MIT_Regression
MIT_Regression
Regression
Now we will turn to a slightly different form of machine-learning problem, called regres-
sion. It is still supervised learning, so our data will still have the form “Regression,” in com-
mon parlance, means
Sn = x(1) , y(1) , . . . , x(n) , y(n) . moving backwards. But
this is forward progress!
But now, instead of the y values being discrete, they will be real-valued, and so our hy-
potheses will have the form
h : Rd → R .
This is a good framework when we want to predict a numerical quantity, like height, stock
value, etc., rather than to divide the inputs into categories.
The first step is to pick a loss function, to describe how to evaluate the quality of the pre-
dictions our hypothesis is making, when compared to the “target” y values in the data set.
The choice of loss function is part of modeling your domain. In the absence of additional
information about a regression problem, we typically use squared error (SE):
It penalizes guesses that are too high the same amount as it penalizes guesses that are too
low, and has a good mathematical justification in the case that your data are generated
from an underlying linear hypothesis, but with Gaussian-distributed noise added to the y
values.
We will consider the case of a linear hypothesis class,
h(x; θ, θ0 ) = θT x + θ0 ,
remembering that we can get a rich class of hypotheses by performing a non-linear fea-
ture transformation before doing the regression. So, θT x + θ0 is a linear function of x, but
θT ϕ(x) + θ0 is a non-linear function of x if ϕ is a non-linear function of x.
We will treat regression as an optimization problem, in which, given a data set D, we
wish to find a linear hypothesis that minimizes mean squared error. Our objective, often
called mean squared error, is to find values for Θ = (θ, θ0 ) that minimize
1 X T (i) 2
n
J(θ, θ0 ) = θ x + θ0 − y(i) ,
n
i=1
38
MIT 6.036 Fall 2019 39
(n) (n)
x1 ... xd y(n)
So, given our data, we can directly compute the linear regression that minimizes mean
squared error. That’s pretty awesome!
1 X T (i) 2
n
Jridge (θ, θ0 ) = θ x + θ0 − y(i) + λkθk2
n
i=1
Larger λ values pressure θ values to be near zero. Note that we don’t penalize θ0 ; intu-
itively, θ0 is what “floats” the regression surface to the right level for the data you have,
and so you shouldn’t make it harder to fit a data set where the y values tend to be around
one million than one where they tend to be around one. The other parameters control the
orientation of the regression surface, and we prefer it to have a not-too-crazy orientation.
There is an analytical expression for the θ, θ0 values that minimize Jridge , but it’s a little
bit more complicated to derive than the solution for OLS because θ0 needs special treatment.
If we decide not to treat θ0 specially (so we add a 1 feature to our input vectors), then we
get:
2
∇θ Jridge = W T (Wθ − T ) + 2λθ .
n
Setting to 0 and solving, we get:
2 T
W (Wθ − T ) + 2λθ = 0
n
1 T 1
W Wθ − W T T + λθ = 0
n n
1 T 1
W Wθ + λθ = W T T
n n
W T Wθ + nλθ = W T T
(W T W + nλI)θ = W T T
θ = (W T W + nλI)−1 W T T
Whew! So,
−1 T
θridge = W T W + nλI W T
which becomes invertible when λ > 0. This is called “ridge”
regression because we
Study Question: Derive this version of the ridge regression solution. are adding a “ridge”
of λ values along the
diagonal of the matrix
Talking about regularization In machine learning in general, not just regression, it is before inverting it.
useful to distinguish two ways in which a hypothesis h ∈ H might contribute to errors on
test data. We have
Structural error: This is error that arises because there is no hypothesis h ∈ H that will
perform well on the data, for example because the data was really generated by a sin
wave but we are trying to fit it with a line.
Estimation error: This is error that arises because we do not have enough data (or the
data are in some way unhelpful) to allow us to choose a good h ∈ H.
There are technical defi-
When we increase λ, we tend to increase structural error but decrease estimation error, nitions of these concepts
that are studied in more
and vice versa.
advanced treatments
Study Question: Consider using a polynomial basis of order k as a feature trans- of machine learning.
formation φ on your data. Would increasing k tend to increase or decrease structural Structural error is re-
ferred to as bias and
error? What about estimation error? estimation error is re-
ferred to as variance.
2 X T (i)
n
∇θ J = θ x + θ0 − y(i) x(i) + 2λθ
n
i=1
2 X T (i)
n
∂J
= θ x + θ0 − y(i) .
∂θ0 n
i=1
Armed with these derivatives, we can do gradient descent, using the regular or stochastic
gradient methods from chapter 6.
Even better, the objective functions for OLS and ridge regression are convex, which
means they have only one minimum, which means, with a small enough step size, gra-
dient descent is guaranteed to find the optimum.