0% found this document useful (0 votes)
3 views5 pages

MIT_Regression

Chapter 7 discusses regression as a supervised learning problem where the goal is to predict real-valued outputs rather than discrete categories. It introduces the concept of loss functions, specifically squared error, and outlines the ordinary least squares (OLS) method for finding linear hypotheses that minimize mean squared error. The chapter also covers ridge regression to address issues like overfitting and the invertibility of matrices in high-dimensional data, and concludes with optimization techniques such as gradient descent.

Uploaded by

G Y
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views5 pages

MIT_Regression

Chapter 7 discusses regression as a supervised learning problem where the goal is to predict real-valued outputs rather than discrete categories. It introduces the concept of loss functions, specifically squared error, and outlines the ordinary least squares (OLS) method for finding linear hypotheses that minimize mean squared error. The chapter also covers ridge regression to address issues like overfitting and the invertibility of matrices in high-dimensional data, and concludes with optimization techniques such as gradient descent.

Uploaded by

G Y
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

CHAPTER 7

Regression

Now we will turn to a slightly different form of machine-learning problem, called regres-
sion. It is still supervised learning, so our data will still have the form “Regression,” in com-
    mon parlance, means
Sn = x(1) , y(1) , . . . , x(n) , y(n) . moving backwards. But
this is forward progress!
But now, instead of the y values being discrete, they will be real-valued, and so our hy-
potheses will have the form
h : Rd → R .
This is a good framework when we want to predict a numerical quantity, like height, stock
value, etc., rather than to divide the inputs into categories.
The first step is to pick a loss function, to describe how to evaluate the quality of the pre-
dictions our hypothesis is making, when compared to the “target” y values in the data set.
The choice of loss function is part of modeling your domain. In the absence of additional
information about a regression problem, we typically use squared error (SE):

Loss(guess, actual) = (guess − actual)2 .

It penalizes guesses that are too high the same amount as it penalizes guesses that are too
low, and has a good mathematical justification in the case that your data are generated
from an underlying linear hypothesis, but with Gaussian-distributed noise added to the y
values.
We will consider the case of a linear hypothesis class,

h(x; θ, θ0 ) = θT x + θ0 ,

remembering that we can get a rich class of hypotheses by performing a non-linear fea-
ture transformation before doing the regression. So, θT x + θ0 is a linear function of x, but
θT ϕ(x) + θ0 is a non-linear function of x if ϕ is a non-linear function of x.
We will treat regression as an optimization problem, in which, given a data set D, we
wish to find a linear hypothesis that minimizes mean squared error. Our objective, often
called mean squared error, is to find values for Θ = (θ, θ0 ) that minimize

1 X  T (i) 2
n
J(θ, θ0 ) = θ x + θ0 − y(i) ,
n
i=1

38
MIT 6.036 Fall 2019 39

resulting in the solution:


θ∗ , θ∗0 = arg min J(θ, θ0 ) . (7.1)
θ,θ0

1 Analytical solution: ordinary least squares


One very interesting aspect of the problem finding a linear hypothesis that minimizes mean
squared error (this general problem is often called ordinary least squares (OLS)) is that we can
find a closed-form formula for the answer! What does “closed
Everything is easier to deal with if we assume that the x(i) have been augmented with form” mean? Generally,
that it involves direct
an extra input dimension (feature) that always has value 1, so we may ignore θ0 . (See
evaluation of a mathe-
chapter 3, section 2 for a reminder about this strategy). matical expression using
We will approach this just like a minimization problem from calculus homework: take a fixed number of “typ-
the derivative of J with respect to θ, set it to zero, and solve for θ. There is an additional ical” operations (like
arithmetic operations,
step required, to check that the resulting θ is a minimum (rather than a maximum or an in- trig functions, powers,
flection point) but we won’t work through that here. It is possible to approach this problem etc.). So equation 7.1 is
by: not in closed form, be-
cause it’s not at all clear
• Finding ∂J/∂θk for k in 1, . . . , d, what operations one
needs to perform to find
• Constructing a set of k equations of the form ∂J/∂θk = 0, and the solution.

• Solving the system for values of θk . We will use d here for


the total number of fea-
That works just fine. To get practice for applying techniques like this to more complex tures in each x(i) , in-
cluding the added 1.
problems, we will work through a more compact (and cool!) matrix view.
Study Question: Work through this and check your answer against ours below.
We can think of our training data in terms of matrices X and Y, where each column of X
is an example, and each “column” of Y is the corresponding target output value:
 
(1) (n)
x1 . . . x1
 . .. ..   (1) 
X=  ..
 . . . y(n) .
. .  Y= y
(1) (n)
xd . . . xd

Study Question: What are the dimensions of X and Y?


In most textbooks, they think of an individual example x(i) as a row, rather than a
column. So that we get an answer that will be recognizable to you, we are going to define a
new matrix and vector, W and T , which are just transposes of our X and Y, and then work
with them:    (1) 
(1) (1)
x1 . . . xd y
 . . .   .. 
T
W=X = .  . . . . 
.  T =Y = .  .
T

(n) (n)
x1 ... xd y(n)

Study Question: What are the dimensions of W and T ?


Now we can write
  2
n
X Xd
1 1 
J(θ) = (Wθ − T )T (Wθ − T ) = Wij θj  − Ti 
n | {z } | {z } n
1×n n×1 i=1 j=1

Last Updated: 12/18/19 11:56:05


MIT 6.036 Fall 2019 40

and using facts about matrix/vector calculus, we get


2
∇θ J = W T (Wθ − T ) .
n |{z} | {z }
d×n n×1

Setting to 0 and solving, we get:


2 T
W (Wθ − T ) = 0
n
W T Wθ − W T T = 0
W T Wθ = W T T
θ = (W T W)−1 W T T

And the dimensions work out!


−1
θ = WT W WT T
| {z } |{z} |{z}
d×d d×n n×1

So, given our data, we can directly compute the linear regression that minimizes mean
squared error. That’s pretty awesome!

2 Regularizing linear regression



Well, actually, there are some kinds of trouble we can get into. What if W T W is not
invertible?
Study Question: Consider, for example, a situation where the data-set is just the
same point repeated twice: x(1) = x(2) = (1, 2)T . What is W in this case? What is
W T W? What is (W T W)−1 ?
Another kind of problem is overfitting: we have formulated an objective that is just
about fitting the data as well as possible, but as we discussed in the context of margin
maximization, we might also want to regularize to keep the hypothesis from getting too
attached to the data.
We address both the problem of not being able to invert (W T W)−1 and the problem of
overfitting using a mechanism called ridge regression. We add a regularization term kθk2 to
the OLS objective, with trade-off parameter λ.
Study Question: When we add a regularizer of the form kθk2 , what is our most
“preferred” value of θ, in the absence of any data?
Here is the ridge regression objective function:

1 X  T (i) 2
n
Jridge (θ, θ0 ) = θ x + θ0 − y(i) + λkθk2
n
i=1

Larger λ values pressure θ values to be near zero. Note that we don’t penalize θ0 ; intu-
itively, θ0 is what “floats” the regression surface to the right level for the data you have,
and so you shouldn’t make it harder to fit a data set where the y values tend to be around
one million than one where they tend to be around one. The other parameters control the
orientation of the regression surface, and we prefer it to have a not-too-crazy orientation.
There is an analytical expression for the θ, θ0 values that minimize Jridge , but it’s a little
bit more complicated to derive than the solution for OLS because θ0 needs special treatment.

Last Updated: 12/18/19 11:56:05


MIT 6.036 Fall 2019 41

If we decide not to treat θ0 specially (so we add a 1 feature to our input vectors), then we
get:
2
∇θ Jridge = W T (Wθ − T ) + 2λθ .
n
Setting to 0 and solving, we get:

2 T
W (Wθ − T ) + 2λθ = 0
n
1 T 1
W Wθ − W T T + λθ = 0
n n
1 T 1
W Wθ + λθ = W T T
n n
W T Wθ + nλθ = W T T
(W T W + nλI)θ = W T T
θ = (W T W + nλI)−1 W T T

Whew! So,
−1 T
θridge = W T W + nλI W T
which becomes invertible when λ > 0. This is called “ridge”
regression because we
Study Question: Derive this version of the ridge regression solution. are adding a “ridge”
of λ values along the
diagonal of the matrix
Talking about regularization In machine learning in general, not just regression, it is before inverting it.
useful to distinguish two ways in which a hypothesis h ∈ H might contribute to errors on
test data. We have

Structural error: This is error that arises because there is no hypothesis h ∈ H that will
perform well on the data, for example because the data was really generated by a sin
wave but we are trying to fit it with a line.

Estimation error: This is error that arises because we do not have enough data (or the
data are in some way unhelpful) to allow us to choose a good h ∈ H.
There are technical defi-
When we increase λ, we tend to increase structural error but decrease estimation error, nitions of these concepts
that are studied in more
and vice versa.
advanced treatments
Study Question: Consider using a polynomial basis of order k as a feature trans- of machine learning.
formation φ on your data. Would increasing k tend to increase or decrease structural Structural error is re-
ferred to as bias and
error? What about estimation error? estimation error is re-
ferred to as variance.

3 Optimization via gradient descent



Inverting the d × d matrix W T W takes O(d3 ) time, which makes the analytic solution Well, actually, Gauss-
impractical for large d. If we have high-dimensional data, we can fall back on gradient Jordan elimination,
a popular algorithm,
descent.
takes O(d3 ) arithmetic
Study Question: Why is having large n not as much of a computational problem as operations, but the bit
having large d? complexity of the in-
termediate results can
Recall the ridge objective grow exponentially!
There are other algo-
rithms with polynomial
1 X  T (i) 2
n
Jridge (θ, θ0 ) = θ x + θ0 − y(i) + λkθk2 bit complexity. (If this
n just made no sense to
i=1
you, don’t worry.)
Last Updated: 12/18/19 11:56:05
MIT 6.036 Fall 2019 42

and its gradient with respect to θ

2 X  T (i) 
n
∇θ J = θ x + θ0 − y(i) x(i) + 2λθ
n
i=1

and partial derivative with respect to θ0

2 X  T (i) 
n
∂J
= θ x + θ0 − y(i) .
∂θ0 n
i=1

Armed with these derivatives, we can do gradient descent, using the regular or stochastic
gradient methods from chapter 6.
Even better, the objective functions for OLS and ridge regression are convex, which
means they have only one minimum, which means, with a small enough step size, gra-
dient descent is guaranteed to find the optimum.

Last Updated: 12/18/19 11:56:05

You might also like