0% found this document useful (0 votes)

3 views5 pages

MIT_Regression

Chapter 7 discusses regression as a supervised learning problem where the goal is to predict real-valued outputs rather than discrete categories. It introduces the concept of loss functions, specifically squared error, and outlines the ordinary least squares (OLS) method for finding linear hypotheses that minimize mean squared error. The chapter also covers ridge regression to address issues like overfitting and the invertibility of matrices in high-dimensional data, and concludes with optimization techniques such as gradient descent.

Uploaded by

G Y

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views5 pages

MIT_Regression

Uploaded by

G Y

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

CHAPTER 7

Regression

Now we will turn to a slightly different form of machine-learning problem, called regres-
sion. It is still supervised learning, so our data will still have the form “Regression,” in com-
mon parlance, means
Sn = x(1) , y(1) , . . . , x(n) , y(n) . moving backwards. But
this is forward progress!
But now, instead of the y values being discrete, they will be real-valued, and so our hy-
potheses will have the form
h : Rd → R .
This is a good framework when we want to predict a numerical quantity, like height, stock
value, etc., rather than to divide the inputs into categories.
The first step is to pick a loss function, to describe how to evaluate the quality of the pre-
dictions our hypothesis is making, when compared to the “target” y values in the data set.
The choice of loss function is part of modeling your domain. In the absence of additional
information about a regression problem, we typically use squared error (SE):

Loss(guess, actual) = (guess − actual)2 .

It penalizes guesses that are too high the same amount as it penalizes guesses that are too
low, and has a good mathematical justification in the case that your data are generated
from an underlying linear hypothesis, but with Gaussian-distributed noise added to the y
values.
We will consider the case of a linear hypothesis class,

h(x; θ, θ0 ) = θT x + θ0 ,

remembering that we can get a rich class of hypotheses by performing a non-linear fea-
ture transformation before doing the regression. So, θT x + θ0 is a linear function of x, but
θT ϕ(x) + θ0 is a non-linear function of x if ϕ is a non-linear function of x.
We will treat regression as an optimization problem, in which, given a data set D, we
wish to find a linear hypothesis that minimizes mean squared error. Our objective, often
called mean squared error, is to find values for Θ = (θ, θ0 ) that minimize

1 X T (i) 2
n
J(θ, θ0 ) = θ x + θ0 − y(i) ,
n
i=1

38
MIT 6.036 Fall 2019 39

resulting in the solution:

θ∗ , θ∗0 = arg min J(θ, θ0 ) . (7.1)
θ,θ0

1 Analytical solution: ordinary least squares

One very interesting aspect of the problem finding a linear hypothesis that minimizes mean
squared error (this general problem is often called ordinary least squares (OLS)) is that we can
find a closed-form formula for the answer! What does “closed
Everything is easier to deal with if we assume that the x(i) have been augmented with form” mean? Generally,
that it involves direct
an extra input dimension (feature) that always has value 1, so we may ignore θ0 . (See
evaluation of a mathe-
chapter 3, section 2 for a reminder about this strategy). matical expression using
We will approach this just like a minimization problem from calculus homework: take a fixed number of “typ-
the derivative of J with respect to θ, set it to zero, and solve for θ. There is an additional ical” operations (like
arithmetic operations,
step required, to check that the resulting θ is a minimum (rather than a maximum or an in- trig functions, powers,
flection point) but we won’t work through that here. It is possible to approach this problem etc.). So equation 7.1 is
by: not in closed form, be-
cause it’s not at all clear
• Finding ∂J/∂θk for k in 1, . . . , d, what operations one
needs to perform to find
• Constructing a set of k equations of the form ∂J/∂θk = 0, and the solution.

• Solving the system for values of θk . We will use d here for

the total number of fea-
That works just fine. To get practice for applying techniques like this to more complex tures in each x(i) , in-
cluding the added 1.
problems, we will work through a more compact (and cool!) matrix view.
Study Question: Work through this and check your answer against ours below.
We can think of our training data in terms of matrices X and Y, where each column of X
is an example, and each “column” of Y is the corresponding target output value:
 
(1) (n)
x1 . . . x1
 . .. ..  (1)
X=  ..
 . . . y(n) .
. .  Y= y
(1) (n)
xd . . . xd

Study Question: What are the dimensions of X and Y?

In most textbooks, they think of an individual example x(i) as a row, rather than a
column. So that we get an answer that will be recognizable to you, we are going to define a
new matrix and vector, W and T , which are just transposes of our X and Y, and then work
with them:    (1) 
(1) (1)
x1 . . . xd y
 . . .   .. 
T
W=X = .  . . . . 
.  T =Y = .  .
T

(n) (n)
x1 ... xd y(n)

Study Question: What are the dimensions of W and T ?

Now we can write
  2
n
X Xd
1 1 
J(θ) = (Wθ − T )T (Wθ − T ) = Wij θj  − Ti 
n | {z } | {z } n
1×n n×1 i=1 j=1

Last Updated: 12/18/19 11:56:05

MIT 6.036 Fall 2019 40

and using facts about matrix/vector calculus, we get

2
∇θ J = W T (Wθ − T ) .
n |{z} | {z }
d×n n×1

Setting to 0 and solving, we get:

2 T
W (Wθ − T ) = 0
n
W T Wθ − W T T = 0
W T Wθ = W T T
θ = (W T W)−1 W T T

And the dimensions work out!

−1
θ = WT W WT T
| {z } |{z} |{z}
d×d d×n n×1

So, given our data, we can directly compute the linear regression that minimizes mean
squared error. That’s pretty awesome!

2 Regularizing linear regression

Well, actually, there are some kinds of trouble we can get into. What if W T W is not
invertible?
Study Question: Consider, for example, a situation where the data-set is just the
same point repeated twice: x(1) = x(2) = (1, 2)T . What is W in this case? What is
W T W? What is (W T W)−1 ?
Another kind of problem is overfitting: we have formulated an objective that is just
about fitting the data as well as possible, but as we discussed in the context of margin
maximization, we might also want to regularize to keep the hypothesis from getting too
attached to the data.
We address both the problem of not being able to invert (W T W)−1 and the problem of
overfitting using a mechanism called ridge regression. We add a regularization term kθk2 to
the OLS objective, with trade-off parameter λ.
Study Question: When we add a regularizer of the form kθk2 , what is our most
“preferred” value of θ, in the absence of any data?
Here is the ridge regression objective function:

1 X T (i) 2
n
Jridge (θ, θ0 ) = θ x + θ0 − y(i) + λkθk2
n
i=1

Larger λ values pressure θ values to be near zero. Note that we don’t penalize θ0 ; intu-
itively, θ0 is what “floats” the regression surface to the right level for the data you have,
and so you shouldn’t make it harder to fit a data set where the y values tend to be around
one million than one where they tend to be around one. The other parameters control the
orientation of the regression surface, and we prefer it to have a not-too-crazy orientation.
There is an analytical expression for the θ, θ0 values that minimize Jridge , but it’s a little
bit more complicated to derive than the solution for OLS because θ0 needs special treatment.

Last Updated: 12/18/19 11:56:05

MIT 6.036 Fall 2019 41

If we decide not to treat θ0 specially (so we add a 1 feature to our input vectors), then we
get:
2
∇θ Jridge = W T (Wθ − T ) + 2λθ .
n
Setting to 0 and solving, we get:

2 T
W (Wθ − T ) + 2λθ = 0
n
1 T 1
W Wθ − W T T + λθ = 0
n n
1 T 1
W Wθ + λθ = W T T
n n
W T Wθ + nλθ = W T T
(W T W + nλI)θ = W T T
θ = (W T W + nλI)−1 W T T

Whew! So,
−1 T
θridge = W T W + nλI W T
which becomes invertible when λ > 0. This is called “ridge”
regression because we
Study Question: Derive this version of the ridge regression solution. are adding a “ridge”
of λ values along the
diagonal of the matrix
Talking about regularization In machine learning in general, not just regression, it is before inverting it.
useful to distinguish two ways in which a hypothesis h ∈ H might contribute to errors on
test data. We have

Structural error: This is error that arises because there is no hypothesis h ∈ H that will
perform well on the data, for example because the data was really generated by a sin
wave but we are trying to fit it with a line.

Estimation error: This is error that arises because we do not have enough data (or the
data are in some way unhelpful) to allow us to choose a good h ∈ H.
There are technical defi-
When we increase λ, we tend to increase structural error but decrease estimation error, nitions of these concepts
that are studied in more
and vice versa.
advanced treatments
Study Question: Consider using a polynomial basis of order k as a feature trans- of machine learning.
formation φ on your data. Would increasing k tend to increase or decrease structural Structural error is re-
ferred to as bias and
error? What about estimation error? estimation error is re-
ferred to as variance.

3 Optimization via gradient descent

Inverting the d × d matrix W T W takes O(d3 ) time, which makes the analytic solution Well, actually, Gauss-
impractical for large d. If we have high-dimensional data, we can fall back on gradient Jordan elimination,
a popular algorithm,
descent.
takes O(d3 ) arithmetic
Study Question: Why is having large n not as much of a computational problem as operations, but the bit
having large d? complexity of the in-
termediate results can
Recall the ridge objective grow exponentially!
There are other algo-
rithms with polynomial
1 X T (i) 2
n
Jridge (θ, θ0 ) = θ x + θ0 − y(i) + λkθk2 bit complexity. (If this
n just made no sense to
i=1
you, don’t worry.)
Last Updated: 12/18/19 11:56:05
MIT 6.036 Fall 2019 42

and its gradient with respect to θ

2 X T (i)
n
∇θ J = θ x + θ0 − y(i) x(i) + 2λθ
n
i=1

and partial derivative with respect to θ0

2 X T (i)
n
∂J
= θ x + θ0 − y(i) .
∂θ0 n
i=1

Armed with these derivatives, we can do gradient descent, using the regular or stochastic
gradient methods from chapter 6.
Even better, the objective functions for OLS and ridge regression are convex, which
means they have only one minimum, which means, with a small enough step size, gra-
dient descent is guaranteed to find the optimum.

Last Updated: 12/18/19 11:56:05

Assignment 1 - Data Science
100% (1)
Assignment 1 - Data Science
5 pages
Berkeley Machine Learning
No ratings yet
Berkeley Machine Learning
185 pages
Class03 RLS
No ratings yet
Class03 RLS
28 pages
Linear Regression
No ratings yet
Linear Regression
19 pages
ML_Lec 4-introduction to regression
No ratings yet
ML_Lec 4-introduction to regression
65 pages
Bias
No ratings yet
Bias
62 pages
data analysis
No ratings yet
data analysis
40 pages
Midterm F02soln
No ratings yet
Midterm F02soln
14 pages
Least Squares Adjustment
100% (1)
Least Squares Adjustment
47 pages
Chapter Regression
No ratings yet
Chapter Regression
10 pages
Lect5 Reg
No ratings yet
Lect5 Reg
16 pages
Python Tutorial
No ratings yet
Python Tutorial
37 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
Lecture 5 - Linear Regression
No ratings yet
Lecture 5 - Linear Regression
51 pages
Quiz2_Mock_Solutions
No ratings yet
Quiz2_Mock_Solutions
19 pages
2. Linear_ Regression_SGD
No ratings yet
2. Linear_ Regression_SGD
71 pages
SolutionQuiz1
No ratings yet
SolutionQuiz1
5 pages
Group30 Linear Regression
No ratings yet
Group30 Linear Regression
20 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
4 Linear Regression Additional Notes
No ratings yet
4 Linear Regression Additional Notes
8 pages
Day 1
No ratings yet
Day 1
41 pages
HW 4
No ratings yet
HW 4
7 pages
Notes Linearregression
No ratings yet
Notes Linearregression
4 pages
Group 30 Ppt
No ratings yet
Group 30 Ppt
33 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
Representer Function
No ratings yet
Representer Function
12 pages
Machine Learning: Linear Models For Regression
No ratings yet
Machine Learning: Linear Models For Regression
54 pages
Wk06
No ratings yet
Wk06
7 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
9_Linear Regression-Problems and Solutions
No ratings yet
9_Linear Regression-Problems and Solutions
23 pages
GradientDescent-Regression_slides
No ratings yet
GradientDescent-Regression_slides
26 pages
Linear Regression: 1 Perspective 1: Maximum Likelihood Estimation
No ratings yet
Linear Regression: 1 Perspective 1: Maximum Likelihood Estimation
5 pages
Tutorial 2
No ratings yet
Tutorial 2
3 pages
Asset-V1 ColumbiaX+CSMM.102x+3T2018+type@asset+block@ML Lecture3 PDF
No ratings yet
Asset-V1 ColumbiaX+CSMM.102x+3T2018+type@asset+block@ML Lecture3 PDF
33 pages
Linear Regression: Normal Equation and Gradient Descent
No ratings yet
Linear Regression: Normal Equation and Gradient Descent
17 pages
05 Regression Least Squares
No ratings yet
05 Regression Least Squares
5 pages
cs229.... Machine Language. Andrew NG
No ratings yet
cs229.... Machine Language. Andrew NG
17 pages
10 Regression, Including Least-Squares Linear and Logistic Regression
No ratings yet
10 Regression, Including Least-Squares Linear and Logistic Regression
5 pages
06 Fitting Matching
No ratings yet
06 Fitting Matching
13 pages
hw4_red
No ratings yet
hw4_red
6 pages
Kernel Ridge Regression: Max Welling
No ratings yet
Kernel Ridge Regression: Max Welling
3 pages
Least Square Adjustament PDF
No ratings yet
Least Square Adjustament PDF
55 pages
Note 4: EECS 189 Introduction To Machine Learning Fall 2020 1 MLE and MAP For Regression (Part I)
No ratings yet
Note 4: EECS 189 Introduction To Machine Learning Fall 2020 1 MLE and MAP For Regression (Part I)
6 pages
Linear Regression
No ratings yet
Linear Regression
6 pages
Cs419 Closed Form Derv
No ratings yet
Cs419 Closed Form Derv
5 pages
Cost Function
No ratings yet
Cost Function
17 pages
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
No ratings yet
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
14 pages
Linear and Logistic Regression: Marta Arias Marias@lsi - Upc.edu
No ratings yet
Linear and Logistic Regression: Marta Arias Marias@lsi - Upc.edu
25 pages
A Tutorial of Machine Learning
No ratings yet
A Tutorial of Machine Learning
16 pages
LinearRegression LectureNotesPublic PDF
No ratings yet
LinearRegression LectureNotesPublic PDF
7 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
Andrew Rosenberg - Lecture 5: Linear Regression With Regularization CSC 84020 - Machine Learning
No ratings yet
Andrew Rosenberg - Lecture 5: Linear Regression With Regularization CSC 84020 - Machine Learning
38 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
Cost Function: y 2m 1 (Y ) 2m 1
No ratings yet
Cost Function: y 2m 1 (Y ) 2m 1
1 page
Tutorial On Multivariate Logistic Regression: Javier R. Movellan July 23, 2006
No ratings yet
Tutorial On Multivariate Logistic Regression: Javier R. Movellan July 23, 2006
9 pages
Wk05 machine learning
No ratings yet
Wk05 machine learning
6 pages
Lec20 RidgeRegression
No ratings yet
Lec20 RidgeRegression
21 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
12 pages
Mark Wildon - Representation Theory of The Symmetric Group (Lecture Notes) (2015)
0% (1)
Mark Wildon - Representation Theory of The Symmetric Group (Lecture Notes) (2015)
34 pages
Signal & System Course file
No ratings yet
Signal & System Course file
13 pages
Toc Lecture Notes
No ratings yet
Toc Lecture Notes
156 pages
Nonogram Solving Algorithms Analysis and Implementation For Augmented Reality System
No ratings yet
Nonogram Solving Algorithms Analysis and Implementation For Augmented Reality System
54 pages
FSM Slides
0% (1)
FSM Slides
37 pages
Chiang Ch6
No ratings yet
Chiang Ch6
34 pages
Soalan Peperiksaan Matematik Tingkatan 1 Kertas 2
75% (12)
Soalan Peperiksaan Matematik Tingkatan 1 Kertas 2
7 pages
Old Midterm 1
No ratings yet
Old Midterm 1
3 pages
The Importance of Math
100% (3)
The Importance of Math
14 pages
5.1 NTH Roots, Radicals, and Rational Roots
No ratings yet
5.1 NTH Roots, Radicals, and Rational Roots
22 pages
Maths Olympiad 2024-25 Class Vii
No ratings yet
Maths Olympiad 2024-25 Class Vii
4 pages
jyCcUu-0607 m24 QP 22
No ratings yet
jyCcUu-0607 m24 QP 22
8 pages
MATH1241 Unit3
No ratings yet
MATH1241 Unit3
43 pages
Classical Mechanics III: Ashoke Sen
No ratings yet
Classical Mechanics III: Ashoke Sen
21 pages
Calibration and Use of Internal Strain - Gage Balances With Application To Wind Tunnel Testing
100% (1)
Calibration and Use of Internal Strain - Gage Balances With Application To Wind Tunnel Testing
64 pages
Operational Research Shortest Path Example
No ratings yet
Operational Research Shortest Path Example
11 pages
Ansys Lab Manual
No ratings yet
Ansys Lab Manual
87 pages
Mathematics Performance Task 2 and 3
No ratings yet
Mathematics Performance Task 2 and 3
4 pages
CH 12 CBQ
No ratings yet
CH 12 CBQ
10 pages
Vector Tutorial PDF
No ratings yet
Vector Tutorial PDF
23 pages
Ghanta
No ratings yet
Ghanta
15 pages
Operators Precedence in C
No ratings yet
Operators Precedence in C
2 pages
Perceptron Example (Practice Que)
No ratings yet
Perceptron Example (Practice Que)
26 pages
DLL - Math 8 W3-Day 9-12
No ratings yet
DLL - Math 8 W3-Day 9-12
9 pages
Optimization GRG - Method
No ratings yet
Optimization GRG - Method
22 pages
Gmat Syllabus The Princeton Review: Orientation Class
No ratings yet
Gmat Syllabus The Princeton Review: Orientation Class
12 pages
5-4 Practice
No ratings yet
5-4 Practice
2 pages
MCQ in Clock Variation and Progression Problems Part 1 ECE Board Exam
No ratings yet
MCQ in Clock Variation and Progression Problems Part 1 ECE Board Exam
16 pages
Rotary Inverted Pendulum
No ratings yet
Rotary Inverted Pendulum
8 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet

MIT_Regression

Uploaded by

MIT_Regression

Uploaded by

CHAPTER 7

Loss(guess, actual) = (guess − actual)2 .

resulting in the solution:

1 Analytical solution: ordinary least squares

• Solving the system for values of θk . We will use d here for

Study Question: What are the dimensions of X and Y?

Study Question: What are the dimensions of W and T ?

Last Updated: 12/18/19 11:56:05

and using facts about matrix/vector calculus, we get

Setting to 0 and solving, we get:

And the dimensions work out!

2 Regularizing linear regression

Last Updated: 12/18/19 11:56:05

3 Optimization via gradient descent

and its gradient with respect to θ

and partial derivative with respect to θ0

Last Updated: 12/18/19 11:56:05

You might also like