0% found this document useful (0 votes)

15 views

Ch02-Regression Handout

Uploaded by

Bashar Almuntaser

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Ch02-Regression Handout

Uploaded by

Bashar Almuntaser

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

CSC 411: Lecture 02: Linear Regression

Richard Zemel, Raquel Urtasun and Sanja Fidler

University of Toronto

(Most plots in this lecture are from Bishop’s book)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 1 / 22

Problems for Today
What should I watch this Friday?
Goal: Predict movie rating automatically!
Goal: How many followers will I get?
Goal: Predict the price of the house

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 2 / 22

Regression
What do all these problems have in common?
I Continuous outputs, we’ll call these t
(e.g., a rating: a real number between 0-10, # of followers, house
price)
Predicting continuous outputs is called regression
What do I need in order to predict these outputs?
I Features (inputs), we’ll call these x (or x if vectors)
I Training examples, many x (i) for which t (i) is known (e.g., many
movies for which we know the rating)
I A model, a function that represents the relationship between x and t
I A loss or a cost or an objective function, which tells us how well our
model approximates the training examples
I Optimization, a way of finding the parameters of our model that
minimizes the loss function
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22
Today: Linear Regression

Linear regression
I continuous outputs
I simple model (linear)

Introduce key concepts:

I loss functions
I generalization
I optimization
I model complexity
I regularization

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 4 / 22

Simple 1-D regression

Circles are data points (i.e., training examples) that are given to us
The data points are uniform in x, but may be displaced in y
t(x) = f (x) +
with some noise
In green is the ”true” curve that we don’t know
Goal: We want to fit a curve to these points
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 5 / 22
Simple 1-D regression

Key Questions:
I How do we parametrize the model?
I What loss (objective) function should we use to judge the fit?
I How do we optimize fit to unseen test data (generalization)?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 6 / 22

Example: Boston Housing data
Estimate median house price in a neighborhood based on neighborhood
statistics
Look at first possible attribute (feature): per capita crime rate

Use this to predict house prices in other neighborhoods

Is this a good input (attribute) to predict house prices?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 7 / 22
Represent the Data
Data is described as pairs D = {(x (1) , t (1) ), · · · , (x (N) , t (N) )}
I x ∈ R is the input feature (per capita crime rate)
I t ∈ R is the target output (median house price)
(i)
I simply indicates the training examples (we have N in this case)
Here t is continuous, so this is a regression problem
Model outputs y , an estimate of t

y (x) = w0 + w1 x

What type of model did we choose?

Divide the dataset into training and testing examples
I Use the training examples to construct hypothesis, or function
approximator, that maps x to predicted y
I Evaluate hypothesis on test set
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 8 / 22
Noise

A simple model typically does not exactly fit the data

I lack of fit can be considered noise

Sources of noise:
I Imprecision in data attributes (input noise, e.g., noise in per-capita
crime)
I Errors in data targets (mis-labeling, e.g., noise in house prices)
I Additional attributes not taken into account by data attributes, affect
target values (latent variables). In the example, what else could affect
house prices?
I Model may be too simple to account for data targets

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 9 / 22

Least-Squares Regression

Define a model
y (x) = function(x, w)

Linear: y (x) = w0 + w1 x
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22
Optimizing the Objective
One straightforward method: gradient descent
I initialize w (e.g., randomly)
I repeatedly update w based on the gradient
∂`
w ←w−λ
∂w

λ is the learning rate

For a single training case, this gives the LMS update rule (Least Mean
Squares):
w ← w + 2λ(t (n) − y (x (n) ))x (n)

w ← w + 2λ (t (n) − y (x (n) )) x (n)

| {z }
error

Note: As error approaches zero, so does the update (w stops changing)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 11 / 22

Optimizing Across Training Set
Two ways to generalize this for all examples in training set:
1. Batch updates: sum or average updates across every example n, then
change the parameter values
N
X
w ← w + 2λ (t (n) − y (x (n) ))x (n)
n=1

2. Stochastic/online updates: update the parameters for each training

case in turn, according to its own gradients

Algorithm 1 Stochastic gradient descent

1: Randomly shuffle examples in the training set
2: for i = 1 to N do
3: Update:

w ← w + 2λ(t (i) − y (x (i) ))x (i) (update for a linear model)

4: end for

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 12 / 22

Analytical Solution?

For some objectives we can also find the optimal solution analytically
This is the case for linear least-squares regression
How?
Compute the derivatives of the objective wrt w and equate with 0
Define:

t = [t (1) , t (2) , . . . , t (N) ]T

1, x (1)
 
 1, x (2) 
X=  ... 


1, x (N)

Then:
w = (XT X)−1 XT t
(work it out!)
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 13 / 22
Multi-dimensional Inputs

One method of extending the model is to consider other input dimensions

y (x) = w0 + w1 x1 + w2 x2

In the Boston housing example, we can look at the number of rooms

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 14 / 22

Linear Regression with Multi-dimensional Inputs

Imagine now we want to predict the median house price from these
multi-dimensional observations
Each house is a data point n, with observations indexed by j:

(n) (n) (n)
x(n) = x1 , · · · , xj , · · · , xd

We can incorporate the bias w0 into w, by using x0 = 1, then

d
X
y (x) = w0 + w j xj = w T x
j=1

We can then solve for w = (w0 , w1 , · · · , wd ). How?

We can use gradient descent to solve for each coefficient, or compute w
analytically (how does the solution change?)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 15 / 22

More Powerful Models?Fitting a Polynomial

What if our linear model is not good? How can we create a more
complicated model?
We can create a more complicated model by defining input variables that are
combinations of components of x
Example: an M-th order polynomial function of one dimensional feature x:
M
X
y (x, w) = w0 + wj x j
j=1

where x j is the j-th power of x

We can use the same approach to optimize for the weights w
How do we do that?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 16 / 22

Which Fit is Best?
from Bishop

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 17 / 22

Generalization
Generalization = model’s ability to predict the held out data
What is happening?
Our model with M = 9 overfits the data (it models also noise)
Not a problem if we have lots of training examples
Let’s look at the estimated weights for various M in the case of fewer
examples
The weights are becoming huge to compensate for the noise
One way of dealing with this is to encourage the weights to be small (this
way no input dimension will have too much influence on prediction). This is
called regularization

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 18 / 22

Regularized Least Squares
Increasing the input features this way can complicate the model considerably
Goal: select the appropriate model complexity automatically
Standard approach: regularization
N
X
˜
`(w) = [t (n) − (w0 + w1 x (n) )]2 + αwT w
n=1

Intuition: Since we are minimizing the loss, the second term will encourage
smaller values in w
When we use the penalty on the squared weights we have ridge regression in
statistics
Leads to a modified update rule for gradient descent:
N
X
w ← w + 2λ[ (t (n) − y (x (n) ))x (n) − αw]
n=1

Also has an analytical solution: w = (XT X + α I)−1 XT t (verify!)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 19 / 22
Regularized least squares

Better generalization
Choose α carefully

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 20 / 22

1-D regression illustrates key concepts

Data fits – is linear model best (model selection)?

I Simple models may not capture all the important variations (signal) in
the data: underfit
I More complex models may overfit the training data (fit not only the
signal but also the noise in the data), especially if not enough data to
constrain model
One method of assessing fit: test generalization = model’s ability to predict
the held out data
Optimization is essential: stochastic and batch iterative approaches; analytic
when available

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 21 / 22

So...

Which movie will you watch?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 22 / 22

3 DeltaRule PDF
No ratings yet
3 DeltaRule PDF
10 pages
AI Module 2
No ratings yet
AI Module 2
47 pages
Optimal Control Engineering Matlab: - V-Publishers
No ratings yet
Optimal Control Engineering Matlab: - V-Publishers
4 pages
03 Classification Handout
No ratings yet
03 Classification Handout
24 pages
Lec 03
No ratings yet
Lec 03
42 pages
Linear Regression: Courtesy:Richard Zemel, Raquel Urtasun and Sanja Fidler
No ratings yet
Linear Regression: Courtesy:Richard Zemel, Raquel Urtasun and Sanja Fidler
48 pages
Lecture 2
No ratings yet
Lecture 2
57 pages
Csci567 Hw1 Spring 2016
No ratings yet
Csci567 Hw1 Spring 2016
9 pages
term_paper
No ratings yet
term_paper
10 pages
Lec06 Matt[1]
No ratings yet
Lec06 Matt[1]
60 pages
Fast and Accurate Bessel Function Computation: John Harrison, Intel Corporation
No ratings yet
Fast and Accurate Bessel Function Computation: John Harrison, Intel Corporation
22 pages
Gradient Descent Based Learners
No ratings yet
Gradient Descent Based Learners
11 pages
Econometrics I 4
No ratings yet
Econometrics I 4
21 pages
Module 3 Numerical Integration
No ratings yet
Module 3 Numerical Integration
10 pages
Slides Estimation
No ratings yet
Slides Estimation
171 pages
EN007001 Engineering Research Methodology: Statistical Inference: Bayesian Inference
No ratings yet
EN007001 Engineering Research Methodology: Statistical Inference: Bayesian Inference
72 pages
Solutions Problem Set 1
No ratings yet
Solutions Problem Set 1
7 pages
16 Primal Dual
No ratings yet
16 Primal Dual
20 pages
Sample Midterm Exam 6_Solutions
No ratings yet
Sample Midterm Exam 6_Solutions
10 pages
CS 229, Public Course Problem Set #1 Solutions: Supervised Learning
No ratings yet
CS 229, Public Course Problem Set #1 Solutions: Supervised Learning
10 pages
hw5_1
No ratings yet
hw5_1
6 pages
Chapter 1 (Part 1), OCW_3
No ratings yet
Chapter 1 (Part 1), OCW_3
43 pages
Nelder Mead 2D
No ratings yet
Nelder Mead 2D
5 pages
Lecture 6.2 - Polynomial Regression
No ratings yet
Lecture 6.2 - Polynomial Regression
56 pages
Support Vecto Machine (3)
No ratings yet
Support Vecto Machine (3)
62 pages
Machine Learning - Home - Week 2 - Notes - Coursera
No ratings yet
Machine Learning - Home - Week 2 - Notes - Coursera
10 pages
Course: DD2427 - Exercise Class 1: Exercise 1 Motivation For The Linear Neuron
No ratings yet
Course: DD2427 - Exercise Class 1: Exercise 1 Motivation For The Linear Neuron
5 pages
Sheet #6 Ensemble + Neural Nets + Linear Regression + Backpropagation + CNN
No ratings yet
Sheet #6 Ensemble + Neural Nets + Linear Regression + Backpropagation + CNN
4 pages
Neural Networks Three
No ratings yet
Neural Networks Three
60 pages
Lecture Three Multi-Layer Perceptron: Backpropagation: Part I: Fundamentals of Neural Networks
No ratings yet
Lecture Three Multi-Layer Perceptron: Backpropagation: Part I: Fundamentals of Neural Networks
70 pages
On Contrastive Divergence Learning
No ratings yet
On Contrastive Divergence Learning
8 pages
MultivariableRegression Summary
No ratings yet
MultivariableRegression Summary
15 pages
Schap 22
No ratings yet
Schap 22
16 pages
Finite Difference Methods For HJB PDEs
No ratings yet
Finite Difference Methods For HJB PDEs
12 pages
08 Generative
No ratings yet
08 Generative
23 pages
midterm2008f_sol
No ratings yet
midterm2008f_sol
12 pages
Calcul Diff
No ratings yet
Calcul Diff
38 pages
chapter_4_assignment (6)
No ratings yet
chapter_4_assignment (6)
5 pages
Digital Image Processing
100% (1)
Digital Image Processing
24 pages
Ordinary Differential Equations (Odes) : Department of Mathematics Iit Guwahati Ra/Rks/Mgpp/Kvk
No ratings yet
Ordinary Differential Equations (Odes) : Department of Mathematics Iit Guwahati Ra/Rks/Mgpp/Kvk
16 pages
Linear Regression
No ratings yet
Linear Regression
75 pages
CMS-2009-0007-0002-a002
No ratings yet
CMS-2009-0007-0002-a002
29 pages
hw3_red
No ratings yet
hw3_red
4 pages
An Introduction To Support Vector Machines
No ratings yet
An Introduction To Support Vector Machines
22 pages
An Introduction To Support Vector Machines
No ratings yet
An Introduction To Support Vector Machines
22 pages
Machine Learning Report Official
No ratings yet
Machine Learning Report Official
17 pages
Numerical Methods: Marisa Villano, Tom Fagan, Dave Fairburn, Chris Savino, David Goldberg, Daniel Rave
No ratings yet
Numerical Methods: Marisa Villano, Tom Fagan, Dave Fairburn, Chris Savino, David Goldberg, Daniel Rave
44 pages
Safe Computing: Week 1: Monday, Jan 23
No ratings yet
Safe Computing: Week 1: Monday, Jan 23
7 pages
M&S L7
No ratings yet
M&S L7
42 pages
Lectura 2 Point Estimator Basics
No ratings yet
Lectura 2 Point Estimator Basics
11 pages
MLMC Founder
No ratings yet
MLMC Founder
10 pages
Numerical Methods With Applications
No ratings yet
Numerical Methods With Applications
59 pages
Bookapm 214
No ratings yet
Bookapm 214
202 pages
EEE - 321: Signals and Systems Lab Assignment 4: Ilkent Niversity Lectrical and Lectronics Ngineering Epartment
No ratings yet
EEE - 321: Signals and Systems Lab Assignment 4: Ilkent Niversity Lectrical and Lectronics Ngineering Epartment
8 pages
Percept Ron
No ratings yet
Percept Ron
2 pages
Pattern Recognition and Machine Learning: Fuzzy Sets in Pattern Recognition Debrup Chakraborty Cinvestav
No ratings yet
Pattern Recognition and Machine Learning: Fuzzy Sets in Pattern Recognition Debrup Chakraborty Cinvestav
38 pages
Pattern Revision
No ratings yet
Pattern Revision
63 pages
Laplace Equation
100% (1)
Laplace Equation
4 pages
Lecture 04
No ratings yet
Lecture 04
46 pages
2101 F 17 Assignment 1
No ratings yet
2101 F 17 Assignment 1
8 pages
00-statistics
No ratings yet
00-statistics
18 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Chapter 15
No ratings yet
Chapter 15
67 pages
Optimization of Neural Networks: A Comparative Analysis of The Genetic Algorithm and Simulated Annealing
No ratings yet
Optimization of Neural Networks: A Comparative Analysis of The Genetic Algorithm and Simulated Annealing
28 pages
Parametric Optimization of Cutting Parameters of Laser Assisted Cutting Using Taguchi Analysis and Genetic Algorithm
No ratings yet
Parametric Optimization of Cutting Parameters of Laser Assisted Cutting Using Taguchi Analysis and Genetic Algorithm
7 pages
Download Complete Schaum's Outline of Calculus for Business, Economics and Finance, Fourth Edition (Schaum's Outlines) Luis Moises Pena-Levano PDF for All Chapters
100% (3)
Download Complete Schaum's Outline of Calculus for Business, Economics and Finance, Fourth Edition (Schaum's Outlines) Luis Moises Pena-Levano PDF for All Chapters
41 pages
Early-Stage Design For Electric Ship: by Julie Chalfant
No ratings yet
Early-Stage Design For Electric Ship: by Julie Chalfant
15 pages
Optimal Blending of Mine Production Using Goal Programming and Interactive Graphics Systems
No ratings yet
Optimal Blending of Mine Production Using Goal Programming and Interactive Graphics Systems
7 pages
250+ TOP MCQs On Methods of Optimisation and Answers (2024)
No ratings yet
250+ TOP MCQs On Methods of Optimisation and Answers (2024)
4 pages
Trend Following Risk Parity and The Correlations
100% (1)
Trend Following Risk Parity and The Correlations
25 pages
Screenshot 2024-01-22 at 11.05.25 AM
No ratings yet
Screenshot 2024-01-22 at 11.05.25 AM
23 pages
UsingGAlib PDF
No ratings yet
UsingGAlib PDF
24 pages
1 Intro
No ratings yet
1 Intro
25 pages
A 3D Ray Tracing Approach
No ratings yet
A 3D Ray Tracing Approach
21 pages
PErloff Chapter 3
100% (1)
PErloff Chapter 3
28 pages
Conceptual Design in Structural Engineering
No ratings yet
Conceptual Design in Structural Engineering
8 pages
Maptek Vulcan Pit Optimiser
No ratings yet
Maptek Vulcan Pit Optimiser
1 page
Mae 546 Lecture 16
No ratings yet
Mae 546 Lecture 16
8 pages
Advanced Design Software For Pumps
No ratings yet
Advanced Design Software For Pumps
4 pages
Pyomo - Dae:: A Modeling and Automatic Discretization Framework For Optimization With Differential and Algebraic Equations
No ratings yet
Pyomo - Dae:: A Modeling and Automatic Discretization Framework For Optimization With Differential and Algebraic Equations
28 pages
Habtamu Eshete PDF
No ratings yet
Habtamu Eshete PDF
108 pages
Application of Numerical Optimization To Aluminum Alloy Wheel Casting
No ratings yet
Application of Numerical Optimization To Aluminum Alloy Wheel Casting
9 pages
A Mixed Integer Programming Approach For
No ratings yet
A Mixed Integer Programming Approach For
15 pages
MM 21 Overall-Product Oct20
No ratings yet
MM 21 Overall-Product Oct20
2 pages
Maths 35
No ratings yet
Maths 35
11 pages
A New Diffusion Variable Spatial Regularized QRRLS Algorithm
No ratings yet
A New Diffusion Variable Spatial Regularized QRRLS Algorithm
5 pages
Zellner Bayesian Analysis
No ratings yet
Zellner Bayesian Analysis
4 pages
Thesis Evolutionary Computation of Optimal Knots Allocation in Smothing Methodsby2 PDF
No ratings yet
Thesis Evolutionary Computation of Optimal Knots Allocation in Smothing Methodsby2 PDF
110 pages
10247
No ratings yet
10247
20 pages
Multi-Objective Unbalanced Distribution Network Reconfiguration Through Hybrid Heuristic Algorithm
No ratings yet
Multi-Objective Unbalanced Distribution Network Reconfiguration Through Hybrid Heuristic Algorithm
8 pages

Ch02-Regression Handout

Uploaded by

Ch02-Regression Handout

Uploaded by

CSC 411: Lecture 02: Linear Regression

Richard Zemel, Raquel Urtasun and Sanja Fidler

(Most plots in this lecture are from Bishop’s book)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 1 / 22

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 2 / 22

Introduce key concepts:

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 4 / 22

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 6 / 22

Use this to predict house prices in other neighborhoods

What type of model did we choose?

A simple model typically does not exactly fit the data

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 9 / 22

λ is the learning rate

w ← w + 2λ (t (n) − y (x (n) )) x (n)

Note: As error approaches zero, so does the update (w stops changing)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 11 / 22

2. Stochastic/online updates: update the parameters for each training

Algorithm 1 Stochastic gradient descent

w ← w + 2λ(t (i) − y (x (i) ))x (i) (update for a linear model)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 12 / 22

t = [t (1) , t (2) , . . . , t (N) ]T

One method of extending the model is to consider other input dimensions

In the Boston housing example, we can look at the number of rooms

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 14 / 22

We can incorporate the bias w0 into w, by using x0 = 1, then

We can then solve for w = (w0 , w1 , · · · , wd ). How?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 15 / 22

where x j is the j-th power of x

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 16 / 22

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 17 / 22

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 18 / 22

Also has an analytical solution: w = (XT X + α I)−1 XT t (verify!)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 20 / 22

Data fits – is linear model best (model selection)?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 21 / 22

Which movie will you watch?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 22 / 22

You might also like