Ch02-Regression Handout
Ch02-Regression Handout
University of Toronto
Linear regression
I continuous outputs
I simple model (linear)
Circles are data points (i.e., training examples) that are given to us
The data points are uniform in x, but may be displaced in y
t(x) = f (x) +
with some noise
In green is the ”true” curve that we don’t know
Goal: We want to fit a curve to these points
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 5 / 22
Simple 1-D regression
Key Questions:
I How do we parametrize the model?
I What loss (objective) function should we use to judge the fit?
I How do we optimize fit to unseen test data (generalization)?
y (x) = w0 + w1 x
Sources of noise:
I Imprecision in data attributes (input noise, e.g., noise in per-capita
crime)
I Errors in data targets (mis-labeling, e.g., noise in house prices)
I Additional attributes not taken into account by data attributes, affect
target values (latent variables). In the example, what else could affect
house prices?
I Model may be too simple to account for data targets
Define a model
y (x) = function(x, w)
Linear: y (x) = w0 + w1 x
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22
Optimizing the Objective
One straightforward method: gradient descent
I initialize w (e.g., randomly)
I repeatedly update w based on the gradient
∂`
w ←w−λ
∂w
4: end for
For some objectives we can also find the optimal solution analytically
This is the case for linear least-squares regression
How?
Compute the derivatives of the objective wrt w and equate with 0
Define:
1, x (N)
Then:
w = (XT X)−1 XT t
(work it out!)
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 13 / 22
Multi-dimensional Inputs
y (x) = w0 + w1 x1 + w2 x2
Imagine now we want to predict the median house price from these
multi-dimensional observations
Each house is a data point n, with observations indexed by j:
(n) (n) (n)
x(n) = x1 , · · · , xj , · · · , xd
What if our linear model is not good? How can we create a more
complicated model?
We can create a more complicated model by defining input variables that are
combinations of components of x
Example: an M-th order polynomial function of one dimensional feature x:
M
X
y (x, w) = w0 + wj x j
j=1
Intuition: Since we are minimizing the loss, the second term will encourage
smaller values in w
When we use the penalty on the squared weights we have ridge regression in
statistics
Leads to a modified update rule for gradient descent:
N
X
w ← w + 2λ[ (t (n) − y (x (n) ))x (n) − αw]
n=1
Better generalization
Choose α carefully