0% found this document useful (0 votes)
54 views

Chap 3

The chapter discusses linear regression models that use a linear combination of basis functions to model the relationship between input variables x and target variables t. It covers maximum likelihood and least squares estimation for fitting the regression weights, and introduces regularization to reduce overfitting. The bias-variance decomposition is explained as a means of understanding the tradeoff between model complexity and accuracy. Bayesian linear regression is also discussed as a way to place a probability distribution over possible regression weights.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Chap 3

The chapter discusses linear regression models that use a linear combination of basis functions to model the relationship between input variables x and target variables t. It covers maximum likelihood and least squares estimation for fitting the regression weights, and introduces regularization to reduce overfitting. The bias-variance decomposition is explained as a means of understanding the tradeoff between model complexity and accuracy. Bayesian linear regression is also discussed as a way to place a probability distribution over possible regression weights.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Chris Bishop’s PRML

Ch. 3: Linear Models of Regression

Mathieu Guillaumin & Radu Horaud

October 25, 2007

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Chapter content

I An example – polynomial curve fitting – was considered in


Ch. 1
I A linear combination – regression – of a fixed set of nonlinear
functions – basis functions
I Supervised learning: N observations {xn } with corresponding
target values {tn } are provided. The goal is to predict t of
a new value x.
I Construct a function such that y(x) is a prediction of t.
I Probabilistic perspective: model the predictive distribution
p(t|x).

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Figure 1.16, page 29

t y(x, w)

y(x0 , w) 2σ
p(t|x0 , w, β)

x0 x

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
The chapter section by section

3.1 Linear basis function models


I Maximum likelihood and least squares
I Geometry of least squares
I Sequential learning
I Regularized least squares
3.2 The bias-variance decomposition
3.3 Bayesian linear regression
I Parameter distribution
I Predictive distribution
I Equivalent kernel
3.4 Bayesian model comparison
3.5 The evidence approximation
3.6 Limitations of fixed basis functions

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Linear Basis Function Models

M
X −1
y(x, w) = wj φj (x) = w> φ(x)
j=0

where:
I w = (w0 , . . . , wM −1 )> and φ = (φ0 , . . . , φM −1 )> with
φ0 (x) = 1 and w0 = bias parameter.
I In general x ∈ RD but it will be convenient to treat the case
x∈R
I We observe the set X = {x1 , . . . , xn , . . . , xN } with
corresponding target variables t = {tn }.

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Basis function choices

I Polynomial
φj (x) = xj
I Gaussian
(x − µj )2
 
φj (x) = exp −
2s2
I Sigmoidal
 
x − µj 1
φj (x) = σ with σ(a) =
s 1 + e−a
I splines, Fourier, wavelets, etc.

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Examples of basis functions

1 1 1

0.5 0.75 0.75

0 0.5 0.5

−0.5 0.25 0.25

−1 0 0
−1 0 1 −1 0 1 −1 0 1

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Maximum likelihood and least squares

t = y(x, w) + 
|{z}
| {z }
deterministic Gaussian noise

For a i.i.d. data set we have the likelihood function:


N
Y
p(t|X, w, β) = N (tn | w> φ(xn ), β −1 )
| {z } |{z}
n=1 mean var

We can use the machinery of MLE to estimate the parameters w


and the precision β:

wM L = (Φ> Φ)−1 Φ> t with ΦM ×N = [φmn (xn )]

and:
N
−1 1 X 2
βM L = tn − w >
M L φ(xn )
N
n=1

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Geometry of least squares

S
t
ϕ2
ϕ1 y

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Sequential learning

Apply a technique known as stochastic gradient descent or


sequential gradient descent, i.e.,
replace:
N
1X 2
ED (w) = tn − w> φ(xn )
2
n=1

with (η is a learning rate parameter):


>
w(τ +1) = w(τ ) + η (tn − w(τ ) φ(xn ))φ(xn ) (3.23)
| {z }
∇En

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Regularized least squares

The total error function:


N
1X 2 λ
tn − w> φ(xn ) + w> w
2 2
n=1

w = (λI + Φ> Φ)−1 Φ> t


Regularization has the advantage of limiting the model complexity
(the appropriate number of basis functions). This is replaced with
the problem of finding a suitable value of the regularization
coefficient λ.

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
The Bias-Variance Decomposition

I Over-fitting occurs whenever the number of basis functions is


large and with training data sets of limited size.
I Limiting the number of basis functions limits the flexibility of
the model.
I Regularization can control over-fitting but raises the question
of how to determine λ.
I The bias-variance tradeoff is a frequentist viewpoint of
model complexity.

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Back to section 1.5.5
I The regression loss-function: L(t, y(x)) = (y(x) − t)2
I The decision problem = minimize the expected loss:
Z Z
E[L] = (y(x) − t)2 p(x, t)dxdt
Z
I Solution: y(x) = tp(t|x)dt = Et [t|x]
I this is known as the regression function
I conditional average of t conditioned on x, e.g., figure 1.28,
page 47
I Another expression for the expectation of the loss function:
Z Z
E[L] = (y(x) − E[t|x])2 p(x)dx + (E[t|x] − t)2 p(x)dx.
(1.90)

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
I The optimal prediction is obtained by minimization of the
expected squared loss function:
Z
h(x) = E[t|x] = tp(t|x)dt (3.36)

I The expected squared loss can be decomposed into two terms:


Z Z
E[L] = (y(x) − h(x)) p(x)dx + (h(x) − t)2 p(x, t)dxdt.
2

(3.37)
I The theoretical minimum of the first term is zero for an
appropriate choice of the function y(x) (for unlimited data
and unlimited computing power).
I The second term arises from noise in the data and it
represents the minimum achievable value of the expected
squared loss.

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
An ensemble of data sets

I For any given data set D we obtain a prediction function


y(x, D).
I The performance of a particular algorithm is assessed by
taking the average over all these data sets, namely ED [L].
This expands into the following terms:

expected loss = (bias)2 + variance + noise


I There is a tradeoff between bias and variance:
I flexible models have low bias and high variance
I rigid models have high bias and low variance
I The bias-variance decomposition provides interesting insights
in model complexity, it is of limited practical value because
several data sets are needed.

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Example: L=100, N=25, M=25, Gaussian basis

1   
1
 

0 0

−1 −1

0 1 0 1

1   
 1
 

0 0

−1 −1

0 1 0 1

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Bayesian Linear Regression (1/5)
Assume additive gaussian noise with known precision β.
The likelihood function p(t|w) is the exponential of a quadratic
function of w, its conjugate prior is Gaussian:

p(w) = N (w|m0 , S0 ) (3.48)

Its posterior is also Gaussian (2.116):

p(w|t) = N (w|mN , SN ) ∝ p(t|w)p(w) (3.49)

mN = SN (S−1 m0 + βΦT t)

where −1 0 (3.50/3.51)
SN = S−1
0 + βΦ Φ
T

I Note how this fits a sequential learning framework


I The max of a Gaussian is at its mean: wMAP = mN
Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Bayesian Linear Regression (2/5)
Assume p(w) is governed by a hyperparameter α following a
Gaussian law of scalar covariance (i.e. m0 = 0 and S0 = α−1 I ):

p(w|α) = N (w|0, α−1 I) (3.52)

mN = βSN ΦT t

then −1 (3.53/3.54)
SN = αI + βΦT Φ

−1
I Note α → 0 implies mN → wML = (ΦT Φ) ΦT t (3.35)

Log of posterior is sum of log of likelihood and log of prior:


N
βX 2 α
ln p(w|t) = − tn − wT φ(xn ) − wT w + const (3.55)
2 2
n=1

which is equivalent to a quadratic regularizer with coeff. α/β


Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Bayesian Linear Regression (3/5)
In practice, we want to make predictions of t for new values of x:
Z
p(t|t, α, β) = p(t|w, β)p(w|t, α, β)dw (3.57)

I Conditional distribution: p(t|w, β) = N (t|y(x, w), β −1 ) (3.8)


I Posterior: p(w|t, α, β) = N (w|mN , SN ) (3.49)

The convolution is a Gaussian (2.115):

p(t|x, t, α, β) = N (t|mT 2
N Φ(x), σN (x)) (3.58)

where
2
σN (x) = β −1 + Φ(x)T SN Φ(x) (3.59)
|{z} | {z }
noise in data uncertainty in w

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
1 1
 

0 0

−1 −1

0 1 0 1

1 1
 

0 0

−1 −1

0 1 0 1

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
1 1
 

0 0

−1 −1

0 1 0 1

1 1
 

0 0

−1 −1

0 1 0 1

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Bayesian Linear Regression (4/5)

PN
y(x, mN ) rewrites as n=1 k(x, xn )tn where

k(x, x0 ) = βΦ(x)T SN Φ(x0 ) (3.61-3.62)

Smoother matrix, equivalent kernel, linear smoother


The kernel works as a similarity or closeness measure, giving more
weight to evidence that is close to the point where we want to
make the prediction
I Basis functions ! kernel duality
1/2
I With Ψ(x) = β −1/2 SN Φ(x), k(x, x0 ) = Ψ(x)T Ψ(x0 ) (3.65)
I The kernel sums to one (over the training set)
I cov(y(x), y(x0 )) = β −1 k(x, x0 ) (3.63)

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Bayesian Linear Regression (5/5)
Kernel from Gaussian basis functions

Kernels at x = 0 for kernels corresponding (left) to the polynomial


basis functions and (right) to the sigmoidal basis functions.

0.04 0.04

0.02 0.02

0 0

−1 0 1 −1 0 1

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Bayesian Model Comparison (1/2)
The overfitting that appears in ML can be avoided by
marginalizing over the model parameters.

I Cross-validation is no more useful


I We can use all the data for better training the model
I We can compare models based on training data alone

p(Mi |D) ∝ p(Mi )p(D|Mi ) (3.66)

p(D|Mi ): model evidence or marginal likelihood.


Using model selection and assuming the posterior p(w|D, Mi ) is
sharply peaked at wMAP (single parameter case):
Z
∆wposterior
p(D) = p(D|w)p(w)dw ' p(D|wMAP ) (3.70)
∆wprior

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Bayesian Model Comparison (2/2)
∆wposterior

p(D)
M1

M2

M3
wMAP w

∆wprior D
D0

Back to multiple parameters, assuming they share the same ∆w


ratio, the complexity penalty is linear in M :
 
∆wposterior
ln p(D) ' ln p(D|wMAP ) + M ln (3.72)
∆wprior

About p(D|Mi ):
I if Mi is too simple, bad fitting of the data
I if Mi is too complex/powerful, the probability of generating
the observed data is washed out
Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
The evidence approximation (1/2)
Fully bayesian treatment would imply marginalizing over
hyperparameters and parameters, but this is intractable:
ZZZ
p(t|t) = p(t|w, β)p(w|t, α, β)p(α, β|t)dwdαdβ (3.74)

An approximation is found by maximizing the marginal likelihood


function p(α, β|t) ∝ p(t|α, β)p(α, β) to get (α̂, β̂) (empirical
Bayes).

M N 1 N
ln p(t|α, β) = ln α + ln β − E(mN ) − ln |S−1
N |− ln(2π)
2 2 2 2
(3.77 → 3.86)
Assuming p(α, β|t) is highly peaked at (α̂, β̂):
Z
p(t|t) ' p(t|t, α̂, β̂) = p(t|w, β̂)p(w, α̂, β̂)dw (3.75)

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
The evidence approximation (2/2)
Plot of the model evidence ln p(t|α, β) versus M , the model
complexity, for the polynomial regression of the synthetic
sinusoidal example (with fixed α).

−18

−20

−22

−24

−26
0 2 4 6 8

The computation for (α̂, β̂) give rise to γ = αmT


N mN (3.90)
γ has the nice interpretation of being the effective number of
parameters
0


2 8

1 5

0 6

−1 1

−5 0 5 −5 0 5 −2 9


Mathieu Guillaumin & Radu Horaud  Chris Bishop’s PRML Ch. 3: Linear Models of Regression

You might also like