0% found this document useful (0 votes)

54 views

Chap 3

The chapter discusses linear regression models that use a linear combination of basis functions to model the relationship between input variables x and target variables t. It covers maximum likelihood and least squares estimation for fitting the regression weights, and introduces regularization to reduce overfitting. The bias-variance decomposition is explained as a means of understanding the tradeoff between model complexity and accuracy. Bayesian linear regression is also discussed as a way to place a probability distribution over possible regression weights.

Uploaded by

20Z367 - HARDIK P

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views

Chap 3

Uploaded by

20Z367 - HARDIK P

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Chris Bishop’s PRML

Ch. 3: Linear Models of Regression

Mathieu Guillaumin & Radu Horaud

October 25, 2007

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Chapter content

I An example – polynomial curve fitting – was considered in

Ch. 1
I A linear combination – regression – of a fixed set of nonlinear
functions – basis functions
I Supervised learning: N observations {xn } with corresponding
target values {tn } are provided. The goal is to predict t of
a new value x.
I Construct a function such that y(x) is a prediction of t.
I Probabilistic perspective: model the predictive distribution
p(t|x).

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Figure 1.16, page 29

t y(x, w)

y(x0 , w) 2σ
p(t|x0 , w, β)

x0 x

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
The chapter section by section

3.1 Linear basis function models

I Maximum likelihood and least squares
I Geometry of least squares
I Sequential learning
I Regularized least squares
3.2 The bias-variance decomposition
3.3 Bayesian linear regression
I Parameter distribution
I Predictive distribution
I Equivalent kernel
3.4 Bayesian model comparison
3.5 The evidence approximation
3.6 Limitations of fixed basis functions

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Linear Basis Function Models

M
X −1
y(x, w) = wj φj (x) = w> φ(x)
j=0

where:
I w = (w0 , . . . , wM −1 )> and φ = (φ0 , . . . , φM −1 )> with
φ0 (x) = 1 and w0 = bias parameter.
I In general x ∈ RD but it will be convenient to treat the case
x∈R
I We observe the set X = {x1 , . . . , xn , . . . , xN } with
corresponding target variables t = {tn }.

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Basis function choices

I Polynomial
φj (x) = xj
I Gaussian
(x − µj )2

φj (x) = exp −
2s2
I Sigmoidal

x − µj 1
φj (x) = σ with σ(a) =
s 1 + e−a
I splines, Fourier, wavelets, etc.

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Examples of basis functions

1 1 1

0.5 0.75 0.75

0 0.5 0.5

−0.5 0.25 0.25

−1 0 0
−1 0 1 −1 0 1 −1 0 1

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Maximum likelihood and least squares

t = y(x, w) +
|{z}
| {z }
deterministic Gaussian noise

For a i.i.d. data set we have the likelihood function:

N
Y
p(t|X, w, β) = N (tn | w> φ(xn ), β −1 )
| {z } |{z}
n=1 mean var

We can use the machinery of MLE to estimate the parameters w

and the precision β:

wM L = (Φ> Φ)−1 Φ> t with ΦM ×N = [φmn (xn )]

and:
N
−1 1 X 2
βM L = tn − w >
M L φ(xn )
N
n=1

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Geometry of least squares

S
t
ϕ2
ϕ1 y

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Sequential learning

Apply a technique known as stochastic gradient descent or

sequential gradient descent, i.e.,
replace:
N
1X 2
ED (w) = tn − w> φ(xn )
2
n=1

with (η is a learning rate parameter):

>
w(τ +1) = w(τ ) + η (tn − w(τ ) φ(xn ))φ(xn ) (3.23)
| {z }
∇En

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Regularized least squares

The total error function:

N
1X 2 λ
tn − w> φ(xn ) + w> w
2 2
n=1

w = (λI + Φ> Φ)−1 Φ> t

Regularization has the advantage of limiting the model complexity
(the appropriate number of basis functions). This is replaced with
the problem of finding a suitable value of the regularization
coefficient λ.

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
The Bias-Variance Decomposition

I Over-fitting occurs whenever the number of basis functions is

large and with training data sets of limited size.
I Limiting the number of basis functions limits the flexibility of
the model.
I Regularization can control over-fitting but raises the question
of how to determine λ.
I The bias-variance tradeoff is a frequentist viewpoint of
model complexity.

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Back to section 1.5.5
I The regression loss-function: L(t, y(x)) = (y(x) − t)2
I The decision problem = minimize the expected loss:
Z Z
E[L] = (y(x) − t)2 p(x, t)dxdt
Z
I Solution: y(x) = tp(t|x)dt = Et [t|x]
I this is known as the regression function
I conditional average of t conditioned on x, e.g., figure 1.28,
page 47
I Another expression for the expectation of the loss function:
Z Z
E[L] = (y(x) − E[t|x])2 p(x)dx + (E[t|x] − t)2 p(x)dx.
(1.90)

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
I The optimal prediction is obtained by minimization of the
expected squared loss function:
Z
h(x) = E[t|x] = tp(t|x)dt (3.36)

I The expected squared loss can be decomposed into two terms:

Z Z
E[L] = (y(x) − h(x)) p(x)dx + (h(x) − t)2 p(x, t)dxdt.
2

(3.37)
I The theoretical minimum of the first term is zero for an
appropriate choice of the function y(x) (for unlimited data
and unlimited computing power).
I The second term arises from noise in the data and it
represents the minimum achievable value of the expected
squared loss.

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
An ensemble of data sets

I For any given data set D we obtain a prediction function

y(x, D).
I The performance of a particular algorithm is assessed by
taking the average over all these data sets, namely ED [L].
This expands into the following terms:

expected loss = (bias)2 + variance + noise

I There is a tradeoff between bias and variance:
I flexible models have low bias and high variance
I rigid models have high bias and low variance
I The bias-variance decomposition provides interesting insights
in model complexity, it is of limited practical value because
several data sets are needed.

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Example: L=100, N=25, M=25, Gaussian basis

1
1

0 0

−1 −1

0 1 0 1

1
1

0 0

−1 −1

0 1 0 1

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Bayesian Linear Regression (1/5)
Assume additive gaussian noise with known precision β.
The likelihood function p(t|w) is the exponential of a quadratic
function of w, its conjugate prior is Gaussian:

p(w) = N (w|m0 , S0 ) (3.48)

Its posterior is also Gaussian (2.116):

p(w|t) = N (w|mN , SN ) ∝ p(t|w)p(w) (3.49)

mN = SN (S−1 m0 + βΦT t)

where −1 0 (3.50/3.51)
SN = S−1
0 + βΦ Φ
T

I Note how this fits a sequential learning framework

I The max of a Gaussian is at its mean: wMAP = mN
Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Bayesian Linear Regression (2/5)
Assume p(w) is governed by a hyperparameter α following a
Gaussian law of scalar covariance (i.e. m0 = 0 and S0 = α−1 I ):

p(w|α) = N (w|0, α−1 I) (3.52)

mN = βSN ΦT t

then −1 (3.53/3.54)
SN = αI + βΦT Φ

−1
I Note α → 0 implies mN → wML = (ΦT Φ) ΦT t (3.35)

Log of posterior is sum of log of likelihood and log of prior:

N
βX 2 α
ln p(w|t) = − tn − wT φ(xn ) − wT w + const (3.55)
2 2
n=1

which is equivalent to a quadratic regularizer with coeff. α/β

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Bayesian Linear Regression (3/5)
In practice, we want to make predictions of t for new values of x:
Z
p(t|t, α, β) = p(t|w, β)p(w|t, α, β)dw (3.57)

I Conditional distribution: p(t|w, β) = N (t|y(x, w), β −1 ) (3.8)

I Posterior: p(w|t, α, β) = N (w|mN , SN ) (3.49)

The convolution is a Gaussian (2.115):

p(t|x, t, α, β) = N (t|mT 2
N Φ(x), σN (x)) (3.58)

where
2
σN (x) = β −1 + Φ(x)T SN Φ(x) (3.59)
|{z} | {z }
noise in data uncertainty in w

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
1 1

0 0

−1 −1

0 1 0 1

1 1

0 0

−1 −1

0 1 0 1

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
1 1

0 0

−1 −1

0 1 0 1

1 1

0 0

−1 −1

0 1 0 1

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Bayesian Linear Regression (4/5)

PN
y(x, mN ) rewrites as n=1 k(x, xn )tn where

k(x, x0 ) = βΦ(x)T SN Φ(x0 ) (3.61-3.62)

Smoother matrix, equivalent kernel, linear smoother

The kernel works as a similarity or closeness measure, giving more
weight to evidence that is close to the point where we want to
make the prediction
I Basis functions ! kernel duality
1/2
I With Ψ(x) = β −1/2 SN Φ(x), k(x, x0 ) = Ψ(x)T Ψ(x0 ) (3.65)
I The kernel sums to one (over the training set)
I cov(y(x), y(x0 )) = β −1 k(x, x0 ) (3.63)

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Bayesian Linear Regression (5/5)
Kernel from Gaussian basis functions

Kernels at x = 0 for kernels corresponding (left) to the polynomial

basis functions and (right) to the sigmoidal basis functions.

0.04 0.04

0.02 0.02

0 0

−1 0 1 −1 0 1

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Bayesian Model Comparison (1/2)
The overfitting that appears in ML can be avoided by
marginalizing over the model parameters.

I Cross-validation is no more useful

I We can use all the data for better training the model
I We can compare models based on training data alone

p(Mi |D) ∝ p(Mi )p(D|Mi ) (3.66)

p(D|Mi ): model evidence or marginal likelihood.

Using model selection and assuming the posterior p(w|D, Mi ) is
sharply peaked at wMAP (single parameter case):
Z
∆wposterior
p(D) = p(D|w)p(w)dw ' p(D|wMAP ) (3.70)
∆wprior

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
Bayesian Model Comparison (2/2)
∆wposterior

p(D)
M1

M3
wMAP w

∆wprior D
D0

Back to multiple parameters, assuming they share the same ∆w

ratio, the complexity penalty is linear in M :

∆wposterior
ln p(D) ' ln p(D|wMAP ) + M ln (3.72)
∆wprior

About p(D|Mi ):
I if Mi is too simple, bad fitting of the data
I if Mi is too complex/powerful, the probability of generating
the observed data is washed out
Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
The evidence approximation (1/2)
Fully bayesian treatment would imply marginalizing over
hyperparameters and parameters, but this is intractable:
ZZZ
p(t|t) = p(t|w, β)p(w|t, α, β)p(α, β|t)dwdαdβ (3.74)

An approximation is found by maximizing the marginal likelihood

function p(α, β|t) ∝ p(t|α, β)p(α, β) to get (α̂, β̂) (empirical
Bayes).

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression
The evidence approximation (2/2)
Plot of the model evidence ln p(t|α, β) versus M , the model
complexity, for the polynomial regression of the synthetic
sinusoidal example (with fixed α).

−18

−20

−22

−24

−26
0 2 4 6 8

The computation for (α̂, β̂) give rise to γ = αmT

N mN (3.90)
γ has the nice interpretation of being the effective number of
parameters
0

2 8

1 5

0 6

−1 1

−5 0 5 −5 0 5 −2 9

Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression

Assay by Titration Validation Protocol-Model
100% (4)
Assay by Titration Validation Protocol-Model
9 pages
Linear - Regression
100% (1)
Linear - Regression
39 pages
ML Lecture Linear Regression 1
No ratings yet
ML Lecture Linear Regression 1
33 pages
Linear Modal For Regresion
No ratings yet
Linear Modal For Regresion
32 pages
ML Lecture Linear Regression 2
No ratings yet
ML Lecture Linear Regression 2
23 pages
PRML Slides 3
No ratings yet
PRML Slides 3
57 pages
Chapter-3-Linear Models For Regression
100% (1)
Chapter-3-Linear Models For Regression
61 pages
Pattern Recognition Machine Learning: Chapter 3: Linear Models For Regression
100% (1)
Pattern Recognition Machine Learning: Chapter 3: Linear Models For Regression
48 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
ML-Lec8
No ratings yet
ML-Lec8
7 pages
Progression Linaire
No ratings yet
Progression Linaire
187 pages
Chapter2 Annotated Part2
No ratings yet
Chapter2 Annotated Part2
30 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
Lecture 2 - Linear Regression
No ratings yet
Lecture 2 - Linear Regression
54 pages
ML Unit3
No ratings yet
ML Unit3
9 pages
9 Mle
No ratings yet
9 Mle
39 pages
Lecture 3 Multi-Regresion 2022.
No ratings yet
Lecture 3 Multi-Regresion 2022.
16 pages
Week 4 Linear Regression
No ratings yet
Week 4 Linear Regression
38 pages
Unit 2 ML_Ver 2
No ratings yet
Unit 2 ML_Ver 2
129 pages
Ch-2 Linear Models For Regression
No ratings yet
Ch-2 Linear Models For Regression
40 pages
Linear Regression
No ratings yet
Linear Regression
20 pages
M6 RegressionLinearModels v2
No ratings yet
M6 RegressionLinearModels v2
97 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
G.C. Calafiore (Politecnico Di Torino)
No ratings yet
G.C. Calafiore (Politecnico Di Torino)
23 pages
Unit Iii
No ratings yet
Unit Iii
27 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
Polynomial Regression
No ratings yet
Polynomial Regression
16 pages
Bayesian linear regression for Posterior Predictive Distribution MATLAB
No ratings yet
Bayesian linear regression for Posterior Predictive Distribution MATLAB
46 pages
Lecture 2
No ratings yet
Lecture 2
23 pages
Lecture Notes on High Dimensional Linear Regression
No ratings yet
Lecture Notes on High Dimensional Linear Regression
73 pages
P&AD Lect 17 1 Unit2
No ratings yet
P&AD Lect 17 1 Unit2
14 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
Lecture3_upload
No ratings yet
Lecture3_upload
28 pages
11_Học máy cơ bản_Hồi quy tuyến tính 1
No ratings yet
11_Học máy cơ bản_Hồi quy tuyến tính 1
105 pages
MAFE208IU-L6 - Least Squares Regression
No ratings yet
MAFE208IU-L6 - Least Squares Regression
45 pages
Unit 2 ML_Ver 2
No ratings yet
Unit 2 ML_Ver 2
129 pages
Machine Learning Unit2
No ratings yet
Machine Learning Unit2
31 pages
FML Unit2
No ratings yet
FML Unit2
13 pages
ML 5
No ratings yet
ML 5
21 pages
03 1 Linear Basis Function Models Draft SEP24
No ratings yet
03 1 Linear Basis Function Models Draft SEP24
52 pages
Lectureslides Chap6-Annot PDF
No ratings yet
Lectureslides Chap6-Annot PDF
30 pages
AAI Lecture 10 Sp 25
No ratings yet
AAI Lecture 10 Sp 25
37 pages
L11_ML
No ratings yet
L11_ML
27 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Lecture 1 - Overview of Supervised Learning
No ratings yet
Lecture 1 - Overview of Supervised Learning
133 pages
Chapter 3 Summary
No ratings yet
Chapter 3 Summary
8 pages
ML_Lec 4-introduction to regression
No ratings yet
ML_Lec 4-introduction to regression
65 pages
Wk05 machine learning
No ratings yet
Wk05 machine learning
6 pages
5_AML Lecture 5_Linear regression
No ratings yet
5_AML Lecture 5_Linear regression
56 pages
Nonlinear Regression
No ratings yet
Nonlinear Regression
8 pages
Simple Linear Regression.: 29.1 Method of Least Squares
No ratings yet
Simple Linear Regression.: 29.1 Method of Least Squares
4 pages
Simple Linear Regression.: 29.1 Method of Least Squares
No ratings yet
Simple Linear Regression.: 29.1 Method of Least Squares
4 pages
Lec 3-5 (Function Approximation)
No ratings yet
Lec 3-5 (Function Approximation)
34 pages
Exercise 03
No ratings yet
Exercise 03
5 pages
Lect5 Reg
No ratings yet
Lect5 Reg
16 pages
Machine Learning: Linear Models For Regression
No ratings yet
Machine Learning: Linear Models For Regression
54 pages
Linear Regression
No ratings yet
Linear Regression
60 pages
04 LinearRegression
No ratings yet
04 LinearRegression
61 pages
Week 4 Quiz - Coursera
100% (1)
Week 4 Quiz - Coursera
7 pages
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
From Everand
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
Gérard Blanchet
3/5 (1)
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
PPT_Lesson_3.4_Statistical Distributions_Measure_Phase
No ratings yet
PPT_Lesson_3.4_Statistical Distributions_Measure_Phase
62 pages
Given The Learning Materials and Activities of This Chapter, They Will Be Able To
No ratings yet
Given The Learning Materials and Activities of This Chapter, They Will Be Able To
14 pages
4-21 Unit 9 Quiz Review
No ratings yet
4-21 Unit 9 Quiz Review
4 pages
Lecture 1 2 Slides Introduction To Statistics
No ratings yet
Lecture 1 2 Slides Introduction To Statistics
23 pages
Ekonometrika Terapan Lecture 1 2022
No ratings yet
Ekonometrika Terapan Lecture 1 2022
34 pages
CPT Maths Stats Weightage
No ratings yet
CPT Maths Stats Weightage
2 pages
Tugas 6 Analisis Multivariat Data Panel
No ratings yet
Tugas 6 Analisis Multivariat Data Panel
11 pages
Dummy Regression
No ratings yet
Dummy Regression
23 pages
Topic 2 To 10 Answer PDF
No ratings yet
Topic 2 To 10 Answer PDF
24 pages
01 Probability and Probability Distributions
No ratings yet
01 Probability and Probability Distributions
18 pages
Experiments Montgomery Word
No ratings yet
Experiments Montgomery Word
111 pages
Figueiredo EM Algorithm
No ratings yet
Figueiredo EM Algorithm
35 pages
Algebra 3 Principles and Sample Problems: 3.1 Probability 3.2 Statistics 3.3 Problems For Solutions
No ratings yet
Algebra 3 Principles and Sample Problems: 3.1 Probability 3.2 Statistics 3.3 Problems For Solutions
24 pages
Binomial and Poisson Distribution
No ratings yet
Binomial and Poisson Distribution
22 pages
CFA Level 1 LOS Changes PDF
No ratings yet
CFA Level 1 LOS Changes PDF
51 pages
Day04 Business Moments
No ratings yet
Day04 Business Moments
10 pages
The Use of MEWMA Control Chart in Controlling Major Component of Cement Product
No ratings yet
The Use of MEWMA Control Chart in Controlling Major Component of Cement Product
11 pages
Implicit Quantile Networks For Distributional Reinforcement Learning, Will Dabney Et Al., 2018, v1
No ratings yet
Implicit Quantile Networks For Distributional Reinforcement Learning, Will Dabney Et Al., 2018, v1
14 pages
Final 100b w21
No ratings yet
Final 100b w21
5 pages
Lecture 03
No ratings yet
Lecture 03
35 pages
Final Exam: Hoang Tran
No ratings yet
Final Exam: Hoang Tran
4 pages
Ronald Aylmer Fisher
100% (1)
Ronald Aylmer Fisher
3 pages
Bayesian Reliability Assessment With Spatially Variable Measurements: The Spatial Averaging Approach
No ratings yet
Bayesian Reliability Assessment With Spatially Variable Measurements: The Spatial Averaging Approach
9 pages
Example of Analyze Taguchi Design (Dynamic) - Minitab PDF
No ratings yet
Example of Analyze Taguchi Design (Dynamic) - Minitab PDF
8 pages
Data Analysis Advance House Price Prediction 1682585529
No ratings yet
Data Analysis Advance House Price Prediction 1682585529
73 pages
Spectrum Estimation
No ratings yet
Spectrum Estimation
49 pages
CFN 9256 Maths, Logical Reasoning & Stats QUESTION PAPER
No ratings yet
CFN 9256 Maths, Logical Reasoning & Stats QUESTION PAPER
5 pages
Nptel TT 7
No ratings yet
Nptel TT 7
27 pages
Machine Learning: PAC-Learning and VC-Dimension
No ratings yet
Machine Learning: PAC-Learning and VC-Dimension
31 pages

Chap 3

Uploaded by

Chap 3

Uploaded by

Chris Bishop’s PRML

Ch. 3: Linear Models of Regression

Mathieu Guillaumin & Radu Horaud

October 25, 2007

I An example – polynomial curve fitting – was considered in

3.1 Linear basis function models

0.5 0.75 0.75

−0.5 0.25 0.25

For a i.i.d. data set we have the likelihood function:

We can use the machinery of MLE to estimate the parameters w

wM L = (Φ> Φ)−1 Φ> t with ΦM ×N = [φmn (xn )]

Apply a technique known as stochastic gradient descent or

with (η is a learning rate parameter):

The total error function:

w = (λI + Φ> Φ)−1 Φ> t

I Over-fitting occurs whenever the number of basis functions is

I The expected squared loss can be decomposed into two terms:

I For any given data set D we obtain a prediction function

expected loss = (bias)2 + variance + noise

p(w) = N (w|m0 , S0 ) (3.48)

Its posterior is also Gaussian (2.116):

p(w|t) = N (w|mN , SN ) ∝ p(t|w)p(w) (3.49)

I Note how this fits a sequential learning framework

p(w|α) = N (w|0, α−1 I) (3.52)

Log of posterior is sum of log of likelihood and log of prior:

which is equivalent to a quadratic regularizer with coeff. α/β

I Conditional distribution: p(t|w, β) = N (t|y(x, w), β −1 ) (3.8)

The convolution is a Gaussian (2.115):

k(x, x0 ) = βΦ(x)T SN Φ(x0 ) (3.61-3.62)

Smoother matrix, equivalent kernel, linear smoother

Kernels at x = 0 for kernels corresponding (left) to the polynomial

I Cross-validation is no more useful

p(Mi |D) ∝ p(Mi )p(D|Mi ) (3.66)

p(D|Mi ): model evidence or marginal likelihood.

Back to multiple parameters, assuming they share the same ∆w

An approximation is found by maximizing the marginal likelihood

The computation for (α̂, β̂) give rise to γ = αmT

You might also like