0% found this document useful (0 votes)

24 views

Lec22 Introduction2BayesianRegression

This document introduces Bayesian regression and related concepts. It discusses using a Gaussian likelihood and Gaussian prior distribution to develop a Bayesian regression model. The posterior distribution is derived, which is also Gaussian. Maximum a posteriori (MAP) estimation is discussed, and it is shown that MAP estimates are equivalent to regularized least squares estimates. The document outlines goals of explaining Bayesian inference and prediction for regression, and discussing regression with basis functions versus Gaussian process regression.

Uploaded by

hu jack

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views

Lec22 Introduction2BayesianRegression

Uploaded by

hu jack

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Introduction to

Bayesian Regression
Prof. Nicholas Zabaras

Email: [email protected]
URL: https://www.zabaras.com/

September 25, 2020

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras

Contents
 Bayesian regression, MAP and Regularized Least Squares

 Posterior distribution, Sequential Posterior calculation, Predictive distribution,

Point-wise uncertainy in the predictions, Plug-in approximation, Regression
function samples

 Regression with basis functions Vs. Gaussian Process Regression

 Equivalent Kernel based Regression

 Chris Bishops’ PRML book, Chapters 1 and 2

 Kevin Murphy’s, Machine Learning: A probabilistic perspective, Chapter 5
 C P Robert, The Bayesian Choice: From Decision-Theoretic Motivations to Compulational Implementation, Springer-Verlag, NY, 2001 (online resource)
 A. Gelman, JB Carlin, HS Stern and DB Rubin, Bayesian Data Analysis, Chapman and Hall CRC Press, 2nd Edition, 2003.
 M Marin and C P Robert, The Bayesian Core, Spring Verlag, 2007 (online resource)
 Bayesian Statistics for Engineering, Online Course at Georgia Tech, B. Vidakovic.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2

Goals
 The goal’s for todays’ lecture are the following:

 Understand Bayesian regression with a Gaussian likelihood and Gaussian

prior: Bayesian inference and prediction

 Understand how regularized least squares estimates can be derived as

MAP estimates

 Learn how regression models can be posed in terms of kernels

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3

Likelihood Model
 Consider a set of training data comprising 𝑁 input x =  x1 ,..., xN  and the
T

corresponding target values t =  t1 ,..., t N 

 We assume that, given the value of 𝑥, the corresponding value of 𝑡 has a

Gaussian distribution with a mean equal to 𝑦(𝑥, 𝒘) and precision (inverse of
the variance) 𝛽
p (t | x, w ,  )  N (t | y ( x, w ),  ) 1

𝑦 𝒙, 𝒘 = 𝝓 𝒙 𝑇 𝒘
t = 𝑦 𝒙, 𝒘 + 𝜀
𝜀~𝒩(0, 𝛽−1 )

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4

MAP and Regularized Least Squares
 Now let us take a step towards a Bayesian approach and introduce a prior
distribution over the coefficients 𝒘. Consider a Gaussian distribution
   
M /2

p ( w |  )  N ( w | 0,  1 I )    exp  w T w 
 2   2 

where 𝛼 is the precision of the distribution, and 𝑀 is the total number of

elements in the vector 𝒘.*

 Using Bayes’ theorem:

p( w | x, t , ,  )  p(t | x, w,  ) p( w |  )

* Note that one should not penalize the bias term 𝒘0 as it does not contribute to overfitting. We discussed in an earlier lecture how to address this by treating
the bias term separately by working with “centered data” (both responses 𝒕𝑐 = 𝒕 − 𝒕ҧ = 𝒕 − 𝟏𝑁 𝑡,ҧ ഥ𝑡 ≡ σ𝑁
𝑖=1 𝑡𝑖 Τ𝑁, and input variables s.t. 𝑦 𝒙𝑛 , 𝒘 =
𝑁
෌𝑛=1𝝋𝑖 (𝒙𝑛 )
𝝓𝑐 𝒙𝑛 𝑇 𝒘, 𝚽𝒄 = 𝝋1c 𝝋2c ... 𝝋𝑀𝑐 , 𝝋ic = 𝝋i − = 0, 𝑖 = 1, . . M − 1). Here 𝝓𝑇 are rows and 𝝋 columns of the 𝑁 × 𝑀 design matrix 𝜱.
𝑁
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5
MAP Estimate
 We can determine 𝒘 (point estimate) by finding the most probable value of 𝒘
given the data, i.e. maximizing the posterior.

 This technique is called maximum posterior (MAP).

 The maximum of the posterior is given by the minimum of

 N

  y( xn , w )  tn  
2
wT w
2 n 1 2
 Note that the MAP point estimate is equivalent to regularized sum of squares
error function with regularization parameter
 precision of prior
 
 precision in the data

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6

Posterior Distribution
    T    N
M /2
2
p( w |  )    exp  w w  , p(t | x, w ,  )  exp    ( xn )T w  tn   
 2   2   2 n 1 
 1 T N N

p (t | x , w ,  )  exp  w    ( xn ) ( xn ) w   w  tn ( xn ) 
T T
Quadratic in
 2 n 1 n 1  𝒘
 We now have the product of two Gaussians and the posterior is easily
computed as: Completing the
square
p ( w | x , t , ,  )  1
 w   S N1  w   
T

 1 T 1 T N N 2

exp   w  I M M w  w    ( xn ) ( xn ) w   w  tn ( xn ) 
T T

 2 2 n 1 n 1 
 N
 1 N
 N  w |  S N  tn ( xn ), S N  , S N   I     ( xn ) ( xn )T
 n 1  n 1
 1 
 x 
 n 
E.g. for polynomial regression :  ( xn )   xn2  ,  ( x)T  1 x x 2 .. x M 1 , I  unit matrix M  M
 : 
 M 1 
 xn 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 7

Posterior Distribution
 For a prior
p ( w )  N ( w | 0,  1 I )

the corresponding posterior distribution over 𝒘 is then given by

𝑝(𝒘|𝒕)= 𝒩(𝒘|𝒎𝑁 , 𝑺𝑁 )

m N   S N T t
S N1   I  T 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8

General Gaussian as a Prior 𝑝(𝒘)
 Assume additive Gaussian noise with known precision . The likelihood function
𝑝(𝒕|𝒘) is the exponential of a quadratic function of 𝒘
     N
N /2
2
p  t | X , w ,     N  tn | w T  ( xn ),  1    exp    tn  w  ( xn ) 
N
T

n 1  2   2 n 1 
and its conjugate prior is Gaussian:

p ( w )  N ( w | m0 , S 0 )

 Combining this with the likelihood and using results for marginal and conditional
Gaussian distributions, gives the posterior
p( w | t )  N ( w | m N , S N )

where
m N  S N  S01m0  T t 
S N1  S01  T 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 9
Posterior Distribution: Derivation
 1 
p ( w | m0 , S0 )  exp   w  m0  S01  w  m0   ,
T

 2 
  N 2
p (t | x , w ,  )  exp    ( xn )T w  tn   
 2 n 1 
 1 T N N

p (t | x , w ,  )  exp  w    ( xn ) ( xn ) w    tn ( xn )w 
T

 2 n 1 n 1 
 We now have the product of two Gaussians and the posterior is easily
computed as:
Complete the
square in 𝒘
p( w | x, t , S0 ,  ) 
 1 T 1 1 T N N

exp   w S0 w  w S0 m0  w   ( xn ) ( xn ) w   w  tn ( xn ) 
T 1 T T

 2 2 n 1 n 1 
 
 N  w | S N  S0 m0   t  , S N  , S N  S0     ( xn ) ( xn )T S 01   T 
N
1 T 1 1
  n 1
 m N 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10
Sequential Posterior Calculation
 Note that because the posterior distribution is Gaussian, its posterior mode coincides
with its mean.
m N  S N  S01m0  T t 
p( w | t )  N ( w | m N , S N )
S N1  S01  T 

w MAP = m N

 The above expressions for the posterior mean and variance can also be written for a
sequential calculation (we already have observed 𝑁 data points and now considering
an additional data point (𝒙𝑁+1 , 𝑡𝑁+1 )). In this case, we have:

m N 1  S N 1  S N1m N  n 1tn 1 

p ( w | t N 1 , x N 1 , m N , S N )  N ( w | m N 1 , S N 1 )
S N11  S N1  n 1nT1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11

Bayesian Regression: Example
 We generate synthetic data from the function 𝑓(𝑥, 𝑎) = 𝑎0 + 𝑎1𝑥 with parameter
values 𝑎0 = −0.3 and 𝑎1 = 0.5 by first choosing values of 𝑥𝑛 from the uniform
distribution 𝒰(𝑥| − 1, 1), then evaluating 𝑓(𝑥𝑛, 𝑎), and finally adding Gaussian noise
with standard deviation of 0.2 to obtain the target values 𝑡𝑛.
 We assume 𝛽 = (1/0.2)2 = 25 and 𝛼 = 2.0.
 We perform Bayesian inference sequentially – one point at a time – so the posterior
at each level becomes the new prior.
 We show results after 1, 2 and 22 points have been collected.
 The results include the likelihood contours (for 1 point), the posterior and samples of
the regression function from the posterior.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12

Bayesian Regression: Example
Prior - No data yet y(x,w) using samples of w from the prior
1 3

0.8

2
0.6

0.4
1

0.2

0 0

-0.2

-1
-0.4

-0.6
-2

-0.8

-1
-1 -0.5 0 0.5 1
-3
-1 -0.5 0 0.5 1 MatLab Code

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13

Example: One Data Point Collected
Likelihood Contour Contours of the posterior y(x,w) using samples of w from the posterior
1 1 1

0.8 0.8 0.8

MatLab Code
0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

-0.2 -0.2 -0.2

-0.4 -0.4 -0.4

-0.6 -0.6 -0.6

-0.8 -0.8 -0.8

-1 -1 -1
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1

Note that the regression lines pass close to the data point (shown with a circle)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14

Example: 2nd Data Point Collected
Likelihood Contour Contours of the posterior y(x,w) using samples of w from the posterior
1 1 1

0.8 0.8 0.8 MatLab Code

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

-0.2 -0.2 -0.2

-0.4 -0.4 -0.4

-0.6 -0.6 -0.6

-0.8 -0.8 -0.8

-1 -1 -1
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1

Note that the regression lines now pass close to both data points
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15
Example: 22 Data Points Collected
Likelihood Contour Contours of the posterior y(x,w) using samples of w from the posterior
1 1 1

0.8 0.8 0.8

MatLab Code
0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

-0.2 -0.2 -0.2

-0.4 -0.4 -0.4

-0.6 -0.6 -0.6

-0.8 -0.8 -0.8

-1 -1 -1
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1

Note that the regression lines after 22 data points have been collected
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16
Summary of Results
prior/posterior (no data yet) data space (no data yet)
MatLab Code 1 1

0.5 0.5

0 0

-0.5 -0.5

-1 -1
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1

likelihood prior/posterior data space

1 1 1

0.5 0.5 0.5

0 0 0

-0.5 -0.5 -0.5

-1 -1 -1
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1

likelihood prior/posterior data space

1 1 1

0.5 0.5 0.5

0 0 0

-0.5 -0.5 -0.5

-1 -1 -1
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1

likelihood prior/posterior data space

1 1 1

0.5 0.5 0.5

0 0 0

-0.5 -0.5 -0.5

-1 -1 -1
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 17

Summary of Results
likelihood prior/posterior data space
1 1
Run bayesLinRegDemo2d
W1 y
from PMTK3 0 0
-1 -1
-1 0 1 -1 0 1
W0 x
1 1 1
W1 W1 y
0 0 0
-1 -1 -1
-1 0 1 -1 0 1 -1 0 1
W0 W0 x
1 1 1
W1 W1 y
0 0 0
-1 -1 -1
-1 0 1 -1 0 1 -1 0 1
W0 W0 x
1 1 1
W1 W1 y
0 0 0 After 20 data points
-1 -1 -1
-1 0 1 -1 0 1 -1 0 1
W0 W0 x

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18

 To integrate 𝒘 out as shown on the predictive distribution expression, we use

a fundamental result for linear Gaussian models.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19

Appendix: Linear Gaussian Models
p  x   N  x |  , 1 
p  y | x   N  y | Ax  b, L1 
 For the above linear Gaussian model, the following very useful results hold:

y  t, A =  ( x)T , b = 0, L1 =  1
 The predictive distribution now takes the form:
p  y   N  y | A  b, L1 + A1 AT  
 N

p  t | x, x, t   N  t |  ( x)  S N  tn ( xn ),  +  ( x) S N  ( x) 
T 1 T

 n 1 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21
Predictive Distribution
 In a full Bayesian treatment, we want to compute the predictive distribution,
i.e. given the training data 𝒙 and 𝒕 and a new test point 𝑥, we want the
distribution:
p(t | x, x, t )   p(t | x, w ) p( w | x, t ) dw, p(t | x, w,  )  N (t | y( x, w ),  1 ) 

p (t | x, x , t )  N (t | m( x), s 2 ( x))

where the mean and variance were shown in the earlier slide to be:
N
m( x)   ( x) S N   ( xn )tn
T

n 1

s 2 ( x)   1   ( x)T S N  ( x)
N
S 1
N   I     ( xn ) ( x)T
n 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 22

Predictive Distribution
 The notation used here is as follows:
p (t | x, x , t )  N (t | m( x), s 2 ( x))
N
m( x)   ( x)T S N   ( xn )tn Predictive mean and
n 1 variance are functions
of 𝑥.
s 2 ( x)   1   ( x)T S N  ( x)
N
S N1   I     ( xn ) ( xn )T
n 1

 1 
 x 
 n 
Eg . for polynomial regression :  ( xn )   xn2  ,
 : 
 M 1 
 xn 
 ( x)T  1 x x 2 .. x M 1 , I  unit matrix M  M
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 23
Predictive Distribution
 The predictive distribution using an 𝑀 = 9 polynomial, with fixed parameters
𝛼 = 5 × 10−3 and 𝛽 = 11.1 (corresponding to a known noise variance).

 The red curve denotes the mean of the predictive distribution and the red
region corresponds to ± standard deviation around the mean.
Polynomial Bayesian Regression M = 9
2

1.5

0.5 MatLab Code

0
Generating
function -0.5

-1 Mean of the
Generating function sin(2pi*x)

-1.5
Random data pts for fitting predictive
Predictive mean
Span standard deviation distribution
-2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 24

Predictive Distribution
 In a full Bayesian treatment, we want to compute the predictive distribution,
i.e. given the training data 𝒙 and 𝒕 and a new test point 𝑥, we want the
distribution:
p(t | x, x, t )   p(t | x, w ) p( w | x, t ) dw, p(t | x, w,  )  N (t | y( x, w ),  1 ) 

p(t | x, x, t )  N (t | m( x),  N2 ( x))

where the mean and variance are given by
N
m( x)   ( x)T S N   ( xn )tn   ( x)T m N , m N   S N T t
n 1

 N2 ( x)   1   ( x)T S N  ( x) (uncertainty in the data + uncertainty in w )

N
S N1   I     ( xn ) ( xn )T   I   T 
n 1

N 
 Note that:  ( x)     ( x) S N  ( x)   1 and  N2 1 ( x)   N2 ( x)
2 1 T
N
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 25
Predictive Distribution
 It is easy to show:
N 
 ( x)     ( x) S N  ( x)   1 and  N2 1 ( x)   N2 ( x)
2
N
1 T

N 1

 Note: S 1
N 1   I     ( xn ) ( xn )T  S N1    ( xn ) ( xn )T
n 1

and the identity:

 M v  v
1 T
M 1 
 M  vv 
1
T
 M 1 
1  v T M 1v
 Using these results, we can write:
S 
1
 2
N 1 ( x)     ( x) S N 1 ( x)     ( x)
1 T 1 T 1
N   ( xn ) ( xn ) T
 ( x)   1 
       S N  ( x) 
2
 S  ( x )  ( x ) T
S   ( x ) T

 ( x)  S N        
T N n n N 2 n 2
( x ) ( x ) N ( x)
1   ( xn ) S N  ( xn )  1   ( xn ) S N  ( xn )
T N T
 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 26

Predictive Distribution: Summary
 The notation used here is as follows:
p(t | x, x, t )  N (t | m( x),  N2 ( x))
N
m( x)   ( x)T S N   ( xn )tn Note:
n 1 Predictive mean and
 N2 ( x)   1 +  ( x)T S N  ( x) variance are functions
of 𝑥.
N
S N1   I     ( xn ) ( xn )T
n 1

For Polynomial regression :

 0 ( xn ) 
  (x )   1 
 1 n   x 
 n 
 ( xn )   2 ( xn )  ,  ( x)  0 ( xn ) 1 ( xn ) 2 ( xn ) .. M 1 ( xn ) , I  unit matrix M  M  ( x )   x 2  ,  ( x)T  1 x x 2 .. x M 1
T

 
n n
:  : 
   M 1 
M 1 ( xn )   xn 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 27

Pointwise Uncertainty in the Predictions
2
Predictive Distribution, M = 9
M=9 Gaussians, 10 parameters
Scale of Gaussians adjusted
1.5
with data
1 𝛼 = 5 × 10−3
N=1 𝛽 = 11.1
0.5
Using 𝑁 = 1,2,4,10
0 Data are given here
-0.5 Predictive Distribution, M = 9
2

-1
Generating function sin(2pi*x) 1.5
Random data points for fitting
-1.5
Predictive Mean 1
Predictive Standard Deviation
-2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 N=2
 The predictive uncertainty is 0

smaller near the data. -0.5

 The level of uncertainty -1

decreases with 𝑁 Generating function sin(2pi*x)
Random data points for fitting
-1.5
Predictive Mean
MatLab code Predictive Standard Deviation
-2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 28

Pointwise Uncertainty in the Predictions
Predictive Distribution, M = 9
2

1.5

0.5

-0.5 Predictive Distribution, M = 9

2
-1
Generating function sin(2pi*x) 1.5
-1.5
Random data points for fitting
Predictive Mean
𝑁=4 𝑁 = 10
1
Predictive Standard Deviation
-2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.5

-0.5

-1
Generating function sin(2pi*x)
Random data points for fitting
-1.5
Predictive Mean
MatLab code Predictive Standard Deviation
-2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 29

Summary of Results
MatLab code
Predictive Distribution, M = 9 Predictive Distribution, M = 9
2 2

1.5 1.5

1 1

0.5 0.5

0 0

-0.5 -0.5

-1 -1
Generating function sin(2pi*x) Generating function sin(2pi*x)
Random data points for fitting Random data points for fitting
-1.5 Predictive Mean -1.5 Predictive Mean
Predictive Standard Deviation Predictive Standard Deviation
-2 -2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Predictive Distribution, M = 9 Predictive Distribution, M = 9

2 2

1.5 1.5

1 1

0.5 0.5

0 0

-0.5 -0.5

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 30

Plugin Approximation
plugin approximation (MLE) Posterior predictive (known variance)
60 80
prediction prediction
70
50
training data training data
𝑝(𝑡|𝑥, 𝒙, 𝒕) 60

40 50

= න𝑝(𝑡|𝑥, 𝒘)𝛿𝒘ෝ (𝒘)𝑑𝒘 40

30
30

20
ෝ)
= 𝑝(𝑡|𝑥, 𝒘
20

10
10

0
-8 -6 -4 -2 0 2 4 6 8 -10
-8 -6 -4 -2 0 2 4 6 8
functions sampled from plugin approximation to posterior functions sampled from posterior
50 100

40 80

35
60
30

25
40
20

15 20

10
0
5 Run linregPostPredDemo
from PMTK3
0
-8 -6 -4 -2 0 2 4 6 8 -20
-8 -6 -4 -2 0 2 4 6 8
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 31
Predictive Distribution
 We are not interested in 𝒘 itself but in making predictions of 𝑡 for new values of 𝑥.
This requires the predictive distribution
p(t | x, x, t )   p(t | x, w ,  ) p( w | x, t , ,  )dw  N  t | m TN  ( x),  N2 ( x) 
1
where  N2 ( x)    ( x)T S N  ( x), S N1   I  T 

 The 1st term represents the noise on the data whereas the 2nd term reflects the
uncertainty associated with 𝒘.
 Because the noise process and the distribution of 𝒘 are independent Gaussians,
their variances are additive.
 The error bars get larger as we move away from the training points. By contrast, in
the plug-in approximation, the error bars are of constant size.
 As additional data points are observed, the posterior distribution becomes narrower.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 32

Covariance Between the Predictions
Plots of y(x,w) where w is a sample from the posterior over w
2
 Draw samples from the posterior of 𝒘 and
1.5 then plot 𝑦(𝑥, 𝒘). We use the same data as
the earlier example.
1

0.5
N=1  We are visualizing the joint uncertainty
in the posterior distribution between the
0
𝑦 values at two or more 𝑥 values.
-0.5
Plots of y(x,w) where w is a sample from the posterior over w
2
-1

1.5
-1.5 N=2
1
-2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.5

Same data and basis functions -0.5

as in the earlier example
-1

-1.5

MatLab Code -2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 33

Covariance Between the Predictions
Plots of y(x,w) where w is a sample from the posterior over w
2
 Draw samples from the posterior of 𝒘 and
1.5 then plot 𝑦(𝑥, 𝒘)
1
 We are visualizing the joint uncertainty
0.5
𝑁=4 in the posterior distribution between the
y values at two or more x values.
0

Plots of y(x,w) where w is a sample from the posterior over w

-0.5
2

-1 1.5
𝑁 = 12
-1.5 1

-2 0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-0.5

-1

-1.5

MatLab Code -2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 34

Summary of Results
MatLab Code
2 2

1.5 1.5

1 1

0.5 0.5

0 0

-0.5 -0.5

-1 -1

-1.5 -1.5

-2 -2
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

2 2

1.5 1.5

1 1

0.5 0.5

0 0

-0.5 -0.5

-1 -1

-1.5 -1.5

-2 -2
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 35

Gaussian Basis vs. Gaussian Process
 If we use localized basis functions such as Gaussians, then in regions away from the
basis function support, the contribution from the second term in the predictive
variance will go to zero, leaving only the noise contribution 𝛽 −1 .

away from the

 ( x )   +  ( x) S N  ( x)
2
N
1 T
  1
support of  ( x )

 The model becomes very confident in its predictions when extrapolating outside the
region occupied by the basis functions. This is an undesirable behavior.

 This problem can be avoided by adopting an alternative Bayesian approach to

regression (Gaussian processes).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 36

Equivalent Kernel
 The predictive mean in our regression model can be written as:
N
y ( x , m N )   ( x ) m N    ( x ) S N  t     ( x )T S N  ( x n ) t n 
T T T

n 1
Equivalent kernel k ( x , xn )
N
y ( x , m N )   k ( x , xn )t n
n 1
k ( x, xn )   ( x)T S N ( xn )
The kernel is shown 1
Contour plot of the kernel
0.4

here as a plot of 𝒙 vs. 𝒙’. 0.9 0.35

0.8 0.3
20 samples 𝑥𝑛 equally
0.7 0.25
spaced in [0,1] 0.6 0.2

Gaussian kernels (𝑠 = x 0.5 0.15

0.05), 𝛼 = 5 × 10−3 , 𝛽 = 0.4 0.1

11.1. 0.3 0.05

 x  x 
0.2

0
2

 j  x   exp   
j 0.1 -0.05

 2s 2  0

 
0 0.2 0.4 0.6 0.8 1

x’ MatLab code

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 37

Equivalent Kernel
 The predictive mean can be written as follows:
N
y ( x , m N )   ( x ) m N    ( x ) S N  t     ( x )T S N  ( x n ) t n 
T T T

n 1
Equivalent kernel k ( x , xn )
N
y ( x , m N )   k ( x , xn )t n k ( x, xn )   ( x)T S N ( xn )
n 1

Two slices Slice through x Slice through x

0.25 0.2

corresponding to 0.2

𝑥 = 0.3, 𝑥 = 0.5. 0.15

0.15

20 Gaussians 0.1
0.1

(𝑠 = 0.05) 0.05
0.05

Plots for 100 0

-0.05

points 𝑥 -0.1 -0.05

uniformly 0 0.1 0.2 0.3 0.4

x’
0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5
x’
0.6 0.7 0.8 0.9 1

in (0,1) MatLab code

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 38

Equivalent Kernel
N
y ( x , m N )   k ( x , xn )t n k ( x, xn )   ( x)T S N ( xn )
n 1

 Note that we make predictions by taking linear combinations of the training set target
values (linear smoothers).
 The equivalent kernel depends on the input values 𝒙𝑛 since these appear in 𝑺𝑁.
 See from the figure that its localized around 𝑥, and so the mean of the predictive
distribution 𝑦(𝒙, 𝒎𝑁), is Contour plot of the kernel
1 0.4
obtained by forming 0.9 0.35

a weighted combination 0.8 0.3

of 𝑡𝑛 in which data points 0.7 0.25

0.6 0.2
close to 𝒙 are given 0.5 0.15

higher weight than points (evidence) 0.4 0.1

further removed from 𝒙. 0.3 0.05

0.2 0

0.1 -0.05

0
0 0.2 0.4 0.6 0.8 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 39

Equivalent Kernel
Polynomial basis, slice through x=0 Sigmoidal basis, slice through x=0
0.06 0.14

0.05 0.12
 x  xj  MatLab code
0.04  j  x  x j 0.1  j  x    
 s 
0.08
0.03 1
 (a) 
0.02
0.06
1  e a
0.04
0.01
0.02

0
0

-0.01 -0.02

-0.02 -0.04
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
x’ x’

 Examples of equivalent kernels 𝑘(𝑥, 𝑥’) for 𝑥 = 0 plotted as a function of 𝑥’. 10 basis
functions were used in each case. The sigmoidal basis is centered at 10 equally
spaced 𝑥𝑛 in [−1,1].
 These are localized functions of 𝑥’ even though the corresponding basis functions
are nonlocal.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 40
Equivalent Kernel
 Consider the covariance between 𝑦(𝒙) and 𝑦(𝒙’), which is given by (recall that
 N

p ( w | x , t , ,  )  N  w |  S N  tn ( xn ), S N  ):
 n 1 
cov  y ( x ), y ( x ')   cov  ( x ) w, w T  ( x ')    ( x )T cov  w  ( x' )   ( x )T S N  ( x' )   1k ( x, x' )
T

 From the earlier discussion on the equivalent kernel, we see that the predictive mean
at nearby points will be highly correlated, whereas for more distant points the
correlation will be smaller.
 By drawing samples from the posterior distribution over 𝒘, and plotting the
corresponding model functions 𝑦(𝒙, 𝒘), we are visualizing the joint uncertainty in the
posterior distribution between the 𝑦 values at two (or more) 𝒙 values, as governed by
the equivalent kernel.
 This is contrary to the Figure here where we visualized pointwise uncertainty in the
predictions.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 41

Equivalent Kernel
 We can avoid the use of basis functions and define the kernel function directly,
leading e.g. to Gaussian Processes (more on Kernel methods later on in this
course).
 Note that N

 k ( x, x )  1
n 1
n

for all values of 𝒙. However, the equivalent kernel may be negative for some values
of 𝒙.
 Like all kernel functions, the equivalent kernel can be expressed as an inner product:

k ( x , z )   ( x )T  ( z )
k ( x, z)   ( x)T S N ( z) 
 ( x )   1/2 S 1/2
N  ( x)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 42

Lecture 8.2 - Variational Quantum Eigensolver
No ratings yet
Lecture 8.2 - Variational Quantum Eigensolver
27 pages
Solution Manual For Discrete Time Signal Processing 3 E 3rd Edition Alan V Oppenheim Ronald W Schafer
0% (1)
Solution Manual For Discrete Time Signal Processing 3 E 3rd Edition Alan V Oppenheim Ronald W Schafer
4 pages
Math EE IB
0% (1)
Math EE IB
15 pages
Solved - Chapter 25 Problem 6P Solution - Elementary Surveying 15th Edition
No ratings yet
Solved - Chapter 25 Problem 6P Solution - Elementary Surveying 15th Edition
4 pages
Solution Manual For Modern Quantum Mechanics 2nd Edition by Sakurai
No ratings yet
Solution Manual For Modern Quantum Mechanics 2nd Edition by Sakurai
13 pages
03 1 Linear Basis Function Models Draft SEP24
No ratings yet
03 1 Linear Basis Function Models Draft SEP24
52 pages
Bishop2008 Chapter ANewFrameworkForMachineLearnin
No ratings yet
Bishop2008 Chapter ANewFrameworkForMachineLearnin
24 pages
TD1
No ratings yet
TD1
3 pages
Lec11 Introduction2BayesianStatistics
No ratings yet
Lec11 Introduction2BayesianStatistics
48 pages
MCMC Bayes PDF
No ratings yet
MCMC Bayes PDF
27 pages
Minka - Inferring A Gaussian Distribution
No ratings yet
Minka - Inferring A Gaussian Distribution
15 pages
Lec8 MLE
No ratings yet
Lec8 MLE
35 pages
cs772-quiz1-spring24-solutions
No ratings yet
cs772-quiz1-spring24-solutions
2 pages
Lec31 32 CaterpillarRegressionExample
No ratings yet
Lec31 32 CaterpillarRegressionExample
108 pages
Lec12 13 BayesianInferenceForTheGaussian
No ratings yet
Lec12 13 BayesianInferenceForTheGaussian
57 pages
Computing The Rank and A Small Nullspace Basis of A Polynomial Matrix
No ratings yet
Computing The Rank and A Small Nullspace Basis of A Polynomial Matrix
8 pages
Quiz3_2023
No ratings yet
Quiz3_2023
2 pages
Topic 2 Matlab Examples
No ratings yet
Topic 2 Matlab Examples
5 pages
Homework 2 (Chapters 1 & 2)
No ratings yet
Homework 2 (Chapters 1 & 2)
3 pages
Lec25 MonteCarloMethods
No ratings yet
Lec25 MonteCarloMethods
57 pages
The Riemannian Geometry of The Space of
No ratings yet
The Riemannian Geometry of The Space of
17 pages
X (X X - . - . - . - . - X) : Neuro-Fuzzy Comp. - Ch. 3 May 24, 2005
No ratings yet
X (X X - . - . - . - . - X) : Neuro-Fuzzy Comp. - Ch. 3 May 24, 2005
20 pages
BME3103 - Term Project (1)
No ratings yet
BME3103 - Term Project (1)
3 pages
GOOD Gptut Alta14
No ratings yet
GOOD Gptut Alta14
187 pages
Power Load Forecasting Based On Multi-Task Gaussian Process
No ratings yet
Power Load Forecasting Based On Multi-Task Gaussian Process
6 pages
Day 1
No ratings yet
Day 1
41 pages
PATTERN FILE[1]
No ratings yet
PATTERN FILE[1]
29 pages
Lecture pptParameterEstimation
No ratings yet
Lecture pptParameterEstimation
24 pages
hw1 PDF
No ratings yet
hw1 PDF
3 pages
cs6359 hw1 With Hints
No ratings yet
cs6359 hw1 With Hints
2 pages
Pattern Classification: All Materials in These Slides Were Taken From
No ratings yet
Pattern Classification: All Materials in These Slides Were Taken From
18 pages
integration-methods
No ratings yet
integration-methods
34 pages
SSP Exam WS1920
No ratings yet
SSP Exam WS1920
5 pages
Sspi Lecture 5 Blue Mle 2025
No ratings yet
Sspi Lecture 5 Blue Mle 2025
72 pages
DRDO PPT m1
No ratings yet
DRDO PPT m1
16 pages
Lecture 5 - 8 Bayesian Estimation
No ratings yet
Lecture 5 - 8 Bayesian Estimation
65 pages
Prev Map
No ratings yet
Prev Map
30 pages
07096332
No ratings yet
07096332
5 pages
Perceptons Neural Networks
No ratings yet
Perceptons Neural Networks
33 pages
CS 229, Public Course Problem Set #4: Unsupervised Learning and Re-Inforcement Learning
No ratings yet
CS 229, Public Course Problem Set #4: Unsupervised Learning and Re-Inforcement Learning
5 pages
Class19 Approxinf
No ratings yet
Class19 Approxinf
45 pages
Serie02 en
No ratings yet
Serie02 en
2 pages
Efficiency Evaluation When Modelling Nairobi Security Exchange Data Using Bilinear and Bilinear-Garch (Bl-Garch) Models
No ratings yet
Efficiency Evaluation When Modelling Nairobi Security Exchange Data Using Bilinear and Bilinear-Garch (Bl-Garch) Models
12 pages
Unbiased Estimation of A Sparse Vector in White Gaussian Noise
No ratings yet
Unbiased Estimation of A Sparse Vector in White Gaussian Noise
46 pages
HB - J. Jenness - Statistical Distributions and Summary Data
No ratings yet
HB - J. Jenness - Statistical Distributions and Summary Data
33 pages
Lecture Notes 5
No ratings yet
Lecture Notes 5
20 pages
Building Mapping 6
No ratings yet
Building Mapping 6
8 pages
Detection and Estimation Theory (CT505) : Assignment 8 May 22, 2020
No ratings yet
Detection and Estimation Theory (CT505) : Assignment 8 May 22, 2020
2 pages
Lec7matrixnorm Part4
No ratings yet
Lec7matrixnorm Part4
13 pages
Tabak-Turner
No ratings yet
Tabak-Turner
20 pages
An E-Field Eigenvalue Method For Computing Wavegui
No ratings yet
An E-Field Eigenvalue Method For Computing Wavegui
4 pages
Lecture 8: Bayesian Estimation of Parameters in State Space Models
No ratings yet
Lecture 8: Bayesian Estimation of Parameters in State Space Models
33 pages
Adobe Scan 11-Jan-2025
No ratings yet
Adobe Scan 11-Jan-2025
3 pages
Assignment 10 solution
No ratings yet
Assignment 10 solution
8 pages
L1 - Signals (1) (During Lecture)
No ratings yet
L1 - Signals (1) (During Lecture)
2 pages
stat5900_f24_lec7
No ratings yet
stat5900_f24_lec7
9 pages
Theory of Estimation2019
No ratings yet
Theory of Estimation2019
4 pages
احصاء د.رياض بابل (1) 1
No ratings yet
احصاء د.رياض بابل (1) 1
128 pages
qapsurvey94
No ratings yet
qapsurvey94
42 pages
Moments of Order Statistics of The - 1
No ratings yet
Moments of Order Statistics of The - 1
12 pages
Being Particular About Calibration
No ratings yet
Being Particular About Calibration
1 page
lecture03c_maximum_likelihood
No ratings yet
lecture03c_maximum_likelihood
8 pages
Lec2 IntroToProbabilityAndStatistics
No ratings yet
Lec2 IntroToProbabilityAndStatistics
37 pages
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
Ek 2020
No ratings yet
Ek 2020
203 pages
Dai 2020
No ratings yet
Dai 2020
62 pages
Lecture 3 - Entanglement in Action
No ratings yet
Lecture 3 - Entanglement in Action
36 pages
Gonzalez 2020
No ratings yet
Gonzalez 2020
79 pages
Lecture 7 - Introduction To Quantum Noise Bonus
No ratings yet
Lecture 7 - Introduction To Quantum Noise Bonus
13 pages
Seminar em
No ratings yet
Seminar em
51 pages
Lecture 8.1 - Iterative Quantum Phase Estimation - Moving Beyond Traditional QPE
No ratings yet
Lecture 8.1 - Iterative Quantum Phase Estimation - Moving Beyond Traditional QPE
31 pages
Durrande 2020
No ratings yet
Durrande 2020
90 pages
Lec35 SequentialImportanceSampling
No ratings yet
Lec35 SequentialImportanceSampling
46 pages
Lecture 1.1 - Single States
No ratings yet
Lecture 1.1 - Single States
49 pages
Lecture 4.1 - Quantum Query Algorithms
No ratings yet
Lecture 4.1 - Quantum Query Algorithms
38 pages
Lec33 MetropolisHastings
No ratings yet
Lec33 MetropolisHastings
66 pages
Lec20 RidgeRegression
No ratings yet
Lec20 RidgeRegression
21 pages
Introduction To State Space Models and Sequential Bayesian Inference
No ratings yet
Introduction To State Space Models and Sequential Bayesian Inference
58 pages
Lec24 BayesianLinearRegression
No ratings yet
Lec24 BayesianLinearRegression
29 pages
Lec17 PriorModeling
No ratings yet
Lec17 PriorModeling
37 pages
Lec18 HierarchicalBayesianModels
No ratings yet
Lec18 HierarchicalBayesianModels
20 pages
Lec29 ImportanceSampling
No ratings yet
Lec29 ImportanceSampling
84 pages
Lec9 MultivariateGaussian
No ratings yet
Lec9 MultivariateGaussian
60 pages
Lec14 15 GenerativeModelsForDiscreteData
No ratings yet
Lec14 15 GenerativeModelsForDiscreteData
74 pages
Lec27 AcceptReject
No ratings yet
Lec27 AcceptReject
30 pages
Lec23 Evidence4Regression
No ratings yet
Lec23 Evidence4Regression
38 pages
Lec16 SummarizingPosteriors BayesianModelSelection
No ratings yet
Lec16 SummarizingPosteriors BayesianModelSelection
59 pages
Lec21 BiasVarianceDecomposition
No ratings yet
Lec21 BiasVarianceDecomposition
15 pages
Lec7 InformationTheory
No ratings yet
Lec7 InformationTheory
41 pages
Lec28 StratifiedSampling
No ratings yet
Lec28 StratifiedSampling
15 pages
Lec30 GibbsSampling
No ratings yet
Lec30 GibbsSampling
55 pages
Mathematics 10: Sequences and Series
No ratings yet
Mathematics 10: Sequences and Series
22 pages
PHP Math Functions
No ratings yet
PHP Math Functions
5 pages
Dirac Delta Function As A Distribution: 8.323: Relativistic Quantum Field Theory I
No ratings yet
Dirac Delta Function As A Distribution: 8.323: Relativistic Quantum Field Theory I
2 pages
LOGAN - An Introduction To Nonlinear Partial Differential Equations PDF
No ratings yet
LOGAN - An Introduction To Nonlinear Partial Differential Equations PDF
205 pages
Curve Fitting
No ratings yet
Curve Fitting
7 pages
Principal Component Analysis - Ipynb
No ratings yet
Principal Component Analysis - Ipynb
27 pages
Riemann-Stieltjes Integrals with α a Step Function: t→s t>s t→s t<s
No ratings yet
Riemann-Stieltjes Integrals with α a Step Function: t→s t>s t→s t<s
2 pages
M-I MCQ's Matrices Unit-5 and 6
No ratings yet
M-I MCQ's Matrices Unit-5 and 6
24 pages
Figure 6-1 (P. 216) : Automated Control Systems, 8/E by Benjamin C. Kuo and Farid Golnaraghi Rights Reserved
No ratings yet
Figure 6-1 (P. 216) : Automated Control Systems, 8/E by Benjamin C. Kuo and Farid Golnaraghi Rights Reserved
11 pages
Y11 2U Algebra Quiz
No ratings yet
Y11 2U Algebra Quiz
9 pages
G-11 Math Final Exam Oromia Education Bureau
100% (1)
G-11 Math Final Exam Oromia Education Bureau
9 pages
On Total Edge Irregularity Strength of Centralized Uniform Theta Graphs
No ratings yet
On Total Edge Irregularity Strength of Centralized Uniform Theta Graphs
7 pages
Curves in Space
No ratings yet
Curves in Space
9 pages
Kisi-Kisi Ujian Sekolah Dan Pra US Matematika Kelas 9 TA2021
No ratings yet
Kisi-Kisi Ujian Sekolah Dan Pra US Matematika Kelas 9 TA2021
3 pages
Sleigh and Goodwill Saint Venant Eq.
No ratings yet
Sleigh and Goodwill Saint Venant Eq.
19 pages
Matrices Q
No ratings yet
Matrices Q
2 pages
Goldberg Kaplan Riley
No ratings yet
Goldberg Kaplan Riley
22 pages
Edwards-Anderson Model
No ratings yet
Edwards-Anderson Model
8 pages
Partial Differential Equation Toolbox + User Guide MatLab
100% (1)
Partial Differential Equation Toolbox + User Guide MatLab
317 pages
(Brent Nelson) Von Neumann Algebras
No ratings yet
(Brent Nelson) Von Neumann Algebras
71 pages
Friedman PDF
No ratings yet
Friedman PDF
323 pages
Elective Maths Syllabus (1) 11 18
No ratings yet
Elective Maths Syllabus (1) 11 18
8 pages
Appl Diff Eqs
No ratings yet
Appl Diff Eqs
94 pages
Experiment-4 Aim
No ratings yet
Experiment-4 Aim
5 pages
MS-492 Operation Management: Supply Chain Models Transportation, Assignment, and Transshipment Problems
No ratings yet
MS-492 Operation Management: Supply Chain Models Transportation, Assignment, and Transshipment Problems
26 pages
BCS 054 Previous Year Question Papers by Ignouassignmentguru
No ratings yet
BCS 054 Previous Year Question Papers by Ignouassignmentguru
84 pages
Controller Design Using Root-Locus Techniques
No ratings yet
Controller Design Using Root-Locus Techniques
10 pages