0% found this document useful (0 votes)
24 views

Lec22 Introduction2BayesianRegression

This document introduces Bayesian regression and related concepts. It discusses using a Gaussian likelihood and Gaussian prior distribution to develop a Bayesian regression model. The posterior distribution is derived, which is also Gaussian. Maximum a posteriori (MAP) estimation is discussed, and it is shown that MAP estimates are equivalent to regularized least squares estimates. The document outlines goals of explaining Bayesian inference and prediction for regression, and discussing regression with basis functions versus Gaussian process regression.

Uploaded by

hu jack
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Lec22 Introduction2BayesianRegression

This document introduces Bayesian regression and related concepts. It discusses using a Gaussian likelihood and Gaussian prior distribution to develop a Bayesian regression model. The posterior distribution is derived, which is also Gaussian. Maximum a posteriori (MAP) estimation is discussed, and it is shown that MAP estimates are equivalent to regularized least squares estimates. The document outlines goals of explaining Bayesian inference and prediction for regression, and discussing regression with basis functions versus Gaussian process regression.

Uploaded by

hu jack
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Introduction to

Bayesian Regression
Prof. Nicholas Zabaras

Email: [email protected]
URL: https://www.zabaras.com/

September 25, 2020

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras


Contents
 Bayesian regression, MAP and Regularized Least Squares

 Posterior distribution, Sequential Posterior calculation, Predictive distribution,


Point-wise uncertainy in the predictions, Plug-in approximation, Regression
function samples

 Regression with basis functions Vs. Gaussian Process Regression

 Equivalent Kernel based Regression

 Chris Bishops’ PRML book, Chapters 1 and 2


 Kevin Murphy’s, Machine Learning: A probabilistic perspective, Chapter 5
 C P Robert, The Bayesian Choice: From Decision-Theoretic Motivations to Compulational Implementation, Springer-Verlag, NY, 2001 (online resource)
 A. Gelman, JB Carlin, HS Stern and DB Rubin, Bayesian Data Analysis, Chapman and Hall CRC Press, 2nd Edition, 2003.
 M Marin and C P Robert, The Bayesian Core, Spring Verlag, 2007 (online resource)
 Bayesian Statistics for Engineering, Online Course at Georgia Tech, B. Vidakovic.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2


Goals
 The goal’s for todays’ lecture are the following:

 Understand Bayesian regression with a Gaussian likelihood and Gaussian


prior: Bayesian inference and prediction

 Understand how regularized least squares estimates can be derived as


MAP estimates

 Learn how regression models can be posed in terms of kernels

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3


Likelihood Model
 Consider a set of training data comprising 𝑁 input x =  x1 ,..., xN  and the
T

corresponding target values t =  t1 ,..., t N 


T

 We assume that, given the value of 𝑥, the corresponding value of 𝑡 has a


Gaussian distribution with a mean equal to 𝑦(𝑥, 𝒘) and precision (inverse of
the variance) 𝛽
p (t | x, w ,  )  N (t | y ( x, w ),  ) 1

𝑦 𝒙, 𝒘 = 𝝓 𝒙 𝑇 𝒘
t = 𝑦 𝒙, 𝒘 + 𝜀
𝜀~𝒩(0, 𝛽−1 )

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4


MAP and Regularized Least Squares
 Now let us take a step towards a Bayesian approach and introduce a prior
distribution over the coefficients 𝒘. Consider a Gaussian distribution
   
M /2

p ( w |  )  N ( w | 0,  1 I )    exp  w T w 
 2   2 

where 𝛼 is the precision of the distribution, and 𝑀 is the total number of


elements in the vector 𝒘.*

 Using Bayes’ theorem:

p( w | x, t , ,  )  p(t | x, w,  ) p( w |  )

* Note that one should not penalize the bias term 𝒘0 as it does not contribute to overfitting. We discussed in an earlier lecture how to address this by treating
the bias term separately by working with “centered data” (both responses 𝒕𝑐 = 𝒕 − 𝒕ҧ = 𝒕 − 𝟏𝑁 𝑡,ҧ ഥ𝑡 ≡ σ𝑁
𝑖=1 𝑡𝑖 Τ𝑁, and input variables s.t. 𝑦 𝒙𝑛 , 𝒘 =
𝑁
෌𝑛=1𝝋𝑖 (𝒙𝑛 )
𝝓𝑐 𝒙𝑛 𝑇 𝒘, 𝚽𝒄 = 𝝋1c 𝝋2c ... 𝝋𝑀𝑐 , 𝝋ic = 𝝋i − = 0, 𝑖 = 1, . . M − 1). Here 𝝓𝑇 are rows and 𝝋 columns of the 𝑁 × 𝑀 design matrix 𝜱.
𝑁
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5
MAP Estimate
 We can determine 𝒘 (point estimate) by finding the most probable value of 𝒘
given the data, i.e. maximizing the posterior.

 This technique is called maximum posterior (MAP).

 The maximum of the posterior is given by the minimum of


 N

  y( xn , w )  tn  
2
wT w
2 n 1 2
 Note that the MAP point estimate is equivalent to regularized sum of squares
error function with regularization parameter
 precision of prior
 
 precision in the data

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6


Posterior Distribution
    T    N
M /2
2
p( w |  )    exp  w w  , p(t | x, w ,  )  exp    ( xn )T w  tn   
 2   2   2 n 1 
 1 T N N

p (t | x , w ,  )  exp  w    ( xn ) ( xn ) w   w  tn ( xn ) 
T T
Quadratic in
 2 n 1 n 1  𝒘
 We now have the product of two Gaussians and the posterior is easily
computed as: Completing the
square
p ( w | x , t , ,  )  1
 w   S N1  w   
T

 1 T 1 T N N 2

exp   w  I M M w  w    ( xn ) ( xn ) w   w  tn ( xn ) 
T T

 2 2 n 1 n 1 
 N
 1 N
 N  w |  S N  tn ( xn ), S N  , S N   I     ( xn ) ( xn )T
 n 1  n 1
 1 
 x 
 n 
E.g. for polynomial regression :  ( xn )   xn2  ,  ( x)T  1 x x 2 .. x M 1 , I  unit matrix M  M
 : 
 M 1 
 xn 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 7


Posterior Distribution
 For a prior
p ( w )  N ( w | 0,  1 I )

the corresponding posterior distribution over 𝒘 is then given by


𝑝(𝒘|𝒕)= 𝒩(𝒘|𝒎𝑁 , 𝑺𝑁 )

m N   S N T t
S N1   I  T 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8


General Gaussian as a Prior 𝑝(𝒘)
 Assume additive Gaussian noise with known precision . The likelihood function
𝑝(𝒕|𝒘) is the exponential of a quadratic function of 𝒘
     N
N /2
2
p  t | X , w ,     N  tn | w T  ( xn ),  1    exp    tn  w  ( xn ) 
N
T

n 1  2   2 n 1 
and its conjugate prior is Gaussian:

p ( w )  N ( w | m0 , S 0 )

 Combining this with the likelihood and using results for marginal and conditional
Gaussian distributions, gives the posterior
p( w | t )  N ( w | m N , S N )

where
m N  S N  S01m0  T t 
S N1  S01  T 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 9
Posterior Distribution: Derivation
 1 
p ( w | m0 , S0 )  exp   w  m0  S01  w  m0   ,
T

 2 
  N 2
p (t | x , w ,  )  exp    ( xn )T w  tn   
 2 n 1 
 1 T N N

p (t | x , w ,  )  exp  w    ( xn ) ( xn ) w    tn ( xn )w 
T

 2 n 1 n 1 
 We now have the product of two Gaussians and the posterior is easily
computed as:
Complete the
square in 𝒘
p( w | x, t , S0 ,  ) 
 1 T 1 1 T N N

exp   w S0 w  w S0 m0  w   ( xn ) ( xn ) w   w  tn ( xn ) 
T 1 T T

 2 2 n 1 n 1 
 
 N  w | S N  S0 m0   t  , S N  , S N  S0     ( xn ) ( xn )T S 01   T 
N
1 T 1 1
  n 1
 m N 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10
Sequential Posterior Calculation
 Note that because the posterior distribution is Gaussian, its posterior mode coincides
with its mean.
m N  S N  S01m0  T t 
p( w | t )  N ( w | m N , S N )
S N1  S01  T 

w MAP = m N

 The above expressions for the posterior mean and variance can also be written for a
sequential calculation (we already have observed 𝑁 data points and now considering
an additional data point (𝒙𝑁+1 , 𝑡𝑁+1 )). In this case, we have:

m N 1  S N 1  S N1m N  n 1tn 1 


p ( w | t N 1 , x N 1 , m N , S N )  N ( w | m N 1 , S N 1 )
S N11  S N1  n 1nT1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11


Bayesian Regression: Example
 We generate synthetic data from the function 𝑓(𝑥, 𝑎) = 𝑎0 + 𝑎1𝑥 with parameter
values 𝑎0 = −0.3 and 𝑎1 = 0.5 by first choosing values of 𝑥𝑛 from the uniform
distribution 𝒰(𝑥| − 1, 1), then evaluating 𝑓(𝑥𝑛, 𝑎), and finally adding Gaussian noise
with standard deviation of 0.2 to obtain the target values 𝑡𝑛.
 We assume 𝛽 = (1/0.2)2 = 25 and 𝛼 = 2.0.
 We perform Bayesian inference sequentially – one point at a time – so the posterior
at each level becomes the new prior.
 We show results after 1, 2 and 22 points have been collected.
 The results include the likelihood contours (for 1 point), the posterior and samples of
the regression function from the posterior.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12


Bayesian Regression: Example
Prior - No data yet y(x,w) using samples of w from the prior
1 3

0.8

2
0.6

0.4
1

0.2

0 0

-0.2

-1
-0.4

-0.6
-2

-0.8

-1
-1 -0.5 0 0.5 1
-3
-1 -0.5 0 0.5 1 MatLab Code

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13


Example: One Data Point Collected
Likelihood Contour Contours of the posterior y(x,w) using samples of w from the posterior
1 1 1

0.8 0.8 0.8

MatLab Code
0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

-0.2 -0.2 -0.2

-0.4 -0.4 -0.4

-0.6 -0.6 -0.6

-0.8 -0.8 -0.8

-1 -1 -1
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1

Note that the regression lines pass close to the data point (shown with a circle)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14


Example: 2nd Data Point Collected
Likelihood Contour Contours of the posterior y(x,w) using samples of w from the posterior
1 1 1

0.8 0.8 0.8 MatLab Code

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

-0.2 -0.2 -0.2

-0.4 -0.4 -0.4

-0.6 -0.6 -0.6

-0.8 -0.8 -0.8

-1 -1 -1
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1

Note that the regression lines now pass close to both data points
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15
Example: 22 Data Points Collected
Likelihood Contour Contours of the posterior y(x,w) using samples of w from the posterior
1 1 1

0.8 0.8 0.8


MatLab Code
0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

-0.2 -0.2 -0.2

-0.4 -0.4 -0.4

-0.6 -0.6 -0.6

-0.8 -0.8 -0.8

-1 -1 -1
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1

Note that the regression lines after 22 data points have been collected
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16
Summary of Results
prior/posterior (no data yet) data space (no data yet)
MatLab Code 1 1

0.5 0.5

0 0

-0.5 -0.5

-1 -1
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1

likelihood prior/posterior data space


1 1 1

0.5 0.5 0.5

0 0 0

-0.5 -0.5 -0.5

-1 -1 -1
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1

likelihood prior/posterior data space


1 1 1

0.5 0.5 0.5

0 0 0

-0.5 -0.5 -0.5

-1 -1 -1
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1

likelihood prior/posterior data space


1 1 1

0.5 0.5 0.5

0 0 0

-0.5 -0.5 -0.5

-1 -1 -1
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 17


Summary of Results
likelihood prior/posterior data space
1 1
Run bayesLinRegDemo2d
W1 y
from PMTK3 0 0
-1 -1
-1 0 1 -1 0 1
W0 x
1 1 1
W1 W1 y
0 0 0
-1 -1 -1
-1 0 1 -1 0 1 -1 0 1
W0 W0 x
1 1 1
W1 W1 y
0 0 0
-1 -1 -1
-1 0 1 -1 0 1 -1 0 1
W0 W0 x
1 1 1
W1 W1 y
0 0 0 After 20 data points
-1 -1 -1
-1 0 1 -1 0 1 -1 0 1
W0 W0 x

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18


Predictive Distribution
 In a full Bayesian treatment, we want to compute the predictive distribution,
i.e. given the training data 𝒙 and 𝒕 and a new test point 𝑥, we want the
distribution:

p (t | x, x, t )   p (t | x, w ) p ( w | x, t ) dw , where
p (t | x, w ,  )  N (t | y ( x, w ),  1 )  N (t |  ( x)T w,  1 ) and
 N
 1 N
p ( w | x , t , ,  )  N  w |  S N  tn ( xn ), S N  , S N   I     ( xn ) ( xn )T
 n 1  n 1

 To integrate 𝒘 out as shown on the predictive distribution expression, we use


a fundamental result for linear Gaussian models.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19


Appendix: Linear Gaussian Models
p  x   N  x |  , 1 
p  y | x   N  y | Ax  b, L1 
 For the above linear Gaussian model, the following very useful results hold:

p  y   N  y | A  b, L + A A 1 1 T


p  x | y   N x |    A LA  T 1
   A L( y  b) ,    A LA
T T 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 20
Predictive Distribution
 N

p  x   N  x |  , 1  p( w | x, t , ,  )  N  w |  S N  tn ( xn ), S N 
 n 1 
p  y | x   N  y | Ax  b, L1  p (t | x, w ,  )  N (t | y ( x, w ),  1 )  N (t |  ( x)T w,  1 )
 Thus for our problem:
N
x  w ,  =  S N  tn ( xn ), 1 = S N
n 1

y  t, A =  ( x)T , b = 0, L1 =  1
 The predictive distribution now takes the form:
p  y   N  y | A  b, L1 + A1 AT  
 N

p  t | x, x, t   N  t |  ( x)  S N  tn ( xn ),  +  ( x) S N  ( x) 
T 1 T

 n 1 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21
Predictive Distribution
 In a full Bayesian treatment, we want to compute the predictive distribution,
i.e. given the training data 𝒙 and 𝒕 and a new test point 𝑥, we want the
distribution:
p(t | x, x, t )   p(t | x, w ) p( w | x, t ) dw, p(t | x, w,  )  N (t | y( x, w ),  1 ) 

p (t | x, x , t )  N (t | m( x), s 2 ( x))

where the mean and variance were shown in the earlier slide to be:
N
m( x)   ( x) S N   ( xn )tn
T

n 1

s 2 ( x)   1   ( x)T S N  ( x)
N
S 1
N   I     ( xn ) ( x)T
n 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 22


Predictive Distribution
 The notation used here is as follows:
p (t | x, x , t )  N (t | m( x), s 2 ( x))
N
m( x)   ( x)T S N   ( xn )tn Predictive mean and
n 1 variance are functions
of 𝑥.
s 2 ( x)   1   ( x)T S N  ( x)
N
S N1   I     ( xn ) ( xn )T
n 1

 1 
 x 
 n 
Eg . for polynomial regression :  ( xn )   xn2  ,
 : 
 M 1 
 xn 
 ( x)T  1 x x 2 .. x M 1 , I  unit matrix M  M
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 23
Predictive Distribution
 The predictive distribution using an 𝑀 = 9 polynomial, with fixed parameters
𝛼 = 5 × 10−3 and 𝛽 = 11.1 (corresponding to a known noise variance).

 The red curve denotes the mean of the predictive distribution and the red
region corresponds to ± standard deviation around the mean.
Polynomial Bayesian Regression M = 9
2

1.5

0.5 MatLab Code


0
Generating
function -0.5

-1 Mean of the
Generating function sin(2pi*x)

-1.5
Random data pts for fitting predictive
Predictive mean
Span standard deviation distribution
-2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 24


Predictive Distribution
 In a full Bayesian treatment, we want to compute the predictive distribution,
i.e. given the training data 𝒙 and 𝒕 and a new test point 𝑥, we want the
distribution:
p(t | x, x, t )   p(t | x, w ) p( w | x, t ) dw, p(t | x, w,  )  N (t | y( x, w ),  1 ) 

p(t | x, x, t )  N (t | m( x),  N2 ( x))


where the mean and variance are given by
N
m( x)   ( x)T S N   ( xn )tn   ( x)T m N , m N   S N T t
n 1

 N2 ( x)   1   ( x)T S N  ( x) (uncertainty in the data + uncertainty in w )


N
S N1   I     ( xn ) ( xn )T   I   T 
n 1

N 
 Note that:  ( x)     ( x) S N  ( x)   1 and  N2 1 ( x)   N2 ( x)
2 1 T
N
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 25
Predictive Distribution
 It is easy to show:
N 
 ( x)     ( x) S N  ( x)   1 and  N2 1 ( x)   N2 ( x)
2
N
1 T

N 1

 Note: S 1
N 1   I     ( xn ) ( xn )T  S N1    ( xn ) ( xn )T
n 1

and the identity:


 M v  v
1 T
M 1 
 M  vv 
1
T
 M 1 
1  v T M 1v
 Using these results, we can write:
S 
1
 2
N 1 ( x)     ( x) S N 1 ( x)     ( x)
1 T 1 T 1
N   ( xn ) ( xn ) T
 ( x)   1 
       S N  ( x) 
2
 S  ( x )  ( x ) T
S   ( x ) T

 ( x)  S N        
T N n n N 2 n 2
( x ) ( x ) N ( x)
1   ( xn ) S N  ( xn )  1   ( xn ) S N  ( xn )
T N T
 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 26


Predictive Distribution: Summary
 The notation used here is as follows:
p(t | x, x, t )  N (t | m( x),  N2 ( x))
N
m( x)   ( x)T S N   ( xn )tn Note:
n 1 Predictive mean and
 N2 ( x)   1 +  ( x)T S N  ( x) variance are functions
of 𝑥.
N
S N1   I     ( xn ) ( xn )T
n 1

For Polynomial regression :


 0 ( xn ) 
  (x )   1 
 1 n   x 
 n 
 ( xn )   2 ( xn )  ,  ( x)  0 ( xn ) 1 ( xn ) 2 ( xn ) .. M 1 ( xn ) , I  unit matrix M  M  ( x )   x 2  ,  ( x)T  1 x x 2 .. x M 1
T

 
n n
:  : 
   M 1 
M 1 ( xn )   xn 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 27


Pointwise Uncertainty in the Predictions
2
Predictive Distribution, M = 9
M=9 Gaussians, 10 parameters
Scale of Gaussians adjusted
1.5
with data
1 𝛼 = 5 × 10−3
N=1 𝛽 = 11.1
0.5
Using 𝑁 = 1,2,4,10
0 Data are given here
-0.5 Predictive Distribution, M = 9
2

-1
Generating function sin(2pi*x) 1.5
Random data points for fitting
-1.5
Predictive Mean 1
Predictive Standard Deviation
-2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 N=2
 The predictive uncertainty is 0

smaller near the data. -0.5

 The level of uncertainty -1


decreases with 𝑁 Generating function sin(2pi*x)
Random data points for fitting
-1.5
Predictive Mean
MatLab code Predictive Standard Deviation
-2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 28


Pointwise Uncertainty in the Predictions
Predictive Distribution, M = 9
2

1.5

0.5

-0.5 Predictive Distribution, M = 9


2
-1
Generating function sin(2pi*x) 1.5
-1.5
Random data points for fitting
Predictive Mean
𝑁=4 𝑁 = 10
1
Predictive Standard Deviation
-2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.5

-0.5

-1
Generating function sin(2pi*x)
Random data points for fitting
-1.5
Predictive Mean
MatLab code Predictive Standard Deviation
-2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 29


Summary of Results
MatLab code
Predictive Distribution, M = 9 Predictive Distribution, M = 9
2 2

1.5 1.5

1 1

0.5 0.5

0 0

-0.5 -0.5

-1 -1
Generating function sin(2pi*x) Generating function sin(2pi*x)
Random data points for fitting Random data points for fitting
-1.5 Predictive Mean -1.5 Predictive Mean
Predictive Standard Deviation Predictive Standard Deviation
-2 -2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Predictive Distribution, M = 9 Predictive Distribution, M = 9


2 2

1.5 1.5

1 1

0.5 0.5

0 0

-0.5 -0.5

-1 -1
Generating function sin(2pi*x) Generating function sin(2pi*x)
Random data points for fitting Random data points for fitting
-1.5 Predictive Mean -1.5 Predictive Mean
Predictive Standard Deviation Predictive Standard Deviation
-2 -2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 30


Plugin Approximation
plugin approximation (MLE) Posterior predictive (known variance)
60 80
prediction prediction
70
50
training data training data
𝑝(𝑡|𝑥, 𝒙, 𝒕) 60

40 50

= න𝑝(𝑡|𝑥, 𝒘)𝛿𝒘ෝ (𝒘)𝑑𝒘 40


30
30

20
ෝ)
= 𝑝(𝑡|𝑥, 𝒘
20

10
10

0
-8 -6 -4 -2 0 2 4 6 8 -10
-8 -6 -4 -2 0 2 4 6 8
functions sampled from plugin approximation to posterior functions sampled from posterior
50 100

45

40 80

35
60
30

25
40
20

15 20

10
0
5 Run linregPostPredDemo
from PMTK3
0
-8 -6 -4 -2 0 2 4 6 8 -20
-8 -6 -4 -2 0 2 4 6 8
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 31
Predictive Distribution
 We are not interested in 𝒘 itself but in making predictions of 𝑡 for new values of 𝑥.
This requires the predictive distribution
p(t | x, x, t )   p(t | x, w ,  ) p( w | x, t , ,  )dw  N  t | m TN  ( x),  N2 ( x) 
1
where  N2 ( x)    ( x)T S N  ( x), S N1   I  T 

 The 1st term represents the noise on the data whereas the 2nd term reflects the
uncertainty associated with 𝒘.
 Because the noise process and the distribution of 𝒘 are independent Gaussians,
their variances are additive.
 The error bars get larger as we move away from the training points. By contrast, in
the plug-in approximation, the error bars are of constant size.
 As additional data points are observed, the posterior distribution becomes narrower.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 32


Covariance Between the Predictions
Plots of y(x,w) where w is a sample from the posterior over w
2
 Draw samples from the posterior of 𝒘 and
1.5 then plot 𝑦(𝑥, 𝒘). We use the same data as
the earlier example.
1

0.5
N=1  We are visualizing the joint uncertainty
in the posterior distribution between the
0
𝑦 values at two or more 𝑥 values.
-0.5
Plots of y(x,w) where w is a sample from the posterior over w
2
-1

1.5
-1.5 N=2
1
-2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.5

Same data and basis functions -0.5


as in the earlier example
-1

-1.5

MatLab Code -2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 33


Covariance Between the Predictions
Plots of y(x,w) where w is a sample from the posterior over w
2
 Draw samples from the posterior of 𝒘 and
1.5 then plot 𝑦(𝑥, 𝒘)
1
 We are visualizing the joint uncertainty
0.5
𝑁=4 in the posterior distribution between the
y values at two or more x values.
0

Plots of y(x,w) where w is a sample from the posterior over w


-0.5
2

-1 1.5
𝑁 = 12
-1.5 1

-2 0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-0.5

-1

-1.5

MatLab Code -2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 34


Summary of Results
MatLab Code
2 2

1.5 1.5

1 1

0.5 0.5

0 0

-0.5 -0.5

-1 -1

-1.5 -1.5

-2 -2
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

2 2

1.5 1.5

1 1

0.5 0.5

0 0

-0.5 -0.5

-1 -1

-1.5 -1.5

-2 -2
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 35


Gaussian Basis vs. Gaussian Process
 If we use localized basis functions such as Gaussians, then in regions away from the
basis function support, the contribution from the second term in the predictive
variance will go to zero, leaving only the noise contribution 𝛽 −1 .

away from the


 ( x )   +  ( x) S N  ( x)
2
N
1 T
  1
support of  ( x )

 The model becomes very confident in its predictions when extrapolating outside the
region occupied by the basis functions. This is an undesirable behavior.

 This problem can be avoided by adopting an alternative Bayesian approach to


regression (Gaussian processes).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 36


Equivalent Kernel
 The predictive mean in our regression model can be written as:
N
y ( x , m N )   ( x ) m N    ( x ) S N  t     ( x )T S N  ( x n ) t n 
T T T

n 1
Equivalent kernel k ( x , xn )
N
y ( x , m N )   k ( x , xn )t n
n 1
k ( x, xn )   ( x)T S N ( xn )
The kernel is shown 1
Contour plot of the kernel
0.4

here as a plot of 𝒙 vs. 𝒙’. 0.9 0.35

0.8 0.3
20 samples 𝑥𝑛 equally
0.7 0.25
spaced in [0,1] 0.6 0.2

Gaussian kernels (𝑠 = x 0.5 0.15

0.05), 𝛼 = 5 × 10−3 , 𝛽 = 0.4 0.1

11.1. 0.3 0.05

 x  x 
0.2

0
2

 j  x   exp   
j 0.1 -0.05

 2s 2  0

 
0 0.2 0.4 0.6 0.8 1

x’ MatLab code

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 37


Equivalent Kernel
 The predictive mean can be written as follows:
N
y ( x , m N )   ( x ) m N    ( x ) S N  t     ( x )T S N  ( x n ) t n 
T T T

n 1
Equivalent kernel k ( x , xn )
N
y ( x , m N )   k ( x , xn )t n k ( x, xn )   ( x)T S N ( xn )
n 1

Two slices Slice through x Slice through x


0.25 0.2

corresponding to 0.2

𝑥 = 0.3, 𝑥 = 0.5. 0.15


0.15

20 Gaussians 0.1
0.1

(𝑠 = 0.05) 0.05
0.05

Plots for 100 0


-0.05

points 𝑥 -0.1 -0.05

uniformly 0 0.1 0.2 0.3 0.4


x’
0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5
x’
0.6 0.7 0.8 0.9 1

in (0,1) MatLab code

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 38


Equivalent Kernel
N
y ( x , m N )   k ( x , xn )t n k ( x, xn )   ( x)T S N ( xn )
n 1

 Note that we make predictions by taking linear combinations of the training set target
values (linear smoothers).
 The equivalent kernel depends on the input values 𝒙𝑛 since these appear in 𝑺𝑁.
 See from the figure that its localized around 𝑥, and so the mean of the predictive
distribution 𝑦(𝒙, 𝒎𝑁), is Contour plot of the kernel
1 0.4
obtained by forming 0.9 0.35

a weighted combination 0.8 0.3

of 𝑡𝑛 in which data points 0.7 0.25

0.6 0.2
close to 𝒙 are given 0.5 0.15

higher weight than points (evidence) 0.4 0.1

further removed from 𝒙. 0.3 0.05

0.2 0

0.1 -0.05

0
0 0.2 0.4 0.6 0.8 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 39


Equivalent Kernel
Polynomial basis, slice through x=0 Sigmoidal basis, slice through x=0
0.06 0.14

0.05 0.12
 x  xj  MatLab code
0.04  j  x  x j 0.1  j  x    
 s 
0.08
0.03 1
 (a) 
0.02
0.06
1  e a
0.04
0.01
0.02

0
0

-0.01 -0.02

-0.02 -0.04
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
x’ x’

 Examples of equivalent kernels 𝑘(𝑥, 𝑥’) for 𝑥 = 0 plotted as a function of 𝑥’. 10 basis
functions were used in each case. The sigmoidal basis is centered at 10 equally
spaced 𝑥𝑛 in [−1,1].
 These are localized functions of 𝑥’ even though the corresponding basis functions
are nonlocal.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 40
Equivalent Kernel
 Consider the covariance between 𝑦(𝒙) and 𝑦(𝒙’), which is given by (recall that
 N

p ( w | x , t , ,  )  N  w |  S N  tn ( xn ), S N  ):
 n 1 
cov  y ( x ), y ( x ')   cov  ( x ) w, w T  ( x ')    ( x )T cov  w  ( x' )   ( x )T S N  ( x' )   1k ( x, x' )
T

 From the earlier discussion on the equivalent kernel, we see that the predictive mean
at nearby points will be highly correlated, whereas for more distant points the
correlation will be smaller.
 By drawing samples from the posterior distribution over 𝒘, and plotting the
corresponding model functions 𝑦(𝒙, 𝒘), we are visualizing the joint uncertainty in the
posterior distribution between the 𝑦 values at two (or more) 𝒙 values, as governed by
the equivalent kernel.
 This is contrary to the Figure here where we visualized pointwise uncertainty in the
predictions.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 41


Equivalent Kernel
 We can avoid the use of basis functions and define the kernel function directly,
leading e.g. to Gaussian Processes (more on Kernel methods later on in this
course).
 Note that N

 k ( x, x )  1
n 1
n

for all values of 𝒙. However, the equivalent kernel may be negative for some values
of 𝒙.
 Like all kernel functions, the equivalent kernel can be expressed as an inner product:

k ( x , z )   ( x )T  ( z )
k ( x, z)   ( x)T S N ( z) 
 ( x )   1/2 S 1/2
N  ( x)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 42

You might also like