Lec22 Introduction2BayesianRegression
Lec22 Introduction2BayesianRegression
Bayesian Regression
Prof. Nicholas Zabaras
Email: [email protected]
URL: https://www.zabaras.com/
𝑦 𝒙, 𝒘 = 𝝓 𝒙 𝑇 𝒘
t = 𝑦 𝒙, 𝒘 + 𝜀
𝜀~𝒩(0, 𝛽−1 )
p( w | x, t , , ) p(t | x, w, ) p( w | )
* Note that one should not penalize the bias term 𝒘0 as it does not contribute to overfitting. We discussed in an earlier lecture how to address this by treating
the bias term separately by working with “centered data” (both responses 𝒕𝑐 = 𝒕 − 𝒕ҧ = 𝒕 − 𝟏𝑁 𝑡,ҧ ഥ𝑡 ≡ σ𝑁
𝑖=1 𝑡𝑖 Τ𝑁, and input variables s.t. 𝑦 𝒙𝑛 , 𝒘 =
𝑁
𝑛=1𝝋𝑖 (𝒙𝑛 )
𝝓𝑐 𝒙𝑛 𝑇 𝒘, 𝚽𝒄 = 𝝋1c 𝝋2c ... 𝝋𝑀𝑐 , 𝝋ic = 𝝋i − = 0, 𝑖 = 1, . . M − 1). Here 𝝓𝑇 are rows and 𝝋 columns of the 𝑁 × 𝑀 design matrix 𝜱.
𝑁
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5
MAP Estimate
We can determine 𝒘 (point estimate) by finding the most probable value of 𝒘
given the data, i.e. maximizing the posterior.
1 T 1 T N N 2
exp w I M M w w ( xn ) ( xn ) w w tn ( xn )
T T
2 2 n 1 n 1
N
1 N
N w | S N tn ( xn ), S N , S N I ( xn ) ( xn )T
n 1 n 1
1
x
n
E.g. for polynomial regression : ( xn ) xn2 , ( x)T 1 x x 2 .. x M 1 , I unit matrix M M
:
M 1
xn
m N S N T t
S N1 I T
p ( w ) N ( w | m0 , S 0 )
Combining this with the likelihood and using results for marginal and conditional
Gaussian distributions, gives the posterior
p( w | t ) N ( w | m N , S N )
where
m N S N S01m0 T t
S N1 S01 T
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 9
Posterior Distribution: Derivation
1
p ( w | m0 , S0 ) exp w m0 S01 w m0 ,
T
2
N 2
p (t | x , w , ) exp ( xn )T w tn
2 n 1
1 T N N
p (t | x , w , ) exp w ( xn ) ( xn ) w tn ( xn )w
T
2 n 1 n 1
We now have the product of two Gaussians and the posterior is easily
computed as:
Complete the
square in 𝒘
p( w | x, t , S0 , )
1 T 1 1 T N N
exp w S0 w w S0 m0 w ( xn ) ( xn ) w w tn ( xn )
T 1 T T
2 2 n 1 n 1
N w | S N S0 m0 t , S N , S N S0 ( xn ) ( xn )T S 01 T
N
1 T 1 1
n 1
m N
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10
Sequential Posterior Calculation
Note that because the posterior distribution is Gaussian, its posterior mode coincides
with its mean.
m N S N S01m0 T t
p( w | t ) N ( w | m N , S N )
S N1 S01 T
w MAP = m N
The above expressions for the posterior mean and variance can also be written for a
sequential calculation (we already have observed 𝑁 data points and now considering
an additional data point (𝒙𝑁+1 , 𝑡𝑁+1 )). In this case, we have:
0.8
2
0.6
0.4
1
0.2
0 0
-0.2
-1
-0.4
-0.6
-2
-0.8
-1
-1 -0.5 0 0.5 1
-3
-1 -0.5 0 0.5 1 MatLab Code
MatLab Code
0.6 0.6 0.6
0 0 0
-1 -1 -1
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1
Note that the regression lines pass close to the data point (shown with a circle)
0 0 0
-1 -1 -1
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1
Note that the regression lines now pass close to both data points
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15
Example: 22 Data Points Collected
Likelihood Contour Contours of the posterior y(x,w) using samples of w from the posterior
1 1 1
0 0 0
-1 -1 -1
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1
Note that the regression lines after 22 data points have been collected
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16
Summary of Results
prior/posterior (no data yet) data space (no data yet)
MatLab Code 1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1
0 0 0
-1 -1 -1
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1
0 0 0
-1 -1 -1
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1
0 0 0
-1 -1 -1
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1
p (t | x, x, t ) p (t | x, w ) p ( w | x, t ) dw , where
p (t | x, w , ) N (t | y ( x, w ), 1 ) N (t | ( x)T w, 1 ) and
N
1 N
p ( w | x , t , , ) N w | S N tn ( xn ), S N , S N I ( xn ) ( xn )T
n 1 n 1
p y N y | A b, L + A A 1 1 T
p x | y N x | A LA T 1
A L( y b) , A LA
T T 1
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 20
Predictive Distribution
N
p x N x | , 1 p( w | x, t , , ) N w | S N tn ( xn ), S N
n 1
p y | x N y | Ax b, L1 p (t | x, w , ) N (t | y ( x, w ), 1 ) N (t | ( x)T w, 1 )
Thus for our problem:
N
x w , = S N tn ( xn ), 1 = S N
n 1
y t, A = ( x)T , b = 0, L1 = 1
The predictive distribution now takes the form:
p y N y | A b, L1 + A1 AT
N
p t | x, x, t N t | ( x) S N tn ( xn ), + ( x) S N ( x)
T 1 T
n 1
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21
Predictive Distribution
In a full Bayesian treatment, we want to compute the predictive distribution,
i.e. given the training data 𝒙 and 𝒕 and a new test point 𝑥, we want the
distribution:
p(t | x, x, t ) p(t | x, w ) p( w | x, t ) dw, p(t | x, w, ) N (t | y( x, w ), 1 )
p (t | x, x , t ) N (t | m( x), s 2 ( x))
where the mean and variance were shown in the earlier slide to be:
N
m( x) ( x) S N ( xn )tn
T
n 1
s 2 ( x) 1 ( x)T S N ( x)
N
S 1
N I ( xn ) ( x)T
n 1
1
x
n
Eg . for polynomial regression : ( xn ) xn2 ,
:
M 1
xn
( x)T 1 x x 2 .. x M 1 , I unit matrix M M
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 23
Predictive Distribution
The predictive distribution using an 𝑀 = 9 polynomial, with fixed parameters
𝛼 = 5 × 10−3 and 𝛽 = 11.1 (corresponding to a known noise variance).
The red curve denotes the mean of the predictive distribution and the red
region corresponds to ± standard deviation around the mean.
Polynomial Bayesian Regression M = 9
2
1.5
-1 Mean of the
Generating function sin(2pi*x)
-1.5
Random data pts for fitting predictive
Predictive mean
Span standard deviation distribution
-2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
N
Note that: ( x) ( x) S N ( x) 1 and N2 1 ( x) N2 ( x)
2 1 T
N
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 25
Predictive Distribution
It is easy to show:
N
( x) ( x) S N ( x) 1 and N2 1 ( x) N2 ( x)
2
N
1 T
N 1
Note: S 1
N 1 I ( xn ) ( xn )T S N1 ( xn ) ( xn )T
n 1
( x) S N
T N n n N 2 n 2
( x ) ( x ) N ( x)
1 ( xn ) S N ( xn ) 1 ( xn ) S N ( xn )
T N T
n n
: :
M 1
M 1 ( xn ) xn
-1
Generating function sin(2pi*x) 1.5
Random data points for fitting
-1.5
Predictive Mean 1
Predictive Standard Deviation
-2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 N=2
The predictive uncertainty is 0
1.5
0.5
-0.5
-1
Generating function sin(2pi*x)
Random data points for fitting
-1.5
Predictive Mean
MatLab code Predictive Standard Deviation
-2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1.5 1.5
1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1
Generating function sin(2pi*x) Generating function sin(2pi*x)
Random data points for fitting Random data points for fitting
-1.5 Predictive Mean -1.5 Predictive Mean
Predictive Standard Deviation Predictive Standard Deviation
-2 -2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1.5 1.5
1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1
Generating function sin(2pi*x) Generating function sin(2pi*x)
Random data points for fitting Random data points for fitting
-1.5 Predictive Mean -1.5 Predictive Mean
Predictive Standard Deviation Predictive Standard Deviation
-2 -2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
40 50
20
ෝ)
= 𝑝(𝑡|𝑥, 𝒘
20
10
10
0
-8 -6 -4 -2 0 2 4 6 8 -10
-8 -6 -4 -2 0 2 4 6 8
functions sampled from plugin approximation to posterior functions sampled from posterior
50 100
45
40 80
35
60
30
25
40
20
15 20
10
0
5 Run linregPostPredDemo
from PMTK3
0
-8 -6 -4 -2 0 2 4 6 8 -20
-8 -6 -4 -2 0 2 4 6 8
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 31
Predictive Distribution
We are not interested in 𝒘 itself but in making predictions of 𝑡 for new values of 𝑥.
This requires the predictive distribution
p(t | x, x, t ) p(t | x, w , ) p( w | x, t , , )dw N t | m TN ( x), N2 ( x)
1
where N2 ( x) ( x)T S N ( x), S N1 I T
The 1st term represents the noise on the data whereas the 2nd term reflects the
uncertainty associated with 𝒘.
Because the noise process and the distribution of 𝒘 are independent Gaussians,
their variances are additive.
The error bars get larger as we move away from the training points. By contrast, in
the plug-in approximation, the error bars are of constant size.
As additional data points are observed, the posterior distribution becomes narrower.
0.5
N=1 We are visualizing the joint uncertainty
in the posterior distribution between the
0
𝑦 values at two or more 𝑥 values.
-0.5
Plots of y(x,w) where w is a sample from the posterior over w
2
-1
1.5
-1.5 N=2
1
-2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.5
-1.5
MatLab Code -2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
-1 1.5
𝑁 = 12
-1.5 1
-2 0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
-0.5
-1
-1.5
MatLab Code -2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1.5 1.5
1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
-2 -2
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
2 2
1.5 1.5
1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
-2 -2
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
The model becomes very confident in its predictions when extrapolating outside the
region occupied by the basis functions. This is an undesirable behavior.
n 1
Equivalent kernel k ( x , xn )
N
y ( x , m N ) k ( x , xn )t n
n 1
k ( x, xn ) ( x)T S N ( xn )
The kernel is shown 1
Contour plot of the kernel
0.4
0.8 0.3
20 samples 𝑥𝑛 equally
0.7 0.25
spaced in [0,1] 0.6 0.2
x x
0.2
0
2
j x exp
j 0.1 -0.05
2s 2 0
0 0.2 0.4 0.6 0.8 1
x’ MatLab code
n 1
Equivalent kernel k ( x , xn )
N
y ( x , m N ) k ( x , xn )t n k ( x, xn ) ( x)T S N ( xn )
n 1
corresponding to 0.2
20 Gaussians 0.1
0.1
(𝑠 = 0.05) 0.05
0.05
Note that we make predictions by taking linear combinations of the training set target
values (linear smoothers).
The equivalent kernel depends on the input values 𝒙𝑛 since these appear in 𝑺𝑁.
See from the figure that its localized around 𝑥, and so the mean of the predictive
distribution 𝑦(𝒙, 𝒎𝑁), is Contour plot of the kernel
1 0.4
obtained by forming 0.9 0.35
0.6 0.2
close to 𝒙 are given 0.5 0.15
0.2 0
0.1 -0.05
0
0 0.2 0.4 0.6 0.8 1
0.05 0.12
x xj MatLab code
0.04 j x x j 0.1 j x
s
0.08
0.03 1
(a)
0.02
0.06
1 e a
0.04
0.01
0.02
0
0
-0.01 -0.02
-0.02 -0.04
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
x’ x’
Examples of equivalent kernels 𝑘(𝑥, 𝑥’) for 𝑥 = 0 plotted as a function of 𝑥’. 10 basis
functions were used in each case. The sigmoidal basis is centered at 10 equally
spaced 𝑥𝑛 in [−1,1].
These are localized functions of 𝑥’ even though the corresponding basis functions
are nonlocal.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 40
Equivalent Kernel
Consider the covariance between 𝑦(𝒙) and 𝑦(𝒙’), which is given by (recall that
N
p ( w | x , t , , ) N w | S N tn ( xn ), S N ):
n 1
cov y ( x ), y ( x ') cov ( x ) w, w T ( x ') ( x )T cov w ( x' ) ( x )T S N ( x' ) 1k ( x, x' )
T
From the earlier discussion on the equivalent kernel, we see that the predictive mean
at nearby points will be highly correlated, whereas for more distant points the
correlation will be smaller.
By drawing samples from the posterior distribution over 𝒘, and plotting the
corresponding model functions 𝑦(𝒙, 𝒘), we are visualizing the joint uncertainty in the
posterior distribution between the 𝑦 values at two (or more) 𝒙 values, as governed by
the equivalent kernel.
This is contrary to the Figure here where we visualized pointwise uncertainty in the
predictions.
k ( x, x ) 1
n 1
n
for all values of 𝒙. However, the equivalent kernel may be negative for some values
of 𝒙.
Like all kernel functions, the equivalent kernel can be expressed as an inner product:
k ( x , z ) ( x )T ( z )
k ( x, z) ( x)T S N ( z)
( x ) 1/2 S 1/2
N ( x)