mcmc
mcmc
Outline
6. Concluding remarks
0-0
Reading
0-1
1 Review of Bayesian inference
Data y = y1 , y2 , . . . , yn , realisations of random vari-
ables Y1 , Y2 , . . . , Yn , with distribution (model)
f (y1 , y2 , . . . , yn |θ)
f (y|θ)π(θ)
π(θ|y) = R
f (y|θ)π(θ)dθ
L(θ|y)π(θ)
= R
L(θ|y)π(θ)dθ
∝ L(θ|y)π(θ)
i.e.
P osterior ∝ Likelihood × P rior
0-2
Three quantities of interest are
0-3
3. Posterior predictive distribution: Let ỹ denote
some future unobserved response, then the pos-
terior predictive distribution is
Z
p(ỹ|y) = f (ỹ|θ, y)π(θ|y)dθ
Z
= f (ỹ|θ)π(θ|y)dθ.
0-4
Example
( n
)
2
2 −n 1 X (xi − θ)
f (x1 , . . . , xn |θ) = (2πσ ) 2 exp −
2 i=1
σ2
so
( n
!)
2 2
1 X (xi − θ) (θ − µ)
π(θ|x) ∝ exp − +
2 i=1
σ2 τ2
1
=: e 2 M
0-5
Calculate (Exercise)
2
b2
b
M = a θ− + +c
a a
n 1
a = + 2
σ2 τ
1 X µ
b = xi + 2
σ2 τ
1 X
2 µ2
c = xi + 2
σ2 τ
So
b 1
π(θ|x) ∼ N ,
a a
Exercise: the predictive distribution for x is N (µ, σ 2 +
τ 2)
0-6
Example: Normal Linear Regression
Consider a normal linear regression,
y = xβ +
y ∼ N (y|xβ, σ 2 I)
0-7
Suppose we take
Then,
π(β|y) ∝ f (y|β)π(β)
1
∝ σ −n/2 exp[− 2
(y − xβ) 0
(y − xβ)] ×
2σ
|v|−1/2 exp[−(2v)−1 β 0 β]
1
∝ exp − 2 β 0 x0 xβ − (2v)−1 β 0 β
2σ
1
+ 2 (y0 xβ + β 0 x0 xβ) .
2σ
0-8
Matching up the densities we find
Σ−1 = (v −1 + σ −2 x0 x)I
and
µ = (x0 x + σ 2 v −1 )−1 x0 y.
0-9
For new data, {y0 , x0 }, predictive densities fol-
low,
Z
p(y0 |x0 , y) = f (y0 |x0 , β, y)π(β|y)dβ
Z
= N (y0 |x0 β, σ 2 )N (β|β̂, v̂)dβ
0-10
Computationally even evaluating the posterior
distribution, the prior predictive distribution, the
marginal likelihoods, and the posterior predictive
distribution is not an easy task, in particular if we
do not have conjugate priors.
0-11
2 Monte Carlo integration
In general, when X is a random variable with distri-
bution π, and h is a function, then evaluating
Z
Eπ [h(X)] = h(x)π(x)dx
0-12
Application to Bayesian inference
θ(i) ∼ π(θ|y)
0-13
Examples:
0-14
(3) Inference of marginal effects: Suppose, θ is
multivariate and we are interested in the sub-
vector θj ∈ θ (for example a particular pa-
rameter in a normal linear regression model).
Then,
M
1 X (i)
Fθj (a) ≈ I(θj ≤ a)
M i=1
0-15
Note that all these quantities can be computed
from the same bag of samples. That is, we can first
collect θ(1) , . . . , θ(M ) as a proxy for π(θ|y) and then
use the same set of samples over and again for what-
ever we are subsequently interested in.
0-16
Independent sampling from π(x) may be difficult.
Fortunately (1) still applies if we generate samples
using a Markov chain, provided some conditions ap-
ply - in that case (1) is called the Ergodic Theorem.
0-17
Review of Markov chains
Xt = αXt−1 + t ,
0-18
Example. A two-state Markov chain (Xt )t=0,1,...
has transition matrix
!
1−α α
P = .
β 1−β
and from
(n) (n)
p12 + p11 = P r(Xn = 1 or 2) = 1
we obtain for n ≥ 1,
(n+1) (n)
p11 = (1 − α − β)p11 + β,
(0)
and p11 = 1. Solving the system gives as unique
solution
(
β α n
(n) α+β + α+β (1 − α − β) for α + β > 0
p11 =
1 for α + β = 0
0-19
A Markov chain has stationary or invariant dis-
tribution π if
Z
π(y)P (y, x)dy = π(x), all x
In matrix notation: πP = π
(n)
Fact: If the state space I is finite and pij → πj
as n → ∞ for all j ∈ I, then π = (πi , i ∈ I) is in-
variant
β α
and so π = ( α+β , α+β ) is invariant distribution
You can also check that πP = π.
0-20
One can try to break a Markov chain Xn into
smaller pieces. We say that i → j, i communicates
with j, if
0-21
(n)
A state i is aperiodic if pii > 0 for all sufficiently
large n.
0-22
Ergodic Theorem: Assume the homogeneous
Markov chain has stationary distribution π and is
aperiodic and irreducible. Then (1) holds; for any
R
function h such that h(x)π(x)dx exists,
n Z
1 X
h(Xt ) → Eπ [h(X)] = h(x)π(x)dx as n → ∞.
n t=1
Here, X ∼ π.
0-23
Note: Usually it is not possible to start the
chain in the stationary distribution - if it was easy
to sample from that distribution directly, we would
not need a Markov chain in the first place.
0-24
3 MCMC in Bayesian inference:
idea
As the name suggests, MCMC works by simulating a
discrete-time Markov chain; it produces a dependent
sequence (a chain) of random variables, {θ(i) }Mi=1 ,
with approximate distribution,
p(θ(i) ) ≈ π(θ|y)
0-25
It is fair to say that MCMC has revitalised (per-
haps even revolutionised) Bayesian statistics. Why?
0-26
We shall see that
π(y, θ) ∝ p(y|θ)π(θ)
0-27
Example: Normal Linear Regression
0-28
Example: Logistic Regression - Titanic data
0-29
Sampling density for Titanic survivals
P (yi = 1),
p(yi |xi )
0-30
A popular approach is to use a Generalised
Linear Model (GLM) which defines this associ-
ation to be linear on an appropriate scale, for in-
stance,
0-31
The most popular link function for binary regres-
sion (two-class classification) yi ∈ {0, 1} is the logit
link, as it quantifies the Log-odds
1 P (yi = 1|xi )
logit(ηi ) = = log
1 + exp(−ηi ) P (yi = 0|xi )
0-32
It is usual to write the model in hierarchical form,
0-33
4 MCMC in Bayesian inference:
algorithms
In the previous chapter we presented an example of
using MCMC for simulation based inference.
0-34
The general algorithm is
–MCMC Algorithm–
θ(0) ← x
For i=1 to M
θ(i) = f (θ(i−1) )
End
0-35
4.1 The Metropolis-Hastings (M-H) al-
gorithm
0-36
–M-H Algorithm–
θ(0) ← x
For i=0 to M
Draw θ̃ ∼ q(θ̃|θ(i) )
End
0-37
Why does it work?
0-38
Note:
0-39
π(b|y)q(a|b)
To accept with probability π(a|y)q(b|a) ,
0-40
In the special case of a symmetric proposal den-
sity (Hastings algorithm), q(a|b) = q(b|a), for ex-
ample q(a|b) = N (a|b, 1), then the ratio reduces to
that of the probabilities
π(b|y)
α(a, b) = min 1,
π(a|y)
0-41
Choices for q(a|b)
0-42
It is usual to “centre” the proposal density around
the current state and make “local” moves. A popu-
lar choice when θ is real valued is to take q(a|b) =
b + N (a|0, V ) where V is user specified. That is, a
normal density centred at the current state b.
0-43
4.2 The Gibbs Sampler
An important alternative approach is available in the
following circumstances:
π(θj |θ−j , y)
0-44
–Gibbs Sampler –
θ(0) ← x
For i=0 to M
Set θ̃ ← θ(i)
For j=1 to p
End
Set θ(i+1) ← θ̃
End
0-45
Note:
0-46
Example: normal linear regression
y = xβ +
y ∼ N (y|xβ, σ 2 I)
0-47
Suppose we take,
0-48
However, the full conditionals are known, and
and
End
0-49
Example: hierarchical normal linear regres-
sion
Consider again the normal linear regression model
y = xβ +
y ∼ N (y|xβ, σ 2 I)
β ∼ N (β|0, vI)
σ2 ∼ IG(σ 2 |a, b)
v ∼ IG(v|c, d)
0-50
Then the joint posterior density is
0-51
However, the full conditionals are known, and
and
and
0-52
–Gibbs Sampler, hierarchical normal linear
regression–
{β, σ 2 , v}(0) ← x
For i=0 to M
End
0-53
Some Issues:
0-54
5 Output analysis and diagnos-
tics
0-55
5.1 Convergence and burn-in
0-56
The accepted practice is to discard an initial set
of samples as being unrepresentative of the station-
ary distribution of the Markov chain (the target dis-
tribution). That is, the first B samples,
are discarded
0-57
It is worth emphasising from the beginning that
in practice no general exact tests for convergence ex-
ist.
0-58
We shall consider the graphical analysis and con-
vergence tests, for more details see the CODA docu-
mentation at
http://www.mrc-bsu.cam.ac.uk/bugs/
documentation/Download/cdaman03.pdf
Graphical Analysis
There should be
- no continuous drift
- no strong autocorrelation
in the sequence of values following burn-in (as the
samples are supposed to follow the same distribu-
tion)
0-59
Usually, θ(0) is far away from the major support
of the posterior density. Initially then, the chain will
often be seen to “migrate” away from θ(0) towards a
region of high posterior probability centred around
a mode of π(θ|y)
0-60
If the model has converged, additional samples
from the posterior distribution should not influence
the calculation of the mean. Running means will re-
veal if the posterior mean has settled to a particular
value.
0-61
Autocorrelation plots
0-62
Formal convergence diagnostics
0-63
Geweke’s test
0-64
Gelman & Rubin’s test
0-65
When little improvement could be gained, the
chains are taken as having converged.
0-66
5. R̂ is defined as
width of pooled interval
R̂ = .
mean width of within-sequence intervals
0-67
Other tests
0-68
Tricks to speed up convergence
0-69
5.2 Tests for dependence in the chain
The Theory
0-70
The variance in the estimator, σf2 , is given by
∞
X
σf2 = cov[f (θ(i) ), f (θ(i+s) )]
s=−∞
In Practice
0-71
A useful statistic is the Effective Sample Size
k
X
ESS = M/(1 + 2 ρ(j))
j=1
0-72
We call
1
Ef f = Pk ,
(1 + 2 j=1 ρ(j))
0-73
6 Concluding remarks
0-74
Often the posterior will not be of standard form
(for example when the prior is non-conjugate)
0-75