0% found this document useful (0 votes)

2 views

mcmc

The document provides an overview of Markov Chain Monte Carlo (MCMC) as a stochastic simulation technique primarily used in Bayesian statistics for computing inferential quantities. It outlines key topics including Bayesian inference, Monte Carlo integration, MCMC algorithms, and output analysis. The document also mentions a practical session using WINBUGS and includes references for further reading.

Uploaded by

sunil

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

mcmc

Uploaded by

sunil

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 76

Markov Chain Monte Carlo

and Applied Bayesian Statistics

Trinity Term 2005
Prof. Gesine Reinert

Markov chain Monte Carlo is a stochastic sim-

ulation technique that is very useful for computing
inferential quantities. It is often used in a Bayesian
context, but not restricted to a Bayesian setting.

Outline

1. Review of Bayesian inference

2. Monte Carlo integration and Markov chains

3. MCMC in Bayesian inference: ideas

4. MCMC in Bayesian inference: algorithms

5. Output analysis and diagnostics

6. Concluding remarks

There will be a practical session, using the soft-

ware package WINBUGS, Tuesday week 3, 2-4 pm.

0-0
Reading

1. Gelman, A. et al. (2004). Bayesian Data Anal-

ysis. Chapman & Hall.

2. Robert, C.P. and Casella, G. (2004) Monte

Carlo Statistical Methods. 2nd ed. (?) Springer.

3. Gilks, W.R. et al. eds. (1996). Markov Chain

Monte Carlo in Practice. Chapman & Hall.

Acknowledgement: Chris Holmes for provid-

ing his lecture notes and examples, which are partly
due to Nicky Best. Note that this course differs from
the one Chris Holmes gave last year, though.

0-1
1 Review of Bayesian inference
Data y = y1 , y2 , . . . , yn , realisations of random vari-
ables Y1 , Y2 , . . . , Yn , with distribution (model)

f (y1 , y2 , . . . , yn |θ)

L(θ|y) = f (y|θ) is the likelihood of y if θ is the true

parameter (vector)
Parameter (vector) θ = (θ1 , . . . , θp ) has a prior dis-
tribution π(θ)
Inference is based on the posterior distribution

i.e.
P osterior ∝ Likelihood × P rior

0-2
Three quantities of interest are

1. Prior predictive distribution

Z
p(y) = f (y|θ)π(θ)dθ

represents the probability of observing the data

that was observed before it was observed

2. Marginal effects of a subset of parameters in a

multivariate model: Suppose that we are inter-
ested in π(θi |y), for some subset θi ∈ θ (here
and in the following we abuse notation by us-
ing θ = {θ1 , . . . , θp } to denote a set as well as
a vector). Then
Z
π(θi |y) = π(θi , θ−i |y)dθ−i
Z
= π(θi |θ−i , y)π(θ−i |y)dθ−i ,

where θ−i = θ \ θi denotes the vector θ with

θi removed. This distribution is also called the
marginal likelihood.

0-3
3. Posterior predictive distribution: Let ỹ denote
some future unobserved response, then the pos-
terior predictive distribution is
Z
p(ỹ|y) = f (ỹ|θ, y)π(θ|y)dθ
Z
= f (ỹ|θ)π(θ|y)dθ.

For the last step we used that ỹ, y are con-

ditionally independent given θ, though clearly
unconditionally they are dependent.

0-4
Example

X1 , . . . , Xn random sample N (θ, σ 2 ), where σ 2

known
π(θ) ∼ N (µ, τ 2 ), where µ, τ 2 known

( n
)
2
2 −n 1 X (xi − θ)
f (x1 , . . . , xn |θ) = (2πσ ) 2 exp −
2 i=1
σ2

( n
!)
2 2
1 X (xi − θ) (θ − µ)
π(θ|x) ∝ exp − +
2 i=1
σ2 τ2
1
=: e 2 M

0-5
Calculate (Exercise)

2
b2

b
M = a θ− + +c
a a
n 1
a = + 2
σ2 τ
1 X µ
b = xi + 2
σ2 τ
1 X
2 µ2
c = xi + 2
σ2 τ
So

b 1
π(θ|x) ∼ N ,
a a
Exercise: the predictive distribution for x is N (µ, σ 2 +
τ 2)

0-6
Example: Normal Linear Regression
Consider a normal linear regression,

y = xβ +

where N (0, σ 2 I). Alternatively, y ∼ N (xβ, σ 2 I);

to make the y-dependence clearer, we write

y ∼ N (y|xβ, σ 2 I)

For now assume that σ is known

Classically, we would wish to estimate the regres-
sion coefficients, β, given a data set, {yi , xi }ni=1 , say
using MLE
β̂ = (x0 x)−1 x0 y

Bayesian modelling proceeds by constructing a

joint model for the data and unknown parameters,

π(y, β|x, σ 2 ) = f (y|x, β, σ 2 )π(β|x, σ 2 )

= N (y|xβ, σ 2 I)π(β)

where we assume, for now, that the prior π(β) is

independent of {x, σ 2 }

0-7
Suppose we take

π(β) = N (β|0, vI)

Then,

π(β|y) ∝ f (y|β)π(β)
1
∝ σ −n/2 exp[− 2
(y − xβ) 0
(y − xβ)] ×
2σ
|v|−1/2 exp[−(2v)−1 β 0 β]

1
∝ exp − 2 β 0 x0 xβ − (2v)−1 β 0 β
2σ

1
+ 2 (y0 xβ + β 0 x0 xβ) .
2σ

We recall that the multivariate normal density fN (µ,Σ)

for some vector z can be written as

1
fN (µ,Σ) (z) ∝ exp − (z − µ)0 Σ−1 (z − µ)
2

1
∝ exp − z0 Σ−1 z
2

1 0 −1
+ (z Σ µ + µ0 Σ−1 z) .
2

0-8
Matching up the densities we find

Σ−1 = (v −1 + σ −2 x0 x)I

and
µ = (x0 x + σ 2 v −1 )−1 x0 y.

Therefore we can write

π(β|y) = N (β|β̂, v̂I)

β̂ = (x0 x + σ 2 v −1 )−1 x0 y
v̂ = σ 2 (x0 x + σ 2 v −1 )−1

Note that again follows a normal distribution:

Prior (normal) → Posterior (normal)
conjugate prior: when prior and posterior are in the
same family

0-9
For new data, {y0 , x0 }, predictive densities fol-
low,
Z
p(y0 |x0 , y) = f (y0 |x0 , β, y)π(β|y)dβ
Z
= N (y0 |x0 β, σ 2 )N (β|β̂, v̂)dβ

= N (y0 |x0 β̂, σ 2 (1 + x0 v̂x00 )).

Bayesian analysis might then continue by com-

puting the posterior mean, the posterior variance,
credible intervals, or using Bayesian hypothesis test-
ing.

0-10
Computationally even evaluating the posterior
distribution, the prior predictive distribution, the
marginal likelihoods, and the posterior predictive
distribution is not an easy task, in particular if we
do not have conjugate priors.

Historically, the need to evaluate integrals was a

major stumbling block for the take up of Bayesian
methods.

Around 15 years ago or so, a numerical method

known as Markov chain Monte Carlo (MCMC) was
popularized by a paper of Gelfand and Smith (1990);
other statisticians such as Ripley, Besag, Tanner,
Geman were using MCMC before.

0-11
2 Monte Carlo integration
In general, when X is a random variable with distri-
bution π, and h is a function, then evaluating
Z
Eπ [h(X)] = h(x)π(x)dx

can be difficult, in particular when X is high-dimensional.

However, if we can draw samples

x(1) , x(2) , . . . , x(n) ∼ π

then we can estimate

n
1X
Eπ [h(X)] ≈ h(x(i) ).
n i=1

This is Monte Carlo integration

For independent samples, by the law of large
numbers we have that, in probability
n
1X
h(x(i) ) → Eπ [h(X)] as n → ∞ (1)
n i=1

0-12
Application to Bayesian inference

Recall: all the information (needed for, say, pre-

dictions, marginals, etc) is contained in the posterior
π(θ|y)

However, π(θ|y) may not be quantifiable as a

standard distribution.

Suppose we are able to draw samples, θ(1) , . . . , θ(M ) ,

from π(θ|y), so that,

θ(i) ∼ π(θ|y)

Then most inferential quantities of interest are solv-

able using the bag of samples, {θ(i) }M
i=1 , as a proxy
for π(θ|y).

0-13
Examples:

(1) Suppose we are interested in P r(θ < a|y).

Then,
M
1 X
P r(θ < a|y) ≈ I(θ(i) < a)
M i=1

where I(·) is the logical indicator function. More

generaly, for a set A ∈ Θ
M
1 X
P r(θ ∈ A|y) ≈ I(θ(i) ∈ A)
M i=1

(2) Prediction: Suppose we are interested in p(ỹ|y),

for some future ỹ. Then,
M
1 X
p(ỹ|y) ≈ f (ỹ|θ(i) , y)
M i=1
M
1 X
≈ f (ỹ|θ(i) )
M i=1

0-14
(3) Inference of marginal effects: Suppose, θ is
multivariate and we are interested in the sub-
vector θj ∈ θ (for example a particular pa-
rameter in a normal linear regression model).
Then,
M
1 X (i)
Fθj (a) ≈ I(θj ≤ a)
M i=1

where F (·) denotes the distribution function;

More generally for any set Aj ∈ Θj , the lower
dimensional parameter space,
M
1 X (i)
P r(θj ∈ Aj |y) ≈ I(θj ∈ Aj )
M i=1

This last point is particularly useful.

0-15
Note that all these quantities can be computed
from the same bag of samples. That is, we can first
collect θ(1) , . . . , θ(M ) as a proxy for π(θ|y) and then
use the same set of samples over and again for what-
ever we are subsequently interested in.

Warning: Monte Carlo integration is a last re-

sort; if we can calculate expectations and probabili-
ties analytically, then that would be much preferred.

0-16
Independent sampling from π(x) may be difficult.
Fortunately (1) still applies if we generate samples
using a Markov chain, provided some conditions ap-
ply - in that case (1) is called the Ergodic Theorem.

0-17
Review of Markov chains

A homogeneous Markov chain (Xt )t=0,1,... is gen-

erated by sampling from a transition kernel P (y, x);
if Xt = xt , then Xt+1 ∼ P (xt , x), for t = 0, 1, 2, . . .;
so Xt+1 depends on the past X0 , X1 , . . . , Xt only
through Xt ; more generally, for any set A,

P (xt , A) := P (Xt+1 ∈ A|Xt = xt )

If the transition probabilities depended on t, the

chain would be called inhomogeneous.

Example. Consider the AR(1) process

Xt = αXt−1 + t ,

where the t ’s are independent, identically distributed.

Then (Xt )t=0,1,... is a homogeneous Markov chain.

For a Markov chain with finite state space I we

can calculate n-step transition probabilities by ma-
trix iteration:
(n)
If pij = P r(Xn = j|X0 = i), for i, j ∈ I, then
(n)
(pij )i,j∈I = P n .

0-18
Example. A two-state Markov chain (Xt )t=0,1,...
has transition matrix
!
1−α α
P = .
β 1−β

Conditioning on the first step, we have, for example,

(n+1) (n) (n)
p11 = βp12 + (1 − α)p11

and from
(n) (n)
p12 + p11 = P r(Xn = 1 or 2) = 1

we obtain for n ≥ 1,
(n+1) (n)
p11 = (1 − α − β)p11 + β,
(0)
and p11 = 1. Solving the system gives as unique
solution
(
β α n
(n) α+β + α+β (1 − α − β) for α + β > 0
p11 =
1 for α + β = 0

0-19
A Markov chain has stationary or invariant dis-
tribution π if
Z
π(y)P (y, x)dy = π(x), all x

that is, once we start in the stationary distribution

π, all Xt will have the distribution π

In matrix notation: πP = π

(n)
Fact: If the state space I is finite and pij → πj
as n → ∞ for all j ∈ I, then π = (πi , i ∈ I) is in-
variant

Example: For the two-state Markov chain above,

as n → ∞,
!
β α
Pn → α+β
β
α+β
α
α+β α+β

β α
and so π = ( α+β , α+β ) is invariant distribution
You can also check that πP = π.

0-20
One can try to break a Markov chain Xn into
smaller pieces. We say that i → j, i communicates
with j, if

P (Xn = j for some n ≥ 0|X0 = i) > 0.

A Markov chain is irreducible if any set of states

can be reached from any other state in a finite num-
ber of moves, i.e. if P (Xn = j for some n ≥ 0|X0 =
i) > 0 for all i, j ∈ I. Every state communicates
with every other state.

Fact: If the chain is irreducible and if it has a sta-

tionary distribution, then the stationary distribution
is unique

0-21
(n)
A state i is aperiodic if pii > 0 for all sufficiently
large n.

Example. Consider the two-state Markov chain

with transition matrix
!
0 1
P = .
1 0

Then P 2 = I, P 2n = I, P 2n+1 = P , so each state

returns to itself at every second step: the chain is
periodic.

Fact: If an irreducible Markov chain has an ape-

riodic state, then automatically all its states are ape-
riodic.

0-22
Ergodic Theorem: Assume the homogeneous
Markov chain has stationary distribution π and is
aperiodic and irreducible. Then (1) holds; for any
R
function h such that h(x)π(x)dx exists,
n Z
1 X
h(Xt ) → Eπ [h(X)] = h(x)π(x)dx as n → ∞.
n t=1

Here, X ∼ π.

Also for such chains with

σh2 = varπ [h(X)] < ∞

the central limit theorem holds, and convergence

to the stationary distribution occurs (geometrically)
fast.

So we can apply Monte Carlo integration to ap-

R
proximate h(x)π(x)dx by simulating a Markov chain
that has π as stationary distribution.

If we start the chain in some arbitrary value X0 ,

then for small n the distribution of the samples may
be quite far away from the stationary distribution,
and we better discard the initial set of, say, T sam-
ples as being unrepresentative.

Knowing when to start collecting samples is a

nontrivial task; we shall deal with this later (watch
out for burn-in).

0-24
3 MCMC in Bayesian inference:
idea
As the name suggests, MCMC works by simulating a
discrete-time Markov chain; it produces a dependent
sequence (a chain) of random variables, {θ(i) }Mi=1 ,
with approximate distribution,

p(θ(i) ) ≈ π(θ|y)

The chain is initialised with a user defined start-

ing value, θ(0)

The Markov property then specifies that the dis-

tribution of θ(i+1) |θ(i) , θ(i−1) , . . . , depends only on
the current state of the chain θ(i)

0-25
It is fair to say that MCMC has revitalised (per-
haps even revolutionised) Bayesian statistics. Why?

MCMC methods construct a Markov chain on

the state space, θ ∈ Θ, whose steady state distribu-
tion is the posterior of interest π(θ|y)

MCMC procedures return a collection of M sam-

ples, {θ(1) , . . . , θ(M ) } where each sample can be as-
sumed to be drawn from π(θ|y), (with slight abuse
of notation)

P r(θ(i) ∈ A) = π(θ ∈ A|y)

for any set A ∈ Θ, or,

θ(i) ∼ π(θ|y) for i = 1, . . . , M

0-26
We shall see that

• MCMC is a general method that simultane-

ously solves inference of {π(θ|y), π(θi |y), p(ỹ|y)}

• MCMC only requires evaluation of the joint

distribution

π(y, θ) ∝ p(y|θ)π(θ)

up to proportionality, pointwise for any θ ∈ Θ

• MCMC allows modeller to concentrate on mod-

elling. That is, to use models, π(y, θ), that you
believe represent the true dependence struc-
tures in the data, rather than those that are
simple to compute

0-27
Example: Normal Linear Regression

We have seen that for the normal linear regres-

sion with known noise variance and prior, π(β) =
N (0, vI), then the posterior is

π(β|y) = N (β|β̂, v̂I)

β̂ = (x0 x + σ 2 v −1 )−1 x0 y
v̂ = σ 2 (x0 x + σ 2 v −1 )−1

MCMC would approximate this distribution with

M samples drawn from the posterior,

{β (1) , . . . , β (M ) } ∼ N (β̂, v̂I)

0-28
Example: Logistic Regression - Titanic data

The data relates to 1, 316 passengers who sailed

on the Titanic’s maiden and final voyage

We have data records on whether each passenger

survived or not, yi ∈ {survived, died }, as well as
three attributes of the passenger

(1) Ticket class: { first, second, third }

(2) Age: {child, adult }

(3) Sex: {female, male }

We wish to perform a Bayesian analysis to see

if there is association between these attributes and
survival probability

As stated before, the Bayesian analysis begins

with the specification of a sampling distribution and
prior

0-29
Sampling density for Titanic survivals

Let, yi ∈ {0, 1}, denote an indicator of whether

the ith passenger survived or not

We wish to relate the probability of survival,

P (yi = 1),

to the passengers covariate information, xi = {class,

age, sex } for the ith passenger

That is, we wish to build a probability model for

p(yi |xi )

0-30
A popular approach is to use a Generalised
Linear Model (GLM) which defines this associ-
ation to be linear on an appropriate scale, for in-
stance,

P (yi = 1|xi ) = g(ηi )

ηi = xi β
P
where xi β = j xij βj and g(·) is a monotone link
function, that maps the range of the linear pre-
dictor, ηi ∈ [−∞, ∞], onto the appropriate range,
P (yi |xi ) ∈ [0, 1]

There is a separate regression coefficient, βj ,

associated with each predictor, in our case, β =
(βclass , βage , βsex )0

0-31
The most popular link function for binary regres-
sion (two-class classification) yi ∈ {0, 1} is the logit
link, as it quantifies the Log-odds

1 P (yi = 1|xi )
logit(ηi ) = = log
1 + exp(−ηi ) P (yi = 0|xi )

where we note, logit(ηi ) → 0 as ηi → −∞, logit(ηi ) →

1 as ηi → ∞

In this case, the value of the regression coeffi-

cients β quantifies the change in the log-odds for
unit change in associated x

This is attractive as clearly β is unknown, and

hence we shall adopt a prior, π(β)

0-32
It is usual to write the model in hierarchical form,

p(yi |xi ) = g(ηi )

ηi = xi β
β ∼ π(β)

We are interested in quantifying the statistical

association between the survival probability and the
attributes, via the posterior density,

π(β|y, x) ∝ p(y|x, β)π(β)

"N #
Y
∝ p(yi |xi , β) π(β)
i=1

which is not of standard form

To infer this we shall use a package known as

WinBugs, a Windows version of BUGS (Bayesian
analysis Using the Gibbs Sampler)

0-33
4 MCMC in Bayesian inference:
algorithms
In the previous chapter we presented an example of
using MCMC for simulation based inference.

Up to now we have not discussed the algorithms

that lie behind MCMC and generate the samples

First, recall that MCMC is an iterative proce-

dure, such that given the current state of the chain,
θ(i) , the algorithm makes a probabilistic update to
θ(i+1)

0-34
The general algorithm is

–MCMC Algorithm–
θ(0) ← x
For i=1 to M
θ(i) = f (θ(i−1) )
End

where f (·) outputs a draw from a conditional

probability density

The update, f (·), is made in such a way that the

distribution p(θ(i) ) → π(θ|y), the target distribu-
tion, as i → ∞, for any starting value θ(0)

We shall consider two of the most general proce-

dures for MCMC simulation from a target distribu-
tion, namely, the Metropolis-Hastings algorithm
and, the Gibbs sampler

0-35
4.1 The Metropolis-Hastings (M-H) al-
gorithm

Metropolis et al. (1953) give an algorithm of how to

construct a Markov chain whose stationary distribu-
tion is our target distribution π; this method was
generalized by Hastings (1970).

Let the current state of the chain be θ(i)

Consider a (any) conditional density q(θ̃|θ(i) ), de-

fined on θ̃ ∈ Θ (with the same dominating measure
as the model)

We call q(·|θ(i) ) the proposal density for rea-

sons that will become clear

We shall use q(·|θ(i) ) to update the chain as fol-

lows

0-36
–M-H Algorithm–
θ(0) ← x
For i=0 to M

Draw θ̃ ∼ q(θ̃|θ(i) )

Set θ(i+1) ← θ̃ with probability α(θ(i) , θ̃),

where

π(b|y)q(a|b)
α(a, b) = min 1,
π(a|y)q(b|a)

Else set θ(i+1) ← θ(i)

End

It can be shown that the Markov chain (θ(i) ), i =

1, 2, . . . will indeed have π(θ|y) as stationary distri-
bution:

0-37
Why does it work?

The key idea is reversibility or detailed balance:

In general the target distribution π is invariant

for P if for all x, y in the state space, the detailed
balance equation holds:

π(x)P (x, y) = π(y)P (y, x).

We check that the M-H sampler satisfies detailed

balance:
Let P be the transition matrix for the M-H chain.
Then, for a 6= b,

π(a|y)P (a, b) = π(a|y)q(b|a)α(a, b)

= min(π(a|y)q(b|a), π(b|y)q(a, b))

and this expression is symmetric in a, b, hence

π(a|y)P (a, b) = π(b|y)P (b, a),

and detailed balance is satisfied.

0-38
Note:

• There is a positive probability of remaining in

the same state, 1 − α(θ(i) , θ̃); and this counts
as an extra iteration.

• The process looks like a stochastic hill climbing

algorithm. You always accept the proposal if
p(b|y)q(a|b)
p(a|y)q(b|a) > 1 else you accept with that prob-
ability (defined by the ratio)

• The acceptance term corrects for the fact that

the proposal density is not the target distribu-
tion

0-39
π(b|y)q(a|b)
To accept with probability π(a|y)q(b|a) ,

First, draw a uniform random variable, say U ,

uniform on [0, 1].

IF U < α(θ(i) , θ̃);

THEN accept θ̃;

ELSE reject and chain stays at θ(i)

The ratio of densities means that the normalis-

R
ing constant p(y) = f (y|θ)π(θ)dθ cancels, top and
bottom. Hence, we can use MCMC when the nor-
malizing constant is unknown (as is often the case)

0-40
In the special case of a symmetric proposal den-
sity (Hastings algorithm), q(a|b) = q(b|a), for ex-
ample q(a|b) = N (a|b, 1), then the ratio reduces to
that of the probabilities

π(b|y)
α(a, b) = min 1,
π(a|y)

The proposal density, q(a|b), is user defined. It

is more of an art than a science.

Pretty much any q(a|b) will do, so long as it gets

you around the state space Θ. However different
q(a|b) lead to different levels of performance in terms
of convergence rates to the target distribution and
exploration of the model space

0-41
Choices for q(a|b)

Clearly q(a|b) = π(θ|y) leads to an acceptance

probability of 1 for all moves and the samples are
iid from the posterior. But the reason we are using
MCMC is that we do not know how to draw from
π(θ|y)

There is a trade off: we would like “large” jumps

(updates), so that the chain explores the state space,
but large jumps usually have low acceptance proba-
bility as the posterior density can be highly peaked

As a rule of thumb, we set the spread of q() to

be as large as possible without leading to very small
acceptance rates, say < 0.1

Finally, q(a|b) should be easy to simulate and

evaluate

0-42
It is usual to “centre” the proposal density around
the current state and make “local” moves. A popu-
lar choice when θ is real valued is to take q(a|b) =
b + N (a|0, V ) where V is user specified. That is, a
normal density centred at the current state b.

Warning. The Metropolis-Hastings algorithm is

a general approach to sampling from a target den-
sity, in our case π(θ|y). However, it requires a user
specified proposal density q(a|b) and the acceptance
rates must be continuously monitored for low and
high values. This is not good for automated models
(software)

0-43
4.2 The Gibbs Sampler
An important alternative approach is available in the
following circumstances:

Suppose that the multidimensional θ can be par-

titioned into p subvectors, θ = {θ1 , . . . , θp }, such
that the conditional distribution,

π(θj |θ−j , y)

is easy to sample from; where θ−j = θ\θj

Iterating over the p subvectors and updating each

subvector in turn using π(θj |θ−j , y) leads to a valid
MCMC scheme known as the Gibbs Sampler, pro-
vided that the chain remains irreducible and aperi-
odic.

0-44
–Gibbs Sampler –
θ(0) ← x
For i=0 to M

Set θ̃ ← θ(i)

For j=1 to p

Draw X ∼ π(θj |θ̃−j , y)

Set θ̃j ← X

End

Set θ(i+1) ← θ̃

End

0-45
Note:

The Gibbs Sampler is a special case of the Metropolis-

Hastings algorithm using the ordered sub-updates,
q(·) = π(θj |θ−j , y)

All proposed updates are accepted (there is no

accept-reject step)

θj may be multidimensional or univariate

Often, π(θj |θ−j , y) will have standard form even

if π(θ|y) does not

0-46
Example: normal linear regression

Consider again the normal linear regression model

discussed in Chapter 1

y = xβ +

where ∼ N (0, σ 2 I). Alternately,

y ∼ N (y|xβ, σ 2 I)

we now assume that σ is unknown

As before we construct a joint model for the data

and unknown parameters,

p(y, β, σ 2 |x) = f (y|x, β, σ 2 )π(β, σ 2 |x)

= N (y|xβ, σ 2 I)π(β)π(σ 2 )

where we have assumed that the joint prior for β, σ 2

is independent

0-47
Suppose we take,

π(β) = N (β|0, vI)

π(σ 2 ) = IG(σ 2 |a, b)

where IG(·|a, b) denotes the Inverse-Gamma density,

IG(x|a, b) ∝ x−(a−2)/2 exp(−b/(2x))

Then the joint posterior density is,

p(β, σ 2 |y) ∝ f (y|β)π(β)π(σ 2 )

1
∝ σ −n/2 exp[− 2 (y − xβ)0 (y − xβ)] ×
2σ
|v|−1/2 exp[−(2v)−1 β 0 β] ×
(σ 2 )−(a−2)/2 exp(−b/(2σ 2 ))

This is not a standard distribution!

0-48
However, the full conditionals are known, and

π(β|y, σ 2 ) = N (β|β̂, v̂I)

β̂ = (x0 x + σ 2 v −1 )−1 x0 y
v̂ = σ 2 (x0 x + σ 2 v −1 )−1

and

π(σ 2 |β, y) = IG(σ 2 |a + n, b + SS)

SS = (y − xβ)0 (y − xβ)

Hence the Gibbs sampler can be adopted:

–Gibbs Sampler, normal linear regression–

(β, σ 2 )(0) ← x
For i=0 to M

Set (β̃, σ̃ 2 ) ← (β, σ 2 )(i)

Draw β̃|σ 2 ∼ N (β|β̂, v̂I)

Draw σ˜2 |β̃ ∼ IG(σ 2 |a + n, b + SS)

Set (β, σ 2 )(i) ← (β̃, σ̃ 2 )

End

0-49
Example: hierarchical normal linear regres-
sion
Consider again the normal linear regression model

y = xβ +

where ∼ N (0, σ 2 I).

we now assume that both σ and prior variance

v of π(β) are unknown

In hierarchical form we write,

y ∼ N (y|xβ, σ 2 I)
β ∼ N (β|0, vI)
σ2 ∼ IG(σ 2 |a, b)
v ∼ IG(v|c, d)

where IG(·|a, b) denotes the Inverse-Gamma density,

IG(x|a, b) ∝ x−(a−2)/2 exp(−b/(2x))

note the “hierarchy” of dependencies

0-50
Then the joint posterior density is

π(β, σ 2 |y) ∝ f (y|β)π(β)π(σ 2 )

−n/2 1
∝ σ exp − 2 (y − xβ)0 (y − xβ) ×
2σ
|v|−1/2 exp[−(2v)−1 β 0 β] ×
(σ 2 )−(a−2)/2 exp(−b/(2σ 2 )) ×
v −(c−2)/2 exp(−d/(2v))

Again, this is not a standard distribution!

0-51
However, the full conditionals are known, and

π(β|y, σ 2 , v) = N (β|β̂, v̂I)

β̂ = (σ −2 x0 x + v −1 )−1 σ −2 x0 y
v̂ = (σ −2 x0 x + v −1 )−1

and

π(σ 2 |β, y) = IG(σ 2 |a + n, b + SS)

SS = (y − xβ)0 (y − xβ)

and

π(v|β) = IG(v|a + p, b + SB)

SB = β0β

where p is the number of predictors (length of β vec-

tor)
Hence the Gibbs sampler can be adopted:

0-52
–Gibbs Sampler, hierarchical normal linear
regression–
{β, σ 2 , v}(0) ← x
For i=0 to M

Set (β̃, σ̃ 2 , ṽ) ← {β, σ 2 , v}(i)

Draw β̃|σ 2 , v ∼ N (β|β̂, V̂ )

Draw σ˜2 |β̃ ∼ IG(σ 2 |a + n, b + SS)

Draw ṽ|β̃ ∼ IG(v|c + p, d + SB)

Set {β, σ 2 , v}(i) ← (β̃, σ̃ 2 , ṽ)

End

When the conditionals do not have standard form

we can usually perform univariate updates (as there
are a variety of methods for univariate sampling from
a target density).

0-53
Some Issues:

The Gibbs sampler is automatic (no user set pa-

rameters) which is good for software, such as Win-
Bugs

But, M-H is more general and if dependence in

the full conditionals, π(θj |θ−j , y) is strong the Gibbs
sampler can be very slow to move around the space,
and a joint M-H proposal may be more efficient. The
choice of the subvectors can affect this

We can combine the two in a Hybrid sampler,

updating some components using Gibbs and others
using M-H

0-54
5 Output analysis and diagnos-
tics

In an ideal world, our simulation algorithm would

return i.i.d. samples from the target (posterior) dis-
tribution

However, MCMC simulation has two short-comings

1. The distribution of the samples, p(θ(i) ) only

converges with i to the target distribution

2. The samples are dependent

In this chapter we shall consider how we deal with

these issues.

We first consider the problem of convergence.

0-55
5.1 Convergence and burn-in

Recall that MCMC is an iterative procedure, such

that

Given the current state of the chain, θ(i) , the al-

gorithm makes a probabilistic update to θ(i+1)

The update, f (·), is made in such a way that the

distribution p(θ(i) ) → π(θ|y), the target distribu-
tion, as i → ∞, for any starting value θ(0)

Hence, the early samples are strongly influenced

by the distribution of θ(0) , which presumably is not
drawn from π(θ|y)

0-56
The accepted practice is to discard an initial set
of samples as being unrepresentative of the station-
ary distribution of the Markov chain (the target dis-
tribution). That is, the first B samples,

{θ(0) , θ(1) , . . . , θ(B) },

are discarded

This user defined initial portion of the chain to

discard is known as a burn-in phase for the chain

The value of B, the length of burn-in, is de-

termined by You using various convergence diag-
nostics which provide evidence that p(θ(B+1) ) and
π(θ|y) are in some sense “close”

0-57
It is worth emphasising from the beginning that
in practice no general exact tests for convergence ex-
ist.

Tests for convergence should more formally be

called tests for lack of convergence. That is, as in
hypothesis testing, we can usually only detect when
it looks like convergence has NOT yet been met.

Remember, all possible sample paths are indeed

possible.

Available convergence diagnostics

WinBugs bundles a collection of convergence di-

agnostics and sample output analysis programs in a
menu driven set of S-Plus functions, called CODA
CODA implemenents a set of routines for

• graphical analysis of samples;

• summary statistics, and;

• formal tests for convergence

0-58
We shall consider the graphical analysis and con-
vergence tests, for more details see the CODA docu-
mentation at

http://www.mrc-bsu.cam.ac.uk/bugs/
documentation/Download/cdaman03.pdf

Graphical Analysis

The first step in any output analysis is to eyeball

(1) (M )
sample traces from various variables, {θj , . . . , θj },
for a set of key variables j: trace plot or history
plot

There should be
- no continuous drift
- no strong autocorrelation
in the sequence of values following burn-in (as the
samples are supposed to follow the same distribu-
tion)

0-59
Usually, θ(0) is far away from the major support
of the posterior density. Initially then, the chain will
often be seen to “migrate” away from θ(0) towards a
region of high posterior probability centred around
a mode of π(θ|y)

If the model has converged, the trace plot will

move like a snake around the mode of the distribu-
tion.

The time taken to settle down to a region of a

mode is certainly the very minimum lower limit for
B

The trace is not easy to interpret if there are very

many points

The trace can be easier to interpret if it is sum-

marized by
- the cumulative posterior median, and upper and
lower credible intervals (say, 95% level)
- moving averages.

0-60
If the model has converged, additional samples
from the posterior distribution should not influence
the calculation of the mean. Running means will re-
veal if the posterior mean has settled to a particular
value.

Kernel density plots

Sometimes non-convergence is reflected in a mul-

timodal distribution. A ”lumpy” posterior may in-
dicate non-convergence.

However, do not assume that the chain has con-

verged just because the posteriors ”look smooth”.

Another useful visual check is to partition the

sample chain up into k blocks,

{{θ(0) , . . . , θ(M/k) }, . . . , {·, . . . , θ(M ) }},

and use kernel density estimates for the within block

distributions to look for continuity/stability in the
estimates

0-61
Autocorrelation plots

Autocorrelation plots show the serial correlation

in the chain. Some correlation between adjacent val-
ues will arise due to the Markov nature of the algo-
rithm. Increasing run length should reduce the au-
tocorrelation.

The presence of correlation indicates that the

samples are not effective in moving around through
the entire posterior distribution.

The autocorrelation will be high if

- the jump function does not jump far enough
- the jump function jumps too far, into a region of
low density.

If the level of autocorrelation is high for a pa-

rameter of interest, then a trace plot will be a poor
diagnostic for convergence.

0-62
Formal convergence diagnostics

CODA offers four formal tests for convergence,

perhaps the two most popular one being those re-
ported by Geweke and those of Gelman and Rubin,
improved by Brooks and Gelman.

0-63
Geweke’s test

Geweke (1992) proposed a convergence test based

on a time-series analysis approach. It is a formal way
to interpret the trace.

Informally, if the chain has reached convergence

then statistics from different portions of the chain
should be close.

For a (function of the) variable of interest, the

chain is sub-divided up into 2 “windows” containing
the initial x% (CODA default is 10%) and the final y%
(CODA default is 50%).

If the chain is stationary, the expectations (means)

of the values should be similar.

Geweke describes a test statistic based on a stan-

dardised difference in sample means. The test statis-
tic has a standard normal sampling distribution if
the chain has converged.

0-64
Gelman & Rubin’s test

Gelman and Rubin (GR) (1992) proposed a con-

vergence test based on output from two or more
multiple runs of the MCMC simulation. This
approach was improved by Brooks and Gelman (1998).

BGR is perhaps the most popular diagnostic used

today.

The approach uses several chains from different

starting values. The method compares the within
and between chain variances for each variable. When
the chains have “mixed” (converged) the variance
within each sequence and the variance between se-
quences for each variable will be roughly equal.

G&R derive a statistic which measures the po-

tential improvement, in terms of the estimate of the
variance in the variable, which could be achieved by
running the chains to infinity.

0-65
When little improvement could be gained, the
chains are taken as having converged.

However, it is possible that the within-variance

and the between-variance are roughly equal but the
pooled and the within confidence interval widths do
not converge to stability. The improved BGR pro-
cedure is as follows.

1. Generate m ≥ 2 MCMC chains, each with dif-

ferent initial values.

2. Exclude the burn-in period, and iterate for an

n-iteration monitored period.

3. From each individual chain the empirical (1 −

α) CI-interval width is calculated; that is the
difference between α2 and (1 − α/2) empirical
quantiles of the first n simulations. We obtain
m within-sequence interval widths estimates.

4. From the entire set of mn observations (pooled),

the empirical (1 − α) CI-interval width is cal-
culated.

0-66
5. R̂ is defined as
width of pooled interval
R̂ = .
mean width of within-sequence intervals

Usually for small n, R̂ > 1 if the initial values are

chosen dispersed enough. The statistic R̂ approaches
to 1 as the chains converge.

The option bgr diag in WinBUGS calculates the

R̂ -based diagnostics with α = 0.2. is calculated
after each 50 simulations. The width of the central
80% interval of the pooled runs is green, the average
width of the 80% intervals within the individual runs
is blue, and their ratio R̂ is red - for plotting purposes
the pooled and within interval widths are normalised
to have an overall maximum of one. Convergence is
achieved when is close to 1, and both the pooled and
within interval widths are stable. The values can be
listed to a window by double-clicking on the figure
followed by ctrl-left-mouse-click on the window.

0-67
Other tests

Heidelberger-Welsh: tests for stationarity of the

chain

Raftery-Lewis: based on how many iterations are

necessary to estimate the posterior for a given quan-
tity

Formal tests for convergence should not be taken

without question as evidence for convergence. Graph-
ical plots and examining posterior distributions for
stability should always be employed for key (func-
tions of) variables of interest.

Warning: Convergence does not mean that you

have a good model!

0-68
Tricks to speed up convergence

Standardize all your variables by subtracting them

from their sample means and dividing by their sam-
ple standard deviations. This decreases the posterior
correlation between parameters.

Use WinBUGS Over-relax Algorithm. This gen-

erates multiple samples at each iteration and then
selects one that is negatively correlated with the cur-
rent value. The time per iteration increases, but
the within-chain correlations should be reduced, and
hence fewer iterations may be necessary. However,
this method is not always effective.

Pick good initial values. If your initial values

are close to their posterior modes, then convergence
should occur relatively quickly.

Just wait. Sometimes models just take a long

time to converge.

0-69
5.2 Tests for dependence in the chain

MCMC produces a set of dependent samples (condi-

tionally Markov)

The Theory

From the central limit result for Markov chains

we have that

{f (θ¯(·) ) − E[f (θ)]} → N (0, σf2 /M )

where f (θ¯(·) ) denotes the empirical estimate for

the statistic of interest using the M MCMC samples,
M
1 X
f (θ(·) ) = f (θ(i) )
M i=1

and E[f (θ)] denotes the true unknown expectation.

We assume that the chain is aperiodic and irreducible,
and that σf2 < ∞

0-70
The variance in the estimator, σf2 , is given by
∞
X
σf2 = cov[f (θ(i) ), f (θ(i+s) )]
s=−∞

Hence, the greater the covariance between sam-

plers, the greater the variance in the MCMC estima-
tor (for given sample size M )

In Practice

The variance parameter σf2 can be approximated

using the sample autocovariances

Plots of autocorrelations within chains are ex-

tremely useful

High autocorrelations indicate slow mixing (move-

ment around the parameter space), with increased
variance in the MCMC estimators (and usually slower
convergence)

0-71
A useful statistic is the Effective Sample Size

k
X
ESS = M/(1 + 2 ρ(j))
j=1

where M is the number of post burn-in MCMC sam-

Pk
ples and j=1 ρ(j) is the sum of the first k monotone
sample autocorrelations

The ESS can be estimated from the sample au-

tocorrelation function; ESS estimates the reduction
in the true number of samples, compared to i.i.d.
samples, due to the autocorrelation in the chain

The ESS is a good way to compare competing

MCMC strategies if you standardise for CPU run
time

0-72
We call
1
Ef f = Pk ,
(1 + 2 j=1 ρ(j))

that is the ratio of the Effective Sample Size (ESS)

to the number of replicates generated (K), the effi-
ciency of the MCMC.

The maximum efficiency of the MCMC is 1 and

the minimum is 0.

ESS is generally smaller than the size of the MCMC

sample.

Estimating ESS and efficiency can be done only

on the sample from the stationary distribution!

If run time is not an issue, but storage is, it is

useful to thin the chain by only saving one in every
T samples - clearly this will reduce the autocorrela-
tions in the saved samples

0-73
6 Concluding remarks

Bayesian data analysis treats all unknowns as ran-

dom variables

Probability is the central tool used to quantify

all measures of uncertainty

Bayesian data analysis is about propagating un-

certainty, from prior to posterior (using Bayes the-
orem)

0-74
Often the posterior will not be of standard form
(for example when the prior is non-conjugate)

In these circumstances, sample based simulation

offers a powerful tool for inference

MCMC is (currently) the most general technique

for obtaining samples from any posterior density -
though it should not be used blindly!

WinBugs is a user friendly (free) package to con-

struct Bayesian data models and perform MCMC.

0-75

Instant Download Quantitative psychological research the complete student s companion 4 edition Edition David Clark-Carter PDF All Chapters
100% (2)
Instant Download Quantitative psychological research the complete student s companion 4 edition Edition David Clark-Carter PDF All Chapters
55 pages
Cheatsheet PDF
100% (1)
Cheatsheet PDF
4 pages
cs109 Final Cheat 3 PDF
No ratings yet
cs109 Final Cheat 3 PDF
13 pages
Regression Statistics
No ratings yet
Regression Statistics
4 pages
Bayes Intro PT 2
No ratings yet
Bayes Intro PT 2
13 pages
MCMC Bayes PDF
No ratings yet
MCMC Bayes PDF
27 pages
IntroBayesTimeSeries1
No ratings yet
IntroBayesTimeSeries1
72 pages
2 Statistical Definitions: 2.1 Probability Density Function
No ratings yet
2 Statistical Definitions: 2.1 Probability Density Function
9 pages
Bayesian-inference-slides-2021
No ratings yet
Bayesian-inference-slides-2021
37 pages
Bayesian Analysis
No ratings yet
Bayesian Analysis
20 pages
MCMC: Metropolis Hastings Algorithm
No ratings yet
MCMC: Metropolis Hastings Algorithm
5 pages
Bayesian Analysis of A Stochastic Volatility
No ratings yet
Bayesian Analysis of A Stochastic Volatility
25 pages
MA40189 Notes
No ratings yet
MA40189 Notes
70 pages
Lecture Notes On Regression: Markov Chain Monte Carlo (MCMC)
No ratings yet
Lecture Notes On Regression: Markov Chain Monte Carlo (MCMC)
13 pages
Lec23 PDF
No ratings yet
Lec23 PDF
7 pages
9. Bayesian_Lec_4
No ratings yet
9. Bayesian_Lec_4
25 pages
Markov Chain Monte Carlo and Gibbs Sampling
No ratings yet
Markov Chain Monte Carlo and Gibbs Sampling
24 pages
Bayesian Networks: Machine Learning, Lecture (Jaakkola)
No ratings yet
Bayesian Networks: Machine Learning, Lecture (Jaakkola)
8 pages
Stat 111
No ratings yet
Stat 111
7 pages
Markov Chains Ergodicity
No ratings yet
Markov Chains Ergodicity
8 pages
Computation
No ratings yet
Computation
11 pages
Lecture Notes For Probability and Statistics
No ratings yet
Lecture Notes For Probability and Statistics
7 pages
lec25
No ratings yet
lec25
3 pages
Markov Chain Monte Carlo
No ratings yet
Markov Chain Monte Carlo
29 pages
Bayesian Analysis of Extreme Operational Losses: Chyng-Lan Liang
No ratings yet
Bayesian Analysis of Extreme Operational Losses: Chyng-Lan Liang
17 pages
Johnson11MLSS Talk Extras
No ratings yet
Johnson11MLSS Talk Extras
73 pages
Stat513 l11
No ratings yet
Stat513 l11
17 pages
Notation
No ratings yet
Notation
3 pages
Bayesian Modelling Tuts-4-9
No ratings yet
Bayesian Modelling Tuts-4-9
6 pages
Inference On Relational Models Using Markov Chain Monte Carlo
No ratings yet
Inference On Relational Models Using Markov Chain Monte Carlo
61 pages
MCMC Brief
No ratings yet
MCMC Brief
69 pages
Bayesian Credible Interval
100% (1)
Bayesian Credible Interval
8 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
BaYesian Models Machine Learning 2016
No ratings yet
BaYesian Models Machine Learning 2016
126 pages
Probability & Statistics 2: Robert Šámal January 29, 2024
No ratings yet
Probability & Statistics 2: Robert Šámal January 29, 2024
29 pages
Probability and Statistics Cheat Sheet
100% (2)
Probability and Statistics Cheat Sheet
28 pages
ProblemSheet1-23
No ratings yet
ProblemSheet1-23
5 pages
BML Lecture Notes
No ratings yet
BML Lecture Notes
126 pages
bayesian-inference
No ratings yet
bayesian-inference
18 pages
MCMC
No ratings yet
MCMC
70 pages
Intro To Markov Chain Monte Carlo: Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601
No ratings yet
Intro To Markov Chain Monte Carlo: Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601
35 pages
Predição em Modelos de Tempo de Falha Acelerado Com Efeito Aleatório para Avaliação de Riscos de Falha - (JoaoBC)
No ratings yet
Predição em Modelos de Tempo de Falha Acelerado Com Efeito Aleatório para Avaliação de Riscos de Falha - (JoaoBC)
22 pages
Cours Mc
No ratings yet
Cours Mc
456 pages
ln13
No ratings yet
ln13
5 pages
확통1 LectureNote09 on Bayesian Statistical Inference
No ratings yet
확통1 LectureNote09 on Bayesian Statistical Inference
78 pages
Markov Chain Monte Carlo Methods: Christian P. Robert
No ratings yet
Markov Chain Monte Carlo Methods: Christian P. Robert
456 pages
Bayesian Linear Model Gory Details
No ratings yet
Bayesian Linear Model Gory Details
9 pages
Bayesian Statistical Methods 1st Edition Brian J. Reich - The ebook in PDF/DOCX format is ready for download now
100% (4)
Bayesian Statistical Methods 1st Edition Brian J. Reich - The ebook in PDF/DOCX format is ready for download now
67 pages
17 Notes MFML Probreview
No ratings yet
17 Notes MFML Probreview
19 pages
Statistics
No ratings yet
Statistics
60 pages
Instant download Bayesian Statistical Methods 1st Edition Brian J. Reich pdf all chapter
100% (2)
Instant download Bayesian Statistical Methods 1st Edition Brian J. Reich pdf all chapter
55 pages
Modern Bayesian Econometrics
No ratings yet
Modern Bayesian Econometrics
100 pages
On The Markov Chain Monte Carlo (MCMC) Method: Rajeeva L Karandikar
No ratings yet
On The Markov Chain Monte Carlo (MCMC) Method: Rajeeva L Karandikar
24 pages
Principles of Statistics
No ratings yet
Principles of Statistics
113 pages
Multivariate Statistical Distributions
No ratings yet
Multivariate Statistical Distributions
12 pages
Stat 535 C - Statistical Computing & Monte Carlo Methods: Arnaud Doucet
No ratings yet
Stat 535 C - Statistical Computing & Monte Carlo Methods: Arnaud Doucet
23 pages
MCMC: Gibbs Sampling: D K k1 k+1 D
No ratings yet
MCMC: Gibbs Sampling: D K k1 k+1 D
7 pages
Gaussian Conjugate Prior Cheat Sheet
No ratings yet
Gaussian Conjugate Prior Cheat Sheet
7 pages
Markov Hand Out
No ratings yet
Markov Hand Out
14 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
From Everand
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
Andrew Igla
No ratings yet
AKMJ-MCThree-phase-series-2022_20220329144205 (1)
No ratings yet
AKMJ-MCThree-phase-series-2022_20220329144205 (1)
3 pages
ANALSIS_OF_VERY_FAST_TRANSIENT_OVER_VOLT
No ratings yet
ANALSIS_OF_VERY_FAST_TRANSIENT_OVER_VOLT
9 pages
rule-232b1-history
No ratings yet
rule-232b1-history
6 pages
Comparing Various Moisture Determination (1)
No ratings yet
Comparing Various Moisture Determination (1)
6 pages
Analysis_and_Calculation_of_Very_Fast_Tr
No ratings yet
Analysis_and_Calculation_of_Very_Fast_Tr
6 pages
The_National_Electrical_Safety_Code_2017
No ratings yet
The_National_Electrical_Safety_Code_2017
1 page
Substation Earthing Design
No ratings yet
Substation Earthing Design
6 pages
HVT DS Haefely Akv-9360 V2206-02
No ratings yet
HVT DS Haefely Akv-9360 V2206-02
4 pages
Estimation_of_Re_striking_Transient_Over
No ratings yet
Estimation_of_Re_striking_Transient_Over
7 pages
Lightning_Overvoltage_Analysis_for_a_380
No ratings yet
Lightning_Overvoltage_Analysis_for_a_380
6 pages
Study Evaluation of the LVD Final Report
No ratings yet
Study Evaluation of the LVD Final Report
136 pages
Attachment_0
No ratings yet
Attachment_0
35 pages
Proposal for a Regulation on General Product Safety
No ratings yet
Proposal for a Regulation on General Product Safety
83 pages
HVT DS HAEFELY AQS 9110a Quadripole V2004
No ratings yet
HVT DS HAEFELY AQS 9110a Quadripole V2004
4 pages
Report on the Implementation of Directive
No ratings yet
Report on the Implementation of Directive
12 pages
HVT DS HAEFELY AKV 9330 Qaudripole-For-Large-Power-Capacitors
No ratings yet
HVT DS HAEFELY AKV 9330 Qaudripole-For-Large-Power-Capacitors
4 pages
An Investigation Into Substation Grounding and Its
No ratings yet
An Investigation Into Substation Grounding and Its
8 pages
HVT Ds Drain-Valve-uhf 12142020
No ratings yet
HVT Ds Drain-Valve-uhf 12142020
2 pages
Bsens: Bushing Sensor
No ratings yet
Bsens: Bushing Sensor
4 pages
Usens-D: Uhf Partial Discharge Sensor
No ratings yet
Usens-D: Uhf Partial Discharge Sensor
8 pages
Main IEC
100% (1)
Main IEC
9 pages
Usens-Tb: Bushing PD Sensor
No ratings yet
Usens-Tb: Bushing PD Sensor
2 pages
Pow 253 2023-11-26
No ratings yet
Pow 253 2023-11-26
5 pages
2022 Pcic 0559
No ratings yet
2022 Pcic 0559
9 pages
Info Iec60287-3-1 (Ed1.1) en D
No ratings yet
Info Iec60287-3-1 (Ed1.1) en D
5 pages
Partial Discharge Detection Using Acoustic Emissio
No ratings yet
Partial Discharge Detection Using Acoustic Emissio
10 pages
Info Iec60092-354 (Ed2.0) en
No ratings yet
Info Iec60092-354 (Ed2.0) en
7 pages
Detection of Partial Discharge Acoustic Emission I
No ratings yet
Detection of Partial Discharge Acoustic Emission I
7 pages
EIC2014June9 112014copyfromproceedings
No ratings yet
EIC2014June9 112014copyfromproceedings
6 pages
Pds100 Tev Probe
No ratings yet
Pds100 Tev Probe
7 pages
Analysis of Variance Analysis of Variance: Steps For One Way Classification
No ratings yet
Analysis of Variance Analysis of Variance: Steps For One Way Classification
2 pages
M8 ANOVA and Kruskall Wallis - Pelajar 12042018-20191108123443
No ratings yet
M8 ANOVA and Kruskall Wallis - Pelajar 12042018-20191108123443
59 pages
Essentials of Statistics for Business and Economics 8th Edition Anderson Solutions Manual pdf download
100% (1)
Essentials of Statistics for Business and Economics 8th Edition Anderson Solutions Manual pdf download
52 pages
Applied Statistics II Chapter 9 The One-Way Model: Jian Zou
No ratings yet
Applied Statistics II Chapter 9 The One-Way Model: Jian Zou
81 pages
Wendt and Carl, 1991
No ratings yet
Wendt and Carl, 1991
11 pages
29 Sep - Business Analytics-I 20-21 Set 1
No ratings yet
29 Sep - Business Analytics-I 20-21 Set 1
16 pages
Exam
No ratings yet
Exam
90 pages
Glyn Davis, Branko Pecar - Statistics For Business Students - A Guide To Using Excel & IBM SPSS Statistics-Glyn Davis, Branko Pecar (2021)
100% (3)
Glyn Davis, Branko Pecar - Statistics For Business Students - A Guide To Using Excel & IBM SPSS Statistics-Glyn Davis, Branko Pecar (2021)
757 pages
Additional Exam Practice 1 April 2024
No ratings yet
Additional Exam Practice 1 April 2024
7 pages
10.2. Accuracy and Quality Measurements
No ratings yet
10.2. Accuracy and Quality Measurements
55 pages
Final Exam Astrid Gabriela Hassan-2201925084 Lob7: 0 1 1i 2 2i 3 3i K Ki I
No ratings yet
Final Exam Astrid Gabriela Hassan-2201925084 Lob7: 0 1 1i 2 2i 3 3i K Ki I
7 pages
Answer Assignment 2
No ratings yet
Answer Assignment 2
6 pages
Anderson and Gerbing 1988
No ratings yet
Anderson and Gerbing 1988
13 pages
TD1 PointEstimation
No ratings yet
TD1 PointEstimation
5 pages
5fe2ae973d22be826368cf68_Further Statistics1 Unit Test 10 Quality of tests mark scheme
No ratings yet
5fe2ae973d22be826368cf68_Further Statistics1 Unit Test 10 Quality of tests mark scheme
9 pages
Cost Behavior - Exercise
100% (1)
Cost Behavior - Exercise
2 pages
Brandt and Kinlay - Estimating Historical Volatility v1.2 June 2005
No ratings yet
Brandt and Kinlay - Estimating Historical Volatility v1.2 June 2005
44 pages
Simple Linear Regression: Y ($) X ($) Y ($) X ($)
No ratings yet
Simple Linear Regression: Y ($) X ($) Y ($) X ($)
5 pages
Correlational Research Design
100% (1)
Correlational Research Design
18 pages
PDF Introduction To Statistics For Biology Third Edition Trudy A. Watt Download
100% (7)
PDF Introduction To Statistics For Biology Third Edition Trudy A. Watt Download
71 pages
Jolliffe 2014
No ratings yet
Jolliffe 2014
5 pages
Regression and Correlation
No ratings yet
Regression and Correlation
9 pages
Logistics Management - Chapter 5 PPT NFJnK1J2IS
No ratings yet
Logistics Management - Chapter 5 PPT NFJnK1J2IS
50 pages
Mann-Whitney U Test: Theoretical Discussion and Sample Problem
No ratings yet
Mann-Whitney U Test: Theoretical Discussion and Sample Problem
14 pages
12 Stat2 Exercise Set 12 Solutions
No ratings yet
12 Stat2 Exercise Set 12 Solutions
4 pages
Neurath 2020-11-30 FAN F Neurotox Dose-Response Assessment, NTP Studies
No ratings yet
Neurath 2020-11-30 FAN F Neurotox Dose-Response Assessment, NTP Studies
30 pages
Cohort Life Table - Ana - K Marta - Mba Lia
No ratings yet
Cohort Life Table - Ana - K Marta - Mba Lia
14 pages
Spss Syllabus
No ratings yet
Spss Syllabus
2 pages

mcmc

Uploaded by

mcmc

Uploaded by

Markov Chain Monte Carlo

and Applied Bayesian Statistics

Markov chain Monte Carlo is a stochastic sim-

1. Review of Bayesian inference

2. Monte Carlo integration and Markov chains

3. MCMC in Bayesian inference: ideas

4. MCMC in Bayesian inference: algorithms

5. Output analysis and diagnostics

There will be a practical session, using the soft-

1. Gelman, A. et al. (2004). Bayesian Data Anal-

2. Robert, C.P. and Casella, G. (2004) Monte

3. Gilks, W.R. et al. eds. (1996). Markov Chain

Acknowledgement: Chris Holmes for provid-

L(θ|y) = f (y|θ) is the likelihood of y if θ is the true

1. Prior predictive distribution

represents the probability of observing the data

2. Marginal effects of a subset of parameters in a

where θ−i = θ \ θi denotes the vector θ with

For the last step we used that ỹ, y are con-

X1 , . . . , Xn random sample N (θ, σ 2 ), where σ 2

where  N (0, σ 2 I). Alternatively, y ∼ N (xβ, σ 2 I);

For now assume that σ is known

Bayesian modelling proceeds by constructing a

π(y, β|x, σ 2 ) = f (y|x, β, σ 2 )π(β|x, σ 2 )

where we assume, for now, that the prior π(β) is

π(β) = N (β|0, vI)

We recall that the multivariate normal density fN (µ,Σ)

Therefore we can write

π(β|y) = N (β|β̂, v̂I)

Note that again follows a normal distribution:

= N (y0 |x0 β̂, σ 2 (1 + x0 v̂x00 )).

Bayesian analysis might then continue by com-

Historically, the need to evaluate integrals was a

Around 15 years ago or so, a numerical method

can be difficult, in particular when X is high-dimensional.

x(1) , x(2) , . . . , x(n) ∼ π

then we can estimate

This is Monte Carlo integration

Recall: all the information (needed for, say, pre-

However, π(θ|y) may not be quantifiable as a

Suppose we are able to draw samples, θ(1) , . . . , θ(M ) ,

Then most inferential quantities of interest are solv-

(1) Suppose we are interested in P r(θ < a|y).

where I(·) is the logical indicator function. More

(2) Prediction: Suppose we are interested in p(ỹ|y),

where F (·) denotes the distribution function;

This last point is particularly useful.

Warning: Monte Carlo integration is a last re-

A homogeneous Markov chain (Xt )t=0,1,... is gen-

P (xt , A) := P (Xt+1 ∈ A|Xt = xt )

If the transition probabilities depended on t, the

Example. Consider the AR(1) process

where the t ’s are independent, identically distributed.

For a Markov chain with finite state space I we

Conditioning on the first step, we have, for example,

that is, once we start in the stationary distribution

Example: For the two-state Markov chain above,

P (Xn = j for some n ≥ 0|X0 = i) > 0.

A Markov chain is irreducible if any set of states

Fact: If the chain is irreducible and if it has a sta-

Example. Consider the two-state Markov chain

Then P 2 = I, P 2n = I, P 2n+1 = P , so each state

Fact: If an irreducible Markov chain has an ape-

Also for such chains with

σh2 = varπ [h(X)] < ∞

the central limit theorem holds, and convergence

So we can apply Monte Carlo integration to ap-

Further reading on Markov chains: J.R. Norris,

If we start the chain in some arbitrary value X0 ,

Knowing when to start collecting samples is a

The chain is initialised with a user defined start-

The Markov property then specifies that the dis-

MCMC methods construct a Markov chain on

MCMC procedures return a collection of M sam-

P r(θ(i) ∈ A) = π(θ ∈ A|y)

for any set A ∈ Θ, or,

where N (0, σ 2 I). Alternatively, y ∼ N (xβ, σ 2 I);

where the t ’s are independent, identically distributed.