mcmc

An introduction of Markov Chain Monte Carlo
Vasin,Rachel,Jinchao
April 5, 2016
Vasin,Rachel,Jinchao MCMC April 5, 2016 1 / 64

Outline
1 Introduction
Motivation
Statistical Mechanics Example
Bayesian Example
Monte Carlo Methods
2 Markov Chain Monte Carlo
Gibbs Sampling
Metropolis Method
Metropolis-Hastings algorithm
3 MCMC Diagnostic

Now let us recall an example we learned in ﬁrst class
Probability of the conﬁguration σ0 at inverse temperature β = 1
kT :
µΛ,β(σ = σ0) =
1
ZΛ,β
exp(−βHN (σ0))PN (σ = σ0)
where Hamiltonian HN (σ) = −1
2 x=y J(x, y)σ(x)σ(y) + h x σ(x)
Partition function ZΛ,β = σ0
exp(−βHN (σ0))PN (σ = σ0)
Prior distribution PN (σ = σ0) = x∈Λ P(σ(x) = σ(x0))

We want to compute the mean value of an observable A
E[A] = A(σ)µΛ,β(dσ).
However, usually the lattice size N 1, and the Partition function
ZΛ,β is hard to compute.
Thus, we try to sample from the Gibbs distribution regardless the
normal constant and approximate the integral by numerical result.

Bayesian Example
Observed data Y1, ..., Yn follows some distribution p(y | θ)
θ is an unknown parameter
Posterior distribution p(θ | y) provides information for θ
By Bayes’ Rule:
p(θ | y) =
p(y | θ)p(θ)
Θ p(y | θ)p(θ)dθ
If we assume a prior distribution for θ, say U(0, 1), we can use
MCMC methods to approximate p(θ | y)

Bayesian Example
Construct Markov Chain with steady state p(θ | y)
End with a sample (θ1, ..., θM ) which can be assumed to be a draw
from p(θ | y)
Can use histogram or time series plot to analyze
Popular applications include linear regression and biomedical
studies

The problems to be solved
The fundamental problem that we wish to solve is ﬁnding the
expectation of some function f(z) with respect to a probability
distribution p(z), i.e.
E[f] = f(z)p(z)dz
The general idea of Monte Carlo methods is to obtain a set of samples
z(l) (where l = 1, . . . , L) drawn independently from the distribution
p(z). This allows the expectation to be approximated by a ﬁnite sum
ˆf =
1
L
L
l=1
f(z(l)
).
The methods we can use to generate the samples include rejection
sampling, importance sampling, etc.

Markov Chain Monte Carlo
The problem, however, is that the samples {z(l)} might not be
independent.
Suppose we want to draw from our posterior distribution p(θ|y), but
we cannot sample independent draws from it. For example, we often do
not know the normalizing constant.
So we consider sampling slightly dependent draws using a Markov
chain, then we can still ﬁnd quantities of interests from those draws.
MCMC is a class of methods using this idea.

Markov Chain Monte Carlo
Set up an ergodic Markov chain {z(l)}, such that
lim
l→∞
Kl
(z(0)
, z) = p(z) (stationary distribution)
irrespective of the choice of initial value of z(0). K is the transition
kernel here.
Then by Ergodic Theorem,
lim
L→∞
1
L
L
l=1
f(z(l)
) = f(z)p(z)dz
In Bayesian statistics, there are generally two MCMC algorithms that
we use: the Gibbs Sampler and the Metropolis-Hastings algorithm.

Gibbs Sampling
MCMC method that indirectly samples from joint distribution
Requires initial values for each random variable
Uses conditional distributions for sampling
Updates one value per time step
Sample eventually converges to the desired joint distribution
Can also approximate marginal distributions

Gibbs Sampling
Procedure:
1 Identify joint distribution p(z) = p(z1, ... , zk) from which to
sample
2 Select initial values for z1, ... , zk. Call them z
(0)
1 , ... , z
(0)
k .
3 Draw a new value for z1, call it z
(1)
1 , from conditional distribution
p(z1 | z
(0)
2 , ... , z
(0)
k )
4 Draw a new value for z2, call it z
(1)
2 , from conditional distribution
p(z2 | z
(1)
1 , z
(0)
3 , ... z
(0)
k )
5 Repeat for all zi, updating the conditioned values each time
6 Repeat steps 3-5 for a large number of times T

Gibbs Sampling
Example:
f(X, Y ) = 1
2πσxσy
√
1−ρ2
exp(− 1
2(1−ρ2)
[(x−µx)2
σ2
x
+
(y−µy)2
σ2
y
−
2ρ(x−µx)(y−µy)
σxσy
])
Bivariate normal with Corr(X, Y )=ρ
Suppose µx = µy = 0 and σx = σy = 1
f(X | Y ) = f(X,Y )
R f(X,Y )dx
= 1√
2π(1−ρ2)
exp(−(x−ρy)2
2(1−ρ2)
)
f(Y | X) = f(X,Y )
R f(X,Y )dy
= 1√
2π(1−ρ2)
exp(−(y−ρx)2
2(1−ρ2)
)
Initialize x, y = 0
Let x(1) be a random draw from N(0, 1 − ρ2)
Let y(1) be a random draw from N(ρx(1), 1 − ρ2)
Repeat last 2 steps 10000 times

Gibbs Sampling

Metropolis Method
Now suppose we have a posterior p(θ|y) that we want to sample from,
but
the posterior is not any distribution we know
some (or all) of the full conditionals is not any distributions we
know (no Gibbs sampling for those whose full conditionals we
don’t know)
Then, we can consider using the Metropolis Method.

Metropolis Method
The Metropolis Method (Metropolis et al., 1953) follows the following
steps:
Choose a starting value z(1).
At the current state z(τ), generate a candidate z∗ by the proposal
distribution q(z|z(τ)) depends on this current state.
Accepted the candidate sample with probability
A(z∗
, z(τ)
) = min(1,
p(z∗)
p(z(τ))
).
If the candidate sample is accepted, then z(τ+1) = z∗, otherwise the
candidate point z∗ is discarded, z(τ+1) is set to z(τ) and another
candidate sample is drawn from the distribution q(z|z(τ+1)).

Comments
For Metropolis Method, the proposal distribution is symmetric,
that is q(zA|zB) = q(zB|zA) for all values of zA and zB.
In general, we usually use normal distribution N(z(τ), σ) as
proposal distribution .
Note that if the step from z(τ) to z∗ causes an increase in the value
of p(z), then the candidate point is certain to be kept.
We can write p(z) = ˜p/Zp,and we will assume that p(z) can readily
be evaluated for any given value of z, although the value of Zp
may be unknown. Like for
µΛ,β(σ = σ0) = 1
ZΛ,β
exp(−βHN (σ0))PN (σ = σ0),
we have A(σ, σ ) = min(1, µ(σ )
µ(σ) ) = min(1, exp(−β∆H).

A generalization of Metropolis Method is known as the
Metropolis-Hastings algorithm (Hastings, 1970), where the proposal
distribution is no longer a symmetric function of its arguments.
In particular, when the current state is z(τ), we draw a sample z∗ from
the distribution qk(z|z(τ)) and then accept it with probability
Ak(z∗, z(τ)) where
Ak(z∗
, z(τ)
) = min(1,
p(z∗)qk(z(τ)|z∗)
p(z(τ))qk(z∗|z(τ))
).

Terminology
Definition : A transition kernel is a function K defined on X × B (X)
such that
∀x ∈ X, K (x, ·) is a probability measure
∀A ∈ B (X), K (·, A) is measurable.
Definition : A Markov chain with transition kernel K satisfies the
detailed balance condition if there exists a function f satisfying
f (x) K (x, y) = f (y) K (y, x)
for every (x, y) .
The balance detailed condition express an equilibrium in the flow of the
Markov chain, namely that the probability of being in x and moving to
y is the same as the probability of being in y and moving back to x.

Terminology
Definition : Suppose that a Markov chain with transition density q
satisfies the detailed balance condition with a probability density
function π i.e.
π (x) q (x, y) = π (y) q (y, x) .
Then the chain is π−reversible.
Definition : A probability density π : R → [0, ∞) is a stationary
density for this Markov chain, if it satisfies
S
π (x) q (x, y) dx = π (y)
for all y ∈ S.

Reversibility
Theorem : If X is a π−reversible Markov chain, then π is a
stationary distribution of X.
Now we have solved the problem of constructing an MCMC
method. The Markov chain generated by Metropolis-Hastings
algorithm is stationary.
The detailed balanced condition is not necessary for f to be a
stationary measure associated with transition kernel K, it just
provides a suﬃcient condition that is often easy to check and that
can be used for most MCMC algorithms.

And if z∗ is generated independently, then we have
Theorem(Mengersen and Tweedie.Ann.Statist.(1996))
The Metropolis-Hasting Algorithm produces a uniform ergodic chain if
there exist a constant M such that
p(z) ≤ Mq(z), ∀x ∈ supp p,
In this case,
Kn
(z, ·) − p TV ≤ (1 −
1
M
)n
,
where K is the transition kernel and · TV denotes the total variation
norm.

Proof. If p(z) ≤ Mq(z), the transition kernel satisﬁes
K(z, z ) ≥ q(z ) min{
p(z )q(z)
p(z)q(z )
, 1}
= min{p(z )
q(z)
p(z)
, q(z )} ≥
1
M
p(z ).
To establish the bound on Kn(z, ·) − p TV , ﬁrst we write
K(z, ·) − p TV = sup
A
|
A
(K(z, y) − p(y))dy|
=
{y;p(y)≥K(z,y)}
(p(y) − K(z, y))dy
≤ (1 −
1
M
)
{y;p(y)≥K(z,y)}
p(y)dy
≤ (1 −
1
M
)

We now continue with a recursion argument. We can write
A
(K2
(z, y) − p(z))dy =
supp p
[
A
(K(u, y) − p(y))dy]
× (K(z, u) − p(u))du,
and by the same procedure we can get
K2
(z, ·) − p TV ≤ (1 −
1
M
)2
.
We next write a general recursion relation
A
(Kn+1
(z, y) − p(z))dy =
supp p
[
A
(Kn
(u, y) − p(y))dy]
× (K(z, u) − p(u))du,
and by induction we have
Kn
(z, ·) − p TV ≤ (1 −
1
M
)n
.

Simple Example
Now, consider using Metropolis Method to sample a bi-modal target
distribution
p(y) =
1
2
√
2π
exp(−
(y + 5)2
2
) +
1
6
√
2π
exp(−
(y − 7)2
18
)
we begin with y(1) = 0 and let the proposal distribution to be
q(y(n−1)
|y(n)
) ∼ N(y(n−1)
,
√
0.5)

Simple Example
Histogram of y_n[1:100]
y_n[1:100]
Density
−10 0 5 10 15 20
0.000.150.30
y_n[1:1000]
Density
−10 0 5 10 15 20
0.000.060.12
y_n[1:10000]
Density
−10 0 5 10 15 20
0.000.100.200.30
Histogram of y_n[1:1e+05]
y_n[1:1e+05]
Density
−10 0 5 10 15 20
0.000.100.20

Simple Example
Histogram of y_n
q(y(n−1)|y(n))~N(y(n−1),sqrt(0.5))
y_n
Density
−10 −5 0 5 10 15 20
0.000.100.20
Histogram of y2_n
q(y(n−1)|y(n))~N(−5,sqrt(10))
y2_n
Density
−10 −5 0 5 10 15 20
0.00.20.4
Histogram of y3_n
q(y(n−1)|y(n))~N(0,sqrt(10))
y3_n
Density
−10 −5 0 5 10 15 20
0.000.15

MCMC Diagnostic
The purpose of Markov Chain Monte Carlo approximation is to obtain
a sequence of parameter values φ(1), φ(2), ..., φ(S) such that
1
S
S
s=1
g φ(s)
≈ g (φ) p (φ) dφ.
for any function g of interest. In other word, we want the empirical
average of g φ(1) , g φ(2) , ..., g φ(S) to estimate the expected
value of g (φ) under target distribution p (φ).

MCMC Diagnostic
Isn’t the Gibbs sampler guaranteed to eventually provide a good
approximation to p (θ) ? It is but eventually can be very long time in
some situations.
1 In the case of a generic parameter φ and a target distribution
p (φ), it is helpful to think of the sequence φ(1), φ(2), ..., φ(S) as
the trajectory of a particle φ moving around the parameter space.
2 In term of MCMC integral approximation, the critical thing is
that the amount of time the particle spends in given set A is
proportional to the target probability A p (φ) dφ.

MCMC Diagnostic : Mixing
Speed of mixing describes how quickly particle moves around the
parameter space. An independent MC sampler has perfect mixing : It
has zero auto correlation and can jump around between regions of the
parameter space in one step. On the other hands, MCMC samples are
NOT independent draws from a target distribution because
1 The ﬁrst draw is set by the user and thus not a random draw from
the target distribution.
2 Subsequently, draw s + 1 depends on draw s: We say that the
samples are auto correlated.

MCMC Diagnostic : Traceplot
The initial phase of an MCMC chain is called the burn-in phase,
during which the chain converges towards the target distribution. The
trace plots can be used to detect burn-in. Note that we can never be
sure that a chain has converged but at least we can detect lack of
convergence.
A traceplot is a plot of the iteration number against the value of the
draw of the parameter at each iteration. We can see whether our chain
gets stuck in certain areas of the parameter space, which indicates bad
mixing.

Traceplot

MCMC Diagnostic : Auto Correlation Function
Another way to assess convergence is to assess the auto correlations
between the draws of our Markov chain.
Let f = {f (x)}x∈S be a real value function deﬁned on the state space
S. Then for the stationary markov chain, {ft} ≡ {f (Xt)} is a
stationary stochastic process with mean
µf ≡ ft =
x
πxf (x) .

We deﬁne covariance matrix as
Xi, Xj ≡ (Xi − Xi ) (Xj − Xj ) ≡ XiXj − Xi Xj
and deﬁne un-normalized auto correlation function or auto covariance
function as
Cff (t) ≡ fs, fs+t − ft
2
where normalized auto correlation function is given by
ρff (t) ≡
Cff (t)
Cff (0)
.
Typically, ρff (t) decays exponentially ∼ e−|t|/τ .

For a given observable f we deﬁne the integrated auto correlation time
τint,f =
1
2
∞
t=−∞
ρff (t)
≈
1
2
+
∞
t=1
ρff .
The integrated auto correlation time controls the statistical error in
Monte Carlo measurements of f .

More precisely the sample mean
¯f =
1
n
n
t=1
ft
has variance
V ar ¯f =
1
n2
n
r,s=1
Cff (r − s)
≈
1
n
(2τint,f ) Cff (0)
for n τ. Thus the variance of ¯f is a factor 2τint,f larger than it would
be if the {ft} were statistically independent. Stated diﬀerently the
number of eﬀectively independent samples in a run of length n is
roughly n
2τint,f
.

If autocorrelation is still relatively high for higher values of k, this
indicates high degree of correlation between our draws and slow mixing.

MCMC Diagnostic : Eﬀective Sample Size
One measure to evaluate whether we have enough MCMC samples is
the eﬀective sample size Seff of an MCMC sample. Seff can be
interpreted as the number of independent MC samples needed to
obtain the same precision for the mean. Suppose we approximate E (θ)
by the sample mean
¯θ =
1
S
θ(s)
.

For an MC sample we have:
V arMC
¯θ =
V ar (θ)
S
.
For an MCMC sample we have
V arMCMC
¯θ =
V ar (θ)
S
+
1
S2
s=t
E θ(s)
− E (θ) θ(t)
− E (θ)
where the last term depends on the auto-correlation in the MCMC
chain, and is generally positive.

The eﬀective sample size is deﬁned as follows:
V arMCMC
¯θ =
V ar (θ)
Seff
.
Potential scale reduction factor or shrink factor ˆR is, approximately,
the square root of the variance of the mixture of the chains, divided by
the average within chain variance. If chains have mixed well, then ˆR is
close to 1.

Golden Rule !
Some rules of thumb : ˆR < 1.1 and Seff 100 !!!

References
Christopher Bishop (2006)
Pattern Recognition and Machine Learning
Springer, p523–556.
Christian Robert and George Casella (2004)
Monte Carlo Statistical Methods
Springer, p267–318.
Jun S. Liu (2001)
Monte Carlo Strategies In Scientiﬁc Computing
Springer, p105-127
http://www.mas.ncl.ac.uk/ ndjw1/teaching/sim/gibbs/gibbs.r
http://www.stats.ox.ac.uk/ cholmes/Courses/BDA/bda-mcmc.pdf
Peter D. Hoﬀ (2009)
A First Course in Bayesian Statistical Methods

Stochastic Volatility
Vasin,Rachel,Jinchao
April 5, 2016

Introduction
Financial returns indicate the proﬁt from an investment
From ﬁnance group, rate of return σ assumed constant
In reality, volatility can vary with respect to time
To model volatility, we can use deterministic or probabilistic
methods
We will focus on probabilistic methods using MCMC and Bayesian
analysis

SV Model
Let y = (y1, y2, ..., yn) be a collection of observed returns
Each return is normally distributed with mean 0 and unknown
variance
Log-variance is normally distributed with unknown mean and
variance
Log-variance also follows a Markov chain where the distribution of
the current one depends only on the value of the previous one

SV model : Model Speciﬁcation
Mathematically:
yt | ht ∼ N(0, eht
)
h0 | µ, φ, σ ∼ N(µ,
σ2
1 − φ2
)
ht | ht−1, µ, φ, σ ∼ N(µ + φ(ht−1 − µ), σ2
)
h = (h1, h2, ..., hn) = log-variance or volatility process
µ = level of log-variance
φ = persistence of log-variance
σ = volatility of log-variance

Prior distributions
To complete the model setup, a prior distribution for the parameter
vector θ needs to be speciﬁed. Following Kim S, Shephard N, Chib S
(1998), we choose independent components for each parameter
p (θ) = p (µ) p (φ) p (ση) .
The level µ ∈ R is is equipped with the usual normal prior
µ ∼ N (bµ, Bµ).
In practical applications, this prior is usually chosen to be rather
uninformative; namely, bµ = 0 and Bµ ≥ 100 for daily log returns.

Prior distributions
For the persistence parameter φ ∈ (−1, 1) , we choose (φ+1)
2 ∼ B (a0, b0)
implying that
p (φ) =
1
2B (a0, b0)
1 + φ
2
a0−1
1 − φ
2
b0−1
where a0 and b0 are positive parameters and
B (x, y) =
1
0
tx−1
(1 − t)y−1
dt
denotes the beta function.

Prior distributions
Lastly, for the volatility log - variance ση ∈ R+, we choose
σ2
η ∼ Gamma
1
2
,
1
2Bση
.
This choice is motivated by Fruhwirth-Schnatter and Wagner (2010)
who equivalently stipulate the prior for ± σ2
η to follow a centered
normal distribution.

Posterior distribution
Then, we can estimate our parameters (h, µ, φ, σ) using the posterior
distribution given by the Bayesian formula:
p(h, µ, φ, σ|y) ∝ p(y|h)p(h|h0, µ, φ, σ)p(h0|µ, φ, σ)p0(µ, φ, σ)
where y is our data, p(y|h)p(h|h0, µ, φ, σ)p(h0|µ, φ, σ) is the lkelihood
function which is given by our SV-model, and p0(µ, φ, σ) = p(θ) is the
prior distribution we construct just now.
Now, we can use MCMC method to sampling the parameters from the
posterior distribution.

Example : EUR/USD
Euro to US dollar exchange rates
Data available: http://sdw.ecb.europa.eu/
Also available under ”exrates” within the R package ”stochvol”

Example : Prior distributions
Prior distributions :
µ ∼ N (0, 100)
(φ + 1)
2
∼ B (20, 1.5)
σ2
η ∼ Gamma
1
2
,
1
0.2
= χ2
1 × 0.1
For some discussion about this issue, see, Kim S, Shephard N, Chib S
(1998) who choose a0 = 20 and b0 = 1.5 implying a prior mean of 0.86
with the prior standard deviation of 0.11.

Example : Output Summary
The summary of 100,000 MCMC draws after a burn-in of 50,000
Mean SD 5% quantile 50% quantile 95% quantile Seff
µ -10.1364 0.23363 -10.4630 -10.1381 -9.7980 51118
φ 0.9932 0.00286 0.9881 0.9935 0.9974 2914
σ 0.0660 0.01020 0.0505 0.0651 0.0838 1347
σ2
0.0045 0.00141 0.0026 0.0042 0.0070 1347

Example : Estimated volatilities
Then we can get the empirical quantiles of the posterior distribution of
100 exp(ht/2), the latent volatilities of yt over time in percent:

Example : Forecast volatility
Forecast volatility of yt in percentage
t + 1 0.584132065586393
t + 2 0.584587272795878
t + 3 0.584863398891602
t + 4 0.585436225819106
t + 5 0.585646621839762
t + 6 0.585741891671509
t + 7 0.586295462435408
t + 8 0.586822153121447
t + 9 0.587354949275841
t + 10 0.587772869210331

Example : Traceplot

Example : Posterior and Prior densities

Example : Mean standardized residuals

Result from STAN : Traceplot

Result from STAN : Posterior Density

References
Gregor Kastner (2016)
Dealing with Stochastic Volatility in Time Series Using the R
package stochvol
Kim S, Shephard N, Chib S (1998)
Stochastic Volatility: Likelihood Inference and Com- parison With
ARCH Models
Fruhwirth-Schnatter S, Wagner H (2010)

mcmc

More Related Content

What's hot (20)

Similar to mcmc (20)

mcmc