SlideShare a Scribd company logo
An introduction of Markov Chain Monte Carlo
Vasin,Rachel,Jinchao
April 5, 2016
Vasin,Rachel,Jinchao MCMC April 5, 2016 1 / 64
Outline
1 Introduction
Motivation
Statistical Mechanics Example
Bayesian Example
Monte Carlo Methods
2 Markov Chain Monte Carlo
Gibbs Sampling
Metropolis Method
Metropolis-Hastings algorithm
3 MCMC Diagnostic
Vasin,Rachel,Jinchao MCMC April 5, 2016 2 / 64
Statistical Mechanics Example
Now let us recall an example we learned in first class
Probability of the configuration σ0 at inverse temperature β = 1
kT :
µΛ,β(σ = σ0) =
1
ZΛ,β
exp(−βHN (σ0))PN (σ = σ0)
where Hamiltonian HN (σ) = −1
2 x=y J(x, y)σ(x)σ(y) + h x σ(x)
Partition function ZΛ,β = σ0
exp(−βHN (σ0))PN (σ = σ0)
Prior distribution PN (σ = σ0) = x∈Λ P(σ(x) = σ(x0))
Vasin,Rachel,Jinchao MCMC April 5, 2016 3 / 64
Statistical Mechanics Example
We want to compute the mean value of an observable A
E[A] = A(σ)µΛ,β(dσ).
However, usually the lattice size N 1, and the Partition function
ZΛ,β is hard to compute.
Thus, we try to sample from the Gibbs distribution regardless the
normal constant and approximate the integral by numerical result.
Vasin,Rachel,Jinchao MCMC April 5, 2016 4 / 64
Bayesian Example
Observed data Y1, ..., Yn follows some distribution p(y | θ)
θ is an unknown parameter
Posterior distribution p(θ | y) provides information for θ
By Bayes’ Rule:
p(θ | y) =
p(y | θ)p(θ)
Θ p(y | θ)p(θ)dθ
If we assume a prior distribution for θ, say U(0, 1), we can use
MCMC methods to approximate p(θ | y)
Vasin,Rachel,Jinchao MCMC April 5, 2016 5 / 64
Bayesian Example
Construct Markov Chain with steady state p(θ | y)
End with a sample (θ1, ..., θM ) which can be assumed to be a draw
from p(θ | y)
Can use histogram or time series plot to analyze
Popular applications include linear regression and biomedical
studies
Vasin,Rachel,Jinchao MCMC April 5, 2016 6 / 64
The problems to be solved
The fundamental problem that we wish to solve is finding the
expectation of some function f(z) with respect to a probability
distribution p(z), i.e.
E[f] = f(z)p(z)dz
The general idea of Monte Carlo methods is to obtain a set of samples
z(l) (where l = 1, . . . , L) drawn independently from the distribution
p(z). This allows the expectation to be approximated by a finite sum
ˆf =
1
L
L
l=1
f(z(l)
).
The methods we can use to generate the samples include rejection
sampling, importance sampling, etc.
Vasin,Rachel,Jinchao MCMC April 5, 2016 7 / 64
Markov Chain Monte Carlo
The problem, however, is that the samples {z(l)} might not be
independent.
Suppose we want to draw from our posterior distribution p(θ|y), but
we cannot sample independent draws from it. For example, we often do
not know the normalizing constant.
So we consider sampling slightly dependent draws using a Markov
chain, then we can still find quantities of interests from those draws.
MCMC is a class of methods using this idea.
Vasin,Rachel,Jinchao MCMC April 5, 2016 8 / 64
Markov Chain Monte Carlo
Set up an ergodic Markov chain {z(l)}, such that
lim
l→∞
Kl
(z(0)
, z) = p(z) (stationary distribution)
irrespective of the choice of initial value of z(0). K is the transition
kernel here.
Then by Ergodic Theorem,
lim
L→∞
1
L
L
l=1
f(z(l)
) = f(z)p(z)dz
In Bayesian statistics, there are generally two MCMC algorithms that
we use: the Gibbs Sampler and the Metropolis-Hastings algorithm.
Vasin,Rachel,Jinchao MCMC April 5, 2016 9 / 64
Gibbs Sampling
MCMC method that indirectly samples from joint distribution
Requires initial values for each random variable
Uses conditional distributions for sampling
Updates one value per time step
Sample eventually converges to the desired joint distribution
Can also approximate marginal distributions
Vasin,Rachel,Jinchao MCMC April 5, 2016 10 / 64
Gibbs Sampling
Procedure:
1 Identify joint distribution p(z) = p(z1, ... , zk) from which to
sample
2 Select initial values for z1, ... , zk. Call them z
(0)
1 , ... , z
(0)
k .
3 Draw a new value for z1, call it z
(1)
1 , from conditional distribution
p(z1 | z
(0)
2 , ... , z
(0)
k )
4 Draw a new value for z2, call it z
(1)
2 , from conditional distribution
p(z2 | z
(1)
1 , z
(0)
3 , ... z
(0)
k )
5 Repeat for all zi, updating the conditioned values each time
6 Repeat steps 3-5 for a large number of times T
Vasin,Rachel,Jinchao MCMC April 5, 2016 11 / 64
Gibbs Sampling
Example:
f(X, Y ) = 1
2πσxσy
√
1−ρ2
exp(− 1
2(1−ρ2)
[(x−µx)2
σ2
x
+
(y−µy)2
σ2
y
−
2ρ(x−µx)(y−µy)
σxσy
])
Bivariate normal with Corr(X, Y )=ρ
Suppose µx = µy = 0 and σx = σy = 1
f(X | Y ) = f(X,Y )
R f(X,Y )dx
= 1√
2π(1−ρ2)
exp(−(x−ρy)2
2(1−ρ2)
)
f(Y | X) = f(X,Y )
R f(X,Y )dy
= 1√
2π(1−ρ2)
exp(−(y−ρx)2
2(1−ρ2)
)
Initialize x, y = 0
Let x(1) be a random draw from N(0, 1 − ρ2)
Let y(1) be a random draw from N(ρx(1), 1 − ρ2)
Repeat last 2 steps 10000 times
Vasin,Rachel,Jinchao MCMC April 5, 2016 12 / 64
Gibbs Sampling
Vasin,Rachel,Jinchao MCMC April 5, 2016 13 / 64
Gibbs Sampling
Vasin,Rachel,Jinchao MCMC April 5, 2016 14 / 64
Metropolis Method
Now suppose we have a posterior p(θ|y) that we want to sample from,
but
the posterior is not any distribution we know
some (or all) of the full conditionals is not any distributions we
know (no Gibbs sampling for those whose full conditionals we
don’t know)
Then, we can consider using the Metropolis Method.
Vasin,Rachel,Jinchao MCMC April 5, 2016 15 / 64
Metropolis Method
The Metropolis Method (Metropolis et al., 1953) follows the following
steps:
Choose a starting value z(1).
At the current state z(τ), generate a candidate z∗ by the proposal
distribution q(z|z(τ)) depends on this current state.
Accepted the candidate sample with probability
A(z∗
, z(τ)
) = min(1,
p(z∗)
p(z(τ))
).
If the candidate sample is accepted, then z(τ+1) = z∗, otherwise the
candidate point z∗ is discarded, z(τ+1) is set to z(τ) and another
candidate sample is drawn from the distribution q(z|z(τ+1)).
Vasin,Rachel,Jinchao MCMC April 5, 2016 16 / 64
Comments
For Metropolis Method, the proposal distribution is symmetric,
that is q(zA|zB) = q(zB|zA) for all values of zA and zB.
In general, we usually use normal distribution N(z(τ), σ) as
proposal distribution .
Note that if the step from z(τ) to z∗ causes an increase in the value
of p(z), then the candidate point is certain to be kept.
We can write p(z) = ˜p/Zp,and we will assume that p(z) can readily
be evaluated for any given value of z, although the value of Zp
may be unknown. Like for
µΛ,β(σ = σ0) = 1
ZΛ,β
exp(−βHN (σ0))PN (σ = σ0),
we have A(σ, σ ) = min(1, µ(σ )
µ(σ) ) = min(1, exp(−β∆H).
Vasin,Rachel,Jinchao MCMC April 5, 2016 17 / 64
Metropolis-Hastings algorithm
A generalization of Metropolis Method is known as the
Metropolis-Hastings algorithm (Hastings, 1970), where the proposal
distribution is no longer a symmetric function of its arguments.
In particular, when the current state is z(τ), we draw a sample z∗ from
the distribution qk(z|z(τ)) and then accept it with probability
Ak(z∗, z(τ)) where
Ak(z∗
, z(τ)
) = min(1,
p(z∗)qk(z(τ)|z∗)
p(z(τ))qk(z∗|z(τ))
).
Vasin,Rachel,Jinchao MCMC April 5, 2016 18 / 64
Terminology
Definition : A transition kernel is a function K defined on X × B (X)
such that
∀x ∈ X, K (x, ·) is a probability measure
∀A ∈ B (X), K (·, A) is measurable.
Definition : A Markov chain with transition kernel K satisfies the
detailed balance condition if there exists a function f satisfying
f (x) K (x, y) = f (y) K (y, x)
for every (x, y) .
The balance detailed condition express an equilibrium in the flow of the
Markov chain, namely that the probability of being in x and moving to
y is the same as the probability of being in y and moving back to x.
Vasin,Rachel,Jinchao MCMC April 5, 2016 19 / 64
Terminology
Definition : Suppose that a Markov chain with transition density q
satisfies the detailed balance condition with a probability density
function π i.e.
π (x) q (x, y) = π (y) q (y, x) .
Then the chain is π−reversible.
Definition : A probability density π : R → [0, ∞) is a stationary
density for this Markov chain, if it satisfies
S
π (x) q (x, y) dx = π (y)
for all y ∈ S.
Vasin,Rachel,Jinchao MCMC April 5, 2016 20 / 64
Reversibility
Theorem : If X is a π−reversible Markov chain, then π is a
stationary distribution of X.
Now we have solved the problem of constructing an MCMC
method. The Markov chain generated by Metropolis-Hastings
algorithm is stationary.
The detailed balanced condition is not necessary for f to be a
stationary measure associated with transition kernel K, it just
provides a sufficient condition that is often easy to check and that
can be used for most MCMC algorithms.
Vasin,Rachel,Jinchao MCMC April 5, 2016 21 / 64
Metropolis-Hastings algorithm
We can show that p(z) is an invariant distribution of the Markov chain
defined by the Metropolis-Hastings algorithm by showing that detailed
balance. We have
p(z)q(z|z )A(z , z) = min(p(z)q(z|z ), p(z )q(z |z))
= min(p(z )q(z |z), p(z)q(z|z ))
= p(z )q(z |z)A(z, z )
as required.
Vasin,Rachel,Jinchao MCMC April 5, 2016 22 / 64
Metropolis-Hastings algorithm
And if z∗ is generated independently, then we have
Theorem(Mengersen and Tweedie.Ann.Statist.(1996))
The Metropolis-Hasting Algorithm produces a uniform ergodic chain if
there exist a constant M such that
p(z) ≤ Mq(z), ∀x ∈ supp p,
In this case,
Kn
(z, ·) − p TV ≤ (1 −
1
M
)n
,
where K is the transition kernel and · TV denotes the total variation
norm.
Vasin,Rachel,Jinchao MCMC April 5, 2016 23 / 64
Metropolis-Hastings algorithm
Proof. If p(z) ≤ Mq(z), the transition kernel satisfies
K(z, z ) ≥ q(z ) min{
p(z )q(z)
p(z)q(z )
, 1}
= min{p(z )
q(z)
p(z)
, q(z )} ≥
1
M
p(z ).
To establish the bound on Kn(z, ·) − p TV , first we write
K(z, ·) − p TV = sup
A
|
A
(K(z, y) − p(y))dy|
=
{y;p(y)≥K(z,y)}
(p(y) − K(z, y))dy
≤ (1 −
1
M
)
{y;p(y)≥K(z,y)}
p(y)dy
≤ (1 −
1
M
)
Vasin,Rachel,Jinchao MCMC April 5, 2016 24 / 64
Metropolis-Hastings algorithm
We now continue with a recursion argument. We can write
A
(K2
(z, y) − p(z))dy =
supp p
[
A
(K(u, y) − p(y))dy]
× (K(z, u) − p(u))du,
and by the same procedure we can get
K2
(z, ·) − p TV ≤ (1 −
1
M
)2
.
We next write a general recursion relation
A
(Kn+1
(z, y) − p(z))dy =
supp p
[
A
(Kn
(u, y) − p(y))dy]
× (K(z, u) − p(u))du,
and by induction we have
Kn
(z, ·) − p TV ≤ (1 −
1
M
)n
.
Vasin,Rachel,Jinchao MCMC April 5, 2016 25 / 64
Simple Example
Now, consider using Metropolis Method to sample a bi-modal target
distribution
p(y) =
1
2
√
2π
exp(−
(y + 5)2
2
) +
1
6
√
2π
exp(−
(y − 7)2
18
)
we begin with y(1) = 0 and let the proposal distribution to be
q(y(n−1)
|y(n)
) ∼ N(y(n−1)
,
√
0.5)
Vasin,Rachel,Jinchao MCMC April 5, 2016 26 / 64
Simple Example
Histogram of y_n[1:100]
y_n[1:100]
Density
−10 0 5 10 15 20
0.000.150.30
Histogram of y_n[1:1000]
y_n[1:1000]
Density
−10 0 5 10 15 20
0.000.060.12
Histogram of y_n[1:10000]
y_n[1:10000]
Density
−10 0 5 10 15 20
0.000.100.200.30
Histogram of y_n[1:1e+05]
y_n[1:1e+05]
Density
−10 0 5 10 15 20
0.000.100.20
Vasin,Rachel,Jinchao MCMC April 5, 2016 27 / 64
Simple Example
Histogram of y_n
q(y(n−1)|y(n))~N(y(n−1),sqrt(0.5))
y_n
Density
−10 −5 0 5 10 15 20
0.000.100.20
Histogram of y2_n
q(y(n−1)|y(n))~N(−5,sqrt(10))
y2_n
Density
−10 −5 0 5 10 15 20
0.00.20.4
Histogram of y3_n
q(y(n−1)|y(n))~N(0,sqrt(10))
y3_n
Density
−10 −5 0 5 10 15 20
0.000.15
Vasin,Rachel,Jinchao MCMC April 5, 2016 28 / 64
MCMC Diagnostic
The purpose of Markov Chain Monte Carlo approximation is to obtain
a sequence of parameter values φ(1), φ(2), ..., φ(S) such that
1
S
S
s=1
g φ(s)
≈ g (φ) p (φ) dφ.
for any function g of interest. In other word, we want the empirical
average of g φ(1) , g φ(2) , ..., g φ(S) to estimate the expected
value of g (φ) under target distribution p (φ).
Vasin,Rachel,Jinchao MCMC April 5, 2016 29 / 64
MCMC Diagnostic
Isn’t the Gibbs sampler guaranteed to eventually provide a good
approximation to p (θ) ? It is but eventually can be very long time in
some situations.
1 In the case of a generic parameter φ and a target distribution
p (φ), it is helpful to think of the sequence φ(1), φ(2), ..., φ(S) as
the trajectory of a particle φ moving around the parameter space.
2 In term of MCMC integral approximation, the critical thing is
that the amount of time the particle spends in given set A is
proportional to the target probability A p (φ) dφ.
Vasin,Rachel,Jinchao MCMC April 5, 2016 30 / 64
MCMC Diagnostic : Mixing
Speed of mixing describes how quickly particle moves around the
parameter space. An independent MC sampler has perfect mixing : It
has zero auto correlation and can jump around between regions of the
parameter space in one step. On the other hands, MCMC samples are
NOT independent draws from a target distribution because
1 The first draw is set by the user and thus not a random draw from
the target distribution.
2 Subsequently, draw s + 1 depends on draw s: We say that the
samples are auto correlated.
Vasin,Rachel,Jinchao MCMC April 5, 2016 31 / 64
MCMC Diagnostic : Traceplot
The initial phase of an MCMC chain is called the burn-in phase,
during which the chain converges towards the target distribution. The
trace plots can be used to detect burn-in. Note that we can never be
sure that a chain has converged but at least we can detect lack of
convergence.
A traceplot is a plot of the iteration number against the value of the
draw of the parameter at each iteration. We can see whether our chain
gets stuck in certain areas of the parameter space, which indicates bad
mixing.
Vasin,Rachel,Jinchao MCMC April 5, 2016 32 / 64
Traceplot
Vasin,Rachel,Jinchao MCMC April 5, 2016 33 / 64
MCMC Diagnostic : Auto Correlation Function
Another way to assess convergence is to assess the auto correlations
between the draws of our Markov chain.
Let f = {f (x)}x∈S be a real value function defined on the state space
S. Then for the stationary markov chain, {ft} ≡ {f (Xt)} is a
stationary stochastic process with mean
µf ≡ ft =
x
πxf (x) .
Vasin,Rachel,Jinchao MCMC April 5, 2016 34 / 64
MCMC Diagnostic : Auto Correlation Function
We define covariance matrix as
Xi, Xj ≡ (Xi − Xi ) (Xj − Xj ) ≡ XiXj − Xi Xj
and define un-normalized auto correlation function or auto covariance
function as
Cff (t) ≡ fs, fs+t − ft
2
where normalized auto correlation function is given by
ρff (t) ≡
Cff (t)
Cff (0)
.
Typically, ρff (t) decays exponentially ∼ e−|t|/τ .
Vasin,Rachel,Jinchao MCMC April 5, 2016 35 / 64
MCMC Diagnostic : Auto Correlation Function
For a given observable f we define the integrated auto correlation time
τint,f =
1
2
∞
t=−∞
ρff (t)
≈
1
2
+
∞
t=1
ρff .
The integrated auto correlation time controls the statistical error in
Monte Carlo measurements of f .
Vasin,Rachel,Jinchao MCMC April 5, 2016 36 / 64
MCMC Diagnostic : Auto Correlation Function
More precisely the sample mean
¯f =
1
n
n
t=1
ft
has variance
V ar ¯f =
1
n2
n
r,s=1
Cff (r − s)
≈
1
n
(2τint,f ) Cff (0)
for n τ. Thus the variance of ¯f is a factor 2τint,f larger than it would
be if the {ft} were statistically independent. Stated differently the
number of effectively independent samples in a run of length n is
roughly n
2τint,f
.
Vasin,Rachel,Jinchao MCMC April 5, 2016 37 / 64
MCMC Diagnostic : Auto Correlation Function
If autocorrelation is still relatively high for higher values of k, this
indicates high degree of correlation between our draws and slow mixing.
Vasin,Rachel,Jinchao MCMC April 5, 2016 38 / 64
MCMC Diagnostic : Effective Sample Size
One measure to evaluate whether we have enough MCMC samples is
the effective sample size Seff of an MCMC sample. Seff can be
interpreted as the number of independent MC samples needed to
obtain the same precision for the mean. Suppose we approximate E (θ)
by the sample mean
¯θ =
1
S
θ(s)
.
Vasin,Rachel,Jinchao MCMC April 5, 2016 39 / 64
MCMC Diagnostic : Effective Sample Size
For an MC sample we have:
V arMC
¯θ =
V ar (θ)
S
.
For an MCMC sample we have
V arMCMC
¯θ =
V ar (θ)
S
+
1
S2
s=t
E θ(s)
− E (θ) θ(t)
− E (θ)
where the last term depends on the auto-correlation in the MCMC
chain, and is generally positive.
Vasin,Rachel,Jinchao MCMC April 5, 2016 40 / 64
MCMC Diagnostic : Effective Sample Size
The effective sample size is defined as follows:
V arMCMC
¯θ =
V ar (θ)
Seff
.
Potential scale reduction factor or shrink factor ˆR is, approximately,
the square root of the variance of the mixture of the chains, divided by
the average within chain variance. If chains have mixed well, then ˆR is
close to 1.
Vasin,Rachel,Jinchao MCMC April 5, 2016 41 / 64
Golden Rule !
Some rules of thumb : ˆR < 1.1 and Seff 100 !!!
Vasin,Rachel,Jinchao MCMC April 5, 2016 42 / 64
References
Christopher Bishop (2006)
Pattern Recognition and Machine Learning
Springer, p523–556.
Christian Robert and George Casella (2004)
Monte Carlo Statistical Methods
Springer, p267–318.
Jun S. Liu (2001)
Monte Carlo Strategies In Scientific Computing
Springer, p105-127
http://www.mas.ncl.ac.uk/ ndjw1/teaching/sim/gibbs/gibbs.r
http://www.stats.ox.ac.uk/ cholmes/Courses/BDA/bda-mcmc.pdf
Peter D. Hoff (2009)
A First Course in Bayesian Statistical Methods
Vasin,Rachel,Jinchao MCMC April 5, 2016 43 / 64
Stochastic Volatility
Vasin,Rachel,Jinchao
April 5, 2016
Vasin,Rachel,Jinchao MCMC April 5, 2016 44 / 64
Introduction
Financial returns indicate the profit from an investment
From finance group, rate of return σ assumed constant
In reality, volatility can vary with respect to time
To model volatility, we can use deterministic or probabilistic
methods
We will focus on probabilistic methods using MCMC and Bayesian
analysis
Vasin,Rachel,Jinchao MCMC April 5, 2016 45 / 64
SV Model
Let y = (y1, y2, ..., yn) be a collection of observed returns
Each return is normally distributed with mean 0 and unknown
variance
Log-variance is normally distributed with unknown mean and
variance
Log-variance also follows a Markov chain where the distribution of
the current one depends only on the value of the previous one
Vasin,Rachel,Jinchao MCMC April 5, 2016 46 / 64
SV model : Model Specification
Mathematically:
yt | ht ∼ N(0, eht
)
h0 | µ, φ, σ ∼ N(µ,
σ2
1 − φ2
)
ht | ht−1, µ, φ, σ ∼ N(µ + φ(ht−1 − µ), σ2
)
h = (h1, h2, ..., hn) = log-variance or volatility process
µ = level of log-variance
φ = persistence of log-variance
σ = volatility of log-variance
Vasin,Rachel,Jinchao MCMC April 5, 2016 47 / 64
Prior distributions
To complete the model setup, a prior distribution for the parameter
vector θ needs to be specified. Following Kim S, Shephard N, Chib S
(1998), we choose independent components for each parameter
p (θ) = p (µ) p (φ) p (ση) .
The level µ ∈ R is is equipped with the usual normal prior
µ ∼ N (bµ, Bµ).
In practical applications, this prior is usually chosen to be rather
uninformative; namely, bµ = 0 and Bµ ≥ 100 for daily log returns.
Vasin,Rachel,Jinchao MCMC April 5, 2016 48 / 64
Prior distributions
For the persistence parameter φ ∈ (−1, 1) , we choose (φ+1)
2 ∼ B (a0, b0)
implying that
p (φ) =
1
2B (a0, b0)
1 + φ
2
a0−1
1 − φ
2
b0−1
where a0 and b0 are positive parameters and
B (x, y) =
1
0
tx−1
(1 − t)y−1
dt
denotes the beta function.
Vasin,Rachel,Jinchao MCMC April 5, 2016 49 / 64
Prior distributions
Lastly, for the volatility log - variance ση ∈ R+, we choose
σ2
η ∼ Gamma
1
2
,
1
2Bση
.
This choice is motivated by Fruhwirth-Schnatter and Wagner (2010)
who equivalently stipulate the prior for ± σ2
η to follow a centered
normal distribution.
Vasin,Rachel,Jinchao MCMC April 5, 2016 50 / 64
Posterior distribution
Then, we can estimate our parameters (h, µ, φ, σ) using the posterior
distribution given by the Bayesian formula:
p(h, µ, φ, σ|y) ∝ p(y|h)p(h|h0, µ, φ, σ)p(h0|µ, φ, σ)p0(µ, φ, σ)
where y is our data, p(y|h)p(h|h0, µ, φ, σ)p(h0|µ, φ, σ) is the lkelihood
function which is given by our SV-model, and p0(µ, φ, σ) = p(θ) is the
prior distribution we construct just now.
Now, we can use MCMC method to sampling the parameters from the
posterior distribution.
Vasin,Rachel,Jinchao MCMC April 5, 2016 51 / 64
Example : EUR/USD
Euro to US dollar exchange rates
Data available: http://sdw.ecb.europa.eu/
Also available under ”exrates” within the R package ”stochvol”
Vasin,Rachel,Jinchao MCMC April 5, 2016 52 / 64
Example : Prior distributions
Prior distributions :
µ ∼ N (0, 100)
(φ + 1)
2
∼ B (20, 1.5)
σ2
η ∼ Gamma
1
2
,
1
0.2
= χ2
1 × 0.1
For some discussion about this issue, see, Kim S, Shephard N, Chib S
(1998) who choose a0 = 20 and b0 = 1.5 implying a prior mean of 0.86
with the prior standard deviation of 0.11.
Vasin,Rachel,Jinchao MCMC April 5, 2016 53 / 64
Example : Output Summary
The summary of 100,000 MCMC draws after a burn-in of 50,000
Mean SD 5% quantile 50% quantile 95% quantile Seff
µ -10.1364 0.23363 -10.4630 -10.1381 -9.7980 51118
φ 0.9932 0.00286 0.9881 0.9935 0.9974 2914
σ 0.0660 0.01020 0.0505 0.0651 0.0838 1347
σ2
0.0045 0.00141 0.0026 0.0042 0.0070 1347
Vasin,Rachel,Jinchao MCMC April 5, 2016 54 / 64
Example : Estimated volatilities
Then we can get the empirical quantiles of the posterior distribution of
100 exp(ht/2), the latent volatilities of yt over time in percent:
Vasin,Rachel,Jinchao MCMC April 5, 2016 55 / 64
Example : Forecast volatility
Forecast volatility of yt in percentage
t + 1 0.584132065586393
t + 2 0.584587272795878
t + 3 0.584863398891602
t + 4 0.585436225819106
t + 5 0.585646621839762
t + 6 0.585741891671509
t + 7 0.586295462435408
t + 8 0.586822153121447
t + 9 0.587354949275841
t + 10 0.587772869210331
Vasin,Rachel,Jinchao MCMC April 5, 2016 56 / 64
Example : Traceplot
Vasin,Rachel,Jinchao MCMC April 5, 2016 57 / 64
Example : Posterior and Prior densities
Vasin,Rachel,Jinchao MCMC April 5, 2016 58 / 64
Example : Mean standardized residuals
Vasin,Rachel,Jinchao MCMC April 5, 2016 59 / 64
Result from STAN : Traceplot
Vasin,Rachel,Jinchao MCMC April 5, 2016 60 / 64
Result from STAN : Traceplot
Vasin,Rachel,Jinchao MCMC April 5, 2016 61 / 64
Result from STAN : Traceplot
Vasin,Rachel,Jinchao MCMC April 5, 2016 62 / 64
Result from STAN : Posterior Density
Vasin,Rachel,Jinchao MCMC April 5, 2016 63 / 64
References
Gregor Kastner (2016)
Dealing with Stochastic Volatility in Time Series Using the R
package stochvol
Kim S, Shephard N, Chib S (1998)
Stochastic Volatility: Likelihood Inference and Com- parison With
ARCH Models
Fruhwirth-Schnatter S, Wagner H (2010)
Vasin,Rachel,Jinchao MCMC April 5, 2016 64 / 64

More Related Content

PDF
Metropolis-Hastings MCMC Short Tutorial
PDF
Introduction to MCMC methods
PDF
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
PDF
numerical differentiation&integration
PPTX
Greedy Algorithm - Huffman coding
PPTX
Interpolation and its applications
PDF
Master theorem
PPTX
Fourier series and applications of fourier transform
Metropolis-Hastings MCMC Short Tutorial
Introduction to MCMC methods
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
numerical differentiation&integration
Greedy Algorithm - Huffman coding
Interpolation and its applications
Master theorem
Fourier series and applications of fourier transform

What's hot (20)

PDF
Independent Component Analysis
PPT
Laplace transform
PPTX
Grover's algorithm simplified.pptx
PDF
FPDE presentation
PDF
Fractional calculus and applications
PDF
Numerical Methods: curve fitting and interpolation
PPTX
Face recognition using PCA
PPTX
Fourier series
PPTX
Fourier series Introduction
PDF
Frequency Domain Filtering of Digital Images
PDF
Differential Geometry
PDF
Introduction of Partial Differential Equations
PPTX
Chaotic system and its Application in Cryptography
PPT
Per4 induction
PDF
Markov Chain Monte Carlo Methods
PPTX
LAPLACE TRANSFORM SUITABILITY FOR IMAGE PROCESSING
PPT
07 frequency domain DIP
PDF
Network flow problems
PPTX
Presentation on Numerical Method (Trapezoidal Method)
PPT
introduction to Numerical Analysis
Independent Component Analysis
Laplace transform
Grover's algorithm simplified.pptx
FPDE presentation
Fractional calculus and applications
Numerical Methods: curve fitting and interpolation
Face recognition using PCA
Fourier series
Fourier series Introduction
Frequency Domain Filtering of Digital Images
Differential Geometry
Introduction of Partial Differential Equations
Chaotic system and its Application in Cryptography
Per4 induction
Markov Chain Monte Carlo Methods
LAPLACE TRANSFORM SUITABILITY FOR IMAGE PROCESSING
07 frequency domain DIP
Network flow problems
Presentation on Numerical Method (Trapezoidal Method)
introduction to Numerical Analysis
Ad

Similar to mcmc (20)

PDF
A bit about мcmc
PPT
HUST-talk-1.pptof uncertainty quantification. Volume 6. Springer, 2017. SFK08...
PDF
Hastings 1970
PDF
Delayed acceptance for Metropolis-Hastings algorithms
PDF
Random Matrix Theory and Machine Learning - Part 3
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
PDF
Lecture 3 qualtifed rules of inference
PDF
PRML Reading Chapter 11 - Sampling Method
PDF
random forests for ABC model choice and parameter estimation
PDF
Application of the Monte-Carlo Method to Nonlinear Stochastic Optimization wi...
PDF
Markov chain Monte Carlo methods and some attempts at parallelizing them
PDF
Adaptive Restore algorithm & importance Monte Carlo
ODP
Monte Carlo Methods
PPTX
Monte Carlo Berkeley.pptx
PDF
Qualitative measurement of Klauder coherent states using Bohmian machanics, C...
PDF
bayesian_statistics_introduction_uppsala_university
PDF
Linear models for classification
PDF
Maximum likelihood estimation of regularisation parameters in inverse problem...
PDF
A nonlinear approximation of the Bayesian Update formula
PPTX
Multivariate Methods Assignment Help
A bit about мcmc
HUST-talk-1.pptof uncertainty quantification. Volume 6. Springer, 2017. SFK08...
Hastings 1970
Delayed acceptance for Metropolis-Hastings algorithms
Random Matrix Theory and Machine Learning - Part 3
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
Lecture 3 qualtifed rules of inference
PRML Reading Chapter 11 - Sampling Method
random forests for ABC model choice and parameter estimation
Application of the Monte-Carlo Method to Nonlinear Stochastic Optimization wi...
Markov chain Monte Carlo methods and some attempts at parallelizing them
Adaptive Restore algorithm & importance Monte Carlo
Monte Carlo Methods
Monte Carlo Berkeley.pptx
Qualitative measurement of Klauder coherent states using Bohmian machanics, C...
bayesian_statistics_introduction_uppsala_university
Linear models for classification
Maximum likelihood estimation of regularisation parameters in inverse problem...
A nonlinear approximation of the Bayesian Update formula
Multivariate Methods Assignment Help
Ad

mcmc

  • 1. An introduction of Markov Chain Monte Carlo Vasin,Rachel,Jinchao April 5, 2016 Vasin,Rachel,Jinchao MCMC April 5, 2016 1 / 64
  • 2. Outline 1 Introduction Motivation Statistical Mechanics Example Bayesian Example Monte Carlo Methods 2 Markov Chain Monte Carlo Gibbs Sampling Metropolis Method Metropolis-Hastings algorithm 3 MCMC Diagnostic Vasin,Rachel,Jinchao MCMC April 5, 2016 2 / 64
  • 3. Statistical Mechanics Example Now let us recall an example we learned in first class Probability of the configuration σ0 at inverse temperature β = 1 kT : µΛ,β(σ = σ0) = 1 ZΛ,β exp(−βHN (σ0))PN (σ = σ0) where Hamiltonian HN (σ) = −1 2 x=y J(x, y)σ(x)σ(y) + h x σ(x) Partition function ZΛ,β = σ0 exp(−βHN (σ0))PN (σ = σ0) Prior distribution PN (σ = σ0) = x∈Λ P(σ(x) = σ(x0)) Vasin,Rachel,Jinchao MCMC April 5, 2016 3 / 64
  • 4. Statistical Mechanics Example We want to compute the mean value of an observable A E[A] = A(σ)µΛ,β(dσ). However, usually the lattice size N 1, and the Partition function ZΛ,β is hard to compute. Thus, we try to sample from the Gibbs distribution regardless the normal constant and approximate the integral by numerical result. Vasin,Rachel,Jinchao MCMC April 5, 2016 4 / 64
  • 5. Bayesian Example Observed data Y1, ..., Yn follows some distribution p(y | θ) θ is an unknown parameter Posterior distribution p(θ | y) provides information for θ By Bayes’ Rule: p(θ | y) = p(y | θ)p(θ) Θ p(y | θ)p(θ)dθ If we assume a prior distribution for θ, say U(0, 1), we can use MCMC methods to approximate p(θ | y) Vasin,Rachel,Jinchao MCMC April 5, 2016 5 / 64
  • 6. Bayesian Example Construct Markov Chain with steady state p(θ | y) End with a sample (θ1, ..., θM ) which can be assumed to be a draw from p(θ | y) Can use histogram or time series plot to analyze Popular applications include linear regression and biomedical studies Vasin,Rachel,Jinchao MCMC April 5, 2016 6 / 64
  • 7. The problems to be solved The fundamental problem that we wish to solve is finding the expectation of some function f(z) with respect to a probability distribution p(z), i.e. E[f] = f(z)p(z)dz The general idea of Monte Carlo methods is to obtain a set of samples z(l) (where l = 1, . . . , L) drawn independently from the distribution p(z). This allows the expectation to be approximated by a finite sum ˆf = 1 L L l=1 f(z(l) ). The methods we can use to generate the samples include rejection sampling, importance sampling, etc. Vasin,Rachel,Jinchao MCMC April 5, 2016 7 / 64
  • 8. Markov Chain Monte Carlo The problem, however, is that the samples {z(l)} might not be independent. Suppose we want to draw from our posterior distribution p(θ|y), but we cannot sample independent draws from it. For example, we often do not know the normalizing constant. So we consider sampling slightly dependent draws using a Markov chain, then we can still find quantities of interests from those draws. MCMC is a class of methods using this idea. Vasin,Rachel,Jinchao MCMC April 5, 2016 8 / 64
  • 9. Markov Chain Monte Carlo Set up an ergodic Markov chain {z(l)}, such that lim l→∞ Kl (z(0) , z) = p(z) (stationary distribution) irrespective of the choice of initial value of z(0). K is the transition kernel here. Then by Ergodic Theorem, lim L→∞ 1 L L l=1 f(z(l) ) = f(z)p(z)dz In Bayesian statistics, there are generally two MCMC algorithms that we use: the Gibbs Sampler and the Metropolis-Hastings algorithm. Vasin,Rachel,Jinchao MCMC April 5, 2016 9 / 64
  • 10. Gibbs Sampling MCMC method that indirectly samples from joint distribution Requires initial values for each random variable Uses conditional distributions for sampling Updates one value per time step Sample eventually converges to the desired joint distribution Can also approximate marginal distributions Vasin,Rachel,Jinchao MCMC April 5, 2016 10 / 64
  • 11. Gibbs Sampling Procedure: 1 Identify joint distribution p(z) = p(z1, ... , zk) from which to sample 2 Select initial values for z1, ... , zk. Call them z (0) 1 , ... , z (0) k . 3 Draw a new value for z1, call it z (1) 1 , from conditional distribution p(z1 | z (0) 2 , ... , z (0) k ) 4 Draw a new value for z2, call it z (1) 2 , from conditional distribution p(z2 | z (1) 1 , z (0) 3 , ... z (0) k ) 5 Repeat for all zi, updating the conditioned values each time 6 Repeat steps 3-5 for a large number of times T Vasin,Rachel,Jinchao MCMC April 5, 2016 11 / 64
  • 12. Gibbs Sampling Example: f(X, Y ) = 1 2πσxσy √ 1−ρ2 exp(− 1 2(1−ρ2) [(x−µx)2 σ2 x + (y−µy)2 σ2 y − 2ρ(x−µx)(y−µy) σxσy ]) Bivariate normal with Corr(X, Y )=ρ Suppose µx = µy = 0 and σx = σy = 1 f(X | Y ) = f(X,Y ) R f(X,Y )dx = 1√ 2π(1−ρ2) exp(−(x−ρy)2 2(1−ρ2) ) f(Y | X) = f(X,Y ) R f(X,Y )dy = 1√ 2π(1−ρ2) exp(−(y−ρx)2 2(1−ρ2) ) Initialize x, y = 0 Let x(1) be a random draw from N(0, 1 − ρ2) Let y(1) be a random draw from N(ρx(1), 1 − ρ2) Repeat last 2 steps 10000 times Vasin,Rachel,Jinchao MCMC April 5, 2016 12 / 64
  • 15. Metropolis Method Now suppose we have a posterior p(θ|y) that we want to sample from, but the posterior is not any distribution we know some (or all) of the full conditionals is not any distributions we know (no Gibbs sampling for those whose full conditionals we don’t know) Then, we can consider using the Metropolis Method. Vasin,Rachel,Jinchao MCMC April 5, 2016 15 / 64
  • 16. Metropolis Method The Metropolis Method (Metropolis et al., 1953) follows the following steps: Choose a starting value z(1). At the current state z(τ), generate a candidate z∗ by the proposal distribution q(z|z(τ)) depends on this current state. Accepted the candidate sample with probability A(z∗ , z(τ) ) = min(1, p(z∗) p(z(τ)) ). If the candidate sample is accepted, then z(τ+1) = z∗, otherwise the candidate point z∗ is discarded, z(τ+1) is set to z(τ) and another candidate sample is drawn from the distribution q(z|z(τ+1)). Vasin,Rachel,Jinchao MCMC April 5, 2016 16 / 64
  • 17. Comments For Metropolis Method, the proposal distribution is symmetric, that is q(zA|zB) = q(zB|zA) for all values of zA and zB. In general, we usually use normal distribution N(z(τ), σ) as proposal distribution . Note that if the step from z(τ) to z∗ causes an increase in the value of p(z), then the candidate point is certain to be kept. We can write p(z) = ˜p/Zp,and we will assume that p(z) can readily be evaluated for any given value of z, although the value of Zp may be unknown. Like for µΛ,β(σ = σ0) = 1 ZΛ,β exp(−βHN (σ0))PN (σ = σ0), we have A(σ, σ ) = min(1, µ(σ ) µ(σ) ) = min(1, exp(−β∆H). Vasin,Rachel,Jinchao MCMC April 5, 2016 17 / 64
  • 18. Metropolis-Hastings algorithm A generalization of Metropolis Method is known as the Metropolis-Hastings algorithm (Hastings, 1970), where the proposal distribution is no longer a symmetric function of its arguments. In particular, when the current state is z(τ), we draw a sample z∗ from the distribution qk(z|z(τ)) and then accept it with probability Ak(z∗, z(τ)) where Ak(z∗ , z(τ) ) = min(1, p(z∗)qk(z(τ)|z∗) p(z(τ))qk(z∗|z(τ)) ). Vasin,Rachel,Jinchao MCMC April 5, 2016 18 / 64
  • 19. Terminology Definition : A transition kernel is a function K defined on X × B (X) such that ∀x ∈ X, K (x, ·) is a probability measure ∀A ∈ B (X), K (·, A) is measurable. Definition : A Markov chain with transition kernel K satisfies the detailed balance condition if there exists a function f satisfying f (x) K (x, y) = f (y) K (y, x) for every (x, y) . The balance detailed condition express an equilibrium in the flow of the Markov chain, namely that the probability of being in x and moving to y is the same as the probability of being in y and moving back to x. Vasin,Rachel,Jinchao MCMC April 5, 2016 19 / 64
  • 20. Terminology Definition : Suppose that a Markov chain with transition density q satisfies the detailed balance condition with a probability density function π i.e. π (x) q (x, y) = π (y) q (y, x) . Then the chain is π−reversible. Definition : A probability density π : R → [0, ∞) is a stationary density for this Markov chain, if it satisfies S π (x) q (x, y) dx = π (y) for all y ∈ S. Vasin,Rachel,Jinchao MCMC April 5, 2016 20 / 64
  • 21. Reversibility Theorem : If X is a π−reversible Markov chain, then π is a stationary distribution of X. Now we have solved the problem of constructing an MCMC method. The Markov chain generated by Metropolis-Hastings algorithm is stationary. The detailed balanced condition is not necessary for f to be a stationary measure associated with transition kernel K, it just provides a sufficient condition that is often easy to check and that can be used for most MCMC algorithms. Vasin,Rachel,Jinchao MCMC April 5, 2016 21 / 64
  • 22. Metropolis-Hastings algorithm We can show that p(z) is an invariant distribution of the Markov chain defined by the Metropolis-Hastings algorithm by showing that detailed balance. We have p(z)q(z|z )A(z , z) = min(p(z)q(z|z ), p(z )q(z |z)) = min(p(z )q(z |z), p(z)q(z|z )) = p(z )q(z |z)A(z, z ) as required. Vasin,Rachel,Jinchao MCMC April 5, 2016 22 / 64
  • 23. Metropolis-Hastings algorithm And if z∗ is generated independently, then we have Theorem(Mengersen and Tweedie.Ann.Statist.(1996)) The Metropolis-Hasting Algorithm produces a uniform ergodic chain if there exist a constant M such that p(z) ≤ Mq(z), ∀x ∈ supp p, In this case, Kn (z, ·) − p TV ≤ (1 − 1 M )n , where K is the transition kernel and · TV denotes the total variation norm. Vasin,Rachel,Jinchao MCMC April 5, 2016 23 / 64
  • 24. Metropolis-Hastings algorithm Proof. If p(z) ≤ Mq(z), the transition kernel satisfies K(z, z ) ≥ q(z ) min{ p(z )q(z) p(z)q(z ) , 1} = min{p(z ) q(z) p(z) , q(z )} ≥ 1 M p(z ). To establish the bound on Kn(z, ·) − p TV , first we write K(z, ·) − p TV = sup A | A (K(z, y) − p(y))dy| = {y;p(y)≥K(z,y)} (p(y) − K(z, y))dy ≤ (1 − 1 M ) {y;p(y)≥K(z,y)} p(y)dy ≤ (1 − 1 M ) Vasin,Rachel,Jinchao MCMC April 5, 2016 24 / 64
  • 25. Metropolis-Hastings algorithm We now continue with a recursion argument. We can write A (K2 (z, y) − p(z))dy = supp p [ A (K(u, y) − p(y))dy] × (K(z, u) − p(u))du, and by the same procedure we can get K2 (z, ·) − p TV ≤ (1 − 1 M )2 . We next write a general recursion relation A (Kn+1 (z, y) − p(z))dy = supp p [ A (Kn (u, y) − p(y))dy] × (K(z, u) − p(u))du, and by induction we have Kn (z, ·) − p TV ≤ (1 − 1 M )n . Vasin,Rachel,Jinchao MCMC April 5, 2016 25 / 64
  • 26. Simple Example Now, consider using Metropolis Method to sample a bi-modal target distribution p(y) = 1 2 √ 2π exp(− (y + 5)2 2 ) + 1 6 √ 2π exp(− (y − 7)2 18 ) we begin with y(1) = 0 and let the proposal distribution to be q(y(n−1) |y(n) ) ∼ N(y(n−1) , √ 0.5) Vasin,Rachel,Jinchao MCMC April 5, 2016 26 / 64
  • 27. Simple Example Histogram of y_n[1:100] y_n[1:100] Density −10 0 5 10 15 20 0.000.150.30 Histogram of y_n[1:1000] y_n[1:1000] Density −10 0 5 10 15 20 0.000.060.12 Histogram of y_n[1:10000] y_n[1:10000] Density −10 0 5 10 15 20 0.000.100.200.30 Histogram of y_n[1:1e+05] y_n[1:1e+05] Density −10 0 5 10 15 20 0.000.100.20 Vasin,Rachel,Jinchao MCMC April 5, 2016 27 / 64
  • 28. Simple Example Histogram of y_n q(y(n−1)|y(n))~N(y(n−1),sqrt(0.5)) y_n Density −10 −5 0 5 10 15 20 0.000.100.20 Histogram of y2_n q(y(n−1)|y(n))~N(−5,sqrt(10)) y2_n Density −10 −5 0 5 10 15 20 0.00.20.4 Histogram of y3_n q(y(n−1)|y(n))~N(0,sqrt(10)) y3_n Density −10 −5 0 5 10 15 20 0.000.15 Vasin,Rachel,Jinchao MCMC April 5, 2016 28 / 64
  • 29. MCMC Diagnostic The purpose of Markov Chain Monte Carlo approximation is to obtain a sequence of parameter values φ(1), φ(2), ..., φ(S) such that 1 S S s=1 g φ(s) ≈ g (φ) p (φ) dφ. for any function g of interest. In other word, we want the empirical average of g φ(1) , g φ(2) , ..., g φ(S) to estimate the expected value of g (φ) under target distribution p (φ). Vasin,Rachel,Jinchao MCMC April 5, 2016 29 / 64
  • 30. MCMC Diagnostic Isn’t the Gibbs sampler guaranteed to eventually provide a good approximation to p (θ) ? It is but eventually can be very long time in some situations. 1 In the case of a generic parameter φ and a target distribution p (φ), it is helpful to think of the sequence φ(1), φ(2), ..., φ(S) as the trajectory of a particle φ moving around the parameter space. 2 In term of MCMC integral approximation, the critical thing is that the amount of time the particle spends in given set A is proportional to the target probability A p (φ) dφ. Vasin,Rachel,Jinchao MCMC April 5, 2016 30 / 64
  • 31. MCMC Diagnostic : Mixing Speed of mixing describes how quickly particle moves around the parameter space. An independent MC sampler has perfect mixing : It has zero auto correlation and can jump around between regions of the parameter space in one step. On the other hands, MCMC samples are NOT independent draws from a target distribution because 1 The first draw is set by the user and thus not a random draw from the target distribution. 2 Subsequently, draw s + 1 depends on draw s: We say that the samples are auto correlated. Vasin,Rachel,Jinchao MCMC April 5, 2016 31 / 64
  • 32. MCMC Diagnostic : Traceplot The initial phase of an MCMC chain is called the burn-in phase, during which the chain converges towards the target distribution. The trace plots can be used to detect burn-in. Note that we can never be sure that a chain has converged but at least we can detect lack of convergence. A traceplot is a plot of the iteration number against the value of the draw of the parameter at each iteration. We can see whether our chain gets stuck in certain areas of the parameter space, which indicates bad mixing. Vasin,Rachel,Jinchao MCMC April 5, 2016 32 / 64
  • 34. MCMC Diagnostic : Auto Correlation Function Another way to assess convergence is to assess the auto correlations between the draws of our Markov chain. Let f = {f (x)}x∈S be a real value function defined on the state space S. Then for the stationary markov chain, {ft} ≡ {f (Xt)} is a stationary stochastic process with mean µf ≡ ft = x πxf (x) . Vasin,Rachel,Jinchao MCMC April 5, 2016 34 / 64
  • 35. MCMC Diagnostic : Auto Correlation Function We define covariance matrix as Xi, Xj ≡ (Xi − Xi ) (Xj − Xj ) ≡ XiXj − Xi Xj and define un-normalized auto correlation function or auto covariance function as Cff (t) ≡ fs, fs+t − ft 2 where normalized auto correlation function is given by ρff (t) ≡ Cff (t) Cff (0) . Typically, ρff (t) decays exponentially ∼ e−|t|/τ . Vasin,Rachel,Jinchao MCMC April 5, 2016 35 / 64
  • 36. MCMC Diagnostic : Auto Correlation Function For a given observable f we define the integrated auto correlation time τint,f = 1 2 ∞ t=−∞ ρff (t) ≈ 1 2 + ∞ t=1 ρff . The integrated auto correlation time controls the statistical error in Monte Carlo measurements of f . Vasin,Rachel,Jinchao MCMC April 5, 2016 36 / 64
  • 37. MCMC Diagnostic : Auto Correlation Function More precisely the sample mean ¯f = 1 n n t=1 ft has variance V ar ¯f = 1 n2 n r,s=1 Cff (r − s) ≈ 1 n (2τint,f ) Cff (0) for n τ. Thus the variance of ¯f is a factor 2τint,f larger than it would be if the {ft} were statistically independent. Stated differently the number of effectively independent samples in a run of length n is roughly n 2τint,f . Vasin,Rachel,Jinchao MCMC April 5, 2016 37 / 64
  • 38. MCMC Diagnostic : Auto Correlation Function If autocorrelation is still relatively high for higher values of k, this indicates high degree of correlation between our draws and slow mixing. Vasin,Rachel,Jinchao MCMC April 5, 2016 38 / 64
  • 39. MCMC Diagnostic : Effective Sample Size One measure to evaluate whether we have enough MCMC samples is the effective sample size Seff of an MCMC sample. Seff can be interpreted as the number of independent MC samples needed to obtain the same precision for the mean. Suppose we approximate E (θ) by the sample mean ¯θ = 1 S θ(s) . Vasin,Rachel,Jinchao MCMC April 5, 2016 39 / 64
  • 40. MCMC Diagnostic : Effective Sample Size For an MC sample we have: V arMC ¯θ = V ar (θ) S . For an MCMC sample we have V arMCMC ¯θ = V ar (θ) S + 1 S2 s=t E θ(s) − E (θ) θ(t) − E (θ) where the last term depends on the auto-correlation in the MCMC chain, and is generally positive. Vasin,Rachel,Jinchao MCMC April 5, 2016 40 / 64
  • 41. MCMC Diagnostic : Effective Sample Size The effective sample size is defined as follows: V arMCMC ¯θ = V ar (θ) Seff . Potential scale reduction factor or shrink factor ˆR is, approximately, the square root of the variance of the mixture of the chains, divided by the average within chain variance. If chains have mixed well, then ˆR is close to 1. Vasin,Rachel,Jinchao MCMC April 5, 2016 41 / 64
  • 42. Golden Rule ! Some rules of thumb : ˆR < 1.1 and Seff 100 !!! Vasin,Rachel,Jinchao MCMC April 5, 2016 42 / 64
  • 43. References Christopher Bishop (2006) Pattern Recognition and Machine Learning Springer, p523–556. Christian Robert and George Casella (2004) Monte Carlo Statistical Methods Springer, p267–318. Jun S. Liu (2001) Monte Carlo Strategies In Scientific Computing Springer, p105-127 http://www.mas.ncl.ac.uk/ ndjw1/teaching/sim/gibbs/gibbs.r http://www.stats.ox.ac.uk/ cholmes/Courses/BDA/bda-mcmc.pdf Peter D. Hoff (2009) A First Course in Bayesian Statistical Methods Vasin,Rachel,Jinchao MCMC April 5, 2016 43 / 64
  • 44. Stochastic Volatility Vasin,Rachel,Jinchao April 5, 2016 Vasin,Rachel,Jinchao MCMC April 5, 2016 44 / 64
  • 45. Introduction Financial returns indicate the profit from an investment From finance group, rate of return σ assumed constant In reality, volatility can vary with respect to time To model volatility, we can use deterministic or probabilistic methods We will focus on probabilistic methods using MCMC and Bayesian analysis Vasin,Rachel,Jinchao MCMC April 5, 2016 45 / 64
  • 46. SV Model Let y = (y1, y2, ..., yn) be a collection of observed returns Each return is normally distributed with mean 0 and unknown variance Log-variance is normally distributed with unknown mean and variance Log-variance also follows a Markov chain where the distribution of the current one depends only on the value of the previous one Vasin,Rachel,Jinchao MCMC April 5, 2016 46 / 64
  • 47. SV model : Model Specification Mathematically: yt | ht ∼ N(0, eht ) h0 | µ, φ, σ ∼ N(µ, σ2 1 − φ2 ) ht | ht−1, µ, φ, σ ∼ N(µ + φ(ht−1 − µ), σ2 ) h = (h1, h2, ..., hn) = log-variance or volatility process µ = level of log-variance φ = persistence of log-variance σ = volatility of log-variance Vasin,Rachel,Jinchao MCMC April 5, 2016 47 / 64
  • 48. Prior distributions To complete the model setup, a prior distribution for the parameter vector θ needs to be specified. Following Kim S, Shephard N, Chib S (1998), we choose independent components for each parameter p (θ) = p (µ) p (φ) p (ση) . The level µ ∈ R is is equipped with the usual normal prior µ ∼ N (bµ, Bµ). In practical applications, this prior is usually chosen to be rather uninformative; namely, bµ = 0 and Bµ ≥ 100 for daily log returns. Vasin,Rachel,Jinchao MCMC April 5, 2016 48 / 64
  • 49. Prior distributions For the persistence parameter φ ∈ (−1, 1) , we choose (φ+1) 2 ∼ B (a0, b0) implying that p (φ) = 1 2B (a0, b0) 1 + φ 2 a0−1 1 − φ 2 b0−1 where a0 and b0 are positive parameters and B (x, y) = 1 0 tx−1 (1 − t)y−1 dt denotes the beta function. Vasin,Rachel,Jinchao MCMC April 5, 2016 49 / 64
  • 50. Prior distributions Lastly, for the volatility log - variance ση ∈ R+, we choose σ2 η ∼ Gamma 1 2 , 1 2Bση . This choice is motivated by Fruhwirth-Schnatter and Wagner (2010) who equivalently stipulate the prior for ± σ2 η to follow a centered normal distribution. Vasin,Rachel,Jinchao MCMC April 5, 2016 50 / 64
  • 51. Posterior distribution Then, we can estimate our parameters (h, µ, φ, σ) using the posterior distribution given by the Bayesian formula: p(h, µ, φ, σ|y) ∝ p(y|h)p(h|h0, µ, φ, σ)p(h0|µ, φ, σ)p0(µ, φ, σ) where y is our data, p(y|h)p(h|h0, µ, φ, σ)p(h0|µ, φ, σ) is the lkelihood function which is given by our SV-model, and p0(µ, φ, σ) = p(θ) is the prior distribution we construct just now. Now, we can use MCMC method to sampling the parameters from the posterior distribution. Vasin,Rachel,Jinchao MCMC April 5, 2016 51 / 64
  • 52. Example : EUR/USD Euro to US dollar exchange rates Data available: http://sdw.ecb.europa.eu/ Also available under ”exrates” within the R package ”stochvol” Vasin,Rachel,Jinchao MCMC April 5, 2016 52 / 64
  • 53. Example : Prior distributions Prior distributions : µ ∼ N (0, 100) (φ + 1) 2 ∼ B (20, 1.5) σ2 η ∼ Gamma 1 2 , 1 0.2 = χ2 1 × 0.1 For some discussion about this issue, see, Kim S, Shephard N, Chib S (1998) who choose a0 = 20 and b0 = 1.5 implying a prior mean of 0.86 with the prior standard deviation of 0.11. Vasin,Rachel,Jinchao MCMC April 5, 2016 53 / 64
  • 54. Example : Output Summary The summary of 100,000 MCMC draws after a burn-in of 50,000 Mean SD 5% quantile 50% quantile 95% quantile Seff µ -10.1364 0.23363 -10.4630 -10.1381 -9.7980 51118 φ 0.9932 0.00286 0.9881 0.9935 0.9974 2914 σ 0.0660 0.01020 0.0505 0.0651 0.0838 1347 σ2 0.0045 0.00141 0.0026 0.0042 0.0070 1347 Vasin,Rachel,Jinchao MCMC April 5, 2016 54 / 64
  • 55. Example : Estimated volatilities Then we can get the empirical quantiles of the posterior distribution of 100 exp(ht/2), the latent volatilities of yt over time in percent: Vasin,Rachel,Jinchao MCMC April 5, 2016 55 / 64
  • 56. Example : Forecast volatility Forecast volatility of yt in percentage t + 1 0.584132065586393 t + 2 0.584587272795878 t + 3 0.584863398891602 t + 4 0.585436225819106 t + 5 0.585646621839762 t + 6 0.585741891671509 t + 7 0.586295462435408 t + 8 0.586822153121447 t + 9 0.587354949275841 t + 10 0.587772869210331 Vasin,Rachel,Jinchao MCMC April 5, 2016 56 / 64
  • 57. Example : Traceplot Vasin,Rachel,Jinchao MCMC April 5, 2016 57 / 64
  • 58. Example : Posterior and Prior densities Vasin,Rachel,Jinchao MCMC April 5, 2016 58 / 64
  • 59. Example : Mean standardized residuals Vasin,Rachel,Jinchao MCMC April 5, 2016 59 / 64
  • 60. Result from STAN : Traceplot Vasin,Rachel,Jinchao MCMC April 5, 2016 60 / 64
  • 61. Result from STAN : Traceplot Vasin,Rachel,Jinchao MCMC April 5, 2016 61 / 64
  • 62. Result from STAN : Traceplot Vasin,Rachel,Jinchao MCMC April 5, 2016 62 / 64
  • 63. Result from STAN : Posterior Density Vasin,Rachel,Jinchao MCMC April 5, 2016 63 / 64
  • 64. References Gregor Kastner (2016) Dealing with Stochastic Volatility in Time Series Using the R package stochvol Kim S, Shephard N, Chib S (1998) Stochastic Volatility: Likelihood Inference and Com- parison With ARCH Models Fruhwirth-Schnatter S, Wagner H (2010) Vasin,Rachel,Jinchao MCMC April 5, 2016 64 / 64