MCMC: Metropolis Hastings Algorithm
MCMC: Metropolis Hastings Algorithm
Lecture 25
MCMC: Metropolis Hastings Algorithm
Cite as: Anna Mikusheva, course materials for 14.384 Time Series Analysis, Fall 2007. MIT OpenCourseWare (http://ocw.mit.edu),
Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
Markov Chains 2
The last line is due to the fact that there exists the unique constant that normalizes f to be a pdf. Since
1
the left hand side is a cdf, then c(1−ρ) is this constant.
A major drawback of this method is that is may lead us to reject many draws before we finally accept
f (z)
one. This can make the procedure inefficient. If we choose c and h(z) poorly, then ch(z) could be very small
for many z. It will be especially difficult to choose a good c and h() when we do not know much about π(z).
Markov Chains
A Markov Chain is a stochastic process where the distribution of xt+1 only depends on xt , P (xt+1 ∈
A|xt , xt−1 , ...) = P (xt+1 ∈ A|xt ) ∀A.
Definition 1. A transition kernel is a function, P (x, A), such that, for every x it is a probability measure
in the second argument:
P (x, A) = P (xt+1 ∈ A|xt = x)
It gives the probability of moving from x into the set A.
The transition kernel may have atoms, in particular, we would be considering cases with non-zero prob-
ability of (not moving) staying: P (x, {x}) 6= 0.
We want to study the behavior of a sequence of draws x1 → x2 → ... where we move around according
to a transition kernel. Suppose the distribution of xt is P (t) , then the distribution of y = xt+1 is
Z
(t+1)
P (y)dy = P (t) (x)P (x, dy)dx.
<
DefinitionR 2. A distribution π ∗ is called aninvariant measure (with respect to transition kernel P (x, A)) if
π ∗ (y)dy = < π ∗ (x)P (x, dy)dx.
Under some regularity conditions, a transition kernel P (x, A) has a unique invariant distribution π ∗ ;
and a marginal distribution P (t) of xt - an element in Markov chain with the transitional kernel P (x, A)
converges to its invariant distribution π ∗ as t → ∞. That is, if one would run a Markov chain long enough
then the distribution of the draw is close to π ∗ . Generally, if the transition kernel is irreducible (it can
reach any point from any other point) and aperiodic (not periodic, i.e. the greatest common denominator of
{n : y can be reached from x in n steps} is 1), then it converges to an invariant distribution.
A classical Markov chain problem is to find π ∗ given P (x, A). The MCMC has an inverse problem.
Assume we want to simulate a draw from π ∗ (which we know up to a constant multiplier). We need to find
a transition kernel P (x, dy) such that π ∗ is its invariant measure. Let’s suppose that π ∗ is continuous. We
will consider the class of kernels
here ∆x (dy) is a unit mass measure concentrated at point x: ∆x (A) = I{x ∈ A}. So, the transition
kernel (*) says that we can stay at x with probability r(x), otherwise y is distributed according to some
p
R df proportional Rto p(x, y). Notice, Rthat p(x, y) isn’t exactly a density because it doesn’t integrate to 1.
P (x, dy) = 1 = p(x, y)dy + r(x); p(x, y)dy = 1 − r(x).
Definition 3. A transition kernel is reversible if π(x)p(x, y) = π(y)p(y, x)
Theorem 4. If a transition kernel is reversible, then π is invariant.
Cite as: Anna Mikusheva, course materials for 14.384 Time Series Analysis, Fall 2007. MIT OpenCourseWare (http://ocw.mit.edu),
Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
Metropolis-Hastings 3
Metropolis-Hastings
The goal: we want to simulate a draw from the distribution π which we know up to a constant. That is, we
can compute a function proportional to π, f (x) = kπ(x). We will generate a Markov chain with transition
kernel of the form (*), that will be reversible for π. Then if the chain will run long enough the element of
the chain will have distribution π. The main question is how to generate such a Markov chain?
Supp
R ose we have a Markov chain in state x. Assume that we can draw y ∼ q(x, y), a pdf with respect to
y (so q(x, y)dy = 1). Consider using this q as a transition kernel. Notice that if
π(x)q(x, y) > π(y)q(y, x)
then the chain won’t be reversible (we would move from x to y too often). This suggests that rather than
always moving to the new y we draw, we should only move with some probability, α(x, y). If we construct
α(x, y) such that
π(x)q(x, y)α(x, y) = π(y)q(y, x)α(y, x)
then we will have a reversible transition kernel with invariant measure π. We can take:
π(y)q(y, x)
α(x, y) = min{1, }
π(x)q(x, y)
We can calculate α(x, y) because although we do not know π(x), we do know f (x) = kπ(x), so we can
compute the ratio.
In summary, the Metropolis-Hastings algorithm is: given xt we move to xt+1 by
1. Generate a draw, y, from q(xt , ·)
2. Calculate α(xt , y)
3. Draw u ∼ U [0, 1]
4. If u < α(xt , y), then xt+1 = y. Otherwise xt+1 = xt
This produces a chain with
Z
P (x, dy) = q(y, x)α(y, x)dy + r(x)∆x (dy), r(x) = 1 − q(y, x)α(y, x)dy.
Then the marginal distribution of xt will converge to π. In practice, we begin the chain at an arbitrary
x0 , run the algorithm many, say M times, then use the last N < M draws as a sample from π. Note
that although the marginal distribution of the xt is π, the xt are autocorrelated. This is not a problem
for computing moments from the draws (although the higher the autocorrelation, the more draws we need
to get the same accuracy), but if we want to put standard errors on these moments, we need to take the
autocorrelation into account.
Cite as: Anna Mikusheva, course materials for 14.384 Time Series Analysis, Fall 2007. MIT OpenCourseWare (http://ocw.mit.edu),
Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
Metropolis-Hastings 4
Choice of q()
• Random walk chain: q(x, y) = q1 (y − x), i.e. y = x + ², ² ∼ q1 . This can be a nice choice because
q (x,y) π (y)
if q1 is symmetric, q1 (z) = q1 (−z), then q(y,x) drops out of α(x, y) = min{1, π(x) }. Popular such q1
are normal and U [−a, a]. Note that there is a tradeoff between step-size in the chain and rejection
probability when choosing σ 2 = E²2 . Choosing σ 2 too large will lead to many draws of y from low
probability areas (low π), and as a result we will reject lots of draws. Choosing σ 2 too small will lead us
to accept most draws, but not move very much, and we will have difficulty covering the whole support
of π. In either case, the autocorrelation in our draws will be very high and we’ll need more draws to
get a good sample from π.
• Independence chain: q(x, y) = q1 (y)
• If there is an additional information that π(y) ∝ ψ(y)h(y) where ψ is bounded and we can sample from
ψ(y)
q(x, y) = h(y). This also simplifies α(x, y) = min{1, ψ(x) }
• Autocorrelated y = a + B(x − a) + ² with B < 0, this leads to negative autocorrelation in y. The hope
is that this reverses some of the positive autocorrelation inherent in the procedure.
Cite as: Anna Mikusheva, course materials for 14.384 Time Series Analysis, Fall 2007. MIT OpenCourseWare (http://ocw.mit.edu),
Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.