lecture-notes-markov
lecture-notes-markov
21 October 2013
Abstract
For Halloween, we come as a math course
where D is some domain (possibly all space), and f is our favorite function. For
most functions, there is no closed-form expression for such definite integrals. Nu-
merical analysis provides various means of approximating definite integrals, starting
with Euler’s method for one-dimensional integrals,
Zb b(bX
−a)/hc
i
f (x)d x ≈ h f (a + ) (2)
a i =1 h
and getting arbitrarily more sophisticated. Unfortunately, these are slow, especially
when x is really a high-dimensional vector; one ends up having to evaluate the func-
tion f at an exponentially growing number of points just to get the definite integral.
It turns out that designing nuclear weapons involves doing a lot of complicated
integrals [2], and so one of the first uses of modern computing machines using an ef-
ficient but random approximation scheme, which the physicists involved called “the
Monte Carlo method”, after the European gambling resort. Recall that the expecta-
tion value of a function f with respect to a distribution whose probability density is
p is Z
E p [ f (X )] = f (x) p(x)d x (3)
1
The most basic Monte Carlo method for evaluating the integral in (1) is to draw
X uniformly over the domain D. If the total measure1 of D is |D|, then the uniform
density is 1/|D| on D, and 0 everywhere else, so
1
Z Z
f (x)d x = |D| f (x) dx (5)
D D |D|
and
n
|D| X
Z
f (Xi ) → f (x)d x (6)
n i =1 D
How good is the approximation, i.e., how close are the two sides of Eq. 6? Be-
cause the Xi are IID, we can use another result you remember from introductory
probability, the central limit theorem: when Yi are IID with common mean µ and
variance σ 2 , the sample mean approaches a Gaussian distribution with mean µ and
variance σ 2 /n. Symbolically,
n
1X σ2
Yi N (µ, ) (7)
n i =1 n
In this case, the role of Yi is played by f (Xi ), so
|D| Xn Z σ f2
f (Xi ) N f (x)d x, |D|2 (8)
n i =1 D n
with σ f2 being the variance of f (X ). Thus, the Monte Carlo estimate is unbiased
(its expected value is equal to the truth), and its variance goes down like 1/n. This
is true no matter what the dimension of X might be. So, unlike the numerical in-
tegration schemes, reducing the error of the Monte Carlo estimate doesn’t require
exponentially many points. In fact, if we knew σ f2 , we could us the known Gaussian
distribution to give confidence intervals for f (x)d x. If, as is usually the case, we
R
don’t know that variance, we can always estimate it from the samples, and the Gaus-
sian confidence intervals will become correct as n → ∞, or we can use corrections
(based on, say, the t distribution) familiar from basic statistics.
There are many situations in which this basic recipe is impractical or even impos-
sible. For instance, D may have a very complicated shape, making it hard to draw
samples uniformly. It may then be possible to find some larger region, say C , which
contains D and for which we can generate uniform samples. Then, because2
Z Z
f (x)d x = f (x)1D (x)d x (9)
D C
2
Notice however that sample points Xi which fall outside D are simply wasted, and
|D|
they will be about 1 − |C | of all the samples — it pays to make the sampling region
not too much larger than the domain of integration!
Even this approach is of no use when D is infinite, since then a uniform distribu-
tion makes no sense. But an integral like
Z∞
2
x 2 e −x d x (11)
0
is, as we know, finite3 , so there should be some trick to evaluating it by Monte Carlo.
The trick is to introduce a density p which has the same support as D (in this case,
the whole real line). Then
f (x)
Z Z
f (x)d x = p(x)d x (12)
D D p(x)
Because p(x) > 0 everywhere on D, this is legitimate (we’re never dividing by zero).
And now, if the Xi are generated from p,
n f (X ) f (X )
Z
1X i
→ Ep = f (x)d x (13)
n i =1 p(Xi ) p(X ) D
as desired4
Again, it’s worth asking how good the approximation is, and again, we can use
the central limit theorem:
1X n f (X ) Z σ f2 / p
i
N f (x)d x, (14)
n i =1 p(Xi ) n
D
where σ f2 / p is the variance of f (X )/ p(X ) (when X ∼ p). Once again, the Monte
Carlo approximation is unbiased, and the variance of the approximation goes to zero
like 1/n, no matter how high-dimensional X is, or how ugly f or D might be.
3
• Simple forms It is often worth looking carefully at the integrand to see if a
probability density can be factored out of it. In (11), for instance, the factor
2
e −x is proportional to a N (0, 1/2) density,
Z∞ Z ∞
2 −x 2
p 2 1 −x 2 /(2/2)
x e dx = πx p e (15)
0 0 2π/2
p
so we can simply simulate from that density and take the average of πXi2 .
Problem
Write a function which takes as input a real-valued function of one argument, a lower
and upper limit of integration, and a number of samples, and returns a Monte Carlo
approximation to the integral over that interval. Check whether the lower limit is
-Inf or the upper limit is Inf (both might be true!), and chose a p appropriately.
How would you check this?
By the central limit theorem, for large n, the approximations have a Gaussian distri-
bution around the true expected values, with a variance shrinking like 1/n.
Notice that this is a weighted mean of the f (Xi ), but one where the weights are also
random. This sort of approximation, where we calculate the expectation under one
4
distribution by sampling from another, is called importance sampling, and the ratios
p(Xi )/r (Xi ) are the importance weights. Once again, the approximations tend to a
Gaussian distribution around the truth.
As with picking p in plain Monte Carlo, there is a tension between choosing a
distribution r which is easy to draw from, and choosing one which will be efficient.
It is easy to check that the sample mean of the importance weights will tend towards
1 as n grows6 . But if r puts a lot of probability on regions where p(x) is very small,
many terms in the sample average will be weighted down towards zero, and so nearly
wasted; while a few will have to get very large weights, and averaging only a few
random terms is noisy. So to get good approximations to E p [ f (X )], it is usually
desirable for p(x)/r (x) to not vary too much from 17 .
Problem
The entropy of a distribution with density p is
Z
− p(x) log2 p(x)d x (19)
2 Markov Chains
So far, we have been simulating sequences of completely independent random vari-
ables. This is excessively limiting. The world has lots of variables which are related
to each other — the most important parts of statistics are about describing these rela-
tionships — and so we should be able to simulate dependent variables.
A stochastic process is simply a collection of random variables with a joint dis-
tribution, usually a dependent one9 . Often, but not always, the variables come in a
sequence, X1 , X2 , . . . Xn , . . .. In principle, then, the distribution of X t +1 could depend
on the value of all previous variables, X1 , X2 , . . . X t . At the other extreme, in an IID
sequence, no variable depends on any other variable.
R p(x)
6 Because
r (x)
r (x)d x = 1 (EXERCISE: why?), and the law of large numbers applies.
7 E XERCISE : Are there ever situations where the estimate would be improved by sampling from r
rather than p, assuming it’s equally easy to do either?
8 Or re-derive them!
9 The original motive for using the word “stochastic” in place of “random” is that people tended to take
5
The most important class of stochastic processes which actually have dependence
are the Markov processes10 , in which the distribution of X t +1 depends only on the
value of X t . The variable X t is called the state of the process at time t .
When I say that the distribution of X t +1 depends only on the value of X t , I mean
in particular that X t +1 is conditionally independent of X1 , . . . X t −1 given X t . (“The
future is independent of the past, given the present state.”) This conditional indepen-
dence is called the Markov property. Conceptually, we can view it in two ways:
• In an IID sequence, X t +1 is conditionally independent of earlier states given
X t . It’s also unconditionally independent of them, since there is no dependence
at all. From this perspective, the Markov property is a minimal weakening of
the idea of an IID sequence.
• In a deterministic dynamical system, like the Arnold cat map we saw in the
last lecture (or the laws of classical physics), the next state is a function of the
current state, and earlier states are irrelevant given the present. From this per-
spective, the Markov property just says it’s OK to replace strict determinism
and with probability distributions.
So Markov processes are “just right” to generalize both complete independence and
strict determinism. Markov chain models are used a lot in physics, chemistry, genet-
ics, ecology, psychology, economics, sociology, and, no doubt, other fields.
Mathematically, there are two components to getting the distribution of a Markov
process. The distribution of X1 , the initial distribution, is just another probability
distribution, say p0 . Thereafter, we need the conditional distribution of X t +1 given
X t , which I’ll write q(y|x). This is either a conditional probability density func-
tion (if the X t are continuous) or a conditional probability mass function (if they
are discrete). There are many names for this, but the most transparent may be the
transition distribution11 . Then
t −1
Y
p(x1 , x2 , . . . x t ) = p0 (x1 ) q(xi +1 |xi ) (20)
i=1
Markov Chains and the Transition Matrix When the X t variables are all discrete,
the process is called a Markov chain. If they are not just discrete but finite, say K
of them, then we can represent the transition distribution as a K × K matrix, q say,
where
qi j = q( j |i) = P X t +1 = j |X t = i
(21)
Notice that all entries of q must be ≥ 0, and each row must sum to 1. Such a matrix is
also called stochastic12 . We can now use matrix arithmetic to look at how probability
distributions change.
10 These are named after the great Russian mathematician A. A. Markov, who was the first to systemati-
cally describe them and recognize their importance. See [1] for an accessible account of Markov’s life and
work, and the origins of his theory of chains in a theological quarrel with his arch-nemesis.
11 Technically, in assuming that q stays the same for all t , I am assuming that the Markov process is
“homogeneous”. Inhomogeneous Markov processes exist, but are not very useful for present purposes.
12 Technically, row stochastic, which lets you guess what a column stochastic matrix is, and then what
6
Evolving Probability Distributions Suppose we start with a certain distribution
p0 on the states of the Markov chain. Because there are only finitely many states, we
can represent p0 as a 1 × K vector. Then
p1 = p0 q (22)
is another 1 × K vector. Notice that
K
X K
X
( p1 ) i = ( p0 ) j q j i = P (X1 = j ) P (X2 = i|X1 = j ) = P (X2 = i ) (23)
j =1 j =1
Iterating this,
K
X
pt = a j λ tj v j (27)
j =1
Notice that the only part of Eq. 27 which changes with t is the power of the
eigenvalues. If |λ j | > 1, that term grows exponentially; if |λ j | < 1, that term shrinks
exponentially; only if |λ j | = 1 does the size of the term remain the same.
13 This
is not the same as the distribution of state sequences for the first t steps, which is given by Eq. 20.
14 If
the eigenvectors are orthogonal to each other, then the a j are just their inner products with p0 .
(Why?)
7
So far, we haven’t used the fact that the matrix q is rather special. Let’s start doing
so:
Proposition 1 All the eigenvalues λ j of any stochastic matrix are on or inside the unit
circle, |λ j | ≤ 1, and there is always at least one “unit” eigenvalue, λ j = 1 for at least one
j.
The actual proof is complicated15 , but the intuition is that this comes from the re-
quirement that probability is conserved — the total probability of all states cannot
grow larger than 1 or shrink below it.
Returning to Eq. 27, the proposition means we can divide the eigenvectors into
two kinds: those with |λ j | < 1 and those with |λ j | = 1. The contribution of the
former becomes exponentially small as t grows, so
X
pt → a j λ tj v j (28)
j : |λ j |=1
This limiting distribution is also an invariant distribution (exercise!). So, in the long
run, the distribution of the state tends to a distribution which is invariant under the
transition matrix.
How much does the initial distribution p0 matter in the long run? The only
place it shows up on the right-hand side in Eq. 29 is that it implicitly determines the
weights a j . But even then, if there is only a single invariant eigenvector v1 „ then it has
to have weight a1 = 1, and the initial distribution doesn’t matter at all for the long
run.
8
0.5 0.25
0.5
1 2
0.75
Figure 1: Graphical representation of the Markov chain transition matrix from Eq.
30.
9
1 0.5 3
6 0.97 5
Figure 2: Graphical representation of a Markov chain with six states. Can you work
out the transition matrix?
10
2
1.0 1.0
1 1.0 3
Figure 3: A Markov chain in which there is one recurrent component, and every state
has period three.
the component, and the distribution is not allowed to change. So the basic invariant
distributions match up with the recurrent components of the graph.
What about eigenvectors v j where |λ j | = 1 but λ j 6= 1? These correspond to
periodic cycles in recurrent components. To see what this means, pick a state i and
look the length of paths in the graph which start and end at i (“cycles rooted at i”).
If there is a common divisor to these path lengths, then there is a periodicity to the
behavior of the chain — it can only return to i at certain times. The period of the state
is the greatest common divisor of its cycle lengths. If the greatest common divisor is
1, the state is aperiodic. All the the states in a recurrent component must share the
same period.
To be concrete, look at Figure 3. There are three states, and the chain simply ro-
tates through them. There is, as promised, one eigenvector with eigenvalue 1, which
puts probability 1/3 on each state. This is the unique invariant distribution. But the
two complex eigenvalues are also on the unit circle, and their eigenvectors “do the
rotation”. If we start with the distribution (a, b , 1 − a − b ), after one step we get
(1 − a − b , a, b ), after two steps we get (b , 1 − a − b , a), and after three steps we are
back where we started. This never approaches the invariant distribution.
EXERCISE: Consider the more complicated chain in Figure 4. Convince your-
self that each state has period 3. What happens to an arbitrary initial distribution
(a, b , c, 1 − a − b − c) after one step? Two steps? Three steps? Many steps?
11
2
0.6 1.0
0.4 1.0
1 3 4
1.0
Figure 4: A Markov chain with four states, where each state has period 3.
Thus, in the long run, the distribution of the state at any one time approaches the
unique invariant or equilibrium distribution.
To find out how quickly the chain approaches equilibrium, put the eigenvalues in
order of magnitude17 , 1 = λ1 > |λ2 | ≥ . . . ≥ |λK−1 | ≥ |λK |, and put their eigenvectors
v j in the same order. For large t , then,
1 |λ2 | t |λ j | t (32)
p t − v ∗ ≈ a2 λ2t v2 (33)
Thus, the difference between the distribution after t steps p t and the invariant distri-
bution v ∗ shrinks exponentially fast, and the base of the exponent is λ2 . If |λ2 | is very
close to 1, then this exponential rate could be very slow, a point we’ll come back to
next time.
What this means concretely is that if we took many independent copies of the
Markov chain, and chose their initial states according to p0 , we would see the distri-
bution of states across this “ensemble” tending towards v ∗ as time went on, no matter
what p0 was. This is interesting, but can we say anything about what happens within
any one long run of the chain?
If we take a long IID sequence, the empirical distribution within that sequence
tends towards the true distribution. The same is true for a finite-state Markov chain
17 Break ties however you like.
12
with a single recurrent18 component:
n
1X
1i (X t ) → vi∗ (34)
n t =1
This generalizes the law of large numbers from IID sequences to Markov chains. For
historical reasons20 , this is called the ergodic theorem. To get a sense of why the
ergodic theorem is true, look at the appendix.
The slogan which summarizes it is that “time averages converge on state aver-
ages”, with “time averages” being the left-hand sides of Eqs. 35 and 34, and “state
averages”, a.k.a. expectation values, being the right-hand sides. In statistical terms,
what the ergodic property means is that a single long realization of a Markov chain
acts like a representative sample of the whole distribution, and becomes increasingly
representative as it grows longer. This is important for several reasons.
1. It re-assures us about the statistical methods we learned for IID data. It says
that even if the data we have to deal with are not completely independent, if
they are at least Markov, then much of what we thought we knew is still at
least approximately true.
2. It tells us that we can apply statistics to dynamics, to things which change over
time — at least if they are Markov. In fact, we can rely on just a single long
trajectory, rather than having to get multiple, independent trajectories, which
could be very difficult indeed.
can check with, say, the period-3 chain from Figure 3 that the next two equations hold, no matter what
state we start from.
19 E XERCISE : Convince yourself that Eq. 35 really does follow from Eq. 34.
20 Several decades before Markov, the physicist Ludwig Boltzmann, as part of explaining why thermo-
dynamics works, argued that a large collection of molecules should, within a short time, come arbitrarily
close to every configuration of the molecules which was compatible with the conservation of energy. Since
all our measurements take a long time to make (molecularly speaking at least), we would see only the aver-
age over these configurations, which would look like an expectation value. He needed a name for this, and
called it the “ergodic” property of the trajectory, from the Greek ergon (“energy, work”) + odos (“path,
way”). Sadly, the name stuck.
13
4. It gives us a way to replace complicated expectations with simple simulations.
We will see next time that there are many cases where it is much easier to find a
chain whose invariant distribution is v ∗ than it is to find v ∗ itself. (Strange but
true!) Simulating from the chain then gives a way of calculating expectations
with respect to v ∗ .
3 Summing Up
Here are the key ideas:
1. The Monte Carlo method is to evaluate integrals and expectations by simulat-
ing from suitable distributions and taking sample averages over the simulation
points. So long as sample averages converge on expectations, the approxima-
tion error in the Monte Carlo method can be made as small as we like.
• With independent samples, the Monte Carlo approximation is unbiased,
and has a variance is O(1/n) in the number of simulation points.
• In importance sampling, we simulate from a different distribution than
the one we’re really interested in, and then correct by taking a weighted
average.
2. Markov processes are sequences of random variables where the distribution of
the next variable depends only on the value of the current one. A Markov chain
is a Markov process where the states are discrete. The transitions in a Markov
chain are represented in a stochastic matrix.
• To find the distribution after t steps of the chain, we take the current
distribution (written as a vector) and multiply it by the t th power of the
transition matrix.
• As t grows, the distribution becomes a combination of the eigenvectors
whose eigenvalues have magnitude 1. Eigenvectors whose eigenvalues are
simply 1 are invariant distributions.
• In an aperiodic chain, the long-run distribution is an invariant distribu-
tion.
• Each eigenvector with eigenvalue 1 corresponds to a different recurrent
component of states. If there is only a single recurrent, aperiodic compo-
nent, the long-run distribution is this unique eigenvector.
• Sample averages taken along any single sequence of the chain converge on
expectations under the invariant distribution (ergodicity).
14
A Hand-waving Argument about the Ergodic Theo-
rem
The left-hand side of Eq. 34, what we want to have converge, is
n
1X
1i (X t ) (36)
n t =1
This is a random quantity. Let’s try to work out its expectation value (assuming X1 ∼
p0 ) and its variance as n → ∞. If the expectation tends to a limit and the variance
shrinks to zero, then the time-average as a whole must converge on its expectation
value.
For expectation, we use the fact that taking expectations is a linear operation:
1X n 1X n
1i (X t ) = E p0 1i (X t )
E p0 (37)
n t =1 n t =1
1X n
= P Xt = i
(38)
n t =1
1X n
= ( p0 q t −1 )i (39)
n t =1
Since, as t → ∞
p0 q t → v ∗ (40)
we can conclude that
( p0 q t −1 )i → vi∗ (41)
and therefore
n
1X
E p0 1i (X t ) → vi∗ (42)
n t =1
(Recall Var [X + Y ] = Var [X ] + Var [Y ] + 2Cov [X , Y ].) The first sum, of the vari-
ances, we can handle; the summands tend towards vi∗ (1 − vi∗ ) as t grows. (Why?)
Thus the whole sum approaches nvi∗ (1 − vi∗ ).
The covariances are where I’m going to wave my hands. Since p0 q t → v ∗ , no mat-
ter what distribution we put in for p0 , the Markov chain is “asymptotically indepen-
dent”. After many time-steps, the chain has nearly the same distribution no matter
15
what state we started it in. So P X t = i, X s = j → P X t = i P X s = j as s − t →
|P X t = i, X s = j − P X t = i P X s = j | ≤ κi |λ2 | s −t
(44)
Cov 1i (X t ), 1i (X s ) = P X t = i , X s = j − P X t = i P X s = j
(45)
since the infinite sum is a geometric series. The sum of the covariances is thus limited:
n−1 X
X n
Cov p0 1i (X t ), 1i (X s )
(50)
t =1 s =t +1
n−1
X |λ2 |
≤ κi
t =1 1 − |λ2 |
|λ2 |
≤ nκi (51)
1 − |λ2 |
16
It may clear up this some tricky mathematical argument to compare what’s going
on here to what we’d see if we had an IID sample of discrete variables. Then all the
covariance terms would go away21 , and we’d have
vi∗ (1 − vi∗ )
1X n
Var p0 1i (X t ) =
(55)
n t =1 n
This is the familiar situation from basic statistics where each sample gives us an in-
dependent piece of information about the distribution, and our uncertainty about
population probabilities goes down like 1/n. With the Markov chain, because the
samples are correlated with each other, each observation is not an independent piece
of information. But it’s like we had some smaller number of independent samples:
vi∗ (1 − vi∗ )
1X n
Var p0 1i (X t ) =
(56)
n t =1 n/τ
where
κi |λ2 |
τ =1+2 (57)
vi (1 − vi∗ )
∗
1 − |λ2 |)
tells us how long we have to wait for the correlations to become negligible.
Stepping back even more for the really big picture, what’s going on is that the
variables in a Markov chain are dependent, but widely-separated ones are almost in-
dependent. So if we wait for a long time, averages over the dependent variables look
very much like averages over independent ones. How long is “long” is quantified by
τ.
Problem
Suppose that X1 , X2 , . . . X t , . . . is a sequence of variables, not necessarily a Markov
chain, which is “weakly stationary”, so
E Xt = E Xs = µ
(58)
and that
Cov X t , X s = ρ(|t − s |)
(59)
for t , s. Further suppose that
∞
X
ρ(h) = κρ(0) < ∞ (60)
h=0
17
and that
ρ(0)
n
1X
Var Xt ≤
n t =1 n/(1 + 2κ)
Conclude that
n
1X
Xt → µ
n t =1
References
[1] Basharin, Gely P., Amy N. Langville and Valeriy A. Naumov (2004).
“The Life and Work of A. A. Markov.” Linear Algebra and its Applica-
tions, 386: 3–26. URL http://decision.csl.uiuc.edu/~meyn/pages/
Markov-Work-and-life.pdf.
[2] Serber, Robert (1992). The Los Alamos Primer: The First Lectures on How to
Build the Atomic Bomb. Berkeley: University of California Press. Annotated by
Robert Serber; edited and with an introduction by Richard Rhodes.
18