0% found this document useful (0 votes)
11 views18 pages

lecture-notes-markov

This document discusses Monte Carlo integration and Markov chains, focusing on methods for approximating definite integrals using random sampling. It explains the principles behind Monte Carlo methods, including the law of large numbers and the central limit theorem, and introduces the concept of importance sampling. Additionally, it covers the Markov property and the structure of Markov chains, emphasizing their applications in various fields.

Uploaded by

Fernando Silva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views18 pages

lecture-notes-markov

This document discusses Monte Carlo integration and Markov chains, focusing on methods for approximating definite integrals using random sampling. It explains the principles behind Monte Carlo methods, including the law of large numbers and the central limit theorem, and introduces the concept of importance sampling. Additionally, it covers the Markov property and the structure of Markov chains, emphasizing their applications in various fields.

Uploaded by

Fernando Silva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Lecture 18: Monte Carlo and Markov Chains

36-350, Fall 2011

21 October 2013

Abstract
For Halloween, we come as a math course

1 Monte Carlo Integration


Suppose we want to evaluate a definite integral,
Z
f (x)d x (1)
D

where D is some domain (possibly all space), and f is our favorite function. For
most functions, there is no closed-form expression for such definite integrals. Nu-
merical analysis provides various means of approximating definite integrals, starting
with Euler’s method for one-dimensional integrals,
Zb b(bX
−a)/hc
i
f (x)d x ≈ h f (a + ) (2)
a i =1 h
and getting arbitrarily more sophisticated. Unfortunately, these are slow, especially
when x is really a high-dimensional vector; one ends up having to evaluate the func-
tion f at an exponentially growing number of points just to get the definite integral.
It turns out that designing nuclear weapons involves doing a lot of complicated
integrals [2], and so one of the first uses of modern computing machines using an ef-
ficient but random approximation scheme, which the physicists involved called “the
Monte Carlo method”, after the European gambling resort. Recall that the expecta-
tion value of a function f with respect to a distribution whose probability density is
p is Z
E p [ f (X )] = f (x) p(x)d x (3)

If X1 , X2 , . . . Xn are independent random variables with common density p, we


say that they are independent and identically distributed, or IID. As you learned
in introductory probability, the law of large numbers asserts that, for IID random
variables, the sample mean converges on the expectation value:
1X n Z
f (Xi ) → E p [ f (X )] = f (x) p(x)d x (4)
n i =1

1
The most basic Monte Carlo method for evaluating the integral in (1) is to draw
X uniformly over the domain D. If the total measure1 of D is |D|, then the uniform
density is 1/|D| on D, and 0 everywhere else, so
1
Z Z
f (x)d x = |D| f (x) dx (5)
D D |D|
and
n
|D| X
Z
f (Xi ) → f (x)d x (6)
n i =1 D

How good is the approximation, i.e., how close are the two sides of Eq. 6? Be-
cause the Xi are IID, we can use another result you remember from introductory
probability, the central limit theorem: when Yi are IID with common mean µ and
variance σ 2 , the sample mean approaches a Gaussian distribution with mean µ and
variance σ 2 /n. Symbolically,
n
1X σ2
Yi N (µ, ) (7)
n i =1 n
In this case, the role of Yi is played by f (Xi ), so
 
|D| Xn Z σ f2
f (Xi ) N  f (x)d x, |D|2  (8)
 
n i =1 D n

with σ f2 being the variance of f (X ). Thus, the Monte Carlo estimate is unbiased
(its expected value is equal to the truth), and its variance goes down like 1/n. This
is true no matter what the dimension of X might be. So, unlike the numerical in-
tegration schemes, reducing the error of the Monte Carlo estimate doesn’t require
exponentially many points. In fact, if we knew σ f2 , we could us the known Gaussian
distribution to give confidence intervals for f (x)d x. If, as is usually the case, we
R
don’t know that variance, we can always estimate it from the samples, and the Gaus-
sian confidence intervals will become correct as n → ∞, or we can use corrections
(based on, say, the t distribution) familiar from basic statistics.
There are many situations in which this basic recipe is impractical or even impos-
sible. For instance, D may have a very complicated shape, making it hard to draw
samples uniformly. It may then be possible to find some larger region, say C , which
contains D and for which we can generate uniform samples. Then, because2
Z Z
f (x)d x = f (x)1D (x)d x (9)
D C

if we sample X uniformly from C ,


n
|C | X
Z
f (Xi )1D (Xi ) → f (x)d x (10)
n i=1 D
1 Length, area, volume, etc., as appropriate
2 The indicator function 1D (x) is 1 when x is in D, and 0 otherwise

2
Notice however that sample points Xi which fall outside D are simply wasted, and
|D|
they will be about 1 − |C | of all the samples — it pays to make the sampling region
not too much larger than the domain of integration!
Even this approach is of no use when D is infinite, since then a uniform distribu-
tion makes no sense. But an integral like
Z∞
2
x 2 e −x d x (11)
0

is, as we know, finite3 , so there should be some trick to evaluating it by Monte Carlo.
The trick is to introduce a density p which has the same support as D (in this case,
the whole real line). Then

f (x)
Z Z
f (x)d x = p(x)d x (12)
D D p(x)

Because p(x) > 0 everywhere on D, this is legitimate (we’re never dividing by zero).
And now, if the Xi are generated from p,
n f (X ) f (X )
– ™ Z
1X i
→ Ep = f (x)d x (13)
n i =1 p(Xi ) p(X ) D

as desired4
Again, it’s worth asking how good the approximation is, and again, we can use
the central limit theorem:
 
1X n f (X ) Z σ f2 / p
i
N  f (x)d x, (14)
 
n i =1 p(Xi ) n

D

where σ f2 / p is the variance of f (X )/ p(X ) (when X ∼ p). Once again, the Monte
Carlo approximation is unbiased, and the variance of the approximation goes to zero
like 1/n, no matter how high-dimensional X is, or how ugly f or D might be.

Choosing p In principle, any p which is supported on D 5 could be used for Monte


Carlo. In practice, one looks for easy simulation, low variance, and simple forms.
Easy simulation speaks for itself; what about the others?
• Low variance Notice that if f (x)/ p(x) were constant, say c, then the variance
of the Monte Carlo variable σ f2 / p would be zero, and the Monte Carlo approx-
imation would be exact. Getting this is usually too much to hope for, but, to
the extent possible, it generally improves efficiency to have the shape of p(x)
follow that of f (x). (Of course this can conflict with easy simulation.)
3 Why do we know this?
4 E XERCISE : Convince yourself that Eqs. 5 and 6 are special cases of Eqs. 12 and 13, respectively.
5 That is, p(x) > 0 if and only if x is in D.

3
• Simple forms It is often worth looking carefully at the integrand to see if a
probability density can be factored out of it. In (11), for instance, the factor
2
e −x is proportional to a N (0, 1/2) density,
Z∞ Z ∞”  
2 −x 2
p 2— 1 −x 2 /(2/2) 
x e dx = πx  p e (15)
0 0 2π/2
p
so we can simply simulate from that density and take the average of πXi2 .

Problem
Write a function which takes as input a real-valued function of one argument, a lower
and upper limit of integration, and a number of samples, and returns a Monte Carlo
approximation to the integral over that interval. Check whether the lower limit is
-Inf or the upper limit is Inf (both might be true!), and chose a p appropriately.
How would you check this?

1.1 Calculating Expectation Values


Since expectation values are integrals, everything said above about integrals applies to
them. Whenever we can simulate X and calculate f , we can approximate E p [ f (X )].
By the law of large numbers,
n
1X
f (Xi ) → E p [ f (X )] (16)
n i =1

By the central limit theorem, for large n, the approximations have a Gaussian distri-
bution around the true expected values, with a variance shrinking like 1/n.

Importance sampling Of course, drawing the Xi from p can be tricky. We can


then look for another density r , which has two properties:
1. It is easy for us to simulate from r , and
2. r (x) > 0 whenever p(x) > 0
The second property lets us write
p(X )
– ™
p(x)
Z Z
E p [ f (X )] = f (x) p(x)d x = f (x) r (x)d x = E r f (X ) (17)
r (x) r (X )

The first property means we can easily draw X1 , X2 , . . . Xn from r . Then


n
1X p(Xi )
f (Xi ) → E p [ f (X )] (18)
n i=1 r (Xi )

Notice that this is a weighted mean of the f (Xi ), but one where the weights are also
random. This sort of approximation, where we calculate the expectation under one

4
distribution by sampling from another, is called importance sampling, and the ratios
p(Xi )/r (Xi ) are the importance weights. Once again, the approximations tend to a
Gaussian distribution around the truth.
As with picking p in plain Monte Carlo, there is a tension between choosing a
distribution r which is easy to draw from, and choosing one which will be efficient.
It is easy to check that the sample mean of the importance weights will tend towards
1 as n grows6 . But if r puts a lot of probability on regions where p(x) is very small,
many terms in the sample average will be weighted down towards zero, and so nearly
wasted; while a few will have to get very large weights, and averaging only a few
random terms is noisy. So to get good approximations to E p [ f (X )], it is usually
desirable for p(x)/r (x) to not vary too much from 17 .

Problem
The entropy of a distribution with density p is
Z
− p(x) log2 p(x)d x (19)

This quantity is extremely important in information theory, since it can be used


to quantify how many bits of memory are needed to store a value drawn from the
distribution, or to transmit it. Use Monte Carlo to find the entropy of a Gaussian
distribution with mean 5 and variance 3, and of an exponential distribution with scale
0.5. Analytical formulas for the entropy are available for these two distributions; look
them up8 and calculate the exact values. How large do you need to make n in each
case to get agreement to 2 significant digits? To 3 significant digits?

2 Markov Chains
So far, we have been simulating sequences of completely independent random vari-
ables. This is excessively limiting. The world has lots of variables which are related
to each other — the most important parts of statistics are about describing these rela-
tionships — and so we should be able to simulate dependent variables.
A stochastic process is simply a collection of random variables with a joint dis-
tribution, usually a dependent one9 . Often, but not always, the variables come in a
sequence, X1 , X2 , . . . Xn , . . .. In principle, then, the distribution of X t +1 could depend
on the value of all previous variables, X1 , X2 , . . . X t . At the other extreme, in an IID
sequence, no variable depends on any other variable.
R p(x)
6 Because
r (x)
r (x)d x = 1 (EXERCISE: why?), and the law of large numbers applies.
7 E XERCISE : Are there ever situations where the estimate would be improved by sampling from r
rather than p, assuming it’s equally easy to do either?
8 Or re-derive them!
9 The original motive for using the word “stochastic” in place of “random” is that people tended to take

“random” as implying “statistically independent”.

5
The most important class of stochastic processes which actually have dependence
are the Markov processes10 , in which the distribution of X t +1 depends only on the
value of X t . The variable X t is called the state of the process at time t .
When I say that the distribution of X t +1 depends only on the value of X t , I mean
in particular that X t +1 is conditionally independent of X1 , . . . X t −1 given X t . (“The
future is independent of the past, given the present state.”) This conditional indepen-
dence is called the Markov property. Conceptually, we can view it in two ways:
• In an IID sequence, X t +1 is conditionally independent of earlier states given
X t . It’s also unconditionally independent of them, since there is no dependence
at all. From this perspective, the Markov property is a minimal weakening of
the idea of an IID sequence.
• In a deterministic dynamical system, like the Arnold cat map we saw in the
last lecture (or the laws of classical physics), the next state is a function of the
current state, and earlier states are irrelevant given the present. From this per-
spective, the Markov property just says it’s OK to replace strict determinism
and with probability distributions.
So Markov processes are “just right” to generalize both complete independence and
strict determinism. Markov chain models are used a lot in physics, chemistry, genet-
ics, ecology, psychology, economics, sociology, and, no doubt, other fields.
Mathematically, there are two components to getting the distribution of a Markov
process. The distribution of X1 , the initial distribution, is just another probability
distribution, say p0 . Thereafter, we need the conditional distribution of X t +1 given
X t , which I’ll write q(y|x). This is either a conditional probability density func-
tion (if the X t are continuous) or a conditional probability mass function (if they
are discrete). There are many names for this, but the most transparent may be the
transition distribution11 . Then
t −1
Y
p(x1 , x2 , . . . x t ) = p0 (x1 ) q(xi +1 |xi ) (20)
i=1

Markov Chains and the Transition Matrix When the X t variables are all discrete,
the process is called a Markov chain. If they are not just discrete but finite, say K
of them, then we can represent the transition distribution as a K × K matrix, q say,
where
qi j = q( j |i) = P X t +1 = j |X t = i

(21)
Notice that all entries of q must be ≥ 0, and each row must sum to 1. Such a matrix is
also called stochastic12 . We can now use matrix arithmetic to look at how probability
distributions change.
10 These are named after the great Russian mathematician A. A. Markov, who was the first to systemati-

cally describe them and recognize their importance. See [1] for an accessible account of Markov’s life and
work, and the origins of his theory of chains in a theological quarrel with his arch-nemesis.
11 Technically, in assuming that q stays the same for all t , I am assuming that the Markov process is

“homogeneous”. Inhomogeneous Markov processes exist, but are not very useful for present purposes.
12 Technically, row stochastic, which lets you guess what a column stochastic matrix is, and then what

a doubly stochastic matrix might be.

6
Evolving Probability Distributions Suppose we start with a certain distribution
p0 on the states of the Markov chain. Because there are only finitely many states, we
can represent p0 as a 1 × K vector. Then
p1 = p0 q (22)
is another 1 × K vector. Notice that
K
X K
X
( p1 ) i = ( p0 ) j q j i = P (X1 = j ) P (X2 = i|X1 = j ) = P (X2 = i ) (23)
j =1 j =1

so multiplying the probability distribution p0 by q gives us the new probability dis-


tribution, after one step of the chain. If we look at
p t = p t1 q = p0 q t (24)

we get the distribution of states after t steps of the chain13 .

2.1 Asymptotics of Markov Chains


What happens to p t asymptotically, as t → ∞? Since we are getting p t by matrix
arithmetic, it is natural to turn to linear algebra for an answer, and specifically to
eigenvalues and eigenvectors.
Since q is a square, K × K stochastic matrix, it will have K eigenvectors, say
v1 , . . . vK , and K eigenvalues, λ1 , . . . λK . (Not all of the eigenvalues are necessarily
different.) The eigenvectors will form a basis for RK , meaning that we can write an
arbitrary vector as a linear combination of the eigenvectors. In particular, we can
write p0 so:
XK
p0 = aj vj (25)
j =1
14
for some coefficients a j .
Since the eigenvectors multiply easily by q, we can now write p t very simply:
K
X K
X
p1 = p0 q = aj vj q = aj λj vj (26)
j =1 j =1

Iterating this,
K
X
pt = a j λ tj v j (27)
j =1

Notice that the only part of Eq. 27 which changes with t is the power of the
eigenvalues. If |λ j | > 1, that term grows exponentially; if |λ j | < 1, that term shrinks
exponentially; only if |λ j | = 1 does the size of the term remain the same.
13 This
is not the same as the distribution of state sequences for the first t steps, which is given by Eq. 20.
14 If
the eigenvectors are orthogonal to each other, then the a j are just their inner products with p0 .
(Why?)

7
So far, we haven’t used the fact that the matrix q is rather special. Let’s start doing
so:
Proposition 1 All the eigenvalues λ j of any stochastic matrix are on or inside the unit
circle, |λ j | ≤ 1, and there is always at least one “unit” eigenvalue, λ j = 1 for at least one
j.

The actual proof is complicated15 , but the intuition is that this comes from the re-
quirement that probability is conserved — the total probability of all states cannot
grow larger than 1 or shrink below it.
Returning to Eq. 27, the proposition means we can divide the eigenvectors into
two kinds: those with |λ j | < 1 and those with |λ j | = 1. The contribution of the
former becomes exponentially small as t grows, so
X
pt → a j λ tj v j (28)
j : |λ j |=1

We’ll need another fact about stochastic matrices.

Proposition 2 If λ j = 1, then all entries of v j are ≥ 0.

Since if v j is an eigenvector, so is c v j , we can normalize these v j so their entries sum


to 1. The eigenvectors which go with eigenvalue 1, then, are probability distribu-
tions on the states. Since, for these vectors, v j q = v j , these are called invariant or
equilibrium distributions.
Suppose that the only eigenvalues on the unit circle, |λ j | = 1, are unit eigenvalues,
λ j = 1. (We will see presently when this will happen.) Then we could say
X
pt → aj vj (29)
j : |λ j |=1

This limiting distribution is also an invariant distribution (exercise!). So, in the long
run, the distribution of the state tends to a distribution which is invariant under the
transition matrix.
How much does the initial distribution p0 matter in the long run? The only
place it shows up on the right-hand side in Eq. 29 is that it implicitly determines the
weights a j . But even then, if there is only a single invariant eigenvector v1 „ then it has
to have weight a1 = 1, and the initial distribution doesn’t matter at all for the long
run.

2.2 Graph Form of a Markov Chain


Suppose we have a simple two-state Markov chain, with transition matrix
 
0.5 0.5
q= (30)
0.75 0.26

8
0.5 0.25

0.5
1 2
0.75

Figure 1: Graphical representation of the Markov chain transition matrix from Eq.
30.

We can represent this state space as by a graph or network, as in Figure 1.


The general rule is for drawing such graphs is that each state is a node, and that
each non-zero transition probability, qi j > 0, is a directed edge from i to j , labeled
with the probabilities. You can amuse yourself, for instance, by working out the
transition matrix corresponding to Figure 2.
We say that two i and j states are connected if there is a directed path from i to j .
We say that they are strongly connected if there are directed paths in both directions,
from i to j and from j back to i . This is a transitive relation: if i is strongly connected
to j , and j is strongly connected to k, then i is strongly connected to k. This means
that we can break up the graph into strongly connected components, where all
states in a component are strongly connected to each other, and are not strongly
connected to any other states. For instance, in Figure 2, states 1 and 2 form a strongly
connected component, states 3, 4 and 5 are another, and finally state 6 is strongly
connected to itself but to no other state, so it is in a component of one state.
If a state i is connected, but not strongly connected, to some state j , then i is tran-
sient16 . If i is not transient — that is, it is strongly connected to every state to which
it is connected — then it is recurrent. If i is recurrent, then every state it is connected
to must also be recurrent, and we also call the whole strongly connected component
it belongs to recurrent. In Figure 2, states 1 and 2 form a recurrent component, and
states 3, 4 and 5 form another.
EXERCISE: Convince yourself that every path of a finite-state Markov chain even-
tually hits one recurrent component or another, and then stays in that component
forever. Hint: the initial state is either recurrent or transient.
Each recurrent component corresponds to an eigenvector of q with eigenvalue
1. This eigenvector is > 0 on the states of the component, and = 0 on the other
states. Remember that these eigenvectors are invariant distributions: they can give
probability 0 to the rest of the state space, because a chain started in the component
has zero probability of leaving it. But they cannot give probability 0 to states in the
component, since there is always some probability of reaching them from the rest of
15 It’s
part of the “Perron-Frobenius theorem”.
16 The
reason for the name is that if we start the chain in state i , it will eventually find is way to state j ,
from which it has no way back to i .

9
1 0.5 3

0.75 0.5 1.0

2 0.25 0.02 4 0.1 0.1

0.01 0.9 0.9

6 0.97 5

Figure 2: Graphical representation of a Markov chain with six states. Can you work
out the transition matrix?

10
2
1.0 1.0
1 1.0 3

Figure 3: A Markov chain in which there is one recurrent component, and every state
has period three.

the component, and the distribution is not allowed to change. So the basic invariant
distributions match up with the recurrent components of the graph.
What about eigenvectors v j where |λ j | = 1 but λ j 6= 1? These correspond to
periodic cycles in recurrent components. To see what this means, pick a state i and
look the length of paths in the graph which start and end at i (“cycles rooted at i”).
If there is a common divisor to these path lengths, then there is a periodicity to the
behavior of the chain — it can only return to i at certain times. The period of the state
is the greatest common divisor of its cycle lengths. If the greatest common divisor is
1, the state is aperiodic. All the the states in a recurrent component must share the
same period.
To be concrete, look at Figure 3. There are three states, and the chain simply ro-
tates through them. There is, as promised, one eigenvector with eigenvalue 1, which
puts probability 1/3 on each state. This is the unique invariant distribution. But the
two complex eigenvalues are also on the unit circle, and their eigenvectors “do the
rotation”. If we start with the distribution (a, b , 1 − a − b ), after one step we get
(1 − a − b , a, b ), after two steps we get (b , 1 − a − b , a), and after three steps we are
back where we started. This never approaches the invariant distribution.
EXERCISE: Consider the more complicated chain in Figure 4. Convince your-
self that each state has period 3. What happens to an arbitrary initial distribution
(a, b , c, 1 − a − b − c) after one step? Two steps? Three steps? Many steps?

2.3 The Ergodic Theorem (Law of Large Numbers for Markov


Chains)
To return to the long-run distribution, we can now say that if there is a single, aperi-
odic recurrent component, then there is only one eigenvector, v ∗ , with eigenvalue 1,
and
p t = p0 q t → v ∗ (31)

11
2
0.6 1.0

0.4 1.0
1 3 4

1.0

Figure 4: A Markov chain with four states, where each state has period 3.

Thus, in the long run, the distribution of the state at any one time approaches the
unique invariant or equilibrium distribution.
To find out how quickly the chain approaches equilibrium, put the eigenvalues in
order of magnitude17 , 1 = λ1 > |λ2 | ≥ . . . ≥ |λK−1 | ≥ |λK |, and put their eigenvectors
v j in the same order. For large t , then,

1  |λ2 | t  |λ j | t (32)

no matter what j > 2 we pick. Consequently, again for large t ,

p t − v ∗ ≈ a2 λ2t v2 (33)

Thus, the difference between the distribution after t steps p t and the invariant distri-
bution v ∗ shrinks exponentially fast, and the base of the exponent is λ2 . If |λ2 | is very
close to 1, then this exponential rate could be very slow, a point we’ll come back to
next time.
What this means concretely is that if we took many independent copies of the
Markov chain, and chose their initial states according to p0 , we would see the distri-
bution of states across this “ensemble” tending towards v ∗ as time went on, no matter
what p0 was. This is interesting, but can we say anything about what happens within
any one long run of the chain?
If we take a long IID sequence, the empirical distribution within that sequence
tends towards the true distribution. The same is true for a finite-state Markov chain
17 Break ties however you like.

12
with a single recurrent18 component:
n
1X
1i (X t ) → vi∗ (34)
n t =1

Since there are only finitely many states, this means19


n
1X
f (X t ) → Ev ∗ [ f (X )] (35)
n t =1

This generalizes the law of large numbers from IID sequences to Markov chains. For
historical reasons20 , this is called the ergodic theorem. To get a sense of why the
ergodic theorem is true, look at the appendix.
The slogan which summarizes it is that “time averages converge on state aver-
ages”, with “time averages” being the left-hand sides of Eqs. 35 and 34, and “state
averages”, a.k.a. expectation values, being the right-hand sides. In statistical terms,
what the ergodic property means is that a single long realization of a Markov chain
acts like a representative sample of the whole distribution, and becomes increasingly
representative as it grows longer. This is important for several reasons.
1. It re-assures us about the statistical methods we learned for IID data. It says
that even if the data we have to deal with are not completely independent, if
they are at least Markov, then much of what we thought we knew is still at
least approximately true.

2. It tells us that we can apply statistics to dynamics, to things which change over
time — at least if they are Markov. In fact, we can rely on just a single long
trajectory, rather than having to get multiple, independent trajectories, which
could be very difficult indeed.

3. It gives us a way to short-cut long simulations. Often what we want out of a


simulation is a time-average (what fraction of the time is the system in some
failure mode? how much of the final product is produced per unit time? what
is the average rate of return on the portfolio?). The ergodic theorem says we
don’t have to step through many simulations of the Markov chain to get these
averages, we can just find the invariant distribution and calculate from there.
18 Notice that we do not need to assume the component is recurrent and aperiodic, just recurrent. You

can check with, say, the period-3 chain from Figure 3 that the next two equations hold, no matter what
state we start from.
19 E XERCISE : Convince yourself that Eq. 35 really does follow from Eq. 34.
20 Several decades before Markov, the physicist Ludwig Boltzmann, as part of explaining why thermo-

dynamics works, argued that a large collection of molecules should, within a short time, come arbitrarily
close to every configuration of the molecules which was compatible with the conservation of energy. Since
all our measurements take a long time to make (molecularly speaking at least), we would see only the aver-
age over these configurations, which would look like an expectation value. He needed a name for this, and
called it the “ergodic” property of the trajectory, from the Greek ergon (“energy, work”) + odos (“path,
way”). Sadly, the name stuck.

13
4. It gives us a way to replace complicated expectations with simple simulations.
We will see next time that there are many cases where it is much easier to find a
chain whose invariant distribution is v ∗ than it is to find v ∗ itself. (Strange but
true!) Simulating from the chain then gives a way of calculating expectations
with respect to v ∗ .

3 Summing Up
Here are the key ideas:
1. The Monte Carlo method is to evaluate integrals and expectations by simulat-
ing from suitable distributions and taking sample averages over the simulation
points. So long as sample averages converge on expectations, the approxima-
tion error in the Monte Carlo method can be made as small as we like.
• With independent samples, the Monte Carlo approximation is unbiased,
and has a variance is O(1/n) in the number of simulation points.
• In importance sampling, we simulate from a different distribution than
the one we’re really interested in, and then correct by taking a weighted
average.
2. Markov processes are sequences of random variables where the distribution of
the next variable depends only on the value of the current one. A Markov chain
is a Markov process where the states are discrete. The transitions in a Markov
chain are represented in a stochastic matrix.
• To find the distribution after t steps of the chain, we take the current
distribution (written as a vector) and multiply it by the t th power of the
transition matrix.
• As t grows, the distribution becomes a combination of the eigenvectors
whose eigenvalues have magnitude 1. Eigenvectors whose eigenvalues are
simply 1 are invariant distributions.
• In an aperiodic chain, the long-run distribution is an invariant distribu-
tion.
• Each eigenvector with eigenvalue 1 corresponds to a different recurrent
component of states. If there is only a single recurrent, aperiodic compo-
nent, the long-run distribution is this unique eigenvector.
• Sample averages taken along any single sequence of the chain converge on
expectations under the invariant distribution (ergodicity).

14
A Hand-waving Argument about the Ergodic Theo-
rem
The left-hand side of Eq. 34, what we want to have converge, is
n
1X
1i (X t ) (36)
n t =1

This is a random quantity. Let’s try to work out its expectation value (assuming X1 ∼
p0 ) and its variance as n → ∞. If the expectation tends to a limit and the variance
shrinks to zero, then the time-average as a whole must converge on its expectation
value.
For expectation, we use the fact that taking expectations is a linear operation:
 
1X n 1X n
1i (X t ) = E p0 1i (X t )
 
E p0  (37)
n t =1 n t =1
1X n
= P Xt = i

(38)
n t =1
1X n
= ( p0 q t −1 )i (39)
n t =1

Since, as t → ∞
p0 q t → v ∗ (40)
we can conclude that
( p0 q t −1 )i → vi∗ (41)
and therefore  
n
1X
E p0  1i (X t ) → vi∗ (42)
n t =1

Variance is more involved:


   
1X n 1 X n
Var p0  1i (X t ) = Var p0  1i (X t ) (43)
n t =1 n2 t =1
 
1 X n n−1 X
X n
= Var p0 1i (X t ) + 2 Cov p0 1i (X t ), 1i (X s ) 
   

n 2 t =1 t =1 s =t +1

(Recall Var [X + Y ] = Var [X ] + Var [Y ] + 2Cov [X , Y ].) The first sum, of the vari-
ances, we can handle; the summands tend towards vi∗ (1 − vi∗ ) as t grows. (Why?)
Thus the whole sum approaches nvi∗ (1 − vi∗ ).
The covariances are where I’m going to wave my hands. Since p0 q t → v ∗ , no mat-
ter what distribution we put in for p0 , the Markov chain is “asymptotically indepen-
dent”. After many time-steps, the chain has nearly the same distribution no matter

15
what state we started it in. So P X t = i, X s = j → P X t = i P X s = j as s − t →
  

∞. And in fact, from Eq. 33, we know that |P X t = i , X s = j −P X t = i P X s = j |


  
shrinks to zero exponentially, with the exponent depending on |λ2 |. In fact, we can
reasonably suppose (hand-waving!) that

|P X t = i, X s = j − P X t = i P X s = j | ≤ κi |λ2 | s −t
  
(44)

for some constant κi . Since

Cov 1i (X t ), 1i (X s ) = P X t = i , X s = j − P X t = i P X s = j
    
(45)

(why?), we can add up the covariances,


n
X n
X
Cov p0 1i (X t ), 1i (X s ) κi |λ2 | s −t
 
≤ (46)
s=t +1 s=t +1
X∞
≤ κi |λ2 | s −t (47)
s=t +1
X∞
= κi |λ2 | h (48)
h=1
|λ2 |
= κi (49)
1 − |λ2 |

since the infinite sum is a geometric series. The sum of the covariances is thus limited:
n−1 X
X n
Cov p0 1i (X t ), 1i (X s )
 
(50)
t =1 s =t +1
n−1
X |λ2 |
≤ κi
t =1 1 − |λ2 |
|λ2 |
≤ nκi (51)
1 − |λ2 |

Putting everything together,


 
1X n
Var p0  1i (X t ) (52)
n t =1
1 2 |λ2 |
≤ 2
nvi∗ (1 − vi∗ ) + 2
nκi (53)
n n 1 − |λ2 |
vi (1 − vi ) + 2κi |λ2 |/(1 − |λ2 |)
∗ ∗
= (54)
n
which → 0 as n → ∞.

16
It may clear up this some tricky mathematical argument to compare what’s going
on here to what we’d see if we had an IID sample of discrete variables. Then all the
covariance terms would go away21 , and we’d have

vi∗ (1 − vi∗ )
 
1X n
Var p0  1i (X t ) =
 (55)
n t =1 n

This is the familiar situation from basic statistics where each sample gives us an in-
dependent piece of information about the distribution, and our uncertainty about
population probabilities goes down like 1/n. With the Markov chain, because the
samples are correlated with each other, each observation is not an independent piece
of information. But it’s like we had some smaller number of independent samples:

vi∗ (1 − vi∗ )
 
1X n
Var p0  1i (X t ) =
 (56)
n t =1 n/τ

where
κi |λ2 |
τ =1+2 (57)
vi (1 − vi∗ )

1 − |λ2 |)
tells us how long we have to wait for the correlations to become negligible.
Stepping back even more for the really big picture, what’s going on is that the
variables in a Markov chain are dependent, but widely-separated ones are almost in-
dependent. So if we wait for a long time, averages over the dependent variables look
very much like averages over independent ones. How long is “long” is quantified by
τ.

Problem
Suppose that X1 , X2 , . . . X t , . . . is a sequence of variables, not necessarily a Markov
chain, which is “weakly stationary”, so

E Xt = E Xs = µ
   
(58)

and that
Cov X t , X s = ρ(|t − s |)
 
(59)
for t , s. Further suppose that

X
ρ(h) = κρ(0) < ∞ (60)
h=0

Imitating the arguments made above, prove that


 
1X n
E Xt  = µ
n t =1
21 Notice that an IID sequence is a kind of Markov chain, where λ2 = 0 exactly.

17
and that
ρ(0)
 
n
1X
Var  Xt  ≤
n t =1 n/(1 + 2κ)
Conclude that
n
1X
Xt → µ
n t =1

Congratulations; you have proved the “mean-square ergodic theorem”.

References
[1] Basharin, Gely P., Amy N. Langville and Valeriy A. Naumov (2004).
“The Life and Work of A. A. Markov.” Linear Algebra and its Applica-
tions, 386: 3–26. URL http://decision.csl.uiuc.edu/~meyn/pages/
Markov-Work-and-life.pdf.

[2] Serber, Robert (1992). The Los Alamos Primer: The First Lectures on How to
Build the Atomic Bomb. Berkeley: University of California Press. Annotated by
Robert Serber; edited and with an introduction by Richard Rhodes.

18

You might also like