Freedman
Freedman
ARI FREEDMAN
Contents
1. Introduction and Basic Definitions 1
2. Uniqueness of Stationary Distributions 3
3. Existence of a Stationary Distributions 5
4. Convergence Theorem 8
Acknowledgments 13
References 13
the Markov
Tt−1 property, and can be formalized by saying that if
H = i=0 {Xi = xi } is any event such that P(H ∩ {Xt = x}) > 0, then
P {Xt+1 = y | H ∩ {Xt = x}) = P {Xt+1 = y | Xt = x},
and then we simply define P (x, y) = P {Xt+1 = y | Xt = x}.
0.5
0.4 x y 0.3
0.2
0.6 1
We can illustrate a Markov chain with a state diagram, in which an arrow from
one state to another indicates the probability of going to the second state given we
were just in the first. For example, in this diagram, given that the Markov chain is
currently in x, we have probability .4 of staying in x, probability .6 of going to z,
and probability 0 of going to y in the next time step (Fig. 1). This Markov chain
would be represented by the transition matrix
x y z
x .4 0 .6!
P = y .5 .3 .2
z 0 1 0 .
It is fairly easy to see that if P and Q are both stochastic matrices, then P Q is
also a stochastic matrix, and if µ is a distribution, then µP is also a distribution.
Thus, a stationary distribution is one for which advancing it along the Markov
chain does not change the distribution: if the distribution of Xt is a stationary
distribution π, then the distribution of Xt+1 will also be π. This brings up the
question of when a Markov chain will have a stationary distribution, and if so, is this
distribution unique? Will any distribution converge to this stationary distribution
over time? It turns out that with only mild constraints, all of these are satisfied.
= h(x0 ),
where the last inequality follows from P (x0 , z) > 0 and h(z) < h(x0 ). However,
this gives us h(x0 ) < h(x0 ), a contradiction, which means h(x0 ) = h(z).
Now for any y ∈ Ω, P being irreducible implies there is a path from x0 to y, let
it be x0 , x1 , . . . , xn = y such that P (xi , xi+1 ) > 0. Thus, h(x0 ) = h(x1 ), and so x1
also maximizes h, which means h(x1 ) = h(x2 ). We carry on this logic to get
Now we are ready to show that if a stationary distribution exists (which we show
in the next section), it must be unique.
Proof. By Lemma 2.2, the only functions h that are harmonic are those of the form
h(x) = c ∀x ∈ Ω for some constant c. Putting this into vector form, this means the
only solutions to the equation P h = h, or equivalently (P − I)h = 0 are
1
1
h = c . .
..
1
π
e(y) = Ez (number of visits to y before returning to z)
∞
X
= Pz {Xt = y, τz+ > t},
t=0
since the expected number of visits to y before returning to z is the sum of all the
probabilities of the chain hitting y at a time step less than the return time.
For any given chain with X0 = z, the number of visits to y before return to z is
≤ τz+ , since the total number of states the chain visits before returning to z is τz+ .
Thus π e(y) ≤ Ez (τz+ ), which by Lemma 3.2 is finite, and thus all π e(y) are finite.
And since P is irreducible, it is at least possible to visit y once (a path from z
to y followed by a path from y to z), which means the expected number of visits to
y before returning to z is positive, so πe(y) > 0.
Now we show π e is stationary, or that for all y, (e
π P )(y) = π
e(y). First, see that
X ∞
XX
(e
π P )(y) = π
e(x)P (x, y) = Pz {Xt = x, τz+ ≥ t + 1}P (x, y),
x∈Ω x∈Ω t=0
using the fact that the sum of the probabilities of {Xt = x} over all x ∈ Ω will
just equal 1. We notice that this final summation is very similar to our original
expression for π
e(y); in particular,
∞
X
(e e(y) − Pz {X0 = y, τz+ > 0} +
π P )(y) = π Pz {Xt = y, τz+ = t}.
t=1
But this final term is just accounting for all the occurrences of Xτz+ = y, and so it
sums up to Pz {Xτz+ = y}, which is equal to 1 when y = z and 0 otherwise (since
the Markov chain at its return time should be back at its starting state). Similarly,
Pz {X0 = y, τz+ > 0} is equal to 1 when y = z and 0 otherwise, since z = X0 and
τz+ > 0 are always true by definition. Thus, these two terms are always equal, so
they cancel out, leaving us with (e
π P )(y) = πe(y) ∀y ∈ Ω, or
π
eP = π
e.
This proves, that π
e is stationary, so to make it a stationary distribution, we must
divide each element by the sum of the elements, which is equal to
X X
π
e(x) = Ez (number of visits to x before returning to z) = Ez (τz+ ),
x∈Ω x∈Ω
as the return time for any chain is equal to the total number of states it visits before
returning to its start.
Thus, we define
π
e(x)
π(x) = ,
Ez (τz+ )
which exists since τz+ > 0 by definition, and we will get a stationary distribution
(a stationary vector multiplied by a scalar is still stationary). So by Corollary 2.3,
this π is the only such stationary distribution. As such, for any choice of z ∈ Ω, we
will get the same stationary distribution π, so
π
e(x)
π(x) = ∀x ∈ Ω.
Ex (τx+ )
Note that choosing z = x also changes the definition of πe, so that π
e(x) is now the
expected number of visits to x before returning to x, which is exactly 1, for the one
time the chain hits x upon returning. Thus,
1
π(x) = ∀x ∈ Ω
Ex (τx+ )
is a unique stationary distribution for P .
8 ARI FREEDMAN
4. Convergence Theorem
We have now shown that all irreducible Markov chains have a unique stationary
distribution π. However, in order to ensure that any distribution over such a chain
will converge to π, we require one more condition, called aperiodicity.
Definition 4.1. Let T (x) = {t ≥ 1 : P t (x, x) > 0} be the set of all time steps
for which a Markov chain can start and end in a state x. Then the period of x is
gcd T (x).
Lemma 4.2. If P is irreducible, then the period of all states is equal, or
gcd T (x) = gcd T (y) ∀x, y ∈ Ω.
Proof. Fix states x and y. Since P is irreducible, ∃r, l ≥ 0 such that P r (x, y) > 0
and P l (y, x) > 0 (Fig. 2).
l steps
a steps x y
r steps
Let m = r + l. Then m ∈ T (x), since we can get from x to y in r steps and then
from y back to x in l steps, adding up to r + l = m steps. Similarly, m ∈ T (y),
going from y to x and back to y. If a ∈ T (x), there exists a path from x to itself in
a steps (Fig. 2), so then a + m ∈ T (y) by going from y to x, from x to itself, and
from x back to y, totalling l + a + r = a + m steps. Thus, a ∈ T (y) − m ∀a ∈ T (x),
where T (y) − m = {n − m | n ∈ T (y)}, so
T (x) ⊂ T (y) − m.
Take any n ∈ T (y), so gcd T (y) | n by the definition of gcd . Thus, m ∈ T (y)
implies gcd T (y) | m as well. This means gcd T (y) | n − m ∀n ∈ T (y), or equiva-
lently,
gcd T (y) | a ∀a ∈ T (y) − m.
And we showed that T (x) ⊂ T (y) − m, so this also gives us gcd T (y) | a ∀a ∈ T (x).
So gcd T (y) is a common divisor of T (x), which implies, by the definition of gcd,
that
gcd T (y) | gcd T (x).
By a completely parallel argument, switching around x and y, we also get that
gcd T (y) | gcd T (x). Therefore, gcd T (x) = gcd T (y) ∀x, y ∈ Ω.
This shows that an irreducible Markov chain has a period common to all of its
states, which we then call the period of the chain.
Definition 4.3. An irreducible Markov chain is called aperiodic if its period is
equal to 1, or equivalently, gcd T (x) = 1 ∀x ∈ Ω.
Before being able to prove the Convergence Theorem, we need one result con-
cerning aperiodic chains, and a number-theoretic lemma to prove it.
CONVERGENCE THEOREM FOR FINITE MARKOV CHAINS 9
Proof. Let B = {x ∈ Ω : µ(x) ≥ ν(x)} be the set of states for which µ(x)−ν(x) ≥ 0,
so its complement is B c = Ω \ B = {x ∈ Ω : µ(x) < ν(x)}, which is the set of states
for which µ(x) − ν(x) < 0.
Let A ⊂ Ω be any set of states. Now A is the disjoint union of A ∩ B and A ∩ B c ,
and any x ∈ A ∩ B c is in B c and thus µ(x) − ν(x) < 0, so
µ(A) − ν(A) = µ(A ∩ B) − ν(A ∩ B) + µ(A ∩ B c ) − ν(A ∩ B c ) ≤ µ(A ∩ B) − ν(A ∩ B).
And B is the disjoint union of A ∩ B and B \ A, and any x ∈ B \ A is in B and
thus µ(x) − ν(x) ≥ 0, so
µ(A ∩ B) − ν(A ∩ B) ≤ µ(A ∩ B) − ν(A ∩ B) + µ(B \ A) − ν(B \ A) = µ(B) − ν(B).
Putting these together, we get
µ(A) − ν(A) ≤ µ(B) − ν(B),
and by symmetric logic with A ∩ B c as an intermediary, we get
µ(B c ) − ν(B c ) ≤ µ(A) − ν(A).
Now µ(B) + µ(B c ) = ν(B) + ν(B c ) = 1 implies µ(B c ) − ν(B c ) = −(µ(B) − ν(B)),
so
−(µ(B) − ν(B)) ≤ µ(A) − ν(A) ≤ µ(B) − ν(B),
or |µ(A) − ν(A)| ≤ µ(B) − ν(B). Thus, |µ(A) − ν(A)| is bounded by µ(B) − ν(B),
and we can let A = B to attain this bound, since |µ(B) − ν(B)| = µ(B) − ν(B)
from the fact that µ(B) − ν(B) ≥ 0 ∀x ∈ B. Thus,
||µ − ν||T V = max |µ(A) − ν(A)|
A⊂Ω
= µ(B) − ν(B)
1 h i
= µ(B) − ν(B) + ν(B c ) − µ(B c )
2 !
1 X X
= |µ(x) − ν(x)| + |µ(x) − ν(x)|
2
x∈B x∈B c
1X
= |µ(x) − ν(x)|
2
x∈Ω
Remark 4.8. This proof also gives us ||µ − ν||T V = µ(B) − ν(B). Since µ(B) ≤ 1
and ν(B) ≥ 0, this tells us ||µ − ν||T V ≤ 1, for any distributions µ and ν.
CONVERGENCE THEOREM FOR FINITE MARKOV CHAINS 11
Now we are finally ready to prove the main result of this paper, which tells us
that an irreducible, aperiodic Markov chain will converge at an exponential rate to
a stationary distribution over time.
Proof. Since P is irreducible and aperiodic, Lemma 4.5 tells us there exists an r ≥ 0
satisfying P r > 0 ∀x, y ∈ Ω.
(x, y)
π
Let Π = ... be the |Ω| × |Ω| matrix all of whose rows are the stationary
π
distribution π, making Π a stochastic matrix.
Since P r (x, y) > 0 ∀x, y ∈ Ω and π(y) > 0 ∀y ∈ Ω, by Theorem 3.3, then
P r (x, y)
δ 0 = min >0
x,y∈Ω π(y)
And since πP = π,
π πP π
.. .. ..
ΠP = . P = . = . = Π,
π πP π
P rk = (1 − θk )Π + θk Qk ∀k ≥ 1.
12 ARI FREEDMAN
P r(n+1) = P rn P r
= [(1 − θn )Π + θn Qn ]P r
= (1 − θn )ΠP r + θn Qn P r
= (1 − θn )Π + θn Qn [(1 − θ)Π + θQ]
= (1 − θn )Π + θn (1 − θ)Qn Π + θn+1 Qn+1
= (1 − θn )Π + (θn − θn+1 )Π + θn+1 Qn+1
= (1 − θn+1 )Π + θn+1 Qn+1 ,
or P rk+j − Π = θk (Qk P j − Π). Looking at any row x of both sides of this equation,
we have (P rk+j − Π)(x, ·) = θk (Qk P j − Π)(x, ·). Thus,
≤ θk ,
since Q, P, and Π being stochastic implies Qk P j (x, ·) and Π(x, ·) are distributions,
thus their total variation is bounded by 1√from Remark 4.8.
Now we can define the constants α = r θ and C = α−r . Let t ≥ 0, and define k
and j from the division theorem by r, so t = rk + j with 0 ≤ j < r. Finally, since
0 < θ < 1 and 0 < α < 1,
CONVERGENCE THEOREM FOR FINITE MARKOV CHAINS 13
References
[1] David A. Levin, Yuval Peres and Elizabeth L. Wilmer. Markov Chains and Mixing Times.
http://pages.uoregon.edu/dlevin/MARKOV/markovmixing.pdf.