0% found this document useful (0 votes)

11 views

Freedman

This document provides an overview of the convergence theorem for finite Markov chains, detailing the conditions necessary for convergence and the existence and uniqueness of stationary distributions for irreducible Markov chains. It includes definitions, proofs, and theorems related to Markov chains, including the uniqueness of stationary distributions and the construction of such distributions. The paper concludes by affirming that all irreducible Markov chains possess a unique stationary distribution.

Uploaded by

cs22d0003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Freedman

Uploaded by

cs22d0003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

CONVERGENCE THEOREM FOR FINITE MARKOV CHAINS

ARI FREEDMAN

Abstract. In this expository paper, I will give an overview of the necessary

conditions for convergence in Markov chains on finite state spaces. In doing
so, I will prove the existence and uniqueness of a stationary distribution for
irreducible Markov chains, and finally the Convergence Theorem when aperi-
odicity is also satisfied.

Contents
1. Introduction and Basic Definitions 1
2. Uniqueness of Stationary Distributions 3
3. Existence of a Stationary Distributions 5
4. Convergence Theorem 8
Acknowledgments 13
References 13

1. Introduction and Basic Definitions

A Markov chain is a stochastic process, i.e., randomly determined, that moves
among a set of states over discrete time steps. Given that the chain is at a certain
state at any given time, there is a fixed probability distribution for which state the
chain will go to next (including repeating the state). If there are n states, then the
n × n transition matrix P describes the Markov chain, where the rows and columns
are indexed by the states, and P (x, y), the number in the x-th row and y-th column
gives the probability of going to state y at time t + 1, given that it is at state x at
time t. We can formalize this as follows.
Definition 1.1. A finite Markov chain with finite state space Ω and |Ω| × |Ω|
transition matrix P is a sequence of random variables X0 , X1 , . . . where
P{Xt+1 = y | Xt = x} = P (x, y),
or the probability of Xt+1 = y given Xt = x is P (x, y). Then P (x, ·), the x-th row of
P , gives the distribution of Xt+1 given Xt = x. Here P is the notation for probability
of an event, and Px notates the probability of an event given X0 = x ∈ Ω. Thus,
Px {Xt = y} = P{Xt = y | X0 = x} = P t (x, y), as multiplying a distribution by
the transition matrix P advances the distribution one step along the Markov chain,
so multiplying by P t advances it by t steps from X0 = x.
Furthermore, when the distribution of Xt+1 is conditioned on Xt , the previous
values in the chain, X0 , X1 , . . . , Xt−1 , do not affect the value of Xt+1 . This is called

Date: August 28, 2017.

1
2 ARI FREEDMAN

the Markov
Tt−1 property, and can be formalized by saying that if
H = i=0 {Xi = xi } is any event such that P(H ∩ {Xt = x}) > 0, then
P {Xt+1 = y | H ∩ {Xt = x}) = P {Xt+1 = y | Xt = x},
and then we simply define P (x, y) = P {Xt+1 = y | Xt = x}.

0.5

0.4 x y 0.3

0.2
0.6 1

Figure 1. An example Markov chain with three states x, y, and z.

We can illustrate a Markov chain with a state diagram, in which an arrow from
one state to another indicates the probability of going to the second state given we
were just in the first. For example, in this diagram, given that the Markov chain is
currently in x, we have probability .4 of staying in x, probability .6 of going to z,
and probability 0 of going to y in the next time step (Fig. 1). This Markov chain
would be represented by the transition matrix
x y z
x .4 0 .6!
P = y .5 .3 .2
z 0 1 0 .

This definition mentions distributions, so it may help to formalize what these

are.
Definition 1.2. A probability distribution, or just a distribution, is a vector
of non-negative probabilities that sums up to 1 (this is known as the law of total
probability).
For any state x ∈ Ω, it makes sense that P (x, ·), the x-th row of P , should be a
distribution, since the probability of going from state x to any state is at least 0, and
the sum of the probabilities of going from state x to state y, over all states y ∈ Ω,
should be 1, as these are disjoint events that cover all the possibilities. Distribu-
tions are generally expressed as row vectors, which can then be right-multiplied by
matrices.
The transition matrices associated with Markov chains all fall under the larger
category of what we call stochastic matrices.
CONVERGENCE THEOREM FOR FINITE MARKOV CHAINS 3

Definition 1.3. A stochastic matrix is an n × n matrix with all non-negative

values and each row summing to 1. In particular, a matrix is stochastic if and only
if it consists of n distribution row vectors in Rn .

It is fairly easy to see that if P and Q are both stochastic matrices, then P Q is
also a stochastic matrix, and if µ is a distribution, then µP is also a distribution.

Definition 1.4. A distribution π is called a stationary distribution of a Markov

chain P if πP = π.

Thus, a stationary distribution is one for which advancing it along the Markov
chain does not change the distribution: if the distribution of Xt is a stationary
distribution π, then the distribution of Xt+1 will also be π. This brings up the
question of when a Markov chain will have a stationary distribution, and if so, is this
distribution unique? Will any distribution converge to this stationary distribution
over time? It turns out that with only mild constraints, all of these are satisfied.

Definition 1.5. A Markov chain is irreducible if for all states x, y ∈ Ω, there

exists a t ≥ 0 such that P t (x, y) > 0.

Intuitively, this means that it is possible to get from x to y for any x, y ∈ Ω in

some finite amount of time steps, or, equivalently, there exists a sequence of states
x = x0 , x1 , . . . , xt−1 , xt = y (which we call a path) the chain can take from x to y
such that P (xi , xi+1 ) > 0 for all 0 ≤ i < t.

2. Uniqueness of Stationary Distributions

In order to come up with a nice expression to show the existence of a stationary
distribution, we must first prove that if a distribution exists, it is unique. To do so,
we need to define harmonic functions and prove a lemma about them.

Definition 2.1. A function h : Ω → R is a harmonic at x ∈ Ω if

X
h(x) = P (x, y)h(y).
y∈Ω

If h is harmonic at all states in Ω = {x1 , x2 , . . . , xn },we sayh is harmonic on Ω,

h(x1 )
 h(x2 ) 
and then P h = h, where h is the column vector h =  . .
 
 .. 
h(xn )

Lemma 2.2. If P is irreducible and h is harmonic on Ω, then h is a constant

function.

Proof. Since Ω is finite, h attains a maximum at some state x0 ∈ Ω, such that

h(x0 ) ≥ h(y) ∀y ∈ Ω. Let z ∈ Ω be any state such that P (x0 , z) > 0, and assume
4 ARI FREEDMAN

that h(z) < h(x0 ). Since h is harmonic at x0 ,

X
h(x0 ) = P (x0 , y)h(y)
y∈Ω
X
= P (x0 , z)h(z) + P (x0 , y)h(y)
y∈Ω,y6=z
X
≤ P (x0 , z)h(z) + P (x0 , y)h(x0 )
y∈Ω,y6=z
X
< P (x0 , z)h(x0 ) + P (x0 , y)h(x0 )
y∈Ω,y6=z
 
X
= P (x0 , y) h(x0 )
y∈Ω

= h(x0 ),

where the last inequality follows from P (x0 , z) > 0 and h(z) < h(x0 ). However,
this gives us h(x0 ) < h(x0 ), a contradiction, which means h(x0 ) = h(z).
Now for any y ∈ Ω, P being irreducible implies there is a path from x0 to y, let
it be x0 , x1 , . . . , xn = y such that P (xi , xi+1 ) > 0. Thus, h(x0 ) = h(x1 ), and so x1
also maximizes h, which means h(x1 ) = h(x2 ). We carry on this logic to get

h(x0 ) = h(x1 ) = · · · = h(xn ) = h(y).

So h(y) = h(x0 ) ∀y ∈ Ω, and thus h is a constant function.

Now we are ready to show that if a stationary distribution exists (which we show
in the next section), it must be unique.

Corollary 2.3. If P is irreducible and has a stationary distribution π, then π is

the only such stationary distribution.

Proof. By Lemma 2.2, the only functions h that are harmonic are those of the form
h(x) = c ∀x ∈ Ω for some constant c. Putting this into vector form, this means the
only solutions to the equation P h = h, or equivalently (P − I)h = 0 are
 
1
1
h = c . .
 
 .. 
1

Thus, dim(ker(P − I)) = 1, so by the rank-nullity theorem, rank(P − I) = |Ω| − 1.

And rank(P − I) = rank((P − I)T ) = rank(P T − I) = |Ω| − 1, so again by the
rank-nullity theorem, dim(ker(P T − I)) = 1, so over all row vectors v ∈ R|Ω| , the
equation (P T − I)v T = 0 has only a one-dimensional space of solutions. But this
equation is equivalent to vP = v, so, given that πP = π is a solution, all solutions
must be of the form v = λπ, for some scalar λ. However, to be a distribution whose
elements sum to 1, we must have λ = 1, and thus the only stationary distribution
is v = π.
CONVERGENCE THEOREM FOR FINITE MARKOV CHAINS 5

3. Existence of a Stationary Distributions

We will now show that all irreducible Markov chains have a stationary distribu-
tion by explicitly constructing one, and then by Corollary 2.3, we will know that
this stationary distribution is unique.
Definition 3.1. For a Markov chain X0 , X1 , . . ., the hitting time for a state
x ∈ Ω is the instance of the chain “hitting” x, notated
τx = min{t ≥ 0 : Xt = x}.
When we want the hitting time to be strictly positive, we notate it
τx+ = min{t > 0 : Xt = x},
which is called the first return time when X0 = x.
We will also be using the notation E to denote the expected value of a variable,
and again, Ex means the expected value given X0 = x.
Lemma 3.2. For any x, y ∈ Ω of an irreducible Markov chain, Ex (τy+ ) is finite.
Proof. Since P is irreducible, we know for any two states z, w ∈ Ω that there exists
an s > 0 such that P s (z, w) > 0 (if z = w, we consider a path from z to a different
state and back to itself to ensure s > 0). We let r be the maximum of all such s
over z, w ∈ Ω, and let
= min{P s (z, w) > 0 such that 0 < s ≤ r : z, w ∈ Ω}.
Then for all z, w ∈ Ω, there exists 0 < s ≤ r such that P s (z, w) ≥ > 0.
This implies that given any Xt , the probability of the chain going to a state y
between times t and t + r is at least , or conversely,
P{Xs 6= y : ∀t < s ≤ t + r} ≥ 1 − .
In general, saying τy+ > n implies Xt 6= y ∀0 < t ≤ n. So for k > 0,
Px {τy+ > kr} = Px {Xt 6= y ∀0 < t ≤ kr}
= Px {Xt 6= y ∀0 < t ≤ (k − 1)r}Px {Xt 6= y ∀(k − 1)r < t ≤ kr}
≤ Px {Xt 6= y ∀0 < t ≤ (k − 1)r}(1 − )
≤ Px {Xt 6= y ∀0 < t ≤ (k − 2)r}(1 − )2
..
.
≤ Px {Xt 6= y ∀0 < t ≤ 0}(1 − )k
= (1 − )k ,
where Px {Xt 6= y ∀0 < t ≤ 0} = 1 by vacuous truth.
Now for any random variable Y valued on the non-negative integers, we have
∞
X
E(Y ) = tP (Y = t)
t=0
= 1 · P (Y = 1) + 2 · P (Y = 2) + 3 · P (Y = 3) + · · ·

= P (Y = 1) + P (Y = 2) + P (Y = 3) + · · · + P (Y = 2) + P (Y = 3) + · · · + · · ·
∞
X
= P (Y > t).
t=0
6 ARI FREEDMAN

And P{τy+ > t} is a decreasing function with respect to t, since

P{τy+ > t + 1} ≤ P{τy+ > t + 1} + P{τy+ = t + 1} = P{τy+ > t},

which we use to get

∞
X
Ex (τy+ ) = Px {τy+ > t}
t=0
∞
X
≤ rPx {τy+ > kr}
k=0
X∞
≤r (1 − )k .
k=0

By definition, 0 < ≤ 1, so 0 ≤ 1 − < 1, which means this sum converges, and

thus Ex (τy+ ) is finite.

Theorem 3.3. If P is irreducible, then it has a unique stationary distribution π

with π(x) > 0 ∀x ∈ Ω, given by
1
π(x) = .
Ex (τx+ )
Proof. Fix any state z ∈ Ω. Then we define

π
e(y) = Ez (number of visits to y before returning to z)
∞
X
= Pz {Xt = y, τz+ > t},
t=0

since the expected number of visits to y before returning to z is the sum of all the
probabilities of the chain hitting y at a time step less than the return time.
For any given chain with X0 = z, the number of visits to y before return to z is
≤ τz+ , since the total number of states the chain visits before returning to z is τz+ .
Thus π e(y) ≤ Ez (τz+ ), which by Lemma 3.2 is finite, and thus all π e(y) are finite.
And since P is irreducible, it is at least possible to visit y once (a path from z
to y followed by a path from y to z), which means the expected number of visits to
y before returning to z is positive, so πe(y) > 0.
Now we show π e is stationary, or that for all y, (e
π P )(y) = π
e(y). First, see that
X ∞
XX
(e
π P )(y) = π
e(x)P (x, y) = Pz {Xt = x, τz+ ≥ t + 1}P (x, y),
x∈Ω x∈Ω t=0

e(x) and replaced τz+ > t with

where we just plugged in our earlier expression for π
+
the equivalent expression τz ≥ t + 1.
Since the event {τz+ ≥ t + 1} is only determined by X0 , X1 , . . . , Xt , it is inde-
pendent of the event Xt+1 = y, when conditioned on Xt = x, which means

Pz {Xt = x, Xt+1 = y, τz+ ≥ t + 1} = Pz {Xt = x, τz+ ≥ t + 1}Pz {Xt+1 = y | Xt = x}

= Pz {Xt = x, τz+ ≥ t + 1}P (x, y).
CONVERGENCE THEOREM FOR FINITE MARKOV CHAINS 7

We can then plug this in to our earlier expression for (e

π P )(y) and switch around
the order of summation, since the inner sum converges for all x ∈ Ω, to get
∞ X
X
(e
π P )(y) = Pz {Xt = x, Xt+1 = y, τz+ ≥ t + 1}
t=0 x∈Ω
∞
X
= Pz {Xt+1 = y, τz+ ≥ t + 1}
t=0
∞
X
= Pz {Xt = y, τz+ ≥ t},
t=1

using the fact that the sum of the probabilities of {Xt = x} over all x ∈ Ω will
just equal 1. We notice that this final summation is very similar to our original
expression for π
e(y); in particular,
∞
X
(e e(y) − Pz {X0 = y, τz+ > 0} +
π P )(y) = π Pz {Xt = y, τz+ = t}.
t=1

But this final term is just accounting for all the occurrences of Xτz+ = y, and so it
sums up to Pz {Xτz+ = y}, which is equal to 1 when y = z and 0 otherwise (since
the Markov chain at its return time should be back at its starting state). Similarly,
Pz {X0 = y, τz+ > 0} is equal to 1 when y = z and 0 otherwise, since z = X0 and
τz+ > 0 are always true by definition. Thus, these two terms are always equal, so
they cancel out, leaving us with (e
π P )(y) = πe(y) ∀y ∈ Ω, or
π
eP = π
e.
This proves, that π
e is stationary, so to make it a stationary distribution, we must
divide each element by the sum of the elements, which is equal to
X X
π
e(x) = Ez (number of visits to x before returning to z) = Ez (τz+ ),
x∈Ω x∈Ω

as the return time for any chain is equal to the total number of states it visits before
returning to its start.
Thus, we define
π
e(x)
π(x) = ,
Ez (τz+ )
which exists since τz+ > 0 by definition, and we will get a stationary distribution
(a stationary vector multiplied by a scalar is still stationary). So by Corollary 2.3,
this π is the only such stationary distribution. As such, for any choice of z ∈ Ω, we
will get the same stationary distribution π, so
π
e(x)
π(x) = ∀x ∈ Ω.
Ex (τx+ )
Note that choosing z = x also changes the definition of πe, so that π
e(x) is now the
expected number of visits to x before returning to x, which is exactly 1, for the one
time the chain hits x upon returning. Thus,
1
π(x) = ∀x ∈ Ω
Ex (τx+ )
is a unique stationary distribution for P .
8 ARI FREEDMAN

4. Convergence Theorem
We have now shown that all irreducible Markov chains have a unique stationary
distribution π. However, in order to ensure that any distribution over such a chain
will converge to π, we require one more condition, called aperiodicity.
Definition 4.1. Let T (x) = {t ≥ 1 : P t (x, x) > 0} be the set of all time steps
for which a Markov chain can start and end in a state x. Then the period of x is
gcd T (x).
Lemma 4.2. If P is irreducible, then the period of all states is equal, or
gcd T (x) = gcd T (y) ∀x, y ∈ Ω.
Proof. Fix states x and y. Since P is irreducible, ∃r, l ≥ 0 such that P r (x, y) > 0
and P l (y, x) > 0 (Fig. 2).

l steps

a steps x y

r steps

Figure 2. Since P is irreducible, we can get from x to y in r steps,

from y to x in l steps, and from x to itself in a ∈ T (x) steps.

Let m = r + l. Then m ∈ T (x), since we can get from x to y in r steps and then
from y back to x in l steps, adding up to r + l = m steps. Similarly, m ∈ T (y),
going from y to x and back to y. If a ∈ T (x), there exists a path from x to itself in
a steps (Fig. 2), so then a + m ∈ T (y) by going from y to x, from x to itself, and
from x back to y, totalling l + a + r = a + m steps. Thus, a ∈ T (y) − m ∀a ∈ T (x),
where T (y) − m = {n − m | n ∈ T (y)}, so
T (x) ⊂ T (y) − m.
Take any n ∈ T (y), so gcd T (y) | n by the definition of gcd . Thus, m ∈ T (y)
implies gcd T (y) | m as well. This means gcd T (y) | n − m ∀n ∈ T (y), or equiva-
lently,
gcd T (y) | a ∀a ∈ T (y) − m.
And we showed that T (x) ⊂ T (y) − m, so this also gives us gcd T (y) | a ∀a ∈ T (x).
So gcd T (y) is a common divisor of T (x), which implies, by the definition of gcd,
that
gcd T (y) | gcd T (x).
By a completely parallel argument, switching around x and y, we also get that
gcd T (y) | gcd T (x). Therefore, gcd T (x) = gcd T (y) ∀x, y ∈ Ω.
This shows that an irreducible Markov chain has a period common to all of its
states, which we then call the period of the chain.
Definition 4.3. An irreducible Markov chain is called aperiodic if its period is
equal to 1, or equivalently, gcd T (x) = 1 ∀x ∈ Ω.
Before being able to prove the Convergence Theorem, we need one result con-
cerning aperiodic chains, and a number-theoretic lemma to prove it.
CONVERGENCE THEOREM FOR FINITE MARKOV CHAINS 9

Lemma 4.4. If S ⊂ Z+ ∪ {0} is closed under addition (a + b ∈ S ∀a, b ∈ S) and

gcd S = 1, then there exists M such that a ∈ S ∀a ≥ M.
Proof. We begin by showing that there exists a finite subset T ⊂ S such that
gcd T = 1. Let S0 = {a0 }, for any a0 ∈ S. Either gcd S0 = 1, in which case we let
T = S0 and we are done, or there exists a1 ∈ S for which gcd(S0 ∪ {a1 }) < gcd S0 ,
since otherwise we would have gcd S = gcd S0 6= 1. So we let S1 = S0 ∪ {a1 },
and then gcd S1 < gcd S0 . We continue this process of finding ai ∈ S such that if
Si = gcd(Si−1 ∪ {ai }), then gcd Si < gcd Si−1 , creating a sequence of finite sets Si
whose gcd decreases until eventually gcd Si = 1, at which point we let T = Si and
we are done. We know this will occur at some point, since any strictly decreasing
sequence of positive integers must hit 1 in a finite number of steps.
Since gcd T = 1, there exists a linear combination of the elements in T that
evaluates to 1, so if T = {t1 , . . . , tn }, then there are c1 , . . . , cn ∈ Z such that
c1 t1 + · · · + cn tn = 1.
Without loss of generality, we can say that all of c1 , . . . , ck ≥ 0 and ck+1 , . . . , cn < 0,
for some 1 ≤ k ≤ n, so we can move all the negative terms to the other side to get
c1 t1 + · · · + ck tk = 1 + ck+1 tk+1 + · · · + cn tn .
Since S is closed under addition, and all ti ∈ T ⊂ S, the non-negative linear
combinations on the left side and right side of this equation are also in S, so if we
let these be equal to p and q respectively, we get p, q ∈ S such that p = 1 + q. It is
possible that there are no terms other than 1 on the right side, but this just means
p = 1 ∈ S, so since S is closed under addition, we can let M = 1 and we are done.
Otherwise we have p = 1 + q for p, q ∈ S.
Let M = q(q − 1). Then for any a ≥ M, we have a = kq + r for 0 ≤ r < q and
k ≥ q − 1, by the remainder theorem. Thus r ≤ q − 1 ≤ k, so k − r ≥ 0, and we
can express a as a non-negative linear combination of p and q:
a = kq + r = (k − r)q + r(q + 1) = rp + (k − r)q.
Therefore, a ∈ S ∀a ≥ M , as desired.
Lemma 4.5. If P is aperiodic and irreducible, then there exists r ≥ 0 such that
P r (x, y) > 0 ∀x, y ∈ Ω.
Proof. For any x ∈ Ω, we see that T (x) is a set of non-negative integers closed
under addition, since if a, b ∈ T (x), we have P a (x, x) > 0 and P b (x, x) > 0, so
P a+b (x, x) ≥ P a (x, x)P b (x, x) > 0 and thus a + b ∈ T . And since P is aperiodic,
gcd T (x) = 1, which means we can apply Lemma 4.4 to get that there is an Mx
such that P t (x, x) > 0 ∀t ≥ Mx . If we then let M = max{Mx : x ∈ Ω}, we have
P t (x, x) > 0 ∀t ≥ M ∀x ∈ Ω.
Since P is irreducible, there exists a t for any x, y ∈ Ω such that P t (x, y) > 0.
We let t0 = max{any one t satisfying P t (x, y) > 0 : x, y ∈ Ω} be the maximum of
all of these, and finally let r = t0 + M . Then for any states x, y ∈ Ω, there exists
a t ≤ t0 such that P t (x, y) > 0, and then r − t = M + (t0 − t) ≥ M, which means
P r−t (x, x) > 0, so then we get
P r (x, y) ≥ P r−t (x, x)P t (x, y) > 0.

10 ARI FREEDMAN

To prove anything about the convergence of a distribution, we need to first define

some measure of distance between distributions.
Definition 4.6. The total variation between two distributions µ and ν is defined
as
||µ − ν||T V = max |µ(A) − ν(A)|,
A⊂Ω
P
where µ(A) = µ(x).
x∈A
1
P
Proposition 4.7. ||µ − ν||T V = 2 |µ(x) − ν(x)|.
x∈Ω

Proof. Let B = {x ∈ Ω : µ(x) ≥ ν(x)} be the set of states for which µ(x)−ν(x) ≥ 0,
so its complement is B c = Ω \ B = {x ∈ Ω : µ(x) < ν(x)}, which is the set of states
for which µ(x) − ν(x) < 0.
Let A ⊂ Ω be any set of states. Now A is the disjoint union of A ∩ B and A ∩ B c ,
and any x ∈ A ∩ B c is in B c and thus µ(x) − ν(x) < 0, so
µ(A) − ν(A) = µ(A ∩ B) − ν(A ∩ B) + µ(A ∩ B c ) − ν(A ∩ B c ) ≤ µ(A ∩ B) − ν(A ∩ B).
And B is the disjoint union of A ∩ B and B \ A, and any x ∈ B \ A is in B and
thus µ(x) − ν(x) ≥ 0, so
µ(A ∩ B) − ν(A ∩ B) ≤ µ(A ∩ B) − ν(A ∩ B) + µ(B \ A) − ν(B \ A) = µ(B) − ν(B).
Putting these together, we get
µ(A) − ν(A) ≤ µ(B) − ν(B),
and by symmetric logic with A ∩ B c as an intermediary, we get
µ(B c ) − ν(B c ) ≤ µ(A) − ν(A).
Now µ(B) + µ(B c ) = ν(B) + ν(B c ) = 1 implies µ(B c ) − ν(B c ) = −(µ(B) − ν(B)),
so
−(µ(B) − ν(B)) ≤ µ(A) − ν(A) ≤ µ(B) − ν(B),
or |µ(A) − ν(A)| ≤ µ(B) − ν(B). Thus, |µ(A) − ν(A)| is bounded by µ(B) − ν(B),
and we can let A = B to attain this bound, since |µ(B) − ν(B)| = µ(B) − ν(B)
from the fact that µ(B) − ν(B) ≥ 0 ∀x ∈ B. Thus,
||µ − ν||T V = max |µ(A) − ν(A)|
A⊂Ω
= µ(B) − ν(B)
1 h i
= µ(B) − ν(B) + ν(B c ) − µ(B c )
2 !
1 X X
= |µ(x) − ν(x)| + |µ(x) − ν(x)|
2
x∈B x∈B c
1X
= |µ(x) − ν(x)|
2
x∈Ω

Remark 4.8. This proof also gives us ||µ − ν||T V = µ(B) − ν(B). Since µ(B) ≤ 1
and ν(B) ≥ 0, this tells us ||µ − ν||T V ≤ 1, for any distributions µ and ν.
CONVERGENCE THEOREM FOR FINITE MARKOV CHAINS 11

Now we are finally ready to prove the main result of this paper, which tells us
that an irreducible, aperiodic Markov chain will converge at an exponential rate to
a stationary distribution over time.

Theorem 4.9 (Convergence Theorem). If P is irreducible and aperiodic, with

stationary π, then there exist constants 0 < α < 1 and C > 0 such that

max ||P t (x, ·) − π||T V ≤ Cαt .

x∈Ω

Proof. Since P is irreducible and aperiodic, Lemma 4.5 tells us there exists an r ≥ 0
satisfying P r   > 0 ∀x, y ∈ Ω.
(x, y)
π
Let Π =  ...  be the |Ω| × |Ω| matrix all of whose rows are the stationary
 

π
distribution π, making Π a stochastic matrix.
Since P r (x, y) > 0 ∀x, y ∈ Ω and π(y) > 0 ∀y ∈ Ω, by Theorem 3.3, then

P r (x, y)
δ 0 = min >0
x,y∈Ω π(y)

satisfies P r (x, y) ≥ δ 0 π(y) ∀x, y ∈ Ω. So δ = min{δ 0 , 21 } also satisfies this property,

as well as 0 < δ < 1, so setting θ = 1 − δ, we get 0 < θ < 1.
Now define
P r − (1 − θ)Π
Q= .
θ

Every element of Q is non-negative, since P r (x, y) − (1 − θ)Π(x, y) = P r (x, y) −

δπ(y) ≥ 0, by the definition of δ, and, because P r and Π are stochastic, each row
of Q sums to 1−(1−θ)
θ = 1, making Q a stochastic matrix.
For any n ≥ 0, Qn is stochastic, from Q being stochastic, so we get Qn Π = Π,
since
X X
(Qn Π)(x, y) = Qn (x, z)Π(z, y) = π(y) Qn (x, z) = π(y) = Π(x, y).
z∈Ω z∈Ω

And since πP = π,
     
π πP π
 ..   ..   .. 
ΠP =  .  P =  .  =  .  = Π,
π πP π

so ΠP n = (ΠP )P n−1 = ΠP n−1 = · · · = ΠP = Π.

Using these identities, we will now prove inductively that

P rk = (1 − θk )Π + θk Qk ∀k ≥ 1.
12 ARI FREEDMAN

The base case k = 1 is true, since P r = (1 − θ)Π + θQ from the definition of Q.

Now assume it is true for k = n, so P rn = (1 − θn )Π + θn Qn . Then

P r(n+1) = P rn P r
= [(1 − θn )Π + θn Qn ]P r
= (1 − θn )ΠP r + θn Qn P r
= (1 − θn )Π + θn Qn [(1 − θ)Π + θQ]
= (1 − θn )Π + θn (1 − θ)Qn Π + θn+1 Qn+1
= (1 − θn )Π + (θn − θn+1 )Π + θn+1 Qn+1
= (1 − θn+1 )Π + θn+1 Qn+1 ,

proving inductively that P rk = (1 − θk )Π + θk Qk ∀k ≥ 1. Multiplying both sides

by P j for j ≥ 0,

P rk+j = (1 − θk )ΠP j + θk Qk P j = (1 − θk )Π + θk Qk P j = Π + θk (Qk P j − Π),

or P rk+j − Π = θk (Qk P j − Π). Looking at any row x of both sides of this equation,
we have (P rk+j − Π)(x, ·) = θk (Qk P j − Π)(x, ·). Thus,

||P rk+j (x, ·) − π||T V = ||P rk+j (x, ·) − Π(x, ·)||T V

1X
= |(P rk+j − Π)(x, y)|
2
y∈Ω
1X
= θk |(Qk P j − Π)(x, y)|
2
y∈Ω

= θ ||Qk P j (x, ·) − Π(x, ·)||T V

≤ θk ,

since Q, P, and Π being stochastic implies Qk P j (x, ·) and Π(x, ·) are distributions,
thus their total variation is bounded by 1√from Remark 4.8.
Now we can define the constants α = r θ and C = α−r . Let t ≥ 0, and define k
and j from the division theorem by r, so t = rk + j with 0 ≤ j < r. Finally, since
0 < θ < 1 and 0 < α < 1,

||P t (x, ·) − π||T V = ||P rk+j (x, ·) − π||T V

≤ θk
= αrk
≤ αrk αj−r
= α−r αrk+j
= Cαt .

This is all true for any choice of x ∈ Ω; therefore, for any t ≥ 0,

max ||P t (x, ·) − π||T V ≤ Cαt .

x∈Ω

CONVERGENCE THEOREM FOR FINITE MARKOV CHAINS 13

Acknowledgments. I am deeply grateful to my mentor, Nat Mayer, for devoting

his time to learning this subject with me and helping me through the understanding
and writing of this paper. I must also thank László Babai for teaching a wonderful
and instructive course which gave me much insight into the tools needed for this
paper, as well as Peter May for reviewing it and organizing the University of Chicago
REU.

References
[1] David A. Levin, Yuval Peres and Elizabeth L. Wilmer. Markov Chains and Mixing Times.
http://pages.uoregon.edu/dlevin/MARKOV/markovmixing.pdf.

Markovchain Proofs
No ratings yet
Markovchain Proofs
18 pages
Markov Chains (4728)
No ratings yet
Markov Chains (4728)
14 pages
18-615-notes
No ratings yet
18-615-notes
33 pages
Stationary Distributions of Markov Chains: Will Perkins
No ratings yet
Stationary Distributions of Markov Chains: Will Perkins
20 pages
Lecture 5
No ratings yet
Lecture 5
3 pages
1 Discrete-Time Markov Chains
No ratings yet
1 Discrete-Time Markov Chains
8 pages
Markov chains2
No ratings yet
Markov chains2
75 pages
6.896: Probability and Computation: Spring 2011
No ratings yet
6.896: Probability and Computation: Spring 2011
19 pages
Markov Chain
No ratings yet
Markov Chain
37 pages
Markov Chains
No ratings yet
Markov Chains
21 pages
Markov Chains: Lecture 2
No ratings yet
Markov Chains: Lecture 2
4 pages
MC Notes
No ratings yet
MC Notes
42 pages
2 CTMCs
No ratings yet
2 CTMCs
3 pages
Chapter 9 Stationary Distribution of Markov Chain (Lecture On 02-02-2021) - STAT 243 - Stochastic Process
No ratings yet
Chapter 9 Stationary Distribution of Markov Chain (Lecture On 02-02-2021) - STAT 243 - Stochastic Process
5 pages
Notes On Markov Chain
No ratings yet
Notes On Markov Chain
34 pages
Probability & Statistics 2: Robert Šámal January 29, 2024
No ratings yet
Probability & Statistics 2: Robert Šámal January 29, 2024
29 pages
Markov Chains - Processos Estocásticos
No ratings yet
Markov Chains - Processos Estocásticos
24 pages
Markov Chains: J. M. Akinpelu
No ratings yet
Markov Chains: J. M. Akinpelu
56 pages
Markov Chains
No ratings yet
Markov Chains
6 pages
Markov Hand Out
No ratings yet
Markov Hand Out
14 pages
Discrete Time Markov Chains
No ratings yet
Discrete Time Markov Chains
59 pages
Markov Chains
No ratings yet
Markov Chains
17 pages
B 2 Stochasticprocesses 2020
No ratings yet
B 2 Stochasticprocesses 2020
5 pages
Markov Chains
No ratings yet
Markov Chains
55 pages
MCMC Notes by Mark Holder
No ratings yet
MCMC Notes by Mark Holder
16 pages
lecture19Compressed
No ratings yet
lecture19Compressed
19 pages
Diffusion Processes 34 for Nondiagonalizable Picked Subjects94412
No ratings yet
Diffusion Processes 34 for Nondiagonalizable Picked Subjects94412
9 pages
Markov Chain by Prof. Bertsekas
No ratings yet
Markov Chain by Prof. Bertsekas
7 pages
Markov Chains Ergodicity
No ratings yet
Markov Chains Ergodicity
8 pages
Notes Co905
No ratings yet
Notes Co905
51 pages
MSCFE 620 Group Submission
No ratings yet
MSCFE 620 Group Submission
9 pages
Lec7_MarkovChains
No ratings yet
Lec7_MarkovChains
14 pages
Markov Decission Process. Unit 3
No ratings yet
Markov Decission Process. Unit 3
37 pages
1 Pratiksha Presentation1
No ratings yet
1 Pratiksha Presentation1
4 pages
Mixing Times
No ratings yet
Mixing Times
70 pages
ESI 4313 Operations Research 2: Markov Chains Basics
No ratings yet
ESI 4313 Operations Research 2: Markov Chains Basics
45 pages
Stochastic Processes Beamer
No ratings yet
Stochastic Processes Beamer
43 pages
Chapter 4 Markov Chain
No ratings yet
Chapter 4 Markov Chain
39 pages
Volfovsky PDF
No ratings yet
Volfovsky PDF
11 pages
Markov Chain
No ratings yet
Markov Chain
3 pages
Markov Chain_Part 2
No ratings yet
Markov Chain_Part 2
14 pages
Chap1-2 Markov Chain
No ratings yet
Chap1-2 Markov Chain
82 pages
Markov Chain (Part 1)
No ratings yet
Markov Chain (Part 1)
31 pages
Markov Chains
No ratings yet
Markov Chains
42 pages
Discrete Time Markov
No ratings yet
Discrete Time Markov
71 pages
Continuous-Time_Markov_Chain_50050_50__6_
No ratings yet
Continuous-Time_Markov_Chain_50050_50__6_
9 pages
Markov Chains - Lectures - CMC - 2024
No ratings yet
Markov Chains - Lectures - CMC - 2024
168 pages
Markov Process: Properties, Analysis and Applications: Ajay Kumar
No ratings yet
Markov Process: Properties, Analysis and Applications: Ajay Kumar
113 pages
Markov Process: Properties, Analysis and Applications: Ajay Kumar
No ratings yet
Markov Process: Properties, Analysis and Applications: Ajay Kumar
113 pages
cd −θz − (y−z)
No ratings yet
cd −θz − (y−z)
3 pages
Stochastic Processes (2)
No ratings yet
Stochastic Processes (2)
8 pages
Markov Chains
No ratings yet
Markov Chains
86 pages
All Chapters
No ratings yet
All Chapters
180 pages
Math 312 Lecture Notes Markov Chains
No ratings yet
Math 312 Lecture Notes Markov Chains
9 pages
MATH37012 Course Notes: Discrete Time: DR Jonathan Bagley
No ratings yet
MATH37012 Course Notes: Discrete Time: DR Jonathan Bagley
29 pages
Convergence of Markov Processes: Martin Hairer
No ratings yet
Convergence of Markov Processes: Martin Hairer
39 pages
PQT - Module - 5 Lecture Notes PDF
No ratings yet
PQT - Module - 5 Lecture Notes PDF
7 pages
Markov Chain
No ratings yet
Markov Chain
39 pages
STAT 433 Course Note
No ratings yet
STAT 433 Course Note
11 pages
Elgenfunction Expansions Associated with Second Order Differential Equations
From Everand
Elgenfunction Expansions Associated with Second Order Differential Equations
E. C. Titchmarsh
No ratings yet
HW 3 Solution
100% (1)
HW 3 Solution
10 pages
Ila 0601
No ratings yet
Ila 0601
15 pages
Chapter 8 Markov Chain Model
No ratings yet
Chapter 8 Markov Chain Model
3 pages
Markov Chain Epidemic Models and Parameter Estimation
No ratings yet
Markov Chain Epidemic Models and Parameter Estimation
82 pages
European Journal of Operational Research: Andreas C. Georgiou, Emmanuel Thanassoulis, Alexandra Papadopoulou
No ratings yet
European Journal of Operational Research: Andreas C. Georgiou, Emmanuel Thanassoulis, Alexandra Papadopoulou
17 pages
Time Extra
No ratings yet
Time Extra
35 pages
Tutorial Problems
No ratings yet
Tutorial Problems
74 pages
Markov Chains
No ratings yet
Markov Chains
11 pages
R LAB Manual
No ratings yet
R LAB Manual
38 pages
MAT 3103: Computational Statistics and Probability Chapter 7: Stochastic Process
No ratings yet
MAT 3103: Computational Statistics and Probability Chapter 7: Stochastic Process
20 pages
CS2 B Chapter 2 - Markov Chains - Solutions
No ratings yet
CS2 B Chapter 2 - Markov Chains - Solutions
15 pages
Applications of Matrix Mathematics
No ratings yet
Applications of Matrix Mathematics
3 pages
Studio Worksheets For Math 1554: School of Mathematics, Georgia Institute of Technology
No ratings yet
Studio Worksheets For Math 1554: School of Mathematics, Georgia Institute of Technology
28 pages
Time-Dependent Event-Tree Method For Fire Analysis: Tentative Results
No ratings yet
Time-Dependent Event-Tree Method For Fire Analysis: Tentative Results
12 pages
Maths For Management Material
No ratings yet
Maths For Management Material
219 pages
Markov Chain
No ratings yet
Markov Chain
21 pages
MN3032 ZA d1
No ratings yet
MN3032 ZA d1
6 pages
Markov Chains To Model Genetic Algorithms
100% (2)
Markov Chains To Model Genetic Algorithms
1 page
Maths PPT (Higher Transition Matrix)
No ratings yet
Maths PPT (Higher Transition Matrix)
13 pages
Markov Chain Analysis
No ratings yet
Markov Chain Analysis
19 pages
The Use of The Linear Algebra by Web Search Engines
No ratings yet
The Use of The Linear Algebra by Web Search Engines
5 pages
Chapter 3 - Human Resources Planning
No ratings yet
Chapter 3 - Human Resources Planning
34 pages
15MAT41
No ratings yet
15MAT41
2 pages
An Inventory Control Model Withmm1queueing System and Lost Sales
No ratings yet
An Inventory Control Model Withmm1queueing System and Lost Sales
105 pages
Markov Processes
No ratings yet
Markov Processes
60 pages
Whats in A Name - The Matrix As An Introduction To Mathematics
No ratings yet
Whats in A Name - The Matrix As An Introduction To Mathematics
13 pages
CH 9
No ratings yet
CH 9
21 pages
Hidden Markov Model - Implemented From Scratch - by Oleg Żero - Towards Data Science
No ratings yet
Hidden Markov Model - Implemented From Scratch - by Oleg Żero - Towards Data Science
31 pages
Assessing The Skill of Football Players Using Statistical Methods
No ratings yet
Assessing The Skill of Football Players Using Statistical Methods
172 pages

Freedman

Uploaded by

Freedman

Uploaded by

CONVERGENCE THEOREM FOR FINITE MARKOV CHAINS

Abstract. In this expository paper, I will give an overview of the necessary

1. Introduction and Basic Definitions

Date: August 28, 2017.

Figure 1. An example Markov chain with three states x, y, and z.

This definition mentions distributions, so it may help to formalize what these

Definition 1.3. A stochastic matrix is an n × n matrix with all non-negative

Definition 1.4. A distribution π is called a stationary distribution of a Markov

Definition 1.5. A Markov chain is irreducible if for all states x, y ∈ Ω, there

Intuitively, this means that it is possible to get from x to y for any x, y ∈ Ω in

2. Uniqueness of Stationary Distributions

Definition 2.1. A function h : Ω → R is a harmonic at x ∈ Ω if

If h is harmonic at all states in Ω = {x1 , x2 , . . . , xn },we sayh is harmonic on Ω,

Lemma 2.2. If P is irreducible and h is harmonic on Ω, then h is a constant

Proof. Since Ω is finite, h attains a maximum at some state x0 ∈ Ω, such that

that h(z) < h(x0 ). Since h is harmonic at x0 ,

h(x0 ) = h(x1 ) = · · · = h(xn ) = h(y).

So h(y) = h(x0 ) ∀y ∈ Ω, and thus h is a constant function.

Corollary 2.3. If P is irreducible and has a stationary distribution π, then π is

Thus, dim(ker(P − I)) = 1, so by the rank-nullity theorem, rank(P − I) = |Ω| − 1.

3. Existence of a Stationary Distributions

And P{τy+ > t} is a decreasing function with respect to t, since

P{τy+ > t + 1} ≤ P{τy+ > t + 1} + P{τy+ = t + 1} = P{τy+ > t},

which we use to get

By definition, 0 <  ≤ 1, so 0 ≤ 1 −  < 1, which means this sum converges, and

Theorem 3.3. If P is irreducible, then it has a unique stationary distribution π

e(x) and replaced τz+ > t with

Pz {Xt = x, Xt+1 = y, τz+ ≥ t + 1} = Pz {Xt = x, τz+ ≥ t + 1}Pz {Xt+1 = y | Xt = x}

We can then plug this in to our earlier expression for (e

Figure 2. Since P is irreducible, we can get from x to y in r steps,

Lemma 4.4. If S ⊂ Z+ ∪ {0} is closed under addition (a + b ∈ S ∀a, b ∈ S) and

To prove anything about the convergence of a distribution, we need to first define

Theorem 4.9 (Convergence Theorem). If P is irreducible and aperiodic, with

max ||P t (x, ·) − π||T V ≤ Cαt .

satisfies P r (x, y) ≥ δ 0 π(y) ∀x, y ∈ Ω. So δ = min{δ 0 , 21 } also satisfies this property,

Every element of Q is non-negative, since P r (x, y) − (1 − θ)Π(x, y) = P r (x, y) −

so ΠP n = (ΠP )P n−1 = ΠP n−1 = · · · = ΠP = Π.

The base case k = 1 is true, since P r = (1 − θ)Π + θQ from the definition of Q.

proving inductively that P rk = (1 − θk )Π + θk Qk ∀k ≥ 1. Multiplying both sides

P rk+j = (1 − θk )ΠP j + θk Qk P j = (1 − θk )Π + θk Qk P j = Π + θk (Qk P j − Π),

||P rk+j (x, ·) − π||T V = ||P rk+j (x, ·) − Π(x, ·)||T V

= θ ||Qk P j (x, ·) − Π(x, ·)||T V

||P t (x, ·) − π||T V = ||P rk+j (x, ·) − π||T V

This is all true for any choice of x ∈ Ω; therefore, for any t ≥ 0,

max ||P t (x, ·) − π||T V ≤ Cαt .

Acknowledgments. I am deeply grateful to my mentor, Nat Mayer, for devoting

You might also like

By definition, 0 < ≤ 1, so 0 ≤ 1 − < 1, which means this sum converges, and