0% found this document useful (0 votes)
25 views91 pages

Vigoda - I NTRODUCTION TO MCMC AND PAGERANK

The document provides an introduction to Markov Chains and their properties, including ergodicity, stationary distributions, and mixing times. It also discusses the transition matrix and probabilities associated with one-step and multi-step transitions, along with examples related to a CS 6210 scenario. The lecture concludes with a discussion on limiting distributions as the number of steps increases.

Uploaded by

saeb2saeb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views91 pages

Vigoda - I NTRODUCTION TO MCMC AND PAGERANK

The document provides an introduction to Markov Chains and their properties, including ergodicity, stationary distributions, and mixing times. It also discusses the transition matrix and probabilities associated with one-step and multi-step transitions, along with examples related to a CS 6210 scenario. The lecture concludes with a discussion on limiting distributions as the number of steps increases.

Uploaded by

saeb2saeb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

I NTRODUCTION TO MCMC AND

PAGE R ANK

Eric Vigoda
Georgia Tech

Lecture for CS 6505


1 M ARKOV C HAIN BASICS

2 E RGODICITY

3 W HAT IS THE S TATIONARY D ISTRIBUTION ?

4 PAGE R ANK

5 M IXING T IME

6 P REVIEW OF F URTHER T OPICS


What is a Markov chain?

Example: Life in CS 6210, discrete time t = 0, 1, 2, . . . :


.5

.5
Listen
to Check
Kishore .2 Email

.3 .5
.7 .3

Sleep StarCraft

.7
.3
What is a Markov chain?

Example: Life in CS 6210, discrete time t = 0, 1, 2, . . . :


.5

.5
Listen
to Check
Kishore .2 Email

.3 .5
.7 .3

Sleep StarCraft

.7
.3

Each vertex is a state of the Markov chain.


Directed graph, possibly with self-loops.
Edge weights represent probability of a transition, so:
non-negative and sum of weights of outgoing edges = 1.
Transition matrix

In general: N states Ω = {1, 2, . . . , N}.


N × N transition matrix P where:
P(i, j) = weight of edge i → j = Pr (going from i to j)
For earlier example:
.5

.5
Listen  
to
Kishore .2
Check
Email
.5 .5 0 0
 .2 0 .5 .3 
.7
.3 .5
.3 P=
 0 .3 .7 0 

Sleep StarCraft .7 0 0 .3

.7
.3

P is a stochastic matrix = rows sum to 1.


One-step transitions

Time: t = 0, 1, 2, . . . .
Let Xt denote the state at time t.
Xt is a random variable.
One-step transitions

Time: t = 0, 1, 2, . . . .
Let Xt denote the state at time t.
Xt is a random variable.
For states k and j, Pr (X1 = j | X0 = k) = P(k, j).
One-step transitions

Time: t = 0, 1, 2, . . . .
Let Xt denote the state at time t.
Xt is a random variable.
For states k and j, Pr (X1 = j | X0 = k) = P(k, j).
In general, for t ≥ 1, given:
in state k0 at time 0, in k1 at time 1, . . . , in kt−1 at time t − 1,
what’s the probability of being in state j at time t?
One-step transitions

Time: t = 0, 1, 2, . . . .
Let Xt denote the state at time t.
Xt is a random variable.
For states k and j, Pr (X1 = j | X0 = k) = P(k, j).
In general, for t ≥ 1, given:
in state k0 at time 0, in k1 at time 1, . . . , in kt−1 at time t − 1,
what’s the probability of being in state j at time t?

Pr (Xt = j | X0 = k0 , X1 = k1 , . . . , Xt−1 = kt−1 )


= Pr (Xt = j | Xt−1 = kt−1 )
= P(kt−1 , j).
One-step transitions

Time: t = 0, 1, 2, . . . .
Let Xt denote the state at time t.
Xt is a random variable.
For states k and j, Pr (X1 = j | X0 = k) = P(k, j).
In general, for t ≥ 1, given:
in state k0 at time 0, in k1 at time 1, . . . , in kt−1 at time t − 1,
what’s the probability of being in state j at time t?

Pr (Xt = j | X0 = k0 , X1 = k1 , . . . , Xt−1 = kt−1 )


= Pr (Xt = j | Xt−1 = kt−1 )
= P(kt−1 , j).

Process is memoryless –
only current state matters, previous states do not matter.
Known as Markov property, hence the term Markov chain.
2-step transitions

What’s probability Listen at time 2 given Email at time 0?


Try all possibilities for state at time 1.
2-step transitions

What’s probability Listen at time 2 given Email at time 0?


Try all possibilities for state at time 1.

Pr (X2 = Listen | X0 = Email)


= Pr (X2 = Listen | X1 = Listen) × Pr (X1 = Listen | X0 = Email)
+Pr (X2 = Listen | X1 = Email) × Pr (X1 = Email | X0 = Email)
+Pr (X2 = Listen | X1 = StarCraft) × Pr (X1 = StarCraft | X0 = Email)
+Pr (X2 = Listen | X1 = Sleep) × Pr (X1 = Sleep | X0 = Email)
2-step transitions

What’s probability Listen at time 2 given Email at time 0?


Try all possibilities for state at time 1.

Pr (X2 = Listen | X0 = Email)


= Pr (X2 = Listen | X1 = Listen) × Pr (X1 = Listen | X0 = Email)
+Pr (X2 = Listen | X1 = Email) × Pr (X1 = Email | X0 = Email)
+Pr (X2 = Listen | X1 = StarCraft) × Pr (X1 = StarCraft | X0 = Email)
+Pr (X2 = Listen | X1 = Sleep) × Pr (X1 = Sleep | X0 = Email)
= (.5)(.2) + 0 + 0 + (.7)(.3) = .31

   
.5 .5 0 0 .35 .25 .25 .15
 .2 0 .5 .3   .31 .25 .35 .09 
P=
 0 .3 .7 0 
 P2 = 
 .06

.21 .64 .09 
.7 0 0 .3 .56 .35 0 .09
States: 1=Listen, 2=Email, 3=StarCraft, 4=Sleep.
k-step transitions

2-step transition probabilities: use P2 .


In general, for states i and j:

Pr (Xt+2 = j | Xt = i)
XN
= Pr (Xt+2 = j | Xt+1 = k) × Pr (Xt+1 = k | Xt = i)
k=1
X X
= P(k, j)P(i, k) = P(i, k)P(k, j) = P2 (i, j)
k k
k-step transitions

2-step transition probabilities: use P2 .


In general, for states i and j:

Pr (Xt+2 = j | Xt = i)
XN
= Pr (Xt+2 = j | Xt+1 = k) × Pr (Xt+1 = k | Xt = i)
k=1
X X
= P(k, j)P(i, k) = P(i, k)P(k, j) = P2 (i, j)
k k

`-step transition probabilities: use P` .


For states i and j and integer ` ≥ 1,

Pr (Xt+` = j | Xt = i) = P` (i, j),


Random Initial State

Suppose the state at time 0 is not fixed


but is chosen from a probability distribution µ0 .
Notation: X0 ∼ µ0 .
What is the distribution for X1 ?
Random Initial State

Suppose the state at time 0 is not fixed


but is chosen from a probability distribution µ0 .
Notation: X0 ∼ µ0 .
What is the distribution for X1 ?
For state j,
N
X
Pr (X1 = j) = Pr (X0 = i) × Pr (X1 = j | X0 = i)
i=1
X
= µ0 (i)P(i, j) = (µ0 P)(j)
i

So X1 ∼ µ1 where µ1 = µ0 P.
Random Initial State

Suppose the state at time 0 is not fixed


but is chosen from a probability distribution µ0 .
Notation: X0 ∼ µ0 .
What is the distribution for X1 ?
For state j,
N
X
Pr (X1 = j) = Pr (X0 = i) × Pr (X1 = j | X0 = i)
i=1
X
= µ0 (i)P(i, j) = (µ0 P)(j)
i

So X1 ∼ µ1 where µ1 = µ0 P.
And Xt ∼ µt where µt = µ0 Pt .
Back to CS 6210 example: big t?

Let’slook again at ourCS 6210 example:


.5 .5 0 0
 .2 0 .5 .3 
P=  0 .3 .7 0 

.7 0 0 .3
Back to CS 6210 example: big t?

Let’slook again at ourCS 6210 example:


 
.5 .5 0 0 .35 .25 .25 .15
 .2 0 .5 .3   .31 .25 .35 .09 
P=  0 .3 .7 0 
 P2 = 
 .06

.21 .64 .09 
.7 0 0 .3 .56 .35 0 .09
Back to CS 6210 example: big t?

Let’slook again at ourCS 6210 example:


 
.5 .5 0 0 .35 .25 .25 .15
 .2 0 .5 .3  2 =  .31 .25 .35 .09 
 
P=  0 .3 .7 0 
 P  .06 .21 .64 .09 
.7 0 0 .3 .56 .35 0 .09
 
.247770 .244781 .402267 .105181
 .245167 .244349 .405688 .104796 
P10 =  
 .239532 .243413 .413093 .103963 
.251635 .245423 .397189 .105754
Back to CS 6210 example: big t?

Let’slook again at ourCS 6210 example:


 
.5 .5 0 0 .35 .25 .25 .15
 .2 0 .5 .3  2 =  .31 .25 .35 .09 
 
P=  0 .3 .7 0 
 P  .06 .21 .64 .09 
.7 0 0 .3 .56 .35 0 .09
 
.247770 .244781 .402267 .105181
 .245167 .244349 .405688 .104796 
P10 =  
 .239532 .243413 .413093 .103963 
.251635 .245423 .397189 .105754

 
.244190 .244187 .406971 .104652
 .244187 .244186 .406975 .104651 
P20 =
 .244181

.244185 .406984 .104650 
.244195 .244188 .406966 .104652
Back to CS 6210 example: big t?

Let’slook again at ourCS 6210 example:


 
.5 .5 0 0 .35 .25 .25 .15
 .2 0 .5 .3  2 =  .31 .25 .35 .09 
 
P=  0 .3 .7 0 
 P  .06 .21 .64 .09 
.7 0 0 .3 .56 .35 0 .09
 
.247770 .244781 .402267 .105181
 .245167 .244349 .405688 .104796 
P10 =  
 .239532 .243413 .413093 .103963 
.251635 .245423 .397189 .105754

 
.244190 .244187 .406971 .104652
 .244187 .244186 .406975 .104651 
P20 =
 .244181

.244185 .406984 .104650 
.244195 .244188 .406966 .104652

Columns are converging to


π = [ .244186, .244186, .406977, .104651].
Limiting Distribution

For big t,
 
.244186 .244186 .406977 .104651
.244186 .244186 .406977 .104651 
Pt ≈ 


 .244186 .244186 .406977 .104651 
.244186 .244186 .406977 .104651
Limiting Distribution

For big t,
 
.244186 .244186 .406977 .104651
.244186 .244186 .406977 .104651 
Pt ≈ 


 .244186 .244186 .406977 .104651 
.244186 .244186 .406977 .104651

Regardless of where it starts X0 , for big t:

Pr (Xt = 1) = .244186
Pr (Xt = 2) = .244186
Pr (Xt = 3) = .406977
Pr (Xt = 4) = .104651
Limiting Distribution

For big t,
 
.244186 .244186 .406977 .104651
.244186 .244186 .406977 .104651 
Pt ≈ 


 .244186 .244186 .406977 .104651 
.244186 .244186 .406977 .104651

Regardless of where it starts X0 , for big t:

Pr (Xt = 1) = .244186
Pr (Xt = 2) = .244186
Pr (Xt = 3) = .406977
Pr (Xt = 4) = .104651
Let π = [ .244186, .244186, .406977, .104651].
In other words, for big t, Xt ∼ π.
π is called a stationary distribution.
Limiting Distribution

Let π = [ .244186, .244186, .406977, .104651].


π is called a stationary distribution.
Limiting Distribution

Let π = [ .244186, .244186, .406977, .104651].


π is called a stationary distribution.
Once we reach π we stay in π: if Xt ∼ π then Xt+1 ∼ π,
in other words, πP = π.
Limiting Distribution

Let π = [ .244186, .244186, .406977, .104651].


π is called a stationary distribution.
Once we reach π we stay in π: if Xt ∼ π then Xt+1 ∼ π,
in other words, πP = π.
Any distribution π where πP = π is called a stationary distribution
of the Markov chain.
Stationary Distributions

Key questions:
When is there a stationary distribution?
If there is at least one, is it unique or more than one?
Assuming there’s a unique stationary distribution:
Do we always reach it?
What is it?
Mixing time = Time to reach unique stationary distribution
Algorithmic Goal:
If we have a distribution π that we want to sample from, can
we design a Markov chain that has:
Unique stationary distribution π,
From every X0 we always reach π,
Fast mixing time.
1 M ARKOV C HAIN BASICS

2 E RGODICITY

3 W HAT IS THE S TATIONARY D ISTRIBUTION ?

4 PAGE R ANK

5 M IXING T IME

6 P REVIEW OF F URTHER T OPICS


Irreducibility

Want a unique stationary distribution π and that


get to it from every starting state X0 .
Irreducibility

Want a unique stationary distribution π and that


get to it from every starting state X0 .
But if multiple strongly connected components (SCCs) then can’t
go from one to the other:
6

1 5
1
2

3 4

Starting at 1 gets to different distribution than starting at 5.


Irreducibility

Want a unique stationary distribution π and that


get to it from every starting state X0 .
But if multiple strongly connected components (SCCs) then can’t
go from one to the other:
6

1 5
1
2

3 4

Starting at 1 gets to different distribution than starting at 5.


State i communicates with state j if starting at i can reach j:

there exists t, Pt (i, j) > 0.

Markov chain is irreducible if all pairs of states communicate..


Periodicity

Example of bipartite Markov chain:


1

.4
1 3
.3
.3
.4

2
.6

4
.7
.5
4 .3
.5

Starting at 1 gets to different distribution than starting at 3.


Periodicity

Example of bipartite Markov chain:


1

.4
1 3
.3
.3
.4

2
.6

4
.7
.5
4 .3
.5

Starting at 1 gets to different distribution than starting at 3.


Need that no periodicity.
Aperiodic

2
1

1
1 .7

.3
.7
4
3
.3

Return times for state i are times Ri = {t : Pt (i, i) > 0}.


Above example: R1 = {3, 5, 6, 8, 9, . . . }.
Let r = gcd(Ri ) be the period for state i.
Aperiodic

2
1

1
1 .7

.3
.7
4
3
.3

Return times for state i are times Ri = {t : Pt (i, i) > 0}.


Above example: R1 = {3, 5, 6, 8, 9, . . . }.
Let r = gcd(Ri ) be the period for state i.
If P is irreducible then all states have the same period.
If r = 2 then the Markov chain is bipartite.
A Markov chain is aperiodic if r = 1.
Ergodic: Unique Stationary Distribution

Ergodic = Irreducible and aperiodic.


Ergodic: Unique Stationary Distribution

Ergodic = Irreducible and aperiodic.


Fundamental Theorem for Markov Chains:
Ergodic Markov chain has a unique stationary distribution π.
And for all initial X0 ∼ µ0 then:

lim µt = π.
t→∞

In other words, for big enough t, all rows of Pt are π.


Ergodic: Unique Stationary Distribution

Ergodic = Irreducible and aperiodic.


Fundamental Theorem for Markov Chains:
Ergodic Markov chain has a unique stationary distribution π.
And for all initial X0 ∼ µ0 then:

lim µt = π.
t→∞

In other words, for big enough t, all rows of Pt are π.

How big does t need to be?

What is π?
1 M ARKOV C HAIN BASICS

2 E RGODICITY

3 W HAT IS THE S TATIONARY D ISTRIBUTION ?

4 PAGE R ANK

5 M IXING T IME

6 P REVIEW OF F URTHER T OPICS


Determining π: Symmetric Markov Chain

Symmetric if for all pairs i, j: P(i, j) = P(j, i).


Then π is uniformly distributed over all of the states {1, . . . , N}:
1
π(j) = for all states j.
N
Determining π: Symmetric Markov Chain

Symmetric if for all pairs i, j: P(i, j) = P(j, i).


Then π is uniformly distributed over all of the states {1, . . . , N}:
1
π(j) = for all states j.
N
Proof: We’ll verify that πP = π for this π.
Need to check that for all states j: (πP)(j) = π(j).
Determining π: Symmetric Markov Chain

Symmetric if for all pairs i, j: P(i, j) = P(j, i).


Then π is uniformly distributed over all of the states {1, . . . , N}:
1
π(j) = for all states j.
N
Proof: We’ll verify that πP = π for this π.
Need to check that for all states j: (πP)(j) = π(j).
N
X
(πP)(j) = π(i)P(i, j)
i=1
Determining π: Symmetric Markov Chain

Symmetric if for all pairs i, j: P(i, j) = P(j, i).


Then π is uniformly distributed over all of the states {1, . . . , N}:
1
π(j) = for all states j.
N
Proof: We’ll verify that πP = π for this π.
Need to check that for all states j: (πP)(j) = π(j).
N
X
(πP)(j) = π(i)P(i, j)
i=1
N
1 X
= P(i, j)
N
i=1
Determining π: Symmetric Markov Chain

Symmetric if for all pairs i, j: P(i, j) = P(j, i).


Then π is uniformly distributed over all of the states {1, . . . , N}:
1
π(j) = for all states j.
N
Proof: We’ll verify that πP = π for this π.
Need to check that for all states j: (πP)(j) = π(j).
N
X
(πP)(j) = π(i)P(i, j)
i=1
N
1 X
= P(i, j)
N
i=1
N
1 X
= P(j, i) since P is symmetric
N
i=1
Determining π: Symmetric Markov Chain

Symmetric if for all pairs i, j: P(i, j) = P(j, i).


Then π is uniformly distributed over all of the states {1, . . . , N}:
1
π(j) = for all states j.
N
Proof: We’ll verify that πP = π for this π.
Need to check that for all states j: (πP)(j) = π(j).
N
X
(πP)(j) = π(i)P(i, j)
i=1
N
1 X
= P(i, j)
N
i=1
N
1 X
= P(j, i) since P is symmetric
N
i=1
1
= since rows of P always sum to 1
N
Determining π: Symmetric Markov Chain

Symmetric if for all pairs i, j: P(i, j) = P(j, i).


Then π is uniformly distributed over all of the states {1, . . . , N}:
1
π(j) = for all states j.
N
Proof: We’ll verify that πP = π for this π.
Need to check that for all states j: (πP)(j) = π(j).
N
X
(πP)(j) = π(i)P(i, j)
i=1
N
1 X
= P(i, j)
N
i=1
N
1 X
= P(j, i) since P is symmetric
N
i=1
1
= since rows of P always sum to 1
N
= π(j)
Determining π: Reversible Markov Chain

Reversible with respect to π if for all pairs i, j:

π(i)P(i, j) = π(j)P(j, i).

If can find such a π then it is the stationary distribution.


Determining π: Reversible Markov Chain

Reversible with respect to π if for all pairs i, j:

π(i)P(i, j) = π(j)P(j, i).

If can find such a π then it is the stationary distribution.


Proof: Similar to the symmetric case.
Need to check that for all states j: (πP)(j) = π(j).
Determining π: Reversible Markov Chain

Reversible with respect to π if for all pairs i, j:

π(i)P(i, j) = π(j)P(j, i).

If can find such a π then it is the stationary distribution.


Proof: Similar to the symmetric case.
Need to check that for all states j: (πP)(j) = π(j).
N
X
(πP)(j) = π(i)P(i, j)
i=1
Determining π: Reversible Markov Chain

Reversible with respect to π if for all pairs i, j:

π(i)P(i, j) = π(j)P(j, i).

If can find such a π then it is the stationary distribution.


Proof: Similar to the symmetric case.
Need to check that for all states j: (πP)(j) = π(j).
N
X
(πP)(j) = π(i)P(i, j)
i=1
N
X
= π(j)P(j, i) since P is reversible
i=1
Determining π: Reversible Markov Chain

Reversible with respect to π if for all pairs i, j:

π(i)P(i, j) = π(j)P(j, i).

If can find such a π then it is the stationary distribution.


Proof: Similar to the symmetric case.
Need to check that for all states j: (πP)(j) = π(j).
N
X
(πP)(j) = π(i)P(i, j)
i=1
N
X
= π(j)P(j, i) since P is reversible
i=1
N
X
= π(j) P(j, i)
i=1
Determining π: Reversible Markov Chain

Reversible with respect to π if for all pairs i, j:

π(i)P(i, j) = π(j)P(j, i).

If can find such a π then it is the stationary distribution.


Proof: Similar to the symmetric case.
Need to check that for all states j: (πP)(j) = π(j).
N
X
(πP)(j) = π(i)P(i, j)
i=1
N
X
= π(j)P(j, i) since P is reversible
i=1
N
X
= π(j) P(j, i)
i=1
= π(j)
Some Examples

Random walk on a d-regular, connected undirected graph G:


What is π?
Some Examples

Random walk on a d-regular, connected undirected graph G:


What is π?
Symmetric: for edge (i, j), P(i, j) = P(j, i) = 1/d.
So π is uniform: π(i) = 1/n.
Some Examples

Random walk on a d-regular, connected undirected graph G:


What is π?
Symmetric: for edge (i, j), P(i, j) = P(j, i) = 1/d.
So π is uniform: π(i) = 1/n.
Random walk on a general connected undirected graph G:
What is π?
Some Examples

Random walk on a d-regular, connected undirected graph G:


What is π?
Symmetric: for edge (i, j), P(i, j) = P(j, i) = 1/d.
So π is uniform: π(i) = 1/n.
Random walk on a general connected undirected graph G:
What is π?
Consider π(i) = d(i)/Z where
Pdegree of vertex i and
d(i) =
Z = j∈V d(j). (Note, Z = 2m = 2|E|.)
d(i) 1 1
Check it’s reversible: π(i)P(i, j) = Z d(i) = Z = π(j)P(j, i).
Some Examples

Random walk on a d-regular, connected undirected graph G:


What is π?
Symmetric: for edge (i, j), P(i, j) = P(j, i) = 1/d.
So π is uniform: π(i) = 1/n.
Random walk on a general connected undirected graph G:
What is π?
Consider π(i) = d(i)/Z where
Pdegree of vertex i and
d(i) =
Z = j∈V d(j). (Note, Z = 2m = 2|E|.)
d(i) 1 1
Check it’s reversible: π(i)P(i, j) = Z d(i) = Z = π(j)P(j, i).

What if G is a directed graph?


Some Examples

Random walk on a d-regular, connected undirected graph G:


What is π?
Symmetric: for edge (i, j), P(i, j) = P(j, i) = 1/d.
So π is uniform: π(i) = 1/n.
Random walk on a general connected undirected graph G:
What is π?
Consider π(i) = d(i)/Z where
Pdegree of vertex i and
d(i) =
Z = j∈V d(j). (Note, Z = 2m = 2|E|.)
d(i) 1 1
Check it’s reversible: π(i)P(i, j) = Z d(i) = Z = π(j)P(j, i).

What if G is a directed graph?


Then it may not be reversible, and if it’s not reversible:
then usually we can’t figure out the stationary distribution
since typically N is HUGE.
1 M ARKOV C HAIN BASICS

2 E RGODICITY

3 W HAT IS THE S TATIONARY D ISTRIBUTION ?

4 PAGE R ANK

5 M IXING T IME

6 P REVIEW OF F URTHER T OPICS


PageRank

PageRank is an algorithm devised by Brin and Page 1998:


determine the “importance” of webpages.
PageRank

PageRank is an algorithm devised by Brin and Page 1998:


determine the “importance” of webpages.
Webgraph:
V = webpages
E = directed edges for hyperlinks

Let π(x) = “rank” of page x.


We are trying to define π(x) in a sensible way.
PageRank

PageRank is an algorithm devised by Brin and Page 1998:


determine the “importance” of webpages.
Webgraph:
V = webpages
E = directed edges for hyperlinks
Notation:
For page x ∈ V, let:

Out(x) = {y : x → y ∈ E} = outgoing edges from x


In(x) = {w : w → x ∈ E} = incoming edges to x

Let π(x) = “rank” of page x.


We are trying to define π(x) in a sensible way.
First Ranking Idea

First idea for ranking pages: like academic papers


use citation counts
Here, citation = link to a page.
So set π(x) = |In(x)| = number of links to x.
Refining the Ranking Idea

What if:
a webpage has 500 links and one is to Eric’s page.
another webpage has only 5 links and one is to Santosh’s
page.
Which link is more valuable?
Refining the Ranking Idea

What if:
a webpage has 500 links and one is to Eric’s page.
another webpage has only 5 links and one is to Santosh’s
page.
Which link is more valuable?
Academic papers: If a paper cites 50 other papers, then each
reference gets 1/50 of a citation.
Refining the Ranking Idea

What if:
a webpage has 500 links and one is to Eric’s page.
another webpage has only 5 links and one is to Santosh’s
page.
Which link is more valuable?
Academic papers: If a paper cites 50 other papers, then each
reference gets 1/50 of a citation.
Webpages: If a page y has |Out(y)| outgoing links, then:
each linked page gets 1/|Out(y)|.
New solution:
X 1
π(x) = .
|Out(y)|
y∈In(x)
Further Refining the Ranking Idea

Previous:
X 1
π(x) = .
|Out(y)|
y∈In(x)

But if Eric’s children’s webpage has a link to a Eric’s page and


CNN has a link to Santosh’s page, which is more important?
Further Refining the Ranking Idea

Previous:
X 1
π(x) = .
|Out(y)|
y∈In(x)

But if Eric’s children’s webpage has a link to a Eric’s page and


CNN has a link to Santosh’s page, which is more important?
Solution: define π(x) recursively.
Page y has importance π(y).
A link from y gets π(y)/|Out(y)| of a citation.
X π(y)
π(x) = .
|Out(y)|
y∈In(x)
Random Walk

Importance of page x:
X π(y)
π(x) = .
|Out(y)|
y∈In(x)
Random Walk

Importance of page x:
X π(y)
π(x) = .
|Out(y)|
y∈In(x)

Recursive definition of π, how do we find it?


Random Walk

Importance of page x:
X π(y)
π(x) = .
|Out(y)|
y∈In(x)

Recursive definition of π, how do we find it?


Look at the random walk on the webgraph G = (V, E).
From a page y ∈ V, choose a random link and follow it.
This is a Markov chain.
For y → x ∈ E then:
1
P(y, x) =
|Out(y)|

What is the stationary distribution of this Markov chain?


Random Walk

Random walk on the webgraph G = (V, E).


For y → x ∈ E then:
1
P(y, x) =
|Out(y)|

What is the stationary distribution of this Markov chain?


Random Walk

Random walk on the webgraph G = (V, E).


For y → x ∈ E then:
1
P(y, x) =
|Out(y)|

What is the stationary distribution of this Markov chain?


Need to find π where π = πP.
Thus,
X X π(y)
π(x) = π(y)P(y, x) = .
|Out(y)|
y∈V y∈In(x)

This is identical to the definition of the importance vector π.


Summary: the stationary distribution of the random walk on the
webgraph gives the importance π(x) of a page x.
Random Walk on the Webgraph

Random walk on the webgraph G = (V, E).


Is π the only stationary distribution?
In other words, is the Markov chain ergodic?
Random Walk on the Webgraph

Random walk on the webgraph G = (V, E).


Is π the only stationary distribution?
In other words, is the Markov chain ergodic?
Need that G is strongly connected – it probably is not.
And some pages have no outgoing links...
then hit the “random” button!
Random Walk on the Webgraph

Random walk on the webgraph G = (V, E).


Is π the only stationary distribution?
In other words, is the Markov chain ergodic?
Need that G is strongly connected – it probably is not.
And some pages have no outgoing links...
then hit the “random” button!
Solution to make it ergodic:
Introduce “damping factor” α where 0 < α ≤ 1.
(in practice apparently use α ≈ .85)
From page y,
with prob. α follow a random outgoing link from page y.
with prob. 1 − α go to a completely random page
(uniformly chosen from all pages V).
Random Surfer

Let N = |V| denote number of webpages.


Transition matrix of new Random Surfer chain:
(
1−α
N if y → x 6∈ E
P(y, x) = 1−α α
N + |Out(y)| if y → x ∈ E

This new Random Surfer Markov chain is ergodic.


Thus, unique stationary distribution is the desired π.
Random Surfer

Let N = |V| denote number of webpages.


Transition matrix of new Random Surfer chain:
(
1−α
N if y → x 6∈ E
P(y, x) = 1−α α
N + |Out(y)| if y → x ∈ E

This new Random Surfer Markov chain is ergodic.


Thus, unique stationary distribution is the desired π.
How to find π?
Take last week’s π, and compute πPt for big t.
What’s a big enough t?
1 M ARKOV C HAIN BASICS

2 E RGODICITY

3 W HAT IS THE S TATIONARY D ISTRIBUTION ?

4 PAGE R ANK

5 M IXING T IME

6 P REVIEW OF F URTHER T OPICS


Mixing Time

How fast does an ergodic MC reach its unique stationary π?


Mixing Time

How fast does an ergodic MC reach its unique stationary π?


Need to measure distance from π, use total variation distance.
For distributions µ and ν on set Ω:
1X
dTV (µ, ν) = |µ(x) − ν(x)|.
2
x∈Ω
Mixing Time

How fast does an ergodic MC reach its unique stationary π?


Need to measure distance from π, use total variation distance.
For distributions µ and ν on set Ω:
1X
dTV (µ, ν) = |µ(x) − ν(x)|.
2
x∈Ω

Example: Ω = {1, 2, 3, 4}.


µ is uniform: µ(1) = µ(2) = µ(3) = µ(4) = .25.
And ν has: ν(1) = .5, ν(2) = .1, ν(3) = .15, ν(4) = .25.

1
dTV (µ, ν) = (.25 + .15 + .1 + 0) = .25
2
Mixing Time

Consider ergodic MC with states Ω, transition matrix P, and


unique stationary distribution π.
For state x ∈ Ω, time to mix from x:

T(x) = min{t : dTV (Pt (x, ·), π) ≤ 1/4.


Mixing Time

Consider ergodic MC with states Ω, transition matrix P, and


unique stationary distribution π.
For state x ∈ Ω, time to mix from x:

T(x) = min{t : dTV (Pt (x, ·), π) ≤ 1/4.

Then, mixing time Tmix = maxx T(x).


Summarizing in words:
mixing time is time to get within distance ≤ 1/4 of π from
the worst initial state X0 .
Mixing Time

Consider ergodic MC with states Ω, transition matrix P, and


unique stationary distribution π.
For state x ∈ Ω, time to mix from x:

T(x) = min{t : dTV (Pt (x, ·), π) ≤ 1/4.

Then, mixing time Tmix = maxx T(x).


Summarizing in words:
mixing time is time to get within distance ≤ 1/4 of π from
the worst initial state X0 .
Choice of constant 1/4 is somewhat arbitrary.
Can get within distance ≤  in time O(Tmix log(1/)).
Mixing Time of Random Surfer

Coupling proof:
Consider 2 copies of the Random Surfer chain (Xt ) and (Yt ).
Choose Y0 from π. Thus, Yt ∼ π for all t.
And X0 is arbitrary.
Mixing Time of Random Surfer

Coupling proof:
Consider 2 copies of the Random Surfer chain (Xt ) and (Yt ).
Choose Y0 from π. Thus, Yt ∼ π for all t.
And X0 is arbitrary.
If Xt−1 = Yt−1 then they choose the same transition at time t.
If Xt−1 6= Yt−1 then with prob. 1 − α choose the same random
page z for both chains.
Therefore,
Pr (Xt 6= Yt ) ≤ αt .
Mixing Time of Random Surfer

Coupling proof:
Consider 2 copies of the Random Surfer chain (Xt ) and (Yt ).
Choose Y0 from π. Thus, Yt ∼ π for all t.
And X0 is arbitrary.
If Xt−1 = Yt−1 then they choose the same transition at time t.
If Xt−1 6= Yt−1 then with prob. 1 − α choose the same random
page z for both chains.
Therefore,
Pr (Xt 6= Yt ) ≤ αt .

Setting: t ≥ −2/ log(α) we have Pr (Xt 6= Yt ) ≤ 1/4.


Therefore, mixing time:
−2
Tmix ≤ ≈ 8.5 for α = .85.
log α

You might also like