Roch Mdp Full
Roch Mdp Full
An Essential Toolkit
Sébastien Roch
Preface v
Notation ix
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Review of graph theory . . . . . . . . . . . . . . . . . . . 2
1.1.2 Review of Markov chain theory . . . . . . . . . . . . . . 9
1.2 Some discrete probability models . . . . . . . . . . . . . . . . . . 18
i
2.4.5 . Data science: Johnson-Lindenstrauss lemma and appli-
cation to compressed sensing . . . . . . . . . . . . . . . . 93
2.4.6 . Data science: classification, empirical risk minimization
and VC dimension . . . . . . . . . . . . . . . . . . . . . 102
4 Coupling 234
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
4.1.1 Basic definitions . . . . . . . . . . . . . . . . . . . . . . 235
4.1.2 . Random walks: harmonic functions on lattices and infi-
nite d-regular trees . . . . . . . . . . . . . . . . . . . . . 237
4.1.3 Total variation distance and coupling inequality . . . . . . 240
4.1.4 . Random graphs: degree sequence in Erdős-Rényi model 246
4.2 Stochastic domination . . . . . . . . . . . . . . . . . . . . . . . . 249
4.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 250
4.2.2 . Ising model: boundary conditions . . . . . . . . . . . . 260
4.2.3 Correlation inequalities: FKG and Holley’s inequalities . . 263
ii
4.2.4 . Random graphs: Janson’s inequality and application to
the clique number in the Erdős-Rényi model . . . . . . . . 269
4.2.5 . Percolation: RSW theory and a proof of Harris’ theorem 272
4.3 Coupling of Markov chains and application to mixing . . . . . . . 281
4.3.1 Bounding the mixing time via coupling . . . . . . . . . . 281
4.3.2 . Random walks: mixing on cycles, hypercubes, and trees 284
4.3.3 Path coupling . . . . . . . . . . . . . . . . . . . . . . . . 293
4.3.4 . Ising model: Glauber dynamics at high temperature . . 297
4.4 Chen-Stein method . . . . . . . . . . . . . . . . . . . . . . . . . 300
4.4.1 Main bounds and examples . . . . . . . . . . . . . . . . . 301
4.4.2 Some motivation and proof . . . . . . . . . . . . . . . . . 308
4.4.3 . Random graphs: clique number at the threshold in the
Erdős-Rényi model . . . . . . . . . . . . . . . . . . . . . 318
iii
6.1.2 Extinction . . . . . . . . . . . . . . . . . . . . . . . . . . 417
6.1.3 . Percolation: Galton-Watson trees . . . . . . . . . . . . 422
6.1.4 Multitype branching processes . . . . . . . . . . . . . . . 423
6.2 Random-walk representation . . . . . . . . . . . . . . . . . . . . 429
6.2.1 Exploration process . . . . . . . . . . . . . . . . . . . . . 429
6.2.2 Duality principle . . . . . . . . . . . . . . . . . . . . . . 432
6.2.3 Hitting-time theorem . . . . . . . . . . . . . . . . . . . . 433
6.2.4 . Percolation: critical exponents on the infinite b-ary tree . 438
6.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
6.3.1 . Probabilistic analysis of algorithms: binary search tree . 442
6.3.2 . Data science: the reconstruction problem, the Kesten-
Stigum bound and a phase transition in phylogenetics . . . 451
6.4 . Finale: the phase transition of the Erdős-Rényi model . . . . . . 466
6.4.1 Statement and proof sketch . . . . . . . . . . . . . . . . . 466
6.4.2 Bounding cluster size: domination by branching processes 468
6.4.3 Concentration of cluster size: second moment bounds . . 480
6.4.4 Critical case via martingales . . . . . . . . . . . . . . . . 486
6.4.5 . Encore: random walk on the Erdős-Rényi graph . . . . 490
iv
Preface
This book arose from a set of lecture notes prepared for a one-semester topics
course I taught at the University of Wisconsin–Madison in 2014, 2017, 2020 and
2023 which attracted a wide spectrum of students in mathematics, computer sci-
ences, engineering, and statistics.
What is it about?
The purpose of the book is to provide a graduate-level introduction to discrete prob-
ability. Topics covered are drawn primarily from stochastic processes on graphs:
percolation, random graphs, Markov random fields, random walks on graphs, etc.
No attempt is made at covering these broad areas in depth. Rather, the emphasis
is on illustrating important techniques used to analyze such processes. Along the
way, many standard results regarding discrete probability models are worked out.
The “modern” in the title refers to the (non-exclusive) focus on nonasymptotic
methods and results, reflecting the impact of the theoretical computer science liter-
ature on the trajectory of this field. In particular several applications in randomized
algorithms, probabilistic analysis of algorithms and theoretical machine learning
are used throughout to motivate the techniques described (although, again, these
areas are not covered exhaustively).
Of course the selection of topics is somewhat arbitrary and driven in part by
personal interests. But the choice was guided by a desire to introduce techniques
that are widely used across discrete probability and its applications. The material
discussed here is developed in much greater depth in the following (incomplete
list of) excellent textbooks and expository monographs, many of which influenced
various sections of this book:
- Agarwal, Jiang, Kakade, Sun. Reinforcement learning: Theory and algorithms.
[AJKS22]
- Aldous, Fill. Reversible Markov chains and random walks on graphs. [AF]
- Alon, Spencer. The Probabilistic Method. [AS11]
v
- Béla Bollobás. Random graphs. [Bol01]
- Boucheron, Lugosi, Massart. Concentration Inequalities: A Nonasymptotic Theory
of Independence. [BLM13]
- Chung, Lu. Complex graphs and networks. [CL06]
- Durrett. Random Graph Dynamics. [Dur06]
- Frieze and Karoński. Introduction to random graphs. [FK16]
- Grimmett. Percolation. [Gri10b]
- Janson, Luczak, Rucinski. Random Graphs. [JLR11]
- Lattimore, Szepesvári. Bandit Algorithms. [LS20]
- Levin, Peres, Wilmer. Markov chains and mixing times. [LPW06]
- Lyons, Peres. Probability on trees and networks. [LP16]
- Mitzenmacher, Upfal. Probability and Computing: Randomized Algorithms and
Probabilistic Analysis. [MU05]
- Motwani, Raghavan. Randomized algorithms. [MR95]
- Rassoul-Agha, Seppäläinen. A course on large deviations with an introduction to
Gibbs measures. [RAS15]
- S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From
Theory to Algorithms. [SSBD14]
- van Handel. Probability in high dimension. [vH16]
- van der Hofstad. Random graphs and complex networks. Vol. 1. [vdH17]
- Vershynin. High-Dimensional Probability: An Introduction with Applications in
Data Science. [Ver18]
In fact the book is meant as a first foray into the basic results and/or toolkits detailed
in these more specialized references. My hope is that, by the end, the reader will
have picked up sufficient fundamental background to learn advanced material on
their own with some ease. I should add that I used many additional helpful sources;
they are acknowledged in the “Bibliographic remarks” at the end of each chapter.
It is impossible to cover everything. Some notable omissions include, e.g., graph
limits [Lov12], influence [KS05], and group-theoretic methods [Dia88], among
others.
Much of the material covered here (and more) can also be found in [HMRAR98],
[Gri10a], and [Bre17] with a different emphasis and scope.
Prerequisites
It is assumed throughout that the reader is fluent in undergraduate linear algebra,
for example, at the level of [Axl15], and basic real analysis, for example, at the
level of [Mor05].
vi
In addition, it is recommended that the reader has taken at least one semester of
graduate probability at the level of [Dur10]. I am also particularly fond of [Wil91],
which heavily influenced the appendix where measure-theoretic background is re-
viewed. Some familiarity with countable Markov chain theory is necessary, as
covered for instance in [Dur10, Chapter 6]. An advanced undergraduate or Mas-
ters’ level treatment such as [Dur12], [Nor98], [GS20], [Law06] or [Bre20] will
suffice however.
Organization
The book is organized around five major “tools.” The reader will have likely en-
countered those tools in prior probability courses. The goal here is to develop them
further, specifically with their application to discrete random structures in mind,
and to illustrate them in this setting on a variety of major, classical results and
applications.
In the interest of keeping the book relatively self-contained and serving the
widest spectrum of readers, each chapter begins with a “background” section which
reviews the basic material on which the rest of the chapter builds. The remaining
sections then proceed to expand on two or three important specializations of the
tools. While the chapters are meant to be somewhat modular, results from previous
chapters do occasionally make an appearance.
The techniques are illustrated throughout with simple examples first, and then
with more substantial ones in separate sections marked with the symbol . . I have
attempted to provide applications from many areas of discrete probability and the-
oretical computer science, although some techniques are better suited for certain
types of models or questions. The examples and applications are important: many
of the tools are quite straightforward (or even elementary), and it is only when seen
in action that their full power can be appreciated. Moreover, the . sections serve as
an excuse to introduce the reader to classical results and important applications—
beyond their reliance on specific tools.
Chapter 1 introduces some of the main models from probability on graphs that
we come back to repeatedly throughout the book. It begins with a brief review of
graph theory and Markov chain theory.
Chapter 2 starts out with the probabilistic method, including the first moment
principle and second moment method, and then it moves on to concentration in-
equalities for sums of independent random variables, mostly sub-Gaussian and
sub-exponential variables. It also discusses techniques to analyze the suprema of
random processes.
Chapter 3 turns to martingales. The first main topic there is the Azuma-
vii
Hoeffding inequality and the method of bounded differences with applications to
random graphs and stochastic bandit problems. The second main topic is electrical
network theory for random walks on graphs.
Chapter 4 introduces coupling. It covers stochastic domination and correlation
inequalities as well as couplings of Markov chains with applications to mixing. It
also discusses the Chen-Stein method for Poisson approximation.
Chapter 5 is concerned with spectral methods. A major topic there is the use
of the spectral theorem and geometric bounds on the spectral gap to control the
mixing time of a reversible Markov chain. The chapter also introduces spectral
methods for community recovery in network analysis.
Chapter 6 ends the book with applications of branching processes. Among
other applications, an introduction to the reconstruction problem on trees is pro-
vided. The final section gives a detailed analysis of the phase transition of the
Erdös-Rényi graph, where techniques from all chapters of the book are brought to
bear.
Acknowledgments
The lecture notes on which this book is based were influenced by graduate courses
of David Aldous, Steve Evans, Elchanan Mossel, Yuval Peres, Alistair Sinclair, and
Luca Trevisan at UC Berkeley, where I learned much of this material. In particular
scribe notes for some of these courses helped shape early iterations of this book.
I have also learned a lot over the years from my collaborators and mentors as
well as my former and current Ph.D. students and postdocs. I am particularly grate-
ful to Elchanan Mossel and Allan Sly for encouragements to finish this project and
to the UW-Madison students who have taken the various iterations of the course
that inspired the book for their invaluable feedback.
Warm thanks to everyone in the departments of mathematics at UCLA and UW-
Madison who have provided the stimulating environments that made this project
possible. Beyond my current department, I am particularly indebted to my col-
leagues in the NSF-funded Institute for Foundations of Data Science (IFDS) who
have significantly expanded my knowledge of applications of this material in ma-
chine learning and statistics.
Finally I thank my parents, my wife and my son for their love, patience and
support.
viii
Notation
and
a+ = 0 ∨ a, a− = 0 ∨ (−a).
• For a real a, bac is the largest integer that is smaller than or equal to a and
dae is the smallest integer that is larger than or equal to a.
• For x ∈ R, the natural (i.e., base e) logarithm of x is denoted by log x. We
natural
also let exp(x) = ex .
logarithm
• For a positive integer n ∈ N, we let
ix
• For a vector u = (u1 , . . . , un ) ∈ Rn and real p > 0, its p-norm (or `p -norm) p-norm
is !1/p
Xn
p
kukp := |ui | .
i=1
When p = +∞, we have
We also use the notation kuk0 to denote the number of nonzero coordinates
of u (although it is not a norm; see Exercise 1.1). For two vectors u =
(u1 , . . . , un ), v = (v1 , . . . , vn ) ∈ Rn , their inner product is
inner
n
X product
hu, vi := ui vi .
i=1
• For a matrix A, we denote the entries of A by A(i, j), Ai,j , or Aij . The i-th
row of A is denoted by A(i, ·) or Ai,· . The j-th column of A is denoted by
A(·, j) or A·,j . The transpose of A is AT .
• We use the abbreviation a.s. for “almost surely,” that is, with probability 1.
We use “w.p.” for “with probability.”
x
• For an event A, the random variable 1A is the indicator of A, that is, it is 1
if A occurs and 0 otherwise. We also use 1{A}.
xi
Chapter 1
Introduction
1.1 Background
We start with a brief review of graph terminology and standard countable-space
Markov chains results.
Version: December 20, 2023
Modern Discrete Probability: An Essential Toolkit
Copyright © 2023 Sébastien Roch
1
CHAPTER 1. INTRODUCTION 2
E ⊆ {{u, v} : u, v ∈ V },
is the set of edges (or bonds). See Figure 1.1 for an example. We occasionally write
V (G) and E(G) for the vertices and edges of the graph G. The set of vertices V
is either finite or countably infinite. Edges of the form {u} are called self-loops. In
general, we do not allow E to be a multiset unless otherwise stated. But, when E
is a multiset, G is called a multigraph.
multigraph
A vertex v ∈ V is incident with an edge e ∈ E (or vice versa) if v ∈ e. The
incident vertices of an edge are called endvertices. Two vertices u, v ∈ V are
adjacent (or neighbors), denoted by u ∼ v, if {u, v} ∈ E. The set of adjacent
vertices of v, denoted by N (v), is called the neighborhood of v and its size, that is,
δ(v) := |N (v)|, is the degree of v. A vertex v with δ(v) = 0 is called isolated. A
graph is called d-regular if all its degrees are d. A countable graph is locally finite
if all its vertices have a finite degree.
Example 1.1.1 (Petersen graph). All vertices in the Petersen graph in Figure 1.1
have degree 3, that is, it is a 3-regular graph. In particular it has no isolated vertex.
J
Example 1.1.2 (Triangle). The adjacency matrix of a triangle, that is, a 3-vertex
graph with all possible non-loop edges, is
0 1 1
A = 1 0 1 .
1 1 0
There exist other matrix representations. Here is one. Let m = |E| and assume
that the edges are labeled arbitrarily as e1 , . . . , em . The incidence matrix of an
incidence matrix
undirected graph G = (V, E) is the n × m matrix B such that Bi,j = 1 if vertex i
and edge ej are incident and 0 otherwise.
that is, it contains exactly those edges of G that are between vertices in V 0 . In
that case the notation G0 := G[V 0 ] is used. A subgraph is said to be spanning if
V 0 = V . A subgraph containing all possible non-loop edges between its vertices
is called a clique (or complete subgraph). A clique with k nodes is referred to as a
clique
k-clique.
Example 1.1.3 (Petersen graph (continued)). The Petersen graph contains no tri-
angle, that is, 3-clique, induced or not. J
distance between u and v, denoted by dG (u, v). It can be checked that the graph
graph
distance is a metric (and that, in particular, it satisfies the triangle inequality; see
distance
Exercise 1.6). The minimum length of a cycle in a graph is its girth.
We write u ↔ v if there is a path between u and v. It can be checked that the
binary relation ↔ is an equivalence relation (i.e, it is reflexive, symmetric and tran-
sitive; see Exercise 1.6). Its equivalence classes are called connected components.
A graph is connected if any two vertices are linked by a path, that is, if u ↔ v for
all u, v ∈ V . Or put differently, if there is only one connected component.
Example 1.1.4 (Petersen graph (continued)). The Petersen graph is connected. J
Trees A forest is a graph with no cycle, or acyclic graph. A tree is a connected tree
forest. Vertices of degree 1 are called leaves. A spanning tree of G is a subgraph
which is a tree and is also spanning. A tree is said to be rooted if it has a single
distinguished vertex called the root.
Trees will play a key role and we collect several important facts about them
(mostly without proofs). The following characterizations of trees will be useful.
The proof is left as an exercise (see Exercise 1.8). We write G + e (respectively
G − e) to indicate the graph G with edge e added (respectively removed).
Theorem 1.1.5 (Trees: characterizations). The following are equivalent.
(i) The graph T is a tree.
(ii) For any two vertices in T , there is a unique path between them.
(iv) The graph T is acyclic, but T + {x, y} is not for any pair of non-adjacent
vertices x, y.
Here are two important implications.
Corollary 1.1.6. If G is connected, then it has at least one spanning tree.
Proof. Indeed, from Theorem 1.1.5, a graph is a tree if and only if it is minimally
connected, in the sense that removing any of its edges disconnects it. So a spanning
tree can be obtained by removing edges of G that do not disconnect it until it is not
possible anymore.
Proof. If H is not connected, then it has at least two connected components. Each
of them is acyclic and therefore a tree. By applying Corollary 1.1.7 to the connected
components and summing up, we see that the total number of edges in H is ≤ n−2,
a contradiction. So H is connected and therefore a spanning tree.
Theorem 1.1.9 (Cayley’s formula). There are k k−2 trees on a set of k labeled Cayley’s
formula
vertices.
Some standard graphs Here are a few more examples of finite graphs.
- Cycle graph Cn (or n-cycle): The vertex set is {0, 1, . . . , n − 1} and two
cycle graph
vertices i 6= j are adjacent if and only if |i − j| = 1 or n − 1.
- Torus Ldn : The vertex set is {0, 1, . . . , n − 1}d and two vertices x 6= y are torus
adjacent if and only if there is a coordinate i such that |xi − yi | = 1 or n − 1
and all other coordinates j 6= i satisfy xj = yj .
- Hypercube Zn2 (or n-dimensional hypercube): The vertex set is {0, 1}n and
hypercube
two vertices x 6= y are adjacent if and only if kx − yk1 = 1.
- Rooted b-ary tree T b ` : This graph is a tree with ` levels. The unique vertex on
b
level 0 is called the root. For j = 1, . . . , ` − 1, level j has bj vertices, each
of which has exactly one neighbor on level j − 1 (its parent) and b neighbors
on level j + 1 (its children). The b` vertices on level ` are leaves.
Here are a few examples of infinite graphs, that is, a graph with a countably infinite
number of vertices and edges.
infinite graph
- Infinite d-regular tree Td : This is an infinite tree where each vertex has ex-
actly d neighbors. The rooted version, that is, T b ` with ` = +∞ levels, is
b
denoted by T bb.
CHAPTER 1. INTRODUCTION 6
- Lattice Ld : The vertex set is Zd and two vertices x 6= y are adjacent if and
only if kx − yk1 = 1.
A bipartite graph G = (L ∪ R, E) is a graph whose vertex set is composed of
the union of two disjoint sets L, R and whose edge set E is a subset of {{`, r} :
` ∈ L, r ∈ R}. That is, there is no edge between vertices in L, and likewise for R.
Example 1.1.10 (Some bipartite graphs). The cycle graph C2n is a bipartite graph.
So is the complete bipartite graph Kn,m with vertex set {`1 , . . . , `n }∪{r1 , . . . , rm }
and edge set {{`i , rj } : i ∈ [n], j ∈ [m]}. J
In a bipartite graph G = (L ∪ R, E), a perfect matching is a collection of edges
M ⊆ E such that each vertex is incident with exactly one edge in M .
An automorphism of a graph G = (V, E) is a bijection φ of V to itself that pre-
automorphism
serves the edges, that is, such that {x, y} ∈ E if and only if {φ(x), φ(y)} ∈ E. A
graph G = (V, E) is vertex-transitive if for any u, v ∈ V there is an automorphism
mapping u to v.
Example 1.1.11 (Petersen graph (continued)). For any ` ∈ Z, a (2π`/5)-rotation
of the planar representation of the Petersen graph in Figure 1.1 corresponds to an
automorphism. J
Example 1.1.12 (Trees). The graph Td is vertex-transitive. The graph T
b b on the
other hand has many automorphisms, but is not vertex-transitive. J
F3 (Flow-conservation constraint)
X
f (x, y) = 0, ∀x ∈ V \ (A ∪ Z).
y:y∼x
P
For U, W ⊆ V , let f (U, W ) := u∈U,w∈W f (u, w). The strength of f is kf k :=
f (A, Ac ).
CHAPTER 1. INTRODUCTION 7
Lemma 1.1.14 (Max flow ≤ min cut). For any flow f and cutset F ,
X
kf k = f (AF , AcF ) ≤ |f (x, y)| ≤ κ(F ). (1.1.1)
{x,y}∈F
where we used (F1) twice. The last line is justified by the fact that the edges
between a vertex in AF and a vertex in AcF have to be in F by definition of AF .
That proves the first inequality in the claim. Condition (F2) implies the second
one.
Proof. Note that, by compactness, the supremum on the left-hand side is achieved.
Let f be an optimal flow. The idea of the proof is to construct a “matching” cutset.
An augmentable path is a path x0 ∼ · · · ∼ xk with x0 ∈ A, xi ∈ / A ∪ Z for
all i 6= 0 or k, and f (xi−1 , xi ) < κ({xi−1 , xi }) for all i 6= 0. By default, each
vertex in A is an augmentable path. Moreover, by the optimality of f there cannot
be an augmentable path with xk ∈ Z. Indeed, otherwise, we could “push more
flow through that path” and increase the strength of f —a contradiction.
Let B ⊆ V be the set of all final vertices in some augmentable path and let F
be the edge set between B and B c := V \ B. Note that, again by contradiction, all
vertices in B can be reached from A without crossing F and that f (x, y) = κ(e)
for all e = {x, y} ∈ F with x ∈ B and y ∈ B c . Furthermore F is a cutset
separating A from Z: trivially A ⊆ B; Z ⊆ B c as argued above; and any path
from A to Z must exit B and enter B c through an edge in F . Thus AF = B and
we have equality in (1.1.1). That concludes the proof.
Directed graphs A directed graph (or digraph for short) is a pair G = (V, E)
digraph
where V is a set of vertices (or nodes or sites) and E ⊆ V 2 is a set of directed edges
(or arcs). A directed edge from x to y is typically denoted by (x, y), or occasionally
by hx, yi. A directed path is a sequence of vertices x0 , . . . , xk , all distinct, with
(xi−1 , xi ) ∈ E for all i = 1, . . . , k. We write u → v if there is such a directed
path with x0 = u and xk = v. We say that u, v ∈ V communicate, denoted by
u ↔ v, if u → v and v → u. In particular, we always have u ↔ u for every state
u. The binary relation ↔ relation is an equivalence relation (see Exercise 1.6). The
equivalence classes of ↔ are called the strongly connected components of G.
The following definition will prove useful.
We use the notation Px , Ex for the probability distribution and expectation under
the chain started at x. Similarly for Pµ , Eµ where µ is a probability distribution.
Example 1.1.17 (Simple random walk on a graph). Let G = (V, E) be a finite or
infinite, locally finite graph. Simple random walk on G is the Markov chain on V ,
simple
started at an arbitrary vertex, which at each time picks a uniformly chosen neighbor
random
of the current state. (Exercise 1.9 asks for the transition matrix.) J
walk
on a graph
Markov property Let (Xt )t≥0 be a Markov chain (or chain for short) with
transition matrix P and initial distribution µ. Define the filtration (Ft )t≥0 with
Ft = σ(X0 , . . . , Xt ) (see Appendix B). As mentioned above the defining property
of Markov chains, known as the Markov property, is that: given the present, the
future is independent of the past. In its simplest form, that can be interpreted as
P[Xt+1 = y | Ft ] = PXt [Xt+1 = y] = P (Xt , y). More formally:
Theorem 1.1.18 (Markov property). Let f : V ∞ → R be bounded, measurable Markov property
and let F (x) := Ex [f ((Xt )t≥0 )], then
Proof. This follows from the Markov property. Indeed note that Px [Xt = z | Fs ] =
F (Xs ) with F (y) := Py [Xt−s = z] and take Ex on each side.
CHAPTER 1. INTRODUCTION 11
Example 1.1.21 (Random walk on Z). Let (Xt ) be simple random walk on Z
interpreted as a graph (i.e., L) where i ∼ j if |i − j| = 1.* Then P (0, x) = 1/2 if
|x| = 1. And P 2 (0, x) = 1/4 if |x| = 2 and P 2 (0, 0) = 1/2. J
µs = µ0 P s
(P − I)1 = 0.
In particular, the columns of P − I are linearly dependent, that is, the rank of P − I
is < n. That, in turn, implies that the rows of P − I are linearly dependent since
row rank and column rank are equal. Hence there exists a non-zero row vector
z ∈ Rn such that z(P − I) = 0, or after rearranging,
zP = z. (1.1.2)
The rest of the proof is broken up into a series of lemmas. To take advantage
of irreducibility, we first construct a positive stochastic matrix with z as a left
eigenvector with eigenvalue 1. We then show that all entries of z have the same
sign. Finally, we normalize z.
Lemma 1.1.25 (Existence: Step 1). There exists a non-negative integer h such that
1
R= [I + P + P 2 + · · · + P h ],
h+1
is a stochastic matrix with strictly positive entries which satisfies
zR = z. (1.1.3)
Lemma 1.1.26 (Existence: Step 2). The entries of z are either all nonnegative or
all nonpositive.
Lemma 1.1.27 (Existence: Step 3). Let
z
π= .
z1
Then π is a strictly positive stationary distribution.
We denote the entries of R and P s by Rx,y and Px,y
s , x, y = 1, . . . , n, respectively.
Proof of Lemma 1.1.25. By irreducibility (see Exercise 1.10), for any x, y ∈ [n]
h
there is hxy such that Px,yxy > 0. Now define
h = max hxy .
x,y∈[n]
CHAPTER 1. INTRODUCTION 13
X X
= zx Rx,y + zx Rx,y .
x:zx ≥0 x:zx <0
Moreover, Rx,y > 0 for all x, y. Therefore, the first term on the last line is strictly
positive (since it is at least zi Ri,y > 0) while the second term is strictly nega-
tive (since it is at most zj Rj,y < 0). Hence, because of cancellations (see Exer-
cise 1.13), the expression in the previous display is strictly smaller than the sum of
the absolute values, that is,
X
|zy | < |zx |Rx,y .
x
zx |zx |
πx = P =P ≥ 0,
z
i i i |zi |
where the second equality comes from the previous lemma. We also used the fact
that z 6= 0.
For all y, by definition of z,
X X zx 1 X zy
πx Px,y = P Px,y = P zx Px,y = P = πy .
x x i zi i zi x i zi
The same holds with Px,y replaced by Rx,y from (1.1.3). Since Rx,y > 0 and
z 6= 0 it follows that πy > 0 for all y. That proves the claim.
Proof of Theorem 1.1.24 (ii). Suppose there are two distinct stationary distribu-
tions π1 and π2 (which must be strictly positive). Since they are distinct and both
sum to 1, they are not a multiple of each other and therefore are linearly indepen-
dent. Apply the Gram-Schmidt procedure:
π1 π2 − hπ2 , q1 iq1
q1 = and q2 = .
kπ1 k2 kπ2 − hπ2 , q1 iq1 k2
Then
π1 π1 P π1
q1 P = P = = = q1 ,
kπ1 k2 kπ1 k2 kπ1 k2
and all entries of q1 are strictly positive.
Similarly,
π2 − hπ2 , q1 iq1
q2 P = P
kπ2 − hπ2 , q1 iq1 k2
π2 P − hπ2 , q1 iq1 P
=
kπ2 − hπ2 , q1 iq1 k2
π2 − hπ2 , q1 iq1
=
kπ2 − hπ2 , q1 iq1 k2
= q2 .
CHAPTER 1. INTRODUCTION 15
But this is a contradiction since both vectors are strictly positive. That concludes
the proof of the uniqueness claim.
for all x, y ∈ V . These equations are known as detailed balance. Here is the key
observation: by summing over y and using the fact that P is stochastic, such a
measure is necessarily stationary. (Exercise 1.12 explains the name.)
Example 1.1.28 (Random walk on Zd (continued)). The measure η ≡ 1 is re-
versible for simple random walk on Ld . J
Example 1.1.29 (Simple random walk on a graph (continued)). Let (Xt ) be simple
random walk on a connected graph G = (V, E). Then (Xt ) is reversible with
respect to η(v) := δ(v), where recall that δ(v) is the degree of v. Indeed, for all
{u, v} ∈ E,
1 1
δ(u)P (u, v) = δ(u) = 1 = δ(v) = δ(v)P (v, u).
δ(u) δ(v)
(See Exercise 1.9 for the transition matrix of simple random walk on a graph.) J
Example 1.1.30 (Metropolis chain). The Metropolis algorithm modifies an irre- Metropolis
ducible, symmetric (i.e., whose transition matrix is a symmetric matrix) chain Q algorithm
to produce a new chain P with the same transition graph and a prescribed positive
stationary distribution π. The idea is simple. For each pair x 6= y, either we multi-
ply Q(x, y) by π(y)/π(x) and leave Q(y, x) intact, or vice versa. Detailed balance
immediately follows. To ensure that the new transition matrix remains stochastic,
for each pair we make the choice that lowers the transition probabilities; then we
add the lost probability to the diagonal (i.e., to the probability of staying put).
Formally, the definition of the new chain is
h i
Q(x, y) π(y) ∧ 1 if x 6= y,
π(x)
P (x, y) := P h
π(z)
i
1 −
z6=x Q(x, z) π(x) ∧ 1 otherwise.
CHAPTER 1. INTRODUCTION 16
Note that, by definition of P and the fact that Q is stochastic, we have P (x, y) ≤
Q(x, y) for all x 6= y so X
P (x, y) ≤ 1,
y6=x
Convergence and mixing time A key property of Markov chains is that, under
suitable assumptions, they converge to a stationary regime. We need one more
definition before stating the theorem. A chain is said to be aperiodic if, for all
aperiodic
x ∈ V , the greatest common divisor of {t : P t (x, x) > 0} is 1.
Example 1.1.31 (Lazy random walk on a graph). The lazy simple random walk on
lazy
G is the Markov chain such that, at each time, it stays put with probability 1/2 or
chooses a uniformly random neighbor of the current state otherwise. Such a chain
is aperiodic. J
Lemma 1.1.32 (Consequence of aperiodicity). If P is aperiodic, irreducible and
has a finite state space, then there is a positive integer t0 such that for all t ≥ t0
the matrix P t has strictly positive entries.
We can now state the convergence theorem. For probability measures µ, ν on
V , their total variation distance is
total
kµ − νkTV := sup |µ(A) − ν(A)|. (1.1.4) variation
A⊆V distance
kP t (x, ·) − π(·)kTV → 0,
as t → +∞.
CHAPTER 1. INTRODUCTION 17
We give a proof in the finite case in Example 4.3.3. In particular, the convergence
theorem implies that for all x, y,
P t (x, y) → π(y).
as t → +∞.
We will be interested in quantifying the speed of convergence in Theorem 1.1.33.
For this purpose, we define
where on the second and fourth line we used that P is a stochastic matrix.
The following concept will play a key role in quantifying the “speed of conver-
gence” to stationarity.
Definition 1.1.35 (Mixing time). For a fixed ε > 0, the mixing time is defined as
mixing time
Cx := {y ∈ V : x ⇔ y}.
We will mostly consider bond percolation on the infinite graphs Ld or Td . The main
question we will ask is: For which values of p is there an infinite open cluster?
Definition 1.2.2 (Erdős-Rényi graph model). Let n ∈ N and p ∈ [0, 1]. Set V :=
[n]. Under the Erdős-Rényi graph model on n vertices with density p, a random
Erdős-Rényi
graph G = (V, E) is generated as follows: for each pair x 6= y in V , the edge
graph model
{x, y} is in E with probability p independently of all other edges. We write G ∼
Gn,p and we denote the corresponding probability measure by Pn,p .
Typical questions regarding the Erdős-Rényi graph model (and random graphs
more generally) include: How are degrees distributed? Is G connected? What
CHAPTER 1. INTRODUCTION 19
Definition 1.2.4 (Gibbs random field). Let S be a finite set and let G = (V, E) be a
finite graph. Denote by K the set of all cliques of G. A positive probability measure
µ on X := S V is called a Gibbs random field if there exist clique potentials φK :
Gibbs
S K → R, K ∈ K, such that
random
!
field
1 X
µ(σ) = exp φK (σK ) ,
Z
K∈K
The following example introduces the primary Gibbs random field we will en-
counter.
Example 1.2.5 (Ising model). For β > 0, the (ferromagnetic) Ising model with in-
Ising model
verse temperature β is the Gibbs random field with S := {−1, +1}, φ{i,j} (σ{i,j} ) =
CHAPTER 1. INTRODUCTION 20
P
βσi σj and φK ≡ 0 if |K| 6= 2. The function H(σ) := − {i,j}∈E σi σj is known
as the Hamiltonian. The normalizing constant Z := Z(β) is called the partition
function. The states (σi )i∈V are referred to as spins. J
Typical questions regarding Ising models include: How fast is correlation de-
caying down the graph? How well can one guess the state at an unobserved vertex?
We will also consider certain Markov chains related to Ising models (see Defini-
tion 1.2.8).
Random walks on graphs and reversible Markov chains The last class of pro-
cesses we focus on are random walks on graphs and their generalizations. Recall
the following definition.
Definition 1.2.6 (Simple random walk on a graph). Let G = (V, E) be a finite
or countable, locally finite graph. Simple random walk on G is the Markov chain
simple
on V , started at an arbitrary vertex, which at each time picks a uniformly chosen
random
neighbor of the current state.
walk on a
We generalize the definition by adding weights to the edges. In this context, we graph
denote edge weights by c(e) for “conductance” (see Section 3.3).
Definition 1.2.7 (Random walk on a network). Let G = (V, E) be a finite or
countably infinite graph. Let c : E → R+ be a positive edge weight function on G.
Recall that we call N = (G, c) a network. We assume that for all u ∈ V
random
X
c(u) := c(e) < +∞. walk on a
e={u,v}∈E network
Definition 1.2.8 (Glauber dynamics of the Ising model). Let µβ be the Ising model
with inverse temperature β > 0 on a graph G = (V, E). The (single-site) Glauber
dynamics is the Markov chain on X := {−1, +1}V which at each time:
Glauber
- selects a site i ∈ V uniformly at random, and dynamics
1 eγβSi (σ)
Qβ (σ, σ i,γ ) := · −βS (σ) .
n e i + eβSi (σ)
This chain is irreducible since we can flip each site one by one to go from any
state to any other. It is straightforward to check that Qβ (σ, σ i,γ ) is a stochastic
matrix. The next theorem shows that µβ is its stationary distribution.
We have
Exercises
Exercise 1.1 (0-norm). Show that kuk0 does not define a norm.
to show that
``
`! ≥ .
e`−1
(iii) Use part (i) and the quantity
`−1
Y k k+1
,
(k + 1)k+1
k=1
to show that
``+1
`! ≤ .
e`−1
Exercise 1.3 (A factorial bound: another way). Let ` be a positive integer. Show
that
`` ``+1
≤ `! ≤ ,
e`−1 e`−1
by considering the logarithm of `!, interpreting the resulting quantity as a Riemann
sum, and bounding above and below by an integral.
CHAPTER 1. INTRODUCTION 23
[Hint: Multiply the left-hand side of the inequality by (d/n)d ≤ (d/n)k and use
the binomial theorem.]
Exercise 1.5 (Powers of the adjacency matrix). Let An be the n-th matrix power
of the adjacency matrix A of a graph G = (V, E). Prove that the (i, j)-th entry anij
is the number of walks of length exactly n between vertices i and j in G. [Hint:
Use induction on n.]
Exercise 1.7 (Trees: number of edges). Prove that a connected graph with n ver-
tices is a tree if and only if it has n − 1 edges. [Hint: Proceed by induction. Then
use Corollary 1.1.6.]
Exercise 1.9 (Simple random walk on a graph). Let (Xt )t≥0 be simple random
walk on a finite graph G = (V, E). Suppose the vertex set is V = [n]. Write down
an expression for the transition matrix of (Xt ).
Exercise 1.10 (Communication lemma). Let (Xt ) be a finite Markov chain. Show
that, if x → y, then there is an integer r ≥ 1 such that
be stochastic matrices.
(i) Show that P (1) P (2) is a stochastic matrix. That is, a product of stochastic
matrices is a stochastic matrix.
CHAPTER 1. INTRODUCTION 24
Pr
(ii) Show that for any α1 , . . . , αr ∈ [0, 1] with i=1 αi = 1,
r
X
αi P (i)
i=1
P[Xs = z0 , . . . , X0 = zs ] = P[Xs = zs , . . . , X0 = z0 ].
Exercise 1.13 (A strict inequality). Let a, b ∈ R with a < 0 and b > 0. Show that
(iv) Assume again that P is irreducible. Use (iii) to conclude that the dimen-
sion of the null space of P T − I is exactly 1. [Hint: Use the Rank-Nullity
Theorem.]
CHAPTER 1. INTRODUCTION 25
Exercise 1.15 (Preferential attachment trees). Let (Gt )t≥1 ∼ PA1 as in Defini-
tion 1.2.3. Show that Gt is a tree with t + 1 vertices for all t ≥ 1.
Exercise 1.16 (Warm-up: a little calculus). Prove the following inequalities which
we will encounter throughout. [Hint: Basic calculus should do.]
Bibliographic Remarks
Section 1.1 For an introduction to graphs see for example [Die10] or [Bol98].
Four different proofs of Cayley’s formula are detailed in the delightful [AZ18].
Markov chain theory is covered in details in [Dur10, Chapter 6]. For a more gentle
introduction, see for example [Dur12, Chapter 1], [Nor98, Chapter 1], or [Res92,
Chapter 2].
Section 1.2 For book-length treatments of percolation theory see [BR06a, Gri10b].
The version of the Erdős-Rényi random graph model we consider here is due to
Gilbert [Gil59]. For much deeper accounts of the theory of random graphs and
related processes, see for example [Bol01, Dur06, JLR11, vdH17, FK16]. Two
standard references on finite Markov chains and mixing times are [AF, LPW06].
Chapter 2
2.1 Background
We start with a few basic definitions and standard inequalities. See Appendix B for
a refresher on random variables and their expectation.
Version: December 20, 2023
Modern Discrete Probability: An Essential Toolkit
Copyright © 2023 Sébastien Roch
27
CHAPTER 2. MOMENTS AND TAILS 28
2.1.1 Definitions
Moments As a quick reminder, let X be a random variable with E|X|k < +∞
for some non-negative integer k. In that case we write X ∈ Lk . Recall that
the quantities E[X k ] and E[(X − EX)k ], which are well-defined when X ∈ Lk ,
are called respectively the k-th moment and k-th central moment of X. The first moments
moment and the second central moment are known as the mean and variance, the
square root of which is the standard deviation. A random variable is said to be
centered if its mean is 0. Recall that for a non-negative random variable X, the
k-th moment can be expressed as
Z +∞
k
E[X ] = kxk−1 P [X > x] dx. (2.1.1)
0
= E esX1 E esX2
Tails We refer to a probability of the form P[X ≥ x] as an upper tail (or right tail)
probability. Typically x is (much) greater than the mean or median of X. Similarly
tail
we refer to P[X ≤ x] as a lower tail (or left tail) probability. Our general goal in
this chapter is to bound tail probabilities using moments and moment-generating
functions.
Tail bounds arise naturally in many contexts, as events of interest can often
be framed in terms of a random variable being unusually large or small. Such
probabilities are often hard to compute directly however. As we will see in this
chapter, moments offer an effective means to control tail probabilities for two main
reasons: (i) moments contain information about the tails of a random variable,
as (2.1.1) below makes explicit for instance; and (ii) they are typically easier to
compute—or, at least, to approximate.
As we will see, tail bounds are also useful to study the maximum of a collection
of random variables.
See Figure 2.1 for a proof by picture. Note that this inequality is non-trivial only
when b > EX.
Figure 2.1: Proof of Markov’s inequality: taking expectations of the two functions
depicted above yields the inequality.
Of course this bound is non-trivial only when β is larger than the standard devia-
tion. Results of this type that quantify the probability of deviating from the mean
are referred to as concentration inequalities. Chebyshev’s inequality is perhaps the
concentration in-
simplest instance—we will derive many more. To bound the variance, the follow-
equalities
ing standard formula is sometimes useful
" n # n
X X X
Var Xi = Var[Xi ] + 2 Cov[Xi , Xj ], (2.1.6)
i=1 i=1 i<j
Figure 2.2: Comparison of Markov’s and Chebyshev’s inequalities: the squared de-
viation from the mean (solid) gives a better approximation of the indicator function
(dotted) close to the mean than the absolute deviation (dashed).
q
2
We write X ∼ N(µ, σ 2 ). A direct computation shows that E|X − µ| = σ π.
Hence Markov’s inequality gives
r
E|X − µ| 2 σ
P[|X − µ| ≥ b] ≤ = · ,
b π b
while Chebyshev’s inequality (Theorem 2.1.2) gives
σ 2
P[|X − µ| ≥ b] ≤ .
b
Hence, for b large enough, Chebyshev’s inequality produces a stronger bound. See
Figure 2.2 for some insight. J
Example 2.1.4 (Coupon collector’s problem). Let (Xt )t∈N be i.i.d. uniform ran-
dom variables over [n], that is, that are equally likely to take any value in [n]. Let
uniform
Tn,i be the first time that i elements of [n] have been picked, that is,
with Tn,0 := 0. We prove that the time it takes to pick all elements at least once—
or “collect each coupon”—has the following tail. For any ε > 0, we have as coupon
n → +∞: collector
CHAPTER 2. MOMENTS AND TAILS 32
Claim 2.1.5.
n
X
P Tn,n − n j −1 ≥ ε n log n → 0.
j=1
To prove this claim we note that the time elapsed between Tn,i−1 and Tn,i , which
we denote by τn,i := Tn,i − Tn,i−1 , is geometric with success probability 1 − i−1
n .
And all τn,i s are independent. Recall that a geometric random variable Z with
geometric
success probability p has probability mass function P[Z = z] = (1 − p)z−1 p for
has mean 1/p and variance (1 − p)/p2 . So, the expectation and variance
z ∈ N and P
of Tn,n = ni=1 τn,i are
n n
i − 1 −1
X X
E[Tn,n ] = 1− =n j −1 = Θ(n log n), (2.1.7)
n
i=1 j=1
and
n n +∞
i − 1 −2
X X X
Var[Tn,n ] ≤ 1− = n2 j −2 ≤ n2 j −2 = Θ(n2 ). (2.1.8)
n
i=1 j=1 j=1
So by Chebyshev’s inequality
n
X Var[Tn,n ]
P Tn,n − n j −1 ≥ ε n log n ≤
(ε n log n)2
j=1
n2 +∞ −2
P
j=1 j
≤
(ε n log n)2
→ 0,
lim P[|Xn − X| ≥ ε] → 0.
n→+∞
See Exercise 2.5 for a proof. When the Xk s are i.i.d. and integrable (but not nec-
essarily square integrable), convergence is almost sure. That result, the strong law
of large numbers, also follows from Chebyshev’s inequality (and other ideas), but
we will not prove it here.
a contradiction.
x1 v1 + · · · + xn vn
where we used the linearity of expectation on the second line. Hence the random
variable Z = kX1 v1 + · · · + Xn vn k2 has expectation EZ = n and must take a
value ≤ n with positive probability by the first moment principle (Theorem 2.2.1).
In other words, there must be a choice of Xi s such that Z ≤ n. That proves the
claim. J
Here is a slightly more subtle example of the probabilistic method, where one
has to modify the original random choice.
Example 2.2.3 (Independent sets). For d ∈ N, let G = (V, E) be a d-regular graph
with n vertices. Such a graph necessarily has m = nd/2 edges. Our goal is derive
a lower bound on the size, α(G), of the largest independent set in G. Recall that an
independent set is a set of vertices in a graph, no two of which are adjacent. Again,
at first sight, this may seem like a rather complicated graph-theoretic problem. But
an appropriate random choice gives a non-trivial bound. Specifically:
CHAPTER 2. MOMENTS AND TAILS 35
Claim 2.2.4.
n
α(G) ≥ .
2d
Proof. The proof proceeds in two steps:
Step 1. Let 0 < p < 1 to be chosen below. To form the set S, pick each vertex
in V independently with probability p. Letting X be the number of vertices in S,
we have by the linearity of expectation that
" #
X
EX = E 1v∈S = np,
v∈V
nd 2
E[X − Y ] = np − p ,
2
which, as a function of p, is maximized at p = 1/d where it takes the value n/(2d).
As a result, by the first moment principle applied to X − Y , there must exist a set
S of vertices in G such that
n
|S| − |{{i, j} ∈ E : i, j ∈ S}| ≥ . (2.2.2)
2d
Step 2. For each edge e connecting two vertices in S, remove one of the endver-
tices of e. By construction, the remaining set of vertices: (i) forms an independent
set; and (ii) has a size larger or equal than the left hand side of (2.2.2). That in-
equality implies the claim.
for a general graph G = (V, E), where δ(v) is the degree of v. This bound is achieved for
Turán graphs. See, for example, [AS11, The probabilistic lens: Turán’s theorem].
The previous example also illustrates the important indicator trick, that is, writ-
indicator
ing a random variable as a sum of indicators, which is naturally used in combina-
trick
tion with the linearity of expectation.
This simple fact is typically used in the following manner: one wants to show
that a certain “bad event” does not occur with probability approaching 1; the ran-
dom variable X then counts the number of such “bad events.” In that case, X is a
sum of indicators and Theorem 2.2.6 reduces simply to the standard union bound,
union
also known as Boole’s inequality. We record one useful version of this setting in
bound
the next corollary.
we have
P[Bn ] ≤ µn .
In particular, if µn → 0 as n → +∞, then P[Bn ] → 0.
CHAPTER 2. MOMENTS AND TAILS 37
Pmn
Proof. Take X := Xn = i=1 1An,i in Theorem 2.2.6.
A useful generalization of the union bound is given in Exercise 2.2. We will refer
to applications of Theorem 2.2.6 as the first moment method.
first
Example 2.2.8 (Random k-SAT threshold). For r ∈ R+ , let Φn,r : {0, 1}n → moment
{0, 1} be a random k-CNF formula on n Boolean variables z1 , . . . , zn with drne method
clauses. That is, Φn,r is an AND of drne ORs, each obtained by picking indepen-
dently k literals uniformly at random (with replacement). Recall that a literal is
a variable zi or its negation z̄i . The formula Φn,r is said to be satisfiable if there
exists an assignment z = (z1 , . . . , zn ) such that Φn,r (z) = 1. Clearly the higher
the value of r, the less likely it is for Φn,r to be satisfiable. In fact it is natural to
conjecture that a sharp transition takes place, that is, that there exists an rk∗ ∈ R+
(depending on k but not on n) such that
(
0, if r > rk∗ ,
lim P[Φn,r is satisfiable] = (2.2.4)
n→∞ 1, if r < rk∗ .
Claim 2.2.9.
Proof. How to start the proof should be obvious: let Xn be the number of satisfying
assignments of Φn,r . Applying the first moment method, since
it suffices to show that EXn → 0. To compute EXn , we use the indicator trick
X
Xn = 1{z satisfies Φn,r } .
z∈{0,1}n
There are 2n possible assignments. Each fixed assignment satisfies the random
choice of clauses Φn,r with probability (1 − 2−k )drne . Indeed note that the rn
clauses are picked independently and each clause literal picked is satisfied with
probability 1/2. Therefore, by the assumption on r, for ε > 0 small enough and n
CHAPTER 2. MOMENTS AND TAILS 38
large enough
≤ 2n e−(log 2)(1+ε)n
= 2−εn
→ 0,
where we used that (1 − 1/`)` ≤ e−1 for all ` ∈ N (see Exercise 1.16). Theo-
rem 2.2.6 implies the claim.
Remark 2.2.10. Bounds in the other direction are also known. For instance, for k ≥ 3, it
has been shown that if r < 2k log 2 − k
lim inf P[Φn,r is satisfiable] = 1.
n→∞
See [ANP05]. For the k = 2 case, it is known that (2.2.4) in fact holds with r2∗ = 1 [CR92].
A breakthrough of [DSS22] also establishes (2.2.4) for large k; the threshold rk∗ is charac-
terized as the root of a certain equation coming from statistical physics.
for an ` chosen below. To bound the probability on the right-hand side, we appeal to
the first moment method (Theorem 2.2.6) by letting Xn be the number of increasing
subsequences of length `. We also use the indicator trick, that is, we think of Xn
as a sum of indicators over subsequences (not necessarily increasing) of length `.
n
There are ` such subsequences, each of which is increasing with probability
1/`!. Note that these subsequences are not independent. Nevertheless, by the
linearity of expectation and the first moment method,
√ 2`
n` n`
1 n e n
P[Ln ≥ `] = P[Xn > 0] ≤ EXn = ≤ 2
≤ 2 2`
≤ ,
`! ` (`!) e [`/e] `
where we used a standard bound on factorials recalled in Appendix A. Note that, in
√
order for this bound to go to 0, we need ` > e n. Then (2.2.5) follows by taking
√
` = (1 + δ)e n in (2.2.6), for an arbitrarily small δ > 0.
For the other half of the claim, we show that
ELn
√ ≥ 1.
n
This part does not rely on the first moment method (and may be skipped). We seek
a lower bound on the expected length of a longest increasing subsequence. The
proof uses the following two ideas. First observe that there is a natural symme-
try between the lengths of the longest increasing and decreasing subsequences—
they are identically distributed. Moreover if a permutation has a “short” longest
increasing subsequence, then intuitively it must have a “long” decreasing subse-
quence, and vice versa. Combining these two observations gives a lower bound
on the expectation of Ln . Formally, let Dn be the length of a longest decreasing
subsequence. By symmetry and the arithmetic mean-geometric mean inequality,
note that
Ln + D n p
ELn = E ≥ E Ln Dn .
2
(k)
We show that Ln Dn ≥ n, which proves the claim. Let Ln be the length of a
(k)
longest increasing subsequence ending at position k, and similarly for Dn . It
(k) (k)
suffices to show that the pairs (Ln , Dn ), 1 ≤ k ≤ n, are distinct. Indeed, noting
(k) (k)
that Ln ≤ Ln and Dn ≤ Dn , the number of pairs in [Ln ] × [Dn ] is at most
Ln Dn which must then be at least n.
(k) (j)
Let 1 ≤ j < k ≤ n. If σn (k) > σn (j) then we see that Ln > Ln by
(j)
appending σn (k) to the subsequence ending at position j achieving Ln . If the
(k) (j) (j) (j)
opposite holds, then we have instead Dn > Dn . Either way, (Ln , Dn ) and
(k) (k)
(Ln , Dn ) must be distinct. This clever combinatorial argument is known as the
Erdős-Szekeres Theorem. That concludes the proof of the second claim.
CHAPTER 2. MOMENTS AND TAILS 40
Cx := {y ∈ Z2 : x ⇔ y}.
that is, θ(p) is the probability that the origin is connected by open paths to infinitely
many vertices. It is intuitively clear that the function θ(p) is non-decreasing. Indeed
consider the following alternative representation of the percolation process: to each
edge e, assign a uniform [0, 1] random variable Ue and declare the edge open if
Ue ≤ p. Using the same Ue s for densities p1 < p2 , it follows immediately from
the monotonicity of the construction that θ(p1 ) ≤ θ(p2 ). (We will have much more
to say about this type of “coupling” argument in Chapter 4.) Moreover note that
θ(0) = 0 and θ(1) = 1. The critical value is defined as
critical value
2
pc (L ) := sup{p ≥ 0 : θ(p) = 0},
the point at which the probability that the origin is contained in an infinite open
cluster becomes positive. Note that by a union bound over all vertices, when
θ(p) = 0, we have that Pp [∃x, |Cx | = +∞] = 0. Conversely, because {∃x, |Cx | =
+∞} is a tail event (see Definition B.3.9) for any enumeration of the edges, by Kol-
mogorov’s 0-1 law (Theorem B.3.11) it holds that Pp [∃x, |Cx | = +∞] = 1 when
θ(p) > 0.
CHAPTER 2. MOMENTS AND TAILS 41
Using the first moment method we show that the critical value is non-trivial,
that is, it is strictly between 0 and 1. This is a different example of a threshold
phenomenon.
Claim 2.2.13.
pc (L2 ) ∈ (0, 1).
Proof. We first show that, for any p < 1/3, θ(p) = 0. In order to apply the first
moment method, roughly speaking, we need to reduce the problem to counting the
number of instances of an appropriately chosen substructure. The key observation
is the following:
Pp [|C0 | = +∞] ≤ Pp [∩n {Xn > 0}] = lim Pp [Xn > 0] ≤ lim sup Ep [Xn ],
n n
(2.2.7)
where the last inequality follows from Theorem 2.2.6. We bound the number of
paths of length n (each of which is open with probability pn ) by noting that they
cannot backtrack. That gives 4 choices at the first step, and at most 3 choices at
each subsequent step. Hence, we get the following bound
Ep Xn ≤ 4(3n−1 )pn .
The right-hand side goes to 0 for all p < 1/3. When combined with (2.2.7), that
proves half of the claim:
pc (L2 ) > 0.
For the other direction, we show that θ(p) > 0 for p close enough to 1. This
time, we count “dual cycles.” This type of proof is known as a contour argument, or
Peierls’ argument, and is based on the following construction. Consider the dual
e 2 whose vertices are Z2 + (1/2, 1/2) and whose edges connect vertices
lattice L
dual lattice
u, v with ku − vk1 = 1. See Figure 2.3. Note that each edge in the primal lattice
L2 has a unique corresponding edge in the dual lattice which crosses it perpendic-
ularly. We make the same assignment, open or closed, for corresponding primal
and dual edges. The following graph-theoretic lemma, whose proof is sketched
below, forms the basis of contour arguments. Recall that cycles are “self-avoiding”
by definition (see Section 1.1.1). We say that a cycle is closed if all edges in the
induced subgraph are closed, that is, are not open.
CHAPTER 2. MOMENTS AND TAILS 42
Lemma 2.2.14 (Contour lemma). If |C0 | < +∞, then there is a closed cycle contour lemma
e2.
around the origin in the dual lattice L
To prove that θ(p) > 0 for p close enough to 1, the idea is to apply the first moment
method to Zn equal to the number of closed dual cycles of length n surrounding
the origin. We bound from above the number of dual cycles of length n around the
origin by the number of choices for the starting edge across the upper y-axis and
for each n − 1 subsequent non-backtracking choices. Namely,
when p > 2/3, where the first term in parentheses Pon themlast line comes from
differentiating with respect to q the geometric series m≥0 q and setting q := 1−
p. The expression on the last line can be made smaller than 1 if we let p approach
1. We have shown that θ(p) > 0 for p close enough to 1, and that concludes the
proof. (Exercise 2.3 sketches a proof that θ(p) > 0 for all p > 2/3.)
It is straightforward to extend the claim to Ld . (Exercise 2.4 asks for the de-
tails.)
Proof of the contour lemma We conclude this section by sketching the proof of
the contour lemma, which relies on topological arguments.
Proof of Lemma 2.2.14. Assume |C0 | < +∞. Imagine identifying each vertex in
L2 with a square of side 1 centered around it so that the sides line up with dual
edges. Paint green the squares of vertices in C0 . Paint red the squares of vertices
in C0c which share a side with a green square. Leave the other squares white.
Let u0 be a highest vertex in C0 along the y-axis and let v0 and v1 be the dual
vertices corresponding to the upper left and right corners respectively of the square
of u0 . Because u0 is highest, it must be that the square above it is red. Walk
along the dual edge {v0 , v1 } separating the squares of u0 and u0 + (0, 1) from v0
to v1 . Notice that this edge satisfies what we call the red-green property: as you
traverse it from v0 to v1 , a red square sits on your left and a green square is on your
right. Proceed further by iteratively walking along an incident dual edge with the
following rule. Choose an edge satisfying the red-green property, with the edges
to your left, straight ahead, and to your right in decreasing order of priority. Stop
when a previously visited dual vertex is reached. The claim is that this procedure
constructs the desired cycle. Let v0 , v1 , v2 , . . . be the dual vertices visited. By
construction {vi−1 , vi } is a dual edge for all i.
- A dual cycle is produced. We first argue that this procedure cannot get stuck.
Let {vi−1 , vi } be the edge just crossed and assume that it has the red-green
property. If there is a green square to the left ahead, then the edge to the
left, which has highest priority, has the red-green property. If the left square
ahead is not green, but the right one is, then the left square must in fact be
red by construction (i.e., it cannot be white). In that case, the edge straight
ahead has the red-green property. Finally, if neither square ahead is green,
then the right square must in fact be red because the square behind to the
right is green by assumption. That implies that the edge to the right has
the red-green property. Hence we have shown that the procedure does not
get stuck. Moreover, because by assumption the number of green squares
CHAPTER 2. MOMENTS AND TAILS 44
- The origin lies within the cycle. The inside of a cycle in the plane is well-
defined by the Jordan curve theorem. So the dual cycle produced above has
its adjacent green squares either on the inside (negative orientation) or on the
outside (positive orientation). In the former case the origin must lie inside
the cycle as otherwise the vertices corresponding to the green squares on the
inside would not be in C0 , as they could not be connected to the origin with
open paths.
So it remains to consider the latter case, where through a similar reasoning
the origin must lie outside the cycle. Let vj be the repeated dual vertex.
Assume first that vj 6= v0 and let vj−1 and vj+1 be the dual vertices preced-
ing and following vj during the first visit to vj . Let vk be the dual vertex
preceding vj on the second visit. After traversing the edge from vj−1 to vj ,
vk cannot be to the left or to the right because in those cases the red-green
properties of the two corresponding edges (i.e., {vj−1 , vj } and {vk , vj }) are
not compatible. So vk is straight ahead and, by the priority rules, vj+1 must
be to the left upon entering vj the first time. But in that case, for the origin
to lie outside the cycle as we are assuming and for the cycle to avoid the path
v0 , . . . , vj−1 , we must traverse the cycle with a negative orientation, that is,
the green squares adjacent to the cycle must be on the inside, a contradiction.
So, finally, assume v0 is the repeated vertex. If the cycle is traversed with a
positive orientation and the origin is on the outside, it must be that the cycle
crosses the y-axis at least once above u0 + (0, 1), again a contradiction.
Hence we have shown that the origin is inside the cycle.
Figure 2.4: Second moment method: if the standard deviation σX of X is less than
its expectation µX , then the probability that X is 0 is bounded away from 1.
P[Xn > 0] → 0. That is, although the expectation diverges, the probability that
Xn is positive can be arbitrarily small.
So we turn to the second moment. Intuitively the basis for the so-called second
moment method is that, if the expectation of Xn is large and its variance is rela-
tively small, then we can bound the probability that Xn is close to 0. As we will
see in applications, the first and second moment methods often work hand in hand.
Var[X]
P[X = 0] ≤ P[|X − EX| ≥ EX] ≤ .
(EX)2
CHAPTER 2. MOMENTS AND TAILS 46
EX = E[X1{X<θEX} ] + E[X1{X≥θEX} ]
p
≤ θEX + E[X 2 ]P[X ≥ θEX],
As an immediate application:
Theorem 2.3.2 (Second moment method). Let X be a nonnegative random vari- second
moment
able (not identically zero). Then
method
(EX)2
P[X > 0] ≥ . (2.3.3)
E[X 2 ]
Proof. Take θ ↓ 0 in (2.3.2).
Since
(EX)2 Var[X]
2
=1− ,
E[X ] (EX)2 + Var[X]
we see that (2.3.3) is stronger than (2.3.1).
We typically apply the second moment method to a sequence of random vari-
ables (Xn ). The previous theorem gives a uniform lower bound on the probability
that {Xn > 0} when E[Xn2 ] ≤ CE[Xn ]2 for some C > 0. Just like the first
moment method, the second moment method is often applied to a sum of indica-
tors (but see Section 2.3.3 for a weighted case). We record in the next corollary a
convenient version of the method.
Corollary 2.3.3. Let Bn = An,1 ∪ · · · ∪ An,mn , where An,1 , . . . , An,mn is a col-
n
lection of events for each n. Write i ∼ j if i 6= j and An,i and An,j are not
independent. Then, letting
mn
X X
µn := P[An,i ], γn := P[An,i ∩ An,j ],
i=1 n
i∼j
CHAPTER 2. MOMENTS AND TAILS 47
where the second sum is over ordered pairs, we have limn P[Bn ] > 0 whenever
µn → +∞ and γn ≤ Cµ2n for some C > 0. If moreover γn = o(µ2n ) then
limn P[Bn ] = 1.
Proof. We apply the second moment method to Xn := m
P n
i=1 1An,i so that Bn =
{Xn > 0}. Note that
X X
Var[Xn ] = Var[1An,i ] + Cov[1An,i , 1An,j ],
i i6=j
where
Var[1An,i ] = E[(1An,i )2 ] − (E[1An,i ])2 ≤ P[An,i ],
and, if An,i and An,j are independent,
Cov[1An,i , 1An,j ] = 0,
n
whereas, if i ∼ j,
Hence
Var[Xn ] µn + γ n 1 γn
2
≤ 2
= + 2.
(EXn ) µn µ n µn
Noting
(EXn )2 (EXn )2 1
= = ,
E[Xn2 ] (EXn )2 + Var[Xn ] 1 + Var[Xn ]/(EXn )2
and applying Theorem 2.3.2 gives the result.
Subgraph containment
We first consider the clique number, then we turn to more general subgraphs.
Cliques Let ω(G) be the clique number of a graph G, that is, the size of its largest
clique
clique.
number
Claim 2.3.4. The property ω(Gn ) ≥ 4 has threshold function n−2/3 .
Proof. Let Xn be the number of 4-cliques in the random graph Gn ∼ Gn,pn . Then,
4
noting that there are 2 = 6 edges in a 4-clique,
n 6
En,pn [Xn ] = p = Θ(n4 p6n ),
4 n
which goes to 0 when pn n−2/3 . Hence the first moment method (Theo-
rem 2.2.6) gives one direction: Pn,pn [ω(Gn ) ≥ 4] → 0 in that case.
For the other direction, we apply the second moment method for sums of in-
dicators, that is, Corollary 2.3.3. We use the notation from that corollary. For an
enumeration S1 , . . . , Smn of the 4-tuples of vertices in Gn , let An,1 , . . . , An,mn be
the events that the corresponding 4-clique is present. By the calculation above we
have µn = Θ(n4 p6n ) which goes to +∞ when pn n−2/3 . Also µ2n = Θ(n8 p12 n )
so it suffices to show that γn = o(n8 p12 n ). Note that two 4-cliques with disjoint
edge sets (but possibly sharing one vertex) are independent (i.e., their presence or
n
absence is independent). Suppose Si and Sj share 3 vertices. Then i ∼ j and
as the event An,j implies that all edges between three of the vertices in Si are al-
ready present, and there are 3 edges between the remaining vertex and the rest of Si .
n
Similarly, if |Si ∩ Sj | = 2, we have again i ∼ j and this time Pn,pn [An,i | An,j ] =
p5n . Putting these together, we get by the definition of the conditional probability
CHAPTER 2. MOMENTS AND TAILS 49
where we used that pn n−2/3 (so that for example n3 p3n 1). Corollary 2.3.3
gives the result: Pn,pn [∪i An,i ] → 1 when pn n−2/3 .
Roughly speaking, the first and second moments suffice to pinpoint the thresh-
old in this case because the indicators in Xn are “mostly” pairwise independent
and, as a result, the sum is “concentrated around its mean.”
General subgraphs The methods of Claim 2.3.4 can be applied to more general
subgraphs. However the situation is somewhat more complicated than it is for
cliques. For a graph H0 , let vH0 and eH0 be the number of vertices and edges of
H0 respectively. Let Xn be the number of (not necessarily induced) copies of H0
in Gn ∼ Gn,pn . By the first moment method,
eH
P[Xn > 0] ≤ E[Xn ] = Θ(nvH0 pn 0 ) → 0,
when pn n−vH0 /eH0 . The constant factor, which does not play a role in the
asymptotics, accounts in particular for the number of automorphisms of H0 . In-
deed note that a fixed set of vH0 vertices can contain several distinct copies of H0 ,
depending on its structure (and unlike the clique case).
From the proof of Claim 2.3.4, one might guess that the threshold function is
n−vH0 /eH0 . That is not the case in general. To see what can go wrong, consider
e 0
the graph H0 in Figure 2.5 whose edge density is vH = 65 . When pn n−5/6 ,
H 0 edge density
CHAPTER 2. MOMENTS AND TAILS 50
the expected number of copies of H0 in Gn tends to +∞. But observe that the
subgraph H of H0 has the higher density 5/4 and, hence, when n−5/6 pn
n−4/5 the expected number of copies of H tends to 0. By the first moment method,
the probability that a copy of H0 —and therefore H—is present in that regime
is asymptotically negligible despite its diverging expectation. This leads to the
following definition
eH
rH0 := max : subgraphs H ⊆ H0 , eH > 0 .
vH
Assume H0 has at least one edge.
Claim 2.3.5. “Having a copy of H0 ” has threshold n−1/rH0 .
Proof. We proceed as in Claim 2.3.4. Let H0∗ be a subgraph of H0 achieving rH0 .
When pn n−1/rH0 , the probability that a copy of H0∗ is in Gn tends to 0 by the
argument above. Therefore the same conclusion holds for H0 itself.
Assume pn n−1/rH0 . Let S1 , . . . , Smn be an enumeration of the copies (as
subgraphs) of H0 in a complete graph on the vertices of Gn . Let An,i be the event
that Si ⊆ Gn . Using again the notation of Corollary 2.3.3,
eH
µn = Θ(nvH0 pn 0 ) = Ω(ΦH0 (n)),
where
ΦH0 (n) := min {nvH penH : subgraphs H ⊆ H0 , eH > 0} .
CHAPTER 2. MOMENTS AND TAILS 51
Note that ΦH0 (n) → +∞ when pn n−1/rH0 by definition of rH0 . The events
An,i and An,j are independent if Si and Sj share no edge. Otherwise we write
n
i ∼ j. Note that there are Θ(nvH n2(vH0 −vH ) ) pairs Si , Sj whose intersection is
isomorphic to H. The probability that both Si and Sj of such a pair are present in
2(eH0 −eH )
Gn is Θ(penH pn ). Hence
X
γn = P[An,i ∩ An,j ]
n
i∼j
2eH −eH
X
= Θ n2vH0 −vH pn 0
H⊆H0 ,eH >0
Θ(µ2n )
≤
Θ(ΦH0 (n))
= o(µ2n ),
where we used that ΦH0 (n) → +∞. The result follows from Corollary 2.3.3.
Going back to the example of Figure 2.5, the proof above confirms that when
n−5/6 pn n−4/5 the second moment method fails for H0 since ΦH0 (n) → 0.
In that regime, although there is in expectation a large number of copies of H0 ,
those copies are highly correlated as they are produced from a small (vanishing
in expectation) number of copies of H—producing a large variance that helps to
explain the failure of the second moment method.
Connectivity threshold
Next we use the second moment method to show that the threshold function for
connectivity in the Erdős-Rényi random graph model is logn n . In fact we prove this
result by deriving the threshold function for the presence of isolated vertices. The
connection between the two is obvious in one direction. Isolated vertices imply a
disconnected graph. What is less obvious is that it also works the other way in the
following sense: the two thresholds actually coincide.
Proof. Let Xn be the number of isolated vertices in the random graph Gn ∼ Gn,pn .
Using 1 − x ≤ e−x for all x ∈ R (see Exercise 1.16),
when pn logn n . So the first moment method gives one direction: Pn,pn [Xn >
0] → 0.
For the other direction, we use the second moment method. Let An,j be the
2
event that vertex j is isolated. By the computation above, using 1 − x ≥ e−x−x
for x ∈ [0, 1/2] (see Exercise 1.16 again),
2
X
µn = Pn,pn [An,i ] = n(1 − pn )n−1 ≥ elog n−npn −npn , (2.3.5)
i
which goes to +∞ when pn logn n . Note that An,i and An,j are not independent
for all i 6= j (because the absence of an edge between i and j is part of both events)
and
Pn,pn [An,i ∩ An,j ] = (1 − pn )2(n−2)+1 ,
so that X
γn = Pn,pn [An,i ∩ An,j ] = n(n − 1)(1 − pn )2n−3 .
i6=j
Because γn is not o(µ2n ), we cannot apply Corollary 2.3.3. Instead we use Theo-
rem 2.3.2 directly. We have
En,pn [Xn2 ] µn + γ n
=
En,pn [Xn ]2 µ2n
n(1 − pn )n−1 + n2 (1 − pn )2n−3
≤
n2 (1 − pn )2n−2
1 1
≤ + , (2.3.6)
n(1 − pn )n−1 1 − pn
which is 1 + o(1) when pn logn n by (2.3.5). The second moment method implies
that Pn,pn [Xn > 0] → 1 in that case.
Proof. We start with the easy direction. If pn logn n , Claim 2.3.6 implies that the
graph has at least one isolated vertex—and therefore is necessarily disconnected—
with probability going to 1 as n → +∞.
Assume now that pn logn n . Let Dn be the event that Gn is disconnected.
To bound Pn,pn [Dn ], we let Yk be the number of subsets of k vertices that are
CHAPTER 2. MOMENTS AND TAILS 53
disconnected from all other vertices in the graph for k ∈ {1, . . . , n/2}. Then, by
the first moment method,
n/2 n/2
X X
Pn,pn [Dn ] = Pn,pn Yk > 0 ≤ En,pn [Yk ].
k=1 k=1
n
The expectation of Yk is straightforward to bound. Using that k ≤ n/2 and k ≤
nk ,
n k
En,pn [Yk ] = (1 − pn )k(n−k) ≤ n(1 − pn )n/2 .
k
log n
The expression in parentheses is o(1) when pn n by a calculation similar
to (2.3.4). Summing over k,
+∞
X k
Pn,pn [Dn ] ≤ n(1 − pn )n/2 = O(n(1 − pn )n/2 ) = o(1),
k=1
A closer look We have shown that connectivity and the absence of isolated ver-
tices have the same threshold function. In fact, in a sense, isolated vertices are the
“last obstacle” to connectivity. A slight modification of the proof above leads to
the following more precise result. For k ∈ {1, . . . , n/2}, let Zk be the number
of connected components of size k in Gn . In particular, Z1 is the number of iso-
lated vertices. We consider the “critical window” pn = cnn where cn := log n + s
for some fixed s ∈ R. We show that, in that regime, asymptotically the graph is
typically composed of a large connected component together with some isolated
vertices. Formally, we prove the following claim which says that with probabil-
ity close to one: either the graph is connected or there are some isolated vertices
together with a (necessarily unique) connected component of size greater than n/2.
Claim 2.3.8.
n/2
1 X
Pn,pn [Z1 > 0] ≥ + o(1) and Pn,pn Zk > 0 = o(1).
1 + es
k=2
The limit of Pn,pn [Z1 > 0] can be computed explicitly using the method of mo-
ments. See Exercise 2.19.
CHAPTER 2. MOMENTS AND TAILS 54
Proof of Claim 2.3.8. We first consider isolated vertices. From (2.3.5), (2.3.6) and
the second moment method,
−1
2 1 1
Pn,pn [Z1 > 0] ≥ e− log n+npn +npn + = + o(1),
1 − pn 1 + es
as n → +∞ by our choice of pn .
To bound the number of components of size k > 1, we note first that the
random variable Yk used in the previous claim (which imposes no condition on
the edges between the vertices in the subsets of size k) is too loose to provide a
suitable bound. Instead, to bound the probability that a subset of k vertices forms
a connected component, we observe that a connected component is characterized
by two properties: it is disconnected from the rest of the graph; and it contains
a spanning tree. Formally, for k = 2, . . . , n/2, we let Zk0 be the number of (not
necessarily induced) maximal trees of size k or, put differently, the number of
spanning trees of connected components of size k. Then, by the first moment
method, the probability that a connected component of size > 1 is present in Gn is
bounded by
n/2 n/2 n/2
X X X
Pn,pn Zk > 0 ≤ Pn,pn Zk0 > 0 ≤ En,pn [Zk0 ]. (2.3.7)
k=2 k=2 k=2
To bound the expectation of Zk0 , we use Cayley’s formula which states that there
are k k−2 trees on a set of k labeled vertices. Recall further that a tree on k vertices
has k − 1 edges (see Exercise 1.7). Hence,
0 n k−2 k−1
En,pn [Zk ] = k pn (1 − pn )k(n−k) ,
k |{z} | {z }
| {z } (b) (c)
(a)
where (a) is the number of trees of size k (as subgraphs) in a complete graph of
size n, (b) is the probability that such a tree is present in the graph, and (c) is
the probability that this tree is disconnected from every other vertex in the graph.
Using that k! ≥ (k/e)k (see Appendix A) and 1 − x ≤ e−x for all x ∈ R (see
CHAPTER 2. MOMENTS AND TAILS 55
Exercise 1.16),
nk k−2 k−1
En,pn [Zk0 ] ≤ k pn (1 − pn )k(n−k)
k!
nk e k
≤ k k k npkn e−pn k(n−k)
k
k
k
= n ec e−(1− n )cn
n
k
k
= n e(log n + s)e−(1− n )(log n+s) .
n/2 +∞ k
X X 1
En,pn [Zk0 ] ≤ n e(log n + s)e− 2 (log n+s) = O(n−1/2 log3 n) = o(1).
k=3 k=3
Plugging this back into (2.3.7) gives the second claim in the statement.
where recall that the percolation function is θ(p) = Pp [|C0 | = +∞]. We then
consider general trees, introduce the branching number, and present a weighted
version of the second moment method.
Claim 2.3.9.
1
pc (Td ) = .
d−1
Proof. Let ∂n be the n-th level of Td , that is, the set of vertices at graph distance n
from 0. Let Xn be the number of vertices in ∂n ∩C0 . In order for the open cluster of
CHAPTER 2. MOMENTS AND TAILS 56
the root to be infinite, there must be at least one vertex on the n-th level connected
to the root by an open path. By the first moment method (Theorem 2.2.6),
we have
CHAPTER 2. MOMENTS AND TAILS 57
2
X
Ep [Xn2 ] = Ep 1{x∈C0 }
x∈∂n
X
= Pp [x, y ∈ C0 ]
x,y∈∂n
X n−1
X X
= Pp [x ∈ C0 ] + 1{x∧y∈∂m } pm p2(n−m)
x∈∂n m=0 x,y∈∂n
n−1
X
= µn + (b + 1)bn−1 (b − 1)b(n−m)−1 p2n−m
m=0
+∞
X
≤ µn + (b + 1)(b − 1)b 2n−2 2n
p (bp)−m
m=0
b−1 1
= µn + µ2n · · ,
b + 1 1 − (bp)−1
where, on the fourth line, we used that all vertices on the n-th level are equivalent
and that, for a fixed x, the set {y : x ∧ y ∈ ∂m } is composed of those vertices in ∂n
that are descendants of x ∧ y but not in the descendant subtree of x ∧ y containing
1
x. When p > d−1 = 1b , dividing by (Ep Xn )2 = µ2n → +∞, we get
Ep [Xn2 ] 1 b−1 1
≤ + · (2.3.9)
(Ep Xn )2 µn b + 1 1 − (bp)−1
b−1 1
≤ 1+ ·
b + 1 1 − (bp)−1
=: Cb,p .
which concludes the proof. (Note that the version of the second moment method
in Equation (2.3.1) does not work here. Subtract 1 in (2.3.9) and take p close to
1/b.)
The argument above relies crucially on the fact that, in a tree, any two vertices
are connected by a unique path. For instance, approximating Pp [x ∈ C0 ] is much
harder on a lattice. Note furthermore that, intuitively, the reason why the first mo-
ment captures the critical threshold exactly in this case is that bond percolation on
Td is a “branching process” (defined formally and studied at length in Chapter 6),
CHAPTER 2. MOMENTS AND TAILS 58
where Xn represents the “population size at generation n.” The qualitative behav-
ior of a branching process is governed by its expectation: when the mean number of
children bp exceeds 1, the process grows exponentially on average and “explodes”
with positive probability (see Theorem 6.1.6). We will come back to this point of
view in Section 6.2.4 where branching processes are used to give a more refined
analysis of bond percolation on Td .
General trees Let T be a locally finite tree (i.e., all its degrees are finite) with
root 0. For an edge e, let ve be the endvertex of e furthest from the root. We denote
by |e| the graph distance between 0 and ve . Generalizing a previous definition
from Section 1.1.1 to infinite, locally finite graphs, a cutset separating 0 and +∞ is
a finite set of edges Π such that all infinite paths (which, recall, are self-avoiding by
definition) starting at 0 go through Π. (For our purposes, it will suffice to assume
that cutsets are finite by default.) For a cutset Π, we let Πv := {ve : e ∈ Π}.
Repeating the argument in (2.3.8), for any cutset Π,
Using the max-flow min-cut theorem (Theorem 1.1.15), the branching number can
also be characterized in terms of a “flow to +∞.” We will not do this here. (But
see Theorem 3.3.30.)
1
Equation (2.3.10) implies that pc (T ) ≥ br(T ) . Remarkably, this bound is
tight. The proof is based on a “weighted” second moment method argument.
Proof. Suppose p < br(T 1 −1 > br(T ) and the sum in (2.3.10) can be
) . Then p
made arbitrarily small by definition of the branching number, that is, θ(p) = 0.
1
Hence we have shown that pc (T ) ≥ br(T ).
1
To argue in the other direction, let p > br(T −1 < λ < br(T ), and ε > 0
), p
such that X
λ−|e| ≥ ε (2.3.12)
e∈Π
for all cutsets Π. The existence of such an ε is guaranteed by the definition of the
branching number. As in the proof of Claim 2.3.9, we use that θ(p) is the limit
as n → +∞ of the probability that C0 reaches the n-th level (i.e., the vertices at
graph distance n from the root 0, which is necessarily a finite set in a locally finite
tree). However, this time, we use a weighted count on the n-th level. Let Tn be
the first n levels of T and, as before, let ∂n be the vertices on the n-th level. For a
probability measure νn on ∂n , we define the weighted count
X νn (z)
Xn = 1{z∈C0 } .
Pp [z ∈ C0 ]
z∈∂n
Observe that, while νn (z) may be 0 for some zs (but not all), we still have that
Xn > 0, ∀n implies {|C0 | = +∞}, which is what we need to apply the second
moment method.
Because of (2.3.12), a natural choice of νn follows from the max-flow min-
cut theorem (Theorem 1.1.15) applied to Tn with source 0, sink ∂n and capacity
constraint |φ(x, y)| ≤ κ(e) := ε−1 λ−|e| for allPedges e = {x,
Py}. Indeed, for all
cutsets Π in Tn separating 0 and ∂n , we have e∈Π κ(e) = e∈Π ε λ −1 −|e| ≥1
by (2.3.12). That then guarantees by Theorem 1.1.15 the existence of a unit flow
φ from 0 to ∂n satisfying the capacity constraints. Define νn (z) to be the flow
entering z ∈ ∂n under φ. In particular, because φ is a unit flow, νn defines a
probability measure. It remains to bound the second moment of Xn under this
CHAPTER 2. MOMENTS AND TAILS 60
choice. We have
2
X νn (z)
Ep Xn2 = Ep 1{z∈C0 }
Pp [z ∈ C0 ]
z∈∂n
X Pp [x, y ∈ C0 ]
= νn (x)νn (y)
Pp [x ∈ C0 ]Pp [y ∈ C0 ]
x,y∈∂n
n
X X pm p2(n−m)
= 1{x∧y∈∂m } νn (x)νn (y)
p2n
m=0 x,y∈∂n
n
X X X
= p−m 1{x∧y=z} νn (x)νn (y) .
m=0 z∈∂m x,y∈∂n
where the second equality follows from the construction of νn . It follows that
−1
θ(p) ≥ Cε,λ,p > 0,
1
and pc (T ) ≤ br(T ) . That concludes the proof.
Note that Claims 2.3.9 and 2.3.11 imply that br(Td ) = d−1. The next example
is more striking and insightful.
CHAPTER 2. MOMENTS AND TAILS 61
Example 2.3.12 (The 3–1 tree). The 3–1 tree T c3−1 is an infinite rooted tree. We
give a planar description. The root ρ (level 0) is at the top. It has two children
below it (level 1). Then on level n, for n ≥ 1, the first 2n−1 vertices starting from
the left have exactly 1 child and the next 2n−1 vertices have exactly 3 children. In
particular level n has 2n vertices, which we denote by un,1 , . . . , un,2n . For vertex
un,j we refer to j/2n as its relative position (on level n). So vertices have 1 or 3
relative
children according to whether their relative position is ≤ 1/2 or > 1/2.
position
Because the level size is growing at rate 2, it is tempting to conjecture that the
branching number is 2—but that turns out to be way off.
What makes this tree entirely different from the infinite 2-ary tree, despite hav-
ing the same level growth, is that each infinite path from the root in T
c3−1 eventually
“stops branching,” with the sole exception of the rightmost path which we refer to
as the main path. Indeed, let Γ = v0 ∼ v1 ∼ v2 ∼ · · · with v0 = ρ be an infinite
main path
path distinct from the main path. Let xi be the relative position of vi , i ≥ 1. Let vk
be the first vertex of Γ not on the main path. It lies on the k-th level.
Lemma 2.3.14. Let v be a vertex that is not on the main path with relative position
x and assume that 0 ≤ x ≤ α < 1. Let w be a child of v and denote by y its
relative position. Then
(
1
x if x ≤ 1/2,
y≤ 2 1
x − 2 (1 − α) otherwise.
Proof. Assume without loss of generality that v = un,j for some n and j < 2n . If
j ≤ 2n−1 , then by construction v has exactly one child with relative position
j 1
y= = x.
2n+1 2
That proves the first claim.
If j > 2n−1 , then all vertices to the right of v have 3 children, all of whom are
to the right of the children of v. Hence the children of v have relative position at
most
2n+1 − 3(2n − j) 3j − 2n 3 1
y≤ n+1
= n+1 = x − .
2 2 2 2
Subtracting x and using x ≤ α gives the second claim.
We now apply Lemma 2.3.14 to vk as defined above and its descendants on Γ with
α = 1 − 1/2k . We get that the relative position decreases from vk by 1/2k+1 on
CHAPTER 2. MOMENTS AND TAILS 62
each level until it falls below 1/2 at which point it gets cut in half at each level.
Once this last regime is reached, each vertex on Γ from then on has exactly one
child—that is, there is no more branching.
We are now ready to prove the claim.
Proof of Claim 2.3.13. Take any λ > 1. From the definition of the branching num-
ber (Definition 2.3.10), it suffices to find a sequence of cutsets (Πn )n such that
X
λ−|e| → 0,
e∈Πn
After level J ∗ , the leftmost descendant of w00 has relative position ≤ 1/2 by
Lemma 2.3.14. Therefore we need n > J ∗ to ensure that w00 has the desired
property. Taking
log 3/2
`n = n , (2.3.14)
log 3
will do for n large enough.
Finishing up. By construction, Φn is a cutset for all n ≥ n0 . Moreover
X 1
λ−|e| = 2n−1 λ−mn + λ−`n < ,
n
e∈Φn
for n large enough, where we used (2.3.13) and (2.3.14). Taking n → +∞ gives
the claim.
As a consequence of Claims 2.3.11 and 2.3.13, |Cρ | < +∞ almost surely for
all p < 1 on T
c3−1 . J
2
Proof. By the change of variable y = x + z and using e−z /2 ≤ 1
Z +∞ Z +∞
−y 2 /2 −x2 /2 2
e dy ≤ e e−xz dz = e−x /2 x−1 .
x 0
where
MX (s)
P[X ≥ β] = P[esX ≥ esβ ] ≤ = exp [− {sβ − ΨX (s)}] .
esβ
CHAPTER 2. MOMENTS AND TAILS 65
Returning to the Gaussian case, let X ∼ N (0, ν) where ν > 0 is the variance
and note that
Z +∞
1 x2
MX (s) = esx √ e− 2ν dx
−∞ 2πν
Z +∞ 2
s ν 1 (x−sν)2
= e 2 √ e− 2ν dx
−∞ 2πν
2
s ν
= exp .
2
By straightforward calculus, the optimal choice of s in (2.4.2) gives the exponent
β2
sup(sβ − s2 ν/2) = , (2.4.3)
s>0 2ν
achieved at sβ = β/ν. For β > 0, this leads to the bound
β2
P[X ≥ β] ≤ exp − , (2.4.4)
2ν
which is much sharper than Chebyshev’s inequality for large β—compare to (2.4.1).
As another toy example, we consider simple random walk on Z.
Lemma 2.4.3 (Chernoff bound for simple random walk on Z). Let Z1 , . . . , Zn
be independent Rademacher variables, that is, they are {−1,
P 1}-valued random Rademacher
variables with P[Zi = 1] = P[Zi = −1] = 1/2. Let Sn = i≤n Zi . Then, for any
variable
β > 0,
2 /2n
P[Sn ≥ β] ≤ e−β . (2.4.5)
Observe the similarity between (2.4.5) and the Gaussian bound (2.4.4), if one
takes ν to be the variance of Sn , that is,
ν = Var[Sn ] = nVar[Z1 ] = nE[Z12 ] = n,
where we used that Z1 is centered. The central limit theorem says that simple
random walk is well approximated by a Gaussian in the “bulk” of the distribu-
tion; the bound above extends the approximation in the “large deviation” regime.
The bounding technique used in the proof of Lemma 2.4.3 will be substantially
extended in Section 2.4.2.
Example 2.4.4 (Set balancing). Let v1 , . . . , vm be arbitrary non-zero vectors in
{0, 1}n . Think of vi = (vi,1 , . . . , vi,n ) as representing a subset of [n] = {1, . . . , n}:
vi,j = 1 indicates that j is in subset i. Suppose we want to partition [n] into two
groups such that the subsets corresponding to the vi s are as balanced as possible,
that is, are as close as possible to having the same number of elements from each
group. More formally, we seek a vector x = (x1 , . . . , xn ) ∈ {−1, +1}n such that
B ∗ = maxi=1,...,m |x · vi | is as small as possible.
A simple random choice does well: select each xi independently, uniformly at
random in {−1, +1}. Fix ε > 0. We claim that
h i
P B ∗ ≥ 2n(log m + log(2ε−1 )) ≤ ε.
p
(2.4.7)
P
Theorem 2.4.5 (Chernoff-Cramér method). Let Sn = i≤n Xi , where the Xi s
are i.i.d. random variables. Assume MX1 (s) < +∞ on s ∈ (−s0 , s0 ) for some
s0 > 0. For any β > 0,
∗ β
P[Sn ≥ β] ≤ exp −nΨX1 . (2.4.8)
n
In particular, in the large deviations regime, that is, when β = bn for some b > 0,
we have
1
− lim sup log P[Sn ≥ bn] ≥ Ψ∗X1 (b) . (2.4.9)
n n
Proof. Observe that, by taking a logarithm in (2.1.3), it holds that
∗ β ∗ β
ΨSn (β) = sup(sβ − nΨX1 (s)) = sup n s − ΨX1 (s) = nΨX1 .
s>0 s>0 n n
Now optimize over s in (2.4.2).
We use the Chernoff-Cramér method to derive a few standard bounds.
Poisson variables We start with the Poisson case. Let Z ∼ Poi(λ) be Poisson
Poisson
with mean λ, where recall that
λk
P[Z = k] = e−λ , k ∈ Z+ .
k!
Letting X = Z − λ,
λ
X `
ΨX (s) = log e−λ es(`−λ)
`!
`≥0
X (es λ)`
= log e−(1+s)λ
`!
`≥0
s
= log e−(1+s)λ ee λ
= λ(es − s − 1),
so that straightforward calculus gives for β > 0
Ψ∗X (β) = sup(sβ − λ(es − s − 1))
s>0
β β β
= λ 1+ log 1 + −
λ λ λ
β
=: λ h ,
λ
CHAPTER 2. MOMENTS AND TAILS 68
achieved at sβ = log 1 + βλ , where h is defined as the expression in square
brackets above. Plugging Ψ∗X (β) into Theorem 2.4.5 leads for β > 0 to the bound
β
P[Z ≥ λ + β] ≤ exp −λ h . (2.4.10)
λ
A similar calculation for −(Z − λ) gives for β < 0
β
P[Z ≤ λ + β] ≤ exp −λ h . (2.4.11)
λ
If Sn is a sum of n i.i.d. Poi(λ) variables, then by (2.4.9) for a > λ
1 a−λ
− lim sup log P[Sn ≥ an] ≥ λ h
n n λ
a
= a log −a+λ
λ
=: IλPoi (a), (2.4.12)
β2
P [X − µ ≤ −β] ∨ P [X − µ ≥ β] ≤ exp − , (2.4.16)
2ν
where we used that X ∈ sG(ν) implies −X ∈ sG(ν). As a quick example, note
that this is the approach we took in Lemma 2.4.3, that is, we showed that a uniform
random variable in {−1, 1} (i.e., a Rademacher variable) is sub-Gaussian with
variance factor 1.
When considering (weighted) sums of independent sub-Gaussian random vari-
ables, we get the following.
Theorem 2.4.9 (General Hoeffding inequality). Suppose X1 , . . . , Xn are indepen-
dent random variables where, P for each i, Xi ∈ sG(νi ) with 0 < νi < +∞. For
w1 , . . . , wn ∈ R, let Sn = i≤n wi Xi . Then
n
!
X
2
Sn ∈ sG wi νi .
i=1
Bounded random variables For bounded random variables, the previous in-
equality reduces to a standard bound.
where the first line comes from the taking out what is known lemma (Lemma B.6.16)
and the fact that Xi0 is centered and independent of Xi , the second line follows
from the conditional Jensen’s inequality (Lemma B.6.12), and the third line uses
the tower property (Lemma B.6.16). Observe that Xi − Xi0 is symmetric, that is,
identically distributed to −(Xi − Xi0 ). Hence, using that Zi is independent of both
Xi and Xi0 , we get
h 0
i h h 0
ii
E es(Xi −Xi ) = E E es(Xi −Xi ) Zi
h h 0
ii
= E E esZi (Xi −Xi ) Zi
h 0
i
= E esZi (Xi −Xi )
h h 0
ii
= E E esZi (Xi −Xi ) Xi , Xi0 .
From (2.4.6) (together with Lemma B.6.15), the last line above is
h 0 2
i
≤ E e(s(Xi −Xi )) /2
2 2 /2
≤ e4ci s ,
CHAPTER 2. MOMENTS AND TAILS 72
Proof of Theorem 2.4.10. As pointed out above, it suffices to show that Xi is sub-
Gaussian with variance factor 14 (bi −ai )2 . This is the content of Hoeffding’s lemma
below (which we will use again in Chapter 3). First an observation:
Lemma 2.4.11 (Variance of bounded random variables). For any random variable
Z taking values in [a, b] with −∞ < a ≤ b < +∞, we have
1
Var[Z] ≤ (b − a)2 .
4
Proof. Indeed
a+b b−a
Z− ≤ ,
2 2
and
" #
a+b 2 b−a 2
a+b
Var[Z] = Var Z − ≤E Z− ≤ .
2 2 2
Hoeffding’s
Lemma 2.4.12 (Hoeffding’s lemma). Let X be a random variable taking values
lemma
in [a, b] for −∞ < a ≤ b < +∞. Then X ∈ sG 14 (b − a)2 .
for some s∗ ∈ [0, s]. Therefore it suffices to show that for all s
1
Ψ00X (s) ≤ (b − a)2 . (2.4.17)
4
Note that
00 (s) MX (s) 2
0
MX
Ψ00X (s) = −
MX (s) MX (s)
sX 2
1 2 sX
1
= E X e − E Xe
MX (s) MX (s)
sX
2
esX
2 e
= E X − E X .
MX (s) MX (s)
sx
The trick to conclude is to notice that MeX (s) defines a density on [a, b] with respect
to the law of X. The variance under this density—the last line above—is less than
1 2
4 (b − a) by Lemma 2.4.11. This establishes (2.4.17) and concludes the proof.
Remark 2.4.13. The change of measure above is known as tilting and is a standard trick
in large deviations theory. See for example [Dur10, Section 2.6].
Since we have shown that Xi is sub-Gaussian with variance factor 14 (bi − ai )2 ,
Theorem 2.4.10 follows from Theorem 2.4.9.
General case We now define a broad class of distributions which have such ex-
ponential tail decay.
Observe that the key difference between (2.4.15) and (2.4.19) is the interval of s
over which it holds. As we will see below, the parameter α dictates the exponential
decay rate of the tail. The specific form of the bound in (2.4.19) is natural once one
notices that, as |s| → 0, a centered random variable with variance ν (and a finite
moment-generating function) should roughly satisfy
s2 s2 ν s2 ν
sX 2
log E[e ] ≈ log 1 + sE[X] + E[X ] ≈ log 1 + ≈ .
2 2 2
* More
√
commonly, “sub-exponential” refers to the case α = ν.
CHAPTER 2. MOMENTS AND TAILS 75
Returning to the χ2 distribution, note that from (2.4.18) we have for |s| ≤ 1/4
1
ΨW −1 (s) = −s − log(1 − 2s)
2"
+∞
#
1 X (2s)i
= −s − −
2 i
i=1
" +∞ #
s2 X (2s)i−2
= 4
2 i
i=2
" +∞ #
s2 X
≤ 2 |1/2|i−2
2
i=2
s2
≤ × 4.
2
Hence W ∈ sE(4, 4).
Using the Chernoff-Cramér bound (Lemma 2.4.2), we obtain the following tail
bound for sub-exponential variables.
Theorem 2.4.15 (Sub-exponential tail bound). Suppose the random variable X
with mean µ is sub-exponential with parameters (ν, α). Then for all β ∈ R+
2
(
exp(− β2ν ), if 0 ≤ β ≤ ν/α,
P[X − µ ≥ β] ≤ β (2.4.20)
exp(− 2α ), if β > ν/α.
In words, the tail decays exponentially fast at large deviations but behaves as in the
sub-Gaussian case for smaller deviations. We will see below that this double-tail
allows to extrapolate naturally between different regimes. First we prove the claim.
At this point, the proof diverges from the sub-Gaussian case because the optimal
choice of s depends on β because of the additional constraint |s| ≤ 1/α. When
s∗ = β/ν satisfies s∗ ≤ 1/α, the quadratic function of s in the exponent is mini-
mized at s∗ , giving the bound
β2
P[X ≥ β] ≤ exp − ,
2ν
for 0 ≤ β ≤ ν/α.
CHAPTER 2. MOMENTS AND TAILS 76
On the other hand, when β > ν/α, the exponent is strictly decreasing over
the interval s ≤ 1/α. Hence the optimal choice is s∗ = 1/α, which produces the
bound
β ν
P[X ≥ β] ≤ exp − + 2
α 2α
β β
< exp − +
α 2α
β
= exp − ,
2α
n
!
X
Sn ∈ sE wi2 νi , max |wi |αi .
i
i=1
s2 2
P
X X X (swi )2 νi i≤n wi νi
ΨSn (s) = Ψwi Xi (s) = ΨXi (swi ) ≤ = ,
2 2
i≤n i≤n i≤n
Proof. We claim that Xi ∈ sE(2σi2 , 2c). To establish the claim, we derive a bound
on all moments of Xi . Note that for all integers k ≥ 2
Using the general Bernstein inequality (Theorem 2.4.16) gives the result.
sub-Gaussian behavior. After all, the latter is on the surface a strengthening of the
former. However, note that we have obtained a better bound in Theorem 2.4.17
than we did in Theorem 2.4.10—when β is not too large. That improvement stems
from the use of the (actual) variance for moderate deviations. This is easier to
appreciate on an example.
Dn = max δ(v).
v∈Vn
We focus on the regime npn = ω(log n). Note that, for any vertex v ∈ Vn , its
degree is Bin(n − 1, pn ) by independence of the edges. In particular its expected
degree is (n−1)pn . To prove a high-probability upper bound on the maximum Dn ,
we need to control the deviation of the degree of each vertex from its expectation.
Observe that the degrees are not independent. Instead we apply a union bound over
all vertices, after using a tail bound.
which falls in the lower regime of the tail bound. In particular, β = o(npn ) (i.e.,
the deviation is much smaller than the expectation). Finally by a union bound over
v ∈ Vn
h p i 1
P Dn ≥ (n − 1)pn + 4(1 + ε)pn (1 − pn )(n − 1) log n ≤ n × 1+ε → 0.
n
The same holds in the other direction. That proves the claim.
Λθ = {j ∈ [n] : Vj = θWj } ,
and
Θ∗ = inf {θ ≥ 0 : W∆θ < W} .
P
where, for a subset of items J ⊂ [n], WJ = j∈J Wj (and VJ is similarly de-
fined).
We consider a stochastic version of the fractional knapsack problem where the
weights and values are i.i.d. random variables picked uniformly at random in [0, 1].
Characterizing Z ∗ (e.g., its moments or distribution) is not straightforward. Here
we show that Z ∗ is highly concentrated around a natural quantity. Observe that,
under our probabilistic model, almost surely |Λθ | ∈ {0, 1} for any θ ≥ 0. Hence,
there are two cases. Either Θ∗ = 0, in which case all items fit in the knapsack so
n
Z ∗ = j=1 Vj . Or Θ∗ > 0, in which case |ΛΘ∗ | = 1 and
P
W − W∆Θ∗
Z ∗ = V∆Θ∗ + VΛΘ∗ . (2.4.22)
WΛΘ∗
is a sum of independent random variables taking values in [0, 1]. Hence, for any
β > 0, Hoeffding’s inequality gives
2β 2
P [W∆θ − nw̄θ ≥ β] ≤ exp − .
n
Using this inequality with θ = θτ + √Cn (with n large enough that θ < 1) and
√
β = (C/3) n gives
√
1 1 C/3
≥ (C/3) n ≤ exp −2(C/3)2 ,
P W∆θ + √C − n − θτ − √
τ n 2 3 n
Applying the same argument to −W∆θ with θ = θτ − √Cn and combining with the
previous inequality gives
∗ C
≤ 2 exp −2(C/3)2 ,
P |Θ − θτ | > √ (2.4.26)
n
and
√
≤ −(C/3) n ≤ exp −2(C/3)2 .
P V∆θ C
− nv̄θτ + √C (2.4.28)
τ + √n n
C2
1 C C
v̄θτ − v̄θτ + √C = 2 √ θτ + ≤√ .
n 6 n n n
C
A quick check reveals that, similarly, v̄θτ − √C −v̄θτ ≤ √
n
. Plugging back into (2.4.27)
n
and (2.4.28) gives
√
P V∆θ − √C ≥ nv̄θτ + 2C n ≤ exp −2(C/3)2 ,
(2.4.29)
τ n
and
√
≤ nv̄θτ − 2C n ≤ exp −2(C/3)2 .
P V ∆θ C
(2.4.30)
τ + √n
A similar bound is proved for the 0-1 knapsack problem in Exercise 2.9.
CHAPTER 2. MOMENTS AND TAILS 85
where T is an arbitrary index set and the Xt s are real-valued random variables. To
avoid measurability issues, we assume throughout that T is countable.† Note that
t does not in general need to be a “time” index.
So far we have developed tools that can handle cases where T is finite. When
the supremum is over an infinite index set, however, new ideas are required. One
way to proceed is to apply a tail inequality to a sufficiently dense finite subset of the
index set and then extend the resulting bound by a Lipschitz continuity argument.
We present this type of approach in this section, as well as a multi-scale version
known as chaining.
First we summarize one important special case that will be useful below: T is
finite and Xt is sub-Gaussian.
β2
p
P sup Xt ≥ 2ν log |T | + β ≤ exp − .
t∈T 2ν
as claimed.
For the tail inequality, we use a union bound and (2.4.16)
X h i
p p
P sup Xt ≥ 2ν log |T | + β ≤ P Xt ≥ 2ν log |T | + β
t∈T t∈T
p !
( 2ν log |T | + β)2
≤ |T | exp −
2ν
β2
≤ exp − ,
2ν
/ Bρ (t0 , ε),
t∈ ∀t 6= t0 ∈ N
that is, every pair of elements of N is at distance strictly greater than ε. The largest
cardinality of an ε-packing of T is called the packing number
packing number
P(T , ρ, ε) = sup{|N | : N is an ε-packing of T }.
Lemma 2.4.24 (Covering and packing numbers). For any T ⊆ M and all ε > 0,
N (T , ρ, ε) ≤ P(T , ρ, ε).
2. They are included in the ball of radius 3/2 around the origin: if z ∈ Bk (xi , ε/2),
then kzk2 ≤ kz − xi k2 + kxi k ≤ ε/2 + 1 ≤ 3/2.
π k/2 (ε/2)k
The volume of a ball of radius ε/2 is Γ(k/2+1) and that of a ball of radius 3/2 is
π k/2 (3/2)k
Γ(k/2+1) . Dividing one by the other proves the claim.
If in addition Xt is sub-Gaussian for all t, then we can bound the expectation or tail
probability of the supremum of {Xt }t∈T —if we can bound the expectation or tail
probability of the (random) Lipschitz constant K itself. To see this, let N ⊆ T be
an ε-net of T and, for each t ∈ T , let π(t) be the closest element of N to t. We
will refer to π as the projection map of N . We then have the inequality
where we can use Theorem 2.4.21 to bound the last term.‡ We give an example of
this type of argument next (although we do not apply the above bound directly).
Another example (where (2.4.32) is used this time) can be found in Section 2.4.5.
Example 2.4.27 (Spectral norm of a random matrix). For an m × n matrix A ∈
Rm×n , the spectral norm (or induced 2-norm, or 2-norm for short) is defined as
spectral
kAxk2 norm
kAk2 := sup = sup kAxk2 = sup hAx, yi, (2.4.33)
x∈Rn \{0} kxk2 x∈Sn−1 x∈Sn−1
y∈Sm−1
where Sn−1 is the sphere of Euclidean radius 1 around the origin in Rn . The right-
most expression, which is central to our developments, is justified in Exercise 5.4.
We will be interested in the case where A is a random matrix with independent
entries. One key observation is that the quantity hAx, yi can then be seen as a
linear combination of independent random variables
X
hAx, yi = xj yi Aij .
i,j
‡
If the ε-net N is not included in T , the Lipschitz condition has to hold on a larger subset that
includes both.
CHAPTER 2. MOMENTS AND TAILS 89
Hence we will be able to apply our previous tail bounds. However, we also need to
deal with the supremum.
Theorem 2.4.28 (Upper tail of the spectral norm). Let A ∈ Rm×n be a random
matrix whose entries are centered, independent and sub-Gaussian with variance
factor ν. Then there exists a constant 0 < C < +∞ such that, for all t > 0,
√ √ √
kAk2 ≤ C ν( m + n + t),
2
with probability at least 1 − e−t .
Without the independence assumption, the norm can be much larger in general (see
Exercise 2.15).
Lemma 2.4.30. For any ε-nets N ⊆ Sn−1 and M ⊆ Sm−1 of Sn−1 and Sm−1
respectively, the following inequalities hold
1
sup hAx, yi ≤ kAk2 ≤ sup hAx, yi.
x∈N 1 − 2ε x∈N
y∈M y∈M
Chaining method
We go back to the inequality
Previously we controlled the first term on the right-hand side with a random Lips-
chitz constant and the second term with a maximal inequality for finite sets. Now
we consider cases where we may not have a good almost sure bound on the Lip-
schitz constant, but where we can control increments uniformly in the following
probabilistic sense. We say that a stochastic process {Xt }t∈T has sub-Gaussian
increments on (T , ρ) if there exists a deterministic constant 0 < K < +∞ such
sub-Gaussian
that
increments
Xt − Xs ∈ sG(K2 ρ(s, t)2 ), ∀s, t ∈ T .
Even with this assumption, in (2.4.35) the first term on the right-hand side remains
a supremum over an infinite set. To control it, the chaining method repeats the
chaining method
argument above at progressively smaller scales, leading to the following inequality.
The diameter of T , denoted by diam(T ), is defined as
diam(T ) = sup{ρ(s, t) : s, t, ∈ T }.
Theorem 2.4.31 (Discrete Dudley inequality). Let {Xt }t∈T be a zero-mean stoch-
astic process with sub-Gaussian increments on (T , ρ) and assume diam(T ) ≤ 1.
Then +∞
X q
E sup Xt ≤ C 2−k log N (T , ρ, 2−k ).
t∈T k=0
ρ(πk (t), πk+1 (t)) ≤ ρ(πk (t), t) + ρ(t, πk+1 (t)) ≤ 2−k + 2−k−1 ≤ 2−k+1 ,
so that
Xπk+1 (t) − Xπk (t) ∈ sG(K2 2−2k+2 ),
for some 0 < K < +∞ by the sub-Gaussian increments assumption. We can
therefore apply Theorem 2.4.21 to get
q
E sup Xπk+1 (t) − Xπk (t) ≤ 2K2 2−2k+2 log(N (T , ρ, 2−k−1 )2 )
t∈T
q
−k−1
≤ C2 log N (T , ρ, 2−k−1 ),
Johnson-Lindenstrauss lemma
The Johnson-Lindenstrauss lemma states roughly that, for any collection of points
in a high-dimensional Euclidean space, one can find an embedding of much lower
dimension that roughly preserves the metric relationships of the points, that is,
their distances. Remarkably, no structure is assumed on the original points and the
result is independent of the input dimension. The method of proof simply involves
performing a random projection.
Lemma 2.4.33 (Johnson-Lindenstrauss lemma). For any set of points x(1) , . . . , x(m)
in Rn and θ ∈ (0, 1), there exists a mapping f : Rn → Rd with d = Θ(θ−2 log m)
such that the following holds: for all i, j
(1 − θ)kx(i) − x(j) k2 ≤ kf (x(i) ) − f (x(j) )k2 ≤ (1 + θ)kx(i) − x(j) k2 . (2.4.37)
We use the probabilistic method: we derive a “distributional” version of the
result that, in turn, implies Lemma 2.4.33 by showing that a mapping with the de-
sired properties exists with positive probability. Before stating this claim formally,
we define the explicit random linear mapping we will employ. Let A be a d × n
matrix whose entries are independent N(0, 1). Note that, for any fixed z ∈ Rn ,
2
X d Xn X n
E kAzk22 = E Aij zj = d Var A1j zj = dkzk22 , (2.4.38)
i=1 j=1 j=1
CHAPTER 2. MOMENTS AND TAILS 94
where we used the independence of the Aij s (and, in particular, of the rows of A)
and the fact that
Xn
E Aij zj = 0. (2.4.39)
j=1
preserves the squared Euclidean norm “on average,” that is, E kLzk22 = kzk22 . We
use the Chernoff-Cramér method to prove a high-probability result.
Lemma 2.4.34. Fix δ, θ ∈ (0, 1). Then the random linear mapping L above with
d = Θ(θ−2 log δ −1 ) is such that for any z ∈ Rn with kzk2 = 1
P [|kLzk2 − 1| ≥ θ] ≤ δ. (2.4.40)
a union bound over all m 2 such pairs. The probability that any of the inequali-
ties (2.4.37) is not satisfied by the linear mapping f (z) = Lz is then at most 1/2.
Hence a mapping with the desired properties exists for d = Θ(θ−2 log m).
Note that the right-hand side is ≤ δ for d = Θ(θ−2 log δ −1 ). An inequality in the
other direction can be proved similarly by working with −W (where W is defined
below).
Recall that a sum of independent Gaussians is Gaussian (just compute the con-
volution and complete the squares). So
(Alternatively, we could have used the general Bernstein inequality (Theorem 2.4.16).)
Finally, take β = d(1 + θ)2 . Rearranging we get
To give some further geometric insights into the proof, we make a series of
observations:
1. The d rows of √1 A are “on average” orthonormal. Indeed, note that for
n
i 6= j
n
" #
1X
E Aik Ajk = E[Ai1 ] E[Aj1 ] = 0,
n
k=1
by independence and
n
" #
1X 2
E Aik = E[A2i1 ] = 1,
n
k=1
CHAPTER 2. MOMENTS AND TAILS 96
since the Aik s have mean 0 and variance 1. When n is large, those two
quantities are concentrated around their mean. Fix a unit vector z. Then
√1 Az corresponds approximately to an orthogonal projection of z onto a
n
uniformly chosen random subspace of dimension d.
Compressed sensing
In the compressed sensing problem, one seeks to recover a signal x ∈ Rn from a
small number of linear measurements (Lx)i , i = 1, . . . , d. In complete generality,
one needs n such measurements to recover any unknown x ∈ Rn as the sensing
matrix L must be invertible (or, more precisely, injective). However, by imposing
sensing matrix
extra structure on the signal and choosing the sensing matrix appropriately, much
better results can be obtained. Compressed sensing relies on sparsity.
Definition 2.4.36 (Sparse vectors). We say that a vector z ∈ Rn is k-sparse if it
k-sparse vector
has at most k non-zero entries. We let Skn be the set of k-sparse vectors in Rn .
Note that Skn is a union of nk linear subspaces, one for each support of the
nonzero entries.
To solve the compressed sensing problem over k-sparse vectors, it suffices to
find a sensing matrix L satisfying that all subsets of 2k columns are linearly inde-
pendent. Indeed, if x, x0 ∈ Skn , then x−x0 has at most 2k nonzero entries. Hence,
in order to have L(x − x0 ) = 0, it must be that x − x0 = 0 under the previous con-
dition on L. That implies the required injectivity. The implication goes in the other
CHAPTER 2. MOMENTS AND TAILS 97
direction as well. Observe for instance that the matrix used in the proof of the
Johnson-Lindenstrauss lemma satisfies this property as long as d ≥ 2k: because
of the continuous density of its entries, the probability that 2k of its columns are
linearly dependent is 0 when d ≥ 2k. For practical applications, however, other
requirements must be met, in particular, computational efficiency and robustness.
We describe such an approach.
The following definition will play a key role. Roughly speaking, a restricted
isometry preserves enough of the metric structure of Skn to be invertible on its
image.
Definition 2.4.37 (Restricted isometry property). A d × n linear mapping L satis-
fies the (k, θ)-restricted isometry property (RIP) if for all z ∈ Skn
restricted
(1 − θ)kzk2 ≤ kLzk2 ≤ (1 + θ)kzk2 . (2.4.42) isometry
property
We say that L is (k, θ)-RIP.
Given a (k, θ)-RIP matrix L, can we recover z ∈ Skn from Lz? And how small
can d be? The next two claims answer these questions.
Lemma 2.4.38 (Sensing matrix). Let A be a d × n matrix whose entries are
i.i.d. N (0, 1) and let L = √1d A. There is a constant 0 < C < +∞ such that
if d ≥ Ck log n then L is (10k, 1/3)-RIP with probability at least 1 − 1/n.
Lemma 2.4.39 (Sparse signal recovery). Let L be (10k, 1/3)-RIP. Then for any
x ∈ Skn , the unique solution to the following minimization problem
min kzk1 subject to Lz = Lx, (2.4.43)
z∈Rn
is z∗ = x.
It may seem that a more natural approach, compared to (2.4.43), would be to in-
stead minimize the number of non-zero entries in z, that is, kzk0 . However the
advantage of the `1 norm is that the problem can then be formulated as a linear
program, that is, the minimization of a linear objective subject to linear inequal-
ities (see Exercise 2.13). This permits much faster computation of the solution
using standard techniques—while still leading to a sparse solution. See Figure 2.8
for some insights into to why `1 indeed promotes sparsity.
Putting the two lemmas together shows that:
Claim 2.4.40. Let L be as above with d = Θ(k log n) as required by Lemma 2.4.38.
With probability 1−o(1), any x ∈ Skn can be recovered from the input Lx by solv-
ing (2.4.43).
Note that d can in general be much smaller than n and not far from the 2k bound
we derived above.
CHAPTER 2. MOMENTS AND TAILS 98
Figure 2.8: Because `1 balls (squares) have corners, minimizing the `1 norm over
a linear subspace (line) tends to produce sparse solutions.
CHAPTER 2. MOMENTS AND TAILS 99
ε-net argument We start with the proof of Lemma 2.4.38. The claim does
not follow immediately from the (distributional) Johnson-Lindenstrauss lemma
(i.e., Lemma 2.4.34). Indeed that lemma implies that a (normalized) matrix with
i.i.d. standard Gaussian entries is an approximate isometry on a finite set of points.
Here we need a linear mapping that is an approximate isometry for all vectors in
Skn , an uncountable space.
For a subset of indices J ⊆ [n] and a vector y ∈ Rn , we let yJ be the vector y
restricted to the entries in J, that is, the subvector (yj )j∈J . Fix a subset of indices
I ⊆ [n] of size 10k. We need the RIP condition (Definition 2.4.37) to hold for all
z ∈ Rn with non-zero entries in I (and all such I). The way to achieve this is to
use an ε-net argument, as described in Section 2.4.4. Indeed, notice that, for z 6= 0,
the function kLzk2 /kzk2 :
1. does not depend on the norm of z, so that we can restrict ourselves to the
compact set ∂BI := {z : z[n]\I = 0, kzk2 = 1}; and
2. is continuous on ∂BI , so that it suffices to construct a fine enough covering
of ∂BI by a finite collection of balls (i.e., an ε-net) and apply Lemma 2.4.34
to the centers of those balls.
Proof of Lemma 2.4.38. Let I ⊆ [n] be a subset of indices of size k 0 := 10k.
0
There are kn0 ≤ nk = exp(k 0 log n) such subsets and we denote their collection
by I(k 0 , n). We let NI be an ε-net of ∂BI . By Claim 2.4.26, we can choose one in
0
∂BI of size at most (3/ε)k . We take
1
ε= √ ,
C0 6n log n
for a constant C 0 that will be determined below. The reason for this choice will
become clear when we set C 0 . The union of all ε-nets has size
k 0
k0 3
∪I∈I(k0 ,n) NI ≤ n ≤ exp(C 00 k 0 log n),
ε
for some C 00 > 0. Our goal is to show that
1
sup |kLzk2 − 1| ≤ . (2.4.44)
z∈∪I∈I(k0 ,n) ∂BI 3
as required by the lemma. Then, by a union bound over the NI s, with probability
1 − 1/(2n) we have
1
sup |kLzk2 − 1| ≤ . (2.4.45)
z∈∪I NI 6
Proof of Lemma 2.4.39. Let z∗ be a solution to (2.4.43) and note that such a solu-
tion exists because z = x satisfies the constraint. Without loss of generality assume
that only the first k entries of x are nonzero, that is, x[n]\[k] = 0. Moreover order
the remaining entries of x so that the residual r = z∗ − x has its entries r[n]\[k] in
nonincreasing order in absolute value. Our goal is to show that krk2 = 0.
In order to leverage the RIP condition, we break up the vector r into 9k-long
subvectors. Let
and I¯i = j>i Ij . We will also need I01 = I0 ∪ I1 and I¯01 = I¯1 .
S
We first use the optimality of z∗ . Note that xI¯0 = 0 implies that
and
kxk1 = kxI0 k1 ≤ kz∗I0 k1 + krI0 k1 ,
by the triangle inequality. Since kz∗ k1 ≤ kxk1 by optimality (and the fact that x
satisfies the constraint), we then have
On the other hand, the RIP condition gives a similar inequality in the other
P notice that Lr = 0 by the constraint in (2.4.43) or, put differently,
direction. Indeed
LrI01 = − i≥2 LrIi . Then, by the RIP condition and the triangle inequality, we
have that
2 X 4X
krI01 k2 ≤ kLrI01 k2 ≤ kLrIi k2 ≤ krIi k2 , (2.4.48)
3 3
i≥2 i≥2
where we used the fact that by construction rI01 is 10k-sparse and each rIi is 9k-
sparse.
We note that by the ordering of the entries of x
which is called the empirical risk. Indeed observe that ERn (h) = R(h) and, by
empirical risk
the law of large numbers, Rn (h) → R(h) almost surely as n → +∞. Ignor-
ing computational considerations, one can then formally define an empirical risk
minimizer
empirical risk
∗ 0 0 minimizer
h ∈ ERMH (Sn ) = {h ∈ H : Rn (h) ≤ Rn (h ), ∀h ∈ H},
Overfitting Why restrict the hypothesis class? It turns out that minimizing the
empirical risk over all Boolean functions makes it impossible to achieve an ar-
bitrarily small risk. Intuitively considering too rich a class of functions, that is,
functions that too intricately follow the data, leads to overfitting: the learned hy-
pothesis will fit the sampled data, but it may not generalize well to unseen exam-
ples. A learner A is a map from samples to measurable Boolean functions over
learner
Rd , that is, for any n and any Sn ∈ (Rd × {0, 1})n , the learner outputs a func-
tion A( · , Sn ) : Rd → {0, 1}. The following theorem shows that any learner has
fundamental limitations if all concepts are possible.
Theorem 2.4.42 (No Free Lunch). For any learner A and any finite X ⊆ Rd of
even size |X | =: 2m > 4, there exist a concept C : X → {0, 1} and a distribution
µ over X such that
The gist of the proof is intuitive. In essence, if the target concept is arbitrary and we
only get to see half of the possible instances, then we have learned nothing about
the other half and cannot expect low generalization error.
B = {A(X, Sm )) 6= YX },
CHAPTER 2. MOMENTS AND TAILS 104
The way out is to “limit the complexity” of the hypotheses. For instance, we
could restrict ourselves to half-spaces
n o
HH = h(x) = 1{xT u ≥ α} : u ∈ Rd , α ∈ R ,
or axis-aligned boxes
In order for the empirical risk minimizer h∗ to have a generalization error close
to the best achievable error, we need the empirical risk of the learned hypothesis
Rn (h∗ ) to be close to its expectation R(h∗ ), which is guaranteed by the law of
large numbers for sufficiently large n. But that is not enough, we also need that
same property to hold for all hypotheses in H simultaneously. Otherwise we could
be fooled by a poorly performing hypothesis with unusually good empirical risk on
the samples. The hypothesis class is typically infinite and, therefore, controlling
empirical risk deviations from their expectations uniformly over H is not straight-
forward.
where, on the third line, we used the definition of the empirical risk minimizer.
Taking an infimum over h0 , then an expectation over the samples, and rearranging
gives
This inequality allows us to relate two quantities of interest: the expected true risk
of the empirical risk minimizer (i.e., E[R(h∗ )], where recall that h∗ is defined over
the samples) and the best possible true risk (i.e., inf h0 ∈H R(h0 )). The first term
on the right-hand side is (2.4.52) and the second one can be bounded in a similar
fashion as we argue below. Observe that the suprema are inside the expectations
and that the random variables Rn (h) − R(h) are highly correlated. Indeed, two
similar hypotheses will produce similar predictions. The correlation is ultimately
what allows us to tackle infinite classes H – as we saw in Section 2.4.4.
Indeed, to bound (2.4.52), we use the methods of Section 2.4.4. As a first
step, we apply the symmetrization trick, which we introduced in Section 2.4.2 to
give a proof of Hoeffding’s lemma (Lemma 2.4.12). Let (εi )ni=1 be i.i.d. uniform
random variables in {−1, +1} (i.e., Rademacher variables) and let (Xi0 )ni=1 be an
independent copy of (Xi )ni=1 . Then
E sup {Rn (h) − R(h)}
h∈H
" ( n )#
1X
= E sup `(h, Xi ) − E[`(h, X)]
h∈H n i=1
" ( n )#
1X
= E sup [`(h, Xi ) − E[`(h, Xi0 ) | (Xj )nj=1 ]]
h∈H n i=1
" " n ##
1X 0 n
= E sup E [`(h, Xi ) − `(h, Xi )] (Xj )j=1
h∈H n
i=1
" ( n )#
1 X
≤ E sup [`(h, Xi ) − `(h, Xi0 )] ,
h∈H n i=1
where on the fourth line we used “taking it out what is known” (Lemma B.6.13)
and on the fifth line we used suph EYh ≤ E[suph Yh ] and the tower property. Next
we note that `(h, Xi ) − `(h, Xi0 ) is symmetric and independent of εi (which is also
CHAPTER 2. MOMENTS AND TAILS 107
The exact same argument also applies to the second term on the right-hand side
of (2.4.53), so
n
" #
1 X
E[R(h∗ )] − inf R(h0 ) ≤ 4 E sup εi `(h, Xi ) . (2.4.54)
h0 ∈H h∈H n i=1
Note that we will not compute the best possible true risk (which in general could
be “bad,” i.e., large)—only how close the empirical risk minimizer gets to it.
A naive application of the maximal inequality in Lemma 2.4.21, together with the
two observations above, gives
p p
E sup Zn (h) ≤ 2 log 2n = 2n log 2.
h∈H
Unfortunately, plugging this back into (2.4.54) gives an upper bound which fails to
converge to 0 as n → +∞.
To obtain a better bound, we show that in general the number of distinct re-
strictions of H to n points can grow much slower than 2n .
Definition 2.4.43 (Shattering). Let Λ = {`1 , . . . , `n } ⊆ Rd be a finite set and let
H be a class of Boolean functions on Rd . The restriction of H to Λ is
HΛ = {(h(`1 ), . . . , h(`n )) : h ∈ H}.
We say that Λ is shattered by H if |HΛ | = 2|Λ| , that is, if all Boolean functions over
shattering
Λ can be obtained by restricting a function in H to the points in Λ.
Definition 2.4.44 (VC dimension). Let H be a class of Boolean functions on Rd .
The VC dimension of H, denoted vc(H), is the maximum cardinality of a set shat-
VC
tered by H.
dimension
We prove the following combinatorial lemma at the end of this section.
Lemma 2.4.45 (Sauer’s lemma). Let H be a class of Boolean functions on Rd . For
any finite set Λ = {`1 , . . . , `n } ⊆ Rd ,
vc(H)
en
|HΛ | ≤ .
vc(H)
That is, the number of distinct restrictions of H to any n points grows at most as
∝ nvc(H) .
Returning to E[suph∈H Zn (h)], we get the following inequality.
Lemma 2.4.46. There exists a constant 0 < C < +∞ such that, for any countable
class of measurable Boolean functions H over Rd ,
p
E sup Zn (h) ≤ C vc(H) log n. (2.4.57)
h∈H
Proof. Recall that Zn (h) ∈ sG(1). Since the supremum over H, when seen as
vc(H)
en
restricted to {X1 , . . . , Xn }, is in fact a supremum over at most vc(H) func-
tions by Sauer’s lemma (Lemma 2.4.45), we have by Lemma 2.4.21
v "
u vc(H) #
u en
E sup Zn (h) ≤ t2 log .
h∈H vc(H)
CHAPTER 2. MOMENTS AND TAILS 109
These two examples also provide insights into Sauer’s lemma. Consider the
case of rectangles for instance. Over a collection of n sample points, a rectangle
defines the same {0, 1}-labeling as the minimal-area rectangle containing the same
points. Because each side of a minimal-area rectangle must touch at least one point
in the sample, there are at most n4 such rectangles, and hence there are at most
n4 2n restrictions of HB to these sample points.
√
Application of chaining It turns out that the log n factor in (2.4.57) is not
optimal. We use chaining (Section 2.4.4) to improve the bound.
We claim that the process {Zn (h)}h∈H has sub-Gaussian increments under an
appropriately defined pseudometric. Indeed, conditioning on (Xi )ni=1 , by the gen-
eral Hoeffding inequality (Theorem 2.4.9) and Hoeffding’s lemma (Lemma 2.4.12),
we have that the increment (as a function of the εi s which have variance factor 1)
n
X `(g, Xi ) − `(h, Xi )
Zn (g) − Zn (h) = εi √ ,
n
i=1
where δx is the probability measure that puts mass 1 on x. Then, we can re-write
Hence we have shown that, conditioned on the samples, the process {Zn (h)}h∈H
has sub-Gaussian increments with respect to k · kL2 (µn ) . Note that the pseudo-
metric here is random as it depends on the samples. Though, by the law of large
numbers, kg − hkL2 (µn ) approaches its expectation, kg − hkL2 (µ) , as n → +∞.
Applying the discrete Dudley inequality (Theorem 2.4.31), we obtain the fol-
lowing bound.
Lemma 2.4.51. There exists a constant 0 < C < +∞ such that, for any countable
class of measurable Boolean functions H over Rd ,
" +∞ #
X q
−k
E sup Zn (h) ≤ C E 2 log N (H, k · kL2 (µn ) , 2−k ) ,
h∈H k=0
Proof. Because H comprises only Boolean functions, it follows that under the
pseudometric k · kL2 (µn ) the diameter is bounded by 1. We apply the discrete
Dudley inequality conditioned on (Xi )ni=1 . Then we take an expectation over the
samples.
Our use of the symmetrization trick is more intuitive than it may have appeared at
first. The central limit theorem indicates that the fluctuations of centered averages
such as
(Rn (g) − R(g)) − (Rn (h) − R(h))
tend cancel out and that, in the limit, the variance alone characterizes the over-
all behavior. The εi s in some sense explicitly capture the canceling part of this
phenomenon while ρn captures the scale of the resulting global fluctuations in the
increments.
Our final task is to bound the covering numbers N (H, k · kL2 (µn ) , 2−k ).
Before proving Theorem 2.4.52, we derive its implications for uniform deviations.
Compare the following bound to Lemma 2.4.46.
CHAPTER 2. MOMENTS AND TAILS 112
Lemma 2.4.53. There exists a constant 0 < C < +∞ such that, for any countable
class of measurable Boolean functions H over Rd ,
p
E sup Zn (h) ≤ C vc(H).
h∈H
Moreover [gi (Xk ) − gj (Xk )]2 ∈ [0, 1]. Hence, by Hoeffding’s inequality
there exists a constant 0 < C < +∞ and an m ≤ Cε−4 log N such that
3ε2
2 2
P kgi − gj kL2 (η) − kgi − gj kL2 (µX ) ≥
4
m
" #
2
X
2 3ε2
= P mkgi − gj kL2 (η) − [gi (Xk ) − gj (Xk )] ≥ m
4
k=1
2(m · 3ε2 /4)2
≤ exp −
m
9
= exp − mε4
8
1
< 2.
N
That implies that, for this choice of m,
h ε i
P kgi − gj kL2 (µX ) > , ∀i 6= j > 0,
2
where the probability is over the samples and we used the assumption on the
collection G. Therefore, there must be a set X = {x1 , . . . , xm } ⊆ Rd such
that
ε
kgi − gj kL2 (µX ) > , ∀i 6= j. (2.4.59)
2
where C 0 = 2eC. Plugging (2.4.61) back into (2.4.60) and rearranging gives
2 vc(H)
N ≤ C 0 ε−4 .
Proof of Sauer’s lemma Recall from Appendix A (see also Exercise 1.4) that
for integers 0 < d ≤ n,
d
X n en d
≤ . (2.4.62)
k d
k=0
Sauer’s lemma (Lemma 2.4.45) follows from the following claim.
Lemma 2.4.54 (Pajor). Let H be a class of Boolean functions on Rd and let Λ = Pajor’s lemma
{`1 , . . . , `n } ⊆ Rd be any finite subset. Then
|HΛ | ≤ |{S ⊆ Λ : S is shattered by H}| ,
where the right-hand side includes the empty set.
Going back to Sauer’s lemma, by Lemma 2.4.54 we have the upper bound
|HΛ | ≤ |{S ⊆ Λ : S is shattered by H}| .
By definition of the VC-dimension (Definition 2.4.44), the subsets S ⊆ Λ that
are shattered by H have size at most vc(H). So the right-hand side is bounded
above by the total number of subsets of size at most d = vc(H) of a set of size n.
By (2.4.62), this gives
vc(H)
en
|HΛ | ≤ ,
vc(H)
which establishes Sauer’s lemma.
So it remain to prove Lemma 2.4.54.
Proof of Lemma 2.4.54. We prove the claim by induction on the size n of Λ. The
result is trivial for n = 1. Assume the result is true for any H and any subset of
size n − 1. To apply induction, for ι = 0, 1 we let
Hι = {h ∈ H : h(`n ) = ι},
and we set
Λ0 = {`1 , . . . , `n−1 }.
It will be convenient to introduce the following notation
S(Λ; H) = |{S ⊆ Λ : S is shattered by H}| .
Because |HΛ | = |HΛ 0 | + |H1 | and the induction hypothesis implies S(Λ0 ; Hι ) ≥
0 Λ0
ι
|HΛ0 | for ι = 0, 1, it suffices to show that
S(Λ; H) ≥ S(Λ0 ; H0 ) + S(Λ0 ; H1 ). (2.4.63)
There are two types of sets that contribute to the right-hand side.
CHAPTER 2. MOMENTS AND TAILS 115
- One but not both. Let S ⊆ Λ0 be a set that contributes to one of S(Λ0 ; H0 )
or S(Λ0 ; H1 ) but not both. Then S is a subset of the larger set Λ and it is
certainly shattered by the larger collection H. Hence it also contributes to
the left-hand side of (2.4.63).
Exercises
Exercise 2.1 (Moments of nonnegative random variables). Prove (B.5.1). [Hint:
Use Fubini’s Theorem to compute the integral.]
Exercise 2.2 (Bonferroni inequalities). Let A1 , . . . , An be events and Bn := ∪i Ai .
Define X
S (r) := P[Ai1 ∩ · · · ∩ Air ],
1≤i1 <···<ir ≤n
and
n
X
Xn := 1Ai .
i=1
(iii) Use (i) and (ii) to show that when ` ∈ [n] is odd
`
X
P[Bn ] ≤ (−1)r−1 S (r) ,
r=1
(ii) Show that pc (Ld ) < 1. [Hint: Use the result for L2 .]
(i) Assume further that Var[Xr ] ≤ C < +∞ for all r. Show that
1 X C2
P Xr ≥ β ≤ 2 .
n β n
r≤n
Xv = hU , vi mod 2.
(i) Show that the random variables Xv , v ∈ {0, 1}` \0, are uniformly dis-
tributed in {0, 1} and pairwise independent.
(ii) Show that for any event A measurable with respect to σ(Xv , v ∈ {0, 1}` \0),
P[A] is either 0 or ≥ 1/(n + 1).
Exercise 2.7 (Chernoff bound for Poisson trials). Using the Chernoff-Cramér method,
prove part (i) of Theorem 2.4.7. Show that part (ii) follows from part (i).
Exercise 2.8 (Stochastic knapsack: some details). Consider the stochastic frac-
tional knapsack problem in Section 2.4.3.
(i) Prove that the greedy algorithm described there gives an optimal solution to
problem (2.4.21).
Exercise 2.9 (Stochastic knapsack: 0-1 version). Consider the stochastic fractional
knapsack problem in Section 2.4.3.
CHAPTER 2. MOMENTS AND TAILS 118
(i) Adapt the greedy algorithm for the 0-1 knapsack problem and show that it is
not optimal in general. [Hint: Construct a counter-example with two items.]
Exercise 2.10 (A proof of Pólya’s theorem). Let (St ) be simple random walk on
Ld started at the origin 0.
(i) For d = 1, use Stirling’s formula (see Appendix A) to show that P[S2n =
0] = Θ(n−1/2 ).
(j)
(ii) For j = 1, . . . , d, let Nt be the number of steps in the j-th coordinate by
time t. Show that
(j) n 3n
P Nn ∈ , , ∀j ≥ 1 − exp(−κd n),
2d 2d
(iii) Use (i) and (ii) to show that, for any d ≥ 3, P[S2n = 0] = O(n−d/2 ).
as n → +∞.
Exercise 2.12 (RIP vs. orthogonality). Show that a (k, 0)-RIP matrix with k ≥ 2
is orthogonal, that is, its columns are orthonormal.
Exercise 2.14 (Compressed sensing: almost sparse case). By adapting the proof of
Lemma 2.4.39, show the following “almost sparse” version. Let L be (10k, 1/3)-
RIP. Then, for any x ∈ Rn , the solution to (2.4.43) satisfies
√
kz∗ − xk2 = O(η(x)/ k),
and
n
X
Xn := Ai .
i=1
P[Xn = 0] → e−µ .
d
In fact, Xn → Poi(µ) (no need to prove this). This is a special case of the method
of moments.
Exercise 2.19 (Connectivity: critical window). Using Exercise 2.18 show that,
when pn = log nn+s , the probability that an Erdős-Rényi graph Gn ∼ Gn,pn con-
−s
tains no isolated vertex converges to e−e .
CHAPTER 2. MOMENTS AND TAILS 120
Bibliographic Remarks
Section 2.1 For more on moment-generating functions, see [Bil12, Section 21].
Section 2.2 The examples in Section 2.2.1 are taken from [AS11, Sections 2.4,
3.2]. A fascinating account of the longest increasing subsequence problem is given
in [Rom15], from which the material in Section 2.2.3 is taken. The contour lemma,
Lemma 2.2.14, is attributed to Whitney [Whi32] and is usually proved “by pic-
ture” [Gri10a, Figure 3.1]. A formal proof of the lemma can be found in [Kes82,
Appendix A]. For much more on percolation, see [Gri10b]. A gentler introduction
is provided in [Ste].
Section 2.3 The presentation in Section 2.3.2 follows [AS11, Section 4.4] and
[JLR11, Section 3.1]. The result for general subgraphs is due to Bollobás [Bol81].
A special case (including cliques) was proved by Erdős and Rényi [ER60]. For
variants of the small subgraph containment problem involving copies that are in-
duced, disjoint, isolated, etc., see for example [JLR11, Chapter 3]. For corre-
sponding results for larger subgraphs, such as cycles or matchings, see for exam-
ple [Bol01]. The connectivity threshold in Section 2.3.2 is also due to the same
authors [ER59]. The presentation here follows [vdH17, Section 5.2]. For more on
the method of moments, see for example [Dur10, Section 3.3.5] or [JLR11, Section
6.1]. Claim 2.3.11 is due to R. Lyons [Lyo90].
Section 2.4 The use of the moment-generating function to derive tail bounds for
sums of independent random variables was pioneered by Cramér [Cra38], Bern-
stein [Ber46], and Chernoff [Che52]. For much more on concentration inequali-
ties, see for example [BLM13]. The basics of large deviations theory are covered
in [Dur10, Section 2.6]. See also [RAS15] and [DZ10]. Section 2.4.2 is based
partly on [Ver18] and [Lug, Section 3.2]. Section 2.4.3 is based on [FR98, Section
5.3]. Very insightful, and much deeper, treatment of the material in Section 2.4.4
can be found in [Ver18, vH16]. The presentation in Section 2.4.5 is inspired
by [Har, Lectures 6 and 8] and [Tao]. The Johnson-Lindenstrauss lemma was first
proved by Johnson and Lindenstrauss using non-probabilistic arguments [JL84].
The idea of using random projections to simplify the proof was introduced by
Frankl and Maehara [FM88] and the proof presented here based on Gaussian pro-
jections is due to Indyk and Motwani [IM98]. See [Ach03] for an overview of the
various proofs known. For more on the random projection method, see [Vem04].
For algorithmic applications of the Johnson-Lindenstrauss lemma, see for exam-
ple [Har, Lecture 7]. Compressed sensing emerged in the works of Donoho [Don06]
CHAPTER 2. MOMENTS AND TAILS 121
and Candès, Romberg and Tao [CRT06a, CRT06b]. The restricted isometry prop-
erty was introduced by Candès and Tao [CT05]. Lemma 2.4.39 is due to Candés,
Romberg and Tao [CRT06b]. The proof of Lemma 2.4.38 presented here is due
to Baraniuk et al. [BDDW08]. A survey of compressed sensing can be found
in [CW08]. A thorough mathematical introduction to compressed sensing can be
found in [FR13]. The material in Section 2.4.2 can be found in [BLM13, Chapter
2]. Hoeffding’s lemma and inequality are due to Hoeffding [Hoe63]. Section 2.4.6
borrows from [Ver18, vH16, SSBD14, Haz16]. The proof of Sauer’s lemma fol-
lows [Ver18, Section 8.3.3]. For a proof of Claim 2.4.48 in general dimension d,
see for example [SSBD14, Section 9.1.3].
Chapter 3
In this chapter we turn to martingales, which play a central role in probability the-
ory. We illustrate their use in a number of applications to the analysis of discrete
stochastic processes. After some background on stopping times and a brief review
of basic martingale properties and results in Section 3.1, we develop two major
directions. In Section 3.2, we show how martingales can be used to derive a sub-
stantial generalization of our previous concentration inequalities—from the sums
of independent random variables we focused on in Chapter 2 to nonlinear func-
tions with Lipschitz properties. In particular, we give several applications of the
method of bounded differences to random graphs. We also discuss bandit problems
in machine learning. In the second thread in Section 3.3, we give an introduction
to potential theory and electrical network theory for Markov chains. This toolkit
in particular provides bounds on hitting times for random walks on networks, with
important implications in the study of recurrence among other applications. We
also introduce Wilson’s remarkable method for generating uniform spanning trees.
3.1 Background
We begin with a quick review of stopping times and martingales. Along the way,
we prove a few useful results. In particular, we derive some bounds on hitting times
and cover times of Markov chains.
Throughout, (Ω, F, (Ft )t∈Z+ , P) is a filtered space. See Appendix B for a
formal definition. Recall that, intuitively, the σ-algebra Ft in the filtration (Ft )t
represents “‘the information known at time t.” All time indices are discrete (in Z+
Version: December 20, 2023
Modern Discrete Probability: An Essential Toolkit
Copyright © 2023 Sébastien Roch
122
CHAPTER 3. MARTINGALES AND POTENTIALS 123
unless stated otherwise). We will also use the notation Z+ := {0, 1, . . . , +∞} to
allow time +∞.
Example 3.1.2 (Hitting time). Let (At )t∈Z+ , with values in (E, E), be adapted and
let B ∈ E. Then
τ = inf{t ≥ 0 : At ∈ B},
is a stopping time known as a hitting time. In contrast, the last visit to a set is
hitting time
typically not a stopping time. J
Let τ be a stopping time. Denote by Fτ the set of all events F such that, ∀t ∈ Z+ ,
F ∩ {τ = t} ∈ Ft . Intuitively, the σ-algebra Fτ captures the information up to
time τ . The following lemmas help clarify the definition of Fτ .
Proof. For B ∈ E,
{Xτ ∈ B} ∩ {τ = t} = {Xt ∈ B} ∩ {τ = t} ∈ Ft ,
Let (Xt ) be a Markov chain on a countable space V . The following two exam-
ples of stopping times will play an important role.
Definition 3.1.6 (First visit and return). The first visit time and first return time to
first return
x ∈ V are
Similarly, τB and τB+ are the first visit time and first return time to B ⊆ V .
Definition 3.1.7 (Cover time). Assume V is finite. The cover time of (Xt ) is the
cover time
first time that all states have been visited, that is,
Strong Markov property Let (Xt ) be a Markov chain with transition matrix
P and initial distribution µ. Let Ft = σ(X0 , . . . , Xt ). Recall that the Markov
property (Theorem 1.1.18) says that, given the present, the future is independent
of the past. The Markov property naturally extends to stopping times. Let τ be a
stopping time with P[τ < +∞] > 0. In its simplest form we have:
In words, the chain “starts fresh” at a stopping time with the state at that time as
starting point. More generally:
Theorem 3.1.8 (Strong Markov property). Let ft : V ∞ → R be a sequence of
measurable functions, uniformly bounded in t and let Ft (x) := Ex [ft ((Xs )s≥0 )].
On {τ < +∞},
E[fτ ((Xτ +t )t≥0 ) | Fτ ] = Fτ (Xτ ).
CHAPTER 3. MARTINGALES AND POTENTIALS 125
Throughout, when we say that two random variables Y, Z are equal on an event B,
we mean formally that Y 1B = Z1B almost surely.
E[fτ ((Xτ +t )t≥0 ) | Fτ ]1τ <+∞ = E[fτ ((Xτ +t )t≥0 )1τ <+∞ | Fτ ].
Let A ∈ Fτ . Summing over the possible values of τ , using the tower property
(Lemma B.6.16) and then the Markov property
The following typical application of the strong Markov property (Theorem 3.1.8)
is useful.
probability at least 1/2 of being greater or equal to 0 by the first moment principle
(Theorem 2.2.1), an event which implies that St is greater than or equal to b. Hence
1 1
P[St ≥ b] ≥ P[τ = t] + P[τ < t] ≥ P[τ ≤ t].
2 2
(Exercise 3.1 asks for a more formal proof.)
Theorem 3.1.10 (Reflection principle: simple random walk). Let (St ) be simple
random walk on Z started at 0. Then, ∀a, b, t > 0,
P[St = b + a] = P St = b − a, sup Si ≥ b .
i≤t
and
P sup Si ≥ b = P[St = b] + 2 P[St > b].
i≤t
Proof. For the first claim, reflect the sub-path after the first visit to b across the line
y = b. Summing over a > 0 and rearranging gives the second claim.
Recurrence Let (Xt ) be a Markov chain on a countable state space V . The time
of the k-th return to y is (letting τy0 := 0)
k-th return
In particular, τy1 = τy+ . Define ρxy := Px [τy+ < +∞]. Then by the strong Markov
property (and induction)
When ρyy < 1, we have Ey [Ny ] < +∞ by (3.1.2), and in particular τyk = +∞ for
some k. Or ρyy = 1 and, starting at x = y, we have τyk < +∞ almost surely for
all k by (3.1.1). That leads us to the following dichotomy.
Definition 3.1.12 (Recurrence). A state x is recurrent if ρxx = 1. Otherwise it recurrent
is transient. We refer to the recurrence or transience of a state as its type. Let x
be recurrent. If in addition Ex [τx+ ] < +∞, we say that x is positive recurrent;
otherwise we say that it is null recurrent. A chain is recurrent (or transient, or
positive recurrent, or null recurrent) if all its states are.
Recurrence is “contagious” in the following sense.
Lemma 3.1.13. If x is recurrent and ρxy > 0 then y is recurrent and ρyx = ρxy =
1.
A subset C ⊆ V is closed if x ∈ C and ρxy > 0 implies y ∈ C. A subset
D ⊆ V is irreducible if x, y ∈ D implies ρxy > 0. This definition is consistent
with (and generalizes to sets) the one we gave in Section 1.1.2. Recall that we have
the following decomposition theorem.
Theorem 3.1.14 (Decomposition theorem). Let R := {x : ρxx = 1} be the
recurrent states of the chain. Then R can be written as a disjoint union ∪j Rj
where each Rj is closed and irreducible.
Example 3.1.15 (Simple random walk on Z). Consider simple random walk (St )
on Z started at 0. The chain is clearly irreducible so it suffices to check the type
of state 0 by Lemma 3.1.13. First note the periodicity of this chain. So we look at
S2t . Then by Stirling’s formula (see Appendix A)
2t
√
2t −2t −2t (2t) 2t 1
P[S2t = 0] = 2 ∼2 t 2
√ ∼√ .
t (t ) 2πt πt
Thus X
E[N0 ] = P[St = 0] = +∞,
t>0
and the chain is recurrent. J
Return times are closely related to stationary measures. We recall the following
standard results without proof. We gave an alternative proof of the existence of a
unique stationary distribution in the finite, irreducible case in Theorem 1.1.24.
CHAPTER 3. MARTINGALES AND POTENTIALS 128
Theorem 3.1.16. Let x be a recurrent state. Then the following defines a stationary
measure
X
µx (y) := Ex 1{Xt =y} .
0≤t<τx+
Theorem 3.1.17. If (Xt ) is irreducible and recurrent, then the stationary measure
is unique up to a constant multiple.
Theorem 3.1.18. If there is a stationary distribution π then all states y that have
π(y) > 0 are recurrent.
Proof. We only prove the transient case. In that case, we showed in (3.1.2) that
" #
X X
t
P (x, y) = Ex 1{Xt =y} = Ex [Ny ] < +∞.
t t
Hence P t (x, y) → 0.
CHAPTER 3. MARTINGALES AND POTENTIALS 129
A useful identity A slight generalization of the “cycle trick” used in the proof of
Theorem 3.1.16 gives a useful identity.
Definition 3.1.22 (Green function). Let σ be a stopping time for a Markov chain
(Xt ). The Green function of the chain stopped at σ is given by
Green function
X
Gσ (x, y) = Ex 1{Xt =y} , x, y ∈ V, (3.1.3)
0≤t<σ
Gσ (x, y) = πy Ex [σ].
3.1.2 . Markov chains: exponential tail of hitting times and some cover
time bounds
Tail of a hitting time On a finite state space, the tail of any hitting time converges
to 0 exponentially fast.
Lemma 3.1.25. Let (Xt ) be a finite, irreducible Markov chain with state space V .
For any subset of states A ⊆ V and initial distribution µ:
(i) It holds that Eµ [τA ] < +∞ (and, in particular, τA < +∞ a.s.).
(ii) Letting t̄A := maxx Ex [τA ], we have the tail bound
t
Pµ [τA > t] ≤ exp − .
de t̄A e
CHAPTER 3. MARTINGALES AND POTENTIALS 131
Proof. For any positive integer m, for some distribution θ over the state space V ,
by the strong Markov property (Theorem 3.1.8)
Choose a positive integer s large enough that, from any x, there is a path to A
of length at most s of positive probability. Such an s exists by irreducibility. In
particular αs < 1.
By the multiplication rule and the monotonicity of the events {τA > rs} over
r, we have
m
Y
Pµ [τA > ms] = Pµ [τA > s] Pµ [τA > rs | τA > (r − 1)s].
r=2
since αs < 1.
Now that we have established that t̄A < +∞, by Markov’s inequality (Theo-
rem 2.1.1),
t̄A
αs = max Px [τA > s] ≤ .
x s
for all non-negative integers s. Plugging back into (3.1.5) gives Pµ [τA > t] ≤
b t c
t̄A s
s . By differentiating with respect to s, it can be checked that a good choice
for s is de t̄A e. Simplifying gives the second claim.
Claim 3.1.26.
max Ex [τcov ] ≤ (3 + log n)de t̄hit e.
x
CHAPTER 3. MARTINGALES AND POTENTIALS 132
Proof. By a union bound over all states to be visited and Lemma 3.1.25,
t
max Px [τcov > t] ≤ min 1, n exp − .
x de t̄hit e
Summing over t ∈ Z+ and appealing to the sum of a geometric series,
1
max Ex [τcov ] ≤ (log n + 1)de t̄hit e + de t̄hit e,
x 1 − e−1
where the first term on the right-hand side comes from the fact that until t ≥
(log n + 1)de t̄hit e the upper bound above is 1. The factor de t̄hit e in the second
term on the right-hand side comes from the fact that we must break up the series
into blocks of size de t̄hit e. Simplifying gives the claim.
The previous proof should be reminiscent of that of Theorem 2.4.21.
A clever argument gives a better constant factor as well as a lower bound.
Theorem 3.1.27 (Matthews’ cover time bounds). Let
tA
hit := min Ex τy ,
x,y∈A, x6=y
Pn 1
and hn := m=1 m . Then
max Ex [τcov ] ≤ hn t̄hit , (3.1.6)
x
and
min Ex [τcov ] ≥ max h|A|−1 tA
hit . (3.1.7)
x A⊆V
{x,y}
Clearly, maxx6=y thit is a lower bound on the worst expected cover time. Lower
bound (3.1.7) says that a tighter bound is obtained by finding a larger subset of
states A that are “far away” from each other.
We sketch the proof of the lower bound for A = V , which we assume is
[n] without loss of generality. The other cases are similar. Let (J1 , . . . , Jn ) be a
uniform random ordering of V , let Cm := maxi≤m τJi , and let Lm be the last state
visited among J1 , . . . , Jm . Then for m ≥ 2
Ex [Cm − Cm−1 | J1 , . . . , Jm , {Xt , t ≤ Cm−1 }] ≥ tVhit 1{Lm =Jm } .
1
By symmetry, P[Lm = Jm ] = m . To see this, first pick the set of vertices corre-
sponding to {J1 , . . . , Jm }, wait for all of those vertices to be visited, then pick the
ordering. Moreover observe that Ex C1 ≥ (1 − n1 )tVhit where the factor of (1 − n1 )
accounts for the probability that J1 6= x. Taking expectations above and summing
over m gives the result.
Exercise 3.3 asks for a proof that the bounds above cannot in general be im-
proved up to smaller order terms.
CHAPTER 3. MARTINGALES AND POTENTIALS 133
3.1.3 Martingales
Definition Martingales are an important class of stochastic processes that corre-
spond intuitively to the “probabilistic version of a monotone sequence.” They hide
behind many processes and have properties that make them powerful tools in the
analysis of processes where they have been identified. Formally:
Definition 3.1.28 (Martingale). An adapted process (Mt )t≥0 with E|Mt | < +∞
for all t is a martingale if
martingale
E[Mt+1 | Ft ] = Mt , ∀t ≥ 0.
Recall that adapted (Definition B.7.5) simply means that Mt ∈ Ft , that is, roughly
speaking Mt is “known at time t.” Note that for a martingale, by the tower property
(Lemma B.6.16), we have E[Mt | Fs ] = Ms for all t > s, and similarly (with
inequalities) for supermartingales and submartingales.
We start with a straightforward example.
Example 3.1.29 (Sums of i.i.d. random variables with mean 0). Let X0 , X1 , . . .
be i.i.d.Pintegrable, centered random variables, Ft = σ(X0 , . . . , Xt ), S0 = 0, and
St = ti=1 Xi . Note that E|St | < ∞ by the triangle inequality. By taking out
what is known and the role of independence lemma (Lemma B.6.14) we obtain
Martingales however are richer than random walks with centered steps. For in-
stance mixtures of such random walks are also martingales.
Example 3.1.30 (Mixtures of random walks). Consider again the setting of Exam-
ple 3.1.29. This time assume that X0 is uniformly distributed in {1, 2} and define
Rt = X0 St , t ≥ 0.
Further examples Martingales can also be a little more hidden. Here are two
examples.
Example 3.1.31 (Variance of a sum of i.i.d. random variables). Consider again the
setting of Example 3.1.29 with σ 2 := Var[X1 ] < ∞. Define
Mt = St2 − tσ 2 .
Note that by the triangle inequality and the fact that St has mean zero and is a sum
of independent random variables
t
X
E|Mt | ≤ Var[Xi ] + tσ 2 ≤ 2tσ 2 < +∞.
i=1
Moreover, arguing similarly to the previous example, and using the fact that both
Xt and St−1 are square integrable
E[Mt | Ft−1 ] = E[(Xt + St−1 )2 − tσ 2 | Ft−1 ]
= E[Xt2 + 2Xt St−1 + St−1
2
− tσ 2 | Ft−1 ]
= σ 2 + 0 + St−1
2
− tσ 2
= Mt−1 ,
which proves that (Mt ) is a martingale. J
Example 3.1.32 (Eigenvectors of a transition matrix). Let (Xt )t≥0 be a finite
Markov chain with state space V and transition matrix P , and let (Ft )t≥0 be the
corresponding filtration. Suppose f : V → R is such that
X
P (i, j)f (j) = λf (i), ∀i ∈ S.
j
Or we can create martingales out of thin air. We give two important examples
that will appear later.
Example 3.1.33 (Doob martingale: accumulating data). Let X with E|X| < +∞.
Define Mt = E[X | Ft ]. Note that E|Mt | ≤ E|X| < +∞ by Jensens’ inequality,
and
by the tower property. This is known as a Doob martingale. Intuitively this process
Doob
tracks our expectation of the unobserved X as “more information becomes avail-
martingale
able.” See the co-called “exposure martingales” in Section 3.2.3 for a concrete
illustration of this idea. J
Then X
E|Nt | ≤ 2E|Xt |K < +∞,
i≤t
Lemma 3.1.35. If (Mt )t≥0 is a martingale and φ is a convex function such that
E|φ(Mt )| < +∞ for all t, then (φ(Mt ))t≥0 is a submartingale. Moreover, if
(Mt )t≥0 is a submartingale and φ is an increasing convex function with E|φ(Mt )| <
+∞ for all t, then (φ(Mt ))t≥0 is a submartingale.
Definition 3.1.36. Let (Mt ) be an adapted process and σ be a stopping time. Then
is Mt stopped at σ.
stopped process
Lemma 3.1.37. Let (Mt ) be a supermartingale and σ be a stopping time. Then
the stopped process (Mtσ ) is a supermartingale and in particular
The same result holds with equalities if (Mt ) is a martingale, and with inequalities
in the opposite direction if (Mt ) is a submartingale.
E[Mσ ] ≤ E[M0 ],
(i) σ is bounded;
(iii) E[σ] < +∞ and (Mt ) has bounded increments (i.e., there is c > 0 such that
|Mt − Mt−1 | ≤ c a.s. for all t);
Proof. Case (iv) is Fatou’s lemma (Proposition B.4.14). We prove (iii). We leave
the proof of the other claims as an exercise (see Exercise 3.5).
From Lemma 3.1.37, we have
E[Mσ∧t − M0 ] ≤ 0. (3.1.8)
Furthermore the assumption that E[σ] < +∞ implies that σ < +∞ almost surely.
Hence we seek to take a limit as t → +∞ inside the expectation. To justify
swapping limit and expectation, note that by a telescoping sum
X
|Mσ∧t − M0 | ≤ (Ms − Ms−1 )
s≤σ∧t
X
≤ |Ms − Ms−1 |
s≤σ
≤ cσ.
The claim now follows from dominated convergence (Proposition B.4.14). Equal-
ity holds if (Mt ) is a martingale.
Although the optional stopping theorem (Theorem 3.1.38) is useful, one of-
ten works directly with Lemma 3.1.37 and applies suitable limit theorems (see
Proposition B.4.14). The following martingale-based proof of Wald’s first identity
provides an illustration.
CHAPTER 3. MARTINGALES AND POTENTIALS 138
E[Sτ ] = µ E[τ ].
We also recall Wald’s second identity. The proof, which we omit, uses the martin-
gale in Example 3.1.31.
Example 3.1.41 (Gambler’s ruin: unbiased case). Let (St ) be simple random walk
on Z started at 0 and let τ = τa ∧ τb where −∞ < a < 0 < b < +∞, where the
first visit time τx was defined in Definition 3.1.6.
(ii) By Wald’s first identity (Theorem 3.1.39) and (i), we have E[Sτ ] = 0 or
a P[Sτ = a] + b P[Sτ = b] = 0,
(iii) Because σ 2 = 1, Wald’s second identity (Theorem 3.1.40) says that E[Sτ2 ] =
E[τ ]. Furthermore, we have by (ii)
b −a 2
E[Sτ2 ] = a2 + b = −ab.
b−a b−a
Thus E[τ ] = −ab.
(iv) The first claim was proved in (ii). When b → +∞, τ = τa ∧ τb ↑ τa and
monotone convergence applied to (iii) gives that E[τa ] = +∞.
That concludes the proof.
Note that (iv) above shows that the L1 condition on the stopping time in Wald’s
second identity (Theorem 3.1.40) is necessary. Indeed we have shown a2 = E[Sτ2a ] 6=
σ 2 E[τa ] = +∞. J
Example 3.1.43 (Gambler’s ruin: biased case). The biased randomP walk on Z with
parameter 1/2 < p < 1 is the process (St ) with S0 = 0 and St = ti=1 Xi where
the Xi s are i.i.d. in {−1, +1} with P[X1 = 1] = p. Let again τ := τa ∧ τb where
a < 0 < b. Define q := 1 − p, δ := p − q > 0, and φ(x) := (q/p)x .
Claim 3.1.44. We have:
(i) τ < +∞ almost surely;
φ(b)−φ(0)
(ii) P[τa < τb ] = φ(b)−φ(a) ;
b
(iii) E[τb ] = 2p−1 ;
CHAPTER 3. MARTINGALES AND POTENTIALS 140
and
(i) This claim follows by the same argument as in the unbiased case.
(ii) Note that (φ(St )) is a nonnegative, bounded martingale since q < p by as-
sumption. By Lemma 3.1.37 and dominated convergence (Proposition B.4.14),
1
P[τa < +∞] = < 1, (3.1.9)
φ(a)
so that τa = +∞ with positive probability. On the other hand, P[τb < τa ] =
1 − P[τa < τb ] = φ(0)−φ(a)
φ(b)−φ(a) , and taking a → −∞
and the fact that τb < +∞ almost surely from (ii) to deduce that E[τb ] =
E[Sτb ] b
p−q = 2p−1 .
CHAPTER 3. MARTINGALES AND POTENTIALS 141
hMt − Ms , Mv − Mu i = 0,
In words, martingale increments over disjoint time intervals are uncorrelated (pro-
vided the second moment exists). Note that this is weaker than the independence
of increments of random walks. (See Section 3.2.1 for more discussion on this.)
When this is the case, Mt converges almost surely and in L2 to a finite limit M∞ ,
and furthermore
E[Mt ] → E[M∞ ] < +∞,
as t → +∞.
surely to a finite limit M∞ with E|M∞ | < +∞ by Theorem 3.1.47. Then using
Fatou’s lemma (Proposition B.4.14) in
X
E[(Mt+s − Mt )2 ] = E[(Mi − Mi−1 )2 ],
t+1≤i≤t+s
gives X
E[(M∞ − Mt )2 ] ≤ E[(Mi − Mi−1 )2 ].
t+1≤i
The right-hand side goes to 0 as t → +∞ since the full series is finite, which
proves the second claim.
The last claim follows from Lemmas B.4.16 and B.4.17.
Proof of Claim 3.1.52. Let b := d − 1 be the branching ratio. Because the root has
a different number of children, we consider the descendants of its children. Let Zn
be the number of vertices in the open cluster of the first child of the root n levels
below it and let Fn = σ(Z0 , . . . , Zn ). Then Z0 = 1 and
P[Zn = k, ∀n ≥ N ] = 0,
so it must be that the limit is 0 almost surely. In other words, Zn is eventually 0 for
all n large enough. This is true for every child of the root. Hence the open cluster
of the root is finite almost surely.
Claim 3.1.53.
E|C0 | = +∞.
Proof. Consider the descendant subtree, T1 , of the first child of the root, which we
denote by 1. Let Ce1 be the open cluster of 1 in T1 . As we showed in the previous
claim, the expected number of vertices on any level of T1 is 1. So E|Ce1 | = +∞ by
summing over the levels.
At ≤ Zt − Zt−1 ≤ Bt and Bt − At ≤ ct .
CHAPTER 3. MARTINGALES AND POTENTIALS 146
Applying this inequality to (−Zt ) gives a tail bound in the other direction.
E es(Zt −Z0 )
≤ sβ
h e Pt i
E es r=1 (Zr −Zr−1 )
= . (3.2.1)
esβ
Unlike the Chernoff-Cramér case, however, the terms in the exponent are not
independent. Instead, to exploit the martingale property, we condition on the filtra-
tion. By taking out what is known (Lemma B.6.13)
h h Pt ii h Pt−1 h ii
E E es r=1 (Zr −Zr−1 ) Ft−1 = E es r=1 (Zr −Zr−1 ) E es(Zt −Zt−1 ) Ft−1 .
The martingale property and the assumption in the statement imply that, condi-
tioned on Ft−1 , the random variable Zt − Zt−1 is centered and lies in an interval
of length ct . Hence by Hoeffding’s lemma (Lemma 2.4.12), it holds almost surely
that
2 2 2 2
h
s(Zt −Zt−1 )
i s ct /4 c s
E e Ft−1 ≤ exp = exp t . (3.2.2)
2 8
Using the tower property (Lemma B.6.16) and arguing by induction, we obtain
!
s2 r≤t c2r
h i P
s(Zt −Z0 )
E e ≤ exp .
8
E[Xs Xr ] = 0, ∀r 6= s,
− inf
0
f (x1 , . . . , xi−1 , y 0 , xi+1 , . . . , xn ),
y ∈Xi
High-level idea
We begin with two easier bounds that we will improve below. The trick to ana-
lyzing the concentration of f (X) is to consider the Doob martingale (see Exam-
ple 3.1.33)
and
Z0 = E[f (X)],
so that we can write
n
X
f (X) − E[f (X)] = (Zi − Zi−1 ).
i=1
Intuitively, the martingale difference Zi −Zi−1 tracks the change in our expectation
of f (X) as Xi is revealed.
In fact a clever probabilistic argument relates martingale differences directly to
discrete derivatives. Let X 0 = (X10 , . . . , Xn0 ) be an independent copy of X and let
X (i) = (X1 , . . . , Xi−1 , Xi0 , Xi+1 , . . . , Xn ).
Then
Zi − Zi−1 = E[f (X) | Fi ] − E[f (X) | Fi−1 ]
= E[f (X) | Fi ] − E[f (X (i) ) | Fi−1 ]
= E[f (X) | Fi ] − E[f (X (i) ) | Fi ]
= E[f (X) − f (X (i) ) | Fi ].
Note that we crucially used the independence of the Xk s in the second and third
lines. But then, by Jensen’s inequality (Lemma B.6.12),
|Zi − Zi−1 | ≤ kDi f k∞ . (3.2.4)
Assume further that E[f (X)2 ] < +∞. By the orthogonality of increments of
martingales in L2 (Lemma 3.1.50), we immediately obtain a bound on the variance
of f
Xn h i X n
2 2
Var[f (X)] = E[(Zn − Z0 ) ] = E (Zi − Zi−1 ) ≤ kDi f k2∞ . (3.2.5)
i=1 i=1
E[f (X1 , . . . , Xn )] = 0,
and
E[f (X1 , . . . , Xn ) | X1 ] = nX1 ,
so that
|E[f (X1 , . . . , Xn ) | X1 ] − E[f (X1 , . . . , Xn )]| = n > 2.
In particular, the corresponding Doob martingale does not have increments bounded
by kDi f k∞ = 2.
Fo a less extreme example which has support over all of {−1, 1}n , let
(
1, w.p. 1 − ε,
Ui =
−1, w.p. ε,
= 0,
so that
n−1
!
X
|E[f (X1 , . . . , Xn ) | X1 ] − E[f (X1 , . . . , Xn )]| = (1 − 2ε)i > 2,
i=0
Variance bounds
We give improved bounds on the variance (compared to (3.2.5)). Our first bound
explicitly decomposes the variance of f (X) over the contributions of its individual
entries.
Theorem 3.2.3 (Tensorization of the variance). Let X1 , . . . , Xn be independent
random variables where Xi is Xi -valued for all i and let X = (X1 , . . . , Xn ).
Assume that f : X1 × · · · × Xn → R is a measurable function with E[f (X)2 ] <
+∞. Define Fi = σ(X1 , . . . , Xi ), Gi = σ(X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ) and
Zi = E[f (X) | Fi ]. Then we have
n
X
Var[f (X)] ≤ E [Var [f (X) | Gi ]] .
i=1
where we used Jensen’s inequality on the last line. Taking expectations and using
the tower property
n
X h i
Var[f (X)] = E (Zi − Zi−1 )2
i=1
n
X h h ii
≤ E E (f (X) − E [f (X) | Gi ])2 Fi
i=1
n
X h i
= E (f (X) − E [f (X) | Gi ])2
i=1
n
X h h ii
= E E (f (X) − E [f (X) | Gi ])2 Gi
i=1
n
X
= E [Var [f (X) | Gi ]] .
i=1
Then,
n
1X
Var[f (X)] ≤ E[(f (X) − f (X (i) ))2 ].
2
i=1
CHAPTER 3. MARTINGALES AND POTENTIALS 152
Z T Z T
2
f (x) dx ≤ C f 0 (x)2 dx, (3.2.7)
0 0
where the best possible C is T 2 /4π 2 (see, e.g., [SS03, Chapter 3, Exercise 11]; this case is
also known as Wirtinger’s Rinequality). We give a quick proof for T = 1 with the suboptimal
x
C = 1. Note that f (x) = 0 f 0 (y)dy so, by Cauchy-Schwarz (Theorem B.4.8),
Z x Z 1
0
2
f (x) ≤ x 2
f (y) dy ≤ f 0 (y)2 dy.
0 0
The result follows by integration. Intuitively, for a function with mean 0 to have a large
norm, it must have a large absolute derivative somewhere.
CHAPTER 3. MARTINGALES AND POTENTIALS 153
which is much better than the obvious Var[Z] ≤ E[Z 2 ] ≤ n2 . Note that we did not
require any information about the expectation of Z. J
McDiarmid’s inequality
The following powerful consequence of the Azuma-Hoeffding inequality is com-
monly referred to as the method of bounded differences. Compare to (3.2.6).
Once again, applying the inequality to −f gives a tail bound in the other direction.
Zi = E[f (X) | Fi ],
CHAPTER 3. MARTINGALES AND POTENTIALS 154
and
Ai = E inf f (X1 , . . . , Xi−1 , y, Xi+1 , . . . , Xn ) − f (X) Fi−1 .
y∈Xi
Zi = E [f (X) | Fi ]
" #
≤ E sup f (X1 , . . . , Xi−1 , y, Xi+1 , . . . , Xn ) Fi
y∈Xi
" #
= E sup f (X1 , . . . , Xi−1 , y, Xi+1 , . . . , Xn ) Fi−1 , Xi
y∈Xi
" #
= E sup f (X1 , . . . , Xi−1 , y, Xi+1 , . . . , Xn ) Fi−1 ,
y∈Xi
and similarly for the other direction. Moreover, by definition, Bi −Ai ≤ kDi f k∞ :=
ci . The Azuma-Hoeffding inequality then gives the result.
Examples
The moral of McDiarmid’s inequality is that functions of independent variables
that are smooth, in the sense that they do not depend too much on any one of
their variables, are concentrated around their mean. Here are some straightforward
applications.
Example 3.2.10 (Balls and bins: empty bins). Suppose we throw m balls into n
bins independently, uniformly at random. The number of empty bins, Zn,m , is
centered at
1 m
EZn,m = n 1 − .
n
Writing Zn,m as the sum of indicators ni=1 1Bi , where Bi is the event that bin
P
i is empty, is a natural first attempt at proving concentration around the mean.
However there is a problem—the Bi s are not independent. Indeed, because there
CHAPTER 3. MARTINGALES AND POTENTIALS 155
is a fixed number of bins, the event Bi intuitively makes the other such events less
likely. Instead let Xj be the index of the bin in which ball j lands. The Xj s are
independent by construction and, moreover, letting Zn,m = f (X1 , . . . , Xm ) we
have kDi f k∞ ≤ 1. Indeed, moving a single ball changes the number of empty
bins by at most 1 (if at all). Hence by the method of bounded differences
1 m √
2
P Zn,m − n 1 − ≥ b m ≤ 2e−2b .
n
J
However the 1Ei s are not independent. So we cannot use a Chernoff bound for
Poisson trials (Theorem 2.4.7). Instead we use the fact that Nn = f (X) where
kDi f k∞ ≤ k, as each Xi appears in at most k substrings of length k. By the
method of bounded differences, for all b > 0,
" k #
1 √ 2
P Nn − (n − k + 1) ≥ bk n ≤ 2e−2b .
s
The last two examples are perhaps not surprising in that they involve “sums of
weakly independent” indicator variables. One might reasonably expect a sub-
Gaussian-type inequality in that case. The next application is more striking and
hints at connections to isoperimetric considerations (which we will not explore
here).
be the points at `1 distance at most r from A. Fix ε ∈ (0, 1/2) and assume that
2
|A| ≥ ε2n . Let λε be such that e−2λε = ε. The following application of the
method of bounded differences indicates that much of the uniform measure on the
high-dimensional hypercube lies in a close neighborhood of any such set A. This
is an example of the concentration of measure phenomenon.
Claim 3.2.13. √
r > 2λε n =⇒ |Ar | ≥ (1 − ε)2n .
Proof. Let X = (X1 , . . . , Xn ) be uniformly distributed in {0, 1}n . Note that the
coordinates are in fact independent. The function
has kDi f k∞ ≤ 1. Indeed changing one coordinate of x can increase the `1 distance
to the closest point to x by at most 1; in the other direction, if a one-coordinate
change were to decrease f by more than 1, reversing it would produce an increase
of that same amount—a contradiction. Hence McDiarmid’s inequality gives
2β 2
P [Ef (X) − f (X) ≥ β] ≤ exp − .
n
2(Ef (X))2
P[A] ≤ exp − ,
n
√ 2β 2
P f (X) ≥ 2λε n ≤ P [f (X) − Ef (X) ≥ b] ≤ exp − = ε.
n
√
The result follows by observing that, with r > 2λε n,
|Ar | √
n
≥ P f (X) < 2λε n ≥ 1 − ε.
2
CHAPTER 3. MARTINGALES AND POTENTIALS 157
√
Claim 3.2.13 is striking for two reasons: 1) the radius 2λε n is much smaller
than n, the diameter of {0, 1}n ; and 2) it applies to any A (such that |A| ≥ ε2n ).
The smallest r such that |Ar | ≥ (1 − ε)2n in general depends on A. Here are two
extremes.
For γ > 0, let
r
n n n
B(γ) := x ∈ {0, 1} : kxk1 ≤ − γ .
2 4
By the Berry-Esséen theorem (e.g., [Dur10, Theorem 3.4.9]), there is a C > 0 such
that, after rearranging the final quantity in (3.2.8),
" #
Yn − n/2 C
P p ≤ −γ − P[Z ≤ −γ] ≤ √ ,
n/4 n
where Z ∼ N (0, 1). Let ε < ε0 < 1/2 and let γε0 be such that P[Z ≤ −γε0 ] = ε0 .
Then setting A := B(γε0 ), for pn large enough, we have |A| ≥ ε2n by (3.2.8). On
the other hand, setting r := γε0 n/4, we have Ar ⊆ B(0), so that |Ar | ≤ 21 2n <
√
(1 − ε)2n . We have shown that r = Ω( n) is in general required for Claim 3.2.13
to hold.
For an example at the other extreme, assume for simplicity that N := ε2n is
an integer. Let A ⊆ {0, 1}n be constructed as follows: starting from the empty set,
add points in {0, 1}n to A independently, uniformly at random until |A| = N . Set
r := 2. Each point selected in A has n2 points within `1 distance 2. By a union
bound, the probability that Ar does not cover all of {0, 1}n is at most
n
!ε2n
n
≤ 2n e−ε( 2 ) ,
X
n n 2
P[|{0, 1} \Ar | > 0] ≤ P[x ∈/ Ar ] ≤ 2 1 − n
n
2
x∈{0,1}
where, in the second inequality, we considered only the first N picks in the con-
struction of A (possibly with repeats), and in the third inequality we used 1 − z ≤
e−z for all z ∈ R (see Exercise 1.16). In particular, as n → +∞,
So for n large enough there is a set A such that Ar = {0, 1}n where r = 2. J
CHAPTER 3. MARTINGALES AND POTENTIALS 158
Remark 3.2.14. In fact, it can be shown that sets of the form {x : kxk1 ≤ s} have
the smallest “expansion” among subsets of {0, 1}n of the same size, a result known as
Harper’s vertex isoperimetric theorem. See, for example, [BLM13, Theorem 7.6 and Exer-
cises 7.11-7.13].
Zi = En,p [F (G) | Hi ], i = 1, . . . , n,
is known as a vertex exposure martingale. An alternative way to define the filtration vertex
is to consider instead the random variables Xi = (1{{i,j}∈G} : 1 ≤ j ≤ i) for exposure
i = 2, . . . , n. In words, Xi is a vector whose entries indicate the status (present or martingale
absent) of all potential edges incident with i and a vertex preceding it. Hence, Hi =
σ(X2 , . . . , Xi ) for i = 1, . . . , n (and H1 is trivial as it corresponds to a graph with
a single vertex and no edge). This representation has an important property: the
Xi s are independent as they pertain to disjoint subsets of edges. We are then in the
setting of the method of bounded differences. Re-writing F (G) = f (X1 , . . . , Xn ),
the vertex exposure martingale coincides with the martingale (3.2.3) used in that
context.
As an example, consider the chromatic number χ(G), that is, the smallest num-
ber of colors needed in a proper coloring of G. Define fχ (X1 , . . . , Xn ) := χ(G).
We use the following combinatorial observation to bound kDi fχ k∞ .
Lemma 3.2.15. Altering the status (absent or present) of edges incident to a fixed
vertex v changes the chromatic number by at most 1.
Proof. Altering the status of edges incident to v increases the chromatic number
by at most 1, since in the worst case one can simply use an extra color for v. On
the other hand, if the chromatic number were to decrease by more than 1 after al-
tering the status of edges incident to v, reversing the change and using the previous
observation would produce a contradiction.
CHAPTER 3. MARTINGALES AND POTENTIALS 159
A fortiori, since Xi depends on a subset of the edges incident with vertex i, Lemma 3.2.15
implies that kDi fχ k∞ ≤ 1. Hence, for all 0 < p < 1 and n, by an immediate ap-
plication of the McDiarmid’s inequality (Theorem 3.2.9):
Claim 3.2.16.
√ 2
Pn,p |χ(G) − En,p [χ(G)]| ≥ b n − 1 ≤ 2e−2b .
√
Figure 3.1: All but O( n) vertices are colored using ϕn colors. The remaining
vertices are colored using 3 additional colors.
CHAPTER 3. MARTINGALES AND POTENTIALS 161
Lemma 3.2.19. Changing the edges incident to a single vertex can change Fn by
at most 1.
`=4
5 3α 4
≤ O n4− 2
→ 0,
as n → +∞, where we used that 54 − 3α 5 5 5
2 < 4 − 4 = 0 when α > 6 so that
the geometric series is dominated by its first term. Therefore for n large enough
Pn,pn [Yn > 0] ≤ ε/3, concluding the proof.
By the choice of ϕn in (3.2.9),
ε
Pn,pn [χ(Gn ) < ϕn ] ≤ .
3
By (3.2.10) and (3.2.11) with c = 2bε ,
2ε
Pn,pn [χ(Gn ) > ϕn + 3] ≤ .
3
So, overall,
Pn,pn [ϕn ≤ χ(Gn ) ≤ ϕn + 3] ≥ 1 − ε.
That concludes the proof.
CHAPTER 3. MARTINGALES AND POTENTIALS 163
Power law degree sequence Let Di (t) be the degree of the i-th vertex in Gt ,
denoted vi , and let
t
X
Nd (t) := 1{Di (t)=d} ,
i=0
Claim 3.2.21.
1
Nd (t) →p fd , ∀d ≥ 1.
t
Proof. The claim is immediately implied by the following lemmas.
Lemma 3.2.22 (Convergence of the mean).
1
ENd (t) → fd , ∀d ≥ 1.
t
Lemma 3.2.23 (Concentration around the mean). For any δ > 0,
" r #
1 1 2 log δ −1
P Nd (t) − ENd (t) ≥ ≤ 2δ, ∀d ≥ 1, ∀t.
t t t
2β 2
P[|Nd (t) − ENd (t)| ≥ β] ≤ 2 exp − 2 ,
(2) (t − 1)
p
which, choosing β = 2t log δ −1 , we can rewrite as
" r #
1 1 2 log δ −1
P Nd (t) − ENd (t) ≥ ≤ 2δ.
t t t
Figure 3.2: Graph obtained when x2 = (1, head), x3 = (2, tail) and x4 =
(3, head).
CHAPTER 3. MARTINGALES AND POTENTIALS 166
Figure 3.3: Substituting x3 = (2, tail) with y = (1, tail) in the example of
Figure 3.2 has the effect of replacing the dashed edges with the dotted edges. Note
that only the degrees of vertices v1 and v2 are affected by this change.
CHAPTER 3. MARTINGALES AND POTENTIALS 167
Dynamics of the mean Once again the method of bounded differences tells us
nothing about the mean, which must be analyzed by other means. The proof of
Lemma 3.2.22 does not rely on the Azuma-Hoeffding inequality but is given for
completeness (and may be skipped).
Proof of Lemma 3.2.22. The idea of the proof is to derive a recursion for fd by
considering the evolution of ENd (t) and taking a limit as t → +∞. Let d ≥ 1.
Observe that ENd (t) = 0 for t ≤ d − 1 since we need at least d edges to have
a degree-d vertex. Moreover, by the description of the preferential attachment
process, the following recursion holds for t ≥ d − 1
d−1 d
ENd (t + 1) − ENd (t) = ENd−1 (t) − ENd (t) + 1{d=1} . (3.2.13)
| 2t {z 2t
} | {z } | {z }
(a) (b) (c)
The proof of this lemma is given after the proof of Claim 3.2.21. We first
conclude the proof of Lemma 3.2.22. First let d = 1. In that case, g1 (t) = g1 := 1,
α := 1/2, and t0 := 1. By Lemma 3.2.24,
1 1 2
EN1 (t) → = = f1 .
t 1 + 1/2 3
d−1
gd (t) → gd := fd−1 ,
2
as t → +∞. Using Lemma 3.2.24 with α := d/2 and t0 := d − 1, we obtain
1 1 d−1 d−1 4
ENd (t) → fd−1 = · = fd .
t 1 + d/2 2 d + 2 (d − 1)d(d + 1)
That concludes the proof of Lemma 3.2.22.
To prove Claim 3.2.21, we combine Lemmas 3.2.22 and 3.2.23. Fix any d, δ, ε >
0. Choose t0 large enough that for all t ≥ t0
( r )
1 2 log δ −1
max ENd (t) − fd , ≤ ε.
t t
Then
1
P Nd (t) − fd ≥ 2ε ≤ 2δ,
t
for all t ≥ t0 . That proves convergence in probability.
or
t t t
X Y α Y α
f (t + 1) = g(s) 1− + f (t0 ) 1− , (3.2.15)
s=t0
r r=t
r
r=s+1 0
where empty products are equal to 1. To guess the limit note that, for large s, g(s)
is roughly constant and that the product in the first term behaves like
t
!
X α sα
exp − ≈ exp (−α(log t − log s)) ≈ α .
r t
r=s+1
gt
So approximating the sum by an integral we get that f (t + 1) ≈ α+1 , which is
indeed consistent with the claim.
Formally, we use that there is a constant γ = 0.577 . . . such that (see e.g. [LL10,
Lemma 12.1.3])
m
X 1
= log m + γ + Θ(m−1 ),
`
`=1
log (1 − z) = −z + Θ(z 2 ).
Fix η > 0 small and take t large enough that ηt > 2α and |g(s) − g| < η for all
s ≥ ηt. Then, for s + 1 ≥ t0 ,
t t
X α X nα o
log 1 − = − + Θ(r−2 )
r r
r=s+1 r=s+1
= −α (log t − log s) + Θ(s−1 ),
Hence
t
1 Y α tα0
f (t0 ) 1− = α+1 (1 + Θ(t−1
0 )) → 0,
t r=t
r t
0
CHAPTER 3. MARTINGALES AND POTENTIALS 170
as t → +∞. Moreover
t t t
1X Y α 1X sα
g(s) 1− ≤ (g + η) α (1 + Θ(s−1 ))
t s=ηt r t s=ηt t
r=s+1
t
g X
≤ O(η) + (1 + Θ(t−1 )) sα
tα+1 s=ηt
g (t + 1)α+1
≤ O(η) + (1 + Θ(t−1 ))
tα+1 α + 1
g
→ O(η) + ,
α+1
where we bounded the sum on the second line by an integral. Similarly,
ηt−1 t ηt−1
1X Y α 1X sα
g(s) 1− ≤ (g + η) α (1 + Θ(s−1 ))
t s=t r t s=t t
0 r=s+1 0
ηt (ηt)α
≤ (g + η) α (1 + Θ(t−1
0 ))
t t
→ O(η α+1 ).
In the simplest version of the (two-arm) stochastic bandit problem, there are
stochastic bandit
two unknown reward distributions ν1 , ν2 over [0, 1] with respective means µ1 6= µ2 .
At each time t = 1, . . . , n, we request an independent sample from νIt , where
we are free to choose It ∈ {1, 2} based on past choices and observed rewards
{(Is , Zs )}s<t . This will be referred to as pulling arm It . We then observe the arm
reward Zt ∼ νIt . Letting µ∗ := µ1 ∨ µ2 , our goal is to minimize
" n #
X
∗
Rn = nµ − E µ It , (3.2.16)
t=1
which is known as the pseudo-regret. That is, we seek to make choices (It )nt=1
pseudo-regret
that minimize the difference between the best achievable cumulative mean reward
and the expected cumulative mean reward from our decisions. Note that the expec-
tation in (3.2.16) is taken over the choices (It )nt=1 , which themselves depend on
the random rewards (Zs )nt=1 . As indicated above, because ν1 and ν2 are unknown,
there is a fundamental friction between exploiting the arm that has done best in the
past and exploring further the other arm, which might perform better in the future.
One general approach that has proved effective in this type of problem is known
as optimism in the face of uncertainty. Roughly speaking, we construct a set of
plausible environments (in our case, the means of the reward distributions) that are
consistent with observed data; then we make an optimal decision assuming that the
true environment is the most favorable among them. A concrete implementation
of this principle is the Upper Confidence Bound (UCB) algorithm, which we now
UCB
describe. In words, we use a concentration inequality to build a confidence interval
for each reward mean, and then we pick the arm with highest upper bound.
UCB algorithm
To state the algorithm formally, we will need some notation. For i = 1, 2, let Ti (t)
be the number of times arm i is pulled up to time t
X
Ti (t) = 1{Is = i},
s≤t
and let Xi,s , s = 1, . . . , n, be i.i.d. samples from νi . Assume that the reward at
time t is (
X1,T1 (t−1)+1 if It = 1,
Zt =
X2,T2 (t−1)+1 otherwise.
CHAPTER 3. MARTINGALES AND POTENTIALS 172
In other words, Xi,s is the s-th observed reward from arm i. Let µ̂i,s be the sample
mean of the observed rewards after pulling s times on arm i
1X
µ̂i,s = Xi,r .
s
r≤s
Since the Xi,s s are independent and [0, 1]-valued by assumption, by Hoeffd-
ing’s inequality (Theorem 2.4.10), for any β > 0
2
The argument above implies that the true mean µi has probability less than 1/tα
of being higher than µ̂i,Ti (t−1) + αH(Ti (t − 1), 1/t). The algorithm makes an
“optimistic” decision: it chooses the higher of the two values.
The following theorem shows that UCB achieves a pseudo-regret of the order
of O(log n). Define ∆i = µ∗ − µi and ∆∗ = ∆1 ∨ ∆2 .
Theorem 3.2.26 (Pseudo-regret of UCB). In the two-arm stochastic bandit prob-
lem where the rewards are in [0, 1] with distinct means, α-UCB with α > 1
achieves
2α2
Rn ≤ log n + ∆∗ Cα ,
∆∗
for some constant Cα ∈ (0, +∞) depending only on α.
This bound should not come entirely as a surprise. Indeed a simple, alternative
approach to UCB is to (1) first pull each arm mn = o(n) times and then (2) use
the arm with largest estimated mean for the remainder. Assuming there is a known
lower bound on ∆∗ , then Hoeffding’s inequality (Theorem 2.4.10) guarantees that
mn can be chosen of the order of ∆12 log n to identify the largest mean with proba-
∗
bility 1 − 1/n. Because the rewards are bounded by 1, accounting for the contribu-
tion of the first phase and the probability of failure in the second phase, one gets a
pseudo-regret of the order of ∆∗ ∆12 log n + n1 ∆∗ n ≈ ∆1∗ log n. The UCB strategy,
∗
on the other hand, elegantly adapts to the gap ∆∗ and the horizon n.
CHAPTER 3. MARTINGALES AND POTENTIALS 173
Hence the problem boils down to bounding E[Ti (n)], the expected number of times
that arm i is pulled. Note that Ti (n) is a complicated function of the observations.
To analyze it, we will use the following sufficient condition. Let i∗ be the optimal
arm, that is, the one that achieves µ∗ . Intuitively, if arm i 6= i∗ is pulled, it is
because: either our upper estimate of µi∗ happens to be low or our lower estimate
of µi happens to be high (i.e., our concentration inequality failed); or there is too
much uncertainty in our estimate of µi (i.e., we haven’t pulled arm i enough).
Lemma 3.2.27. Under the α-UCB strategy, if arm i 6= i∗ is pulled at time t then
at least one of the following events hold:
∆i
Et,3 = α H(Ti (t − 1), 1/t) > . (3.2.20)
2
Proof. We argue by contradiction. Assume all the conditions above are false. Then
2α2 log n
un = .
∆2∗
Using the condition in Lemma 3.2.27, we get the following bound on E[Ti (n)].
In particular, for all t ≤ n, the event Et,3 implies that Ti (t − 1) < un . As a result,
since Ti (t) = Ti (t − 1) + 1 whenever It = i, the event {It = i} ∩ Et,3 can occur
at most un times and
" n #
X
E[Ti (n)] ≤ un + E 1{It =i}∩Et,1 + 1{It =i}∩Et,2
t=1
n
X n
X
≤ un + P[Et,1 ] + P[Et,2 ],
t=1 t=1
It remains to bound P[Et,1 ] and P[Et,2 ] from above. This is not entirely straight-
forward because, while µ̂i,Ti (t−1) involves a sum of independent random variables,
the number of terms Ti (t − 1) is itself a random variable. Moreover Ti (t − 1)
depends on the past rewards Zs , s ≤ t − 1, in a complex way. So in order to
apply a concentration inequality to µ̂i,Ti (t−1) , we use a rather blunt approach: we
bound the worst deviation over all possible (deterministic) values in the support of
Ti (t − 1). That is,
Observe that the numerator on the left-hand side of the inequality on the last line
is a martingale (see Example 3.1.29) with increments in [−µi , 1 − µi ]. But the
denominator depends on s.
We try two approaches:
√
- We could simply use that s ≥ 1 on the denominator and apply the maximal
CHAPTER 3. MARTINGALES AND POTENTIALS 176
Slicing method
The slicing method is useful when bounding a weighted supremum. Its application
slicing method
is somewhat problem-specific so we will content ourselves with illustrating it in
our case. Specifically, our goal is to control probabilities of the form
Ms
P sup ≥β ,
s≤t−1 w(s)
CHAPTER 3. MARTINGALES AND POTENTIALS 177
Ps √ q
log t
where Ms := r=1 (Xi,r − µi ), w(s) := s, and β := α 2 . The idea is to
divide up the supremum into slices γ k−1≤ s < γk,
k ≥ 1, where the constant
log t
γ > 1 will be optimized below. That is, fixing Kt = d log γ e (which roughly solves
K
γ t = t), by a union bound over the slices
Kt
X " #
Ms Ms
P sup ≥β ≤ P sup ≥β .
1≤s<t w(s) γ k−1 ≤s<γ k w(s)
k=1
For α > 1, we can choose γ > 1 such that α2 /γ > 1. In that case, the series on
the right-hand side is summable. This improves over both (3.2.23) and (3.2.24).
We are ready to prove the main result.
Proof of Theorem 3.2.26. By (3.2.17) and Lemmas 3.2.27, 3.2.28 and 3.2.29, we
have
n !
X X log t −α2 /γ
Rn = ∆i E[Ti (n)] ≤ ∆∗ un + 2 t .
log γ
i=1,2 t=1
Recalling that α > 1, choose γ > 1 such that α2 /γ > 1. In that case, as noted
above, the series on the right hand side is summable and there is Cα ∈ (0, +∞)
such that
Rn ≤ ∆∗ (un + Cα ).
That proves the claim.
where now ci (x) is a finite, positive function over X1 × · · · × Xn . Notice the “one-
sided” nature of this condition, in the sense that ci depends on x but not on y. A
typical example where (3.2.27) is satisfied, but (3.2.26) is not, is given below.
We state Talagrand’s inequality without proof.
and
β2
P[f (X) − Ef (X) ≤ −β] ≤ exp − hP i .
2E 2
i≤n ci (X)
Example 3.2.33 (Spectral norm of a random matrix with bounded entries). Let A
be an n × n random matrix. We assume that the entries Ai,j , i, j = 1, . . . , n, are
independent, centered random variables in [−1, 1]. In Theorem 2.4.28, we proved
an upper tail bound on the spectral norm
kAxk2
kAk2 = sup = sup hAx, yi,
x∈Rn \{0} kxk2 x∈Sn−1
y∈Sn−1
of such a matrix (in the more general sub-Gaussian case) using an ε-net argument.
√
Theorem 2.4.28 also implies that EkAk2 = O( n) by (B.5.1). (See Exercise 3.9
for a lower bound on the expectation.)
CHAPTER 3. MARTINGALES AND POTENTIALS 180
Hence Talagrand’s inequality implies that kAk2 is sub-Gaussian with variance fac-
tor 4. J
Quantities such as (3.3.1) arise naturally, for instance in the study of recurrence,
and the connection to potential theory, the study of harmonic functions, proves
fruitful in that context as we outline in this section. It turns out that harmonic
functions and martingales are closely related. In Section 3.3.1 we elaborate on that
connection.
But first we rewrite (3.3.2) to reveal the electrical interpretation. For this we
switch to reversible chains. Recall that a reversible Markov chain is equivalent
to a random walk on a network N = (G, c) where the edges of G correspond
to transitions of positive probability. If the chain is reversible with respect to a
stationary measure π, then the edge weights are c(x, y) = π(x)P (x, y). In this
notation (3.3.2) becomes
1 X
h(x) = c(x, y)h(y), ∀x ∈ (A ∪ Z)c , (3.3.4)
c(x) y∼x
P
where c(x) := y∼x c(x, y) = π(x). In words, h(x) is the weighted average of its
neighboring values. Now comes the electrical analogy: if one interprets c(x, y) as
a conductance, a function satisfying (3.3.4) is known as a voltage. The voltages at
A and Z are 1 and 0 respectively. We show in the next subsection by a martingale
argument that, under appropriate conditions, such a voltage exists and is unique.
We develop the electrical analogy and many of its applications in Section 3.3.2.
CHAPTER 3. MARTINGALES AND POTENTIALS 182
that is, (h(Xt∧τ ∗ ))t is a martingale with respect to (Ft ). Indeed, on {τ ∗ ≤ t},
and on {τ ∗ > t}
X
E[h(X(t+1)∧τ ∗ ) | Ft ] = P (Xt , y)h(y) = h(Xt ) = h(Xt∧τ ∗ ).
y
Although the rest of Section 3.3 is concerned with reversible Markov chains,
the current subsection applies to the non-reversible case as well. We give an
overview of potential theory for general, countable-space, discrete-time Markov
chains and its connections to martingales. As a major application, we introduce
the concept of a Lyapounov function which is useful in bounding certain hitting
times.
h(x) = Ex [h (XτW c )] .
Proof. We first argue about uniqueness. Suppose h is defined over all of V and
satisfies (3.3.2). Let τ ∗ := τW c . Then the process (h (Xt∧τ ∗ ))t is a martingale
by (3.3.5). Because W is finite and the chain is irreducible, we have τ ∗ < +∞
almost surely, as implied by Lemma 3.1.25. Moreover the process is bounded
because h is bounded on W c and W is finite. Hence by Doob’s optional stopping
theorem (Theorem 3.1.38 (ii))
which implies that h is unique, since the right-hand side depends only on the chain
and the fixed values of h on W c .
For the existence, simply define h(x) := Ex [h (Xτ ∗ )], ∀x ∈ W, and use a
first-step argument similarly to (3.3.3).
For some insights on what happens when the assumptions of Theorem 3.3.1 are
not satisfied, see Exercise 3.11. For an alternative (arguably more intuitive) proof
of uniqueness based on the maximum principle, see Exercise 3.12.
In the proof above it suffices to specify h on the outer boundary of W
provided the expectation exists. We have proved that, under the assumptions of
Theorem 3.3.1, there exists a unique solution to
(
∆f (x) = 0 ∀x ∈ W,
(3.3.7)
f (x) = h(x) ∀x ∈ ∂V W,
which is a discretized second derivative. More generally, for simple random walk
on Zd , we get
" #
X
∆f (x) = P (x, y)f (y) − f (x)
y
X
= P (x, y)[f (y) − f (x)]
y
d
1 X
= {[f (x + ei ) − f (x)] − [f (x) − f (x − ei )]},
2d
i=1
Theorem 3.3.1 has many applications. One of its consequences is that harmonic
functions on a finite state space are constant.
Proof. Fix the value of h at an arbitrary vertex z and set W = V \{z}. Applying
Theorem 3.3.1, for all x ∈ W , h(x) = Ex [h (XτW c )] = h(z).
Theorem 3.3.4 (Random target lemma). Let (Xt ) be an irreducible Markov chain
on a finite state space V with transition matrix P and stationary distribution π.
Then X
h(x) := π(y) Ex [τy ]
y∈V
Proof. Because the chain is irreducible and has a finite state space,PEx [τy ] < +∞
for all x, y. By Corollary 3.3.3, it suffices to show that h(x) := y π(y) Ex [τy ]
is harmonic on all of V . As before, it is natural to expand Ex [τy ] according to the
first step of the chain,
!
X
Ex [τy ] = 1{x6=y} 1 + P (x, z) Ez [τy ] .
z
CHAPTER 3. MARTINGALES AND POTENTIALS 185
Rearranging, we get
" #
X
∆h(x) = P (x, z)h(z) − h(x)
z
!
X
= π(x) 1 + P (x, z)Ez [τx ] −1
z
= 0,
τW c = inf{t ≥ 0 : Xt ∈ W c }.
The first term on the right-hand side is a final cost incurred when we exit W (and
depends on where we do), while the second term is a unit time cost incurred along
the sample path. Note that, in fact, it suffices to define h on ∂V W , the outer
boundary of W if we restrict ourselves to x ∈ W . Observe also that the function
u(x) may take the value +∞; the expectation is well-defined (in R+ ∪ {+∞}) by
the nonnegativity of the terms (see Appendix B).
Example 3.3.5 (Some special cases). Here are some important special cases:
CHAPTER 3. MARTINGALES AND POTENTIALS 186
J
The function u in (3.3.8) turns out to satisfy a generalized version of (3.3.7).
The proof is usually called first-step analysis (of which we have already seen many
first-step
instances).
analysis
Theorem 3.3.6 (First-step analysis). Let P be a transition matrix on a finite or
countable state space V . Let W be a proper subset of V , and let h : W c → R+
and k : W → R+ be bounded functions. Then the function u ≥ 0, as defined
in (3.3.8), satisfies the system of equations
( P
u(x) = k(x) + y P (x, y)u(y) for x ∈ W ,
(3.3.9)
u(x) = h(x) for x ∈ W c .
CHAPTER 3. MARTINGALES AND POTENTIALS 187
= k(x) + Ex [u(X1 )] ,
If u is finite, the system of equations (3.3.9) can be rewritten as the Poisson equa-
tion (once again as an analogue of its counterpart in the theory of partial differential
Poisson
equations)
equation
(
∆u = −k on W ,
(3.3.10)
u=h on W c .
∆ψ ≤ −k on W .
Then
ψ ≥ u, on V , (3.3.13)
where u is the function defined in (3.3.8).
CHAPTER 3. MARTINGALES AND POTENTIALS 189
Proof. The system (3.3.13) holds on W c by Theorem 3.3.6 and (3.3.12) since in
that case u(x) = h(x) ≤ ψ(x).
Fix x ∈ W . Consider the nonnegative supermartingale (Nt ) in Lemma 3.3.8.
By the convergence of nonnegative supermartingales (Corollary 3.1.48), (Nt ) con-
verges almost surely to a finite limit with expectation ≤ Ex [N0 ]. In particular,
the limit NτW c is well-defined, nonnegative and finite, including on the event that
{τW c = +∞}. As a result,
X
NτW c = lim ψ(Xt∧τW c ) + k(Xs )
t
0≤s<t∧τW c
X
≥ h(XτW c )1{τW c < +∞} + k(Xs ),
0≤s<τW c
≤ Ex [NτW c ]
≤ Ex [N0 ]
= ψ(x),
where, on the last line, we used that the initial state is x ∈ W . That proves the
claim.
Lyapounov functions
Here is an important application, bounding from above the hitting time τA to a set
A in expectation.
Theorem 3.3.10 (Controlling hitting times via Lyapounov functions). Let P be a
transition matrix on a finite or countably infinite state space V . Let A be a proper
subset of V . Suppose the nonnegative function ψ : V → R+ satisfies the system of
inequalities
∆ψ ≤ −1, on Ac . (3.3.14)
Then
Ex [τA ] ≤ ψ(x),
CHAPTER 3. MARTINGALES AND POTENTIALS 190
for all x ∈ V .
Proof. Indeed, by (3.3.14) and nonnegativity (in particular on A), the function ψ
satisfies the assumptions of Theorem 3.3.9 with W = Ac , h = 0 on A, and k = 1
on Ac . Hence, by definition of u and the claim in Theorem 3.3.9,
X
Ex [τA ] = Ex h(XτA )1{τA < +∞} + k(Xt )
0≤t<τA
= u(x)
≤ ψ(x).
That establishes the claim.
Recalling (3.3.11), condition (3.3.14) is equivalent to the following conditional
expected decrease in ψ outside A:
E[ψ(Xt+1 ) − ψ(Xt ) | Ft ] ≤ −1, on {Xt ∈ Ac }. (3.3.15)
A nonnegative function satisfying an inequality of this type, also known as drift
condition, is often referred to as a Lyapounov function. Intuitively, it tends to
Lyapounov
decrease along the sample path outside of A. Because it is non-negative, it cannot
function
decrease forever and therefore the chain eventually enters A. We consider a simple
example next.
Example 3.3.11 (A Markov chain on the nonnegative integers). Let (Zt )t≥1 be
i.i.d. integrable random variables taking values in Z such that E[Z1 ] < 0. Let
(Xt )t≥0 be the chain defined by X0 = x for some x ∈ Z+ and
Xt+1 = (Xt + Zt+1 )+ ,
where recall that z + = max{0, z}. In particular Xt ∈ Z+ for all t. Let (Ft ) be the
corresponding filtration. When Xt is large, the “local drift” is close to E[Z1 ] < 0.
By analogy to the biased case of the gambler’s ruin (Example 3.1.43), we might
expect that, from a large starting point x, it will take time roughly x/|E[Z1 ]| in
expectation to “return to a neighborhood of 0.” We prove something along those
lines here using a Lyapounov function.
Observe that, for any y ∈ Z+ , we have on the event {Xt = y} by the Markov
property
Ex [Xt+1 − Xt | Ft ] = E[(y + Zt+1 )+ − y]
= E[−y1{Zt+1 ≤ −y} + Zt+1 1{Zt+1 > −y}]
≤ E[Zt+1 1{Zt+1 > −y}]
= E[Z1 1{Z1 > −y}]. (3.3.16)
CHAPTER 3. MARTINGALES AND POTENTIALS 191
For all y, the random variable |Z1 1{Z1 > −y}| is bounded by |Z1 |, itself an
integrable random variable. Moreover, Z1 1{Z1 > −y} → Z1 as y → +∞ almost
surely. Hence, the dominated convergence theorem (Proposition B.4.14) implies
that
So for any 0 < ε < −E[Z1 ], there is yε ∈ Z+ large enough that E[Z1 1{Z1 >
−y}] < −ε for all y > yε . Fix ε as above and define
A := {0, 1, . . . , yε }.
for y ∈ Ac . This is the same as (3.3.15). Hence, we can apply Theorem 3.3.10 to
get
x
Ex [τA ] ≤ ψ(x) = ,
ε
for all x ≥ yε . J
∆ψ ≤ −1, on Ac ,
Definitions
Let N = (G, c) be a finite or countable network with G = (V, E). Throughout
this section we assume that N is connected and locally finite. In the context of
electrical networks, edge weights are called conductances. The reciprocal of the
conductance
conductances are called resistances and are denoted by r(e) := 1/c(e), for all
resistance
e ∈ E. For an edge e = {x, y} we overload c(x, y) := c(e) and r(x, y) := r(e).
Both c and r are symmetric as functions of x, y. Recall that the transition matrix
of the random walk on N satisfies
c(x, y)
P (x, y) = ,
c(x)
where X
c(x) = c(x, z).
z:z∼x
where
v(a) = v0 and v|Z ≡ 0. (3.3.18)
CHAPTER 3. MARTINGALES AND POTENTIALS 193
Moreover
v(x)
= Px [τa < τZ ], (3.3.19)
v0
for the corresponding random walk on N .
Proof. Set h(x) = v(x) on A ∪ Z. Theorem 3.3.1 gives the result.
Note in the definition above that if v is a voltage with value v0 at a, then ṽ(x) =
v(x)/v0 is a voltage with value 1 at a.
Let v be a voltage function on N with source a and sink Z. The Laplacian-
based formulation of harmonicity, (3.3.7), can be interpreted in terms of flows (see
Definition 1.1.13). We define the current function current
or, equivalently, v(x) − v(y) = r(x, y) i(x, y). The latter definition is usually
referred to as Ohm’s “law.” Notice that the current is defined on ordered pairs of
Ohm’s law
vertices and is anti-symmetric, that is, i(x, y) = −i(y, x). In terms of the current,
the harmonicity of v is then expressed as
X
i(x, y) = 0, ∀x ∈ W, (3.3.21)
y:y∼x
Because a ∈/ W , it does not satisfy Kirchhoff’s node law and the strength is not
0 in general. The definition of i(x, y) ensures that the flow out of the source is
nonnegative as Py [τa < τZ ] ≤ 1 = Pa [τa < τZ ] for all y ∼ a so that
i(a, y) = c(a, y)[v(a) − v(y)] = c(a, y) [v0 Pa [τa < τZ ] − v0 Py [τa < τZ ]] ≥ 0.
CHAPTER 3. MARTINGALES AND POTENTIALS 194
Ohm’s law is also satisfied on every other edge (to the right of x) because nothing
has changed there. That proves the claim.
We do the same reduction on the other side of x by replacing x ∼ x + 1 ∼
· · · ∼ n with a single edge of resistance Rx,n = r(x, x + 1) + · · · + r(n − 1, n).
See Figure 3.4.
Because the voltage at x was not changed by this transformation, we can com-
pute v(x) = Px [τ0 < τn ] directly on the reduced network, where it is now a
straightforward computation. Indeed, starting at x, the reduced walk jumps to 0
with probability proportional to the conductance on the new super-edge 0 ∼ x (or
CHAPTER 3. MARTINGALES AND POTENTIALS 195
The above example illustrates the series law: resistances in series add up.
series law,
There is a similar parallel law: conductances in parallel add up. To formalize
parallel law
these laws, one needs to introduce multigraphs. This is straightforward, although
to avoid complicating the notation further we will not do this here. (But see Exam-
ple 3.3.22 for a simple case.)
Another useful network reduction technique is illustrated in the next example.
Example 3.3.15 (Network reduction: binary tree). Let N be the rooted binary tree
with n levels Tb n and equal conductances on all edges. Let 0 be the root. Pick an
2
arbitrary leaf and denote it by n. The remaining vertices on the path between 0
and n, which we refer to as the main path, will be denoted by 1, . . . , n − 1 moving
away from the root. We claim that, for all 0 < x < n, it holds that
Indeed let v be the voltage with values 1 and 0 at a = 0 and Z = {n} respec-
tively. Let i be the corresponding current. Notice that, for each 0 ≤ y < n, the
current—as a flow—has “nowhere to go” on the subtree Ty hanging from y away
from the main path. The leaves of the subtree are dead ends. Hence the current
must be 0 on Ty and by Ohm’s law the voltage must be constant on it, that is, every
vertex in Ty has voltage v(y).
Imagine collapsing all vertices in Ty , including y, into a single vertex (and re-
moving the self-loops so created). Doing this for every vertex on the main path
results in a new reduced network which is formed of a single path as in Exam-
ple 3.3.14. Note that the voltage and the current can be taken to be the same as
they were previously on the main path. Indeed, with this choice, Ohm’s law is
automatically satisfied. Moreover, because there is no current on the hanging sub-
trees in the original network, Kirchhoff’s node law is also satisfied on the reduced
network, as no current is “lost.”
Hence the answer can be obtained from Example 3.3.14. That proves the claim.
(You should convince yourself that this result is obvious from a probabilistic point
of view.) J
for the escape probability. The next lemma can be interpreted as a sort of Ohm’s escape
law between a and Z, where c(a) P[a → Z] is the “effective conductance.” (We probability
will be more formal in Definition 3.3.19 below.)
Lemma 3.3.16 (Effective Ohm’s Law). Let v be a voltage on N with source a and
sink Z. Let i be the associated current. Then
v(a) 1
= . (3.3.22)
kik c(a) P[a → Z]
where we used Corollary 3.3.13 on the second line and Ohm’s law on the last line.
Rearranging gives the result.
be the number of one-step transitions from x to y up to the time of the first visit to
the sink Z for the random walk on N started at a. Let v be the voltage correspond-
ing to the unit current i. Then the following formulas hold:
GτZ (a, x)
v(x) = , ∀x, (3.3.23)
c(x)
and
Z Z
i(x, y) = Ea [Nx→y − Ny→x ], ∀x ∼ y.
Proof. We prove the formula for the voltage by showing that v(x) as defined above
is harmonic on W = V \({a} ∪ Z). Note first that, for all z ∈ Z, the expected
number of visits to z before reaching Z (i.e., GτZ (a, z)) is 0. Or, put differently,
G (a,z)
0 = v(z) = τZc(z) . Moreover, to compute GτZ (a, a), note that the number of
CHAPTER 3. MARTINGALES AND POTENTIALS 198
visits to a before the first visit to Z is geometric with success probability P[a → Z]
by the strong Markov property (Theorem 3.1.8) and hence
1
GτZ (a, a) = ,
P[a → Z]
and, by Lemma 3.3.16 and the fact that we are using the unit current, v(a) =
GτZ (a,a)
c(a) , as required.
To establish the formula for x ∈ W , we compute the quantity
1 X Z
Ea [Ny→x ],
c(x) y:y∼x
in two ways. First, because each visit to x ∈ W must enter through one of x’s
neighbors (including itself in the presence of a self-loop), we get
1 X Z Gτ (a, x)
Ea [Ny→x ]= Z . (3.3.24)
c(x) y:y∼x c(x)
where we used that c(x, y) = c(x)P (x, y) = c(y)P (y, x) (see Definition 1.2.7).
G (a,x)
Equating (3.3.24) and (3.3.26) shows that τZc(x) is harmonic on W and hence
must be equal to the voltage function by Corollary 3.3.13.
Finally, by (3.3.25),
Z
Ea [Nx→y Z
− Ny→x ] = P (x, y) GτZ (a, x) − P (y, x) GτZ (a, y)
= P (x, y)v(x)c(x) − P (y, x)v(y)c(y)
= c(x, y)[v(x) − v(y)]
= i(x, y).
Example 3.3.18 (Network reduction: binary tree (continued)). Recall the setting
of Example 3.3.15. We argued that the current on side edges, that is, edges of
subtrees hanging from the main path, is 0. This is clear from the probabilistic
interpretation of the current: in a walk from a to z, any traversal of a side edge
must be undone at a later time. J
The network reduction techniques illustrated above are useful. But the power
of the electrical network perspective is more apparent in what comes next: the
definition of the effective resistance and, especially, its variational characterization.
Effective resistance
Before proceeding further, let us recall our original motivation. Let N = (G, c)
be a countable, locally finite, connected network and let (Xt ) be the corresponding
walk. Recall that a vertex a in G is transient if Pa [τa+ < +∞] < 1.
To relate this to our setting, consider an exhaustive sequence of induced sub-
exhaustive
graphs Gn ofSG which for our purposes is defined as: G0 contains only a, Gn ⊆
sequence
Gn+1 , G = n Gn , and every Gn is finite and connected. Such a sequence always
exists by iteratively adding the neighbors of the previous vertices and using that G
is locally finite and connected. Let Zn be the set of vertices of G not in Gn . Then,
CHAPTER 3. MARTINGALES AND POTENTIALS 200
by Lemma 3.1.25, Pa [τZn ∧ τa+ = +∞] = 0 for all n by our assumptions on (Gn ).
Hence, the remaining possibilities are
Therefore a is transient if and only if limn P[a → Zn ] > 0. Note that the limit ex-
ists because the sequence of events {τZn < τa+ } is decreasing by construction. By
a sandwiching argument the limit also does not depend on the exhaustive sequence.
Hence we define
P[a → ∞] := lim P[a → Zn ].
n
We use Lemma 3.3.16 to characterize this limit using electrical network concepts.
But, first, here comes the key definition. In Lemma 3.3.16, v(a) can be thought
of as the potential difference between the source and the sink, and kik can be
thought of as the total current flowing through the network from the source to the
sink. Hence, viewing the network as a single “super-edge,” Equation (3.3.22) is
the analogue of Ohm’s law if we interpret c(a) P[a → Z] as an “effective conduc-
tance.”
where the rightmost equality holds by Lemma 3.3.16. The reciprocal is called the
effective conductance and is denoted by C (a ↔ Z) := 1/R(a ↔ Z).
effective
Going back to recurrence, for an exhaustive sequence (Gn ) with (Zn ) as above, conductance
it is natural to define
where, once again, the limit does not depend on the choice of exhaustive sequence.
Proof. This follows immediately from the definition of the effective resistance.
Recall that, on a connected network, all states have the same type (recurrent or
transient).
Note that the network reduction techniques we discussed previously leave both
the voltage and the current strength unchanged on the reduced network. Hence
they also leave the effective resistance unchanged.
Example 3.3.21 (Gambler’s ruin chain revisited). Extend the gambler’s ruin chain
of Example 3.3.14 to all of Z+ . We determine when this chain is transient. Be-
cause it is irreducible, all states have the same type and it suffices to look at 0.
Consider the exhaustive sequence obtained by letting Gn be the graph restricted
to {0, 1, . . . , n − 1} and letting Zn = {n, n + 1 . . .}. To compute the effective
resistance R(0 ↔ Zn ), we use the same reduction as in Example 3.3.14. The
“super-edge” between 0 and n has resistance
n−1 n−1
X X (q/p)n − 1
R(0 ↔ Zn ) = r(j, j + 1) = (q/p)j = ,
(q/p) − 1
j=0 j=0
Example 3.3.22 (Biased walk on the b-ary tree). Fix λ ∈ (0, +∞). Consider the
rooted, infinite b-ary tree with conductance λj on all edges between level j − 1 and
j, for j ≥ 1. We determine when this chain is transient. Because it is irreducible,
all states have the same type and it suffices to look at the root. Denote the root
by 0. For an exhaustive sequence, let Gn be the root together with the first n − 1
levels. Let Zn be as before. To compute R(0 ↔ Zn ): (i) glue together all vertices
of Zn ; (ii) glue together all vertices on the same level of Gn ; (iii) replace parallel
edges with a single edge whose conductance is the sum of the conductances; (iv)
let the current on this edge be the sum of the currents; and (v) leave the voltages
unchanged. It can be checked that Ohm’s law and Kirchhoff’s node law are still
satisfied, and that hence we have not changed the effective resistance. (This is an
application of the parallel law.)
The reduced network is now a line. Denote the new vertices 0, 1, . . . , n. The
conductance on the edge between j and j + 1 is bj+1 λj = b(bλ)j . So this is
CHAPTER 3. MARTINGALES AND POTENTIALS 202
the chain from the previous example with (p/q) = bλ where all conductances are
scaled by a factor of b. Hence
(
+∞, bλ ≤ 1,
R(0 ↔ ∞) = 1
b(1−(bλ)−1 )
, bλ > 1.
Variational principles
Recall from Definition 1.1.13 that a flow θ from source a to sink Z on a countable,
locally finite, connected network N = (G, c) is a function on pairs of adjacent
vertices such that: θ is anti-symmetric, that is, Pθ(x, y) = −θ(y, x) for all x ∼ y;
and it satisfies the flow-conservation constraint y:y∼x θ(x, y) P= 0 on all vertices
x except those in {a} ∪ Z. The strength of the flow is kθk = y:y∼a θ(a, y). The
current is a special flow—one that can be written as a potential difference according
to Ohm’s law. As we show next, it can also be characterized as a flow minimizing
a certain energy. Specifically, the energy of a flow θ is defined as energy
1X
E (θ) = r(x, y)θ(x, y)2 .
2 x,y
The proof of the variational principle we present here employs a neat trick, convex
duality. In particular, it reveals that the voltage and current are dual in the sense of
convex analysis.
where h has an entry for all vertices except those in Z. For all h,
because those θs with Bθ = b make the second term vanish in L (θ; h). Since
L (θ; h) is strictly convex as a function of θ, the solution to its minimization is
characterized by the usual optimality conditions which in this case read 2Rθ −
2B T h = 0, or
θ = R−1 B T h. (3.3.28)
Substituting into the Lagrangian and simplifying, we have proved that
for all h and flow θ. This inequality is a statement of weak duality. To show that a
flow θ is optimal it suffices to find h such that E (θ) = L ∗ (h).
Let θ = i be the unit current in vector form, which satisfies Bθ = b by our
choice of b and Kirchhoff’s node law (i.e., (3.3.21)). The suitable dual turns out to
be the corresponding voltage h = v in vector form restricted to V \ Z. To see this,
observe that B T h is the vector of neighboring node differences
B T h = (h(x) − h(y))(x,y)∈−
→,
G
(3.3.30)
CHAPTER 3. MARTINGALES AND POTENTIALS 204
where implicitly h|Z ≡ 0. Hence the optimality condition (3.3.28) is nothing but
Ohm’s law (i.e., (3.3.20)) in vector form. Therefore, if i is the unit current and v is
the associated voltage in vector form, it holds that
where the first equality follows from the fact that i minimizes L (i; v) by (3.3.28)
and the second equality follows from the fact that Bi = b. So we must have
E (i) = E ∗ by weak duality (i.e., (3.3.29)).
As for uniqueness, it can be checked that two minimizers θ, θ 0 satisfy
E (θ) + E (θ 0 ) θ + θ0 θ − θ0
∗
E = =E +E ,
2 2 2
by definition of the energy. The first term in the rightmost expression is greater or
equal to E ∗ since the average of two unit flows is still a unit flow. The second term
is nonnegative by definition. Hence the latter must be zero and the only way for
this to happen is if θ = θ 0 .
To conclude the proof, it remains to compute the optimal value. The matrix
BR−1 B T is related to the Laplacian associated to random walk on N (see Sec-
tion 3.3.1) up to a row scaling. Multiplying by row x ∈ V \ Z involves taking a
conductance-weighted average of the neighboring values and subtracting the value
at x, that is,
X h i
BR−1 B T v x =
c(x, y)(v(x) − v(y))
−
→
y:(x,y)∈ G
X h i
− c(y, x)(v(y) − v(x))
−
→
y:(y,x)∈ G
X h i
= c(x, y)(v(x) − v(y)) ,
y:y∼x
where we used (3.3.30) and the facts that r(x, y)−1 = c(x, y) and c(x, y) =
c(y, x), and it is assumed implicitly that v|Z ≡ 0. By Corollary 3.3.13, this is
zero except for the row x = a where it is
X X
c(a, y)[v(a) − v(y)] = i(a, y) = 1,
y:y∼a y:y∼a
where we used Ohm’s law and the fact that the current has unit strength. We have
CHAPTER 3. MARTINGALES AND POTENTIALS 205
finally
E ∗ = L ∗ (v)
= −v T BR−1 B T v + 2v T b
= −v(a) + 2v(a)
= v(a)
= R(a ↔ Z),
Observe that the convex combination α minimizing the sum of squares j αj2
P
is constant. In a similar manner, Thomson’s principle (Theorem 3.3.23) stipulates
roughly speaking that the more the flow can be spread out over the network, the
lower is the effective resistance (penalizing flow on edges with higher resistance).
Pólya’s theorem below provides a vivid illustration. Here is a simple example
suggesting that, in a sense, the current is indeed a well-distributed flow.
Example 3.3.24 (Random walk on the complete graph). Let N be the complete
graph on {1, . . . , n} with unit resistances, and let a = 1 and Z = {n}. Assume
n > 2. The effective resistance is straightforward to compute in this case. Indeed,
the escape probability (with a slight abuse of notation) is
1 1 1 n
P[1 → n] = + 1− = ,
n−1 2 n−1 2(n − 1)
when n > 2. Because the direct path from 1 to n has a somewhat lower resistance,
the optimal flow is obtained by increasing the flow on that edge slightly. Namely,
for a flow α on {1, n} (and the rest divided up evenly among the two-edge paths),
we get an energy of α2 + 2(n − 2)[ 1−α 2 2
n−2 ] which is minimized at α = n where it is
indeed 2
n−2 2
2 2 2 2 n−2 2
+ = + = .
n n−2 n n n n n
J
As we noted above, the matrix BR−1 B T in the proof of Thomson’s princi-
ple is related to the Laplacian. Because B T h is the vector of neighboring node
differences, we have
1X
hT BR−1 B T h = c(x, y)[h(y) − h(x)]2 ,
2 x,y
where we implicitly fix h|Z ≡ 0, which is called the Dirichlet energy. Thinking of
Dirichlet energy
B T as a “discrete gradient,” the Dirichlet energy can be interpreted as the weighted
norm of the gradient of h. The following is a “dual” to Thomson’s principle.
Exercise 3.15 asks for a proof.
Theorem 3.3.25 (Dirichlet’s principle). Let N = (G, c) be a finite, connected
network. The effective conductance between source a and sink Z is characterized
by
( )
1X
C (a ↔ Z) = inf 2
c(x, y)[h(y) − h(x)] : h(a) = 1, h|Z ≡ 0} .
2 x,y
Similarly, if N is a countable, locally finite, connected network, then for any col-
lection {Πj }j of finite, disjoint cutsets separating a from ∞,
−1
X X
R(a ↔ ∞) ≥ c(e) .
j e∈Πj
Proof. Consider the case where N is finite first. We will need the following claim,
which follows immediately from Lemma 1.1.14: for any unit flow θ between a and
Z and any cutset Πj separating a from Z, it holds that
X
|θ(e)| ≥ kθk = 1.
e∈Πj
Example 3.3.27 (Biased walk on general trees). Let T be a locally finite tree
with root 0. Consider again the biased walk from Example 3.3.22, that is, the
conductance is λj on all edges between level j − 1 and j. Recall the branching
number br(T ) from Definition 2.3.10.
CHAPTER 3. MARTINGALES AND POTENTIALS 208
−|e|
P
Assume λ > br(T ). For any ε > 0, there is a cutset Π such that e∈Π λ ≤
ε. By Nash-Williams,
!−1
X
R(0 ↔ ∞) ≥ c(e) ≥ ε−1 .
e∈Π
where, on the fourth line, we used Lemma 1.1.14 together with the fact that φn is a
unit flow and Fm is a cutset separating 0 and ∂n . Thomson’s principle implies that
R(0 ↔ ∂n ) is uniformly bounded in n. The walk is transient by Theorem 3.3.20.
J
Proof. The additional edge enlarges the space of possible flows, so by Thomson’s
principle it can only lower the resistance or leave it as is. The second statement
follows from the definition of the effective resistance.
CHAPTER 3. MARTINGALES AND POTENTIALS 209
More generally:
RN (a ↔ Z) ≤ RN 0 (a ↔ Z).
Proof. Compare the energies of an arbitrary flow on N and N 0 , and apply Thom-
son’s principle.
Note that this corollary implies the previous one by thinking of an absent edge as
one with infinite resistance.
Flows to infinity
Combining Theorem 3.3.20 and Thomson’s principle, we derive a flow-based cri-
terion for recurrence. To state the result, it is convenient to introduce the notion
of a unit flow θ from source a to ∞ on a countable, locally finite network: θ is
flow to ∞
P it satisfies the flow-conservation constraint on all vertices but a,
anti-symmetric,
and kθk := y∼a θ(a, y) = 1. Note that the energy E (θ) of such a flow is well
defined in [0, +∞].
Proof. Suppose such a flow exists and has energy bounded by B < +∞. Let (Gn )
be an exhaustive sequence with associated sinks (Zn ). A unit flow from a to ∞
on N yields, by projection, a unit flow from a to Zn . This projected flow also has
energy bounded by B. Hence Thomson’s principle implies R(a ↔ Zn ) ≤ B for
all n and transience follows from Theorem 3.3.20.
Proving the other direction involves producing a flow to ∞. Suppose a is
transient and let (Gn ) be an exhaustive sequence as above. Then Theorem 3.3.20
implies that R(a ↔ Zn ) ≤ R(a ↔ ∞) < B for some B < +∞ and Thomson’s
principle guarantees in turn the existence of a flow θn from a to Zn with energy
bounded by B. In particular there is a unit current in , and associated voltage vn ,
of energy bounded by B. So it remains to use the sequence of current flows (in )
to construct a flow to ∞ on the infinite network. The technical point is to show
that the limit of (in ) exists and is indeed a flow. For this, consider the random
walk on N started at a. Let Yn (x) be the number of visits to x before hitting
Zn the first time. By the monotone convergence theorem (Proposition B.4.14),
CHAPTER 3. MARTINGALES AND POTENTIALS 210
and then
i∞ (x, y) := c(x, y)[v∞ (x) − v∞ (y)]
= lim c(x, y)[vn (x) − vn (y)]
n
= lim in (x, y),
n
by Ohm’s law (when n is large enough that both x and y are in Gn ). Because in is
a flow for all n, by taking limits in the flow-conservation constraints we see that so
is i∞ . Note that by construction of i`
1 X 1 X
c(x, y)i∞ (x, y)2 = lim c(x, y)i` (x, y)2
2 `≥n 2
x,y∈Gn x,y∈Gn
2. for every edge e0 in G0 , there are no more than β edges in G whose image
under Φ contains e0 .
Exercise 3.19 asks for a rigorous proof of rough equivalence. See also Exer-
cise 3.20 for an important generalization of this example. J
Our main result about roughly equivalent networks is that they have the same
type.
Proof. Assume N is transient and let θ be a unit flow from some a to ∞ of finite
energy. The existence of this flow is guaranteed by Theorem 3.3.30. Let φ, Φ be a
rough embedding from N to N 0 with parameters α and β.
The basic idea of the proof is to map the flow θ onto N 0 using Φ. Because
flows are directional, it will be convenient to think of edges as being directed. For
CHAPTER 3. MARTINGALES AND POTENTIALS 212
Figure 3.5: The flow on (x0 , y 0 ) is the sum of the flows on (x1 , y1 ), (x2 , y2 ), and
(x3 , y3 ).
→
−
e = {x, y} in N , let Φ (x, y) be the path Φ(e) oriented from φ(x) to φ(y). So
→
−
(x0 , y 0 ) ∈ Φ (x, y) means that {x0 , y 0 } ∈ Φ(e) and that x0 is visited before y 0 in the
path Φ(e) from φ(x) to φ(y). (If φ(x) = φ(y), choose an arbitrary orientation of
→
− →
−
the cycle Φ(e) for Φ (x, y) and the reversed orientation for Φ (y, x).) Then define,
for x0 , y 0 with {x0 , y 0 } in N 0 ,
X
θ0 (x0 , y 0 ) := θ(x, y). (3.3.31)
−
→
(x,y):(x0 ,y 0 )∈ Φ (x,y)
- Assume first that φ(x), φ(y) 6= z 0 and let (u0 , z 0 ), (z 0 , w0 ) be the di-
→
−
rected edges incident with z 0 on Φ (x, y). Observe that, in the def-
inition of θ0 , (y, x) contributes θ(y, x) = −θ(x, y) to θ0 (z 0 , u0 ) and
(x, y) contributes θ(x, y) to θ0 (z 0 , w0 ). So these contributions can-
cel
P out in the flow-conservation constraint for z 0 , that is, in the sum
0 0 0
v 0 :v 0 ∼z 0 θ (z , v ).
- If instead e = {x, y} is such that φ(x) = z 0 , let (z 0 , w0 ) be the first
→
−
edge on the path Φ (x, y). Edge (x, y) contributes θ(x, y) to θ0 (z 0 , w0 ).
A similar statement applies to φ(y) = z 0 by changing the role of x and
y. This case also applies to φ(x) = φ(y) = z 0 .
From the two cases above, summing over all paths visiting z 0 gives
!
X X X
θ0 (z 0 , v 0 ) = θ(z, v) .
v 0 :v 0 ∼z 0 z:φ(z)=z 0 v:v∼z
Summing over all pairs and using Condition 1 in Definition 3.3.31 gives
1X 0 0 0 0 0 0 2 1X 0 0 0 X
r (x , y )θ (x , y ) ≤ β r (x , y ) θ(x, y)2
2 0 0 2 0 0 −
→
x ,y x ,y (x,y):(x0 ,y 0 )∈ Φ (x,y)
1X X
= β θ(x, y)2 r0 (x0 , y 0 )
2 x,y −
→
(x0 ,y 0 )∈ Φ (x,y)
1X
≤ αβ r(x, y)θ(x, y)2 ,
2 x,y
CHAPTER 3. MARTINGALES AND POTENTIALS 214
Other applications
So far we have emphasized applications to recurrence. Here we show that electrical
network theory can also be used to bound commute times. In Section 3.3.5, we give
further applications beyond random walks on graphs.
An application of Corollary 3.1.24 gives another probabilistic interpretation of
the effective resistance—and a useful formula.
Proof. This follows immediately from Corollary 3.1.24 and the definition of the
effective resistance (Definition 3.3.19). Specifically,
1
Ex [τy ] + Ey [τx ] =
πx Px [τy < τx+ ]
1
=
c(e))−1 c(x) Px [τy < τx+ ]
P
(2 e={x,y}∈N
= cN R(x ↔ y).
Example 3.3.35 (Random walk on the torus). Consider random walk on the d-
dimensional torus Ldn with unit resistances. We use the commute time identity to
lower bound the mean hitting time Ex [τy ] for arbitrary vertices x 6= y at graph
distance k on Ldn . To use the commute time identity (Theorem 3.3.34), note that
by symmetry Ex [τy ] = Ey [τx ] so that
1
Ex [τy ] = cN R(x ↔ y) = dnd R(x ↔ y). (3.3.32)
2
where we used that the number of vertices is nd and the graph is 2d-regular.
To simplify, assume n is odd and identify the vertices of Ldn with the box
(the graph distance between x = 0 and y). Since the `∞ norm is at least 1/d times
the `1 norm on Ld , there exists J = O(k) such that all Πj s, j ≤ J, are cutsets
separating x from y. By the Nash-Williams inequality
(
X
−1
X
−(d−1)
Ω(log k), d = 2
R(x ↔ y) ≥ |Πj | = Ω j =
0≤j≤J 0≤j≤J
Ω(1), d ≥ 3.
Claim 3.3.36. (
Ω(nd log k), d = 2
Ex [τy ] =
Ω(nd ), d ≥ 3.
J
Remark 3.3.37. The bounds in the previous example are tight up to constants. See [LPW06,
Proposition 10.13]. Note that the case d ≥ 3 does not in fact depend on the distance k.
See Exercise 3.22 for an application of the commute time identity to cover
times.
We prove the theorem for d = 2, 3 using the tools developed in the previous sub-
section. The other cases follow by Rayleigh’s principle (Corollary 3.3.29). There
are elementary proofs of this result. But we showed above that the electrical net-
work approach has the advantage of being robust to the details of the lattice. For a
different argument, see Exercise 2.10.
The case d = 2 follows from the Nash-Williams inequality (Corollary 3.3.26)
by letting Πj be the set of edges connecting vertices of `∞ norm jP
and j + 1. Using
the fact that all conductances are 1, that |Πj | = O(j), and that j j −1 diverges,
recurrence is established by Theorem 3.3.20.
CHAPTER 3. MARTINGALES AND POTENTIALS 216
First proof
Now consider the case d = 3 and let a = 0 be the origin.
We construct a finite-energy flow to ∞ using the method of random paths. Note
method of
that a simple way to produce a unit flow to ∞ is to push a flow of 1 through an
random
infinite path (which, recall, are self-avoiding by definition). Taking this a step fur-
paths
ther, let µ be a probability measure on infinite paths and define the anti-symmetric
function
θ(x, y) := E[1(x,y)∈Γ − 1(y,x)∈Γ ] = P[(x, y) ∈ Γ] − P[(y, x) ∈ Γ],
where Γ is a random path distributed according to µ, oriented away from 0. (We
will give an explicit construction below
P where the appropriate formal probability
space will be clear.) Observe that y∼x [1(x,y)∈Γ − 1(y,x)∈Γ ] = 0 for any x 6= 0
because vertices visited by Γ are entered and exited exactly once. That same sum
is 1 at x = 0. Hence θ is a unit flow to ∞. For edge e = {x, y}, consider the
following “edge marginal” of µ:
µ(e) := P[(x, y) ∈ Γ or (y, x) ∈ Γ] = P[(x, y) ∈ Γ] + P[(y, x) ∈ Γ] ≥ θ(x, y),
where we used that a path Γ cannot visit both (x, y) and (y, x) by definition. Then
we get the following bound.
Claim 3.3.39 (Method of random paths).
X
E (θ) ≤ µ(e)2 . (3.3.33)
e
That immediately implies a similar bound on the probability that an edge is visited
by Γ. Moreover:
Lemma 3.3.40. There are O(j 2 ) edges with an endpoint at `2 distance within
[j, j + 1] from the origin.
Proof. Consider an open ball of `2 radius 1/2 centered around each vertex of `2
norm within [j, j + 1]. Those balls are non-intersecting and have total volume
Θ(Nj ), where Nj is the number of such vertices. On the other hand, the volume of
the shell of `2 inner and outer radii j − 1/2 and j + 3/2 centered around the origin
(where all those balls lie) is
4 4
π(j + 3/2)3 − π(j − 1/2)3 = O(j 2 ).
3 3
Hence Nj = O(j 2 ). Finally note that each vertex has 6 incident edges.
Transience follows from Theorem 3.3.30. (This argument clearly does not work
on L where there are only two rays. You should convince yourself that it does not
work on L2 either. But see Exercise 3.17.)
Second proof
We briefly describe a second proof based on the independent-coordinate random
walk. Consider the networks N and N 0 in Example 3.3.32. Because they are
roughly equivalent (Definition 3.3.31), they have the same type by Theorem 3.3.33.
Recall that, because the number of returns to 0 is geometric with success probabil-
ity equal to the escape probability, random walk on N 0 is transient if and only if
the expected number of visits to 0 is finite (see (3.1.2)). By independence of the
coordinates, this expectation can be written as
where we used Stirling’s formula (see Appendix A). The rightmost sum is finite
if and only if d ≥ 3. That implies random walk on N 0 is transient under that
condition. By rough equivalence, the same is true of N .
CHAPTER 3. MARTINGALES AND POTENTIALS 218
P[e ∈ T | e0 ∈ T ] ≤ P[e ∈ T ], ∀e 6= e0 ∈ G.
This property is perhaps not surprising. For one, the number of edges in a spanning
tree is fixed, so the inclusion of e0 makes it seemingly less likely for other edges to
be present. Yet proving Claim 3.3.41 is not trivial. The proof relies on the electrical
network perspective. The key is a remarkable formula for the inclusion of an edge
in a uniform spanning tree.
Before explaining how this formula arises, we show that it implies Claim 3.3.41.
CHAPTER 3. MARTINGALES AND POTENTIALS 219
Proof of Claim 3.3.41. Recall that P[e0 ∈ T ] 6= 0. By the law of total probability,
P[e ∈ T | e0 ∈
/ T ] ≥ P[e ∈ T ]. (3.3.34)
RN 0 (x ↔ y) ≥ RN (x ↔ y),
Let e = {x, y}. To get some insight into Kirchhoff’s resistance formula, we
first note that, if i is the unit current from x to y and v is the associated voltage, by
definition of the effective resistance
v(x)
R(x ↔ y) = = c(e)(v(x) − v(y)) = i(x, y), (3.3.35)
kik
where we used Ohm’s law (i.e., (3.3.20)) as well as the fact that c(e) = 1, v(y) = 0,
and kik = 1. Note that kik and i(x, y) are not the same quantity: although kik = 1,
i(x, y) is only the current along the edge to y. Furthermore by the probabilistic
interpretation of the current (Theorem 3.3.17), with Z = {y},
Z Z
i(x, y) = Ex [Nx→y − Ny→x ] = Px [(x, y) is traversed before τy ] . (3.3.36)
Z
Indeed, started at x, Ny→x = 0 and Nx→yZ ∈ {0, 1}. Kirchhoff’s resistance for-
mula is then established by relating the random walk on N to the probability that
e is present in a uniform spanning tree T . To do this we introduce a random-walk-
based algorithm for generating uniform spanning trees. This rather miraculous
procedure, known as Wilson’s method, is of independent interest. (For a classical
connection between random walks and spanning trees, see also Exercise 3.23.)
CHAPTER 3. MARTINGALES AND POTENTIALS 220
With a slight abuse, we continue to call a tree T picked at random among all span-
ning trees of G with probability proportional to W (T ) a “uniform” spanning tree
on N .
To state Wilson’s method, we need the notion of loop erasure. Let P = x0 ∼
loop erasure
. . . ∼ xk be a walk in N . The loop erasure of P is obtained by removing cycles in
the order they appear. That is, let j ∗ be the smallest j such that xj = x` for some
` < j. Remove the subwalk x`+1 ∼ · · · ∼ xj from P, and repeat. The result is
self-avoiding, that is, a path, and is denoted by LE(P).
Let v0 be an arbitrary vertex of G, which we refer to as the root, and let T0
be the subtree made up of v0 alone. Starting with the root, order arbitrarily the
vertices of G as v0 , . . . , vn−1 . Wilson’s method constructs an increasing sequence
of subtrees as follows. See Figure 3.6. Let T := T0 .
1. Let v be the vertex of G not in T with lowest index. Perform random walk
on N started at v until the first visit to a vertex of T . Let P be the resulting
walk.
This claim is far from obvious. Before proving it, we finish the proof of Kirch-
hoff’s resistance formula.
Proof of Theorem 3.3.42. From (3.3.35) and (3.3.36), it suffices to prove that, for
e = {x, y},
Px [(x, y) is traversed before τy ] = P[e ∈ T ],
where the probability on the left-hand side refers to random walk on N with unit
resistances started at x and the probability on the right-hand side refers to a uniform
spanning tree T on N . Generate T using Wilson’s method started at root v0 = y
with the choice v1 = x. If the walk from x to y during the first iteration of Wilson’s
CHAPTER 3. MARTINGALES AND POTENTIALS 221
Figure 3.6: An illustration of Wilson’s method. The dashed lines indicate erased
loops.
CHAPTER 3. MARTINGALES AND POTENTIALS 222
method includes (x, y), then the loop erasure is simply x ∼ y and e is in T . On
the other hand, if the walk from x to y does not include (x, y), then e cannot be
used at a later stage because it would create a cycle. That immediately proves the
theorem.
Proof of Claim 3.3.44. The idea of the proof is to cast Wilson’s method in the more
general framework of cycle popping algorithms. We begin by explaining how such
algorithms work.
Let P be the transition matrix corresponding to random walk on N = (G, c)
with G = (V, E) and root v0 . To each vertex x 6= v0 in V , we assign an indepen-
dent stack of “colored directed edges”
where each Yjx is chosen independently at random from the distribution P (x, · ).
In particular all Yjx s are neighbors of x in N . The index j in hx, Yjx ij is the color
color
of the edge. It keeps track of the position of the edge in the original stack. (Picture
S x as a spring-loaded plate dispenser located on vertex x.)
We consider a process which involves popping edges off the stacks. We use
the notation S x to denote the current stack at x. The initial assignment of the
stack is S x := S0x as above. Given the current stacks (S x )x , we call visible graph
visible graph
the (colored) directed graph over V with edges Top(S x ) for all x 6= v0 , where
Top(S x ) is the first edge in the current stack S x . The latter are referred to as
→
−
visible edges. We denote the current visible graph by G .
→
− visible edge
Note that G has out-degree 1 for all x 6= v0 and the root has out-degree
→
−
0. In particular all undirected cycles in G are in fact directed cycles, and we
refer to them simply as cycles. (Indeed a set of edges forming an undirected cycle
that is not directed must have a vertex of out-degree 2.) Also recall the following
characterization from Corollary 1.1.8: an acyclic, undirected subgraph with |V |
→
−
vertices and |V |−1 edges is a spanning tree of G. Hence if there is no cycle in G ,
then it must be a spanning tree (as an undirected graph) where, furthermore, all
edges point towards the root. Such a tree is also known as a spanning arborescence.
spanning
Once that happens, we are done.
arborescence
As the name suggests, a cycle popping algorithm proceeds by popping cycles
→
−
in G off the tops of the stacks until a spanning arborescence is produced. That
→
− →
−
is, at every iteration, if G contains at least one cycle, then a cycle C is picked
→
−
according to some rule, the top of each stack in C is popped, and a new visible
→
−
graph G is revealed. See Figure 3.7 for an illustration.
CHAPTER 3. MARTINGALES AND POTENTIALS 223
Figure 3.7: A realization of a cycle popping algorithm (from top to bottom). In all
three figures, the underlying graph is G while the arrows depict the visible edges.
CHAPTER 3. MARTINGALES AND POTENTIALS 224
With these definitions in place, the proof of the claim involves the following
steps.
(ii) The popping order does not matter. We just argued that Wilson’s method is a
cycle popping algorithm. In fact we claim that any cycle popping algorithm,
that is, no matter what popping choices are made along the way, produces the
same final arborescence. To make this precise, we identify the popped cycles
uniquely. This is where the colors come in. A colored cycle is a directed
colored
cycle over V made of colored edges from the stacks (not necessarily of the
cycle
same color and not necessarily in the current visible graph). We say that a
→
− →
−
colored cycle C is poppable for a visible graph G if there exists a sequence
→
− →
− →
− poppable cycle
of colored cycles C 1 , . . . , C r = C that can be popped in that order starting
→
− →
− →
−
from G . Note that, by this definition, C 1 is a cycle in G . Now we
→
−0 →
−
claim that if C 1 were popped first instead of C 1 , producing the new visible
→
− →
− →
−
graph G 0 , then C would still be poppable for G 0 . This claim implies
that, in any cycle popping algorithm, either an infinite number of cycles are
popped or eventually all poppable cycles are popped—independently of the
order—producing the same outcome. (Note that, while the same cycle may
be popped more than once, the same colored cycle cannot.)
→
− →
− →
−
To prove the claim, note first that if C 01 = C or if C 01 does not share a vertex
→
− →
− →
−
with any of C 1 , . . . , C r there is nothing to prove. So let C j be the first cy-
→
−
cle in the sequence sharing a vertex with C 01 , say x. Let hx, yic and hx, y 0 ic0
→
− →
−
be the colored edges emanating from x in C j and C 01 respectively. By def-
→
− →
−
inition, x is not on any of C 1 , . . . , C j−1 so the edge originating from x is
not popped by that sequence and we must have hx, yic = hx, y 0 ic0 as colored
→
− →
−
edges. In particular, the vertex y is also a shared vertex of C j and C 01 , and
CHAPTER 3. MARTINGALES AND POTENTIALS 225
the same argument applies to it. Proceeding by induction leads to the conclu-
→
− →
− →
−
sion that C 01 = C j as colored cycles. But then C is clearly poppable for the
→
−
visible graph resulting from popping C 01 first, because it can be popped with
→
− →
− → − →
− →
− →
− →
−
the rearranged sequence C 01 = C j , C 1 , . . . , C j−1 , C j+1 , . . . , C r = C ,
→
− →
− →
−
where we used the fact that C 01 does not share a vertex with C 1 , . . . , C j−1 .
(iii) Termination occurs in finite time almost surely. We have shown so far that, in
any cycle popping algorithm, either an infinite number of cycles are popped
or eventually all poppable cycles are popped. But Wilson’s method—a cycle
popping algorithm as we have shown—stops after a finite amount of time
with probability 1. Indeed, because the network is finite and connected, the
random walk started at each iteration hits the current T in finite time almost
surely (by Lemma 3.1.25). To sum up, all cycle popping algorithms termi-
nate and produce the same spanning arborescence. It remains to compute the
distribution of the outcome.
(iv) The arborescence has the desired distribution. Let A be the spanning ar-
borescence produced by any cycle popping algorithm on the stacks (S0x ). To
compute the distribution of A, we first compute the distribution of a par-
ticular cycle popping realization leading to A. Because the popping order
does not matter, by “realization” we mean a collection C of colored cycles
together with a final spanning arborescence A. Notice that what lies in the
stacks “under” A is not relevant to the realization, that is, the same outcome
is produced no matter what is under A.
So, from the distribution of the stacks, the probability of observing (C, A) is
simply the product of the transitions corresponding to the “popped edges” in
C and the “final edges” in A, that is,
Y Y → −
P (~e) = Ψ(A) Ψ C ,
~e∈C∪A −
→
C ∈C
of possible Cs not to vary with A. But it is clear that any arborescence could
lie under any C.
To see that we are done, let T be the undirected spanning tree corresponding to
the outcome, A, of Wilson’s method. Then, because P (x, y) = c(x,y)
c(x) , we get
W (T )
Ψ(A) = Q ,
x6=v0 c(x)
where note that the denominator does not depend on T . So if we forget the orien-
tation of A which is determined by the root (i.e., sum over all choices of root), we
get a spanning tree whose distribution is proportional to W (T ), as required.
CHAPTER 3. MARTINGALES AND POTENTIALS 227
Exercises
Exercise 3.1 (Reflection). Give a rigorous proof of Theorem 3.1.9 through a for-
mal application of the strong Markov property (i.e., specify ft and Ft in Theo-
rem 3.1.8).
Exercise 3.2 (Time of k-th return). Give a rigorous proof of (3.1.1) through a
formal application of the strong Markov property (i.e., specify ft and Ft in Theo-
rem 3.1.8).
Exercise 3.3 (Tightness of Matthews’ bounds). Show that the bounds (3.1.6) and (3.1.7)
are tight up to smaller order terms for the coupon collector problem (Example 2.1.4).
[Hint: State the problem in terms of the cover time of a random walk on the com-
plete graph with self-loops.]
Exercise 3.4 (Pólya’s urn: a suprisingly simple formula). Consider the setting of
Example 3.1.49. Prove that
t m!(t − m)!
P[Gt = m + 1] = .
m (t + 1)!
[Hint: Consider the probability of one particular sequence of outcomes producing
the desired event.]
Exercise 3.5 (Optional stopping theorem). Give a rigorous proof of the remaining
cases of the optional stopping theorem (Theorem 3.1.38).
Exercise 3.6 (Supermartingale inequality). Let (Mt ) be a nonnegative, supermartin-
gale. Show that, for any b > 0,
E[M0 ]
P sup Ms ≥ b ≤ .
s≥0 b
(iv) Prove that (iii) implies a variant of the Azuma-Hoeffding inequality (Theo-
rem 3.2.1) for bounded increments.
(v) Show that the random variables in Exercise 2.6 (after centering) do not sat-
isfy (3.3.37) (without using the claim in part (ii) of that exercise).
for n large enough. [Hint: Use the fact that kAk2 ≥ kAe1 k2 together with Cheby-
shev’s inequality.]
(i) When p > 1/2, show that there is more than one bounded extension of h to
Z+ \{0} that is harmonic on Z+ \{0}. [Hint: Consider Px [τ0 = +∞].]
(ii) When p ≤ 1/2, show that there exists a unique bounded extension of h to
Z+ \{0} that is harmonic on Z+ \{0}.
CHAPTER 3. MARTINGALES AND POTENTIALS 229
where d and d0 are the graph distances on G and G0 respectively, and furthermore
all vertices in G0 are within distance β of the image of V . Let N = (G, c) and
N 0 = (G0 , c0 ) be countable, connected networks with uniformly bounded con-
ductances, resistances and degrees. Prove that if G and G0 are roughly isometric
then N and N 0 are roughly equivalent. [Hint: Start by proving that being roughly
isometric is an equivalence relation.]
Exercise 3.21 (Random walk on the cycle: hitting time). Use the commute time
identity (Theorem 3.3.34) to compute Ex [τy ] in Example 3.3.35 in the case d = 1.
Give a second proof using a direct martingale argument.
Exercise 3.22 (Random walk on the binary tree: cover time). As in Example 3.3.15,
b n and equal conductances on all
let N be the rooted binary tree with n levels T 2
edges.
(i) Show that the maximal hitting time Ea τb is achieved for a and b such that
their most recent common ancestor is the root 0. Furthermore argue that
in that case Ea [τb ] = Ea [τa,0 ], where recall that τa,0 is the commute time
between a and 0.
(ii) Use the commute time identity (Theorem 3.3.34) and Matthews’ cover time
bounds (Theorem 3.1.27) to give an upper bound on the mean cover time of
the order of O(n2 2n ).
CHAPTER 3. MARTINGALES AND POTENTIALS 231
Exercise 3.23 (Markov chain tree theorem). Let P be the transition matrix of a fi-
nite, irreducible Markov chain with stationary distribution π. Let G be the directed
graph corresponding to the positive transitions of P . For an arborescence A of G,
define its weight as Y
Ψ(A) = P (~e).
~e∈A
(iii) Prove the Markov chain tree theorem: The stationary distribution π of P is
proportional to X
πx = Ψ(A).
A : root(A)=x
CHAPTER 3. MARTINGALES AND POTENTIALS 232
Bibliographic Remarks
Section 3.1 Picking up where Appendix B leaves off, Sections 3.1.1 and 3.1.3
largely follow the textbooks [Wil91] and [Dur10], which contain excellent intro-
ductions to martingales. The latter also covers Markov chains, and includes the
proofs we skipped here. Theorem 3.1.11 is proved in [Dur10, Theorem 4.3.2].
Many more results like Corollary 3.1.24 can be derived from the occupation mea-
sure identity; see for example [AF, Chapter 2]. The upper bound in Theorem 3.1.27
was first proved by Matthews [Mat88].
Section 3.3 Section 3.3.1 is based partly on [Nor98, Sections 4.1-2], [Ebe, Sec-
tions 0.3, 1.1-2, 3.1-2], and [Bre17, Sections 7.3, 17.1]. The material in Secti-
ons 3.3.2-3.3.5 borrows from [LPW06, Chapters 9, 10], [AF, Chapters 2, 3] and,
especially, [LP16, Sections 2.1-2.6, 4.1-4.2, 5.5]. Foster’s theorem (Theorem 3.3.12)
CHAPTER 3. MARTINGALES AND POTENTIALS 233
is from [Fos53]. The classical reference on potential theory and its probabilistic
counterpart is [Doo01]. For the discrete case and the electrical network point of
view, the book of Doyle and Snell is excellent [DS84]. In particular the series and
parallel laws are defined and illustrated. See also [KSK76]. For an introduction to
convex optimization and duality, see for example [BV04]. The Nash-Williams
inequality is due to Nash-Williams [NW59]. The result in Example 3.3.27 is
due to R. Lyons [Lyo90]. Theorem 3.3.33 is due to Kanai [Kan86]. The com-
mute time identity was proved by Chandra, Raghavan, Ruzzo, Smolensky and Ti-
wari [CRR+ 89]. An elementary proof of Pólya’s theorem can be found in [Dur10,
Section 4.2]. The flow we used in the proof of Pólya’s theorem is essentially
due to T. Lyons [Lyo83]. Wilson’s method is due to Wilson [Wil96]. A related
method for generating uniform spanning trees was introduced by Aldous [Ald90]
and Broder [Bro89]. A connection between loop-erased random walks and uni-
form spanning trees had previously been established by Pemantle [Pem91] using
the Aldous-Broder method. For more on negative correlation in uniform spanning
trees, see for example [LP16, Section 4.2]. For a proof of the matrix tree theo-
rem using Wilson’s method, see [KRS]. For a discussion of the running time of
Wilson’s method and other spanning tree generation approaches, see [Wil96].
Chapter 4
Coupling
4.1 Background
We begin with some background on coupling. After defining the concept formally
and giving a few simple examples, we derive the coupling inequality, which pro-
vides a fundamental approach to bounding the distance between two distributions.
Version: December 20, 2023
Modern Discrete Probability: An Essential Toolkit
Copyright © 2023 Sébastien Roch
234
CHAPTER 4. COUPLING 235
lim P[Yt 6= Zt ] = 0,
t
So h(y) = h(z).
CHAPTER 4. COUPLING 238
Proof. From (3.3.2), h is harmonic with respect to random walk on Ld if and only
if it is harmonic with respect to lazy random walk (Definition 1.1.31), that is, the
walk that stays put with probability 1/2 at every step. Let Py and Pz be the laws of
lazy random walk on Ld started at y and z respectively. We construct a coupling
(i) (i)
((Yt ), (Zt )) = ((Yt )i∈[d] , (Zt )i∈[d] ) of Py and Pz as follows: at time t, pick a
coordinate I ∈ [d] uniformly at random, then
(I) (I)
• if Yt = Zt then do nothing with probability 1/2 and otherwise pick
(I) (I) (I)
W ∈ {−1, +1} uniformly at random, set Yt+1 = Zt+1 := Zt + W and
leave the other coordinates unchanged;
(I) (I)
• if instead Yt 6= Zt , pick W ∈ {−1, +1} uniformly at random, and with
(I) (I)
probability 1/2 set Yt+1 := Yt +W and leave Zt and the other coordinates
(I) (I)
of Yt unchanged, or otherwise set Zt+1 := Zt + W and leave Yt and the
other coordinates of Zt unchanged.
as desired.
so h is harmonic on all of Td .
Let b 6= a be a neighbor of the root. The key of the proof is the following
lemma.
Lemma 4.1.8.
q := Pa [τρ = +∞] = Pb [τρ = +∞] > 0.
Proof. The equality of the two probabilities follows by symmetry. To see that
q > 0, let (Zt ) be simple random walk on Td started at a until the walk hits ρ and
let Lt be the graph distance between Zt and the root. Then (Lt ) is a biased random
walk on Z started at 1 jumping to the right with probability 1 − d1 and jumping to
the left with probability d1 . The probability that (Lt ) hits 0 in finite time is < 1
because 1 − d1 > 12 when d ≥ 3 by the gambler’s ruin (Example 3.1.43).
Note that
1
h(ρ) ≤ 1 − (1 − q) < 1.
d
Indeed, if on the first step the random walk started at ρ moves away from a, an
event of probability 1 − d1 , then it must come back to ρ in finite time to reach Ta .
Similarly, by the strong Markov property (Theorem 3.1.8),
h(a) = q + (1 − q)h(ρ).
Since h(ρ) 6= 1 and q > 0, this shows that h(a) > h(ρ). So h is not constant.
CHAPTER 4. COUPLING 240
Total variation distance Let µ and ν be probability measures on (S, S). Recall
the definition of the total variation distance
As the next lemma shows in the countable case, the total variation distance can
be thought of as an `1 distance on probability measures as vectors (up to a constant
factor).
Similarly, we have
The two bounds above are equal so |µ(A) − ν(A)| ≤ µ(E∗ ) − ν(E∗ ). Equality is
achieved when A = E∗ . Also
1
µ(E∗ ) − ν(E∗ ) = [µ(E∗ ) − ν(E∗ ) + ν(E∗c ) − µ(E∗c )]
2
1X
= |µ(x) − ν(x)|.
2
x∈S
kµ − νkTV ≤ P[X 6= Y ].
Lemma 4.1.13 (Maximal coupling). Assume S is finite and let S = 2S . Let µ and
ν be probability measures on (S, S). Then,
Assume p > 0 (otheriwse there is nothing to prove). First, two lemmas. See
Figure 4.1 for a proof by picture.
Lemma 4.1.14. X
µ(x) ∧ ν(x) = 1 − kµ − νkTV .
x∈S
CHAPTER 4. COUPLING 243
Proof. We have
X
2kµ − νkTV = |µ(x) − ν(x)|
x∈S
X X
= [µ(x) − ν(x)] + [ν(x) − µ(x)]
x∈A x∈B
X X X
= µ(x) + ν(x) − µ(x) ∧ ν(x)
x∈A x∈B x∈S
X X X
=2− µ(x) − ν(x) − µ(x) ∧ ν(x)
x∈B x∈A x∈S
X
=2−2 µ(x) ∧ ν(x),
x∈S
where we used that both µ and ν sum to 1. Rearranging gives the claim.
Lemma 4.1.15.
X X
[µ(x) − ν(x)] = [ν(x) − µ(x)] = kµ − νkTV = 1 − p.
x∈A x∈B
Proof. The first equality is immediate by the fact that µ and ν are probability mea-
sures. The second equality follows from the first one together with the second line
in the proof of the previous lemma. The last equality is a restatement of the last
lemma.
µ(x) − ν(x)
γA (x) := , x ∈ A,
1−p
and, independently, pick Y from
ν(x) − µ(x)
γB (x) := , x ∈ B.
1−p
Note that X 6= Y in that case because A and B are disjoint.
CHAPTER 4. COUPLING 244
and for x ∈ B,
A := {0}, B := {1},
X
p := µ(x) ∧ ν(x) = (1 − r) + q, 1 − p = α = β := r − q,
x
1−r q
(γmin (x))x=0,1 = , ,
(1 − r) + q (1 − r) + q
γA (0) := 1, γB (1) := 1.
The law of the maximal coupling (X 000 , Y 000 ) is given by
000 000
p γmin (0) (1 − p) γA (0)γB (1)
P[(X , Y ) = (i, j)] =
i,j∈{0,1} 0 p γmin (1)
1−r r−q
= .
0 q
Notice that it happens to coincide with the monotone coupling. J
(Xi0 , Wi0 ) is indeed a coupling of Xi , Wi . Let Sn0 := i≤n Xi0 and Zn0 := i≤n Wi0 .
P P
Then (Sn0 , Zn0 ) is a coupling of Sn , Zn . By the coupling inequality
Mappings reduce the total variation distance The following lemma will be
useful.
Proof. From the definition of the total variation distance, we seek to bound
Since h−1 (A0 ) ∈ S by the measurability of h, this last expression is less or equal
than
sup |P[X ∈ A] − P[Y ∈ A]| ,
A∈S
We will give many examples throughout this chapter. See also Example 4.2.14 for
an example of a coupling of Markov chains that is not Markovian.
Proof. Note that the degrees Di (n), i ∈ [n], are identically distributed (but not
independent) so
1
En,pn [Nd (n)] = Pn,pn [D1 (n) = d].
n
Moreover, by definition, D1 (n) ∼ Bin(n − 1, pn ). Let Sn ∼ Bin(n, pn ) and
Zn ∼ Poi(λ). Using the Poisson approximation (Theorem 4.1.18) and a Taylor
expansion,
1X
kµSn − µZn kTV ≤ (− log(1 − pn ))2
2
i≤n
2
1 X λ −2
= + O(n )
2 n
i≤n
λ 2
= + O(n−2 ).
2n
We can further couple D1 (n) and Sn as
X X
Xi , Xi ,
i≤n−1 i≤n
where the Xi s are i.i.d. Ber(pn ), that is, Bernoulli with parameter pn . By the
coupling inequality (Theorem 4.1.11),
X X λ
kµD1 (n) − µSn kTV ≤ P Xi 6= Xi = P[Xn = 1] = pn = .
n
i≤n−1 i≤n
CHAPTER 4. COUPLING 248
By the triangle inequality for the total variation distance (Lemma 4.1.10) and
the bounds above,
1X
|Pn,pn [D1 (n) = d] − fd | = kµD1 (n) − µZn kTV
2
d≥0
≤ kµD1 (n) − µSn kTV + kµSn − µZn kTV
λ + λ2 /2
≤ + O(n−2 ).
n
Therefore, for all d,
1 2λ + λ2
En,pn [Nd (n)] − fd ≤ + O(n−2 ) → 0,
n n
as n → +∞.
Varn,pn [ n1 Nd (n)]
1 1
Pn,pn Nd (n) − En,pn [Nd (n)] ≥ ε ≤ . (4.1.1)
n n ε2
between 1 and 2 from those of edges to other vertices, we see that the joint degrees
(D1 (n), D2 (n)) have the same distribution as (X1 + Y1 , X1 + Y2 ). So the term in
curly bracket above is equal to
P[Xi ≥ 1] ≥ p,
Pn
and let S = i=1 Xi be their sum. Now consider a separate random variable
S∗ ∼ Bin(n, p).
It is intuitively clear that one should be able to bound S from below by analyzing S∗
instead—which may be considerably easier. Indeed, in some sense, S “dominates”
S∗ , that is, S should have a tendency to be bigger than S∗ . One expects more
specifically that
P[S > x] ≥ P[S∗ > x].
Coupling provides a formal characterization of this notion, as we detail in this
section.
In particular we study an important special case known as positive associations.
Here a measure “dominates itself” in the following sense: conditioning on certain
events makes other events more likely. That concept is formalized in Section 4.2.3.
CHAPTER 4. COUPLING 250
Figure 4.2: The law of X, represented here by its cumulative distribution function
FX in solid, stochastically dominates the law of Y , in dashed. The construction of
a monotone coupling, (X̂, Ŷ ) := (FX−1 (U ), FY−1 (U )) where U is uniform in [0, 1],
is also depicted.
4.2.1 Definitions
We start with the simpler case of real random variables then consider partially
ordered sets, a natural setting for this concept.
Example 4.2.2 (Bernoulli vs. Poisson). Let X ∼ Poi(λ) be Poisson with mean
λ > 0 and let Y be a Bernoulli trial with success probability p ∈ (0, 1). In order
CHAPTER 4. COUPLING 251
P[Y > x] = P[Ŷ > x] = P[X̂ ≥ Ŷ > x] ≤ P[X̂ > x] = P[X > x].
For the other direction, define the cumulative distribution functions FX (x) =
P[X ≤ x] and FY (x) = P[Y ≤ x]. Assume X Y . The idea of the proof
is to use the following standard way of generating a real random variable (see
Theorem B.2.7)
d
X = FX−1 (U ), (4.2.2)
where U is a [0, 1]-valued uniform random variable and
FX−1 (u) := inf{x ∈ R : FX (x) ≥ u},
is a generalized inverse. It is natural to construct a coupling of X and Y by simply
using the same uniform random variable U in this representation, that is, we define
X̂ = FX−1 (U ) and Ŷ = FY−1 (U ). See Figure 4.2. By (4.2.2), this is a coupling
of X and Y . It remains to check (4.2.1). Because FX (x) ≤ FY (x) for all x by
definition of stochastic domination, by the definition of the generalized inverse,
P[X̂ ≥ Ŷ ] = P[FX−1 (U ) ≥ FY−1 (U )] = 1,
as required.
CHAPTER 4. COUPLING 252
Example 4.2.4. Returning to the example in the first paragraph of Section 4.2,
let (Xi )ni=1 be independent PnZ+ -valued random variables with P[Xi ≥ 1] ≥ p and
consider
Pn their sum S := i=1 Xi . Further let S∗ ∼ Bin(n, p). Write S∗ as the sum
Y
i=1 i where (Yi ) are independent Bernoullli variables with P[Yi = 1] = p. To
couple S and S∗ , first set (Ŷi ) := (Yi ) and Ŝ∗ := ni=1 Ŷi . Let X̂i be 0 whenever
P
Ŷi = 0. Otherwise (i.e., if Ŷi = 1), generate X̂i according to the distribution of
Xi conditioned on {Xi ≥ 1}, independently of Peverything else. By construction
X̂i ≥ Ŷi almost surely for all i and as a result ni=1 X̂i =: Ŝ ≥ Ŝ∗ almost surely,
or S S∗ by Theorem 4.2.3. That implies for instance that P[S > x] ≥ P[S∗ > x]
as we claimed earlier. A slight modification of this argument gives the following
useful fact about binomials
Corollary 4.2.6. Let X and Y be real random variables with X Y and let
f : R → R be a non-decreasing function. Then f (X) f (Y ) and furthermore,
provided E|f (X)|, E|f (Y )| < +∞, we have that
Proof. Let (X̂, Ŷ ) be the monotone coupling of X and Y whose existence is guar-
anteed by Theorem 4.2.3. Then f (X̂) ≥ f (Ŷ ) almost surely so that, provided the
expectations exist,
X1 + X2 Y1 + Y2 .
CHAPTER 4. COUPLING 253
Proof. Let (X̂1 , Yˆ1 ) and (X̂2 , Ŷ2 ) be independent, monotone couplings of X1 , Y1
and X2 , Y2 on the same probability space. Then
d d
X1 + X2 = X̂1 + X̂2 ≥ Ŷ1 + Ŷ2 = Y1 + Y2 .
The following special case will be useful later. Let 0 < Λ < 1 and let m be a
positive integer. Then
Λ Λ m m Λ
≥ = − 1 ≥ log = − log 1 − ,
m−1 m−Λ m−Λ m−Λ m
where we used that log x ≤ x − 1 for all x ∈ R+ (see Exercise 1.16). So, setting
Λ Λ
λ := m−1 , p := m and n := m − 1 in (4.2.3), we get
Λ
Λ ∈ (0, 1) =⇒ Poi(Λ) Bin m − 1, . (4.2.4)
m
J
where C0 is the open cluster of the origin. We argued in Example 4.1.3 (see also
Section 2.2.4) that θ(p) is nondecreasing by considering the following alternative
representation of the percolation process under Pp : to each edge e, assign a uniform
[0, 1]-valued random variable Ue and declare the edge open if Ue ≤ p. Using the
same Ue s for two different values of p, say p1 < p2 , gives a monotone coupling of
the processes for p1 and p2 . J
The existence of a monotone coupling is perhaps more surprising for posets.
We prove the result in the finite case only, which will be enough for our purposes.
Theorem 4.2.11 (Strassen’s theorem). Let X and Y be random variables taking
values in a finite poset (X , ≤) with the σ-algebra F = 2X . Then X Y if and
only if there exists a monotone coupling (X̂, Ŷ ) of X and Y .
Proof. Suppose there is such a coupling. Then for all increasing A
The proof in the other direction relies on the max-flow min-cut theorem (Theo-
rem 1.1.15). To see the connection with flows, let µX and µY be the laws of X and
CHAPTER 4. COUPLING 255
Y respectively, and denote by ν their joint distribution under the desired coupling.
Noting that we want ν(x, y) > 0 only if x ≥ y, the marginal conditions on the
coupling read X
ν(x, y) = µX (x), ∀x ∈ X , (4.2.5)
y:x≥y
and X
ν(x, y) = µY (y), ∀y ∈ X . (4.2.6)
x:x≥y
A∗ = {x ∈ X : ∃y ∈ Z∗c , x ≥ y},
where the term in parenthesis is nonnegative by assumption and the fact that A∗ is
increasing. That concludes the proof.
Remark 4.2.12. Strassen’s theorem (Theorem 4.2.11) holds more generally on Polish
spaces with a closed partial order. See, e.g., [Lin02, Section IV.1.2] for the details.
Example 4.2.14 (Lazier chain). Consider a random walk (Xt ) on the network
N = ((V, E), c) where V = {0, 1, . . . , n} and i ∼ j if and only if |i − j| ≤ 1
(including self-loops). Let N 0 = ((V, E), c0 ) be a modified version of N on the
same graph where, for all i, c(i, i) ≤ c0 (i, i). That is, if (Xt0 ) is random walk on
N 0 , then (Xt0 ) is “lazier” than (Xt ) in that it is more likely to stay put. To simplify
the calculations, assume c(i, i) = 0 for all i.
Assume that both (Xt ) and (Xt0 ) start at i0 and define Ms := maxt≤s Xt and
Ms := maxt≤s Xt0 . Since (Xt0 ) “travels less” than (Xt ) the following claim is
0
intuitively obvious:
Claim 4.2.15.
Ms Ms0 .
We prove this by producing a monotone coupling. First set (X̂t )t∈Z+ := (Xt )t∈Z+ .
We then generate (X̂t0 )t∈Z+ as a “sticky” version of (X̂t )t∈Z+ . That is, (X̂t0 ) follows
CHAPTER 4. COUPLING 258
exactly the same transitions as (X̂t ) (including the self-loops), but at each time it
opts to stay where it currently is, say state j, for an extra time step with probability
c0 (j, j)
αj := P 0
,
i:i∼j c (i, j)
where, on the second line, we used that c0 (i, j) = c(i, j) for i 6= j and i ∼ j. This
coupling satisfies almost surely
cs := max X̂t ≥ max X̂ 0 =: M
M c0
t s
t≤s t≤s
because (X̂t0 )t≤s visits a subset of the states visited by (X̂t )t≤s . In other words
(M c0 ) is a monotone coupling of (Ms , M 0 ) and this proves the claim.
cs , M J
s s
The analogue of Strassen’s theorem in this case is the following theorem, which
we prove in the finite case only again.
Theorem 4.2.17. Let (Xt )t∈Z+ and (Yt )t∈Z+ be Markov chains on a finite poset
(X , ≤) with transition matrices P and Q respectively. Assume that Q stochas-
tically dominates P . Then for all x0 ≤ y0 there is a coupling (X̂t , Ŷt ) of (Xt )
started at x0 and (Yt ) started at y0 such that almost surely
Furthermore, if the chains are irreducible and have stationary distributions π and
µ respectively, then π µ.
W := {(x, y) ∈ X × X : x ≤ y}.
For all (x, y) ∈ W, let R((x, y), ·) be the joint law of a monotone coupling of
P (x, ·) and Q(y, ·). Such a coupling exists by Strassen’s theorem and Condi-
tion (4.2.7). Let (X̂t , Ŷt ) be a Markov chain on W with transition matrix R started
at (x0 , y0 ). By construction, X̂t ≤ Ŷt for all t almost surely. That proves the first
half of the theorem.
For the second half, let A be increasing on X . Note that the first half implies
that for all s ≥ 1
because X̂s ≤ Ŷs and A is increasing. Then, by a standard convergence result for
irreducible Markov chains (i.e., (1.1.5)),
1X s 1X s
π(A) = lim P (x0 , A) ≤ lim Q (y0 , A) = µ(A).
t→+∞ t t→+∞ t
s≤t s≤t
Intuitively, because the ferromagnetic Ising model favors spin agreement, the all-
(+1) boundary condition tends to produce more +1s which in turn makes increas-
ing events more likely. And vice versa.
The idea of the proof is to use Theorem 4.2.17 with a suitable choice of Markov
chain.
Stochastic domination Recall that, in this context, vertices are often referred to
as sites. Adapting Definition 1.2.8, we consider the single-site Glauber dynamics,
which is the Markov chain on X which, at each time, selects a site i ∈ Λ uniformly
at random and updates the spin σi according to µξβ,Λ (σ) conditioned on agreeing
with σ at all sites in Λ\{i}. Specifically, for γ ∈ {−1, +1}, i ∈ Λ, and σ ∈ X , let
CHAPTER 4. COUPLING 261
σ i,γ be the configuration σ with the state at i being set to γ. Then, letting n = |Λ|,
the transition matrix of the Glauber dynamics is
ξ
1 eγβSi (σ)
Qξβ,Λ (σ, σ i,γ ) := · ,
n e−βSiξ (σ) + eβSiξ (σ)
where
Siξ (σ) :=
X X
σj + ξj .
j:j∼i j:j∼i
j∈Λ j ∈Λ
/
Claim 4.2.19.
0
ξ 0 ≥ ξ =⇒ Qξβ,Λ stochastically dominates Qξβ,Λ . (4.2.8)
Proof. Because the Glauber dynamics updates a single site at a time, establishing
stochastic domination reduces to checking simple one-site inequalities.
Lemma 4.2.20. To establish (4.2.8), it suffices to show that for all i and all σ ≤ τ
0
Qξβ,Λ (σ, σ i,+ ) ≤ Qξβ,Λ (τ, τ i,+ ). (4.2.9)
CHAPTER 4. COUPLING 262
Proof. Assume (4.2.9) holds. Let A be increasing in X and let σ ≤ τ . Then, for
the single-site Glauber dynamics, we have
Qξβ,Λ (σ, A) = Qξβ,Λ (σ, A ∩ Bσ ), (4.2.10)
where
Bσ := {σ i,γ : i ∈ Λ, γ ∈ {−1, +1}},
and similarly for τ , ξ 0 . Moreover, because A is increasing and τ ≥ σ,
σ i,γ ∈ A =⇒ τ i,γ ∈ A, (4.2.11)
and
σ i,− ∈ A =⇒ σ i,+ ∈ A. (4.2.12)
Letting
± +
Iσ,A := {i ∈ Λ : σ i,− ∈ A}, Iσ,A := {i ∈ Λ : σ i,− ∈
/ A, σ i,+ ∈ A},
and similarly for τ , we have by (4.2.9), (4.2.10), (4.2.11), and (4.2.12),
Qξβ,Λ (σ, A) = Qξβ,Λ (σ, A ∩ Bσ )
X ξ X h ξ i
= Qβ,Λ (σ, σ i,+ ) + Qβ,Λ (σ, σ i,− ) + Qξβ,Λ (σ, σ i,+ )
+ ±
i∈Iσ,A i∈Iσ,A
0 X h ξ i
Qξβ,Λ (τ, τ i,+ ) + Qβ,Λ (σ, σ i,− ) + Qξβ,Λ (σ, σ i,+ )
X
≤
+ ±
i∈Iσ,A i∈Iσ,A
0 X 1
Qξβ,Λ (τ, τ i,+ ) +
X
=
+ ±
n
i∈Iσ,A i∈Iσ,A
ξ0
X h ξ0 0
i
Qβ,Λ (τ, τ i,− ) + Qξβ,Λ (τ, τ i,+ )
X
≤ Qβ,Λ (τ, τ i,+ ) +
+ ±
i∈Iτ,A i∈Iτ,A
0
= Qξβ,Λ (τ, A),
+ + ±
as claimed, where on the fifth line we used that Iσ,A ⊆ Iτ,A ∪ Iτ,A by (4.2.11) and
0
that Qξβ,Λ (τ, τ i,+ ) ≤ 1/n for all i (in particular for i ∈ Iτ,A
+ +
\ Iσ,A ).
Finally:
Proof of Claim 4.2.18. Combining Theorem 4.2.17 and Claim 4.2.19 gives the re-
sult.
−
Remark 4.2.21. One can make sense of the limit of µ+ β,Λ and µβ,Λ when |Λ| → +∞,
which is known as an infinite-volume Gibbs measure. For more, see for example [RAS15,
Chapters 7-10].
Observe that we have not used any special property of the d-dimensional lat-
tice. Indeed Claim 4.2.18 in fact holds for any countable, locally finite graph with
positive coupling constants. We give another proof in Example 4.2.33.
Claim 4.2.23.
Pn,p [B | A] ≥ Pn,p [B]. (4.2.13)
More generally:
FKG
Theorem 4.2.28 (FKG inequality). Let X = {0, 1}F where F is finite. Suppose µ
inequality
is a positive probability measure on X satisfying the FKG condition. Then µ has
positive associations.
Remark 4.2.29. Strict positivity is not in fact needed [FKG71]. The FKG condition is
equivalent to a strong form of positive associations. See Exercise 4.8.
= µ(ω) µ(ω 0 ).
So the FKG inequality (Theorem 4.2.28) applies, for instance, to bond percola-
tion and the Erdős-Rényi random graph model. The pointwise nature of the FKG
condition also makes it relatively easy to check for measures which are defined
explicitly up to a normalizing constant, such as the Ising model.
Example 4.2.30 (Ising model with boundary conditions: checking FKG). Con-
sider again the setting of Section 4.2.2. We work on the space X := {−1, +1}Λ
d
rather than {0, 1}F . Fix a finite Λ ⊆ Ld , ξ ∈ {−1, +1}L and β > 0.
Claim 4.2.31. The measure µξβ,Λ satisfies the FKG condition and therefore has
positive associations.
Intuitively, taking the minimum (or maximum) of two spin configurations tends
to increase agreement and therefore leads to a higher likelihood. For σ, σ 0 ∈ X ,
let τ = σ ∨ σ 0 and τ = σ ∧ σ 0 . By taking logarithms in the FKG condition and
rearranging, we arrive at
and we see that proving the claim boils down to checking an inequality for each
term in the Hamiltonian (which, confusingly, has a negative sign in it).
CHAPTER 4. COUPLING 266
When i ∈ Λ and j ∈
/ Λ such that i ∼ j, we have
For i, j ∈ Λ with i ∼ j, note first that the case σj = σj0 reduces to the previous
calculation (with σj = σj0 playing the role of ξj ), so we assume σi 6= σi0 and
σj 6= σj0 . Then
since 2 is the largest value the rightmost expression ever takes. We have estab-
lished (4.2.16), which implies the claim.
Again, we have not used any special property of the lattice and the same result
holds for countable, locally finite graphs with positive coupling constants. Note
however that in the anti-ferromagnetic case, that is, if we multiply the Hamiltonian
by −1, the above argument does not work. Indeed there is no reason to expect
positive associations in that case. J
The FKG inequality in turn follows from a more general result known as Hol-
ley’s inequality.
Theorem 4.2.32 (Holley’s inequality). Let X = {0, 1}F where F is finite. Suppose Holley’s
inequality
µ1 and µ2 are positive probability measures on X satisfying
Then µ1 µ2 .
Proof of Theorem 4.2.28. Assume that µ satisfies the FKG condition and let f , g
be increasing functions. Because of our restriction to positive measures in Holley’s
inequality, we will work with positive functions. This is done without loss of gen-
erality. Indeed, letting 0 be the all-0 vector, note that f and g are increasing if and
only if f 0 := f − f (0) + 1 > 0 and g 0 := g − g(0) + 1 > 0 are increasing and
that, moreover,
g(ω 0 )µ(ω 0 )
µ1 (ω)µ2 (ω 0 ) = µ(ω)
µ(g)
g(ω 0 )
= µ(ω)µ(ω 0 )
µ(g)
g(ω ∨ ω 0 )
≤ µ(ω ∧ ω 0 )µ(ω ∨ ω 0 )
µ(g)
0 0
= µ1 (ω ∧ ω )µ2 (ω ∨ ω ),
Proof of Theorem 4.2.32. The idea of the proof is to use Theorem 4.2.17. This is
similar to what was done in Section 4.2.2. Again we use a single-site dynamic.
For x ∈ X and γ ∈ {0, 1}, we let xi,γ be x with coordinate i set to γ. We
write x ∼ y if kx − yk1 = 1. Let n = |F |. We use a scheme analogous to the
Metropolis algorithm (see Example 1.1.30). A natural symmetric chain on X is to
pick a coordinate uniformly at random, and flip its value. We modify it to guarantee
reversibility with respect to the desired stationary distributions, namely µ1 and µ2 .
For α, β > 0 small enough, the following transition matrix over X is irre-
ducible and reversible with respect to its stationary distribution µ2 : for all i ∈ F ,
y ∈ X,
1
Q(y i,0 , y i,1 ) = α {β} ,
n
µ2 (y i,0 )
i,1 i,0 1
Q(y , y ) = α β ,
n µ2 (y i,1 )
X
Q(y, y) = 1 − Q(y, z).
z:z∼y
CHAPTER 4. COUPLING 268
Let P be similarly defined with respect to µ1 with the same values of α and β.
For reasons that will be clear below, the value of 0 < β < 1 is chosen small
enough that the sum of the two expressions in brackets above is smaller than 1
for all y, i in both P and Q. The value of α > 0 is then chosen small enough that
P (y, y), Q(y, y) ≥ 0 for all y. Reversibility follows immediately from the first two
equations. We call the first transition above an upward transition and the second
one a downward transition.
By Theorem 4.2.17, it remains to show that Q stochastically dominates P . That
is, for any x ≤ y, we want to show that P (x, ·) Q(y, ·). We produce a monotone
coupling (X̂, Ŷ ) of these two distributions. Because x ≤ y, our goal is never to
perform an upward transition in x simultaneously with a downward transition in y.
Observe that
µ1 (xi,0 ) µ2 (y i,0 )
≥ (4.2.19)
µ1 (xi,1 ) µ2 (y i,1 )
by taking ω = y i,0 and ω 0 = xi,1 in Condition (4.2.18).
The coupling works as follows. Fix x ≤ y. With probability 1 − α, set
(X̂, Ŷ ) := (x, y). Otherwise, pick a coordinate i ∈ F uniformly at random. There
are several cases to consider depending on the coordinates xi , yi (with xi ≤ yi by
assumption):
in both coordinates, that is, set X̂ := xi,0 and Ŷ := y i,0 . With probability
µ1 (xi,0 ) µ2 (y i,0 )
β − ,
µ1 (xi,1 ) µ2 (y i,1 )
A little accounting shows that this is indeed a coupling of P (x, ·) and Q(y, ·).
By construction, this coupling satisfies X̂ ≤ Ŷ almost surely. An application of
Theorem 4.2.17 concludes the proof.
Example 4.2.33 (Ising model revisited). Holley’s inequality implies Claim 4.2.18.
To see this, just repeat the calculations of Example 4.2.30, where now (4.2.17) is
replaced with an inequality. See Exercise 4.6. J
n(n − 1)(n − 2) λ 3 λ3
n 3
En,pn Xn = pn = → ,
3 6 n 6
The FKG inequality (Theorem 4.2.28) immediately gives one direction. Recall
that Pn,pn , as a product measure over edge sets, satisfies the FKG condition and
therefore has positive associations by the FKG inequality. Moreover the events BSc
are decreasing for all S. Hence, applying positive associations inductively,
hT i Y 3
Pn,pn V BS
S∈( 3 )
c ≥ Pn,pn [BSc ] → e−λ /6 ,
S∈(V3 )
where the limit follows from (4.2.21). As it turns out, the FKG inequality also
gives a bound in the other direction. This is known as Janson’s inequality, which
we state in a more general context.
Bi := {ω ∈ X : ω ≥ β (i) },
Theorem 4.2.36 (Janson’s inequality). Let X := {0, 1}F where F is finite and
P be a positive product measure on X . Let {Bi }i∈I and ∆ be as above. Assume
further that there is ε > 0 such that P[Bi ] ≤ ε for all i ∈ I. Then
Y ∆ Y
P[Bic ] ≤ P[∩i∈I Bic ] ≤ e 1−ε P[Bic ].
i∈I i∈I
Before proving the theorem, we show that it implies Claim 4.2.34. We have
already shown in (4.2.20) and (4.2.22) that ε → 0 and ∆ → 0. Janson’s inequality
(Theorem 4.2.36) immediately implies the claim by (4.2.21).
The rest is clever manipulation. For i ∈ [m], let N (i) := {` ∈ [m] : ` ∼ i} and
N< (i) := N (i) ∩ [i − 1]. Note that Bi is independent of {B` : ` ∈ [i − 1]\N< (i)}.
Hence,
h i
P Bi ∩ ∩j∈[i−1] Bjc
P[Bi | ∩j∈[i−1] Bjc ] =
P[∩j∈[i−1] Bjc ]
h i
P Bi ∩ ∩j∈N< (i) Bjc ∩ ∩j∈[i−1]\N< (i) Bjc
≥
P[∩j∈[i−1]\N< (i) Bjc ]
= P Bi ∩ ∩j∈N< (i) Bjc ∩j∈[i−1]\N< (i) Bjc
where we used independence for the first term on the last line. By a union bound,
the second term on the last line is
where the last line follows from the FKG inequality. This requires some explana-
tions:
(i)
- On the event Bi , all coordinates ` with β` = 1 are fixed to 1, and the other
ones are free. So we can think of P[ · | Bi ] as a positive product measure on
0 (i)
{0, 1}F with F 0 := {` ∈ [m] : β` = 0}.
- The event Bj is increasing, while the event ∩j∈[i−1]\N< (i) Bjc is decreasing
as the intersection of decreasing events (see Exercise 4.4).
where we used the assumption P[Bi ] ≤ ε on the second line. By the definition of
∆, we are done.
where C0 is the open cluster of the origin. We know from Example 4.2.10 that θ(p)
is non-decreasing. Let
be the critical value. We proved in Section 2.2.4 that there is a non-trivial transition,
that is, pc (L2 ) ∈ (0, 1). See Exercise 2.3 for a proof that pc (L2 ) ∈ [1/3, 2/3].
Our goal in this section is to use the FKG inequality to improve this further to:
CHAPTER 4. COUPLING 273
θ(1/2) = 0.
Here we present a proof of Harris’ theorem that uses an important tool in perco-
lation theory, the Russo-Seymour-Welsh (RSW) lemma, an application of the FKG
inequality.
Harris’ theorem
To motivate the RSW lemma, we start with the proof of Harris’ theorem.
e 2 as we did in
Proof of Theorem 4.2.37. Fix p = 1/2. We use the dual lattice L
Section 2.2.4. Consider the annulus
The existence of a closed dual cycle inside Ann(`), an event we denote by Od (`),
prevents the possibility of an infinite open path from the origin in the primal lattice
L2 . See Figure 4.4. That is,
K
Y
P1/2 [|C0 | = +∞] ≤ {1 − P1/2 [Od (3k )]}, (4.2.23)
k=0
for all K, where we took powers of 3 to make the annuli disjoint and therefore
independent. To prove the theorem, it suffices to show that there is a constant
c∗ > 0 such that, for all `, P1/2 [Od (`)] ≥ c∗ . Then the right-hand side of (4.2.23)
tends to 0 as K → +∞.
To simplify further, thinking of Ann(`) as a union of four rectangles [−3`, −`)×
[−3`, 3`], [−3`, 3`] × (`, 3`], etc., it suffices to consider the event O#
d (`) that each
one of these rectangles contains a closed dual path connecting its two shorter sides.
To be more precise, for the first rectangle above for instance, the path connects
[−3` + 1/2, −` − 1/2] × {3` − 1/2} to [−3` + 1/2, −` − 1/2] × {−3` + 1/2}
and stays inside the rectangle. See Figure 4.4. By symmetry the probability that
such a path exists is the same for all four rectangles. Denote that probability by ρ` .
Moreover the event that such a path exists is increasing so, although the four events
CHAPTER 4. COUPLING 274
are not independent, we can apply the FKG inequality (Theorem 4.2.28). Hence,
since O#d (`) ⊆ Od (`), we finally get the bound
The RSW lemma and some symmetry arguments, both of which are detailed
below, imply:
Claim 4.2.39. There is some c > 0 such that, for all `,
ρ` ≥ c.
RSW theory
We have reduced the proof of Harris’ theorem to bounding the probability that
certain closed paths exist in the dual lattice. To be consistent with the standard
RSW notation, we switch to the primal lattice and consider open paths. We also let
p take any value in (0, 1).
Let Rn,α (p) be the probability that the rectangle
has an open path connecting its left and right sides with the path remaining inside
the rectangle. Such a path is called an (open) left-right crossing. The event that a
left-right crossing exists in a rectangle B is denoted by LR(B). We similarly define
the event, TB(B), that a top-bottom crossing exists in B. In essence, the RSW
lemma says this: if there is a significant probability that a left-right crossing exists
in the square B(n, n), then there is a significant probability that a left-right crossing
exists in the rectangle B(3n, n). More precisely, here is a version of the theorem
that will be enough for our purposes. (See Exercise 4.10 for a generalization.)
Lemma 4.2.40 (RSW lemma). For all n ≥ 2 (divisible by 4) and p ∈ (0, 1),
1
Rn,3 (p) ≥ Rn,1 (p)11 Rn/2,1 (p)12 . (4.2.24)
256
The right-hand side of (4.2.24) depends only on the probability of crossing a square
from left to right. By a duality argument, at p = 1/2, it turns out that this prob-
ability is at least 1/2 independently of n. Before presenting a proof of the RSW
lemma, we detail this argument and finish the proof of Harris’ theorem.
CHAPTER 4. COUPLING 276
Proof of Claim 4.2.39. The point of (4.2.24) is that, if Rn,1 (1/2) is bounded away
from 0 uniformly in n, then so is the left-hand side. By the argument in the proof
of Harris’ theorem, this then implies that a closed cycle exists in Ann(n) with a
probability bounded away from 0 as well. Hence to prove Claim 4.2.39 it suffices
to give a lower bound on Rn,1 (1/2). It is crucial that this bound not depend on the
“scale” n.
As it turns out, a simple duality-based symmetry argument does the trick. The
following fact about L2 is a variant of the contour lemma (Lemma 2.2.14). Its
proof is similar and Exercise 4.11 asks for the details (the “if” direction being the
non-trivial implication).
Lemma 4.2.41. There is an open left-right crossing in the primal rectangle [0, n +
1] × [0, n] if and only if there is no closed top-bottom crossing in the dual rectangle
[1/2, n + 1/2] × [−1/2, n + 1/2].
By symmetry, when p = 1/2, the two events in Lemma 4.2.41 have equal proba-
bility. So they must have probability 1/2 because they form a partition of the space
of outcomes. By monotonicity, that implies Rn,1 (1/2) ≥ 1/2 for all n. The RSW
lemma then implies the required bound.
The proof of the RSW lemma involves a clever choice of event that relates the
existence of crossings in squares and rectangles. (Combining crossings of squares
into crossings of rectangles is not as trivial as it might look. Try it before reading
the proof.)
Step 1: it suffices to bound Rn,3/2 (p) We first reduce the proof to finding a
bound on Rn,3/2 (p). Let B10 := B(2n, n) and B20 := [n, 5n] × [−n, n]. Note
that B10 ∪ B20 = B(3n, n) and B10 ∩ B20 = [n, 3n] × [−n, n]. Then we have the
implication
See Figure 4.5. Each event on the left-hand side is increasing so the FKG inequality
gives
Rn,3 (p) ≥ Rn,2 (p)2 Rn,1 (p).
A similar argument over B(2n, n) gives
Step 2: bounding Rn,3/2 (p) The heart of the proof is to bound Rn,3/2 (p) using
an event involving crossings of squares. Let
Let Γ1 be the event that there are paths P1 , P2 , where P1 is a top-bottom crossing
of S and P2 is an open path connecting the left side of B1 to P1 and stays inside
B1 . Similarly let Γ02 be the event that there are paths P10 , P20 , where P10 is a top-
bottom crossing of S and P20 is an open path connecting the right side of B2 to P10
and stays inside B2 . Then we have the implication
See Figure 4.6. By symmetry Pp [Γ1 ] = Pp [Γ02 ]. Moreover, the events on the left-
hand side are increasing so by the FKG inequality:
Lemma 4.2.43 (Proof of RSW: step 2).
Step 3: bounding Pp [Γ1 ] It remains to bound Pp [Γ1 ]. That requires several ad-
ditional definitions. Let P1 and P2 be top-bottom crossings of S. There is a natural
partial order over such crossings. The path P1 divides S into two subgraphs: [P1 }
which includes the left side of S (including edges on the left incident with P1 but
not those edges on P1 itself) and {P1 ] which includes the right side of S (and P1
itself). Then we write P1 P2 if {P1 ] ⊆ {P2 ]. Assuming TB(S) holds, one
also gets the existence of a unique rightmost crossing. Roughly speaking, take the
rightmost
union of all top-bottom crossings of S as sets of edges; then the “right boundary”
crossing
of this set is a top-bottom crossing PS∗ such that PS∗ P for all top-bottom cross-
ings P of S. (We accept as a fact the existence of a unique rightmost crossing. See
Exercise 4.11 for a related construction.)
Let IS be the set of (not necessarily open) paths connecting the top and bottom
of S and stay inside S. For P ∈ IS , we let P 0 be the reflection of P in B12 \S
through the x-axis and we let PP0 be the union of P and P 0 . Define [ PP0 } to be the
CHAPTER 4. COUPLING 279
the mirror image of the rightmost top-bottom crossing in S; the shaded region on
the right is the complement in B1 of the set [ PP0 }. Note that, because in the bottom
figure the left-right path must stay within [ PP0 } by definition of PS∗ , the configura-
tion shown in the top figure where a left-right path (dotted) “travels behind” the
top-bottom crossing of S cannot occur.
CHAPTER 4. COUPLING 280
subgraph of B1 to the left of PP0 (including edges on the left incident with PP0 but
P + P
not those edges on P 0 itself). Let LR [ P 0 } be the event that there is a left-right
crossing of [ PP0 } ending on P , that is, that there is an open path connecting the left
side of B1 and P that stays within [ PP0 }. See Figure 4.6. Note that the existence of
a left-right crossing of B1 implies the existence of an open path connecting the left
side of B1 to PP0 . By symmetry we then get
1 1
Pp LR+ [ PP0 } ≥ Pp [LR(B1 )] = Rn,1 (p).
(4.2.27)
2 2
Now comes a subtle point. We turn to the rightmost crossing of S—for two rea-
sons:
• First, by uniqueness of the rightmost crossing, {PS∗ = P }P ∈IS forms a par-
tition of TB(S). Recall that we are looking to bound a probability from
below, and therefore we have to be careful not to “double count.”
• Second, the rightmost crossing has a Markov-like property. Observe that,
for P ∈ IS , the event that {PS∗ = P } depends only the bonds in {P ]. In
particular it is independent of the bonds in [ PP0 }, for example, of the event
LR+ [ PP0 } . Hence
Note that the event {PS∗ = P } is not increasing, as adding more open bonds
can shift the rightmost crossing rightward. Therefore, we cannot use the
FKG inequality here.
Combining (4.2.27) and (4.2.28), we get
X
Pp [PS∗ = P ] Pp LR+ [ PP0 } | PS∗ = P
Pp [Γ1 ] ≥
P ∈IS
1 X
≥ Rn,1 (p) Pp [PS∗ = P ]
2
P ∈IS
1
= Rn,1 (p) Pp [TB(S)]
2
1
= Rn,1 (p)Rn/2,1 (p).
2
We have proved:
Lemma 4.2.44 (Proof of RSW: step 3).
1
Pp [Γ1 ] ≥ Rn,1 (p)Rn/2,1 (p). (4.2.29)
2
CHAPTER 4. COUPLING 281
where
d(t) := max kP t (x, ·) − πkTV .
x∈V
Lemma 4.3.1.
¯ ≤ 2d(t),
d(t) ≤ d(t) ∀t.
Proof. The second inequality follows from an application of the triangle inequality.
CHAPTER 4. COUPLING 282
For the first inequality, note that by definition of the total variation distance and
the stationarity of π
kP t (x, ·) − πkTV = sup |P t (x, A) − π(A)|
A⊆V
X
= sup π(y)[P t (x, A) − P t (y, A)]
A⊆V y∈V
X
≤ sup π(y)|P t (x, A) − P t (y, A)|
A⊆V y∈V
( )
X
≤ π(y) sup |P t (x, A) − P t (y, A)|
y∈V A⊆V
X
≤ π(y)kP t (x, ·) − P t (y, ·)kTV
y∈V
kP t (x, ·) − P t (y, ·)kTV ≤ P(x,y) [Xt 6= Yt ] = P(x,y) [τcoal > t]. (4.3.1)
Combining (4.3.1) and Lemma 4.3.1, we get the main tool of this section.
CHAPTER 4. COUPLING 283
Theorem 4.3.2 (Bounding the mixing time: coupling method). Let (Xt , Yt ) be a
coalescing Markovian coupling of an irreducible transition matrix P on a finite
state space V with stationary distribution π. Then
In particular
tmix (ε) ≤ inf t ≥ 0 : P(x,y) [τcoal > t] ≤ ε, ∀x, y .
In words there is a state z0 ∈ V such that, starting from any state w ∈ V , the
probability of reaching z0 in exactly s steps is at least δ (which does not depend on
w). Assume such a z0 exists.
We construct a coalescing Markovian coupling (Xt , Yt ) of P . Assume first that
s = 1 and let
1
P̃ (w, z) = [P (w, z) − δ1{z = z0 }] .
1−δ
It can be checked that P̃ is a stochastic matrix on V provided z0 satisfies the con-
dition above (see Exercise 4.13). We use a technique known as splitting. While
splitting
Xt 6= Yt , at the next time step: (i) with probability δ we set Xt+1 = Yt+1 = z0 , (ii)
otherwise we pick Xt+1 ∼ P̃ (Xt , · ) and Yt+1 ∼ P̃ (Yt , · ) independently. On the
other hand, if Xt = Yt , we maintain the equality and pick the next state according
to P . Put differently, the coupling Q is defined as: if x 6= y,
while if x = y,
Q((x, x), (x0 , x0 )) = P (x, x0 ).
Observe that, in case (i) above, coalescence occurs at time t + 1. In case (ii),
coalescence may or may not occur at time t + 1. In other words, while Xt 6=
CHAPTER 4. COUPLING 284
By Theorem 4.3.2,
Exponential decay of the worst-case total variation distance to the stationary dis-
tribution is referred to as uniform geometric ergodicity.
uniform
Suppose now that s > 1. We apply the argument above to the chain P s this
geometric
time. We get
ergodicity
max P(x,y) [τcoal > ts] ≤ (1 − δ)t ,
x,y∈V
So, we have shown that uniform geometric ergodicity is implied by Doeblin’s con-
dition.
We note however that the rate of decay derived from this technique can be
very slow. For instance the condition always holds when P is finite, irreducible
and aperiodic (as follows from Lemma 1.1.32), but a straight application of the
technique may lead to a bound depending badly on the size of the state space V
(see Exercise 4.14). J
Cycle
Let (Zt ) be lazy simple random walk on the cycle of size n, Zn := {0, 1 . . . , n−1},
where i ∼ j if |j − i| = 1 (mod n). For any starting points x, y, we construct a
Markovian coupling (Xt , Yt ) of this chain. Set (X0 , Y0 ) := (x, y). At each time,
flip a fair coin. On heads, Yt stays put and Xt moves one step, the direction of
which is uniform at random. On tails, proceed similarly with the roles of Xt and
Yt reversed. Let Dt be the clockwise distance between Xt and Yt . Observe that,
D
by construction, (Dt ) is simple random walk on {0, . . . , n} and τcoal = τ{0,n} , the
first time (Dt ) hits {0, n}.
We use Markov’s inequality (Theorem 2.1.1) to bound P(x,y) [τ{0,n} D > t].
Denote by D0 = dx,y the starting distance. By Wald’s second identity (Theo-
rem 3.1.40), h i
D
E(x,y) τ{0,n} = dx,y (n − dx,y ).
Claim 4.3.4.
n2
tmix (ε) ≤ .
4ε
By the diameter-based lower bound on mixing in Section 5.2.3, this bound
gives the correct order of magnitude in n up to logarithmic factors. Indeed, the
diameter is ∆ = n/2 and πmin = 1/n so that Claim 5.2.25 gives
n2
tmix (ε) ≥ ,
64 log n
for n large enough. Exercise 4.15 sketches a tighter lower bound.
CHAPTER 4. COUPLING 286
Hypercube
Let (Zt )t∈Z+ be lazy simple random walk on the n-dimensional hypercube Zn2 :=
{0, 1}n where i ∼ j if ki − jk1 = 1. We denote the coordinates of Zt by
(1) (n)
(Zt , . . . , Zt ). This is equivalent to performing the Glauber dynamics chain
on an empty graph (see Definition 1.2.8): at each step, we first pick a coordinate
uniformly at random, then refresh its value. Because of the way the updating is
done, the chain stays put with probability 1/2 at each time as required.
Inspired by this observation, the coupling (Xt , Yt ) started at (x, y) is the fol-
lowing. At each time t, pick a coordinate i uniformly at random in [n], pick a bit
value b in {0, 1} uniformly at random independently of the coordinate choice. Set
(i) (i)
both i coordinates to b, that is, Xt = Yt = b. By design we reach coalescence
when all coordinates have been updated at least once.
The following standard bound from the coupon collector’s problem (see Ex-
ample 2.1.4) is what is needed to conclude.
Lemma 4.3.5. Let τcoll be the time it takes to update each coordinate at least once.
Then, for any c > 0,
Proof. Let Bi be the event that the i-th coordinate has not been updated by time
dn log n + cne. Then, using that 1 − x ≤ e−x for all x (see Exercise 1.16),
X
P[τcoll > dn log n + cne] ≤ P[Bi ]
i
X 1 dn log n+cne
= 1−
n
i
n log n + cn
≤ n exp −
n
−c
= e .
Claim 4.3.6.
tmix (ε) ≤ dn log n + cε ne.
Again we get a quick lower bound using the diameter-based result from Sec-
tion 5.2.3. Here ∆ = n and πmin = 1/2n so that Claim 5.2.25 gives
n2
tmix (ε) ≥ = Ω(n),
12 log n + (4 log 2)n
for n large enough. So the upper bound we derived above is off at most by a
logarithmic factor in n. In fact:
Claim 4.3.7.
1
tmix (ε) ≥ n log n − O(n).
2
Proof. For simplicity, we assume that n is odd. Let Wt be the number of 1s, or
Hamming weight, at time t. Let A be the event that the Hamming weight is ≤ n/2.
Hamming weight
To bound the mixing time, we use the fact that for any z0
have
E[Wt ] = E[E[Wt | Ut ]]
1
= E Ut + (n − Ut )
2
1
= E n − Ut
2
" #
1 t
1
=n− n 1− 1−
2 n
" t #
n 1
= 1+ 1− , (4.3.4)
2 n
h t i
where on the fourth line we used the fact that E[Ut ] = n 1 − 1 − n1 by sum-
ming over the coordinates and using linearity of expectation.
As to the variance, using again the observation above about the distribution of
Wt given Ut ,
(i) (j)
that is, It and It are negatively correlated, while
1 t
(i) (i) (i) (i)
Var[It ] = E[(It )2 ] − (E[It ])2 ≤ E[It ] = 1− .
n
CHAPTER 4. COUPLING 289
1 t
≤n 1− .
n
Plugging this back into (4.3.5), we get
" #
1 t 1 t n
n n
Var[Wt ] ≤ 1− 1− + 1− = .
4 n 4 n 4
For tα = 12 n log n − n log α with α > 0, by (4.3.4),
n n 2 n α√
E[Wtα ] = + etα (−1/n+Θ(1/n )) = + n + o(1),
2 2 2 2
where we used that by a Taylor expansion, for |z| ≤ 1/2, log (1 − z) = −z +
Θ(z 2 ). Fix 0 < ε < 1/2. By Chebyshev’s inequality, for tα = 12 n log n − n log α
and n large enough,
√ n/4 1
P[Wtα ≤ n/2] ≤ P[|Wtα − E[Wtα ]| ≥ (α/2) n] ≤ ≤ − ε,
(α/2)2 n 2
for α large enough. By (4.3.2) and (4.3.3), that implies d(tα ) ≥ ε and we are
done.
The previous proof relies on a “distinguishing statistic.” Recall from Lemma 4.1.19
that for any random variables X, Y and mapping h it holds that
kµh(X) − µh(Y ) kTV ≤ kµX − µY kTV ,
where µZ is the law of Z. The mapping used in the proof of the claim is the
Hamming weight. In essence, we gave a lower bound on the total variation distance
between the laws of the Hamming weight at stationarity and under P t (z0 , · ). See
Exercise 4.16 for a more general treatment of the distinguishing statistic approach.
Remark 4.3.8. The upper bound in Claim 4.3.6 is indeed off by a factor of 2. See [LPW06,
Theorem 18.3] for an improved upper bound and a discussion of the so-called cutoff phe-
nomenon. The latter refers to the fact that for all 0 < ε < 1/2 it can be shown in this case
cutoff
that
(n)
tmix (ε)
lim = 1,
n→+∞ t(n) (1 − ε)
mix
(n)
where tmix (ε) is the mixing time on the n-dimensional hypercube. In words, for large n,
the total variation distance drops from 1 to 0 in a short time window. See Exercise 5.10 for
a necessary condition for cutoff.
CHAPTER 4. COUPLING 290
b-ary tree
b ` , with
Let (Zt )t∈Z+ be lazy simple random walk on the `-level rooted b-ary tree, T b
` ≥ 2. The root, 0, is on level 0 and the leaves, L, are on level `. All vertices
have degree b + 1, except for the root which has degree b and the leaves which
have degree 1. By Example 1.1.29 (noting that laziness makes no difference), the
stationary distribution is
δ(x)
π(x) := ,
2(n − 1)
where n is the number of vertices and δ(x) is the degree of x. We used that a tree
on n vertices has n − 1 edges (Corollary 1.1.7). We construct a coupling (Xt , Yt )
of this chain started at (x, y). Assume without loss of generality that x is no further
from the root than y, which we denote by x 4 y (which, here, does not mean that
y is a descendant of x). The coupling has two stages:
- In the first stage, at each time, flip a fair coin. On heads, Yt stays put and Xt
moves one step chosen uniformly at random among its neighbors. Similarly,
on tails, reverse the roles of Xt and Yt . Do this until Xt and Yt are on the
same level.
- In the second stage, that is, once the two chains are on the same level, at each
time first let Xt move as a lazy simple random walk on T b ` . Then let Yt move
b
in the same direction as Xt , that is, if Xt moves closer to the root, so does
Yt and so on.
By construction, Xt 4 Yt for all t. The key observation is the following. Let τ ∗
be the first time (Xt ) visits the root after visiting the leaves. By time τ ∗ , the two
chains have necessarily met: because Xt 4 Yt , when Xt reaches the leaves, so
does Yt ; after that time, the coupling is in the second stage so Xt and Yt remain on
the same level; in particular, when Xt reaches the root (after visiting the leaves), so
does Yt . Hence τcoal ≤ τ ∗ . Intuitively, the mixing time is indeed dominated by the
time it takes to reach the root from the worst starting point, a leaf. See Figure 4.7
and the corresponding lower bound argument.
To estimate P(x,y) [τ ∗ > t], we use Markov’s inequality (Theorem 2.1.1), for
which we need a bound on E(x,y) [τ ∗ ]. We note that E(x,y) [τ ∗ ] is less than the mean
time for the walk to go from the root to the leaves and back. Let Lt be the level
of Xt and let N be the corresponding network (where the conductances are equal
to the number of edges on each level of the tree). In terms of Lt , the quantity we
seek to bound is the mean of τ0,` , the commute time of the chain (Lt ) between the
states 0 and `. By the commute time identity (Theorem 3.3.34),
where X
cN = 2 c(e) = 4(n − 1),
e={x,y}∈N
where we simply counted the number of edges in T b ` and the extra factor of 2
b
accounts for self-loops. Using network reduction techniques, we computed the
effective resistance R(0 ↔ `) in Examples 3.3.21 and 3.3.22—without self-loops.
Of course adding self-loops does not affect the effective resistance as we can use
the same voltage and current. So, ignoring them, we get
`−1 `−1
X X 1 1 − b−`
R(0 ↔ `) = r(j, j + 1) = b−(j+1) = · , (4.3.7)
b 1 − b−1
j=0 j=0
which implies
1 1
≤ R(0 ↔ `) ≤ ≤ 1.
b b−1
Finally, applying Theorem 4.3.2 and Markov’s inequality and using (4.3.6), we get
Claim 4.3.9.
4n
tmix (ε) ≤ .
ε
This time the diameter-based bound is far off. We have ∆ = 2` = Θ(log n)
and πmin = 1/2(n − 1) so that Claim 5.2.25 gives
(2`)2
tmix (ε) ≥ = Ω(log n),
12 log n + 4 log(2(n − 1))
Figure 4.7: Setup for the lower bound on the mixing time on a b-ary tree. (Here
b = 2.)
on one side of the root to the leaves on the other, both of which have substantial
weight under the stationary distribution. That typically takes time exponential in
the diameter—that is, linear in n. Indeed one first has to reach the root, which by
the gambler’s ruin problem (Example 3.1.43), takes an exponential in ` number of
“excursions” (see Claim 3.1.44 (ii)).
Formally let x0 be a leaf of T b ` and let A be the set of vertices “on the other
b
side of root (inclusively),” that is, vertices whose graph distance from x0 is at least
`. See Figure 4.7. Then π(A) ≥ 1/2 by symmetry. We use the fact that
to bound the mixing time from below. We claim that, started at x0 , the walk takes
time linear in n to reach A with nontrivial probability.
Consider again the level Lt of Xt . Using definition of the effective resistance
(Definition 3.3.19) as well as the expression for it in (4.3.7), we have
+ 1 1 b−1 b−1 1
P` [τ0 < τ` ] = = `· = ` =O .
c(`) R(0 ↔ `) b 1−b −` b −1 n
Hence, started from the leaves, the number of excursions back to the leaves needed
to reach the root for the first time is geometric with success probability O(n−1 ).
Each such excursion takes time at least 2 (which corresponds to going right back
CHAPTER 4. COUPLING 293
to the leaves after the first step). So P t (x0 , A) is bounded above by the probability
that at least one such excursion was successful among the first t/2 attempts. That
is,
t/2 1
P t (x0 , A) ≤ 1 − 1 − O n−1 < − ε,
2
for all t ≤ αε n with αε > 0 small enough and
where the infimum is over all paths connecting x and y in H0 . We call a path
achieving the infimum a minimum-weight path. It is straightforward to check that
w0 satisfies the triangle inequality. Let
w0 (u, v) ≥ 1,
for all {u, v} ∈ E0 . Assume further that there exists κ ∈ (0, 1) such that:
- (Local couplings) For all x, y with {x, y} ∈ E0 , there is a coupling (X ∗ , Y ∗ )
of P (x, ·) and P (y, ·) satisfying the following contraction property
Then
d(t) ≤ ∆0 κt ,
or
log ∆0 + log ε−1
tmix (ε) ≤ .
log κ−1
Proof. The crux of the proof is to extend (4.3.8) to arbitrary pairs of vertices.
Claim 4.3.11 (Global coupling). For all x, y ∈ V , there is a coupling (X ∗ , Y ∗ ) of
P (x, ·) and P (y, ·) such that (4.3.8) holds.
Iterating the coupling in this last claim immediately implies the existence of a
coalescing Markovian coupling (Xt , Yt ) of P such that
Proof of Claim 4.3.11. Fix x0 , y 0 ∈ V such that {x0 , y 0 } is not an edge in the dis-
similarity graph H0 . The idea is to combine the local couplings on a minimum-
weight path between x0 and y 0 in H0 . Let x0 = x0 ∼ · · · ∼ xm = y 0 be such a path.
∗ , Z ∗ ) be a coupling of P (x , ·) and P (x
For all i = 0, . . . , m − 1, let (Zi,0 i,1 i i+1 , ·)
satisfying the contraction property (4.3.8).
Then we proceed as follows. Set Z (0) := Z0,0 ∗ and Z (1) := Z ∗ . Then iter-
0,1
atively pick Z (i+1) according to the law P[Zi,1 ∗ ∈ · | Z ∗ = Z (i) ]. By induction
i,0
on i, (X ∗ , Y ∗ ) := (Z (0) , Z (m) ) is then a coupling of P (x0 , ·) and P (y 0 , ·). See
Figure 4.8.
CHAPTER 4. COUPLING 295
Observe that X
Ri (z (i) , z (i+1) ) = 1, (4.3.9)
z (i+1)
and X
P (xi , z (i) )Ri (z (i) , z (i+1) ) = P (xi+1 , z (i+1) ), (4.3.10)
z (i)
∗ , Z ∗ ) and the definition of R . The law of the
by construction of the coupling (Zi,0 i,1 i
full coupling
(Z (0) , . . . , Z (m) )
is
as required.
By the triangle inequality for w0 , the coupling (X ∗ , Y ∗ ) satisfies
h i
E[w0 (X ∗ , Y ∗ )] = E w0 (Z (0) , Z (m) )
m−1
X h i
≤ E w0 (Z (i) , Z (i+1) )
i=0
m−1
X
≤ κ w0 (xi , xi+1 )
i=0
= κ w0 (x0 , y 0 ),
where, on the third line, we used (4.3.8) for adjacent pairs and the last line follows
from the fact that we chose a minimum-weight path.
We illustrate the path coupling method in the next subsection. See Exer-
cise 4.17 for an optimal transport perspective on the path coupling method.
CHAPTER 4. COUPLING 297
is the partition function. In this context, recall that vertices are often referred to as
sites. The single-site Glauber dynamics (Definition 1.2.8) of the Ising model is the
Markov chain on X which, at each time, selects a site i ∈ V uniformly at random
and updates the spin σi according to µβ (σ) conditioned on agreeing with σ at all
sites in V \{i}. Specifically, for γ ∈ {−1, +1}, i ∈ V , and σ ∈ X , let σ i,γ be
the configuration σ with the state at i being set to γ. Then, letting n = |V |, the
transition matrix of the Glauber dynamics is
1 eγβSi (σ)
Qβ (σ, σ i,γ ) := · −βS (σ)
n e
i + eβSi (σ)
1 1 1
= + tanh(γβSi (σ)) , (4.3.11)
n 2 2
where X
Si (σ) := σj .
j∼i
All other transitions have probability 0. Recall that this chain is irreducible and
reversible with respect to µβ . In particular µβ is the stationary distribution of Qβ .
In this section we give an upper bound on the mixing time, tmix (ε), of Qβ
using path coupling. We say that the Glauber dynamics is fast mixing if tmix (ε) =
fast mixing
O(n log n). We first make a simple observation:
Proof. Similarly to what we did in Section 4.3.2 in the context of random walk
on the hypercube (but for a lower bound this time), we use a coupon collecting
argument (see Example 2.1.4). Let σ̄ be the all-(−1) configuration and let A be the
set of configurations where at least half of the sites are +1. Then, by symmetry,
µβ (A) = µβ (Ac ) = 1/2 where we assumed for simplicity that n is odd. By
definition of the total variation distance,
So it remains to show that by time c n, for c > 0 small, the chain is unlikely to have
reached A. That happens if, say, fewer than a third of the sites have been updated.
Using the notation of Example 2.1.4, we are seeking a bound on Tn,n/3 , that is, the
time to collect n/3 coupons out of n.
We can write this random variable as a sum of n/3 independent geometric
Pn/3 i−1 −1
variables Tn,n/3 = i=1 τn,i , where E[τn,i ] = 1 − n and Var[τn,i ] ≤
i−1 −2
1− n . Hence, approximating the Riemann sums below by integrals,
n/3 n
i − 1 −1
X X
E[Tn,n/3 ] = 1− =n j −1 = Θ(n), (4.3.13)
n
i=1 j=2n/3+1
and
n/3 n
i − 1 −2
X X
Var[Tn,n/3 ] ≤ 1− = n2 j −2 = Θ(n). (4.3.14)
n
i=1 j=2n/3+1
Var[Tn,n/3 ]
P[|Tn,n/3 − E[Tn,n/3 ]| ≥ ε n] ≤ → 0,
(ε n)2
by (4.3.14). In view of (4.3.13), taking ε > 0 small enough and n large enough,
we have shown that for t ≤ cε n for some cε > 0
which proves the claim by (4.3.12) and the definition of the mixing time (Defini-
tion 1.1.35).
CHAPTER 4. COUPLING 299
Remark 4.3.14. In fact, Ding and Peres proved that tmix (ε) = Ω(n log n) for any graph
on n vertices [DP11]. In Claim 4.3.7, we treated the special case of the empty graph,
which is equivalent to lazy random walk on the hypercube. See also Section 5.3.4 for a
much stronger lower bound at low temperature for certain graphs with good “expansion
properties.”
In our main result of this section, we show that the Glauber dynamics of the
Ising model is fast mixing when the inverse temperature β is small enough as a
function of the maximum degree.
Claim 4.3.15 (Glauber dynamics: fast mixing at high temperature).
β < δ̄ −1 =⇒ tmix (ε) = O(n log n).
Proof. We use path coupling. Let H0 = (V0 , E0 ) where V0 := X and {σ, ω} ∈ E0
if 21 kσ − ωk1 = 1 (i.e., they differ in exactly one coordinate) with unit weight on
all edges. To avoid confusion, we reserve the notation ∼ for adjacency in G.
Let {σ, ω} ∈ E0 differ at coordinate i. We construct a coupling (X ∗ , Y ∗ ) of
Qβ (σ, ·) and Qβ (ω, ·). We first pick the same coordinate i∗ to update. If i∗ is such
that all its neighbors in G have the same state in σ and ω, that is, if σj = ωj for
all j ∼ i∗ , we update X ∗ from σ according to the Glauber rule and set Y ∗ := X ∗ .
Note that this includes the case i∗ = i. Otherwise, that is, if i∗ ∼ i, we proceed as
follows. From the state σ, the probability of updating site i∗ to state γ ∈ {−1, +1}
is given by the expression in brackets in (4.3.11), and similarly for ω. Unlike the
previous case, we cannot guarantee that the update is identical in both chains. In
order to minimize the chance of increasing the distance between the two chains,
we use a monotone coupling, which recall from Example 4.1.17 is maximal in the
two-state case. Specifically, we pick a uniform random variable U in [−1, 1] and
set (
∗ +1 if U ≤ tanh(βSi∗ (σ)),
Xi∗ :=
−1 otherwise,
and (
+1 if U ≤ tanh(βSi∗ (ω)),
Yi∗∗ :=
−1 otherwise.
We set Xj∗ := σj and Yj∗ := ωj for all j 6= i∗ . The expected distance between X ∗
and Y ∗ is then
E[w0 (X ∗ , Y ∗ )]
1 1 X 1
=1− + |tanh(βSj (σ)) − tanh(βSj (ω))|, (4.3.15)
n n 2
|{z} j:j∼i
(a) | {z }
(b)
CHAPTER 4. COUPLING 300
where
s := Sj (σ) ∧ Sj (ω).
The derivative of tanh is maximized at 0 where it is equal to 1. So the right-hand
side of (4.3.16) is ≤ β(s + 2) − βs = 2β. Plugging this back into (4.3.15) and
using 1 − x ≤ e−x for all x (see Exercise 1.16), we get
∗ ∗ 1 − δ̄β 1 − δ̄β
E[w0 (X , Y )] ≤ 1 − ≤ exp − = κ w0 (σ, ω),
n n
where
1 − δ̄β
κ := exp − < 1,
n
by our assumption on β. The diameter of H0 is ∆0 = n. By Theorem 4.3.10,
Setting The basic setup is a sum of Bernoulli (i.e., {0, 1}-valued) random vari-
ables {Xi }ni=1
Xn
W = Xi , (4.4.1)
i=1
where the Xi s are not assumed independent or identically distributed. Define
and
n
X
E[W ] = λ := pi . (4.4.3)
i=1
Letting µ denote the law of W and π be the Poisson distribution with mean λ, our
goal is to bound kµ − πkTV .
We first state the main bounds and give some examples of its use. We then
motivate and prove the result, and return to further applications. Throughout the
next two subsections, we use the notation in (4.4.1), (4.4.2) and (4.4.3).
By bounding the right-hand side of (4.4.5) for any function satisfying (4.4.6), we
get a Poisson approximation result for µ.
One way to do this is to construct a certain type of coupling. We begin with a
definition, which will be justified in the corollary below. We write X ∼ Y |A to
mean that X is distributed as Y conditioned on the event A.
Definition 4.4.3 (Stein coupling). A Stein coupling is a pair (Ui , Vi ), for each
Stein coupling
i = 1, . . . , n, such that
Ui ∼ W, Vi ∼ W − 1|Xi = 1.
Each pair (Ui , Vi ) is defined on a joint probability space, but different pairs do not
need to.
How such a coupling is constructed will become clearer in the examples below.
Corollary 4.4.4. Let (Ui , Vi ), i = 1, . . . , n, be a Stein coupling. Then
n
X
kµ − πkTV ≤ (1 ∧ λ−1 ) pi E|Ui − Vi |. (4.4.7)
i=1
Pn Pn
Proof. By (4.4.5), using the facts that λ = i=1 pi and W = i=1 Xi , we get
kµ − πkTV
= E [−λh(W + 1) + W h(W )]
n n
" ! ! #
X X
=E − pi h(W + 1) + Xi h(W )
i=1 i=1
n
X
= (−pi E [h(W + 1)] + E [Xi h(W )])
i=1
Xn
= (−pi E [h(W + 1)] + E [h(W ) | Xi = 1] P[Xi = 1])
i=1
Xn
= pi (−E [h(W + 1)] + E [h(W ) | Xi = 1]) .
i=1
Claim 4.4.6.
n
X
kµ − πkTV ≤ (1 ∧ λ−1 ) p2i .
i=1
Ui = W
and X
Vi = Xj .
j:j6=i
By independence,
Vi = W − Xi ∼ W − 1|Xi = 1,
as desired. Plugging into (4.4.7), we obtain the bound
n
X
kµ − πkTV ≤ (1 ∧ λ−1 ) pi E|Ui − Vi |
i=1
n
X X
≤ (1 ∧ λ−1 ) pi E W − Xj
i=1 j6=i
n
X
≤ (1 ∧ λ−1 ) pi E |Xi |
i=1
n
X
≤ (1 ∧ λ−1 ) p2i .
i=1
J
CHAPTER 4. COUPLING 304
Xi = 1{box i is empty},
and let W = ni=1 Xi be the number of empty boxes. Note that the Xi s are not
P
independent. In particular, we cannot use Theorem 4.1.18. Note that
1 k
pi = 1 − ,
n
so by Corollary 4.4.4
(
2k k )
1 2
kµ − πkTV ≤ (1 ∧ λ−1 ) n2 1 − − n(n − 1) 1 − .
n n
Theorem 4.4.8 (Chen-Stein method: dissociated case). Suppose that for each i
there is a neighborhood Ni ⊆ [n] \ {i} such that
Xi is independent of {Xj : j ∈
/ Ni ∪ {i}}.
Then
n
X X
kµ − πkTV ≤ (1 ∧ λ−1 ) p2i + (pi pj + E[Xi Xj ]) .
i=1 j∈Ni
CHAPTER 4. COUPLING 306
Because the law of {Xk : k ∈ / Ni ∪ {i}} (and therefore of the first term in Vi ) is
independent of the event {Xi = 1}, the above scheme satisfies the conditions of
the Stein coupling.
Hence we can apply Corollary 4.4.4. The construction of (Ui , Vi ) guarantees
that Ui − Vi depends only on “i and its neighborhood.” Specifically, we get
n
X
kµ − πkTV ≤ (1 ∧ λ−1 ) pi E|Ui − Vi |
i=1
n n
(i)
X X X X
−1
= (1 ∧ λ ) pi E Xj − Xk − Yj
i=1 j=1 k∈N
/ i ∪{i} j∈Ni
n
(i)
X X
= (1 ∧ λ−1 ) pi E Xi + (Xj − Yj )
i=1 j∈Ni
n
(i)
X X
≤ (1 ∧ λ−1 ) pi E|Xi | + (E|Xj | + E|Yj |) ,
i=1 j∈Ni
Example 4.4.9 (Longest head run). Let 0 < q < 1 and let Z1 , Z2 , . . . be i.i.d. Bern-
oulli random variables with success probability q = P[Zi = 1]. We are interested
in the distribution of R, the length of the longest run of 1s starting in the first n
(t)
tosses. For any positive integer t, let X1 := Z1 · · · Zt and
(t)
Xi := (1 − Zi−1 )Zi · · · Zi+t−1 , i ≥ 2.
(t)
The event {Xi = 1} indicates that a head run of length at least t starts at the i-th
toss. Now define
n
(t)
X
(t)
W := Xi .
i=1
(t)
Xi = (1 − Zi−1 )Zi · · · Zi+t−1 ,
and
(t)
Xi+t+1 = (1 − Zi+t )Zi+t+1 · · · Zi+2t ,
(t)
do not depend on any common Zj , while Xi and
(t)
Xi+t = (1 − Zi+t−1 )Zi+t · · · Zi+2t−1 ,
and, for i ≥ 2,
(t)
pi := E[(1 − Zi−1 )Zi · · · Zi+t−1 ]
i+t−1
Y
= E[1 − Zi−1 ] E[Zj ]
j=i
= (1 − q)q t
≤ qt.
(t)
For i ≥ 1 and j ∈ Ni , observe that a head run of length at least t cannot start
(t) (t)
simultaneously at i and j. So E[Xi Xj ] = 0 in that case. We also have
We are ready to apply Theorem 4.4.8. We get, letting µ(t) denote the law of
W (t) and π (t) be the Poisson distribution with mean λ(t) ,
n
(t) 2 (t) (t) (t) (t)
X X
(t) (t) (t) −1
kµ − π kTV ≤ (1 ∧ (λ ) ) (pi ) + pi pj + E[Xi Xj ]
i=1 j∈Ni
(t)
1
≤ (1 ∧ (nq t )−1 )[2t + 1](nq t )2 .
(1 − q)n
This bound is non-asymptotic—it holds for any q, n, t. One special regime of note
is t = log1/q n + C with large n. In that case, we have nq t → C 0 as n → +∞ for
some 0 < C 0 < +∞ and the total variation above is of the order to O(log n/n).
Going back to (4.4.8), we finally obtain when t = log1/q n + C that
−λ(t) log n
P[R < t] − e =O ,
n
= µ(A∗ ) − π(A∗ )
X
= (µ(z) − π(z)), (4.4.9)
z∈A∗
1 1 1
The probability of staying put is: 1 − 2n λ if x = 0, 1 − 2n x − 2n λ if 1 ≤ x ≤ n,
1 π(n)
and 1− 2n λ 1−Π(n) if x = n+1. Those are all strictly positive when λ < n. Hence
by construction P is aperiodic and irreducible, and it satisfies the detailed balance
conditions (4.4.10).
Recalling (3.3.6), the Laplacian is
X
∆f (x) = P (x, y)[f (y) − f (x)]
y
π(n)
∆f (n + 1) = −λn g(n + 1).
1 − Π(n)
It is a standard fact (see Exercise 4.19) that the expectation of the Laplacian
under the stationary distribution is 0. Inverting the relationship (4.4.12), for any
g : {0, . . . , n + 1} → R, there is a corresponding f (unique up to an additive
constant). So we have shown that if Z ∼ π̄ then
π(n)
E (λg(Z + 1) − Zg(Z))1{Z≤n} − λn g(Z)1{Z=n+1} = 0,
1 − Π(n)
that is,
E (λg(Z + 1) − Zg(Z))1{Z≤n} = λnπ(n)g(n + 1).
Notice that, if g is extended to a bounded function on Z+ , λ is fixed and Z ∼
Poi(λ), then taking n → +∞ recovers Theorem 4.4.1 by dominated convergence
(Proposition B.4.14).*
where the subscript of E indicates the initial state. We will later take expectations
over µ and use the fact that π(z) = π̄(z) on {0, 1 . . . , n} to interpolate between µ
and π.
First, we use standard Markov chains facts to compute (4.4.13). Define for
y ∈ {1, . . . , n + 1}
t−1
1 X
gzt (y) := (Ey [δz (Xs )] − Ey−1 [δz (Xs )]), (4.4.14)
2n
s=0
and gzt (0) := 0. The function gzt (y) is, up to a factor (whose purpose will be clearer
below), the difference between the expected number of visits to z up to time t − 1
when started at y and y − 1 respectively. It depends on µ only through λ and n. By
Chapman-Kolmogorov (Theorem 1.1.20) applied to the first step of the chain,
Ey [δz (Xs+1 )] = P (y, y + 1) Ey+1 [δz (Xs )]
+ P (y, y) Ey [δz (Xs )] + P (y, y − 1) Ey−1 [δz (Xs )].
Using that P (y, y + 1) + P (y, y) + P (y, y − 1) = 1 and rearranging we get for
0 ≤ y ≤ n and 0 ≤ z ≤ n + 1
t−1
X
Ey [δz (Xs ) − δz (Xs+1 )]
s=0
Xt−1
= − P (y, y + 1)(Ey+1 [δz (Xs )] − Ey [δz (Xs )])
s=0
+ P (y, y − 1)(Ey [δz (Xs )] − Ey−1 [δz (Xs )])
In fact, an explicit expression for gz∞ can be derived via the following recur-
sion. That expression will be helpful to establish the Lipschitz condition in Theo-
rem 4.4.2.
Lemma 4.4.11. For all 0 ≤ y ≤ n and 0 ≤ z ≤ n + 1,
Lemma 4.4.11 leads to the following formula for gz∞ , which we establish after the
proof of the theorem.
Lemma 4.4.12. For 1 ≤ y ≤ n + 1 and 0 ≤ z ≤ n + 1,
( Π(y−1)
∞ yπ(y) π̄(z), if z ≥ y,
gz (y) = 1−Π(y−1) (4.4.16)
− yπ(y) π̄(z), if z < y.
Proof. Fix z ∈ {0, 1, . . . , n}. Multiplying both sides in Lemma 4.4.11 by µ(y)
and summing over y in {0, 1, . . . , n} gives
Now summing over z in A∗ ⊆ {0, 1, . . . , n} and using (4.4.9) gives the claim.
CHAPTER 4. COUPLING 313
∞ . That lemma
Lemma 4.4.12 can be used to derive a Lipschitz constant for gA
is also established after the proof of the theorem.
Lemma 4.4.14. For A ⊆ {0, 1, . . . , n} and y, y 0 ∈ {0, 1, . . . , n + 1},
∞ 0 ∞
|gA (y ) − gA (y)| ≤ (1 ∧ λ−1 )|y 0 − y|.
∞.
Lemmas 4.4.13 and 4.4.14 imply the theorem with h := gA ∗
Proof of Lemma 4.4.10. We use a coupling argument. Let (Ys , Ỹs )+∞ s=0 be an in-
dependent Markovian coupling of (Ys ), the chain started at y − 1, and (Ỹs ), the
chain started at y. Let τ be the first time s that Ys = Ỹs . Because Ys and Ỹs
are independent and P is a birth-death chain with strictly positive nearest-neighbor
and staying-put transition probabilities, the coupled chain (Ys , Ỹs )+∞
s=0 is aperiodic
and irreducible over {0, 1, . . . , n + 1}2 . By the exponential tail of hitting times,
Lemma 3.1.25, it holds that E[τ ] < +∞.
Modify the coupling (Ys , Ỹs ) to enforce Ỹs = Ys for all s ≥ τ (while not
changing (Ys )), that is, to make it coalescing. By the Strong Markov property
(Theorem 3.1.8), the resulting chain (Ys∗ , Ỹs∗ ) is also a Markovian coupling of the
chain started at y − 1 and y respectively. Using this coupling, we rewrite
t−1
1 X
gzt (y) = (Ey [δz (Xs )] − Ey−1 [δz (Xs )])
2n
s=0
t−1
1 X
= E[δz (Ỹs∗ ) − δz (Ys∗ )]
2n
s=0
" t−1 #
1 X
∗ ∗
= E (δz (Ỹs ) − δz (Ys )) .
2n
s=0
uniformly in t. Indeed, after s = τ , the terms in the sum are 0, while before
s = τ the terms are bounded by 1 in absolute value. By the integrability of τ ,
CHAPTER 4. COUPLING 314
the dominated convergence theorem (Proposition B.4.14) allows to take the limit,
leading to
" t−1 #
1 X
gz∞ (y) = lim E (δz (Ỹs∗ ) − δz (Ys∗ ))
t→+∞ 2n
s=0
" +∞ #
1 X
∗ ∗
= E (δz (Ỹs ) − δz (Ys ))
2n
s=0
< +∞.
Proof of Lemma 4.4.12. Our starting point is Lemma 4.4.11, from which we de-
duce the recursive formula
1
gz∞ (y + 1) = {ygz∞ (y) + π(z) − δz (y)} , (4.4.18)
λ
for 0 ≤ y ≤ n and 0 ≤ z ≤ n.
We guess a general formula and then check it. By (4.4.18),
1
gz∞ (1) = {π(z) − δz (0)} , (4.4.19)
λ
1 ∞
gz∞ (2) = {g (1) + π(z) − δz (1)}
λz
1 1
= {π(z) − δz (0)} + π(z) − δz (1)
λ λ
1 1
= 2 {π(z) − δz (0)} + {π(z) − δz (1)},
λ λ
1
gz∞ (3) = {2g ∞ (2) + π(z) − δz (2)}
λ z
1 1 1
= 2 2 {π(z) − δz (0)} + 2 {π(z) − δz (1)} + π(z) − δz (2)
λ λ λ
2 2 1
= 3 {π(z) − δz (0)} + 2 {π(z) − δz (1)} + {π(z) − δz (2)},
λ λ λ
and so forth. We posit the general formula
y−1
(y − 1)! X λk
gz∞ (y) = {π(z) − δz (k)}, (4.4.20)
λy k!
k=0
CHAPTER 4. COUPLING 315
for 1 ≤ y ≤ n + 1 and 0 ≤ z ≤ n.
The formula is straightforward to confirm by induction. Indeed it holds for
y = 1 as can be seen in (4.4.19) (and recalling that 0! = 1 by convention) and,
assuming it holds for y, we have by (4.4.18)
1
gz∞ (y + 1) = {ygz∞ (y) + π(z) − δz (y)}
λ
y−1
( )
1 (y − 1)! X λk
= y {π(z) − δz (k)} + π(z) − δz (y)
λ λy k!
k=0
y−1
y! X λk 1
= {π(z) − δz (k)} + {π(z) − δz (y)}
λy+1 k! λ
k=0
y
y! X λk
= {π(z) − δz (k)},
λy+1 k!
k=0
as desired.
We rewrite (4.4.20) according to whether the term δz (y) = 1{z = y} plays a
role in the equation. For z ≥ y > 0, the equation simplifies to
y−1
(y − 1)! X λk
gz∞ (y) = π(z)
λy k!
k=0
y−1
1 y! X e−λ λk
= π(z)
y e−λ λy k!
k=0
Π(y − 1)
= π(z).
yπ(y)
We have argued that all the terms in this last sum are non-negative—with the sole
exception of the term y = z. Hence, for a fixed 0 ≤ z ≤ n, it must be that
CHAPTER 4. COUPLING 317
e−λ n o
= eλ − 1
λ
1 − e−λ
= .
λ
−λ
For λ ≥ 1, we have 1−eλ ≤ λ1 = (1 ∧ λ−1 ), while for 0 < λ < 1 we have
1−e−λ
λ ≤ λλ = 1 = (1 ∧ λ−1 ) by Exercise 1.16.
It remains to consider the cases y = 0. Recall that gz∞ (0) = 0. By Lemma 4.4.12,
for z ≥ 1,
And
1 − Π(0) 1 − e−λ
g0∞ (1) − g0∞ (0) = g0∞ (1) = − π(0) = − .
π(1) λ
So we have established (4.4.21) and that concludes the proof.
CHAPTER 4. COUPLING 318
pn = Cn−2/3 ,
for some constant C > 0. We use the Chen-Stein method in the form of Theo-
rem 4.4.8.
For an enumeration S1 , . . . , Sm of the 4-tuples of vertices in Gn , let A1 , . . . , Am
Pthat the corresponding 4-cliques are present and define Zi = 1Ai .
be the events
Then W = m i=1 Zi is the number of 4-cliques in Gn . We argued previously (see
Claim 2.3.4) that
qi := E[Zi ] = p6n ,
and
n 6
λ := E[W ] = p .
4 n
In our regime of interest, λ is of constant order.
Observe that the Zi s are not independent because the 4-tuples may share po-
tential edges. However they admit a neighborhood structure as in Theorem 4.4.8.
Specifically, for i = 1, . . . , m, define
Then the conditions of Theorem 4.4.8 are satisfied, that is, Xi is independent of
{Zj : j ∈
/ Ni ∪ {i}}. We argued previously (again see Claim 2.3.4) that
4 4 n−4
|Ni | = (n − 4) + = Θ(n2 ),
3 2 2
where the first term counts the number of Sj s sharing exactly three vertices with
Si , in which case E[Zi Zj ] = p9n , and the second term counts those sharing two, in
which case E[Zi Zj ] = p11
n .
CHAPTER 4. COUPLING 319
We are ready to apply the bound in Theorem 4.4.8. Let π be the Poisson distri-
bution with mean λ. Using the formulas above, we get when pn = Cn−2/3
kµ − πkTV
Xn X
≤ (1 ∧ λ−1 ) qi2 + (qi qj + E[Zi Zj ])
i=1 j∈Ni
n
≤ (1 ∧ λ−1 )
4
12 4 12 9 4 n−4 12 11
× pn + (n − 4)(pn + pn ) + (pn + pn )
3 2 2
= (1 ∧ λ−1 ) Θ(n4 p12 5 9 6 11
n + n pn + n pn )
= (1 ∧ λ−1 ) Θ(n4 n−8 + n5 n−6 + n6 n−22/3 )
= (1 ∧ λ−1 ) Θ(n−1 ),
Exercises
Exercise 4.1 (Harmonic function on Zd : unbounded). Give an example of an un-
bounded harmonic function on Z. Give one on Zd for general d. [Hint: What is the
simplest function after the constant one?]
Exercise 4.2 (Binomial vs. Binomial). Use coupling to show that
n ≥ m, q ≥ p =⇒ Bin(n, q) Bin(m, p).
Exercise 4.3 (A chain that is not stochastically monotone). Consider random walk
on a network N = ((V, E), c) where V = {0, 1, . . . , n} and i ∼ j if and only if
|i − j| = 1 (in particular, not including self-loops). Show that the transition matrix
is, in general, not stochastically monotone (see Definition 4.2.16).
Exercise 4.4 (Increasing events: properties). Let F be a σ-algebra over the poset
X . Recall that an event A ∈ F is increasing if x ∈ A implies that any y ≥ x is also
in A and that a function f : X → R is increasing if x ≤ y implies f (x) ≤ f (y).
(i) Show that an event A ∈ F is increasing if and only if the indicator function
1A is increasing.
(ii) Let A, B ∈ F be increasing. Show that A ∩ B and A ∪ B are increasing.
(iii) An event A is decreasing if x ∈ A implies that any y ≤ x is also in A. Show
that A is decreasing if and only if Ac is increasing.
(iv) Let A, B ∈ F be decreasing. Show that A ∩ B and A ∪ B are decreasing.
Exercise 4.5 (Harris’ inequality: alternative proof). We say that f : Rn → R is
coordinatewise nondecreasing if it is nondecreasing in each variable while keeping
the other variables fixed.
(i) (Chebyshev’s association inequality) Let f : R → R and g : R → R be
coordinatewise nondecreasing and let X be a real random variable. Show
that
E[f (X)g(X)] ≥ E[f (X)]E[g(X)].
[Hint: Consider the quantity (f (X) − f (X 0 ))(g(X) − g(X 0 )) where X 0 is
an independent copy of X.]
(ii) (Harris’ inequality) Let f : Rn → R and g : Rn → R be coordinatewise
nondecreasing and let X = (X1 , . . . , Xn ) be independent real random vari-
ables. Show by induction on n that
E[f (X)g(X)] ≥ E[f (X)]E[g(X)].
[Hint: You may need Lemma B.6.15.]
CHAPTER 4. COUPLING 321
Exercise 4.7 (FKG: sufficient conditions). Let X := {0, 1}F where F is finite and
let µ be a positive probability measure on X . We use the notation introduced in the
proof of Holley’s inequality (Theorem 4.2.32).
(i) To check the FKG condition, show that it suffices to check that, for all x ≤
y ∈ X and i ∈ F ,
µ(y i,1 ) µ(xi,1 )
≥ .
µ(y i,0 ) µ(xi,0 )
[Hint: Write µ(ω ∨ ω 0 )/µ(ω) as a telescoping product.]
(ii) To check the FKG condition, show that it suffices to check (4.2.15) only for
those ω, ω 0 ∈ X such that kω − ω 0 k1 = 2 and neither ω ≤ ω 0 nor ω 0 ≤ ω.
[Hint: Use (i).]
Exercise 4.8 (FKG and strong positive associations). Let X := {0, 1}F where F
is finite and let µ be a positive probability measure on X . For Λ ⊆ F and ξ ∈ X ,
let
XΛξ := {ωΛ × ξΛc : ωΛ ∈ {0, 1}Λ },
where ωΛ × ξΛc agrees with ω on coordinates in Λ and with ξ on coordinates in
F \Λ. Define the measure µξΛ over {0, 1}Λ as
µ(ωΛ × ξΛc )
µξΛ (ωΛ ) := .
µ(XΛξ )
That is, µξΛ is µ conditioned on agreeing with ξ on F \Λ. The measure µ is said to
be strongly positively associated if µξΛ (ωΛ ) is positively associated for all Λ and ξ.
Prove that the FKG condition is equivalent to strong positive associations. [Hint:
Use Exercise 4.7 as well as the FKG inequality.]
(i) Let et be the minimum number of edges in a t-vertex union of k not mutually
vertex-disjoint triangles. Show that, for any k ≥ 2 and k ≤ t < 3k, it holds
that et > t.
(ii) Use Exercise 2.18 to give a second proof of the fact that P[Xn = 0] →
3
e−λ /6 .
CHAPTER 4. COUPLING 322
Exercise 4.10 (RSW lemma: general α). Let Rn,α (p) be as defined in Section 4.2.5.
Show that for all n ≥ 2 (divisible by 4) and p ∈ (0, 1)
2α−2
1
Rn,α (p) ≥ Rn,1 (p)6α−7 Rn/2,1 (p)6α−6 .
2
Exercise 4.11 (Primal and dual crossings). Modify the proof of Lemma 2.2.14 to
prove Lemma 4.2.41.
Exercise 4.12 (Square-root trick). Let µ be an FKG measure on {0, 1}F where F
is finite. Let A1 and A2 be increasing events with µ(A1 ) = µ(A2 ). Show that
p
µ(A1 ) ≥ 1 − 1 − µ(A1 ∪ A2 ).
Exercise 4.13 (Splitting: details). Show that P̃ , as defined in Example 4.3.3, is a
transition matrix on V provided z0 satisfies the condition there.
Exercise 4.14 (Doeblin’s condition in finite case). Let P be a transition matrix on
a finite state space.
(i) Show that Doeblin’s condition (see Example 4.3.3) holds when P is finite,
irreducible and aperiodic.
(ii) Show that Doeblin’s condition holds for lazy random walk on the hypercube
with s = n. Use it to derive a bound on the mixing time.
Exercise 4.15 (Mixing on cycles: lower bound). Let (Zt ) be lazy, simple random
walk on the cycle of size n, Zn := {0, 1 . . . , n − 1}, where i ∼ j if |j − i| = 1
(mod n). Assume n is divisible by 4 and fix 0 < ε < 1/2.
(i) Let A = {n/2, . . . , n − 1}. By coupling (Zt ) with lazy, simple random walk
on Z, show that
2 1
P αn (n/4, A) < − ε,
2
for α ≤ αε for some αε > 0. [Hint: You may want to use Chebyshev’s
inequality (Theorem 2.1.2) or Kolmogorov’s maximal inequality (Coroll-
ary 3.1.46).]
(ii) Deduce that
tmix (ε) ≥ αε n2 .
Exercise 4.17 (Path coupling and optimal transport). Let V be a finite state space
and let P be an irreducible transition matrix on V with stationary distribution π.
Let w0 be a metric on V . For probability measures µ, ν on V , let
(ii) Assume that the conditions of Theorem 4.3.10 hold. Show that for any prob-
ability measures µ, ν
Exercise 4.18 (Stein equation for the Poisson distribution). Let λ > 0. Show that
a non-negative integer-valued random variable Z is Poi(λ) if and only if for all g
bounded
E[λg(Z + 1) − Zg(Z)] = 0.
provided the sum if finite. Show that a probability distribution µ over V is station-
ary for P if and only if for all bounded measurable functions
X
µ(x)∆f (x) = 0.
x∈V
CHAPTER 4. COUPLING 324
Exercise 4.20 (Chen-Stein method for positively related variables). Using the no-
tation in (4.4.1), (4.4.2) and (4.4.3), suppose that for each i we can construct a
(i) (i) (i)
coupling {(Xj : j = 1, . . . , n), (Yj : j 6= i)} with (Xj )j ∼ (Xj )j such that
Show that
n
( )
X
−1
kµ − πkTV ≤ (1 ∧ λ ) Var(W ) − λ + 2 p2i .
i=1
Exercise 4.21 (Chen-Stein and 4-cliques). Use Exercise 4.20 to give an improved
asymptotic bound in the setting of Section 4.4.3.
Exercise 4.22 (Chen-Stein for negatively related variables). Using the notation
in (4.4.1), (4.4.2) and (4.4.3), suppose that for each i we can construct a coupling
(i) (i) (i)
{(Xj : j = 1, . . . , n), (Yj : j 6= i)} with (Xj )j ∼ (Xj )j such that
Show that
kµ − πkTV ≤ (1 ∧ λ−1 ) {λ − Var(W )} .
CHAPTER 4. COUPLING 325
Bibliographic remarks
Section 4.1 The coupling method is generally attributed to Doeblin [Doe38]. The
standard reference on coupling is [Lin02]. See that reference for a history of cou-
pling and a facsimile of Doeblin’s paper. See also [dH]. Section 4.1.2 is based
on [Per, Section 6] and Section 4.1.4 is based on [vdH17, Section 5.3].
Section 4.3 The material in Section 4.3 borrows heavily from [LPW06, Chapters
5, 14, 15] and [AF, Chapter 12]. Aldous [Ald83] was the first author to make
explicit use of coupling to bound total variation distance to stationarity of finite
Markov chains. The link between couplings of Markov chains and total variation
distance was also used by Griffeath [Gri75] and Pitman [Pit76]. Example 4.3.3 is
based on [Str14] and [JH01]. For a more general treatment, see [MT09, Chapter
16]. The proof of Claim 4.3.7 is partly based on [LPW06, Proposition 7.13]. See
also [DGG+ 00] and [HS07] for alternative proofs. Path coupling is due to Bubley
and Dyer [BD97]. The optimal transport perspective on the path coupling method
in Exercise 4.17 is from [LPW06, Chapter 14]. For more on optimal transport,
CHAPTER 4. COUPLING 326
see for example [Vil09]. The main result in Section 4.3.4 is taken from [LPW06,
Theorem 15.1]. For more background on the so-called “critical slowdown” of the
Glauber dynamics of Ising and Potts models on various graphs, see [CDL+ 12,
LS12].
Spectral methods
5.1 Background
We first review some important concepts from linear algebra. In particular, we re-
call the spectral theorem as well as the variational characterization of eigenvalues.
We also derive a few perturbation results. We end this section with an application
Version: December 20, 2023
Modern Discrete Probability: An Essential Toolkit
Copyright © 2023 Sébastien Roch
327
CHAPTER 5. SPECTRAL METHODS 328
where y ∈ Rd1 , z ∈ Rd2 , A ∈ Rd1 ×d1 , B ∈ Rd1 ×d2 , C ∈ Rd2 ×d1 , and D ∈
Rd2 ×d2 . Then it follows by direct calculation that
T
y A B y
= yT Ay + yT Bz + zT Cy + zT Dz. (5.1.1)
z C D z
We will also need the following linear algebra fact. Let v1 , . . . , vj be orthonor-
mal vectors in Rd , with j < d. Then they can be completed into an orthonormal
basis v1 , . . . , vd of Rd .
A first eigenvector Let A1 = A. Maximizing over the objective function hv, A1 vi,
we let
v1 ∈ arg max{hv, A1 vi : kvk2 = 1},
and
λ1 = max{hv, A1 vi : kvk2 = 1}.
Complete v1 into an orthonormal basis of Rd , v1 , v̂2 , . . . , v̂d , and form the block
matrix
Ŵ1 := v1 V̂1
where the columns of V̂1 are v̂2 , . . . , v̂d . Note that Ŵ1 is orthogonal by construc-
tion.
Getting one step closer to diagonalization We show next that Ŵ1 gets us one
step closer to a diagonal matrix by similarity transformation. Note first that
λ1 w1T
T
Ŵ1 A1 Ŵ1 =
w1 A2
where w1 := V̂1T A1 v1 and A2 := V̂1T A1 V̂1 . The key claim is that w1 = 0. This
follows from an argument by contradiction.
Suppose w1 6= 0 and consider the unit vector
1 1
z := Ŵ1 × p
1 + δ 2 kw1 k22 δw1
which achieves objective value
T
λ1 w1T
T 1 1 1
z A1 z =
1 + δ 2 kw1 k22 δw1 w1 A2 δw1
1
λ1 + 2δkw1 k22 + δ 2 w1T A2 w1 ,
= 2 2
1 + δ kw1 k2
where we used (5.1.1). By the Taylor expansion,
1
= 1 − 2 + O(4 ),
1 + 2
for δ small enough,
Next step of the induction Apply the same argument to the symmetric subma-
trix A2 ∈ R(d−1)×(d−1) , let Ŵ2 ∈ R(d−1)×(d−1) be the corresponding orthogonal
matrix, and define λ2 and A3 through the equation
T λ2 0
Ŵ2 A2 Ŵ2 = .
0 A3
Define the block matrix
1 0
W2 =
0 Ŵ2
and observe that
λ1 0
W2T W1T A1 W1 W2 = W2T W2
0 A2
λ1 0
=
0 Ŵ2T A2 Ŵ2
λ1 0 0
= 0 λ2 0 .
0 0 A3
Proceeding similarly by induction gives the claim, with the final Q being the
product of the Wi s (which is orthogonal as the product of orthogonal matrices).
Furthermore we have the following min-max formulas, which do not depend on the
choice of spectral decomposition, for all k = 1, . . . , d
Note that, in all these formulas, the vector u = vk is optimal. To derive the “local”
formula, the first ones above, we expand a vector in Vk into the basis v1 , . . . , vk
and use the fact that RA (vi ) = λi and that eigenvalues are in nonincreasing order.
The “global” formulas then follow from a dimension argument.
We will need the following dimension-based fact. Let U, V ⊆ Rd be linear
subspaces such that dim(U) + dim(V) > d, where dim(U) denotes the dimension
of U. Then there exists a nonzero vector in the intersection U ∩ V. That is,
Proof of Theorem 5.1.3. We first prove the local formulas, that is, the ones involv-
ing a specific decomposition.
k k
* +
X X
hu, Aui = u, hu, vi iλi vi = λi hu, vi i2 .
i=1 i=1
CHAPTER 5. SPECTRAL METHODS 332
Thus,
Pk 2
Pk
hu, Aui i=1 λi hu, vi i hu, vi i2
RA (u) = = Pk ≥ λk Pi=1
k
= λk
hu, ui i=1 hu, vi i
2
i=1 hu, vi i
2
Global formulas Since Vk has dimension k, it follows from the local formula
that
λk = min RA (u) ≤ max min RA (u).
u∈Vk dim(V)=k u∈V
Let V be any subspace with dimension k. Because Wd−k+1 has dimension d−k+1,
we have that dim(V) + dim(Wd−k+1 ) > d and there must be nonzero vector u0
in the intersection V ∩ Wd−k+1 by the dimension-based fact above. We then have
by the other local formula that
λk = max RA (u) ≥ RA (u0 ) ≥ min RA (u).
u∈Wd−k+1 u∈V
Combining with the inequality in the other direction above gives the claim. The
other global formula is proved similarly.
Since the latter quantity is always nonnegative, it also implies that L is positive
semidefinite.
0 ≤ µ1 ≤ µ 2 ≤ · · · ≤ µn ,
In order for this to hold, it must be that any two adjacent vertices i and j
have yi = yj . That is, {i, j} ∈ E implies yi = yj . Furthermore, because G is
connected, between any two of its vertices u and v (adjacent or not) there is a path
u = w0 ∼ · · · ∼ wk = v along which the yw s must be the same. Thus y is a
constant vector.
But that is a contradiction since the eigenvectors y1 , . . . , yn are in fact linearly
independent, so that y1 and y2 cannot both be a constant vector.
µn ≥ δ̄ + 1.
Proof. Let u ∈ V be a vertex with degree δ̄. Let z be the vector with entries
δ̄
if i = u,
zi = −1 if {i, u} ∈ E,
0 otherwise,
and let x be the unit vector z/kzk2 . By definition of the degree of u, kzk22 =
δ̄ 2 + δ̄(−1)2 = δ̄(δ̄ + 1).
Using the Lemma 5.1.5,
X
hz, Lzi = (zi − zj )2
e={i,j}∈E
X
≥ (zi − zu )2
i:{i,u}∈E
X
= (−1 − δ̄)2
i:{i,u}∈E
= δ̄(δ̄ + 1)2 ,
where we restricted the sum to those edges incident with u and used the fact that
all terms in the sum are nonnegative. Finally
δ̄(δ̄ + 1)2
z z 1
hx, Lxi = ,L = hz, Lzi = = δ̄ + 1,
kzk2 kzk2 kzk22 δ̄(δ̄ + 1)
so that
µn = max{hx0 , Lx0 i : kx0 k2 = 1} ≥ hx, Lxi = δ̄ + 1,
as claimed.
CHAPTER 5. SPECTRAL METHODS 336
n
(P )
{u,v}∈E (xu − xv )2 X
µ2 = min Pn 2
: x = (x1 , . . . , xn ) 6= 0, xu = 0 .
u=1 xu u=1
µ2 = min RL (x).
x∈Wd−1
Since y1 is constant and Wd−1 is the subspace orthogonal to it, this is equivalent
to restrictring the minimization to those nonzero xs such that
m
1 X
0 = hx, y1 i = √ xu .
n u=1
xv )2
P
hx, Lxi {u,v}∈E (xu −
RL (x) = = Pn 2
.
hx, xi u=1 xu
over all centered unit vectors, y2 tends to assign similar coordinates to adjacent
vertices. A similar reasoning applies to the third Laplacian eigenvector, which in
addition is orthogonal to the second one. See Figure 5.1 for an illustration.
CHAPTER 5. SPECTRAL METHODS 337
Figure 5.1: Top: A 3-by-3 grid graph with vertices located at independent uni-
formly random points in a square. Bottom: The same 3-by-3 grid graph with ver-
tices located at the coordinates corresponding to the second and third eigenvectors
of the Laplacian matrix. That is, vertex i is located at position (y2,i , y3,i ).
CHAPTER 5. SPECTRAL METHODS 338
and
n
X X X
x2u = α2 + β 2 = |V1 |α2 + |V2 |β 2 = 1.
u=1 u∈V1 u∈V2
Replacing the first equation in the second one, we get
−|V2 |β 2 |V2 |2 β 2
|V1 | + |V2 |β 2 = + |V2 |β 2 = 1,
|V1 | |V1 |
or
|V1 | |V1 |
β2 = = .
|V2 |(|V2 | + |V1 |) n|V2 |
CHAPTER 5. SPECTRAL METHODS 339
Take s s
|V1 | −|V2 |β |V2 |
β=− , α= = .
n|V2 | |V1 | n|V1 |
The vector x we constructed is in fact an eigenvector of L. Indeed, let B be an
oriented incidence matrix of G. Then, for ek = {u, v}, (B T x)k is either xu −xv or
xv − xu . In both cases, that is 0. So Lx = BB T x = 0, that is, x is an eigenvector
of L with eigenvalue 0.
We have shown that µ2 = 0 when G has two connected components. A slight
modification of this argument shows that µ2 = 0 whenever G is not connected. J
It can be shown (see Exercise 5.2) that the Laplacian quadratic form satisfies in the
edge-weighted case
X
hx, Lxi = wij (xi − xj )2 , (5.1.3)
{i,j}∈E
for x = (x1 , . . . , xn ) ∈ Rn . (The keen observer will have noticed that we al-
ready encountered this quantity as the “Dirichlet energy” in Section 3.3.3; more
on this in Section 5.3.) As a positive semidefinite matrix (see again Exercise 5.2),
the network Laplacian has an orthonormal basis of eigenvectors with nonnegative
eigenvalues that satisfy the variational characterization we derived above. In par-
ticular, if we denote the eigenvalues 0 = µ1 ≤ µ2 ≤ · · · ≤ µn , it follows from
CHAPTER 5. SPECTRAL METHODS 340
Other variants of the Laplacian are useful. We introduce the normalized Lapla-
cian next.
LT = I T − (D−1/2 AD−1/2 )T
= I − (D−1/2 )T AT (D−1/2 )T
= I − D−1/2 AD−1/2
= L.
by the properties of the Laplacian. Hence by the spectral theorem (Theorem 5.1.1),
we can write
Xn
L= ηi zi zTi ,
i=1
CHAPTER 5. SPECTRAL METHODS 341
0 ≤ η1 ≤ η2 ≤ · · · ≤ ηn .
which makes z1 into a unit norm vector. The relationship to the Laplacian implies
(see Exercise 5.3) that
!2
T
X xi xj
x Lx = wij p −p ,
{i,j}∈E
δ(i) δ(j)
kAxk
kAk2 := max m = max kAxk.
06=x∈R kxk x∈Sm−1
(ii) kAk2 ≥ 0
Proof. These properties all follow from the definition of the induced norm and the
corresponding properties for the vector norm:
kαAxk2 = |α|kAxk2
and
Wd−k+1 (C) = span(vk (C), . . . , vd (C)).
The following lemma is one version of what is known as Weyl’s inequality.
Weyl’s
inequality
CHAPTER 5. SPECTRAL METHODS 343
Proof. Let H = B − A. We prove only one upper bound. The other one follows
from interchanging the roles of A and B. Because
it follows from (5.1.2) that Vj (B) ∩ Wd−j+1 (A) contains a nonzero vector. Let v
be a unit vector in that intersection.
By Theorem 5.1.3,
λj (B) ≤ hv, (A + H)vi = hv, Avi + hv, Hvi ≤ λj (A) + hv, Hvi.
Then
8kA − Bk22
min kvi (A) − svi (B)k22 ≤ .
s∈{+1,−1} δ2
CHAPTER 5. SPECTRAL METHODS 344
Proof. Expand vi (B) in the basis formed by the eigenvectors of A, that is,
d
X
vi (B) = hvi (B), vj (A)i vj (A),
j=1
where, on the last two lines, we used the orthonormality of the vj (A)s and vj (B)s
through Parseval’s identity, as well as the definition of δ.
On the other hand, letting E = A − B, by the triangle inequality
k(A − λi (A)I) vi (B)k2 = k(B + E − λi (A)I) vi (B)k2
≤ k(B − λi (A)I) vi (B)k2 + kE vi (B)k2
≤ |λi (B) − λi (A)|kvi (B)k2 + kEk2 kvi (B)k2
= 2kEk2 ,
where we used Lemma 5.1.12 and Weyl’s inequality.
Combining the last two inequalities gives
4kEk22
(1 − hvi (B), vi (A)i2 ) ≤ .
δ2
The result follows by noting that, since |hvi (B), vi (A)i| ≤ 1 by Cauchy-Schwarz
(Theorem B.4.8), we have
min kvi (A) − svi (B)k2 = 2 − 2|hvi (B), vi (A)i|
s∈{+1,−1}
+1 −1
W = +1 p q .
−1 q p
We assume that p ≥ q, encoding the fact that vertices belonging to the same com-
munity are more likely to share an edge. To summarize, we say that (X, G) ∼
SBMn,p,q if:
stochastic
1. (Communities) The assignment X = (X1 , . . . , Xn ) is uniformly random blockmodel
over
The role of s in this formula is to account for the fact that the community names
are not meaningful.
Now consider the following recovery requirements. These are asymptotic notions,
as n → +∞.
recovery
Definition 5.1.16 (Recovery requirement). Let (X, G) ∼ SBMn,p,q . For any esti-
mator X̂ := X̂(G) ∈ Πn2 , we say that it achieves:
Next we establish sufficient conditions for almost exact recovery. First we describe
a natural estimator X̂.
MAP estimator and spectral clustering A natural starting point is the maxi-
mum a posteriori (MAP) estimator. Let Ω(X) be the balanced partition of [n]
corresponding to X and Ω̂(G) be the one corresponding to X̂(G). The probability
of error, that is, the probability of not recovering the true partition, is given by
X
P[Ω(X) 6= Ω̂(G)] = P[Ω̂(g) 6= Ω(X) | G = g] P[G = g], (5.1.6)
g
where the sum is over all graphs on n vertices (i.e., all possible subsets of edges
present) and we dropped the subscript n, p, q to simplify the notation. The MAP
estimator Ω̂MAP (G) is obtained by minimizing each term P[Ω̂(g) 6= Ω(X) | G =
g] individually (note that P[G = g] > 0 for all g by definition of the SBMn,p,q , a
probability which does not depend on the estimator). Equivalently we choose for
each g a partition γ that maximizes the posterior probability
where we applied Bayes’ rule on the first line and the uniformity of the partition X
on the second line.
Based on (5.1.7), we seek a partition that maximizes P[G = g | Ω(X) = γ]. We
compute this last probability explicitly. For fixed g, let M := M (g) be the number
of edges in g. For any γ, denote by Min := Min (g, γ) and Mout := Mout (g, γ)
the number of edges within and across communities respectively, and note that
Min = M − Mout . By definition of the SBMn,p,q model, the probability of a graph
g given a partition γ is expressed simply as
P[G = g | Ω(X) = γ]
n 2 n n 2
= q Mout (1 − q)( 2 ) −M
pMin (1 − p){( 2 )−( 2 ) }−Min
out
n 2 n n 2
= q Mout (1 − q)( 2 ) −Mout pM −Mout (1 − p){( 2 )−( 2 ) }−{M −Mout }
1 − p Mout n
q n 2 n n 2
o
= · (1 − q)( 2 ) pM (1 − p){( 2 )−( 2 ) }−M .
1−q p
The expression in curly brackets does noth depend on i the partition γ. Moreover,
q 1−p
since we assume that p ≥ q, we have that 1−q · p ≤ 1 (which can be checked
directly by rearranging and cancelling). Therefore, to maximize P[G = g | Ω(X) =
γ] over γ for a fixed g, we need to choose a partition that results in the smallest
possible value of Mout , the number of edges across the two communities. This
problem is well-known in combinatorial optimization, where it is referred to as
the minimum bisection problem. It is unfortunately NP-hard and we consider a
minimum
relaxation that admits a polynomial-time algorithmic solution.
bisection
To see how this comes about, observe that the minimum bisection problem can
problem
be reformulated as
max xT Ax
x∈{+1,−1}n , xT 1=0
where we changed the notation from x to z to emphasize that the solution no longer
encodes a partition. We recognize the Rayleigh quotient of A as the objective func-
tion in the final formulation. At this point, it is tempting to use Courant-Fischer
CHAPTER 5. SPECTRAL METHODS 348
(Theorem 5.1.3) and conclude that the maximum above is achieved at the second
eigenvalue of A. Note however that the vector 1 (appearing in the orthogonality
constraint zT 1 = 0) is not in general an eigenvector of A (unless the graph hap-
pens to be regular). To leverage the variational characterization of eigenvalues in
a statistically justified way, we instead turn to the expected adjacency matrix and
then establish concentration.
Lemma 5.1.17 (Expected adjacency). Let (X, G) ∼ SBMn,p,q , let A be the adja-
cency matrix of G and let AX = En,p,q [A | X]. Then
p+q p−q
AX = n u1 uT1 + n u2 uT2 − p I,
2 2
where
1 1
u1 = √ 1, u2 = √ X.
n n
Proof. For any distinct pair i, j, the term
1 2 p+q
p+q T p+q
n u1 u1 =n √ =
2 i,j 2 n 2
while the term
1 2
p−q T p−q p−q
n u2 u2 =n √ Xi Xj = Xi Xj .
2 i,j 2 n 2
The product Xi Xj is 1 when i and j belong to the same community and is −1
otherwise. In the former case, summing the two terms indeed gives p, while in the
latter case it gives q. Finally, the term −pI accounts for the fact that A has zeros
on the diagonal.
Now condition on X and observe that u1 and u2 in Lemma 5.1.17 are orthog-
onal by our assumption that X corresponds to a balanced partition (i.e., with two
communities of equal size). Hence we deduce that an eigenvector decomposition of
AX is formed of u1 , u2 and any orthonormal basis of the orthogonal complement
of the span of u1 and u2 , with respective eigenvalues
p+q p−q
n − p, n − p, −p.
2 2
So the second largest eigenvalue of AX is λ2 (AX ) = n p−q
2 − p (independently of
X), and Courant-Fischer implies
zT AX z
max = λ2 (AX ).
06=z∈Rn ,zT 1=0 zT z
CHAPTER 5. SPECTRAL METHODS 349
Almost exact recovery We prove the following. We restrict ourselves to the case
where p and q are constants not depending on n.
Theorem 5.1.18. Let (X, G) ∼ SBMn,p,q and let A be the adjacency matrix of
G. Let µ := min q, p−q 2 > 0. Clustering according to the sign of the second
eigenvector of A identifies the two communities of G with probability at least 1 −
e−n , except for C/µ2 misclassified nodes for some constant C > 0.
There are two key ingredients to the proof: concentration of the adjacency matrix
and perturbation arguments.
We start with the former.
CHAPTER 5. SPECTRAL METHODS 350
Lemma 5.1.19 (Norm of the centered adjacency matrix). Let (X, G) ∼ SBMn,p,q ,
let A be the adjacency matrix of G and let AX = En,p,q [A | X]. There is a constant
C 0 > 0 such that, conditioned on X,
√
kA − AX k2 ≤ C 0 n,
by adjusting the constant. Note that this bound holds for any X.
CHAPTER 5. SPECTRAL METHODS 351
If the signs of (u2 )i and θ (û2 )i disagree, then the i-th term in the sum above is
≥ 1. So there can be at most C/µ2 such disagreements. That establishes the
desired bound on the number of misclassified nodes.
Remark 5.1.20. It was shown in [YP14, MNS15a, AS15] that almost exact re-
covery in the balanced two-community model SBMn,pn ,qn with pn = an /n and
qn = bn /n is achievable (and computationally efficiently so) if and only if
(an − bn )2
= ω(1).
(an + bn )
On the other hand, it was shown in [ABH16, MNS15a] that exact recovery in the
SBMn,pn ,qn with pn = α log n/n
√ and qn = β log n/n is achievable√and computa-
√ √
tionally efficiently so if α − β > 2 and not achievable if α − β < 2.
and
kf k2π := hf, f iπ .
CHAPTER 5. SPECTRAL METHODS 352
The inner product is well-defined since the series is summable by Hölder’s inequal-
ity (Theorem B.4.8), which implies the Cauchy-Schwarz inequality
hf, giπ ≤ kf kπ kgkπ .
Minkowski’s inequality (Theorem B.4.9) implies the triangle inequality
kf + gkπ ≤ kf kπ + kgkπ .
The integral with respect to π (see Appendix B) reduces in this case to a sum
X
π(f ) := π(x)f (x),
x∈V
This shows that P f is well-defined since π > 0 and hence the series in square
brackets on the first line is finite for all x. Applying the same argument to kP f k2π
gives the inequality above.
Everything above holds whether or not P is reversible, so long as π is a sta-
tionary measure. Now we use reversibility. We claim that, when P reversible, then
it is self-adjoint, that is,
= hP f, giπ ,
1/2 −1/2
So M = (M (x, y))x,y = Dπ P Dπ is a symmetric matrix. By the spectral
theorem (Theorem 5.1.1), it has real eigenvectors {φj }nj=1 forming an orthonormal
−1/2
basis of Rn with corresponding real eigenvalues {λj }nj=1 . Define fj := Dπ φj .
Then
P fj = P Dπ−1/2 φj
= Dπ−1/2 Dπ1/2 P Dπ−1/2 φj
= Dπ−1/2 M φj
= λj Dπ−1/2 φj
= λj fj ,
and
Because {φj }nj=1 is an orthonormal basis of Rn , we have that {fj }nj=1 is an or-
thonormal basis of (Rn , h·, ·iπ ).
kf k∞ = max |f (x)|.
x∈V
When the chain is aperiodic, it cannot have an eigenvalue −1. Exercise 5.9 asks
for a proof.
Lemma 5.2.4. If P has an eigenvalue equal to −1, then P is not aperiodic.
Lemma 5.2.5. For all j 6= 1, πfj = 0.
Proof. By orthonormality, hf1 , fj iπ = 0. Now use the fact that f1 = 1.
Proof. Using the notation of Theorem 5.2.1, the matrix Φ whose columns are the
φj s is orthogonal so ΦΦT = I. That is,
n
X
φj (x)φj (y) = δx (y),
j=1
or
n p
X
π(x)π(y)fj (x)fj (y) = δx (y).
j=1
Proof. Let F be the matrix whose columns are the eigenvectors {fj }nj=1 and let
Dλ be the diagonal matrix with {λj }nj=1 on the diagonal. Using the notation in the
proof of Theorem 5.2.1,
P t Dπ−1 = F Dλt F T .
for α, β ∈ (0, 1). Observe that P is reversible with respect to the stationary distri-
bution
β α
π := , .
α+β α+β
We know that f1 = 1 is an eigenfunction with eigenvalue 1. As can be checked by
direct computation, the other eigenfunction (in vector form) is
r r !
α β
f2 := ,− ,
β α
Or, rearranging,
! !
β α α α
α+β − α+β
Pt = α+β
β
α+β
α + (1 − α − β)t β β .
α+β α+β
− α+β α+β
As a result,
log ε α+β
β log ε −1 − log α+β
β
log |1 − α − β| = log |1 − α − β|−1 .
tmix (ε) =
J
Spectral gap and mixing Assume further that P is aperiodic. Recall that by the
convergence theorem (Theorem 1.1.33), for all x, y, P t (x, y) → π(y) as t → +∞,
and that the mixing time (Definition 1.1.35) is
tmix (ε) := min{t ≥ 0 : d(t) ≤ ε},
where d(t) := maxx∈V kP t (x, ·) − π(·)kTV . It will be convenient to work with a
different notion of distance.
Definition 5.2.9 (Separation distance). The separation distance is defined as
separation
P t (x, y)
distance
sx (t) := max 1 − ,
y∈V π(y)
and we let s(t) := maxx∈V sx (t).
Lemma 5.2.10 (Separation distance and total variation distance).
d(t) ≤ s(t).
Proof. By Lemma 4.1.15,
X
kP t (x, ·) − π(·)kTV = π(y) − P t (x, y)
y:P t (x,y)<π(y)
P t (x, y)
X
= π(y) 1 −
π(y)
y:P t (x,y)<π(y)
≤ sx (t).
Since this holds for any x, the claim follows.
CHAPTER 5. SPECTRAL METHODS 358
It follows that, from the spectral decomposition (Theorem 5.2.7), the speed of
convergence of P t (x, y) to π(y) is dominated by the largest eigenvalue of P not
equal to 1.
Definition 5.2.11 (Spectral gap). The absolute spectral gap is γ∗ := 1 − λ∗ where
absolute
λ∗ := |λ2 | ∨ |λn |. The spectral gap is γ := 1 − λ2 .
spectral
By Lemmas 5.2.3 and 5.2.4, we have γ∗ > 0 when P is irreducible and aperiodic. gap
Note that the eigenvalues of the lazy version 12 P + 12 I of P are { 21 (λj + 1)}nj=1
which are all nonnegative. So, there, γ∗ = γ.
Definition 5.2.12 (Relaxation time). The relaxation time is defined as
relaxation
trel := γ∗−1 . time
P t (x, y) λt (1 − γ∗ )t e−γ∗ t
− 1 ≤ λt∗ π(x)−1 π(y)−1 ≤ ∗ =
p
≤ , (5.2.5)
π(y) πmin πmin πmin
CHAPTER 5. SPECTRAL METHODS 359
t (ε)
so d(t) ≥ 21 λt∗ . When t = tmix (ε), ε ≥ d(tmix (ε)) ≥ 12 λ∗mix . Therefore,
rearranging and taking a logarithm, we get
1 1 1
tmix (ε) − 1 ≥ tmix (ε) log ≥ log ,
λ∗ λ∗ 2ε
where we used z = 1 − λ−1 in 1 − z ≤ e−z to get the first inequality. The result
−1 ∗ −1 −1
γ∗
follows from λ1∗ − 1 = 1−λλ∗
∗
= 1−γ∗ = trel − 1.
which is indeed 1.
From the eigenvalues, we derive the relaxation time (Definition 5.2.12) analyt-
ically.
Theorem 5.2.16 (Cycle: relaxation time). The relaxation time for lazy simple ran-
dom walk on the n-cycle is
1
trel = 2π
= Θ(n2 ).
1 − cos n
4π 2
2π
1 − cos = + O(n−4 ).
n n2
Since πmin = 1/n, we get tmix (ε) = O(n2 log n) and tmix (ε) = Ω(n2 ) by Theo-
rem 5.2.14.
It turns out our upper bound is off by a logarithmic factor. A sharper bound on
the mixing time can be obtained by working directly with the spectral decomposi-
tion. By Lemma 4.1.9 and Cauchy-Schwarz (Theorem B.4.8), for any x ∈ V ,
( )2
X P t (x, y)
4kP t (x, ·) − π(·)k2TV = π(y) −1
y
π(y)
t 2
X P (x, y)
≤ π(y) −1
y
π(y)
2
n−1
X
= µtj gj (x)gj
j=1
π
n−1
X
= µ2t 2
j gj (x) ,
j=1
where we used the spectral decomposition of P t (Theorem 5.2.7) on the third line
and Parseval’s identity (i.e., (5.2.1)) on the fourth line.
Here comes the trick: the total variation distance does not depend on the start-
ing point x by symmetry. Multiplying by π(x) and summing over x—on the right-
CHAPTER 5. SPECTRAL METHODS 362
X n−1
X
4kP t (x, ·) − π(·)k2TV ≤ π(x) µ2t
j gj (x)
2
x j=1
n−1
X X
= µ2t
j π(x)gj (x)2
j=1 x
n−1
X
= µ2t
j ,
j=1
2 /2
For x ∈ [0, π/2), cos x ≤ e−x (see Exercise 1.16). Then
(n−1)/2
4π 2 j 2
X
2
4d(t) ≤ 2 exp − 2 t
n
j=1
∞
4π 2 X 4π 2 (j 2 − 1)
≤ 2 exp − 2 t exp − t
n n2
j=1
∞
4π 2 X 4π 2 t
≤ 2 exp − 2 t exp − 2 `
n n
`=0
2
2 exp − 4π n2
t
=
2
.
1 − exp − 4π n2
t
These are called parity functions. We show that the parity functions form an eigen-
parity
basis of the transition matrix.
function
Lemma 5.2.17 (Hypercube: eigenbasis). For all J ⊆ [n], the function χJ is an
eigenfunction of P with eigenvalue
n − |J|
µJ := .
n
Moreover the χJ s are orthonormal in `2 (V, π).
Proof. For x ∈ V and i ∈ [n], let x[i] be x where coordinate i is flipped. Note that,
for all J, x,
n
X 1 1X1
P (x, y)χJ (y) = χJ (x) + χJ (x[i] )
y
2 2 n
i=1
1 1 n − |J| 1 |J|
= + χJ (x) − χJ (x)
2 2 n 2 n
n − |J|
= χJ (x).
n
For the orthonormality, note that
X X 1 Y
π(x)χJ (x)2 = x2j = 1.
2n
x∈V x∈V j∈J
For J 6= J 0 ⊆ [n],
X
π(x)χJ (x)χJ 0 (x)
x∈V
X 1 Y Y Y
= n
x2j xj xj
2 0 0 0
x∈V j∈J∩J j∈J\J j∈J \J
|J∩J 0 |
2 Y X Y X
= xj xj
2n 0 0
j∈J\J xj ∈{−1,+1} j∈J \J xj ∈{−1,+1}
= 0,
Theorem 5.2.18 (Hypercube: relaxation time). The relaxation time for lazy simple
random walk on the n-dimensional hypercube is
trel = n.
Then
X n − |J| 2t
2
4d(t) ≤
n
J6=∅
n
` 2t
X n
= 1−
` n
`=1
n
X n 2t`
≤ exp −
` n
`=1
n
2t
= 1 + exp − − 1,
n
where we used that 1 − x ≤ e−x for all x (see Exercise 1.16). So, by definition,
tmix (ε) ≤ 12 n log n + O(n).
Remark 5.2.19. In fact, lazy simple random walk on the n-dimensional hypercube has
a “cutoff” at (1/2)n log n. Roughly speaking, within a time window of size O(n), the
total variation distance to the stationary distribution goes from near 1 to near 0. See,
e.g., [LPW06, Section 18.2.2].
CHAPTER 5. SPECTRAL METHODS 365
Varopoulos-Carne bound
Our main bound is the following. Recall that a reversible Markov chain is equiva-
lent to a random walk on the network corresponding to its positive transition prob-
abilities (see Definition 1.2.7 and the discussion following it).
As a sanity check before proving the theorem, note that if the chain is aperiodic and
π is the stationary distribution then by the convergence theorem (Theorem 1.1.33)
s
π(y)
P t (x, y) → π(y) ≤ 2 , as t → +∞,
π(x)
where again (St ) is simple random walk on Z started at 0, and then use the Chernoff
bound (Lemma 2.4.3).
CHAPTER 5. SPECTRAL METHODS 366
By the local finiteness assumption, only a finite number of states can be reached
by time t. Hence we can reduce the problem to a finite state space. More precisely,
let Ṽ = {z ∈ V : ρ(x, z) ≤ t} and for z, w ∈ Ṽ
(
P (z, w) if z 6= w,
P̃ (z, w) =
P (z, z) + P (z, V \ Ṽ ) otherwise.
is a Chebyshev polynomial of the first kind. Note that |Tk (ξ)| ≤ 1 on [−1, 1] by
Chebyshev
definition. The classical trigonometric identity (to see this, write it in complex
polynomials
form)
cos((k + 1)θ) + cos((k − 1)θ) = 2 cos θ cos(kθ),
implies the recursion
which in turn implies that Tk is indeed a polynomial. It has degree k from induction
and the fact that T0 (ξ) = 1 and T1 (ξ) = ξ. The connection to simple random walk
on Z comes from the following somewhat miraculous representation (which does
not rely on reversibility). Let Tk (P ) denote the polynomial Tk evaluated at P as a
matrix polynomial.
Lemma 5.2.21.
t
X
t
P = P[St = k] T|k| (P ).
k=−t
where we used Parseval’s identity (5.2.1) twice and the fact that Tk (λi )2 ∈ [0, 1].
Let δz denote the point mass at z. By Cauchy-Schwarz (Theorem B.4.8)
and (5.2.7),
p p s
hδx , Tk (P )δy iπ kδx kπ kδy kπ π(x) π(y) π(y)
Tk (P )(x, y) = ≤ = = ,
π(x) π(x) π(x) π(x)
for any k (in particular for k ≥ ρ(x, y)) and we have proved the claim.
Combining the two lemmas gives the result.
Remark 5.2.23. The local finiteness assumption is made for simplicity only. The result
holds for any countable-space, reversible chain. See [LP16, Section 13.2].
Lower bound on mixing Let (Xt ) be an irreducible aperiodic (for now not nec-
essarily reversible) Markov chain with finite state space V and stationary distribu-
tion π. Recall that, for a fixed 0 < ε < 1/2, the mixing time is
tmix (ε) = min{t : d(t) ≤ ε},
where
d(t) = max kP t (x, · ) − πkTV .
x∈V
It is intuitively clear that tmix (ε) is at least of the order of the “diameter” of the
transition graph of P . For x, y ∈ V , let ρ(x, y) be the graph distance between x and
y on the undirected version of the transition graph, that is, ignoring the orientation
of the edges. With this definition, a shortest directed path from x to y contains at
least ρ(x, y) edges. Here we define the diameter of the transition graph as ∆ :=
diameter
maxx,y∈V ρ(x, y). Let x0 , y0 be a pair of vertices achieving the diameter. Then we
claim that P b(∆−1)/2c (x0 , · ) and P b(∆−1)/2c (y0 , · ) are supported on disjoint sets.
To see this let
A = {z ∈ V : ρ(x0 , z) < ρ(y0 , z)},
be the set of states closer to x0 than y0 . See Figure 5.2. By the triangle inequality
for ρ, any z such that ρ(x0 , z) ≤ b(∆ − 1)/2c is in A, otherwise we would have
ρ(y0 , z) ≤ ρ(x0 , z) ≤ b(∆ − 1)/2c and hence ρ(x0 , y0 ) ≤ ρ(x0 , z) + ρ(y0 , z) ≤
2b(∆ − 1)/2c < ∆, a contradiction. Similarly, if ρ(y0 , z) ≤ b(∆ − 1)/2c, then
z ∈ Ac . By the triangle inequality for the total variation distance,
1
d(b(∆ − 1)/2c) ≥ P b(∆−1)/2c (x0 , · ) − P b(∆−1)/2c (y0 , · )
2 TV
1 n b(∆−1)/2c b(∆−1)/2c
o
≥ P (x0 , A) − P (y0 , A)
2
1 1
= {1 − 0} = , (5.2.8)
2 2
CHAPTER 5. SPECTRAL METHODS 369
Figure 5.2: The supports of P b(∆−1)/2c (x0 , · ) and P b(∆−1)/2c (y0 , · ) are contained
in A and Ac respectively.
Claim 5.2.24.
∆
tmix (ε) ≥ .
2
This bound is often far from the truth. Consider for instance simple random walk
on a cycle of size n. The diameter is ∆ = n/2. But Lemma 2.4.3 suggests that
it takes time of order ∆2 to even reach the antipode of the starting point, let alone
achieve stationarity. More generally, when P is reversible, the “diffusive behavior”
captured by the Varopoulos-Carne bound (Theorem 5.2.20) implies that the mixing
time does indeed scale at least as the square of the diameter.
Assume that P is reversible with respect to π and has diameter ∆. Letting
n = |V | and πmin = minx∈V π(x), we then have the following.
∆2
tmix (ε) ≥ ,
12 log n + 4| log πmin |
16
provided n ≥ (1−2ε)2
.
CHAPTER 5. SPECTRAL METHODS 370
Proof. The proof is based on the same argument we used to derive our first diame-
ter-based bound, except that the Varopoulos-Carne bound gives a better depen-
dence on the diameter. Namely, let x0 , y0 , and A be as above. By the Varopoulos-
Carne bound,
s
X X π(z) − ρ2 (x0 ,z) −1/2 ∆2
P t (x0 , Ac ) = P t (x0 , z) ≤ 2 e 2t ≤ 2nπmin e− 8t ,
c c
π(x0 )
z∈A z∈A
∆
where we used that |Ac | ≤ n and ρ(x0 , z) ≥ 2 for z ∈ Ac . For any
∆2
t< , (5.2.9)
12 log n + 4| log πmin |
we get that
t c −1/2 3 log n + | log πmin | 2
P (x0 , A ) ≤ 2nπmin exp − =√ ,
2 n
Our goal is to estimate πf from the sample path of the Markov chain (Xt )t≥0 with
transition matrix P . Indeed the ergodic theorem guarantees that
T
1X
f (Xt ) → πf,
T
t=1
Proof. We have
XX X
t
| E[f (Xt )] − πf | = µx Px,y f (y) − πy f (y) .
x y y
CHAPTER 5. SPECTRAL METHODS 372
P
Because x µx = 1, the right-hand side is
XX XX
t
= µx Px,y f (y) − µx πy f (y)
x y x y
X X
t
≤ µx Px,y − πy |f (y)|,
x y
Hence,
T
1 X T kf k2∞
0≤ Var[f (Xt )] ≤ → 0,
T2 T2
t=1
as T → +∞.
Bounding the covariance requires a more delicate argument. Fix 1 ≤ s < t ≤
T . The trick is to condition on Xs and use the Markov Property (Theorem 1.1.18).
By definition of the covariance, the tower property (Lemma B.6.16) and taking out
what is known (Lemma B.6.13),
By Lemma 5.2.28,
Returning to the sum over the covariances, the previous bound gives
2 X
Cov[f (Xs ), f (Xt )]
T2
1≤s<t≤T
2 X
≤ 2 |Cov[f (Xs ), f (Xt )]|
T
1≤s<t≤T
2 X
−1
≤ 4(1 − γ? )t−s πmin kf k2∞ .
T2
1≤s<t≤T
To evaluate the sum we make the change of variable h = t − s to get that the
previous expression is
T −s
−1 2 X X
≤ 4πmin kf k2∞ (1 − γ∗ )h
T2
1≤s≤T h=1
+∞
−1 2 X X
≤ 4πmin kf k2∞ (1 − γ∗ )h
T2
1≤s≤T h=0
−1 2 X 1
= 4πmin kf k2∞
T2 γ∗
1≤s≤T
−1 1
= 8πmin kf k2∞ γ∗−1 → 0,
T
as T → +∞.
Combining the variance and covariance bounds, we have shown that
T
" #
1X 1 −1 1 −1 1
Var f (Xt ) ≤ kf k2∞ + 8πmin kf k2∞ γ∗−1 ≤ 9πmin kf k2∞ γ∗−1 .
T T T T
t=1
CHAPTER 5. SPECTRAL METHODS 375
for all x, y ∈ V .
CHAPTER 5. SPECTRAL METHODS 376
- Conditions stronger than reversibility are needed for the spectral theorem—
in a form similar to what we used—to apply. Specifically, one needs that P is
a compact operator: whenever (fn )n ∈ `2 (V, π) is a bounded sequence, then compact
there exists a subsequence (fnk )k such that (P fnk ) converges in the norm. operator
Unfortunately that is often not the case, as the next example illustrates, even
in the reversible, positive recurrent setting.
Example 5.2.29 (A positive recurrent chain whose P is not compact). For p <
1/2, let (Xt ) be the birth-death chain with V := {0, 1, 2, . . .}, P (0, 0) := 1 − p,
P (0, 1) = p, P (x, x + 1) := p and P (x, x − 1) := 1 − p for all x ≥ 1, and
P (x, y) := 0 if |x − y| > 1. As can be checked by direct computation, P is
reversible with respect to the stationary distribution π(x) = (1 − γ)γ x for x ≥ 0
p
where γ := 1−p . For j ≥ 1, define gj (x) := π(j)−1/2 1{x=j} . Then kgj k2π = 1 for
all j so {gj }j is bounded in `2 (V, π). On the other hand,
So
We will not say much about the spectral theory of infinite networks. In this
subsection, we establish a relationship between the operator norm of P —which is
related to its spectrum—and the decay of P t (x, y).
Let `0 (V ) be the set of real-valued functions on V with finite support. It is
dense in `2 (V, π). Indeed let v1 , v2 , . . . be an enumeration of V and, for f ∈
`2 (V, π), define f |n (vi ) := f (vi )1i≤n to be f restricted to v1 , . . . , vn . Then
∞
X
kf − f |n k2π = π(vi )f (vi )2 → 0, (5.2.10)
i=n+1
CHAPTER 5. SPECTRAL METHODS 377
as n → ∞, since kf k2π = 2
P
x π(x)f (x) < +∞. We will also need the following
kP f kπ ≤ kP kπ kf kπ . (5.2.12)
The same can be seen to hold for any f ∈ `2 (V, π) by considering the sequence
(f |n )n and noting that kf |n kπ → kf kπ and kP (f |n )kπ → kP f kπ as n → ∞
by (5.2.10), (5.2.11) and the triangle inequality. This latter observation explains
why it suffices to restrict the supremum to `0 in the definition of the norm.
Note that, by (5.2.2), kP kπ ≤ 1. Note further that, if V is finite or more
generally if π is summable, then we have in fact kP kπ = 1 by taking f ≡ 1
above. When P is self-adjoint, the norm kP kπ is also equal to what is known
as the spectral radius, that is, the radius of the smallest disk centered at 0 in the
spectral
complex plane that contains the spectrum of P . We will not need to define what that
radius
means formally here. (But Exercise 5.5 asks for a proof in the setting of symmetric
matrices.)
Our main result is the following.
Theorem 5.2.31 (Spectral radius). Let P be irreducible and reversible with respect
to π > 0. Then
ρ(P ) := lim sup P t (x, y)1/t = kP kπ .
t
In the positive recurrent case (for instance if the chain is finite), we have P t (x, y) →
π(y) > 0 and so ρ(P ) = 1 = kP kπ . The theorem says that the equality between
ρ(P ) and kP kπ holds in general for reversible chains.
CHAPTER 5. SPECTRAL METHODS 378
Proof of Theorem 5.2.31. To see that the limit does not depend on x, y, let u, v, x, y ∈
V and k, m ≥ 0 such that P m (u, x) > 0 and P k (y, v) > 0. Then
which shows that lim supt P t (u, v)1/t ≥ lim supt P t (x, y)1/t for all u, v, x, y.
We first show that ρ(P ) ≤ kP kπ . Observe that applying (5.2.4) and (5.2.12)
repeatedly gives that P t is self-adjoint and satisfies the inequality kP t kπ ≤ kP ktπ .
Because kδz k2π = π(z) ≤ 1, by Cauchy-Schwarz
p
π(x)P t (x, y) = hδx , P t δy iπ ≤ kP ktπ kδx kπ kδy kπ = kP ktπ π(x)π(y).
q
π(y)
Hence P t (x, y) ≤ t
π(x) kP kπ and
or
kP t+1 f kπ kP t+2 f kπ
≤ .
kP t f kπ kP t+1 f kπ
t+1
So kPkP t f fkπkπ is non-decreasing and therefore has a limit L ≤ +∞. More-
over, for t = 0, we get
kP f kπ
≤ L, (5.2.13)
kf kπ
so it suffices to prove L ≤ ρ(P ).
CHAPTER 5. SPECTRAL METHODS 379
- Observe that
t 1/t 1/t
kP f kπ kP f kπ kP t f kπ
= × ··· × → L,
kf kπ kf kπ kP t−1 f kπ
so in fact
L = lim kP t f k1/t
π .
t
- By self-adjointness again
X X
kP t f k2π = hf, P 2t f iπ = π(x)f (x) f (y)P 2t (x, y).
x y
for all x, y in the support of f . For such a t, plugging back into the previous
display
!1/2t
X X
kP t f k1/t
π ≤ (ρ(P ) + ε) π(x)|f (x)| |f (y)| .
x y
L = lim kP t f k1/t
π ≤ ρ(P ). (5.2.14)
t
So, combining (5.2.13) and (5.2.14), we have shown that kP kπ ≤ ρ(P ) and that
concludes the proof.
This is not an if and only if. Random walk on Z3 is transient, yet P 2t (0, 0) =
Θ(t−3/2 ) so there kP kπ = ρ(P ) = 1.
In the non-reversible case, our definition of kP kπ still makes sense with respect
to any stationary measure π (although P is not self-adjoint). But the equality in
Theorem 5.2.31 no longer holds in general.
kP gn kπ1
lim sup ≥ 1,
n kgn kπ1
and kP kπ1 ≥ 1. But we already showed that kP kπ1 ≤ 1 in (5.2.2), so the claim
follows.
On the other hand, E0 [Xt ] = (2p−1)t. So the martingale Zt := Xt −(2p−1)t
(see Example 3.1.29), as a sum of t independent centered random variables in
{−1 − (2p − 1), 1 − (2p − 1)}, satisfies the assumptions of the Azuma-Hoeffding
inequality (Theorem 3.2.1) with increment bound ct := 2. So
Therefore
2 /2
lim sup P t (0, 0)1/t ≤ e−(2p−1) < 1.
t
J
relationships between the “volume” of sets and their “circumference.” The classi-
cal isoperimetric inequality states that the area enclosed by any rectifiable simple
isoperimetric
closed curve in the plane is at most the length of the curve squared divided by 4π.
inequality
Moreover equality is achieved if and only if the curve is a circle.
Remark 5.3.1. Here is an easy proof in the smooth case. Suppose r(s) = (x(s), y(s)),
s ∈ [0, 2π], is the parametrization of a positively oriented, smooth, simple closed curve
in the plane centered at the origin with arc-length 2π, where kr 0 (s)k2 = 1 for all s,
R 2π
0
r(s) ds = 0 and x(0) = x(2π) = 0. By Green’s theorem, the area enclosed by the
curve is
Z 2π
1 2π
Z
A= x(s)y 0 (s) ds = [x(s)2 + y 0 (s)2 − (x(s) − y 0 (s))2 ] ds,
0 2 0
1 2π 1 2π 0 2
Z Z
A≤ [x(s)2 + y 0 (s)2 ] ds ≤ [x (s) + y 0 (s)2 ] ds = π,
2 0 2 0
which is indeed the area of a circle of circumference 2π.
∂E S := {e = {x, y} ∈ E : x ∈ S, y ∈ S c }.
and X
|W |h := h(v).
v∈W
|∂E S|g
ΦE (S; g, h) := .
|S|h
CHAPTER 5. SPECTRAL METHODS 382
Roughly speaking, this is the ratio of the “size of the boundary” of a set to its
“volume.”
Our main definition, the edge expansion constant, quantifies the worst such
ratio. First, one last piece of notation: for disjoint subsets S0 , S1 ⊆ V , we let
X X
c(S0 , S1 ) := c(x0 , x1 ).
x0 ∈S0 x1 ∈S1
Definition 5.3.2 (Edge expansion). For a subset of states S ⊆ V , the edge expan-
sion constant (or bottleneck ratio) of S is
edge
|∂E S|c c(S, S c ) expansion
ΦE (S; c, π) = = .
|S|π π(S) constant
We refer to (S, S c ) as a cut. The edge expansion constant (or bottleneck ratio or
Cheeger number or isoperimetric constant* ) of N is
1
Φ∗ := min ΦE (S; c, π) : S ⊆ V, 0 < π(S) ≤ .
2
Intuitively, a small value of Φ∗ suggests the existence of a “bottleneck” in N .
Conversely, a large value of Φ∗ indicates that all sets “expand out.” See Fig-
ure 5.3. Note that the quantity ΦE (S; c, π) has a natural probabilistic interpretation:
pick a stationary state and make one step according to the transition matrix; then
ΦE (S; c, π) is the conditional probability that, given that the first state is in S, the
next one is in S c .
Equivalently, the edge expansion constant can be expressed as
c(S, S c )
Φ∗ := min : S ⊆ V, 0 < π(S) < 1 .
π(S) ∧ π(S c )
Example 5.3.3 (Edge expansion: complete graph). Let G = Kn be the complete
graph on n vertices. Let c(x, y) = 1/n2 for all x, y (corresponding to picking
any vertex uniformly at random at the next step) and π(x) = 1/n for all x. For
simplicity, take n even. Then for a subset S of size |S| = k,
|∂E S|c k(n − k)/n2 n−k
ΦE (S; c, π) = = = .
|S|π k/n n
Thus, the minimum is achieved for k = n/2 and
n − n/2 1
Φ∗ = = .
n 2
J
* It is also called “conductance,” but that terminology clashes with our use of the term.
CHAPTER 5. SPECTRAL METHODS 383
The associated quadratic form, also known as Dirichlet energy, is D(f ) := D(f, f ).
Dirichlet
energy
CHAPTER 5. SPECTRAL METHODS 384
where the last expression denotes the variance under π. So the variational charac-
terization of γ implies that
Varπ [f ] ≤ γ −1 D(f ),
for all f such that πf = 0. In fact, it holds for any f by considering f − πf and
noticing that both sides are unaffected by subtracting a constant to f .
We have shown:
Theorem 5.3.4 (Poincaré inequality for N ). Let P be finite, irreducible and re-
versible with respect to π. Then
Φ2∗
≤ γ ≤ 2Φ∗ .
2
In terms of the relaxation time trel = γ −1 , these inequalities have an intuitive
meaning: the presence or absence of a bottleneck in the state space leads to slow or
fast mixing respectively. We detail some applications to mixing times in the next
subsections.
Before giving a proof of the theorem, we start with a trivial—yet insightful—
example.
Example 5.3.6 (Two-state chain). Let V := {0, 1} and, for α, β ∈ (0, 1),
1−α α
P :=
β 1−β
Proof of Theorem 5.3.5. We start with the upper bound. In view of the Poincaré
inequality for N (Theorem 5.3.4), to get an upper bound on the spectral gap, it
CHAPTER 5. SPECTRAL METHODS 387
Then
" s # "s #
X π(S c ) c π(S)
π(x)fS (x) = π(S) − + π(S ) = 0,
x
π(S) π(S c )
and
" s #2 "s #2
X π(S c ) π(S)
π(x)fS (x)2 = π(S) − + π(S c ) = 1.
x
π(S) π(S c )
D(fS )
γ≤
Varπ [fS ]
1X
= c(x, y)[fS (x) − fS (y)]2
2 x,y
" s s #2
X π(S c ) π(S)
= c(x, y) − −
π(S) π(S c )
x∈S,y∈S c
" #2
X π(S c ) + π(S)
= c(x, y) − p
x∈S,y∈S c
π(S)π(S c )
c(S, S c )
=
π(S)π(S c )
c(S, S c )
≤2 ,
π(S)
as claimed.
The other direction is trickier. Because we seek an upper bound on the edge
expansion constant Φ∗ , our goal is to find a cut (S, S c ) such that
c(S, S c ) p
c
≤ 2γ. (5.3.4)
π(S) ∧ π(S )
CHAPTER 5. SPECTRAL METHODS 388
c(Si , Sic )
Φ∗ ≤ . (5.3.6)
π(Si ) ∧ π(Sic )
2. (Normalization) Let
f := f2 − f2 (vm ).
g(v1 )2 + g(vn )2 = 1.
Lemma 5.3.7.
1X X
c(x, y)(g(x) − g(y))2 ≤ γ π(x)g(x)2 .
2 x,y x
CHAPTER 5. SPECTRAL METHODS 389
D(f2 )
γ= .
Varπ [f2 ]
D(g)
γ= .
Varπ [g]
2.
P
Now use the fact that Varπ [g] ≤ x π(x)g(x)
3. (Random cut) Pick Θ in [g(v1 ), g(vn )] with density 2|θ|. Note that
Z g(vn )
2|θ| dθ = g(v1 )2 + g(vn )2 = 1.
g(v1 )
Finally define
Z := {vi : g(vi ) < Θ}.
The rest of the proof is calculations. We bound the expectations on both sides
of (5.3.5).
Lemma 5.3.8. The following hold:
(i) X
E[π(Z) ∧ π(Z c )] = π(x)g(x)2 .
x
(ii)
!1/2 !1/2
c 1X X
E[c(Z, Z )] ≤ c(x, y)(g(x) − g(y))2 2 π(x)g(x) 2
.
2 x,y x
Lemmas 5.3.7 and 5.3.8 immediately imply (5.3.5) and that concludes the proof of
Theorem 5.3.5. So it remains to prove this last lemma.
CHAPTER 5. SPECTRAL METHODS 390
The expression in the first parentheses is equal to 21 x,y c(x, y)(g(x) − g(y))2 .
P
So it remain to bound the expression in the second parentheses.
Note that
b-ary tree Let (Zt ) be lazy simple random walk on the `-level rooted b-ary tree,
b ` . The root, 0, is on level 0 and the leaves, L, are on level `. All vertices have
T b
CHAPTER 5. SPECTRAL METHODS 392
degree b + 1, except for the root which has degree b and the leaves which have
degree 1. Recall that the stationary distribution is
δ(x)
π(x) := , (5.3.8)
2(n − 1)
By Theorem 5.3.5,
1
γ ≤ 2Φ∗ ≤ and trel = γ −1 ≥ n − 2.
n−2
Thus by Theorem 5.2.14 and the fact that the chain is lazy
1
tmix (ε) ≥ (trel − 1) log = Ω(n).
2ε
We showed in Section 4.3.2, using other techniques, that tmix (ε) = Θ(n).
Cycle Let (Zt ) be lazy simple random walk on the cycle of size n, Zn :=
{0, 1 . . . , n − 1}, where i ∼ j if |j − i| = 1 (mod n). Assume n is even.
CHAPTER 5. SPECTRAL METHODS 393
and
1
tmix (ε) ≤ log trel = O(n2 log n).
επmin
We know from exact eigenvalue computations (see Section 5.2.2 where technically
we considered the non-lazy chain; laziness only affects the relaxation time by a
2
factor of 2) that in fact γ = 2π
n2
+ O(n−4 ). We also showed in that section that
tmix (ε) = O(n2 ). (Exercise 4.15 shows this is tight up to a constant factor.)
Hypercube Let (Zt ) be lazy simple random walk on the n-dimensional hyper-
cube Zn2 := {0, 1}n where i ∼ j if ki − jk1 = 1.
To get a bound on the edge expansion constant, consider the set S = {x ∈ Zn2 :
x1 = 0}. By symmetry π(S) = 21 . For each i ∼ j, c(i, j) = 21n · 12 · n1 = n2n+1
1
.
Hence
1
2n−1 n2n+1 1
Φ∗ ≤ 1 = ,
2
2n
where in the numerator we used that |S| = 2n−1 . By Theorem 5.3.5,
1
γ ≤ 2Φ∗ ≤ .
n
CHAPTER 5. SPECTRAL METHODS 394
denote the edge expansion constant of Gn with unit conductances.† Let α > 0. We
say that {Gn }n is a (d, α)-expander family if for all n
Φ∗ (Gn ) ≥ α.
The key point of the definition is that the edge expansion constant of all graphs in
an expander family is bounded away from 0 uniformly in n. Note that it is trivial
to construct such a family if we drop the bounded degree assumption: the edge ex-
pansion constant of the complete graph Kn is 1/2 by Example 5.3.3. On the other
hand, it is far from obvious that one can construct a family of sparse graphs (i.e.,
such that |E(Gn )| = O(|V (Gn )|)) with an edge expansion constant uniformly
bounded away from 0. It turns out that a simple probabilistic construction does the
trick.
We will need the following definition. For a subset S ⊆ V , we let the vertex
boundary of S be
vertex boundary
c
∂V S := {y ∈ S : ∃x ∈ S s.t. x ∼ y}.
†
In terms of random walk, this corresponds to choosing a neighbor uniformly at random and
taking the stationary measure equal to the degree. Note that scaling the stationary measure does not
affect the edge expansion constant.
CHAPTER 5. SPECTRAL METHODS 395
Claim 5.3.10 (Pinsker’s model: edge expansion constant). There exists α > 0 such
that
lim P[Φ∗ (Gn ) ≥ α] = 1.
n
Proof. For convenience, assume n is even. We need to show that with probability
going to 1, for any S with |S| ≤ |V (Gn )|/2 = n, we have |∂E S| ≥ αd|S| for
some α > 0. We first reduce the proof to a statement about sets of vertices lying
on one side of Gn .
Lemma 5.3.11. There is β > 0 such that
lim P [|∂V K| ≥ (1 + β)|K|, ∀K ⊆ L, |K| ≤ n/2] = 1.
n
BK := {|∂V K| ≤ k + bβkc} ,
by considering all subsets of {rk+1 , . . . , rn } of size bβkc and bounding the prob-
0
that all edges out of K fall into one of them and K . Note that there are
ability
n−k
bβkc such subsets. See Figure 5.5.
Since σn1 and σn2 are uniform and independent, they each match K to a uni-
formly chosen subset of the same size in R and we have by a union bound
" k+bβkc 2
# k+bβkc2
n−k k n bβkc
P[BK ] ≤ n ≤ n 2
,
bβkc bβkc
k k
n n
where we used that s = n−s .
CHAPTER 5. SPECTRAL METHODS 398
where we defined
k(1−β) 3 βk
k e (1 + β)2
fn (k) := 1{k≤n/2} .
n β3
Let also " β #k
1 1−β e3 (1 + β)2
g(k) := ,
2 β3
and notice that for β small enough
fn (k) → 0,
as n → +∞, and
∞
X 1
g(k) ≤ < +∞.
1 − γβ
k=1
Hence, by the dominated convergence theorem (Theorem B.4.7), combining (5.3.11)
and (5.3.12) we get
∞
X
P[∃K ⊆ L, |K| ≤ n/2, |∂V K| ≤ (1 + β)|K|] = fn (k) → 0.
k=1
eγβSi (σ)
i,γ 1 1 1 1
Qβ (σ, σ ) := · −βS (σ) = + tanh(γβSi (σ)) ,
n e i + eβSi (σ) n 2 2
where X
Si (σ) := σj .
j∼i
All other transitions have probability 0. Recall that this chain is irreducible and
reversible with respect to µβ . In particular µβ is the stationary distribution of Qβ .
CHAPTER 5. SPECTRAL METHODS 401
We showed in Claim 4.3.15 that the Glauber dynamics is fast mixing at high tem-
perature. More precisely we proved that tmix (ε) = O(n log n) when β < δ̄ −1 .
Here we prove a converse: at low temperature, graphs with good enough expan-
sion properties produce exponentially slow mixing of the Glauber dynamics.
Curie-Weiss model
Let G = Kn be the complete graph on n vertices. In this case, the Ising model
is often referred to as the Curie-Weiss model. It is natural to scale β with n. We
Curie-Weiss
define α := β(n−1). Since δ̄ = n−1, we have that, when α < 1, β = n−1 α
< δ̄ −1
model
so tmix (ε) = O(n log n). In the other direction, we prove:
Claim 5.3.14 (Curie-Weiss model: slow mixing at low temperature). For α > 1,
tmix (ε) = Ω(exp(r(α)n)) for some function r(α) > 0 not depending on n.
Proof. We first prove exponential mixing when α is large enough, an argument
which will be useful in the generalization to expander graphs below.
The idea of the proof is to bound the edge expansion constant and use Theo-
rem 5.3.5. To simplify the proof, assume n is odd. We denote the edge expansion
constant of the chain by ΦX ∗ to avoid confusion with that of the base graph G.
Intuitively, because the spins tend to align strongly at low temperature, it takes
a considerable amount of time to travel from a configuration with a majority of
−1s to a configuration with a majority of +1s. Because the model tends to prefer
agreeing spins but does not favor any particular spin, a natural place to look for a
bottleneck is the set
( )
X
M := σ ∈ X : σi < 0 ,
i
P
where the quantity m(σ) := i σi is called the magnetization. Note that the magnetization
magnetization is positive if and only if a majority of spins are +1 and that it forms
a Markov chain by itself. So the boundary of the set M must be crossed to travel
from configurations with mostly −1 spins to configurations with mostly −1 spins.
Observe further that µβ (M) = 1/2. The edge expansion constant is hence
bounded by
µβ (σ)Qβ (σ, σ 0 )
P
σ∈M,σ 0 ∈M
/
X
X
Φ∗ ≤ =2 µβ (σ)Qβ (σ, σ 0 ). (5.3.13)
µβ (M) 0
σ∈M,σ ∈M
/
Because the Glauber dynamics changes a single spin at a time, in order for σ ∈ M
to be adjacent to a configuration σ 0 ∈
/ M, it must be that
σ ∈ M−1 := {σ ∈ X : m(σ) = −1} ,
CHAPTER 5. SPECTRAL METHODS 402
we see that the partition function is a sum of O(n) exponentially large terms and
is therefore dominated by the term corresponding to the largest exponent. Using
Stirling’s formula,
n
log = (1 + o(1))nH(k/n),
k
where H(p) = −p log p − (1 − p) log(1 − p) is the entropy, and therefore
(1 − 2p)2
Kα (p) := H(p) + α .
2
Note that the first term in Kα (p) is increasing on [0, 1/2] while the second term
is decreasing on [0, 1/2]. In a sense, we are looking at the tradeoff between the
contribution from the entropy (i.e., how many ways are there to have k spins with
CHAPTER 5. SPECTRAL METHODS 404
value −1) and that from the Hamiltonian (i.e., how much such a configuration is
favored). We seek to maximize Kα (p) to determine the leading term in the partition
function.
By a straightforward computation,
0 1−p
Kα (p) = log − 2α(1 − 2p),
p
and
1
Kα00 (p) = − + 4α.
p(1 − p)
Observe first that, when α < 1 (i.e., at high temperature), Kα0 (1/2) = 0 and
Kα00 (p) < 0 for all p ∈ [0, 1] since p(1 − p) ≤ 1/4. Hence, in that case, Kα is
maximized at p = 1/2.
In our case of interest, on the other hand, that is, when α > 1, Kα00 (p) > 0
in an interval around 1/2 so there is p∗ < 1/2 with Kα (p∗ ) > Kα (1/2) = 1.
So the distribution significantly favors “unbalanced” configurations and crossing
M−1 becomes a bottleneck for the Glauber dynamics. Going back to (5.3.17) and
bounding Z(β) ≥ 2Yα,bp∗ nc , we get
ΦX
∗ = O (exp(−n[Kα (p∗ ) − Kα (1/2)])) .
Expander graphs
In the proof of Claim 5.3.14, the bottleneck slowing down the chain arises as a
result of the fact that, when m(σ) = −1, there is a large number of edges in
the base graph Kn connecting Jσ and Jσc . That produces a low probability for
such configurations under the ferromagnetic Ising model, where agreeing spins are
favored. The same argument easily extends to expander graphs. In words, we
prove something that—at first—may seem a bit counter-intuitive: good expansion
properties in the base graph produces a bottleneck in the Glauber dynamics at low
temperature.
Claim 5.3.15 (Ising model on expander graphs: slow mixing of the Glauber dy-
namics). Let {Gn }n be a (d, γ)-expander family. For large enough inverse tem-
perature β > 0, the Glauber dynamics of the Ising model on Gn satisfies tmix (ε) =
Ω(exp(r(β)|V (Gn )|)) for some function r(β) > 0 not depending on n.
Proof. Let µβ be the probability distribution over spin configurations under the
Ising model over Gn = (V, E) with inverse temperature β. Let Qβ be the transition
CHAPTER 5. SPECTRAL METHODS 405
matrix of the Glauber dynamics. For not necessarily disjoint subsets of vertices
W0 , W1 ⊆ V in the base graph Gn , let
be the set of edges with one endpoint in W0 and one endpoint in W1 . Let N =
|V (Gn )| and assume it is odd for simplicity. We use the notation in the proof
of Claim 5.3.14. Following the argument in that proof, we observe that (5.3.15)
and (5.3.16) still hold. Thus
X exp (β [|E(Jσ , Jσ )| + |E(J c , J c )| − |E(Jσ , J c )|])
ΦX
∗ ≤ (1 + o(1))
σ σ σ
.
Z(β)
σ∈M−1
e−βH(σ)
P
As we did in (5.3.17), we bound the partition function Z(β) = σ∈X
with the term for the all-(−1) configuration, leading to
X exp (β [|E(Jσ , Jσ )| + |E(J c , J c )| − |E(Jσ , J c )|])
ΦX
∗ ≤ (1 + o(1))
σ σ σ
exp (β [|E(Jσ , Jσ )| + |E(Jσc , Jσc )| + |E(Jσ , Jσc )|])
σ∈M−1
X
= (1 + o(1)) exp (−2β|E(Jσ , Jσc )|)
σ∈M−1
X
= (1 + o(1)) exp (−2β|∂E Jσc |)
σ∈M−1
N
≤ (1 + o(1)) exp (−2βγd|Jσc |)
(N + 1)/2
r
2 N
= (1 + o(1)) 2 (1 + o(1)) exp (−βγd(N − 1))
πN
r
2
≤ Cβ,γ,d exp (−N [βγd − log 2]) ,
πN
for some constant Cβ,γ,d > 0. We used the definition of an expander family (Def-
inition 5.3.9) on the fourth line above. Taking β > 0 large enough gives the re-
sult.
For each pair of vertices x, y, let νx,y be a directed path between x and y in the
digraph G e as a collection of directed edges. Let |νx,y | be the number of
e = (V, E),
edges in the path. The congestion ratio associated with the paths ν = {νx,y }x,y∈V
congestion
is
1 X ratio
Cν = max |νx,y |π(x)π(y).
e c(~
~e∈E e)
x,y:~e∈νx,y
Note that Cν tends to be large when many selected paths, called canonical paths,
canonical
go through the same “congested” edge. To get a good bound in the theorem below,
paths
one must choose canonical paths that are well “spread out.”
Theorem 5.3.16 (Canonical paths method). For any choice of paths ν as above,
we have the following bound on the spectral gap
1
γ≥ .
Cν
Proof. We establish a Poincaré inequality (5.3.18) with C := Cν . The proof strat-
egy is to start with the variance and manipulate it to bring out canonical paths.
For any f ∈ `2 (V, π), it can be checked by expanding that
1X
Varπ [f ] = π(x)π(y)(f (x) − f (y))2 . (5.3.20)
2 x,y
CHAPTER 5. SPECTRAL METHODS 407
To bring out terms similar to those in (5.3.19), we write f (x)−f (y) as a telescoping
sum over the canonical path between x and y. That is, letting ~e1 , . . . , ~e|νx,y | be the
edges in νx,y , observe that
|νx,y |
X
f (y) − f (x) = ∇f (~ei ).
i=1
≤ Cν D(f ).
Example 5.3.17 (Random walk inside a box). Consider random walk on the fol-
lowing d-dimensional box with sides of length n:
1
P (x, y) = , ∀x, y ∈ [n]d , x ∼ y,
|{z : z ∼ x}|
|{z : z ∼ x}|
π(x) = ,
2|E|
and
1
c(e) = , ∀e ∈ E.
2|E|
We define E e as before.
We use Theorem 5.3.16 to bound the spectral gap. For x = (x1 , . . . , xd ), y =
(y1 , . . . , yd ) ∈ [n]d , we construct νx,y by matching each coordinate in turn. That
is, for two vertices w, z ∈ [n]d with a single distinct coordinate, let [w, z] be the
directed path from w to z in G e = (V, E)e corresponding to a straight line (or the
empty path if w = z). Then
d
[
νx,y = [(y1 , . . . , yi−1 , xi , xi+1 , . . . , xd ), (y1 , . . . , yi−1 , yi , xi+1 , . . . , xd )] .
i=1
(5.3.21)
It remains to bound
1 X
Cν = max |νx,y |π(x)π(y),
~e∈E
e c(~e)
x,y:~e∈νx,y
from above.
Each term in the union defining νx,y contains at most n edges, and therefore
Not attempting to get the best constant factors, the edge weights (i.e., conduc-
tances) satisfy
1 1 1
c(~e) = ≥ d
= ,
2|E| 2 · 2dn 4dnd
for all ~e, since there are nd vertices and each has at most 2d incident edges. Like-
wise, for any x,
|{z : z ∼ x}| 2d 2
π(x) = ≤ d
= d,
2|E| 2 · (dn )/2 n
CHAPTER 5. SPECTRAL METHODS 409
16d2
= max |{x, y : ~e ∈ νx,y }| .
nd−1 ~e∈E
e
To bound the cardinality of the set on the last line, we note that any edge ~e ∈ E
e
is of the form
that is, the endvertices differ by exactly one unit along a single coordinate. By the
construction of the path νx,y in (5.3.21), if ~e ∈ νx,y then it must lie in the subpath
The remaining components of x and y (of which there are i of the former and
d − (i − 1) of the latter) each has at most n possible values (although not all of
them are allowed), so that
Exercises
Exercise 5.1. Let A be an n × n symmetric random matrix. We assume that
the entries on and above the diagonal, Ai,j , i ≤ j, are independent uniform in
{+1, −1} (and each entry below the diagonal is equal to the corresponding entry
above). Use Talagrand’s inequality (Theorem 3.2.32) to prove concentration of the
largest eigenvalue of A around its mean (which you do not need to compute).
Exercise 5.2. Let G = (V, E, w) be a network.
(i) Prove formula (5.1.3) for the Laplacian quadratic form. (Hint: For an orien-
tation Gσ = (V, E σ ) of G (that is, give an arbitrary direction to each edge
to turn it into a digraph), consider the matrix B σ ∈ Rn×m where the column
√ √
corresponding to arc (i, j) has − wij in row i and wij in row j, and every
other entry is 0.)
for x = (x1 , . . . , xn ) ∈ Rn .
Exercise 5.4 (2-norm). Prove that
[Hint: Use Cauchy-Schwarz (Theorem B.4.8) for one direction, and set y =
Ax/kAxk2 for the other one.]
Exercise 5.5 (Spectral radius of a symmetric matrix). Let A ∈ Rn×n be a sym-
metric matrix. The set σ(A) of eigenvalues of A is called the spectrum of A and spectrum
Exercise 5.6 (Community recovery in sparse networks). Assume without proof the
following theorem.
Theorem 5.3.18 (Remark 3.13 of [BH16]). Consider a symmetric matrix Z =
[Zi,j ] ∈ Rn×n whose entries are independent and obey, EZi,j = 0 and Zi,j ≤ B,
∀1 ≤ i, j ≤ n, EZi,j 2 ≤ σ 2 then with high probability we have ||Z|| . σ √n +
√
B log n.
q
Let (X, G) ∼ SBMn,pn ,qn . Show that, under the conditions pn & logn
n and pn
n =
o(pn − qn ), spectral clustering achieves almost exact recovery.
Exercise 5.7 (Parseval’s identity). Prove Parseval’s identity (i.e., (5.2.1)) in the
finite-dimensional case.
Exercise 5.8 (Dirichlet kernel). Prove that for θ 6= 0
n
X sin((n + 1/2)θ)
1+2 cos kθ = .
sin(θ/2)
k=1
[Hint: Switch to the complex representation and use the formula for a geometric
series.]
Exercise 5.9 (Eigenvalues and periodicity). Let P be a finite irreducible transition
matrix reversible with respect to π over V . Show that if P has a nonzero eigen-
function f with eigenvalue −1, then P is not aperiodic. [Hint: Look at x achieving
kf k∞ .]
Exercise 5.10 (Mixing time: necessary condition for cutoff). Consider a sequence
of Markov chains indexed by n = 1, 2, . . .. Assume that each chain has a finite
(n) (n)
state space and is irreducible, aperiodic, and reversible. Let tmix (ε) and trel be
respectively the mixing time and relaxation time of the n-th chain. The sequence
is said to have pre-cutoff if
(n)
tmix (ε)
sup lim sup (n)
< +∞.
0<ε<1/2 n→+∞ tmix (1 − ε)
Show that if for some ε > 0
(n)
tmix (ε)
sup (n)
< +∞,
n≥1 trel
then there is no pre-cutoff. In particular, there is no cutoff, as defined in Re-
mark 4.3.8.
CHAPTER 5. SPECTRAL METHODS 412
Exercise 5.11 (Relaxation time and variance). Let P be a finite irreducible transi-
tion matrix reversible with respect to π over V . Define
X
Varπ [g] = π(x)[g(x) − πg]2 .
x∈V
Exercise 5.12 (Lumping). Let (Xt ) be a Markov chain on a finite state space V
with transition P . Suppose there is an equivalence relation ∼ on V with equiv-
alence classes V ] , denoting by [x] the equivalence class of x, such that [Xt ] is a
Markov chain with transition matrix P ] ([x], [y]) = P (x, [y]).
(i) Compute the transition matrix P and stationary distribution π of the chain
(Xt ). Show that P is reversible with respect to π.
(iii) Compute the spectral gap γ of P in terms of the spectral gaps γj of the Pj s.
CHAPTER 5. SPECTRAL METHODS 413
Exercise 5.15 (Hypercube revisited). Use Exercise 5.14 to recover Lemma 5.2.17.
Exercise 5.16 (Norm and Rayleigh quotient). Let P be irreducible and reversible
with respect to π > 0.
Exercise 5.17 (Random walk on a box with holes). Consider the random walk in
Example 5.3.17 with d = 2. Suppose we remove from the network an arbitrary
collection of horizontal edges at even heights. Use the canonical paths method to
derive a lower bound on the spectral gap of the form γ ≥ 1/(Cn2 ). [Hint: Modify
the argument in Example 5.3.17 and relate the congestion ratio before and after the
removal.]
CHAPTER 5. SPECTRAL METHODS 414
Bibliographic Remarks
Section 5.1 General references on the spectral theorem, the Courant-Fischer and
perturbation results include the classics [HJ13, Ste98]. Much more on spectral
graph theory can be gleaned from [Chu97, Nic18]. Section 5.1.4 is based largely
on [Abb18], which gives a broad survey of theoretical results for community re-
covery, and [Ver18, Section 4.5] as well as on scribe notes by Joowon Lee, Aidan
Howells, Govind Gopakumar, and Shuyao Li for “MATH 888: Topics in Mathe-
matical Data Science” taught at the University of Wisconsin–Madison in Fall 2021.
Section 5.2 For a great introduction to Hilbert space theory and its applications
(including to the Dirichlet problem), consult [SS05, Chapters 4,5]. Section 5.2.1
borrows from [LP17, Chapter 12]. A representation-theoretic approach to comput-
ing eigenvalues and eigenfunctions, greatly generalizing the calculations in 5.2.2,
is presented in [Dia88]. The presentation in Section 5.2.3 follows [KP, Section 3]
and [LP16, Section 13.3]. The Varopoulos-Carne bound is due to Carne [Car85]
and Varopoulos [Var85]. For a probabilistic approach to the Varopoulos-Carne
bound see Peyre’s proof [Pey08]. The application to mixing times is from [LP16].
There are many textbooks dedicated to Markov chain Monte Carlo (MCMC) and
its uses in data analysis, for example, [RC04, GL06, GCS+ 14]. See also [Dia09].
A good overview of the techniques developed in the statistics literature to bound the
rate of convergence of MCMC methods (a combination of coupling and Lyapounov
arguments) is [JH01]. A deeper treatment of these ideas is developed in [MT09].
A formal definition of the spectral radius and its relationship to the operator norm
can be found, for instance, in [Rud73, Part III].
Section 5.3 This section follows partly the presentation in [LP16, Section 6.4],
[LPW06, Section 13.6], and [Spi12]. Various proofs of the isoperimetric inequal-
ity can be found in [SS03, SS05]. Theorem 5.3.5 is due to [SJ89, LS88]. The
approach to its proof used here is due to Luca Trevisan The original Cheeger
inequality was proved, in the context of manifolds, in [Che70]. For a fascinat-
ing introduction to expander graphs and their applications, see [HLW06]. A de-
tailed account of the Curie-Weiss model can be found in [FV18]. Section 5.3.5 is
based partly on [Ber14, Sections 3 and 4]. The method of canonical paths, and
some related comparison techniques, were developed in [JS89, DS91, DSC93b,
DSC93a]. For more advanced functional techniques for bounding the mixing time,
see e.g. [MT06].
Chapter 6
Branching processes
Branching processes, which are the focus of this chapter , arise naturally in the
study of stochastic processes on trees and locally tree-like graphs. Similarly to
martingales, finding a hidden (or not-so-hidden) branching process within a prob-
abilistic model can lead to useful bounds and insights into asymptotic behavior.
After a review of the basic extinction theory of branching processes in Section 6.1
and of a fruitful random-walk perspective in Section 6.2, we give a couple exam-
ples of applications in discrete probability in Section 6.3. In particular we analyze
the height of a binary search tree, a standard data structure in computer science.
We also give an introduction to phylogenetics, where a “multitype” variant of the
Galton-Watson branching process plays an important role; we use the techniques
derived in this chapter to establish a phase transition in the reconstruction of an-
cestral molecular sequences. We end this chapter in Section 6.4 with a detailed
look into the phase transition of the Erdős-Rényi graph model. The random-walk
perspective mentioned above allows one to analyze the “exploration” of a largest
connected component, leading to information about the “evolution” of its size as
edge density increases. Tools from all chapters come to bear on this final, marquee
application.
6.1 Background
We begin with a review of the theory of Galton-Watson branching processes, a
standard stochastic model for population growth. In particular we discuss extinc-
tion theory. We also briefly introduce a multitype variant, where branching process
Version: December 20, 2023
Modern Discrete Probability: An Essential Toolkit
Copyright © 2023 Sébastien Roch
415
CHAPTER 6. BRANCHING PROCESSES 416
We denote by {pk }k≥0 the law of X(1, 1). We also let f (s) := E[sX(1,1) ] be
the corresponding probability generating function. To avoid trivialities we assume
P[X(1, 1) = i] < 1 for all i ≥ 0. We further assume that p0 > 0.
In words, Zt models the size of a population at time (or generation) t. The random
variable X(i, t) corresponds to the number of offspring of the i-th individual (if
there is one) in generation t − 1. Generation t is formed of all offspring of the
individuals in generation t − 1.
By tracking genealogical relationships, that is, who is whose child, we obtain
a tree T rooted at the single individual in generation 0 with a vertex for each indi-
vidual in the progeny and an edge for each parent-child relationship. We refer to T
as a Galton-Watson tree.
Galton-Watson
A basic observation about Galton-Watson processes is that their growth (or
tree
decay) is exponential in t.
Wt := m−t Zt . (6.1.1)
Ft = σ(Z0 , . . . , Zt ).
In particular, E[Zt ] = mt .
CHAPTER 6. BRANCHING PROCESSES 417
This is true for all k. Rearranging shows that (Wt ) is a martingale. For the second
claim, note that E[Wt ] = E[W0 ] = 1.
In fact, the martingale convergence theorem (Theorem 3.1.47) gives the following.
Proof. This follows immediately from the martingale convergence theorem for
nonnegative martingales (Corollary 3.1.48).
6.1.2 Extinction
Observe that 0 is a fixed point of the process. The event
Proof. The process (Zt ) is integer-valued and 0 is the only fixed point of the pro-
cess under the assumption that p1 < 1. From any state k, the probability of never
coming back to k > 0 is at least pk0 > 0, so every state k > 0 is transient. So the
only possibilities left are Zt → 0 and Zt → +∞, and the claim follows.
We address the general case using probability generating functions. Let ft (s) =
E[sZt ], where by convention we set ft (0) := P[Zt = 0]. Note that, by monotonic-
ity,
Moreover, by the tower property (Lemma B.6.16) and the Markov property (The-
orem 1.1.18), ft has a natural recursive form
ft (s) = E[sZt ]
= E[E[sZt | Ft−1 ]]
= E[f (s)Zt−1 ]
= ft−1 (f (s)) = · · · = f (t) (s), (6.1.3)
where f (t) is the t-th iterate of f . The subcritical case below has an easier proof
(see Exercise 6.1).
Theorem 6.1.6 (Extinction: subcriticial and supercritical cases). The probability
of extinction η is given by the smallest fixed point of f in [0, 1]. Moreover:
(i) (Subcritical regime) If m < 1 then η = 1.
Figure 6.1: Fixed points of f in subcritical (left) and supercritical (right) cases.
(ii) If m < 1 then f (t) > t for t ∈ [0, 1). Let η0 := 1 in that case.
Proof. Assume m > 1. Since f 0 (1) = m > 1, there is δ > 0 such that f (1 − δ) <
1 − δ. On the other hand f (0) = p0 > 0 so by continuity of f there must be a
fixed point in (0, 1 − δ). Moreover, by strict convexity and the fact that f (1) = 1,
if x ∈ (0, 1) is a fixed point then f (y) < y for y ∈ (x, 1), proving uniqueness.
The second part follows by strict convexity and monotonicity.
Proof. We only prove (i). The argument for (ii) is similar. By monotonicity, for
x ∈ [0, η0 ), we have x < f (x) < f (η0 ) = η0 . Iterating
x < f (1) (x) < · · · < f (t) (x) < f (t) (η0 ) = η0 .
CHAPTER 6. BRANCHING PROCESSES 420
The result then follows from the above lemmas together with Equations (6.1.2)
and (6.1.3).
So the process goes extinct with probability 1 when λ ≤ 1. For λ > 1, the
probability of extinction η is the smallest solution in [0, 1] to the equation
e−λ(1−x) = x.
CHAPTER 6. BRANCHING PROCESSES 421
We can use these extinction results to obtain more information on the limit in
Lemma 6.1.3. Recall the definition of (Wt ) in (6.1.1). Of course, conditioned on
extinction, W∞ = 0 almost surely. On the other hand:
Proof. Let T (1) , . . . , T (Z1 ) be the descendant subtrees of the children of the root.
We use the notation T ∈ A to mean that tree T satisfies A. By the tower property,
the definition of inherited, and conditional independence,
P[A] = E[P[T ∈ A | Z1 ]]
≤ E[P[T (i) ∈ A, ∀i ≤ Z1 | Z1 ]]
= E[P[A]Z1 ]
= f (P[A]).
E[Wt2 ] = E[Wt−1
2
] + E[(Wt − Wt−1 )2 ].
E[Wt2 ] = E[Wt−1
2
] + m−t−1 σ 2 .
1 = E[Wt ] → E[W∞ ].
(α)
be an array of i.i.d. Zτ+ -valued random row vectors with distribution {pk }. Let
Z0 = k0 ∈ Zτ+ ,
be the initial population at time 0, again as a row vector. Recursively, the popula-
tion vector
Zt = (Zt,1 , . . . , Zt,τ ) ∈ Zτ+ ,
at time t ≥ 1 is set to
τ ZX
X t−1,α
That is, mα,β is the expected number of offspring of type β of an individual of type
α. We assume throughout that mα,β < +∞ for all α, β.
CHAPTER 6. BRANCHING PROCESSES 425
To see how M drives the growth of the process, we generalize the proof of
Lemma 6.1.2. By the recursive formula (6.1.4),
τ ZX
X t−1,α
E [Zt | Z0 ] = Z0 M t . (6.1.6)
Moreover any real right eigenvector u (as a column vector) of M with real eigen-
value λ 6= 0 gives rise to a martingale
Ut := λ−t Zt u, t ≥ 0, (6.1.7)
since
Wt := ρ−t Zt w, t ≥ 0,
ρ < 1 =⇒ q = 1. (6.1.8)
Exercise 6.1 asks for the proof. We state the following more general result without
proof. We use the notation of Theorem 6.1.17. We will also refer to the generating
functions
τ (α)
Y X (1,1)
f (α) (s) := E sβ β , s ∈ [0, 1]τ
β=1
with f = (f (1) , . . . , f (τ ) ).
Theorem 6.1.18 (Extinction: multitype case). Let (Z)t be a positive regular, non-
singular multitype branching process with a finite mean matrix M .
(i) If ρ ≤ 1 then q = 1.
c- Almost surely
lim ρ−t Zt = vW∞ ,
t→+∞
where W∞ is a nonnegative random variable.
(α)
d- If in addition Var[Xβ (1, 1)] < +∞ for all α, β then
E[W∞ | Z0 = eα ] = wα ,
and
q (α) = P[W∞ = 0 | Z0 = eα ],
for all α ∈ [τ ].
Remark 6.1.19. As in the single-type case, a theorem of Kesten and Stigum gives a neces-
sary and sufficient condition for the last claim of Theorem 6.1.18 (ii) to hold [KS66b].
Linear functionals Theorem 6.1.18 also characterizes the limit behavior of lin-
ear functionals of the form Zt u for any vector that is not orthogonal to v. In
contrast, interesting new behavior arises when u is orthogonal to v. We will not
derive the general theory here. We only show through a second moment calculation
that a phase transition takes place.
We restrict ourselves to the supercritical case ρ > 1 and to u = (u1 , . . . , uτ )
being a real right eigenvector of M with a real eigenvalue λ ∈ / {0, ρ}. Let Ut be
the corresponding martingale from (6.1.7). The vector u is necessarily orthogonal
to v. Indeed vM u is equal to both ρvu and λvu. Because ρ 6= λ by assumption,
this is only possible if all three expressions are 0. That implies vu = 0 since we
also have ρ 6= 0 by assumption.
To compute the second moment of Ut , we mimic the computations in the proof
of Lemma 6.1.13. We have
E[Ut2 | Z0 ] = E[Ut−1
2
| Z0 ] + E[(Ut − Ut−1 )2 | Z0 ],
by the orthogonality of increments (Lemma 3.1.50). Since E[Ut | Ft−1 ] = Ut−1 by
the martingale property, we get
E[(Ut − Ut−1 )2 | Ft−1 ] = Var[Ut | Ft−1 ]
= Var[λ−t Zt u | Ft−1 ]
Xτ ZXt−1,α
where S(u) = (Var[X(1) (1, 1) u], . . . , Var[X(τ ) (1, 1) u]) as a column vector. In
the last display, we used (6.1.4) on the third line and the independence of the
random vectors X(α) (i, t) on the fourth line. Hence, taking expectations and us-
ing (6.1.6), we get
E[Ut2 | Z0 ] = E[Ut−1
2
| Z0 ] + λ−2t Z0 M t−1 S(u) .
and finally
t
X
E[Ut2 | Z0 ] = (Z0 u)2 + λ−2s Z0 M s−1 S(u) . (6.1.9)
s=1
The case S(u) = 0 is trivial (see Exercise 6.4), so we exclude it from the following
lemma.
Lemma 6.1.20 (Second moment of Ut ). Assume S(u) 6= 0 and Z0 6= 0. The
sequence E[Ut2 | Z0 ], t = 0, 1, 2, . . ., is non-decreasing and satisfies
(
2 < +∞ if ρ < λ2 ,
sup E[Ut | Z0 ]
t≥0 = +∞ otherwise.
Proof. Because S(u) 6= 0 and nonnegative and the matrix M is strictly positive by
assumption, we have that
e (u) := M S(u) > 0.
S
Since w is also strictly positive, there is 0 < C − ≤ C + < +∞ such that
C −w ≤ S
e (u) ≤ C + w.
uniformly in t.
CHAPTER 6. BRANCHING PROCESSES 429
In the case ρ < λ2 , the martingale (Ut ) is bounded in L2 and therefore con-
verges almost surely to a limit U∞ with E[U∞ | Z0 ] = Z0 u by Theorem 3.1.51.
2
On the√ other hand, when ρ ≥ λ , it can be shown (we will not do this here) that
Zt u/ Zt w satisfies a central limit theorem with a limit independent of Z0 . Im-
plications of these claims are illustrated in Section 6.3.2.
Exploration of a graph
Because this will be useful again later, we describe it first in the context of a locally
finite graph G = (V, E). The exploration process starts at an arbitrary vertex
v ∈ V and has 3 types of vertices:
active
- At : active vertices, explored
neutral
- Et : explored vertices,
- Nt : neutral vertices.
CHAPTER 6. BRANCHING PROCESSES 430
vertex is explored at each time until the set of active vertices is empty. The size of
the connected component of v can be characterized as follows.
Lemma 6.2.1.
τ0 = |Cv |.
Proof. Indeed a single vertex of Cv is explored at each time until all of Cv has been
visited. At that point, At is empty.
- Otherwise, (a) one active vertex becomes an explored vertex and (b) its off-
spring become active vertices. That is,
At−1 + |{z}
−1 + Xt if t − 1 < τ0 ,
|{z}
At = (a) (b)
0 otherwise,
We let Yt := Xt − 1 and
t
X
St := 1 + Ys ,
s=1
with S0 := 1. Then
τ0 = inf{t ≥ 0 : St = 0},
CHAPTER 6. BRANCHING PROCESSES 432
and
(At ) = (St∧τ0 ),
is a random walk started at 1 with i.i.d. increments (Yt ) stopped when it hits 0 for
the first time.
We refer to
H = (X1 , . . . , Xτ0 ),
as the history of the process (Zi ). Observe that, under breadth-first search, the
history
process (Zi ) can be reconstructed from H: Z0 = 1, Z1 = X1 , Z2 = X2 + . . . +
XZ1 +1 , and so forth. (Exercise 6.5 asks for a general formula.) As a result, (Zi )
can be recovered from (St ) as well. We call (x1 , . . . , xt ) a valid history if
valid
history
1 + (x1 − 1) + · · · + (xs − 1) > 0,
Theorem 6.2.2 (Duality principle). Let (Zi ) be a branching process with offspring
distribution {pk }k≥0 and extinction probability η < 1. Let (Zi0 ) be a branching
process with offspring distribution {p0k }k≥0 where
p0k = η k−1 pk .
Then (Zi ) conditioned on extinction has the same distribution as (Zi0 ), which is
referred to as the dual branching process.
dual branching
Let f be the probability generating function of the offspring distribution of (Zi ). process
Note that X X
p0k = η k−1 pk = η −1 f (η) = 1,
k≥0 k≥0
Proof of Theorem 6.2.2. We use the random-walk representation. Let H = (X1 , . . . , Xτ0 )
and H 0 = (X10 , . . . , Xτ0 0 ) be the histories of (Zi ) and (Zi0 ) respectively. In the case
0
of extinction of (Zi ), the history H has finite length.
By definition of the conditional probability, for a valid history (x1 , . . . , xt ) with
a finite t,
t
P[H = (x1 , . . . , xt )] Y
P[H = (x1 , . . . , xt ) | τ0 < +∞] = = η −1 pxs .
P[τ0 < +∞]
s=1
Since this is true for all valid histories and the processes can be recovered from
their histories, we have proved the claim.
λk (λη)k
p0k = η k−1 pk = η k−1 e−λ = η −1 e−λ ,
k! k!
(λη)k (λη)k
p0k = eλ(1−η) e−λ = e−λη .
k! k!
That is, the dual branching process has offspring distribution Poi(λη). J
Lemma 6.2.4 (Total progeny and random-walk representation). Let W be the total
progeny of the Galton-Watson branching process (Zi ). Then
W = τ0 .
At + t + Nt = n, ∀t ≤ τ0 .
At t = τ0 , At = Nt = 0 and we get
τ0 = n.
P[Ut ≤ 1] = 1.
Fix a positive integer `. Let σ` be the first time t such that Rt = `. Then
`
P[σ` = t] = P[Rt = `].
t
Finally we get:
τ0 = inf{t ≥ 0 : St = 0}
= inf{t ≥ 0 : 1 + (X1 − 1) + · · · + (Xt − 1) = 0}
= inf{t ≥ 0 : X1 + · · · + Xt = t − 1}.
and
τ0 = inf{t ≥ 0 : Rt = 1}.
The process (Rt ) satisfies the assumptions of the hitting-time theorem (Theo-
rem 6.2.5) with ` = 1 and σ` = τ0 = W . Applying the theorem gives the
claim.
Lemma 6.2.8 (Spitzer’s combinatorial lemma). Assume rt > 0. Let ` be the num-
ber of cyclic permutations such that t is a ladder index. Then ` ≥ 1 and each such
cyclic permutation has exactly ` ladder indices.
We first show that ` ≥ 1, that is, there is at least one cyclic permutation where
t is a ladder index. Let β ≥ 1 be the smallest index achieving the maximum of
r1 , . . . , rt , that is,
rβ+j − rβ ≤ 0 < rt , ∀j = 1, . . . , t − β,
and
rt − [rβ − rj ] < rt , ∀j = 1, . . . , β − 1.
From (6.2.1), in u(β) , t is a ladder index.
For the second claim, since ` ≥ 1, we can assume without loss of generality
(β)
that u is such that t is a ladder index. (Note that rt = rt for all β.) We show
that β is a ladder index in u if and only if t is a ladder index in u(β) . That does
indeed imply the claim as there are ` cyclic permutations where t is a ladder index
by assumption. We use (6.2.1) again. Observe that β is a ladder index in u if and
only if
rβ > r0 ∨ · · · ∨ rβ−1 ,
which holds if and only if
Moreover, because rt > rj for all j by the assumption that t is ladder index, the
last display holds if and only if
and
rt − [rβ − rj ] < rt , ∀j = 1, . . . , β − 1, (6.2.4)
that is, if and only if t is a ladder index in u(β)
by (6.2.1). Indeed, the second
condition (i.e., (6.2.4)) is intact from (6.2.2), while the first one (i.e., (6.2.3)) can
be rewritten as rβ > −(rt − rβ+j ) where the right-hand side is < 0 for j =
1, . . . , t − β − 1 and = 0 for j = t − β.
Proof of hitting-time theorem We are now ready to prove the hitting-time the-
orem. We only handle the case ` = 1 (which is the one we used for the law of the
total progeny). Exercise 6.6 asks for the full proof.
Proof of Theorem 6.2.5. Recall that Rt = ts=1 Us and σ1 = inf{j ≥ 0 : Rj =
P
1}. By the assumption that Us ≤ 1 almost surely for all s,
{σ1 = t} = {t is the first ladder index in R1 , . . . , Rt }.
By symmetry, for all β = 0, . . . , t − 1,
P[t is the first ladder index in R1 , . . . , Rt ]
(β) (β)
= P[t is the first ladder index in R1 , . . . , Rt ].
Let Eβ be the event on the last line. Then,
t−1
1 X
P[σ1 = t] = E[1E0 ] = E 1E β .
t
β=0
Critical value We denote the root by 0. Similarly to what we did in Section 6.1.3,
we think of the open cluster of the root, C0 , as the progeny of a branching process
as follows. Denote by ∂n the n-th level of T b b , that is, the vertices of T
b b at graph
distance n from the root. In the branching process interpretation, we think of the
immediate descendants in C0 of a vertex v as the offspring of v. By construction, v
has at most b children, independently of all other vertices in the same generation.
In this branching process, the offspring distribution {qk }bk=0 is binomial with pa-
rameters b and p; Zn := |C0 ∩ ∂n | represents the size of the progeny at generation
n; and W := |C0 | is the total progeny of the process. In particular |C0 | < +∞ if
and only if the process goes extinct. Because the mean number of offspring is bp,
by Theorem 6.1.6, this leads immediately to a second proof of (a rooted variant of)
Claim 2.3.9:
Claim 6.2.9. 1
pc T
bb = .
b
is 0 on [0, 1/b], while on (1/b, 1] the quantity η(p) := 1 − θ(p) is the unique
solution in [0, 1) of the fixed point equation
For b = 2, for instance, we can compute the fixed point explicitly by noting that
0 = ((1 − p) + ps)2 − s
= p2 s2 + [2p(1 − p) − 1]s + (1 − p)2 ,
CHAPTER 6. BRANCHING PROCESSES 439
For p ∈ ( 1b , 1), the total progeny is infinite with positive probability (and in par-
ticular the expectation is infinite), but we can compute the expected cluster size on
the event that |C0 | < +∞. For this purpose we use the duality principle.
CHAPTER 6. BRANCHING PROCESSES 440
q̂k := [η(p)]k−1 qk
k−1 b
= [η(p)] pk (1 − p)b−k
k
[η(p)]k
b k
= b
p (1 − p)b−k
((1 − p) + p η(p)) k
k b−k
b p η(p) 1−p
=
k (1 − p) + p η(p) (1 − p) + p η(p)
b k
=: p̂ (1 − p̂)b−k ,
k
where we used (6.2.5) and implicitly defined the dual density
p η(p)
p̂ := . (6.2.7)
(1 − p) + p η(p)
Hence, using (6.2.6) with both p and p̂ as well as the fact that Pp [|C0 | < +∞] =
η(p), we have the following.
Claim 6.2.12.
(
1
f
1−bp , p ∈ [0, 1b ),
χ (p) := Ep |C0 |1{|C0 |<+∞} = η(p)
1−bp̂ , p ∈ ( 1b , 1).
2
1−p
For b = 2, η(p) = 1 − θ(p) = p so
2
1−p
p p (1 − p)2
p̂ = 2 = = 1 − p,
p(1 − p) + (1 − p)2
1−p
(1 − p) + p p
and
CHAPTER 6. BRANCHING PROCESSES 441
Distribution of the open cluster size In fact the hitting-time theorem gives an
d
explicit formula for the distribution of |C0 |. Namely, recall that |C0 | = τ0 where
τ0 = inf{t ≥ 0 : St = 0},
P
for St = `≤t X` − (t − 1) where S0 = 1 and the X` s are i.i.d. binomial with
parameters b and p. By Theorem 6.2.6,
1
P[τ0 = t] = P[St = 0],
t
and we have
1 X 1 b`
Pp [|C0 | = `] = P X` = ` − 1 =
p`−1 (1 − p)b`−(`−1) ,
` ` `−1
i≤`
(6.2.8)
where we used that a sum of independent binomials with the same p is itself bino-
mial. In particular at criticality (where |C0 | < +∞ almost surely; see Claim 3.1.52),
using Stirling’s formula (see Appendix A) it can be checked that
1 1 1
Ppc [|C0 | = `] ∼ p =p ,
` 2πpc (1 − pc )b` 2π(1 − pc )`3
as ` → +∞.
and
1
χf (p) ∼ |p − pc |−1 .
2
In fact, as can be seen from Claim 6.2.12, the critical exponent of χf (p) does not
depend on b. The same holds for θ(p) (see Exercise 6.9). Using (6.2.8), the higher
moments of |C0 | can also be studied around criticality (see Exercise 6.10).
6.3 Applications
We develop two applications of branching processes in discrete probability. First,
we prove a result about the height of a random binary search tree. Then we describe
a phase transition in an Ising model on a tree with applications to evolutionary
biology. In the next section, we also use branching processes to study the phase
transition of Erdős-Rényi random graph model.
- if the root’s key is strictly larger than xi+1 , then move to its left descendant,
otherwise move to its right descendant;
- if such a descendant does not exist then create it and assign it xi+1 as its key;
- otherwise repeat.
Inserting keys (and other operations such as deleting keys, which we do not de-
scribe) takes time proportional to the height Hn of the tree Tn , that is, the length
of the longest path from the root to a leaf. While, in general, the height can be as
large as n (if keys are inserted in order for instance), the typical behavior can be
much smaller.
CHAPTER 6. BRANCHING PROCESSES 443
other words,
Sρ0 = σ −1 (1) − 1.
Similarly, denoting by ρ00 the right descendant of ρ, we see that
Sρ00 = n − σ −1 (1).
The second part of this last observation can be checked by direct computation.
Rename X10 , . . . , XS0 0 the keys in the subtree rooted at ρ0 in the order that they are
ρ
inserted and let σ 0 be the (random) permutation corresponding to their ordering,
that is,
Xσ0 0 (1) < Xσ0 0 (2) < · · · < Xσ0 0 (S 0 ) .
ρ
Hn = sup {h : ∃v ∈ Lh , Sv ≥ 1} , (6.3.3)
From Lemma 6.3.2 and the characterization of the height in (6.3.3), we need to
control how fast products of independent uniforms decrease. But that is only half
of the story: the number of paths of length ` from the root grows exponentially
with `. The following lemma, which takes both effects into account, will play a
key role in the analysis. It also explains the definition of γ in (6.3.1). Note that
we ignore—for the time being—the repeated rounding in Lemma 6.3.2; it will turn
out to have a minor effect.
Proof. Taking logarithms turns the product on the left-hand side into a sum of
i.i.d. random variables
" ` #
h i X
` −`/c `
2 P U1 · · · U` ≥ e =2 P (− log Ui ) ≤ `/c . (6.3.4)
i=1
from which (6.3.5) follows: the lower bound is obtained by keeping only the first
term in the sum; the upper bound is obtained by factoring out y ` e−y /`! and relating
the remaining sum to a geometric series.
So it remains to prove the general claim. First note that − log U1 is exponen-
tially distributed. Indeed, for any y ≥ 0,
P [− log U1 > y] = P U1 < e−y = e−y .
So
( +∞ ) ( +∞ )
X yi X yi
P [− log U1 ≤ y] = 1 − e−y = e−y −1 = e−y ,
i! i!
i=0 i=1
We return to the proof of Lemma 6.3.3. Plugging (6.3.5) into (6.3.4), we get
2` (`/c)` e−(`/c) h i
≤ 2` P U1 · · · U` ≥ e−`/c
`! !
2` (`/c)` e−(`/c) 1
≤ (`/c)
. (6.3.6)
`! 1− `+1
As ` → +∞,
1 1
→ , (6.3.7)
1− (`/c) 1 − 1c
`+1
CHAPTER 6. BRANCHING PROCESSES 447
which is positive when c > 1. We will use the standard bound (see Exercise 1.3
for a proof)
`` ``+1
≤ `! ≤ .
e`−1 e`−1
It implies immediately that
By (6.3.1) and the remark following it, the expression in square brackets is > 1 or
< 1 depending on whether c < γ or c > γ. Combining (6.3.6), (6.3.7) and (6.3.8)
and taking a limit as ` → +∞ gives the claim.
for any v ∈ Lh , where the first equality follows from (6.3.3). Since
2h P [Sv ≥ 1] ≤ 2h P [nU1 U2 · · · Uh ≥ 1]
h i
= 2h P U1 U2 · · · Uh ≥ e−h/(γ+ε)
→ 0, (6.3.10)
as h → +∞. From (6.3.9) and (6.3.10), we obtain finally that for any ε > 0
P [Hn / log n ≥ γ + ε] → 0,
Ztu,`
X
= |L∗` [ut−1,r ]| ,
r=1
Z u,`
t−1 ∗
and let ut,1 , . . . , ut,Z u,` be the vertices in ∪r=1 L` [ut−1,r ] from left to right.
t
In words, Z1u,`
counts the number of vertices ` levels below u whose subtree sizes
(ignoring rounding) have not decreased “too much” compared to that of u (in the
sense of Lemma 6.3.3). We let such vertices (if any) be u1,1 , . . . , u1,Z u,` . Sim-
1
ilarly, Z2u,` counts the same quantity over all vertices ` levels below the vertices
u1,1 , . . . , u1,Z u,` , and so forth.
1
Because the Wv s are i.i.d., this process is indeed a Galton-Watson branching
process. The expectation of the offspring distribution (which by symmetry does
not depend on the choice of u) is
h i h i
m = E Z1u,` = 2` P U1 · · · U` ≥ e−`/c ,
where we used the notation of Lemma 6.3.3. By that lemma, we can choose `
large enough that m > 1. Fix such an ` for the rest of the proof. In that case, by
Theorem 6.1.6, the process survives with probability 1 − η for some 0 ≤ η < 1.
The relevance of this observation can be seen from taking u = ρ.
CHAPTER 6. BRANCHING PROCESSES 449
Claim 6.3.5. Let c0 < c. Conditioned on survival of (Ztρ,` ), for n large enough
Hn ≥ c0 log n − θn ` almost surely for some θn ∈ [0, 1).
Zkρ,` ≥ 1,
n U[ρ, v ∗ ] ≥ n(e−`/c )k .
Now take s = c0 log n − θn ` with c0 < c and θn ∈ [0, 1) such that s is a multiple
of `. Then
0 0
n(e−`/c )k = n(e−s/c ) = n(n−c /c e−θn `/c ) = n1−c /c e−θn `/c
≥ c0 log n − θn ` + 1 = s + 1, (6.3.12)
for all n large enough, where we used that 1 − c0 /c > 0, θn ∈ [0, 1) and ` is
fixed. So, using the characterization of the height in (6.3.2) and (6.3.3) together
with inequality (6.3.11), we derive
But this is not quite what we want: this last claim holds only conditioned on
survival; or put differently, it holds with probability 1 − η, a value which could
be significantly smaller than 1 in general. To handle this last issue, we consider a
large number of independent copies of the Galton-Watson process above in order
to “boost” the probability that at least one of them survives to a value arbitrarily
close to 1.
u∗1 , . . . , u∗2J`
CHAPTER 6. BRANCHING PROCESSES 450
Under B1c , at least one of the branching processes survives; let I be the lowest
index among them.
- (Fast decay at the top) To bound the height, we also need to control the effect
of the first J` levels on the subtree sizes. Let B2 be the event that at least
one of the W -values associated with the 2J` − 1 vertices ancestral to the u∗i s
is outside the interval (α, 1 − α). Choose α small enough that this event has
probability < δ/2, that is,
s = k` = c0 log n − θn `,
u∗ ,`
as before, we have Zk I ≥ 1 so there is v ∗ ∈ Ls [u∗I ] such that
where we used (6.3.14). Observe that (6.3.12) remains valid (for potentially larger
n) even after multiplying all expressions on the left-hand side of the inequality by
αJ` . Arguing as in (6.3.13), we get that Hn ≥ c0 log n − θn ` + J`. This event
holds with probability at least
For any ε > 0, we can choose c0 = γ − ε and c0 < c < γ. Further, δ can
be made arbitrarily small (provided n is large enough). Put differently, we have
proved that for any ε > 0
P [Hn / log n ≥ γ − ε] → 1,
we see that the states σ1 at the first level are completely randomized (i.e., indepen-
dent of σ0 ) with probability (2p)2 —in which case we cannot hope to reconstruct
the root state better than a coin flip. Intuitively the reconstruction problem is solv-
able if we can find an estimator of the root state which outperforms a random coin
flip as h grows to +∞. Let µ+ h be the distribution µh conditioned on the root state
σ0 being +1, and similarly for µ− 1 + 1 −
h . Observe that µh = 2 µh + 2 µh . Recall also
that
− 1 X
−
kµ+h − µh kTV = 2 |µ+
h (sh ) − µh (sh )|.
h
sh ∈{+1,−1}2
Let µh (s0 |sh ) be the posterior probability of the root state, that is, the conditional
probability of the root state s0 given the states sh at level h. By Bayes’ rule,
(1/2)µ+h (sh )
µh (+1|sh ) = ,
µh (sh )
and similarly for µh (+1|sh ). Hence the choice above is equivalent to
(
+1 if µh (+1|sh ) ≥ µh (−|sh ),
σ̂0 (sh ) =
−1 otherwise.
CHAPTER 6. BRANCHING PROCESSES 453
Since P[σ̂0 (σh ) = σ0 ] + P[σ̂0 (σh ) 6= σ0 ] = 1, the display above can be rewritten
as
1 1
P[σ̂0MAP (σh ) 6= σ0 ] = − kµ+ − µ− h kTV .
2 2 h
Given that σ̂0MAP was chosen to minimize the error probability, we also have that
for any root estimator σ̂0
1 1 +
P[σ̂0 (σh ) 6= σ0 ] ≥ − kµ − µ−
h kTV .
2 2 h
Since this last inequality also applies to the estimator −σ̂0 , we have also that
1
P[σ̂0 (σh ) 6= σ0 ] ≤ .
2
The next lemma summarizes the discussion above.
Lemma 6.3.8 (Probability of erroneous reconstruction). The probability of an er-
roneous root reconstruction behaves as follows.
CHAPTER 6. BRANCHING PROCESSES 454
(ii) If the reconstruction problem is unsolvable, then for any root estimator σ̂0
1
lim P[σ̂0 (σh ) 6= σ0 ] = .
h→+∞ 2
It turns out that the accuracy of the MAP estimator undergoes a phase transition
at a critical mutation probability p∗ . Our main theorem is the following.
2θ∗2 = 1,
1−θ∗
and set p∗ = 2 . Then the reconstruction problem is:
Kesten-Stigum bound
The condition in Theorem 6.3.9 is referred to as the Kesten-Stigum bound. We
Kesten-Stigum
explain why next. We showed in Lemma 6.3.8 that the MAP estimator has an error
bound
probability bounded away from 1/2 if and only if the reconstruction problem is
solvable. Of course, other estimators may also achieve that same desirable out-
come. In fact, from the lemma, to establish reconstruction solvability it suffices to
exhibit one such “better-than-random” estimator. So, rather than analyzing σ̂0MAP ,
we look at a simpler estimator first and prove half of Theorem 6.3.9. The other half
will be proven below using different ideas.
The key is to notice that a multitype branching process (see Section 6.1.4) hides
in the background. For h ≥ 0, consider the random row vector Zh = (Zh,+ , Zh,− )
where the first component records the number of +1 states (which we refer to as
belonging to the + type) in σh and, likewise, the second component counts the
−1 states (referred to as of − type). Then (Zh )h≥0 is a two-type Galton-Watson
process where each individual has exactly two children. Their types depend on the
type of the parent. A type + individual has the following offspring distribution:
2
(1 − p)
if k = (2, 0),
2p(1 − p) if k = (1, 1),
(+)
pk =
p2 if k = (0, 2),
0 otherwise.
(−)
Similar expressions hold for pk . The mean matrix is given by
eigenvector decomposition
1−p p
P =
p 1−p
1/2 1/2 1/2 −1/2
= + (1 − 2p)
1/2 1/2 −1/2 1/2
= λ1 x1 xT1 + λ2 x2 xT2 ,
The eigenvalues of M are twice those of P while the eigenvectors are the same.
In particular, using the notation and convention of the Perron-Frobenius Theorem
(Theorem 6.1.17), we have
1/2
ρ = 2, w = .
1/2
These should not come entirely as a surprise. In particular, recall from Theo-
rem 6.1.18 that ρ can be interpreted as an “overall rate of growth” of the popu-
lation, which here is two since each individual has exactly two children (ignoring
the types).
Let u = (1, −1) be a column vector proportional to the second right eigenvec-
tor of M . We know from Section 6.1.4 that
1 X
Uh = (2λ2 )−h Zh u = h h σ` , h ≥ 0,
2 θ
`∈Lh
θ := λ2 = 1 − 2p.
Upon looking more closely, the quantity Uh has a natural interpretation: its sign is
the majority estimator, that is, sgn(Uh ) = +1 if a majority of individuals at level h
majority
are of type + (breaking ties in favor of +), and is −1 otherwise. We indicated pre-
estimator
viously that we only need to find one estimator with an error probability bounded
away from 1/2 to establish reconstruction solvability for a given value of p. The
majority estimator
σ̂0Maj := sgn(Uh ),
CHAPTER 6. BRANCHING PROCESSES 457
is an obvious one to try. What is less obvious is that it works—all the way to the
threshold. This essentially follows from the results of Section 6.1.4, as we detail
next.
We begin with an informal discussion. When can σ̂0Maj be expected to work?
We will not in fact bound the error probability of σ̂0Maj , but instead analyze directly
the properties of (Uh ). By our modeling assumptions, Z0 is either (1, 0) or (0, 1)
with equal probability. Hence, by the martingale property, we obtain that
E[Uh | Z0 ] = Z0 u = σ0 . (6.3.16)
In words, Uh is “centered” around the root state. Intuitively, its second moment
therefore captures how informative it is about σ0 . Lemma 6.1.20 exhibits a phase
transition for E[Uh2 | Z0 ]. The condition for that lemma to hold is
Proof. By applying the Markov transition matrix P on the first level and using the
symmetries of the model, for any ` ∈ Lh and `0 ∈ Lh−1 , we have
Although we do not strictly need it, we also derive an explicit formula for the
variance. The proof is typical of how conditional independence properties of this
kind of Markov model on trees can be used to derive recursions for quantities of
interest.
where the last line follows from symmetry, with Var+ indicating the conditional
variance given that the root state σ0 is +1. Write Uh = U̇h + Üh as a sum over the
left and right subtrees below the root respectively. Using the conditional indepen-
dence of those two subtrees given the root state, we get from (6.3.18) that
We now use the Markov transition matrix on the first level to derive a recursion
in h. Let σ̇0 be the state at the left child of the root. We use the fact that the random
variables 2θU̇h conditioned on σ̇0 = +1 and Uh−1 conditioned on σ0 = +1 are
identically distributed. Using E+ [U̇h ] = 1/2 (by Lemma 6.3.10 and symmetry),
we get from (6.3.19) that
h i2 h i
Var[Uh ] = 1 − 2E+ U̇h + 2 E+ U̇h2
= 1 − 2(1/2)2 + 2 (1 − p)E+ (2θ)−2 Uh−1
2
+ p E− (2θ)−2 Uh−1
2
by symmetry and the fact that E[Uh−1 ] = 0. Solving the affine recursion (6.3.20)
gives
h−1
X
Var[Uh ] = (2θ2 )−h + (1/2) (2θ2 )−i ,
i=0
1 X
s̄h = sh,` .
2h θ h
`∈Lh
X X X
− −
|µ̄+
h (z) − µ̄h (z)| = (µ+
h (sh ) − µh (sh ))
z z sh :s̄h =z
X X
−
≤ |µ+
h (sh ) − µh (sh )|
z sh :s̄h =z
X
−
= |µ+
h (sh ) − µh (sh )|,
sh ∈{+1,−1}2h
CHAPTER 6. BRANCHING PROCESSES 460
where the first sum is over the support of µ̄h . So it suffices to bound from below
the left-hand side on the first line.
For that purpose, we apply Cauchy-Schwarz and use the variance bound in
1 −
Lemma 6.3.11. First note that 12 µ̄+h + 2 µ̄h = µ̄h so that, by the triangle inequality,
−
|µ̄+
h (z) − µ̄h (z)| µ̄+ (z) + µ̄−
h (z)
≤ h = 1. (6.3.21)
2µ̄h (z) 2µ̄h (z)
Hence, we get
X X |µ̄+ (z) − µ̄− (z)|
−
|µ̄+
h (z) − µ̄h (z)| =
h h
2µ̄h (z)
z z
2µ̄h (z)
X µ̄+ (z) − µ̄− (z) 2
h h
≥ 2 µ̄h (z)
z
2µ̄h (z)
− 2
µ̄+
P
h (z)−µ̄h (z)
zz 2µ̄ (z) µ̄h (z)
≥ 2 P h2
z z µ̄h (z)
P + −
2
1 z z µ̄h (z) − µ̄h (z)
≥ P 2
2 z z µ̄h (z)
1 (E [Uh ] − E− [Uh ])2
+
=
2 Var[Uh ]
≥ 4(1 − (2θ2 )−1 ) > 0,
where we used (6.3.21) on the second line, Cauchy-Schwarz on the third line (after
rearranging), and Lemmas 6.3.10 and 6.3.11 on the last line.
Remark 6.3.12. The proof above and a correlation inequality of [EKPS00, Theorem 1.4]
give a lower bound on the probability of reconstruction of the majority estimator.
Impossibility of reconstruction
The previous result was based on showing that majority voting, that is, σ̂0Maj , pro-
duces a good root-state estimator—up to p = p∗ . Here we establish that this result
is best possible. Majority is not in fact the best root-state estimator: in general its
error probability can be higher than σ̂0MAP as the latter also takes into account the
configuration of the states at level h. However, perhaps surprisingly, it turns out
that the critical threshold for σ̂0Maj coincides with that of σ̂0MAP in the CFN model.
To prove the second part of Theorem 6.3.9 we analyze the MAP estimator.
Recall that µh (s0 |sh ) is the conditional probability of the root state s0 given the
CHAPTER 6. BRANCHING PROCESSES 461
states sh at level h. It will be more convenient to work with the following “root
magnetization”
Rh := µh (+1|σh ) − µh (−1|σh ),
which, as a function of σh , is a random variable. Note that E[Rh ] = 0 by symme-
try. By Bayes’ rule and the fact that µh (+1|σh ) + µh (−1|σh ) = 1, we have the
following alternative formulas which will prove useful
1
Rh = [µ+ (σh ) − µ−
h (σh )], (6.3.22)
2µh (σh ) h
µ+ (σh )
Rh = 2µh (+1|σh ) − 1 = h − 1, (6.3.23)
µh (σh )
µ− (σh )
Rh = 1 − 2µh (−1|σh ) = 1 − h . (6.3.24)
µh (σh )
It turns out to be enough to prove an upper bound on the variance of Rh .
Lemma 6.3.13 (Second moment bound). It holds that
q
+ −
kµh − µh kTV ≤ E[Rh2 ].
Proof. By (6.3.22),
1 X
−
|µ+
h (sh ) − µh (sh )|
2
sh ∈{+1,−1}2h
X
= µh (sh ) |µh (+1|sh ) − µh (−1|sh )|
h
sh ∈{+1,−1}2
= E|Rh |
q
≤ E[Rh2 ],
where we used Cauchy-Schwarz on the last line.
Let z̄h = E[Rh2 ]. In view of Lemma 6.3.13, the proof of Theorem 6.3.9 (ii) will
follow from establishing the limit
lim z̄h = 0.
h→+∞
We apply the same kind of recursive argument we used for the analysis of majority
(see in particular Lemma 6.3.11): we condition on the root to exploit conditional
independence; we use the Markov transition matrix on the top edges.
We first derive a recursion for Rh itself—as a random variable. We proceed in
two steps:
CHAPTER 6. BRANCHING PROCESSES 462
- Step 1: we break up the first h levels of the tree into two identical (h − 1)-
level trees with an additional edge at their respective root through conditional
independence;
- Step 2: we account for that edge through the Markov transition matrix.
We will need some notation. Let σ̇h be the states at level h (from the root) below
the left child of the root and let µ̇h be the distribution of σ̇h (and use a superscript
+ to denote the conditional probability given the root is +, and so on). Define
where µ̇h (s0 |ṡh ) is the conditional probability that the root is s0 given that σ̇h =
ṡh . Similarly, denote with a double dot the same quantities with respect to the
subtree below the right child of the root. Expressions similar to (6.3.22), (6.3.23)
and (6.3.24) also hold.
Ẏh + Ÿh
Rh = .
1 + Ẏh Ÿh
Proof. Using µ+ + +
h (sh ) = µ̇h (ṡh )µ̈h (s̈h ) by conditional independence, (6.3.22) ap-
plied to Rh , and (6.3.23) and (6.3.24) applied to Ẏh and Ÿh , we get
γ
1 X µh (σh )
Rh = γ
2 γ=+,− µh (σh )
γ γ
1 µ̇h (σ̇h )µ̈h (σ̈h ) X µ̇h (σ̇h )µ̈h (σ̈h )
= γ
2 µh (σh ) γ=+,−
µ̇h (σ̇h )µ̈h (σ̈h )
1 µ̇h (σ̇h )µ̈h (σ̈h ) X
= γ 1 + γ Ẏh 1 + γ Ÿh
2 µh (σh ) γ=+,−
µ̇h (σ̇h )µ̈h (σ̈h )
= (Ẏh + Ÿh ).
µh (σh )
CHAPTER 6. BRANCHING PROCESSES 463
1 X
= 1 + γ Ẏh 1 + γ Ÿh
2 γ=+,−
= 1 + Ẏh Ÿh .
where ν̇h (ṡ0 |ṡh ) is the conditional probability that the left child of the root is ṡ0
given that the states at level h (from the root) below the left child are σ̇h = ṡh ;
and similarly for the right child of the root. Again expressions similar to (6.3.22),
(6.3.23) and (6.3.24) hold. The following lemma is left as an exercise (see Exer-
cise 6.18).
Ẏh = θḊh .
We are now ready to prove the second half of our main theorem.
Proof of Theorem 6.3.9 (ii). Putting Lemmas 6.3.14 and 6.3.15 together, we get
θ(Ḋh + D̈h )
Rh = . (6.3.25)
1 + θ2 Ḋh D̈h
We now take expectations. Recall that we seek to compute the second moment of
CHAPTER 6. BRANCHING PROCESSES 464
= E[(1 + Rh )Rh ]
= E[Rh2 ],
where we used (6.3.23) on the third line and E[Rh ] = 0 on the fifth line. So it
suffices to compute the conditional first moment.
Using the expansion
1 r2
=1−r+ ,
1+r 1+r
- (Poisson tail) Let Sn be a sum of n i.i.d. Poi(λ) variables. Recall from (2.4.10)
and (2.4.11) that for a > λ
1 a
− log P[Sn ≥ an] ≥ a log − a + λ =: IλPoi (a), (6.4.1)
n λ
and similarly for a < λ
1
− log P[Sn ≤ an] ≥ IλPoi (a). (6.4.2)
n
To simplify the notation, we let
where the inequality follows from the convexity of Iλ and the fact that it
attains its minimum at λ = 1 where it is 0.
CHAPTER 6. BRANCHING PROCESSES 467
Theorem 6.4.1 (Subcritical case: upper bound on the largest cluster). Let Gn ∼
Gn,pn where pn = nλ with λ ∈ (0, 1). For all κ > 0,
We also give a matching logarithmic lower bound on the size of Cmax in Theo-
rem 6.4.11.
In the supercritical case, that is, when λ > 1, we prove the existence of a
unique connected component of size linear in n, which is referred to as the giant
component.
giant
Theorem 6.4.2 (Supercritical regime: giant component). Let Gn ∼ Gn,pn where component
pn = nλ with λ > 1. For any γ ∈ (1/2, 1) and δ < 2γ − 1,
1 − e−λζ = ζ.
In fact, with probability 1 − O(n−δ ), there is a unique largest component and the
second largest connected component has size O(log n).
Exploration process
Recall that the exploration process started at v has 3 types of vertices: the active
vertices At , the explored vertices Et , and the neutral vertices Nt . We start with
A0 := {v}, E0 := ∅, and N0 contains all other vertices in Gn . We imagine re-
vealing the edges of Gn as they are encountered in this process and we let (Ft ) be
the corresponding filtration. In words, starting with v, the cluster of v is progres-
sively grown by adding to it at each time a vertex adjacent to one of the previously
explored vertices and uncovering its remaining neighbors in Gn .
Let as before At := |At |, Et := |Et |, and Nt := |Nt |, and
τ0 := inf{t ≥ 0 : At = 0} = |Cv |,
CHAPTER 6. BRANCHING PROCESSES 469
where the rightmost equality is from Lemma 6.2.1. Recall that (Et ) is non-decreasing
while (Nt ) is non-increasing, and that the process is fixed for all t > τ0 . Since
Et = t for all t ≤ τ0 (as exactly one vertex is explored at each time until the set of
active vertices is empty) and (At , Et , Nt ) forms a partition of [n] for all t, we have
At + t + Nt = n, ∀t ≤ τ0 . (6.4.4)
Hence, in tracking the size of the exploration process, we can work with At or Nt .
Moreover at t = τ0 we have
Similarly to the case of a Galton-Watson tree, the processes (At ) and (Nt )
admit a simple recursive form. Conditioning on Ft−1 :
- (Active vertices) If At−1 = 0, the exploration process has finished its course
and At = 0. Otherwise, (a) one active vertex becomes explored and (b) its
neutral neighbors become active vertices. That is,
At = At−1 + 1{At−1 >0} |{z}−1 + Xt , (6.4.6)
|{z}
(a) (b)
- (Neutral vertices) Similarly, if At−1 > 0, that is, Nt−1 < n − (t − 1), Xt
neutral vertices become active. That is,
A
t := At−1 + 1{A >0} − 1 + Xt ,
t−1
(6.4.11)
with A
0 := 1. In words, (At ) is the size of the active set of a Galton-Watson
branching process with offspring distribution Poi(λ), as defined in Section 6.2.1.
As a result, letting
Wλ = τ0 := inf{t ≥ 0 : A
t = 0},
Lower bound: In the other direction, we proceed in two steps. We first show that,
up to a certain time, the process is bounded from below by a branching process
with binomial offspring distribution. In a second step, we show that this binomial
branching process can be approximated by a Poisson branching process.
A≺ ≺ ≺
t := At−1 + 1{A≺ >0} − 1 + Xt ,
t−1
(6.4.12)
with A≺
0 := 1, where
n−k
Xn
Xt≺ := It,j . (6.4.13)
i=1
Note that we use the same It,j s as in the definition of Xt , that is, we cou-
ple the two processes. This time (A≺ t ) is the size of the active set in the
exploration process of a Galton-Watson branching process with offspring
distribution Bin(n − kn , pn ). Let
τ0≺ := inf{t ≥ 0 : A≺
t = 0},
be the total progeny of this branching process. We prove the following rela-
tionship between τ0 and τ0≺ .
At ≥ A≺
t , ∀t ≤ σn−kn ,
On the other hand, when σn−kn = +∞, Nt > n − kn for all t—in particular
for t = τ0 —and therefore |Cv | = τ0 = n − Nτ0 < kn by (6.4.5). Moreover
in that case, because At ≥ A≺
t for all t ≤ σn−kn = +∞, it holds in addition
that τ0≺ ≤ τ0 < kn . To sum up, we have proved the implications
2. (Poisson approximation) Our next step is approximate the tail of τ0≺ by that
of τ0 .
where the Xi≺ s are independent Bin(n − kn , pn ). Note further that, because
the sum of independent binomials with the same success probability is bino-
mial,
Xt
Xi≺ ∼ Bin(t(n − kn ), pn ).
i=1
Recall on the other hand that (Xt ) is Poi(λ) and, because a sum of inde-
pendent Poisson is Poisson (see Exercise 6.7), we have
" t #
1 X
P[τ0 = t] = P Xi = t − 1 , (6.4.15)
t
i=1
where
t
X
Xi ∼ Poi(tλ).
i=1
Finally, recalling that pn = λ/n, combining the last three displays and using
the triangle inequality for the total variation distance,
" t # " t #
X X
P Xi≺ = t − 1 − P Xi = t − 1
i=1 i=1
1
≤ t(n − kn )[− log(1 − pn )]2 + |tλ − t(n − kn )(− log(1 − pn ))|
2
2 2 2
1 λ λ λ λ
≤ tn +O 2
+ tλ − t(n − kn ) +O
2 n n n n2
tkn
=O ,
n
as claimed.
Lemma 6.4.6 (Subcritical regime: upper bound on cluster size). Let Gn ∼ Gn,pn
where pn = nλ with λ ∈ (0, 1) and let Cv be the connected component of v ∈ [n].
For all κ > 0,
where the Xi s are i.i.d. Poi(λ). Both terms on the right-hand side depend on
whether or not the mean λ is smaller or larger than 1. When λ < 1, the Poisson
branching process goes extinct with probability 1 by the extinction theory (Theo-
rem 6.1.6). Hence P[Wλ = +∞] = 0.
As to the second term, the sum of the Xi s is Poi(λt). Using the Poisson
tail (6.4.1) for λ < 1 and kn = ω(1),
" t # " t #
X 1 X X X
P Xi = t − 1 ≤ P Xi ≥ t − 1
t
t≥kn i=1 t≥kn i=1
Poi t − 1
X
≤ exp −tIλ
t
t≥kn
X
exp −t(Iλ − O(t−1 ))
≤
t≥kn
X
≤ C exp (−tIλ )
t≥kn
= O (exp (−Iλ kn )) , (6.4.17)
Proof of Theorem 6.4.1. Let again c = (1 + κ)Iλ−1 for κ > 0. By a union bound
and symmetry,
By Lemma 6.4.6,
as n → +∞.
In fact we prove below that the largest component is indeed of size roughly Iλ−1 log n.
But first we turn to the supercritical regime.
1 − e−λζ = ζ.
log2 n
Pn,pn |Cv | ≥ (1 + κ)Iλ−1 log n = ζλ + O
.
n
Note the small—but critical difference—with Lemma 6.4.6: this time the branch-
ing process can survive. This happens with probability ζλ by extinction theory
(Theorem 6.1.6). In that case, we will need further arguments to nail down the
cluster size. Observe also that the result holds for a fixed vertex v—and therefore
does not yet tell us about the largest cluster. We come back to the latter in the next
subsection.
Proof of Lemma 6.4.7. We adapt the proof of Lemma 6.4.6, beginning with (6.4.16)
which recall states
" t #
X 1 X
P [Wλ ≥ kn ] = P [Wλ = +∞] + P Xi = t − 1 ,
t
t≥kn i=1
CHAPTER 6. BRANCHING PROCESSES 477
where the Xi s are i.i.d. Poi(λ). When λ > 1, P[Wλ = +∞] = ζλ , where ζλ > 0
is the survival probability of the branching process by Example 6.1.10. As to the
second term, using (6.4.2) for λ > 1,
" t # " t #
X 1 X X X
P Xi = t − 1 ≤ P Xi ≤ t
t
t≥kn i=1 t≥kn i=1
X
≤ exp (−tIλ )
t≥kn
≤ C exp (−Iλ kn ) , (6.4.20)
log2 n
Pn,pn [|Cv | ≥ c log n] = P [Wλ ≥ c log n] + O . (6.4.21)
n
log2 n
Pn,pn [|Cv | ≥ c log n] = ζλ + O , (6.4.23)
n
as claimed.
Recall that the Poisson branching process approximation was based on the fact
that the degree of a vertex is well approximated by a Poisson distribution. When
the exploration process goes on for too long however (i.e., when kn is large), this
approximation is not as accurate because of a saturation effect: at each step of the
exploration, we uncover edges to the neutral vertices (which then become active);
and, because an Erdős-Rényi graph has a finite pool of vertices from which to
draw these edges, as the number of neutral vertices decreases so does the expected
number of uncovered edges. Instead we use the following lemma which explicitly
accounts for the dwindling size of Nt . Roughly speaking, we model the set of
neutral vertices as a process that discards a fraction pn of its current set at each
time step (i.e., those neutral vertices with an edge to the current explored vertex).
CHAPTER 6. BRANCHING PROCESSES 478
Lemma 6.4.8. Let Gn ∼ Gn,pn where pn = nλ with λ > 0 and let Cv be the
connected component of v ∈ [n]. Let Yt ∼ Bin(n − 1, 1 − (1 − pn )t ). Then, for
any t,
Proof. We work with neutral vertices. By (6.4.4) and Lemma 6.2.1, for any t,
It is easier to consider the process without the indicator as it has a simple distribu-
tion. Define N00 := n − 1 and
0
Nt−1
X
Nt0 := 0
Nt−1 − It,i ,
i=1
and observe that Nt ≥ Nt0 for all t, as the two processes agree up to time τ0 at
which point Nt stays fixed. The interpretation of Nt0 is straightforward: starting
with n−1 vertices, at each time each remaining vertex is discarded with probability
pn . Hence, the number of surviving vertices at time t has distribution
Nt0 ∼ Bin(n − 1, (1 − pn )t ),
The previous lemma gives the following additional bound on the cluster size
in the supercritical regime. Together with Lemma 6.4.7 it shows that, when |Cv | >
c log n, the cluster size is in fact linear in n with high probability. We will have
more to say about the largest cluster in the next subsection.
CHAPTER 6. BRANCHING PROCESSES 479
1 − e−λζ = ζ.
For any α < ζλ and any δ > 0, there exists κδ,α > 0 large enough so that
1 − e−λζ − ζ = 0.
Let α < ζλ .
For any t ∈ [c log n, αn], by the Chernoff bound for Poisson trials (Theo-
rem 2.4.7 (ii)(b)),
!
t 2
µt
P[Yt ≤ t] ≤ exp − 1− . (6.4.26)
2 µt
For t/n ≤ α < ζλ , using 1 − x ≤ e−x for x ∈ (0, 1) (see Exercise 1.16), there is
γα > 1 such that
µt ≥ (n − 1)(1 − e−λ(t/n) )
n − 1 1 − e−λ(t/n)
=t
n t/n
n − 1 1 − e−λα
≥t
n α
≥ γα t,
for n large enough, where we used that 1 − e−λx is increasing in x on the third line
and that 1 − e−λx − x > 0 for 0 < x < ζλ on the fourth line (as can be checked by
CHAPTER 6. BRANCHING PROCESSES 480
computing the first and second derivatives). Plugging this back into (6.4.26), we
get
( )!
γα 1 2
P[Yt ≤ t] ≤ exp −t 1− .
2 γα
Therefore
αn
X αn
X
Pn,pn [|Cv | = t] ≤ P[Yt ≤ t]
t=c log n t=c log n
+∞
( )!
1 2
X γα
≤ exp −t 1−
2 γα
t=c log n
( )!!
γα 1 2
= O exp −c log n 1− .
2 γα
The quantity Skn is the number of small vertices. By Lemma 6.4.7, its expectation
is
Lemma 6.4.10, which is proved below, leads to our main result in the supercritical
case: the existence of the giant component, a unique cluster Cmax of size linear in
giant
n.
component
Proof of Theorem 6.4.2. Take α ∈ (ζλ /2, ζλ ) and let k n , k̄n , and γ be as above.
Let B1,n := {|Bkn − ζλ n| ≥ nγ }. Because γ < 1, the event B1,n c implies that
X
1{|Cv |>kn } = Bkn > ζλ n − nγ ≥ 1,
v∈[n]
for n large enough. That is, there is at least one “large” cluster of size > k n . In
turn, that implies
|Cmax | ≤ Bkn ,
since there are at most Bkn vertices in that large cluster.
c holds, in addition to B c , then
Let B2,n := {∃v, |Cv | ∈ [k n , k̄n ]}. If B2,n 1,n
since there is no cluster whose size falls in [k n , k̄n ]. Moreover there is equality
across the last display if there is a unique cluster of size greater than k̄n .
This is indeed the case under B1,n c ∩ B c : if there were two distinct clusters of
2,n
size k̄n , then since 2α > ζλ we would have for n large enough
To bound the second term in (6.4.28), we sum over the size of Cu and note that,
conditioned on {|Cu | = `, u = v}, the size of Cv has the same distribution as the
unconditional size of C1 in a Gn−`,pn random graph, that is,
Observe that the probability on the right-hand side is increasing in ` (as can be
CHAPTER 6. BRANCHING PROCESSES 483
We define
∆k := Pn−k,pn [|C1 | ≤ k] − Pn,pn [|C1 | ≤ k].
Then, plugging this back above, we get
X X
Pn,pn [|Cu | = `, |Cv | ≤ k, u = v]
u,v∈[n] `≤k
X
≤ Pn,pn [|Cu | ≤ k](Pn,pn [|Cv | ≤ k] + ∆k )
u,v∈[n]
and
X
Pn,pn [|Cu | ≤ k, |Cv | ≤ k, u ↔ v] ≤ (En,pn [Sk ])2 + λnk 2 . (6.4.31)
u,v∈[n]
Var[Sk ] ≤ 2λnk 2 .
P[|Skn − (1 − ζλ )n| ≥ nγ ]
≤ P[|Skn − En,pn [Skn ]| ≥ nγ − C log2 n]
2λnk 2n
≤
(nγ − C log2 n)2
2λn(1 + κδ,α )2 Iλ−2 log2 n
≤
C 0 n2γ
00 −δ
≤C n ,
for constants C, C 0 , C 00 > 0 and n large enough, where we used that 2γ > 1 and
δ < 2γ − 1.
for n large enough. For any β ∈ (0, κ), taking ε small enough we have
= Ω(nβ ).
(EBkn )2
Pn,pn [Bkn > 0] ≥
E[Bk2n ]
−1
O(nkn e−kn Iλ )
≥ 1+
Ω(n2β )
!−1
O(nkn e(κ−1) log n )
= 1+
Ω(n2β )
O(kn nκ ) −1
= 1+
Ω(n2β )
→ 1,
Remark 6.4.15. One can also derive a lower bound on the probability that |Cmax | >
κn2/3 for some κ > 0 [ER60]. Exercise 6.20 provides a sketch based on counting tree
components; the combinatorial approach has the advantage of giving insights into the
structure of the graph (see [Bol01] for more on this). See also [NP10] for a martingale
proof of the lower bound as well as a better upper bound.
c0
Pn,pn [|Cv | > k] ≤ √ .
k
Before we establish the lemma, we prove the theorem assuming it.
Take
kn := κn2/3 .
By Markov’s inequality (Theorem 2.1.1) and Lemma 6.4.16,
Proof of Lemma 6.4.16. Once again, we use the exploration process defined in
Section 6.4.2 started at v. Let (Ft ) be the corresponding filtration and let At = |At |
be the size of the active set.
CHAPTER 6. BRANCHING PROCESSES 488
with M0 := 1 and (X et ) are i.i.d. Bin(n, 1/n). We couple (At ) and (Mt ) through
the equation (6.4.7) by letting
n
X
X
et = It,i .
i=1
Recalling that
τ0 = inf{t ≥ 0 : At = 0} = |Cv |,
by Lemma 6.2.1, we have τ̃0 ≥ τ0 = |Cv | almost surely. So
The tail of τ̃0 To bound the tail of τ̃0 , we introduce a modified stopping time. For
h > 0, let
τh0 := inf{t ≥ 0 : Mt = 0 or Mt ≥ h}.
We will use the inequality
and we will choose h below to minimize the rightmost expression (or, more specif-
ically, an upper bound on it). The rest of the analysis is similar to the gambler’s
ruin problem in Example 3.1.41, with some slight complications arising from the
fact that the process is not nearest-neighbor.
We note that by the exponential tail of hitting times on finite state spaces
(Lemma 3.1.25), the stopping time τh0 is almost surely finite and, in fact, has a
finite expectation. By two applications of Markov’s inequality,
E[Mτh0 ]
P[Mτh0 ≥ h] ≤ ,
h
CHAPTER 6. BRANCHING PROCESSES 489
and
Eτh0
P[τh0 > k] ≤ .
k
We bound the expectations on the right-hand sides.
Bounding EMτh0 and Eτh0 To compute EMτh0 , we use the optional stopping the-
orem in the uniformly bounded case (Theorem 3.1.38 (ii)) to the stopped process
(Mt∧τh0 ) (which is also a martingale by Lemma 3.1.37) to get that
E[Mτh0 ] = E[M0 ] = 1.
We conclude that
1
P[Mτh0 ≥ h] ≤ . (6.4.34)
h
To compute Eτh0 , we use a different martingale (adapted from Example 3.1.31),
specifically
Lt := Mt2 − σ 2 t,
where we let σ 2 := n n1 1 − n1 = 1 − n1 , which is ≥ 21 when n ≥ 2. To see
that (Lt ) is a martingale, note that by taking out what is known (Lemma B.6.13)
and using the fact that (Mt ) is itself a martingale
By Lemma 3.1.37, the stopped process (Lt∧τh0 ) is also a martingale; and it has
bounded increments since
2 2 2
|L(t+1)∧τh0 − Lt∧τh0 | ≤ |M(t+1)∧τ 0 − Mt∧τ 0 | + σ
h h
2 et+1 | + σ 2
≤ (−1 + X
et+1 ) + 2h| − 1 + X
≤ n2 + 2hn + 1.
We use the optional stopping theorem in the bounded increments case (Theo-
rem 3.1.38 (iii)) on (Lt∧τh0 ) to get
≤ (σ 2 + 1) + 2h + h2
≤ 4h2 .
We show in that regime that lazy simple random walk (Xt ) on Gn “mixes fast.”
Recall from Example 1.1.29 that, when the graph is connected, the corresponding
transition matrix P is reversible with respect to the stationary distribution
δ(v)
π(v) := ,
2|En |
where δ(v) is the degree of v. For a fixed ε > 0, the mixing time (see Defini-
tion 1.1.35) is
tmix (ε) = inf{t ≥ 0 : d(t) ≤ ε},
where
d(t) = sup kP t (x, ·) − π(·)kTV .
x∈Vn
By convention, we let tmix (ε) = +∞ if the graph is not connected. Our main
result is the following.
c(S, S c )
ΦE (S; c, π) = ,
π(S)
where
δ(x) minx δ(x)
πmin = min π(x) = min = P .
x x 2|En | y δ(y)
So our main task is to bound δ(x) and |E(S, S c )| with high probability. We do this
next.
Bounding the degrees In fact, we have already done half the work. Indeed in
Example 2.4.18 we studied the maximum degree of Gn
Dn = max δ(v),
v∈Vn
in the regime npn = ω(log n). We showed that for any ζ > 0, as n → +∞,
h p i
P |Dn − npn | ≤ 2 (1 + ζ)npn log n → 1.
The proof of that result actually shows something stronger: all degrees satisfy the
inequality simultaneously, that is,
h p i
P ∀v ∈ Vn , |δ(v) − npn | ≤ 2 (1 + ζ)npn log n = 1 − o(1). (6.4.38)
p
We will use the fact that 2 (1 + ζ)npn log n = o(npn ) when npn = ω(log n). In
essence, all degrees are roughly npn . That implies the following claims.
CHAPTER 6. BRANCHING PROCESSES 493
Lemma 6.4.19 (Bounds on stationary distribution and volume). The following hold
with probability 1 − o(1).
(i) The smallest stationary probability satisfies
1 − o(1)
πmin ≥ .
n
(ii) For any set of vertices S ⊆ Vn with |S| > 2n/3, we have
1
π(S) > .
2
Φ∗ = Ω(1).
Proof. By the definition of Φ∗ and Lemma 6.4.19 (ii), we can restrict ourselves to
sets S of size at most 2n/3. Let S be such a set with s = |S|. Then |E(S, S c )| is
Bin(s(n − s), pn ). By Bernstein’s inequality with c = 1 and νi = pn (1 − pn ),
β2
c
Pn,pn [|E(S, S )| ≤ s(n − s)pn − β] ≤ exp − ,
4s(n − s)pn (1 − pn )
CHAPTER 6. BRANCHING PROCESSES 494
By a union bound over all sets of size s and using the fact that ns ≤ ( ne s
s )
(see Appendix A), there is a constant C > 0 such that
c 1
Pn,pn ∃S, |S| = s, |E(S, S )| ≤ s(n − s)pn
2
n s(n − s)pn
≤ exp −
s 16(1 − pn )
np
n
≤ exp −s + s log(ne/s)
48
≤ exp (−Csnpn ) ,
for n large enough, where we also used that n − s ≥ n/3 and npn = ω(log n).
Summing over s gives, for a constant C 0 > 0,
1
Pn,pn ∃S, 1 ≤ |S| ≤ 2n/3, |E(S, S )| ≤ |S|(n − |S|)pn ≤ C 0 exp (−Cnpn ) ,
c
2
which goes to 0 as n → +∞.
Using (6.4.36) and Lemma 6.4.19 (iii), any set S such that |E(S, S c )| >
1
2 |S|(n − |S|)pn has edge expansion
1
2 |S|(n − |S|)pn 1
ΦE (S; c, π) ≥ ≥ (1 − o(1)).
|S|npn (1 + o(1)) 6
That proves the claim.
Proof of the theorem Finally, we are ready to prove the main result.
Proof of Theorem 6.4.18. Plugging Lemma 6.4.19 (i) and Lemma 6.4.20 into (6.4.37)
gives
1 2
tmix (ε) ≤ log ≤ C 00 log(ε−1 n(1 + o(1))) = O(log n),
επmin Φ2∗
for some constant C 00 > 0.
Remark 6.4.21. A mixing time of O(log n) in fact holds for lazy simple random walk on
Gn,pn when pn = λ log
n
n
with λ > 1 [CF07]. See also [Dur06, Section 6.5]. Mixing time
on the giant component has also been studied. See, e.g., [FR08, BKW14, DKLP11].
CHAPTER 6. BRANCHING PROCESSES 495
Exercises
Exercise 6.1 (Galton-Watson process: subcritical case). We use Markov’s inequal-
ity to analyze the subcritical case.
(i) Let (Zt ) be a Galton-Watson process with offspring distribution mean m <
1. Use Markov’s inequality (Theorem 2.1.1) to prove that extinction occurs
almost surely.
(ii) Prove the equivalent result in the multitype case, that is, prove (6.1.8).
(i) Compute the probability generating function f of {pk }k≥0 and the extinction
probability η := ηp as a function of p.
pmt (1 − s) + qs − p
ft (s) = .
qmt (1 − s) + qs − p
t − (t − 1)s
ft (s) = ,
t + 1 − ts
and deduce that
1
E[e−λZt /t | Zt > 0] → .
1+λ
CHAPTER 6. BRANCHING PROCESSES 496
Exercise 6.3 (Supercritical branching process: infinite line of descent). Let (Zt )
be a supercritical Galton-Watson branching process with offspring distribution
{pk }k≥0 . Let η be the extinction probability and define ζ := 1 − η. Let Zt∞
be the number of individuals in the t-th generation with an infinite line of descent,
i.e., whose descendant subtree is infinite. Denote by S the event of nonextinction
of (Zt ). Define p∞
0 := 0 and
X j
∞ −1
pk := ζ η j−k ζ k pj .
k
j≥k
[Hint: Condition on Z1 .]
(iii) Show by induction on t that, conditioned on nonextinction, the process (Zt∞ )
has the same distribution as a Galton-Watson branching process with off-
spring distribution {p∞
k }k≥0 .
Exercise 6.4 (Multitype branching processes: a special case). Extend Lemma 6.1.20
to the case S(u) = 0. [Hint: Show that Ut = Z0 u for all t almost surely.]
Exercise 6.5 (Galton-Watson: Inverting history). Let
H = (X1 , . . . , Xτ0 ),
be the history (see Section 6.2) of the Galton-Watson process (Zi ). Write Zi as a
function of H, for all i.
Exercise 6.6 (Spitzer’s lemma). Prove Theorem 6.2.5.
Exercise 6.7 (Sum of Poisson). Let Q1 and Q2 be independent Poisson random
variables with respective means λ1 and λ2 . Show by direct computation of the
convolution that the sum Q1 + Q2 is Poisson with mean λ1 + λ2 . [Hint: Recall
that P[Q1 = k] = e−λ1 λk1 /k! for all k ∈ Z+ .]
Exercise 6.8 (Percolation on bounded-degree graphs). Let G = (V, E) be a count-
able graph such that all vertices have degree bounded by b + 1 for b ≥ 2. Let 0 be
a distinguished vertex in G. For bond percolation on G, prove that
pc (G) ≥ pc (T
b b ),
1
1
b
h(ε, u) := u − 1− b − ε (1 − u) + b +ε .
(i) Show that there is a constant C > 0 not depending on ε, u such that
b−1 2
h(ε, u) − bεu + u ≤ C(u3 ∨ εu2 ).
2b
θ(p) 2b2
lim = .
b b ) (p − pc (T
p↓pc (T b b )) b−1
Exercise 6.10 (Percolation on T b 2 : higher moments of |C0 |). Consider bond per-
colation on the rooted infinite binary tree Tb 2 . For density p < 1 , let Zp be an
2
integer-valued random variable with distribution
` Pp [|C0 | = `]
Pp [Zp = `] = , ∀` ≥ 1.
Ep |C0 |
(i) Using the explicit formula for Pp [|C0 | = `] derived in Section 6.2.4, show
that for all 0 < a < b < +∞
" # Z b
Zp
Pp ∈ [a, b] → C x−1/2 e−x dx,
(1/4)( 12 − p)−2 a
Ep |C0 |k
lim = Ck .
p↑pc (T
b2 ) b 2 ) − p)−1−2(k−1)
(pc (T
(ii) Couple the two processes step-by-step and use (i) to show that
k−1
λ2 X
|P[Wn,pn ≥ k] − P[Wλ ≥ k]| ≤ P[Wλ ≥ i].
n
i=1
Exercise 6.13 (Random binary search tree: property (BST)). Show that the (BST)
property is preserved by the algorithm described at the beginning of Section 6.3.1.
Exercise 6.14 (Random binary search tree: limit). Consider the equation (6.3.1).
(ii) Prove that the expression on the left-hand side is strictly decreasing at that
solution.
(i) Show that, for any v, it holds that Sv0 + Sv00 = Sv − 1 almost surely provided
Sv ≥ 1.
(ii) Show that, for any v, there is almost surely a descendant w of v (not neces-
sarily immediate) such that Sw = 1.
(iii) Let
Hn = max {h : ∃v ∈ Vh , Sv = 1} ,
where Vh is the set of vertices of T at topological distance h from the root.
Show that Hn ≤ n.
CHAPTER 6. BRANCHING PROCESSES 499
Exercise 6.16 (Ising vs. CFN). Let Th be a rooted complete binary tree with h
levels. Fix 0 < p < 1/2. Assign to each vertex v a state σv ∈ {+1, −1} at
random according to the CFN model described in Section 6.3.2. Show that this
distribution is equivalent to a ferromagnetic Ising model on Th and determine the
inverse temperature β in terms of p. [Hint: Write the distribution of the states under
the CFN model as a product over the edges.]
− −
Exercise 6.17 (Monotonicity of kµ+ +
h −µh kTV ). Let µh , µh be as in Section 6.3.2.
Show that
− −
kµ+ +
h+1 − µh+1 kTV ≤ kµh − µh kTV .
Exercise 6.19 (Cayley’s formula). Let (Zt ) be a Poisson branching process with
offspring mean 1 started at Z0 = 1 and let T be the corresponding Galton-Watson
tree. Let W be the total of size of the progeny, that is, the number of vertices in T .
Recall from Example 6.2.7 that
nn−1 e−n
P[W = n] = .
n!
(i) Given W = n, label the vertices of T uniformly at random with the integers
1, . . . , n. Show that every rooted labeled tree on n vertices arises with prob-
ability e−n /n!. [Hint: Label the vertices as you grow the tree and observe
that a lot of terms cancel out or simplify.]
(ii) Derive Cayley’s formula: the number of labeled trees on n vertices is nn−2 .
(i) Let γn,k be the expected number of isolated tree components of size k in Gn .
Justify the formula
k
n k−2 1 k−1 1 k(n−k)+(2)−(k−1)
γn,k = k 1− .
k n n
k −5/2 k3
γn,k ∼n √ exp − 2 .
2π 6n
CHAPTER 6. BRANCHING PROCESSES 500
(iii) Conclude that for 0 < δ < 1 the expectation of U , the number of isolated
tree components of size in [(δn)2/3 , n2/3 ], is Ω(δ −1 ) as δ → 0.
(v) Prove that Var[U ] = O(E[U ]). [Hint: Use (2.1.6), (iv), and (ii).]
Exercise 6.21 (Critical regime: overshoot bound). The goal of this exercise is to
prove Lemma 6.4.17. We use the notation of Section 6.4.4.
Bibliographic remarks
Section 6.1 See [Dur10, Section 5.3.4] for a quick introduction to branching pro-
cesses. A more detailed overview relating to its use in discrete probability can
be found in [vdH17, Chapter 3]. A classical reference on branching processes
is [AN04]. The Kesten-Stigum Theorem is due to Kesten and Stigum [KS66b].
Our proof of a weaker version with the second moment condition follows [Dur10,
Example 5.4.3]. Section 6.1.4 is based loosely on [AN04, Chapter V]. A proof
of Theorem 6.1.18 can be found in [Har63]. A good reference for the Perron-
Frobenius Theorem (Theorem 6.1.17 as well as more general versions) is [HJ13,
Chapter 8]. The central limit theorem for ρ ≥ λ2 referred to at the end of Sec-
tion 6.1.4 is due to Kesten and Stigum [KS66a, KS67]. The critical percolation
threshold for percolation on Galton-Watson trees is due to R. Lyons [Lyo90].
Section 6.2 The exploration process in Section 6.2.1 dates back to [ML86] and
[Kar90]. The hitting-time theorem (Theorem 6.2.5) in the case ` = 1 was first
proved in [Ott49]. For alternative proofs, see for example [vdHK08] or [Wen75].
Spitzer’s combinatorial lemma (Lemma 6.2.8) is from [Spi56]. See also [Fel71,
Section XII.6]. The presentation in Section 6.2.4 follows [vdH10]. See also [Dur85].
Section 6.3 Section 6.3.1 follows [Dev98, Section 2.1] from the excellent vol-
ume [HMRAR98]. Section 6.3.2 is partly a simplified version of [BCMR06]. Fur-
ther applications in phylogenetics, specifically to the sample complexity of phy-
logeny inference algorithms, can be found in, for example, [Mos04, Mos03, Roc10,
DMR11, RS17]. The reconstruction problem also has applications in community
detection [MNS15b]. See [Abb18] for a survey.
Section 6.4 The phase transtion of the Erdős-Rényi graph model was first studied
in [ER60]. For much more, see for example [vdH17, Chapter 4], [JLR11, Chapter
5] and [Bol01, Chapter 6]. In particular a central limit theorem for the giant com-
ponent, proved by several authors including Martin-Löf [ML98], Pittel [Pit90], and
Barraez, Boucheron, and de la Vega [BBFdlV00], is established in [vdH17, Section
4.5]. Section 6.4.4 is based on [NP10]. See also [Per09, Sections 2 and 3]. Much
more is known about the critical regime; see, e.g., [Ald97, Bol84, Lu90, LuPW94].
Section 6.4.5 is based partly on [Dur06, Section 6.5]. For a lot more on random
walk on random graphs (not just Erdős-Rényi), see [Dur06, Chapter 6]. For more
on the spectral properties of random graphs, see [CL06].
Appendix A
nn nn+1
n−1
≤ n! ≤ n−1 ,
e e
nk e k nk
n
k
≤ ≤ k ,
k k k
n
X n k n−k
(x + y)n = x y ,
k
k=0
d
X n en d
≤ ,
k d
k=0
√ n n
n! ∼ 2πn ,
e
4n
2n
= (1 + o(1)) √ ,
n πn
and
n
log = (1 + o(1))nH(k/n),
k
where H(p) := −p log p − (1 − p) log(1 − p). The third one is the binomial
theorem. The fifth one is Stirling’s formula.
502
Appendix B
Measure-theoretic foundations
(i) S ∈ Σ0 ;
(ii) F ∈ Σ0 implies F c ∈ Σ0 ;
(iii) F, G ∈ Σ0 implies F ∪ G ∈ Σ0 .
That, of course, implies that the empty set as well as all pairwise intersections
are also in Σ0 . The collection Σ0 is an actual algebra (i.e., a vector space with a
bilinear product) with the symmetric difference as its “sum,” the intersection as its
“product” and the underlying field being the field with two elements.
Version: December 20, 2023
Modern Discrete Probability: An Essential Toolkit
Copyright © 2023 Sébastien Roch
503
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 504
(i) µ0 (∅) = 0;
Example B.1.10. For the algebra in the Example B.1.2, the set function
k k
!
[ X
λ0 (ai , bi ] = (bi − ai )
i=1 i=1
Definition B.1.12 (Probability space). If (Ω, F, P) is a measure space with P(Ω) = probability space
1 then P is called a probability measure and (Ω, F, P) is called a probability space
(or probability triple).
If in addition µ0 is finite, the next lemma implies that the extension is unique.
Example B.1.15. The sets (−∞, x] for x ∈ R form a π-system generating B(R).
That is, B(R) is the smallest σ-algebra containing that π-system. J
Finally we can define Lebesgue measure. We start with (0, 1] and extend to R
in the obvious way. We need the following lemma.
LX = P ◦ X −1 ,
We refer to such as random variable as a uniform random variable over (0, 1]. J
Distribution functions are characterized by a few simple properties.
Proposition B.2.5. Suppose F = FX is the distribution function of a random
variable X on (Ω, F, P). Then the following hold:
(i) F is non-decreasing;
(ii) limx→+∞ F (x) = 1, limx→−∞ F (x) = 0;
(iii) F is right-continuous.
Proof. The first property follows from the monotonicity of probability measure
(which itself follows immediately from σ-additivity).
For the second property, note that the limit exists by the first property. The
value of the limit follows from the following important lemma.
Lemma B.2.6 (Monotone convergence properties of measures). Let (S, Σ, µ) be a
measure space.
(i) If Fn ∈ Σ, n ≥ 1, with Fn ↑ F , then µ(Fn ) ↑ µ(F ).
(ii) If Gn ∈ Σ, n ≥ 1, with Gn ↓ G and µ(Gk ) < +∞ for some k, then
µ(Gn ) ↓ µ(G).
Proof. Clearly F = ∪n Fn ∈ Σ. For n ≥ 1, write Hn = Fn \Fn−1 (with F0 = ∅).
Then by disjointness
X X
µ(Fn ) = µ(Hk ) ↑ µ(Hk ) = µ(F ).
k≤n k<+∞
if xn ↓ x.
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 508
It turns out that the properties above characterize distribution functions in the
following sense.
The result says that all real random variables can be generated from uniform
random variables over (0, 1].
Proof. Assume first that F is continuous and strictly increasing. Define X(ω) =
F −1 (ω) for all ω ∈ Ω. Then, ∀x ∈ R,
In general, let
X(ω) = inf{x : F (x) ≥ ω}.
It suffices to prove that
X(ω) ≤ x ⇐⇒ ω ≤ F (x).
σ(Yγ , γ ∈ Γ),
Example B.2.9. Suppose we flip two unbiased coins and let X be the number of
heads observed. Then, denoting heads by H and tails by T,
Note that h−1 preserves all set operations. For example, h−1 (A ∪ B) =
h−1 (A) ∪ h−1 (B). This gives the following important lemma.
Proof. Let E be the sets such that h−1 (B) ∈ Σ. By the observation before the
statement, E is a σ-algebra. But C ⊆ E which implies σ(C) ⊆ E by minimality.
(i) f ◦ h ∈ mΣ.
{g ≤ c} ∈ Σ.
(iv) ∀α ∈ R, h1 + h2 , h1 h2 , αh ∈ mΣ.
(iv) This follows from (iii). For example note that, writing the left-hand side as
h1 > c − h2 ,
B.3 Independence
Let (Ω, F, P) be a probability space.
that is, fill up the array diagonally. By the argument above, the Vi ’s are independent
and Bernoulli(1/2).
Finally let µn , n ≥ 1, be a sequence of probability measures with distribution
functions Fn , n ≥ 1. For each n, define
Here RN is the product σ-algebra, that is, the σ-algebra generated by finite-dimen-
sional rectangles.
where
Tn = σ(Xn+1 , Xn+2 , . . .).
As an intersection of σ-algebras, T is a σ-algebra. It is called the tail σ-algebra of
the sequence (Xn ).
Intuitively, an event is in the tail if changing a finite number of values does not
affect its occurence.
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 513
P
Example B.3.10. If Sn = k≤n Xk , then
{lim Sn exists} ∈ T ,
n
X∞ = σ(Xn , n ≥ 1).
Note that [
K∞ = Xn ,
n≥1
B.4 Expectation
Let (S, Σ, µ) be a measure space. We denote by 1A the indicator of a set A, that is,
(
1, if s ∈ A
1A (s) =
0, o.w.
where ak ∈ [0, +∞] and Ak ∈ Σ for all k. We denote the set of all such functions
by SF+ . We define the integral of f by
m
X
µ(f ) := ak µ(Ak ) ≤ +∞.
k=1
(iii) If f ≤ g then µf ≤ µg. [Hint: Show that g − f ∈ SF+ and use linearity.]
The main definition and theorem of integration theory follows.
Definition B.4.3 (Nonnegative functions). Let f ∈ (mΣ)+ . Then the integral of f
is defined by
µ(f ) = sup{µ(h) : h ∈ SF+ , h ≤ f }.
Again we also write µf = µ(f ).
Theorem B.4.4 (Monotone convergence theorem). If fn , f ∈ (mΣ)+ , n ≥ 1, with
fn ↑ f , then
µfn ↑ µf.
Many theorems in integration follow from the monotone convergence theorem.
In that context, the following approximation is useful.
Definition B.4.5 (Staircase function). For f ∈ (mΣ)+ and r ≥ 1, the r-th staircase
function α(r) is
0,
if x = 0,
(r)
α (x) = (i − 1)2−r , if (i − 1)2−r < x ≤ i2−r ≤ r,
r, if x > r,
We let f (r) = α(r) (f ). Note that f (r) ∈ SF+ and f (r) ↑ f as r → +∞.
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 515
Using the previous definition, we get for example the following properties.
Proposition B.4.6. Let f, g ∈ (mΣ)+ .
(i) If µ(f 6= g) = 0, then µ(f ) = µ(g).
(ii) For all c ≥ 0, f + g, cf ∈ (mΣ)+ and
µ(f + g) = µf + µg, µ(cf ) = cµf.
kf kp := µ(|f |p )1/p ,
up to equality almost everywhere. We state the following results without proof.
Theorem B.4.8 (Hölder’s inequality). Let 1 < p, q < +∞ such that p−1 +
q −1 = 1. Then, for any f ∈ Lp (S, Σ, µ) and g ∈ Lq (S, Σ, µ), it holds that
f g ∈ L1 (S, Σ, µ) and further
kf gk1 ≤ kf kp kgkq .
The case p = q = 2 is known as the Cauchy-Schwarz inequality (or Schwarz
inequality).
Cauchy-Schwarz
inequality
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 516
Theorem B.4.9 (Minkowski’s inequality). Let 1 < p < +∞. Then, for any f, g ∈
Lp (S, Σ, µ), it holds that f + g ∈ Lp (S, Σ, µ) and further
kf + gkp ≤ kf kp + kgkp .
kfn − f kp → 0,
as n → +∞.
We can now define the expectation. Let (Ω, F, P) be a probability space.
Definition B.4.11 (Expectation). If X ≥ 0 is a random variable then we define the
expectation of X, denoted by E[X], as the integral of X over P. More generally
(i.e., not assuming non-negativity), if
we let
E[X] = E[X + ] − E[X − ].
We denote the set of all such integrable random variables (up to equality almost
integrable
surely) by L1 (Ω, F, P).
The properties of the integral for nonnegative functions (see Proposition B.4.6)
extend to the expectation.
Proposition B.4.12. Let X, X1 , X2 be random variables in L1 (Ω, F, P).
(LIN) If a1 , a2 ∈ R, then E[a1 X1 + a2 X2 ] = a1 E[X1 ] + a2 E[X2 ].
E|Xn − X| → 0,
and, hence,
E[Xn ] → E[X].
(Indeed,
E|Xn − X| → 0.
E|Xn − X| → 0.
Proof. We only prove (FATOU). To use (MON) we write the lim inf as an increas-
ing limit. Letting Zk = inf n≥k Xn , we have
so that by (MON)
E[lim inf Xn ] =↑ lim E[Zk ].
n k
Finally, we get
E[lim inf Xn ] ≤↑ lim inf E[Xn ].
n k n≥k
E[h(X)] ≥ h(E[X]).
The Lp norm defined earlier applies to random variables as well. That is, for
p ≥ 1, we let kXkp = E[|X|p ]1/p and denote by Lp (Ω, F, P) the collection of
random variables X (up to almost sure equality) such that kXkp < +∞. Jensen’s
inequality (Theorem B.4.15) implies the following relationship.
This latter inequality is useful among other things to argue about the convergence
of expectations. We say that Xn converges to X∞ in Lp if kXn − X∞ kp → 0. By
the previous lemma, convergence on Lr implies convergence in Lp for r ≥ p ≥ 1.
Further we have:
kXn − X∞ k1 → 0,
implies
E[Xn ] → E[X∞ ].
where
Z
I1f (s1 ) := f (s1 , s2 )µ2 (ds2 ) ∈ bΣ1 ,
S2
Z
I2f (s2 ) := f (s1 , s2 )µ1 (ds1 ) ∈ bΣ2 .
S1
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 520
(The equality and inclusions above are part of the statement.) The set function
µ is a measure on (S, Σ) called the product measure of µ1 and µ2 and we write
µ = µ1 × µ2 and
If f ∈ (mΣ)+ then
Z Z
µ(f ) = I1f (s1 )µ1 (ds1 ) = I2f (s2 )µ2 (ds2 ),
S1 S2
where I1f , I2f are defined as before (i.e., as the sup over bounded functions from
below). The same is valid if f ∈ mΣ and µ(|f |) < +∞.
1. For f = 1B with B ∈ B,
Z
E[1B (X)] = L(B) = 1B (y)L(dy).
R
Pm
2. If f = k=1 ak 1Ak is a simple function, then by (LIN)
m
X m
X Z Z
E[f (X)] = ak E[1Ak (X)] = ak 1Ak (y)L(dy) = f (y)L(dy).
k=1 k=1 R R
4. Finally, assume that f is such that E|f (X)| < +∞. Then by (LIN)
E[f (X)] = E[f + (X)] − E[f − (X)]
Z Z
= +
f (y)L(dy) − f − (y)L(dy)
ZR R
= f (y)L(dy).
R
Definition B.5.5 (Density). Let X be a random variable with law µ. We say that
X has density fX if for all B ∈ B(R)
Z
µ(B) = P[X ∈ B] = fX (x)λ(dx).
B
Theorem B.5.6 (Convolution). Let X and Y be independent random variables
with distribution functions F and G respectively. Then the distribution function,
H, of X + Y is Z
H(z) = F (z − y)dG(y).
Proof. From Fubini’s Theorem (Theorem B.5.3), denoting the laws of X and Y
by µ and ν respectively,
Z Z
P[X + Y ≤ z] = 1{x+y≤z} µ(dx)ν(dy)
Z
= F (z − y)ν(dy)
Z
= F (z − y)dG(y)
Z Z z
= f (x − y)dx dG(y)
−∞
Z z Z
= f (x − y)dG(y) dx
−∞
Z z Z
= f (x − y)g(y)dy dx.
−∞
where we assume P[Z = zj ] > 0 for all j. As motivation for the general definition,
we make the following observations.
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 523
Y (ω) = yj on Gj = {ω : Z(ω) = zj }.
E[Y ; Gj ] = yj P[Gj ]
X
= xi P[X = xi | Z = zj ]P[Z = zj ]
i
X
= xi P[X = xi , Z = zj ]
i
= E[X; Gj ].
We are ready to state the general definition of the conditional expectation. Its
existence and uniqueness follow from the next theorem.
When G = σ(Z), we sometimes use the notation E[X | Z] := E[X | G]. A similar
convention applies to collections of random variables, for example, E[X | Z1 , Z2 ] :=
E[X | σ(Z1 , Z2 )] and so on.
We first prove uniqueness. Existence is proved below after some more concepts
are introduced.
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 524
Proof. Take (Yn ) such that kX − Yn k2 → ∆. We use the fact that L2 (Ω, G, P) is
complete (Theorem B.4.10) and first seek to prove that (Yn ) is Cauchy. Using the
parallelogram law (Theorem B.4.20), note that
2 2
1 1
kX − Yr k22 + kX − Ys k22 = 2 X − (Yr + Ys ) +2 (Yr − Ys ) .
2 2 2 2
The first term on the right-hand side is ≥ 2∆2 by definition of ∆, so taking limits
r, s → +∞ we have what we need, that is, that (Yn ) is indeed Cauchy.
Let Y be the limit of (Yn ) in L2 (Ω, G, P). Note that by the triangle inequality
∆ ≤ kX − Y k2 ≤ kX − Yn k2 + kYn − Y k2 → ∆,
kX − Y − tZk22 ≥ ∆2 = kX − Y k22 ,
−2thZ, X − Y i + t2 kZk22 ≥ 0,
Proof of Theorem B.6.1 (i). The previous theorem implies that conditional expec-
tations exist for indicators and simple functions. Now take X ∈ L1 (Ω, F, P) and
write X = X + − X − , so we can assume X is in fact nonnegative without loss of
generality. Using the staircase function
0, if X = 0
(r)
X = (i − 1)2 , if (i − 1)2−r < X ≤ i2−r ≤ r
−r
r, if X > r,
we have 0 ≤ X (r) ↑ X. Let Y (r) = E[X (r) | G]. Using an argument similar to
the proof of uniqueness, it follows that U ≥ 0 implies E[U | G] ≥ 0 for a simple
function U . Using linearity (which is immediate from the definition), we then have
Y (r) ↑ Y := lim sup Y (r) which is measurable in G. By (MON),
Example B.6.6. On (Ω, F, P) = ((0, 1], B(0, 1], λ), let G be the σ-algebra of all
countable and co-countable (i.e., whose complement in (0, 1] is countable) subsets
of (0, 1]. Then P[G] ∈ {0, 1} for all G ∈ G and
so that E[X | G] = E[X]. Yet, G contains all singletons and we seemingly have
“full information,” which would lead to the wrong guess E[X | G] = X. J
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 526
Proof. Use the linearity of expectation and the fact that a linear combination of
random variables in G is also in G.
Proof. Let Y = E[X | G] and assume for contradiction that P[Y < 0] > 0. There
is n ≥ 1 such that P[Y < −n−1 ] > 0. But that implies, for G = {Y < −n−1 },
a contradiction.
for all G ∈ G.
Xn ≥ Zm := inf Xk ↑∈ G,
k≥m
so we must have
lim inf E[|Xn − X| | G] = 0.
n
Now use that |E[Xn −X | G]| ≤ E[|Xn −X| | G] (which follows from (cPOS)).
where the Xi s are i.i.d. in Rd , independent of S0 . The case Xi uniform in {−1, +1}
is called simple random walk on Z.
Filtered spaces provide a formal framework for time-indexed processes. We
restrict ourselves to discrete time. (We will not discuss continuous-time processes
in this book.)
Definition B.7.4. A filtered space is a tuple (Ω, F, (Ft )t∈Z+ , P) where:
• (Ω, F, P) is a probability space;
Definition B.7.5. Fix (Ω, F, (Ft )t∈Z+ , P). A process (Wt )t≥0 is adapted if Wt ∈
adapted
Ft for all t.
[AK97] Noga Alon and Michael Krivelevich. The concentration of the chro-
matic number of random graphs. Combinatorica, 17(3):303–313,
1997.
530
[Ald83] David Aldous. Random walks on finite groups and rapidly mixing
Markov chains. In Seminar on probability, XVII, volume 986 of
Lecture Notes in Math., pages 243–297. Springer, Berlin, 1983.
[Ald90] David J. Aldous. The random walk construction of uniform span-
ning trees and uniform labelled trees. SIAM J. Discrete Math.,
3(4):450–465, 1990.
[Ald97] David Aldous. Brownian excursions, critical random graphs and the
multiplicative coalescent. Ann. Probab., 25(2):812–854, 1997.
[Alo03] Noga Alon. Problems and results in extremal combinatorics. I. Dis-
crete Math., 273(1-3):31–53, 2003. EuroComb’01 (Barcelona).
[AMS09] Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Explo-
ration–exploitation tradeoff using variance estimates in multi-armed
bandits. Theoretical Computer Science, 410(19):1876–1902, April
2009.
[AN04] K. B. Athreya and P. E. Ney. Branching processes. Dover Pub-
lications, Inc., Mineola, NY, 2004. Reprint of the 1972 original
[Springer, New York; MR0373040].
[ANP05] Dimitris Achlioptas, Assaf Naor, and Yuval Peres. Rigorous lo-
cation of phase transitions in hard optimization problems. Nature,
435:759–764, 2005.
[AS11] N. Alon and J.H. Spencer. The Probabilistic Method. Wiley Series
in Discrete Mathematics and Optimization. Wiley, 2011.
[AS15] Emmanuel Abbe and Colin Sandon. Community detection in gen-
eral stochastic block models: Fundamental limits and efficient algo-
rithms for recovery. In Venkatesan Guruswami, editor, IEEE 56th
Annual Symposium on Foundations of Computer Science, FOCS
2015, Berkeley, CA, USA, 17-20 October, 2015, pages 670–688.
IEEE Computer Society, 2015.
[Axl15] Sheldon Axler. Linear algebra done right. Undergraduate Texts in
Mathematics. Springer, Cham, third edition, 2015.
[AZ18] Martin Aigner and Günter M. Ziegler. Proofs from The Book.
Springer, Berlin, sixth edition, 2018. See corrected reprint of the
1998 original [ MR1723092], Including illustrations by Karl H. Hof-
mann.
531
[Azu67] Kazuoki Azuma. Weighted sums of certain dependent random vari-
ables. Tôhoku Math. J. (2), 19:357–367, 1967.
[BA99] Albert-László Barabási and Réka Albert. Emergence of scaling in
random networks. Science, 286(5439):509–512, 1999.
[BBFdlV00] D. Barraez, S. Boucheron, and W. Fernandez de la Vega. On the
fluctuations of the giant component. Combin. Probab. Comput.,
9(4):287–304, 2000.
[BC03] Bo Brinkman and Moses Charikar. On the Impossibility of Dimen-
sion Reduction in L1. In Proceedings of the 44th Annual IEEE Sym-
posium on Foundations of Computer Science, 2003.
[BCB12] Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret Analysis of
Stochastic and Nonstochastic Multi-armed Bandit Problems. Now
Publishers, 2012. Google-Books-ID: Rl2skwEACAAJ.
[BCMR06] Christian Borgs, Jennifer T. Chayes, Elchanan Mossel, and
Sébastien Roch. The Kesten-Stigum reconstruction bound is tight
for roughly symmetric binary channels. In FOCS, pages 518–530,
2006.
[BD97] Russ Bubley and Martin E. Dyer. Path coupling: A technique for
proving rapid mixing in markov chains. In 38th Annual Symposium
on Foundations of Computer Science, FOCS ’97, Miami Beach,
Florida, USA, October 19-22, 1997, pages 223–231. IEEE Com-
puter Society, 1997.
[BDDW08] Richard Baraniuk, Mark Davenport, Ronald DeVore, and Michael
Wakin. A simple proof of the restricted isometry property for ran-
dom matrices. Constr. Approx., 28(3):253–263, 2008.
[BDJ99] Jinho Baik, Percy Deift, and Kurt Johansson. On the distribution of
the length of the longest increasing subsequence of random permu-
tations. J. Amer. Math. Soc., 12(4):1119–1178, 1999.
[Ber46] S.N. Bernstein. Probability Theory (in russian). M.-L. Gostechiz-
dat, 1946.
[Ber14] Nathanaël Berestycki. Lectures on mixing times: A crossroad
between probability, analysis and geometry. Available at https:
//homepage.univie.ac.at/nathanael.berestycki/
wp-content/uploads/2022/05/mixing3.pdf, 2014.
532
[BH57] S. R. Broadbent and J. M. Hammersley. Percolation processes. I.
Crystals and mazes. Proc. Cambridge Philos. Soc., 53:629–641,
1957.
[BKW14] Itai Benjamini, Gady Kozma, and Nicholas Wormald. The mixing
time of the giant component of a random graph. Random Structures
Algorithms, 45(3):383–407, 2014.
[Bol84] Béla Bollobás. The evolution of random graphs. Trans. Amer. Math.
Soc., 286(1):257–274, 1984.
[Bol98] Béla Bollobás. Modern graph theory, volume 184 of Graduate Texts
in Mathematics. Springer-Verlag, New York, 1998.
[BR06b] Béla Bollobás and Oliver Riordan. A short proof of the Harris-
Kesten theorem. Bull. London Math. Soc., 38(3):470–484, 2006.
533
[Bre20] Pierre Bremaud. Markov chains—Gibbs fields, Monte Carlo sim-
ulation and queues, volume 31 of Texts in Applied Mathematics.
Springer, Cham, 2020. Second edition [of 1689633].
[BRST01] Béla Bollobás, Oliver Riordan, Joel Spencer, and Gábor Tusnády.
The degree sequence of a scale-free random graph process. Random
Structures Algorithms, 18(3):279–290, 2001.
[CF07] Colin Cooper and Alan Frieze. The cover time of sparse random
graphs. Random Structures Algorithms, 30(1-2):1–16, 2007.
[Che70] Jeff Cheeger. A lower bound for the smallest eigenvalue of the
Laplacian. In Problems in analysis (Sympos. in honor of Salomon
Bochner, Princeton Univ., Princeton, N.J., 1969), pages 195–199.
Princeton Univ. Press, Princeton, N.J., 1970.
534
[Chu97] Fan R. K. Chung. Spectral graph theory, volume 92 of CBMS Re-
gional Conference Series in Mathematics. Published for the Confer-
ence Board of the Mathematical Sciences, Washington, DC; by the
American Mathematical Society, Providence, RI, 1997.
[CL06] Fan Chung and Linyuan Lu. Complex graphs and networks, vol-
ume 107 of CBMS Regional Conference Series in Mathematics.
Published for the Conference Board of the Mathematical Sciences,
Washington, DC; by the American Mathematical Society, Provi-
dence, RI, 2006.
[CR92] V. Chvatal and B. Reed. Mick gets some (the odds are on his side)
[satisfiability]. In Foundations of Computer Science, 1992. Proceed-
ings., 33rd Annual Symposium on, pages 620–627, Oct 1992.
535
Colin McDiarmid, Jorge Ramirez-Alfonsin, and Bruce Reed, ed-
itors, Probabilistic Methods for Algorithmic Discrete Mathemat-
ics, volume 16 of Algorithms and Combinatorics, pages 249–314.
Springer Berlin Heidelberg, 1998.
[Dey] Partha Dey. Lecture notes on “Stein-Chen method for Poisson ap-
proximation”. https://faculty.math.illinois.edu/
˜psdey/414CourseNotes.pdf.
[DGG+ 00] Martin Dyer, Leslie Ann Goldberg, Catherine Greenhill, Mark Jer-
rum, and Michael Mitzenmacher. An extension of path coupling and
its application to the Glauber dynamics for graph colourings (ex-
tended abstract). In Proceedings of the Eleventh Annual ACM-SIAM
Symposium on Discrete Algorithms (San Francisco, CA, 2000),
pages 616–624. ACM, New York, 2000.
[Dia09] Persi Diaconis. The Markov chain Monte Carlo revolution. Bull.
Amer. Math. Soc. (N.S.), 46(2):179–205, 2009.
[DKLP11] Jian Ding, Jeong Han Kim, Eyal Lubetzky, and Yuval Peres.
Anatomy of a young giant component in the random graph. Ran-
dom Structures Algorithms, 39(2):139–178, 2011.
536
[Doe38] Wolfgang Doeblin. Exposé de la théorie des chaı̂nes simples con-
stantes de markoff à un nombre fini d’états. Rev. Math. Union Inter-
balkan, 2:77–105, 1938.
[Don06] David L. Donoho. Compressed sensing. IEEE Trans. Inform. The-
ory, 52(4):1289–1306, 2006.
[Doo01] J.L. Doob. Classical Potential Theory and Its Probabilistic Coun-
terpart. Classics in Mathematics. Springer Berlin Heidelberg, 2001.
[DP11] Jian Ding and Yuval Peres. Mixing time for the Ising model: a uni-
form lower bound for all graphs. Ann. Inst. Henri Poincaré Probab.
Stat., 47(4):1020–1028, 2011.
[DS84] P.G. Doyle and J.L. Snell. Random walks and electric net-
works. Carus mathematical monographs. Mathematical Association
of America, 1984.
[DS91] Persi Diaconis and Daniel Stroock. Geometric bounds for eigenval-
ues of Markov chains. Ann. Appl. Probab., 1(1):36–61, 1991.
[DSC93a] Persi Diaconis and Laurent Saloff-Coste. Comparison techniques
for random walk on finite groups. Ann. Probab., 21(4):2131–2156,
1993.
[DSC93b] Persi Diaconis and Laurent Saloff-Coste. Comparison theorems for
reversible Markov chains. Ann. Appl. Probab., 3(3):696–730, 1993.
[DSS22] Jian Ding, Allan Sly, and Nike Sun. Proof of the satisfiability con-
jecture for large k. Ann. of Math. (2), 196(1):1–388, 2022.
[Dur85] Richard Durrett. Some general results concerning the critical ex-
ponents of percolation processes. Z. Wahrsch. Verw. Gebiete,
69(3):421–437, 1985.
[Dur06] R. Durrett. Random Graph Dynamics. Cambridge Series in Sta-
tistical and Probabilistic Mathematics. Cambridge University Press,
2006.
[Dur10] R. Durrett. Probability: Theory and Examples. Cambridge Series
in Statistical and Probabilistic Mathematics. Cambridge University
Press, 2010.
[Dur12] Richard Durrett. Essentials of stochastic processes. Springer Texts
in Statistics. Springer, New York, second edition, 2012.
537
[DZ10] Amir Dembo and Ofer Zeitouni. Large deviations techniques and
applications, volume 38 of Stochastic Modelling and Applied Prob-
ability. Springer-Verlag, Berlin, 2010. Corrected reprint of the sec-
ond (1998) edition.
[Ebe] Andreas Eberle. Markov Processes. 2021. https://uni-bonn.
sciebo.de/s/kzTUFff5FrWGAay.
[EKPS00] W. S. Evans, C. Kenyon, Y. Peres, and L. J. Schulman. Broadcasting
on trees and the Ising model. Ann. Appl. Probab., 10(2):410–433,
2000.
[ER59] P. Erdős and A. Rényi. On random graphs. I. Publ. Math. Debrecen,
6:290–297, 1959.
[ER60] P. Erdős and A. Rényi. On the evolution of random graphs. Magyar
Tud. Akad. Mat. Kutató Int. Közl., 5:17–61, 1960.
[Fel71] William Feller. An introduction to probability theory and its ap-
plications. Vol. II. Second edition. John Wiley & Sons, Inc., New
York-London-Sydney, 1971.
[FK16] Alan Frieze and MichałKaroński. Introduction to random graphs.
Cambridge University Press, Cambridge, 2016.
[FKG71] C. M. Fortuin, P. W. Kasteleyn, and J. Ginibre. Correlation inequali-
ties on some partially ordered sets. Comm. Math. Phys., 22:89–103,
1971.
[FM88] P. Frankl and H. Maehara. The Johnson-Lindenstrauss lemma and
the sphericity of some graphs. J. Combin. Theory Ser. B, 44(3):355–
362, 1988.
[Fos53] F. G. Foster. On the stochastic matrices associated with certain queu-
ing processes. Ann. Math. Statistics, 24:355–360, 1953.
[FR98] AlanM. Frieze and Bruce Reed. Probabilistic analysis of algorithms.
In Michel Habib, Colin McDiarmid, Jorge Ramirez-Alfonsin, and
Bruce Reed, editors, Probabilistic Methods for Algorithmic Discrete
Mathematics, volume 16 of Algorithms and Combinatorics, pages
36–92. Springer Berlin Heidelberg, 1998.
[FR08] N. Fountoulakis and B. A. Reed. The evolution of the mixing rate of
a simple random walk on the giant component of a random graph.
Random Structures Algorithms, 33(1):68–86, 2008.
538
[FR13] Simon Foucart and Holger Rauhut. A Mathematical Introduction to
Compressive Sensing. Applied and Numerical Harmonic Analysis.
Birkhäuser Basel, 2013.
[GC11] Aurélien Garivier and Olivier Cappé. The KL-UCB Algorithm for
Bounded Stochastic Bandits and Beyond. In Proceedings of the 24th
Annual Conference on Learning Theory, pages 359–376. JMLR
Workshop and Conference Proceedings, December 2011. ISSN:
1938-7228.
[GCS+ 14] Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson,
Aki Vehtari, and Donald B. Rubin. Bayesian data analysis. Texts in
Statistical Science Series. CRC Press, Boca Raton, FL, third edition,
2014.
[GL06] Dani Gamerman and Hedibert Freitas Lopes. Markov chain Monte
Carlo. Texts in Statistical Science Series. Chapman & Hall/CRC,
Boca Raton, FL, second edition, 2006. Stochastic simulation for
Bayesian inference.
539
[GS20] Geoffrey R. Grimmett and David R. Stirzaker. Probability and ran-
dom processes. Oxford University Press, Oxford, 2020. Fourth edi-
tion [of 0667520].
[Har] Nicholas Harvey. Lecture notes for CPSC 536N: Randomized Al-
gorithms. http://www.cs.ubc.ca/˜nickhar/W12/.
[HLW06] Shlomo Hoory, Nathan Linial, and Avi Wigderson. Expander graphs
and their applications. Bull. Amer. Math. Soc. (N.S.), 43(4):439–561
(electronic), 2006.
540
[Jan90] Svante Janson. Poisson approximation for large deviations. Random
Structures Algorithms, 1(2):221–229, 1990.
[JH01] Galin L. Jones and James P. Hobert. Honest exploration of in-
tractable probability distributions via Markov chain Monte Carlo.
Statist. Sci., 16(4):312–334, 2001.
[JL84] William B. Johnson and Joram Lindenstrauss. Extensions of Lip-
schitz mappings into a Hilbert space. In Conference in modern
analysis and probability (New Haven, Conn., 1982), volume 26 of
Contemp. Math., pages 189–206. Amer. Math. Soc., Providence, RI,
1984.
[JLR11] S. Janson, T. Luczak, and A. Rucinski. Random Graphs. Wiley
Series in Discrete Mathematics and Optimization. Wiley, 2011.
[JS89] Mark Jerrum and Alistair Sinclair. Approximating the permanent.
SIAM J. Comput., 18(6):1149–1178, 1989.
[Kan86] Masahiko Kanai. Rough isometries and the parabolicity of Rieman-
nian manifolds. J. Math. Soc. Japan, 38(2):227–238, 1986.
[Kar90] Richard M. Karp. The transitive closure of a random digraph. Ran-
dom Structures Algorithms, 1(1):73–93, 1990.
[Kes80] Harry Kesten. The critical probability of bond percolation on the
square lattice equals 12 . Comm. Math. Phys., 74(1):41–59, 1980.
[Kes82] Harry Kesten. Percolation theory for mathematicians, volume 2 of
Progress in Probability and Statistics. Birkhäuser, Boston, Mass.,
1982.
[KP] Júlia Komjáthy and Yuval Peres. Lecture notes for Markov
chains: mixing times, hitting times, and cover times, 2012. Saint-
Petersburg Summer School. http://www.win.tue.nl/˜jkomjath/
SPBlecturenotes.pdf.
541
[KS66b] H. Kesten and B. P. Stigum. A limit theorem for multidimensional
Galton-Watson processes. Ann. Math. Statist., 37:1211–1223, 1966.
[KS05] Gil Kalai and Shmuel Safra. Threshold Phenomena and In-
fluence: Perspectives from Mathematics, Computer Science,
and Economics. In Computational Complexity and Statistical
Physics. Oxford University Press, December 2005. eprint:
https://academic.oup.com/book/0/chapter/354512033/chapter-
pdf/43716844/isbn-9780195177374-book-part-8.pdf.
542
[Lov12] László Lovász. Large networks and graph limits, volume 60 of
American Mathematical Society Colloquium Publications. Amer-
ican Mathematical Society, Providence, RI, 2012.
[LP16] Russell Lyons and Yuval Peres. Probability on trees and networks,
volume 42 of Cambridge Series in Statistical and Probabilistic
Mathematics. Cambridge University Press, New York, 2016.
[LP17] David A. Levin and Yuval Peres. Markov chains and mixing times.
American Mathematical Society, Providence, RI, 2017. Second edi-
tion of [ MR2466937], With contributions by Elizabeth L. Wilmer,
With a chapter on “Coupling from the past” by James G. Propp and
David B. Wilson.
[LR85] T.L Lai and Herbert Robbins. Asymptotically efficient adaptive al-
location rules. Advances in Applied Mathematics, 6(1):4–22, March
1985.
[LS12] Eyal Lubetzky and Allan Sly. Critical Ising on the square lattice
mixes in polynomial time. Comm. Math. Phys., 313(3):815–836,
2012.
[Lu90] Tomasz Ł uczak. Component behavior near the critical point of the
random graph process. Random Structures Algorithms, 1(3):287–
310, 1990.
[LuPW94] Tomasz Ł uczak, Boris Pittel, and John C. Wierman. The structure
of a random graph at the point of the phase transition. Trans. Amer.
Math. Soc., 341(2):721–748, 1994.
543
[Lyo83] Terry Lyons. A simple criterion for transience of a reversible
Markov chain. Ann. Probab., 11(2):393–402, 1983.
[ML98] Anders Martin-Löf. The final size of a nearly critical epidemic, and
the first passage time of a Wiener process to a parabolic barrier. J.
Appl. Probab., 35(3):671–682, 1998.
[MNS15a] Elchanan Mossel, Joe Neeman, and Allan Sly. Consistency thresh-
olds for the planted bisection model. In Rocco A. Servedio and
Ronitt Rubinfeld, editors, Proceedings of the Forty-Seventh Annual
ACM on Symposium on Theory of Computing, STOC 2015, Port-
land, OR, USA, June 14-17, 2015, pages 69–75. ACM, 2015.
[MNS15b] Elchanan Mossel, Joe Neeman, and Allan Sly. Reconstruction and
estimation in the planted partition model. Probab. Theory Related
Fields, 162(3-4):431–461, 2015.
544
[Mos04] E. Mossel. Phase transitions in phylogeny. Trans. Amer. Math. Soc.,
356(6):2379–2404, 2004.
[MP03] E. Mossel and Y. Peres. Information flow on trees. Ann. Appl.
Probab., 13(3):817–844, 2003.
[MR95] Rajeev Motwani and Prabhakar Raghavan. Randomized algorithms.
Cambridge University Press, Cambridge, 1995.
[MS86] Vitali D. Milman and Gideon Schechtman. Asymptotic theory of
finite-dimensional normed spaces, volume 1200 of Lecture Notes in
Mathematics. Springer-Verlag, Berlin, 1986. With an appendix by
M. Gromov.
[MT06] Ravi Montenegro and Prasad Tetali. Mathematical aspects of mix-
ing times in Markov chains. Found. Trends Theor. Comput. Sci.,
1(3):x+121, 2006.
[MT09] Sean Meyn and Richard L. Tweedie. Markov chains and stochastic
stability. Cambridge University Press, Cambridge, second edition,
2009. With a prologue by Peter W. Glynn.
[MU05] Michael Mitzenmacher and Eli Upfal. Probability and Comput-
ing: Randomized Algorithms and Probabilistic Analysis. Cambridge
University Press, New York, NY, USA, 2005.
[Nic18] Bogdan Nica. A brief introduction to spectral graph theory. EMS
Textbooks in Mathematics. European Mathematical Society (EMS),
Zürich, 2018.
[Nor98] J. R. Norris. Markov chains, volume 2 of Cambridge Series in Sta-
tistical and Probabilistic Mathematics. Cambridge University Press,
Cambridge, 1998. Reprint of 1997 original.
[NP10] Asaf Nachmias and Yuval Peres. The critical random graph, with
martingales. Israel J. Math., 176:29–41, 2010.
[NW59] C. St. J. A. Nash-Williams. Random walk and electric currents in
networks. Proc. Cambridge Philos. Soc., 55:181–194, 1959.
[Ott49] Richard Otter. The multiplicative process. Ann. Math. Statistics,
20:206–224, 1949.
[Pem91] Robin Pemantle. Choosing a spanning tree for the integer lattice
uniformly. Ann. Probab., 19(4):1559–1574, 1991.
545
[Pem00] Robin Pemantle. Towards a theory of negative dependence. J. Math.
Phys., 41(3):1371–1390, 2000. Probabilistic techniques in equilib-
rium and nonequilibrium statistical physics.
[Pet] Gábor Pete. Probability and geometry on groups. Lecture notes for
a graduate course. http://www.math.bme.hu/˜gabor/PGG.html.
[Pit90] Boris Pittel. On tree census and the giant component in sparse ran-
dom graphs. Random Structures Algorithms, 1(3):311–342, 1990.
546
[RS17] Sebastien Roch and Allan Sly. Phase transition in the sample com-
plexity of likelihood-based phylogeny inference. Probability Theory
and Related Fields, 169(1):3–62, Oct 2017.
[RT87] WanSoo T. Rhee and Michel Talagrand. Martingale inequalities and
NP-complete problems. Math. Oper. Res., 12(1):177–181, 1987.
[Rud73] Walter Rudin. Functional analysis. McGraw-Hill Series in Higher
Mathematics. McGraw-Hill Book Co., New York-Düsseldorf-
Johannesburg, 1973.
[Rus78] Lucio Russo. A note on percolation. Z. Wahrscheinlichkeitstheorie
und Verw. Gebiete, 43(1):39–48, 1978.
[SE64] M. F. Sykes and J. W. Essam. Exact critical percolation probabilities
for site and bond problems in two dimensions. J. Mathematical
Phys., 5:1117–1127, 1964.
[SJ89] Alistair Sinclair and Mark Jerrum. Approximate counting, uniform
generation and rapidly mixing Markov chains. Inform. and Comput.,
82(1):93–133, 1989.
[Spi56] Frank Spitzer. A combinatorial lemma and its application to proba-
bility theory. Trans. Amer. Math. Soc., 82:323–339, 1956.
[Spi12] Daniel A. Spielman. Lecture notes on spectral graph the-
ory. https://www.cs.yale.edu/homes/spielman/
561/2012/index.html, 2012.
[SS87] E. Shamir and J. Spencer. Sharp concentration of the chromatic
number on random graphs Gn,p . Combinatorica, 7(1):121–129,
1987.
[SS03] Elias M. Stein and Rami Shakarchi. Fourier analysis, volume 1 of
Princeton Lectures in Analysis. Princeton University Press, Prince-
ton, NJ, 2003. An introduction.
[SS05] Elias M. Stein and Rami Shakarchi. Real analysis, volume 3 of
Princeton Lectures in Analysis. Princeton University Press, Prince-
ton, NJ, 2005. Measure theory, integration, and Hilbert spaces.
[SSBD14] S. Shalev-Shwartz and S. Ben-David. Understanding Machine
Learning: From Theory to Algorithms. Understanding Machine
Learning: From Theory to Algorithms. Cambridge University Press,
2014.
547
[Ste] J. E. Steif. A mini course on percolation theory, 2009. http://www.
math.chalmers.se/˜steif/perc.pdf.
[Ste72] Charles Stein. A bound for the error in the normal approximation
to the distribution of a sum of dependent random variables. In Pro-
ceedings of the Sixth Berkeley Symposium on Mathematical Statis-
tics and Probability (Univ. California, Berkeley, Calif., 1970/1971),
Vol. II: Probability theory, pages 583–602, 1972.
[Ste97] J. Michael Steele. Probability theory and combinatorial optimiza-
tion, volume 69 of CBMS-NSF Regional Conference Series in Ap-
plied Mathematics. Society for Industrial and Applied Mathematics
(SIAM), Philadelphia, PA, 1997.
[Ste98] G. W. Stewart. Matrix algorithms. Vol. I. Society for Industrial and
Applied Mathematics, Philadelphia, PA, 1998. Basic decomposi-
tions.
[Str65] V. Strassen. The existence of probability measures with given
marginals. Ann. Math. Statist., 36:423–439, 1965.
[Str14] Daniel W. Stroock. Doeblin’s Theory for Markov Chains. In An In-
troduction to Markov Processes, pages 25–47. Springer Berlin Hei-
delberg, Berlin, Heidelberg, 2014.
[SW78] P. D. Seymour and D. J. A. Welsh. Percolation probabilities on the
square lattice. Ann. Discrete Math., 3:227–245, 1978. Advances
in graph theory (Cambridge Combinatorial Conf., Trinity College,
Cambridge, 1977).
[Tao] Terence Tao. Open question: deterministic UUP matri-
ces. https://terrytao.wordpress.com/2007/07/02/
open-question-deterministic-uup-matrices/.
[Var85] Nicholas Th. Varopoulos. Long range estimates for Markov chains.
Bull. Sci. Math. (2), 109(3):225–252, 1985.
[vdH10] Remco van der Hofstad. Percolation and random graphs. In New
perspectives in stochastic geometry, pages 173–247. Oxford Univ.
Press, Oxford, 2010.
[vdH17] Remco van der Hofstad. Random graphs and complex networks.
Vol. 1. Cambridge Series in Statistical and Probabilistic Mathemat-
ics, [43]. Cambridge University Press, Cambridge, 2017.
548
[vdHK08] Remco van der Hofstad and Michael Keane. An elementary proof
of the hitting time theorem. Amer. Math. Monthly, 115(8):753–756,
2008.
549
Index
550
positively related variables, 324 cumulant-generating function, 64, 66
Stein coupling, 302, 306 Curie-Weiss model, 401
Stein equation, 301 cutset, 58
Chernoff-Cramér bound, see tail bounds
Chernoff-Cramér method, 67, 117, 146 Davis-Kahan theorem, 343, 349
community recovery, 345 dependency graph, 46
compressed sensing, 96 dimension reduction, 93
concentration inequalities, see tail bounds, Dirichlet
120 energy, 383, 406
concentration of measure, 156 form, 383
conditional expectation principle, 206, 229
definition, 522–525 problem, 183
examples, 525–526 drift condition, 190
law of total probability, 527 Dudley’s inequality, 91, 111
properties, 526–527
edge expansion constant, see networks
tower property, 527
Efron-Stein inequality, 151
congestion ratio, 406
electrical networks
convergence in probability, 32
definitions, 192–194
convex duality
Dirichlet energy, see Dirichlet
Lagrangian, 203
effective conductance, 200, 206
weak duality, 203
effective resistance, 200, 214, 229
coupling
flow, see flows
coalescence time, 282
Kirchhoff’s cycle law, 193
coalescing, 282
Kirchhoff’s node law, 193
coupling inequality, 241, 247
Kirchhoff’s resistance formula, 218
definition, 235
Nash-Williams inequality, 206, 229
Erdős-Rényi graph model, 466, 483
Ohm’s law, 193, 219
independent, 235
parallel law, 196
Markovian, 246, 282
Rayleigh’s principle, 209, 219
maximal coupling, 242
series law, 196
monotone, 235, 236, 251, 254, 257
Thomson’s principle, 202, 208
path coupling, 293, 299
voltage, 192
splitting, 283
empirical measure, 110, 112
stochastic domination, 251
epsilon-net, 86, 89, 97–101
Strassen’s theorem, 254
Erdős-Rényi graph model
coupon collector, 31, 286
chromatic number, 158
Courant-Fischer theorem, 331, 336, 340,
clique number, 48, 318, 321
341, 384
cluster, 468
covariance, 30
connectivity, 52
covering number, 87, 111
551
definition, 18 gambler’s ruin, 138–141, 195, 201
degree sequence, 247 Gibbs random fields, 19
evolution, 466 Glauber dynamics
exploration process, 468 definition, 21
FKG, 265 fast mixing, 297, 401
giant component, 467, 481 graph Laplacian
isolated vertices, 51 connectivity, 334
largest connected component, 467 definition, 333
maximum degree, 78 degree, 335
positive associations, 263 eigenvalues, 334
random walk, 490 Fiedler vector, 334
subgraph containment, 269 network, 339
threshold function, 47–55 normalized, 340
exhaustive sequence, 229 quadratic form, 333
expander graphs graphs
(d, α)-expander family, 394 3–1 tree, see trees
Pinsker’s model, 395 b-ary tree Tb ` , 5, 290
b
n-clique, 382
factorials adjacency matrix, see matrices
bounds, 22, 502 bridge, 218
definition, x Cayley’s formula, see trees
Stirling’s formula, 502 chromatic number, 8, 158
Fenchel-Legendre dual, 66 clique, 3, 48
first moment method, 36–41, 46, 48, 49, clique number, 48, 318
52–54, 56, 144, 161 coloring, 8
first moment principle, 33, 36, 104 complete graph Kn , 5
first-step analysis, 186 cutset, 58, 206
FKG cycle Cn , 5, 285
condition, 264, 321 definitions, 2–9
inequality, 265, 321, 325 degree, 78
measure, 264, 322 diameter, 368
flows directed, 8, 9
current, 193 expander, see expander graphs
definition, 6 flow, see flows
energy, 202, 209, 212 graph distance, 4
flow-conservation constraints, 193, hypercube Zn2 , 5, 286
210 incidence matrix, see matrices
max-flow min-cut theorem, 7, 254 independent set, 8, 34
strength, 193 infinite, 5
to ∞, 209, 211 infinite binary tree, 451
552
Laplacian, see graph Laplacian Jensen’s inequality, 518
matching, 8 Johnson-Lindenstrauss
matrix representation, see matrices distributional lemma, 93
multigraph, 2 lemma, 93–96
network, see networks
oriented incidence matrix, see ma- Kesten’s theorem, see percolation
trices knapsack problem, 79
perfect matching, 8 Kolmogorov’s maximal inequality, 141
spanning arborescence, 222 Kullback-Leibler divergence, 68
torus Ldn , 5 Laplacian
tree, see trees graphs, see graph Laplacian
Turán graphs, 35 Markov chains, 183, 204, 323
Green function, see Markov chains networks, see graph Laplacian
Hölder’s inequality, 515 large deviations, 69, 120
Hamming distance, 178 law of total probability, see conditional
harmonic functions, see Markov chains expectation
Harper’s vertex isoperimetric theorem, laws of large numbers, 32–33
158 Lipschitz
Harris’ inequality, 320, 325 condition, 178–180
Harris’ theorem, see percolation process, 88
hitting time, see stopping time Lyapounov function, see Markov chains
Hoeffding’s inequality, see tail bounds Markov chain Monte Carlo, 370
Hoeffding’s lemma, see tail bounds, 146 Markov chain tree theorem, 231
Holley’s inequality, 266, 321, 325 Markov chains
increasing event, see posets average occupation time, 186
indicator trick, 36 birth-death, 194, 228, 376
inherited property, 421 bottleneck ratio, see networks
Ising model Chapman-Kolmogorov, 10
boundary conditions, 260 commute time, 214
complete graph, 401 commute time identity, 214, 230, 290
definition, 19 construction, 10
FKG, 265 cover time, 124
Glauber dynamics, 286, 297, 400 decomposition theorem, 127
magnetization, 401 definitions, 9–17
random cluster, 451 Doeblin’s condition, 283, 322
trees, 451, 499 escape probability, 197, 205
isoperimetric inequality, 381 examples, 9–16
exit law, 186
Janson’s inequality, 271, 325 exit probability, 186
553
first return time, 124 Markov chain, see Markov chains
first visit time, 124, 181 martingale difference, 147
Green function, 129, 186, 197 optional stopping theorem, 137
harmonic functions, 181 orthogonality of increments, 143, 148
hitting times, 313 stopped process, 136
irreducible set, 127 submartingale, 133
lower bound, 285 supermartingale, 133
Lyapounov function, 190, 294 vertex exposure martingale, 158
Markov property, 10 matrices
martingales, 134 2-norm, 88
Matthews’ cover time bounds, 132, adjacency, 2, 332
230 block, 328
mean exit time, 186 diagonal, 328
Metropolis algorithm, 15 graph Laplacian, see graph Lapla-
mixing time, see mixing times cian
positive recurrence, 127 incidence, 3
potential theory, 185 oriented incidence, 9
recurrence, 199, 209–214 orthogonal, 328
recurrent state, 127 spectral norm, 88, 179, 341
relaxation time, 358 spectral radius, 410, 425
reversibility, 181, 365 spectrum, 410
splitting, see coupling stochastic matrix, 9
stationary measure, 128 symmetric, 328
stochastic domination, 260, 267 maximal Azuma-Hoeffding inequality, see
stochastic monotonicity, 258 tail bounds
strong Markov property, 124 maximum principle, 14, 24, 183, 229
uniform geometric ergodicity, 284 method of bounded differences, see tail
Varopoulos-Carne bound, 365, 414 bounds
Markov’s inequality, see tail bounds, 64 method of moments, 119, 120
martingales, 456 method of random paths, 216, 229
Azuma-Hoeffding inequality, see tail Metropolis algorithm, see Markov chains
bounds minimum bisection problem, 347
convergence theorem, 142, 417 Minkowski’s inequality, 516
definition, 133 mixing times
Doob martingale, 135, 147, 158 b-ary tree, 290, 391
Doob’s submartingale inequality, 141, cutoff, 289, 364, 411
146 cycle, 322, 359, 392
edge exposure martingale, 159 definition, 17
exposure martingale, 158 diameter, 369
hitting time, 230 diameter bound, 369
554
distinguishing statistic, 289, 322 optional stopping theorem, see martin-
hypercube, 362, 393, 413 gales
lower bound, 284, 287, 297, 322,
368, 369 Pólya’s theorem, see random walk
random walk on cycle, 285 Pólya’s urn, 142
random walk on hypercube, 286 packing number, 87
separation distance, 357 Pajor’s lemma, 114
upper bound, 297 parity functions, 363
moment-generating functions Parseval’s identity, 352
χ2 , 73 pattern matching, 155
definition, 28 peeling method, see slicing method
Gaussian, 65 percolation
Poisson, 67 contour lemma, 42
Rademacher, 65 critical exponents, 441
moments, 28 critical value, 40, 55, 272, 423, 438
exponential moment, see moment- dual lattice, 41, 273
generating functions Galton-Watson tree, 422
multi-armed bandits, see bandits Harris’ theorem, 273, 325
multitype branching processes Kesten’s theorem, 273, 325
definitions, 423–425 on L2 , 40, 272
Kesten-Stigum bound, 455 on Ld , 254
mean matrix, 424 on a graph, 236
nonsingular case, 424 on infinite trees, 55, 144
percolation function, 40, 55, 254,
negative associations, 219 272, 438
networks RSW lemma, 273, 322, 325
cut, 382 permutations
definition, 8 Erdős-Szekeres Theorem, 39
edge boundary, 381 longest increasing subsequence, 38
edge expansion, 382 random, 38
vertex boundary, 394 Perron-Frobenius theory
no free lunch, see binary classification Perron vector, 426
notation, ix–xi theorem, 12, 426, 456
Poincaré inequality, 152, 385, 386, 406
operator Poisson approximation, 245, 269
compact, 376 Poisson equation, 187
norm, 377 Poisson trials, 69
spectral radius, 377 posets
optimal transport, 323 decreasing event, 264
definition, 253
555
increasing event, 254, 320 cycle, 285, 359
positive associations hypercube, 286, 362
definition, 263 lazy, 16, 238, 284
strong, 321 loop erasure, 220
positively correlated events, 264 on a graph, 10–15
probabilistic method, 33–36, 93, 103, 112 on a network, 20
probability generating function, 418 Pólya’s theorem, 215
probability spaces reflection principle, 125
definitions, 503–506 simple random walk on Z, 11, 127,
distribution function, 506 138, 183, 211, 365, 528
expectation, 513–519 simple random walk on Zd , 238
filtered spaces, 122, 528 simple random walk on a graph, 20
Fubini’s theorem, 519 tree, 239
independence, 510–513 Wald’s identities, 137–141
process, 528 Rayleigh quotient, 330, 336, 384
random variables, 506–510 reconstruction problem
pseudo-regret, 171 definition, 451
MAP estimator, 453
random graphs solvability, 452
Erdős-Rényi, see Erdős-Rényi graph reflection principle, see random walk
model relaxation times
preferential attachment, 19, 163 cycle, 361
stochastic blockmodel, 345 hypercube, 364
random projection, 93 restricted isometry property, 97, 101–102
random target lemma, 184 rough embedding, 210, 211
random variables rough equivalence, 211, 230
χ2 , 73, 95 rough isometry, 230
Bernoulli, 68, 235, 244, 250, 307 RSW lemma, see percolation
binomial, 68, 287, 320
Gaussian, 30, 70, 73, 93 Sauer’s lemma, 108, 110, 112
geometric, 32, 284 second moment method, 45, 46, 48, 50,
Poisson, 67, 241, 244, 250, 252, 420, 52, 54, 57, 58, 60
496 set balancing, 66
Rademacher, 65, 70 shattering, 108
uncorrelated, 32, 117, 143 simple random walk on a graph, see ran-
uniform, 31 dom walk
random walk slicing method, 176
b-ary tree, 290 span, 331
asymmetric random walk on Z, 380 sparse signal recovery, 97
biased random walk on Z, 139, 236 sparsity, 96
556
spectral clustering, 349 Bernstein’s inequality for bounded
spectral gap, 358 variables, 77
spectral theorem, 328, 340 Chebyshev’s inequality, 29, 45, 247,
Spitzer’s combinatorial lemma, 436 481
Stirling’s formula, see factorials Chernoff bound, 69, 155, 365
stochastic bandits, see bandits Chernoff-Cramér bound, 64, 75, 85
stochastic domination definitions, 28–29
binomial, 470 general Bernstein inequality, 76
coupling, see coupling general Hoeffding inequality, 70, 89
Markov chains, 258 Hoeffding’s inequality, 82, 112
monotonicity, 320 Hoeffding’s inequality for bounded
posets, 254 variables, 71
real random variables, 250 Hoeffding’s lemma, 72
stochastic processes Markov’s inequality, 29, 36, 104, 141,
adapted, 529 146
definition, 528 McDiarmid’s inequality, 153, 159,
filtration, 528 164, 178, 179
predictable, 529 method of bounded differences, 153,
sample path, 528 156, 163
supremum, 85–93 Paley-Zygmund inequality, 46
stopping time sub-exponential, 74
cover time, see Markov chains sub-Gaussian, 70, 85, 107
definition, 123 Talagrand’s inequality, 179
first return time, see Markov chains Talagrand’s inequality, see tail bounds
first visit time, see Markov chains threshold phenomena, 37, 41, 47
hitting time, 123 tilting, 73
strong Markov property, see Markov total variation distance, 16
chains tower property, see conditional expecta-
Strassen’s theorem, 325 tion
sub-exponential variable, see tail bounds trees
sub-Gaussian increments, 91 3–1 tree, 61
sub-Gaussian variable, see tail bounds, Cayley’s formula, 5, 26, 54, 499
146 characterization, 4–5
submodularity, 325 definition, 4
symmetrization, 71, 106 infinite, 55, 438
uniform spanning tree, 218–226
tail bounds type, see recurrence
Azuma-Hoeffding inequality, 145,
158, 159, 227, 380 uniform spanning trees, see trees
Bernstein’s inequality, 78, 493 union bound, 36, 116
557
Varopoulos-Carne bound, see Markov chains
VC dimension, 108, 111
558