0% found this document useful (0 votes)
13 views571 pages

Roch Mdp Full

The document titled 'Modern Discrete Probability: An Essential Toolkit' by Sébastien Roch provides a comprehensive overview of discrete probability theory, including foundational concepts, various models, and advanced techniques such as martingales, coupling, and spectral methods. It covers a wide range of topics, including graph theory, Markov chains, and applications in data science and probabilistic analysis. The content is structured into chapters that delve into specific methodologies and their implications in both theoretical and practical contexts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views571 pages

Roch Mdp Full

The document titled 'Modern Discrete Probability: An Essential Toolkit' by Sébastien Roch provides a comprehensive overview of discrete probability theory, including foundational concepts, various models, and advanced techniques such as martingales, coupling, and spectral methods. It covers a wide range of topics, including graph theory, Markov chains, and applications in data science and probabilistic analysis. The content is structured into chapters that delve into specific methodologies and their implications in both theoretical and practical contexts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 571

Modern Discrete Probability

An Essential Toolkit

Sébastien Roch

December 20, 2023


To Betsy
Contents

Preface v

Notation ix

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Review of graph theory . . . . . . . . . . . . . . . . . . . 2
1.1.2 Review of Markov chain theory . . . . . . . . . . . . . . 9
1.2 Some discrete probability models . . . . . . . . . . . . . . . . . . 18

2 Moments and tails 27


2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.1.2 Basic inequalities . . . . . . . . . . . . . . . . . . . . . . 29
2.2 First moment method . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.1 The probabilistic method . . . . . . . . . . . . . . . . . . 33
2.2.2 Boole’s inequality . . . . . . . . . . . . . . . . . . . . . 36
2.2.3 . Random permutations: longest increasing subsequence . 38
2.2.4 . Percolation: existence of a non-trivial critical value on Z2 40
2.3 Second moment method . . . . . . . . . . . . . . . . . . . . . . 44
2.3.1 Paley-Zygmund inequality . . . . . . . . . . . . . . . . . 45
2.3.2 . Random graphs: subgraph containment and connectivity
in the Erdős-Rényi model . . . . . . . . . . . . . . . . . . 47
2.3.3 . Percolation: critical value on trees and branching number 55
2.4 Chernoff-Cramér method . . . . . . . . . . . . . . . . . . . . . . 63
2.4.1 Tail bounds via the moment-generating function . . . . . 64
2.4.2 Sub-Gaussian and sub-exponential random variables . . . 69
2.4.3 . Probabilistic analysis of algorithms: knapsack problem . 79
2.4.4 Epsilon-nets and chaining . . . . . . . . . . . . . . . . . 85

i
2.4.5 . Data science: Johnson-Lindenstrauss lemma and appli-
cation to compressed sensing . . . . . . . . . . . . . . . . 93
2.4.6 . Data science: classification, empirical risk minimization
and VC dimension . . . . . . . . . . . . . . . . . . . . . 102

3 Martingales and potentials 122


3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
3.1.1 Stopping times . . . . . . . . . . . . . . . . . . . . . . . 123
3.1.2 . Markov chains: exponential tail of hitting times and some
cover time bounds . . . . . . . . . . . . . . . . . . . . . 130
3.1.3 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.1.4 . Percolation: critical regime on infinite d-regular tree . . 144
3.2 Concentration for martingales and applications . . . . . . . . . . 145
3.2.1 Azuma-Hoeffding inequality . . . . . . . . . . . . . . . . 145
3.2.2 Method of bounded differences . . . . . . . . . . . . . . 147
3.2.3 . Random graphs: exposure martingale and application to
the chromatic number in Erdős-Rényi model . . . . . . . . 158
3.2.4 . Random graphs: degree sequence of preferential attach-
ment graphs . . . . . . . . . . . . . . . . . . . . . . . . . 163
3.2.5 . Data science: stochastic bandits and the slicing method 170
3.2.6 Coda: Talagrand’s inequality . . . . . . . . . . . . . . . . 178
3.3 Potential theory and electrical networks . . . . . . . . . . . . . . 180
3.3.1 Martingales, the Dirichlet problem and Lyapounov functions182
3.3.2 Basic electrical network theory . . . . . . . . . . . . . . . 192
3.3.3 Bounding the effective resistance via variational principles 202
3.3.4 . Random walks: Pólya’s theorem, two ways . . . . . . . 215
3.3.5 . Randomized algorithms: Wilson’s method for generating
uniform spanning trees . . . . . . . . . . . . . . . . . . . 218

4 Coupling 234
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
4.1.1 Basic definitions . . . . . . . . . . . . . . . . . . . . . . 235
4.1.2 . Random walks: harmonic functions on lattices and infi-
nite d-regular trees . . . . . . . . . . . . . . . . . . . . . 237
4.1.3 Total variation distance and coupling inequality . . . . . . 240
4.1.4 . Random graphs: degree sequence in Erdős-Rényi model 246
4.2 Stochastic domination . . . . . . . . . . . . . . . . . . . . . . . . 249
4.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 250
4.2.2 . Ising model: boundary conditions . . . . . . . . . . . . 260
4.2.3 Correlation inequalities: FKG and Holley’s inequalities . . 263

ii
4.2.4 . Random graphs: Janson’s inequality and application to
the clique number in the Erdős-Rényi model . . . . . . . . 269
4.2.5 . Percolation: RSW theory and a proof of Harris’ theorem 272
4.3 Coupling of Markov chains and application to mixing . . . . . . . 281
4.3.1 Bounding the mixing time via coupling . . . . . . . . . . 281
4.3.2 . Random walks: mixing on cycles, hypercubes, and trees 284
4.3.3 Path coupling . . . . . . . . . . . . . . . . . . . . . . . . 293
4.3.4 . Ising model: Glauber dynamics at high temperature . . 297
4.4 Chen-Stein method . . . . . . . . . . . . . . . . . . . . . . . . . 300
4.4.1 Main bounds and examples . . . . . . . . . . . . . . . . . 301
4.4.2 Some motivation and proof . . . . . . . . . . . . . . . . . 308
4.4.3 . Random graphs: clique number at the threshold in the
Erdős-Rényi model . . . . . . . . . . . . . . . . . . . . . 318

5 Spectral methods 327


5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
5.1.1 Eigenvalues and their variational characterization . . . . . 328
5.1.2 Elements of spectral graph theory . . . . . . . . . . . . . 332
5.1.3 Perturbation results . . . . . . . . . . . . . . . . . . . . . 341
5.1.4 . Data science: community recovery . . . . . . . . . . . . 345
5.2 Spectral techniques for reversible Markov chains . . . . . . . . . 351
5.2.1 Spectral gap . . . . . . . . . . . . . . . . . . . . . . . . . 353
5.2.2 . Random walks: a spectral look at cycles and hypercubes 359
5.2.3 . Markov chains: Varopoulos-Carne and diameter-based
bounds on the mixing time . . . . . . . . . . . . . . . . . 365
5.2.4 . Randomized algorithms: Markov chain Monte Carlo and
a quantitative ergodic theorem . . . . . . . . . . . . . . . 370
5.2.5 Spectral radius . . . . . . . . . . . . . . . . . . . . . . . 375
5.3 Geometric bounds . . . . . . . . . . . . . . . . . . . . . . . . . . 380
5.3.1 Cheeger’s inequality . . . . . . . . . . . . . . . . . . . . 385
5.3.2 . Random walks: trees, cycles, and hypercubes revisited . 391
5.3.3 . Random graphs: existence of an expander family and
application to mixing . . . . . . . . . . . . . . . . . . . . 394
5.3.4 . Ising model: Glauber dynamics on complete graphs and
expanders . . . . . . . . . . . . . . . . . . . . . . . . . . 400
5.3.5 Congestion ratio . . . . . . . . . . . . . . . . . . . . . . 405

6 Branching processes 415


6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
6.1.1 Basic definitions . . . . . . . . . . . . . . . . . . . . . . 416

iii
6.1.2 Extinction . . . . . . . . . . . . . . . . . . . . . . . . . . 417
6.1.3 . Percolation: Galton-Watson trees . . . . . . . . . . . . 422
6.1.4 Multitype branching processes . . . . . . . . . . . . . . . 423
6.2 Random-walk representation . . . . . . . . . . . . . . . . . . . . 429
6.2.1 Exploration process . . . . . . . . . . . . . . . . . . . . . 429
6.2.2 Duality principle . . . . . . . . . . . . . . . . . . . . . . 432
6.2.3 Hitting-time theorem . . . . . . . . . . . . . . . . . . . . 433
6.2.4 . Percolation: critical exponents on the infinite b-ary tree . 438
6.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
6.3.1 . Probabilistic analysis of algorithms: binary search tree . 442
6.3.2 . Data science: the reconstruction problem, the Kesten-
Stigum bound and a phase transition in phylogenetics . . . 451
6.4 . Finale: the phase transition of the Erdős-Rényi model . . . . . . 466
6.4.1 Statement and proof sketch . . . . . . . . . . . . . . . . . 466
6.4.2 Bounding cluster size: domination by branching processes 468
6.4.3 Concentration of cluster size: second moment bounds . . 480
6.4.4 Critical case via martingales . . . . . . . . . . . . . . . . 486
6.4.5 . Encore: random walk on the Erdős-Rényi graph . . . . 490

A Useful combinatorial formulas 502

B Measure-theoretic foundations 503


B.1 Probability spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 503
B.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 506
B.3 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
B.4 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
B.5 Fubini’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 519
B.6 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . 522
B.7 Filtered spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 528

iv
Preface

This book arose from a set of lecture notes prepared for a one-semester topics
course I taught at the University of Wisconsin–Madison in 2014, 2017, 2020 and
2023 which attracted a wide spectrum of students in mathematics, computer sci-
ences, engineering, and statistics.

What is it about?
The purpose of the book is to provide a graduate-level introduction to discrete prob-
ability. Topics covered are drawn primarily from stochastic processes on graphs:
percolation, random graphs, Markov random fields, random walks on graphs, etc.
No attempt is made at covering these broad areas in depth. Rather, the emphasis
is on illustrating important techniques used to analyze such processes. Along the
way, many standard results regarding discrete probability models are worked out.
The “modern” in the title refers to the (non-exclusive) focus on nonasymptotic
methods and results, reflecting the impact of the theoretical computer science liter-
ature on the trajectory of this field. In particular several applications in randomized
algorithms, probabilistic analysis of algorithms and theoretical machine learning
are used throughout to motivate the techniques described (although, again, these
areas are not covered exhaustively).
Of course the selection of topics is somewhat arbitrary and driven in part by
personal interests. But the choice was guided by a desire to introduce techniques
that are widely used across discrete probability and its applications. The material
discussed here is developed in much greater depth in the following (incomplete
list of) excellent textbooks and expository monographs, many of which influenced
various sections of this book:
- Agarwal, Jiang, Kakade, Sun. Reinforcement learning: Theory and algorithms.
[AJKS22]
- Aldous, Fill. Reversible Markov chains and random walks on graphs. [AF]
- Alon, Spencer. The Probabilistic Method. [AS11]

v
- Béla Bollobás. Random graphs. [Bol01]
- Boucheron, Lugosi, Massart. Concentration Inequalities: A Nonasymptotic Theory
of Independence. [BLM13]
- Chung, Lu. Complex graphs and networks. [CL06]
- Durrett. Random Graph Dynamics. [Dur06]
- Frieze and Karoński. Introduction to random graphs. [FK16]
- Grimmett. Percolation. [Gri10b]
- Janson, Luczak, Rucinski. Random Graphs. [JLR11]
- Lattimore, Szepesvári. Bandit Algorithms. [LS20]
- Levin, Peres, Wilmer. Markov chains and mixing times. [LPW06]
- Lyons, Peres. Probability on trees and networks. [LP16]
- Mitzenmacher, Upfal. Probability and Computing: Randomized Algorithms and
Probabilistic Analysis. [MU05]
- Motwani, Raghavan. Randomized algorithms. [MR95]
- Rassoul-Agha, Seppäläinen. A course on large deviations with an introduction to
Gibbs measures. [RAS15]
- S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From
Theory to Algorithms. [SSBD14]
- van Handel. Probability in high dimension. [vH16]
- van der Hofstad. Random graphs and complex networks. Vol. 1. [vdH17]
- Vershynin. High-Dimensional Probability: An Introduction with Applications in
Data Science. [Ver18]

In fact the book is meant as a first foray into the basic results and/or toolkits detailed
in these more specialized references. My hope is that, by the end, the reader will
have picked up sufficient fundamental background to learn advanced material on
their own with some ease. I should add that I used many additional helpful sources;
they are acknowledged in the “Bibliographic remarks” at the end of each chapter.
It is impossible to cover everything. Some notable omissions include, e.g., graph
limits [Lov12], influence [KS05], and group-theoretic methods [Dia88], among
others.
Much of the material covered here (and more) can also be found in [HMRAR98],
[Gri10a], and [Bre17] with a different emphasis and scope.

Prerequisites
It is assumed throughout that the reader is fluent in undergraduate linear algebra,
for example, at the level of [Axl15], and basic real analysis, for example, at the
level of [Mor05].

vi
In addition, it is recommended that the reader has taken at least one semester of
graduate probability at the level of [Dur10]. I am also particularly fond of [Wil91],
which heavily influenced the appendix where measure-theoretic background is re-
viewed. Some familiarity with countable Markov chain theory is necessary, as
covered for instance in [Dur10, Chapter 6]. An advanced undergraduate or Mas-
ters’ level treatment such as [Dur12], [Nor98], [GS20], [Law06] or [Bre20] will
suffice however.

Organization
The book is organized around five major “tools.” The reader will have likely en-
countered those tools in prior probability courses. The goal here is to develop them
further, specifically with their application to discrete random structures in mind,
and to illustrate them in this setting on a variety of major, classical results and
applications.
In the interest of keeping the book relatively self-contained and serving the
widest spectrum of readers, each chapter begins with a “background” section which
reviews the basic material on which the rest of the chapter builds. The remaining
sections then proceed to expand on two or three important specializations of the
tools. While the chapters are meant to be somewhat modular, results from previous
chapters do occasionally make an appearance.
The techniques are illustrated throughout with simple examples first, and then
with more substantial ones in separate sections marked with the symbol . . I have
attempted to provide applications from many areas of discrete probability and the-
oretical computer science, although some techniques are better suited for certain
types of models or questions. The examples and applications are important: many
of the tools are quite straightforward (or even elementary), and it is only when seen
in action that their full power can be appreciated. Moreover, the . sections serve as
an excuse to introduce the reader to classical results and important applications—
beyond their reliance on specific tools.
Chapter 1 introduces some of the main models from probability on graphs that
we come back to repeatedly throughout the book. It begins with a brief review of
graph theory and Markov chain theory.
Chapter 2 starts out with the probabilistic method, including the first moment
principle and second moment method, and then it moves on to concentration in-
equalities for sums of independent random variables, mostly sub-Gaussian and
sub-exponential variables. It also discusses techniques to analyze the suprema of
random processes.
Chapter 3 turns to martingales. The first main topic there is the Azuma-

vii
Hoeffding inequality and the method of bounded differences with applications to
random graphs and stochastic bandit problems. The second main topic is electrical
network theory for random walks on graphs.
Chapter 4 introduces coupling. It covers stochastic domination and correlation
inequalities as well as couplings of Markov chains with applications to mixing. It
also discusses the Chen-Stein method for Poisson approximation.
Chapter 5 is concerned with spectral methods. A major topic there is the use
of the spectral theorem and geometric bounds on the spectral gap to control the
mixing time of a reversible Markov chain. The chapter also introduces spectral
methods for community recovery in network analysis.
Chapter 6 ends the book with applications of branching processes. Among
other applications, an introduction to the reconstruction problem on trees is pro-
vided. The final section gives a detailed analysis of the phase transition of the
Erdös-Rényi graph, where techniques from all chapters of the book are brought to
bear.

Acknowledgments
The lecture notes on which this book is based were influenced by graduate courses
of David Aldous, Steve Evans, Elchanan Mossel, Yuval Peres, Alistair Sinclair, and
Luca Trevisan at UC Berkeley, where I learned much of this material. In particular
scribe notes for some of these courses helped shape early iterations of this book.
I have also learned a lot over the years from my collaborators and mentors as
well as my former and current Ph.D. students and postdocs. I am particularly grate-
ful to Elchanan Mossel and Allan Sly for encouragements to finish this project and
to the UW-Madison students who have taken the various iterations of the course
that inspired the book for their invaluable feedback.
Warm thanks to everyone in the departments of mathematics at UCLA and UW-
Madison who have provided the stimulating environments that made this project
possible. Beyond my current department, I am particularly indebted to my col-
leagues in the NSF-funded Institute for Foundations of Data Science (IFDS) who
have significantly expanded my knowledge of applications of this material in ma-
chine learning and statistics.
Finally I thank my parents, my wife and my son for their love, patience and
support.

viii
Notation

Throughout the book, we will use the following notation.


• The real numbers are denoted by R, the nonnegative reals are denoted by
R+ , the integers are denoted by Z, the nonnegative integers are denoted by
Z+ , the natural numbers (i.e., positive integers) are denoted by N and the
rational numbers are denoted by Q. We will also use the notation Z+ :=
{0, 1, . . . , +∞}.
• For two reals a, b ∈ R,

a ∧ b := min{a, b}, a ∨ b := max{a, b},

and
a+ = 0 ∨ a, a− = 0 ∨ (−a).

• For a real a, bac is the largest integer that is smaller than or equal to a and
dae is the smallest integer that is larger than or equal to a.
• For x ∈ R, the natural (i.e., base e) logarithm of x is denoted by log x. We
natural
also let exp(x) = ex .
logarithm
• For a positive integer n ∈ N, we let

[n] := {1, . . . , n}.

• Let A be a set. The cardinality of A is denoted by |A|. The power set of A,


i.e., the collection of all of its subsets, is denoted by 2A .
• For two sets A, B, their cartesian product is denoted by A × B.
• We will use the following notation for standard vectors: 0 is the all-zero
vector, 1 is the all-one vector, and ei is the standard basis vector with a one
in coordinate i and a zero elsewhere. In each case, the dimension is implicit,
as well as whether it is a row or column vector.

ix
• For a vector u = (u1 , . . . , un ) ∈ Rn and real p > 0, its p-norm (or `p -norm) p-norm
is !1/p
Xn
p
kukp := |ui | .
i=1
When p = +∞, we have

kuk∞ := max |ui |.


i

We also use the notation kuk0 to denote the number of nonzero coordinates
of u (although it is not a norm; see Exercise 1.1). For two vectors u =
(u1 , . . . , un ), v = (v1 , . . . , vn ) ∈ Rn , their inner product is
inner
n
X product
hu, vi := ui vi .
i=1

The same notations apply to row vectors.

• For a matrix A, we denote the entries of A by A(i, j), Ai,j , or Aij . The i-th
row of A is denoted by A(i, ·) or Ai,· . The j-th column of A is denoted by
A(·, j) or A·,j . The transpose of A is AT .

• For a vector z = (z1 , . . . , zd ), we let diag(z) be the diagonal matrix with


diagonal entries z1 , . . . , zd .

• The binomial coefficients are defined as


binomial
 
n n! coefficients
= ,
k k!(n − k)!
where k, n ∈ N with k ≤ n and n! = 1 × 2 × · · · × n is the factorial of n.
Some standard approximations for nk and n! are listed in Appendix A. See
also Exercises 1.2, 1.3, and 1.4.

• We use the abbreviation a.s. for “almost surely,” that is, with probability 1.
We use “w.p.” for “with probability.”

• Convergence in probability is denoted as →p . Convergence in distribution is


d
denoted as →.

• For a random variable X and a probability distribution µ, we write X ∼ µ to


d
indicate that X has distribution µ. We write X = Y if the random variables
X and Y have the same distribution.

x
• For an event A, the random variable 1A is the indicator of A, that is, it is 1
if A occurs and 0 otherwise. We also use 1{A}.

• For probability measures µ, ν on a countable set S, their total variation dis-


tance is
total
kµ − νkTV := sup |µ(A) − ν(A)|.
A⊆S variation
distance
• For nonnegative functions f (n), g(n) of n ∈ Z+ we write f (n) = O(g(n))
if there exists a positive constant C > 0 such that f (n) ≤ Cg(n) for all
n large enough. Similarly, f (n) = Ω(g(n)) means that f (n) ≥ cg(n) for
some constant c > 0 for all n large enough. The notation f (n) = Θ(g(n))
indicates that both f (n) = O(g(n)) and f (n) = Ω(g(n)) hold. We also
write f (n) = o(g(n)) or g(n) = ω(f (n)) or f (n)  g(n) or g(n)  f (n)
if f (n)/g(n) → 0 as n → +∞. If f (n)/g(n) → 1 we write f (n) ∼ g(n).
The same notations are used for functions of a real variable x as x → +∞.

xi
Chapter 1

Introduction

In this chapter we describe a few discrete probability models to which we will


come back repeatedly throughout the book. While there exists a vast array of
well-studied random combinatorial structures (permutations, partitions, urn mod-
els, Boolean functions, polytopes, etc.), our focus is primarily on a limited number
of graph-based processes, namely percolation, random graphs, the Ising model,
and random walks on networks. We will not attempt to derive the theory of these
models exhaustively here. Instead we will employ them to illustrate some essential
techniques from discrete probability. Note that the toolkit developed in this book
is meant to apply to other probabilistic models of interest as well, and in fact many
more will be encountered along the way. After a brief review of graph basics and
Markov chains theory in Section 1.1, we formally introduce our main models in
Section 1.2. We also formulate various questions about these models that will be
answered (at least partially) later on. We assume that the reader is familiar with the
measure-theoretic foundations of probability. A refresher of all required concepts
and results is provided in Appendix B.

1.1 Background
We start with a brief review of graph terminology and standard countable-space
Markov chains results.
Version: December 20, 2023
Modern Discrete Probability: An Essential Toolkit
Copyright © 2023 Sébastien Roch

1
CHAPTER 1. INTRODUCTION 2

Figure 1.1: Petersen graph.

1.1.1 Review of graph theory


Basic definitions An undirected graph (or graph for short) is a pair G = (V, E)
graph
where V is the set of vertices (or nodes or sites) and

E ⊆ {{u, v} : u, v ∈ V },

is the set of edges (or bonds). See Figure 1.1 for an example. We occasionally write
V (G) and E(G) for the vertices and edges of the graph G. The set of vertices V
is either finite or countably infinite. Edges of the form {u} are called self-loops. In
general, we do not allow E to be a multiset unless otherwise stated. But, when E
is a multiset, G is called a multigraph.
multigraph
A vertex v ∈ V is incident with an edge e ∈ E (or vice versa) if v ∈ e. The
incident vertices of an edge are called endvertices. Two vertices u, v ∈ V are
adjacent (or neighbors), denoted by u ∼ v, if {u, v} ∈ E. The set of adjacent
vertices of v, denoted by N (v), is called the neighborhood of v and its size, that is,
δ(v) := |N (v)|, is the degree of v. A vertex v with δ(v) = 0 is called isolated. A
graph is called d-regular if all its degrees are d. A countable graph is locally finite
if all its vertices have a finite degree.

Example 1.1.1 (Petersen graph). All vertices in the Petersen graph in Figure 1.1
have degree 3, that is, it is a 3-regular graph. In particular it has no isolated vertex.
J

A convenient (and mathematically useful) way to specify a graph is the follow-


ing matrix representation. Assume the graph G = (V, E) has n = |V | vertices.
Assume that the vertices are numbered 1, . . . , n. The adjacency matrix A of G is
adjacency matrix
CHAPTER 1. INTRODUCTION 3

the n × n symmetric matrix defined as


(
1 if {x, y} ∈ E,
Ax,y =
0 otherwise.

Example 1.1.2 (Triangle). The adjacency matrix of a triangle, that is, a 3-vertex
graph with all possible non-loop edges, is
 
0 1 1
A = 1 0 1 .
1 1 0

There exist other matrix representations. Here is one. Let m = |E| and assume
that the edges are labeled arbitrarily as e1 , . . . , em . The incidence matrix of an
incidence matrix
undirected graph G = (V, E) is the n × m matrix B such that Bi,j = 1 if vertex i
and edge ej are incident and 0 otherwise.

Subgraphs, paths, and cycles A subgraph of G = (V, E) is a graph G0 =


(V 0 , E 0 ) with V 0 ⊆ V and E 0 ⊆ E. Implied in this definition is the fact that the
edges in E 0 are incident with V 0 only. The subgraph G0 is said to be induced if

E 0 = {{x, y} : x, y ∈ V 0 , {x, y} ∈ E},

that is, it contains exactly those edges of G that are between vertices in V 0 . In
that case the notation G0 := G[V 0 ] is used. A subgraph is said to be spanning if
V 0 = V . A subgraph containing all possible non-loop edges between its vertices
is called a clique (or complete subgraph). A clique with k nodes is referred to as a
clique
k-clique.

Example 1.1.3 (Petersen graph (continued)). The Petersen graph contains no tri-
angle, that is, 3-clique, induced or not. J

A walk in G is a sequence of (not necessarily distinct) vertices x0 ∼ x1 ∼


· · · ∼ xk . Note the requirement that consecutive vertices of a walk are adjacent.
The number k ≥ 0 is the length of the walk. If the endvertices x0 , xk coincide,
that is, x0 = xk , we refer to the walk as closed. If the vertices of a walk are all
distinct, we call it a path (or self-avoiding walk). If the vertices of a closed walk
are all distinct except for the endvertices and its length is at least 3, we call it a
cycle. A path or cycle can be seen as a (not necessarily induced) subgraph of G.
The length of the shortest path connecting two distinct vertices u, v is the graph
CHAPTER 1. INTRODUCTION 4

distance between u and v, denoted by dG (u, v). It can be checked that the graph
graph
distance is a metric (and that, in particular, it satisfies the triangle inequality; see
distance
Exercise 1.6). The minimum length of a cycle in a graph is its girth.
We write u ↔ v if there is a path between u and v. It can be checked that the
binary relation ↔ is an equivalence relation (i.e, it is reflexive, symmetric and tran-
sitive; see Exercise 1.6). Its equivalence classes are called connected components.
A graph is connected if any two vertices are linked by a path, that is, if u ↔ v for
all u, v ∈ V . Or put differently, if there is only one connected component.
Example 1.1.4 (Petersen graph (continued)). The Petersen graph is connected. J

Trees A forest is a graph with no cycle, or acyclic graph. A tree is a connected tree
forest. Vertices of degree 1 are called leaves. A spanning tree of G is a subgraph
which is a tree and is also spanning. A tree is said to be rooted if it has a single
distinguished vertex called the root.
Trees will play a key role and we collect several important facts about them
(mostly without proofs). The following characterizations of trees will be useful.
The proof is left as an exercise (see Exercise 1.8). We write G + e (respectively
G − e) to indicate the graph G with edge e added (respectively removed).
Theorem 1.1.5 (Trees: characterizations). The following are equivalent.
(i) The graph T is a tree.

(ii) For any two vertices in T , there is a unique path between them.

(iii) The graph T is connected, but T − e is not for any edge e in T.

(iv) The graph T is acyclic, but T + {x, y} is not for any pair of non-adjacent
vertices x, y.
Here are two important implications.
Corollary 1.1.6. If G is connected, then it has at least one spanning tree.
Proof. Indeed, from Theorem 1.1.5, a graph is a tree if and only if it is minimally
connected, in the sense that removing any of its edges disconnects it. So a spanning
tree can be obtained by removing edges of G that do not disconnect it until it is not
possible anymore.

The following characterization is proved in Exercise 1.7.


Corollary 1.1.7. A connected graph with n vertices is a tree if and only if it has
n − 1 edges.
CHAPTER 1. INTRODUCTION 5

And here is a related fact.

Corollary 1.1.8. Let G be a graph with n vertices. If an acyclic subgraph H has


n vertices and n − 1 edges, then it is a spanning tree of G.

Proof. If H is not connected, then it has at least two connected components. Each
of them is acyclic and therefore a tree. By applying Corollary 1.1.7 to the connected
components and summing up, we see that the total number of edges in H is ≤ n−2,
a contradiction. So H is connected and therefore a spanning tree.

Finally, a classical formula:

Theorem 1.1.9 (Cayley’s formula). There are k k−2 trees on a set of k labeled Cayley’s
formula
vertices.

We give a proof of Cayley’s formula based on branching processes in Exercise 6.19.

Some standard graphs Here are a few more examples of finite graphs.

- Complete graph Kn : This graph is made of n vertices with all non-loop


complete graph
edges.

- Cycle graph Cn (or n-cycle): The vertex set is {0, 1, . . . , n − 1} and two
cycle graph
vertices i 6= j are adjacent if and only if |i − j| = 1 or n − 1.

- Torus Ldn : The vertex set is {0, 1, . . . , n − 1}d and two vertices x 6= y are torus
adjacent if and only if there is a coordinate i such that |xi − yi | = 1 or n − 1
and all other coordinates j 6= i satisfy xj = yj .

- Hypercube Zn2 (or n-dimensional hypercube): The vertex set is {0, 1}n and
hypercube
two vertices x 6= y are adjacent if and only if kx − yk1 = 1.

- Rooted b-ary tree T b ` : This graph is a tree with ` levels. The unique vertex on
b
level 0 is called the root. For j = 1, . . . , ` − 1, level j has bj vertices, each
of which has exactly one neighbor on level j − 1 (its parent) and b neighbors
on level j + 1 (its children). The b` vertices on level ` are leaves.

Here are a few examples of infinite graphs, that is, a graph with a countably infinite
number of vertices and edges.
infinite graph
- Infinite d-regular tree Td : This is an infinite tree where each vertex has ex-
actly d neighbors. The rooted version, that is, T b ` with ` = +∞ levels, is
b
denoted by T bb.
CHAPTER 1. INTRODUCTION 6

- Lattice Ld : The vertex set is Zd and two vertices x 6= y are adjacent if and
only if kx − yk1 = 1.
A bipartite graph G = (L ∪ R, E) is a graph whose vertex set is composed of
the union of two disjoint sets L, R and whose edge set E is a subset of {{`, r} :
` ∈ L, r ∈ R}. That is, there is no edge between vertices in L, and likewise for R.
Example 1.1.10 (Some bipartite graphs). The cycle graph C2n is a bipartite graph.
So is the complete bipartite graph Kn,m with vertex set {`1 , . . . , `n }∪{r1 , . . . , rm }
and edge set {{`i , rj } : i ∈ [n], j ∈ [m]}. J
In a bipartite graph G = (L ∪ R, E), a perfect matching is a collection of edges
M ⊆ E such that each vertex is incident with exactly one edge in M .
An automorphism of a graph G = (V, E) is a bijection φ of V to itself that pre-
automorphism
serves the edges, that is, such that {x, y} ∈ E if and only if {φ(x), φ(y)} ∈ E. A
graph G = (V, E) is vertex-transitive if for any u, v ∈ V there is an automorphism
mapping u to v.
Example 1.1.11 (Petersen graph (continued)). For any ` ∈ Z, a (2π`/5)-rotation
of the planar representation of the Petersen graph in Figure 1.1 corresponds to an
automorphism. J
Example 1.1.12 (Trees). The graph Td is vertex-transitive. The graph T
b b on the
other hand has many automorphisms, but is not vertex-transitive. J

Flows Let G = (V, E) be a connected graph with two distinguished disjoint


vertex sets, a source-set (or source for short) A ⊆ V and a sink-set (or sink for
short) Z. Let κ : E → R+ be a capacity function.
Definition 1.1.13 (Flow). A flow from source A to sink Z is a function f : V ×V →
flow
R such that:
F1 (Antisymmetry) f (x, y) = −f (y, x), ∀x, y ∈ V .

F2 (Capacity constraint) |f (x, y)| ≤ κ(e), ∀e = {x, y} ∈ E, and f (x, y) = 0


otherwise.

F3 (Flow-conservation constraint)
X
f (x, y) = 0, ∀x ∈ V \ (A ∪ Z).
y:y∼x

P
For U, W ⊆ V , let f (U, W ) := u∈U,w∈W f (u, w). The strength of f is kf k :=
f (A, Ac ).
CHAPTER 1. INTRODUCTION 7

One useful consequence of antisymmetry is that, for any U ⊆ V , we have


f (U, U ) = 0 since each distinct pair x 6= y ∈ U appears exactly twice in the sum,
once in each ordering. Also if W1 and W2 are disjoint, then f (U, W1 ∪ W2 ) =
f (U, W1 ) + f (U, W2 ). In particular, combining both observations, f (U, W ) =
f (U, W \ U ) = −f (W \ U, U ). P
For F ⊆ E, let κ(F ) := e∈F κ(e). We call F a cutset separating A and
Z (or cutset for short) if all paths connecting A and Z include an edge in F . For
such an F , let AF be the set of vertices not separated from A by F , that is, vertices
from which there is a path to A not crossing an edge in F . Clearly A ⊆ AF but
AF ∩ Z = ∅.

Lemma 1.1.14 (Max flow ≤ min cut). For any flow f and cutset F ,
X
kf k = f (AF , AcF ) ≤ |f (x, y)| ≤ κ(F ). (1.1.1)
{x,y}∈F

Proof. Since F is a cutset, (AF \ A) ∩ (A ∪ Z) = ∅. So, by (F3),


X
f (A, Ac ) = f (A, Ac ) + f (u, V )
u∈AF \A

= f (A, AF \ A) + f (A, AcF )


+ f (AF \ A, AF ) + f (AF \ A, AcF )
= f (A, AF \ A) + f (A, AcF )
+ f (AF \ A, A) + f (AF \ A, AcF )
= f (AF , AcF )
X
≤ |f (x, y)|,
{x,y}∈F

where we used (F1) twice. The last line is justified by the fact that the edges
between a vertex in AF and a vertex in AcF have to be in F by definition of AF .
That proves the first inequality in the claim. Condition (F2) implies the second
one.

Remarkably, this bound is tight, in the following sense.

Theorem 1.1.15 (Max-flow min-cut theorem). Let G be a finite connected graph


with source A and sink Z, and let κ be a capacity function. Then the following
holds
sup{kf k : flow f } = min{κ(F ) : cutset F }.
CHAPTER 1. INTRODUCTION 8

Proof. Note that, by compactness, the supremum on the left-hand side is achieved.
Let f be an optimal flow. The idea of the proof is to construct a “matching” cutset.
An augmentable path is a path x0 ∼ · · · ∼ xk with x0 ∈ A, xi ∈ / A ∪ Z for
all i 6= 0 or k, and f (xi−1 , xi ) < κ({xi−1 , xi }) for all i 6= 0. By default, each
vertex in A is an augmentable path. Moreover, by the optimality of f there cannot
be an augmentable path with xk ∈ Z. Indeed, otherwise, we could “push more
flow through that path” and increase the strength of f —a contradiction.
Let B ⊆ V be the set of all final vertices in some augmentable path and let F
be the edge set between B and B c := V \ B. Note that, again by contradiction, all
vertices in B can be reached from A without crossing F and that f (x, y) = κ(e)
for all e = {x, y} ∈ F with x ∈ B and y ∈ B c . Furthermore F is a cutset
separating A from Z: trivially A ⊆ B; Z ⊆ B c as argued above; and any path
from A to Z must exit B and enter B c through an edge in F . Thus AF = B and
we have equality in (1.1.1). That concludes the proof.

Colorings, independent sets, and matchings A coloring of a graph G = (V, E)


coloring
is an assignment of colors to each vertex in G. In a coloring, two vertices may share
the same color. A coloring is proper if for every edge e in G the endvertices of e
have distinct colors. The smallest number of colors in a proper coloring of a graph
G is called the chromatic number χ(G) of G.
An independent vertex set (or independent set for short) of G = (V, E) is a
independent
subset of vertices W ⊆ V such that all pairs of vertices in W are non-adjacent.
set
Likewise, two edges are independent if they are not incident with the same vertex.
A matching is a set of pairwise independent edges. A matching F is perfect if
matching
every vertex in G is incident with an edge of F .

Edge-weighted graphs We refer to an edge-weighted graph G = (V, E, w) as


a network. Here w : E → R+ is a function that assigns positive real weights to
the edges. Definitions can be generalized naturally. In particular, one defines the
degree of a vertex i as X
δ(i) = we .
j:e={i,j}∈E

The adjacency matrix A of G is the n × n symmetric matrix defined as


(
we if e = {i, j} ∈ E,
Ai,j =
0 otherwise,

where we denote the vertices {1, . . . , n}.


CHAPTER 1. INTRODUCTION 9

Directed graphs A directed graph (or digraph for short) is a pair G = (V, E)
digraph
where V is a set of vertices (or nodes or sites) and E ⊆ V 2 is a set of directed edges
(or arcs). A directed edge from x to y is typically denoted by (x, y), or occasionally
by hx, yi. A directed path is a sequence of vertices x0 , . . . , xk , all distinct, with
(xi−1 , xi ) ∈ E for all i = 1, . . . , k. We write u → v if there is such a directed
path with x0 = u and xk = v. We say that u, v ∈ V communicate, denoted by
u ↔ v, if u → v and v → u. In particular, we always have u ↔ u for every state
u. The binary relation ↔ relation is an equivalence relation (see Exercise 1.6). The
equivalence classes of ↔ are called the strongly connected components of G.
The following definition will prove useful.

Definition 1.1.16 (Oriented incidence matrix). Let G = (V, E) be an undirected


graph. Assume that the vertices of G = (V, E) are numbered 1, . . . , |V | and that
the edges are labeled arbitrarily as e1 , . . . , e|E| . An orientation of G is the choice
~ An oriented incidence
of a direction ~ei for each edge ei , turning it into a digraph G.
matrix of G is the incidence matrix of an orientation, that is, the matrix B such
oriented
that Bij = −1 if edge ~ej leaves vertex i, Bij = 1 if edge ~ej enters vertex i, and 0
incidence matrix
otherwise.

1.1.2 Review of Markov chain theory


Informally, a Markov chain (or Markov process) is a time-indexed stochastic pro-
Markov chain
cess satisfying the property: conditioned on the present, the future is indepen-
dent of the past. We restrict ourselves to the discrete-time, time-homogeneous,
countable-space case, where such a process is characterized by its initial distribu-
tion and a transition matrix.

Construction of a Markov chain For our purposes, it will suffice to “define”


a Markov chain through a particular construction. Let V be a finite or count-
able space. Recall that a stochastic matrix on V is a nonnegative matrix P =
stochastic matrix
(P (i, j))i,j∈V satisfying
X
P (i, j) = 1, ∀i ∈ V.
j∈V

We think of P (i, ·) as a probability distribution on V . In particular, for a set of


states A ⊆ V , we let X
P (i, A) = P (i, j).
j∈A
CHAPTER 1. INTRODUCTION 10

Let µ be a probability measure on V and let P be a stochastic matrix on V .


One way to construct a Markov chain (Xt )t≥0 on V with transition matrix P and
initial distribution µ is the following:
- Pick X0 ∼ µ and let (Y (i, n))i∈V,n≥1 be a mutually independent array of
random variables with Y (i, n) ∼ P (i, ·).

- Set inductively Xn := Y (Xn−1 , n), n ≥ 1.


So in particular:

P[X0 = x0 , . . . , Xt = xt ] = µ(x0 )P (x0 , x1 ) · · · P (xt−1 , xt ).

We use the notation Px , Ex for the probability distribution and expectation under
the chain started at x. Similarly for Pµ , Eµ where µ is a probability distribution.
Example 1.1.17 (Simple random walk on a graph). Let G = (V, E) be a finite or
infinite, locally finite graph. Simple random walk on G is the Markov chain on V ,
simple
started at an arbitrary vertex, which at each time picks a uniformly chosen neighbor
random
of the current state. (Exercise 1.9 asks for the transition matrix.) J
walk
on a graph
Markov property Let (Xt )t≥0 be a Markov chain (or chain for short) with
transition matrix P and initial distribution µ. Define the filtration (Ft )t≥0 with
Ft = σ(X0 , . . . , Xt ) (see Appendix B). As mentioned above the defining property
of Markov chains, known as the Markov property, is that: given the present, the
future is independent of the past. In its simplest form, that can be interpreted as
P[Xt+1 = y | Ft ] = PXt [Xt+1 = y] = P (Xt , y). More formally:
Theorem 1.1.18 (Markov property). Let f : V ∞ → R be bounded, measurable Markov property
and let F (x) := Ex [f ((Xt )t≥0 )], then

E[f ((Xs+t )t≥0 ) | Fs ] = F (Xs ) a.s.


Remark 1.1.19. We will come back to the “strong” Markov property in Chapter 3.

We define P t (x, y) := Px [Xt = y]. An important consequence of the Markov


property (Theorem 1.1.18) is the following.
Theorem 1.1.20 (Chapman-Kolmogorov).
X
P t (x, z) = P s (x, y)P t−s (y, z), s ∈ {0, 1, . . . , t}.
y∈V

Proof. This follows from the Markov property. Indeed note that Px [Xt = z | Fs ] =
F (Xs ) with F (y) := Py [Xt−s = z] and take Ex on each side.
CHAPTER 1. INTRODUCTION 11

Example 1.1.21 (Random walk on Z). Let (Xt ) be simple random walk on Z
interpreted as a graph (i.e., L) where i ∼ j if |i − j| = 1.* Then P (0, x) = 1/2 if
|x| = 1. And P 2 (0, x) = 1/4 if |x| = 2 and P 2 (0, 0) = 1/2. J

As is conventional in Markov chain theory, we think of probability distributions


over the state space as row vectors. We will typically denote them by Greek letters
(e.g., µ, π). If we write µs for the law of Xs as a row vector, then

µs = µ0 P s

where here P s is the matrix product of P by itself s times.

Stationarity The transition graph of a chain is the directed graph on V whose


edges are the transitions with strictly positive probability. A chain is irreducible if
irreducible
V is the unique (strongly) connected component of its transition graph, that is, if
all pairs of states have a directed path between them in the transition graph.

Example 1.1.22 (Simple random walk on a graph (continued)). Simple random


walk on G is irreducible if and only if G is connected. J

A stationary measure π is a measure on V such that


X
π(x)P (x, y) = π(y), ∀y ∈ V,
x∈V

or in matrix form π = πP . We say that π is a stationary distribution if in addition


stationary
π is a probability measure.
distribution
Example 1.1.23 (Random walk on Zd ). The all-one measure π ≡ 1 is stationary
for simple random walk on Ld . J

Finite, irreducible chains always have a unique stationary distribution.

Theorem 1.1.24 (Existence and uniqueness: finite case). If P is irreducible and


has a finite state space, then:

(i) (Existence) it has a stationary distribution which, furthermore, is strictly


positive;

(ii) (Uniqueness) the stationary distribution is unique.


* OnZ, simple random walk often refers to any nearest-neighbor random walk, whereas the ex-
ample here is called simple symmetric random walk. We will not adopt this terminology here.
CHAPTER 1. INTRODUCTION 12

This result follows from Perron-Frobenius theory (a version of which is stated as


Theorem 6.1.17). We give a self-contained proof.

Proof of Theorem 1.1.24 (i). We begin by proving existence. Denote by n the


number of states. Because P is stochastic, we have by definition that P 1 = 1,
where 1 is the all-one vector. Put differently,

(P − I)1 = 0.

In particular, the columns of P − I are linearly dependent, that is, the rank of P − I
is < n. That, in turn, implies that the rows of P − I are linearly dependent since
row rank and column rank are equal. Hence there exists a non-zero row vector
z ∈ Rn such that z(P − I) = 0, or after rearranging,

zP = z. (1.1.2)

The rest of the proof is broken up into a series of lemmas. To take advantage
of irreducibility, we first construct a positive stochastic matrix with z as a left
eigenvector with eigenvalue 1. We then show that all entries of z have the same
sign. Finally, we normalize z.
Lemma 1.1.25 (Existence: Step 1). There exists a non-negative integer h such that
1
R= [I + P + P 2 + · · · + P h ],
h+1
is a stochastic matrix with strictly positive entries which satisfies

zR = z. (1.1.3)

Lemma 1.1.26 (Existence: Step 2). The entries of z are either all nonnegative or
all nonpositive.
Lemma 1.1.27 (Existence: Step 3). Let
z
π= .
z1
Then π is a strictly positive stationary distribution.
We denote the entries of R and P s by Rx,y and Px,y
s , x, y = 1, . . . , n, respectively.

Proof of Lemma 1.1.25. By irreducibility (see Exercise 1.10), for any x, y ∈ [n]
h
there is hxy such that Px,yxy > 0. Now define

h = max hxy .
x,y∈[n]
CHAPTER 1. INTRODUCTION 13

The matrix P s , as a product of stochastic matrices, is a stochastic matrix for all s


(see Exercise 1.11). In particular, it has nonnegative entries. Hence, for each x, y,
1 2 h
Rx,y = [Ix,y + Px,y + Px,y + · · · + Px,y ]
h+1
1 h
≥ Px,yx,y > 0.
h+1
Moreover the matrix R, as a convex combination of stochastic matrices, is a stochas-
tic matrix (see Exercise 1.11).
Since zP = z, it follows by induction that zP s = z for all s. Therefore,
1
zR = [zI + zP + zP 2 + · · · + zP h ]
h+1
1
= [z + z + z + · · · + z]
h+1
= z.
That concludes the proof.

Proof of Lemma 1.1.26. We argue by contradiction. Suppose that two entries of


z = (z1 , . . . , zn ) have different signs, say zi > 0 while zj < 0. By the previous
lemma,
X
|zy | = zx Rx,y
x

X X
= zx Rx,y + zx Rx,y .
x:zx ≥0 x:zx <0

Moreover, Rx,y > 0 for all x, y. Therefore, the first term on the last line is strictly
positive (since it is at least zi Ri,y > 0) while the second term is strictly nega-
tive (since it is at most zj Rj,y < 0). Hence, because of cancellations (see Exer-
cise 1.13), the expression in the previous display is strictly smaller than the sum of
the absolute values, that is,
X
|zy | < |zx |Rx,y .
x

Since R is also stochastic by the previous lemma, we deduce after summing


over y that
X XX X X X
|zy | < |zx |Rx,y = |zx | Rx,y = |zx |,
y y x x y x

a contradiction, thereby proving the claim.


CHAPTER 1. INTRODUCTION 14

Proof of Lemma 1.1.27. Now define π entrywise by

zx |zx |
πx = P =P ≥ 0,
z
i i i |zi |

where the second equality comes from the previous lemma. We also used the fact
that z 6= 0.
For all y, by definition of z,
X X zx 1 X zy
πx Px,y = P Px,y = P zx Px,y = P = πy .
x x i zi i zi x i zi

The same holds with Px,y replaced by Rx,y from (1.1.3). Since Rx,y > 0 and
z 6= 0 it follows that πy > 0 for all y. That proves the claim.

That concludes the proof of the existence claim.

It remains to prove uniqueness. See Exercise 1.14 for an alternative proof


based on the maximum principle (to which we come back in Theorem 3.3.9 and
Exercise 3.12).

Proof of Theorem 1.1.24 (ii). Suppose there are two distinct stationary distribu-
tions π1 and π2 (which must be strictly positive). Since they are distinct and both
sum to 1, they are not a multiple of each other and therefore are linearly indepen-
dent. Apply the Gram-Schmidt procedure:

π1 π2 − hπ2 , q1 iq1
q1 = and q2 = .
kπ1 k2 kπ2 − hπ2 , q1 iq1 k2

Then
π1 π1 P π1
q1 P = P = = = q1 ,
kπ1 k2 kπ1 k2 kπ1 k2
and all entries of q1 are strictly positive.
Similarly,

π2 − hπ2 , q1 iq1
q2 P = P
kπ2 − hπ2 , q1 iq1 k2
π2 P − hπ2 , q1 iq1 P
=
kπ2 − hπ2 , q1 iq1 k2
π2 − hπ2 , q1 iq1
=
kπ2 − hπ2 , q1 iq1 k2
= q2 .
CHAPTER 1. INTRODUCTION 15

Since z := q2 satisfies (1.1.2), by Lemmas 1.1.25–1.1.27 there is a multiple of


q2 , say q02 = αq2 with α 6= 0, such that q02 P = q02 and all entries of q02 are strictly
positive. By the Gram-Schmidt procedure,

hq1 , q02 i = hq1 , αq2 i = αhq1 , q2 i = 0.

But this is a contradiction since both vectors are strictly positive. That concludes
the proof of the uniqueness claim.

Reversibility A transition matrix P is reversible with respect to a measure η if


reversible
η(x)P (x, y) = η(y)P (y, x)

for all x, y ∈ V . These equations are known as detailed balance. Here is the key
observation: by summing over y and using the fact that P is stochastic, such a
measure is necessarily stationary. (Exercise 1.12 explains the name.)
Example 1.1.28 (Random walk on Zd (continued)). The measure η ≡ 1 is re-
versible for simple random walk on Ld . J
Example 1.1.29 (Simple random walk on a graph (continued)). Let (Xt ) be simple
random walk on a connected graph G = (V, E). Then (Xt ) is reversible with
respect to η(v) := δ(v), where recall that δ(v) is the degree of v. Indeed, for all
{u, v} ∈ E,
1 1
δ(u)P (u, v) = δ(u) = 1 = δ(v) = δ(v)P (v, u).
δ(u) δ(v)
(See Exercise 1.9 for the transition matrix of simple random walk on a graph.) J

Example 1.1.30 (Metropolis chain). The Metropolis algorithm modifies an irre- Metropolis
ducible, symmetric (i.e., whose transition matrix is a symmetric matrix) chain Q algorithm
to produce a new chain P with the same transition graph and a prescribed positive
stationary distribution π. The idea is simple. For each pair x 6= y, either we multi-
ply Q(x, y) by π(y)/π(x) and leave Q(y, x) intact, or vice versa. Detailed balance
immediately follows. To ensure that the new transition matrix remains stochastic,
for each pair we make the choice that lowers the transition probabilities; then we
add the lost probability to the diagonal (i.e., to the probability of staying put).
Formally, the definition of the new chain is
 h i
Q(x, y) π(y) ∧ 1 if x 6= y,
π(x)
P (x, y) := P h
π(z)
i
1 −
z6=x Q(x, z) π(x) ∧ 1 otherwise.
CHAPTER 1. INTRODUCTION 16

Note that, by definition of P and the fact that Q is stochastic, we have P (x, y) ≤
Q(x, y) for all x 6= y so X
P (x, y) ≤ 1,
y6=x

and hence P is well-defined as a transition matrix. We claim further that P is


reversible with respect to π. Suppose x 6= y and assume without loss of generality
that π(x) ≥ π(y). Then, by definition of P , we have
π(y)
π(x)P (x, y) = π(x)Q(x, y)
π(x)
= Q(x, y)π(y)
= Q(y, x)π(y)
= P (y, x)π(y),

where we used the symmetry of Q. J

Convergence and mixing time A key property of Markov chains is that, under
suitable assumptions, they converge to a stationary regime. We need one more
definition before stating the theorem. A chain is said to be aperiodic if, for all
aperiodic
x ∈ V , the greatest common divisor of {t : P t (x, x) > 0} is 1.
Example 1.1.31 (Lazy random walk on a graph). The lazy simple random walk on
lazy
G is the Markov chain such that, at each time, it stays put with probability 1/2 or
chooses a uniformly random neighbor of the current state otherwise. Such a chain
is aperiodic. J
Lemma 1.1.32 (Consequence of aperiodicity). If P is aperiodic, irreducible and
has a finite state space, then there is a positive integer t0 such that for all t ≥ t0
the matrix P t has strictly positive entries.
We can now state the convergence theorem. For probability measures µ, ν on
V , their total variation distance is
total
kµ − νkTV := sup |µ(A) − ν(A)|. (1.1.4) variation
A⊆V distance

Theorem 1.1.33 (Convergence theorem). Suppose P is irreducible, aperiodic and


has stationary distribution π. Then, for all x

kP t (x, ·) − π(·)kTV → 0,

as t → +∞.
CHAPTER 1. INTRODUCTION 17

We give a proof in the finite case in Example 4.3.3. In particular, the convergence
theorem implies that for all x, y,

P t (x, y) → π(y).

Without aperiodicity, it can be shown that we have the weaker claim


t
1X s
P (x, y) → π(y), (1.1.5)
t
s=1

as t → +∞.
We will be interested in quantifying the speed of convergence in Theorem 1.1.33.
For this purpose, we define

d(t) := sup kP t (x, ·) − π(·)kTV . (1.1.6)


x∈V

Lemma 1.1.34 (Monotonicity of d(t)). The function d(t) is non-increasing in t.

Proof. Note that, by definition of P t+1 ,

d(t + 1) = sup sup |P t+1 (x, A) − π(A)|


x∈V A⊆V
X
= sup sup P (x, z)(P t (z, A) − π(A))
x∈V A⊆V z
X
≤ sup P (x, z) sup |P t (z, A) − π(A)|
x∈V z A⊆V

≤ sup sup |P t (z, A) − π(A)|


z∈V A⊆V
= d(t),

where on the second and fourth line we used that P is a stochastic matrix.

The following concept will play a key role in quantifying the “speed of conver-
gence” to stationarity.

Definition 1.1.35 (Mixing time). For a fixed ε > 0, the mixing time is defined as
mixing time

tmix (ε) := inf{t ≥ 0 : d(t) ≤ ε}.


CHAPTER 1. INTRODUCTION 18

1.2 Some discrete probability models


With the necessary background covered, we are now in a position to define for-
mally a few important discrete probability models that will be ubiquitous in this
book. These are all graph-based processes. Many more interesting random discrete
structures and other related probabilistic models will be encountered throughout
(and defined where needed).

Percolation Percolation processes are meant to model the movement of a fluid


through a porous medium. There are several types of percolation models. We
focus here on bond percolation. In words, edges of a graph are “open” at random,
indicating that fluid is passing through. We are interested in the “open clusters,”
that is, the regions reached by the fluid.

Definition 1.2.1 (Bond percolation). Let G = (V, E) be a finite or infinite graph.


The bond percolation process on G with density p ∈ [0, 1], whose probability mea-
bond percolation
sure is denoted by Pp , is defined as follows: each edge of G is independently set to
open with probability p, otherwise it is set to closed. Write x ⇔ y if x, y ∈ V are
connected by a path all of whose edges are open. The open cluster of x is

Cx := {y ∈ V : x ⇔ y}.

We will mostly consider bond percolation on the infinite graphs Ld or Td . The main
question we will ask is: For which values of p is there an infinite open cluster?

Random graphs Random graphs provide a natural framework to study complex


networks. Different behaviors are observed depending on the modeling choices
made. Perhaps the simplest and most studied is the Erdős-Rényi random graph
model. We consider the version due to Gilbert. Here the edges are present in-
dependently with a fixed probability. Despite its simplicity, this model exhibits
a rich set of phenomena that make it a prime example for the use of a variety of
probabilistic techniques.

Definition 1.2.2 (Erdős-Rényi graph model). Let n ∈ N and p ∈ [0, 1]. Set V :=
[n]. Under the Erdős-Rényi graph model on n vertices with density p, a random
Erdős-Rényi
graph G = (V, E) is generated as follows: for each pair x 6= y in V , the edge
graph model
{x, y} is in E with probability p independently of all other edges. We write G ∼
Gn,p and we denote the corresponding probability measure by Pn,p .

Typical questions regarding the Erdős-Rényi graph model (and random graphs
more generally) include: How are degrees distributed? Is G connected? What
CHAPTER 1. INTRODUCTION 19

is the (asymptotic) probability of observing a particular subgraph, for example, a


triangle? What is the typical chromatic number?
As one alternative to the Erdős-Rényi model, we will also encounter preferen-
tial attachment graphs. These are meant to model the growth of a network where
new edges are more likely to be incident with vertices of high degree, a reasonable
assumption in some applied settings. Such a process produces graphs with prop-
erties that differ from those of the Erdős-Rényi model; in particular they tend to
have a “fatter” degree distribution tail. In the definition of preferential attachment
graphs, we restrict ourselves to the tree case (see Exercise 1.15).

Definition 1.2.3 (Preferential attachment graph). The preferential attachment graph


process produces a sequence of graphs (Gt )t≥1 as follows. We start at time 1 with
preferential
two vertices, denoted v0 and v1 , connected by an edge. At time t, we add vertex vt
attachment
with a single edge connecting it to an old vertex, which is picked proportionally to
graphs
its degree. We write (Gt )t≥1 ∼ PA1 .

Markov random fields Another common class of graph-based processes in-


volves the assignment of random “states” to the vertices of a fixed graph. The
state distribution is typically specified through “interactions between neighboring
vertices.” Such models are widely studied in statistical physics and also have im-
portant applications in statistics. We focus on models with a Markovian (i.e., con-
ditional independence) structure that makes them particularly amenable to rigorous
analysis and computational methods. We start with Gibbs random fields, a broad
class of such models.

Definition 1.2.4 (Gibbs random field). Let S be a finite set and let G = (V, E) be a
finite graph. Denote by K the set of all cliques of G. A positive probability measure
µ on X := S V is called a Gibbs random field if there exist clique potentials φK :
Gibbs
S K → R, K ∈ K, such that
random
!
field
1 X
µ(σ) = exp φK (σK ) ,
Z
K∈K

where σK is the vector σ ∈ X restricted to the vertices (i.e., coordinates) in K and


Z is a normalizing constant.

The following example introduces the primary Gibbs random field we will en-
counter.

Example 1.2.5 (Ising model). For β > 0, the (ferromagnetic) Ising model with in-
Ising model
verse temperature β is the Gibbs random field with S := {−1, +1}, φ{i,j} (σ{i,j} ) =
CHAPTER 1. INTRODUCTION 20

P
βσi σj and φK ≡ 0 if |K| 6= 2. The function H(σ) := − {i,j}∈E σi σj is known
as the Hamiltonian. The normalizing constant Z := Z(β) is called the partition
function. The states (σi )i∈V are referred to as spins. J
Typical questions regarding Ising models include: How fast is correlation de-
caying down the graph? How well can one guess the state at an unobserved vertex?
We will also consider certain Markov chains related to Ising models (see Defini-
tion 1.2.8).

Random walks on graphs and reversible Markov chains The last class of pro-
cesses we focus on are random walks on graphs and their generalizations. Recall
the following definition.
Definition 1.2.6 (Simple random walk on a graph). Let G = (V, E) be a finite
or countable, locally finite graph. Simple random walk on G is the Markov chain
simple
on V , started at an arbitrary vertex, which at each time picks a uniformly chosen
random
neighbor of the current state.
walk on a
We generalize the definition by adding weights to the edges. In this context, we graph
denote edge weights by c(e) for “conductance” (see Section 3.3).
Definition 1.2.7 (Random walk on a network). Let G = (V, E) be a finite or
countably infinite graph. Let c : E → R+ be a positive edge weight function on G.
Recall that we call N = (G, c) a network. We assume that for all u ∈ V
random
X
c(u) := c(e) < +∞. walk on a
e={u,v}∈E network

Random walk on network N is the Markov chain on V , started at an arbitrary


vertex, which at each time picks a neighbor of the current state proportionally to
the weight of the corresponding edge. That is, the transition matrix is given by
( c(e)
if e = {u, v} ∈ E,
P (u, v) = c(u)
0 otherwise.
By definition of P , it is immediate that this Markov chain is reversible with respect
to the measure η(u) := c(u). In fact, conversely, any countable reversible Markov
chain can be seen as a random walk on a network by setting c(e) := π(x)P (x, y) =
π(y)P (y, x) for all x, y such that P (x, y) > 0.
Typical questions include: How long does it take to visit all vertices at least
once or a particular subset of vertices for the first time? How fast does the walk
approach stationarity? How often does the walk return to its starting point?
We will also encounter a particular class of Markov chains related to Ising
models, the Glauber dynamics.
CHAPTER 1. INTRODUCTION 21

Definition 1.2.8 (Glauber dynamics of the Ising model). Let µβ be the Ising model
with inverse temperature β > 0 on a graph G = (V, E). The (single-site) Glauber
dynamics is the Markov chain on X := {−1, +1}V which at each time:
Glauber
- selects a site i ∈ V uniformly at random, and dynamics

- updates the spin at i according to µβ conditioned on agreeing with the cur-


rent state at all sites in V \{i}.

Specifically, for γ ∈ {−1, +1}, i ∈ V , and σ ∈ X , let σ i,γ P


be the configuration σ
with the spin at i being set to γ. Let n = |V | and Si (σ) := j∼i σj . The nonzero
entries of the transition matrix are

1 eγβSi (σ)
Qβ (σ, σ i,γ ) := · −βS (σ) .
n e i + eβSi (σ)
This chain is irreducible since we can flip each site one by one to go from any
state to any other. It is straightforward to check that Qβ (σ, σ i,γ ) is a stochastic
matrix. The next theorem shows that µβ is its stationary distribution.

Theorem 1.2.9. The Glauber dynamics is reversible with respect to µβ .

Proof. For all σ ∈ X and i ∈ V , let

S6=i (σ) := H(σ i,+ ) + Si (σ) = H(σ i,− ) − Si (σ).

We have

e−βS6=i (σ) e−βSi (σ) eβSi (σ)


µβ (σ i,− ) Qβ (σ i,− , σ i,+ ) = ·
Z(β) n[e−βSi (σ) + eβSi (σ) ]
e−βS6=i (σ)
=
nZ(β)[e−βSi (σ) + eβSi (σ) ]
e−βS6=i (σ) eβSi (σ) e−βSi (σ)
= ·
Z(β) n[e−βSi (σ) + eβSi (σ) ]
= µβ (σ i,+ ) Qβ (σ i,+ , σ i,− ).

That concludes the proof.


CHAPTER 1. INTRODUCTION 22

Exercises
Exercise 1.1 (0-norm). Show that kuk0 does not define a norm.

Exercise 1.2 (A factorial bound: one way). Let ` be a positive integer.

(i) Use the bound 1 + x ≤ ex to show that


k+1
≤ e1/k ,
k
and
k
≤ e1/(k+1) ,
k+1
for all positive integers k.

(ii) Use part (i) and the quantity


`−1
Y (k + 1)k
,
kk
k=1

to show that
``
`! ≥ .
e`−1
(iii) Use part (i) and the quantity
`−1
Y k k+1
,
(k + 1)k+1
k=1

to show that
``+1
`! ≤ .
e`−1
Exercise 1.3 (A factorial bound: another way). Let ` be a positive integer. Show
that
`` ``+1
≤ `! ≤ ,
e`−1 e`−1
by considering the logarithm of `!, interpreting the resulting quantity as a Riemann
sum, and bounding above and below by an integral.
CHAPTER 1. INTRODUCTION 23

Exercise 1.4 (A binomial bound). Show that for integers 0 < d ≤ n,


d  
X n  en d
≤ .
k d
k=0

[Hint: Multiply the left-hand side of the inequality by (d/n)d ≤ (d/n)k and use
the binomial theorem.]

Exercise 1.5 (Powers of the adjacency matrix). Let An be the n-th matrix power
of the adjacency matrix A of a graph G = (V, E). Prove that the (i, j)-th entry anij
is the number of walks of length exactly n between vertices i and j in G. [Hint:
Use induction on n.]

Exercise 1.6 (Paths). Let u, v be vertices of a graph G = (V, E).

(i) Show that the graph distance dG (u, v) is a metric.

(ii) Show that the binary relation u ↔ v is an equivalence relation.

(iii) Prove (ii) in the directed case.

Exercise 1.7 (Trees: number of edges). Prove that a connected graph with n ver-
tices is a tree if and only if it has n − 1 edges. [Hint: Proceed by induction. Then
use Corollary 1.1.6.]

Exercise 1.8 (Trees: characterizations). Prove Theorem 1.1.5.

Exercise 1.9 (Simple random walk on a graph). Let (Xt )t≥0 be simple random
walk on a finite graph G = (V, E). Suppose the vertex set is V = [n]. Write down
an expression for the transition matrix of (Xt ).

Exercise 1.10 (Communication lemma). Let (Xt ) be a finite Markov chain. Show
that, if x → y, then there is an integer r ≥ 1 such that

P[Xr = y | X0 = x] = (P r )x,y > 0.

Exercise 1.11 (Stochastic matrices from stochastic matrices). Let

P (1) , P (2) , . . . , P (r) ∈ Rn×n ,

be stochastic matrices.

(i) Show that P (1) P (2) is a stochastic matrix. That is, a product of stochastic
matrices is a stochastic matrix.
CHAPTER 1. INTRODUCTION 24

Pr
(ii) Show that for any α1 , . . . , αr ∈ [0, 1] with i=1 αi = 1,
r
X
αi P (i)
i=1

is stochastic. That is, a convex combination of stochastic matrices is a


stochastic matrix.
Exercise 1.12 (Reversing time). Let (Xt ) be a finite Markov chain with transition
matrix P . Assume P is reversible with respect to a probability distribution π.
Assume that the initial distribution is π. Show that for any sequence of states
z0 , . . . , zs , the reversed sequence has the same probability, that is,

P[Xs = z0 , . . . , X0 = zs ] = P[Xs = zs , . . . , X0 = z0 ].

Exercise 1.13 (A strict inequality). Let a, b ∈ R with a < 0 and b > 0. Show that

|a + b| < |a| + |b|.

[Hint: Consider the cases a + b ≥ 0 and a + b < 0 separately.]


Exercise 1.14 (Uniqueness: maximum principle). Let P = (Pi,j )ni,j=1 ∈ Rn be a
transition matrix.
(i) Let α1 , . . . , αm > 0 such that m n
P
i=1 αi = 1. Let x = (x1 , . . . , xm ) ∈ R .
Show that
m
X
αi xi ≤ max xi ,
i
i=1
and that equality holds if and only if x1 = x2 = · · · = xm .

(ii) Let 0 6= y ∈ Rn be a right eigenvector of P with eigenvalue 1, that is,


P y = y. Assume that y is not a constant vector, that is, there is i 6= j such
that yi 6= yj . Let k be such that yk = maxi∈[n] yi . Show that forPany ` such
that Pk,` > 0 we necessarily have y` = yk . [Hint: Use that yk = ni=1 Pk,i yi
and apply (i).]

(iii) Assume that P is irreducible. Let 0 6= y ∈ Rn again be a right eigenvector


of P with eigenvalue 1. Use (ii) to show that y is necessarily a constant
vector.

(iv) Assume again that P is irreducible. Use (iii) to conclude that the dimen-
sion of the null space of P T − I is exactly 1. [Hint: Use the Rank-Nullity
Theorem.]
CHAPTER 1. INTRODUCTION 25

Exercise 1.15 (Preferential attachment trees). Let (Gt )t≥1 ∼ PA1 as in Defini-
tion 1.2.3. Show that Gt is a tree with t + 1 vertices for all t ≥ 1.

Exercise 1.16 (Warm-up: a little calculus). Prove the following inequalities which
we will encounter throughout. [Hint: Basic calculus should do.]

(i) Show that 1 − x ≤ e−x for all x ∈ R.


2
(ii) Show that 1 − x ≥ e−x−x for x ∈ [0, 1/2].

(iii) Show that log(1 + x) ≤ x − x2 /4 for x ∈ [0, 1].

(iv) Show that log x ≤ x − 1 for all x ∈ R+ .


2
(v) Show that cos x ≤ e−x /2 for x ∈ [0, π/2). [Hint: Consider the function
2
h(x) = log(ex /2 cos x) and recall that the derivative of tan x is 1 + tan2 x.]
CHAPTER 1. INTRODUCTION 26

Bibliographic Remarks
Section 1.1 For an introduction to graphs see for example [Die10] or [Bol98].
Four different proofs of Cayley’s formula are detailed in the delightful [AZ18].
Markov chain theory is covered in details in [Dur10, Chapter 6]. For a more gentle
introduction, see for example [Dur12, Chapter 1], [Nor98, Chapter 1], or [Res92,
Chapter 2].

Section 1.2 For book-length treatments of percolation theory see [BR06a, Gri10b].
The version of the Erdős-Rényi random graph model we consider here is due to
Gilbert [Gil59]. For much deeper accounts of the theory of random graphs and
related processes, see for example [Bol01, Dur06, JLR11, vdH17, FK16]. Two
standard references on finite Markov chains and mixing times are [AF, LPW06].
Chapter 2

Moments and tails

In this chapter we look at the moments of a random variable. Specifically we


demonstrate that moments capture useful information about the tail of a random
variable while often being simpler to compute or at least bound. Several well-
known inequalities quantify this intuition. Although they are straightforward to
derive, such inequalities are surprisingly powerful. Through a range of applica-
tions, we illustrate the utility of controlling the tail of a random variable, typically
by allowing one to dismiss certain “bad events” as rare. We begin in Section 2.1
by recalling the classical Markov and Chebyshev’s inequalities. Then we discuss
three of the most fundamental tools in discrete probability and probabilistic com-
binatorics. In Sections 2.2 and 2.3, we derive the complementary first and second
moment methods, and give several standard applications, especially to phase tran-
sitions in random graphs and percolation. In Section 2.4 we develop the Chernoff-
Cramér method, which relies on the moment-generating function and is the build-
ing block for a large class of tail bounds. Two key applications in data science are
briefly introduced: sparse recovery and empirical risk minimization.

2.1 Background
We start with a few basic definitions and standard inequalities. See Appendix B for
a refresher on random variables and their expectation.
Version: December 20, 2023
Modern Discrete Probability: An Essential Toolkit
Copyright © 2023 Sébastien Roch

27
CHAPTER 2. MOMENTS AND TAILS 28

2.1.1 Definitions
Moments As a quick reminder, let X be a random variable with E|X|k < +∞
for some non-negative integer k. In that case we write X ∈ Lk . Recall that
the quantities E[X k ] and E[(X − EX)k ], which are well-defined when X ∈ Lk ,
are called respectively the k-th moment and k-th central moment of X. The first moments
moment and the second central moment are known as the mean and variance, the
square root of which is the standard deviation. A random variable is said to be
centered if its mean is 0. Recall that for a non-negative random variable X, the
k-th moment can be expressed as
Z +∞
k
E[X ] = kxk−1 P [X > x] dx. (2.1.1)
0

The moment-generating function (or exponential moment) of X is the function moment-


MX (s) := E esX ,
 
generating
function
defined for all s ∈ R where it is finite, which includes at least s = 0. If MX (s) is
defined on (−s0 , s0 ) for some s0 > 0 then X has finite moments of all orders, for
any k ∈ Z,
dk h i
MX (s) = E X k esX , (2.1.2)
ds
and the following expansion holds
X sk
MX (s) = E[X k ], |s| < s0 .
k!
k≥0

The moment-generating function plays nicely with sums of independent random


variables. Specifically, if X1 and X2 are independent random variables with MX1
and MX2 defined over a joint interval (−s0 , s0 ), then for s in that interval
h i
MX1 +X2 (s) = E es(X1 +X2 )
= E esX1 esX2
 

= E esX1 E esX2
   

= MX1 (s)MX2 (s), (2.1.3)


where we used independence in the third equality.
One more piece of notation: if A is an event and X ∈ L1 , then we use the
shorthand
E[X; A] = E[X1A ].
CHAPTER 2. MOMENTS AND TAILS 29

Tails We refer to a probability of the form P[X ≥ x] as an upper tail (or right tail)
probability. Typically x is (much) greater than the mean or median of X. Similarly
tail
we refer to P[X ≤ x] as a lower tail (or left tail) probability. Our general goal in
this chapter is to bound tail probabilities using moments and moment-generating
functions.
Tail bounds arise naturally in many contexts, as events of interest can often
be framed in terms of a random variable being unusually large or small. Such
probabilities are often hard to compute directly however. As we will see in this
chapter, moments offer an effective means to control tail probabilities for two main
reasons: (i) moments contain information about the tails of a random variable,
as (2.1.1) below makes explicit for instance; and (ii) they are typically easier to
compute—or, at least, to approximate.
As we will see, tail bounds are also useful to study the maximum of a collection
of random variables.

2.1.2 Basic inequalities


Markov’s inequality Our first bound on the tail of a random variable is Markov’s
inequality. In words, for a non-negative random variable: the heavier the tail, the
larger the expectation. This simple inequality is in fact a key ingredient in more
sophisticated tail bounds as we will see.
Theorem 2.1.1 (Markov’s inequality). Let X be a non-negative random variable. Markov’s
inequality
Then, for all b > 0,
EX
P[X ≥ b] ≤ . (2.1.4)
b
Proof.
EX ≥ E[X; X ≥ b] ≥ E[b; X ≥ b] = b P[X ≥ b].

See Figure 2.1 for a proof by picture. Note that this inequality is non-trivial only
when b > EX.

Chebyshev’s inequality An application of Markov’s inequality (Theorem 2.1.1)


to |X −EX|2 gives a classical tail bound featuring the second moment of a random
variable.
Chebyshev’s
Theorem 2.1.2 (Chebyshev’s inequality). Let X be a random variable with EX 2 <
inequality
+∞. Then, for all β > 0,
Var[X]
P[|X − EX| > β] ≤ . (2.1.5)
β2
CHAPTER 2. MOMENTS AND TAILS 30

Figure 2.1: Proof of Markov’s inequality: taking expectations of the two functions
depicted above yields the inequality.

Proof. This follows immediately by applying (2.1.4) to |X − EX|2 with b = β 2 .

Of course this bound is non-trivial only when β is larger than the standard devia-
tion. Results of this type that quantify the probability of deviating from the mean
are referred to as concentration inequalities. Chebyshev’s inequality is perhaps the
concentration in-
simplest instance—we will derive many more. To bound the variance, the follow-
equalities
ing standard formula is sometimes useful
" n # n
X X X
Var Xi = Var[Xi ] + 2 Cov[Xi , Xj ], (2.1.6)
i=1 i=1 i<j

where recall that the covariance of Xi and Xj is


covariance
Cov[Xi , Xj ] := E[Xi Xj ] − E[Xi ] E[Xj ].

When Xi and Xj are independent, then Cov[Xi , Xj ] = 0.


Example 2.1.3. Let X be a Gaussian random variable with mean µ and variance
Gaussian
σ 2 , that is, whose density is
(x − µ)2
 
1
fX (x) = √ exp − , x ∈ R.
2πσ 2 2σ 2
CHAPTER 2. MOMENTS AND TAILS 31

Figure 2.2: Comparison of Markov’s and Chebyshev’s inequalities: the squared de-
viation from the mean (solid) gives a better approximation of the indicator function
(dotted) close to the mean than the absolute deviation (dashed).
q
2
We write X ∼ N(µ, σ 2 ). A direct computation shows that E|X − µ| = σ π.
Hence Markov’s inequality gives
r
E|X − µ| 2 σ
P[|X − µ| ≥ b] ≤ = · ,
b π b
while Chebyshev’s inequality (Theorem 2.1.2) gives
 σ 2
P[|X − µ| ≥ b] ≤ .
b
Hence, for b large enough, Chebyshev’s inequality produces a stronger bound. See
Figure 2.2 for some insight. J
Example 2.1.4 (Coupon collector’s problem). Let (Xt )t∈N be i.i.d. uniform ran-
dom variables over [n], that is, that are equally likely to take any value in [n]. Let
uniform
Tn,i be the first time that i elements of [n] have been picked, that is,

Tn,i = inf {t ≥ 1 : |{X1 , . . . , Xt }| = i} ,

with Tn,0 := 0. We prove that the time it takes to pick all elements at least once—
or “collect each coupon”—has the following tail. For any ε > 0, we have as coupon
n → +∞: collector
CHAPTER 2. MOMENTS AND TAILS 32

Claim 2.1.5.  
n
X
P  Tn,n − n j −1 ≥ ε n log n → 0.
j=1

To prove this claim we note that the time elapsed between Tn,i−1 and Tn,i , which
we denote by τn,i := Tn,i − Tn,i−1 , is geometric with success probability 1 − i−1
n .
And all τn,i s are independent. Recall that a geometric random variable Z with
geometric
success probability p has probability mass function P[Z = z] = (1 − p)z−1 p for
has mean 1/p and variance (1 − p)/p2 . So, the expectation and variance
z ∈ N and P
of Tn,n = ni=1 τn,i are
n  n
i − 1 −1
X  X
E[Tn,n ] = 1− =n j −1 = Θ(n log n), (2.1.7)
n
i=1 j=1

and
n  n +∞
i − 1 −2
X  X X
Var[Tn,n ] ≤ 1− = n2 j −2 ≤ n2 j −2 = Θ(n2 ). (2.1.8)
n
i=1 j=1 j=1

So by Chebyshev’s inequality
 
n
X Var[Tn,n ]
P  Tn,n − n j −1 ≥ ε n log n ≤
(ε n log n)2
j=1

n2 +∞ −2
P
j=1 j

(ε n log n)2
→ 0,

by (2.1.7) and (2.1.8). J


A classical implication of Chebyshev’s inequality is (a version of) the law of
large numbers. Recall that a sequence of random variables (Xn )n≥1 converges in
probability to a random variable X, denoted by Xn →p X, if for all ε > 0

lim P[|Xn − X| ≥ ε] → 0.
n→+∞

Theorem 2.1.6 (L2 weak law of large numbers). Let X1 , X2 , . . . be uncorrelated


uncorrelated
random variables, that is, E[Xi Xj ] = E[Xi ] E[Xj ] for i 6= j, with E[Xi ] = µ <
+∞ and supi Var[Xi ] < +∞. Then
1X
Xk →p µ.
n
k≤n
CHAPTER 2. MOMENTS AND TAILS 33

See Exercise 2.5 for a proof. When the Xk s are i.i.d. and integrable (but not nec-
essarily square integrable), convergence is almost sure. That result, the strong law
of large numbers, also follows from Chebyshev’s inequality (and other ideas), but
we will not prove it here.

2.2 First moment method


In this section, we develop some techniques based on the first moment. Recall
that the expectation of a random variable has an elementary, yet handy, property:
linearity. That is, if random variables X1 , . . . , Xk defined on a joint probability
space have finite first moments, then
E[X1 + · · · + Xk ] = E[X1 ] + · · · + E[Xk ], (2.2.1)
without any further assumption. In particular linearity holds whether or not the Xi s
are independent.

2.2.1 The probabilistic method


A key technique of probabilistic combinatorics is the so-called probabilistic method.
The idea is that one can establish the existence of an object satisfying a certain
property—without having to construct one explicitly. Instead one argues that a
randomly chosen object exhibits the given property with positive probability. The
following “obvious” observation, sometimes referred to as the first moment princi-
ple, plays a key role in this context.
Theorem 2.2.1 (First moment principle). Let X be a random variable with finite first
moment
expectation. Then, for any µ ∈ R,
principle
EX ≤ µ =⇒ P[X ≤ µ] > 0.
Proof. We argue by Tcontradiction. Assume EX ≤ µ and P[X ≤ µ] = 0.
Write {X ≤ µ} = n≥1 {X < µ + 1/n}. That implies by monotonicity (see
Lemma B.2.6) that, for any ε ∈ (0, 1), it holds that P[X < µ + 1/n] < ε for n
large enough. Hence, because we assume that P[X ≤ µ] = 0,
µ ≥ EX
= E[X; X < µ + 1/n] + E[X; X ≥ µ + 1/n]
≥ µ P[X < µ + 1/n] + (µ + 1/n)(1 − P[X < µ + 1/n])
= µ + n−1 (1 − P[X < µ + 1/n])
> µ + n−1 (1 − ε)
> µ,
CHAPTER 2. MOMENTS AND TAILS 34

a contradiction.

The power of this principle is easier to appreciate on an example.


Example 2.2.2 (Balancing vectors). Let v1 , . . . , vn be arbitrary unit vectors in Rn .
How small can we make the 2-norm of the linear combination

x1 v1 + · · · + xn vn

by appropriately choosing x1 , . . . , xn ∈ {−1, +1}? We claim that it can be as



small as n, for any collection of vi s. At first sight, this may appear to be a
complicated geometry problem. But the proof is trivial once one thinks of choosing
the xi s at random. Let X1 , . . . , Xn be independent random variables uniformly
distributed in {−1, +1}. Then, since E[Xi Xj ] = E[Xi ] E[Xj ] = 0 for all i 6= j
but E[Xi2 ] = 1 for all i,
 
X
EkX1 v1 + · · · + Xn vn k22 = E  Xi Xj hvi , vj i
i,j
X
= E[Xi Xj hvi , vj i]
i,j
X
= hvi , vj i E[Xi Xj ]
i,j
X
= kvi k22
i
= n,

where we used the linearity of expectation on the second line. Hence the random
variable Z = kX1 v1 + · · · + Xn vn k2 has expectation EZ = n and must take a
value ≤ n with positive probability by the first moment principle (Theorem 2.2.1).
In other words, there must be a choice of Xi s such that Z ≤ n. That proves the
claim. J
Here is a slightly more subtle example of the probabilistic method, where one
has to modify the original random choice.
Example 2.2.3 (Independent sets). For d ∈ N, let G = (V, E) be a d-regular graph
with n vertices. Such a graph necessarily has m = nd/2 edges. Our goal is derive
a lower bound on the size, α(G), of the largest independent set in G. Recall that an
independent set is a set of vertices in a graph, no two of which are adjacent. Again,
at first sight, this may seem like a rather complicated graph-theoretic problem. But
an appropriate random choice gives a non-trivial bound. Specifically:
CHAPTER 2. MOMENTS AND TAILS 35

Claim 2.2.4.
n
α(G) ≥ .
2d
Proof. The proof proceeds in two steps:

1. We first prove the existence of a subset S of vertices with relatively few


edges.

2. We remove vertices from S to obtain an independent set.

Step 1. Let 0 < p < 1 to be chosen below. To form the set S, pick each vertex
in V independently with probability p. Letting X be the number of vertices in S,
we have by the linearity of expectation that
" #
X
EX = E 1v∈S = np,
v∈V

where we used that E[1v∈S ] = p. Letting Y be the number of edges between


vertices in S, we have by the linearity of expectation that
 
X nd 2
EY = E  1i∈S 1j∈S  = p ,
2
{i,j}∈E

where we also used that E[1i∈S 1j∈S ] = p2 by independence. Hence, subtracting,

nd 2
E[X − Y ] = np − p ,
2
which, as a function of p, is maximized at p = 1/d where it takes the value n/(2d).
As a result, by the first moment principle applied to X − Y , there must exist a set
S of vertices in G such that
n
|S| − |{{i, j} ∈ E : i, j ∈ S}| ≥ . (2.2.2)
2d
Step 2. For each edge e connecting two vertices in S, remove one of the endver-
tices of e. By construction, the remaining set of vertices: (i) forms an independent
set; and (ii) has a size larger or equal than the left hand side of (2.2.2). That in-
equality implies the claim.

Note that a graph G made of n/(d + 1) cliques of size d + 1 (with no edge


between the cliques) has α(G) = n/(d + 1), showing that our bound is tight up to
a constant. This is known as a Turán graph. J
CHAPTER 2. MOMENTS AND TAILS 36

Remark 2.2.5. The previous result can be strengthened to


X 1
α(G) ≥ ,
δ(v) + 1
v∈V

for a general graph G = (V, E), where δ(v) is the degree of v. This bound is achieved for
Turán graphs. See, for example, [AS11, The probabilistic lens: Turán’s theorem].

The previous example also illustrates the important indicator trick, that is, writ-
indicator
ing a random variable as a sum of indicators, which is naturally used in combina-
trick
tion with the linearity of expectation.

2.2.2 Boole’s inequality


One implication of the first moment principle (Theorem 2.2.1) is that: if a non-
negative, integer-valued random variable X has expectation strictly smaller than 1,
then its value is 0 with positive probability. The following application of Markov’s
inequality (Theorem 2.1.1) adds a quantitative twist: if that same X has a “small”
expectation, then its value is 0 with “large” probability.

Theorem 2.2.6 (First moment method). If X is a non-negative, integer-valued


random variable, then
P[X > 0] ≤ EX. (2.2.3)

Proof. Take b = 1 in Markov’s inequality.

This simple fact is typically used in the following manner: one wants to show
that a certain “bad event” does not occur with probability approaching 1; the ran-
dom variable X then counts the number of such “bad events.” In that case, X is a
sum of indicators and Theorem 2.2.6 reduces simply to the standard union bound,
union
also known as Boole’s inequality. We record one useful version of this setting in
bound
the next corollary.

Corollary 2.2.7. Let Bn = An,1 ∪ · · · ∪ An,mn , where An,1 , . . . , An,mn is a col-


lection of events for each n. Then, letting
mn
X
µn := P[An,i ],
i=1

we have
P[Bn ] ≤ µn .
In particular, if µn → 0 as n → +∞, then P[Bn ] → 0.
CHAPTER 2. MOMENTS AND TAILS 37

Pmn
Proof. Take X := Xn = i=1 1An,i in Theorem 2.2.6.

A useful generalization of the union bound is given in Exercise 2.2. We will refer
to applications of Theorem 2.2.6 as the first moment method.
first
Example 2.2.8 (Random k-SAT threshold). For r ∈ R+ , let Φn,r : {0, 1}n → moment
{0, 1} be a random k-CNF formula on n Boolean variables z1 , . . . , zn with drne method
clauses. That is, Φn,r is an AND of drne ORs, each obtained by picking indepen-
dently k literals uniformly at random (with replacement). Recall that a literal is
a variable zi or its negation z̄i . The formula Φn,r is said to be satisfiable if there
exists an assignment z = (z1 , . . . , zn ) such that Φn,r (z) = 1. Clearly the higher
the value of r, the less likely it is for Φn,r to be satisfiable. In fact it is natural to
conjecture that a sharp transition takes place, that is, that there exists an rk∗ ∈ R+
(depending on k but not on n) such that
(
0, if r > rk∗ ,
lim P[Φn,r is satisfiable] = (2.2.4)
n→∞ 1, if r < rk∗ .

Studying such threshold phenomena is a major theme of modern discrete probabil-


threshold
ity. Using the first moment method (Theorem 2.2.6), we give an upper bound on
phenomenon
the threshold. Formally:

Claim 2.2.9.

r > 2k log 2 =⇒ lim sup P[Φn,r is satisfiable] = 0.


n→∞

Proof. How to start the proof should be obvious: let Xn be the number of satisfying
assignments of Φn,r . Applying the first moment method, since

P[Φn,r is satisfiable] = P[Xn > 0],

it suffices to show that EXn → 0. To compute EXn , we use the indicator trick
X
Xn = 1{z satisfies Φn,r } .
z∈{0,1}n

There are 2n possible assignments. Each fixed assignment satisfies the random
choice of clauses Φn,r with probability (1 − 2−k )drne . Indeed note that the rn
clauses are picked independently and each clause literal picked is satisfied with
probability 1/2. Therefore, by the assumption on r, for ε > 0 small enough and n
CHAPTER 2. MOMENTS AND TAILS 38

large enough

EXn = 2n (1 − 2−k )drne


k
≤ 2n (1 − 2−k )(2 log 2)(1+ε)n

≤ 2n e−(log 2)(1+ε)n
= 2−εn
→ 0,

where we used that (1 − 1/`)` ≤ e−1 for all ` ∈ N (see Exercise 1.16). Theo-
rem 2.2.6 implies the claim.
Remark 2.2.10. Bounds in the other direction are also known. For instance, for k ≥ 3, it
has been shown that if r < 2k log 2 − k
lim inf P[Φn,r is satisfiable] = 1.
n→∞

See [ANP05]. For the k = 2 case, it is known that (2.2.4) in fact holds with r2∗ = 1 [CR92].
A breakthrough of [DSS22] also establishes (2.2.4) for large k; the threshold rk∗ is charac-
terized as the root of a certain equation coming from statistical physics.

2.2.3 . Random permutations: longest increasing subsequence


In this section, we bound the expected length of a longest increasing subsequence
in a random permutation. Let σn = (σn (1), . . . , σn (n)) be a uniformly random
permutation of [n] := {1, . . . , n} (i.e., a bijection of [n] to itself chosen uniformly
random
at random among all such mappings) and let Ln be the length of a longest in-
permutation
creasing subsequence of σn (i.e., a sequence of indices i1 < · · · < ik such that
σn (i1 ) < · · · < σn (ik )).
Claim 2.2.11. √
ELn = Θ( n).
Proof. We first prove that
ELn
lim sup √ ≤ e, (2.2.5)
n→∞ n
which implies half of the claim. Bounding the expectation of Ln is not straightfor-
ward as it is the expectation of a maximum. A natural way to proceed is to find a
value ` for which P[Ln ≥ `] is “small.” More formally, we bound the expectation
as follows

ELn ≤ ` P[Ln < `] + n P[Ln ≥ `] ≤ ` + n P[Ln ≥ `], (2.2.6)


CHAPTER 2. MOMENTS AND TAILS 39

for an ` chosen below. To bound the probability on the right-hand side, we appeal to
the first moment method (Theorem 2.2.6) by letting Xn be the number of increasing
subsequences of length `. We also use the indicator trick, that is, we think of Xn
as a sum of indicators over subsequences (not necessarily increasing) of length `.
n

There are ` such subsequences, each of which is increasing with probability
1/`!. Note that these subsequences are not independent. Nevertheless, by the
linearity of expectation and the first moment method,
 √ 2`
n` n`
 
1 n e n
P[Ln ≥ `] = P[Xn > 0] ≤ EXn = ≤ 2
≤ 2 2`
≤ ,
`! ` (`!) e [`/e] `
where we used a standard bound on factorials recalled in Appendix A. Note that, in

order for this bound to go to 0, we need ` > e n. Then (2.2.5) follows by taking

` = (1 + δ)e n in (2.2.6), for an arbitrarily small δ > 0.
For the other half of the claim, we show that
ELn
√ ≥ 1.
n
This part does not rely on the first moment method (and may be skipped). We seek
a lower bound on the expected length of a longest increasing subsequence. The
proof uses the following two ideas. First observe that there is a natural symme-
try between the lengths of the longest increasing and decreasing subsequences—
they are identically distributed. Moreover if a permutation has a “short” longest
increasing subsequence, then intuitively it must have a “long” decreasing subse-
quence, and vice versa. Combining these two observations gives a lower bound
on the expectation of Ln . Formally, let Dn be the length of a longest decreasing
subsequence. By symmetry and the arithmetic mean-geometric mean inequality,
note that  
Ln + D n p
ELn = E ≥ E Ln Dn .
2
(k)
We show that Ln Dn ≥ n, which proves the claim. Let Ln be the length of a
(k)
longest increasing subsequence ending at position k, and similarly for Dn . It
(k) (k)
suffices to show that the pairs (Ln , Dn ), 1 ≤ k ≤ n, are distinct. Indeed, noting
(k) (k)
that Ln ≤ Ln and Dn ≤ Dn , the number of pairs in [Ln ] × [Dn ] is at most
Ln Dn which must then be at least n.
(k) (j)
Let 1 ≤ j < k ≤ n. If σn (k) > σn (j) then we see that Ln > Ln by
(j)
appending σn (k) to the subsequence ending at position j achieving Ln . If the
(k) (j) (j) (j)
opposite holds, then we have instead Dn > Dn . Either way, (Ln , Dn ) and
(k) (k)
(Ln , Dn ) must be distinct. This clever combinatorial argument is known as the
Erdős-Szekeres Theorem. That concludes the proof of the second claim.
CHAPTER 2. MOMENTS AND TAILS 40

Remark 2.2.12. It has been shown that in fact



ELn = 2 n + cn1/6 + o(n1/6 ),

as n → +∞, where c = −1.77... [BDJ99].

2.2.4 . Percolation: existence of a non-trivial critical value on Z2


In this section we use the first moment method (Theorem 2.2.6) to prove the exis-
tence of a non-trivial threshold in bond percolation on the two-dimensional lattice.
We begin with some background.

Critical value in bond percolation Consider bond percolation (Definition 1.2.1)


on the two-dimensional lattice L2 (see Section 1.1.1) with density p. Let Pp denote
the corresponding probability measure. Recall that paths are “self-avoiding” by
definition (see Section 1.1.1). We say that a path is open if all edges in the induced
open path
subgraph are open. Writing x ⇔ y if x, y ∈ L2 are connected by an open path,
recall that the open cluster of x is

Cx := {y ∈ Z2 : x ⇔ y}.

The percolation function is defined as


percolation
function
θ(p) := Pp [|C0 | = +∞],

that is, θ(p) is the probability that the origin is connected by open paths to infinitely
many vertices. It is intuitively clear that the function θ(p) is non-decreasing. Indeed
consider the following alternative representation of the percolation process: to each
edge e, assign a uniform [0, 1] random variable Ue and declare the edge open if
Ue ≤ p. Using the same Ue s for densities p1 < p2 , it follows immediately from
the monotonicity of the construction that θ(p1 ) ≤ θ(p2 ). (We will have much more
to say about this type of “coupling” argument in Chapter 4.) Moreover note that
θ(0) = 0 and θ(1) = 1. The critical value is defined as
critical value
2
pc (L ) := sup{p ≥ 0 : θ(p) = 0},

the point at which the probability that the origin is contained in an infinite open
cluster becomes positive. Note that by a union bound over all vertices, when
θ(p) = 0, we have that Pp [∃x, |Cx | = +∞] = 0. Conversely, because {∃x, |Cx | =
+∞} is a tail event (see Definition B.3.9) for any enumeration of the edges, by Kol-
mogorov’s 0-1 law (Theorem B.3.11) it holds that Pp [∃x, |Cx | = +∞] = 1 when
θ(p) > 0.
CHAPTER 2. MOMENTS AND TAILS 41

Using the first moment method we show that the critical value is non-trivial,
that is, it is strictly between 0 and 1. This is a different example of a threshold
phenomenon.

Claim 2.2.13.
pc (L2 ) ∈ (0, 1).

Proof. We first show that, for any p < 1/3, θ(p) = 0. In order to apply the first
moment method, roughly speaking, we need to reduce the problem to counting the
number of instances of an appropriately chosen substructure. The key observation
is the following:

An infinite C0 contains an open path starting at 0 of infinite length and,


as a result, of all lengths.

Hence, we let Xn be the number of open paths of length n starting at 0. Then, by


monotonicity,

Pp [|C0 | = +∞] ≤ Pp [∩n {Xn > 0}] = lim Pp [Xn > 0] ≤ lim sup Ep [Xn ],
n n
(2.2.7)
where the last inequality follows from Theorem 2.2.6. We bound the number of
paths of length n (each of which is open with probability pn ) by noting that they
cannot backtrack. That gives 4 choices at the first step, and at most 3 choices at
each subsequent step. Hence, we get the following bound

Ep Xn ≤ 4(3n−1 )pn .

The right-hand side goes to 0 for all p < 1/3. When combined with (2.2.7), that
proves half of the claim:
pc (L2 ) > 0.
For the other direction, we show that θ(p) > 0 for p close enough to 1. This
time, we count “dual cycles.” This type of proof is known as a contour argument, or
Peierls’ argument, and is based on the following construction. Consider the dual
e 2 whose vertices are Z2 + (1/2, 1/2) and whose edges connect vertices
lattice L
dual lattice
u, v with ku − vk1 = 1. See Figure 2.3. Note that each edge in the primal lattice
L2 has a unique corresponding edge in the dual lattice which crosses it perpendic-
ularly. We make the same assignment, open or closed, for corresponding primal
and dual edges. The following graph-theoretic lemma, whose proof is sketched
below, forms the basis of contour arguments. Recall that cycles are “self-avoiding”
by definition (see Section 1.1.1). We say that a cycle is closed if all edges in the
induced subgraph are closed, that is, are not open.
CHAPTER 2. MOMENTS AND TAILS 42

Figure 2.3: Primal (solid) and dual (dotted) lattices.

Lemma 2.2.14 (Contour lemma). If |C0 | < +∞, then there is a closed cycle contour lemma
e2.
around the origin in the dual lattice L
To prove that θ(p) > 0 for p close enough to 1, the idea is to apply the first moment
method to Zn equal to the number of closed dual cycles of length n surrounding
the origin. We bound from above the number of dual cycles of length n around the
origin by the number of choices for the starting edge across the upper y-axis and
for each n − 1 subsequent non-backtracking choices. Namely,

P[|C0 | < +∞] ≤ P[∃n ≥ 4, Zn > 0]


X
≤ P[Zn > 0]
n≥4
X
≤ EZn
n≥4
Xn
≤ 3n−1 (1 − p)n
2
n≥4
33 (1 − p)4 X
= (m + 3)(3(1 − p))m−1
2
m≥1
3 4
 
3 (1 − p) 1 1
= +3 ,
2 (1 − 3(1 − p))2 1 − 3(1 − p)
CHAPTER 2. MOMENTS AND TAILS 43

when p > 2/3, where the first term in parentheses Pon themlast line comes from
differentiating with respect to q the geometric series m≥0 q and setting q := 1−
p. The expression on the last line can be made smaller than 1 if we let p approach
1. We have shown that θ(p) > 0 for p close enough to 1, and that concludes the
proof. (Exercise 2.3 sketches a proof that θ(p) > 0 for all p > 2/3.)

It is straightforward to extend the claim to Ld . (Exercise 2.4 asks for the de-
tails.)

Proof of the contour lemma We conclude this section by sketching the proof of
the contour lemma, which relies on topological arguments.

Proof of Lemma 2.2.14. Assume |C0 | < +∞. Imagine identifying each vertex in
L2 with a square of side 1 centered around it so that the sides line up with dual
edges. Paint green the squares of vertices in C0 . Paint red the squares of vertices
in C0c which share a side with a green square. Leave the other squares white.
Let u0 be a highest vertex in C0 along the y-axis and let v0 and v1 be the dual
vertices corresponding to the upper left and right corners respectively of the square
of u0 . Because u0 is highest, it must be that the square above it is red. Walk
along the dual edge {v0 , v1 } separating the squares of u0 and u0 + (0, 1) from v0
to v1 . Notice that this edge satisfies what we call the red-green property: as you
traverse it from v0 to v1 , a red square sits on your left and a green square is on your
right. Proceed further by iteratively walking along an incident dual edge with the
following rule. Choose an edge satisfying the red-green property, with the edges
to your left, straight ahead, and to your right in decreasing order of priority. Stop
when a previously visited dual vertex is reached. The claim is that this procedure
constructs the desired cycle. Let v0 , v1 , v2 , . . . be the dual vertices visited. By
construction {vi−1 , vi } is a dual edge for all i.

- A dual cycle is produced. We first argue that this procedure cannot get stuck.
Let {vi−1 , vi } be the edge just crossed and assume that it has the red-green
property. If there is a green square to the left ahead, then the edge to the
left, which has highest priority, has the red-green property. If the left square
ahead is not green, but the right one is, then the left square must in fact be
red by construction (i.e., it cannot be white). In that case, the edge straight
ahead has the red-green property. Finally, if neither square ahead is green,
then the right square must in fact be red because the square behind to the
right is green by assumption. That implies that the edge to the right has
the red-green property. Hence we have shown that the procedure does not
get stuck. Moreover, because by assumption the number of green squares
CHAPTER 2. MOMENTS AND TAILS 44

is finite, this procedure must eventually terminate when a previously visited


dual vertex is reached, forming a cycle (of length at least 4).

- The origin lies within the cycle. The inside of a cycle in the plane is well-
defined by the Jordan curve theorem. So the dual cycle produced above has
its adjacent green squares either on the inside (negative orientation) or on the
outside (positive orientation). In the former case the origin must lie inside
the cycle as otherwise the vertices corresponding to the green squares on the
inside would not be in C0 , as they could not be connected to the origin with
open paths.
So it remains to consider the latter case, where through a similar reasoning
the origin must lie outside the cycle. Let vj be the repeated dual vertex.
Assume first that vj 6= v0 and let vj−1 and vj+1 be the dual vertices preced-
ing and following vj during the first visit to vj . Let vk be the dual vertex
preceding vj on the second visit. After traversing the edge from vj−1 to vj ,
vk cannot be to the left or to the right because in those cases the red-green
properties of the two corresponding edges (i.e., {vj−1 , vj } and {vk , vj }) are
not compatible. So vk is straight ahead and, by the priority rules, vj+1 must
be to the left upon entering vj the first time. But in that case, for the origin
to lie outside the cycle as we are assuming and for the cycle to avoid the path
v0 , . . . , vj−1 , we must traverse the cycle with a negative orientation, that is,
the green squares adjacent to the cycle must be on the inside, a contradiction.
So, finally, assume v0 is the repeated vertex. If the cycle is traversed with a
positive orientation and the origin is on the outside, it must be that the cycle
crosses the y-axis at least once above u0 + (0, 1), again a contradiction.
Hence we have shown that the origin is inside the cycle.

That concludes the proof.


Remark 2.2.15. It turns out that pc (L2 ) = 1/2. We will prove pc (L2 ) ≥ 1/2, known as
Harris’ Theorem, in Section 4.2.5. The other direction is due to Kesten [Kes80].

2.3 Second moment method


The first moment method (Theorem 2.2.6) gives an upper bound on the probability
that a non-negative, integer-valued random variable is positive—which is nontrivial
provided its expectation is small enough. In this section we seek a lower bound on
that probability. We first note that a large expectation does not suffice in general.
Say Xn is n2 with probability 1/n, and 0 otherwise. Then EXn = n → +∞, yet
CHAPTER 2. MOMENTS AND TAILS 45

Figure 2.4: Second moment method: if the standard deviation σX of X is less than
its expectation µX , then the probability that X is 0 is bounded away from 1.

P[Xn > 0] → 0. That is, although the expectation diverges, the probability that
Xn is positive can be arbitrarily small.
So we turn to the second moment. Intuitively the basis for the so-called second
moment method is that, if the expectation of Xn is large and its variance is rela-
tively small, then we can bound the probability that Xn is close to 0. As we will
see in applications, the first and second moment methods often work hand in hand.

2.3.1 Paley-Zygmund inequality


As an immediate corollary of Chebyshev’s inequality (Theorem 2.1.2), we get a
first version of the second moment method: if the standard deviation of X is less
than its expectation, then the probability that X is 0 is bounded away from 1. See
Figure 2.4. Formally, let X be a nonnegative random variable (not identically zero).
Then
Var[X]
P[X > 0] ≥ 1 − . (2.3.1)
(EX)2
Indeed, by (2.1.5),

Var[X]
P[X = 0] ≤ P[|X − EX| ≥ EX] ≤ .
(EX)2
CHAPTER 2. MOMENTS AND TAILS 46

The following tail bound, a simple application of Cauchy-Schwarz (Theo-


rem B.4.8), leads to an improved version of this inequality.
Theorem 2.3.1 (Paley-Zygmund inequality). Let X be a nonnegative random vari- Paley-Zygmund
inequality
able. For all 0 < θ < 1,
(EX)2
P[X ≥ θ EX] ≥ (1 − θ)2 . (2.3.2)
E[X 2 ]
Proof. We have

EX = E[X1{X<θEX} ] + E[X1{X≥θEX} ]
p
≤ θEX + E[X 2 ]P[X ≥ θEX],

where we used Cauchy-Schwarz. Rearranging gives the result.

As an immediate application:
Theorem 2.3.2 (Second moment method). Let X be a nonnegative random vari- second
moment
able (not identically zero). Then
method
(EX)2
P[X > 0] ≥ . (2.3.3)
E[X 2 ]
Proof. Take θ ↓ 0 in (2.3.2).

Since
(EX)2 Var[X]
2
=1− ,
E[X ] (EX)2 + Var[X]
we see that (2.3.3) is stronger than (2.3.1).
We typically apply the second moment method to a sequence of random vari-
ables (Xn ). The previous theorem gives a uniform lower bound on the probability
that {Xn > 0} when E[Xn2 ] ≤ CE[Xn ]2 for some C > 0. Just like the first
moment method, the second moment method is often applied to a sum of indica-
tors (but see Section 2.3.3 for a weighted case). We record in the next corollary a
convenient version of the method.
Corollary 2.3.3. Let Bn = An,1 ∪ · · · ∪ An,mn , where An,1 , . . . , An,mn is a col-
n
lection of events for each n. Write i ∼ j if i 6= j and An,i and An,j are not
independent. Then, letting
mn
X X
µn := P[An,i ], γn := P[An,i ∩ An,j ],
i=1 n
i∼j
CHAPTER 2. MOMENTS AND TAILS 47

where the second sum is over ordered pairs, we have limn P[Bn ] > 0 whenever
µn → +∞ and γn ≤ Cµ2n for some C > 0. If moreover γn = o(µ2n ) then
limn P[Bn ] = 1.
Proof. We apply the second moment method to Xn := m
P n
i=1 1An,i so that Bn =
{Xn > 0}. Note that
X X
Var[Xn ] = Var[1An,i ] + Cov[1An,i , 1An,j ],
i i6=j

where
Var[1An,i ] = E[(1An,i )2 ] − (E[1An,i ])2 ≤ P[An,i ],
and, if An,i and An,j are independent,

Cov[1An,i , 1An,j ] = 0,
n
whereas, if i ∼ j,

Cov[1An,i , 1An,j ] = E[1An,i 1An,j ] − E[1An,i ]E[1An,j ] ≤ P[An,i ∩ An,j ].

Hence
Var[Xn ] µn + γ n 1 γn
2
≤ 2
= + 2.
(EXn ) µn µ n µn
Noting

(EXn )2 (EXn )2 1
= = ,
E[Xn2 ] (EXn )2 + Var[Xn ] 1 + Var[Xn ]/(EXn )2
and applying Theorem 2.3.2 gives the result.

2.3.2 . Random graphs: subgraph containment and connectivity in the


Erdős-Rényi model
Threshold phenomena are also common in random graphs. We consider here the
Erdős-Rényi random graph model (Definition 1.2.2). In this context a threshold
function for a graph property P is a function r(n) such that
threshold
(
function
0, if pn  r(n)
lim Pn,pn [Gn has property P ] =
n 1, if pn  r(n),

where Gn ∼ Gn,pn is a random graph with n vertices and density pn . In this


section, we illustrate this type of phenomenon on two properties: the containment
of small subgraphs and connectivity.
CHAPTER 2. MOMENTS AND TAILS 48

Subgraph containment
We first consider the clique number, then we turn to more general subgraphs.

Cliques Let ω(G) be the clique number of a graph G, that is, the size of its largest
clique
clique.
number
Claim 2.3.4. The property ω(Gn ) ≥ 4 has threshold function n−2/3 .

Proof. Let Xn be the number of 4-cliques in the random graph Gn ∼ Gn,pn . Then,
4

noting that there are 2 = 6 edges in a 4-clique,
 
n 6
En,pn [Xn ] = p = Θ(n4 p6n ),
4 n

which goes to 0 when pn  n−2/3 . Hence the first moment method (Theo-
rem 2.2.6) gives one direction: Pn,pn [ω(Gn ) ≥ 4] → 0 in that case.
For the other direction, we apply the second moment method for sums of in-
dicators, that is, Corollary 2.3.3. We use the notation from that corollary. For an
enumeration S1 , . . . , Smn of the 4-tuples of vertices in Gn , let An,1 , . . . , An,mn be
the events that the corresponding 4-clique is present. By the calculation above we
have µn = Θ(n4 p6n ) which goes to +∞ when pn  n−2/3 . Also µ2n = Θ(n8 p12 n )
so it suffices to show that γn = o(n8 p12 n ). Note that two 4-cliques with disjoint
edge sets (but possibly sharing one vertex) are independent (i.e., their presence or
n
absence is independent). Suppose Si and Sj share 3 vertices. Then i ∼ j and

Pn,pn [An,i | An,j ] = p3n ,

as the event An,j implies that all edges between three of the vertices in Si are al-
ready present, and there are 3 edges between the remaining vertex and the rest of Si .
n
Similarly, if |Si ∩ Sj | = 2, we have again i ∼ j and this time Pn,pn [An,i | An,j ] =
p5n . Putting these together, we get by the definition of the conditional probability
CHAPTER 2. MOMENTS AND TAILS 49

(see Appendix B) and the fact that Pn,pn [An,j ] = p6n


X
γn = P[An,i ∩ An,j ]
n
i∼j
X
= Pn,pn [An,j ] Pn,pn [An,i | An,j ]
n
i∼j
X X
= Pn,pn [An,j ] Pn,pn [An,i | An,j ]
j n
i:i∼j
       
n 6 4 4 n−4 5
= pn (n − 4)p3n + pn
4 3 2 2
= O(n5 p9n ) + O(n6 p11 )
 8 12  n 8 12 
n pn n pn
= O 3 3
+O
n pn n2 p n
= o(n8 p12
n )
= o(µ2n ),

where we used that pn  n−2/3 (so that for example n3 p3n  1). Corollary 2.3.3
gives the result: Pn,pn [∪i An,i ] → 1 when pn  n−2/3 .

Roughly speaking, the first and second moments suffice to pinpoint the thresh-
old in this case because the indicators in Xn are “mostly” pairwise independent
and, as a result, the sum is “concentrated around its mean.”

General subgraphs The methods of Claim 2.3.4 can be applied to more general
subgraphs. However the situation is somewhat more complicated than it is for
cliques. For a graph H0 , let vH0 and eH0 be the number of vertices and edges of
H0 respectively. Let Xn be the number of (not necessarily induced) copies of H0
in Gn ∼ Gn,pn . By the first moment method,
eH
P[Xn > 0] ≤ E[Xn ] = Θ(nvH0 pn 0 ) → 0,

when pn  n−vH0 /eH0 . The constant factor, which does not play a role in the
asymptotics, accounts in particular for the number of automorphisms of H0 . In-
deed note that a fixed set of vH0 vertices can contain several distinct copies of H0 ,
depending on its structure (and unlike the clique case).
From the proof of Claim 2.3.4, one might guess that the threshold function is
n−vH0 /eH0 . That is not the case in general. To see what can go wrong, consider
e 0
the graph H0 in Figure 2.5 whose edge density is vH = 65 . When pn  n−5/6 ,
H 0 edge density
CHAPTER 2. MOMENTS AND TAILS 50

Figure 2.5: Graph H0 and subgraph H.

the expected number of copies of H0 in Gn tends to +∞. But observe that the
subgraph H of H0 has the higher density 5/4 and, hence, when n−5/6  pn 
n−4/5 the expected number of copies of H tends to 0. By the first moment method,
the probability that a copy of H0 —and therefore H—is present in that regime
is asymptotically negligible despite its diverging expectation. This leads to the
following definition
 
eH
rH0 := max : subgraphs H ⊆ H0 , eH > 0 .
vH
Assume H0 has at least one edge.
Claim 2.3.5. “Having a copy of H0 ” has threshold n−1/rH0 .
Proof. We proceed as in Claim 2.3.4. Let H0∗ be a subgraph of H0 achieving rH0 .
When pn  n−1/rH0 , the probability that a copy of H0∗ is in Gn tends to 0 by the
argument above. Therefore the same conclusion holds for H0 itself.
Assume pn  n−1/rH0 . Let S1 , . . . , Smn be an enumeration of the copies (as
subgraphs) of H0 in a complete graph on the vertices of Gn . Let An,i be the event
that Si ⊆ Gn . Using again the notation of Corollary 2.3.3,
eH
µn = Θ(nvH0 pn 0 ) = Ω(ΦH0 (n)),
where
ΦH0 (n) := min {nvH penH : subgraphs H ⊆ H0 , eH > 0} .
CHAPTER 2. MOMENTS AND TAILS 51

Note that ΦH0 (n) → +∞ when pn  n−1/rH0 by definition of rH0 . The events
An,i and An,j are independent if Si and Sj share no edge. Otherwise we write
n
i ∼ j. Note that there are Θ(nvH n2(vH0 −vH ) ) pairs Si , Sj whose intersection is
isomorphic to H. The probability that both Si and Sj of such a pair are present in
2(eH0 −eH )
Gn is Θ(penH pn ). Hence
X
γn = P[An,i ∩ An,j ]
n
i∼j
2eH −eH
X  
= Θ n2vH0 −vH pn 0
H⊆H0 ,eH >0
Θ(µ2n )

Θ(ΦH0 (n))
= o(µ2n ),

where we used that ΦH0 (n) → +∞. The result follows from Corollary 2.3.3.

Going back to the example of Figure 2.5, the proof above confirms that when
n−5/6  pn  n−4/5 the second moment method fails for H0 since ΦH0 (n) → 0.
In that regime, although there is in expectation a large number of copies of H0 ,
those copies are highly correlated as they are produced from a small (vanishing
in expectation) number of copies of H—producing a large variance that helps to
explain the failure of the second moment method.

Connectivity threshold
Next we use the second moment method to show that the threshold function for
connectivity in the Erdős-Rényi random graph model is logn n . In fact we prove this
result by deriving the threshold function for the presence of isolated vertices. The
connection between the two is obvious in one direction. Isolated vertices imply a
disconnected graph. What is less obvious is that it also works the other way in the
following sense: the two thresholds actually coincide.

Isolated vertices We begin with isolated vertices.


log n
Claim 2.3.6. “Not having an isolated vertex” has threshold function n .

Proof. Let Xn be the number of isolated vertices in the random graph Gn ∼ Gn,pn .
Using 1 − x ≤ e−x for all x ∈ R (see Exercise 1.16),

En,pn [Xn ] = n(1 − pn )n−1 ≤ elog n−(n−1)pn → 0, (2.3.4)


CHAPTER 2. MOMENTS AND TAILS 52

when pn  logn n . So the first moment method gives one direction: Pn,pn [Xn >
0] → 0.
For the other direction, we use the second moment method. Let An,j be the
2
event that vertex j is isolated. By the computation above, using 1 − x ≥ e−x−x
for x ∈ [0, 1/2] (see Exercise 1.16 again),
2
X
µn = Pn,pn [An,i ] = n(1 − pn )n−1 ≥ elog n−npn −npn , (2.3.5)
i

which goes to +∞ when pn  logn n . Note that An,i and An,j are not independent
for all i 6= j (because the absence of an edge between i and j is part of both events)
and
Pn,pn [An,i ∩ An,j ] = (1 − pn )2(n−2)+1 ,
so that X
γn = Pn,pn [An,i ∩ An,j ] = n(n − 1)(1 − pn )2n−3 .
i6=j

Because γn is not o(µ2n ), we cannot apply Corollary 2.3.3. Instead we use Theo-
rem 2.3.2 directly. We have

En,pn [Xn2 ] µn + γ n
=
En,pn [Xn ]2 µ2n
n(1 − pn )n−1 + n2 (1 − pn )2n−3

n2 (1 − pn )2n−2
1 1
≤ + , (2.3.6)
n(1 − pn )n−1 1 − pn

which is 1 + o(1) when pn  logn n by (2.3.5). The second moment method implies
that Pn,pn [Xn > 0] → 1 in that case.

Connectivity We use Claim 2.3.6 to study the threshold for connectivity.


log n
Claim 2.3.7. Connectivity has threshold function n .

Proof. We start with the easy direction. If pn  logn n , Claim 2.3.6 implies that the
graph has at least one isolated vertex—and therefore is necessarily disconnected—
with probability going to 1 as n → +∞.
Assume now that pn  logn n . Let Dn be the event that Gn is disconnected.
To bound Pn,pn [Dn ], we let Yk be the number of subsets of k vertices that are
CHAPTER 2. MOMENTS AND TAILS 53

disconnected from all other vertices in the graph for k ∈ {1, . . . , n/2}. Then, by
the first moment method,
 
n/2 n/2
X X
Pn,pn [Dn ] = Pn,pn  Yk > 0 ≤ En,pn [Yk ].
k=1 k=1

n

The expectation of Yk is straightforward to bound. Using that k ≤ n/2 and k ≤
nk ,  
n  k
En,pn [Yk ] = (1 − pn )k(n−k) ≤ n(1 − pn )n/2 .
k
log n
The expression in parentheses is o(1) when pn  n by a calculation similar
to (2.3.4). Summing over k,
+∞ 
X k
Pn,pn [Dn ] ≤ n(1 − pn )n/2 = O(n(1 − pn )n/2 ) = o(1),
k=1

where we used that the geometric series (started at k = 1) is dominated asymp-


totically by its first term. So the probability of being disconnected goes to 0 when
pn  logn n and we have proved the claim.

A closer look We have shown that connectivity and the absence of isolated ver-
tices have the same threshold function. In fact, in a sense, isolated vertices are the
“last obstacle” to connectivity. A slight modification of the proof above leads to
the following more precise result. For k ∈ {1, . . . , n/2}, let Zk be the number
of connected components of size k in Gn . In particular, Z1 is the number of iso-
lated vertices. We consider the “critical window” pn = cnn where cn := log n + s
for some fixed s ∈ R. We show that, in that regime, asymptotically the graph is
typically composed of a large connected component together with some isolated
vertices. Formally, we prove the following claim which says that with probabil-
ity close to one: either the graph is connected or there are some isolated vertices
together with a (necessarily unique) connected component of size greater than n/2.

Claim 2.3.8.
 
n/2
1 X
Pn,pn [Z1 > 0] ≥ + o(1) and Pn,pn  Zk > 0 = o(1).
1 + es
k=2

The limit of Pn,pn [Z1 > 0] can be computed explicitly using the method of mo-
ments. See Exercise 2.19.
CHAPTER 2. MOMENTS AND TAILS 54

Proof of Claim 2.3.8. We first consider isolated vertices. From (2.3.5), (2.3.6) and
the second moment method,
 −1
2 1 1
Pn,pn [Z1 > 0] ≥ e− log n+npn +npn + = + o(1),
1 − pn 1 + es

as n → +∞ by our choice of pn .
To bound the number of components of size k > 1, we note first that the
random variable Yk used in the previous claim (which imposes no condition on
the edges between the vertices in the subsets of size k) is too loose to provide a
suitable bound. Instead, to bound the probability that a subset of k vertices forms
a connected component, we observe that a connected component is characterized
by two properties: it is disconnected from the rest of the graph; and it contains
a spanning tree. Formally, for k = 2, . . . , n/2, we let Zk0 be the number of (not
necessarily induced) maximal trees of size k or, put differently, the number of
spanning trees of connected components of size k. Then, by the first moment
method, the probability that a connected component of size > 1 is present in Gn is
bounded by
   
n/2 n/2 n/2
X X X
Pn,pn  Zk > 0 ≤ Pn,pn  Zk0 > 0 ≤ En,pn [Zk0 ]. (2.3.7)
k=2 k=2 k=2

To bound the expectation of Zk0 , we use Cayley’s formula which states that there
are k k−2 trees on a set of k labeled vertices. Recall further that a tree on k vertices
has k − 1 edges (see Exercise 1.7). Hence,
 
0 n k−2 k−1
En,pn [Zk ] = k pn (1 − pn )k(n−k) ,
k |{z} | {z }
| {z } (b) (c)
(a)

where (a) is the number of trees of size k (as subgraphs) in a complete graph of
size n, (b) is the probability that such a tree is present in the graph, and (c) is
the probability that this tree is disconnected from every other vertex in the graph.
Using that k! ≥ (k/e)k (see Appendix A) and 1 − x ≤ e−x for all x ∈ R (see
CHAPTER 2. MOMENTS AND TAILS 55

Exercise 1.16),

nk k−2 k−1
En,pn [Zk0 ] ≤ k pn (1 − pn )k(n−k)
k!
nk e k
≤ k k k npkn e−pn k(n−k)
k
 k
k
= n ec e−(1− n )cn
n
 k
k
= n e(log n + s)e−(1− n )(log n+s) .

For k ≤ n/2, the expression in parentheses is o(1). In fact, for 2 ≤ k ≤ n/2,


En,pn [Zk0 ] = o(1). Furthermore, summing over k > 2,

n/2 +∞  k
X X 1
En,pn [Zk0 ] ≤ n e(log n + s)e− 2 (log n+s) = O(n−1/2 log3 n) = o(1).
k=3 k=3

Plugging this back into (2.3.7) gives the second claim in the statement.

2.3.3 . Percolation: critical value on trees and branching number


Consider bond percolation (see Definition 1.2.1) on the infinite d-regular tree Td .
Root the tree arbitrarily at a vertex 0 and let C0 be the open cluster of the root. In
this section we illustrate the use of the first and second moment methods on the
identification of the critical value

pc (Td ) = sup{p ∈ [0, 1] : θ(p) = 0},

where recall that the percolation function is θ(p) = Pp [|C0 | = +∞]. We then
consider general trees, introduce the branching number, and present a weighted
version of the second moment method.

Regular tree Our main result for Td is the following.

Claim 2.3.9.
1
pc (Td ) = .
d−1
Proof. Let ∂n be the n-th level of Td , that is, the set of vertices at graph distance n
from 0. Let Xn be the number of vertices in ∂n ∩C0 . In order for the open cluster of
CHAPTER 2. MOMENTS AND TAILS 56

Figure 2.6: Most recent common ancestor of x and y.

the root to be infinite, there must be at least one vertex on the n-th level connected
to the root by an open path. By the first moment method (Theorem 2.2.6),

θ(p) = Pp [|C0 | = +∞] ≤ Pp [Xn > 0] ≤ Ep Xn = d(d − 1)n−1 pn → 0, (2.3.8)


1
as n → +∞, for any p < d−1 . Here we used that there is a unique path be-
tween 0 and any vertex in the tree to deduce that Pp [x ∈ C0 ] = pn for x ∈ ∂n .
1
Equation (2.3.8) implies half of the claim: pc (Td ) ≥ d−1 .
The second moment method gives a lower bound on Pp [Xn > 0]. To simplify
the notation, it is convenient to introduce the “branching ratio” b := d − 1. We
say that x is a descendant of z if the path between 0 and x goes through z. Each
z 6= 0 has d − 1 descendant subtrees, that is, subtrees of Td rooted at z made of all
descendants of z. Let x ∧ y be the most recent common ancestor of x and y, that is,
the furthest vertex from 0 that lies on both the path from 0 to x and the path from
0 to y; see Figure 2.6. Letting
 
X
µn := Ep [Xn ] = Ep  1{x∈C0 }  = (b + 1)bn−1 pn ,
x∈∂n

we have
CHAPTER 2. MOMENTS AND TAILS 57

 2 
X
Ep [Xn2 ] = Ep  1{x∈C0 }  
x∈∂n
X
= Pp [x, y ∈ C0 ]
x,y∈∂n

X n−1
X X
= Pp [x ∈ C0 ] + 1{x∧y∈∂m } pm p2(n−m)
x∈∂n m=0 x,y∈∂n
n−1
X
= µn + (b + 1)bn−1 (b − 1)b(n−m)−1 p2n−m
m=0
+∞
X
≤ µn + (b + 1)(b − 1)b 2n−2 2n
p (bp)−m
m=0
b−1 1
= µn + µ2n · · ,
b + 1 1 − (bp)−1
where, on the fourth line, we used that all vertices on the n-th level are equivalent
and that, for a fixed x, the set {y : x ∧ y ∈ ∂m } is composed of those vertices in ∂n
that are descendants of x ∧ y but not in the descendant subtree of x ∧ y containing
1
x. When p > d−1 = 1b , dividing by (Ep Xn )2 = µ2n → +∞, we get

Ep [Xn2 ] 1 b−1 1
≤ + · (2.3.9)
(Ep Xn )2 µn b + 1 1 − (bp)−1
b−1 1
≤ 1+ ·
b + 1 1 − (bp)−1
=: Cb,p .

By the second moment method (Theorem 2.3.2) and monotonicity,


−1
θ(p) = Pp [|C0 | = +∞] = Pp [∀n, Xn > 0] = lim Pp [Xn > 0] ≥ Cb,p > 0,
n

which concludes the proof. (Note that the version of the second moment method
in Equation (2.3.1) does not work here. Subtract 1 in (2.3.9) and take p close to
1/b.)

The argument above relies crucially on the fact that, in a tree, any two vertices
are connected by a unique path. For instance, approximating Pp [x ∈ C0 ] is much
harder on a lattice. Note furthermore that, intuitively, the reason why the first mo-
ment captures the critical threshold exactly in this case is that bond percolation on
Td is a “branching process” (defined formally and studied at length in Chapter 6),
CHAPTER 2. MOMENTS AND TAILS 58

where Xn represents the “population size at generation n.” The qualitative behav-
ior of a branching process is governed by its expectation: when the mean number of
children bp exceeds 1, the process grows exponentially on average and “explodes”
with positive probability (see Theorem 6.1.6). We will come back to this point of
view in Section 6.2.4 where branching processes are used to give a more refined
analysis of bond percolation on Td .

General trees Let T be a locally finite tree (i.e., all its degrees are finite) with
root 0. For an edge e, let ve be the endvertex of e furthest from the root. We denote
by |e| the graph distance between 0 and ve . Generalizing a previous definition
from Section 1.1.1 to infinite, locally finite graphs, a cutset separating 0 and +∞ is
a finite set of edges Π such that all infinite paths (which, recall, are self-avoiding by
definition) starting at 0 go through Π. (For our purposes, it will suffice to assume
that cutsets are finite by default.) For a cutset Π, we let Πv := {ve : e ∈ Π}.
Repeating the argument in (2.3.8), for any cutset Π,

θ(p) = Pp [|C0 | = +∞]


≤ Pp [C0 ∩ Πv 6= ∅]
X
≤ Pp [u ∈ C0 ]
u∈Πv
X
= p|e| . (2.3.10)
e∈Π

This bound naturally leads to the following definition.

Definition 2.3.10 (Branching number). The branching number of T is given by


branching
number
( )
X
−|e|
br(T ) = sup λ ≥ 1 : inf λ >0 . (2.3.11)
cutset Π
e∈Π

Using the max-flow min-cut theorem (Theorem 1.1.15), the branching number can
also be characterized in terms of a “flow to +∞.” We will not do this here. (But
see Theorem 3.3.30.)
1
Equation (2.3.10) implies that pc (T ) ≥ br(T ) . Remarkably, this bound is
tight. The proof is based on a “weighted” second moment method argument.

Claim 2.3.11. For any rooted, locally finite tree T ,


1
pc (T ) = .
br(T )
CHAPTER 2. MOMENTS AND TAILS 59

Proof. Suppose p < br(T 1 −1 > br(T ) and the sum in (2.3.10) can be
) . Then p
made arbitrarily small by definition of the branching number, that is, θ(p) = 0.
1
Hence we have shown that pc (T ) ≥ br(T ).
1
To argue in the other direction, let p > br(T −1 < λ < br(T ), and ε > 0
), p
such that X
λ−|e| ≥ ε (2.3.12)
e∈Π

for all cutsets Π. The existence of such an ε is guaranteed by the definition of the
branching number. As in the proof of Claim 2.3.9, we use that θ(p) is the limit
as n → +∞ of the probability that C0 reaches the n-th level (i.e., the vertices at
graph distance n from the root 0, which is necessarily a finite set in a locally finite
tree). However, this time, we use a weighted count on the n-th level. Let Tn be
the first n levels of T and, as before, let ∂n be the vertices on the n-th level. For a
probability measure νn on ∂n , we define the weighted count
X νn (z)
Xn = 1{z∈C0 } .
Pp [z ∈ C0 ]
z∈∂n

The purpose of the denominator is normalization, that is,


X
Ep Xn = νn (z) = 1.
z∈∂n

Observe that, while νn (z) may be 0 for some zs (but not all), we still have that
Xn > 0, ∀n implies {|C0 | = +∞}, which is what we need to apply the second
moment method.
Because of (2.3.12), a natural choice of νn follows from the max-flow min-
cut theorem (Theorem 1.1.15) applied to Tn with source 0, sink ∂n and capacity
constraint |φ(x, y)| ≤ κ(e) := ε−1 λ−|e| for allPedges e = {x,
Py}. Indeed, for all
cutsets Π in Tn separating 0 and ∂n , we have e∈Π κ(e) = e∈Π ε λ −1 −|e| ≥1
by (2.3.12). That then guarantees by Theorem 1.1.15 the existence of a unit flow
φ from 0 to ∂n satisfying the capacity constraints. Define νn (z) to be the flow
entering z ∈ ∂n under φ. In particular, because φ is a unit flow, νn defines a
probability measure. It remains to bound the second moment of Xn under this
CHAPTER 2. MOMENTS AND TAILS 60

choice. We have
 2 
X νn (z)  
Ep Xn2 = Ep  1{z∈C0 }
Pp [z ∈ C0 ]
z∈∂n
X Pp [x, y ∈ C0 ]
= νn (x)νn (y)
Pp [x ∈ C0 ]Pp [y ∈ C0 ]
x,y∈∂n
n
X X pm p2(n−m)
= 1{x∧y∈∂m } νn (x)νn (y)
p2n
m=0 x,y∈∂n
 
n
X X X
= p−m  1{x∧y=z} νn (x)νn (y) .
m=0 z∈∂m x,y∈∂n

In the expression in parentheses, for each x descendant of z, the sum over y is at


most νn (x)νn (z) by the definition of a flow; then the sum over those xs gives at
most νn (z)2 . So
n
X X
Ep Xn2 ≤ p−m νn (z)2
m=0 z∈∂m
Xn X
≤ p−m (ε−1 λ−m )νn (z)
m=0 z∈∂m
+∞
X
≤ ε−1 (pλ)−m
m=0
ε−1
= =: Cε,λ,p < +∞,
1 − (pλ)−1
where the second line follows from the capacity constraint, and we used pλ > 1
on the last line. From the second moment method (recalling that Ep Xn = 1),
−1
θ(p) = Pp [|C0 | = +∞] ≥ Pp [∀n, Xn > 0] = lim Pp [Xn > 0] ≥ Cε,λ,p > 0,
n

where the second equality follows from the construction of νn . It follows that
−1
θ(p) ≥ Cε,λ,p > 0,
1
and pc (T ) ≤ br(T ) . That concludes the proof.

Note that Claims 2.3.9 and 2.3.11 imply that br(Td ) = d−1. The next example
is more striking and insightful.
CHAPTER 2. MOMENTS AND TAILS 61

Example 2.3.12 (The 3–1 tree). The 3–1 tree T c3−1 is an infinite rooted tree. We
give a planar description. The root ρ (level 0) is at the top. It has two children
below it (level 1). Then on level n, for n ≥ 1, the first 2n−1 vertices starting from
the left have exactly 1 child and the next 2n−1 vertices have exactly 3 children. In
particular level n has 2n vertices, which we denote by un,1 , . . . , un,2n . For vertex
un,j we refer to j/2n as its relative position (on level n). So vertices have 1 or 3
relative
children according to whether their relative position is ≤ 1/2 or > 1/2.
position
Because the level size is growing at rate 2, it is tempting to conjecture that the
branching number is 2—but that turns out to be way off.

Claim 2.3.13. br(T


c3−1 ) = 1.

What makes this tree entirely different from the infinite 2-ary tree, despite hav-
ing the same level growth, is that each infinite path from the root in T
c3−1 eventually
“stops branching,” with the sole exception of the rightmost path which we refer to
as the main path. Indeed, let Γ = v0 ∼ v1 ∼ v2 ∼ · · · with v0 = ρ be an infinite
main path
path distinct from the main path. Let xi be the relative position of vi , i ≥ 1. Let vk
be the first vertex of Γ not on the main path. It lies on the k-th level.

Lemma 2.3.14. Let v be a vertex that is not on the main path with relative position
x and assume that 0 ≤ x ≤ α < 1. Let w be a child of v and denote by y its
relative position. Then
(
1
x if x ≤ 1/2,
y≤ 2 1
x − 2 (1 − α) otherwise.

Proof. Assume without loss of generality that v = un,j for some n and j < 2n . If
j ≤ 2n−1 , then by construction v has exactly one child with relative position
j 1
y= = x.
2n+1 2
That proves the first claim.
If j > 2n−1 , then all vertices to the right of v have 3 children, all of whom are
to the right of the children of v. Hence the children of v have relative position at
most
2n+1 − 3(2n − j) 3j − 2n 3 1
y≤ n+1
= n+1 = x − .
2 2 2 2
Subtracting x and using x ≤ α gives the second claim.

We now apply Lemma 2.3.14 to vk as defined above and its descendants on Γ with
α = 1 − 1/2k . We get that the relative position decreases from vk by 1/2k+1 on
CHAPTER 2. MOMENTS AND TAILS 62

each level until it falls below 1/2 at which point it gets cut in half at each level.
Once this last regime is reached, each vertex on Γ from then on has exactly one
child—that is, there is no more branching.
We are now ready to prove the claim.

Proof of Claim 2.3.13. Take any λ > 1. From the definition of the branching num-
ber (Definition 2.3.10), it suffices to find a sequence of cutsets (Πn )n such that
X
λ−|e| → 0,
e∈Πn

as n → +∞. What does not work is to choose Πn := Λn to be the edges between


level n − 1 and level n, since we then have
X
λ−|e| = 2n λ−n ,
e∈Λn

which diverges whenever λ < 2. Instead we construct a new cutset Φn based on


Λn as follows. We divide up Λn into the disjoint union Λ− + −
n ∪ Λn , where Λn are
the edges whose endvertex on level n has relative position ≤ 1/2 and Λ+ n are the
rest of the edges. Start with Φn := ∅.
Step 1. For each edge e in Λ− n , letting v be the endvertex of e on level n, add to Φn
the edge {v , v } where v and v 00 are the unique descendants of v on level mn − 1
0 00 0

and mn respectively. The value of mn ≥ n is chosen so that


1
2n−1 λ−mn ≤ . (2.3.13)
2n
Any infinite path from the root going through one of the edges in Λ− n has to go
through the edge that replaced it in Φn since there is no branching below that point
by Lemma 2.3.14.
Step 2. We also add to Φn the edge {w0 , w00 } on the main path where w0 =
u`n −1,2`n −1 is on level `n − 1 and w00 = u`n ,2`n is on level `n . We mean for
the value of `n to be such that any infinite path going through an edge in Λ+ n has
0 00
to go through {w , w } first. That is, we need all vertices of level n with relative
position > 1/2 to be a descendant of w00 . The number of descendants of w00 on
level J > `n is 3J−`n until the last J such that it is ≤ 2J−1 , which we denote by
J ∗ . A quick calculation gives
 
`n log 3 − log 2
J∗ = .
log 3 − log 2
CHAPTER 2. MOMENTS AND TAILS 63

After level J ∗ , the leftmost descendant of w00 has relative position ≤ 1/2 by
Lemma 2.3.14. Therefore we need n > J ∗ to ensure that w00 has the desired
property. Taking  
log 3/2
`n = n , (2.3.14)
log 3
will do for n large enough.
Finishing up. By construction, Φn is a cutset for all n ≥ n0 . Moreover
X 1
λ−|e| = 2n−1 λ−mn + λ−`n < ,
n
e∈Φn

for n large enough, where we used (2.3.13) and (2.3.14). Taking n → +∞ gives
the claim.

As a consequence of Claims 2.3.11 and 2.3.13, |Cρ | < +∞ almost surely for
all p < 1 on T
c3−1 . J

2.4 Chernoff-Cramér method


Chebyshev’s inequality (Theorem 2.1.2) gives a bound on the concentration around
its mean of a square integrable random variable. It is, in general, best possible. In-
deed take X to be µ+bσ or µ−bσ with probability (2b2 )−1 each, and µ otherwise.
Then EX = µ, VarX = σ 2 , and for β = bσ,
1 VarX
P[|X − EX| ≥ β] = P[|X − EX| = β] = = .
b2 β2
However, in many cases, much stronger bounds can be derived. For instance, if
X ∼ N(0, 1), by the following lemma
r
2 −1 1
P[|X − EX| ≥ β] ∼ β exp(−β 2 /2)  2 , (2.4.1)
π β
as β → +∞. Indeed:

Lemma 2.4.1. For x > 0,


Z +∞
−1 −3 −x2 /2 2 /2 2 /2
(x −x )e ≤ e−y dy ≤ x−1 e−x .
x
CHAPTER 2. MOMENTS AND TAILS 64

2
Proof. By the change of variable y = x + z and using e−z /2 ≤ 1
Z +∞ Z +∞
−y 2 /2 −x2 /2 2
e dy ≤ e e−xz dz = e−x /2 x−1 .
x 0

For the other direction, by differentiation


Z +∞
2 2
(1 − 3y −4 ) e−y /2 dy = (x−1 − x−3 ) e−x /2 .
x

In this section we discuss the Chernoff-Cramér method, which produces exponen-


tial tail bounds, provided the moment-generating function (see Section 2.1.1) is
finite in a neighborhood of 0.

2.4.1 Tail bounds via the moment-generating function


Under a finite variance, squaring within Markov’s inequality (Theorem 2.1.1) pro-
duces Chebyshev’s inequality (Theorem 2.1.2). This “boosting” can be pushed
further when stronger integrability conditions hold.

Chernoff-Cramér We refer to (2.4.2) in the next lemma as the Chernoff-Cramér


bound.
Chernoff-
Lemma 2.4.2 (Chernoff-Cramér bound). Assume X is a random variable such Cramér
that MX (s) < +∞ for s ∈ (−s0 , s0 ) for some s0 > 0. For any β > 0 and s > 0, bound

P[X ≥ β] ≤ exp [− {sβ − ΨX (s)}] , (2.4.2)

where

ΨX (s) := log MX (s),

is the cumulant-generating function of X.

Proof. Exponentiating within Markov’s inequality gives for s > 0

MX (s)
P[X ≥ β] = P[esX ≥ esβ ] ≤ = exp [− {sβ − ΨX (s)}] .
esβ
CHAPTER 2. MOMENTS AND TAILS 65

Returning to the Gaussian case, let X ∼ N (0, ν) where ν > 0 is the variance
and note that
Z +∞
1 x2
MX (s) = esx √ e− 2ν dx
−∞ 2πν
Z +∞ 2
s ν 1 (x−sν)2
= e 2 √ e− 2ν dx
−∞ 2πν
 2 
s ν
= exp .
2
By straightforward calculus, the optimal choice of s in (2.4.2) gives the exponent

β2
sup(sβ − s2 ν/2) = , (2.4.3)
s>0 2ν
achieved at sβ = β/ν. For β > 0, this leads to the bound

β2
 
P[X ≥ β] ≤ exp − , (2.4.4)

which is much sharper than Chebyshev’s inequality for large β—compare to (2.4.1).
As another toy example, we consider simple random walk on Z.
Lemma 2.4.3 (Chernoff bound for simple random walk on Z). Let Z1 , . . . , Zn
be independent Rademacher variables, that is, they are {−1,
P 1}-valued random Rademacher
variables with P[Zi = 1] = P[Zi = −1] = 1/2. Let Sn = i≤n Zi . Then, for any
variable
β > 0,
2 /2n
P[Sn ≥ β] ≤ e−β . (2.4.5)

Proof. The moment-generating function of Z1 can be bounded as follows

es + e−s X s2j X (s2 /2)j 2


MZ1 (s) = = ≤ = es /2 . (2.4.6)
2 (2j)! j!
j≥0 j≥0

Using independence and taking s = β/n in the Chernoff-Cramér bound (2.4.2),


we get

P[Sn ≥ β] ≤ exp (−sβ + nΨZ1 (s))


≤ exp −sβ + ns2 /2

2 /2n
= e−β ,

which concludes the proof.


CHAPTER 2. MOMENTS AND TAILS 66

Observe the similarity between (2.4.5) and the Gaussian bound (2.4.4), if one
takes ν to be the variance of Sn , that is,
ν = Var[Sn ] = nVar[Z1 ] = nE[Z12 ] = n,
where we used that Z1 is centered. The central limit theorem says that simple
random walk is well approximated by a Gaussian in the “bulk” of the distribu-
tion; the bound above extends the approximation in the “large deviation” regime.
The bounding technique used in the proof of Lemma 2.4.3 will be substantially
extended in Section 2.4.2.
Example 2.4.4 (Set balancing). Let v1 , . . . , vm be arbitrary non-zero vectors in
{0, 1}n . Think of vi = (vi,1 , . . . , vi,n ) as representing a subset of [n] = {1, . . . , n}:
vi,j = 1 indicates that j is in subset i. Suppose we want to partition [n] into two
groups such that the subsets corresponding to the vi s are as balanced as possible,
that is, are as close as possible to having the same number of elements from each
group. More formally, we seek a vector x = (x1 , . . . , xn ) ∈ {−1, +1}n such that
B ∗ = maxi=1,...,m |x · vi | is as small as possible.
A simple random choice does well: select each xi independently, uniformly at
random in {−1, +1}. Fix ε > 0. We claim that
h i
P B ∗ ≥ 2n(log m + log(2ε−1 )) ≤ ε.
p
(2.4.7)

Indeed, by (2.4.5) (considering only the non-zero entries of vi ),


h p i
P |x · vi | ≥ 2n(log m + log(2ε−1 ))
2n(log m + log(2ε−1 ))
 
≤ 2 exp −
2kvi k1
ε
≤ ,
m
where we used that kvi k1 ≤ n. Taking a union bound over the m vectors gives

the result. In (2.4.7), the n term on the right-hand side of the inequality is to
be expected since it is the standard deviation of |x · vi | in the worst case. The
power of the exponential tail bound (2.4.5) appears in the logarithmic terms, which
would have been replaced with something much larger if one had used Chebyshev’s
inequality instead. J
The Chernoff-Cramér bound is particularly useful for sums of independent ran-
dom variables as the moment-generating function then factorizes; see (2.1.3). Let
Ψ∗X (β) = sup (sβ − ΨX (s)),
s∈R+

be the Fenchel-Legendre dual of the cumulant-generating function of X.


Fenchel-
Legendre
dual
CHAPTER 2. MOMENTS AND TAILS 67

P
Theorem 2.4.5 (Chernoff-Cramér method). Let Sn = i≤n Xi , where the Xi s
are i.i.d. random variables. Assume MX1 (s) < +∞ on s ∈ (−s0 , s0 ) for some
s0 > 0. For any β > 0,
  
∗ β
P[Sn ≥ β] ≤ exp −nΨX1 . (2.4.8)
n
In particular, in the large deviations regime, that is, when β = bn for some b > 0,
we have
1
− lim sup log P[Sn ≥ bn] ≥ Ψ∗X1 (b) . (2.4.9)
n n
Proof. Observe that, by taking a logarithm in (2.1.3), it holds that
     
∗ β ∗ β
ΨSn (β) = sup(sβ − nΨX1 (s)) = sup n s − ΨX1 (s) = nΨX1 .
s>0 s>0 n n
Now optimize over s in (2.4.2).
We use the Chernoff-Cramér method to derive a few standard bounds.

Poisson variables We start with the Poisson case. Let Z ∼ Poi(λ) be Poisson
Poisson
with mean λ, where recall that
λk
P[Z = k] = e−λ , k ∈ Z+ .
k!
Letting X = Z − λ,
 
λ
X `
ΨX (s) = log  e−λ es(`−λ) 
`!
`≥0
 
X (es λ)`
= log e−(1+s)λ 
`!
`≥0
 s

= log e−(1+s)λ ee λ
= λ(es − s − 1),
so that straightforward calculus gives for β > 0
Ψ∗X (β) = sup(sβ − λ(es − s − 1))
s>0
    
β β β
= λ 1+ log 1 + −
λ λ λ
 
β
=: λ h ,
λ
CHAPTER 2. MOMENTS AND TAILS 68
 
achieved at sβ = log 1 + βλ , where h is defined as the expression in square
brackets above. Plugging Ψ∗X (β) into Theorem 2.4.5 leads for β > 0 to the bound
  
β
P[Z ≥ λ + β] ≤ exp −λ h . (2.4.10)
λ
A similar calculation for −(Z − λ) gives for β < 0
  
β
P[Z ≤ λ + β] ≤ exp −λ h . (2.4.11)
λ
If Sn is a sum of n i.i.d. Poi(λ) variables, then by (2.4.9) for a > λ
 
1 a−λ
− lim sup log P[Sn ≥ an] ≥ λ h
n n λ
a
= a log −a+λ
λ
=: IλPoi (a), (2.4.12)

and similarly for a < λ


1
− lim sup log P[Sn ≤ an] ≥ IλPoi (a). (2.4.13)
n n
In fact, these bounds follow immediately from (2.4.10) and (2.4.11) by noting that
Sn ∼ Poi(nλ) (see, e.g., Exercise 6.7).

Binomial variables and Chernoff bounds Let Z ∼ Bin(n, p) be a binomial


random variable with parameters n and p. Recall that Z is a sum of i.i.d. indicators
binomial
Y1 , . . . , Yn equal to 1 with probability p. The Yi s are also known as Bernoulli
random variables or Bernoulli trials, and their law is denote by Ber(p). We also
refer to p as the success probability. Letting Xi = Yi − p and Sn = Z − np,
Bernoulli
s
ΨX1 (s) = log (pe + (1 − p)) − ps.

For b ∈ (0, 1 − p), letting a = b + p, direct calculation gives

Ψ∗X1 (b) = sup(sb − (log [pes + (1 − p)] − ps))


s>0
1−a a
= (1 − a) log + a log =: D(akp), (2.4.14)
1−p p

achieved at sb = log (1−p)a


p(1−a) . The function D(akp) in (2.4.14) is the so-called
Kullback-Leibler divergence or relative entropy between two Bernoulli variables
Kullback-Leibler
divergence
CHAPTER 2. MOMENTS AND TAILS 69

with parameters a and p respectively. By (2.4.8) for β > 0


P[Z ≥ np + β] ≤ exp (−n D (p + β/nkp)) .
Applying the same argument to Z 0 = n − Z gives a bound in the other direction.
Remark 2.4.6. In the large deviations regime, it can be shown that the previous bound is
tight in the sense that
1 Bin
− log P[Z ≥ np + bn] → D (p + bkp) =: In,p (b),
n
as n → +∞. The theory of large deviations provides general results of this type. See for
example [Dur10, Section 2.6]. Upper bounds will be enough for our purposes.
The following related bounds, proved in Exercise 2.7, are often useful.
Theorem 2.4.7 (Chernoff bounds for Poisson trials). Let Y1 , . . . , YnPbe indepen-
dent {0, 1}-valued random variables
P with P[Yi = 1] = pi and µ = i pi . These
are called Poisson trials. Let Z = i Yi . Then:
Poisson
(i) Above the mean trials

(a) For any δ > 0,





P[Z ≥ (1 + δ)µ] ≤ .
(1 + δ)(1+δ)
(b) For any 0 < δ ≤ 1,
2 /3
P[Z ≥ (1 + δ)µ] ≤ e−µδ .
(ii) Below the mean
(a) For any 0 < δ < 1,

e−δ

P[Z ≤ (1 − δ)µ] ≤ .
(1 − δ)(1−δ)
(b) For any 0 < δ < 1,
2 /2
P[Z ≤ (1 − δ)µ] ≤ e−µδ .

2.4.2 Sub-Gaussian and sub-exponential random variables


The bounds in Section 2.4.1 were obtained by computing the moment-generating
function explicitly (possibly with some approximations). This is not always possi-
ble. In this section, we give some important examples of tail bounds derived from
the Chernoff-Cramér method for broad classes of random variables under natural
conditions on their distributions.
CHAPTER 2. MOMENTS AND TAILS 70

Sub-Gaussian random variables


We begin with sub-Gaussian random variables which, as the name suggests, have
a moment-generating function that is bounded by that of a Gaussian.

General case Here is our key definition.


Definition 2.4.8 (Sub-Gaussian random variables). We say that a random variable
X with mean µ is sub-Gaussian with variance factor ν if
sub-Gaussian
s2 ν variable
ΨX−µ (s) ≤ , ∀s ∈ R, (2.4.15)
2
for some ν > 0. We use the notation X ∈ sG(ν).
Note that the right-hand side in (2.4.15) is the cumulant-generating function of
a N(0, ν). By the Chernoff-Cramér method and (2.4.3) it follows immediately that

β2
 
P [X − µ ≤ −β] ∨ P [X − µ ≥ β] ≤ exp − , (2.4.16)

where we used that X ∈ sG(ν) implies −X ∈ sG(ν). As a quick example, note
that this is the approach we took in Lemma 2.4.3, that is, we showed that a uniform
random variable in {−1, 1} (i.e., a Rademacher variable) is sub-Gaussian with
variance factor 1.
When considering (weighted) sums of independent sub-Gaussian random vari-
ables, we get the following.
Theorem 2.4.9 (General Hoeffding inequality). Suppose X1 , . . . , Xn are indepen-
dent random variables where, P for each i, Xi ∈ sG(νi ) with 0 < νi < +∞. For
w1 , . . . , wn ∈ R, let Sn = i≤n wi Xi . Then
n
!
X
2
Sn ∈ sG wi νi .
i=1

In particular, for all β > 0,


β2
 
P [Sn − ESn ≥ β] ≤ exp − Pn .
2 i=1 wi2 νi
Proof. Assume the Xi s are centered. By independence and (2.1.3),
s2 2
P
X X X (swi )2 νi i≤n wi νi
ΨSn (s) = Ψwi Xi (s) = ΨXi (swi ) ≤ = .
2 2
i≤n i≤n i≤n
CHAPTER 2. MOMENTS AND TAILS 71

Bounded random variables For bounded random variables, the previous in-
equality reduces to a standard bound.

Theorem 2.4.10 (Hoeffding’s inequality for bounded variables). Let X1 , . . . , Xn


be independent random variables where,
P for each i, Xi takes values in [ai , bi ] with
−∞ < ai ≤ bi < +∞. Let Sn = i≤n Xi . For all β > 0,
!
2β 2
P[Sn − ESn ≥ β] ≤ exp − P 2
.
i≤n (bi − ai )

By Theorem 2.4.9, it suffices to show that Xi ∈ sG(νi ) with νi = 14 (bi − ai )2 . We


first give a quick proof of a weaker version that uses a trick called symmetrization.
symmetrization
Suppose the Xi s are centered and satisfy |Xi | ≤ ci for some ci > 0. Let Xi0 be an
independent copy of Xi and let Zi be an independent uniform random variable in
{−1, 1}. For any s,
h 0
i
E esXi = E esE[Xi −Xi | Xi ]
 
h h 0
ii
≤ E E es(Xi −Xi ) Xi
h 0
i
= E es(Xi −Xi ) ,

where the first line comes from the taking out what is known lemma (Lemma B.6.16)
and the fact that Xi0 is centered and independent of Xi , the second line follows
from the conditional Jensen’s inequality (Lemma B.6.12), and the third line uses
the tower property (Lemma B.6.16). Observe that Xi − Xi0 is symmetric, that is,
identically distributed to −(Xi − Xi0 ). Hence, using that Zi is independent of both
Xi and Xi0 , we get
h 0
i h h 0
ii
E es(Xi −Xi ) = E E es(Xi −Xi ) Zi
h h 0
ii
= E E esZi (Xi −Xi ) Zi
h 0
i
= E esZi (Xi −Xi )
h h 0
ii
= E E esZi (Xi −Xi ) Xi , Xi0 .

From (2.4.6) (together with Lemma B.6.15), the last line above is
h 0 2
i
≤ E e(s(Xi −Xi )) /2
2 2 /2
≤ e4ci s ,
CHAPTER 2. MOMENTS AND TAILS 72

since |Xi |, |Xi0 | ≤ ci . Putting everything together, we arrive at


2 2
E esXi ≤ e4ci s /2 .
 

variance factor 4c2i . By Theorem 2.4.9, Sn is


That is, Xi is sub-Gaussian with P
sub-Gaussian with variance factor i≤n 4c2i and
!
t2
P[Sn ≥ t] ≤ exp − P .
8 i≤n c2i

Proof of Theorem 2.4.10. As pointed out above, it suffices to show that Xi is sub-
Gaussian with variance factor 14 (bi −ai )2 . This is the content of Hoeffding’s lemma
below (which we will use again in Chapter 3). First an observation:
Lemma 2.4.11 (Variance of bounded random variables). For any random variable
Z taking values in [a, b] with −∞ < a ≤ b < +∞, we have
1
Var[Z] ≤ (b − a)2 .
4
Proof. Indeed
a+b b−a
Z− ≤ ,
2 2
and
"  # 
a+b 2 b−a 2
  
a+b
Var[Z] = Var Z − ≤E Z− ≤ .
2 2 2

Hoeffding’s
Lemma 2.4.12 (Hoeffding’s lemma). Let X be a random variable taking values
lemma
in [a, b] for −∞ < a ≤ b < +∞. Then X ∈ sG 14 (b − a)2 .

Proof. Note first that X −EX ∈ [a−EX, b−EX] and 41 ((b−EX)−(a−EX))2 =


1 2
4 (b − a) . So without loss of generality we assume that EX = 0. Because X is
bounded, MX (s) is finite for all s ∈ R. Hence, by (2.1.2),
0 (0)
MX
ΨX (0) = log MX (0) = 0, Ψ0X (0) = = EX = 0,
MX (0)
and by a Taylor expansion
s2 00 ∗ s2
ΨX (s) = ΨX (0) + sΨ0X (0) + ΨX (s ) = Ψ00X (s∗ ),
2 2
CHAPTER 2. MOMENTS AND TAILS 73

for some s∗ ∈ [0, s]. Therefore it suffices to show that for all s
1
Ψ00X (s) ≤ (b − a)2 . (2.4.17)
4
Note that
00 (s) MX (s) 2
 0 
MX
Ψ00X (s) = −
MX (s) MX (s)
 sX  2
 
1  2 sX
 1
= E X e − E Xe
MX (s) MX (s)
sX
2
esX
   
2 e
= E X − E X .
MX (s) MX (s)
sx
The trick to conclude is to notice that MeX (s) defines a density on [a, b] with respect
to the law of X. The variance under this density—the last line above—is less than
1 2
4 (b − a) by Lemma 2.4.11. This establishes (2.4.17) and concludes the proof.

Remark 2.4.13. The change of measure above is known as tilting and is a standard trick
in large deviations theory. See for example [Dur10, Section 2.6].
Since we have shown that Xi is sub-Gaussian with variance factor 14 (bi − ai )2 ,
Theorem 2.4.10 follows from Theorem 2.4.9.

Sub-exponential random variables


Unfortunately, not every random variable of interest is sub-Gaussian. A simple
example is the square of a Gaussian variable. Indeed, suppose X ∼ N (0, 1). Then
W = X 2 is χ2 -distributed and its moment-generating
√ function can be computed
explicitly. Using the change of variable u = x 1 − 2s, for s < 1/2,
Z +∞
1 2 2
MW (s) = √ esx e−x /2 dx
2π −∞
Z +∞
1 1 2
= √ × √ e−u /2 du
1 − 2s 2π −∞
1
= . (2.4.18)
(1 − 2s)1/2

When s ≥ 1/2, however, we have MW (s) = +∞. In particular, W cannot be


sub-Gaussian for any variance factor ν > 0. (Note that centering W produces an
additional factor of e−s in the moment-generating function which does not prevent
CHAPTER 2. MOMENTS AND TAILS 74

it from diverging.) Further confirming this observation, arguing as in (2.4.1), the


upper tail of W decays as
p
P[W ≥ β] = P[X ≥ β]
r
1 p −1 p
∼ [ β] exp(−[ β]2 /2)

r
1
∼ exp(−β/2),
2πβ
as β → +∞. That is, it decays exponentially with β, but slower than the Gaussian
tail.

General case We now define a broad class of distributions which have such ex-
ponential tail decay.

Definition 2.4.14 (Sub-exponential random variable). We say that a random vari-


able X with mean µ is sub-exponential with parameters (ν, α) if
sub-exponential
s2 ν 1 variable
ΨX−µ (s) ≤ , ∀|s| ≤ , (2.4.19)
2 α
for some ν, α > 0. We write X ∈ sE(ν, α).*

Observe that the key difference between (2.4.15) and (2.4.19) is the interval of s
over which it holds. As we will see below, the parameter α dictates the exponential
decay rate of the tail. The specific form of the bound in (2.4.19) is natural once one
notices that, as |s| → 0, a centered random variable with variance ν (and a finite
moment-generating function) should roughly satisfy

s2 s2 ν s2 ν
   
sX 2
log E[e ] ≈ log 1 + sE[X] + E[X ] ≈ log 1 + ≈ .
2 2 2
* More

commonly, “sub-exponential” refers to the case α = ν.
CHAPTER 2. MOMENTS AND TAILS 75

Returning to the χ2 distribution, note that from (2.4.18) we have for |s| ≤ 1/4
1
ΨW −1 (s) = −s − log(1 − 2s)
2"
+∞
#
1 X (2s)i
= −s − −
2 i
i=1
" +∞ #
s2 X (2s)i−2
= 4
2 i
i=2
" +∞ #
s2 X
≤ 2 |1/2|i−2
2
i=2
s2
≤ × 4.
2
Hence W ∈ sE(4, 4).
Using the Chernoff-Cramér bound (Lemma 2.4.2), we obtain the following tail
bound for sub-exponential variables.
Theorem 2.4.15 (Sub-exponential tail bound). Suppose the random variable X
with mean µ is sub-exponential with parameters (ν, α). Then for all β ∈ R+
2
(
exp(− β2ν ), if 0 ≤ β ≤ ν/α,
P[X − µ ≥ β] ≤ β (2.4.20)
exp(− 2α ), if β > ν/α.

In words, the tail decays exponentially fast at large deviations but behaves as in the
sub-Gaussian case for smaller deviations. We will see below that this double-tail
allows to extrapolate naturally between different regimes. First we prove the claim.

Proof of Theorem 2.4.15. We start by applying the Chernoff-Cramér bound. For


any β > 0 and |s| ≤ 1/α

P[X − µ ≥ β] ≤ exp (−sβ + ΨX (s)) ≤ exp −sβ + s2 ν/2 .




At this point, the proof diverges from the sub-Gaussian case because the optimal
choice of s depends on β because of the additional constraint |s| ≤ 1/α. When
s∗ = β/ν satisfies s∗ ≤ 1/α, the quadratic function of s in the exponent is mini-
mized at s∗ , giving the bound

β2
 
P[X ≥ β] ≤ exp − ,

for 0 ≤ β ≤ ν/α.
CHAPTER 2. MOMENTS AND TAILS 76

On the other hand, when β > ν/α, the exponent is strictly decreasing over
the interval s ≤ 1/α. Hence the optimal choice is s∗ = 1/α, which produces the
bound
 
β ν
P[X ≥ β] ≤ exp − + 2
α 2α
 
β β
< exp − +
α 2α
 
β
= exp − ,

where we used that ν < βα on the second line.

For (weighted) sums of independent sub-exponential random variables, we get


the following.

Theorem 2.4.16 (General Bernstein inequality). Suppose X1 , . . . , Xn are inde-


pendent random variables where, forPeach i, Xi ∈ sE(νi , αi ) with 0 < νi , αi <
+∞. For w1 , . . . , wn ∈ R, let Sn = i≤n wi Xi . Then

n
!
X
Sn ∈ sE wi2 νi , max |wi |αi .
i
i=1

In particular, for all β > 0,


   Pn 2
exp − Pnβ 2 2 , 0 ≤ β ≤ i=1 wi νi
2 w ν i
if max i |wi |αi ,
P [Sn − ESn ≥ β] ≤  i=1 i
β
 Pn
w 2ν
exp − i=1 i i
2 maxi |wi |αi , if β > maxi |wi |αi .

Proof. By independence and (2.1.3),

s2 2
P
X X X (swi )2 νi i≤n wi νi
ΨSn (s) = Ψwi Xi (s) = ΨXi (swi ) ≤ = ,
2 2
i≤n i≤n i≤n

provided |swi | ≤ 1/αi for all i, that is,


1
|s| ≤ .
maxi |wi |αi
CHAPTER 2. MOMENTS AND TAILS 77

Bounded random variables: revisited We apply the previous result to bounded


random variables.

Theorem 2.4.17 (Bernstein’s inequality for bounded variables). Let X1 , . . . , Xn


be independent random variables where, for each i, Xi has mean 2
Pµi , variance σi
and satisfies |Xi − µi | ≤ c for some 0 < c < +∞. Let Sn = i≤n Xi . For all
β > 0,
   Pn 2
exp − Pβn2 2 , if 0 ≤ β ≤ i=1 σi ,
4 σ c
P [Sn − ESn ≥ β] ≤  i=1 i Pn 2
exp − β , i=1 σi
4c if β > c .

Proof. We claim that Xi ∈ sE(2σi2 , 2c). To establish the claim, we derive a bound
on all moments of Xi . Note that for all integers k ≥ 2

E|Xi − µi |k ≤ ck−2 E|Xi − µi |2 = ck−2 σi2 .

Hence, first applying the dominated convergence theorem (Proposition B.4.14) to


1
establish the limit, we have for |s| ≤ 2c
+∞ k
s(Xi −µi )
X s
E[e ]= E[(Xi − µi )k ]
k!
k=0
+∞ k
X s
≤ 1 + s E[(Xi − µi )] + ck−2 σi2
k!
k=2
+∞
s2 σi2 2 2
s σi X
k−2
≤1+ + (cs)
2 3!
k=3
s2 σi2
 
1 cs
=1+ 1+
2 3 1 − cs
s2 σi2
 
1 1/2
≤1+ 1+
2 3 1 − 1/2
s2
≤ 1 + 2σi2
2 2 
s 2
≤ exp 2σ .
2 i

Using the general Bernstein inequality (Theorem 2.4.16) gives the result.

It may seem counter-intuitive to derive a tail bound based on the sub-exponential


property of bounded random variables when we have already done so using their
CHAPTER 2. MOMENTS AND TAILS 78

sub-Gaussian behavior. After all, the latter is on the surface a strengthening of the
former. However, note that we have obtained a better bound in Theorem 2.4.17
than we did in Theorem 2.4.10—when β is not too large. That improvement stems
from the use of the (actual) variance for moderate deviations. This is easier to
appreciate on an example.

Example 2.4.18 (Erdős-Rényi: maximum degree). Let Gn = (Vn , En ) ∼ Gn,pn


be a random graph with n vertices and density pn under the Erdős-Rényi model
(Definition 1.2.2). Recall that two vertices u, v ∈ Vn are adjacent if {u, v} ∈ En
and that the set of adjacent vertices of v, denoted by N (v), is called the neighbor-
hood of v. The degree of v is the size of its neighborhood, that is, δ(v) = |N (v)|.
Here we study the maximum degree of Gn

Dn = max δ(v).
v∈Vn

We focus on the regime npn = ω(log n). Note that, for any vertex v ∈ Vn , its
degree is Bin(n − 1, pn ) by independence of the edges. In particular its expected
degree is (n−1)pn . To prove a high-probability upper bound on the maximum Dn ,
we need to control the deviation of the degree of each vertex from its expectation.
Observe that the degrees are not independent. Instead we apply a union bound over
all vertices, after using a tail bound.

Claim 2.4.19. For any ε > 0, as n → +∞,


h p i
P |Dn − (n − 1)pn | ≥ 2 (1 + ε)npn log n → 0.

Proof. For a fixed vertex v, think of δ(v) = Sn−1 ∼ Bin(n − 1, pn ) as a sum


of n − 1 independent {0, 1}-valued random variables, one for each possible edge.
Pn−1
That is, Sn−1 = i=1 Xi where the Xi s are bounded random variables. The
mean of Xi is pn and its variance is pn (1 − pn ). So in Bernstein’s inequality
(Theorem 2.4.17), we can take µi := pn , σi2 := pn (1 − pn ) and c := 1 for all i.
We get
  
exp − β 2 , if 0 ≤ β ≤ ν,
P [Sn−1 ≥ (n − 1)pn + β] ≤  4ν
exp − β , if β > ν.
4

where ν = (n − 1)pn (1 − pn ) = ω(log n) by assumption. We choose β to be the


smallest value that will produce a tail probability less than n−1−ε for ε > 0, that
is, p p
β = 4(n − 1)pn (1 − pn ) × (1 + ε) log n = o(ν),
CHAPTER 2. MOMENTS AND TAILS 79

which falls in the lower regime of the tail bound. In particular, β = o(npn ) (i.e.,
the deviation is much smaller than the expectation). Finally by a union bound over
v ∈ Vn
h p i 1
P Dn ≥ (n − 1)pn + 4(1 + ε)pn (1 − pn )(n − 1) log n ≤ n × 1+ε → 0.
n
The same holds in the other direction. That proves the claim.

Had we used Hoeffding’s inequality


p (Theorem 2.4.10) in the proof of Claim 2.4.19
we would have had to take β = (1 + ε)n log n. That would have produced a
much weaker bound when pn = o(1). Indeed the advantage of Bernstein’s in-
equality is that it makes explicit use of the variance, which when pn = o(1) is
much smaller than the worst case for bounded variables. J

2.4.3 . Probabilistic analysis of algorithms: knapsack problem


In a knapsack problem, we have n items. Item i has weight Wi ∈ [0, 1] and value
Vi ∈ [0, 1]. Given a weight bound W, we want to pack as valuable a collection of
items in the knapsack under the constraint that the total weight is less or equal than
W. Formally we seek a solution to the optimization problem
 
X n n
X 
Z ∗ = max xj Vj : x1 , . . . , xn ∈ [0, 1], xj Wj ≤ W . (2.4.21)
 
j=1 j=1

This is the fractional knapsack problem, where we allow a fraction of an item to be


knapsack
added to the knapsack.
problem
It is used as a computationally tractable relaxation of the 0-1 knapsack problem,
which also includes the combinatorial constraint xj ∈ {0, 1}, ∀j. Indeed, it turns
out that the optimization problem (2.4.21) is solved exactly by a simple greedy
solution (see Exercise 2.8 for a formal proof of correctness): let π be a permutation
of {1, . . . , n} that puts the items in decreasing order of value per unit weight
Vπ(1) Vπ(2) Vπ(n)
≥ ≥ ··· ≥ ;
Wπ(1) Wπ(2) Wπ(n)
add the items in that order until the first time the weight constraints is violated;
include whatever fraction of that last item that will fit. This greedy algorithm has
a natural geometric interpretation, depicted in Figure 2.7, that will be useful. We
associate item j to a point (Wj , Vj ) ∈ [0, 1]2 and keep only those items falling on or
above a line with slope θ chosen to satisfy the total weight constraint. Specifically,
let
∆θ = {j ∈ [n] : Vj > θWj } ,
CHAPTER 2. MOMENTS AND TAILS 80

Figure 2.7: Vizualization of the greedy algorithm.


CHAPTER 2. MOMENTS AND TAILS 81

Λθ = {j ∈ [n] : Vj = θWj } ,
and
Θ∗ = inf {θ ≥ 0 : W∆θ < W} .
P
where, for a subset of items J ⊂ [n], WJ = j∈J Wj (and VJ is similarly de-
fined).
We consider a stochastic version of the fractional knapsack problem where the
weights and values are i.i.d. random variables picked uniformly at random in [0, 1].
Characterizing Z ∗ (e.g., its moments or distribution) is not straightforward. Here
we show that Z ∗ is highly concentrated around a natural quantity. Observe that,
under our probabilistic model, almost surely |Λθ | ∈ {0, 1} for any θ ≥ 0. Hence,
there are two cases. Either Θ∗ = 0, in which case all items fit in the knapsack so
n
Z ∗ = j=1 Vj . Or Θ∗ > 0, in which case |ΛΘ∗ | = 1 and
P

W − W∆Θ∗
Z ∗ = V∆Θ∗ + VΛΘ∗ . (2.4.22)
WΛΘ∗

One interesting regime is W = τ n for some constant τ > 0. Clearly, τ > 1 is


trivial. In fact, because
 
n
X 1
E Wj  = n E[W1 ] = n,
2
j=1

we assume that τ ≤ 1/2. To further simplify the calculations, we restrict ourselves


to the case τ ∈ (1/6, 1/2). (See Exercise 2.8 for the remaining case.) In this
regime, we show that Z ∗ grows linearly with n and give a bound on its deviation.
Although Z ∗ is technically a sum of random variables, the choice of Θ∗ corre-
lates them and we cannot apply our concentration bounds directly. Instead we show
that Θ∗ itself can be controlled well. It is natural to conjecture that Θ∗ is approx-
imately equal to a solution θτ of the expected constraint equation E[W∆θτ ] = W,
that is,
nw̄θτ = nτ, (2.4.23)
CHAPTER 2. MOMENTS AND TAILS 82

where w̄θ is defined through


 
X
E[W∆θ ] = E  Wj 
j∈∆θ
 
Xn
= E 1{Vj > θWj } Wj 
j=1

= nE [1{V1 > θW1 } W1 ]


=: nw̄θ .
Similarly, we define
v̄θ := E [1{V1 > θW1 } V1 ] .
We see directly from the definitions that both w̄θ and v̄θ are monotone as functions
of θ.
Our main claim is the following.
Claim 2.4.20. There is a constant c > 0 such that for any δ > 0
h p i
P |Z ∗ − nv̄θτ | ≥ cn log δ −1 ≤ δ,

for all n large enough.


Proof. Because all weights and values are in [0, 1], it follows from (2.4.22) that
V∆Θ∗ ≤ Z ∗ ≤ V∆Θ∗ + 1, (2.4.24)
and it will suffice to work with V∆Θ∗ . The idea of the proof is to show that Θ∗
is close to θτ by establishing that W∆θ is highly likely to be less than τ n when
θ > θτ , while the opposite holds when θ < θτ . For this, we view W∆θ as a sum
of independent bounded random variables and use Hoeffding’s inequality (Theo-
rem 2.4.10).
Controlling Θ∗ . First, it will be useful to compute w̄θ and θτ analytically. By
definition,
w̄θ = E [1{V1 > θW1 } W1 ]
Z 1Z 1
= 1{y > θx}x dy dx
0 0
Z 1∧1/θ
= (1 − θx)x dx
(0
1 1
= 2 − 3θ if θ ≤ 1,
(2.4.25)
1
6θ2
otherwise.
CHAPTER 2. MOMENTS AND TAILS 83

Plugging back into (2.4.23), we get the unique solution


 
1
θτ := 3 − τ ∈ (0, 1),
2

for the range τ ∈ (1/6, 1/2).


Now observe that, for each fixed θ, the quantity
n
X
W ∆θ = 1{Vj > θWj } Wj ,
j=1

is a sum of independent random variables taking values in [0, 1]. Hence, for any
β > 0, Hoeffding’s inequality gives

2β 2
 
P [W∆θ − nw̄θ ≥ β] ≤ exp − .
n

Using this inequality with θ = θτ + √Cn (with n large enough that θ < 1) and

β = (C/3) n gives


   
1 1 C/3
≥ (C/3) n ≤ exp −2(C/3)2 ,

P W∆θ + √C − n − θτ − √
τ n 2 3 n

where we used (2.4.25). After rearranging and using that n 21 − 13 θτ = nτ




by (2.4.23) and (2.4.25), this gives


   
C
P Θ∗ ≥ θτ + √ = P W∆θ + √C ≥ nτ ≤ exp −2(C/3)2 .

n τ n

Applying the same argument to −W∆θ with θ = θτ − √Cn and combining with the
previous inequality gives
 
∗ C
≤ 2 exp −2(C/3)2 ,

P |Θ − θτ | > √ (2.4.26)
n

assuming n is large enough.


Controlling Z ∗ . We conclude by applying Hoeffding’s inequality to V∆θ . Ar-
guing as above with the same θ’s and β (but the roles of the two cases reversed),
we obtain

 
P V∆θ − √C − nv̄θτ − √C ≥ (C/3) n ≤ exp −2(C/3)2 ,

(2.4.27)
τ n n
CHAPTER 2. MOMENTS AND TAILS 84

and

 
≤ −(C/3) n ≤ exp −2(C/3)2 .

P V∆θ C
− nv̄θτ + √C (2.4.28)
τ + √n n

Again, it will be useful to compute v̄θ analytically. By definition,

v̄θ = E [1{V1 > θW1 } V1 ]


Z 1Z 1
= 1{y > θx}y dx dy
0 0
Z 1∧θ 2 Z 1
y
= dy + y dy
θ
(0 1∧θ
1
− 1 θ2 if θ ≤ 1,
= 21 6
3θ otherwise.

Assuming n is large enough (recall that θτ < 1), we get

C2
 
1 C C
v̄θτ − v̄θτ + √C = 2 √ θτ + ≤√ .
n 6 n n n
C
A quick check reveals that, similarly, v̄θτ − √C −v̄θτ ≤ √
n
. Plugging back into (2.4.27)
n
and (2.4.28) gives

 
P V∆θ − √C ≥ nv̄θτ + 2C n ≤ exp −2(C/3)2 ,

(2.4.29)
τ n

and

 
≤ nv̄θτ − 2C n ≤ exp −2(C/3)2 .

P V ∆θ C
(2.4.30)
τ + √n

Observe that the following monotonicity property holds almost surely

θ0 ≤ θ1 ≤ θ2 =⇒ V∆θ0 ≥ V∆θ1 ≥ V∆θ2 . (2.4.31)

Combining (2.4.24), (2.4.26), (2.4.29), (2.4.30) and (2.4.31), we obtain


√ 
P |Z ∗ − nv̄θτ | > 2C n ≤ 4 exp −2(C/3)2 ,
 

for n large enough. Choosing C appropriately gives the claim.

A similar bound is proved for the 0-1 knapsack problem in Exercise 2.9.
CHAPTER 2. MOMENTS AND TAILS 85

2.4.4 Epsilon-nets and chaining


Suppose we are interested in bounding the expectation or tail of the supremum of
a stochastic process
sup Xt ,
t∈T

where T is an arbitrary index set and the Xt s are real-valued random variables. To
avoid measurability issues, we assume throughout that T is countable.† Note that
t does not in general need to be a “time” index.
So far we have developed tools that can handle cases where T is finite. When
the supremum is over an infinite index set, however, new ideas are required. One
way to proceed is to apply a tail inequality to a sufficiently dense finite subset of the
index set and then extend the resulting bound by a Lipschitz continuity argument.
We present this type of approach in this section, as well as a multi-scale version
known as chaining.
First we summarize one important special case that will be useful below: T is
finite and Xt is sub-Gaussian.

Theorem 2.4.21 (Maximal inequalities: sub-Gaussian case). Let {Xt }t∈T be a


stochastic process with finite index set T . Assume that there is ν > 0 such that,
for all t, Xt ∈ sG(ν) and E[Xt ] = 0. Then
 
p
E sup Xt ≤ 2ν log |T |,
t∈T

and, for all β > 0,

β2
   
p
P sup Xt ≥ 2ν log |T | + β ≤ exp − .
t∈T 2ν

Proof. For the expectation, we apply a variation on the Chernoff-Cramér method


(Section
P 2.4). Naively, we could bound the supremum supt∈T Xt by the sum
t∈T |X t |, but that would lead to a bound growing linearly with the cardinality
|T |. Instead we first take an exponential, which tends to amplify the largest term
and produces a much stronger bound. Specifically, by Jensen’s inequality (Theo-
rem B.4.15), for any s > 0
      
1 1
E sup Xt = E sup sXt ≤ log E exp sup sXt .
t∈T s t∈T s t∈T

Technically, it suffices to assume that there is a countable T0 ⊆ T such that supt∈T Xt =
supt∈T0 Xt almost surely.
CHAPTER 2. MOMENTS AND TAILS 86

Since ea∨b ≤ ea + eb by the non-negativity of the exponential, we can bound


  " #
1 X
E sup Xt ≤ log E [exp (sXt )]
t∈T s
t∈T
" #
1 X
= log MXt (s)
s
t∈T
 
1 s2 ν
≤ log |T | e 2
s
log |T | sν
= + .
s 2
The optimal choice of s (i.e., leading to the least
p upper bound) is when the two
terms in the sum above are equal, that is, s = 2ν −1 log |T |, which gives finally
 
p
E sup Xt ≤ 2ν log |T |,
t∈T

as claimed.
For the tail inequality, we use a union bound and (2.4.16)
  X h i
p p
P sup Xt ≥ 2ν log |T | + β ≤ P Xt ≥ 2ν log |T | + β
t∈T t∈T
p !
( 2ν log |T | + β)2
≤ |T | exp −

β2
 
≤ exp − ,

as claimed, where we used that β > 0 on the last line.

Epsilon-nets and covering numbers


Moving on to infinite index sets, we first define the notion of an ε-net. This notion
requires that a pseudometric ρ (i.e., ρ : T × T → R+ is symmetric and satisfies
the triangle inequality) be defined over T .

Definition 2.4.22 (ε-net). Let T be a subset of a pseudometric space (M, ρ) and


let ε > 0. The collection of points N ⊆ M is called an ε-net of T if ε-net
[
T ⊆ Bρ (t, ε),
t∈N
CHAPTER 2. MOMENTS AND TAILS 87

where Bρ (t, ε) = {s ∈ T : ρ(s, t) ≤ ε}, that is, each element of T is within


distance ε of an element in N . The smallest cardinality of an ε-net of T is called
the covering number
covering number
N (T , ρ, ε) = inf{|N | : N is an ε-net of T }.

A natural way to construct an ε-net is the following algorithm. Start with N = ∅


and successively add a point from T to N at distance at least ε from all other pre-
vious points until it is not possible to do so anymore. Provided T is compact, this
procedure will terminate after a finite number of steps. This leads to the following
dual perspective.
Definition 2.4.23 (ε-packing). Let T be a subset of a pseudometric space (M, ρ)
and let ε > 0. The collection of points N ⊆ T is called an ε-packing of T if

/ Bρ (t0 , ε),
t∈ ∀t 6= t0 ∈ N

that is, every pair of elements of N is at distance strictly greater than ε. The largest
cardinality of an ε-packing of T is called the packing number
packing number
P(T , ρ, ε) = sup{|N | : N is an ε-packing of T }.

Lemma 2.4.24 (Covering and packing numbers). For any T ⊆ M and all ε > 0,

N (T , ρ, ε) ≤ P(T , ρ, ε).

Proof. Observe that a maximal ε-packing N is an ε-net. Indeed, by maximality,


any element of T \ N is at distance at most ε from an element of N .

Example 2.4.25 (Sphere in Rk ). We let Bk (x, ε) be the ball of radius ε around


x ∈ Rk with the Euclidean metric. We let Sk−1 be the sphere of radius 1 centered
around the origin 0, that is, the surface of Bk (0, 1). Let 0 < ε < 1.
Claim 2.4.26. For S := Sk−1 ,
 k
3
N (S, ρ, ε) ≤ .
ε

Proof. Let N be any maximal ε-packing of S. We show that |N | ≤ (3/ε)k , which


implies the claim by Lemma 2.4.24. The balls of radius ε/2 around points in N ,
{Bk (xi , ε/2) : xi ∈ N }, satisfy two properties:
1. They are pairwise disjoint: if z ∈ Bk (xi , ε/2) ∩ Bk (xj , ε/2), then kxi −
xj k2 ≤ kxi − zk2 + kxj − zk2 ≤ ε, a contradiction.
CHAPTER 2. MOMENTS AND TAILS 88

2. They are included in the ball of radius 3/2 around the origin: if z ∈ Bk (xi , ε/2),
then kzk2 ≤ kz − xi k2 + kxi k ≤ ε/2 + 1 ≤ 3/2.
π k/2 (ε/2)k
The volume of a ball of radius ε/2 is Γ(k/2+1) and that of a ball of radius 3/2 is
π k/2 (3/2)k
Γ(k/2+1) . Dividing one by the other proves the claim.

This bound will be useful later. J


The basic approach to use an ε-net for controlling the supremum of a stochastic
process is the following. We say that a stochastic process {Xt }t∈T is Lipschitz for
Lipschitz
pseudometric ρ on T if there is a random variable 0 < K < +∞ such that
process
|Xt − Xs | ≤ Kρ(s, t), ∀s, t ∈ T .

If in addition Xt is sub-Gaussian for all t, then we can bound the expectation or tail
probability of the supremum of {Xt }t∈T —if we can bound the expectation or tail
probability of the (random) Lipschitz constant K itself. To see this, let N ⊆ T be
an ε-net of T and, for each t ∈ T , let π(t) be the closest element of N to t. We
will refer to π as the projection map of N . We then have the inequality

sup Xt ≤ sup(Xt − Xπ(t) ) + sup Xπ(t) ≤ Kε + sup Xs , (2.4.32)


t∈T t∈T t∈T s∈N

where we can use Theorem 2.4.21 to bound the last term.‡ We give an example of
this type of argument next (although we do not apply the above bound directly).
Another example (where (2.4.32) is used this time) can be found in Section 2.4.5.
Example 2.4.27 (Spectral norm of a random matrix). For an m × n matrix A ∈
Rm×n , the spectral norm (or induced 2-norm, or 2-norm for short) is defined as
spectral
kAxk2 norm
kAk2 := sup = sup kAxk2 = sup hAx, yi, (2.4.33)
x∈Rn \{0} kxk2 x∈Sn−1 x∈Sn−1
y∈Sm−1

where Sn−1 is the sphere of Euclidean radius 1 around the origin in Rn . The right-
most expression, which is central to our developments, is justified in Exercise 5.4.
We will be interested in the case where A is a random matrix with independent
entries. One key observation is that the quantity hAx, yi can then be seen as a
linear combination of independent random variables
X
hAx, yi = xj yi Aij .
i,j

If the ε-net N is not included in T , the Lipschitz condition has to hold on a larger subset that
includes both.
CHAPTER 2. MOMENTS AND TAILS 89

Hence we will be able to apply our previous tail bounds. However, we also need to
deal with the supremum.

Theorem 2.4.28 (Upper tail of the spectral norm). Let A ∈ Rm×n be a random
matrix whose entries are centered, independent and sub-Gaussian with variance
factor ν. Then there exists a constant 0 < C < +∞ such that, for all t > 0,
√ √ √
kAk2 ≤ C ν( m + n + t),
2
with probability at least 1 − e−t .

Without the independence assumption, the norm can be much larger in general (see
Exercise 2.15).

Proof. Fix ε = 1/4. By Claim 2.4.26, there is an ε-net N ⊆ Sn−1 (respectively


M ⊆ Sm−1 ) of Sn−1 (respectively Sm−1 ) with |N | ≤ 12n (respectively |M | ≤
12m ). We proceed in two steps:

1. We first apply the general Hoeffding inequality (Theorem 2.4.9) to control


the deviations of the supremum in (2.4.33) restricted to N and M .

2. We then extend the bound to the full supremum by Lipschitz continuity.

Formally, the result follows from the following two lemmas.


Lemma 2.4.29. Let N and M be as above. There is a constant C large enough
(not depending on n, m) such that, for all t > 0,
 
1 √ √ √ 2
P  max hAx, yi ≥ C ν( m + n + t) ≤ e−t .
x∈N 2
y∈M

Lemma 2.4.30. For any ε-nets N ⊆ Sn−1 and M ⊆ Sm−1 of Sn−1 and Sm−1
respectively, the following inequalities hold
1
sup hAx, yi ≤ kAk2 ≤ sup hAx, yi.
x∈N 1 − 2ε x∈N
y∈M y∈M

Proof of Lemma 2.4.29. Recall that


X
hAx, yi = xj yi Aij ,
i,j
CHAPTER 2. MOMENTS AND TAILS 90

is a linear combination of independent random variables. By the general Hoeffding


inequality, hAx, yi is sub-Gaussian with variance factor
X
(xi yj )2 ν = kxk22 kyk22 ν = ν,
i,j

for all x ∈ N and y ∈ M . In particular, for all β > 0,


β2
 
P [hAx, yi ≥ β] ≤ exp − .

Hence, by a union bound over N and M ,
 
1 √ √ √
P  max hAx, yi ≥ C ν( m + n + t)
x∈N 2
y∈M
X  1 √ √ √

≤ P hAx, yi ≥ C ν( m + n + t)
2
x∈N
y∈M
2 !
1 1 √ √ √

≤ |N ||M | exp − C ν( m + n + t)
2ν 2
C2 
 
n+m 2
≤ 12 exp − m+n+t
8
2
≤ e−t ,
for C 2 /8 = log 12 ≥ 1, where in the third inequality we ignored all cross-products
since they are nonnegative.
Proof of Lemma 2.4.30. The first inequality is immediate by definition of the spec-
tral norm. For the second inequality, we will use the following observation
hAx, yi − hAx0 , y0 i = hAx, y − y0 i + hA(x − x0 ), y0 i. (2.4.34)
Fix x ∈ Sn−1 and y ∈ Sm−1 such that hAx, yi = kAk2 (which exist by compact-
ness), and let x0 ∈ N and y0 ∈ M such that
kx − x0 k2 ≤ ε and ky − y0 k2 ≤ ε.
Then (2.4.34), Cauchy-Schwarz and the definition of the spectral norm imply
kAk2 − hAx0 , y0 i ≤ kAk2 kxk2 ky − y0 k2 + kAk2 kx − x0 k2 ky0 k2 ≤ 2εkAk2 .
Rearranging gives the claim.
Putting the two lemmas together concludes the proof of Theorem 2.4.28.
We will give an application of this bound in Section 5.1.4. J
CHAPTER 2. MOMENTS AND TAILS 91

Chaining method
We go back to the inequality

sup Xt ≤ sup(Xt − Xπ(t) ) + sup Xπ(t) . (2.4.35)


t∈T t∈T t∈T

Previously we controlled the first term on the right-hand side with a random Lips-
chitz constant and the second term with a maximal inequality for finite sets. Now
we consider cases where we may not have a good almost sure bound on the Lip-
schitz constant, but where we can control increments uniformly in the following
probabilistic sense. We say that a stochastic process {Xt }t∈T has sub-Gaussian
increments on (T , ρ) if there exists a deterministic constant 0 < K < +∞ such
sub-Gaussian
that
increments
Xt − Xs ∈ sG(K2 ρ(s, t)2 ), ∀s, t ∈ T .
Even with this assumption, in (2.4.35) the first term on the right-hand side remains
a supremum over an infinite set. To control it, the chaining method repeats the
chaining method
argument above at progressively smaller scales, leading to the following inequality.
The diameter of T , denoted by diam(T ), is defined as

diam(T ) = sup{ρ(s, t) : s, t, ∈ T }.

Theorem 2.4.31 (Discrete Dudley inequality). Let {Xt }t∈T be a zero-mean stoch-
astic process with sub-Gaussian increments on (T , ρ) and assume diam(T ) ≤ 1.
Then   +∞
X q
E sup Xt ≤ C 2−k log N (T , ρ, 2−k ).
t∈T k=0

for some constant 0 ≤ C < +∞.

Proof. Recall that we assume that T is countable. Let Tj ⊆ T , j ≥ 1, be a


sequence of finite sets such that Tj ↑ T . By monotone convergence (Proposi-
tion B.4.14), " #
 
E sup Xt = sup E sup Xt .
t∈T j≥1 t∈Tj

Moreover, N (Tj , ρ, ε) ≤ N (T , ρ, ε) for any ε > 0 since Tj ⊆ T . Hence it


suffices to handle the case |T | < +∞.
ε-nets at all scales. For each k ≥ 0, let Nk be an 2−k -net of T with |Nk | =
N (T , ρ, 2−k ) and projection map πk . Because diam(T ) ≤ 1, we can set N0 =
{t0 } where t0 ∈ T can be taken arbitrarily. Moreover, because T is finite, there
CHAPTER 2. MOMENTS AND TAILS 92

is 1 ≤ κ < +∞ such that we can take Nk = T for all k ≥ κ§ . In particular,


πκ (t) = t for all t ∈ T . By a telescoping argument,
κ−1
X 
Xt = Xt0 + Xπk+1 (t) − Xπk (t) .
k=0

Taking a supremum and then an expectation gives


  κ−1 
X 

E sup Xt ≤ E sup Xπk+1 (t) − Xπk (t) , (2.4.36)
t∈T k=0 t∈T

where we used E[Xt0 ] = 0.


Sub-Gaussian bound. We use the maximal inequality (Theorem 2.4.21) to bound
the expectation in (2.4.36). For each k, the number of distinct elements in the
supremum is at most

|{(πk (t), πk+1 (t)) : t ∈ T }| ≤ |Nk × Nk+1 |


= |Nk | × |Nk+1 |
≤ N (T , ρ, 2−k−1 )2 .

For any t ∈ T , by the triangle inequality,

ρ(πk (t), πk+1 (t)) ≤ ρ(πk (t), t) + ρ(t, πk+1 (t)) ≤ 2−k + 2−k−1 ≤ 2−k+1 ,

so that
Xπk+1 (t) − Xπk (t) ∈ sG(K2 2−2k+2 ),
for some 0 < K < +∞ by the sub-Gaussian increments assumption. We can
therefore apply Theorem 2.4.21 to get
  q

E sup Xπk+1 (t) − Xπk (t) ≤ 2K2 2−2k+2 log(N (T , ρ, 2−k−1 )2 )
t∈T
q
−k−1
≤ C2 log N (T , ρ, 2−k−1 ),

for some constant 0 ≤ C < +∞ (depending on K).


To finish the argument, we plug back into (2.4.36),
  κ−1
X q
−k−1
E sup Xt ≤ C2 log N (T , ρ, 2−k−1 ),
t∈T k=0

which implies the claim.


§
Technically, T could be part of a larger countable space by the discussion above.
CHAPTER 2. MOMENTS AND TAILS 93

Using a similar argument, one can derive a tail inequality.


Theorem 2.4.32 (Chaining tail inequality). Let {Xt }t∈T be a zero-mean stoch-
astic process with sub-Gaussian increments on (T , ρ) and assume that diam(T ) ≤
1. Then, for all t0 ∈ T and β > 0,
+∞
" #
β2
X q  
−k −k
P sup(Xt − Xt0 ) ≥ C 2 log N (T , ρ, 2 ) + β ≤ C exp − ,
t∈T C
k=0

for some constant 0 ≤ C < +∞.


We give an application of the discrete Dudley inequality in Section 2.4.6.

2.4.5 . Data science: Johnson-Lindenstrauss lemma and application to


compressed sensing
In this section we discuss an application of the Chernoff-Cramér method (Sec-
tion 2.4.1) to dimension reduction in data science. We use once again an ε-net
argument (Section 2.4.4).

Johnson-Lindenstrauss lemma
The Johnson-Lindenstrauss lemma states roughly that, for any collection of points
in a high-dimensional Euclidean space, one can find an embedding of much lower
dimension that roughly preserves the metric relationships of the points, that is,
their distances. Remarkably, no structure is assumed on the original points and the
result is independent of the input dimension. The method of proof simply involves
performing a random projection.
Lemma 2.4.33 (Johnson-Lindenstrauss lemma). For any set of points x(1) , . . . , x(m)
in Rn and θ ∈ (0, 1), there exists a mapping f : Rn → Rd with d = Θ(θ−2 log m)
such that the following holds: for all i, j
(1 − θ)kx(i) − x(j) k2 ≤ kf (x(i) ) − f (x(j) )k2 ≤ (1 + θ)kx(i) − x(j) k2 . (2.4.37)
We use the probabilistic method: we derive a “distributional” version of the
result that, in turn, implies Lemma 2.4.33 by showing that a mapping with the de-
sired properties exists with positive probability. Before stating this claim formally,
we define the explicit random linear mapping we will employ. Let A be a d × n
matrix whose entries are independent N(0, 1). Note that, for any fixed z ∈ Rn ,
  2   
X d Xn X n
E kAzk22 = E   Aij zj   = d Var  A1j zj  = dkzk22 , (2.4.38)
i=1 j=1 j=1
CHAPTER 2. MOMENTS AND TAILS 94

where we used the independence of the Aij s (and, in particular, of the rows of A)
and the fact that  
Xn
E Aij zj  = 0. (2.4.39)
j=1

Hence the normalized mapping


1
L = √ A,
d

preserves the squared Euclidean norm “on average,” that is, E kLzk22 = kzk22 . We
use the Chernoff-Cramér method to prove a high-probability result.

Lemma 2.4.34. Fix δ, θ ∈ (0, 1). Then the random linear mapping L above with
d = Θ(θ−2 log δ −1 ) is such that for any z ∈ Rn with kzk2 = 1

P [|kLzk2 − 1| ≥ θ] ≤ δ. (2.4.40)

Before proving Lemma 2.4.34, we argue that it implies the Johnson-Lindenstrauss


lemma (Lemma 2.4.33). Simply take δ = 1/(2 m 2 ), apply the previous lemma to
each normalized pairwise difference z = (x − x(j) )/kx(i) − x(j) k2 , and use
(i)

a union bound over all m 2 such pairs. The probability that any of the inequali-
ties (2.4.37) is not satisfied by the linear mapping f (z) = Lz is then at most 1/2.
Hence a mapping with the desired properties exists for d = Θ(θ−2 log m).

Proof of Lemma 2.4.34. We prove one direction. Specifically, we establish


 
3 2
P [kLzk2 ≥ 1 + θ] ≤ exp − dθ . (2.4.41)
4

Note that the right-hand side is ≤ δ for d = Θ(θ−2 log δ −1 ). An inequality in the
other direction can be proved similarly by working with −W (where W is defined
below).
Recall that a sum of independent Gaussians is Gaussian (just compute the con-
volution and complete the squares). So

(Az)k ∼ N(0, kzk22 ) = N(0, 1), ∀k,

where we argued as in (2.4.38) to compute the variance. Hence


d
X
W = kAzk22 = (Az)2k ,
k=1
CHAPTER 2. MOMENTS AND TAILS 95

is a sum of squares of independent Gaussians, that is, χ2 -distributed random vari-


ables. By (2.4.18) and independence,
1
MW (s) = .
(1 − 2s)d/2

Applying the Chernoff-Cramér bound (2.4.2) with s = 12 (1 − d/β) gives


 d/2
MW (s) 1 β
P[W ≥ β] ≤ sβ
= sβ = e(d−β)/2 .
e e (1 − 2s)d/2 d

(Alternatively, we could have used the general Bernstein inequality (Theorem 2.4.16).)
Finally, take β = d(1 + θ)2 . Rearranging we get

P[kLzk2 ≥ 1 + θ] = P[kAzk22 ≥ d(1 + θ)2 ]


= P[W ≥ β]
2 ]/2 d/2
≤ ed[1−(1+θ) (1 + θ)2


= exp −d(θ + θ2 /2 − log(1 + θ))



 
3 2
≤ exp − dθ ,
4

where we used that log(1 + x) ≤ x − x2 /4 on [0, 1] (see Exercise 1.16).


Remark 2.4.35. The Johnson-Lindenstrauss lemma is essentially optimal [Alo03, Sec-
tion 9]: any set of n points with all pairwise distances in [1 − θ, 1 + θ] requires at least
Ω(log n/(θ2 log θ−1 )) dimensions. Note however that it relies crucially on the use of the
Euclidean norm [BC03].

To give some further geometric insights into the proof, we make a series of
observations:

1. The d rows of √1 A are “on average” orthonormal. Indeed, note that for
n
i 6= j
n
" #
1X
E Aik Ajk = E[Ai1 ] E[Aj1 ] = 0,
n
k=1

by independence and
n
" #
1X 2
E Aik = E[A2i1 ] = 1,
n
k=1
CHAPTER 2. MOMENTS AND TAILS 96

since the Aik s have mean 0 and variance 1. When n is large, those two
quantities are concentrated around their mean. Fix a unit vector z. Then
√1 Az corresponds approximately to an orthogonal projection of z onto a
n
uniformly chosen random subspace of dimension d.

2. Now observe that projecting z on a uniform random subspace of dimension


d can be done in the following way: first apply a uniformly chosen random
rotation to z; and then project the resulting vector on the first d dimensions.
In other words, √1n kAzk2 is approximately distributed as the norm of the
first d components of a uniform unit vector in Rn . To analyze this quantity,
note that a vector in Rn whose components are independent N(0, 1), when
divided by its norm, produces a uniform vector in Rn . When d is large, the
norm of the first d components √ of that vector is therefore a ratio whose nu-
merator is concentrated around d and whose denominator is concentrated

around n (by calculations similar to those in the first point above).

3. Hence kLzk2 = nd × √1n kAzk2 should be concentrated around 1.


p

The Johnson-Lindenstrauss lemma makes it possible to solve certain compu-


tational problems (e.g., finding the nearest point to a query) more efficiently by
working in a smaller dimension. We discuss a different application of the “random
projection method” next.

Compressed sensing
In the compressed sensing problem, one seeks to recover a signal x ∈ Rn from a
small number of linear measurements (Lx)i , i = 1, . . . , d. In complete generality,
one needs n such measurements to recover any unknown x ∈ Rn as the sensing
matrix L must be invertible (or, more precisely, injective). However, by imposing
sensing matrix
extra structure on the signal and choosing the sensing matrix appropriately, much
better results can be obtained. Compressed sensing relies on sparsity.
Definition 2.4.36 (Sparse vectors). We say that a vector z ∈ Rn is k-sparse if it
k-sparse vector
has at most k non-zero entries. We let Skn be the set of k-sparse vectors in Rn .
Note that Skn is a union of nk linear subspaces, one for each support of the


nonzero entries.
To solve the compressed sensing problem over k-sparse vectors, it suffices to
find a sensing matrix L satisfying that all subsets of 2k columns are linearly inde-
pendent. Indeed, if x, x0 ∈ Skn , then x−x0 has at most 2k nonzero entries. Hence,
in order to have L(x − x0 ) = 0, it must be that x − x0 = 0 under the previous con-
dition on L. That implies the required injectivity. The implication goes in the other
CHAPTER 2. MOMENTS AND TAILS 97

direction as well. Observe for instance that the matrix used in the proof of the
Johnson-Lindenstrauss lemma satisfies this property as long as d ≥ 2k: because
of the continuous density of its entries, the probability that 2k of its columns are
linearly dependent is 0 when d ≥ 2k. For practical applications, however, other
requirements must be met, in particular, computational efficiency and robustness.
We describe such an approach.
The following definition will play a key role. Roughly speaking, a restricted
isometry preserves enough of the metric structure of Skn to be invertible on its
image.
Definition 2.4.37 (Restricted isometry property). A d × n linear mapping L satis-
fies the (k, θ)-restricted isometry property (RIP) if for all z ∈ Skn
restricted
(1 − θ)kzk2 ≤ kLzk2 ≤ (1 + θ)kzk2 . (2.4.42) isometry
property
We say that L is (k, θ)-RIP.
Given a (k, θ)-RIP matrix L, can we recover z ∈ Skn from Lz? And how small
can d be? The next two claims answer these questions.
Lemma 2.4.38 (Sensing matrix). Let A be a d × n matrix whose entries are
i.i.d. N (0, 1) and let L = √1d A. There is a constant 0 < C < +∞ such that
if d ≥ Ck log n then L is (10k, 1/3)-RIP with probability at least 1 − 1/n.
Lemma 2.4.39 (Sparse signal recovery). Let L be (10k, 1/3)-RIP. Then for any
x ∈ Skn , the unique solution to the following minimization problem
min kzk1 subject to Lz = Lx, (2.4.43)
z∈Rn

is z∗ = x.
It may seem that a more natural approach, compared to (2.4.43), would be to in-
stead minimize the number of non-zero entries in z, that is, kzk0 . However the
advantage of the `1 norm is that the problem can then be formulated as a linear
program, that is, the minimization of a linear objective subject to linear inequal-
ities (see Exercise 2.13). This permits much faster computation of the solution
using standard techniques—while still leading to a sparse solution. See Figure 2.8
for some insights into to why `1 indeed promotes sparsity.
Putting the two lemmas together shows that:
Claim 2.4.40. Let L be as above with d = Θ(k log n) as required by Lemma 2.4.38.
With probability 1−o(1), any x ∈ Skn can be recovered from the input Lx by solv-
ing (2.4.43).
Note that d can in general be much smaller than n and not far from the 2k bound
we derived above.
CHAPTER 2. MOMENTS AND TAILS 98

Figure 2.8: Because `1 balls (squares) have corners, minimizing the `1 norm over
a linear subspace (line) tends to produce sparse solutions.
CHAPTER 2. MOMENTS AND TAILS 99

ε-net argument We start with the proof of Lemma 2.4.38. The claim does
not follow immediately from the (distributional) Johnson-Lindenstrauss lemma
(i.e., Lemma 2.4.34). Indeed that lemma implies that a (normalized) matrix with
i.i.d. standard Gaussian entries is an approximate isometry on a finite set of points.
Here we need a linear mapping that is an approximate isometry for all vectors in
Skn , an uncountable space.
For a subset of indices J ⊆ [n] and a vector y ∈ Rn , we let yJ be the vector y
restricted to the entries in J, that is, the subvector (yj )j∈J . Fix a subset of indices
I ⊆ [n] of size 10k. We need the RIP condition (Definition 2.4.37) to hold for all
z ∈ Rn with non-zero entries in I (and all such I). The way to achieve this is to
use an ε-net argument, as described in Section 2.4.4. Indeed, notice that, for z 6= 0,
the function kLzk2 /kzk2 :
1. does not depend on the norm of z, so that we can restrict ourselves to the
compact set ∂BI := {z : z[n]\I = 0, kzk2 = 1}; and
2. is continuous on ∂BI , so that it suffices to construct a fine enough covering
of ∂BI by a finite collection of balls (i.e., an ε-net) and apply Lemma 2.4.34
to the centers of those balls.
Proof of Lemma 2.4.38. Let I ⊆ [n] be a subset of indices of size k 0 := 10k.
0
There are kn0 ≤ nk = exp(k 0 log n) such subsets and we denote their collection


by I(k 0 , n). We let NI be an ε-net of ∂BI . By Claim 2.4.26, we can choose one in
0
∂BI of size at most (3/ε)k . We take
1
ε= √ ,
C0 6n log n
for a constant C 0 that will be determined below. The reason for this choice will
become clear when we set C 0 . The union of all ε-nets has size
 k 0
k0 3
∪I∈I(k0 ,n) NI ≤ n ≤ exp(C 00 k 0 log n),
ε
for some C 00 > 0. Our goal is to show that
1
sup |kLzk2 − 1| ≤ . (2.4.44)
z∈∪I∈I(k0 ,n) ∂BI 3

We seek to apply the inequality (2.4.32).


Applying the “distributional” Johnson-Lindenstrauss lemma to the ε-nets: The
first step is to control the supremum in (2.4.44)—restricted to the ε-nets. Lemma 2.4.34
is exactly what we need for this. Take θ = 1/6, δ = 1/(2n| ∪I NI |), and
d = Θ θ−2 log(2n| ∪I NI |) = Θ(k 0 log n),

CHAPTER 2. MOMENTS AND TAILS 100

as required by the lemma. Then, by a union bound over the NI s, with probability
1 − 1/(2n) we have
1
sup |kLzk2 − 1| ≤ . (2.4.45)
z∈∪I NI 6

Lipschitz continuity: The next step is to establish Lipschitz continuity of |kLzk2 −


1|. For vectors y, z ∈ Rn , by repeated applications of the triangle inequality, we
have

||kLzk2 − 1| − |kLyk2 − 1|| ≤ |kLzk2 − kLyk2 | ≤ kL(z − y)k2 .

To bound the rightmost expression, we let A∗ be the largest entry of A in absolute


value and note that
 2
Xd X n
kL(z − y)k22 =  Lij (zj − yj )
i=1 j=1
  
d
X n
X n
X
≤  L2ij   (zj − yj )2 
i=1 j=1 j=1
 2
1
≤ dn √ A∗ kz − yk22
d
≤ nA2∗ kz − yk22 ,

where we used Cauchy-Schwarz (Theorem B.4.8) on the second line. Taking a


square root, we see that the (random) Lipschitz constant of |kLzk2 − 1| (with

respect to the Euclidean metric) is at most K := nA∗ .
Controlling the Lipschitz constant: So it remains to control A∗ . For this we use
the Chernoff-Cramér bound for Gaussians (see (2.4.4)) which implies by a union
bound over the entries of A that
h i
P[A∗ ≥ C 0 log n] ≤ P ∃i, j, |Aij | ≥ C 0 log n
p p

(C 0 log n)2
 
2
≤ n exp −
2
1
≤ ,
2n
for√a C 0 > 0 large enough. Hence with probability 1 − 1/(2n), we have A∗ <
C 0 log n and
1
Kε ≤ , (2.4.46)
6
CHAPTER 2. MOMENTS AND TAILS 101

by the choice of ε made previously.


Putting everything together: We apply (2.4.32). Combining (2.4.45) and (2.4.46),
with probability 1 − 1/n, the claim (2.4.44) holds. That concludes the proof.

`1 minimization Finally we prove Lemma 2.4.39 (which can be skipped).

Proof of Lemma 2.4.39. Let z∗ be a solution to (2.4.43) and note that such a solu-
tion exists because z = x satisfies the constraint. Without loss of generality assume
that only the first k entries of x are nonzero, that is, x[n]\[k] = 0. Moreover order
the remaining entries of x so that the residual r = z∗ − x has its entries r[n]\[k] in
nonincreasing order in absolute value. Our goal is to show that krk2 = 0.
In order to leverage the RIP condition, we break up the vector r into 9k-long
subvectors. Let

I0 = [k], Ii = {(9(i − 1) + 1)k + 1, . . . , (9i + 1)k}, ∀i ≥ 1,

and I¯i = j>i Ij . We will also need I01 = I0 ∪ I1 and I¯01 = I¯1 .
S
We first use the optimality of z∗ . Note that xI¯0 = 0 implies that

kz∗ k1 = kz∗I0 k1 + kz∗I¯0 k1 = kz∗I0 k1 + krI¯0 k1 ,

and
kxk1 = kxI0 k1 ≤ kz∗I0 k1 + krI0 k1 ,
by the triangle inequality. Since kz∗ k1 ≤ kxk1 by optimality (and the fact that x
satisfies the constraint), we then have

krI¯0 k1 ≤ krI0 k1 . (2.4.47)

On the other hand, the RIP condition gives a similar inequality in the other
P notice that Lr = 0 by the constraint in (2.4.43) or, put differently,
direction. Indeed
LrI01 = − i≥2 LrIi . Then, by the RIP condition and the triangle inequality, we
have that
2 X 4X
krI01 k2 ≤ kLrI01 k2 ≤ kLrIi k2 ≤ krIi k2 , (2.4.48)
3 3
i≥2 i≥2

where we used the fact that by construction rI01 is 10k-sparse and each rIi is 9k-
sparse.
We note that by the ordering of the entries of x

krIi k1 2 krIi k21


 
2
krIi+1 k2 ≤ 9k = , (2.4.49)
9k 9k
CHAPTER 2. MOMENTS AND TAILS 102

where we bounded rIi+1 entrywise by the expression in parenthesis. Combin-



ing (2.4.47) and (2.4.49), and using that krI0 k1 ≤ kkrI0 k2 by Cauchy-Schwarz,
we have
X X krIj k1 krI¯0 k1 krI k1 krI0 k2 krI01 k2
krIi k2 ≤ √ = √ ≤ √0 ≤ ≤ .
9k 3 k 3 k 3 3
i≥2 j≥1

Plugging this back into (2.4.48) gives


X 2
krI01 k2 ≤ 2 krIi k2 ≤ krI01 k2 ,
3
i≥2

which implies rI01 = 0. In particular rI0 = 0 and, by (2.4.47), rI¯0 = 0 as well.


We have shown that r = 0. Or, in other words, z∗ = x.
Remark 2.4.41. Lemma 2.4.39 can be extended to noisy measurements using a modifica-
tion of (2.4.43). This provides some robustness to noise which is important in applications.
See [CRT06b].

2.4.6 . Data science: classification, empirical risk minimization and VC


dimension
In the binary classification problem, one is given samples Sn = {(Xi , C(Xi ))}ni=1
binary
where Xi ∈ Rd is a feature vector and C(Xi ) ∈ {0, 1} is a label. The feature vec-
classification
tors are assumed to be independent samples from an unknown probability measure
µ and C : Rd → {0, 1} is a measurable Boolean function. For instance, the feature
vector might be an image (encoded as a vector) and the label might indicate “cat”
(label 0) or “dog” (label 1). Our goal is learn the function (or concept) C from the
samples.
More precisely, we seek to construct a hypothesis h : Rd → {0, 1} that is
hypothesis
a good approximation to C in the sense that it predicts the label well on a new
sample (from the same distribution). Formally, we want h to have small true risk
(or generalization error),
true risk

R(h) = P[h(X) 6= C(X)]

where X ∼ µ. Because we only have access to the distribution µ through the


samples, it is natural to estimate the true risk of the hypothesis h using the samples
as
n
1X
Rn (h) = 1{h(Xi ) 6= C(Xi )},
n
i=1
CHAPTER 2. MOMENTS AND TAILS 103

which is called the empirical risk. Indeed observe that ERn (h) = R(h) and, by
empirical risk
the law of large numbers, Rn (h) → R(h) almost surely as n → +∞. Ignor-
ing computational considerations, one can then formally define an empirical risk
minimizer
empirical risk
∗ 0 0 minimizer
h ∈ ERMH (Sn ) = {h ∈ H : Rn (h) ≤ Rn (h ), ∀h ∈ H},

where H, the hypothesis class, is a given collection of Boolean functions over Rd .


hypothesis
(We assume that h∗ can be defined as a measurable function of the samples.)
class

Overfitting Why restrict the hypothesis class? It turns out that minimizing the
empirical risk over all Boolean functions makes it impossible to achieve an ar-
bitrarily small risk. Intuitively considering too rich a class of functions, that is,
functions that too intricately follow the data, leads to overfitting: the learned hy-
pothesis will fit the sampled data, but it may not generalize well to unseen exam-
ples. A learner A is a map from samples to measurable Boolean functions over
learner
Rd , that is, for any n and any Sn ∈ (Rd × {0, 1})n , the learner outputs a func-
tion A( · , Sn ) : Rd → {0, 1}. The following theorem shows that any learner has
fundamental limitations if all concepts are possible.

Theorem 2.4.42 (No Free Lunch). For any learner A and any finite X ⊆ Rd of
even size |X | =: 2m > 4, there exist a concept C : X → {0, 1} and a distribution
µ over X such that

P[R(A( · , Sm )) ≥ 1/8] ≥ 1/8, (2.4.50)

where Sm = {(Xi , C(Xi ))}m


i=1 with independent Xi ∼ µ.

The gist of the proof is intuitive. In essence, if the target concept is arbitrary and we
only get to see half of the possible instances, then we have learned nothing about
the other half and cannot expect low generalization error.

Proof of Theorem 2.4.42. We let µ be uniform over X . To prove the existence


of a concept satisfying (2.4.50), we use the probabilistic method (Section 2.2.1)
and pick C at random. For each x ∈ X , we set C(x) := Yx where the Yx s are
i.i.d. uniform in {0, 1}.
We first bound E[R(A( · , Sm ))], where the expectation runs over both random
labels {Yx }x∈X and the samples Sm = {(Xi , C(Xi ))}m i=1 . For an additional inde-
pendent sample X ∼ µ, we will need the event that the learner, given samples Sm ,
makes an incorrect prediction on X

B = {A(X, Sm )) 6= YX },
CHAPTER 2. MOMENTS AND TAILS 104

and the event that X is observed in the samples Sm


O = {X ∈ {X1 , . . . , Xm }}.
By the tower property (Lemma B.6.16),
E[R(A( · , Sm ))] = P[B]
= E[P[B | Sm ]]
= E [P[B | O, Sm ]P[O | Sm ] + P[B | Oc , Sm ]P[Oc | Sm ]]
≥ E [P[B | Oc , Sm ]P[Oc | Sm ]]
1 1
≥ × ,
2 2
where we used that:
• P[Oc | Sm ] ≥ 1/2 because |X | = 2m and µ is uniform, and;
• P[B | Oc , Sm ] = 1/2 because for any x ∈ / {X1 , . . . , Xm } the prediction
A(x, Sm ) ∈ {0, 1} is independent of Yx and the latter is uniform.
Conditioning over the concept, we have proved that
1
E [E[R(A( · , Sm )) | {Yx }x∈X ]] ≥ .
4
Hence, by the first moment principle (Theorem 2.2.1),
P[E[R(A( · , Sm )) | {Yx }x∈X ] ≥ 1/4] > 0,
where the probability is taken over {Yx }x∈X . That is, there exists a choice {yx }x∈X ∈
{0, 1}X such that
E[R(A( · , Sm )) | {Yx = yx }x∈X ] ≥ 1/4. (2.4.51)
Finally, to prove (2.4.50), we use a variation on Markov’s inequality (Theo-
rem 2.1.1) for [0, 1]-valued random variables. If Z ∈ [0, 1] is a random variable
with E[Z] = µ and α ∈ [0, 1], then
E[Z] ≤ α × P[Z < α] + 1 × P[Z ≥ α] ≤ P[Z ≥ α] + α.
Taking α = µ/2 gives
P[Z ≥ µ/2] ≥ µ/2.
Going back to (2.4.51), we obtain
 
1 1
P R(A( · , Sm )) ≥ {Yx = yx }x∈X ≥ ,
8 8
establishing the claim.
CHAPTER 2. MOMENTS AND TAILS 105

The way out is to “limit the complexity” of the hypotheses. For instance, we
could restrict ourselves to half-spaces
n o
HH = h(x) = 1{xT u ≥ α} : u ∈ Rd , α ∈ R ,

or axis-aligned boxes

HB = {h(x) = 1{xi ∈ [αi , βi ], ∀i} : −∞ ≤ αi ≤ βi ≤ ∞, ∀i} .

In order for the empirical risk minimizer h∗ to have a generalization error close
to the best achievable error, we need the empirical risk of the learned hypothesis
Rn (h∗ ) to be close to its expectation R(h∗ ), which is guaranteed by the law of
large numbers for sufficiently large n. But that is not enough, we also need that
same property to hold for all hypotheses in H simultaneously. Otherwise we could
be fooled by a poorly performing hypothesis with unusually good empirical risk on
the samples. The hypothesis class is typically infinite and, therefore, controlling
empirical risk deviations from their expectations uniformly over H is not straight-
forward.

Uniform deviations Our goal in this section is to show how to bound


  " ( n )#
1X
E sup {Rn (h) − R(h)} = E sup `(h, Xi ) − E[`(h, X)]
h∈H h∈H n i=1
(2.4.52)

in terms of a measure of complexity of the class H, where we defined the loss


`(h, x) = 1{h(x) 6= C(x)} to simplify the notation. We assume that H is count-
able. (Observe for instance that, for HH and HB , nothing is lost by assuming that
the parameters defining the hypotheses are rational-valued.)
Controlling deviations uniformly over H as in (2.4.52) allows one to provide
guarantees on the empirical risk minimizer. Indeed, for any h0 ∈ H,

R(h∗ ) = Rn (h∗ ) + {R(h∗ ) − Rn (h∗ )}


≤ Rn (h∗ ) + sup {R(h) − Rn (h)}
h∈H
0
≤ Rn (h ) + sup {R(h) − Rn (h)}
h∈H
= R(h0 ) + Rn (h0 ) − R(h0 ) + sup {R(h) − Rn (h)}

h∈H
≤ R(h0 ) + sup {Rn (h) − R(h)} + sup {R(h) − Rn (h)} ,
h∈H h∈H
CHAPTER 2. MOMENTS AND TAILS 106

where, on the third line, we used the definition of the empirical risk minimizer.
Taking an infimum over h0 , then an expectation over the samples, and rearranging
gives

E[R(h∗ )] − inf R(h0 )


h0 ∈H
   
≤ E sup {Rn (h) − R(h)} + E sup {R(h) − Rn (h)} . (2.4.53)
h∈H h∈H

This inequality allows us to relate two quantities of interest: the expected true risk
of the empirical risk minimizer (i.e., E[R(h∗ )], where recall that h∗ is defined over
the samples) and the best possible true risk (i.e., inf h0 ∈H R(h0 )). The first term
on the right-hand side is (2.4.52) and the second one can be bounded in a similar
fashion as we argue below. Observe that the suprema are inside the expectations
and that the random variables Rn (h) − R(h) are highly correlated. Indeed, two
similar hypotheses will produce similar predictions. The correlation is ultimately
what allows us to tackle infinite classes H – as we saw in Section 2.4.4.
Indeed, to bound (2.4.52), we use the methods of Section 2.4.4. As a first
step, we apply the symmetrization trick, which we introduced in Section 2.4.2 to
give a proof of Hoeffding’s lemma (Lemma 2.4.12). Let (εi )ni=1 be i.i.d. uniform
random variables in {−1, +1} (i.e., Rademacher variables) and let (Xi0 )ni=1 be an
independent copy of (Xi )ni=1 . Then
 
E sup {Rn (h) − R(h)}
h∈H
" ( n )#
1X
= E sup `(h, Xi ) − E[`(h, X)]
h∈H n i=1
" ( n )#
1X
= E sup [`(h, Xi ) − E[`(h, Xi0 ) | (Xj )nj=1 ]]
h∈H n i=1
" " n ##
1X 0 n
= E sup E [`(h, Xi ) − `(h, Xi )] (Xj )j=1
h∈H n
i=1
" ( n )#
1 X
≤ E sup [`(h, Xi ) − `(h, Xi0 )] ,
h∈H n i=1

where on the fourth line we used “taking it out what is known” (Lemma B.6.13)
and on the fifth line we used suph EYh ≤ E[suph Yh ] and the tower property. Next
we note that `(h, Xi ) − `(h, Xi0 ) is symmetric and independent of εi (which is also
CHAPTER 2. MOMENTS AND TAILS 107

symmetric) to deduce that the last line above is


" ( n )#
1X 0
= E sup εi [`(h, Xi ) − `(h, Xi )]
h∈H n i=1
n n
" #
1X 1X
≤ E sup εi `(h, Xi ) + sup (−εi )`(h, Xi0 )
h∈H n i=1 h∈H n i=1
n
" #
1X
= 2 E sup εi `(h, Xi ) .
h∈H n i=1

The exact same argument also applies to the second term on the right-hand side
of (2.4.53), so
n
" #
1 X
E[R(h∗ )] − inf R(h0 ) ≤ 4 E sup εi `(h, Xi ) . (2.4.54)
h0 ∈H h∈H n i=1

Changing the normalization, we define the process


n
1 X
Zn (h) = √ εi `(h, Xi ), h ∈ H. (2.4.55)
n
i=1

Our task reduces to upper bounding


 
E sup Zn (h) . (2.4.56)
h∈H

Note that we will not compute the best possible true risk (which in general could
be “bad,” i.e., large)—only how close the empirical risk minimizer gets to it.

VC dimension We make two observations about Zn (h).

1. It is centered. Also, as a weighted sum of independent random variables in


[−1, 1], it is sub-Gaussian with variance factor 1 by the general Hoeffding
inequality (Theorem 2.4.9) and Hoeffding’s lemma (Lemma 2.4.12).

2. It depends only on the values of the hypothesis h at a finite number of points,


X1 , . . . , Xn . Hence, while the supremum in (2.4.56) is over a potentially
infinite class of functions H, it is in effect a supremum over at most 2n
functions, that is, all the possible restrictions of the hs to (Xi )ni=1 .
CHAPTER 2. MOMENTS AND TAILS 108

A naive application of the maximal inequality in Lemma 2.4.21, together with the
two observations above, gives
 
p p
E sup Zn (h) ≤ 2 log 2n = 2n log 2.
h∈H

Unfortunately, plugging this back into (2.4.54) gives an upper bound which fails to
converge to 0 as n → +∞.
To obtain a better bound, we show that in general the number of distinct re-
strictions of H to n points can grow much slower than 2n .
Definition 2.4.43 (Shattering). Let Λ = {`1 , . . . , `n } ⊆ Rd be a finite set and let
H be a class of Boolean functions on Rd . The restriction of H to Λ is
HΛ = {(h(`1 ), . . . , h(`n )) : h ∈ H}.
We say that Λ is shattered by H if |HΛ | = 2|Λ| , that is, if all Boolean functions over
shattering
Λ can be obtained by restricting a function in H to the points in Λ.
Definition 2.4.44 (VC dimension). Let H be a class of Boolean functions on Rd .
The VC dimension of H, denoted vc(H), is the maximum cardinality of a set shat-
VC
tered by H.
dimension
We prove the following combinatorial lemma at the end of this section.
Lemma 2.4.45 (Sauer’s lemma). Let H be a class of Boolean functions on Rd . For
any finite set Λ = {`1 , . . . , `n } ⊆ Rd ,
 vc(H)
en
|HΛ | ≤ .
vc(H)
That is, the number of distinct restrictions of H to any n points grows at most as
∝ nvc(H) .
Returning to E[suph∈H Zn (h)], we get the following inequality.
Lemma 2.4.46. There exists a constant 0 < C < +∞ such that, for any countable
class of measurable Boolean functions H over Rd ,
 
p
E sup Zn (h) ≤ C vc(H) log n. (2.4.57)
h∈H

Proof. Recall that Zn (h) ∈ sG(1). Since the supremum over H, when seen as
 vc(H)
en
restricted to {X1 , . . . , Xn }, is in fact a supremum over at most vc(H) func-
tions by Sauer’s lemma (Lemma 2.4.45), we have by Lemma 2.4.21
v "
  u vc(H) #
u en
E sup Zn (h) ≤ t2 log .
h∈H vc(H)
CHAPTER 2. MOMENTS AND TAILS 109

That proves the claim.

Returning to (2.4.54), the previous lemma finally implies


r
∗ 0 vc(H) log n
E[R(h )] − inf R(h ) ≤ 4C .
h0 ∈H n
For hypothesis classes with finite VC dimension, the bound goes to 0 as n → +∞.
We give some examples.
Example 2.4.47 (VC dimension of half-spaces). Consider the class of half-spaces.
Claim 2.4.48.
vc(HH ) = d + 1.
We only prove the case d = 1, where HH reduces to half-lines (−∞, γ] or [γ, +∞).
Clearly any set Λ = {`1 , `2 } ⊆ R with elements is shattered by HH . On the other
hand, for any Λ = {`1 , `2 , `3 } with `1 < `2 < `3 , any half-line containing `1 and
`3 necessarily includes `2 as well. Hence no set of size 3 is shattered by HH . J
Example 2.4.49 (VC dimension of boxes). Consider the class of axis-aligned
boxes.
Claim 2.4.50.
vc(HB ) = 2d.
We only prove the case d = 2, where HB reduces to rectangles. The four-point
set Λ = {(−1, 0), (1, 0), (0, −1), (0, 1)} is shattered by HB . Indeed, the rectangle
[−1, 1] × [−1, 1] contains Λ, with each side of the rectangle containing one of
the points. Moving any side inward by ε < 1 removes the corresponding point
from the rectangle without affecting the other ones. Hence, any subset of Λ can be
obtained by this procedure.
On the other hand, let Λ = {`1 , . . . , `5 } ⊆ R2 be any set of five distinct points.
If the points all lie on the same axis-aligned line, then an argument similar to the
half-line case in Claim 2.4.48 shows that Λ is not shattered. Otherwise consider
the axis-aligned rectangle with smallest area containing Λ. For each side of the
rectangle, choose one point of Λ that lies on it. These necessarily exist (otherwise
the rectangle could be made even smaller) and denote them by xN for the highest,
xE for the rightmost, xS for the lowest, and xW for the leftmost. Note that they
may not be distinct, but in any case at least one point in Λ, say `5 without loss of
generality, is not in the list. Now observe that any axis-aligned rectangle containing
xN , xE , xS , xW must also contain `5 since its coordinates are sandwiched between
the bounds defined by those points. Hence no set of size 5 is shattered. That proves
the claim. J
CHAPTER 2. MOMENTS AND TAILS 110

These two examples also provide insights into Sauer’s lemma. Consider the
case of rectangles for instance. Over a collection of n sample points, a rectangle
defines the same {0, 1}-labeling as the minimal-area rectangle containing the same
points. Because each side of a minimal-area rectangle must touch at least one point
in the sample, there are at most n4 such rectangles, and hence there are at most
n4  2n restrictions of HB to these sample points.

Application of chaining It turns out that the log n factor in (2.4.57) is not
optimal. We use chaining (Section 2.4.4) to improve the bound.
We claim that the process {Zn (h)}h∈H has sub-Gaussian increments under an
appropriately defined pseudometric. Indeed, conditioning on (Xi )ni=1 , by the gen-
eral Hoeffding inequality (Theorem 2.4.9) and Hoeffding’s lemma (Lemma 2.4.12),
we have that the increment (as a function of the εi s which have variance factor 1)
n
X `(g, Xi ) − `(h, Xi )
Zn (g) − Zn (h) = εi √ ,
n
i=1

is sub-Gaussian with variance factor


n  n
`(g, Xi ) − `(h, Xi ) 2

X 1X
√ ×1= [`(g, Xi ) − `(h, Xi )]2 .
n n
i=1 i=1

Define the pseudometric


" n
#1/2 " n
#1/2
1X 1X
ρn (g, h) = [`(g, Xi ) − `(h, Xi )]2 = [g(Xi ) − h(Xi )]2 ,
n n
i=1 i=1

where we used that `(h, x) = 1{h(x) 6= C(x)} by definition. It satisfies the


triangle inequality since it can be expressed as a Euclidean norm. In fact, it will be
useful to recast it in a more general setting. For a probability measure η over Rd ,
define Z
2
kg − hkL2 (η) = (f (x) − g(x))2 dη(x).
Rd
Let µn be the empirical measure
empirical
n measure
1 X
µn = µ(Xi )ni=1 := δXi , (2.4.58)
n
i=1

where δx is the probability measure that puts mass 1 on x. Then, we can re-write

ρn (g, h) = kg − hkL2 (µn ) .


CHAPTER 2. MOMENTS AND TAILS 111

Hence we have shown that, conditioned on the samples, the process {Zn (h)}h∈H
has sub-Gaussian increments with respect to k · kL2 (µn ) . Note that the pseudo-
metric here is random as it depends on the samples. Though, by the law of large
numbers, kg − hkL2 (µn ) approaches its expectation, kg − hkL2 (µ) , as n → +∞.
Applying the discrete Dudley inequality (Theorem 2.4.31), we obtain the fol-
lowing bound.

Lemma 2.4.51. There exists a constant 0 < C < +∞ such that, for any countable
class of measurable Boolean functions H over Rd ,
  " +∞ #
X q
−k
E sup Zn (h) ≤ C E 2 log N (H, k · kL2 (µn ) , 2−k ) ,
h∈H k=0

where µn is the empirical measure over the samples (Xi )ni=1 .

Proof. Because H comprises only Boolean functions, it follows that under the
pseudometric k · kL2 (µn ) the diameter is bounded by 1. We apply the discrete
Dudley inequality conditioned on (Xi )ni=1 . Then we take an expectation over the
samples.

Our use of the symmetrization trick is more intuitive than it may have appeared at
first. The central limit theorem indicates that the fluctuations of centered averages
such as
(Rn (g) − R(g)) − (Rn (h) − R(h))
tend cancel out and that, in the limit, the variance alone characterizes the over-
all behavior. The εi s in some sense explicitly capture the canceling part of this
phenomenon while ρn captures the scale of the resulting global fluctuations in the
increments.
Our final task is to bound the covering numbers N (H, k · kL2 (µn ) , 2−k ).

Theorem 2.4.52 (Covering numbers via VC dimension). There exists a constant


0 < C < +∞ such that, for any class of measurable Boolean functions H over
Rd , any probability measure η over Rd and any ε ∈ (0, 1),
 C vc(H)
2
N (H, k · kL2 (η) , ε) ≤ .
ε

Before proving Theorem 2.4.52, we derive its implications for uniform deviations.
Compare the following bound to Lemma 2.4.46.
CHAPTER 2. MOMENTS AND TAILS 112

Lemma 2.4.53. There exists a constant 0 < C < +∞ such that, for any countable
class of measurable Boolean functions H over Rd ,
 
p
E sup Zn (h) ≤ C vc(H).
h∈H

Proof. By Lemma 2.4.51 and Theorem 2.4.52,


  " +∞ #
X q
E sup Zn (h) ≤ C E 2−k log N (H, k · kL2 (µn ) , 2−k )
h∈H k=0
 s 
+∞  C 0 vc(H)
X 2
≤ C E 2−k log 
2−k
k=0
" +∞ #
p X
−k
√ p
= C vc(H) E 2 k + 1 C 0 log 2
k=0
00
p
≤C vc(H),

for some 0 < C 00 < +∞.

It remains to prove Theorem 2.4.52.

Proof of Theorem 2.4.52. Let G = {g1 , . . . , gN } ⊆ H be a maximal ε-packing of


H with N ≥ N (H, k · kL2 (η) , ε), which exists by Lemma 2.4.24. We use the prob-
abilistic method (Section 2.2) and Hoeffding’s inequality for bounded variables
(Theorem 2.4.10) to show that there exists a small number of points {x1 , . . . , xm }
such that G is still a good packing when H is restricted to the xi s. Then we use
Sauer’s lemma (Lemma 2.4.45) to conclude.

1. Restriction. By construction, the collection G satisfies

kgi − gj kL2 (η) > ε, ∀i 6= j.

For an integer m that we will choose as small as possible below, let X =


{X1 , . . . , Xm } be i.i.d. samples from η and let µX be the corresponding
empirical measure (as defined in (2.4.58)). Observe that, for any i 6= j,
m
" #
h i 1 X
E kgi − gj k2L2 (µX ) = E [gi (Xk ) − gj (Xk )]2 = kgi − gj k2L2 (η) .
m
k=1
CHAPTER 2. MOMENTS AND TAILS 113

Moreover [gi (Xk ) − gj (Xk )]2 ∈ [0, 1]. Hence, by Hoeffding’s inequality
there exists a constant 0 < C < +∞ and an m ≤ Cε−4 log N such that

3ε2
 
2 2
P kgi − gj kL2 (η) − kgi − gj kL2 (µX ) ≥
4
m
" #
2
X
2 3ε2
= P mkgi − gj kL2 (η) − [gi (Xk ) − gj (Xk )] ≥ m
4
k=1
2(m · 3ε2 /4)2
 
≤ exp −
m
 
9
= exp − mε4
8
1
< 2.
N
That implies that, for this choice of m,
h ε i
P kgi − gj kL2 (µX ) > , ∀i 6= j > 0,
2
where the probability is over the samples and we used the assumption on the
collection G. Therefore, there must be a set X = {x1 , . . . , xm } ⊆ Rd such
that
ε
kgi − gj kL2 (µX ) > , ∀i 6= j. (2.4.59)
2

2. VC bound. In particular, by (2.4.59), the functions in G restricted to X are


distinct. By Sauer’s lemma (Lemma 2.4.45),
vc(H) vc(H)
eCε−4 log N
 
em
N = |GX | ≤ |HX | ≤ ≤ . (2.4.60)
vc(H) vc(H)
1
Using that 2D log N = log N 1/2D ≤ N 1/2D where D = vc(H), we get
vc(H)
eCε−4 log N

vc(H)
≤ C 0 ε−4 N 1/2 , (2.4.61)
vc(H)

where C 0 = 2eC. Plugging (2.4.61) back into (2.4.60) and rearranging gives
2 vc(H)
N ≤ C 0 ε−4 .

That concludes the proof.


CHAPTER 2. MOMENTS AND TAILS 114

Proof of Sauer’s lemma Recall from Appendix A (see also Exercise 1.4) that
for integers 0 < d ≤ n,
d  
X n  en d
≤ . (2.4.62)
k d
k=0
Sauer’s lemma (Lemma 2.4.45) follows from the following claim.
Lemma 2.4.54 (Pajor). Let H be a class of Boolean functions on Rd and let Λ = Pajor’s lemma
{`1 , . . . , `n } ⊆ Rd be any finite subset. Then
|HΛ | ≤ |{S ⊆ Λ : S is shattered by H}| ,
where the right-hand side includes the empty set.
Going back to Sauer’s lemma, by Lemma 2.4.54 we have the upper bound
|HΛ | ≤ |{S ⊆ Λ : S is shattered by H}| .
By definition of the VC-dimension (Definition 2.4.44), the subsets S ⊆ Λ that
are shattered by H have size at most vc(H). So the right-hand side is bounded
above by the total number of subsets of size at most d = vc(H) of a set of size n.
By (2.4.62), this gives
 vc(H)
en
|HΛ | ≤ ,
vc(H)
which establishes Sauer’s lemma.
So it remain to prove Lemma 2.4.54.

Proof of Lemma 2.4.54. We prove the claim by induction on the size n of Λ. The
result is trivial for n = 1. Assume the result is true for any H and any subset of
size n − 1. To apply induction, for ι = 0, 1 we let
Hι = {h ∈ H : h(`n ) = ι},
and we set
Λ0 = {`1 , . . . , `n−1 }.
It will be convenient to introduce the following notation
S(Λ; H) = |{S ⊆ Λ : S is shattered by H}| .
Because |HΛ | = |HΛ 0 | + |H1 | and the induction hypothesis implies S(Λ0 ; Hι ) ≥
0 Λ0
ι
|HΛ0 | for ι = 0, 1, it suffices to show that
S(Λ; H) ≥ S(Λ0 ; H0 ) + S(Λ0 ; H1 ). (2.4.63)
There are two types of sets that contribute to the right-hand side.
CHAPTER 2. MOMENTS AND TAILS 115

- One but not both. Let S ⊆ Λ0 be a set that contributes to one of S(Λ0 ; H0 )
or S(Λ0 ; H1 ) but not both. Then S is a subset of the larger set Λ and it is
certainly shattered by the larger collection H. Hence it also contributes to
the left-hand side of (2.4.63).

- Both. Let S ⊆ Λ0 be a set that contributes to both S(Λ0 ; H0 ) and S(Λ0 ; H1 ).


Hence it contributes two to the right-hand side of (2.4.63). As in the previous
point, it is also included in S(Λ; H), but it only contributes one to the left-
hand side of (2.4.63). It turns out that there is another set that contributes
one to the left-hand side but zero to the right-hand side: the subset S ∪ {`n }.
Indeed, by definition of Hι , the subset S ∪ {`n } cannot be shattered by it
since all functions in it take the same value on `n . On the other hand, any
Boolean function h on S ∪ {`n } with h(`n ) = ι is realized in Hι since S
itself is shattered by Hι .

That concludes the proof.


CHAPTER 2. MOMENTS AND TAILS 116

Exercises
Exercise 2.1 (Moments of nonnegative random variables). Prove (B.5.1). [Hint:
Use Fubini’s Theorem to compute the integral.]
Exercise 2.2 (Bonferroni inequalities). Let A1 , . . . , An be events and Bn := ∪i Ai .
Define X
S (r) := P[Ai1 ∩ · · · ∩ Air ],
1≤i1 <···<ir ≤n

and
n
X
Xn := 1Ai .
i=1

(i) Let x0 ≤ x1 ≤ · · · ≤ xs ≥Pxs+1 ≥ · · · ≥ xm be a unimodal


P` sequence of
nonnegative reals such that mj=0 (−1)j x = 0. Show that
j j=0 (−1)j x is
j
≥ 0 for even ` and ≤ 0 for odd `.

(ii) Show that, for all r,


 
X Xn
1Ai1 1Ai2 · · · 1Air = .
r
1≤i1 <···<ir ≤n

(iii) Use (i) and (ii) to show that when ` ∈ [n] is odd
`
X
P[Bn ] ≤ (−1)r−1 S (r) ,
r=1

and when ` ∈ [n] is even


`
X
P[Bn ] ≥ (−1)r−1 S (r) .
r=1

These inequalities are called Bonferroni inequalities. The case ` = 1 is


Boole’s inequality.
Exercise 2.3 (Percolation on Z2 : a better bound). Let E1 be the event that all edges
are open in [−N, N ]2 and E2 be the event that there is no closed self-avoiding dual
cycle surrounding [−N, N ]2 . By looking at E1 ∩ E2 , show that θ(p) > 0 for
p > 2/3.
Exercise 2.4 (Percolation on Zd : existence of critical threshold). Consider bond
percolation on Ld .
CHAPTER 2. MOMENTS AND TAILS 117

(i) Show that pc (Ld ) > 0. [Hint: Count self-avoiding paths.]

(ii) Show that pc (Ld ) < 1. [Hint: Use the result for L2 .]

Exercise 2.5 (Sums of uncorrelated variables). Centered random variables X1 , X2 , . . .


are uncorrelated if
E[Xr Xs ] = 0, ∀r 6= s.

(i) Assume further that Var[Xr ] ≤ C < +∞ for all r. Show that
 
1 X C2
P Xr ≥ β  ≤ 2 .
n β n
r≤n

(ii) Use (i) to prove Theorem 2.1.6.

Exercise 2.6 (Pairwise independence: lack of concentration). Let U = (U1 , . . . , U` )


be uniformly distributed over {0, 1}` . Let n = 2` −1. For all v ∈ {0, 1}` \0, define

Xv = hU , vi mod 2.

(i) Show that the random variables Xv , v ∈ {0, 1}` \0, are uniformly dis-
tributed in {0, 1} and pairwise independent.

(ii) Show that for any event A measurable with respect to σ(Xv , v ∈ {0, 1}` \0),
P[A] is either 0 or ≥ 1/(n + 1).

Exercise 2.5 shows that pairwise independence implies “polynomial concentra-


tion” of the average of square-integrable Xv s. On the other hand, the current ex-
ercise suggests that in general pairwise independence cannot imply “exponential
concentration.”

Exercise 2.7 (Chernoff bound for Poisson trials). Using the Chernoff-Cramér method,
prove part (i) of Theorem 2.4.7. Show that part (ii) follows from part (i).

Exercise 2.8 (Stochastic knapsack: some details). Consider the stochastic frac-
tional knapsack problem in Section 2.4.3.

(i) Prove that the greedy algorithm described there gives an optimal solution to
problem (2.4.21).

(ii) Prove Claim 2.4.20 for τ ∈ (0, 1/6).

Exercise 2.9 (Stochastic knapsack: 0-1 version). Consider the stochastic fractional
knapsack problem in Section 2.4.3.
CHAPTER 2. MOMENTS AND TAILS 118

(i) Adapt the greedy algorithm for the 0-1 knapsack problem and show that it is
not optimal in general. [Hint: Construct a counter-example with two items.]

(ii) Prove Claim 2.4.20 for the greedy solution of (i).

Exercise 2.10 (A proof of Pólya’s theorem). Let (St ) be simple random walk on
Ld started at the origin 0.

(i) For d = 1, use Stirling’s formula (see Appendix A) to show that P[S2n =
0] = Θ(n−1/2 ).
(j)
(ii) For j = 1, . . . , d, let Nt be the number of steps in the j-th coordinate by
time t. Show that
   
(j) n 3n
P Nn ∈ , , ∀j ≥ 1 − exp(−κd n),
2d 2d

for some constant κd > 0.

(iii) Use (i) and (ii) to show that, for any d ≥ 3, P[S2n = 0] = O(n−d/2 ).

Exercise 2.11 (Maximum degree). Let Gn = (Vn , En ) ∼ Gn,pn be an Erdős-


Rényi graph with n vertices and density pn . Suppose npn = C log n for some
C > 0. Let Dn be the maximum degree of Gn . Use Bernstein’s inequality to show
that for any ε > 0

P [Dn ≥ (n − 1)pn + max{C, 4(1 + ε)} log n] → 0,

as n → +∞.

Exercise 2.12 (RIP vs. orthogonality). Show that a (k, 0)-RIP matrix with k ≥ 2
is orthogonal, that is, its columns are orthonormal.

Exercise 2.13 (Compressed sensing: linear programming formulation). Formu-


late (2.4.43) as a linear program, that is, the minimization of a linear objective
subject to linear inequalities.

Exercise 2.14 (Compressed sensing: almost sparse case). By adapting the proof of
Lemma 2.4.39, show the following “almost sparse” version. Let L be (10k, 1/3)-
RIP. Then, for any x ∈ Rn , the solution to (2.4.43) satisfies

kz∗ − xk2 = O(η(x)/ k),

where η(x) := minx0 ∈Skn kx − x0 k1 .


CHAPTER 2. MOMENTS AND TAILS 119

Exercise 2.15 (Spectral norm without independence). Give an example of a ran-


dom matrix A ∈ Rn×n whose entries are bounded, but not independent, such that
the spectral norm is Ω(n) with high probability.

Exercise 2.16 (Spectral norm: symmetric matrix). Let A ∈ Rn×n be a symmetric


random matrix. We assume that entries on and above the diagonal Ai,j , i ≤ j, are
centered, independent and sub-Gaussian with variance factor ν. Each entry below
the diagonal is equal to the corresponding entry above it. Prove an analogue of
Theorem 2.4.28 for A. [Hint: Mimic the proof of Theorem 2.4.28.]

Exercise 2.17 (Chaining tail inequality). Prove Theorem 2.4.32.

Exercise 2.18 (Poisson convergence: method of moments). Let A1 , . . . , An be


events and A := ∪i Ai . Define
X
S (r) := P[Ai1 ∩ · · · ∩ Air ],
1≤i1 <···<ir ≤n

and
n
X
Xn := Ai .
i=1

Assume that there is µ > 0 such that, for all r,


µr
S (r) → ,
r!
as n → +∞. Use Exercise 2.2 and a Taylor expansion of e−µ to show that

P[Xn = 0] → e−µ .
d
In fact, Xn → Poi(µ) (no need to prove this). This is a special case of the method
of moments.

Exercise 2.19 (Connectivity: critical window). Using Exercise 2.18 show that,
when pn = log nn+s , the probability that an Erdős-Rényi graph Gn ∼ Gn,pn con-
−s
tains no isolated vertex converges to e−e .
CHAPTER 2. MOMENTS AND TAILS 120

Bibliographic Remarks
Section 2.1 For more on moment-generating functions, see [Bil12, Section 21].

Section 2.2 The examples in Section 2.2.1 are taken from [AS11, Sections 2.4,
3.2]. A fascinating account of the longest increasing subsequence problem is given
in [Rom15], from which the material in Section 2.2.3 is taken. The contour lemma,
Lemma 2.2.14, is attributed to Whitney [Whi32] and is usually proved “by pic-
ture” [Gri10a, Figure 3.1]. A formal proof of the lemma can be found in [Kes82,
Appendix A]. For much more on percolation, see [Gri10b]. A gentler introduction
is provided in [Ste].

Section 2.3 The presentation in Section 2.3.2 follows [AS11, Section 4.4] and
[JLR11, Section 3.1]. The result for general subgraphs is due to Bollobás [Bol81].
A special case (including cliques) was proved by Erdős and Rényi [ER60]. For
variants of the small subgraph containment problem involving copies that are in-
duced, disjoint, isolated, etc., see for example [JLR11, Chapter 3]. For corre-
sponding results for larger subgraphs, such as cycles or matchings, see for exam-
ple [Bol01]. The connectivity threshold in Section 2.3.2 is also due to the same
authors [ER59]. The presentation here follows [vdH17, Section 5.2]. For more on
the method of moments, see for example [Dur10, Section 3.3.5] or [JLR11, Section
6.1]. Claim 2.3.11 is due to R. Lyons [Lyo90].

Section 2.4 The use of the moment-generating function to derive tail bounds for
sums of independent random variables was pioneered by Cramér [Cra38], Bern-
stein [Ber46], and Chernoff [Che52]. For much more on concentration inequali-
ties, see for example [BLM13]. The basics of large deviations theory are covered
in [Dur10, Section 2.6]. See also [RAS15] and [DZ10]. Section 2.4.2 is based
partly on [Ver18] and [Lug, Section 3.2]. Section 2.4.3 is based on [FR98, Section
5.3]. Very insightful, and much deeper, treatment of the material in Section 2.4.4
can be found in [Ver18, vH16]. The presentation in Section 2.4.5 is inspired
by [Har, Lectures 6 and 8] and [Tao]. The Johnson-Lindenstrauss lemma was first
proved by Johnson and Lindenstrauss using non-probabilistic arguments [JL84].
The idea of using random projections to simplify the proof was introduced by
Frankl and Maehara [FM88] and the proof presented here based on Gaussian pro-
jections is due to Indyk and Motwani [IM98]. See [Ach03] for an overview of the
various proofs known. For more on the random projection method, see [Vem04].
For algorithmic applications of the Johnson-Lindenstrauss lemma, see for exam-
ple [Har, Lecture 7]. Compressed sensing emerged in the works of Donoho [Don06]
CHAPTER 2. MOMENTS AND TAILS 121

and Candès, Romberg and Tao [CRT06a, CRT06b]. The restricted isometry prop-
erty was introduced by Candès and Tao [CT05]. Lemma 2.4.39 is due to Candés,
Romberg and Tao [CRT06b]. The proof of Lemma 2.4.38 presented here is due
to Baraniuk et al. [BDDW08]. A survey of compressed sensing can be found
in [CW08]. A thorough mathematical introduction to compressed sensing can be
found in [FR13]. The material in Section 2.4.2 can be found in [BLM13, Chapter
2]. Hoeffding’s lemma and inequality are due to Hoeffding [Hoe63]. Section 2.4.6
borrows from [Ver18, vH16, SSBD14, Haz16]. The proof of Sauer’s lemma fol-
lows [Ver18, Section 8.3.3]. For a proof of Claim 2.4.48 in general dimension d,
see for example [SSBD14, Section 9.1.3].
Chapter 3

Martingales and potentials

In this chapter we turn to martingales, which play a central role in probability the-
ory. We illustrate their use in a number of applications to the analysis of discrete
stochastic processes. After some background on stopping times and a brief review
of basic martingale properties and results in Section 3.1, we develop two major
directions. In Section 3.2, we show how martingales can be used to derive a sub-
stantial generalization of our previous concentration inequalities—from the sums
of independent random variables we focused on in Chapter 2 to nonlinear func-
tions with Lipschitz properties. In particular, we give several applications of the
method of bounded differences to random graphs. We also discuss bandit problems
in machine learning. In the second thread in Section 3.3, we give an introduction
to potential theory and electrical network theory for Markov chains. This toolkit
in particular provides bounds on hitting times for random walks on networks, with
important implications in the study of recurrence among other applications. We
also introduce Wilson’s remarkable method for generating uniform spanning trees.

3.1 Background
We begin with a quick review of stopping times and martingales. Along the way,
we prove a few useful results. In particular, we derive some bounds on hitting times
and cover times of Markov chains.
Throughout, (Ω, F, (Ft )t∈Z+ , P) is a filtered space. See Appendix B for a
formal definition. Recall that, intuitively, the σ-algebra Ft in the filtration (Ft )t
represents “‘the information known at time t.” All time indices are discrete (in Z+
Version: December 20, 2023
Modern Discrete Probability: An Essential Toolkit
Copyright © 2023 Sébastien Roch

122
CHAPTER 3. MARTINGALES AND POTENTIALS 123

unless stated otherwise). We will also use the notation Z+ := {0, 1, . . . , +∞} to
allow time +∞.

3.1.1 Stopping times


Definitions Roughly speaking, a stopping time is a random time whose value is
determined by a rule not depending on the future. Formally:

Definition 3.1.1 (Stopping time). A random variable τ : Ω → Z+ is called a


stopping time if
stopping time
{τ ≤ t} ∈ Ft , ∀t ∈ Z+ ,
or, equivalently,
{τ = t} ∈ Ft , ∀t ∈ Z+ .

To see the equivalence above, note that {τ = t} = {τ ≤ t} \ {τ ≤ t − 1}, and


{τ ≤ t} = ∪i≤t {τ = i}.

Example 3.1.2 (Hitting time). Let (At )t∈Z+ , with values in (E, E), be adapted and
let B ∈ E. Then
τ = inf{t ≥ 0 : At ∈ B},
is a stopping time known as a hitting time. In contrast, the last visit to a set is
hitting time
typically not a stopping time. J

Let τ be a stopping time. Denote by Fτ the set of all events F such that, ∀t ∈ Z+ ,
F ∩ {τ = t} ∈ Ft . Intuitively, the σ-algebra Fτ captures the information up to
time τ . The following lemmas help clarify the definition of Fτ .

Lemma 3.1.3. Fτ = Fs if τ := s, Fτ = F∞ = σ(∪t Ft ) if τ := +∞ and


Fτ ⊆ F∞ for any stopping time τ .

Proof. In the first case, note that F ∩ {τ = t} is empty if t 6= s and is F if t = s.


So if F ∈ Fτ then F = F ∩ {τ = s} ∈ Fs by definition of Fτ , and if F ∈ Fs
then F = F ∩ {τ = t} ∈ Ft for all t by definition of τ . So we have proved both
inclusions. This works also for t = +∞. For the third claim note that, for any
F ∈ Fτ ,
F = ∪t∈Z+ F ∩ {τ = t} ∈ F∞ ,
again by definition of Fτ .

Lemma 3.1.4. If (Xt ) is adapted and τ is a stopping time then Xτ ∈ Fτ (where


we assume that X∞ ∈ F∞ , e.g., by setting X∞ := lim inf Xn ).
CHAPTER 3. MARTINGALES AND POTENTIALS 124

Proof. For B ∈ E,

{Xτ ∈ B} ∩ {τ = t} = {Xt ∈ B} ∩ {τ = t} ∈ Ft ,

by definition of τ . That shows Xτ is measurable with respect to Fτ as claimed.

Lemma 3.1.5. If σ, τ are stopping times then Fσ∧τ ⊆ Fτ .


Proof. Let F ∈ Fσ∧τ . Note that

F ∩ {τ = t} = ∪s≤t [(F ∩ {σ ∧ τ = s}) ∩ {τ = t}] ∈ Ft .

Indeed, the expression in parentheses is in Fs ⊆ Ft by definition of Fσ∧τ and


{τ = t} ∈ Ft .

Let (Xt ) be a Markov chain on a countable space V . The following two exam-
ples of stopping times will play an important role.
Definition 3.1.6 (First visit and return). The first visit time and first return time to
first return
x ∈ V are

τx := inf{t ≥ 0 : Xt = x} and τx+ := inf{t ≥ 1 : Xt = x}.

Similarly, τB and τB+ are the first visit time and first return time to B ⊆ V .
Definition 3.1.7 (Cover time). Assume V is finite. The cover time of (Xt ) is the
cover time
first time that all states have been visited, that is,

τcov := inf{t ≥ 0 : {X0 , . . . , Xt } = V }.

Strong Markov property Let (Xt ) be a Markov chain with transition matrix
P and initial distribution µ. Let Ft = σ(X0 , . . . , Xt ). Recall that the Markov
property (Theorem 1.1.18) says that, given the present, the future is independent
of the past. The Markov property naturally extends to stopping times. Let τ be a
stopping time with P[τ < +∞] > 0. In its simplest form we have:

P[Xτ +1 = y | Fτ ] = PXτ [Xτ +1 = y] = P (Xτ , y).

In words, the chain “starts fresh” at a stopping time with the state at that time as
starting point. More generally:
Theorem 3.1.8 (Strong Markov property). Let ft : V ∞ → R be a sequence of
measurable functions, uniformly bounded in t and let Ft (x) := Ex [ft ((Xs )s≥0 )].
On {τ < +∞},
E[fτ ((Xτ +t )t≥0 ) | Fτ ] = Fτ (Xτ ).
CHAPTER 3. MARTINGALES AND POTENTIALS 125

Throughout, when we say that two random variables Y, Z are equal on an event B,
we mean formally that Y 1B = Z1B almost surely.

Proof of Theorem 3.1.8. We use that

E[fτ ((Xτ +t )t≥0 ) | Fτ ]1τ <+∞ = E[fτ ((Xτ +t )t≥0 )1τ <+∞ | Fτ ].

Let A ∈ Fτ . Summing over the possible values of τ , using the tower property
(Lemma B.6.16) and then the Markov property

E[fτ ((Xτ +t )t≥0 )1τ <+∞ ; A]


= E[fτ ((Xτ +t )t≥0 ); A ∩ {τ < +∞}]
X
= E[fs ((Xs+t )t≥0 ); A ∩ {τ = s}]
s≥0
X
= E[E[fs ((Xs+t )t≥0 ); A ∩ {τ = s} | Fs ]]
s≥0
X
= E[1A∩{τ =s} E[fs ((Xs+t )t≥0 ) | Fs ]]
s≥0
X
= E[1A∩{τ =s} Fs (Xs )]
s≥0
X
= E[Fs (Xs ); A ∩ {τ = s}]
s≥0
= E[Fτ (Xτ ); A ∩ {τ < +∞}]
= E[Fτ (Xτ )1τ <+∞ ; A],

where, on the fifth line, we used that A ∩ {τ = s} ∈ Fs by definition of Fτ


and taking out what is known (Lemma B.6.13). The definition of the conditional
expectation (Theorem B.6.1) concludes the proof.

The following typical application of the strong Markov property (Theorem 3.1.8)
is useful.

Theorem 3.1.9 (Reflection principle).


P Let X1 , X2 , . . . be i.i.d. with a distribution
symmetric about 0 and let St = i≤t Xi . Then, for b > 0,
 
P sup Si ≥ b ≤ 2 P[St ≥ b].
i≤t

Proof. Let τ := inf{i ≤ t : Si ≥ b}. By the strong Markov property, on {τ <


t}, St − Sτ is independent of Fτ and is symmetric about 0. In particular, it has
CHAPTER 3. MARTINGALES AND POTENTIALS 126

probability at least 1/2 of being greater or equal to 0 by the first moment principle
(Theorem 2.2.1), an event which implies that St is greater than or equal to b. Hence
1 1
P[St ≥ b] ≥ P[τ = t] + P[τ < t] ≥ P[τ ≤ t].
2 2
(Exercise 3.1 asks for a more formal proof.)

In the case of simple random walk on Z, we get a stronger statement.

Theorem 3.1.10 (Reflection principle: simple random walk). Let (St ) be simple
random walk on Z started at 0. Then, ∀a, b, t > 0,
 
P[St = b + a] = P St = b − a, sup Si ≥ b .
i≤t

and  
P sup Si ≥ b = P[St = b] + 2 P[St > b].
i≤t

Proof. For the first claim, reflect the sub-path after the first visit to b across the line
y = b. Summing over a > 0 and rearranging gives the second claim.

We record another related result that will be useful later.

Theorem 3.1.11 (Ballot theorem). In an election with n voters, candidate A gets α


votes and candidate B gets β < α votes. The probability that A leads B through-
out the counting is α−β
n .

Recurrence Let (Xt ) be a Markov chain on a countable state space V . The time
of the k-th return to y is (letting τy0 := 0)
k-th return

τyk := inf{t > τyk−1 : Xt = y}.

In particular, τy1 = τy+ . Define ρxy := Px [τy+ < +∞]. Then by the strong Markov
property (and induction)

Px [τyk < +∞] = ρxy ρk−1


yy . (3.1.1)

(Exercise 3.2 asks for a more formal proof.) Letting


X X
Ny := 1{Xt =y} = 1{τyk <+∞} ,
t>0 k≥1
CHAPTER 3. MARTINGALES AND POTENTIALS 127

be the number of visits to y after time 0, by linearity


ρxy
Ex [Ny ] = . (3.1.2)
1 − ρyy

When ρyy < 1, we have Ey [Ny ] < +∞ by (3.1.2), and in particular τyk = +∞ for
some k. Or ρyy = 1 and, starting at x = y, we have τyk < +∞ almost surely for
all k by (3.1.1). That leads us to the following dichotomy.
Definition 3.1.12 (Recurrence). A state x is recurrent if ρxx = 1. Otherwise it recurrent
is transient. We refer to the recurrence or transience of a state as its type. Let x
be recurrent. If in addition Ex [τx+ ] < +∞, we say that x is positive recurrent;
otherwise we say that it is null recurrent. A chain is recurrent (or transient, or
positive recurrent, or null recurrent) if all its states are.
Recurrence is “contagious” in the following sense.
Lemma 3.1.13. If x is recurrent and ρxy > 0 then y is recurrent and ρyx = ρxy =
1.
A subset C ⊆ V is closed if x ∈ C and ρxy > 0 implies y ∈ C. A subset
D ⊆ V is irreducible if x, y ∈ D implies ρxy > 0. This definition is consistent
with (and generalizes to sets) the one we gave in Section 1.1.2. Recall that we have
the following decomposition theorem.
Theorem 3.1.14 (Decomposition theorem). Let R := {x : ρxx = 1} be the
recurrent states of the chain. Then R can be written as a disjoint union ∪j Rj
where each Rj is closed and irreducible.
Example 3.1.15 (Simple random walk on Z). Consider simple random walk (St )
on Z started at 0. The chain is clearly irreducible so it suffices to check the type
of state 0 by Lemma 3.1.13. First note the periodicity of this chain. So we look at
S2t . Then by Stirling’s formula (see Appendix A)
  2t

2t −2t −2t (2t) 2t 1
P[S2t = 0] = 2 ∼2 t 2
√ ∼√ .
t (t ) 2πt πt
Thus X
E[N0 ] = P[St = 0] = +∞,
t>0
and the chain is recurrent. J
Return times are closely related to stationary measures. We recall the following
standard results without proof. We gave an alternative proof of the existence of a
unique stationary distribution in the finite, irreducible case in Theorem 1.1.24.
CHAPTER 3. MARTINGALES AND POTENTIALS 128

Theorem 3.1.16. Let x be a recurrent state. Then the following defines a stationary
measure  
X
µx (y) := Ex  1{Xt =y}  .
0≤t<τx+

Theorem 3.1.17. If (Xt ) is irreducible and recurrent, then the stationary measure
is unique up to a constant multiple.

Theorem 3.1.18. If there is a stationary distribution π then all states y that have
π(y) > 0 are recurrent.

Theorem 3.1.19. If (Xt ) is irreducible and has a stationary distribution π, then


1
π(x) = .
Ex τx+
Theorem 3.1.20. If (Xt ) is irreducible, then the following are equivalent.

(i) There is a stationary distribution.

(ii) All states are positive recurrent.

(iii) There is a positive recurrent state.

We have seen previously that, in the irreducible, positive recurrent, aperiodic


case, there is convergence to stationarity (see Theorem 1.1.33). In the transient
and null recurrent cases, there is no stationary distribution to converge to by Theo-
rem 3.1.20. Instead, we have the following.

Theorem 3.1.21 (Convergence of P t : transient and null recurrent cases). If P is


an irreducible chain which is either transient or null recurrent, we have for all x, y
that
lim P t (x, y) = 0.
t

Proof. We only prove the transient case. In that case, we showed in (3.1.2) that
" #
X X
t
P (x, y) = Ex 1{Xt =y} = Ex [Ny ] < +∞.
t t

Hence P t (x, y) → 0.
CHAPTER 3. MARTINGALES AND POTENTIALS 129

A useful identity A slight generalization of the “cycle trick” used in the proof of
Theorem 3.1.16 gives a useful identity.

Definition 3.1.22 (Green function). Let σ be a stopping time for a Markov chain
(Xt ). The Green function of the chain stopped at σ is given by
Green function
 
X
Gσ (x, y) = Ex  1{Xt =y}  , x, y ∈ V, (3.1.3)
0≤t<σ

that is, it is the expected number of visits to y before σ when started at x.

Lemma 3.1.23 (Occupation measure identity). Consider an irreducible, positive


recurrent Markov chain (Xt )t≥0 with transition matrix P and stationary distribu-
tion π. Let x be a state and σ be a stopping time such that Ex [σ] < +∞ and
Px [Xσ = x] = 1. For any y,

Gσ (x, y) = πy Ex [σ].

Proof. By the uniqueness of the stationary measure up to constant multiple (The-


orem 3.1.17), it suffices to show that Gσ (x, y) satisfies the system for a stationary
measure as a function of y
X
Gσ (x, y)P (y, z) = Gσ (x, z), ∀z, (3.1.4)
y

and use the fact that


 
X X X
Gσ (x, y) = Ex  1{Xt =y}  = Ex [σ].
y y 0≤t<σ

To check (3.1.4), because Xσ = X0 almost surely, observe that


 
X
Gσ (x, z) = Ex  1Xt =z 
0≤t<σ
 
X
= Ex  1Xt+1 =z 
0≤t<σ
X
= Px [Xt+1 = z, σ > t].
t≥0
CHAPTER 3. MARTINGALES AND POTENTIALS 130

Since {σ > t} ∈ Ft , applying the Markov property we get


XX
Gσ (x, z) = Px [Xt = y, Xt+1 = z, σ > t]
t≥0 y
XX
= Px [Xt+1 = z | Xt = y, σ > t] Px [Xt = y, σ > t]
t≥0 y
XX
= P (y, z) Px [Xt = y, σ > t]
t≥0 y
X
= Gσ (x, y)P (y, z),
y

which establishes (3.1.4) and proves the claim.

Here is a typical application of this lemma.


Corollary 3.1.24. In the setting of Lemma 3.1.23, for all x 6= y,
1
Px [τy < τx+ ] = .
πx (Ex [τy ] + Ey [τx ])
Proof. Let σ be the time of the first visit to x after the first visit to y. Then Ex [σ] =
Ex [τy ] + Ey [τx ] < +∞, where we used that the chain is irreducible and positive
recurrent. By the strong Markov property, the number of visits to x before the first
visit to y is geometric with success probability Px [τy < τx+ ] (where, here, a visit
to x is a “failed trial”). Moreover the number of visits to x after the first visit to
y but before σ is 0 by definition. Hence Gσ (x, x) is the mean of the geometric
distribution, namely 1/Px [τy < τx+ ]. Applying the occupation measure identity
gives the result.

3.1.2 . Markov chains: exponential tail of hitting times and some cover
time bounds
Tail of a hitting time On a finite state space, the tail of any hitting time converges
to 0 exponentially fast.
Lemma 3.1.25. Let (Xt ) be a finite, irreducible Markov chain with state space V .
For any subset of states A ⊆ V and initial distribution µ:
(i) It holds that Eµ [τA ] < +∞ (and, in particular, τA < +∞ a.s.).
(ii) Letting t̄A := maxx Ex [τA ], we have the tail bound
  
t
Pµ [τA > t] ≤ exp − .
de t̄A e
CHAPTER 3. MARTINGALES AND POTENTIALS 131

Proof. For any positive integer m, for some distribution θ over the state space V ,
by the strong Markov property (Theorem 3.1.8)

Pµ [τA > ms | τA > (m − 1)s] = Pθ [τA > s] ≤ max Px [τA > s] =: αs .


x

Choose a positive integer s large enough that, from any x, there is a path to A
of length at most s of positive probability. Such an s exists by irreducibility. In
particular αs < 1.
By the multiplication rule and the monotonicity of the events {τA > rs} over
r, we have
m
Y
Pµ [τA > ms] = Pµ [τA > s] Pµ [τA > rs | τA > (r − 1)s].
r=2

Therefore, Pµ [τA > ms] ≤ αsm , which in turn implies


btc
Pµ [τA > t] ≤ αs s . (3.1.5)

The result for the expectation follows from


X X btc
Eµ [τA ] = Pµ [τA > t] ≤ αs s < +∞,
t≥0 t

since αs < 1.
Now that we have established that t̄A < +∞, by Markov’s inequality (Theo-
rem 2.1.1),
t̄A
αs = max Px [τA > s] ≤ .
x s
for all non-negative integers s. Plugging back into (3.1.5) gives Pµ [τA > t] ≤
 b t c
t̄A s
s . By differentiating with respect to s, it can be checked that a good choice
for s is de t̄A e. Simplifying gives the second claim.

Application to cover times We give an application of the previous bound to


cover times. Let (Xt ) be a finite, irreducible Markov chain on V with n := |V | >
1. Recall that the cover time is τcov := maxy τy . We bound the mean cover time in
terms of
t̄hit := max Ex τy .
x6=y

Claim 3.1.26.
max Ex [τcov ] ≤ (3 + log n)de t̄hit e.
x
CHAPTER 3. MARTINGALES AND POTENTIALS 132

Proof. By a union bound over all states to be visited and Lemma 3.1.25,
   
t
max Px [τcov > t] ≤ min 1, n exp − .
x de t̄hit e
Summing over t ∈ Z+ and appealing to the sum of a geometric series,
1
max Ex [τcov ] ≤ (log n + 1)de t̄hit e + de t̄hit e,
x 1 − e−1
where the first term on the right-hand side comes from the fact that until t ≥
(log n + 1)de t̄hit e the upper bound above is 1. The factor de t̄hit e in the second
term on the right-hand side comes from the fact that we must break up the series
into blocks of size de t̄hit e. Simplifying gives the claim.
The previous proof should be reminiscent of that of Theorem 2.4.21.
A clever argument gives a better constant factor as well as a lower bound.
Theorem 3.1.27 (Matthews’ cover time bounds). Let
tA
hit := min Ex τy ,
x,y∈A, x6=y
Pn 1
and hn := m=1 m . Then
max Ex [τcov ] ≤ hn t̄hit , (3.1.6)
x
and
min Ex [τcov ] ≥ max h|A|−1 tA
hit . (3.1.7)
x A⊆V
{x,y}
Clearly, maxx6=y thit is a lower bound on the worst expected cover time. Lower
bound (3.1.7) says that a tighter bound is obtained by finding a larger subset of
states A that are “far away” from each other.
We sketch the proof of the lower bound for A = V , which we assume is
[n] without loss of generality. The other cases are similar. Let (J1 , . . . , Jn ) be a
uniform random ordering of V , let Cm := maxi≤m τJi , and let Lm be the last state
visited among J1 , . . . , Jm . Then for m ≥ 2
Ex [Cm − Cm−1 | J1 , . . . , Jm , {Xt , t ≤ Cm−1 }] ≥ tVhit 1{Lm =Jm } .
1
By symmetry, P[Lm = Jm ] = m . To see this, first pick the set of vertices corre-
sponding to {J1 , . . . , Jm }, wait for all of those vertices to be visited, then pick the
ordering. Moreover observe that Ex C1 ≥ (1 − n1 )tVhit where the factor of (1 − n1 )
accounts for the probability that J1 6= x. Taking expectations above and summing
over m gives the result.
Exercise 3.3 asks for a proof that the bounds above cannot in general be im-
proved up to smaller order terms.
CHAPTER 3. MARTINGALES AND POTENTIALS 133

3.1.3 Martingales
Definition Martingales are an important class of stochastic processes that corre-
spond intuitively to the “probabilistic version of a monotone sequence.” They hide
behind many processes and have properties that make them powerful tools in the
analysis of processes where they have been identified. Formally:

Definition 3.1.28 (Martingale). An adapted process (Mt )t≥0 with E|Mt | < +∞
for all t is a martingale if
martingale

E[Mt+1 | Ft ] = Mt , ∀t ≥ 0.

If equality is replaced with ≤ or ≥, we get a supermartingale or a submartingale


respectively. We say that a martingale in bounded in Lp if supt E[|Xt |p ] < +∞.

Recall that adapted (Definition B.7.5) simply means that Mt ∈ Ft , that is, roughly
speaking Mt is “known at time t.” Note that for a martingale, by the tower property
(Lemma B.6.16), we have E[Mt | Fs ] = Ms for all t > s, and similarly (with
inequalities) for supermartingales and submartingales.
We start with a straightforward example.

Example 3.1.29 (Sums of i.i.d. random variables with mean 0). Let X0 , X1 , . . .
be i.i.d.Pintegrable, centered random variables, Ft = σ(X0 , . . . , Xt ), S0 = 0, and
St = ti=1 Xi . Note that E|St | < ∞ by the triangle inequality. By taking out
what is known and the role of independence lemma (Lemma B.6.14) we obtain

E[St | Ft−1 ] = E[St−1 + Xt | Ft−1 ] = St−1 + E[Xt ] = St−1 ,

which proves that (St ) is a martingale. J

Martingales however are richer than random walks with centered steps. For in-
stance mixtures of such random walks are also martingales.

Example 3.1.30 (Mixtures of random walks). Consider again the setting of Exam-
ple 3.1.29. This time assume that X0 is uniformly distributed in {1, 2} and define

Rt = X0 St , t ≥ 0.

Then, because (St ) is a martingale,

E[Rt | Ft−1 ] = X0 E[St | Ft−1 ] = X0 St−1 = Rt−1 ,

so (Rt ) is also a martingale.


CHAPTER 3. MARTINGALES AND POTENTIALS 134

Further examples Martingales can also be a little more hidden. Here are two
examples.
Example 3.1.31 (Variance of a sum of i.i.d. random variables). Consider again the
setting of Example 3.1.29 with σ 2 := Var[X1 ] < ∞. Define
Mt = St2 − tσ 2 .
Note that by the triangle inequality and the fact that St has mean zero and is a sum
of independent random variables
t
X
E|Mt | ≤ Var[Xi ] + tσ 2 ≤ 2tσ 2 < +∞.
i=1
Moreover, arguing similarly to the previous example, and using the fact that both
Xt and St−1 are square integrable
E[Mt | Ft−1 ] = E[(Xt + St−1 )2 − tσ 2 | Ft−1 ]
= E[Xt2 + 2Xt St−1 + St−1
2
− tσ 2 | Ft−1 ]
= σ 2 + 0 + St−1
2
− tσ 2
= Mt−1 ,
which proves that (Mt ) is a martingale. J
Example 3.1.32 (Eigenvectors of a transition matrix). Let (Xt )t≥0 be a finite
Markov chain with state space V and transition matrix P , and let (Ft )t≥0 be the
corresponding filtration. Suppose f : V → R is such that
X
P (i, j)f (j) = λf (i), ∀i ∈ S.
j

In other words, f is a (right) eigenvector of P with eigenvalue λ. Define


Mt = λ−t f (Xt ).
Note that by the finiteness of the state space
E|Mt | < +∞,
and that further by the Markov property
E[Mt | Ft−1 ] = λ−t E[f (Xt ) | Ft−1 ]
X
= λ−t P (Xt−1 , j)f (j)
j
−t
= λ · λf (Xt−1 )
= Mt−1 .
That is, (Mt ) is a martingale. J
CHAPTER 3. MARTINGALES AND POTENTIALS 135

Or we can create martingales out of thin air. We give two important examples
that will appear later.

Example 3.1.33 (Doob martingale: accumulating data). Let X with E|X| < +∞.
Define Mt = E[X | Ft ]. Note that E|Mt | ≤ E|X| < +∞ by Jensens’ inequality,
and

E[Mt | Ft−1 ] = E[X | Ft−1 ] = Mt−1 ,

by the tower property. This is known as a Doob martingale. Intuitively this process
Doob
tracks our expectation of the unobserved X as “more information becomes avail-
martingale
able.” See the co-called “exposure martingales” in Section 3.2.3 for a concrete
illustration of this idea. J

Example 3.1.34 (Martingale transform). Let (Xt )t≥1 be an integrable, adapted


process and let (Ct )t≥1 be a bounded, predictable process. Recall that predictable
(Definition B.7.6) means Ct ∈ Ft−1 for all t, that is, roughly speaking Ct is
“known at time t − 1.” Define
X
Nt = (Xi − E[Xi | Fi−1 ])Ci .
i≤t

Then X
E|Nt | ≤ 2E|Xt |K < +∞,
i≤t

where we used that |Ct | < K for all t ≥ 1, and

E[Nt − Nt−1 | Ft−1 ] = E[(Xt − E[Xt | Ft−1 ])Ct | Ft−1 ]


= Ct (E[Xt | Ft−1 ] − E[Xt | Ft−1 ])
= 0,

by taking out what is known. So (Nt ) is a martingale.


When (Xt ) is itself a martingale (in which case E[Xi | Fi−1 ] = Xi−1 in the
definition of Nt ), this is a sort of “stochastic (Stieltjes) integral.” When, instead,
(Xt ) is a supermartingale (respectively submartingale) and (Ct ) is nonnegative and
bounded, then the same computation shows that
X
Nt = (Xi − Xi−1 )Ci ,
i≤t

defines a supermartingale (respectively submartingale). J


CHAPTER 3. MARTINGALES AND POTENTIALS 136

As implied by the next lemma, an immediate consequence of Jensen’s inequal-


ity (in its conditional version of Lemma B.6.12), submartingales naturally arise as
convex functions of martingales.

Lemma 3.1.35. If (Mt )t≥0 is a martingale and φ is a convex function such that
E|φ(Mt )| < +∞ for all t, then (φ(Mt ))t≥0 is a submartingale. Moreover, if
(Mt )t≥0 is a submartingale and φ is an increasing convex function with E|φ(Mt )| <
+∞ for all t, then (φ(Mt ))t≥0 is a submartingale.

Martingales and stopping times A fundamental reason explaining the utility of


martingales in analyzing a variety of stochastic processes is that they play nicely
with stopping times, in particular, through what is known as the optional stopping
theorem (in its various forms). We will encounter many applications of this impor-
tant result. First a definition:

Definition 3.1.36. Let (Mt ) be an adapted process and σ be a stopping time. Then

Mtσ (ω) := Mσ(ω)∧t (ω),

is Mt stopped at σ.
stopped process
Lemma 3.1.37. Let (Mt ) be a supermartingale and σ be a stopping time. Then
the stopped process (Mtσ ) is a supermartingale and in particular

E[Mt ] ≤ E[Mσ∧t ] ≤ E[M0 ].

The same result holds with equalities if (Mt ) is a martingale, and with inequalities
in the opposite direction if (Mt ) is a submartingale.

Proof. Note that X


Mtσ − M0 = Ci (Xi − Xi−1 ),
i≤t

with Ci = 1{i ≤ σ} ∈ Fi−1 (which is nonnegative and bounded) and Xi = Mi


for all i, and use Example 3.1.34 to conclude that E[Mσ∧t ] ≤ E[M0 ].
On the other hand,
X
Mt − Mtσ = (1 − Ci )(Xi − Xi−1 ).
i≤t

So the other inequality follows from the same argument.


CHAPTER 3. MARTINGALES AND POTENTIALS 137

Theorem 3.1.38 (Doob’s optional stopping theorem). Let (Mt ) be a supermartin-


gale and σ be a stopping time. Then Mσ is integrable and

E[Mσ ] ≤ E[M0 ],

if any of the following conditions hold:

(i) σ is bounded;

(ii) (Mt ) is uniformly bounded and σ is almost surely finite;

(iii) E[σ] < +∞ and (Mt ) has bounded increments (i.e., there is c > 0 such that
|Mt − Mt−1 | ≤ c a.s. for all t);

(iv) (Mt ) is nonnegative and σ is almost surely finite.

The first three imply equality above if (Mt ) is a martingale.

Proof. Case (iv) is Fatou’s lemma (Proposition B.4.14). We prove (iii). We leave
the proof of the other claims as an exercise (see Exercise 3.5).
From Lemma 3.1.37, we have

E[Mσ∧t − M0 ] ≤ 0. (3.1.8)

Furthermore the assumption that E[σ] < +∞ implies that σ < +∞ almost surely.
Hence we seek to take a limit as t → +∞ inside the expectation. To justify
swapping limit and expectation, note that by a telescoping sum

X
|Mσ∧t − M0 | ≤ (Ms − Ms−1 )
s≤σ∧t
X
≤ |Ms − Ms−1 |
s≤σ
≤ cσ.

The claim now follows from dominated convergence (Proposition B.4.14). Equal-
ity holds if (Mt ) is a martingale.

Although the optional stopping theorem (Theorem 3.1.38) is useful, one of-
ten works directly with Lemma 3.1.37 and applies suitable limit theorems (see
Proposition B.4.14). The following martingale-based proof of Wald’s first identity
provides an illustration.
CHAPTER 3. MARTINGALES AND POTENTIALS 138

X2 , . . . ∈ L1 be i.i.d. with E[X1 ] =


Theorem 3.1.39 (Wald’s first identity). Let X1 ,P
µ and let τ ∈ L1 be a stopping time. Let St = ts=1 Xs . Then

E[Sτ ] = µ E[τ ].

Proof. We first prove the result for nonnegative Xi s. By Example 3.1.29, St − tµ


is a martingale and Lemma 3.1.37 implies that E[Sτ ∧t − µ(τ ∧ t)] = 0, or

E[Sτ ∧t ] = µ E[τ ∧ t].

Note that, in the nonnegative case, we have Sτ ∧t ↑ Sτ and τ ∧ t ↑ τ . Thus, by


monotone convergence (Proposition B.4.14), the claim E[Sτ ] = µ E[τ ] follows in
that case.
Consider now the general case. Again, E[Sτ ∧t ] = µ E[τ ∧t] and E[τ ∧t] ↑ E[τ ].
Applying the previous argument to the sum of nonnegative random variables Rt =
P t
s=1 |Xs | shows that E[Rτ ] = E|X1 | E[τ ] < +∞ by assumption. Since |Sτ ∧t | ≤
Rτ for all t by the triangle inequality, dominated convergence (Proposition B.4.14)
implies E[Sτ ∧t ] → E[Sτ ] and we are done.

We also recall Wald’s second identity. The proof, which we omit, uses the martin-
gale in Example 3.1.31.

Theorem 3.1.40 (Wald’s second identity). Let X1 , X2 , . . . ∈ L2 be i.i.d. with


2 1
Pt 1 ] = 0 and Var[X1 ] = σ and let τ ∈ L be a stopping time. Let St =
E[X
s=1 Xs . Then
E[Sτ2 ] = σ 2 E[τ ].

We illustrate Wald’s identities on the gambler’s ruin problem that is character-


gambler’s ruin
istic of applications of stopping times in Markov chains. We consider the “unbi-
ased” and “biased” cases separately.

Example 3.1.41 (Gambler’s ruin: unbiased case). Let (St ) be simple random walk
on Z started at 0 and let τ = τa ∧ τb where −∞ < a < 0 < b < +∞, where the
first visit time τx was defined in Definition 3.1.6.

Claim 3.1.42. We have:

(i) τ < +∞ almost surely;


b
(ii) P[τa < τb ] = b−a ;

(iii) E[τ ] = −ab;

(iv) τa < +∞ almost surely but E[τa ] = +∞.


CHAPTER 3. MARTINGALES AND POTENTIALS 139

Proof. We prove the claims in order.


(i) We argue that in fact E[τ ] < ∞. That follows immediately from the ex-
ponential tail of hitting times in Lemma 3.1.25 for the chain (Sτ ∧t ) whose
(effective) state space, {a, a + 1, . . . , b}, is finite.

(ii) By Wald’s first identity (Theorem 3.1.39) and (i), we have E[Sτ ] = 0 or

a P[Sτ = a] + b P[Sτ = b] = 0,

that is, using P[Sτ = a] = 1 − P[Sτ = b] = P[τa < τb ],


b
P[τa < τb ] = and P[τa < +∞] ≥ P[τa < τb ] → 1,
b−a
where we took b → +∞ in the first expression to obtain the second one.

(iii) Because σ 2 = 1, Wald’s second identity (Theorem 3.1.40) says that E[Sτ2 ] =
E[τ ]. Furthermore, we have by (ii)
b −a 2
E[Sτ2 ] = a2 + b = −ab.
b−a b−a
Thus E[τ ] = −ab.

(iv) The first claim was proved in (ii). When b → +∞, τ = τa ∧ τb ↑ τa and
monotone convergence applied to (iii) gives that E[τa ] = +∞.
That concludes the proof.

Note that (iv) above shows that the L1 condition on the stopping time in Wald’s
second identity (Theorem 3.1.40) is necessary. Indeed we have shown a2 = E[Sτ2a ] 6=
σ 2 E[τa ] = +∞. J
Example 3.1.43 (Gambler’s ruin: biased case). The biased randomP walk on Z with
parameter 1/2 < p < 1 is the process (St ) with S0 = 0 and St = ti=1 Xi where
the Xi s are i.i.d. in {−1, +1} with P[X1 = 1] = p. Let again τ := τa ∧ τb where
a < 0 < b. Define q := 1 − p, δ := p − q > 0, and φ(x) := (q/p)x .
Claim 3.1.44. We have:
(i) τ < +∞ almost surely;
φ(b)−φ(0)
(ii) P[τa < τb ] = φ(b)−φ(a) ;

b
(iii) E[τb ] = 2p−1 ;
CHAPTER 3. MARTINGALES AND POTENTIALS 140

(iv) τa = +∞ with positive probability.


Proof. Let ψt (x) := x − δt. We use two martingales: (φ(St )) and (ψt (St )).
Observe that indeed both processes are clearly integrable and

E[φ(St ) | Ft−1 ] = p(q/p)St−1 +1 + q(q/p)St−1 −1 = φ(St−1 ),

and

E[ψt (St ) | Ft−1 ] = p[St−1 + 1 − δt] + q[St−1 − 1 − δt] = ψt−1 (St−1 ).

(i) This claim follows by the same argument as in the unbiased case.
(ii) Note that (φ(St )) is a nonnegative, bounded martingale since q < p by as-
sumption. By Lemma 3.1.37 and dominated convergence (Proposition B.4.14),

φ(0) = E[φ(Sτ )] = P[τa < τb ] φ(a) + P[τa > τb ] φ(b),


φ(b)−φ(0)
or, rearranging, P[τa < τb ] = φ(b)−φ(a) . Taking b → +∞, by monotonicity

1
P[τa < +∞] = < 1, (3.1.9)
φ(a)
so that τa = +∞ with positive probability. On the other hand, P[τb < τa ] =
1 − P[τa < τb ] = φ(0)−φ(a)
φ(b)−φ(a) , and taking a → −∞

P [τb < +∞] = 1.

(iii) By Lemma 3.1.37 applied to (ψt (St )),

0 = E[Sτb ∧t − δ(τb ∧ t)]. (3.1.10)

By monotone convergence (Proposition B.4.14), E[τb ∧ t] ↑ E[τb ]. Further-


more, observe that − inf t St ≥ 0 almost surely since S0 = 0. Moreover, for
x ≥ 0, by (3.1.9)
 x
q
P[− inf St ≥ x] = P[τ−x < +∞] = ,
t p
P
so that E[− inf t St ] = x≥1 P[− inf t St ≥ x] < +∞. Hence, in (3.1.10),
we can use dominated convergence (Proposition B.4.14) with

|Sτb ∧t | ≤ max{b, − inf St },


t

and the fact that τb < +∞ almost surely from (ii) to deduce that E[τb ] =
E[Sτb ] b
p−q = 2p−1 .
CHAPTER 3. MARTINGALES AND POTENTIALS 141

(iv) That claim was proved in (ii).


That concludes the proof.
Note that, in (iii) above, in order to apply Wald’s first identity directly we would
have had to prove that τb ∈ L1 first. J
We also obtain the following maximal version of Markov’s inequality (Theo-
rem 2.1.1).
Theorem 3.1.45 (Doob’s submartingale inequality). Let (Mt ) be a nonnegative
submartingale. Then, for b > 0,
 
E[Mt ]
P sup Ms ≥ b ≤ .
0≤s≤t b
Observe that a naive application of Markov’s inequality implies only that
E[Mt ]
sup P[Ms ≥ b] ≤ ,
0≤s≤t b
where we used that E[Ms ] ≤ E[Mt ] for all 0 ≤ s ≤ t for a submartingale. Intro-
ducing an appropriate stopping time immediately gives something stronger. (Exer-
cise 3.6 asks for the supermartingale version of this.)
Proof. Let σ be the first time that Mt ≥ b. Then the event of interest can be
characterized as  
sup Ms ≥ b = {Mσ∧t ≥ b} .
0≤s≤t
By Markov’s inequality,
E[Mσ∧t ]
P[Mσ∧t ≥ b] ≤ .
b
Lemma 3.1.37 implies that E[Mσ∧t ] ≤ E[Mt ], which concludes the proof.
One consequence of the previous bound is a strengthening of Chebyshev’s inequal-
ity (Theorem 2.1.2) for sums of independent random variables.
Corollary 3.1.46 (Kolmogorov’s maximal inequality). Let X1 , X2 , . . . be inde-
pendent
P random variables with E[Xi ] = 0 and Var[Xi ] < +∞. Define St =
i≤t i . Then, for β > 0,
X
 
Var[St ]
P max |Si | ≥ β ≤ .
i≤t β2
Proof. By Example 3.1.29, (St ) is a martingale. By Lemma 3.1.35, (St2 ) is hence
a (nonnegative) submartingale. The result follows from Doob’s submartingale in-
equality (Theorem 3.1.45).
CHAPTER 3. MARTINGALES AND POTENTIALS 142

Convergence Finally another fundamental result about martingales is the follow-


ing convergence theorem, which we state without proof. We give a quick applica-
tion below.
Theorem 3.1.47 (Convergence theorem). Let (Mt ) be a supermartingale bounded
in L1 . Then (Mt ) converges almost surely to a finite limit M∞ . Moreover, letting
M∞ := lim supt Mt , then M∞ ∈ F∞ and E|M∞ | < +∞.
Corollary 3.1.48 (Convergence of non-negative supermartingales). If (Mt ) is a
non-negative supermartingale then Mt converges almost surely to a finite limit
M∞ with E[M∞ ] ≤ E[M0 ].
Proof. By the supermartingale property, (Mt ) is bounded in L1 since
E|Mt | = E[Mt ] ≤ E[M0 ], ∀t.
Then we use the martingale convergence theorem (Theorem 3.1.47) and Fatou’s
lemma (Proposition B.4.14).
Example 3.1.49 (Pólya’s urn). An urn contains 1 red ball and 1 green ball. At each
time, we pick one ball and put it back with an extra ball of the same color. This
process is known as Pólya’s urn. Let Rt (respectively Gt ) be the number of red
Pólya’s urn
balls (respectively green balls) after the tth draw. Let
Ft := σ(R0 , G0 , R1 , G1 , . . . , Rt , Gt ).
Define Mt to be the fraction of green balls after the tth draw. Then
Rt−1 Gt−1
E[Mt | Ft−1 ] =
Gt−1 + Rt−1 Gt−1 + Rt−1 + 1
Gt−1 Gt−1 + 1
+
Gt−1 + Rt−1 Gt−1 + Rt−1 + 1
Gt−1
=
Gt−1 + Rt−1
= Mt−1 .
Since Mt ≥ 0 and is a martingale, we have Mt → M∞ almost surely. In fact,
Exercise 3.4 asks for a proof that
 
t m!(t − m)! 1
P[Gt = m + 1] = = .
m (t + 1)! t+1
So taking a limit as t → +∞
bx(t + 2) − 1c
P[Mt ≤ x] = → x.
t+1
That is, (Mt ) converges in distribution to a uniform random variable on [0, 1]. J
CHAPTER 3. MARTINGALES AND POTENTIALS 143

Convergence of the expectation in general requires stronger conditions. A sim-


ple case is boundedness in L2 . Before stating the result, we derive a key property
of martingales in L2 which will be useful later.

Lemma 3.1.50 (Orthogonality of increments). Let (Mt ) be a martingale with


Mt ∈ L2 for all t. Let s ≤ t ≤ u ≤ v. Then,

hMt − Ms , Mv − Mu i = 0,

where hX, Y i = E[XY ].

Proof. Use Mu = E[Mv | Fu ] and Mt − Ms ∈ Fu , and apply the L2 characteriza-


tion of the conditional expectation (Theorem B.6.2).

In words, martingale increments over disjoint time intervals are uncorrelated (pro-
vided the second moment exists). Note that this is weaker than the independence
of increments of random walks. (See Section 3.2.1 for more discussion on this.)

Theorem 3.1.51 (Convergence of martingales bounded in L2 ). Let (Mt ) be a mar-


tingale with Mt ∈ L2 for all t. Then (Mt ) is bounded in L2 if and only if
X
E[(Mt − Mt−1 )2 ] < +∞.
k≥1

When this is the case, Mt converges almost surely and in L2 to a finite limit M∞ ,
and furthermore
E[Mt ] → E[M∞ ] < +∞,
as t → +∞.

Proof. Writing Mt as a telescoping sum of increments, the orthogonality of incre-


ments (Lemma 3.1.50) implies
 !2 
Xt
E[Mt2 ] = E  M0 + (Ms − Ms−1 ) 
s=1
t
X
= E[M02 ] + E[(Ms − Ms−1 )2 ],
s=1

proving the first claim.


By the monotonicity of norms (Lemma B.4.16), (Mt ) being bounded in L2 im-
plies that (Mt ) is bounded in L1 which, in turn, implies that Mt converges almost
CHAPTER 3. MARTINGALES AND POTENTIALS 144

surely to a finite limit M∞ with E|M∞ | < +∞ by Theorem 3.1.47. Then using
Fatou’s lemma (Proposition B.4.14) in
X
E[(Mt+s − Mt )2 ] = E[(Mi − Mi−1 )2 ],
t+1≤i≤t+s

gives X
E[(M∞ − Mt )2 ] ≤ E[(Mi − Mi−1 )2 ].
t+1≤i

The right-hand side goes to 0 as t → +∞ since the full series is finite, which
proves the second claim.
The last claim follows from Lemmas B.4.16 and B.4.17.

3.1.4 . Percolation: critical regime on infinite d-regular tree


Consider bond percolation (see Definition 1.2.1) on the infinite d-regular tree Td
rooted at a vertex 0. In Section 2.3.3, we showed that
1
pc (Td ) = sup{p ∈ [0, 1] : Pp [|C0 | = +∞] = 0} = ,
d−1
where recall that C0 is the open cluster of the root. Here we consider the critical
1
case, that is, we set density p = d−1 . (The same results apply to the infinite b-ary
tree Tb with d = b + 1.) Assume d ≥ 3 (since d = 2 is simply a path).
b
First:
Claim 3.1.52. |C0 | < +∞ almost surely.
Let Xn := |∂n ∩ C0 |, where ∂n are the n-th level vertices. In Section 2.3.3, we
proved the same claim in the subcritical case using the first moment method. It
does not work here because
d
EXn = d(d − 1)n−1 pn = 9 0.
d−1
Instead we use a martingale argument which will be generalized when we discuss
branching processes in Section 6.1.

Proof of Claim 3.1.52. Let b := d − 1 be the branching ratio. Because the root has
a different number of children, we consider the descendants of its children. Let Zn
be the number of vertices in the open cluster of the first child of the root n levels
below it and let Fn = σ(Z0 , . . . , Zn ). Then Z0 = 1 and

E[Zn | Fn−1 ] = bpZn−1 = Zn−1 .


CHAPTER 3. MARTINGALES AND POTENTIALS 145

So (Zn ) is a nonnegative, integer-valued martingale and it converges almost surely


to a finite limit by Corollary 3.1.48. (In particular, E[Zn ] = 1, which will be useful
below.) But, clearly, for any integer k > 0 and N ≥ 0

P[Zn = k, ∀n ≥ N ] = 0,

so it must be that the limit is 0 almost surely. In other words, Zn is eventually 0 for
all n large enough. This is true for every child of the root. Hence the open cluster
of the root is finite almost surely.

On the other hand:

Claim 3.1.53.
E|C0 | = +∞.

Proof. Consider the descendant subtree, T1 , of the first child of the root, which we
denote by 1. Let Ce1 be the open cluster of 1 in T1 . As we showed in the previous
claim, the expected number of vertices on any level of T1 is 1. So E|Ce1 | = +∞ by
summing over the levels.

3.2 Concentration for martingales and applications


The Chernoff-Cramér method extends naturally to martingales. This observation
leads to powerful new tail bounds that hold far beyond the case of sums of inde-
pendent variables. In particular it will allow us to prove one version of the concen-
tration phenomenon, which can be stated informally as: a function f (X1 , . . . , Xn )
of many independent random variables that is not too sensitive to any of its coor-
dinates tends to be close to its mean.

3.2.1 Azuma-Hoeffding inequality


The main result of this section is the following generalization of Hoeffding’s in-
equality (Theorem 2.4.10).

Theorem 3.2.1 (Maximal Azuma-Hoeffding inequality). Let (Zt )t∈Z+ be a mar-


tingale with respect to the filtration (Ft )t∈Z+ . Assume that there are predictable
processes (At ) and (Bt ) (i.e., At , Bt ∈ Ft−1 ) and constants 0 < ct < +∞ such
that: for all t ≥ 1, almost surely,

At ≤ Zt − Zt−1 ≤ Bt and Bt − At ≤ ct .
CHAPTER 3. MARTINGALES AND POTENTIALS 146

Then, for all β > 0,


!
2β 2
 
P sup (Zi − Z0 ) ≥ β ≤ exp − P 2 .
0≤i≤t i≤t ci

Applying this inequality to (−Zt ) gives a tail bound in the other direction.

Proof of Theorem 3.2.1. As in the Chernoff-Cramér method, we start by applying


Markov’s inequality (Theorem 2.1.1). Here we use the maximal version for sub-
martingales, Doob’s submartingale inequality (Theorem 3.1.45). First notice that
esx is increasing and convex for s > 0, so that by Lemma 3.1.35 the process
(es(Zt −Z0 ) )t is a submartingale. Hence, for s > 0, by Theorem 3.1.45
   
s(Zi −Z0 ) sβ
P sup (Zi − Z0 ) ≥ β = P sup e ≥e
0≤i≤t 0≤i≤t

E es(Zt −Z0 )
 
≤ sβ
h e Pt i
E es r=1 (Zr −Zr−1 )

= . (3.2.1)
esβ
Unlike the Chernoff-Cramér case, however, the terms in the exponent are not
independent. Instead, to exploit the martingale property, we condition on the filtra-
tion. By taking out what is known (Lemma B.6.13)
h h Pt ii h Pt−1 h ii
E E es r=1 (Zr −Zr−1 ) Ft−1 = E es r=1 (Zr −Zr−1 ) E es(Zt −Zt−1 ) Ft−1 .

The martingale property and the assumption in the statement imply that, condi-
tioned on Ft−1 , the random variable Zt − Zt−1 is centered and lies in an interval
of length ct . Hence by Hoeffding’s lemma (Lemma 2.4.12), it holds almost surely
that
 2 2   2 2
h
s(Zt −Zt−1 )
i s ct /4 c s
E e Ft−1 ≤ exp = exp t . (3.2.2)
2 8
Using the tower property (Lemma B.6.16) and arguing by induction, we obtain
!
s2 r≤t c2r
h i P
s(Zt −Z0 )
E e ≤ exp .
8

Put differently, we have proved that Zt − Z0 is sub-Gaussian with variance factor


1P 2 1P 2
4 r≤t cr . By (2.4.16) (or, equivalently, by choosing s = β/ 4 r≤t cr in (3.2.1))
we get the result.
CHAPTER 3. MARTINGALES AND POTENTIALS 147

In Theorem 3.2.1 the martingale difference sequence (Xt ), where Xt := Zt −


martingale
Zt−1 , is not only “pairwise uncorrelated” by Lemma 3.1.50, that is,
difference

E[Xs Xr ] = 0, ∀r 6= s,

but it is in fact “mutually uncorrelated,” that is,

E [Xj1 · · · Xjk ] = 0, ∀ k ≥ 1, ∀ 1 ≤ j1 < · · · < jk .


P
This stronger property helps explain why r≤t Xr is highly concentrated. This
point is the subject of Exercise 3.7, which guides the reader through a slightly
different proof of the Azuma-Hoeffding inequality. Compare with Exercises 2.5
and 2.6.

3.2.2 Method of bounded differences


The power of the maximal Azuma-Hoeffding inequality (Theorem 3.2.1) is that it
produces tail inequalities for quantities other than sums of independent variables.
The setting is the following. Let X1 , . . . , Xn be independent random variables
where Xi is Xi -valued for all i and let X = (X1 , . . . , Xn ). Assume that f :
X1 × · · · × Xn → R is a measurable function. Our goal is to characterize the
concentration properties of f (X) around its expectation in terms of its “discrete
derivatives”

Di f (x) := sup f (x1 , . . . , xi−1 , y, xi+1 , . . . , xn )


y∈Xi

− inf
0
f (x1 , . . . , xi−1 , y 0 , xi+1 , . . . , xn ),
y ∈Xi

where x = (x1 , . . . , xn ) ∈ X1 × · · · × Xn . We think of Di f (x) as a measure of


the “sensitivity” of f to its i-th coordinate.

High-level idea
We begin with two easier bounds that we will improve below. The trick to ana-
lyzing the concentration of f (X) is to consider the Doob martingale (see Exam-
ple 3.1.33)

Zi = E[f (X) | Fi ], (3.2.3)

where Fi = σ(X1 , . . . , Xi ), which is well-defined provided E|f (X)| < +∞.


Note that
Zn = E[f (X) | Fn ] = f (X),
CHAPTER 3. MARTINGALES AND POTENTIALS 148

and
Z0 = E[f (X)],
so that we can write
n
X
f (X) − E[f (X)] = (Zi − Zi−1 ).
i=1

Intuitively, the martingale difference Zi −Zi−1 tracks the change in our expectation
of f (X) as Xi is revealed.
In fact a clever probabilistic argument relates martingale differences directly to
discrete derivatives. Let X 0 = (X10 , . . . , Xn0 ) be an independent copy of X and let
X (i) = (X1 , . . . , Xi−1 , Xi0 , Xi+1 , . . . , Xn ).
Then
Zi − Zi−1 = E[f (X) | Fi ] − E[f (X) | Fi−1 ]
= E[f (X) | Fi ] − E[f (X (i) ) | Fi−1 ]
= E[f (X) | Fi ] − E[f (X (i) ) | Fi ]
= E[f (X) − f (X (i) ) | Fi ].
Note that we crucially used the independence of the Xk s in the second and third
lines. But then, by Jensen’s inequality (Lemma B.6.12),
|Zi − Zi−1 | ≤ kDi f k∞ . (3.2.4)
Assume further that E[f (X)2 ] < +∞. By the orthogonality of increments of
martingales in L2 (Lemma 3.1.50), we immediately obtain a bound on the variance
of f
Xn h i X n
2 2
Var[f (X)] = E[(Zn − Z0 ) ] = E (Zi − Zi−1 ) ≤ kDi f k2∞ . (3.2.5)
i=1 i=1

By the maximal Azuma-Hoeffding inequality and the fact that


Zi − Zi−1 ∈ [−kDi f k∞ , kDi f k∞ ],
we also get a bound on the tail
!
β2
P[f (X) − E[f (X)] ≥ β] ≤ exp − P . (3.2.6)
2 i≤n kDi f k2∞
A more careful analysis, which we detail below, leads to a better bound.
We emphasize that, although it may not be immediately obvious, independence
plays a crucial role in the bound (3.2.4), as the next example shows.
CHAPTER 3. MARTINGALES AND POTENTIALS 149

Example 3.2.2 (A counterexample). Let f (x1 , . . . , xn ) = x1 + · · · + xn where


xi ∈ {−1, 1} for all i. Then,

kD1 f k∞ = sup [(1 + x2 + · · · + xn ) − (−1 + x2 + · · · + xn )] = 2,


x2 ,...,xn

and similarly kDi f k∞ = 2 for i = 2, . . . , n. Let X1 be a uniform random variable


on {−1, 1}. First consider the case where we set X2 , . . . , Xn all equal to X1 . Then

E[f (X1 , . . . , Xn )] = 0,

and
E[f (X1 , . . . , Xn ) | X1 ] = nX1 ,
so that
|E[f (X1 , . . . , Xn ) | X1 ] − E[f (X1 , . . . , Xn )]| = n > 2.
In particular, the corresponding Doob martingale does not have increments bounded
by kDi f k∞ = 2.
Fo a less extreme example which has support over all of {−1, 1}n , let
(
1, w.p. 1 − ε,
Ui =
−1, w.p. ε,

for some ε > 0 independently for all i = 1, . . . , n − 1. Let again X1 be a uniform


random variable on {−1, 1} and, for i = 2, . . . , n, define the random variable Xi =
Ui−1 Xi−1 , that is, Xi is the same as Xi−1 with probability 1 − ε and otherwise is
flipped. Then,

E[f (X1 , . . . , Xn )] = E [X1 + · · · + Xn ]


  
n−1
XY
= E X1 1 + Uj 
i=1 j≤i
 
n−1
XY
= E[X1 ] E 1 + Uj 
i=1 j≤i

= 0,

by the independence of X1 and the Ui s. Similarly


 
n−1 n−1
!
XY X
i
E[f (X1 , . . . , Xn ) | X1 ] = X1 E 1 + Uj  = X1 (1 − 2ε) ,
i=1 j≤i i=0
CHAPTER 3. MARTINGALES AND POTENTIALS 150

so that
n−1
!
X
|E[f (X1 , . . . , Xn ) | X1 ] − E[f (X1 , . . . , Xn )]| = (1 − 2ε)i > 2,
i=0

for ε small enough and n ≥ 3. In particular, the corresponding Doob martingale


does not have increments bounded by kDi f k∞ = 2. J

Variance bounds
We give improved bounds on the variance (compared to (3.2.5)). Our first bound
explicitly decomposes the variance of f (X) over the contributions of its individual
entries.
Theorem 3.2.3 (Tensorization of the variance). Let X1 , . . . , Xn be independent
random variables where Xi is Xi -valued for all i and let X = (X1 , . . . , Xn ).
Assume that f : X1 × · · · × Xn → R is a measurable function with E[f (X)2 ] <
+∞. Define Fi = σ(X1 , . . . , Xi ), Gi = σ(X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ) and
Zi = E[f (X) | Fi ]. Then we have
n
X
Var[f (X)] ≤ E [Var [f (X) | Gi ]] .
i=1

Proof of Theorem 3.2.3. The key lemma is the following.


Lemma 3.2.4.

E [E [f (X) | Gi ] | Fi ] = E [f (X) | Fi−1 ]

Proof. By the tower property (Lemma B.6.16),

E [f (X) | Fi−1 ] = E [E [f (X) | Gi ] | Fi−1 ] .

Moreover, σ(Xi ) is independent of σ(Gi , Fi−1 ) so by the role of independence


(Lemma B.6.14), we have

E [E [f (X) | Gi ] | Fi−1 ] = E [E [f (X) | Gi ] | Fi−1 , Xi ] = E [E [f (X) | Gi ] | Fi ] .

Combining the last two displays gives the result.

Again, we take advantage of the orthogonality of increments to write


n
X h i
Var[f (X)] = E (Zi − Zi−1 )2 .
i=1
CHAPTER 3. MARTINGALES AND POTENTIALS 151

By the lemma above,

(Zi − Zi−1 )2 = (E [f (X) | Fi ] − E [f (X) | Fi−1 ])2


= (E [f (X) | Fi ] − E [E [f (X) | Gi ] | Fi ])2
= (E [f (X) − E [f (X) | Gi ] | Fi ])2
h i
≤ E (f (X) − E [f (X) | Gi ])2 Fi ,

where we used Jensen’s inequality on the last line. Taking expectations and using
the tower property
n
X h i
Var[f (X)] = E (Zi − Zi−1 )2
i=1
n
X h h ii
≤ E E (f (X) − E [f (X) | Gi ])2 Fi
i=1
n
X h i
= E (f (X) − E [f (X) | Gi ])2
i=1
n
X h h ii
= E E (f (X) − E [f (X) | Gi ])2 Gi
i=1
n
X
= E [Var [f (X) | Gi ]] .
i=1

That concludes the proof.

We derive two useful consequences of the tensorization property of the vari-


ance. The first one is the Efron-Stein inequality.

Theorem 3.2.5 (Efron-Stein inequality). Let X1 , . . . , Xn be independent random


variables where Xi is Xi -valued for all i and let X = (X1 , . . . , Xn ). Assume that
f : X1 × · · · × Xn → R is a measurable function with E[f (X)2 ] < +∞. Let
X 0 = (X10 , . . . , Xn0 ) be an independent copy of X and

X (i) = (X1 , . . . , Xi−1 , Xi0 , Xi+1 , . . . , Xn ).

Then,
n
1X
Var[f (X)] ≤ E[(f (X) − f (X (i) ))2 ].
2
i=1
CHAPTER 3. MARTINGALES AND POTENTIALS 152

Proof. Observe that if Y 0 is an independent copy of Y ∈ L2 , then Var[Y ] =


1 0 2
2 E[(Y − Y ) ], which can be seen by adding and subtracting the mean, expanding
and using independence. Hence,
1
Var [f (X) | Gi ] = E[(f (X) − f (X (i) ))2 | Gi ],
2
where we used the independence of the Xi s and Xi0 s. Plugging back into Theo-
rem 3.2.3 gives the claim.

Our second consequence of Theorem 3.2.3 is a Poincaré-type inequality which


relates the variance of a function to its expected “square gradient.” Compare to
the much weaker (3.2.5), which involves in each term a supremum rather than an
expectation.
Theorem 3.2.6 (Bounded differences inequality). Let X1 , . . . , Xn be independent
random variables where Xi is Xi -valued for all i and let X = (X1 , . . . , Xn ).
Assume that f : X1 × · · · × Xn → R is a measurable function with E[f (X)2 ] <
+∞. Then
n
1X
Var[f (X)] ≤ E[Di f (X)2 ].
4
i=1

Proof. By Lemma 2.4.11,


1
Var [f (X) | Gi ] ≤ Di f (X)2 .
4
Plugging back into Theorem 3.2.3 gives the claim.
Remark 3.2.7. For comparison, a version of the Poincaré inequality in one dimension
Poincaré
asserts the following: let f : [0, T ] → R be continuously differentiable with f (0) =
RT T inequality
f (T ) = 0, 0 f (x)2 + f 0 (x)2 dx < +∞ and 0 f (x)dx = 0, then
R

Z T Z T
2
f (x) dx ≤ C f 0 (x)2 dx, (3.2.7)
0 0

where the best possible C is T 2 /4π 2 (see, e.g., [SS03, Chapter 3, Exercise 11]; this case is
also known as Wirtinger’s Rinequality). We give a quick proof for T = 1 with the suboptimal
x
C = 1. Note that f (x) = 0 f 0 (y)dy so, by Cauchy-Schwarz (Theorem B.4.8),
Z x Z 1
0
2
f (x) ≤ x 2
f (y) dy ≤ f 0 (y)2 dy.
0 0

The result follows by integration. Intuitively, for a function with mean 0 to have a large
norm, it must have a large absolute derivative somewhere.
CHAPTER 3. MARTINGALES AND POTENTIALS 153

Example 3.2.8 (Longest common subsequence). Let X1 , . . . , X2n be independent


uniform random variables in {−1, +1}. Let Z be the length of the longest common
subsequence in (X1 , . . . , Xn ) and (Xn+1 , . . . , X2n ), that is,

Z = max k : ∃1 ≤ i1 < i2 < · · · < ik ≤ n
and n + 1 ≤ j1 < j2 < · · · < jk ≤ 2n
such that Xi1 = Xj1 , Xi2 = Xi2 , . . . , Xik = Xjk .

Then, writing Z = f (X1 , . . . , X2n ), it follows that kDi f k∞ ≤ 1. Indeed, fix


x = (x1 , . . . , x2n ) and let xi,+ (respectively xi,− ) be x where the i-th component
is replaced with +1 (respectively −1). Assume without loss of generality that
f (xi,− ) ≤ f (xi,+ ). Then |f (xi,+ ) − f (xi,− )| ≤ 1 because removing the i-th
component (and its match) from a longest common subsequence when xi = +1 (if
present) decreases the length by 1. Since this is true for any x, we have kDi f k∞ ≤
1. Finally, by the bounded differences inequality (Theorem 3.2.6),
2n
1X n
Var[Z] ≤ kDi f k2∞ ≤ ,
4 2
i=1

which is much better than the obvious Var[Z] ≤ E[Z 2 ] ≤ n2 . Note that we did not
require any information about the expectation of Z. J

McDiarmid’s inequality
The following powerful consequence of the Azuma-Hoeffding inequality is com-
monly referred to as the method of bounded differences. Compare to (3.2.6).

Theorem 3.2.9 (McDiarmid’s inequality). Let X1 , . . . , Xn be independent random


variables where Xi is Xi -valued for all i, and let X = (X1 , . . . , Xn ). Assume
f : X1 × · · · × Xn → R is a measurable function such that kDi f k∞ < +∞ for
all i. Then for all β > 0
!
2β 2
P[f (X) − Ef (X) ≥ β] ≤ exp − P 2
.
i≤n kDi f k∞

Once again, applying the inequality to −f gives a tail bound in the other direction.

Proof of Theorem 3.2.9. As before, we let

Zi = E[f (X) | Fi ],
CHAPTER 3. MARTINGALES AND POTENTIALS 154

where Fi = σ(X1 , . . . , Xi ), we let Gi = σ(X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ). Then,


it holds that Ai ≤ Zi − Zi−1 ≤ Bi where
" #
Bi = E sup f (X1 , . . . , Xi−1 , y, Xi+1 , . . . , Xn ) − f (X) Fi−1 ,
y∈Xi

and
 
Ai = E inf f (X1 , . . . , Xi−1 , y, Xi+1 , . . . , Xn ) − f (X) Fi−1 .
y∈Xi

Indeed, since σ(Xi ) is independent of Fi−1 and Gi , by the role of independence


(Lemma B.6.14)

Zi = E [f (X) | Fi ]
" #
≤ E sup f (X1 , . . . , Xi−1 , y, Xi+1 , . . . , Xn ) Fi
y∈Xi
" #
= E sup f (X1 , . . . , Xi−1 , y, Xi+1 , . . . , Xn ) Fi−1 , Xi
y∈Xi
" #
= E sup f (X1 , . . . , Xi−1 , y, Xi+1 , . . . , Xn ) Fi−1 ,
y∈Xi

and similarly for the other direction. Moreover, by definition, Bi −Ai ≤ kDi f k∞ :=
ci . The Azuma-Hoeffding inequality then gives the result.

Examples
The moral of McDiarmid’s inequality is that functions of independent variables
that are smooth, in the sense that they do not depend too much on any one of
their variables, are concentrated around their mean. Here are some straightforward
applications.
Example 3.2.10 (Balls and bins: empty bins). Suppose we throw m balls into n
bins independently, uniformly at random. The number of empty bins, Zn,m , is
centered at
1 m
 
EZn,m = n 1 − .
n
Writing Zn,m as the sum of indicators ni=1 1Bi , where Bi is the event that bin
P
i is empty, is a natural first attempt at proving concentration around the mean.
However there is a problem—the Bi s are not independent. Indeed, because there
CHAPTER 3. MARTINGALES AND POTENTIALS 155

is a fixed number of bins, the event Bi intuitively makes the other such events less
likely. Instead let Xj be the index of the bin in which ball j lands. The Xj s are
independent by construction and, moreover, letting Zn,m = f (X1 , . . . , Xm ) we
have kDi f k∞ ≤ 1. Indeed, moving a single ball changes the number of empty
bins by at most 1 (if at all). Hence by the method of bounded differences

1 m √
   
2
P Zn,m − n 1 − ≥ b m ≤ 2e−2b .
n
J

Example 3.2.11 (Pattern matching). Let X = (X1 , X2 , . . . , Xn ) be i.i.d. random


variables taking values uniformly at random in a finite set S of size s = |S|. Let
a = (a1 , . . . , ak ) be a fixed string of elements of S. We are interested in the
number of occurrences of a as a (consecutive) substring in X, which we denote by
Nn . Denote by Ei the event that the substring of X starting at i is a. Summing
over the starting positions and using the linearity of expectation, the mean of Nn is
"n−k+1 #  k
X 1
ENn = E 1Ei = (n − k + 1) .
s
i=1

However the 1Ei s are not independent. So we cannot use a Chernoff bound for
Poisson trials (Theorem 2.4.7). Instead we use the fact that Nn = f (X) where
kDi f k∞ ≤ k, as each Xi appears in at most k substrings of length k. By the
method of bounded differences, for all b > 0,
"  k #
1 √ 2
P Nn − (n − k + 1) ≥ bk n ≤ 2e−2b .
s

The last two examples are perhaps not surprising in that they involve “sums of
weakly independent” indicator variables. One might reasonably expect a sub-
Gaussian-type inequality in that case. The next application is more striking and
hints at connections to isoperimetric considerations (which we will not explore
here).

Example 3.2.12 (Concentration of measure on the hypercube). For A ⊆ {0, 1}n a


subset of the hypercube and r > 0, we let
 
n
Ar = x ∈ {0, 1} : inf kx − ak1 ≤ r ,
a∈A
CHAPTER 3. MARTINGALES AND POTENTIALS 156

be the points at `1 distance at most r from A. Fix ε ∈ (0, 1/2) and assume that
2
|A| ≥ ε2n . Let λε be such that e−2λε = ε. The following application of the
method of bounded differences indicates that much of the uniform measure on the
high-dimensional hypercube lies in a close neighborhood of any such set A. This
is an example of the concentration of measure phenomenon.

Claim 3.2.13. √
r > 2λε n =⇒ |Ar | ≥ (1 − ε)2n .

Proof. Let X = (X1 , . . . , Xn ) be uniformly distributed in {0, 1}n . Note that the
coordinates are in fact independent. The function

f (x) = inf kx − ak1 ,


a∈A

has kDi f k∞ ≤ 1. Indeed changing one coordinate of x can increase the `1 distance
to the closest point to x by at most 1; in the other direction, if a one-coordinate
change were to decrease f by more than 1, reversing it would produce an increase
of that same amount—a contradiction. Hence McDiarmid’s inequality gives

2β 2
 
P [Ef (X) − f (X) ≥ β] ≤ exp − .
n

Choosing β = Ef (X) and noting that f (x) ≤ 0 if and only if x ∈ A gives

2(Ef (X))2
 
P[A] ≤ exp − ,
n

or, rearranging and using our assumption on A,


s r
1 1 1 1 √
Ef (X) ≤ n log ≤ n log = λε n.
2 P[A] 2 ε

By a second application of the method of bounded differences with β = λε n,

√  2β 2
 

P f (X) ≥ 2λε n ≤ P [f (X) − Ef (X) ≥ b] ≤ exp − = ε.
n

The result follows by observing that, with r > 2λε n,

|Ar |  √ 
n
≥ P f (X) < 2λε n ≥ 1 − ε.
2
CHAPTER 3. MARTINGALES AND POTENTIALS 157


Claim 3.2.13 is striking for two reasons: 1) the radius 2λε n is much smaller
than n, the diameter of {0, 1}n ; and 2) it applies to any A (such that |A| ≥ ε2n ).
The smallest r such that |Ar | ≥ (1 − ε)2n in general depends on A. Here are two
extremes.
For γ > 0, let
 r 
n n n
B(γ) := x ∈ {0, 1} : kxk1 ≤ − γ .
2 4

Note that, letting for Yn ∼ B(n, 12 ),


n

−γ n
2 X 4    r 
1 n −n n n
|B(γ)| = 2 = P Y n ≤ − γ . (3.2.8)
2n ` 2 4
`=0

By the Berry-Esséen theorem (e.g., [Dur10, Theorem 3.4.9]), there is a C > 0 such
that, after rearranging the final quantity in (3.2.8),
" #
Yn − n/2 C
P p ≤ −γ − P[Z ≤ −γ] ≤ √ ,
n/4 n

where Z ∼ N (0, 1). Let ε < ε0 < 1/2 and let γε0 be such that P[Z ≤ −γε0 ] = ε0 .
Then setting A := B(γε0 ), for pn large enough, we have |A| ≥ ε2n by (3.2.8). On
the other hand, setting r := γε0 n/4, we have Ar ⊆ B(0), so that |Ar | ≤ 21 2n <

(1 − ε)2n . We have shown that r = Ω( n) is in general required for Claim 3.2.13
to hold.
For an example at the other extreme, assume for simplicity that N := ε2n is
an integer. Let A ⊆ {0, 1}n be constructed as follows: starting from the empty set,
add points in {0, 1}n to A independently, uniformly at random until |A| = N . Set
r := 2. Each point selected in A has n2 points within `1 distance 2. By a union
bound, the probability that Ar does not cover all of {0, 1}n is at most
n
 !ε2n
n
≤ 2n e−ε( 2 ) ,
X
n n 2
P[|{0, 1} \Ar | > 0] ≤ P[x ∈/ Ar ] ≤ 2 1 − n
n
2
x∈{0,1}

where, in the second inequality, we considered only the first N picks in the con-
struction of A (possibly with repeats), and in the third inequality we used 1 − z ≤
e−z for all z ∈ R (see Exercise 1.16). In particular, as n → +∞,

P[|{0, 1}n \Ar | > 0] < 1.

So for n large enough there is a set A such that Ar = {0, 1}n where r = 2. J
CHAPTER 3. MARTINGALES AND POTENTIALS 158

Remark 3.2.14. In fact, it can be shown that sets of the form {x : kxk1 ≤ s} have
the smallest “expansion” among subsets of {0, 1}n of the same size, a result known as
Harper’s vertex isoperimetric theorem. See, for example, [BLM13, Theorem 7.6 and Exer-
cises 7.11-7.13].

3.2.3 . Random graphs: exposure martingale and application to the


chromatic number in Erdős-Rényi model
Exposure martingales In the context of the Erdős-Rényi graph model (Defi-
nition 1.2.2), a common way to apply the Azuma-Hoeffding inequality (Theo-
rem 3.2.1) is to introduce an “exposure martingale.” Let G ∼ Gn,p and let F
be any function on graphs such that En,p |F (G)| < +∞ for all n, p. Choose an
arbitrary ordering of the vertices and, for i = 1, . . . , n, denote by Hi the sub-
graph of G induced by the first i vertices. Then the filtration Hi = σ(H1 , . . . , Hi ),
i = 1, . . . , n, corresponds to adding the vertices of G one at a time (together with
their edges to the previous vertices). The Doob martingale

Zi = En,p [F (G) | Hi ], i = 1, . . . , n,

is known as a vertex exposure martingale. An alternative way to define the filtration vertex
is to consider instead the random variables Xi = (1{{i,j}∈G} : 1 ≤ j ≤ i) for exposure
i = 2, . . . , n. In words, Xi is a vector whose entries indicate the status (present or martingale
absent) of all potential edges incident with i and a vertex preceding it. Hence, Hi =
σ(X2 , . . . , Xi ) for i = 1, . . . , n (and H1 is trivial as it corresponds to a graph with
a single vertex and no edge). This representation has an important property: the
Xi s are independent as they pertain to disjoint subsets of edges. We are then in the
setting of the method of bounded differences. Re-writing F (G) = f (X1 , . . . , Xn ),
the vertex exposure martingale coincides with the martingale (3.2.3) used in that
context.
As an example, consider the chromatic number χ(G), that is, the smallest num-
ber of colors needed in a proper coloring of G. Define fχ (X1 , . . . , Xn ) := χ(G).
We use the following combinatorial observation to bound kDi fχ k∞ .

Lemma 3.2.15. Altering the status (absent or present) of edges incident to a fixed
vertex v changes the chromatic number by at most 1.

Proof. Altering the status of edges incident to v increases the chromatic number
by at most 1, since in the worst case one can simply use an extra color for v. On
the other hand, if the chromatic number were to decrease by more than 1 after al-
tering the status of edges incident to v, reversing the change and using the previous
observation would produce a contradiction.
CHAPTER 3. MARTINGALES AND POTENTIALS 159

A fortiori, since Xi depends on a subset of the edges incident with vertex i, Lemma 3.2.15
implies that kDi fχ k∞ ≤ 1. Hence, for all 0 < p < 1 and n, by an immediate ap-
plication of the McDiarmid’s inequality (Theorem 3.2.9):
Claim 3.2.16.
√ 2
Pn,p |χ(G) − En,p [χ(G)]| ≥ b n − 1 ≤ 2e−2b .
 

Edge exposure martingales can be defined in a manner similar to the vertex


edge
case: reveal the edges one at a time in an arbitrary order. By Lemma 3.2.15, the
exposure
corresponding function also satisfies the same `∞ bound. Observe however that,
martingale
for the chromatic number, edge exposure results in a much weaker bound as the
Θ(n2 ) random variables produce only a linear in n deviation for the same tail
probability. (The reader may want to ponder the apparent paradox: using a larger
number of independent variables seemingly leads to weaker concentration in this
case.)
Remark 3.2.17. Note that Claim 3.2.16 tells us nothing about the expectation of χ(G). It
turns out that, up to logarithmic factors, En,pn [χ(G)] is of order npn when pn ∼ n−α for
some 0 < α < 1. We will not prove this result here. See the “Bibliographic remarks” at
the end of this chapter for more on the chromatic number of Erdős-Rényi graphs.

χ(G) is concentrated on few values Much stronger concentration results can


be obtained: when pn = n−α with α > 12 , the chromatic number χ(G) is in
fact concentrated on two values! We give a partial result along those lines which
illustrates a less straightforward choice of martingale in the Azuma-Hoeffding in-
equality (Theorem 3.2.1).
Claim 3.2.18. Let pn = n−α with α > 5
6 and let Gn ∼ Gn,pn . Then for any ε > 0
there is ϕn := ϕn (α, ε) such that
Pn,pn [ ϕn ≤ χ(Gn ) ≤ ϕn + 3 ] ≥ 1 − ε,
for all n large enough.
Proof. We consider the following martingale. Let ϕn be the smallest integer such
that
ε
Pn,pn [χ(Gn ) ≤ ϕn ] > . (3.2.9)
3
Let Fn (Gn ) be the minimal size of a set of vertices, U , in Gn such that Gn \U is
ϕn -colorable. Let (Zi ) be the vertex exposure martingale associated to the quantity

Fn (Gn ). The proof proceeds in two steps: we show that 1) all but O( n) vertices
can be ϕn -colored and 2) the remaining vertices can be colored using 3 additional
colors. See Figure 3.2.3 for an illustration of the proof strategy.
We claim that (Zi ) has increments bounded by 1.
CHAPTER 3. MARTINGALES AND POTENTIALS 160


Figure 3.1: All but O( n) vertices are colored using ϕn colors. The remaining
vertices are colored using 3 additional colors.
CHAPTER 3. MARTINGALES AND POTENTIALS 161

Lemma 3.2.19. Changing the edges incident to a single vertex can change Fn by
at most 1.

Proof. Changing the edges incident to v can increase Fn by at most 1. Indeed, if


Fn increases after such a change, it must be that v ∈
/ U since in the other case the
edges incident with v would not affect the colorability of Gn \ U —present or not.
So we can add v to U and restore colorability. On the other hand, if Fn were to
decrease by more than 1, reversing the change and using the previous observation
would give a contradiction.
2
Choose bε such that e−bε /2 = ε
3. Then, applying the Azuma-Hoeffding in-
equality to (−Zi ),
 √  ε
Pn,pn Fn (Gn ) − En,pn [Fn (Gn )] ≤ −bε n − 1 ≤ ,
3
which, since Pn,pn [Fn (Gn ) = 0] = Pn,pn [χ(Gn ) ≤ ϕn ] > 3ε , implies that

En,pn [Fn (Gn )] ≤ bε n − 1.

Applying the Azuma-Hoeffding inequality to (Zi ) gives


 √ 
Pn,pn Fn (Gn ) ≥ 2bε n − 1
 √ 
≤ Pn,pn Fn (Gn ) − En,pn [Fn (Gn )] ≥ bε n − 1
ε
≤ . (3.2.10)
3

So with probability at least 1 − 3ε , we can color all vertices but 2bε n − 1 using
ϕn colors. Let U be the remaining uncolored vertices.
We claim that, with high probability, we can color the vertices in U using at
most 3 extra colors.
Lemma 3.2.20. Fix c > 0, α > 5
6 and ε > 0. Let Gn ∼ Gn,pn with pn = n−α .
For all n large enough,
 √  ε
Pn,pn every subset of c n vertices of Gn can be 3-colored > 1 − . (3.2.11)
3
Proof. We use the first moment method (Theorem 2.2.6). We refer to a subset of
vertices that is not 3-colorable but such that all of its subsets are as minimal, non

3-colorable. Let Yn be the number of such subsets of size at most c n in Gn .
Any minimal, non 3-colorable subset W must have degree at least 3. Indeed
suppose that w ∈ W has degree less than 3. Then W \{w} is 3-colorable by
definition. But, since w has fewer than 3 neighbors, it can also be properly colored
CHAPTER 3. MARTINGALES AND POTENTIALS 162

without adding a new color—a contradiction. In particular, the subgraph of Gn


induced by W must have at least 32 |W | edges. Hence, the probability that a subset
of vertices of Gn of size ` is minimal, non 3-colorable is at most
 `  3`
2
3`
pn2 ,
2
3`
by a union bound over all subsets of edges of size 2.
n en `
 
By the first moment method, by the binomial bounds ` ≤ ` (see Ap-
pendix A) and 2` ≤ `2 /2, for some c0 ∈ (0, +∞)


Pn,pn [Yn > 0] ≤ En,pn Yn



c n   ` 
X n 3`
2
≤ 3`
pn2
` 2
`=4

c n 3`
X en `  e`  2 3`α
≤ n− 2
` 3
`=4
√ !`
c n
X e 52 n1− 3α 1
2 `2
≤ 3
`=4 32

c n `
X 5 3α
≤ c0 n 4 − 2

`=4
 5 3α 4
≤ O n4− 2
→ 0,
as n → +∞, where we used that 54 − 3α 5 5 5
2 < 4 − 4 = 0 when α > 6 so that
the geometric series is dominated by its first term. Therefore for n large enough
Pn,pn [Yn > 0] ≤ ε/3, concluding the proof.
By the choice of ϕn in (3.2.9),
ε
Pn,pn [χ(Gn ) < ϕn ] ≤ .
3
By (3.2.10) and (3.2.11) with c = 2bε ,

Pn,pn [χ(Gn ) > ϕn + 3] ≤ .
3
So, overall,
Pn,pn [ϕn ≤ χ(Gn ) ≤ ϕn + 3] ≥ 1 − ε.
That concludes the proof.
CHAPTER 3. MARTINGALES AND POTENTIALS 163

3.2.4 . Random graphs: degree sequence of preferential attachment


graphs
Let (Gt )t≥1 ∼ PA1 be a preferential attachment graph (Definition 1.2.3). A key
feature of such graphs is a power-law degree sequence: the fraction of vertices with
degree d behaves like ∝ d−α for some α > 0, that is, it has a fat tail. Recall that we
restrict ourselves to the tree case. In contrast, we will show in Section 4.1.4 that
a (sparse) Erdős-Rényi random graph has an asymptotically Poisson-distributed
degree sequence, and therefore a much thinner tail.

Power law degree sequence Let Di (t) be the degree of the i-th vertex in Gt ,
denoted vi , and let
t
X
Nd (t) := 1{Di (t)=d} ,
i=0

be the number of vertices of degree d in Gt . By construction N0 (t) = 0 for all t.


Define the sequence
4
fd := , d ≥ 1. (3.2.12)
d(d + 1)(d + 2)

Our main claim is:

Claim 3.2.21.
1
Nd (t) →p fd , ∀d ≥ 1.
t
Proof. The claim is immediately implied by the following lemmas.
Lemma 3.2.22 (Convergence of the mean).
1
ENd (t) → fd , ∀d ≥ 1.
t
Lemma 3.2.23 (Concentration around the mean). For any δ > 0,
" r #
1 1 2 log δ −1
P Nd (t) − ENd (t) ≥ ≤ 2δ, ∀d ≥ 1, ∀t.
t t t

An alternative representation of the process We start with the proof of Lem-


ma 3.2.23, which is an application of the method of bounded differences.
CHAPTER 3. MARTINGALES AND POTENTIALS 164

Proof of Lemma 3.2.23. In our description of the preferential attachment process,


the random choices made at each time depend in a seemingly complicated way
on previous choices. In order to establish concentration of the process around its
mean, we introduce a clever, alternative construction which has the advantage that
it involves independent choices.
We start with a single vertex v0 . At time 1, we add a single vertex v1 and an
edge e1 connecting v0 and v1 . For bookkeeping, we orient edges away from the
vertex of higher time index (but we ignore the orientations in the output). For a
directed edge (i, j), we refer to i as its tail and j as its head. For all s ≥ 2, let
Xs be an independent, uniformly chosen edge extremity among the edges in Gs−1 ,
that is, pick a uniform element in

Xs := {(1, tail), (1, head), . . . , (s − 1, tail), (s − 1, head)}.

To form Gs , attach a new edge es to the vertex of Gs−1 corresponding to Xs . A


d0
vertex of degree d0 in Gs−1 is selected with probability 2(s−1) , as it should. Note
that Xs can be picked in advance independently of the sequence (Gs0 )s0 <s . For
instance, if x2 = (1, head), x3 = (2, tail) and x4 = (3, head), the graph obtained
at time 4 is depicted in Figure 3.2.
We claim that Nd (t) =: h(X2 , . . . , Xt ) as a function of X2 , . . . , Xt satisfies
kDi hk∞ ≤ 2. Indeed let (x2 , . . . , xt ) be a realization of (X2 , . . . , Xt ) and let
y ∈ Xs with y 6= xs . Replacing xs = (i, end) with y = (j, end0 ) where i, j ∈
{1, . . . , s−1} and end, end0 ∈ {tail, head} has the effect of redirecting the head of
edge es from the end of ei to the end0 of ej . This redirection also brings along with
it the heads of all other edges associated with the choice (s, head). But, crucially,
those changes only affect the degrees of the vertices (i, end) and (j, end0 ) in the
original graph. Hence the number of vertices with degree d changes by at most
2, as claimed. For instance, returning to the example of Figure 3.2. If we replace
x3 = (2, tail) with y = (1, tail), one obtains the graph in Figure 3.3. Note that
only the degrees of vertices v1 and v2 are affected by this change.
By McDiarmid’s inequality (Theorem 3.2.9), for all β > 0,

2β 2
 
P[|Nd (t) − ENd (t)| ≥ β] ≤ 2 exp − 2 ,
(2) (t − 1)
p
which, choosing β = 2t log δ −1 , we can rewrite as
" r #
1 1 2 log δ −1
P Nd (t) − ENd (t) ≥ ≤ 2δ.
t t t

That concludes the proof of the lemma.


CHAPTER 3. MARTINGALES AND POTENTIALS 165

Figure 3.2: Graph obtained when x2 = (1, head), x3 = (2, tail) and x4 =
(3, head).
CHAPTER 3. MARTINGALES AND POTENTIALS 166

Figure 3.3: Substituting x3 = (2, tail) with y = (1, tail) in the example of
Figure 3.2 has the effect of replacing the dashed edges with the dotted edges. Note
that only the degrees of vertices v1 and v2 are affected by this change.
CHAPTER 3. MARTINGALES AND POTENTIALS 167

Dynamics of the mean Once again the method of bounded differences tells us
nothing about the mean, which must be analyzed by other means. The proof of
Lemma 3.2.22 does not rely on the Azuma-Hoeffding inequality but is given for
completeness (and may be skipped).
Proof of Lemma 3.2.22. The idea of the proof is to derive a recursion for fd by
considering the evolution of ENd (t) and taking a limit as t → +∞. Let d ≥ 1.
Observe that ENd (t) = 0 for t ≤ d − 1 since we need at least d edges to have
a degree-d vertex. Moreover, by the description of the preferential attachment
process, the following recursion holds for t ≥ d − 1
d−1 d
ENd (t + 1) − ENd (t) = ENd−1 (t) − ENd (t) + 1{d=1} . (3.2.13)
| 2t {z 2t
} | {z } | {z }
(a) (b) (c)

Indeed: (a) Nd (t) increases by 1 if a vertex of degree d − 1 is picked, an event of


probability d−1
2t Nd−1 (t) because the sum of degrees at time t is twice the number
of edges (i.e., t); (b) Nd (t) decreases by 1 if a vertex of degree d is picked, an
d
event of probability 2t Nd (t); and (c) the last term comes from the fact that the new
vertex always has degree 1. We rewrite (3.2.13) as
d−1 d
ENd (t + 1) = ENd (t) + ENd−1 (t) − ENd (t) + 1{d=1}
  2t  2t  
d/2 d−1 1
= 1− ENd (t) + ENd−1 (t) + 1{d=1}
t 2 t
 
d/2
=: 1− ENd (t) + gd (t), (3.2.14)
t
where gd (t) is defined as the expression in curly brackets on the second line. We
will not solve this recursion explicitly. Instead we seek to analyze its asymptotics,
specifically we show that 1t ENd (t) → fd .
The key is to notice that the expression for ENd (t+1) depends on 1t ENd−1 (t)—
so we work by induction on d. Because of the form of the recursion, the following
technical lemma is what we need to proceed.
Lemma 3.2.24. Let f, g be nonnegative functions of t ∈ N satisfying the following
recursion  α
f (t + 1) = 1 − f (t) + g(t), ∀t ≥ t0 ,
t
with g(t) → g ∈ [0, +∞) as t → +∞, and where α > 0, t0 ≥ 2α, f (t0 ) ≥ 0 are
constants. Then
1 g
f (t) → ,
t 1+α
as t → +∞.
CHAPTER 3. MARTINGALES AND POTENTIALS 168

The proof of this lemma is given after the proof of Claim 3.2.21. We first
conclude the proof of Lemma 3.2.22. First let d = 1. In that case, g1 (t) = g1 := 1,
α := 1/2, and t0 := 1. By Lemma 3.2.24,
1 1 2
EN1 (t) → = = f1 .
t 1 + 1/2 3

Assuming by induction that 1t ENd−1 (t) → fd−1 we get

d−1
gd (t) → gd := fd−1 ,
2
as t → +∞. Using Lemma 3.2.24 with α := d/2 and t0 := d − 1, we obtain
 
1 1 d−1 d−1 4
ENd (t) → fd−1 = · = fd .
t 1 + d/2 2 d + 2 (d − 1)d(d + 1)
That concludes the proof of Lemma 3.2.22.

To prove Claim 3.2.21, we combine Lemmas 3.2.22 and 3.2.23. Fix any d, δ, ε >
0. Choose t0 large enough that for all t ≥ t0
( r )
1 2 log δ −1
max ENd (t) − fd , ≤ ε.
t t

Then  
1
P Nd (t) − fd ≥ 2ε ≤ 2δ,
t
for all t ≥ t0 . That proves convergence in probability.

Proof of the technical lemma It remains to prove Lemma 3.2.24.

Proof of Lemma 3.2.24. By induction on t, we have


 α
f (t + 1) = 1− f (t) + g(t)
t   
 α  α
= 1− 1− f (t − 1) + g(t − 1) + g(t)
t t−1
 
 α   α α
= 1− g(t − 1) + g(t) + 1 − 1− f (t − 1)
t t t−1
= ···
t−t
X0 i−1   t−t
Y0  
Y α α
= g(t − i) 1− + f (t0 ) 1− ,
t−j t−j
i=0 j=0 j=0
CHAPTER 3. MARTINGALES AND POTENTIALS 169

or
t t t 
X Y  α Y α
f (t + 1) = g(s) 1− + f (t0 ) 1− , (3.2.15)
s=t0
r r=t
r
r=s+1 0

where empty products are equal to 1. To guess the limit note that, for large s, g(s)
is roughly constant and that the product in the first term behaves like
t
!
X α sα
exp − ≈ exp (−α(log t − log s)) ≈ α .
r t
r=s+1

gt
So approximating the sum by an integral we get that f (t + 1) ≈ α+1 , which is
indeed consistent with the claim.
Formally, we use that there is a constant γ = 0.577 . . . such that (see e.g. [LL10,
Lemma 12.1.3])
m
X 1
= log m + γ + Θ(m−1 ),
`
`=1

and that by a Taylor expansion, for |z| ≤ 1/2,

log (1 − z) = −z + Θ(z 2 ).

Fix η > 0 small and take t large enough that ηt > 2α and |g(s) − g| < η for all
s ≥ ηt. Then, for s + 1 ≥ t0 ,
t t
X  α X nα o
log 1 − = − + Θ(r−2 )
r r
r=s+1 r=s+1
= −α (log t − log s) + Θ(s−1 ),

so, taking exponentials,


t
Y  α  sα
1− = α (1 + Θ(s−1 )).
r t
r=s+1

Hence
t 
1 Y α tα0
f (t0 ) 1− = α+1 (1 + Θ(t−1
0 )) → 0,
t r=t
r t
0
CHAPTER 3. MARTINGALES AND POTENTIALS 170

as t → +∞. Moreover
t t t
1X Y  α 1X sα
g(s) 1− ≤ (g + η) α (1 + Θ(s−1 ))
t s=ηt r t s=ηt t
r=s+1
t
g X
≤ O(η) + (1 + Θ(t−1 )) sα
tα+1 s=ηt
g (t + 1)α+1
≤ O(η) + (1 + Θ(t−1 ))
tα+1 α + 1
g
→ O(η) + ,
α+1
where we bounded the sum on the second line by an integral. Similarly,
ηt−1 t ηt−1
1X Y  α 1X sα
g(s) 1− ≤ (g + η) α (1 + Θ(s−1 ))
t s=t r t s=t t
0 r=s+1 0

ηt (ηt)α
≤ (g + η) α (1 + Θ(t−1
0 ))
t t
→ O(η α+1 ).

Plugging these inequalities back into (3.2.15), we get


1 g
lim sup f (t + 1) ≤ + O(η).
t t 1+α

A similar inequality holds in the other direction. Letting η → 0 concludes the


proof.
Remark 3.2.25. A more quantitative result (uniform in t and d) can be derived. See, for
example, [vdH17, Sections 8.5, 8.6]. See the same reference for a generalization beyond
trees.

3.2.5 . Data science: stochastic bandits and the slicing method


In this section, we consider an application of the maximal Azuma-Hoeffding in-
equality (Theorem 3.2.1) to (multi-armed) bandit problems. These are meant as a
simple model of sequential decision making with limited information where a fun-
damental issue is trading off between exploitation of actions that have done well in
the past and exploration of actions that might perform better in the future. A typ-
ical application is online advertising, where one must decide which advertisement
to display to the next visitor to a website.
CHAPTER 3. MARTINGALES AND POTENTIALS 171

In the simplest version of the (two-arm) stochastic bandit problem, there are
stochastic bandit
two unknown reward distributions ν1 , ν2 over [0, 1] with respective means µ1 6= µ2 .
At each time t = 1, . . . , n, we request an independent sample from νIt , where
we are free to choose It ∈ {1, 2} based on past choices and observed rewards
{(Is , Zs )}s<t . This will be referred to as pulling arm It . We then observe the arm
reward Zt ∼ νIt . Letting µ∗ := µ1 ∨ µ2 , our goal is to minimize
" n #
X

Rn = nµ − E µ It , (3.2.16)
t=1

which is known as the pseudo-regret. That is, we seek to make choices (It )nt=1
pseudo-regret
that minimize the difference between the best achievable cumulative mean reward
and the expected cumulative mean reward from our decisions. Note that the expec-
tation in (3.2.16) is taken over the choices (It )nt=1 , which themselves depend on
the random rewards (Zs )nt=1 . As indicated above, because ν1 and ν2 are unknown,
there is a fundamental friction between exploiting the arm that has done best in the
past and exploring further the other arm, which might perform better in the future.
One general approach that has proved effective in this type of problem is known
as optimism in the face of uncertainty. Roughly speaking, we construct a set of
plausible environments (in our case, the means of the reward distributions) that are
consistent with observed data; then we make an optimal decision assuming that the
true environment is the most favorable among them. A concrete implementation
of this principle is the Upper Confidence Bound (UCB) algorithm, which we now
UCB
describe. In words, we use a concentration inequality to build a confidence interval
for each reward mean, and then we pick the arm with highest upper bound.

UCB algorithm
To state the algorithm formally, we will need some notation. For i = 1, 2, let Ti (t)
be the number of times arm i is pulled up to time t
X
Ti (t) = 1{Is = i},
s≤t

and let Xi,s , s = 1, . . . , n, be i.i.d. samples from νi . Assume that the reward at
time t is (
X1,T1 (t−1)+1 if It = 1,
Zt =
X2,T2 (t−1)+1 otherwise.
CHAPTER 3. MARTINGALES AND POTENTIALS 172

In other words, Xi,s is the s-th observed reward from arm i. Let µ̂i,s be the sample
mean of the observed rewards after pulling s times on arm i
1X
µ̂i,s = Xi,r .
s
r≤s

Since the Xi,s s are independent and [0, 1]-valued by assumption, by Hoeffd-
ing’s inequality (Theorem 2.4.10), for any β > 0

P[µ̂i,s − µi ≥ β] ∨ P[µi − µ̂i,s ≥ β] ≤ exp −2sβ 2 .




The right-hand side can be made ≤ δ provided


r
log δ −1
β≥ := H(s, δ).
2s
We are now ready to state the α-UCB algorithm, where α > 1 is the exploration
parameter. At each time t, we pick

It ∈ arg max µ̂i,Ti (t−1) + α H(Ti (t − 1), 1/t) .
i=1,2

2
The argument above implies that the true mean µi has probability less than 1/tα
of being higher than µ̂i,Ti (t−1) + αH(Ti (t − 1), 1/t). The algorithm makes an
“optimistic” decision: it chooses the higher of the two values.
The following theorem shows that UCB achieves a pseudo-regret of the order
of O(log n). Define ∆i = µ∗ − µi and ∆∗ = ∆1 ∨ ∆2 .
Theorem 3.2.26 (Pseudo-regret of UCB). In the two-arm stochastic bandit prob-
lem where the rewards are in [0, 1] with distinct means, α-UCB with α > 1
achieves
2α2
Rn ≤ log n + ∆∗ Cα ,
∆∗
for some constant Cα ∈ (0, +∞) depending only on α.
This bound should not come entirely as a surprise. Indeed a simple, alternative
approach to UCB is to (1) first pull each arm mn = o(n) times and then (2) use
the arm with largest estimated mean for the remainder. Assuming there is a known
lower bound on ∆∗ , then Hoeffding’s inequality (Theorem 2.4.10) guarantees that
mn can be chosen of the order of ∆12 log n to identify the largest mean with proba-

bility 1 − 1/n. Because the rewards are bounded by 1, accounting for the contribu-
tion of the first phase and the probability of failure in the second phase, one gets a
pseudo-regret of the order of ∆∗ ∆12 log n + n1 ∆∗ n ≈ ∆1∗ log n. The UCB strategy,

on the other hand, elegantly adapts to the gap ∆∗ and the horizon n.
CHAPTER 3. MARTINGALES AND POTENTIALS 173

Analysis of the UCB algorithm


We break down the proof into a sequence of lemmas. We first rewrite the pseudo-
regret as
" n #
X
Rn = nµ∗ − E µIt
t=1
n
" #
X
=E (µ∗ − µIt )
t=1
 
Xn X
= E 1{It = i}∆i 
t=1 i=1,2
X
= ∆i E[Ti (n)]. (3.2.17)
i=1,2

Hence the problem boils down to bounding E[Ti (n)], the expected number of times
that arm i is pulled. Note that Ti (n) is a complicated function of the observations.
To analyze it, we will use the following sufficient condition. Let i∗ be the optimal
arm, that is, the one that achieves µ∗ . Intuitively, if arm i 6= i∗ is pulled, it is
because: either our upper estimate of µi∗ happens to be low or our lower estimate
of µi happens to be high (i.e., our concentration inequality failed); or there is too
much uncertainty in our estimate of µi (i.e., we haven’t pulled arm i enough).
Lemma 3.2.27. Under the α-UCB strategy, if arm i 6= i∗ is pulled at time t then
at least one of the following events hold:

Et,1 = {µ̂i∗ ,Ti∗ (t−1) + α H(Ti∗ (t − 1), 1/t) ≤ µ∗ }, (3.2.18)

Et,2 = {µ̂i,Ti (t−1) − α H(Ti (t − 1), 1/t) > µi }, (3.2.19)

 
∆i
Et,3 = α H(Ti (t − 1), 1/t) > . (3.2.20)
2
Proof. We argue by contradiction. Assume all the conditions above are false. Then

µ̂i∗ ,Ti∗ (t−1) + α H(Ti∗ (t − 1), 1/t) > µ∗


= µi + ∆ i
≥ µ̂i,Ti (t−1) + α H(Ti (t − 1), 1/t).

That implies that arm i would not be chosen.


CHAPTER 3. MARTINGALES AND POTENTIALS 174

We first deal with Et,3 . Let

2α2 log n
un = .
∆2∗

Using the condition in Lemma 3.2.27, we get the following bound on E[Ti (n)].

Lemma 3.2.28. Under the α-UCB strategy, for i 6= i∗ ,


n
X n
X
E[Ti (n)] ≤ un + P[Et,1 ] + P[Et,2 ].
t=1 t=1

Proof. For i 6= i∗ , by definition of Ti (n),


" n #
X
E[Ti (n)] = E 1{It =i}
t=1
" n #
X 
≤E 1{It =i}∩Et,1 + 1{It =i}∩Et,2 + 1{It =i}∩Et,3 ,
t=1

where we used that by Lemma 3.2.27

{It = i} ⊆ Et,1 ∪ Et,2 ∪ Et,3 .

The condition in Et,3 can be written equivalently as


s
log t ∆i 2α2 log t
α > ⇐⇒ Ti (t − 1) < .
2Ti (t − 1) 2 ∆2i

In particular, for all t ≤ n, the event Et,3 implies that Ti (t − 1) < un . As a result,
since Ti (t) = Ti (t − 1) + 1 whenever It = i, the event {It = i} ∩ Et,3 can occur
at most un times and
" n #
X 
E[Ti (n)] ≤ un + E 1{It =i}∩Et,1 + 1{It =i}∩Et,2
t=1
n
X n
X
≤ un + P[Et,1 ] + P[Et,2 ],
t=1 t=1

which proves the claim.


CHAPTER 3. MARTINGALES AND POTENTIALS 175

It remains to bound P[Et,1 ] and P[Et,2 ] from above. This is not entirely straight-
forward because, while µ̂i,Ti (t−1) involves a sum of independent random variables,
the number of terms Ti (t − 1) is itself a random variable. Moreover Ti (t − 1)
depends on the past rewards Zs , s ≤ t − 1, in a complex way. So in order to
apply a concentration inequality to µ̂i,Ti (t−1) , we use a rather blunt approach: we
bound the worst deviation over all possible (deterministic) values in the support of
Ti (t − 1). That is,

P[Et,2 ] = P[µ̂i,Ti (t−1) − α H(Ti (t − 1), 1/t) > µi ]


 
[
≤ P {µ̂i,s − α H(s, 1/t) > µi } . (3.2.21)
s≤t−1

We reformulate the previous bound as


 
[
P {µ̂i,s − α H(s, 1/t) > µi }
s≤t−1
 
=P sup (µ̂i,s − µi − α H(s, 1/t)) > 0
s≤t−1
  r  
1 X log t
= P  sup  Xi,r − µi − α  > 0
s≤t−1 s 2s
r≤s
  r  
1 1 X log t
= P  sup √  √ (Xi,r − µi ) − α  > 0
s≤t−1 s s 2
r≤s
" Ps r #
(X
r=1 √i,r − µ i ) log t
= P sup >α . (3.2.22)
s≤t−1 s 2

Observe that the numerator on the left-hand side of the inequality on the last line
is a martingale (see Example 3.1.29) with increments in [−µi , 1 − µi ]. But the
denominator depends on s.
We try two approaches:

- We could simply use that s ≥ 1 on the denominator and apply the maximal
CHAPTER 3. MARTINGALES AND POTENTIALS 176

Azuma-Hoeffding inequality (Theorem 3.2.1) to get


n n s
" r #
X X X log t
P[Et,2 ] ≤ P sup (Xi,r − µi ) > α
s≤t−1 r=1 2
t=1 t=1
n p !
X 2(α (log t)/2)2
≤ exp −
t−1
t=1
n  
2 log t
X
≤ exp −α . (3.2.23)
t−1
t=1

That is of order Θ(n) for any α.


- On the other hand, we could use a union bound over s and apply the maximal
Azuma-Hoeffding inequality to each term to get
n n X
" s r #
X X X s log t
P[Et,2 ] ≤ P (Xi,r − µi ) > α
2
t=1 t=1 s≤t−1 r=1
n X p !
X 2(α (s log t)/2)2
≤ exp −
s
t=1 s≤t−1
n
X
(t − 1) exp −α2 log t

=
t=1
n
X 1
≤ . (3.2.24)
t=1
tα2 −1

The series converges for α > 2. Therefore, in that case, this bound√is
Θ(1), which is much better than our previous attempt. For 1 < α ≤ 2
2
however, we get a bound of order Θ(nα ), which is worse than before.
It turns out that doing something “in between” the two approaches above
√ gives
a bound that significantly improves over both of them in the 1 < α ≤ 2 regime.
This is known as the slicing (or peeling) method.

Slicing method
The slicing method is useful when bounding a weighted supremum. Its application
slicing method
is somewhat problem-specific so we will content ourselves with illustrating it in
our case. Specifically, our goal is to control probabilities of the form
 
Ms
P sup ≥β ,
s≤t−1 w(s)
CHAPTER 3. MARTINGALES AND POTENTIALS 177

Ps √ q
log t
where Ms := r=1 (Xi,r − µi ), w(s) := s, and β := α 2 . The idea is to
divide up the supremum into slices γ k−1≤ s < γk,
k ≥ 1, where the constant
log t
γ > 1 will be optimized below. That is, fixing Kt = d log γ e (which roughly solves
K
γ t = t), by a union bound over the slices
Kt
  X " #
Ms Ms
P sup ≥β ≤ P sup ≥β .
1≤s<t w(s) γ k−1 ≤s<γ k w(s)
k=1

Because w(s) is increasing, on each slice separately we can bound


" # " #
Ms Ms
P sup ≥β ≤P sup k−1 )
≥β
γ k−1 ≤s<γ k w(s) γ k−1 ≤s<γ k w(γ
" #
=P sup Ms ≥ βw(γ k−1 )
γ k−1 ≤s<γ k
" #
≤ P sup Ms ≥ βw(γ k−1 ) .
s≤γ k

Now we apply the maximal Azuma-Hoeffding inequality (Theorem 3.2.1) to obtain


" #
2(βw(γ k−1 ))2
 
k−1
P sup Ms ≥ βw(γ ) ≤ exp −
s≤γ k γk
2β 2
 
≤ exp −
γ
2 /γ
= t−α ,
where we used that Ms − Ms−1 = Xi,s − µi ∈ [−µi , 1 − µi ], an interval of length
1. Plugging this back above we get
   
Ms log t −α2 /γ
P sup ≥β ≤ t . (3.2.25)
1≤s<t w(s) log γ
Now we see the tradeoff: increasing γ makes the slices larger and hence the tail
inequality weaker, but it also makes the number of slices smaller which helps with
the union bound.
Combining (3.2.21), (3.2.22), and (3.2.25), we have proved:
Lemma 3.2.29. For any γ > 1, it holds that
n n  
X X log t −α2 /γ
P[Et,2 ] ≤ t ,
log γ
t=1 t=1

and similarly for P[Et,1 ].


CHAPTER 3. MARTINGALES AND POTENTIALS 178

For α > 1, we can choose γ > 1 such that α2 /γ > 1. In that case, the series on
the right-hand side is summable. This improves over both (3.2.23) and (3.2.24).
We are ready to prove the main result.

Proof of Theorem 3.2.26. By (3.2.17) and Lemmas 3.2.27, 3.2.28 and 3.2.29, we
have
n   !
X X log t −α2 /γ
Rn = ∆i E[Ti (n)] ≤ ∆∗ un + 2 t .
log γ
i=1,2 t=1

Recalling that α > 1, choose γ > 1 such that α2 /γ > 1. In that case, as noted
above, the series on the right hand side is summable and there is Cα ∈ (0, +∞)
such that
Rn ≤ ∆∗ (un + Cα ).
That proves the claim.

Remark 3.2.30. A slightly better—and provably optimal—multiplicative constant


in the pseudo-regret bound has been obtained by [GC11] using a variant of UCB
called KL-UCB. The matching lower bound is due to [LR85]. See also [BCB12,
Sections 2.3-2.4]. Further improvements can be obtained by using Bernstein’s
rather than Hoeffding’s inequality [AMS09].

3.2.6 Coda: Talagrand’s inequality


We end this section with a celebrated concentration inequality that applies un-
der weaker conditions than McDiarmid’s inequality (Theorem 3.2.9)—but is not
proved using the martingale method. It is known as Talagrand’s inequality.
Bounds on kDi f k∞ are often expressed in terms of a Lipschitz condition under
an appropriate metric. Let 0 < ci < +∞, i = 1, . . . , n and c = (c1 , . . . , cn ). The
c-weighted Hamming distance is defined as
weighted
n
X Hamming
ρc (x, y) := ci 1{xi 6=yi } , distance
i=1

for x = (x1 , . . . , xn ), y = (y1 , . . . , yn ) ∈ X1 × · · · × Xn . The proof of the


following equivalence is left as an exercise (see Exercise 3.8).
Lemma 3.2.31 (Lipschitz condition). A function f : X1 × · · · × Xn → R satisfies
the Lipschitz condition
|f (x) − f (y)| ≤ ρc (x, y), ∀x, y ∈ X1 × · · · × Xn , (3.2.26)
if and only if
kDi f k∞ ≤ ci , ∀i.
CHAPTER 3. MARTINGALES AND POTENTIALS 179

Consider the following relaxed version of (3.2.26):


n
X
f (x) − f (y) ≤ ci (x)1{xi 6=yi } , ∀x, y ∈ X1 × · · · × Xn , (3.2.27)
i=1

where now ci (x) is a finite, positive function over X1 × · · · × Xn . Notice the “one-
sided” nature of this condition, in the sense that ci depends on x but not on y. A
typical example where (3.2.27) is satisfied, but (3.2.26) is not, is given below.
We state Talagrand’s inequality without proof.

Theorem 3.2.32 (Talagrand’s inequality). Let X1 , . . . , Xn be independent random Talagrand’s


variables where Xi is Xi -valued for all i, and let X = (X1 , . . . , Xn ). Assume inequality
f : X1 × · · · × Xn → R is a measurable function
P such that (3.2.27) holds. Then
2
f (X) is sub-Gaussian with variance factor k i≤n ci k∞ . In fact, for all β > 0
the following upper and lower tail bounds hold
!
β2
P[f (X) − Ef (X) ≥ β] ≤ exp − P ,
2k i≤n c2i k∞

and  
β2
P[f (X) − Ef (X) ≤ −β] ≤ exp − hP i .
2E 2
i≤n ci (X)

Compared to McDiarmid’s inequality (Theorem 3.2.9), the upper tail in Theo-


rem 3.2.32 has the sum over the coordinates inside the supremum, potentially a
major improvement; the lower tail is even better, replacing the supremum with an
expectation.

Example 3.2.33 (Spectral norm of a random matrix with bounded entries). Let A
be an n × n random matrix. We assume that the entries Ai,j , i, j = 1, . . . , n, are
independent, centered random variables in [−1, 1]. In Theorem 2.4.28, we proved
an upper tail bound on the spectral norm

kAxk2
kAk2 = sup = sup hAx, yi,
x∈Rn \{0} kxk2 x∈Sn−1
y∈Sn−1

of such a matrix (in the more general sub-Gaussian case) using an ε-net argument.

Theorem 2.4.28 also implies that EkAk2 = O( n) by (B.5.1). (See Exercise 3.9
for a lower bound on the expectation.)
CHAPTER 3. MARTINGALES AND POTENTIALS 180

Here we use Talagrand’s inequality (Theorem 3.2.32) directly to show concen-


tration around the mean. For this, we need to check (3.2.27) where we think of the
spectral norm as a function of n2 independent random variables

kAk2 = f ({Ai,j }i,j ).

Let x∗ (A) and y∗ (A) be unit vectors in Rn such that

kAk2 = hAx∗ (A), y∗ (A)i,

which exist by compactness.


Given two n × n matrices A, A
e with entries in [−1, 1], we have

e 2 = hAx∗ (A), y∗ (A)i − sup hAx,


kAk2 − kAk e yi
x∈Sn−1
y∈Sn−1

≤ hAx∗ (A), y∗ (A)i − hAxe ∗ (A), y∗ (A)i


e ∗ (A), y∗ (A)i
= h(A − A)x
X
≤ |Aij − Aeij ||x∗ (A)i ||y∗ (A)j |
i,j
X
≤ 1Aij 6=Aeij cij (A),
i,j

where on the last line we set

cij (A) := 2|x∗ (A)i ||y∗ (A)j |,

and used the fact that |Aij − A eij | ≤ 2. Note that


X X X
cij (A)2 = 4 x∗ (A)2i y∗ (A)2j = 4.
i,j i j

Hence Talagrand’s inequality implies that kAk2 is sub-Gaussian with variance fac-
tor 4. J

3.3 Potential theory and electrical networks


In this section we develop a classical link between random walks and electrical net-
works. The electrical interpretation is a useful physical analogy. The mathematical
substance of the connection starts with the following observation.
CHAPTER 3. MARTINGALES AND POTENTIALS 181

Let (Xt ) be a Markov chain with transition matrix P on a finite or countable


state space V . Recall from Definition 3.1.6 that τB is the first visit time to B ⊆ V .
For two disjoint subsets A, Z of V , the probability of hitting A before Z
h(x) = Px [τA < τZ ], (3.3.1)
seen as a function of the starting point x ∈ V , is harmonic (with respect to P ) on
harmonic
W := (A ∪ Z)c := V \ (A ∪ Z) in the sense that
function
X
h(x) = P (x, y)h(y), ∀x ∈ W. (3.3.2)
y

Indeed note that h = 1 (respectively = 0) on A (respectively Z) and by the Markov


property (Theorem 1.1.18), after the first step of the chain, for x ∈ W
X
Px [τA < τZ ] = P (x, y) Py [τA < τZ ]
y ∈A∪Z
/
X X
+ P (x, y) · 1 + P (x, y) · 0
y∈A y∈Z
X
= P (x, y) Py [τA < τZ ]. (3.3.3)
y

Quantities such as (3.3.1) arise naturally, for instance in the study of recurrence,
and the connection to potential theory, the study of harmonic functions, proves
fruitful in that context as we outline in this section. It turns out that harmonic
functions and martingales are closely related. In Section 3.3.1 we elaborate on that
connection.
But first we rewrite (3.3.2) to reveal the electrical interpretation. For this we
switch to reversible chains. Recall that a reversible Markov chain is equivalent
to a random walk on a network N = (G, c) where the edges of G correspond
to transitions of positive probability. If the chain is reversible with respect to a
stationary measure π, then the edge weights are c(x, y) = π(x)P (x, y). In this
notation (3.3.2) becomes
1 X
h(x) = c(x, y)h(y), ∀x ∈ (A ∪ Z)c , (3.3.4)
c(x) y∼x
P
where c(x) := y∼x c(x, y) = π(x). In words, h(x) is the weighted average of its
neighboring values. Now comes the electrical analogy: if one interprets c(x, y) as
a conductance, a function satisfying (3.3.4) is known as a voltage. The voltages at
A and Z are 1 and 0 respectively. We show in the next subsection by a martingale
argument that, under appropriate conditions, such a voltage exists and is unique.
We develop the electrical analogy and many of its applications in Section 3.3.2.
CHAPTER 3. MARTINGALES AND POTENTIALS 182

3.3.1 Martingales, the Dirichlet problem and Lyapounov functions


To see why martingales come in, let Ft = σ(X0 , . . . , Xt ) and let τ ∗ := τW c . By a
first-step calculation again, for a function h satisfying (3.3.2),
 
h(Xt∧τ ∗ ) = E h(X(t+1)∧τ ∗ ) | Ft , ∀t ≥ 0, (3.3.5)

that is, (h(Xt∧τ ∗ ))t is a martingale with respect to (Ft ). Indeed, on {τ ∗ ≤ t},

E[h(X(t+1)∧τ ∗ ) | Ft ] = h(Xτ ∗ ) = h(Xt∧τ ∗ ),

and on {τ ∗ > t}
X
E[h(X(t+1)∧τ ∗ ) | Ft ] = P (Xt , y)h(y) = h(Xt ) = h(Xt∧τ ∗ ).
y

Although the rest of Section 3.3 is concerned with reversible Markov chains,
the current subsection applies to the non-reversible case as well. We give an
overview of potential theory for general, countable-space, discrete-time Markov
chains and its connections to martingales. As a major application, we introduce
the concept of a Lyapounov function which is useful in bounding certain hitting
times.

Existence and uniqueness of a harmonic extension


We begin with a special case, which will be generalized below.

Theorem 3.3.1 (Harmonic extension: existence and uniqueness). Let P be an ir-


reducible transition matrix on a finite or countably infinite state space V . Let W
be a finite, proper subset of V and let h : W c → R be a bounded function on
W c . Then there exists a unique extension of h to W that is harmonic on W , that
is, which satisfies (3.3.2). The solution is given by

h(x) = Ex [h (XτW c )] .

Proof. We first argue about uniqueness. Suppose h is defined over all of V and
satisfies (3.3.2). Let τ ∗ := τW c . Then the process (h (Xt∧τ ∗ ))t is a martingale
by (3.3.5). Because W is finite and the chain is irreducible, we have τ ∗ < +∞
almost surely, as implied by Lemma 3.1.25. Moreover the process is bounded
because h is bounded on W c and W is finite. Hence by Doob’s optional stopping
theorem (Theorem 3.1.38 (ii))

h(x) = Ex [h(Xτ ∗ )], ∀x ∈ W,


CHAPTER 3. MARTINGALES AND POTENTIALS 183

which implies that h is unique, since the right-hand side depends only on the chain
and the fixed values of h on W c .
For the existence, simply define h(x) := Ex [h (Xτ ∗ )], ∀x ∈ W, and use a
first-step argument similarly to (3.3.3).

For some insights on what happens when the assumptions of Theorem 3.3.1 are
not satisfied, see Exercise 3.11. For an alternative (arguably more intuitive) proof
of uniqueness based on the maximum principle, see Exercise 3.12.
In the proof above it suffices to specify h on the outer boundary of W

∂V W = {z ∈ V \W : ∃y ∈ W, P (y, z) > 0}.

Introduce the Laplacian associated to P


Laplacian
" #
X
∆f (x) = P (x, y)f (y) − f (x)
y
X
= P (x, y)[f (y) − f (x)]
y

= Ex [f (X1 ) − f (X0 )], (3.3.6)

provided the expectation exists. We have proved that, under the assumptions of
Theorem 3.3.1, there exists a unique solution to
(
∆f (x) = 0 ∀x ∈ W,
(3.3.7)
f (x) = h(x) ∀x ∈ ∂V W,

and that solution is given by f (x) = Ex [h (XτW c )], for x ∈ W ∪ ∂V W . The


system (3.3.7), in reference to its counterpart in the theory of partial differential
equations, is referred to as a Dirichlet problem.
Dirichlet
Example 3.3.2 (Simple random walk on Zd ).
The Laplacian above can be inter- problem
preted as a discretized version of the standard Laplacian. For instance, for simple
random walk on Z,
" #
X
∆f (x) = P (x, y)f (y) − f (x)
y
X
= P (x, y)[f (y) − f (x)]
y
1
= {[f (x + 1) − f (x)] − [f (x) − f (x − 1)]},
2
CHAPTER 3. MARTINGALES AND POTENTIALS 184

which is a discretized second derivative. More generally, for simple random walk
on Zd , we get
" #
X
∆f (x) = P (x, y)f (y) − f (x)
y
X
= P (x, y)[f (y) − f (x)]
y
d
1 X
= {[f (x + ei ) − f (x)] − [f (x) − f (x − ei )]},
2d
i=1

where e1 , . . . , ed is the standard basis in Rd . J

Theorem 3.3.1 has many applications. One of its consequences is that harmonic
functions on a finite state space are constant.

Corollary 3.3.3. Let P be an irreducible transition matrix on a finite state space


V . If h is harmonic on all of V , then it is constant.

Proof. Fix the value of h at an arbitrary vertex z and set W = V \{z}. Applying
Theorem 3.3.1, for all x ∈ W , h(x) = Ex [h (XτW c )] = h(z).

As an example of application of this corollary, we prove the following surpris-


ing result: in a finite, irreducible Markov chain, the expected time to hit a target
chosen at random according to the stationary distribution does not depend on the
starting point.

Theorem 3.3.4 (Random target lemma). Let (Xt ) be an irreducible Markov chain
on a finite state space V with transition matrix P and stationary distribution π.
Then X
h(x) := π(y) Ex [τy ]
y∈V

does not in fact depend on x.

Proof. Because the chain is irreducible and has a finite state space,PEx [τy ] < +∞
for all x, y. By Corollary 3.3.3, it suffices to show that h(x) := y π(y) Ex [τy ]
is harmonic on all of V . As before, it is natural to expand Ex [τy ] according to the
first step of the chain,
!
X
Ex [τy ] = 1{x6=y} 1 + P (x, z) Ez [τy ] .
z
CHAPTER 3. MARTINGALES AND POTENTIALS 185

Substituting into the definition of h(x) gives


XX
h(x) = (1 − π(x)) + π(y)P (x, z) Ez [τy ]
z y6=x
X
= (1 − π(x)) + P (x, z) (h(z) − π(x) Ez [τx ]) .
z

Rearranging, we get
" #
X
∆h(x) = P (x, z)h(z) − h(x)
z
!
X
= π(x) 1 + P (x, z)Ez [τx ] −1
z
= 0,

where we used 1/π(x) = Ex [τx+ ] = 1 + z P (x, z) Ez [τx ] by Theorem 3.1.19


P
and a first-step argument (recall that the first return time τx+ was defined in Defini-
tion 3.1.6).

Potential theory for Markov chains


More generally, many quantities of interest can be expressed in the following form.
Consider again a subset W ⊂ V and the stopping time

τW c = inf{t ≥ 0 : Xt ∈ W c }.

Let also h : W c → R+ and k : W → R+ . Define the quantity


 
X
u(x) := Ex h(XτW c )1{τW c < +∞} + k(Xt ) . (3.3.8)
0≤t<τW c

The first term on the right-hand side is a final cost incurred when we exit W (and
depends on where we do), while the second term is a unit time cost incurred along
the sample path. Note that, in fact, it suffices to define h on ∂V W , the outer
boundary of W if we restrict ourselves to x ∈ W . Observe also that the function
u(x) may take the value +∞; the expectation is well-defined (in R+ ∪ {+∞}) by
the nonnegativity of the terms (see Appendix B).

Example 3.3.5 (Some special cases). Here are some important special cases:
CHAPTER 3. MARTINGALES AND POTENTIALS 186

• Revisiting (3.3.1), for two disjoint subsets A, Z of V , the probability


u(x) := Px [τA < τZ ],
of hitting A before Z as a function of the starting point x ∈ V is obtained
by taking W := (A ∪ Z)c , h = 1 (respectively = 0) on A (respectively Z),
and k = 0 on V . The further special case Z = ∅ leads to the exit probability
exit
from A
probability
u(x) := Px [τA < +∞].
On the other hand, if A and Z form a disjoint partition of W c (or ∂V W will
suffice if x ∈ W ), we get the exit law from W
exit law
u(x) := Px [XτW c ∈ A; τW c < +∞].

• The average occupation time of A ⊆ W before exiting W average


 
occupation
X
u(x) := Ex  1{Xt ∈A}  , time
0≤t<τW c

is obtained by taking h = 0 on V , and k = 1 (respectively = 0) on A


(respectively on Ac ). Revisiting (3.1.3), the Green function of the chain
stopped at τW c , that is,
 
X
u(x) := GτW c (x, y) = Ex  1{Xt =y}  ,
0≤t<τW c

is obtained by taking A = {y}. Another special case is A = W where we


get the mean exit time from A
mean exit
u(x) := Ex [τ ] .
Ac time

J
The function u in (3.3.8) turns out to satisfy a generalized version of (3.3.7).
The proof is usually called first-step analysis (of which we have already seen many
first-step
instances).
analysis
Theorem 3.3.6 (First-step analysis). Let P be a transition matrix on a finite or
countable state space V . Let W be a proper subset of V , and let h : W c → R+
and k : W → R+ be bounded functions. Then the function u ≥ 0, as defined
in (3.3.8), satisfies the system of equations
( P
u(x) = k(x) + y P (x, y)u(y) for x ∈ W ,
(3.3.9)
u(x) = h(x) for x ∈ W c .
CHAPTER 3. MARTINGALES AND POTENTIALS 187

Proof. For x ∈ W c , by definition u(x) = h(x) since τW c = 0. Fix x ∈ W . By


taking out what is known (Lemma B.6.13), the tower property (Lemma B.6.16)
and the Markov property (Theorem 1.1.18),
 
X
u(x) = k(x) + Ex h(XτW c )1{τW c < +∞} + k(Xt )
1≤t<τW c
  
X
= k(x) + Ex E h(XτW c )1{τW c < +∞} + k(Xt ) F1 
1≤t<τW c

= k(x) + Ex [u(X1 )] ,

which gives the claim.

If u is finite, the system of equations (3.3.9) can be rewritten as the Poisson equa-
tion (once again as an analogue of its counterpart in the theory of partial differential
Poisson
equations)
equation
(
∆u = −k on W ,
(3.3.10)
u=h on W c .

This is well-defined for instance if W is a finite subset and P is irreducible. Indeed,


as we argued in the proof of Theorem 3.3.1, the stopping time τW c then has a finite
expectation. Because h is bounded, it follows that
 
X
u(x) := Ex h(Xτ c )1{τW c < +∞} +
W
k(Xt )
0≤t<τW c
≤ sup h(x) + sup k(x) sup Ex [τW c ]
x∈W c x∈W x∈W
< +∞,

uniformly in x. Using (3.3.6) and rearranging (3.3.9) gives (3.3.10).


Remark 3.3.7. A more general form of the statement which can be used to study
certain moment-generating functions can be found, for example, in [Ebe, Theorem
1.3].
In a generalization of Theorem 3.3.1, our next theorem allows one to establish
uniqueness of the solution of the system (3.3.10) under some conditions (which we
will not detail here, but see Exercise 3.13). Perhaps even more useful, it also gives
an effective approach to bound the function u from above. This is based on the
following supermartingale.
CHAPTER 3. MARTINGALES AND POTENTIALS 188

Lemma 3.3.8 (Locally superharmonic functions). Let P be a transition matrix


on a finite or countable state space V . Let W be a proper subset of V , and let
h : W c → R+ and k : W → R+ be bounded functions. Suppose the nonnegative
function ψ : V → R+ satisfies

∆ψ ≤ −k on W .

Then the process


X
Nt := ψ(Xt∧τW c ) + k(Xs ),
0≤s<t∧τW c

is a nonnegative supermartingale for any initial point x ∈ V .


Proof. Observe that: on {τW c ≤ t}, we have Nt+1 = Nt ; while on {τW c > t},
we have Nt+1 − Nt = ψ(Xt+1 ) − ψ(Xt ) + k(Xt ) by cancellations in the sum. So,
since {τW c > t} ∈ Ft by definition of a stopping time, it holds by taking out what
is known that

E[Nt+1 − Nt | Ft ] = E[1{τW c > t}(ψ(Xt+1 ) − ψ(Xt ) + k(Xt )) | Ft ]


= 1{τW c > t}(E[ψ(Xt+1 ) − ψ(Xt ) | Ft ] + k(Xt ))
= 1{τW c > t}(∆ψ(Xt ) + k(Xt ))
≤ 1{τW c > t}(−k(Xt ) + k(Xt ))
= 0,

where we used that, by (3.3.6) and the Markov property,

E[ψ(Xt+1 ) − ψ(Xt ) | Ft ] = ∆ψ(Xt ), (3.3.11)

and that Xt ∈ W on {τW c > t}.

Theorem 3.3.9 (Poisson equation: bounding the solution). Let P be a transition


matrix on a finite or countable state space V . Let W be a proper subset of V ,
and let h : W c → R+ and k : W → R+ be bounded functions. Suppose the
nonnegative function ψ : V → R+ satisfies the system of inequalities
(
∆ψ ≤ −k on W ,
(3.3.12)
ψ≥h on W c .

Then
ψ ≥ u, on V , (3.3.13)
where u is the function defined in (3.3.8).
CHAPTER 3. MARTINGALES AND POTENTIALS 189

Proof. The system (3.3.13) holds on W c by Theorem 3.3.6 and (3.3.12) since in
that case u(x) = h(x) ≤ ψ(x).
Fix x ∈ W . Consider the nonnegative supermartingale (Nt ) in Lemma 3.3.8.
By the convergence of nonnegative supermartingales (Corollary 3.1.48), (Nt ) con-
verges almost surely to a finite limit with expectation ≤ Ex [N0 ]. In particular,
the limit NτW c is well-defined, nonnegative and finite, including on the event that
{τW c = +∞}. As a result,
 
X
NτW c = lim ψ(Xt∧τW c ) + k(Xs )
t
0≤s<t∧τW c
X
≥ h(XτW c )1{τW c < +∞} + k(Xs ),
0≤s<τW c

where we used (3.3.12). Moreover, by Lemma 3.1.37, Ex [Nt∧τW c ] ≤ Ex [N0 ] for


all t and Fatou’s lemma (see Proposition B.4.14) gives Ex [NτW c ] ≤ Ex [N0 ].
Hence, by definition of u,
 
X
u(x) = Ex h(Xτ c )1{τW c < +∞} +
W
k(Xt )
0≤t<τW c

≤ Ex [NτW c ]
≤ Ex [N0 ]
= ψ(x),
where, on the last line, we used that the initial state is x ∈ W . That proves the
claim.

Lyapounov functions
Here is an important application, bounding from above the hitting time τA to a set
A in expectation.
Theorem 3.3.10 (Controlling hitting times via Lyapounov functions). Let P be a
transition matrix on a finite or countably infinite state space V . Let A be a proper
subset of V . Suppose the nonnegative function ψ : V → R+ satisfies the system of
inequalities
∆ψ ≤ −1, on Ac . (3.3.14)
Then
Ex [τA ] ≤ ψ(x),
CHAPTER 3. MARTINGALES AND POTENTIALS 190

for all x ∈ V .
Proof. Indeed, by (3.3.14) and nonnegativity (in particular on A), the function ψ
satisfies the assumptions of Theorem 3.3.9 with W = Ac , h = 0 on A, and k = 1
on Ac . Hence, by definition of u and the claim in Theorem 3.3.9,
 
X
Ex [τA ] = Ex h(XτA )1{τA < +∞} + k(Xt )
0≤t<τA

= u(x)
≤ ψ(x).
That establishes the claim.
Recalling (3.3.11), condition (3.3.14) is equivalent to the following conditional
expected decrease in ψ outside A:
E[ψ(Xt+1 ) − ψ(Xt ) | Ft ] ≤ −1, on {Xt ∈ Ac }. (3.3.15)
A nonnegative function satisfying an inequality of this type, also known as drift
condition, is often referred to as a Lyapounov function. Intuitively, it tends to
Lyapounov
decrease along the sample path outside of A. Because it is non-negative, it cannot
function
decrease forever and therefore the chain eventually enters A. We consider a simple
example next.
Example 3.3.11 (A Markov chain on the nonnegative integers). Let (Zt )t≥1 be
i.i.d. integrable random variables taking values in Z such that E[Z1 ] < 0. Let
(Xt )t≥0 be the chain defined by X0 = x for some x ∈ Z+ and
Xt+1 = (Xt + Zt+1 )+ ,
where recall that z + = max{0, z}. In particular Xt ∈ Z+ for all t. Let (Ft ) be the
corresponding filtration. When Xt is large, the “local drift” is close to E[Z1 ] < 0.
By analogy to the biased case of the gambler’s ruin (Example 3.1.43), we might
expect that, from a large starting point x, it will take time roughly x/|E[Z1 ]| in
expectation to “return to a neighborhood of 0.” We prove something along those
lines here using a Lyapounov function.
Observe that, for any y ∈ Z+ , we have on the event {Xt = y} by the Markov
property
Ex [Xt+1 − Xt | Ft ] = E[(y + Zt+1 )+ − y]
= E[−y1{Zt+1 ≤ −y} + Zt+1 1{Zt+1 > −y}]
≤ E[Zt+1 1{Zt+1 > −y}]
= E[Z1 1{Z1 > −y}]. (3.3.16)
CHAPTER 3. MARTINGALES AND POTENTIALS 191

For all y, the random variable |Z1 1{Z1 > −y}| is bounded by |Z1 |, itself an
integrable random variable. Moreover, Z1 1{Z1 > −y} → Z1 as y → +∞ almost
surely. Hence, the dominated convergence theorem (Proposition B.4.14) implies
that

lim E[Z1 1{Z1 > −y}] = E[Z1 ] < 0.


y→+∞

So for any 0 < ε < −E[Z1 ], there is yε ∈ Z+ large enough that E[Z1 1{Z1 >
−y}] < −ε for all y > yε . Fix ε as above and define

A := {0, 1, . . . , yε }.

We use Theorem 3.3.10 to bound τA in expectation. Define the Lyapounov


function
x
ψ(x) = , ∀x ∈ Z+ .
ε
On the event {Xt = y}, we rewrite (3.3.16) as

E[Z1 1{Z1 > −y}]


E[ψ(Xt+1 ) − ψ(Xt ) | Ft ] ≤
ε
≤ −1,

for y ∈ Ac . This is the same as (3.3.15). Hence, we can apply Theorem 3.3.10 to
get
x
Ex [τA ] ≤ ψ(x) = ,
ε
for all x ≥ yε . J

A well-known, closely related result gives a criterion for positive recurrence.


We state it without proof.

Theorem 3.3.12 (Foster’s theorem). Let P be an irreducible transition matrix on


a countable state space V . Let A be a finite, proper subset of V . Suppose the
nonnegative function ψ : V → R+ satisfies the system of inequalities

∆ψ ≤ −1, on Ac ,

as well as the condition


X
P (x, y)ψ(y) < +∞, ∀x ∈ A.
y∈V

Then P is positive recurrent.


CHAPTER 3. MARTINGALES AND POTENTIALS 192

3.3.2 Basic electrical network theory


We now develop the basic theory of electrical networks for the analysis of random
walks. All results in this subsection (and the next one) concern reversible Markov
chains, or random walks on networks (see Definition 1.2.7). We begin with a few
definitions. Throughout, we will use the notation h|B for the function h restricted
to the subset B. We also write h ≡ c if h is identically equal to the constant c.

Definitions
Let N = (G, c) be a finite or countable network with G = (V, E). Throughout
this section we assume that N is connected and locally finite. In the context of
electrical networks, edge weights are called conductances. The reciprocal of the
conductance
conductances are called resistances and are denoted by r(e) := 1/c(e), for all
resistance
e ∈ E. For an edge e = {x, y} we overload c(x, y) := c(e) and r(x, y) := r(e).
Both c and r are symmetric as functions of x, y. Recall that the transition matrix
of the random walk on N satisfies
c(x, y)
P (x, y) = ,
c(x)

where X
c(x) = c(x, z).
z:z∼x

Let A, Z be disjoint, non-empty subsets of V such that W := (A ∪ Z)c is


finite. For our purposes it will suffice to take A to be a singleton, that is, A = {a}
for some a. Then a is called the source and Z is called the sink-set, or sink for source,
short. As an immediate corollary of Theorem 3.3.1, we obtain the existence and sink
uniqueness of a voltage function, defined formally in the next corollary. It will be
useful to consider voltages taking an arbitrary value at a, but we always set the
voltage on Z to 0.

Corollary 3.3.13 (Voltage). Fix v0 > 0. Let N = (G, c) be a finite or countable,


connected network with G = (V, E). Let A := {a}, Z be disjoint non-empty
subsets of V such that W = (A ∪ Z)c is non-empty and finite. Then there exists
a unique voltage defined as follows: a function v on V such that v is harmonic on
voltage
W , that is,
1 X
v(x) = c(x, y)v(y), ∀x ∈ W, (3.3.17)
c(x) y:y∼x

where
v(a) = v0 and v|Z ≡ 0. (3.3.18)
CHAPTER 3. MARTINGALES AND POTENTIALS 193

Moreover
v(x)
= Px [τa < τZ ], (3.3.19)
v0
for the corresponding random walk on N .
Proof. Set h(x) = v(x) on A ∪ Z. Theorem 3.3.1 gives the result.

Note in the definition above that if v is a voltage with value v0 at a, then ṽ(x) =
v(x)/v0 is a voltage with value 1 at a.
Let v be a voltage function on N with source a and sink Z. The Laplacian-
based formulation of harmonicity, (3.3.7), can be interpreted in terms of flows (see
Definition 1.1.13). We define the current function current

i(x, y) := c(x, y)[v(x) − v(y)], (3.3.20)

or, equivalently, v(x) − v(y) = r(x, y) i(x, y). The latter definition is usually
referred to as Ohm’s “law.” Notice that the current is defined on ordered pairs of
Ohm’s law
vertices and is anti-symmetric, that is, i(x, y) = −i(y, x). In terms of the current,
the harmonicity of v is then expressed as
X
i(x, y) = 0, ∀x ∈ W, (3.3.21)
y:y∼x

that is, i is a flow on W (without capacity constraints). This set of equations


is known as Kirchhoff’s node law. We also refer to these constraints as flow-
Kirchhoff’s
conservation constraints. To be clear, the current is not just any flow. It is a flow
node law
that can be written as a potential difference according to Ohm’s law. Such a cur-
rent also satisfies Kirchhoff’s cycle law: if x1 ∼ x2 ∼ · · · ∼ xk ∼ xk+1 = x1 is a
Kirchhoff’s
cycle, then
k cycle law
X
i(xj , xj+1 ) r(xj , xj+1 ) = 0,
j=1

as can be seen by substituting Ohm’s law.


The strength of the current is defined as
strength
X
kik := i(a, y).
y:y∼a

Because a ∈/ W , it does not satisfy Kirchhoff’s node law and the strength is not
0 in general. The definition of i(x, y) ensures that the flow out of the source is
nonnegative as Py [τa < τZ ] ≤ 1 = Pa [τa < τZ ] for all y ∼ a so that

i(a, y) = c(a, y)[v(a) − v(y)] = c(a, y) [v0 Pa [τa < τZ ] − v0 Py [τa < τZ ]] ≥ 0.
CHAPTER 3. MARTINGALES AND POTENTIALS 194

Note that by multiplying the voltage by a constant we obtain a current which is


similarly scaled. Up to that scaling, the current is unique from the uniqueness of
the voltage. We will often consider the unit current where we scale v and i so as to
unit current
enforce that kik = 1.
Summing up the previous paragraphs, to determine the voltage it suffices to
find functions v and i that simultaneously satisfy Ohm’s law and Kirchhoff’s node
law. Here is an example.

Example 3.3.14 (Network reduction: birth-death chain). Let N be the line on


{0, 1, . . . , n} with j ∼ k ⇐⇒ |j − k| = 1 and arbitrary (positive) conductances
on the edges. Let (Xt ) be the corresponding walk. We use the principle above
to compute Px [τ0 < τn ] for 1 ≤ x ≤ n − 1. Consider the voltage function
v when v(0) = 1 and v(n) = 0 with current i, which exists and is unique by
Corollary 3.3.13. The desired quantity is v(x).
Note that because i is a flow on N , the flow into every vertex equals the flow out
of that vertex, and we must have i(y, y + 1) = i(0, 1) = kik for all y. To compute
v(x), we note that it remains the same if we replace the path 0 ∼ 1 ∼ · · · ∼ x
with a single edge of resistance R0,x = r(0, 1) + · · · + r(x − 1, x). Indeed leave
the voltage unchanged on the remaining nodes (to the right of x) and define the
current on the new edge as kik. Kirchhoff’s node law is automatically satisfied by
the argument above. To check Ohm’s law on the new “super-edge,” note that on
the original network N (with the original voltage function)

v(0) − v(x) = (v(0) − v(1)) + · · · + (v(x − 1) − v(x))


= r(x − 1, x)i(x − 1, x) + · · · + r(0, 1)i(0, 1)
= [r(0, 1) + · · · + r(x − 1, x)]kik
= R0,x kik.

Ohm’s law is also satisfied on every other edge (to the right of x) because nothing
has changed there. That proves the claim.
We do the same reduction on the other side of x by replacing x ∼ x + 1 ∼
· · · ∼ n with a single edge of resistance Rx,n = r(x, x + 1) + · · · + r(n − 1, n).
See Figure 3.4.
Because the voltage at x was not changed by this transformation, we can com-
pute v(x) = Px [τ0 < τn ] directly on the reduced network, where it is now a
straightforward computation. Indeed, starting at x, the reduced walk jumps to 0
with probability proportional to the conductance on the new super-edge 0 ∼ x (or
CHAPTER 3. MARTINGALES AND POTENTIALS 195

Figure 3.4: Reduced network.

the reciprocal of the resistance), that is,


−1
R0,x
Px [τ0 < τn ] = −1 −1
R0,x + Rx,n
Rx,n
=
Rx,n + R0,x
r(x, x + 1) + · · · + r(n − 1, n)
= .
r(0, 1) + · · · + r(n − 1, n)
Some special cases:
• Simple random walk. In the case of simple random walk, all resistances are
equal and we get
n−x
Px [τ0 < τn ] = .
n
• Gambler’s ruin. The gambler’s ruin example (see Examples 3.1.41 and 3.1.43)
corresponds to taking c(j, j + 1) = (p/q)j or r(j, j + 1) = (q/p)j , for some
0 < p < 1 and q = 1 − p. In this case we obtain
Pn−1 j
j=x (q/p) (q/p)x (1 − (q/p)n−x ) (p/q)n−x − 1
Px [τ0 < τn ] = Pn−1 = = ,
j=0 (q/p)
j 1 − (q/p)n (p/q)n − 1

when p 6= q (otherwise we get back the simple random walk case).z


CHAPTER 3. MARTINGALES AND POTENTIALS 196

The above example illustrates the series law: resistances in series add up.
series law,
There is a similar parallel law: conductances in parallel add up. To formalize
parallel law
these laws, one needs to introduce multigraphs. This is straightforward, although
to avoid complicating the notation further we will not do this here. (But see Exam-
ple 3.3.22 for a simple case.)
Another useful network reduction technique is illustrated in the next example.

Example 3.3.15 (Network reduction: binary tree). Let N be the rooted binary tree
with n levels Tb n and equal conductances on all edges. Let 0 be the root. Pick an
2
arbitrary leaf and denote it by n. The remaining vertices on the path between 0
and n, which we refer to as the main path, will be denoted by 1, . . . , n − 1 moving
away from the root. We claim that, for all 0 < x < n, it holds that

Px [τ0 < τn ] = (n − x)/n.

Indeed let v be the voltage with values 1 and 0 at a = 0 and Z = {n} respec-
tively. Let i be the corresponding current. Notice that, for each 0 ≤ y < n, the
current—as a flow—has “nowhere to go” on the subtree Ty hanging from y away
from the main path. The leaves of the subtree are dead ends. Hence the current
must be 0 on Ty and by Ohm’s law the voltage must be constant on it, that is, every
vertex in Ty has voltage v(y).
Imagine collapsing all vertices in Ty , including y, into a single vertex (and re-
moving the self-loops so created). Doing this for every vertex on the main path
results in a new reduced network which is formed of a single path as in Exam-
ple 3.3.14. Note that the voltage and the current can be taken to be the same as
they were previously on the main path. Indeed, with this choice, Ohm’s law is
automatically satisfied. Moreover, because there is no current on the hanging sub-
trees in the original network, Kirchhoff’s node law is also satisfied on the reduced
network, as no current is “lost.”
Hence the answer can be obtained from Example 3.3.14. That proves the claim.
(You should convince yourself that this result is obvious from a probabilistic point
of view.) J

We gave a probabilistic interpretation of the voltage. What about the current?


The following result says that, roughly speaking, i(x, y) is the net traffic on the
edge {x, y} from x to y. We start with an important formula for the voltage at a.
For the walk started at a, we use the shorthand

P[a → Z] := Pa [τZ < τa+ ],


CHAPTER 3. MARTINGALES AND POTENTIALS 197

for the escape probability. The next lemma can be interpreted as a sort of Ohm’s escape
law between a and Z, where c(a) P[a → Z] is the “effective conductance.” (We probability
will be more formal in Definition 3.3.19 below.)

Lemma 3.3.16 (Effective Ohm’s Law). Let v be a voltage on N with source a and
sink Z. Let i be the associated current. Then
v(a) 1
= . (3.3.22)
kik c(a) P[a → Z]

Proof. Using the usual first-step analysis,


X
P[a → Z] = P (a, x)Px [τZ < τa ]
x:x∼a
X c(a, x)  v(x)

= 1−
x:x∼a
c(a) v(a)
1 X
= c(a, x)[v(a) − v(x)]
c(a)v(a) x:x∼a
1 X
= i(a, x),
c(a)v(a) x:x∼a

where we used Corollary 3.3.13 on the second line and Ohm’s law on the last line.
Rearranging gives the result.

Recall the Green function from (3.1.3).

Theorem 3.3.17 (Probabilistic interpretation of the current). For x ∼ y, let Nx→yZ

be the number of one-step transitions from x to y up to the time of the first visit to
the sink Z for the random walk on N started at a. Let v be the voltage correspond-
ing to the unit current i. Then the following formulas hold:

GτZ (a, x)
v(x) = , ∀x, (3.3.23)
c(x)

and
Z Z
i(x, y) = Ea [Nx→y − Ny→x ], ∀x ∼ y.

Proof. We prove the formula for the voltage by showing that v(x) as defined above
is harmonic on W = V \({a} ∪ Z). Note first that, for all z ∈ Z, the expected
number of visits to z before reaching Z (i.e., GτZ (a, z)) is 0. Or, put differently,
G (a,z)
0 = v(z) = τZc(z) . Moreover, to compute GτZ (a, a), note that the number of
CHAPTER 3. MARTINGALES AND POTENTIALS 198

visits to a before the first visit to Z is geometric with success probability P[a → Z]
by the strong Markov property (Theorem 3.1.8) and hence
1
GτZ (a, a) = ,
P[a → Z]

and, by Lemma 3.3.16 and the fact that we are using the unit current, v(a) =
GτZ (a,a)
c(a) , as required.
To establish the formula for x ∈ W , we compute the quantity
1 X Z
Ea [Ny→x ],
c(x) y:y∼x

in two ways. First, because each visit to x ∈ W must enter through one of x’s
neighbors (including itself in the presence of a self-loop), we get

1 X Z Gτ (a, x)
Ea [Ny→x ]= Z . (3.3.24)
c(x) y:y∼x c(x)

On the other hand, by the Markov property (Theorem 1.1.18)


Z
Ea [Ny→x ]
 
X
= Ea  1{Xt =y,Xt+1 =x} 
0≤t<τZ
X
= Pa [Xt = y, Xt+1 = x, τZ > t]
t≥0
X
= Pa [τZ > t]Pa [Xt = y | τZ > t]Pa [Xt+1 = x | Xt = y, τZ > t]
t≥0
X
= Pa [τZ > t]Pa [Xt = y | τZ > t]P (y, x)
t≥0
X
= Pa [Xt = y, τZ > t]P (y, x)
t≥0
 
X
= P (y, x) Ea  1{Xt =y} 
0≤t<τZ

= P (y, x) GτZ (a, y), (3.3.25)


CHAPTER 3. MARTINGALES AND POTENTIALS 199

so that, summing over y, we obtain this time


1 X 1 X
Z
Ea [Ny→x ]= P (y, x) GτZ (a, y)
c(x) y:y∼x c(x) y:y∼x
X GτZ (a, y)
= P (x, y) , (3.3.26)
y:y∼x
c(y)

where we used that c(x, y) = c(x)P (x, y) = c(y)P (y, x) (see Definition 1.2.7).
G (a,x)
Equating (3.3.24) and (3.3.26) shows that τZc(x) is harmonic on W and hence
must be equal to the voltage function by Corollary 3.3.13.
Finally, by (3.3.25),
Z
Ea [Nx→y Z
− Ny→x ] = P (x, y) GτZ (a, x) − P (y, x) GτZ (a, y)
= P (x, y)v(x)c(x) − P (y, x)v(y)c(y)
= c(x, y)[v(x) − v(y)]
= i(x, y).

That concludes the proof.

Example 3.3.18 (Network reduction: binary tree (continued)). Recall the setting
of Example 3.3.15. We argued that the current on side edges, that is, edges of
subtrees hanging from the main path, is 0. This is clear from the probabilistic
interpretation of the current: in a walk from a to z, any traversal of a side edge
must be undone at a later time. J

The network reduction techniques illustrated above are useful. But the power
of the electrical network perspective is more apparent in what comes next: the
definition of the effective resistance and, especially, its variational characterization.

Effective resistance
Before proceeding further, let us recall our original motivation. Let N = (G, c)
be a countable, locally finite, connected network and let (Xt ) be the corresponding
walk. Recall that a vertex a in G is transient if Pa [τa+ < +∞] < 1.
To relate this to our setting, consider an exhaustive sequence of induced sub-
exhaustive
graphs Gn ofSG which for our purposes is defined as: G0 contains only a, Gn ⊆
sequence
Gn+1 , G = n Gn , and every Gn is finite and connected. Such a sequence always
exists by iteratively adding the neighbors of the previous vertices and using that G
is locally finite and connected. Let Zn be the set of vertices of G not in Gn . Then,
CHAPTER 3. MARTINGALES AND POTENTIALS 200

by Lemma 3.1.25, Pa [τZn ∧ τa+ = +∞] = 0 for all n by our assumptions on (Gn ).
Hence, the remaining possibilities are

1 = Pa [∃n, τa+ < τZn ] + Pa [∀n, τZn < τa+ ]


= Pa [τa+ < +∞] + lim P[a → Zn ].
n

Therefore a is transient if and only if limn P[a → Zn ] > 0. Note that the limit ex-
ists because the sequence of events {τZn < τa+ } is decreasing by construction. By
a sandwiching argument the limit also does not depend on the exhaustive sequence.
Hence we define
P[a → ∞] := lim P[a → Zn ].
n

We use Lemma 3.3.16 to characterize this limit using electrical network concepts.
But, first, here comes the key definition. In Lemma 3.3.16, v(a) can be thought
of as the potential difference between the source and the sink, and kik can be
thought of as the total current flowing through the network from the source to the
sink. Hence, viewing the network as a single “super-edge,” Equation (3.3.22) is
the analogue of Ohm’s law if we interpret c(a) P[a → Z] as an “effective conduc-
tance.”

Definition 3.3.19 (Effective resistance and conductance). Let N = (G, c) be a


finite or countable, locally finite, connected network. Let A = {a} and Z be
disjoint non-empty subsets of the vertex set V such that W := V \(A ∪ Z) is finite.
Let v be a voltage from source a to sink Z and let i be the corresponding current.
The effective resistance between a and Z is defined as
effective
1 v(a) resistance
R(a ↔ Z) := = ,
c(a) P[a → Z] kik

where the rightmost equality holds by Lemma 3.3.16. The reciprocal is called the
effective conductance and is denoted by C (a ↔ Z) := 1/R(a ↔ Z).
effective
Going back to recurrence, for an exhaustive sequence (Gn ) with (Zn ) as above, conductance
it is natural to define

R(a ↔ ∞) := lim R(a ↔ Zn ),


n

where, once again, the limit does not depend on the choice of exhaustive sequence.

Theorem 3.3.20 (Recurrence and resistance). Let N = (G, c) be a countable,


locally finite, connected network. Vertex a (and hence all vertices) in N is transient
if and only if R(a ↔ ∞) < +∞.
CHAPTER 3. MARTINGALES AND POTENTIALS 201

Proof. This follows immediately from the definition of the effective resistance.
Recall that, on a connected network, all states have the same type (recurrent or
transient).

Note that the network reduction techniques we discussed previously leave both
the voltage and the current strength unchanged on the reduced network. Hence
they also leave the effective resistance unchanged.

Example 3.3.21 (Gambler’s ruin chain revisited). Extend the gambler’s ruin chain
of Example 3.3.14 to all of Z+ . We determine when this chain is transient. Be-
cause it is irreducible, all states have the same type and it suffices to look at 0.
Consider the exhaustive sequence obtained by letting Gn be the graph restricted
to {0, 1, . . . , n − 1} and letting Zn = {n, n + 1 . . .}. To compute the effective
resistance R(0 ↔ Zn ), we use the same reduction as in Example 3.3.14. The
“super-edge” between 0 and n has resistance
n−1 n−1
X X (q/p)n − 1
R(0 ↔ Zn ) = r(j, j + 1) = (q/p)j = ,
(q/p) − 1
j=0 j=0

when p 6= q, and similarly it has resistance n in the p = q case. Hence, taking a


limit as n → +∞, (
+∞, p ≤ 1/2,
R(0 ↔ ∞) = p
2p−1 , p > 1/2.

So 0 is transient if and only if p > 1/2. J

Example 3.3.22 (Biased walk on the b-ary tree). Fix λ ∈ (0, +∞). Consider the
rooted, infinite b-ary tree with conductance λj on all edges between level j − 1 and
j, for j ≥ 1. We determine when this chain is transient. Because it is irreducible,
all states have the same type and it suffices to look at the root. Denote the root
by 0. For an exhaustive sequence, let Gn be the root together with the first n − 1
levels. Let Zn be as before. To compute R(0 ↔ Zn ): (i) glue together all vertices
of Zn ; (ii) glue together all vertices on the same level of Gn ; (iii) replace parallel
edges with a single edge whose conductance is the sum of the conductances; (iv)
let the current on this edge be the sum of the currents; and (v) leave the voltages
unchanged. It can be checked that Ohm’s law and Kirchhoff’s node law are still
satisfied, and that hence we have not changed the effective resistance. (This is an
application of the parallel law.)
The reduced network is now a line. Denote the new vertices 0, 1, . . . , n. The
conductance on the edge between j and j + 1 is bj+1 λj = b(bλ)j . So this is
CHAPTER 3. MARTINGALES AND POTENTIALS 202

the chain from the previous example with (p/q) = bλ where all conductances are
scaled by a factor of b. Hence
(
+∞, bλ ≤ 1,
R(0 ↔ ∞) = 1
b(1−(bλ)−1 )
, bλ > 1.

So the root is transient if and only if bλ > 1.


A generalization is provided in Example 3.3.27. J

3.3.3 Bounding the effective resistance via variational principles


The examples we analyzed so far were atypical in that it was possible to reduce
the network down to a single edge using simple rules and read off the effective
resistance. In general, we need more robust techniques to bound the effective re-
sistance. The following two variational principles provide a powerful approach for
this purpose. We derive them for finite networks, but will later on apply them to
exhaustive sequences.

Variational principles
Recall from Definition 1.1.13 that a flow θ from source a to sink Z on a countable,
locally finite, connected network N = (G, c) is a function on pairs of adjacent
vertices such that: θ is anti-symmetric, that is, Pθ(x, y) = −θ(y, x) for all x ∼ y;
and it satisfies the flow-conservation constraint y:y∼x θ(x, y) P= 0 on all vertices
x except those in {a} ∪ Z. The strength of the flow is kθk = y:y∼a θ(a, y). The
current is a special flow—one that can be written as a potential difference according
to Ohm’s law. As we show next, it can also be characterized as a flow minimizing
a certain energy. Specifically, the energy of a flow θ is defined as energy
1X
E (θ) = r(x, y)θ(x, y)2 .
2 x,y

The proof of the variational principle we present here employs a neat trick, convex
duality. In particular, it reveals that the voltage and current are dual in the sense of
convex analysis.

Theorem 3.3.23 (Thomson’s principle). Let N = (G, c) be a finite, connected


network. The effective resistance between source a and sink Z is characterized by

R(a ↔ Z) = inf {E (θ) : θ is a unit flow from a to Z} . (3.3.27)

The unique minimizer is the unit current.


CHAPTER 3. MARTINGALES AND POTENTIALS 203

Proof. It will be convenient to work in vector form. Let 1, . . . , n be the vertices of


G and order the edges arbitrarily as e1 , . . . , em . (We ignore any self-loops, which
have no flow.) Choose an arbitrary orientation of N , that is, replace each edge


ei = {x, y} with either ~ei = (x, y) or (y, x). Let G be the corresponding directed
graph. Think of the flow θ as a vector with one coordinate for each oriented edge.
Then the flow constraint can be written as a linear system Bθ = b. Here the matrix
B has a column for each directed edge and a row for each vertex except those in
Z. The entries of B are Bx,(x,y) = 1, By,(x,y) = −1, and 0 otherwise. We have
already encountered this matrix: it is an oriented incidence matrix of G (see Defini-
tion 1.1.16) restricted to the rows in V \ Z. The vector b has 0s everywhere except
for ba = 1. Let r be the vector of resistances and let R be the diagonal matrix with
diagonal r. In vector form, E (θ) = θ T Rθ and the optimization problem (3.3.27)
reads
E ∗ = inf{θ T Rθ : Bθ = b}.
We first characterize the optimal flow. We introduce the Lagrangian
Lagrangian

L (θ; h) := θ T Rθ − 2hT (Bθ − b),

where h has an entry for all vertices except those in Z. For all h,

E ∗ ≥ inf L (θ; h),


θ

because those θs with Bθ = b make the second term vanish in L (θ; h). Since
L (θ; h) is strictly convex as a function of θ, the solution to its minimization is
characterized by the usual optimality conditions which in this case read 2Rθ −
2B T h = 0, or
θ = R−1 B T h. (3.3.28)
Substituting into the Lagrangian and simplifying, we have proved that

E (θ) ≥ E ∗ ≥ −hT BR−1 B T h + 2hT b =: L ∗ (h), (3.3.29)

for all h and flow θ. This inequality is a statement of weak duality. To show that a
flow θ is optimal it suffices to find h such that E (θ) = L ∗ (h).
Let θ = i be the unit current in vector form, which satisfies Bθ = b by our
choice of b and Kirchhoff’s node law (i.e., (3.3.21)). The suitable dual turns out to
be the corresponding voltage h = v in vector form restricted to V \ Z. To see this,
observe that B T h is the vector of neighboring node differences

B T h = (h(x) − h(y))(x,y)∈−
→,
G
(3.3.30)
CHAPTER 3. MARTINGALES AND POTENTIALS 204

where implicitly h|Z ≡ 0. Hence the optimality condition (3.3.28) is nothing but
Ohm’s law (i.e., (3.3.20)) in vector form. Therefore, if i is the unit current and v is
the associated voltage in vector form, it holds that

L ∗ (v) = L (i; v) = E (i),

where the first equality follows from the fact that i minimizes L (i; v) by (3.3.28)
and the second equality follows from the fact that Bi = b. So we must have
E (i) = E ∗ by weak duality (i.e., (3.3.29)).
As for uniqueness, it can be checked that two minimizers θ, θ 0 satisfy

E (θ) + E (θ 0 ) θ + θ0 θ − θ0
   

E = =E +E ,
2 2 2

by definition of the energy. The first term in the rightmost expression is greater or
equal to E ∗ since the average of two unit flows is still a unit flow. The second term
is nonnegative by definition. Hence the latter must be zero and the only way for
this to happen is if θ = θ 0 .
To conclude the proof, it remains to compute the optimal value. The matrix
BR−1 B T is related to the Laplacian associated to random walk on N (see Sec-
tion 3.3.1) up to a row scaling. Multiplying by row x ∈ V \ Z involves taking a
conductance-weighted average of the neighboring values and subtracting the value
at x, that is,
X h i
BR−1 B T v x =

c(x, y)(v(x) − v(y))


y:(x,y)∈ G
X h i
− c(y, x)(v(y) − v(x))


y:(y,x)∈ G
X h i
= c(x, y)(v(x) − v(y)) ,
y:y∼x

where we used (3.3.30) and the facts that r(x, y)−1 = c(x, y) and c(x, y) =
c(y, x), and it is assumed implicitly that v|Z ≡ 0. By Corollary 3.3.13, this is
zero except for the row x = a where it is
X X
c(a, y)[v(a) − v(y)] = i(a, y) = 1,
y:y∼a y:y∼a

where we used Ohm’s law and the fact that the current has unit strength. We have
CHAPTER 3. MARTINGALES AND POTENTIALS 205

finally

E ∗ = L ∗ (v)
= −v T BR−1 B T v + 2v T b
= −v(a) + 2v(a)
= v(a)
= R(a ↔ Z),

by (3.3.16). That concludes the proof.

Observe that the convex combination α minimizing the sum of squares j αj2
P
is constant. In a similar manner, Thomson’s principle (Theorem 3.3.23) stipulates
roughly speaking that the more the flow can be spread out over the network, the
lower is the effective resistance (penalizing flow on edges with higher resistance).
Pólya’s theorem below provides a vivid illustration. Here is a simple example
suggesting that, in a sense, the current is indeed a well-distributed flow.

Example 3.3.24 (Random walk on the complete graph). Let N be the complete
graph on {1, . . . , n} with unit resistances, and let a = 1 and Z = {n}. Assume
n > 2. The effective resistance is straightforward to compute in this case. Indeed,
the escape probability (with a slight abuse of notation) is
 
1 1 1 n
P[1 → n] = + 1− = ,
n−1 2 n−1 2(n − 1)

as we either jump to n immediately or jump to one of the remaining nodes, in which


case we reach n first with probability 1/2 by symmetry. Hence, since c(1) = n−1,
we get
2
R(1 ↔ n) = ,
n
from the definition of the effective resistance (Definition 3.3.19).
We now look for the optimal flow in Thomson’s principle. Pushing a flow of 1
through the edge {1, n} gives an upper bound of 1, which is far from the optimal
2
n . Spreading the flow a bit more by pushing 1/2 through the edge {1, n} and 1/2
through the path 1 ∼ 2 ∼ n gives the slightly better bound 3 · (1/2)2 = 3/4.
1
Taking this further, pushing a flow of n−1 through {1, n} as well as through each
two-edge path to n via the remaining neighbors of 1 gives the yet improved bound
2 2
2n2 − 3n
 
1 1 2n − 3 2 2
+ 2(n − 2) = 2
= · 2
> ,
n−1 n−1 (n − 1) n 2n − 4n + 2 n
CHAPTER 3. MARTINGALES AND POTENTIALS 206

when n > 2. Because the direct path from 1 to n has a somewhat lower resistance,
the optimal flow is obtained by increasing the flow on that edge slightly. Namely,
for a flow α on {1, n} (and the rest divided up evenly among the two-edge paths),
we get an energy of α2 + 2(n − 2)[ 1−α 2 2
n−2 ] which is minimized at α = n where it is
indeed  2
n−2 2
   
2 2 2 2 n−2 2
+ = + = .
n n−2 n n n n n
J
As we noted above, the matrix BR−1 B T in the proof of Thomson’s princi-
ple is related to the Laplacian. Because B T h is the vector of neighboring node
differences, we have
1X
hT BR−1 B T h = c(x, y)[h(y) − h(x)]2 ,
2 x,y

where we implicitly fix h|Z ≡ 0, which is called the Dirichlet energy. Thinking of
Dirichlet energy
B T as a “discrete gradient,” the Dirichlet energy can be interpreted as the weighted
norm of the gradient of h. The following is a “dual” to Thomson’s principle.
Exercise 3.15 asks for a proof.
Theorem 3.3.25 (Dirichlet’s principle). Let N = (G, c) be a finite, connected
network. The effective conductance between source a and sink Z is characterized
by
( )
1X
C (a ↔ Z) = inf 2
c(x, y)[h(y) − h(x)] : h(a) = 1, h|Z ≡ 0} .
2 x,y

The unique minimizer is the voltage v with v(a) = 1.


The following lower bound is a typical application of Thomson’s principle. See
Pólya’s theorem below for an example of its use. Recall from Section 1.1.1 that,
on a finite graph, a cutset separating a from Z is a set of edges Π such that any
path between a and Z must include at least one edge in Π. Similarly, as defined
in Section 2.3.3, on a countable, locally finite network, a cutset separating a from
∞ is a finite set of edges that must be crossed by any infinite (self-avoiding) path
from a.
Corollary 3.3.26 (Nash-Williams inequality). Let N be a finite, connected network
and let {Πj }nj=1 be a collection of disjoint cutsets separating source a from sink
Z. Then  −1
Xn X
R(a ↔ Z) ≥  c(e) .
j=1 e∈Πj
CHAPTER 3. MARTINGALES AND POTENTIALS 207

Similarly, if N is a countable, locally finite, connected network, then for any col-
lection {Πj }j of finite, disjoint cutsets separating a from ∞,
 −1
X X
R(a ↔ ∞) ≥  c(e) .
j e∈Πj

Proof. Consider the case where N is finite first. We will need the following claim,
which follows immediately from Lemma 1.1.14: for any unit flow θ between a and
Z and any cutset Πj separating a from Z, it holds that
X
|θ(e)| ≥ kθk = 1.
e∈Πj

By Cauchy-Schwarz (Theorem B.4.8),


 2
X X Xp
c(e) r(e0 )θ(e0 )2 ≥  c(e)r(e) |θ(e)|
e∈Πj e0 ∈Πj e∈Πj
 2
X
=  |θ(e)|
e∈Πj
≥ 1.

Rearranging, summing over j and using the disjointness of the cutsets,


 −1
n X n
1X X X X
E (θ) = r(x, y)θ(x, y)2 ≥ r(e0 )θ(e0 )2 ≥  c(e) .
2 x,y j=1 e0 ∈Π j j=1 e∈Πj

Thomson’s principle gives the result.


The infinite case follows from a similar argument using an exhaustive se-
quence.

The following example is an application of Nash-Williams (Corollary 3.3.26)


and Thomson’s principle to recurrence.

Example 3.3.27 (Biased walk on general trees). Let T be a locally finite tree
with root 0. Consider again the biased walk from Example 3.3.22, that is, the
conductance is λj on all edges between level j − 1 and j. Recall the branching
number br(T ) from Definition 2.3.10.
CHAPTER 3. MARTINGALES AND POTENTIALS 208

−|e|
P
Assume λ > br(T ). For any ε > 0, there is a cutset Π such that e∈Π λ ≤
ε. By Nash-Williams,
!−1
X
R(0 ↔ ∞) ≥ c(e) ≥ ε−1 .
e∈Π

Since ε is arbitrary, the walk is recurrent by Theorem 3.3.20.


Suppose instead that λ < br(T ) and let λ < λ∗ < br(T ). By the proof of
Claim 2.3.11, for all n ≥ 1, there exist ε > 0 and a unit flow φn from 0 to the
−|e|
n-level vertices ∂n with capacity constraints |φn (x, y)| ≤ ε−1 λ∗ for all edges
e = {x, y}, where |e| is the graph distance from the root to the endvertex of e
furthest from it. Then, letting Fm = {e : |e| = m}, the energy of the flow is
1X
E (φn ) = r(x, y)φn (x, y)2
2 x,y
n
−|e|
X X
≤ λm |φn (x, y)|ε−1 λ∗
m=1 e={x,y}∈Fm
n  m
−1
X λ X
=ε |φn (x, y)|
λ∗
m=1 e={x,y}∈Fm
+∞ 
λ m
X 
−1
≤ε
λ∗
m=1
< +∞,

where, on the fourth line, we used Lemma 1.1.14 together with the fact that φn is a
unit flow and Fm is a cutset separating 0 and ∂n . Thomson’s principle implies that
R(0 ↔ ∂n ) is uniformly bounded in n. The walk is transient by Theorem 3.3.20.
J

Another typical application of Thomson’s principle is the following mono-


tonicity property (which is not obvious from a probabilistic point of view).

Corollary 3.3.28. Adding an edge to a finite, connected network cannot increase


the effective resistance between a source a and a sink Z. In particular, if the added
edge is not incident with a, then P[a → Z] cannot decrease.

Proof. The additional edge enlarges the space of possible flows, so by Thomson’s
principle it can only lower the resistance or leave it as is. The second statement
follows from the definition of the effective resistance.
CHAPTER 3. MARTINGALES AND POTENTIALS 209

More generally:

Corollary 3.3.29 (Rayleigh’s principle). Let N and N 0 be two networks on the


same finite, connected graph G such that, for each edge in G, the resistance in N 0
is greater than it is in N . Then, for any source a and sink Z,

RN (a ↔ Z) ≤ RN 0 (a ↔ Z).

Proof. Compare the energies of an arbitrary flow on N and N 0 , and apply Thom-
son’s principle.

Note that this corollary implies the previous one by thinking of an absent edge as
one with infinite resistance.

Flows to infinity
Combining Theorem 3.3.20 and Thomson’s principle, we derive a flow-based cri-
terion for recurrence. To state the result, it is convenient to introduce the notion
of a unit flow θ from source a to ∞ on a countable, locally finite network: θ is
flow to ∞
P it satisfies the flow-conservation constraint on all vertices but a,
anti-symmetric,
and kθk := y∼a θ(a, y) = 1. Note that the energy E (θ) of such a flow is well
defined in [0, +∞].

Theorem 3.3.30 (Recurrence and finite-energy flows). Let N = (G, c) be a count-


able, locally finite, connected network. Vertex a (and hence all vertices) in N is
transient if and only if there is a unit flow from a to ∞ of finite energy.

Proof. Suppose such a flow exists and has energy bounded by B < +∞. Let (Gn )
be an exhaustive sequence with associated sinks (Zn ). A unit flow from a to ∞
on N yields, by projection, a unit flow from a to Zn . This projected flow also has
energy bounded by B. Hence Thomson’s principle implies R(a ↔ Zn ) ≤ B for
all n and transience follows from Theorem 3.3.20.
Proving the other direction involves producing a flow to ∞. Suppose a is
transient and let (Gn ) be an exhaustive sequence as above. Then Theorem 3.3.20
implies that R(a ↔ Zn ) ≤ R(a ↔ ∞) < B for some B < +∞ and Thomson’s
principle guarantees in turn the existence of a flow θn from a to Zn with energy
bounded by B. In particular there is a unit current in , and associated voltage vn ,
of energy bounded by B. So it remains to use the sequence of current flows (in )
to construct a flow to ∞ on the infinite network. The technical point is to show
that the limit of (in ) exists and is indeed a flow. For this, consider the random
walk on N started at a. Let Yn (x) be the number of visits to x before hitting
Zn the first time. By the monotone convergence theorem (Proposition B.4.14),
CHAPTER 3. MARTINGALES AND POTENTIALS 210

Ea Yn (x) → Ea Y∞ (x) where Y∞ (x) is the total number of visits to x. Moreover


Ea Y∞ (x) < +∞ by transience and (3.1.2). By (3.3.23), Ea Yn (x) = c(x)vn (x).
So we can now define
v∞ (x) := lim vn (x) < +∞,
n

and then
i∞ (x, y) := c(x, y)[v∞ (x) − v∞ (y)]
= lim c(x, y)[vn (x) − vn (y)]
n
= lim in (x, y),
n

by Ohm’s law (when n is large enough that both x and y are in Gn ). Because in is
a flow for all n, by taking limits in the flow-conservation constraints we see that so
is i∞ . Note that by construction of i`
1 X 1 X
c(x, y)i∞ (x, y)2 = lim c(x, y)i` (x, y)2
2 `≥n 2
x,y∈Gn x,y∈Gn

≤ lim sup E (i` )


`≥n
< B,
uniformly in n. Because the left-hand side converges to the energy of i∞ as n →
+∞, we are done.

We give an application to Pólya’s theorem in Section 3.3.4.


Finally we derive a useful general result illustrating the robustness reaped from
Thomson’s principle. At a high level, a rough embedding from N to N 0 is a
mapping of the edges of N to paths of N 0 of comparable overall resistance that
do not overlap much. The formal definition follows. As we will see, the purpose
of a rough embedding is to allow a flow on N to be morphed into a flow on N 0 of
comparable energy.
Definition 3.3.31 (Rough embedding). Let N = (G, c) and N 0 = (G0 , c0 ) be
networks with resistances r and r0 respectively. We say that a map φ from the
vertices of G to the vertices of G0 is a rough embedding if there are constants
rough
α, β < +∞ and a map Φ defined on the edges of G such that:
embedding
1. for every edge e = {x, y} in G, Φ(e) is a non-empty path of edges of G0
between φ(x) and φ(y) such that
X
r0 (e0 ) ≤ α r(e);
e0 ∈Φ(e)
CHAPTER 3. MARTINGALES AND POTENTIALS 211

2. for every edge e0 in G0 , there are no more than β edges in G whose image
under Φ contains e0 .

The map φ need not in general be a bijection.


We say that two networks are roughly equivalent if there exist rough embed-
roughly
dings between them, one in each direction.
equivalent
Example 3.3.32 (Independent-coordinate random walk). Let N = Ld with unit
resistances and let N 0 be the network corresponding to the independent-coordinate
random walk
(1) (d)
(Yt , . . . , Yt ),
(i)
where each coordinate (Yt ) is an independent simple random walk on Z started
at 0. For example the neighborhood of the origin in N 0 is {(x1 , . . . , xd ) : xi ∈
{−1, 1}, ∀i}. Note that N 0 contains only those points of Zd with coordinates of
identical parities.
Despite encoding quite different random walks, we claim that the networks N
and N 0 are roughly equivalent.

• N to N 0 : Consider the map φ which associates to each x ∈ N a closest


point in N 0 chosen in some arbitrary manner. For Φ, associate to each edge
e = {x, y} ∈ N a shortest path in N 0 between φ(x) and φ(y), again chosen
arbitrarily. If φ(x) = φ(y), choose an arbitrary, non-empty, shortest cycle
through φ(x).

• N 0 to N : Consider the map φ which associates to each x ∈ N 0 the corre-


sponding point x in N . Construct Φ similarly to the previous case.

Exercise 3.19 asks for a rigorous proof of rough equivalence. See also Exer-
cise 3.20 for an important generalization of this example. J

Our main result about roughly equivalent networks is that they have the same
type.

Theorem 3.3.33 (Recurrence and rough equivalence). Let N and N 0 be roughly


equivalent, locally finite, connected networks. Then N is transient if and only if
N 0 is transient.

Proof. Assume N is transient and let θ be a unit flow from some a to ∞ of finite
energy. The existence of this flow is guaranteed by Theorem 3.3.30. Let φ, Φ be a
rough embedding from N to N 0 with parameters α and β.
The basic idea of the proof is to map the flow θ onto N 0 using Φ. Because
flows are directional, it will be convenient to think of edges as being directed. For
CHAPTER 3. MARTINGALES AND POTENTIALS 212

Figure 3.5: The flow on (x0 , y 0 ) is the sum of the flows on (x1 , y1 ), (x2 , y2 ), and
(x3 , y3 ).



e = {x, y} in N , let Φ (x, y) be the path Φ(e) oriented from φ(x) to φ(y). So


(x0 , y 0 ) ∈ Φ (x, y) means that {x0 , y 0 } ∈ Φ(e) and that x0 is visited before y 0 in the
path Φ(e) from φ(x) to φ(y). (If φ(x) = φ(y), choose an arbitrary orientation of

− →

the cycle Φ(e) for Φ (x, y) and the reversed orientation for Φ (y, x).) Then define,
for x0 , y 0 with {x0 , y 0 } in N 0 ,
X
θ0 (x0 , y 0 ) := θ(x, y). (3.3.31)


(x,y):(x0 ,y 0 )∈ Φ (x,y)

See Figure 3.5.


We claim that θ0 is a flow to ∞ of finite energy on N 0 . We first check that θ0 is
a flow.

1. (Anti-symmetry) By construction, θ0 (y 0 , x0 ) = −θ0 (x0 , y 0 ), that is, θ0 is anti-




symmetric, because θ itself is anti-symmetric. We used the fact that Φ (y, x)


is Φ (x, y) oriented in the opposite direction.

2. (Flow conservation) Next we check the flow-conservation constraints. Fix


z 0 in N 0 . By Condition 2 in Definition 3.3.31, there are finitely many edges
e in N such that Φ(e) visits z 0 . Let e = {x, y} be such an edge. There are
two cases:
CHAPTER 3. MARTINGALES AND POTENTIALS 213

- Assume first that φ(x), φ(y) 6= z 0 and let (u0 , z 0 ), (z 0 , w0 ) be the di-


rected edges incident with z 0 on Φ (x, y). Observe that, in the def-
inition of θ0 , (y, x) contributes θ(y, x) = −θ(x, y) to θ0 (z 0 , u0 ) and
(x, y) contributes θ(x, y) to θ0 (z 0 , w0 ). So these contributions can-
cel
P out in the flow-conservation constraint for z 0 , that is, in the sum
0 0 0
v 0 :v 0 ∼z 0 θ (z , v ).
- If instead e = {x, y} is such that φ(x) = z 0 , let (z 0 , w0 ) be the first


edge on the path Φ (x, y). Edge (x, y) contributes θ(x, y) to θ0 (z 0 , w0 ).
A similar statement applies to φ(y) = z 0 by changing the role of x and
y. This case also applies to φ(x) = φ(y) = z 0 .
From the two cases above, summing over all paths visiting z 0 gives
!
X X X
θ0 (z 0 , v 0 ) = θ(z, v) .
v 0 :v 0 ∼z 0 z:φ(z)=z 0 v:v∼z

Because θ is a flow, the sum in parentheses is 0 if z 6= a and 1 otherwise. So


the right-hand side is 0 unless a ∈ φ−1 ({z 0 }) in which case it is 1.
We have shown that θ0 is a unit flow from φ(a) to ∞. It remains to bound the
energy of θ0 . By (3.3.31), Cauchy-Schwarz, and Condition 2 in Definition 3.3.31,
 2
X
θ0 (x0 , y 0 )2 =  θ(x, y)
 


(x,y):(x0 ,y 0 )∈ Φ (x,y)
  
X X
≤ 1  θ(x, y)2 
  

→ −

(x,y):(x0 ,y 0 )∈ Φ (x,y) (x,y):(x0 ,y 0 )∈ Φ (x,y)
X
≤β θ(x, y)2 .


(x,y):(x0 ,y 0 )∈ Φ (x,y)

Summing over all pairs and using Condition 1 in Definition 3.3.31 gives
1X 0 0 0 0 0 0 2 1X 0 0 0 X
r (x , y )θ (x , y ) ≤ β r (x , y ) θ(x, y)2
2 0 0 2 0 0 −

x ,y x ,y (x,y):(x0 ,y 0 )∈ Φ (x,y)
1X X
= β θ(x, y)2 r0 (x0 , y 0 )
2 x,y −

(x0 ,y 0 )∈ Φ (x,y)
1X
≤ αβ r(x, y)θ(x, y)2 ,
2 x,y
CHAPTER 3. MARTINGALES AND POTENTIALS 214

which is finite by assumption. That concludes the proof.

As an application, we give a second proof of Pólya’s theorem in Section 3.3.4.

Other applications
So far we have emphasized applications to recurrence. Here we show that electrical
network theory can also be used to bound commute times. In Section 3.3.5, we give
further applications beyond random walks on graphs.
An application of Corollary 3.1.24 gives another probabilistic interpretation of
the effective resistance—and a useful formula.

Theorem 3.3.34 (Commute time identity). Let N = (G, c) be a finite, connected


network with vertex set V . For x 6= y, let the commute time τx,y be the time of the
commute time
first return to x after the first visit to y. Then

Ex [τx,y ] = Ex [τy ] + Ey [τx ] = cN R(x ↔ y),


P
where cN = 2 e={x,y}∈N c(e).

Proof. This follows immediately from Corollary 3.1.24 and the definition of the
effective resistance (Definition 3.3.19). Specifically,
1
Ex [τy ] + Ey [τx ] =
πx Px [τy < τx+ ]
1
=
c(e))−1 c(x) Px [τy < τx+ ]
P
(2 e={x,y}∈N
= cN R(x ↔ y).

Example 3.3.35 (Random walk on the torus). Consider random walk on the d-
dimensional torus Ldn with unit resistances. We use the commute time identity to
lower bound the mean hitting time Ex [τy ] for arbitrary vertices x 6= y at graph
distance k on Ldn . To use the commute time identity (Theorem 3.3.34), note that
by symmetry Ex [τy ] = Ey [τx ] so that

1
Ex [τy ] = cN R(x ↔ y) = dnd R(x ↔ y). (3.3.32)
2
where we used that the number of vertices is nd and the graph is 2d-regular.
To simplify, assume n is odd and identify the vertices of Ldn with the box

B := {−(n − 1)/2, . . . , (n − 1)/2}d ,


CHAPTER 3. MARTINGALES AND POTENTIALS 215

in Ld centered at x = 0. Let ∂Bj∞ = {z ∈ Ld : kzk∞ = j} and let Πj be the set


of edges between ∂Bj∞ and ∂Bj+1 ∞ . Note that on B the `1 norm of y is at most k

(the graph distance between x = 0 and y). Since the `∞ norm is at least 1/d times
the `1 norm on Ld , there exists J = O(k) such that all Πj s, j ≤ J, are cutsets
separating x from y. By the Nash-Williams inequality
(
X
−1
X 
−(d−1)
 Ω(log k), d = 2
R(x ↔ y) ≥ |Πj | = Ω j =
0≤j≤J 0≤j≤J
Ω(1), d ≥ 3.

From (3.3.32), we get:

Claim 3.3.36. (
Ω(nd log k), d = 2
Ex [τy ] =
Ω(nd ), d ≥ 3.
J
Remark 3.3.37. The bounds in the previous example are tight up to constants. See [LPW06,
Proposition 10.13]. Note that the case d ≥ 3 does not in fact depend on the distance k.

See Exercise 3.22 for an application of the commute time identity to cover
times.

3.3.4 . Random walks: Pólya’s theorem, two ways


The following is a classical result.

Theorem 3.3.38 (Pólya’s theorem). Random walk on Ld is recurrent for d ≤ 2


and transient for d ≥ 3.

We prove the theorem for d = 2, 3 using the tools developed in the previous sub-
section. The other cases follow by Rayleigh’s principle (Corollary 3.3.29). There
are elementary proofs of this result. But we showed above that the electrical net-
work approach has the advantage of being robust to the details of the lattice. For a
different argument, see Exercise 2.10.
The case d = 2 follows from the Nash-Williams inequality (Corollary 3.3.26)
by letting Πj be the set of edges connecting vertices of `∞ norm jP
and j + 1. Using
the fact that all conductances are 1, that |Πj | = O(j), and that j j −1 diverges,
recurrence is established by Theorem 3.3.20.
CHAPTER 3. MARTINGALES AND POTENTIALS 216

First proof
Now consider the case d = 3 and let a = 0 be the origin.
We construct a finite-energy flow to ∞ using the method of random paths. Note
method of
that a simple way to produce a unit flow to ∞ is to push a flow of 1 through an
random
infinite path (which, recall, are self-avoiding by definition). Taking this a step fur-
paths
ther, let µ be a probability measure on infinite paths and define the anti-symmetric
function
θ(x, y) := E[1(x,y)∈Γ − 1(y,x)∈Γ ] = P[(x, y) ∈ Γ] − P[(y, x) ∈ Γ],
where Γ is a random path distributed according to µ, oriented away from 0. (We
will give an explicit construction below
P where the appropriate formal probability
space will be clear.) Observe that y∼x [1(x,y)∈Γ − 1(y,x)∈Γ ] = 0 for any x 6= 0
because vertices visited by Γ are entered and exited exactly once. That same sum
is 1 at x = 0. Hence θ is a unit flow to ∞. For edge e = {x, y}, consider the
following “edge marginal” of µ:
µ(e) := P[(x, y) ∈ Γ or (y, x) ∈ Γ] = P[(x, y) ∈ Γ] + P[(y, x) ∈ Γ] ≥ θ(x, y),
where we used that a path Γ cannot visit both (x, y) and (y, x) by definition. Then
we get the following bound.
Claim 3.3.39 (Method of random paths).
X
E (θ) ≤ µ(e)2 . (3.3.33)
e

For a measure µ concentrated on a single path, the sum above is infinite. To


obtain a useful bound, what we need is a large collection of spread out paths. On
the lattice L3 , we construct µ as follows. Let U be a uniformly random point on
the unit sphere in R3 and let γ be the ray from 0 to ∞ going through U . Imagine
centering a unit cube around each point in Z3 whose edges are aligned with the
axes. Then γ traverses an infinite number of such cubes. Let Γ be the correspond-
ing path in the lattice L3 . To see that this procedure indeed produces a path observe
that γ, upon exiting a cube around a point z ∈ Z3 , enters the cube of a neighboring
point z 0 ∈ Z3 through a face corresponding to the edge between z and z 0 on the
lattice L3 (unless it goes through a corner of the cube, but this has probability 0).
To argue that µ distributes its mass among sufficiently spread out paths, we bound
the probability that a vertex is visited by Γ. Let z be an arbitrary vertex in Z3 .
Because the sphere of radius kzk2 around the origin in R3 has area O(kzk22 ) and
its intersection with the unit cube centered around z has area O(1), it follows that
P[z ∈ Γ] = O 1/kzk22 .

CHAPTER 3. MARTINGALES AND POTENTIALS 217

That immediately implies a similar bound on the probability that an edge is visited
by Γ. Moreover:

Lemma 3.3.40. There are O(j 2 ) edges with an endpoint at `2 distance within
[j, j + 1] from the origin.

Proof. Consider an open ball of `2 radius 1/2 centered around each vertex of `2
norm within [j, j + 1]. Those balls are non-intersecting and have total volume
Θ(Nj ), where Nj is the number of such vertices. On the other hand, the volume of
the shell of `2 inner and outer radii j − 1/2 and j + 3/2 centered around the origin
(where all those balls lie) is
4 4
π(j + 3/2)3 − π(j − 1/2)3 = O(j 2 ).
3 3
Hence Nj = O(j 2 ). Finally note that each vertex has 6 incident edges.

Plugging those bounds into (3.3.33), we get


X 2 P 
E (θ) ≤ O(j 2 ) · O(1/j 2 ) = O −2 < +∞.

j j
j

Transience follows from Theorem 3.3.30. (This argument clearly does not work
on L where there are only two rays. You should convince yourself that it does not
work on L2 either. But see Exercise 3.17.)

Second proof
We briefly describe a second proof based on the independent-coordinate random
walk. Consider the networks N and N 0 in Example 3.3.32. Because they are
roughly equivalent (Definition 3.3.31), they have the same type by Theorem 3.3.33.
Recall that, because the number of returns to 0 is geometric with success probabil-
ity equal to the escape probability, random walk on N 0 is transient if and only if
the expected number of visits to 0 is finite (see (3.1.2)). By independence of the
coordinates, this expectation can be written as

X  h (1) id X 2t d X


−2t
P Y2t = 0 = 2 = Θ(t−d/2 ),
t
t≥0 t≥0 t≥0

where we used Stirling’s formula (see Appendix A). The rightmost sum is finite
if and only if d ≥ 3. That implies random walk on N 0 is transient under that
condition. By rough equivalence, the same is true of N .
CHAPTER 3. MARTINGALES AND POTENTIALS 218

3.3.5 . Randomized algorithms: Wilson’s method for generating uni-


form spanning trees
In this section, we describe an application of electrical network theory to spanning
trees.
With a slight abuse of notation, we use e ∈ G to indicate that e is an edge of
G.

Uniform spanning trees Let G = (V, E) be a finite connected graph. Recall


that a spanning tree is a subtree of G containing all its vertices. Such a tree has
|V | − 1 edges. A uniform spanning tree is a spanning tree T chosen uniformly at
uniform
random among all spanning trees of G.
spanning tree
We make some simple observations first. Because G is connected, it has at least
one spanning tree by Corollary 1.1.6. Moreover, for any edge e ∈ G, there always
exists at least one spanning tree including it. To see this, let T 0 be any spanning tree
of G, which exists by the previous observation. If e ∈ / T 0 , then we obtain a new
spanning tree by adding e to T 0 and removing one edge 6= e in the cycle created.
As a consequence, the probability of inclusion P[e ∈ T ] in a uniform spanning tree
T cannot be 0. It is however possible for P[e ∈ T ] to equal to 1 if removing e
disconnects the graph. Such an edge is called a bridge.
bridge
A fundamental property of uniform spanning trees is the following negative
correlation between edges.

Claim 3.3.41. For a uniform spanning tree T of a connected graph G,

P[e ∈ T | e0 ∈ T ] ≤ P[e ∈ T ], ∀e 6= e0 ∈ G.

This property is perhaps not surprising. For one, the number of edges in a spanning
tree is fixed, so the inclusion of e0 makes it seemingly less likely for other edges to
be present. Yet proving Claim 3.3.41 is not trivial. The proof relies on the electrical
network perspective. The key is a remarkable formula for the inclusion of an edge
in a uniform spanning tree.

Theorem 3.3.42 (Kirchhoff’s resistance formula). Let G = (V, E) be a finite,


connected graph and let N be the network on G with unit resistances. If T is a
uniform spanning tree on G, then for all e = {x, y}

P[e ∈ T ] = R(x ↔ y).

Before explaining how this formula arises, we show that it implies Claim 3.3.41.
CHAPTER 3. MARTINGALES AND POTENTIALS 219

Proof of Claim 3.3.41. Recall that P[e0 ∈ T ] 6= 0. By the law of total probability,

P[e ∈ T ] = P[e ∈ T | e0 ∈ T ] P[e0 ∈ T ] + P[e ∈ T | e0 ∈


/ T ] P[e0 ∈
/ T ],

so, since P[e0 ∈ T ] + P[e0 ∈


/ T ] = 1, we can instead prove

P[e ∈ T | e0 ∈
/ T ] ≥ P[e ∈ T ]. (3.3.34)

Picking a uniform spanning tree on N conditioned on {e0 ∈ / T } is the same as


picking a uniform spanning tree on the modified network N where e0 is removed.
0

By Rayleigh’s principle (in the form of Corollary 3.3.28),

RN 0 (x ↔ y) ≥ RN (x ↔ y),

and Kirchhoff’s resistance formula (Theorem 3.3.42) gives (3.3.34).


Remark 3.3.43. More generally, thinking of a uniform spanning tree T as a random subset
of edges, the law of T has the property of negative associations, defined as follows. An
event A ⊆ 2E is said to be increasing if ω ∪ {e} ∈ A whenever ω ∈ A. The event A
is said to depend only on F ⊆ E if for all ω1 , ω2 ∈ 2E that agree on F , either both
are in A or neither is. The law PT of T has negative associations in the sense that for
any two increasing events A and B that depend only on disjoint sets of edges, we have
PT [A ∩ B] ≤ PT [A]PT [B]. See [LP16, Exercise 4.6].

Let e = {x, y}. To get some insight into Kirchhoff’s resistance formula, we
first note that, if i is the unit current from x to y and v is the associated voltage, by
definition of the effective resistance
v(x)
R(x ↔ y) = = c(e)(v(x) − v(y)) = i(x, y), (3.3.35)
kik

where we used Ohm’s law (i.e., (3.3.20)) as well as the fact that c(e) = 1, v(y) = 0,
and kik = 1. Note that kik and i(x, y) are not the same quantity: although kik = 1,
i(x, y) is only the current along the edge to y. Furthermore by the probabilistic
interpretation of the current (Theorem 3.3.17), with Z = {y},
Z Z
i(x, y) = Ex [Nx→y − Ny→x ] = Px [(x, y) is traversed before τy ] . (3.3.36)
Z
Indeed, started at x, Ny→x = 0 and Nx→yZ ∈ {0, 1}. Kirchhoff’s resistance for-
mula is then established by relating the random walk on N to the probability that
e is present in a uniform spanning tree T . To do this we introduce a random-walk-
based algorithm for generating uniform spanning trees. This rather miraculous
procedure, known as Wilson’s method, is of independent interest. (For a classical
connection between random walks and spanning trees, see also Exercise 3.23.)
CHAPTER 3. MARTINGALES AND POTENTIALS 220

Wilson’s method It will be somewhat easier to work in a more general context.


Let N = (G, c) be a finite, connected network on G with arbitrary conductances
and define the weight of a spanning tree T on N as
Y
W (T ) = c(e).
e∈T

With a slight abuse, we continue to call a tree T picked at random among all span-
ning trees of G with probability proportional to W (T ) a “uniform” spanning tree
on N .
To state Wilson’s method, we need the notion of loop erasure. Let P = x0 ∼
loop erasure
. . . ∼ xk be a walk in N . The loop erasure of P is obtained by removing cycles in
the order they appear. That is, let j ∗ be the smallest j such that xj = x` for some
` < j. Remove the subwalk x`+1 ∼ · · · ∼ xj from P, and repeat. The result is
self-avoiding, that is, a path, and is denoted by LE(P).
Let v0 be an arbitrary vertex of G, which we refer to as the root, and let T0
be the subtree made up of v0 alone. Starting with the root, order arbitrarily the
vertices of G as v0 , . . . , vn−1 . Wilson’s method constructs an increasing sequence
of subtrees as follows. See Figure 3.6. Let T := T0 .

1. Let v be the vertex of G not in T with lowest index. Perform random walk
on N started at v until the first visit to a vertex of T . Let P be the resulting
walk.

2. Add the loop erasure LE(P) to T .

3. Repeat until all vertices of G are in T .

Let T0 , . . . , Tm be the sequence of subtrees produced by Wilson’s method.

Claim 3.3.44. Forgetting the root, Tm is a uniform spanning tree on N .

This claim is far from obvious. Before proving it, we finish the proof of Kirch-
hoff’s resistance formula.

Proof of Theorem 3.3.42. From (3.3.35) and (3.3.36), it suffices to prove that, for
e = {x, y},
Px [(x, y) is traversed before τy ] = P[e ∈ T ],
where the probability on the left-hand side refers to random walk on N with unit
resistances started at x and the probability on the right-hand side refers to a uniform
spanning tree T on N . Generate T using Wilson’s method started at root v0 = y
with the choice v1 = x. If the walk from x to y during the first iteration of Wilson’s
CHAPTER 3. MARTINGALES AND POTENTIALS 221

Figure 3.6: An illustration of Wilson’s method. The dashed lines indicate erased
loops.
CHAPTER 3. MARTINGALES AND POTENTIALS 222

method includes (x, y), then the loop erasure is simply x ∼ y and e is in T . On
the other hand, if the walk from x to y does not include (x, y), then e cannot be
used at a later stage because it would create a cycle. That immediately proves the
theorem.

It remains to prove the claim.

Proof of Claim 3.3.44. The idea of the proof is to cast Wilson’s method in the more
general framework of cycle popping algorithms. We begin by explaining how such
algorithms work.
Let P be the transition matrix corresponding to random walk on N = (G, c)
with G = (V, E) and root v0 . To each vertex x 6= v0 in V , we assign an indepen-
dent stack of “colored directed edges”

S0x := (hx, Y1x i1 , hx, Y2x i2 , . . .)

where each Yjx is chosen independently at random from the distribution P (x, · ).
In particular all Yjx s are neighbors of x in N . The index j in hx, Yjx ij is the color
color
of the edge. It keeps track of the position of the edge in the original stack. (Picture
S x as a spring-loaded plate dispenser located on vertex x.)
We consider a process which involves popping edges off the stacks. We use
the notation S x to denote the current stack at x. The initial assignment of the
stack is S x := S0x as above. Given the current stacks (S x )x , we call visible graph
visible graph
the (colored) directed graph over V with edges Top(S x ) for all x 6= v0 , where
Top(S x ) is the first edge in the current stack S x . The latter are referred to as


visible edges. We denote the current visible graph by G .

− visible edge
Note that G has out-degree 1 for all x 6= v0 and the root has out-degree


0. In particular all undirected cycles in G are in fact directed cycles, and we
refer to them simply as cycles. (Indeed a set of edges forming an undirected cycle
that is not directed must have a vertex of out-degree 2.) Also recall the following
characterization from Corollary 1.1.8: an acyclic, undirected subgraph with |V |


vertices and |V |−1 edges is a spanning tree of G. Hence if there is no cycle in G ,
then it must be a spanning tree (as an undirected graph) where, furthermore, all
edges point towards the root. Such a tree is also known as a spanning arborescence.
spanning
Once that happens, we are done.
arborescence
As the name suggests, a cycle popping algorithm proceeds by popping cycles


in G off the tops of the stacks until a spanning arborescence is produced. That

− →

is, at every iteration, if G contains at least one cycle, then a cycle C is picked


according to some rule, the top of each stack in C is popped, and a new visible


graph G is revealed. See Figure 3.7 for an illustration.
CHAPTER 3. MARTINGALES AND POTENTIALS 223

Figure 3.7: A realization of a cycle popping algorithm (from top to bottom). In all
three figures, the underlying graph is G while the arrows depict the visible edges.
CHAPTER 3. MARTINGALES AND POTENTIALS 224

With these definitions in place, the proof of the claim involves the following
steps.

(i) Wilson’s method is a cycle popping algorithm. Recasting Wilson’s method,


we can think of the initial stacks (S0x ) as corresponding to picking—ahead of
time—all potential transitions in the random walks. With this representation,
the algorithm boils down to a recipe for choosing which cycle to pop next.
Indeed, at each iteration, we start from a vertex v not in the current tree T .
A key observation: following the visible edges from v traces a path whose
distribution is that of random walk on N . Loop erasure then corresponds
to popping cycles as they are closed. We pop only those visible edges on
the removed cycles, as they originate from vertices that will be visited again
by the algorithm and for which a new transition will then be needed. Those
visible edges in the resulting loop-erased path are not popped—note that they
are part of the final arborescence.

(ii) The popping order does not matter. We just argued that Wilson’s method is a
cycle popping algorithm. In fact we claim that any cycle popping algorithm,
that is, no matter what popping choices are made along the way, produces the
same final arborescence. To make this precise, we identify the popped cycles
uniquely. This is where the colors come in. A colored cycle is a directed
colored
cycle over V made of colored edges from the stacks (not necessarily of the
cycle
same color and not necessarily in the current visible graph). We say that a

− →

colored cycle C is poppable for a visible graph G if there exists a sequence

− →
− →
− poppable cycle
of colored cycles C 1 , . . . , C r = C that can be popped in that order starting

− →
− →

from G . Note that, by this definition, C 1 is a cycle in G . Now we

−0 →

claim that if C 1 were popped first instead of C 1 , producing the new visible

− →
− →

graph G 0 , then C would still be poppable for G 0 . This claim implies
that, in any cycle popping algorithm, either an infinite number of cycles are
popped or eventually all poppable cycles are popped—independently of the
order—producing the same outcome. (Note that, while the same cycle may
be popped more than once, the same colored cycle cannot.)

− →
− →

To prove the claim, note first that if C 01 = C or if C 01 does not share a vertex

− →
− →

with any of C 1 , . . . , C r there is nothing to prove. So let C j be the first cy-


cle in the sequence sharing a vertex with C 01 , say x. Let hx, yic and hx, y 0 ic0

− →

be the colored edges emanating from x in C j and C 01 respectively. By def-

− →

inition, x is not on any of C 1 , . . . , C j−1 so the edge originating from x is
not popped by that sequence and we must have hx, yic = hx, y 0 ic0 as colored

− →

edges. In particular, the vertex y is also a shared vertex of C j and C 01 , and
CHAPTER 3. MARTINGALES AND POTENTIALS 225

the same argument applies to it. Proceeding by induction leads to the conclu-

− →
− →

sion that C 01 = C j as colored cycles. But then C is clearly poppable for the


visible graph resulting from popping C 01 first, because it can be popped with

− →
− → − →
− →
− →
− →

the rearranged sequence C 01 = C j , C 1 , . . . , C j−1 , C j+1 , . . . , C r = C ,

− →
− →

where we used the fact that C 01 does not share a vertex with C 1 , . . . , C j−1 .

(iii) Termination occurs in finite time almost surely. We have shown so far that, in
any cycle popping algorithm, either an infinite number of cycles are popped
or eventually all poppable cycles are popped. But Wilson’s method—a cycle
popping algorithm as we have shown—stops after a finite amount of time
with probability 1. Indeed, because the network is finite and connected, the
random walk started at each iteration hits the current T in finite time almost
surely (by Lemma 3.1.25). To sum up, all cycle popping algorithms termi-
nate and produce the same spanning arborescence. It remains to compute the
distribution of the outcome.

(iv) The arborescence has the desired distribution. Let A be the spanning ar-
borescence produced by any cycle popping algorithm on the stacks (S0x ). To
compute the distribution of A, we first compute the distribution of a par-
ticular cycle popping realization leading to A. Because the popping order
does not matter, by “realization” we mean a collection C of colored cycles
together with a final spanning arborescence A. Notice that what lies in the
stacks “under” A is not relevant to the realization, that is, the same outcome
is produced no matter what is under A.
So, from the distribution of the stacks, the probability of observing (C, A) is
simply the product of the transitions corresponding to the “popped edges” in
C and the “final edges” in A, that is,
Y Y → −
P (~e) = Ψ(A) Ψ C ,
~e∈C∪A −

C ∈C

where the function Ψ returns the product of the transition probabilities of a


set of directed edges. Thanks to the product form on the right-hand side,
summing over all possible Cs gives that the probability of producing A is
proportional to Ψ(A).
For this argument to work though, there are two small details to take care of.
First, note that we want the probability of the “uncolored” arborescence. But
observe that, in fact, there is no need to keep track of the colors on the edges
of A because these are determined by C. Secondly, we need for the collection
CHAPTER 3. MARTINGALES AND POTENTIALS 226

of possible Cs not to vary with A. But it is clear that any arborescence could
lie under any C.

To see that we are done, let T be the undirected spanning tree corresponding to
the outcome, A, of Wilson’s method. Then, because P (x, y) = c(x,y)
c(x) , we get

W (T )
Ψ(A) = Q ,
x6=v0 c(x)

where note that the denominator does not depend on T . So if we forget the orien-
tation of A which is determined by the root (i.e., sum over all choices of root), we
get a spanning tree whose distribution is proportional to W (T ), as required.
CHAPTER 3. MARTINGALES AND POTENTIALS 227

Exercises
Exercise 3.1 (Reflection). Give a rigorous proof of Theorem 3.1.9 through a for-
mal application of the strong Markov property (i.e., specify ft and Ft in Theo-
rem 3.1.8).
Exercise 3.2 (Time of k-th return). Give a rigorous proof of (3.1.1) through a
formal application of the strong Markov property (i.e., specify ft and Ft in Theo-
rem 3.1.8).
Exercise 3.3 (Tightness of Matthews’ bounds). Show that the bounds (3.1.6) and (3.1.7)
are tight up to smaller order terms for the coupon collector problem (Example 2.1.4).
[Hint: State the problem in terms of the cover time of a random walk on the com-
plete graph with self-loops.]
Exercise 3.4 (Pólya’s urn: a suprisingly simple formula). Consider the setting of
Example 3.1.49. Prove that
 
t m!(t − m)!
P[Gt = m + 1] = .
m (t + 1)!
[Hint: Consider the probability of one particular sequence of outcomes producing
the desired event.]
Exercise 3.5 (Optional stopping theorem). Give a rigorous proof of the remaining
cases of the optional stopping theorem (Theorem 3.1.38).
Exercise 3.6 (Supermartingale inequality). Let (Mt ) be a nonnegative, supermartin-
gale. Show that, for any b > 0,
 
E[M0 ]
P sup Ms ≥ b ≤ .
s≥0 b

[Hint: Mimic the proof of the submartingale case.]


Exercise 3.7 (Azuma-Hoeffding: a second proof). This exercise leads the reader
through an alternative proof of the Azuma-Hoeffding inequality.
(i) Show that for all x ∈ [−1, 1] and a > 0

eax ≤ cosh a + x sinh a.

(ii) Use a Taylor expansion to show that for all x


2 /2
cosh x ≤ ex .
CHAPTER 3. MARTINGALES AND POTENTIALS 228

(iii) Let X1 , . . . , Xn be (not necessarily independent) random variables such that,


for all i, |Xi | ≤ ci for some constant ci < +∞ and

E [Xi1 · · · Xik ] = 0, ∀ 1 ≤ k ≤ n, ∀ 1 ≤ i1 < · · · < ik ≤ n. (3.3.37)

Show, using (i) and (ii), that for all b > 0


" n #
b2
X  
P Xi ≥ b ≤ exp − Pn 2 .
i=1
2 i=1 ci

(iv) Prove that (iii) implies a variant of the Azuma-Hoeffding inequality (Theo-
rem 3.2.1) for bounded increments.

(v) Show that the random variables in Exercise 2.6 (after centering) do not sat-
isfy (3.3.37) (without using the claim in part (ii) of that exercise).

Exercise 3.8 (Lipschitz condition). Give a rigorous proof of Lemma 3.2.31.

Exercise 3.9 (Lower bound on expected spectral norm). Let A be an n × n random


matrix. Assume that the entries Ai,j , i, j = 1, . . . , n, are independent, centered
random variables in [−1, 1]. Suppose further that there is 0 < σ 2 < +∞ such that
Var[Aij ] ≥ σ 2 for all i, j. Show that there is 0 < c < +∞ such that

EkAk ≥ c n,

for n large enough. [Hint: Use the fact that kAk2 ≥ kAe1 k2 together with Cheby-
shev’s inequality.]

Exercise 3.10 (Kirchhoff’s laws). Consider a finite, connected network with a


source and a sink. Show that an anti-symmetric function on the edges satisfying
Kirchhoff’s two laws is a current function (i.e., it corresponds to a voltage function
through Ohm’s law).

Exercise 3.11 (Dirichlet problem: non-uniqueness). Let (Xt ) be the birth-and-


death chain on Z+ with P (x, x + 1) = p and P (x, x − 1) = 1 − p for all x ≥ 1,
and P (0, 1) = 1, for some 0 < p < 1. Fix h(0) = 1.

(i) When p > 1/2, show that there is more than one bounded extension of h to
Z+ \{0} that is harmonic on Z+ \{0}. [Hint: Consider Px [τ0 = +∞].]

(ii) When p ≤ 1/2, show that there exists a unique bounded extension of h to
Z+ \{0} that is harmonic on Z+ \{0}.
CHAPTER 3. MARTINGALES AND POTENTIALS 229

Exercise 3.12 (Maximum principle). Let N = (G, c) be a finite or countable,


connected network with G = (V, E). Let W be a finite, connected, proper subset
of V .
(i) Let h : V → R be a function on V . Prove the maximum principle: if h is
harmonic on W , that is, it satisfies
1 X
h(x) = c(x, y)h(y), ∀x ∈ W,
c(x) y∼x

and if h achieves its supremum on W , then h is constant on W ∪ ∂V W ,


where
∂V W = {z ∈ V \ W : ∃y ∈ W, y ∼ z}.

(ii) Let h : W c → R be a bounded function on W c := V \ W . Let h1 and h2


be extensions of h to W that are harmonic on W . Use part (i) to prove that
h1 ≡ h2 .
Exercise 3.13 (Poisson equation: uniqueness). Show that u is the unique solution
of the system in Theorem 3.3.6 under the conditions of Theorem 3.3.1. [Hint: Use
Theorem 3.3.9 and mimic the proof of Theorem 3.3.1.]
Exercise 3.14 (Effective resistance: metric). Show that effective resistances be-
tween pairs of vertices form a metric.
Exercise 3.15 (Dirichlet principle: proof). Prove Theorem 3.3.25.
Exercise 3.16 (Martingale problem). Let V be countable, let (Xt ) be a stochastic
process adapted to (Ft ) and taking values in V , and let P be a transition prob-
ability on V with associated Laplacian operator ∆. Show that the following are
equivalent:
(i) The process (Xt ) is a Markov chain with transition probability P .
(ii) For any bounded measurable function f : V → R, the process
t−1
Mtf = f (Xt ) −
X
∆f (Xs ),
s=0

is a martingale with respect to (Ft ).


Exercise 3.17 (Random walk on L2 : effective resistance). Consider random walk
on L2 , which we showed is recurrent. Let (Gn ) be the exhaustive sequence cor-
responding to vertices at distance at most n from the origin and let Zn be the
corresponding sink-set. Show that R(0 ↔ Zn ) = Θ(log n). [Hint: Use the Nash-
Williams inequality and the method of random paths.]
CHAPTER 3. MARTINGALES AND POTENTIALS 230

Exercise 3.18 (Random walk on regular graphs: effective resistance). Let G be a


d-regular graph with n vertices and d > n/2. Let N be the network (G, c) with
unit conductances. Let a and z be arbitrary distinct vertices.

(i) Show that there are at least 2d − n vertices x 6= a, z such that a ∼ x ∼ z is


a path.

(ii) Prove that


2dn
Ea (τa,z ) ≤ .
2d − n
Exercise 3.19 (Independent-coordinate random walk). Give a rigorous proof that
the two networks in Example 3.3.32 are roughly equivalent.

Exercise 3.20 (Rough isometries). Graphs G = (V, E) and G0 = (V 0 , E 0 ) are


roughly isometric (or quasi-isometric) if there is a map φ : V → V 0 and constants
rough isometry
0 < α, β < +∞ such that for all x, y ∈ V

α−1 d(x, y) − β ≤ d0 (φ(x), φ(y)) ≤ αd(x, y) + β,

where d and d0 are the graph distances on G and G0 respectively, and furthermore
all vertices in G0 are within distance β of the image of V . Let N = (G, c) and
N 0 = (G0 , c0 ) be countable, connected networks with uniformly bounded con-
ductances, resistances and degrees. Prove that if G and G0 are roughly isometric
then N and N 0 are roughly equivalent. [Hint: Start by proving that being roughly
isometric is an equivalence relation.]

Exercise 3.21 (Random walk on the cycle: hitting time). Use the commute time
identity (Theorem 3.3.34) to compute Ex [τy ] in Example 3.3.35 in the case d = 1.
Give a second proof using a direct martingale argument.

Exercise 3.22 (Random walk on the binary tree: cover time). As in Example 3.3.15,
b n and equal conductances on all
let N be the rooted binary tree with n levels T 2
edges.

(i) Show that the maximal hitting time Ea τb is achieved for a and b such that
their most recent common ancestor is the root 0. Furthermore argue that
in that case Ea [τb ] = Ea [τa,0 ], where recall that τa,0 is the commute time
between a and 0.

(ii) Use the commute time identity (Theorem 3.3.34) and Matthews’ cover time
bounds (Theorem 3.1.27) to give an upper bound on the mean cover time of
the order of O(n2 2n ).
CHAPTER 3. MARTINGALES AND POTENTIALS 231

Exercise 3.23 (Markov chain tree theorem). Let P be the transition matrix of a fi-
nite, irreducible Markov chain with stationary distribution π. Let G be the directed
graph corresponding to the positive transitions of P . For an arborescence A of G,
define its weight as Y
Ψ(A) = P (~e).
~e∈A

Consider the following process on spanning arborescences over G. Let ρ be the


root of the current spanning arborescence A. Pick an outgoing edge ~e = (ρ, x) of
ρ according to P (ρ, · ). Edge ~e is not in A by definition of an arborescence. Add
~e to A. This creates a cycle. Remove the edge of this cycle originating from x,
producing a new arborescence A0 with root x. Repeat the process.

(i) Show that this chain is irreducible.

(ii) Show that Ψ is a stationary measure for this chain.

(iii) Prove the Markov chain tree theorem: The stationary distribution π of P is
proportional to X
πx = Ψ(A).
A : root(A)=x
CHAPTER 3. MARTINGALES AND POTENTIALS 232

Bibliographic Remarks
Section 3.1 Picking up where Appendix B leaves off, Sections 3.1.1 and 3.1.3
largely follow the textbooks [Wil91] and [Dur10], which contain excellent intro-
ductions to martingales. The latter also covers Markov chains, and includes the
proofs we skipped here. Theorem 3.1.11 is proved in [Dur10, Theorem 4.3.2].
Many more results like Corollary 3.1.24 can be derived from the occupation mea-
sure identity; see for example [AF, Chapter 2]. The upper bound in Theorem 3.1.27
was first proved by Matthews [Mat88].

Section 3.2 The Azuma-Hoeffding inequality is due to Hoeffding [Hoe63] and


Azuma [Azu67]. The version of the inequality in Exercise 3.7 is from [Ste97]. The
method of bounded differences has its origins in the works of Yurinskii [Yur76],
Maurey [Mau79], Milman and Schechtman [MS86], Rhee and Talagrand [RT87],
and Shamir and Spencer [SS87]. In its current form, it appears in [McD89]. Exam-
ple 3.2.11 is taken from [MU05, Section 12.5]. The presentation in Section 3.2.3
follows [AS11, Section 7.3]. Claim 3.2.16 is due to Shamir and Spencer [SS87].
The 2-point concentration result alluded to in Section 3.2.3 is due to Alon and Kriv-
elevich [AK97]. For the full story on the chromatic number of Erdős-Rényi graphs,
see [JLR11, Chapter 7]. Claim 3.2.21 is due to Bollobás, Riordan, Spencer, and
Tusnády [BRST01]. It confirmed simulations of Barabási and Albert [BA99]. The
expectation was analyzed by Dorogovtsev, Mendes, and Samukhin [DMS00]. For
much more on preferential attachment models see [Dur06], [CL06], or [vdH17].
Example 3.2.12 borrows from [BLM13, Section 7.1] and [Pet, Section 6.3]. Gen-
eral references on the concentration of measure phenomenon and concentration
inequalities are [Led01] and [BLM13]. See [BCB12] or [LS20] for an introduc-
tion to bandit problems; or [AJKS22] for an introduction to the sample complex-
ity of the more general reinforcement learning problem. The slicing argument
in Section 3.2.5 is based on [Bub10]. A more general discussion of the slicing
method, whose best known application is the proof of the law of the iterated loga-
rithm (e.g., [Wil91, Section 14.7]), can be found in [vH16]. Section 3.2.6 is based
on [vH16, Section 4.3]. In particular, a proof of Talagrand’s inequality (Theo-
rem 3.2.32) can be found there. See also [AS11, Chapter 7] or [BLM13, Chapter
7].

Section 3.3 Section 3.3.1 is based partly on [Nor98, Sections 4.1-2], [Ebe, Sec-
tions 0.3, 1.1-2, 3.1-2], and [Bre17, Sections 7.3, 17.1]. The material in Secti-
ons 3.3.2-3.3.5 borrows from [LPW06, Chapters 9, 10], [AF, Chapters 2, 3] and,
especially, [LP16, Sections 2.1-2.6, 4.1-4.2, 5.5]. Foster’s theorem (Theorem 3.3.12)
CHAPTER 3. MARTINGALES AND POTENTIALS 233

is from [Fos53]. The classical reference on potential theory and its probabilistic
counterpart is [Doo01]. For the discrete case and the electrical network point of
view, the book of Doyle and Snell is excellent [DS84]. In particular the series and
parallel laws are defined and illustrated. See also [KSK76]. For an introduction to
convex optimization and duality, see for example [BV04]. The Nash-Williams
inequality is due to Nash-Williams [NW59]. The result in Example 3.3.27 is
due to R. Lyons [Lyo90]. Theorem 3.3.33 is due to Kanai [Kan86]. The com-
mute time identity was proved by Chandra, Raghavan, Ruzzo, Smolensky and Ti-
wari [CRR+ 89]. An elementary proof of Pólya’s theorem can be found in [Dur10,
Section 4.2]. The flow we used in the proof of Pólya’s theorem is essentially
due to T. Lyons [Lyo83]. Wilson’s method is due to Wilson [Wil96]. A related
method for generating uniform spanning trees was introduced by Aldous [Ald90]
and Broder [Bro89]. A connection between loop-erased random walks and uni-
form spanning trees had previously been established by Pemantle [Pem91] using
the Aldous-Broder method. For more on negative correlation in uniform spanning
trees, see for example [LP16, Section 4.2]. For a proof of the matrix tree theo-
rem using Wilson’s method, see [KRS]. For a discussion of the running time of
Wilson’s method and other spanning tree generation approaches, see [Wil96].
Chapter 4

Coupling

In this chapter we move on to coupling, another probabilistic technique with a wide


range of applications (far beyond discrete stochastic processes). The idea behind
the coupling method is deceptively simple: to compare two probability measures µ
and ν, it is sometimes useful to construct a joint probability space with marginals
µ and ν. For instance, in the classical application of coupling to the convergence
of Markov chains (Theorem 1.1.33), one simultaneously constructs two copies of a
Markov chain—one of which is already at stationarity—and shows that they can be
made to coincide after a random amount of time called the coupling time. We begin
in Section 4.1 by defining coupling formally and deriving its connection to the total
variation distance through the coupling inequality. We illustrate the basic idea on
a classical Poisson approximation result, which we apply to the degree sequence
of an Erdős-Rényi graph. In Section 4.2, we introduce the concept of stochastic
domination and some related correlation inequalities. We develop a key applica-
tion in percolation theory. Coupling of Markov chains is the subject of Section 4.3,
where it serves as a powerful tool to derive mixing time bounds. Finally, we end in
Section 4.4 with the Chen-Stein method for Poisson approximations, a technique
that applies in particular in some natural settings with dependent variables.

4.1 Background
We begin with some background on coupling. After defining the concept formally
and giving a few simple examples, we derive the coupling inequality, which pro-
vides a fundamental approach to bounding the distance between two distributions.
Version: December 20, 2023
Modern Discrete Probability: An Essential Toolkit
Copyright © 2023 Sébastien Roch

234
CHAPTER 4. COUPLING 235

As an application, we analyze the degree distribution in the Erdős-Rényi graph


model. Throughout this chapter, (S, S) is a measurable space. Also we will denote
by µZ the law of random variable Z.

4.1.1 Basic definitions


A formal definition of coupling follows. Recall (see Appendix B) that for measur-
able spaces (S1 , S1 ) (S2 , S2 ), we can consider the product space (S1 ×S2 , S1 ×S2 )
where
S1 × S2 := {(s1 , s2 ) : s1 ∈ S1 , s2 ∈ S2 }
is the Cartesian product of S1 and S2 , and S1 × S2 is the smallest σ-algebra on
S1 × S2 containing the rectangles A1 × A2 for all A1 ∈ S1 and A2 ∈ S2 .
Definition 4.1.1 (Coupling). Let µ and ν be probability measures on the same
measurable space (S, S). A coupling of µ and ν is a probability measure γ on the
coupling
product space (S × S, S × S) such that the marginals of γ coincide with µ and ν,
that is,
γ(A × S) = µ(A) and γ(S × A) = ν(A), ∀A ∈ S.
For two random variables X and Y taking values in (S, S), a coupling of X and
Y is a joint variable (X 0 , Y 0 ) taking values in (S × S, S × S) whose law as a
probability measure is a coupling of the laws of X and Y . Note that, under this
definition, X and Y need not be defined on the same probability space (but X 0 and
Y 0 do need to). We also say that (X 0 , Y 0 ) is a coupling of µ and ν if the law of
(X 0 , Y 0 ) is a coupling of µ and ν.
We give a few examples.
Example 4.1.2 (Coupling of Bernoulli variables). Let X and Y be Bernoulli ran-
dom variables with respective parameters 0 ≤ q < r ≤ 1. That is, P[X = 1] = q
and P[Y = 1] = r. Here S = {0, 1} and S = 2S .
d
- (Independent coupling) One coupling of X and Y is (X 0 , Y 0 ) where X 0 = X
d
and Y 0 = Y are independent of one another. Its law is
 

0 0
 (1 − q)(1 − r) (1 − q)r
P[(X , Y ) = (i, j)] = .
i,j∈{0,1} q(1 − r) qr

- (Monotone coupling) Another possibility is to pick U uniformly at random


in [0, 1], and set X 00 = 1{U ≤q} and Y 00 = 1{U ≤r} . Then (X 00 , Y 00 ) is a
coupling of X and Y with law
 

00 00
 1−r r−q
P[(X , Y ) = (i, j)] = .
i,j∈{0,1} 0 q
CHAPTER 4. COUPLING 236

Example 4.1.3 (Bond percolation: monotonicity). Let G = (V, E) be a countable


graph. Denote by Pp the law of bond percolation (Definition 1.2.1) on G with
density p. Let x ∈ V and assume 0 ≤ q < r ≤ 1. Using the monotone coupling in
the previous example on each edge independently produces a coupling of Pq and
Pr . More precisely:

- Let {Ue }e∈E be independent uniforms on [0, 1].

- For p ∈ [0, 1], let Wp be the set of edges e such that Ue ≤ p.

Thinking of Wp as specifying the open edges in the percolation process on G under


Pp , we see that (Wq , Wr ) is a coupling of Pq and Pr with the property that P[Wq ⊆
(q) (r)
Wr ] = 1. Let Cx and Cx be the open clusters of x under Wq and Wr respectively.
(q) (r)
Because Cx ⊆ Cx ,

θ(q) := Pq [|Cx | = +∞]


= P[|Cx(q) | = +∞]
≤ P[|Cx(r) | = +∞]
= Pr [|Cx | = +∞]
= θ(r).

(We made this claim in Section 2.2.4.) J


(p)
Example 4.1.4 (Biased random walk on Z). For p ∈ [0, 1], let (St ) be nearest-
neighbor random walk on Z started at 0 with probability p of jumping to the right
and probability 1 − p of jumping to the left. (See the gambler’s ruin problem in
Example 3.1.43.) Assume 0 ≤ q < r ≤ 1. Using again the monotone coupling of
(q) (r)
Bernoulli variables above we produce a coupling of (St ) and (St ).

- Let (Xi00 , Yi00 )i be an infinite sequence of i.i.d. monotone Bernoulli couplings


with parameters q and r respectively.
(q) (r)
- Define (Zi , Zi ) := (2Xi00 − 1, 2Yi00 − 1). Note that P[2X100 − 1 = 1] =
P[X100 = 1] = q and P[2X100 − 1 = −1] = P[X100 = 0] = 1 − q, and similarly
for Yi00 .
(q) P (q) (r) P (r)
- Let Ŝt = i≤t Zi and Ŝt = i≤t Zi .
CHAPTER 4. COUPLING 237

(q) (r) (q) (r) (q) (r)


Then (Ŝt , Ŝt ) is a coupling of (St ) and (St ) such that Ŝt ≤ Ŝt for all t
almost surely. In particular, we deduce that for all y and all t
(q) (q) (r) (r)
P[St ≤ y] = P[Ŝt ≤ y] ≥ P[Ŝt ≤ y] = P[St ≤ y].

4.1.2 . Random walks: harmonic functions on lattices and infinite d-


regular trees
Let (Xt ) be a Markov chain on a finite or countably infinite state space V with
transition matrix P and let Px be the law of (Xt ) started at x. We say that a
function h : V → R is bounded if supx∈V |h(x)| < +∞. Recall from Section 3.3
that h is harmonic (with respect to P ) on V if
X
h(x) = P (x, y)h(y), ∀x ∈ V.
y∈V

We first give a coupling-based criterion for bounded harmonic functions to be con-


stant. Recall that we treated the finite state-space case (where boundedness is au-
tomatic) in Corollary 3.3.3.
Lemma 4.1.5 (Coupling and bounded harmonic functions). If, for all y, z ∈ V ,
there is a coupling ((Yt )t , (Zt )t ) of Py and Pz such that

lim P[Yt 6= Zt ] = 0,
t

then all bounded harmonic functions on V are constant.


Proof. Let h be bounded and harmonic on V with supx |h(x)| = M < +∞. Let
y, z be any points in V . Then, arguing as in Section 3.3.1, (h(Yt )) and (h(Zt )) are
martingales and, in particular,

E[h(Yt )] = E[h(Y0 )] = h(y) and E[h(Zt )] = E[h(Z0 )] = h(z).

So by Jensen’s inequality (Theorem B.4.15) and the boundedness assumption

|h(y) − h(z)| = |E[h(Yt )] − E[h(Zt )]|


≤ E |h(Yt ) − h(Zt )|
≤ 2M P[Yt 6= Zt ]
→ 0.

So h(y) = h(z).
CHAPTER 4. COUPLING 238

Harmonic functions on Zd Consider random walk on Ld for d ≥ 1. In that case,


we show that all bounded harmonic functions are constant.

Theorem 4.1.6 (Bounded harmonic functions on Zd ). All bounded harmonic func-


tions on Ld are constant.

Proof. From (3.3.2), h is harmonic with respect to random walk on Ld if and only
if it is harmonic with respect to lazy random walk (Definition 1.1.31), that is, the
walk that stays put with probability 1/2 at every step. Let Py and Pz be the laws of
lazy random walk on Ld started at y and z respectively. We construct a coupling
(i) (i)
((Yt ), (Zt )) = ((Yt )i∈[d] , (Zt )i∈[d] ) of Py and Pz as follows: at time t, pick a
coordinate I ∈ [d] uniformly at random, then
(I) (I)
• if Yt = Zt then do nothing with probability 1/2 and otherwise pick
(I) (I) (I)
W ∈ {−1, +1} uniformly at random, set Yt+1 = Zt+1 := Zt + W and
leave the other coordinates unchanged;
(I) (I)
• if instead Yt 6= Zt , pick W ∈ {−1, +1} uniformly at random, and with
(I) (I)
probability 1/2 set Yt+1 := Yt +W and leave Zt and the other coordinates
(I) (I)
of Yt unchanged, or otherwise set Zt+1 := Zt + W and leave Yt and the
other coordinates of Zt unchanged.

It is straightforward to check that ((Yt ), (Zt )) is indeed a coupling of Py and Pz .


To apply the previous lemma, it remains to bound P[Yt 6= Zt ].
(i) (i)
The key is to note that, for each coordinate i, the difference (Yt −Zt ) is itself
a nearest-neighbor random walk on Z started at y (i) − z (i) with holding probability
(i.e., probability of staying put) 1 − d1 —until it hits 0. Simple random walk on Z
is irreducible and recurrent (Theorem 3.3.38). The holding probability does not
(i) (i)
affect the type of the walk. So (Yt − Zt ) hits 0 in finite time with probability
(i) (i) (i) (i)
1. Hence, letting τ (i) be the first time Yt − Zt = 0, we have P[Yt 6= Zt ] ≤
(i) (i)
P[τ > t] → P[τ = +∞] = 0.
By a union bound,
(i) (i)
X
P[Yt 6= Zt ] ≤ P[Yt 6= Zt ] → 0,
i∈[d]

as desired.

Exercise 4.1 asks for an example of a non-constant (necessarily unbounded)


harmonic function on Zd .
CHAPTER 4. COUPLING 239

Harmonic functions on Td On trees, the situation is different. Let Td be the


infinite d-regular tree with root ρ. For x ∈ Td , we let Tx be the subtree, rooted at
x, of descendants of x.

Theorem 4.1.7 (Bounded harmonic functions on Td ). For d ≥ 3, let (Xt ) be


simple random walk on Td and let P be the corresponding transition matrix. Let a
be a neighbor of the root and consider the function

h(x) := Px [Xt ∈ Ta for all but finitely many t].

Then h is a non-constant, bounded harmonic function on Td .

Proof. The function h is bounded since it is defined as a probability, and by the


usual first-step analysis
X 1 X
h(x) = Py [Xt ∈ Ta for all but finitely many t] = P (x, y)h(y),
y:y∼x
d y

so h is harmonic on all of Td .
Let b 6= a be a neighbor of the root. The key of the proof is the following
lemma.
Lemma 4.1.8.
q := Pa [τρ = +∞] = Pb [τρ = +∞] > 0.

Proof. The equality of the two probabilities follows by symmetry. To see that
q > 0, let (Zt ) be simple random walk on Td started at a until the walk hits ρ and
let Lt be the graph distance between Zt and the root. Then (Lt ) is a biased random
walk on Z started at 1 jumping to the right with probability 1 − d1 and jumping to
the left with probability d1 . The probability that (Lt ) hits 0 in finite time is < 1
because 1 − d1 > 12 when d ≥ 3 by the gambler’s ruin (Example 3.1.43).

Note that  
1
h(ρ) ≤ 1 − (1 − q) < 1.
d
Indeed, if on the first step the random walk started at ρ moves away from a, an
event of probability 1 − d1 , then it must come back to ρ in finite time to reach Ta .
Similarly, by the strong Markov property (Theorem 3.1.8),

h(a) = q + (1 − q)h(ρ).

Since h(ρ) 6= 1 and q > 0, this shows that h(a) > h(ρ). So h is not constant.
CHAPTER 4. COUPLING 240

4.1.3 Total variation distance and coupling inequality


In the examples of Section 4.1.1, we used coupling to prove monotonicity state-
ments. Coupling is also useful to bound the distance between probability measures.
For this, we need the coupling inequality.

Total variation distance Let µ and ν be probability measures on (S, S). Recall
the definition of the total variation distance

kµ − νkTV := sup |µ(A) − ν(A)|.


A∈S

As the next lemma shows in the countable case, the total variation distance can
be thought of as an `1 distance on probability measures as vectors (up to a constant
factor).

Lemma 4.1.9 (Alternative definition of total variation distance). If S is countable,


then it holds that
1X
kµ − νkTV = |µ(x) − ν(x)|.
2
x∈S

Proof. Let E∗ := {x : µ(x) ≥ ν(x)}. Then, for any A ⊆ S, by definition of E∗

µ(A) − ν(A) ≤ µ(A ∩ E∗ ) − ν(A ∩ E∗ ) ≤ µ(E∗ ) − ν(E∗ ).

Similarly, we have

ν(A) − µ(A) ≤ ν(E∗c ) − µ(E∗c )


= (1 − ν(E∗ )) − (1 − µ(E∗ ))
= µ(E∗ ) − ν(E∗ ).

The two bounds above are equal so |µ(A) − ν(A)| ≤ µ(E∗ ) − ν(E∗ ). Equality is
achieved when A = E∗ . Also
1
µ(E∗ ) − ν(E∗ ) = [µ(E∗ ) − ν(E∗ ) + ν(E∗c ) − µ(E∗c )]
2
1X
= |µ(x) − ν(x)|.
2
x∈S

That concludes the proof.

Like the `1 distance, the total variation distance is a metric. In particular, it


satisfies the triangle inequality.
CHAPTER 4. COUPLING 241

Lemma 4.1.10 (Total variation distance: triangle inequality). Let µ, ν, η be prob-


ability measures on (S, S). Then

kµ − νkTV ≤ kµ − ηkTV + kη − νkTV .

Proof. From the definition,

sup |µ(A) − ν(A)| ≤ sup {|µ(A) − η(A)| + |η(A) − ν(A)|}


A∈S A∈S
≤ sup |µ(A) − η(A)| + sup |η(A) − ν(A)|.
A∈S A∈S

Coupling inequality We come to an elementary, yet fundamental inequality.


Lemma 4.1.11 (Coupling inequality). Let µ and ν be probability measures on
(S, S). For any coupling (X, Y ) of µ and ν,

kµ − νkTV ≤ P[X 6= Y ].

Proof. For any A ∈ S,

µ(A) − ν(A) = P[X ∈ A] − P[Y ∈ A]


= P[X ∈ A, X = Y ] + P[X ∈ A, X 6= Y ]
− P[Y ∈ A, X = Y ] − P[Y ∈ A, X 6= Y ]
= P[X ∈ A, X 6= Y ] − P[Y ∈ A, X 6= Y ]
≤ P[X 6= Y ],

and, similarly, ν(A) − µ(A) ≤ P[X 6= Y ]. Hence

|µ(A) − ν(A)| ≤ P[X 6= Y ].

Taking a supremum over A gives the claim.

Here is a quick example.


Example 4.1.12 (A coupling of Poisson random variables). Let X ∼ Poi(λ) and
Y ∼ Poi(ν) with λ > ν. Recall that a sum of independent Poisson is Poisson
(see Exercise 6.7). This fact leads to a natural coupling: let Ŷ ∼ Poi(ν), Ẑ ∼
Poi(λ − ν) independently of Ŷ , and X̂ = Ŷ + Ẑ. Then (X̂, Ŷ ) is a coupling of X
and Y , and by the coupling inequality (Lemma 4.1.11)

kµX − µY kTV ≤ P[X̂ 6= Ŷ ] = P[Ẑ > 0] = 1 − e−(λ−ν) ≤ λ − ν,

where we used 1 − e−x ≤ x for all x (see Exercise 1.16). J


CHAPTER 4. COUPLING 242

Figure 4.1: Proof by picture that: 1 − p = α = β = kµ − νkTV .

Remarkably, the inequality in Lemma 4.1.11 is tight. For simplicity, we prove


this in the finite case only.

Lemma 4.1.13 (Maximal coupling). Assume S is finite and let S = 2S . Let µ and
ν be probability measures on (S, S). Then,

kµ − νkTV = inf{P[X 6= Y ] : coupling (X, Y ) of µ and ν}.

Proof. We construct a coupling which achieves equality in the coupling inequality.


Such a coupling is called a maximal coupling.
maximal
Let A = {x ∈ S : µ(x) > ν(x)}, B = {x ∈ S : µ(x) ≤ ν(x)} and
coupling
X X X
p := µ(x) ∧ ν(x), α := [µ(x) − ν(x)], β := [ν(x) − µ(x)].
x∈S x∈A x∈B

Assume p > 0 (otheriwse there is nothing to prove). First, two lemmas. See
Figure 4.1 for a proof by picture.
Lemma 4.1.14. X
µ(x) ∧ ν(x) = 1 − kµ − νkTV .
x∈S
CHAPTER 4. COUPLING 243

Proof. We have
X
2kµ − νkTV = |µ(x) − ν(x)|
x∈S
X X
= [µ(x) − ν(x)] + [ν(x) − µ(x)]
x∈A x∈B
X X X
= µ(x) + ν(x) − µ(x) ∧ ν(x)
x∈A x∈B x∈S
X X X
=2− µ(x) − ν(x) − µ(x) ∧ ν(x)
x∈B x∈A x∈S
X
=2−2 µ(x) ∧ ν(x),
x∈S

where we used that both µ and ν sum to 1. Rearranging gives the claim.

Lemma 4.1.15.
X X
[µ(x) − ν(x)] = [ν(x) − µ(x)] = kµ − νkTV = 1 − p.
x∈A x∈B

Proof. The first equality is immediate by the fact that µ and ν are probability mea-
sures. The second equality follows from the first one together with the second line
in the proof of the previous lemma. The last equality is a restatement of the last
lemma.

The maximal coupling is defined as follows:

- With probability p, pick X = Y from γmin where


1
γmin (x) := µ(x) ∧ ν(x), x ∈ S.
p

- Otherwise, pick X from γA where

µ(x) − ν(x)
γA (x) := , x ∈ A,
1−p
and, independently, pick Y from

ν(x) − µ(x)
γB (x) := , x ∈ B.
1−p
Note that X 6= Y in that case because A and B are disjoint.
CHAPTER 4. COUPLING 244

The marginal law of X is: for x ∈ A,

p γmin (x) + (1 − p) γA (x) = ν(x) + µ(x) − ν(x) = µ(x),

and for x ∈ B,

p γmin (x) + (1 − p) γA (x) = µ(x) + 0 = µ(x).

A similar calculation holds for Y . Finally P[X 6= Y ] = 1 − p = kµ − νkTV .


Remark 4.1.16. A proof of this result for general Polish spaces can be found in [dH,
Section 2.5].

We return to our coupling of Bernoulli variables.


Example 4.1.17 (Coupling of Bernoulli variables (continued)). Recall the setting
of Example 4.1.2. To construct the maximal coupling as above, we note that

A := {0}, B := {1},
X
p := µ(x) ∧ ν(x) = (1 − r) + q, 1 − p = α = β := r − q,
x
 
1−r q
(γmin (x))x=0,1 = , ,
(1 − r) + q (1 − r) + q
γA (0) := 1, γB (1) := 1.
The law of the maximal coupling (X 000 , Y 000 ) is given by
 

000 000
 p γmin (0) (1 − p) γA (0)γB (1)
P[(X , Y ) = (i, j)] =
i,j∈{0,1} 0 p γmin (1)
 
1−r r−q
= .
0 q
Notice that it happens to coincide with the monotone coupling. J

Poisson approximation Here is a classical application of coupling: the approxi-


mation of a sum of independent Bernoulli variables with a Poisson. It gives a quan-
titative bound in total variation distance. Let X1 , . . . , Xn be independent Bernoulli
random variables with parameters p1 , . . . , pnPrespectively. We are interested in the
case where the pi s are “small.” Let Sn := i≤n Xi . We approximate Sn with a
Poisson random variable Zn as follows: let W1 , . . . , Wn be independent PPoisson
random variables with means λ1 , . . . , λn respectively and define Zn := i≤n Wi .
We choose λi = − log(1 − Ppi ) for reasons that will become clear below. Note that
Zn ∼ Poi(λ) where λ = i≤n λi .
CHAPTER 4. COUPLING 245

Theorem 4.1.18 (Poisson approximation).


1X 2
kµSn − Poi(λ)kTV ≤ λi .
2
i≤n

Proof. We couple the pairs Xi , Wi independently for i ≤ n. Let

Wi0 ∼ Poi(λi ) and Xi0 = Wi0 ∧ 1.

Because of our choice λi = − log(1 − pi ) which implies

1 − pi = P[Xi = 0] = P[Wi = 0] = e−λi ,

(Xi0 , Wi0 ) is indeed a coupling of Xi , Wi . Let Sn0 := i≤n Xi0 and Zn0 := i≤n Wi0 .
P P
Then (Sn0 , Zn0 ) is a coupling of Sn , Zn . By the coupling inequality

kµSn − µZn kTV ≤ P[Sn0 6= Zn0 ]


X
≤ P[Xi0 6= Wi0 ]
i≤n
X
= P[Wi0 ≥ 2]
i≤n
XX λji
= e−λi
j!
i≤n j≥2
X λ2 X λ`i
≤ i
e−λi
2 `!
i≤n `≥0
X λ2
i
= .
2
i≤n

Mappings reduce the total variation distance The following lemma will be
useful.

Lemma 4.1.19 (Mappings). Let X and Y be random variables taking values in


(S, S), let h be a measurable map from (S, S) to (S 0 , S 0 ), and let X 0 := h(X) and
Y 0 := h(Y ). The following inequality holds

kµX 0 − µY 0 kTV ≤ kµX − µY kTV .


CHAPTER 4. COUPLING 246

Proof. From the definition of the total variation distance, we seek to bound

sup P[X 0 ∈ A0 ] − P[Y 0 ∈ A0 ]


A0 ∈S 0
= sup P[h(X) ∈ A0 ] − P[h(Y ) ∈ A0 ]
A0 ∈S 0
= sup P[X ∈ h−1 (A0 )] − P[Y ∈ h−1 (A0 )] .
A0 ∈S 0

Since h−1 (A0 ) ∈ S by the measurability of h, this last expression is less or equal
than
sup |P[X ∈ A] − P[Y ∈ A]| ,
A∈S

which proves the claim.

Coupling of Markov chains In the context of Markov chains, a natural way to


couple is to do so step by step. We will refer to such couplings as Markovian. An
important special case is a Markovian coupling of a chain with itself.

Definition 4.1.20 (Markovian coupling). Let P and Q be transition matrices on


the same state space V . A Markovian coupling of P and Q is a Markov chain
Markovian
(Xt , Yt )t on V × V with transition matrix R satisfying: for all x, y, x0 , y 0 ∈ V ,
coupling
X
R((x, y), (x0 , z 0 )) = P (x, x0 ),
z0
X
R((x, y), (z 0 , y 0 )) = Q(y, y 0 ).
z0

We will give many examples throughout this chapter. See also Example 4.2.14 for
an example of a coupling of Markov chains that is not Markovian.

4.1.4 . Random graphs: degree sequence in Erdős-Rényi model


Let Gn ∼ Gn,pn be an Erdős-Rényi graph with pn := nλ and λ > 0 (see Defini-
tion 1.2.2). For i ∈ [n], let Di (n) be the degree of vertex i and define
n
X
Nd (n) := 1{Di (n)=d} ,
i=1

the number of vertices of degree d.


CHAPTER 4. COUPLING 247

Theorem 4.1.21 (Erdős-Rényi graph: degree sequence).


1 λd
Nd (n) →p fd := e−λ , ∀d ≥ 0.
n d!
Proof. We proceed in two steps:
1. we use the coupling inequality (Lemma 4.1.11) to show that the expectation
of n1 Nd (n) is close to fd ; and
2. we appeal to Chebyshev’s inequality (Theorem 2.1.2) to show that n1 Nd (n)
is close to its expectation.
We justify each step as a lemma.
Lemma 4.1.22 (Convergence of the mean).
1
lim En,pn [Nd (n)] = fd , ∀d ≥ 1.
n→+∞ n

Proof. Note that the degrees Di (n), i ∈ [n], are identically distributed (but not
independent) so
1
En,pn [Nd (n)] = Pn,pn [D1 (n) = d].
n
Moreover, by definition, D1 (n) ∼ Bin(n − 1, pn ). Let Sn ∼ Bin(n, pn ) and
Zn ∼ Poi(λ). Using the Poisson approximation (Theorem 4.1.18) and a Taylor
expansion,
1X
kµSn − µZn kTV ≤ (− log(1 − pn ))2
2
i≤n
 2
1 X λ −2
= + O(n )
2 n
i≤n
λ 2
= + O(n−2 ).
2n
We can further couple D1 (n) and Sn as
 
X X
 Xi , Xi  ,
i≤n−1 i≤n

where the Xi s are i.i.d. Ber(pn ), that is, Bernoulli with parameter pn . By the
coupling inequality (Theorem 4.1.11),
 
X X λ
kµD1 (n) − µSn kTV ≤ P  Xi 6= Xi  = P[Xn = 1] = pn = .
n
i≤n−1 i≤n
CHAPTER 4. COUPLING 248

By the triangle inequality for the total variation distance (Lemma 4.1.10) and
the bounds above,
1X
|Pn,pn [D1 (n) = d] − fd | = kµD1 (n) − µZn kTV
2
d≥0
≤ kµD1 (n) − µSn kTV + kµSn − µZn kTV
λ + λ2 /2
≤ + O(n−2 ).
n
Therefore, for all d,

1 2λ + λ2
En,pn [Nd (n)] − fd ≤ + O(n−2 ) → 0,
n n
as n → +∞.

Lemma 4.1.23 (Concentration around the mean).


 
1 1 2λ + 1
Pn,pn Nd (n) − En,pn [Nd (n)] ≥ ε ≤ , ∀d ≥ 1, ∀n.
n n ε2 n

Proof. By Chebyshev’s inequality, for all ε > 0

Varn,pn [ n1 Nd (n)]
 
1 1
Pn,pn Nd (n) − En,pn [Nd (n)] ≥ ε ≤ . (4.1.1)
n n ε2

To compute the variance, we note that


 
1
Varn,pn Nd (n)
n
  2  
1  X 
= 2 En,pn  1{Di (n)=d}   − (n Pn,pn [D1 (n) = d])2
n  
i≤n
1n
= 2 n(n − 1)Pn,pn [D1 (n) = d, D2 (n) = d]
n o
+ n Pn,pn [D1 (n) = d] − n2 Pn,pn [D1 (n) = d]2
1 n o
≤ + Pn,pn [D1 (n) = d, D2 (n) = d] − Pn,pn [D1 (n) = d]2 , (4.1.2)
n
where we used the crude bound Pn,pn [D1 (n) = d] ≤ 1. We bound the last line
using a neat coupling argument. Let Y1 and Y2 be independent Bin(n − 2, pn ), and
let X1 and X2 be independent Ber(pn ). By separating the contribution of the edge
CHAPTER 4. COUPLING 249

between 1 and 2 from those of edges to other vertices, we see that the joint degrees
(D1 (n), D2 (n)) have the same distribution as (X1 + Y1 , X1 + Y2 ). So the term in
curly bracket above is equal to

P[(X1 + Y1 , X1 + Y2 ) = (d, d)] − P[X1 + Y1 = d]2


= P[(X1 + Y1 , X1 + Y2 ) = (d, d)] − P[(X1 + Y1 , X2 + Y2 ) = (d, d)]
≤ P[(X1 + Y1 , X1 + Y2 ) = (d, d), (X1 + Y1 , X2 + Y2 ) 6= (d, d)]
= P[(X1 + Y1 , X1 + Y2 ) = (d, d), X2 + Y2 6= d]
= P[X1 = 0, Y1 = Y2 = d, X2 = 1]
+ P[X1 = 1, Y1 = Y2 = d − 1, X2 = 0]
≤ P[X2 = 1] + P[X1 = 1]

= .
n
1  2λ+1
Plugging back into (4.1.2) we get Varn,pn n Nd (n) ≤ n , and (4.1.1) gives the
claim.

Combining the lemmas concludes the proof of Theorem 4.1.21.

4.2 Stochastic domination


In comparing two probability measures, a natural relationship is that of “domina-
tion.” For instance, let (Xi )ni=1 be independent Z+ -valued random variables with

P[Xi ≥ 1] ≥ p,
Pn
and let S = i=1 Xi be their sum. Now consider a separate random variable

S∗ ∼ Bin(n, p).

It is intuitively clear that one should be able to bound S from below by analyzing S∗
instead—which may be considerably easier. Indeed, in some sense, S “dominates”
S∗ , that is, S should have a tendency to be bigger than S∗ . One expects more
specifically that
P[S > x] ≥ P[S∗ > x].
Coupling provides a formal characterization of this notion, as we detail in this
section.
In particular we study an important special case known as positive associations.
Here a measure “dominates itself” in the following sense: conditioning on certain
events makes other events more likely. That concept is formalized in Section 4.2.3.
CHAPTER 4. COUPLING 250

Figure 4.2: The law of X, represented here by its cumulative distribution function
FX in solid, stochastically dominates the law of Y , in dashed. The construction of
a monotone coupling, (X̂, Ŷ ) := (FX−1 (U ), FY−1 (U )) where U is uniform in [0, 1],
is also depicted.

4.2.1 Definitions
We start with the simpler case of real random variables then consider partially
ordered sets, a natural setting for this concept.

Ordering of real random variables Recall that, intuitively, stochastic domina-


tion captures the idea that one variable “tends to take larger values” than the other.
For real random variables, it is defined in terms of tail probabilities, or equivalently
in terms of cumulative distribution functions. See Figure 4.2 for an illustration.

Definition 4.2.1 (Stochastic domination). Let µ and ν be probability measures on


R. The measure µ is said to stochastically dominate ν, denoted by µ  ν, if for all
stochastic
x∈R     domination
µ (x, +∞) ≥ ν (x, +∞) .
A real random variable X stochastically dominates Y , denoted by X  Y , if the
law of X dominates the law of Y .

Example 4.2.2 (Bernoulli vs. Poisson). Let X ∼ Poi(λ) be Poisson with mean
λ > 0 and let Y be a Bernoulli trial with success probability p ∈ (0, 1). In order
CHAPTER 4. COUPLING 251

for X to stochastically dominate Y , we need to have


P[X > `] ≥ P[Y > `], ∀` ≥ 0.
This is always true for ` ≥ 1 since P[X > `] > 0 but P[Y > `] = 0. So it remains
to consider the case ` = 0. We have
1 − e−λ = P[X > 0] ≥ P[Y > 0] = p,
if and only if
λ ≥ − log(1 − p).
J
Note that stochastic domination does not require X and Y to be defined on
the same probability space. However the connection to coupling arises from the
following characterization.
Theorem 4.2.3 (Coupling and stochastic domination). The real random variable
X stochastically dominates Y if and only if there is a coupling (X̂, Ŷ ) of X and
Y such that
P[X̂ ≥ Ŷ ] = 1. (4.2.1)
We refer to (X̂, Ŷ ) as a monotone coupling of X and Y . monotone
Proof. Suppose there is such a coupling. Then for all x ∈ R coupling

P[Y > x] = P[Ŷ > x] = P[X̂ ≥ Ŷ > x] ≤ P[X̂ > x] = P[X > x].
For the other direction, define the cumulative distribution functions FX (x) =
P[X ≤ x] and FY (x) = P[Y ≤ x]. Assume X  Y . The idea of the proof
is to use the following standard way of generating a real random variable (see
Theorem B.2.7)
d
X = FX−1 (U ), (4.2.2)
where U is a [0, 1]-valued uniform random variable and
FX−1 (u) := inf{x ∈ R : FX (x) ≥ u},
is a generalized inverse. It is natural to construct a coupling of X and Y by simply
using the same uniform random variable U in this representation, that is, we define
X̂ = FX−1 (U ) and Ŷ = FY−1 (U ). See Figure 4.2. By (4.2.2), this is a coupling
of X and Y . It remains to check (4.2.1). Because FX (x) ≤ FY (x) for all x by
definition of stochastic domination, by the definition of the generalized inverse,
P[X̂ ≥ Ŷ ] = P[FX−1 (U ) ≥ FY−1 (U )] = 1,
as required.
CHAPTER 4. COUPLING 252

Example 4.2.4. Returning to the example in the first paragraph of Section 4.2,
let (Xi )ni=1 be independent PnZ+ -valued random variables with P[Xi ≥ 1] ≥ p and
consider
Pn their sum S := i=1 Xi . Further let S∗ ∼ Bin(n, p). Write S∗ as the sum
Y
i=1 i where (Yi ) are independent Bernoullli variables with P[Yi = 1] = p. To
couple S and S∗ , first set (Ŷi ) := (Yi ) and Ŝ∗ := ni=1 Ŷi . Let X̂i be 0 whenever
P

Ŷi = 0. Otherwise (i.e., if Ŷi = 1), generate X̂i according to the distribution of
Xi conditioned on {Xi ≥ 1}, independently of Peverything else. By construction
X̂i ≥ Ŷi almost surely for all i and as a result ni=1 X̂i =: Ŝ ≥ Ŝ∗ almost surely,
or S  S∗ by Theorem 4.2.3. That implies for instance that P[S > x] ≥ P[S∗ > x]
as we claimed earlier. A slight modification of this argument gives the following
useful fact about binomials

n ≥ m, q ≥ p =⇒ Bin(n, q)  Bin(m, p).

Exercise 4.2 asks for a formal proof. J

Example 4.2.5 (Poisson distribution). Let X ∼ Poi(µ) and Y ∼ Poi(ν) with


µ > ν. Recall that a sum of independent Poisson is Poisson (see Exercise 6.7). This
fact leads to a natural coupling: let Ŷ ∼ Poi(ν), Ẑ ∼ Poi(µ − ν) independently of
Y , and X̂ = Ŷ + Ẑ. Then (X̂, Ŷ ) is a coupling and X̂ ≥ Ŷ a.s. because Ẑ ≥ 0.
Hence X  Y . J

We record two useful consequences of Theorem 4.2.3.

Corollary 4.2.6. Let X and Y be real random variables with X  Y and let
f : R → R be a non-decreasing function. Then f (X)  f (Y ) and furthermore,
provided E|f (X)|, E|f (Y )| < +∞, we have that

E[f (X)] ≥ E[f (Y )].

Proof. Let (X̂, Ŷ ) be the monotone coupling of X and Y whose existence is guar-
anteed by Theorem 4.2.3. Then f (X̂) ≥ f (Ŷ ) almost surely so that, provided the
expectations exist,

E[f (X)] = E[f (X̂)] ≥ E[f (Ŷ )] = E[f (Y )],

and furthermore (f (X̂), f (Ŷ )) is a monotone coupling of f (X) and f (Y ). Hence


f (X)  f (Y ).

Corollary 4.2.7. Let X1 , X2 be independent random variables. Let Y1 , Y2 be


independent random variables such that Xi  Yi , i = 1, 2. Then

X1 + X2  Y1 + Y2 .
CHAPTER 4. COUPLING 253

Proof. Let (X̂1 , Yˆ1 ) and (X̂2 , Ŷ2 ) be independent, monotone couplings of X1 , Y1
and X2 , Y2 on the same probability space. Then
d d
X1 + X2 = X̂1 + X̂2 ≥ Ŷ1 + Ŷ2 = Y1 + Y2 .

Example 4.2.8 (Binomial vs. Poisson). A sum of n independent Poisson variables


with mean λ is Poi(nλ). A sum of n independent Bernoulli trials with success
probability p is Bin(n, p). Using Example 4.2.2 and Corollary 4.2.7, we get

λ ≥ − log(1 − p) =⇒ Poi(nλ)  Bin(n, p). (4.2.3)

The following special case will be useful later. Let 0 < Λ < 1 and let m be a
positive integer. Then
   
Λ Λ m m Λ
≥ = − 1 ≥ log = − log 1 − ,
m−1 m−Λ m−Λ m−Λ m
where we used that log x ≤ x − 1 for all x ∈ R+ (see Exercise 1.16). So, setting
Λ Λ
λ := m−1 , p := m and n := m − 1 in (4.2.3), we get
 
Λ
Λ ∈ (0, 1) =⇒ Poi(Λ)  Bin m − 1, . (4.2.4)
m
J

Ordering on partially ordered sets The definition of stochastic domination


hinges on the totally ordered nature of R. It also extends naturally to posets. Let
(X , ≤) be a poset, that is, for all x, y, z ∈ X : poset
- (Reflexivity) x ≤ x;

- (Antisymmetry) if x ≤ y and y ≤ x then x = y; and

- (Transitivity) if x ≤ y and y ≤ z then x ≤ z.


Throughout, we assume that X is a measurable space.
For instance the set {0, 1}F is a poset when equipped with the relation x ≤ y
if and only if xi ≤ yi for all i ∈ F , where x = (xi )i∈F and y = (yi )i∈F .
Equivalently the subsets of F , denoted by 2F , form a poset with the inclusion
relation.
A totally ordered set satisfies in addition that, for any x, y, we have either x ≤ y
or y ≤ x. That is not satisfied in the previous example.
CHAPTER 4. COUPLING 254

Let F be a σ-algebra over the poset X . An event A ∈ F is increasing if x ∈ A


increasing
implies that any y ≥ x is also in A. A function f : X → R is increasing if
x ≤ y implies f (x) ≤ f (y). Some properties of increasing events are derived in
Exercise 4.4.
Definition 4.2.9 (Stochastic domination for posets). Let (X , ≤) be a poset and
let F be a σ-algebra on X . Let µ and ν be probability measures on (X , F).
The measure µ is said to stochastically dominate ν, denoted by µ  ν, if for all
increasing A ∈ F
µ(A) ≥ ν(A).
An X -valued random variable X stochastically dominates Y , denoted by X  Y ,
if the law of X dominates the law of Y .
As before, a monotone coupling (X̂, Ŷ ) of X and Y is one which satisfies X̂ ≥ Ŷ
almost surely.
Example 4.2.10 (Monotonicity of the percolation function). We briefly revisit
Example 4.1.3 to illustrate our definitions. Consider bond percolation on the d-
dimensional lattice Ld (Definition 1.2.1). Here the poset is the collection of all
subsets of edges, specifying the open edges, with the inclusion relation. Recall that
the percolation function is given by

θ(p) := Pp [|C0 | = +∞],

where C0 is the open cluster of the origin. We argued in Example 4.1.3 (see also
Section 2.2.4) that θ(p) is nondecreasing by considering the following alternative
representation of the percolation process under Pp : to each edge e, assign a uniform
[0, 1]-valued random variable Ue and declare the edge open if Ue ≤ p. Using the
same Ue s for two different values of p, say p1 < p2 , gives a monotone coupling of
the processes for p1 and p2 . J
The existence of a monotone coupling is perhaps more surprising for posets.
We prove the result in the finite case only, which will be enough for our purposes.
Theorem 4.2.11 (Strassen’s theorem). Let X and Y be random variables taking
values in a finite poset (X , ≤) with the σ-algebra F = 2X . Then X  Y if and
only if there exists a monotone coupling (X̂, Ŷ ) of X and Y .
Proof. Suppose there is such a coupling. Then for all increasing A

P[Y ∈ A] = P[Ŷ ∈ A] = P[X̂ ≥ Ŷ ∈ A] ≤ P[X̂ ∈ A] = P[X ∈ A].

The proof in the other direction relies on the max-flow min-cut theorem (Theo-
rem 1.1.15). To see the connection with flows, let µX and µY be the laws of X and
CHAPTER 4. COUPLING 255

Y respectively, and denote by ν their joint distribution under the desired coupling.
Noting that we want ν(x, y) > 0 only if x ≥ y, the marginal conditions on the
coupling read X
ν(x, y) = µX (x), ∀x ∈ X , (4.2.5)
y:x≥y

and X
ν(x, y) = µY (y), ∀y ∈ X . (4.2.6)
x:x≥y

These equations can be interpreted as flow-conservation constraints. Consider


the following directed graph. There are two vertices, (w, 1) and (w, 2), for each
element w in X with edges connecting each (x, 1) to those (y, 2)s with x ≥ y.
These edges have capacity +∞. In addition there is a source a and a sink z.
The source has a directed edge of capacity µX (x) to (x, 1) for each x ∈ X and,
similarly, each (y, 2) has a directed edge of capacity µY (y) to the sink. The ex-
istence of a monotone coupling will follow once we show that there is a flow of
strength 1 between a and z. Indeed, in that case, all edges from the source and
all edges to the sink must be at capacity. If we let ν(x, y) be the flow on edge
h(x, 1), (y, 2)i, the systems in (4.2.5) and (4.2.6) encode conservation of flow on
the vertices (X × {1}) ∪ (X × {2}). Hence the flow between X × {1} and X × {2}
yields the desired coupling. See Figure 4.3.
By the max-flow min-cut theorem (Theorem 1.1.15), it suffices to show that a
minimum cut has capacity 1. Such a cut is of course obtained by choosing all edges
out of the source. So it remains to show that no cut has capacity less than 1. This
is where we use the fact that µX (A) ≥ µY (A) for all increasing A. Because the
edges between X × {1} and X × {2} have infinite capacity, they cannot be used
in a minimum cut. So we can restrict our attention to those cuts containing edges
from a to A∗ × {1} and from Z∗ × {2} to z for subsets A∗ , Z∗ ⊆ X .
We must have
A∗ ⊇ {x ∈ X : ∃y ∈ Z∗c , x ≥ y},
to block all paths of the form a ∼ (x, 1) ∼ (y, 2) ∼ z with x and y as in the
previous display; here Z∗c = X \ Z∗ . In fact, for a minimum cut, we further have

A∗ = {x ∈ X : ∃y ∈ Z∗c , x ≥ y},

as adding an x not satisfying this property is redundant. In particular A∗ is in-


creasing: if x1 ∈ A∗ and x2 ≥ x1 , then ∃y ∈ Z∗c such that x1 ≥ y and, since
x2 ≥ x1 ≥ y, we also have x2 ∈ A∗ .
Observe further that, because y ≥ y, the set A∗ also includes Z∗c . If it were
the case that A∗ 6= Z∗c , then we could construct a cut with lower or equal capacity
CHAPTER 4. COUPLING 256

Figure 4.3: Construction of a monotone coupling through the max-flow repre-


sentation for independent Bernoulli pairs with parameters r (on the left) and
q < r (on the right). Edge labels indicate capacity. Edges without labels
have infinite capacity. The dotted edges depict a suboptimal cut. The dark ver-
tices correspond to the sets A∗ and Z∗ for this cut. The capacity of the cut is
r2 + r(1 − r) + (1 − q)2 + (1 − q)q = r + (1 − q) > r + (1 − r) = 1.
CHAPTER 4. COUPLING 257

by fixing A∗ and setting Z∗ := Ac∗ : suppose A∗ ∩ Z∗ is nonempty; because A∗ is


increasing, any y ∈ A∗ ∩ Z∗ is such that paths of the form a ∼ (x, 1) ∼ (y, 2) ∼ z
with x ≥ y are cut by x ∈ A∗ ; so we do not need those ys in Z∗ . Hence, for a
minimum cut, we can assume that in fact A∗ = Z∗c . The capacity of the cut is

µX (A∗ ) + µY (Z∗ ) = µX (A∗ ) + 1 − µY (A∗ ) = 1 + (µX (A∗ ) − µY (A∗ )) ≥ 1,

where the term in parenthesis is nonnegative by assumption and the fact that A∗ is
increasing. That concludes the proof.
Remark 4.2.12. Strassen’s theorem (Theorem 4.2.11) holds more generally on Polish
spaces with a closed partial order. See, e.g., [Lin02, Section IV.1.2] for the details.

The proof of Corollary 4.2.6 immediately extends to:

Corollary 4.2.13. Let X and Y be X -valued random variables with X  Y and


let f : X → R be an increasing function. Then f (X)  f (Y ) and furthermore,
provided E|f (X)|, E|f (Y )| < +∞, we have that

E[f (X)] ≥ E[f (Y )].

Ordering of Markov chains Stochastic domination also arises in the context of


Markov chains. We begin with an example. Recall the notion of a Markovian
coupling from Definition 4.1.20. The following coupling of Markov chains is not
Markovian.

Example 4.2.14 (Lazier chain). Consider a random walk (Xt ) on the network
N = ((V, E), c) where V = {0, 1, . . . , n} and i ∼ j if and only if |i − j| ≤ 1
(including self-loops). Let N 0 = ((V, E), c0 ) be a modified version of N on the
same graph where, for all i, c(i, i) ≤ c0 (i, i). That is, if (Xt0 ) is random walk on
N 0 , then (Xt0 ) is “lazier” than (Xt ) in that it is more likely to stay put. To simplify
the calculations, assume c(i, i) = 0 for all i.
Assume that both (Xt ) and (Xt0 ) start at i0 and define Ms := maxt≤s Xt and
Ms := maxt≤s Xt0 . Since (Xt0 ) “travels less” than (Xt ) the following claim is
0

intuitively obvious:

Claim 4.2.15.
Ms  Ms0 .

We prove this by producing a monotone coupling. First set (X̂t )t∈Z+ := (Xt )t∈Z+ .
We then generate (X̂t0 )t∈Z+ as a “sticky” version of (X̂t )t∈Z+ . That is, (X̂t0 ) follows
CHAPTER 4. COUPLING 258

exactly the same transitions as (X̂t ) (including the self-loops), but at each time it
opts to stay where it currently is, say state j, for an extra time step with probability
c0 (j, j)
αj := P 0
,
i:i∼j c (i, j)

which is in [0, 1] by assumption. Marginally, (X̂t0 ) is a random walk on N 0 . Indeed,


we have by construction of the coupling that the probability of staying put when in
state j is
c0 (j, j)
αj = P 0
,
i:i∼j c (i, j)

and, for k 6= j with k ∼ j, the probability of moving to state k when in state j is


!
[ i:i∼j c0 (i, j)] − c0 (j, j)
P
c(j, k) c(j, k)
(1 − αj ) P = P 0 (i, j)
P
i:i∼j c(i, j) i:i∼j c i:i∼j c(i, j)
P !
i:i∼j c(i, j) c0 (j, k)
= P 0
P
i:i∼j c (i, j) i:i∼j c(i, j)
c0 (j, k)
=P 0
,
i:i∼j c (i, j)

where, on the second line, we used that c0 (i, j) = c(i, j) for i 6= j and i ∼ j. This
coupling satisfies almost surely
cs := max X̂t ≥ max X̂ 0 =: M
M c0
t s
t≤s t≤s

because (X̂t0 )t≤s visits a subset of the states visited by (X̂t )t≤s . In other words
(M c0 ) is a monotone coupling of (Ms , M 0 ) and this proves the claim.
cs , M J
s s

As we indicated, the previous example involved an “asynchronous” coupling of


the chains. Often a simpler step-by-step approach—that is, through the construc-
tion of a Markovian coupling—is possible. We specialize the notion of stochastic
domination to that important case.
Definition 4.2.16 (Stochastic domination of Markov chains). Let P and Q be tran-
sition matrices on a finite or countably infinite poset (X , ≤). The transition matrix
Q is said to stochastically dominate the transition matrix P if
x ≤ y =⇒ P (x, ·)  Q(y, ·). (4.2.7)
If the above condition is satisfied for P = Q, we say that P is stochastically
monotone
CHAPTER 4. COUPLING 259

The analogue of Strassen’s theorem in this case is the following theorem, which
we prove in the finite case only again.

Theorem 4.2.17. Let (Xt )t∈Z+ and (Yt )t∈Z+ be Markov chains on a finite poset
(X , ≤) with transition matrices P and Q respectively. Assume that Q stochas-
tically dominates P . Then for all x0 ≤ y0 there is a coupling (X̂t , Ŷt ) of (Xt )
started at x0 and (Yt ) started at y0 such that almost surely

X̂t ≤ Ŷt , ∀t.

Furthermore, if the chains are irreducible and have stationary distributions π and
µ respectively, then π  µ.

Observe that, for a Markovian, monotone coupling to exist, it is not generally


enough for the weaker condition P (x, ·)  Q(x, ·) to hold for all x, as should
be clear from the proof. See also Exercise 4.3.

Proof of Theorem 4.2.17. Let

W := {(x, y) ∈ X × X : x ≤ y}.

For all (x, y) ∈ W, let R((x, y), ·) be the joint law of a monotone coupling of
P (x, ·) and Q(y, ·). Such a coupling exists by Strassen’s theorem and Condi-
tion (4.2.7). Let (X̂t , Ŷt ) be a Markov chain on W with transition matrix R started
at (x0 , y0 ). By construction, X̂t ≤ Ŷt for all t almost surely. That proves the first
half of the theorem.
For the second half, let A be increasing on X . Note that the first half implies
that for all s ≥ 1

P s (x0 , A) = P[X̂s ∈ A] ≤ P[Ŷs ∈ A] = Qs (y0 , A),

because X̂s ≤ Ŷs and A is increasing. Then, by a standard convergence result for
irreducible Markov chains (i.e., (1.1.5)),
1X s 1X s
π(A) = lim P (x0 , A) ≤ lim Q (y0 , A) = µ(A).
t→+∞ t t→+∞ t
s≤t s≤t

This proves the claim by definition of stochastic domination.

An example of application of this theorem is given in the next subsection.


CHAPTER 4. COUPLING 260

4.2.2 . Ising model: boundary conditions


Consider the d-dimensional lattice Ld . Let Λ be a finite subset of vertices in Ld
and define X := {−1, +1}Λ , which is a poset when equipped with the relation
σ ≤ σ 0 if and only if σi ≤ σi0 for all i ∈ Λ. Generalizing Example 1.2.5, for
d
ξ ∈ {−1, +1}L , the (ferromagnetic) Ising model on Λ with boundary conditions
boundary
ξ and inverse temperature β is the probability distribution over spin configurations
conditions
σ ∈ X given by
1
µξβ,Λ (σ) := e−βHΛ,ξ (σ) ,
ZΛ,ξ (β)
where X X
HΛ,ξ (σ) := − σi σj − σ i ξj ,
i∼j i∼j
i,j∈Λ i∈Λ,j ∈Λ
/

is the Hamiltonian and


X
ZΛ,ξ (β) := e−βHΛ,ξ (σ) ,
σ∈X

is the partition function. For shorthand, we occasionally write + and − instead of


+1 and −1.
For the all-(+1) and all-(−1) boundary conditions we denote the measure

above by µ+ β,Λ (σ) and µβ,Λ (σ) respectively. In this section, we show that these
two measures are “extreme” in the following sense.
d
Claim 4.2.18. For all boundary conditions ξ ∈ {−1, +1}L ,
ξ −
µ+
β,Λ  µβ,Λ  µβ,Λ .

Intuitively, because the ferromagnetic Ising model favors spin agreement, the all-
(+1) boundary condition tends to produce more +1s which in turn makes increas-
ing events more likely. And vice versa.
The idea of the proof is to use Theorem 4.2.17 with a suitable choice of Markov
chain.

Stochastic domination Recall that, in this context, vertices are often referred to
as sites. Adapting Definition 1.2.8, we consider the single-site Glauber dynamics,
which is the Markov chain on X which, at each time, selects a site i ∈ Λ uniformly
at random and updates the spin σi according to µξβ,Λ (σ) conditioned on agreeing
with σ at all sites in Λ\{i}. Specifically, for γ ∈ {−1, +1}, i ∈ Λ, and σ ∈ X , let
CHAPTER 4. COUPLING 261

σ i,γ be the configuration σ with the state at i being set to γ. Then, letting n = |Λ|,
the transition matrix of the Glauber dynamics is
ξ
1 eγβSi (σ)
Qξβ,Λ (σ, σ i,γ ) := · ,
n e−βSiξ (σ) + eβSiξ (σ)

where
Siξ (σ) :=
X X
σj + ξj .
j:j∼i j:j∼i
j∈Λ j ∈Λ
/

All other transitions have probability 0. It is straightforward to check that Qξβ,Λ is


a stochastic matrix.
This chain is clearly irreducible. It is also reversible with respect to µξβ,Λ .
Indeed, for all σ ∈ X and i ∈ Λ, let
ξ
S6= i (σ) := HΛ,ξ (σ
i,+
) + Siξ (σ) = HΛ,ξ (σ i,− ) − Siξ (σ),

Arguing as in Theorem 1.2.9, we have


ξ ξ ξ
e−βS6=i (σ) e−βSi (σ) eβSi (σ)
µξβ,Λ (σ i,− ) Qξβ,Λ (σ i,− , σ i,+ ) = · ξ ξ
ZΛ,ξ (β) n[e−βSi (σ) + eβSi (σ) ]
ξ
e−βS6=i (σ)
= ξ ξ
nZΛ,ξ (β)[e−βSi (σ) + eβSi (σ) ]
ξ ξ ξ
e−βS6=i (σ) eβSi (σ) e−βSi (σ)
= · ξ ξ
ZΛ,ξ (β) n[e−βSi (σ) + eβSi (σ) ]
= µξβ,Λ (σ i,+ ) Qξβ,Λ (σ i,+ , σ i,− ).

In particular µξβ,Λ is the stationary distribution of Qξβ,Λ .

Claim 4.2.19.
0
ξ 0 ≥ ξ =⇒ Qξβ,Λ stochastically dominates Qξβ,Λ . (4.2.8)

Proof. Because the Glauber dynamics updates a single site at a time, establishing
stochastic domination reduces to checking simple one-site inequalities.
Lemma 4.2.20. To establish (4.2.8), it suffices to show that for all i and all σ ≤ τ
0
Qξβ,Λ (σ, σ i,+ ) ≤ Qξβ,Λ (τ, τ i,+ ). (4.2.9)
CHAPTER 4. COUPLING 262

Proof. Assume (4.2.9) holds. Let A be increasing in X and let σ ≤ τ . Then, for
the single-site Glauber dynamics, we have
Qξβ,Λ (σ, A) = Qξβ,Λ (σ, A ∩ Bσ ), (4.2.10)
where
Bσ := {σ i,γ : i ∈ Λ, γ ∈ {−1, +1}},
and similarly for τ , ξ 0 . Moreover, because A is increasing and τ ≥ σ,
σ i,γ ∈ A =⇒ τ i,γ ∈ A, (4.2.11)
and
σ i,− ∈ A =⇒ σ i,+ ∈ A. (4.2.12)
Letting
± +
Iσ,A := {i ∈ Λ : σ i,− ∈ A}, Iσ,A := {i ∈ Λ : σ i,− ∈
/ A, σ i,+ ∈ A},
and similarly for τ , we have by (4.2.9), (4.2.10), (4.2.11), and (4.2.12),
Qξβ,Λ (σ, A) = Qξβ,Λ (σ, A ∩ Bσ )
X ξ X h ξ i
= Qβ,Λ (σ, σ i,+ ) + Qβ,Λ (σ, σ i,− ) + Qξβ,Λ (σ, σ i,+ )
+ ±
i∈Iσ,A i∈Iσ,A
0 X h ξ i
Qξβ,Λ (τ, τ i,+ ) + Qβ,Λ (σ, σ i,− ) + Qξβ,Λ (σ, σ i,+ )
X

+ ±
i∈Iσ,A i∈Iσ,A
0 X 1
Qξβ,Λ (τ, τ i,+ ) +
X
=
+ ±
n
i∈Iσ,A i∈Iσ,A

ξ0
X h ξ0 0
i
Qβ,Λ (τ, τ i,− ) + Qξβ,Λ (τ, τ i,+ )
X
≤ Qβ,Λ (τ, τ i,+ ) +
+ ±
i∈Iτ,A i∈Iτ,A
0
= Qξβ,Λ (τ, A),
+ + ±
as claimed, where on the fifth line we used that Iσ,A ⊆ Iτ,A ∪ Iτ,A by (4.2.11) and
0
that Qξβ,Λ (τ, τ i,+ ) ≤ 1/n for all i (in particular for i ∈ Iτ,A
+ +
\ Iσ,A ).

Returning to the proof of Claim 4.2.19, observe that


ξ
1 eβSi (σ) 1 1
Qξβ,Λ (σ, σ i,+ ) = · ξ ξ = · ξ ,
n e−βSi (σ) + eβSi (σ) n e−2βSi (σ) + 1
0
which is increasing in Siξ (σ). Now σ ≤ τ and ξ ≤ ξ 0 imply that Siξ (σ) ≤ Siξ (τ ).
That proves the claim by Lemma 4.2.20.
CHAPTER 4. COUPLING 263

Finally:

Proof of Claim 4.2.18. Combining Theorem 4.2.17 and Claim 4.2.19 gives the re-
sult.

Remark 4.2.21. One can make sense of the limit of µ+ β,Λ and µβ,Λ when |Λ| → +∞,
which is known as an infinite-volume Gibbs measure. For more, see for example [RAS15,
Chapters 7-10].

Observe that we have not used any special property of the d-dimensional lat-
tice. Indeed Claim 4.2.18 in fact holds for any countable, locally finite graph with
positive coupling constants. We give another proof in Example 4.2.33.

4.2.3 Correlation inequalities: FKG and Holley’s inequalities


A special case of stochastic domination is positive associations. In this section,
we restrict ourselves to posets of the form {0, 1}F for F finite. We begin with an
example.

Example 4.2.22 (Erdős-Rényi graph: positive associations). Consider an Erdős-


Rényi graph G ∼ Gn,p . Let E = {{x, y} : x, y ∈ [n], x 6= y}. Think of G as
taking values in the poset ({0, 1}E , ≤) where a 1 indicates that the corresponding
edge is present. In fact observe that the law of G, which we denote as usual by Pn,p ,
is a product measure on {0, 1}E . The event A that G is connected is increasing
because adding edges cannot disconnect an already connected graph. So is the
event B of having a chromatic number larger than 4. Intuitively then, conditioning
on A makes B more likely: the occurrence of A tends to be accompanied with a
larger number of edges which in turn makes B more probable.
This is an example of a more general phenomenon. That is, for any non-empty
increasing events A and B, we have:

Claim 4.2.23.
Pn,p [B | A] ≥ Pn,p [B]. (4.2.13)

Or, put differently, the conditional measure Pn,p [ · | A] stochastically dominates


the unconditional measure Pn,p [ · ]. This is a special case of what is known as
Harris’ inequality, proved below. Note that (4.2.13) is equivalent to Pn,p [A ∩ B] ≥
Pn,p [A] Pn,p [B], that is, to the fact that A and B are positively correlated. J

More generally:

Definition 4.2.24 (Positive associations). Let µ be a probability measure on {0, 1}F


where F is finite. Then µ is said to have positive associations, or is positively as-
positive
associations
CHAPTER 4. COUPLING 264

sociated, if for all increasing functions f, g : {0, 1}F → R


µ(f g) ≥ µ(f )µ(g),
where X
µ(h) := µ(ω)h(ω).
ω∈{0,1}F

In particular, for any increasing events A and B it holds that


µ(A ∩ B) ≥ µ(A)µ(B),
that is, A and B are positively correlated. Denoting by µ(A | B) the conditional
positively
probability of A given B, this is equivalent to
correlated
µ(A | B) ≥ µ(A).
Remark 4.2.25. Note that positive associations is concerned only with “monotone” events.
See Remark 4.2.45.
Remark 4.2.26. A notion of negative associations, which is a somewhat more delicate
concept, was defined in Remark 3.3.43. See also [Pem00].
Let µ be positively associated. Note that if A and B are decreasing, that is,
decreasing
their complements are increasing (see Exercise 4.4), then
µ(A ∩ B) = 1 − µ(Ac ∪ B c )
= 1 − µ(Ac ) − µ(B c ) + µ(Ac ∩ B c )
≥ 1 − µ(Ac ) − µ(B c ) + µ(Ac )µ(B c )
= µ(A)µ(B),
or µ(A | B) ≥ µ(A). Similarly, if A is increasing and B is decreasing, we have
µ(A ∩ B) ≤ µ(A)µ(B), or
µ(A | B) ≤ µ(A). (4.2.14)
Harris’ inequality states that product measures on {0, 1}F have positive asso-
ciations. We prove a more general result known as the FKG inequality. For two
configurations ω, ω 0 in {0, 1}F , we let ω ∧ ω 0 and ω ∨ ω 0 be the coordinatewise
minimum and maximum of ω and ω 0 .
Definition 4.2.27 (FKG condition). Let X = {0, 1}F where F is finite. A positive
probability measure µ on X satisfies the FKG condition if
FKG
0 0 0 0
µ(ω ∨ ω ) µ(ω ∧ ω ) ≥ µ(ω) µ(ω ), ∀ω, ω ∈ X . (4.2.15) condition
This property is also known as log-supermodularity. We call such a measure an
FKG measure.
CHAPTER 4. COUPLING 265

FKG
Theorem 4.2.28 (FKG inequality). Let X = {0, 1}F where F is finite. Suppose µ
inequality
is a positive probability measure on X satisfying the FKG condition. Then µ has
positive associations.
Remark 4.2.29. Strict positivity is not in fact needed [FKG71]. The FKG condition is
equivalent to a strong form of positive associations. See Exercise 4.8.

Note that product


Qmeasures satisfy the FKG condition with equality. Indeed if
µ(ω) is of the form f ∈F µf (ωf ) then
Y
µ(ω ∨ ω 0 ) µ(ω ∧ ω 0 ) = µf (ωf ∨ ωf0 ) µf (ωf ∧ ωf0 )
f
Y Y
= µf (ωf )2 µf (ωf )µf (ωf0 )
f :ωf =ωf0 f :ωf 6=ωf0
Y Y
= µf (ωf )µf (ωf0 ) µf (ωf )µf (ωf0 )
f :ωf =ωf0 f :ωf 6=ωf0

= µ(ω) µ(ω 0 ).

So the FKG inequality (Theorem 4.2.28) applies, for instance, to bond percola-
tion and the Erdős-Rényi random graph model. The pointwise nature of the FKG
condition also makes it relatively easy to check for measures which are defined
explicitly up to a normalizing constant, such as the Ising model.

Example 4.2.30 (Ising model with boundary conditions: checking FKG). Con-
sider again the setting of Section 4.2.2. We work on the space X := {−1, +1}Λ
d
rather than {0, 1}F . Fix a finite Λ ⊆ Ld , ξ ∈ {−1, +1}L and β > 0.

Claim 4.2.31. The measure µξβ,Λ satisfies the FKG condition and therefore has
positive associations.

Intuitively, taking the minimum (or maximum) of two spin configurations tends
to increase agreement and therefore leads to a higher likelihood. For σ, σ 0 ∈ X ,
let τ = σ ∨ σ 0 and τ = σ ∧ σ 0 . By taking logarithms in the FKG condition and
rearranging, we arrive at

HΛ,ξ (τ ) + HΛ,ξ (τ ) ≤ HΛ,ξ (σ) + HΛ,ξ (σ 0 ), (4.2.16)

and we see that proving the claim boils down to checking an inequality for each
term in the Hamiltonian (which, confusingly, has a negative sign in it).
CHAPTER 4. COUPLING 266

When i ∈ Λ and j ∈
/ Λ such that i ∼ j, we have

τ i ξj + τ i ξj = (τ i + τ i )ξj = (σi + σi0 )ξj = σi ξj + σi0 ξj . (4.2.17)

For i, j ∈ Λ with i ∼ j, note first that the case σj = σj0 reduces to the previous
calculation (with σj = σj0 playing the role of ξj ), so we assume σi 6= σi0 and
σj 6= σj0 . Then

τ i τ j + τ i τ j = (+1)(+1) + (−1)(−1) = 2 ≥ σi σj + σi0 σj0 ,

since 2 is the largest value the rightmost expression ever takes. We have estab-
lished (4.2.16), which implies the claim.
Again, we have not used any special property of the lattice and the same result
holds for countable, locally finite graphs with positive coupling constants. Note
however that in the anti-ferromagnetic case, that is, if we multiply the Hamiltonian
by −1, the above argument does not work. Indeed there is no reason to expect
positive associations in that case. J

The FKG inequality in turn follows from a more general result known as Hol-
ley’s inequality.

Theorem 4.2.32 (Holley’s inequality). Let X = {0, 1}F where F is finite. Suppose Holley’s
inequality
µ1 and µ2 are positive probability measures on X satisfying

µ2 (ω ∨ ω 0 ) µ1 (ω ∧ ω 0 ) ≥ µ2 (ω) µ1 (ω 0 ), ∀ω, ω 0 ∈ X . (4.2.18)

Then µ1  µ2 .

Before proving Holley’s inequality (Theorem 4.2.32), we check that it indeed


implies the FKG inequality. See Exercise 4.5 for an elementary proof in the inde-
pendent case, that is, of Harris’ inequality.

Proof of Theorem 4.2.28. Assume that µ satisfies the FKG condition and let f , g
be increasing functions. Because of our restriction to positive measures in Holley’s
inequality, we will work with positive functions. This is done without loss of gen-
erality. Indeed, letting 0 be the all-0 vector, note that f and g are increasing if and
only if f 0 := f − f (0) + 1 > 0 and g 0 := g − g(0) + 1 > 0 are increasing and
that, moreover,

µ(f 0 g 0 ) − µ(f 0 )µ(g 0 ) = µ([f 0 − µ(f 0 )][g 0 − µ(g 0 )])


= µ([f − µ(f )][g − µ(g)])
= µ(f g) − µ(f )µ(g).
CHAPTER 4. COUPLING 267

In Holley’s inequality, we let µ1 := µ and define the positive probability mea-


sure
g(ω)µ(ω)
µ2 (ω) := .
µ(g)
We check that µ1 and µ2 satisfy the conditions of Theorem 4.2.32. Note that ω 0 ≤
ω ∨ ω 0 for any ω so that, because g is increasing, we have g(ω 0 ) ≤ g(ω ∨ ω 0 ).
Hence, for any ω, ω 0 ,

g(ω 0 )µ(ω 0 )
µ1 (ω)µ2 (ω 0 ) = µ(ω)
µ(g)
g(ω 0 )
= µ(ω)µ(ω 0 )
µ(g)
g(ω ∨ ω 0 )
≤ µ(ω ∧ ω 0 )µ(ω ∨ ω 0 )
µ(g)
0 0
= µ1 (ω ∧ ω )µ2 (ω ∨ ω ),

where on the third line we used the FKG condition satisfied by µ.


So Holley’s inequality implies that µ2  µ1 . Hence, since f is increasing, by
Corollary 4.2.13
µ(f g)
µ(f ) = µ1 (f ) ≤ µ2 (f ) = ,
µ(g)
and the theorem is proved.

Proof of Theorem 4.2.32. The idea of the proof is to use Theorem 4.2.17. This is
similar to what was done in Section 4.2.2. Again we use a single-site dynamic.
For x ∈ X and γ ∈ {0, 1}, we let xi,γ be x with coordinate i set to γ. We
write x ∼ y if kx − yk1 = 1. Let n = |F |. We use a scheme analogous to the
Metropolis algorithm (see Example 1.1.30). A natural symmetric chain on X is to
pick a coordinate uniformly at random, and flip its value. We modify it to guarantee
reversibility with respect to the desired stationary distributions, namely µ1 and µ2 .
For α, β > 0 small enough, the following transition matrix over X is irre-
ducible and reversible with respect to its stationary distribution µ2 : for all i ∈ F ,
y ∈ X,
1
Q(y i,0 , y i,1 ) = α {β} ,
n
µ2 (y i,0 )
 
i,1 i,0 1
Q(y , y ) = α β ,
n µ2 (y i,1 )
X
Q(y, y) = 1 − Q(y, z).
z:z∼y
CHAPTER 4. COUPLING 268

Let P be similarly defined with respect to µ1 with the same values of α and β.
For reasons that will be clear below, the value of 0 < β < 1 is chosen small
enough that the sum of the two expressions in brackets above is smaller than 1
for all y, i in both P and Q. The value of α > 0 is then chosen small enough that
P (y, y), Q(y, y) ≥ 0 for all y. Reversibility follows immediately from the first two
equations. We call the first transition above an upward transition and the second
one a downward transition.
By Theorem 4.2.17, it remains to show that Q stochastically dominates P . That
is, for any x ≤ y, we want to show that P (x, ·)  Q(y, ·). We produce a monotone
coupling (X̂, Ŷ ) of these two distributions. Because x ≤ y, our goal is never to
perform an upward transition in x simultaneously with a downward transition in y.
Observe that
µ1 (xi,0 ) µ2 (y i,0 )
≥ (4.2.19)
µ1 (xi,1 ) µ2 (y i,1 )
by taking ω = y i,0 and ω 0 = xi,1 in Condition (4.2.18).
The coupling works as follows. Fix x ≤ y. With probability 1 − α, set
(X̂, Ŷ ) := (x, y). Otherwise, pick a coordinate i ∈ F uniformly at random. There
are several cases to consider depending on the coordinates xi , yi (with xi ≤ yi by
assumption):

- (xi , yi ) = (0, 0): With probability β, perform an upward transition in both


coordinates, that is, set X̂ := xi,1 and Ŷ := y i,1 . With probability 1 − β, set
(X̂, Ŷ ) := (x, y) instead.
i,0
- (xi , yi ) = (1, 1): With probability β µµ22 (y
(y )
i,1 ) , perform a downward transition

in both coordinates, that is, set X̂ := xi,0 and Ŷ := y i,0 . With probability

µ1 (xi,0 ) µ2 (y i,0 )
 
β − ,
µ1 (xi,1 ) µ2 (y i,1 )

perform a downward transition in x only, that is, set X̂ := xi,0 and Ŷ :=


y. With the remaining probability, set (X̂, Ŷ ) := (x, y) instead. Note
that (4.2.19) and our choice of β guarantees that this step is well-defined.

- (xi , yi ) = (0, 1): With probability β, perform an upward transition in x


i,0 )
only, that is, set X̂ := xi,1 and Ŷ := y. With probability β µµ22 (y
(y i,1 )
, perform
a downward transition in y only, that is, set X̂ := x and Ŷ := y i,0 . With the
remaining probability, set (X̂, Ŷ ) := (x, y) instead. Again our choice of β
guarantees that this step is well-defined.
CHAPTER 4. COUPLING 269

A little accounting shows that this is indeed a coupling of P (x, ·) and Q(y, ·).
By construction, this coupling satisfies X̂ ≤ Ŷ almost surely. An application of
Theorem 4.2.17 concludes the proof.

Example 4.2.33 (Ising model revisited). Holley’s inequality implies Claim 4.2.18.
To see this, just repeat the calculations of Example 4.2.30, where now (4.2.17) is
replaced with an inequality. See Exercise 4.6. J

4.2.4 . Random graphs: Janson’s inequality and application to the clique


number in the Erdős-Rényi model
Let G = (V, E) ∼ Gn,pn be an Erdős-Rényi graph (see Definition 1.2.2). By
Claim 2.3.5, the property of being triangle-free has threshold n−1 . That is, the
probability that G contains a triangle goes to 0 or 1 as n → +∞ according to
whether pn  n−1 or pn  n−1 respectively. In this section, we investigate what
happens “at the threshold,” by which we mean that we take pn = λ/n for some
λ > 0 not depending on n.
For any subset S of three distinct vertices of G, let BS be the event that S forms
a triangle in G. So
ε := Pn,pn [BS ] = p3n → 0. (4.2.20)
V
 P
Denoting the unordered triples of distinct vertices by 3 , let Xn = S∈(V ) 1BS
3
be the number of triangles in G. By the linearity of expectation, the mean number
of triangles is

n(n − 1)(n − 2) λ 3 λ3
   
n 3
En,pn Xn = pn = → ,
3 6 n 6

as n → +∞. If the events {BS }S were mutually independent, Xn would be


binomially distributed and the event that G is triangle-free would have probability
n
Pn,pn [BSc ] = (1 − p3n )( 3 ) → e−λ /6 .
3
Y
(4.2.21)
V
S∈( 3 )

In fact, by the Poisson approximation to the binomial (e.g., Theorem 4.1.18), we


would have that the number of triangles converges weakly to Poi(λ3 /6).
In reality, of course, the events {BS } are not mutually independent. Observe
however that, for most pairs S, S 0 , the events BS and BS 0 are in fact pairwise
independent. That is the case whenever |S ∩ S 0 | ≤ 1, that is, whenever the edges
connecting S are disjoint from those connecting S 0 . Write S ∼ S 0 if S 6= S 0 are
CHAPTER 4. COUPLING 270

not independent, that is, if |S ∩ S 0 | = 2. The expected number of unordered pairs


S ∼ S 0 both forming a triangle is
  
1 X 1 n 3
∆ := Pn,pn [BS ∩ BS ] =
0 (n − 3)p5n = Θ(n4 p5n ) → 0,
2 0 V 2 3 2
S,S ∈( 3 )
S∼S 0
(4.2.22)
n 3
 
where the 3 comes from the number of ways of choosing S, the comes from
2
the number of ways of choosing the vertices in common between S and S 0 , and the
n − 3 comes from the number of ways of choosing the third vertex of S 0 . Given
that the events {BS }S∈(V ) are “mostly” independent, it is natural to expect that
3
Xn behaves asymptotically as it would in the independent case. Indeed we prove:
Claim 4.2.34.
3 /6
Pn,pn [Xn = 0] → e−λ .
d
Remark 4.2.35. In fact, Xn → Poi(λ3 /6). See Exercises 2.18 and 4.9.

The FKG inequality (Theorem 4.2.28) immediately gives one direction. Recall
that Pn,pn , as a product measure over edge sets, satisfies the FKG condition and
therefore has positive associations by the FKG inequality. Moreover the events BSc
are decreasing for all S. Hence, applying positive associations inductively,
hT i Y 3
Pn,pn V BS
S∈( 3 )
c ≥ Pn,pn [BSc ] → e−λ /6 ,
S∈(V3 )

where the limit follows from (4.2.21). As it turns out, the FKG inequality also
gives a bound in the other direction. This is known as Janson’s inequality, which
we state in a more general context.

Janson’s inequality Let X := {0, 1}F where F is finite. Let Bi , i ∈ I, be a


finite collection of events of the form

Bi := {ω ∈ X : ω ≥ β (i) },

for some β (i) ∈ X . Think of these as “bad events” corresponding to a certain


subset of coordinates being set to 1. By definition, the Bi s are increasing. Assume
(i) (j)
P is a positive product measure on X . Write i ∼ j if βr = βr = 1 for at least
one r and note that Bi is independent of Bj if i  j. Set
X
∆ := P[Bi ∩ Bj ].
{i,j}
i∼j
CHAPTER 4. COUPLING 271

Theorem 4.2.36 (Janson’s inequality). Let X := {0, 1}F where F is finite and
P be a positive product measure on X . Let {Bi }i∈I and ∆ be as above. Assume
further that there is ε > 0 such that P[Bi ] ≤ ε for all i ∈ I. Then
Y ∆ Y
P[Bic ] ≤ P[∩i∈I Bic ] ≤ e 1−ε P[Bic ].
i∈I i∈I

Before proving the theorem, we show that it implies Claim 4.2.34. We have
already shown in (4.2.20) and (4.2.22) that ε → 0 and ∆ → 0. Janson’s inequality
(Theorem 4.2.36) immediately implies the claim by (4.2.21).

Proof of Theorem 4.2.36. The lower bound is the FKG inequality.


In the other direction, assume without loss of generality that I = [m]. The first
step is to apply the chain rule to obtain
m
Y
P[∩i∈I Bic ] = P[Bic | ∩j∈[i−1] Bjc ].
i=1

The rest is clever manipulation. For i ∈ [m], let N (i) := {` ∈ [m] : ` ∼ i} and
N< (i) := N (i) ∩ [i − 1]. Note that Bi is independent of {B` : ` ∈ [i − 1]\N< (i)}.
Hence,
h  i
P Bi ∩ ∩j∈[i−1] Bjc
P[Bi | ∩j∈[i−1] Bjc ] =
P[∩j∈[i−1] Bjc ]
h    i
P Bi ∩ ∩j∈N< (i) Bjc ∩ ∩j∈[i−1]\N< (i) Bjc

P[∩j∈[i−1]\N< (i) Bjc ]
= P Bi ∩ ∩j∈N< (i) Bjc ∩j∈[i−1]\N< (i) Bjc
  

= P Bi ∩j∈[i−1]\N< (i) Bjc


 

×P ∩j∈N< (i) Bjc Bi ∩ ∩j∈[i−1]\N< (i) Bjc


 

= P [Bi ] P ∩j∈N< (i) Bjc Bi ∩ ∩j∈[i−1]\N< (i) Bjc ,


 

where we used independence for the first term on the last line. By a union bound,
the second term on the last line is

P ∩j∈N< (i) Bjc Bi ∩ ∩j∈[i−1]\N< (i) Bjc


 
X
P Bj Bi ∩ ∩j∈[i−1]\N< (i) Bjc
 
≥1−
j∈N< (i)
X
≥1− P [Bj | Bi ] ,
j∈N< (i)
CHAPTER 4. COUPLING 272

where the last line follows from the FKG inequality. This requires some explana-
tions:
(i)
- On the event Bi , all coordinates ` with β` = 1 are fixed to 1, and the other
ones are free. So we can think of P[ · | Bi ] as a positive product measure on
0 (i)
{0, 1}F with F 0 := {` ∈ [m] : β` = 0}.

- The event Bj is increasing, while the event ∩j∈[i−1]\N< (i) Bjc is decreasing
as the intersection of decreasing events (see Exercise 4.4).

- So we can apply the FKG inequality in the form (4.2.14) to P[ · | Bi ].


Combining the last three displays and using 1 + x ≤ ex for all x (see Exer-
cise 1.16), we get
 
m
Y X
P[∩i∈I Bic ] ≤ P [Bic ] + P [Bi ∩ Bj ]
i=1 j∈N< (i)
 
m
Y 1 X
≤ P [Bic ] 1 + P [Bi ∩ Bj ]
1−ε
i=1 j∈N< (i)
 
m
Y 1 X
≤ P [Bic ] exp  P [Bi ∩ Bj ] .
1−ε
i=1 j∈N< (i)

where we used the assumption P[Bi ] ≤ ε on the second line. By the definition of
∆, we are done.

4.2.5 . Percolation: RSW theory and a proof of Harris’ theorem


Consider bond percolation (Definition 1.2.1) on the two-dimensional lattice L2 .
Recall that the percolation function is given by

θ(p) := Pp [|C0 | = +∞],

where C0 is the open cluster of the origin. We know from Example 4.2.10 that θ(p)
is non-decreasing. Let

pc (L2 ) := sup{p ≥ 0 : θ(p) = 0},

be the critical value. We proved in Section 2.2.4 that there is a non-trivial transition,
that is, pc (L2 ) ∈ (0, 1). See Exercise 2.3 for a proof that pc (L2 ) ∈ [1/3, 2/3].
Our goal in this section is to use the FKG inequality to improve this further to:
CHAPTER 4. COUPLING 273

Theorem 4.2.37 (Harris’ theorem).

θ(1/2) = 0.

Or, put differently, pc (L2 ) ≥ 1/2.


Remark 4.2.38. This bound is tight, that is, in fact pc (L2 ) = 1/2. The other direction is
known as Kesten’s theorem. See, e.g., [BR06a].

Here we present a proof of Harris’ theorem that uses an important tool in perco-
lation theory, the Russo-Seymour-Welsh (RSW) lemma, an application of the FKG
inequality.

Harris’ theorem
To motivate the RSW lemma, we start with the proof of Harris’ theorem.
e 2 as we did in
Proof of Theorem 4.2.37. Fix p = 1/2. We use the dual lattice L
Section 2.2.4. Consider the annulus

Ann(`) := [−3`, 3`]2 \[−`, `]2 .

The existence of a closed dual cycle inside Ann(`), an event we denote by Od (`),
prevents the possibility of an infinite open path from the origin in the primal lattice
L2 . See Figure 4.4. That is,
K
Y
P1/2 [|C0 | = +∞] ≤ {1 − P1/2 [Od (3k )]}, (4.2.23)
k=0

for all K, where we took powers of 3 to make the annuli disjoint and therefore
independent. To prove the theorem, it suffices to show that there is a constant
c∗ > 0 such that, for all `, P1/2 [Od (`)] ≥ c∗ . Then the right-hand side of (4.2.23)
tends to 0 as K → +∞.
To simplify further, thinking of Ann(`) as a union of four rectangles [−3`, −`)×
[−3`, 3`], [−3`, 3`] × (`, 3`], etc., it suffices to consider the event O#
d (`) that each
one of these rectangles contains a closed dual path connecting its two shorter sides.
To be more precise, for the first rectangle above for instance, the path connects
[−3` + 1/2, −` − 1/2] × {3` − 1/2} to [−3` + 1/2, −` − 1/2] × {−3` + 1/2}
and stays inside the rectangle. See Figure 4.4. By symmetry the probability that
such a path exists is the same for all four rectangles. Denote that probability by ρ` .
Moreover the event that such a path exists is increasing so, although the four events
CHAPTER 4. COUPLING 274

Figure 4.4: Top: the event Od (`). Bottom: the event O#


d (`).
CHAPTER 4. COUPLING 275

are not independent, we can apply the FKG inequality (Theorem 4.2.28). Hence,
since O#d (`) ⊆ Od (`), we finally get the bound

P1/2 [Od (`)] ≥ ρ4` .

The RSW lemma and some symmetry arguments, both of which are detailed
below, imply:
Claim 4.2.39. There is some c > 0 such that, for all `,

ρ` ≥ c.

That concludes the proof.

It remains to prove Claim 4.2.39. We first state the RSW lemma.

RSW theory
We have reduced the proof of Harris’ theorem to bounding the probability that
certain closed paths exist in the dual lattice. To be consistent with the standard
RSW notation, we switch to the primal lattice and consider open paths. We also let
p take any value in (0, 1).
Let Rn,α (p) be the probability that the rectangle

B(αn, n) := [−n, (2α − 1)n] × [−n, n],

has an open path connecting its left and right sides with the path remaining inside
the rectangle. Such a path is called an (open) left-right crossing. The event that a
left-right crossing exists in a rectangle B is denoted by LR(B). We similarly define
the event, TB(B), that a top-bottom crossing exists in B. In essence, the RSW
lemma says this: if there is a significant probability that a left-right crossing exists
in the square B(n, n), then there is a significant probability that a left-right crossing
exists in the rectangle B(3n, n). More precisely, here is a version of the theorem
that will be enough for our purposes. (See Exercise 4.10 for a generalization.)

Lemma 4.2.40 (RSW lemma). For all n ≥ 2 (divisible by 4) and p ∈ (0, 1),
1
Rn,3 (p) ≥ Rn,1 (p)11 Rn/2,1 (p)12 . (4.2.24)
256
The right-hand side of (4.2.24) depends only on the probability of crossing a square
from left to right. By a duality argument, at p = 1/2, it turns out that this prob-
ability is at least 1/2 independently of n. Before presenting a proof of the RSW
lemma, we detail this argument and finish the proof of Harris’ theorem.
CHAPTER 4. COUPLING 276

Proof of Claim 4.2.39. The point of (4.2.24) is that, if Rn,1 (1/2) is bounded away
from 0 uniformly in n, then so is the left-hand side. By the argument in the proof
of Harris’ theorem, this then implies that a closed cycle exists in Ann(n) with a
probability bounded away from 0 as well. Hence to prove Claim 4.2.39 it suffices
to give a lower bound on Rn,1 (1/2). It is crucial that this bound not depend on the
“scale” n.
As it turns out, a simple duality-based symmetry argument does the trick. The
following fact about L2 is a variant of the contour lemma (Lemma 2.2.14). Its
proof is similar and Exercise 4.11 asks for the details (the “if” direction being the
non-trivial implication).
Lemma 4.2.41. There is an open left-right crossing in the primal rectangle [0, n +
1] × [0, n] if and only if there is no closed top-bottom crossing in the dual rectangle
[1/2, n + 1/2] × [−1/2, n + 1/2].
By symmetry, when p = 1/2, the two events in Lemma 4.2.41 have equal proba-
bility. So they must have probability 1/2 because they form a partition of the space
of outcomes. By monotonicity, that implies Rn,1 (1/2) ≥ 1/2 for all n. The RSW
lemma then implies the required bound.

The proof of the RSW lemma involves a clever choice of event that relates the
existence of crossings in squares and rectangles. (Combining crossings of squares
into crossings of rectangles is not as trivial as it might look. Try it before reading
the proof.)

Proof of Lemma 4.2.40. There are several steps in the proof.

Step 1: it suffices to bound Rn,3/2 (p) We first reduce the proof to finding a
bound on Rn,3/2 (p). Let B10 := B(2n, n) and B20 := [n, 5n] × [−n, n]. Note
that B10 ∪ B20 = B(3n, n) and B10 ∩ B20 = [n, 3n] × [−n, n]. Then we have the
implication

LR(B10 ) ∩ TB(B10 ∩ B20 ) ∩ LR(B20 ) ⊆ LR(B(3n, n)).

See Figure 4.5. Each event on the left-hand side is increasing so the FKG inequality
gives
Rn,3 (p) ≥ Rn,2 (p)2 Rn,1 (p).
A similar argument over B(2n, n) gives

Rn,2 (p) ≥ Rn,3/2 (p)2 Rn,1 (p).

Combining the two, we have proved:


CHAPTER 4. COUPLING 277

Figure 4.5: Illustration of the implication LR(B10 ) ∩ TB(B10 ∩ B20 ) ∩ LR(B20 ) ⊆


LR(B(3n, n)).
CHAPTER 4. COUPLING 278

Lemma 4.2.42 (Proof of RSW: step 1).

Rn,3 (p) ≥ Rn,3/2 (p)4 Rn,1 (p)3 . (4.2.25)

Step 2: bounding Rn,3/2 (p) The heart of the proof is to bound Rn,3/2 (p) using
an event involving crossings of squares. Let

B1 := B(n, n) = [−n, n] × [−n, n],


B2 := [0, 2n] × [−n, n],
B12 := B1 ∩ B2 = [0, n] × [−n, n],
S := [0, n] × [0, n].

Let Γ1 be the event that there are paths P1 , P2 , where P1 is a top-bottom crossing
of S and P2 is an open path connecting the left side of B1 to P1 and stays inside
B1 . Similarly let Γ02 be the event that there are paths P10 , P20 , where P10 is a top-
bottom crossing of S and P20 is an open path connecting the right side of B2 to P10
and stays inside B2 . Then we have the implication

Γ1 ∩ LR(S) ∩ Γ02 ⊆ LR(B(3n/2, n)).

See Figure 4.6. By symmetry Pp [Γ1 ] = Pp [Γ02 ]. Moreover, the events on the left-
hand side are increasing so by the FKG inequality:
Lemma 4.2.43 (Proof of RSW: step 2).

Rn,3/2 (p) ≥ Pp [Γ1 ]2 Rn/2,1 (p). (4.2.26)

Step 3: bounding Pp [Γ1 ] It remains to bound Pp [Γ1 ]. That requires several ad-
ditional definitions. Let P1 and P2 be top-bottom crossings of S. There is a natural
partial order over such crossings. The path P1 divides S into two subgraphs: [P1 }
which includes the left side of S (including edges on the left incident with P1 but
not those edges on P1 itself) and {P1 ] which includes the right side of S (and P1
itself). Then we write P1  P2 if {P1 ] ⊆ {P2 ]. Assuming TB(S) holds, one
also gets the existence of a unique rightmost crossing. Roughly speaking, take the
rightmost
union of all top-bottom crossings of S as sets of edges; then the “right boundary”
crossing
of this set is a top-bottom crossing PS∗ such that PS∗  P for all top-bottom cross-
ings P of S. (We accept as a fact the existence of a unique rightmost crossing. See
Exercise 4.11 for a related construction.)
Let IS be the set of (not necessarily open) paths connecting the top and bottom
of S and stay inside S. For P ∈ IS , we let P 0 be the reflection of P in B12 \S
through the x-axis and we let PP0 be the union of P and P 0 . Define [ PP0 } to be the
CHAPTER 4. COUPLING 279

Figure 4.6: Top: illustration of the implication Γ1 ∩ LR(S) ∩ Γ02 ⊆


LR(B(3n/2, n)). Bottom: the event LR [ P 0 } ∩ {P = PS∗ }; the dashed path is
+ P


the mirror image of the rightmost top-bottom crossing in S; the shaded region on
the right is the complement in B1 of the set [ PP0 }. Note that, because in the bottom
figure the left-right path must stay within [ PP0 } by definition of PS∗ , the configura-
tion shown in the top figure where a left-right path (dotted) “travels behind” the
top-bottom crossing of S cannot occur.
CHAPTER 4. COUPLING 280

subgraph of B1 to the left of PP0 (including edges on the left incident with PP0 but
P + P

not those edges on P 0 itself). Let LR [ P 0 } be the event that there is a left-right
crossing of [ PP0 } ending on P , that is, that there is an open path connecting the left
side of B1 and P that stays within [ PP0 }. See Figure 4.6. Note that the existence of
a left-right crossing of B1 implies the existence of an open path connecting the left
side of B1 to PP0 . By symmetry we then get
 1 1
Pp LR+ [ PP0 } ≥ Pp [LR(B1 )] = Rn,1 (p).

(4.2.27)
2 2
Now comes a subtle point. We turn to the rightmost crossing of S—for two rea-
sons:
• First, by uniqueness of the rightmost crossing, {PS∗ = P }P ∈IS forms a par-
tition of TB(S). Recall that we are looking to bound a probability from
below, and therefore we have to be careful not to “double count.”
• Second, the rightmost crossing has a Markov-like property. Observe that,
for P ∈ IS , the event that {PS∗ = P } depends only the bonds in {P ]. In
particular it is independent of the bonds in [ PP0 }, for example, of the event
LR+ [ PP0 } . Hence

Pp LR+ [ PP0 } | PS∗ = P = Pp LR+ [ PP0 } .


    
(4.2.28)

Note that the event {PS∗ = P } is not increasing, as adding more open bonds
can shift the rightmost crossing rightward. Therefore, we cannot use the
FKG inequality here.
Combining (4.2.27) and (4.2.28), we get
X
Pp [PS∗ = P ] Pp LR+ [ PP0 } | PS∗ = P
  
Pp [Γ1 ] ≥
P ∈IS
1 X
≥ Rn,1 (p) Pp [PS∗ = P ]
2
P ∈IS
1
= Rn,1 (p) Pp [TB(S)]
2
1
= Rn,1 (p)Rn/2,1 (p).
2
We have proved:
Lemma 4.2.44 (Proof of RSW: step 3).
1
Pp [Γ1 ] ≥ Rn,1 (p)Rn/2,1 (p). (4.2.29)
2
CHAPTER 4. COUPLING 281

Step 4: putting everything together Combining (4.2.25), (4.2.26) and (4.2.29)


gives that

Rn,3 (p) ≥ Rn,3/2 (p)4 Rn,1 (p)3


≥ [Pp [Γ1 ]2 Rn/2,1 (p)]4 Rn,1 (p)3
" 2 #4
1
≥ Rn,1 (p)Rn/2,1 (p) Rn/2,1 (p) Rn,1 (p)3 .
2

Collecting the terms concludes the proof of the RSW lemma.


Remark 4.2.45. This argument is quite subtle. It is instructive to read the remark after
[Gri97, Theorem 9.3].

4.3 Coupling of Markov chains and application to mixing


As we have seen, coupling is useful to bound total variation distance. In this section
we apply the technique to bound the mixing time of Markov chains.

4.3.1 Bounding the mixing time via coupling


Let P be an irreducible, aperiodic Markov transition matrix on the finite state space
V with stationary distribution π. Recall from Definition 1.1.35 that, for a fixed
0 < ε < 1/2, the mixing time of P is

tmix (ε) := min{t : d(t) ≤ ε},

where
d(t) := max kP t (x, ·) − πkTV .
x∈V

It will be easier to work with


¯ := max kP t (x, ·) − P t (y, ·)kTV .
d(t)
x,y∈V

¯ are related in the following way.


The quantities d(t) and d(t)

Lemma 4.3.1.
¯ ≤ 2d(t),
d(t) ≤ d(t) ∀t.

Proof. The second inequality follows from an application of the triangle inequality.
CHAPTER 4. COUPLING 282

For the first inequality, note that by definition of the total variation distance and
the stationarity of π
kP t (x, ·) − πkTV = sup |P t (x, A) − π(A)|
A⊆V

X
= sup π(y)[P t (x, A) − P t (y, A)]
A⊆V y∈V
X
≤ sup π(y)|P t (x, A) − P t (y, A)|
A⊆V y∈V
( )
X
≤ π(y) sup |P t (x, A) − P t (y, A)|
y∈V A⊆V
X
≤ π(y)kP t (x, ·) − P t (y, ·)kTV
y∈V

≤ max kP t (x, ·) − P t (y, ·)kTV .


x,y∈V

Coalescence Recall that a Markovian coupling of P with itself is a Markov chain


(Xt , Yt )t on V × V with transition matrix Q satisfying: for all x, y, x0 , y 0 ∈ V ,
X
Q((x, y), (x0 , z 0 )) = P (x, x0 ),
z0
X
Q((x, y), (z 0 , y 0 )) = P (y, y 0 ).
z0
We say that a Markovian coupling is coalescing if further: for all z ∈ V ,
coalescing
x0 6= y 0 =⇒ Q((z, z), (x0 , y 0 )) = 0.
Let (Xt , Yt ) be a coalescing Markovian coupling of P . By the coalescing
condition, if Xs = Ys then Xt = Yt for all t ≥ s. That is, once (Xt ) and (Yt )
meet, they remain equal. Let τcoal be the coalescence time (also called coupling
coalescence time
time), that is,
τcoal := inf{t ≥ 0 : Xt = Yt }.
The key to the coupling approach to mixing times is the following immediate con-
sequence of the coupling inequality (Lemma 4.1.11). For any starting point (x, y),

kP t (x, ·) − P t (y, ·)kTV ≤ P(x,y) [Xt 6= Yt ] = P(x,y) [τcoal > t]. (4.3.1)
Combining (4.3.1) and Lemma 4.3.1, we get the main tool of this section.
CHAPTER 4. COUPLING 283

Theorem 4.3.2 (Bounding the mixing time: coupling method). Let (Xt , Yt ) be a
coalescing Markovian coupling of an irreducible transition matrix P on a finite
state space V with stationary distribution π. Then

d(t) ≤ max P(x,y) [τcoal > t].


x,y∈V

In particular

tmix (ε) ≤ inf t ≥ 0 : P(x,y) [τcoal > t] ≤ ε, ∀x, y .

Note that a Markovian coupling can be made coalescing by modifying it as follows:


when Xt = Yt , perform one step of the chain to determine Xt+1 and set Yt+1 :=
Xt+1 . That modification does not affect the coalescence time.
We give a few simple examples of the coupling method in the next subsection.
First, we discuss a classical result.
Example 4.3.3 (Doeblin’s condition). Let P be a transition matrix on a countable
space V . One form of Doeblin’s condition (also called a minorization condition)
Doeblin’s
is: there is s ∈ Z+ and δ > 0 such that
condition
s
sup inf P (w, z) > δ.
z∈V w∈V

In words there is a state z0 ∈ V such that, starting from any state w ∈ V , the
probability of reaching z0 in exactly s steps is at least δ (which does not depend on
w). Assume such a z0 exists.
We construct a coalescing Markovian coupling (Xt , Yt ) of P . Assume first that
s = 1 and let
1
P̃ (w, z) = [P (w, z) − δ1{z = z0 }] .
1−δ
It can be checked that P̃ is a stochastic matrix on V provided z0 satisfies the con-
dition above (see Exercise 4.13). We use a technique known as splitting. While
splitting
Xt 6= Yt , at the next time step: (i) with probability δ we set Xt+1 = Yt+1 = z0 , (ii)
otherwise we pick Xt+1 ∼ P̃ (Xt , · ) and Yt+1 ∼ P̃ (Yt , · ) independently. On the
other hand, if Xt = Yt , we maintain the equality and pick the next state according
to P . Put differently, the coupling Q is defined as: if x 6= y,

Q((x, y), (x0 , y 0 )) = δ1{x0 = y 0 = z0 } + (1 − δ)P̃ (x, x0 )P̃ (y, y 0 ),

while if x = y,
Q((x, x), (x0 , x0 )) = P (x, x0 ).
Observe that, in case (i) above, coalescence occurs at time t + 1. In case (ii),
coalescence may or may not occur at time t + 1. In other words, while Xt 6=
CHAPTER 4. COUPLING 284

Yt , coalescence occurs at the next step with probability at least δ. So τcoal is


stochastically dominated by a geometric random variable with success probability
δ, or
max P(x,y) [τcoal > t] ≤ (1 − δ)t .
x,y∈V

By Theorem 4.3.2,

max kP t (x, ·) − πkTV ≤ (1 − δ)t .


x∈V

Exponential decay of the worst-case total variation distance to the stationary dis-
tribution is referred to as uniform geometric ergodicity.
uniform
Suppose now that s > 1. We apply the argument above to the chain P s this
geometric
time. We get
ergodicity
max P(x,y) [τcoal > ts] ≤ (1 − δ)t ,
x,y∈V

so that, after a change of variable,

max kP t (x, ·) − πkTV ≤ (1 − δ)bt/sc .


x∈V

So, we have shown that uniform geometric ergodicity is implied by Doeblin’s con-
dition.
We note however that the rate of decay derived from this technique can be
very slow. For instance the condition always holds when P is finite, irreducible
and aperiodic (as follows from Lemma 1.1.32), but a straight application of the
technique may lead to a bound depending badly on the size of the state space V
(see Exercise 4.14). J

4.3.2 . Random walks: mixing on cycles, hypercubes, and trees


In this section, we consider lazy simple random walk on various graphs. By this we
mean that the walk stays put with probability 1/2 and otherwise picks an adjacent
vertex uniformly at random. In each case, we construct a coupling to bound the
mixing time. As a reference, we compare our upper bounds to the diameter-based
lower bound we will derive in Section 5.2.3. Specifically, by Claim 5.2.25, for a
finite, reversible Markov chain with stationary distribution π and diameter ∆ we
have the lower bound
!
∆2
tmix (ε) = Ω −1 ,
log(n ∨ πmin )

where πmin is the smallest value taken by π.


CHAPTER 4. COUPLING 285

Cycle
Let (Zt ) be lazy simple random walk on the cycle of size n, Zn := {0, 1 . . . , n−1},
where i ∼ j if |j − i| = 1 (mod n). For any starting points x, y, we construct a
Markovian coupling (Xt , Yt ) of this chain. Set (X0 , Y0 ) := (x, y). At each time,
flip a fair coin. On heads, Yt stays put and Xt moves one step, the direction of
which is uniform at random. On tails, proceed similarly with the roles of Xt and
Yt reversed. Let Dt be the clockwise distance between Xt and Yt . Observe that,
D
by construction, (Dt ) is simple random walk on {0, . . . , n} and τcoal = τ{0,n} , the
first time (Dt ) hits {0, n}.
We use Markov’s inequality (Theorem 2.1.1) to bound P(x,y) [τ{0,n} D > t].
Denote by D0 = dx,y the starting distance. By Wald’s second identity (Theo-
rem 3.1.40), h i
D
E(x,y) τ{0,n} = dx,y (n − dx,y ).

Applying Theorem 4.3.2 and Markov’s inequality we get

d(t) ≤ max P(x,y) [τcoal > t]


x,y∈V
h i
D
E(x,y) τ{0,n}
≤ max
x,y∈V t
dx,y (n − dx,y )
= max
x,y∈V t
n 2
≤ ,
4t
or:

Claim 4.3.4.
n2
tmix (ε) ≤ .

By the diameter-based lower bound on mixing in Section 5.2.3, this bound
gives the correct order of magnitude in n up to logarithmic factors. Indeed, the
diameter is ∆ = n/2 and πmin = 1/n so that Claim 5.2.25 gives

n2
tmix (ε) ≥ ,
64 log n
for n large enough. Exercise 4.15 sketches a tighter lower bound.
CHAPTER 4. COUPLING 286

Hypercube
Let (Zt )t∈Z+ be lazy simple random walk on the n-dimensional hypercube Zn2 :=
{0, 1}n where i ∼ j if ki − jk1 = 1. We denote the coordinates of Zt by
(1) (n)
(Zt , . . . , Zt ). This is equivalent to performing the Glauber dynamics chain
on an empty graph (see Definition 1.2.8): at each step, we first pick a coordinate
uniformly at random, then refresh its value. Because of the way the updating is
done, the chain stays put with probability 1/2 at each time as required.
Inspired by this observation, the coupling (Xt , Yt ) started at (x, y) is the fol-
lowing. At each time t, pick a coordinate i uniformly at random in [n], pick a bit
value b in {0, 1} uniformly at random independently of the coordinate choice. Set
(i) (i)
both i coordinates to b, that is, Xt = Yt = b. By design we reach coalescence
when all coordinates have been updated at least once.
The following standard bound from the coupon collector’s problem (see Ex-
ample 2.1.4) is what is needed to conclude.
Lemma 4.3.5. Let τcoll be the time it takes to update each coordinate at least once.
Then, for any c > 0,

P [τcoll > dn log n + cne] ≤ e−c .

Proof. Let Bi be the event that the i-th coordinate has not been updated by time
dn log n + cne. Then, using that 1 − x ≤ e−x for all x (see Exercise 1.16),
X
P[τcoll > dn log n + cne] ≤ P[Bi ]
i
X 1 dn log n+cne

= 1−
n
i
 
n log n + cn
≤ n exp −
n
−c
= e .

Applying Theorem 4.3.2, we get

d(dn log n + cne) ≤ max P(x,y) [τcoal > dn log n + cne]


x,y∈V
≤ P[τcoll > dn log n + cne]
≤ e−c .

Hence for cε > 0 large enough:


CHAPTER 4. COUPLING 287

Claim 4.3.6.
tmix (ε) ≤ dn log n + cε ne.

Again we get a quick lower bound using the diameter-based result from Sec-
tion 5.2.3. Here ∆ = n and πmin = 1/2n so that Claim 5.2.25 gives

n2
tmix (ε) ≥ = Ω(n),
12 log n + (4 log 2)n

for n large enough. So the upper bound we derived above is off at most by a
logarithmic factor in n. In fact:

Claim 4.3.7.
1
tmix (ε) ≥ n log n − O(n).
2

Proof. For simplicity, we assume that n is odd. Let Wt be the number of 1s, or
Hamming weight, at time t. Let A be the event that the Hamming weight is ≤ n/2.
Hamming weight
To bound the mixing time, we use the fact that for any z0

d(t) ≥ kP t (z0 , ·) − πkTV ≥ |P t (z0 , A) − π(A)|. (4.3.2)

Under the stationary distribution, the Hamming weight is equal in distribution to


a Bin(n, 1/2). In particular the probability that a majority of coordinates are 0 is
1/2. That is, π(A) = 1/2.
On the other hand, let (Zt ) start at z0 , the all-1 vector. By the definition of A,

|P t (z0 , A) − π(A)| = |P[Wt ≤ n/2] − 1/2|. (4.3.3)

We use Chebyshev’s inequality (Theorem 2.1.2) to bound the probability on the


right-hand side. So we need to compute the expectation and variance of Wt .
Let Ut be the number of (distinct) updated coordinates up to time t in the
Glauber dynamics representation of the chain discussed above. Observe that, con-
ditioned on Ut , the Hamming weight Wt is equal in distribution to Bin(Ut , 1/2) +
(n − Ut ) as the updated coordinates are uniform and the other ones are 1. Thus we
CHAPTER 4. COUPLING 288

have

E[Wt ] = E[E[Wt | Ut ]]
 
1
= E Ut + (n − Ut )
2
 
1
= E n − Ut
2
" #
1 t

1
=n− n 1− 1−
2 n
"  t #
n 1
= 1+ 1− , (4.3.4)
2 n
h t i
where on the fourth line we used the fact that E[Ut ] = n 1 − 1 − n1 by sum-
ming over the coordinates and using linearity of expectation.
As to the variance, using again the observation above about the distribution of
Wt given Ut ,

Var[Wt ] = E[Var[Wt | Ut ]] + Var[E[Wt | Ut ]]


1 1
= E [Ut ] + Var[Ut ]. (4.3.5)
4 4
(i)
It remains to compute Var[Ut ]. Let It be 1 if coordinate i has not been updated
up to time t and 0 otherwise. Note that for i 6= j
(i) (j) (i) (j) (i) (j)
Cov[It , It ] = E[It It ] − E[It ]E[It ]
2 t 1 2t
   
= 1− − 1−
n n
t 
1 t
 
2 2
= 1− − 1− + 2
n n n
≤ 0,

(i) (j)
that is, It and It are negatively correlated, while

1 t
 
(i) (i) (i) (i)
Var[It ] = E[(It )2 ] − (E[It ])2 ≤ E[It ] = 1− .
n
CHAPTER 4. COUPLING 289

Then, writing n − Ut as the sum of these indicators, we have


Var[Ut ] = Var[n − Ut ]
n
(i) (i) (j)
X X
= Var[It ] + 2 Cov[It , It ]
i=1 i<j

1 t
 
≤n 1− .
n
Plugging this back into (4.3.5), we get
" #
1 t 1 t n
  
n n
Var[Wt ] ≤ 1− 1− + 1− = .
4 n 4 n 4
For tα = 12 n log n − n log α with α > 0, by (4.3.4),
n n 2 n α√
E[Wtα ] = + etα (−1/n+Θ(1/n )) = + n + o(1),
2 2 2 2
where we used that by a Taylor expansion, for |z| ≤ 1/2, log (1 − z) = −z +
Θ(z 2 ). Fix 0 < ε < 1/2. By Chebyshev’s inequality, for tα = 12 n log n − n log α
and n large enough,
√ n/4 1
P[Wtα ≤ n/2] ≤ P[|Wtα − E[Wtα ]| ≥ (α/2) n] ≤ ≤ − ε,
(α/2)2 n 2
for α large enough. By (4.3.2) and (4.3.3), that implies d(tα ) ≥ ε and we are
done.
The previous proof relies on a “distinguishing statistic.” Recall from Lemma 4.1.19
that for any random variables X, Y and mapping h it holds that
kµh(X) − µh(Y ) kTV ≤ kµX − µY kTV ,
where µZ is the law of Z. The mapping used in the proof of the claim is the
Hamming weight. In essence, we gave a lower bound on the total variation distance
between the laws of the Hamming weight at stationarity and under P t (z0 , · ). See
Exercise 4.16 for a more general treatment of the distinguishing statistic approach.
Remark 4.3.8. The upper bound in Claim 4.3.6 is indeed off by a factor of 2. See [LPW06,
Theorem 18.3] for an improved upper bound and a discussion of the so-called cutoff phe-
nomenon. The latter refers to the fact that for all 0 < ε < 1/2 it can be shown in this case
cutoff
that
(n)
tmix (ε)
lim = 1,
n→+∞ t(n) (1 − ε)
mix
(n)
where tmix (ε) is the mixing time on the n-dimensional hypercube. In words, for large n,
the total variation distance drops from 1 to 0 in a short time window. See Exercise 5.10 for
a necessary condition for cutoff.
CHAPTER 4. COUPLING 290

b-ary tree
b ` , with
Let (Zt )t∈Z+ be lazy simple random walk on the `-level rooted b-ary tree, T b
` ≥ 2. The root, 0, is on level 0 and the leaves, L, are on level `. All vertices
have degree b + 1, except for the root which has degree b and the leaves which
have degree 1. By Example 1.1.29 (noting that laziness makes no difference), the
stationary distribution is
δ(x)
π(x) := ,
2(n − 1)
where n is the number of vertices and δ(x) is the degree of x. We used that a tree
on n vertices has n − 1 edges (Corollary 1.1.7). We construct a coupling (Xt , Yt )
of this chain started at (x, y). Assume without loss of generality that x is no further
from the root than y, which we denote by x 4 y (which, here, does not mean that
y is a descendant of x). The coupling has two stages:
- In the first stage, at each time, flip a fair coin. On heads, Yt stays put and Xt
moves one step chosen uniformly at random among its neighbors. Similarly,
on tails, reverse the roles of Xt and Yt . Do this until Xt and Yt are on the
same level.
- In the second stage, that is, once the two chains are on the same level, at each
time first let Xt move as a lazy simple random walk on T b ` . Then let Yt move
b
in the same direction as Xt , that is, if Xt moves closer to the root, so does
Yt and so on.
By construction, Xt 4 Yt for all t. The key observation is the following. Let τ ∗
be the first time (Xt ) visits the root after visiting the leaves. By time τ ∗ , the two
chains have necessarily met: because Xt 4 Yt , when Xt reaches the leaves, so
does Yt ; after that time, the coupling is in the second stage so Xt and Yt remain on
the same level; in particular, when Xt reaches the root (after visiting the leaves), so
does Yt . Hence τcoal ≤ τ ∗ . Intuitively, the mixing time is indeed dominated by the
time it takes to reach the root from the worst starting point, a leaf. See Figure 4.7
and the corresponding lower bound argument.
To estimate P(x,y) [τ ∗ > t], we use Markov’s inequality (Theorem 2.1.1), for
which we need a bound on E(x,y) [τ ∗ ]. We note that E(x,y) [τ ∗ ] is less than the mean
time for the walk to go from the root to the leaves and back. Let Lt be the level
of Xt and let N be the corresponding network (where the conductances are equal
to the number of edges on each level of the tree). In terms of Lt , the quantity we
seek to bound is the mean of τ0,` , the commute time of the chain (Lt ) between the
states 0 and `. By the commute time identity (Theorem 3.3.34),

E[τ0,` ] = cN R(0 ↔ `), (4.3.6)


CHAPTER 4. COUPLING 291

where X
cN = 2 c(e) = 4(n − 1),
e={x,y}∈N

where we simply counted the number of edges in T b ` and the extra factor of 2
b
accounts for self-loops. Using network reduction techniques, we computed the
effective resistance R(0 ↔ `) in Examples 3.3.21 and 3.3.22—without self-loops.
Of course adding self-loops does not affect the effective resistance as we can use
the same voltage and current. So, ignoring them, we get
`−1 `−1
X X 1 1 − b−`
R(0 ↔ `) = r(j, j + 1) = b−(j+1) = · , (4.3.7)
b 1 − b−1
j=0 j=0

which implies
1 1
≤ R(0 ↔ `) ≤ ≤ 1.
b b−1
Finally, applying Theorem 4.3.2 and Markov’s inequality and using (4.3.6), we get

d(t) ≤ max P(x,y) [τ ∗ > t]


x,y∈V
E(x,y) [τ ∗ ]
≤ max
x,y∈V t
E[τ0,` ]

t
4n
≤ ,
t
or:

Claim 4.3.9.
4n
tmix (ε) ≤ .
ε
This time the diameter-based bound is far off. We have ∆ = 2` = Θ(log n)
and πmin = 1/2(n − 1) so that Claim 5.2.25 gives

(2`)2
tmix (ε) ≥ = Ω(log n),
12 log n + 4 log(2(n − 1))

for n large enough.


Here is a better lower bound. We take b = 2 to simplify. Intuitively the mixing
time is significantly greater than the squared diameter because the chain tends to
be pushed away from the root. Consider the time it takes to go from the leaves
CHAPTER 4. COUPLING 292

Figure 4.7: Setup for the lower bound on the mixing time on a b-ary tree. (Here
b = 2.)

on one side of the root to the leaves on the other, both of which have substantial
weight under the stationary distribution. That typically takes time exponential in
the diameter—that is, linear in n. Indeed one first has to reach the root, which by
the gambler’s ruin problem (Example 3.1.43), takes an exponential in ` number of
“excursions” (see Claim 3.1.44 (ii)).
Formally let x0 be a leaf of T b ` and let A be the set of vertices “on the other
b
side of root (inclusively),” that is, vertices whose graph distance from x0 is at least
`. See Figure 4.7. Then π(A) ≥ 1/2 by symmetry. We use the fact that

kP t (x0 , ·) − πkTV ≥ |P t (x0 , A) − π(A)|,

to bound the mixing time from below. We claim that, started at x0 , the walk takes
time linear in n to reach A with nontrivial probability.
Consider again the level Lt of Xt . Using definition of the effective resistance
(Definition 3.3.19) as well as the expression for it in (4.3.7), we have
 
+ 1 1 b−1 b−1 1
P` [τ0 < τ` ] = = `· = ` =O .
c(`) R(0 ↔ `) b 1−b −` b −1 n
Hence, started from the leaves, the number of excursions back to the leaves needed
to reach the root for the first time is geometric with success probability O(n−1 ).
Each such excursion takes time at least 2 (which corresponds to going right back
CHAPTER 4. COUPLING 293

to the leaves after the first step). So P t (x0 , A) is bounded above by the probability
that at least one such excursion was successful among the first t/2 attempts. That
is,
t/2 1
P t (x0 , A) ≤ 1 − 1 − O n−1 < − ε,
2
for all t ≤ αε n with αε > 0 small enough and

kP αε n (x0 , ·) − πkTV ≥ |P αε n (x0 , A) − π(A)| > ε.

We have proved that tmix (ε) ≥ αε n.

4.3.3 Path coupling


Path coupling is a method for constructing Markovian couplings from “simpler”
couplings. The building blocks are one-step couplings starting from pairs of initial
states that are close in some “dissimilarity graph.”
Let (Xt ) be an irreducible Markov chain on a finite state space V with transi-
tion matrix P and stationary distribution π. Assume that we have a dissimilarity
graph H0 = (V0 , E0 ) on V0 := V with edge weights w0 : E0 → R+ . This graph
dissimilarity
need not have the same edges as the transition graph of (Xt ). We extend w0 to the
graph, path
path metric
metric
(m−1 )
X
w0 (x, y) := inf w0 (xi , xi+1 ) : x = x0 , x1 , . . . , xm = y is a path in H0 ,
i=0

where the infimum is over all paths connecting x and y in H0 . We call a path
achieving the infimum a minimum-weight path. It is straightforward to check that
w0 satisfies the triangle inequality. Let

∆0 := max w0 (x, y),


x,y

be the weighted diameter of H0 .


Theorem 4.3.10 (Path coupling method). Assume that

w0 (u, v) ≥ 1,

for all {u, v} ∈ E0 . Assume further that there exists κ ∈ (0, 1) such that:
- (Local couplings) For all x, y with {x, y} ∈ E0 , there is a coupling (X ∗ , Y ∗ )
of P (x, ·) and P (y, ·) satisfying the following contraction property

E[w0 (X ∗ , Y ∗ )] ≤ κ w0 (x, y). (4.3.8)


CHAPTER 4. COUPLING 294

Then
d(t) ≤ ∆0 κt ,
or
log ∆0 + log ε−1
 
tmix (ε) ≤ .
log κ−1
Proof. The crux of the proof is to extend (4.3.8) to arbitrary pairs of vertices.
Claim 4.3.11 (Global coupling). For all x, y ∈ V , there is a coupling (X ∗ , Y ∗ ) of
P (x, ·) and P (y, ·) such that (4.3.8) holds.
Iterating the coupling in this last claim immediately implies the existence of a
coalescing Markovian coupling (Xt , Yt ) of P such that

E(x,y) [w0 (Xt , Yt )] = E(x,y) [E[w0 (Xt , Yt ) | Xt−1 , Yt−1 ]]


≤ E(x,y) [κ w0 (Xt−1 , Yt−1 )]
≤ ···
≤ κt E(x,y) [w0 (X0 , Y0 )]
= κt w0 (x, y)
≤ κt ∆0 .

By assumption, 1{x6=y} ≤ w0 (x, y) so that by the coupling inequality and Lemma


4.3.1, we have
¯ ≤ max P(x,y) [Xt 6= Yt ] ≤ max E(x,y) [w0 (Xt , Yt )] ≤ κt ∆0 ,
d(t) ≤ d(t)
x,y x,y

which implies the theorem.


Remark 4.3.12. In essence, w0 satisfies a form of Lyapounov condition (i.e., (3.3.15)) with
a “geometric drift.” See, e.g., [MT09, Chapter 15].
It remains to prove Claim 4.3.11.

Proof of Claim 4.3.11. Fix x0 , y 0 ∈ V such that {x0 , y 0 } is not an edge in the dis-
similarity graph H0 . The idea is to combine the local couplings on a minimum-
weight path between x0 and y 0 in H0 . Let x0 = x0 ∼ · · · ∼ xm = y 0 be such a path.
∗ , Z ∗ ) be a coupling of P (x , ·) and P (x
For all i = 0, . . . , m − 1, let (Zi,0 i,1 i i+1 , ·)
satisfying the contraction property (4.3.8).
Then we proceed as follows. Set Z (0) := Z0,0 ∗ and Z (1) := Z ∗ . Then iter-
0,1
atively pick Z (i+1) according to the law P[Zi,1 ∗ ∈ · | Z ∗ = Z (i) ]. By induction
i,0
on i, (X ∗ , Y ∗ ) := (Z (0) , Z (m) ) is then a coupling of P (x0 , ·) and P (y 0 , ·). See
Figure 4.8.
CHAPTER 4. COUPLING 295

Figure 4.8: Coupling of P (x0 , ·) and P (y 0 , ·) constructed from a sequence of local


∗ , Z ∗ ), . . . , (Z ∗
couplings (Z0,0 ∗
0,1 0,m−1 , Z0,m−1 ).
CHAPTER 4. COUPLING 296

To be more formal, define the transition matrix


∗ ∗
Ri (z (i) , z (i+1) ) := P[Zi,1 = z (i+1) | Zi,0 = z (i) ].

Observe that X
Ri (z (i) , z (i+1) ) = 1, (4.3.9)
z (i+1)
and X
P (xi , z (i) )Ri (z (i) , z (i+1) ) = P (xi+1 , z (i+1) ), (4.3.10)
z (i)
∗ , Z ∗ ) and the definition of R . The law of the
by construction of the coupling (Zi,0 i,1 i
full coupling
(Z (0) , . . . , Z (m) )
is

P[(Z (0) , . . . , Z (m) ) = (z (0) , . . . , z (m) )]


= P (x0 , z (0) )R0 (z (0) , z (1) ) · · · Rm−1 (z (m−1) , z (m) ).

Using (4.3.9) and (4.3.10) inductively gives respectively

P[X ∗ = z (0) ] = P[Z (0) = z (0) ] = P (x0 , z (0) ),


P[Y ∗ = z (m) ] = P[Z (m) = z (m) ] = P (xm , z (m) ),

as required.
By the triangle inequality for w0 , the coupling (X ∗ , Y ∗ ) satisfies
h i
E[w0 (X ∗ , Y ∗ )] = E w0 (Z (0) , Z (m) )
m−1
X h i
≤ E w0 (Z (i) , Z (i+1) )
i=0
m−1
X
≤ κ w0 (xi , xi+1 )
i=0
= κ w0 (x0 , y 0 ),

where, on the third line, we used (4.3.8) for adjacent pairs and the last line follows
from the fact that we chose a minimum-weight path.

That concludes the proof of the theorem.

We illustrate the path coupling method in the next subsection. See Exer-
cise 4.17 for an optimal transport perspective on the path coupling method.
CHAPTER 4. COUPLING 297

4.3.4 . Ising model: Glauber dynamics at high temperature


Let G = (V, E) be a finite, connected graph with maximal degree δ̄. Define X :=
{−1, +1}V . Recall from Example 1.2.5 that the (ferromagnetic) Ising model on V
with inverse temperature β is the probability distribution over spin configurations
σ ∈ X given by
1 −βH(σ)
µβ (σ) := e ,
Z(β)
where X
H(σ) := − σi σj ,
i∼j

is the Hamiltonian and X


Z(β) := e−βH(σ) ,
σ∈X

is the partition function. In this context, recall that vertices are often referred to as
sites. The single-site Glauber dynamics (Definition 1.2.8) of the Ising model is the
Markov chain on X which, at each time, selects a site i ∈ V uniformly at random
and updates the spin σi according to µβ (σ) conditioned on agreeing with σ at all
sites in V \{i}. Specifically, for γ ∈ {−1, +1}, i ∈ V , and σ ∈ X , let σ i,γ be
the configuration σ with the state at i being set to γ. Then, letting n = |V |, the
transition matrix of the Glauber dynamics is

1 eγβSi (σ)
Qβ (σ, σ i,γ ) := · −βS (σ)
n e

i + eβSi (σ) 
1 1 1
= + tanh(γβSi (σ)) , (4.3.11)
n 2 2

where X
Si (σ) := σj .
j∼i

All other transitions have probability 0. Recall that this chain is irreducible and
reversible with respect to µβ . In particular µβ is the stationary distribution of Qβ .
In this section we give an upper bound on the mixing time, tmix (ε), of Qβ
using path coupling. We say that the Glauber dynamics is fast mixing if tmix (ε) =
fast mixing
O(n log n). We first make a simple observation:

Claim 4.3.13 (Glauber dynamics: lower bound on mixing).

tmix (ε) = Ω(n), ∀β > 0.


CHAPTER 4. COUPLING 298

Proof. Similarly to what we did in Section 4.3.2 in the context of random walk
on the hypercube (but for a lower bound this time), we use a coupon collecting
argument (see Example 2.1.4). Let σ̄ be the all-(−1) configuration and let A be the
set of configurations where at least half of the sites are +1. Then, by symmetry,
µβ (A) = µβ (Ac ) = 1/2 where we assumed for simplicity that n is odd. By
definition of the total variation distance,

d(t) ≥ kQtβ (σ̄, ·) − µβ (·)kTV


≥ |Qtβ (σ̄, A) − µβ (A)|
= |Qtβ (σ̄, A) − 1/2|. (4.3.12)

So it remains to show that by time c n, for c > 0 small, the chain is unlikely to have
reached A. That happens if, say, fewer than a third of the sites have been updated.
Using the notation of Example 2.1.4, we are seeking a bound on Tn,n/3 , that is, the
time to collect n/3 coupons out of n.
We can write this random variable as a sum of n/3 independent geometric
Pn/3 i−1 −1

variables Tn,n/3 = i=1 τn,i , where E[τn,i ] = 1 − n and Var[τn,i ] ≤
i−1 −2

1− n . Hence, approximating the Riemann sums below by integrals,

n/3  n
i − 1 −1
X  X
E[Tn,n/3 ] = 1− =n j −1 = Θ(n), (4.3.13)
n
i=1 j=2n/3+1

and
n/3  n
i − 1 −2
X  X
Var[Tn,n/3 ] ≤ 1− = n2 j −2 = Θ(n). (4.3.14)
n
i=1 j=2n/3+1

So by Chebyshev’s inequality (Theorem 2.1.2)

Var[Tn,n/3 ]
P[|Tn,n/3 − E[Tn,n/3 ]| ≥ ε n] ≤ → 0,
(ε n)2

by (4.3.14). In view of (4.3.13), taking ε > 0 small enough and n large enough,
we have shown that for t ≤ cε n for some cε > 0

Qtβ (σ̄, A) ≤ 1/3,

which proves the claim by (4.3.12) and the definition of the mixing time (Defini-
tion 1.1.35).
CHAPTER 4. COUPLING 299

Remark 4.3.14. In fact, Ding and Peres proved that tmix (ε) = Ω(n log n) for any graph
on n vertices [DP11]. In Claim 4.3.7, we treated the special case of the empty graph,
which is equivalent to lazy random walk on the hypercube. See also Section 5.3.4 for a
much stronger lower bound at low temperature for certain graphs with good “expansion
properties.”
In our main result of this section, we show that the Glauber dynamics of the
Ising model is fast mixing when the inverse temperature β is small enough as a
function of the maximum degree.
Claim 4.3.15 (Glauber dynamics: fast mixing at high temperature).
β < δ̄ −1 =⇒ tmix (ε) = O(n log n).
Proof. We use path coupling. Let H0 = (V0 , E0 ) where V0 := X and {σ, ω} ∈ E0
if 21 kσ − ωk1 = 1 (i.e., they differ in exactly one coordinate) with unit weight on
all edges. To avoid confusion, we reserve the notation ∼ for adjacency in G.
Let {σ, ω} ∈ E0 differ at coordinate i. We construct a coupling (X ∗ , Y ∗ ) of
Qβ (σ, ·) and Qβ (ω, ·). We first pick the same coordinate i∗ to update. If i∗ is such
that all its neighbors in G have the same state in σ and ω, that is, if σj = ωj for
all j ∼ i∗ , we update X ∗ from σ according to the Glauber rule and set Y ∗ := X ∗ .
Note that this includes the case i∗ = i. Otherwise, that is, if i∗ ∼ i, we proceed as
follows. From the state σ, the probability of updating site i∗ to state γ ∈ {−1, +1}
is given by the expression in brackets in (4.3.11), and similarly for ω. Unlike the
previous case, we cannot guarantee that the update is identical in both chains. In
order to minimize the chance of increasing the distance between the two chains,
we use a monotone coupling, which recall from Example 4.1.17 is maximal in the
two-state case. Specifically, we pick a uniform random variable U in [−1, 1] and
set (
∗ +1 if U ≤ tanh(βSi∗ (σ)),
Xi∗ :=
−1 otherwise,
and (
+1 if U ≤ tanh(βSi∗ (ω)),
Yi∗∗ :=
−1 otherwise.
We set Xj∗ := σj and Yj∗ := ωj for all j 6= i∗ . The expected distance between X ∗
and Y ∗ is then
E[w0 (X ∗ , Y ∗ )]
1 1 X 1
=1− + |tanh(βSj (σ)) − tanh(βSj (ω))|, (4.3.15)
n n 2
|{z} j:j∼i
(a) | {z }
(b)
CHAPTER 4. COUPLING 300

where: (a) corresponds to i∗ = i in which case w0 (X ∗ , Y ∗ ) = 0; (b) corresponds


to i∗ ∼ i in which case w0 (X ∗ , Y ∗ ) = 2 with probability
1
| tanh(βSi∗ (σ)) − tanh(βSi∗ (ω))|,
2
by our coupling; and otherwise w0 (X ∗ , Y ∗ ) = w0 (σ, ω) = 1. To bound (b), we
note that for any j ∼ i

|tanh(βSj (σ)) − tanh(βSj (ω))| = tanh(β(s + 2)) − tanh(βs), (4.3.16)

where
s := Sj (σ) ∧ Sj (ω).
The derivative of tanh is maximized at 0 where it is equal to 1. So the right-hand
side of (4.3.16) is ≤ β(s + 2) − βs = 2β. Plugging this back into (4.3.15) and
using 1 − x ≤ e−x for all x (see Exercise 1.16), we get
 
∗ ∗ 1 − δ̄β 1 − δ̄β
E[w0 (X , Y )] ≤ 1 − ≤ exp − = κ w0 (σ, ω),
n n

where  
1 − δ̄β
κ := exp − < 1,
n
by our assumption on β. The diameter of H0 is ∆0 = n. By Theorem 4.3.10,

log ∆0 + log ε−1 n(log n + log ε−1 )


   
tmix (ε) ≤ = ,
log κ−1 1 − δ̄β

which implies the claim.


Remark 4.3.16. A slighlty more careful analysis shows that the condition δ̄ tanh(β) < 1
is enough for the claim to hold. See [LPW06, Theorem 15.1].

4.4 Chen-Stein method


The Chen-Stein method serves to establish Poisson approximation results with
quantitative bounds in certain settings with dependent variables that are common,
for instance, in random graphs and string statistics.
CHAPTER 4. COUPLING 301

Setting The basic setup is a sum of Bernoulli (i.e., {0, 1}-valued) random vari-
ables {Xi }ni=1
Xn
W = Xi , (4.4.1)
i=1
where the Xi s are not assumed independent or identically distributed. Define

pi = P[Xi = 1], (4.4.2)

and
n
X
E[W ] = λ := pi . (4.4.3)
i=1
Letting µ denote the law of W and π be the Poisson distribution with mean λ, our
goal is to bound kµ − πkTV .
We first state the main bounds and give some examples of its use. We then
motivate and prove the result, and return to further applications. Throughout the
next two subsections, we use the notation in (4.4.1), (4.4.2) and (4.4.3).

4.4.1 Main bounds and examples


We begin with an elementary observation.
Theorem 4.4.1 (Stein equation for the Poisson distribution). Let λ > 0. A non-
negative integer-valued random variable Z is Poi(λ) if and only if for all g bounded

E[λg(Z + 1) − Zg(Z)] = 0. (4.4.4)


The “only if” follows a direct calculation. The “if” follows from taking g(z) :=
1{z=k} for all k ≥ 1 and deriving a recursion. Exercise 4.18 asks for the details.
One might expect that if the left-hand side of (4.4.4) is “small for many gs,” then
Z is approximately Poisson.
The following key result in some sense helps to formalize this intuition. We
prove it by constructing a Markov chain that “interpolates” between µ and π,
where (4.4.4) will arise naturally (see Section 4.4.2).
Theorem 4.4.2 (Chen-Stein method). Let W ∼ µ and π ∼ Poi(λ). Then there
exists a function h : {0, 1, . . . , n + 1} → R such

kµ − πkTV = E [−λh(W + 1) + W h(W )] . (4.4.5)

Moreover h satisfies the following Lipschitz condition: for all y, y 0 ∈ {0, 1, . . . , n+


1},
|h(y 0 ) − h(y)| ≤ (1 ∧ λ−1 )|y 0 − y|. (4.4.6)
CHAPTER 4. COUPLING 302

By bounding the right-hand side of (4.4.5) for any function satisfying (4.4.6), we
get a Poisson approximation result for µ.
One way to do this is to construct a certain type of coupling. We begin with a
definition, which will be justified in the corollary below. We write X ∼ Y |A to
mean that X is distributed as Y conditioned on the event A.
Definition 4.4.3 (Stein coupling). A Stein coupling is a pair (Ui , Vi ), for each
Stein coupling
i = 1, . . . , n, such that
Ui ∼ W, Vi ∼ W − 1|Xi = 1.
Each pair (Ui , Vi ) is defined on a joint probability space, but different pairs do not
need to.
How such a coupling is constructed will become clearer in the examples below.
Corollary 4.4.4. Let (Ui , Vi ), i = 1, . . . , n, be a Stein coupling. Then
n
X
kµ − πkTV ≤ (1 ∧ λ−1 ) pi E|Ui − Vi |. (4.4.7)
i=1
Pn Pn
Proof. By (4.4.5), using the facts that λ = i=1 pi and W = i=1 Xi , we get
kµ − πkTV
= E [−λh(W + 1) + W h(W )]
n n
" ! ! #
X X
=E − pi h(W + 1) + Xi h(W )
i=1 i=1
n
X
= (−pi E [h(W + 1)] + E [Xi h(W )])
i=1
Xn
= (−pi E [h(W + 1)] + E [h(W ) | Xi = 1] P[Xi = 1])
i=1
Xn
= pi (−E [h(W + 1)] + E [h(W ) | Xi = 1]) .
i=1

Let (Ui , Vi ), i = 1, . . . , n, be a Stein coupling (Definition 4.4.3). Then, we can


rewrite this last expression is
n
X
= pi (−E [h(Ui + 1)] + E [h(Vi + 1)])
i=1
Xn
≤ pi E [|h(Ui + 1) − h(Vi + 1)|] .
i=1
CHAPTER 4. COUPLING 303

By (4.4.6), we finally get


n
X
kµ − πkTV ≤ (1 ∧ λ−1 ) pi E|Ui − Vi |,
i=1

which concludes the proof.

As a first example, we derive a Poisson approximation result in the independent


case. Compare to Theorem 4.1.18.

Example 4.4.5 (Independent Xi s). Assume the Xi s are independent. We prove


the following:

Claim 4.4.6.
n
X
kµ − πkTV ≤ (1 ∧ λ−1 ) p2i .
i=1

We use the following Stein coupling. For each i = 1, . . . , n, we let

Ui = W

and X
Vi = Xj .
j:j6=i

By independence,
Vi = W − Xi ∼ W − 1|Xi = 1,
as desired. Plugging into (4.4.7), we obtain the bound
n
X
kµ − πkTV ≤ (1 ∧ λ−1 ) pi E|Ui − Vi |
i=1
n
X X
≤ (1 ∧ λ−1 ) pi E W − Xj
i=1 j6=i
n
X
≤ (1 ∧ λ−1 ) pi E |Xi |
i=1
n
X
≤ (1 ∧ λ−1 ) p2i .
i=1

J
CHAPTER 4. COUPLING 304

Here is a less straightforward example.

Example 4.4.7 (Balls in boxes). Suppose we throw k balls uniformly at random in


n boxes independently. Let

Xi = 1{box i is empty},

and let W = ni=1 Xi be the number of empty boxes. Note that the Xi s are not
P
independent. In particular, we cannot use Theorem 4.1.18. Note that

1 k
 
pi = 1 − ,
n

for all i and, hence,  k


1
λ=n 1− .
n
For each i = 1, . . . , n, we generate the coupling (Ui , Vi ) in the following way.
We let Ui = W . If box i is empty then Vi = W − 1. Otherwise, we re-distribute
all balls in box i among the remaining boxes and let Vi count the number of empty
boxes 6= i. By construction, both conditions of the Stein coupling are satisfied.
Moreover we have almost surely Vi ≤ Ui so that
n
X n
X n
X
2
pi E|Ui − Vi | = pi E[Ui − Vi ] = λ − pi E[Vi ].
i=1 i=1 i=1

By the fact that Vi ∼ W − 1|Xi = 1 and Bayes’ rule,


n
X n
X n
X
pi E[Vi ] = P[Xi = 1] (k − 1)P[Vi = k − 1]
i=1 i=1 k=1
n X
X n
= (k − 1)P[W = k | Xi = 1] P[Xi = 1]
i=1 k=1
Xn X n
= (k − 1)P[Xi = 1 | W = k] P[W = k].
i=1 k=1

Now we use the fact that P[Xi = 1 | W = k] = E[Xi | W = k] because Xi is an


CHAPTER 4. COUPLING 305

indicator variable. So the last line above is


n X
X n
= (k − 1)E[Xi | W = k] P[W = k]
i=1 k=1
n n
" #
X X
= (k − 1)E Xi W = k P[W = k]
k=1 i=1
Xn
= (k − 1)k P[W = k]
k=1
= E[W 2 ] − E[W ].

It remains to compute E[W 2 ]. We have by symmetry

E[W 2 ] = n E[X12 ] + n(n − 1)E[X1 X2 ]


2 k
 
= λ + n(n − 1) 1 − ,
n

so by Corollary 4.4.4
(
 2k  k )
1 2
kµ − πkTV ≤ (1 ∧ λ−1 ) n2 1 − − n(n − 1) 1 − .
n n

When k = n log n + Cn for instance, it can be checked that kµ − πkTV =


O(log n/n). J

This last example is generalized in Exercise 4.22.


In special settings, one can give useful general bounds by constructing an ap-
propriate Stein coupling. We give an important example next. Recall that [n] =
{1, . . . , n}.

Theorem 4.4.8 (Chen-Stein method: dissociated case). Suppose that for each i
there is a neighborhood Ni ⊆ [n] \ {i} such that

Xi is independent of {Xj : j ∈
/ Ni ∪ {i}}.

Then
 
n 
X X 
kµ − πkTV ≤ (1 ∧ λ−1 ) p2i + (pi pj + E[Xi Xj ]) .
 
i=1 j∈Ni
CHAPTER 4. COUPLING 306

Proof. We use the following Stein coupling. Let


Ui = W.
Then generate
(i)
(Yj )j∈Ni ∼ (Xj )j∈Ni |{Xk : k ∈
/ Ni ∪ {i}}, Xi = 1,
and set
(i)
X X
Vi = Xk + Yj .
k∈N
/ i ∪{i} j∈Ni

Because the law of {Xk : k ∈ / Ni ∪ {i}} (and therefore of the first term in Vi ) is
independent of the event {Xi = 1}, the above scheme satisfies the conditions of
the Stein coupling.
Hence we can apply Corollary 4.4.4. The construction of (Ui , Vi ) guarantees
that Ui − Vi depends only on “i and its neighborhood.” Specifically, we get
n
X
kµ − πkTV ≤ (1 ∧ λ−1 ) pi E|Ui − Vi |
i=1
n n
(i)
X X X X
−1
= (1 ∧ λ ) pi E Xj − Xk − Yj
i=1 j=1 k∈N
/ i ∪{i} j∈Ni

n
(i)
X X
= (1 ∧ λ−1 ) pi E Xi + (Xj − Yj )
i=1 j∈Ni
 
n
(i)
X X
≤ (1 ∧ λ−1 ) pi E|Xi | + (E|Xj | + E|Yj |) ,
i=1 j∈Ni

where we used the triangle inequality. Recalling that pi = P[Xi = 1] = E[Xi ] =


(i)
E|Xi | and the definition of Yj , the last expression above is
 
Xn X
= (1 ∧ λ−1 ) pi pi + [pj + E[|Xj ||Xi = 1]
i=1 j∈Ni
 
Xn  X 
= (1 ∧ λ−1 ) p2i + (pi pj + pi E[Xj |Xi = 1])
 
i=1 j∈Ni
 
Xn  X 
= (1 ∧ λ−1 ) p2i + (pi pj + E[Xi Xj ]) .
 
i=1 j∈Ni
CHAPTER 4. COUPLING 307

That concludes the proof.

Next we give an example of the previous theorem.

Example 4.4.9 (Longest head run). Let 0 < q < 1 and let Z1 , Z2 , . . . be i.i.d. Bern-
oulli random variables with success probability q = P[Zi = 1]. We are interested
in the distribution of R, the length of the longest run of 1s starting in the first n
(t)
tosses. For any positive integer t, let X1 := Z1 · · · Zt and
(t)
Xi := (1 − Zi−1 )Zi · · · Zi+t−1 , i ≥ 2.
(t)
The event {Xi = 1} indicates that a head run of length at least t starts at the i-th
toss. Now define
n
(t)
X
(t)
W := Xi .
i=1

The key observation is that

{R < t} = {W (t) = 0}. (4.4.8)


(t)
Notice that, for fixed t, the Xi s are neither independent nor identically dis-
tributed. However, they exhibit a natural neighborhood structure as in Theorem 4.4.8.
Indeed let
(t)
Ni := {α ∈ [n] : |α − i| ≤ t} \ {i}.
(t) (t)
Then, Xi is independent of {Xj : j ∈
/ Ni ∪ {i}}. For example,

(t)
Xi = (1 − Zi−1 )Zi · · · Zi+t−1 ,

and
(t)
Xi+t+1 = (1 − Zi+t )Zi+t+1 · · · Zi+2t ,
(t)
do not depend on any common Zj , while Xi and
(t)
Xi+t = (1 − Zi+t−1 )Zi+t · · · Zi+2t−1 ,

both depend on Zi+t−1 .


We compute the quantities needed to apply Theorem 4.4.8. We have
t
(t)
Y
p1 := E[Z1 · · · Zt ] = E[Zj ] = q t ,
j=1
CHAPTER 4. COUPLING 308

and, for i ≥ 2,
(t)
pi := E[(1 − Zi−1 )Zi · · · Zi+t−1 ]
i+t−1
Y
= E[1 − Zi−1 ] E[Zj ]
j=i

= (1 − q)q t
≤ qt.
(t)
For i ≥ 1 and j ∈ Ni , observe that a head run of length at least t cannot start
(t) (t)
simultaneously at i and j. So E[Xi Xj ] = 0 in that case. We also have

λ(t) := E[W (t) ] = q t + (n − 1)(1 − q)q t ∈ [n(1 − q)q t , nq t ],


and
(t)
Ni ≤ 2t.

We are ready to apply Theorem 4.4.8. We get, letting µ(t) denote the law of
W (t) and π (t) be the Poisson distribution with mean λ(t) ,
 
n 
   

(t) 2 (t) (t) (t) (t)
X X
(t) (t) (t) −1
kµ − π kTV ≤ (1 ∧ (λ ) ) (pi ) + pi pj + E[Xi Xj ]
 
i=1  j∈Ni
(t) 

≤ (1 ∧ (n(1 − q)q t )−1 ) nq 2t + 2tnq 2t


 

1
≤ (1 ∧ (nq t )−1 )[2t + 1](nq t )2 .
(1 − q)n
This bound is non-asymptotic—it holds for any q, n, t. One special regime of note
is t = log1/q n + C with large n. In that case, we have nq t → C 0 as n → +∞ for
some 0 < C 0 < +∞ and the total variation above is of the order to O(log n/n).
Going back to (4.4.8), we finally obtain when t = log1/q n + C that
 
−λ(t) log n
P[R < t] − e =O ,
n

where recall that R and λ(t) implicitly depend on n. J

4.4.2 Some motivation and proof


The idea behind the Chen-Stein method is to interpolate between µ and π in Theo-
rem 4.4.2 by constructing a Markov chain with initial distribution µ and stationary
distribution π. Here we use a discrete-time, finite Markov chain.
CHAPTER 4. COUPLING 309

Proof of Theorem 4.4.2. We seek a bound on

kµ − πkTV = sup |µ(A) − π(A)|


A⊆Z+

= µ(A∗ ) − π(A∗ )
X
= (µ(z) − π(z)), (4.4.9)
z∈A∗

where A∗ = {z ∈ Z+ : µ(z) > π(z)}, by Lemma 4.1.15. Since W ≤ n almost


surely, µ(z) = 0 for all z > n which implies that A∗ ⊆ {0, 1, . . . , n}. In particular,
it will suffice to bound µ(z) − π(z) for 0 ≤ z ≤ n. We also assume λ < n (the
case λ = n being uninteresting).

Constructing the Markov chain It will be convenient to truncate π at n, that is,


we define 
π(z)
 0 ≤ z ≤ n,
π̄(z) = 1 − Π(n) z = n + 1,

0 otherwise,

P
where Π(z) = w≤z π(w) is the cumulative distribution function of the Poisson
distribution with mean λ. We construct a Markov chain with stationary distribution
π̄. We will also need the chain to be aperiodic and irreducible over {0, 1, . . . , n +
1}.
We choose the transition matrix (P (x, y))0≤x,y≤n+1 to be that of a birth-death
chain reversible with respect to π̄, that is, we require P (x, y) = 0 unless |x−y| ≤ 1
and
P (x, x + 1) π̄(x + 1)
= , ∀x ∈ [n]. (4.4.10)
P (x + 1, x) π̄(x)
For x < n,

π̄(x + 1) π(x + 1) e−λ λx+1 /(x + 1)! λ


= = −λ x
= .
π̄(x) π(x) e λ /x! x+1

In view of this, we want P (x, x + 1) ∝ λ and P (x, x − 1) ∝ x. We choose


the proportionality constant to ensure that all transition probabilities are in [0, 1].
Specifically, for x 6= y, the nonzero transition probabilities take values
1
 2n λ,
 if 0 ≤ x ≤ n, y = x + 1,
1
P (x, y) = 2n x, if 1 ≤ x ≤ n, y = x − 1, (4.4.11)
1
 π(n)
2n λ 1−Π(n) , if x = n + 1, y = n.
CHAPTER 4. COUPLING 310

1 1 1
The probability of staying put is: 1 − 2n λ if x = 0, 1 − 2n x − 2n λ if 1 ≤ x ≤ n,
1 π(n)
and 1− 2n λ 1−Π(n) if x = n+1. Those are all strictly positive when λ < n. Hence
by construction P is aperiodic and irreducible, and it satisfies the detailed balance
conditions (4.4.10).
Recalling (3.3.6), the Laplacian is
X
∆f (x) = P (x, y)[f (y) − f (x)]
y

= P (x, x + 1)[f (x + 1) − f (x)] − P (x, x − 1)[f (x) − f (x − 1)]


= λg(x + 1) − xg(x),

for 0 ≤ x ≤ n, where we defined


f (x) − f (x − 1)
g(x) := , x ∈ {1, . . . , n + 1}, (4.4.12)
2n
and g(0) is arbitrary. At x = n + 1,

π(n)
∆f (n + 1) = −λn g(n + 1).
1 − Π(n)

It is a standard fact (see Exercise 4.19) that the expectation of the Laplacian
under the stationary distribution is 0. Inverting the relationship (4.4.12), for any
g : {0, . . . , n + 1} → R, there is a corresponding f (unique up to an additive
constant). So we have shown that if Z ∼ π̄ then
 
π(n)
E (λg(Z + 1) − Zg(Z))1{Z≤n} − λn g(Z)1{Z=n+1} = 0,
1 − Π(n)

that is,  
E (λg(Z + 1) − Zg(Z))1{Z≤n} = λnπ(n)g(n + 1).
Notice that, if g is extended to a bounded function on Z+ , λ is fixed and Z ∼
Poi(λ), then taking n → +∞ recovers Theorem 4.4.1 by dominated convergence
(Proposition B.4.14).*

Markov chains calculations By the convergence theorem for Markov chains


(Theorem 1.1.33),
P t (y, z) → π̄(z)
* The above argument is more natural in the setting of continuous-time Markov chains, but we
will not introduced them here.
CHAPTER 4. COUPLING 311

for all 0 ≤ y ≤ n + 1 and 0 ≤ z ≤ n + 1 as t → +∞. Hence, letting δz (x) =


1{x=z} , by telescoping
δz (y) − π̄(z) = lim Ey [δz (X0 ) − δz (Xt )]
t→+∞
t−1
X
= lim Ey [δz (Xs ) − δz (Xs+1 )], (4.4.13)
t→+∞
s=0

where the subscript of E indicates the initial state. We will later take expectations
over µ and use the fact that π(z) = π̄(z) on {0, 1 . . . , n} to interpolate between µ
and π.
First, we use standard Markov chains facts to compute (4.4.13). Define for
y ∈ {1, . . . , n + 1}
t−1
1 X
gzt (y) := (Ey [δz (Xs )] − Ey−1 [δz (Xs )]), (4.4.14)
2n
s=0

and gzt (0) := 0. The function gzt (y) is, up to a factor (whose purpose will be clearer
below), the difference between the expected number of visits to z up to time t − 1
when started at y and y − 1 respectively. It depends on µ only through λ and n. By
Chapman-Kolmogorov (Theorem 1.1.20) applied to the first step of the chain,
Ey [δz (Xs+1 )] = P (y, y + 1) Ey+1 [δz (Xs )]
+ P (y, y) Ey [δz (Xs )] + P (y, y − 1) Ey−1 [δz (Xs )].
Using that P (y, y + 1) + P (y, y) + P (y, y − 1) = 1 and rearranging we get for
0 ≤ y ≤ n and 0 ≤ z ≤ n + 1
t−1
X
Ey [δz (Xs ) − δz (Xs+1 )]
s=0
Xt−1 
= − P (y, y + 1)(Ey+1 [δz (Xs )] − Ey [δz (Xs )])
s=0

+ P (y, y − 1)(Ey [δz (Xs )] − Ey−1 [δz (Xs )])

= −2nP (y, y + 1)gzt (y + 1) + 2nP (y, y − 1)gzt (y)


= −λgzt (y + 1) + ygzt (y), (4.4.15)
where we used (4.4.11) on the last line.
We establish after the proof of the theorem that gzt (y) has a well-defined limit.
That fact is not immediately obvious as the limit is the “difference of two infinities.”
But a simple coupling argument does the trick.
CHAPTER 4. COUPLING 312

Lemma 4.4.10. Let gzt : {0, 1, . . . , n + 1} → R be defined in (4.4.14). Then


there exists a bounded function gz∞ : {0, 1, . . . , n + 1} → R such that for all
0 ≤ z ≤ n + 1 and 0 ≤ y ≤ n + 1,

gz∞ (y) = lim gzt (y).


t→+∞

In fact, an explicit expression for gz∞ can be derived via the following recur-
sion. That expression will be helpful to establish the Lipschitz condition in Theo-
rem 4.4.2.
Lemma 4.4.11. For all 0 ≤ y ≤ n and 0 ≤ z ≤ n + 1,

δz (y) − π̄(z) = −λgz∞ (y + 1) + ygz∞ (y).

Proof. Combine (4.4.13), (4.4.15), and Lemma 4.4.10.

Lemma 4.4.11 leads to the following formula for gz∞ , which we establish after the
proof of the theorem.
Lemma 4.4.12. For 1 ≤ y ≤ n + 1 and 0 ≤ z ≤ n + 1,
( Π(y−1)
∞ yπ(y) π̄(z), if z ≥ y,
gz (y) = 1−Π(y−1) (4.4.16)
− yπ(y) π̄(z), if z < y.

and gz∞ (0) = 0.

Interpolating between µ and π For A ⊆ {0, 1, . . . , n}, define


X

gA (y) := gz∞ (y).
z∈A

We obtain the following key bound.


d
Lemma 4.4.13 (Chen’s equation). Let W ∼ µ and π = Poi(λ). Then,
∞ ∞
kµ − πkTV = E [−λgA ∗ (W + 1) + W gA∗ (W )] (4.4.17)

where A∗ = {z ∈ Z+ : µ(z) > π(z)}.

Proof. Fix z ∈ {0, 1, . . . , n}. Multiplying both sides in Lemma 4.4.11 by µ(y)
and summing over y in {0, 1, . . . , n} gives

µ(z) − π(z) = E [−λgz∞ (W + 1) + W gz∞ (W )] .

Now summing over z in A∗ ⊆ {0, 1, . . . , n} and using (4.4.9) gives the claim.
CHAPTER 4. COUPLING 313

∞ . That lemma
Lemma 4.4.12 can be used to derive a Lipschitz constant for gA
is also established after the proof of the theorem.
Lemma 4.4.14. For A ⊆ {0, 1, . . . , n} and y, y 0 ∈ {0, 1, . . . , n + 1},
∞ 0 ∞
|gA (y ) − gA (y)| ≤ (1 ∧ λ−1 )|y 0 − y|.
∞.
Lemmas 4.4.13 and 4.4.14 imply the theorem with h := gA ∗

Proofs of technical lemmas It remains to prove Lemmas 4.4.10, 4.4.12 and


4.4.14.

Proof of Lemma 4.4.10. We use a coupling argument. Let (Ys , Ỹs )+∞ s=0 be an in-
dependent Markovian coupling of (Ys ), the chain started at y − 1, and (Ỹs ), the
chain started at y. Let τ be the first time s that Ys = Ỹs . Because Ys and Ỹs
are independent and P is a birth-death chain with strictly positive nearest-neighbor
and staying-put transition probabilities, the coupled chain (Ys , Ỹs )+∞
s=0 is aperiodic
and irreducible over {0, 1, . . . , n + 1}2 . By the exponential tail of hitting times,
Lemma 3.1.25, it holds that E[τ ] < +∞.
Modify the coupling (Ys , Ỹs ) to enforce Ỹs = Ys for all s ≥ τ (while not
changing (Ys )), that is, to make it coalescing. By the Strong Markov property
(Theorem 3.1.8), the resulting chain (Ys∗ , Ỹs∗ ) is also a Markovian coupling of the
chain started at y − 1 and y respectively. Using this coupling, we rewrite
t−1
1 X
gzt (y) = (Ey [δz (Xs )] − Ey−1 [δz (Xs )])
2n
s=0
t−1
1 X
= E[δz (Ỹs∗ ) − δz (Ys∗ )]
2n
s=0
" t−1 #
1 X
∗ ∗
= E (δz (Ỹs ) − δz (Ys )) .
2n
s=0

The random variable inside the expectation is bounded in absolute value by


t−1
X
(δz (Ỹs∗ ) − δz (Ys∗ )) ≤ τ,
s=0

uniformly in t. Indeed, after s = τ , the terms in the sum are 0, while before
s = τ the terms are bounded by 1 in absolute value. By the integrability of τ ,
CHAPTER 4. COUPLING 314

the dominated convergence theorem (Proposition B.4.14) allows to take the limit,
leading to
" t−1 #
1 X
gz∞ (y) = lim E (δz (Ỹs∗ ) − δz (Ys∗ ))
t→+∞ 2n
s=0
" +∞ #
1 X
∗ ∗
= E (δz (Ỹs ) − δz (Ys ))
2n
s=0
< +∞.

That concludes the proof.

Proof of Lemma 4.4.12. Our starting point is Lemma 4.4.11, from which we de-
duce the recursive formula
1
gz∞ (y + 1) = {ygz∞ (y) + π(z) − δz (y)} , (4.4.18)
λ
for 0 ≤ y ≤ n and 0 ≤ z ≤ n.
We guess a general formula and then check it. By (4.4.18),
1
gz∞ (1) = {π(z) − δz (0)} , (4.4.19)
λ

1 ∞
gz∞ (2) = {g (1) + π(z) − δz (1)}
λz 
1 1
= {π(z) − δz (0)} + π(z) − δz (1)
λ λ
1 1
= 2 {π(z) − δz (0)} + {π(z) − δz (1)},
λ λ

1
gz∞ (3) = {2g ∞ (2) + π(z) − δz (2)}
λ z 
1 1 1
= 2 2 {π(z) − δz (0)} + 2 {π(z) − δz (1)} + π(z) − δz (2)
λ λ λ
2 2 1
= 3 {π(z) − δz (0)} + 2 {π(z) − δz (1)} + {π(z) − δz (2)},
λ λ λ
and so forth. We posit the general formula
y−1
(y − 1)! X λk
gz∞ (y) = {π(z) − δz (k)}, (4.4.20)
λy k!
k=0
CHAPTER 4. COUPLING 315

for 1 ≤ y ≤ n + 1 and 0 ≤ z ≤ n.
The formula is straightforward to confirm by induction. Indeed it holds for
y = 1 as can be seen in (4.4.19) (and recalling that 0! = 1 by convention) and,
assuming it holds for y, we have by (4.4.18)
1
gz∞ (y + 1) = {ygz∞ (y) + π(z) − δz (y)}
λ
y−1
( )
1 (y − 1)! X λk
= y {π(z) − δz (k)} + π(z) − δz (y)
λ λy k!
k=0
y−1
y! X λk 1
= {π(z) − δz (k)} + {π(z) − δz (y)}
λy+1 k! λ
k=0
y
y! X λk
= {π(z) − δz (k)},
λy+1 k!
k=0

as desired.
We rewrite (4.4.20) according to whether the term δz (y) = 1{z = y} plays a
role in the equation. For z ≥ y > 0, the equation simplifies to
y−1
(y − 1)! X λk
gz∞ (y) = π(z)
λy k!
k=0
y−1
1 y! X e−λ λk
= π(z)
y e−λ λy k!
k=0
Π(y − 1)
= π(z).
yπ(y)

For 0 ≤ z < y, we get instead


y−1 k
( ! )
(y − 1)! X λ λz
gz∞ (y) = π(z) −
λy k! z!
k=0
y−1
( ! )
1 y! X e−λ λk
= π(z) − π(z)
y e−λ λy k!
k=0
Π(y − 1) − 1
= π(z).
yπ(y)

The cases z = n + 1 are analogous. That concludes the proof.


CHAPTER 4. COUPLING 316

Proof of Lemma 4.4.14. It suffices to prove that, for A ⊆ {0, 1, . . . , n} and y ∈


{0, 1, . . . , n},
∞ ∞
|gA (y + 1) − gA (y)| ≤ (1 ∧ λ−1 ), (4.4.21)
and then use the triangle inequality.
We start with the case y ≥ 1. We use the expression derived in Lemma 4.4.12.
For 1 ≤ y < z ≤ n + 1,
Π(y) Π(y − 1)
gz∞ (y + 1) − gz∞ (y) = π̄(z) − π̄(z)
(y + 1)π(y + 1) yπ(y)
1 ny o
= π̄(z) Π(y) − Π(y − 1) ,
yπ(y) λ
where we used that π(y + 1)/π(y) = λ/(y + 1). We show that the expression in
curly brackets is non-negative. Indeed, taking out the term k 0 = 0 in the first sum
below and changing variables, we get
y 0 y−1 −λ k
y X e−λ λk X e λ
0

λ 0 (k )! k!
k =0 k=0
y−1 y−1
y X e−λ λ(k+1)−1 X e−λ λk
= e−λ + −
λ (k + 1)!/y k!
k=0 k=0
y−1 y−1
y −λ X e−λ λk X e−λ λk
≥ e + −
λ k! k!
k=0 k=0
≥ 0.
So gz∞ (y + 1) − gz∞ (y) ≥ 0 for 1 ≤ y < z. A similar calculation, which we omit,
shows that the same inequality holds for z < y ≤ n. The cases y = 0, which are
analogous, are detailed below.
For notational convenience, it will be helpful to define gz∞ (n + 2) = 0 for all
z. Then, for y = n + 1 and z ≤ n, we get
1 − Π(n)
gz∞ (n + 2) − gz∞ (n + 1) = 0 + π(z) ≥ 0.
(n + 1)π(n + 1)
Moreover, by telescoping,
n+1
X
0= gz∞ (n + 2) − gz∞ (0) = {gz∞ (y + 1) − gz∞ (y)}.
y=0

We have argued that all the terms in this last sum are non-negative—with the sole
exception of the term y = z. Hence, for a fixed 0 ≤ z ≤ n, it must be that
CHAPTER 4. COUPLING 317

the maximum of |gz∞ (y + 1) − gz∞ (y)| is achieved at y = z. The case z =



P+ 1 ∞is analogous. ∞By definition of gz , for a fixed 0∞≤ y ≤ n, it∞holds that
n
z {gz (y + 1) − gz (y)} = 0 and the maximum of |gA (y + 1) − gA (y)| over
A ⊆ {0, 1, . . . , n} is achieved at A = {y}. It remains to bound that last case.
We have, using π(y + 1)/π(y) = λ/(y + 1) again, that for 1 ≤ y ≤ n

|gy∞ (y + 1) − gy∞ (y)|


1 − Π(y) Π(y − 1)
= − π(y) − π(y)
(y + 1)π(y + 1) yπ(y)
y−1
1 X −λ λk 1 X −λ λk
= e + e
λ k! y k!
k≥y+1 k=0
 
y 0
e−λ  X λk k 0 X λk 
= +
λ  0 (k 0 )! y k! 
k =1 k≥y+1
 
e−λ X λk 

λ  k! 
k≥1

e−λ n o
= eλ − 1
λ
1 − e−λ
= .
λ
−λ
For λ ≥ 1, we have 1−eλ ≤ λ1 = (1 ∧ λ−1 ), while for 0 < λ < 1 we have
1−e−λ
λ ≤ λλ = 1 = (1 ∧ λ−1 ) by Exercise 1.16.
It remains to consider the cases y = 0. Recall that gz∞ (0) = 0. By Lemma 4.4.12,
for z ≥ 1,

gz∞ (1) − gz∞ (0) = gz∞ (1)


Π(0)
= π̄(z)
π(1)
≥ 0.

And
1 − Π(0) 1 − e−λ
g0∞ (1) − g0∞ (0) = g0∞ (1) = − π(0) = − .
π(1) λ
So we have established (4.4.21) and that concludes the proof.
CHAPTER 4. COUPLING 318

4.4.3 . Random graphs: clique number at the threshold in the Erdős-


Rényi model
We revisit the subgraph containment problem of Section 2.3.2 (and Section 4.2.4).
Let Gn ∼ Gn,pn be an Erdős-Rényi graph with n vertices and density pn . Let
ω(G) be the clique number of a graph G, that is, the size of its largest clique. We
showed previously that the property ω(G) ≥ 4 has threshold function n−2/3 . Here
we consider what happens when

pn = Cn−2/3 ,

for some constant C > 0. We use the Chen-Stein method in the form of Theo-
rem 4.4.8.
For an enumeration S1 , . . . , Sm of the 4-tuples of vertices in Gn , let A1 , . . . , Am
Pthat the corresponding 4-cliques are present and define Zi = 1Ai .
be the events
Then W = m i=1 Zi is the number of 4-cliques in Gn . We argued previously (see
Claim 2.3.4) that
qi := E[Zi ] = p6n ,
and  
n 6
λ := E[W ] = p .
4 n
In our regime of interest, λ is of constant order.
Observe that the Zi s are not independent because the 4-tuples may share po-
tential edges. However they admit a neighborhood structure as in Theorem 4.4.8.
Specifically, for i = 1, . . . , m, define

Ni = {j : Si and Sj share at least two vertices} \ {i}.

Then the conditions of Theorem 4.4.8 are satisfied, that is, Xi is independent of
{Zj : j ∈
/ Ni ∪ {i}}. We argued previously (again see Claim 2.3.4) that
    
4 4 n−4
|Ni | = (n − 4) + = Θ(n2 ),
3 2 2

where the first term counts the number of Sj s sharing exactly three vertices with
Si , in which case E[Zi Zj ] = p9n , and the second term counts those sharing two, in
which case E[Zi Zj ] = p11
n .
CHAPTER 4. COUPLING 319

We are ready to apply the bound in Theorem 4.4.8. Let π be the Poisson distri-
bution with mean λ. Using the formulas above, we get when pn = Cn−2/3

kµ − πkTV
 
Xn  X 
≤ (1 ∧ λ−1 ) qi2 + (qi qj + E[Zi Zj ])
 
i=1 j∈Ni
 
n
≤ (1 ∧ λ−1 )
4
      
12 4 12 9 4 n−4 12 11
× pn + (n − 4)(pn + pn ) + (pn + pn )
3 2 2
= (1 ∧ λ−1 ) Θ(n4 p12 5 9 6 11
n + n pn + n pn )
= (1 ∧ λ−1 ) Θ(n4 n−8 + n5 n−6 + n6 n−22/3 )
= (1 ∧ λ−1 ) Θ(n−1 ),

which goes to 0 as n → +∞.


See Exercise 4.21 for an improved bound.
CHAPTER 4. COUPLING 320

Exercises
Exercise 4.1 (Harmonic function on Zd : unbounded). Give an example of an un-
bounded harmonic function on Z. Give one on Zd for general d. [Hint: What is the
simplest function after the constant one?]
Exercise 4.2 (Binomial vs. Binomial). Use coupling to show that
n ≥ m, q ≥ p =⇒ Bin(n, q)  Bin(m, p).
Exercise 4.3 (A chain that is not stochastically monotone). Consider random walk
on a network N = ((V, E), c) where V = {0, 1, . . . , n} and i ∼ j if and only if
|i − j| = 1 (in particular, not including self-loops). Show that the transition matrix
is, in general, not stochastically monotone (see Definition 4.2.16).
Exercise 4.4 (Increasing events: properties). Let F be a σ-algebra over the poset
X . Recall that an event A ∈ F is increasing if x ∈ A implies that any y ≥ x is also
in A and that a function f : X → R is increasing if x ≤ y implies f (x) ≤ f (y).
(i) Show that an event A ∈ F is increasing if and only if the indicator function
1A is increasing.
(ii) Let A, B ∈ F be increasing. Show that A ∩ B and A ∪ B are increasing.
(iii) An event A is decreasing if x ∈ A implies that any y ≤ x is also in A. Show
that A is decreasing if and only if Ac is increasing.
(iv) Let A, B ∈ F be decreasing. Show that A ∩ B and A ∪ B are decreasing.
Exercise 4.5 (Harris’ inequality: alternative proof). We say that f : Rn → R is
coordinatewise nondecreasing if it is nondecreasing in each variable while keeping
the other variables fixed.
(i) (Chebyshev’s association inequality) Let f : R → R and g : R → R be
coordinatewise nondecreasing and let X be a real random variable. Show
that
E[f (X)g(X)] ≥ E[f (X)]E[g(X)].
[Hint: Consider the quantity (f (X) − f (X 0 ))(g(X) − g(X 0 )) where X 0 is
an independent copy of X.]
(ii) (Harris’ inequality) Let f : Rn → R and g : Rn → R be coordinatewise
nondecreasing and let X = (X1 , . . . , Xn ) be independent real random vari-
ables. Show by induction on n that
E[f (X)g(X)] ≥ E[f (X)]E[g(X)].
[Hint: You may need Lemma B.6.15.]
CHAPTER 4. COUPLING 321

Exercise 4.6. Provide the details for Example 4.2.33.

Exercise 4.7 (FKG: sufficient conditions). Let X := {0, 1}F where F is finite and
let µ be a positive probability measure on X . We use the notation introduced in the
proof of Holley’s inequality (Theorem 4.2.32).

(i) To check the FKG condition, show that it suffices to check that, for all x ≤
y ∈ X and i ∈ F ,
µ(y i,1 ) µ(xi,1 )
≥ .
µ(y i,0 ) µ(xi,0 )
[Hint: Write µ(ω ∨ ω 0 )/µ(ω) as a telescoping product.]

(ii) To check the FKG condition, show that it suffices to check (4.2.15) only for
those ω, ω 0 ∈ X such that kω − ω 0 k1 = 2 and neither ω ≤ ω 0 nor ω 0 ≤ ω.
[Hint: Use (i).]

Exercise 4.8 (FKG and strong positive associations). Let X := {0, 1}F where F
is finite and let µ be a positive probability measure on X . For Λ ⊆ F and ξ ∈ X ,
let
XΛξ := {ωΛ × ξΛc : ωΛ ∈ {0, 1}Λ },
where ωΛ × ξΛc agrees with ω on coordinates in Λ and with ξ on coordinates in
F \Λ. Define the measure µξΛ over {0, 1}Λ as

µ(ωΛ × ξΛc )
µξΛ (ωΛ ) := .
µ(XΛξ )

That is, µξΛ is µ conditioned on agreeing with ξ on F \Λ. The measure µ is said to
be strongly positively associated if µξΛ (ωΛ ) is positively associated for all Λ and ξ.
Prove that the FKG condition is equivalent to strong positive associations. [Hint:
Use Exercise 4.7 as well as the FKG inequality.]

Exercise 4.9 (Triangle-freeness: a second proof). Consider again the setting of


Section 4.2.4.

(i) Let et be the minimum number of edges in a t-vertex union of k not mutually
vertex-disjoint triangles. Show that, for any k ≥ 2 and k ≤ t < 3k, it holds
that et > t.

(ii) Use Exercise 2.18 to give a second proof of the fact that P[Xn = 0] →
3
e−λ /6 .
CHAPTER 4. COUPLING 322

Exercise 4.10 (RSW lemma: general α). Let Rn,α (p) be as defined in Section 4.2.5.
Show that for all n ≥ 2 (divisible by 4) and p ∈ (0, 1)
 2α−2
1
Rn,α (p) ≥ Rn,1 (p)6α−7 Rn/2,1 (p)6α−6 .
2
Exercise 4.11 (Primal and dual crossings). Modify the proof of Lemma 2.2.14 to
prove Lemma 4.2.41.
Exercise 4.12 (Square-root trick). Let µ be an FKG measure on {0, 1}F where F
is finite. Let A1 and A2 be increasing events with µ(A1 ) = µ(A2 ). Show that
p
µ(A1 ) ≥ 1 − 1 − µ(A1 ∪ A2 ).
Exercise 4.13 (Splitting: details). Show that P̃ , as defined in Example 4.3.3, is a
transition matrix on V provided z0 satisfies the condition there.
Exercise 4.14 (Doeblin’s condition in finite case). Let P be a transition matrix on
a finite state space.
(i) Show that Doeblin’s condition (see Example 4.3.3) holds when P is finite,
irreducible and aperiodic.
(ii) Show that Doeblin’s condition holds for lazy random walk on the hypercube
with s = n. Use it to derive a bound on the mixing time.
Exercise 4.15 (Mixing on cycles: lower bound). Let (Zt ) be lazy, simple random
walk on the cycle of size n, Zn := {0, 1 . . . , n − 1}, where i ∼ j if |j − i| = 1
(mod n). Assume n is divisible by 4 and fix 0 < ε < 1/2.
(i) Let A = {n/2, . . . , n − 1}. By coupling (Zt ) with lazy, simple random walk
on Z, show that
2 1
P αn (n/4, A) < − ε,
2
for α ≤ αε for some αε > 0. [Hint: You may want to use Chebyshev’s
inequality (Theorem 2.1.2) or Kolmogorov’s maximal inequality (Coroll-
ary 3.1.46).]
(ii) Deduce that
tmix (ε) ≥ αε n2 .

Exercise 4.16 (Lower bound on mixing: distinguishing statistic). Let X and Y


be random variables on a finite state space S. Let h : S → R be a measurable
real-valued map. Assume that
E[h(Y )] − E[h(X)] ≥ rσ,
CHAPTER 4. COUPLING 323

where r > 0 and σ 2 := max{Var[h(X)], Var[h(Y )]}. Show that


8
kµX − µY kTV ≥ 1 − .
r2
[Hint: Consider the interval on one side of the midpoint between E[h(X)] and
E[h(Y )].]

Exercise 4.17 (Path coupling and optimal transport). Let V be a finite state space
and let P be an irreducible transition matrix on V with stationary distribution π.
Let w0 be a metric on V . For probability measures µ, ν on V , let

W0 (µ, ν) := inf {E[w0 (X, Y )] : (X, Y ) is a coupling of µ and ν} ,

be the so-called Wasserstein distance (or transportation metric) between µ and ν.


Wasserstein
(i) Show that W0 is a metric. [Hint: See the proof of Claim 4.3.11.] distance

(ii) Assume that the conditions of Theorem 4.3.10 hold. Show that for any prob-
ability measures µ, ν

W0 (µP, νP ) ≤ κ W0 (µ, ν).

(iii) Use (i) and (ii) to prove Theorem 4.3.10.

Exercise 4.18 (Stein equation for the Poisson distribution). Let λ > 0. Show that
a non-negative integer-valued random variable Z is Poi(λ) if and only if for all g
bounded
E[λg(Z + 1) − Zg(Z)] = 0.

Exercise 4.19 (Laplacian and stationarity). Let P be an irreducible transition ma-


trix on a finite or countably infinite spate space V . Recall the Laplacian operator
is
" #
X
∆f (x) = P (x, y)f (y) − f (x),
y

provided the sum if finite. Show that a probability distribution µ over V is station-
ary for P if and only if for all bounded measurable functions
X
µ(x)∆f (x) = 0.
x∈V
CHAPTER 4. COUPLING 324

Exercise 4.20 (Chen-Stein method for positively related variables). Using the no-
tation in (4.4.1), (4.4.2) and (4.4.3), suppose that for each i we can construct a
(i) (i) (i)
coupling {(Xj : j = 1, . . . , n), (Yj : j 6= i)} with (Xj )j ∼ (Xj )j such that

(i) (i) (i) (i) (i)


(Yj , j 6= i) ∼ (Xj , j 6= i)|Xi =1 and Yj ≥ Xj , ∀j 6= i.

Show that
n
( )
X
−1
kµ − πkTV ≤ (1 ∧ λ ) Var(W ) − λ + 2 p2i .
i=1

Exercise 4.21 (Chen-Stein and 4-cliques). Use Exercise 4.20 to give an improved
asymptotic bound in the setting of Section 4.4.3.

Exercise 4.22 (Chen-Stein for negatively related variables). Using the notation
in (4.4.1), (4.4.2) and (4.4.3), suppose that for each i we can construct a coupling
(i) (i) (i)
{(Xj : j = 1, . . . , n), (Yj : j 6= i)} with (Xj )j ∼ (Xj )j such that

(i) (i) (i) (i) (i)


(Yj , j 6= i) ∼ (Xj , j 6= i)|Xi =1 and Yj ≤ Xj , ∀j 6= i.

Show that
kµ − πkTV ≤ (1 ∧ λ−1 ) {λ − Var(W )} .
CHAPTER 4. COUPLING 325

Bibliographic remarks
Section 4.1 The coupling method is generally attributed to Doeblin [Doe38]. The
standard reference on coupling is [Lin02]. See that reference for a history of cou-
pling and a facsimile of Doeblin’s paper. See also [dH]. Section 4.1.2 is based
on [Per, Section 6] and Section 4.1.4 is based on [vdH17, Section 5.3].

Section 4.2 Strassen’s theorem is due to Strassen [Str65]. Harris’ inequality is


due to Harris [Har60]. The FKG inequality is due to Fortuin, Kasteleyn, and Gini-
bre [FKG71]. A “four-function” version of Holley’s inequality, which also extends
to distributive lattices, was proved by Ahlswede and Daykin [AD78]. See for ex-
ample [AS11, Section 6.1]. An exposition of submodularity and its connections
to convexity can be found in [Lov83]. For more on Markov random fields, see
for example [RAS15]. Section 4.2.4 follows [AS11, Sections 8.1, 8.2, 10.1]. Jan-
son’s inequality is due to Janson [Jan90]. Boppana and Spencer [BS89] gave the
proof presented here. For more on Janson’s inequality, see [JLR11, Section 2.2].
The presentation in Section 4.2.5 follows closely [BR06b, Sections 3 and 4]. See
also [BR06a, Chapter 3]. Broadbent and Hammersley [BH57, Ham57] initiated
the study of the critical value of percolation. Harris’ theorem was proved by Har-
ris [Har60] and Kesten’s theorem was proved two decades later by Kesten [Kes80],
confirming non-rigorous work of Sykes and Essam [SE64]. The RSW lemma was
obtained independently by Russo [Rus78] and Seymour and Welsh [SW78]. The
proof we gave here is due to Bollobás and Riordan [BR06b]. Another short proof
of a version of the RSW lemma for critical site percolation on a triangular lat-
tice was given by Smirnov; see for example [Ste]. The type of “scale invariance”
seen in the RSW lemma plays a key role in the contemporary theory of critical
two-dimensional percolation and of two-dimensional lattice models more gener-
ally. See for example [Law05, Gri10a].

Section 4.3 The material in Section 4.3 borrows heavily from [LPW06, Chapters
5, 14, 15] and [AF, Chapter 12]. Aldous [Ald83] was the first author to make
explicit use of coupling to bound total variation distance to stationarity of finite
Markov chains. The link between couplings of Markov chains and total variation
distance was also used by Griffeath [Gri75] and Pitman [Pit76]. Example 4.3.3 is
based on [Str14] and [JH01]. For a more general treatment, see [MT09, Chapter
16]. The proof of Claim 4.3.7 is partly based on [LPW06, Proposition 7.13]. See
also [DGG+ 00] and [HS07] for alternative proofs. Path coupling is due to Bubley
and Dyer [BD97]. The optimal transport perspective on the path coupling method
in Exercise 4.17 is from [LPW06, Chapter 14]. For more on optimal transport,
CHAPTER 4. COUPLING 326

see for example [Vil09]. The main result in Section 4.3.4 is taken from [LPW06,
Theorem 15.1]. For more background on the so-called “critical slowdown” of the
Glauber dynamics of Ising and Potts models on various graphs, see [CDL+ 12,
LS12].

Section 4.4 The Chen-Stein method was introduced by Chen in [Che75] as an


adaptation of the Stein method [Ste72] to the Poisson distribution. The presen-
tation in Section 4.4 is inspired heavily by [Dey] and [vH16]. Example 4.4.9 is
taken from [AGG89]. Further applications of the Chen-Stein and Stein methods to
random graphs can be found in [JLR11, Chapter 6].
Chapter 5

Spectral methods

In this chapter , we develop spectral techniques. We highlight some applications


to Markov chain mixing and network analysis. The main tools are the spectral
theorem and the variational characterization of eigenvalues, which we review in
Section 5.1 together with some related results. We also give a brief introduction
to spectral graph theory and detail an application to community recovery. In Sec-
tion 5.2 we apply the spectral theorem to reversible Markov chains. In particular
we define the spectral gap and establish its close relationship to the mixing time.
Roughly speaking, we show through an eigendecomposition of the transition ma-
trix that the gap between the eigenvalue 1 (which is the largest in absolute value)
and the rest of the spectrum drives how fast P t converges to the stationary distribu-
tion. We give several examples. We then show in Section 5.3 that the spectral gap
can be bounded using certain isoperimetric properties of the underlying network.
We prove Cheeger’s inequality, which quantifies this relationship, and introduce
expander graphs, an important family of graphs with good “expansion.” Applica-
tions to mixing times are also discussed. One specific technique is the “canonical
paths method,” which bounds the spectral graph by formalizing a notion of con-
gestion in the network.

5.1 Background
We first review some important concepts from linear algebra. In particular, we re-
call the spectral theorem as well as the variational characterization of eigenvalues.
We also derive a few perturbation results. We end this section with an application
Version: December 20, 2023
Modern Discrete Probability: An Essential Toolkit
Copyright © 2023 Sébastien Roch

327
CHAPTER 5. SPECTRAL METHODS 328

to community recovery in network analysis.

5.1.1 Eigenvalues and their variational characterization


When a matrix A ∈ Rd×d is symmetric, that is, aij = aji for all i, j, a remarkable
symmetric
result is that A is similar to a diagonal matrix by an orthogonal transformation.
matrix
Put differently, there exists an orthonormal basis of Rd made of eigenvectors of A.
Recall that a matrix Q ∈ Rd×d is orthogonal if QQT = Id×d and QT Q = Id×d ,
orthogonal
where Id×d is the d × d identity matrix. In words, its columns form an orthonormal
matrix
basis of Rd . For a vector z = (z1 , . . . , zd ), we let diag(z) = diag(z1 , . . . , zd ) be
the diagonal matrix with diagonal entries z1 , . . . , zd . Unless specified otherwise, a
vector is by default a “column vector” and its transpose is a “row vector.”
Theorem 5.1.1 (Spectral theorem). Let A ∈ Rd×d be a symmetric matrix, that is, spectral
AT = A. Then A has d orthonormal eigenvectors q1 , . . . , qd with corresponding theorem
(not necessarily distinct) real eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λd . In matrix form,
this is written as the matrix factorization
d
X
A = QΛQT = λi qi qTi ,
i=1

where Q has columns q1 , . . . , qd and Λ = diag(λ1 , . . . , λd ). We refer to this


factorization as a spectral decomposition of A.
The proof uses a greedy sequence maximizing the quadratic form hv, Avi. For
a hint as to why that might come about, note that for a unit eigenvector v with
eigenvalue λ we have hv, Avi = hv, λvi = λ.
We will need the following formula. Consider the block matrices
   
y A B
and
z C D

where y ∈ Rd1 , z ∈ Rd2 , A ∈ Rd1 ×d1 , B ∈ Rd1 ×d2 , C ∈ Rd2 ×d1 , and D ∈
Rd2 ×d2 . Then it follows by direct calculation that
 T   
y A B y
= yT Ay + yT Bz + zT Cy + zT Dz. (5.1.1)
z C D z
We will also need the following linear algebra fact. Let v1 , . . . , vj be orthonor-
mal vectors in Rd , with j < d. Then they can be completed into an orthonormal
basis v1 , . . . , vd of Rd .

Proof of Theorem 5.1.1. We proceed by induction.


CHAPTER 5. SPECTRAL METHODS 329

A first eigenvector Let A1 = A. Maximizing over the objective function hv, A1 vi,
we let
v1 ∈ arg max{hv, A1 vi : kvk2 = 1},
and
λ1 = max{hv, A1 vi : kvk2 = 1}.
Complete v1 into an orthonormal basis of Rd , v1 , v̂2 , . . . , v̂d , and form the block
matrix 
Ŵ1 := v1 V̂1
where the columns of V̂1 are v̂2 , . . . , v̂d . Note that Ŵ1 is orthogonal by construc-
tion.

Getting one step closer to diagonalization We show next that Ŵ1 gets us one
step closer to a diagonal matrix by similarity transformation. Note first that
λ1 w1T
 
T
Ŵ1 A1 Ŵ1 =
w1 A2

where w1 := V̂1T A1 v1 and A2 := V̂1T A1 V̂1 . The key claim is that w1 = 0. This
follows from an argument by contradiction.
Suppose w1 6= 0 and consider the unit vector
 
1 1
z := Ŵ1 × p
1 + δ 2 kw1 k22 δw1
which achieves objective value
T 
λ1 w1T
  
T 1 1 1
z A1 z =
1 + δ 2 kw1 k22 δw1 w1 A2 δw1
1
λ1 + 2δkw1 k22 + δ 2 w1T A2 w1 ,

= 2 2
1 + δ kw1 k2
where we used (5.1.1). By the Taylor expansion,
1
= 1 − 2 + O(4 ),
1 + 2
for δ small enough,

zT A1 z = (λ1 + 2δkw1 k22 + δ 2 w1T A2 w1 )(1 − δ 2 kw1 k22 + O(δ 4 ))


= λ1 + 2δkw1 k22 + O(δ 2 )
> λ1 .
CHAPTER 5. SPECTRAL METHODS 330

That gives the desired contradiction.


So, letting W1 := Ŵ1 ,
 
λ1 0
W1T A1 W1 = .
0 A2

Finally note that A2 = V̂1T A1 V̂1 is symmetric since


AT2 = (V̂1T A1 V̂1 )T = V̂1T AT1 V̂1 = V̂1T A1 V̂1 = A2 ,
by the symmetry of A1 itself.

Next step of the induction Apply the same argument to the symmetric subma-
trix A2 ∈ R(d−1)×(d−1) , let Ŵ2 ∈ R(d−1)×(d−1) be the corresponding orthogonal
matrix, and define λ2 and A3 through the equation
 
T λ2 0
Ŵ2 A2 Ŵ2 = .
0 A3
Define the block matrix  
1 0
W2 =
0 Ŵ2
and observe that
 
λ1 0
W2T W1T A1 W1 W2 = W2T W2
0 A2
 
λ1 0
=
0 Ŵ2T A2 Ŵ2
 
λ1 0 0
=  0 λ2 0  .
0 0 A3
Proceeding similarly by induction gives the claim, with the final Q being the
product of the Wi s (which is orthogonal as the product of orthogonal matrices).

We derive an important variational characterization inspired by the proof of the


spectral theorem. We will need the following quantity.
Definition 5.1.2 (Rayleigh quotient). Let A ∈ Rd×d be a symmetric matrix. The
Rayleigh quotient of A is defined as
Rayleigh
hu, Aui quotient
RA (u) =
hu, ui
which is defined for any u 6= 0 in Rd .
CHAPTER 5. SPECTRAL METHODS 331

We let the span of a collection of vectors be defined as span


( n )
X
span(u1 , . . . , un ) := αi ui : α1 , . . . , αn ∈ R .
i=1

Theorem 5.1.3 (Courant-Fischer theorem).


Pd Let A ∈ Rd×d be a symmetric matrix Courant-Fischer
with spectral decomposition A = i=1 λi vi viT where λ1 ≥ · · · ≥ λd . For each
k = 1, . . . , d, define the subspace

Vk = span(v1 , . . . , vk ) and Wd−k+1 = span(vk , . . . , vd ).

Then, for all k = 1, . . . , d,

λk = min RA (u) = max RA (u).


u∈Vk u∈Wd−k+1

Furthermore we have the following min-max formulas, which do not depend on the
choice of spectral decomposition, for all k = 1, . . . , d

λk = max min RA (u) = min max RA (u).


dim(V)=k u∈V dim(W)=d−k+1 u∈W

Note that, in all these formulas, the vector u = vk is optimal. To derive the “local”
formula, the first ones above, we expand a vector in Vk into the basis v1 , . . . , vk
and use the fact that RA (vi ) = λi and that eigenvalues are in nonincreasing order.
The “global” formulas then follow from a dimension argument.
We will need the following dimension-based fact. Let U, V ⊆ Rd be linear
subspaces such that dim(U) + dim(V) > d, where dim(U) denotes the dimension
of U. Then there exists a nonzero vector in the intersection U ∩ V. That is,

dim(U) + dim(V) > d =⇒ (U ∩ V) \ {0} =


6 ∅. (5.1.2)

Proof of Theorem 5.1.3. We first prove the local formulas, that is, the ones involv-
ing a specific decomposition.

P an orthonormal basis of Vk , any nonzero


Local formulas Since v1 , . . . , vk form
vector u ∈ Vk can be written as u = ki=1 hu, vi ivi and it follows that
k
X
hu, ui = hu, vi i2
i=1

k k
* +
X X
hu, Aui = u, hu, vi iλi vi = λi hu, vi i2 .
i=1 i=1
CHAPTER 5. SPECTRAL METHODS 332

Thus,
Pk 2
Pk
hu, Aui i=1 λi hu, vi i hu, vi i2
RA (u) = = Pk ≥ λk Pi=1
k
= λk
hu, ui i=1 hu, vi i
2
i=1 hu, vi i
2

where we used λ1 ≥ · · · ≥ λk and the fact that hu, vi i2 ≥ 0. Moreover RA (vk ) =


λk . So we have established
λk = min RA (u).
u∈Vk

The expression in terms of Wd−k+1 is proved similarly.

Global formulas Since Vk has dimension k, it follows from the local formula
that
λk = min RA (u) ≤ max min RA (u).
u∈Vk dim(V)=k u∈V

Let V be any subspace with dimension k. Because Wd−k+1 has dimension d−k+1,
we have that dim(V) + dim(Wd−k+1 ) > d and there must be nonzero vector u0
in the intersection V ∩ Wd−k+1 by the dimension-based fact above. We then have
by the other local formula that
λk = max RA (u) ≥ RA (u0 ) ≥ min RA (u).
u∈Wd−k+1 u∈V

Since this inequality holds for any subspace of dimension k, we have


λk ≥ max min RA (u).
dim(V)=k u∈V

Combining with the inequality in the other direction above gives the claim. The
other global formula is proved similarly.

5.1.2 Elements of spectral graph theory


We apply the variational characterization of eigenvalues to matrices arising in
graph theory. In this section, graphs have no self-loop.

Unweighted graphs As we have previously seen, a convenient way of specifying


a graph is through a matrix representation. Assume the undirected graph G =
(V, E) has n = |V | vertices. Recall that the adjacency matrix A of G is the n × n
symmetric matrix defined as
(
1 if {x, y} ∈ E,
Axy =
0 otherwise.
CHAPTER 5. SPECTRAL METHODS 333

Another matrix of interest is the Laplacian matrix. It is related to the Laplace


operator we encountered previously. We will show in particular that it contains
useful information about the connectedness of the graph. Recall that, given a graph
G = (V, E), the quantity δ(v) denotes the degree of v ∈ V .
Definition 5.1.4 (Graph Laplacian). Let G = (V, E) be a graph with vertices
V = {1, . . . , n} and adjacency matrix A ∈ Rn×n . Let D = diag(δ(1), . . . , δ(n))
be the degree matrix. The graph Laplacian (or Laplacian matrix, or Laplacian for
graph
short) associated to G is defined as L = D − A. Its entries are
Laplacian

δ(i) if i = j,

lij = −1 if {i, j} ∈ E,

0 otherwise.

Observe that the Laplacian L of a graph G is a symmetric matrix:


LT = (D − A)T = DT − AT = D − A,
where we used that both D and A are themselves symmetric. The associated
quadratic form is particularly simple and will play an important role.
Lemma 5.1.5 (Laplacian quadratic form). Let G = (V, E) be a graph with n =
|V | vertices. Its Laplacian L is a positive semi-definite matrix and furthermore we
have the following formula for the Laplacian quadratic form (or Dirichlet energy)
Laplacian
X
T 2 quadratic
x Lx = (xi − xj ) ,
e={i,j}∈E form

for any x = (x1 , . . . , xn ) ∈ Rn .


Proof of Lemma 5.1.5. Let B be an oriented incidence matrix of G (see Defini-
tion 1.1.16). We claim that BB T = L. Indeed, for i 6= j, entry (i, j) of BB T is a
sum over all edges containing i and j as endvertices, of which there is at most one.
When e = {i, j} ∈ E, that entry is −1, since one of i or j has a 1 in the column
of B corresponding to e and the other one has a −1. For i = j, letting bxy be entry
(x, y) of B, X
(BB T )ii = b2xy = δ(i).
e={x,y}∈E:i∈e

That shows that BB T = L entry-by-entry.


For any x, we have (B T x)k = xv − xu if the edge ek = {u, v} is oriented as
(u, v) under B. That implies
X
xT Lx = xT BB T x = kB T xk22 = (xi − xj )2 .
e={i,j}∈E
CHAPTER 5. SPECTRAL METHODS 334

Since the latter quantity is always nonnegative, it also implies that L is positive
semidefinite.

As a convention, we denote the eigenvalues of a Laplacian matrix L by

0 ≤ µ1 ≤ µ 2 ≤ · · · ≤ µn ,

and we will refer to them as Laplacian eigenvalues. Here is a simple observation.


Laplacian
For any G = (V, E), the constant unit vector
eigenvalues
1
y1 = √ (1, . . . , 1),
n

is an eigenvector of the Laplacian with eigenvalue 0. Indeed, let B be an oriented


incidence matrix of G and recall from the proof of Lemma 5.1.5 that L = BB T .
By construction B T y1 = 0 since each column of B has exactly one 1 and one −1.
So Ly1 = BB T y1 = 0 as claimed. In general, the constant vector may not be the
only eigenvector with eigenvalue one.
We are now ready to derive connectivity consequences. Recall that, for any
graph G, the Laplacian eigenvalue µ1 = 0.

Lemma 5.1.6 (Laplacian and connectivity). If G is connected, then the Laplacian


eigenvalue µ2 > 0.
Pn T
Proof. Let G = (V, E) with n = |V | and let L = i=1 µi yi yi be a spectral
decomposition of its Laplacian L with 0 = µ1 ≤ · · · ≤ µn . Suppose by way of
contradiction that µ2 = 0. Any eigenvector y = (y1 , . . . , yn ) with 0 eigenvalue
satisfies Ly = 0 by definition. By Lemma 5.1.5 then
X
0 = yT Ly = (yi − yj )2 .
e={i,j}∈E

In order for this to hold, it must be that any two adjacent vertices i and j
have yi = yj . That is, {i, j} ∈ E implies yi = yj . Furthermore, because G is
connected, between any two of its vertices u and v (adjacent or not) there is a path
u = w0 ∼ · · · ∼ wk = v along which the yw s must be the same. Thus y is a
constant vector.
But that is a contradiction since the eigenvectors y1 , . . . , yn are in fact linearly
independent, so that y1 and y2 cannot both be a constant vector.

The quantity µ2 is sometimes referred to as the algebraic connectivity of the graph.


The corresponding eigenvector, y2 , is known as the Fiedler vector.
Fiedler vector
CHAPTER 5. SPECTRAL METHODS 335

We will be interested in more quantitative results of this type. Before proceed-


ing, we start with a simple observation. By our proof of Theorem 5.1.1, the largest
eigenvalue µn of the Laplacian L is the solution to the optimization problem

µn = max{hx, Lxi : kxk2 = 1}.

Such extremal characterization is useful in order to bound the eigenvalue µn , since


any choice of x with kxk2 = 1 gives a lower bound through the quantity hx, Lxi.
We give a simple consequence.
Lemma 5.1.7 (Laplacian and degree). Let G = (V, E) be a graph with maximum
degree δ̄. Let µn be the largest Laplacian eigenvalue. Then

µn ≥ δ̄ + 1.

Proof. Let u ∈ V be a vertex with degree δ̄. Let z be the vector with entries

δ̄
 if i = u,
zi = −1 if {i, u} ∈ E,

0 otherwise,

and let x be the unit vector z/kzk2 . By definition of the degree of u, kzk22 =
δ̄ 2 + δ̄(−1)2 = δ̄(δ̄ + 1).
Using the Lemma 5.1.5,
X
hz, Lzi = (zi − zj )2
e={i,j}∈E
X
≥ (zi − zu )2
i:{i,u}∈E
X
= (−1 − δ̄)2
i:{i,u}∈E

= δ̄(δ̄ + 1)2 ,

where we restricted the sum to those edges incident with u and used the fact that
all terms in the sum are nonnegative. Finally
δ̄(δ̄ + 1)2
 
z z 1
hx, Lxi = ,L = hz, Lzi = = δ̄ + 1,
kzk2 kzk2 kzk22 δ̄(δ̄ + 1)
so that
µn = max{hx0 , Lx0 i : kx0 k2 = 1} ≥ hx, Lxi = δ̄ + 1,
as claimed.
CHAPTER 5. SPECTRAL METHODS 336

A special case of Courant-Fischer (Theorem 5.1.3) for the Laplacian matrix is


the following.
Corollary 5.1.8 (Variational characterization of µ2 ). Let G = (V, E) be a graph
with n
P= |V | vertices. Assume the Laplacian L of G has spectral decomposition
n
L = i=1 µi yi yiT with 0 = µ1 ≤ µ2 ≤ · · · ≤ µn and y1 = √1n (1, . . . , 1). Then

n
(P )
{u,v}∈E (xu − xv )2 X
µ2 = min Pn 2
: x = (x1 , . . . , xn ) 6= 0, xu = 0 .
u=1 xu u=1

Proof. By Theorem 5.1.3,

µ2 = min RL (x).
x∈Wd−1

Since y1 is constant and Wd−1 is the subspace orthogonal to it, this is equivalent
to restrictring the minimization to those nonzero xs such that
m
1 X
0 = hx, y1 i = √ xu .
n u=1

Moreover, by Lemma 5.1.5,


X
hx, Lxi = (xu − xv )2
{u,v}∈E

so the Rayleigh quotient is

xv )2
P
hx, Lxi {u,v}∈E (xu −
RL (x) = = Pn 2
.
hx, xi u=1 xu

That proves the claim.

One application of this extremal characterization is a graph drawing heuristic.


Consider the entries of the second Laplacian eigenvector y2 normalized to have
unit norm. The entries are centered around 0 by the condition nu=1 xu = 0.
P
Because it minimizes the quantity
2
P
{u,v}∈E (xu − xv )
Pn 2
,
u=1 xu

over all centered unit vectors, y2 tends to assign similar coordinates to adjacent
vertices. A similar reasoning applies to the third Laplacian eigenvector, which in
addition is orthogonal to the second one. See Figure 5.1 for an illustration.
CHAPTER 5. SPECTRAL METHODS 337

Figure 5.1: Top: A 3-by-3 grid graph with vertices located at independent uni-
formly random points in a square. Bottom: The same 3-by-3 grid graph with ver-
tices located at the coordinates corresponding to the second and third eigenvectors
of the Laplacian matrix. That is, vertex i is located at position (y2,i , y3,i ).
CHAPTER 5. SPECTRAL METHODS 338

Example 5.1.9 (Two-component graph). Let G = (V, E) be a graph with two


connected components ∅ 6= V1 , V2 ⊆ V . By the properties of connected compo-
nents, we have V1 ∩ V2 = ∅ and P V1 ∪ V2 = V . Assume the Laplacian L of G has
spectral decomposition L = ni=1 µi yi yiT with 0 = µ1 ≤ µ2 ≤ · · · ≤ µn and
y1 = √1n (1, . . . , 1). We claimed earlier that for such a graph µ2 = 0. We prove
this here using Corollary 5.1.8:
(
X
µ2 = min (xu − xv )2 :
{u,v}∈E
n n
)
X X
x = (x1 , . . . , xn ) ∈ Rn , xu = 0, x2u = 1 .
u=1 u=1

it suffices to find a vector x satisfying nu=1 xu = 0


P
Based
Pon this characterization,
and nu=1 x2u = 1 such that {u,v}∈E (xu − xv )2 = 0. Indeed, since µ2 ≥ 0 and
P
any suchP x gives an upper bound on µ2 , we then necessarily have that µ2 = 0.
For {u,v}∈E (xu − xv )2 to be 0, one might be tempted to take a constant
vector x. But then we could not satisfy nu=1 xu = 0 and nu=1 x2u = 1. Instead,
P P
we modify this guess slightly. Because the graph has two connected components,
there is no edge between V1 P and V2 . Hence we can assign a different value to
each component and still get {u,v}∈E (xu − xv )2 = 0. So we look for a vector
x = (x1 , . . . , xn ) of the form
(
α if u ∈ V1 ,
xu =
β if u ∈ V2 .
To satisfy the constraints on x, we require
n
X X X
xu = α+ β = |V1 |α + |V2 |β = 0,
u=1 u∈V1 u∈V2

and
n
X X X
x2u = α2 + β 2 = |V1 |α2 + |V2 |β 2 = 1.
u=1 u∈V1 u∈V2
Replacing the first equation in the second one, we get

−|V2 |β 2 |V2 |2 β 2
 
|V1 | + |V2 |β 2 = + |V2 |β 2 = 1,
|V1 | |V1 |
or
|V1 | |V1 |
β2 = = .
|V2 |(|V2 | + |V1 |) n|V2 |
CHAPTER 5. SPECTRAL METHODS 339

Take s s
|V1 | −|V2 |β |V2 |
β=− , α= = .
n|V2 | |V1 | n|V1 |
The vector x we constructed is in fact an eigenvector of L. Indeed, let B be an
oriented incidence matrix of G. Then, for ek = {u, v}, (B T x)k is either xu −xv or
xv − xu . In both cases, that is 0. So Lx = BB T x = 0, that is, x is an eigenvector
of L with eigenvalue 0.
We have shown that µ2 = 0 when G has two connected components. A slight
modification of this argument shows that µ2 = 0 whenever G is not connected. J

Networks In the case of a network (i.e., edge-weighted graph) G = (V, E, w),


the Laplacian can be defined as follows. As usual, we assume that w : E → R+ is
a function that assigns positive real weights to the edges. We write we = wij for
the weight of edge e = {i, j}. Recall that the degree of a vertex i is,
X
δ(i) = wij ,
j:{i,j}∈E

the adjacency matrix A of G is the n × n symmetric matrix defined as


(
wij if {i, j} ∈ E,
Aij =
0 otherwise.

Definition 5.1.10 (Network Laplacian). Let G = (V, E, w) be a network with


n = |V | vertices and adjacency matrix A. Let D = diag(δ(1), . . . , δ(n)) be
the degree matrix. The network Laplacian (or Laplacian matrix, or Laplacian for
short) associated to G is defined as L = D − A.

It can be shown (see Exercise 5.2) that the Laplacian quadratic form satisfies in the
edge-weighted case
X
hx, Lxi = wij (xi − xj )2 , (5.1.3)
{i,j}∈E

for x = (x1 , . . . , xn ) ∈ Rn . (The keen observer will have noticed that we al-
ready encountered this quantity as the “Dirichlet energy” in Section 3.3.3; more
on this in Section 5.3.) As a positive semidefinite matrix (see again Exercise 5.2),
the network Laplacian has an orthonormal basis of eigenvectors with nonnegative
eigenvalues that satisfy the variational characterization we derived above. In par-
ticular, if we denote the eigenvalues 0 = µ1 ≤ µ2 ≤ · · · ≤ µn , it follows from
CHAPTER 5. SPECTRAL METHODS 340

Courant-Fischer (Theorem 5.1.3) that


(
X
µ2 = min wuv (xu − xv )2 :
{u,v}∈E
n n
)
X X
T n
x = (x1 , . . . , xn ) ∈ R , xu = 0, x2u =1 .
u=1 u=1

Other variants of the Laplacian are useful. We introduce the normalized Lapla-
cian next.

Definition 5.1.11 (Normalized Laplacian). The normalized Laplacian of G = (V, E, w)


normalized
with adjacency matrix A and degree matrix D is defined as
Laplacian
−1/2 −1/2
L=I −D AD .

The entries of L are



1 if i = j,
Li,j = w
− √ ij otherwise.
δ(i)δ(j)

We also note the following relation to the (unnormalized) Laplacian:

L = D−1/2 LD−1/2 . (5.1.4)

We check that the normalized Laplacian is symmetric:

LT = I T − (D−1/2 AD−1/2 )T
= I − (D−1/2 )T AT (D−1/2 )T
= I − D−1/2 AD−1/2
= L.

It is also positive semidefinite. Indeed,

xT Lx = xT D−1/2 LD−1/2 x = (D−1/2 x)T L(D−1/2 x) ≥ 0,

by the properties of the Laplacian. Hence by the spectral theorem (Theorem 5.1.1),
we can write
Xn
L= ηi zi zTi ,
i=1
CHAPTER 5. SPECTRAL METHODS 341

where the zi s are orthonormal eigenvectors of L and the eigenvalues satisfy

0 ≤ η1 ≤ η2 ≤ · · · ≤ ηn .

One more observation: because the constant vector is an eigenvector of L with


eigenvalue 0, we get from (5.1.4) that D1/2 1 is an eigenvector of L with eigenvalue
0. So η1 = 0 and we set
! s
D1/2 1 δ(i)
(z1 )i = 1/2
= P , ∀i ∈ [n],
kD 1k2 i∈V δ(i)
i

which makes z1 into a unit norm vector. The relationship to the Laplacian implies
(see Exercise 5.3) that
!2
T
X xi xj
x Lx = wij p −p ,
{i,j}∈E
δ(i) δ(j)

for x = (x1 , . . . , xn ) ∈ Rn . Through the change of variables


xi
yi = p ,
δ(i)

Courant-Fischer (Theorem 5.1.3) gives this time


(
X
η2 = min wuv (yu − yv )2 :
{u,v}∈E
n n
)
X X
n
y = (y1 , . . . , yn ) ∈ R , δ(u)yu = 0, δ(u)yu2 =1 .
u=1 u=1
(5.1.5)

5.1.3 Perturbation results


We will need some perturbation results for eigenvalues and eigenvectors. Recall
the following definition. Define Sm−1 = {x ∈ Rm : kxk2 = 1}. The spectral
norm (or induced 2-norm or 2-norm) of a matrix A ∈ Rn×m is

kAxk
kAk2 := max m = max kAxk.
06=x∈R kxk x∈Sm−1

The induced 2-norm of a matrix has many other useful properties.


CHAPTER 5. SPECTRAL METHODS 342

Lemma 5.1.12 (Properties of the induced norm). Let A, B ∈ Rn×m and α ∈ R.


The following hold:

(i) kAxk2 ≤ kAk2 kxk2 , ∀x ∈ Rm

(ii) kAk2 ≥ 0

(iii) kAk2 = 0 if and only if A = 0

(iv) kαAk2 = |α|kAk2

(v) kA + Bk2 ≤ kAk2 + kBk2

(vi) kABk2 ≤ kAk2 kBk2 .

Proof. These properties all follow from the definition of the induced norm and the
corresponding properties for the vector norm:

• Claims (i) and (ii) are immediate from the definition.

• For (ii) note that kAk2 = 0 implies kAxk2 = 0, ∀x ∈ Sm−1 , so that Ax =


0, ∀x ∈ Sm−1 . In particular, Aij = eTi Aej = 0, ∀i, j.

• For (iv), (v), (vi), observe that for all x ∈ Sm−1

kαAxk2 = |α|kAxk2

k(A + B)xk2 = kAx + Bxk2 ≤ kAxk2 + kBxk2 ≤ kAk2 + kBk2


k(AB)xk2 = kA(Bx)k2 ≤ kAk2 kBxk2 ≤ kAk2 kBk2 .

Perturbations of eigenvalues For a symmetric matrix C ∈ Rd×d , we let λj (C),


j = 1, . . . , d, be the eigenvalues of C in nonincreasing order with corresponding
orthonormal eigenvectors vj (C), j = 1, . . . , d. As in the Courant-Fischer theorem
(Theorem 5.1.3), define the subspaces

Vk (C) = span(v1 (C), . . . , vk (C))

and
Wd−k+1 (C) = span(vk (C), . . . , vd (C)).
The following lemma is one version of what is known as Weyl’s inequality.
Weyl’s
inequality
CHAPTER 5. SPECTRAL METHODS 343

Lemma 5.1.13 (Weyl’s inequality). Let A ∈ Rd×d and B ∈ Rd×d be symmetric


matrices. Then, for all j = 1, . . . , d,

max |λj (B) − λj (A)| ≤ kB − Ak2 .


j∈[d]

Proof. Let H = B − A. We prove only one upper bound. The other one follows
from interchanging the roles of A and B. Because

dim(Vj (B)) + dim(Wd−j+1 (A)) = j + (d − j + 1) = d + 1 > d,

it follows from (5.1.2) that Vj (B) ∩ Wd−j+1 (A) contains a nonzero vector. Let v
be a unit vector in that intersection.
By Theorem 5.1.3,

λj (B) ≤ hv, (A + H)vi = hv, Avi + hv, Hvi ≤ λj (A) + hv, Hvi.

Moreover, by Cauchy-Schwarz (Theorem B.4.8), since kvk2 = 1

hv, Hvi ≤ kvk2 kHvk2 ≤ kHk2 ,

which proves the claim after rearranging.

Perturbations of eigenvectors While Weyl’s inequality (Lemma 5.1.13) indi-


cates that the eigenvalues of A and B are close when kA − Bk2 is small, it says
nothing about the eigenvectors. The following theorem remediates that. It is tradi-
tionally stated in terms of the angle between the eigenvectors (whereby the name).
Here we give a version that is more suited to the applications we will encounter.
We do not optimize the constants. We use the same notation as in the previous
paragraph. Recall
Pd Parseval’s identity: if u1 , . . . , ud is an orthonormal basis of Rd ,
2
then kxk = i=1 hx, ui i . 2

Theorem 5.1.14 (Davis-Kahan sin θ theorem). Let A ∈ Rd×d and B ∈ Rd×d be


symmetric matrices. For an i ∈ {1, . . . , d}, assume that

δ := min |λi (A) − λj (A)| > 0.


j6=i

Then
8kA − Bk22
min kvi (A) − svi (B)k22 ≤ .
s∈{+1,−1} δ2
CHAPTER 5. SPECTRAL METHODS 344

Proof. Expand vi (B) in the basis formed by the eigenvectors of A, that is,
d
X
vi (B) = hvi (B), vj (A)i vj (A),
j=1

where we used the orthonormality of the vj (A)s. On the one hand,


k(A − λi (A)Id×d ) vi (B)k22
2
d
X
= hvi (B), vj (A)i(A − λi (A)Id×d ) vj (A)
j=1
2
2
d
X
= hvi (B), vj (A)i(λj (A) − λi (A)) vj (A)
j=1,j6=i
2
d
X
= hvi (B), vj (A)i2 (λj (A) − λi (A))2
j=1,j6=i

≥ δ (1 − hvi (B), vi (A)i2 ),


2

where, on the last two lines, we used the orthonormality of the vj (A)s and vj (B)s
through Parseval’s identity, as well as the definition of δ.
On the other hand, letting E = A − B, by the triangle inequality
k(A − λi (A)I) vi (B)k2 = k(B + E − λi (A)I) vi (B)k2
≤ k(B − λi (A)I) vi (B)k2 + kE vi (B)k2
≤ |λi (B) − λi (A)|kvi (B)k2 + kEk2 kvi (B)k2
= 2kEk2 ,
where we used Lemma 5.1.12 and Weyl’s inequality.
Combining the last two inequalities gives
4kEk22
(1 − hvi (B), vi (A)i2 ) ≤ .
δ2
The result follows by noting that, since |hvi (B), vi (A)i| ≤ 1 by Cauchy-Schwarz
(Theorem B.4.8), we have
min kvi (A) − svi (B)k2 = 2 − 2|hvi (B), vi (A)i|
s∈{+1,−1}

≤ 2(1 − hvi (B), vi (A)i2 )


8kEk22
≤ .
δ2
CHAPTER 5. SPECTRAL METHODS 345

5.1.4 . Data science: community recovery


A common task in network analysis is to recover hidden community structure.
Informally, we seek groups of vertices with more edges within the groups then to
the rest of the graph. More rigorously, providing statistical guarantees on the output
of a community recovery algorithm requires some underlying random graph model.
The standard model for this purpose is the stochastic blockmodel, a generalization
of the Erdös-Rényi graph model with a “planted partition.”

Stochastic blockmodel and recovery requirement We restrict ourselves to the


simple case of two strictly balanced communities. Consider a random graph on n
(even) nodes where there are two communities, labeled +1 and −1, consisting of
n/2 nodes. Each vertex i ∈ V is assigned a community label Xi ∈ {1, −1} as
follows: a subset of n/2 vertices is chosen uniformly at random among all such
subsets to form community +1, and the rest of the vertices form community −1.
For two nodes i, j, the edge {i, j} is present with probability p if they belong to
the same community, and with probability q otherwise. All edges are independent.
The following 2 × 2 matrix describes the edge density within and across the two
communities:

+1 −1
W = +1 p q .
−1 q p

We assume that p ≥ q, encoding the fact that vertices belonging to the same com-
munity are more likely to share an edge. To summarize, we say that (X, G) ∼
SBMn,p,q if:
stochastic
1. (Communities) The assignment X = (X1 , . . . , Xn ) is uniformly random blockmodel
over

Πn2 := {x ∈ {+1, −1}n : xT 1 = 0},

where 1 = (1, · · · , 1) is the all-one vector.

2. (Graph) Conditioned on X, the graph G = ([n], E) has independent edges


where {i, j} is present with probability WXi ,Xj for ∀i < j .
We denote the corresponding measure by Pn,p,q . We allow p and q to depend on n
(although we do not make that dependence explicit).
Roughly speaking, the community recovery problem is the following: given G,
output X. There are different notions of recovery.
CHAPTER 5. SPECTRAL METHODS 346

Definition 5.1.15 (Agreement). The agreement between two community assign-


ment vectors x, y ∈ {+1, −1}n is the largest fraction of common assignments
between x and ±y, that is,
n
1X
α(x, y) = max 1{xi = syi }.
s∈{+1,−1} n
i=1

The role of s in this formula is to account for the fact that the community names
are not meaningful.

Now consider the following recovery requirements. These are asymptotic notions,
as n → +∞.
recovery
Definition 5.1.16 (Recovery requirement). Let (X, G) ∼ SBMn,p,q . For any esti-
mator X̂ := X̂(G) ∈ Πn2 , we say that it achieves:

- exact recovery if Pn,p,q [α(X, X̂) = 1] = 1 − o(1); or

- almost exact recovery if Pn,p,q [α(X, X̂) = 1 − o(1)] = 1 − o(1).

Next we establish sufficient conditions for almost exact recovery. First we describe
a natural estimator X̂.

MAP estimator and spectral clustering A natural starting point is the maxi-
mum a posteriori (MAP) estimator. Let Ω(X) be the balanced partition of [n]
corresponding to X and Ω̂(G) be the one corresponding to X̂(G). The probability
of error, that is, the probability of not recovering the true partition, is given by
X
P[Ω(X) 6= Ω̂(G)] = P[Ω̂(g) 6= Ω(X) | G = g] P[G = g], (5.1.6)
g

where the sum is over all graphs on n vertices (i.e., all possible subsets of edges
present) and we dropped the subscript n, p, q to simplify the notation. The MAP
estimator Ω̂MAP (G) is obtained by minimizing each term P[Ω̂(g) 6= Ω(X) | G =
g] individually (note that P[G = g] > 0 for all g by definition of the SBMn,p,q , a
probability which does not depend on the estimator). Equivalently we choose for
each g a partition γ that maximizes the posterior probability

P[G = g | Ω(X) = γ] P[Ω(X) = γ]


P[Ω(X) = γ | G = g] =
P[G = g]
1
= P[G = g | Ω(X) = γ] · n , (5.1.7)
|Π2 | P[G = g]
CHAPTER 5. SPECTRAL METHODS 347

where we applied Bayes’ rule on the first line and the uniformity of the partition X
on the second line.
Based on (5.1.7), we seek a partition that maximizes P[G = g | Ω(X) = γ]. We
compute this last probability explicitly. For fixed g, let M := M (g) be the number
of edges in g. For any γ, denote by Min := Min (g, γ) and Mout := Mout (g, γ)
the number of edges within and across communities respectively, and note that
Min = M − Mout . By definition of the SBMn,p,q model, the probability of a graph
g given a partition γ is expressed simply as
P[G = g | Ω(X) = γ]
n 2 n n 2
= q Mout (1 − q)( 2 ) −M
pMin (1 − p){( 2 )−( 2 ) }−Min
out

n 2 n n 2
= q Mout (1 − q)( 2 ) −Mout pM −Mout (1 − p){( 2 )−( 2 ) }−{M −Mout }
1 − p Mout n
 
q n 2 n n 2
o
= · (1 − q)( 2 ) pM (1 − p){( 2 )−( 2 ) }−M .
1−q p
The expression in curly brackets does noth depend on i the partition γ. Moreover,
q 1−p
since we assume that p ≥ q, we have that 1−q · p ≤ 1 (which can be checked
directly by rearranging and cancelling). Therefore, to maximize P[G = g | Ω(X) =
γ] over γ for a fixed g, we need to choose a partition that results in the smallest
possible value of Mout , the number of edges across the two communities. This
problem is well-known in combinatorial optimization, where it is referred to as
the minimum bisection problem. It is unfortunately NP-hard and we consider a
minimum
relaxation that admits a polynomial-time algorithmic solution.
bisection
To see how this comes about, observe that the minimum bisection problem can
problem
be reformulated as
max xT Ax
x∈{+1,−1}n , xT 1=0

where A is the n × n adjacency matrix. Replacing the combinatorial constraint


x ∈ {+1, −1}n by x ∈ Rn with kxk2 = n leads to the relaxation
max zT Az
z∈Rn , zT 1=0, kzk2 =n
 T  
z z
= max n A n
06=z∈Rn , zT 1=0 kzk2 kzk2
zT Az
= n2 max ,
06=z∈Rn , zT 1=0 zT z

where we changed the notation from x to z to emphasize that the solution no longer
encodes a partition. We recognize the Rayleigh quotient of A as the objective func-
tion in the final formulation. At this point, it is tempting to use Courant-Fischer
CHAPTER 5. SPECTRAL METHODS 348

(Theorem 5.1.3) and conclude that the maximum above is achieved at the second
eigenvalue of A. Note however that the vector 1 (appearing in the orthogonality
constraint zT 1 = 0) is not in general an eigenvector of A (unless the graph hap-
pens to be regular). To leverage the variational characterization of eigenvalues in
a statistically justified way, we instead turn to the expected adjacency matrix and
then establish concentration.
Lemma 5.1.17 (Expected adjacency). Let (X, G) ∼ SBMn,p,q , let A be the adja-
cency matrix of G and let AX = En,p,q [A | X]. Then
p+q p−q
AX = n u1 uT1 + n u2 uT2 − p I,
2 2
where
1 1
u1 = √ 1, u2 = √ X.
n n
Proof. For any distinct pair i, j, the term

1 2 p+q
   
p+q T p+q
n u1 u1 =n √ =
2 i,j 2 n 2
while the term
1 2
   
p−q T p−q p−q
n u2 u2 =n √ Xi Xj = Xi Xj .
2 i,j 2 n 2
The product Xi Xj is 1 when i and j belong to the same community and is −1
otherwise. In the former case, summing the two terms indeed gives p, while in the
latter case it gives q. Finally, the term −pI accounts for the fact that A has zeros
on the diagonal.

Now condition on X and observe that u1 and u2 in Lemma 5.1.17 are orthog-
onal by our assumption that X corresponds to a balanced partition (i.e., with two
communities of equal size). Hence we deduce that an eigenvector decomposition of
AX is formed of u1 , u2 and any orthonormal basis of the orthogonal complement
of the span of u1 and u2 , with respective eigenvalues
p+q p−q
n − p, n − p, −p.
2 2
So the second largest eigenvalue of AX is λ2 (AX ) = n p−q
2 − p (independently of
X), and Courant-Fischer implies
zT AX z
max = λ2 (AX ).
06=z∈Rn ,zT 1=0 zT z
CHAPTER 5. SPECTRAL METHODS 349

The corresponding eigenvector, up to scaling and sign, is precisely what we are


trying to recover, namely, the community assignment X.
These observations motivate the following spectral clustering approach.
spectral
1. Input: graph G with adjacency matrix A. clustering

2. Compute an eigenvector decomposition of A.

3. Let û2 be the eigenvector corresponding to the second largest eigenvalue.

4. Output: X̂(G) = sgn (û2 ) .

Here we used the notation


(
+1 if zi ≥ 0,
(sgn(z))i =
−1 otherwise.

Because we used A rather than AX (which we do not know), it is not immediate


that this approach will work. Below, we use Davis-Kahan (Theorem 5.1.14) to
show that, under some conditions, the second eigenvector of A is concentrated
around that of AX —and therefore almost exact recovery holds.
Before getting to the analysis, we make a final algorithmic remark. The “clus-
tering” above, specifically taking the sign of the second eigenvector, works in this
toy model but is perhaps somewhat naive. More generally, in a spectral cluster-
ing method, one uses the top eigenvectors (deciding how many is a bit of an art)
of the adjacency matrix (or of another matrix associated to the graph such as the
Laplacian or normalized Laplacian) to obtain a low-dimensional representation of
the input. Then in a second step, one uses a clustering algorithm, for example,
k-means clustering, to extract communities in the low-dimensional space.

Almost exact recovery We prove the following. We restrict ourselves to the case
where p and q are constants not depending on n.

Theorem 5.1.18.  Let (X, G) ∼ SBMn,p,q and let A be the adjacency matrix of
G. Let µ := min q, p−q 2 > 0. Clustering according to the sign of the second
eigenvector of A identifies the two communities of G with probability at least 1 −
e−n , except for C/µ2 misclassified nodes for some constant C > 0.

There are two key ingredients to the proof: concentration of the adjacency matrix
and perturbation arguments.
We start with the former.
CHAPTER 5. SPECTRAL METHODS 350

Lemma 5.1.19 (Norm of the centered adjacency matrix). Let (X, G) ∼ SBMn,p,q ,
let A be the adjacency matrix of G and let AX = En,p,q [A | X]. There is a constant
C 0 > 0 such that, conditioned on X,

kA − AX k2 ≤ C 0 n,

with probability at least 1 − e−n .

Proof. Condition on X. We use Theorem 2.4.28 on the random matrix R :=


A − AX . The entries of R are centered and independent (conditionally on X).
Moreover they are bounded. Indeed, for i 6= j, Aij ∈ {0, 1} while (AX )ij ∈
{q, p}. So Rij ≤ [−p, 1 − q]. On the diagonal, Rii = 0. Hence, by Hoeffding’s
lemma (Lemma 2.4.12), the entries are sub-Gaussian with variance factor
1
(1 − q − (−p))2 ≤ 1.
4

Taking t = n in Theorem 2.4.28, there is a constant C > 0 such that with
probability 1 − e−n
√ √ √ √
kA − AX k2 ≤ C 1( n + n + n).

Adjusting the constant gives the claim.

We are ready to prove the theorem.

Proof of Theorem 5.1.18. Condition on X. To apply the Davis-Kahan theorem


(Theorem 5.1.14), we need to bound the smallest gap δ between the second largest
eigenvalue of AX and its other eigenvalues. Recall that the eigenvalues are
p+q p−q
n − p, n − p, −p,
2 2
so  
p−q
δ = min n , n q = n µ > 0.
2
By Davis-Kahan and Lemma 5.1.19, with probability at least 1 − e−n , there is
θ ∈ {+1, −1} such that

8kA − AX k22 8(C 0 n)2 C
ku2 − θ û2 k22 ≤ ≤ = ,
δ2 (nµ)2 n µ2

by adjusting the constant. Note that this bound holds for any X.
CHAPTER 5. SPECTRAL METHODS 351

Rearranging and expanding the norm, we get


X √ √ 2 C
n (u2 )i − n θ (û2 )i ≤ .
µ2
i

If the signs of (u2 )i and θ (û2 )i disagree, then the i-th term in the sum above is
≥ 1. So there can be at most C/µ2 such disagreements. That establishes the
desired bound on the number of misclassified nodes.

Remark 5.1.20. It was shown in [YP14, MNS15a, AS15] that almost exact re-
covery in the balanced two-community model SBMn,pn ,qn with pn = an /n and
qn = bn /n is achievable (and computationally efficiently so) if and only if

(an − bn )2
= ω(1).
(an + bn )
On the other hand, it was shown in [ABH16, MNS15a] that exact recovery in the
SBMn,pn ,qn with pn = α log n/n
√ and qn = β log n/n is achievable√and computa-
√ √
tionally efficiently so if α − β > 2 and not achievable if α − β < 2.

5.2 Spectral techniques for reversible Markov chains


In this section, we apply the spectral theorem to reversible Markov chains. Through-
out (Xt ) is an irreducible Markov chain on a state space V with transition matrix
P reversible with respect to a positive stationary measure π > 0. Recall that this
means that π(x)P (x, y) = π(y)P (y, x) for all x, y ∈ V . We also assume that P
is irreducible.

A Hilbert space It will be convenient to introduce a Hilbert space of func-


P over V . Let
tions `2 (V, π) be the space of functions f : V → R such that
2
x∈V π(x)f (x) < +∞. Equipped with the following inner product, it forms
a Hilbert space (i.e., a real inner product space that is also a complete metric space
(see Theorem B.4.10) with respect to the induced metric; we will work mostly in
finite dimension where it is merely a slight generalization of Euclidean space). For
f, g ∈ `2 (V, π), define
X
hf, giπ := π(x)f (x)g(x),
x∈V

and

kf k2π := hf, f iπ .
CHAPTER 5. SPECTRAL METHODS 352

The inner product is well-defined since the series is summable by Hölder’s inequal-
ity (Theorem B.4.8), which implies the Cauchy-Schwarz inequality
hf, giπ ≤ kf kπ kgkπ .
Minkowski’s inequality (Theorem B.4.9) implies the triangle inequality
kf + gkπ ≤ kf kπ + kgkπ .
The integral with respect to π (see Appendix B) reduces in this case to a sum
X
π(f ) := π(x)f (x),
x∈V

provided π(|f |) < +∞ or f ≥ 0. Here |f | is defined as |f |(x) := |f (x)| for all


x ∈ V . We also write πf = π(f ) to simplify the notation.
We recall some standard Hilbert space facts. The countable collection of func-
tions {fi }∞ 2
i=1 in ` (V, π) is an orthonormal basis if: (i) hfi , fj iπ =
P0 if i 6= j and =
1 if i = j; and (ii) any f ∈ `2 (V, π) can be written as limn→+∞ ni=1 hfi , f iπ fi =
f where the limit is in the norm. We then have Parseval’s identity: for any
Parseval’s
g ∈ `2 (V, π)
∞ identity
X
2 2
kgkπ = hg, fj iπ . (5.2.1)
j=1

Think of P as an operator on `2 (V, π). That is, let P f : V → R be defined as


X
(P f )(x) := P (x, y)f (y),
y∈V

for x ∈ V . For any f ∈ `2 (V, π), P f is well-defined and further we have


kP f kπ ≤ kf kπ . (5.2.2)
Indeed by Cauchy-Schwarz, stochasticity, Fubini and stationarity
" #2
X X
kP |f |k2π = π(x) P (x, y)|f (y)|
x y
" #
X X X
≤ π(x) P (x, y)|f (y)|2 P (x, z)
x y z
XX
= π(x)P (x, y)f (y)2
y x
X
= π(y)f (y)2
y

= kf k2π < +∞. (5.2.3)


CHAPTER 5. SPECTRAL METHODS 353

This shows that P f is well-defined since π > 0 and hence the series in square
brackets on the first line is finite for all x. Applying the same argument to kP f k2π
gives the inequality above.
Everything above holds whether or not P is reversible, so long as π is a sta-
tionary measure. Now we use reversibility. We claim that, when P reversible, then
it is self-adjoint, that is,

hf, P giπ = hP f, giπ ∀f, g ∈ `2 (V, π). (5.2.4)

This follows immediately by reversibility


X X
hf, P giπ = π(x)f (x) P (x, y)g(y)
x∈V y∈V
XX
= π(y)P (y, x)f (x)g(y)
x∈V y∈V
X X
= π(y)g(y) P (y, x)f (x)
y∈V x∈V

= hP f, giπ ,

where we argue as in (5.2.3) to justify using Fubini.


Throughout this section, we denote by 0 and 1 the all-zero and all-one functions
respectively.

5.2.1 Spectral gap


In this subsection, we restrict ourselves to a finite state space V . Our goal is to
bound the mixing time of (Xt ) in terms of the eigenvalues of the P transition matrix
P . We assume that π is now the stationary distribution, that is, x∈V π(x) = 1
(which is unique by Theorem 1.1.24 and irreducibility). We also let n := |V | <
+∞.

Spectral decomposition Self-adjointness generalizes the notion of a symmetric


matrix, with one consequence being that a version of the spectral theorem applies
to P (at least in this finite-dimensional case; see Section 5.2.5 for more discussion
on this). For completeness, we derive it from Theorem 5.1.1. It will be convenient
to assume without loss of generality that V = [n] and identify functions in `2 (V, π)
with vectors in Rn .

Theorem 5.2.1 (Reversibility: spectral theorem). There is an orthonormal basis of


`2 (V, π) formed of real eigenfunctions {fj }nj=1 of P with real eigenvalues {λj }nj=1 .
CHAPTER 5. SPECTRAL METHODS 354

Proof. Let Dπ be the diagonal matrix with π on the diagonal. By reversibility,

M (x, y) := (Dπ1/2 P Dπ−1/2 )x,y


s
π(x)
= P (x, y)
π(y)
s
π(y)
= P (y, x)
π(x)
= (Dπ1/2 P Dπ−1/2 )y,x
= M (y, x).

1/2 −1/2
So M = (M (x, y))x,y = Dπ P Dπ is a symmetric matrix. By the spectral
theorem (Theorem 5.1.1), it has real eigenvectors {φj }nj=1 forming an orthonormal
−1/2
basis of Rn with corresponding real eigenvalues {λj }nj=1 . Define fj := Dπ φj .
Then

P fj = P Dπ−1/2 φj
= Dπ−1/2 Dπ1/2 P Dπ−1/2 φj
= Dπ−1/2 M φj
= λj Dπ−1/2 φj
= λj fj ,

and

hfi , fj iπ = hDπ−1/2 φi , Dπ−1/2 φj iπ


X
= π(x)[π(x)−1/2 φi (x)][π(x)−1/2 φj (x)]
x∈V
X
= φi (x)φj (x).
x∈V

Because {φj }nj=1 is an orthonormal basis of Rn , we have that {fj }nj=1 is an or-
thonormal basis of (Rn , h·, ·iπ ).

We collect a few more facts about the eigenbasis. Recall that

kf k∞ = max |f (x)|.
x∈V

Lemma 5.2.2. Any eigenvalue λ of P satisfies |λ| ≤ 1.


CHAPTER 5. SPECTRAL METHODS 355

Proof. It holds that


X
P f = λf =⇒ |λ|kf k∞ = kP f k∞ = max P (x, y)f (y) ≤ kf k∞ .
x
y

Rearranging gives the claim.

We order the eigenvalues 1 ≥ λ1 ≥ · · · ≥ λn ≥ −1. The second eigenvalue will


play an important role below.
Lemma 5.2.3. We have λ1 = 1 and λ2 < 1. Also we can take f1 = 1.
Proof. Because P is stochastic, the all-one vector is a right eigenvector with eigen-
value 1. Any eigenfunction with eigenvalue 1 is harmonic with respect to P on V
(see (3.3.2)). By Corollary 3.3.3, for a finite, irreducible chain the only harmonic
functions are the constant functions. So the eigenspace corresponding to 1 is one-
dimensional. We must have λ2 < 1 by Lemma 5.2.2.

When the chain is aperiodic, it cannot have an eigenvalue −1. Exercise 5.9 asks
for a proof.
Lemma 5.2.4. If P has an eigenvalue equal to −1, then P is not aperiodic.
Lemma 5.2.5. For all j 6= 1, πfj = 0.
Proof. By orthonormality, hf1 , fj iπ = 0. Now use the fact that f1 = 1.

Let δx (y) := 1{x=y} .


Lemma 5.2.6. For all x, y,
n
X
fj (x)fj (y) = π(x)−1 δx (y).
j=1

Proof. Using the notation of Theorem 5.2.1, the matrix Φ whose columns are the
φj s is orthogonal so ΦΦT = I. That is,
n
X
φj (x)φj (y) = δx (y),
j=1

or
n p
X
π(x)π(y)fj (x)fj (y) = δx (y).
j=1

Rearranging gives the result.


CHAPTER 5. SPECTRAL METHODS 356

Using the eigendecomposition of P , we get the following expression for its


t-th power P t .

Theorem 5.2.7 (Spectral decomposition of P t ). Let {fj }nj=1 be the eigenfunctions


of a reversible and irreducible transition matrix P with corresponding eigenvalues
{λj }nj=1 , as defined previously. Assume λ1 ≥ · · · ≥ λn . We have the decomposi-
tion
n
P t (x, y) X
=1+ fj (x)fj (y)λtj .
π(y)
j=2

Proof. Let F be the matrix whose columns are the eigenvectors {fj }nj=1 and let
Dλ be the diagonal matrix with {λj }nj=1 on the diagonal. Using the notation in the
proof of Theorem 5.2.1,

Dπ1/2 P t Dπ−1/2 = M t = (Dπ1/2 F )Dλt (Dπ1/2 F )T ,

which after rearranging becomes

P t Dπ−1 = F Dλt F T .

Expanding and using Lemma 5.2.3 gives the result.

Example 5.2.8 (Two-state chain). Let V := {0, 1} and


 
1−α α
P := ,
β 1−β

for α, β ∈ (0, 1). Observe that P is reversible with respect to the stationary distri-
bution  
β α
π := , .
α+β α+β
We know that f1 = 1 is an eigenfunction with eigenvalue 1. As can be checked by
direct computation, the other eigenfunction (in vector form) is
r r !
α β
f2 := ,− ,
β α

with eigenvalue λ2 := 1 − α − β. We normalized f2 so that kf2 k2π = 1.


By Theorem 5.2.7, the spectral decomposition at time t is therefore
!
α
−1
 
t −1 1 1 t β
P Dπ = + (1 − α − β) .
1 1 −1 αβ
CHAPTER 5. SPECTRAL METHODS 357

Or, rearranging,
! !
β α α α
α+β − α+β
Pt = α+β
β
α+β
α + (1 − α − β)t β β .
α+β α+β
− α+β α+β

Note for instance that the case α + β = 1 corresponds to a rank-one P , which


immediately converges to stationarity.
Assume β ≥ α. Then, by (1.1.6) and Lemma 4.1.9,
1X t β
d(t) = max |P (x, y) − π(y)| = |1 − α − β|t .
x 2 α + β
y

As a result,
      
log ε α+β
β log ε −1 − log α+β
β
 log |1 − α − β|  =  log |1 − α − β|−1  .
tmix (ε) =    
   
J

Spectral gap and mixing Assume further that P is aperiodic. Recall that by the
convergence theorem (Theorem 1.1.33), for all x, y, P t (x, y) → π(y) as t → +∞,
and that the mixing time (Definition 1.1.35) is
tmix (ε) := min{t ≥ 0 : d(t) ≤ ε},
where d(t) := maxx∈V kP t (x, ·) − π(·)kTV . It will be convenient to work with a
different notion of distance.
Definition 5.2.9 (Separation distance). The separation distance is defined as
separation
P t (x, y)
 
distance
sx (t) := max 1 − ,
y∈V π(y)
and we let s(t) := maxx∈V sx (t).
Lemma 5.2.10 (Separation distance and total variation distance).
d(t) ≤ s(t).
Proof. By Lemma 4.1.15,
X
kP t (x, ·) − π(·)kTV = π(y) − P t (x, y)
 

y:P t (x,y)<π(y)
P t (x, y)
X  
= π(y) 1 −
π(y)
y:P t (x,y)<π(y)

≤ sx (t).
Since this holds for any x, the claim follows.
CHAPTER 5. SPECTRAL METHODS 358

It follows that, from the spectral decomposition (Theorem 5.2.7), the speed of
convergence of P t (x, y) to π(y) is dominated by the largest eigenvalue of P not
equal to 1.
Definition 5.2.11 (Spectral gap). The absolute spectral gap is γ∗ := 1 − λ∗ where
absolute
λ∗ := |λ2 | ∨ |λn |. The spectral gap is γ := 1 − λ2 .
spectral
By Lemmas 5.2.3 and 5.2.4, we have γ∗ > 0 when P is irreducible and aperiodic. gap
Note that the eigenvalues of the lazy version 12 P + 12 I of P are { 21 (λj + 1)}nj=1
which are all nonnegative. So, there, γ∗ = γ.
Definition 5.2.12 (Relaxation time). The relaxation time is defined as
relaxation
trel := γ∗−1 . time

Example 5.2.13 (Two-state chain (continued)). Returning to Example 5.2.8, there


are two cases:
• α + β ≤ 1: In that case the (absolute) spectral gap is γ∗ = γ = α + β and
the relaxation time is trel = 1/(α + β).
• α + β > 1: In that case the absolute spectral gap is γ∗ = 2 − α − β and the
relaxation time is trel = 1/(2 − α − β).
J
The following result clarifies the relationship between the mixing and relax-
ation times. Let πmin = minx π(x).
Theorem 5.2.14 (Mixing time and relaxation time). Let P be reversible, irre-
ducible, and aperiodic with positive stationary distribution π. For all ε > 0,
   
1 1
(trel − 1) log ≤ tmix (ε) ≤ log trel .
2ε επmin
Proof. We start with the upper bound. By Lemma 5.2.10, it suffices to find t such
that s(t) ≤ ε. By the spectral decomposition and Cauchy-Schwarz,
v
n
P t (x, y) u n n
X uX X
t tt 2
− 1 ≤ λ∗ |fj (x)fj (y)| ≤ λ∗ fj (x) fj (y)2 .
π(y)
j=2 j=2 j=2
Pn
By Lemma 5.2.6, j=2 fj (x)
2 ≤ π(x)−1 . Plugging this back above, we get

P t (x, y) λt (1 − γ∗ )t e−γ∗ t
− 1 ≤ λt∗ π(x)−1 π(y)−1 ≤ ∗ =
p
≤ , (5.2.5)
π(y) πmin πmin πmin
CHAPTER 5. SPECTRAL METHODS 359

where we used that 1 − z ≤ e−z for all z ∈ R (see Exercise


 1.16). Observe that
1
the right-hand side is less than ε when t ≥ log επmin trel .
For the lower bound, let f∗ be an eigenfunction associated with an eigenvalue
achieving λ∗ := |λ2 | ∨ |λn |. Let z be such that |f∗ (z)| = kf∗ k∞ . By Lemma 5.2.5,
πf∗ = 0. Hence

λt∗ |f∗ (z)| = |P t f∗ (z)|


X
= [P t (z, y)f∗ (y) − π(y)f∗ (y)]
y
X
≤ kf∗ k∞ |P t (z, y) − π(y)| ≤ kf∗ k∞ 2d(t),
y

t (ε)
so d(t) ≥ 21 λt∗ . When t = tmix (ε), ε ≥ d(tmix (ε)) ≥ 12 λ∗mix . Therefore,
rearranging and taking a logarithm, we get
     
1 1 1
tmix (ε) − 1 ≥ tmix (ε) log ≥ log ,
λ∗ λ∗ 2ε

where we used z = 1 − λ−1 in 1 − z ≤ e−z to get the first inequality. The result
 −1 ∗  −1  −1
γ∗
follows from λ1∗ − 1 = 1−λλ∗

= 1−γ∗ = trel − 1.

5.2.2 . Random walks: a spectral look at cycles and hypercubes


We illustrate the results in the previous subsection to random walk on cycles and
hypercubes.

Random walk on a cycle


Consider simple random walk on the n-cycle (see Example 1.1.17). That is, V :=
{0, 1, . . . , n−1} and P (x, y) = 1/2 if and only if |x−y| = 1 mod n. We assume
that n is odd to avoid periodicity issues. Let π ≡ n−1 be the stationary distribution
(by symmetry and |V | = n). We showed in Section 4.3.2 that (for the lazy version
of the chain) the mixing time is tmix (ε) = Θ(n2 ).
Here we use spectral techniques. We first compute the eigendecomposition,
which in this case can be determined explicitly.

Lemma 5.2.15 (Cycle: eigenbasis). For j = 1, . . . , n − 1, the function



 
2πjx
gj (x) := 2 cos , x = 0, 1, . . . , n − 1,
n
CHAPTER 5. SPECTRAL METHODS 360

is an eigenfunction of P with eigenvalue


 
2πj
µj := cos ,
n
and g0 = 1 is an eigenfunction with eigenvalue 1. Moreover the gj s are orthonor-
mal in `2 (V, π).
Proof. We know from Lemma 5.2.3 that 1 is an eigenfunction with eigenvalue 1.
Let j ∈ {1, . . . , n − 1}. Note that, for all x, switching momentarily to the complex
representation (where we use i for the imaginary unit)
1 √ √
    
X 2πj(x − 1) 2πj(x + 1)
P (x, y)gj (y) = 2 cos + 2 cos
y
2 n n
√ " 2πj(x−1) 2πj(x−1) 2πj(x+1) 2πj(x+1) #
2 ei n + e−i n ei n + e−i n
= +
2 2 2
" 2πjx 2πjx
# " 2πj 2πj
#
√ ei n + e−i n ei n + e−i n
= 2
2 2

     
2πjx 2πj
= 2 cos cos
n n
 
2πj
= cos gj (x).
n
The orthonormality follows from standard trigonometric identities. We prove
only that the gj s have unit norm. We use the Dirichtlet kernel (see Exercise 5.8)
n
X sin((n + 1/2)θ)
1+2 cos kθ = ,
sin(θ/2)
k=1
2 1
for θ 6= 0, and the
Pidentity cos (θ) = 2 (1 + cos(2θ)). For j = 0, gj = 1 and the
norm squared is x π(x) = 1. For j 6= 0, we have kgj k2π is
n−1  
X 1X 2πjx
π(x)gj (x)2 = 2 cos2
n n
x∈V x=0
n−1   
1X 4πjx
= 1 + cos
n n
x=0
n  
1X 4πj
=1+ cos k
n n
k=1
 
1 sin((n + 1/2)(4πj/n))
=1+ −1 ,
2 sin((4πj/n)/2)
CHAPTER 5. SPECTRAL METHODS 361

which is indeed 1.

From the eigenvalues, we derive the relaxation time (Definition 5.2.12) analyt-
ically.

Theorem 5.2.16 (Cycle: relaxation time). The relaxation time for lazy simple ran-
dom walk on the n-cycle is
1
trel = 2π
 = Θ(n2 ).
1 − cos n

 Lemma 5.2.15, the absolute spectral gap (Definition 5.2.11) is 1 −


Proof. By

cos n , using that n is odd. By a Taylor expansion,

4π 2
 

1 − cos = + O(n−4 ).
n n2

Since πmin = 1/n, we get tmix (ε) = O(n2 log n) and tmix (ε) = Ω(n2 ) by Theo-
rem 5.2.14.
It turns out our upper bound is off by a logarithmic factor. A sharper bound on
the mixing time can be obtained by working directly with the spectral decomposi-
tion. By Lemma 4.1.9 and Cauchy-Schwarz (Theorem B.4.8), for any x ∈ V ,
( )2
X P t (x, y)
4kP t (x, ·) − π(·)k2TV = π(y) −1
y
π(y)
 t 2
X P (x, y)
≤ π(y) −1
y
π(y)
2
n−1
X
= µtj gj (x)gj
j=1
π
n−1
X
= µ2t 2
j gj (x) ,
j=1

where we used the spectral decomposition of P t (Theorem 5.2.7) on the third line
and Parseval’s identity (i.e., (5.2.1)) on the fourth line.
Here comes the trick: the total variation distance does not depend on the start-
ing point x by symmetry. Multiplying by π(x) and summing over x—on the right-
CHAPTER 5. SPECTRAL METHODS 362

hand side only—gives

X n−1
X
4kP t (x, ·) − π(·)k2TV ≤ π(x) µ2t
j gj (x)
2

x j=1
n−1
X X
= µ2t
j π(x)gj (x)2
j=1 x
n−1
X
= µ2t
j ,
j=1

where we used that kgj k2π = 1.


We get
n−1   (n−1)/2  
2
X
2t 2πj X
2t 2πj
4d(t) ≤ cos =2 cos .
n n
j=1 j=1

2 /2
For x ∈ [0, π/2), cos x ≤ e−x (see Exercise 1.16). Then
(n−1)/2
4π 2 j 2
X  
2
4d(t) ≤ 2 exp − 2 t
n
j=1
 ∞
4π 2 X 4π 2 (j 2 − 1)
  
≤ 2 exp − 2 t exp − t
n n2
j=1
 ∞
4π 2 X 4π 2 t
  
≤ 2 exp − 2 t exp − 2 `
n n
`=0
 2

2 exp − 4π n2
t
= 
2
.
1 − exp − 4π n2
t

So tmix (ε) = O(n2 ).

Random walk on the hypercube


Consider simple random walk on the hypercube V := {−1, +1}n where x ∼ y
if they differ at exactly one coordinate. We consider the lazy version to avoid
issues of periodicity (see Example 1.1.31). Let P be the transition matrix and let
π ≡ 2−n be the stationary distribution (by symmetry and |V | = 2n ). We showed
in Section 4.3.2 that tmix (ε) = Θ(n log n). Here we use spectral techniques.
CHAPTER 5. SPECTRAL METHODS 363

For J ⊆ [n], we let


Y
χJ (x) = xj , x ∈ V.
j∈J

These are called parity functions. We show that the parity functions form an eigen-
parity
basis of the transition matrix.
function
Lemma 5.2.17 (Hypercube: eigenbasis). For all J ⊆ [n], the function χJ is an
eigenfunction of P with eigenvalue
n − |J|
µJ := .
n
Moreover the χJ s are orthonormal in `2 (V, π).
Proof. For x ∈ V and i ∈ [n], let x[i] be x where coordinate i is flipped. Note that,
for all J, x,
n
X 1 1X1
P (x, y)χJ (y) = χJ (x) + χJ (x[i] )
y
2 2 n
i=1
 
1 1 n − |J| 1 |J|
= + χJ (x) − χJ (x)
2 2 n 2 n
n − |J|
= χJ (x).
n
For the orthonormality, note that
X X 1 Y
π(x)χJ (x)2 = x2j = 1.
2n
x∈V x∈V j∈J

For J 6= J 0 ⊆ [n],
X
π(x)χJ (x)χJ 0 (x)
x∈V
X 1 Y Y Y
= n
x2j xj xj
2 0 0 0
x∈V j∈J∩J j∈J\J j∈J \J
   
|J∩J 0 |
2 Y X Y X
=  xj   xj 
2n 0 0
j∈J\J xj ∈{−1,+1} j∈J \J xj ∈{−1,+1}

= 0,

since at least one of J \ J 0 or J 0 \ J is nonempty.


CHAPTER 5. SPECTRAL METHODS 364

From the eigenvalues, we obtain the relaxation time.

Theorem 5.2.18 (Hypercube: relaxation time). The relaxation time for lazy simple
random walk on the n-dimensional hypercube is

trel = n.

Proof. From Lemma 5.2.17, the absolute spectral gap is


n−1 1
γ∗ = γ = 1 − = .
n n
Note that πmin = 1/2n . Hence, by Theorem 5.2.14, we have tmix (ε) = O(n2 ) and
tmix (ε) = Ω(n). Those bounds, it turns out, are both off.
As we did for the cycle, we obtain a sharper upper bound by working directly
with the spectral decomposition. By the same argument we used there,
X
4d(t)2 ≤ µ2t
J.
J6=∅

Then
X  n − |J| 2t
2
4d(t) ≤
n
J6=∅
n 
` 2t
 
X n
= 1−
` n
`=1
n    
X n 2t`
≤ exp −
` n
`=1
  n
2t
= 1 + exp − − 1,
n

where we used that 1 − x ≤ e−x for all x (see Exercise 1.16). So, by definition,
tmix (ε) ≤ 12 n log n + O(n).
Remark 5.2.19. In fact, lazy simple random walk on the n-dimensional hypercube has
a “cutoff” at (1/2)n log n. Roughly speaking, within a time window of size O(n), the
total variation distance to the stationary distribution goes from near 1 to near 0. See,
e.g., [LPW06, Section 18.2.2].
CHAPTER 5. SPECTRAL METHODS 365

5.2.3 . Markov chains: Varopoulos-Carne and diameter-based bounds


on the mixing time
If (St ) is simple random walk on Z, then Lemma 2.4.3 guarantees that for any
x, y ∈ Z
2
P t (x, y) ≤ e−|x−y| /2t , (5.2.6)
where P is the transition matrix of (St ). Interestingly a similar bound holds for any
reversible Markov chain—and Lemma 2.4.3 plays an unexpected role in its proof.
An application to mixing times is discussed below.

Varopoulos-Carne bound
Our main bound is the following. Recall that a reversible Markov chain is equiva-
lent to a random walk on the network corresponding to its positive transition prob-
abilities (see Definition 1.2.7 and the discussion following it).

Theorem 5.2.20 (Varopoulos-Carne bound). Let P be the transition matrix of an


irreducible Markov chain (Xt ) on the countable state space V . Assume further that
P is reversible with respect to the stationary measure π and that the corresponding
network N is locally finite. Then the following hold
s
π(y) −ρ(x,y)2 /2t
∀x, y ∈ V, ∀t ∈ N, P t (x, y) ≤ 2 e ,
π(x)

where ρ(x, y) is the graph distance between x and y on N .

As a sanity check before proving the theorem, note that if the chain is aperiodic and
π is the stationary distribution then by the convergence theorem (Theorem 1.1.33)
s
π(y)
P t (x, y) → π(y) ≤ 2 , as t → +∞,
π(x)

since π(x), π(y) ≤ 1.

Proof of Theorem 5.2.20. The idea of the proof is to show that


s
π(y)
P t (x, y) ≤ 2 P[St ≥ ρ(x, y)],
π(x)

where again (St ) is simple random walk on Z started at 0, and then use the Chernoff
bound (Lemma 2.4.3).
CHAPTER 5. SPECTRAL METHODS 366

By the local finiteness assumption, only a finite number of states can be reached
by time t. Hence we can reduce the problem to a finite state space. More precisely,
let Ṽ = {z ∈ V : ρ(x, z) ≤ t} and for z, w ∈ Ṽ
(
P (z, w) if z 6= w,
P̃ (z, w) =
P (z, z) + P (z, V \ Ṽ ) otherwise.

By construction P̃ is reversible with respect to π̃ = π/π(Ṽ ) on Ṽ . Because


within time t one never reaches a state z where P (z, V \ Ṽ ) > 0, by Chapman-
Kolmogorov (Theorem 1.1.20) and using the fact that π̃(y)/π̃(x) = π(y)/π(x), it
suffices to prove the result for P̃ . Hence we assume without loss of generality that
V is finite with |V | = n.
To relate (Xt ) to simple random walk on Z, we use a special representation of
t
P based on Chebyshev polynomials. For ξ = cos θ ∈ [−1, 1],

Tk (ξ) = cos kθ,

is a Chebyshev polynomial of the first kind. Note that |Tk (ξ)| ≤ 1 on [−1, 1] by
Chebyshev
definition. The classical trigonometric identity (to see this, write it in complex
polynomials
form)
cos((k + 1)θ) + cos((k − 1)θ) = 2 cos θ cos(kθ),
implies the recursion

Tk+1 (ξ) + Tk−1 (ξ) = 2ξ Tk (ξ),

which in turn implies that Tk is indeed a polynomial. It has degree k from induction
and the fact that T0 (ξ) = 1 and T1 (ξ) = ξ. The connection to simple random walk
on Z comes from the following somewhat miraculous representation (which does
not rely on reversibility). Let Tk (P ) denote the polynomial Tk evaluated at P as a
matrix polynomial.
Lemma 5.2.21.
t
X
t
P = P[St = k] T|k| (P ).
k=−t

Proof. It suffices to prove


t
X
t
ξ = P[St = k] T|k| (ξ),
k=−t
CHAPTER 5. SPECTRAL METHODS 367

as an identity of polynomials. By the binomial theorem (Appendix A),


t X t t
e + e−iθ
 iθ  
−t t
X
t iθ ` −iθ t−`
ξ = = 2 (e ) (e ) = P[St = k]eikθ ,
2 `
`=0 k=−t

where we used that the probability that


St = −t + 2` = (+1)` + (−1)(t − `),
is the event of making ` steps to the right and t − ` steps to the left. Now take
real parts on both sides and use that cos(kθ) = cos(−kθ) to get the claim. (Put
differently, (cos θ)t is the characteristic function E[eiθSt ] of St .)

We bound Tk (P )(x, y) as follows.


Lemma 5.2.22. It holds that
Tk (P )(x, y) = 0, ∀k < ρ(x, y),
and s
π(y)
Tk (P )(x, y) ≤ , ∀k ≥ ρ(x, y).
π(x)

Proof. Note that Tk (P )(x, y) = 0 when k < ρ(x, y) because Tk (P )(x, y) is a


function of the entries P ` (x, y) for ` ≤ k, all of which are 0.
We work on `2 (V, π). Let f1 , . . . , fn be an eigendecomposition of P orthonor-
mal with respect to the inner product h·, ·iπ with eigenvalues λ1 , . . . , λn ∈ [−1, 1].
Such a decomposition exists by Theorem 5.2.1. Then f1 , . . . , fn is also an eigen-
decomposition of the polynomial Tk (P ) with eigenvalues
Tk (λ1 ), . . . , Tk (λn ) ∈ [−1, 1],
by the
Pdefinition of the Chebyshev polynomials. By decomposing any function
f = ni=1 αi fi over this eigenbasis, that implies that
n 2
X
kTk (P )f k2π = αi Tk (λi )fi
i=1 π
n
X
= αi2 Tk (λi )2
i=1
Xn
≤ αi2
i=1
= kf k2π , (5.2.7)
CHAPTER 5. SPECTRAL METHODS 368

where we used Parseval’s identity (5.2.1) twice and the fact that Tk (λi )2 ∈ [0, 1].
Let δz denote the point mass at z. By Cauchy-Schwarz (Theorem B.4.8)
and (5.2.7),
p p s
hδx , Tk (P )δy iπ kδx kπ kδy kπ π(x) π(y) π(y)
Tk (P )(x, y) = ≤ = = ,
π(x) π(x) π(x) π(x)
for any k (in particular for k ≥ ρ(x, y)) and we have proved the claim.
Combining the two lemmas gives the result.
Remark 5.2.23. The local finiteness assumption is made for simplicity only. The result
holds for any countable-space, reversible chain. See [LP16, Section 13.2].

Lower bound on mixing Let (Xt ) be an irreducible aperiodic (for now not nec-
essarily reversible) Markov chain with finite state space V and stationary distribu-
tion π. Recall that, for a fixed 0 < ε < 1/2, the mixing time is
tmix (ε) = min{t : d(t) ≤ ε},
where
d(t) = max kP t (x, · ) − πkTV .
x∈V
It is intuitively clear that tmix (ε) is at least of the order of the “diameter” of the
transition graph of P . For x, y ∈ V , let ρ(x, y) be the graph distance between x and
y on the undirected version of the transition graph, that is, ignoring the orientation
of the edges. With this definition, a shortest directed path from x to y contains at
least ρ(x, y) edges. Here we define the diameter of the transition graph as ∆ :=
diameter
maxx,y∈V ρ(x, y). Let x0 , y0 be a pair of vertices achieving the diameter. Then we
claim that P b(∆−1)/2c (x0 , · ) and P b(∆−1)/2c (y0 , · ) are supported on disjoint sets.
To see this let
A = {z ∈ V : ρ(x0 , z) < ρ(y0 , z)},
be the set of states closer to x0 than y0 . See Figure 5.2. By the triangle inequality
for ρ, any z such that ρ(x0 , z) ≤ b(∆ − 1)/2c is in A, otherwise we would have
ρ(y0 , z) ≤ ρ(x0 , z) ≤ b(∆ − 1)/2c and hence ρ(x0 , y0 ) ≤ ρ(x0 , z) + ρ(y0 , z) ≤
2b(∆ − 1)/2c < ∆, a contradiction. Similarly, if ρ(y0 , z) ≤ b(∆ − 1)/2c, then
z ∈ Ac . By the triangle inequality for the total variation distance,
1
d(b(∆ − 1)/2c) ≥ P b(∆−1)/2c (x0 , · ) − P b(∆−1)/2c (y0 , · )
2 TV
1 n b(∆−1)/2c b(∆−1)/2c
o
≥ P (x0 , A) − P (y0 , A)
2
1 1
= {1 − 0} = , (5.2.8)
2 2
CHAPTER 5. SPECTRAL METHODS 369

Figure 5.2: The supports of P b(∆−1)/2c (x0 , · ) and P b(∆−1)/2c (y0 , · ) are contained
in A and Ac respectively.

where we used (1.1.4) on the second line, so that:

Claim 5.2.24.

tmix (ε) ≥ .
2
This bound is often far from the truth. Consider for instance simple random walk
on a cycle of size n. The diameter is ∆ = n/2. But Lemma 2.4.3 suggests that
it takes time of order ∆2 to even reach the antipode of the starting point, let alone
achieve stationarity. More generally, when P is reversible, the “diffusive behavior”
captured by the Varopoulos-Carne bound (Theorem 5.2.20) implies that the mixing
time does indeed scale at least as the square of the diameter.
Assume that P is reversible with respect to π and has diameter ∆. Letting
n = |V | and πmin = minx∈V π(x), we then have the following.

Claim 5.2.25. The following lower bound holds

∆2
tmix (ε) ≥ ,
12 log n + 4| log πmin |
16
provided n ≥ (1−2ε)2
.
CHAPTER 5. SPECTRAL METHODS 370

Proof. The proof is based on the same argument we used to derive our first diame-
ter-based bound, except that the Varopoulos-Carne bound gives a better depen-
dence on the diameter. Namely, let x0 , y0 , and A be as above. By the Varopoulos-
Carne bound,
s
X X π(z) − ρ2 (x0 ,z) −1/2 ∆2
P t (x0 , Ac ) = P t (x0 , z) ≤ 2 e 2t ≤ 2nπmin e− 8t ,
c c
π(x0 )
z∈A z∈A


where we used that |Ac | ≤ n and ρ(x0 , z) ≥ 2 for z ∈ Ac . For any

∆2
t< , (5.2.9)
12 log n + 4| log πmin |

we get that
 
t c −1/2 3 log n + | log πmin | 2
P (x0 , A ) ≤ 2nπmin exp − =√ ,
2 n

or P t (x0 , A) ≥ 1 − √2 . Similarly, P t (y0 , A) ≤ √2 so that arguing as in (5.2.8)


n n
 
1 2 2 1 2
d(t) ≥ 1− √ − √ = − √ ≥ ε,
2 n n 2 n

for t as in (5.2.9) and n as in the statement.


Remark 5.2.26. The dependence on ∆ and πmin in Claim 5.2.25 cannot be improved.
See [LP16, Section 13.3].

5.2.4 . Randomized algorithms: Markov chain Monte Carlo and a quan-


titative ergodic theorem
In Markov chain Monte Carlo methods, one generates samples from a probability
distribution of interest π over some state space V in order to estimate some of its
properties, for example, its mean, by designing and then running a Markov chain
with stationary distribution π. The Metropolis algorithm from Example 1.1.30 is
a standard way of constructing such a chain. These techniques play a central role
in Bayesian statistics in particular where π is the so-called posterior distribution
given the data.
We restrict ourselves here to finite V and, without loss of generality, we assume
that V = [n]. Let P be an irreducible chain reversible with respect to a stationary
CHAPTER 5. SPECTRAL METHODS 371

distribution π = (πx )x∈V . As previously, we work on `2 (V, π). Let f : V → R be


a function in `2 (V, π). Recall that
X
πf = πx f (x).
x∈V

Our goal is to estimate πf from the sample path of the Markov chain (Xt )t≥0 with
transition matrix P . Indeed the ergodic theorem guarantees that
T
1X
f (Xt ) → πf,
T
t=1

almost surely as T → +∞ for any starting point. We derive a simple, quantitative


version of this statement that provides insights into how long the chain needs to be
run to get an accurate estimate in terms of the spectral gap.
Theorem 5.2.27 (Ergodic theorem: reversible case). Let P = (Px,y )x,y∈V be an
irreducible aperiodic transition matrix over a finite state space V reversible with
respect to the stationary distribution π = (πx )x∈V . Let f : V → R be a function
in `2 (V, π). Then for any initial distribution µ = (µx )x∈V
T
1X
f (Xt ) → πf,
T
t=1

in probability as T → +∞. Moreover, for any ε > 0,


T
" #
−1
1X 9πmin kf k2∞ γ∗−1 T1
P f (Xt ) − πf ≥ ε ≤ −1 ,
T
t=1
(ε − πmin kf k∞ γ∗−1 T1 )2

as T → +∞, where γ∗ > 0 is the absolute spectral gap of P .


Recall that, by Lemmas 5.2.3 and 5.2.4, we have γ∗ > 0 since P is irreducible
and aperiodic. We will first need the following lemma.
Lemma 5.2.28 (Convergence of the expectation). For any initial distribution µ =
(µx )x∈V and any t
−1
| E[f (Xt )] − πf | ≤ (1 − γ∗ )t πmin kf k∞ .

Proof. We have
XX X
t
| E[f (Xt )] − πf | = µx Px,y f (y) − πy f (y) .
x y y
CHAPTER 5. SPECTRAL METHODS 372

P
Because x µx = 1, the right-hand side is

XX XX
t
= µx Px,y f (y) − µx πy f (y)
x y x y
X X
t
≤ µx Px,y − πy |f (y)|,
x y

by the triangle inequality.


Now by (5.2.5) this is
X X πy
≤ µx (1 − γ∗ )t |f (y)|
x y
πmin
1 X X
= (1 − γ∗ )t µx πy |f (y)|
πmin x y
−1
≤ (1 − γ∗ )t πmin kf k∞ .

That proves the claim.

Proof of Theorem 5.2.27. We use Chebyshev’s inequality (Theorem 2.1.2), simi-


larly to the proof of the L2 weak law of large numbers (Theorem 2.1.6). In partic-
ular, we note that the Xt s are not independent.
By Lemma 5.2.28, the expectation of the time average can be bounded as fol-
lows
T T
" #
1X 1X
E f (Xt ) − πf ≤ |E[f (Xt )] − πf |
T T
t=1 t=1
T
1 X
−1
≤ (1 − γ∗ )t πmin kf k∞
T
t=1
+∞
−1 1X
≤ πmin kf k∞ (1 − γ∗ )t
T
t=0
−1 1
= πmin kf k∞ γ∗−1 → 0,
T
as T → +∞, since γ∗ > 0.
Next we bound the variance of the sum. We have
T T
" #
1X 1 X 2 X
Var f (Xt ) = 2 Var[f (Xt )] + 2 Cov[f (Xs ), f (Xt )].
T T T
t=1 t=1 1≤s<t≤T
CHAPTER 5. SPECTRAL METHODS 373

We bound the variance and covariance terms separately.


To obtain convergence, a trivial bound on the variance suffices

0 ≤ Var[f (Xt )] ≤ E[f (Xt )2 ] ≤ kf k2∞ .

Hence,
T
1 X T kf k2∞
0≤ Var[f (Xt )] ≤ → 0,
T2 T2
t=1

as T → +∞.
Bounding the covariance requires a more delicate argument. Fix 1 ≤ s < t ≤
T . The trick is to condition on Xs and use the Markov Property (Theorem 1.1.18).
By definition of the covariance, the tower property (Lemma B.6.16) and taking out
what is known (Lemma B.6.13),

Cov[f (Xs ), f (Xt )]


= E [(f (Xs ) − E[f (Xs )])(f (Xt ) − E[f (Xt )])]
X
= E [(f (Xs ) − E[f (Xs )])(f (Xt ) − E[f (Xt )]) | Xs = x] P[Xs = x]
x
X
= E [f (Xt ) − E[f (Xt )] | Xs = x] (f (x) − E[f (Xs )]) P[Xs = x].
x

We now use the time homogeneity of the chain to note that

E [f (Xt ) − E[f (Xt )] | Xs = x]


= E [f (Xt ) | Xs = x] − E[f (Xt )]
= E [f (Xt−s ) | X0 = x] − E[f (Xt )].

By Lemma 5.2.28,

|E [f (Xt ) − E[f (Xt )] | Xs = x]|


= |E [f (Xt−s ) | X0 = x] − E[f (Xt )]|
= |(E [f (Xt−s ) | X0 = x] − πf ) − (E[f (Xt )] − πf )|
≤ |E [f (Xt−s ) | X0 = x] − πf | + |E[f (Xt )] − πf |
−1 −1
≤ (1 − γ∗ )t−s πmin kf k∞ + (1 − γ∗ )t πmin kf k∞
−1
≤ 2(1 − γ∗ )t−s πmin kf k∞ ,
CHAPTER 5. SPECTRAL METHODS 374

which does not depend on x. Plugging back above,

|Cov[f (Xs ), f (Xt )]|


X
≤ |E [f (Xt ) − E[f (Xt )] | Xs = x]| |f (x) − E[f (Xs )]| P[Xs = x]
x
X
−1
≤ 2(1 − γ∗ )t−s πmin kf k∞ |f (x) − E[f (Xs )]| P[Xs = x]
x
X
−1
≤ 2(1 − γ∗ )t−s πmin kf k∞ 2kf k∞ P[Xs = x]
x
−1
≤ 4(1 − γ∗ )t−s πmin kf k2∞ .

Returning to the sum over the covariances, the previous bound gives

2 X
Cov[f (Xs ), f (Xt )]
T2
1≤s<t≤T
2 X
≤ 2 |Cov[f (Xs ), f (Xt )]|
T
1≤s<t≤T
2 X
−1
≤ 4(1 − γ? )t−s πmin kf k2∞ .
T2
1≤s<t≤T

To evaluate the sum we make the change of variable h = t − s to get that the
previous expression is
T −s
−1 2 X X
≤ 4πmin kf k2∞ (1 − γ∗ )h
T2
1≤s≤T h=1
+∞
−1 2 X X
≤ 4πmin kf k2∞ (1 − γ∗ )h
T2
1≤s≤T h=0

−1 2 X 1
= 4πmin kf k2∞
T2 γ∗
1≤s≤T

−1 1
= 8πmin kf k2∞ γ∗−1 → 0,
T
as T → +∞.
Combining the variance and covariance bounds, we have shown that
T
" #
1X 1 −1 1 −1 1
Var f (Xt ) ≤ kf k2∞ + 8πmin kf k2∞ γ∗−1 ≤ 9πmin kf k2∞ γ∗−1 .
T T T T
t=1
CHAPTER 5. SPECTRAL METHODS 375

For any ε > 0


T
" #
1X
P f (Xt ) − πf ≥ ε
T
t=1
T T T
" " # " # ! #
1X 1X 1X
=P f (Xt ) − E f (Xt ) + E f (Xt ) − πf ≥ε
T T T
t=1 t=1 t=1
T T T
" " # " # #
1X 1X 1X
≤P f (Xt ) − E f (Xt ) + E f (Xt ) − πf ≥ ε
T T T
t=1 t=1 t=1
T T
" " # #
1X 1X −1 −1 1
≤P f (Xt ) − E f (Xt ) ≥ ε − πmin kf k∞ γ∗ .
T T T
t=1 t=1

We can now apply Chebyshev’s inequality to get


T
" #
−1
1X 9πmin kf k2∞ γ∗−1 T1
P f (Xt ) − πf ≥ ε ≤ −1 → 0,
T
t=1
(ε − πmin kf k∞ γ∗−1 T1 )2
as T → +∞.

5.2.5 Spectral radius


The results in this section have so far concerned finite state spaces. The countably
infinite case presents a number of complications. We start with a few observations:
- Suppose P is irreducible, aperiodic and positive recurrent. Then we know
from the convergence theorem (Theorem 1.1.33) that, if π is the stationary
distribution, then for all x
kP t (x, ·) − π(·)kTV → 0,
as t → +∞. The convergence rate depends on the starting point x. In the
infinite state space case, one typically needs to make that dependence explicit
to get meaningful results. In particular the mixing time—as we have defined
it—may not be a useful concept.
- In the transient and null recurrent cases, there is no stationary distribution
to converge to by Theorem 3.1.20. Instead, we have the following by The-
orem 3.1.21: if P is an irreducible chain which is either transient or null
recurrent, then we have that
lim P t (x, y) = 0,
t

for all x, y ∈ V .
CHAPTER 5. SPECTRAL METHODS 376

- Conditions stronger than reversibility are needed for the spectral theorem—
in a form similar to what we used—to apply. Specifically, one needs that P is
a compact operator: whenever (fn )n ∈ `2 (V, π) is a bounded sequence, then compact
there exists a subsequence (fnk )k such that (P fnk ) converges in the norm. operator
Unfortunately that is often not the case, as the next example illustrates, even
in the reversible, positive recurrent setting.

Example 5.2.29 (A positive recurrent chain whose P is not compact). For p <
1/2, let (Xt ) be the birth-death chain with V := {0, 1, 2, . . .}, P (0, 0) := 1 − p,
P (0, 1) = p, P (x, x + 1) := p and P (x, x − 1) := 1 − p for all x ≥ 1, and
P (x, y) := 0 if |x − y| > 1. As can be checked by direct computation, P is
reversible with respect to the stationary distribution π(x) = (1 − γ)γ x for x ≥ 0
p
where γ := 1−p . For j ≥ 1, define gj (x) := π(j)−1/2 1{x=j} . Then kgj k2π = 1 for
all j so {gj }j is bounded in `2 (V, π). On the other hand,

P gj (x) = pπ(j)−1/2 1{x=j−1} + (1 − p)π(j)−1/2 1{x=j+1} .

So

kP gj k2π = p2 π(j)−1 π(j − 1) + (1 − p)2 π(j)−1 π(j + 1)


1−p p
= p2 + (1 − p)2
p 1−p
= 2p(1 − p).

Hence {P gj }j is also bounded. However, for j > `

kP gj − P g` k2π ≥ (1 − p)2 π(j)−1 π(j + 1) + p2 π(`)−1 π(` − 1)


= 2p(1 − p).

So {P gj }j does not have a converging subsequence. J

We will not say much about the spectral theory of infinite networks. In this
subsection, we establish a relationship between the operator norm of P —which is
related to its spectrum—and the decay of P t (x, y).
Let `0 (V ) be the set of real-valued functions on V with finite support. It is
dense in `2 (V, π). Indeed let v1 , v2 , . . . be an enumeration of V and, for f ∈
`2 (V, π), define f |n (vi ) := f (vi )1i≤n to be f restricted to v1 , . . . , vn . Then

X
kf − f |n k2π = π(vi )f (vi )2 → 0, (5.2.10)
i=n+1
CHAPTER 5. SPECTRAL METHODS 377

as n → ∞, since kf k2π = 2
P
x π(x)f (x) < +∞. We will also need the following

kP f − P (f |n )k2π = kP (f − f |n )k2π ≤ kf − f |n k2π → 0, (5.2.11)


where we used (5.2.2).

Definition 5.2.30 (Operator norm). The operator norm of P is operator



kP f kπ
 norm
kP kπ = sup : f ∈ `0 (V ), f 6= 0 .
kf kπ

By definition, for any f ∈ `0 (V ),

kP f kπ ≤ kP kπ kf kπ . (5.2.12)

The same can be seen to hold for any f ∈ `2 (V, π) by considering the sequence
(f |n )n and noting that kf |n kπ → kf kπ and kP (f |n )kπ → kP f kπ as n → ∞
by (5.2.10), (5.2.11) and the triangle inequality. This latter observation explains
why it suffices to restrict the supremum to `0 in the definition of the norm.
Note that, by (5.2.2), kP kπ ≤ 1. Note further that, if V is finite or more
generally if π is summable, then we have in fact kP kπ = 1 by taking f ≡ 1
above. When P is self-adjoint, the norm kP kπ is also equal to what is known
as the spectral radius, that is, the radius of the smallest disk centered at 0 in the
spectral
complex plane that contains the spectrum of P . We will not need to define what that
radius
means formally here. (But Exercise 5.5 asks for a proof in the setting of symmetric
matrices.)
Our main result is the following.

Theorem 5.2.31 (Spectral radius). Let P be irreducible and reversible with respect
to π > 0. Then
ρ(P ) := lim sup P t (x, y)1/t = kP kπ .
t

In particular, the limit does not depend on x, y. Moreover, for all t,


s
t π(y)
P (x, y) ≤ kP ktπ .
π(x)

In the positive recurrent case (for instance if the chain is finite), we have P t (x, y) →
π(y) > 0 and so ρ(P ) = 1 = kP kπ . The theorem says that the equality between
ρ(P ) and kP kπ holds in general for reversible chains.
CHAPTER 5. SPECTRAL METHODS 378

Proof of Theorem 5.2.31. To see that the limit does not depend on x, y, let u, v, x, y ∈
V and k, m ≥ 0 such that P m (u, x) > 0 and P k (y, v) > 0. Then

P t+m+k (u, v)1/(t+m+k)


≥ (P m (u, x)P t (x, y)P k (y, v))1/(t+m+k)
≥ P m (u, x)1/(t+m+k) P t (x, y)1/t P k (y, v)1/(t+m+k) ,

which shows that lim supt P t (u, v)1/t ≥ lim supt P t (x, y)1/t for all u, v, x, y.
We first show that ρ(P ) ≤ kP kπ . Observe that applying (5.2.4) and (5.2.12)
repeatedly gives that P t is self-adjoint and satisfies the inequality kP t kπ ≤ kP ktπ .
Because kδz k2π = π(z) ≤ 1, by Cauchy-Schwarz
p
π(x)P t (x, y) = hδx , P t δy iπ ≤ kP ktπ kδx kπ kδy kπ = kP ktπ π(x)π(y).
q
π(y)
Hence P t (x, y) ≤ t
π(x) kP kπ and

ρ(P ) = lim sup P t (x, y)1/t


t
s !1/t
π(y)
≤ lim sup kP ktπ
t π(x)
= kP kπ .

To establish the inequality in the other direction, we make a series of observa-


tions. Fix a nonzero f ∈ `0 (V ).

- By self-adjointness and Cauchy-Schwarz,

kP t+1 f k2π = hP t+1 f, P t+1 f iπ = hP t+2 f, P t f iπ ≤ kP t+2 f kπ kP t f kπ ,

or
kP t+1 f kπ kP t+2 f kπ
≤ .
kP t f kπ kP t+1 f kπ
t+1
So kPkP t f fkπkπ is non-decreasing and therefore has a limit L ≤ +∞. More-
over, for t = 0, we get
kP f kπ
≤ L, (5.2.13)
kf kπ
so it suffices to prove L ≤ ρ(P ).
CHAPTER 5. SPECTRAL METHODS 379

- Observe that
 t 1/t  1/t
kP f kπ kP f kπ kP t f kπ
= × ··· × → L,
kf kπ kf kπ kP t−1 f kπ
so in fact
L = lim kP t f k1/t
π .
t

- By self-adjointness again
X X
kP t f k2π = hf, P 2t f iπ = π(x)f (x) f (y)P 2t (x, y).
x y

By definition of ρ(P ), for any ε > 0, there is a t large enough that

P 2t (x, y) ≤ (ρ(P ) + ε)2t ,

for all x, y in the support of f . For such a t, plugging back into the previous
display
!1/2t
X X
kP t f k1/t
π ≤ (ρ(P ) + ε) π(x)|f (x)| |f (y)| .
x y

The expression in parentheses on the right-hand side is finite because f has


finite support. Since ε is arbitrary, we get

L = lim kP t f k1/t
π ≤ ρ(P ). (5.2.14)
t

So, combining (5.2.13) and (5.2.14), we have shown that kP kπ ≤ ρ(P ) and that
concludes the proof.

Corollary 5.2.32. Let P be irreducible and reversible with respect to π. If kP kπ <


1, then P is transient.
Proof. By Theorem 5.2.31, P t (x, x) ≤ kP ktπ so
X X
P t (x, x) ≤ kP ktπ < +∞.
t t

Let (Xt ) be a chain with transition matrix P . Because


" #
X X
t
P (x, x) = Ex 1{Xt =x} ,
t t
P
we have that t 1{Xt =x} < +∞, Px -a.s., and (Xt ) is transient.
CHAPTER 5. SPECTRAL METHODS 380

This is not an if and only if. Random walk on Z3 is transient, yet P 2t (0, 0) =
Θ(t−3/2 ) so there kP kπ = ρ(P ) = 1.
In the non-reversible case, our definition of kP kπ still makes sense with respect
to any stationary measure π (although P is not self-adjoint). But the equality in
Theorem 5.2.31 no longer holds in general.

Example 5.2.33 (Counter-example). Let (Xt ) be asymmetric random walk  onZx


p
with probability p ∈ (1/2, 1) of going to the right. Then both π0 (x) := 1−p
and π1 (x) := 1 define stationary measures, but the transition matrix P is only
reversible with respect to π0 .
Under π1 , we have kP kπ1 = 1. Indeed, let gn (x) := 1{|x|≤n} and note that

(P gn )(x) = 1{|x|≤n−1} + p1{x = −n − 1 or −n} + (1 − p)1{x = n or n + 1} ,

so kgn k2π1 = 2n + 1 and kP gn k2π1 ≥ 2(n − 1) + 1. Hence

kP gn kπ1
lim sup ≥ 1,
n kgn kπ1

and kP kπ1 ≥ 1. But we already showed that kP kπ1 ≤ 1 in (5.2.2), so the claim
follows.
On the other hand, E0 [Xt ] = (2p−1)t. So the martingale Zt := Xt −(2p−1)t
(see Example 3.1.29), as a sum of t independent centered random variables in
{−1 − (2p − 1), 1 − (2p − 1)}, satisfies the assumptions of the Azuma-Hoeffding
inequality (Theorem 3.2.1) with increment bound ct := 2. So

P t (0, 0)1/t ≤ P0 [Xt ≤ 0]1/t


= P0 [Xt − (2p − 1)t ≤ −(2p − 1)t]1/t
2(2p−1)2 t2 1
≤ e− 22 t t .

Therefore
2 /2
lim sup P t (0, 0)1/t ≤ e−(2p−1) < 1.
t
J

5.3 Geometric bounds


The goal of this section is to relate the spectral gap to certain geometric proper-
ties of the underlying network, more specifically isoperimetric properties, that is,
CHAPTER 5. SPECTRAL METHODS 381

relationships between the “volume” of sets and their “circumference.” The classi-
cal isoperimetric inequality states that the area enclosed by any rectifiable simple
isoperimetric
closed curve in the plane is at most the length of the curve squared divided by 4π.
inequality
Moreover equality is achieved if and only if the curve is a circle.
Remark 5.3.1. Here is an easy proof in the smooth case. Suppose r(s) = (x(s), y(s)),
s ∈ [0, 2π], is the parametrization of a positively oriented, smooth, simple closed curve
in the plane centered at the origin with arc-length 2π, where kr 0 (s)k2 = 1 for all s,
R 2π
0
r(s) ds = 0 and x(0) = x(2π) = 0. By Green’s theorem, the area enclosed by the
curve is
Z 2π
1 2π
Z
A= x(s)y 0 (s) ds = [x(s)2 + y 0 (s)2 − (x(s) − y 0 (s))2 ] ds,
0 2 0

where we used that 2ab = a2 + b2 − (a − b)2 . By the one-dimensional Poincaré inequality


(Remark 3.2.7),

1 2π 1 2π 0 2
Z Z
A≤ [x(s)2 + y 0 (s)2 ] ds ≤ [x (s) + y 0 (s)2 ] ds = π,
2 0 2 0
which is indeed the area of a circle of circumference 2π.

Edge expansion We define our isoperimetric quantity of interest. Let (Xt ) be a


finite, irreducible Markov chain on V reversible with respect to a stationary mea-
sure π > 0. (In this section, we do not necessarily assume that π is a probabil-
ity distribution.) Let P be its transition matrix. We think of (Xt ) as a random
walk on the network N = (G, c) where G is the transition graph and c(x, y) :=
π(x)P (x, y) = π(y)P (y, x).
For a subset S ⊆ V , we let the edge boundary of S be
edge boundary

∂E S := {e = {x, y} ∈ E : x ∈ S, y ∈ S c }.

Let g : E → R+ be an edge weight function. For F ⊆ E and W ⊆ V , we define


X
|F |g := g(e).
e∈F

and X
|W |h := h(v).
v∈W

Finally, for S ⊆ V , we let

|∂E S|g
ΦE (S; g, h) := .
|S|h
CHAPTER 5. SPECTRAL METHODS 382

Roughly speaking, this is the ratio of the “size of the boundary” of a set to its
“volume.”
Our main definition, the edge expansion constant, quantifies the worst such
ratio. First, one last piece of notation: for disjoint subsets S0 , S1 ⊆ V , we let
X X
c(S0 , S1 ) := c(x0 , x1 ).
x0 ∈S0 x1 ∈S1

Definition 5.3.2 (Edge expansion). For a subset of states S ⊆ V , the edge expan-
sion constant (or bottleneck ratio) of S is
edge
|∂E S|c c(S, S c ) expansion
ΦE (S; c, π) = = .
|S|π π(S) constant

We refer to (S, S c ) as a cut. The edge expansion constant (or bottleneck ratio or
Cheeger number or isoperimetric constant* ) of N is
 
1
Φ∗ := min ΦE (S; c, π) : S ⊆ V, 0 < π(S) ≤ .
2
Intuitively, a small value of Φ∗ suggests the existence of a “bottleneck” in N .
Conversely, a large value of Φ∗ indicates that all sets “expand out.” See Fig-
ure 5.3. Note that the quantity ΦE (S; c, π) has a natural probabilistic interpretation:
pick a stationary state and make one step according to the transition matrix; then
ΦE (S; c, π) is the conditional probability that, given that the first state is in S, the
next one is in S c .
Equivalently, the edge expansion constant can be expressed as
c(S, S c )
 
Φ∗ := min : S ⊆ V, 0 < π(S) < 1 .
π(S) ∧ π(S c )
Example 5.3.3 (Edge expansion: complete graph). Let G = Kn be the complete
graph on n vertices. Let c(x, y) = 1/n2 for all x, y (corresponding to picking
any vertex uniformly at random at the next step) and π(x) = 1/n for all x. For
simplicity, take n even. Then for a subset S of size |S| = k,
|∂E S|c k(n − k)/n2 n−k
ΦE (S; c, π) = = = .
|S|π k/n n
Thus, the minimum is achieved for k = n/2 and
n − n/2 1
Φ∗ = = .
n 2
J
* It is also called “conductance,” but that terminology clashes with our use of the term.
CHAPTER 5. SPECTRAL METHODS 383

Figure 5.3: A bottleneck.

Dirichlet form, Rayleigh quotient, and normalized Laplacian We relate the


edge expansion constant of N to the spectral gap of P . Recall that we denote by
λ1 , . . . , λn the eigenvalues of P in decreasing order. First, we adapt the variational
characterization of Theorem 5.1.3 to the network setting.
The Dirichlet form is defined over `2 (V, π) as the bilinear form
Dirichlet
form
D(f, g) := hf, (I − P )giπ .

The associated quadratic form, also known as Dirichlet energy, is D(f ) := D(f, f ).
Dirichlet
energy
CHAPTER 5. SPECTRAL METHODS 384

Note that, using stochasticity and reversibility,

hf, (I − P )f iπ = hf, f iπ − hf, P f iπ


1X 1X X
= π(x)f (x)2 + π(y)f (y)2 − π(x)f (x)f (y)P (x, y)
2 x 2 y x,y
1X X
= f (x)2 π(x) P (x, y)
2 x y
1X X X
+ f (y)2 π(y) P (y, x) − π(x)f (x)f (y)P (x, y)
2 y x x,y
1X
= π(x)P (x, y)f (x)2
2 x,y
1X X
+ π(x)P (x, y)f (y)2 − π(x)P (x, y)f (x)f (y)
2 x,y x,y
1X
= c(x, y)[f (x) − f (y)]2 ,
2 x,y

which is indeed consistent with the expression we encountered previously in The-


orem 3.3.25.
The Rayleigh quotient for I − P over `2 (V, π) is then
Rayleigh
1P 2 quotient
hf, (I − P )f iπ x,y c(x, y)[f (x) − f (y)]
= 2 P 2
hf, f iπ x π(x)f (x)
zT Lz
= ,
zT z
where L is the normalized pLaplacian of the network N and we defined the vector
z = (zx )x∈V with zx := π(x)f (x). Consequently, the Courant-Fischer theorem
(Theorem 5.1.3) in the form (5.1.5) gives the following. Here η2 = 1 − λ2 = γ is
the spectral gap of P , which can also be seen as the second smallest eigenvalue of
I − P (which has the same eigenfunctions as P itself). We have
 
hf, (I − P )f iπ
γ = inf : πf = 0, f 6= 0 .
hf, f iπ
The infimum is achieved by the eigenfunction f2 of P corresponding to its second
largest eigenvalue λ2 . (Recall from Lemma 5.2.5 that πf2 = 0.)
We note further that if πf = 0 then

hf, f iπ = hf − πf, f − πf iπ = Varπ [f ],


CHAPTER 5. SPECTRAL METHODS 385

where the last expression denotes the variance under π. So the variational charac-
terization of γ implies that

Varπ [f ] ≤ γ −1 D(f ),

for all f such that πf = 0. In fact, it holds for any f by considering f − πf and
noticing that both sides are unaffected by subtracting a constant to f .
We have shown:

Theorem 5.3.4 (Poincaré inequality for N ). Let P be finite, irreducible and re-
versible with respect to π. Then

Varπ [f ] ≤ γ −1 D(f ), (5.3.1)

for all f ∈ `2 (V π). Equality is achieved by the eigenfunction f2 of P correspond-


ing to the second largest eigenvalue λ2 .

An inequality of the type

Varπ [f ] ≤ CD(f ), ∀f, (5.3.2)

is known as a Poincaré inequality, a simple version of which we encountered pre-


Poincaré
viously in Remark 3.2.7. To see the connection with that one-dimensional version,
~ be an orientation of E, that inequality
it will be convenient to work with directed edges. Let E
is, for each e ∈ {x, y}, E~ includes either (x, y) or (y, x) with associated weight
c(~e) := c(e) > 0 For a function f : V → R and an edge ~e = (x, y) ∈ E, ~ we
define the “discrete gradient”

∇f (~e) = f (y) − f (x).

With this notation, we can rewrite the Dirichlet energy as


1X X
D(f ) = c(x, y)[f (x) − f (y)]2 = c(~e)[∇f (~e)]2 , (5.3.3)
2 x,y
~e

hence (5.3.1) is a network analogue of (3.2.7).

5.3.1 Cheeger’s inequality


The edge expansion constant and the spectral gap are related through the following
isoperimetric inequalities. The lower bound is known as Cheeger’s inequality.
Cheeger’s
inequality
CHAPTER 5. SPECTRAL METHODS 386

Theorem 5.3.5. Let P be a finite, irreducible, reversible Markov transition matrix


and let γ = 1 − λ2 be the spectral gap of P . Then

Φ2∗
≤ γ ≤ 2Φ∗ .
2
In terms of the relaxation time trel = γ −1 , these inequalities have an intuitive
meaning: the presence or absence of a bottleneck in the state space leads to slow or
fast mixing respectively. We detail some applications to mixing times in the next
subsections.
Before giving a proof of the theorem, we start with a trivial—yet insightful—
example.

Example 5.3.6 (Two-state chain). Let V := {0, 1} and, for α, β ∈ (0, 1),
 
1−α α
P :=
β 1−β

which has stationary distribution


 
β α
π := , .
α+β α+β

Recall from Example 5.2.8 that the second right eigenvector is


r r ! r r 
α β π1 π0
f2 := ,− = ,− ,
β α π0 π1

with eigenvalue λ2 := 1 − α − β, so the spectral gap is α + β. Assume that β ≤ α.


Then the bottleneck ratio is
c(0, 1)
Φ∗ = = P (0, 1) = α.
π(0)

Then Theorem 5.3.5 reads


α2
≤ α + β ≤ 2α,
2
which is indeed satisfied for all 0 < β ≤ α < 1. Note that the upper bound is tight
when α = β. J

Proof of Theorem 5.3.5. We start with the upper bound. In view of the Poincaré
inequality for N (Theorem 5.3.4), to get an upper bound on the spectral gap, it
CHAPTER 5. SPECTRAL METHODS 387

suffices to plug in a well-chosen function f in (5.3.1). Taking a hint from Exam-


ple 5.3.6, for S ⊆ V with π(S) ∈ (0, 1/2], we let
 q c
− π(S ) x ∈ S,
π(S)
fS (x) := q π(S)

π(S c ) x ∈ Sc.

Then
" s # "s #
X π(S c ) c π(S)
π(x)fS (x) = π(S) − + π(S ) = 0,
x
π(S) π(S c )

and
" s #2 "s #2
X π(S c ) π(S)
π(x)fS (x)2 = π(S) − + π(S c ) = 1.
x
π(S) π(S c )

So Varπ [fS ] = 1. Hence, from Theorem 5.3.4,

D(fS )
γ≤
Varπ [fS ]
1X
= c(x, y)[fS (x) − fS (y)]2
2 x,y
" s s #2
X π(S c ) π(S)
= c(x, y) − −
π(S) π(S c )
x∈S,y∈S c
" #2
X π(S c ) + π(S)
= c(x, y) − p
x∈S,y∈S c
π(S)π(S c )
c(S, S c )
=
π(S)π(S c )
c(S, S c )
≤2 ,
π(S)

as claimed.
The other direction is trickier. Because we seek an upper bound on the edge
expansion constant Φ∗ , our goal is to find a cut (S, S c ) such that

c(S, S c ) p
c
≤ 2γ. (5.3.4)
π(S) ∧ π(S )
CHAPTER 5. SPECTRAL METHODS 388

Because the eigenfunction f2 achieves γ in Theorem 5.3.4, it is natural to look to


it for “good cuts.” Thinking of f2 as a one-dimensional embedding of the network,
it turns out to be enough to consider only “sweep cuts” of the form S := {v :
f2 (v) ≤ θ} for a threshold θ. How to pick the right threshold is less obvious.
Here we use a probabilistic argument, that is, we construct a random cut (Z, Z c ).
Observe that it suffices that

E [c(Z, Z c )] ≤ 2γE [π(Z) ∧ π(Z c )] ,


p
(5.3.5)
√ c Z c ) ≥ 0, which in turn implies that we

√ E 2γπ(Z) ∧c π(Z ) − c(Z,
since then
have P 2γπ(Z) ∧ π(Z ) − c(Z, Z c ) ≥ 0 > 0 by the first moment principle
(Theorem 2.2.1); in other words, there exists a cut satisfying (5.3.4).
We now describe the random cut (Z, Z c ):

1. (Cuts from f2 ) Let again f2 be an eigenfunction corresponding to the eigen-


value λ2 of P with kf2 k2π = 1. Order the vertices V := {v1 , . . . , vn } in such
a way that
f2 (vi ) ≤ f2 (vi+1 ), ∀i = 1, . . . , n − 1.
As we described above, the function f2 naturally produces a series of cuts
(Si , Sic ) where Si := {v1 , . . . , vi }. By definition of the bottleneck ratio,

c(Si , Sic )
Φ∗ ≤ . (5.3.6)
π(Si ) ∧ π(Sic )

2. (Normalization) Let

m := min{i : π(Si ) > 1/2},

and define the translated function

f := f2 − f2 (vm ).

We further set g := αf where α > 0 is chosen so that

g(v1 )2 + g(vn )2 = 1.

Note that, by construction, g(vm ) = 0 and g(v1 ) ≤ · · · ≤ g(vm ) = 0 ≤


g(vm+1 ) ≤ · · · ≤ g(vn ). The function g is related to γ as follows:

Lemma 5.3.7.
1X X
c(x, y)(g(x) − g(y))2 ≤ γ π(x)g(x)2 .
2 x,y x
CHAPTER 5. SPECTRAL METHODS 389

Proof. By Theorem 5.3.4,

D(f2 )
γ= .
Varπ [f2 ]

Because neither the numerator nor the denominator is affected by adding a


constant, we have also
D(f )
γ= .
Varπ [f ]
Furthermore, notice that a constant multiplying f cancels out in the ratio so

D(g)
γ= .
Varπ [g]
2.
P
Now use the fact that Varπ [g] ≤ x π(x)g(x)

3. (Random cut) Pick Θ in [g(v1 ), g(vn )] with density 2|θ|. Note that
Z g(vn )
2|θ| dθ = g(v1 )2 + g(vn )2 = 1.
g(v1 )

Finally define
Z := {vi : g(vi ) < Θ}.

The rest of the proof is calculations. We bound the expectations on both sides
of (5.3.5).
Lemma 5.3.8. The following hold:

(i) X
E[π(Z) ∧ π(Z c )] = π(x)g(x)2 .
x

(ii)
!1/2 !1/2
c 1X X
E[c(Z, Z )] ≤ c(x, y)(g(x) − g(y))2 2 π(x)g(x) 2
.
2 x,y x

Lemmas 5.3.7 and 5.3.8 immediately imply (5.3.5) and that concludes the proof of
Theorem 5.3.5. So it remains to prove this last lemma.
CHAPTER 5. SPECTRAL METHODS 390

Proof of Lemma 5.3.8. We start with (i). By definition of g, Θ ≤ 0 implies that


π(Z) ∧ π(Z c ) = π(Z) and vice versa. Thus
 
X X
E[π(Z) ∧ π(Z c )] = E  π(v` )1{v` ∈Z} 1{Θ≤0} + π(v` )1{v` ∈Z c } 1{Θ>0} 
`<m `≥m
 
X X
= E π(v` )1{g(v` )<Θ≤0} + π(v` )1{0<Θ≤g(v` )} 
`<m `≥m
X X
= π(v` )P [g(v` ) < Θ ≤ 0] + π(v` )P [0 < Θ ≤ g(v` )]
`<m `≥m
X X
= π(v` )g(v` )2 + π(v` )g(v` )2
`<m `≥m
X
2
= π(x)g(x) , (5.3.7)
x

where we integrated over the density of Θ to obtain the fourth line.


We move on to (ii). To compute E[c(Z, Z c )], we note that xk ∈ Z and x` ∈ Z c
if and only if g(vk ) < Θ ≤ g(v` ). The probability of that event depends on the
signs of g(vk ) and g(v` ). If g(vk )g(v` ) ≥ 0,

P[g(vk ) < Θ ≤ g(v` )] = |g(vk )2 − g(v` )2 |


= |g(vk ) − g(v` )||g(vk ) + g(v` )|
= |g(vk ) − g(v` )|(|g(vk )| + |g(v` )|).

If g(vk )g(v` ) < 0,

P[g(vk ) < Θ ≤ g(v` )] = g(vk )2 + g(v` )2


≤ g(vk )2 + g(v` )2 − 2g(vk )g(v` )
= (g(vk ) − g(v` ))2
= |g(vk ) − g(v` )|(|g(vk )| + |g(v` )|).
CHAPTER 5. SPECTRAL METHODS 391

We apply Cauchy-Schwarz to get


X
E[c(Z, Z c )] = c(vk , v` )P[g(vk ) < Θ ≤ g(v` )]
k<`
X
≤ c(vk , v` )|g(vk ) − g(v` )|(|g(vk )| + |g(v` )|)
k<`
!1/2
X
2
≤ c(vk , v` )(g(vk ) − g(v` ))
k<`
!1/2
X
2
× c(vk , v` )(|g(vk )| + |g(v` )|) .
k<`

The expression in the first parentheses is equal to 21 x,y c(x, y)(g(x) − g(y))2 .
P
So it remain to bound the expression in the second parentheses.
Note that

(|g(x)| + |g(y)|)2 = 2g(x)2 + 2g(y)2 − (|g(x)| − |g(y)|)2 ≤ 2g(x)2 + 2g(y)2 .


P P
Therefore, since y c(x, y) = y c(y, x) = π(x),
X 1X
c(vk , v` )(|g(vk )| + |g(v` )|)2 ≤ c(x, y)(|g(x)| + |g(y)|)2
2 x,y
k<`
X X
≤ π(x)g(x)2 + π(y)g(y)2
x y
X
=2 π(x)g(x)2 .
x

That concludes the proof.

5.3.2 . Random walks: trees, cycles, and hypercubes revisited


We use the techniques of the previous subsection to bound the mixing time of
random walk on some simple graphs. In particular we revisit the examples of
Section 4.3.2.

b-ary tree Let (Zt ) be lazy simple random walk on the `-level rooted b-ary tree,
b ` . The root, 0, is on level 0 and the leaves, L, are on level `. All vertices have
T b
CHAPTER 5. SPECTRAL METHODS 392

degree b + 1, except for the root which has degree b and the leaves which have
degree 1. Recall that the stationary distribution is

δ(x)
π(x) := , (5.3.8)
2(n − 1)

where n is the number of vertices and δ(x) is the degree of x. We take b = 2 to


simplify.
It is intuitively clear that each edge of this graph constitutes a bottleneck, with
the root being the most “balanced” one. Let x0 be a leaf of T b ` and let A be the set
b
of vertices “on the other side of the root (inclusively),” that is, vertices whose graph
distance from x0 is at least `. See Figure 4.7. Let S be the remaining vertices. Then
by symmetry π(S) ≤ 1/2. Note that there is a single edge connecting S and S c =
A, namely, the edge linking 0 and the root of the subtree TS formed by the vertices
in S. More precisely, let vS be the root of TS . From (5.3.8), P (vS , 0) = 12 · 13 = 16
3
(where the 1/2 accounts for the laziness), π(vS ) = 2n−2 , and, by symmetry,

(2n − 2 − 2)/2 n−2


π(S) = = ,
2n − 2 2n − 2
where in the numerator we subtracted the degree of the root before dividing the
sum of the remaining degrees by 2. Hence
 
1 3
6 2n−2 1
Φ∗ ≤ n−2 = ,
2n−2
2(n − 2)

By Theorem 5.3.5,
1
γ ≤ 2Φ∗ ≤ and trel = γ −1 ≥ n − 2.
n−2
Thus by Theorem 5.2.14 and the fact that the chain is lazy
 
1
tmix (ε) ≥ (trel − 1) log = Ω(n).

We showed in Section 4.3.2, using other techniques, that tmix (ε) = Θ(n).

Cycle Let (Zt ) be lazy simple random walk on the cycle of size n, Zn :=
{0, 1 . . . , n − 1}, where i ∼ j if |j − i| = 1 (mod n). Assume n is even.
CHAPTER 5. SPECTRAL METHODS 393

Consider a subset of vertices S. Note that by symmetry π(S) = |S|n . Moreover,


1 1 1 1
for all i ∼ j, c(i, j) = π(i)P (i, j) = n · 2 · 2 = 4n . Among all sets of size |S|,
consecutive vertices minimize the size of the boundary. So
1
2 4n 1
Φ∗ ≤ `
= ,
n
2`

for all ` ≤ n/2. This expression is minimized for ` = n/2 so


1
Φ∗ = .
n
By Theorem 5.3.5,
1 Φ2∗ 2
2
= ≤ γ ≤ 2Φ∗ =
2n 2 n
and
n
≤ trel = γ −1 ≤ 2n2 .
2
Thus by Theorem 5.2.14
 
1
tmix (ε) ≥ (trel − 1) log = Ω(n),

and  
1
tmix (ε) ≤ log trel = O(n2 log n).
επmin
We know from exact eigenvalue computations (see Section 5.2.2 where technically
we considered the non-lazy chain; laziness only affects the relaxation time by a
2
factor of 2) that in fact γ = 2π
n2
+ O(n−4 ). We also showed in that section that
tmix (ε) = O(n2 ). (Exercise 4.15 shows this is tight up to a constant factor.)

Hypercube Let (Zt ) be lazy simple random walk on the n-dimensional hyper-
cube Zn2 := {0, 1}n where i ∼ j if ki − jk1 = 1.
To get a bound on the edge expansion constant, consider the set S = {x ∈ Zn2 :
x1 = 0}. By symmetry π(S) = 21 . For each i ∼ j, c(i, j) = 21n · 12 · n1 = n2n+1
1
.
Hence
1
2n−1 n2n+1 1
Φ∗ ≤ 1 = ,
2
2n
where in the numerator we used that |S| = 2n−1 . By Theorem 5.3.5,
1
γ ≤ 2Φ∗ ≤ .
n
CHAPTER 5. SPECTRAL METHODS 394

Thus by Theorem 5.2.14


 
1
tmix (ε) ≥ (trel − 1) log = Ω(n).

We know from exact eigenvalue computations (Section 5.2.2) that in fact γ = n1 .


We also showed in Section 4.3.2 that tmix (ε) = Θ(n log n).

5.3.3 . Random graphs: existence of an expander family and applica-


tion to mixing
In many applications, it is useful to construct “bottleneck-free” graphs. In particu-
lar, random walks mix rapidly on such graphs. Formally:

Definition 5.3.9 (Expander family). Let {Gn }n be a collection of finite d-regular


graphs with limn |V (Gn )| = +∞, where V (Gn ) is the vertex set of Gn . Let
 
|∂E S| |V (Gn )|
Φ∗ (Gn ) := min : S ⊆ V (Gn ), 0 < |S| ≤
d|S| 2

denote the edge expansion constant of Gn with unit conductances.† Let α > 0. We
say that {Gn }n is a (d, α)-expander family if for all n

Φ∗ (Gn ) ≥ α.

The key point of the definition is that the edge expansion constant of all graphs in
an expander family is bounded away from 0 uniformly in n. Note that it is trivial
to construct such a family if we drop the bounded degree assumption: the edge ex-
pansion constant of the complete graph Kn is 1/2 by Example 5.3.3. On the other
hand, it is far from obvious that one can construct a family of sparse graphs (i.e.,
such that |E(Gn )| = O(|V (Gn )|)) with an edge expansion constant uniformly
bounded away from 0. It turns out that a simple probabilistic construction does the
trick.
We will need the following definition. For a subset S ⊆ V , we let the vertex
boundary of S be
vertex boundary
c
∂V S := {y ∈ S : ∃x ∈ S s.t. x ∼ y}.

In terms of random walk, this corresponds to choosing a neighbor uniformly at random and
taking the stationary measure equal to the degree. Note that scaling the stationary measure does not
affect the edge expansion constant.
CHAPTER 5. SPECTRAL METHODS 395

Figure 5.4: A draw from Pinsker’s model.

Existence of expander graphs For simplicity, we allow multigraphs (i.e., E is


a multiset; or, put differently, there can be multiple edges between the same two
vertices) and consider the case d = 3. We construct a random bipartite multigraph
Gn = (Ln , Rn , En ) on 2n vertices known as Pinsker’s model. Denote the vertices
by Ln = {`1 , . . . , `n } and Rn = {r1 , . . . , rn }. Let σn1 and σn2 be independent
uniform random permutations of [n]. The edge set of Gn is given by
 
En := {(`i , ri ) : i ∈ [n]} ∪ (`i , rσn1 (i) ) : i ∈ [n] ∪ (`i , rσn2 (i) ) : i ∈ [n] .

In words, Gn is a union of three independent uniform perfect matchings (and its


vertices are labeled so that one of the matchings is {(`i , ri )}i ). See Figure 5.4.
Observe that, as a multigraph, all vertices of Gn have degree 3. We show that
there exists α > 0 such that, for all n large enough, with positive (in fact, high)
probability Gn has an edge expansion constant bounded below by α. In particular,
such a Gn exists for all n large enough and, thus, there exists a (3, α)-expander
family.
CHAPTER 5. SPECTRAL METHODS 396

Claim 5.3.10 (Pinsker’s model: edge expansion constant). There exists α > 0 such
that
lim P[Φ∗ (Gn ) ≥ α] = 1.
n
Proof. For convenience, assume n is even. We need to show that with probability
going to 1, for any S with |S| ≤ |V (Gn )|/2 = n, we have |∂E S| ≥ αd|S| for
some α > 0. We first reduce the proof to a statement about sets of vertices lying
on one side of Gn .
Lemma 5.3.11. There is β > 0 such that
lim P [|∂V K| ≥ (1 + β)|K|, ∀K ⊆ L, |K| ≤ n/2] = 1.
n

The same holds for R.


Before proving Lemma 5.3.11, we argue that it implies Claim 5.3.10. Note that
the lemma concerns the vertex boundary of K. To relate the latter to the edge
boundary, let S with |S| ≤ n, and let SL := S ∩ L and SR := S ∩ R. For any
subset K ⊆ SL , the size of the edge boundary of S can be bounded below as
follows
|∂E S| ≥ |∂V K| − |SR |, (5.3.9)
where we took into account that the vertices of ∂V K in SR do not contribute to the
edge boundary, but the others do as they are incident with at least one edge in ∂E S.
It remains to find a good K.
Assume without loss of generality that |SL | ≥ |SR | (in the other case, just
interchange the roles of L and R), and suppose that the event in the lemma holds.
In particular, |SR | ≤ |S|/2. We claim that there is a subset K of SL such that
|SR | ≤ |S|/2 ≤ |K| ≤ n/2. (5.3.10)
There are two cases:
- If |SL | < n/2, then take K = SL . It follows that |K| = |SL | ≥ |S|/2.
- If |SL | ≥ n/2, then let K be any subset of SL of size n/2. Since |S| ≤ n, it
follows that |K| = n/2 ≥ |S|/2.
Under the event in the lemma, |∂V K| ≥ (1 + β)|K|.
Going back to (5.3.9), using the lower bound on |∂V K| and (5.3.10), we get
β
|∂E S| ≥ (1 + β)|K| − |K| = β|K| ≥ |S| = α|S|,
2
where we set α = β/2. Since this holds for any set S with |S| ≤ n, we have
proved Claim 5.3.10.
It remains to prove the lemma.
CHAPTER 5. SPECTRAL METHODS 397

Figure 5.5: Illustration of the main step in proof of the lemma.

Proof of Lemma 5.3.11. Let K ⊆ L with k := |K| ≤ n/2. Without loss of


generality assume K = {`1 , . . . , `k }. Observe that, by construction, ∂V K ⊇ K 0
where K 0 = {r1 , . . . , rk }. We analyze the “bad event”

BK := {|∂V K| ≤ k + bβkc} ,

by considering all subsets of {rk+1 , . . . , rn } of size bβkc and bounding the prob-
0
 that all edges out of K fall into one of them and K . Note that there are
ability
n−k
bβkc such subsets. See Figure 5.5.
Since σn1 and σn2 are uniform and independent, they each match K to a uni-
formly chosen subset of the same size in R and we have by a union bound

 " k+bβkc 2
#   k+bβkc2
n−k k n bβkc
P[BK ] ≤ n ≤ n 2
,
bβkc bβkc

k k

n n
 
where we used that s = n−s .
CHAPTER 5. SPECTRAL METHODS 398

Taking a union bound again, this time over Ks, we have

P[∃K ⊆ L, |K| ≤ n/2, |∂V K| ≤(1 + β)|K|]


X
≤ P[BK ]
K⊆L, |K|≤n/2
n/2    k+bβkc2
X n n bβkc
≤ n 2
. (5.3.11)
k bβkc

k=1 k
s s s t t
We use the bound nss ≤ ns ≤ e sns ≤ e tnt for s ≤ t < n (see Appendix A; to see

d t t
the last inequality, note that dt log( e tnt ) = log( nt ) > 0 for 0 < t < n). We obtain
that the sum in the last display is bounded as
n/2    k+bβkc2 n/2   k+bβkc2
X n n bβkc
X n bβkc
n 2
= n

k bβkc bβkc

k=1 k k=1 k
βk 2
 βk 
n/2 βk βk e (k+βk)
X e n (βk) βk
≤ βk nk
(βk) k
k=1 k
n/2
X  k k(1−β)  e3 (1 + β)2 βk

n β3
k=1
X∞
= fn (k), (5.3.12)
k=1

where we defined
 k(1−β)  3 βk
k e (1 + β)2
fn (k) := 1{k≤n/2} .
n β3
Let also "  β #k
1 1−β e3 (1 + β)2

g(k) := ,
2 β3
and notice that for β small enough

|fn (k)| ≤ g(k), ∀k


k 1 n
since n ≤ 2 for k ≤ 2 and
 1−β  3 β
1 e (1 + β)2
γβ := < 1,
2 β3
CHAPTER 5. SPECTRAL METHODS 399

using that β β → 1 as β → 0. Moreover, for each k,

fn (k) → 0,

as n → +∞, and

X 1
g(k) ≤ < +∞.
1 − γβ
k=1
Hence, by the dominated convergence theorem (Theorem B.4.7), combining (5.3.11)
and (5.3.12) we get

X
P[∃K ⊆ L, |K| ≤ n/2, |∂V K| ≤ (1 + β)|K|] = fn (k) → 0.
k=1

That concludes the proof.

That concludes the proof of Claim 5.3.10.

Claim 5.3.10 implies:


Theorem 5.3.12 (Existence of expander family). For α > 0 small enough, there
exists a (3, α)-expander (multigraph) family.
Proof. By Claim 5.3.10, for all n large enough, there exists Gn with Φ∗ (Gn ) ≥ α
for some fixed α > 0.

Fast mixing on expander graphs As we mentioned above, an important prop-


erty of an expander graph is that random walk on such a graph mixes rapidly. We
make this precise.
Claim 5.3.13 (Mixing on expanders). Let {Gn } be a (d, α)-expander family. Then
tmix (ε) = Θ(log |V (Gn )|), where the constant depends on ε and α.
Proof. Because of the degree assumption, random walk on Gn is reversible with
respect to the uniform distribution (see Example 1.1.29). So, by Theorems 5.2.14
and 5.3.5, the mixing time is upper bounded by
   
1 |V (Gn )|
tmix (ε) ≤ log trel ≤ log 2α−2 = O(log |V (Gn )|).
επmin ε
By the diameter-based lower bound on the mixing time for reversible chains
(Claim 5.2.25), for n large enough
∆2
tmix (ε) ≥ ,
12 log |V (Gn )| + 4| log πmin |
CHAPTER 5. SPECTRAL METHODS 400

where ∆ is the diameter of Gn . For a d-regular graph Gn , the diameter is at least


log |V (Gn )|. Indeed, by induction, the number of vertices within graph distance
k of any vertex is at most dk . For dk to be greater than |V (Gn )|, we need k ≥
logd |V (Gn )|. Finally,

(logd |V (Gn )|)2


tmix (ε) ≥ = Ω(log |V (Gn )|).
16 log |V (Gn )|
That concludes the proof.

5.3.4 . Ising model: Glauber dynamics on complete graphs and ex-


panders
Let G = (V, E) be a finite, connected graph with maximal degree δ̄. Define X :=
{−1, +1}V . Recall from Example 1.2.5 that the (ferromagnetic) Ising model on V
with inverse temperature β is the probability distribution over spin configurations
σ ∈ X given by
1 −βH(σ)
µβ (σ) := e ,
Z(β)
where X
H(σ) := − σi σj ,
i∼j

is the Hamiltonian and X


Z(β) := e−βH(σ) ,
σ∈X
is the partition function. In this context, recall that vertices are often referred to as
sites. The single-site Glauber dynamics of the Ising model (Definition 1.2.8) is the
Markov chain on X which, at each time, selects a site i ∈ V uniformly at random
and updates the spin σi according to µβ (σ) conditioned on agreeing with σ at all
sites in V \{i}. Specifically, for γ ∈ {−1, +1}, i ∈ V , and σ ∈ X , let σ i,γ be
the configuration σ with the state at i being set to γ. Then, letting n = |V |, the
transition matrix of the Glauber dynamics is

eγβSi (σ)
 
i,γ 1 1 1 1
Qβ (σ, σ ) := · −βS (σ) = + tanh(γβSi (σ)) ,
n e i + eβSi (σ) n 2 2
where X
Si (σ) := σj .
j∼i

All other transitions have probability 0. Recall that this chain is irreducible and
reversible with respect to µβ . In particular µβ is the stationary distribution of Qβ .
CHAPTER 5. SPECTRAL METHODS 401

We showed in Claim 4.3.15 that the Glauber dynamics is fast mixing at high tem-
perature. More precisely we proved that tmix (ε) = O(n log n) when β < δ̄ −1 .
Here we prove a converse: at low temperature, graphs with good enough expan-
sion properties produce exponentially slow mixing of the Glauber dynamics.

Curie-Weiss model
Let G = Kn be the complete graph on n vertices. In this case, the Ising model
is often referred to as the Curie-Weiss model. It is natural to scale β with n. We
Curie-Weiss
define α := β(n−1). Since δ̄ = n−1, we have that, when α < 1, β = n−1 α
< δ̄ −1
model
so tmix (ε) = O(n log n). In the other direction, we prove:
Claim 5.3.14 (Curie-Weiss model: slow mixing at low temperature). For α > 1,
tmix (ε) = Ω(exp(r(α)n)) for some function r(α) > 0 not depending on n.
Proof. We first prove exponential mixing when α is large enough, an argument
which will be useful in the generalization to expander graphs below.
The idea of the proof is to bound the edge expansion constant and use Theo-
rem 5.3.5. To simplify the proof, assume n is odd. We denote the edge expansion
constant of the chain by ΦX ∗ to avoid confusion with that of the base graph G.
Intuitively, because the spins tend to align strongly at low temperature, it takes
a considerable amount of time to travel from a configuration with a majority of
−1s to a configuration with a majority of +1s. Because the model tends to prefer
agreeing spins but does not favor any particular spin, a natural place to look for a
bottleneck is the set
( )
X
M := σ ∈ X : σi < 0 ,
i
P
where the quantity m(σ) := i σi is called the magnetization. Note that the magnetization
magnetization is positive if and only if a majority of spins are +1 and that it forms
a Markov chain by itself. So the boundary of the set M must be crossed to travel
from configurations with mostly −1 spins to configurations with mostly −1 spins.
Observe further that µβ (M) = 1/2. The edge expansion constant is hence
bounded by
µβ (σ)Qβ (σ, σ 0 )
P
σ∈M,σ 0 ∈M
/
X
X
Φ∗ ≤ =2 µβ (σ)Qβ (σ, σ 0 ). (5.3.13)
µβ (M) 0
σ∈M,σ ∈M
/

Because the Glauber dynamics changes a single spin at a time, in order for σ ∈ M
to be adjacent to a configuration σ 0 ∈
/ M, it must be that
σ ∈ M−1 := {σ ∈ X : m(σ) = −1} ,
CHAPTER 5. SPECTRAL METHODS 402

and that σ 0 = σ j,+ for some site j such that


j ∈ Jσ := {j ∈ V : σj = −1}.
Because the number of such sites is (n + 1)/2 on M−1 , that is, |Jσ | = (n + 1)/2
for all σ ∈ M−1 , and the Glauber dynamics picks a site uniformly at random, it
follows that for σ ∈ M−1
X X X
µβ (σ)Qβ (σ, σ 0 ) = µβ (σ) Qβ (σ, σ j,+ )
σ∈M,σ 0 ∈M
/ σ∈M−1 j∈Jσ
X (n + 1)/2
≤ µβ (σ) (5.3.14)
n
σ∈M−1
 
1 1
= 1+ µβ (M−1 ). (5.3.15)
2 n
Thus plugging this back in (5.3.13) gives
 
X 1
Φ∗ ≤ 1 + µβ (M−1 )
n
X e−βH(σ)
= (1 + o(1)) (5.3.16)
Z(β)
σ∈M−1
 h i
α |Jσ | |Jσc |
− |Jσ ||Jσc |
 
X exp n−1 2 + 2
= (1 + o(1)) .
Z(β)
σ∈M−1

−βH(σ) with the term for the


P
We bound the partition function Z(β) = σ∈X e
all-(−1) configuration, leading to
 h i
α |Jσ | |Jσc | c
 
X exp n−1 2 + 2 − |Jσ ||Jσ |
ΦX∗ ≤ (1 + o(1))  h i (5.3.17)
α |Jσ | |Jσc |
 
exp + + |J ||J c|
σ∈M−1 n−1 2 2 σ σ
 
X 2α
= (1 + o(1)) exp − |Jσ ||Jσc |
n−1
σ∈M−1
     
n 2α n+1 n−1
= (1 + o(1)) exp −
(n + 1)/2 n−1 2 2
r  
2 n α(n + 1)
= (1 + o(1)) 2 (1 + o(1)) exp −
πn 2
r
2  h α i
≤ Cα exp −n − log 2 ,
πn 2
CHAPTER 5. SPECTRAL METHODS 403

for some constant Cα > 0 depending on α, where we used Stirling’s formula


(see Appendix A). Hence, by Theorems 5.2.14 and 5.3.5, for α > 2 log 2 there is
r(α) > 0
   
1 1
tmix (ε) ≥ (trel − 1) log ≥ exp(r(α)n) log .
2ε 2ε

That proves the weaker result.


We now show that α > 1 in fact suffices. For this, we need to improve our
bound on the partition function in (5.3.17). Writing
X
Z(β) = e−βH(σ)
σ∈X
n        
X n α k n−k
= exp + − k(n − k)
k n−1 2 2
k=0
(n−1)/2        
X n α k n−k
=2 exp + − k(n − k)
k n−1 2 2
k=0
(n−1)/2
X
=: 2 Yα,k ,
k=0

we see that the partition function is a sum of O(n) exponentially large terms and
is therefore dominated by the term corresponding to the largest exponent. Using
Stirling’s formula,  
n
log = (1 + o(1))nH(k/n),
k
where H(p) = −p log p − (1 − p) log(1 − p) is the entropy, and therefore

(k/n)2 + (1 − k/n)2 − 2(k/n)(1 − k/n)


 
log Yα,k = (1+o(1))n H(k/n) + α .
2
| {z }
Kα (k/n)

where, for p ∈ [0, 1], we let

(1 − 2p)2
Kα (p) := H(p) + α .
2
Note that the first term in Kα (p) is increasing on [0, 1/2] while the second term
is decreasing on [0, 1/2]. In a sense, we are looking at the tradeoff between the
contribution from the entropy (i.e., how many ways are there to have k spins with
CHAPTER 5. SPECTRAL METHODS 404

value −1) and that from the Hamiltonian (i.e., how much such a configuration is
favored). We seek to maximize Kα (p) to determine the leading term in the partition
function.
By a straightforward computation,
 
0 1−p
Kα (p) = log − 2α(1 − 2p),
p
and
1
Kα00 (p) = − + 4α.
p(1 − p)
Observe first that, when α < 1 (i.e., at high temperature), Kα0 (1/2) = 0 and
Kα00 (p) < 0 for all p ∈ [0, 1] since p(1 − p) ≤ 1/4. Hence, in that case, Kα is
maximized at p = 1/2.
In our case of interest, on the other hand, that is, when α > 1, Kα00 (p) > 0
in an interval around 1/2 so there is p∗ < 1/2 with Kα (p∗ ) > Kα (1/2) = 1.
So the distribution significantly favors “unbalanced” configurations and crossing
M−1 becomes a bottleneck for the Glauber dynamics. Going back to (5.3.17) and
bounding Z(β) ≥ 2Yα,bp∗ nc , we get

ΦX
∗ = O (exp(−n[Kα (p∗ ) − Kα (1/2)])) .

Applying Theorems 5.2.14 and 5.3.5 concludes the proof.

Expander graphs
In the proof of Claim 5.3.14, the bottleneck slowing down the chain arises as a
result of the fact that, when m(σ) = −1, there is a large number of edges in
the base graph Kn connecting Jσ and Jσc . That produces a low probability for
such configurations under the ferromagnetic Ising model, where agreeing spins are
favored. The same argument easily extends to expander graphs. In words, we
prove something that—at first—may seem a bit counter-intuitive: good expansion
properties in the base graph produces a bottleneck in the Glauber dynamics at low
temperature.

Claim 5.3.15 (Ising model on expander graphs: slow mixing of the Glauber dy-
namics). Let {Gn }n be a (d, γ)-expander family. For large enough inverse tem-
perature β > 0, the Glauber dynamics of the Ising model on Gn satisfies tmix (ε) =
Ω(exp(r(β)|V (Gn )|)) for some function r(β) > 0 not depending on n.

Proof. Let µβ be the probability distribution over spin configurations under the
Ising model over Gn = (V, E) with inverse temperature β. Let Qβ be the transition
CHAPTER 5. SPECTRAL METHODS 405

matrix of the Glauber dynamics. For not necessarily disjoint subsets of vertices
W0 , W1 ⊆ V in the base graph Gn , let

E(W0 , W1 ) := {{u, v} : u ∈ W0 , v ∈ W1 , {u, v} ∈ E},

be the set of edges with one endpoint in W0 and one endpoint in W1 . Let N =
|V (Gn )| and assume it is odd for simplicity. We use the notation in the proof
of Claim 5.3.14. Following the argument in that proof, we observe that (5.3.15)
and (5.3.16) still hold. Thus
X exp (β [|E(Jσ , Jσ )| + |E(J c , J c )| − |E(Jσ , J c )|])
ΦX
∗ ≤ (1 + o(1))
σ σ σ
.
Z(β)
σ∈M−1

e−βH(σ)
P
As we did in (5.3.17), we bound the partition function Z(β) = σ∈X
with the term for the all-(−1) configuration, leading to
X exp (β [|E(Jσ , Jσ )| + |E(J c , J c )| − |E(Jσ , J c )|])
ΦX
∗ ≤ (1 + o(1))
σ σ σ
exp (β [|E(Jσ , Jσ )| + |E(Jσc , Jσc )| + |E(Jσ , Jσc )|])
σ∈M−1
X
= (1 + o(1)) exp (−2β|E(Jσ , Jσc )|)
σ∈M−1
X
= (1 + o(1)) exp (−2β|∂E Jσc |)
σ∈M−1
 
N
≤ (1 + o(1)) exp (−2βγd|Jσc |)
(N + 1)/2
r
2 N
= (1 + o(1)) 2 (1 + o(1)) exp (−βγd(N − 1))
πN
r
2
≤ Cβ,γ,d exp (−N [βγd − log 2]) ,
πN
for some constant Cβ,γ,d > 0. We used the definition of an expander family (Def-
inition 5.3.9) on the fourth line above. Taking β > 0 large enough gives the re-
sult.

5.3.5 Congestion ratio


Recall from (5.3.2) that an inequality of the type

Varπ [f ] ≤ CD(f ), (5.3.18)


CHAPTER 5. SPECTRAL METHODS 406

holding for all f is known as a Poincaré inequality. By Theorem 5.3.4, it implies


the lower bound γ ≥ C −1 on the spectral gap γ = 1 − λ2 . In this section, we
derive such an inequality using a formal measure of “congestion” in the network.
Let N = (G, c) be a finite, connected network with
P G = (V, E). We assume
that c(x, y) = π(x)P (x, y) and therefore c(x) = y∼x c(x, y) = π(x), where
π is the stationary distribution of random walk on N . To state the bound, it will
be convenient to work with directed edges—this time in both directions. Let E e
contain all edges from E with both orientations, that is, for each e ∈ {x, y}, E e
includes (x, y) and (y, x) with associated weight c(x, y) = c(y, x) = c(e) > 0.
For a function f ∈ `2 (V, π) and an edge ~e = (x, y) ∈ E,
e we define as before

∇f (~e) = f (y) − f (x).

With this notation, we can rewrite the Dirichlet energy as


1X 1X
D(f ) = c(x, y)[f (x) − f (y)]2 = c(~e)[∇f (~e)]2 . (5.3.19)
2 x,y 2
~e∈E
e

For each pair of vertices x, y, let νx,y be a directed path between x and y in the
digraph G e as a collection of directed edges. Let |νx,y | be the number of
e = (V, E),
edges in the path. The congestion ratio associated with the paths ν = {νx,y }x,y∈V
congestion
is
1 X ratio
Cν = max |νx,y |π(x)π(y).
e c(~
~e∈E e)
x,y:~e∈νx,y

Note that Cν tends to be large when many selected paths, called canonical paths,
canonical
go through the same “congested” edge. To get a good bound in the theorem below,
paths
one must choose canonical paths that are well “spread out.”

Theorem 5.3.16 (Canonical paths method). For any choice of paths ν as above,
we have the following bound on the spectral gap
1
γ≥ .

Proof. We establish a Poincaré inequality (5.3.18) with C := Cν . The proof strat-
egy is to start with the variance and manipulate it to bring out canonical paths.
For any f ∈ `2 (V, π), it can be checked by expanding that
1X
Varπ [f ] = π(x)π(y)(f (x) − f (y))2 . (5.3.20)
2 x,y
CHAPTER 5. SPECTRAL METHODS 407

To bring out terms similar to those in (5.3.19), we write f (x)−f (y) as a telescoping
sum over the canonical path between x and y. That is, letting ~e1 , . . . , ~e|νx,y | be the
edges in νx,y , observe that

|νx,y |
X
f (y) − f (x) = ∇f (~ei ).
i=1

By Cauchy-Schwarz (Theorem B.4.8),


 2
|νx,y |
X
(f (y) − f (x))2 =  ∇f (~ei )
i=1
  
|νx,y | |νx,y |
X X
≤ 12   ∇f (~ei )2 
i=1 i=1
X
= |νx,y | ∇f (~e)2 .
~e∈νx,y

Combining the last display with (5.3.20) and rearranging, we arrive at


1X X
Varπ [f ] ≤ π(x)π(y)|νx,y | ∇f (~e)2
2 x,y
~e∈νx,y
1X X
= ∇f (~e)2 |νx,y |π(x)π(y)
2
~e∈E
e x,y:~e∈νx,y
 
1X 1 X
= c(~e)∇f (~e)2  |νx,y |π(x)π(y)
2 c(~e)
~e∈E
e x,y:~e∈νx,y

≤ Cν D(f ).

That concludes the proof.

We give an example next.

Example 5.3.17 (Random walk inside a box). Consider random walk on the fol-
lowing d-dimensional box with sides of length n:

V = [n]d = {1, . . . , n}d ,

E = {x, y ∈ [n]d : kx − yk1 = 1},


CHAPTER 5. SPECTRAL METHODS 408

1
P (x, y) = , ∀x, y ∈ [n]d , x ∼ y,
|{z : z ∼ x}|
|{z : z ∼ x}|
π(x) = ,
2|E|
and
1
c(e) = , ∀e ∈ E.
2|E|
We define E e as before.
We use Theorem 5.3.16 to bound the spectral gap. For x = (x1 , . . . , xd ), y =
(y1 , . . . , yd ) ∈ [n]d , we construct νx,y by matching each coordinate in turn. That
is, for two vertices w, z ∈ [n]d with a single distinct coordinate, let [w, z] be the
directed path from w to z in G e = (V, E)e corresponding to a straight line (or the
empty path if w = z). Then
d
[
νx,y = [(y1 , . . . , yi−1 , xi , xi+1 , . . . , xd ), (y1 , . . . , yi−1 , yi , xi+1 , . . . , xd )] .
i=1
(5.3.21)
It remains to bound
1 X
Cν = max |νx,y |π(x)π(y),
~e∈E
e c(~e)
x,y:~e∈νx,y

from above.
Each term in the union defining νx,y contains at most n edges, and therefore

|νx,y | ≤ dn, ∀x, y.

Not attempting to get the best constant factors, the edge weights (i.e., conduc-
tances) satisfy
1 1 1
c(~e) = ≥ d
= ,
2|E| 2 · 2dn 4dnd
for all ~e, since there are nd vertices and each has at most 2d incident edges. Like-
wise, for any x,

|{z : z ∼ x}| 2d 2
π(x) = ≤ d
= d,
2|E| 2 · (dn )/2 n
CHAPTER 5. SPECTRAL METHODS 409

where we divided by two in the denominator to account for the double-counting of


edges. Hence we get
1 X
Cν ≤ max (dn)(2/nd )(2/nd )
~e∈E
e 1/(4dnd )
x,y:~e∈νx,y

16d2
= max |{x, y : ~e ∈ νx,y }| .
nd−1 ~e∈E
e

To bound the cardinality of the set on the last line, we note that any edge ~e ∈ E
e
is of the form

~e = ((z1 , . . . , zi−1 , zi , zi+1 , . . . , zd ), (z1 , . . . , zi−1 , zi ± 1, zi+1 , . . . , zd ))

that is, the endvertices differ by exactly one unit along a single coordinate. By the
construction of the path νx,y in (5.3.21), if ~e ∈ νx,y then it must lie in the subpath

((z1 , . . . , zi−1 , zi , zi+1 , . . . , zd ), (z1 , . . . , zi−1 , zi ± 1, zi+1 , . . . , zd ))


∈ [(y1 , . . . , yi−1 , xi , xi+1 , . . . , xd ), (y1 , . . . , yi−1 , yi , xi+1 , . . . , xd )] .

But that imposes constraints on x and y. Namely, we must have

y1 = z1 , . . . , yi−1 = zi−1 , xi+1 = zi+1 , . . . , xd = zd .

The remaining components of x and y (of which there are i of the former and
d − (i − 1) of the latter) each has at most n possible values (although not all of
them are allowed), so that

|{x, y : ~e ∈ νx,y }| ≤ ni nd−(i−1) = nd+1 .

This upper bound is valid for any ~e.


Putting everything together, we get the bound
16d2 d+1
Cν ≤ n = 16d2 n2 ,
nd−1
so that
1
γ≥
.
16d2 n2
Observe that this lower bound on the spectral gap depends only mildly (i.e., poly-
nomially) in the dimension. J
One advantage of the canonical paths method is that it is somewhat robust
to modifying the underlying network through comparison arguments. See Exer-
cise 5.17 for a simple illustration.
CHAPTER 5. SPECTRAL METHODS 410

Exercises
Exercise 5.1. Let A be an n × n symmetric random matrix. We assume that
the entries on and above the diagonal, Ai,j , i ≤ j, are independent uniform in
{+1, −1} (and each entry below the diagonal is equal to the corresponding entry
above). Use Talagrand’s inequality (Theorem 3.2.32) to prove concentration of the
largest eigenvalue of A around its mean (which you do not need to compute).
Exercise 5.2. Let G = (V, E, w) be a network.
(i) Prove formula (5.1.3) for the Laplacian quadratic form. (Hint: For an orien-
tation Gσ = (V, E σ ) of G (that is, give an arbitrary direction to each edge
to turn it into a digraph), consider the matrix B σ ∈ Rn×m where the column
√ √
corresponding to arc (i, j) has − wij in row i and wij in row j, and every
other entry is 0.)

(i) Show that the network Laplacian is positive semidefinite.


Exercise 5.3. Let G = (V, E, w) be a weighted graph with normalized Laplacian
L. Show that !2
T
X xi xj
x Lx = wij p −p ,
{i,j}∈E
δ(i) δ(j)

for x = (x1 , . . . , xn ) ∈ Rn .
Exercise 5.4 (2-norm). Prove that

sup kAxk2 = sup hAx, yi.


x∈Sn−1 x∈Sn−1
y∈Sm−1

[Hint: Use Cauchy-Schwarz (Theorem B.4.8) for one direction, and set y =
Ax/kAxk2 for the other one.]
Exercise 5.5 (Spectral radius of a symmetric matrix). Let A ∈ Rn×n be a sym-
metric matrix. The set σ(A) of eigenvalues of A is called the spectrum of A and spectrum

ρ(A) = max{|λ| : λ ∈ σ(A)},

is its spectral radius. Prove that


spectral
ρ(A) = kAk2 , radius

where recall that


kAxk
kAk2 = max m .
06=x∈R kxk
CHAPTER 5. SPECTRAL METHODS 411

Exercise 5.6 (Community recovery in sparse networks). Assume without proof the
following theorem.
Theorem 5.3.18 (Remark 3.13 of [BH16]). Consider a symmetric matrix Z =
[Zi,j ] ∈ Rn×n whose entries are independent and obey, EZi,j = 0 and Zi,j ≤ B,
∀1 ≤ i, j ≤ n, EZi,j 2 ≤ σ 2 then with high probability we have ||Z|| . σ √n +

B log n.
q
Let (X, G) ∼ SBMn,pn ,qn . Show that, under the conditions pn & logn
n and pn
n =
o(pn − qn ), spectral clustering achieves almost exact recovery.
Exercise 5.7 (Parseval’s identity). Prove Parseval’s identity (i.e., (5.2.1)) in the
finite-dimensional case.
Exercise 5.8 (Dirichlet kernel). Prove that for θ 6= 0
n
X sin((n + 1/2)θ)
1+2 cos kθ = .
sin(θ/2)
k=1

[Hint: Switch to the complex representation and use the formula for a geometric
series.]
Exercise 5.9 (Eigenvalues and periodicity). Let P be a finite irreducible transition
matrix reversible with respect to π over V . Show that if P has a nonzero eigen-
function f with eigenvalue −1, then P is not aperiodic. [Hint: Look at x achieving
kf k∞ .]
Exercise 5.10 (Mixing time: necessary condition for cutoff). Consider a sequence
of Markov chains indexed by n = 1, 2, . . .. Assume that each chain has a finite
(n) (n)
state space and is irreducible, aperiodic, and reversible. Let tmix (ε) and trel be
respectively the mixing time and relaxation time of the n-th chain. The sequence
is said to have pre-cutoff if
(n)
tmix (ε)
sup lim sup (n)
< +∞.
0<ε<1/2 n→+∞ tmix (1 − ε)
Show that if for some ε > 0
(n)
tmix (ε)
sup (n)
< +∞,
n≥1 trel
then there is no pre-cutoff. In particular, there is no cutoff, as defined in Re-
mark 4.3.8.
CHAPTER 5. SPECTRAL METHODS 412

Exercise 5.11 (Relaxation time and variance). Let P be a finite irreducible transi-
tion matrix reversible with respect to π over V . Define
X
Varπ [g] = π(x)[g(x) − πg]2 .
x∈V

Let γ∗ be the absolute spectral gap of P . Show that

Varπ [P t f ] ≤ (1 − γ∗ )2t Varπ [f ].

Exercise 5.12 (Lumping). Let (Xt ) be a Markov chain on a finite state space V
with transition P . Suppose there is an equivalence relation ∼ on V with equiv-
alence classes V ] , denoting by [x] the equivalence class of x, such that [Xt ] is a
Markov chain with transition matrix P ] ([x], [y]) = P (x, [y]).

(i) Let f : V → R be an eigenfunction of P with eigenvalue λ and assume that


f is constant on each equivalence class. Prove that f ] ([x]) := f (x) defines
an eigenfunction of P ] . What is its eigenvalue?

(ii) Suppose g : V ] → R is eigenfunction of P ] with eigenvalue λ. Prove that


g [ : V → R defined by g [ (x) := g([x]) is eigenfunction of P . What is its
eigenvalue?

Exercise 5.13 (Random walk on path with reflecting boundaries). Let n be an


even positive integer. Let (Xt ) be simple random walk on the path {1, . . . , n} with
reflecting boundaries, that is, the transition matrix P is defined by P (x, x − 1) =
P (x, x + 1) = 1/2 for x ∈ {2, . . . , n − 1}, and P (1, 2) = P (n, n − 1) = 1.
Use Exercise 5.12 to compute the eigenfunctions of P . [Hint: Use the results of
Section 5.2.2.]

Exercise 5.14 (Product chain). For j = 1, . . . , d, let Pj be a transition matrix on


the finite state space Vj reversible with respect to the stationary distribution πj .
Let w = (wj )j∈[d] be a probability distribution over [d]. Consider the following
Markov chain (Xt ) on V := V1 × · · · × Vd : at each step, pick j according to w,
then take one step along the j-th coordinate according to Pj .

(i) Compute the transition matrix P and stationary distribution π of the chain
(Xt ). Show that P is reversible with respect to π.

(ii) Construct an orthonormal basis of `2 (V, π) made of eigenfunctions of P in


terms of eigenfunctions of the Pj s. What are the corresponding eigenvalues?

(iii) Compute the spectral gap γ of P in terms of the spectral gaps γj of the Pj s.
CHAPTER 5. SPECTRAL METHODS 413

Exercise 5.15 (Hypercube revisited). Use Exercise 5.14 to recover Lemma 5.2.17.

Exercise 5.16 (Norm and Rayleigh quotient). Let P be irreducible and reversible
with respect to π > 0.

(i) Prove the polarization identity


1
hP f, giπ = [hP (f + g), f + giπ − hP (f − g), f − giπ ] .
4

(ii) Show that


 
hf, P f iπ
kP kπ = sup : f ∈ `0 (V ), f 6= 0 .
hf, f iπ

Exercise 5.17 (Random walk on a box with holes). Consider the random walk in
Example 5.3.17 with d = 2. Suppose we remove from the network an arbitrary
collection of horizontal edges at even heights. Use the canonical paths method to
derive a lower bound on the spectral gap of the form γ ≥ 1/(Cn2 ). [Hint: Modify
the argument in Example 5.3.17 and relate the congestion ratio before and after the
removal.]
CHAPTER 5. SPECTRAL METHODS 414

Bibliographic Remarks
Section 5.1 General references on the spectral theorem, the Courant-Fischer and
perturbation results include the classics [HJ13, Ste98]. Much more on spectral
graph theory can be gleaned from [Chu97, Nic18]. Section 5.1.4 is based largely
on [Abb18], which gives a broad survey of theoretical results for community re-
covery, and [Ver18, Section 4.5] as well as on scribe notes by Joowon Lee, Aidan
Howells, Govind Gopakumar, and Shuyao Li for “MATH 888: Topics in Mathe-
matical Data Science” taught at the University of Wisconsin–Madison in Fall 2021.

Section 5.2 For a great introduction to Hilbert space theory and its applications
(including to the Dirichlet problem), consult [SS05, Chapters 4,5]. Section 5.2.1
borrows from [LP17, Chapter 12]. A representation-theoretic approach to comput-
ing eigenvalues and eigenfunctions, greatly generalizing the calculations in 5.2.2,
is presented in [Dia88]. The presentation in Section 5.2.3 follows [KP, Section 3]
and [LP16, Section 13.3]. The Varopoulos-Carne bound is due to Carne [Car85]
and Varopoulos [Var85]. For a probabilistic approach to the Varopoulos-Carne
bound see Peyre’s proof [Pey08]. The application to mixing times is from [LP16].
There are many textbooks dedicated to Markov chain Monte Carlo (MCMC) and
its uses in data analysis, for example, [RC04, GL06, GCS+ 14]. See also [Dia09].
A good overview of the techniques developed in the statistics literature to bound the
rate of convergence of MCMC methods (a combination of coupling and Lyapounov
arguments) is [JH01]. A deeper treatment of these ideas is developed in [MT09].
A formal definition of the spectral radius and its relationship to the operator norm
can be found, for instance, in [Rud73, Part III].

Section 5.3 This section follows partly the presentation in [LP16, Section 6.4],
[LPW06, Section 13.6], and [Spi12]. Various proofs of the isoperimetric inequal-
ity can be found in [SS03, SS05]. Theorem 5.3.5 is due to [SJ89, LS88]. The
approach to its proof used here is due to Luca Trevisan The original Cheeger
inequality was proved, in the context of manifolds, in [Che70]. For a fascinat-
ing introduction to expander graphs and their applications, see [HLW06]. A de-
tailed account of the Curie-Weiss model can be found in [FV18]. Section 5.3.5 is
based partly on [Ber14, Sections 3 and 4]. The method of canonical paths, and
some related comparison techniques, were developed in [JS89, DS91, DSC93b,
DSC93a]. For more advanced functional techniques for bounding the mixing time,
see e.g. [MT06].
Chapter 6

Branching processes

Branching processes, which are the focus of this chapter , arise naturally in the
study of stochastic processes on trees and locally tree-like graphs. Similarly to
martingales, finding a hidden (or not-so-hidden) branching process within a prob-
abilistic model can lead to useful bounds and insights into asymptotic behavior.
After a review of the basic extinction theory of branching processes in Section 6.1
and of a fruitful random-walk perspective in Section 6.2, we give a couple exam-
ples of applications in discrete probability in Section 6.3. In particular we analyze
the height of a binary search tree, a standard data structure in computer science.
We also give an introduction to phylogenetics, where a “multitype” variant of the
Galton-Watson branching process plays an important role; we use the techniques
derived in this chapter to establish a phase transition in the reconstruction of an-
cestral molecular sequences. We end this chapter in Section 6.4 with a detailed
look into the phase transition of the Erdős-Rényi graph model. The random-walk
perspective mentioned above allows one to analyze the “exploration” of a largest
connected component, leading to information about the “evolution” of its size as
edge density increases. Tools from all chapters come to bear on this final, marquee
application.

6.1 Background
We begin with a review of the theory of Galton-Watson branching processes, a
standard stochastic model for population growth. In particular we discuss extinc-
tion theory. We also briefly introduce a multitype variant, where branching process
Version: December 20, 2023
Modern Discrete Probability: An Essential Toolkit
Copyright © 2023 Sébastien Roch

415
CHAPTER 6. BRANCHING PROCESSES 416

and Markov chain aspects interact to produce interesting new behavior.

6.1.1 Basic definitions


Recall the definition of a Galton-Watson process.

Definition 6.1.1. A Galton-Watson branching process is a Markov chain of the


Galton-Watson
following form:
process
• Let Z0 := 1.

• Let X(i, t), i ≥ 1, t ≥ 1, be an array of i.i.d. Z+ -valued random variables


with finite mean m = E[X(1, 1)] < +∞, and define inductively,
X
Zt := X(i, t).
1≤i≤Zt−1

We denote by {pk }k≥0 the law of X(1, 1). We also let f (s) := E[sX(1,1) ] be
the corresponding probability generating function. To avoid trivialities we assume
P[X(1, 1) = i] < 1 for all i ≥ 0. We further assume that p0 > 0.

In words, Zt models the size of a population at time (or generation) t. The random
variable X(i, t) corresponds to the number of offspring of the i-th individual (if
there is one) in generation t − 1. Generation t is formed of all offspring of the
individuals in generation t − 1.
By tracking genealogical relationships, that is, who is whose child, we obtain
a tree T rooted at the single individual in generation 0 with a vertex for each indi-
vidual in the progeny and an edge for each parent-child relationship. We refer to T
as a Galton-Watson tree.
Galton-Watson
A basic observation about Galton-Watson processes is that their growth (or
tree
decay) is exponential in t.

Lemma 6.1.2 (Exponential growth I). Let

Wt := m−t Zt . (6.1.1)

Then (Wt ) is a nonnegative martingale with respect to the filtration

Ft = σ(Z0 , . . . , Zt ).

In particular, E[Zt ] = mt .
CHAPTER 6. BRANCHING PROCESSES 417

Proof. We use Lemma B.6.17. Observe that on {Zt−1 = k}


 
X
E[Zt | Ft−1 ] = E  X(j, t) Ft−1  = mk = mZt−1 .
1≤j≤k

This is true for all k. Rearranging shows that (Wt ) is a martingale. For the second
claim, note that E[Wt ] = E[W0 ] = 1.

In fact, the martingale convergence theorem (Theorem 3.1.47) gives the following.

Lemma 6.1.3 (Exponential growth II). We have Wt → W∞ < +∞ almost surely


for some nonnegative random variable W∞ ∈ σ(∪t Ft ) with E[W∞ ] ≤ 1.

Proof. This follows immediately from the martingale convergence theorem for
nonnegative martingales (Corollary 3.1.48).

6.1.2 Extinction
Observe that 0 is a fixed point of the process. The event

{Zt → 0} = {∃t : Zt = 0},

is called extinction. Establishing when extinction occurs is a central question in


extinction
branching process theory. We let η be the probability of extinction. Recall that, to
avoid trivialities, we assume that p0 > 0 and p1 < 1. Here is a first observation
about extinction.

Lemma 6.1.4. Almost surely either Zt → 0 or Zt → +∞.

Proof. The process (Zt ) is integer-valued and 0 is the only fixed point of the pro-
cess under the assumption that p1 < 1. From any state k, the probability of never
coming back to k > 0 is at least pk0 > 0, so every state k > 0 is transient. So the
only possibilities left are Zt → 0 and Zt → +∞, and the claim follows.

In the critical case, that immediately implies almost sure extinction.

Theorem 6.1.5 (Extinction: critical case). Assume m = 1. Then Zt → 0 almost


surely, that is, η = 1.

Proof. When m = 1, (Zt ) itself is a martingale. Hence (Zt ) must converge to 0


by Lemma 6.1.3.
CHAPTER 6. BRANCHING PROCESSES 418

We address the general case using probability generating functions. Let ft (s) =
E[sZt ], where by convention we set ft (0) := P[Zt = 0]. Note that, by monotonic-
ity,

η = P[∃t ≥ 0 : Zt = 0] = lim P[Zt = 0] = lim ft (0). (6.1.2)


t→+∞ t→+∞

Moreover, by the tower property (Lemma B.6.16) and the Markov property (The-
orem 1.1.18), ft has a natural recursive form

ft (s) = E[sZt ]
= E[E[sZt | Ft−1 ]]
= E[f (s)Zt−1 ]
= ft−1 (f (s)) = · · · = f (t) (s), (6.1.3)

where f (t) is the t-th iterate of f . The subcritical case below has an easier proof
(see Exercise 6.1).
Theorem 6.1.6 (Extinction: subcriticial and supercritical cases). The probability
of extinction η is given by the smallest fixed point of f in [0, 1]. Moreover:
(i) (Subcritical regime) If m < 1 then η = 1.

(ii) (Supercritical regime) If m > 1 then η < 1.


Proof. The case p0 + p1 = 1 is straightforward: the process dies almost surely
after a geometrically distributed time. So we assume p0 + p1 < 1 for the rest of
the proof.
We first summarize without proof some properties of f which follow from
standard power series facts.
Lemma 6.1.7. On [0, 1], the function f satisfies:
(i) f (0) = p0 , f (1) = 1;

(ii) f is infinitely differentiable on [0, 1);

(iii) f is strictly convex and increasing; and

(iv) lims↑1 f 0 (s) = m < +∞.


We first characterize the fixed points of f . See Figure 6.1 for an illustration.
Lemma 6.1.8. We have the following.
(i) If m > 1 then f has a unique fixed point η0 ∈ [0, 1).
CHAPTER 6. BRANCHING PROCESSES 419

Figure 6.1: Fixed points of f in subcritical (left) and supercritical (right) cases.

(ii) If m < 1 then f (t) > t for t ∈ [0, 1). Let η0 := 1 in that case.

Proof. Assume m > 1. Since f 0 (1) = m > 1, there is δ > 0 such that f (1 − δ) <
1 − δ. On the other hand f (0) = p0 > 0 so by continuity of f there must be a
fixed point in (0, 1 − δ). Moreover, by strict convexity and the fact that f (1) = 1,
if x ∈ (0, 1) is a fixed point then f (y) < y for y ∈ (x, 1), proving uniqueness.
The second part follows by strict convexity and monotonicity.

It remains to prove convergence of the iterates to the appropriate fixed point.


See Figure 6.2 for an illustration.
Lemma 6.1.9. We have the following.

(i) If x ∈ [0, η0 ), then f (t) (x) ↑ η0 .

(ii) If x ∈ (η0 , 1) then f (t) (x) ↓ η0 .

Proof. We only prove (i). The argument for (ii) is similar. By monotonicity, for
x ∈ [0, η0 ), we have x < f (x) < f (η0 ) = η0 . Iterating

x < f (1) (x) < · · · < f (t) (x) < f (t) (η0 ) = η0 .
CHAPTER 6. BRANCHING PROCESSES 420

Figure 6.2: Convergence of iterates to a fixed point.

So f (t) (x) ↑ L ≤ η0 as t → ∞. By continuity of f , we can take the limit t → ∞


inside of f on the right-hand side of the equality

f (t) (x) = f (f (t−1) (x)),

to get L = f (L). So by definition of η0 we must have L = η0 .

The result then follows from the above lemmas together with Equations (6.1.2)
and (6.1.3).

Example 6.1.10 (Poisson branching process). Consider the offspring distribution


X(1, 1) ∼ Poi(λ) with mean λ > 0. We refer to this case as the Poisson branching
process. Then
X λi i
f (s) = E[sX(1,1) ] = e−λ s = eλ(s−1) .
i!
i≥0

So the process goes extinct with probability 1 when λ ≤ 1. For λ > 1, the
probability of extinction η is the smallest solution in [0, 1] to the equation

e−λ(1−x) = x.
CHAPTER 6. BRANCHING PROCESSES 421

The survival probability ζλ := 1 − η satisfies 1 − e−λζλ = ζλ . J

We can use these extinction results to obtain more information on the limit in
Lemma 6.1.3. Recall the definition of (Wt ) in (6.1.1). Of course, conditioned on
extinction, W∞ = 0 almost surely. On the other hand:

Lemma 6.1.11 (Exponential growth III). Conditioned on nonextinction, either


W∞ = 0 almost surely or W∞ > 0 almost surely.

As a result, P[W∞ = 0] ∈ {η, 1}.

Proof of Lemma 6.1.11. A property of rooted trees is said to be inherited if all


finite trees satisfy the property and whenever a tree satisfies the property then so
do all subtrees rooted at the children of the root. The property {W∞ = 0}, as a
property of the Galton-Watson tree T , is inherited, seeing that Zt is a sum over the
children of the root of the number of descendants at the corresponding generation
t − 1. The result then follows from the following 0-1 law.
Lemma 6.1.12 (0-1 law for inherited properties). For a Galton-Watson tree T , an
inherited property A has, conditioned on nonextinction, probability 0 or 1.

Proof. Let T (1) , . . . , T (Z1 ) be the descendant subtrees of the children of the root.
We use the notation T ∈ A to mean that tree T satisfies A. By the tower property,
the definition of inherited, and conditional independence,

P[A] = E[P[T ∈ A | Z1 ]]
≤ E[P[T (i) ∈ A, ∀i ≤ Z1 | Z1 ]]
= E[P[A]Z1 ]
= f (P[A]).

So P[A] ∈ [0, η] ∪ {1} by the proof of Lemma 6.1.8.


Moreover since A holds for finite trees, we have P[A] ≥ η, where recall that
η is the probability of extinction. Hence, in fact, P[A] ∈ {η, 1}. Conditioning on
nonextinction gives the claim.

That concludes the proof.

A further moment assumption provides a more detailed picture.

Lemma 6.1.13 (Exponential growth IV). Let (Zt ) be a Galton-Watson branching


process with m = E[X(1, 1)] > 1 and σ 2 = Var[X(1, 1)] < +∞. Then, (Wt )
converges in L2 and, in particular, E[W∞ ] = 1. Further, P[W∞ = 0] = η.
CHAPTER 6. BRANCHING PROCESSES 422

Proof. We bound E[Wt2 ] by computing it explicitly by induction. From the orthog-


onality of increments (Lemma 3.1.50), it holds that

E[Wt2 ] = E[Wt−1
2
] + E[(Wt − Wt−1 )2 ].

Since E[Wt | Ft−1 ] = Wt−1 by the martingale property,

E[(Wt − Wt−1 )2 | Ft−1 ] = Var[Wt | Ft−1 ]


= m−2t Var[Zt | Ft−1 ]
 
Zt−1
X
= m−2t Var  X(i, t) Ft−1 
i=1
−2t 2
= m Zt−1 σ .

Hence, taking expectations and using Lemma 6.1.2, we get

E[Wt2 ] = E[Wt−1
2
] + m−t−1 σ 2 .

Since E[W02 ] = 1, induction gives


t+1
X
E[Wt2 ] = 1 + σ 2 m−i ,
i=2

which is uniformly bounded from above when m > 1.


By the convergence theorem for martingales bounded in L2 (Theorem 3.1.51),
(Wt ) converges almost surely and in L2 to a finite limit W∞ and

1 = E[Wt ] → E[W∞ ].

The last statement follows from Lemma 6.1.11.


Remark 6.1.14. A theorem of Kesten and Stigum gives a necessary and sufficient condition
for E[W∞ ] = 1 to hold [KS66b]. See, e.g., [LP16, Chapter 12].

6.1.3 . Percolation: Galton-Watson trees


Let T be the Galton-Watson tree for an offspring distribution with mean m > 1.
Now perform bond percolation on T with density p (see Definition 1.2.1). Let C0
be the open cluster of the root in T . Recall from Section 2.3.3 that the critical value
is
pc (T ) = sup{p ∈ [0, 1] : θ(p) = 0},
where the percolation function (conditioned on T ) is θ(p) = Pp [|C0 | = +∞ | T ].
CHAPTER 6. BRANCHING PROCESSES 423

Theorem 6.1.15 (Bond percolation on Galton-Watson trees). Assume m > 1. Con-


ditioned on nonextinction of T ,
1
pc (T ) = ,
m
almost surely.
Proof. We can think of C0 (or more precisely, its size on each level) as being itself
generated by a Galton-Watson branching process, where this time the offspring
PX(1,1)
distribution is the law of i=1 Ii where the Ii s are i.i.d. Ber(p) and X(1, 1) is
distributed according to the offspring distribution of T . In words, we are “thinning”
T . By conditioning on X(1, 1) and then using the tower property (Lemma B.6.16),
the offspring mean under the process generating C0 is mp.
If mp ≤ 1 then by the extinction theory (Theorems 6.1.5 and 6.1.6)

1 = Pp [|C0 | < +∞] = E[Pp [|C0 | < +∞ | T ]],

and we must have Pp [|C0 | < +∞ | T ] = 1 almost surely. Taking p = 1/m, we


1
get pc (T ) ≥ m almost surely. That holds, in particular, on the nonextinction of T
which happens with positive probability.
For the other direction, fix p such that mp > 1. The property of trees {Pp [|C0 | <
+∞ | T ] = 1} is inherited. So by Lemma 6.1.12, conditioned on nonextinction of
T , it has probability 0 or 1. That probability is of course 1 on extinction. By
Theorem 6.1.6,

1 > Pp [|C0 | < +∞] = E[Pp [|C0 | < +∞ | T ]],

and, conditioned on nonextinction of T , we must have Pp [|C0 | < +∞ | T ] = 0—


i.e., pc (T ) < p—almost surely. Repeating this argument for a sequence pn ↓
1/m simultaneously (i.e., on the same T ) and using the monotonicity of Pp [|C0 | <
+∞ | T ], we get that pc (T ) ≤ 1/m almost surely conditioned on nonextinction of
T . That proves the claim.

6.1.4 Multitype branching processes


Multitype branching processes are a useful generalization of Galton-Watson pro-
multitype
cesses (Definition 6.1.1). Their behavior combines aspects of branching processes
branching
(exponential growth, extinction, etc.) and Markov chains (reducibility, mixing,
processes
etc.). We will not develop the full theory here. In this section, we define this class
of processes and hint (largely without proofs) at their properties. In Section 6.3.2,
we illustrate some of the more intricate interplay between the driving phenomena
involved in a special example of practical importance.
CHAPTER 6. BRANCHING PROCESSES 424

Definition In a multitype branching process, each individual has one of τ types,


which we will denote in this section by 1, . . . , τ for simplicity. Each type α ∈ [τ ] =
(α)
{1, . . . , τ } has its own offspring distribution {pk : k ∈ Zτ+ }, which specifies the
distribution of the number of offspring of each type it has. Just to emphasize, this
is a collection of (typically distinct) multivariate distributions.
For reasons that will become clear below, it will be convenient to work with
row vectors. For each α ∈ [τ ], let
 
(α)
X(α) (i, t) = X1 (i, t), . . . , Xτ(α) (i, t) , ∀i, t ≥ 1

(α)
be an array of i.i.d. Zτ+ -valued random row vectors with distribution {pk }. Let

Z0 = k0 ∈ Zτ+ ,

be the initial population at time 0, again as a row vector. Recursively, the popula-
tion vector
Zt = (Zt,1 , . . . , Zt,τ ) ∈ Zτ+ ,
at time t ≥ 1 is set to
τ ZX
X t−1,α

Zt := X(α) (i, t). (6.1.4)


α=1 i=1
(α)
In words, the i-th individual of type α at generation t − 1 produces Xβ (i, t)
individuals of type β at generation t (before itself dying). Let Ft = σ(Z0 , . . . , Zt )
be the corresponding filtration. We assume throughout that P[kX(α) (1, 1)k1 =
1] < 1 for at least one α (which is referred to as the nonsingular case); otherwise
nonsingular
the process reduces to a simple finite Markov chain.
case

Martingales As in the single-type case, the means of the offspring distributions


play a key role in the theory. This time however they form a matrix, the so-called
mean matrix M = (mα,β ) with entries
mean matrix
h i
(α)
mα,β = E Xβ (1, 1) , ∀α, β ∈ [τ ].

That is, mα,β is the expected number of offspring of type β of an individual of type
α. We assume throughout that mα,β < +∞ for all α, β.
CHAPTER 6. BRANCHING PROCESSES 425

To see how M drives the growth of the process, we generalize the proof of
Lemma 6.1.2. By the recursive formula (6.1.4),
 
τ ZX
X t−1,α

E [Zt | Ft−1 ] = E  X(α) (i, t) Ft−1 


α=1 i=1
τ ZX
X t−1,α h i
= E X(α) (i, t) Ft−1
α=1 i=1
Xτ h i
= Zt−1,α E X(α) (1, 1)
α=1
= Zt−1 M, (6.1.5)

where recall that Zt−1 and Zt are row vectors. Inductively,

E [Zt | Z0 ] = Z0 M t . (6.1.6)

Moreover any real right eigenvector u (as a column vector) of M with real eigen-
value λ 6= 0 gives rise to a martingale

Ut := λ−t Zt u, t ≥ 0, (6.1.7)

since

E[Ut | Ft−1 ] = E[λ−t Zt u | Ft−1 ]


= λ−t E[Zt | Ft−1 ] u
= λ−t Zt−1 M u
= λ−t Zt−1 λu
= Ut−1 .

Extinction The classical Perron-Frobenius Theorem characterizes the direction


of largest growth of the matrix M . We state a version of it without proof in the
case where all entries of M are strictly positive, which is referred to as the positive
regular case. Note that, unlike the case of simple finite Markov chains, the matrix
positive
M is not in general stochastic, as it also reflects the “growth” of the population in
regular case
addition to the “transitions” between types. We encountered the following concept
in Section 5.2.5 and Exercise 5.5.
Definition 6.1.16. The spectral radius ρ(A) of a matrix A is the maximum of the
spectral
eigenvalues of A in absolute value.
radius
CHAPTER 6. BRANCHING PROCESSES 426

Theorem 6.1.17 (Perron-Frobenius theorem: positive regular case). Let M be a


strictly positive, square matrix. Then ρ := ρ(M ) is an eigenvalue of M with al-
gebraic and geometric multiplicities 1. It is also the only eigenvalue with absolute
value ρ. The corresponding left and right eigenvectors, denoted by v (as a row
vector) and w (as a column vector) respectively, are positive vectors. They are
referred to as left and right Perron vector. We assume that they are normalized so
Perron
that 1w = 1 and vw = 1. Here 1 is the all-one row vector.
vector
Because w is positive, the martingale

Wt := ρ−t Zt w, t ≥ 0,

is nonnegative. Therefore it converges almost surely to a random limit with a finite


mean by Corollary 3.1.48. When ρ < 1, an argument based on Markov’s inequality
(Theorem 2.1.1) implies that the process goes extinct almost surely. Formally, let
q (α) be the probability of extinction when started with a single individual of type
α, that is,
q (α) := P[Zt = 0 for some t | Z0 = eα ],
where eα ∈ Zτ+ is the standard basis row vector with a one in the α-th coordinate,
and let q := (q (1) , . . . , q (τ ) ). Then

ρ < 1 =⇒ q = 1. (6.1.8)

Exercise 6.1 asks for the proof. We state the following more general result without
proof. We use the notation of Theorem 6.1.17. We will also refer to the generating
functions  
τ (α)
Y X (1,1)
f (α) (s) := E  sβ β  , s ∈ [0, 1]τ
β=1

with f = (f (1) , . . . , f (τ ) ).

Theorem 6.1.18 (Extinction: multitype case). Let (Z)t be a positive regular, non-
singular multitype branching process with a finite mean matrix M .

(i) If ρ ≤ 1 then q = 1.

(ii) If ρ > 1 then:

a- It holds that q < 1.


b- The unique solution to f (s) = s in [0, 1)τ is q.
CHAPTER 6. BRANCHING PROCESSES 427

c- Almost surely
lim ρ−t Zt = vW∞ ,
t→+∞
where W∞ is a nonnegative random variable.
(α)
d- If in addition Var[Xβ (1, 1)] < +∞ for all α, β then
E[W∞ | Z0 = eα ] = wα ,
and
q (α) = P[W∞ = 0 | Z0 = eα ],
for all α ∈ [τ ].
Remark 6.1.19. As in the single-type case, a theorem of Kesten and Stigum gives a neces-
sary and sufficient condition for the last claim of Theorem 6.1.18 (ii) to hold [KS66b].

Linear functionals Theorem 6.1.18 also characterizes the limit behavior of lin-
ear functionals of the form Zt u for any vector that is not orthogonal to v. In
contrast, interesting new behavior arises when u is orthogonal to v. We will not
derive the general theory here. We only show through a second moment calculation
that a phase transition takes place.
We restrict ourselves to the supercritical case ρ > 1 and to u = (u1 , . . . , uτ )
being a real right eigenvector of M with a real eigenvalue λ ∈ / {0, ρ}. Let Ut be
the corresponding martingale from (6.1.7). The vector u is necessarily orthogonal
to v. Indeed vM u is equal to both ρvu and λvu. Because ρ 6= λ by assumption,
this is only possible if all three expressions are 0. That implies vu = 0 since we
also have ρ 6= 0 by assumption.
To compute the second moment of Ut , we mimic the computations in the proof
of Lemma 6.1.13. We have
E[Ut2 | Z0 ] = E[Ut−1
2
| Z0 ] + E[(Ut − Ut−1 )2 | Z0 ],
by the orthogonality of increments (Lemma 3.1.50). Since E[Ut | Ft−1 ] = Ut−1 by
the martingale property, we get
E[(Ut − Ut−1 )2 | Ft−1 ] = Var[Ut | Ft−1 ]
= Var[λ−t Zt u | Ft−1 ]
  
Xτ ZXt−1,α

= λ−2t Var  X(α) (i, t) u Ft−1 


α=1 i=1
τ
X h i
= λ−2t Zt−1,α Var X(α) (1, 1) u
α=1
−2t
= λ Zt−1 S(u) ,
CHAPTER 6. BRANCHING PROCESSES 428

where S(u) = (Var[X(1) (1, 1) u], . . . , Var[X(τ ) (1, 1) u]) as a column vector. In
the last display, we used (6.1.4) on the third line and the independence of the
random vectors X(α) (i, t) on the fourth line. Hence, taking expectations and us-
ing (6.1.6), we get
E[Ut2 | Z0 ] = E[Ut−1
2
| Z0 ] + λ−2t Z0 M t−1 S(u) .
and finally
t
X
E[Ut2 | Z0 ] = (Z0 u)2 + λ−2s Z0 M s−1 S(u) . (6.1.9)
s=1

The case S(u) = 0 is trivial (see Exercise 6.4), so we exclude it from the following
lemma.
Lemma 6.1.20 (Second moment of Ut ). Assume S(u) 6= 0 and Z0 6= 0. The
sequence E[Ut2 | Z0 ], t = 0, 1, 2, . . ., is non-decreasing and satisfies
(
2 < +∞ if ρ < λ2 ,
sup E[Ut | Z0 ]
t≥0 = +∞ otherwise.

Proof. Because S(u) 6= 0 and nonnegative and the matrix M is strictly positive by
assumption, we have that
e (u) := M S(u) > 0.
S
Since w is also strictly positive, there is 0 < C − ≤ C + < +∞ such that
C −w ≤ S
e (u) ≤ C + w.

Moreover, since M is positive, each inequality is preserved when multiplying on


both sides by M , that is, for any s ≥ 1
C − ρs w ≤ M s S
e (u) ≤ C + ρs w. (6.1.10)
Now rewrite (6.1.9) as
t
X
E[Ut2 | Z0 ] = (Z0 u)2 + λ−2 Z0 S(u) + λ−4 Z0 (1/λ2 )s−2 M s−2 S
e (u) .
s=2

There are two cases:


- When ρ < λ2 , using (6.1.10), the sum on the right-hand side can be bounded
above by
t 
X ρ s−2 1
C + Z0 w 2
≤ C + Z0 w < +∞,
λ 1 − (ρ/λ2 )
s=2

uniformly in t.
CHAPTER 6. BRANCHING PROCESSES 429

- When ρ ≥ λ2 , the same sum can be bounded from below by


t 

X ρ s−2
C Z0 w → +∞,
λ2
s=2

as t → +∞. Indeed, Z0 6= 0 implies that the inner product Z0 w is strictly


positive.

That proves the claim.

In the case ρ < λ2 , the martingale (Ut ) is bounded in L2 and therefore con-
verges almost surely to a limit U∞ with E[U∞ | Z0 ] = Z0 u by Theorem 3.1.51.
2
On the√ other hand, when ρ ≥ λ , it can be shown (we will not do this here) that
Zt u/ Zt w satisfies a central limit theorem with a limit independent of Z0 . Im-
plications of these claims are illustrated in Section 6.3.2.

6.2 Random-walk representation


In this section, we develop a random-walk representation of the Galton-Watson
process. We give two applications: a characterization of the Galton-Watson process
conditioned on extinction in terms of a dual branching process; and a formula for
the size of the total progeny. We illustrate both in Section 6.2.4, where we revisit
percolation on the infinite b-ary tree.

6.2.1 Exploration process


We introduce an exploration process where a random-walk perspective will natu-
rally arise.

Exploration of a graph
Because this will be useful again later, we describe it first in the context of a locally
finite graph G = (V, E). The exploration process starts at an arbitrary vertex
v ∈ V and has 3 types of vertices:
active
- At : active vertices, explored
neutral
- Et : explored vertices,

- Nt : neutral vertices.
CHAPTER 6. BRANCHING PROCESSES 430

Figure 6.3: Exploration process for Cv .

At the beginning, we have A0 := {v}, E0 := ∅, and N0 contains all other vertices


in G. At time t, if At−1 = ∅ (i.e., there are no active vertices) we let (At , Et , Nt ) :=
(At−1 , Et−1 , Nt−1 ). Otherwise, we pick an element, at , from At−1 (say in first-
come, first-served basis to be explicit) and set:
- At := (At−1 \{at }) ∪ {x ∈ Nt−1 : {x, at } ∈ E}
- Et := Et−1 ∪ {at }
- Nt := Nt−1 \{x ∈ Nt−1 : {x, at } ∈ E}
We imagine revealing the edges of G as they are encountered in the exploration
process. In words, starting with v, the connected component Cv of v is progres-
sively grown by adding to it at each time a vertex adjacent to one of the previously
explored vertices and uncovering its remaining neighbors in G. In this process, Et
is the set of previously explored vertices and At —the frontier of the process—is
the set of vertices who are known to belong to Cv but whose full neighborhood is
waiting to be uncovered. The rest of the vertices form the set Nt . See Figure 6.3.
Let At := |At |, Et := |Et |, and Nt := |Nt |. Note that (Et )—not to be
confused with the edge set—is non-decreasing while (Nt ) is non-increasing. Let
τ0 := inf{t ≥ 0 : At = 0},
be the first time At is 0 (which by convention is +∞ if there is no such t). The
process is fixed for all t > τ0 . Notice that Et = t for all t ≤ τ0 , as exactly one
CHAPTER 6. BRANCHING PROCESSES 431

vertex is explored at each time until the set of active vertices is empty. The size of
the connected component of v can be characterized as follows.

Lemma 6.2.1.
τ0 = |Cv |.

Proof. Indeed a single vertex of Cv is explored at each time until all of Cv has been
visited. At that point, At is empty.

Random-walk representation of a Galton-Watson tree


Let (Zi )i≥0 be a Galton-Watson branching process and let T be the corresponding
Galton-Watson tree. We run the exploration process above on T started at the
root 0. We will refer to the index i in Zi as a “generation,” and to the index t
in the exploration process as “time”—they are not the same. Let (At , Et , Nt ) and
At := |At |, Et := |Et |, and Nt := |Nt | be as above. Let (Ft ) be the corresponding
filtration. Because we explore the vertices on first-come, first-served basis, we
exhaust all vertices in generation i before considering vertices in generation i + 1
(i.e., we perform breadth-first search).
The random-walk representation is the following. Observe that the process
(At ) admits a simple recursive form. We start with A0 := 1. Then, conditioning
on Ft−1 :

- If At−1 = 0, the exploration process has finished its course and At = 0.

- Otherwise, (a) one active vertex becomes an explored vertex and (b) its off-
spring become active vertices. That is,
 
At−1 + |{z}
 −1 + Xt if t − 1 < τ0 ,
|{z}
At = (a) (b)

0 otherwise,

where Xt is distributed according to the offspring distribution.

We let Yt := Xt − 1 and
t
X
St := 1 + Ys ,
s=1

with S0 := 1. Then

τ0 = inf{t ≥ 0 : St = 0},
CHAPTER 6. BRANCHING PROCESSES 432

and
(At ) = (St∧τ0 ),
is a random walk started at 1 with i.i.d. increments (Yt ) stopped when it hits 0 for
the first time.
We refer to
H = (X1 , . . . , Xτ0 ),
as the history of the process (Zi ). Observe that, under breadth-first search, the
history
process (Zi ) can be reconstructed from H: Z0 = 1, Z1 = X1 , Z2 = X2 + . . . +
XZ1 +1 , and so forth. (Exercise 6.5 asks for a general formula.) As a result, (Zi )
can be recovered from (St ) as well. We call (x1 , . . . , xt ) a valid history if
valid
history
1 + (x1 − 1) + · · · + (xs − 1) > 0,

for all s < t and


1 + (x1 − 1) + · · · + (xt − 1) = 0.
Note that a valid history may have probability 0 under the offspring distribution.

6.2.2 Duality principle


The random-walk representation above is useful to prove the following duality
principle.

Theorem 6.2.2 (Duality principle). Let (Zi ) be a branching process with offspring
distribution {pk }k≥0 and extinction probability η < 1. Let (Zi0 ) be a branching
process with offspring distribution {p0k }k≥0 where

p0k = η k−1 pk .

Then (Zi ) conditioned on extinction has the same distribution as (Zi0 ), which is
referred to as the dual branching process.
dual branching
Let f be the probability generating function of the offspring distribution of (Zi ). process
Note that X X
p0k = η k−1 pk = η −1 f (η) = 1,
k≥0 k≥0

because η is a fixed point of f by Theorem 6.1.6. So {p0k }k≥0 is indeed a probabil-


ity distribution. Note further that its expectation is
X X
kp0k = kη k−1 pk = f 0 (η) < 1,
k≥0 k≥0
CHAPTER 6. BRANCHING PROCESSES 433

since, by Lemma 6.1.7, f 0 is strictly increasing, f (η) = η < 1 and f (1) = 1


(which would not be possible if f 0 (η) were greater or equal to 1; see Figure 6.1 for
an illustration). So the dual branching process is subcritical.

Proof of Theorem 6.2.2. We use the random-walk representation. Let H = (X1 , . . . , Xτ0 )
and H 0 = (X10 , . . . , Xτ0 0 ) be the histories of (Zi ) and (Zi0 ) respectively. In the case
0
of extinction of (Zi ), the history H has finite length.
By definition of the conditional probability, for a valid history (x1 , . . . , xt ) with
a finite t,
t
P[H = (x1 , . . . , xt )] Y
P[H = (x1 , . . . , xt ) | τ0 < +∞] = = η −1 pxs .
P[τ0 < +∞]
s=1

Because (x1 − 1) + · · · + (xt − 1) = −1,


t
Y t
Y t
Y
η −1 pxs = η −1 η 1−xs p0xs = p0xs = P[H 0 = (x1 , . . . , xt )].
s=1 s=1 s=1

Since this is true for all valid histories and the processes can be recovered from
their histories, we have proved the claim.

Example 6.2.3 (Poisson branching process). Let (Zi ) be a Galton-Watson branch-


ing process with offspring distribution Poi(λ) where λ > 1. Then the dual proba-
bility distribution is given by

λk (λη)k
p0k = η k−1 pk = η k−1 e−λ = η −1 e−λ ,
k! k!

where recall from Example 6.1.10 that e−λ(1−η) = η, so

(λη)k (λη)k
p0k = eλ(1−η) e−λ = e−λη .
k! k!
That is, the dual branching process has offspring distribution Poi(λη). J

6.2.3 Hitting-time theorem


The random-walk representation also gives a formula for the distribution of the size
of the progeny.
CHAPTER 6. BRANCHING PROCESSES 434

Law of total progeny The key is the following claim.

Lemma 6.2.4 (Total progeny and random-walk representation). Let W be the total
progeny of the Galton-Watson branching process (Zi ). Then

W = τ0 .

Proof. Recall that


τ0 := inf{t ≥ 0 : At = 0}.
If the process does not go extinct, then τ0 = +∞ as there are always more vertices
to explore.
Suppose the process goes extinct and that W = n. Notice that Et = t for all
t ≤ τ0 , as exactly one vertex is explored at each time until the set of active vertices
is empty. Moreover, for all t, (At , Et , Nt ) forms a partition of [n] so

At + t + Nt = n, ∀t ≤ τ0 .

At t = τ0 , At = Nt = 0 and we get

τ0 = n.

That proves the claim.

To compute the distribution of W = τ0 , we use the following hitting-time


theorem, which is proved later in this subsection.

Theorem 6.2.5 (Hitting-time theorem). Let (Rt ) be a random walk started at 0


with i.i.d. increments (Ut ) satisfying

P[Ut ≤ 1] = 1.

Fix a positive integer `. Let σ` be the first time t such that Rt = `. Then
`
P[σ` = t] = P[Rt = `].
t
Finally we get:

Theorem 6.2.6 (Law of total progeny). Let (Zt ) be a Galton-Watson branching


process with total progeny W . In the random-walk representation of (Zt ),
1
P[W = t] = P[X1 + · · · + Xt = t − 1],
t
for all t ≥ 1.
CHAPTER 6. BRANCHING PROCESSES 435

Proof. Recall that Yt := Xt − 1 ≥ −1 and


t
X
St = 1 + Ys ,
s=1

with S0 = 1, and that

τ0 = inf{t ≥ 0 : St = 0}
= inf{t ≥ 0 : 1 + (X1 − 1) + · · · + (Xt − 1) = 0}
= inf{t ≥ 0 : X1 + · · · + Xt = t − 1}.

Define Rt := 1 − St and Ut := −Yt for all t. Then R0 := 0,

{X1 + · · · + Xt = t − 1} = {Rt = 1},

and
τ0 = inf{t ≥ 0 : Rt = 1}.
The process (Rt ) satisfies the assumptions of the hitting-time theorem (Theo-
rem 6.2.5) with ` = 1 and σ` = τ0 = W . Applying the theorem gives the
claim.

Example 6.2.7 (Poisson branching process (continued)). Let (Zi ) be a Galton-


Watson branching process with offspring distribution Poi(λ) where λ > 0. Let W
be its total progeny. By the hitting-time theorem, for t ≥ 1,
1
P[W = t] = P[X1 + · · · + Xt = t − 1]
t
1 −λt (λt)t−1
= e
t (t − 1)!
(λt)t−1
= e−λt ,
t!
where we used that a sum of independent Poisson is Poisson. J

Spitzer’s combinatorial lemma Before proving the hitting-time theorem, we


begin with a combinatorial lemma of independent interest. Let u1 , . . . , ut ∈ R and
define r0 := 0 and rj := u1 +· · ·+uj for 1 ≤ j ≤ t. We say that j is a ladder index
if rj > r0 ∨ · · · ∨ rj−1 . Consider the cyclic permutations of u = (u1 , . . . , ut ), that
is, u(0) = u, u(1) = (u2 , . . . , ut , u1 ), . . . , u(t−1) = (ut , u1 , . . . , ut−1 ). Define
(β) (β) (β)
the corresponding partial sums rj := u1 + · · · + uj for j = 1, . . . , t and
β = 0, . . . , t − 1.
CHAPTER 6. BRANCHING PROCESSES 436

Lemma 6.2.8 (Spitzer’s combinatorial lemma). Assume rt > 0. Let ` be the num-
ber of cyclic permutations such that t is a ladder index. Then ` ≥ 1 and each such
cyclic permutation has exactly ` ladder indices.

Proof. We will need the following observation


(β) (β)
(r1 , . . . , rt )
= (rβ+1 − rβ , rβ+2 − rβ , . . . , rt − rβ ,
[rt − rβ ] + r1 , [rt − rβ ] + r2 , . . . , [rt − rβ ] + rβ )
= (rβ+1 − rβ , rβ+2 − rβ , . . . , rt − rβ ,
rt − [rβ − r1 ], rt − [rβ − r2 ], . . . , rt − [rβ − rβ−1 ], rt ). (6.2.1)

We first show that ` ≥ 1, that is, there is at least one cyclic permutation where
t is a ladder index. Let β ≥ 1 be the smallest index achieving the maximum of
r1 , . . . , rt , that is,

rβ > r1 ∨ · · · ∨ rβ−1 and rβ ≥ rβ+1 ∨ · · · ∨ rt .

Moreover, rt > 0 = r0 by assumption. Hence,

rβ+j − rβ ≤ 0 < rt , ∀j = 1, . . . , t − β,

and
rt − [rβ − rj ] < rt , ∀j = 1, . . . , β − 1.
From (6.2.1), in u(β) , t is a ladder index.
For the second claim, since ` ≥ 1, we can assume without loss of generality
(β)
that u is such that t is a ladder index. (Note that rt = rt for all β.) We show
that β is a ladder index in u if and only if t is a ladder index in u(β) . That does
indeed imply the claim as there are ` cyclic permutations where t is a ladder index
by assumption. We use (6.2.1) again. Observe that β is a ladder index in u if and
only if
rβ > r0 ∨ · · · ∨ rβ−1 ,
which holds if and only if

rβ > r0 = 0 and rt − [rβ − rj ] < rt , ∀j = 1, . . . , β − 1. (6.2.2)

Moreover, because rt > rj for all j by the assumption that t is ladder index, the
last display holds if and only if

rβ+j − rβ < rt , ∀j = 1, . . . , t − β, (6.2.3)


CHAPTER 6. BRANCHING PROCESSES 437

and
rt − [rβ − rj ] < rt , ∀j = 1, . . . , β − 1, (6.2.4)
that is, if and only if t is a ladder index in u(β)
by (6.2.1). Indeed, the second
condition (i.e., (6.2.4)) is intact from (6.2.2), while the first one (i.e., (6.2.3)) can
be rewritten as rβ > −(rt − rβ+j ) where the right-hand side is < 0 for j =
1, . . . , t − β − 1 and = 0 for j = t − β.

Proof of hitting-time theorem We are now ready to prove the hitting-time the-
orem. We only handle the case ` = 1 (which is the one we used for the law of the
total progeny). Exercise 6.6 asks for the full proof.
Proof of Theorem 6.2.5. Recall that Rt = ts=1 Us and σ1 = inf{j ≥ 0 : Rj =
P
1}. By the assumption that Us ≤ 1 almost surely for all s,
{σ1 = t} = {t is the first ladder index in R1 , . . . , Rt }.
By symmetry, for all β = 0, . . . , t − 1,
P[t is the first ladder index in R1 , . . . , Rt ]
(β) (β)
= P[t is the first ladder index in R1 , . . . , Rt ].
Let Eβ be the event on the last line. Then,
 
t−1
1 X
P[σ1 = t] = E[1E0 ] = E  1E β  .
t
β=0

By Spitzer’s combinatorial lemma (Lemma 6.2.8), there is at most one cyclic


permutation where t is the first ladder index. (There is at least one cyclic permu-
tation where t is a ladder index—but it may P not be the first one, that is, there may
be multiple ladder indices.) In particular, t−1β=0 1Eβ ∈ {0, 1}. So, by the previous
display,
1 h i
P[σ1 = t] = P ∪t−1 E
β=0 β .
t
Finally we claim that {Rt = 1} = ∪t−1 β=0 Eβ . Indeed, because R0 = 0 and
Us ≤ 1 for all s, the partial sum at the j-th ladder index must take value j. So the
event ∪t−1
β=0 Eβ implies {Rt = 1} since the last partial sum of all cyclic permutations
is Rt . Similarly, because there is at least one cyclic permutation such that t is a
ladder index, the event {Rt = 1} implies that t is in fact the first ladder index in
that cyclic permutation, and therefore it implies ∪tβ=1 Eβ . Hence,
1
P[σ1 = t] = P [Rt = 1] ,
t
which concludes the proof (for the case ` = 1).
CHAPTER 6. BRANCHING PROCESSES 438

6.2.4 . Percolation: critical exponents on the infinite b-ary tree


In this section, we use branching processes to study bond percolation (Defini-
tion 1.2.1) on the infinite b-ary tree T
b b and derive explicit expressions for quan-
tities of interest. Close to the critical value, we prove the existence of “critical
exponents.” We illustrate the use of both the duality principle (Theorem 6.2.2) and
the hitting-time theorem (Theorem 6.2.6).

Critical value We denote the root by 0. Similarly to what we did in Section 6.1.3,
we think of the open cluster of the root, C0 , as the progeny of a branching process
as follows. Denote by ∂n the n-th level of T b b , that is, the vertices of T
b b at graph
distance n from the root. In the branching process interpretation, we think of the
immediate descendants in C0 of a vertex v as the offspring of v. By construction, v
has at most b children, independently of all other vertices in the same generation.
In this branching process, the offspring distribution {qk }bk=0 is binomial with pa-
rameters b and p; Zn := |C0 ∩ ∂n | represents the size of the progeny at generation
n; and W := |C0 | is the total progeny of the process. In particular |C0 | < +∞ if
and only if the process goes extinct. Because the mean number of offspring is bp,
by Theorem 6.1.6, this leads immediately to a second proof of (a rooted variant of)
Claim 2.3.9:

Claim 6.2.9.   1
pc T
bb = .
b

Percolation function The generating function of the offspring distribution is


φ(s) := ((1 − p) + ps)b . So, by Theorems 6.1.5 and 6.1.6, the percolation function

θ(p) = Pp [|C0 | = +∞],

is 0 on [0, 1/b], while on (1/b, 1] the quantity η(p) := 1 − θ(p) is the unique
solution in [0, 1) of the fixed point equation

s = ((1 − p) + ps)b . (6.2.5)

For b = 2, for instance, we can compute the fixed point explicitly by noting that

0 = ((1 − p) + ps)2 − s
= p2 s2 + [2p(1 − p) − 1]s + (1 − p)2 ,
CHAPTER 6. BRANCHING PROCESSES 439

whose solution for p ∈ (1/2, 1] is


p
∗ −[2p(1 − p) − 1] ± [2p(1 − p) − 1]2 − 4p2 (1 − p)2
s =
2p2
p
−[2p(1 − p) − 1] ± 1 − 4p(1 − p)
=
2p2
−[2p(1 − p) − 1] ± (2p − 1)
=
2p2
2p2 + [(1 − 2p) ± (2p − 1)]
= .
2p2
So, rejecting the fixed point 1,

2p2 + 2(1 − 2p) 2p − 1


θ(p) = 1 − 2
= .
2p p2
We have proved:

Claim 6.2.10. For b = 2,


(
0, 0 ≤ p ≤ 12 ,
θ(p) = 2(p− 12 ) 1
p2 2 < p ≤ 1.

Since η(p) = (1 − θ(p)), we have in that case


(
1, 0 ≤ p ≤ 12 ,
η(p) = (1−p)2 1
p2 2 < p ≤ 1.

Conditioning on a finite cluster The expected size of the population at genera-


tion n is (bp)n by Lemma 6.1.2, so for p ∈ [0, 1b )
X 1
Ep |C0 | = (bp)n = . (6.2.6)
1 − bp
n≥0

For p ∈ ( 1b , 1), the total progeny is infinite with positive probability (and in par-
ticular the expectation is infinite), but we can compute the expected cluster size on
the event that |C0 | < +∞. For this purpose we use the duality principle.
CHAPTER 6. BRANCHING PROCESSES 440

Recall that qk = kb pk (1 − p)b−k , k = 0, . . . , b, is the offspring distribution.




For 0 ≤ k ≤ b, we let the dual offspring distribution be

q̂k := [η(p)]k−1 qk
 
k−1 b
= [η(p)] pk (1 − p)b−k
k
[η(p)]k
 
b k
= b
p (1 − p)b−k
((1 − p) + p η(p)) k
  k  b−k
b p η(p) 1−p
=
k (1 − p) + p η(p) (1 − p) + p η(p)
 
b k
=: p̂ (1 − p̂)b−k ,
k
where we used (6.2.5) and implicitly defined the dual density
p η(p)
p̂ := . (6.2.7)
(1 − p) + p η(p)

In particular {q̂k } is a probability distribution as expected under Theorem 6.2.2—


it is in fact binomial with parameters b and p̂. Summarizing the implications of
Theorem 6.2.2:

Claim 6.2.11. Conditioned on |C0 | < +∞, (supercritical) percolation on T b b with


1
density p ∈ ( b , 1) has the same distribution as (subcritical) percolation on T
b b with
density defined by (6.2.7).

Hence, using (6.2.6) with both p and p̂ as well as the fact that Pp [|C0 | < +∞] =
η(p), we have the following.

Claim 6.2.12.
(
1
f
  1−bp , p ∈ [0, 1b ),
χ (p) := Ep |C0 |1{|C0 |<+∞} = η(p)
1−bp̂ , p ∈ ( 1b , 1).
 2
1−p
For b = 2, η(p) = 1 − θ(p) = p so
 2
1−p
p p (1 − p)2
p̂ = 2 = = 1 − p,
p(1 − p) + (1 − p)2

1−p
(1 − p) + p p

and
CHAPTER 6. BRANCHING PROCESSES 441

Claim 6.2.13. For b = 2,



1/2
 1 −p ,
 p ∈ [0, 12 ),
2
χf (p) = 1 1−p
2

 2 p
p− 12
, p ∈ ( 12 , 1).

Distribution of the open cluster size In fact the hitting-time theorem gives an
d
explicit formula for the distribution of |C0 |. Namely, recall that |C0 | = τ0 where

τ0 = inf{t ≥ 0 : St = 0},
P
for St = `≤t X` − (t − 1) where S0 = 1 and the X` s are i.i.d. binomial with
parameters b and p. By Theorem 6.2.6,
1
P[τ0 = t] = P[St = 0],
t
and we have
 
 
1  X 1 b`
Pp [|C0 | = `] = P X` = ` − 1 =
 p`−1 (1 − p)b`−(`−1) ,
` ` `−1
i≤`
(6.2.8)

where we used that a sum of independent binomials with the same p is itself bino-
mial. In particular at criticality (where |C0 | < +∞ almost surely; see Claim 3.1.52),
using Stirling’s formula (see Appendix A) it can be checked that
1 1 1
Ppc [|C0 | = `] ∼ p =p ,
` 2πpc (1 − pc )b` 2π(1 − pc )`3

as ` → +∞.

Critical exponents Close to criticality, physicists predict that many quantities


behave according to power laws of the form |p − pc |β , where the exponent is re-
ferred to as a critical exponent. The critical exponents are believed to satisfy certain
critical
“universality” properties. But even proving the existence of such exponents in gen-
exponent
eral remains a major open problem. On trees, though, we can simply read off the
critical exponents from the above formulas. For b = 2, Claims 6.2.10 and 6.2.13
imply for instance that, as p → pc ,

θ(p) ∼ 8(p − pc )1{p>1/2} ,


CHAPTER 6. BRANCHING PROCESSES 442

and
1
χf (p) ∼ |p − pc |−1 .
2
In fact, as can be seen from Claim 6.2.12, the critical exponent of χf (p) does not
depend on b. The same holds for θ(p) (see Exercise 6.9). Using (6.2.8), the higher
moments of |C0 | can also be studied around criticality (see Exercise 6.10).

6.3 Applications
We develop two applications of branching processes in discrete probability. First,
we prove a result about the height of a random binary search tree. Then we describe
a phase transition in an Ising model on a tree with applications to evolutionary
biology. In the next section, we also use branching processes to study the phase
transition of Erdős-Rényi random graph model.

6.3.1 . Probabilistic analysis of algorithms: binary search tree


A binary search tree (BST) is a commonly used data structure in computer science.
binary
It consists of a rooted binary tree Tn = (Vn , En ). Each vertex has a “left” and
search tree
“right” subtree (possibly empty) and a “key” from an input sequence x1 , . . . , xn ∈
R (which we assume are distinct) that satisfies the BST property: the key at vertex
v ∈ V is greater than all keys in the left subtree below it and less than all keys
in the right subtree below it. Such a data structure can be used for a variety of
algorithmic tasks, such as searching for keys or sorting them.
The tree is constructed recursively as follows. Assume that the keys x1 , . . . , xi
have already been inserted and that the current tree Ti satisfies the BST property.
To insert xi+1 :
- start at the root;

- if the root’s key is strictly larger than xi+1 , then move to its left descendant,
otherwise move to its right descendant;

- if such a descendant does not exist then create it and assign it xi+1 as its key;

- otherwise repeat.
Inserting keys (and other operations such as deleting keys, which we do not de-
scribe) takes time proportional to the height Hn of the tree Tn , that is, the length
of the longest path from the root to a leaf. While, in general, the height can be as
large as n (if keys are inserted in order for instance), the typical behavior can be
much smaller.
CHAPTER 6. BRANCHING PROCESSES 443

Indeed, here we study the case of n keys X1 , . . . , Xn i.i.d. from a continuous


distribution on R and establish a much better behavior for the random height. Let
γ be the unique solution greater than 1 of
   γ
1 2e
= 1. (6.3.1)
e γ
See Exercise 6.14 for a proof that γ is well-defined and that the left-hand side is
strictly decreasing at γ. We show:
Claim 6.3.1. Hn / log n →p γ as n → ∞.

Alternative representation of the height


The main idea of the proof is to relate the height Hn of the tree Tn to a product
of independent uniform random variables. We make a series of observations about
the structure of the tree. First:
Observation 1. Keys affect the construction of the binary search tree
only through their ordering. Let σ be the corresponding (random)
permutation, that is,

Xσ(1) < Xσ(2) < · · · < Xσ(n) .

Let t[σ] be the binary search tree generated by the permutation σ.


Second, by symmetry:
Observation 2. The permutation σ is uniformly distributed.
Denote by Sv the size of the subtree rooted at v (including v itself) in t[σ]. At the
root ρ, we have Sρ = n. What is the size of the subtree rooted at the left descendant
ρ0 of ρ? Eventually all keys with a rank lower than σ −1 (1), that is, those keys with
indices in σ(i) : i < σ −1 (1) , find their way into the left subtree of the root. In


other words,
Sρ0 = σ −1 (1) − 1.
Similarly, denoting by ρ00 the right descendant of ρ, we see that

Sρ00 = n − σ −1 (1).

We refer to σ −1 (1) as the rank of the root. By Observation 2:


Observation 3. The rank σ −1 (1) of the root is uniformly distributed
in [n]. Moreover it is identically distributed to bSρ Wρ c + 1, where Wρ
is uniform in [0, 1].
CHAPTER 6. BRANCHING PROCESSES 444

The second part of this last observation can be checked by direct computation.
Rename X10 , . . . , XS0 0 the keys in the subtree rooted at ρ0 in the order that they are
ρ
inserted and let σ 0 be the (random) permutation corresponding to their ordering,
that is,
Xσ0 0 (1) < Xσ0 0 (2) < · · · < Xσ0 0 (S 0 ) .
ρ

Define σ 00 similarly for ρ00 . Again by symmetry:


Observation 4. Conditioned on σ −1 (1) (and therefore on Sρ0 and
Sρ00 ), the permutations σ 0 and σ 00 are independent and uniformly dis-
tributed.
Finally, recursively:
Observation 5. The binary search tree t[σ] is obtained by appending
the left subtree t[σ 0 ] and right subtree t[σ 00 ] to the root ρ.
If Sρ0 = 0, then t[σ 0 ] = ∅ (and there is in fact no ρ0 ); while, if Sρ0 = 1, the tree t[σ 0 ]
is comprised of the single vertex ρ0 . Similarly for σ 00 . Hence this recursive process
stops whenever we reach a vertex v with Sv ∈ {0, 1}. But it will be convenient to
extend it indefinitely to produce an infinite binary tree T = T b 2 , where all additional
vertices v are assigned Sv = 0.
The upshot of all these observations is that we obtain the following alternative
characterization of the height Hn :
- assign an independent U [0, 1] (i.e., uniform in [0, 1]) random variable Wv to
each vertex v in the infinite binary tree T ;

- at the root ρ, set


Sρ = n;

- then recursively from the root down, set

Sv0 := bSv Wv c and Sv00 := bSv (1 − Wv )c, (6.3.2)

where v 0 and v 00 are the left and right descendants of v in T .


It can be checked that Sv0 + Sv00 = Sv − 1 almost surely, provided Sv ≥ 1 (see
Exercise 6.15). Moreover notice that, when Sv = 1, then Sv0 = Sv00 = 0 almost
surely; while, if Sv = 0, then Sv0 = Sv00 = 0. Finally the height Hn is the highest
level containing a vertex with subtree size at least 1, that is,

Hn = sup {h : ∃v ∈ Lh , Sv ≥ 1} , (6.3.3)

where Lh is the set of vertices of T at graph distance h from the root.


CHAPTER 6. BRANCHING PROCESSES 445

Key technical bound


Because W ∼ U [0, 1] implies also that (1 − W ) ∼ U [0, 1], we immediately get
from (6.3.2) that:

Lemma 6.3.2 (Distribution of subtree size). Let v be a vertex at topological dis-


tance ` from the root of T . Let U1 , . . . , U` be i.i.d. U [0, 1]. Then we have the
equality in distribution
d
Sv = b· · · bbnU1 cU2 c · · · U` c.

From Lemma 6.3.2 and the characterization of the height in (6.3.3), we need to
control how fast products of independent uniforms decrease. But that is only half
of the story: the number of paths of length ` from the root grows exponentially
with `. The following lemma, which takes both effects into account, will play a
key role in the analysis. It also explains the definition of γ in (6.3.1). Note that
we ignore—for the time being—the repeated rounding in Lemma 6.3.2; it will turn
out to have a minor effect.

Lemma 6.3.3 (Product of uniforms). Let U1 , U2 , . . . be i.i.d. U [0, 1]. Then


(
`
h
−`/c
i +∞ if c < γ,
lim 2 P U1 · · · U` ≥ e =
`→+∞ 0 if c > γ.

Proof. Taking logarithms turns the product on the left-hand side into a sum of
i.i.d. random variables
" ` #
h i X
` −`/c `
2 P U1 · · · U` ≥ e =2 P (− log Ui ) ≤ `/c . (6.3.4)
i=1

Now it is elementary to bound the right-hand side.


Lemma 6.3.4 (A tail bound). Let U1 , . . . , U` be i.i.d. U [0, 1]. Then for any y > 0
" ` # !
y ` e−y X y ` e−y 1
≤P (− log Ui ) ≤ y ≤ y . (6.3.5)
`! `! 1 − `+1
i=1

Proof. We prove a more general claim, specifically


" ` # ( +∞ )
X X yi
−y
P (− log Ui ) ≤ y = e ,
i!
i=1 i=`
CHAPTER 6. BRANCHING PROCESSES 446

from which (6.3.5) follows: the lower bound is obtained by keeping only the first
term in the sum; the upper bound is obtained by factoring out y ` e−y /`! and relating
the remaining sum to a geometric series.
So it remains to prove the general claim. First note that − log U1 is exponen-
tially distributed. Indeed, for any y ≥ 0,
P [− log U1 > y] = P U1 < e−y = e−y .
 

So
( +∞ ) ( +∞ )
X yi X yi
P [− log U1 ≤ y] = 1 − e−y = e−y −1 = e−y ,
i! i!
i=0 i=1

as claimed in the base case ` = 1.


Proceeding by induction, suppose the claim holds up to ` − 1. Then
" ` # Z " `−1 #
X +∞ X
P (− log Ui ) > y = e−z P (− log Ui ) > y − z dz
i=1 0 i=1
" `−1 #
Z y X
−y −z
=e + e P (− log Ui ) > y − z dz
0 i=1
( `−2 )
Z y X (y − z)i
= e−y + e−z e−(y−z) dz
0 i!
i=0
`−2
−y −y
X y i+1
=e +e
i!(i + 1)
i=0
`−1
X yj
= e−y .
j!
j=0

That proves the claim.

We return to the proof of Lemma 6.3.3. Plugging (6.3.5) into (6.3.4), we get
2` (`/c)` e−(`/c) h i
≤ 2` P U1 · · · U` ≥ e−`/c
`! !
2` (`/c)` e−(`/c) 1
≤ (`/c)
. (6.3.6)
`! 1− `+1

As ` → +∞,
1 1
→ , (6.3.7)
1− (`/c) 1 − 1c
`+1
CHAPTER 6. BRANCHING PROCESSES 447

which is positive when c > 1. We will use the standard bound (see Exercise 1.3
for a proof)
`` ``+1
≤ `! ≤ .
e`−1 e`−1
It implies immediately that

2` (`/c)` e−(`/c) e`−1 2` (`/c)` e−(`/c) 2` (`/c)` e−(`/c) e`−1


≤ ≤ ,
``+1 `! ``
which after simplifying gives
   c `/c ` −(`/c)
   c `/c
−1 1 2e ` (`/c) e −1 1 2e
(e`) ≤2 ≤e . (6.3.8)
e c `! e c

By (6.3.1) and the remark following it, the expression in square brackets is > 1 or
< 1 depending on whether c < γ or c > γ. Combining (6.3.6), (6.3.7) and (6.3.8)
and taking a limit as ` → +∞ gives the claim.

As an immediate consequence of Lemma 6.3.3, we bound the height from


above. Fix any ε > 0 and let h := (γ + ε) log n. We use a union bound as
follows
 
[ X
P [Hn ≥ h] = P  {Sv ≥ 1} ≤ P [Sv ≥ 1] = 2h P [Sv ≥ 1] , (6.3.9)
v∈Lh v∈Lh

for any v ∈ Lh , where the first equality follows from (6.3.3). Since

b· · · bbnU1 cU2 c · · · Uh c ≤ nU1 U2 · · · Uh ,

Lemmas 6.3.2 and 6.3.3 imply that

2h P [Sv ≥ 1] ≤ 2h P [nU1 U2 · · · Uh ≥ 1]
h i
= 2h P U1 U2 · · · Uh ≥ e−h/(γ+ε)
→ 0, (6.3.10)

as h → +∞. From (6.3.9) and (6.3.10), we obtain finally that for any ε > 0

P [Hn / log n ≥ γ + ε] → 0,

as n → +∞, which establishes one direction of Claim 6.3.1.


CHAPTER 6. BRANCHING PROCESSES 448

Lower bounding the height: a branching process


Establishing the other direction is where branching processes enter the scene. We
will need some additional notation. Fix c < γ and let ` be a positive integer that
will be set later on. For any pair of vertex v, w ∈ T with w a descendant of v,
let Q[v, w] be the set of vertices on the path between v and w, including v but
excluding w. Further, recalling (6.3.2), define
Y
U[v, w] = Uzv,w ,
z∈Q[u,v]

where Uzv,w = Wz (respectively 1 − Wz ) if the path from v to w takes the left


(respectively right) edge upon exiting z. Denote by L` [v] the set of descendant
vertices of v in T at graph distance ` from v and consider the random subset
n o
L∗` [v] = w ∈ L` [v] : U[v, w] ≥ e−`/c .

Fix a vertex u ∈ T . We define the following Galton-Watson branching process.


- Initialize Z0u,` := 1 and u0,1 := u.
- For t ≥ 1, set
u,`
Zt−1

Ztu,`
X
= |L∗` [ut−1,r ]| ,
r=1
Z u,`
t−1 ∗
and let ut,1 , . . . , ut,Z u,` be the vertices in ∪r=1 L` [ut−1,r ] from left to right.
t

In words, Z1u,`
counts the number of vertices ` levels below u whose subtree sizes
(ignoring rounding) have not decreased “too much” compared to that of u (in the
sense of Lemma 6.3.3). We let such vertices (if any) be u1,1 , . . . , u1,Z u,` . Sim-
1
ilarly, Z2u,` counts the same quantity over all vertices ` levels below the vertices
u1,1 , . . . , u1,Z u,` , and so forth.
1
Because the Wv s are i.i.d., this process is indeed a Galton-Watson branching
process. The expectation of the offspring distribution (which by symmetry does
not depend on the choice of u) is
h i h i
m = E Z1u,` = 2` P U1 · · · U` ≥ e−`/c ,

where we used the notation of Lemma 6.3.3. By that lemma, we can choose `
large enough that m > 1. Fix such an ` for the rest of the proof. In that case, by
Theorem 6.1.6, the process survives with probability 1 − η for some 0 ≤ η < 1.
The relevance of this observation can be seen from taking u = ρ.
CHAPTER 6. BRANCHING PROCESSES 449

Claim 6.3.5. Let c0 < c. Conditioned on survival of (Ztρ,` ), for n large enough
Hn ≥ c0 log n − θn ` almost surely for some θn ∈ [0, 1).

Proof. To account for the rounding, we will need the inequality

b· · · bbnU1 cU2 c · · · Us c ≥ nU1 U2 · · · Us − s, (6.3.11)

which holds for all n, s ≥ 1, as can be checked by induction. Write s = k` for


some positive integer k to be determined. Conditioned on survival of (Ztρ,` ), the
population at generation k satisfies

Zkρ,` ≥ 1,

which implies that, for some v ∗ ∈ Ls [ρ], it holds that

n U[ρ, v ∗ ] ≥ n(e−`/c )k .

Now take s = c0 log n − θn ` with c0 < c and θn ∈ [0, 1) such that s is a multiple
of `. Then
0 0
n(e−`/c )k = n(e−s/c ) = n(n−c /c e−θn `/c ) = n1−c /c e−θn `/c
≥ c0 log n − θn ` + 1 = s + 1, (6.3.12)

for all n large enough, where we used that 1 − c0 /c > 0, θn ∈ [0, 1) and ` is
fixed. So, using the characterization of the height in (6.3.2) and (6.3.3) together
with inequality (6.3.11), we derive

Sv∗ ≥ n U[ρ, v ∗ ] − s ≥ n(e−`/c )k − s ≥ 1. (6.3.13)

That is, Hn ≥ c0 log n − θn `.

But this is not quite what we want: this last claim holds only conditioned on
survival; or put differently, it holds with probability 1 − η, a value which could
be significantly smaller than 1 in general. To handle this last issue, we consider a
large number of independent copies of the Galton-Watson process above in order
to “boost” the probability that at least one of them survives to a value arbitrarily
close to 1.

Claim 6.3.6. For any δ > 0, there is a J so that Hn ≥ c0 log n − θn ` + J` with


probability at least 1 − δ for all n large enough.

Proof. Let J` be a multiple of ` and let

u∗1 , . . . , u∗2J`
CHAPTER 6. BRANCHING PROCESSES 450

be the vertices on level J` from left to right. Each process


 ∗ 
u ,`
Zt i , i = 1, . . . , 2J` ,
t≥0

is an independent copy of (Ztρ,` )t≥0 .


We define two “bad events”:
u∗ ,`
- (No survival) Let B1 be the event that all (Zt i )s go extinct and choose J
large enough that this event has probability < δ/2, that is,
J`
P[B1 ] = η 2 < δ/2.

Under B1c , at least one of the branching processes survives; let I be the lowest
index among them.

- (Fast decay at the top) To bound the height, we also need to control the effect
of the first J` levels on the subtree sizes. Let B2 be the event that at least
one of the W -values associated with the 2J` − 1 vertices ancestral to the u∗i s
is outside the interval (α, 1 − α). Choose α small enough that this event has
probability < δ/2, that is,

P[B2 ] ≤ (2α)(2J` − 1) < δ/2.

Under B2c , we have almost surely the lower bound

U[ρ, u∗I ] ≥ αJ` , (6.3.14)

since it in fact holds for all u∗i s simultaneously.


We are now ready to conclude. Assume B1c and B2c hold. Taking

s = k` = c0 log n − θn `,
u∗ ,`
as before, we have Zk I ≥ 1 so there is v ∗ ∈ Ls [u∗I ] such that

n U[ρ, v ∗ ] = n U[ρ, u∗I ] U[u∗I , v ∗ ] ≥ nαJ` (e−`/c )k ,

where we used (6.3.14). Observe that (6.3.12) remains valid (for potentially larger
n) even after multiplying all expressions on the left-hand side of the inequality by
αJ` . Arguing as in (6.3.13), we get that Hn ≥ c0 log n − θn ` + J`. This event
holds with probability at least

P[(B1 ∪ B2 )c ] = 1 − P[B1 ] − P[B2 ] ≥ 1 − δ.

We have proved the claim.


CHAPTER 6. BRANCHING PROCESSES 451

For any ε > 0, we can choose c0 = γ − ε and c0 < c < γ. Further, δ can
be made arbitrarily small (provided n is large enough). Put differently, we have
proved that for any ε > 0

P [Hn / log n ≥ γ − ε] → 1,

as n → +∞, which establishes the other direction of Claim 6.3.1.

6.3.2 . Data science: the reconstruction problem, the Kesten-Stigum


bound and a phase transition in phylogenetics
In this section, we explore an application of multitype branching processes in sta-
tistical phylogenetics, the reconstruction of evolutionary trees from molecular data.
Informally, we consider a ferromagnetic Ising model (Example 1.2.5) on an infi-
nite binary tree and we ask: when do the states at level h “remember” the state
at the root? We establish the existence of a phase transition. Before defining the
problem formally and explaining its connection to evolutionary biology, we de-
scribe an equivalent definition of the model. This alternative “Markov chain on a
tree” perspective will make it easier to derive recursions for quantities of interest.
Equivalence between the two models is proved in Exercise 6.16.

The reconstruction problem


Consider a rooted infinite binary tree T = T b 2 , where the root is denoted by 0. Fix
a parameter 0 < p < 1/2, which we will refer to as the mutation probability for
mutation
reasons that will be explained below. We assign a state σv in C = {+1, −1} to
probability
each vertex v as follows. At the root 0, the state σ0 is picked uniformly at random
in {+1, −1}. Moving away from the root, the state σv at a vertex v, conditioned
on the state at its immediate ancestor u, is equal to σu with probability 1 − p and
to −σu with probability p. In the computational biology literature, this model is
referred to as the Cavender-Farris-Neyman (CFN) model.
CFN model
For h ≥ 0, let Lh be the set of vertices in T at graph distance h from the
root. We denote by σh = (σ` )`∈Lh the vector of states at level h and we denote
by µh the distribution of σh . The reconstruction problem consists in trying to
reconstruction
“guess” the state at the root σ0 given the states σh at level h. We first note that
problem
in general we cannot expect an arbitrarily good estimator. Indeed, rewriting the
Markov transition matrix along the edges (i.e., the matrix encoding the probability
of the state at a vertex given the state at its immediate ancestor) in its random
cluster form
     
1−p p 1 0 1/2 1/2
P := = (1 − 2p) + (2p) , (6.3.15)
p 1−p 0 1 1/2 1/2
CHAPTER 6. BRANCHING PROCESSES 452

we see that the states σ1 at the first level are completely randomized (i.e., indepen-
dent of σ0 ) with probability (2p)2 —in which case we cannot hope to reconstruct
the root state better than a coin flip. Intuitively the reconstruction problem is solv-
able if we can find an estimator of the root state which outperforms a random coin
flip as h grows to +∞. Let µ+ h be the distribution µh conditioned on the root state
σ0 being +1, and similarly for µ− 1 + 1 −
h . Observe that µh = 2 µh + 2 µh . Recall also
that
− 1 X

kµ+h − µh kTV = 2 |µ+
h (sh ) − µh (sh )|.
h
sh ∈{+1,−1}2

Definition 6.3.7 (Reconstruction solvability). We say that the reconstruction prob-


lem for 0 < p < 1/2 is solvable if
reconstruction
lim inf kµ+
h − µ−
h kTV > 0, solvability
h→+∞

otherwise the problem is unsolvable.



(Exercise 6.17 asks for a proof that kµ+ h − µh kTV is monotone in h and therefore
has a limit.)
To see the connection with the description above, consider an arbitrary root
estimator σ̂0 (sh ). Then the probability of a mistake is
1 X
P[σ̂0 (σh ) 6= σ0 ] = µ−
h (sh )1{σ̂0 (sh ) = +1}
2
sh ∈{+1,−1}2h
1 X
+ µ+
h (sh )1{σ̂0 (sh ) = −1}.
2
sh ∈{+1,−1}2h

This expression is minimized by choosing for each sh separately


(

+1 if µ+
h (sh ) ≥ µh (sh ),
σ̂0 (sh ) =
−1 otherwise.

Let µh (s0 |sh ) be the posterior probability of the root state, that is, the conditional
probability of the root state s0 given the states sh at level h. By Bayes’ rule,
(1/2)µ+h (sh )
µh (+1|sh ) = ,
µh (sh )
and similarly for µh (+1|sh ). Hence the choice above is equivalent to
(
+1 if µh (+1|sh ) ≥ µh (−|sh ),
σ̂0 (sh ) =
−1 otherwise.
CHAPTER 6. BRANCHING PROCESSES 453

which is known as the maximum a posteriori (MAP) estimator. (We encountered


MAP estimator
it in a different context in Section 5.1.4.) For short, we will denote it by σ̂0MAP .
Now note that

P[σ̂0MAP (σh ) = σ0 ] − P[σ̂0MAP (σh ) 6= σ0 ]


1 X
= µ+ MAP
h (sh ) [1{σ̂0 (sh ) = +1} − 1{σ̂0MAP (sh ) = −1}]
2 h
sh ∈{+1,−1}2
1 X
+ µ− MAP
h (sh ) [1{σ̂0 (sh ) = −1} − 1{σ̂0MAP (sh ) = +1}]
2
sh ∈{+1,−1}2h
1 X
= µ+ MAP
h (sh ) σ̂0 (sh )
2
sh ∈{+1,−1}2h
1 X
− µ− MAP
h (sh ) σ̂0 (sh )
2
sh ∈{+1,−1}2h
1 X

= |µ+
h (sh ) − µh (sh )|
2
sh ∈{+1,−1}2h

= kµ+
h − µh kTV ,

where the third equality comes from

|a − b| = (a − b)1{a ≥ b} + (b − a)1{a < b}.

Since P[σ̂0 (σh ) = σ0 ] + P[σ̂0 (σh ) 6= σ0 ] = 1, the display above can be rewritten
as
1 1
P[σ̂0MAP (σh ) 6= σ0 ] = − kµ+ − µ− h kTV .
2 2 h
Given that σ̂0MAP was chosen to minimize the error probability, we also have that
for any root estimator σ̂0
1 1 +
P[σ̂0 (σh ) 6= σ0 ] ≥ − kµ − µ−
h kTV .
2 2 h
Since this last inequality also applies to the estimator −σ̂0 , we have also that
1
P[σ̂0 (σh ) 6= σ0 ] ≤ .
2
The next lemma summarizes the discussion above.
Lemma 6.3.8 (Probability of erroneous reconstruction). The probability of an er-
roneous root reconstruction behaves as follows.
CHAPTER 6. BRANCHING PROCESSES 454

(i) If the reconstruction problem is solvable, then


1
lim P[σ̂0MAP (σh ) 6= σ0 ] < .
h→+∞ 2

(ii) If the reconstruction problem is unsolvable, then for any root estimator σ̂0
1
lim P[σ̂0 (σh ) 6= σ0 ] = .
h→+∞ 2

It turns out that the accuracy of the MAP estimator undergoes a phase transition
at a critical mutation probability p∗ . Our main theorem is the following.

Theorem 6.3.9 (Solvability). Let θ∗ be the unique positive solution to

2θ∗2 = 1,
1−θ∗
and set p∗ = 2 . Then the reconstruction problem is:

(i) solvable if 0 < p < p∗ ;

(ii) unsolvable if p∗ ≤ p < 1/2.

We will prove this theorem in the rest of the section.


But first, what does all of this have to do with evolutionary biology? Trun-
cate T at level h to obtain a finite tree Th with leaf set Lh . In phylogenetics, one
uses such a tree to depict evolutionary relationships between extant species that are
represented by its leaves. Each internal branching corresponds to a past specia-
tion event. Extinctions have been pruned from the tree. The genomes of ancestral
species, starting from the most recent common ancestor at the root, are posited to
have evolved along the (deterministic) tree Th according to a random process of
single-site substitutions. To simplify, each position in the genome is assumed to
take one of two values, +1 or −1, and it evolves independently from all other posi-
tions under a CFN model on Th . That is, on each edge of the tree a mutation occurs
with probability p, changing the state of the immediate descendant species at that
position. This is of course only a toy model, but it is not far from what evolution-
ary biologists actually use in practice with great success. One practical problem
of interest is to reconstruct the genome of ancestors given access to contemporary
genomes. This is, in a nutshell, the reconstruction problem.
CHAPTER 6. BRANCHING PROCESSES 455

Kesten-Stigum bound
The condition in Theorem 6.3.9 is referred to as the Kesten-Stigum bound. We
Kesten-Stigum
explain why next. We showed in Lemma 6.3.8 that the MAP estimator has an error
bound
probability bounded away from 1/2 if and only if the reconstruction problem is
solvable. Of course, other estimators may also achieve that same desirable out-
come. In fact, from the lemma, to establish reconstruction solvability it suffices to
exhibit one such “better-than-random” estimator. So, rather than analyzing σ̂0MAP ,
we look at a simpler estimator first and prove half of Theorem 6.3.9. The other half
will be proven below using different ideas.
The key is to notice that a multitype branching process (see Section 6.1.4) hides
in the background. For h ≥ 0, consider the random row vector Zh = (Zh,+ , Zh,− )
where the first component records the number of +1 states (which we refer to as
belonging to the + type) in σh and, likewise, the second component counts the
−1 states (referred to as of − type). Then (Zh )h≥0 is a two-type Galton-Watson
process where each individual has exactly two children. Their types depend on the
type of the parent. A type + individual has the following offspring distribution:

2
(1 − p)

 if k = (2, 0),

2p(1 − p) if k = (1, 1),
(+)
pk =


 p2 if k = (0, 2),

0 otherwise.

(−)
Similar expressions hold for pk . The mean matrix is given by

2(1 − p)2 + 2p(1 − p) 2p(1 − p) + 2p2


 
M=
2p(1 − p) + 2p2 2(1 − p)2 + 2p(1 − p)
 
(1 − p)(1 − p + p) p(1 − p + p)
=2
p(1 − p + p) (1 − p)(1 − p + p)
 
1−p p
=2
p 1−p
= 2P,

where (not coincidentally) we have already encountered the matrix P in (6.3.15).


As a symmetric matrix, by the spectral theorem (Theorem 5.1.1), P has a real
CHAPTER 6. BRANCHING PROCESSES 456

eigenvector decomposition
 
1−p p
P =
p 1−p
   
1/2 1/2 1/2 −1/2
= + (1 − 2p)
1/2 1/2 −1/2 1/2
= λ1 x1 xT1 + λ2 x2 xT2 ,

where the eigenvalues and eigenvectors are


 √   √ 
1/√2 1/ √2
λ1 = 1, λ2 = 1 − 2p, x1 = x2 = .
1/ 2 −1/ 2

The eigenvalues of M are twice those of P while the eigenvectors are the same.
In particular, using the notation and convention of the Perron-Frobenius Theorem
(Theorem 6.1.17), we have
 
1/2
ρ = 2, w = .
1/2

These should not come entirely as a surprise. In particular, recall from Theo-
rem 6.1.18 that ρ can be interpreted as an “overall rate of growth” of the popu-
lation, which here is two since each individual has exactly two children (ignoring
the types).
Let u = (1, −1) be a column vector proportional to the second right eigenvec-
tor of M . We know from Section 6.1.4 that
1 X
Uh = (2λ2 )−h Zh u = h h σ` , h ≥ 0,
2 θ
`∈Lh

is a martingale, where we used the notation

θ := λ2 = 1 − 2p.

Upon looking more closely, the quantity Uh has a natural interpretation: its sign is
the majority estimator, that is, sgn(Uh ) = +1 if a majority of individuals at level h
majority
are of type + (breaking ties in favor of +), and is −1 otherwise. We indicated pre-
estimator
viously that we only need to find one estimator with an error probability bounded
away from 1/2 to establish reconstruction solvability for a given value of p. The
majority estimator
σ̂0Maj := sgn(Uh ),
CHAPTER 6. BRANCHING PROCESSES 457

is an obvious one to try. What is less obvious is that it works—all the way to the
threshold. This essentially follows from the results of Section 6.1.4, as we detail
next.
We begin with an informal discussion. When can σ̂0Maj be expected to work?
We will not in fact bound the error probability of σ̂0Maj , but instead analyze directly
the properties of (Uh ). By our modeling assumptions, Z0 is either (1, 0) or (0, 1)
with equal probability. Hence, by the martingale property, we obtain that
E[Uh | Z0 ] = Z0 u = σ0 . (6.3.16)
In words, Uh is “centered” around the root state. Intuitively, its second moment
therefore captures how informative it is about σ0 . Lemma 6.1.20 exhibits a phase
transition for E[Uh2 | Z0 ]. The condition for that lemma to hold is

(Var[X(+) (1, 1) u], Var[X(−) (1, 1) u]) 6= 0,


(+) (−)
where X(+) (1, 1) ∼ {pk } and X(−) (1, 1) ∼ {pk }. This is indeed satisfied.
The lemma then states that E[Uh2 | Z0 ] is uniformly bounded if and only if ρ <
(2λ2 )2 , or after rearranging
2θ2 > 1. (6.3.17)
Note that this is the condition in Theorem 6.3.9. It arises as a tradeoff between
the rate of growth ρ = 2 and the second largest eigenvalue λ2 = θ of the Markov
transition matrix P . One way to make sense of it is to observe the following:
- On any infinite path out of the root, the process performs a finite Markov
chain with transition matrix P . We know from Theorem 5.2.14 (see in partic-
ular Example 5.2.8) that the chain mixes—and therefore “forgets” its starting
state σ0 —at a rate governed by the spectral gap 1 − λ2 .
- On the other hand, the tree itself is growing at rate ρ = 2, which produces
an exponentially large number of (overlapping) paths out of the root. That
growth helps preserve the information about σ0 down the tree through the
duplication of the state (with mutation) at each branching.
- The condition ρ < (2λ2 )2 says in essence that when mixing is slow enough—
corresponding to larger values of λ2 —compared to the growth, then the re-
construction problem is solvable. Lemma 6.1.20 was first proved by Kesten
and Stigum, and (6.3.17) is thereby known as the Kesten-Stigum bound.
It remains to turn these observations into a formal proof.
Denote by E+ the expectation conditioned on σ0 = +1, and similarly for E− .
The following lemma is a consequence of (6.3.16). We give a quick alternative
proof.
CHAPTER 6. BRANCHING PROCESSES 458

Lemma 6.3.10 (Unbiasedness of Uh ). We have

E+ [Uh ] = +1, E− [Uh ] = −1.

Proof. By applying the Markov transition matrix P on the first level and using the
symmetries of the model, for any ` ∈ Lh and `0 ∈ Lh−1 , we have

E+ [σ` ] = (1 − p)E+ [σ`0 ] + p E− [σ`0 ]


= (1 − p)E+ [σ`0 ] + p E+ [−σ`0 ]
= (1 − 2p)E+ [σ`0 ]
= θ E+ [σ`0 ].

Iterating, we get E+ [σ` ] = θh . The claim follows by linearity of expectation.

Although we do not strictly need it, we also derive an explicit formula for the
variance. The proof is typical of how conditional independence properties of this
kind of Markov model on trees can be used to derive recursions for quantities of
interest.

Lemma 6.3.11 (Variance of Uh ). We have


( 1/2
1−(2θ2 )−1
if 2θ2 > 1,
Var[Uh ] →
+∞, otherwise.

Proof. By the conditional variance formula

Var[Uh ] = Var[E[Uh | σ0 ]] + E[Var[Uh | σ0 ]]


= Var[σ0 ] + E[Var[Uh | σ0 ]]
= 1 + Var+ [Uh ], (6.3.18)

where the last line follows from symmetry, with Var+ indicating the conditional
variance given that the root state σ0 is +1. Write Uh = U̇h + Üh as a sum over the
left and right subtrees below the root respectively. Using the conditional indepen-
dence of those two subtrees given the root state, we get from (6.3.18) that

Var[Uh ] = 1 + Var+ [Uh ]


h i
= 1 + Var+ U̇h + Üh
h i
= 1 + 2Var+ U̇h
 h i h i2 
+ 2 +
= 1 + 2 E U̇h − E U̇h . (6.3.19)
CHAPTER 6. BRANCHING PROCESSES 459

We now use the Markov transition matrix on the first level to derive a recursion
in h. Let σ̇0 be the state at the left child of the root. We use the fact that the random
variables 2θU̇h conditioned on σ̇0 = +1 and Uh−1 conditioned on σ0 = +1 are
identically distributed. Using E+ [U̇h ] = 1/2 (by Lemma 6.3.10 and symmetry),
we get from (6.3.19) that
h i2 h i
Var[Uh ] = 1 − 2E+ U̇h + 2 E+ U̇h2
= 1 − 2(1/2)2 + 2 (1 − p)E+ (2θ)−2 Uh−1
2
+ p E− (2θ)−2 Uh−1
2
    

= 1/2 + (2θ2 )−1 E+ [Uh−1


2
]
= 1/2 + (2θ2 )−1 Var[Uh−1 ], (6.3.20)

where we used that


2
Var[Uh−1 ] = E[Uh−1 ] = E+ [Uh−1
2
] = E− [Uh−1
2
],

by symmetry and the fact that E[Uh−1 ] = 0. Solving the affine recursion (6.3.20)
gives
h−1
X
Var[Uh ] = (2θ2 )−h + (1/2) (2θ2 )−i ,
i=0

where we used that Var[U0 ] = Var[σ0 ] = 1. The result follows.

We can now prove the first part of Theorem 6.3.9.



Proof of Theorem 6.3.9 (i). Let µ̄h be the distribution of Uh and define µ̄+
h and µ̄h
+ − + −
similarly. We give a bound on kµh − µh kTV through a bound on kµ̄h − µ̄h kTV .
h
Let s̄h be the Uh -value associated to sh = (sh,` )`∈Lh ∈ {+1, −1}2 , that is,

1 X
s̄h = sh,` .
2h θ h
`∈Lh

Then, by marginalizing and the triangle inequality,

X X X
− −
|µ̄+
h (z) − µ̄h (z)| = (µ+
h (sh ) − µh (sh ))
z z sh :s̄h =z
X X

≤ |µ+
h (sh ) − µh (sh )|
z sh :s̄h =z
X

= |µ+
h (sh ) − µh (sh )|,
sh ∈{+1,−1}2h
CHAPTER 6. BRANCHING PROCESSES 460

where the first sum is over the support of µ̄h . So it suffices to bound from below
the left-hand side on the first line.
For that purpose, we apply Cauchy-Schwarz and use the variance bound in
1 −
Lemma 6.3.11. First note that 12 µ̄+h + 2 µ̄h = µ̄h so that, by the triangle inequality,

|µ̄+
h (z) − µ̄h (z)| µ̄+ (z) + µ̄−
h (z)
≤ h = 1. (6.3.21)
2µ̄h (z) 2µ̄h (z)
Hence, we get
X X |µ̄+ (z) − µ̄− (z)|

|µ̄+
h (z) − µ̄h (z)| =
h h
2µ̄h (z)
z z
2µ̄h (z)
X  µ̄+ (z) − µ̄− (z) 2
h h
≥ 2 µ̄h (z)
z
2µ̄h (z)
− 2
µ̄+
P  
h (z)−µ̄h (z)
zz 2µ̄ (z) µ̄h (z)
≥ 2 P h2
z z µ̄h (z)
P + −
2
1 z z µ̄h (z) − µ̄h (z)
≥ P 2
2 z z µ̄h (z)
1 (E [Uh ] − E− [Uh ])2
+
=
2 Var[Uh ]
≥ 4(1 − (2θ2 )−1 ) > 0,

where we used (6.3.21) on the second line, Cauchy-Schwarz on the third line (after
rearranging), and Lemmas 6.3.10 and 6.3.11 on the last line.

Remark 6.3.12. The proof above and a correlation inequality of [EKPS00, Theorem 1.4]
give a lower bound on the probability of reconstruction of the majority estimator.

Impossibility of reconstruction
The previous result was based on showing that majority voting, that is, σ̂0Maj , pro-
duces a good root-state estimator—up to p = p∗ . Here we establish that this result
is best possible. Majority is not in fact the best root-state estimator: in general its
error probability can be higher than σ̂0MAP as the latter also takes into account the
configuration of the states at level h. However, perhaps surprisingly, it turns out
that the critical threshold for σ̂0Maj coincides with that of σ̂0MAP in the CFN model.
To prove the second part of Theorem 6.3.9 we analyze the MAP estimator.
Recall that µh (s0 |sh ) is the conditional probability of the root state s0 given the
CHAPTER 6. BRANCHING PROCESSES 461

states sh at level h. It will be more convenient to work with the following “root
magnetization”
Rh := µh (+1|σh ) − µh (−1|σh ),
which, as a function of σh , is a random variable. Note that E[Rh ] = 0 by symme-
try. By Bayes’ rule and the fact that µh (+1|σh ) + µh (−1|σh ) = 1, we have the
following alternative formulas which will prove useful
1
Rh = [µ+ (σh ) − µ−
h (σh )], (6.3.22)
2µh (σh ) h
µ+ (σh )
Rh = 2µh (+1|σh ) − 1 = h − 1, (6.3.23)
µh (σh )
µ− (σh )
Rh = 1 − 2µh (−1|σh ) = 1 − h . (6.3.24)
µh (σh )
It turns out to be enough to prove an upper bound on the variance of Rh .
Lemma 6.3.13 (Second moment bound). It holds that
q
+ −
kµh − µh kTV ≤ E[Rh2 ].
Proof. By (6.3.22),
1 X

|µ+
h (sh ) − µh (sh )|
2
sh ∈{+1,−1}2h
X
= µh (sh ) |µh (+1|sh ) − µh (−1|sh )|
h
sh ∈{+1,−1}2

= E|Rh |
q
≤ E[Rh2 ],
where we used Cauchy-Schwarz on the last line.

Let z̄h = E[Rh2 ]. In view of Lemma 6.3.13, the proof of Theorem 6.3.9 (ii) will
follow from establishing the limit
lim z̄h = 0.
h→+∞

We apply the same kind of recursive argument we used for the analysis of majority
(see in particular Lemma 6.3.11): we condition on the root to exploit conditional
independence; we use the Markov transition matrix on the top edges.
We first derive a recursion for Rh itself—as a random variable. We proceed in
two steps:
CHAPTER 6. BRANCHING PROCESSES 462

- Step 1: we break up the first h levels of the tree into two identical (h − 1)-
level trees with an additional edge at their respective root through conditional
independence;

- Step 2: we account for that edge through the Markov transition matrix.

We will need some notation. Let σ̇h be the states at level h (from the root) below
the left child of the root and let µ̇h be the distribution of σ̇h (and use a superscript
+ to denote the conditional probability given the root is +, and so on). Define

Ẏh = µ̇h (+1|σ̇h ) − µ̇h (−1|σ̇h ),

where µ̇h (s0 |ṡh ) is the conditional probability that the root is s0 given that σ̇h =
ṡh . Similarly, denote with a double dot the same quantities with respect to the
subtree below the right child of the root. Expressions similar to (6.3.22), (6.3.23)
and (6.3.24) also hold.

Lemma 6.3.14 (Recursion: Step 1). It holds almost surely that

Ẏh + Ÿh
Rh = .
1 + Ẏh Ÿh

Proof. Using µ+ + +
h (sh ) = µ̇h (ṡh )µ̈h (s̈h ) by conditional independence, (6.3.22) ap-
plied to Rh , and (6.3.23) and (6.3.24) applied to Ẏh and Ÿh , we get
γ
1 X µh (σh )
Rh = γ
2 γ=+,− µh (σh )
γ γ
1 µ̇h (σ̇h )µ̈h (σ̈h ) X µ̇h (σ̇h )µ̈h (σ̈h )
= γ
2 µh (σh ) γ=+,−
µ̇h (σ̇h )µ̈h (σ̈h )
1 µ̇h (σ̇h )µ̈h (σ̈h ) X   
= γ 1 + γ Ẏh 1 + γ Ÿh
2 µh (σh ) γ=+,−
µ̇h (σ̇h )µ̈h (σ̈h )
= (Ẏh + Ÿh ).
µh (σh )
CHAPTER 6. BRANCHING PROCESSES 463

The factor in front can be computed as follows

µh (σh ) X 1 µγh (σh )


=
µ̇h (σ̇h )µ̈h (σ̈h ) γ=+,−
2 µ̇h (σ̇h )µ̈h (σ̈h )
X 1 µ̇γ (σ̇h )µ̈γ (σ̈h )
h h
=
γ=+,−
2 µ̇ h (σ̇h )µ̈ h (σ̈h )

1 X   
= 1 + γ Ẏh 1 + γ Ÿh
2 γ=+,−

= 1 + Ẏh Ÿh .

That proves the claim.

For the second step of the recursion, we define

Ḋh = ν̇h (+1|σ̇h ) − ν̇h (−1|σ̇h ),

where ν̇h (ṡ0 |ṡh ) is the conditional probability that the left child of the root is ṡ0
given that the states at level h (from the root) below the left child are σ̇h = ṡh ;
and similarly for the right child of the root. Again expressions similar to (6.3.22),
(6.3.23) and (6.3.24) hold. The following lemma is left as an exercise (see Exer-
cise 6.18).

Lemma 6.3.15 (Recursion: Step 2). It holds almost surely that

Ẏh = θḊh .

We are now ready to prove the second half of our main theorem.

Proof of Theorem 6.3.9 (ii). Putting Lemmas 6.3.14 and 6.3.15 together, we get

θ(Ḋh + D̈h )
Rh = . (6.3.25)
1 + θ2 Ḋh D̈h
We now take expectations. Recall that we seek to compute the second moment of
CHAPTER 6. BRANCHING PROCESSES 464

Rh . However an important simplification arises from the following observation


X
E+ [Rh ] = µ+
h (sh )Rh (sh )
sh ∈{+1,−1}2h
X µ+
h (sh )
= µh (sh ) Rh (sh )
µh (sh )
sh ∈{+1,−1}2h
X
= µh (sh )(1 + Rh (sh ))Rh (sh )
sh ∈{+1,−1} 2h

= E[(1 + Rh )Rh ]
= E[Rh2 ],

where we used (6.3.23) on the third line and E[Rh ] = 0 on the fifth line. So it
suffices to compute the conditional first moment.
Using the expansion

1 r2
=1−r+ ,
1+r 1+r

with r = θ2 Ḋh D̈h , we have by (6.3.25) that

Rh = θ(Ḋh + D̈h ) − θ3 (Ḋh + D̈h )Ḋh D̈h + θ4 Ḋh2 D̈h2 Rh


≤ θ(Ḋh + D̈h ) − θ3 (Ḋh + D̈h )Ḋh D̈h + θ4 Ḋh2 D̈h2 , (6.3.26)

where we used |Rh | ≤ 1.


We will need the conditional first and second moments of Ḋh . For the first
moment, note that by symmetry (more precisely, by the fact that Rh−1 conditioned
on σ0 = −1 is equal in distribution to −Rh−1 conditioned on σ0 = +1)

E+ [Ḋh ] = (1 − p)E+ [Rh−1 ] + p E− [Rh−1 ]


= (1 − p)E+ [Rh−1 ] + p E+ [−Rh−1 ]
= (1 − 2p)E+ [Rh−1 ]
= θ E+ [Rh−1 ].

Similarly, for the second moment, we have

E+ [Ḋh2 ] = (1 − p)E+ [Rh−1


2
] + p E− [Rh−1
2
]
2
= E[Rh−1 ]
= E+ [Rh−1 ],
CHAPTER 6. BRANCHING PROCESSES 465

where we used that E+ [Rh−1 2 ] = E− [R2 ] by symmetry so that E[R2 ] =


h−1 h−1
2 ] + (1/2)E− [R2 ] = E+ [R2 ].
(1/2)E+ [Rh−1 h−1 h−1
Taking expectations in (6.3.26), using conditional independence, and plugging
in the formulas for E+ [Ḋh ] and E+ [Ḋh2 ] above, we obtain
z̄h = E+ [Rh ]
≤ θ(E+ [Ḋh ] + E+ [D̈h ]) − θ3 (E+ [Ḋh2 ] E+ [D̈h ] + E+ [D̈h2 ] E+ [Ḋh ])
+θ4 E+ [Ḋh2 ] E+ [D̈h2 ]
= 2θ2 E+ [Rh−1 ] − 2θ4 E+ [Rh−1 ]2 + θ4 E+ [Rh−1 ]2
= 2θ2 z̄h−1 − θ4 z̄h−1
2
. (6.3.27)
We analyze this recursion next. At h = 0, we have z̄0 = E+ [R0 ] = 1.
- When 2θ2 < 1, the sequence z̄h decreases to 0 exponentially fast
z̄h ≤ (2θ2 )h , h ≥ 0.

- When 2θ2 = 1 on the other hand, convergence to 0 occurs at a slower rate.


We show by induction that
4
z̄h ≤ , h ≥ 0.
h
Note that z̄1 ≤ z̄0 − θ4 z̄02 = 3/4 ≤ 4 since θ4 = 1/4, which proves the base
case. Assuming the bound holds for h − 1, we have from (6.3.27) that
1 2
z̄h ≤ z̄h−1 − z̄h−1
4
4 4
≤ −
h − 1 (h − 1)2
h−2
=4
(h − 1)2
4
≤ ,
h
where the last line follows from checking that h(h − 2) ≤ (h − 1)2 .
Since
lim z̄h = 0,
h→+∞
the claim follows from Lemma 6.3.13.
Remark 6.3.16. While Theorem 6.3.9 part (i) can be generalized beyond the CFN model
(see, e.g., [MP03]), part (ii) cannot. A striking construction of [Mos01] shows that, under
more general models, certain root-state estimators taking into account the configuration
of the states at level h can “beat” the Kesten-Stigum bound.
CHAPTER 6. BRANCHING PROCESSES 466

6.4 . Finale: the phase transition of the Erdős-Rényi model


A compelling way to view an Erdős-Rényi random graph—as its density varies—
is the following coupling or “evolution.” For each pair {i, j}, let U{i,j} be in-
dependent uniform random variables in [0, 1] and set G(p) := ([n], E(p)) where
{i, j} ∈ E(p) if and only if U{i,j} ≤ p. Then G(p) is distributed according to Gn,p .
As p varies from 0 to 1, we start with an empty graph and progressively add edges
until the complete graph is obtained.
We showed in Section 2.3.2 that logn n is a threshold function for connectivity.
Before connectivity occurs in the evolution of the random graph, a quantity of
interest is the size of the largest connected component. As we show in the current
section, this quantity itself undergoes a remarkable phase transition: when p =
λ
n with λ < 1, the largest component has size Θ(log n); as λ crosses 1, many
components quickly merge to form a so-called “giant component” of size Θ(n).
This celebrated result is often referred to as “the” phase transition of the Erdős-
Rényi graph model. Although the proof is quite long, it is well worth studying in
details. It employs most tools we have seen up to this point: first and second
moment methods, Chernoff-Cramér bounds, martingale techniques, coupling and
stochastic domination, and branching processes. It is quintessential discrete prob-
ability.

6.4.1 Statement and proof sketch


Before stating the main theorems, we recall a basic result from Chapter 2.

- (Poisson tail) Let Sn be a sum of n i.i.d. Poi(λ) variables. Recall from (2.4.10)
and (2.4.11) that for a > λ
1 a
− log P[Sn ≥ an] ≥ a log − a + λ =: IλPoi (a), (6.4.1)
n λ
and similarly for a < λ
1
− log P[Sn ≤ an] ≥ IλPoi (a). (6.4.2)
n
To simplify the notation, we let

Iλ := IλPoi (1) = λ − 1 − log λ ≥ 0, (6.4.3)

where the inequality follows from the convexity of Iλ and the fact that it
attains its minimum at λ = 1 where it is 0.
CHAPTER 6. BRANCHING PROCESSES 467

We let p = nλ and denote by Cmax a largest connected component. In the subcritical


case, that is, when λ < 1, we show that the largest connected component has
logarithmic size in n.

Theorem 6.4.1 (Subcritical case: upper bound on the largest cluster). Let Gn ∼
Gn,pn where pn = nλ with λ ∈ (0, 1). For all κ > 0,

Pn,pn |Cmax | ≥ (1 + κ)Iλ−1 log n = o(1),


 

where Iλ is defined in (6.4.3).

We also give a matching logarithmic lower bound on the size of Cmax in Theo-
rem 6.4.11.
In the supercritical case, that is, when λ > 1, we prove the existence of a
unique connected component of size linear in n, which is referred to as the giant
component.
giant
Theorem 6.4.2 (Supercritical regime: giant component). Let Gn ∼ Gn,pn where component
pn = nλ with λ > 1. For any γ ∈ (1/2, 1) and δ < 2γ − 1,

Pn,pn [||Cmax | − ζλ n| ≥ nγ ] = O(n−δ ),

where ζλ be the unique solution in (0, 1) to the fixed point equation

1 − e−λζ = ζ.

In fact, with probability 1 − O(n−δ ), there is a unique largest component and the
second largest connected component has size O(log n).

See Figure 6.4 for an illustration.


At a high level, the proof goes as follows:

- (Subcritical regime) In the subcritical case, we use an exploration process


and a domination argument to approximate the size of the connected compo-
nents with the progeny of a branching process. The result then follows from
the hitting-time theorem and the Poisson tail.

- (Supercritical regime) In the supercritical case, a similar argument gives a


bound on the expected size of the giant component, which is related to the
survival of the branching process. Chebyshev’s inequality gives concentra-
tion. The hard part there is to bound the variance.
CHAPTER 6. BRANCHING PROCESSES 468

Figure 6.4: Illustration of the phase transition.

6.4.2 Bounding cluster size: domination by branching processes


For a vertex v ∈ [n], let Cv be the connected component containing v, which we
also refer to as the cluster of v. To analyze the size of Cv , we use the exploration
cluster
process introduced in Section 6.2.1 and show that it is dominated above and below
by branching processes.

Exploration process
Recall that the exploration process started at v has 3 types of vertices: the active
vertices At , the explored vertices Et , and the neutral vertices Nt . We start with
A0 := {v}, E0 := ∅, and N0 contains all other vertices in Gn . We imagine re-
vealing the edges of Gn as they are encountered in this process and we let (Ft ) be
the corresponding filtration. In words, starting with v, the cluster of v is progres-
sively grown by adding to it at each time a vertex adjacent to one of the previously
explored vertices and uncovering its remaining neighbors in Gn .
Let as before At := |At |, Et := |Et |, and Nt := |Nt |, and

τ0 := inf{t ≥ 0 : At = 0} = |Cv |,
CHAPTER 6. BRANCHING PROCESSES 469

where the rightmost equality is from Lemma 6.2.1. Recall that (Et ) is non-decreasing
while (Nt ) is non-increasing, and that the process is fixed for all t > τ0 . Since
Et = t for all t ≤ τ0 (as exactly one vertex is explored at each time until the set of
active vertices is empty) and (At , Et , Nt ) forms a partition of [n] for all t, we have

At + t + Nt = n, ∀t ≤ τ0 . (6.4.4)

Hence, in tracking the size of the exploration process, we can work with At or Nt .
Moreover at t = τ0 we have

|Cv | = τ0 = n − Nτ0 . (6.4.5)

Similarly to the case of a Galton-Watson tree, the processes (At ) and (Nt )
admit a simple recursive form. Conditioning on Ft−1 :

- (Active vertices) If At−1 = 0, the exploration process has finished its course
and At = 0. Otherwise, (a) one active vertex becomes explored and (b) its
neutral neighbors become active vertices. That is,
 
At = At−1 + 1{At−1 >0} |{z}−1 + Xt , (6.4.6)
|{z}
(a) (b)

where Xt is binomial with parameters Nt−1 and pn . By (6.4.4), Nt−1 can be


written in terms of At−1 as Nt−1 = n − (t − 1) − At−1 . For the coupling
arguments below, it will be useful to think of Xt as a sum of independent
Bernoulli variables. That is, let (It,j : t ≥ 1, j ≥ 1) be an array of inde-
pendent, identically distributed {0, 1}-variables with P[I1,1 = 1] = pn . We
write
Nt−1
X
Xt = It,i . (6.4.7)
i=1

- (Neutral vertices) Similarly, if At−1 > 0, that is, Nt−1 < n − (t − 1), Xt
neutral vertices become active. That is,

Nt = Nt−1 − 1{Nt−1 <n−(t−1)} Xt . (6.4.8)

Poisson branching process approximation


With these observations, we now relate the size of the cluster of v to the total
progeny of a Poisson branching process with an appropriately chosen offspring
mean. The intuition is simple: when pn = λ/n, the number of neighbors of a
vertex is well approximated by a Poisson distribution; therefore, exploration of the
CHAPTER 6. BRANCHING PROCESSES 470

cluster of v is similar to that of the corresponding branching process. We will see


that this holds long enough to prove accurate results about the subcritical regime
(see Lemma 6.4.6). It will also be useful in the supercritical regime, but additional
arguments will be required there (see Lemmas 6.4.7 and 6.4.8).
Lemma 6.4.3 (Cluster size: Poisson branching process approximation). Let Gn ∼
Gn,pn where pn = nλ with λ > 0 and let Cv be the connected component of v ∈ [n].
Let Wλ be the total progeny of a branching process with offspring distribution

Poi(λ). Then, for 1 ≤ kn = o( n),
 2
kn
P[Wλ ≥ kn ] − O ≤ Pn,pn [|Cv | ≥ kn ] ≤ P[Wλ ≥ kn ].
n
From Example 6.2.7, we have an explicit formula for the distribution of Wλ .
Before proving the lemma, recall the following simple domination results from
Chapter 4:
- (Binomial domination) We have

n ≥ m =⇒ Bin(n, p)  Bin(m, p). (6.4.9)

The binomial distribution is also dominated by the Poisson distribution in


the following way:
 
λ
λ ∈ (0, 1) =⇒ Poi(λ)  Bin n − 1, . (6.4.10)
n
For the proofs, see Examples 4.2.4 and 4.2.8.
We use these domination results to relate the size of a connected component to the
progeny of a branching process.

Proof of Lemma 6.4.3. We start with the upper bound.


Upper bound: Because Nt−1 = n − (t − 1) − At−1 ≤ n − 1, conditioned on Ft−1 ,
the following stochastic domination relations hold
   
λ λ
Bin Nt−1 ,  Bin n − 1,  Poi(λ),
n n
by (6.4.9) and (6.4.10). Observe that the center and rightmost distributions do not
depend on Nt−1 . Let (Xt ) be a sequence of independent Poi(λ).
Pn−14.2.8, we can couple the processes (It,j )j and
Using the coupling in Example
(Xt ) in such way that Xt ≥ j=1 It,j almost surely for all t. Then by induction
on t,
At ≤ A t ,
CHAPTER 6. BRANCHING PROCESSES 471

almost surely for all t, where we define (recalling (6.4.6))

A  
 
t := At−1 + 1{A >0} − 1 + Xt ,
t−1
(6.4.11)

with A 
0 := 1. In words, (At ) is the size of the active set of a Galton-Watson
branching process with offspring distribution Poi(λ), as defined in Section 6.2.1.
As a result, letting

Wλ = τ0 := inf{t ≥ 0 : A
t = 0},

be the total progeny of this branching process, we immediately get

Pn,pn [|Cv | ≥ kn ] = Pn,pn [τ0 ≥ kn ] ≤ P[τ0 ≥ kn ] = P[Wλ ≥ kn ].

Lower bound: In the other direction, we proceed in two steps. We first show that,
up to a certain time, the process is bounded from below by a branching process
with binomial offspring distribution. In a second step, we show that this binomial
branching process can be approximated by a Poisson branching process.

1. (Domination from below) Let A≺


t be defined as (again recalling (6.4.6))

A≺ ≺ ≺
 
t := At−1 + 1{A≺ >0} − 1 + Xt ,
t−1
(6.4.12)

with A≺
0 := 1, where
n−k
Xn
Xt≺ := It,j . (6.4.13)
i=1

Note that we use the same It,j s as in the definition of Xt , that is, we cou-
ple the two processes. This time (A≺ t ) is the size of the active set in the
exploration process of a Galton-Watson branching process with offspring
distribution Bin(n − kn , pn ). Let

τ0≺ := inf{t ≥ 0 : A≺
t = 0},

be the total progeny of this branching process. We prove the following rela-
tionship between τ0 and τ0≺ .

Lemma 6.4.4. We have

P[τ0≺ ≥ kn ] ≤ Pn,pn [τ0 ≥ kn ].


CHAPTER 6. BRANCHING PROCESSES 472

Proof. We claim that At is bounded from below by A≺


t up to the stopping
time
σn−kn := inf{t ≥ 0 : Nt ≤ n − kn },
which by convention is +∞ if the event is not reached (i.e., if the cluster is
“small”; see below). Indeed, N0 = n − 1 and for all t ≤ σn−kn , Nt−1 >
n − kn by definition. Hence, by the coupling (6.4.7) and (6.4.13), Xt ≥ Xt≺
for all t ≤ σn−kn and as a result, by induction on t,

At ≥ A≺
t , ∀t ≤ σn−kn ,

where we used the recursions (6.4.6) and (6.4.13).


Because the inequality between At and A≺ t holds only up to time σn−kn ,
we cannot compare directly τ0 and τ0≺ . However, we will use the following
observation: the size of the cluster of v is at least the total number of active
and explored vertices at any time t. In particular, when σn−kn < +∞,

τ0 = |Cv | ≥ Aσn−kn + Eσn−kn = n − Nσn−kn ≥ kn .

On the other hand, when σn−kn = +∞, Nt > n − kn for all t—in particular
for t = τ0 —and therefore |Cv | = τ0 = n − Nτ0 < kn by (6.4.5). Moreover
in that case, because At ≥ A≺
t for all t ≤ σn−kn = +∞, it holds in addition
that τ0≺ ≤ τ0 < kn . To sum up, we have proved the implications

τ0≺ ≥ kn =⇒ σn−kn < +∞ =⇒ τ0 ≥ kn .

In particular, we have proved the lemma.

2. (Poisson approximation) Our next step is approximate the tail of τ0≺ by that
of τ0 .

Lemma 6.4.5. We have


kn2
 
P[τ0≺ ≥ kn ] = P[τ0 ≥ kn ] + O .
n

Proof. By Theorem 6.2.6,


" t #
1 X ≺
P[τ0≺ = t] = P Xi = t − 1 , (6.4.14)
t
i=1
CHAPTER 6. BRANCHING PROCESSES 473

where the Xi≺ s are independent Bin(n − kn , pn ). Note further that, because
the sum of independent binomials with the same success probability is bino-
mial,
Xt
Xi≺ ∼ Bin(t(n − kn ), pn ).
i=1

Recall on the other hand that (Xt ) is Poi(λ) and, because a sum of inde-
pendent Poisson is Poisson (see Exercise 6.7), we have
" t #
1 X
P[τ0 = t] = P Xi = t − 1 , (6.4.15)
t
i=1

where
t
X
Xi ∼ Poi(tλ).
i=1

We use the Poisson approximation result in Theorem 4.1.18 to compare the


probabilities on the right-hand sides of (6.4.14) and (6.4.15). In fact, because
the Poisson approximation is in terms of the total variation distance—which
bounds any event—one might be tempted to apply it directly to the tails of τ0≺
and τ0 by summing over t. However note that the factor of 1/t in (6.4.14)
and (6.4.15) prevents us from doing so.
Instead, we argue for each t separately and use that
" t # " t #
X X
P Xi≺ = t − 1 − P Xi = t − 1
i=1 i=1
≤ kBin(t(n − kn ), pn ) − Poi(tλ)kTV ,

by the observations in the previous paragraph. Theorem 4.1.18 tells us that

kBin(t(n − kn ), pn ) − Poi (t(n − kn )[− log(1 − pn )])kTV


1
≤ t(n − kn )[− log(1 − pn )]2 .
2
We must adjust the mean of the Poisson distribution. To do so, we argue as
in Example 4.1.12 to get

kPoi (t(n − kn )[− log(1 − pn )]) − Poi (tλ)kTV


≤ |tλ − t(n − kn )(− log(1 − pn ))| .
CHAPTER 6. BRANCHING PROCESSES 474

Finally, recalling that pn = λ/n, combining the last three displays and using
the triangle inequality for the total variation distance,
" t # " t #
X X
P Xi≺ = t − 1 − P Xi = t − 1
i=1 i=1
1
≤ t(n − kn )[− log(1 − pn )]2 + |tλ − t(n − kn )(− log(1 − pn ))|
2
  2 2   2 
1 λ λ λ λ
≤ tn +O 2
+ tλ − t(n − kn ) +O
2 n n n n2
 
tkn
=O ,
n

where we used that kn ≥ 1 and λ is fixed.


So, by (6.4.14) and (6.4.15), dividing by t and then summing over t < kn
gives
 2
≺  kn
P[τ0 < kn ] − P[τ0 < kn ] = O .
n

Rearranging proves the lemma.

Putting together Lemmas 6.4.4 and 6.4.5 gives

Pn,pn [|Cv | ≥ kn ] = Pn,pn [τ0 ≥ kn ]


kn2


≥ P[τ0
≥ kn ] − O
n
 2
kn
= P[Wλ ≥ kn ] − O ,
n

as claimed.

Subcritical regime: largest cluster


We are now ready to analyze the subcritical regime, that is, the case λ < 1.

Lemma 6.4.6 (Subcritical regime: upper bound on cluster size). Let Gn ∼ Gn,pn
where pn = nλ with λ ∈ (0, 1) and let Cv be the connected component of v ∈ [n].
For all κ > 0,

Pn,pn |Cv | ≥ (1 + κ)Iλ−1 log n = O(n−(1+κ) ).


 
CHAPTER 6. BRANCHING PROCESSES 475

Proof. We use the Poisson branching process approximation (Lemma 6.4.3). To


apply the lemma we need to bound the tail of the progeny Wλ of a Poisson branch-
ing process. Using the notation of Lemma 6.4.3, by Theorem 6.2.6,
" t #
X 1 X

P [Wλ ≥ kn ] = P [Wλ = +∞] + P Xi = t − 1 , (6.4.16)
t
t≥kn i=1

where the Xi s are i.i.d. Poi(λ). Both terms on the right-hand side depend on
whether or not the mean λ is smaller or larger than 1. When λ < 1, the Poisson
branching process goes extinct with probability 1 by the extinction theory (Theo-
rem 6.1.6). Hence P[Wλ = +∞] = 0.
As to the second term, the sum of the Xi s is Poi(λt). Using the Poisson
tail (6.4.1) for λ < 1 and kn = ω(1),
" t # " t #
X 1 X X X
P Xi = t − 1 ≤ P Xi ≥ t − 1
t
t≥kn i=1 t≥kn i=1
  
Poi t − 1
X
≤ exp −tIλ
t
t≥kn
X
exp −t(Iλ − O(t−1 ))


t≥kn
X
≤ C exp (−tIλ )
t≥kn
= O (exp (−Iλ kn )) , (6.4.17)

for some constant C > 0.


Let c = (1 + κ)Iλ−1 for κ > 0. By Lemma 6.4.3,

Pn,pn [|Cv | ≥ c log n] ≤ P [Wλ ≥ c log n] .

By (6.4.16) and (6.4.17),

P [Wλ ≥ c log n] = O (exp (−Iλ c log n)) , (6.4.18)

which proves the claim.

As before, let Cmax be a largest connected component of Gn (choosing the


component containing the lowest label if there is more than one such component).
A union bound and the previous lemma immediately imply an upper bound on the
size of Cmax in the subcritical case.
CHAPTER 6. BRANCHING PROCESSES 476

Proof of Theorem 6.4.1. Let again c = (1 + κ)Iλ−1 for κ > 0. By a union bound
and symmetry,

Pn,pn [|Cmax | ≥ c log n] = Pn,pn [∃v, |Cv | > c log n]


≤ n Pn,pn [|C1 | ≥ c log n] . (6.4.19)

By Lemma 6.4.6,

Pn,pn [|Cmax | ≥ c log n] = O(n · n−(1+κ) ) = O(n−κ ) → 0,

as n → +∞.

In fact we prove below that the largest component is indeed of size roughly Iλ−1 log n.
But first we turn to the supercritical regime.

Supercritical regime: two phases


Applying the Poisson branching process approximation in the supercritical regime
gives the following.
Lemma 6.4.7 (Supercritical regime: extinction). Let Gn ∼ Gn,pn where pn = nλ
with λ > 1. and let Cv be the connected component of v ∈ [n]. Let ζλ be the unique
solution in (0, 1) to the fixed point equation

1 − e−λζ = ζ.

For any κ > 0,

log2 n
 
Pn,pn |Cv | ≥ (1 + κ)Iλ−1 log n = ζλ + O
 
.
n
Note the small—but critical difference—with Lemma 6.4.6: this time the branch-
ing process can survive. This happens with probability ζλ by extinction theory
(Theorem 6.1.6). In that case, we will need further arguments to nail down the
cluster size. Observe also that the result holds for a fixed vertex v—and therefore
does not yet tell us about the largest cluster. We come back to the latter in the next
subsection.

Proof of Lemma 6.4.7. We adapt the proof of Lemma 6.4.6, beginning with (6.4.16)
which recall states
" t #
X 1 X

P [Wλ ≥ kn ] = P [Wλ = +∞] + P Xi = t − 1 ,
t
t≥kn i=1
CHAPTER 6. BRANCHING PROCESSES 477

where the Xi s are i.i.d. Poi(λ). When λ > 1, P[Wλ = +∞] = ζλ , where ζλ > 0
is the survival probability of the branching process by Example 6.1.10. As to the
second term, using (6.4.2) for λ > 1,
" t # " t #
X 1 X X X
 
P Xi = t − 1 ≤ P Xi ≤ t
t
t≥kn i=1 t≥kn i=1
X
≤ exp (−tIλ )
t≥kn
≤ C exp (−Iλ kn ) , (6.4.20)

for a constant C > 0.


Now let c = (1 + κ)Iλ−1 for κ > 0. By Lemma 6.4.3,

log2 n
 
Pn,pn [|Cv | ≥ c log n] = P [Wλ ≥ c log n] + O . (6.4.21)
n

By (6.4.16) and (6.4.20),

P [Wλ ≥ c log n] = ζλ + O (exp (−cIλ log n))


= ζλ + O(n−(1+κ) ). (6.4.22)

Combining (6.4.21) and (6.4.22), for any κ > 0,

log2 n
 
Pn,pn [|Cv | ≥ c log n] = ζλ + O , (6.4.23)
n

as claimed.

Recall that the Poisson branching process approximation was based on the fact
that the degree of a vertex is well approximated by a Poisson distribution. When
the exploration process goes on for too long however (i.e., when kn is large), this
approximation is not as accurate because of a saturation effect: at each step of the
exploration, we uncover edges to the neutral vertices (which then become active);
and, because an Erdős-Rényi graph has a finite pool of vertices from which to
draw these edges, as the number of neutral vertices decreases so does the expected
number of uncovered edges. Instead we use the following lemma which explicitly
accounts for the dwindling size of Nt . Roughly speaking, we model the set of
neutral vertices as a process that discards a fraction pn of its current set at each
time step (i.e., those neutral vertices with an edge to the current explored vertex).
CHAPTER 6. BRANCHING PROCESSES 478

Lemma 6.4.8. Let Gn ∼ Gn,pn where pn = nλ with λ > 0 and let Cv be the
connected component of v ∈ [n]. Let Yt ∼ Bin(n − 1, 1 − (1 − pn )t ). Then, for
any t,

Pn,pn [|Cv | = t] ≤ P[Yt = t − 1].

Proof. We work with neutral vertices. By (6.4.4) and Lemma 6.2.1, for any t,

Pn,pn [|Cv | = t] = Pn,pn [τ0 = t] ≤ Pn,pn [Nt = n − t]. (6.4.24)

Recall that N0 = n − 1 and


Nt−1
X
Nt = Nt−1 − 1{Nt−1 <n−(t−1)} It,i .
i=1

It is easier to consider the process without the indicator as it has a simple distribu-
tion. Define N00 := n − 1 and
0
Nt−1
X
Nt0 := 0
Nt−1 − It,i ,
i=1

and observe that Nt ≥ Nt0 for all t, as the two processes agree up to time τ0 at
which point Nt stays fixed. The interpretation of Nt0 is straightforward: starting
with n−1 vertices, at each time each remaining vertex is discarded with probability
pn . Hence, the number of surviving vertices at time t has distribution

Nt0 ∼ Bin(n − 1, (1 − pn )t ),

by the independence of the steps. Arguing as in (6.4.24),

Pn,pn [|Cv | = t] ≤ Pn,pn [Nt0 = n − t]


= Pn,pn [(n − 1) − Nt0 = t − 1]
= P[Yt = t − 1],

which concludes the proof.

The previous lemma gives the following additional bound on the cluster size
in the supercritical regime. Together with Lemma 6.4.7 it shows that, when |Cv | >
c log n, the cluster size is in fact linear in n with high probability. We will have
more to say about the largest cluster in the next subsection.
CHAPTER 6. BRANCHING PROCESSES 479

Lemma 6.4.9 (Supercritical regime: saturation). Let Gn ∼ Gn,pn where pn = nλ


with λ > 1 and let Cv be the connected component of v ∈ [n]. Let ζλ be the unique
solution in (0, 1) to the fixed point equation

1 − e−λζ = ζ.

For any α < ζλ and any δ > 0, there exists κδ,α > 0 large enough so that

Pn,pn (1 + κδ,α )Iλ−1 log n ≤ |Cv | ≤ αn = O(n−(1+δ) ).


 
(6.4.25)

Proof. By Lemma 6.4.8,

Pn,pn [|Cv | = t] ≤ P[Yt = t − 1] ≤ P[Yt ≤ t],

where Yt ∼ Bin(n − 1, 1 − (1 − pn )t ). Roughly, the right-hand side is negligible


until the mean µt := (n − 1)(1 − (1 − λ/n)t ) is of the order of t. Let ζλ be as
above, and recall that it is a solution to

1 − e−λζ − ζ = 0.

Note in particular that, when t = ζλ n,

µt = (n − 1)(1 − (1 − λ/n)ζλ n ) ≈ n(1 − e−λζλ ) = ζλ n = t.

Let α < ζλ .
For any t ∈ [c log n, αn], by the Chernoff bound for Poisson trials (Theo-
rem 2.4.7 (ii)(b)),
 !
t 2

µt
P[Yt ≤ t] ≤ exp − 1− . (6.4.26)
2 µt

For t/n ≤ α < ζλ , using 1 − x ≤ e−x for x ∈ (0, 1) (see Exercise 1.16), there is
γα > 1 such that

µt ≥ (n − 1)(1 − e−λ(t/n) )
n − 1 1 − e−λ(t/n)
 
=t
n t/n
n − 1 1 − e−λα
 
≥t
n α
≥ γα t,

for n large enough, where we used that 1 − e−λx is increasing in x on the third line
and that 1 − e−λx − x > 0 for 0 < x < ζλ on the fourth line (as can be checked by
CHAPTER 6. BRANCHING PROCESSES 480

computing the first and second derivatives). Plugging this back into (6.4.26), we
get
(   )!
γα 1 2
P[Yt ≤ t] ≤ exp −t 1− .
2 γα

Therefore
αn
X αn
X
Pn,pn [|Cv | = t] ≤ P[Yt ≤ t]
t=c log n t=c log n
+∞
( )!
1 2

X γα
≤ exp −t 1−
2 γα
t=c log n
(   )!!
γα 1 2
= O exp −c log n 1− .
2 γα

Taking κ > 0 large enough proves (6.4.25).

6.4.3 Concentration of cluster size: second moment bounds


To characterize the size of the largest cluster in the supercritical case, we use
Chebyshev’s inequality. We also use a related second moment argument to give
a lower bound on the largest cluster in the subcritical regime.

Supercritical regime: giant component


Assume λ > 1. Our goal is to characterize the size of the largest component. We do
this by bounding what is not in it (i.e., intuitively those vertices whose exploration
process goes extinct). For δ > 0 and α < ζλ , let κδ,α be as defined in Lemma 6.4.9.
Set
k n := (1 + κδ,α )Iλ−1 log n and k̄n := αn.
We call a vertex v such that |Cv | ≤ k n a small vertex.
small vertex
Let X
Sk := 1{|Cv |≤k} .
v∈[n]

It will also be useful to work with


X
Bk = n − Sk = 1{|Cv |>k} .
v∈[n]
CHAPTER 6. BRANCHING PROCESSES 481

The quantity Skn is the number of small vertices. By Lemma 6.4.7, its expectation
is

En,pn [Skn ] = n(1 − Pn,pn [|Cv | > k n ]) = (1 − ζλ )n + O log2 n . (6.4.27)




Using Chebyshev’s inequality (Theorem 2.1.2), we prove that Skn is concentrated.


Lemma 6.4.10 (Concentration of Skn ). For any γ ∈ (1/2, 1) and δ < 2γ − 1,

Pn,pn [|Skn − (1 − ζλ )n| ≥ nγ ] = O(n−δ ).

Lemma 6.4.10, which is proved below, leads to our main result in the supercritical
case: the existence of the giant component, a unique cluster Cmax of size linear in
giant
n.
component
Proof of Theorem 6.4.2. Take α ∈ (ζλ /2, ζλ ) and let k n , k̄n , and γ be as above.
Let B1,n := {|Bkn − ζλ n| ≥ nγ }. Because γ < 1, the event B1,n c implies that

X
1{|Cv |>kn } = Bkn > ζλ n − nγ ≥ 1,
v∈[n]

for n large enough. That is, there is at least one “large” cluster of size > k n . In
turn, that implies
|Cmax | ≤ Bkn ,
since there are at most Bkn vertices in that large cluster.
c holds, in addition to B c , then
Let B2,n := {∃v, |Cv | ∈ [k n , k̄n ]}. If B2,n 1,n

|Cmax | ≤ Bkn = Bk̄n ,

since there is no cluster whose size falls in [k n , k̄n ]. Moreover there is equality
across the last display if there is a unique cluster of size greater than k̄n .
This is indeed the case under B1,n c ∩ B c : if there were two distinct clusters of
2,n
size k̄n , then since 2α > ζλ we would have for n large enough

Bkn = Bk̄n > 2k̄n = 2αn > ζλ n + nγ ,


c ∩ Bc
a contradiction. Hence we have proved that under B1,n 2,n

|Cmax | = Bkn = Bk̄n .

Take δ < 2γ − 1. Applying Lemmas 6.4.9 and 6.4.10

P[B1,n ∪ B2,n ] ≤ O(n−δ ) + n · O(n−(1+δ) ) = O(n−δ ),

which concludes the proof.


CHAPTER 6. BRANCHING PROCESSES 482

It remains to prove Lemma 6.4.10.

Proof of Lemma 6.4.10. As mentioned above, we use Chebyshev’s inequality. Hence


our main task is to bound the variance of Skn .
Our starting point is the following expression for the second moment
X
En,pn [Sk2 ] = Pn,pn [|Cu | ≤ k, |Cv | ≤ k]
u,v∈[n]
X 
= Pn,pn [|Cu | ≤ k, |Cv | ≤ k, u ↔ v]
u,v∈[n]

+Pn,pn [|Cu | ≤ k, |Cv | ≤ k, u = v] , (6.4.28)

where u ↔ v indicates that u and v are in the same connected component.


To bound the first term in (6.4.28), we note that u ↔ v implies that Cu = Cv .
Hence,
X X
Pn,pn [|Cu | ≤ k, |Cv | ≤ k, u ↔ v] = Pn,pn [|Cu | ≤ k, v ∈ Cu ]
u,v∈[n] u,v∈[n]
X
= En,pn [1{|Cu |≤k} 1{v∈Cu } ]
u,v∈[n]
 
X X
= En,pn 1{|Cu |≤k} 1{v∈Cu } 
u∈[n] v∈[n]
X
= En,pn [|Cu |1{|Cu |≤k} ]
u∈[n]
= n En,pn [|C1 |1{|C1 |≤k} ]
≤ nk. (6.4.29)

To bound the second term in (6.4.28), we sum over the size of Cu and note that,
conditioned on {|Cu | = `, u = v}, the size of Cv has the same distribution as the
unconditional size of C1 in a Gn−`,pn random graph, that is,

Pn,pn [|Cv | ≤ k | |Cu | = `, u = v] = Pn−`,pn [|C1 | ≤ k].

Observe that the probability on the right-hand side is increasing in ` (as can be
CHAPTER 6. BRANCHING PROCESSES 483

seen, e.g., by coupling; see below for a related argument). Hence


X X
Pn,pn [|Cu | = `, |Cv | ≤ k, u = v]
u,v∈[n] `≤k
X X
= Pn,pn [|Cu | = `, u = v] Pn,pn [|Cv | ≤ k | |Cu | = `, u = v]
u,v∈[n] `≤k
X X
= Pn,pn [|Cu | = `, u = v] Pn−`,pn [|Cv | ≤ k]
u,v∈[n] `≤k
X X
≤ Pn,pn [|Cu | = `] Pn−k,pn [|Cv | ≤ k]
u,v∈[n] `≤k
X
= Pn,pn [|Cu | ≤ k] Pn−k,pn [|Cv | ≤ k].
u,v∈[n]

To get a bound on the variance of Sk , we need to relate this last expression to


(En,pn [Sk ])2 , where we will use that
 
X X
En,pn [Sk ] = En,pn  1{|Cv |≤k}  = Pn,pn [|Cv | ≤ k]. (6.4.30)
v∈[n] v∈[n]

We define
∆k := Pn−k,pn [|C1 | ≤ k] − Pn,pn [|C1 | ≤ k].
Then, plugging this back above, we get
X X
Pn,pn [|Cu | = `, |Cv | ≤ k, u = v]
u,v∈[n] `≤k
X
≤ Pn,pn [|Cu | ≤ k](Pn,pn [|Cv | ≤ k] + ∆k )
u,v∈[n]

≤ (En,pn [Sk ])2 + n2 |∆k |,


by (6.4.30). It remains to bound ∆k .
We use a coupling argument. Let H ∼ Gn−k,pn and construct H 0 ∼ Gn,pn
in the following manner: let H 0 coincide with H on the first n − k vertices then
pick the rest the edges independently. Then clearly ∆k ≥ 0 since the cluster of 1
in H 0 includes the cluster of 1 in H. In fact, ∆k is the probability that under this
coupling the cluster of 1 has at most k vertices in H but not in H 0 . That implies
in particular that at least one of the vertices in the cluster of 1 in H is connected to
a vertex in {n − k + 1, . . . , n}. Hence by a union bound over those k 2 potential
edges
∆k ≤ k 2 p n ,
CHAPTER 6. BRANCHING PROCESSES 484

and
X
Pn,pn [|Cu | ≤ k, |Cv | ≤ k, u ↔ v] ≤ (En,pn [Sk ])2 + λnk 2 . (6.4.31)
u,v∈[n]

Combining (6.4.29) and (6.4.31), we get

Var[Sk ] ≤ 2λnk 2 .

The result follows from (6.4.27) and Chebyshev’s inequality

P[|Skn − (1 − ζλ )n| ≥ nγ ]
≤ P[|Skn − En,pn [Skn ]| ≥ nγ − C log2 n]
2λnk 2n

(nγ − C log2 n)2
2λn(1 + κδ,α )2 Iλ−2 log2 n

C 0 n2γ
00 −δ
≤C n ,

for constants C, C 0 , C 00 > 0 and n large enough, where we used that 2γ > 1 and
δ < 2γ − 1.

Subcritical regime: second moment argument


A second moment argument also gives a lower bound on the size of the largest
component in the subcritical case. We proved in Theorem 6.4.1 that, when λ < 1,
the probability of observing a connected component of size larger than Iλ−1 log n
is vanishingly small. In the other direction, we get:
Theorem 6.4.11 (Subcritical regime: lower bound on the largest cluster). Let
Gn ∼ Gn,pn where pn = nλ with λ ∈ (0, 1). For all κ ∈ (0, 1),

Pn,pn |Cmax | ≤ (1 − κ)Iλ−1 log n = o(1).


 

Proof. Recall that X


Bk = 1{|Cv |>k} .
v∈[n]

It suffices to prove that with probability 1 − o(1) we have Bk > 0 when k =


(1 − κ)Iλ−1 log n. To apply the second moment method (Theorem 2.3.2), we need
an upper bound on the second moment of Bk and a lower bound on its first moment.
The following lemma is closely related to Lemma 6.4.10. Exercise 6.12 asks for a
proof.
CHAPTER 6. BRANCHING PROCESSES 485

Lemma 6.4.12 (Second moment of Xk ). Assume λ < 1. There is a constant C > 0


such that
En,pn [Bk2 ] ≤ (En,pn [Bk ])2 + Cnke−kIλ , ∀k ≥ 0.
Lemma 6.4.13 (First moment of Xk ). Let kn = (1 − κ)Iλ−1 log n. Then, for any
β ∈ (0, κ) we have that

En,pn [Bkn ] = Ω(nβ ),

for n large enough.

Proof. By Lemma 6.4.3,

En,pn [Bkn ] = n Pn,pn [|C1 | > kn ]


≥ n P[Wλ > kn ] − O dkn e2 .

(6.4.32)

Once again, we use the random-walk representation of the total progeny of a


branching process (Theorem 6.2.6). In contrast to the proof of Lemma 6.4.6, we
need a lower bound this time. For this purpose, we use the explicit expression for
the law of the total progeny Wλ from Example 6.2.7
X 1 (λt)t−1
P[Wλ > kn ] = e−λt .
t (t − 1)!
t>kn

Using Stirling’s formula (see Appendix A) and (6.4.3), we note that

1 −λt (λt)t−1 (λt)t−1


e = e−λt
t (t − 1)! t!
(λt)t
= e−λt √
λt(t/e)t 2πt(1 + o(1))
1 − o(1)
= √ exp (−tλ + t log λ + t)
λ 2πt3
1 − o(1)
= √ exp (−tIλ ) .
λ 2πt3
Hence, for any ε > 0,
X
P[Wλ > kn ] ≥ λ−1 exp (−t(Iλ + ε))
t>kn
= Ω (exp (−kn (Iλ + ε))) ,
CHAPTER 6. BRANCHING PROCESSES 486

for n large enough. For any β ∈ (0, κ), taking ε small enough we have

n P[Wλ > kn ] = Ω (n exp (−kn (Iλ + ε)))


= Ω exp {1 − (1 − κ)Iλ−1 (Iλ + ε)} log n


= Ω(nβ ).

Plugging this back into (6.4.32) gives

En,pn [Bkn ] = Ω(nβ ),

which proves the claim.

We return to the proof of Theorem 6.4.11. Let again kn = (1 − κ)Iλ−1 log n.


By the second moment method and Lemmas 6.4.12 and 6.4.13,

(EBkn )2
Pn,pn [Bkn > 0] ≥
E[Bk2n ]
−1
O(nkn e−kn Iλ )

≥ 1+
Ω(n2β )
!−1
O(nkn e(κ−1) log n )
= 1+
Ω(n2β )
O(kn nκ ) −1
 
= 1+
Ω(n2β )
→ 1,

for β close enough to κ. That proves the claim.

6.4.4 Critical case via martingales


It remains to consider the critical case, that is, when λ = 1. As it turns out, the
model goes through a “double jump”: as λ crosses 1, the largest cluster size goes
from order log n to order n2/3 to order n. Here we use martingale methods to show
the following.
Theorem 6.4.14 (Critical case: upper bound on the largest cluster). Let Gn ∼
Gn,pn where pn = n1 . For all κ > 1,
h i C
Pn,pn |Cmax | > κn2/3 ≤ 3/2 ,
κ
for some constant C > 0.
CHAPTER 6. BRANCHING PROCESSES 487

Remark 6.4.15. One can also derive a lower bound on the probability that |Cmax | >
κn2/3 for some κ > 0 [ER60]. Exercise 6.20 provides a sketch based on counting tree
components; the combinatorial approach has the advantage of giving insights into the
structure of the graph (see [Bol01] for more on this). See also [NP10] for a martingale
proof of the lower bound as well as a better upper bound.

The key technical bound is the following.

Lemma 6.4.16. Let Gn ∼ Gn,pn where pn = n1 and let Cv be the connected


component of v ∈ [n]. There are constants c, c0 > 0 such that for all k ≥ c

c0
Pn,pn [|Cv | > k] ≤ √ .
k
Before we establish the lemma, we prove the theorem assuming it.

Proof of Theorem 6.4.14. Recall that


X
Bk = 1{|Cv |>k} .
v∈[n]

Take
kn := κn2/3 .
By Markov’s inequality (Theorem 2.1.1) and Lemma 6.4.16,

Pn,pn [|Cmax | > kn ] ≤ Pn,pn [Bkn > kn ]


En,pn [Bkn ]

kn
n Pn,pn [|Cv | > kn ]
=
kn
nc0
≤ 3/2
kn
C
≤ 3/2 ,
κ
for some constant C > 0.

It remains to prove the lemma.

Proof of Lemma 6.4.16. Once again, we use the exploration process defined in
Section 6.4.2 started at v. Let (Ft ) be the corresponding filtration and let At = |At |
be the size of the active set.
CHAPTER 6. BRANCHING PROCESSES 488

Domination by a martingale Recalling (6.4.6), we define


h i
Mt := Mt−1 + −1 + X et , (6.4.33)

with M0 := 1 and (X et ) are i.i.d. Bin(n, 1/n). We couple (At ) and (Mt ) through
the equation (6.4.7) by letting
n
X
X
et = It,i .
i=1

In particular Mt ≥ At for all t.


Furthermore, we have
1
E[Mt | Ft−1 ] = Mt−1 − 1 + n = Mt−1 .
n
So (Mt ) is a martingale. We define the stopping time

τ̃0 := inf{t ≥ 0 : Mt = 0}.

Recalling that
τ0 = inf{t ≥ 0 : At = 0} = |Cv |,
by Lemma 6.2.1, we have τ̃0 ≥ τ0 = |Cv | almost surely. So

Pn,pn [|Cv | > k] ≤ P[τ̃0 > k].

The tail of τ̃0 To bound the tail of τ̃0 , we introduce a modified stopping time. For
h > 0, let
τh0 := inf{t ≥ 0 : Mt = 0 or Mt ≥ h}.
We will use the inequality

P[τ̃0 > k] = P[Mt > 0, ∀t ≤ k] ≤ P[τh0 > k] + P[Mτh0 ≥ h],

and we will choose h below to minimize the rightmost expression (or, more specif-
ically, an upper bound on it). The rest of the analysis is similar to the gambler’s
ruin problem in Example 3.1.41, with some slight complications arising from the
fact that the process is not nearest-neighbor.
We note that by the exponential tail of hitting times on finite state spaces
(Lemma 3.1.25), the stopping time τh0 is almost surely finite and, in fact, has a
finite expectation. By two applications of Markov’s inequality,
E[Mτh0 ]
P[Mτh0 ≥ h] ≤ ,
h
CHAPTER 6. BRANCHING PROCESSES 489

and
Eτh0
P[τh0 > k] ≤ .
k
We bound the expectations on the right-hand sides.

Bounding EMτh0 and Eτh0 To compute EMτh0 , we use the optional stopping the-
orem in the uniformly bounded case (Theorem 3.1.38 (ii)) to the stopped process
(Mt∧τh0 ) (which is also a martingale by Lemma 3.1.37) to get that

E[Mτh0 ] = E[M0 ] = 1.

We conclude that
1
P[Mτh0 ≥ h] ≤ . (6.4.34)
h
To compute Eτh0 , we use a different martingale (adapted from Example 3.1.31),
specifically
Lt := Mt2 − σ 2 t,
where we let σ 2 := n n1 1 − n1 = 1 − n1 , which is ≥ 21 when n ≥ 2. To see
 

that (Lt ) is a martingale, note that by taking out what is known (Lemma B.6.13)
and using the fact that (Mt ) is itself a martingale

E[Lt | Ft−1 ] = E[(Mt−1 + (Mt − Mt−1 ))2 − σ 2 t | Ft−1 ]


2
= E[Mt−1 + 2Mt−1 (Mt − Mt−1 ) + (Mt − Mt−1 )2 − σ 2 t | Ft−1 ]
2
= Mt−1 + 2Mt−1 · 0 + σ 2 − σ 2 t
= Lt−1 .

By Lemma 3.1.37, the stopped process (Lt∧τh0 ) is also a martingale; and it has
bounded increments since
2 2 2
|L(t+1)∧τh0 − Lt∧τh0 | ≤ |M(t+1)∧τ 0 − Mt∧τ 0 | + σ
h h

2 et+1 | + σ 2
≤ (−1 + X
et+1 ) + 2h| − 1 + X

≤ n2 + 2hn + 1.

We use the optional stopping theorem in the bounded increments case (Theo-
rem 3.1.38 (iii)) on (Lt∧τh0 ) to get

E[Mτ20 − σ 2 τh0 ] = E[Mτ20 ] − σ 2 Eτh0 = 1.


h h
CHAPTER 6. BRANCHING PROCESSES 490

After rearranging (6.4.35)


1
Eτh0 ≤ E[Mτ20 ] ≤ 2 E[Mτ20 ], (6.4.35)
σ2 h h

where we used the fact that σ 2 ≥ 1/2.


To bound E[Mτ20 ], we need to control by how much the process “overshoots”
h
h. A stochastic domination argument gives the desired bound; Exercise 6.21 asks
for a proof.
Lemma 6.4.17 (Overshoot bound). Let f be an increasing function and W ∼
Bin(n, 1/n). Then

E[f (Mτh0 − h) | Mτh0 ≥ h] ≤ E[f (W )].

The lemma implies that

E[Mτ20 | Mτh0 ≥ h] = E[(Mτh0 − h)2 + 2(Mτh0 − h)h + h2 | Mτh0 ≥ h]


h

≤ (σ 2 + 1) + 2h + h2
≤ 4h2 .

Plugging back into (6.4.35) gives


 
0 1 2
Eτh ≤ 2 E[Mτ 0 | Mτh0 ≥ h] ≤ 8h,
h h

where we used (6.4.34).


q
k
Putting everything together Finally take h := 8. Putting everything together
r
8h 1 8
Pn,pn [|Cv | > k] ≤ P[τ̃0 > k] ≤ P[τh0 > k] + P[Mτh0 ≥ h] ≤ + =2 .
k h k
That concludes the proof.

6.4.5 . Encore: random walk on the Erdős-Rényi graph


So far in this section we have used techniques from all chapters of the book—with
the exception of Chapter 5. Not to be outdone, we discuss one last result that will
make use of spectral techniques. We venture a little further down the evolution of
the Erdős-Rényi graph model to the connected regime. Specifically, recall from
Section 2.3.2 that Gn = (Vn , En ) ∼ Gn,pn is connected with probability 1 − o(1)
when npn = ω(log n).
CHAPTER 6. BRANCHING PROCESSES 491

We show in that regime that lazy simple random walk (Xt ) on Gn “mixes fast.”
Recall from Example 1.1.29 that, when the graph is connected, the corresponding
transition matrix P is reversible with respect to the stationary distribution

δ(v)
π(v) := ,
2|En |

where δ(v) is the degree of v. For a fixed ε > 0, the mixing time (see Defini-
tion 1.1.35) is
tmix (ε) = inf{t ≥ 0 : d(t) ≤ ε},
where
d(t) = sup kP t (x, ·) − π(·)kTV .
x∈Vn

By convention, we let tmix (ε) = +∞ if the graph is not connected. Our main
result is the following.

Theorem 6.4.18 (Mixing on a connected Erdős-Rényi graph). Let Gn ∼ Gn,pn


with npn = ω(log n). With probability 1 − o(1), the mixing time is O(log n).

Edge expansion We use Cheeger’s inequality (Theorem 5.3.5) which, recall,


states that
Φ2∗
γ≥ ,
2
where γ is the spectral gap of P (see Definition 5.2.11) and
 
1
Φ∗ = min ΦE (S; c, π) : S ⊆ V, 0 < π(S) ≤ ,
2

is the edge expansion constant (see Definition 5.3.2), with

c(S, S c )
ΦE (S; c, π) = ,
π(S)

for a subset of vertices S ⊆ Vn . Here, for a pair of vertices x, y connected by an


edge,
δ(x) 1 1
c(x, y) = π(x)P (x, y) = = .
2|En | δ(x) 2|En |
Hence
|E(S, S c )|
c(S, S c ) = ,
2|En |
CHAPTER 6. BRANCHING PROCESSES 492

where E(S, S c ) is the set of edges between S and S c . Similarly,


P
δ(x)
π(S) = x∈S .
2|En |

P numerator is referred to as the volume of S and we use the notation vol(S) =


The
x∈S δ(x). So
c(S, S c ) |E(S, S c )|
= . (6.4.36)
π(S) vol(S)
Because the random walk is lazy, the spectral gap is equal to the absolute spec-
tral gap (see Definition 5.2.11), and as a consequence the relaxation time (see Def-
inition 5.2.12) is
trel = γ −1 .
Using Theorem 5.2.14, we get
   
1 1 2
tmix (ε) ≤ log trel ≤ log , (6.4.37)
επmin επmin Φ2∗

where
δ(x) minx δ(x)
πmin = min π(x) = min = P .
x x 2|En | y δ(y)

So our main task is to bound δ(x) and |E(S, S c )| with high probability. We do this
next.

Bounding the degrees In fact, we have already done half the work. Indeed in
Example 2.4.18 we studied the maximum degree of Gn

Dn = max δ(v),
v∈Vn

in the regime npn = ω(log n). We showed that for any ζ > 0, as n → +∞,
h p i
P |Dn − npn | ≤ 2 (1 + ζ)npn log n → 1.

The proof of that result actually shows something stronger: all degrees satisfy the
inequality simultaneously, that is,
h p i
P ∀v ∈ Vn , |δ(v) − npn | ≤ 2 (1 + ζ)npn log n = 1 − o(1). (6.4.38)
p
We will use the fact that 2 (1 + ζ)npn log n = o(npn ) when npn = ω(log n). In
essence, all degrees are roughly npn . That implies the following claims.
CHAPTER 6. BRANCHING PROCESSES 493

Lemma 6.4.19 (Bounds on stationary distribution and volume). The following hold
with probability 1 − o(1).
(i) The smallest stationary probability satisfies
1 − o(1)
πmin ≥ .
n

(ii) For any set of vertices S ⊆ Vn with |S| > 2n/3, we have
1
π(S) > .
2

(iii) For any set of vertices S ⊆ Vn with s := |S|

vol(S) = snpn (1 + o(1)).

Proof. We assume that the event in (6.4.38) holds.


For (i), that means
p
npn − 2 (1 + ζ)npn log n 1
πmin ≥ p = (1 − o(1)),
n(npn + 2 (1 + ζ)npn log n) n

when npn = ω(log n).


For (ii), we get
P p
x∈S δ(x) |S|(npn − 2 (1 + ζ)npn log n) 2
π(S) = P ≥ p > (1 − o(1)).
x∈Vn δ(x) n(npn + 2 (1 + ζ)npn log n) 3

Finally (iii) follows similarly.

Bounding the cut size An application of Bernstein’s inequality (Theorem 2.4.17)


gives the following bound.
Lemma 6.4.20 (Bound on the edge expansion). With probability 1 − o(1),

Φ∗ = Ω(1).

Proof. By the definition of Φ∗ and Lemma 6.4.19 (ii), we can restrict ourselves to
sets S of size at most 2n/3. Let S be such a set with s = |S|. Then |E(S, S c )| is
Bin(s(n − s), pn ). By Bernstein’s inequality with c = 1 and νi = pn (1 − pn ),

β2
 
c
Pn,pn [|E(S, S )| ≤ s(n − s)pn − β] ≤ exp − ,
4s(n − s)pn (1 − pn )
CHAPTER 6. BRANCHING PROCESSES 494

for β ≤ s(n − s)pn (1 − pn ). We take β = 12 s(n − s)pn and get


   
c 1 s(n − s)pn
Pn,pn |E(S, S )| ≤ s(n − s)pn ≤ exp − .
2 16(1 − pn )

By a union bound over all sets of size s and using the fact that ns ≤ ( ne s

s )
(see Appendix A), there is a constant C > 0 such that
 
c 1
Pn,pn ∃S, |S| = s, |E(S, S )| ≤ s(n − s)pn
2
   
n s(n − s)pn
≤ exp −
s 16(1 − pn )
 np 
n
≤ exp −s + s log(ne/s)
48
≤ exp (−Csnpn ) ,

for n large enough, where we also used that n − s ≥ n/3 and npn = ω(log n).
Summing over s gives, for a constant C 0 > 0,
 
1
Pn,pn ∃S, 1 ≤ |S| ≤ 2n/3, |E(S, S )| ≤ |S|(n − |S|)pn ≤ C 0 exp (−Cnpn ) ,
c
2
which goes to 0 as n → +∞.
Using (6.4.36) and Lemma 6.4.19 (iii), any set S such that |E(S, S c )| >
1
2 |S|(n − |S|)pn has edge expansion
1
2 |S|(n − |S|)pn 1
ΦE (S; c, π) ≥ ≥ (1 − o(1)).
|S|npn (1 + o(1)) 6
That proves the claim.

Proof of the theorem Finally, we are ready to prove the main result.

Proof of Theorem 6.4.18. Plugging Lemma 6.4.19 (i) and Lemma 6.4.20 into (6.4.37)
gives
 
1 2
tmix (ε) ≤ log ≤ C 00 log(ε−1 n(1 + o(1))) = O(log n),
επmin Φ2∗
for some constant C 00 > 0.
Remark 6.4.21. A mixing time of O(log n) in fact holds for lazy simple random walk on
Gn,pn when pn = λ log
n
n
with λ > 1 [CF07]. See also [Dur06, Section 6.5]. Mixing time
on the giant component has also been studied. See, e.g., [FR08, BKW14, DKLP11].
CHAPTER 6. BRANCHING PROCESSES 495

Exercises
Exercise 6.1 (Galton-Watson process: subcritical case). We use Markov’s inequal-
ity to analyze the subcritical case.

(i) Let (Zt ) be a Galton-Watson process with offspring distribution mean m <
1. Use Markov’s inequality (Theorem 2.1.1) to prove that extinction occurs
almost surely.

(ii) Prove the equivalent result in the multitype case, that is, prove (6.1.8).

Exercise 6.2 (Galton-Watson process: geometric offspring). Let (Zt ) be a Galton-


Watson branching process with geometric offspring distribution (started at 0), that
is, pk = p(1 − p)k for all k ≥ 0, for some p ∈ (0, 1). Let q := 1 − p, let m be the
mean of the offspring distribution, and let Wt = m−t Zt .

(i) Compute the probability generating function f of {pk }k≥0 and the extinction
probability η := ηp as a function of p.

(ii) If G is a 2 × 2 matrix, define


G11 s + G12
G(s) := .
G21 s + G22
Show that G(H(s)) = (GH)(s).

(iii) Assume m 6= 1. Use (ii) to derive

pmt (1 − s) + qs − p
ft (s) = .
qmt (1 − s) + qs − p

Deduce that when m > 1


(1 − η)
E[exp(−λW∞ )] = η + (1 − η) .
λ + (1 − η)

(iv) Assume m = 1. Show that

t − (t − 1)s
ft (s) = ,
t + 1 − ts
and deduce that
1
E[e−λZt /t | Zt > 0] → .
1+λ
CHAPTER 6. BRANCHING PROCESSES 496

Exercise 6.3 (Supercritical branching process: infinite line of descent). Let (Zt )
be a supercritical Galton-Watson branching process with offspring distribution
{pk }k≥0 . Let η be the extinction probability and define ζ := 1 − η. Let Zt∞
be the number of individuals in the t-th generation with an infinite line of descent,
i.e., whose descendant subtree is infinite. Denote by S the event of nonextinction
of (Zt ). Define p∞
0 := 0 and
X j 
∞ −1
pk := ζ η j−k ζ k pj .
k
j≥k

(i) Show that {p∞


k }k≥0 is a probability distribution and compute its expectation.

(ii) Show that for any k ≥ 0


P[Z1∞ = k | S] = p∞
k .

[Hint: Condition on Z1 .]
(iii) Show by induction on t that, conditioned on nonextinction, the process (Zt∞ )
has the same distribution as a Galton-Watson branching process with off-
spring distribution {p∞
k }k≥0 .

Exercise 6.4 (Multitype branching processes: a special case). Extend Lemma 6.1.20
to the case S(u) = 0. [Hint: Show that Ut = Z0 u for all t almost surely.]
Exercise 6.5 (Galton-Watson: Inverting history). Let
H = (X1 , . . . , Xτ0 ),
be the history (see Section 6.2) of the Galton-Watson process (Zi ). Write Zi as a
function of H, for all i.
Exercise 6.6 (Spitzer’s lemma). Prove Theorem 6.2.5.
Exercise 6.7 (Sum of Poisson). Let Q1 and Q2 be independent Poisson random
variables with respective means λ1 and λ2 . Show by direct computation of the
convolution that the sum Q1 + Q2 is Poisson with mean λ1 + λ2 . [Hint: Recall
that P[Q1 = k] = e−λ1 λk1 /k! for all k ∈ Z+ .]
Exercise 6.8 (Percolation on bounded-degree graphs). Let G = (V, E) be a count-
able graph such that all vertices have degree bounded by b + 1 for b ≥ 2. Let 0 be
a distinguished vertex in G. For bond percolation on G, prove that
pc (G) ≥ pc (T
b b ),

by bounding the expected size of the cluster of 0. [Hint: Consider self-avoiding


paths started at 0.]
CHAPTER 6. BRANCHING PROCESSES 497

Exercise 6.9 (Percolation on T b b : critical exponent of θ(p)). Consider bond per-


colation on the rooted infinite b-ary tree T b b with b > 2. For ε ∈ [0, 1 − 1 ] and
b
u ∈ [0, 1], define

1
 1
b
h(ε, u) := u − 1− b − ε (1 − u) + b +ε .

(i) Show that there is a constant C > 0 not depending on ε, u such that

b−1 2
h(ε, u) − bεu + u ≤ C(u3 ∨ εu2 ).
2b

(ii) Use (i) to prove that

θ(p) 2b2
lim = .
b b ) (p − pc (T
p↓pc (T b b )) b−1

Exercise 6.10 (Percolation on T b 2 : higher moments of |C0 |). Consider bond per-
colation on the rooted infinite binary tree Tb 2 . For density p < 1 , let Zp be an
2
integer-valued random variable with distribution

` Pp [|C0 | = `]
Pp [Zp = `] = , ∀` ≥ 1.
Ep |C0 |

(i) Using the explicit formula for Pp [|C0 | = `] derived in Section 6.2.4, show
that for all 0 < a < b < +∞
" # Z b
Zp
Pp ∈ [a, b] → C x−1/2 e−x dx,
(1/4)( 12 − p)−2 a

as p ↑ 12 , for some constant C > 0.

(ii) Show that for all k ≥ 2 there is Ck > 0 such that

Ep |C0 |k
lim = Ck .
p↑pc (T
b2 ) b 2 ) − p)−1−2(k−1)
(pc (T

(iii) What happens when p ↓ pc (T


b 2 )?

Exercise 6.11 (Branching process approximation: improved bound). Let pn = nλ


with λ > 0. Let Wn,pn , respectively Wλ , be the total progeny of a branching
process with offspring distribution Bin(n, pn ), respectively Poi(λ).
CHAPTER 6. BRANCHING PROCESSES 498

(i) Show that

|P[Wn,pn ≥ k] − P[Wλ ≥ k]|


≤ max{P[Wn,pn ≥ k, Wλ < k], P[Wn,pn < k, Wλ ≥ k]}.

(ii) Couple the two processes step-by-step and use (i) to show that
k−1
λ2 X
|P[Wn,pn ≥ k] − P[Wλ ≥ k]| ≤ P[Wλ ≥ i].
n
i=1

Exercise 6.12 (Subcritical Erdős-Rényi: second moment). Prove Lemma 6.4.12.

Exercise 6.13 (Random binary search tree: property (BST)). Show that the (BST)
property is preserved by the algorithm described at the beginning of Section 6.3.1.

Exercise 6.14 (Random binary search tree: limit). Consider the equation (6.3.1).

(i) Show that there exists a unique solution greater than 1.

(ii) Prove that the expression on the left-hand side is strictly decreasing at that
solution.

Exercise 6.15 (Random binary search tree: height is well-defined). Let T be an


infinite binary tree. Assign an independent U [0, 1] random variable Zv to each
vertex v in T , set Sρ = n and then recursively from the root down

Sv0 := bSv Zv c and Sv00 := bSv (1 − Zv )c,

where v 0 and v 00 are the left and right descendants of v in T .

(i) Show that, for any v, it holds that Sv0 + Sv00 = Sv − 1 almost surely provided
Sv ≥ 1.

(ii) Show that, for any v, there is almost surely a descendant w of v (not neces-
sarily immediate) such that Sw = 1.

(iii) Let
Hn = max {h : ∃v ∈ Vh , Sv = 1} ,
where Vh is the set of vertices of T at topological distance h from the root.
Show that Hn ≤ n.
CHAPTER 6. BRANCHING PROCESSES 499

Exercise 6.16 (Ising vs. CFN). Let Th be a rooted complete binary tree with h
levels. Fix 0 < p < 1/2. Assign to each vertex v a state σv ∈ {+1, −1} at
random according to the CFN model described in Section 6.3.2. Show that this
distribution is equivalent to a ferromagnetic Ising model on Th and determine the
inverse temperature β in terms of p. [Hint: Write the distribution of the states under
the CFN model as a product over the edges.]
− −
Exercise 6.17 (Monotonicity of kµ+ +
h −µh kTV ). Let µh , µh be as in Section 6.3.2.
Show that
− −
kµ+ +
h+1 − µh+1 kTV ≤ kµh − µh kTV .

[Hint: Use the Markovian nature of the process.]

Exercise 6.18 (Unsolvability: recursion). Prove Lemma 6.3.15.

Exercise 6.19 (Cayley’s formula). Let (Zt ) be a Poisson branching process with
offspring mean 1 started at Z0 = 1 and let T be the corresponding Galton-Watson
tree. Let W be the total of size of the progeny, that is, the number of vertices in T .
Recall from Example 6.2.7 that

nn−1 e−n
P[W = n] = .
n!
(i) Given W = n, label the vertices of T uniformly at random with the integers
1, . . . , n. Show that every rooted labeled tree on n vertices arises with prob-
ability e−n /n!. [Hint: Label the vertices as you grow the tree and observe
that a lot of terms cancel out or simplify.]

(ii) Derive Cayley’s formula: the number of labeled trees on n vertices is nn−2 .

Exercise 6.20 (Critical regime: tree components). Let Gn ∼ Gn,pn where pn = n1 .

(i) Let γn,k be the expected number of isolated tree components of size k in Gn .
Justify the formula
k
n k−2 1 k−1 1 k(n−k)+(2)−(k−1)
     
γn,k = k 1− .
k n n

[Hint: We did a related calculation in Section 2.3.2.]

(ii) Show that, if k = ω(1) and k = o(n3/4 ),

k −5/2 k3
 
γn,k ∼n √ exp − 2 .
2π 6n
CHAPTER 6. BRANCHING PROCESSES 500

(iii) Conclude that for 0 < δ < 1 the expectation of U , the number of isolated
tree components of size in [(δn)2/3 , n2/3 ], is Ω(δ −1 ) as δ → 0.

(iv) For 1 ≤ k1 ≤ k2 ≤ n − k1 , let σn,k1 ,k2 be the expected number of pairs of


isolated tree components where the first one has size k1 and the second one
has size k2 . Justify the formula
k1
n k1 −2 1 k1 −1 1 k1 (n−k1 )+( 2 )−(k1 −1)
     
σn,k1 ,k2 = k 1−
k1 1 n n
k2
1 k2 (n−(k1 +k2 ))+( 2 )−(k2 −1)
   k2 −1  
n − k1 k2 −2 1
× k2 1− ,
k2 n n

and show that


σn,k1 ,k2 ≤ γn,k1 γn,k2 .
[Hint: You may need to prove that, for 0 < a ≤ 1 ≤ b, it holds that 1 − ab ≤
(1 − a)b .]

(v) Prove that Var[U ] = O(E[U ]). [Hint: Use (2.1.6), (iv), and (ii).]

Exercise 6.21 (Critical regime: overshoot bound). The goal of this exercise is to
prove Lemma 6.4.17. We use the notation of Section 6.4.4.

(i) Let W, Z ∼ Bin(n, 1/n) and 0 ≤ r ≤ n. Show that W − r conditioned on


W ≥ r is stochastically dominated by Z. [Hint: Use the representation of
W as a sum of indicators. Thinking of the partial sums as a Markov chain,
consider the first time it reaches r.]

(ii) Show that Mτh0 − h conditioned on Mτh0 ≥ h is stochastically dominated by


Z from (i). [Hint: By the tower property, it suffices to show that

P[Mτh0 − h ≥ z | τh0 = `, M`−1 = h − r, M` ≥ h] ≤ P[Z ≥ z],

for the relevant `, r, z.]

(iii) Use (ii) to prove Lemma 6.4.17.


CHAPTER 6. BRANCHING PROCESSES 501

Bibliographic remarks
Section 6.1 See [Dur10, Section 5.3.4] for a quick introduction to branching pro-
cesses. A more detailed overview relating to its use in discrete probability can
be found in [vdH17, Chapter 3]. A classical reference on branching processes
is [AN04]. The Kesten-Stigum Theorem is due to Kesten and Stigum [KS66b].
Our proof of a weaker version with the second moment condition follows [Dur10,
Example 5.4.3]. Section 6.1.4 is based loosely on [AN04, Chapter V]. A proof
of Theorem 6.1.18 can be found in [Har63]. A good reference for the Perron-
Frobenius Theorem (Theorem 6.1.17 as well as more general versions) is [HJ13,
Chapter 8]. The central limit theorem for ρ ≥ λ2 referred to at the end of Sec-
tion 6.1.4 is due to Kesten and Stigum [KS66a, KS67]. The critical percolation
threshold for percolation on Galton-Watson trees is due to R. Lyons [Lyo90].

Section 6.2 The exploration process in Section 6.2.1 dates back to [ML86] and
[Kar90]. The hitting-time theorem (Theorem 6.2.5) in the case ` = 1 was first
proved in [Ott49]. For alternative proofs, see for example [vdHK08] or [Wen75].
Spitzer’s combinatorial lemma (Lemma 6.2.8) is from [Spi56]. See also [Fel71,
Section XII.6]. The presentation in Section 6.2.4 follows [vdH10]. See also [Dur85].

Section 6.3 Section 6.3.1 follows [Dev98, Section 2.1] from the excellent vol-
ume [HMRAR98]. Section 6.3.2 is partly a simplified version of [BCMR06]. Fur-
ther applications in phylogenetics, specifically to the sample complexity of phy-
logeny inference algorithms, can be found in, for example, [Mos04, Mos03, Roc10,
DMR11, RS17]. The reconstruction problem also has applications in community
detection [MNS15b]. See [Abb18] for a survey.

Section 6.4 The phase transtion of the Erdős-Rényi graph model was first studied
in [ER60]. For much more, see for example [vdH17, Chapter 4], [JLR11, Chapter
5] and [Bol01, Chapter 6]. In particular a central limit theorem for the giant com-
ponent, proved by several authors including Martin-Löf [ML98], Pittel [Pit90], and
Barraez, Boucheron, and de la Vega [BBFdlV00], is established in [vdH17, Section
4.5]. Section 6.4.4 is based on [NP10]. See also [Per09, Sections 2 and 3]. Much
more is known about the critical regime; see, e.g., [Ald97, Bol84, Lu90, LuPW94].
Section 6.4.5 is based partly on [Dur06, Section 6.5]. For a lot more on random
walk on random graphs (not just Erdős-Rényi), see [Dur06, Chapter 6]. For more
on the spectral properties of random graphs, see [CL06].
Appendix A

Useful combinatorial formulas

Recall the following facts about factorials and binomial coefficients:

nn nn+1
n−1
≤ n! ≤ n−1 ,
e e
nk e k nk
 
n
k
≤ ≤ k ,
k k k
n  
X n k n−k
(x + y)n = x y ,
k
k=0
d  
X n  en d
≤ ,
k d
k=0
√  n n
n! ∼ 2πn ,
e
4n
 
2n
= (1 + o(1)) √ ,
n πn
and  
n
log = (1 + o(1))nH(k/n),
k
where H(p) := −p log p − (1 − p) log(1 − p). The third one is the binomial
theorem. The fifth one is Stirling’s formula.

Version: December 20, 2023


Modern Discrete Probability: An Essential Toolkit
Copyright © 2023 Sébastien Roch

502
Appendix B

Measure-theoretic foundations

This appendix contains relevant background on measure-theoretic probability. We


follow closely the highly recommended [Wil91]. Missing proofs (and a lot more
details and examples) can be found there. Another excellent textbook on this topic
is [Dur10].

B.1 Probability spaces


Let S be a set. In general it turns out that we cannot assign a probability to ev-
ery subset of S. Here we discuss “well-behaved” collections of subsets. First an
algebra on S is a collection of subsets stable under finitely many set operations.

Definition B.1.1 (Algebra on S). A collection Σ0 of subsets of S is an algebra on


S if the following conditions hold:

(i) S ∈ Σ0 ;

(ii) F ∈ Σ0 implies F c ∈ Σ0 ;

(iii) F, G ∈ Σ0 implies F ∪ G ∈ Σ0 .

That, of course, implies that the empty set as well as all pairwise intersections
are also in Σ0 . The collection Σ0 is an actual algebra (i.e., a vector space with a
bilinear product) with the symmetric difference as its “sum,” the intersection as its
“product” and the underlying field being the field with two elements.
Version: December 20, 2023
Modern Discrete Probability: An Essential Toolkit
Copyright © 2023 Sébastien Roch

503
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 504

Example B.1.2. On R, sets of the form


k
[
(ai , bi ]
i=1

where the union is disjoint with k < +∞ and −∞ ≤ ai ≤ bi ≤ +∞ form an


algebra. J
Finite set operations are not enough for our purposes. For instance, we want to
be able to take limits. A σ-algebra is stable under countably many set operations.
Definition B.1.3 (σ-algebra on S). A collection Σ of subsets of S is a σ-algebra on
S (or σ-field on S) if
(i) S ∈ Σ;
(ii) F ∈ Σ implies F c ∈ Σ;
(iii) Fn ∈ Σ, ∀n implies ∪n Fn ∈ Σ.
Example B.1.4. 2S is a trivial example. J
To give a nontrivial example, we need the following definition. We begin with
a lemma.
Lemma B.1.5 (Intersection of σ-algebras). Let Fi , i ∈ I, be σ-algebras on S
where I is arbitrary. Then ∩i Fi is a σ-algebra.
Proof. We prove only one of the conditions. The other ones are similar. Suppose
A ∈ Fi for all i. Then Ac is in Fi for all i since each Fi is itself a σ-algebra.

Definition B.1.6 (σ-algebra generated by C). Let C be a collection of subsets of S.


Then we let σ(C) be the smallest σ-algebra containing C, defined as the intersection
of all such σ-algebras (including in particular 2S ).
Example B.1.7. The smallest σ-algebra containing all open sets in R, denoted
B(R), is called the Borel σ-algebra. This is a non-trivial σ-algebra in the sense
that it can be proved that there exist subsets of R that are not in B, but that any
“reasonable” set is in B. In particular, it contains the algebra in Example B.1.2. J
Example B.1.8. The σ-algebra generated by the algebra in Example B.1.2 is B(R).
This follows from the fact that all open sets of R can be written as a countable union
of open intervals. (Indeed, for x ∈ O an open set, let Ix be the largest open interval
contained in O and containing x. If Ix ∩ Iy 6= ∅ then Ix = Iy by maximality (i.e.,
take the union). Then O = ∪x Ix and there are only countably many disjoint ones
because each one contains a rational.) J
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 505

We now define measures.

Definition B.1.9 (Additivity and σ-additivity). A non-negative set function on an


algebra Σ0
µ0 : Σ0 → [0, +∞],
is additive if

(i) µ0 (∅) = 0;

(ii) F, G ∈ Σ0 with F ∩ G = ∅ implies µ0 (F ∪ G) = µ0 (F ) + µ0 (G).

Moreover µ0 is said to be σ-additive if condition (ii) is true for any countable


is, if Fn ∈ Σ0 , n ≥ 0, all
collection of disjoint sets whose union is in Σ0 , that P
pairwise disjoint with ∪n Fn ∈ Σ0 , then µ0 (∪n Fn ) = n µ0 (Fn ).

Example B.1.10. For the algebra in the Example B.1.2, the set function
k k
!
[ X
λ0 (ai , bi ] = (bi − ai )
i=1 i=1

is additive. (In fact, it is also σ-additive. We will show this later.) J

Definition B.1.11 (Measure space). Let Σ be a σ-algebra on S. Then (S, Σ) is a


measurable space. A σ-additive function µ on Σ is called a measure and (S, Σ, µ)
is called a measure space.

Definition B.1.12 (Probability space). If (Ω, F, P) is a measure space with P(Ω) = probability space
1 then P is called a probability measure and (Ω, F, P) is called a probability space
(or probability triple).

The sets in F are referred to as events. events


To define a measure on B(R) we need the following tools from abstract mea-
sure theory.

Theorem B.1.13 (Caratheodory’s extension theorem). Let Σ0 be an algebra on S


and let Σ = σ(Σ0 ). If µ0 is σ-additive on Σ0 then there exists a measure µ on Σ
that agrees with µ0 on Σ0 .

If in addition µ0 is finite, the next lemma implies that the extension is unique.

Lemma B.1.14 (Uniqueness of extensions). Let I be a π-system on S, that is, a


family of subsets closed under finite intersections, and let Σ = σ(I). If µ1 , µ2 are
finite measures on (S, Σ) that agree on I, then they agree on Σ.
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 506

Example B.1.15. The sets (−∞, x] for x ∈ R form a π-system generating B(R).
That is, B(R) is the smallest σ-algebra containing that π-system. J

Finally we can define Lebesgue measure. We start with (0, 1] and extend to R
in the obvious way. We need the following lemma.

Lemma B.1.16 (σ-additivity of λ0 ). Let λ0 be the set function defined above in


Example B.1.10, restricted to (0, 1]. Then λ0 is σ-additive.

Definition B.1.17 (Lebesgue measure on unit interval). The unique extension of λ0


(see Example B.1.10) to (0, 1] is denoted λ and is called Lebesgue measure.

B.2 Random variables


Let (S, Σ, µ) be a measure space and let B = B(R).

Definition B.2.1 (Measurable function). Suppose h : S → R and define

h−1 (A) = {s ∈ S : h(s) ∈ A}.

The function h is Σ-measurable if h−1 (B) ∈ Σ for all B ∈ B. We denote by


mΣ (resp., (mΣ)+ , bΣ) the Σ-measurable functions (resp., that are non-negative,
bounded).

In the probabilistic case:

Definition B.2.2. A random variable is a measurable function on a probability


random variable
space (Ω, F, P).

The behavior of a random variable is characterized by its distribution function.

Definition B.2.3 (Distribution function). Let X be a random variable on a proba-


bility space (Ω, F, P). The law of X is

LX = P ◦ X −1 ,

which is a probability measure on (R, B). By Lemma B.1.14, LX is determined by


the distribution function (DF) of X
distribution
function
FX (x) = P[X ≤ x], x ∈ R.
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 507

Example B.2.4. The distribution function of a constant random variable is a jump


of size 1 at the value it takes almost surely. The distribution function of a random
variable with law equal to Lebesgue measure on (0, 1] is

x x ∈ (0, 1],

FX (x) = 0 x ≤ 0,

1 x > 1.

We refer to such as random variable as a uniform random variable over (0, 1]. J
Distribution functions are characterized by a few simple properties.
Proposition B.2.5. Suppose F = FX is the distribution function of a random
variable X on (Ω, F, P). Then the following hold:
(i) F is non-decreasing;
(ii) limx→+∞ F (x) = 1, limx→−∞ F (x) = 0;
(iii) F is right-continuous.
Proof. The first property follows from the monotonicity of probability measure
(which itself follows immediately from σ-additivity).
For the second property, note that the limit exists by the first property. The
value of the limit follows from the following important lemma.
Lemma B.2.6 (Monotone convergence properties of measures). Let (S, Σ, µ) be a
measure space.
(i) If Fn ∈ Σ, n ≥ 1, with Fn ↑ F , then µ(Fn ) ↑ µ(F ).
(ii) If Gn ∈ Σ, n ≥ 1, with Gn ↓ G and µ(Gk ) < +∞ for some k, then
µ(Gn ) ↓ µ(G).
Proof. Clearly F = ∪n Fn ∈ Σ. For n ≥ 1, write Hn = Fn \Fn−1 (with F0 = ∅).
Then by disjointness
X X
µ(Fn ) = µ(Hk ) ↑ µ(Hk ) = µ(F ).
k≤n k<+∞

The second statement is similar.

Similarly, for the third property, by Lemma B.2.6 again

P[X ≤ xn ] ↓ P[X ≤ x],

if xn ↓ x.
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 508

It turns out that the properties above characterize distribution functions in the
following sense.

Theorem B.2.7 (Skorokhod representation). Let F satisfy the three properties


above in Proposition B.2.5. Then there is a random variable X on

(Ω, F, P) = ((0, 1], B(0, 1], λ),

with distribution function F . The law of X is called the Lebesgue-Stieltjes measure


associated to F .

The result says that all real random variables can be generated from uniform
random variables over (0, 1].

Proof. Assume first that F is continuous and strictly increasing. Define X(ω) =
F −1 (ω) for all ω ∈ Ω. Then, ∀x ∈ R,

P[X ≤ x] = P[{ω : F −1 (ω) ≤ x}] = P[{ω : ω ≤ F (x)}] = F (x).

In general, let
X(ω) = inf{x : F (x) ≥ ω}.
It suffices to prove that

X(ω) ≤ x ⇐⇒ ω ≤ F (x).

The ⇐ direction is clear by definition of X. On the other hand, by the right-


continuity of F , we have that ω ≤ F (X(ω)). Therefore, by monotonicity of F ,

X(ω) ≤ x ⇒ ω ≤ F (X(ω)) ≤ F (x).

That proves the claim.

Turning measurability on its head, we get the following important definition.

Definition B.2.8. Let (Ω, F, P) be a probability space. Let Yγ , γ ∈ Γ, be a collec-


tion of maps from Ω to R. We let

σ(Yγ , γ ∈ Γ),

be the smallest σ-algebra on which the Yγ ’s are measurable.

In a sense, the above σ-algebra corresponds to “the partial information avail-


able when the Yγ s are observed.”
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 509

Example B.2.9. Suppose we flip two unbiased coins and let X be the number of
heads observed. Then, denoting heads by H and tails by T,

σ(X) = σ({{HH}, {HT, TH}, {TT}}),

which is coarser than the full σ-algebra 2Ω . J

Note that h−1 preserves all set operations. For example, h−1 (A ∪ B) =
h−1 (A) ∪ h−1 (B). This gives the following important lemma.

Lemma B.2.10 (Sufficient condition for measurability). Suppose C ⊆ B with


σ(C) = B. Then h−1 : C → Σ implies h ∈ mΣ. That is, it suffices to check
measurability on a collection generating B.

Proof. Let E be the sets such that h−1 (B) ∈ Σ. By the observation before the
statement, E is a σ-algebra. But C ⊆ E which implies σ(C) ⊆ E by minimality.

As a consequence we get the following properties of measurable functions.

Proposition B.2.11 (Properties of measurable functions). Let h, hn , n ≥ 1, be in


mΣ and f ∈ mB.

(i) f ◦ h ∈ mΣ.

(ii) If S is a topological space and h is continuous, then h is B(S)-measurable,


where B(S) is generated by the open sets of S.

(iii) The function g : S → R is in mΣ if for all c ∈ R,

{g ≤ c} ∈ Σ.

(iv) ∀α ∈ R, h1 + h2 , h1 h2 , αh ∈ mΣ.

(v) inf hn , sup hn , lim inf hn , lim sup hn are in mΣ.

(vi) The set


{s : lim hn (s) exists in R},
is measurable.

Proof. We sketch the proof of a few of them.


(ii) This follows from Lemma B.2.10 by taking C as the open sets of R.
(iii) Similarly, take C to be the sets of the form (−∞, c].
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 510

(iv) This follows from (iii). For example note that, writing the left-hand side as
h1 > c − h2 ,

{h1 + h2 > c} = ∪q∈Q [{h1 > q} ∩ {q > c − h2 }],

which is a countable union of measurable sets by assumption.


(v) Note that
{sup hn ≤ c} = ∩n {hn ≤ c}.
Further, note that lim inf is the sup of an inf.

B.3 Independence
Let (Ω, F, P) be a probability space.

Definition B.3.1 (Independence). Sub-σ-algebras G1 , G2 , . . . of F are independent


independence
if: for all Gi ∈ Gi , i ≥ 1, and distinct i1 , . . . , in we have
n
Y
P[Gi1 ∩ · · · ∩ Gin ] = P[Gij ].
j=1

Specializing to events and random variables:

Definition B.3.2 (Independent random variables). Random variables X1 , X2 , . . .


are independent if the σ-algebras σ(X1 ), σ(X2 ), . . . are independent.

Definition B.3.3 (Independent events). Events E1 , E2 , . . . are independent if the


σ-algebras
Ei = {∅, Ei , Eic , Ω}, i ≥ 1,
are independent.

Recall the more familiar definitions.

Theorem B.3.4 (Independent random variables: familiar definition). Random vari-


ables X, Y are independent if and only if for all x, y ∈ R

P[X ≤ x, Y ≤ y] = P[X ≤ x] P[Y ≤ y].

Theorem B.3.5 (Independent events: familiar definition). Events E1 , E2 are inde-


pendent if and only if
P[E1 ∩ E2 ] = P[E1 ] P[E2 ].
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 511

The proofs of these characterizations follow immediately from the following


lemma.
Lemma B.3.6 (Independence and π-systems). Suppose that G and H are sub-σ-
algebras and that I and J are π-systems such that
σ(I) = G, σ(J ) = H.
Then G and H are independent if and only if I and J are as well, that is,
P[I ∩ J] = P[I] P[J], ∀I ∈ I, J ∈ J .
Proof. Suppose I and J are independent. For fixed I ∈ I, the measures P[I ∩ H]
and P[I] P[H] are equal for H ∈ J and have total mass P[I] < +∞. By the
Uniqueness of Extensions Lemma (Lemma B.1.14) the above measures agree on
σ(J ) = H.
Repeat the argument. Fix H ∈ H. Then the measures P[G ∩ H] and P[G] P[H]
agree on I and have total mass P[H] < +∞. Therefore they must agree on σ(I) =
G.
We give a standard construction of an infinite sequence of independent random
variables with prescribed distributions.
Let (Ω, F, P) = ((0, 1], B(0, 1], λ) and for ω ∈ Ω consider the binary expan-
sion
ω = 0.ω1 ω2 . . . .
(For dyadic rationals, use the all-1 ending and note that the dyadic rationals have
measure 0 by countability.) This construction produces a sequence of independent
so-called Bernoulli trials. That is, under λ, each bit is Bernoulli(1/2) and any finite
Bernoulli
collection is independent.
trials
To get two independent uniform random variables, consider the following con-
struction:
U1 = 0.ω1 ω3 ω5 . . .
U2 = 0.ω2 ω4 ω6 . . .
Let A1 (resp. A2 ) be the π-system consisting of all finite intersections of events of
the form {ωi ∈ Hi } for odd i (resp. even i). By Lemma B.3.6, the σ-fields σ(A1 )
and σ(A2 ) are independent.
More generally, let
V1 = 0.ω1 ω3 ω6 . . .
V2 = 0.ω2 ω5 ω9 . . .
V3 = 0.ω4 ω8 ω13 . . .
.. .
. = ..
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 512

that is, fill up the array diagonally. By the argument above, the Vi ’s are independent
and Bernoulli(1/2).
Finally let µn , n ≥ 1, be a sequence of probability measures with distribution
functions Fn , n ≥ 1. For each n, define

Xn (ω) = inf{x : Fn (x) ≥ Vn (ω)}

By the (proof of the) Skorokhod Representation (Theorem B.2.7), Xn has distribu-


tion function Fn .

Definition B.3.7 (I.i.d. random variables). A sequence of independent random vari-


ables (Xn ) as above is independent and identically distributed (i.i.d.) if Fn = F
for some n.

Alternatively, we have the following more general result.

Theorem B.3.8 (Kolmogorov’s extension theorem). Suppose we are given proba-


bility measures µn on (Rn , B(Rn )) that are consistent, that is,

µn+1 ((a1 , b1 ] × · · · × (an , bn ] × R) = µn ((a1 , b1 ] × · · · × (an , bn ]).

Then there exists a unique probability measure P on (RN , RN ) with

P[ω : ωi ∈ (ai , bi ], 1 ≤ i ≤ n] = µn ((a1 , b1 ] × · · · × (an , bn ]).

Here RN is the product σ-algebra, that is, the σ-algebra generated by finite-dimen-
sional rectangles.

Next, we discuss a first non-trivial result about independent sequences.

Definition B.3.9 (Tail σ-algebra). Let X1 , X2 , . . . be random variables on a prob-


ability space (Ω, F, P). Define
\
T = Tn ,
n≥1

where
Tn = σ(Xn+1 , Xn+2 , . . .).
As an intersection of σ-algebras, T is a σ-algebra. It is called the tail σ-algebra of
the sequence (Xn ).

Intuitively, an event is in the tail if changing a finite number of values does not
affect its occurence.
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 513

P
Example B.3.10. If Sn = k≤n Xk , then

{lim Sn exists} ∈ T ,
n

{lim sup n−1 Sn > 0} ∈ T ,


n
but
{lim sup Sn > 0} ∈
/ T.
n
J

Theorem B.3.11 (Kolmogorov’s 0-1 law). Let (Xn ) be a sequence of independent


random variables with tail σ-algebra T . Then T is P-trivial, that is, for all A ∈ T
we have either P[A] = 0 or 1.

Proof. Let Xn = σ(X1 , . . . , Xn ). Note that Xn and Tn are independent. More-


over, since T ⊆ Tn we have that Xn is independent of T . Now let

X∞ = σ(Xn , n ≥ 1).

Note that [
K∞ = Xn ,
n≥1

is a π-system generating X∞ . Therefore, by Lemma B.3.6, X∞ is independent of


T . But T ⊆ X∞ and therefore T is independent of itself! Hence if A ∈ T ,

P[A] = P[A ∩ A] = P[A]2 ,

which can occur only if P[A] ∈ {0, 1}.

B.4 Expectation
Let (S, Σ, µ) be a measure space. We denote by 1A the indicator of a set A, that is,
(
1, if s ∈ A
1A (s) =
0, o.w.

Definition B.4.1 (Simple functions). A simple function is a function of the form


m
X
f= a k 1A k ,
k=1
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 514

where ak ∈ [0, +∞] and Ak ∈ Σ for all k. We denote the set of all such functions
by SF+ . We define the integral of f by
m
X
µ(f ) := ak µ(Ak ) ≤ +∞.
k=1

We also write µf = µ(f ).


The following is left as a (somewhat tedious but) immediate exercise.
Proposition B.4.2. Let f, g ∈ SF+ .
(i) If µ(f 6= g) = 0, then µf = µg. [Hint: Rewrite f and g over the same
disjoint sets.]

(ii) For all c ≥ 0, f + g, cf ∈ SF+ and

µ(f + g) = µf + µg, µ(cf ) = cµf.

[Hint: This one is obvious by definition.]

(iii) If f ≤ g then µf ≤ µg. [Hint: Show that g − f ∈ SF+ and use linearity.]
The main definition and theorem of integration theory follows.
Definition B.4.3 (Nonnegative functions). Let f ∈ (mΣ)+ . Then the integral of f
is defined by
µ(f ) = sup{µ(h) : h ∈ SF+ , h ≤ f }.
Again we also write µf = µ(f ).
Theorem B.4.4 (Monotone convergence theorem). If fn , f ∈ (mΣ)+ , n ≥ 1, with
fn ↑ f , then
µfn ↑ µf.
Many theorems in integration follow from the monotone convergence theorem.
In that context, the following approximation is useful.
Definition B.4.5 (Staircase function). For f ∈ (mΣ)+ and r ≥ 1, the r-th staircase
function α(r) is

0,
 if x = 0,
(r)
α (x) = (i − 1)2−r , if (i − 1)2−r < x ≤ i2−r ≤ r,

r, if x > r,

We let f (r) = α(r) (f ). Note that f (r) ∈ SF+ and f (r) ↑ f as r → +∞.
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 515

Using the previous definition, we get for example the following properties.
Proposition B.4.6. Let f, g ∈ (mΣ)+ .
(i) If µ(f 6= g) = 0, then µ(f ) = µ(g).
(ii) For all c ≥ 0, f + g, cf ∈ (mΣ)+ and
µ(f + g) = µf + µg, µ(cf ) = cµf.

(iii) If f ≤ g then µf ≤ µg.


For a function f , let f + and f − be the positive and negative parts of f , that is,
f + (s) = f (s) ∨ 0, f − (s) = (−f (s)) ∨ 0.
Note that |f | = f + + f − . Finally we define
µ(f ) := µ(f + ) − µ(f − ),
provided µ(f + ) + µ(f − ) < +∞, in which case we write f ∈ L1 (S, Σ, µ). Propo-
sition B.4.6 can be generalized naturally to this definition. Moreover we have the
following.
Theorem B.4.7 (Dominated convergence theorem). If fn , f ∈ mΣ, n ≥ 1, with
fn (s) → f (s) for all s ∈ S, and there is a nonnegative function g ∈ L1 (S, Σ, µ)
such that |fn | ≤ g, then
µ(|fn − f |) → 0,
and in particular
µfn → µf,
as n → ∞.
More generally, for 0 < p < +∞, the space Lp (S, Σ, µ) contains all functions
f : S → R such that kf kp < +∞, where

kf kp := µ(|f |p )1/p ,
up to equality almost everywhere. We state the following results without proof.
Theorem B.4.8 (Hölder’s inequality). Let 1 < p, q < +∞ such that p−1 +
q −1 = 1. Then, for any f ∈ Lp (S, Σ, µ) and g ∈ Lq (S, Σ, µ), it holds that
f g ∈ L1 (S, Σ, µ) and further
kf gk1 ≤ kf kp kgkq .
The case p = q = 2 is known as the Cauchy-Schwarz inequality (or Schwarz
inequality).
Cauchy-Schwarz
inequality
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 516

Theorem B.4.9 (Minkowski’s inequality). Let 1 < p < +∞. Then, for any f, g ∈
Lp (S, Σ, µ), it holds that f + g ∈ Lp (S, Σ, µ) and further

kf + gkp ≤ kf kp + kgkp .

Theorem B.4.10 (Lp completeness). Let 1 ≤ p < +∞. If (fn )n in Lp (S, Σ, µ) is


Cauchy, that is,
sup kfn − fm kp → 0,
n,m≥k

as k → +∞, then there exists f ∈ Lp (S, Σ, µ) such that

kfn − f kp → 0,

as n → +∞.
We can now define the expectation. Let (Ω, F, P) be a probability space.
Definition B.4.11 (Expectation). If X ≥ 0 is a random variable then we define the
expectation of X, denoted by E[X], as the integral of X over P. More generally
(i.e., not assuming non-negativity), if

E|X| = E[X + ] + E[X − ] < +∞,

we let
E[X] = E[X + ] − E[X − ].
We denote the set of all such integrable random variables (up to equality almost
integrable
surely) by L1 (Ω, F, P).
The properties of the integral for nonnegative functions (see Proposition B.4.6)
extend to the expectation.
Proposition B.4.12. Let X, X1 , X2 be random variables in L1 (Ω, F, P).
(LIN) If a1 , a2 ∈ R, then E[a1 X1 + a2 X2 ] = a1 E[X1 ] + a2 E[X2 ].

(POS) If X ≥ 0, then E[X] ≥ 0.


One useful implication of (POS) is that |X| − X ≥ 0 so that E[X] ≤ E|X| and,
by applying the same argument to −X, we have further |E[X]| ≤ E|X|.
The monotone convergence theorem (Theorem B.4.4) implies the following
results. We first need a definition.
Definition B.4.13 (Convergence almost sure). We say that Xn → X almost surely
(a.s.) if
P[Xn → X] = 1.
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 517

Proposition B.4.14. Let X, Y, Xn , n ≥ 1, be random variables in L1 (Ω, F, P).


(MON) If 0 ≤ Xn ↑ X, then E[Xn ] ↑ E[X] ≤ +∞.

(FATOU) If Xn ≥ 0, then E[lim inf n Xn ] ≤ lim inf n E[Xn ].

(DOM) If |Xn | ≤ Y , n ≥ 1, with E[Y ] < +∞ and Xn → X a.s., then

E|Xn − X| → 0,

and, hence,
E[Xn ] → E[X].
(Indeed,

|E[Xn ] − E[X]| = |E[Xn − X]|


≤ E|Xn − X|.)

(SCHEFFE) If Xn → X a.s. and E|Xn | → E|X| then

E|Xn − X| → 0.

(BDD) If Xn → X a.s. and |Xn | ≤ K < +∞ for all n then

E|Xn − X| → 0.

Proof. We only prove (FATOU). To use (MON) we write the lim inf as an increas-
ing limit. Letting Zk = inf n≥k Xn , we have

lim inf Xn =↑ lim Zk ,


n k

so that by (MON)
E[lim inf Xn ] =↑ lim E[Zk ].
n k

For n ≥ k we have Xn ≥ Zk so that E[Xn ] ≥ E[Zk ] hence

E[Zk ] ≤ inf E[Xn ].


n≥k

Finally, we get
E[lim inf Xn ] ≤↑ lim inf E[Xn ].
n k n≥k

The following inequality is often useful. We give an example below.


APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 518

Theorem B.4.15 (Jensen’s inequality). Let h : G → R be a convex function on Jensen’s


inequality
an open interval G such that P[X ∈ G] = 1 and X, h(X) ∈ L1 (Ω, F, P) then

E[h(X)] ≥ h(E[X]).

The Lp norm defined earlier applies to random variables as well. That is, for
p ≥ 1, we let kXkp = E[|X|p ]1/p and denote by Lp (Ω, F, P) the collection of
random variables X (up to almost sure equality) such that kXkp < +∞. Jensen’s
inequality (Theorem B.4.15) implies the following relationship.

Lemma B.4.16 (Monotonicity of norms). For 1 ≤ p ≤ r < +∞, we have kXkp ≤


kXkr .

Proof. For n ≥ 0, let


Xn = (|X| ∧ n)p .
Take h(x) = xr/p which is convex on (0, +∞). Then, by Jensen’s inequality,

(E[Xn ])r/p ≤ E[(Xn )r/p ] = E[(|X| ∧ n)r ] ≤ E[|X|r ].

Take n → ∞ and use (MON).

This latter inequality is useful among other things to argue about the convergence
of expectations. We say that Xn converges to X∞ in Lp if kXn − X∞ kp → 0. By
the previous lemma, convergence on Lr implies convergence in Lp for r ≥ p ≥ 1.
Further we have:

Lemma B.4.17 (Convergence of expectations). Assume Xn , X∞ ∈ L1 . Then

kXn − X∞ k1 → 0,

implies
E[Xn ] → E[X∞ ].

Proof. Note that

|E[Xn ] − E[X∞ ]| ≤ E|Xn − X∞ | → 0.

So, a fortiori, convergence in Lp , p ≥ 1, implies convergence of expectations.


Square integrable random variables have a nice geometry by virtue of forming
a Hilbert space.
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 519

Definition B.4.18 (Square integrable variables). Recall that L2 (Ω, F, P) denotes


the set of all square integrable random variables (up to equality almost surely), that square
is, those X with E[X 2 ] < +∞ For X, Y ∈ L2 (Ω,p F, P), define the inner product integrable
hX, Y i := E[XY ]. Then the L2 norm is kXk2 = hX, Xi.
Theorem B.4.19 (Cauchy-Schwarz inequality). If X, Y ∈ L2 (Ω, F, P), then XY ∈
L1 (Ω, F, P) and p
E|XY | ≤ E[X 2 ]E[Y 2 ],
or put differently
|hX, Y i| ≤ kXk2 kY k2 .
parallelogram
Theorem B.4.20 (Parallelogram law). If X, Y ∈ L2 (Ω, F, P), then
law
kX + Y k22 + kX − Y k22 = 2kXk22 + 2kY k22 .

B.5 Fubini’s theorem


We now define product measures and state (without proof) Fubini’s Theorem.
Definition B.5.1 (Product σ-algebra). Let (S1 , Σ1 ) and (S2 , Σ2 ) be measure spaces.
Let S = S1 × S2 be the Cartesian product of S1 and S2 . For i = 1, 2, let
πi : S → Si be the projection on the i-th coordinate, that is,
πi (s1 , s2 ) = si .
The product σ-algebra Σ = Σ1 × Σ2 is defined as
Σ = σ(π1 , π2 ).
In words, it is the smallest σ-algebra that makes coordinate maps measurable. It
is generated by sets of the form
π1−1 (B1 ) = B1 × S2 , π2−1 (B2 ) = S1 × B2 , B1 ∈ Σ1 , B2 ∈ Σ2 .
Theorem B.5.2 (Fubini’s Theorem). For F ∈ Σ, let f = 1F and define Fubini’s
Z Z Theorem
µ(F ) := I1f (s1 )µ1 (ds1 ) = I2f (s2 )µ2 (ds2 ),
S1 S2

where
Z
I1f (s1 ) := f (s1 , s2 )µ2 (ds2 ) ∈ bΣ1 ,
S2
Z
I2f (s2 ) := f (s1 , s2 )µ1 (ds1 ) ∈ bΣ2 .
S1
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 520

(The equality and inclusions above are part of the statement.) The set function
µ is a measure on (S, Σ) called the product measure of µ1 and µ2 and we write
µ = µ1 × µ2 and

(S, Σ, µ) = (S1 , Σ1 , µ1 ) × (S2 , Σ2 , µ2 ).

Moreover µ is the unique measure on (S, Σ) for which

µ(A1 × A2 ) = µ(A1 )µ(A2 ), Ai ∈ Σi .

If f ∈ (mΣ)+ then
Z Z
µ(f ) = I1f (s1 )µ1 (ds1 ) = I2f (s2 )µ2 (ds2 ),
S1 S2

where I1f , I2f are defined as before (i.e., as the sup over bounded functions from
below). The same is valid if f ∈ mΣ and µ(|f |) < +∞.

Some applications of Fubini’s Theorem (Theorem B.5.2) follow. We first recall


the following useful formula.

Theorem B.5.3 (Change-of-variables formula). Let X be a random variable with


law L. If f : R → R is such that either f ≥ 0 or E|f (X)| < +∞ then
Z
E[f (X)] = f (y)L(dy).
R

Proof. We use the standard machinery.

1. For f = 1B with B ∈ B,
Z
E[1B (X)] = L(B) = 1B (y)L(dy).
R

Pm
2. If f = k=1 ak 1Ak is a simple function, then by (LIN)
m
X m
X Z Z
E[f (X)] = ak E[1Ak (X)] = ak 1Ak (y)L(dy) = f (y)L(dy).
k=1 k=1 R R

3. Let f ≥ 0 and approximate f by a sequence {fn } of increasing simple


functions. By (MON)
Z Z
E[f (X)] = lim E[fn (X)] = lim fn (y)L(dy) = f (y)L(dy).
n n R R
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 521

4. Finally, assume that f is such that E|f (X)| < +∞. Then by (LIN)
E[f (X)] = E[f + (X)] − E[f − (X)]
Z Z
= +
f (y)L(dy) − f − (y)L(dy)
ZR R

= f (y)L(dy).
R

Theorem B.5.4. Let X and Y be independent random variables with respective


laws µ and ν. Let f and g be measurable functions such that either f, g ≥ 0 or
E|f (X)|, E|g(Y )| < +∞. Then
E[f (X)g(Y )] = E[f (X)]E[g(Y )].
Proof. From the change-of-variables formula (Theorem B.5.3) and Fubini’s Theo-
rem (Theorem B.5.2), we get
Z
E[f (X)g(Y )] = f (x)g(y)(µ × ν)(dx × dy)
R2
Z Z 
= f (x)g(y)µ(dx) ν(dy)
ZR R

= (g(y)E[f (X)]) ν(dy)


R
= E[f (X)]E[g(Y )].

Definition B.5.5 (Density). Let X be a random variable with law µ. We say that
X has density fX if for all B ∈ B(R)
Z
µ(B) = P[X ∈ B] = fX (x)λ(dx).
B
Theorem B.5.6 (Convolution). Let X and Y be independent random variables
with distribution functions F and G respectively. Then the distribution function,
H, of X + Y is Z
H(z) = F (z − y)dG(y).

This is called the convolution of F and G. Moreover, if X and Y have densities f


and g respectively, then X + Y has density
Z
h(z) = f (z − y)g(y)dy.
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 522

Proof. From Fubini’s Theorem (Theorem B.5.3), denoting the laws of X and Y
by µ and ν respectively,
Z Z
P[X + Y ≤ z] = 1{x+y≤z} µ(dx)ν(dy)
Z
= F (z − y)ν(dy)
Z
= F (z − y)dG(y)
Z Z z 
= f (x − y)dx dG(y)
−∞
Z z Z 
= f (x − y)dG(y) dx
−∞
Z z Z 
= f (x − y)g(y)dy dx.
−∞

See Exercise 2.1 for a proof of the following standard formula.


Theorem B.5.7 (Moments of nonegative random variables). For any nonnegative
random variable X and positive integer k,
Z +∞
k
E[X ] = kxk−1 P[X > x] dx. (B.5.1)
0

B.6 Conditional expectation


Before defining the conditional expectation, we recall some elementary concepts.
For two events A, B, the conditional probability of A given B is defined as
P[A ∩ B]
P[A | B] = ,
P[B]
where we assume P[B] > 0.
Now let X and Z be random variables taking values x1 , . . . , xm and z1 , . . . , zn
respectively. The conditional expectation of X given Z = zj is defined as
X
yj = E[X | Z = zj ] = xi P[X = xi | Z = zj ],
i

where we assume P[Z = zj ] > 0 for all j. As motivation for the general definition,
we make the following observations.
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 523

• We can think of the conditional expectation as a random variable Y =


E[X | Z] defined as follows

Y (ω) = yj on Gj = {ω : Z(ω) = zj }.

• Then Y is G-measurable where G = σ(Z).

• On sets in G, the expectation of Y agrees with the expectation of X. Indeed,


note first that

E[Y ; Gj ] = yj P[Gj ]
X
= xi P[X = xi | Z = zj ]P[Z = zj ]
i
X
= xi P[X = xi , Z = zj ]
i
= E[X; Gj ].

This is also true for all G ∈ G by summation over j.

We are ready to state the general definition of the conditional expectation. Its
existence and uniqueness follow from the next theorem.

Theorem B.6.1 (Conditional expectation). Let X ∈ L1 (Ω, F, P) and G ⊆ F a


sub-σ-algebra. Then:

(i) (Existence) There exists a random variable Y ∈ L1 (Ω, G, P) such that

E[Y ; G] = E[X; G], ∀G ∈ G. (B.6.1)

Such a Y is called a version of the conditional expectation of X given G and


conditional
is denoted by E[X | G].
expectation
0
(ii) (Uniqueness) It is unique in the sense that, if Y and Y are two versions of
the conditional expectation, then Y = Y 0 almost surely.

When G = σ(Z), we sometimes use the notation E[X | Z] := E[X | G]. A similar
convention applies to collections of random variables, for example, E[X | Z1 , Z2 ] :=
E[X | σ(Z1 , Z2 )] and so on.
We first prove uniqueness. Existence is proved below after some more concepts
are introduced.
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 524

Proof of Theorem B.6.1 (ii). By way of contradiction, let Y, Y 0 be two versions of


E[X | G] such that without loss of generality P[Y > Y 0 ] > 0. By monotonicity,
there is n ≥ 1 with G = {Y > Y 0 + n−1 } ∈ G such that P[G] > 0. Then, by
definition,
0 = E[Y − Y 0 ; G] > n−1 P[G] > 0,
which gives a contradiction.

To prove existence, we use the L2 method. In L2 (Ω, F, P), the conditional


expectation reduces to an orthogonal projection.

Theorem B.6.2 (Conditional expectation: L2 case). Let X ∈ L2 (Ω, F, P) and


G ⊆ F a sub-σ-algebra. Then there exists an (almost surely) unique Y ∈ L2 (Ω, G, P)
such that

kX − Y k2 = ∆ := inf{kX − W k2 : W ∈ L2 (Ω, G, P)},

and, moreover, hZ, X − Y i = 0, ∀Z ∈ L2 (Ω, G, P). In particular, it satis-


fies (B.6.1). Such a Y is called the orthogonal projection of X on L2 (Ω, G, P).

Proof. Take (Yn ) such that kX − Yn k2 → ∆. We use the fact that L2 (Ω, G, P) is
complete (Theorem B.4.10) and first seek to prove that (Yn ) is Cauchy. Using the
parallelogram law (Theorem B.4.20), note that
2 2
1 1
kX − Yr k22 + kX − Ys k22 = 2 X − (Yr + Ys ) +2 (Yr − Ys ) .
2 2 2 2

The first term on the right-hand side is ≥ 2∆2 by definition of ∆, so taking limits
r, s → +∞ we have what we need, that is, that (Yn ) is indeed Cauchy.
Let Y be the limit of (Yn ) in L2 (Ω, G, P). Note that by the triangle inequality

∆ ≤ kX − Y k2 ≤ kX − Yn k2 + kYn − Y k2 → ∆,

as n → +∞. As a result, for any Z ∈ L2 (Ω, G, P) and t ∈ R,

kX − Y − tZk22 ≥ ∆2 = kX − Y k22 ,

so that, expanding and rearranging, we have

−2thZ, X − Y i + t2 kZk22 ≥ 0,

which is only possible for every t ∈ R if the first term is 0.


Uniqueness follows from the parallelogram law and the definition of ∆.
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 525

We return to the proof of existence of the conditional expectation. We use the


standard machinery.

Proof of Theorem B.6.1 (i). The previous theorem implies that conditional expec-
tations exist for indicators and simple functions. Now take X ∈ L1 (Ω, F, P) and
write X = X + − X − , so we can assume X is in fact nonnegative without loss of
generality. Using the staircase function

 0, if X = 0
(r)
X = (i − 1)2 , if (i − 1)2−r < X ≤ i2−r ≤ r
−r

r, if X > r,

we have 0 ≤ X (r) ↑ X. Let Y (r) = E[X (r) | G]. Using an argument similar to
the proof of uniqueness, it follows that U ≥ 0 implies E[U | G] ≥ 0 for a simple
function U . Using linearity (which is immediate from the definition), we then have
Y (r) ↑ Y := lim sup Y (r) which is measurable in G. By (MON),

E[Y ; G] = E[X; G], ∀G ∈ G.

That concludes the proof.

Before deriving some properties, we give a few examples.

Example B.6.3. If X ∈ L1 (Ω, G, P) then E[X | G] = X almost surely trivially. J

Example B.6.4. If G = {∅, Ω}, then E[X | G] = E[X]. J

Example B.6.5. Let A, B ∈ F with 0 < P[B] < 1. If G = {∅, B, B c , Ω} and


X = 1A , then ( P[A∩B]
P[B] , on ω ∈ B,
P[A | G] = P[A∩B c]
c
P[B c ] , on ω ∈ B .
J

Intuition about the conditional expectation sometimes breaks down.

Example B.6.6. On (Ω, F, P) = ((0, 1], B(0, 1], λ), let G be the σ-algebra of all
countable and co-countable (i.e., whose complement in (0, 1] is countable) subsets
of (0, 1]. Then P[G] ∈ {0, 1} for all G ∈ G and

E[X; G] = E[E[X]; G] = E[X]P[G],

so that E[X | G] = E[X]. Yet, G contains all singletons and we seemingly have
“full information,” which would lead to the wrong guess E[X | G] = X. J
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 526

We show that the conditional expectation behaves similarly to the ordinary


expectation. Below all X and Xi s are in L1 (Ω, F, P) and G is a sub σ-algebra of
F.

Lemma B.6.7 (cLIN). If a1 , a2 ∈ R, then E[a1 X1 + a2 X2 | G] = a1 E[X1 | G] +


a2 E[X2 | G] a.s.

Proof. Use the linearity of expectation and the fact that a linear combination of
random variables in G is also in G.

Lemma B.6.8 (cPOS). If X ≥ 0 then E[X | G] ≥ 0 a.s.

Proof. Let Y = E[X | G] and assume for contradiction that P[Y < 0] > 0. There
is n ≥ 1 such that P[Y < −n−1 ] > 0. But that implies, for G = {Y < −n−1 },

E[X; G] = E[Y ; G] < −n−1 P[G] < 0,

a contradiction.

Lemma B.6.9 (cMON). If 0 ≤ Xn ↑ X then E[Xn | G] ↑ E[X | G] a.s.

Proof. Let Yn = E[Xn | G]. By (cLIN) and (cPOS), 0 ≤ Yn ↑. Then letting


Y = lim sup Yn , by (MON),

E[X; G] = E[Y ; G],

for all G ∈ G.

Lemma B.6.10 (cFATOU). If Xn ≥ 0 then E[lim inf Xn | G] ≤ lim inf E[Xn | G]


a.s.

Proof. Note that, for n ≥ m,

Xn ≥ Zm := inf Xk ↑∈ G,
k≥m

so that inf n≥m E[Xn | G] ≥ E[Zm | G]. Applying (cMON)

E[lim Zm | G] = lim E[Zm | G] ≤ lim inf E[Xn | G].


n≥m

Lemma B.6.11 (cDOM). If Xn ≤ V ∈ L1 (Ω, F, P) and Xn → X a.s., then

E[Xn | G] → E[X | G] a.s.


APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 527

Proof. Applying (cFATOU) to Wn := 2V − |Xn − X| ≥ 0,


E[2V | G] = E[lim inf Wn | G]
n
≤ lim inf E[Wn | G]
n
= E[2V | G] − lim inf E[|Xn − X| | G],
n

so we must have
lim inf E[|Xn − X| | G] = 0.
n
Now use that |E[Xn −X | G]| ≤ E[|Xn −X| | G] (which follows from (cPOS)).

Lemma B.6.12 (cJENSEN). If f is convex and E[|f (X)|] < +∞ then


f (E[X | G]) ≤ E[f (X) | G].
In addition, we highlight (without proof) the following important properties of
the conditional expectation.
Lemma B.6.13 (Taking out what is known). If X ∈ L1 (Ω, F, P) and Z ∈ mG is
bounded or if X is bounded and Z ∈ L1 (Ω, G, P), then E[ZX | G] = Z E[X | G].
This is also true if X, Z ≥ 0, E[X] < +∞ and E[ZX] < +∞, or X ∈
L2 (Ω, F, P) and Z ∈ L2 (Ω, G, P).
Lemma B.6.14 (Role of independence). If X ∈ L1 (Ω, F, P) is independent of
H then E[X | H] = E[X]. In fact, if H is independent of σ(σ(X), G), then
E[X | σ(G, H)] = E[X | G].
Lemma B.6.15 (Conditioning on an independent random variable). Suppose X, Y
are independent. Let φ be a function with E|φ(X, Y )| < +∞ and let g(x) =
E(φ(x, Y )). Then,
E(φ(X, Y )|X) = g(X).
Lemma B.6.16 (Tower property). If H ⊆ G is a σ-algebra and X ∈ L1 (Ω, F, P) tower
property
E[E[X | G] | H] = E[E[X | H] | G] = E[X | H].
That is, the “smallest σ-algebra wins.”
An important special case of the latter, also known as the law of total probability
or the law of total expectation, is E[E[X | G]] = E[X].
One last useful property:
Lemma B.6.17. Let (Ω, F, P) be a probability space. If Y1 = Y2 a.s. on B ∈ F
then E[Y1 | F] = E[Y2 | F] a.s. on B.
APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 528

B.7 Filtered spaces


Finally we define stochastic processes. Let E be a set and let E be a σ-algebra
defined over E.
Definition B.7.1. A stochastic process (or process) is a collection {Xt }t∈T of process
(E, E)-valued random variables on a probability space (Ω, F, P), where T is an
arbitrary index set.
Here is a typical example.
Example B.7.2. When T = Z+ (or T = N or T = Z) we have a discrete-
time process, in which case we often write the process as a sequence (Xt )t≥0 . For
instance:
• X0 , X1 , X2 , . . . i.i.d. random variables;
P
• (St )t≥0 where St = i≤t Xi with Xi as above.
We let
Ft = σ(X0 , X1 , . . . , Xt ),
which can be thought of as “the information known up to time t.” For a fixed
ω ∈ Ω, (Xt (ω) : t ∈ T ) is called a sample path. J
sample path
Definition B.7.3. A random walk on Rd is a process of the form:
t
X
St = S0 + Xi , t≥1
i=1

where the Xi s are i.i.d. in Rd , independent of S0 . The case Xi uniform in {−1, +1}
is called simple random walk on Z.
Filtered spaces provide a formal framework for time-indexed processes. We
restrict ourselves to discrete time. (We will not discuss continuous-time processes
in this book.)
Definition B.7.4. A filtered space is a tuple (Ω, F, (Ft )t∈Z+ , P) where:
• (Ω, F, P) is a probability space;

• (Ft )t∈Z+ is a filtration, that is,


filtration
F0 ⊆ F1 ⊆ · · · ⊆ F∞ := σ(∪t Ft ) ⊆ F.

where each Fi is a σ-algebra.


APPENDIX B. MEASURE-THEORETIC FOUNDATIONS 529

Definition B.7.5. Fix (Ω, F, (Ft )t∈Z+ , P). A process (Wt )t≥0 is adapted if Wt ∈
adapted
Ft for all t.

Intuitively, in the previous definition, the value of Wt is “known at time t.”

Definition B.7.6. A process (Ct )t≥1 is predictable if Ct ∈ Ft−1 for all t ≥ 1.


predictable
Example B.7.7. Continuing Example B.7.2. The collection (Ft )t≥0 forms a fil-
tration. The process (St )t≥0 is adapted. On the other hand, the process Ct =
1{St−1 ≤ k} is predictable. J
Bibliography

[Abb18] Emmanuel Abbe. Community Detection and Stochastic Block Mod-


els. Foundations and Trends® in Communications and Information
Theory, 14(1-2):1–162, June 2018. Publisher: Now Publishers, Inc.
[ABH16] Emmanuel Abbe, Afonso S. Bandeira, and Georgina Hall. Exact Re-
covery in the Stochastic Block Model. IEEE Transactions on Infor-
mation Theory, 62(1):471–487, January 2016. Conference Name:
IEEE Transactions on Information Theory.
[Ach03] Dimitris Achlioptas. Database-friendly random projections:
Johnson-Lindenstrauss with binary coins. J. Comput. Syst. Sci.,
66(4):671–687, 2003.
[AD78] Rudolf Ahlswede and David E. Daykin. An inequality for the
weights of two families of sets, their unions and intersections. Z.
Wahrsch. Verw. Gebiete, 43(3):183–185, 1978.
[AF] David Aldous and James Allen Fill. Reversible Markov chains
and random walks on graphs. http://www.stat.berkeley.edu/
˜aldous/RWG/book.html.

[AGG89] R. Arratia, L. Goldstein, and L. Gordon. Two Moments Suffice for


Poisson Approximations: The Chen-Stein Method. Annals of Prob-
ability, 17(1):9–25, January 1989. Publisher: Institute of Mathemat-
ical Statistics.
[AJKS22] Alekh Agarwal, Nan Jiang, Sham M. Kakade, and Wen Sun. Re-
inforcement learning: Theory and algorithms. Available at https:
//rltheorybook.github.io/rltheorybook_AJKS.pdf, 2022.

[AK97] Noga Alon and Michael Krivelevich. The concentration of the chro-
matic number of random graphs. Combinatorica, 17(3):303–313,
1997.

530
[Ald83] David Aldous. Random walks on finite groups and rapidly mixing
Markov chains. In Seminar on probability, XVII, volume 986 of
Lecture Notes in Math., pages 243–297. Springer, Berlin, 1983.
[Ald90] David J. Aldous. The random walk construction of uniform span-
ning trees and uniform labelled trees. SIAM J. Discrete Math.,
3(4):450–465, 1990.
[Ald97] David Aldous. Brownian excursions, critical random graphs and the
multiplicative coalescent. Ann. Probab., 25(2):812–854, 1997.
[Alo03] Noga Alon. Problems and results in extremal combinatorics. I. Dis-
crete Math., 273(1-3):31–53, 2003. EuroComb’01 (Barcelona).
[AMS09] Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Explo-
ration–exploitation tradeoff using variance estimates in multi-armed
bandits. Theoretical Computer Science, 410(19):1876–1902, April
2009.
[AN04] K. B. Athreya and P. E. Ney. Branching processes. Dover Pub-
lications, Inc., Mineola, NY, 2004. Reprint of the 1972 original
[Springer, New York; MR0373040].
[ANP05] Dimitris Achlioptas, Assaf Naor, and Yuval Peres. Rigorous lo-
cation of phase transitions in hard optimization problems. Nature,
435:759–764, 2005.
[AS11] N. Alon and J.H. Spencer. The Probabilistic Method. Wiley Series
in Discrete Mathematics and Optimization. Wiley, 2011.
[AS15] Emmanuel Abbe and Colin Sandon. Community detection in gen-
eral stochastic block models: Fundamental limits and efficient algo-
rithms for recovery. In Venkatesan Guruswami, editor, IEEE 56th
Annual Symposium on Foundations of Computer Science, FOCS
2015, Berkeley, CA, USA, 17-20 October, 2015, pages 670–688.
IEEE Computer Society, 2015.
[Axl15] Sheldon Axler. Linear algebra done right. Undergraduate Texts in
Mathematics. Springer, Cham, third edition, 2015.
[AZ18] Martin Aigner and Günter M. Ziegler. Proofs from The Book.
Springer, Berlin, sixth edition, 2018. See corrected reprint of the
1998 original [ MR1723092], Including illustrations by Karl H. Hof-
mann.

531
[Azu67] Kazuoki Azuma. Weighted sums of certain dependent random vari-
ables. Tôhoku Math. J. (2), 19:357–367, 1967.
[BA99] Albert-László Barabási and Réka Albert. Emergence of scaling in
random networks. Science, 286(5439):509–512, 1999.
[BBFdlV00] D. Barraez, S. Boucheron, and W. Fernandez de la Vega. On the
fluctuations of the giant component. Combin. Probab. Comput.,
9(4):287–304, 2000.
[BC03] Bo Brinkman and Moses Charikar. On the Impossibility of Dimen-
sion Reduction in L1. In Proceedings of the 44th Annual IEEE Sym-
posium on Foundations of Computer Science, 2003.
[BCB12] Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret Analysis of
Stochastic and Nonstochastic Multi-armed Bandit Problems. Now
Publishers, 2012. Google-Books-ID: Rl2skwEACAAJ.
[BCMR06] Christian Borgs, Jennifer T. Chayes, Elchanan Mossel, and
Sébastien Roch. The Kesten-Stigum reconstruction bound is tight
for roughly symmetric binary channels. In FOCS, pages 518–530,
2006.
[BD97] Russ Bubley and Martin E. Dyer. Path coupling: A technique for
proving rapid mixing in markov chains. In 38th Annual Symposium
on Foundations of Computer Science, FOCS ’97, Miami Beach,
Florida, USA, October 19-22, 1997, pages 223–231. IEEE Com-
puter Society, 1997.
[BDDW08] Richard Baraniuk, Mark Davenport, Ronald DeVore, and Michael
Wakin. A simple proof of the restricted isometry property for ran-
dom matrices. Constr. Approx., 28(3):253–263, 2008.
[BDJ99] Jinho Baik, Percy Deift, and Kurt Johansson. On the distribution of
the length of the longest increasing subsequence of random permu-
tations. J. Amer. Math. Soc., 12(4):1119–1178, 1999.
[Ber46] S.N. Bernstein. Probability Theory (in russian). M.-L. Gostechiz-
dat, 1946.
[Ber14] Nathanaël Berestycki. Lectures on mixing times: A crossroad
between probability, analysis and geometry. Available at https:
//homepage.univie.ac.at/nathanael.berestycki/
wp-content/uploads/2022/05/mixing3.pdf, 2014.

532
[BH57] S. R. Broadbent and J. M. Hammersley. Percolation processes. I.
Crystals and mazes. Proc. Cambridge Philos. Soc., 53:629–641,
1957.

[BH16] Afonso S. Bandeira and Ramon van Handel. Sharp nonasymptotic


bounds on the norm of random matrices with independent entries.
The Annals of Probability, 44(4):2479–2506, July 2016. Publisher:
Institute of Mathematical Statistics.

[Bil12] P. Billingsley. Probability and Measure. Wiley Series in Probability


and Statistics. Wiley, 2012.

[BKW14] Itai Benjamini, Gady Kozma, and Nicholas Wormald. The mixing
time of the giant component of a random graph. Random Structures
Algorithms, 45(3):383–407, 2014.

[BLM13] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequali-


ties: A Nonasymptotic Theory of Independence. OUP Oxford, 2013.

[Bol81] Béla Bollobás. Random graphs. In Combinatorics (Swansea, 1981),


volume 52 of London Math. Soc. Lecture Note Ser., pages 80–102.
Cambridge Univ. Press, Cambridge-New York, 1981.

[Bol84] Béla Bollobás. The evolution of random graphs. Trans. Amer. Math.
Soc., 286(1):257–274, 1984.

[Bol98] Béla Bollobás. Modern graph theory, volume 184 of Graduate Texts
in Mathematics. Springer-Verlag, New York, 1998.

[Bol01] Béla Bollobás. Random graphs, volume 73 of Cambridge Studies in


Advanced Mathematics. Cambridge University Press, Cambridge,
second edition, 2001.

[BR06a] Béla Bollobás and Oliver Riordan. Percolation. Cambridge Univer-


sity Press, New York, 2006.

[BR06b] Béla Bollobás and Oliver Riordan. A short proof of the Harris-
Kesten theorem. Bull. London Math. Soc., 38(3):470–484, 2006.

[Bre17] Pierre Bremaud. Discrete probability models and methods, vol-


ume 78 of Probability Theory and Stochastic Modelling. Springer,
Cham, 2017. Probability on graphs and trees, Markov chains and
random fields, entropy and coding.

533
[Bre20] Pierre Bremaud. Markov chains—Gibbs fields, Monte Carlo sim-
ulation and queues, volume 31 of Texts in Applied Mathematics.
Springer, Cham, 2020. Second edition [of 1689633].

[Bro89] Andrei Z. Broder. Generating random spanning trees. In FOCS,


pages 442–447. IEEE Computer Society, 1989.

[BRST01] Béla Bollobás, Oliver Riordan, Joel Spencer, and Gábor Tusnády.
The degree sequence of a scale-free random graph process. Random
Structures Algorithms, 18(3):279–290, 2001.

[BS89] Ravi Boppona and Joel Spencer. A useful elementary correlation


inequality. J. Combin. Theory Ser. A, 50(2):305–307, 1989.

[Bub10] Sébastien Bubeck. Bandits Games and Clustering Foundations.


phdthesis, Université des Sciences et Technologie de Lille - Lille
I, June 2010.

[BV04] S.P. Boyd and L. Vandenberghe. Convex Optimization. Berichte


über verteilte messysteme. Cambridge University Press, 2004.

[Car85] Thomas Keith Carne. A transmutation formula for Markov chains.


Bull. Sci. Math. (2), 109(4):399–405, 1985.

[CDL+ 12] P. Cuff, J. Ding, O. Louidor, E. Lubetzky, Y. Peres, and A. Sly.


Glauber dynamics for the mean-field Potts model. J. Stat. Phys.,
149(3):432–477, 2012.

[CF07] Colin Cooper and Alan Frieze. The cover time of sparse random
graphs. Random Structures Algorithms, 30(1-2):1–16, 2007.

[Che52] Herman Chernoff. A measure of asymptotic efficiency for tests of a


hypothesis based on the sum of observations. Ann. Math. Statistics,
23:493–507, 1952.

[Che70] Jeff Cheeger. A lower bound for the smallest eigenvalue of the
Laplacian. In Problems in analysis (Sympos. in honor of Salomon
Bochner, Princeton Univ., Princeton, N.J., 1969), pages 195–199.
Princeton Univ. Press, Princeton, N.J., 1970.

[Che75] Louis H. Y. Chen. Poisson Approximation for Dependent Trials.


The Annals of Probability, 3(3):534–545, 1975. Publisher: Institute
of Mathematical Statistics.

534
[Chu97] Fan R. K. Chung. Spectral graph theory, volume 92 of CBMS Re-
gional Conference Series in Mathematics. Published for the Confer-
ence Board of the Mathematical Sciences, Washington, DC; by the
American Mathematical Society, Providence, RI, 1997.

[CL06] Fan Chung and Linyuan Lu. Complex graphs and networks, vol-
ume 107 of CBMS Regional Conference Series in Mathematics.
Published for the Conference Board of the Mathematical Sciences,
Washington, DC; by the American Mathematical Society, Provi-
dence, RI, 2006.

[CR92] V. Chvatal and B. Reed. Mick gets some (the odds are on his side)
[satisfiability]. In Foundations of Computer Science, 1992. Proceed-
ings., 33rd Annual Symposium on, pages 620–627, Oct 1992.

[Cra38] H. Cramér. Sur un nouveau théorème-limite de la théorie des prob-


abilités. Actualités Scientifiques et Industrielles, 736:5–23, 1938.

[CRR+ 89] Ashok K. Chandra, Prabhakar Raghavan, Walter L. Ruzzo, Roman


Smolensky, and Prasoon Tiwari. The electrical resistance of a graph
captures its commute and cover times (detailed abstract). In David S.
Johnson, editor, STOC, pages 574–586. ACM, 1989.

[CRT06a] Emmanuel J. Candès, Justin Romberg, and Terence Tao. Robust


uncertainty principles: exact signal reconstruction from highly in-
complete frequency information. IEEE Trans. Inform. Theory,
52(2):489–509, 2006.

[CRT06b] Emmanuel J. Candès, Justin K. Romberg, and Terence Tao. Sta-


ble signal recovery from incomplete and inaccurate measurements.
Comm. Pure Appl. Math., 59(8):1207–1223, 2006.

[CT05] Emmanuel J. Candès and Terence Tao. Decoding by linear program-


ming. IEEE Trans. Inform. Theory, 51(12):4203–4215, 2005.

[CW08] E.J. Candès and M.B. Wakin. An introduction to compressive sam-


pling. Signal Processing Magazine, IEEE, 25(2):21–30, March
2008.

[Dev98] Luc Devroye. Branching processes and their applications in the


analysis of tree structures and tree algorithms. In Michel Habib,

535
Colin McDiarmid, Jorge Ramirez-Alfonsin, and Bruce Reed, ed-
itors, Probabilistic Methods for Algorithmic Discrete Mathemat-
ics, volume 16 of Algorithms and Combinatorics, pages 249–314.
Springer Berlin Heidelberg, 1998.

[Dey] Partha Dey. Lecture notes on “Stein-Chen method for Poisson ap-
proximation”. https://faculty.math.illinois.edu/
˜psdey/414CourseNotes.pdf.
[DGG+ 00] Martin Dyer, Leslie Ann Goldberg, Catherine Greenhill, Mark Jer-
rum, and Michael Mitzenmacher. An extension of path coupling and
its application to the Glauber dynamics for graph colourings (ex-
tended abstract). In Proceedings of the Eleventh Annual ACM-SIAM
Symposium on Discrete Algorithms (San Francisco, CA, 2000),
pages 616–624. ACM, New York, 2000.

[dH] Frank den Hollander. Probability theory: The coupling method,


2012. http://websites.math.leidenuniv.nl/probability/
lecturenotes/CouplingLectures.pdf.

[Dia88] Persi Diaconis. Group representations in probability and statistics,


volume 11 of Institute of Mathematical Statistics Lecture Notes—
Monograph Series. Institute of Mathematical Statistics, Hayward,
CA, 1988.

[Dia09] Persi Diaconis. The Markov chain Monte Carlo revolution. Bull.
Amer. Math. Soc. (N.S.), 46(2):179–205, 2009.

[Die10] Reinhard Diestel. Graph theory, volume 173 of Graduate Texts in


Mathematics. Springer, Heidelberg, fourth edition, 2010.

[DKLP11] Jian Ding, Jeong Han Kim, Eyal Lubetzky, and Yuval Peres.
Anatomy of a young giant component in the random graph. Ran-
dom Structures Algorithms, 39(2):139–178, 2011.

[DMR11] Constantinos Daskalakis, Elchanan Mossel, and Sébastien Roch.


Evolutionary trees and the ising model on the bethe lattice: a
proof of steel’s conjecture. Probability Theory and Related Fields,
149:149–189, 2011. 10.1007/s00440-009-0246-2.

[DMS00] S. N. Dorogovtsev, J. F. F. Mendes, and A. N. Samukhin. Struc-


ture of growing networks with preferential linking. Phys. Rev. Lett.,
85:4633–4636, Nov 2000.

536
[Doe38] Wolfgang Doeblin. Exposé de la théorie des chaı̂nes simples con-
stantes de markoff à un nombre fini d’états. Rev. Math. Union Inter-
balkan, 2:77–105, 1938.
[Don06] David L. Donoho. Compressed sensing. IEEE Trans. Inform. The-
ory, 52(4):1289–1306, 2006.
[Doo01] J.L. Doob. Classical Potential Theory and Its Probabilistic Coun-
terpart. Classics in Mathematics. Springer Berlin Heidelberg, 2001.
[DP11] Jian Ding and Yuval Peres. Mixing time for the Ising model: a uni-
form lower bound for all graphs. Ann. Inst. Henri Poincaré Probab.
Stat., 47(4):1020–1028, 2011.
[DS84] P.G. Doyle and J.L. Snell. Random walks and electric net-
works. Carus mathematical monographs. Mathematical Association
of America, 1984.
[DS91] Persi Diaconis and Daniel Stroock. Geometric bounds for eigenval-
ues of Markov chains. Ann. Appl. Probab., 1(1):36–61, 1991.
[DSC93a] Persi Diaconis and Laurent Saloff-Coste. Comparison techniques
for random walk on finite groups. Ann. Probab., 21(4):2131–2156,
1993.
[DSC93b] Persi Diaconis and Laurent Saloff-Coste. Comparison theorems for
reversible Markov chains. Ann. Appl. Probab., 3(3):696–730, 1993.
[DSS22] Jian Ding, Allan Sly, and Nike Sun. Proof of the satisfiability con-
jecture for large k. Ann. of Math. (2), 196(1):1–388, 2022.
[Dur85] Richard Durrett. Some general results concerning the critical ex-
ponents of percolation processes. Z. Wahrsch. Verw. Gebiete,
69(3):421–437, 1985.
[Dur06] R. Durrett. Random Graph Dynamics. Cambridge Series in Sta-
tistical and Probabilistic Mathematics. Cambridge University Press,
2006.
[Dur10] R. Durrett. Probability: Theory and Examples. Cambridge Series
in Statistical and Probabilistic Mathematics. Cambridge University
Press, 2010.
[Dur12] Richard Durrett. Essentials of stochastic processes. Springer Texts
in Statistics. Springer, New York, second edition, 2012.

537
[DZ10] Amir Dembo and Ofer Zeitouni. Large deviations techniques and
applications, volume 38 of Stochastic Modelling and Applied Prob-
ability. Springer-Verlag, Berlin, 2010. Corrected reprint of the sec-
ond (1998) edition.
[Ebe] Andreas Eberle. Markov Processes. 2021. https://uni-bonn.
sciebo.de/s/kzTUFff5FrWGAay.
[EKPS00] W. S. Evans, C. Kenyon, Y. Peres, and L. J. Schulman. Broadcasting
on trees and the Ising model. Ann. Appl. Probab., 10(2):410–433,
2000.
[ER59] P. Erdős and A. Rényi. On random graphs. I. Publ. Math. Debrecen,
6:290–297, 1959.
[ER60] P. Erdős and A. Rényi. On the evolution of random graphs. Magyar
Tud. Akad. Mat. Kutató Int. Közl., 5:17–61, 1960.
[Fel71] William Feller. An introduction to probability theory and its ap-
plications. Vol. II. Second edition. John Wiley & Sons, Inc., New
York-London-Sydney, 1971.
[FK16] Alan Frieze and MichałKaroński. Introduction to random graphs.
Cambridge University Press, Cambridge, 2016.
[FKG71] C. M. Fortuin, P. W. Kasteleyn, and J. Ginibre. Correlation inequali-
ties on some partially ordered sets. Comm. Math. Phys., 22:89–103,
1971.
[FM88] P. Frankl and H. Maehara. The Johnson-Lindenstrauss lemma and
the sphericity of some graphs. J. Combin. Theory Ser. B, 44(3):355–
362, 1988.
[Fos53] F. G. Foster. On the stochastic matrices associated with certain queu-
ing processes. Ann. Math. Statistics, 24:355–360, 1953.
[FR98] AlanM. Frieze and Bruce Reed. Probabilistic analysis of algorithms.
In Michel Habib, Colin McDiarmid, Jorge Ramirez-Alfonsin, and
Bruce Reed, editors, Probabilistic Methods for Algorithmic Discrete
Mathematics, volume 16 of Algorithms and Combinatorics, pages
36–92. Springer Berlin Heidelberg, 1998.
[FR08] N. Fountoulakis and B. A. Reed. The evolution of the mixing rate of
a simple random walk on the giant component of a random graph.
Random Structures Algorithms, 33(1):68–86, 2008.

538
[FR13] Simon Foucart and Holger Rauhut. A Mathematical Introduction to
Compressive Sensing. Applied and Numerical Harmonic Analysis.
Birkhäuser Basel, 2013.

[FV18] S. Friedli and Y. Velenik. Statistical mechanics of lattice systems.


Cambridge University Press, Cambridge, 2018. A concrete mathe-
matical introduction.

[GC11] Aurélien Garivier and Olivier Cappé. The KL-UCB Algorithm for
Bounded Stochastic Bandits and Beyond. In Proceedings of the 24th
Annual Conference on Learning Theory, pages 359–376. JMLR
Workshop and Conference Proceedings, December 2011. ISSN:
1938-7228.

[GCS+ 14] Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson,
Aki Vehtari, and Donald B. Rubin. Bayesian data analysis. Texts in
Statistical Science Series. CRC Press, Boca Raton, FL, third edition,
2014.

[Gil59] E. N. Gilbert. Random graphs. Ann. Math. Statist., 30:1141–1144,


1959.

[GL06] Dani Gamerman and Hedibert Freitas Lopes. Markov chain Monte
Carlo. Texts in Statistical Science Series. Chapman & Hall/CRC,
Boca Raton, FL, second edition, 2006. Stochastic simulation for
Bayesian inference.

[Gri97] Geoffrey Grimmett. Percolation and disordered systems. In Lec-


tures on probability theory and statistics (Saint-Flour, 1996), vol-
ume 1665 of Lecture Notes in Math., pages 153–300. Springer,
Berlin, 1997.

[Gri10a] Geoffrey Grimmett. Probability on graphs, volume 1 of Institute


of Mathematical Statistics Textbooks. Cambridge University Press,
Cambridge, 2010. Random processes on graphs and lattices.

[Gri10b] G.R. Grimmett. Percolation. Grundlehren der mathematischen Wis-


senschaften. Springer, 2010.

[Gri75] David Griffeath. A maximal coupling for Markov chains. Z.


Wahrscheinlichkeitstheorie und Verw. Gebiete, 31:95–106, 1974/75.

539
[GS20] Geoffrey R. Grimmett and David R. Stirzaker. Probability and ran-
dom processes. Oxford University Press, Oxford, 2020. Fourth edi-
tion [of 0667520].

[Ham57] J. M. Hammersley. Percolation processes. II. The connective con-


stant. Proc. Cambridge Philos. Soc., 53:642–645, 1957.

[Har] Nicholas Harvey. Lecture notes for CPSC 536N: Randomized Al-
gorithms. http://www.cs.ubc.ca/˜nickhar/W12/.

[Har60] T. E. Harris. A lower bound for the critical probability in a certain


percolation process. Proc. Cambridge Philos. Soc., 56:13–20, 1960.

[Har63] Theodore E. Harris. The theory of branching processes. Die


Grundlehren der mathematischen Wissenschaften, Band 119.
Springer-Verlag, Berlin; Prentice Hall, Inc., Englewood Cliffs, N.J.,
1963.

[Haz16] Elad Hazan. Introduction to Online Convex Optimization. Founda-


tions and Trends® in Optimization, 2(3-4):157–325, August 2016.
Publisher: Now Publishers, Inc.

[HJ13] Roger A. Horn and Charles R. Johnson. Matrix analysis. Cambridge


University Press, Cambridge, second edition, 2013.

[HLW06] Shlomo Hoory, Nathan Linial, and Avi Wigderson. Expander graphs
and their applications. Bull. Amer. Math. Soc. (N.S.), 43(4):439–561
(electronic), 2006.

[HMRAR98] M. Habib, C. McDiarmid, J. Ramirez-Alfonsin, and B. Reed, ed-


itors. Probabilistic methods for algorithmic discrete mathemat-
ics, volume 16 of Algorithms and Combinatorics. Springer-Verlag,
Berlin, 1998.

[Hoe63] Wassily Hoeffding. Probability inequalities for sums of bounded


random variables. J. Amer. Statist. Assoc., 58:13–30, 1963.

[HS07] Thomas P. Hayes and Alistair Sinclair. A general lower bound


for mixing of single-site dynamics on graphs. Ann. Appl. Probab.,
17(3):931–952, 2007.

[IM98] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors:


Towards removing the curse of dimensionality. In Jeffrey Scott Vit-
ter, editor, STOC, pages 604–613. ACM, 1998.

540
[Jan90] Svante Janson. Poisson approximation for large deviations. Random
Structures Algorithms, 1(2):221–229, 1990.
[JH01] Galin L. Jones and James P. Hobert. Honest exploration of in-
tractable probability distributions via Markov chain Monte Carlo.
Statist. Sci., 16(4):312–334, 2001.
[JL84] William B. Johnson and Joram Lindenstrauss. Extensions of Lip-
schitz mappings into a Hilbert space. In Conference in modern
analysis and probability (New Haven, Conn., 1982), volume 26 of
Contemp. Math., pages 189–206. Amer. Math. Soc., Providence, RI,
1984.
[JLR11] S. Janson, T. Luczak, and A. Rucinski. Random Graphs. Wiley
Series in Discrete Mathematics and Optimization. Wiley, 2011.
[JS89] Mark Jerrum and Alistair Sinclair. Approximating the permanent.
SIAM J. Comput., 18(6):1149–1178, 1989.
[Kan86] Masahiko Kanai. Rough isometries and the parabolicity of Rieman-
nian manifolds. J. Math. Soc. Japan, 38(2):227–238, 1986.
[Kar90] Richard M. Karp. The transitive closure of a random digraph. Ran-
dom Structures Algorithms, 1(1):73–93, 1990.
[Kes80] Harry Kesten. The critical probability of bond percolation on the
square lattice equals 12 . Comm. Math. Phys., 74(1):41–59, 1980.
[Kes82] Harry Kesten. Percolation theory for mathematicians, volume 2 of
Progress in Probability and Statistics. Birkhäuser, Boston, Mass.,
1982.
[KP] Júlia Komjáthy and Yuval Peres. Lecture notes for Markov
chains: mixing times, hitting times, and cover times, 2012. Saint-
Petersburg Summer School. http://www.win.tue.nl/˜jkomjath/
SPBlecturenotes.pdf.

[KRS] Michael J. Kozdron, Larissa M. Richards, and Daniel W. Stroock.


Determinants, their applications to markov processes, and a random
walk proof of kirchhoff’s matrix tree theorem, 2013. Preprint avail-
able at http://arxiv.org/abs/1306.2059.
[KS66a] H. Kesten and B. P. Stigum. Additional limit theorems for indecom-
posable multidimensional Galton-Watson processes. Ann. Math.
Statist., 37:1463–1481, 1966.

541
[KS66b] H. Kesten and B. P. Stigum. A limit theorem for multidimensional
Galton-Watson processes. Ann. Math. Statist., 37:1211–1223, 1966.

[KS67] H. Kesten and B. P. Stigum. Limit theorems for decomposable


multi-dimensional Galton-Watson processes. J. Math. Anal. Appl.,
17:309–338, 1967.

[KS05] Gil Kalai and Shmuel Safra. Threshold Phenomena and In-
fluence: Perspectives from Mathematics, Computer Science,
and Economics. In Computational Complexity and Statistical
Physics. Oxford University Press, December 2005. eprint:
https://academic.oup.com/book/0/chapter/354512033/chapter-
pdf/43716844/isbn-9780195177374-book-part-8.pdf.

[KSK76] John G. Kemeny, J. Laurie Snell, and Anthony W. Knapp. Denumer-


able Markov chains. Springer-Verlag, New York-Heidelberg-Berlin,
second edition, 1976. With a chapter on Markov random fields, by
David Griffeath, Graduate Texts in Mathematics, No. 40.

[Law05] Gregory F. Lawler. Conformally invariant processes in the plane,


volume 114 of Mathematical Surveys and Monographs. American
Mathematical Society, Providence, RI, 2005.

[Law06] Gregory F. Lawler. Introduction to stochastic processes. Chapman


& Hall/CRC, Boca Raton, FL, second edition, 2006.

[Led01] M. Ledoux. The Concentration of Measure Phenomenon. Mathe-


matical surveys and monographs. American Mathematical Society,
2001.

[Lin02] Torgny Lindvall. Lectures on the coupling method. Dover Pub-


lications, Inc., Mineola, NY, 2002. Corrected reprint of the 1992
original.

[LL10] G.F. Lawler and V. Limic. Random Walk: A Modern Introduction.


Cambridge Studies in Advanced Mathematics. Cambridge Univer-
sity Press, 2010.

[Lov83] L. Lovász. Submodular functions and convexity. In Mathematical


programming: the state of the art (Bonn, 1982), pages 235–257.
Springer, Berlin, 1983.

542
[Lov12] László Lovász. Large networks and graph limits, volume 60 of
American Mathematical Society Colloquium Publications. Amer-
ican Mathematical Society, Providence, RI, 2012.

[LP16] Russell Lyons and Yuval Peres. Probability on trees and networks,
volume 42 of Cambridge Series in Statistical and Probabilistic
Mathematics. Cambridge University Press, New York, 2016.

[LP17] David A. Levin and Yuval Peres. Markov chains and mixing times.
American Mathematical Society, Providence, RI, 2017. Second edi-
tion of [ MR2466937], With contributions by Elizabeth L. Wilmer,
With a chapter on “Coupling from the past” by James G. Propp and
David B. Wilson.

[LPW06] David A. Levin, Yuval Peres, and Elizabeth L. Wilmer. Markov


chains and mixing times. American Mathematical Society, 2006.

[LR85] T.L Lai and Herbert Robbins. Asymptotically efficient adaptive al-
location rules. Advances in Applied Mathematics, 6(1):4–22, March
1985.

[LS88] Gregory F. Lawler and Alan D. Sokal. Bounds on the L2 spec-


trum for Markov chains and Markov processes: a generalization
of Cheeger’s inequality. Trans. Amer. Math. Soc., 309(2):557–580,
1988.

[LS12] Eyal Lubetzky and Allan Sly. Critical Ising on the square lattice
mixes in polynomial time. Comm. Math. Phys., 313(3):815–836,
2012.

[LS20] Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge


University Press, 2020.

[Lu90] Tomasz Ł uczak. Component behavior near the critical point of the
random graph process. Random Structures Algorithms, 1(3):287–
310, 1990.

[Lug] Gabor Lugosi. Concentration-of-measure inequalities, 2004. Avail-


able at http://www.econ.upf.edu/˜lugosi/anu.pdf.

[LuPW94] Tomasz Ł uczak, Boris Pittel, and John C. Wierman. The structure
of a random graph at the point of the phase transition. Trans. Amer.
Math. Soc., 341(2):721–748, 1994.

543
[Lyo83] Terry Lyons. A simple criterion for transience of a reversible
Markov chain. Ann. Probab., 11(2):393–402, 1983.

[Lyo90] Russell Lyons. Random walks and percolation on trees. Ann.


Probab., 18(3):931–958, 1990.

[Mat88] Peter Matthews. Covering problems for Markov chains. Ann.


Probab., 16(3):1215–1228, 1988.

[Mau79] Bernard Maurey. Construction de suites symétriques. C. R. Acad.


Sci. Paris Sér. A-B, 288(14):A679–A681, 1979.

[McD89] Colin McDiarmid. On the method of bounded differences. In Sur-


veys in combinatorics, 1989 (Norwich, 1989), volume 141 of Lon-
don Math. Soc. Lecture Note Ser., pages 148–188. Cambridge Univ.
Press, Cambridge, 1989.

[ML86] Anders Martin-Löf. Symmetric sampling procedures, general epi-


demic processes and their threshold limit theorems. J. Appl.
Probab., 23(2):265–282, 1986.

[ML98] Anders Martin-Löf. The final size of a nearly critical epidemic, and
the first passage time of a Wiener process to a parabolic barrier. J.
Appl. Probab., 35(3):671–682, 1998.

[MNS15a] Elchanan Mossel, Joe Neeman, and Allan Sly. Consistency thresh-
olds for the planted bisection model. In Rocco A. Servedio and
Ronitt Rubinfeld, editors, Proceedings of the Forty-Seventh Annual
ACM on Symposium on Theory of Computing, STOC 2015, Port-
land, OR, USA, June 14-17, 2015, pages 69–75. ACM, 2015.

[MNS15b] Elchanan Mossel, Joe Neeman, and Allan Sly. Reconstruction and
estimation in the planted partition model. Probab. Theory Related
Fields, 162(3-4):431–461, 2015.

[Mor05] Frank Morgan. Real analysis. American Mathematical Society,


Providence, RI, 2005.

[Mos01] E. Mossel. Reconstruction on trees: beating the second eigenvalue.


Ann. Appl. Probab., 11(1):285–300, 2001.

[Mos03] E. Mossel. On the impossibility of reconstructing ancestral data and


phylogenies. J. Comput. Biol., 10(5):669–678, 2003.

544
[Mos04] E. Mossel. Phase transitions in phylogeny. Trans. Amer. Math. Soc.,
356(6):2379–2404, 2004.
[MP03] E. Mossel and Y. Peres. Information flow on trees. Ann. Appl.
Probab., 13(3):817–844, 2003.
[MR95] Rajeev Motwani and Prabhakar Raghavan. Randomized algorithms.
Cambridge University Press, Cambridge, 1995.
[MS86] Vitali D. Milman and Gideon Schechtman. Asymptotic theory of
finite-dimensional normed spaces, volume 1200 of Lecture Notes in
Mathematics. Springer-Verlag, Berlin, 1986. With an appendix by
M. Gromov.
[MT06] Ravi Montenegro and Prasad Tetali. Mathematical aspects of mix-
ing times in Markov chains. Found. Trends Theor. Comput. Sci.,
1(3):x+121, 2006.
[MT09] Sean Meyn and Richard L. Tweedie. Markov chains and stochastic
stability. Cambridge University Press, Cambridge, second edition,
2009. With a prologue by Peter W. Glynn.
[MU05] Michael Mitzenmacher and Eli Upfal. Probability and Comput-
ing: Randomized Algorithms and Probabilistic Analysis. Cambridge
University Press, New York, NY, USA, 2005.
[Nic18] Bogdan Nica. A brief introduction to spectral graph theory. EMS
Textbooks in Mathematics. European Mathematical Society (EMS),
Zürich, 2018.
[Nor98] J. R. Norris. Markov chains, volume 2 of Cambridge Series in Sta-
tistical and Probabilistic Mathematics. Cambridge University Press,
Cambridge, 1998. Reprint of 1997 original.
[NP10] Asaf Nachmias and Yuval Peres. The critical random graph, with
martingales. Israel J. Math., 176:29–41, 2010.
[NW59] C. St. J. A. Nash-Williams. Random walk and electric currents in
networks. Proc. Cambridge Philos. Soc., 55:181–194, 1959.
[Ott49] Richard Otter. The multiplicative process. Ann. Math. Statistics,
20:206–224, 1949.
[Pem91] Robin Pemantle. Choosing a spanning tree for the integer lattice
uniformly. Ann. Probab., 19(4):1559–1574, 1991.

545
[Pem00] Robin Pemantle. Towards a theory of negative dependence. J. Math.
Phys., 41(3):1371–1390, 2000. Probabilistic techniques in equilib-
rium and nonequilibrium statistical physics.

[Per] Yuval Peres. Course notes on Probability on trees and networks,


2004. http://stat-www.berkeley.edu/˜peres/notes1.pdf.

[Per09] Yuval Peres. The unreasonable effectiveness of martingales. In Pro-


ceedings of the Twentieth Annual ACM-SIAM Symposium on Dis-
crete Algorithms, SODA ’09, pages 997–1000, Philadelphia, PA,
USA, 2009. Society for Industrial and Applied Mathematics.

[Pet] Gábor Pete. Probability and geometry on groups. Lecture notes for
a graduate course. http://www.math.bme.hu/˜gabor/PGG.html.

[Pey08] Rémi Peyre. A probabilistic approach to Carne’s bound. Potential


Anal., 29(1):17–36, 2008.

[Pit76] J. W. Pitman. On coupling of Markov chains. Z. Wahrscheinlichkeit-


stheorie und Verw. Gebiete, 35(4):315–322, 1976.

[Pit90] Boris Pittel. On tree census and the giant component in sparse ran-
dom graphs. Random Structures Algorithms, 1(3):311–342, 1990.

[RAS15] Firas Rassoul-Agha and Timo Seppäläinen. A course on large devia-


tions with an introduction to Gibbs measures, volume 162 of Gradu-
ate Studies in Mathematics. American Mathematical Society, Prov-
idence, RI, 2015.

[RC04] Christian P. Robert and George Casella. Monte Carlo statistical


methods. Springer Texts in Statistics. Springer-Verlag, New York,
second edition, 2004.

[Res92] Sidney Resnick. Adventures in stochastic processes. Birkhäuser


Boston, Inc., Boston, MA, 1992.

[Roc10] Sebastien Roch. Toward Extracting All Phylogenetic Infor-


mation from Matrices of Evolutionary Distances. Science,
327(5971):1376–1379, 2010.

[Rom15] Dan Romik. The surprising mathematics of longest increasing sub-


sequences, volume 4 of Institute of Mathematical Statistics Text-
books. Cambridge University Press, New York, 2015.

546
[RS17] Sebastien Roch and Allan Sly. Phase transition in the sample com-
plexity of likelihood-based phylogeny inference. Probability Theory
and Related Fields, 169(1):3–62, Oct 2017.
[RT87] WanSoo T. Rhee and Michel Talagrand. Martingale inequalities and
NP-complete problems. Math. Oper. Res., 12(1):177–181, 1987.
[Rud73] Walter Rudin. Functional analysis. McGraw-Hill Series in Higher
Mathematics. McGraw-Hill Book Co., New York-Düsseldorf-
Johannesburg, 1973.
[Rus78] Lucio Russo. A note on percolation. Z. Wahrscheinlichkeitstheorie
und Verw. Gebiete, 43(1):39–48, 1978.
[SE64] M. F. Sykes and J. W. Essam. Exact critical percolation probabilities
for site and bond problems in two dimensions. J. Mathematical
Phys., 5:1117–1127, 1964.
[SJ89] Alistair Sinclair and Mark Jerrum. Approximate counting, uniform
generation and rapidly mixing Markov chains. Inform. and Comput.,
82(1):93–133, 1989.
[Spi56] Frank Spitzer. A combinatorial lemma and its application to proba-
bility theory. Trans. Amer. Math. Soc., 82:323–339, 1956.
[Spi12] Daniel A. Spielman. Lecture notes on spectral graph the-
ory. https://www.cs.yale.edu/homes/spielman/
561/2012/index.html, 2012.
[SS87] E. Shamir and J. Spencer. Sharp concentration of the chromatic
number on random graphs Gn,p . Combinatorica, 7(1):121–129,
1987.
[SS03] Elias M. Stein and Rami Shakarchi. Fourier analysis, volume 1 of
Princeton Lectures in Analysis. Princeton University Press, Prince-
ton, NJ, 2003. An introduction.
[SS05] Elias M. Stein and Rami Shakarchi. Real analysis, volume 3 of
Princeton Lectures in Analysis. Princeton University Press, Prince-
ton, NJ, 2005. Measure theory, integration, and Hilbert spaces.
[SSBD14] S. Shalev-Shwartz and S. Ben-David. Understanding Machine
Learning: From Theory to Algorithms. Understanding Machine
Learning: From Theory to Algorithms. Cambridge University Press,
2014.

547
[Ste] J. E. Steif. A mini course on percolation theory, 2009. http://www.
math.chalmers.se/˜steif/perc.pdf.

[Ste72] Charles Stein. A bound for the error in the normal approximation
to the distribution of a sum of dependent random variables. In Pro-
ceedings of the Sixth Berkeley Symposium on Mathematical Statis-
tics and Probability (Univ. California, Berkeley, Calif., 1970/1971),
Vol. II: Probability theory, pages 583–602, 1972.
[Ste97] J. Michael Steele. Probability theory and combinatorial optimiza-
tion, volume 69 of CBMS-NSF Regional Conference Series in Ap-
plied Mathematics. Society for Industrial and Applied Mathematics
(SIAM), Philadelphia, PA, 1997.
[Ste98] G. W. Stewart. Matrix algorithms. Vol. I. Society for Industrial and
Applied Mathematics, Philadelphia, PA, 1998. Basic decomposi-
tions.
[Str65] V. Strassen. The existence of probability measures with given
marginals. Ann. Math. Statist., 36:423–439, 1965.
[Str14] Daniel W. Stroock. Doeblin’s Theory for Markov Chains. In An In-
troduction to Markov Processes, pages 25–47. Springer Berlin Hei-
delberg, Berlin, Heidelberg, 2014.
[SW78] P. D. Seymour and D. J. A. Welsh. Percolation probabilities on the
square lattice. Ann. Discrete Math., 3:227–245, 1978. Advances
in graph theory (Cambridge Combinatorial Conf., Trinity College,
Cambridge, 1977).
[Tao] Terence Tao. Open question: deterministic UUP matri-
ces. https://terrytao.wordpress.com/2007/07/02/
open-question-deterministic-uup-matrices/.

[Var85] Nicholas Th. Varopoulos. Long range estimates for Markov chains.
Bull. Sci. Math. (2), 109(3):225–252, 1985.
[vdH10] Remco van der Hofstad. Percolation and random graphs. In New
perspectives in stochastic geometry, pages 173–247. Oxford Univ.
Press, Oxford, 2010.
[vdH17] Remco van der Hofstad. Random graphs and complex networks.
Vol. 1. Cambridge Series in Statistical and Probabilistic Mathemat-
ics, [43]. Cambridge University Press, Cambridge, 2017.

548
[vdHK08] Remco van der Hofstad and Michael Keane. An elementary proof
of the hitting time theorem. Amer. Math. Monthly, 115(8):753–756,
2008.

[Vem04] Santosh S. Vempala. The random projection method. DIMACS


Series in Discrete Mathematics and Theoretical Computer Science,
65. American Mathematical Society, Providence, RI, 2004. With a
foreword by Christos H. Papadimitriou.

[Ver18] Roman Vershynin. High-Dimensional Probability: An Introduction


with Applications in Data Science. Cambridge University Press,
September 2018. Google-Books-ID: TahxDwAAQBAJ.

[vH16] Ramon van Handel. Probability in high dimension. http://www.


princeton.edu/˜rvan/APC550.pdf, 2016.

[Vil09] Cédric Villani. Optimal transport, volume 338 of Grundlehren der


Mathematischen Wissenschaften [Fundamental Principles of Math-
ematical Sciences]. Springer-Verlag, Berlin, 2009. Old and new.

[Wen75] J. G. Wendel. Left-continuous random walk and the Lagrange ex-


pansion. Amer. Math. Monthly, 82:494–499, 1975.

[Whi32] Hassler Whitney. Non-separable and planar graphs. Trans. Amer.


Math. Soc., 34(2):339–362, 1932.

[Wil91] David Williams. Probability with martingales. Cambridge Mathe-


matical Textbooks. Cambridge University Press, Cambridge, 1991.

[Wil96] David Bruce Wilson. Generating random spanning trees more


quickly than the cover time. In Gary L. Miller, editor, STOC, pages
296–303. ACM, 1996.

[YP14] Se-Young Yun and Alexandre Proutière. Community detection via


random and adaptive sampling. In Maria-Florina Balcan, Vitaly
Feldman, and Csaba Szepesvári, editors, Proceedings of The 27th
Conference on Learning Theory, COLT 2014, Barcelona, Spain,
June 13-15, 2014, volume 35 of JMLR Workshop and Conference
Proceedings, pages 138–175. JMLR.org, 2014.

[Yur76] V. V. Yurinskiı̆. Exponential inequalities for sums of random vectors.


J. Multivariate Anal., 6(4):473–499, 1976.

549
Index

ε-net, see epsilon-net branching processes


binomial offspring, 438
Azuma-Hoeffding inequality, see tail bounds duality principle, 432
exploration process, 431
balancing vectors, 34
extinction, 417, 418
ballot theorem, 126
Galton-Watson process, 416
balls and bins, 154
Galton-Watson tree, 416, 422
bandits
geometric offspring, 495
definition, 170
history, 432, 496
optimism in the face of uncertainty,
infinite line of descent, 496
171
linear functionals, 427
upper confidence bound, 171
multitype, see multitype branching
Bernstein’s inequality, see tail bounds
processes
Berry-Esséen theorem, 157
Poisson offspring, 420, 433, 435
binary classification
positive regular case, 425
definitions, 102–103
random-walk representation, 429, 431,
empirical risk minimizer, 103
433
no free lunch, 103
binary search tree, 442 canonical paths, 406
binomial coefficients Cauchy-Schwarz inequality, 515, 519
binomial theorem, 502 Cavender-Farris-Neyman (CFN) model,
bounds, 23, 502 451
definition, x Cayley’s formula, see trees
bond percolation chaining method, 91
definition, 18 Chebyshev polynomials, 366
FKG, 265 Chebyshev’s association inequality, 320
Bonferroni inequalities, 116 Chebyshev’s inequality, see tail bounds
Boole’s inequality, see union bound Cheeger’s inequality, 385, 491
Boolean functions Chen-Stein method
random k-SAT problem, 37 dissociated case, 305, 318
bounded differences inequality, 152 main result, 301
branching number, 58 negatively related variables, 324

550
positively related variables, 324 cumulant-generating function, 64, 66
Stein coupling, 302, 306 Curie-Weiss model, 401
Stein equation, 301 cutset, 58
Chernoff-Cramér bound, see tail bounds
Chernoff-Cramér method, 67, 117, 146 Davis-Kahan theorem, 343, 349
community recovery, 345 dependency graph, 46
compressed sensing, 96 dimension reduction, 93
concentration inequalities, see tail bounds, Dirichlet
120 energy, 383, 406
concentration of measure, 156 form, 383
conditional expectation principle, 206, 229
definition, 522–525 problem, 183
examples, 525–526 drift condition, 190
law of total probability, 527 Dudley’s inequality, 91, 111
properties, 526–527
edge expansion constant, see networks
tower property, 527
Efron-Stein inequality, 151
congestion ratio, 406
electrical networks
convergence in probability, 32
definitions, 192–194
convex duality
Dirichlet energy, see Dirichlet
Lagrangian, 203
effective conductance, 200, 206
weak duality, 203
effective resistance, 200, 214, 229
coupling
flow, see flows
coalescence time, 282
Kirchhoff’s cycle law, 193
coalescing, 282
Kirchhoff’s node law, 193
coupling inequality, 241, 247
Kirchhoff’s resistance formula, 218
definition, 235
Nash-Williams inequality, 206, 229
Erdős-Rényi graph model, 466, 483
Ohm’s law, 193, 219
independent, 235
parallel law, 196
Markovian, 246, 282
Rayleigh’s principle, 209, 219
maximal coupling, 242
series law, 196
monotone, 235, 236, 251, 254, 257
Thomson’s principle, 202, 208
path coupling, 293, 299
voltage, 192
splitting, 283
empirical measure, 110, 112
stochastic domination, 251
epsilon-net, 86, 89, 97–101
Strassen’s theorem, 254
Erdős-Rényi graph model
coupon collector, 31, 286
chromatic number, 158
Courant-Fischer theorem, 331, 336, 340,
clique number, 48, 318, 321
341, 384
cluster, 468
covariance, 30
connectivity, 52
covering number, 87, 111

551
definition, 18 gambler’s ruin, 138–141, 195, 201
degree sequence, 247 Gibbs random fields, 19
evolution, 466 Glauber dynamics
exploration process, 468 definition, 21
FKG, 265 fast mixing, 297, 401
giant component, 467, 481 graph Laplacian
isolated vertices, 51 connectivity, 334
largest connected component, 467 definition, 333
maximum degree, 78 degree, 335
positive associations, 263 eigenvalues, 334
random walk, 490 Fiedler vector, 334
subgraph containment, 269 network, 339
threshold function, 47–55 normalized, 340
exhaustive sequence, 229 quadratic form, 333
expander graphs graphs
(d, α)-expander family, 394 3–1 tree, see trees
Pinsker’s model, 395 b-ary tree Tb ` , 5, 290
b
n-clique, 382
factorials adjacency matrix, see matrices
bounds, 22, 502 bridge, 218
definition, x Cayley’s formula, see trees
Stirling’s formula, 502 chromatic number, 8, 158
Fenchel-Legendre dual, 66 clique, 3, 48
first moment method, 36–41, 46, 48, 49, clique number, 48, 318
52–54, 56, 144, 161 coloring, 8
first moment principle, 33, 36, 104 complete graph Kn , 5
first-step analysis, 186 cutset, 58, 206
FKG cycle Cn , 5, 285
condition, 264, 321 definitions, 2–9
inequality, 265, 321, 325 degree, 78
measure, 264, 322 diameter, 368
flows directed, 8, 9
current, 193 expander, see expander graphs
definition, 6 flow, see flows
energy, 202, 209, 212 graph distance, 4
flow-conservation constraints, 193, hypercube Zn2 , 5, 286
210 incidence matrix, see matrices
max-flow min-cut theorem, 7, 254 independent set, 8, 34
strength, 193 infinite, 5
to ∞, 209, 211 infinite binary tree, 451

552
Laplacian, see graph Laplacian Jensen’s inequality, 518
matching, 8 Johnson-Lindenstrauss
matrix representation, see matrices distributional lemma, 93
multigraph, 2 lemma, 93–96
network, see networks
oriented incidence matrix, see ma- Kesten’s theorem, see percolation
trices knapsack problem, 79
perfect matching, 8 Kolmogorov’s maximal inequality, 141
spanning arborescence, 222 Kullback-Leibler divergence, 68
torus Ldn , 5 Laplacian
tree, see trees graphs, see graph Laplacian
Turán graphs, 35 Markov chains, 183, 204, 323
Green function, see Markov chains networks, see graph Laplacian
Hölder’s inequality, 515 large deviations, 69, 120
Hamming distance, 178 law of total probability, see conditional
harmonic functions, see Markov chains expectation
Harper’s vertex isoperimetric theorem, laws of large numbers, 32–33
158 Lipschitz
Harris’ inequality, 320, 325 condition, 178–180
Harris’ theorem, see percolation process, 88
hitting time, see stopping time Lyapounov function, see Markov chains
Hoeffding’s inequality, see tail bounds Markov chain Monte Carlo, 370
Hoeffding’s lemma, see tail bounds, 146 Markov chain tree theorem, 231
Holley’s inequality, 266, 321, 325 Markov chains
increasing event, see posets average occupation time, 186
indicator trick, 36 birth-death, 194, 228, 376
inherited property, 421 bottleneck ratio, see networks
Ising model Chapman-Kolmogorov, 10
boundary conditions, 260 commute time, 214
complete graph, 401 commute time identity, 214, 230, 290
definition, 19 construction, 10
FKG, 265 cover time, 124
Glauber dynamics, 286, 297, 400 decomposition theorem, 127
magnetization, 401 definitions, 9–17
random cluster, 451 Doeblin’s condition, 283, 322
trees, 451, 499 escape probability, 197, 205
isoperimetric inequality, 381 examples, 9–16
exit law, 186
Janson’s inequality, 271, 325 exit probability, 186

553
first return time, 124 Markov chain, see Markov chains
first visit time, 124, 181 martingale difference, 147
Green function, 129, 186, 197 optional stopping theorem, 137
harmonic functions, 181 orthogonality of increments, 143, 148
hitting times, 313 stopped process, 136
irreducible set, 127 submartingale, 133
lower bound, 285 supermartingale, 133
Lyapounov function, 190, 294 vertex exposure martingale, 158
Markov property, 10 matrices
martingales, 134 2-norm, 88
Matthews’ cover time bounds, 132, adjacency, 2, 332
230 block, 328
mean exit time, 186 diagonal, 328
Metropolis algorithm, 15 graph Laplacian, see graph Lapla-
mixing time, see mixing times cian
positive recurrence, 127 incidence, 3
potential theory, 185 oriented incidence, 9
recurrence, 199, 209–214 orthogonal, 328
recurrent state, 127 spectral norm, 88, 179, 341
relaxation time, 358 spectral radius, 410, 425
reversibility, 181, 365 spectrum, 410
splitting, see coupling stochastic matrix, 9
stationary measure, 128 symmetric, 328
stochastic domination, 260, 267 maximal Azuma-Hoeffding inequality, see
stochastic monotonicity, 258 tail bounds
strong Markov property, 124 maximum principle, 14, 24, 183, 229
uniform geometric ergodicity, 284 method of bounded differences, see tail
Varopoulos-Carne bound, 365, 414 bounds
Markov’s inequality, see tail bounds, 64 method of moments, 119, 120
martingales, 456 method of random paths, 216, 229
Azuma-Hoeffding inequality, see tail Metropolis algorithm, see Markov chains
bounds minimum bisection problem, 347
convergence theorem, 142, 417 Minkowski’s inequality, 516
definition, 133 mixing times
Doob martingale, 135, 147, 158 b-ary tree, 290, 391
Doob’s submartingale inequality, 141, cutoff, 289, 364, 411
146 cycle, 322, 359, 392
edge exposure martingale, 159 definition, 17
exposure martingale, 158 diameter, 369
hitting time, 230 diameter bound, 369

554
distinguishing statistic, 289, 322 optional stopping theorem, see martin-
hypercube, 362, 393, 413 gales
lower bound, 284, 287, 297, 322,
368, 369 Pólya’s theorem, see random walk
random walk on cycle, 285 Pólya’s urn, 142
random walk on hypercube, 286 packing number, 87
separation distance, 357 Pajor’s lemma, 114
upper bound, 297 parity functions, 363
moment-generating functions Parseval’s identity, 352
χ2 , 73 pattern matching, 155
definition, 28 peeling method, see slicing method
Gaussian, 65 percolation
Poisson, 67 contour lemma, 42
Rademacher, 65 critical exponents, 441
moments, 28 critical value, 40, 55, 272, 423, 438
exponential moment, see moment- dual lattice, 41, 273
generating functions Galton-Watson tree, 422
multi-armed bandits, see bandits Harris’ theorem, 273, 325
multitype branching processes Kesten’s theorem, 273, 325
definitions, 423–425 on L2 , 40, 272
Kesten-Stigum bound, 455 on Ld , 254
mean matrix, 424 on a graph, 236
nonsingular case, 424 on infinite trees, 55, 144
percolation function, 40, 55, 254,
negative associations, 219 272, 438
networks RSW lemma, 273, 322, 325
cut, 382 permutations
definition, 8 Erdős-Szekeres Theorem, 39
edge boundary, 381 longest increasing subsequence, 38
edge expansion, 382 random, 38
vertex boundary, 394 Perron-Frobenius theory
no free lunch, see binary classification Perron vector, 426
notation, ix–xi theorem, 12, 426, 456
Poincaré inequality, 152, 385, 386, 406
operator Poisson approximation, 245, 269
compact, 376 Poisson equation, 187
norm, 377 Poisson trials, 69
spectral radius, 377 posets
optimal transport, 323 decreasing event, 264
definition, 253

555
increasing event, 254, 320 cycle, 285, 359
positive associations hypercube, 286, 362
definition, 263 lazy, 16, 238, 284
strong, 321 loop erasure, 220
positively correlated events, 264 on a graph, 10–15
probabilistic method, 33–36, 93, 103, 112 on a network, 20
probability generating function, 418 Pólya’s theorem, 215
probability spaces reflection principle, 125
definitions, 503–506 simple random walk on Z, 11, 127,
distribution function, 506 138, 183, 211, 365, 528
expectation, 513–519 simple random walk on Zd , 238
filtered spaces, 122, 528 simple random walk on a graph, 20
Fubini’s theorem, 519 tree, 239
independence, 510–513 Wald’s identities, 137–141
process, 528 Rayleigh quotient, 330, 336, 384
random variables, 506–510 reconstruction problem
pseudo-regret, 171 definition, 451
MAP estimator, 453
random graphs solvability, 452
Erdős-Rényi, see Erdős-Rényi graph reflection principle, see random walk
model relaxation times
preferential attachment, 19, 163 cycle, 361
stochastic blockmodel, 345 hypercube, 364
random projection, 93 restricted isometry property, 97, 101–102
random target lemma, 184 rough embedding, 210, 211
random variables rough equivalence, 211, 230
χ2 , 73, 95 rough isometry, 230
Bernoulli, 68, 235, 244, 250, 307 RSW lemma, see percolation
binomial, 68, 287, 320
Gaussian, 30, 70, 73, 93 Sauer’s lemma, 108, 110, 112
geometric, 32, 284 second moment method, 45, 46, 48, 50,
Poisson, 67, 241, 244, 250, 252, 420, 52, 54, 57, 58, 60
496 set balancing, 66
Rademacher, 65, 70 shattering, 108
uncorrelated, 32, 117, 143 simple random walk on a graph, see ran-
uniform, 31 dom walk
random walk slicing method, 176
b-ary tree, 290 span, 331
asymmetric random walk on Z, 380 sparse signal recovery, 97
biased random walk on Z, 139, 236 sparsity, 96

556
spectral clustering, 349 Bernstein’s inequality for bounded
spectral gap, 358 variables, 77
spectral theorem, 328, 340 Chebyshev’s inequality, 29, 45, 247,
Spitzer’s combinatorial lemma, 436 481
Stirling’s formula, see factorials Chernoff bound, 69, 155, 365
stochastic bandits, see bandits Chernoff-Cramér bound, 64, 75, 85
stochastic domination definitions, 28–29
binomial, 470 general Bernstein inequality, 76
coupling, see coupling general Hoeffding inequality, 70, 89
Markov chains, 258 Hoeffding’s inequality, 82, 112
monotonicity, 320 Hoeffding’s inequality for bounded
posets, 254 variables, 71
real random variables, 250 Hoeffding’s lemma, 72
stochastic processes Markov’s inequality, 29, 36, 104, 141,
adapted, 529 146
definition, 528 McDiarmid’s inequality, 153, 159,
filtration, 528 164, 178, 179
predictable, 529 method of bounded differences, 153,
sample path, 528 156, 163
supremum, 85–93 Paley-Zygmund inequality, 46
stopping time sub-exponential, 74
cover time, see Markov chains sub-Gaussian, 70, 85, 107
definition, 123 Talagrand’s inequality, 179
first return time, see Markov chains Talagrand’s inequality, see tail bounds
first visit time, see Markov chains threshold phenomena, 37, 41, 47
hitting time, 123 tilting, 73
strong Markov property, see Markov total variation distance, 16
chains tower property, see conditional expecta-
Strassen’s theorem, 325 tion
sub-exponential variable, see tail bounds trees
sub-Gaussian increments, 91 3–1 tree, 61
sub-Gaussian variable, see tail bounds, Cayley’s formula, 5, 26, 54, 499
146 characterization, 4–5
submodularity, 325 definition, 4
symmetrization, 71, 106 infinite, 55, 438
uniform spanning tree, 218–226
tail bounds type, see recurrence
Azuma-Hoeffding inequality, 145,
158, 159, 227, 380 uniform spanning trees, see trees
Bernstein’s inequality, 78, 493 union bound, 36, 116

557
Varopoulos-Carne bound, see Markov chains
VC dimension, 108, 111

Wald’s identities, see random walk


Wasserstein distance, 323
Weyl’s inequality, 342

558

You might also like