0% found this document useful (0 votes)
92 views

Probabilistic Combinatorics

This document provides an overview of a 16-lecture course on probabilistic combinatorics given in 2011 at the University of Oxford. It includes notes on the contents of each lecture, which cover basic proof techniques in probabilistic combinatorics like the first moment method. The document also provides recommendations for further reading and background information on key topics like random graphs.

Uploaded by

Fionna Skerman
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views

Probabilistic Combinatorics

This document provides an overview of a 16-lecture course on probabilistic combinatorics given in 2011 at the University of Oxford. It includes notes on the contents of each lecture, which cover basic proof techniques in probabilistic combinatorics like the first moment method. The document also provides recommendations for further reading and background information on key topics like random graphs.

Uploaded by

Fionna Skerman
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Probabilistic Combinatorics

Unocial Lecture Notes


Hilary Term 2011, 16 lectures
O. M. Riordan [email protected]
These notes give APPROXIMATELY the contents of the lectures. They are
not ocial in any way. They are also not intended for distribution, only as a
revision aid. I would be grateful to receive corrections by e-mail. (But please
check the course webpage rst in case the correction has already been made.)
Horizontal lines indicate approximately the division between lectures.
Recommended books:
For much of the course The Probabilistic Method (second edition, Wiley,
2000) by Alon and Spencer is the most accessible reference.
Very good books containing a lot of material, especially about random
graphs, are Random Graphs by Bollob as, and Random Graphs by Janson,
Luczak and Ruci nski. However, do not expect these books to be easy to read!
Remark on notation: simply means , though the latter is sometimes used
for emphasis.
0 Introduction: what is probabilistic combina-
torics?
The rst question is what is combinatorics! This is hard to dene exactly, but
should become clearer through examples, of which the main one is graph theory.
Roughly speaking, combinatorics is the study of discrete structures. Here
discrete means either nite, or innite but discrete in the sense that the integers
are, as opposed to the reals. Usually in combinatorics, there are some underlying
objects whose internal structure we ignore, and we study structures built on
them: the most common example is graph theory, where we do not care what
the vertices are, but study the abstract structure of graphs on a given set of
vertices. Abstractly, a graph is just a set of unordered pairs of vertices, i.e., a
symmetric irreexive binary relation on its vertex set. More generally, we might
study collections of subsets of a given vertex set, for example.
Turning to probabilistic combinatorics, this is combinatorics with random-
ness involved. It can mean two things: (a) the use of randomness (e.g., random
graphs) to solve deterministic combinatorial problems, or (b) the study of ran-
dom combinatorial objects for their own sake. Historically, the subject started
with (a), but after a while, the same objects (e.g., random graphs) come up
again and again, and one realizes that it is not only important, but also inter-
esting, to study these in themselves, as well as their applications. The subject
has also lead to new developments in probability theory.
The course will mainly be organized around proof techniques. However,
each technique will be illustrated with examples, and one particular example
1
(random graphs) will occur again and again, so by the end of the course we will
have covered aim (b) in this special case as well as aim (a) above.
The rst few examples will be mathematically very simple; nevertheless they
will show the power of the method in general. Of course, modern applications
are usually not quite so simple.
1 The rst moment method
Perhaps the most basic inequality in probability is the union bound: if A
1
and
A
2
are two events, then P(A
1
A
2
) P(A
1
) +P(A
2
). More generally,
P
_
A
1
A
n
_

i=1
P(A
i
).
This totally trivial fact is already useful.
Example 1.1 (Ramsey numbers). Recall the denition: the Ramsey number
R(k, ) is the smallest n such that every red/blue colouring of the edges of the
complete graph K
n
contains either a red K
k
or a blue K

.
Theorem 1.1 (Erd os, 1947). If n, k 1 are integers such that
_
n
k
_
2
1(
k
2
)
< 1,
then R(k, k) > n.
Proof. Colour the edges of K
n
red/blue at random so that each edge is red with
probability 1/2 and blue with probability 1/2, and the colours of all edges are
independent.
There are
_
n
k
_
copies of K
k
in K
n
. Let A
i
be the event that the ith copy is
monochromatic. Then
P(A
i
) = 2
_
1
2
_
(
k
2
)
= 2
1(
k
2
)
.
Thus
P( monochromatic K
k
)

P(A
i
) =
_
n
k
_
2
1(
k
2
)
< 1.
Thus, in the random colouring, the probability that there is no monochromatic
K
k
is greater than 0. Hence it is possible that the random colouring is good
(contains no monochromatic K
k
), i.e., there exists a good colouring.
To deduce an explicit bound on R(k, k) involves a little calculation.
Corollary 1.2. R(k, k) 2
k/2
for k 3.
Proof. Set n = 2
k/2
. Then
_
n
k
_
2
1(
k
2
)

n
k
k!
2
1(
k
2
)

2
k
2
/2
k!
2
1k
2
/2+k/2
=
2
1+k/2
k!
,
which is smaller than 1 if k 3.
2
Remark. The result above is very simple, and may seem weak. But the best
known lower bounds proved by non-random methods are roughly k
(log k)
C
with C
constant, i.e., grow only slightly faster than polynomially. This is tiny compared
with the exponential bound above. Note that the upper bounds are roughly 4
k
,
so exponential is the right order: the constant is unknown.
Often, the rst-moment method simply refers to using the union bound as
above. But it is much more general than that. We recall another basic term
from probability.
Denition. The rst moment of a random variable X is simply its mean, or
expectation, written E[X].
Recall that expectation is linear. If X and Y are (real-valued) random vari-
ables and is a (constant!) real number, then E[X + Y ] = E[X] + E[Y ], and
E[X] = E[X]. Crucially, these ALWAYS hold, irrespective of any relationship
(or not) between X and Y .
If A is an event, then its indicator function I
A
is the random variable which
takes the value 1 when A holds and 0 when A does not hold.
Let A
1
, . . . , A
n
be events, let I
i
be the indicator function of A
i
, and set
X =

I
i
, so X is the (random) number of the events A
i
that hold. Then
E[X] =
n

i=1
E[I
i
] =
n

i=1
P(A
i
).
We use the following trivial observation: for any random variable X, it cannot
be true that X is always smaller than its mean, or always larger: if E[X] = ,
then P(X ) > 0 and P(X ) > 0.
Example 1.2 (Ramsey numbers again).
Theorem 1.3. Let n, k 1 be integers. Then
R(k, k) > n
_
n
k
_
2
1(
k
2
)
.
Proof. Colour the edges of K
n
as before. Let X denote the (random) number
of monochromatic copies of K
k
in the colouring. Then
= E[X] =
_
n
k
_
2
1(
k
2
)
.
Since P(X ) > 0, there exists a colouring with at most monochromatic
copies of K
k
. Pick one vertex from each of these monochromatic K
k
s this
may involve picking the same vertex more than once. Delete all the selected
vertices. Then we have deleted at most vertices, and we are left with a good
colouring of K
m
for some m n . Thus R(k, k) > m n .
The type of argument above is often called a deletion argument. Instead of
trying to avoid bad things in our random structure, we rst ensure that there
arent too many, and then x things (here by deleting) to get rid of those few.
3
Corollary 1.4. R(k, k) (1 o(1))e
1
k2
k/2
.
Here we are using standard asymptotic notation (see handout). Explicitly,
we mean that for any > 0 there is a k
0
such that R(k, k) (1 )e
1
k2
k/2
for all k k
0
.
Proof. Exercise: take n = e
1
k2
k/2
.
We now give a totally dierent example of the rst-moment method.
Example 1.3 (Sum-free sets).
Denition. A set S R is sum-free if there do not exist a, b, c S such that
a +b = c.
Note that {1, 2} is not sum-free, since 1 + 1 = 2. The set {2, 3, 7, 8, 12} is
sum-free, for example.
Theorem 1.5 (Erd os, 1965). Let S = {s
1
, s
2
, . . . , s
n
} be a set of n 1 non-zero
integers. There is some A S such that A is sum-free and |A| > n/3.
End L1 Start L2
Proof. We use a trick: x a prime p of the form 3k + 2 not dividing any s
i
; for
example, take p > max |s
i
|.
Let I = {k +1, . . . , 2k +1}. Then I is sum-free modulo p: there do not exist
a, b, c I such that a +b c mod p.
Pick r uniformly at random from 1, 2, . . . , p 1, and set t
i
= rs
i
mod p, so
each t
i
is a random element of {1, 2, . . . , p 1}.
For each i, as r runs from 1 to p1, t
i
takes each possible value 1, 2, . . . , p1
exactly once. Hence
P(t
i
I) =
|I|
p 1
=
k + 1
3k + 1
>
1
3
.
By the rst moment method, we have
E[#i such that t
i
I] =
n

i=1
P(t
i
I) > n/3.
It follows that there is some r such that, for this particular r, the number of
i with t
i
I is greater than n/3. For this r, let A = {s
i
: t
i
I}, so A S
and |A| > n/3. If we had s
i
, s
j
, s
k
A with s
i
+ s
j
= s
k
then we would have
rs
i
+rs
j
= rs
k
, and hence t
i
+t
j
t
k
mod p, which contradicts the fact that
I is sum-free modulo p.
The proof above is an example of an averaging argument. This particular
example is not so easy to come up with without a hint, although it is hopefully
easy to follow.
4
Example 1.4 (2-colouring hypergraphs). A hypergraph H is simply an ordered
pair (V, E) where V is a set of vertices and E is a set of hyperedges, i.e., a set
of subsets of V . Hyperedges are often called simply edges.
Note that E is a set, so each possible hyperedge (subset of V ) is either
present or not, just as each possible edge of a graph is either present or not.
If we want to allow multiple copies of the same hyperedge, we could dene
multi-hypergraphs in analogy with multigraphs.
H is r-uniform if |e| = r for all e E, i.e., if every hyperedge consists of
exactly r vertices. In particular, a 2-uniform hypergraph is simply a graph.
An example of a 3-uniform hypergraph is the Fano plane shown in the gure.
This has 7 vertices and 7 hyperedges; in the drawing, the 6 straight lines and the
circle each represent a hyperedge. (As usual, how they are drawn is irrelevant,
all that matters is which vertices each hyperedge contains.)
A (proper) 2-colouring of a hypergraph H is a red/blue colouring of the
vertices such that every hyperedge contains vertices of both colours. (If H is
2-uniform, this is same as a proper 2-colouring of H as a graph.) We say that H
is 2-colourable if it has a 2-colouring. (Sometimes this is called having property
B.)
Let m(r) be the minimum m such that there exists a non 2-colourable r-
uniform hypergraph with m edges. It is easy to check that m(2) = 3. It is
harder to check that m(3) = 7 (there is no need to do this!).
Theorem 1.6. For r 2 we have m(r) 2
r1
.
Proof. Let H = (V, E) be any r-uniform hypergraph with m < 2
r1
edges.
Colour the vertices red and blue randomly: each red with probability 1/2 and
blue with probability 1/2, with dierent vertices coloured independently. For
any e E, the probability that e is monochromatic is 2(1/2)
r
= (1/2)
r1
.
By the union bound, it follows that the probability that there is at least one
monochromatic edge is at most m(1/2)
r1
< 1. Thus there exists a good
colouring.
We can also obtain a bound in the other direction; this is slightly harder.
Theorem 1.7 (Erd os, 1964). If r is large enough then m(r) 3r
2
2
r
.
Proof. Fix r 3. Let V be a set of n vertices, where n (which depends on r)
will be chosen later.
5
Let e
1
, . . . , e
m
be chosen independently and uniformly at random from all
_
n
r
_
possible hyperedges on V . Although repetitions are possible, the hypergraph
H = (V, {e
1
, . . . , e
m
})
certainly has at most m hyperedges.
Let be any red/blue colouring of V . (Not a random one this time.) Then
has either at least n/2 red vertices, or at least n/2 blue ones. It follows that at
least (crudely)
_
n/2
r
_
of the possible hyperedges are monochromatic with respect
to .
Let p = p

denote the probability that e


1
(a hyperedge chosen at random
from all possibilities) is monochromatic with respect to . Then
p
_
n/2
r
_
_
n
r
_ =
(n/2)(n/2 1) (n/2 r + 1)
n(n 1) (n r + 1)

_
n/2 r + 1
n r + 1
_
r

_
n/2 r
n r
_
r
= 2
r
_
1
r
n r
_
r
.
Set n = r
2
. Then p 2
r
(1 1/(r 1))
r
. Since (1 1/(r 1))
r
e
1
as
r , we see that p
1
32
r
if r is large enough, which we assume from now
on.
The probability that a given, xed colouring is a proper 2-colouring of
our random hypergraph H is simply the probability that none of e
1
, . . . , e
m
is monochromatic with respect to . Since e
1
, . . . , e
m
are independent, this is
exactly (1 p)
m
.
By the union bound, the probability that H is 2-colourable is at most the
sum over all possible of the probability that is a 2-colouring, which is at
most 2
n
(1 p)
m
.
Recall that n = r
2
, and set m = 3r
2
2
r
. Then using the standard fact
1 x e
x
, we have
2
n
(1 p)
m
2
n
e
pm
2
r
2
e

3r
2
2
r
32
r
= 2
r
2
e
r
2
< 1.
Therefore there exists an r-uniform hypergraph H with at most m edges and
no 2-colouring.
End L2 Start L3
Remark. Why does the rst moment method work? Often, there is some com-
plicated event A whose probability we want to know (or at least bound). For
example, A might be the event that the random colouring is a 2-colouring
of a xed (complicated) hypergraph H. Often, A is constructed by taking the
union or intersection of simple events A
1
, . . . , A
k
.
If A
1
, . . . , A
k
are independent, then
P(A
1
A
k
) =

P(A
i
) and P(A
1
A
k
) = 1

(1P(A
i
)).
6
If A
1
, . . . , A
k
are mutually exclusive, then
P(A
1
A
k
) =

P(A
k
).
(For example, this gives us the probability 2(1/2)
|e|
that a xed hyperedge is
monochromatic in a random 2-colouring.)
In general, the relationship between the A
i
may be very complicated. How-
ever, if X is the number of A
i
that hold, then we always have E[X] =

P(A
i
)
and
P(A
i
) = P(X > 0)

P(A
i
).
The key point is that while the left-hand side is complicated, the right-hand
side is simple: we evaluate it by looking at one simple event at a time.
So far we have used the expectation only via the observations P(X E[X]) >
0 and P(X E[X]) > 0, together with the union bound. A slightly more
sophisticated (but still simple) way to use it is via Markovs inequality.
Lemma 1.8 (Markovs inequality). If X is a random variable taking only non-
negative values and t > 0, then P(X t) E[X]/t.
Proof. E[X] tP(X t) + 0P(X < t).
We now start on one of our main themes, the study of the random graph
G(n, p).
Denition. Given an integer n 1 and a real number 0 p 1, the random
graph G(n, p) is the graph with vertex set [n] = {1, 2, . . . , n} in which each
possible edge ij, 1 i < j n is present with probability p, independently of
the others.
Thus, for any graph H on [n],
P
_
G(n, p) = H
_
= p
e(H)
(1 p)
(
n
2
)e(H)
.
For example, if p = 1/2, then all 2
(
n
2
)
graphs on [n] are equally likely.
Remark. It is important to remember that we work with labelled graphs. For
example, the probability that G(3, p) is a star with three vertices is 3p
2
(1 p),
since there are three (isomorphic) graphs on {1, 2, 3} that are stars.
We use the notation G(n, p) for the probability space of graphs on [n] with
the probabilities above. All of G G(n, p), G = G(n, p) and G G(n, p) mean
exactly the same thing, namely that G is a random graph with this distribution.
This model of random graphs is often called the Erd osRenyi model, al-
though in fact it was rst dened by Gilbert. Erd os and Renyi introduced an
essentially equivalent model, and were the real founds of the theory of random
graphs, so associating the model with their names is reasonable!
7
Example 1.5 (High girth and chromatic number). Let us recall some deni-
tions. The girth g(G) of a graph G is the length of the shortest cycle in G, or
if G contains no cycles. The chromatic number (G) is the least k such that
G has a k-colouring of the vertices in which adjacent vertices receive dierent
colours, and the independence number (G) is the maximum number of vertices
in an independent set I in G, i.e., a set of vertices of G no two of which are
joined by an edge.
Recall that (G) |G|/(G).
Theorem 1.9 (Erd os, 1959). For any k and there exists a graph G with
(G) k and g(G) .
There are non-random proofs of this, but it is not so easy.
The idea of the proof is to consider G(n, p) for suitable n and p. We will
show separately that (a) very likely there are few short cycles, and (b) very
likely there is no large independent set. Then it is likely that the properties in
(a) and (b) both hold, and after deleting a few vertices (to kill the short cycles),
we obtain the graph we need.
Proof. Fix k, 3. There are
n(n 1) (n r + 1)
2r
possible cycles of length r in G(n, p): the numerator counts sequences of r
distinct vertices, and the denominator accounts for the fact that each cycle
corresponds to 2r sequences, depending on the choice of starting point and
direction.
Let X
r
be the number of r-cycles in G(n, p). Then
E[X
r
] =
n(n 1) (n r + 1)
2r
p
r

n
r
p
r
2r
.
Set p = n
1+1/
, and let X be the number of short cycles, i.e., cycles with
length less than . Then X = X
3
+X
4
+ +X
1
, so
E[X] =
1

r=3
E[X
r
]
1

r=3
(np)
r
2r
=
1

r=3
n
r/
2r
= O(n
1

) = o(n).
By Markovs inequality it follows that
P(X n/2)
E[X]
n/2
0.
Set m = n
11/(2)
. Let Y be the number of independent sets in G(n, p) of
size (exactly) m. Then
E[Y ] =
_
n
m
_
(1 p)
(
m
2
)

_
en
m
_
m
e
p(
m
2
)
=
_
en
m
e
p
m1
2
_
m
.
8
Now
p
m1
2

pm
2

n
1+
1

n
1
1
2
2
=
n
1
2
2
.
Thus p(m1)/2 2 log n if n is large enough, which we may assume. But then
E[Y ]
_
en
m
n
2
_
m
0,
and by Markovs inequality we have
P(Y > 0) = P(Y 1) E[Y ] 0,
i.e., P((G) m) 0.
Combining the two results above, we have P(X n/2 or (G) m) 0.
Hence, if n is large enough, there exists some graph G with n vertices, with
fewer than n/2 short cycles, and with (G) < m.
Construct G

by deleting one vertex from each short cycle of G. Then


g(G

) , and |G

| n n/2 = n/2. Also, (G

) (G) < m. Thus


(G

)
|G

|
(G

)

n/2
m

n/2
n
1
1
2
=
1
2
n
1
2
,
which is larger than k if n is large enough.
End L3 Start L4
Our nal example of the rst moment method will be something rather
dierent.
Example 1.6 (Antichains).
A set system on V is just a collection A of subsets of V , i.e., a subset of the
power set P(V ). Thus A corresponds to the edge set of a hypergraph on V ;
although the concepts are the same, the viewpoints tend to be slightly dierent.
The r-th layer L
r
of P(V ) is just {A V : |A| = r}.
We often think of an arrow from A to B whenever A, B P(V ) with A B
and A = B. A chain is a sequence A
1
, . . . , A
t
of subsets of V with A
1
A
2

A
t
, i.e., each subset is strictly contained in the next. An antichain is just
a collection of sets A
1
, . . . , A
r
with no arrows, i.e., no A
i
contained in any other
A
j
.
Suppose that |V | = n. How many sets can a chain contain? At most n + 1,
since we can have at most one set from each layer. Also, every maximal chain
C is of the form (C
0
, C
1
, . . . , C
n
) with C
r
L
r
; in other words, we start from
C
0
= and add elements one-by-one in any order, nishing with C
n
= V .
A harder question is: how many sets can an antichain A P(V ) contain?
An obvious candidate for a large antichain is the biggest layer L
r
.
Theorem 1.10 (LYM Inequality). Let A P(V ) be an antichain, where |V | =
n. Then
n

r=0
|A L
r
|
|L
r
|
1.
9
Proof. Let C = (C
0
, C
1
, . . . , C
n
) be a maximal chain chosen uniformly at ran-
dom. Since C
r
is uniformly random from L
r
, we have
P(C
r
A) =
|A L
r
|
|L
r
|
.
Now |C A| 1 otherwise for two sets A, B in the intersection, since they are
in C we have we have A B or B A, which is impossible for two sets in A.
Thus
1 E[|C A|] =
n

r=0
P(C
r
A) =
n

r=0
|A L
r
|
|L
r
|
.
Corollary 1.11 (Sperners Theorem). If |V | = n and A P(V ) is an an-
tichain, then |A| max
r
|L
r
| =
_
n
n/2
_
.
Proof. Exercise.
2 The second moment method
Suppose (X
n
) is a sequence of random variables, each taking non-negative in-
teger values. If E[X
n
] 0 as n , then we have P(X
n
> 0) = P(X
n
1)
E[X
n
] 0. Under what conditions can we show that P(X
n
> 0) 1? Simply
E[X
n
] is not enough: it is easy to nd examples where E[X
n
] , but
P(X
n
= 0) 1. Rather, we need some control on the dierence between X
n
and E[X
n
].
Denition. The variance Var[X] of a random variable X is dened by
Var[X] = E
_
(X EX)
2

= E
_
X
2

_
EX
_
2
.
We recall a basic fact from probability.
Lemma 2.1 (Chebyshevs Inequality). Let X be a random variable and t 0.
Then
P
_
|X EX| t
_

Var[X]
t
2
.
Proof. We have
P
_
|X EX| t
_
= P
_
(X EX)
2
t
2
_
.
Applying Markovs inequality this is at most E
_
(XEX)
2

/t
2
= Var[X]/t
2
.
In practice, we usually use this as follows.
Corollary 2.2. Let (X
n
) be a sequence of random variables with E[X
n
] =
n
>
0 and Var[X
n
] = o(
2
n
). Then P(X
n
= 0) 0.
10
Proof.
P(X
n
= 0) P
_
|X
n

n
|
n
_

Var[X
n
]

2
n
0.
Remark. The mean = E[X] is usually easy to calculate. Since Var[X] =
E[X
2
]
2
, this means that knowing the variance is equivalent to knowing the
second moment E[X
2
]. In particular, with
n
= E[X
n
], the condition Var[X
n
] =
o(
2
n
) is exactly equivalent to E[X
2
n
] = (1 +o(1))
2
n
, i.e., E[X
2
n
]
2
n
:
Var[X
n
] = o(
2
n
) E[X
n
]
2

2
n
.
Sometimes the second moment is more convenient to calculate than the variance.
Suppose that X = I
1
+ + I
k
, where each I
i
is the indicator function of
some event A
i
. We have seen that E[X] is easy to calculate; E[X
2
] is not too
much harder:
E[X
2
] = E
_

i
I
i

j
I
j

= E
_

j
I
i
I
j

j
E[I
i
I
j
] =
k

i=1
k

j=1
P(A
i
A
j
).
Example 2.1 (K
4
s in G(n, p)).
Theorem 2.3. Let p = p(n) be a function of n.
1. If n
4
p
6
0 as n , then P(G(n, p) contains a K
4
) 0.
2. If n
4
p
6
as n , then P(G(n, p) contains a K
4
) 1.
Proof. Let X (really X
n
, as the distribution depends on n) denote the number
of K
4
s in G(n, p). For each set S of 4 vertices from [n], let A
S
be the event that
S spans a K
4
in G(n, p). Then
= E[X] =

S
P(A
S
) =
_
n
4
_
p
6
=
n(n 1)(n 2)(n 3)
4!
p
6

n
4
p
6
24
.
In case 1 it follows that E[X] 0, so P(X > 0) 0, as required.
For the second part of the result, we have E[X
2
] =

T
P(A
S
A
T
). The
contributions from all terms where S and T meet in a given number of vertices
are as follows:
|S T| contribution
0
_
n
4
__
n4
4
_
p
12

n
4
24
n
4
24
p
12

2
1
_
n
4
_
4
_
n4
3
_
p
12
= (n
7
p
12
)
2
_
n
4
__
4
2
__
n4
2
_
p
11
= (n
6
p
11
)
3
_
n
4
__
4
3
__
n4
1
_
p
9
= (n
5
p
9
)
4
_
n
4
_
p
6
=
11
Recall that by assumption n
4
p
6
, so and the last contribution
is o(
2
). How do the other contributions compare to
2
? Firstly, since

2
= (n
8
p
12
), we have n
7
p
12
= o(
2
). For the others, note that n
4
p
6
0,
i.e., p
6
= o(n
4
), so p
1
= o(n
2/3
). Thus
n
6
p
11
n
8
p
12
= n
2
p
1
= o(n
2
n
2/3
) = o(1),
and
n
5
p
9
n
8
p
12
= n
3
p
3
= o(n
3
n
2
) = o(1).
Putting this all together, E[X
2
] =
2
+ o(
2
), so Var[X] = o(
2
), and by
Corollary 2.2 we have P(X = 0) 0.
End L4 Start L5
Remark. Sometimes one considers the 2nd factorial moment of X, namely
E
2
[X] = E[X(X1)]. If X is the number of events A
i
that hold, then X(X1)
is the number of ordered pairs of distinct events that hold, so
E
2
[X] =

j=i
P(A
i
A
j
).
Since X
2
= X(X 1) + X, we have E[X
2
] = E
2
[X] +E[X], so if we know the
mean, then knowing any of E[X
2
], E
2
[X] or Var[X] lets us easily calculate the
others.
When applying the second moment method, our ultimate aim is always
to estimate the variance, showing that it is small compared to the square of
the mean, so Corollary 2.2 applies. So far we rst calculated E[X
2
], due to
the simplicity of the formula

j
P(A
i
A
j
). However, this involves some
unnecessary work when many of the events are independent. We can avoid
this by directly calculating the variance.
Suppose as usual that X = I
1
+ . . . + I
k
, with I
i
the indicator function of
A
i
. Then
Var[X] = E[X
2
] (E[X])
2
=

j
P(A
i
A
j
)
_

i
P(A
i
)
__

j
P(A
j
)
_
=

j
_
P(A
i
A
j
) P(A
i
)P(A
j
)
_
.
Write i j if i = j and A
i
and A
j
are dependent. The contribution from terms
where A
i
and A
j
are independent is zero by denition, so
Var[X] =

i
_
P(A
i
) P(A
i
)
2
_
+

ji
_
P(A
i
A
j
) P(A
i
)P(A
j
)
_
E[X] +

ji
P(A
i
A
j
).
12
Note that the second last line is an exact formula for the variance. The last
line is just an upper bound, but this upper bound is often good enough.
The bound above gives another standard way of applying the 2nd moment
method.
Corollary 2.4. If = E[X] and =

ji
P(A
i
A
j
) = o(
2
), then
P(X > 0) 1.
Proof. We have
Var[X]

2

+

2
=
1

2
0.
Now apply Chebyshevs inequality in the form of Corollary 2.2.
Denition. Let P be a property of graphs (e.g., contains a K
4
). A function
p

(n) is called a threshold function for P in the model G(n, p) if


p(n) = o(p

(n)) implies that P(G(n, p(n)) has P) 0, and


p(n)/p

(n) implies that P(G(n, p(n)) has P) 1.


Theorem 2.3 says that n
2/3
is a threshold function for G(n, p) to contain a
K
4
. Note that threshold functions are not quite uniquely dened (e.g., 2n
2/3
is also one). Also, not every property has a threshold function, although many
do.
Example 2.2 (Small subgraphs in G(n, p)). Fix a graph H with v vertices and
e edges. What is the threshold for copies of H to appear in G(n, p)? (Here H
is small in the sense that its size is xed as n .)
Let X be the number of copies of H in G(n, p). There are n(n 1) (n
v + 1) injective maps : V (H) [n]. Two maps , give the same copy if
and only if = for some automorphism of H. Thus there are
n(n 1) (n v + 1)
aut(H)
possible copies, where aut(H) is the number of automorphisms of H. (For
example, aut(C
r
) = 2r.) It follows that
E[X] =
n(n 1) (n v + 1)
aut(H)
p
e

n
v
p
e
aut(H)
= (n
v
p
e
).
This suggests that the threshold should be p = n
v/e
.
Unfortunately, this cant quite be right. Consider for example H to be a K
4
with an extra edge hanging o, so v = 5 and e = 7. Our proposed threshold
is p = n
5/7
, which is much smaller than p = n
2/3
. Consider the range
in between, where p/n
5/7
but p/n
2/3
0. Then E[X] , but the
probability that G(n, p) contains a K
4
tends to 0, so the probability that G(n, p)
contains a copy of H tends to 0.
The problem is that H contains a subgraph K
4
which is hard to nd, because
its e/v ratio is larger than that of H.
13
Denition. The edge density d(H) of a graph H is e(H)/|H|, i.e., 1/2 times
the average degree of H.
Denition. H is balanced if = H

H implies d(H

) d(H), and strictly


balanced if = H

H and H

= H implies d(H

) < d(H).
Examples of strictly balanced graphs are complete graphs, trees, and con-
nected regular graphs. Examples of balanced but not strictly balanced graphs
are disconnected regular graphs, or a triangle with an extra edge attached (mak-
ing four vertices and four edges).
For balanced graphs, p = n
v/e
does turn out to be the threshold.
Theorem 2.5. Let H be a balanced graph with |H| = v and e(G) = e. Then
p(n) = n
v/e
is a threshold function for the property G(n, p) contains a copy
of H.
Proof. Let X denote the number of copies of H in G(n, p), and set = EX, so
= (n
v
p
e
). If p/n
v/e
0 then 0, so P(X 1) 0.
Suppose that p/n
v/e
, i.e., that n
v
p
e
.
End L5 Start L6
Let H
1
, . . . , H
N
list all possible copies of H with vertices in [n], and let A
i
denote the event that the ith copy H
i
is present in G = G(n, p). Let
=

ji
P(A
i
A
j
) =

ji
P(H
i
H
j
G).
Note that in all terms above, the graphs H
i
and H
j
share at least one edge.
We split the sum by the number r of vertices of H
i
H
j
and the number s
of edges of H
i
H
j
. Note that H
i
H
j
is a non-empty subgraph of H
i
, which
is isomorphic to H. Since H is balanced, we thus have
s
r
= d(H
i
H
j
) d(H) =
e
v
,
so s re/v.
The contribution to from terms with given r and s is

_
n
2vr
p
2es
_
=
_

2
/(n
r
p
s
)
_
.
Now
n
r
p
s
n
r
p
re/v
= (n
v
p
e
)
r/v
= (
r/v
).
Since and r/v > 0, it follows that n
r
p
s
, so the contribution from
this pair (r, s) is o(
2
).
Since there are only a xed number of pairs to consider, it follows that
= o(
2
). Hence by Corollary 2.4, P(X > 0) 1.
In general the threshold is n
1/d(H

)
, where H

is the densest subgraph of


H. The proof is almost the same.
14
Remark. If H is strictly balanced and p = cn
v/e
, then tends to a constant
and the rth factorial moment E
r
[X] = E[X(X 1) (X r + 1)] satises
E
r
[X]
r
, from which one can show that the number of copies of H has
asymptotically a Poisson distribution. We shall not do this.
3 The ErdosLovasz local lemma
Suppose that we have some bad events A
1
, . . . , A
n
, and we want to show that
its possible that no A
i
holds, no matter how unlikely. If

P(A
i
) < 1 then the
union bound gives what we want. What about if

P(A
i
) is large? In general
of course it might be that

A
i
always hold. One trivial case where we can rule
this out is when the A
i
are independent. Then
P(A
c
i
) =

P(A
c
i
) =
n

i=1
(1 P(A
i
)) > 0,
provided each A
i
has probability less than 1.
What if each A
i
depends only on a few others?
Recall that A
1
, . . . , A
n
are independent if for all disjoint S, T [n] we have
P
_

iS
A
i

iT
A
c
i
_
=

iS
P(A
i
)

iT
P(A
c
i
).
This is not the same as each pair of events being independent (see below).
Denition. An event A is independent of a set {B
1
, . . . , B
n
} of events if for
all disjoint S, T [n] we have
P
_
A

iS
B
i

iT
B
c
i
_
= P(A),
i.e., if knowing that certain B
i
hold and certain others do not does not aect
the probability that A holds.
For example, suppose that each of the following four sequences of coin tosses
happens with probability 1/4: TTT, THH, HTH and HHT. Let A
i
be the
event that the ith toss is H. Then one can check that any two events A
i
are
independent, but {A
1
, A
2
, A
3
} is not a set of independent events. Similarly, A
1
is not independent of {A
2
, A
3
}, since P(A
1
| A
2
A
3
) = 0.
Recall that a digraph on a vertex set V is a set of ordered pairs of distinct
elements of V , i.e., a graph in which each edge has an orientation, there are
no loops, and there is at most one edge from a given i to a given j, but we may
have edges in both directions. We write i j if there is an edge from i to j.
Denition. A digraph D on [n] is called a dependency digraph for the events
A
1
, . . . , A
n
if for each i the event A
i
is independent of the set of events {A
j
:
j = i, i j}.
15
Theorem 3.1 (Local Lemma, general form). Let D be a dependency digraph
for the events A
1
, . . . , A
n
. Suppose that there are real numbers 0 < x
i
< 1 such
that
P(A
i
) x
i

j : ij
(1 x
j
)
for each i. Then
P
_
n

i=1
A
c
i
_

i=1
(1 x
i
) > 0.
Roughly speaking, A
i
is allowed to depend on A
j
when i j. More
precisely, A
i
must be independent of the remaining A
j
as a set, not just indi-
vidually.
Proof. We claim that for any subset S of [n] and any i / S we have
P
_
_
A
c
i

jS
A
c
j
_
_
1 x
i
, (1)
i.e., that
P
_
_
A
i

jS
A
c
j
_
_
x
i
. (2)
Assuming the claim, then
P
_
n

i=1
A
c
i
_
= P(A
c
1
)P(A
c
2
| A
c
1
)P(A
c
3
| A
c
1
A
c
2
)
(1 x
1
)(1 x
2
)(1 x
3
)
=
n

i=1
(1 x
i
).
It remains to prove the claim. For this we use induction on |S|.
For the base case |S| = 0 we have
P
_
_
A
i

jS
A
c
j
_
_
= P(A
i
) x
i

j : ij
(1 x
j
) x
i
,
as required.
Now suppose the claim holds whenever |S| < r, and consider S with |S| = r,
and i / S. Let S
1
= {j S : i j} and S
0
= S \ S
1
= {j S : i j}, and
consider B =

jS
1
A
c
j
and C =

jS
0
A
c
j
.
The left-hand side of (2) is
P(A
i
| B C) =
P(A
i
B C)
P(B C)
=
P(A
i
B C)
P(C)
P(C)
P(B C)
=
P(A
i
B | C)
P(B | C)
.
(3)
16
To bound the numerator, note that P(A
i
B | C) P(A
i
| C) = P(A
i
), since
A
i
is independent of the set of events {A
j
: j S
0
}. Hence, by the assumption
of the theorem,
P(A
i
B | C) P(A
i
) x
i

j : ij
(1 x
j
). (4)
For the denominator in (3), write S
1
as {j
1
, . . . , j
a
} and S
0
as {k
1
, . . . , k
b
}. Then
P(B | C) = P(A
c
j
1
A
c
j
a
| A
c
k
1
A
c
k
b
)
=
a

t=1
P(A
c
j
t
| A
c
k
1
A
c
k
b
A
c
j
1
A
c
j
t1
).
In each conditional probability in the product, we condition on the intersection
of at most a +b 1 = r 1 events, and A
j
t
is not one of them, so (1) applies,
and thus
P(B | C)
a

t=1
(1 x
j
t
)
=

jS
1
(1 x
j
)

j : ij
(1 x
j
)
since S
1
{j : i j}. Together with (3) and (4) this gives P(A
i
| BC) x
i
,
which is exactly (2). This completes the proof by induction.
End L6 Start L7
Dependency digraphs are slightly slippery. First recall that given the events
A
1
, . . . , A
n
, we cannot construct D simply by taking i j if A
i
and A
j
are
dependent. Considering three events such that each pair is independent but
{A
1
, A
2
, A
3
} is not, a legal dependency digraph must have at least one edge
from vertex 1 (since A
1
does depend on {A
2
, A
3
}), and similarly from each
other vertex.
The same example shows that (even minimal) dependency digraphs are not
unique: in this case there are 8 minimal dependency digraphs.
There is an important special case where dependency digraphs are easy to
construct; we state it as a trivial lemma.
Lemma 3.2. Suppose that (X

)
F
is a set of independent random variables,
and that A
1
, . . . , A
n
are events where A
i
is determined by {X

: F
i
} for
some F
i
F. Then the (di)graph in which i j and j i if and only if
F
i
F
j
= is a dependency digraph for A
1
, . . . , A
n
.
Proof. For each i, the events {A
j
: j = i, i j} are (jointly) determined by the
variables (X

: F \F
i
), and A
i
is independent of this family of variables.
17
We now turn to a more user friendly version of the local lemma. The out-
degree of a vertex i in a digraph D is simply the number of j such that i j.
Theorem 3.3 (Local Lemma, Symmetric version). Let A
1
, . . . , A
n
be events
having a dependency digraph D with maximum out-degree at most d. If P(A
i
)
p for all i and ep(d + 1) 1, then P(A
c
i
) > 0.
Proof. Set x
i
= 1/(d + 1) for all i and apply Theorem 3.1. We have |{j : i
j}| d, so
x
i

j : ij
(1 x
j
)
1
d + 1
_
d
d + 1
_
d

1
e(d + 1)
p P(A
i
),
and Theorem 3.1 applies.
Remark. Considering d+1 disjoint events with probability 1/(d+1) shows that
the constant (here e) must be at least 1. In fact, the constant e is best possible
for large d.
Example 3.1 (Hypergraph colouring).
Theorem 3.4. Let H be an r-uniform hypergraph in which each edge meets at
most d other edges. If d + 1 2
r1
/e then H has a 2-colouring.
Proof. Colour the vertices randomly in the usual way, each red/blue with prob-
ability 1/2, independently of the others. Let A
i
be the event that the ith edge
e
i
is monochromatic, so P(A
i
) = 2
1r
= p.
By Lemma 3.2 we may form a dependency digraph for the A
i
by joining i to
j (both ways) if e
i
and e
j
share one or more vertices. The maximum out-degree
is at most d by assumption, and
ep(d + 1) e2
1r
2
r
2e
= 1,
so Theorem 3.3 gives P(A
c
i
) > 0, i.e., there exists a good colouring.
Example 3.2 (Ramsey numbers again).
Theorem 3.5. If k 3 and e2
1(
k
2
)
_
k
2
__
n
k2
_
1 then R(k, k) > n.
Proof. Colour the edges of K
n
as usual, each red or blue with probability 1/2,
independently of each other. For each S [n] with |S| = k let A
S
be the event
that the complete graph on S is monochromatic, so P(A
S
) = 2
1(
k
2
)
.
For the dependency digraph, by Lemma 3.2 we may join S and T if they
share an edge, i.e., if |S T| 2. The maximum degree d is
d = |{T : |S T| 2}| <
_
k
2
__
n
k 2
_
.
By assumption ep(d + 1) 1, so Theorem 3.3 applies, giving the result.
18
Corollary 3.6. R(k, k) (1 +o(1))
k

2
e
2
k/2
.
Proof. Straightforward(ish) calculation; you wont be asked to do it!
Note: this improves the bound from the rst moment method by a factor of

2. This is not much, but this is the best bound known!


Example 3.3 (R(3, k)). In the previous example, the local lemma didnt make
so much dierence, because each event depended on very many others. If we
consider o-diagonal Ramsey numbers the situation changes, but we cant use
the symmetric form. The point here is to understand how to apply the lemma
when we have two types of events; the details of the calculation are not im-
portant.
Colour the edges of K
n
red with probability p and blue with probability
1 p, independently of each other, where p = p(n) 0.
For each S [n] with |S| = 3 let A
S
be the event that S spans a red triangle,
and for each T [n] with |T| = k let B
T
be the event that T spans a blue K
k
.
Note that
P(A
S
) = p
3
and P(B
T
) = (1 p)
(
k
2
)
.
As usual, we can form the dependency digraph by joining two events if they
involve one or more common edges.
Each A event is joined to
at most 3n other A events, and
at most
_
n
k
_
n
k
B events.
(There are only
_
n
k
_
B events in total.)
Also, each B event is joined to
at most
_
k
2
_
n A events, and
at most n
k
B events.
End L7 Start L8
Our aim is to apply Theorem 3.1 with x
i
= x for all A events and x
i
= y for
all B events to conclude that the probability that none of the A
S
or B
T
holds
is positive, which gives R(3, k) > n. The conditions are satised provided we
have
p
3
x(1 x)
3n
(1 y)
n
k
(5)
and
(1 p)
(
k
2
)
y(1 x)
(
k
2
)n
(1 y)
n
k
. (6)
It turns out that
p =
1
6

n
x =
1
12n
3/2
k = 30

nlog n y = n
k
satises (5) and (6) if k is large enough. This gives the following result.
19
Theorem 3.7. There exists a constant c > 0 such that R(3, k) ck
2
/(log k)
2
if k is large enough.
Proof. The argument above shows that R(3, 30

nlog n) > n. Rearrange this


to get the result.
Remark. This bound is best possible apart from one factor of log k. Removing
this factor is not easy, and was a major achievement of J.H. Kim.
4 The Cherno bounds
Generally speaking, we are ultimately interested in whether the random graph
G(n, p) has some property almost always (with probability tending to one as
n ), or almost never. For example, this is enough to allow us to show the
existence of graphs with various combinations of properties, using the fact that
if two or three properties individually hold almost always, then their intersection
holds almost always. Sometimes, however, we need to consider a number k of
properties (events) that tends to innity as n . This means that we would
like tighter bounds on the probability that individual events fail to hold.
For example, let G = G(n, p) and consider its maximum degree (G). For
any d we have P((G) d) nP(d
v
d) where v is any given vertex. In turn
this is at most nP(X d) where X Bi(n, d). To show that P((G) d) 0
for some d = d(n) we would need a bound of the form
P(X d) = o(1/n). (7)
Recall that if X Bi(n, p) then = E[X] = np and
2
= Var[X] = np(1 p).
(For example, if p = 1/2 then = n/2 and =

n/2.) Chebyshevs inequality


gives P(X +)
2
; to use this for (7) we need much larger than

n.
If p = 1/2 this gives n, which is useless.
On the other hand, the central limit theorem suggests that
P(X +) = P
_
X


_
P(N(0, 1) ) e

2
/2
,
where N(0, 1) is the standard normal distribution. But this is only valid for
constant as n , so again it is no use for (7).
Our next aim is to prove a bound similar to the above, but valid no matter
how depends on n.
Theorem 4.1. Suppose that n 1 and p, x (0, 1). Let X Bi(n, p). Then
P(X nx)
_
_
p
x
_
x
_
1 p
1 x
_
1x
_
n
if x p,
and
P(X nx)
_
_
p
x
_
x
_
1 p
1 x
_
1x
_
n
if x p.
20
Note that the exact expression is in some sense not so important; what
matters is (a) the proof technique, and (b) that it is exponential in n if x and
p are xed.
Proof. The idea is simply to apply Markovs inequality to the random variable
e
tX
for some number t that we will choose so as to optimize the bound.
Consider X as a sum X
1
+ . . . + X
n
where the X
i
are independent with
P(X
i
= 1) = p and P(X
i
= 0) = 1 p. Then
E[e
tX
] = E[e
tX
1
e
tX
2
e
tX
n
]
= E[e
tX
1
] E[e
tX
n
]
= E[e
tX
1
]
n
= (pe
t
+ (1 p)e
0
)
n
,
where we used independence for the second equality.
For any t > 0, using the fact that y e
ty
is increasing and Markovs
inequality, we have
P(X nx) = P(e
tX
e
tnx
) (8)
E[e
tX
]/e
tnx
= [(pe
t
+ 1 p)e
tx
]
n
.
To get the best bound we minimize over t (by dierentiating and equating to
zero).
For x > p, it turns out that the minimum occurs when
e
t
=
x
p
1 p
1 x
1,
so t > 0 (otherwise the argument above is invalid), and we get
P(X nx)
__
x
1 p
1 x
+ 1 p
_
_
p
x
_
x
_
1 x
1 p
_
x
_
n
=
_
_
p
x
_
x
_
1 p
1 x
_
1x
_
n
,
proving the rst part of the theorem. (For x = p we get t = 0 which works too,
but when x = p the result is trivial in any case.)
For the second part, argue similarly but with t < 0 at the minimum, so e
ty
is a decreasing function, and (8) becomes P(X nx) = P(e
tX
e
tnx
), or let
Y = n X, so Y Bi(n, 1 p). Then P(X nx) = P(Y n(1 x)), and
apply the rst part.
Remark. Theorem 4.1 gives the best possible bound among bounds of the form
P(X nx) g(x, p)
n
where g(x, p) is some function of x and p.
In the form above, the bound is a little hard to use. Here are some more
practical forms.
21
Corollary 4.2. Let X Bi(n, p). Then for h, t > 0
P
_
X np +nh
_
e
2h
2
n
and
P
_
X np +t
_
e
2t
2
/n
.
Also, for 0 1 we have
P
_
X (1 +)np
_
e

2
np/4
and
P
_
X (1 )np
_
e

2
np/2
.
Proof. Fix p. For x > p or x < p Theorem 4.1 gives P(X nx) e
f(x)n
or
P(X nx) e
f(x)n
, where
f(x) = xlog
_
x
p
_
+ (1 x) log
_
1 x
1 p
_
.
We aim to bound f(x) from below by some simpler function. Note that f(p) = 0.
Also,
f

(x) = log x log p log(1 x) + log(1 p),


so f

(p) = 0 and
f

(x) =
1
x
+
1
1 x
.
If f

(x) a for all a between p and p +h then (e.g., by Taylors Theorem) we


get f(p +h) ah
2
/2.
Now for any x we have f

(x) inf
x>0
{1/x +1/(1 x)} = 4, so f(p +h)
2h
2
, giving the rst bound; the second is the same bound in dierent notation,
setting t = nh.
For the third bound, if p x p(1 + ) 2p then f

(x) 1/x 1/(2p),


giving f(p +p)

2
p
2
2
1
2p
, which gives the result.
For the nal bound, if 0 < x p then f

(x) 1/x 1/p, giving f(pp)

2
p
2
2
1
p
.
End L8 Start L9
Remark. Recall that =
_
np(1 p), so when p is small then np

np.
The central limit theorem suggests that the probability of a deviation this large
should be around e

2
np/2
as in the nal bound above. The second bound is
weaker, but has the advantage of being correct!
In general, think of the bounds as of the form e
c
2
for the probability of
being standard deviations away from the mean. Alternatively, deviations on
the scale of the mean are exponentially unlikely.
The Cherno bounds apply more generally than just to binomial distribu-
tions; they apply to other sums of independent variables where each variable
has bounded range.
22
Example 4.1 (The maximum degree of G(n, p)).
Theorem 4.3. Let p = p(n) satisfy np 10 log n, and let be the maximum
degree of G(n, p). Then
P
_
np + 3
_
np log n
_
0
as n .
Proof. Let d = np + 3

np log n. As noted at the start of the section,


P( d) nP(d
v
d) nP(X d)
where d
v
Bi(n 1, p) is the degree of a given vertex, and X Bi(n, p).
Applying the second bound in Corollary 4.2 with = 3
_
log n/(np) 1, we
have
nP(X d) ne

2
np/4
= ne
9(log n)/4
= nn
9/4
= n
5/4
0,
giving the result
Note that for large n there will be some vertices with degrees any given
number of standard deviations above the average. The result says however that
all degrees will be at most C

log n standard deviations above. This is best


possible, apart from the constant.
5 The phase transition in G(n, p)
[ Summary of what we know about G(n, p) in various ranges; most interesting
near p = 1/n. ]
5.1 Branching processes
Let Z be a probability distribution on the non-negative integers. The Galton
Watson branching process with ospring distribution Z is dened as follows:
Generation 0 consists of a single individual.
Generation t + 1 consists of the children of individuals in generation t
(formally, its the disjoint union of the sets of children of the individuals
in generation t).
The number of children of each individual has distribution Z, and is in-
dependent of everything else, i.e., of the history so far, and of other indi-
viduals in the same generation.
23
We write X
t
for the number of individuals in generation t, and X = (X
0
, X
1
, . . .)
for the random sequence of generation sizes. Note that X
0
= 1, and given that
X
t
= k, then X
t+1
is the sum of k independent copies of Z.
Let = EZ. Then EX
0
= 1. Also E[X
t+1
| X
t
= k] = k. Thus
E[X
t+1
] =

k
P(X
t
= k)E[X
t+1
| X
t
= k]
=

k
P(X
t
= k)k = E[X
t
].
Hence E[X
t
] =
t
for all t.
The branching process survives if X
t
> 0 for all t, and dies out or goes
extinct if X
t
= 0 for some t.
If = E[Z] < 1, then for any t we have
P(X survives) P(X
t
> 0) E[X
t
] =
t
.
Letting t shows that P(X survives) = 0.
What if > 1? Note that any branching process with P(Z = 0) > 0 may
die out the question is, can it survive?
We recall some basic probabilities of probability generating functions.
Denition. If Z is a random variable taking non-negative integer values, the
probability generating function of Z is the function f
Z
: [0, 1] R dened by
f
Z
(x) = E[x
Z
] =

k=0
P(Z = k)x
k
.
The following facts are easy to check
f
Z
is continuous on [0, 1].
f
Z
(0) = P(Z = 0) and f
Z
(1) = 1.
f
Z
is increasing, with f

Z
(1) = 1.
If P(Z 2) > 0, then f
Z
is strictly convex.
For the last two observations, note that for 0 < x 1 we have
f

Z
(x) =

k=1
kP(Z = k)x
k1
0,
and
f

Z
(x) =

k2
k(k 1)P(Z = k)x
k2
0,
with strict equality if P(Z 2) > 0. (We are implicitly assuming that E[Z]
exists.)
24
End L9 Start L10
Let
t
= P(X
t
= 0). Then
0
= 0 and

t+1
=

k
P(X
t+1
= 0 | X
1
= k)P(X
1
= k) =

k
t
P(Z = k) = f
Z
(
t
),
since, given the number of individuals in the rst generation, the descendants
of each of them form an independent copy of the branching process.
Let X
Z
denote the GaltonWatson branching process with ospring dis-
tribution Z. Let = (Z) denote the extinction probability of X
Z
, i.e., the
probability that the process dies out.
Theorem 5.1. For any probability distribution Z on the non-negative integers,
(Z) is equal to the smallest non-negative solution to f
Z
(x) = x.
Proof. As above, let
t
= P(X
t
= 0), so 0 =
0

1

2
. Since the events
{X
t
= 0} are nested and their union is the event that the process dies out, we
have
t
as t .
As shown above,
t+1
= f
Z
(
t
). Since f
Z
is continuous, taking the limit of
both sides gives = f
Z
(), so is a non-negative solution to f
Z
(x) = x.
Let x
0
be the smallest non-negative solution to f
Z
(x) = x, so x
0
. Then
0 =
0
x
0
. Since f
Z
is increasing, this gives

1
= f(
0
) f(x
0
) = x
0
.
Similarly, by induction we obtain
t
x
0
for all t, so taking the limit, x
0
,
and hence = x
0
.
Corollary 5.2. If E[Z] > 1 then (Z) < 1, i.e., the probability that X
Z
survives
is positive. If E[Z] < 1 then (Z) = 1.
Proof. The question is simply whether the curves f
Z
(x) and x meet anywhere
in [0, 1] other than at x = 1; sketch the graphs!
For the rst statement, suppose for convenience that E[Z] is dened. Then
f

Z
(1) > 1, so there exists > 0 such that f
Z
(1 ) < 1 . Since f
Z
(0) 0,
applying the Intermediate Value Theorem to f
Z
(x) x, there must be some
x [0, 1 ] for which f
Z
(x) = x. But then x 1 < 1.
We have already proved the second; it can be reproved using the convexity
of f
Z
.
Note that when P(Z 2) > 0 and E[Z] > 1, then there is a unique solution
to f
Z
(x) = x in [0, 1); this follows from the strict convexity of f
Z
.
Remark. We have left out the case E[Z] = 1; it is an exercise to show that if
E[Z] = 1 then, except in a trivial case, the process always dies.
Denition. For c > 0, a random variable Z has the Poisson distribution with
mean c, written Z Po(c), if
P(Z = k) =
c
k
k!
e
c
for k = 0, 1, 2, . . ..
25
Basic properties (exercise): E[Z] = c, E
2
[Z] = c
2
. (More generally, the rth
factorial moment is E
r
[Z] = c
r
).
Lemma 5.3. Suppose n and p 0 with np c where c > 0 is constant.
Let Z
n
have the binomial distribution Bi(n, p), and let Z Po(c). Then Z
n
converges in distribution to Z, i.e., for each k, P(Z
n
= k) P(Z = k).
Proof. For k xed,
P(Z
n
= k) =
_
n
k
_
p
k
(1 p)
nk

n
k
k!
p
k
(1 p)
n
=
(np)
k
k!
e
np+O(np
2
)

c
k
k!
e
c
,
since np c and np
2
0.
As we shall see shortly, there is a very close connection between components
in G(n, c/n) and the GaltonWatson branching process X
Po(c)
where the o-
spring distribution is Poisson with mean c. The extinction probability of this
process will be especially important.
Theorem 5.4. Let c > 0. Then the extinction probability = (c) of the
branching process X
Po(c)
satises the equation
= e
c(1)
.
Furthermore, < 1 if and only if c > 1.
Proof. The probability generating function of the Poisson distribution with
mean c is given by
f(x) =

k=0
c
k
k!
e
c
x
k
= e
cx
e
c
= e
c(x1)
= e
c(1x)
.
The result now follows from Theorem 5.1.
5.2 Component exploration
In the light of Lemma 5.3, it is intuitively clear that the Poisson branching
process gives a good local approximation to the neighbourhood of a vertex of
G(n, c/n). To make this precise, we shall explore the component of a vertex
in a certain way. First we describe the (simpler) exploration for the branching
process.
Exploration process (for branching process).
Start with Y
0
= 1, meaning one live individual (the root). At each step,
select a live individual; it has Z
t
children and then dies. (So several steps may
correspond to a single generation.) Let Y
t
be the number of individuals alive
after t steps. Then
Y
t
=
_
Y
t1
+Z
t
1 if Y
t1
> 0
0 if Y
t1
= 0.
26
The process dies out if Y
m
= 0 for some m; in this case the total number of
individuals is min{m : Y
m
= 0}.
Until it hits zero, the sequence (Y
t
) is a random walk with i.i.d. steps Z
1

1, Z
2
1, . . . taking values in {1, 0, 1, 2, . . .}. Each step has expectation E[Z
1] = 1, so < 1 implies negative drift and the walk will hit 0, i.e., the
process will die, with probability 1. If > 1 then the drift is positive, so the
probability of never hitting 0 is positive, i.e., the probability that the process
survives is positive.
Component exploration in G(n, p).
Let v be a xed vertex of G = G(n, p). At each stage, each vertex u of G
will be live, unreached, or processed. Y
t
will be the number of live vertices
after t steps; there will be exactly t processed vertices, and nt Y
t
unreached
vertices.
At t = 0, v is live and all other vertices are unreached, so Y
0
= 1.
At each step t, pick a live vertex w, if there is one. For each unreached w

,
check whether ww

E(G); if so, make w

live. After completing these checks,


set w to be processed.
Let R
t
be the number of w

which become live during step t. (Think of this


as the number of vertices Reached in step t.) Then
Y
t
=
_
Y
t1
+R
t
1 if Y
t1
> 0
0 if Y
t1
= 0.
The process stops at the rst m for which Y
m
= 0, and it is easy to check
that we have reached all vertices in the component C
v
of G containing v. In
particular, |C
v
| = m.
End L10 Start L11
So far, G could be any graph. Now suppose that G = G(n, p). Then each
edge is present with probability p independently of the others. No edge is tested
twice (we only check edges from live to unreached vertices, and then one end
becomes processed). Since n(t 1) Y
t1
vertices are unreached at the start
of step t, it follows that given Y
0
, . . . , Y
t1
, the number R
t
of vertices reached
in step t has the distribution
R
t
Bi(n (t 1) Y
t1
, p). (9)
5.3 Vertices in small components
Given a vertex v of a graph G, let C
v
denote the component of G containing v,
and |C
v
| its size, i.e., number of vertices.
Let
k
(c) denote the probability that |X
Po(c)
| = k, where |X| =

t0
X
t
de-
notes the total number of individuals in all generations of the branching process
X.
27
Lemma 5.5. Suppose that p = p(n) satises np c where c > 0 is constant.
Let v be a given vertex of G(n, p). For each constant k we have
P(|C
v
| = k)
k
(c)
as n .
Proof. The idea is simply to show that the random walks (Y
t
) and (Y
t
) have
almost the same probability of rst hitting zero at t = k. We do this by
comparing the probabilities of individual trajectories.
Dene (Y
t
) and (R
t
) as in the exploration above. Then |C
v
| = k if and only
if Y
k
= 0 and Y
t
> 0 for all t < k. Let S
k
be the set of all possible sequences
y = (y
0
, . . . , y
k
) of values for Y = (Y
0
, . . . , Y
k
) with this property. (I.e., y
0
= 1,
y
k
= 0, y
t
> 0 for t < k, and each y
t
is an integer with y
t
y
t1
1.) Then
P(|C
v
| = k) =

yS
k
P(Y = y).
Similarly,

k
(c) = P
_
|X
Po(c)
| = k
_
=

yS
k
P(Y = y).
Fix any sequence y S
k
. For each t let r
t
= y
t
y
t1
+ 1, so (r
t
) is the
sequence of R
t
values corresponding to Y = y. From (9) we have
P(Y = y) =
k

t=1
P
_
Bi(n (t 1) y
t1
, p) = r
t
_
.
In each factor, t 1, y
t1
and r
t
are constants. As n we have n (t
1) y
t1
n, so (n (t 1) y
t1
)p c. Applying Lemma 5.3 to each factor
in the product, it follows that
P(Y = y)
k

t=1
P
_
Po(c) = r
t
_
.
But this is just P(Y = y), from the exploration for the branching process.
Summing over the nite number of possible sequences y gives the result.
We write N
k
(G) for the number of vertices of a graph G in components with
k vertices. (So N
k
(G) is k times the number of k-vertex components in G.)
Corollary 5.6. Suppose that np c where c > 0 is constant. For each xed k
we have EN
k
(G(n, p)) n
k
(c) as n .
Proof. The expectation is simply

v
P(|C
v
| = k) = nP(|C
v
| = k) n
k
(c).
Lemma 5.5 tells us that the branching process predicts the expected number
of vertices in components of each xed size k. It is not hard to use the second
moment method to show that in fact this number is concentrated around its
mean.
28
Denition. Let (X
n
) be a sequence of real-valued random variables and a a
(constant) real number. Then X
n
converges to a in probability, written X
n
p
a,
if for all (xed) > 0 we have P(|X
n
a| > ) 0 as n .
Lemma 5.7. Suppose that E[X
n
] a and E[X
2
n
] a
2
. Then X
n
p
a.
Proof. Note that Var[X
n
] = E[X
2
n
] (EX
n
)
2
a
2
a
2
= 0, and apply Cheby-
shev.
In fact, whenever we showed that some quantity X
n
was almost always
positive by using the second moment method, we really showed more, that
X
n
/E[X
n
]
p
1, i.e., that X
n
is concentrated around its mean.
Lemma 5.8. Let c > 0 and k be constant, and let N
k
= N
k
(G(n, c/n)). Then
N
k
/n
p

k
(c).
Proof. We have already shown that E(N
k
/n)
k
(c).
Let I
v
be the indicator function of the event that |C
v
| = k, so N
k
=

v
I
v
and
N
2
k
=

w
I
v
I
w
= A+B,
where
A =

w
I
v
I
w
I
{C
v
=C
w
}
is the part of the sum from vertices in the same component, and
B =

w
I
v
I
w
I
{C
v
=C
w
}
is the part from vertices in dierent components. [Note that we can split the
sum even though its random whether a particular pair of vertices are in the
same component or not.]
If I
v
= 1, then |C
v
| = k, so

w
I
w
I
{C
v
=C
w
}
= k. Hence A = kN
k
kn,
and E[A] = o(n
2
).
Since all vertices v are equivalent, we can rewrite E[B] as
nP(|C
v
| = k)E
_

w
I
w
I
{C
v
=C
w
}

|C
v
| = k
_
where v is any xed vertex. Now

w
I
w
I
{C
v
=C
w
}
is just N
k
(GC
v
), the number
of vertices of G C
v
in components of size k. Exploring C
v
as before, given
that |C
v
| = k we have not examined any of the edges among the n k vertices
not in C
v
, so G C
v
has the distribution of G(n k, c/n). Since n k n,
Lemma 5.5 gives
E[B] nP(|C
v
| = k)(n k)
k
(c) (n
k
(c))
2
.
Hence, E[N
2
k
] = E[A] +E[B] (n
k
(c))
2
, i.e., E[(N
k
/n)
2
]
k
(c)
2
.
Lemma 5.7 now gives the result.
29
End L11 Start L12
Let N
K
(G) denote the number of vertices v of G with |C
v
| K, and let

K
(c) = P
_
|X
Po(c)
| K
_
.
With G = G(n, c/n), we have seen that for k xed, N
k
(G)/n
p

k
(c).
Summing over k = 1, . . . , K, it follows that if K is xed, then
N
K
(G)
n
p

K
(c). (10)
What if we want to consider components of sizes growing with n? Then we
must be more careful.
Recall that = (c) denotes the extinction probability of the branching
process X
Po(c)
, so

k

k
(c) = . In other words,

K
(c) =
K

k=1

k
(c) (c) as K .
If c > 1, then N
n
(G)/n = 1, while
n
(c) (c) < 1, so we cannot extend
the formula (10) to arbitrary K = K(n). But we can allow K to grow at some
rate.
Lemma 5.9. Let c > 0 be constant, and suppose that k

= k

(n) satises
k

and k

n
1/4
. Then the number N

of vertices of G(n, c/n) in


components with at most k

vertices satises N

/n
p
(c).
Proof. [Sketch; non-examinable] The key point is that since k

, we have
P(|X
Po(c)
| k

)
c
.
We omit the comparison of the branching process and graph in this case. For
enthusiasts, one possibility is to repeat the proof of Lemma 5.9 but being more
careful with the estimates. One can use the fact that P
_
Bi(n m, c/n) = k
_
and P(Po(c) = k) agree within a factor 1O((k +m+1)
2
/n) when k, m n/4,
to show that all sequences y corresponding to the random walk ending by time
k

have essentially the same probability in the graph and branching process
explorations.
For each xed k, we know almost exactly how many vertices are in compo-
nents of size k. Does this mean that we know the whole component structure?
Not quite: if c > 1, so < 1, then Lemma 5.9 tells us that there are around
(1 )n vertices in components of size at least n
1/4
, say. But are these compo-
nents really of around that size, or much larger?
To answer this, we return to the exploration process.
5.4 The phase transition
Recall that our exploration of the component of vertex v in G(n, c/n) leads to
a random walk (Y
t
)
m
t=0
with Y
0
= 1, Y
m
= 0, and at each step Y
t
= Y
t1
+
30
R
t
1 where, conditional on the process so far, R
t
has the binomial distribution
Bi(n (t 1) Y
t1
, c/n).
Suppose rst that c < 1. The distribution of R
t
is dominated by a Bi(n, c/n)
distribution. More precisely, we can dene independent variables R
+
t
Bi(n, c/n)
so that R
t
R
+
t
holds for all t for which R
t
is dened. [ To be totally explicit,
construct the random variables step by step. At step t, we want R
t
to be
Bi(x, c/n) for some x n that depends what has happened so far. Toss x coins
to determine R
t
, and then nx further coins, taking the total number of heads
to be R
+
t
. ]
Let (Y
+
t
) be the walk with Y
+
0
= 1 and increments R
+
t
1, so Y
t
Y
+
t
for
all t until our exploration in G(n, c/n) stops. Then for any k we have
P(|C
v
| > k) P
_
Y
0
, . . . , Y
k
> 0
_
P
_
Y
+
0
, . . . , Y
+
k
> 0
_
P(Y
+
k
> 0).
But Y
+
k
has an extremely simple distribution:
Y
+
k
+k 1 =
k

i=1
R
+
k
Bi(nk, c/n),
so
P(Y
+
k
> 0) = P(Y
+
k
+k 1 k) = P
_
Bi(nk, c/n) > k
_
= P
_
Bi(nk, c/n) > ck + (1 c)k
_
.
Letting = min{(1c)/c, 1}, the Cherno bound gives that this nal probability
is at most e

2
ck/4
. If we set k = Alog n with A > 8/(
2
c), then we have
P(|C
v
| k) e
2 log n
= 1/n
2
, proving the following result.
We say that an event (formally a sequence of events) holds with high proba-
bility or whp if its probability tends to 1 as n .
Theorem 5.10. Let c < 1 be constant. There is a constant A > 0 such that
whp every component of G(n, c/n) has size at most Alog n.
Proof. Dene A as above. Then the expected number of vertices in components
larger than Alog n is at most 1/n = o(1).
Now suppose that c > 1. Then our random walk has positive drift: at least
to start with. Once the number ntY
t
of unexplored vertices becomes smaller
than n/c, this is no longer true.
Fix any > 0, and let k
+
= (1 1/c )n. Now let R

t
be independent
increments with the distribution Bi(n/c + n, c/n), and let (Y

t
) be the corre-
sponding random walk. As long as we have used up at most k
+
vertices, then
our increments R
t
dominate R

t
. It follows that for any k k
+
we have
P(|C
v
| = k) P(Y
k
= 0) P(Y

k
0).
31
Once again, Y

k
has a simple distribution: it is 1 k + Bi(nk(c
1
+ ), c/n).
Hence
P(Y

k
0) = P
_
Bi(nk(c
1
+), c/n) k 1
_
P
_
Bi(nk(c
1
+), c/n) k
_
.
The binomial has mean = k + ck, so k = (1 ) for = c/(1 + c),
which is a positive constant. By the Cherno bound, this probability is at most
e

2
/2
e

2
k/2
.
Choose a constant A such that A > 6/
2
and let k

= Alog n. Then for


k

k k
+
we have
P(|C
v
| = k) e

2
k/2
e

2
k

/2
e
3 log n
= 1/n
3
.
Applying the union bound over k

k k
+
and over all n vertices v, it follows
that whp there are no vertices at all in components of size between k

and
k
+
. In other words, whp all components are either small, i.e., of size at most
k

= O(log n), or large, i.e., of size at least k


+
= (1 1/c )n.
From Theorem 5.9, we know that there almost exactly n vertices in small
components; hence there are almost exactly (1 )n vertices in large compo-
nents.
Given a graph G, let L
i
(G) denote the number of vertices in the ith largest
component. Note that which component is the ith largest may be ambiguous,
if there are ties, but the value of L
i
(G) is unambiguous.
We close this section by showing that almost all vertices not in small com-
ponents are in a unique giant component.
Theorem 5.11. Let c > 0 be constant, and let G = G(n, c/n).
If c < 1 there is a constant A such that L
1
(G) Alog n holds whp.
If c > 1 then L
1
(G)/n
p
(1 ), and there is a constant A such that
L
2
(G) Alog n holds whp.
Proof. We have almost completed the proof. The only remaining ingredient is
to show in the c > 1 case that there cannot be two large components.
The simplest way to show this is just to choose so that (1 1/c ) >
(1 )/2. Then we dont have enough vertices in large components to have two
or more large components. But is this possible? Such a exists if and only if
(1 1/c) > (1 )/2, i.e., if and only if > 2/c 1.
Recall that = (c) is the smallest solution to = e
c(1)
. Furthermore,
for x < we have < e
c(1)
and for x > we have > e
c(1)
. So
what we have to show is that x = 2/c 1 falls into the rst case, i.e., that
2/c 1 < e
c(1(2/c1))
= e
22c
.
Multiplying by c, let f(c) = ce
22c
+ c 2, so our task is to show f(c) > 0
for c > 1. This is easy by calculus: we have f(1) = 0, f

(1) = 0 and f

(c) > 0
for c > 1.
(In fact f

(c) = 4(c 1)e


22c
.)
End L12 Start L13
32
6 Correlation and concentration
6.1 The HarrisKleitman Lemma
In this section we turn to the following simple question and its generalizations.
Does conditioning on G = G(n, p) containing a triangle make G more or less
likely to be connected? Note that if we condition on a xed set E of edges
being present, then this is the same as simply adding E to G(n, p), which does
increase the chance of connectedness. But conditioning on at least one triangle
being present is not so simple.
Let X be any nite set, the ground set. For 0 p 1 let X
p
be a random
subset of X obtained by selecting each element independently with probability
p. A property of subsets of X is just some collection A P(X) of subsets of X.
For example, the property contains element 1 or element 3 may be identied
with the set A of all subsets A of X with 1 A or 3 A.
We write P
X
p
(A) for
P(X
p
A) =

AA
p
|A|
(1 p)
|X||A|
.
Most of the time, we omit X from the notation, writing P
p
(A) for P
X
p
(A).
We say that A P(X) is an up-set, or increasing property, if A A and
A B X implies B A. Similarly, A is a down-set or decreasing property
if A A and B A implies B A. Note that A is an up-set if and only if
A
c
= P(X) \ A is a down-set.
To illustrate the denitions, consider the (for us) most common special case.
Here X consists of all
_
n
2
_
edges of K
n
, and X
p
is then simply the edge-set of
G(n, p). Then a property of subsets of X is just a set of graphs on [n], e.g., all
connected graphs. A property is increasing if it is preserved by adding edges,
and decreasing if it is preserved by deleting edges.
Lemma 6.1 (Harriss Lemma). If A, B P(X) are up-sets and 0 p 1
then
P
p
(A B) P
p
(A)P
p
(B).
In other words, P
_
X
p
A and X
p
B
_
P(X
p
A)P(X
p
B), i.e.,
P
_
X
p
A | X
p
B) P(X
p
A), i.e., increasing properties are positively
correlated.
Proof. We use induction on n = |X|. The base case n = 0 makes perfect sense
and holds trivially, though you can start from n = 1 if you prefer.
Suppose that |X| = n 1 and that the result holds for smaller sets X.
Without loss of generality, let X = [n] = {1, 2, . . . , n}.
For any C P(X) let
C
0
= {C C : n / C
_
P([n 1])
and
C
1
= {C \ {n} : C C, n C} P([n 1]).
33
Thus C
0
and C
1
correspond to the subsets of C not containing and containing n
respectively, except that for C
1
we delete n from every set to obtain a collection
of subsets of [n 1].
Note that
P
p
(C) = (1 p)P
p
(C
0
) +pP
p
(C
1
). (11)
More precisely,
P
[n]
p
(C) = (1 p)P
[n1]
p
(C
0
) +pP
[n1]
p
(C
1
).
Suppose now that A and B P([n]) are up-sets. Then A
0
, A
1
, B
0
and B
1
are
all up-sets in P([n 1]). Also, A
0
A
1
and B
0
B
1
. Let a
0
= P
p
(A
0
) etc, so
certainly a
0
a
1
and b
0
b
1
.
Since (A B)
i
= A
i
B
i
, by (11) and the induction hypothesis we have
P
p
(A B) = (1 p)P
p
(A
0
B
0
) +pP
p
(A
1
B
1
)
(1 p)a
0
b
0
+pa
1
b
1
= x,
say. On the other hand
P
p
(A)P
p
(B) =
_
(1 p)a
0
+pa
1
__
(1 p)b
0
+pb
1
_
= y,
say. So it remains to show that x y. But
x y =
_
(1 p) (1 p)
2
_
a
0
b
0
p(1 p)a
0
b
1
p(1 p)a
1
b
0
+ (p p
2
)a
1
b
1
= p(1 p)(a
1
a
0
)(b
1
b
0
) 0,
recalling that a
0
a
1
and b
0
b
1
.
Harriss Lemma has an immediate corollary concerning two down-sets, or
one up- and one down-set.
Corollary 6.2. If U is an up-set and D
1
and D
2
are down-sets, then
P
p
(U D
1
) P
p
(U)P
p
(D
1
),
and
P
p
(D
1
D
2
) P
p
(D
1
)P
p
(D
2
).
Proof. Exercise, using the fact that D
c
i
is an up-set.
An up-set U is called principal if it is of the form U = {A X : S A} for
some xed S X. Recall that conditioning on such a U is the same as simply
adding the elements of S into our random set X
p
.
Corollary 6.3. If D
1
and D
2
are down-sets and U
0
is a principal up-set, then
P
p
(D
1
| U
0
D
2
) P
p
(D
1
| U
0
).
Proof. Thinking of the elements of the set S X corresponding to U
0
as always
present, this can be viewed as a statement about two down-sets D

1
and D

2
in
P(X \ S). Apply Harriss Lemma to these.
34
6.2 Jansons inequalities
We have shown (from the Cherno bounds) that, roughly speaking, if we have
many independent events and the expected number that hold is large, then the
probability that none holds is very small. What if our events are not quite
independent, but each depends on only a few others?
End L13 Start L14
Let E
1
, . . . , E
k
be sets of possible edges of G(n, p), and let A
i
be the event
that all edges in E
i
are present in G(n, p). Note that each A
i
is a principal
up-set. Let X be the number of A
i
that hold. (For example, if the E
i
list all
possible edge sets of triangles, then X is the number of triangles in G(n, p).)
As usual, let = E[X] =

i
P(A
i
).
As in Chapter 2, write i j if i = j and E
i
E
j
= , and let
=

ji
P(A
i
A
j
).
Theorem 6.4. In the setting above, we have P(X = 0) e
+/2
.
Before turning to the proof, note that
P(X = 0) = P(A
c
1
A
c
k
)
= P(A
c
1
)P(A
c
2
| A
c
1
) P(A
c
k
| A
c
1
A
c
k1
)

i=1
P(A
c
i
) =
k

i=1
(1 P(A
i
)),
where we used Harriss Lemma and the fact that the intersection of two or more
down-sets is again a down-set. In the (typical) case that all P(A
i
) are small,
the nal bound is roughly e

P(A
i
)
= e

, so (if is small), Theorem 6.4 is


saying that the probability that X = 0 is not much larger than the minimum it
could possibly be.
Proof. Let r
i
= P(A
i
| A
c
1
A
c
i1
). Note that
P(X = 0) = P(A
c
1
A
c
k
) =
k

i=1
(1 r
i
). (12)
Our aim is to show that r
i
is not much smaller than P(A
i
).
Fix i, and let D
1
be the intersection of those A
c
j
where j < i and j i. Let
D
0
be the intersection of those A
c
j
where j < i and j i. Thus
r
i
= P(A
i
| D
0
D
1
).
For any three events
P(A | B C) =
P(A B C)
P(B C)

P(A B C)
P(C)
= P(A B | C).
35
Thus r
i
P(A
i
D
1
| D
0
) = P(A
i
| D
0
)P(D
1
| A
i
D
0
).
Now D
0
depends only on the presence of edges in

ji
E
j
, which is disjoint
from E
i
. Hence P(A
i
| D
0
) = P(A
i
). Also, A
i
is a principal up-set and D
0
and
D
1
are down-sets, so by Corollary 6.3 we have P(D
1
| A
i
D
0
) P(D
1
| A
i
).
Thus
r
i
P(A
i
)P(D
1
| A
i
) = P(A
i
D
1
)
= P
_
_
A
i
\
_
j<i, ji
(A
i
A
j
)
_
_
= P(A
i
) P
_
_
_
j<i, ji
(A
i
A
j
)
_
_
P(A
i
)

j<i, ji
P(A
i
A
j
).
By (12) we thus have
P(X = 0) =

(1 r
i
)

e
r
i
= exp
_

r
i
_
exp
_
_

i=1
P(A
i
) +

ji, j<i
P(A
i
A
j
)
_
_
= exp( +/2).
When is much larger than , Theorem 6.4 is not very useful. But there
is a trick to deduce something from it in this case.
Theorem 6.5. In the setting of Theorem 6.4, if then P(X = 0) e

2
2
.
Proof. For any S [k], by Theorem 6.4 we have
P(X = 0) = P
_
k

i=1
A
c
i
_
P
_

iS
A
c
i
_
e

S
+
S
/2
, (13)
where

S
=

iS
P(A
i
) =
k

i=1
I
{iS}
P(A
i
)
and

S
=

iS

jS, ji
P(A
i
A
j
) =

ji
I
{i,jS}
P(A
i
A
j
).
Suppose now that 0 r 1, and let S be the random subset of [k] obtained
by selecting each element independently with probability r. Then
S
and
S
36
become random variables. By linearity of expectation we have
E[
S
] =

i
rP(A
i
) = r
and
E[
S
] =

ji
P(A
i
A
j
)P(i, j S) = r
2
.
Thus E[
S

S
/2] = r r
2
/2.
Since a random variable cannot always be smaller than its mean, there exists
some set S such that
S

S
/2 rr
2
/2. Applying (13) to this particular
set S it follows that
P(X = 0) e
r+r
2
/2
.
This bound is valid for any 0 r 1; to get the best result we optimize, which
simply involves setting r = / 1. Then we obtain
P(X = 0) e

2
2
= e

2
2
.
Together Theorems 6.4 and 6.5 give the following.
Corollary 6.6. Under the assumptions of Theorem 6.4 we have P(X = 0)
exp
_
min{/2,
2
/(2)}
_
.
Proof. For < apply Theorem 6.4; for apply Theorem 6.5.
End L14 Start L15
How do the second moment method and Jansons inequalities compare?
Suppose that X is the number of events A
i
that hold, let = E[X], and let
=

ji
P(A
i
A
j
), as in the context of Corollary 2.4. Then Corollary 2.4
says that if and = o(
2
) (i.e.,
2
/ ), then P(X = 0) 0.
More concretely, if L and
2
/ L, then the proof of Corollary 2.4 gives
P(X = 0) 2/L.
Jansons inequality, in the form of Corollary 6.6, has more restrictive assump-
tions: the events A
i
have to be events of a specic type. When this holds, the
there is the same as before. When L and
2
/ L, the conclusion is
that
P(X = 0) e
L/2
.
Both bounds imply that P(X = 0) 0 when and
2
/ both tend to in-
nity, but when Jansons inequalities apply, the concrete bound they give is
exponentially stronger than that from the second moment method.
37
6.3 Clique and chromatic number of G(n, p)
We shall illustrate the power of Jansons inequality by using it to study the
chromatic number of G(n, p). The ideas are more important than the details
of the calculations. We start by looking at something much simpler: the clique
number.
Throughout this section p is constant with 0 < p < 1.
As usual, let (G) = max{k : G contains a copy of K
k
} be the clique num-
ber of G. For k = k(n) let X
k
be the number of copies of K
k
in G = G(n, p),
and

k
= E[X
k
] =
_
n
k
_
p
(
k
2
)
.
Note that

k+1

k
=
_
n
k + 1
__
n
k
_
1
p
(
k+1
2
)(
k
2
)
=
n k
k + 1
p
k
, (14)
which is a decreasing function of k. It follows that the ratio is at least 1 up to
some point and then at most one, so
k
rst increases from
0
= 1,
1
= n, . . . ,
and then decreases.
We dene
k
0
= k
0
(n, p) = min{k :
k
< 1}.
Lemma 6.7. With 0 < p < 1 xed we have k
0
2 log
1/p
n = 2
log n
log(1/p)
as
n .
Proof. Using standard binomial bounds,
_
n
k
_
k
p
k(k1)/2

k

_
en
k
_
k
p
k(k1)/2
.
Taking the kth root it follows that

1/k
k
=
_
n
k
p
(k1)/2
_
=
_
n
k
p
k/2
_
.
Let > 0 be given.
If k (1 )2 log
1/p
n then k/2 (1 ) log
1/p
n, so (1/p)
k/2
n
1
, i.e.,
p
k/2
n
1+
. Thus
1/k
k
is at least a constant times nn
1+
/ log n = n

/ log n,
so
1/k
k
> 1 if n is large. Hence
k
> 1, so k
0
k.
Similarly, if k (1 +)2 log
1/p
n then p
k/2
n
1
and if n is large enough
it follows that
k
< 1, so k
0
k. So for any xed we have
(1 )2 log
1/p
n k
0
(1 +)2 log
1/p
n
if n is large enough, i.e., k
0
2 log
1/p
n.
Note for later that if k k
0
then
_
1
p
_
k
= n
2+o(1)
(15)
38
so from (14) we have

k+1

k
=
n O(log n)
(log n)
n
2+o(1)
= n
1+o(1)
. (16)
Lemma 6.8. With 0 < p < 1 xed we have P
_
(G(n, p)) > k
0
_
0 as n .
Proof. We have (G(n, p)) > k
0
if and only if X
k
0
+1
> 0, which has probability
at most E[X
k
0
+1
] =
k
0
+1
. Now
k
0
< 1 by denition, so by (16) we have

k
0
+1
n
1+o(1)
, so
k
0
+1
0.
Let
k
be the expected number of ordered pairs of distinct k-cliques sharing
at least one edge. This is exactly the quantity appearing in Corollaries 2.4
and 6.6 when X is the number of k-cliques.
Lemma 6.9. Suppose that k k
0
. Then

2
k
max
_
n
2+o(1)
,
n
1+o(1)

k
_
.
In particular, if
k
then
k
= o(
2
k
).
Proof. We have

k
=
k1

s=2
_
n
k
__
k
s
__
n k
k s
_
p
2(
k
2
)(
s
2
)
,
so

2
k
=
k1

s=2

s
,
where

s
=
_
k
s
__
nk
ks
_
_
n
k
_ p
(
s
2
)
.
Let

s
=

s+1

s
=
k s
s + 1
k s
n 2k +s + 1
p
s
,
so

s
= n
1+o(1)
_
1
p
_
s
. (17)
In particular, using (15) we have
s
< 1 for s k/4, say, and
s
> 1 for
s 3k/4. In between we have
s+1
/
s
1/p, so
s+1
/
s
1, and
s
is
increasing when s runs from k/4 to 3k/4.
It follows that there is some s
0
[k/4, 3k/4] such that
s
1 for s s
0
and

s
> 1 for s > s
0
. In other words, the sequence
s
decreases and then increases.
Hence, max{
s
: 2 s k 1} = max{
2
,
k1
}, so

2
k
=
k1

s=2

s
k max{
2
,
k1
} = n
o(1)
max{
2
,
k1
}.
39
End L15 Start L16
Either calculating directly, or using
0
1,
2
=
0

1
, and the approx-
imate formula for
s
in (17), one can check that
2
= n
2+o(1)
. Similarly,

k
= 1/
k
and
k1
=
k
/
k1
= n
1+o(1)
/
k
, using (17) and (15).
Theorem 6.10. Let 0 < p < 1 be xed. Dene k
0
= k
0
(n, p) as above, and let
G = G(n, p). Then
P
_
k
0
2 (G) k
0
_
1
Proof. The upper bound is Lemma 6.8. For the lower bound,
k
0
1
1 by
the denition of k
0
. So by (16) we have
k
0
2
n
1o(1)
, and in particular

k
0
2
. Let k = k
0
2. Then by Lemma 6.9 we have
k
= o(
2
k
). Hence
by the second moment method (Corollary 2.4) we have P((G) < k
0
2) =
P(X
k
= 0) 0.
Note that we have pinned down the clique number to one of three values;
with only a very little more care, we can pin it down to two values or, almost
all the time, a single value! (Usually,
k
0
1
is much larger than one, and
k
0
much smaller than one.)
Using Jansons inequality, we can get a very tight bound on the probability
that the clique number is signicantly smaller than expected.
Theorem 6.11. Under the assumptions of Theorem 6.10 we have
P
_
(G) < k
0
3
_
e
n
2o(1)
.
Note that this is an almost inconceivably small probability: the probability
that G(n, p) contains no edges at all is (1 p)
(
n
2
)
= e
(n
2
)
.
Proof. Let k = k
0
3. Then arguing as above we have
k
n
2o(1)
. Hence by
Lemma 6.9 we have
k
/
2
k
n
2+o(1)
, so
2
k
/
k
n
2o(1)
. Thus by Jansons
inequality (Corollary 6.6) we have P(X
k
= 0) e
n
2o(1)
.
Why is such a good error bound useful? Because it allows us to study the
chromatic number, by showing that with high probability every subgraph of a
decent size contains a fairly large independent set.
Theorem 6.12 (Bollob as). Let 0 < p < 1 be constant and let G = G(n, p).
Then for any xed > 0 we have
P
_
(1 )
n
2 log
x
n
(G) (1 +)
n
2 log
x
n
_
1
as n , where x = 1/(1 p).
40
Proof. Apply Theorem 6.10 to the complement G
c
of G, noting that G
c

G(n, 1 p). Writing (G) for the independence number of G, we nd that with
probability tending to 1 we have (G) = (G
c
) k
0
(n, 1 p) 2 log
x
n. Since
(G) n/(G), this gives the lower bound.
For the upper bound, let m = n/(log n)
2
, say. For each subset V of V (G)
with |V | = m, let E
V
be the event that G[V ] contains an independent set of
size at least k = k
0
(m, 1 p) 3. Note that
k 2 log
x
m 2 log
x
n.
Applying Theorem 6.11 to the complement of G[V ], which has the distribution
of G(m, 1 p), we have
P(E
c
V
) e
m
2o(1)
= e
n
2o(1)
.
Let E =

|V |=m
E
V
. Considering the
_
n
m
_
2
n
possible sets separately, the
union bound gives
P(E
c
) = P(E
c
V
) 2
n
e
n
2o(1)
0.
It follows that E holds with high probability. But when E holds one can colour
by greedily choosing independent sets of size at least k for the colour classes,
until at most m vertices remain, and then simply using one colour for each
vertex. Since we use at most n/k sets of size at least k, this shows that when
E holds, then
(G(n, p))
n
k
+m = (1 +o(1))
n
2 log
x
n
+
n
(log n)
2

n
2 log
x
n
,
completing the proof.
7 Postscript: other models
There are several standard models of random graphs on the vertex set [n] =
{1, 2, . . . , n}. We have focussed on G(n, p), where each possible edge is included
independently with probability p.
The model originally studied by the founders of the theory of random graphs,
Erdos and Renyi, is slightly dierent. Fix n 1 and 0 m N =
_
n
2
_
. The
random graph G(n, m) is the graph with vertex set [n] obtained by choosing
exactly m edges randomly, with all
_
N
m
_
possible sets of m edges equally likely.
For most questions (but not, for example, is the number of edges even),
G(n, p) and G(n, m) behave very similarly, provided we choose the density pa-
rameters in a corresponding way, i.e., take p m/N.
Often, we consider random graphs of dierent densities simultaneously. In
G(n, m), there is a natural way to do this, called the random graph process. This
is the random sequence (G
m
)
m=0,1,...,N
of graphs on [n] obtained by starting
with no edges, and adding edges one-by-one in a random order, with all N!
41
orders equally likely. Note that each individual G
m
has the distribution of
G(n, m): we take the rst m edges in a random order, so all possibilities are
equally likely. The key point is that in the sequence (G
m
), we dene all the G
m
together, in such a way that if m
1
< m
2
, then G
m
1
G
m
2
. (This is called a
coupling of the distributions G(n, m) for dierent m.)
There is a similar coupling in the G(n, p) setting, the continuous time random
graph process. This is the random sequence (G
t
)
t[0,1]
dened as follows: for
each possible edge, let U
e
be a random variable with the uniform distribution
on the interval [0, 1], with the dierent U
e
independent. Let the edge set of G
t
be {e : U
e
t}. (Formally this denes a random function t G
t
from [0, 1] to
the set of graphs on [n].) One can think of U
e
as giving the time at which the
edge e is born; G
t
consists of all edges born by time t. For any p, G
p
has the
distribution of G(n, p), but again these distributions are coupled in the natural
way: if p
1
< p
2
then G
p
1
G
p
2
.
Of course there are many other random graph models not touched on in
this course (as well as many more results about G(n, p)). These include other
classical models, such as the conguration model of Bollob as, and also new
inhomogeneous models introduced as more realistic models for networks in the
real world.
42

You might also like