Cs Theorists Toolkit
Cs Theorists Toolkit
December 5, 2005
Abstract
These are edited lecture notes from a graduate course at the Computer Science department
of Princeton University in Fall 2002. The course was my attempt to teach rst year graduate
students students many mathematical tools useful in theoretical computer science. Of
course, the goal was too ambitious for a course with 12 three hour lectures. I had to
relegate some topics to homework; these include online algorithms, Yaos lemma as a way to
lowerbound randomized complexity, Madhu Sudans list decoding algorithm (useful recently
in complexity theory and pseudorandomness), and pseudorandom properties of expander
graphs. If I had time for another lecture I would have covered basic information theory.
To put the choice of topics in context, I should mention that our theory grads take
a two semester course sequence on Advanced Algorithm Design and Complexity Theory
during their rst year, and I did not wish to duplicate topics covered in them. Inevitably,
the choice of topics especially the nal two chapters also reected my own current
research interests.
The scribe notes were written by students, and I have attempted to edit them. For
this techreport I decided to reshue material for coherence, and so the scribe names given
with each chapter does not completely reect who wrote that chapter. So I will list all the
scribes here and thank them for their help: Tony Wirth, Satyen Kale, Miroslav Dudik, Paul
Chang, Elad Hazan, Elena Nabieva, Nir Ailon, Renato F. Werneck, Loukas Georgiadis,
Manoj M.P., and Edith Elkind. I hope the course was as much fun for them as it was for
me.
Sanjeev Arora
March 2003
Contents
1 Probabilistic Arguments
1.1 Introduction . . . . . . . . . . . . . .
1.2 Independent sets in random graphs .
1.3 Graph colouring: local versus global
1.4 Bounding distribution tails . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
4
6
8
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10
10
11
13
13
17
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
19
20
22
23
24
25
25
29
29
31
.
.
.
.
33
33
33
34
36
41
41
42
2 LP
2.1
2.2
2.3
2.4
2.5
3 The
3.1
3.2
3.3
3.4
3.5
3.6
3.7
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Dimension Method
Basics: Fields and Vector Spaces . . . . . . . .
Systems of Linear Equations . . . . . . . . . . .
Dispersal of Information Using Polynomials . .
Hashing: An introduction . . . . . . . . . . . .
Pairwise and k-wise Independent Sampling . .
Madhu Sudans List Decoding Algorithm . . .
The Razborov-Smolensky Circuit Lower Bound
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6.3
6.4
43
45
48
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
49
49
49
50
51
53
56
57
61
62
62
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
spaces
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
65
65
66
67
68
68
70
71
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
75
75
75
76
79
80
10 Semidenite Programming
10.1 Introduction . . . . . . . . . . . . . . . .
10.2 Basic denitions . . . . . . . . . . . . .
10.3 An SDP for Graph Expansion . . . . . .
10.4 0.878-Approximation for Max Cut . . .
10.5 Spectral Partitioning of Random Graphs
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
84
84
84
85
86
86
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Chapter 1
Probabilistic Arguments
scribe:Tony Wirth
1.1
Introduction
Think of the topics in this course as a toolbox to rely upon during your research career.
Todays lecture shows the application of simple probabilistic arguments to prove seemingly
dicult results. Alon and Spencers text goes into considerably more detail on this topic,
as will Sudakovs course in the Math dept.
A random variable is a mapping from a probability space to R. To give an example,
the probability space could be that of all possible outcomes of n tosses of a fair coin, and
Xi is the random variable that is 1 if the ith toss is a head, and is 0 otherwise.
Let X1 , X2 , X3 , . . . , Xn be a sequence of random variables. The rst observation we
make is that of the Linearity of Expectation, viz.
Xi ] =
E[Xi ]
E[
i
It is important to realize that linearity holds regardless of the whether or not the random
variables are independent.
Can we say something about E[X1 X2 ]? In general, nothing much but if X1 , X2 are
independent events (formally, this means that for all a, b Pr[X1 = a, X2 = b] = Pr[X1 =
a]Pr[X2 = b]) then E[X1 X2 ] = E[X1 ]E[X2 ].
The rst of a number of inequalities presented today, Markovs inequality says that
any non-negative random variable X satises
Pr (X kE[X])
1
.
k
Note that this is just another way to write the trivial observation that E[X] kPr[X k].
Sometimes we refer to the application of Markovs inequality as an averaging argument.
Can we give any meaningful upperbound on Pr[X < c E[X]] where c < 1, in other
words the probability that X is a lot less than its expectation? In general we cannot.
4
However, if we know an upperbound on X then we can. For example, if X [0, 1] and
E[X] = then for any c < 1 we have (simple exercise)
Pr[X c]
1
.
1 c
1.2
We illustrate the power of the averaging argument by studying the size of the largest independent set in random graphs. Recall that an independent set is a collection of nodes
between every pair of which there is no edgean anti-clique, if you like. It is NP-hard to
determine the size of the largest independent set in a graph in general.
Let G(n, 12 ) stand for the distribution on graphs with n vertices in which the probability
that each edge is included in the graph is 1/2. In fact, G(n, 12 ) is the uniform distribution
on graphs with n nodes (verify this!). Thus one of the reasons for studying random graphs
is to study the performance of some algorithms on the average graph.
What is the size of the largest independent set in a random graph?
Theorem 1
The probability that a graph drawn from G(n, 12 ) has an independent set of size greater
than 2log n is tiny.
Proof: For all subsets S of {1, 2, . . . , n}, let XS be an indicator random variable for S
being an independent set. Now, let r.v. Yk be the number of independent sets of size k;
that is,
XS .
Yk =
S:|S|=k
E[XS ].
S:|S|=k
|S|
2
E[XS ] = Pr[XS = 1] =
1
,
|S|
2( 2 )
k
k
n
n2
,
and
2
2
(1.1)
5
we can show that
n
1
E[Yk ] =
k 2(|S|
2 )
k
ne
1
=
k 2k2 /2
ne k
=
k2k/2
2 log n
e
,
=
2 log n
(1.2)
substituting 2 log n for k. Hence the mean value of Yk tends to zero very rapidly.
In particular, Markovs inequality tells us that Pr[Yk 1] E[Yk ] and hence that the
probability of an independent set of size 2log n is tiny. 2
Since the distribution G(n, 1/2) picks each edge with probability 1/2, it also leaves out
each edge with probability 1/2. Thus we can simultaneously bound the maximum size of a
clique by 2log n.
Now we also show a lowerbound
on the size of the largest independent set/clique. Note
that substituting k = 2log n log n in the above calculation
for E[Yk ] shows that the
expected number of independent sets of size 2log n log n is very large. However, we
have to rule out the possibility that this number is large with probability 0.
For this we need a more powerful inequality, Chebyshevs inequality, which says
Pr[|X | k]
1
,
k2
where and 2 are the mean and variance of X. Recall that 2 = E[(X )2 ] = E[X 2 ]2 .
Actually, Chebyshevs inequality is just a special case of Markovs inequality: by denition,
E |X |2 = 2 ,
and so,
1
Pr |X |2 k2 2 2 .
k
Theorem 2
If G is drawn from G(n, 12 ) there is almost surely an independent set of size 2log n log n.
Proof: If k = 2 log n log n, then by substituting into formula (1.2) we nd that E[Yk ]
is approximately
k log n/2 log n
2
ne
.
n
2 log n 2log n/2
2 log n
This quantity tends to innity, so the mean is much greater than one. However, the variance
could be so large that the value of the mean tells us nothing.
6
The calculation of the variance is however quite messy. Let N stand for nk and p
k
stand for 2(2) . The expected value of Yk is N p. We show that E[Yk2 ] is N 2 p2 + , where
N 2 p2 , so that Var[Yk ] = is smaller than N 2 p2 . Then concentration about the mean
follows from Chebyshevs inequality. Observe that
Xs )2 ] =
E[XS XT ],
E[Yk2 ] = E[(
S:|S|=k
by linearity of expectation.
Note that if |S T | 1 then XS and XT are independent events and E[XS XT ] =
fraction
E[XS ]E[XT ]= p2 . Now we calculate the
k
of S, T that satisfy |S T | = i > 1.
n
nk
There are k ways to choose S and then i ki ways to choose T . Then the probability
k
i
i
that both are independent sets is 22(2) 2(2) = p2 2(2) . Since X X = 1 i both S and T
S
n 2 nk
k
n k (i )
k
nk1
p
22 .
+
+
E[Yk2 ] =
k
i
ki
k
1
k1
i2
Using the fact that in a random graph, all degrees are concentrated in [ n2 O( n log n), n2 +
O( n log n)] (see Cherno bounds below), we can show that k log n O(1) with high
probability.
Open Problem 1 Is it possible to nd a clique of size (1 + ) log n in polynomial time?
(The best algorithm we know of does exhaustive listing of sets of size (1 + ) log n, which
takes about n(1+) log n time.
1.3
Imagine that one day you pick up a political map of some world and decide to colour in
the countries. To keep the boundaries clear you decide to ll in adjacent countries with
dierent colours. It turns out that no matter what map anyone gives you (a map is just a
so-called planar graph), four colours will always suce.
7
We can extend this notion of colouring to graphs: we assign each vertex a colour so that
no two vertices with an edge between them share a colour. More formally, a k-colouring
of a graph G = (V, E) is a family of k independent sets whose union is V . Note that we
can ask that these sets form a partition of V , but we dont have to.
Following convention, let (G) stand for the size of the largest independent set in G and
let (G) stand for the minimum k for which G admits a k-colouring. The formal denition
of colouring tells us that
n
.
(G)
(G)
Can we provide an upper bound for the chromatic number, (G), using (G), for instance 2n/(G)? Such a bound is impossible: there exist graphs in which there is a large
independent set, but the remainder of the graph is a clique. However, if the vertex degrees
are bounded, then so is the chromatic number.
Theorem 3
If the maximum degree of a graph is d then (G) d + 1.
Proof: Proof by induction: The base case is trivially true. For the inductive step, assume
all graphs with n nodes satisfy the theorem. Given a graph G with n + 1 nodes, identify
one vertex v. The induced subgraph G {v} can be coloured with d + 1 colours, by the
induction assumption. The node v has links to at most d other nodes, due to the maximum
degree constraint, and hence is adjacent to at most d colours. Hence the d + 1th colour is
available for v and so G can be d + 1-coloured. 2
Perhaps being able to colour small subgraphs with few colours might help us to colour
larger graphs? Such conjectures can waste a lot of brain-cycles. But a simple probabilistic
argument doable on the back of an envelope can help us dispose of such conjectures, as
in the following 1962 result by Erd
os. It shows that the chromatic number is truly a global
property and cannot be deduced from looking locally.
Theorem 4
For all k there exists a positive such that for all suciently large n there is a graph G on
n vertices with (G) > k, but every subgraph of G with at most n vertices is 3-colourable.
Proof: Let G be a graph selected from G(n, p) where p = c/n. We show not only that
there exists such a G that satises the theorem statement, but that selected in this way G
satises it almost surely.
First, we show
high probability (G) < n/k, which implies (G) > k. We can
n that with
H(a)n
, where H(a) is the entropy function
approximate an with 2
1
1
+ (1 a) log
.
a log
a
1a
Using the fact that when p is small, 1 p ep , we nd that the expected number of
independent sets of size n/k is
n/k
cn
1
2 /2k 2
n
nH(1/k)
pn
)
(
(ln 2) 2 .
e
= exp nH
(1 p) 2 2
k
k
n/k
8
If c is at least 2k2 H(1/k) ln 2, this expectation drops to zero rapidly.
Second, we show that in every induced subgraph on at most n nodes, the average degree
is less than 3. Note that if there exists a subgraph on n vertices that is not 3-colourable,
then there is a minimal such subgraph. In the minimal such subgraph, every vertex must
have degree at least 3 (Proof: Suppose a vertex has degree 2 yet omitting it gives a 3colorable subgraph. Then putting the vertex back in, we can extend the 3-coloring to that
vertex by giving it a color that has not been assigned to its 2 neighbors.) So we have a
subgraph on s n vertices with at least 3s/2 edges. The probability of such a subgraph
existing is at most
n s c 3s/2
2
,
s
3s/2 n
sn
recalling that p = c/n. If s is O(1), the terms tend to zero rapidly and the sum is negligible.
Otherwise, we can use the approximations presented in line (1.1) to arrive at the quantity
s
s
ne se3/2 c 3/2
e5/2 33/2 c3/2 s/n .
s 3
n
sn
If is less than e5 33 c3 , the summation terms are strictly less than 1, since s n. On
the other hand we have eliminated the cases where s is O(1) so the remainder of the terms
form a geometric series and so their sum is bounded by a constant times the rst term,
which is tiny.
2
1.4
When we toss a coin many times, the expected number of heads is half the number of tosses.
How tightly is this distribution concentrated? Should we be very surprised if after 1000
tosses we have 625 heads? We can provide bounds on how likely a sum of Poisson trials
is to deviate from its mean. (A sequence of {0, 1} random variables is Poisson, as against
Bernoulli, if the expected values can vary between trials.) The Cherno bound presented
here was probably known before Cherno published it in 1952.
Theorem 5
be independent Poisson trials
and let pi = E[Xi ], where 0 < pi < 1.
Let X1 , X2 , . . . , Xn
n
Then the sum X = i=1 Xi , which has mean = ni=1 pi , satises
e
Pr[X (1 + )]
(1 + )(1+)
.
Proof: Surprisingly, this inequality also is proved using the Markov inequality.
We introduce a positive dummy variable t and observe that
Xi )] = E[ exp(tXi )] =
E[exp(tXi )],
E[exp(tX)] = E[exp(t
i
(1.3)
9
where the last equality holds because the Xi r.v.s are independent. Now,
E[exp(tXi )] = (1 pi ) + pi et ,
therefore,
E[exp(tXi )] =
[1 + pi (et 1)]
exp(pi (et 1))
i
pi (et 1)) = exp((et 1)),
= exp(
(1.4)
E[exp(tX)]
exp((et 1))
=
,
exp(t(1 + ))
exp(t(1 + ))
using lines (1.3) and (1.4) and the fact that t is positive. Since t is a dummy variable, we can
choose any positive value we like for it. The right hand size is minimized if t = ln(1+)just
dierentiateand this leads to the theorem statement. 2
A similar technique yields this result for the lower tail of the distribution
e
Pr[X (1 )]
(1 )(1)
By the way, if all n coin tosses are fair (Heads has probability 1/2) then the the prob
2
ability of seeing N heads where |N n/2| > a n is at most ea /2 . The chance of seeing
at least 625 heads in 1000 tosses of an unbiased coin is less than 5.3 107 .
1 Prove that the greedy algorithm nds a clique of size log nO(1) with high probability
in a random graph.
Chapter 2
2.1
A Linear Program involves optimizing a linear cost function with respect to linear inequality
constraints. They are useful for algorithm design as well as a tool in mathematical proofs.
The typical program looks as follows.
Given: vectors c, a1 , a2 , . . . am Rn , and real numbers b1 , b2 , . . . bm .
Objective: nd X Rn to minimize c X, subject to:
a1 X
a2 X
..
.
b1
b2
(2.1)
am X bm
X
0
The notation X > Y simply means that X is componentwise larger than Y. Now we
represent the system in (2.1) more compactly using matrix notation. Let
T
b1
a1
b2
aT
2
A = . and b = .
.
.
.
.
aTm
bm
10
11
Then the Linear Program (LP for short) can be rewritten as:
min cT X :
AX b
X 0
(2.2)
This form is general enough to represent any possible linear program. For instance,
if the linear program involves a linear equality a X = b then we can replace it by two
inequalities
a X b and a X b.
If the variable Xi is unconstrained, then we can replace each occurence by Xi+ Xi where
Xi+ , Xi are two new non-negative variables.
The set of conditions in an LP may not be satisable, however. Farkas Lemma tells us
when this happens.
Lemma 6
Farkas Lemma. The set of linear inequalities (2.1) is infeasible if and only if using
positive linear combinations of the inequalities it is possible to derive 1 0, i.e. there
exist 1 , 2 , . . . m 0 such that
m
i ai < 0 and
i=1
2.2
m
i bi > 0.
i=1
With every LP we can associate another LP called its dual. The original LP is called the
primal. If the primal has n variables and m constraints, then the dual has m variables and
n constraints.
Dual
Primal
min cT X :
max Y T b :
(2.3)
AX b
Y T A cT
Y 0
X 0
(Aside: if the primal contains an equality constraint instead of inequality then the
corresponding dual variable is unconstrained.)
It is an easy exercise that the dual of the dual is just the primal.
Theorem 7
The Duality Theorem. If both the Primal and the Dual of an LP are feasible, then the
two optima coincide.
Proof: The proof involves two parts:
1. Primal optimum Dual optimum.
This is the easy part. Suppose X , Y are the respective optima. This implies that
AX b.
12
Now, since Y 0, the product Y AX is a non-negative linear combination of the
rows of AX , so the inequality
Y T AX Y T b
holds. Again, since X 0 and cT Y T A, the inequality
cT X (Y T A)X Y T b
holds, which completes the proof of this part.
2. Dual optimum Primal optimuml.
Let k be the optimum value of the primal. Since the primal is a minimization problem,
the following set of linear inequalities is infeasible for any > 0:
cT X (k )
AX b
(2.4)
i=1
m
i bi > 0.
(2.6)
i=1
Note that 0 > 0 omitting the rst inequality in (2.4) leaves a feasible system by
assumption about the primal. Thus, consider the vector
=(
m T
1
,...
) .
0
0
13
2.3
The input is a directed graph G(V, E) with one source s and one sink t. Each edge e has
a capacity ce . The ow on any edge must be less than its capacity, and at any node apart
from s and t, ow must be conserved: total incoming ow must equal total outgoing ow.
We wish to maximize the ow we can send from s to t. The maximum ow problem can be
formulated as a Linear Program as follows:
Let P denote the set of all (directed) paths from s to t. Then the max ow problem
becomes:
fP :
(2.7)
max
P P
P P : fP 0
fP ce
e E :
(2.8)
(2.9)
P :eP
ce ye :
(2.10)
eE
e E : ye 0
ye 1
P P :
(2.11)
(2.12)
eP
Notice that the dual in fact represents the Fractional min s t cut problem: think of
each edge e being picked up to a fraction ye . The constraints say that a total weight of
1 must be picked on each path. Thus the usual min cut problem simply involves 0 1
solutions to the ye s in the dual.
Exercise 1 Prove that the optimum solution does have ye {0, 1}.
Thus, the max- st-ow = (capacity of) min-cut.
2.4
Approximate Inclusion-Exclusion
warning: the proof of approximate inclusion-exclusion given here is incomplete towards the end. Specifically, I found after giving the lecture that
the main theorem of the Linial-Nisan paper is incorrect as stated. I plan to
correct this but havent gotten around to. The reader can still get the
basic idea from the description below.
Now we see an interesting application of linear programming and the duality theorem. The Inclusion-Exclusion formula for the cardinality of the union of n nite sets
A1 , A2 , . . . , An is given by
|Ai |
|Ai Aj | + (1)n+1 |A1 A2 . . . An | (2.13)
|A1 A2 . . . An | =
i
i<j
14
The question is, suppose we know only the rst k terms of the formula, how good an
approximation can we get? The simple idea of truncating the formula to the rst k terms
doesnt work: for instance, consider the case when all Ai are identical.
The answer (due to Linial and Nisan, 1988) is that for k ( n), we can get a good
i=1
(2.15)
iS
(2.16)
(2.17)
(2.18)
(2.19)
15
Lemma 9
The following relation connects the rj and the aj :
rj =
n
i
i=j
ai .
n
i
i=j
S [n] :
j
n
xi :
(2.20)
i=1
xi = 0
(2.21)
xi 1
(2.22)
xi 1
(2.23)
iS
iS
xi = rj qj = 0
i k,
16
It is easy to check that this denes proper probabilities using (2.22, 2.23), and that the
sequences of events they dene satisfy (2.15) because of (2.21). 2
Now we look at the dual of the LP. It involves nding S , S 0 and yi s such that
(S + S ) :
(2.26)
min
S[n]
1 i n :
(S S ) +
S i
jmin(i,k)
i
yj 1
j
(2.27)
Clearly, the optimum solution will, for each subset S, never need to make both S and
S positive. For example, if S = a > 0 and S = b > a then making S = 0 and S = b a
still satises all constraints while lowering the objective.
For
y1 , y2 , . . . , yn , what is the best choice of S , S ? For each i let ci =
any given
1 jmin(i,k) ji yj . Let I = {i : ci > 0} and J = {j : cj < 0}. Let c+ = max {ci : i I}
and c = min {cj : j J}. (If I or J is empty the corresponding max or min is dened
to be 0.) Then there is a dual solution of cost c+ + c , namely, I = c+ , J = c and the
variables associated with all other subsets are zero. Furthermore, every feasible solution
must have some set with S c+ and some other set with S c , and thus have cost
at least c+ + c . Finally, we claim that at the optimum, the yi s are such that J = , and
hence c = 0. Suppose not, and c > 0. Then divide all yi by 1 + c ; a simple calculation
+ +c
. Thus the objective function has
shows that the new c is 0 whereas the new c+ is c1+c
i
yj :
min max 1
y1 ,...,yn 1in
j
jmin(i,k)
i
1
yj 0
i.
j
(2.28)
(2.29)
jmin(i,k)
Lemma 12
The optimum value of the program in Lemma 11 is given by
inf max (1 q(m))
q
m[n]
(2.30)
where the inmum ranges over all polynomials q of degree atmost k with constant term 0
such that q(m) 1 for all m {1, . . . , n}.
, which is a degree i polynomial whose conProof: Recall that xi = x(x1)(x2)(xi+1)
i!
stant term is 0. (That is, at x = 0 its value is 0.) It is also a polynomial that is 0 at
x = 1, 2 . . . , (i 1). We note that any polynomial of degree at most k and constant term
17
0 can be written as a linear combination of xi , for 1 i k. (The proof is by induction.
If the polynomial is cxi +q(x) where q(x) is a polynomial of degree at most i 1, then it
may be expressed as ci! xi + r(x) where r(x) has degree at most i 1.) Finally, note that
if we dene the polynomial q(x) = kj=1 xj yj , then (2.28) becomes simply
inf
q
max
(1 q(m))
m{1,...,n}
(x + x2 1)m + (x x2 1)m
.
(2.31)
Tm (x) =
2
2. For x [1, 1], we have 1 Tm (x) 1.
Consider the following polynomial of degree k,
qk,n (x) = 1
Tk ( 2x(n+1)
)
n1
Tk ( (n+1)
n1 )
(2.32)
) [1, 1]
Note that q(0) = 0 (i.e. its constant term is 0) and when x [1, n], Tk ( 2x(n+1)
n1
(n+1)
so |qk,n (x) 1| 1/D where D = Tk ( n1 ).
Then p(x) = Dqk,n (x)/(1 + D) satises D1
1+D p(m) 1 for all m [1, n]. Thus it
2
is a dual feasible solution and we conclude that E(k, n) 1 D1
1+D = 1+D . Thus the
1
D+1
maximum ratio for Pr[i Ai ] and Pr[i Bi ] is 1E(k,n) D1 . It only remains to estimate
k
k
n+1
2
)
this quantity. Since D = Tk ( n+1
n1 = ( + )/2 where = n1 1 + n for large
n, we can upperbound this by
D+1
1 + O(k2 /n).
D1
2.5
A note on algorithms
We have emphasized the use the duality theorem as a tool for proving theorems. Of course,
the primary use of LPs is to solve optimization problems. Several algorithms exist to
solve LPs in polynomial time. We want to mention Khachiyans ellipsoid algorithm in
particular because it can solve even exponential size LPs provided they have a polynomial
time separation oracle. (There is an additional technical condition that we need to know
18
a containing ball for the polytope in question, and the ball should not be too much bigger
than the polytope. Usually this condition is satised.)
A separation oracle for an LP decides whether a given input (x1 , x2 , . . . , xn ) is feasible
or not, and if it isnt, outputs out one constraint that it violates.
For example, consider the dual of max-ow (viz. fractional min-cut) that was discussed
in the previous lecture:
ce ye :
(2.33)
min
eE
e E : ye 0
ye 1
P P :
(2.34)
(2.35)
eP
This can be solved in many ways, but the simplest (if we do not care about eciency too
much) is to use the Ellipsoid method, since we can design a polytime separation oracle for
this problem using the shortest path algorithm. Suppose the oracle is given as input a vector
(ye )eE . To decide if it is feasible, the oracle computes the shortest path from s to t with
edge weights = ye , and checks if the length of this shortest path is atleast 1. (Of course,
before anything else one should check if all the ye > 0.) Clearly, (ye )eE is feasible i the
shortest path has length at least 1, and if it is infeasible then the shortest path constitutes
an unsatised constraint.
Chapter 3
3.1
We recall some basic linear algebra. A eld is a set closed under addition, subtraction,
multiplication and division by nonzero elements. By addition and multiplication, we mean
commutative and associative operations which obey distributive laws. The additive identity
is called zero, the multiplicative identity is called unity. Examples of elds are reals R,
rationals Q, and integers modulo a prime p denoted by Z/p. We will be mostly concerned
with nite elds. The cardinality of a nite eld must be a power of prime and all nite
elds with the same number of elements are isomorphic. Thus for each power pk there is
essentially one eld F with |F| = pk . We shall denote this eld by GF(pk ).
A vector space V over a eld F is an additive group closed under (left) multiplication by
elements of F. We require that this multiplication be distributive with respect to addition
in both V and F, and associative with respect to multiplication
in F.
Vectors v1 , . . . , vk are said to be linearly independent if ki=1 i vi = 0 implies that
1 = 2 = = k = 0. A maximal set of vectors {vi }iI whose every nite subset is
linearly independent is called a basis of V ; all such sets have the same cardinality, called
the dimension of V (denoted dim V ). If V has a nite dimension k and {vi }ki=1 is a basis
then every vector v V can be uniquely represented as
v=
k
i vi ,
i=1
19
20
Let Amn = {aij } be a matrix over a eld F. Rank of A, denoted by rank A, is the
maximum number of linearly independent rows in A. It is equal to the maximum number
of linearly independent columns. Hence rank A = rank AT .
Let M = {mij } be an n by n matrix. The determinant of M is dened as follows:
det M =
(1)()
Sn
n
mi(i) ,
i=1
where Sn is the group of permutations over [n], and () is the parity of the permutation
. The matrix Mnn has rank n if and only if det M = 0. We will use this fact to prove
the following result, which is our rst example of the Dimension Method.
Theorem 13
Let Mnn be a random matrix over GF(2). Then Pr[det M = 0] 1/4.
Proof: Denote the columns of M by Mi , where i = 1, 2, . . . , n. It suces to bound the
probability that these columns are linearly independent:
Pr[M1 , . . . , Mn linearly independent]
n
Pr[M1 , . . . , Mi linearly independent | M1 , . . . , Mi1 linearly independent]
=
i=1
n
(1 Pr[Mi span(M1 , . . . , Mi1 ) | M1 , . . . , Mi1 linearly independent].
=
i=1
Now, if M1 , . . . , Mi1 are independent then their span is of dimension i 1 and hence it
contains 2i1 vectors. The column Mi is picked uniformly at random from the space of 2n
vectors, independently of M1 , . . . , Mi1 . Thus the probability that it will lie in their span
is 2i1 /2n .
=
n
n
(1 2i1n )
exp{2i1n 2 ln 2}
i=1
exp{2 ln 2
i=1
i=1
3.2
21
Proposition 14
1. The system () is feasible if and only if b span(A1 , . . . , An ), which occurs if and
only if rank(A|b) = rank A. (Here A|b is the matrix whose last column is b and the
other columns are from A.)
2. Suppose F is nite. If rank A = k then the system () has either 0 solutions (if
infeasible) or Fnk solutions (if feasible). In particular, if n = k then the solution is
unique if it exists.
3. If b = 0 then a nontrivial solution exists if and only if rank A n 1. In particular,
if n > m then nontrivial solutions always exist.
Example 2 Suppose M is a random matrix over GF(2) and b is a random n 1 vector.
What is the probability that the system M x = b has a unique solution? By Theorem 13 it
is at least 1/4.
Theorem 15
A nonzero polynomial of degree d has at most d distinct roots.
Proof: Suppose p(x) = di=0 ci xi has d + 1 distinct roots 1 , . . . , d+1 in some eld F.
Then
d
ij ci = p(j ) = 0,
i=0
21 . . .
d1
1 1
1 2
22 . . .
d2
A=
. . . . . . . . . . . . . . . . . . .
1 d+1 2d+1 . . . dd+1
has a solution y = c. The matrix A is a Vandermonde matrix, hence
(i j ),
det A =
i>j
which is nonzero for distinct i . Hence rank A = d + 1. The system Ay = 0 has therefore
only a trivial solution a contradiction to c = 0. 2
Theorem 16
For any set of pairs (a1 , b1 ), . . . , (ad+1 , bd+1 ) there exists a unique polynomial g(x) of degree
at most d such that g(ai ) = bi for all i = 1, 2, . . . , d + 1.
Proof: The requirements are satised by Lagrange Interpolating Polynomial:
"
d+1
j=i (x aj )
.
bi "
j=i (ai aj )
i=1
If two polynomials g1 (x), g2 (x) satisfy the requirements then their dierence p(x) = g1 (x)
g2 (x) is of degree at most d, and is zero for x = a1 , . . . , ad+1 . Thus, from the previous
theorem, polynomial p(x) must be zero and polynomials g1 (x), g2 (x) identical. 2
22
3.3
23
Theorem 19 (Berlekamp-Welch)
If c(x), e(x) satisfy the conditions of the lemma then e(x)|c(x) and p(x) = c(x)/e(x).
Proof: Consider the polynomial c(x) p(x)e(x). It has degree at most 6d, but it has at
least 15d roots because it is zero on all noncorrupted i s. Therefore, c(x) p(x)e(x) 0
and c(x) p(x)e(x). 2
Remark 1 This strategy will work whenever a xed -fraction of packets is corrupted,
where < 1/2. Somebody asked if a scheme is known that recovers the polynomial even if
more than 1/2 the packets are corrupted. The answer is Yes, using Sudans list decoding
algorithm. See the homework.
3.4
Hashing: An introduction
2
.
p
1
.
p2
24
Suppose we use a hash function h : X U , |U | = p 2n, picked at random from a
pairwise independent family. Fix a bucket u and consider random variables
1 if h(x) = u,
Xx =
0 otherwise,
where x is an element of the sequence. By pairwise independence, choosing arbitrary y X
such that y = x, we obtain
Pr[h(x) = u, h(y) = v] = 1/p.
Pr[h(x) = u] =
vU
The size of bucket u is S = x Xx . Calculate the expectation of S 2 :
E[Xx Xy ] =
E[Xx2 ] +
E[Xx Xy ]
E[S 2 ] =
x,y
Pr[h(x) = u] +
x=y
Pr[h(x) = u, h(y) = u]
x=y
x
2
(Aside: The element distinctness problem is impossible to solve (even using randomness)
in linear time in the comparison model, where the algorithm is only allowed to compare two
numbers at every step.
3.5
Consider a randomized algorithm that uses n random bits and gives a Yes/No answer.
Suppose we know that one of the answer happens with probability at least 2/3 but we
do not know which. We can determine that answer with high probability by running the
algorithm m times with independent random bits and taking the majority answer; by the
Cherno bounds the error in this estimation will by exp(m). Can we do this estimation
using fewer than mn random bits? Intuitively speaking, the algorithm converts n random
bits into a single random bit (Yes/No) so it has thrown away a lot of randomness. Can we
perhaps reuse some of it? Later in the course we will see some powerful techniques to do
so; here we use more elementary ideas. Here we see a technique that uses 2n random bits
and its error probability is 1/m. (We need m < 2n .) The idea is to use random strings that
are pairwise independent and use Chebyshevs inequality.
A sequence of random variables z1 , z2 , z3 , . . . is pairwise independent if every pair is
independent.
We can construct a sequence of m pairwise independent strings {zi } , zi GF(q), q 2n
using 2 log q random bits. Let {xi } , xi GF(q) be any xed sequence. Pick a, b GF(q)
25
at random and set zi = axi + b. Running the algorithm on z1 , . . . , zm will guarantee that
answers are pairwise independent.
Analogously, we can construct k-wise
sequences by picking a0 , . . . , ak1 at
k1independent
j
random and applying the map x j=0 aj x to an arbitrary sequence {xi } , xi GF(q).
Chebyshev inequality generalizes to higher moments:
&
#
$
%1/k
< k .
Pr X E[X] > E |X E[X]|k
This uses k log q random bits but the error in the estimation goes down as 1/mk .
Example 5 Secret Sharing (A. Shamir, How to share a secret, Comm. ACM 1979). We
want to design a scheme for sharing a secret a0 among m people so that k + 1 people can
recover the secret, but k or fewer people cannot.
If ai 0 , . . . , ak are picked randomly and person i receives the pair (i , p(i )) where p(x) =
ai x then any set of k people will receive a random k-tuple of strings, whereas k + 1
people will be able to recover the polynomial p(x) by interpolation.
3.6
Sudans algorithm gives a way to recover a polynomial from its values, even when most
(an overwhelming majority) of the values are corrupted. See Question 4 on HW 2 for a
self-guided tour of this algorithm.
Strictly speaking, this algorithm doesnt use just the dimension method: it also uses
Berlekamp (or Kaltofens) algorithm for factoring polynomials over nite elds.
3.7
26
Definition 2 A family of circuits is a collection of circuits {Cn }n1 , such that the nth
circuit has n input nodes. We say that a circuit family computes a function f : {0, 1}
{0, 1} if for all x, C|x| (x) = f (x).
Remark 2 A circuit family can compute undecidable languages - we may hardwire into a
circuit any of 2n functions on n inputs, and may use dierent' circuits for each input(length.
(For example, there exists a circuit family which computes 1<M,w> |M halts on w ).
We are interested only in a subclass of circuits, the MOD3 circuits.
Definition 3 A MOD3 circuit of depth d on n inputs is one whose depth is bounded by a
constant d. Each boolean gate performs any of the four operation AND (), OR (), NOT
(), and sum modulo three (M OD3 ). The M OD3 gate outputs zero if the sum of its inputs
is zero modulo three, and one otherwise.
The inclusion of the M OD3 gates gives the circuit some power. Nevertheless we show
that it cannot compute the M OD2 function, namely, parity of n bits.
Theorem 20
The Razborov/Smolensky Circuit Lower Bound. Computing M OD2 with M OD3
1
1/2d
set 2l = n1/2d to obtain a degree n polynomial that agrees with C on 1 S/2n /2
fraction of inputs.
Step 2 We show that no polynomial of degree n agrees with M OD2 on more than 49/50
fraction of inputs.
1/2d
Together, the two steps imply that S > 2n /2 /50 for any depth d circuit computing
M OD2 , thus proving the theorem. Now we give details.
Step 1. Consider a node i in the circuit at a depth h . (The input is assumed to have
depth 0.) If fi (xi , , xn ) is the function computed at this node, we desire a polynomial
fi (xi , , xn ) over GF (3) with degree (2l)h , such that on any input in {0, 1}n GF (3),
polynomial fi produces an output in {0, 1}. (In other words, even though fi is a polynomial
over GF (3), for 0/1 inputs it takes values in {0, 1} GF (3).) Furthermore, we desire
circuit size
(3.1)
Prx{0,1}n [fi (x) = fi (x)]
2l
We construct the approximating polynomial by induction. The case h = 0 is trivial.
Suppose we have replaced the output of all nodes up to height h with polynomials of
appropriately bounded degree and error. We wish to approximate the output of a node g
of height h + 1 with immediate inputs f1 , , fk .
27
1. If g is a NOT gate, we may write g = fi . Then, g = 1 fi is an approximating
polynomial with the same degree and error as that of fi .
2. If g is a M OD3 gate, we use the approximation g = ( i fi )2 . The degree increases
by at most a factor of 2, and there is no increase in the error rate.
3. If g is an AND or an OR gate, the required approximation is slightly harder to
produce. Suppose g = iI fi . The naive approach approximates an AND node g
g = iI fi we use De Morgans law and similarly obtain
with a polynomial iI fi . If "
the naive approximator 1 iI (1 fi ). Unfortunately, both of these increase the
degree of the polynomial by a factor of |I|, which could be much larger than the
allowed 2l.
Let us give the correct solution for OR, leaving the case of AND to yet another
application of De Morgans laws.
If g is an OR gate, then g = 1 if and only if at least one of the fi = 1. We observe
that if any of the fi = 1, the sum (over GF (3)) of a random subset of {fi } is one with
probability at least 1/2.
Pick l subsets S1 , , Sl of {1, , k} randomly. We compute the l polynomials
2
jSi (fj ) , each of which has degree at most twice that of the largest input polynomial. We then compute the OR of these l terms using the naive approach. The result
is a polynomial with degree at most 2l times that of the largest input polynomial.
For any x, the probability over the choice of subsets that this polynomial diers from
OR(f1 , , fk ) is at most 21l . So, by the expectation argument, there exists a choice
for the l subsets such that the probability over the choice of x that this polynomial
diers from OR(f1 , , fk ) is at most 21l .
(There was a question in class about how the errors at dierent gates aect each other.
An approximator may introduce some error, but another approximator higher up may
introduce another error which cancels this one. The answer is that we are ignoring this issue,
and using just the union bound to upperbound the probablity that any of the approximator
polynomials anywhere in the circuit miscompute.)
This completes the inductive construction of the proper polynomial.
Step 2. Suppose that a polynomial f agrees with the M OD2 function for
all
ninputs in a
28
Lemma 21
"
For every S FG , there exists a polynomial gS which is a sum of monomials aI iI yi
(3.3)
which takes the same values as g(y1 , y2 , . . . , yn )iIyi over {1, 1}n . Thus wlog every
n i
i + n
2
Using knowledge of the tails of a binomial distribution,
49 n
2
50
(3.6)
Chapter 4
4.1
This example is partially motivated by the study of massively parallel computers and partly
by telephone networks. We model a communications network as a graph with N inputs and
N outputs. We wish to route calls from the input users to the output users through routers
represented by internal nodes. Each call occupies the routers it uses, so separate calls must
take vertex disjoint paths. In particular, we would like to be able to route any permutation
matching inputs to outputs with decentralized control. In this section we construct a
network in which all permutations can be routed. In the next lecture, we will decentralize
the control.
Remark 4 Any network which can route all of the permutations has size (N logN ), since
N ! #states d#nodes
29
(4.1)
30
Layer 1
Layer 2
Layer 3
x1
x2
x3
x4
x5
x n-4
x n-3
x n-2
x n-1
xn
Figure 4.1: An example delta network. Each layer contains a dotted set of edges up and a
solid set of edges down. In the multibuttery, each of these sets forms an expander graph.
Definition 5 A Delta network on N inputs is dened recursively: it consists of the rst
level, which contains the N input nodes, and two subnetworks that are Delta networks on N2
inputs, and are called the up and down networks respectively. Each input node has edges to
inputs in both up and down networks. Outputs whose address has the MSB (most signicant
bit) 1 lie in the up subnetwork and the remaining outputs are in the down subnetwork.
Clearly, any call to a destination whose MSB is 1 must go from the input node to some
node in the up subnetwork. Reasoning similarly for all the levels, we see that the delta
network is a layered network in which the up/down decision at level i depends on the ith bit
of the destination node. We may route to any of the input nodes of the smaller networks.
An example of the Delta network is the buttery, where each input node (b0 b1 . . . blog N )
has exactly two outgoing edges, one each to the up and down subnetwork. The edges go to
nodes whose label in the that subnetwork is (b1 b2 , . . . , blog N ).
Buttery networks cannot route all permutations with vertex disjoint paths. For instance, they function poorly on the transpose permutation. Instead, we present the multibuttery network of Upfal 88. In the buttery network, each of N nodes in the input layer
is connected to one of the N2 top nodes of the next layer. In the multibuttery, the degree
of that node is d and the subgraph of edges from each of N nodes to the N2 top nodes of
the next layer is an (, ) expander for < 12 .
31
We assume that the number of inputs (also, number of outputs) is 2N and each is
connected to 1/2 successive inputs (resp., outputs) of the multibuttery. (Alternatively,
we may still use N for the number of inputs and outputs and then connect them using
multibutteries with N/2 inputs.)
We claim that any permutation on these 2N inputs/outputs can be routed using vertex
disjoint paths. We do this layer by layer. At most N/2 calls need to be routed to the top
(or bottom) subnetwork. Let S be the set of nodes at which these calls originate. We can
route these calls in a vertex disjoint manner if there exists a matching between this subset
S of input nodes and any subset S of nodes in the top network. To see that this can be
done, we apply Halls theorem, noting that the expander property guarantees that S has at
least |S| neighbors.
Theorem 22
Halls Theorem. If G = (V1 , V2 , E) is a bipartite graph, there exists a perfect matching
involving V1 if and only if for all subsets S V1 , |(S)| |S|.
Proof: Apply the max-ow min-cut theorem. 2
Repeating for all levels we obtain our vertex disjoint paths.
4.2
32
3. Every node on L that recieves an acceptance, matches to any one of the nodes in R
that accepted it, and then drops out. All remaining nodes constitute Si+1 .
4. Repeat till all vertices of L are matched.
The next claim shows that |Si | decreases by a constant factor after every phase, and so
in O(log N ) phases it falls below 1, that is, it becomes 0.
|
2(1 /d).
Claim: |S|Si+1
i|
In phase i, let nj be the number of vertices from R that get exactly i proposals. Since
G is a d-regular graph, we have:
|Si | d n1 + 2n2 + 3n3 + ... =
k nk n1 + 2
k=1
nk
k=2
Since G is an (, )-expander:
|(S)| = n1 + n2 + ... =
nk |S|
k=1
Combining both:
nk ] [n1 + 2
nk ] (2 d)|S|.
n1 = 2[
k=1
k=2
This is the number of nodes in R that send acceptances. Any node in Si can receive at most
d acceptances, so the number that drop out is at least n1 /d. Thus |Si+1 | |Si | n1 /d and
the claim follows.
Remark 5 This simple algorithm only scratches the surface of what is possible. One can
improve the algorithm to run in O(log N ) time, and furthermore, route calls in a nonblocking
fashion. This means that callers can make calls and hang up any number of times and in any
(adversarially determined) order, but still every unused input can call any unused output
and the call is placed within O(log N ) steps using local control. The main idea in proving
the nonblocking is to treat busy nodes in the circuit those currently used by other paths
as faulty, and to show that the remaining graph/circuit still has high expansion. See the
paper by Arora, Leighton, Maggs and an improvement by Pippenger that requires expansion
much less than d/2.
Chapter 5
5.1
5.1.1
xT M x
,x=0 xT x
min
n
xR
xT M x
xW (x1 ) xT x
min
If we denote W (x1 , ..., xk1 ) := n \ span{x1 , ..., xk1 }, then the kth smallest eigenvalue is:
xT M x
min
k =
xW (x1 ,...,xk1) xT x
This characterization of the spectrum is called the Courant Fisher Theorem.
33
34
3. Denote by Spec(M ) the spectrum of matrix M , that is the multi-set of its eigenvalues.
Then for a block diagonal matrix M , that is, a matrix of the form:
A 0
M=
0 B
The following holds: Spec(M ) = Spec(A) Spec(B)
4. Eigenvalues of a matrix can be computed in polynomial time. (Eigenvalues are the
roots of the characteristic polynomial of a matrix).
5. The Interlacing theorem:
A matrix B is denoted a principal minor of matrix M if it can be obtained from M
by deleting k < n columns and k rows.
Let A (n1)(n1) be a principal minor of the matrix M . Let:
Spec(A) = {1 ... n1 }
Then:
1 1 2 2 ... n1 n
5.1.2
Matrices of Graphs
The most common matrix associated with graphs in literature is the adjacency matrix.
For a graph G = (V, E), the adjacency matrix A = AG is dened as:
1 (i, j) E
Ai,j =
0 otherwise
Another common matrix is the Laplacian
0
Li,j =
1
di dj
35
Example - The cycle
Consider a cycle on n nodes. The laplacian is:
1 0
0 1
..
..
.
.
1
0 2
0
12
..
.
This matrix has 1s on the diagonal, and 0s or 12 elsewhere, depending on whether the
indices correspond to an edge. Since a cycle is 2-regular, in each row and column there are
exactly two entries with 12 .
Claim 1 The all ones vector 1 is an eigenvector of the laplacian of the n-node cycle, corresponding to eigenvalue 0.
Proof: Observe that since every row has exactly two 12 then:
L 1 = 0 = 0 1
2 In fact, we can characterize all eigenvalues of the cycle:
Claim 2 The eigenvalues of the laplacian of the n-node cycle are:
2k
, k = 0, 1, ..., n 1}
n
Proof: Observe that if the vertices are named consecutively, then each vertex i is connected
to i 1, i + 1 mod n. Therefore, a value is an eigenvalue with an eigenvector x if and
only if for every index of x:
{1 cos
1
xi (xi1 + xi+1 ) = xi
2
(where sums are modulo n)
It remain to show the eigenvectors. For eigenvalue k = 1 cos 2k
n we associate the
k
eigenvector x with coordinates:
2ik
xki = cos
n
xy
And indeed (recall the identity cos x + cos y = 2 cos x+y
2 cos 2 ):
2(i + 1)k
2(i 1)k
1 k
2ik 1
k
k
cos
+ cos
xi (xi1 + xi+1 ) = cos
2
n
2
n
n
2ik
2k
2ik 1
2 cos
cos
= cos
n
2
n
n
2k
2ik
1 cos
= cos
n
n
= xki k
2
36
5.1.3
In this section we state the connection between expansion of a graph (dened below) and
the eigenvalues of its characteristic matrices.
Definition 7 Let G = (V, E) be a graph. For any subset of vertices S V dene its
volume to be:
V ol(S) :=
di
iS
For a subset of vertices S V denote by E(S, S) the set of edges crossing the cut dened
by S. Using the above denition we can dene edge expansion of a graph:
Definition 8 The Cheeger constant of G is:
hG := min
SV
|E(S, S)|
|E(S, S)|
=
min
.
min{V ol(S), V ol(S)} SV,V ol(S)|E| V ol(S)
37
Proof: By straightforward calculation. Denote by b the vector:
b := L w
The ith coordinate of b is:
bi = wi
1
w
j = di
=0
di
d
d
i
j
j(i)
j(i)
Hence b is the zero vector, and thus 0 is an eigenvalue corresponding to the above eigenvector.
2
Using the previous claim, we can write an explicit expression for the second smallest eigenvalue:
=
xT Lx
T
di xi =0 x x
(5.1.1)
min
x:i
i,j zi Mi,j xj :
Now substitute yi :=
min
x:i di xi =0
1
2
i xi
x2i
xi xj
di dj
j(i)
(5.1.2)
xi :
di
$
=
=
2
i di yi
min
i di yi =0
min
i di yi =0
j(i) yi yj
2
i di yi
(i,j)E
(yi yj )2
di yi2
(5.1.3)
(5.1.4)
1
iS
V ol(S)
ai =
1
i
/S
V ol(S)
Notice that a is legal as:
i
di ai =
iS
di
V ol(S) V ol(S)
di
=
=0
V ol(S)
V ol(S) V ol(S)
V
ol(S)
iS
/
38
Now, according to the last expression obtained for we get:
2
(i,j)E (ai aj )
2
i di ai
$
%2
1
1
+
E(S, S)
V ol(S)
V ol(S)
=
di
di
iS V ol(S)2 +
iS
/ V ol(S)2
1
1
E(S, S)
+
=
V ol(S) V ol(S)
2
E(S, S)
V ol(S)
And since this holds for any S V , it specically holds for the set that minimizes the
quantity in Cheegers constant, and we get:
2hG
2
Before we proceed to the more dicult part of the theorem, we recall the Cauchy-Schwartz
inequality: if a1 , ..., an R ; b1 , ..., bn R, then:
n
1
1
ai bi (
a2i ) 2 (
b2i ) 2
i=1
(yi yj )2
2
i di yi
yi yi < 0
ui =
0
otherwise
yi yi > 0
vi =
Observe that:
otherwise
And hence:
(i,j)E [(ui
uj )2 + (vi vj )2 ]
2
i di (ui
+ vi2 )
(5.1.5)
39
Since
a+b
c+d
2
2
i di ui
Now comes the mysterious part (at least to Sanjeev): multiply and divide by the same
quantity.
2
2
2
(i,j)E (ui uj )
(i,j)E (ui uj )
(i,j)E (ui + uj )
=
2
2
2
(i,j)E (ui + uj )
i di ui
i di ui
[ (i,j)E (ui uj )2 ][ (i,j)E (ui + uj )2 ]
2
2
2
i di ui 2
(i,j)E (ui + uj )
Where the last inequality comes from (a + b)2 2(a2 + b2 ). Now using Cauchy-Schwartz
in the numerator:
[ (i,j)E (ui uj )(ui + uj )]2
2( i di u2i )2
[ (i,j)E (u2i u2j )]2
=
2( i di u2i )2
Now denote by Sk = {v1 , ..., vk } V the set of the rst k vertices. Denote by Ck the size
of the cut induced by Sk :
Ck := |E(Sk , Sk )|
Now, since u2i u2j = u2i u2i+1 + u2i+1 u2i+1 ... + u2j1 u2i , we can write:
(u2i u2j ) =
(u2k u2k+1 ) Ck
(i,j)E
2
2( i di u2i )2
i di ui
[ k (u2k u2k+1 ) Ck ]2
=
2( i di u2i )2
According to the denition of hG we know that Ck hG ( ik di ) (as hG is the minimum
of a set of expressions containing these). Hence:
[ k (u2k u2k+1 ) hG ( ik di )]2
2( i di u2i )2
[ k (u2k u2k+1 ) ( ik di )]2
2
= hG
2( i di u2i )2
( k dk u2k )2
2
= hG
2( i di u2i )2
=
h2G
2
40
And this concludes the second part of the theorem. 2
Remark 6 Note that we proved the stronger result that actually one of the cuts Ck sat
k ,Sk )
Chapter 6
6.1
Basics
42
The uniform distribution, which assigns probability 1/n to each node, is a stationary
distribution for this chain, since it is unchanged after applying one step of the chain.
Definition 11 A Markov chain M is ergodic if there exists a unique stationary distribution
and for every (initial) distribution x the limit limt xMt = .
Theorem 24
The following are necessary and sucient conditions for ergodicity:
1. connectivity: i, j : Mt (i, j) > 0 for some t.
2. aperiodicity: i : gcd{t : Mt (i, j) > 0} = 1.
Remark 7 Clearly, these conditions are necessary. If the Markov chain is disconnected it
cannot have a unique stationary distribution there is a dierent stationary distribution for
each connected component. Similarly, a bipartite graph does not have a unique distribution:
if the initial distribution places all probability on one side of the bipartite graph, then the
distribution at time t oscillates between the two sides depending on whether t is odd or
even. Note that in a bipartite graph gcd{t : Mt (i, j) > 0} 2. The suciency of these
conditions is proved using eigenvalue techniques (for inspiration see the analysis of mixing
time later on).
Both conditions are easily satised in practice. In particular, any Markov chain can be
made aperiodic by adding self-loops assigned probability 1/2.
Definition 12 An ergodic Markov chain is reversible if the stationary distribution satises for all i, j, i Pij = j Pji .
Uses of Markov Chains. A Markov Chain is a very convenient way to model many situations where the memoryless property makes sense. Examples including communication
theory (Markovian sources), linguistics (Markovian models of language production), speech
recognition, internet search (Googles Pagerank algorithm is based upon a Markovian model
of a random surfer).
6.2
Mixing Times
Informally, the mixing time of a Markov chain is the time it takes to reach nearly uniform
distribution from any arbitrary starting distribution.
Definition 13 The mixing time of an ergodic
Markov
chain M is t if for every starting
t
t
distribution x, the distribution xM satises xM 1 1/4. (Here ||1 denotes the 1
norm and the constant 1/4 is arbitrary.)
The next exercise claries why we are interested in 1 norm.
Exercise 2 For any distribution on {1, 2, . . . , N }, and S {1, 2, . . . , N } let (S) =
iS i . Show that for any two distributions , ,
= 2 max (S) (S) .
(6.2.1)
1
S{1,...,N }
43
Here is another way to restate the property in (6.2.1). Suppose A is some deterministic
algorithm (we place no bounds on its complexity) that, given any number i {1, 2, . . . , N },
outputs Yes or No. If | |1 then the probability that A outputs Yes on a random
input drawn according to cannot be too dierent from the probability it outputs Yes on an
input drawn according to . For this reason, 1 distance is also called statistical dierence.
We are interested in analysing the mixing time so that we can draw a sample from the
stationary distribution.
Example 7 (Mixing time of a cycle) Consider an n-cycle, i.e., a Markov chain with n states
where, at each state, Pr(lef t) = Pr(right) = Pr(stay) = 1/3.
Suppose the initial distribution concentrates all probability at state 0. Then t steps
correspond to about 2t/3 random coin tosses and the index of the nal state is
(#(Heads) #(Tails))
(mod n).
Clearly, it takes (n2 ) steps for the walk to reach the other half of the circle with any
reasonable probability, and the mixing time is (n2 ). We will later see that this lowerbound
is fairly tight.
6.2.1
Markov chains allow one to sample from very nontrivial sets, provided we know how to nd
at least one element of this set. The idea is to dene a Markov chain whose state space is
the same as this set. The Markov chain is such that it has a unique stationary distribution,
which is uniform. We know how to nd one element of the set. We do a walk according
to the Markov chain with this as the starting point, and after T = O(mixing time) steps,
output the node we are at. This is approximately a random sample from the set. We
illustrate this idea later. First we discuss why sampling from large sets is important.
Usually this set has exponential size set and it is only given implicitly. We give a few
examples of some interesting sets.
Example 8 (Perfect matchings) For some given graph G = (V, E) the set of all perfect
matchings in G could be exponentially large compared to the size of the graph, and is only
know implicitly. We know how to generate some element of this set, since we can nd a
perfect matching (if one exists) in polynomial time. But how do we generate a random
element?
Example
9 (0 1 knapsack) Given a1 . . . an , b Z+ , the set of vectors (xi , . . . , xn ) s.t.
ai xi b.
In both cases, determining the exact size of the set is in #P (the complexity class corresponding to counting the number of solutions to an NP problem). In fact, we have the
following.
Theorem 25 (Valiant, late 1970s)
If there exist a polynomial-time algorithm for counting the number of perfect matchings or
the number of solutions to the 0 1 knapsack counting problem, then P = NP (in fact,
P = P#P ).
44
Valiants Theorem does not rule out nding good approximations to this problem.
Definition 14 A Fully Polynomial Randomized Approximation Scheme (FPRAS) is a
randomized algorithm, which for any nds an answer in time polynomial in ( n log 1 ) that
is correct within a multiplicative factor (1+) with probability (1-).
It turns out that approximate counting is equivalent to approximate sampling.
Theorem 26 (Jerrum, Valiant, Vazirani, 1984)
ai xi
If we can sample almost uniformly, in polynomial time, from A = {(x1 , . . . , xn ) :
b}, then we can design an FPRAS for the knapsack counting problem.
Conversely, given an FPRAS for knapsack counting, we can draw an almost uniform
sample from A.
Remark 8 By sampling almost uniformly we mean having a sampling algorithm whose
output distribution has 1 distance exp(n2 ) (say) from the uniform distribution. For ease
of exposition, we think of this as a uniform sample.
Proof: We rst show how to count approximately assuming there is a polynomial time
sampling algorithm. The idea is simple though the details require some care (which we
suppress here). Suppose we have a sampling algorithm for knapsack. Draw a few samples
from A, and observe what fraction feature x1 = 0. Say it is p. Let A0 be the set of
solutions
with x1 = 0. Then p = |A0 | / |A|. Now since A0 is the set of (x2 , . . . , xn ) such
that i2 ai xi b a1 x1 , it is also the set of solutions of a knapsack problem, but with
one fewer variable. Using the algorithm recursively, assume we can calculate |A0 |. Then we
can calculate
|A| = |A0 | /p.
Now if we do not know |A0 | , p accurately but up to some accuracy, say (1 + ). So we
will only know |A| up to accuracy (1 + )2 1 + 2.
Actually the above is not accurate, since it ignores the possibility that p is so small that
we never see an element of A0 when we draw poly(n) samples from A. However, in that
case, the set A1 = A \ A0 must be at least 1/2 of A and we can estimate its size. Then we
proceed in the rest of the algorithm using A1 .
Therefore, by choosing appropriately so that (1 + )n is small, and using the Cherno
bound, we can achieve the desired bound on the error in polynomial time.
The converse is similar. To turn a counting algorithm into a sampling algorithm, we need
to show how to output a random member of A. We do this bit by bit, rst outputting x1 ,
then x2 , and so on. To output x1 , output 0 with probability p and 1 with probablity 1 p,
where p = |A0 | / |A| is calculated by calling the counting algorithm twice. Having output
x1 with the correct probability, we are left with a sampling problem on n 1 variables,
which we solve recursively. Again, we need some care because we only have an approximate
counting algorithm instead of an exact algorithm. Since we need to count the approximate
counting algorithm only 2n times, an error of (1 + ) each time could turn into an error of
(1 + )2n , which is about 1 + 2. 2
Thus to count approximately, it suces to sample from the uniform distribution. We
dene a Markov chain M on A whose stationary distribution is uniform. Then we show
that its mixing time is poly(n).
45
The Markov chain is as follows. If the current node is (x1 , . . . , xn ) (note a1 x1 + a2 x2 +
. . . + an xn b) then
1. with probability 1/2 remain at the same node
2. else pick i {1, . . . , n}.
Let y = (x1 , . . . , xi1 , 1 xi , xi+1 , . . . , xn ). If y A, go there. Else stay put.
Note that M is
1. aperiodic because of self-loops
2. connected because every sequence can be turned into the zero vector in a nite number
of transformations, i.e., every node is connected to 0.
Therefore, M is ergodic, i.e., has a unique stationary distribution. Since the uniform distribution is stationary, it follows that the stationary distribution of M is uniform.
Now the question is: how fast does M converge to the uniform distribution? If M mixes
fast, we can get an ecient approximation algorithm for the knapsack counting: we get the
solution by running M for the mixing time and sampling from the resulting distribution
after the mixing time has elapsed.
Theorem 27
(Morris-Sinclair, 1999): The mixing time for M is O(n8 ).
Fact (see our remark later in our analysis of mixing time): running the M for a bit
longer than the mixing time results in a distribution that is extremely close to uniform.
Thus, we get the following sampling algorithm:
1. Start with the zero vector as the initial distribution of M .
2. Run M for O(n9 ) time.
3. output the node at which the algorithm stops.
This results in a uniform sampling from A.
Thus Markov chains are useful for sampling from a distribution. Often, we are unable to
prove any useful bounds on the mixing time (this is the case for many Markov chains used
in simulated annealing and the Metropolis algorithm of statistical physics) but nevertheless
in practice the chains are found to mix rapidly. Thus they are useful even though we do
not have a proof that they work.
6.3
46
Since M is ergodic, and since n1 1 is a stationary distribution, then n1 1 is the unique
stationary distribution for M .
The question is how fast does M convege to n1 1? Note that if x is a distribution, x can
be written as
1
x = 1 +
n
n
i ei
i=2
where ei are the eigenvectors of M which form an orthogonal basis and 1 is the rst eigenvector with eigenvalue 1. (Clearly, x can be written as a combination of the eigenvectors;
2
the observation here is that the coecient in front of the rst eigenvector 1 is 1 x/ 1
2
which is n1 i xi = n1 .)
M t x = M t1 (M x)
1
( 1 +
i i ei )
n
n
= M
t1
i=2
1
(M ( 1 +
i i ei ))
n
n
= M
t2
...
n
1
1+
i ti ei
=
n
i=2
i=2
Also
n
i ti ei 2 tmax
i=2
but Cauchy Schwartz inequality relates the 2 and 1 distances: |p|1 n |p|2 . So we have
proved:
Theorem 28
n
).
The mixing time is at most O( log
max
Note also that if we let the Markov chain run for O(k log n/max ) steps then the distance
to uniform distribution drops to exp(k). This is why we were not very fussy about the
constant 1/4 in the denition of the mixing time earlier.
Finally, we recall from the last lecture: for S V , V ol(S) = iS di , where di is the
degree of node i, the Cheeger Constant is
E(S, S
min
hG =
V ol(V ) V ol(S)
SV,vol(S)
2
47
2hG
h2G
2
n
i ti ei 2 (1
i=2
h2G t
) n
2
c t
n2 ) ,
48
6.4
n
i vi .
i=2
Multiplying
the above equation by 1, we get 1 = 1 (since x1 = 1 = 1). Therefore
t
xM = + ni=2 i ti vi , and hence
||xM t ||2 ||
n
i ti vi ||2
(6.4.1)
i=2
22 + + 2n
t ||x||2 ,
as needed. 2
(6.4.2)
(6.4.3)
Chapter 7
7.1
7.1.1
n/2
(n/2)!
49
50
7.1.2
Funny facts
1. Scribes contribution: The volume of the unit n-ball tends to 0 as the dimension tends
to .
2. For any c > 1, a (1 1c ) - fraction of the volume of the n-ball lies in a strip of width
/
n
a a
O( c log
n ). A strip of width a is Bn intersected with {(x1 , ..., xn )|x1 [ 2 , 2 ]}.
3. If you pick 2 vectors on the surface of Bn independently, then with probability > 1 n1 ,
| cos(x,y )| = O(
log n
),
n
where x,y is the angle between x and y. In other words, the 2 vectors are almost
orthogonal w.h.p. To prove this, we use the following lemma:
Lemma 31
Suppose a is a unit vector in Rn . Let x = (x1 , ..., xn ) Rn be chosen from the surface
of Bn by choosing each coordinate at random from
{1, 1} and normalizing by factor
1 . Denote by X the random variable a x =
ai xi . Then:
n
2
= E(X) = E(
ai xi ) = 0
a2
1
i
= .
ai xi )2 ] = E[
ai aj xi xj ] =
ai aj E[xi xj ] =
2 = E[(
n
n
Using the Cherno bound, we see that,
t 2
Corollary 32
If two unit vectors x, y are chosen at random from Rn , then
0
P r |cos(x,y )| >
Now, to get fact (3), put = n1 .
log
n
2
<
51
7.2
We can apply our earlier study of random walks to geometric random walks, and derive
bounds on the mixing time using geometric ideas.
Denition of a convex body: A set K Rn is convex if x, y K, [0, 1],
x + (1 )y K.
In other words, for any two points in K, the line segment connecting them is in the set.
examples: Balls, cubes, polytopes (=set of feasible solutions to an LP), ellipsoides etc.
Convex bodies are very important in theory of algorithms. Many optimization problems
can be formulated as optimizing convex functions in convex domains. A special case is linear
programming.
Todays Goal: To generate a random point in a convex body. This can be used to
approximate the volume of the body and to solve convex programs (we saw how to do the
latter in Vempalas talk at the theory lunch last week). We emphasize that the description
below is missing many details, though they are easy to gure out.
First we need to specify how the convex body is represented:
0 K.
K is contained in a cube of side R with center at 0.
A unit cube is contained in K.
R n2 .
there exists a separation oracle that given any point x Rn returns yes if x K
and if x K then returns a separating hyperplane {y|aT y = b} s.t. x and K are on
opposite sides (i.e. aT x < b, aT z > b z K). Note 1: a (closed) convex body can
actually be characterized by the property that it can be separated from any external
point with a separating hyperplane. Note 2: In the linear program case, giving a
separating hyperplane is equivalent to specifying a violated inequality.
The idea is to approximate the convex body K with a ne grid of scale < n12 . The
volume of K is approximately proportional to the number of grid points. There is some
error because we are approximating the convex body using a union of small cubes, and the
error term is like
(number of grid cubes that touch the surface of K) (volume of small cube),
which is volume of K since the minimum crosssection of K (at least 1) is much larger
than the dimensions of the grid cube.
Consider the graph GK whose vertices are grid notes contained in K, and there is an
edge between every pair of grid neighbors. To a rst approximation, the graph is regular:
almost every node has degree 2n. The exceptions are grid points that lie close to the surface
of K, since not their 2n neighbors may lie inside K. But as noted, such grid points form
a negligible fraction. Thus generating a random point of GK is a close approximation to
generating a random point of K.
52
Figure 7.1: The bottom-left most grid point in the triangle is isolated.
Figure 7.2: The dashed convex body is obtained by smoothening the solid triangle.
However, we only know one node in GK , namely, the origin 0. Thus one could try the
following idea. Start a random walk from 0. At any step, if youre on a grid point x, stay
at x with probability 1/2, else randomly pick a neighbor y of x. If y K, move to y, else
stay at x. If this walk is ergodic, then the stationary distribution will be close to uniform
on GK , as noted. The hope is that this random walk is rapidly mixing, so running it for
poly(n) time does yield a fairly unbiased sample from the stationary distribution.
Unfortunately, the walk may not be ergodic since GK may not be connected (see gure 7.1).
To solve this problem, we smoothen K, namely, we replace K with a -neighborhood of
K dened as
Bn (x, ),
K =
xk
where Bn (x, ) is a closed ball centered at x with radius , and n1 (see gure 7.2). It
can be checked that this negligibly increases the volume while ensuring the following two
facts:
K is convex, and a separation oracle for K can be easily built from the separation
oracle for K, and,
Any 2 grid points in K are connected via a path of grid points in K .
Now let GK be the graph on grid nodes contained in K . It is connected, and almost
all its nodes are in K (since the volume increase in going from K to K was minimal).
53
We go back to estimating the mixing time. Recall that the mixing time is the time t
s.t. starting from any distribution x0 , after t steps the distribution xt satises
xt 2
1
.
4
2 t
) N,
2
where is the Cheeger number attached to the graph and N is the size of the graph (number
of grid points in K ). So we reduce to the problem of estimating the Cheeger constant.
[Aside: In case you ever go through the literature, you will see that many papers use
the conductance of the graph instead of the Cheeger number. The conductance of a
graph G is
|E(S, S)|Vol(V
)
min
.
SV Vol(S)Vol(S)
Easy exercise: the conductance is within factor 2 of the Cheeger number.]
To estimate the Cheeger number of the grid graph, we use a result by Lovasz and
Simonovits. Let U be a subset of nodes of GK . Consider the union of the corresponding
set of grid cubes; this is a measurable subset (actually it is even better behaved; it is a nite
union of convex subsets) of K. Call this set S. Let S be the boundary of S. Then it is
not hard to see that:
2nVoln1 (S),
Voln1 (S) E(S, S)
is the number of grid edges crossing between S and S.
The [LS] result is:
where E(S, S)
Theorem 33
If K is a convex body containing a unit cube, then for any measurable S K,
Voln1 (S)
poly(n)
Therefore, the mixing time is O( 12 log N ) = O(poly(n)).
7.3
Dimension Reduction
Now we describe a central result of high-dimensional geometry (at least when distances are
measured in the 2 norm). Problem: Given n points z 1 , z 2 , ..., z n in n , we would like to
54
nd n points u1 , u2 , ..., un in m where m is of low dimension (compared to n) and the
metric restricted to the points is almost preserved, namely:
z i z j 2 ui uj 2 (1 + )z j z j 2 i, j.
(7.3.1)
1+
m ,
1+
m }.
j 2
k=1
l=1
j 2
u = u u =
i
0 n
m
k=1
22
zl xkl
l=1
2
Let Xk be the random variable ( nl=1 zl xkl )2 . Its expectation is = 1+
m z (can be seen
similarly to the proof of lemma 31). Therefore, the expectation of u2 is (1 + )z2 . If we
show that u2 is concentrated enough around its mean, then it would prove the theorem.
More formally, we state the following Cherno bound lemma:
Lemma 35
There exist constants c1 > 0 and c2 > 0 such that:
1. P r[u2 > (1 + )] < ec1
2m
2m
Therefore there is a constant c such that the probability of a bad case is bounded by:
P r[(u2 > (1 + )) (u2 < (1 ))] < ec
2m
55
n
Now, we have 2 random variables of the type ui uj 2 . Choose = 2 . Using the
union bound, we get that the probability that any of these random variables is not within
(1 2 ) of their expected value is bounded by
n c 2 m
e 4 .
2
c)
, we get that with positive probability, all the variables
So if we choose m > 8(log n+log
2
are close to their expectation within factor (1 2 ). This means that for all i,j:
(1 )(1 + )z i z j 2 ui uj 2 (1 + )(1 + )z i z j 2
2
2
Therefore,
1+
m ,
so = 2 mz2
(7.3.2)
zl2 (xjl )2 )) + t( (
zl zh xkl xkh )))] exp(t(1 + )2 mz2 )
= E[exp(t( (
k
(7.3.3)
l=h
zl zh xkl xkh )))] exp(t(1 + )2 mz2 )
= E[exp(t2 mz2 + t( (
k
l=h
2
zl = z2 . So continuing, we get:
The last step used the fact that (xkl )2 = 2 and
zl zh xkl xkh )))] exp (t2 mz2 )
P E[exp (t( (
(7.3.4)
k
l=h
56
The set of variables {xkl xkh }l=h are pairwise independent. Therefore the above expectation
can be rewritten as a product of expectations:
E[exp(tzl zh xkl xkh )] exp(t2 mz2 )
(7.3.5)
P
k l=h
we notice that
E[exp(tzl zh xkl xkh )] =
1
1
exp(tzl zh 2 ) + exp(tzl zh 2 ) < exp(t2 zl2 zh2 4 )
2
2
(the last inequality is easily obtained by Taylor expanding the exponent function). Plugging
that in (7.3.5), we get:
exp (t2 zl2 zh2 4 ) exp (t2 mz2 )
P <
k l=h
l=h
= exp (mt2
(7.3.6)
l=h
Using simple analysis of quadratic function we see that the last expression obtains its
minimum when
z2
.
t=
22 l=h zl2 zh2
Substituting for t, we get:
z4
)
P < exp ( 2 m
4 l=h zl2 zh2
Finally, the expression
0
(z) =
z4
4 l=h zl2 zh2
(7.3.7)
is bounded below by a constant c1 . To prove this, rst note that (z) = (z) for any = 0.
So it is enough to consider the case z = 1. Then, using Lagrange multipliers technique,
for example, we get that (z) obtains its minimum when zl = 1n for each l = 1..n. Plugging
this in the expression for (z) we see that it is bounded above by a constant c1 that does
not depend on n. This completes the proof. 2
7.4
VC Dimension
57
We continue our study of high dimensional geometry with the concept of VC-dimension,
named after its inventors Vapnik and Chervonenkis. This useful tool allows us to reason
about situations in which the trivial probabilistic method (i.e., the union bound) cannot
be applied. Typical situations are those in which we have to deal with uncountably many
objects, as in the following example.
Example 10 Suppose we pick set of m points randomly from the unit square in 2 . Let
C be any triangle inside the unit square of area 1/4. The expected number of samplepoints
that lie inside it is m/4, and Cherno bounds show that the probability that the this number
is not in m/4[1 ] is exp(2 m).
Now suppose we have a collection of 2 m triangles of area 1/4. The union bound shows
that whp the sample is such that all triangles in our collection have m/4[1] samplepoints.
But what if we want the preceding statement to hold for all (uncountably many!) triangles of area 1/4 in the unit square? The union bound fails badly! Nevertheless the following
statement is true.
Theorem 36
Let > 0 and m be suciently large. Let S be a random sample of m points in the unit
sphere. Then the probability is at least 1 o(1) that every triangle of area 1/4 has between
m/4[1 ] samplepoints.
How can we prove such a result? The intuition is that even though the set of such
triangles is uncountable, the number of vastly dierent triangles is not too large. Thus
there is a small set of typical representatives, and we only need to argue about those.
7.4.1
Definition
VC-dimension is used to formalize the above intuition. First, some preliminary denitions:
Definition 15 A Range Space is a pair (X, R), where X is a set and R is a family of
subsets of X (R 2X ).1
Definition 16 For any A X, we dene the PR (A), the projection of R on A, to be
{r A : r R}.
Definition 17 We say that A is shattered if PR (A) = 2A , i.e., if the projection of R on
A includes all possible subsets of A.
With these denitions, we can nally dene the VC-dimension of a range space:
Definition 18 The VC-dimension of a range space (X, R) is the cardinality of the largest
set A that it shatters: VC-dim = sup{|A| : A is shattered}. It may be innite.
1 X
58
Example 1. Consider the case where X = R2 (the plane) and R is the set of all axisaligned squares. The sets A we consider in this case are points on the plane. Lets consider
dierent values of |A|:
|A| = 1: With a single point on the plane, there are only two subsets A, the empty set
and A itself. Since there for every point there are axis-aligned squares that contain it
and others that dont, A is shattered.
|A| = 2: If we have two points, it is easy to nd axis-aligned squares that cover both
of them, none of them, and each of them separately.
|A| = 3: It is possible to come up with a set of three points such that are shattered;
the vertices of an equilateral triangle, for example: there are axis-aligned squares that
contain none of the points, each point individually, each possible pair of points, and
all 3 points.
|A| = 4: In that case, it is impossible to come up with four points on the plane that
are shattered by the set of axis-parallel squares. There will always be some pair of
points that cannot be covered by a square.
Therefore, the VC-dimension of this range space if 3.
Example 2. Let X = F n (where F is a nite eld) and R be the set of all linear subspaces
of dimension d. (Note that these cannot possibly shatter a set of d + 1 linearly independent
vectors.)
First assume d n/2. We claim that any set A = {v1 , v2 , . . . , vd } of d linearly independent vectors is shattered by X. Take any subset I of the vectors. We can build a basis of a
d-dimension subspace as follows: rst, take the |I| vectors of this set (which are all linearly
independent); then, complete the basis with d |I| linearly independent vectors zj in F n
that do not belong to the (d-dimensional) subspace determined by v1 , v2 , . . . , vd . This basis
determines the subspace
span{{vi }iI , zd+1 , zd+2 , . . . , zd+d|I| },
which is d-dimensional, as desired.
But note that this construction works as long as there are d |I| linearly independent
vectors available to complete the basis. This happens when d n/2. If d > n/2, we can
pick at most n d vectors outside the set. Therefore, the VC-dimension of the range space
analized is min{d, n d}.
We now study some properties of range spaces related to VC-dimension. The next
theorem gives a bound; it is tight (exercise: prove this using the ideas in the previous
example).
Theorem 37
If (X, R)
has VC-dimension
d and A x has n elements then |PR (A)| g(d, n) where
n
g(d, n) = id i .
59
To prove this theorem, all we need is the following result:
Lemma 38
If (X, R) has VC-dimension d and |X| = n then |R| g(d, n).
This lemma clearly implies the theorem above: just apply the lemma with A instead of
X, and look at the range space (A, PR (A)). We now proceed to the proof of the lemma.
Proof: We prove Lemma 38 by induction on d + n. Let S = (X, R), and consider some
element x X. By denition, we have that
R1
3
45
6
S x = (X {x}, {r {x} : r R})
and
R2
3
45
6
S \ x = (X {x}, {x R : x r but r {x} R})
Note that every element of R that does not contain x is contained in R1 ; and every element
of R that does contain x is in R2 (with x removed). Therefore,
|R| = |R1 | + |R2 |.
A well-known property of binomial coecients states that
n
ni
ni
=
+
,
i
i
i1
which implies that
g(d, n) = g(d, n 1) + g(d 1, n 1),
according to our denition of g(d, n).
In the subproblems dened above, |X {x}| = n 1. We claim that there exists an
x such that S \ x has VC-dimension at most d 1. That being the case, the inductive
hypothesis immediately applies:
|R| = |R1 | + |R2 | g(d, n 1) + g(d 1, n 1) = g(d, n).
So once we prove the claim, we will be done. Let A S be shattered in S with |A| = d.
We want to prove that in S \ x no set B X {x} can be shattered if |B| = d. But this is
straighforward: if B is shattered in S \ x, then B {x} is shattered in S. This completes
the proof of Lemma 38. (Incidentally, the bound given by the lemma is tight.) 2
The following result is due to Haussler and Welzl (1987), and also appears in Vapnik
and Chervonenkiss paper.
Theorem 39
Let (X, R) be a range space with VC-dimension d, and let A X have size n. Suppose S
is a random sample of size m drawn (with replacement) from A, and let m be such that
&
#
2 8d
8d
4
log ,
log
.
m max
Then, with probability at least 1 , S is such that for all r R such that |r A| n, we
have |r S| = .
60
Proof: Dene r to be heavy if |r A| n. We will now analyze a single experiment
from two dierent points of view.
1. From the rst point of view, we pick two random samples N and T , each of size
exactly m (note that they both have the same characteristics as the set S mentioned
in the statement of the theorem).
Consider the following events:
E1 : there exists a heavy r such that r N = (this is a bad event; we must
prove that it happens with probability at most );
E2 : there exists a heavy r such that r N = , but |r T | m/2.
We claim that, if E1 happens with high probability, then so does E2 . Specically, we
prove that P r(E2 |E1 ) 1/2. Suppose E1 has happened and that some specic r is
the culprit. Now, pick T . Consider the probability that this r satises |r T | m
2
(this is a lower bound for P r(E2 |E1 )). We know that |r T | is a binomial random
variable; its expectation is m and its variance is (1 )m (and therefore strictly
smaller than m. Using Chebyschevs inequality, we get:
$
m %
m
4
Pr |r T | <
(7.4.1)
=
2
2
(m/2)
m
Because the statement of the theorem assumes that m > 8/, we conclude that indeed
P r(E2 |E1 ) 1/2.
2. Now consider the same experiment from a dierent point of view: we pick a single
sample of size 2m, which is then partitioned randomly between N and T . Consider
the following event:
Er : r N = , but |r T | m/2.
It is clear that
E2 =
Er .
r:heavy
P =
61
There are no more than g(d, 2m) distinct events Er . Applying the union bound, we
see that the probability that at least one of the occurs for a xed N T is at most
g(d, 2m) 2m/2 . But this holds for any choice of N T , so
Pr(E2 ) g(d, 2m) 2m/2 .
(7.4.2)
4
4
The constraints on m dened in the statement of the theorem take care of the second half
of this inequality:
2
m
2
4
.
m log =
7.4.2
Let (X, R) be a range space of VC dimension d. Consider the range space (X, Rh ) where
Rh consists of h-wise intersections of sets in R. We claim that the VC-dimension is at
most 2dh log(dh).
For, suppose A is some subset of R that is shattered
in (X, Rh ), and let k = |A|. Then
m
k
|PRh (A)| = 2 , but we know that it is at most h where m = |PR (A)| g(k, d). Hence
g(k, d)
k
k(d+1)h ,
2
h
whence it follows that k 2dh log(dh).
Exercise: Generalize to range spaces in which the subsets are boolean combinations of
the sets of R.
62
7.4.3
7.4.4
Geometric Algorithms
63
The original nearest neighbor searching problem then reduces to point-location in the
Voronoi diagram. One way to do that is to divide the space into vertical slabs separated
by vertical lines one passing through each vertex of the Voronoi diagram. There are at
most O(n) slabs and each is divided into up to O(n) subregions (from the Voronoi diagram).
These slabs can be maintained in a straightforward way in a structure of size O(n2 ) (more
sophisticated strucutures can reduce this signicantly). Queries can be thought of as two
binary searches (one to nd the correct slab and another within the slab), and require
O(log n) time.
Higher dimensions: Meisers Algorithm The notion of Voronoi diagrams can be
generalized in a straighforward way to arbitrary dimensions. For every two points p and q,
the points that are equidistant from both dene a set Hpq given by:
Hpq = {x : d(x, p) = d(x, q)}.
Considering a d-dimensional space with the 2 norm, we get that
d(x, p) = d(x, q)
d
d
(xi pi )2 =
(xi qi )2
i=1
d
i=1
i=1
64
where g determines the size of the structure for point location within the sample. We dont
compute f, g because that is not the point of this simplied description. (Using some care,
the running time is dO(d) log n and storage is nO(d) .) We now proceed to prove the claim:
Proof: Consider the range space (X, R), where X is the set of all m hyperplanes of
the Voronoi diagram and R contains for each simplex D a subset RD X containing all
hyperplanes that intersect the interior of D. From Section 7.4.2 we know that the VCdimension of this range space is O(d2 log d). Then with high probability the sample must
be such that for every simplex D that is intersected by at least m/2 hyperplanes is intersected by at least one hyperplane in the sample. This means that D cannot be a cell in the
arrangement of the sample. 2
Because the RHS grows faster than the LHS, we only have to worry about the smaller
possible value of m, which is m0 = (8d/) log(8d/) according to the statement of the
theorem:
m0
d log(2m0 )
4
8d
8d
8d
8d
log
d log 2
log
4
16d
8d
8d
d log
log
2d log
2
8d
16d
8d
log
log
log
2
8d
16d
8d
log
8d
8d
2 log
8d
4d
log .
The last inequality is obviously true, so we are done.
Chapter 8
8.1
Introduction
Sanjeev admits that he used to nd fourier transforms intimidating as a student. His fear
vanished once he realized that it is a rewording of the following trivial idea.
of n then every vector v can be expressed as
If u1 , u2 , . . . , un is an orthonormal
basis
2
2
i i ui where i =< v, ui > and
i i = |v|2 .
Whenever the word fourier transform is used one is usually thinking about vectors
as functions. In the above example, one can think of a vector in n as a function from
{0, 1, . . . , n 1} to . Furthermore, the basis {ui } is dened in some nice way. Then the
i s are called fourier coecients and the simple fact mentioned about their 2 norm is
called Parsevals Identity.
The classic fourier transform, using the basis functions cos(2nx) and sin(2nx), is
analogous except we are talking about dierentiable (or at least continuous) functions from
[1, 1] to , and the denition of inner product uses integrals instead of sums (the vectors
are in an innite dimensional space).
Example 11 (Fast fourier transform) The FFT studied in undergrad algorithms uses the
$
%T
(N 1)j
j
for j = 0, 1, . . . , N 1. Here
following orthonormal set uj = 1 N
. . . N
N = e2i/N is the N th root of 1.
Consider a function f : [1, N ] R, i.e. a vector in RN . The Discrete Fourier Transform
(DFT) of f , denoted by f, is dened as
fk =
N
(k1)(x1)
f (x) N
, k [1, N ]
x=1
(8.1.1)
66
f (x) =
N
1 (k1)(x1)
.
fk N
N
(8.1.2)
k=1
$
%T
(N 1)j
j
In other words f is written as a linear combination of the basis vectors uj = 1 N
,
. . . N
for j in [1, N ]. Another way to express this is by the matrix-vector product f = M f , where
1
1
1
1
11
12
N
N
21
22
M = 1
N
N
(N 1)1
(N 1)2
N
1 N
...
1
1(N 1)
...
N
2(N 1)
...
N
...
(N 1)(N 1)
. . . N
(8.1.3)
8.2
N
fi ui ,
(8.2.1)
i=1
and the coecients fi are due to orthonormality equal to the inner product of f and of the
corresponding basis vector, that is fi =< f, ui >.
Definition 19 The length of f is the L2 -norm
7
f 2 (i).
f 2 =
(8.2.2)
(8.2.3)
67
8.2.1
Now we consider real-valued functions whose domain is the boolean hypercube, i.e., f :
{1, 1}n R. The inner product of two functions f and g is dened to be
< f, g >=
1
2n
f (x)g(x).
(8.2.4)
x{1,1}n
Now we describe a particular orthonormal basis. For all the subsets S of {1, . . . , n} we
dene the functions
xi .
(8.2.5)
S (x) =
iS
T
where x = x1 . . . xn
is a vector in {1, 1}n . The S s form an orthonormal basis.
Moreover, they are the eigenvectors of the adjacency matrix of the hypercube.
Example 12 Note that
(x) = 1
and
{i} (x) = xi .
Therefore,
< {i} , >=
1
2n
xi = 0.
x{1,1}n
1
2n
1
2n
x{1,1}n
S (x)S (x) =
1
2n
x{1,1}n
iS
xi
xj
jS
x{1,1}n iSS
68
Since the S s form a basis every f can be written as
f=
fS S ,
(8.2.6)
S{1,...,n}
fS =< f, S >= n
2
x:S (x)=1
f (x)
%
f (x) .
(8.2.7)
x:S (x)=1
Remark 11 The basis functions are just the eigenvectors of the adjacency matrix of the
boolean hypercube, which you calculated in an earlier homework. This observation and
its extension to Cayley graphs of groups forms the basis of analysis of random walks on
Cayley graphs using fourier transforms.
8.3
In this section we describe two applications of the version of the Discrete Fourier Transform that we presented in section 8.2.1. The Fourier Transform is useful for understanding
the properties of functions. The applications that we discuss here are: (i) PCPs and (ii)
constructing small-bias probability spaces with a low number of random bits.
8.3.1
We consider assignments of a boolean formula. These can be encoded in a way that enables
them to be probabilistically checked by examining only a constant number of bits. If the
encoded assignment satises the formula then the verier accepts with probability 1. If no
satisfying assignment exists then the verier rejects with high probability.
Definition
20 The function f : GF (2)n GF (2) is linear if there exist a1 , . . . , an such
that f (x) = ni=1 ai xi for all x GF (2)n .
Definition 21 The function g : GF (2)n GF (2) is -close if there exists a linear function f such that PrxGF (2)n [f (x) = g(x)] 1 .
We think of g as an adversarily constructed function. The adversary attempts to deceive
the verier to accept g.
The functions that we considered are dened in GF (2) but we can change the domain
to be an appropriate group for the DFT of section 8.2.1.
69
GF (2)
0+0 =0
0+1 =1
1+1 =0
{1, 1} R
11=1
1 (1) = 1
(1) (1) = 1
"
Therefore if we use the mapping 0 1 and 1 1 we have that iS xi iS xi .
This implies that the linear functions become the Fourier Functions S . The verier uses
the following test in order to determine if g is linear.
Linear Test
Pick x, y R GF (2)n
Accept i g(x) + g(y) = g(x + y)
It is clear that a linear function is indeed accepted with probability one. So we only
need to consider what happens if g is accepted with probability 1 .
Theorem 40
If g is accepted with probability 1 then g is -close.
Proof: We change our view from GF (2) to {1, 1}. Notice now that
x + y = x1 y1 x2 y2 . . . xn yn
and that if is the fraction of the points in {1, 1}n where g and S agree then by denition
gS =< g, S >= (1 ) = 2 1.
Now we need to show that if the test accepts g with probability 1 it implies that
there exists a subset S such that gS 1 2. The test accepts when for the given choice of
x and y we have over GF (2) that g(x) + g(y) = g(x + y). So the equivalent test in {1, 1}
is g(x) g(y) g(x + y) = 1 because both sides must have the same sign.
Therefore if the test accepts with probability 1 then
= Ex,y [g(x) g(y) g(x + y)] = (1 ) = 1 2.
We replace g in (8.3.1) with its Fourier Transform and get
% $
% $
$
gS1 S1 (x)
gS2 S2 (y)
= Ex,y
S1 {1,...,n}
= Ex,y
S2 {1,...,n}
S3 {1,...,n}
gS1 gS2 gS3 Ex,y S1 (x) S2 (y) S3 (x + y)
gS1 gS2 gS3 Ex,y S1 (x) S3 (x) S2 (y) S3 (y) .
(8.3.1)
%
gS3 S3 (x + y)
70
However
Ex S1 (x)S3 (x) =< S1 , S3 >
S{1,...,n}
8.3.2
S2
Sg
S{1,...,n}
Long Code
Suppose that g(x) = xW for all x. Then g(x) + g(y) = xW + yW , while g(x + y + z) =
(x + y + z)W = xW + yW + xW . So the probability of acceptance in this case is 1 .
Theorem 41 (Hastad)
If the Long Word Test accepts with probability 1/2 + then
S
fS3 (1 2)|S| 2.
71
The proof is similar to that of the Theorem 40 and is left as an exercise. The interpretation of Theorem 41 is that if f passes the test with must depend on a few coordinates.
This is not the same as being close to a coordinate function but it suces for the PCP
application.
Remark 13 Hastads analysis belongs to a eld called inuence of boolean variables, that
has been used recently in demonstration of sharp thresholds for many random graph properties (Friedgut and others) and in learning theory.
8.3.3
We consider n random variables x1 , . . . , xn , where each xi is in {0, 1} and their joint probability distribution is D. Let U be the uniform distribution. The distance of D from the
uniform distribution is
D U =
|U () D()|.
{0,1}n
(8.3.2)
iS
Here we are concerned with the construction of (small) -bias probability spaces that
are k-wise independent, where k is a function of . Such spaces have several applications.
For example they can be used to reduce the amount of randomness that is necessary in
several randomized algorithms and for derandomizations.
Definition 25 The variables x1 , . . . , xn are -biased if for all subsets S {1, . . . , n},
biasD (S) . They are k-wise -biased if for all subsets S {1, . . . , n} such that |S| k,
biasD (S) .
Definition 26 The variables x1 , . . . , xn are k-wise -dependent if for all subsets S
{1, . . . , n} such that |S| k, U (S) D(S) . (D(S) and U (S) are the distributions D
and U restricted to S.)
Remark 14 If = 0 this is just the notion of k-wise independence studied briey in an
earlier lecture.
The following two conditions are equivalent:
1. All xi are independent and Pr[xi = 0] = Pr[xi = 1].
2. For every subset S {1, . . . , n} we have Pr[ iS xi = 0] = Pr[ iS xi = 1], that is
the parity of the subset is equally likely to be 0 or 1.
72
Here we try to reduce the size of the sample space by relaxing the second condition,
that is we consider biased spaces.
We will use the following theorem.
Theorem 42 (Diaconis and Shahshahani 88)
D U 2n
2
D
S
%1/2
=
S{1,...,n}
biasD (S)2
%1/2
.
(8.3.3)
S{1,...,n}
Proof: We write the Fourier expansion of D as D = S D
S S . Then,
S S .
D
D=U+
(8.3.4)
S=
S=
S=
S=
(8.3.5)
(8.3.6)
Thus
D U 2n
2
D
S
%1/2
=
S{1,...,n}
biasD (S)2
%1/2
.
(8.3.7)
S{1,...,n}
73
where
S (x) =
xi ( mod 2).
iS
With probability at least 1 not all ri s are such that for all S, S (ri ) = 0. But if there exists a vector rj in the sample with S (rj ) = 1 then we can argue that PrD [S (x1 , . . . , xn ) =
1] = PrD [S (x1 , . . . , xn ) = 0]. Consequently biasD (S) . 2
Construction of the family F
We construct log n families F1 , . . . , Flog n . For a vector v {0, 1}n let |v| be the number
of ones in v. For a vector r chosen uniformly at random from Fi and for each vector v in
{0, 1}n with k |v| 2k 1 we require that
Prr [< r, v >= 1] .
Assume that we have random variables that are uniform on {1, . . . , n} and c-wise independent (using only O(c log n) random bits - see Lecture 4). We dene three random
vectors u, v and r such that
1. The ui s are c-wise independent and Pr[ui = 0] = Pr[ui = 1] = 1/2.
2. The wi s are pairwise independent and Pr[wi = 1] = 2/k.
3. The elements of r are
#
ri =
1 if ui = 1 and wi = 1
0 otherwise
74
We claim that Pr[< r, v >= 1] 1/4 for all v with k |v| 2k 1. In order to prove
this we dene the vector v with coordinates:
#
1 if vi = 1 and wi = 1
vi =
0 otherwise
By the above denition we have < v , u >=< r, v >. We wish to show that
1
Pr[1 |v | c] .
2
(8.3.8)
Because the ui s are c-wise independent (8.3.8) sucies to guarantee that Pr[< v , u >=
1] 1/4. We view |v | as a binomial random variable h with probability p = 2/k for
each non-zero entry. Therefore, E[h] = pl and Var[h] = p(1 p)l, where l [k, 2k 1] is
the number of potential non-zero entries. By Chebyshevs inequality at the endpoints of
[k, 2k 1] we have
Pr[|h pk| 2]
1
pk(1 p)
4
2
and
Pr[|h 2pk| 3]
2pk(1 p)
4
.
9
9
log
n
ai ri , where ri R Fi .
i=1
Each family Fi corresponds to the case that |v| [2i1 , 2i 1]. Now for any v we can show
that Pr[< v, r >= 1] 18 (so = 1/8).
Chapter 9
9.1
Introduction
In this lecture (and the next one) we will look at methods of nding approximate solutions
to NP-hard optimization problems. We concentrate on algorithms derived as follows: the
problem is formulated as an Integer Linear Programming (or integer semi-denite programming, in the next lecture) problem; then it is relaxed to an ordinary Linear Programming
problem (or semi-denite programming problem, respectively) and solved. This solution
may be fractional and thus not correspond to a solution to the original problem. After
somehow modifying it (rounding) we obtain a solution. We then show that the solution
thus obtained is a approximately optimum.
The rst main example is the Leighton-Rao approximation for Sparsest Cut in a graph.
Then we switch gears to look at Lift and Project methods for designing LP relaxations.
These yield a hierarchy of relaxations, with potentially a trade-o between running time
and approximation ratio. Understanding this trade-o remains an important open probem.
9.2
Vertex Cover
75
76
An Integer Linear Programming (ILP) formulation for this problem is as follows:
xi {0, 1} for each i V
xi + xj 1 for each {i, j} E
Minimize
xi
iV
9.3
Sparsest Cut
Our next example is Graph Expansion, or Sparsest-Cut problem. Given a graph G = (V, E)
dene
E(S, S)
E(S, S)
=
min
G = min
|V |
SV min{|S|, |S|}
|S|
SV :|S|
2
where E(S, S)
nd a cut S which realizes G .
Compare this with the Edge Expansion problem we encountered in an earlier lecture: the Cheegar
constant was defined as
E(S, S)
hG = min
SV min{Vol(S), Vol(S)}
where Vol(S) =
vS
deg(v). Note that for a d-regular graph, we have Vol(S) = d|S| and hence hG = G /d.
77
LP formulation
Leighton and Rao (1988) showed how to approximate G within a factor of O(log n) using
an LP relaxation method. The ILP formulation is in terms of the cut-metric induced by a
cut.
is dened as follows: S (i, j) =
Definition 27 The cut-metric on V , induced by a cut (S, S),
|V |
2
E(S, S)
|S||S|
xij = xji
xij + xjk xik
xij {0, 1/t}
xij
Minimize
(integrality constraint)
{i,j}E
Note that this exactly solves the sparsest cut problem if we could take t corresponding to
the optimum cut. This is because, (a) the cut-metric induced by the optimum cut gives a
solution to the above ILP with value equal to G , and (b) a solution to the above ILP gives
a cut S dened by an equivalence class of points with pair-wise distance (values of xij )
S)
{i,j}E xij .
equal to 0, so that E(S,
t
In going to the LP relaxation we substitute the integrality constraint by 0 xij 1.
This in fact is a relaxation for all possible values of t, and therefore lets us solve a single
LP without knowing the correct value of t.
Rounding the LP
Once we formulate the LP as above it can be solved to obtain {xij } which dene a metric
on the nodes of the graph. But to get a valid cut from this, this needs to be rounded o
to a cut-metric- i.e., an integral solution of the form xij {0, } has to be obtained (which,
78
as we noted above, can be converted to a cut of cost no more than the cost it gives for the
LP). The rounding proceeds in two steps. First the graph is embedded into the 1 -norm
2
metric-space RO(log n) . This entails some distortion in the metric, and can increase the
cost. But then, we can go from this metric to a cut-metric with out any increase in the
cost. Below we elaborate the two steps.
2
(9.3.1)
Lemma 46
For any n points z1 , . . . , zn Rm , there exist cuts S1 , . . . , SN [n] and 1 , . . . , N 0
where N = m(n 1) such that for all i, j [n],
k Sk (i, j)
(9.3.2)
||zi zj ||1 =
k
Proof: We consider each of the m co-ordinates separately, and for each co-ordinate give
n 1 cuts Sk and corresponding k . Consider the rst co-ordinate for which we produce
(1)
R be the rst co-ordinate of zi . w.l.o.g assume that the
cuts {Sk }n1
k=1 as follows: let i
(1)
(1)
(1)
n points are sorted by their rst co-ordinate: 1 2 . . . n . Let Sk = {z1 , . . . , zk }
(1)
(1)
(1)
(1)
and k = k+1 k . Then n1
k=1 k Sk (i, j) = |i j |.
Similarly dening Sk , k for each co-ordinate we get,
m(n1)
k=1
k Sk (i, j) =
m
()
|i
()
j | = ||zi zj ||1
=1
k , Sk )
.
The cut-metric we choose is given by S = argminSk E(S
|Sk ||Sk |
Now we shall argue that S provides an O(log n) approximate solution to the sparsest
cut problem. Let optf be the solution of the LP relaxation.
Theorem 47
optf G c log noptf .
79
Proof: As outlined above, the proof proceeds by constructing a cut from the metric
2
obtained as the LP solution, by rst embedding it in RO(log n) and then expressing it
as a positive combination of cut-metrics. Let xij denote the LP solution. Then optf =
xij
xij .
{i,j}E
i<j
Applying the two inequalities in Equation (9.3.1) to the metric (i, j) = xij ,
c log n {i,j}E xij
{i,j}E ||zi zj ||1
i<j xij
i<j ||zi zj ||1
Now by Equation (9.3.2),
{i,j}E ||zi zj ||1
{i,j}E
k k Sk (i, j)
k k
{i,j}E Sk (i, j)
=
=
i<j ||zi zj ||1
i<j
k k Sk (i, j)
k k
i<j Sk (i, j)
k E(Sk , Sk )
E(Sk , Sk )
min
= k
k
|Sk ||Sk |
k k |Sk ||Sk |
E(S , S )
=
|S ||S |
ak
k bk
mink
ak
bk .
Corollary 48
n
n
2 optf G c log n 2 optf .
The O(log n) integrality gap is tight, as it occurs when the graph is a constant degree
(say 3-regular) expander. (An exercise; needs a bit of work.)
9.4
In both examples we saw thus far, the integrality gap proved was tight. Can we design
tighter relaxations for these problems? Researchers have looked at this question in great
detail. Next, we consider an more abstract view of the process of writing better relaxation.
The feasible region for the ILP problem is a polytope, namely, the convex hull of the
integer solutions 2 . We will call this the integer polytope. The set of feasible solutions of
the relaxation is also a polytope, which contains the integer polytope. We call this the
relaxed polytope; in the above examples it was of polynomial size but note that it would
suce (thanks to the Ellipsoid algorithm) to just have an implicit description of it using
a polynomial-time separation oracle. On the other hand, if P = NP the integer polytope
has no such description. The name of the game here is to design a relaxed polytope that
2
By integer solutions, we refer to solutions with co-ordinates 0 or 1. We assume that the integrality
constraints in the ILP correspond to restricting the solution to such points.
80
as close to the integer polytope as possible. Lift-and-project methods give, starting from
any relaxation of our choice, a hierarchy of relaxations where the nal relaxation gives the
integer polytope. Of course, solving this nal relaxation takes exponential time. In-between
relaxations may take somewhere between polynomial and exponential time to solve, and it
is an open question in most cases to determine their integrality gap.
Basic idea
The main idea in the Lift and Project methods is to try to simulate non-linear programming
using linear programming. Recall that nonlinear constraints are very powerful: to restrict
a variable x to be in {0, 1} we simply add the quadratic constraint x(1 x) = 0. Of course,
this means nonlinear programming is NP-hard in general. In lift-and-project methods we
introduce auxiliary variables for the nonlinear terms.
Example 13 Here is a quadratic program for the vertex cover problem.
0 xi 1 for each i V
(1 xi )(1 xj ) = 0 for each {i, j} E
To simulate this using an LP, we introduce extra variables yij 0, with the intention
that yij represents the product xi xj . This is the lift step, in which we lift the problem to
a higher dimensional space. To get a solution for the original problem from a solution for
the new problem, we simply project it onto the variables xi . Note that this a relaxation of
the original integer linear program, in the sense that any solution of that program will still
be retained as a solution after the lift and project steps. Since we have no way of ensuring
yij = xi xj in every solution of the lifted problem, we still may end up with a relaxed
polytope. But note that this relaxation can be no worse than the original LP relaxation (in
which we simply dropped the integrality constraints), because 1 xi xj + yij = 0, yij 0
xi + xj 1, and any point in the new relaxation is present in the original one.
(If we insisted that the matrix (yij ) formed a positive semi-denite matrix, it would still
be a (tighter) relaxation, and we get a Semi-denite Programming problem. We shall see
this in the next lecture.)
9.5
Now we describe the method formally. Suppose we are given a polytope P Rn (via a
separation oracle) and we are trying to get a representation for P0 P dened as the
convex hull of P {0, 1}n . We proceed as follows:
The rst step is homogenization. We change the polytope P into a cone K in Rn+1 . (A
cone is a set of points that is closed under scaling: if x is in the set, so is cx for all c 0.) If a
point (1 , . . . , n ) P then (1, 1 , . . . , n ) K. In terms of the linear constraints dening
P this amounts to multiplying the constant
by a new variable x0
n term in the constraints
and thus making it homogeneous: i.e., i=1 ai xi b is replaced by ni=1 ai xi bx0 . Let
K0 K be the cone generated by the points K{0, 1}n+1 (x0 = 0 gives the origin; otherwise
x0 = 1 and we have the points which dene the polytope P
0 ) For r = 1, 2, . . . , n we shall
r
V
(r)
n+1
, where Vn+1 (r) = ri=0 n+1
dene SA (K) to be a cone in R
i . Each co-ordinate
corresponds to a variable ys for s [n + 1], |s| r. The intention is that the variable ys
81
"
r|s|
stands for the homogenuous term ( is xi ) x0 . Let y(r) denote the vector of all the
Vn+1 (r) variables.
Definition 28 Cone SAr (K) is dened as follows:
SA1 (K) = K, with y{i} = xi and y = x0 .
SAr (K) The constraints dening SAr (K) are obtained from the constraints dening
SAr1 (K): for each constraint ay(r1) 0, for each i [n], form the following two
constraints:
(1 xi ) ay(r1) 0
xi ay(r1) 0
where the operator distributes over the sum ay(r1) =
is a shorthand for ys{i} .
s[n]:|s|r as ys
and xi ys
K {0, 1}n+1 . Then we note that the cone SAr (K) contains
Suppose (1, x1 , . . . , xn ) "
the points dened by ys = is xi . This is true for r = 1 and is maintained inductively by
the constraints we form for each r. Note that if i s then xi ys = ys , but we also have
x2i = xi for xi {0, 1}.
To get a relaxation of K we need to come back to n + 1 dimensions. Next we do this:
Definition 29 S r (K) is the cone obtained by projecting the cone SAr (K) to n + 1 dimensions as follows: a point u SAr (K) will be projected to u|s:|s|1; the variable u is mapped
to x0 and for each i [n] u{i} to xi .
Example 14 This example illustrates the above procedure for Vertex Cover, and shows
how it can be thought of as a simulation of non-linear programming. The constraints for
S 1 (K) come from the linear programming constraints:
j V, 0 y{j} y
{i, j} E, y{i} + y{j} y 0
Among the various constraints for SA2 (K) formed from the above constraints for SA1 (K),
we have (1 xi ) (y y{j} ) 0 and (1 xi ) (y{i} + y{j} y ) 0 The rst one expands
to y y{i} y{j} + y{i,j} 0 and the second one becomes to y{j} y{i,j} y y{i} 0,
together enforcing y y{i} y{j} + y{i,j} = 0. This is simulating the quadratic constraint
1 xi xj + xi xj = 0 or the more familiar (1 xi )(1 xj ) = 0 for each {i, j} E. It can be
shown that the dening
constraints for the cone S 2 (K) are the odd-cyle constraints: for
an odd cycle C, xi C (|C| + 1)/2. An exact characterization of S ( (K)r) for r > 2 is
open.
Intuitively, as we increase r we get tighter relaxations of K0 , as we are adding more
and more valid constraints. Let us prove this formally. First, we note the following
characterization of SAr (K).
Lemma 49
u SAr (K) i i [n] we have vi , wi SAr1 (K), where s [n], |s| r 1, vi = us{i}
and wi = us us{i} .
82
Proof: To establish the lemma, we make our notation used in dening SAr (K) from
SAr1 (K) explicit. Suppose SAr1 (K) has a constraint ay(r1) 0. Recall that from this,
for each i [n] we form two constraints for SAr (K), say a y(r) 0 and a y(r) 0, where
a y(r) xi ay(r1) is given by
#
0
if i s
(9.5.1)
as =
as + as\{i} if i s
and a y(r) (1 xi ) ay(r1) by
as
#
=
if i s
as
as\{i} if i s
(9.5.2)
as + as\{i} us
s i
as vsi +
s i
as vsi = avi
s i
i
where we used the fact that for s i, us = vsi = vs\{i}
. Similarly, noting that for s i,
i
i
ws = 0 and for s i, ws = us us\{i} , we have
a u =
as\{i} us +
s i
as us
s i
=
as us{i} + as us = awi
s i
Therefore, u satises the constraints of SAr (K) i for each i [n] vi and wi satisfy the
constraints of SAr1 (K). 2
Now we are ready to show that S r (K), 1 r n form a hierarchy of relaxations of K0 .
Theorem 50
K0 S r (K) S r1 (K) K for every r.
Proof: We have already seen that each"integer solution x in K0 gives rise to a corresponding
point in SAr (K): we just take y() s = is xi . Projecting to a point in S r (K), we just get
x back. Thus K0 S r (K) for each r.
So it is left to only show that S r (K) S r1 (K), because S 1 (K) = K.
Suppose x S r (K) is obtained as u|s:|s|1, u SAr (K). Let vi , wi SAr1 (K) be
two vectors (for some i [n]) as given by Lemma 49. Since SAr1 (K) is a convex cone,
vi + wi SAr1 (K). Note that for each s [n], |s| r 1, wsi + vsi = us . In particular
this holds for s, |s| 1. So x = (vi + wi )|s:|s|1 S r1 (K). 2
Theorem 51
S n (K) = K0
83
Proof: Recall from the proof of Theorem 50 that if x S r (K), then there are two points
vi , wi SAr1 (K) such that x = vi |s:|s|1 + wi |s:|s|1. It follows from the denition of
i
i
and w{i}
= 0. So vi |s:|s|1 S r1 (K)|y{i} =y and wi |s:|s|1
vi and wi that vi = v{i}
S r1 (K)|y{i} =0 . Hence, S r (K) S r1 (K)|y{i} =y + S r1 (K)|y{i} =0 . Further, this holds for
all i [n]. Thus,
%
$
S r1 (K)|y{i} =y + S r1 (K)|y{i} =0
S r (K)
i[n]
S r (K)
{i1 ,...,ir }[n]
T {0,y
K|(y{i
}r
1}
,...,y{ir } )=T
T {0,y }n
Chapter 10
Semidefinite Programming
scribe:Edith Elkind
10.1
Introduction
10.2
Basic definitions
85
Proof: From (ii), it follows that if M is positive semidenite, then so is M for any > 0.
From (iii), it follows that if M1 , M2 are positive semidenite, then so is M1 +M2 . Obviously,
symmetry is preserved, too. 2
Definition 31 A semidenite program is formulated as follows:
minC Y
A1 Y b1 ,
...
Am Y bm ,
Y 0
where Y is an
n n-matrix of variables, C, A1 , . . . , Am Rnn , b1 , . . . , bm R and X Y is
interpreted as ni,j=1 Xij Yij .
Such a convex program can be solved to an arbitrary precision (modulo some technical
requirements we do not go into) using the Ellipsoid method. As mentioned earlier this
method can solve a linear program to an arbitrary degree of precision in polynomial time
even if the number of constraints is non-polynomial (or innite), so long as there is a
polynomial-time separation oracle for the set of constraints. In case of the above convex
program, we can replace the Y 0 by the innite family of linear constraints xT Y x 0
for every x n . Furthermore, given any Y that is not psd, there is a simple algorithm
(derived from the Cholesky decomposition) that gives an x n such that xT Y x < 0; this
is the separation oracle.
10.3
Last time, we wrote a linear program for Graph Expansion. Given a graph G = (V, E), for
every i, j V , we introduce a variable xij and consider the following LP:
{i,j}E xij
min
i<j xij
xij + xjk xik for all i, j, k (triangle inequality)
xij 0
xij = xji
In an eort to tighten the relaxation, we require the following constraint in addition:
v1 , v2 , . . . , vn n such that xij = (vi vj ) (vi vj ).
We show that this constraint is not an unfair one, by shoing that the optimum integer
solution (and, more generally, any cut metric) satises it: simply pick an arbitrary vector
so the optimum integer solution is still
u and set vi = u if vi S and vi = u if vi S,
feasible.
86
How can we optimize under this constraint using SDP? Consider the matrix M where
x
Mii = 1 and Mij = 1 2ij if i = j. The constraint amounts to saying this matrix to be psd
(see Theorem 30 part (iv)).
We leave it as an exercise to show that this SDP provides a lower bound that is at
least as good as the eigenvalue bound for d-regular graphs and no worse than LeightonRao
bound for the general case. Thus this approach combines the best of both worlds.
It is conjectured that the linearity gap for this program is O(1) (it is known to be
O(log n)). This would follow from another conjecture (by Goemans and Linial), namely,
that any metric d(i, j) that can be represented as vi vj 2 for some {vi }ni=1 , can be
embedded into l1 with O(1) distortion.
10.4
Let us present the celebrated 0.878 approximation algorithm for Max Cut by Goemans and
Williamson (1993). Here the goal is to partition the vertices of a graph G = (V, E) into two
(10.4.1)
10.5
This part is based on a FOCS 2001 paper by Frank McSherry, which considers the task
of nding a solution to an NP-hard graph problem when the input graph is drawn from a
distribution over acceptable graphs.
For example, suppose that a graph on n vertices is constructed as follows: include each
edge with probability p; then pick k vertices out of n (uniformly over all sets of size k), and
complete the clique on these vertices.
87
Given a graph constructed according to this procedure, can we nd the planted clique
with high probability (over the coin tosses of the generating algorithm as well as our own
random choices)?
More generally, consider the following setup: for a graph on n vertices, there is a mapping
: [n] [k] that partitions the vertices into k classes, and a matrix Pkk , 0 Pij 1. To
on a vertex set V , |V | = n, include an edge between vi and
generate a random instance G
can
vj with probability P(i)(j) . We would like to construct an algorithm that given G
reconstruct with high probability. (Usually, after that it is possible to (approximately)
reconstruct P ).
Note that for the clique problem described above, k = 2, P00 = 1, P01 = P10 = P11 = p
and the planted clique can be described as 1 (0).
by M
. Note that M = EM
gives us the corresponding
Denote the adjacency matrix of G
edge probabilities. Hence, for any two vertices that are in the same class with respect to
, the corresponding columns are identical, so the rank of M is at most k. Obviously, if
is obtained from M by
we knew M , we could nd by grouping the columns; since M
randomized rounding, let us try to use clustering.
M . Suppose that P projects the space spanned by
Let E be the error matrix, E = M
the columns of M onto k dimensions. If both P (E) and P (M ) M are small, then by
) M . Now, our goal is to nd a suitable P .
triangle inequality, P (M
Since M is a low-rank symmetric matrix, we can use singular value decomposition
(SVD).
It is known that an m n-matrix A of rank r can be represented as U V T , where U is
an m r orthogonal matrix, = diag(1 , . . . , r ), and V is an n r orthogonal matrix.
This decomposition is unique up to a permutation of i s (and respective permutations
of rows and columns of U and V ), so we can assume 1 . . . r ; for a symmetric A, i s
are the eigenvalues.
By truncating U to k columns, to k values, and V T to k rows, we obtain a matrix
Ak , which is known to be the best rank k approximation of A in the spectral norm. The
columns of this matrix are projections of the corresponding columns of A onto the rst k
eigenvectors of A. In what follows, we denote the projection X Xk by PXk .
Instead of applying SVD directly, we will use a minor modication of this approach.
Namely, consider the following algorithm Partition( ):
Randomly divide [n] into two parts; let us say that this division splits the columns of
as [A | B].
G
Compute P1 = PBk , P2 = PAk .
2 (B)].
work for B and vice versa. By splitting the matrix into two parts we avoid some tricky
interdependency issues.
Suppose that for all u we have
|P1 (Au ) Au | 1 and |P1 (Au Au )| 2
|P2 (Bu ) Bu | 1 and |P2 (Bu Bu )| 2
and |Gu Gv | 4(1 + 2 ) whenever (u) = (v). Then our algorithm nds correctly
if = 2(1 + 2 ). To see that, suppose that u and v ended up in the same cluster. As
|Gu Hu | 1 + 2 , |Gv Hv | 1 + 2 , and |Hu Hv | < 2(1 + 2 ), it must be that
Gu = Gv . Conversely, if Gu = Gv , Hu and Hv cannot be too far apart.
To make use of this algorithm, we need a bound on 1 and 2 .
Theorem 55
With probability at least 1 ,
|PA (Bu ) Bu | 8 nk/su
|P (Bu Bu )| 2k log(n/),
A
where is the largest deviation of an entry in A and su is the size of the part containing u,
i.e., su = |{v | (v) = (u)}|.
For the proof, the reader is referred to the paper.
88
Homework 1
Out: September 16
Due: September 23
You can collaborate with your classmates, but be sure to list your collaborators with your
answer. If you get help from a published source (book, paper etc.), cite that. Also, limit
your answers to one page or less you just need to give enough detail to convince me. If
you suspect a problem is open, just say so and give reasons for your suspicion.
1 For every integer k 1 show that any graph with more than kk nodes has either a
clique or an independent set of size at least k. (Moral: A graph cannot be completely
disordered: there is always some local order in it.) Can you prove the same statement
for graphs with fewer than kk nodes?
2 A bipartite graph G = (V1 , V2 , E) with |V1 | = |V2 | is said to be an (, ) expander if
every subset S V1 of size at most |V1 | has at least |S| neighbors in V2 . Show
that for every , > 0 satisfying < 1 there is an integer d such that a d-regular
expander exists for every large enough |V1 |.
3 A graph is said to be nontransitive if there is no triple of vertices i, j, k such that all
of {i, j} , {j, k} , {k, i} are edges. Show that in a nontransitive d-regular graph, there
is an independent set of size at least n log d/8d. (Hint: Suppose we try to pick an
independent set S randomly from all independent sets in the graph. For any vertex
v, suppose you have picked the portion of S except for v and the neighbors of v. How
would you pick the rest of S?)
89
90
princeton university F02
Homework 2
Out: October 7
Due: October 21
You can collaborate with your classmates, but be sure to list your collaborators with your
answer. If you get help from a published source (book, paper etc.), cite that. Also, limit
your answers to one page or less you just need to give enough detail to convince me. If
you suspect a problem is open, just say so and give reasons for your suspicion.
1 Let A be a class of deterministic algorithms for a problem and I be the set of possible
inputs. Both sets are nite. Let cost(A, x) denote the cost of running algorithm A on
input x. The distributional complexity of the problem is
max min ExD [cost(A, x)],
D
AA
xI
where P is a probability distribution on A. Prove Yaos lemma, which says that the
two complexities are equal.
Does this result hold if the class of algorithms and inputs is innite?
2 (The rent-or-buy problem) Your job requires you to take long trips by car. You can
either buy a car ($ 5000) or rent one whenever you need it ($ 200 each time).
If i is the number of trips you end up making, clearly, it makes sense to buy if i > 25.
(Let us ignore emotional factors, as well as the fact that a used car is still worth
something after a few years.) Furthermore the optimum expenditure as a function of
i is C(i) = min {5000, 200i}.
However, you do not know i ahead of time: in fact, you dont learn about each trip
until the day before, at which time you have to rent or buy. (Such problems are
studied in a eld called online algorithms.)
(a) Show that if you make the decision to buy or rent at each step using a deterministic algorithm then there is a strategy whose cost is at most 2C(i) for any
i. Show that no deterministic algorithm can do much better for some i. (Alas,
life is suboptimal??)
(b) Does your answer change if you can use a randomized strategy, and try to
minimize your expected cost? (Hint: Lowerbounds can be proved using Yaos
Lemma.)
3 Prove the Schwartz-Zippel Lemma: If g(x1 , x2 , . . . , xm ) is any nonzero polynomial
of total degree d and S F is any subset of eld elements, then the fraction of
(a1 , a2 , . . . , am ) S m for which g(a1 , a2 , . . . , am ) = 0 is at most d/ |S|.
4 (Sudans list decoding) Let (a1 , b1 ), (a2 , b2 ), . . . , (an , bn ) F 2 where F = GF (q) and
q
n. We say that a polynomial p(x) describes k of these pairs if p(ai ) = bi for k
values of i.
(a) Show that there exists a bivariate polynomial Q(z, x) of degree at most n + 1
in z and x such that Q(bi , ai ) = 0 for each i = 1, . . . , n. Show also that there is
an ecient (poly(n) time) algorithm to construct such a Q.
(b) Show that if R(z, x) is a bivariate polynomial and g(x) a univariate polynomial
then z g(x) divides R(z, x) i R(g(x), x) is the 0 polynomial.
(c) Suppose p(x) is a degree d polynomial that describes k of the points. Show that
91
Homework 3
Out: October 21
Due: November 4
You can collaborate with your classmates, but be sure to list your collaborators with your
answer. If you get help from a published source (book, paper etc.), cite that. Also, limit
your answers to one page or less you just need to give enough detail to convince me. If
you suspect a problem is open, just say so and give reasons for your suspicion.
1 Show that rank A for an n n matrix A is the least k such that A can be expressed
as the sum of k rank 1 matrices. (This characterization of rank is often useful.)
2 Compute all eigenvalues and eigenvectors of the Laplacian of the boolean hypercube
on n = 2k nodes.
3 Suppose 1 (= d) 2 n are the eigenvalues of the adjacency matrix of a
n(n
.
connected d-regular graph G. Then show that (G) d
n
4 Let G be an n-vertex connected graph. Let 2 be the second largest eigenvalue of
its adjacency matrix and x be the corresponding eigenvector. Then show that the
subgraph induced on S = {i : xi 0} is connected.
92
93
princeton university F02
Homework 4
Out: November 11
Due: November 25
You can collaborate with your classmates, but be sure to list your collaborators with your
answer. If you get help from a published source (book, paper etc.), cite that. Also, limit
your answers to one page or less you just need to give enough detail to convince me. If
you suspect a problem is open, just say so and give reasons for your suspicion.
1 Estimate the mixing time of the random walk on the lollipop graph, which consists of
a complete graph on n nodes and a path of length n attached to one of those nodes.
2 Give an ecient algorithm for the following task, which is used for dimension reduction
in many clustering algorithms. We are given n vectors x1 , x2 , . . . xn n and a
number k, and we desire the k dimensional subspace S of n which minimizes the
sum of the squared lengths of the projections of x1 , x2 , . . . , xn to S. You may assume
that eigenvalues and eigenvectors can be eciently computed.
3 In this question you will prove that random walks on constant degree expanders mix
very rapidly: for any subset of vertices C that is fairly large, the probability that
a random walk of length l avoids C is exp(l). Let G = (V, E) be an unweighted
undirected graph and A be its adjacency matrix.
(a) Show that the number of walks of length l is gT Al g where g is the all-1 vector.
(b) Suppose now that G is d-regular and has n vertices. Suppose that each of the
eigenvalues of A except the largest (which is d) has magnitude at most . Let
C be a subset of cn vertices and let A be the adjacency matrix of the induced
graph on V \ C. Show that every eigenvalue of A is at most (1 c)d + c in
magnitude.
(c) Conclude that if < 0.9d (i.e., G is an expander) and c = 1/2 then the probability that a random walk of length l in G (starting at a randomly chosen vertex)
avoids C is at most exp(l).