0% found this document useful (0 votes)
9 views

Mathematics of Signals, Networks, and Learning

Uploaded by

Suman Ch
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Mathematics of Signals, Networks, and Learning

Uploaded by

Suman Ch
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Notes for the course

“Mathematics of Signals, Networks, and Learning”


Spring 2023
Afonso S. Bandeira & Antoine Maillard
[email protected] & [email protected]

Last update: June 5, 2023

Course overview
This is an introductory course to Mathematical aspects of Data Science, Machine Learning, Signal
Processing, and Networks.
Course Coordinator: Pedro Abdalla Teixeira
. [email protected]
The contents of the course will be updated as the semester progresses, and the following list is subject
to many possible changes.

1. Unsupervized Learning and Data Parsimony:

• Clustering and k-means


• Singular Value Decomposition
• Low Rank approximations and Eckart–Young–Mirsky Theorem
• Dimension Reduction and Principal Component Analysis
• Kernel PCA and Bochner’s theorem
• Sparsity and Compressed Sensing
• Finite Frame Theory and Equiangular tight frames
• The Paley ETF
• Concentration inequalities and random low-coherence frames.

2. Connections to graph theory

• Introduction to Spectral Graph Theory.


• Paley ETF and the Paley graph

3. Signal processing and Fourier analysis

• Shannon’s sampling theorem and the Nyquist rate.


• Discrete Fourier transform

4. Supervized Learning:

• Introduction to Classification and Generalization of Classifiers


• Uniform convergence, the VC theorem and VC Dimension

1
Note for non-Mathematics students: this class requires a certain degree of mathematical maturity–
including abstract thinking and the ability to understand and write proofs.
Please visit the Forum at
https://forum.math.ethz.ch/c/spring-23/math-of-signals-networks-and-learning/149
for more information.
You will notice several questions along the way, separated into Challenges (and Exploratory Chal-
lenges).

• Challenges are well-defined mathematical questions, of varying level of difficulty. Some are
very easy, and some are much harder than any homework problem.

• Exploratory Challenges are not necessarily well defined, but thinking about them should
improve your understanding of the material.

We also include a few “Further Reading” references in case you are interested in learning more about
a particular topic.
Some chapters of these lecture notes are a close adaptation of the ones of the previous years, by A. S.
Bandeira and N. Zhivotovskiy [BZ22]. Some parts are adapted of the book [BSS23]. If you are looking
for a more advanced course on this topic with lecture notes and many open problems, you can also
read through [Ban16] (by the first author).
Besides the goal of serving as an introduction to the Mathematics of Data Science and related areas,
the content selection also has the goal of illustrating interesting connections between different parts
of Mathematics and some of their (a priori) surprising applications.
Important disclaimer – This draft is in the making and subject to many future changes, adds and
removals. Please excuse the lack of polishing and typos. If you find any typos or mistakes, please let
us know! This draft was last updated on June 5, 2023.

Contents

1 Introduction 4

2 Clustering and k-means (24.02.2023) 5

3 The singular value decomposition (03.03.2023) 8

4 Low-rank approximation of matrix data (03.03.2023 - 10.03.2023) 9

5 Principal Component Analysis (10.03.2023) 12

6 Kernel PCA (10&17.03.2023) 14

7 Fourier Transform and Bochner’s theorem (17.03.2023) 16

8 Fourier Series and Shannon Sampling (24.03.2023) 19

9 The Discrete Fourier Transform (24.03.2023) 22

10 Graphs and Networks (31.03.2023) 23

11 Graphs Cuts and Spectral Graph Theory(31.03.2023-21.04.2023) 25

2
12 Parsimony, compressed sensing and sparse recovery (21.04.2023 - 28.04.2023) 29

13 Finite frame theory and the Welch bound (05.05.2023) 34

14 Equiangular Tight Frames (ETFs) (05.05.2023) 37

15 The Paley ETF (12.05.2023) 40

16 Elements of classification theory (19.05.2023) 45

17 Hoeffding’s inequality (19-26.05.2023) 49

18 Uniform convergence and the VC theorem (26.05.2023 - 02.06.2023) 54

19 The Vapnik-Chervonenkis dimension (02.06.2023) 59

A Rest of Proof of Bochner’s Theorem 63

B Some elements of number theory 64

3
1 Introduction
We will study several areas of Signal Processing, Machine Learning and Analysis of Data, focusing
on the mathematical aspects. We list below the areas we will consider (the list contains the subjects
already covered, and will be updated as the course progresses).

• Unsupervised Learning: The most common instance in exploratory data analysis is when we re-
ceive data points without a priori known structure, think e.g. unlabeled images from a databased,
genomes of a population, etc. The natural first question is to ask if we can learn the geometry
of the data. Simple examples include: Does the data separate well into clusters? Does the data
naturally live in a smaller dimensional space or manifold? Sometimes the dataset comes in the
form of a network (or graph) and the same questions can be asked, an approach in this case is
with Spectral Graph Theory which we will cover if time permits.

• Signal Processing: Often times, the data we observe come in a form of a signal f (t), in which
we can think of t as the time. After motivating Fourier analysis with Bochner’s theorem in
the previous part, we will understand how Fourier analysis allows to understand when we can
uniquely reconstruct a signal from a discrete set of measurements, by proving Shannon’s sam-
pling theorem. We will then define the Discrete Fourier transform and overview some of its
applications.

• Parsimony and Sparsity: Sometimes, the information/signal we are after has a particular struc-
ture. A common form of parsimony is sparsity in a particular linear dictionary, such as natural
images in the Wavelet basis, or audio signals in the Fourier basis. We will present the basics of
Compressed sensing of sparse vectors, and use it to motivate the construction of low-coherence
frames.

• Finite frame theory: Motivated by the compressed sensing application above, we will introduce
the notion of maximally low-coherence frames, or equiangular tight frames. Using elementary
notions of number theory, we will present the construction of the Paley ETF, one of the few
explicit constructions that exist for these objects.

• Supervised Learning: In this last part, we will introduce some basics of statistical learning
theory, using the common problem of classification, i.e. learning an unknown classifier function
from examples. As a textbook illustrative example, one can have in mind classifying correctly
images of cats and dogs by generalizing from a finite sample of such images in which the label is
given. We will introduce the notion of PAC learnability for finite classes of functions, and using
tools of probability and concentration of measure, we will give guarantees on generalisation for
possibly infinite classes of functions via the VC dimension.

4
2 Clustering and k-means (24.02.2023)
Clustering is one of the central tasks in machine learning. Given a set of data points, the purpose
of clustering is to partition the data into a set of clusters where data points assigned to the same
cluster correspond to similar data (for example, having small distance to each other if the points are
in Euclidean space).

Figure 1: Examples of points which can be well separated in clusters.

k-means Clustering
One the most popular methods used for clustering is k-means clustering. Given x1 , . . . , xn ∈ Rp , the
k-means clustering partitions the data points in clusters S1 , . . . , Sk with centers µ1 , . . . , µk ∈ Rp as
the solution to:
k X
∥xi − µl ∥2 .
X
min (1)
partition S1 ,...,Sk
µ1 ,...,µk l=1 i∈Sl

A popular algorithm that attempts to minimize eq. (1) is Lloyd’s Algorithm [Llo82] (this is also
sometimes referred to as simply “the k-means algorithm”). It relies on the following two observations
Proposition 2.1 (Properties of the k-means objective –)

• Given a choice for the partition S1 ∪ · · · ∪ Sk , the centers that minimize (1) are given by
1 X
µl = xi .
|Sl | i∈S
l

• Given the centers µ1 , . . . , µk ∈ Rp , the partition that minimizes (1) assigns each point xi to
the closest center µk .

Challenge 2.1. Prove Proposition 2.1.


We describe Lloyd’s algorithm in Algorithm 2.1. Unfortunately, Lloyd’s algorithm is not guaranteed
to converge to the solution of (1). Indeed, it oftentimes gets stuck in local optima of (1). In fact,
optimizing (1) is N P -hard and so there is no polynomial time algorithm that works in the worst-case
(assuming the widely believed conjecture P ̸= N P ).
Challenge 2.2. Show that Lloyd’s algorithm 2.1 converges1 (even if not always to the minimum
of (1)).

5
Algorithm 2.1 Lloyd’s algorithm
It is an iterative algorithm that starts with an arbitrary choice of centers and iteratively alternates
between

• Given centers µ1 , . . . , µk , assign each point xi to the cluster

l = arg min ∥xi − µl ∥ .


l=1,...,k

1
• Update the centers µl =
P
|Sl | i∈Sl xi ,

until no update is taken.

Challenge 2.3. Can you find an example of points and starting centers for which Lloyd’s algorithm
does not converge to the optimal solution of (1)?

Exploratory Challenge 2.4. How would you try to “fix” Lloyd’s Algorithm to avoid it getting stuck
in the example you constructed in Challenge 2.3?

Figure 2: Because the solutions of k-means are always convex clusters, it is not able to handle some
cluster structures.

While popular, k-means clustering has some potential issues:

• One needs to set the number of clusters a priori. A typical way to overcome this issue is to try
the algorithm for different numbers of clusters.

• The way the formula (1) is defined needs the points to be defined in an Euclidean space. Often
we are interested in clustering data for which we only have some measure of affinity between
different data points, but not necessarily an embedding in Rp (this issue can be overcome by
reformulating eq. (1) in terms of distances only — you will do this on the first homework problem
set.).

• The formulation is computationally hard, so algorithms may produce suboptimal instances.

• The solutions of k-means are always convex clusters. This means that k-means cannot find
clusters such as in Figure 2.

1
In the sense that it stops after a finite number of iterations.

6
Further reading 2.1. On the computational side, there are many interesting questions regarding
when the k-means objective can be efficiently approximated, you can see a few open problems on this
in [Ban16] (for example Open Problem 9.4).

7
3 The singular value decomposition (03.03.2023)
We recall here some useful facts and definitions on the singular value decomposition.
Data is often presented as a d × n matrix whose columns correspond to n data points in Rd . Other
examples include matrices of interactions where the entry (i, j) of a matrix contains information about
an interaction, or similarity, between an item (or entity) i and j. In this context, the Singular Value
Decomposition (SVD) is one of the most powerful tools to analyze matrix data.
Given a matrix X ∈ Rn×m , its Singular Value Decomposition is given by (U, Σ, V ) such that

X = U ΣV ⊤ ,

where U ∈ O(n) and V ∈ O(m) are orthogonal matrices, and Σ ∈ Rn×m is a rectangular diagonal
matrix, in the sense that Σij = 0 for i ̸= j, and whose diagonal entries are non-negative.
The diagonal entries σ1 ≥ σ2 ≥ . . . , σmin{n,m} of Σ are called the singular values2 of X. Recall
that unlike eigenvalues they must be real and non-negative. The columns (uk )k∈[n] and (vµ )µ∈[m] of
respectively U and V are called the left and right singular vectors of X.
Proposition 3.1 (Some basic properties of SVD)

• rk(X) is equal to the number of non-zero singular values of X.

• If n ≤ m, then the singular values of X are the square roots of the eigenvalues of XX ⊤ . If
m ≤ n they are the square roots of the eigenvalues of X ⊤ X.

Challenge 3.1. Prove this fact.

The SVD can also be written in more economic ways. For example, if rk(X) = r ≤ min{n, m} then
we can instead write
X = U ΣV ⊤ ,
where U ⊤ U = Ir×r , V ⊤ V = Ir×r , and Σ is a non-singular r × r diagonal matrix matrix. Note that this
representation only requires r(n + m + 1) numbers, which if r ≪ min{n, m} (i.e. if X is low-rank), is
considerable savings when compared to the nm elements of X. It is also useful to write the SVD as
r
X
X= σk uk vkT ,
k=1

where σk is the k-th largest singular value, and uk and vk are the corresponding left and right singular
vectors.

2
The most common convention is that the singular values are ordered in decreasing order, it is the convention we observe
here.

8
4 Low-rank approximation of matrix data (03.03.2023 -
10.03.2023)
A key observation in Machine Learning and Data Science is that (matrix) data is oftentimes well
approximated by low-rank matrices. We will see examples of this phenomenon later in the course, and
in the code simulations available on the class webpage.
In order to talk about what it means for a matrix B to approximate another matrix A, we need to
have a notion of distance between matrices of the same dimensions, or equivalently a notion of norm
of A − B. Let us start with some classical norms.
Definition 4.1 (Spectral Norm)
The spectral norm of X ∈ Rn×m is given by

∥X∥ := max ∥Xv∥2 ,


∥v∥2 =1

or equivalently ∥X∥ := σ1 (X).

Challenge 4.1. Show that the two definitions above are equivalent.
Another common matrix norm is the Frobenius (or Hilbert-Schmidt) norm.
Definition 4.2 (Frobenius norm)
The Frobenius norm of X ∈ Rn×m is given by
" n m #1/2
XX
∥X∥F := Xij2 .
i=1 j=1

Challenge 4.2. Show that


min{n,m}
σi (X)2 = Tr[XX ⊤ ].
X
∥X∥2F =
i=1

Challenge 4.3. Show that the spectral and Frobenius norms are indeed norms.
Note that by solving Challenges 4.1 and 4.3 you have shown also that for any two matrices X, Y ∈
Rn×n ,
σ1 (X + Y ) ≤ σ1 (X) + σ1 (Y ). (2)
There is a natural generalization of the two norms above, the so called Schatten p-norms.
Definition 4.3 (Schatten p-norm)
Given a matrix X ∈ Rn×m and 1 ≤ p ≤ ∞, the Schatten p-norm of X is given by
 1/p
min{n,m}
X
∥X∥(S,p) :=  σi (X)p  = ∥σ(X)∥p ,
i=1

where σ(X) corresponds to the vector whose entries are the singular values of X. Note that for
p = ∞, this corresponds to the spectral norm and we often simply use ∥X∥ without a subscript.
Moreover, the Schatten 2-norm is the Frobenius norm, according to Challenge 4.2.

Challenge 4.4. Show that the Schatten p-norm is a norm (proving triangular inequality for general
p ∈ [1, ∞] is non-trivial).

9
Another key insight in this section is that, since the rank of a matrix X is the number of non-zero
singular values, a natural rank-r approximation for a matrix X is to replace all singular values but
the largest r singular values of X with zero. This is often referred to as the truncated SVD. Let us
be more precise.
Definition 4.4 (Truncated SVD)
Let X ∈ Rn×m and X = U ΣV ⊤ be its SVD. We define Xr = Ur Σr Vr⊤ the truncated SVD of X by
setting Ur ∈ Rn×r and Vr ∈ Rm×r to be, respectively, the first r columns of U and V ; and Σr ∈ Rr×r
to be a diagonal matrix with the first r singular values of X (notice these are the largest ones, due
to the way we defined SVD).

Warning: The notation Xr for low-rank approximations is not standard.


Note that rk(Xr ) = r and σ1 (X − Xr ) = σr+1 (X). It turns out that this way to approximate a
matrix by a low-rank matrix is optimal is a very strong sense. This is captured by the celebrated
Eckart–Young–Mirsky Theorem, which we will prove now, starting with a particular case.
Lemma 4.1 (Eckart–Young–Mirsky Theorem for Spectral norm)
The truncated SVD provides the best low-rank approximation in spectral norm. In other words,
let X ∈ Rn×m and r ≤ min{n, m}. Let Xr be as in Definition 4.4. Then for any matrix B with
rk(B) ≤ r we have:
∥X − B∥ ≥ ∥X − Xr ∥.

Proof of Lemma 4.1 – The claim with r = min{m, n} is trivial, as then Xr = X. We assume
r < min(m, n). Let X = U ΣV ⊤ be the SVD of X. Since rk(B) = r there must exist a vector w ̸= 0
in the span of the first r + 1 right singular vectors v1 , . . . , vr+1 of X such that w is in the kernel of B.
Without loss of generality let w have unit norm. Let us write w = r+1
P
Pr+1 2 k=1 αk vk . Since w is unit-norm

and the vk ’s are orthonormal we have αk = vk w and k=1 αk = 1. Finally,
v
ur+1

uX
∥X − B∥ ≥ ∥(X − B)w∥2 = ∥Xw∥2 = ∥ΣV w∥2 = t σk2 (X)αk2 ≥ σr+1 (X) = ∥X − Xr ∥.
k=1

Challenge 4.5. If you think the existence of the vector w in the start of the proof above is not obvious
(or any other step), try to prove it.

The inequality (2) is a particular case of a more general set of inequalities, the Weyl inequalities,
named after Hermann Weyl (a brilliant Mathematician who spent many years at ETH). Here we focus
on the inequalities for singular values, the more classical ones are for eigenvalues; it is worth noting
also that these follow from the ones for eigenvalues since the singular values of X are the square-roots
of the eigenvalues of X ⊤ X.

Theorem 4.2 (Weyl inequalities for singular values)


For all X, Y ∈ Rn×m :
σi+j−1 (X + Y ) ≤ σi (X) + σj (Y ),
for all 1 ≤ i, j, ≤ min{n, m} satisfying i + j − 1 ≤ min{n, m}

Proof of Theorem 4.2 – Let Xi−1 and Yj−1 be, respectively, the rank i − 1 and j − 1 approximation
of X and Y (as in Definition 4.4). By eq. (2) we have

σ1 ((X − Xi−1 ) + (Y − Yj−1 )) ≤ σ1 (X − Xi−1 ) + σ (Y − Yj−1 ) = σi (X) + σj (Y ).

10
Since Xi−1 + Yj−1 has rank at most i + j − 2, Lemma 4.1 implies that

σi+j−1 (X + Y ) = σ1 (X + Y − (X + Y )i+j−2 ) ≤ σ1 (X + Y − (Xi−1 + Yj−1 )) .

Putting both inequalities together we get

σi+j−1 (X + Y ) ≤ σ1 (X + Y − Xi−1 − Yj−1 ) ≤ σi (X) + σj (Y ).

Challenge 4.6. There is another simple proof of this theorem based on the Courant-Fischer minimax
variational characterization of singular values:

σk (X) = max min ∥Xv∥, (3)


V ⊆Rm ,dim(V )=k v∈V,∥v∥=1

σk+1 (X) = min max ∥Xv∥. (4)


V ⊆Rm ,dim(V )=m−k v∈V,∥v∥=1

Try to prove it that way.

We are now ready to prove the main theorem of this section:

Theorem 4.3 (Eckart–Young–Mirsky Theorem)


The truncated SVD provides the best low-rank approximation in any Schatten p-norm. Formally,
let X ∈ Rn×m , r ≤ min{n, m}, and 1 ≤ p ≤ ∞. Let Xr be the truncated SVD of X retaining the
leading r singular values, see Definition 4.4. Then

Xr = arg min ∥X − B∥(S,p) .


B∈Rn×m
rk(B)≤r

We have already proved this for p = ∞ (Lemma 4.1). The proof of the general result follows from
Weyl’s inequalities (Theorem 4.2).
Proof of Theorem 4.3 – Let X ∈ Rn×m , and B a matrix with rk(B) ≤ r. We use Weyl’s
inequalities for X − B and B:

σi+j−1 (X) ≤ σi (X − B) + σj (B),

Taking j = r + 1, and i > 1 satisfying i + (r + 1) − 1 ≤ min{n, m} we have

σi+r (X) ≤ σi (X − B), (5)

since σr+1 (B) = 0. Note that:


min{n,m} min{n,m}−r
B∥p(S,p) σkp (X σkp (X − B) .
X X
∥X − = − B) ≥
k=1 k=1

And by eq. (5):


min{n,m}−r min{n,m}−r min{n,m}
σkp (X p
σkp (X) = ∥X − Xr ∥p(S,p) .
X X X
− B) ≥ σk+r (X) =
k=1 k=1 k=r+1

11
5 Principal Component Analysis (10.03.2023)
When given some high-dimensional data, a statistician often seeks to find out if this data can be
approximately represented as lying in a smaller dimensional set, see Fig. 3. In general, this procedure

Figure 3: Two sets of data points. The blue points can clearly be well-approximated by a one-
dimensional subspace (a line), while the orange points can not.

is referred to as dimensionality reduction: given a set of data points


y1 , · · · , ym ∈ Rp , (6)
we are hoping to find a good d-dimensional representation of {yi }m
i=1 , ideally with d ≪ p. The simplest
such representation is given by a d-dimensional subspace: does there exists a set z1 , · · · , zm of points
which all lie on the same d-dimensional affine subspace, and such that zi is “close” to yi ?
Let us simplify the setup slightly: we go from affine to linear subspace approximation by removing the
Pm
empirical mean of {yi }mi=1 . More precisely, denoting µ := (1/m) i=1 yi , we will try to approximate
xi := yi − µ by points {zi }mi=1 lying on a d-dimensional linear subspace.
To measure closeness of zi to xi , we will use the Euclidean norm. This leads us to look for
m
X
arg min ∥zi − xi ∥22 , (7)
{zi }m
i=1 i=1

in which the minimum is taken over all {zi }m


i=1 such that dim[Span({zi })] ≤ d. Note that Span({zi })
is also the image space of the matrix Z ∈ R p×m , by defining
   
| | | |
Z := z1 · · · zm  and X := x1 · · · xm  .
   
| | | |
This allows to rewrite eq. (7) as:
arg min ∥Z − X∥2F . (8)
Z∈Rp×m
rk(Z)≤d

We recognize exactly the setup of Theorem 4.3: the solution is given by the truncated SVD of X, that
is
arg min ∥Z − X∥2F = Ud Σd Vd⊤ , (9)
Z∈Rp×m
rk(Z)≤d

12
in which X = U ΣV ⊤ , and we used the notations of Definition 4.4. Coming back to our original task
of approximating {yi }m
i=1 , this means that the best d-dimensional representation is given by

yi ∼ zi = Ud βi + µ, (10)

where βi := Σd vi ∈ Rd , with vi the i-th column of Vd . Eq. (10) is known as Principal Component
Analysis (PCA). We refer to [BSS23] for an alternative derivation of PCA, not based on Eckart-Young’s
theorem.

Remark 5.1. Notice how in eq. (10) the left singular vectors Ud and the right singular vectors Vd
have two different interpretations:

• The singular vectors Ud correspond to the basis in which to project the original points (after
centering).

• The singular vectors Vd (or more precisely the vectors βi = Σd vi ) then correspond to low dimen-
sional coordinates for the points in this basis.

While centering the data points might seem arbitrary when looking for the best d-dimensional
approximation, one can show that indeed this is the optimal choice:

Challenge 5.1. Instead of centering the points at the start, we could have asked for the best approx-
imation in the sense of picking βk ∈ Rd , a matrix Ud whose columns are a basis for a d-dimensional
subspace, and µ ∈ Rd such that (10) is the best possible approximation (in the sense of sum of squares
of distances). Show that then µ = (1/m) m
P
i=1 yi the empirical mean.

We end this section by two remarks:


Principal Component Analysis and sample covariance matrix – Another classical way to
describe PCA (see, for example, Chapter 3.2 of [BSS23]) is to build the sample covariance matrix of
the (centered) data, which is defined as:
m
1 1 X
XX ⊤ = (yi − µ)(yi − µ)⊤ .
m−1 m − 1 i=1

PCA is then described as writing the data in the subspace generated by the leading eigenvectors
of XX ⊤ . Notice that this is the same as above, since XX ⊤ = U ΣV ⊤ (U ΣV ⊤ )⊤ = U Σ2 U ⊤ , where
X = U ΣV ⊤ is the SVD of X. Thus the leading eigenvectors of XX ⊤ correspond to the leading left
singular vectors of X: writing the data in the subspace they generate is therefore exactly what we did
in eq. (10)!
Principal Component Analysis and Gram matrix – While the basis in which we project the
points xi is given by the leading left singular vectors of X, we also saw that the leading right singular
vectors were related to the coordinates in that basis. We note here that they correspond to eigenvectors
of the Gram matrix of {xi }m i=1 , that is the matrix M ∈ R
m×m whose entries are

Mij := ⟨xi , xj ⟩. (11)

Indeed, one has M = X ⊤ X = V Σ2 V ⊤ , so the right singular vectors of X are the eigenvectors of M .

13
6 Kernel PCA (10&17.03.2023)
6.1 Kernel PCA
PCA aims to find the best low-dimensional linear representation of the data points. But what if the
data indeed has some low-dimensional structure, but it is non-linear? For instance, think of Figure 2:
PCA will not be able to find a representation of the data that can differentiate the two clusters.
A possible approach to overcome this limitation is to come back to eq. (11): one can interpret the
matrix M as Mij measuring affinity between point i and j; indeed ⟨xi , xj ⟩ is larger if xi and xj are
more similar. With this interpretation we can define versions of PCA with other notions of affinity
Mij = K(xi , xj ),
where the affinity function K is often called a Kernel. This is the idea behind Kernel PCA. Notice
that this can be defined even when the data points are not in Euclidean space. Moreover in Kernel
PCA we will consider the top eigenvectors of M : according to the previous section, this will give us
a low-dimensional representation of the data, but not how this representation is built. This is often
sufficient, as e.g. in clustering: we do not need to know how the low-dimensional representation is
built as long as we can use it to cluster the data points!
Example: Gaussian Kernel – A common choice of Kernel is the so-called Gaussian kernel
 
K(xi , xj ) = exp −∥xi − xj ∥2 /ε2 ,
for ε > 0. The intuition of why one would use this notion of affinity is that it tends to ignore distances
at a scale larger than ε; if data has a low dimensional structure embedded, with some curvature, in
a larger dimensional ambient space then small distances in the ambient space should be similar to
intrinsic distances, but larger distances are less reliable (recall Figure 2). In Fig. 4 we show how Kernel
PCA with a Gaussian Kernel allows to efficiently cluster data similar to that of Fig. 2. See Chapter
5 in [BSS23] for more on this, and some other illustrative pictures.

Original data points Projection using Kernel PCA


1.0 0.6
0.4
0.5
0.2
0.0 0.0
0.2
0.5
0.4
1.0 0.6
1.0 0.5 0.0 0.5 1.0 0.4 0.2 0.0 0.2 0.4

Figure 4: Clustering data points made of two (noisy) concentric circles using Kernel PCA with a
Gaussian Kernel. When considering the two top eigenvectors of the Kernel matrix (on the right), we
can easily cluster the data!

There is an alternative way of interpreting Kernel PCA: rather than seeing it as changing the affinity
measure, we can also think of it as changing the way data points are represented. Then Kernel PCA
is doing PCA on the new representation, i.e. K(xi , xj ) = ⟨ϕi , ϕj ⟩, where these new “representations”
(ϕi ) of the points (xi ) are often referred to as features, in Machine Learning. This observation will be
further explored below.
Importantly, in order for the interpretation above to apply we need M ⪰ 0 (M ⪰ 0 means M is
positive semidefinite, all eigenvalues are non-negative; we only use the notation M ⪰ 0 for symmetric
matrices). This motivates the definition of positive definite kernels:

14
Definition 6.1 (Positive definite kernels)
A kernel function K is positive definite if for any n ≥ 0 and any x1 , · · · , xn ∈ Rp , the matrix
(K(xi , xj ))ni,j=1 is symmetric and positive semi-definite.

Note the unfortunate choice of wording in this definition: a kernel is positive definite iff the associated
matrices are positive semi-definite.
When this is the case, we can write the Cholesky decomposition of M = (K(xi , xj ))ni,j=1 as

M = Φ⊤ Φ,

for some matrix Φ. If ϕi is the i-th column of Φ then

Mij = K(xi , xj ) = ⟨ϕi , ϕj ⟩,

for this reason ϕi is commonly referred to, in the Machine Learning community, referred to as the
feature vector of i.

Challenge 6.1. Show that the Gaussian Kernel K(x, y) := exp −∥x − y∥2 /ε2 is positive definite.


Further reading 6.1. A very natural question is whether the feature vectors ϕi can be written as
ϕi = φ(xi ), where the function φ (called the feature map) depends only on the kernel K and not on the
data points. This turns out to be true, and is related to the celebrated Mercer Theorem (essentially
a spectral theorem for positive definite kernels).

Exploratory Challenge 6.2. Can you describe the feature map associated to the Gaussian Kernel?

References – A more advanced introduction to kernel methods can be found e.g. in the lecture notes
[Bac21], see also the references therein.

15
7 Fourier Transform and Bochner’s theorem (17.03.2023)
In this section we will introduce Bochner’s Theorem, which characterizes translation-invariant Positive
Definite Kernels, but before we need to take a small detour to introduce the Fourier Transform.

7.1 Fourier Transform


This is a brief introduction to Fourier Transform. Math BSc students at ETH have a detailed and
rigorous introduction in Analysis IV, others have many options for books on the subject ([SS03] is an
excellent one). In this subsection, functions are functions from R (or Rp ) to C.

7.1.1 Fourier Transform in R


Given f ∈ L1 (R), a complex-valued integrable function, we can define its Fourier transform as:
1
Z
fˆ(ξ) := √ f (t) e−iξt dt (ξ ∈ R). (12)
2π R

If fˆ ∈ L1 (R), then we have a Fourier inversion theorem: for all t ∈ R which are continuity points of
f we have:
1
Z
f (t) = √ fˆ(ξ) eiξt dξ. (13)
2π R
Furthermore, the definition (12) can be extended to square-integrable functions f ∈ L2 (R). An
essential result in this context is Plancherel’s Theorem, which states that the Fourier transform is an
isometry of L2 (R), i.e.
Z Z
|f (t)|2 dt = |fˆ(ξ)|2 dξ. (14)
R R

Challenge 7.1. Prove these properties. You can assume f ∈ L1 (R) ∩ L2 (R) (or even f ∈ L1 (R) ∩
L∞ (R) if it makes it easier), and the same for fˆ. There are fascinating connections between the
regularity (smoothness, integrability, decay, etc...) of f and fˆ, but for a first introduction, try to prove
the properties above assuming whatever regularity you need. You can find out more in e.g. [SS03].
Challenge 7.2. Similarly to Plancherel’s Theorem (14), can you show that the Fourier Transform
also perserves inner-products in L2 (R)?
One reason the Fourier Transform is a central object in so many areas of Mathematics and beyond is
that it effectively diagonalizes translations (and so also differentiation, which is in a sense the reason
why it is such a useful tool when studying differential equations). This will be more clear once we talk
about the Discrete Fourier Transform, and in a more abstract sense once you study some representation
theory of groups.
For now, we observe an important property of the Fourier Transform that illustrates this fact: for
f ∈ L1 (R) and t0 ∈ R, let h(t) := Tt0 f (t) = f (t − t0 ). Then
1 1
Z Z
−itξ
ĥ(ξ) = √ f (t − t0 )e dt = √ f (t − t0 )e−i(t−t0 )ξ e−it0 ξ dt = e−it0 ξ fˆ(ξ). (15)
2π R 2π R

The transformation Mt0 : g(ξ) 7→ e−it0 ξ g(ξ) is known as a modulation. It is “diagonal” in the sense
that the value of g(ξ) depends only on the value of g(ξ) and not on the value of g at other arguments.
Challenge 7.3. Derive a formula for the Fourier Transform of the derivative of f in terms of the
Fourier Transform of f . Is it also “diagonal”? (in the same sense as above)
Challenge 7.4. Derive formulas for the Fourier Transform of:
1. A dilation of a function f , i.e. h(x) := f (αx), for α ∈ R.
2. A modulation of a function f , i.e. h(x) := eiβx f (x), for β ∈ R.

16
7.1.2 Fourier Transform in Rp
The Fourier Transform can be analogously defined in Rp . Indeed, given f ∈ L1 (Rp ), we can define its
Fourier transform as:
1
Z
⊤x
fˆ(u) := f (x) e−iu dx (u ∈ Rp ). (16)
(2π)p/2 Rp

The properties showed above for p = 1 have direct analogues in this setting, we include here the
inverse formula. If fˆ ∈ L1 (Rp ) then, for all x ∈ Rp which are continuity points of f :
1
Z

f (x) = fˆ(u) eiu x du. (17)
(2π)p/2 Rp

Challenge 7.5. Show analogues in this setting of the properties described above for the one dimen-
sional Fourier Transform.

7.2 Bochner’s theorem


In this section, we focus on the special case of translation-invariant kernels. They are kernels that
are a function only of the difference between the points, i.e. such that K(xi , xj ) = q(xi − xj ), for
some function q : Rp 7→ R. Note that this is the case e.g. for the Gaussian Kernel K(xi , xj ) =
exp(−∥xi − xj ∥2 /ε2 ).
In this specific setting, Bochner’s theorem relates a kernel being positive with properties of its Fourier
Transform. This theorem can be used to solve Challenge 6.1 (but there are other ways). We are now
ready to state Bochner’s theorem:

Theorem 7.1 (Bochner)


Let K(x, y) = q(x − y) be a translation invariant kernel, real-valued and symmetric. Assume that
q is continuous. Then the two following are equivalent:

(i) K is positive definite.

(ii) There exists a positive and finite measure µ on Rp such that q is the Fourier Transform of µ,
i.e. for all x ∈ Rp :
1
Z
⊤x
q(x) = e−iu dµ(u).
(2π)p/2 Rp

The proof of Bochner’s theorem 7.1 can be found in several textbooks on harmonic analysis, cf. for
instance [Kat04]. In these notes we show a weaker version of Bochner’s Theorem.

17
Theorem 7.2 (Bochner, weak version)
Let K(x, y) = q(x − y) be a translation invariant kernel, real-valued and symmetric. Assume that
q is continuous, and that q, q̂ ∈ L1 (Rp ). Then the two following are equivalent:

(i) K is positive definite.

(ii) For all u ∈ Rp , q̂(u) ≥ 0.

Remark – Actually the hypothesis that q̂ ∈ L1 (Rp ) is not necessary in Theorem 7.2: one can show
that either (i) or (ii) imply that q̂ ∈ L1 (Rp ), cf e.g. the notes [Gub18].
In the lecture and in this section we prove only the easier implication (ii) ⇒ (i), see Appendix A
for the other implication. Note that the proof of (ii) ⇒ (i) is exactly the same in both Theorem 7.2
and Theorem 7.1 (in the lecture we showed it in the language of Theorem 7.1, here we do it in the
language of Theorem 7.2).
Proof of (ii) ⇒ (i) in Theorem 7.2 – Since q̂ ∈ L1 (Rp ), we can use the Fourier inversion formula
for all x ∈ Rp3
1
Z
⊤x
q(x) = eiu q̂(u) du. (18)
(2π)p/2 Rp

We fix n ≥ 0 and x1 , · · · , xn ∈ Rp . Let Mij := q(xi − xj ). Since q is even (since the Kernel
is symmetric), M is symmetric. Let us show that M is positive semidefinite. We fix any α =
(α1 , · · · , αn ) ∈ Rn , and show that α⊤ M α ≥ 0. We have:
n
α⊤ M α =
X
αj αk q(xj − xk ),
j,k=1
n
1
Z

eiu (xj −xk ) q̂(u)du,
X
= p/2
αj αk
(2π) j,k=1 Rp

n
1
Z
⊤ (x
j −xk )
X
= αj αk eiu q̂(u)du,
(2π)p/2 Rp j,k=1
n 2
1
Z
iu⊤ xj
X
= αj e q̂(u)du ≥ 0.
(2π)p/2 Rp j=1

3
Since q is continuous, the formula is valid for all x.

18
8 Fourier Series and Shannon Sampling (24.03.2023)
In this section we will shift gears somewhat and show an important application of Fourier theory in
Signal Processing.

8.1 Fourier Series and L2 ([−π, π])


Let us start by recalling4 some properties of L2 ([−π, π]). All of the sequel can be analogously done
for L2 [−Ωπ, Ωπ] for any Ω > 0 by appropriately scaling quantities. To ease notation, we set Ω = 1.
L2 ([−π, π]) is the Hilbert space of square-integrable complex-valued functions in [−π, π] with the
inner-product given by5 Z π
⟨f, g⟩ := f (x)g(x)dx,
−π
and the associated norm Z π
∥f ∥2 = |f (x)|2 dx.
−π

A remarkable property of L2 ([−π, π]) is that the harmonic functions


1
 
x 7→ √ einx (19)
2π n∈Z

are an orthonormal basis for L2 ([−π, π]).


Challenge 8.1. Show that the functions (19) are orthonormal, i.e.
1 1
 
√ einx , √ eimx = δn,m .
2π 2π
The fact that this is a basis means that for every function f ∈ L2 ([−π, π]), there exists a sequence
{an }n∈Z such that

X 1
f (x) = an √ einx ,
n=−∞ 2π
with equality in the sense of L2 ([−π, π]). Since the basis (19) is orthonormal, the coefficients are given
by Z π
1 1
 
an = f (x), √ einx = √ f (x)e−inx = fˆ(n).
2π 2π −π
Note that we are identifying a function f ∈ L2 ([−π, π]) with the function in L2 (R) that is equal to f
in [−π, π] and zero elsewhere. This expansion is known as Fourier Series.
Definition 8.1 (Fourier Series)
Given f ∈ L2 ([−π, π]) we define its Fourier Series as

1
fˆ(n)einx ,
X
f (x) = √ (20)
2π n=−∞

Challenge 8.2. Try to show Parseval’s theorem: For f ∈ L2 ([−π, π]),

|fˆ(n)|2 .
X
∥f ∥2 =
k∈Z
4
The ETH Math BSc students see the proof of this in Analysis IV, others can see it in any of several excellent books on
Theory of Hilbert Spaces, Functional Analysis, or Fourier Theory, a very
R πgood example is [SS03].
5
Warning: In Physics it is more common to use the convention ⟨f, g⟩ = −π f (x)g(x)dx. We use the classical convention

in Mathematics ⟨f, g⟩ = −π
f (x)g(x)dx.

19
8.2 Shannon Sampling Theorem

The Shannon Sampling Theorem is a key result in Signal Processing. In this section, we will consider
functions f : t ∈ R 7→ f (t) ∈ C, which we interpret as a signal, e.g. the sound of a music, that is
a function of the time t ∈ R. Sometimes the functions are real-valued, although the theory below is
naturally presented in the more general case of complex-valued functions.
Recall definition (12). Plancherel’s Theorem (14) states that for f ∈ L2 (R),
Z Z
2
|f (t)| dt = |fˆ(ξ)|2 dξ. (21)
R R

When talking about signals, the quantity |f (t)|2 is sometimes called the energy of the signal, while
R

the integrand |fˆ(ξ)|2 on the right-hand-side of eq. (14) is sometimes referred to as the spectral density
of the signal. By eq. (14), the spectral density in ξ represents how the energy of the signal f is
distributed across frequencies (i.e. what is the “contribution” of the frequency t 7→ e−iξt ).
Bandlimited functions – We can often limit the range of frequencies ξ ∈ R that we consider. This
can be motivated by two observations
• In physical signals, the large majority of the energy is usually spread out over a finite range of
frequencies, that we call its bandwith. Physically, the spectral density of any f ∈ L2 (R) has to
decrease simply because it is integrable by eq. (14). One can then put a cut-off on values of
|f (ξ)| smaller than some threshold: this effectively creates a signal whose frequencies should lie
in a finite range [−Λ, Λ], for some Λ > 0.
• The observations of the signal we can make also effectively limit its bandwith. Think for instance
of the human ear or eye, which can only see light in the wavelength range of (approximately)
380 to 750 nanometers: effectively, we are only observing a cut-off of the signal with a finite
bandwith.
Definition 8.2 (Bandlimited function)
For any Ω > 0, the space of Ω-bandlimited functions BΩ is the set of all f ∈ L1 (R) ∩ L2 (R) such
that f is continuous and fˆ(ξ) = 0 for all ξ ∈
/ [−Ωπ, Ωπ].

Remark 8.1. An important theorem in Fourier analysis is the Paley-Wiener theorem, which relates
decay properties of f (t) when |t| → ∞ with the analyticity of fˆ(ξ)6 . In this context, bandlimited
functions possess strong regularity properties (in particular they are C ∞ ), since they are the inverse
Fourier transform of compactly-supported functions.

Whittaker-Kotelnikov-Shannon Sampling Theorem – We can now state the main theorem of


this section:
Theorem 8.1 (Whittaker-Kotelnikov-Shannon)
Let Ω > 0 and f ∈ BΩ . Then for all t ∈ R:
!
X k sin(π(Ωt − k))
f (t) = f . (22)
k∈Z
Ω π(Ωt − k)

Moreover we have:
!2
1 X k
Z
2
|f (t)| dt = f .
R Ω k∈Z

For example, if tf (t) is integrable, then one can show that fˆ′ (ξ) = (2π)−1/2 (−it)f (t)e−itξ dξ.
6
R

20
Proof of Theorem 8.1 – We prove here the theorem for Ω = 1 for clarity of exposition. The proof
for any Ω > 0 follows with straightforward adaptations (which involve factors of Ω in many places and
make the mathematical expressions less elegant, while being conceptually the same). Since f ∈ B,
fˆ ∈ L2 ([−π, π]). The idea is to consider the Fourier Series of fˆ. We have
1 X
fˆ(ξ) = 1{ξ∈[−π,π]} √ an einξ . (23)
2π n∈Z

Moreover we have:
Z π
1
Z
an = fˆ(ξ) einξ dξ = √ fˆ(ξ) ei(−n)ξ dξ = f (−n), (24)
−π 2π R

where we used the Fourier inversion formula in the last step. Indeed, since fˆ is continuous (it is easy
to see since f ∈ L1 ) and is compactly supported, we also have fˆ ∈ L1 (R). Furthermore, by the Fourier
inversion formula, we have, for any t ∈ R,
1
Z
f (t) = √ fˆ(ξ)eitξ dξ,
2π R

using (23) and (24) we have


 
Z π
1 1 X
f (t) = √ √ f (−n)einξ  eitξ dξ.
2π −π 2π n∈Z

Using Fubini and the change of indexing n ↔ −n, we have


Z π
1 X
f (t) = f (n) e−inξ eitξ dξ.
2π n∈Z −π

A simple calculation gives Z π


sin(π(t − n))
e−inξ eitξ dξ = ,
−π π(t − n)
which completes the proof of the first part of the Theorem.
sin(x)
In Signal Processing this function is usually referred to as sinc(x) := x (and sinc(0) = 1).
|ak |2 = |f (k)|2 = |fˆ(ξ)|2 dξ and the
P P R
The second part of the theorem is obtained using that
Plancherel theorem. □

Remark 8.2. Theorem 8.1 shows that when f is Ω-bandlimited, it is uniquely determined (and can
be reconstructed) using discrete samples taken at the frequency Ω (i.e. samples in Z/Ω). This is known
as the Nyquist rate. Conversely, Theorem 8.1 also shows that one can transmit a square-summable
sequence of numbers (ak ) at a frequency Ω by representing them as the samples of a Ω-bandlimited
function for which we have an explicit form. The Nyquist rate is actually optimal, in the sense that
a stable reconstruction of f ∈ BΩ from samples taken with frequency ω < Ω is in general impossible.
This was shown by Landau in a beautiful paper [Lan67].

21
9 The Discrete Fourier Transform (24.03.2023)
We will now introduce a related object that will appear again later in the course, the Discrete Fourier
Transform.
Let us consider Fourier Series ((20) and Definition 8.1) in a grid x = Nk 2π for an integer N > 0 and
k = 0, · · · , N − 1. In this section it eases notation to identify f with a function in [0, 2π] (both can be
identified with a 2π-periodic function in R).
∞ ∞
2π 1 1
 
k nk
fˆ(n)e(in N 2π) = √ fˆ(n)e(i N 2π) .
X X
f k =√ (25)
N 2π n=−∞ 2π n=−∞
(n+aN )k
nk

Since, for any integer a, e(i N 2π) = e i N

, there are only N different exponentials in the sum.
We can rewrite the sum as
−1 ∞
1 NX
" !#

 
nk
e(i N 2π) fˆ(n + aN )
X
f k =√ . (26)
N 2π n=0 a=−∞
2πi
 
Let us ease notation by taking ωN = e− N to be the N -root of unity, x ∈ Cn given by xk = f k 2π
N
and y ∈ CN given by yk = √12π ∞ ˆ(k + aN ). Then
P
a=−∞ f

x = T y, (27)
−(a−1)(b−1)
where T ∈ CN ×N is given by Tab := ωN . A multiple of the Hermitian conjugate (also referred
to as adjoint) is the celebrated Discrete Fourier Transform.
Definition 9.1 (Discrete Fourier Transform)
(a−1)(b−1) 2πi(a−1)(b−1)
The matrix F ∈ CN ×N given by Fab = N −1/2 ωN = N −1/2 e− N is known as the
Discrete Fourier Transform (DFT) Matrix.

It shares many of the properties of the objects described above (Fourier Transform and Fourier Series).
Since it is a linear transformation in finite dimensions (a matrix) many of its properties are easily
described in terms of classical matrix properties. For example, the fact that F is a Unitary matrix
immediately implies that F is an isometry (a Parseval/Plancherel-style Theorem) and that its inverse
if F ∗√(the analogue to the Fourier inverse formula). Notice also that T above, in (27), is given by
T = N F ∗.
Challenge 9.1. Show that the matrix F defined above is a Unitary matrix, i.e. F ∗ F = I.
Remark 9.1. As we mentioned a couple of lectures ago, one of the reasons Fourier theory is so
ubiquitous is that the (Discrete) Fourier Transform essentially corresponds to the change of basis
(of the space of functions) that diagonalizes translations. This can be readily viewed in the discrete
setting. The unitary matrix F simultaneously diagonalizes shift matrices (and thus all circulant
matrices — see the challenge below). This observation allows one to develop Fourier Theory to other
groups, which is tightly connected to “Representation Theory of Groups” (in particular, “characters of
Abelian groups”). Viewed in this abstract algebraic light, the Fourier Transform, Fourier Series, and
the Discrete Fourier Transform correspond to different groups (translations in R, cyclic translations
in S1 (the torus [−π, π])), and cyclic translations in Z/nZ). If you are interested in learning more,
look up also “Harmonic Analysis”, “Pontryagin Duality”, “Spherical Harmonics”, and “Peter–Weyl
Theorem”.
Challenge 9.2. A matrix M is circulant if Mij = Mkl whenever i − j = k − l.
1. Show that for any circulant M we have that F M F ∗ is a diagonal matrix.
2. What are the diagonal entries of F M F ∗ ?

22
10 Graphs and Networks (31.03.2023)
In this section we will study networks, also called graphs.
Definition 10.1 (Graph)
A graph is a mathematical object consisting of a set of vertices V and a set of edges E ⊆ V2 . We


will focus on undirected graphs. We say that i ∼ j, i is connected to j, if (i, j) ∈ E. We assume


graphs have no self-loops, i.e. (i, i) ∈
/ E for all i.

In what follows the graph will have n nodes (|V | = n). It is sometimes useful to consider a weighted
graph, in which an edge (i, j) has a non-negative weight wij . Essentially everything remains the same
if considering weighted graphs, we focus on unweighted graphs to lighten the notation (See Chapter 4
in [BSS23] for a similar treatment that includes weighted graphs).
Definition 10.2 (Degree and d-regular graph)
The degree of a node i, deg(i), is the number of neighbors of node i. A graph is said to be d-regular
if deg(i) = d for all i ∈ V .

In this course we will focus on d-regular graphs, as this will make some of the exposition and deriva-
tions easier. Conceptually, not much changes when the graphs are not regular, once the objects are
normalized in the appropriate way (you can see e.g. Chapter 4 in [BSS23]).
A useful way to represent a graph is via its adjacency matrix. Given a graph G = (V, E) on n nodes
(|V | = n), we define its adjacency matrix A ∈ Rn×n as the symmetric matrix with entries
(
1 if (i, j) ∈ E,
Aij =
0 otherwise.
Notice that for d-regular graphs, A1 = d1, where 1 is the all-ones vector.
Proposition 10.1 (Spectral norm of a d-regular graph)
For A the adjacency of a d-regular graph, ∥A∥ ≤ d.

Challenge 10.1. Prove Proposition 10.1.


We denote λ1 (A) ≥ · · · ≥ λn (A) the eigenvalues of A. Note that Proposition 10.1 means that λ1 (A) = d
and that the leading eigenvector of A is v1 = √1n 1.

Remark 10.1. Note that the matrix K = In + d1 A is PSD. Motivated by the discussion a couple of
weeks ago on Kernel PCA, it would be natural to do PCA on this matrix in a attempt to “draw” the
graph in a low-dimensional space (after discarding the first “boring” principal component v1 = √1n 1).
This has many names (they are slightly different variants that end up being the same in the case of
regular graphs), it is known as, among other things, “Laplacian eigenmaps” and “Diffusion Maps”
(see [BSS23]). If you have the idea to cluster the PCA projected data points using k-means, you
basically rediscover “Spectral Clustering”! More below.
Challenge 10.2. For which graphs do we have that λ2 (A) = λ1 (A)?
Challenge 10.3. For which graphs do we have that |λn (A)| = |λ1 (A)|?
Exploratory Challenge 10.4. For a d − regular graph, we call the Fiedler value
fG := max{|λ2 (G)|, |λn (G)|}.
Graphs with small Fiedler value are called Expanders, and are very important
√ in many areas of Math-
ematics, Computer Science, and Engineering. Graphs for which fG ≤ 2 d − 1 are called Ramanujan
graphs. The first constructions were based on Number Theory. To this day, we still don’t know that
they exist for all degrees d, so here is a fascinating open problem:

23
• Is it true that for all integers
√ d ≥ 3, and all integers n0 , there is a d-regular graph on n ≥ n0
nodes satisfying fG ≤ 2 d − 1?

Challenge 10.5. It is true that 2 d − 1 in the Exploratory Challenge above is unimprovable (this is
known as Alon-Boppana’s Theorem). A weaker version √ of this theorem is (relatively) easy to show,
try it: show that, for any d-regular graph, we have fG ≥ d − 1.

A few definitions will be useful.


Definition 10.3 (Cut and Connectivity)
Given a subset S ⊆ V of the vertices, we call S c := V \ S the complement of S and we define
X X
cut(S) := 1(i,j)∈E ,
i∈S j∈S c

as the number of edges “cut” by the partition (S, S c ), where 1X is the indicator of X. Furthermore,
we say that a graph G is disconnected if there exists ∅ ⊊ S ⊊ V such that cut(S) = 0.

It is useful to consider the following quadratic form in Rn :


X
Q(x) := (xi − xj )2 .
(i,j)∈E

Definition 10.4 (Graph Laplacian)


The symmetric matrix associated with this Quadratic Form is the celebrated Graph Laplacian L.
In other words, L ∈ Rn×n is the symmetric matrix that satisfies Q(x) = x⊤ Lx for all x ∈ Rn .

Proposition 10.2 (Properties of the Graph Laplacian)


Let G = (V, E) be a d-regular graph, the following three definitions for LG ∈ Rn×n its graph
Laplacian, are equivalent:

(i) LG is the symmetric matrix such that, for all x ∈ Rn

x⊤ LG x =
X
(xi − xj )2 .
(i,j)∈E

(ii) LG is given by
(ei − ej )(ei − ej )⊤ ,
X
LG =
(i,j)∈E

where ei is the i-th element of the canonical basis.

(iii) LG = dIn − A.

Challenge 10.6. Prove Proposition 10.2.

Remark 10.2. Notice that, because in the definition of graph Laplacian, the matrix A appears with
a negative sign. Therefore the largest eigenvalues of A become the smallest ones of LG . Since LG ⪰ 0
the eigenvalues of LG are usually ordered from smallest to largest λ1 (LG ) ≤ · · · ≤ λn (LG ). Note that
λ1 (LG ) = 0.

24
11 Graphs Cuts and Spectral Graph Theory(31.03.2023-
21.04.2023)
If S ⊂ V and x = 1S − 1S c (a vector that takes the value 1 in S and −1 in S c ), then (show it !)
1 1 dn 1 ⊤
cut(S) = x⊤ LG x = x⊤ (dIn − A)x = − x Ax
4 4 4 4
When n is an even number, the minimum bisection of a graph, MinBisG is the minimum number of
edges that are cut on a balanced partition of the nodes of the graph.
dn 1 ⊤ dn 1
 
MinBisG = min cut(S) = min − x Ax = − max x⊤ Ax. (28)
S⊆V, S⊆V, 4 4 4 4 x∈{±1}n
|S|= n
2
|S|=n/2, x⊥1
x=1S −1S c

Notice that, by the variational principal for the eigenvalues (Courant-Fisher) we have
λ2 (A) = max y ⊤ Ay. (29)
∥y∥2 =1
y⊥1

For any x ∈ {±1}n such that x ⊥ 1, we have that z = √xn satisfies the constraints in (29). This means
that the search space of (29) is larger than the one in (28), we must have
1
max y ⊤ Ay ≥ max x⊤ Ax.
∥y∥2 =1 n x∈{±1}n
y⊥1 x⊥1

We have just proved the following Theorem, which is a first instance of a rigorous connection between
the geometry of G and the spectrum of A.

Theorem 11.1 (Min Bisection and spectrum of the graph)


For G a d-regular graph on n nodes (with n even) we have
n
MinBisG = min cut(S) ≥ [d − λ2 (A)].
S⊆V, 4
|S|= n
2

Let us suppose that the graph G does indeed have a non-trivial7 partition of its nodes (S, S c ) with
a small number of edges connecting nodes in S with nodes in S c , i.e a small value of cut(S). If
cut(S) = 0 then the graph is disconnected by definition. We will investigate what happens if the cut
is small, but not necessarily zero, i.e. we assume that the graph is connected. We define the Ratio
Cut as follows.
Definition 11.1 (Ratio Cut)
Let G = (V, E) be a d-regular graph. Given a vertex partition (S, S c ), the Ratio Cut of S is defined
as:
cut(S) cut(S)
R(S) := + .
|S| |S c |
We call Ratio Cut of G the minimal R(S) over non-trivial partitions: RG := min∅⊊S⊊V R(S).

Recall that we ordered the eigenvalues of LG = dIn − A as:


0 = λ1 (LG ) ≤ λ2 (LG ) ≤ · · · ≤ λn (LG ).
We will show the following relationship between the second eigenvalue (also called spectral gap) λ2 (LG )
and the ratio cut:
7
Non-trivial here simply means that neither part is the empty set.

25
Theorem 11.2 (Ratio cut and spectral gap)
Let G = (V, E) be a d-regular graph. Then

λ2 (LG ) ≤ RG .

Remark 11.1. Since λ2 (LG ) = d−λ2 (A), notice that Theorem 11.2 implies Theorem 11.1. Moreover,
we recover that disconnected graphs have λ2 (LG ) = 0 (and the converse is true, see Challenge 10.2).
Remark 11.2. As everything we describe in this chapter, the Ratio Cut can be generalized to non-
regular graphs by defining the notion of volume of a set of vertices, and it is then known as the
Normalized Cut. In the case of d-regular graphs the volume essentially reduces to the cardinality of
the set, and the Normalized Cut to the Ratio Cut. For more details on extensions to non-regular
graphs, see e.g. the notes of the previous years [BZ22].
Proof of Theorem 11.2 – The key idea in this proof is that of a relaxation — when a complicated
minimization problem is lower bounded by taking the minimization over a larger, but simpler, set. By
the Courant-Fischer variational principal of eigenvalues and Proposition 10.2 we know that

λ2 (LG ) = min z ⊤ LG z = min


X
(zi − zj )2 .
∥z∥=1, ∥z∥=1,
z⊥1n z⊥1n (i,j)∈E

The key argument is that the Ratio Cut will correspond to the same minimum when we restrict the
vector z to be of the form z = a1S + b1S c , i.e. z ∈ {a, b}n for some a, b ∈ R. More precisely, for a
non-trivial subset S ⊂ V , let us consider the vector y ∈ Rn such that
(
a if i ∈ S
yi =
b if i ∈ S c .

For the constraints ∥y∥ = 1 and y ⊥ 1n to be satisfied we must have


a |S| + b2 |S c |
( 2
= 1,
c
a|S| + b|S | = 0,

and therefore a = [|S c |/(n|S|)]1/2 and b = −[|S|/(n|S c |)]1/2 (up to a global sign change). Note that
we used |S| + |S c | = n. The rest of the proof proceeds by computing y ⊤ LG y.

y ⊤ LG y =
X X X
(yi − yj )2 = (a − b)2 1(i,j)∈E
(i,j)∈E i∈S j∈S c
"s s #2 " #
cut(S) |S c | |S| cut(S) |S c | |S|
= + = + c +2
n |S| |S c | n |S| |S |
" #
cut(S) |S c | |S| |S| |S c |
= + c + +
n |S| |S | |S| |S c |
" #
1 1
= cut(S) + c = R(S).
|S| |S |

Finally we have:
X X
λ2 (LG ) = min (yi − yj )2 ≤ min (yi − yj )2 = min R(S). (30)
∥y∥=1, ∥y∥=1, y⊥1n ∅⊊S⊊V
y⊥1n (i,j)∈E y∈{a,b}n for a,b∈R (i,j)∈E

There are (at least) two consequential ideas of this result:

26
1. The way cuts of partitions are measured in R(S) promotes somewhat balanced partitions (so
that neither |S| nor |S c | are too small), this turns out to be beneficial to avoid trivial solutions
such as partition a graph by splitting just one node from all the others.

2. There is an important algorithmic consequence of (30): when we want to cluster a network in


two groups, what we want to minimize is the RHS of (30), this is unfortunately computationally
intractable (in fact, it is known to be NP-hard). However, the LHS of the inequality is a spectral
problem and so computationally tractable. This is the idea behind the popular algorithm of
Spectral clustering (Algorithm 11.1).

Algorithm 11.1 Spectral Clustering


Given a d-regular graph G = (V, E), let v2 be the eigenvector corresponding to the second smallest
eigenvalue of the Laplacian LG . Given a threshold τ ∈ R (one can try all different possibilities, or run
k-means in the entries of v2 for k = 2), set

S = {i ∈ V : v2 (i) ≤ τ }.

Algorithm 11.1 should be thought about as “projecting” the nodes of the graph in a one-dimensional
space, before trying to cluster them using this one-dimensional projection.
Remark 11.3. With this interpretation in mind, Algorithm 11.1 can be generalized to cluster data
into k > 2 clusters. In that case one considers the k − 1 eigenvectors (from the 2nd to the kth) and
to each node i we associate the k − 1 dimensional representation

[v2 (i), v3 (i), · · · , vk (i)]⊤ ,

and use k-means on this representation.


Remark 11.4 (Spectral clustering and Kernel PCA). Spectral clustering should not appear “magical”:
it is simply doing k-means on the representation given by kernel PCA, in which the kernel matrix is
K = dIn + A (which is PSD, and has the same eigenvectors as the Laplacian), that is – up to a shift
– the adjacency matrix of the graph. And the adjacency is the most natural “affinity” kernel one can
design from a graph, so it is very natural to use it to do kernel PCA! See also Remark 10.1. This
is oftentimes referred to as “Diffusion Maps” or “Spectral Embedding”, see for example Chapter 5
in [BSS23].
A natural question is whether one can give a guarantee for spectral clustering: “Does Algorithm 11.1
produce a partition whose ratio cut is comparable with RG ?" Although the proof of such a guarantee
is outside the scope of this course, we will briefly describe it below, and highlight its relation to
the celebrated Cheeger’s Inequality, of which we proved one side (often called the “easy side”) in
Theorem 11.2.
Lemma 11.3 (Spectral clustering guarantee - not proved in the course, see e.g. [BSS23])
There is a threshold τ ∈ R in Algorithm 11.1 producing a partition S such that
q
R(S) ≤ 8dλ2 (LG ).

Using Theorem 11.2, this implies in particular that


p
RG ≤ R(S) ≤ 8dRG ,

giving a guarantee on the performance of Algorithm 11.1.


Lemma 11.3 and Theorem 11.2 imply what is known as Cheeger’s inequality:

27
Theorem 11.4 (Cheeger’s Inequality)
Let G = (V, E) be a d-regular graph, and recall the definition of RG in Definition 11.1. The
following holds: q
λ2 (LG ) ≤ RG ≤ 8dλ2 (LG ).

Remark 11.5. Cheeger’s inequality is often stated in a more general way, not using the Ratio Cut RG
but the√Cheeger Cut hG = min0⊊S⊊V [cut(S)/ min{|S|, |S c |}], in which case one can show (1/2)λ2 ≤
hG ≤ 2dλ2 . As all we described, it can also be generalized to non-regular graphs, taking care of
additional technicalities that arise.

Cheeger’s inequality was first established for manifolds by Jeff Cheeger in 1970 [Che70], the graph
version is due to Noga Alon and Vitaly Milman [Alo86, AM85] in the mid 80s. The upper bound
in Cheeger’s inequality (corresponding to Lemma 11.3) is more difficult to prove and outside of the
scope of this course, it is often referred to as the “the difficult part” of Cheeger’s inequality. There
are several proofs of this inequality (see [Chu10] for four different proofs)! You can also see [BSS23]
for a proof in notation close to these notes, although for a more general case than d-regular graphs.

28
12 Parsimony, compressed sensing and sparse recovery
(21.04.2023 - 28.04.2023)
12.1 Parsimony
In this section 12.1 of the notes we recall some general observations we made in the very first lecture.
Parsimony is an important principle in machine learning. The key idea is that oftentimes one wants to
learn (or recover) an object with a particular structure. It is also important in supervised learning, the
key idea there being that classifiers (or regression rules, as you will see in a Statistics course) that are
simple are in theory more likely to generalize to unseen data. We may see some of these phenomena
in the very last part of the course, see also the notes of the last years [BZ22].
Observations of this type date back at least to eight centuries ago, the most notable instance being
William of Ockham’s celebrated Occam’s Razor: “Entia non-sunt multiplicanda praeter necessitatem
(Entities must not be multiplied beyond necessity)”, which is today used as a synonym for parsimony.
One example discussed in last year’s notes [BZ22] is recommendation systems, in which the goal is to
make recommendations of a product to users based both on the particular user scores of other items,
and the scores other users gives to items. The score matrix whose rows correspond to users, columns
to items, and entries to scores is known to be low rank and this form of parsimony is key to perform
“matrix completion”, meaning to recover (or estimate) unseen scores (matrix entries) from the ones
that are available.
A simpler form of parsimony is sparsity (i.e. having few non-zero entries). Not only is sparsity present
in many problems, including signal and image processing, but the mathematics arising from its study
are crucial also to solve problems such as matrix completion. In what follows we will use image
processing as the driving motivation.

Sparse recovery – Most of us have noticed how saving an image in JPEG dramatically reduces
the space it occupies in our hard drives (as opposed to file types that save the pixel value of each pixel
in the image, e.g. TIFF of BMP). The idea behind these compression methods is to exploit known
structure in the images; although our cameras will record the pixel value (even three values in RGB)
for each pixel, it is clear that most collections of pixel values will not correspond to pictures that we
would expect to see. This special structure tends to exploited via sparsity. Indeed, natural images
are known to be sparse in certain bases (such as the wavelet base) and this is the core idea behind
JPEG (actually, JPEG2000; JPEG uses a different basis). There is an example illustrating this in the
jupyter notebook accompanying the class.
Let us think of x ∈ RN as the signal corresponding to the image already in the basis for which it is
sparse, meaning that it has few non-zero entries. We use the notation ∥x∥0 for the number of non-zero
entries of x, it is common to refer to this as the ℓ0 norm, even though it is not actually a norm. Let
us assume that x ∈ RN is s-sparse, i.e. ∥x∥0 ≤ s. Usually we will assume s ≪ N . This means that,
when we take a picture, our camera makes N measurements (each corresponding to a pixel) but then,
after an appropriate change of basis, it keeps only s ≪ N non-zero coefficients and drops the others.
This motivates the question: “If only a few degrees of freedom are kept after compression, why not
measure in a more efficient way and take considerably less than N measurements?”. This question is
in the heart of Compressed Sensing. It is particularly important in MRI imaging as less measurements
potentially means less measurement time. The following book is a great reference on Compressed
Sensing [FR13].

29
12.2 Compressed Sensing and Sparse Recovery

More precisely, given a s-sparse vector x ∈ Kn (with K ∈ {R, C}), we take M linear measurements
yi = ϕ⊤ M
i x, with s < M ≪ N , and measurement (or sensing) vectors {ϕi }i=1 . Our goal is to recover x
from the underdetermined system:
 
 
   



 
 
 y = Φ  x 
.
   
 
 
 
 

Here, Φ ∈ RM ×N is the matrix whose i-th row is ϕi . Since the system is underdetermined and we
know x is sparse, the natural thing to try in order to recover x is to solve
min ∥z∥0
(31)
s.t. Φz = y,
and hope that the optimal solution z corresponds to the signal in question x.
Remark 12.1. There is another useful way to think about (31), which we will discuss later in the
section on finite frame theory. We can think of the columns of Φ as a redundant “dictionary”. In that
case, the goal becomes to represent a vector y ∈ KM as a linear combination of the dictionary elements.
To leverage the redundancy a common choice is to use the sparsest representation, corresponding to
solving problem (31).
Definition 12.1 (Spark)
The spark of a matrix Φ is the minimum number of columns of the matrix that make up a linearly
dependent set.

Challenge 12.1. For a matrix Φ, show that spark(Φ) ≤ rk(Φ) + 1. Can you prove it in a single line?
We can give a first guarantee for the solution of eq. (31) to be actually x:
Proposition 12.1
If x is s-sparse and spark(Φ) > 2s then x is the unique solution to (31) for y = Φx.

Proof of Prop 12.1 – Assume that there exists x′ ̸= x such that y = Φx′ and ∥x′ ∥0 ≤ ∥x∥0 ≤ s.
Then Φ[x − x′ ] = 0. Since ∥x − x′ ∥0 ≤ 2s and x − x′ ̸= 0, this implies that there is a set of at most 2s
columns of Φ which are linearly dependent, in contradiction with our assumptions. □

Challenge 12.2. Can you construct Φ with large spark and small number of measurements M ?
There are two significant issues with (31), stability (as the ℓ0 norm is very brittle) and computation.
In fact, (31) is known to be a computationally hard problem in general (provided P ̸= NP). Instead,
the approach usually taken in sparse recovery is to consider a convex relaxation of the ℓ0 norm, the
ℓ1 norm: ∥z∥1 := N i=1 |zi |. Figure 5 depicts how the ℓ1 norm can be seen as a convex relaxation of
P

the ℓ0 norm and how it promotes sparsity. This motivates one to consider the following optimization
problem (surrogate to (31)):
min ∥z∥1
(32)
s.t. Φz = y,
For (32) to be useful, two things are needed: (i) the solution of it needs to be meaningful (hopefully
to coincide with x) and (ii) (32) should be efficiently solvable. We first consider (ii) in Section 12.2.1,
and then discuss (i) in Section 12.2.2.

30
Figure 5: A two-dimensional depiction of ℓ0 and ℓ1 minimization. In ℓ1 minimization (the picture of
the right), one inflates the ℓ1 ball (the diamond) until it hits the affine subspace of interest, this image
conveys how this norm promotes sparsity, due to the pointy corners on sparse vectors.

12.2.1 Computational efficiency


To address computational efficiency we will focus on the real case (K = R) and formulate (32) as a
Linear Program (and thus show that it is efficiently solvable). Let us define ω + as the positive part
of x and ω − as the negative part of x, meaning that x = ω + − ω − and, for each i, either ωi− or ωi+ is
zero. Note that, in that case (for x ∈ RN ),
N
[ωi+ + ωi− ] = 1⊤ ω + + ω − .
X 
∥x∥1 =
i=1

Therefore, we are led to consider:


min 1⊤ ω + + ω −


s.t. Φ ω + − ω − = y
(33)
ω+ ≥ 0
ω − ≥ 0,
which is a linear program. It is not difficult to see (prove it!) that the optimal solution of (33) will
indeed satisfy that, for each i, either ωi− or ωi+ is zero and the program above is indeed equivalent
to (32). Since linear programs are efficiently solvable [VB04], this means that (32) can be solved
efficiently.
Remark 12.2. While (32) does not correspond to a linear program in the Complex case K = C it is
nonetheless efficient to solve, the key property is that it is a convex problem, but a general discussion
about convexity is outside the scope of this course.

12.2.2 Exact recovery via ℓ1 minimization


The goal now is to show that, under certain conditions, the solution of (32) for y = Φx indeed coincides
with x. There are several approaches to this, we refer to [BSS23] for a few alternatives. Here we will
discuss a deterministic approach based on the notion of coherence.
Let S = supp(x) (i.e. xi ̸= 0 ⇔ i ∈ S) and suppose that z ̸= x is an optimal solution of the ℓ1
minimization problem (32) with y = Φx. Let v := z − x, so z = v + x and notice that we must have:
∥v + x∥1 ≤ ∥x∥1 and Φ(v + x) = Φx,
so that Φv = 0. For a vector u ∈ RN , we define uS = (ui )i∈S ∈ R|S| , and we let ∥u∥S := ∥uS ∥1 =
i∈S |ui |. We have:
P

∥x∥S = ∥x∥1 ≥ ∥v + x∥1 = ∥v + x∥S + ∥v∥S c ≥ ∥x∥S − ∥v∥S + ∥v∥S c ,

31
where the last inequality follows by the reverse triangular inequality. This means that ∥vS ∥1 ≥ ∥vS c ∥1 ,
but since |S| ≪ N it is unlikely for Φ to have vectors in its nullspace that are this concentrated on
such few entries. This motivates the following definition.
Definition 12.2 (Null Space Property)
Φ is said to satisfy the s-Null Space Property if, for all v ∈ ker(Φ)\{0} (the nullspace of Φ) and all
|S| = s we have
∥vS ∥1 < ∥vS c ∥1 .

In the argument above, we have shown that if Φ satisfies the Null Space Property for s, then x will
indeed be the unique optimal solution to (32). In fact, the converse also holds

Theorem 12.2 (NSP and ℓ1 recovery)


The following are equivalent for Φ ∈ KM ×N :

1. For any s-sparse vector x ∈ KN , x is the unique optimal solution of (32) for y = Φx.

2. Φ satisfies the s-Null Space Property.

Challenge 12.3. We proved (1) ⇐ (2) in Theorem 12.2. Can you prove (1) ⇒ (2)?
We now prove the main Theorem of this section, which gives a sufficient condition for exact recovery
via ℓ1 minimization based on the notion of worst-case coherence of a matrix, or more precisely of its
columns. We need first to introduce this notion.
Definition 12.3 (Worst-case coherence)
Given a set of vectors ϕ1 , . . . , ϕN ∈ KM such that ∥ϕk ∥2 = 1 for all k ∈ [N ] we call the worst-case
coherence (sometimes also called dictionary coherence) the quantity

µ := max |⟨ϕi , ϕj ⟩|.


i̸=j

We call the worst-case coherence of a matrix the worst-case coherence of its column vectors.
We are now ready to state our main theorem:

Theorem 12.3 (Low coherence and ℓ1 recovery)


If the worst-case coherence µ of a matrix Φ with unit norm column vectors satisfies
1 1
 
s< 1+ , (34)
2 µ
then Φ satisfies the s-NSP.

Proof of Theorem 12.3 – If µ = 0 then the columns of Φ form an orthonormal basis of RM , thus
ker(Φ) = ∅ and so it must satisfy the NSP for any s.
We now focus on µ > 0. Let v ∈ ker(Φ)\{0} and k ∈ [N ], recall that ϕk is the k-th column of Φ, we
have
N
X
vl ϕl = 0,
l=1

and so vk ϕk = − Since ∥ϕk ∥ = 1 we have (recall ϕ†k = ϕk )
P
l̸=k vl ϕl .
 

ϕ†k vl (ϕ†k ϕl ).
X X
vk = − vl ϕl  = −
l̸=k l̸=k

32
Thus,

vl (ϕ†k ϕl ) ≤ µ
X X
|vk | ≤ |vl | = µ(∥v∥1 − |vk |).
l̸=k l̸=k

This means that for all k ∈ [N ] we have

(1 + µ)|vk | ≤ µ∥v∥1 .

Finally, for S ⊂ [N ] of size s we have


X µ 1
∥vS ∥1 = |vk | ≤ s ∥v∥1 < ∥v∥1 ,
k∈S
1+µ 2

where the last inequality follows from the hypothesis (34) of the Theorem. Since ∥v∥1 = ∥vS ∥1 +∥vS c ∥1
this completes the proof. □

In the next lectures we will study matrices with low worst-case coherence.

Remark 12.3. Different approaches are based on probability theory, and roughly follow the following
path: since due to Theorem 12.2 recovery is formulated in terms of certain vectors not belonging
to the nullspace of Φ, if one draws Φ from an ensemble of random matrices the problem reduces to
understanding when a random subspace (the nullspace of the random matrix) avoids certain vectors,
this is the subject of the celebrated “Gordon’s Escape through a Mesh Theorem” (see [BSS23]), you
can see versions of this approach also at [CRPW12] or, for an interesting approach based on Integral
Geometry [ALMT14].

33
13 Finite frame theory and the Welch bound (05.05.2023)
Motivated by Theorem 12.3 we will now try to build low-coherence matrices. In order to do so we
first introduce some basic elements of finite dimensional frame theory. For a reference on this topic,
see for example the first chapter of the book [Chr16]. Recall that K ∈ {R, C}. We also change slightly
notations with respect to the previous section: the usual vector x will live in Kd instead of KN , and
we will denote ϕ1 , · · · , ϕm ∈ Kd the frame vectors.

13.1 Finite frame theory


If m = d and ϕ1 , . . . , ϕd ∈ Kd are a basis then any point x ∈ Kd is uniquely identified by the inner
products bk := ⟨ϕk , x⟩. In particular if ϕ1 , . . . , ϕd ∈ Kd form an orthonormal basis this representation
satisfies a Parseval identity: ∥[⟨ϕk , x⟩]dk=1 ∥ = ∥x∥. Using this identity on x − y yields:

{⟨ϕk , x − y⟩}dk=1 = ∥x − y∥ (∀x, y ∈ Kd ). (35)

This identity ensures stability in the representation: when we perturb x slightly, we only change slightly
the inner products representation {⟨ϕk , x⟩}dk=1 . But what about when the set of vectors {ϕ1 , · · · , ϕm }
is not an orthonormal basis? In particular when m > d?
Redundancy – For instance, in signal processing and communication it is useful to include redun-
dancy. Indeed, if instead of a basis one considers a “redundant” spanning set ϕ1 , . . . , ϕm ∈ Kd with
m > d a few advantages arise: for example, if in a communication channel one of the coefficients bk
gets erased, it might still be possible to reconstruct x. Such sets are sometimes called redundant
dictionaries or overcomplete dictionaries.
Stability – Still, it is important to keep some form of stability of the type of the Parseval identity (35).
While this is particularly important for infinite dimensional vector spaces (more precisely Hilbert
spaces) we will focus our exposition on finite dimensions.
Definition 13.1 (Frame)
A set ϕ1 , . . . , ϕm ∈ Kd is called a frame of Kd if there exist constants 0 < A ≤ B such that, for all
x ∈ Kd :
m
X
A∥x∥2 ≤ |⟨ϕk , x⟩|2 ≤ B∥x∥2 .
k=1

A and B are called respectively the lower and upper frame bound. The largest possible value of A
and the lowest possible value of B are called the optimal frame bounds.

Challenge 13.1. Show that ϕ1 , . . . , ϕm ∈ Kd is a frame if and only if it spans all of Kd .


Further reading 13.1. In infinite dimensions the situation is considerably more delicate than sug-
gested by Challenge 13.1, and it is tightly connected with the notion of stable sampling from signal
processing. You can see, e.g., [Chr16].
Given a frame ϕ1 , . . . , ϕm ∈ Kd , let
 
| |
Φ :=  ϕ1 · · · ϕm  . (36)
 
| |

The following are classical definitions in the frame theory literature (although for finite dimensions
the objects are essentially just matrices involving Φ and so the definitions are not as important; also
note that we are doing a slight abuse of notation using the same notation for a matrix and the linear
operator it represents – it will be clear from context which object we mean.)

34
Definition 13.2
Given a frame ϕ1 , . . . , ϕm ∈ Kd , we give the following definitions.
Pm
• The operator Φ : Km → Kd corresponding to the matrix Φ, meaning Φ(c) = k=1 ck ϕk , is
called the Synthesis Operator.

• Its adjoint operator Φ† : Kd → Km corresponding to the matrix Φ† = Φ , meaning Φ† (x) =
{⟨x, ϕk ⟩}m
k=1 , is called the Analysis Operator.

• The self-adjoint operator S : Kd → Kd given by S = ΦΦ† is called the Frame Operator.

Challenge 13.2. Show that S ⪰ 0 and that S is invertible.


The following are interesting (and useful) definitions:
Definition 13.3 (Tight frame)
A frame is called a tight frame if the frame bounds can be taken to be equal A = B.

Challenge 13.3. What can you say about the Frame Operator S for a tight frame?
We recall now the definition of worst-case coherence, which we already gave in the matrix setting, now
in the language of frames (see Definition 12.3):
Definition 13.4 (Worst-case coherence)

(i) A frame ϕ1 , . . . , ϕm ∈ Kd is said to be unit normed (or unit norm) if for all k ∈ [m] we have
∥ϕk ∥ = 1.

(ii) Given a unit norm frame ϕ1 , . . . , ϕm ∈ Kd we call the worst-case coherence (sometimes also
called dictionary coherence) the quantity

µ := max |⟨ϕi , ϕj ⟩|.


i̸=j

Challenge 13.4. In a very similar way one can define the spark of a frame as the spark of the matrix
whose i − th column is given by ϕi , see Definition 12.1. Can you give a relationship between the spark
and the worst-case coherence of a frame?

13.2 The Welch bound


Let us now come back to our original motivation: with Theorem 12.3 in mind, in this section we study
the worst-case coherence of frames with the goal of understanding how much savings (in measurements)
one can achieve with the technique described in Section 12. We start with a lower bound, due to Welch
[Wel74].

Theorem 13.1 (Welch Bound)


Let ϕ1 , . . . , ϕm ∈ Kd be a unit norm frame, with m ≥ d. Let µ be its worst case coherence

µ = max |⟨ϕi , ϕj ⟩| .
i̸=j

Then s
m−d
µ≥ .
d(m − 1)

35
Proof of Theorem 13.1 – Note that we can assume K = C: indeed, the theorem for K = R will
then follow by simply viewing real vectors as elements of Cd .
Let G be the Gram matrix of the vectors, Gij := ⟨ϕi , ϕj ⟩ = ϕ†i ϕj . In other words, G = Φ† Φ. It is
positive semi-definite and its rank is at most d. Let λ1 , . . . , λd denote the largest eigenvalues of G, in
particular this includes all non-zero ones. We have
d
X 2 d
X m
X
(Tr[G])2 = λk ≤d λ2k = d λ2k = d∥G∥2F ,
k=1 k=1 k=1

where the inequality follows from Cauchy-Schwarz between the vectors with the λk ’s and the all-ones
vector. Note that since the vectors ϕi are unit normed, Tr(G) = m 2
i=1 ∥ϕi ∥ = m, thus
P

m
X 1 m2
|⟨ϕi , ϕj ⟩|2 = ∥G∥2F ≥ (Tr[G])2 = .
i,j=1
d d

Also,
m
X m
X m
X m
X
|⟨ϕi , ϕj ⟩|2 = |⟨ϕi , ϕi ⟩|2 + |⟨ϕi , ϕj ⟩|2 = m + |⟨ϕi , ϕj ⟩|2 ≤ m + (m2 − m)µ2 .
i,j=1 i=1 i̸=j i̸=j

Putting everything together gives:


v
u m2 s
t d −m m−d
u
µ≥ = .
(m2 − m) d(m − 1)

Remark 13.2. Notice that in the proof above there were two inequalities used, if we track the cases
when they are “equality” we can see for which frames the Welch bound is tight. The Cauchy-Schwarz
inequality is tight when the vector consisting in the first d eigenvalues of G is a multiple of the all-
ones vector, which is the case exactly when Φ is a Tight Frame (recall Definition 13.3). The second
inequality is tight when all the terms in the sum m 2
i̸=j |⟨ϕi , ϕj ⟩| are equal. The frames that satisfy
P

these properties are called ETFs – Equiangular Tight Frames.

36
14 Equiangular Tight Frames (ETFs) (05.05.2023)
14.1 Definition and maximal size
In this section we continue the analysis started in the last sections, by studying equiangular tight
frames, which are tight frames with the lowest possible worst-case coherence.
Definition 14.1 (Equiangular Tight Frame)
A unit-normed tight frame ϕ1 , · · · , ϕm ∈ Kd is called an Equiangular Tight Frame (ETF) if there
exists µ ≥ 0 such that, for all i ̸= j,

|⟨ϕi , ϕj ⟩| = µ. (37)

p in Remark 13.2, the only possible value of µ for an ETF is


Remark 14.1. Note that as described
given by the Welch bound, i.e. µ = (m − d)/(d[m − 1]).
Proposition 14.1 (Maximum size of an ETF)
Let ϕ1 , . . . , ϕm by an equiangular tight frame in Kd . Then:

• If K = C then m ≤ d2 .
d(d+1)
• If K = R then m ≤ 2 .

Proof of Proposition 14.1 – We start with a remark. Note that any real matrix M ∈ Rd×d can
2
be written as a vector, called vec(M ) ∈ Rd , which collects all its entries8 . Moreover
⟨vec(M1 ), vec(M2 )⟩ = Tr[M1 M2⊤ ].
Note that the vectors vec(M ) for M symmetric actually live in a subspace of dimension d(d + 1)/2.
2
Similarly any complex matrix N ∈ Cd×d can be written as a complex vector vec(N ) ∈ Cd , and the
inner product is ⟨vec(N1 ), vec(N2 )⟩ = Tr[M1 M2† ]. And the vectors vec(N ) for N Hermitian actually
live in a real subspace of dimension d2 .
We now come back to the proof, both in the real and complex case. Let ψi := vec(ϕi ϕ†i ). It is easy to
2
check that these are unit-norm vectors in Kd , and moreover, their inner products are (for i ̸= j):
ψi† ψj = ⟨vec(ϕi ϕ†i ), vec(ϕj ϕ†j )⟩ = Tr((ϕi ϕ†i )(ϕj ϕ†j )† ) = |⟨ϕi , ϕj ⟩|2 = µ2 .
This means that their Gram matrix H is given by
H = (1 − µ2 )Im + µ2 11⊤ .
2
Since µ < 1 we have rk(H) = m. However, we can also write H = Ψ† Ψ, for the matrix Ψ ∈ Kd ×m
whose i-th column is given by ψi . Therefore, rk(H) ≤ rk(Ψ) (here the rank means the dimension over
R of the image space of the matrix). But as we saw in the remark above, the image space of Ψ has
real dimension:
• For K = R, at most 21 d(d + 1).
• For K = C at most d2 .
d(d+1)
Thus m ≤ d2 for K = C, and m ≤ 2 for K = R. □

Further reading 14.2. Equiangular Tight Frames in Cd with m = d2 are important objects in
Quantum Mechanics, where they are called SIC-POVM: Symmetric, Informationally Complete, Positive
Operator-Valued Measure. It is a central open problem to prove that they exist in all dimensions d,
see Open Problem 6.3. in [Ban16] (the conjecture that they do exist is known as Zauner’s Conjecture).
8
Here it is not important how the indexing of the entries is done as long as consistent throughout.

37
14.2 First examples of low-coherence frames
Building ETFs with many vectors is a very non-trivial task. We will devote the next Section 15 to
the construction of an ETF that arises from Number Theory, and you will also highlight in Exercice
Class (if time permits) connections with spectral graph theory (a very nice example of how different
fields of mathematics interact when studying “data science”!).

On the other hand, there are simple families of vectors with worst case coherence µ ∼ 1/ d.

Example 14.3 (Discrete Fourier Transform). We recall the definition of the Discrete Fourier transform
matrix F ∈ Cd×d (see Definition 9.1):
1
Fjk = √ exp [−2πi(j − 1)(k − 1)/d] .
d

) of F form an orthonormal basis on Cd . Notably, any Fk has an inner


The columns (F1 , · · · , Fd√
product of |⟨Fk , el ⟩| = 1/ d with any element el of the canonical basis. This means that the d × 2d
matrix
Φ := [Id F ] (38)
√ √
has worst case coherence 1/ d (to be compared with the Welch bound of 1/ 2d − 1 here). Theo-
rem 12.3 guarantees then that, for Φ given by (38), ℓ1 minimization achieves exact recovery for sparsity
levels √ 
1
s< 1+ d .
2
Remark 14.4. While the DFT construction above has a “redundancy” coefficient m/d = 2, there
are many constructions of unit norm frames with low coherence, with redundancy coefficient much
larger than 2. There is a all field of research involving these constructions, see for instance this article
listing all constructions known in 2016 [FM15]. You can also take a look at the PhD thesis of Dustin
Mixon [Mix12] which describes part of this field, and discusses connections to Compressed Sensing;
Dustin Mixon also has a blog in part devoted to these questions [Mix]).

Exploratory Challenge 14.1 is meant to be solved later in the course, and shows that even randomly
picked vectors do quite well (it requires some of the probability tools introduced later on).

Exploratory Challenge 14.1. Towards the end of the course, equiped with a few more tools of
probability (in particular concentration inequalities), you’ll be able to show that by taking a frame
made up of random (independent) vectors in the unit norm sphere, the coherence is comparable to
the Welch bound. More precisely, this challenge
√ is showing that m such vectors in d dimensions will
have worst-case coherence polylog(m)/ d, where polylog(m) means a polynomial of the logarithm of
m (you will also work out the actual dependency).

Further reading 14.5. Challenge 14.1 along Theorem 12.3 show that for matrices consisting of
√ (independent) columns, sparse recovery with ℓ1 minimization is possible up to sparsity levels
random
s ≲ d/polylog(m). It turns out that one can actually perform it for much larger levels of sparsity
s ≲ d/ log(m)! Proving this however is outside the scope of this course, as it requires heavier probability
Theory machinery. Interestingly, matching this performance with deterministic constructions seems
notoriously difficult, in fact there is only one known construction “breaking the square-root bottleneck".
You can read more about this in Open Problem 5.1. in [Ban16] (and references therein).

38
14.3 Mutually Unbiased Bases (MUBs)

Definition 14.2 (Mutually Unbiased Bases)


Construction (38) suggest the notion of Mutually Unbiased Bases (MUBs). Two orthonormal√bases
v1 , . . . , vd and u1 , . . . , ud of Cd are called mutually unbiased if for all i, j we have |vi† uj | = 1/ d. A
set of k ≥ 2 bases is called mutually unbiased if the bases are pairwise mutually unbiased.

Challenge 14.2. Show that a matrix √formed with two orthonormal bases (such as (38)) cannot have
worst case coherence smaller than 1/ d. This motivates the definition above as the “most possibly
unbiased” bases.

Further reading 14.6. Mutually Unbiased basis are an important object in quantum mechanics,
communication, and signal processing, however there is still much that is not understood about them.
A very nice and natural question to ask is: “what is the maximum number of bases that can be made
mutually unbiased in Cd ?” Let us denote this number M(d). Remarkably, very little is known about
M(d), besides a general bound M(d) ≤ d + 1, and that this bound is achievable if d is the power of a
prime number.

Exploratory Challenge 14.3 (Open Problem). How many mutually unbiased bases exist in d = 6
dimensions ? The best known upper bound is M(6) ≤ 6 + 1 = 7 (see above), and the best known lower
bound is M(6) ≥ 3. See Open Problem 6.2. in [Ban16].

39
15 The Paley ETF (12.05.2023)
15.1 A bit of number theory
Recall that for any p ≥ 2, Zp (also denoted Z/pZ) is the cyclic group of order p (under addition)
of integers modulo p. Moreover, if p is prime then Zp is a field, and we denote Z× p := Zp \{0} the
multiplicative group. In Appendix B we recall (and show) some basics of number theory, in particular
the classical result that Z× ×
p is a cyclic group, i.e. there is an element g ∈ Zp (called a generator) such
that Z× k
p = {g , 1 ≤ k ≤ p − 1}. For a reference on number theory, you can check out the webpage of
the Bachelor course of this year at ETH.
Definition 15.1 (Quadratic residue)
Let p ≥ 3 be a prime number. We say that an integer x ∈ Z is a quadratic residue mod p if there
exists q ∈ Z such that x ≡ q 2 mod p. Otherwise, we say that x is a quadratic non-residue (mod p).

Definition 15.2 (Legendre symbol)


Let p ≥ 3 be a prime number, and a ∈ Z. We define the Legendre symbol as:

!
1

 if a is a quadratic residue mod p and a ̸≡ 0 mod p,
a
:= −1 if a is a non quadratic residue mod p, (39)
p 
if a ≡ 0 mod p.

0

We will need the following properties of the Legendre symbol (or of the quadratic residues).
Proposition 15.1 (Properties of quadratic residues)
Let p ≥ 3 be a prime number. Then:

(i) (Euler’s criterion) For all a ∈ Z,


!
a p−1
≡a 2 mod p. (40)
p

(ii) (Multiplicativity) For all a, b ∈ Z,


! ! !
a b ab
= . (41)
p p p

(iii) Quadratic residues form exactly half the elements of Z×


p:
!
X a
= 0. (42)
p
a∈Z×
p

(iv) We have
! (
−1 1 if p ≡ 1 mod 4,
=
p −1 if p ≡ 3 mod 4.

Remark 15.1. Recall that Fermat’s little theorem states that for all a ∈ Z, if a ̸≡ 0 mod p then
ap−1 ≡ 1 mod p, which we recover by Euler’s criterion (i).

40
Proof of Proposition 15.1 – We start by proving (i). If a ≡ 0 mod p the statement is clear, so we
assume a ̸≡ 0 mod p. Assume first that a is a quadratic non-residue mod p. Then the collection of
pairs {{x, ax−1 }}x∈Z×p
partitions Z×
p into (p − 1)/2 pairs (since a is a non-residue, the two elements
in the pair are always distinct). Moreover, the product of the elements in every pair is always given
by a. Therefore we have 1 × 2 × · · · × (p − 1) ≡ a(p−1)/2 mod p. Since (p − 1)! ≡ −1 mod p when p is
prime (this is called Wilson’s theorem, see Theorem B.4), we reach −1 ≡ a(p−1)/2 mod p.
Let us now assume that a is a quadratic residue, so a ≡ r2 mod p for some r ∈ Z× p . Since the equation
a ≡ x2 mod p has only two solutions x = ±r in the field Z× p (see Theorem B.3) 9 , we can now partition


p \{−r, r} by using (p − 3)/2 pairs: {{x, ax−1 }} ×
x∈Zp \{−r,r} . Multiplying all elements of Zp , we reach
×

(p − 1)! ≡ a(p−3)/2 × (−r2 ) mod p. Applying again Wilson’s theorem, we reach 1 ≡ a(p−1)/2 mod p.
Given (i), statement (ii) is immediate, so we now prove (iii). Let g ∈ Z× ×
p be a generator of Zp , i.e.
Z× k k
p = {g , k ∈ {0, · · · , p − 2}} (see Lemma B.6). Since all g for 0 ≤ k ≤ p − 2 are distinct and
(p − 1)/2 ≤ p − 2, we can not have g (p−1)/2 ≡ 1 mod p. Therefore, by (i), the only possibility is that
g (p−1)/2 ≡ −1 mod p, i.e. g is a quadratic non-residue. Moreover, we have
! ! p−2 !
X a X a X gk
∈ [−(p − 1), p − 1] and = ,
p p p
a∈Z×
p a∈Z×
p
k=0
p−2
X k(p−1)
≡ g 2 mod p,
k=0
p−2
X
≡ (−1)k mod p,
k=0
≡ 0 mod p.

Point (iv) is easy to check, we verify the case p ≡ 1 mod 4. Notice that since p = 4q + 1, we have by
Euler’s criterion
!
−1
≡ (−1)2q mod p,
p
≡ 1 mod p.

Proposition 15.1 allows us to deduce an important property of a quantity which is a called the quadratic
Gauss sum:
Theorem 15.2 (Gauss sums)
Let p ≥ 3 be a prime number, and let ω := e2iπl/p , for some l ̸≡ 0 mod p. Then
" p−1 ! #2 ! (
X k −1 p if p ≡ 1 mod 4,
ωk =p =
k=0
p p −p if p ≡ 3 mod 4.

Pp−1 k  k
Proof of Theorem 15.2 – Let gp (x) := k=0 p x . We have

p−1 ! !
2
X k j j+k
gp (x) = x .
j,k=0
p p
9
In this case it is very easy to see, since x2 ≡ r2 mod p ⇔ (x − r)(x + r) ≡ 0 mod p, and Zp is a field.

41
For any u such that up = 1 (in particular this hold for u = ω and u = 1), we have uj+k = uj+k( mod p) ,
thus we can group terms:
p−1 " ! !#
X X k j
gp (u)2 = un .
n=0 0≤j,k≤p−1
p p
j+k≡n mod p

p−1
Let us denote an the element in front of un in this sum, i.e. gp (u)2 = n P
Pp−1 n=0 an u . By (iii) of
Property 15.1, gp (1) = 0, thus n=0 an = 0. Moreover, we can compute, using (ii) of Proposition 15.1:
p−1 ! ! p−1 ! ! p−1 ! !
X j −j X j2 −1 X −1 −1
a0 = = = = (p − 1) .
j=0
p p j=0
p p j=1
p p

Finally, let 1 ≤ n ≤ p − 1. Letting j = nj ′ and k = nk ′ with j ′ , k ′ ∈ Z×


p (and identifying the integers
j, k, n with the corresponding element of Zp ), we have
! !
X nk ′ nj ′
an =
p p
0≤j ′ ,k′ ≤p−1
j ′ +k′ ≡1 mod p
! ! !
X n2 k′ j′
=
p p p
0≤j ′ ,k′ ≤p−1
j ′ +k′ ≡1 mod p

= a1 .

Therefore a1 = a2 = · · · = ap−1 . Combining the different results above, we reach that


 !
 −1

 a0 = (p − 1) ,
p



!

 −1
an =− ∀n ∈ {1, · · · , p − 1}.



p

Thus
! ! p−1 !
2 −1 −1 X −1
n
gp (ω) = p − ω =p ,
p p n=0
p
Pp−1 n
since n=0 ω = 0 is a basic property of p-th roots of unity. □

15.2 Definition of the Paley ETF


We now introduce the Paley ETF. Other brief descriptions of its construction (with slightly different
conventions) can be found e.g. in [Ban16, BFMW13].
Its construction is not mathematically complicated, but involves several steps. Let p ≥ 1 be a prime
number such that p ≡ 1 mod 4, and let M := (p + 1)/2 and N := 2M = p + 1. Recall the definition of
the Discrete Fourier Transform matrix in Definition 9.1, and we let F be the p × p DFT matrix. To
lighten some notations, we use {0, · · · , p − 1} rather than {1, · · · , p} to index its rows and columns.
We have for 0 ≤ a, b ≤ p − 1:
1 − 2iπab
Fab = √ e p .
p

42
Let S = {0}∪Q, with Q ⊆ {1, · · · , p−1} the subset of quadratic residues modulo p, cf. Definition 15.1.
By Proposition 15.1-(iii), |S| = M . We define G as the M × p matrix formed by picking rows of F
whose index is in S, i.e. if we denote S = {i0 , · · · , iM −1 } ⊆ [p − 1], we have for 0 ≤ k ≤ M − 1 and
0 ≤ l ≤ p − 1:
1 − 2iπik l
Gkl := Fik ,l = √ e p . (43)
p

We end the construction by two steps:

(i) We let √H := DG, with D ∈ RM ×M the diagonal matrix whose elements √ are D11 = 1, and
Dkk = 2 for k ≥ 2. Effectively, this multiplies all elements of G by 2, except the ones in the
first row.

(ii) We build Φ ∈ CM ×N (i.e. a M × 2M matrix) by concatenating the columns of H with the


canonical basis element (1, 0, · · · , 0) ∈ RM .

As we will see below, these two steps are necessary: step (i) ensures that the columns of Φ have unit
norm and satisfy µ = |⟨ϕi , ϕj ⟩| for all i ̸= j, and step (ii) ensures that the frame is tight. Anticipating
on what we will later prove, we call this construction the Paley ETF [BFMW13].
Definition 15.3 (Paley ETF)
For any prime p ≥ 2 such that p ≡ 1 mod 4, the columns of the matrix Φ built by the procedure
above are called the Paley Equiangular Tight Frame.

Example 15.2. Let p = 5. It is easy to check that S = {0, 1, 4}. Therefore we have:
 
1 1 1 1 1
1  2iπ 4iπ 6iπ 8iπ
G = √ 1 e− 5 e− 5 e− 5 e− 5  .

5 8iπ 6iπ 4iπ 2iπ
1 e− 5 e− 5 e− 5 e− 5

This leads to the construction:


q q q q q 
1 1 1 1 1
1
q 5 q 5 2iπ q 5 4iπ q 5 6iπ q 5 8iπ 
 2 2 − 5 2 − 5 2 − 5 2 − 5
Φ= e e e e 0 .

 5
q q 5 8iπ q 5 6iπ q 5 4iπ q 5 2iπ 
2 2 − 5 2 − 5 2 − 5 2 − 5
5 5e 5e 5e 5e 0

Challenge 15.1. Check that the matrix Φ built above for the case p = 5 is indeed an ETF.

We now prove the main theorem of this section, which is that the Paley ETF is indeed an equiangular
tight frame.

Theorem 15.3
For any prime p ≥ 2, the Paley ETF is a complex equiangular tight frame.

Remark 15.3. Theorem 15.3 does not actually require p ≡ 1 mod 4 to hold. On the other hand, it
is only in this case that the Paley ETF can be mapped to a real ETF, see below.

Proof of Theorem 15.3 – We denote ϕ1 , · · · , ϕN the columns of Φ. Trivially, ∥ϕN ∥ = 1. For any
i ∈ [N − 1], we have
−1
1 MX 2 1 + 2(M − 1)
∥ϕi ∥2 = + = = 1.
p k=1
p p

43
We now show that Φ is tight. One checks easily from the properties of the DFT matrix that the rows
of Φ are pairwise orthogonal, and that they all have squared norm 2. Therefore we have ΦΦ† = 2IM ,
i.e. Φ is a tight frame.
To prove that Φ is an ETF, it√is thus sufficient show that for all i, j ∈ [N ], |⟨ϕi , ϕj ⟩| takes the same

value. Let us denote µ = 1/ 2M − 1 = 1/ p, i.e. the minimal worst-case coherence given by the
Welch bound 13.1 for a unit-norm frame of N = 2M vectors in CM . In what follows, we show
successively:

(i) For all j ∈ [N − 1], |⟨ϕj , ϕN ⟩| = µ.

(ii) For all j, j ′ ∈ [N − 1] with j ̸= j ′ , |⟨ϕj , ϕj ′ ⟩| = µ.



The propery (i) is trivial, since |⟨ϕj , ϕN ⟩| = |(ϕj )1 | = 1/ p. We focus on (ii). We have
p−1 ( ! )
2iπk(j−j ′ )
1 2X k −
⟨ϕj , ϕj ′ ⟩ = + 1 =1 e p ,
p p k=1 p
p−1 " !#
2iπk(j−j ′ )
1 1X
(a) k −
= + 1+ e p ,
p p k=1 p
p−1 ′) p−1 ′)
!
1 X − 2iπk(j−j 1 X k − 2iπk(j−j
= e p + e p ,
p k=0 p k=1 p
p−1 !
′)
1 X k − 2iπk(j−j
= e p ,
p k=0 p

k
∈ {±1}. Therefore by Theorem 15.2, we have ⟨ϕj , ϕj ′ ⟩2 = ±1/p, and

where we used in (a) that p

thus |⟨ϕj , ϕj ′ ⟩| = 1/ p. □

Remark 15.4 (The real Paley ETF). When p ≡ 1 mod 4, the proof above shows that the inner
product between any two elements of the frame is real, i.e. that Φ† Φ is a real matrix (and recall it
is positive semidefinite). We can thus write Φ† Φ = Ψ⊤ Ψ, in which Ψ is a real matrix given by the
Cholesky decomposition of Φ† Φ. Since Φ form an ETF, one can then check easily that the columns of
Ψ form a real Equiangular Tight Frame. For this reason, the Paley ETF is often studied in the case
p ≡ 1 mod 4.

Further reading 15.5 (Real ETFs and strongly regular graphs). There is a fascinating and general
connection between real ETFs and a class of regular graphs known as strongly regular graphs, see e.g.
Theorem 19 in [BFMW13]. The latter are defined as d-regular graphs, such that the common number
of neighbors of any two vertices i, j only depends on whether i and j are adjacent or not. In particular,
the Paley ETF is mapped to a graph on p vertices, known as the Paley graph, such that vertices i and
j are connected iff i − j is a quadratic residue mod p. One can check that this definition is consistent
when p ≡ 1 mod 4 (i.e. i ∼ j ⇔ j ∼ i), and then prove that this is indeed a strongly regular graph.
For more details on the Paley graph and strongly regular graphs, see e.g. this book draft by Daniel
Spielman, which is also a great reference for more notions of spectral and algebraic graph theory. In a
future homework you might study the spectral properties of strongly regular graphs, and prove some
of the statements mentioned in this remark.

44
16 Elements of classification theory (19.05.2023)
A bit of history – In this last part of the lectures, which we will cover for approximately three weeks,
we introduce some of the fundamentals of learning theory. We start with the theory of classification, a
foudational topic in Statistical Machine Learning. The direction we are discussing in this part of the
course was initiated by Vladimir Vapnik and Alexey Chervonenkis in the mid-60s and independently
by Leslie Valiant in the mid-80s. The results of Vapnik and Chervonenkis led to what we now call
Vapnik–Chervonenkis Theory. From the early 70s to the present day, their work has an ongoing
impact on Machine Learning, Statistics, Empirical Process Theory, Computational Geometry and
Combinatorics. In parallel, the work of Valiant looked at a similar problem from a more computational
perspective. In particular, Valiant developed the theory of Probably Approximately Correct (PAC)
Learning that led, among other contributions, to his 2010 Turing Award.
Classification model – The statistician is given a sequence of independent identically distributed
observations
X1 , . . . , Xn
taking values in X ⊆ Rp , each distributed according to some unknown distribution P 10 . In practice,
Xi might be seen as an image or a feature vector. For example, consider the problem of health
diagnostics. In this case these vectors can describe some medical information such as age, weight,
blood pressure and so on. An important part of our analysis is that the dimension of the space will
not play any role, and classification is possible even in abstract measurable spaces.
Contrary to e.g. clustering tasks we have considered previously, classification models belong to the
realm of supervised learning, meaning that the statistician also observe labels associated to the obser-
vations
f ∗ (X1 ), . . . , f ∗ (Xn ),
where f ∗ is a (unknown to her) target classifier11 mapping X → {0, 1}. These labels will depend on
the application, they can represent cat/dogs when classifying images, spam/not spam when classifying
ls, disease/no disease when diagnosing a patient... Moreover, we restrict the classifier to have value in
{0, 1}, but what we will describe can be generalized for a finite number of classes.
Classification task – Using the labeled sample

Sn = {(X1 , f ∗ (X1 )), . . . , (Xn , f ∗ (Xn ))}, (44)

the statistician’s aim is to construct a measurable classifier fb : X → {0, 1} that can be used to classify
any element x ∈ X , e.g. a new image. The risk (or the error) of a classifier f : X → {0, 1} is defined
by the probability of making an error on a random sample:

R(f ) := P(f (X) ̸= f ∗ (X)),

where X ∼ P . With this definition in mind, the statistician wants to find a classifier that has risk as
small as possible. Besides the labeled sample Sn , a second important information is available to her:
f ∗ belongs to some known class F of (measurable) classifiers12 mapping X to {0, 1}.
Definition 16.1 (Consistent classifier –)
We say that a classifier fb : X → {0, 1} is consistent with the sample Sn if for all i ∈ {1, · · · , n}:

fb(Xi ) = f ∗ (Xi ).

10
We assume that there is a probability space (X , F, P ), where F is a Borel σ-algebra.
11
Moreover, we always assume standard measurability assumption on f ∗ so that e.g., {f ∗ (x) = 1} is measurable.
12
Note that in the Computer Science literature these classifiers are sometimes called concepts.

45
Performance of consistent classifiers – Which strategy could the statistican adopt? Given that
the sample Sn is basically the only information in her possession, the most natural way is to choose
fb ∈ F consistent with Sn and use it as a guess, hoping that it will be close to the true classifier f ∗ .
Hopefully, if the number of samples n is large enough, this will be true. However, since the sample
Sn is random, we cannot guarantee this with certainty. Instead, we may only say that fb is close to f ∗
with high probability: this would mean intuitively that for a large fraction of all random realizations
of the sample Sn , any classifier consistent with a particular realization of the sample has a small risk.

Theorem 16.1 (Risk of consistent classifiers –)


Assume that f ∗ ∈ F, and that F is finite. For the confidence parameter δ ∈ (0, 1) and the precision
parameter ε ∈ (0, 1), assume that we have

log |F| 1 1
 
n≥ + log .
ε ε δ

Then (the probability being over the law of X1 , · · · , Xn ):

P(∀fb ∈ F consistent with the sample Sn : R(fb) < ε) ≥ 1 − δ. (45)

Equivalently, with probability at least 1 − δ, any classifier f such that R(f ) > ε cannot be consistent
with the sample Sn .

Proof of Theorem 16.1 – Let us denote Fε := {f ∈ F : R(f ) ≥ ε} ⊆ F, and fix any f ∈ Fε . If no


such function exists, the claim follows. By independence of (Xi )ni=1 :

P[f is consistent with Sn ] = P(f (Xi ) = f ∗ (Xi ) for i = 1, . . . , n)


n
PXi (f (Xi ) = f ∗ (Xi ))
Y
=
i=1
n
(1 − PXi (f (Xi ) ̸= f ∗ (Xi )))
Y
=
i=1
≤ (1 − ε)n
≤ exp(−nε), (46)

where in the last line we used 1 − x ≤ exp(−x). We recall the union bound:
Proposition 16.2 (Union bound)
Suppose (Ω, F, P) is a probability space. For any coutable sequence of events (An )n≥1 ∈ F:
!
[ X
P An ≤ P(An ).
n≥1 n≥1

By Proposition 16.2 and eq. (46) we have

P(there is f ∈ F with R(f ) ≥ ε and such that f (Xi ) = f ∗ (Xi ) for i = 1, . . . , n)


!

[
=P {f (Xi ) = f (Xi ) for i = 1, . . . , n}
f ∈Fε

P(f (Xi ) = f ∗ (Xi ) for i = 1, . . . , n)


X

f ∈Fε

≤ |Fε | exp(−nε) ≤ |F| exp(−nε), (47)

46
Since we want this probability to be smaller than δ, we see that if

log |F| 1 1
 
n≥ + log ,
ε ε δ

then by eq. (47) we can guarantee that, with probability at least 1 − δ, any classifier f ∈ F consistent
with the samples has its risk smaller than ε. □

Remark 16.1 (Risk bound). One may rewrite the result of Theorem 16.1 as a risk bound. That is,
we first fix the sample size n and want to estimate the precision ε of any consistent classifier. More
precisely, eq. (45) implies that
 
 log |F| 1 1
sup R(fb) ≤ + log  ≥ 1 − δ.
 
P

fb∈F
n n δ
fb consistent with Sn

PAC learnability – The result of Theorem 16.1 inspires the following definition. In what follows,
PAC stands for Probably Approximately Correct. Indeed, we showed that for any finite class F, any
consistent classifier is approximately correct (i.e. has risk ≤ ε) with high probability.
Definition 16.2 (PAC learnability –)
A (possibly infinite) class F of classifiers is PAC-learnable with the sample complexity n(δ, ε) if there
is a mapping

(X × {0, 1})m → {0, 1}X
[
A:
m=0

called the learning algorithm (given a sample S of any size it outputs a classifier A(S)) that satisfies
the following property: for

(i) every distribution P on X ,

(ii) every δ, ε ∈ (0, 1), and

(iii) every target classifier f ∗ ∈ F,

if the sample size n is greater or equal than n(δ, ε), then

P(X1 ,...,Xn ) (R(A(Sn )) ≤ ε) ≥ 1 − δ.

Remark 16.2 (Measurability of classifiers). When considering finite classes we have little problems
with measurability and we may only request that for all f ∈ F the set {f (x) = 1} is measurable. The
notion of PAC-learnability allows infinite classes. In this case the question of measurability is more
subtle. However, as a rule of thumb, one may argue that measurability issues will almost never appear
in the analysis of real-life algorithms. In particular, starting from the late 80’s there is a useful and
formal notion of well-behaved classes: these are essentially the classes for which these measurability
issues do not appear. See also Remark 18.2 in Section 18.

47
An immediate outcome of Theorem 16.1 is the following result.
Corollary 16.3 (PAC learnability of finite classes)
Any finite class F is PAC learnable with the sample complexity

log |F| 1 1
 
n(δ, ε) = + log .
ε ε δ

Moreover, to learn this class, we simply need to output any consistent classifier fb.

An important limitation of Theorem 16.1 is that it only deals with finite classes. Working only with
discrete spaces of solutions is somewhat impractical: many modern machine learning techniques are
based on optimization (e.g. gradient descent) methods that require that the class F is parametrized in
a relatively smooth way by Rp . One of our main goals for the remaining of the class will therefore be
to see what we can say about PAC learnability of possibly infinite class of functions, and will culmi-
nate with the characterization of PAC learnability via a property known as the Vapnik-Chervonenkis
dimension in Section 19. First we will need to introduce an important result of probability theory
called Hoeffding’s inequality.

48
17 Hoeffding’s inequality (19-26.05.2023)
In this section, we prove a very useful bound for the sum of many i.i.d. random variables. In Sec-
tion 17.2 (which is not covered in the lectures), we show that this allows to prove that randomized
constructions of frames have low coherence. Let us recall first a classical result of probability theory:
Proposition 17.1 (Markov’s inequality)
Suppose (Ω, F, P) is a probability space, and let X : Ω → R be a nonnegative random variable.
Then for all t > 0:
E[X]
P[X ≥ t] ≤ .
t

An immediate consequence of Markov’s inequality (by applying it to |X − EX|2 ) is Chebyshev’s


inequality:
Proposition 17.2 (Chebyshev’s inequality)
Suppose (Ω, F, P) is a probability space, and let X : Ω → R be a random variable with finite mean
E[X]. Then for all t > 0:

Var[X]
P[|X − EX| ≥ t] ≤ .
t2

17.1 Concentration inequalities


Given a random variable X, we define the real-valued function MX (also called moment generating
function, or MGF) as
MX (λ) := E exp(λX),
whenever this expectation exists. For example, it is standard to verify that if X is distributed according
to the normal law with mean 0 and variance σ 2 , then for all λ ∈ R:

E exp(λX) = exp(λ2 σ 2 /2).

We now show that a similar upper bounds holds for any zero-mean bounded random variable. This
result is originally due to Wassily Hoeffding. In this proof we use Jensen’s inequality: if ϕ is a convex
function, then ϕ(EX) ≤ E ϕ(X).
Lemma 17.3 (Hoeffding’s lemma)
Let X be a zero mean random variable (EX = 0) such that X ∈ [a, b] almost surely. Then for all
λ ∈ R:
MX (λ) ≤ exp(λ2 (b − a)2 /8).

Remark 17.1. Random variables X such that the MGF of X − EX is upper bounded by the MGF
of a Gaussian random variable are usually called sub-Gaussian random variables. A consequence of
Lemma 17.3 is that all bounded random variables are sub-Gaussian.

In the lecture, we will prove a weaker version of Lemma 17.3, in which the denominator 8 is replaced
by 2. For completeness, we include the proof of the weaker upper bound (proven in the class) and of
the stronger result.
Proof of Lemma 17.3 (weaker) – This weaker proof is interesting because it uses the idea of
symmetrization, which is often useful in probability theory, and which we will encouter in Section 18.

49
Let us denote X ′ an independent copy of the random variable X, and E′ the expectation with respect
to X ′ only. Since EX = 0, we have
(a)
MX (λ) = E exp(λ(X − E′ X ′ )) = E exp(λE′ [X − X ′ ]) ≤ EE′ exp(λ[X − X ′ ]).
We used Jensen’s inequality in (a). Note that X − X ′ is a symmetric random variable, thus its
distribution is equal to the one of
d
X − X ′ = ε(X − X ′ ),
in which ε ∈ {±1} is a Rademacher random variable with P[ε = 1] = 1/2, independent of (X, X ′ ).
Thus we have (writing now E for the expectation with respect to all X, X ′ and ε):
MX (λ) ≤ E exp(ελ[X − X ′ ]).
This is the core idea of the symmetrization method, as we can now inverse the order of expectation
by Fubini’s theorem, and perform first the expectation over ε:
( )
1h i
MX (λ) ≤ E exp(λ[X − X ′ ]) + exp(−λ[X − X ′ ]) ≤ E exp(λ2 [X − X ′ ]2 /2),
2

since cosh(x) ≤ exp(x2 /2) for all x ∈ R. The proof is now over since |X − X ′ |2 ≤ (b − a)2 because
a ≤ X, X ′ ≤ b. □
Proof of Lemma 17.3 (original formulation) – Since x 7→ eλx is a convex function, for all
x ∈ [a, b] we have:
b − x λa x − a λb
eλx ≤ e + e .
b−a b−a
By taking expectations, we have:
b λa a λb
MX (λ) ≤ e − e ,
b−a b−a
" #
a
≤ eλa 1+ (1 − eλ(b−a) ) ,
b−a
≤ eF [λ(b−a)] ,
in which F (x) := ax/(b − a) + log[1 + a(1 − ex )/(b − a)]. In particular, F (0) = 0, F ′ (0) = 0, and
F ′′ (x) = −abex /(b − aex )2 . Note
√ that since EX = 0, a ≤ 0 and b ≥ 0. Therefore the AMGM
inequality13 yields b − aex ≥ 2 −abex , and thus F ′′ (x) ≤ 1/4. We can then use Taylor’s expansion
around 0 and bound the remainder, which yields that for all x ∈ R,
x2 x2
F (x) ≤ F (0) + xF ′ (0) + sup F ′′ (h) ≤ .
2 h∈R 8
Applying this for x = λ(b − a) ends the proof. □

Let Y be a random variable. Denote its moment generating function by MY . For any t ∈ R and
λ > 0, we have
P(Y ≥ t) = P(λY ≥ λt) = P(exp(λY ) ≥ exp(λt)) ≤ exp(−λt)MY (λ),
where the last inequality follows from Markov’s inequality (Proposition 17.1). Therefore, we have
P(Y ≥ t) ≤ inf {exp(−λt)MY (λ)}.
λ>0

This very useful upper bound is usually called the Chernoff method. We are now ready to prove the
basic concentration inequality for bounded random variables.
13 √
For any x, y ≥ 0, xy ≤ (x + y)/2.

50
Theorem 17.4 (Hoeffding’s inequality)
Let X1 , . . . , Xn be independent random variables such that Xi ∈ [ai , bi ] almost surely for i =
1, . . . , n. Then, for any t ≥ 0, it holds that
n
! !
X −2t2
P (Xi − EXi ) ≥ t ≤ exp Pn 2
.
i=1 i=1 (bi − ai )

Moreover,
n
! !
X −2t2
P Xi − EXi ≥ t ≤ 2 exp Pn 2
.
i=1 i=1 (bi − ai )

Proof of Theorem 17.4 – We proceed with the following lines. For any λ ≥ 0, it holds that
n n
! !
X X
P (Xi − EXi ) ≥ t ≤ exp (−λt) E exp λ (Xi − EXi ) (by the Chernoff method)
i=1 i=1
n
Y
= exp (−λt) E exp (λ(Xi − EXi )) (by independence)
i=1
n
!
Y λ2
≤ exp (−λt) exp (bi − ai )2 (by Hoeffding’s lemma)
i=1
8
n
!
λ2 X
= exp −λt + (bi − ai )2 . (48)
8 i=1

Observe that we used that the length of the interval to which Xi − EXi belongs is the same as the
corresponding length for Xi . We can now choose λ ≥ 0 so as to minimize the right-hand side of
eq. (48). One checks easily that the optimal choice is λ = 4t/ ni=1 (bi − ai )2 , which proves the first
P

inequality. To prove the second inequality, notice that we can apply the first inequality to Yi = −Xi ,
which yields
n
! !
X −2t2
P (Xi − EXi ) ≤ −t ≤ exp Pn 2
.
i=1 i=1 (bi − ai )

Finally, by the union bound (Proposition 16.2)


n n n
! ! !
X X X
P Xi − EXi ≥ t ≤P (Xi − EXi ) ≥ t + P (Xi − EXi ) ≤ −t
i=1 i=1 i=1
!
−2t2
≤ 2 exp Pn 2
.
i=1 (bi − ai )

Example 17.2. Assume that ai = a and bi = b for all i ∈ [n]. Then Hoeffding’s inequality gives:
n
! !
X 2t2
P (Xi − EXi ) ≥ t ≤ 2 exp − .
i=1
n(b − a)2

In particular, the right-hand side goes to 0 as n → ∞ if t = ω( n)14 : informally, we see that the

sum can not fluctuate by more than O( n), coherently with the picture given by the central limit
theorem!
14
With the classical notation x = ω(y) ⇔ y = o(x).

51
Further reading 17.3. The idea of deducing concentration inequalities by using upper bounds on the
moment generating function is very fruitful, and is used to prove a large part of classical concentration
inequalities. Hoeffding’s inequality appears in the foundational work of Hoeffding [Hoe63]. Similar
techniques were used in 1920-s by Bernstein and Kolmogorov [Kol29].

17.2 Randomized low-coherence frames (not covered in the class)


An easy corollary of Hoeffding’s inequality 17.4 shows that a set of random vectors with i.i.d. coordi-
nates in {±1} has a low worst-case coherence. We will not show the following corollary in the lecture,
but it is a classical application of combining a strong concentration inequality with the union bound,
a very versatile idea in probability theory, as you will see in the next two sections!
Corollary 17.5 (Random low-coherence frame)
Let d, m ≥ 1. Let ϕ1 , · · · , ϕm ∈ Rd be i.i.d. draws of the vector X√∈ Rd drawn with the
√ following
distribution: the entries (Xk )dk=1 of X are i.i.d. and P[Xk = 1/ d] = P[Xk = −1/ d] = 1/2.
Notice that ∥ϕi ∥2 = 1 almost surely for all i ∈ [m]. Moreover for all t ≥ 0:
n dt2 o
P[max |⟨ϕi , ϕj ⟩| ≥ t] ≤ m(m − 1) exp − . (49)
i̸=j 2

Therefore, for any δ ∈ (0, 1):


" s #
2 m(m − 1)
P max |⟨ϕi , ϕj ⟩| ≤ log ≥ 1 − δ. (50)
i̸=j d δ

Proof of Corollary 17.5 – Notice that, for any fixed i ̸= j:


d
X
⟨ϕi , ϕj ⟩ = (ϕi )k (ϕj )k .
k=1

Since i ̸= j, the random variables Yk := (ϕi )k (ϕj )k are i.i.d. random variables, with zero mean, and
P[Yk = 1/d] = P[Yk = −1/d] = 1/2. We can thus apply Hoeffding’s inequality 17.4 to get
n dt2 o
P[|⟨ϕi , ϕj ⟩| ≥ t] ≤ 2 exp − .
2
Notice that there are m(m − 1)/2 different inner products |⟨ϕi , ϕj ⟩|. By applying the union bound
(Proposition 16.2) we thus get eq. (49):
n dt2 o
P[max |⟨ϕi , ϕj ⟩| ≥ t] ≤ m(m − 1) exp − .
i̸=j 2
p
Eq. (50) can be easily deduced by letting t = (2/d) log[m(m − 1)/δ]. □

Remark 17.4. For large d and m, Corollary 17.5 shows p that a random frame with ±1/ d coordinates
has, with a large probability, a worst-case coherence µ ∼ (4 log m)/d. In particular, by Theorems
p 12.2
and 12.3, Φ will be a suitable matrix for ℓ1 recovery of s-sparse vectors as long as s ≲ d/ log m.
Actually (see Further reading 14.5), they are known to work as long as s ≲ d/ log(m).

Remark
p 17.5. If, for instance, we pick m = αd (with
p a fixed α ≥ 1), then as d → ∞, we have
µ ∼ 4(log d)/d, while the Welch
√ bound is µ min ∼ (1 − α−1 )/d. Random frames thus satisfy the
Welch bound up to a factor O( log d) in this setting15 .
15
p
More generally, they satisfy it up to a factor O( (1 − d/m) log m).

52
Further reading√17.6. Corollary 17.5 can be shown to hold for much more general distributions
than random ±1/ d coordinates. In particular, similar results (perhaps up to some multiplicative
constants) hold for random vectors on the unit sphere, or i.i.d. vectors with distributions whose tail
decay at least as fast as a Gaussian (called sub-Gaussian distributions). This is related to concentration
results (like Hoeffding’s inequality) holding as well for these different cases, see e.g. a famous textbook
on high-dimensional probability [Ver18].

53
18 Uniform convergence and the Vapnik-Chervonenkis
theorem (26.05.2023 - 02.06.2023)
In this chapter, we use Hoeffding’s inequality to prove the Vapnik-Chervonenkis theorem. Funda-
mentally, it shows the uniform convergence of frequencies of events to their probabilities, but in the
context of classification theory, it will allow us to generalize Theorem 16.1 to possibly infinite classes,
by union bounding over a set whose cardinality might be much smaller than the whole class (since the
union bound over the whole class of eq. (47) fails for infinite classes F). In practice, we will obtain
a theorem very close to Theorem 16.1, replacing the size |F| of the class by a quantity known as the
growth function of this class.
Finally, in the final Section 19 of the lecture, we will see that the growth function can be bounded by a
quantity that is easier to interpret, and that is known as the Vapnik-Chervonenkis (VC) dimension. As
many infinite classes have finite VC dimension, this significantly generalizes the results of Section 16
to infinite classes.

18.1 Motivation and statement of the VC theorem


We take again the setup of Section 16, in which we are trying to learn a classifier f ⋆ ∈ F, but now we
do not assume that F is finite. Note that a classifier f : X → {0, 1} can be equivalently represented as
a set {x ∈ X : f (x) = 1}. A different and more practical (but completely equivalent) representation
of f is given by the set Af := {x ∈ X : f (x) ̸= f ⋆ (x)}16 . We denote A := {Af : f ∈ F } ⊆ {0, 1}X .
For Af ∈ A, the risk of Af is naturally defined

R(Af ) = R(f ) = P[x ∈ Af ].

As in Theorem 16.1 we want to show that any consistent classifier has (with high probability) small
risk. That is we want to upper bound

P(there is f ∈ F with R(f ) ≥ ε and such that f (Xi ) = f ∗ (Xi ) for i = 1, . . . , n),
= P(∃Af ∈ A with R(Af ) ≥ ε and such that Xi ∈
/ A for all i = 1, . . . , n).

We now notice that R(Af ) = P[x ∈ Af ] = E[1x∈Af ]. Thus:

P(there is f ∈ F with R(f ) ≥ ε and such that f (Xi ) = f ∗ (Xi ) for i = 1, . . . , n),
n
!
1X
≤ P ∃Af ∈ A : 1X ∈A − R(Af ) ≥ ε ,
n i=1 i f
n
!
1X
≤ P sup 1Xi ∈A − P(X ∈ A) ≥ ε .
A∈A n i=1

As we will refer to this bound several times in the rest of the notes, we state it as a lemma:
Lemma 18.1
Let F be a set of classifiers, and f ⋆ ∈ F. Let A := {Af : f ∈ F }, with Af := {x ∈ X : f (x) ̸=
f ⋆ (x)}. Then for any ε > 0 and n ≥ 1:
n
!
∗ 1X
P(∃f ∈ F : R(f ) ≥ ε and f (Xi ) = f (Xi ) ∀i ∈ [n]) ≤ P sup 1Xi ∈A − P(X ∈ A) ≥ ε .
A∈A n i=1

16
Check that this is indeed a bijection from the set of classifiers to {0, 1}X .

54
Remark 18.1. By the law of large numbers we know that for any given A ∈ A,
n n
1X 1X a.s.
1X ∈A − P(X ∈ A) = (1Xi ∈A − E[1Xi ∈A ]) → 0
n i=1 i n i=1
However in Lemma 18.1, we are interested in the analysis of this convergence uniformly over all events
A ∈ A.
Remark 18.2 (Measurability). We require that the random variable appearing in Lemma 18.1 is
measurable. As we saw in Section 16, if F (or equivalently A) is finite or even countable, then no
problems of this sort appear. However, for infinite classes of events we need some relatively mild
assumptions, which we will not discuss. See Chapter 2 in [VPG15] for a more detailed discussion and
relevant references. We additionally refer to Appendix A.1 in [BEHW89].
When A is finite we can take the union bound in Lemma 18.1, as we did in Section 16. The analysis
becomes more complicated when A is infinite. However, a key remark is that while the set A is infinite,
the set A ∈ A only appears in Lemma 18.1 via its projections (1X1 ∈A , . . . , 1Xn ∈A ). And for any given
sample (X1 , · · · , Xn ), the number of such projections (over all possible A ∈ A) is always smaller than
2n , and thus finite. It might even be much smaller than 2n , which motivates the following definition:
Definition 18.1 (Growth function)
Given a family of events A, the growth (shatter) function SA is defined by

SA (n) := sup |{(1x1 ∈A , . . . , 1xn ∈A ) : A ∈ A}|.


x1 ,...,xn ∈X

That is, the growth function bounds the number of projections of A on the sample x1 , . . . , xn .

Observe that SA (n) ≤ 2n . Let us give some simple examples (if you prefer to think of classifiers, recall
that any A ∈ A can be represented as one, for instance by f (x) = 1x∈A ):
1. The growth function of a finite family of events satisfies SA (n) ≤ |A|.
2. Assume that X = R and that A consists of the sets induced by all rays of the form x ≤ t, t ∈ R.
Then, SA (n) = n + 1.
3. Assume that X = R and A consists of all open sets in R. Then, SA (n) = 2n .
Remark 18.3. Recall that when considering a set of classifiers F, we representated it as A := {Af :
f ∈ F} ⊆ {0, 1}X , with Af := {x ∈ X : f (x) ̸= f ⋆ (x)}. While the family A depends on f ⋆ , its growth
function does not, as formalized in the following lemma:
Lemma 18.2 (Growth function of class of classifiers –)
Let f ⋆ ∈ F, with F a class of classifiers. Let Af := {x ∈ X : f (x) ̸= f ⋆ (x)} and A′f := {x ∈ X :
f (x) = 1} for f ∈ F, and we define A := {Af : f ∈ F } and A′ := {A′f : f ∈ F }. Then, for all
n ≥ 1:

SA (n) = SA′ (n).

In particular, we define SF (n) := SA (n), and it does not depend on f ⋆ .

Proof of Lemma 18.2 – One can check that for all x1 , · · · , xn ∈ X and all (εi )ni=1 ∈ {0, 1}n :
|{(1f (x1 )=1 , · · · , 1f (xn )=1 ) : f ∈ F}| = |{(1f (x1 )=ε1 , · · · , 1f (xn )=εn ) : f ∈ F}|. (51)
What we claim follows by taking εi = 1 − f ⋆ (xi ) and taking the supremum over x1 , · · · , xn . □
We are now ready to formulate the main result of this lecture. It gives us a guarantee for the uniform
convergence of the frequencies of events A ∈ A to their probabilities (which appears in Lemma 18.1),
depending on the growth function SA (n), rather than the size of A.

55
Theorem 18.3 (Vapnik-Chervonenkis Theorem)
p
Consider a family of events A with the growth function SA . For any t ≥ 2/n, it holds that
n
!
1X
P sup 1Xi ∈A − P(X ∈ A) ≥ t ≤ 8SA (n) exp(−nt2 /32).
A∈A n i=1

In particular, with probability at least 1 − δ, we have


s
n
1X 2 1
 
sup 1Xi ∈A − P(X ∈ A) ≤ 4 log(8SA (n)) + log .
A∈A n i=1 n δ

We have the following corollary for our initial classification problem by using Lemma 18.1 (recall the
definition of SF (n) in Lemma 18.2):
Corollary 18.4 (VC theorem for classification)
Let F be a (possibly infinite) class of classifiers, and let f ⋆ ∈ F. Recall that R(f ) := P[f (X) ̸=

p
f (X)]. Then for any ε ≥ 2/n, we have:

P(∃f ∈ F with R(f ) ≥ ε and f (Xi ) = f ∗ (Xi ) for all i = 1, . . . , n) ≤ 8SF (n) exp(−nε2 /32).

Corollary 18.4 should be compared with eq. (47): we have managed to get a finite upper bound even
for infinite classes of functions, as a function of the growth function rather than the size of the class!
Still this upper bound is not very practical, as computing the growth function in general is not easy.
In Section 19 we will see that it can be controlled as a function of an easier to handle quantity, known
as the VC dimension.

18.2 Proof of Theorem 18.3


The main ingredient of the proof is a symmetrization lemma, similarly to what we use to prove a weak
form of Hoeffding’s inequality in Section 17.
Lemma 18.5 (Symmetrization lemma)
Assume that ε1 , . . . , εn are independent (from each other and from Xi , p
i = 1, . . . , n) random
variables taking the values ±1 each with probability 1/2. Then, for any t ≥ 2/n, it holds that
n n
! !
1X 1X
PX1 ,...,Xn sup (1Xi ∈A − P (A)) ≥ t ≤ 4PX1 ,...,Xn , sup εi 1Xi ∈A ≥ t/4 .
A∈A n i=1 ε1 ,...,εn A∈A n i=1

Let us see how Lemma 18.5 allows to end the proof. Using the symmetrization lemma, we consider
the term
n
!
1X
4PX1 ,...,Xn , sup εi 1Xi ∈A ≥ t/4 .
ε1 ,...,εn A∈A n i=1

As we mentioned above, the key observation is that even though the set of events A is infinite, there
are at most SA (n) realizations of (1X1 ∈A , . . . , 1Xn ∈A ) for a given sample X1 , . . . , Xn . To clarify, let
us fix X1 , · · · , Xn , and denote M(X1 , · · · , Xn ) := {(1X1 ∈A , . . . , 1Xn ∈A ) : A ∈ A}. By Definition 18.1,
|M| ≤ SA (n). Moreover:
n n
! !
1X 1X
Pε1 ,...,εn sup εi 1Xi ∈A ≥ t/4 = Pε1 ,...,εn sup εi yi ≥ t/4 ,
A∈A n i=1 y∈M n i=1

56
n
!
X 1X
≤ Pε1 ,...,εn εi yi ≥ t/4 (union bound),
y∈M
n i=1
n
!
1X
≤ SA (n) sup Pε1 ,...,εn εi yi ≥ t/4 ,
y∈M n i=1
n
!
1X
= SA (n) sup Pε1 ,...,εn εi 1Xi ∈A ≥ t/4 . (52)
A∈A n i=1

We can then apply Hoeffding’s inequality in eq. (52) (observe that εi 1Xi ∈A ∈ [−1, 1] and are indepen-
dent), and we have
n
!
1X
4PX1 ,...,Xn , sup εi 1Xi ∈A ≥ t/4
ε1 ,...,εn A∈A n i=1
n
!!
1X
≤ 4EX1 ,...,Xn SA (n) sup Pε1 ,...,εn εi 1Xi ∈A ≥ t/4 (by eq. (52))
A∈A n i=1
 
≤ 4SA (n)E · 2 exp(−2nt2 /(4 · 16)) (Hoeffding’s inequality)
2
= 8SA (n) exp(−nt /32).

The claim follows.


Proof of Lemma 18.5 – As in Section 17, the key idea of symmetrization for a random variable
Y is to introduce an independent copy Y ′ , and make appear the difference Y − Y ′ , which is always
a symmetric random variable and thus has the same distribution as ε(Y − Y ′ ). Going back to the
problem of interest here, assume that X1′ , . . . , Xn′ is an independent copy of X1 , . . . , Xn and set
 n
1X
νn (A)

 := 1X ∈A ,

 n i=1 i
n

′ 1X
νn (A) := 1 ′ .


n i=1 Xi ∈A

Given X1 , . . . , Xn assume that a (random) A⋆ ∈ A achieves the supremum. We may assume it without
loss of generality as otherwise there is a sequence Aε for ε → 0 that gives an arbitrary close value and
taking the limits carefully will give the same result, see Challenge 18.1. We have

sup |νn (A) − P (A)| = |νn (A⋆ ) − P (A⋆ )| .


A∈A

Further, by the reverse triangle inequality

|νn (A⋆ ) − ν ′ (A⋆ )| ≥ |νn (A⋆ ) − P (A⋆ )| − |νn′ (A⋆ ) − P (A⋆ )|.

In particular, for any t ≥ 0, we have

1{|νn (A⋆ ) − P (A⋆ )| ≥ t} × 1{ νn′ (A⋆ ) − P (A⋆ ) < t/2} ≤ 1{ νn (A⋆ ) − νn′ (A⋆ ) ≥ t/2},

Now we take the expectation of both sides of this inequality with respect to X1′ , . . . , Xn′ . Observe that
A⋆ depends on X1 , . . . , Xn but does not depend on X1′ , . . . , Xn′ . Thus we reach:
 
1|νn (A⋆ )−P (A⋆ )|≥t · PX1′ ,...,Xn′ νn′ (A⋆ ) − P (A⋆ ) < t/2 ≤ PX1′ ,...,Xn′ νn (A⋆ ) − νn′ (A⋆ ) ≥ t/2 . (53)
 

By Chebyshev’s inequality (Proposition 17.2) and independence of X1′ , . . . , Xn′ we have


n
!
4 1X 1
νn′ (A⋆ ) ⋆
(1 ′ ⋆ − P (A⋆ ))

PX1′ ,...,Xn′ − P (A ) ≥ t/2 ≤ 2 Var ≤ ,
t n i=1 Xi ∈A nt2

57
where we used that for a Bernoulli random variable taking its values in {0, 1} its variance is at most
(1/4) (show it!). Therefore, considering the complementary event we have
1 1
PX1′ ,...,Xn′ νn′ (A⋆ ) − P (A⋆ ) < t/2 ≥ 1 −

≥ ,
nt2 2
p
whenever t ≥ 2/n. Thus, for such values of t, we reach from eq. (53):

1{|νn (A⋆ ) − P (A⋆ )| ≥ t} ≤ 2PX1′ ,...,Xn′ νn (A⋆ ) − νn′ (A⋆ ) ≥ t/2 .



(54)

Eq. (54) is what we wanted: it will yield an upper bound on the probability that |νn (A⋆ ) − P (A⋆ )| is
big by considering the event in which |νn (A⋆ ) − νn′ (A⋆ )| is big, with νn′ an independent copy of νn .
In particular, we can use the symmetrization trick, that is, 1Xi ∈A − 1Xi′ ∈A has the same distribution
i.i.d.
as εi (1Xi ∈A − 1Xi′ ∈A ), for εi ∼ Unif({±1}). Taking the expectation of eq. (54) with respect to
X1 , . . . , Xn , and using the symmetrization trick we obtain

PX1 ,...,Xn (|νn (A⋆ ) − P (A⋆ )| ≥ t) ≤ 2PX1 ,...,Xn , νn (A⋆ ) − νn′ (A⋆ ) ≥ t/2

X1′ ,...,Xn′
!
≤ 2PX1 ,...,Xn , sup νn (A) − νn′ (A) ≥ t/2
X1′ ,...,Xn′ A∈A
n
!
1X
= 2P sup εi (1Xi ∈A − 1Xi′ ∈A ) ≥ t/2 ,
A∈A n i=1

where the last probability symbol corresponds to the joint distribution of Xi , Xi′ , εi for all i = 1, . . . , n.
Finally, using the triangle inequality and the union bound17 , we obtain
n n n
! !
1X 1X 1X
P sup εi (1Xi ∈A − 1Xi′ ∈A ) ≥ t/2 ≤ P sup εi 1Xi ∈A + sup εi 1Xi′ ∈A ≥ t/2
A∈A n i=1 A∈A n i=1 A∈A n i=1
n
!
1X
≤ 2PX1 ,...,Xn , sup εi 1Xi ∈A ≥ t/4 .
ε1 ,...,εn A∈A n i=1

The claim follows. □

Challenge 18.1. Re-work the proof above without assuming that the supremum over A ∈ A is
achieved.

Challenge
 18.2. Improve the constants  in the uniform convergence theorem by directly analyzing
P supA∈A n1 ni=1 (1Xi ∈A − 1Xi′ ∈A ) ≥ t/2 instead of introducing random signs εi .
P

Further reading 18.4. The uniform convergence theorem and the growth function appear in the
foundational work of Vapnik and Chervonenkis [VC71]. Symmetrization with random signs appears
in [GZ84]. A modern presentation of similar results can be found in the textbook [Ver18].

17
In the form that P[X + Y ≥ t] ≤ P[X ≥ t/2] + P[Y ≥ t/2], since X + Y ≥ t ⇒ (X ≥ t/2) ∨ (Y ≥ t/2).

58
19 The Vapnik-Chervonenkis dimension (02.06.2023)
19.1 Definition and first examples
We are now ready for the final lecture of this class, in which we will generalize PAC learnability
(Corollary 16.3) to infinite classes. To do so, we will upper bound the growth function appearing in
Corollary 18.4 using the concept of Vapnik-Chervonenkis (VC) dimension.
Definition 19.1 (Shattered set)
Given a family of events A, we say that a finite set {x1 , . . . , xd } ⊂ X is shattered by A if the
number of projections of A on X is equal to 2d , that is if {(1x1 ∈A , . . . , 1xd ∈A ) : A ∈ A} = {0, 1}d ,
or equivalently

|{(1x1 ∈A , . . . , 1xd ∈A ) : A ∈ A}| = 2d .

Recall the definition of the growth (or shatter) function SA in Definition 18.1.
Definition 19.2 (VC dimension)
Given a family of events A, the Vapnik Chervonenkis (VC) dimension of A is the size of the largest
subset of X that is shattered by A. Equivalently, it is the largest integer d such that:

SA (d) = sup |{(1x1 ∈A , . . . , 1xd ∈A ) : A ∈ A}| = 2d .


x1 ,...,xd ∈X

If SA (n) = 2n for all n ≥ 1, we set d = ∞.

First, we consider several simple examples (try to work them out yourself, to build some intuition).
Again here we consider general sets of events A ⊆ {0, 1}X , however one can always equivalently build
classifiers as f (x) = 1{x ∈ A} for A ∈ A.
Example 19.1. The VC dimension of the family A = {[a, b], a ≤ b} of the closed intervals in R is
equal to 2. This is because a pair of distinct points can be shattered. But there is no interval that
contains two points but does not contain a point between them. Thus, the set of three points cannot
be shattered. More formally:

SA (1)
 = sup |{(1x∈[a,b] ) : a ≤ b}| = 2,
x∈R




SA (2) = sup |{(1x∈[a,b] , 1y∈[a,b] ) : a ≤ b}| = 4,

 x,y∈R

= sup |{(1x∈[a,b] , 1y∈[a,b] , 1z∈[a,b] ) : a ≤ b}| = 7 < 23 .

SA (3)



x,y,z∈R

Example 19.2. The VC dimension of the family of events induced by halfspaces in R2 (not necessarily
passing through the origin) is equal to 3. Indeed, a set of three distinct points can be shattered in all
possible 23 ways. At the same time, for a set of 4 points it is impossible to shatter the set in a way
such that two diagonals of the corresponding rectangle are in two different halfspaces (draw it!).
Example 19.3. Generalizing abusively from the above examples, one could think that the VC di-
mension is closely related to the number of parameters. However, there is a classical example of a
family of events parametrized by a single parameter such that its VC dimension is infinite. Consider
the family of events A = {At : t > 0} , with
" #
[ 2kπ (2k + 1)π
At = {x ∈ R \ {0} : sin(xt) ≥ 0} = , \{0}.
k∈Z
t t

One can verify that a set of any size can be shattered by this family of sets. Therefore, its VC
dimension is infinite.

59
Example 19.4. The VC dimension of the family of events induced by non-homogeneous half-spaces
in Rp is equal to p + 1. For a proof of this fact, see the notes of the previous year [BZ22].

19.2 Uniform convergence and the VC dimension


In order to relate the conclusion of VC’s Theorem 18.3 (or Corollary 18.4) to the VC dimension, we
need to relate it to the growth function SA (n) for general values of n. We know that for n ≤ d,
SA (n) = 2n . The following theorem gives an upper bound on SA (n) for n ≥ d. Quite surprisingly, it
was shown by several authors independently around the same time. While Vapnik-Chervonenkis were
motivated by uniform convergence, other authors looked at it from a different perspective. Currently
there are several known techniques that can be used to prove this result.
Theorem 19.1 (Sauer-Shelah-Vapnik-Chervonenkis)
Assume that the VC dimension of A is equal to d. Then for any n ≥ d:
d
!
X n
SA (n) ≤ .
i=0
i

Proof of Theorem 19.1 – We use the approach based on the shifting technique. Fix any set of
points x1 , . . . , xn in X . Set V = {(1x1 ∈A , . . . , 1xn ∈A ) : A ∈ A} ⊆ {0, 1}n . For i = 1, . . . , n consider
the shifting operator Si,V acting on (v1 , . . . , vn ) ∈ V as follows:
(
(v1 , . . . , vi−1 , 0, vi+1 , . . . , vn ), if (v1 , . . . , vi−1 , 0, vi+1 , . . . , vn ) ∈
/ V;
Si,V ((v1 , . . . , vn )) =
(v1 , . . . , vn ), otherwise.
In words, Si,V changes the i-th coordinate 1 with 0 if this does not yields a copy of a vector that is
already in V . Define Si (V ) = {Si,V (v) : v ∈ V }. This means that we apply the shifting operator to all
vectors in V . By our construction we have |Si (V )| = V . Moreover, note that since V ⊆ {0, 1}n , it can
be seen as a collection of sets of {1, · · · , n} (identifying (v1 , · · · , vn ) with the set {j ∈ [n] : vj = 1}).
With this view in mind, we have the following lemma.
Lemma 19.2
Any set I ⊂ {1, . . . , n} shattered by Si (V ) is also shattered by V .

Proof of Lemma 19.2 – Take any set I shattered by Si (V ). If i ∈ / I, then the claim follows
immediately since the shifting operator does not affect this index. Otherwise, without loss of generality
assume that i = 1 and I = {1, . . . , k}. Since I is shattered by S1 (V ), for any u ∈ {0, 1}k there is
v ∈ S1 (V ) such that vi = ui for i = 1, . . . , k. If u1 = 1, then both v and v ′ = (0, v2 , . . . , vn ) belong
to V since otherwise v would have been shifted. Thus, for any u ∈ {0, 1}k there is w ∈ V such that
wi = ui for i = 1, . . . , k. This means that I is also shattered by V . □

Starting from the set V , we apply shifting repeatedly to all i ∈ {1, · · · , n} until no shifts are possible.
That is, we reach the set V ′ such that Si (V ′ ) = V ′ for all i = 1, . . . , n. This happens because whenever
a nontrivial shift happens, the total number of 1-s in V decreases, so this procedure has to stop.
Finally, we prove that V ′ contains no vector with more than d 1-s. Indeed, let us assume that there
is a vector v ∈ V ′ with k > d 1-s. Then the set of these k coordinates is shattered by V ′ : it is easy to
see that otherwise shifting would have reduced the number of 1-s in v. By Lemma 19.2, this implies
that the same subset of size k > d is also shattered by V . We obtain a contradiction with the fact
that the VC dimension of A is equal to d.
Since V ′ is included in the set of vectors with at most d 1-s, we have:
d
!

X n
|V | ≤ .
i=0
i

60
The claim follows since |V | = |V ′ |. □

We may now present a key corollary of Theorem 19.1, which generalizes the conclusion of the uniform
convergence Theorem 18.3 to families of events with finite VC dimension.

Theorem 19.3 (VC Theorem with VC dimension)


p
Consider a family of events A with the VC dimension d. If n ≥ d, then for any t ≥ 2/n:
! !
en nt2
P sup |νn (A) − P (A)| ≥ t ≤ exp d log − .
A∈A d 32

In particular, with probability at least 1 − δ, we have


s
2 8en 1
   
sup |νn (A) − P (A)| ≤ 4 d log + log .
A∈A n d δ

Proof of Theorem 19.3 – The proof uses the uniform convergence Theorem 18.3 together with
Theorem 19.1. We use the elementary identity, for d ≤ n:
d d  d−i n  d−i
! ! ! d
n n n n n d n  n d en
X X X  
≤ ≤ = 1+ ≤ ,
i=0
i i=0
d i i=0
d i n d d

Therefore we have
log(8SA (n)) ≤ log(8 (en/d)d ) ≤ d log(8en/d).
The claim follows by Theorem 18.3. □

19.3 Application in classification theory

Definition 19.3 (VC dimension of a set of classifiers)


For a class F of classifiers, we define the VC dimension of F as the VC dimension of A′ ⊆ {0, 1}X
whose elements are A′f := {x : f (x) = 1} for f ∈ F.

Remark 19.5. As a consequence of Lemma 18.2, for any f ⋆ ∈ F the VC dimension of F is equal to
the VC dimension of A = {Af : f ∈ F} with Af = {f (x) ̸= f ⋆ (x)}, since SA (n) = SA′ (n).
By using Theorem 19.3 in Lemma 18.1, we can now generalize PAC-learnability of finite classes to
any class with finite VC dimension.

Theorem 19.4 (PAC learnability of classifiers)


Any class F with the finite VC dimension d is PAC learnable by any algorithm choosing a consistent
classifier in F, with the sample complexity n = n(ε, δ) such that:
" #
32 8en 1
n ≥ 2 d log + log . (55)
ε d δ

Proof of Theorem 19.4 – Letpf ⋆ ∈ F. We can apply Theorem 19.3 in Lemma (18.1) (using
Remark 19.5), we get for any ε ≥ 2/n:

P(there is f ∈ F with R(f ) ≥ ε and such that f (Xi ) = f ∗ (Xi ) for i = 1, . . . , n),

61
 en nε2 
≤ exp d log − .
d 32

Equivalently, with probability at least 1 − δ (the sup being taken over consistent classifiers fb):
s
2 8en 1
   
sup R(fb) ≤ 4 d log + log .
fb
n d δ
r    
2 8en
Hence if the sample size n(ε, δ) is such that 4 n d log d + log 1δ ≤ ε, then supfb R(fb) ≤ ε. □

Example 19.6. The classes F of halfspaces in Rp , intervals and rays in R are PAC learnable.

Further reading 19.7. The Sauer-Shelah-Vapnik-Chervonenkis lemma appears independently and


in different contexts in [VC71, Sau72, She72]. Further relations between PAC learning and the VC
dimensions were made in [BEHW89]. So far we observed that the finiteness of the VC dimension
imply the PAC-learnability. In the notes of the previous year [BZ22], another sufficient condition for
PAC-learnability is discussed, namely the existence of a finite sample compression scheme. Relating
the existence of finite compression schemes to the VC dimension leads to important conjectures in
learning theory.

62
A Rest of Proof of Bochner’s Theorem
Proof of (i) ⇒ (ii) in Theorem 7.2 – We prove this statement when p = 1, as the notations are
lighter and the principle is exactly the same. Here, we give the proof for all dimensions p ≥ 1. We
will show that for all T > 0, we have
p !
|xj |
Z
−iu⊤ x
Y
HT (u) := q(x) e 1− dx ≥ 0. (56)
[−T,T ]p j=1
T


Let us describe how it allows to end the proof. Note that for all x ∈ Rp , we have q(x)e−iu x j (1 −
Q

|xj |/T ) → q(x)e−iu x as T → ∞. We can use the dominated convergence theorem (check the domi-
nation hypothesis!) to take the limit T → ∞ in eq. (56). This yields that q̂(u) ≥ 0. Let us now prove
eq. (56).
It is easy to see (prove it!) that for all x ∈ R, one has
! Z T /2
|x| n o 1 n T T o
1− 1 |x| ≤ T = 1 − − x ≤ θ ≤ − x dθ,
T T −T /2 2 2
1 T /2 T T
Z n o
= 1 − − θ ≤ x ≤ − θ dθ.
T −T /2 2 2

Therefore
"Z p #
1 T T
Z
−iu⊤ x
Y n o
HT (u) = p q(x) e 1 − − θj ≤ xj ≤ − θj dθ dx,
T Rp [−T /2,T /2]p j=1 2 2
"Z p #
1 T T
Z
(a) −iu⊤ x
Y n o
= p q(x) e 1 − − θj ≤ xj ≤ − θj dx dθ,
T [−T /2,T /2]p Rp j=1
2 2
1
Z Z
(b) ⊤ (y−θ)
= p q(y − θ) e−iu dy dθ. (57)
T [−T /2,T /2]p [−T /2,T /2]p

In (a) we used Fubini’s theorem to change the order of the integrals, and in (b) we changed variables
x = y − θ. Since q is continuous, we can approximate the integral in eq. (57) by Riemann sums. For
any N ≥ 1, we partition the set [−T /2, T /2]p in N cells C1 , · · · , CN , such that each cell has volume
V (Ck ) = T p /N . For each k ∈ [N ], we fix an arbitrary point rk ∈ Ck . Riemann sums theory yields
that we have:
N X N
Tp X ⊤
HT (u) = lim 2
q(rk − rl )e−iu (rk −rl ) ,
N →∞ N
k=1 l=1
N
Tp X ⊤
= lim 2
eiu⊤ rk q(rk − rl )eiu rl . (58)
N →∞ N
k,l=1

Since K is positive definite, the matrix (q(rk − rl ))N


k,l=1 is positive semi-definite. Since any real
N
symmetric matrix is also Hermitian, for all z ∈ C , we have k,l zk q(rk − rl )zl ≥ 0. Applying it for
P

zk = eiu rk in eq. (58) shows that HT (u) ≥ 0. □

63
B Some elements of number theory
We first recall some basic definitions of group theory, here specified to the case of the group Z×
p.

B.1 Order of a group element

Definition B.1 (Order of an element)


Let a ∈ Z× k
p . The order of a, denoted |a|, is the smallest k ≥ 1 such that a ≡ 1 mod p.

By Fermat’s little theorem, we know that the order of any element can not be higher than p − 1:

Theorem B.1 (Fermat’s little theorem)


Let p ≥ 2 be a prime, and a ∈ Z×
p . Then a
p−1 ≡ 1 mod p.

This yields the easy corollary, a particular case of Lagrange’s theorem:


Corollary B.2
Let p ≥ 2 be a prime, and a ∈ Z×
p . Then |a| divides p − 1.

Note that this fact is a general result in group theory, a corollary of Lagrange’s theorem: the order of
each element must divide the cardinality of the group, here p − 1.
Proof of Corollary B.2 – We know that |a| ≤ p − 1. We denote p − 1 = k|a| + r the Euclidean
division of p − 1 by |a|, with 0 ≤ r < |a|. Then ap−1 ≡ ak|a|+r mod p ≡ ar mod p. By Fermat’s little
theorem, we thus have 1 ≡ ar mod p. But since r < |a| we must have r = 0. □

B.2 Polynomials on Zp

Theorem B.3 (Roots of a polynomial)


Let p ≥ 2 be prime, and let f be a polynomial function over Zp (i.e. the coefficients of f are in Zp )
of degree n ≥ 1. Then the equation f (x) ≡ 0 mod p has at most n solutions in Zp .

Proof of Theorem B.3 – The proof is by induction over the degree n. If n = 1, then f (x) = ax + b
with a ̸≡ 0 mod p and it has a unique root x = −a−1 b. Assume that n ≥ 2 and that the claim holds
for n − 1. Let f be a polynomial over Zp of degree n. Assume that f has at least one root a ∈ Zp
(otherwise the claim holds). Then we can write f (x) = (x − a)g(x) with g a polynomial over Zp of
degree n − 118 . Since Zp is a field (because p is prime), the roots of f are thus exactly a and the roots
of g, making at most n − 1 + 1 = n solutions by the induction hypothesis. □

B.3 Wilson’s theorem

Theorem B.4 (Wilson’s theorem)


Let p ≥ 2 be prime. Then

(p − 1)! ≡ −1 mod p.

18
This follows by the Euclidean division of polynomials.

64
Note that this is identity is actually equivalent to p being prime.
Proof of Theorem B.4 – By Theorem B.3, the only solutions to x2 ≡ 1 mod p are x ≡ ±1 mod p.
Therefore, we can form the (p − 3)/2 pairs {a, a−1 } for a ∈ Z× p \{−1, 1}, which are pairwise disjoint.
×
Since all elements of Zp \{−1, 1} fall into such a pair, we have
Y
(p − 1)! = a ≡ 1 × (−1) × (1)(p−3)/2 mod p ≡ −1 mod p.
a∈Z×
p

B.4 Z×
p is a cyclic group

We now completely characterize the orders of the elements of Z×


p . We need to introduce Euler’s
function:
Definition B.2 (Euler’s totient function)
Euler’s function ϕ is the function from N>0 to N>0 that maps each integer n ≥ 1 with the number
of m ∈ {1, · · · , n} such that m and n are coprime.

In particular ϕ(n) ≥ 1, and by definition ϕ(p) = p − 1 if and only if p is prime. We moreover have the
following property of Euler’s function:
Proposition B.5
For any n ≥ 1 we have
X
ϕ(d) = n.
d|n

Proof of Proposition B.5 – For any d|n, we denote A(d) := {k ∈ [1, n] : gcd(k, n) = d}. Note
that k ∈ A(d) ⇔ k = dl for some l ∈ [1, n/d] which is coprime with n/d. Therefore |A(d)| = ϕ(n/d).
Moreover, the sets {A(d), d|n} are pairwise disjoint and their union is [1, n]. Therefore we have
X X
ϕ(n/d) = ϕ(d) = n,
d|n d|n

since if d ranges over the divisors of n, so does n/d. □

We can now prove the main result of this Appendix:


Lemma B.6 (Order of elements of Z×
p)
Let p ≥ 2 be a prime number, and d ≥ 1 such that d|(p − 1). There are exactly ϕ(d) elements in

p with order d.

In particular Z×
p is what we call a cyclic group, i.e. there is an element with order p − 1 (actually
ϕ(p − 1) of them), that is an element whose powers generate the whole group!
Proof of Lemma B.6 – If a ∈ Z×
p has order d|(p − 1), then it is a root of the polynomial

xd − 1 ≡ 0 mod p.

By Theorem B.3, there are at most d solutions to this equation, and since a, a2 , · · · , ad = 1 are all
distinct solutions, they form all the solutions. Therefore the set {ak , 1 ≤ k ≤ d} must contain all the
elements of order d. However, one checks easily that for all k ∈ [1, d], |ak | = d ⇔ d and k are coprime.

65
Therefore we have shown that if there is at least one element of order d, then there must be exactly
ϕ(d) elements of order d. Moreover, since all elements of Z×
p have an order:

#{a ∈ Z×
X
p−1= p such that |a| = d}. (59)
d|(p−1)

We have thus shown that the element inside the sum in the right hand side of eq. (59) can only be 0
or ϕ(d). However by Proposition B.5, we know
X
ϕ(d) = p − 1. (60)
d|(p−1)

Therefore, the only possibility is that #{a ∈ Z×


p such that |a| = d} = ϕ(d) for all d|(p − 1). □

66
References
[ALMT14] D. Amelunxen, M. Lotz, M. B. McCoy, and J. A. Tropp. Living on the edge: phase
transitions in convex programs with random data. Information and Inference, available
online, 2014.

[Alo86] N. Alon. Eigenvalues and expanders. Combinatorica, 6:83–96, 1986.

[AM85] N. Alon and V. Milman. Isoperimetric inequalities for graphs, and superconcentrators.
Journal of Combinatorial Theory, 38:73–88, 1985.

[Bac21] Francis Bach. Learning theory from first principles. Online version, 2021.

[Ban16] Afonso S. Bandeira. Ten lectures and forty-two open problems in the mathemat-
ics of data science. Available online at: http: // www. cims. nyu. edu/ ~bandeira/
TenLecturesFortyTwoProblems. pdf , 2016.

[BEHW89] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Learn-
ability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM), 36(4):929–
965, 1989.

[BFMW13] Afonso S Bandeira, Matthew Fickus, Dustin G Mixon, and Percy Wong. The road to
deterministic matrices with the restricted isometry property. Journal of Fourier Analysis
and Applications, 19(6):1123–1149, 2013.

[BSS23] A. S. Bandeira, A. Singer, and T. Strohmer. Mathematics of data science. Book draft
available here, 2023.

[BZ22] Afonso S. Bandeira and Nikita Zhivotovskiy. Mathematics of machine learning. Lecture
notes available here, 2022.

[Che70] J. Cheeger. A lower bound for the smallest eigenvalue of the Laplacian. Problems in
analysis (Papers dedicated to Salomon Bochner, 1969), pp. 195–199. Princeton Univ.
Press, 1970.

[Chr16] Ole Christensen. An Introduction to Frames and Riesz Bases. Birkhäuser, 2016.

[Chu10] F. Chung. Four proofs for the cheeger inequality and graph partition algorithms. Fourth
International Congress of Chinese Mathematicians, pp. 331–349, 2010.

[CRPW12] V. Chandrasekaran, B. Recht, P.A. Parrilo, and A.S. Willsky. The convex geometry of
linear inverse problems. Foundations of Computational Mathematics, 12(6):805–849, 2012.

[FM15] Matthew Fickus and Dustin G Mixon. Tables of the existence of equiangular tight frames.
arXiv preprint arXiv:1504.00253, 2015.

[FR13] S. Foucart and H. Rauhut. A Mathematical Introduction to Compressive Sensing.


Birkhauser, 2013.

[Gub18] John A Gubner. Derivation of the Fourier inversion formula, Bochner’s theorem, and
Herglotz’s theorem, 2018.

[GZ84] Evarist Giné and Joel Zinn. Some limit theorems for empirical processes. The Annals of
Probability, pages 929–989, 1984.

[Hoe63] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal
of the American Statistical Association, 58(301):13–30, 1963.

67
[Kat04] Yitzhak Katznelson. An introduction to harmonic analysis. Cambridge University Press,
2004.

[Kol29] A Kolmogoroff. Über das gesetz des iterierten logarithmus. Mathematische Annalen,
101(1):126–135, 1929.

[Lan67] HJ Landau. Sampling, data transmission, and the nyquist rate. Proceedings of the IEEE,
55(10):1701–1706, 1967.

[Llo82] S. Lloyd. Least squares quantization in PCM. IEEE Trans. Inf. Theor., 28(2):129–137,
1982.

[Mix] D. G. Mixon. Short, Fat matrices BLOG.

[Mix12] Dustin G. Mixon. Sparse signal processing with frame theory. PhD Thesis, Princeton
University, also available at arXiv:1204.5958[math.FA], 2012.

[Sau72] Norbert Sauer. On the density of families of sets. Journal of Combinatorial Theory, Series
A, 13(1):145–147, 1972.

[She72] Saharon Shelah. A combinatorial problem; stability and order for models and theories in
infinitary languages. Pacific Journal of Mathematics, 41(1):247–261, 1972.

[SS03] Elias M. Stein and Rami Shakarchi. Fourier Analysis: An Introduction. Princeton Lectures
in Analysis, Princeton University Press, 2003.

[VB04] L. Vanderberghe and S. Boyd. Convex Optimization. Cambridge University Press, 2004.

[VC71] Vladimir Naumovich Vapnik and Aleksei Yakovlevich Chervonenkis. On uniform con-
vergence of the frequencies of events to their probabilities. Teoriya Veroyatnostei i ee
Primeneniya, 16(2):264–279, 1971.

[Ver18] Roman Vershynin. High-dimensional probability: An introduction with applications in


data science, volume 47. Cambridge university press, 2018.

[VPG15] Vladimir Vovk, Harris Papadopoulos, and Alexander Gammerman. Measures of Com-
plexity. Springer, 2015.

[Wel74] Lloyd Welch. Lower bounds on the maximum cross correlation of signals (corresp.). IEEE
Transactions on Information theory, 20(3):397–399, 1974.

68

You might also like