0% found this document useful (0 votes)
8 views

An Information-Theoretic Approach To Generalization Theory - Part2

Machine learning notes

Uploaded by

ulkrisht
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

An Information-Theoretic Approach To Generalization Theory - Part2

Machine learning notes

Uploaded by

ulkrisht
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

3.

A Primer on Generalization Theory

In this chapter, we introduce the notation and definitions related to generalization


theory. We also put into context the framework of information-theoretic general-
ization with respect to other frameworks and motivate its adoption.
In Section 3.1, we start describing the notation commonly employed in learning
theory. After that, in Sections 3.2 and 3.3, we review classical approaches to
generalization based on the study of the complexity of the hypothesis class. These
sections include results based on the uniform convergence (or Glivenko–Cantelli)
property of the class, and its Rademacher complexity. These frameworks provide
the whole hypothesis class with generalization guarantees, which often is a strong
requirement. This leads us to the introduction of the minimum description length
(MDL) principle and the associated parsimonious philosophical postulate of the
Occam’s razor in Section 3.4.
The MDL principle provides guarantees for every hypothesis in a finite class,
but these guarantees do not explicitly take into account the algorithm that returns
the said hypothesis. Hence, we discuss algorithmic stability next in Section 3.5,
which gives guarantees for algorithms with a certain stability property. In partic-
ular, we first discuss the uniform stability framework, as it is very successful in
providing deterministic algorithms with generalization guarantees. After that, we
also discuss how privacy can be interpreted as stability property of an algorithm,
and therefore be linked to generalization. We conclude the section discussing how
one can employ different information measures to define an algorithm’s stability.
In this last subsection, we note that the “privacy as stability” interpretation can
also be formalized in terms of information measures and how this particular way
of understanding stability gives guarantees to particular algorithms and particular
classes of data distributions.
Equipped with this knowledge, in Section 3.6, we describe the information-
theoretic framework for generalization. This framework can be seen as encompass-
ing the framework of algorithmic stability with information measures and therefore
also takes into account both the algorithm and the data distribution. Therefore,
the framework relaxes the requirements for the guarantees, and they do not need
to hold for every hypothesis in a class and every data distribution. This allows
the information-theoretic framework for generalization to yield tighter guarantees
than previous frameworks, although at the price of being more individualized, and
it motivates its adoption in this thesis. We conclude this section with a descrip-
tion of the different kinds of information-theoretic generalization guarantees based
on their specificity level: either guarantees in expectation or in probability (both
PAC-Bayesian or single-draw PAC-Bayesian).

37
3. A Primer on Generalization Theory

Finally, in Section 3.7, the chapter concludes with a justification of why, for
parameterized models, one can express the results based on the parameters of the
models instead of their resulting hypothesis. In particular, this justification is
important for this thesis as the results hereafter will be presented in this form.
The intention of this chapter is to give a brief introduction to the field of gener-
alization theory and the notation that we will employ to refer to the topics within
the field. We also wanted to discuss the different theoretical frameworks to study
generalization, along with their trade-offs, in order to clarify our reasoning to study
the information-theoretic generalization framework and how we position it with re-
spect to other frameworks. Experts on the topic may choose to skip this chapter.
More novice readers can benefit from the digestion of the concepts and the intu-
itions given to better understand the following chapters and to get a taste of the
field and the relevant literature before delving deeper into it.

3.1 A Formal Model of Learning


Consider a problem we want to solve and a set of hypotheses H to solve the problem.
The instances (or examples) z of the problem lie in a space Z and are distributed
according to a distribution PZ . The performance of a hypothesis h ∈ H on an
instance z ∈ Z can be evaluated with a loss function ℓ : H × Z → R+ . Larger
values of the loss function denote a worse performance and a value of zero means
perfect performance. To solve the problem, we want to find a hypothesis h that
minimizes the population risk R(h). The population risk is defined as the expected
value of the loss of the hypothesis h on instances of the problem, namely
 
R(h) := E ℓ(h, Z) .

For example, a common type of problems are supervised learning problems. Here,
the instances z = (x, y) are a tuple formed by a feature and a label (or target), and
the hypotheses h : X → Y are functions that return a label when given a feature.
Then, the loss functions are often of the type ℓ(h, z) = ρ(h(x), y), where ρ is a mea-
sure of dissimilarity between the predicted h(x) and true y label associated to the
feature x. Traditionally, a large volume of the theory has focused on classification
tasks where the labels’ set consists of k classes Y = [k] and the loss function is the
0–1 loss ℓ(h, z) = IEh (x, y), where Eh = {(x, y) ∈ X × Y : h(x) ̸= y}.
In a learning problem, we assume that we have access to a sequence of n data
instances s := (z1 , · · · , zn ) ∈ Z n , or training set. Usually, we assume that these
instances are all independent and identically distributed (i.i.d.); that is, they are
sampled from PZ⊗n . Unless we explicitly state it otherwise, this will also be assumed
throughout the thesis. Then, a learning algorithm A is a (possibly randomized)
mechanism that generates a hypothesis H of the solution of a problem given a
S
training set s. The algorithm A is characterized by the Markov kernel PH that,
S=s
for a given training dataset s, returns a distribution on the hypothesis space PH .
An algorithm generalizes if the population risk of the hypothesis that it generates

38
3.1. A Formal Model of Learning

is small. Then, a quantity of interest is the excess risk of a hypothesis h, which is


defined as the difference between its population risk and the smallest population
risk of hypothesis in that class, that is

excess(h, H) := R(h) − inf



R(h⋆ ).
h ∈H

Therefore, small values of excess(h, H) guarantee small values of the population


risk of a hypothesis h relative to the optimal risk achievable in that class.
Often, we do not have full knowledge of the distribution PZ , so calculating the
population risk or the excess risk is not feasible. However, a good proxy of the
population risk is the empirical risk R(h,
b s), which is defined as the average loss of
a hypothesis h on the samples from the training set s, namely
n
1X
R(h,
b s) := ℓ(h, zi ).
n i=1

Indeed, for every fixed hypothesis


 h, the
 empirical risk is an unbiased estimator of
the population risk, that is, E R(h,
b S) = R(h). For this reason, many learning al-
gorithms attempt to return a hypothesis minimizing the empirical risk, also known
as performing empirical risk minimization (ERM). This is still a complicated task
that only results in a low population risk if the difference between the popula-
tion and empirical risks is small. Consequently, many generalization bounds, or
bounds on the population risk, are obtained by bounding the generalization error
(or generalization gap)

gen(h, s) := R(h) − R(h,


b s).

More precisely, consider the decomposition

R(h) = R(h,
b s) + gen(h, s). (3.1)

Then, a bound on the generalization error gen(h, s) directly gives a computable


bound on the population risk. Hence, if a hypothesis h has a small empirical risk
and a small generalization error, we can say that the hypothesis h generalizes.
The generalization error is also useful to bound the excess risk from above. Note
that the excess risk can be decomposed as

excess(h, H) =
b h̃s , s) − inf R(h⋆ ) . (3.2)
 
gen(h, s) + R(h,
b s) − inf R(b h̃s , s) + inf R(
⋆ ∈H
h̃s ∈H h̃s ∈H h
| {z } | {z }
optimization error approximation error

Generally, the approximation error is non-positive as the ERM achieves a better


empirical risk than the smallest possible population risk. Therefore, if the opti-
mization error can be sufficiently controlled, bounding the generalization error is
sufficient to bound the excess risk.

39
3. A Primer on Generalization Theory

3.2 Probably Approximately Correct Learning


The Probably Approximately Correct (PAC) framework from Valiant [97] was orig-
inally formulated for the 0–1 loss for classification tasks in the realizable setting.
This setting assumes that there exists a hypothesis h⋆ ∈ H such that R(h⋆ ) = 0. In
this setting, every ERM hERM achieves zero empirical risk, that is, R(h b ERM , S) = 0
almost surely.
Informally, the framework states that a hypothesis class H is PAC learnable if,
for all α, β ∈ (0, 1), there exists a function nH : (0, 1)2 → N and a learning algorithm
A such that, after observing n ≥ nH (α, β) samples, it returns a hypothesis that
is approximately correct (has a population risk smaller than α) with probability at
least 1−β. If such an algorithm exists, then it is called a PAC learning algorithm for
H. PAC learnability can be extended outside the realizable setting and to general
loss functions [98, Chapter 3].
Definition 3.1. A hypothesis class H is agnostic PAC learnable if, for all α ∈ R+
and β ∈ (0, 1), and for every distribution PZ , there exists a function nH : R+ ×
(0, 1) → N and a learning algorithm A such that, after observing n ≥ nH (α, β)
samples from PZ , it returns a hypothesis h such that, with probability at least 1 − β,
excess(h, H) ≤ α.
The function nH : R+ × (0, 1) → N is called the sample complexity of learning
H and determines how many examples are required to guarantee a PAC solution.
To be precise, for an agnostic PAC learnable hypothesis class H, there are multiple
functions nH : R+ × (0, 1) that satisfy the requirements given in Definition 3.1.
Hence, the sample complexity generally refers to the minimal of these functions.
Generally, PAC learning theory focuses on the study and bounding from above
of the sample complexity of hypothesis classes. However, we often have access to a
fixed, finite number of samples n and want to understand the amount of error that
our algorithm will suffer. Therefore, in order to have a better comparison between
the generalization guarantees of the different frameworks, instead of focusing on the
sample complexity, we will study a complementary concept: the error complexity
αH : N × (0, 1) → R+ , which is defined as the generalized inverse of the sample
complexity for a fixed confidence β, that is
αH (n, β) = inf{α ∈ R+ : nH (α, β) ≤ n}.
In this way, the generalization guarantees from the PAC learning framework
will be formulated as follows: for all n ∈ N and β ∈ (0, 1), there exists a function
αH : N × (0, 1) → N and a learning algorithm A such that, after observing n
samples, it returns a hypothesis h such that, with probability at least 1 − β,
excess(h, H) ≤ αH (n, β).
That is, the error complexity is an upper bound on the excess risk. If the error
complexity αH (n, β) is non-increasing in n and limn→∞ αH (n, β) = 0 for all β ∈
(0, 1), then the hypothesis class H is agnostic PAC learnable.

40
3.2. Probably Approximately Correct Learning

3.2.1 Uniform Convergence


Above, we saw that for a hypothesis class H to be agnostic PAC learnable, there
needs to exist an algorithm such that, for every data distribution PZ , it attains an
arbitrarily small population risk when given a sufficiently large number of instances.
Similarly, the uniform convergence property of a hypothesis class H states that,
for every data distribution PZ , the absolute generalization error of every hypothesis
in the class H is arbitrarily small for a sufficiently large number of instances [98,
Chapter 4].
Definition 3.2. A hypothesis class H has the uniform convergence property (or
is a uniformly Glivenko–Cantelli class [99]) if, for all n ∈ N and β ∈ (0, 1), for
every distribution PZ , for every dataset S of n instances sampled from PZ , and for
every hypothesis h ∈ H, there exists a function αUC : N × (0, 1) → R+ such that
limn→∞ αUC (n, β) = 0 and that, with probability at least 1 − β,

gen(h, S) ≤ αUC (n, β).

Therefore, based on the decomposition (3.1), the uniform convergence property


guarantees that the population risk will be close to the observed empirical risk
regardless of the learning algorithm used and the underlying data distribution.
Moreover, based on the decomposition (3.2), controlling the optimization error is
sufficient to ensure that the uniform convergence property also gives guarantees
on the excess risk. In fact, if a hypothesis class H has the uniform convergence
property, then the class is agnostic PAC learnable and the ERM algorithm is an
agnostic PAC learner for H [98, Corollary 4.4]. Moreover, for binary classification
problems, if a hypothesis class is agnostic PAC learnable, then it has the uniform
convergence property [98, Theorem 6.7], although this is not true for more general
problems [100, Section 4].
For instance, if the loss function has a range contained in [a, b] and the hypothesis
class is finite |H| < ∞, then it can be shown that H has the uniform convergence
property by the union bound and Hoeffding’s inequality [98, Chapter 4]. More
precisely, in this case
s
2 log 2|H|
β
αUC (n, β) ≤ (b − a) . (3.3)
n
Practically, this can be extended to real valued parameterized hypothesis classes.
Namely, if the class is parameterized with d parameters and the computer uses a 64
bit precision, the set of hypotheses is upper bounded by 264d and log |H| ≤ 64d log 2.
Unfortunately, the requirements for this property are too strict for current learning
algorithms such as parameterized, differentiable networks, where the number of
parameters d is very large [101, 102]. However, it seems that for interpolating
algorithms, that is, algorithms that output a hypothesis h such that R(h,b S) = 0
a.s., it is still possible to avoid the problems outlined by Nagarajan and Kolter

41
3. A Primer on Generalization Theory

[102]. The idea is to consider a surrogate hypothesis h̃ that belongs to a structural


Glivenko–Cantelli hypothesis class (a weaker notion of uniform convergence for
sequences of learning problems) and study its deviation from from the original
hypothesis [103].

3.2.2 Vapnik–Chervonenkis and Natarajan Dimensions


In the previous subsection, we established that finite hypothesis classes p H are ag-
nostic PAC learnable and that their generalization gap is in O 1/n · log |H|/β
with probability at least 1 − β. However, we know that there are infinite hypothesis
classes that also have a vanishing generalization gap. The Vapnik–Chervonenkis
(VC) and the Natarajan dimensions give a sharp characterization of this kind of
hypothesis classes for classification problems.
Let us focus first on the binary classification setting. In this setting, a hypothesis
corresponds to a function h from X to Y = {0, 1}, where we recall that Z = X × Y.
The maximum number of distinct ways of classifying n instances using hypotheses
in a hypothesis class H is known as the growth function ΠH (n) of that class, and
provides us with a measure of the richness of the class H.

Definition 3.3. The growth function ΠH : N → N of a hypothesis class H is


 
ΠH (n) := max h(x1 ), . . . , h(xn ) : h ∈ H .
{x1 ,...,xn }∈X n

The growth function ΠH (n) of a hypothesis class H can be used to bound the
absolute generalization error and, hence, establish a uniform convergence property
of the class [104, Section 3.2]. More precisely, if a hypothesis class H has a growth
function ΠH (n), then
s
8 log 4ΠHβ(2n)
αUC (n, β) = . (3.4)
n

However, calculating the growth function is cumbersome, as it requires the


calculation of ΠH (n) for each n ∈ N by definition. The growth function has a
trivial bound of ΠH (n) ≤ 2n . This bound does not provide us with any interesting
results on generalization as (3.4) becomes vacuous. Nonetheless, it is useful to
define the concept of shattering. A set {x1 , . . . , xn } of n instances is said to be
shattered by a hypothesis class H if the instances of the set can be classified by
elements in the class H in every possible way, that is, if ΠH (n) = 2n . The VC
dimension of a hypothesis class H is defined as the largest size of a set shattered
by H. More precisely,

VCdim(H) = sup n ∈ N : ΠH (n) = 2n .




42
3.2. Probably Approximately Correct Learning

The Sauer–Shelah–Perles lemma [98, Lemma 6.10] bounds the growth function
of a hypothesis class H in terms of its VC dimension. Formally, it states that

VCdim(H)
VCdim(H)   2
 if n ≤ VCdim(H)
X n VCdim(H)
ΠH (n) ≤ ≤ .

i=0
i  en
 VCdim(H) otherwise

Hence, the VC dimension characterizes the uniform convergence of a hypothesis


class and, since we are considering the binary classification problem, also its agnostic
PAC learnability. To be more exact, there exist absolute constants c1 and c2 such
that [98, Theorem 6.8]
s s
VCdim(H) + log β1 VCdim(H) + log β1
c1 · ≤ αUC (n, β) ≤ c2 · .
n n
The VC dimension is particularly useful since it allows us to obtain uniform
convergence generalization bounds for infinite hypothesis classes. A canonical ex-
ample is the class of threshold functions H = {I{x≤a} : a ∈ R}. Clearly, there are
an infinite number of thresholds on the real line. However, their VC dimension is 1,
and therefore they have the uniform convergence property. Other classical exam-
ples of hypotheses classes with finite VC dimension are the intervals, hyperplanes,
or axis aligned rectangles [98, 104].
The notion of VC dimension can be extended to non-binary classification prob-
lems with k classes [98, Chapter 29]. A set A ⊆ X is shattered by the hypothesis
class H if there exist two functions f0 , f1 : A → [k] such that
• For every x ∈ A, f0 (x) ̸= f1 (x) .
• For every subset B ⊆ A, there exists a function h ∈ H such that h(x) = f0 (x)
for all x ∈ B and h(x) = f1 (x) for all x ∈ A \ B.
Then, the Natarajan dimension Ndim(H) of a hypothesis class H is defined as
the maximal size of a set shattered by H. Moreover, the Natarajan dimension
also characterizes the uniform convergence and the agnostic PAC learnability of a
general classification problem. More precisely, there exist absolute constants c1 and
c2 such that [98, Theorem 29.3]
s s
Ndim(H) + log β1 Ndim(H) log k + log β1
c1 · ≤ αUC (n, β) ≤ c2 · .
n n
In the realizable setting, both the VC and the Natarajan dimension of a hy-
pothesis class H characterize more strongly the uniform convergence and the PAC
learnability of a general classification problem. More precisely there exist absolute
constants c1 and c2 such that [98, Theorem 29.3]
  
1 Ndim(H)
Ndim(H) + log β1 Ndim(H) β kn
c1 · ≤ αUC (n, β) ≤ c2 · · W ,
n n c2 Ndim(H)

43
3. A Primer on Generalization Theory

where W is the Lambert function, and the 0 branch can be bounded from above as
W(x) ≤ log(x + 1) [105, Theorem 2.3.]. Therefore, in the realizable setting the uni-
form convergence of a class is of the order (Ndim(H)+log 1/β)/n up to logarithmic terms.
Unfortunately, the VC or the Natarajan dimensions still do not help us to char-
acterize the generalization behavior of deep learning. For a feed-forward network
with sigmoid activation functions of p parameters and c connections between the
parameters, the VC dimension is lower bounded in Ω(c2 ) and upper bounded in
O(p2 c2 ) [98, Chapter 20].

3.3 Rademacher Complexity


The requirements needed for the PAC learning and uniform convergence based
guarantees are very strong. They need to hold simultaneously for all hypotheses in
a hypothesis class H and all data distributions PZ , independently of the draw of the
training data S. A first relaxation of these requirements come from the empirical
Rademacher complexity of a hypothesis class H with respect to a training dataset
S [106, 107].

Definition 3.4. The empirical Rademacher complexity of a hypothesis class H with


respect to a fixed training set s = (z1 , . . . , zn ) of n instances and a loss function ℓ
is  n 
1 X
Rad(ℓ ◦ H, s) := E sup Ri · ℓ(h, zi ) ,
n h∈H i=1

 Ri are independent


where   and identically distributed random variables such that
PRi − 1 = PRi 1 = 1/2 for all i ∈ [n]. This kind of variables are known as
Rademacher random variables.

The Rademacher complexity can be understood in two different, but equally


valid ways:

1. As a measure of richness of the hypothesis class by measuring the degree to


which it can fit random noise [104, Section 3.1]. More precisely, let R :=
(R1 , · · · , Rn ) and ℓs (h) := (ℓ(h, z1 ), . . . , ℓ(h, zn )) be the vectors describing
the uniform noise from the Rademacher random variables and the losses of a
hypothesis h on the instances of the training set s. Then, the Rademacher
complexity can be written as
 
1 ⊺
Rad(ℓ ◦ H, s) = E sup R ℓs (h) ,
n h∈H

where the inner product R⊺ ℓs (h) measures the correlation between the ran-
dom noise and the losses. Thus, the Rademacher complexity measures how
well, on average, the hypothesis class correlates with random noise. Richer
or more complex hypothesis classes H can generate more losses and correlate

44
3.3. Rademacher Complexity

better with random noise. This understanding is not unique to Rademacher


random variables, and there is an analogue to the Rademacher complexity
with standard Gaussian random variables named Gaussian complexity, and
the two are equivalent up to constants [83, Excercise 5.5][107]. As in Sec-
tion 3.2.1, more complex hypothesis classes will have a larger generalization
error than simpler ones.
2. As a measure of the discrepancy between fictitious training and test sets [98,
Section 26.1]. Indeed, for a fixed training set s, one can construct a fictitious
training set as S1 = {zi ∈ s : Ri = 1} and a fictitious test set as S2 =
{zi ∈ s : Ri = −1}, where the randomness of the sets only comes from their
construction using the Rademacher random variables. Then, the Rademacher
complexity calculates what is, on average, the worst difference between these
two sets. Namely, we may write
 X 
1 X
Rad(ℓ ◦ H, s) = E sup ℓ(h, z) − ℓ(h, z) .
n h∈H
z∈S1 z∈S2

In this way, the connection with generalization becomes apparent.


In both interpretations it is clear that the Rademacher complexity depends
on the training dataset s, and that the larger the number of samples, the better
it represents either (i) the richness or complexity of the hypothesis class, or (ii)
the discrepancy between fictitious training and test sets. This is reflected in the
following theorem relating the Rademacher complexity with the generalization error
of hypotheses from a class H.
Theorem 3.1 ([104, Theorem 3.3]). Consider a loss function with a range contained
in [a, b]. Then, for all β ∈ (0, 1) and all h ∈ H, with probability no less than 1 − β
s
9 log β2
gen(h, S) ≤ 2Rad(ℓ ◦ H, S) + (b − a)
2n
The Rademacher complexity can be used to derive non-vacuous generalization
bounds for important hypothesis classes such as support vector machines (SVMs)
and other kernel based hypotheses [98, Chapter 26] [104, Chapters 5 and 6]. How-
ever, even if the empirical Rademacher complexity is data dependent and could
in theory be calculated with the training dataset, the expectation with respect to
the Rademacher random variables Ri requires performing 2n empirical risk mini-
mizations. This is computationally hard for some hypothesis classes and often one
resorts to results either using the expected Rademacher complexity or bounding
this empirical measure with the growth function or the VC dimension, hence losing
the advantage of having a data dependent measure.
The usage of the Rademacher complexity to explain the generalization of deep
learning models seems to be complicated. Even if the measure takes advantage

45
3. A Primer on Generalization Theory

of the collected training dataset, it is still a quantity that holds uniformly for all
hypotheses in a class H, which is a very strong requirement. For example, although
there are works that obtain upper bounds on the Rademacher complexity of feed-
forward networks that depend on different norms of the weights [108–112], these
are still not tight enough to characterize their generalization. Moreover, there are
reasons to believe that this measure will not be sufficient for this task. Indeed, the
first interpretation of the Rademacher complexity of the class H describes how well
the hypotheses from the class can fit random binary label assignments, and it has
been shown that parameterized, differentiable networks can perfectly fit random
labels [101]. Hence, it is expected that the Rademacher complexity is close to 1 for
binary classification with the 0–1 loss, resulting in a trivial generalization bound.

3.4 Minimum Description Length and Occam’s Razor


So far, all frameworks to guarantee the generalization of a hypothesis h only de-
pended on the hypothesis class H and held simultaneously for all hypotheses in
the class. When the hypothesis class is “simple”, this leads to good generalization
guarantees, but for more “complex” classes the guarantees are vacuous.
A first step towards considering a more “per hypothesis” specialized measure
of complexity is done in structural risk minimization (SRM) [28, 113]. The idea
S∞ a complex hypothesis class H into a countable union of
behind SRM is to decompose
hypothesis classes H = k=1 Hk such that Hk ⊆ Hk+1 for all k ∈ N and where the
complexity of each class Hk is non-decreasing in k. The complexity of the class Hk
can be measured with any of the previous methods, either uniform convergence [98,
Section 7.2] or the Rademacher complexity [104, Section 4.3]. Then, the bounds
on the population risk of a hypothesis h depend on the class Hk they belong: they
trade-off a smaller empirical risk for more complex classes (or larger k) for a larger
uniform convergence or Rademacher complexity.
A further step is given by the minimum description length (MDL) principle [29,
91] [98, Section 7.3], where a different generalization guarantee is given to each
hypothesis depending on the “preference” given S∞to it. To be more precise, consider
a countable hypothesis class such that H = k=1 {hk }. Furthermore, assume that
we have a preference to each
 hypothesis hk determined by a probability distribution
Q: larger values of Q hk mean a larger preference for that hypothesis. Then, an
application of Hoeffding’s inequality and the union bound yields that, for all losses
with a range contained in [a, b] and all hypotheses hk ∈ H, with probability no less
than 1 − β [33, Theorem 2][30–32, 34, 35],
s
− log Q hk + log 1δ
 
gen(hk , S) ≤ . (3.5)
2n
A difficult question is how to select the preference over the hypotheses in the
class H when no prior knowledge is available. For instance, if the hypothesis class is
finite |H| < ∞ and we have no preference for any hypothesis (that is, we consider a

46
3.4. Minimum Description Length and Occam’s Razor

 
uniform preference distribution Q hk = 1/|H|), then the guarantee is equal for each
hypothesis and (3.5) is reduced to the same guarantee (up to constants) received
by the uniform convergence property in (3.3). A simple way to express a preference
is to favour hypotheses that are simpler to describe, or that have a smaller descrip-
tion length. A binary string is a finite sequence of zeros and ones. A description
language for a hypothesis class H is a function mapping each hypothesis h ∈ H to
a string desc(h) with length |desc(h)|, where desc(h) is called the description of h
and |desc(h)| is its description length. If thePlanguage is prefix-free, then Kraft’s
inequality holds [71, Theorem 5.2.1], namely h∈H 2−|desc(h)| ≤ 1. In this way, one
may consider the simple prior Q hk ∝ 2−|desc(h)| resulting in the bound
 

s
|desc(hk )| log 2 + log 1δ
gen(hk , S) ≤ , (3.6)
2n

where the description language could be any prefix-free compression algorithm and
|desc(hk )| is its length (or bits). When the description length of a hypothesis h is
the smallest possible it is known as its Kolmogorov complexity [91] and the prior Q
is referred to as the universal prior [71, Section 14.6].
Then, following the MDL principle leads to selecting hypotheses that trade off
a good empirical performance (small empirical risk) and a small complexity or
description length (small generalization error). This is aligned with the Occam’s
razor, which states that “it is futile to do with more, what can be done with
fewer” [114, 115]. This has been incorporated into the methodology of science in the
following form [98, 116]: “given two explanations of the data, all other things being
equal, the simpler explanation is preferable”. Indeed, following the MDL principle,
given two hypotheses hk and hl with the same empirical risk R(h b k , s) = R(h
b l , s)
for a fixed dataset s, the one that is easier to describe (or has a smaller description
length) is preferable.
The parsimonious philosophical principle of the Occam’s razor will also resonate
with the rest of the generalization guarantees described in this thesis, although
with other characterizations of the hypotheses complexity and not necessarily their
description length.
The MDL principle has big connections with other theoretical frameworks to
generalization such as PAC Bayesian theory [33–35, 117]. More precisely, the two
are equivalent when the considered algorithms are deterministic and the hypothesis
class is discrete, as shown below in Section 3.6.1. In fact, under the PAC Bayesian
umbrella, the MDL has been employed to obtain non-vacuous bounds for deep
learning algorithms [118]. The idea is to describe the parameters of the networks
with a tunable prefix-free variable-length code that acts as the description language
desc. Then, both the quantized parameters and the quantization levels of the code
are learnt simultaneously to minimize a variant of the MDL generalization guarantee
in (3.6).

47
3. A Primer on Generalization Theory

3.5 Algorithmic Stability


The MDL principle takes a step away from previous frameworks like PAC learning
and uniform convergence and establishes generalization guarantees that are spe-
cific for each hypothesis in the hypothesis class H. However, it requires that the
hypothesis class is discrete (or belongs to a subclass Hk from a countable set of
subclasses that cover the whole class in the case of SRM) and it does not take into
account the algorithm that selects the hypothesis. Even though one, in theory,
could study specific algorithms and give bounds on the description length of their
selected hypotheses, this is not embedded into the definition of the framework.
Algorithmic stability [119–121], on the other hand, makes the algorithm the
central object of its framework. To put it simply, an algorithm is stable if, when
presented with similar input training datasets, it outputs similar hypotheses. The
connection to generalization is then simple. If the hypothesis returned by an algo-
rithm does not change much for similar training datasets, given a training dataset
sampled from the data distribution, then the output hypothesis would be similar to
other training datasets from the same distribution. Hence, the generalization error
will depend on how well the training dataset represents the data distribution. This
framework still follows the Occam’s razor principle if we measure the complexity of
the hypothesis returned by an algorithm by how unstable the algorithm is.
Precisely describing what it means for the input training datasets and for the
output hypotheses to be “similar” is complicated, and there are different notions of
stability with different generalization guarantees. Below, we will describe a partic-
ularly successful notion named uniform stability [122]. Then, we will discuss how
privacy can be understood as a stability notion: if an algorithm is private, then it
is stable. This will serve as a connection point to Chapter 6. Finally, we will men-
tion how the stability of an algorithm can be described with information-theoretic
measures of dependence of the output hypothesis on the training data. This final
subsection will be the link to Section 3.6 discussing information-theoretic gen-
eralization and, more broadly, to the rest of the thesis. We omit the discussion
around other notions of stability, such as the leave-one-out, the leave-one-out cross
validation, or hypothesis stability, among others [100, 122–125] due to space and
scope constraints.

3.5.1 Uniform Stability


Consider a deterministic algorithm A : Z n → H. Bousquet and Elisseeff [122]
described the stability of the algorithm as the largest difference in performance of
hypotheses generated by the algorithm when presented to neighbouring datasets.
Therefore, for them, two input training datasets are “similar” if they are neigh-
bours, that is, if they differ in at most one element; and two output hypotheses are
“similar” if the absolute difference in their performance is bounded.

48
3.5. Algorithmic Stability

Definition 3.5. A deterministic algorithm A is uniformly stable with parameter


γ if for every training dataset s = (z1 , . . . , zn ) ∈ Z n , every neighbouring dataset
s(i) = (z1 , . . . , zi−1 , zi′ , zi+1 , . . . , zn ) ∈ Z n for all i ∈ [n], and every sample z ∈ Z

ℓ A(S), z − ℓ A(S (i) ), z ≤ γ.


 

 
By the definition of uniform stability, and the fact that E gen(A(S), S) =
E ℓ A(S), z − ℓ A(S (i) ), z
  
[122, Lemma 7], it directly follows that
E gen(A(S), S) ≤ γ. There have been increasingly better bounds on the gen-
eralization error based on the clear intuition that if an algorithm is uniformly sta-
ble, then it generalizes [122, 126–128]. To our knowledge, the following one is the
one that best characterizes the generalization error of uniformly stable algorithms,
although there exists better characterizations of the excess risk [129].
Theorem 3.2 (Bousquet et al. [128, Corollary 8]). Consider a loss function with
a range contained in [a, b]. Let h = A(S), where A is a deterministic, uniformly
stable algorithm with parameter γ. Then, there exist universal constants c1 , c2 ∈ R+
such that, for all β ∈ (0, 1), with probability no less than 1 − β,
s
log β1
 
1
|gen(h, S)| ≤ c1 · γ log(n) log + c2 · (b − a) · .
β n
The bound in Theorem 3.2 tells us that if an algorithm is uniformly stable, as
long as the stability parameter γ decreases with respect to the number of samples
n faster than logarithmically, then the algorithm will generalize for a large enough
training dataset. The framework of uniform stability has been relatively successful
in providing us with generalization error guarantees for known algorithms, that
is, we know that many known algorithms are uniformly stable. For example, the
ERM solution to convex learning problems with a strongly convex regularization

term [100] is uniformly stable with γ ∈ Θ(1/ n). Moreover, there is a line of work
establishing the uniform stability of SGD under different combinations of conditions
such as the Bernstein or Polyak–Lojasiewicz conditions, or smoothness, convexity,
or Lipschitzness, among others [129–135], which is an encouraging direction to
better understand the generalization in deep learning.

3.5.2 Privacy as a Stability Measure


Recall from Section 2.2.5 that a privacy mechanism is an algorithm that answers
queries about a given dataset in a way that is informative about the queries them-
selves but not the specific records (or instances) in the dataset. Private learning
algorithms are a special class of learning algorithms that employ a privacy mecha-
nism to analyze the training dataset. Hence, these algorithms produce hypotheses
that are uninformative about the dataset with which they were trained. In this
way, private learning algorithms are stable, as ideally the hypotheses generated by
the algorithm when presented with similar training datasets are also similar.

49
3. A Primer on Generalization Theory

S
S A ≡ PH H

Figure 3.1: Illustration of a learning algorithm A viewed as a channel processing a


dataset S to obtain a hypothesis H.

To visualize this concept more clearly, let us use Dwork et al.’s definition of
differential privacy [58, 59]. A randomized algorithm A is (ε, δ)-differentially private
if for all subsets of hypotheses A ⊆ H and all neighbouring datasets s and s′

P A(s) ∈ A ≤ eε P A(s′ ) ∈ A + δ.
   
(3.7)

If δ = 0, then the algorithm is just ε-differentially private. This definition clarifies


the idea that private algorithms are stable: like for uniform stability, two input
training datasets are “similar” if they are neighbours, and two output hypotheses
are “similar” if their distributions are close as measured by (3.7).
Since differentially private algorithms are stable, under certain requirements
on the privacy parameters (ε, δ), they should generalize. The connection between
differential privacy (and other privacy notions like maximal leakage) and gener-
alization has been previously studied [2, 24, 25, 27, 60, 61, 136–139] and will be
further discussed in more detail in Chapter 6.

3.5.3 Stability via Information Measures


The stability notion borrowed from differential privacy in the previous subsection
is inherently information-theoretic. To illustrate this, consider a (possibly random-
S
ized) algorithm A characterized by the Markov kernel PH , that is, given a fixed input
dataset s, the output of the algorithm is a random hypothesis H := A(s) distributed
S=s S=s
according to PH . If the algorithm is deterministic, then PH = I{h=hs } (h) and
A(s) = hs . From an information-theoretic perspective, an algorithm is a chan-
nel processing a dataset into a hypothesis (Figure 3.1). Then, recall from Sec-
tion 2.2.5 that an algorithm is ε-differentially private if, for every two neighbouring
training datasets s and s′ , the output distributions of the hypotheses resulting from
the channel processing are close in Rényi divergence of order ∞, that is,

S=s S=s
D∞ (PH ∥PH ) ≤ ε.

In this way, an algorithm is ε-IM stable if, given two neighbouring input datasets,
the difference between the two output hypotheses, as measured by some informa-
tion measure IM like the ones presented in Section 2.2, is smaller than ε. This
definition, like uniform stability, considers two input datasets to be “similar” if they
are neighbours, and understands that two output hypotheses are “similar” if their
distributions are close according to some information measure. Other common ex-
amples of this information-theoretic stability are total variation, relative entropy,

50
3.6. Information-Theoretic Generalization

and Wasserstein distance stability [136, 140]. As for the previous definitions of sta-
bility, if an algorithm satisfies any of these notions, its generalization error is also
bounded [2, 24, 136, 140].
So far, the stability definitions were agnostic to the data distribution PZ . Ragin-
sky et al. [136] introduced the concept of (ε, PZ )-IM stability to take into account
the effect of the data distribution into the stability of an algorithm. Given a dataset
s = (z1 , . . . , zn ), define s−i = (z1 , . . . , zi−1 , zi+1 , . . . , zn ) as the dataset obtained by
removing the i-th sample from s. In this way, the distribution
−i
S =s−i S=(z1 ,...,zi−1 ,Zi ,zi+1 ,...,zn )
PH = PH ◦ PZ

represents the expected distribution of the output hypothesis after processing a


dataset S where Zi is distributed according to the data distribution PZ and the
other instances are fixed S −i = s−i . Then, (ε, PZ )-IM stability guarantees that, on
average, the output hypothesis does not change much when each instance of the
training dataset is changed by another instance from the data distribution. For
example, an algorithm is (ε, PZ )-total variation stable if1
n
1X h S S −i
i
E TV PH , PH ≤ ε.
n i=1

Taking this argument to the extreme, one may consider how much, on average,
S=s
the distribution of the output hypothesis PH for a given realization S = s of the
training dataset changes with respect to the prototypical distribution on samples
S
from the data distribution PH = PH ◦PS . For example, this was the rationale consid-
ered in [141, 142] to define the stability of an algorithm and to prove generalization
guarantees in terms of the total variation. This argument can also be interpreted
as to how much the algorithm’s output distribution depends on the input training
dataset. Both of these interpretations are paramount in the information-theoretic
generalization framework introduced in the next section. Therefore, the stability
framework to characterize the generalization of learning algorithms, when the sta-
bility notion is defined via privacy or information measures, can be understood
as belonging to the information-theoretic generalization framework and vice versa.
The lines between the different frameworks are blurry and, in the end, the termi-
nology barely depends on personal taste.

3.6 Information-Theoretic Generalization


The MDL principle provides us guarantees that are specific for each hypothesis h
in the class H, unlike those given to us by uniform convergence or the Rademacher
complexity. Algorithmic stability takes the algorithm into account and gives us
1 The original definition in [136] was written in a different, but equivalent form. We choose

this presentation to ease the understanding of the concepts.

51
3. A Primer on Generalization Theory

generalization guarantees that are specific to the algorithm A used to find the
hypothesis. Information-theoretic generalization also provides us with guarantees
that are specific to the learning algorithm and, depending on the level of specificity,
that are also specific to each hypothesis. Not needing to provide guarantees that
hold uniformly for all elements in the hypothesis class H allows these frameworks
to attain tighter characterizations of the generalization of learning algorithms.
Often, the guarantees the information-theoretic generalization framework
provides us do not hold for every data distribution PZ , and they are specialized to
different classes of data distributions. These classes are chosen depending of the
behavior of the loss random variable ℓ(h, Z) for hypotheses in h ∈ H. This consid-
eration allows the framework to derive bounds on the population risk for potentially
unbounded losses and separates it from frameworks like uniform stability.
The information-theoretic framework, like algorithmic stability via information
measures, considers the (possibly randomized) algorithm as a channel processing a
dataset into a hypothesis (Figure 3.1). Essentially, this framework encompasses
all generalization guarantees that depend on information measures like those pre-
S
sented in Section 2.2 involving the algorithm’s Markov kernel PH and the data
distribution PZ in some capacity. A common theme in the guarantees obtained in
this framework is that they can be interpreted with classically information-theoretic
concepts like information or compression.
Intuitively, the more information the algorithm’s output hypothesis captures
about the dataset that it used for training, the worse it will generalize. The reason is
closely related to the concept of overfitting, which is a phenomenon that occurs when
the output hypothesis describes very well the training data but fails to describe the
underlying distribution. The idea is that, in order to perfectly describe the training
data, the algorithm adapted to the sampling noise for the specific data it observes,
which may differ from future observations. This can be quantified, for example,
S=s
with the dissimilarity between the distribution of the algorithm’s output PH and
the smoothed, prototypical distribution on samples from the data distribution PH =
S S
PH ◦ PS , for instance DKL (PH ∥PH ) [33–35, 70, 117, 136]. This example is purposely
chosen to highlight again the connection between stability via information measures
and information-theoretic stability.
This intuition also follows the parsimonious philosophical principle of the Oc-
cam’s razor from Section 3.4: given two algorithms that output hypotheses with
the same empirical risk, the one that extracts the least information, or needs the
least bits to be compressed, is preferred.
In practice, the foundational generalization bounds from this framework are
usually obtained combining a change of measure (or a decoupling lemma) and a
concentration inequality. For randomized algorithms, the population risk R(H) is
S
a random variable that depends on the joint distribution PS ⊗ PH , where S is the
⊗n
random training dataset and PS = PZ . This dependence between the hypothesis H
and the dataset S makes the usage of standard concentration inequalities around the
empirical risk impossible. For this reason, a common first step is to use a change of
measure to consider the population risk R(H ′ ) of an auxiliary random hypothesis

52
3.6. Information-Theoretic Generalization

H ′ ∼ Q that is independent (or decoupled ) of the training data S. Ideally, the


chosen distribution Q is one that allows us to control the new population risk using
standard concentration inequalities like those in [94]. The penalty for studying the
risk of the auxiliary hypothesis H ′ instead of the real one H is captured by an
information measure that describes the dissimilarity between their distributions.
When the chosen distribution is the prototypical distributions on samples from the
distribution PH , then we recover the intuition and interpretation given above. We
will go deeper into this in Chapters 4 to 6, but the reader is also referred to the
review papers [143, 144].
The main criticism of this framework is that in order to find computable gen-
eralization guarantees it is necessary to evaluate or further bound the information
measures between distributions. This is often difficult or impractical, leading to
crude bounds that may be vacuous [4, 145–147]. However, they can still lead to
non-vacuous generalization statements in deep learning [5, 42, 118, 148–151], to
recover from below known results for hypothesis classes with a bounded VC or
Natarajan dimension [24, 152, 153], or helps us better understand noisy, iterative
algorithms like stochastic gradient descent (SGD) [154], among other upsides. More
examples of the benefits obtained from guarantees derived from this framework are
described in the following chapters.

3.6.1 Levels of Specificity


There are different levels of specificity when it comes to the guarantees provided to
us by information-theoretic generalization. The specificity levels come depending if
we require the results to hold for a specific hypothesis returned from the algorithm
given a specific training dataset, or if we only need them to hold on average for the
outputs of the algorithm given a specific training dataset, or, finally, if we need them
to hold on average for the outputs of the algorithm given training datasets from
a certain distribution. These specificity levels are also referred to as “flavours” of
generalization [27, 144] and are further described below from less, to more specific
together with a classical example. All guarantees are provided to a given algorithm
S
described by a Markov kernel PH and a data distribution PZ .

• Generalization guarantees in expectation (or “PAC-Bayesian guarantees in


expectation”, “expectedly approximately correct (EAC) guarantees” [155–
157], or “mean approximately correct (MAC) guarantees” [152]). This is the
least specific level. This kind of guarantee states that

E[gen(H, S)] ≤ αexp ,

where the expectation is taken with respect to the marginal, prototypical


S
distribution PH = PH ◦ PS , where we recall that PS = PZ⊗n .

53
3. A Primer on Generalization Theory

For a loss with a range contained in [a, b], the classical example of this guar-
antee using the Donsker and Varadhan Lemma 2.1 is [136]
r
I(H; S)
αexp = (b − a) .
2n

• PAC-Bayesian guarantees. These guarantees are specific to the observed


realization of the training dataset S. More precisely, this kind of guarantee
states that, for all β ∈ (0, 1), with probability no less than 1 − β,

E S [gen(H, S)] ≤ αPAC-Bayes ,

where the probability is taken with respect to the draw of the training data
from PS and the expectation is taken with respect to the conditional distri-
S
bution of the output hypothesis for that training data PH .
For a loss with a range contained in [a, b], the classical example of this guar-
antee using the Donsker and Varadhan Lemma 2.1 is [33–35]
s
S ∥Q) + log ξ(n)
DKL (PH β
αPAC-Bayes = (b − a) , (3.8)
2n
√ √
where ξ(n) ∈ [ n, 2 + 2n] [5, 158, 159] and Q is any distribution on H
S
such that PH ≪ Q. Note that compared to the guarantee in expectation, this
includes a penalty for the confidence 1 − β of the statement.
Note that if the learning algorithm is deterministic (that is, if the algorithm’s
S
Markov kernel is PH (h) = I{h=hS } (h)) and the hypothesis class is discrete,
S
then DKL (PH ∥Q) = − log Q[hS ] and the PAC-Bayesian guarantees are equiv-
alent to the guarantees from the MDL framework (see Section 3.4 above).

• Single-draw PAC-Bayesian guarantees (or “pointwise or derandomized PAC-


Bayesian guarantees” [160, 161])2 . These guarantees are specific to the ob-
served realization of the training dataset S and the particular output hypoth-
esis returned by the algorithm. To be precise, this kind of guarantee states
that, for all β ∈ (0, 1), with probability no less than 1 − β,

gen(H, S) ≤ αsdPAC ,

where the probability is taken with respect to the draw of the training data
S
from PS and the later draw of the algorithm’s output hypothesis from PH .
2 Wechoose this name as we feel it is the one that better clarifies that the bounds are on a
single draw of the hypothesis rather than on the algorithm’s distribution. This is also the name
employed in [27, 144]

54
3.6. Information-Theoretic Generalization

For a loss with a range contained in [a, b], the classical example3 of this
guarantee using the change of measure stemming from the Radon–Nikodym
theorem (2.1) is [162]
s
dP S
log dQH (H) + log ξ(n)
β
αsdPAC = (b − a) ,
2n
S
where we recall that dPH/dQ is the Radon–Nikodym derivative between the
algorithm’s output distribution and the reference distribution Q. Note that
compared to standard PAC-Bayesian and the expectation guarantees, this
changes depending on the particular realization of the output hypothesis.

The more specific guarantees give us more practical information. For example,
the single-draw PAC-Bayesian guarantees hold for the particular realization of
the hypothesis that an algorithm returns for a particular dataset. However, they
are often harder to calculate or control. On the other hand, the less specific
guarantees gives us more abstract information and are often easier to calculate
and control. For instance, the mutual information can be upper bounded by other
quantities relevant to a particular algorithm, giving us a concrete understanding
of the important elements for that algorithm to generalize. For example, for the
stochastic gradient Langevin dynamics (SGLD) [163, 164] algorithm, we learnt
that the gradient incoherence between samples important in determining its
generalization [1, 23, 165] (see Section 4.5). More examples of these bounds’
usefulness are given in Chapter 4.
Before moving to some bibliographic remarks we clarify some notation. We say
that a bound has a fast rate if it decreases linearly with the number of samples,
that is, if it is in O(1/n). For example, the guarantees for the realizable setting in
classification problems provided by uniform convergence. Similarly, we say that a

bound has a slow rate if it is in O(1/ n). If a bound has a rate different than these
two, we will refer to the rate in comparison to them. For probabilistic bounds, we
say that a bound is of high probability if its dependence with the confidence param-
eter β is logarithmic, that is, it depends on log 1/β like (3.8). Other dependencies,
such as the linear one 1/β , will simply be referred to as not of high probability.4

3.6.2 Bibliographic Remarks


Arguably, the first relation of information-theoretic concepts and generalization
guarantees is due to Vapnik [113, Section 4.6] prior to 1995. He formalized a
3 The bound was not originally given in this form in [162], but one can recover it after routine

manipulations.
4 There are other notions of “high probability”. For example, a common definition states that

a bound is of high probability if the probability of failure β can be made arbitrarily close to 0 as
the number of samples tend to infinity [83]. We choose this nomenclature, similarly to Hellström
and Durisi [27], since it helps us distinguish between an exponential decay in the probability of
failure (when the dependence is logarithmic) and weaker decays such as polynomial.

55
3. A Primer on Generalization Theory

relationship between the MDL principle from Section 3.4 and compression to
a generalization bound similar to (3.6). After that, Shawe-Taylor, Bartlett, and
Williamson [28, 117] introduced the first PAC-Bayesian bounds using a luckiness
factor in 1996-1997, and those were further developed by McAllester [33, 34, 35] in
1998-2003. These latter results first obtained another formal relationship between
the MDL principle and generalization, now in the form of (3.5), and then evolved
into the more general PAC-Bayesian bound from (3.8). Interestingly, McAllester
did not use neither the Donsker and Varadhan nor the Gibbs Lemma 2.1 to obtain
the result and, to our knowledge, the first to do so in the context of generalization
was Seeger [36] in 2002, who actually re-discovered it. After that, the usage of this
result became customary in the PAC-Bayesian literature, popularized by Audibert
[166], Catoni [38, 39], Zhang [70], and Germain and others [167, 168], although some
re-derived it again and gave it a different name for the context of generalization,
such as the information exponential inequality from Zhang [70] in 2006.
Information-theoretic generalization bounds in expectation were derived in the
PAC-Bayesian community, at least, since 2006 [39, 45, 169]. These were popular-
ized in 2016 after Xu and Raginsky [20] extended the results from Russo and Zou
[21] from the bias of adaptive data analysis to the generalization error or learning
algorithms. The popularity of these results came, in part, for the simplicity of the
proofs that directly used the Donsker and Varadhan Lemma 2.1 and the fact that
the bound explicitly featured the mutual information. The combination of these
two facts allowed for the development of many results and interpetations after that,
some of which will be discussed later in Chapter 4.

Remark 3.1. Another, previous conceptual connection between information-


theoretic concepts and generalization guarantees came from Dudley’s metric en-
tropy [170] to bound the rate of uniform convergence [171] in 1984. For a metric
hypothesis space (H, ρ), an ε-net is a set of hypotheses N (ε, H, ρ) that cover the
hypothesis class with ε precision, that is, that for all hypothesis h ∈ H there is a
hypothesis h′ ∈ N (ε, H, ρ) such that ρ(h, h′ ) ≤ ε. The metric entropy is defined
as the logarithm of the covering number log |N ⋆ (ε, H, ρ, )|, which is defined as the
smallest cardinality of an ε-net. This concept measures the spread of the hypotheses
in the class H, relating it with the entropy. Moreover, it coincides with the entropy
of a uniform distribution on the elements of the cover.

3.7 Parameterized Models


Modern learning algorithms are often a combination of a parameterized model
and an optimization algorithm, where the parameters are usually referred to as
the weights. For example, a parameterized, differentiable network and stochastic
gradient descent (SGD).
Consider for example a supervised learning problem where we recall that the
instances z = (x, y) are a tuple formed by a feature and a label, and the hypotheses

56
3.7. Parameterized Models

h : X → Y are functions that return a label when given a feature. In a parameter-


ized model, the hypotheses hw : X → Y are completely described by the weights
w ∈ W. In this way, the hypothesis space is

H = {hw : w ∈ W}. (3.9)

Therefore, each weight w uniquely determines a hypothesis hw ∈ H, but the reverse


is not necessarily true: there can be some hypothesis h that results from multiple
weights. For example, consider the simple parameterized model hw (x) = sign(xw)
with X = W = R. Here, every choice of w ∈ W uniquely determines the hypothesis
hw . On the other hand, the hypothesis hw (x) = sign(x) results from every positive
weight w > 0 and the hypothesis hw (x) = −sign(x) results from every negative one
w < 0. In other words, the relationship between weights and hypotheses is onto
but not one-to-one.
This onto relationship between parameters and weights allows us to find gener-
alization guarantees considering the weights instead of the hypotheses themselves.

• Recall the results stemming from uniform convergence or the Rademacher


complexity from Sections 3.2 and 3.3, these results needed to hold for every
hypothesis h in the hypothesis class H. If, instead, we ensure that they hold
for every weight w in the weight space W, they will necessarily hold for all
hypotheses in H as per (3.9).
• The reasoning is similar for the guarantees coming from the MDL principle
in Section 3.4, as one can describe every hypothesis hw by describing the
weights w. When multiple weights result in the same hypothesis, then the
one that is more easily described will be selected as per the Occam’s razor.
• For deterministic algorithms and uniform stability (Section 3.5.1), the al-
gorithm is at the center and therefore considering the weights that uniquely
determine they hypothesis or the hypothesis itself is irrelevant.
• When privacy is considered as a stability measure (Section 3.5.2) the ratio-
nale is slightly different. When the model is parameterized, usually the private
algorithm returns the weights W , and these are the ones that enjoy the pri-
vacy guarantees. Luckily, most privacy frameworks like differential privacy or
maximal leakage have post-processing guarantees [55, 59, 78]. This guarantee
states that no amount of post-processing can degrade the privacy guaran-
tees. Therefore, the hypothesis hW maintains the same privacy guarantees
and therefore also the same generalization guarantees derived from them.
• Finally, for algorithmic stability based on information measures or
information-theoretic generalization (Sections 3.5.3 and 3.6), the main-
tainance of the generalization guarantees comes from the data processing
inequality (see Proposition 2.3 and the rest of Section 2.2). Consider,
for example, the relative entropy: for two weights W and W ′ , this inequality

57
3. A Primer on Generalization Theory

guarantees that DKL (PhW ∥PhW ′ ) ≤ DKL (PW ∥PW ′ ). Therefore, the guarantees
obtained for the weights also translate to the hypotheses.
For this reason, a good volume of the information-theoretic generalization liter-
ature presents their results abusing notation and describing the random hypothesis
returned by the algorithm with the letter W and using it indistinguishably from the
model’s weights. Henceforth, in this thesis, we will also present our results in this
way. However, it is important to mention that, until the weights are employed to
find specific guarantees for some particular algorithm like the SGLD in Section 4.5,
the results hold for any hypothesis and not only for parameterized models.

Remark 3.2. In information theory, the letter H is historically reserved to the


Shannon entropy H and expressions like H(H) can be confusing. Pragmatically,
this is also a reason why in information-theoretic generalization the letter W is
employed instead of the classical H in learning theory.

58

You might also like