An Information-Theoretic Approach To Generalization Theory - Part2
An Information-Theoretic Approach To Generalization Theory - Part2
37
3. A Primer on Generalization Theory
Finally, in Section 3.7, the chapter concludes with a justification of why, for
parameterized models, one can express the results based on the parameters of the
models instead of their resulting hypothesis. In particular, this justification is
important for this thesis as the results hereafter will be presented in this form.
The intention of this chapter is to give a brief introduction to the field of gener-
alization theory and the notation that we will employ to refer to the topics within
the field. We also wanted to discuss the different theoretical frameworks to study
generalization, along with their trade-offs, in order to clarify our reasoning to study
the information-theoretic generalization framework and how we position it with re-
spect to other frameworks. Experts on the topic may choose to skip this chapter.
More novice readers can benefit from the digestion of the concepts and the intu-
itions given to better understand the following chapters and to get a taste of the
field and the relevant literature before delving deeper into it.
For example, a common type of problems are supervised learning problems. Here,
the instances z = (x, y) are a tuple formed by a feature and a label (or target), and
the hypotheses h : X → Y are functions that return a label when given a feature.
Then, the loss functions are often of the type ℓ(h, z) = ρ(h(x), y), where ρ is a mea-
sure of dissimilarity between the predicted h(x) and true y label associated to the
feature x. Traditionally, a large volume of the theory has focused on classification
tasks where the labels’ set consists of k classes Y = [k] and the loss function is the
0–1 loss ℓ(h, z) = IEh (x, y), where Eh = {(x, y) ∈ X × Y : h(x) ̸= y}.
In a learning problem, we assume that we have access to a sequence of n data
instances s := (z1 , · · · , zn ) ∈ Z n , or training set. Usually, we assume that these
instances are all independent and identically distributed (i.i.d.); that is, they are
sampled from PZ⊗n . Unless we explicitly state it otherwise, this will also be assumed
throughout the thesis. Then, a learning algorithm A is a (possibly randomized)
mechanism that generates a hypothesis H of the solution of a problem given a
S
training set s. The algorithm A is characterized by the Markov kernel PH that,
S=s
for a given training dataset s, returns a distribution on the hypothesis space PH .
An algorithm generalizes if the population risk of the hypothesis that it generates
38
3.1. A Formal Model of Learning
R(h) = R(h,
b s) + gen(h, s). (3.1)
excess(h, H) =
b h̃s , s) − inf R(h⋆ ) . (3.2)
gen(h, s) + R(h,
b s) − inf R(b h̃s , s) + inf R(
⋆ ∈H
h̃s ∈H h̃s ∈H h
| {z } | {z }
optimization error approximation error
39
3. A Primer on Generalization Theory
40
3.2. Probably Approximately Correct Learning
41
3. A Primer on Generalization Theory
The growth function ΠH (n) of a hypothesis class H can be used to bound the
absolute generalization error and, hence, establish a uniform convergence property
of the class [104, Section 3.2]. More precisely, if a hypothesis class H has a growth
function ΠH (n), then
s
8 log 4ΠHβ(2n)
αUC (n, β) = . (3.4)
n
42
3.2. Probably Approximately Correct Learning
The Sauer–Shelah–Perles lemma [98, Lemma 6.10] bounds the growth function
of a hypothesis class H in terms of its VC dimension. Formally, it states that
VCdim(H)
VCdim(H) 2
if n ≤ VCdim(H)
X n VCdim(H)
ΠH (n) ≤ ≤ .
i=0
i en
VCdim(H) otherwise
43
3. A Primer on Generalization Theory
where W is the Lambert function, and the 0 branch can be bounded from above as
W(x) ≤ log(x + 1) [105, Theorem 2.3.]. Therefore, in the realizable setting the uni-
form convergence of a class is of the order (Ndim(H)+log 1/β)/n up to logarithmic terms.
Unfortunately, the VC or the Natarajan dimensions still do not help us to char-
acterize the generalization behavior of deep learning. For a feed-forward network
with sigmoid activation functions of p parameters and c connections between the
parameters, the VC dimension is lower bounded in Ω(c2 ) and upper bounded in
O(p2 c2 ) [98, Chapter 20].
where the inner product R⊺ ℓs (h) measures the correlation between the ran-
dom noise and the losses. Thus, the Rademacher complexity measures how
well, on average, the hypothesis class correlates with random noise. Richer
or more complex hypothesis classes H can generate more losses and correlate
44
3.3. Rademacher Complexity
45
3. A Primer on Generalization Theory
of the collected training dataset, it is still a quantity that holds uniformly for all
hypotheses in a class H, which is a very strong requirement. For example, although
there are works that obtain upper bounds on the Rademacher complexity of feed-
forward networks that depend on different norms of the weights [108–112], these
are still not tight enough to characterize their generalization. Moreover, there are
reasons to believe that this measure will not be sufficient for this task. Indeed, the
first interpretation of the Rademacher complexity of the class H describes how well
the hypotheses from the class can fit random binary label assignments, and it has
been shown that parameterized, differentiable networks can perfectly fit random
labels [101]. Hence, it is expected that the Rademacher complexity is close to 1 for
binary classification with the 0–1 loss, resulting in a trivial generalization bound.
46
3.4. Minimum Description Length and Occam’s Razor
uniform preference distribution Q hk = 1/|H|), then the guarantee is equal for each
hypothesis and (3.5) is reduced to the same guarantee (up to constants) received
by the uniform convergence property in (3.3). A simple way to express a preference
is to favour hypotheses that are simpler to describe, or that have a smaller descrip-
tion length. A binary string is a finite sequence of zeros and ones. A description
language for a hypothesis class H is a function mapping each hypothesis h ∈ H to
a string desc(h) with length |desc(h)|, where desc(h) is called the description of h
and |desc(h)| is its description length. If thePlanguage is prefix-free, then Kraft’s
inequality holds [71, Theorem 5.2.1], namely h∈H 2−|desc(h)| ≤ 1. In this way, one
may consider the simple prior Q hk ∝ 2−|desc(h)| resulting in the bound
s
|desc(hk )| log 2 + log 1δ
gen(hk , S) ≤ , (3.6)
2n
where the description language could be any prefix-free compression algorithm and
|desc(hk )| is its length (or bits). When the description length of a hypothesis h is
the smallest possible it is known as its Kolmogorov complexity [91] and the prior Q
is referred to as the universal prior [71, Section 14.6].
Then, following the MDL principle leads to selecting hypotheses that trade off
a good empirical performance (small empirical risk) and a small complexity or
description length (small generalization error). This is aligned with the Occam’s
razor, which states that “it is futile to do with more, what can be done with
fewer” [114, 115]. This has been incorporated into the methodology of science in the
following form [98, 116]: “given two explanations of the data, all other things being
equal, the simpler explanation is preferable”. Indeed, following the MDL principle,
given two hypotheses hk and hl with the same empirical risk R(h b k , s) = R(h
b l , s)
for a fixed dataset s, the one that is easier to describe (or has a smaller description
length) is preferable.
The parsimonious philosophical principle of the Occam’s razor will also resonate
with the rest of the generalization guarantees described in this thesis, although
with other characterizations of the hypotheses complexity and not necessarily their
description length.
The MDL principle has big connections with other theoretical frameworks to
generalization such as PAC Bayesian theory [33–35, 117]. More precisely, the two
are equivalent when the considered algorithms are deterministic and the hypothesis
class is discrete, as shown below in Section 3.6.1. In fact, under the PAC Bayesian
umbrella, the MDL has been employed to obtain non-vacuous bounds for deep
learning algorithms [118]. The idea is to describe the parameters of the networks
with a tunable prefix-free variable-length code that acts as the description language
desc. Then, both the quantized parameters and the quantization levels of the code
are learnt simultaneously to minimize a variant of the MDL generalization guarantee
in (3.6).
47
3. A Primer on Generalization Theory
48
3.5. Algorithmic Stability
By the definition of uniform stability, and the fact that E gen(A(S), S) =
E ℓ A(S), z − ℓ A(S (i) ), z
[122, Lemma 7], it directly follows that
E gen(A(S), S) ≤ γ. There have been increasingly better bounds on the gen-
eralization error based on the clear intuition that if an algorithm is uniformly sta-
ble, then it generalizes [122, 126–128]. To our knowledge, the following one is the
one that best characterizes the generalization error of uniformly stable algorithms,
although there exists better characterizations of the excess risk [129].
Theorem 3.2 (Bousquet et al. [128, Corollary 8]). Consider a loss function with
a range contained in [a, b]. Let h = A(S), where A is a deterministic, uniformly
stable algorithm with parameter γ. Then, there exist universal constants c1 , c2 ∈ R+
such that, for all β ∈ (0, 1), with probability no less than 1 − β,
s
log β1
1
|gen(h, S)| ≤ c1 · γ log(n) log + c2 · (b − a) · .
β n
The bound in Theorem 3.2 tells us that if an algorithm is uniformly stable, as
long as the stability parameter γ decreases with respect to the number of samples
n faster than logarithmically, then the algorithm will generalize for a large enough
training dataset. The framework of uniform stability has been relatively successful
in providing us with generalization error guarantees for known algorithms, that
is, we know that many known algorithms are uniformly stable. For example, the
ERM solution to convex learning problems with a strongly convex regularization
√
term [100] is uniformly stable with γ ∈ Θ(1/ n). Moreover, there is a line of work
establishing the uniform stability of SGD under different combinations of conditions
such as the Bernstein or Polyak–Lojasiewicz conditions, or smoothness, convexity,
or Lipschitzness, among others [129–135], which is an encouraging direction to
better understand the generalization in deep learning.
49
3. A Primer on Generalization Theory
S
S A ≡ PH H
To visualize this concept more clearly, let us use Dwork et al.’s definition of
differential privacy [58, 59]. A randomized algorithm A is (ε, δ)-differentially private
if for all subsets of hypotheses A ⊆ H and all neighbouring datasets s and s′
P A(s) ∈ A ≤ eε P A(s′ ) ∈ A + δ.
(3.7)
In this way, an algorithm is ε-IM stable if, given two neighbouring input datasets,
the difference between the two output hypotheses, as measured by some informa-
tion measure IM like the ones presented in Section 2.2, is smaller than ε. This
definition, like uniform stability, considers two input datasets to be “similar” if they
are neighbours, and understands that two output hypotheses are “similar” if their
distributions are close according to some information measure. Other common ex-
amples of this information-theoretic stability are total variation, relative entropy,
50
3.6. Information-Theoretic Generalization
and Wasserstein distance stability [136, 140]. As for the previous definitions of sta-
bility, if an algorithm satisfies any of these notions, its generalization error is also
bounded [2, 24, 136, 140].
So far, the stability definitions were agnostic to the data distribution PZ . Ragin-
sky et al. [136] introduced the concept of (ε, PZ )-IM stability to take into account
the effect of the data distribution into the stability of an algorithm. Given a dataset
s = (z1 , . . . , zn ), define s−i = (z1 , . . . , zi−1 , zi+1 , . . . , zn ) as the dataset obtained by
removing the i-th sample from s. In this way, the distribution
−i
S =s−i S=(z1 ,...,zi−1 ,Zi ,zi+1 ,...,zn )
PH = PH ◦ PZ
Taking this argument to the extreme, one may consider how much, on average,
S=s
the distribution of the output hypothesis PH for a given realization S = s of the
training dataset changes with respect to the prototypical distribution on samples
S
from the data distribution PH = PH ◦PS . For example, this was the rationale consid-
ered in [141, 142] to define the stability of an algorithm and to prove generalization
guarantees in terms of the total variation. This argument can also be interpreted
as to how much the algorithm’s output distribution depends on the input training
dataset. Both of these interpretations are paramount in the information-theoretic
generalization framework introduced in the next section. Therefore, the stability
framework to characterize the generalization of learning algorithms, when the sta-
bility notion is defined via privacy or information measures, can be understood
as belonging to the information-theoretic generalization framework and vice versa.
The lines between the different frameworks are blurry and, in the end, the termi-
nology barely depends on personal taste.
51
3. A Primer on Generalization Theory
generalization guarantees that are specific to the algorithm A used to find the
hypothesis. Information-theoretic generalization also provides us with guarantees
that are specific to the learning algorithm and, depending on the level of specificity,
that are also specific to each hypothesis. Not needing to provide guarantees that
hold uniformly for all elements in the hypothesis class H allows these frameworks
to attain tighter characterizations of the generalization of learning algorithms.
Often, the guarantees the information-theoretic generalization framework
provides us do not hold for every data distribution PZ , and they are specialized to
different classes of data distributions. These classes are chosen depending of the
behavior of the loss random variable ℓ(h, Z) for hypotheses in h ∈ H. This consid-
eration allows the framework to derive bounds on the population risk for potentially
unbounded losses and separates it from frameworks like uniform stability.
The information-theoretic framework, like algorithmic stability via information
measures, considers the (possibly randomized) algorithm as a channel processing a
dataset into a hypothesis (Figure 3.1). Essentially, this framework encompasses
all generalization guarantees that depend on information measures like those pre-
S
sented in Section 2.2 involving the algorithm’s Markov kernel PH and the data
distribution PZ in some capacity. A common theme in the guarantees obtained in
this framework is that they can be interpreted with classically information-theoretic
concepts like information or compression.
Intuitively, the more information the algorithm’s output hypothesis captures
about the dataset that it used for training, the worse it will generalize. The reason is
closely related to the concept of overfitting, which is a phenomenon that occurs when
the output hypothesis describes very well the training data but fails to describe the
underlying distribution. The idea is that, in order to perfectly describe the training
data, the algorithm adapted to the sampling noise for the specific data it observes,
which may differ from future observations. This can be quantified, for example,
S=s
with the dissimilarity between the distribution of the algorithm’s output PH and
the smoothed, prototypical distribution on samples from the data distribution PH =
S S
PH ◦ PS , for instance DKL (PH ∥PH ) [33–35, 70, 117, 136]. This example is purposely
chosen to highlight again the connection between stability via information measures
and information-theoretic stability.
This intuition also follows the parsimonious philosophical principle of the Oc-
cam’s razor from Section 3.4: given two algorithms that output hypotheses with
the same empirical risk, the one that extracts the least information, or needs the
least bits to be compressed, is preferred.
In practice, the foundational generalization bounds from this framework are
usually obtained combining a change of measure (or a decoupling lemma) and a
concentration inequality. For randomized algorithms, the population risk R(H) is
S
a random variable that depends on the joint distribution PS ⊗ PH , where S is the
⊗n
random training dataset and PS = PZ . This dependence between the hypothesis H
and the dataset S makes the usage of standard concentration inequalities around the
empirical risk impossible. For this reason, a common first step is to use a change of
measure to consider the population risk R(H ′ ) of an auxiliary random hypothesis
52
3.6. Information-Theoretic Generalization
53
3. A Primer on Generalization Theory
For a loss with a range contained in [a, b], the classical example of this guar-
antee using the Donsker and Varadhan Lemma 2.1 is [136]
r
I(H; S)
αexp = (b − a) .
2n
where the probability is taken with respect to the draw of the training data
from PS and the expectation is taken with respect to the conditional distri-
S
bution of the output hypothesis for that training data PH .
For a loss with a range contained in [a, b], the classical example of this guar-
antee using the Donsker and Varadhan Lemma 2.1 is [33–35]
s
S ∥Q) + log ξ(n)
DKL (PH β
αPAC-Bayes = (b − a) , (3.8)
2n
√ √
where ξ(n) ∈ [ n, 2 + 2n] [5, 158, 159] and Q is any distribution on H
S
such that PH ≪ Q. Note that compared to the guarantee in expectation, this
includes a penalty for the confidence 1 − β of the statement.
Note that if the learning algorithm is deterministic (that is, if the algorithm’s
S
Markov kernel is PH (h) = I{h=hS } (h)) and the hypothesis class is discrete,
S
then DKL (PH ∥Q) = − log Q[hS ] and the PAC-Bayesian guarantees are equiv-
alent to the guarantees from the MDL framework (see Section 3.4 above).
gen(H, S) ≤ αsdPAC ,
where the probability is taken with respect to the draw of the training data
S
from PS and the later draw of the algorithm’s output hypothesis from PH .
2 Wechoose this name as we feel it is the one that better clarifies that the bounds are on a
single draw of the hypothesis rather than on the algorithm’s distribution. This is also the name
employed in [27, 144]
54
3.6. Information-Theoretic Generalization
For a loss with a range contained in [a, b], the classical example3 of this
guarantee using the change of measure stemming from the Radon–Nikodym
theorem (2.1) is [162]
s
dP S
log dQH (H) + log ξ(n)
β
αsdPAC = (b − a) ,
2n
S
where we recall that dPH/dQ is the Radon–Nikodym derivative between the
algorithm’s output distribution and the reference distribution Q. Note that
compared to standard PAC-Bayesian and the expectation guarantees, this
changes depending on the particular realization of the output hypothesis.
The more specific guarantees give us more practical information. For example,
the single-draw PAC-Bayesian guarantees hold for the particular realization of
the hypothesis that an algorithm returns for a particular dataset. However, they
are often harder to calculate or control. On the other hand, the less specific
guarantees gives us more abstract information and are often easier to calculate
and control. For instance, the mutual information can be upper bounded by other
quantities relevant to a particular algorithm, giving us a concrete understanding
of the important elements for that algorithm to generalize. For example, for the
stochastic gradient Langevin dynamics (SGLD) [163, 164] algorithm, we learnt
that the gradient incoherence between samples important in determining its
generalization [1, 23, 165] (see Section 4.5). More examples of these bounds’
usefulness are given in Chapter 4.
Before moving to some bibliographic remarks we clarify some notation. We say
that a bound has a fast rate if it decreases linearly with the number of samples,
that is, if it is in O(1/n). For example, the guarantees for the realizable setting in
classification problems provided by uniform convergence. Similarly, we say that a
√
bound has a slow rate if it is in O(1/ n). If a bound has a rate different than these
two, we will refer to the rate in comparison to them. For probabilistic bounds, we
say that a bound is of high probability if its dependence with the confidence param-
eter β is logarithmic, that is, it depends on log 1/β like (3.8). Other dependencies,
such as the linear one 1/β , will simply be referred to as not of high probability.4
manipulations.
4 There are other notions of “high probability”. For example, a common definition states that
a bound is of high probability if the probability of failure β can be made arbitrarily close to 0 as
the number of samples tend to infinity [83]. We choose this nomenclature, similarly to Hellström
and Durisi [27], since it helps us distinguish between an exponential decay in the probability of
failure (when the dependence is logarithmic) and weaker decays such as polynomial.
55
3. A Primer on Generalization Theory
relationship between the MDL principle from Section 3.4 and compression to
a generalization bound similar to (3.6). After that, Shawe-Taylor, Bartlett, and
Williamson [28, 117] introduced the first PAC-Bayesian bounds using a luckiness
factor in 1996-1997, and those were further developed by McAllester [33, 34, 35] in
1998-2003. These latter results first obtained another formal relationship between
the MDL principle and generalization, now in the form of (3.5), and then evolved
into the more general PAC-Bayesian bound from (3.8). Interestingly, McAllester
did not use neither the Donsker and Varadhan nor the Gibbs Lemma 2.1 to obtain
the result and, to our knowledge, the first to do so in the context of generalization
was Seeger [36] in 2002, who actually re-discovered it. After that, the usage of this
result became customary in the PAC-Bayesian literature, popularized by Audibert
[166], Catoni [38, 39], Zhang [70], and Germain and others [167, 168], although some
re-derived it again and gave it a different name for the context of generalization,
such as the information exponential inequality from Zhang [70] in 2006.
Information-theoretic generalization bounds in expectation were derived in the
PAC-Bayesian community, at least, since 2006 [39, 45, 169]. These were popular-
ized in 2016 after Xu and Raginsky [20] extended the results from Russo and Zou
[21] from the bias of adaptive data analysis to the generalization error or learning
algorithms. The popularity of these results came, in part, for the simplicity of the
proofs that directly used the Donsker and Varadhan Lemma 2.1 and the fact that
the bound explicitly featured the mutual information. The combination of these
two facts allowed for the development of many results and interpetations after that,
some of which will be discussed later in Chapter 4.
56
3.7. Parameterized Models
57
3. A Primer on Generalization Theory
guarantees that DKL (PhW ∥PhW ′ ) ≤ DKL (PW ∥PW ′ ). Therefore, the guarantees
obtained for the weights also translate to the hypotheses.
For this reason, a good volume of the information-theoretic generalization liter-
ature presents their results abusing notation and describing the random hypothesis
returned by the algorithm with the letter W and using it indistinguishably from the
model’s weights. Henceforth, in this thesis, we will also present our results in this
way. However, it is important to mention that, until the weights are employed to
find specific guarantees for some particular algorithm like the SGLD in Section 4.5,
the results hold for any hypothesis and not only for parameterized models.
58