0% found this document useful (0 votes)

4 views

generalize_DL_2023

This paper explores the theoretical foundations of generalization in deep learning, addressing why deep learning models can generalize well despite their high capacity and complexity. It discusses various approaches to establish non-vacuous generalization guarantees and proposes new open problems for further research. The authors reconcile existing paradoxes in the literature and present insights into the behavior of deep learning models in relation to overfitting and generalization errors.

Uploaded by

Đức Mạnh Dương

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

generalize_DL_2023

Uploaded by

Đức Mạnh Dương

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Generalization in Deep Learning

Kenji Kawaguchi Leslie Pack Kaelbling Yoshua Bengio

MIT MIT University of Montreal
arXiv:1710.05468v9 [stat.ML] 22 Aug 2023

Abstract
This paper provides theoretical insights into why and how deep learning can generalize well, despite
its large capacity, complexity, possible algorithmic instability, nonrobustness, and sharp minima,
responding to an open question in the literature. We also discuss approaches to provide non-vacuous
generalization guarantees for deep learning. Based on theoretical observations, we propose new
open problems and discuss the limitations of our results.

1. Introduction
Deep learning has seen significant practical success and has had a profound impact on the conceptual
bases of machine learning and artificial intelligence. Along with its practical success, the theoretical
properties of deep learning have been a subject of active investigation. For expressivity of neural
networks, there are classical results regarding their universality (Leshno et al., 1993) and exponential
advantages over hand-crafted features (Barron, 1993). Another series of theoretical studies have
considered how trainable (or optimizable) deep hypothesis spaces are, revealing structural properties
that may enable non-convex optimization (Choromanska et al., 2015; Kawaguchi, 2016a). However,
merely having an expressive and trainable hypothesis space does not guarantee good performance in
predicting the values of future inputs, because of possible over-fitting to training data. This leads to
the study of generalization, which is the focus of this paper.
Some classical theory work attributes generalization ability to the use of a low-capacity class
of hypotheses (Vapnik, 1998; Mohri et al., 2012). From the viewpoint of compact representation,
which is related to small capacity, it has been shown that deep hypothesis spaces have an exponential
advantage over shallow hypothesis spaces for representing some classes of natural target functions
(Pascanu et al., 2014; Montufar et al., 2014; Livni et al., 2014; Telgarsky, 2016; Poggio et al., 2017).
In other words, when some assumptions implicit in the hypothesis space (e.g., deep composition of
piecewise linear transformations) are approximately satisfied by the target function, one can achieve
very good generalization, compared to methods that do not rely on that assumption. However, a
recent paper (Zhang et al., 2017) has empirically shown that successful deep hypothesis spaces
have sufficient capacity to memorize random labels. This observation has been called an “apparent
paradox” and has led to active discussion by many researchers (Arpit et al., 2017; Krueger et al.,
2017; Hoffer et al., 2017; Wu et al., 2017; Dziugaite and Roy, 2017; Dinh et al., 2017). Zhang
et al. (2017) concluded with an open problem stating that understanding such observations require
rethinking generalization, while Dinh et al. (2017) stated that explaining why deep learning models
can generalize well, despite their overwhelming capacity, is an open area of research.

BibTeX of this paper is available at: https://people.csail.mit.edu/kawaguch/bibtex.html

K AWAGUCHI , K AELBLING , AND B ENGIO

We begin, in Section 3, by illustrating that, even in the case of linear models, hypothesis spaces
with overwhelming capacity can result in arbitrarily small test errors and expected risks. Here, test
error is the error of a learned hypothesis on data that it was not trained on, but which is often drawn
from the same distribution. Test error is a measure of how well the hypothesis generalizes to new data.
We closely examine this phenomenon, extending the original open problem from previous papers
(Zhang et al., 2017; Dinh et al., 2017) into a new open problem that strictly includes the original.
We reconcile the possible apparent paradox by checking theoretical consistency and identifying a
difference in the underlying assumptions. Considering a difference in the focuses of theory and
practice, we outline possible practical roles that generalization theory can play.
Towards addressing these issues, Section 4 presents generalization bounds based on validation
datasets, which can provide non-vacuous and numerically-tight generalization guarantees for deep
learning in general. Section 5 analyzes generalization errors based on training datasets, focusing
on a specific case of feed-forward neural networks with ReLU units and max-pooling. Under
these conditions, the developed theory provides quantitatively tight theoretical insights into the
generalization behavior of neural networks.

2. Background
Let x ∈ X be an input and y ∈ Y be a target. Let L be a loss function. Let R[f ] be the expected
risk of a function f , R[f ] = Ex,y∼P(X,Y ) [L(f (x), y)], where P(X,Y ) is the true distribution. Let
fA(S) : X → Y be a model learned by a learning algorithm A (including random seeds for simplicity)
using a training dataset S := Sm := {(x1 , y1 ), . . . , (xm , ym )} of size m. Let RS [f ] be the empirical
1 Pm m
risk of f as RS [f ] = m i=1 L(f (xi ), yi ) with {(xi , yi )}i=1 = S. Let F be a set of functions
endowed with some structure or equivalently a hypothesis space. Let LF be a family of loss functions
associated with F, defined by LF = {g : f ∈ F, g(x, y) = L(f (x), y)}. All vectors are column
vectors in this paper. For any given variable v, let dv be the dimensionality of the variable v.
A goal in machine learning is typically framed as the minimization of the expected risk R[fA(S) ].
We typically aim to minimize the non-computable expected risk R[fA(S) ] by minimizing the com-
putable empirical risk RS [fA(S) ] (i.e., empirical risk minimization). One goal of generalization
theory is to explain and justify when and how minimizing RS [fA(S) ] is a sensible approach to
minimizing R[fA(S) ] by analyzing

the generalization gap := R[fA(S) ] − RS [fA(S) ].

In this section only, we use the typical assumption that S is generated by independent and identically
distributed (i.i.d.) draws according to the true distribution P(X,Y ) ; the following sections of this
paper do not utilize this assumption. Under this assumption, a primary challenge of analyzing the
generalization gap stems from the dependence of fA(S) on the same dataset S used in the definition of
RS . Several approaches in statistical learning theory have been developed to handle this dependence.
The hypothesis-space complexity approach handles this dependence by decoupling fA(S) from
the particular S by considering the worst-case gap for functions in the hypothesis space as

R[fA(S) ] − RS [fA(S) ] ≤ sup R[f ] − RS [f ],

f ∈F

and by carefully analyzing the right-hand side. Because the cardinality of F is typically (uncountably)
infinite, a direct use of the union bound over all elements in F yields a vacuous bound, leading to

2
G ENERALIZATION IN D EEP L EARNING

the need to consider different quantities to characterize F; e.g., Rademacher complexity and the
Vapnik–Chervonenkis (VC) dimension. For example, if the codomain of L is in [0, 1], we have
(Mohri et al., 2012, Theorem 3.1) that for any δ > 0, with probability at least 1 − δ,
s
ln 1δ
sup R[f ] − RS [f ] ≤ 2Rm (LF ) + ,
f ∈F 2m
where Rm (LF ) is the Rademacher complexity of LF , which then can be bounded by the Rademacher
complexity of F, Rm (F). For the deep-learning hypothesis spaces F, there are several well-known
bounds on Rm (F) including those with explicit exponential dependence on depth (Sun et al., 2016;
Neyshabur et al., 2015b; Xie et al., 2015) and explicit linear dependence on the number of trainable
parameters (Shalev-Shwartz and Ben-David, 2014). There has been significant work on improving the
bounds in this approach, but all existing solutions with this approach still depend on the complexity
of a hypothesis space or a sequence of hypothesis spaces.
The stability approach deals with the dependence of fA(S) on the dataset S by considering the
stability of algorithm A with respect to different datasets. The considered stability is a measure of
how much changing a data point in S can change fA(S) . For example, if the algorithm A has uniform
stability β (w.r.t. L) and if the codomain of L is in [0, M ], we have (Bousquet and Elisseeff, 2002)
that for any δ > 0, with probability at least 1 − δ,
s
ln 1δ
R[fA(S) ] − RS [fA(S) ] ≤ 2β + (4mβ + M ) .
2m
Based on previous work on stability (e.g., Hardt et al. 2015; Kuzborskij and Lampert 2017; Gonen
and Shalev-Shwartz 2017), one may conjecture some reason for generalization in deep learning.
The robustness approach avoids dealing with certain details of the dependence of fA(S) on S by
considering the robustness of algorithm A for all possible datasets. In contrast to stability, robustness
is the measure of how much the loss value can vary w.r.t. the input space of (x, y). For example, if
algorithm A is (Ω, ζ(·))-robust and the codomain of L is upper-bounded by M , given a dataset S,
we have (Xu and Mannor, 2012) that for any δ > 0, with probability at least 1 − δ,
s
2Ω ln 2 + 2 ln 1δ
|R[fA(S) ] − RS [fA(S) ]| ≤ ζ(S) + M .
m
The robustness approach requires an a priori known and fixed partition of the input space such that
the number of sets in the partition is Ω and the change of loss values in each set of the partition
is bounded by ζ(S) for all S (Definition 2 and the proof of Theorem 1 in Xu and Mannor 2012).
In classification, if the margin is ensured to be large, we can fix the partition with balls of radius
corresponding to the large margin, filling the input space. Recently, this idea was applied to deep
learning (Sokolic et al., 2017a,b), producing insightful and effective generalization bounds, while
still suffering from the curse of the dimensionality of the priori-known fixed input manifold.
With regard to the above approaches, flat minima can be viewed as the concept of low variation
in the parameter space; i.e., a small perturbation in the parameter space around a solution results
in a small change in the loss surface. Several studies have provided arguments for generalization
in deep learning based on flat minima (Keskar et al., 2017). However, Dinh et al. (2017) showed
that flat minima in practical deep learning hypothesis spaces can be turned into sharp minima
via re-parameterization without affecting the generalization gap, indicating that it requires further
investigation.

3
K AWAGUCHI , K AELBLING , AND B ENGIO

3. Rethinking generalization
Zhang et al. (2017) empirically demonstrated that several deep hypothesis spaces can memorize
random labels, while having the ability to produce zero training error and small test errors for
particular natural datasets (e.g., CIFAR-10). They also empirically observed that regularization
on the norm of weights seemed to be unnecessary to obtain small test errors, in contradiction to
conventional wisdom. These observations suggest the following open problem:
Open Problem 1. Tightly characterize the expected risk R[f ] or the generalization gap R[f ]−RS [f ]
with a sufficiently complex deep-learning hypothesis space F ∋ f , producing theoretical insights
and distinguishing the case of “natural” problem instances (P(X,Y ) , S) (e.g., images with natural
labels) from the case of other problem instances (P′(X,Y ) , S ′ ) (e.g., images with random labels).
Supporting and extending the empirical observations by Zhang et al. (2017), we provide a theorem
(Theorem 1) stating that the hypothesis space of over-parameterized linear models can memorize any
training data and decrease the training and test errors arbitrarily close to zero (including to zero) with
the norm of parameters being arbitrarily large, even when the parameters are arbitrarily far from the
ground-truth parameters. Furthermore, Corollary 2 shows that conventional wisdom regarding the
norm of the parameters w can fail to explain generalization, even in linear models that might seem
not to be over-parameterized. All proofs for this paper are presented in the appendix.

Theorem 1. Consider a linear model with the training prediction Ŷ (w) = Φw ∈ Rm×dy , where
Φ ∈ Rm×n is a fixed feature matrix of the training inputs. Let Ŷtest (w) = Φtest w ∈ Rmtest ×dy
be the test prediction, where Φtest ∈ Rmtest ×n is a fixed feature matrix of the test inputs. Let
M = [Φ⊤ , Φ⊤ ⊤
test ] . Then, if n > m and if rank(Φ) = m and rank(M ) < n,

(i) For any Y ∈ Rm×dy , there exists a parameter w′ such that Ŷ (w′ ) = Y , and

(ii) if there exists a ground truth w∗ satisfying Y = Φw∗ and Ytest = Φtest w∗ , then for any
ϵ, δ ≥ 0, there exists a parameter w such that

(a) Ŷ (w) = Y + ϵA for some matrix A with ∥A∥F ≤ 1, and

(b) Ŷtest (w) = Ytest + ϵB for some matrix B with ∥B∥F ≤ 1, and
(c) ∥w∥F ≥ δ and ∥w − w∗ ∥F ≥ δ.

Corollary 2. If n ≤ m and if rank(M ) < n, then statement (ii) in Theorem 1 holds.

Whereas Theorem 1 and Corollary 2 concern test errors instead of expected risk (in order to be
consistent with empirical studies), Remark 3 shows the existence of the same phenomena for expected
risk for general machine learning models not limited to deep learning and linear hypothesis spaces;
i.e., Remark 3 shows that none of small capacity, low complexity, stability, robustness, and flat minima
is necessary for generalization in machine learning for each given problem instance (P(X,Y ) , S),
although one of them can be sufficient for generalization. This statement does not contradict necessary
conditions and no free lunch theorem from previous learning theory, as explained in the subsequent
subsections.

Remark 3. Given a pair (P(X,Y ) , S) and a desired ϵ > inf f ∈Y X R[f ] − RS [f ], let fϵ∗ be a function
such that ϵ ≥ R[fϵ∗ ] − RS [fϵ∗ ]. Then,

4
G ENERALIZATION IN D EEP L EARNING

(i) For any hypothesis space F whose hypothesis-space complexity is large enough to memorize
any dataset and which includes fϵ∗ possibly at an arbitrarily sharp minimum, there exist learning
algorithms A such that the generalization gap of fA(S) is at most ϵ, and

(ii) There exist arbitrarily unstable and arbitrarily non-robust algorithms A such that the general-
ization gap of fA(S) is at most ϵ.
To see this, first consider statement (i). Given such a F, consider any A such that A takes F and S
as input and outputs fϵ∗ . Clearly, there are many such algorithms A. For example, given a S, fix A
such that A takes F and S as input and outputs fϵ∗ (which already proves the statement), or even
fϵ∗ + δ where δ becomes zero by the right choice of hyper-parameters and of small variations of F
(e.g., architecture search in deep learning) such that F still satisfy the condition in the statement.
This establishes statement (i).
Consider statement (ii). Given any dataset S ′ , consider a look-up algorithm A′ that always outputs
fϵ∗ if S = S ′ , and outputs f1 otherwise such that f1 is arbitrarily non-robust and |L(fϵ∗ (x), y) −
L(f1 (x), y)| is arbitrarily large (i.e., arbitrarily non-stable). This proves statement (ii). Note that
while A′ here suffices to prove statement (ii), we can also generate other non-stable and non-robust
algorithms by noticing the essence captured in Remark 4.
We capture the essence of all the above observations in the following remark.
Remark 4. The expected risk R[f ] and the generalization gap R[f ] − RS [f ] of a hypothesis f with
a true distribution P(X,Y ) and a dataset S are completely determined by the tuple (P(X,Y ) , S, f ),
independently of other factors, such as a hypothesis space F (and hence its properties such as
capacity, Rademacher complexity, pre-defined bound on norms, and flat-minima) and properties of
random datasets different from the given S (e.g., stability and robustness of the learning algorithm
A). In contrast, the conventional wisdom states that these other factors are what matter. This has
created the “apparent paradox” in the literature.
From these observations, we propose the following open problem:
Open Problem 2. Tightly characterize the expected risk R[f ] or the generalization gap R[f ]−RS [f ]
of a hypothesis f with a pair (P(X,Y ) , S) of a true distribution and a dataset, producing theoretical
insights, based only on properties of the hypothesis f and the pair (P(X,Y ) , S).
Solving Open Problem 2 for deep learning implies solving Open Problem 1, but not vice versa.
Open Problem 2 encapsulates the essence of Open Problem 1 and all the issues from our Theorem 1,
Corollary 2 and Proposition 3.

3.1 Consistency of theory

The empirical observations in (Zhang et al., 2017) and our results above may seem to contradict
the results of statistical learning theory. However, there is no contradiction, and the apparent
inconsistency arises from the misunderstanding and misuse of the precise meanings of the theoretical
statements.
Statistical learning theory can be considered to provide two types of statements relevant to the
scope of this paper. The first type (which comes from upper bounds) is logically in the form of
“p implies q,” where p := “the hypothesis-space complexity is small” (or another statement about
stability, robustness, or flat minima), and q := “the generalization gap is small.” Notice that “p

5
K AWAGUCHI , K AELBLING , AND B ENGIO

Figure 1: An illustration of differences in assumptions. Statistical learning theory analyzes the

generalization behaviors of fA(S) over randomly-drawn unspecified datasets S ∈ D according to
some unspecified distribution P(X,Y ) ∈ P. Intuitively, statistical learning theory concerns more
about questions regarding a set P × D because of the unspecified nature of (P(X,Y ) , S), whereas
certain empirical studies (e.g., Zhang et al. 2017) can focus on questions regarding each specified
point (P(X,Y ) , S) ∈ P × D.

implies q” does not imply “q implies p.” Thus, based on statements of this type, it is entirely possible
that the generalization gap is small even when the hypothesis-space complexity is large or the learning
mechanism is unstable, non-robust, or subject to sharp minima.
The second type (which comes from lower bounds) is logically in the following form: in a set
Uall of all possible problem configurations, there exists a subset U ⊆ Uall such that “q implies p”
in U (with the same definitions of p and q as in the previous paragraph). For example, Mohri et al.
(2012, Section 3.4) derived lower bounds on the generalization gap by showing the existence of
a “bad” distribution that characterizes U . Similarly, the classical no free lunch theorems are the
results with the existence of a worst-case distribution for each algorithm. However, if the problem
instance at hand (e.g., object classification with MNIST or CIFAR-10) is not in such a U in the
proofs (e.g., if the data distribution is not among the “bad” ones considered in the proofs), q does not
necessarily imply p. Thus, it is still naturally possible that the generalization gap is small with large
hypothesis-space complexity, instability, non-robustness, and sharp minima. Therefore, there is no
contradiction or paradox.

3.2 Difference in assumptions and problem settings

Under certain assumptions, many results in statistical learning theory have been shown to be tight
and insightful (e.g., Mukherjee et al. 2006; Mohri et al. 2012). Hence, the need of rethinking
generalization partly comes from a difference in the assumptions and problem settings.
Figure 1 illustrates the differences in assumptions in statistical learning theory and some empirical
studies. On one hand, in statistical learning theory, a distribution P(X,Y ) and a dataset S are usually
unspecified except that P(X,Y ) is in some set P and a dataset S ∈ D is drawn randomly according to
P(X,Y ) (typically with the i.i.d. assumption). On the other hand, in most empirical studies and in our
theoretical results (Theorem 1 and Proposition 3), a distribution P(X,Y ) is still unknown yet specified
(e.g., via a real world process) and a dataset S is specified and usually known (e.g., CIFAR-10 or
ImageNet). Intuitively, whereas statistical learning theory considers a set P × D because of weak
assumptions, some empirical studies can focus on a specified point (P(X,Y ) , S) in a set P × D
because of stronger assumptions. Therefore, by using the same terminology such as “expected risk”
and “generalization” in both cases, we are susceptible to confusion and apparent contradiction.
Lower bounds, necessary conditions and tightness in statistical learning theory are typically
defined via a worst-case distribution Pworst
(X,Y ) ∈ P. For instance, classical “no free lunch” theorems
and certain lower bounds on the generalization gap (e.g., Mohri et al. 2012, Section 3.4) have been

6
G ENERALIZATION IN D EEP L EARNING

proven for the worst-case distribution Pworst

(X,Y ) ∈ P. Therefore, “tight” and “necessary” typically
mean “tight” and “necessary” for the set P × D (e.g., through the worst or average case), but not for
each particular point (P(X,Y ) , S) ∈ P × D. From this viewpoint, we can understand that even if the
quality of the set P × D is “bad” overall, there may exist a “good” point (P(X,Y ) , S) ∈ P × D.
Several approaches in statistical learning theory, such as the data dependent and Bayesian
approaches (Herbrich and Williamson, 2002; Dziugaite and Roy, 2017), use more assumptions on
the set P × D to take advantage of more prior and posterior information; these have an ability to
tackle Open Problem 1. However, these approaches do not apply to Open Problem 2 as they still
depend on other factors than the given (P(X,Y ) , S, f ). For example, data-dependent bounds with
the luckiness framework (Shawe-Taylor et al., 1998; Herbrich and Williamson, 2002) and empirical
Rademacher complexity (Koltchinskii and Panchenko, 2000; Bartlett et al., 2002) still depend on a
concept of hypothesis spaces (or the sequence of hypothesis spaces), and the robustness approach
(Xu and Mannor, 2012) depend on different datasets than a given S via the definition of robustness
(i.e., in Section 2, ζ(S) is a data-dependent term, but the definition of ζ itself and Ω depend on other
datasets than S).
We note that analyzing a set P × D is of significant interest for its own merits and is a natural task
in the field of computational complexity (e.g., categorizing a set of problem instances into subsets
with or without polynomial solvability). Indeed, the situation where theory focuses more on a set
and many practical studies focus on each element in the set is prevalent in computer science (see the
discussion in Appendix A2 for more detail).

3.3 Practical role of generalization theory

From the discussions above, we can see that there is a logically expected difference between the
scope in theory and the focus in practice; it is logically expected that there are problem instances
where theoretical bounds are pessimistic. In order for generalization theory to have maximal impact
in practice, we must be clear on a set of different roles it can play regarding practice, and then work
to extend and strengthen it in each of these roles. We have identified the following practical roles for
theory:

Role 1 Provide guarantees on expected risk.

Role 2 Guarantee generalization gap

Role 2.1 to be small for a given fixed S, and/or

Role 2.2 to approach zero with a fixed model class as m increases.

Role 3 Provide theoretical insights to guide the search over model classes.

4. Generalization bounds via validation

In practical deep learning, we typically adopt the training–validation paradigm, usually with a
held-out validation set. We then search over hypothesis spaces by changing architectures (and other
hyper-parameters) to obtain low validation error. In this view, we can conjecture the reason that
deep learning can sometimes generalize well as follows: it is partially because we can obtain a good
model via search using a validation dataset. Indeed, Proposition 5 states that if the validation error
of a hypothesis is small, it is guaranteed to generalize well, regardless of its capacity, Rademacher

7
K AWAGUCHI , K AELBLING , AND B ENGIO

(val)
complexity, stability, robustness, and flat minima. Let Smval be a held-out validation dataset of size
mval , which is independent of the training dataset S.
(val)
Proposition 5. (generalization guarantee via validation error) Assume that Smval is generated by i.i.d.
(val)
draws according to a true distribution P(X,Y ) . Let κf,i = R[f ] − L(f (xi ), yi ) for (xi , yi ) ∈ Smval .
Suppose that E[κ2f,i ] ≤ γ 2 and |κf,i | ≤ C almost surely, for all (f, i) ∈ Fval × {1, . . . , mval }. Then,
for any δ > 0, with probability at least 1 − δ, the following holds for all f ∈ Fval :
s
|Fval |
2C ln( δ ) 2γ 2 ln( |Fδval | )
R[f ] ≤ RS (val) [f ] + + .
mval 3mval mval
Here, Fval is defined as a set of models f that is independent of a held-out validation dataset
(val)
Smval , but can depend on the training dataset S. For example, Fval can contain a set of models f
such that each element f is a result at the end of each epoch during training with at least 99.5%
training accuracy. In this example, |Fval | is at most (the number of epochs) × (the cardinality of the
set of possible hyper-parameter settings), and is likely much smaller than that because of the 99.5%
training accuracy criteria and the fact that a space of many hyper-parameters is narrowed down by
using the training dataset as well as other datasets from different tasks. If a hyper-parameter search
depends on the validation dataset, Fval must be the possible space of the search instead of the space
(j)
actually visited by the search. We can also use a sequence {Fval }j (see Appendix A).
The bound in Proposition 5 is non-vacuous and tight enough to be practically meaningful.
For example, consider a classification task with 0–1 loss. Set mval = 10, 000 (e.g., MNIST and
CIFAR-10) and δ = 0.1. Then, even in the worst case with C = 1 and γ 2 = 1 and even with
|Fval | = 1, 000, 000, 000, we have with probability at least 0.9 that R[f ] ≤ RS (val) [f ] + 6.94% for
mval

all f ∈ Fval . In a non-worst-case scenario, for example, with C = 1 and γ 2 = (0.05)2 , we can
replace 6.94% by 0.49%. With a larger validation set (e.g., ImageNet) and/or more optimistic C and
γ 2 , we can obtain much better bounds.
Although Proposition 5 poses the concern of increasing the generalization bound when p using a
single validation dataset with too large |Fval |, the rate of increase is only ln |Fval | and ln |Fval |.
We can also avoid dependence on the cardinality of Fval using Remark 6.
(val)
Remark 6. Assume that Smval is generated by i.i.d. draws according to P(X,Y ) . Let LFval = {g :
f ∈ Fval , g(x, y) := L(f (x), y)}. By applying (Mohri et al., 2012, Theorem 3.1) to LFval , if the
codomain of L is in [0, 1], with probability at least 1 − δ, for all f ∈ Fval , R[f ] ≤ RS (val) [f ] +
mval
p
2Rm (LFval ) + (ln 1/δ)/mval .
Unlike the standard use of Rademacher complexity with a training dataset, the set Fval cannot
depend on the validation set Smval , but can depend on the training dataset S in any manner, and
hence Fval differs significantly from the typical hypothesis space defined by the parameterization
of models. We can thus end up with a very different effective capacity and hypothesis complexity
(as selected by model search using the validation set) depending on whether the training data are
random or have interesting structure which the neural network can capture.

5. Direct analyses of neural networks

Unlike the previous section, this section analyzes the generalization gap with a training dataset S. In
Section 3, we extended Open Problem 1 to Open Problem 2, and identified the different assumptions

8
G ENERALIZATION IN D EEP L EARNING

in theoretical and empirical studies. Accordingly, this section aims to address these problems to
some extent, both in the case of particular specified datasets and the case of random unspecified
datasets. To achieve this goal, this section presents a direct analysis of neural networks, rather than
deriving results about neural networks from more generic theories based on capacity, Rademacher
complexity, stability, or robustness.
Sections 5.2 and 5.3 deals with the squared loss, while Section 5.4 considers 0-1 loss with
multi-labels.

5.1 Model description via deep paths

We consider general neural networks of any depth that have the structure of a directed acyclic graph
(DAG) with ReLU nonlinearity and/or max pooling. This includes any structure of a feedforward
network with convolutional and/or fully connected layers, potentially with skip connections. For
pedagogical purposes, we first discuss our model description for layered networks without skip
connections, and then describe it for DAGs.
Layered nets without skip connections Let z [l] (x, w) ∈ Rnl be the pre-activation vector of
the l-th hidden layer, where nl is the width of the l-th hidden layer, and w represents the trainable
parameters. Let L−1 be the number of hidden layers. For layered networks without skip connections,
the pre-activation (or pre-nonlinearity) vector of the l-th layer can be written as

z [l] (x, w) = W [l] σ (l−1) z [l−1] (x, w) ,

with a boundary definition σ (0) z [0] (x, w) ≡ x, where σ (l−1) represents nonlinearity via ReLU

and/or max pooling at the (l − 1)-th hidden layer, and W [l] ∈ Rnl ×nl−1 is a matrix of weight
parameters connecting the (l − 1)-th layer to the l-th layer. Here, W [l] can have any structure (e.g.,
shared and sparse weights to represent a convolutional layer). Let σ̇ [l] (x, w) be a vector with each
element being 0 or 1 such that σ z (x, w) = σ̇ (x, w) ◦ z [l] (x, w), which is an element-wise
[l] [l] [l]

product of the vectors σ̇ [l] (x, w) and z [l] (x, w). Then, we can write the pre-activation of the k-th
output unit at the last layer l = L as
nL−1
[L] [L] (L−1) [L−1]
X
zk (x, w) = WkjL−1 σ̇jL−1 (x, w)zjL−1 (x, w).
jL−1 =1

By expanding z [l] (x, w) repeatedly and exchanging the sum and product via the distributive law of
multiplication,
nL−1 nL−2 n0
[L]
X X X
zk (x, w) = ··· W kjL−1 jL−2 ...j0 σ̇jL−1 jL−2 ...j1 (x, w)xj0 ,
jL−1 =1 jL−2 =1 j0 =1

[L] [l] QL−1 [l]

where W kjL−1 jL−2 ...j0 = WkjL−1 L−1
Q
l=1 Wjl jl−1 and σ̇jL−1 jL−2 ...j1 (x, w) = l=1 σ̇jl (x, w). By
merging the indices j0 , . . . , jL−1 into j with some bijection between {1, . . . , n0 } × · · · × {1, . . . ,
nL−1 } ∋ (j0 , . . . , jL−1 ) and {1, . . . , n0 n1 · · · nL−1 } ∋ j,
[L] P
zk (x, w) = j w̄k,j σ̄j (x, w)x̄j ,

9
K AWAGUCHI , K AELBLING , AND B ENGIO

where w̄k,j , σ̄j (x, w) and x̄j represent W kjL−1 jL−2 ...j0 , σ̇jL−1 jL−2 ...j1 (x, w) and xj0 respectively with
the change of indices (i.e., σj (x, w) and x̄j respectively contain the n0 P numbers and n1 · · · nL−1
numbers of the same copy of each σ̇jL−1 jL−2 ...j1 (x, w) and xj0 ). Note that j represents summation
over all the paths from the input x to the k-th output unit.
DAGs Remember that every DAG has at least one topological ordering, which can be used to
to create a layered structure with possible skip connections (e.g., see Healy and Nikolov 2001;
Neyshabur et al. 2015b). In other words, we consider DAGs such that the pre-activation vector of the
l-th layer can be written as
l−1 ′
′ ′
X
[l]
z (x, w) = W (l,l ) σ [l ] z [l ] (x, w)
l′ =0
′
with a boundary definition σ (0) z [0] (x, w) ≡ x, where W (l,l ) ∈ Rnl ×nl′ is a matrix of weight

′
parameters connecting the l′ -th layer to the l-th layer. Again, W (l,l ) can have any structure. Thus, in
the same way as with layered networks without skip connections, for all k ∈ {1, . . . , dy },
[L] P
zk (x, w) = j w̄k,j σ̄j (x, w)x̄j ,
P
where j represents the summation over all paths from the input x to the k-th output unit; i.e.,
w̄k,j σ̄j (x, w)x̄j is the contribution from the j-th path to the k-th output unit. Each of w̄k,j , σ̄j (x, w)
and x̄j is defined in the same manner as in the case of layered networks without skip connections. In
other words, the j-th path weight w̄k,j is the product of the weight parameters in the j-th path, and
σ̄j (x, w) is the product of the 0-1 activations in the j-th path, corresponding to ReLU nonlinearity
and max pooling; σ̄j (x, w) = 1 if all units in the j-th path are active, and σ̄j (x, w) = 0 otherwise.
Also, x̄j is the input used in the j-th path. Therefore, for DAGs, including layered networks without
skip connections,
[L]
zk (x, w) = [x̄ ◦ σ̄(x, w)]⊤ w̄k , (1)

where [x̄ ◦ σ̄(x, w)]j = x̄j σ̄j (x, w) and (w̄k )j = w̄k,j are the vectors of the size of the number of
the paths.

5.2 Theoretical insights via tight theory for every pair (P, S)
Theorem 7 solves Open Problem 2 (and hence Open Problem 1) for neural networks with squared loss
by stating that the generalization gap of a w with respect to a problem (P(X,Y ) , S) is tightly analyzable
with theoretical insights, based only on the quality of the w and the pair (P(X,Y ) , S). We do not
assume that S is generated randomly based on some relationship with P(X,Y ) ; the theorem holds for
any dataset, regardless of how it was generated. Let wS and w̄kS be the parameter vectors w and w̄k
learned with a dataset S. Let R[wS ] and RS [wS ] be the expect risk and empirical risk ofP the model
with the learned parameter w . Let zi = [x̄i ◦ σ̄(xi , w )]. Let G = Ex,y∼P(X,Y ) [zz ] − m m
S S ⊤ 1 ⊤
i=1 zi zi
1 Pm
and v = m i=1 yik zi −Ex,y∼P(X,Y ) [yk z]. Given a matrix M , let λmax (M ) be the largest eigenvalue
of M .
Theorem 7. Let {λj }j and {uj }j be a set of eigenvalues and a corresponding orthonormal set of
(1) (2)
eigenvectors of G. Let θw̄k ,j be the angle between uj and w̄k . Let θw̄k be the angle between v and

10
G ENERALIZATION IN D EEP L EARNING

w̄k . Then (deterministically),

!
dy
S S (2) (1)
R[w ] − RS [w ] − cy = ∥w̄k ∥22 λj cos2 θw̄S ,j
P P
2∥v∥2 ∥w̄k ∥2 cos θw̄S
S
+ S
k j k
k=1
dy
X
2∥v∥2 ∥w̄kS ∥2 + λmax (G)∥w̄kS ∥22 ,

≤
k=1

1 Pm
where cy = Ey [∥y∥22 ] − m
2
i=1 ∥yi ∥2 .

Proof idea. From Equation (1) with squared loss, we can decompose the generalization gap into
three terms:
dy " m
! #
X 1 X
R[wS ] − RS [wS ] = (w̄kS )⊤ E[zz ⊤ ] − zi zi⊤ w̄kS (2)
m
k=1 i=1
dy " m
! #
X 1 X ⊤ ⊤
+2 yik zi − E[yk z ] w̄kS
m
k=1 i=1
m
1 X
+ E[y ⊤ y] − yi⊤ yi .
m
i=1

By manipulating each term, we obtain the desired statement. See Appendix C3 for a complete proof.
□

In Theorem 7, there is no concept of a hypothesis space. Instead, it indicates that if the norm
of the weights ∥w̄kS ∥2 at the end of learning process with the actual given S is small, then the
generalization gap is small, even if the norm ∥w̄kS ∥2 is unboundedly large at anytime with any dataset
other than S.
Importantly, in Theorem 7, there are two other significant factors in addition to the norm of the
weights ∥w̄kS ∥2 . First, the eigenvalues of G and v measure the concentration of the given dataset S
with respect to the (unknown) P(X,Y ) in the space of the learned representation zi = [x̄i ◦ σ̄(xi , wS )].
Here, we can see the benefit of deep learning from the viewpoint of “deep-path” feature learning:
even if a given S is not concentrated in the original space, optimizing w can result in concentrating it
in the space of z. Similarly, cy measures the concentration of ∥y∥22 , but cy is independent of w and
unchanged after a pair (P(X,Y ) , S) is given. Second, the cos θ terms measure the similarity between
w̄kS and these concentration terms. Because the norm of the weights ∥w̄kS ∥2 is multiplied by those
other factors, the generalization gap can remain small, even if ∥w̄kS ∥2 is large, as long as some of
those other factors are small.
Based on a generic bound-based theory, Neyshabur et al. (2015a,b) proposed to control the norm
of the path weights ∥w̄k ∥2 , which is consistent with our direct bound-less result (and which is as
computationally tractable as a standard forward-backward pass). Unlike the previous results, we
do not require a pre-defined bound on ∥w̄k ∥2 over different datasets, but depend only on its final
From the derivation of Equation (1), one can compute ∥w̄kS ∥22 with a single forward pass using element-wise squared
weights, an identity input, and no nonlinearity. One can also follow the previous paper (Neyshabur et al., 2015a) for its
computation.

11
K AWAGUCHI , K AELBLING , AND B ENGIO

1.02

Test accuracy ratio

1
0.98
0.96
0.94 MNIST (ND)
MNIST
0.92 CIFAR10
0.9
0 0.2 0.4 0.6 0.8 1
α
Figure 2: Test accuracy ratio (Two-phase/Base). Notice that the y-axis starts with high initial
accuracy, even with a very small dataset size αm for learning wσ .

value with each S as desired, in addition to more tight insights (besides the norm) via equality as
discussed above. In addition to the pre-defined norm bound, these previous results have an explicit
exponential dependence on the depth of the network, which does not appear in our Theorem 4.
Similarly, some previous results specific to layered networks without skip connections (Sun et al.,
2016; XieQet al., 2015) contain theP2L−1 factor and Pa bound on theQproduct of the norm of weight
matrices, l=1 ∥W ∥, instead of k ∥w̄kS ∥2 . Here, k ∥w̄k ∥22 ≤ L
L (l) (l) 2
l=1 ∥W ∥F because the latter
contains all of the same terms as the former as well as additional non-negative additive terms after
expanding the sums in the definition of the norms.
Therefore, unlike previous bounds, Theorem 7 generates these new theoretical insights based on
the tight equality (in the first line of the equation in Theorem 7). Notice that without manipulating the
generalization gap, we can always obtain equality. However, the question answered here is whether
or not we can obtain competitive theoretical insights (the path norm bound) via equality instead of
inequality. From a practical view point, if the obtained insights are the same (e.g., regularize the
norm), then equality-based theory has the obvious advantage of being more precise.

5.3 Probabilistic bound over random datasets

While the previous subsection tightly analyzed each given point (P(X,Y ) , S), this subsection considers
the set P × D ∋ (P(X,Y ) , S), where D is the set of possible datasets S endowed with an i.i.d. product
measure Pm (X,Y ) where P(X,Y ) ∈ P (see Section 3.2).

In Equation (2), the generalization gap is decomposed into three terms, each of which contains
the difference between a sum of dependent random variables and its expectation. The dependence
comes from the fact that zi = [x̄i ◦ σ̄(xi , wS )] are dependent over the sample index i, because of
[L]
the dependence of wS on the entire dataset S. We then observe the following: in zk (x, w) =
[x̄ ◦ σ̄(x, w)]⊤ w̄, the derivative of z = [x̄ ◦ σ̄(x, w)] with respect to w is zero everywhere (except
for the measure zero set, where the derivative does not exist). Therefore, each step of the (stochastic)
gradient decent greedily chooses the best direction in terms of w̄ (with the current z = [x̄ ◦ σ̄(x, w)]),
but not in terms of w in z = [x̄ ◦ σ̄(x, w)] (see Appendix A3 for more detail). This observation leads
to a conjecture that the dependence of zi = [x̄i ◦ σ̄(xi , wS )] via the training process with the whole
dataset S is not entirely “bad”in terms of the concentration of the sum of the terms with zi .

12
G ENERALIZATION IN D EEP L EARNING

5.3.1 E MPIRICAL OBSERVATIONS

As a first step to investigate the dependence of zi , we evaluated the following novel two-phase training
procedure that explicitly breaks the dependence of zi over the sample index i. We first train a network
in a standard way, but only using a partial training dataset Sαm = {(x1 , y1 ), . . . , (xαm , yαm )} of
size αm, where α ∈ (0, 1) (standard phase). We then assign the value of wSαm to a new placeholder
wσ := wSαm and freeze wσ , meaning that as w changes, wσ does not change. At this point, we
[L]
have that zk (x, wSαm ) = [x̄ ◦ σ̄(x, wσ )]⊤ w̄kSαm . We then keep training only the w̄kSαm part with the
entire training dataset of size m (freeze phase), yielding the final model via this two-phase training
procedure as

[L]
z̃k (x, wS ) = [x̄ ◦ σ̄(x, wσ )]⊤ w̄kS . (3)

[L]
Note that the vectors wσ = wSαm and w̄kS contain the untied parameters in z̃k (x, wS ). See Ap-
pendix A4 for a simple implementation of this two-phase training procedure that requires at most
(approximately) twice as much computational cost as the normal training procedure.
We implemented the two-phase training procedure with the MNIST and CIFAR-10 datasets. The
test accuracies of the standard training procedure (base case) were 99.47% for MNIST (ND), 99.72%
for MNIST, and 92.89% for CIFAR-10. MNIST (ND) indicates MNIST with no data augmentation.
The experimental details are in Appendix B.
Our source code is available at: http://lis.csail.mit.edu/code/gdl.html
Figure 2 presents the test accuracy ratios for varying α: the test accuracy of the two-phase
training procedure divided by the test accuracy of the standard training procedure. The plot in Figure
2 begins with α = 0.05, for which αm = 3000 in MNIST and αm = 2500 in CIFAR-10. Somewhat
surprisingly, using a much smaller dataset for learning wσ still resulted in competitive performance.
A dataset from which we could more easily obtain a better generalization (i.e., MNIST) allowed us to
use smaller αm to achieve competitive performance, which is consistent with our discussion above.

5.3.2 T HEORETICAL RESULTS

We now prove a probabilistic bound for the hypotheses resulting from the two-phase training
algorithm. Let z̃i = [x̄i ◦ σ̄(xi , wσ )] where wσ := wSαm , as defined in the two-phase training
procedure above. Our two-phase training procedure forces z̃αm+1 , . . . , z̃m over samples to be
independent random variables (each z̃i is dependent over coordinates, which is taken care of in our
[L]
proof), while maintaining the competitive practical performance of the output model z̃k (·, wS ).
As a result, we obtain the following bound on the generalization gap for the practical deep models
[L]
z̃k (·, wS ). Let mσ = (1 − α)m. Given a matrix M , let ∥M ∥2 be the spectral norm of M .
(i) (i)
Assumption 1. Let G(i) = Ex [z̃ z̃ ⊤ ] − z̃i z̃i⊤ , Vkk′ = yik z̃i,k′ − Ex,y [yk z̃k′ ], and cy = Ey [∥y∥22 ] −
∥yi ∥22 . Assume that for all i ∈ {αm + 1, . . . , m},

• Czz ≥ λmax (G(i) ) and γzz

2 ≥ ∥E [(G(i) )2 ]∥
x 2

(i) 2 (i) 2
• Cyz ≥ maxk,k′ |Vkk ′ | and γyz ≥ maxk,k ′ Ex [(Vkk ′ ) ])

(i) (i)
• Cy ≥ |cy | and γy2 ≥ Ex [(cy )2 ].

13
K AWAGUCHI , K AELBLING , AND B ENGIO

Theorem 8. Suppose that Assumption 1 holds. Assume that S \ Sαm is generated by i.i.d. draws
according to true distribution P(X,Y ) . Assume that S \ Sαm is independent of Sαm . Let fA(S) be
the model learned by the two-phase training procedure with S. Then, for each wσ := wSαm , for any
δ > 0, with probability at least 1 − δ,
dy dy
2
X X
S
R[fA(S) ] − RS\Sαm [fA(S) ] ≤ β1 w̄k 1
+ 2β2 w̄kS 2
+ β3 ,
k=1 k=1
q q 2
2Czz 2
2γzz 2Cyz 6dy dz γyz 6dy dz 2Cy
where β1 = 3mσ ln 3dδ z + mσ ln 3dδ z , β2 = 3mσ ln δ + mσ ln δ , and β2 = 3mσ ln 3δ +
q 2
2γy 3
mσ ln δ .

Our proof does not require independence over the coordinates of z̃i and the entries of the random
matrices z̃i z̃i⊤ (see the proof of Theorem 8).
The bound in Theorem 8 is data-dependent because the norms of the weights w̄kS depend on each
particular S. Similarly to Theorem 7, the bound in Theorem 8 does not contain a pre-determined
bound on the norms of weights and can be independent of the concept of hypothesis space, as
desired; i.e., Assumption 1 can be also satisfied without referencing a hypothesis space of w, because
z̃ = [x̄i ◦ σ̄(xi , wσ )] with σ̄j (xi , wσ ) ∈ {0, 1}. However, unlike Theorem 7, Theorem 8 implicitly
contains the properties of datasets different from a given S, via the pre-defined bounds in Assumption
1. This is expected since Theorem 8 makes claims about the set of random datasets S instead of each
instantiated S. Therefore, while Theorem 8 presents a strongly-data-dependent bound (over random
datasets), Theorem 7 is tighter for each given S; indeed, the main equality of Theorem 7 is as tight as
possible.
Theorems 7 and 8 provide generalization bounds for practical deep learning models that do
not necessarily have explicit dependence on the number of weights, or exponential dependence on
depth or effective input dimensionality. Although the size of the vector w̄kS can be exponentially
[L]
large in the depth of the network, the norms of the vector need not be. Because z̃k (x, wS ) =
[L]
∥x̄ ◦ σ̄(x, wσ )∥2 ∥w̄kS ∥2 cos θ, we have that ∥w̄kS ∥2 = zk (x, w)/(∥x̄ ◦ σ̄(x, wσ )∥2 cos θ) (unless the
denominator is zero), where θ is the angle between x̄ ◦ σ̄(x, wσ ) and w̄kS . Additionally, as discussed
in Section 5.2, k ∥w̄k ∥22 ≤ L (l) 2
P Q
l=1 ∥W ∥F .

5.4 Probabilistic bound for 0-1 loss with multi-labels

For the 0–1 loss with multi-labels, the two-phase training procedure in Section 5.3 yields the
generalization bound in Theorem 9. Similarly to the bounds in Theorems 7 and 8, the generalization
bound in Theorem 9 does not necessarily have dependence on the number of weights, and exponential
dependence on depth and effective input dimensionality.
Theorem 9. Assume that S \ Sαm is generated by i.i.d. draws according to true distribution P(X,Y ) .
Assume that S \ Sαm is independent of Sαm . Fix ρ > 0 and wσ . Let F be the set of the models with
the two-phase training procedure. Suppose that Ex [∥x̄ ◦ σ̄(x, wσ )∥22 ] ≤ Cσ2 and maxk ∥w̄k ∥2 ≤ Cw
for all f ∈ F. Then, for any δ > 0, with probability at least 1 − δ, the following holds for all f ∈ F:
s
(ρ) 2d2y (1 − α)−1/2 Cσ Cw ln 1δ
R[f ] ≤ RS\Sαm [f ] + √ + .
ρ mσ 2mσ

14
G ENERALIZATION IN D EEP L EARNING

(ρ) (ρ) 1 Pm
Here, the empirical margin loss RS [f ] is defined as RS [f ] = m i=1 Lmargin,ρ (f (xi ), yi ),
where Lmargin,ρ is defined as follows:
(2) (1)
Lmargin,ρ (f (x), y) = Lmargin,ρ (Lmargin,ρ (f (x), y))

where
(1) [L]
Lmargin,ρ (f (x), y) = zy[L] (x) − max′ zy′ (x) ∈ R,
y̸=y

and 
0
 if ρ ≤ t
(2)
Lmargin,ρ (t) = 1 − t/ρ if 0 ≤ t ≤ ρ

1 if t ≤ 0.


6. Discussions and open problems

It is very difficult to make a detailed characterization of how well a specific hypotheses generated
by a certain learning algorithm will generalize, in the absence of detailed information about the
given problem instance. Traditional learning theory addresses this very difficult question and has
developed bounds that are as tight as possible given the generic information available. In this paper,
we have worked toward drawing stronger conclusions by developing theoretical analyses tailored for
the situations with more detailed information, including actual neural network structures, and actual
performance on a validation set.
Optimization and generalization in deep learning are closely related via the following observation:
if we make optimization easier by changing model architectures, generalization performance can be
degraded, and vice versa. Hence, non-pessimistic generalization theory discussed in this paper might
allow more architectural choices and assumptions in optimization theory.
Define the partial order of problem instances (P, S, f ) as

(P, S, f ) ≤ (P′ , S ′ , f ′ ) ⇔ RP [f ] − RS [f ] ≤ RP′ [f ′ ] − RS ′ [f ′ ]

where RP [f ] is the expected risk with probability measure P. Then, any theoretical insights without
the partial order preservation can be misleading as it can change the ranking of the preference
of (P, S, f ). For example, theoretically motivated algorithms can degrade actual performances
when compared with heuristics, if the theory does not preserve the partial order of (P, S, f ). This
observation suggests the following open problem.
Open Problem 3. Tightly characterize the expected risk R[f ] or the generalization gap R[f ]−RS [f ]
of a hypothesis f with a pair (P, S), producing theoretical insights while partially yet provably
preserving the partial order of (P, S, f ).
Theorem 7 partially addresses Open Problem 3 by preserving the exact ordering via equality
without bounds. However, it would be beneficial to consider a weaker notion of order preservation to
gain analyzability with more useful insights as stated in Open Problem 3.
Our discussion with Proposition 5 and Remark 6 suggests another open problem: analyzing the
role and influence of human intelligence on generalization. For example, human intelligence seems
to be able to often find good architectures (and other hyper-parameters) that get low validation errors
(without non-exponentially large |Fval | in Proposition 5, or a low complexity of LFval in Remark

15
K AWAGUCHI , K AELBLING , AND B ENGIO

6). A close look at the deep learning literature seems to suggest that this question is fundamentally
related to the process of science and engineering, because many successful architectures have been
designed based on the physical properties and engineering priors of the problems at hand (e.g.,
hierarchical nature, convolution, architecture for motion such as that by Finn et al. 2016, memory
networks, and so on). While this is a hard question, understanding it would be beneficial to further
automate the role of human intelligence towards the goal of artificial intelligence.

Acknowledgements
We gratefully acknowledge support from NSF grants 1420316, 1523767 and 1723381, from AFOSR
FA9550-17-1-0165, from ONR grant N00014-14-1-0486, and from ARO grant W911 NF1410433,
as well as support from NSERC, CIFAR and Canada Research Chairs. Any opinions, findings,
and conclusions or recommendations expressed in this material are those of the authors and do not
necessarily reflect the views of our sponsors.

References
Devansh Arpit, Stanislaw Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S
Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A closer look at
memorization in deep networks. In International Conference on Machine Learning, 2017.

Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE
Transactions on Information theory, 39(3):930–945, 1993.

Peter L Bartlett, Stéphane Boucheron, and Gábor Lugosi. Model selection and error estimation.
Machine Learning, 48(1):85–113, 2002.

Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of Machine Learning
Research, 2(Mar):499–526, 2002.

Anna Choromanska, MIkael Henaff, Michael Mathieu, Gerard Ben Arous, and Yann LeCun. The
loss surfaces of multilayer networks. In Proceedings of the Eighteenth International Conference
on Artificial Intelligence and Statistics, pages 192–204, 2015.

Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for
deep nets. In International Conference on Machine Learning, 2017.

Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for
deep (stochastic) neural networks with many more parameters than training data. In Proceedings
of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, 2017.

Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction
through video prediction. In Advances in Neural Information Processing Systems, pages 64–72,
2016.

Alon Gonen and Shai Shalev-Shwartz. Fast rates for empirical risk minimization of strict saddle
problems. arXiv preprint arXiv:1701.04271, 2017.

16
G ENERALIZATION IN D EEP L EARNING

Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of
stochastic gradient descent. arXiv preprint arXiv:1509.01240, 2015.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual
networks. In European Conference on Computer Vision, pages 630–645. Springer, 2016.

Patrick Healy and Nikola S Nikolov. How to layer a directed acyclic graph. In International
Symposium on Graph Drawing, pages 16–30. Springer, 2001.

Ralf Herbrich and Robert C Williamson. Algorithmic luckiness. Journal of Machine Learning
Research, 3:175–212, 2002.

Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the gen-
eralization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741,
2017.

Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information
Processing Systems, 2016a.

Kenji Kawaguchi. Bounded optimal exploration in MDP. In Proceedings of the 30th AAAI Conference
on Artificial Intelligence (AAAI), 2016b.

Kenji Kawaguchi and Yoshua Bengio. Generalization in machine learning via analytical learning
theory. arXiv preprint arXiv:1802.07426, 2018.

Kenji Kawaguchi, Leslie Pack Kaelbling, and Tomás Lozano-Pérez. Bayesian optimization with
exponential convergence. In Advances in Neural Information Processing (NIPS), 2015.

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter
Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In
International Conference on Learning Representations, 2017.

Vladimir Koltchinskii and Dmitriy Panchenko. Rademacher processes and bounding the risk of
function learning. In High dimensional probability II, pages 443–457. Springer, 2000.

Vladimir Koltchinskii and Dmitry Panchenko. Empirical margin distributions and bounding the
generalization error of combined classifiers. Annals of Statistics, pages 1–50, 2002.

David Krueger, Nicolas Ballas, Stanislaw Jastrzebski, Devansh Arpit, Maxinder S Kanwal, Tegan
Maharaj, Emmanuel Bengio, Asja Fischer, and Aaron Courville. Deep nets don’t learn via
memorization. In Workshop Track of International Conference on Learning Representations, 2017.

Ilja Kuzborskij and Christoph Lampert. Data-dependent stability of stochastic gradient descent.
arXiv preprint arXiv:1703.01678, 2017.

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward
networks with a nonpolynomial activation function can approximate any function. Neural networks,
6(6):861–867, 1993.

17
K AWAGUCHI , K AELBLING , AND B ENGIO

Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational efficiency of training
neural networks. In Advances in Neural Information Processing Systems, pages 855–863, 2014.
Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT
press, 2012.
Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear
regions of deep neural networks. In Advances in neural information processing systems, pages
2924–2932, 2014.
Sayan Mukherjee, Partha Niyogi, Tomaso Poggio, and Ryan Rifkin. Learning theory: stability
is sufficient for generalization and necessary and sufficient for consistency of empirical risk
minimization. Advances in Computational Mathematics, 25(1):161–193, 2006.
Behnam Neyshabur, Ruslan R Salakhutdinov, and Nati Srebro. Path-sgd: Path-normalized opti-
mization in deep neural networks. In Advances in Neural Information Processing Systems, pages
2422–2430, 2015a.
Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural
networks. In Proceedings of The 28th Conference on Learning Theory, pages 1376–1401, 2015b.
Razvan Pascanu, Guido Montufar, and Yoshua Bengio. On the number of response regions of deep
feed forward networks with piece-wise linear activations. In International Conference on Learning
Representations, 2014.
Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, and Qianli Liao. Why and
when can deep-but not shallow-networks avoid the curse of dimensionality: A review. International
Journal of Automation and Computing, pages 1–17, 2017.
Ikuro Sato, Hiroki Nishimura, and Kensuke Yokoi. Apac: Augmented pattern classification with
neural networks. arXiv preprint arXiv:1505.03229, 2015.
Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to
algorithms. Cambridge university press, 2014.
John Shawe-Taylor, Peter L Bartlett, Robert C Williamson, and Martin Anthony. Structural risk
minimization over data-dependent hierarchies. IEEE transactions on Information Theory, 44(5):
1926–1940, 1998.
Jure Sokolic, Raja Giryes, Guillermo Sapiro, and Miguel Rodrigues. Generalization error of invariant
classifiers. In Artificial Intelligence and Statistics, pages 1094–1103, 2017a.
Jure Sokolic, Raja Giryes, Guillermo Sapiro, and Miguel RD Rodrigues. Robust large margin deep
neural networks. IEEE Transactions on Signal Processing, 2017b.
Shizhao Sun, Wei Chen, Liwei Wang, Xiaoguang Liu, and Tie-Yan Liu. On the depth of deep
neural networks: a theoretical view. In Proceedings of the Thirtieth AAAI Conference on Artificial
Intelligence, pages 2066–2072. AAAI Press, 2016.
Matus Telgarsky. Benefits of depth in neural networks. In 29th Annual Conference on Learning
Theory, pages 1517–1539, 2016.

18
G ENERALIZATION IN D EEP L EARNING

Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational
mathematics, 12(4):389–434, 2012.

Joel A Tropp et al. An introduction to matrix concentration inequalities. Foundations and Trends®
in Machine Learning, 8(1-2):1–230, 2015.

Vladimir Vapnik. Statistical learning theory, volume 1. Wiley New York, 1998.

Li Wan, Matthew Zeiler, Sixin Zhang, Yann L Cun, and Rob Fergus. Regularization of neural
networks using dropconnect. In Proceedings of the 30th international conference on machine
learning (ICML-13), pages 1058–1066, 2013.

Lei Wu, Zhanxing Zhu, et al. Towards understanding generalization of deep learning: Perspective of
loss landscapes. arXiv preprint arXiv:1706.10239, 2017.

Pengtao Xie, Yuntian Deng, and Eric Xing. On the generalization error bounds of neural networks
under diversity-inducing mutual angular regularization. arXiv preprint arXiv:1511.07110, 2015.

Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual
transformations for deep neural networks. arXiv preprint arXiv:1611.05431, 2016.

Huan Xu and Shie Mannor. Robustness and generalization. Machine learning, 86(3):391–423, 2012.

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understand-
ing deep learning requires rethinking generalization. In International Conference on Learning
Representations, 2017.

A. Appendix: Additional discussions

Theorem 7 address Open Problem 2 with the limited applicability to certain neural networks with
squared loss. In contrast, a parallel study (Kawaguchi and Bengio, 2018) presents a novel generic
learning theory to address Open Problem 2 for general cases in machine learning. It would be
beneficial to explore both a generic analysis (Kawaguchi and Bengio, 2018) and a concrete analysis
in deep learning to get theoretical insights that are tailored for each particular case.
In previous bounds with a hypothesis space F, if we try different hypothesis spaces F depending
on S, the basic proof breaks down. An easy recovery at the cost of an extra quantity in a bound is
to take a union bound over all possible Fj for j = 1, 2, . . . , where we pre-decide {Fj }j without
dependence on S (because simply considering the “largest” F ⊇ Fj can result in a very loose bound
for each Fj ). Similarly, if we need to try many wσ := wSαm depending on the whole S in Theorem
(j) (j)
8, we can take a union bound over wσ for j = 1, 2, . . . , where we pre-determine {wσ }j without
dependence on S \ Sαm but with dependence on Sαm . We can do the same with Proposition 5 and
(val)
Remark 6 to use many Fval depending on the validation dataset Smval with a predefined sequence.

A1 Simple regularization algorithm

In general, theoretical bounds from statistical learning theory can be too loose to be directly used
in practice. In addition, many theoretical results in statistical learning theory end up suggesting to

19
K AWAGUCHI , K AELBLING , AND B ENGIO

Table 1: Test error (%)

Method MNIST CIFAR-10

Baseline 0.26 3.52
DARC1 0.20 3.43

simply regularize some notion of smoothness of a hypothesis class. Indeed, by upper bounding a
distance between two functions (e.g., a hypothesis and the ground truth function corresponding to
expected true labels), one can immediately see without statistical learning theory that regularizing
some notion of smoothness of the hypothesis class helps guarantees on generalization. Then, by
the Occam’s razor argument, one might prefer a simpler (yet still rigor) theory and a corresponding
simpler algorithm.
Accordingly, this subsection examines another simple regularization algorithm that directly regu-
larize smoothness of the learned hypothesis. In this subsection, we focus on multi-class classification
problems with dy classes, such as object classification with images. Accordingly, we analyze the
expected risk with 0–1 loss as R[f ] = Ex [1{f (x) = y(x)}], where f (x) = argmaxk∈{1,...,dy } (
[L]
zk (x)) is the model prediction, and y(x) ∈ {1, . . . , dy } is the true label of x.
This subsection proposes the following family of simple regularization algorithms: given any
architecture and method, add a new regularization term for each mini-batch as
m̄
λ X [L]
loss = original loss + max ξi zk (xi ) ,
m̄ k
i=1

where xi is drawn from some distribution approximating the true distribution of x, ξ1 , . . . , ξm̄ are
independently and uniformly drawn from {−1, 1}, m̄ is a mini-batch size and λ is a hyper-parameter.
Importantly, the approximation of the true distribution of x is only used for regularization purposes
and hence needs not be precisely accurate (as long as it plays its role for regularization). For example,
it can be approximated by populations generated by a generative neural network and/or an extra data
augmentation process. For simplicity, we call this family of methods as Directly Approximately
Regularizing Complexity (DARC).
In this paper, we evaluated only a very simple version of the proposed family of methods as a
first step. That is, our experiments employed the following simple and easy-to-implement method,
called DARC1:
m̄
!
λ X [L]
loss = original loss + max |zk (xi )| , (A.1)
m̄ k
i=1

where xi is the i-th sample in the training mini-batch. The additional computational cost and
[L]
programming effort due to this new regularization is almost negligible because zk (xi ) is already
used in computing the original loss. This simplest version was derived by approximating the true
distribution of x with the empirical distribution of the training data.
We evaluated the proposed method (DARC1) by simply adding the new regularization term in
equation (A.1) to existing standard codes for MNIST and CIFAR-10. A standard variant of LeNet

20
G ENERALIZATION IN D EEP L EARNING

Table 2: Test error ratio (DARC1/Base)

MNIST (ND) MNIST CIFAR-10

mean stdv mean stdv mean stdv
Ratio 0.89 0.61 0.95 0.67 0.97 0.79

1 Pm [L]
Table 3: Values of m maxk i=1 |zk (xi )|

MNIST (ND) MNIST CIFAR-10

Method
mean stdv mean stdv mean stdv
Base 17.2 2.40 8.85 0.60 12.2 0.32
DARC1 1.30 0.07 1.35 0.02 0.96 0.01

(LeCun et al., 1998) and ResNeXt-29(16 × 64d) (Xie et al., 2016) are used for MNIST and CIFAR-
10, and compared with the addition of the studied regularizer. For all the experiments, we fixed
(λ/m̄) = 0.001 with m̄ = 64. We used a single model without ensemble methods. The experimental
details are in Appendix B. The source code is available at: http://lis.csail.mit.edu/code/gdl.html
Table 1 shows the error rates comparable with previous results. To the best of our knowledge, the
previous state-of-the-art classification error is 0.23% for MNIST with a single model (Sato et al.,
2015) (and 0.21% with an ensemble by Wan et al. 2013). To further investigate the improvement,
we ran 10 random trials with computationally less expensive settings, to gather mean and standard
deviation (stdv). For MNIST, we used fewer epochs with the same model. For CIFAR-10, we
used a smaller model class (pre-activation ResNet with only 18 layers). Table 2 summarizes the
improvement ratio: the new model’s error divided by the base model’s error. We observed the
improvements for all cases. The test errors (standard deviations) of the base models were 0.53 (0.029)
for MNIST (ND), 0.28 (0.024) for MNIST, and 7.11 (0.17) for CIFAR-10 (all in %).
[L]
1
(maxk m
P
Table 3 summarizes the values of the regularization term m i=1 |zk (xi )|) for each
obtained model. The models learned with the proposed method were significantly different from
the base models in terms of this value. Interestingly, a comparison of the base cases for MNIST
(ND) and MNIST shows that data augmentation by itself implicitly regularized what we explicitly
regularized in the proposed method.

A2 Relationship to other fields

The situation where theoretical studies focus on a set of problems and practical applications care
about each element in the set is prevalent in machine learning and computer science literature, not
limited to the field of learning theory. For example, for each practical problem instance q ∈ Q,
the size of the set Q that had been analyzed in theory for optimal exploration in Markov decision
processes (MDPs) were demonstrated to be frequently too pessimistic, and a methodology to partially
mitigate the issue was proposed (Kawaguchi, 2016b). Bayesian optimization would suffer from a
pessimistic set Q regarding each problem instance q ∈ Q, the issue of which was partially mitigated
(Kawaguchi et al., 2015).

21
K AWAGUCHI , K AELBLING , AND B ENGIO

Moreover, characterizing a set of problems Q only via a worst-case instance q ′ ∈ Q (i.e., worst-
case analysis) is known to have several issues in theoretical computer science, and so-called beyond
worst-case analysis (e.g., smoothed analysis) is an active area of research to mitigate the issues.

A3 SGD chooses direction in terms of w̄

Recall that
[L]
zk (x, w) = z ⊤ w̄ = [x̄ ◦ σ̄(x, w)]⊤ w̄.
Note that σ(x, w) is 0 or 1 for max-pooling and/or ReLU nonlinearity. Thus, the derivative of
z = [x̄ ◦ σ̄(x, w)] with respect to w is zero everywhere (except at the measure zero set where the
derivative does not exists). Thus, by the chain rule (and power rule), the gradient of the loss with
[L]
respect to w only contain the contribution from the derivative of zk with respect to w̄, but not with
respect to w in z.

A4 Simple implementation of two-phase training procedure

Directly implementing Equation (3) requires the summation over all paths, which can be computa-
tionally expensive. To avoid it, we implemented it by creating two deep neural networks, one of
which defines w̄ paths hierarchically, and another of which defines wσ paths hierarchically, resulting
in the computational cost at most (approximately) twice as much as the original cost of training
standard deep learning models. We tied wσ and w̄ in the two networks during standard phase, and
untied them during freeze phase.
Our source code is available at: http://lis.csail.mit.edu/code/gdl.html
The computation of the standard network without skip connection can be re-written as:

z [l] (x, w) = σ [l] (W [l] z [l−1] (x, w))

= σ̇ [l] (W [l] z [l−1] (x, w)) ◦ W [l] z [l−1] (x, w)
= σ̇ [l] (Wσ[l] zσ[l−1] (x, w)) ◦ W [l] z [l−1] (x, w)
[l] [l−1] [l] [l−1] [l]
where Wσ := W [l] , zσ := σ(Wσ zσ (x, w)) and σ̇j (W [l] z [l−1] (x, w)) = 1 if the j-th unit at
[l] [l]
the l-th layer is active, σ̇j (W [l] z [l−1] (x, w)) = 0 otherwise. Note that because Wσ = W [l] , we
[l−1]
have that zσ = z [l] in standard phase and standard models.
[l] [l−1]
In the two-phase training procedure, we created two networks for Wσ zσ (x, w) and W [l]
[l] [l]
z [l−1] (x, w) separately. We then set Wσ = W [l] during standard phase, and frozen Wσ and only
trained W [l] during freeze phase. By following the same derivation of Equation (1), we can see that
this defines the desired computation without explicitly computing the summation over all paths. By
the same token, this applies to DAGs.

B. Appendix: Experimental details

For MNIST:
We used the following fixed architecture:

(i) Convolutional layer with 32 filters with filter size of 5 by 5, followed by max pooling of size
of 2 by 2 and ReLU.

22
G ENERALIZATION IN D EEP L EARNING

(ii) Convolution layer with 32 filters with filter size of 5 by 5, followed by max pooling of size of
2 by 2 and ReLU.

(iii) Fully connected layer with output 1024 units, followed by ReLU and Dropout with its proba-
bility being 0.5.

(iv) Fully connected layer with output 10 units.

Layer 4 outputs z [L] in our notation. For training purpose, we use softmax of z [L] . Also,
f (x) = argmax(z [L] (x)) is the label prediction.
We fixed learning rate to be 0.01, momentum coefficient to be 0.5, and optimization algorithm to
be (standard) stochastic gradient decent (SGD). We fixed data augmentation process as: random crop
with size 24, random rotation up to ±15 degree, and scaling of 15%. We used 3000 epochs for Table
1, and 1000 epochs for Tables 2 and 3.

For CIFAR-10:
For data augmentation, we used random horizontal flip with probability 0.5 and random crop of
size 32 with padding of size 4.
For Table 1, we used ResNeXt-29(16 × 64d) (Xie et al., 2016). We set initial learning rate to
be 0.05 and decreased to 0.005 at 150 epochs, and to 0.0005 at 250 epochs. We fixed momentum
coefficient to be 0.9, weight decay coefficient to be 5 × 10−4 , and optimization algorithm to be
stochastic gradient decent (SGD) with Nesterov momentum. We stopped training at 300 epochs.
For Tables 2 and 3, we used pre-activation ResNet with only 18 layers (pre-activation ResNet-18)
(He et al., 2016). We fixed learning rate to be 0.001 and momentum coefficient to be 0.9, and
optimization algorithm to be (standard) stochastic gradient decent (SGD). We used 1000 epochs.

C. Appendix: Proofs
We use the following lemma in the proof of Theorem 7.

Lemma 10. (Matrix Bernstein inequality: corollary to theorem 1.4 in Tropp 2012) Consider a finite
sequence {Mi } of independent, random, self-adjoint matrices with dimension d. AssumeP that each
random matrix satisfies that E[Mi ] = 0 and λmax (Mi ) ≤ R almost surely. Let γ 2 = ∥ i E[Mi2 ]∥2 .
Then, for any δ > 0, with probability at least 1 − δ,
! r
X 2R d d
λmax Mi ≤ ln + 2γ 2 ln .
3 δ δ
i

Proof. Theorem 1.4 by Tropp (2012) states that for all t ≥ 0,

" ! #
−t2 /2
X
P λmax Mi ≥ t ≤ d · exp .
γ 2 + Rt/3
i
2 /2

Setting δ = d exp − γ 2 t+Rt/3 implies

2
−t2 + R(ln d/δ)t + 2γ 2 ln d/δ = 0.
3

23
K AWAGUCHI , K AELBLING , AND B ENGIO

Solving for t with the quadratic formula

√ and bounding
√ the solution with the subadditivity of square
√
root on non-negative terms (i.e., a + b ≤ a + b for all a, b ≥ 0),

2
t ≤ R(ln d/δ) + 2γ 2 ln d/δ.
3

C1 Proof of Theorem 1
Proof. For any matrix M , let Col(M ) and Null(M ) be the column space and null space of M . Since
rank(Φ) ≥ m and Φ ∈ Rm×n , the set of its columns span Rm , which proves statement (i). Let
w∗ = w1∗ + w2∗ where Col(w1∗ ) ⊆ Col(M T ) and Col(w2∗ ) ⊆ Null(M ). For statement (ii), set the
parameter as w := w1∗ + ϵC1 + αC2 where Col(C1 ) ⊆ Col(M T ), Col(C2 ) ⊆ Null(M ), α ≥ 0 and
C2 = α1 w2∗ + C̄2 . Since rank(M ) < n, Null(M ) ̸= {0} and there exist non-zero C̄2 . Then,

Ŷ (w) = Y + ϵΦC1 ,

and
Ŷtest (w) = Ytest + ϵΦtest C1 .

By setting A = ΦC1 and B = Φtest C1 with a proper normalization of C1 yields (a) and (b) in
statement (ii) (note that C1 has an arbitrary freedom in the bound on its scale because its only
condition is Col(C1 ) ⊆ Col(M T )). At the same time with the same parameter, since Col(w1∗ +
ϵC1 ) ⊥ Col(C2 ),
∥w∥2F = ∥w1∗ + ϵC1 ∥2F + α2 ∥C2 ∥2F ,

and
∥w − w∗ ∥2F = ∥ϵC1 ∥2F + α2 ∥C̄2 ∥2F ,

which grows unboundedly as α → ∞ without changing A and B, proving (c) in statement (ii).

C2 Proof of Corollary 2
Proof. It follows the fact that the proof in Theorem 1 uses the assumption of n > m and rank(Φ) ≥
m only for statement (i).

C3 Proof of Theorem 7
Proof. From Equation (1), the squared loss of deep models for each point (x, y) can be rewritten as

dy dy
X X
⊤ 2
(z w̄k − yk ) = w̄k⊤ (zz ⊤ )w̄k − 2yk z ⊤ w̄k + yk2 .
k=1 k=1

24
G ENERALIZATION IN D EEP L EARNING

Thus, from Equation (1) with the squared loss, we can decompose the generalization gap into
three terms as

dy " m
! #
X 1 X
R[wS ] − RS [wS ] = (w̄kS )⊤ E[zz ⊤ ] − zi zi⊤ w̄kS
m
k=1 i=1
dy " m
! #
X 1 X
+2 yik zi⊤ − E[yk z ⊤ ] w̄kS
m
k=1 i=1
m
1 X
+ E[y ⊤ y] − yi⊤ yi .
m
i=1

As G is a real symmetric matrix, we denote an eigendecomposition of G as G = U ΛU ⊤

where the diagonal matrix Λ contains eigenvalues as Λjj = λj with the corresponding orthogonal
eigenvector matrix U ; uj is the j-th column of U . Then,

(1)
X X
(w̄kS )⊤ Gw̄kS = λj (u⊤ S 2 S 2
j w̄k ) = ∥w̄k ∥2 λj cos2 θw̄S ,j ,
k
j j

and
X
λj (u⊤ S 2 ⊤ S 2 S 2
j w̄k ) ≤ λmax (G)∥U w̄k ∥2 = λmax (G)∥w̄k ∥2 .
j

Also,
(2)
v ⊤ w̄kS = ∥v∥2 ∥w̄kS ∥2 cos θw̄S ≤ ∥v∥2 ∥w̄kS ∥2 .
k

By using these,

R[wS ] − RS [wS ] − cy
Pdy (2) (1)

2∥v∥2 ∥w̄kS ∥2 cos θw̄S + ∥w̄kS ∥22 j λj cos2 θw̄S ,j
P
= k=1
k k
dy
X
2∥v∥2 ∥w̄kS ∥2 + λmax (G)∥w̄kS ∥22 .

≤
k=1

C4 Proof of Theorem 8
Proof. We do not require the independence over the coordinates of z̃i and the entries of random
matricesP z̃i z̃i⊤ because of the definition of independence required for matrix Bernstein inequality
(for m1σ m σ ⊤
i=1 z̃i z̃i ) (e.g., see section 2.2.3 of Tropp et al. 2015) and because of a union bound
over coordinates (for m1σ m
P σ
i=1 yik z̃i ). We use the fact that z̃αm+1 , . . . , z̃m are independent random
variables over the sample index (although dependent over the coordinates), because a wσ := wSαm is
fixed and independent from xαm+1 , . . . , xm .

25
K AWAGUCHI , K AELBLING , AND B ENGIO

From Equation (2), with the definition of induced matrix norm and the Cauchy-Schwarz inequal-
ity,
R[fA(S) ] − RS\Sαm [fA(S) ] (C.1)
dy m
!
X 2 1 X
≤ w̄kS λ
2 max
E[z̃ z̃ ⊤ ] − z̃i z̃i⊤
mσ
k=1 i=αm+1
dy m
X 1 X
+2 ∥w̄kS ∥1 yik z̃i − E[yk z̃]
mσ
k=1 i=αm+1 ∞
m
1 X
+ E[y ⊤ y] − yi⊤ yi .
mσ
i=αm+1

In the below, we bound each term of the right-hand side of the above with concentration
inequalities.
For the first term: Matrix Bernstein inequality (Lemma 10) states that for any δ > 0, with
probability at least 1 − δ/3,
s
m
!
1 2C 3d 2
2γzz 3dz
zz z
X
λmax E[z̃ z̃ ⊤ ] − z̃i z̃i⊤ ≤ ln + ln .
mσ 3mσ δ mσ δ
i=αm+1

Here, Matrix Bernstein inequality was applied as follows. Let Mi = ( m1σ G(i) ). Then, m
P
i=αm+1
Mi = E[z̃ z̃ ⊤ ] − m1σ m ⊤ . We have that E[M ] = 0 for all i. Also, λ 1
P
z̃
i=αm+1 i i z̃ i max (M i ) ≤ mσ Czz
and ∥ i E[Mi2 ]∥2 ≤ m1σ γzz 2 .
P
For the second term: We apply Bernstein inequality to each (k, k ′ ) ∈ {1, . . . , dy } × {1, . . . , dz }
and take union bound over dy dz events, obtaining that for any δ > 0, with probability at least 1 − δ/3,
for all k ∈ {1, 2, . . . , dy },
s
m 2
1 X 2Cyz 6dy dz γyz 6dy dz
yik z̃i − E[yk z̃] ≤ ln + ln
mσ 3mσ δ mσ δ
i=αm+1 ∞

For the third term: From Bernstein inequality, with probability at least 1 − δ/3,
s
m
1 X 2Cy 3 2γy2 3
E[y ⊤ y] − yi⊤ yi ≤ ln + ln .
mσ 3m δ m δ
i=αm+1

Putting together: Putting together, for a fixed (or frozen) wσ , with probability at least 1 − δ
(probability over S \ Sαm = {(xαm+1 , yαm+1 ), . . . , (xm , ym )}), we have that
m
!
⊤ 1 X
⊤
λmax E[z̃ z̃ ] − z̃i z̃i ≤ β1
mσ
i=αm+1

1 Pm
, mσ i=αm+1 yik z̃i − E[yk z̃] ≤ β2 (for all k), and
∞
m
⊤ 1 X
E[y y] − yi⊤ yi ≤ β3 .
mσ
i=αm+1

26
G ENERALIZATION IN D EEP L EARNING

Since Equation (C.1) always hold deterministically (with or without such a dataset), the desired
statement of this theorem follows.

C5 Proof of Theorem 9
Proof. Define Smσ as

Smσ = S \ Sαm = {(xαm+1 , yαm+1 ), . . . , (xm , ym )}.

Recall the following fact: using the result by Koltchinskii and Panchenko (2002), we have that for
any δ > 0, with probability at least 1 − δ, the following holds for all f ∈ F:
s
2d2y ′ ln 1δ
R[f ] ≤ RSmσ ,ρ [f ] + R (F) + ,
ρmσ mσ 2mσ

where R′mσ (F) is Rademacher complexity defined as

mσ
" #
[L]
X
R′mσ (F) = ESmσ ,ξ sup ξi zk (xi , w) .
k,w i=1

Here, ξi is the Rademacher variable, and the supremum is taken over all k ∈ {1, . . . , dy } and all w
allowed in F. Then, for our parameterized hypothesis spaces with any frozen wσ ,
mσ
" #
X
R′mσ (F) = ESmσ ,ξ sup ξi [x̄i ◦ σ̄(xi , wσ )]⊤ w̄k
k,w̄k i=1
mσ
" #
X
≤ ESmσ ,ξ sup ξi [x̄i ◦ σ̄(xi , wσ )] ∥w̄k ∥2
k,w̄k i=1 2
mσ
" #
X
≤ Cw ESmσ ,ξ ξi [x̄i ◦ σ̄(xi , wσ )] .
i=1 2

Because square root is concave in its domain, by using Jensen’s inequality and linearity of expectation,
" m #
X σ

ESmσ ,ξ ξi [x̄i ◦ σ̄(xi , wσ )]

i=1 2
 1/2
mσ X
X mσ
≤ ESmσ Eξ [ξi ξj ][x̄i ◦ σ̄(xi , wσ )]⊤ [x̄j ◦ σ̄(xj , wσ )]
i=1 j=1

mσ i 1/2
!
X h
= ESmσ ∥[x̄i ◦ σ̄(xi , wσ )]∥22
i=1
√
≤ Cσ mσ .
√
Putting together, we have that R′m (F) ≤ Cσ Cw mσ .

27
K AWAGUCHI , K AELBLING , AND B ENGIO

C6 Proof of Proposition 5
Proof. Consider a single fixed f ∈ Fval . Since Fval is independent from the validation dataset,
κf,1 , . . . , κf,mval are independent zero-mean random variables, given a fixed f ∈ Fval . Thus, we can
apply Bernstein inequality, yielding
mval
!
ϵ2 mval /2

1 X
P κf,i > ϵ ≤ exp − 2 .
mval γ + ϵC/3
i=1

By taking union bound over all elements in Fval ,

By noticing that the solution of ϵ with the minus sign results in ϵ < 0, which is invalid for Bernstein
inequality, we obtain the valid solution with the plus sign. Then, we have
s
2C ln( |Fδval | ) 2γ 2 ln( |Fδval | )
ϵ≤ + ,
3mval mval
√ √ √
where we used that a + b ≤ a + b. By tanking the negation of the statement, we obtain that
for any δ > 0, with probability at least 1 − δ, for all f ∈ Fval ,
s
m |F |
1 X val
2C ln( δ ) val
2γ 2 ln( |Fδval | )
κf,i ≤ + ,
mval 3mval m
i=1

1 Pmval
where mval i=1 κf,i = R[f ] − RS (val) [f ].
mval

Understanding The Management of Cyber Resilient Systems
No ratings yet
Understanding The Management of Cyber Resilient Systems
18 pages
2503.02113v1
No ratings yet
2503.02113v1
20 pages
1611 03530 PDF
No ratings yet
1611 03530 PDF
15 pages
Fit without fear- remarkable mathematical phenomena of deep learning through the prism of interpolation
No ratings yet
Fit without fear- remarkable mathematical phenomena of deep learning through the prism of interpolation
51 pages
Classification With Deep Neural Networks and Logistic Loss: Zihan Zhang
No ratings yet
Classification With Deep Neural Networks and Logistic Loss: Zihan Zhang
117 pages
Credal Learning Theory
No ratings yet
Credal Learning Theory
30 pages
Generalization
No ratings yet
Generalization
10 pages
Generalization
No ratings yet
Generalization
10 pages
SML_Lecture2
No ratings yet
SML_Lecture2
35 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
transformer_turing
No ratings yet
transformer_turing
13 pages
HW 1
No ratings yet
HW 1
6 pages
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
No ratings yet
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
56 pages
Deep Neural Networks Are Lazy: On The Inductive Bias of Deep Learning
No ratings yet
Deep Neural Networks Are Lazy: On The Inductive Bias of Deep Learning
78 pages
Lec-02-03
No ratings yet
Lec-02-03
39 pages
Recent Progress in The Theory of Deep Learning: Tengyu Ma Facebook AI Research/Stanford
No ratings yet
Recent Progress in The Theory of Deep Learning: Tengyu Ma Facebook AI Research/Stanford
50 pages
Cheatsheets For Deep Learning 1650192034
No ratings yet
Cheatsheets For Deep Learning 1650192034
95 pages
Theory of Deep Learning 1652786371
No ratings yet
Theory of Deep Learning 1652786371
118 pages
A Selective Overview of Deep Learning: Jianqing Fan Cong Ma Yiqiao Zhong April 16, 2019
No ratings yet
A Selective Overview of Deep Learning: Jianqing Fan Cong Ma Yiqiao Zhong April 16, 2019
37 pages
cs229 Notes4 PDF
No ratings yet
cs229 Notes4 PDF
11 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
hypothesis_in_ml
No ratings yet
hypothesis_in_ml
8 pages
Unit 1-1
No ratings yet
Unit 1-1
75 pages
L D S M: Earning EEP Tructured Odels
No ratings yet
L D S M: Earning EEP Tructured Odels
11 pages
A Mathematical Theory of Generalization: Part I: David H. Wolpert
No ratings yet
A Mathematical Theory of Generalization: Part I: David H. Wolpert
50 pages
A Bayesian Perspective On Generalization and Stochastic Gradient Descent
No ratings yet
A Bayesian Perspective On Generalization and Stochastic Gradient Descent
4 pages
Mathematical Theory of Generalization: Part II: Comp Systems
No ratings yet
Mathematical Theory of Generalization: Part II: Comp Systems
49 pages
Notes On Deep Learning Theory
No ratings yet
Notes On Deep Learning Theory
68 pages
RADL TQKhoat
No ratings yet
RADL TQKhoat
50 pages
DSA5105 Lecture1
No ratings yet
DSA5105 Lecture1
51 pages
Chapter 08
100% (2)
Chapter 08
202 pages
The Lack of A Priori Distinctions Between Learning Algorithms
No ratings yet
The Lack of A Priori Distinctions Between Learning Algorithms
51 pages
Lecture 03 - Feedforward Networks - 4p
No ratings yet
Lecture 03 - Feedforward Networks - 4p
19 pages
s41467-024-54813-x
No ratings yet
s41467-024-54813-x
9 pages
On The Power and Limitations of Random Features For Understanding Neural Networks
No ratings yet
On The Power and Limitations of Random Features For Understanding Neural Networks
30 pages
2010 07140 PDF
No ratings yet
2010 07140 PDF
34 pages
Dropout As A Bayesian Approximation
No ratings yet
Dropout As A Bayesian Approximation
10 pages
DSA5102X_lecture1
No ratings yet
DSA5102X_lecture1
51 pages
3048 Greedy Layer Wise Training of Deep Networks
No ratings yet
3048 Greedy Layer Wise Training of Deep Networks
8 pages
Matematics and Machine Learning
No ratings yet
Matematics and Machine Learning
156 pages
Maths For ML
No ratings yet
Maths For ML
156 pages
hw1 PDF
No ratings yet
hw1 PDF
6 pages
What Neural Networks Memorize and Why: Discovering The Long Tail Via Influence Estimation
No ratings yet
What Neural Networks Memorize and Why: Discovering The Long Tail Via Influence Estimation
18 pages
DSA5102_lecture1
No ratings yet
DSA5102_lecture1
60 pages
ML MODULE - 1-1
No ratings yet
ML MODULE - 1-1
25 pages
Week 3
No ratings yet
Week 3
56 pages
Supervised Learning
No ratings yet
Supervised Learning
5 pages
4.machine Learning Basics (C)
No ratings yet
4.machine Learning Basics (C)
9 pages
Lecture1
No ratings yet
Lecture1
56 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
25 pages
Lec 3
No ratings yet
Lec 3
21 pages
A Bayesian/Information Theoretic Model of Learning To Learn Via Multiple Task Sampling
No ratings yet
A Bayesian/Information Theoretic Model of Learning To Learn Via Multiple Task Sampling
33 pages
DLbook
No ratings yet
DLbook
165 pages
NIPS 2017 Information Theoretic Analysis of Generalization Capability of Learning Algorithms Paper
No ratings yet
NIPS 2017 Information Theoretic Analysis of Generalization Capability of Learning Algorithms Paper
10 pages
HW 1 Eeowh 3
No ratings yet
HW 1 Eeowh 3
6 pages
1810 01075 PDF
No ratings yet
1810 01075 PDF
59 pages
MLSM Lecture2 120923
No ratings yet
MLSM Lecture2 120923
35 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Shape Theory: Categorical Methods of Approximation
From Everand
Shape Theory: Categorical Methods of Approximation
J. M. Cordier
No ratings yet
Elementary Point-Set Topology: A Transition to Advanced Mathematics
From Everand
Elementary Point-Set Topology: A Transition to Advanced Mathematics
Andre L. Yandl
5/5 (1)
Stationary and Related Stochastic Processes: Sample Function Properties and Their Applications
From Everand
Stationary and Related Stochastic Processes: Sample Function Properties and Their Applications
Harald Cramér
4/5 (2)
Shibi
No ratings yet
Shibi
144 pages
AccountStatement Report 6038802623 12022024 23 56
No ratings yet
AccountStatement Report 6038802623 12022024 23 56
2 pages
Operating Manual CTC
No ratings yet
Operating Manual CTC
68 pages
Dewalt Spare Parts
No ratings yet
Dewalt Spare Parts
711 pages
Engineering Data: GPO-3 8/11/2011 H900
No ratings yet
Engineering Data: GPO-3 8/11/2011 H900
2 pages
Custom Search Engine Project Presentation
No ratings yet
Custom Search Engine Project Presentation
18 pages
AI-and-Legal-Education
No ratings yet
AI-and-Legal-Education
2 pages
Euroglass Catalogue June2015 Door and Patch Fittings Section
No ratings yet
Euroglass Catalogue June2015 Door and Patch Fittings Section
18 pages
Apelem Kristal X-Ray Table - Technical Manual
No ratings yet
Apelem Kristal X-Ray Table - Technical Manual
430 pages
The Specifications of RTS5188-GR
No ratings yet
The Specifications of RTS5188-GR
1 page
ADET - Lesson 1
No ratings yet
ADET - Lesson 1
21 pages
1.X. An Introduction To Analytical Fuzzy Plane Geometry (Debdas Ghosh - Debjani Chakraborty - (Springer) ) 2019
No ratings yet
1.X. An Introduction To Analytical Fuzzy Plane Geometry (Debdas Ghosh - Debjani Chakraborty - (Springer) ) 2019
224 pages
CHAPTER II
No ratings yet
CHAPTER II
5 pages
Volvo Tooth System Handbook - 1 - 1
No ratings yet
Volvo Tooth System Handbook - 1 - 1
36 pages
Please Make Sure Minimum Manuscript Requirement Are Met
No ratings yet
Please Make Sure Minimum Manuscript Requirement Are Met
7 pages
Case Study - Turbine Bearing Housing Fire Accident Due To Hydrogen Inflow Into Nitrogen Line of Dry Gas Seal
No ratings yet
Case Study - Turbine Bearing Housing Fire Accident Due To Hydrogen Inflow Into Nitrogen Line of Dry Gas Seal
16 pages
What Is Difference Between Ground Stabilization & Sea Stabilization of Radar MarineGyaan
100% (2)
What Is Difference Between Ground Stabilization & Sea Stabilization of Radar MarineGyaan
1 page
s14904.r4 Acorn Rev3, Estun Pronet
No ratings yet
s14904.r4 Acorn Rev3, Estun Pronet
1 page
12) College Fest Organizer (Abstract)
0% (1)
12) College Fest Organizer (Abstract)
4 pages
Rough TT
No ratings yet
Rough TT
3 pages
2020 Colocation Data Centers in India
No ratings yet
2020 Colocation Data Centers in India
12 pages
Automatic Lens Distortion Correction Using One-Parameter Division Models
No ratings yet
Automatic Lens Distortion Correction Using One-Parameter Division Models
17 pages
PuraSeva Centre User Manual
No ratings yet
PuraSeva Centre User Manual
69 pages
Safety Integrity Level (SIL) Determination Using LOPA Methods To Comply With IEC 61511 and ISA 84
No ratings yet
Safety Integrity Level (SIL) Determination Using LOPA Methods To Comply With IEC 61511 and ISA 84
3 pages
CloudComputing MotivationAndContext
No ratings yet
CloudComputing MotivationAndContext
8 pages
SL210-211 - Migrating To OO Programming With Java - Oh - 0499
No ratings yet
SL210-211 - Migrating To OO Programming With Java - Oh - 0499
175 pages
Asme Fatigue Usage Factor
100% (1)
Asme Fatigue Usage Factor
43 pages
MG15cfx 100cfx
No ratings yet
MG15cfx 100cfx
7 pages
RJ-SERIES PotablE WatER CaRtS - Aeroservicios
No ratings yet
RJ-SERIES PotablE WatER CaRtS - Aeroservicios
34 pages

generalize_DL_2023

Uploaded by

generalize_DL_2023

Uploaded by

Generalization in Deep Learning

Kenji Kawaguchi Leslie Pack Kaelbling Yoshua Bengio

BibTeX of this paper is available at: https://people.csail.mit.edu/kawaguch/bibtex.html

the generalization gap := R[fA(S) ] − RS [fA(S) ].

R[fA(S) ] − RS [fA(S) ] ≤ sup R[f ] − RS [f ],

(a) Ŷ (w) = Y + ϵA for some matrix A with ∥A∥F ≤ 1, and

Corollary 2. If n ≤ m and if rank(M ) < n, then statement (ii) in Theorem 1 holds.

3.1 Consistency of theory

Figure 1: An illustration of differences in assumptions. Statistical learning theory analyzes the

3.2 Difference in assumptions and problem settings

proven for the worst-case distribution Pworst

3.3 Practical role of generalization theory

Role 1 Provide guarantees on expected risk.

Role 2 Guarantee generalization gap

Role 2.1 to be small for a given fixed S, and/or

4. Generalization bounds via validation

5. Direct analyses of neural networks

5.1 Model description via deep paths

[L] [l] QL−1 [l]

w̄k . Then (deterministically),

Test accuracy ratio

5.3 Probabilistic bound over random datasets

5.3.1 E MPIRICAL OBSERVATIONS

5.3.2 T HEORETICAL RESULTS

• Czz ≥ λmax (G(i) ) and γzz

5.4 Probabilistic bound for 0-1 loss with multi-labels

6. Discussions and open problems

(P, S, f ) ≤ (P′ , S ′ , f ′ ) ⇔ RP [f ] − RS [f ] ≤ RP′ [f ′ ] − RS ′ [f ′ ]

A. Appendix: Additional discussions

A1 Simple regularization algorithm

Table 1: Test error (%)

Method MNIST CIFAR-10

Table 2: Test error ratio (DARC1/Base)

MNIST (ND) MNIST CIFAR-10

MNIST (ND) MNIST CIFAR-10

A2 Relationship to other fields

A3 SGD chooses direction in terms of w̄

A4 Simple implementation of two-phase training procedure

z [l] (x, w) = σ [l] (W [l] z [l−1] (x, w))

B. Appendix: Experimental details

(iv) Fully connected layer with output 10 units.

Proof. Theorem 1.4 by Tropp (2012) states that for all t ≥ 0,

Solving for t with the quadratic formula

As G is a real symmetric matrix, we denote an eigendecomposition of G as G = U ΛU ⊤

Smσ = S \ Sαm = {(xαm+1 , yαm+1 ), . . . , (xm , ym )}.

where R′mσ (F) is Rademacher complexity defined as

ESmσ ,ξ ξi [x̄i ◦ σ̄(xi , wσ )]

By taking union bound over all elements in Fval ,

You might also like