spectrum_dependent_learning
spectrum_dependent_learning
Networks
We derive analytical expressions for the gener- The goal of our theory is not to provide worst case bounds
alization performance of kernel regression as a on generalization performance in the sense of statistical
function of the number of training samples us- learning theory (Vapnik, 1999), but to provide analytical ex-
ing theoretical methods from Gaussian processes pressions that explain the average or a typical performance
and statistical physics. Our expressions apply to in the spirit of statistical physics. The techniques we use are
wide neural networks due to an equivalence be- a continuous approximation to learning curves previously
tween training them and kernel regression with used in Gaussian processes (Sollich, 1999; 2002; Sollich &
the Neural Tangent Kernel (NTK). By computing Halees, 2002) and the replica method of statistical physics
the decomposition of the total generalization error (Sherrington & Kirkpatrick, 1975; Mézard et al., 1987).
due to different spectral components of the kernel,
We first develop an approximate theory of generalization in
we identify a new spectral principle: as the size of
kernel regression that is applicable to any kernel. We then
the training set grows, kernel machines and neural
use our theory to gain insight into neural networks by us-
networks fit successively higher spectral modes of
ing a correspondence between kernel regression and neural
the target function. When data are sampled from
network training. When the hidden layers of a neural net-
a uniform distribution on a high-dimensional hy-
work are taken to infinite width with a certain initialization
persphere, dot product kernels, including NTK,
scheme, recent influential work (Jacot et al., 2018; Arora
exhibit learning stages where different frequency
et al., 2019; Lee et al., 2019) showed that training a feedfor-
modes of the target function are learned. We ver-
ward neural network with gradient descent to zero training
ify our theory with simulations on synthetic data
loss is equivalent to kernel interpolation (or ridgeless kernel
and MNIST dataset.
regression) with a kernel called the Neural Tangent Kernel
(NTK) (Jacot et al., 2018). Our kernel regression theory con-
tains kernel interpolation as a special limit (ridge parameter
1. Introduction going to zero).
Finding statistical patterns in data that generalize beyond a Our contributions and results are summarized below:
training set is a main goal of machine learning. Generaliza-
tion performance depends on factors such as the number of • Using a continuous approximation to learning curves
training examples, the complexity of the learning task, and adapted from Gaussian process literature (Sollich, 1999;
the nature of the learning machine. Identifying precisely 2002), we derive analytical expressions for learning
how these factors impact the performance poses a theoreti- curves for each spectral component of a target function
cal challenge. Here, we present a theory of generalization learned through kernel regression.
in kernel machines (Schölkopf & Smola, 2001) and neural • We present another way to arrive at the same analytical
expressions using the replica method of statistical physics
1
John A. Paulson School of Engineering and Applied Sci- and a saddle-point approximation (Sherrington & Kirk-
ences, Harvard University, Cambridge, MA, USA 2 Department of
Physics, Harvard University, Cambridge, MA, USA 3 Center for
patrick, 1975; Mézard et al., 1987).
Brain Science, Harvard University, Cambridge, MA, USA. Corre- • Analysis of our theoretical expressions show that differ-
spondence to: Cengiz Pehlevan <[email protected]>. ent spectral modes of a target function are learned with
different rates. Modes corresponding to higher kernel
Proceedings of the 37 th International Conference on Machine eigenvalues are learned faster, in the sense that a marginal
Learning, Online, PMLR 119, 2020. Copyright 2020 by the au-
thor(s).
training data point causes a greater percent reduction in
Code: https://github.com/Pehlevan-Group/ generalization error for higher eigenvalue modes than for
NTK_Learning_Curves lower eigenvalue modes.
Learning Curves in Kernel Regression and Wide Neural Networks
• When data is sampled from a uniform distribution on a kernel regression, as opposed to a bound on it, that remains
hypersphere, dot product kernels, which include NTK, valid for the ridgeless case and finite sample sizes.
admit a degenerate Mercer decomposition in spherical
In statistical physics domain, Dietrich et al. (1999) calcu-
harmonics, Ykm . In this case, our theory predicts that
lated learning curves for support vector machines, but not
generalization error of lower frequency modes of the tar-
kernel regression, in the limit of number of training samples
get function decrease more quickly than higher frequency
going to infinity for dot product kernels with binary inputs
modes as the dataset size grows. Different learning stages
using a replica method. Our theory applies to general ker-
are visible in the sense described below.
nels and finite size datasets. In the infinite training set limit,
• As the dimensions of data, d, go to infinity, learning
they observed several learning stages where each spectral
curves exhibit different learning stages. For a training
mode is learned with a different rate. We observe similar
set of size p ∼ O(dl ), modes with k < l are perfectly
phenomena in kernel regression. In a similar spirit, (Cohen
learned, k = l are being learned, and k > l are not
et al., 2019) calculates learning curves for infinite-width neu-
learned.
ral networks using a path integral formulation and a replica
• We verify the predictions of our theory using numerical
analysis but does not discuss the spectral dependence of the
simulations for kernel regression and kernel interpolation
generalization error.
with NTK, and wide and deep neural networks trained
with gradient descent. Our theory fits experiments remark- In the infinite width limit, neural networks have many more
ably well on synthetic datasets and MNIST. parameters than training samples yet they do not overfit
(Zhang et al., 2017). Some authors suggested that this is
1.1. Related Work a consequence of the training procedure since stochastic
gradient descent is implicitly biased towards choosing the
Our main approximation technique comes from the literature simplest functions that interpolate the training data (Belkin
on Gaussian processes, which is related to kernel regression et al., 2019a; 2018b; Xu et al., 2019a; Jacot et al., 2018).
in a certain limit. Total learning curves for Gaussian pro- Other studies have shown that neural networks fit the low
cesses, but not their spectral decomposition as we do here, frequency components of the target before the high fre-
have been studied in a limited teacher-student setting where quency components during training with gradient descent
both student and teacher were described by the same Gaus- (Xu et al., 2019b; Rahaman et al., 2019; Zhang et al., 2019;
sian process and the same noise in (Opper & Vivarelli, 1998; Luo et al., 2019). In addition to training dynamics, recent
Sollich, 1999). We allow arbitrary teacher distributions. works such as (Yang & Salman, 2019; Bietti & Mairal, 2019;
Sollich also considered mismatched models where teacher Cao et al., 2019) have discussed how the spectrum of kernels
and student kernels had different eigenspectra and different impacts its smoothness and approximation properties. Here
noise levels (Sollich, 2002). The total learning curve from we explore similar ideas by explicitly calculating average
this model is consistent with our results when the teacher case learning curves for kernel regression and studying its
noise is sent to zero, but we also consider, provide expres- dependence on the kernel’s eigenspectrum.
sions for, and analyze generalization in spectral modes. We
use an analogue of the “lower-continuous” approximation
scheme introduced in (Sollich & Halees, 2002), the results 2. Kernel Regression Learning Curves
of which we reproduce through the replica method (Mézard We start with a general theory of kernel regression. Implica-
et al., 1987). tions of our theory for dot product kernels including NTK
Generalization bounds for kernel ridge regression have and trained neural networks are described in Section 3.
been obtained in many contexts (Schölkopf & Smola, 2001;
Cucker & Smale, 2002; Vapnik, 1999; Gyorfi et al., 2003), 2.1. Notation and Problem Setup
but the rates of convergence often crucially depend on the ex-
We start by defining our notation and setting up our prob-
plicit ridge parameter λ and do not provide guarantees in the
lem. Our initial goal is to derive a mathematical expression
ridgeless case. Using a teacher-student setting, Spigler et al.
for generalization error in kernel regression, which we will
(2019) showed that learning curves for kernel regression
analyze in the subsequent sections using techniques from
asymptotically decay with a power law determined by the
the Gaussian process literature (Sollich, 1999; 2002; Sol-
decay rate of the teacher and the student. Such power law
lich & Halees, 2002) and statistical physics (Sherrington &
decays have been observed empirically on standard datasets
Kirkpatrick, 1975; Mézard et al., 1987).
(Hestness et al., 2017; Spigler et al., 2019). Recently, inter-
est in explaining the phenomenon of interpolation has led The goal of kernel regression is to learn a function f : X →
to the study of generalization bounds on ridgeless regres- RC from a finite number of observations (Wahba, 1990;
sion (Belkin et al., 2018b;a; 2019b; Liang & Rakhlin, 2018). Schölkopf & Smola, 2001). In developing our theory, we
Here, our aim is to capture the average case performance of will first focus on the case where C = 1, and later extend
Learning Curves in Kernel Regression and Wide Neural Networks
p
our results to C > 1 as we discuss in Section 2.5. Let where ψρ (x) = λρ φ(x) is the feature map we will work
{xi , yi } ∈ X × R, where X ⊆ Rd , be one of the p training with. In our analysis, M will be taken to be infinite, but for
examples and let H be a Reproducing Kernel Hilbert space the derivation of the learning curves, we will first consider
(RKHS) with inner product h·, ·iH . To avoid confusion with M as a finite integer. The eigenfunctions and eigenvalues
our notation for averaging, we will always decorate angular are defined with respect to the probability measure that
brackets for Hilbert inner product with H and a comma. generates the data dµ(x) = p(x)dx
Kernel ridge regression is defined as: Z
p dx0 p(x0 )K(x, x0 )φρ (x0 ) = λρ φρ (x). (6)
X
min (f (xi ) − yi )2 + λ||f ||2H . (1)
f ∈H
i=1 We will also find it convenient to work with a vector repre-
The λ → 0 limit is referred to as interpolating kernel re- sentation of the RKHS functions in the feature space. Kernel
gression, and, as we will discuss later, relevant to training eigenfunctions form a complete orthonormal basis, allowing
wide neural networks. The unique minimum of the convex the expansion of the target function f ∗ and learned function
optimization problem is given by f in terms of features {ψρ (x)}
X X
f (x) = y> (K + λI)−1 k(x), (2) f ∗ (x) = wρ ψρ (x), f (x) = wρ ψρ (x). (7)
ρ ρ
where K(·, ·) is the reproducing kernel for H, K is the
Hence, M -dimensional vectors w and w constitute a repre-
p × p kernel gram matrix Kij = K(xi , xj ), and k(x)i =
sentation of f and f ∗ respectively in the feature space.
K(x, xi ). Lastly, y ∈ Rp is the vector of target values
yi = f ∗ (xi ). For interpolating kernel regression, when We can also obtain a feature space expression for the optimal
the kernel is invertible, the solution is the same except that kernel regression function (2). Let Ψ ∈ RM ×p be feature
λ = 0, meaning that training data is fit perfectly. The proof matrix for the sample so that Ψρ,i = ψρ (xi ). With this
of this optimal solution is provided in the Supplementary representation, kernel ridge regression (1) can be recast
Information (SI) Section 1. as the optimization problem minw∈RM , kwk2 <∞ kΨ> w −
yk2 + λkwk2 , whose solution is
Let p(x) be the probability density function from which the
input data are sampled. The generalization error is defined w = (ΨΨ> + λI)−1 Ψy. (8)
as the expected risk with expectation taken over new test
points sampled from the same density p(x). For a given
Another novelty of our theory is the decomposition of the
dataset {xi } and target function f ∗ (x), let fK (x; {xi }, f ∗ )
generalization error into its contributions from different
represent the function learned with kernel regression. The
eigenmodes. The feature space expression of the generaliza-
generalization error for this dataset and target function is
tion error after averaging over the data distribution can be
Z
2
written as:
Eg ({xi }, f ) = dx p(x) (fK (x; {xi }, f ∗ ) − f ∗ (x)) .
∗
X
(3) Eg = Eρ , Eρ ≡ λρ (wρ − wρ )2 {xi },w , (9)
ρ
To calculate the average case performance of kernel regres-
sion, we average this generalization error over the possible where we identify Eρ as the generalization error in mode ρ.
datasets {xi } and target functions f ∗
Proof.
Eg = hEg ({xi }, f ∗ )i{xi },f ∗ . (4)
Eg = (f (x) − f ∗ (x))2 x,{xi },f ∗
Our aim is to calculate Eg for a general kernel and a general X
distribution over teacher functions. = h(wρ − wρ )(wγ − wγ )i{xi },f ∗ hψρ (x)ψγ (x)ix
ρ,γ
For our theory, we will find it convenient to work with X X
the feature map defined by the Mercer decomposition. By = λρ (wρ − wρ )2 {xi },w
= Eρ . (10)
Mercer’s theorem (Mercer, 1909; Rasmussen & Williams, ρ ρ
2005) , the kernel admits a representation in terms of its M
kernel eigenfunctions {φρ (x)},
M
X M
X We introduce a matrix notation for RKHS eigenvalues
0
K(x, x ) = λρ φρ (x)φρ (x ) = 0
ψρ (x)ψρ (xi ), Λρ,γ ≡ δρ,γ λρ for convenience. Finally, with our notation
ρ=1 ρ=1 set up, we can present our first result about generalization
(5) error.
Learning Curves in Kernel Regression and Wide Neural Networks
Proposition 1. For the w that minimizes the training error Note that the quantity we want to calculate is given by
(eq. (8)), the generalization error (eq. (4)) is given by
∂ D E
G2 (p) = − G̃(p, v) . (16)
Eg = Tr D G2 {xi } , (11) ∂v v=0
which can be decomposed into modal generalization errors By considering the effect of adding a single randomly sam-
X pled input x0 , and treating p as a continuous parameter, we
Eρ = Dρ,γ G2γ,ρ {x } , (12)
i can derive an approximate quasi-linear partial differential
γ
equation (PDE) for the average elements of G as a function
where of the number of data points p (see below for a derivation):
−1 D E
1
G= ΦΦ> + Λ−1 , Φ = Λ−1/2 Ψ. (13) ∂ G̃(p, v) 1 ∂ D E
λ = D E G̃(p, v) , (17)
∂p λ + Tr G̃(p, v) ∂v
and
D = Λ−1/2 ww> Λ−1/2 . (14)
w
with the initial condition G̃(0, v) = (Λ−1 + vI)−1 , which
We leave the proof to SI Section 2 but provide a few cur- follows from ΦΦ> = 0 when there is no data. Since G̃ is
sory observations of this result. First, note that all of the initialized as a diagonal matrix, the Doff-diagonal
E elements
dependence on the teacher function comes in the matrix D will not vary under the dynamics and G̃(p, v) will remain
whereas all of the dependence on the empirical samples is diagonal for all (p, v). We will use the solutions to this PDE
in G. In the rest of the paper, we will develop multiple the- and relation (16) to arrive at an approximate expression for
oretical methods to calculate the generalization error given the generalization error Eg and the mode errors Eρ .
by expression (11).
Averaging over the target weights in the expression for D Derivation of the PDE approximation (17). Let φ ∈ RM
is easily done for generic weight distributions. The case represent the new feature to be added to G−1 so that
of a fixed target is included by choosing a delta-function φρ = φρ (x0 ) where x0 ∼ p(x 0
D ) is aErandom sample from
distribution over w. the data distribution. Let G̃(p, v) denote the matrix
Φ
We present two methods for computing the nontrivial aver- G̃ averaged over it’s p-sample design matrix Φ. By the
age of the matrix G2 over the training samples {xi }. First, Woodbury matrix inversion formula
we consider the effect of adding a single new sample to G * −1 +
to derive a recurrence relation for G at different number D E
−1 1 >
G̃(p + 1, v) = G̃(p, v) + φφ
of data points. This method generates a partial differential Φ,φ λ
Φ,φ
equation that must be solved to compute the generalization * +
>
error. Second, we use a replica method and a saddle point D E G̃(p, v)φφ G̃(p, v)
= G̃(p, v) − . (18)
approximation to calculate the matrix elements of G. These Φ λ + φ> G̃(p, v)φ Φ,φ
approaches give identical predictions for the learning curves
of kernel machines. Performing the average of the last term on the right hand
For notational simplicity, in the rest of the paper, we will side is difficult so we resort to an approximation, where the
use h. . .i to mean h. . .i{xi },w unless stated otherwise. In all numerator and denominator are averaged separately.
cases, the quantity inside the brackets will depend either on D E
the data distribution or the distribution of target weights, but D E D E G̃(p, v)2
ΦE
not both. G̃(p + 1, v) ≈ G̃(p, v) − D ,
Φ,φ Φ λ + Tr G̃(p, v)
Φ
2.2. Continuous Approximation to Learning Curves (19)
First, we adopt a method following Sollich (1999; 2002) where we used the fact that hφρ (x0 )φγ (x0 )ix0 ∼p(x0 ) = δρ,γ .
and Sollich & Halees (2002) to calculate the generalization
error. We generalize the definition of G by introducing an Treating p as a continuous variable and taking a continuum
auxiliary parameter v, and make explicit its dataset size, p, limit of the finite differences given above, we arrive at (17).
dependence:
−1
1 Next, we present the solution to the PDE (17) and the result-
G̃(p, v) = ΦΦ> + Λ−1 + vI . (15) ing generalization error.
λ
Learning Curves in Kernel Regression and Wide Neural Networks
D E
Proposition 2. Let gρ (p, v) = G̃(p, v)ρρ represent the We note that though the mode errors fall asymptotically
D E like p−2 (SI Section 4), the total generalization error Eg
diagonal elements of the average matrix G̃(p, v) . These can scale with p in a nontrivial manner. For instance, if
matrix elements satisfy the implicit relationship w2ρ λρ ∼ ρ−a and λρ ∼ ρ−b then a simple computation (SI
!−1 Section 4) shows that Eg ∼ p− min{a−1,2b} as p → ∞ for
1 p ridgeless regression and Eg ∼ p− min{a−1,2b}/b for explic-
gρ (p, v) = +v+ PM . (20)
λρ λ + γ=1 gγ (p, v) itly regularized regression. This is consistent with recent
observations that total generalization error for neural net-
works and kernel regression falls in a power law Eg ∼ p−β
This implicit solution is obtained from the method of char- with β dependent on kernel and target function (Hestness
acteristics which we provide in Section 3 of the SI. et al., 2017; Spigler et al., 2019).
Proposition 3. Under the PDE approximation (17), the
average error Eρ associated with mode ρ is 2.3. Computing Learning Curves with Replica Method
−2 −1 The result of the continuous approximation can be obtained
hw2ρ i
1 p pγ(p)
Eρ (p) = + 1− , using another approximation method, which we outline here
λρ λρ λ + t(p) (λ + t(p))2
(21) and detail in SI Section 5. We perform the average of ma-
where t(p) ≡
P
g (p, 0) is the solution to the implicit trix G(p, v) over the training data, using the replica method
ρ ρ
equation (Sherrington & Kirkpatrick, 1975; Mézard et al., 1987) from
statistical physics and a finite size saddle-point approxima-
X 1 −1 tion, and obtain identical learning curves to Proposition 3.
p
t(p) = + , (22) Our starting point is a Gaussian integral representation of
ρ
λρ λ + t(p)
the matrix inverse
and γ(p) is defined as ∂2
hG(p, v)ρ,γ i = R(p, v, h)|h=0 ,
−2 ∂hρ ∂hγ
X 1 p Z
γ(p) = + . (23) 1 − 21 u> ( λ
1
ΦΦ> +Λ−1 +vI)u+h·u
λρ λ + t(p) R(p, v, h) ≡ du e ,
ρ Z
(24)
The full proof of this proposition is provided in Section 3 of 1 > 1 > −1
where Z = du e− 2 u ( λ ΦΦ +Λ +vI)u . Since Z also
R
the SI. We show the steps required to compute theoretical
learning curves numerically in Algorithm 1. depends on the dataset (quenched disorder) Φ, to make the
average over Φ tractable, we use the following limiting
procedure: Z −1 = limn→0 Z n−1 . As is common in the
Algorithm 1 Computing Theoretical Learning Curves
physics of disordered systems (Mézard et al., 1987), we
Input: RKHS spectrum {λρ }, target function weights compute R(p, v, h) for integer n and analytically continue
{wρ }, regularizer λ, sample sizes {pi }, i = 1, ..., m; the expressions in the n → 0 limit under a symmetry ansatz.
for i = 1 to m do −1 This procedure produces the same average matrix elements
P pi
Solve numerically ti = ρ λ1ρ + λ+t i as the continuous approximation discussed in Proposition 2,
P 1 −2 and therefore the same generalization error given in Propo-
pi
Compute γi = ρ λρ + λ+ti
−2 −1 sition 3. Further detail is provided in SI Section 5.
hw2 i
pi p i γi
Eρ,i = λρρ λ1ρ + λ+t i
1 − (λ+ti ) 2
2.4. Spectral Dependency of Learning Curves
end for
We can get insight about the behavior of learning curves by
In eq. (21), the target function sets the overall scale of Eρ . considering ratios between errors in different modes:
That Eρ depends only on w̄ρ , but not other target modes, 1 p 2
is an artifact of our approximation scheme, and in a full Eρ hw2ρ i λγ ( λγ + λ+t )
= p 2. (25)
treatment may not necessarily hold. The spectrum of the Eγ hw2γ i λρ ( λ1ρ + λ+t )
kernel affects all modes in a nontrivial way. When we apply
Eρ λρ hw2ρ i
this theory to neural networks in Section 3, the information For small p this ratio approaches ∼ . For large
Eγ λγ hw2γ i
about the architecture of the network will be in the spectrum Eρ hw2ρ i/λρ
{λρ }. The dependence on number of samples p is also p, Eγ ∼ hw2γ i/λγ
,indicating that asymptotically (p →
nontrivial, but we will consider various informative limits ∞), the amount of relative error in mode ρ grows with the
below. ratio hw2ρ i /λρ , showing that the asymptotic mode error is
Learning Curves in Kernel Regression and Wide Neural Networks
relatively large if the teacher function places large amounts k=1 k=5
of power in modes that have small RKHS eigenvalues λρ . 1.0 k=2 k=6
k=3 k=7
We can also examine how the RKHS spectrum affects the k=4 k=8
evolution of the error ratios with p. Without loss of general- 0.8
kN(d, k)
ity, we take λγ > λρ and show in SI Section 6 that
0.6
d d
log Eρ > log Eγ . (26)
dp dp 0.4
In this sense, the marginal training data point causes a
greater percent reduction in generalization error for modes 0.2
101 102
with larger RKHS eigenvalues. d
100 100
10 1
1
10 1
2 10 1 1
2
E1(p)/E1(0)
E1(p)/E1(0)
k=1 10
Ek(p)/Ek(0)
2
10 3 k=2 10 2
k=3 10 3 =0
k=4 d=15 10 3 =1
10 5 k=5 10 4 d=30 =3
k=6 d=45 10 4 =5
k=7 10 5
d=75 = 10
100 101 102 103 104 100 101 102 103 104 100 101 102 103 104
p p p
(a) 3-layer NTK d = 15 λ = 0 (b) 3-layer NTK k = 1 λ = 1 (c) 10-layer NTK d = 15 k = 1
Figure 2. Learning curves for kernel regression with NTK averaged over 50 trials compared to theory. Error bars are standard deviation.
Solid lines are theoretical curves calculated using eq. (21). Dashed vertical lines indicate the degeneracy N (d, k). (a) Normalized learning
curves for different spectral modes. Sequential fitting of mode errors is visible. (b) Normalized learning curves for varying data dimension,
d. (c) Learning curves for varying regularization parameter, λ.
101 10 1
theory
101 expt test error
100 100
Ek(p)/Ek(0)
Ek(p)/Ek(0)
10 1
10 2
10 1
Eg
10 2 NN k = 1 NN k = 1
NN k = 2 NN k = 2
10 3 NN k = 4
Kernel k=1
10 2 NN k = 4
Kernel k=1
Kernel k=2 Kernel k=2
10 4 Kernel k=4 Kernel k=4 10 3
10 3
101 102 103 101 102 103 101 102 103
p p p
(a) 2-layer NN N = 10000 (b) 4-layer NN N = 500 (c) 2-Layer NN Student-Teacher; N = 8000
Figure 3. (a) and (b) Learning curves for neural networks (NNs) on “pure modes” as defined in eq. (35). (c) Learning curve for the student
teacher setup defined in (36). The theory curves shown as solid lines are again computed with eq. (21). The test error for the finite width
neural networks and NTK are shown with dots and triangles respectively. The generalization error was estimated by taking a random test
sample of 1000 data points. The average was taken over 25 trials and the standard deviations are shown with errorbars. The networks
were initialized with the default Gaussian NTK parameterization (Jacot et al., 2018) and trained with stochastic gradient descent (details
in SI Section 13).
degree. For “pure mode” k, the teacher is constructed with k = 4 mode is not learned at all in this range. Our theory
the following rule: again perfectly fits the experiments.
p
X
0 Lastly, we show that our theory also works for composite
∗
f (x) = αi Qk (x> xi ), (35) functions that contain many different degree spherical har-
i=1 monics. In this setup, we randomly initialize a two layer
teacher neural network and train a student neural network
where again αi ∼ B(1/2) and xi ∼ p(x) are sampled
randomly. Figure 3(a) shows the learning curve for a fully f ∗ (x) = r> σ(Θx), f (x) = r> σ(Θx), (36)
connected 2-layer ReLU network with width N = 10000, M ×d
where Θ, Θ ∈ R are the feedforward weights for the
input dimension d = 30 and p0 = 10000. As before, we
student and teacher respectively, σ is an activation function
see that the lower k pure modes require less data to be fit.
and r, r ∈ RM are the student and teacher readout weights.
Experimental test errors for kernel regression with NTK on
Chosen in this way with ReLU activations, the teacher is
the same synthetic datasets are plotted as triangles. Our
composed of spherical harmonics of many different degrees
theory perfectly fits the experiments.
(Section 13 in SI). The total generalization error for this
Results from a 4-layer NN simulation are provided in Figure teacher student setup as well as the theoretical prediction of
3(b). Each hidden layer had N = 500 hidden units. We our theory is provided in Figure 3(c) for d = 25, N = 8000.
again see that the k = 2 mode is only learned for p > 200. They agree excellently. Results from additional neural net-
Learning Curves in Kernel Regression and Wide Neural Networks
10 6
Eg
Eg
Eg
10 8 10 2
=0
10 10 = 10 6
10 1
= 10 4
= 10 2
10 12 10 3
101 102 101 102 103 101 102 103
p p p
(a) Gaussian kernel and measure in d = 20 (b) 3-Layer NN on MNIST, N = 800 (c) NTK regression on MNIST, λ = 0.
Figure 4. (a) Learning curves for Gaussian kernel in d = 20 dimensions with varying λ. For small λ, learning stages are visible at
p = N (d, k) for k = 1, 2 (p = 20, 210, vertical dashed lines) but the stages are obscured for non-negligible λ. (b) Learning curve for
3-layer NTK regression and a neural network (NN) on a subset of 8000 randomly sampled images of handwritten digits from MNIST. (c)
Aggregated NTK regression mode errors for the setup in (b). Eigenmodes of MNIST with larger eigenvalues are learned more rapidly
with increasing p.
work simulations are provided in Section 13 of the SI. Once the eigenvalues Λ and eigenvectors Φ> have been
identified, we compute the target function coefficients by
4.3. Gaussian Kernel Regression and Interpolation projecting the target data yc onto these principal compo-
nents wc = Λ−1/2 Φyc for each target c = 1, . . . , C. Once
We next test our theory on another widely-used kernel. The all of these ingredients are obtained, theoretical learning
setting where the probability measure and kernel are Gaus- curves can be computed using Algorithm 1 and multiple
1 0 2
sian, K(x, x0 ) = e− 2`2 ||x−x || , allows analytical compu- class formalism described in Section 2.5, providing esti-
tation of the eigenspectrum, {λk } (Rasmussen & Williams, mates of the error on the entire dataset incurred when train-
2005). In d dimensions, the k-th distinct
eigenvalue corre- ing with a subsample of p < p̃ data points. An example
sponds to a set of N (d, k) = d+k−1 k ∼ dk
/k! degenerate where the discrete measure is taken as p̃ = 8000 images
eigenmodes. The spectrum itself decays exponentially. of handwritten digits from MNIST (Lecun et al., 1998)
In Figure 4(a), experimental learning curves for d = 20 di- and the kernel is NTK with 3 layers is provided in Figures
mensional standard normal random vector data and a Gaus- 4(c)and 4(b). For total generalization error, we find perfect
sian kernel with ` = 50 are compared to our theoretical agreement between kernel regression and neural network
predictions for varying ridge parameters λ. The target func- experiments, and our theory.
tion f ∗ (x) is constructed with the same rule we used for the
NTK experiments, shown in eq. 34. When λ is small, sharp 5. Conclusion
drops in the generalization error occur when p ≈ N (d, k)
for k = 1, 2. These drops are suppressed by the explicit In this paper, we presented an approximate theory of the
regularization λ. average generalization performance for kernel regression.
We studied our theory in the ridgeless limit to explain the
4.4. MNIST: Discrete Data Measure and Kernel PCA behavior of trained neural networks in the infinite width
limit (Jacot et al., 2018; Arora et al., 2019; Lee et al., 2019).
We can also test our theory for finite datasets by defining a We demonstrated how the RKHS eigenspectrum of NTK
probability measure with equal point mass on each of the encodes a preferential bias to learn high spectral modes
data points {xi }p̃i=1 in the dataset (including training and only after the sample size p is sufficiently large. Our theory
test sets): fits kernel regression experiments remarkably well. We
p̃ further experimentally verified that the theoretical learning
1X curves obtained in the infinite width limit provide a good
p(x) = δ(x − xi ). (37)
p̃ i=1 approximation of the learning curves for wide but finite-
width neural networks. Our MNIST result suggests that our
With this measure, the eigenvalue problem (6) becomes a theory can be applied to datasets with practical value.
p̃ × p̃ kernel PCA problem (see SI 14)
Cucker, F. and Smale, S. Best choices for regularization pa- Mézard, M., Parisi, G., and Virasoro, M. Spin glass theory
rameters in learning theory: On the bias-variance problem. and beyond: An Introduction to the Replica Method and
Foundations of Computational Mathematics, 2:413–428, Its Applications, volume 9. World Scientific Publishing
2002. Company, 1987.
Learning Curves in Kernel Regression and Wide Neural Networks
Novak, R., Xiao, L., Hron, J., Lee, J., Alemi, A. A., Sohl- Vapnik, V. N. An overview of statistical learning theory.
Dickstein, J., and Schoenholz, S. S. Neural tangents: IEEE transactions on neural networks, 10(5):988–999,
Fast and easy infinite neural networks in python. In 1999.
International Conference on Learning Representations,
2020. Wahba, G. Spline Models for Observational Data. Society
for Industrial and Applied Mathematics, Philadelphia,
Opper, M. and Vivarelli, F. General bounds on bayes errors 1990.
for regression with gaussian processes. In Advances in
Xu, Z.-Q. J., Zhang, Y., Luo, T., Xiao, Y., and Ma, Z. Fre-
Neural Information Processing Systems 11, pp. 302–308,
quency principle: Fourier analysis sheds light on deep
1998.
neural networks, 2019a.
Pehlevan, C., Ali, F., and Ölveczky, B. P. Flexibility in
Xu, Z.-Q. J., Zhang, Y., and Xiao, Y. Training behavior of
motor timing constrains the topology and dynamics of
deep neural network in frequency domain. In Gedeon,
pattern generator circuits. Nature communications, 9(1):
T., Wong, K. W., and Lee, M. (eds.), Neural Information
1–15, 2018.
Processing, pp. 264–274, 2019b.
Rahaman, N., Baratin, A., Arpit, D., Draxler, F., Lin, M., Yang, G. and Salman, H. A fine-grained spectral perspective
Hamprecht, F., Bengio, Y., and Courville, A. On the on neural networks, 2019.
spectral bias of neural networks. In Proceedings of the
36th International Conference on Machine Learning, pp. Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals,
5301–5310, 2019. O. Understanding deep learning requires rethinking gen-
eralization. In International Conference on Learning
Rasmussen, C. E. and Williams, C. K. I. Gaussian Pro- Representations, 2017.
cesses for Machine Learning (Adaptive Computation and
Machine Learning). The MIT Press, 2005. Zhang, Y., Xu, Z.-Q. J., Luo, T., and Ma, Z. Explicitizing
an implicit bias of the frequency principle in two-layer
Schölkopf, B. and Smola, A. J. Learning with Kernels: neural networks, 2019.
Support Vector Machines, Regularization, Optimization,
and Beyond. MIT Press, 2001.
1. Background on Kernel Machines Using the representer theorem, we may reformulate the
entire objective in terms of the p coefficients αi
1.1. Reproducing Kernel Hilbert Space
p
Let X ⊆ Rd and p(x) be a probability distribution over
X
L[f ] = (f (xi ) − yi )2 + λ||f ||2H
X . Let H be a Hilbert space with inner product h·, ·iH . A i=1
kernel K(·, ·) is said to be reproducing for H if function p X
X p
evaluation at any x ∈ X is the equivalent to the Hilbert = ( αj K(xi , xj ) − yi )2
inner product with K(·, x): K is reproducing for H if for i=1 j=1
all g ∈ H and all x ∈ X X D E
+λ αi αj K(xi , ·), K(xj , ·)
H
ij
hK(·, x), giH = g(x). (SI.1)
= α> K2 α − 2y> Kα + y> y + λα> Kα. (SI.7)
If such a kernel exists for a Hilbert space, then it is unique Optimizing this loss with respect to α gives
and defined as the reproducing kernel for the RKHS (Evge-
α = (K + λI)−1 y. (SI.8)
niou et al., 1999; Schölkopf & Smola, 2001).
Therefore the optimal function evaluated at a test point is
1.2. Mercer’s Theorem
f (x) = α> k(x) = y> (K + λI)−1 k(x). (SI.9)
Let H be a RKHS with kernel K. Mercer’s theorem (Mercer,
1909; Rasmussen & Williams, 2005) allows the eigendecom- 2. Derivation of the Generalization Error
position of K
+
p H have eigenvalues λρ for ρ ∈ Z . Define
Let the RKHS
K(x, x0 ) =
X
λρ φρ (x)φρ (x), (SI.2) ψρ (x) = λρ φρ (x), where φρ are the eigenfunctions of
ρ
the reproducing kernel for H. Let the target function have
the following expansion in terms of the kernel eigenfunc-
tions f ∗ (x) = ρ wρ ψρ (x). Define the design matrices
P
where the eigenvalue statement is
Φρ,i = φρ (xi ) and Λργ = λρ δργ . Then the average gener-
Z alization error for kernel regression is
dx0 p(x0 )K(x, x0 )φρ (x0 ) = λρ φρ (x). (SI.3)
Eg = Tr D G2 {xi } (SI.10)
where
1.3. Representer Theorem −1
1
Let H be a RKHS with inner product h., .iH . Consider the G= ΦΦ> + Λ−1 , Φ = Λ−1/2 Ψ. (SI.11)
λ
regularized learning problem
and
minf ∈H L̂[f ] + λ||f ||2H , (SI.4) D = Λ−1/2 ww> w
Λ−1/2 . (SI.12)
Specializing to the case of least squares regression, let Next, it suffices to calculate the weights w learned through
kernel regression. Define a matrix with elements Ψρ,i =
p
X ψρ (xi ). The training error for kernel regression is
L̂[f ] = (f (xi ) − yi )2 . (SI.6)
i=1
Etr = ||Ψ> w − y||2 + λ||w||22 (SI.14)
Learning Curves in Kernel Regression and Wide Neural Networks
which we corrected after their paper. δ(Qab − ua · ub ) = dQ̂ab eiQab Q̂ab −iQ̂ab u ·u . (SI.49)
Learning Curves in Kernel Regression and Wide Neural Networks
101
numerical solution numerical
10 1 p1 b 10 3 p1 a
p(1 a)/b
10 3 10 7
10 11
10 5
Eg
z
10 15
10 7
10 19
10 9
10 23
100 101 102 103 104 105 100 101 102 103 104 105
p p
(a) z = t + λ (b) Eg (p)
Figure SI.1. Approximate scaling of learning curve for spectra that decay as power laws λk ∼ k−b and a2k ≡ w2k λk = k−a . Figure (a)
shows a comparison of the numerical solution to the implicit equation for t + λ as a function of p and its comparison to approximate
scalings. There are two regimes which are separated by p ≈ λ−1/(b−1) . For small p, z ∼ p1−b but for large p, z ∼ λ. The total
generalization error is shown in (b) which scales like p1−a for small p and p(1−a)/b for large p.
After inserting delta functions to enforce order parameter point equations are
definitions, we are left with integrals over the thermal de-
grees of freedom p
q̂ ∗ = + v,
q∗
+λ
n
Z Y
1
P
ua Λ−1 ua −i
P
Q̂ab ua ub +u(1) h 1 1
dua e− 2
X X
a ab
q∗ = 1 = 1 p ,
a=1 ρ λρ + q̂ ∗ ρ λρ +v+ q ∗ +λ
− 21 log det( λ1ρ I+2iQ̂)+ 12 h2ρ ( λ1ρ I+2iQ̂)−1
P P
=e ρ ρ 11
. (SI.50) q0∗ = q̂0∗ = 0. (SI.53)
We now make a replica symmetric ansatz Qab = qδab + q0 We see that q ∗ is exactly equivalent to t(p, v) defined in
and 2iQ̂ab = q̂δab + q̂0 . Under this ansatz R(h) can be SI.29 for the continuous approximation. Under the saddle
rewritten as point approximation we find
R(h) = 1
P
h2ρ 1
1 +q̂ ∗
−npF (q ∗ ,q0∗ ,q̂ ∗ ,q̂0∗ ) 2 ρ
Z R(h) ≈ e e λρ
. (SI.54)
I+2iQ̂)−1
1
P 2 1
h (
dqdq̂ddq̂dq̂0 e−pnF (q,q0 ,q̂,q̂0 ) e 2 ρ ρ λρ 11
,
(SI.51) Taking the n → 0 limit as promised, we obtain the normal-
ized average
where the free energy is
q q0 1
P
h2ρ 1
1 +q̂ ∗
2pF(q, q0 , q̂, q̂0 ) =p log 1 + +p + v(q + q0 ) R̃(h) ≡ lim R(h) = e
2 ρ
λρ
, (SI.55)
λ λ+q n→0
− (q + q0 )(q̂ + q̂0 ) + q0 q̂0
so that the matrix elements are
" #
X 1 q̂0
+ log + q̂ + 1 .
λρ λρ + q̂
ρ D E ∂2 δρ,γ
(SI.52) G̃(p, v)ρ,γ = R̃(h)|h=0 = 1 p ,
∂hρ ∂hγ λρ +v+ λ+q ∗
X 1
In the limit p → ∞, R(h) is dominated by the saddle point q∗ = 1 p . (SI.56)
ρ λρ
+ v + λ+q ∗
of the free energy where ∇F(q, q̂, q0 , q̂0 ) = 0. The saddle
Learning Curves in Kernel Regression and Wide Neural Networks
Using our formula for the mode errors, we find 7. Spherical Harmonics
X D E Let −∆ represent the Laplace-Beltrami operator in Rd .
Eρ = Dρ,γ G̃(p, v)2γ,ρ
Spherical harmonics {Ykm } in dimension d are harmonic
γ
(−∆Ykm (x) = 0), homogeneous (Ykm (tx) = tk Ykm (x))
∂ D E
polynomials that are orthonormal with respect to the uni-
= −Dρ,ρ G̃(p, v)ρ,ρ |v=0
∂v form measure on Sd−1 (Efthimiou & Frye, 2014; Dai & Xu,
−2
w2ρ (λ + q ∗ )2
1 p 2013). The number of spherical harmonics of degree k in
= + ,
λρ (λ + q ∗ )2 − γp λρ λ + q∗ dimension d denoted by N (d, k) is
(SI.57)
2k + d − 2 k + d − 3
N (d, k) = . (SI.64)
consistent with our result from the continuous approxima- k k−1
tion.
The Laplace Beltrami Operator can be decomposed into the
radial and angular parts, allowing
6. Spectral Dependence of Learning Curves
− ∆ = −∆r − ∆Sd−1 (SI.65)
We want to calculate how different mode errors change as
we add one more sample. We study:
Using this decomposition, the spherical harmonics are eigen-
1 d Eρ functions of the surface Laplacian
log , (SI.58)
2 dp Eγ
− ∆Sd−1 Ykm (x) = k(k + d − 2)Ykm (x). (SI.66)
where Eρ is given by eq. (21). Evaluating the derivative,
we find: The spherical harmonics are related to the Gegenbauer poly-
nomials {Qk }, which are orthogonal with respect to the mea-
1 d Eρ sure dτ (z) = (1−z 2 )(d−3)/2 dz of inner products z = x> x0
log
2 dp Eγ of uniformly sampled pairs on the sphere x, x0 ∼ Sd−1 . The
Gegenbauer polynomials can be constructed with the Gram-
!
1 1 ∂ p
=− 1 p − 1 p . (SI.59) Schmidt procedure and have the following properties
λρ + λ+t λγ + λ+t
∂p λ + t
N (d,k)
> 0 1 X
Using eq. (22), Qk (x x ) = Ykm (x)Ykm (x0 ),
N (d, k) m=1
X −2 Z 1
∂t ∂ p 1 p ωd−1 δk,`
=− + Qk (z)Q` (z)dτ (z) = , (SI.67)
∂p ∂p λ+t ρ
λγ λ+t −1 ωd−2 N (d, k)
∂ p π d/2
= −γ , (SI.60) where ωd−1 = Γ(d/2) is the surface area of Sd−1 .
∂p λ + t
where we identified the sum with γ. Inserting this, we 8. Decomposition of Dot Product Kernels on
obtain: Sd−1
" #
1 d Eρ 1 1 1 ∂t For inputs sampled from the uniform measure on Sd−1 ,
log = 1 p − 1 p . dot product kernels can be decomposed into Gegenbauer
2 dp Eγ λρ + λ+t λγ + λ+t
γ ∂p
polynomials introduced in SI Section 7.
(SI.61)
Let K(x, x0 ) = κ(x> x0 ). The kernel’s orthogonal decom-
Finally, solving for ∂t/∂p from (SI.60), we get: position is
∞
∂t 1 (λ + t)2 γ 1 X
Tr G2 , κ(z) = λk N (d, k)Qk (z),
=− =−
∂p λ + t (λ + t)2 − pγ λ+t k=0
(SI.62) Z 1
ωd−2
proving that ∂t/∂p < 0. Taking λγ > λρ without loss of λk = κ(z)Qk (z)dτ (z). (SI.68)
generality, it follows that ωd−1 −1
d Eρ d d To numerically calculate the kernel eigenvalues of κ, we
log >0 ⇒ log Eρ > log Eγ . (SI.63) use Gauss-Gegenbauer quadrature (Abramowitz & Stegun,
dp Eγ dp dp
Learning Curves in Kernel Regression and Wide Neural Networks
1972) for the measure dτ (z) so that for a quadrature scheme Let us consider an integer l such that the scaling P = αdl
of order r holds. This leads to three different asymptotic behavior of
Z 1 X r gk s:
κ(z)Qk (z)dτ (z) ≈ wi Qk (zi )κ(zi ), (SI.69)
−1 i=1
gk ∼ O(dl−k ) O(1), k<l
gk = α ∼ O(1), k=l
where zi are the r roots of Qr (z) and the weights wi are
l−k
chosen with gk ∼ O(d ) O(1), k>l (SI.76)
2 2r+2α+1
Γ(r + α + 1) 2 r!
wi = , (SI.70) If we assume t ∼ O(1), we get an asymptotically consistent
Γ(r + 2α + 1) Vr0 (zi )Vr+1 (zi )
set of equations:
where
Vr (z) = 2r r!(−1)r Qr (z)
X
(SI.71) t≈ λ̄m + a(α, t, λ, λ̄l ) ∼ O(1),
For our calculations we take r = 1000. m>l
γ̃ ≈ b(α, t, λ, λ̄l ) ∼ O(1), (SI.77)
9. Frequency Dependence of Learning Curves where a and b are the lth terms in the sums in t and γ̃,
in d → ∞ Limit respectively, and are given by:
Here, we consider an informative limit where the number of (t + λ)λ̄l
input data dimension, d, goes to infinity. a(α, t, λ, λ̄l ) = ,
t + λ + αλ̄l
Denoting the index ρ = (k, m), we can write mode error αλ̄2l
b(α, t, λ, λ̄l ) = (SI.78)
(SI.37), after some rearranging, as: t + λ + αλ̄l
2
(λ + t)2 λk hw2km i
Ekm = pγ , (SI.72) Then using (SI.75), (SI.76) and (SI.77), we find the errors
1 − (λ+t)2 (λ + t + pλk )2 associated to different modes as:
where t and γ, after performing the sum over degenerate Ekm (α)
indices, are: k < l, ∼ O(d2(k−l) ) ≈ 0,
Ekm (0)
X N (d, m)(λ + t)λm Ekm (α) 1
t= , k > l, ≈ ,
m
λ + t + pλm Ekm (0) 1 − γ̃(α)
X N (d, m)(λ + t)2 λ2 Ekm (α)
γ= m
. (SI.73) k = l, = s(α) ∼ O(1), (SI.79)
(λ + t + pλm )2 Ekm (0)
m
In the limit d → ∞, the degeneracy factor (SI.64) ap- where s(α) is given by:
proaches to N (d, k) ∼ O(dk ). We note that for dot-product 1 1
kernels λk scales with d as λk ∼ d−k (Smola et al., 2001) s(α) = 2 . (SI.80)
1 − γ̃(α)
λ̄l
(Figure 1), which leads us to define the O(1) parameter 1 + α t+λ
λ̄k = dk λk . Plugging these in, we get:
Note that limα→0 γ̃(α) = limα→∞ γ̃(α) = 0 and non-zero
d−k (t + λ)2 2
λ̄k hw̄km i in between. Then, for large α, in the limit we are considering
Ekm (gk ) = 2
1 − γ̃ t + λ + gk λ̄k Ekm (α)
X (t + λ)λ̄m k < l, ≈ 0,
t= , Ekm (0)
m
t + λ + gm λ̄m Ekm (α)
k > l, ≈ 1,
X gm λ̄2m Ekm (0)
γ̃ = 2 , (SI.74)
(λ + m>l λ̄m )2 1
P
m t + λ + gm λ̄m Ekm (α)
k = l, ≈ . (SI.81)
Ekm (0) λ̄2l α2
where gk = p/dk is the ratio of sample size to the de-
generacy. Furthermore, we want to calculate the ratio
Ekm (p)/Ekm (0) to probe how much the mode errors move 10. Neural Tangent Kernel
from their initial value: The neural tangent kernel is
Ekm (p) 1 1 X D ∂fθ (x) ∂fθ (x0 ) E
= (SI.75)
Ekm (0) 1 − γ̃
gk λ̄k
2 KNTK (x, x0 ) = . (SI.82)
1+ t+λ ∂θi ∂θi θ
i
Learning Curves in Kernel Regression and Wide Neural Networks
N(d, k)
et al., 2019). We will restrict our attention to networks with
zero bias and nonlinear activation function σ. Then
k
(1)
KN T K (x, x0 ) 10 3
=5 = 50
(1) = 10 = 100
= KN N GP (x, x0 ) = 25 = 500
(2)
KN T K (x, x0 ) 100 101 102
k
(2) (1)
= KN N GP (x, x0 ) + KN T K (x, x0 )K̇ (2) (x, x0 )
... Figure SI.2. Spectrum of fully connected ReLU NTK without bias
(L) for varying depth `. As the depth increases, the spectrum whitens,
KN T K (x, x0 ) causing derivatives of lower order to have infinite variance. As
(L) (L−1) ` → ∞, λk N (d, k) ∼ 1 implying that the kernel becomes non-
= KN N GP (x, x0 ) + KN T K (x, x0 )K̇ (L) (x, x0 ),
(SI.83) analytic at the origin.
where P
kernel’s trace hK(x, x)ix = k λk N (d, k) begins to di-
(L)
KN N GP (x, x0 ) = E(α,β)∼p(L−1) σ(α)σ(β), verge. Inference with such a kernel is equivalent to learning
x,x0
a function with infinite variance. Constraints on the vari-
K̇ (L) (x, x0 ) = E(α,β)∼p(L−1) σ̇(α)σ̇(β), ance of derivatives ||∇nSd−1 f (x)||2 correspond to more
x,x0
! restrictive constraints on the eigenspectrum of the RKHS.
(x, x) K (L−1) (x, x0 )
(L−1)
0 K −n−1/2
, Specifically, λk N (d, k) ∼ O(k ) implies that the
(L−1)
px,x0 = N ,
0 K (L−1) (x, x0 ) K (L−1) (x0 , x0 ) n-th gradient has finite variance ||∇nSd−1 f (x)||2 < ∞.
(1)
KN N GP (x, x0 ) = x> x0 . (SI.84)
Proof.
Pp By the representer theorem, let f (x) =
If σ is chosen to be the ReLU activation, then we can an- i=1 i K(x, xi ).
α By Green’s theorem, the variance
alytically simplify the expression. Defining the following of the n-th derivative can be rewritten as
function
p ||∇nSd−1 f (x)||2 = hf (x)(−∆Sd−1 )n f (x)i
1 2
1 X
f (z) = arccos 1 − z + 1 − arccos(z) z , = αi αj λk λk0 Ykm (xi )Yk0 m0 (xj )
π π
kk0 mm0 ij
(SI.85)
we obtain × hYkm (x)(−∆Sd−1 )n Yk0 m0 (x)i
X
= λ2k k n (k + d − 2)n N (d, k)αi αj Qk (x>
i xj )
(L)
KN N GP (x, x0 ) = cos f ◦(L−1) (x> x0 )
kij
1 ◦(L−2) > 0 X
0
K̇L (x, x ) = 1 − f (x x ) , (SI.86) ≤ Cp2 (α∗ )2 λ2k k n (k + d − 2)n N (d, k)2 ,
π k
(SI.87)
where f ◦(L−1) (z) is the function f composed into itself
L − 1 times. where α∗ = maxj |αj | and |Qk (z)| ≤ CN (d, k) for a uni-
This simplification gives an exact recursive formula to com- versal constant C. A sufficient condition for this sum to con-
pute the kernel as a function of z = x> x0 , which is what verge is that λ2k k n (k +d−2)n N (d, k)2 ∼ O(k −1 ) which is
we use to compute the eigenspectrum with the quadrature equivalent to demanding λk N (d, k) ∼ O(k −n−1/2 ) since
scheme described in the previous section. (k + d − 2)n ∼ k n as k → ∞.
11. Spectra of Fully Connected ReLU NTK 12. Decomposition of Risk for Numerical
Experiments
A plot of the RKHS spectra of fully connected ReLU NTK’s
of varying depth is shown in Figure SI.2. As the depth in- As we describe in Section 4.1 of the main text, the teacher
creases, the spectrum becomes more white, eventually, the functions for the kernel regression experiments are chosen
Learning Curves in Kernel Regression and Wide Neural Networks
Using the Mercer decomposition of the kernel we can iden- Ek = λ2k N (d, k) α> Qk (XT X)α − 2α> Qk (XT X)α
T
tify the coefficients +α> Qk (X X)α . (SI.95)
0
p
∗
X XX We randomly sample the α variables for the teacher and
f (x) = αi K(x, xi ) = αi ψρ (xi ) ψρ (x)
fit α = (K + λI)−1 y to the training data. Once these
i=1 ρ i
(SI.90) coefficients are known, we can obtain empirical mode errors.
Comparing each term in these two expressions, we identify
the coefficient of the ρ-th eigenfunction 13. Neural Network Experiments
X For the “pure mode” experiments with neural networks, the
wρ = αi ψρ (xi ). (SI.91)
i
target function was
0
P
X
∗
We now need to compute the Dρρ , by averaging w2ρ over all f (x) = αi Qk (x> xi )
possible teachers i=1
N (d,k) 0
P
1 1 X
X X
Dρρ = w2ρ = hαi αj i hψρ (xi )ψρ (xj )i = αi Ykm (xi ) Ykm (x), (SI.96)
λρ λρ ij m=1 i=1
13.1. Hyperparameters
The choice of the number of hidden units N was based
primarily on computational considerations. For two layer
neural networks, the total number of parameters scales lin-
early with N , so to approach the overparameterized regime,
we aimed to have N ≈ 10pmax where pmax is the largest
sample size used in our experiment. For pmax = 500, we
chose N = 4000, 10000.
For the three and four layer networks, the number of pa-
rameters scales quadratically with N , making simulations
with N > 103 computationally expensive. We chose
N to give comparable training time for the 2 layer case
which corresponded to N = 500 after experimenting with
{100, 250, 500, 1000, 5000}.
We found that the learning rate needed to be quite large
for the training loss to be reduced by a factor of ≈
106 . For the 2 layer networks, we tried learning rates
{10−3 , 10−2 , 1, 10, 32} and found that a learning rate of
32 gave the lowest training error. For the three and four
layer networks, we found that lower learning rates worked
better and used learning rates in the range from [0.5, 3].
Etr
10 2
10 7
10 3
10 9
10 4
10 11 10 5
10 6
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
SGD iteration SGD iteration
(a) 3 Layer Training Loss; lr =2 (b) 4 Layer Training Loss; lr = 0.5
Figure SI.3. Training error for different pure mode target functions on neural networks with 500 hidden units per hidden layer on a sample
of size p = 500. Generally, we find that the low frequency modes have an initial rapid reduction in the training error but the higher
frequencies k ≥ 4 are trained at a slower rate.
101 101
101
100 100
100
Ek(p)/Ek(0)
Ek(p)/Ek(0)
Ek(p)/Ek(0)
10 1
10 1
10 1
10 2
NN k = 1 10 2 NN k = 1 10 2 NN k = 1
NN k = 2 NN k = 2 NN k = 2
10 3 NN k = 4 10 3 NN k = 4 NN k = 4
Kernel k=1 Kernel k=1 10 3 Kernel k=1
Kernel k=2 Kernel k=2 Kernel k=2
10 4 Kernel k=4 10 4 Kernel k=4 Kernel k=4
10 4
101 102 103 101 102 103 101 102 103
p p p
(a) 2 layer NN N = 4000 (b) 2 layer NN N = 104 (c) 3 layer N = 500
101 10 1 10 1
theory theory
expt test error expt test error
100
Ek(p)/Ek(0)
10 1 10 2 10 2
Eg
Eg
NN k = 1
NN k = 2
10 2 NN k = 4
Kernel k=1
Kernel k=2
Kernel k=4 10 3 10 3
10 3
101 102 103 101 102 103 101 102 103
p p p
(d) 4 layer N = 500 (e) 2 Layer NN Student-Teacher; N = 2000 (f) 2 Layer NN Student-Teacher; N = 8000
Figure SI.4. Learning curves for neural networks on “pure modes” and on student teacher experiments. The theory curves shown as solid
lines. For the pure mode experiments, the test error for the finite width neural networks and NTK are shown with dots and triangles
respectively. Logarithms are evaluated with base 10.