0% found this document useful (0 votes)
8 views

spectrum_dependent_learning

Uploaded by

ee19b114
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

spectrum_dependent_learning

Uploaded by

ee19b114
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural

Networks

Blake Bordelon 1 Abdulkadir Canatar 2 Cengiz Pehlevan 1 3

networks (LeCun et al., 2015) with wide hidden layers that


Abstract addresses these questions.
arXiv:2002.02561v7 [cs.LG] 25 Feb 2021

We derive analytical expressions for the gener- The goal of our theory is not to provide worst case bounds
alization performance of kernel regression as a on generalization performance in the sense of statistical
function of the number of training samples us- learning theory (Vapnik, 1999), but to provide analytical ex-
ing theoretical methods from Gaussian processes pressions that explain the average or a typical performance
and statistical physics. Our expressions apply to in the spirit of statistical physics. The techniques we use are
wide neural networks due to an equivalence be- a continuous approximation to learning curves previously
tween training them and kernel regression with used in Gaussian processes (Sollich, 1999; 2002; Sollich &
the Neural Tangent Kernel (NTK). By computing Halees, 2002) and the replica method of statistical physics
the decomposition of the total generalization error (Sherrington & Kirkpatrick, 1975; Mézard et al., 1987).
due to different spectral components of the kernel,
We first develop an approximate theory of generalization in
we identify a new spectral principle: as the size of
kernel regression that is applicable to any kernel. We then
the training set grows, kernel machines and neural
use our theory to gain insight into neural networks by us-
networks fit successively higher spectral modes of
ing a correspondence between kernel regression and neural
the target function. When data are sampled from
network training. When the hidden layers of a neural net-
a uniform distribution on a high-dimensional hy-
work are taken to infinite width with a certain initialization
persphere, dot product kernels, including NTK,
scheme, recent influential work (Jacot et al., 2018; Arora
exhibit learning stages where different frequency
et al., 2019; Lee et al., 2019) showed that training a feedfor-
modes of the target function are learned. We ver-
ward neural network with gradient descent to zero training
ify our theory with simulations on synthetic data
loss is equivalent to kernel interpolation (or ridgeless kernel
and MNIST dataset.
regression) with a kernel called the Neural Tangent Kernel
(NTK) (Jacot et al., 2018). Our kernel regression theory con-
tains kernel interpolation as a special limit (ridge parameter
1. Introduction going to zero).
Finding statistical patterns in data that generalize beyond a Our contributions and results are summarized below:
training set is a main goal of machine learning. Generaliza-
tion performance depends on factors such as the number of • Using a continuous approximation to learning curves
training examples, the complexity of the learning task, and adapted from Gaussian process literature (Sollich, 1999;
the nature of the learning machine. Identifying precisely 2002), we derive analytical expressions for learning
how these factors impact the performance poses a theoreti- curves for each spectral component of a target function
cal challenge. Here, we present a theory of generalization learned through kernel regression.
in kernel machines (Schölkopf & Smola, 2001) and neural • We present another way to arrive at the same analytical
expressions using the replica method of statistical physics
1
John A. Paulson School of Engineering and Applied Sci- and a saddle-point approximation (Sherrington & Kirk-
ences, Harvard University, Cambridge, MA, USA 2 Department of
Physics, Harvard University, Cambridge, MA, USA 3 Center for
patrick, 1975; Mézard et al., 1987).
Brain Science, Harvard University, Cambridge, MA, USA. Corre- • Analysis of our theoretical expressions show that differ-
spondence to: Cengiz Pehlevan <[email protected]>. ent spectral modes of a target function are learned with
different rates. Modes corresponding to higher kernel
Proceedings of the 37 th International Conference on Machine eigenvalues are learned faster, in the sense that a marginal
Learning, Online, PMLR 119, 2020. Copyright 2020 by the au-
thor(s).
training data point causes a greater percent reduction in
Code: https://github.com/Pehlevan-Group/ generalization error for higher eigenvalue modes than for
NTK_Learning_Curves lower eigenvalue modes.
Learning Curves in Kernel Regression and Wide Neural Networks

• When data is sampled from a uniform distribution on a kernel regression, as opposed to a bound on it, that remains
hypersphere, dot product kernels, which include NTK, valid for the ridgeless case and finite sample sizes.
admit a degenerate Mercer decomposition in spherical
In statistical physics domain, Dietrich et al. (1999) calcu-
harmonics, Ykm . In this case, our theory predicts that
lated learning curves for support vector machines, but not
generalization error of lower frequency modes of the tar-
kernel regression, in the limit of number of training samples
get function decrease more quickly than higher frequency
going to infinity for dot product kernels with binary inputs
modes as the dataset size grows. Different learning stages
using a replica method. Our theory applies to general ker-
are visible in the sense described below.
nels and finite size datasets. In the infinite training set limit,
• As the dimensions of data, d, go to infinity, learning
they observed several learning stages where each spectral
curves exhibit different learning stages. For a training
mode is learned with a different rate. We observe similar
set of size p ∼ O(dl ), modes with k < l are perfectly
phenomena in kernel regression. In a similar spirit, (Cohen
learned, k = l are being learned, and k > l are not
et al., 2019) calculates learning curves for infinite-width neu-
learned.
ral networks using a path integral formulation and a replica
• We verify the predictions of our theory using numerical
analysis but does not discuss the spectral dependence of the
simulations for kernel regression and kernel interpolation
generalization error.
with NTK, and wide and deep neural networks trained
with gradient descent. Our theory fits experiments remark- In the infinite width limit, neural networks have many more
ably well on synthetic datasets and MNIST. parameters than training samples yet they do not overfit
(Zhang et al., 2017). Some authors suggested that this is
1.1. Related Work a consequence of the training procedure since stochastic
gradient descent is implicitly biased towards choosing the
Our main approximation technique comes from the literature simplest functions that interpolate the training data (Belkin
on Gaussian processes, which is related to kernel regression et al., 2019a; 2018b; Xu et al., 2019a; Jacot et al., 2018).
in a certain limit. Total learning curves for Gaussian pro- Other studies have shown that neural networks fit the low
cesses, but not their spectral decomposition as we do here, frequency components of the target before the high fre-
have been studied in a limited teacher-student setting where quency components during training with gradient descent
both student and teacher were described by the same Gaus- (Xu et al., 2019b; Rahaman et al., 2019; Zhang et al., 2019;
sian process and the same noise in (Opper & Vivarelli, 1998; Luo et al., 2019). In addition to training dynamics, recent
Sollich, 1999). We allow arbitrary teacher distributions. works such as (Yang & Salman, 2019; Bietti & Mairal, 2019;
Sollich also considered mismatched models where teacher Cao et al., 2019) have discussed how the spectrum of kernels
and student kernels had different eigenspectra and different impacts its smoothness and approximation properties. Here
noise levels (Sollich, 2002). The total learning curve from we explore similar ideas by explicitly calculating average
this model is consistent with our results when the teacher case learning curves for kernel regression and studying its
noise is sent to zero, but we also consider, provide expres- dependence on the kernel’s eigenspectrum.
sions for, and analyze generalization in spectral modes. We
use an analogue of the “lower-continuous” approximation
scheme introduced in (Sollich & Halees, 2002), the results 2. Kernel Regression Learning Curves
of which we reproduce through the replica method (Mézard We start with a general theory of kernel regression. Implica-
et al., 1987). tions of our theory for dot product kernels including NTK
Generalization bounds for kernel ridge regression have and trained neural networks are described in Section 3.
been obtained in many contexts (Schölkopf & Smola, 2001;
Cucker & Smale, 2002; Vapnik, 1999; Gyorfi et al., 2003), 2.1. Notation and Problem Setup
but the rates of convergence often crucially depend on the ex-
We start by defining our notation and setting up our prob-
plicit ridge parameter λ and do not provide guarantees in the
lem. Our initial goal is to derive a mathematical expression
ridgeless case. Using a teacher-student setting, Spigler et al.
for generalization error in kernel regression, which we will
(2019) showed that learning curves for kernel regression
analyze in the subsequent sections using techniques from
asymptotically decay with a power law determined by the
the Gaussian process literature (Sollich, 1999; 2002; Sol-
decay rate of the teacher and the student. Such power law
lich & Halees, 2002) and statistical physics (Sherrington &
decays have been observed empirically on standard datasets
Kirkpatrick, 1975; Mézard et al., 1987).
(Hestness et al., 2017; Spigler et al., 2019). Recently, inter-
est in explaining the phenomenon of interpolation has led The goal of kernel regression is to learn a function f : X →
to the study of generalization bounds on ridgeless regres- RC from a finite number of observations (Wahba, 1990;
sion (Belkin et al., 2018b;a; 2019b; Liang & Rakhlin, 2018). Schölkopf & Smola, 2001). In developing our theory, we
Here, our aim is to capture the average case performance of will first focus on the case where C = 1, and later extend
Learning Curves in Kernel Regression and Wide Neural Networks
p
our results to C > 1 as we discuss in Section 2.5. Let where ψρ (x) = λρ φ(x) is the feature map we will work
{xi , yi } ∈ X × R, where X ⊆ Rd , be one of the p training with. In our analysis, M will be taken to be infinite, but for
examples and let H be a Reproducing Kernel Hilbert space the derivation of the learning curves, we will first consider
(RKHS) with inner product h·, ·iH . To avoid confusion with M as a finite integer. The eigenfunctions and eigenvalues
our notation for averaging, we will always decorate angular are defined with respect to the probability measure that
brackets for Hilbert inner product with H and a comma. generates the data dµ(x) = p(x)dx
Kernel ridge regression is defined as: Z
p dx0 p(x0 )K(x, x0 )φρ (x0 ) = λρ φρ (x). (6)
X
min (f (xi ) − yi )2 + λ||f ||2H . (1)
f ∈H
i=1 We will also find it convenient to work with a vector repre-
The λ → 0 limit is referred to as interpolating kernel re- sentation of the RKHS functions in the feature space. Kernel
gression, and, as we will discuss later, relevant to training eigenfunctions form a complete orthonormal basis, allowing
wide neural networks. The unique minimum of the convex the expansion of the target function f ∗ and learned function
optimization problem is given by f in terms of features {ψρ (x)}
X X
f (x) = y> (K + λI)−1 k(x), (2) f ∗ (x) = wρ ψρ (x), f (x) = wρ ψρ (x). (7)
ρ ρ
where K(·, ·) is the reproducing kernel for H, K is the
Hence, M -dimensional vectors w and w constitute a repre-
p × p kernel gram matrix Kij = K(xi , xj ), and k(x)i =
sentation of f and f ∗ respectively in the feature space.
K(x, xi ). Lastly, y ∈ Rp is the vector of target values
yi = f ∗ (xi ). For interpolating kernel regression, when We can also obtain a feature space expression for the optimal
the kernel is invertible, the solution is the same except that kernel regression function (2). Let Ψ ∈ RM ×p be feature
λ = 0, meaning that training data is fit perfectly. The proof matrix for the sample so that Ψρ,i = ψρ (xi ). With this
of this optimal solution is provided in the Supplementary representation, kernel ridge regression (1) can be recast
Information (SI) Section 1. as the optimization problem minw∈RM , kwk2 <∞ kΨ> w −
yk2 + λkwk2 , whose solution is
Let p(x) be the probability density function from which the
input data are sampled. The generalization error is defined w = (ΨΨ> + λI)−1 Ψy. (8)
as the expected risk with expectation taken over new test
points sampled from the same density p(x). For a given
Another novelty of our theory is the decomposition of the
dataset {xi } and target function f ∗ (x), let fK (x; {xi }, f ∗ )
generalization error into its contributions from different
represent the function learned with kernel regression. The
eigenmodes. The feature space expression of the generaliza-
generalization error for this dataset and target function is
tion error after averaging over the data distribution can be
Z
2
written as:
Eg ({xi }, f ) = dx p(x) (fK (x; {xi }, f ∗ ) − f ∗ (x)) .

X
(3) Eg = Eρ , Eρ ≡ λρ (wρ − wρ )2 {xi },w , (9)
ρ
To calculate the average case performance of kernel regres-
sion, we average this generalization error over the possible where we identify Eρ as the generalization error in mode ρ.
datasets {xi } and target functions f ∗
Proof.
Eg = hEg ({xi }, f ∗ )i{xi },f ∗ . (4)
Eg = (f (x) − f ∗ (x))2 x,{xi },f ∗
Our aim is to calculate Eg for a general kernel and a general X
distribution over teacher functions. = h(wρ − wρ )(wγ − wγ )i{xi },f ∗ hψρ (x)ψγ (x)ix
ρ,γ
For our theory, we will find it convenient to work with X X
the feature map defined by the Mercer decomposition. By = λρ (wρ − wρ )2 {xi },w
= Eρ . (10)
Mercer’s theorem (Mercer, 1909; Rasmussen & Williams, ρ ρ
2005) , the kernel admits a representation in terms of its M
kernel eigenfunctions {φρ (x)},
M
X M
X We introduce a matrix notation for RKHS eigenvalues
0
K(x, x ) = λρ φρ (x)φρ (x ) = 0
ψρ (x)ψρ (xi ), Λρ,γ ≡ δρ,γ λρ for convenience. Finally, with our notation
ρ=1 ρ=1 set up, we can present our first result about generalization
(5) error.
Learning Curves in Kernel Regression and Wide Neural Networks

Proposition 1. For the w that minimizes the training error Note that the quantity we want to calculate is given by
(eq. (8)), the generalization error (eq. (4)) is given by
∂ D E
  G2 (p) = − G̃(p, v) . (16)
Eg = Tr D G2 {xi } , (11) ∂v v=0

which can be decomposed into modal generalization errors By considering the effect of adding a single randomly sam-
X pled input x0 , and treating p as a continuous parameter, we
Eρ = Dρ,γ G2γ,ρ {x } , (12)
i can derive an approximate quasi-linear partial differential
γ
equation (PDE) for the average elements of G as a function
where of the number of data points p (see below for a derivation):
 −1 D E
1
G= ΦΦ> + Λ−1 , Φ = Λ−1/2 Ψ. (13) ∂ G̃(p, v) 1 ∂ D E
λ = D E G̃(p, v) , (17)
∂p λ + Tr G̃(p, v) ∂v
and
D = Λ−1/2 ww> Λ−1/2 . (14)
w
with the initial condition G̃(0, v) = (Λ−1 + vI)−1 , which
We leave the proof to SI Section 2 but provide a few cur- follows from ΦΦ> = 0 when there is no data. Since G̃ is
sory observations of this result. First, note that all of the initialized as a diagonal matrix, the Doff-diagonal
E elements
dependence on the teacher function comes in the matrix D will not vary under the dynamics and G̃(p, v) will remain
whereas all of the dependence on the empirical samples is diagonal for all (p, v). We will use the solutions to this PDE
in G. In the rest of the paper, we will develop multiple the- and relation (16) to arrive at an approximate expression for
oretical methods to calculate the generalization error given the generalization error Eg and the mode errors Eρ .
by expression (11).
Averaging over the target weights in the expression for D Derivation of the PDE approximation (17). Let φ ∈ RM
is easily done for generic weight distributions. The case represent the new feature to be added to G−1 so that
of a fixed target is included by choosing a delta-function φρ = φρ (x0 ) where x0 ∼ p(x 0
D ) is aErandom sample from
distribution over w. the data distribution. Let G̃(p, v) denote the matrix
Φ
We present two methods for computing the nontrivial aver- G̃ averaged over it’s p-sample design matrix Φ. By the
age of the matrix G2 over the training samples {xi }. First, Woodbury matrix inversion formula
we consider the effect of adding a single new sample to G * −1 +
to derive a recurrence relation for G at different number D E
−1 1 >
G̃(p + 1, v) = G̃(p, v) + φφ
of data points. This method generates a partial differential Φ,φ λ
Φ,φ
equation that must be solved to compute the generalization * +
>
error. Second, we use a replica method and a saddle point D E G̃(p, v)φφ G̃(p, v)
= G̃(p, v) − . (18)
approximation to calculate the matrix elements of G. These Φ λ + φ> G̃(p, v)φ Φ,φ
approaches give identical predictions for the learning curves
of kernel machines. Performing the average of the last term on the right hand
For notational simplicity, in the rest of the paper, we will side is difficult so we resort to an approximation, where the
use h. . .i to mean h. . .i{xi },w unless stated otherwise. In all numerator and denominator are averaged separately.
cases, the quantity inside the brackets will depend either on D E
the data distribution or the distribution of target weights, but D E D E G̃(p, v)2
ΦE
not both. G̃(p + 1, v) ≈ G̃(p, v) − D ,
Φ,φ Φ λ + Tr G̃(p, v)
Φ
2.2. Continuous Approximation to Learning Curves (19)

First, we adopt a method following Sollich (1999; 2002) where we used the fact that hφρ (x0 )φγ (x0 )ix0 ∼p(x0 ) = δρ,γ .
and Sollich & Halees (2002) to calculate the generalization
error. We generalize the definition of G by introducing an Treating p as a continuous variable and taking a continuum
auxiliary parameter v, and make explicit its dataset size, p, limit of the finite differences given above, we arrive at (17).
dependence:
 −1
1 Next, we present the solution to the PDE (17) and the result-
G̃(p, v) = ΦΦ> + Λ−1 + vI . (15) ing generalization error.
λ
Learning Curves in Kernel Regression and Wide Neural Networks
D E
Proposition 2. Let gρ (p, v) = G̃(p, v)ρρ represent the We note that though the mode errors fall asymptotically
D E like p−2 (SI Section 4), the total generalization error Eg
diagonal elements of the average matrix G̃(p, v) . These can scale with p in a nontrivial manner. For instance, if
matrix elements satisfy the implicit relationship w2ρ λρ ∼ ρ−a and λρ ∼ ρ−b then a simple computation (SI
!−1 Section 4) shows that Eg ∼ p− min{a−1,2b} as p → ∞ for
1 p ridgeless regression and Eg ∼ p− min{a−1,2b}/b for explic-
gρ (p, v) = +v+ PM . (20)
λρ λ + γ=1 gγ (p, v) itly regularized regression. This is consistent with recent
observations that total generalization error for neural net-
works and kernel regression falls in a power law Eg ∼ p−β
This implicit solution is obtained from the method of char- with β dependent on kernel and target function (Hestness
acteristics which we provide in Section 3 of the SI. et al., 2017; Spigler et al., 2019).
Proposition 3. Under the PDE approximation (17), the
average error Eρ associated with mode ρ is 2.3. Computing Learning Curves with Replica Method
−2  −1 The result of the continuous approximation can be obtained
hw2ρ i

1 p pγ(p)
Eρ (p) = + 1− , using another approximation method, which we outline here
λρ λρ λ + t(p) (λ + t(p))2
(21) and detail in SI Section 5. We perform the average of ma-
where t(p) ≡
P
g (p, 0) is the solution to the implicit trix G(p, v) over the training data, using the replica method
ρ ρ
equation (Sherrington & Kirkpatrick, 1975; Mézard et al., 1987) from
statistical physics and a finite size saddle-point approxima-
X 1 −1 tion, and obtain identical learning curves to Proposition 3.
p
t(p) = + , (22) Our starting point is a Gaussian integral representation of
ρ
λρ λ + t(p)
the matrix inverse
and γ(p) is defined as ∂2
hG(p, v)ρ,γ i = R(p, v, h)|h=0 ,
−2 ∂hρ ∂hγ
X 1 p  Z 
γ(p) = + . (23) 1 − 21 u> ( λ
1
ΦΦ> +Λ−1 +vI)u+h·u
λρ λ + t(p) R(p, v, h) ≡ du e ,
ρ Z
(24)
The full proof of this proposition is provided in Section 3 of 1 > 1 > −1
where Z = du e− 2 u ( λ ΦΦ +Λ +vI)u . Since Z also
R
the SI. We show the steps required to compute theoretical
learning curves numerically in Algorithm 1. depends on the dataset (quenched disorder) Φ, to make the
average over Φ tractable, we use the following limiting
procedure: Z −1 = limn→0 Z n−1 . As is common in the
Algorithm 1 Computing Theoretical Learning Curves
physics of disordered systems (Mézard et al., 1987), we
Input: RKHS spectrum {λρ }, target function weights compute R(p, v, h) for integer n and analytically continue
{wρ }, regularizer λ, sample sizes {pi }, i = 1, ..., m; the expressions in the n → 0 limit under a symmetry ansatz.
for i = 1 to m do −1 This procedure produces the same average matrix elements
P  pi
Solve numerically ti = ρ λ1ρ + λ+t i as the continuous approximation discussed in Proposition 2,
P 1 −2 and therefore the same generalization error given in Propo-
pi
Compute γi = ρ λρ + λ+ti
−2  −1 sition 3. Further detail is provided in SI Section 5.
hw2 i

pi p i γi
Eρ,i = λρρ λ1ρ + λ+t i
1 − (λ+ti ) 2
2.4. Spectral Dependency of Learning Curves
end for
We can get insight about the behavior of learning curves by
In eq. (21), the target function sets the overall scale of Eρ . considering ratios between errors in different modes:
That Eρ depends only on w̄ρ , but not other target modes, 1 p 2
is an artifact of our approximation scheme, and in a full Eρ hw2ρ i λγ ( λγ + λ+t )
= p 2. (25)
treatment may not necessarily hold. The spectrum of the Eγ hw2γ i λρ ( λ1ρ + λ+t )
kernel affects all modes in a nontrivial way. When we apply
Eρ λρ hw2ρ i
this theory to neural networks in Section 3, the information For small p this ratio approaches ∼ . For large
Eγ λγ hw2γ i
about the architecture of the network will be in the spectrum Eρ hw2ρ i/λρ
{λρ }. The dependence on number of samples p is also p, Eγ ∼ hw2γ i/λγ
,indicating that asymptotically (p →
nontrivial, but we will consider various informative limits ∞), the amount of relative error in mode ρ grows with the
below. ratio hw2ρ i /λρ , showing that the asymptotic mode error is
Learning Curves in Kernel Regression and Wide Neural Networks

relatively large if the teacher function places large amounts k=1 k=5
of power in modes that have small RKHS eigenvalues λρ . 1.0 k=2 k=6
k=3 k=7
We can also examine how the RKHS spectrum affects the k=4 k=8
evolution of the error ratios with p. Without loss of general- 0.8

kN(d, k)
ity, we take λγ > λρ and show in SI Section 6 that
0.6
d d
log Eρ > log Eγ . (26)
dp dp 0.4
In this sense, the marginal training data point causes a
greater percent reduction in generalization error for modes 0.2
101 102
with larger RKHS eigenvalues. d

2.5. Multiple Outputs Figure 1. Spectrum of 10-layer NTK multiplied by degeneracy


as a function of dimension for various k, calculated by numerical
The learning curves we derive for a scalar function can be
integration (SI Section 8). λk N (d, k) stays constant as input
straightforwardly extended to the case where the function dimension increases, confirming that λk N (d, k)−1 ∼ Od (1) at
outputs are multivariate: f : Rd → RC . For least squares large d.
regression, this case is equivalent to solving C separate
learning problems for each component functions fc (x), c =
1, ..., C. Let yc ∈ Rp be the corresponding vectors of target We briefly comment on another fact that will later be
values possibly generated by different target functions, fc∗ . used in our numerical simulations. Dot product kernels
The learning problem in this case is admit an expansion in terms of Gegenbauer polynomials
C
X
" p
X
# {Qk }, which form a complete and orthonormal basis for
minc (fc (xi ) − yc,i )2 + λ||fc ||2H . (27) the uniform
P∞measure on the sphere (Dai & Xu, 2013):
f ∈H
c=1 i=1 κ(z) = k=0 λk N (d, k)Qk (z). The Gegenbauer poly-
nomials are related to spherical harmonics {Ykm } through
The solution to the learning problem depends on the same PN (d,k)
kernel but different targets for each function: Qk (x> x0 ) = N (d,k)
1 0
m=1 Ykm (x)Ykm (x ) (Dai & Xu,
2013) (see SI Sections 7 and 8 for a review).
fc (x) = yc> (K + λI)−1 k(x), c = 1, . . . , C. (28)
Our theory can be used to generate predictions for the gen- 3.1. Frequency Dependence of Learning Curves
eralization error of each of the C learned functions, fc (x), In the special case of dot product kernels with monotoni-
and then summed to obtain the total error. cally decaying spectra, results given in Section 2.4 indicate
that the marginal training data point causes greater reduc-
3. Dot Product Kernels on Sd−1 and NTK tion in relative error for low frequency modes than for high
frequency modes. Monotonic RKHS spectra represent an
For the remainder of the paper, we specialize to the case inductive bias that preferentially favors fitting lower frequen-
where our inputs are drawn uniformly on X = Sd−1 , a cies as more data becomes available. More rapid decay in
(d − 1)-dimensional unit hyper-sphere. In addition, we will the spectrum yields a stronger bias to fit low frequencies
assume that the kernel is a dot product kernel (K(x, x0 ) = first.
κ(x> x0 )), as is the case for NTK. In this setting, the kernel
eigenfunctions are spherical harmonics {Ykm } (Bietti & To make this intuition more precise, we now discuss an
Mairal, 2019; Efthimiou & Frye, 2014), and the Mercer informative limit d → ∞ where the degeneracy factor ap-
decomposition is given by proaches to N (d, k) ∼ dk /k!. In the following, we replace
eigenfunction index ρ with index pair (k, m). Eigenvalues

X N (d,k)
X of the kernel scales with d as λk ∼ N (d, k)−1 (Smola et al.,
K(x, x0 ) = λk Ykm (x)Ykm (x0 ). (29) 2001) in the d → ∞ limit, as we verify numerically in Fig-
k=0 m=1 ure 1 for NTK. If we take p = αd` for some integer degree
Here, N (d, k) is the dimension of the subspace spanned by `, then Ekm exhibits three distinct learning stages. Leaving
d-dimensional spherical harmonics of degree k. Rotation the details to SI Section 9, we find that in this limit, for large
invariance renders the eigenspectrum degenerate since each α:

of the N (d, k) modes of frequency k share the same eigen- 1, k>`
Ekm (α)  const.
value λk . A review of these topics is given in SI Sections 7 ≈ α 2 , k=` , (30)
and 8. Ekm (0)
0, k<`

Learning Curves in Kernel Regression and Wide Neural Networks

where the constant is given in SI Section 9. In other words, 4. Experiments


k < l modes are perfectly learned, k = l are being learned
with an asymptotic 1/α2 rate, and k > l are not learned. In this section, we test our theoretical results for kernel re-
gression, kernel interpolation and wide networks for various
This simple calculation demonstrates that the lower modes kernels and datasets.
are learned earlier with increasing sample complexity since
the higher modes stays stationary until p reaches to the 4.1. NTK Regression and Interpolation
degeneracy of that mode.
We first test our theory in a kernel regression task with
3.2. Neural Tangent Kernel and its Spectrum NTK demonstrating the spectral decomposition. In this
experiment, the target function is a linear combination of a
For fully connected architectures, the NTK is a rotation kernel evaluated at randomly sampled points {xi }:
invariant kernel that describes how the predictions of in-
0
finitely wide neural networks evolve under gradient flow p
X

(Jacot et al., 2018). Let θi index all of the parameters of the f (x) = αi K(x, xi ), (34)
neural network and let fθ (x) be the output of the network. i=1
Here, we focus on scalar network outputs for simplicity,
where αi ∼ B(1/2) are sampled randomly from a Bernoulli
but generalization to multiple outputs is straightforward, as
distribution on {±1} and xi are sampled uniformly from
discussed in Section 2.5. Then the neural tangent kernel is
Sd−1 . The points xi are independent samples from Sd−1
defined as
X D ∂fθ (x) ∂fθ (x0 ) E and are different than the training set {xi }. The student
KNTK (x, x0 ) = . (31) function is learned with kernel regression by inverting the
i
∂θi ∂θi θ Gram matrix K defined on the training samples {xi } ac-
Let uθ ∈ Rp be the current predictions of fθ on the training cording to eq. (2). With this choice of target function,
P exact
data. If the parameters of the model are updated via gradient computation of the mode wise errors Ek = m Ekm in
flow on a quadratic loss, dθ terms of Gegenbauer polynomials is possible; the formula
dt = −∇θ uθ · (uθ − y), then the
predictions on the training data evolve with the following and its derivation are provided in Section 12.2 of the SI. We
dynamics (Pehlevan et al., 2018; Jacot et al., 2018; Arora compare these experimental mode-errors to those predicted
et al., 2019; Lee et al., 2019) by our theory and find perfect agreement. For these exper-
iments, both the target and student kernels are taken to be
duθ
= −KNTK · (uθ − y) . (32) NTK of a 4-layer fully connected ReLU without bias.
dt
Figure 2 shows the errors for each frequency k as a function
When the width of the neural network is taken to infinity of sample size p. In Figure 2(a), we show that the mode
with proper initialization, where the weights at layer ` are errors sequentially start falling when p ∼ N (d, k). Figure
sampled W (`) ∼ N 0, 1/n(`) where n(`) is the number 2(b) shows the mode error corresponding to k = 1 for kernel
of hidden units in layer `, the NTK becomes independent regression with 3-layer NTK across different dimensions.
of the particular realization of parameters and approaches Higher input dimension causes the frequency modes to be
a deterministic function of the inputs and the nonlinear learned at larger p. We observe an asymptotic ∼ 1/α2 decay
activation function (Jacot et al., 2018). Further, the kernel in modal errors. Finally, we show the effect of regularization
is approximately fixed throughout gradient descent (Jacot on mode errors with a 10-layer NTK in Figure 2(c). With
et al., 2018; Arora et al., 2019). If we assume that uθ = 0 increasing λ, learning begins at larger p values.
at t = 0, then the final learned function is
f (x) = y> K−1
NTK k(x). (33) 4.2. Learning Curves for Finite Width Neural
Networks
Note that this corresponds to ridgeless, interpolating regres-
sion where λ = 0. We will use this correspondence and our Having established that our theory accurately predicts the
kernel regression theory to explain neural network learning generalization error of kernel regression with NTK, we now
curves in the next section. For more information about NTK compare the generalization error of finite width neural net-
for fully connected architectures see SI Sections 10 and 11. works trained on a quadratic loss with the theoretical learn-
ing curves for NTK. For these experiments, we use the
To generate theoretical learning curves, we need the eigen-
Neural-Tangents Library (Novak et al., 2020) which sup-
spectrum of the kernels involved. For X = Sd−1 , it suffices
ports training and inference for both finite and infinite width
to calculate the projections of the kernel on the Gegenbauer
neural networks.
basis hKNTK (x), Qk (x)ix , which we evaluate numerically
with Gauss-Gegenbauer quadrature (SI Section 8). Further First, we use “pure mode” teacher functions, meaning the
details on NTK spectrum is presented in SI Section 11. teacher is composed only of spherical harmonics of the same
Learning Curves in Kernel Regression and Wide Neural Networks

100 100
10 1
1
10 1
2 10 1 1
2

E1(p)/E1(0)

E1(p)/E1(0)
k=1 10
Ek(p)/Ek(0)

2
10 3 k=2 10 2
k=3 10 3 =0
k=4 d=15 10 3 =1
10 5 k=5 10 4 d=30 =3
k=6 d=45 10 4 =5
k=7 10 5
d=75 = 10
100 101 102 103 104 100 101 102 103 104 100 101 102 103 104
p p p
(a) 3-layer NTK d = 15 λ = 0 (b) 3-layer NTK k = 1 λ = 1 (c) 10-layer NTK d = 15 k = 1

Figure 2. Learning curves for kernel regression with NTK averaged over 50 trials compared to theory. Error bars are standard deviation.
Solid lines are theoretical curves calculated using eq. (21). Dashed vertical lines indicate the degeneracy N (d, k). (a) Normalized learning
curves for different spectral modes. Sequential fitting of mode errors is visible. (b) Normalized learning curves for varying data dimension,
d. (c) Learning curves for varying regularization parameter, λ.

101 10 1
theory
101 expt test error

100 100
Ek(p)/Ek(0)

Ek(p)/Ek(0)

10 1
10 2
10 1

Eg
10 2 NN k = 1 NN k = 1
NN k = 2 NN k = 2
10 3 NN k = 4
Kernel k=1
10 2 NN k = 4
Kernel k=1
Kernel k=2 Kernel k=2
10 4 Kernel k=4 Kernel k=4 10 3
10 3
101 102 103 101 102 103 101 102 103
p p p
(a) 2-layer NN N = 10000 (b) 4-layer NN N = 500 (c) 2-Layer NN Student-Teacher; N = 8000

Figure 3. (a) and (b) Learning curves for neural networks (NNs) on “pure modes” as defined in eq. (35). (c) Learning curve for the student
teacher setup defined in (36). The theory curves shown as solid lines are again computed with eq. (21). The test error for the finite width
neural networks and NTK are shown with dots and triangles respectively. The generalization error was estimated by taking a random test
sample of 1000 data points. The average was taken over 25 trials and the standard deviations are shown with errorbars. The networks
were initialized with the default Gaussian NTK parameterization (Jacot et al., 2018) and trained with stochastic gradient descent (details
in SI Section 13).

degree. For “pure mode” k, the teacher is constructed with k = 4 mode is not learned at all in this range. Our theory
the following rule: again perfectly fits the experiments.
p
X
0 Lastly, we show that our theory also works for composite

f (x) = αi Qk (x> xi ), (35) functions that contain many different degree spherical har-
i=1 monics. In this setup, we randomly initialize a two layer
teacher neural network and train a student neural network
where again αi ∼ B(1/2) and xi ∼ p(x) are sampled
randomly. Figure 3(a) shows the learning curve for a fully f ∗ (x) = r> σ(Θx), f (x) = r> σ(Θx), (36)
connected 2-layer ReLU network with width N = 10000, M ×d
where Θ, Θ ∈ R are the feedforward weights for the
input dimension d = 30 and p0 = 10000. As before, we
student and teacher respectively, σ is an activation function
see that the lower k pure modes require less data to be fit.
and r, r ∈ RM are the student and teacher readout weights.
Experimental test errors for kernel regression with NTK on
Chosen in this way with ReLU activations, the teacher is
the same synthetic datasets are plotted as triangles. Our
composed of spherical harmonics of many different degrees
theory perfectly fits the experiments.
(Section 13 in SI). The total generalization error for this
Results from a 4-layer NN simulation are provided in Figure teacher student setup as well as the theoretical prediction of
3(b). Each hidden layer had N = 500 hidden units. We our theory is provided in Figure 3(c) for d = 25, N = 8000.
again see that the k = 2 mode is only learned for p > 200. They agree excellently. Results from additional neural net-
Learning Curves in Kernel Regression and Wide Neural Networks

100 100 100


MNIST Theory k=1-100
10 2 Kernel k=101-500
k=501-1000
NN Expt k=1001-5000
10 4 10 1

10 6

Eg
Eg

Eg
10 8 10 2
=0
10 10 = 10 6
10 1
= 10 4
= 10 2
10 12 10 3
101 102 101 102 103 101 102 103
p p p
(a) Gaussian kernel and measure in d = 20 (b) 3-Layer NN on MNIST, N = 800 (c) NTK regression on MNIST, λ = 0.

Figure 4. (a) Learning curves for Gaussian kernel in d = 20 dimensions with varying λ. For small λ, learning stages are visible at
p = N (d, k) for k = 1, 2 (p = 20, 210, vertical dashed lines) but the stages are obscured for non-negligible λ. (b) Learning curve for
3-layer NTK regression and a neural network (NN) on a subset of 8000 randomly sampled images of handwritten digits from MNIST. (c)
Aggregated NTK regression mode errors for the setup in (b). Eigenmodes of MNIST with larger eigenvalues are learned more rapidly
with increasing p.

work simulations are provided in Section 13 of the SI. Once the eigenvalues Λ and eigenvectors Φ> have been
identified, we compute the target function coefficients by
4.3. Gaussian Kernel Regression and Interpolation projecting the target data yc onto these principal compo-
nents wc = Λ−1/2 Φyc for each target c = 1, . . . , C. Once
We next test our theory on another widely-used kernel. The all of these ingredients are obtained, theoretical learning
setting where the probability measure and kernel are Gaus- curves can be computed using Algorithm 1 and multiple
1 0 2
sian, K(x, x0 ) = e− 2`2 ||x−x || , allows analytical compu- class formalism described in Section 2.5, providing esti-
tation of the eigenspectrum, {λk } (Rasmussen & Williams, mates of the error on the entire dataset incurred when train-
2005). In d dimensions, the k-th distinct
 eigenvalue corre- ing with a subsample of p < p̃ data points. An example
sponds to a set of N (d, k) = d+k−1 k ∼ dk
/k! degenerate where the discrete measure is taken as p̃ = 8000 images
eigenmodes. The spectrum itself decays exponentially. of handwritten digits from MNIST (Lecun et al., 1998)
In Figure 4(a), experimental learning curves for d = 20 di- and the kernel is NTK with 3 layers is provided in Figures
mensional standard normal random vector data and a Gaus- 4(c)and 4(b). For total generalization error, we find perfect
sian kernel with ` = 50 are compared to our theoretical agreement between kernel regression and neural network
predictions for varying ridge parameters λ. The target func- experiments, and our theory.
tion f ∗ (x) is constructed with the same rule we used for the
NTK experiments, shown in eq. 34. When λ is small, sharp 5. Conclusion
drops in the generalization error occur when p ≈ N (d, k)
for k = 1, 2. These drops are suppressed by the explicit In this paper, we presented an approximate theory of the
regularization λ. average generalization performance for kernel regression.
We studied our theory in the ridgeless limit to explain the
4.4. MNIST: Discrete Data Measure and Kernel PCA behavior of trained neural networks in the infinite width
limit (Jacot et al., 2018; Arora et al., 2019; Lee et al., 2019).
We can also test our theory for finite datasets by defining a We demonstrated how the RKHS eigenspectrum of NTK
probability measure with equal point mass on each of the encodes a preferential bias to learn high spectral modes
data points {xi }p̃i=1 in the dataset (including training and only after the sample size p is sufficiently large. Our theory
test sets): fits kernel regression experiments remarkably well. We
p̃ further experimentally verified that the theoretical learning
1X curves obtained in the infinite width limit provide a good
p(x) = δ(x − xi ). (37)
p̃ i=1 approximation of the learning curves for wide but finite-
width neural networks. Our MNIST result suggests that our
With this measure, the eigenvalue problem (6) becomes a theory can be applied to datasets with practical value.
p̃ × p̃ kernel PCA problem (see SI 14)

KΦ> = p̃Φ> Λ. (38)


Learning Curves in Kernel Regression and Wide Neural Networks

Acknowledgements Dai, F. and Xu, Y. Approximation Theory and Harmonic


Analysis on Spheres and Balls. Springer New York, 2013.
We thank Matthieu Wyart and Stefano Spigler for comments
and pointing to a recent version of their paper (Spigler et al., Dietrich, R., Opper, M., and Sompolinsky, H. Statistical
2019) with an independent derivation of the generalization mechanics of support vector networks. Physical Review
error scaling for power law kernel and target spectra (see Letters, 82(14):2975–2978, 1999.
(SI.43)). C. Pehlevan thanks the Harvard Data Science Ini-
tiative, Google and Intel for support. Efthimiou, C. and Frye, C. Spherical Harmonics In P
Dimensions. World Scientific Publishing Company, 2014.
References Evgeniou, T., Pontil, M., and Poggio, T. Regularization
Abramowitz, M. and Stegun, I. Handbook of Mathematical networks and support vector machines. Advances in Com-
Functions: With Formulas, Graphs, and Mathematical putational Mathematics, 13, 1999.
Tables. U.S. Department of Commerce, National Bureau
of Standards, 1972. Gyorfi, L., Kohler, M., Krzyzak, A., and Walk, H. A
distribution-free theory of nonparametric regression.
Arfken, G. Mathematical Methods for Physicists. Academic Journal of the American Statistical Association, 98(464):
Press, Inc., San Diego, third edition, 1985. 1084–1084, 2003.
Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R. R., Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H.,
and Wang, R. On exact computation with an infinitely Kianinejad, H., Patwary, M. M. A., Yang, Y., and Zhou, Y.
wide neural net. In Advances in Neural Information Pro- Deep learning scaling is predictable, empirically, 2017.
cessing Systems, pp. 8139–8148, 2019.
Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel:
Belkin, M., Hsu, D. J., and Mitra, P. Overfitting or per-
Convergence and generalization in neural networks. In
fect fitting? risk bounds for classification and regression
Advances in neural information processing systems, pp.
rules that interpolate. In Advances in neural information
8571–8580, 2018.
processing systems, pp. 2300–2311, 2018a.
Belkin, M., Ma, S., and Mandal, S. To understand deep Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-
learning we need to understand kernel learning. In Pro- based learning applied to document recognition. Proceed-
ceedings of the 35th International Conference on Ma- ings of the IEEE, 86(11):2278–2324, 1998.
chine Learning, pp. 541–549, 2018b.
LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. Na-
Belkin, M., Hsu, D., Ma, S., and Mandal, S. Reconciling ture, 521(7553):436–444, 2015.
modern machine-learning practice and the classical bias–
variance trade-off. Proceedings of the National Academy Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl-
of Sciences, 116(32):15849–15854, 2019a. Dickstein, J., and Pennington, J. Wide neural networks of
any depth evolve as linear models under gradient descent.
Belkin, M., Rakhlin, A., and Tsybakov, A. B. Does data In Advances in neural information processing systems,
interpolation contradict statistical optimality? In Interna- pp. 8570–8581, 2019.
tional Conference on Artificial Intelligence and Statistics,
pp. 1611–1619, 2019b. Liang, T. and Rakhlin, A. Just interpolate: Kernel “ridge-
less” regression can generalize, 2018.
Bietti, A. and Mairal, J. On the inductive bias of neural
tangent kernels. In Advances in Neural Information Pro- Luo, T., Ma, Z., Xu, Z.-Q. J., and Zhang, Y. Theory of
cessing Systems, pp. 12873–12884, 2019. the frequency principle for general deep neural networks,
2019.
Cao, Y., Fang, Z., Wu, Y., Zhou, D.-X., and Gu, Q. Towards
understanding the spectral bias of deep learning, 2019. Mercer, J. Functions of positive and negative type, and their
Cohen, O., Malka, O., and Ringel, Z. Learning curves for connection with the theory of integral equations. Philo-
deep neural networks: A gaussian field theory perspective, sophical Transactions of the Royal Society of London.
2019. Series A, 209:415–446, 1909.

Cucker, F. and Smale, S. Best choices for regularization pa- Mézard, M., Parisi, G., and Virasoro, M. Spin glass theory
rameters in learning theory: On the bias-variance problem. and beyond: An Introduction to the Replica Method and
Foundations of Computational Mathematics, 2:413–428, Its Applications, volume 9. World Scientific Publishing
2002. Company, 1987.
Learning Curves in Kernel Regression and Wide Neural Networks

Novak, R., Xiao, L., Hron, J., Lee, J., Alemi, A. A., Sohl- Vapnik, V. N. An overview of statistical learning theory.
Dickstein, J., and Schoenholz, S. S. Neural tangents: IEEE transactions on neural networks, 10(5):988–999,
Fast and easy infinite neural networks in python. In 1999.
International Conference on Learning Representations,
2020. Wahba, G. Spline Models for Observational Data. Society
for Industrial and Applied Mathematics, Philadelphia,
Opper, M. and Vivarelli, F. General bounds on bayes errors 1990.
for regression with gaussian processes. In Advances in
Xu, Z.-Q. J., Zhang, Y., Luo, T., Xiao, Y., and Ma, Z. Fre-
Neural Information Processing Systems 11, pp. 302–308,
quency principle: Fourier analysis sheds light on deep
1998.
neural networks, 2019a.
Pehlevan, C., Ali, F., and Ölveczky, B. P. Flexibility in
Xu, Z.-Q. J., Zhang, Y., and Xiao, Y. Training behavior of
motor timing constrains the topology and dynamics of
deep neural network in frequency domain. In Gedeon,
pattern generator circuits. Nature communications, 9(1):
T., Wong, K. W., and Lee, M. (eds.), Neural Information
1–15, 2018.
Processing, pp. 264–274, 2019b.
Rahaman, N., Baratin, A., Arpit, D., Draxler, F., Lin, M., Yang, G. and Salman, H. A fine-grained spectral perspective
Hamprecht, F., Bengio, Y., and Courville, A. On the on neural networks, 2019.
spectral bias of neural networks. In Proceedings of the
36th International Conference on Machine Learning, pp. Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals,
5301–5310, 2019. O. Understanding deep learning requires rethinking gen-
eralization. In International Conference on Learning
Rasmussen, C. E. and Williams, C. K. I. Gaussian Pro- Representations, 2017.
cesses for Machine Learning (Adaptive Computation and
Machine Learning). The MIT Press, 2005. Zhang, Y., Xu, Z.-Q. J., Luo, T., and Ma, Z. Explicitizing
an implicit bias of the frequency principle in two-layer
Schölkopf, B. and Smola, A. J. Learning with Kernels: neural networks, 2019.
Support Vector Machines, Regularization, Optimization,
and Beyond. MIT Press, 2001.

Sherrington, D. and Kirkpatrick, S. Solvable model of a


spin-glass. Phys. Rev. Lett., 35:1792–1796, 1975.

Smola, A. J., Ovari, Z. L., and Williamson, R. C. Regular-


ization with dot-product kernels. In Advances in neural
information processing systems, pp. 308–314, 2001.

Sollich, P. Learning curves for gaussian processes. In


Advances in neural information processing systems, pp.
344–350, 1999.

Sollich, P. Gaussian process regression with mismatched


models. In Advances in Neural Information Processing
Systems, pp. 519–526, 2002.

Sollich, P. and Halees, A. Learning curves for gaussian


process regression: Approximations and bounds. Neural
Computation, 14(6):1393–1428, 2002.

Spigler, S., Geiger, M., and Wyart, M. Asymptotic learning


curves of kernel methods: empirical data v.s. Teacher-
Student paradigm. arXiv e-prints, art. arXiv:1905.10843,
2019.

Spigler, S., Geiger, M., and Wyart, M. Asymptotic learning


curves of kernel methods: empirical data v.s. teacher-
student paradigm, 2019.
Learning Curves in Kernel Regression and Wide Neural Networks

1. Background on Kernel Machines Using the representer theorem, we may reformulate the
entire objective in terms of the p coefficients αi
1.1. Reproducing Kernel Hilbert Space
p
Let X ⊆ Rd and p(x) be a probability distribution over
X
L[f ] = (f (xi ) − yi )2 + λ||f ||2H
X . Let H be a Hilbert space with inner product h·, ·iH . A i=1
kernel K(·, ·) is said to be reproducing for H if function p X
X p
evaluation at any x ∈ X is the equivalent to the Hilbert = ( αj K(xi , xj ) − yi )2
inner product with K(·, x): K is reproducing for H if for i=1 j=1
all g ∈ H and all x ∈ X X D E
+λ αi αj K(xi , ·), K(xj , ·)
H
ij
hK(·, x), giH = g(x). (SI.1)
= α> K2 α − 2y> Kα + y> y + λα> Kα. (SI.7)

If such a kernel exists for a Hilbert space, then it is unique Optimizing this loss with respect to α gives
and defined as the reproducing kernel for the RKHS (Evge-
α = (K + λI)−1 y. (SI.8)
niou et al., 1999; Schölkopf & Smola, 2001).
Therefore the optimal function evaluated at a test point is
1.2. Mercer’s Theorem
f (x) = α> k(x) = y> (K + λI)−1 k(x). (SI.9)
Let H be a RKHS with kernel K. Mercer’s theorem (Mercer,
1909; Rasmussen & Williams, 2005) allows the eigendecom- 2. Derivation of the Generalization Error
position of K
+
p H have eigenvalues λρ for ρ ∈ Z . Define
Let the RKHS
K(x, x0 ) =
X
λρ φρ (x)φρ (x), (SI.2) ψρ (x) = λρ φρ (x), where φρ are the eigenfunctions of
ρ
the reproducing kernel for H. Let the target function have
the following expansion in terms of the kernel eigenfunc-
tions f ∗ (x) = ρ wρ ψρ (x). Define the design matrices
P
where the eigenvalue statement is
Φρ,i = φρ (xi ) and Λργ = λρ δργ . Then the average gener-
Z alization error for kernel regression is
dx0 p(x0 )K(x, x0 )φρ (x0 ) = λρ φρ (x). (SI.3)  
Eg = Tr D G2 {xi } (SI.10)

where
1.3. Representer Theorem  −1
1
Let H be a RKHS with inner product h., .iH . Consider the G= ΦΦ> + Λ−1 , Φ = Λ−1/2 Ψ. (SI.11)
λ
regularized learning problem
and
minf ∈H L̂[f ] + λ||f ||2H , (SI.4) D = Λ−1/2 ww> w
Λ−1/2 . (SI.12)

Proof. Define the student’s eigenfunction expansion


where L̂[f ] is an empirical cost defined on the discrete sup- P
f (x) = ρ ρ ψρ (x) and decompose the risk in the ba-
w
port of the dataset and λ > 0. The optimal solution to
sis of eigenfunctions:
the optimization problem above can always be written as
(Schölkopf & Smola, 2001) Eg ({xi }, f ∗ ) = (f (x) − y(x))2 x
X
p = (wρ − wρ )(wγ − wγ ) hψρ (x)ψγ (x)ix
X
f (x) = αi K(xi , x). (SI.5) ρ,γ
X
i=1 = λρ (wρ − wρ )2
ρ

1.4. Solution to Least Squares = (w − w)> Λ(w − w). (SI.13)

Specializing to the case of least squares regression, let Next, it suffices to calculate the weights w learned through
kernel regression. Define a matrix with elements Ψρ,i =
p
X ψρ (xi ). The training error for kernel regression is
L̂[f ] = (f (xi ) − yi )2 . (SI.6)
i=1
Etr = ||Ψ> w − y||2 + λ||w||22 (SI.14)
Learning Curves in Kernel Regression and Wide Neural Networks

The `2 norm on w is equivalent


P to the Hilbert norm on the 3. Solution of the PDE Using Method of
student function. If f (x) = ρ wρ ψρ (x) then Characteristics
||f ||2H = hf, f iH Here we derive the solution to the PDE in equation 17 of the
X X main text by adapting the method used by (Sollich, 1999).
= wρ wγ hψρ (·), ψγ (·)iH = wρ2 , (SI.15) We will prove both Propositions 2 and 3.
ργ ρ
Let
since hψρ (·), ψγ (·)iH = δρ,γ (Bietti & Mairal, 2019). This D E
fact can be verified by invoking the reproducing property gρ (p, v) ≡ G̃(p, v)ρρ , (SI.23)
of the kernel and it’s Mercer decomposition. Let g(·) =
P
ρ aρ ψρ (·). By the reproducing property and
X X
hK(·, x), g(·)iH = aγ ψρ (x) hψρ (·), ψγ (·)iH t(p, v) ≡ Tr hG(p, v)i = gρ (p, v). (SI.24)
ρ,γ ρ
X
= g(x) = aρ ψρ (x) (SI.16) It follows from equation 17 that t obeys the PDE
ρ
∂t(p, v) 1 ∂t(p, v)
Demanding equality of each term, we find = , (SI.25)
∂p λ + t ∂v
X
aγ hψρ (·), ψγ (·)iH = aρ (SI.17) with an initial condition t(0, v) = Tr(Λ−1 + vI)−1 . The
γ
solution to first order PDEs of the form is given by the
Due to the arbitrariness of aρ , we must have method of characteristics (Arfken, 1985), which we describe
hψρ (·), ψγ (·)iH = δρ,γ . We stress the difference be- below, and prove Proposition 2.
tween the action of the Hilbert inner product and averaging
feature functions over a dataset hψρ (x)ψγ (x)ix = λρ δρ,γ Proof of Proposition 2. The solution to (SI.25) is a surface
which produce different results. We will always decorate (t, p, v) ⊂ R3 that passes through the line (Tr(Λ−1 +
angular brackets with H to denote Hilbert inner product. vI)−1 , 0, v) and satisfies the PDE at all points. The
tangent plane to the solution surface at a point (t, p, v)
The training error has a unique minimum is span{( ∂p ∂t ∂t
, 1, 0), ( ∂v , 0, 1)}. Therefore a vector a =
3
w = (ΨΨ> + λI)−1 Ψy = (ΨΨ> + λI)−1 ΨΨ> w (at , ap , av ) ∈ R normal to the solution surface must satisfy
= w − λ(ΨΨ> + λI)−1 w, (SI.18) ∂t ∂t
at + ap = 0, at + av = 0.
where the target function is produced according to y = ∂p ∂v
Ψ> w. ∂t ∂t
One such normal vector is (−1, ∂p , ∂v ).
Plugging in the w that minimizes the training error into the
The PDE can be written as a dot product involving this
formula for the generalization error, we find normal vector,
> −1 > −1
Eg ({xi }, w) = λ2 w(ΨΨ + λI) Λ(ΨΨ + λI) w . 
∂t ∂t
 
1

(SI.19) −1, , · 0, 1, − = 0, (SI.26)
∂p ∂v λ+t
Defining
1
 −1 demonstrating that (0, 1, − λ+t ) is tangent to the solution
1
G = λΛ1/2 (ΨΨ> +λI)−1 Λ1/2 = >
ΦΦ + Λ−1
, surface. This allows us to parameterize one dimensional
λ curves along the solution in these tangent directions. Such
(SI.20)
curves are known as characteristics. Introducing a parameter
and
s ∈ R that varies along the one dimensional characteristic
D = Λ−1/2 ww> Λ−1/2 , (SI.21)
curves, we get
and identifying the terms in (SI.19) with these definitions,
we obtain the desired result. Then each component of the dt dp dv 1
= 0, = 1, =− . (SI.27)
mode error is given by: ds ds ds λ+t
X The first of these equations indicate that t is constant along
Eρ = Dρ,γ hG2γ,ρ i (SI.22)
each characteristic curve. Integrating along the parameter,
γ s
p = s + p0 and v = − λ+t + v0 where p0 is the value of p
when s = 0 and v0 is the value of v at s = 0. Without loss
Learning Curves in Kernel Regression and Wide Neural Networks

of generality, take p0 = 0 so that s = p. At s = 0, we have The error in mode ρ is therefore


our initial condition −2  −1
hw2ρ i

−1 1 p pγ(p)
t(0, v) = Tr Λ−1 + v0 I . (SI.28) Eρ = + 1− ,
λρ λρ λ + t(p) (λ + t(p))2
Since t takes on the same value for each characteristic (SI.37)
   −1
−1 p so it suffices to numerically solve for t(p, 0) to recover
t(p, v) = Tr Λ + v + I , (SI.29)
λ + t(p, v) predictions of the mode errors. Equations (SI.29) (evaluated
at v = 0), (SI.34) and (SI.37) collectively prove Proposition
which gives an implicit solution for t(p, v). Now that we
3.
have solved for t(p, v), remembering (SI.24), we may write
 −1
1 p 4. Learning Curve for Power Law Spectra
gρ (p, v) = +v+ . (SI.30)
λρ λ + t(p, v)
For λ > 0, the mode errors asymptotically satisfy Eρ ∼
This equation proves Proposition 2 of the main text. O(p−2 ) since λ+t p (λ+t)2
∼ λp and (λ+t) 2 −γp ∼ Op (1) (see be-

low). Although each mode error decays asymptotically like


Next, we compute the modal generalization errors Eρ and
p−2 , the total generalization error can have nontrivial scal-
prove Proposition 3.
ing with p that depends on both the kernel and the target
function.
Proof of Proposition 3. Computing generalization error of
kernel regression requires the differentiation with respect to To illustrate the dependence of the learning curves on the
v at v = 0 (eq.s (11) and (16) of main text). Since G2 is choice of kernel and target function, we consider a case
diagonal, the mode errors only depend on the diagonals of where both have power law spectra. Specifically, we assume
∂g
D and on G2ρ,ρ = − ∂vρ |v=0 : that λρ = ρ−b and a2ρ ≡ w2ρ λρ = ρ−a for ρ = 1, 2, .... We
introduce the variable z = t+λ to simplify the computations
X hw2ρ i ∂gρ below. We further approximate the sums over modes with
Eρ = Dρ,γ G2γ,ρ = − . (SI.31) integrals
γ
λρ ∂v v=0

z2 dρ ρ−a
Z
We proceed with calculating the derivative in the above Eg ≈ 2 2 . (SI.38)
z − pγ p −b
zρ +1
1
equation.
−2
We use the same approximation technique to study the be-

∂gρ (p, 0) 1 p
=− + havior of z(p)
∂v λρ λ + t(p, 0)
 
p ∂t(p, 0)  1− 1b Z ∞
× 1− . (SI.32) z ∞ dρ
Z
z du
(λ + t)2 ∂v z =λ+ z b =λ+
p 1 1 + pρ p b
(z/p)1/b 1 + u
We need to calculate ∂t(p,v)
∂v |v=0
 1− 1b
z
  =λ+ F (b, p, z), (SI.39)
∂t(p, 0) p ∂t(p, 0) p
= −γ 1 − , (SI.33)
∂v (λ + t)2 ∂v R∞ du −1/(b−1)
where F (b, p, z) = (z/p)1/b 1+u b . If p  λ then
where 1−b b
−2 z ≈ λ, otherwise z ≈ p F (b, p, z) . Further, the scaling
X 1 p z ∼ O(p1−b ) is self-consistent since the lower endpoint of
γ≡ + . (SI.34)
λρ λ + t(p, 0) integration (z/p)1/b ∼ p−1 → 0 so F (b, z, p) approaches
ρ
a constant F (b) for p → ∞
Solving for the derivative, we get Z ∞
1−b b du
∂t(p, 0) γ z∼p F (b) , F (b, z, p) ∼ F (b) = .
=− p , (SI.35) 0 1 + ub
∂v 1 − γ (λ+t) 2 (SI.40)
and We similarly find that pγ(p) ∼ O(p2−2b ) if p  λ−1/(b−1) .
 −2  −1 The mode-independent prefactor is approximately constant
∂gρ (p, 0) 1 p γp z2
=− + 1− . z 2 −γp ∼ Op (1).
∂v λρ λ+t (λ + t)2
(SI.36) We can use all of these facts to identify scalings of Eg . We
Learning Curves in Kernel Regression and Wide Neural Networks

will first consider the case where p  λ−1/(b−1) : 5. Replica Calculation


Z ∞
dρρ−a In this section, we present the replica trick and the saddle-
Eg ∼ point approximation summarized in main text Section 2.3.
1 (pb ρ−b + 1)2
Z p Z ∞ Our goal is to show that the continuous approximation of
−2b −a+2b
≈p dρρ + dρρ−a the main paper and previous section can be interpreted as
1 p a finite size saddle-point approximation to the replicated
1 2b system under a replica symmetry ansatz. We will present
= p−2b + p−(a−1) .
a − 1 − 2b (a − 1)(2b + 1 − a) a detailed treatment of the thermodynamic limit and the
(SI.41) replica symmetric ansatz in a different paper.
1 > −1
−1
If 2b > a − 1 then the second term dominates, indicating Let G̃(p, v) = D λ ΦΦ + ΛE + vI . To obtain the
that higher frequency modes k > p provide a greater contri- average elements G̃(p, v)ρ,γ we will use a Gaussian in-
bution to the error due to the slow decay in the target power.
tegral representation of the matrix inverse
In this case Eg ∼ p−(a−1) . If, on the other hand, 2b < a − 1
then lower frequency modes k < p dominate the error and
D E
G̃(p, v)ρ,γ
Eg ∼ p−2b .
∂2
 Z 
1 1 1 > −1
Now, suppose that p > λ−1/(b−1) . In this regime = due− 2 u( λ ΦΦ +Λ +vI)u+h·u ,
∂hρ ∂hγ Z Φ
Z ∞
dρρ−a (SI.44)
Eg ∼ p −b
1 ( λ ρ + 1)2
where
1/b Z ∞
λ2 (p/λ)
Z Z
≈ 2 dρ ρ2b−a + dρρ−a Z=
1 1 >
du e− 2 u( λ ΦΦ +Λ−1 +vI)u
, (SI.45)
p 1 (p/λ)1/b
λ2
  
1 p (2b−a+1)/b
= 2 −1 and make use of the identity Z −1 = limn→0 Z n−1 to
p 2b − a + 1 λ
rewrite the entire average in the form
1  p (1−a)/b
+ . (SI.42) n
a−1 λ
Z Y D 1 P a 1 > −1 E
a (1)
R(h) = dua e− 2 a u ( λ ΦΦ +Λ +vI)u +h·u
Here there are two possible scalings. If 2b > a − 1 then a=1
Eg ∼ p−(a−1)/b while 2b < a − 1 implies Eg ∼ p−2 . (SI.46)
So the total error scales like with the identification that

Eg ∼ p− min{a−1,2b} , p < λ−1/(b−1)


D E ∂2
G̃(p, v)ρ,γ = lim R(h)|h=0 . (SI.47)
∂hρ ∂hγ n→0
Eg ∼ p− min{a−1,2b}/b , p > λ−1/(b−1) . (SI.43)
Following the replica method from the physics of disordered
A verification of this scaling is provided in Figure SI.1, systems, we will first restrict ourselves to integer n and then
which shows the behavior of z and Eg in these two regimes. analytically continue the resulting expressions to take the
When the explicit regularization is low (or zero) (p < limit of n → 0.
λ−1/(b−1) ), our equations reproduce the power law scal- Averaging over the quenched disorder (dataset) with the
ings derived with Fourier analysis in (Spigler et al., 2019)2 . assumption that the residual error (w − w) · Ψ(xi ) is a
The slower asymptotic decays in generalization error when Gaussian process, we find
explicit regularization λ is large relative to the sample size D 1 P a > aE p 1
indicates that explicit regularization hurts performance. The e− 2λ a u ΦΦ u = e− 2 log det(I+ λ Q) , (SI.48)
decay exponents also indicate that the RKHS eigenspectrum
should decay with exponent at least as large as b∗ > a−1 2
where order parameters Qab = ua ·ub have been introduced.
for optimal asymptotics. Kernels with slow decays in their
To enforce the definition of these order parameters, Dirac
RKHS spectra induce larger errors.
delta functions are inserted into the expression for R. We
2
We note that in a recent version of their paper, Spigler et al. then represent each delta function as a Fourier integral so
(2019) used our formalism to independently derive the scalings in that integrals over ua can be computed
(SI.43) for the ridgeless (λ = 0) case. Our calculation in an earlier
preprint had missed the possible ∼ p−2b and ∼ p−2 scalings,
Z
a b

which we corrected after their paper. δ(Qab − ua · ub ) = dQ̂ab eiQab Q̂ab −iQ̂ab u ·u . (SI.49)
Learning Curves in Kernel Regression and Wide Neural Networks

101
numerical solution numerical
10 1 p1 b 10 3 p1 a
p(1 a)/b
10 3 10 7

10 11
10 5

Eg
z

10 15

10 7
10 19

10 9
10 23

100 101 102 103 104 105 100 101 102 103 104 105
p p
(a) z = t + λ (b) Eg (p)

Figure SI.1. Approximate scaling of learning curve for spectra that decay as power laws λk ∼ k−b and a2k ≡ w2k λk = k−a . Figure (a)
shows a comparison of the numerical solution to the implicit equation for t + λ as a function of p and its comparison to approximate
scalings. There are two regimes which are separated by p ≈ λ−1/(b−1) . For small p, z ∼ p1−b but for large p, z ∼ λ. The total
generalization error is shown in (b) which scales like p1−a for small p and p(1−a)/b for large p.

After inserting delta functions to enforce order parameter point equations are
definitions, we are left with integrals over the thermal de-
grees of freedom p
q̂ ∗ = + v,
q∗

n
Z Y
1
P
ua Λ−1 ua −i
P
Q̂ab ua ub +u(1) h 1 1
dua e− 2
X X
a ab
q∗ = 1 = 1 p ,
a=1 ρ λρ + q̂ ∗ ρ λρ +v+ q ∗ +λ
− 21 log det( λ1ρ I+2iQ̂)+ 12 h2ρ ( λ1ρ I+2iQ̂)−1
P P
=e ρ ρ 11
. (SI.50) q0∗ = q̂0∗ = 0. (SI.53)

We now make a replica symmetric ansatz Qab = qδab + q0 We see that q ∗ is exactly equivalent to t(p, v) defined in
and 2iQ̂ab = q̂δab + q̂0 . Under this ansatz R(h) can be SI.29 for the continuous approximation. Under the saddle
rewritten as point approximation we find

R(h) = 1
P
h2ρ 1
1 +q̂ ∗
−npF (q ∗ ,q0∗ ,q̂ ∗ ,q̂0∗ ) 2 ρ
Z R(h) ≈ e e λρ
. (SI.54)
I+2iQ̂)−1
1
P 2 1
h (
dqdq̂ddq̂dq̂0 e−pnF (q,q0 ,q̂,q̂0 ) e 2 ρ ρ λρ 11
,
(SI.51) Taking the n → 0 limit as promised, we obtain the normal-
ized average
where the free energy is
 q q0 1
P
h2ρ 1
1 +q̂ ∗
2pF(q, q0 , q̂, q̂0 ) =p log 1 + +p + v(q + q0 ) R̃(h) ≡ lim R(h) = e
2 ρ
λρ
, (SI.55)
λ λ+q n→0
− (q + q0 )(q̂ + q̂0 ) + q0 q̂0
so that the matrix elements are
"   #
X 1 q̂0
+ log + q̂ + 1 .
λρ λρ + q̂
ρ D E ∂2 δρ,γ
(SI.52) G̃(p, v)ρ,γ = R̃(h)|h=0 = 1 p ,
∂hρ ∂hγ λρ +v+ λ+q ∗
X 1
In the limit p → ∞, R(h) is dominated by the saddle point q∗ = 1 p . (SI.56)
ρ λρ
+ v + λ+q ∗
of the free energy where ∇F(q, q̂, q0 , q̂0 ) = 0. The saddle
Learning Curves in Kernel Regression and Wide Neural Networks

Using our formula for the mode errors, we find 7. Spherical Harmonics
X D E Let −∆ represent the Laplace-Beltrami operator in Rd .
Eρ = Dρ,γ G̃(p, v)2γ,ρ
Spherical harmonics {Ykm } in dimension d are harmonic
γ
(−∆Ykm (x) = 0), homogeneous (Ykm (tx) = tk Ykm (x))
∂ D E
polynomials that are orthonormal with respect to the uni-
= −Dρ,ρ G̃(p, v)ρ,ρ |v=0
∂v form measure on Sd−1 (Efthimiou & Frye, 2014; Dai & Xu,
−2
w2ρ (λ + q ∗ )2

1 p 2013). The number of spherical harmonics of degree k in
= + ,
λρ (λ + q ∗ )2 − γp λρ λ + q∗ dimension d denoted by N (d, k) is
(SI.57)  
2k + d − 2 k + d − 3
N (d, k) = . (SI.64)
consistent with our result from the continuous approxima- k k−1
tion.
The Laplace Beltrami Operator can be decomposed into the
radial and angular parts, allowing
6. Spectral Dependence of Learning Curves
− ∆ = −∆r − ∆Sd−1 (SI.65)
We want to calculate how different mode errors change as
we add one more sample. We study:
Using this decomposition, the spherical harmonics are eigen-
1 d Eρ functions of the surface Laplacian
log , (SI.58)
2 dp Eγ
− ∆Sd−1 Ykm (x) = k(k + d − 2)Ykm (x). (SI.66)
where Eρ is given by eq. (21). Evaluating the derivative,
we find: The spherical harmonics are related to the Gegenbauer poly-
  nomials {Qk }, which are orthogonal with respect to the mea-
1 d Eρ sure dτ (z) = (1−z 2 )(d−3)/2 dz of inner products z = x> x0
log
2 dp Eγ of uniformly sampled pairs on the sphere x, x0 ∼ Sd−1 . The
Gegenbauer polynomials can be constructed with the Gram-
!  
1 1 ∂ p
=− 1 p − 1 p . (SI.59) Schmidt procedure and have the following properties
λρ + λ+t λγ + λ+t
∂p λ + t
N (d,k)
> 0 1 X
Using eq. (22), Qk (x x ) = Ykm (x)Ykm (x0 ),
N (d, k) m=1
 X −2 Z 1
∂t ∂ p 1 p ωd−1 δk,`
=− + Qk (z)Q` (z)dτ (z) = , (SI.67)
∂p ∂p λ+t ρ
λγ λ+t −1 ωd−2 N (d, k)
 
∂ p π d/2
= −γ , (SI.60) where ωd−1 = Γ(d/2) is the surface area of Sd−1 .
∂p λ + t

where we identified the sum with γ. Inserting this, we 8. Decomposition of Dot Product Kernels on
obtain: Sd−1
  " #
1 d Eρ 1 1 1 ∂t For inputs sampled from the uniform measure on Sd−1 ,
log = 1 p − 1 p . dot product kernels can be decomposed into Gegenbauer
2 dp Eγ λρ + λ+t λγ + λ+t
γ ∂p
polynomials introduced in SI Section 7.
(SI.61)
Let K(x, x0 ) = κ(x> x0 ). The kernel’s orthogonal decom-
Finally, solving for ∂t/∂p from (SI.60), we get: position is

∂t 1 (λ + t)2 γ 1 X
Tr G2 , κ(z) = λk N (d, k)Qk (z),

=− =−
∂p λ + t (λ + t)2 − pγ λ+t k=0
(SI.62) Z 1
ωd−2
proving that ∂t/∂p < 0. Taking λγ > λρ without loss of λk = κ(z)Qk (z)dτ (z). (SI.68)
generality, it follows that ωd−1 −1

 
d Eρ d d To numerically calculate the kernel eigenvalues of κ, we
log >0 ⇒ log Eρ > log Eγ . (SI.63) use Gauss-Gegenbauer quadrature (Abramowitz & Stegun,
dp Eγ dp dp
Learning Curves in Kernel Regression and Wide Neural Networks

1972) for the measure dτ (z) so that for a quadrature scheme Let us consider an integer l such that the scaling P = αdl
of order r holds. This leads to three different asymptotic behavior of
Z 1 X r gk s:
κ(z)Qk (z)dτ (z) ≈ wi Qk (zi )κ(zi ), (SI.69)
−1 i=1
gk ∼ O(dl−k )  O(1), k<l
gk = α ∼ O(1), k=l
where zi are the r roots of Qr (z) and the weights wi are
l−k
chosen with gk ∼ O(d )  O(1), k>l (SI.76)
2 2r+2α+1
Γ(r + α + 1) 2 r!
wi = , (SI.70) If we assume t ∼ O(1), we get an asymptotically consistent
Γ(r + 2α + 1) Vr0 (zi )Vr+1 (zi )
set of equations:
where
Vr (z) = 2r r!(−1)r Qr (z)
X
(SI.71) t≈ λ̄m + a(α, t, λ, λ̄l ) ∼ O(1),
For our calculations we take r = 1000. m>l
γ̃ ≈ b(α, t, λ, λ̄l ) ∼ O(1), (SI.77)
9. Frequency Dependence of Learning Curves where a and b are the lth terms in the sums in t and γ̃,
in d → ∞ Limit respectively, and are given by:
Here, we consider an informative limit where the number of (t + λ)λ̄l
input data dimension, d, goes to infinity. a(α, t, λ, λ̄l ) = ,
t + λ + αλ̄l
Denoting the index ρ = (k, m), we can write mode error αλ̄2l
b(α, t, λ, λ̄l ) = (SI.78)
(SI.37), after some rearranging, as: t + λ + αλ̄l
2

(λ + t)2 λk hw2km i
Ekm = pγ , (SI.72) Then using (SI.75), (SI.76) and (SI.77), we find the errors
1 − (λ+t)2 (λ + t + pλk )2 associated to different modes as:
where t and γ, after performing the sum over degenerate Ekm (α)
indices, are: k < l, ∼ O(d2(k−l) ) ≈ 0,
Ekm (0)
X N (d, m)(λ + t)λm Ekm (α) 1
t= , k > l, ≈ ,
m
λ + t + pλm Ekm (0) 1 − γ̃(α)
X N (d, m)(λ + t)2 λ2 Ekm (α)
γ= m
. (SI.73) k = l, = s(α) ∼ O(1), (SI.79)
(λ + t + pλm )2 Ekm (0)
m

In the limit d → ∞, the degeneracy factor (SI.64) ap- where s(α) is given by:
proaches to N (d, k) ∼ O(dk ). We note that for dot-product 1 1
kernels λk scales with d as λk ∼ d−k (Smola et al., 2001) s(α) = 2 . (SI.80)
1 − γ̃(α)

λ̄l
(Figure 1), which leads us to define the O(1) parameter 1 + α t+λ
λ̄k = dk λk . Plugging these in, we get:
Note that limα→0 γ̃(α) = limα→∞ γ̃(α) = 0 and non-zero
d−k (t + λ)2 2
λ̄k hw̄km i in between. Then, for large α, in the limit we are considering
Ekm (gk ) = 2
1 − γ̃ t + λ + gk λ̄k Ekm (α)
X (t + λ)λ̄m k < l, ≈ 0,
t= , Ekm (0)
m
t + λ + gm λ̄m Ekm (α)
k > l, ≈ 1,
X gm λ̄2m Ekm (0)
γ̃ = 2 , (SI.74)
(λ + m>l λ̄m )2 1
P
m t + λ + gm λ̄m Ekm (α)
k = l, ≈ . (SI.81)
Ekm (0) λ̄2l α2
where gk = p/dk is the ratio of sample size to the de-
generacy. Furthermore, we want to calculate the ratio
Ekm (p)/Ekm (0) to probe how much the mode errors move 10. Neural Tangent Kernel
from their initial value: The neural tangent kernel is
Ekm (p) 1 1 X D ∂fθ (x) ∂fθ (x0 ) E
= (SI.75)
Ekm (0) 1 − γ̃

gk λ̄k
2 KNTK (x, x0 ) = . (SI.82)
1+ t+λ ∂θi ∂θi θ
i
Learning Curves in Kernel Regression and Wide Neural Networks

For a neural network, it is convenient to compute this re-


cursively in terms of the Neural Network Gaussian Process 10 1

(NNGP) kernel which corresponds to only training the read-


out weights from the final layer (Jacot et al., 2018; Arora
10 2

N(d, k)
et al., 2019). We will restrict our attention to networks with
zero bias and nonlinear activation function σ. Then

k
(1)
KN T K (x, x0 ) 10 3
=5 = 50
(1) = 10 = 100
= KN N GP (x, x0 ) = 25 = 500
(2)
KN T K (x, x0 ) 100 101 102
k
(2) (1)
= KN N GP (x, x0 ) + KN T K (x, x0 )K̇ (2) (x, x0 )
... Figure SI.2. Spectrum of fully connected ReLU NTK without bias
(L) for varying depth `. As the depth increases, the spectrum whitens,
KN T K (x, x0 ) causing derivatives of lower order to have infinite variance. As
(L) (L−1) ` → ∞, λk N (d, k) ∼ 1 implying that the kernel becomes non-
= KN N GP (x, x0 ) + KN T K (x, x0 )K̇ (L) (x, x0 ),
(SI.83) analytic at the origin.

where P
kernel’s trace hK(x, x)ix = k λk N (d, k) begins to di-
(L)
KN N GP (x, x0 ) = E(α,β)∼p(L−1) σ(α)σ(β), verge. Inference with such a kernel is equivalent to learning
x,x0
a function with infinite variance. Constraints on the vari-
K̇ (L) (x, x0 ) = E(α,β)∼p(L−1) σ̇(α)σ̇(β), ance of derivatives ||∇nSd−1 f (x)||2 correspond to more
x,x0
! restrictive constraints on the eigenspectrum of the RKHS.
(x, x) K (L−1) (x, x0 )
   (L−1) 
0 K −n−1/2
, Specifically, λk N (d, k) ∼ O(k ) implies that the
(L−1)
px,x0 = N ,
0 K (L−1) (x, x0 ) K (L−1) (x0 , x0 ) n-th gradient has finite variance ||∇nSd−1 f (x)||2 < ∞.
(1)
KN N GP (x, x0 ) = x> x0 . (SI.84)
Proof.
Pp By the representer theorem, let f (x) =
If σ is chosen to be the ReLU activation, then we can an- i=1 i K(x, xi ).
α By Green’s theorem, the variance
alytically simplify the expression. Defining the following of the n-th derivative can be rewritten as
function
 p    ||∇nSd−1 f (x)||2 = hf (x)(−∆Sd−1 )n f (x)i
1 2
1 X
f (z) = arccos 1 − z + 1 − arccos(z) z , = αi αj λk λk0 Ykm (xi )Yk0 m0 (xj )
π π
kk0 mm0 ij
(SI.85)
we obtain × hYkm (x)(−∆Sd−1 )n Yk0 m0 (x)i
X
= λ2k k n (k + d − 2)n N (d, k)αi αj Qk (x>
i xj )
 
(L)
KN N GP (x, x0 ) = cos f ◦(L−1) (x> x0 )
kij
 
1 ◦(L−2) > 0 X
0
K̇L (x, x ) = 1 − f (x x ) , (SI.86) ≤ Cp2 (α∗ )2 λ2k k n (k + d − 2)n N (d, k)2 ,
π k
(SI.87)
where f ◦(L−1) (z) is the function f composed into itself
L − 1 times. where α∗ = maxj |αj | and |Qk (z)| ≤ CN (d, k) for a uni-
This simplification gives an exact recursive formula to com- versal constant C. A sufficient condition for this sum to con-
pute the kernel as a function of z = x> x0 , which is what verge is that λ2k k n (k +d−2)n N (d, k)2 ∼ O(k −1 ) which is
we use to compute the eigenspectrum with the quadrature equivalent to demanding λk N (d, k) ∼ O(k −n−1/2 ) since
scheme described in the previous section. (k + d − 2)n ∼ k n as k → ∞.

11. Spectra of Fully Connected ReLU NTK 12. Decomposition of Risk for Numerical
Experiments
A plot of the RKHS spectra of fully connected ReLU NTK’s
of varying depth is shown in Figure SI.2. As the depth in- As we describe in Section 4.1 of the main text, the teacher
creases, the spectrum becomes more white, eventually, the functions for the kernel regression experiments are chosen
Learning Curves in Kernel Regression and Wide Neural Networks

as 12.2. Empirical Mode Errors


0
p
By the representer theorem, we may represent the student
X
f ∗ (x) = αi K(x, xi ), (SI.88) PP
i=1 function as f (x) = i=1 αi K(x, xi ). Then, the general-
ization error is given by
where the coefficients αi ∼ B(1/2) are randomly sampled
from a centered Bernoulli distribution on {±1} and the Eg = (f (x) − f ∗ (x))2
points xi ∼ p(x) are drawn from the same distribution as 
P P0

the training data. In general p0 is not the same as the number
X X X
= λρ λγ  αj φρ (xj ) − αi φρ (xi )
of samples p. Choosing a function of this form is very ργ j=1 i=1
convenient for producing theoretical predictions of mode  
P P0
errors as we discuss below. X X
 αj φγ (xj ) − αi φγ (xi ) hφρ (x)φγ (x)i
j=1 i=1
12.1. Theoretical Mode Errors 
Since the matrix elements G2ρρ are determined completely =
X
2
λρ
X
αj αj 0 φρ (xj )φρ (xj )
by the kernel eigenvalues {λρ }, it suffices to calculate the ρ j,j 0
diagonal elements of D to find the generalization error. For 
the teacher function sampled in the way described above, X X
there is a convenient expression for Dρρ . −2 αj αi φρ (xj )φρ (xi ) + αi αi0 φ(xi )φ(xi0 ) .
i,j i,i0
The teacher function admits an expansion in the basis of (SI.94)
kernel eigenfunctions
PN (d,k)
On the d-sphere, by defining Ek = Ekm we arrive
X
f ∗ (x) = wρ ψρ (x). (SI.89) m=1
ρ at the formula

Using the Mercer decomposition of the kernel we can iden- Ek = λ2k N (d, k) α> Qk (XT X)α − 2α> Qk (XT X)α
T

tify the coefficients +α> Qk (X X)α . (SI.95)
0
p

X XX  We randomly sample the α variables for the teacher and
f (x) = αi K(x, xi ) = αi ψρ (xi ) ψρ (x)
fit α = (K + λI)−1 y to the training data. Once these
i=1 ρ i
(SI.90) coefficients are known, we can obtain empirical mode errors.
Comparing each term in these two expressions, we identify
the coefficient of the ρ-th eigenfunction 13. Neural Network Experiments
X For the “pure mode” experiments with neural networks, the
wρ = αi ψρ (xi ). (SI.91)
i
target function was
0
P
X

We now need to compute the Dρρ , by averaging w2ρ over all f (x) = αi Qk (x> xi )
possible teachers i=1
 
N (d,k) 0
P
1 1 X
X X
Dρρ = w2ρ = hαi αj i hψρ (xi )ψρ (xj )i =  αi Ykm (xi ) Ykm (x), (SI.96)
λρ λρ ij m=1 i=1

1 X p0 λρ whereas, for the composite experiment, the target function


= hψρ (xi )ψρ (xi )i = = p0 , (SI.92)
λρ i λρ was a randomly sampled two layer neural network with
ReLU activations
since hψρ (x)ψρ (x)i = λρ . Thus it suffices to calculate
∂ f ∗ (x) = r> σ(Θx). (SI.97)
∂v gρ (p, v) for each mode and then compute mode errors
with
∂gρ (p, v) This target model is a special case of eq. (SI.90) so the
Eρ = −dρ |v=0 , (SI.93) same technology can be used to compute the theoretical
∂v
learning curves. We can use a similar trick as that shown
∂g
where ∂vρ |v=0 is evaluated in terms of the numerical solu- in equation (SI.92) to determine wρ for the NN teacher
tion for t(p, 0). experiment. Let the Gegenbauer polynomial expansion of
Learning Curves in Kernel Regression and Wide Neural Networks
P∞
σ(z) be σ(z) = k=0 ak N (d, k)Qk (z). Then the mode
a2k
error for mode k is Ek = λ2k
gk2 where gk2 is computed
with equation (SI.37).
A sample of some training error and generalization errors
from pure mode experiments are provided below in Figures
SI.3 and SI.4.

13.1. Hyperparameters
The choice of the number of hidden units N was based
primarily on computational considerations. For two layer
neural networks, the total number of parameters scales lin-
early with N , so to approach the overparameterized regime,
we aimed to have N ≈ 10pmax where pmax is the largest
sample size used in our experiment. For pmax = 500, we
chose N = 4000, 10000.
For the three and four layer networks, the number of pa-
rameters scales quadratically with N , making simulations
with N > 103 computationally expensive. We chose
N to give comparable training time for the 2 layer case
which corresponded to N = 500 after experimenting with
{100, 250, 500, 1000, 5000}.
We found that the learning rate needed to be quite large
for the training loss to be reduced by a factor of ≈
106 . For the 2 layer networks, we tried learning rates
{10−3 , 10−2 , 1, 10, 32} and found that a learning rate of
32 gave the lowest training error. For the three and four
layer networks, we found that lower learning rates worked
better and used learning rates in the range from [0.5, 3].

14. Discrete Measure and Kernel PCA


We consider a special case of a discrete probability measure
with equal mass on each point in a dataset of size p̃

1X
p(x) = δ(x − xi ). (SI.98)
p̃ i=1

For this measure, the integral eigenvalue equation becomes


Z
dx p(x)K(x, x0 )φρ (x)
p̃ Z
1X
= dx δ(x − xi )K(x, x0 )φρ (x)
p̃ i=1

1X
= K(xi , x0 )φρ (xi ) = λρ φρ (x0 ). (SI.99)
p̃ i=1

Evaluating x0 at each of the points xi in the dataset yields a


matrix equation. Let Φρ,i = φρ (xi ) and Λρ,γ = δρ,γ λρ

KΦ> = p̃Φ> Λ. (SI.100)


Learning Curves in Kernel Regression and Wide Neural Networks

k=1 102 k=1


101 k=2 k=2
k=4 101 k=4
10 1
100
10 3
10 1
10 5
Etr

Etr
10 2

10 7
10 3

10 9
10 4

10 11 10 5

10 6
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
SGD iteration SGD iteration
(a) 3 Layer Training Loss; lr =2 (b) 4 Layer Training Loss; lr = 0.5
Figure SI.3. Training error for different pure mode target functions on neural networks with 500 hidden units per hidden layer on a sample
of size p = 500. Generally, we find that the low frequency modes have an initial rapid reduction in the training error but the higher
frequencies k ≥ 4 are trained at a slower rate.

101 101
101
100 100
100
Ek(p)/Ek(0)
Ek(p)/Ek(0)

Ek(p)/Ek(0)

10 1
10 1
10 1

10 2
NN k = 1 10 2 NN k = 1 10 2 NN k = 1
NN k = 2 NN k = 2 NN k = 2
10 3 NN k = 4 10 3 NN k = 4 NN k = 4
Kernel k=1 Kernel k=1 10 3 Kernel k=1
Kernel k=2 Kernel k=2 Kernel k=2
10 4 Kernel k=4 10 4 Kernel k=4 Kernel k=4
10 4
101 102 103 101 102 103 101 102 103
p p p
(a) 2 layer NN N = 4000 (b) 2 layer NN N = 104 (c) 3 layer N = 500

101 10 1 10 1
theory theory
expt test error expt test error
100
Ek(p)/Ek(0)

10 1 10 2 10 2
Eg

Eg

NN k = 1
NN k = 2
10 2 NN k = 4
Kernel k=1
Kernel k=2
Kernel k=4 10 3 10 3
10 3
101 102 103 101 102 103 101 102 103
p p p
(d) 4 layer N = 500 (e) 2 Layer NN Student-Teacher; N = 2000 (f) 2 Layer NN Student-Teacher; N = 8000
Figure SI.4. Learning curves for neural networks on “pure modes” and on student teacher experiments. The theory curves shown as solid
lines. For the pure mode experiments, the test error for the finite width neural networks and NTK are shown with dots and triangles
respectively. Logarithms are evaluated with base 10.

You might also like