Stuart 185
Stuart 185
Abstract
The classical development of neural networks has primarily focused on learning mappings be-
tween finite dimensional Euclidean spaces or finite sets. We propose a generalization of neural net-
works to learn operators, termed neural operators, that map between infinite dimensional function
spaces. We formulate the neural operator as a composition of linear integral operators and non-
linear activation functions. We prove a universal approximation theorem for our proposed neural
operator, showing that it can approximate any given nonlinear continuous operator. The proposed
neural operators are also discretization-invariant, i.e., they share the same model parameters among
different discretization of the underlying function spaces. Furthermore, we introduce four classes
of efficient parameterization, viz., graph neural operators, multi-pole graph neural operators, low-
rank neural operators, and Fourier neural operators. An important application for neural operators
is learning surrogate maps for the solution operators of partial differential equations (PDEs). We
consider standard PDEs such as the Burgers, Darcy subsurface flow, and the Navier-Stokes equa-
tions, and show that the proposed neural operators have superior performance compared to existing
machine learning based methodologies, while being several orders of magnitude faster than con-
ventional PDE solvers.
*. Equal contribution.
†. Majority of the work was completed while the author was at Caltech.
©2023 Nikola Kovachki, Zongyi Li, Burigede Liu, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Andrew Stuart, Anima
Anandkumar.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at
http://jmlr.org/papers/v24/21-1524.html.
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
1. Introduction
Learning mappings between function spaces has widespread applications in science and engineer-
ing. For instance, for solving differential equations, the input is a coefficient function and the
output is a solution function. A straightforward solution to this problem is to simply discretize the
infinite-dimensional input and output function spaces into finite-dimensional grids, and apply stan-
dard learning models such as neural networks. However, this limits applicability since the learned
neural network model may not generalize well to different discretizations, beyond the discretization
grid of the training data.
To overcome these limitations of standard neural networks, we formulate a new deep-learning
framework for learning operators, called neural operators, which directly map between function
spaces on bounded domains. Since our neural operator is designed on function spaces, they can be
discretized by a variety of different methods, and at different levels of resolution, without the need
for re-training. In contrast, standard neural network architectures depend heavily on the discretiza-
tion of training data: new architectures with new parameters may be needed to achieve the same
error for data with varying discretization. We also propose the notion of discretization-invariant
models and prove that our neural operators satisfy this property, while standard neural networks do
not.
1. acts on any discretization of the input function, i.e. accepts any set of points in the input domain,
The first two requirements of accepting any input and output points in the domain is a natural
requirement for discretization invariance, while the last one ensures consistency in the limit as the
discretization is refined. For example, families of graph neural networks (Scarselli et al., 2008) and
transformer models (Vaswani et al., 2017) are resolution invariant, i.e., they can receive inputs at any
resolution, but they fail to converge to a continuum operator as discretization is refined. Moreover,
we require the models to have a fixed number of parameters; otherwise, the number of parameters
becomes unbounded in the limit as the discretization is refined, as shown in Figure 1. Thus the
notion of discretization invariance allows us to define neural operator models that are consistent
in function spaces and can be applied to data given at any resolution and on any mesh. We also
establish that standard neural network models are not discretization invariant.
Neural Operators. We introduce the concept of neural operators for learning operators that are
mappings between infinite-dimensional function spaces. We propose neural operator architectures
to be multi-layers where layers are themselves operators composed with non-linear activations. This
ensures that that the overall end-to-end composition is an operator, and thus satisfies the discretiza-
tion invariance property. The key design choice for neural operator is the operator layers. To keep it
simple, we limit ourselves to layers that are linear operators. Since these layers are composed with
2
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
non-linear activations, we obtain neural operator models that are expressive and able to capture any
continuous operator. The latter property is known as universal approximation.
The above line of reasoning for neural operator design follows closely the design of standard
neural networks, where linear layers (e.g. matrix multiplication, convolution) are composed with
non-linear activations, and we have universal approximation of continuous functions defined on
compact domains (Hornik et al., 1989). Neural operators replace finite-dimensional linear layers in
neural networks with linear operators in function spaces.
We formally establish that neural operator models with a fixed number of parameters satisfy
discretization invariance. We further show that neural operators models are universal approximators
of continuous operators acting between Banach spaces, and can uniformly approximate any contin-
uous operator defined on a compact set of a Banach space. Neural operators are the only known
class of models that guarantee both discretization-invariance and universal approximation.
See Table 1 for a comparison among the deep learning models. Previous deep learning models are
mostly defined on a fixed grid, and removing, adding, or moving grid points generally makes these
models no longer applicable. Thus, they are not discretization invariant.
We propose several design choices for the linear operator layers in neural operator such as a pa-
rameterized integral operator or through multiplication in the spectral domain as shown in Figure 2.
Specifically, we propose four practical methods for implementing the neural operator framework:
graph-based operators, low-rank operators, multipole graph-based operators, and Fourier operators.
Specifically, for graph-based operators, we develop a Nyström extension to connect the integral op-
erator formulation of the neural operator to families of graph neural networks (GNNs) on arbitrary
grids. For Fourier operators, we consider the spectral domain formulation of the neural operator
which leads to efficient algorithms in settings where fast transform methods are applicable.
We include an exhaustive numerical study of the four formulations of neural operators. Numer-
ically, we show that the proposed methodology consistently outperforms all existing deep learning
methods even on the resolutions for which the standard neural networks were designed. For the
two-dimensional Navier-Stokes equation, when learning the entire flow map, the method achieves
< 1% error for a Reynolds number of 20 and 8% error for a Reynolds number of 200.
The proposed Fourier neural operator (FNO) has an inference time that is three orders of magni-
tude faster than the pseudo-spectral method used to generate the data for the Navier-Stokes problem
3
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
GNO layer
v + σ
W
LNO layer
v + σ
W
(Chandler and Kerswell, 2013) – 0.005s compared to the 2.2s on a 256 × 256 uniform spatial grid.
Despite its tremendous speed advantage, the method does not suffer from accuracy degradation
when used in downstream applications such as solving Bayesian inverse problems. Furthermore,
we demonstrate that FNO is robust to noise on the testing problems we consider here.
Data-driven approaches for solving PDEs. Over the past decades, significant progress has been
made in formulating (Gurtin, 1982) and solving (Johnson, 2012) the governing PDEs in many sci-
entific fields from micro-scale problems (e.g., quantum and molecular dynamics) to macro-scale
applications (e.g., civil and marine engineering). Despite the success in the application of PDEs to
solve real-world problems, two significant challenges remain: (1) identifying the governing model
for complex systems; (2) efficiently solving large-scale nonlinear systems of equations.
4
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
Table 1: Comparison of deep learning models. The first row indicates whether the model is dis-
cretization invariant. The second and third rows indicate whether the output and input are a func-
tions. The fourth row indicates whether the model class is a universal approximator of operators.
Neural Operators are discretization invariant deep learning methods that output functions and can
approximate any operator.
Identifying and formulating the underlying PDEs appropriate for modeling a specific problem
usually requires extensive prior knowledge in the corresponding field which is then combined with
universal conservation laws to design a predictive model. For example, modeling the deformation
and failure of solid structures requires detailed knowledge of the relationship between stress and
strain in the constituent material. For complicated systems such as living cells, acquiring such
knowledge is often elusive and formulating the governing PDE for these systems remains pro-
hibitive, or the models proposed are too simplistic to be informative. The possibility of acquiring
such knowledge from data can revolutionize these fields. Second, solving complicated nonlinear
PDE systems (such as those arising in turbulence and plasticity) is computationally demanding and
can often make realistic simulations intractable. Again the possibility of using instances of data to
design fast approximate solvers holds great potential for accelerating numerous problems.
Learning PDE Solution Operators. In PDE applications, the governing differential equations
are by definition local, whilst the solution operator exhibits non-local properties. Such non-local
effects can be described by integral operators explicitly in the spatial domain, or by means of spec-
tral domain multiplication; convolution is an archetypal example. For integral equations, the graph
approximations of Nyström type (Belongie et al., 2002) provide a consistent way of connecting
different grid or data structures arising in computational methods and understanding their contin-
uum limits (Von Luxburg et al., 2008; Trillos and Slepčev, 2018; Trillos et al., 2020). For spectral
domain calculations, there are well-developed tools that exist for approximating the continuum
(Boyd, 2001; Trefethen, 2000). However, these approaches for approximating integral operators
are not data-driven. Neural networks present a natural approach for learning-based integral opera-
tor approximations since they can incorporate non-locality. However, standard neural networks are
limited to the discretization of training data and hence, offer a poor approximation to the integral
operator. We tackle this issue here by proposing the framework of neural operators.
Properties of existing deep-learning models. Previous deep learning models are mostly defined
on a fixed grid, and removing, adding, or moving grid points generally makes these models no longer
applicable, as seen in Table 1. Thus, they are not discretization invariant. In general, standard neural
networks (NN) (such as Multilayer perceptron (MLP), convolution neural networks (CNN), Resnet,
and Vision Transformers (ViT)) that take the input grid and output grid as finite-dimensional vectors
5
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
are not discretization-invariant since their input and output have to be at the fixed grid with fixed
location. On the other hand, the pointwise neural networks used in PINNs (Raissi et al., 2019) that
take each coordinate as input are discretization-invariant since it can be applied at each location
in parallel. However PINNs only represent the solution function of one instance and it does not
learn the map from the input functions to the output solution functions. A special class of neural
networks is convolution neural networks (CNNs). CNNs also do not converge with grid refinement
since their respective fields change with different input grids. On the other hand, if normalized
by the grid size, CNNs can be applied to uniform grids with different resolutions, which converge
to differential operators, in a similar fashion to the finite difference method. Interpolation is a
baseline approach to achieve discretization-invariance. While NNs+Interpolation (or in general any
finite-dimensional neural networks+Interpolation) are resolution invariant and their outputs can be
queried at any point, they are not universal approximators of operators since the dimension of input
and output of the internal CNN model is defined to a bounded number. DeepONets (Lu et al., 2019)
are a class of operators that have the universal approximation property. DeepONets consist of a
branch net and a trunk net. The trunk net allows queries at any point, but the branch net constrains
the input to fixed locations; however it is possible to modify the branch net to make the methodology
discretization invariant, for example by using the PCA-based approach as used in (De Hoop et al.,
2022).
Furthermore, we show transformers (Vaswani et al., 2017) are special cases of neural operators
with structured kernels that can be used with varying grids to represent the input function. However,
the commonly used vision-based extensions of transformers, e.g., ViT (Dosovitskiy et al., 2020),
use convolutions on patches to generate tokens, and therefore, they are not discretization-invariant
models.
We also show that when our proposed neural operators are applied only on fixed grids, the re-
sulting architectures coincide with neural networks and other operator learning frameworks. In such
reductions, point evaluations of the input functions are available on the grid points. In particular, we
show that the recent work of DeepONets (Lu et al., 2019), which are maps from finite-dimensional
spaces to infinite dimensional spaces are special cases of neural operators architecture when neu-
ral operators are limited only to fixed input grids. Moreover, by introducing an adjustment to the
DeepONet architecture, we propose the DeepONet-Operator model that fits into the full operator
learning framework of maps between function spaces.
2. Learning Operators
In subsection 2.1, we describe the generic setting of PDEs to make the discussions in the following
setting concrete. In subsection 2.2, we outline the general problem of operator learning as well as
our approach to solving it. In subsection 2.3, we discuss the functional data that is available and
how we work with it numerically.
6
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
measure µ supported on A and u(i) = G † (a(i) ) is possibly corrupted with noise. We aim to build an
approximation of G † by constructing a parametric map
Gθ : A → U, θ ∈ Rp (2)
with parameters from the finite-dimensional space Rp and then choosing θ† ∈ Rp so that Gθ† ≈ G † .
We will be interested in controlling the error of the approximation on average with respect to µ.
In particular, assuming G † is µ-measurable, we will aim to control the L2µ (A; U) Bochner norm of
the approximation
Z
† †
2
∥G − Gθ ∥L2µ (A;U ) = Ea∼µ ∥G (a) − Gθ (a)∥U =2
∥G † (a) − Gθ (a)∥2U dµ(a). (3)
A
This is a natural framework for learning in infinite-dimensions as one could seek to solve the asso-
ciated empirical-risk minimization problem
N
† 1 X (i)
min Ea∼µ ∥G (a) − Gθ (a)∥2U ≈ minp ∥u − Gθ (a(i) )∥2U (4)
θ∈Rp θ∈R N
i=1
which directly parallels the classical finite-dimensional setting (Vapnik, 1998). As well as using
error measured in the Bochner norm, we will also consider the setting where error is measured
uniformly over compact sets of A. In particular, given any K ⊂ A compact, we consider
which is a more standard error metric in the approximation theory literature. Indeed, the classic
approximation theory of neural networks in formulated analogously to equation (5) (Hornik et al.,
1989).
7
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
In Section 9 we show that, for the architecture we propose and given any desired error tolerance,
there exists p ∈ N and an associated parameter θ† ∈ Rp , so that the loss (3) or (5) is less than the
specified tolerance. However, we do not address the challenging open problems of characterizing
the error with respect to either (a) a fixed parameter dimension p or (b) a fixed number of training
samples N . Instead, we approach this in the empirical test-train setting where we minimize (4)
based on a fixed training set and approximate (3) from new samples that were not seen during
training. Because we conceptualize our methodology in the infinite-dimensional setting, all finite-
dimensional approximations can share a common set of network parameters which are defined in the
(approximation-free) infinite-dimensional setting. In particular, our architecture does not depend on
the way the functions a(i) , u(i) are discretized. . The notation used through out this paper, along
with a useful summary table, may be found in Appendix A.
2.3 Discretization
Since our data a(i) and u(i) are, in general, functions, to work with them numerically, we assume
access only to their point-wise evaluations. To illustrate this, we will continue with the example
of the preceding paragraph. For simplicity, assume D = D′ and suppose that the input and output
(i)
functions are both real-valued. Let D(i) = {xℓ }L ℓ=1 ⊂ D be a L-point discretization of the
domain D and assume we have observations a |D(i) , u(i) |D(i) ∈ RL , for a finite collection of
(i)
input-output pairs indexed by j. In the next section, we propose a kernel inspired graph neural
network architecture which, while trained on the discretized data, can produce the solution u(x) for
any x ∈ D given an input a ∼ µ. In particular, our discretized architecture maps into the space
U and not into a discretization thereof. Furthermore our parametric operator class is consistent, in
that, given a fixed set of parameters, refinement of the input discretization converges to the true
functions space operator. We make this notion precise in what follows and refer to architectures that
possess it as function space architectures, mesh-invariant architectures, or discretization-invariant
architectures. *
Definition 1 We call a discrete refinement of the domain D ⊂ Rd any sequence of nested sets
D1 ⊂ D2 ⊂ · · · ⊂ D with |DL | = L for any L ∈ N such that, for any ϵ > 0, there exists a number
L = L(ϵ) ∈ N such that [
D⊆ {y ∈ Rd : ∥y − x∥2 < ϵ}.
x∈DL
8
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
K ⊂ A,
lim RK (G(·, θ), ĜL (·, ·, θ), DL ) = 0.
L→∞
We prove that the architectures proposed in Section 3 are discretization-invariant. We further verify
this claim numerically by showing that the approximation error is approximately constant as we
refine the discretization. Such a property is highly desirable as it allows a transfer of solutions
between different grid geometries and discretization sizes with a single architecture that has a fixed
number of parameters.
We note that, while the application of our methodology is based on having point-wise evalua-
tions of the function, it is not limited by it. One may, for example, represent a function numerically
as a finite set of truncated basis coefficients. Invariance of the representation would then be with
respect to the size of this set. Our methodology can, in principle, be modified to accommodate this
scenario through a suitably chosen architecture. We do not pursue this direction in the current work.
From the construction of neural operators, when the input and output functions are evaluated on
fixed grids, the architecture of neural operators on these fixed grids coincide with the class of neural
networks.
3. Neural Operators
In this section, we outline the neural operator framework. We assume that the input functions a ∈ A
are Rda -valued and defined on the bounded domain D ⊂ Rd while the output functions u ∈ U are
′
Rdu -valued and defined on the bounded domain D′ ⊂ Rd . The proposed architecture Gθ : A → U
has the following overall structure:
1. Lifting: Using a pointwise function Rda → Rdv0 , map the input {a : D → Rda } 7→ {v0 :
D → Rdv0 } to its first hidden representation. Usually, we choose dv0 > da and hence this is
a lifting operation performed by a fully local operator.
3. Projection: Using a pointwise function RdvT → Rdu , map the last hidden representation
{vT : D′ → RdvT } 7→ {u : D′ → Rdu } to the output function. Analogously to the first
step, we usually pick dvT > du and hence this is a projection step performed by a fully local
operator.
†. The indexing of sets D• here differs from the two previous indexings used in Subsection 2.3. The index t is not the
physical time, but the iteration (layer) in the model architecture.
9
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
The outlined structure mimics that of a finite dimensional neural network where hidden repre-
sentations are successively mapped to produce the final output. In particular, we have
where P : Rda → Rdv0 , Q : RdvT → Rdu are the local lifting and projection mappings respectively,
Wt ∈ Rdvt+1 ×dvt are local linear operators (matrices), Kt : {vt : Dt → Rdvt } → {vt+1 : Dt+1 →
Rdvt+1 } are integral kernel operators, bt : Dt+1 → Rdvt+1 are bias functions, and σt are fixed
activation functions acting locally as maps Rvt+1 → Rvt+1 in each layer. The output dimensions
dv0 , . . . , dvT as well as the input dimensions d1 , . . . , dT −1 and domains of definition D1 , . . . , DT −1
are hyperparameters of the architecture. By local maps, we mean that the action is pointwise, in
particular, for the lifting and projection maps, we have (P(a))(x) = P(a(x)) for any x ∈ D
and (Q(vT ))(x) = Q(vT (x)) for any x ∈ D′ and similarly, for the activation, (σ(vt+1 ))(x) =
σ(vt+1 (x)) for any x ∈ Dt+1 . The maps, P, Q, and σt can thus be thought of as defining Nemitskiy
operators (Dudley and Norvaisa, 2011, Chapters 6,7) when each of their components are assumed to
be Borel measurable. This interpretation allows us to define the general neural operator architecture
when pointwise evaluation is not well-defined in the spaces A or U e.g. when they are Lebesgue,
Sobolev, or Besov spaces.
The crucial difference between the proposed architecture (6) and a standard feed-forward neural
network is that all operations are directly defined in function space (noting that the activation fun-
tions, P and Q are all interpreted through their extension to Nemitskiy operators) and therefore do
not depend on any discretization of the data. Intuitively, the lifting step locally maps the data to a
space where the non-local part of G † is easier to capture. We confirm this intuition numerically in
Section 7; however, we note that for the theory presented in Section 9 it suffices that P is the identity
map. The non-local part of G † is then learned by successively approximating using integral kernel
operators composed with a local nonlinearity. Each integral kernel operator is the function space
analog of the weight matrix in a standard feed-forward network since they are infinite-dimensional
linear operators mapping one function space to another. We turn the biases, which are normally vec-
tors, to functions and, using intuition from the ResNet architecture (He et al., 2016), we further add
a local linear operator acting on the output of the previous layer before applying the nonlinearity.
The final projection step simply gets us back to the space of our output function. We concatenate
in θ ∈ Rp the parameters of P, Q, {bt } which are usually themselves shallow neural networks, the
parameters of the kernels representing {Kt } which are again usually shallow neural networks, and
the matrices {Wt }. We note, however, that our framework is general and other parameterizations
such as polynomials may also be employed.
Integral Kernel Operators We define three version of the integral kernel operator Kt used in (6).
For the first, let κ(t) ∈ C(Dt+1 × Dt ; Rdvt+1 ×dvt ) and let νt be a Borel measure on Dt . Then we
define Kt by Z
(Kt (vt ))(x) = κ(t) (x, y)vt (y) dνt (y) ∀x ∈ Dt+1 . (7)
Dt
Normally, we take νt to simply be the Lebesgue measure on Rdt but, as discussed in Section 4,
other choices can be used to speed up computation or aid the learning process by building in a
priori information. The choice of integral kernel operator in (7) defines the basic form of the neural
operator and is the one we analyze in Section 9 and study most in the numerical experiments of
Section 7.
10
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
For the second, let κ(t) ∈ C(Dt+1 × Dt × Rda × Rda ; Rdvt+1 ×dvt ). Then we define Kt by
Z
(Kt (vt ))(x) = κ(t) (x, y, a(ΠD D
t+1 (x)), a(Πt (y)))vt (y) dνt (y) ∀x ∈ Dt+1 . (8)
Dt
where ΠD t : Dt → D are fixed mappings. We have found numerically that, for certain PDE prob-
lems, the form (8) outperforms (7) due to the strong dependence of the solution u on the parameters
a, for example, the Darcy flow problem considered in subsection 7.2.1. Indeed, if we think of (6)
as a discrete time dynamical system, then the input a ∈ A only enters through the initial condi-
tion hence its influence diminishes with more layers. By directly building in a-dependence into the
kernel, we ensure that it influences the entire architecture.
Lastly, let κ(t) ∈ C(Dt+1 × Dt × Rdvt × Rdvt ; Rdvt+1 ×dvt ). Then we define Kt by
Z
(Kt (vt ))(x) = κ(t) (x, y, vt (Πt (x)), vt (y))vt (y) dνt (y) ∀x ∈ Dt+1 . (9)
Dt
where Πt : Dt+1 → Dt are fixed mappings. Note that, in contrast to (7) and (8), the integral
operator (9) is nonlinear since the kernel can depend on the input function vt . With this definition
and a particular choice of kernel κt and measure νt , we show in Section 5.2 that neural operators
are a continuous input/output space generalization of the popular transformer architecture (Vaswani
et al., 2017).
Single Hidden Layer Construction Having defined possible choices for the integral kernel oper-
ator, we are now in a position to explicitly write down a full layer of the architecture defined by (6).
For simplicity, we choose the integral kernel operator given by (7), but note that the other definitions
(8), (9) work analogously. We then have that a single hidden layer update is given by
Z
vt+1 (x) = σt+1 Wt vt (Πt (x)) + κ(t) (x, y)vt (y) dνt (y) + bt (x) ∀x ∈ Dt+1 (10)
Dt
where Πt : Dt+1 → Dt are fixed mappings. We remark that, since we often consider functions on
the same domain, we usually take Πt to be the identity.
We will now give an example of a full single hidden layer architecture i.e. when T = 2. We
choose D1 = D, take σ2 as the identity, and denote σ1 by σ, assuming it is any activation function.
Furthermore, for simplicity, we set W1 = 0, b1 = 0, and assume that ν0 = ν1 is the Lebesgue
measure on Rd . Then (6) becomes
Z Z
(1) (0)
(Gθ (a))(x) = Q κ (x, y)σ W0 P(a(y)) + κ (y, z)P(a(z)) dz + b0 (y) dy (11)
D D
for any x ∈ D′ . In this example, P ∈ C(Rda ; Rdv0 ), κ(0) ∈ C(D×D; Rdv1 ×dv0 ), b0 ∈ C(D; Rdv1 ),
W0 ∈ Rdv1 ×dv0 , κ(1) ∈ C(D′ × D; Rdv2 ×dv1 ), and Q ∈ C(Rdv2 ; Rdu ). One can then parametrize
the continuous functions P, Q, κ(0) , κ(1) , b0 by standard feed-forward neural networks (or by any
other means) and the matrix W0 simply by its entries. The parameter vector θ ∈ Rp then becomes
the concatenation of the parameters of P, Q, κ(0) , κ(1) , b0 along with the entries of W0 . One can
then optimize these parameters by minimizing with respect to θ using standard gradient based min-
imization techniques. To implement this minimization, the functions entering the loss need to be
discretized; but the learned parameters may then be used with other discretizations. In Section 4,
we discuss various choices for parametrizing the kernels, picking the integration measure, and how
those choices affect the computational complexity of the architecture.
11
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Preprocessing It is often beneficial to manually include features into the input functions a to help
facilitate the learning process. For example, instead of considering the Rda -valued vector field a
as input, we use the Rd+da -valued vector field (x, a(x)). By including the identity element, infor-
mation about the geometry of the spatial domain D is directly incorporated into the architecture.
This allows the neural networks direct access to information that is already known in the problem
and therefore eases learning. We use this idea in all of our numerical experiments in Section 7.
Similarly, when learning a smoothing operator, it may be beneficial to include a smoothed version
of the inputs aϵ using, for example, Gaussian convolution. Derivative information may also be
of interest and thus, as input, one may consider, for example, the Rd+2da +dda -valued vector field
(x, a(x), aϵ (x), ∇x aϵ (x)). Many other possibilities may be considered on a problem-specific basis.
Discretization Invariance and Approximation In light of discretization invariance Theorem 8
and universal approximation Theorems 11 12, 13, 14 whose formal statements are given in Sec-
tion 9, we may obtain a decomposition of the total error made by a neural operator as a sum of the
discretization error and the approximation error. In particular, given a finite dimensional instantia-
tion of a neural operator Ĝθ : RLd × RLda → U, for some L-point discretization of the input, we
have
∥Ĝθ (DL , a|DL ) − G † (a)∥U ≤ ∥Ĝθ (DL , a|DL ) − Gθ (a)∥U + ∥Gθ (a) − G † (a)∥U .
| {z } | {z }
discretization error approximation error
Our approximation theoretic Theorems imply that we can find parameters θ so that the approxi-
mation error is arbitrarily small while the discretization invariance Theorem states that we can find
a fine enough discretization (large enough L) so that the discretization error is arbitrarily small.
Therefore, with a fixed set of parameters independent of the input discretization, a neural operator
that is able to be implemented on a computer can approximate operators to arbitrary accuracy.
12
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
The integral in (13) can be approximated using any Pother integral approximation methods, including
the celebrated Riemann sum for which u(xj ) = Jl=1 κ(xj , xl )v(xl )∆xl and ∆xl is the Riemann
sum coefficient associated with ν at xl . For the approximation methods, to compute u on the en-
tire grid requires O(J 2 ) matrix-vector multiplications. Each of these matrix-vector multiplications
requires O(mn) operations; for the rest of the discussion, we treat mn = O(1) as constant and
consider only the cost with respect to J the discretization parameter since m and n are fixed by
the architecture choice whereas J varies depending on required discretization accuracy and hence
may be arbitrarily large. This cost is not specific to the Monte Carlo approximation but is generic
for quadrature rules which use the entirety of the data. Therefore, when J is large, computing (13)
becomes intractable and new ideas are needed in order to alleviate this. Subsections 4.1-4.4 propose
different approaches to the solution to this problem, inspired by classical methods in numerical
analysis. We finally remark that, in contrast, computations with W , b, and σ only require O(J)
operations which justifies our focus on computation with the kernel integral operator.
Kernel Matrix. It will often times be useful to consider the kernel matrix associated to κ for the
discrete points {x1 , . . . , xJ } ⊂ D. We define the kernel matrix K ∈ RmJ×nJ to be the J × J block
matrix with each block given by the value of the kernel i.e.
13
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
where we use (j, l) to index an individual block rather than a matrix element. Various numerical
algorithms for the efficient computation of (13) can be derived based on assumptions made about
the structure of this matrix, for example, bounds on its rank or sparsity.
J ′
1 X
u(xkj ) ≈ ′ κ(xkj , xkl )v(xkl ), j = 1, . . . , J ′ .
J
l=1
K ≈ KJJ ′ KJ ′ J ′ KJ ′ J (15)
where KJ ′ J ′ is a J ′ × J ′ block matrix and KJJ ′ , KJ ′ J are interpolation matrices, for example,
linearly extending the function to the whole domain from the random nodal points. The complexity
of this computation is O(J ′2 ) hence it remains quadratic but only in the number of subsampled
points J ′ which we assume is much less than the number of points J in the original discretization.
Truncation. Another simple method to alleviate the cost of computing (13) is to truncate the
integral to a sub-domain of D which depends on the point of evaluation x ∈ D. Let s : D → B(D)
be a mapping of the points of D to the Lebesgue measurable subsets of D denoted B(D). Define
dν(x, y) = 1s(x) dy then (13) becomes
Z
u(x) = κ(x, y)v(y) dy ∀x ∈ D. (16)
s(x)
If the size of each set s(x) is smaller than D then the cost of computing (16) is O(cs J 2 ) where
cs < 1 is a constant depending on s. While the cost remains quadratic in J, the constant cs can
have a significant effect in practical computations, as we demonstrate in Section 7. For simplicity
and ease of implementation, we only consider s(x) = B(x, r) ∩ D where B(x, r) = {y ∈ Rd :
∥y − x∥Rd < r} for some fixed r > 0. With this choice of s and assuming that D = [0, 1]d , we can
explicitly calculate that cs ≈ rd .
Furthermore notice that we do not lose any expressive power when we make this approximation
so long as we combine it with composition.
√ To see this, consider the example of the previous
paragraph where, if we let r = 2, then (16) reverts to (13). Pick r < 1 and let L ∈ N with
L ≥ 2 be the smallest integer such that 2L−1 r ≥ 1. Suppose that u(x) is computed by composing
the right hand side of (16) L times with a different kernel every time. The domain of influence
of u(x) is then B(x, 2L−1 r) ∩ D = D hence it is easy to see that there exist L kernels such that
14
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
computing this composition is equivalent to computing (13) for any given kernel with appropriate
regularity. Furthermore the cost of this computation is O(Lrd J 2 ) and therefore the truncation is
beneficial if rd (log2 1/r + 1) < 1 which holds for any r < 1/2 when d = 1 and any r < 1
when d ≥ 2. Therefore we have shown that we can always reduce the cost of computing (13) by
truncation and composition. From the perspective of the kernel matrix, truncation enforces a sparse,
block diagonally-dominant structure at each layer. We further explore the hierarchical nature of this
computation using the multipole method in subsection 4.3.
Besides being a useful computational tool, truncation can also be interpreted as explicitly build-
ing local structure into the kernel κ. For problems where such structure exists, explicitly enforcing
it makes learning more efficient, usually requiring less data to achieve the same generalization er-
ror. Many physical systems such as interacting particles in an electric potential exhibit strong local
behavior that quickly decays, making truncation a natural approximation technique.
Graph Neural Networks. We utilize the standard architecture of message passing graph net-
works employing edge features as introduced in Gilmer et al. (2017) to efficiently implement (13)
on arbitrary discretizations of the domain D. To do so, we treat a discretization {x1 , . . . , xJ } ⊂ D
as the nodes of a weighted, directed graph and assign edges to each node using the function
s : D → B(D) which, recall from the section on truncation, assigns to each point a domain of
integration. In particular, for j = 1, . . . , J, we assign the node xj the value v(xj ) and emanate
from it edges to the nodes s(xj ) ∩ {x1 , . . . , xJ } = N (xj ) which we call the neighborhood of xj .
If s(x) = D then the graph is fully-connected. Generally, the sparsity structure of the graph deter-
mines the sparsity of the kernel matrix K, indeed, the adjacency matrix of the graph and the block
kernel matrix have the same zero entries. The weights of each edge are assigned as the arguments
of the kernel. In particular, for the case of (13), the weight of the edge between nodes xj and xk is
simply the concatenation (xj , xk ) ∈ R2d . More complicated weighting functions are considered for
the implementation of the integral kernel operators (8) or (9).
With the above definition the message passing algorithm of Gilmer et al. (2017), with averaging
aggregation, updates the value v(xj ) of the node xj to the value u(xj ) as
1 X
u(xj ) = κ(xj , y)v(y), j = 1, . . . , J
|N (xj )|
y∈N (xj )
which corresponds to the Monte-Carlo approximation of the integral (16). More sophisticated
quadrature rules and adaptive meshes can also be implemented using the general framework of
message passing on graphs, see, for example, Pfaff et al. (2020). We further utilize this framework
in subsection 4.3.
Convolutional Neural Networks. Lastly, we compare and contrast the GNO framework to stan-
dard convolutional neural networks (CNNs). In computer vision, the success of CNNs has largely
been attributed to their ability to capture local features such as edges that can be used to distinguish
different objects in a natural image. This property is obtained by enforcing the convolution kernel to
have local support, an idea similar to our truncation approximation. Furthermore by directly using a
translation invariant kernel, a CNN architecture becomes translation equivariant; this is a desirable
feature for many vision models e.g. ones that perform segmentation. We will show that similar
ideas can be applied to the neural operator framework to obtain an architecture with built-in local
properties and translational symmetries that, unlike CNNs, remain consistent in function space.
15
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
To that end, let κ(x, y) = κ(x − y) and suppose that κ : Rd → Rm×n is supported on B(0, r).
Let r∗ > 0 be the smallest radius such that D ⊆ B(x∗ , r∗ ) where x∗ ∈ Rd denotes the center of
mass of D and suppose r ≪ r∗ . Then (13) becomes the convolution
Z
u(x) = (κ ∗ v)(x) = κ(x − y)v(y) dy ∀x ∈ D. (17)
B(x,r)∩D
Notice that (17) is precisely (16) when s(x) = B(x, r) ∩ D and κ(x, y) = κ(x − y). When the
kernel is parameterized by e.g. a standard neural network and the radius r is chosen independently
of the data discretization, (17) becomes a layer of a convolution neural network that is consistent in
function space. Indeed the parameters of (17) do not depend on any discretization of v. The choice
κ(x, y) = κ(x − y) enforces translational equivariance in the output while picking r small enforces
locality in the kernel; hence we obtain the distinguishing features of a CNN model.
We will now show that, by picking a parameterization that is inconsistent in function space
and applying a Monte Carlo approximation to the integral, (17) becomes a standard CNN. This is
most easily demonstrated when D = [0, 1] and the discretization {x1 , . . . , xJ } is equispaced i.e.
|xj+1 − xj | = h for any j = 1, . . . , J − 1. Let k ∈ N be an odd filter size and let z1 , . . . , zk ∈ R
be the points zj = (j − 1 − (k − 1)/2)h for j = 1, . . . , k. It is easy to see that {z1 , . . . , zk } ⊂
B̄(0, (k − 1)h/2) which we choose as the support of κ. Furthermore, we parameterize κ directly
by its pointwise values which are m × n matrices at the locations z1 , . . . , zk thus yielding kmn
parameters. Then (17) becomes
k n
1 XX
u(xj )p ≈ κ(zl )pq v(xj − zl )q , j = 1, . . . , J, p = 1, . . . , m
k
l=1 q=1
where we define v(x) = 0 if x ̸∈ {x1 , . . . , xJ }. Up to the constant factor 1/k which can be re-
absorbed into the parameterization of κ, this is precisely the update of a stride 1 CNN with n input
channels, m output channels, and zero-padding so that the input and output signals have the same
length. This example can easily be generalized to higher dimensions and different CNN structures,
we made the current choices for simplicity of exposition. Notice that if we double the amount of
discretization points for v i.e. J 7→ 2J and h 7→ h/2, the support of κ becomes B̄(0, (k − 1)h/4)
hence the model changes due to the discretization of the data. Indeed, if we take the limit to the
continuum J → ∞, we find B̄(0, (k − 1)h/2) → {0} hence the model becomes completely local.
To fix this, we may try to increase the filter size k (or equivalently add more layers) simultaneously
with J, but then the number of parameters in the model goes to infinity as J → ∞ since, as we
previously noted, there are kmn parameters in this layer. Therefore standard CNNs are not consis-
tent models in function space. We demonstrate their inability to generalize to different resolutions
in Section 7.
16
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
for some functions φ(1) , ψ (1) , . . . , φ(r) , ψ (r) : D → R that are normally given as the components
of two neural networks φ, ψ : D → Rr or a single neural network Ξ : D → R2r which couples all
functions through its parameters. With this definition, and supposing that n = m = 1, we have that
(13) becomes
r
Z X
u(x) = φ(j) (x)ψ (j) (y)v(y) dy
D j=1
r
XZ
= ψ (j) (y)v(y) dy φ(j) (x)
j=1 D
Xr
(j)
= ⟨ψ , v⟩φ(j) (x)
j=1
where ⟨·, ·⟩ denotes the L2 (D; R) inner product. Notice that the inner products can be evaluated
independently of the evaluation point x ∈ D hence the computational complexity of this method is
O(rJ) which is linear in the discretization.
We may also interpret this choice of kernel as directly parameterizing a rank r ∈ N operator on
2
L (D; R). Indeed, we have
X r
u= (φ(j) ⊗ ψ (j) )v (18)
j=1
which corresponds preceisely to applying the SVD of a rank r operator to the function v. Equation
(18) makes natural the vector valued generalization. Assume m, n ≥ 1 and φ(j) : D → Rm and
ψ (j) : D → Rn for j = 1, . . . , r then, (18) defines an operator mapping L2 (D; Rn ) → L2 (D; Rm )
that can be evaluated as
r
X
u(x) = ⟨ψ (j) , v⟩L2 (D;Rn ) φ(j) (x) ∀x ∈ D.
j=1
We again note the linear computational complexity of this parameterization. Finally, we observe
that this method can be interpreted as directly imposing a rank r structure on the kernel matrix.
Indeed,
K = KJr KrJ
where KJr , KrJ are J × r and r × J block matricies respectively. This construction is similar to
the DeepONet construction of Lu et al. (2019) discussed in Section 5.1, but parameterized to be
consistent in function space.
17
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
and a hierarchy of low-rank structures is imposed on the long-range components. We employ this
idea to construct hierarchical, multi-scale graphs, without being constrained to particular forms of
the kernel. We will elucidate the workings of the FMM through matrix factorization. This ap-
proach was first outlined in Li et al. (2020b) and is referred as the Multipole Graph Neural Operator
(MGNO).
The key to the fast multipole method’s linear complexity lies in the subdivision of the kernel
matrix according to the range of interaction, as shown in Figure 3:
K = K1 + K2 + . . . + KL , (19)
18
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
Figure 4: V-cycle
Left: the multi-level discretization. Right: one V-cycle iteration for the multipole neural operator.
2005): starting from a discretization with J1 = J nodes, we impose inducing points of size
J2 , J3 , . . . , JL which all admit a low-rank kernel matrix decomposition of the form (15). The orig-
inal J × J kernel matrix Kl is represented by a much smaller Jl × Jl kernel matrix, denoted by
Kl,l . As shown in Figure 3, K1 is full-rank but very sparse while KL is dense but low-rank. Such
structure can be achieved by applying equation (15) recursively to equation (19), leading to the
multi-resolution matrix factorization (Kondor et al., 2014):
K ≈ K1,1 + K1,2 K2,2 K2,1 + K1,2 K2,3 K3,3 K3,2 K2,1 + · · · (20)
where K1,1 = K1 represents the shortest range, K1,2 K2,2 K2,1 ≈ K2 , represents the second shortest
range, etc. The center matrix Kl,l is a Jl × Jl kernel matrix corresponding to the l-level of the
discretization described above. The matrices Kl+1,l , Kl,l+1 are Jl+1 × Jl and Jl × Jl+1 wide and
long respectively block transition matrices. Denote vl ∈ RJl ×n for the representation of the input
v at each level of the discretization for l = 1, . . . , L, and ul ∈ RJl ×n for the output (assuming the
inputs and outputs has the same dimension). We define the matrices Kl+1,l , Kl,l+1 as moving the
representation vl between different levels of the discretization via an integral kernel that we learn.
Combining with the truncation idea introduced in subsection 4.1, we define the transition matrices
as discretizations of the following integral kernel operators:
Z
Kl,l : vl 7→ ul = κl,l (x, y)vl (y) dy (21)
B(x,rl,l )
Z
Kl+1,l : vl 7→ ul+1 = κl+1,l (x, y)vl (y) dy (22)
B(x,rl+1,l )
Z
Kl,l+1 : vl+1 7→ ul = κl,l+1 (x, y)vl+1 (y) dy (23)
B(x,rl,l+1 )
where each kernel κl,l′ : D × D → Rn×n is parameterized as a neural network and learned.
V-cycle Algorithm We present a V-cycle algorithm, see Figure 4, for efficiently computing (20).
It consists of two steps: the downward pass and the upward pass. Denote the representation in
downward pass and upward pass by v̌ and v̂ respectively. In the downward step, the algorithm starts
from the fine discretization representation v̌1 and updates it by applying a downward transition
19
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
v̌l+1 = Kl+1,l v̌l . In the upward step, the algorithm starts from the coarse presentation v̂L and
updates it by applying an upward transition and the center kernel matrix v̂l = Kl,l−1 v̂l−1 + Kl,l v̌l .
Notice that applying one level downward and upward exactly computes K1,1 + K1,2 K2,2 K2,1 , and
a full L-level V-cycle leads to the multi-resolution decomposition (20).
Employing (21)-(23), we use L neural networks κ1,1 , . . . , κL,L to approximate the kernel oper-
ators associated to Kl,l , and 2(L − 1) neural networks κ1,2 , κ2,1 , . . . to approximate the transitions
Kl+1,l , Kl,l+1 . Following the iterative architecture (6), we introduce the linear operator W ∈ Rn×n
(denoting it by Wl for each corresponding resolution) to help regularize the iteration, as well as the
nonlinear activation function σ to increase the expensiveness. Since W acts pointwise (requiring
J remains the same for input and output), we employ it only along with the kernel Kl,l and not the
transitions. At each layer t = 0, . . . , T − 1, we perform a full V-cycle as:
• Downward Pass
(t+1) (t) (t+1)
For l = 1, . . . , L : v̌l+1 = σ(v̂l+1 + Kl+1,l v̌l ) (24)
• Upward Pass
(t+1) (t+1) (t+1)
For l = L, . . . , 1 : v̂l = σ((Wl + Kl,l )v̌l + Kl,l−1 v̂l−1 ). (25)
Notice that one full pass of the V-cycle algorithm defines a mapping v 7→ u.
Multi-level Graphs. We emphasize that we view the discretization {x1 , . . . , xJ } ⊂ D as a graph
in order to facilitate an efficient implementation through the message passing graph neural network
architecture. Since the V-cycle algorithm works at different levels of the discretization, we build
multi-level graphs to represent the coarser and finer discretizations. We present and utilize two con-
structions of multi-level graphs, the orthogonal multipole graph and the generalized random graph.
The orthogonal multipole graph is the standard grid construction used in the fast multiple method
which is adapted to a uniform grid, see e.g. (Greengard and Rokhlin, 1997). In this construction, the
decomposition in (19) is orthogonal in that the finest graph only captures the closest range interac-
tion, the second finest graph captures the second closest interaction minus the part already captured
in the previous graph and so on, recursively. In particular, the ranges of interaction for each kernel
do not overlap. While this construction is usually efficient, it is limited to uniform grids which
may be a bottleneck for certain applications. Our second construction is the generalized random
graph as shown in Figure 3 where the ranges of the kernels are allowed to overlap. The generalized
random graph is very flexible as it can be applied on any domain geometry and discretization. Fur-
ther it can also be combined with random sampling methods to work on problems where J is very
large or combined with an active learning method to adaptively choose the regions where a finer
discretization is needed.
Linear Complexity. Each term in the decomposition (19) is represented by the kernel matrix
Kl,l for l = 1, . . . , L, and Kl+1,l , Kl,l+1 for l = 1, . . . , L − 1 corresponding to the appropri-
ate sub-discretization. Therefore the complexity of the multipole method is L 2 d
P
PL−1 l=1 O(Jl rl ) +
d
P L 2 d 2 d
l=1 O(Jl Jl+1 rl ) = l=1 O(Jl rl ). By designing the sub-discretization so that O(J√ l rl ) ≤
O(J), we can obtain complexity linear in J. For example, when d = 2, pick rl = 1/ Jl and
Jl = O(2−l J)PLsuch that2 rdL is large enough so that there exists a ball of radius rL containing D.
Then clearly l=1 O(Jl rl ) = O(J). By combining with a Nyström approximation, we can obtain
O(J ′ ) complexity for some J ′ ≪ J.
20
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
We note that the set Zkmax is not the canonical choice for the low frequency modes of vt . Indeed,
the low frequency modes are usually defined by placing an upper-bound on the ℓ1 -norm of k ∈ Zd .
We choose Zkmax as above since it allows for an efficient implementation. Figure 5 gives a pictorial
representation of an entire Neural Operator architecture employing Fourier layers.
21
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Figure 5: top: The architecture of the neural operators; bottom: Fourier layer.
(a) The full architecture of neural operator: start from input a. 1. Lift to a higher dimension channel
space by a neural network P. 2. Apply T (typically T = 4) layers of integral operators and activation
functions. 3. Project back to the target dimension by a neural network Q. Output u. (b) Fourier layers:
Start from input v. On top: apply the Fourier transform F; a linear transform R on the lower Fourier modes
which also filters out the higher modes; then apply the inverse Fourier transform F −1 . On the bottom: apply
a local linear transform W .
The Discrete Case and the FFT. Assuming the domain D is discretized with J ∈ N points, we
can treat v ∈ CJ×n and F(v) ∈ CJ×n . Since we convolve v with a function which only has kmax
Fourier modes, we may simply truncate the higher modes to obtain F(v) ∈ Ckmax ×n . Multiplication
by the weight tensor R ∈ Ckmax ×m×n is then
n
X
R · (Fvt ) k,l = Rk,l,j (Fv)k,j , k = 1, . . . , kmax , l = 1, . . . , m. (27)
j=1
When the discretization is uniform with resolution s1 × · · · × sd = J, F can be replaced by the Fast
Fourier Transform. For v ∈ CJ×n , k = (k1 , . . . , kd ) ∈ Zs1 × · · · × Zsd , and x = (x1 , . . . , xd ) ∈ D,
the FFT F̂ and its inverse F̂ −1 are defined as
1 −1
sX d −1
sX Pd xj kj
−2iπ j=1 sj
(F̂v)l (k) = ··· vl (x1 , . . . , xd )e ,
x1 =0 xd =0
1 −1
sX d −1
sX Pd xj kj
−1 2iπ j=1 sj
(F̂ v)l (x) = ··· vl (k1 , . . . , kd )e
k1 =0 kd =0
22
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
Choices for R. In general, R can be defined to depend on (Fa), the Fourier transform of the
input a ∈ A to parallel our construction (8). Indeed, we can define Rϕ : Zd × Cda → Cm×n as
a parametric function that maps k, (Fa)(k)) to the values of the appropriate Fourier modes. We
have experimented with the following parameterizations of Rϕ :
• Linear. Define the parameters ϕk1 ∈ Cm×n×da , ϕk2 ∈ Cm×n for each wave number k:
Rϕ k, (Fa)(k) := ϕk1 (Fa)(k) + ϕk2 .
We find that the linear parameterization has a similar performance to the direct parameterization
above, however, it is not as efficient both in terms of computational complexity and the number of
parameters required. On the other hand, we find that the feed-forward neural network parameteriza-
tion has a worse performance. This is likely due to the discrete structure of the space Zd ; numerical
evidence suggests neural networks are not adept at handling this structure. Our experiments in this
work focus on the direct parameterization presented above.
Invariance to Discretization. The Fourier layers are discretization-invariant because they can
learn from and evaluate functions which are discretized in an arbitrary way. Since parameters are
learned directly in Fourier space, resolving the functions in physical space simply amounts to pro-
jecting on the basis elements e2πi⟨x,k⟩ ; these are well-defined everywhere on Cd .
Quasi-linear Complexity. The weight tensor R contains kmax < J modes, so the inner multipli-
cation has complexity O(kmax ). Therefore, the majority of the computational cost lies in computing
the Fourier transform F(v) and its inverse. General Fourier transforms have complexity O(J 2 ),
however, since we truncate the series the complexity is in fact O(Jkmax ), while the FFT has com-
plexity O(J log J). Generally, we have found using FFTs to be very efficient, however, a uniform
discretization is required.
Non-uniform and Non-periodic Geometry. The Fourier neural operator model is defined based
on Fourier transform operations accompanied by local residual operations and potentially additive
bias function terms. These operations are mainly defined on general geometries, function spaces,
and choices of discretization. They are not limited to rectangular domains, periodic functions, or
uniform grids. In this paper, we instantiate these operations on uniform grids and periodic functions
in order to develop fast implementations that enjoy spectral convergence and utilize methods such as
fast Fourier transform. In order to maintain a fast and memory-efficient method, our implementation
of the Fourier neural operator relies on the fast Fourier transform which is only defined on uniform
mesh discretizations of D = Td , or for functions on the square satisfying homogeneous Dirichlet
(fast Fourier sine transform) or homogeneous Neumann (fast Fourier cosine transform) boundary
conditions. However, the fast implementation of Fourier neural operator can be applied in more
23
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
general geometries via Fourier continuations. Given any compact manifold D = M, we can always
embed it into a periodic cube (torus),
i : M → Td
where the regular FFT can be applied. Conventionally, in numerical analysis applications, the em-
bedding i is defined through a continuous extension by fitting polynomials (Bruno et al., 2007).
However, in the Fourier neural operator, the idea can be applied simply by padding the input with
zeros. The loss is computed only on the original space during training. The Fourier neural operator
will automatically generate a smooth extension to the padded domain in the output space.
4.5 Summary
We summarize the main computational approaches presented in this section and their complexity:
• GNO: Subsample J ′ points from the J-point discretization and compute the truncated integral
Z
u(x) = κ(x, y)v(y) dy (28)
B(x,r)
at a O(JJ ′ ) complexity.
• LNO: Decompose the kernel function tensor product form and compute
r
X
u(x) = ⟨ψ (j) , v⟩φ(j) (x) (29)
j=1
at a O(J) complexity.
at a O(J) complexity.
• FNO: Parameterize the kernel in the Fourier domain and compute the using the FFT
24
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
5.1 DeepONets
We will now draw a parallel between the recently proposed DeepONet architecture in Lu et al.
(2019), a map from finite-dimensional spaces to function spaces, and the neural operator frame-
work. We will show that if we use a particular, point-wise parameterization of the first kernel in a
NO and discretize the integral operator, we obtain a DeepONet. However, such a parameterization
breaks the notion of discretization invariance because the number of parameters depends on the dis-
cretization of the input function. Therefore such a model cannot be applied to arbitrarily discretized
functions and its number of parameters goes to infinity as we take the limit to the continuum. This
phenomenon is similar to our discussion in subsection 4.1 where a NO parametrization which is
inconsistent in function space and breaks discretization invariance yields a CNN. We propose a
modification to the DeepONet architecture, based on the idea of the LNO, which addresses this
issue and gives a discretization invariant neural operator.
Proposition 5 A neural operator with a point-wise parameterized first kernel and discretized inte-
gral operators yields a DeepONet.
Proof We work with (11) where we choose W0 = 0 and denote b0 by b. For simplicity, we will
consider only real-valued functions i.e. da = du = 1 and set dv0 = dv1 = n and dv2 = p for
some n, p ∈ N. Define P : R → Rn by P(x) = (x, . . . , x) and Q : Rp → R by Q(x) =
(1)
x1 + · · · + xp . Furthermore let κ(1) : D′ × D → Rp×n be defined by some κjk : D′ × D → R
for j = 1, . . . , p and k = 1, . . . , n. Similarly let κ(0) : D × D → Rn×n be given as κ(0) (x, y) =
(0) (0) (0) (0)
diag(κ1 (x, y), . . . , κn (x, y)) for some κ1 , . . . κn : D × D → R. Then (11) becomes
p X
n Z Z
(1) (0)
X
(Gθ (a))(x) = κjk (x, y)σ κj (y, z)a(z) dz + bj (y) dy
k=1 j=1 D D
where b(y) = (b1 (y), . . . , bn (y)) for some b1 , . . . , bn : D → R. Let x1 , . . . , xq ∈ D be the points
at which the input function a is evaluated and denote by ã = a(x1 ), . . . , a(xq ) ∈ Rq the vector of
(0)
evaluations. Choose κj (y, z) = 1(y)wj (z) for some w1 , . . . , wn : D → R where 1 denotes the
constant function taking the value one. Let
q
wj (xl ) = w̃jl
|D|
for j = 1, . . . , n and l = 1, . . . , q where w̃jl ∈ R are some constants. Furthermore let bj (y) =
b̃j 1(y) for some constants b̃j ∈ R. Then the Monte Carlo approximation of the inner-integral yields
p X
n Z
(1)
X
(Gθ (a))(x) = κjk (x, y)σ ⟨w̃j , ã⟩Rq + b̃j 1(y) dy
k=1 j=1 D
(1)
where w̃j = w̃j1 , . . . , w̃jq . Choose κjk (x, y) = (c̃jk /|D|)φk (x)1(y) for some constants c̃jk ∈ R
and functions φ1 , . . . , φp : D′ → R. Then we obtain
X p Xn p
X
(Gθ (a))(x) = c̃jk σ ⟨w̃j , ã⟩Rq + b̃j φk (x) = Gk (ṽ)φk (x) (32)
k=1 j=1 k=1
25
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
where Gk : Rq → R can be viewed as the components of a single hidden layer neural network
G : Rq → Rp with parameters w̃jl , b̃j , c̃jk . The set of maps φ1 , . . . , φp form the trunk net while
G is the branch net of a DeepONet. Our construction above can clearly be generalized to yield
arbitrary depth branch nets by adding more kernel integral layers, and, similarly, the trunk net can
be chosen arbitrarily deep by parameterizing each φk as a deep neural network.
Since the mappings w1 , . . . , wn are point-wise parametrized based on the input values ã, it is
clear that the construction in the above proof is not discretization invariant. In order to make this
model a discretization invariant neural operator, we propose DeepONet-Operator where, for each
j, we replace the inner product in the finite dimensional space ⟨w̃j , ã⟩Rq with an appropriate inner
product in the function space ⟨wj , a⟩.
X p Xn
(Gθ (a))(x) = c̃jk σ ⟨wj , a⟩ + b̃j φk (x) (33)
k=1 j=1
26
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
(11) (usually four to five layers are used in the experiments of Section 7), perform better. The ben-
efits of depth are again not captured in our analysis in Section 9 either. We leave further theoretical
studies of approximation properties as an interesting avenue of investigation for future work.
27
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
a “word”-valued function on, for example, the domain [0, 1]. Assuming our function is linked to
a sentence with a fixed semantic meaning, adding or removing words from the sentence simply
corresponds to refining or coarsening the discretization of [0, 1]. We will now make this intuition
precise in the proof of the following statement.
Proposition 6 The attention mechanism in transformer models is a special case of a neural opera-
tor layer.
Proof We will show that by making a particular choice of the nonlinear integral kernel operator
(9) and discretizing the integral by a Monte-Carlo approximation, a neural operator layer reduces
to a pre-normalized, single-headed attention, transformer block as originally proposed in (Vaswani
et al., 2017). For simplicity, we assume dvt = n ∈ N and that Dt = D for any t = 0, . . . , T , the
bias term is zero, and W = I is the identity. Furthermore, to simplify notation, we will drop the
layer index t from (10) and, employing (9), obtain
Z
u(x) = σ v(x) + κv (x, y, v(x), v(y))v(y) dy ∀x ∈ D (34)
D
a single layer of the neural operator where v : D → Rn is the input function to the layer and we
denote by u : D → Rn the output function. We use the notation κv to indicate that the kernel
depends on the entirety of the function v as well as on its pointwise values v(x) and v(y). While
this is not explicitly done in (9), it is a straightforward generalization. We now pick a specific form
for kernel, in particular, we assume κv : Rn × Rn → Rn×n does not explicitly depend on the spatial
variables (x, y) but only on the input pair (v(x), v(y)). Furthermore, we let
where R ∈ Rn×n is a matrix of free parameters i.e. its entries are concatenated in θ so they are
learned, and gv : Rn × Rn → R is defined as
Z −1
⟨Av(s), Bv(y)⟩ ⟨Av(x), Bv(y)⟩
gv (v(x), v(y)) = exp √ ds exp √ .
D m m
Here A, B ∈ Rm×n are again matrices of free parameters, m ∈ N is a hyperparameter, and ⟨·, ·⟩ is
the Euclidean inner-product on Rm . Putting this together, we find that (34) becomes
exp ⟨Av(x),Bv(y)⟩
Z √
m
u(x) = σ v(x) + Rv(y) dy ∀x ∈ D. (35)
R ⟨Av(s),Bv(y)⟩
D
D exp √
m
ds
Equation (35) can be thought of as the continuum limit of a transformer block. To see this, we will
discretize to obtain the usual transformer block.
To that end, let {x1 , . . . , xk } ⊂ D be a uniformly-sampled, k-point discretization of D and
denote vj = v(xj ) ∈ Rn and uj = u(xj ) ∈ Rn for j = 1, . . . , k. Approximating the inner-integral
in (35) by Monte-Carlo, we have
k
⟨Av(s), Bv(y)⟩ |D| X ⟨Avl , Bv(y)⟩
Z
exp √ ds ≈ exp √ .
D m k m
l=1
28
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
Plugging this into (35) and using the same approximation for the outer integral yields
⟨Avj ,Bvq ⟩
k exp √
X m
uj = σ vj + Rvq , j = 1, . . . , k. (36)
Pk ⟨Avl ,Bvq ⟩
q=1 l=1 exp √
m
Equation (36) can be viewed as a Nyström approximation of (35). Define the vectors zq ∈ Rk by
1
zq = √ (⟨Av1 , Bvq ⟩, . . . , ⟨Avk , Bvq ⟩), q = 1, . . . , k.
m
Furthermore, if we re-parametrize R = Rout Rval where Rout ∈ Rn×m and Rval ∈ Rm×n are
matrices of free parameters, we obtain
k
X
uj = σ vj + Rout Sj (zq )Rval vq , j = 1, . . . , k
q=1
which is precisely the single-headed attention, transformer block with no layer normalization ap-
plied inside the activation function. In the language of transformers, the matrices A, B, and Rval
correspond to the queries, keys, and values functions respectively. We note that tricks such as layer
normalization (Ba et al., 2016) can be adapted in a straightforward manner to the continuum setting
and incorporated into (35). Furthermore multi-headed self-attention can be realized by simply al-
lowing κv to be a sum over multiple functions with form gv R all of which have separate trainable
parameters. Including such generalizations yields the continuum limit of the transformer as imple-
mented in practice. We do not pursue this here as our goal is simply to draw a parallel between the
two methods.
Even though transformers are special cases of neural operators, the standard attention mech-
anism is memory and computation intensive, as seen in Section 6, compared to neural operator
architectures developed here (7)-(9). The high computational complexity of transformers is evident
is (35) since we must evaluate a nested integral of v for each x ∈ D. Recently, efficient attention
mechanisms have been explored, e.g. long-short Zhu et al. (2021) and adaptive FNO-based atten-
tion mechanisms (Guibas et al., 2021). However, many of the efficient vision transformer architec-
tures (Choromanski et al., 2020; Dosovitskiy et al., 2020) like ViTs are not special cases of neural
operators since they use CNN layers to generate tokens, which are not discretization invariant.
29
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
6. Test Problems
A central application of neural operators is learning solution operators defined by parametric partial
differential equations. In this section, we define four test problems for which we numerically study
the approximation properties of neural operators. To that end, let (A, U, F) be a triplet of Banach
spaces. The first two problem classes considered are derived from the following general class of
PDEs:
La u = f (37)
where, for every a ∈ A, La : U → F is a, possibly nonlinear, partial differential operator, and u ∈ U
corresponds to the solution of the PDE (37) when f ∈ F and appropriate boundary conditions are
imposed. The second class will be evolution equations with initial condition a ∈ A and solution
u(t) ∈ U at every time t > 0. We seek to learn the map from a to u := u(τ ) for some fixed time
τ > 0; we will also study maps on paths (time-dependent solutions).
Our goal will be to learn the mappings
G † : a 7→ u or G † : f 7→ u;
we will study both cases, depending on the test problem considered. We will define a probability
measure µ on A or F which will serve to define a model for likely input data. Furthermore, measure
µ will define a topology on the space of mappings in which G † lives, using the Bochner norm (3).
We will assume that each of the spaces (A, U, F) are Banach spaces of functions defined on a
bounded domain D ⊂ Rd . All reported errors will be Monte-Carlo estimates of the relative error
∥G † (a) − Gθ (a)∥L2 (D)
Ea∼µ
∥G † (a)∥L2 (D)
or equivalently replacing a with f in the above display and with the assumption that U ⊆ L2 (D).
The domain D will be discretized, usually uniformly, with J ∈ N points.
for some source function f : (0, 1) → R). In particular, for D(L) := H01 ((0, 1); R) ∩ H 2 ((0, 1); R),
we have L : D(L) → L2 ((0, 1); R) defined as −d2 /dx2 , noting that that L has no dependence on
any parameter a ∈ A in this case. We will consider the weak form of (38) with source function
f ∈ H −1 ((0, 1); R) and therefore the solution operator G † : H −1 ((0, 1); R) → H01 ((0, 1); R)
defined as
G † : f 7→ u.
We define the probability measure µ = N (0, C) where
−2
C = L+I ,
30
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
defined through the spectral theory of self-adjoint operators. Since µ charges a subset of L2 ((0, 1); R),
we will learn G † : L2 ((0, 1); R) → H01 ((0, 1); R) in the topology induced by (3).
In this setting, G † has a closed-form solution given as
Z 1
†
G (f ) = G(·, y)f (y) dy
0
where
1
G(x, y) = (x + y − |y − x|) − xy, ∀(x, y) ∈ [0, 1]2
2
is the Green’s function. Note that while G † is a linear operator, the Green’s function G is non-linear
as a function of its arguments. We will consider only a single layer of (6) with σ1 = Id, P = Id,
Q = Id, W0 = 0, b0 = 0, and Z 1
K0 (f ) = κθ (·, y)f (y) dy
0
where D = (0, 1)2 is the unit square. In this setting A = L∞ (D; R+ ), U = H01 (D; R), and
F = H −1 (D; R). We fix f ≡ 1 and consider the weak form of (39) and therefore the solution
operator G † : L∞ (D; R+ ) → H01 (D; R) defined as
G † : a 7→ u. (40)
Note that while (39) is a linear PDE, the solution operator G † is nonlinear. We define the probability
measure µ = T♯ N (0, C) as the pushforward of a Gaussian measure under the operator T where the
covariance of the Gaussian is
C = (−∆ + 9I)−2
31
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
with D(−∆) defined to impose zero Neumann boundary on the Laplacian. We define T to be a
Nemytskii operator acting on functions, defined through the map T : R → R+ defined as
(
12, x≥0
T (x) = .
3, x<0
The random variable a ∼ µ is a piecewise-constant function with random interfaces given by the
underlying Gaussian random field. Such constructions are prototypical models for many physical
systems such as permeability in sub-surface flows and (in a vector generalization) material mi-
crostructures in elasticity.
To create the dataset used for training, solutions to (39) are obtained using a second-order finite
difference scheme on a uniform grid of size 421 × 421. All other resolutions are downsampled from
this data set. We use N = 1000 training examples.
∂ 1 ∂ 2 ∂2
u(x, t) + u(x, t) = ν 2 u(x, t), x ∈ (0, 2π), t ∈ (0, ∞)
∂t 2 ∂x ∂x (41)
u(x, 0) = u0 (x), x ∈ (0, 2π)
with periodic boundary conditions and a fixed viscosity ν = 10−1 . Let Ψ : L2per ((0, 2π); R)×R+ →
s ((0, 2π); R), for any s > 0, be the flow map associated to (41), in particular,
Hper
We consider the solution operator defined by evaluating Ψ at a fixed time. Fix any s ≥ 0. Then we
may define G † : L2per ((0, 2π); R) → Hper
s ((0, 2π); R) by
d2 −2
C = 625 − 2 + 25I
dx
with domain of the Laplacian defined to impose periodic boundary conditions. We chose the initial
condition for (41) by drawing u0 ∼ µ, noting that µ charges a subset of L2per ((0, 2π); R).
To create the dataset used for training, solutions to (41) are obtained using a pseudo-spectral
split step method where the heat equation part is solved exactly in Fourier space and then the non-
linear part is advanced using a forward Euler method with a very small time step. We use a uniform
spatial grid with 213 = 8192 collocation points and subsample all other resolutions from this data
set. We use N = 1000 training examples.
32
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
where T2 is the unit torus i.e. [0, 1]2 equipped with periodic boundary conditions, and ν ∈ R+ is a
fixed viscosity. Here u : T2 × R+ → R2 is the velocity field, p : T2 × R+ → R2 is the pressure
field, and f : T2 → R is a fixed forcing function.
Equivalently, we study the vorticity-streamfunction formulation of the equation
G † : w0 7→ Ψ(w0 , T ) (45)
for some fixed T > 0. In the second, we will map an initial part of the trajectory to a later part of the
trajectory. In particular, we define G † : L2 (T2 ; R)×C (0, 10]; H s (T2 ; R) → C (10, T ]; H s (T2 ; R)
by
G † : w0 , Ψ(w0 , t)|t∈(0,10] 7→ Ψ(w0 , t)|t∈(10,T ]
(46)
for some fixed T > 10. We define the probability measure µ = N (0, C) where
33
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
with periodic boundary conditions on the Laplacian. We model the initial vorticity w0 ∼ µ to (44)
as µ charges a subset of L2 (T2 ; R). Its pushforward onto Ψ(w0 , t)|t∈(0,10] is required to define the
measure on input space in the second case defined by (46).
To create the dataset used for training, solutions to (44) are obtained using a pseudo-spectral
split step method where the viscous terms are advanced using a Crank–Nicolson update and the
nonlinear and forcing terms are advanced using Heun’s method. Dealiasing is used with the 2/3
rule. For further details on this approach see (Chandler and Kerswell, 2013). Data is obtained on a
uniform 256 × 256 grid and all other resolutions are subsampled from this data set. We experiment
with different viscosities ν, final times T , and amounts of training data N .
y = O G † (w0 ) +η
(47)
6.4.2 S PECTRA
Because of the constant-in-time forcing term the energy reaches a non-zero equilibrium in time
which is statistically reproducible for different initial conditions. To compare the complexity of the
solution to the Navier-Stokes problem outlined in subsection 6.4 we show, in Figure 6, the Fourier
spectrum of the solution data at time t = 50 for three different choices of the viscosity ν. The
figure demonstrates that, for a wide range of wavenumbers k, which grows as ν decreases, the rate
of decay of the spectrum is −5/3, matching what is expected in the turbulent regime (Kraichnan,
1967). This is a statistically stationary property of the equation, sustained for all positive times.
34
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
7. Numerical Results
In this section, we compare the proposed neural operator with other supervised learning approaches,
using the four test problems outlined in Section 6. In Subsection 7.1 we study the Poisson equation,
and learning a Greens function; Subsection 7.2 considers the coefficient to solution map for steady
Darcy flow, and the initial condition to solution at positive time map for Burgers equation. In
subsection 7.3 we study the Navier-Stokes equation.
We compare with a variety of architectures found by discretizing the data and applying finite-
dimensional approaches, as well as with other operator-based approximation methods; further de-
tailed comparison of other operator-based approximation methods may be found in De Hoop et al.
(2022), where the issue of error versus cost (with cost defined in various ways such as evaluation
time of the network, amount of data required) is studied. We do not compare against traditional
solvers (FEM/FDM/Spectral), although our methods, once trained, enable evaluation of the input
to output map orders of magnitude more quickly than by use of such traditional solvers on com-
plex problems. We demonstrate the benefits of this speed-up in a prototypical application, Bayesian
inversion, in Subsubection 7.3.4.
All the computations are carried on a single Nvidia V100 GPU with 16GB memory. The code
is available at https://github.com/zongyi-li/graph-pde and https://github.
com/zongyi-li/fourier_neural_operator.
Setup of the Four Methods: We construct the neural operator by stacking four integral operator
layers as specified in (6) with the ReLU activation. No batch normalization is needed. Unless
35
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
otherwise specified, we use N = 1000 training instances and 200 testing instances. We use the
Adam optimizer to train for 500 epochs with an initial learning rate of 0.001 that is halved every
100 epochs. We set the channel dimensions dv0 = · · · = dv3 = 64 for all one-dimensional problems
and dv0 = · · · = dv3 = 32 for all two-dimensional problems. The kernel networks κ(0) , . . . , κ(3)
are standard feed-forward neural networks with three layers and widths of 256 units. We use the
following abbreviations to denote the methods introduced in Section 4.
• GNO: The method introduced in subsection 4.1, truncating the integral to a ball with radius
r = 0.25 and using the Nyström approximation with J ′ = 300 sub-sampled nodes.
• LNO: The low-rank method introduced in subsection 4.2 with rank r = 4.
• MGNO: The multipole method introduced in subsection 4.3. On the Darcy flow problem,
we use the random construction with three graph levels, each sampling J1 = 400, J2 =
100, J3 = 25 nodes nodes respectively. On the Burgers’ equation problem, we use the or-
thogonal construction without sampling.
• FNO: The Fourier method introduced in subsection 4.4. We set kmax,j = 16 for all one-
dimensional problems and kmax,j = 12 for all two-dimensional problems.
Remark on the Resolution. Traditional PDE solvers such as FEM and FDM approximate a single
function and therefore their error to the continuum decreases as the resolution is increased. The fig-
ures we show here exhibit something different: the error is independent of resolution, once enough
resolution is used, but is not zero. This reflects the fact that there is a residual approximation error,
in the infinite dimensional limit, from the use of a finite-parametrized neural operator, trained on a
finite amount of data. Invariance of the error with respect to (sufficiently fine) resolution is a de-
sirable property that demonstrates that an intrinsic approximation of the operator has been learned,
independent of any specific discretization; see Figure 8. Furthermore, resolution-invariant operators
can do zero-shot super-resolution, as shown in Subsubection 7.3.1.
36
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
Figure 7: Kernel for one-dimensional Green’s function, with the Nystrom approximation method
left: learned kernel function; right: the analytic Green’s function.
This is a proof of concept of the graph kernel network on 1 dimensional Poisson equation and the
comparison of learned and truth kernel.
• FCN is the state of the art neural network method based on Fully Convolution Network (Zhu
and Zabaras, 2018). It has a dominating performance for small grids s = 61. But fully
convolution networks are mesh-dependent and therefore their error grows when moving to a
larger grid.
37
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
(a) benchmarks on Burgers equation; (b) benchmarks on Darcy Flow for different resolutions; Train and test
on the same resolution. For acronyms, see Section 7; details in Tables 3, 2.
• RBM is the classical Reduced Basis Method (using a PCA basis), which is widely used in
applications and provably obtains mesh-independent error (DeVore, 2014). This method has
good performance, but the solutions can only be evaluated on the same mesh as the training
data and one needs knowledge of the PDE to employ it.
• DeepONet is the Deep Operator network (Lu et al., 2019) that comes equipped with an ap-
proximation theory (Lanthaler et al., 2021). We use the unstacked version with width 200
which is precisely defined in the original work (Lu et al., 2019). We use standard fully con-
nected neural networks with 8 layers and width 200.
38
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
39
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
• U-Net: A popular choice for image-to-image regression tasks consisting of four blocks with
2-d convolutions and deconvolutions Ronneberger et al. (2015).
• TF-Net: A network designed for learning turbulent flows based on a combination of spatial
and temporal convolutions Wang et al. (2020).
• FNO-2d: 2-d Fourier neural operator with an auto-regressive structure in time. We use the
Fourier neural operator to model the local evolution from the previous 10 time steps to the
next one time step, and iteratively apply the model to get the long-term trajectory. We set and
kmax,j = 12, dv = 32.
• FNO-3d: 3-d Fourier neural operator that directly convolves in space-time. We use the
Fourier neural operator to model the global evolution from the initial 10 time steps directly to
the long-term trajectory. We set kmax,j = 12, dv = 20.
As shown in Table 4, the FNO-3D has the best performance when there is sufficient data (ν =
10−3 , N = 1000 and ν = 10−4 , N = 10000). For the configurations where the amount of data
is insufficient (ν = 10−4 , N = 1000 and ν = 10−5 , N = 1000), all methods have > 15% error
with FNO-2D achieving the lowest among our hyperparameter search. Note that we only present
results for spatial resolution 64 × 64 since all benchmarks we compare against are designed for this
resolution. Increasing the spatial resolution degrades their performance while FNO achieves the
same errors.
Auto-regressive (2D) and Temporal Convolution (3D). We investigate two standard formulation
to model the time evolution: the auto-regressive model (2D) and the temporal convolution model
(3D). Auto-regressive models: FNO-2D, U-Net, TF-Net, and ResNet all do 2D-convolution in the
40
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
Table 4: Benchmarks on Navier Stokes (fixing resolution 64 × 64 for both training and testing).
spatial domain and recurrently propagate in the time domain (2D+RNN). The operator maps the
solution at previous time steps to the next time step (2D functions to 2D functions). Temporal
convolution models: on the other hand, FNO-3D performs convolution in space-time – it approx-
imates the integral in time by a convolution. FNO-3D maps the initial time interval directly to the
full trajectory (3D functions to 3D functions). The 2D+RNN structure can propagate the solution to
any arbitrary time T in increments of a fixed interval length ∆t, while the Conv3D structure is fixed
to the interval [0, T ] but can transfer the solution to an arbitrary time-discretization. We find the 2D
method work better for short time sequences while the 3D method more expressive and easier to
train on longer sequences.
41
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
42
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
and the final decoder network Q recover the high frequency modes. As an example, consider a solu-
tion to the Navier-Stokes equation with viscosity ν = 10−3 . Truncating this function at 20 Fourier
modes yields an error around 2% as shown in Figure 13, while the Fourier neural operator learns
the parametric dependence and produces approximations to an error of ≤ 1% with only kmax,j = 12
parameterized modes.
Traditional Fourier methods work only with periodic boundary conditions. However, the Fourier
neural operator does not have this limitation. This is due to the linear transform W (the bias term)
which keeps the track of non-periodic boundary. As an example, the Darcy Flow and the time
domain of Navier-Stokes have non-periodic boundary conditions, and the Fourier neural operator
still learns the solution operator with excellent accuracy.
43
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
As discussed in Section 6.4.1, we use the pCN method of Cotter et al. (2013) to draw samples from
the posterior distribution of initial vorticities in the Navier-Stokes equation given sparse, noisy ob-
servations at time T = 50. We compare the Fourier neural operator acting as a surrogate model with
the traditional solvers used to generate our train-test data (both run on GPU). We generate 25,000
samples from the posterior (with a 5,000 sample burn-in period), requiring 30,000 evaluations of
the forward operator.
As shown in Figure 14, FNO and the traditional solver recover almost the same posterior mean
which, when pushed forward, recovers well the later-time solution of the Navier-Stokes equation.
In sharp contrast, FNO takes 0.005s to evaluate a single instance while the traditional solver, after
being optimized to use the largest possible internal time-step which does not lead to blow-up, takes
2.2s. This amounts to 2.5 minutes for the MCMC using FNO and over 18 hours for the traditional
solver. Even if we account for data generation and training time (offline steps) which take 12 hours,
using FNO is still faster. Once trained, FNO can be used to quickly perform multiple MCMC
runs for different initial conditions and observations, while the traditional solver will take 18 hours
for every instance. Furthermore, since FNO is differentiable, it can easily be applied to PDE-
constrained optimization problems in which adjoint calculations are used as part of the solution
procedure.
Figure 14: Results of the Bayesian inverse problem for the Navier-Stokes equation.
The top left panel shows the true initial vorticity while bottom left panel shows the true observed vorticity at
T = 50 with black dots indicating the locations of the observation points placed on a 7 × 7 grid. The top
middle panel shows the posterior mean of the initial vorticity given the noisy observations estimated with
MCMC using the traditional solver, while the top right panel shows the same thing but using FNO as a
surrogate model. The bottom middle and right panels show the vorticity at T = 50 when the respective
approximate posterior means are used as initial conditions.
44
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
In this section we will compare the four methods in term of expressiveness, complexity, refinabili-
bity, and ingenuity.
7.4.1 I NGENUITY
First we will discuss ingenuity, in other words, the design of the frameworks. The first method,
GNO, relies on the Nyström approximation of the kernel, or the Monte Carlo approximation of the
integration. It is the most simple and straightforward method. The second method, LNO, relies on
the low-rank decomposition of the kernel operator. It is efficient when the kernel has a near low-
rank structure. The third method, MGNO, is the combination of the first two. It has a hierarchical,
multi-resolution decomposition of the kernel. The last one, FNO, is different from the first three; it
restricts the integral kernel to induce a convolution.
GNO and MGNO are implemented using graph neural networks, which helps to define sampling
and integration. The graph network library also allows sparse and distributed message passing. The
LNO and FNO don’t have sampling. They are faster without using the graph library.
Table 6: Ingenuity.
7.4.2 E XPRESSIVENESS
We measure the expressiveness by the training and testing error of the method. The full O(J 2 )
integration always has the best results, but it is usually too expensive. As shown in the experiments
7.2.1 and 7.2.2, GNO usually has good accuracy, but its performance suffers from sampling. LNO
works the best on the 1d problem (Burgers equation). It has difficulty on the 2d problem because it
doesn’t employ sampling to speed-up evaluation. MGNO has the multi-level structure, which gives
it the benefit of the first two. Finally, FNO has overall the best performance. It is also the only
method that can capture the challenging Navier-Stokes equation.
7.4.3 C OMPLEXITY
The complexity of the four methods are listed in Table 7. GNO and MGNO have sampling. Their
complexity depends on the number of nodes sampled J ′ . When using the full nodes. They are still
quadratic. LNO has the lowest complexity O(J). FNO, when using the fast Fourier transform, has
complexity O(J log J).
In practice. FNO is faster then the other three methods because it doesn’t have the kernel
network κ. MGNO is relatively slower because of its multi-level graph structure.
45
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
7.4.4 R EFINABILITY
Refineability measures the number of parameters used in the framework. Table 8 lists the accuracy
of the relative error on Darcy Flow with respect to different number of parameters. Because GNO,
LNO, and MGNO have the kernel networks, the slope of their error rates are flat: they can work
with a very small number of parameters. On the other hand, FNO does not have the sub-network. It
needs at a larger magnitude of parameters to obtain an acceptable error rate.
Table 8: Refinability.
The relative error on Darcy Flow with respect to different number of parameters. The errors above are
approximated value roundup to 0.05. They are the lowest test error achieved by the model, given the
model’s number of parameters |θ| is bounded by 103 , 104 , 105 , 106 respectively.
7.4.5 ROBUSTNESS
We conclude with experiments investigating the robustness of Fourier neural operator to noise. We
study: a) training on clean (noiseless) data and testing with clean and noisy data; b) training on
clean (noiseless) data and testing with clean and noisy data. When creating noisy data we map a to
noisy a′ as follows: at every grid-point x we set
where ξ ∼ N (0, 1) is drawn i.i.d. at every grid point; this is similar to the setting adopted in Lu et al.
(2021b). We also study the 1d advection equation as an additional test case, following the setting in
Lu et al. (2021b) in which the input data is a random square wave, defined by an R3 -valued random
variable.
As shown in the top half of Table 9 and Figure 15, we observe the Fourier neural operator is
robust with respect to the (test) noise level on all four problems. In particular, on the advection prob-
lem, it has about 10% error with 10% noise. The Darcy and Navier-Stokes operators are smoothing,
and the Fourier neural operator obtains lower than 10% error in all scenarios. However the FNO
46
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
Table 9: Robustness.
is less robust on the advection equation, which is not smoothing, and on Burgers equation which,
whilst smoothing also forms steep fronts.
A straightforward approach to enhance the robustness is to train the model with noise. As shown
in the bottom half of Table 9, the Fourier neural operator has no gap between the clean data and noisy
data when training with noise. However, noise in training may degrade the performance on the clean
data, as a trade-off. In general, augmenting the training data with noise leads to robustness. For
example, in the auto-regressive modeling of dynamical systems, training the model with noise will
reduce error accumulation in time, and thereby help the model to predict over longer time-horizons
(Pfaff et al., 2020). We also observed that other regularization techniques such as early-stopping
and weight decay improve robustness. Using a higher spatial resolution also helps.
The advection problem is a hard problem for the FNO since it has discontinuities; similar issues
arise when using spectral methods for conservation laws. One can modify the architecture to address
such discontinuities accordingly. For example, Wen et al. (2021) enhance the FNO by composing a
CNN or UNet branch with the Fourier layer; the resulting composite model outperforms the basic
FNO on multiphase flow with high contrast and sharp shocks. However the CNN and UNet take
the method out of the realm of discretization-invariant methods; further work is required to design
discretization-invariant image-processing tools, such as the identification of discontinuities.
8. Literature Review
We outline the major neural network-based approaches for the solution of PDEs.
47
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
the ability to transfer the solution between meshes. The work Ummenhofer et al. (2020) proposed a
continuous convolution network for fluid problems, where off-grid points are sampled and linearly
interpolated. However the continuous convolution method is still constrained by the underlying
grid which prevents generalization to higher resolutions. Similarly, to get finer resolution solution,
Jiang et al. (2020) proposed learning super-resolution with a U-Net structure for fluid mechanics
problems. However fine-resolution data is needed for training, while neural operators are capable
of zero-shot super-resolution with no new data.
DeepONet A novel operator regression architecture, named DeepONet, was recently proposed by
Lu et al. (2019, 2021a); it builds an iterated or deep structure on top of the shallow architecture
proposed in Chen and Chen (1995). The architecture consists of two neural networks: a branch
net applied on the input functions and a trunk net applied on the querying locations in the output
space. The original work of Chen and Chen (1995) provides a universal approximation theorem,
and more recently Lanthaler et al. (2021) developed an error estimate for DeepONet itself. The
standard DeepONet structure is a linear approximation of the target operator, where the trunk net
and branch net learn the coefficients and basis. On the other hand, the neural operator setting is
heavily inspired by the advances in deep learning and is a non-linear approximation, which makes it
constructively more expressive. A detailed discussion of DeepONet is provided in Section 5.1 and
as well as a numerical comparison to DeepONet in Section 7.2.
Physics Informed Neural Networks (PINNs), Deep Ritz Method (DRM), and Deep Galerkin
Method (DGM). A different approach is to directly parameterize the solution u as a neural net-
48
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
work u : D̄ × Θ → R (E and Yu, 2018; Raissi et al., 2019; Sirignano and Spiliopoulos, 2018;
Bar and Sochen, 2019; Smith et al., 2020; Pan and Duraisamy, 2020; Beck et al., 2021). This
approach is designed to model one specific instance of the PDE, not the solution operator. It is
mesh-independent, but for any given new parameter coefficient function a ∈ A, one would need
to train a new neural network ua which is computationally costly and time consuming. Such an
approach closely resembles classical methods such as finite elements, replacing the linear span of a
finite set of local basis functions with the space of neural networks.
ML-based Hybrid Solvers Similarly, another line of work proposes to enhance existing numeri-
cal solvers with neural networks by building hybrid models (Pathak et al., 2020; Um et al., 2020a;
Greenfeld et al., 2019) These approaches suffer from the same computational issue as classical
methods: one needs to solve an optimization problem for every new parameter similarly to the
PINNs setting. Furthermore, the approaches are limited to a setting in which the underlying PDE is
known. Purely data-driven learning of a map between spaces of functions is not possible.
Reduced Basis Methods. Our methodology most closely resembles the classical reduced basis
method (RBM) (DeVore, 2014) or the method of Cohen and DeVore (2015). The method intro-
duced here, along with the contemporaneous work introduced in the papers (Bhattacharya et al.,
2020; Nelsen and Stuart, 2021; Opschoor et al., 2020; Schwab and Zech, 2019; O’Leary-Roseberry
et al., 2020; Lu et al., 2019; Fresca and Manzoni, 2022), are, to the best of our knowledge, amongst
the first practical supervised learning methods designed to learn maps between infinite-dimensional
spaces. Our methodology addresses the mesh-dependent nature of the approach in the papers (Guo
et al., 2016; Zhu and Zabaras, 2018; Adler and Oktem, 2017; Bhatnagar et al., 2019) by produc-
ing a single set of network parameters that can be used with different discretizations. Furthermore,
it has the ability to transfer solutions between meshes and indeed between different discretization
methods. Moreover, it needs only to be trained once on the equation set {aj , uj }N j=1 . Then, obtain-
ing a solution for a new a ∼ µ only requires a forward pass of the network, alleviating the major
computational issues incurred in (E and Yu, 2018; Raissi et al., 2019; Herrmann et al., 2020; Bar
and Sochen, 2019) where a different network would need to be trained for each input parameter.
Lastly, our method requires no knowledge of the underlying PDE: it is purely data-driven and there-
fore non-intrusive. Indeed the true map can be treated as a black-box, perhaps to be learned from
experimental data or from the output of a costly computer simulation, not necessarily from a PDE.
Continuous Neural Networks. Using continuity as a tool to design and interpret neural networks
is gaining currency in the machine learning community, and the formulation of ResNet as a con-
tinuous time process over the depth parameter is a powerful example of this (Haber and Ruthotto,
2017; E, 2017). The concept of defining neural networks in infinite-dimensional spaces is a central
problem that has long been studied (Williams, 1996; Neal, 1996; Roux and Bengio, 2007; Glober-
son and Livni, 2016; Guss, 2016). The general idea is to take the infinite-width limit which yields a
non-parametric method and has connections to Gaussian Process Regression (Neal, 1996; Matthews
et al., 2018; Garriga-Alonso et al., 2018), leading to the introduction of deep Gaussian processes
(Damianou and Lawrence, 2013; Dunlop et al., 2018). Thus far, such methods have not yielded
efficient numerical algorithms that can parallel the success of convolutional or recurrent neural net-
works for the problem of approximating mappings between finite dimensional spaces. Despite the
superficial similarity with our proposed work, this body of work differs substantially from what
we are proposing: in our work we are motivated by the continuous dependence of the data, in the
49
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
input or output spaces, in spatial or spatio-temporal variables; in contrast the work outlined in this
paragraph uses continuity in an artificial algorithmic depth or width parameter to study the network
architecture when the depth or width approaches infinity, but the input and output spaces remain of
fixed finite dimension.
Nyström Approximation, GNNs, and Graph Neural Operators (GNOs). The graph neural op-
erators (Section 4.1) has an underlying Nyström approximation formulation (Nyström, 1930) which
links different grids to a single set of network parameters. This perspective relates our continuum
approach to Graph Neural Networks (GNNs). GNNs are a recently developed class of neural net-
works that apply to graph-structured data; they have been used in a variety of applications. Graph
networks incorporate an array of techniques from neural network design such as graph convolu-
tion, edge convolution, attention, and graph pooling (Kipf and Welling, 2016; Hamilton et al., 2017;
Gilmer et al., 2017; Veličković et al., 2017; Murphy et al., 2018). GNNs have also been applied to
the modeling of physical phenomena such as molecules (Chen et al., 2019) and rigid body systems
(Battaglia et al., 2018) since these problems exhibit a natural graph interpretation: the particles are
the nodes and the interactions are the edges. The work (Alet et al., 2019) performs an initial study
that employs graph networks on the problem of learning solutions to Poisson’s equation, among
other physical applications. They propose an encoder-decoder setting, constructing graphs in the
latent space, and utilizing message passing between the encoder and decoder. However, their model
uses a nearest neighbor structure that is unable to capture non-local dependencies as the mesh size
is increased. In contrast, we directly construct a graph in which the nodes are located on the spatial
domain of the output function. Through message passing, we are then able to directly learn the ker-
nel of the network which approximates the PDE solution. When querying a new location, we simply
add a new node to our spatial graph and connect it to the existing nodes, avoiding interpolation error
by leveraging the power of the Nyström extension for integral operators.
Low-rank Kernel Decomposition and Low-rank Neural Operators (LNOs). Low-rank de-
composition is a popular method used in kernel methods and Gaussian process (Kulis et al., 2006;
Bach, 2013; Lan et al., 2017; Gardner et al., 2018). We present the low-rank neural operator in
Section 4.2 where we structure the kernel network as a product of two factor networks inspired by
Fredholm theory. The low-rank method, while simple, is very efficient and easy to train especially
when the target operator is close to linear. Khoo and Ying (2019) proposed a related neural network
with low-rank structure to approximate the inverse of differential operators. The framework of two
factor networks is also similar to the trunk and branch network used in DeepONet (Lu et al., 2019).
But in our work, the factor networks are defined on the physical domain and non-local information
is accumulated through integration with respect to the Lebesgue measure. In contrast, DeepONet(s)
integrate against delta measures at a set of pre-defined nodal points that are usually taken to be the
grid on which the data is given. See section 5.1 for further discussion.
50
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
Fan et al. (2019c,b); He and Xu (2019) propose a similar multipole expansion for solving parametric
PDEs on structured grids. However, the classical FMM requires nested grids as well as the explicit
form of the PDEs. In Section 4.3, we propose the multipole graph neural operator (MGNO) by
generalizing this idea to arbitrary graphs in the data-driven setting, so that the corresponding graph
neural networks can learn discretization-invariant solution operators which are fast and can work on
complex geometries.
Fourier Transform, Spectral Methods, and Fourier Neural Operators (FNOs). The Fourier
transform is frequently used in spectral methods for solving differential equations since differen-
tiation is equivalent to multiplication in the Fourier domain. Fourier transforms have also played
an important role in the development of deep learning. They are used in theoretical work, such as
the proof of the neural network universal approximation theorem (Hornik et al., 1989) and related
results for random feature methods (Rahimi and Recht, 2008); empirically, they have been used
to speed up convolutional neural networks (Mathieu et al., 2013). Neural network architectures
involving the Fourier transform or the use of sinusoidal activation functions have also been pro-
posed and studied (Bengio et al., 2007; Mingo et al., 2004; Sitzmann et al., 2020). Recently, some
spectral methods for PDEs have been extended to neural networks (Fan et al., 2019a,c; Kashinath
et al., 2020). In Section 4.4, we build on these works by proposing the Fourier neural operator
architecture defined directly in Fourier space with quasi-linear time complexity and state-of-the-art
approximation capabilities.
Sources of Error In this paper we will study the error resulting from approximating an operator
(mapping between Banach spaces) from within a class of finitely-parameterized operators. We show
that the resulting error, expressed in terms of universal approximation of operators over a compact
set or in terms of a resulting risk, can be driven to zero by increasing the number of parameters, and
refining the approximations inherent in the neural operator architecture. In practice there will be two
other sources of approximation error: firstly from the discretization of the data; and secondly from
the use of empirical risk minimization over a finite data set to determine the parameters. Balancing
all three sources of error is key to making algorithms efficient. However we do not study these other
two sources of error in this work. Furthermore we do not study how the number of parameters in
our approximation grows as the error tolerance is refined. Generally, this growth may be super-
exponential as shown in (Kovachki et al., 2021). However, for certain classes of operators and
related approximation methods, it is possible to beat the curse of dimensionality; we refer the reader
to the works (Lanthaler et al., 2021; Kovachki et al., 2021) for detailed analyses demonstrating this.
Finally we also emphasize that there is a potential source of error from the optimization procedure
which attempts to minimize the empirical risk: it may not achieve the global minumum. Analysis
of this error in the context of operator approximation has not been undertaken.
9. Approximation Theory
The paper by Chen and Chen (1995) provides the first universal approximation theorem for operator
approximation via neural networks, and the paper by Bhattacharya et al. (2020) provides an alter-
native architecture and approximation result. The analysis of Chen and Chen (1995) was recently
extended in significant ways in the paper by Lanthaler et al. (2021) where, for the first time, the curse
of dimensionality is addressed, and resolved, for certain specific operator learning problems, using
the DeepOnet generalization Lu et al. (2019, 2021a) of Chen and Chen (1995). The paper Lanthaler
51
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
et al. (2021) was generalized to study operator approximation, and the curse of dimensionality, for
the FNO, in Kovachki et al. (2021).
Unlike the finite-dimensional setting, the choice of input and output spaces A and U for the
mapping G † play a crucial role in the approximation theory due to the distinctiveness of the induced
norm topologies. In this section, we prove universal approximation theorems for neural operators
both with respect to the topology of uniform convergence over compact sets and with respect to
the topology induced by the Bochner norm (3). We focus our attention on the Lebesgue, Sobolev,
continuous, and continuously differentiable function classes as they have numerous applications
in scientific computing and machine learning problems. Unlike the results of Bhattacharya et al.
(2020); Kovachki et al. (2021) which rely on the Hilbertian structure of the input and output spaces
or the results of Chen and Chen (1995); Lanthaler et al. (2021) which rely on the continuous func-
tions, our results extend to more general Banach spaces as specified by Assumptions 9 and 10 (stated
in Section 9.3) and are, to the best of our knowledge, the first of their kind to apply at this level of
generality.
Our method of proof proceeds by making use of the following two observations. First we estab-
lish the Banach space approximation property Grothendieck (1955) for the input and output spaces
of interest, which allows for a finite dimensionalization of the problem. In particular, we prove that
the Banach space approximation property holds for various function spaces defined on Lipschitz
domains; the precise result we need, while unsurprising, seems to be missing from the functional
analysis literature and so we provide statement and proof. Details are given in Appendix A. Second,
we establish that integral kernel operators with smooth kernels can be used to approximate linear
functionals of various input spaces. In doing so, we establish a Riesz-type representation theorem
for the continuously differentiable functions. Such a result is not surprising and mimics the well-
known result for Sobolev spaces; however in the form we need it we could not find the result in the
functional analysis literature and so we provide statement and proof. Details are given in Appendix
B. With these two facts, we construct a neural operator which linearly maps any input function to a
finite vector then non-linearly maps this vector to a new finite vector which is then used to form the
coefficients of a basis expansion for the output function. We reemphasize that our approximation
theory uses the fact that neural operators can be reduced to a linear method of approximation (as
pointed out in Section 5.1) and does not capture any benefits of nonlinear approximation. However
these benefits are present in the architecture and are exploited by the trained networks we find in
practice. Exploiting their nonlinear nature to potentially obtain improved rates of approximation
remains an interesting direction for future research.
The rest of this section is organized as follows. In Subsection 9.1, we define allowable activation
functions and the set of neural operators used in our theory, noting that they constitute a subclass of
the neural operators defined in Section 5. In Subsection 9.3, we state and prove our main universal
approximation theorems.
52
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
′
We define the set of Rd -valued neural networks simply by stacking real-valued networks
′ ′
Nn (σ; Rd , Rd ) := {f : Rd → Rd : f (x) = f1 (x), . . . , fd′ (x) , f1 , . . . , fd′ ∈ Nn (σ; Rd )}.
′ ′ ′
We remark that we could have defined Nn (σ; Rd , Rd ) by letting Wn ∈ Rd ×dn and bn ∈ Rd in the
definition of Nn (σ; Rd ) because we allow arbitrary width, making the two definitions equivalent;
however the definition as presented is more convenient for our analysis. We also employ the pre-
′
ceding definition with Rd and Rd replaced by spaces of matrices. For any m ∈ N0 , we define the
set of allowable activation functions as the continuous R → R maps which make neural networks
dense in C m (Rd ) on compacta at any fixed depth,
It is shown in (Pinkus, 1999, Theorem 4.1) that {σ ∈ C m (R) : σ is not a polynomial} ⊆ Am with
n = 1. Clearly Am+1 ⊆ Am .
We define the set of linearly bounded activations as
L |σ(x)|
Am := σ ∈ Am : σ is Borel measurable , sup <∞ ,
x∈R 1 + |x|
noting that any globally Lipschitz, non-polynomial, C m -function is contained in ALm . Most activa-
tion functions used in practice fall within this class, for example, ReLU ∈ AL0 , ELU ∈ AL1 while
tanh, sigmoid ∈ ALm for any m ∈ N0 .
For approximation in a Bochner norm, we will be interested in constructing globally bounded
neural networks which can approximate the identity over compact sets as done in (Lanthaler et al.,
2021; Bhattacharya et al., 2020). This allows us to control the potential unboundedness of the
support of the input measure by exploiting the fact that the probability of an input must decay
to zero in unbounded regions. Following (Lanthaler et al., 2021), we introduce the forthcoming
definition which uses the notation of the diameter of a set. In particular, the diameter of any set
S ⊆ Rd is defined as, for | · |2 the Euclidean norm on Rd ,
Definition 7 We denote by BA the set of maps σ ∈ A0 such that, for any compact set K ⊂ Rd ,
′
ϵ > 0, and C ≥ diam2 (K) , there exists a number n ∈ N and a neural network f ∈ Nn (σ; Rd , Rd )
such that
|f (x) − x|2 ≤ ϵ, ∀x ∈ K,
|f (x)|2 ≤ C, ∀x ∈ Rd .
It is shown in (Lanthaler et al., 2021, Lemma C.1) that ReLU ∈ AL0 ∩ BA with n = 3.
We will now define the specific class of neural operators for which we prove a universal approx-
imation theorem. It is important to note that the class with which we work is a simplification of
the one given in (6). In particular, the lifting and projection operators Q, P, together with the final
activation function σn , are set to the identity, and the local linear operators W0 , . . . , Wn−1 are set to
zero. In our numerical studies we have in any case typically set σn to the identity. However we have
53
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
found that learning the local operators Q, P and W0 , . . . , Wn−1 is beneficial in practice; extending
the universal approximation theorems given here to explain this benefit would be an important but
non-trivial development of the analysis we present here.
Let D ⊂ Rd be a domain. For any σ ∈ A0 , we define the set of affine kernel integral operators
by
Z
d1 d2
IO(σ; D, R , R ) = {f 7→ κ(·, y)f (y) dy + b : κ ∈ Nn1 (σ; Rd × Rd , Rd2 ×d1 ),
D
b ∈ Nn2 (σ; Rd , Rd2 ), n1 , n2 ∈ N},
for any d1 , d2 ∈ N. Clearly, since σ ∈ A0 , any S ∈ IO(σ; D, Rd1 , Rd2 ) acts as S : Lp (D; Rd1 ) →
Lp (D; Rd2 ) for any 1 ≤ p ≤ ∞ since κ ∈ C(D̄ × D̄; Rd2 ×d1 ) and b ∈ C(D̄; Rd2 ). For any
′
n ∈ N≥2 , da , du ∈ N, D ⊂ Rd , D′ ⊂ Rd domains, and σ1 ∈ AL0 , σ2 , σ3 ∈ A0 , we define the set of
n-layer neural operators by
Z
′ da du
NOn (σ1 , σ2 , σ3 ; D, D , R , R ) = {f 7→ κn (·, y) Sn−1 σ1 (. . . S2 σ1 (S1 (S0 f )) . . . ) (y) dy :
D
S0 ∈ IO(σ2 , D; Rda , Rd1 ), . . . Sn−1 ∈ IO(σ2 , D; Rdn−1 , Rdn ),
′
κn ∈ Nl (σ3 ; Rd × Rd , Rdu ×dn ), d1 , . . . , dn , l ∈ N}.
When da = du = 1, we will simply write NOn (σ1 , σ2 , σ3 ; D, D′ ). Since σ1 is linearly bounded, we
can use a result about compositions of maps in Lp spaces such as (Dudley and Norvaiša, 2010, The-
orem 7.13) to conclude that any G ∈ NOn (σ1 , σ2 , σ3 , D, D′ ; Rda , Rdu ) acts as G : Lp (D; Rda ) →
Lp (D′ ; Rdu ). Note that it is only in the last layer that we transition from functions defined over
domain D to functions defined over domain D′ .
When the input space of an operator of interest is C m (D̄), for m ∈ N, we will need to take in
derivatives explicitly as they cannot be learned using kernel integration as employed in the current
construction given in Lemma 30; note that this is not the case for W m,p (D) as shown in Lemma 28.
We will therefore define the set of m-th order neural operators by
′
NOm da du α1
n (σ1 , σ2 , σ3 ; D, D , R , R ) = {(∂ f, . . . , ∂
αJm
f ) 7→ G(∂ α1 f, . . . , ∂ αJm f ) :
G ∈ NOn (σ1 , σ2 , σ3 ; D, D′ , RJm da , Rdu )}
where α1 , . . . , αJm ∈ Nd is an enumeration of the set {α ∈ Nd : 0 ≤ |α|1 ≤ m}. Since we only
use the m-th order operators when dealing with spaces of continuous functions, each element of
NOm n can be thought of as a mapping from a product space of spaces of the form C
m−|αj | (D̄; Rda )
54
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
F
A RJ A
G† ψ G†
U R J′ G
U
The proof, provided in appendix E, constructs a sequence of finite dimensional maps which
approximate the neural operator by Riemann sums and shows uniform converges of the error over
compact sets of A.
Assumption 9 Let D ⊂ Rd be a Lipschitz domain for some d ∈ N. One of the following holds
3. A = C(D̄).
′
Assumption 10 Let D′ ⊂ Rd be a Lipschitz domain for some d′ ∈ N. One of the following holds
3. U = C m2 (D̄′ ) and m2 ∈ N0 .
We first show that neural operators are dense in the continuous operators G † : A → U in
the topology of uniform convergence on compacta. The proof proceeds by making three main
approximations which are schematically shown in Figure 16. First, inputs are mapped to a finite-
dimensional representation through a set of appropriate linear functionals on A denoted by F :
A → RJ . We show in Lemmas 21 and 23 that, when A satisfies Assumption 9, elements of A∗ can
be approximated by integration against smooth functions. This generalizes the idea from (Chen and
Chen, 1995) where functionals on C(D̄) are approximated by a weighted sum of Dirac measures.
We then show in Lemma 25 that, by lifting the dimension, this representation can be approximated
by a single element of IO. Second, the representation is non-linearly mapped to a new representation
′
by a continuous function ψ : RJ → RJ which finite-dimensionalizes the action of G † . We show,
in Lemma 28, that this map can be approximated by a neural operator by reducing the architecture
to that of a standard neural network. Third, the new representation is used as the coefficients of
′
an expansion onto representers of U, the map denoted G : RJ → U, which we show can be
55
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
approximated by a single IO layer in Lemma 27 using density results for continuous functions. The
structure of the overall approximation is similar to (Bhattacharya et al., 2020) but generalizes the
ideas from working on Hilbert spaces to the spaces in Assumptions 9 and 10. Statements and proofs
of the lemmas used in the theorems are given in the appendices.
Theorem 11 Let Assumptions 9 and 10 hold and suppose G † : A → U is continuous. Let σ1 ∈ AL0 ,
σ2 ∈ A0 , and σ3 ∈ Am2 . Then for any compact set K ⊂ A and 0 < ϵ ≤ 1, there exists a number
N ∈ N and a neural operator G ∈ NON (σ1 , σ2 , σ3 ; D, D′ ) such that
Furthermore, if U is a Hilbert space and σ1 ∈ BA and, for some M > 0, we have that ∥G † (a)∥U ≤
M for all a ∈ A then G can be chosen so that
∥G(a)∥U ≤ 4M, ∀a ∈ A.
The proof is provided in appendix F In the following theorem, we extend this result to the case
A = C m1 (D̄), showing density of the m1 -th order neural operators.
Furthermore, if U is a Hilbert space and σ1 ∈ BA and, for some M > 0, we have that ∥G † (a)∥U ≤
M for all a ∈ A then G can be chosen so that
∥G(a)∥U ≤ 4M, ∀a ∈ A.
Proof The proof follows as in Theorem 11, replacing the use of Lemma 32 with Lemma 33.
With these results in hand, we show density of neural operators in the space L2µ (A; U) where
µ is a probability measure and U is a separable Hilbert space. The Hilbertian structure of U allows
us to uniformly control the norm of the approximation due to the isomorphism with ℓ2 as shown in
Theorem 11. It remains an interesting future direction to obtain similar results for Banach spaces.
The proof follows the ideas in (Lanthaler et al., 2021) where similar results are obtained for Deep-
ONet(s) on L2 (D) by using Lusin’s theorem to restrict the approximation to a large enough compact
set and exploit the decay of µ outside it. Bhattacharya et al. (2020) also employ a similar approach
but explicitly constructs the necessary compact set after finite-dimensionalizing.
′
Theorem 13 Let D′ ⊂ Rd be a Lipschitz domain, m2 ∈ N0 , and suppose Assumption 9 holds.
Let µ be a probability measure on A and suppose G † : A → H m2 (D) is µ-measurable and G † ∈
L2µ (A; H m2 (D)). Let σ1 ∈ AL0 ∩ BA, σ2 ∈ A0 , and σ3 ∈ Am2 . Then for any 0 < ϵ ≤ 1, there exists
a number N ∈ N and a neural operator G ∈ NON (σ1 , σ2 , σ3 ; D, D′ ) such that
56
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
The proof is provided in appendix G. In the following we extend this result to the case A = C m1 (D)
using the m1 -th order neural operators.
∥G † − G∥L2µ (C m1 (D);U ) ≤ ϵ.
Proof The proof follows as in Theorem 13 by replacing the use of Theorem 11 with Theorem 12.
10. Conclusions
We have introduced the concept of Neural Operator, the goal being to construct a neural network
architecture adapted to the problem of mapping elements of one function space into elements of
another function space. The network is comprised of four steps which, in turn, (i) extract features
from the input functions, (ii) iterate a recurrent neural network on feature space, defined through
composition of a sigmoid function and a nonlocal operator, and (iii) a final mapping from feature
space into the output function.
We have studied four nonlocal operators in step (iii), one based on graph kernel networks, one
based on the low-rank decomposition, one based on the multi-level graph structure, and the last one
based on convolution in Fourier space. The designed network architectures are constructed to be
mesh-free and our numerical experiments demonstrate that they have the desired property of being
able to train and generalize on different meshes. This is because the networks learn the mapping
between infinite-dimensional function spaces, which can then be shared with approximations at dif-
ferent levels of discretization. A further advantage of the integral operator approach is that data may
be incorporated on unstructured grids, using the Nyström approximation; these methods, however,
are quadratic in the number of discretization points; we describe variants on this methodology, using
low rank and multiscale ideas, to reduce this complexity. On the other hand the Fourier approach
leads directly to fast methods, linear-log linear in the number of discretization points, provided
structured grids are used. We demonstrate that our methods can achieve competitive performance
with other mesh-free approaches developed in the numerical analysis community. Specifically, the
Fourier neural operator achieves the best numerical performance among our experiments, poten-
tially due to the smoothness of the solution function and the underlying uniform grids. The methods
developed in the numerical analysis community are less flexible than the approach we introduce
here, relying heavily on the structure of an underlying PDE mapping input to output; our method is
entirely data-driven.
57
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
functions, following the example of the Bayesian inverse problem 7.3.4, or when the underlying
model is unknown as in computer vision or robotics; and secondly the development of more ad-
vanced methodologies beyond the four approximation schemes presented in Section 4 that are more
efficient or better in specific situations; thirdly, the development of an underpinning theory which
captures the expressive power, and approximation error properties, of the proposed neural network,
following Section 9, and quantifies the computational complexity required to achieve given error.
10.1.1 N EW A PPLICATIONS
The proposed neural operator is a blackbox surrogate model for function-to-function mappings. It
naturally fits into solving PDEs for physics and engineering problems. In the paper we mainly stud-
ied three partial differential equations: Darcy Flow, Burgers’ equation, and Navier-Stokes equation,
which cover a broad range of scenarios. Due to its blackbox structure, the neural operator is easily
applied on other problems. We foresee applications on more challenging turbulent flows, such as
those arising in subgrid models with in climate GCMs, high contrast media in geological models
generalizing the Darcy model, and general physics simulation for games and visual effects. The
operator setting leads to an efficient and accurate representation, and the resolution-invariant prop-
erties make it possible to training and a smaller resolution dataset, and be evaluated on arbitrarily
large resolution.
The operator learning setting is not restricted to scientific computing. For example, in computer
vision, images can naturally be viewed as real-valued functions on 2D domains and videos simply
add a temporal structure. Our approach is therefore a natural choice for problems in computer vision
where invariance to discretization is crucial. We leave this as an interesting future direction.
10.1.2 N EW M ETHODOLOGIES
Despite their excellent performance, there is still room for improvement upon the current method-
ologies. For example, the full O(J 2 ) integration method still outperforms the FNO by about 40%,
albeit at greater cost. It is of potential interest to develop more advanced integration techniques
or approximation schemes that follows the neural operator framework. For example, one can use
adaptive graph or probability estimation in the Nyström approximation. It is also possible to use
other basis than the Fourier basis such as the PCA basis and the Chebyshev basis.
Another direction for new methodologies is to combine the neural operator in other settings.
The current problem is set as a supervised learning problem. Instead, one can combine the neural
operator with solvers (Pathak et al., 2020; Um et al., 2020b), augmenting and correcting the solvers
to get faster and more accurate approximation. Similarly, one can combine operator learning with
physics constraints (Wang et al., 2021; Li et al., 2021).
10.1.3 T HEORY
In this work, we develop a universal approximation theory (Section 9) for neural operators. As in
the work of Lu et al. (2019) studying universal approximation for DeepONet, we use linear ap-
proximation techniques. The power of non-linear approximation (DeVore, 1998), which is likely
intrinsic to the success of neural operators in some settings, is still less studied, as discussed in
Section 5.1; we note that DeepOnet is intrinsically limited by linear approximation properties. For
functions between Euclidean spaces, we clearly know, by combining two layers of linear functions
with one layer of non-linear activation function, the neural network can approximate arbitrary con-
58
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
tinuous functions, and that deep neural networks can be exponentially more expressive compared to
shallow networks (Poole et al., 2016). However issues are less clear when it comes to the choice of
architecture and the scaling of the number of parameters within neural operators between Banach
spaces. The approximation theory of operators is much more complex and challenging compared to
that of functions over Euclidean spaces. It is important to study the class of neural operators with
respect to their architecture – what spaces the true solution operators lie in, and which classes of
PDEs the neural operator approximate efficiently. We leave these as exciting, but open, research
directions.
59
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Acknowledgements
Z. Li gratefully acknowledges the financial support from the Kortschak Scholars, PIMCO Fellows,
and Amazon AI4Science Fellows programs. A. Anandkumar is supported in part by Bren endowed
chair. K. Bhattacharya, N. B. Kovachki, B. Liu and A. M. Stuart gratefully acknowledge the fi-
nancial support of the Army Research Laboratory through the Cooperative Agreement Number
W911NF-12-0022. Research was sponsored by the Army Research Laboratory and was accom-
plished under Cooperative Agreement Number W911NF-12-2-0022. AMS is also supported by
NSF (award DMS-1818977). Part of this research is developed when K. Azizzadenesheli was with
the Purdue University. The authors are grateful to Siddhartha Mishra for his valuable feedback on
this work.
The views and conclusions contained in this document are those of the authors and should not be
interpreted as representing the official policies, either expressed or implied, of the Army Research
Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute
reprints for Government purposes notwithstanding any copyright notation herein.
The computations presented here were conducted on the Resnick High Performance Cluster at
the California Institute of Technology.
References
J. Aaronson. An Introduction to Infinite Ergodic Theory. Mathematical surveys and monographs.
American Mathematical Society, 1997. ISBN 9780821804940.
Jonas Adler and Ozan Oktem. Solving ill-posed inverse problems using iterative deep neural
networks. Inverse Problems, nov 2017. doi: 10.1088/1361-6420/aa9581. URL https:
//doi.org/10.1088%2F1361-6420%2Faa9581.
Fernando Albiac and Nigel J. Kalton. Topics in Banach space theory. Graduate Texts in Mathemat-
ics. Springer, 1 edition, 2006.
Ferran Alet, Adarsh Keshav Jeewajee, Maria Bauza Villalonga, Alberto Rodriguez, Tomas Lozano-
Perez, and Leslie Kaelbling. Graph element networks: adaptive, structured computation and
memory. In 36th International Conference on Machine Learning. PMLR, 2019. URL http:
//proceedings.mlr.press/v97/alet19a.html.
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint
arXiv:1607.06450, 2016.
Francis Bach. Sharp analysis of low-rank kernel matrix approximations. In Conference on Learning
Theory, pages 185–209, 2013.
Leah Bar and Nir Sochen. Unsupervised deep learning algorithm for PDE-based forward and inverse
problems. arXiv preprint arXiv:1904.05417, 2019.
Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi,
Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al.
60
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261,
2018.
Christian Beck, Sebastian Becker, Philipp Grohs, Nor Jaafari, and Arnulf Jentzen. Solving the
kolmogorov pde by means of deep learning. Journal of Scientific Computing, 88(3), 2021.
Serge Belongie, Charless Fowlkes, Fan Chung, and Jitendra Malik. Spectral partitioning with indef-
inite kernels using the nyström extension. In European conference on computer vision. Springer,
2002.
Yoshua Bengio, Yann LeCun, et al. Scaling learning algorithms towards ai. Large-scale kernel
machines, 34(5):1–41, 2007.
Saakaar Bhatnagar, Yaser Afshar, Shaowu Pan, Karthik Duraisamy, and Shailendra Kaushik. Predic-
tion of aerodynamic flow fields using convolutional neural networks. Computational Mechanics,
pages 1–21, 2019.
Kaushik Bhattacharya, Bamdad Hosseini, Nikola B Kovachki, and Andrew M Stuart. Model reduc-
tion and neural networks for parametric PDEs. arXiv preprint arXiv:2005.03180, 2020.
Andrea Bonito, Albert Cohen, Ronald DeVore, Diane Guignard, Peter Jantsch, and Guergana
Petrova. Nonlinear methods for model reduction. arXiv preprint arXiv:2005.02565, 2020.
Steffen Börm, Lars Grasedyck, and Wolfgang Hackbusch. Hierarchical matrices. Lecture notes, 21:
2003, 2003.
George EP Box. Science and statistics. Journal of the American Statistical Association, 71(356):
791–799, 1976.
John P Boyd. Chebyshev and Fourier spectral methods. Courier Corporation, 2001.
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
Alexander Brudnyi and Yuri Brudnyi. Methods of Geometric Analysis in Extension and Trace
Problems, volume 1. Birkhäuser Basel, 2012.
Oscar P Bruno, Youngae Han, and Matthew M Pohlman. Accurate, high-order representation of
complex three-dimensional surfaces via fourier continuation analysis. Journal of computational
Physics, 227(2):1094–1125, 2007.
Gary J. Chandler and Rich R. Kerswell. Invariant recurrent solutions embedded in a turbulent two-
dimensional kolmogorov flow. Journal of Fluid Mechanics, 722:554–595, 2013.
Chi Chen, Weike Ye, Yunxing Zuo, Chen Zheng, and Shyue Ping Ong. Graph networks as a uni-
versal machine learning framework for molecules and crystals. Chemistry of Materials, 31(9):
3564–3572, 2019.
61
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Tianping Chen and Hong Chen. Universal approximation to nonlinear operators by neural networks
with arbitrary activation functions and its application to dynamical systems. IEEE Transactions
on Neural Networks, 6(4):911–917, 1995.
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas
Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention
with performers. arXiv preprint arXiv:2009.14794, 2020.
Z. Ciesielski and J. Domsta. Construction of an orthonormal basis in cm(id) and wmp(id). Studia
Mathematica, 41:211–224, 1972.
Albert Cohen and Ronald DeVore. Approximation of high-dimensional parametric PDEs. Acta
Numerica, 2015. doi: 10.1017/S0962492915000033.
Albert Cohen, Ronald Devore, Guergana Petrova, and Przemyslaw Wojtaszczyk. Optimal stable
nonlinear approximation. arXiv preprint arXiv:2009.09907, 2020.
S. L. Cotter, G. O. Roberts, A. M. Stuart, and D. White. Mcmc methods for functions: Modifying
old algorithms to make them faster. Statistical Science, 28(3):424–446, Aug 2013. ISSN 0883-
4237. doi: 10.1214/13-sts421. URL http://dx.doi.org/10.1214/13-STS421.
Simon L Cotter, Massoumeh Dashti, James Cooper Robinson, and Andrew M Stuart. Bayesian
inverse problems for functions and applications to fluid mechanics. Inverse problems, 25(11):
115008, 2009.
Andreas Damianou and Neil Lawrence. Deep gaussian processes. In Artificial Intelligence and
Statistics, pages 207–215, 2013.
Maarten De Hoop, Daniel Zhengyu Huang, Elizabeth Qian, and Andrew M Stuart. The cost-
accuracy trade-off in operator learning with neural networks. Journal of Machine Learning,
to appear; arXiv preprint arXiv:2203.13181, 2022.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Ronald A. DeVore. Chapter 3: The Theoretical Foundation of Reduced Basis Methods. 2014. doi:
10.1137/1.9781611974829.ch3.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An
image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint
arXiv:2010.11929, 2020.
R. Dudley and Rimas Norvaisa. Concrete Functional Calculus, volume 149. 01 2011. ISBN
978-1-4419-6949-1.
62
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
R.M. Dudley and R. Norvaiša. Concrete Functional Calculus. Springer Monographs in Mathemat-
ics. Springer New York, 2010.
Matthew M Dunlop, Mark A Girolami, Andrew M Stuart, and Aretha L Teckentrup. How deep are
deep gaussian processes? The Journal of Machine Learning Research, 19(1):2100–2145, 2018.
Weinan E and Bing Yu. The deep ritz method: A deep learning-based numerical algorithm for
solving variational problems. Communications in Mathematics and Statistics, 3 2018. ISSN
2194-6701. doi: 10.1007/s40304-018-0127-z.
Yuwei Fan, Cindy Orozco Bohorquez, and Lexing Ying. Bcr-net: A neural network based on the
nonstandard wavelet form. Journal of Computational Physics, 384:1–15, 2019a.
Yuwei Fan, Jordi Feliu-Faba, Lin Lin, Lexing Ying, and Leonardo Zepeda-Núnez. A multiscale
neural network based on hierarchical nested bases. Research in the Mathematical Sciences, 6(2):
21, 2019b.
Yuwei Fan, Lin Lin, Lexing Ying, and Leonardo Zepeda-Núnez. A multiscale neural network based
on hierarchical matrices. Multiscale Modeling & Simulation, 17(4):1189–1213, 2019c.
Stefania Fresca and Andrea Manzoni. Pod-dl-rom: Enhancing deep learning-based reduced order
models for nonlinear parametrized pdes by proper orthogonal decomposition. Computer Methods
in Applied Mechanics and Engineering, 388:114–181, 2022.
Jacob R Gardner, Geoff Pleiss, Ruihan Wu, Kilian Q Weinberger, and Andrew Gordon Wilson.
Product kernel interpolation for scalable gaussian processes. arXiv preprint arXiv:1802.08903,
2018.
Adrià Garriga-Alonso, Carl Edward Rasmussen, and Laurence Aitchison. Deep Convolutional Net-
works as shallow Gaussian Processes. arXiv e-prints, art. arXiv:1808.05587, Aug 2018.
Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural
message passing for quantum chemistry. In Proceedings of the 34th International Conference on
Machine Learning, 2017.
Amir Globerson and Roi Livni. Learning infinite-layer networks: Beyond the kernel trick. CoRR,
abs/1606.05316, 2016. URL http://arxiv.org/abs/1606.05316.
Daniel Greenfeld, Meirav Galun, Ronen Basri, Irad Yavneh, and Ron Kimmel. Learning to optimize
multigrid PDE solvers. In International Conference on Machine Learning, pages 2415–2423.
PMLR, 2019.
63
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Leslie Greengard and Vladimir Rokhlin. A new version of the fast multipole method for the laplace
equation in three dimensions. Acta numerica, 6:229–269, 1997.
John Guibas, Morteza Mardani, Zongyi Li, Andrew Tao, Anima Anandkumar, and Bryan Catan-
zaro. Adaptive fourier neural operators: Efficient token mixers for transformers. arXiv preprint
arXiv:2111.13587, 2021.
Xiaoxiao Guo, Wei Li, and Francesco Iorio. Convolutional neural networks for steady flow ap-
proximation. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, 2016.
William H. Guss. Deep Function Machines: Generalized Neural Networks for Topological Layer
Expression. arXiv e-prints, art. arXiv:1612.04799, Dec 2016.
Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse Problems,
34(1):014004, 2017.
Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs.
In Advances in neural information processing systems, pages 1024–1034, 2017.
Juncai He and Jinchao Xu. Mgnet: A unified framework of multigrid and convolutional neural
network. Science china mathematics, 62(7):1331–1354, 2019.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-
nition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages
770–778, 2016.
L Herrmann, Ch Schwab, and J Zech. Deep relu neural network expression rates for data-to-qoi
maps in bayesian PDE inversion. 2020.
Kurt Hornik, Maxwell Stinchcombe, Halbert White, et al. Multilayer feedforward networks are
universal approximators. Neural networks, 2(5):359–366, 1989.
Chiyu Max Jiang, Soheil Esmaeilzadeh, Kamyar Azizzadenesheli, Karthik Kashinath, Mustafa
Mustafa, Hamdi A Tchelepi, Philip Marcus, Anima Anandkumar, et al. Meshfreeflownet: A
physics-constrained deep continuous space-time super-resolution framework. arXiv preprint
arXiv:2005.01463, 2020.
Claes Johnson. Numerical solution of partial differential equations by the finite element method.
Courier Corporation, 2012.
Karthik Kashinath, Philip Marcus, et al. Enforcing physical constraints in cnns through differen-
tiable PDE layer. In ICLR 2020 Workshop on Integration of Deep Neural Models and Differential
Equations, 2020.
64
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
Yuehaw Khoo and Lexing Ying. Switchnet: a neural network model for forward and inverse scat-
tering problems. SIAM Journal on Scientific Computing, 41(5):A3182–A3201, 2019.
Yuehaw Khoo, Jianfeng Lu, and Lexing Ying. Solving parametric PDE problems with artificial
neural networks. European Journal of Applied Mathematics, 32(3):421–435, 2021.
Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional net-
works. arXiv preprint arXiv:1609.02907, 2016.
Risi Kondor, Nedelina Teneva, and Vikas Garg. Multiresolution matrix factorization. In Interna-
tional Conference on Machine Learning, pages 1620–1628, 2014.
Nikola Kovachki, Samuel Lanthaler, and Siddhartha Mishra. On universal approximation and error
bounds for Fourier Neural Operators. arXiv preprint arXiv:2107.07562, 2021.
Robert H. Kraichnan. Inertial ranges in two-dimensional turbulence. The Physics of Fluids, 10(7):
1417–1423, 1967.
Brian Kulis, Mátyás Sustik, and Inderjit Dhillon. Learning low-rank kernel matrices. In Proceedings
of the 23rd international conference on Machine learning, pages 505–512, 2006.
Gitta Kutyniok, Philipp Petersen, Mones Raslan, and Reinhold Schneider. A theoretical analysis of
deep neural networks and parametric pdes. Constructive Approximation, 55(1):73–125, 2022.
Liang Lan, Kai Zhang, Hancheng Ge, Wei Cheng, Jun Liu, Andreas Rauber, Xiao-Li Li, Jun Wang,
and Hongyuan Zha. Low-rank decomposition meets kernel learning: A generalized nyström
method. Artificial Intelligence, 250:1–15, 2017.
Samuel Lanthaler, Siddhartha Mishra, and George Em Karniadakis. Error estimates for deeponets:
A deep learning framework in infinite dimensions. arXiv preprint arXiv:2102.09618, 2021.
G. Leoni. A First Course in Sobolev Spaces. Graduate studies in mathematics. American Mathe-
matical Soc., 2009.
Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, An-
drew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential
equations, 2020a.
Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, An-
drew Stuart, and Anima Anandkumar. Multipole graph neural operator for parametric partial
differential equations, 2020b.
Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, An-
drew Stuart, and Anima Anandkumar. Neural operator: Graph kernel network for partial differ-
ential equations. arXiv preprint arXiv:2003.03485, 2020c.
Zongyi Li, Hongkai Zheng, Nikola Kovachki, David Jin, Haoxuan Chen, Burigede Liu, Kamyar
Azizzadenesheli, and Anima Anandkumar. Physics-informed neural operator for learning partial
differential equations. arXiv preprint arXiv:2111.03794, 2021.
65
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Lu Lu, Pengzhan Jin, and George Em Karniadakis. Deeponet: Learning nonlinear operators for
identifying differential equations based on the universal approximation theorem of operators.
arXiv preprint arXiv:1910.03193, 2019.
Lu Lu, Pengzhan Jin, Guofei Pang, Zhongqiang Zhang, and George Em Karniadakis. Learning
nonlinear operators via deeponet based on the universal approximation theorem of operators.
Nature Machine Intelligence, 3(3):218–229, 2021a.
Lu Lu, Xuhui Meng, Shengze Cai, Zhiping Mao, Somdatta Goswami, Zhongqiang Zhang, and
George Em Karniadakis. A comprehensive and fair comparison of two neural operators (with
practical extensions) based on fair data. arXiv preprint arXiv:2111.05512, 2021b.
Michael Mathieu, Mikael Henaff, and Yann LeCun. Fast training of convolutional networks through
ffts, 2013.
Alexander G. de G. Matthews, Mark Rowland, Jiri Hron, Richard E. Turner, and Zoubin Ghahra-
mani. Gaussian Process Behaviour in Wide Deep Neural Networks. Apr 2018.
Luis Mingo, Levon Aslanyan, Juan Castellanos, Miguel Diaz, and Vladimir Riazanov. Fourier
neural networks: An approach with sinusoidal activation functions. 2004.
Ryan L Murphy, Balasubramaniam Srinivasan, Vinayak Rao, and Bruno Ribeiro. Janossy pool-
ing: Learning deep permutation-invariant functions for variable-size inputs. arXiv preprint
arXiv:1811.01900, 2018.
Radford M. Neal. Bayesian Learning for Neural Networks. Springer-Verlag, 1996. ISBN
0387947248.
Nicholas H Nelsen and Andrew M Stuart. The random feature model for input-output maps between
banach spaces. SIAM Journal on Scientific Computing, 43(5):A3212–A3243, 2021.
Evert J Nyström. Über die praktische auflösung von integralgleichungen mit anwendungen auf
randwertaufgaben. Acta Mathematica, 1930.
Thomas O’Leary-Roseberry, Umberto Villa, Peng Chen, and Omar Ghattas. Derivative-informed
projected neural networks for high-dimensional parametric maps governed by pdes. arXiv
preprint arXiv:2011.15110, 2020.
Joost A.A. Opschoor, Christoph Schwab, and Jakob Zech. Deep learning in high dimension: Relu
network expression rates for bayesian PDE inversion. SAM Research Report, 2020-47, 2020.
Shaowu Pan and Karthik Duraisamy. Physics-informed probabilistic learning of linear embeddings
of nonlinear dynamics with guaranteed stability. SIAM Journal on Applied Dynamical Systems,
19(1):480–509, 2020.
Jaideep Pathak, Mustafa Mustafa, Karthik Kashinath, Emmanuel Motheau, Thorsten Kurth, and
Marcus Day. Using machine learning to augment coarse-grid computational fluid dynamics sim-
ulations, 2020.
Aleksander Pełczyński and Michał Wojciechowski. Contribution to the isomorphic classification of
sobolev spaces lpk(omega). Recent Progress in Functional Analysis, 189:133–142, 2001.
66
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
Tobias Pfaff, Meire Fortunato, Alvaro Sanchez-Gonzalez, and Peter W. Battaglia. Learning mesh-
based simulation with graph networks, 2020.
Allan Pinkus. Approximation theory of the mlp model in neural networks. Acta Numerica, 8:
143–195, 1999.
Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponen-
tial expressivity in deep neural networks through transient chaos. Advances in neural information
processing systems, 29:3360–3368, 2016.
Joaquin Quiñonero Candela and Carl Edward Rasmussen. A unifying view of sparse approximate
gaussian process regression. J. Mach. Learn. Res., 6:1939–1959, 2005.
Ali Rahimi and Benjamin Recht. Uniform approximation of functions with random bases. In 2008
46th Annual Allerton Conference on Communication, Control, and Computing, pages 555–561.
IEEE, 2008.
Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A
deep learning framework for solving forward and inverse problems involving nonlinear partial
differential equations. Journal of Computational Physics, 378:686–707, 2019.
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedi-
cal image segmentation. In International Conference on Medical image computing and computer-
assisted intervention, pages 234–241. Springer, 2015.
Nicolas Le Roux and Yoshua Bengio. Continuous neural networks. In Marina Meila and Xiaotong
Shen, editors, Proceedings of the Eleventh International Conference on Artificial Intelligence and
Statistics, 2007.
Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini.
The graph neural network model. IEEE transactions on neural networks, 20(1):61–80, 2008.
Christoph Schwab and Jakob Zech. Deep learning in high dimension: Neural network expression
rates for generalized polynomial chaos expansions in UQ. Analysis and Applications, 17(01):
19–55, 2019.
Justin Sirignano and Konstantinos Spiliopoulos. Dgm: A deep learning algorithm for solving partial
differential equations. Journal of computational physics, 375:1339–1364, 2018.
Vincent Sitzmann, Julien NP Martel, Alexander W Bergman, David B Lindell, and Gordon
Wetzstein. Implicit neural representations with periodic activation functions. arXiv preprint
arXiv:2006.09661, 2020.
Jonathan D Smith, Kamyar Azizzadenesheli, and Zachary E Ross. Eikonet: Solving the eikonal
equation with deep neural networks. arXiv preprint arXiv:2004.00361, 2020.
Elias M. Stein. Singular Integrals and Differentiability Properties of Functions. Princeton Univer-
sity Press, 1970.
67
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Nicolas Garcia Trillos and Dejan Slepčev. A variational approach to the consistency of spectral
clustering. Applied and Computational Harmonic Analysis, 45(2):239–281, 2018.
Nicolás Garcı́a Trillos, Moritz Gerlach, Matthias Hein, and Dejan Slepčev. Error estimates for
spectral convergence of the graph laplacian on random geometric graphs toward the laplace–
beltrami operator. Foundations of Computational Mathematics, 20(4):827–887, 2020.
Kiwon Um, Philipp Holl, Robert Brand, Nils Thuerey, et al. Solver-in-the-loop: Learning from
differentiable physics to interact with iterative PDE-solvers. arXiv preprint arXiv:2007.00016,
2020a.
Kiwon Um, Raymond, Fei, Philipp Holl, Robert Brand, and Nils Thuerey. Solver-in-the-loop:
Learning from differentiable physics to interact with iterative PDE-solvers, 2020b.
Benjamin Ummenhofer, Lukas Prantl, Nils Thürey, and Vladlen Koltun. Lagrangian fluid simu-
lation with continuous convolutions. In International Conference on Learning Representations,
2020.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg,
S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural
Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua
Bengio. Graph attention networks. 2017.
Ulrike Von Luxburg, Mikhail Belkin, and Olivier Bousquet. Consistency of spectral clustering. The
Annals of Statistics, pages 555–586, 2008.
Rui Wang, Karthik Kashinath, Mustafa Mustafa, Adrian Albert, and Rose Yu. Towards physics-
informed deep learning for turbulent flow prediction. In Proceedings of the 26th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining, pages 1457–1466, 2020.
Sifan Wang, Hanwen Wang, and Paris Perdikaris. Learning the solution operator of parametric
partial differential equations with physics-informed deeponets. arXiv preprint arXiv:2103.10974,
2021.
Gege Wen, Zongyi Li, Kamyar Azizzadenesheli, Anima Anandkumar, and Sally M Benson. U-
fno–an enhanced fourier neural operator based-deep learning model for multiphase flow. arXiv
preprint arXiv:2109.03697, 2021.
68
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
Christopher K. I. Williams. Computing with infinite networks. In Proceedings of the 9th Interna-
tional Conference on Neural Information Processing Systems, Cambridge, MA, USA, 1996. MIT
Press.
Chen Zhu, Wei Ping, Chaowei Xiao, Mohammad Shoeybi, Tom Goldstein, Anima Anandkumar,
and Bryan Catanzaro. Long-short transformer: Efficient transformers for language and vision. In
Advances in Neural Information Processing Systems, 2021.
Yinhao Zhu and Nicholas Zabaras. Bayesian deep convolutional encoder–decoder networks
for surrogate modeling and uncertainty quantification. Journal of Computational Physics,
2018. ISSN 0021-9991. doi: https://doi.org/10.1016/j.jcp.2018.04.018. URL http://www.
sciencedirect.com/science/article/pii/S0021999118302341.
69
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Appendix A.
Notation Meaning
Operator Learning
D ⊂ Rd The spatial domain for the PDE.
x∈D Points in the the spatial domain.
a ∈ A = (D; Rda ) The input functions (coefficients, boundaries, and/or initial conditions).
u ∈ U = (D; Rdu ) The target solution functions.
Dj The discretization of (aj , uj ).
G† : A → U The operator mapping the coefficients to the solutions.
µ A probability measure where aj sampled from.
Neural Operator
v(x) ∈ Rdv The neural network representation of u(x)
da Dimension of the input a(x).
du Dimension of the output u(x).
dv The dimension of the representation v(x).
t = 0, . . . , T The layer (iteration) in the neural operator .
P, Q The pointwise linear transformation P : a(x) 7→ v0 (x) and Q : vT (x) 7→ u(x).
K The integral operator in the iterative update vt 7→ vt+1 ,
κ : R2(d+1) → Rdv ×dv The kernel maps (x, y, a(x), a(y)) to a dv × dv matrix
K ∈ Rn×n×dv ×dv The kernel matrix with Kxy = κ(x, y).
W ∈ Rdv ×dv The pointwise linear transformation used as the bias term in the iterative update.
σ The activation function.
For any Banach space Y, we denote by L(X ; Y) the Banach space of continuous linear maps T :
X → Y with the operator norm
We will abuse notation and write ∥ · ∥ for any operator norm when there is no ambiguity about the
spaces in question.
70
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
Let d ∈ N. We say that D ⊂ Rd is a domain if it is a bounded and connected open set that is
topologically regular i.e. int(D̄) = D. Note that, in the case d = 1, a domain is any bounded, open
interval. For d ≥ 2, we say D is a Lipschitz domain if ∂D can be locally represented as the graph of
a Lipschitz continuous function defined on an open ball of Rd−1 . If d = 1, we will call any domain
a Lipschitz domain. For any multi-index α ∈ Nd0 , we write ∂ α f for the α-th weak partial derivative
of f when it exists.
Let D ⊂ Rd be a domain. For any m ∈ N0 , we define the following spaces
C(D) = {f : D → R : f is continuous},
C m (D) = {f : D → R : ∂ α f ∈ C m−|α|1 (D) ∀ 0 ≤ |α|1 ≤ m},
m m α
Cb (D) = f ∈ C (D) : max sup |∂ f (x)| < ∞ ,
0≤|α|1 ≤m x∈D
m
C (D̄) = {f ∈ Cbm (D) : ∂ α f is uniformly continuous ∀ 0 ≤ |α|1 ≤ m}
and make the equivalent definitions when D is replaced with Rd . Note that any function in C m (D̄)
has a unique, bounded, continuous extension from D to D̄ and is hence uniquely defined on ∂D.
We will work with this extension without further notice. We remark that when D is a Lipschitz
domain, the following definition for C m (D̄) is equivalent
C m (D̄) = {f : D̄ → R : ∃F ∈ C m (Rd ) such that f ≡ F |D̄ },
see Whitney (1934); Brudnyi and Brudnyi (2012). We define C ∞ (D) = ∞ m
T
m=0 C (D) and, simi-
larly, Cb∞ (D) and C ∞ (D̄). We further define
Cc∞ (D) = {f ∈ C ∞ (D) : supp(f ) ⊂ D is compact}
and, again, note that all definitions hold analogously for Rd . We denote by ∥·∥C m : Cbm (D) → R≥0
the norm
∥f ∥C m = max sup |∂ α f (x)|
0≤|α|1 ≤m x∈D
which makes Cbm (D) (also with D = Rd ) and C m (D̄) Banach spaces. For any n ∈ N, we write
C(D; Rn ) for the n-fold Cartesian product of C(D) and similarly for all other spaces we have
defined or will define subsequently. We will continue to write ∥ · ∥C m for the norm on Cbm (D; Rn )
and C m (D̄; Rn ) defined as
∥f ∥C m = max ∥fj ∥C m .
j∈{1,...,n}
For any m ∈ N and 1 ≤ p ≤ ∞, we use the notation W m,p (D) for the standard Lp -type Sobolev
space with m derivatives; we refer the reader to Adams and Fournier (2003) for a formal definition.
Furthermore, we, at times, use the notation W 0,p (D) = Lp (D) and W m,2 (D) = H m (D). Since
we use the standard definitions of Sobolev spaces that can be found in any reference on the subject,
we do not give the specifics here.
Appendix B.
In this section we gather various results on the approximation property of Banach spaces. The main
results are Lemma 22 which states that if two Banach spaces have the approximation property then
continuous maps between them can be approximated in a finite-dimensional manner, and Lemma 26
which states the spaces in Assumptions 9 and 10 have the approximation property.
71
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Definition 15 A Banach space X has a Schauder basis if there exist some {φj }∞
j=1 ⊂ X and
{cj }∞
j=1 ⊂ X ∗ such that
We remark that definition 15 is equivalent to the following. The elements {φj }∞j=1 ⊂ X are called
a Schauder basis for X if, for each x ∈ X , there exists a unique sequence {αj }∞
j=1 ⊂ R such that
n
X
lim ∥x − αj φj ∥X = 0.
n→∞
j=1
For the equivalence, see, for example (Albiac and Kalton, 2006, Theorem 1.1.3). Throughout this
paper we will simply write the term basis to mean Schauder basis. Furthermore, we note that if
{φ}∞ ∞
j=1 is a basis then so is {φj /∥φ∥X }j=1 , so we will assume that any basis we use is normalized.
Definition 16 Let X be a Banach space and U ∈ L(X ; X ). U is called a finite rank operator if
U (X ) ⊆ X is finite dimensional.
By noting that any finite dimensional subspace has a basis, we may equivalently define a finite rank
operator U ∈ L(X ; X ) to be one such that there exists a number n ∈ N and some {φj }nj=1 ⊂ X
and {cj }nj=1 ⊂ X ∗ such that
n
X
Ux = cj (x)φj , ∀x ∈ X .
j=1
Definition 17 A Banach space X is said to have the approximation property (AP) if, for any com-
pact set K ⊂ X and ϵ > 0, there exists a finite rank operator U : X → X such that
∥x − U x∥X ≤ ϵ, ∀x ∈ K.
We now state and prove some well-known results about the relationship between basis and the
AP. We were unable to find the statements of the following lemmas in the form given here in the
literature and therefore we provide full proofs.
Lemma 18 Let X be a Banach space with a basis. Then X has the AP.
see, for example (Albiac and Kalton, 2006, Remark 1.1.6). Assume, without loss of generality,
that C ≥ 1. Let K ⊂ X be compact and ϵ > 0. Since K is compact, we can find a number
72
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
n = n(ϵ, C) ∈ N and elements y1 , . . . , yn ∈ K such that for any x ∈ K there exists a number
l ∈ {1, . . . , n} with the property that
ϵ
∥x − yl ∥X ≤ .
3C
We can then find a number J = J(ϵ, n) ∈ N such that
J
X ϵ
max ∥yj − ck (yj )φk ∥X ≤ .
j∈{1,...,n} 3
k=1
as desired.
Lemma 19 Let X be a Banach space with a basis and Y be any Banach space. Suppose there
exists a continuous linear bijection T : X → Y. Then Y has a basis.
Proof Let y ∈ Y and ϵ > 0. Since T is a bijection, there exists an element x ∈ X so that T x = y
and T −1 y = x. Since X has a basis, we can find {φj }∞ ∞ ∗
j=1 ⊂ X and {cj }j=1 ⊂ X and a number
n = n(ϵ, ∥T ∥) ∈ N such that
n
X ϵ
∥x − cj (x)φj ∥X ≤ .
∥T ∥
j=1
Note that
n
X n
X n
X
−1
∥y − cj (T y)T φj ∥Y = ∥T x − T cj (x)φj ∥ ≤ ∥T ∥∥x − cj (x)φj ∥X ≤ ϵ
j=1 j=1 j=1
hence {T φj }∞
j=1 ⊂ Y and {cj (T
−1 ·)}∞ ⊂ Y ∗ form a basis for Y by linearity and continuity of T
j=1
−1
and T .
73
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Lemma 20 Let X be a Banach space with the AP and Y be any Banach space. Suppose there exists
a continuous linear bijection T : X → Y. Then Y has the AP.
Proof Let K ⊂ Y be a compact set and ϵ > 0. The set R = T −1 (K) ⊂ X is compact since T −1
is continuous. Since X has the AP, there exists a finite rank operator U : X → X such that
ϵ
∥x − U x∥X ≤ , ∀x ∈ R.
∥T ∥
Define the operator W : Y → Y by W = T U T −1 . Clearly W is a finite rank operator since U is a
finite rank operator. Let y ∈ K then, since K = T (R), there exists x ∈ R such that T x = y and
x = T −1 y. Then
The following lemma shows than the infinite union of compact sets is compact if each set is
the image of a fixed compact set under a convergent sequence of continuous maps. The result is
instrumental in proving Lemma 22.
Proof Let ϵ > 0 then there exists a number N = N (ϵ) ∈ N such that
ϵ
sup ∥F (x) − Fn (x)∥Y ≤ , ∀n ≥ N.
x∈K 2
Define the set
N
[
WN = Fn (K) ∪ F (K)
n=1
which is compact since F and each Fn are continuous. We can therefore find a number J =
J(ϵ, N ) ∈ N and elements y1 , . . . , yJ ∈ WN such that, for any z ∈ WN , there exists a number
l = l(z) ∈ {1, . . . , J} such that
ϵ
∥z − yl ∥Y ≤ .
2
Let y ∈ W \ WN then there exists a number m > N and an element x ∈ K such that y = Fm (x).
Since F (x) ∈ WN , we can find a number l ∈ {1, . . . , J} such that
ϵ
∥F (x) − yl ∥Y ≤ .
2
74
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
Therefore,
∥y − yl ∥Y ≤ ∥Fm (x) − F (x)∥Y + ∥F (x) − yl ∥Y ≤ ϵ
hence {yj }Jj=1 forms a finite ϵ-net for W , showing that W is totally bounded.
We will now show that W is closed. To that end, let {pn }∞ n=1 be a convergent sequence in W ,
in particular, pn ∈ W for every n ∈ N and pn → p ∈ Y as n → ∞. We can thus find convergent
sequences {xn }∞ ∞
n=1 and {αn }n=1 such that xn ∈ K, αn ∈ N0 , and pn = Fαn (xn ) where we define
F0 := F . Since K is closed, lim xn = x ∈ K thus, for each fixed n ∈ N,
n→∞
The following lemma shows that any continuous operator acting between two Banach spaces
with the AP can be approximated in a finite-dimensional manner. The approximation proceeds
in three steps which are shown schematically in Figure 16. First an input is mapped to a finite-
dimensional representation via the action of a set of functionals on X . This representation is then
mapped by a continuous function to a new finite-dimensional representation which serves as the set
of coefficients onto representers of Y. The resulting expansion is an element of Y that is ϵ-close to
the action of G on the input element. A similar finite-dimensionalization was used in (Bhattacharya
et al., 2020) by using PCA on X to define the functionals acting on the input and PCA on Y to define
the output representers. However the result in that work is restricted to separable Hilbert spaces;
here, we generalize it to Banach spaces with the AP.
Lemma 22 Let X , Y be two Banach spaces with the AP and let G : X → Y be a continuous map.
For every compact set K ⊂ X and ϵ > 0, there exist numbers J, J ′ ∈ N and continuous linear
′ ′
maps FJ : X → RJ , GJ ′ : RJ → Y as well as φ ∈ C(RJ ; RJ ) such that
J′
′
X
GJ ′ (v) = vj β j , ∀v ∈ RJ .
j=1
′
If Y admits a basis then {βj }Jj=1 can be picked so that there is an extension {βj }∞
j=1 ⊂ Y which is
a basis for Y.
75
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Proof Since X has the AP, there exists a sequence of finite rank operators {UnX : X → X }∞
n=1
such that
lim sup ∥x − UnX x∥X = 0.
n→∞ x∈K
which is compact by Lemma 21. Therefore, G is uniformly continuous on Z hence there exists a
modulus of continuity ω : R≥0 → R≥0 which is non-decreasing and satisfies ω(t) → ω(0) = 0 as
t → 0 as well as
∥G(z1 ) − G(z2 )∥Y ≤ ω ∥z1 − z2 ∥X ∀z1 , z2 ∈ Z.
We can thus find, a number N = N (ϵ) ∈ N such that
X
ϵ
sup ω ∥x − UN x∥X ≤ .
x∈K 2
X (X ) < ∞. There exist elements {α }J J ∗
Let J = dim UN j j=1 ⊂ X and {wj }j=1 ⊂ X such that
J
X
X
UN x = wj (x)αj , ∀x ∈ X.
j=1
′ ′
for some {βj }Jj=1 ⊂ Y and {qj }Jj=1 ⊂ Y ∗ such that UJY′ = GY Y
J ′ ◦ FJ ′ . Clearly if Y admits a basis
then we could have defined FJY′ and GY Y J J′
J ′ through it instead of through UJ ′ . Define φ : R → R
by
φ(v) = (FJY′ ◦ G ◦ GX J )(v), ∀v ∈ RJ
76
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
≤ϵ
as desired.
We now state and prove some results about isomorphisms of function spaces defined on different
domains. These results are instrumental in proving Lemma 26.
Proof Clearly T is linear since the evaluation functional is linear. To see that it is continuous, note
that by the chain rule we can find a constant Q = Q(m) > 0 such that
We will now show that it is bijective. Let f, g ∈ C m (D̄) so that f ̸= g. Then there exists a point
x ∈ D̄ such that f (x) ̸= g(x). Then T (f )(τ −1 (x)) = f (x) and T (g)(τ −1 (x)) = g(x) hence
T (f ) ̸= T (g) thus T is injective. Now let g ∈ C m (D̄′ ) and define f : D̄ → R by f = g ◦ τ −1 .
Since τ −1 ∈ C m (D̄; D̄′ ), we have that f ∈ C m (D̄). Clearly, T (f ) = g hence T is surjective.
Corollary 24 Let M > 0 and m ∈ N0 . There exists a continuous linear bijection T : C m ([0, 1]d ) →
C m ([−M, M ]d ).
Proof Let 1 ∈ Rd denote the vector in which all entries are 1. Define the map τ : Rd → Rd by
1 1
τ (x) = x + 1, ∀x ∈ Rd . (48)
2M 2
Clearly τ is a C ∞ -diffeomorphism between [−M, M ]d and [0, 1]d hence Lemma 23 implies the
result.
Lemma 25 Let M > 0 and m ∈ N. There exists a continuous linear bijection T : W m,1 ((0, 1)d ) →
W m,1 ((−M, M )d ).
77
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Proof Define the map τ : Rd → Rd by (48). We have that τ ((−M, M )d ) = (0, 1)d . Define the
operator T by
T f = f ◦ τ, ∀f ∈ W m,1 ((0, 1)d ).
which is clearly linear since composition is linear. We compute that, for any 0 ≤ |α|1 ≤ m,
∂ α (f ◦ τ ) = (2M )−|α|1 (∂ α f ) ◦ τ
This shows that T : W m,1 ((0, 1)d ) → W m,1 ((−M, M )d ) is continuous and injective. Now let
g ∈ W m,1 ((−M, M )d ) and define f = g ◦ τ −1 . A similar argument shows that f ∈ W m,1 ((0, 1)d )
and, clearly, T f = g hence T is surjective.
We now show that the spaces in Assumptions 9 and 10 have the AP. While the result is well-
known when the domain is (0, 1)d or Rd , we were unable to find any results in the literature for
Lipschitz domains and we therefore give a full proof here. The essence of the proof is to either
exhibit an isomorphism to a space that is already known to have AP or to directly show AP by
embedding the Lipschitz domain into an hypercube for which there are known basis constructions.
Our proof shows the stronger result that W m,p (D) for m ∈ N0 and 1 ≤ p < ∞ has a basis, but,
for C m (D̄), we only establish the AP and not necessarily a basis. The discrepancy comes from the
fact that there is an isomorphism between W m,p (D) and W m,p (Rd ) while there is not one between
C m (D̄) and C m (Rd ).
Lemma 26 Let Assumptions 9 and 10 hold. Then A and U have the AP.
Proof It is enough to show that the spaces W m,p (D), and C m (D̄) for any 1 ≤ p < ∞ and m ∈ N0
with D ⊂ Rd a Lipschitz domain have the AP. Consider first the spaces W 0,p (D) = Lp (D). Since
the Lebesgue measure on D is σ-finite and has no atoms, Lp (D) is isometrically isomorphic to
Lp ((0, 1)) (see, for example, (Albiac and Kalton, 2006, Chapter 6)). Hence by Lemma 20, it is
enough to show that Lp ((0, 1)) has the AP. Similarly, consider the spaces W m,p (D) for m > 0 and
p > 1. Since D is Lipschitz, there exists a continuous linear operator W m,p (D) → W m,p (Rd )
(Stein, 1970, Chapter 6, Theorem 5) (this also holds for p = 1). We can therefore apply (Pełczyński
and Wojciechowski, 2001, Corollary 4) (when p > 1) to conclude that W m,p (D) is isomorphic
to Lp ((0, 1)). By (Albiac and Kalton, 2006, Proposition 6.1.3), Lp ((0, 1)) has a basis hence
Lemma 18 implies the result.
Now, consider the spaces C m (D̄). Since D is bounded, there exists a number M > 0 such that
D̄ ⊆ [−M, M ]d . Hence, by Corollary 24, C m ([0, 1]d ) is isomorphic to C m ([−M, M ]d ). Since
C m ([0, 1]d ) has a basis (Ciesielski and Domsta, 1972, Theorem 5), Lemma 19 then implies that
C m ([−M, M ]d ) has a basis. By (Fefferman, 2007, Theorem 1), there exists a continuous linear
78
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
operator E : C m (D̄) → Cbm (Rd ) such that E(f )|D̄ = f for all f ∈ C(D̄). Define the restriction
operators RM : Cbm (Rd ) → C m ([−M, M ]d ) and RD : C m ([−M, M ]d ) → C m (D̄) which are
both clearly linear and continuous and ∥RM ∥ = ∥RD ∥ = 1. Let {cj }∞ m d ∗
j=1 ⊂ C ([−M, M ] )
and {φj }∞ m d m d
j=1 ⊂ C ([−M, M ] ) be a basis for C ([−M, M ] ). As in the proof of Lemma 18,
there exists a constant C1 > 0 such that, for any n ∈ N and f ∈ C m ([−M, M ]d ),
n
X
∥ cj (f )φj ∥C m ([−M,M ]d ) ≤ C1 ∥f ∥C m ([−M,M ]d ) .
j=1
Suppose, without loss of generality, that C1 ∥E∥ ≥ 1. Let K ⊂ C m (D̄) be a compact set and ϵ > 0.
Since K is compact, we can find a number n = n(ϵ) ∈ N and elements y1 , . . . , yn ∈ K such that,
for any f ∈ K there exists a number l ∈ {1, . . . , n} such that
ϵ
∥f − yl ∥C m (D̄) ≤ .
3C1 ∥E∥
For every l ∈ {1, . . . , n}, define gl = RM (E(yl )) and note that gl ∈ C m ([−M, M ]d ) hence there
exists a number J = J(ϵ, n) ∈ N such that
J
X ϵ
max ∥gl − cj (gl )φj ∥C m ([−M,M ]d ) ≤ .
l∈{1,...,n} 3
j=1
79
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
We are left with the case W m,1 (D). A similar argument as for the C m (D̄) case holds. In par-
ticular the basis from (Ciesielski and Domsta, 1972, Theorem 5) is also a basis for W m,1 ((0, 1)d ).
Lemma 25 gives an isomorphism between W m,1 ((0, 1)d ) and W m,1 ((−M, M )d ) hence we may
use the extension operator W m,1 (D) → W m,1 (Rd ) from (Stein, 1970, Chapter 6, Theorem 5) to
complete the argument. In fact, the same construction yields a basis for W m,1 (D) due to the iso-
morphism with W m,1 (Rd ), see, for example (Pełczyński and Wojciechowski, 2001, Theorem 1).
Appendix C.
In this section, we prove various results about the approximation of linear functionals by kernel in-
tegral operators. Lemma 27 establishes a Riesz-representation theorem for C m . The proof proceeds
exactly as in the well-known result for W m,p but, since we did not find it in the literature, we give
full details here. Lemma 28 shows that linear functionals on W m,p can be approximated uniformly
over compact set by integral kernel operators with a C ∞ kernel. Lemmas 30 and 31 establish similar
results for C and C m respectively by employing Lemma 27. These lemmas are crucial in showing
that NO(s) are universal since they imply that the functionals from Lemma 22 can be approximated
by elements of IO.
∗
Lemma 27 Let D ⊂ Rd be a domain and m ∈ N0 . For every L ∈ C m (D̄) there exist finite,
signed, Radon measures {λα }0≤|α|1 ≤m such that
X Z
L(f ) = ∂ α f dλα , ∀f ∈ C m (D̄).
0≤|α|1 ≤m D̄
Proof The case m = 0 follow directly from (Leoni, 2009, Theorem B.111), so we assume that
m > 0. Let α1 , . . . , αJ be an enumeration of the set {α ∈ Nd : |α|1 ≤ m}. Define the mapping
T : C m (D̄) → C(D̄; RJ ) by
T f = ∂ α0 f, . . . , ∂ αJ f ), ∀f ∈ C m (D̄).
Clearly ∥T f ∥C(D̄;RJ ) = ∥f ∥C m (D̄) hence T is an injective, continuous linear operator. Define
W := T (C m (D̄)) ⊂ C(D̄; RJ ) then T −1 : W → C m (D̄) is a continuous linear operator since
−1 m
T preserves norm. Thus W = T −1 (C (D̄)) is closed as the pre-image of a closed set under
a continuous map. In particular, W is a Banach space since C(D̄; RJ ) is a Banach space and T
is an isometric isomorphism between C m (D̄) and W . Therefore, there exists a continuous linear
functional L̃ ∈ W ∗ such that
L(f ) = L̃(T f ), ∀f ∈ C m (D̄).
∗
By the Hahn-Banach theorem, L̃ can be extended to a continuous linear functional L̄ ∈ C(D̄; RJ )
such that ∥L∥(C m (D̄))∗ = ∥L̃∥W ∗ = ∥L̄∥(C(D̄;RJ ))∗ . We have that
×
∗ ∗ M ∗
C(D̄; R ) ∼
J
= C(D̄) =∼ C(D̄) ,
j=1 j=1
80
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
we have, by applying (Leoni, 2009, Theorem B.111) J times, that there exist finite, signed, Radon
measures {λα }0≤|α|1 ≤m such that
X Z
L̄(T f ) = ∂ α f dλα , ∀f ∈ C m (D̄)
0≤|α|1 ≤m D̄
as desired.
Lemma 28 Let D ⊂ Rd be a bounded, open set and L ∈ (W m,p (D))∗ for some m ≥ 0 and
1 ≤ p < ∞. For any closed and bounded set K ⊂ W m,p (D) (compact if p = 1) and ϵ > 0, there
exists a function κ ∈ Cc∞ (D) such that
Z
sup |L(u) − κu dx| < ϵ.
u∈K D
Proof First consider the case m = 0 and 1 ≤ p < ∞. By the Riesz Representation Theorem
(Conway, 2007, Appendix B), there exists a function v ∈ Lq (D) such that
Z
L(u) = vu dx.
D
sup ∥u∥Lp ≤ M.
u∈K
Suppose p > 1, so that 1 < q < ∞. Density of Cc∞ (D) in Lq (D) (Adams and Fournier, 2003,
Corollary 2.30) implies there exists a function κ ∈ Cc∞ (D) such that
ϵ
∥v − κ∥Lq < .
M
By the Hölder inequality,
Z
|L(u) − κu dx| ≤ ∥u∥Lp ∥v − κ∥Lq < ϵ.
D
Suppose that p = 1 then q = ∞. Since K is totally bounded, there exists a number n ∈ N and
functions g1 , . . . , gn ∈ K such that, for any u ∈ K,
ϵ
∥u − gl ∥L1 <
3∥v∥L∞
for some l ∈ {1, . . . , n}. Let ψη ∈ Cc∞ (D) denote a standard mollifier for any η > 0. We can find
η > 0 small enough such that
ϵ
max ∥ψη ∗ gl − gl ∥L1 <
l∈{1,...,n} 9∥v∥L∞
81
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Define f = ψη ∗ v ∈ C(D) and note that ∥f ∥L∞ ≤ ∥v∥L∞ . By Fubini’s theorem, we find
Z Z
ϵ
| (f − v)gl dx| = v(ψη ∗ gl − gl ) dx ≤ ∥v∥L∞ ∥ψη ∗ gl − gl ∥L1 < .
D D 9
Since gl ∈ L1 (D), by Lusin’s theorem, we can find a compact set A ⊂ D such that
Z
ϵ
max |gl | dx <
l∈{1,...,n} D\A 18∥v∥L∞
Since Cc∞ (D) is dense in C(D) over compact sets (Leoni, 2009, Theorem C.16), we can find a
function κ ∈ Cc∞ (D) such that
ϵ
sup |κ(x) − f (x)| ≤
x∈A 9M
and ∥κ∥L∞ ≤ ∥f ∥L∞ ≤ ∥v∥L∞ . We have,
Z Z Z
| (κ − v)gl dx| ≤ |(κ − v)gl | dx + |(κ − v)gl | dx
D A D\A
Z Z Z
≤ |(κ − f )gl | dx + |(f − v)gl | dx + 2∥v∥L∞ |gl | dx
A D D\A
2ϵ
≤ sup |κ(x) − f (x)|∥gl ∥L1 +
x∈A 9
ϵ
< .
3
Finally,
Z Z Z Z Z
|L(u) − κu dx| ≤ | vu dx −vgl dx| + | vgl dx − κu dx|
D D D Z D Z D Z Z
≤ ∥v∥L∞ ∥u − gl ∥L1 + | κu dx − κgl dx| + | κgl dx − vgl dx|
D Z D D D
ϵ
≤ + ∥κ∥L∞ ∥u − gl ∥L1 + | (κ − v)gl dx|
3 D
2ϵ
≤ + ∥v∥L∞ ∥u − gl ∥L1
3
< ϵ.
Suppose m ≥ 1. By the Riesz Representation Theorem (Adams and Fournier, 2003, Theorem
3.9), there exist elements (vα )0≤|α|1 ≤m of Lq (D) where α ∈ Nd is a multi-index such that
X Z
L(u) = vα ∂α u dx.
0≤|α|1 ≤m D
82
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
Suppose p > 1, so that 1 < q < ∞. Density of C0∞ (D) in Lq (D) implies there exist functions
(fα )0≤|α|1 ≤m in Cc∞ (D) such that
ϵ
∥fα − vα ∥Lq <
MJ
where J = |{α ∈ Nd : |α|1 ≤ m}|. Let
X
κ= (−1)|α|1 ∂α fα
0≤|α|1 ≤m
Since K is totally bounded, there exists a number n ∈ N and functions g1 , . . . , gn ∈ K such that,
for any u ∈ K,
ϵ
∥u − gl ∥W m,1 <
3Cv
for some l ∈ {1, . . . , n}. Let ψη ∈ Cc∞ (D) denote a standard mollifier for any η > 0. We can find
η > 0 small enough such that
ϵ
max max ∥ψη ∗ ∂α gl − ∂α gl ∥L1 < .
α l∈{1,...,n} 9Cv
Define fα = ψη ∗ vα ∈ C(D) and note that ∥fα ∥L∞ ≤ ∥vα ∥L∞ . By Fubini’s theorem, we find
X Z X Z
| (fα − vα )∂α gl dx| = | vα (ψη ∗ ∂α gl − ∂α gl ) dx|
0≤|α|1 ≤m D 0≤|α|1 ≤m D
X
≤ ∥vα ∥L∞ ∥ψη ∗ ∂α gl − ∂α gl ∥L1
0≤|α|1 ≤m
ϵ
< .
9
Since ∂α gl ∈ L1 (D), by Lusin’s theorem, we can find a compact set A ⊂ D such that
Z
ϵ
max max |∂α gl | dx < .
α l∈{1,...,n} D\A 18Cv
83
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Since Cc∞ (D) is dense in C(D) over compact sets, we can find functions wα ∈ Cc∞ (D) such that
ϵ
sup |wα (x) − fα (x)| ≤
x∈A 9M J
where J = |{α ∈ Nd : |α|1 ≤ m}| and ∥wα ∥L∞ ≤ ∥fα ∥L∞ ≤ ∥vα ∥L∞ . We have,
Z Z Z !
X X
|(wα − vα )∂α gl | = |(wα − vα )∂α gl |dx + |(wα − vα )∂α gl |dx
0≤|α|1 ≤m D 0≤|α|1 ≤m A D\A
X Z Z
≤ |(wα − fα )∂α gl | dx + |(fα − vα )∂α gl | dx
0≤|α|1 ≤m A D
Z
+ 2∥vα ∥L∞ |∂α gl | dx
D\A
X 2ϵ
≤ sup |wα (x) − fα (x)|∥∂α gl ∥L1 +
x∈A 9
0≤|α|1 ≤m
ϵ
< .
3
Let X
κ= (−1)|α|1 ∂α wα .
0≤|α|1 ≤m
Finally,
Z X Z
|L(u) − κu dx| ≤ |vα ∂α u − wα ∂α u| dx
D 0≤|α|1 ≤m D
X Z Z
≤ |vα (∂α u − ∂α gl )| dx + |vα ∂α gl − wα ∂α u| dx
0≤|α|1 ≤m D D
X Z Z
≤ ∥vα ∥L∞ ∥u − gl ∥W m,1 + |(vα − wα )∂α gl | dx + |(∂α gl − ∂α u)wα | dx
0≤|α|1 ≤m D D
2ϵ X
< + ∥wα ∥L∞ ∥u − gl ∥W m,1
3
0≤|α|1 ≤m
< ϵ.
84
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
∗
Lemma 29 Let D ⊂ Rd be a domain and L ∈ C m (D̄) for some m ∈ N0 . For any compact
set K ⊂ C m (D̄) and ϵ > 0, there exists distinct points y11 , . . . , y1n1 , . . . , yJnJ ∈ D and numbers
c11 , . . . , c1n1 , . . . , cJnJ ∈ R such that
nj
J X
X
sup |L(u) − cjk ∂ αj u(yjk )| ≤ ϵ
u∈K j=1 k=1
Proof By Lemma 27, there exist finite, signed, Radon measures {λα }0≤|α|1 ≤m such that
X Z
L(u) = ∂ α u dλα , ∀u ∈ C m (D̄).
0≤|α|1 ≤m D̄
Let α1 , . . . , αJ be an enumeration of the set {α ∈ Nd0 : 0 ≤ |α|1 ≤ m}. By weak density of the
Dirac measures (Bogachev, 2007, Example 8.1.6), we can find points y11 , . . . , y1n1 , . . . , yJ1 , . . . , yJnJ ∈
D̄ as well as numbers c11 , . . . , cJnJ ∈ R such that
Z nj
X ϵ
| ∂ αj u dλαj − cjk ∂ αj u(yjk )| ≤ , ∀u ∈ C m (D̄)
D̄ 4J
k=1
J Z nj
J X
X
αj
X ϵ
| ∂ u dλαj − cjk ∂ αj u(yjk )| ≤ , ∀u ∈ C m (D̄).
D̄ 4
j=1 j=1 k=1
Since K is compact, we can find functions g1 , . . . , gN ∈ K such that, for any u ∈ K, there exists
l ∈ {1, . . . , N } such that
ϵ
∥u − gl ∥C k ≤ .
4Q
Suppose that some yjk ∈ ∂D. By uniform continuity, we can find a point ỹjk ∈ D such that
ϵ
max |∂ αj gl (yjk ) − ∂ αj gl (ỹjk )| ≤ .
l∈{1,...,N } 4Q
Denote
nj
J X
X
S(u) = cjk ∂ αj u(yjk )
j=1 k=1
85
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
and by S̃(u) the sum S(u) with yjk replaced by ỹjk . Then, for any u ∈ K, we have
Since there are a finite number of points, this implies that all points yjk can be chosen in D. Suppose
now that yjk = yqp for some (j, k) ̸= (q, p). As before, we can always find a point ỹjk distinct from
all others such that
ϵ
max |∂ αj gl (yjk ) − ∂ αj gl (ỹjk )| ≤ .
l∈{1,...,N } 4Q
Repeating the previous argument then shows that all points yjk can be chosen distinctly as desired.
∗
Lemma 30 Let D ⊂ Rd be a domain and L ∈ C(D̄) . For any compact set K ⊂ C(D̄) and
ϵ > 0, there exists a function κ ∈ Cc∞ (D) such that
Z
sup |L(u) − κu dx| < ϵ.
u∈K D
Proof By Lemma 29, we can find points distinct points y1 , . . . , yn ∈ D as well as numbers
c1 , . . . , cn ∈ R such that
n
X ϵ
sup |L(u) − cj u(yj )| ≤ .
u∈K 3
j=1
Since K is compact, there exist functions g1 , . . . , gJ ∈ K such that, for any u ∈ K, there exists
some l ∈ {1, . . . , J} such that
ϵ
∥u − gl ∥C ≤ .
6nQ
Let r > 0 be such that the open balls Br (yj ) ⊂ D and are pairwise disjoint. Let ψη ∈ Cc∞ (Rd )
denote the standard mollifier with parameter η > 0, noting that supp ψr = Br (0). We can find a
number 0 < γ ≤ r such that
Z
ϵ
max | ψγ (x − yj )gl (x) dx − gl (yj )| ≤ .
l∈{1,...,J} D 3nQ
j∈{1,...,n}
86
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
Define κ : Rd → R by
n
X
κ(x) = cj ψγ (x − yj ), ∀x ∈ Rd .
j=1
Since supp ψγ (· − yj ) ⊆ Br (yj ), we have that κ ∈ Cc∞ (D). Then, for any u ∈ K,
Z n
X n
X Z
|L(u) − κu dx| ≤ |L(u) − cj u(yj )| + | cj u(yj ) − κu dx|
D j=1 j=1 D
n Z
ϵ X
≤ + |cj ||u(yj ) − ψη (x − yj )u(x) dx|
3 D
j=1
n Z
ϵ X
≤ +Q |u(yj ) − gl (yj )| + |gl (yj ) − ψη (x − yj )u(x) dx|
3 D
j=1
n Z
ϵ X
≤ + nQ∥u − gl ∥C + Q |gl (yj ) − ψη (x − yj )gl (x) dx|
3 D
j=1
Z
+| ψη (x − yj ) gl (x) − u(x) dx|
D
n Z
ϵ ϵ X
≤ + nQ∥u − gl ∥C + nQ + Q∥gl − u∥C ψγ (x − yj ) dx
3 3nQ D
j=1
2ϵ
= + 2nQ∥u − gl ∥C
3
=ϵ
where we use the fact that mollifiers are non-negative and integrate to one.
∗
Lemma 31 Let D ⊂ Rd be a domain and L ∈ C m (D̄) . For any compact set K ⊂ C m (D̄) and
ϵ > 0, there exist functions κ1 , . . . , κJ ∈ Cc∞ (D) such that
J Z
X
sup |L(u) − κj ∂ αj u dx| < ϵ
u∈K j=1 D
Proof By Lemma 29, we find distinct points y11 , . . . , y1n1 , . . . , yJnJ ∈ D and numbers c11 , . . . , cJnJ ∈
R such that
J X nj
X ϵ
sup |L(u) − cjk ∂ αj u(yjk )| ≤ .
u∈K 2
j=1 k=1
Applying the proof of Lemma 31 J times to each of the inner sums, we find functions κ1 , . . . , κJ ∈
Cc∞ (D) such that
Z nj
αj
X ϵ
max | κj ∂ u dx − cjk ∂ αj u(yjk )| ≤ .
j∈{1,...,J} D 2J
k=1
87
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
as desired.
Appendix D.
The following lemmas show that the three pieces used in constructing the approximation from
Lemma 22, which are schematically depicted in Figure 16, can all be approximated by NO(s).
Lemma 32 shows that FJ : A → RJ can be approximated by an element of IO by mapping to a
′
vector-valued constant function. Similarly, Lemma 34 shows that GJ ′ : RJ → U can be approx-
imated by an element of IO by mapping a vector-valued constant function to the coefficients of a
basis expansion. Finally, Lemma 35 shows that NO(s) can exactly represent any standard neural
network by viewing the inputs and outputs as vector-valued constant functions.
Lemma 32 Let Assumption 9 hold. Let {cj }nj=1 ⊂ A∗ for some n ∈ N. Define the map F : A →
Rn by
F (a) = c1 (a), . . . , cn (a) , ∀a ∈ A.
Then, for any compact set K ⊂ A, σ ∈ A0 , and ϵ > 0, there exists a number L ∈ N and neural
network κ ∈ NL (σ; Rd × Rd , Rn×1 ) such that
Z
sup sup |F (a) − κ(y, x)a(x) dx|1 ≤ ϵ.
a∈K y∈D̄ D
sup ∥a∥A ≤ M.
a∈K
88
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
By setting all weights associated to the first argument to zero, we can modify each neural network
ψj to a neural network ψj ∈ NL (σ; Rd × Rd ) so that
ψj (y, x) = ψj (x)1(y), ∀y, x ∈ Rd .
Define κ ∈ NL (σ; Rd × Rd , Rn×1 ) by
κ(y, x) = [ψ1 (y, x), . . . , ψn (y, x)]T .
Then for any a ∈ K and y ∈ D̄, we have
Z Xn Z
p
|F (a) − κ(y, x)a dx|p = |cj (a) − 1(y)ψj (x)a(x) dx|p
D j=1 D
n
X Z Z
p−1 p
≤2 |cj (a) − fj a dx| + | (fj − ψj )a dx|p
j=1 D D
ϵp
≤ + 2p−1 nQp ∥fj − ψj ∥pC
2
≤ ϵp
and the result follows by finite dimensional norm equivalence.
∗
Lemma 33 Suppose D ⊂ Rd is a domain and let {cj }nj=1 ⊂ C m (D̄) for some m, n ∈ N.
Define the map F : A → Rn by
∀a ∈ C m (D̄).
F (a) = c1 (a), . . . , cn (a) ,
Then, for any compact set K ⊂ C m (D̄), σ ∈ A0 , and ϵ > 0, there exists a number L ∈ N and
neural network κ ∈ NL (σ; Rd × Rd , Rn×J ) such that
Z
κ(y, x) ∂ α1 a(x), . . . , ∂ αJ a(x) dx|1 ≤ ϵ
sup sup |F (a) −
a∈K y∈D̄ D
Proof The proof follows as in Lemma 32 by replacing the use of Lemmas 28 and 30 by Lemma 31.
Lemma 34 Let Assumption 10 hold. Let {φj }nj=1 ⊂ U for some n ∈ N. Define the map G : Rn →
U by
Xn
G(w) = wj φ j , ∀w ∈ Rn .
j=1
89
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
If U = Lp2 (D′ ), then density of Cc∞ (D′ ) implies there are functions ψ̃1 , . . . , ψ̃n ∈ C ∞ (D̄′ ) such
that
ϵ
max ∥φj − ψ̃j ∥U ≤ .
j∈{1,...,n} 2nM
′
Similarly if U = W m2 ,p2 (D′ ), then density of the restriction of functions in Cc∞ (Rd ) to D′ (Leoni,
2009, Theorem 11.35) implies the same result. If U = C m2 (D̄′ ) then we set ψ̃j = φj for any
′ ′
j ∈ {1, . . . , n}. Define κ̃ : Rd × Rd → R1×n by
1
κ̃(y, x) = [ψ̃1 (y), . . . , ψ̃n (y)].
|D′ |
Then, for any w ∈ K,
Z n
X n
X
∥G(w) − κ̃(·, x)w1(x) dx∥U = ∥ wj φ j − wj ψ̃j ∥U
D′ j=1 j=1
n
X
≤ |wj |∥φj − ψ̃j ∥U
j=1
ϵ
≤ .
2
′
Since σ ∈ Am2 , there exists neural networks ψ1 , . . . , ψn ∈ N1 (σ; Rd ) such that
ϵ
max ∥ψ̃j − ψj ∥C m2 ≤ 1
j∈{1,...,n}
2nM (J|D′ |) p2
where, if U = C m2 (D̄′ ), we set J = 1/|D′ | and p2 = 1, and otherwise J = |{α ∈ Nd : |α|1 ≤
m2 }|. By setting all weights associated to the second argument to zero, we can modify each neural
′ ′
network ψj to a neural network ψj ∈ N1 (σ; Rd × Rd ) so that
′
ψj (y, x) = ψj (y)1(x), ∀y, x ∈ Rd .
′ ′
Define κ ∈ N1 (σ; Rd × Rd , R1×n ) as
1
κ(y, x) = [ψ1 (y, x), . . . , ψn (y, x)].
|D′ |
Then, for any w ∈ Rn ,
Z n
X
κ(y, x)w1(x) dx = wj ψj (y).
D′ j=1
90
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
≤ϵ
as desired.
φ(x) = WN σ1 (. . . W1 σ1 (W0 x + b0 ) + b1 . . . ) + bN , ∀x ∈ Rd
′ ′
where W0 ∈ Rd0 ×d , W1 ∈ Rd1 ×d0 , . . . , WN ∈ Rd ×dN −1 and b0 ∈ Rd0 , b1 ∈ Rd1 , . . . , bN ∈ Rd
for some d0 , . . . , dN −1 ∈ N. By setting all parameters to zero except for the last bias term, we can
find κ(0) ∈ N1 (σ2 ; Rp × Rp , Rd0 ×d ) such that
1
κ0 (x, y) = W0 , ∀x, y ∈ Rp .
|D|
b̃0 (x) = b0 , ∀x ∈ Rp .
Then Z
κ0 (y, x)w1(x) dx + b̃(y) = (W0 w + b0 )1(y), ∀w ∈ Rd , ∀y ∈ D.
D
Continuing a similar construction for all layers clearly yields the result.
91
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Appendix E.
Proof [of Theorem 8] Without loss of generality, we will assume that D = D′ and, by continuous
embedding, that A = U = C(D̄). Furthermore, note that, by continuity, it suffices to show the
result for the single layer
Z
d d d
NO = f 7→ σ1 κ(·, y)f (y) dy + b : κ ∈ Nn1 (σ2 ; R × R ), b ∈ Nn2 (σ2 ; R ), n1 , n2 ∈ N .
D
We can do this since the points in each discretization Dj are pairwise distinct. For any G ∈ NO
with parameters κ, b define the sequence of maps Ĝj : Rjd × Rj → Y by
j
!
(k)
X
Ĝj (y1 , . . . , yj , w1 , . . . , wj ) = σ1 κ(·, yk )wk |Pj | + b(·)
k=1
for any yk ∈ Rd and wk ∈ R. Since K is compact, there is a constant M > 0 such that
sup ∥a∥U ≤ M.
a∈K
Therefore,
Z j
(k)
X
sup sup κ(x, y)a(y) dy + κ(x, yk )a(yk )|Pj | + 2b(x) ≤ 2(M |D|∥κ∥C(D̄×D̄) + ∥b∥C(D̄) )
x∈D̄ j∈N D k=1
:= R.
Hence we need only consider σ1 as a map [−R, R] → R. Thus, by uniform continuity, there exists
a modulus of continuity ω : R≥0 → R≥0 which is continuous, non-negative, and non-decreasing on
R≥0 , satisfies ω(z) → ω(0) = 0 as z → 0 and
Let ϵ > 0. Equation (49) and the non-decreasing property of ω imply that in order to show there
exists Q = Q(ϵ) ∈ N such that for any m ≥ Q implies
92
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
for any m ≥ Q. Since K is compact, we can find functions a1 , . . . , aN ∈ K such that, for any
a ∈ K, there is some n ∈ {1, . . . , N } such that
ϵ
∥a − an ∥C(D̄) ≤ .
4|D|∥κ∥C(D̄×D̄)
Since (Dj ) is a discrete refinement, by convergence of Riemann sums, we can find some q ∈ N
such that for any t ≥ q, we have
t Z
(k)
X
sup κ(x, yk )|Pt | − κ(x, y) dy < |D|∥κ∥C(D̄×D̄)
x∈D̄ k=1 D
(n) (n)
where Dtn = {y1 , . . . , ytn }. Let m ≥ max{q, p1 , . . . , pN } and denote Dm = {y1 , . . . , ym }.
Note that,
Z
sup κ(x, y) (a(y) − an (y)) dy ≤ |D|∥κ∥C(D̄×D̄) ∥a − an ∥C(D̄) .
x∈D̄ D
Furthermore,
m
X m
X
(k) (k)
sup κ(x, yk ) (an (yk ) − a(yk )) |Pm | ≤ ∥an − a∥C(D̄) sup κ(x, yk )|Pm |
x∈D̄ k=1 x∈D̄ k=1
Xm Z
(k)
≤ ∥an − a∥C(D̄) sup κ(x, yk )|Pm |− κ(x, y) dy
x∈D̄ k=1 D
Z
+ sup κ(x, y) dy
x∈D̄ D
≤ 2|D|∥κ∥C(D̄×D̄) ∥an − a∥C(D̄) .
Therefore, for any a ∈ K, by repeated application of the triangle inequality, we find that
Z m
X m
X Z
(k) (k)
sup κ(x, y)a(y) dy − κ(x, yk )a(yk )|Pm | ≤ sup κ(x, yk )an (yk )|Pm | − κ(x, y)an (y) dy
x∈D̄ D k=1 x∈D̄ k=1 D
+ 3|D|∥κ∥C(D̄×D̄) ∥a − an ∥C(D̄)
ϵ 3ϵ
< + =ϵ
4 4
which completes the proof.
93
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Appendix F.
Proof [of Theorem 11] The statement in Lemma 26 allows us to apply Lemma 22 to find a mapping
G1 : A → U such that
ϵ
sup ∥G † (a) − G1 (a)∥U ≤
a∈K 2
′ ′
where G1 = G◦ψ◦F with F : A → RJ , G : RJ → U continuous linear maps and ψ ∈ C(RJ ; RJ )
for some J, J ′ ∈ N. By Lemma 32, we can find a sequence of maps Ft ∈ IO(σ2 ; D, R, RJ ) for
t = 1, 2, . . . such that
1
sup sup | Ft (a) (x) − F (a)|1 ≤ .
a∈K x∈D̄ t
In particular, Ft (a)(x) = wt (a)1(x) for some wt : A → RJ which is constant in space. We can
therefore identify the range of Ft (a) with RJ . Define the set
∞
[
Z := Ft (K) ∪ F (K) ⊂ RJ
t=1
Since FT is continuous, FT (K) is compact. Since ψ is a continuous function on the compact set
′
FT (K) ⊂ RJ mapping into RJ , we can use any classical neural network approximation theorem
such as (Pinkus, 1999, Theorem 4.1) to find an ϵ-close (uniformly) neural network. Since Lemma
35 shows that neural operators can exactly mimic standard neural networks, it follows that we
′
can find S1 ∈ IO(σ1 ; D, RJ , Rd1 ), . . . , SN −1 ∈ IO(σ1 ; D, RdN −1 , RJ ) for some N ∈ N≥2 and
d1 , . . . , dN −1 ∈ N such that
∀f ∈ L1 (D; RJ )
ψ̃(f ) := SN −1 ◦ σ1 ◦ · · · ◦ S2 ◦ σ1 ◦ S1 (f ),
satisfies
ϵ
sup sup |ψ(q) − ψ̃(q1)(x)|1 ≤ .
q∈FT (K) x∈D̄ 6∥G∥
By construction, ψ̃ maps constant functions into constant functions and is continuous in the appro-
′
priate subspace topology of constant functions hence we can identity it as an element of C(RJ ; RJ )
′
for any input constant function taking values in RJ . Then (ψ̃ ◦ FT )(K) ⊂ RJ is compact. There-
′ ′ ′
fore, by Lemma 34, we can find a neural network κ ∈ NL (σ3 ; Rd × Rd , R1×J ) for some L ∈ N
such that Z
′
G̃(f ) := κ(·, y)f (y) dy, ∀f ∈ L1 (D; RJ )
D′
94
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
satisfies
ϵ
sup ∥G(y) − G̃(y1)∥U ≤ .
y∈(ψ̃◦FT )(K) 6
Define
Z
G(a) := G̃ ◦ ψ̃ ◦ FT (a) = κ(·, y) (SN −1 ◦ σ1 ◦ . . . σ1 ◦ S1 ◦ FT )(a) (y) dy, ∀a ∈ A,
D′
noting that G ∈ NON (σ1 , σ2 , σ3 ; D, D′ ). For any a ∈ K, define a1 := (ψ ◦ F )(a) and ã1 :=
(ψ̃ ◦ FT )(a) so that G1 (a) = G(a1 ) and G(a) = G̃(ã1 ) then
∥G1 (a) − G(a)∥U ≤ ∥G(a1 ) − G(ã1 )∥U + ∥G(ã1 ) − G̃(ã1 )∥U
≤ ∥G∥|a1 − a˜1 |1 + sup ∥G(y) − G̃(y1)∥U
y∈(ψ̃◦FT )(K)
ϵ
≤ + ∥G∥|(ψ ◦ F )(a) − (ψ ◦ FT )(a)|1 + ∥G∥|(ψ ◦ FT )(a) − (ψ̃ ◦ FT )(a)|1
6
ϵ
≤ + ∥G∥ω |F (a) − FT (a)|1 +∥G∥ sup |ψ(q) − ψ̃(q)|1
6 q∈FT (K)
ϵ
≤ .
2
Finally we have
ϵ ϵ
∥G † (a) − G(a)∥U ≤ ∥G † (a) − G1 (a)∥U + ∥G1 (a) − G(a)∥U ≤ + =ϵ
2 2
as desired.
To show boundedness, we will exhibit a neural operator G̃ that is ϵ-close to G in K and is
uniformly bounded by 4M . Note first that
∥G(a)∥U ≤ ∥G(a) − G † (a)∥U + ∥G † (a)∥U ≤ ϵ + M ≤ 2M, ∀a ∈ K
where, without loss of generality, we assume that M ≥ 1. By construction, we have that
J ′
X
G(a) = ψ̃j (FT (a))φj , ∀a ∈ A
j=1
′ ′
for some neural network φ : Rd → RJ . Since U is a Hilbert space and by linearity, we may assume
that the components φj are orthonormal since orthonormalizing them only requires multiplying the
last layers of ψ̃ by an invertible linear map. Therefore
|ψ̃(FT (a))|2 = ∥G(a)∥U ≤ 2M, ∀a ∈ K.
Define the set W := (ψ̃ ◦ FT )(K) ⊂ R J′ which is compact as before. We have
diam2 (W ) = sup |x − y|2 ≤ sup |x|2 + |y|2 ≤ 4M.
x,y∈W x,y∈W
′ ′
Since σ1 ∈ BA, there exists a number R ∈ N and a neural network β ∈ NR (σ1 ; RJ , RJ ) such that
|β(x) − x|2 ≤ ϵ, ∀x ∈ W
′
|β(x)|2 ≤ 4M, ∀x ∈ RJ .
95
KOVACHKI , L I , L IU , A ZIZZADENESHELI , B HATTACHARYA , S TUART, A NANDKUMAR
Define
J′
X
G̃(a) := βj (ψ̃(FT (a)))φj , ∀a ∈ A.
j=1
Furthermore,
∥G̃(a)∥U = |β(q)|2 ≤ 4M
as desired.
Appendix G.
Proof [of Theorem 13] Let U = H m2 (D). For any R > 0, define
(
† G † (a), ∥G † (a)∥U ≤ R
GR (a) := R
∥G † (a)∥
G † (a), otherwise
U
†
for any a ∈ A. Since GR → G † as R → ∞ µ-almost everywhere, G † ∈ L2µ (A; U), and clearly
†
∥GR (a)∥U ≤ ∥G † (a)∥U for any a ∈ A, we can apply the dominated convergence theorem for
Bochner integrals to find R > 0 large enough such that
† ϵ
∥GR − G † ∥L2µ (A;U ) ≤ .
3
Since A and U are Polish spaces, by Lusin’s theorem (Aaronson, 1997, Theorem 1.0.0) we can find
a compact set K ⊂ A such that
ϵ2
µ(A \ K) ≤
153R2
†
and GR |K is continuous. Since K is closed, by a generalization of the Tietze extension theorem
† †
(Dugundji, 1951, Theorem 4.1), there exist a continuous mapping G̃R : A → U such that G̃R (a) =
†
GR (a) for all a ∈ K and
† †
sup ∥G̃R (a)∥ ≤ sup ∥GR (a)∥ ≤ R.
a∈A a∈A
†
Applying Theorem 11 to G̃R , wefind that there exists a number N ∈ N and a neural operator
G ∈ NON (σ1 , σ2 , σ3 ; D, D′ ) suchthat
√
† 2ϵ
sup ∥G(a) − GR (a)∥U ≤
a∈K 3
96
N EURAL O PERATOR : L EARNING M APS B ETWEEN F UNCTION S PACES W ITH A PPLICATIONS TO PDE S
and
sup ∥G(a)∥U ≤ 4R.
a∈A
We then have
† †
∥G † − G∥L2µ (A;U ) ≤ ∥G † − GR ∥L2µ (A;U ) + ∥GR − G∥L2µ (A;U )
!1
Z Z 2
ϵ † †
≤ + ∥GR (a) − G(a)∥2U dµ(a) + ∥GR (a) − G(a)∥2U dµ(a)
3 K A\K
21
2ϵ2
ϵ †
≤ + + 2 sup ∥GR (a)∥2U + ∥G(a)∥2U µ(A \ K)
3 9 a∈A
2 12
ϵ 2ϵ 2
≤ + + 34R µ(A \ K)
3 9
2 12
ϵ 4ϵ
≤ +
3 9
=ϵ
as desired.
97