0% found this document useful (0 votes)
152 views15 pages

ICONIP2024论文

Uploaded by

fakeherolimit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
152 views15 pages

ICONIP2024论文

Uploaded by

fakeherolimit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Latent Neural Operator Pretraining for Solving

Time-Dependent PDEs

Tian Wang1,2 and Chuang Wang1,2


1
Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
{wangtian2022,wangchuang}@ia.ac.cn
2
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing
100049, China

Abstract. Pretraining methods gain increasing attraction recently for


solving PDEs with neural operators. It alleviates the data scarcity problem
encountered by neural operator learning when solving single PDE via
training on large-scale datasets consisting of various PDEs and utilizing
shared patterns among different PDEs to improve the solution precision.
In this work, we propose the Latent Neural Operator Pretraining (LNOP)
framework based on the Latent Neural Operator (LNO) backbone. We
achieve universal transformation through pretraining on hybrid time-
dependent PDE dataset to extract representations of different physical
systems and solve various time-dependent PDEs in the latent space
through finetuning on single PDE dataset. Our proposed LNOP framework
reduces the solution error by 31.7% on four problems and can be further
improved to 57.1% after finetuning. On out-of-distribution dataset, our
LNOP model achieves roughly 50% lower error and 3× data efficiency on
average across different dataset sizes. These results show that our method
is more competitive in terms of solution precision, transfer capability and
data efficiency compared to non-pretrained neural operators.

Keywords: Latent Neural Operator · Pretraining · PDE

1 Introduction
Partial Differential Equations (PDEs) describe the underlying principles of nu-
merous phenomena in the real world with broad applications ranging from
weather forecasting[2,26] and pollution detection[27] to industrial designing[42].
Consequently, solving PDE accurately and efficiently remains a pivotal research
area. Traditional numerical methods such as finite element method and spectral
method solve PDEs by transforming continuous differential equations into dis-
crete difference equations, which requires specialized knowledge and substantial
computational resources. With the advent of deep learning, employing surrogate
models based on neural network as alternatives to traditional numerical methods
offers a fresh approach to PDE solving with cheaper computational requirement
and potentially better generalizability to characterize the dynamics of the real
world even beyond the scope of PDE-based models.
2 Tian Wang and Chuang Wang

Neural operator[15], which adopts the data-driven paradigm to directly learn


infinite-dimensional mappings from input functions to output functions, stands
out as a promising surrogate modeling approach for solving PDEs. Compared
to its peer methods, e.g., physics-informed numerical networks[29,14,30], neural
operators demonstrate faster inference speeds and superior generalization ability
but less accuracy. Existing neural operator methods mostly rely on training
different models for corresponding PDE problems based on simulated data
generated by numerical methods, restricting neural operator to solve specialized,
case-by-case problems, rather than leveraging extensive data from diverse PDE
problems to cover common representations for solving general problems.
Pretraining framework has become a de facto groundwork for building uni-
versally capable large models in many machine learning fields such as computer
vision [6,13,28] and natural language processing[31,4], which utilize unlabeled
data collected from differently designed dataset to explore common represen-
tations among the data. In the realm of PDE solving, pretraining promises a
prospective solution to alleviate the data-scarcity problems for neural operator
learning.
In this work, we follow the idea of Latent Neural Operator[38] (LNO) and
establish pretraining for multiple time-dependent PDEs in the latent space. The
major contributions are summarized as follows.

– We introduce the Latent Neural Operator Pretraining (LNOP) framework of


extracting representations of different PDEs in a shared latent space through
a universal transformation module implemented as Physics-Cross-Attention
(PhCA).
– Numerical experiments on various timed-dependent PDE problems indicate
that our proposed framework exploits common representations from various
physical systems and achieves better solution precision compared to methods
without pretraining.
– The proposed pretrain framework also exhibits strong transfer learning capa-
bility and data efficiency with the pretrained universal transformation tested
on experiments of out-of-distribution time-dependent PDE problems.

The rest of the paper is organized as follows. Firstly, we introduce existing works
of neural operator and PDE pretraining in Section 2. We illustrate the working
principles of LNO backbone and our proposed LNOP framework in Section 3.
We describe the dataset details in Section 4. Subsequently, we present a series of
experiments and corresponding analysis to study the performance of our LNOP
framework in terms of solution precision, transfer capability and data efficiency
in Section 5. Finally, we provide concluding remarks of our work in Section 6.

2 Related Work
2.1 Neural Operator
Neural operator methods aim to solve PDEs by learning mappings between
functions. For instance, they map the coefficients, initial conditions or boundary
Latent Neural Operator Pretraining for Solving Time-Dependent PDEs 3

conditions of PDEs, which serve as input functions, to the solution, which serves
as the output function. DeepONet[24] designs trunk and branch structures for
encoding query positions of the output function and observed values of the
input function respectively, and the results from these two parts are combined
to predict the output function. FNO[21] utilizes Fourier transform to learn
transformations between functions in the frequency domain and derives a series
of variants including Geo-FNO[20], U-FNO[39], and F-FNO[35], which extend
the applicability or enhance the precision and efficiency of FNO.
With the tremendous success of Transformer[36] structures in fields of com-
puter vision and natural language processing, Transformer-based neural operator
methods have also been proposed. Galerkin-Transformer[5] firstly introduces
Galerkin-type and Fourier-type attention mechanisms as kernels in neural opera-
tor. OFormer[18] extends Galerkin-type attention to the case of cross-attention.
GNOT[11] further proposes Heterogeneous Normalized Cross-Attention to ac-
commodate multiple input functions.
To address the issues of the significant computational cost of quadratic
complexity attention mechanisms when applied to PDE problems with large
spatial grids, many works have been proposed. FactFormer[19] projects high-
dimensional PDEs into multiple single-dimensional functions. Transolver[41]
uses physical attention to allocate geometric features to a constant number
of physical slices in each Transformer block. LNO[38] employs Physics-Cross-
Attention (PhCA) to solve PDEs in the latent space. Our LNOP framework
follows the idea of LNO, where we train encoder and decoder to learn universal
transformation which extracts the common representations of multiple PDEs in
a shared latent space.
Despite the strong nonlinear approximating capability of neural operator,
it faces the challenge of insufficient training data. Simulated data generated
using traditional numerical methods is often computationally expensive, while
real-world data is difficult to collect. Some approaches such as PINO[22] and
PI-DeepONet[37] incorporate physical priors into neural operator to alleviate
data scarcity. Other works attempt to construct pretrained foundation models
for various downstream PDE tasks involving scarce data.

2.2 PDE Pretraining

Pretraining has been a crucial driver of recent breakthroughs in fields of computer


vision[12,17] and natural language processing[7,34,4]. Models are firstly pretrained
on pretext tasks to learn generic and task-agnostic representations from vast
amounts of raw data in an unsupervised manner. The task-specific components
are then finetuned with minimal data to complete the downstream task. This
paradigm significantly enhances data efficiency and enables effective utilization
of resources.
In the field of PDE solving, pretraining strategies are also gradually be-
ing adopted. MPP[25] firstly proposes autoregressive task to pretrain video
Transformer[1] models on dataset consisted of various time-dependent PDEs
in fluid mechanics. DPOT[10] extends MPP by incorporating denoising ob-
4 Tian Wang and Chuang Wang

jective in autoregressive tasks and utilizes a Fourier Transformer[8,16] back-


bone. PDEformer[43] trains graph Transformers on dataset containing 1D time-
dependent PDEs with diverse conditions. Unisolver[9] extends the idea of equa-
tion conditional modulation to the case of 2D time-dependent PDEs and applies
domain-wise and point-wise conditions to modulate PDE representations using
different attention heads. Both of these efforts and our proposed framework follow
the pretraining and finetuning paradigm same as in fields of computer vision and
natural language processing.

3 Method
We first provide the formal definition of solving time-dependent PDEs, and then
introduce the Latent Neural Operator (LNO) backbone. Finally, we present the
framework of our Latent Neural Operator Pretraining (LNOP) approach.

3.1 Problem Setup


We consider the time-dependent PDE defined on D ⊆ Ω × [0, T ]
La ◦ u = 0, with u(x, 0) = u0 (x), x ∈ Ω, and u(x, t) = b(x), x ∈ ∂Ω,
where La is an operator containing partial differentials parameterized by coeffi-
cients a; u0 (x) and b(x) represent the initial condition and boundary condition
respectively, and u(x, t) = ut (x) is the solution to the PDE.
A set of classic linear time-dependent PDEs such as the heat equation,
Laplace’s equation and the wave equation can be decoupled and solved by a
proper functional transformation, following a general procedure that i) first
transform the spatial domain of the system into another new domain; ii) then
predict the evolution of the system over time in the new domain; iii) finally
transform back from the new domain to the spatial domain. This method can be
formalized as
û0 (w) = F(u0 )(x), ût+∆t = P(ût ), uT (x) = F −1 (ûT )(w),
where x ∈ Ω, w ∈ Ω ′ . For different time-dependent PDE systems, the transforma-
tion F can be, for example, the Fourier transform or Laplace transform. Inspired
by this, we treat the PhCA encoder and decoder modules in LNO[38] as universal
feature transformation and its inverse respectively. We consider the intermediate
Transformer layers as short-term propagator, and train the LNO end-to-end to
learn feature transformation that is applicable across different PDE problems.
3.2 Latent Neural Operator
We briefly introduce our previous work of LNO[38] for solving forward and inverse
PDE problems as the pretraining backbone. Contrast to the original work[38]
that trains models to predict different PDEs separately in a case-by-case manner,
we explore the pretraining and universal representation ability of LNO in this
work.
LNO consists of five modules: an input projector to lift the dimension of the
input data, an encoder to transform the input embedding into a learnable latent
Latent Neural Operator Pretraining for Solving Time-Dependent PDEs 5

space, a sequence of Transformer layers for modeling operator in the latent space,
a decoder to recover the latent representation back to the real-world space and
an output projector to project back the lifted dimension of the output data.
Physics-Cross-Attention (PhCA) is the core of LNO, used for transforming
between N embeddings in the real geometric space and M representation tokens
in the latent space. Since the latent space is much more compact than the large
geometric space, PhCA significantly reduces the computational load when solving
PDEs.
The input projector contains two parts, a branch projector and a trunk
projector, following the convention of DeepONet [24]. In the encoding phase: i)
the branch projector converts N observation positions and corresponding values
of the input function into N embeddings in real geometric space, which serve as
the value matrix; ii) the trunk projector lifts the dimension of each observation
position of the input function, which is then converted into M attention scores
through MLP in the PhCA encoder, and N observation positions yield M × N
attention score matrix; iii) the row-normalized attention score matrix is multiplied
by the value matrix to obtain the representation tokens in the latent space.
Conversely, the decoding phase is a inverse process of the encoding. Specifically,
i) the representation tokens transformed through multiple Transformer layers
serve as the value matrix; ii) the trunk projector lifts the dimension of each query
position of the output function, which is then converted into M attention scores
through MLP in the PhCA decoder, and N query positions yield M ×N attention
score matrix; iii) the column-normalized attention score matrix is transposed
and multiplied by the value matrix to map the representation tokens in the
latent space back to the real geometric space, which finally converted into output
function values by the output projector.

3.3 Frame Design

Overall Our proposed LNOP framework is an end-to-end method for solving


multiple time-dependent PDEs based on LNO backbone. Unlike the original LNO
work, we adopt a hybrid pretraining strategy to guide the PhCA mechanism
in learning the universal transformation applicable to different heterogeneous
physical systems. This extends the LNO, which was designed to solve a single
physical system, into the LNOP capable of solving multiple physical systems in
higher precision.
LNOP consists of three components: PhCA encoder/decoder and the propa-
gator. As illustrated in Figure 1, we pretrain the LNO backbone on the hybrid
dataset consisting of various time-dependent PDEs, and then finetune it on
downstream tasks for solving other PDEs.

Pretraining PhCA Encoder/Decoder In this work, we explore the ability of


PhCA from efficiency in [38] to universality. Unlike the LNO method where
multiple models are required to construct separated latent spaces for different
PDE problems, the LNOP framework uses a single model to construct the shared
latent space for various PDE problems during pretraining. In this context, the
6 Tian Wang and Chuang Wang
LNOP

Hybrid Time-Dependent PDE Dataset

Input Function Learnable


Pretraining Learnable Latent Space Inverse
Transform Transform
Branch 0 0 Ouput Function
Projector
1 1
Latent Neural Operator PhCA Transformer PhCA Output
Encoder Block Decoder Projector

...

...
Trunk M M
Fintuning Projector �
Representation Representation
Observation Tokens Tokens
/Query
Position

Downstream Out-Of-Distribution Tasks

Fig. 1. The overall architecture of Latent Neural Operator Pretraining. We pretrain


the Latent Neural Operator (LNO) on datasets containing various time-dependent
PDE problems. This enables the PhCA encoder and decoder within the LNO to learn
general transformation, which is used to extract PDE representations in the latent
space. Subsequently, we finetune the LNO on out-of-distribution downstream tasks to
apply the learned universal transformation.

PhCA encoder/decoder can be seen as learnable universal transformation. The


PhCA encoder transforms the sampled functions in the real geometric space
into representations in the latent space, while the PhCA decoder reconstructs
the function information from these representations. PhCA encoder and decoder
operate on the PDE spatial states and do not involve the temporal evolution
process.
The PhCA encoder and decoder compress high-dimensional PDE representa-
tions in the raw real-world space from single physical system into representations
that retain the essence of the PDE spatial states with higher information density
in the latent space. This enhances the efficiency of simulating interactions be-
tween different spatial locations of the PDEs. In LNOP, the objective of solving
multiple physical systems simultaneously in hybrid dataset further constrains the
universal transformation learned by the PhCA encoder and decoder. As a result,
the representations capture the commonalities of different PDE spatial states,
facilitating their transfer to other downstream tasks.

Finetuning Propagator Since solving time-dependent PDEs essentially involves


calculating the system’s response based on its spatial state at current moment
and its inherent spatial interaction rules, we need the propagator to compute the
representations of the spatial state at next moment based on the representations
extracted by the PhCA encoder/decoder from the current PDE spatial state.
Different physical systems have distinct spatial interaction rules, so different
propagators are needed for each physical system in principle. However, during
pretraining process, we use only one propagator to help align the representation
space derived from PhCA encoder/decoder with the solution space used for PDE
temporal evolution prediction. This propagator predicts the new state of the
PDE system at next time step in each forward call of the entire model to reduce
Latent Neural Operator Pretraining for Solving Time-Dependent PDEs 7

the approximating errors caused by differences among various physical systems.


This means that each short-time evolution prediction of the PDE goes through
the complete process involving the PhCA encoder/decoder and the propagator.
During finetuning process, we adjust the propagator’s parameters for different
PDE problems to approximate the corresponding spatial interaction rules.
Considering the attention mechanism’s ability to model interactions among
multiple vectors and its excellent performance as a kernel function in operator
learning, we choose to use a series of Transformer layers as the propagator.

In the above proposed LNOP framework, we pretrain the LNO backbone on


hybrid dataset containing multiple time-dependent PDEs, rather than training
separate LNO model for each PDE as done in the original LNO paper[38].
Throughout this pretraining process, we expect that the PhCA encoder/decoder
can learn a universal transformation which maps spatial domain features of
different PDEs to a shared latent space, so that the propagator can perform
evolution prediction in the temporal domain within this latent space. Details on
training procedure and comparison are presented in the experiment section.

4 Dataset
We consider the hybrid dataset containing multiple physical systems which are
all time-dependent PDEs in 2D space, including the Navier-Stokes equation,
Shallow-Water equation, Burgers’ equation and Reaction-Diffusion equation.
All these PDEs describe time-varying systems whose response is determined
by interactions among different spatial locations. Therefore, we can solve them
by neural network which extracts representations of PDE spatial states and
approximates the temporal evolution of the representations.
Navier-Stokes Equation We use the Navier-Stokes equation in the FNO dataset[21]
with the form of
∂t u(x, t) + w(x, t) · ∇u(x, t) = υ∆u(x, t) + f (x)
∇ · w(x, t) = 0
x ∈ Ω, t ∈ [0, T ],
where w is the velocity, u = ∇ × w is the vorticity, υ is the viscosity coefficient
and f (x) is the forcing term. This equation is the fundamental equation used to
explain and predict the behavior of fluids under various conditions.
We set Ω = [0, 1]2 , T = 20 and f (x) = 0.1(sin(2π(x1 +x2 ))+cos(2π(x1 +x2 ))).
The initial condition is generated according to u(x, 0) ∼ N (72/3 (−∆ + 49I)−2.5 ).
Periodic boundary condition is applied. We use the data under three different
viscosity coefficient values υ = 10−5 , 10−4 , 10−3 . For υ = 10−5 , there are 1200
trajectories each containing 20 frames on 64 × 64 spatial grids. We use 1100
trajectories for training and the rest 100 for testing. For υ = 10−4 and υ = 10−3 ,
there are 1100 trajectories respectively, each containing 25 frames on 64 × 64
spatial grids. We use 1000 trajectories for training and the rest 100 for testing.
The data with viscosity coefficient υ = 10−5 will be used during pretraining,
while the data with viscosity coefficients υ = 10−4 , 10−3 will be used to evaluate
the model’s transfer capability.
8 Tian Wang and Chuang Wang

Shallow-Water Equation We use the Shallow-Water equation in the PDEBench[33]


dataset which has the form of
∂t u(x, t) + ∂x u(x, t)v(x, t) = 0
1
∂t u(x, t)v(x, t) + ∂x (u(x, t)v 2 (x, t) + gr u2 (x, t)) = −gr u(x, t)∂x b
2
x ∈ Ω, t ∈ [0, T ],
where u is the water depth, v is the velocities in horizontal and vertical direction,
b is the spatially varying bathymetry and gr is the gravitational acceleration.
This equation can be used to describe the fluid motion in shallow water regions.
We set Ω = [−2.5, 2.5]2 , T = 1. The initial condition is
(
2.0, ∥x∥22 > r
u(x, 0) =
1.0, ∥x∥22 ≤ r,
where r ∼ U(0.3, 0.7) is the radius of the circular bump in the center of the spatial
domain. Dirichlet boundary condition is applied. There are 1000 trajectories each
containing 100 frames on 128 × 128 spatial grids. We downsample the temporal
dimension to 20 and the spatial dimensions to 64 × 64. We use 900 trajectories
for training and the rest 100 for testing. Although the data we generated exhibits
spatial symmetry, we do not incorporate this characteristic as a prior into the
model architecture or predicting process.
Burgers’ Equation We generate the data of Burgers’ equation following
∂t u(x, t) = D∂xx u(x, t) − u(x, t)∂x u(x, t)
x ∈ Ω, t ∈ [0, T ],
where D is the diffusion coefficient. This equation is used to simulate the formation
of shock waves.
We set Ω = [−1, 1]2 , T = 1 and D = 0.001/π. The initial condition is
generated according to u(x, 0) ∼ N (72/3 (−∆ + 49I)−2.5 ). Periodic boundary
condition is applied. There are 1200 trajectories each containing 20 frames on
64 × 64 spatial grids. We use 1100 trajectories for training and the rest 100 for
testing.
Reaction-Diffusion Equation We generate the data of Reaction-Diffusion equation
following
∂t u1 (x, t) = u1 − u31 − k − u2 + D1 ∆u1 (x, t)
∂t u2 (x, t) = u1 − u2 + D2 ∆u1 (x, t)
x ∈ Ω, t ∈ [0, T ],
where D1 , D2 are the diffusion coefficients for the activator and inhibitor respec-
tively. This equation can be used to describe and analyze the material diffusion
and chemical reaction processes.
We set Ω = [−1, 1]2 , T = 10 and D1 = 10−3 , D2 = 5 × 10−3 , k = 5 × 10−3 .
The initial condition u(x, 0) is generated according to u(x, 0) ∼ N (72/3 (−∆ +
49I)−2.5 ). Neumann boundary condition is applied. There are 1200 trajectories
each containing 20 frames on 64 × 64 spatial grids. We use 1100 trajectories for
training and the rest 100 for testing.
Latent Neural Operator Pretraining for Solving Time-Dependent PDEs 9

5 Experiment
We conduct a series of experiments on our dataset and compare the results with
both classical and newly proposed neural operator methods. We demonstrate our
LNOP framework effectively improves the solution precision for time-dependent
PDEs, exhibits strong transfer capability and data efficiency.

Baselines We choose to compare our framework with FNO[21], Transolver[41]


and LNO[38]. FNO is classical neural operator method which stacks Fourier layers
to learn mappings between functions. Leveraging the significant role of Fourier
transform in time-frequency analysis, FNO has found wide application across
various tasks[3,26,40]. Transolver is the latest SOTA neural operator method,
which performs allocation and re-allocation between geometric space and physical
slices in each Transformer layer to help exploit the physical interactions between
different spatial regions. LNO has been introduced in Section 3.2. All of FNO,
Transolver and LNO are trained on each single PDE problem in our dataset.

Implementation All models are trained for 500 epoch using AdamW[23] optimizer
and OneCycleLR[32] scheduler with initial learning rate 0.001. We choose relative
L2 error as the loss function. For FNO, we set the mode number to 12. For
Transolver, we set the slice number to 32. We construct both small-scale and
large-scale versions of LNO. The small-scale version (marked with the suffix
-S) consists of 4 Transformer layers and has 64 representation tokens each of
128 dimension. The large-scale version (marked with the suffix -L) consists of 8
Transformer layers and has 256 representation tokens each of 256 dimension. All
experiments are conducted on a single RTX 3090 GPU, with batch sizes adjusted
from 4 to 16 based on the memory usage. The FNO and Transolver has about
0.9 and 1.6 million model parameters respectively, while the two scale versions of
LNO have 0.8 and 5.0 million model parameters respectively.

5.1 Solution Precision


We train models of different methods to autoregressively solve various time-
dependent PDEs given the initial 10 time steps and compare the solution precision,
as shown in Table 1.
LNOP trained on the hybrid dataset achieves higher or comparable solution
precision on various PDE problems, even without finetuning, compared to LNO
trained on individual dataset. The small-scale version and large-scale version
LNOP method reduces the error by an average of 5.6% and 31.7%, respectively,
compared to LNO across all problems.
As the finetuning epochs on each PDE problem’s dataset increase from 100
to 500, the solution precision of LNOP continues to improve. After finetuning for
500 epoch, the error reduction of LNOP method in two scale versions further
improves to 46.8% and 57.1% respectively. These results demonstrate that the
universal transformation learned by LNOP during pretraining can extract common
representations across different PDE problems, thereby effectively raising solution
precision.
10 Tian Wang and Chuang Wang

LNOP achieves more significant precision improvements through pretraining


in the large-scale version than the small-scale version. This indicates that as the
model’s capacity to process data increases, the data from different PDE problems
can complement each other. This further underscores the necessity of learning
universal transformation from multiple physical systems.

Table 1. The solution precision of different models on various PDE problems respectively.
Relative L2 error is recorded. The best result in each group is in bold. Values in
parentheses indicate the change in error relative to the LNO model of the same scale,
where ’-’ denotes a reduction and ’+’ denotes an increase.

Model Navier-Stokes Shallow-Water Burgers’ Reaction-Diffusion


FNO[21] 0.1214 0.0017 0.0174 0.0572
Transolver[41] 0.1012 0.0015 0.0193 0.0473
LNO-S[38] 0.0949 0.0013 0.0148 0.0467
LNOP-S(pretrain) 0.0730(-23.1%) 0.0014(+7.7%) 0.0153(+3.4%) 0.0419(-10.3%)
LNOP-S(finetune-100) 0.0664(-30.0%) 0.0013(-0%) 0.0130(-12.2%) 0.0367(-21.4%)
LNOP-L(finetune-500) 0.0456(-52.0%) 0.0005(-61.5%) 0.0112(-24.3%) 0.0236(-49.5%)
LNO-L[38] 0.0845 0.0014 0.0037 0.0052
LNOP-L(pretrain) 0.0328(-61.2%) 0.0010(-28.6%) 0.0029(-21.6%) 0.0044(-15.4%)
LNOP-L(finetune-100) 0.0302(-64.3%) 0.0005(-64.3%) 0.0025(-32.4%) 0.0040(-23.1%)
LNOP-L(finetune-500) 0.0269(-68.2%) 0.0003(-78.6%) 0.0021(-43.2%) 0.0032(-38.5%)

5.2 Transfer Capability


An important goal of pretraining is to allow models to learn general representations
from large amounts of data and provide better parameter initialization for
downstream tasks, resulting in improvements in data utilization compared to
training from scratch where parameters are randomly initialized. We expect
LNOP to have strong transfer capability to help improving solution precision and
data efficiency in downstream tasks involving out-of-distribution PDEs which
are unseen during the pretraining process.
We finetune the pretrained LNOP model on Navier-Stokes equations with
viscosity coefficients of 10−4 and 10−3 (as opposed to 10−5 during pretraining
process) for 500 epochs using different proportion of the total data amount
to validate their data efficiency, and compare the solution precision with the
baselines which are trained from scratch.
The results in Table 2, 3 show that pretrained LNOP achieves higher solution
precision than the baselines when finetuned with varying proportions of data
on Navier-Stokes equations with out-of-distribution viscosity coefficients. For
viscosity coefficients of 10−4 , the LNOP model in two scale versions reduces the
average error by 38.7% and 49.8%, respectively, compared to LNO across different
training data sizes. For viscosity coefficients of 10−3 , the error reductions are
57.9% and 59.6% respectively. The LNOP model finetuned using only 30% data
can achieve higher solution precision than the LNO model trained from scratch
using 100% data. Roughly, the LNOP has 3x data efficiency compared to the
LNO.
The results demonstrate that the universal transformation learned from
in-distribution time-dependent physical systems can be effectively adapted to
Latent Neural Operator Pretraining for Solving Time-Dependent PDEs 11

out-of-distribution time-dependent PDE problems even under low-data situation,


proving that our proposed LNOP framework possesses strong transfer learning
capability and highly efficient data utilization.

Table 2. The solution precision of different models on Navier-Stokes equation with


viscosity coefficient values υ = 10−4 of varying dataset scales. Relative L2 error is
recorded. The best result in each group is in bold. Values in parentheses indicate
the change in error relative to the LNO model of the same scale, where ’-’ denotes a
reduction and ’+’ denotes an increase.
Model 10% data 30% data 50% data 80% data 100% data
FNO[21] 0.6618 0.6289 0.6327 0.6076 0.5632
Transolver[41] 0.4142 0.3872 0.3003 0.2498 0.2217
LNO-S[38] 0.3072 0.2224 0.1931 0.1571 0.1548
LNOP-S(finetune-500) 0.2424(-21.1%) 0.1460(-34.4%) 0.1090(-43.6%) 0.0835(-46.8%) 0.0815(-47.4%)
LNO-L[38] 0.2912 0.1964 0.1702 0.1402 0.1292
LNOP-L(finetune-500) 0.2259(-22.4%) 0.1227(-37.5%) 0.0826(-51.5%) 0.0477(-66.0%) 0.0367(-71.6%)

Table 3. The solution precision of different models on Navier-Stokes equation with


viscosity coefficient values υ = 10−3 of varying dataset scales. Relative L2 error is
recorded. The best result in each group is in bold. Values in parentheses indicate
the change in error relative to the LNO model of the same scale, where ’-’ denotes a
reduction and ’+’ denotes an increase.
Model 10% data 30% data 50% data 80% data 100% data
FNO[21] 0.0203 0.0097 0.0053 0.0041 0.0038
Transolver[41] 0.0354 0.0112 0.0054 0.0039 0.0036
LNO-S[38] 0.0218 0.0073 0.0046 0.0033 0.0031
LNOP-S(finetune-500) 0.0048(-78.0%) 0.0026(-64.4%) 0.0020(-56.5%) 0.0018(-45.5%) 0.0017(-45.2%)
LNO-L[38] 0.0128 0.0036 0.0023 0.0019 0.0016
LNOP-L(finetune-500) 0.0024(-81.3%) 0.0013(-63.9%) 0.0010(-56.5%) 0.0009(-52.6%) 0.0009(-43.8%)

5.3 Scaling
We conduct scaling experiments to demonstrate the solution precision on all
time-dependent PDE problems of our proposed LNOP framework as the number
and dimension of representation tokens vary. The results in Figure 2(a) indicate
the solution precision of the Navier-Stokes equation consistently improves as the
token dimension increases from 32 to as large as 256, while that of the other
three PDEs gradually saturates when the token dimension reaches 192. The
results in Figure 2(b) show that, aside from the shallow-water equation which
consistently maintains precise solution, the solution precision of the other three
PDEs improves continuously with an increasing number of representation tokens.
5.4 Ablation Study
We conduct ablation study to investigate the impact of finetuning different
components in the LNOP framework on the solution precision. Specifically, we
compare the following three finetuning scenarios: i) all parameters; ii) only
the PhCA encoder/decoder; iii) only the components other than the PhCA
encoder/decoder, including the input and output projector and the propagator.
12 Tian Wang and Chuang Wang

Table 4. The solution precision of LNOP pretrained with different approach or finetuned
with different component on various PDE problems respectively. Relative L2 is recorded.
The best result is in bold.
Model Navier-Stokes Shallow-Water Burgers’ Reaction-Diffusion
LNOP-S(pretrain) 0.0730 0.0014 0.0153 0.0419
LNOP-S(finetune-All) 0.0456 0.0004 0.0112 0.0236
LNOP-S(finetune-PhCA) 0.0722 0.0014 0.0151 0.0411
LNOP-S(finetune-Others) 0.0526 0.0010 0.0117 0.0263
LNOP-S(two-stage) 0.2236 0.0122 0.0817 0.1274

0.12 0.12
Navier-Stokes Navier-Stokes
0.10 Shallow-Water 0.10 Shallow-Water
Burgers' Burgers'
0.08 Reaction-Diffusion 0.08 Reaction-Diffusion
Relative L2 errors

Relative L2 errors
0.06 0.06

0.04 0.04

0.02 0.02

0.00 0.00
64 128 192 256 32 64 128 256
Representation Dimension in Latent Space Representation Quantity in Latent Space
(a) (b)
Fig. 2. Results of scaling experiments. (a) Impact of representation token dimension on
solution precision of various PDE problems. (b) Impact of representation token quantity
on solution precision of various PDE problems.

The results in Table 4 show that, although finetuning all parameters achieves
the highest solution precision, finetuning the components other than the PhCA
encoder/decoder can yield higher precision than finetuning only the PhCA
encoder/decoder. This indicates that the PhCA encoder/decoder effectively learn
universal transformation for representation extraction from multiple physical
systems.
We also try modifying the LNOP framework into two-stage approach. In the
first stage, we pretrain a PhCA-based autoencoder on the hybrid dataset using
reconstruction task. The autoencoder takes several frames of time-dependent
PDEs as input, extracts representations in the latent space, and reconstructs
them back into PDE information. In the second stage, we train propagators using
autoregressive task on each single PDE problem to predict the temporal evolution.
The propagator iterates the PDE representations from the initial state to the final
state in the latent space. This implies that the PhCA encoder/decoder is only
used at the initial and final moments of the PDE system, with the intermediate
temporal evolution prediction relying solely on the propagator.
The result in the last row of Table 4 indicates that the two-stage LNOP
approach performs not as well as the end-to-end one. This may be due to the
discrepancy between the representation spaces required by the reconstruction task
and PDE solution task, which introduces a mismatch between the autoencoder
and the propagator.
Latent Neural Operator Pretraining for Solving Time-Dependent PDEs 13

6 Conclusion
We propose the Latent Neural Operator Pretraining (LNOP) framework to
learn universal transformation for extracting representations of different PDEs
in a shared latent space across multiple physical systems. We pretrains the
LNO backbone on hybrid dataset comprising multiple time-dependent PDE
problems and compare its solution precision under different finetuning conditions
for both in-distribution and out-of-distribution time-dependent PDE problems.
Through a series of experiments, we verifies the precision improvement gained
from learning shared representations through pretraining, and also validates
the transfer capability and data efficiency brought by the learned universal
transformation.
Our work also has some limitations. First, our purely data-driven method
does not leverage prior knowledge from different PDEs, which may compromise
the solution precision. Additionally, our method does not completely separate
PDE representation learning from PDE time evolution prediction, which slows
down the pretraining process.
Future work should focus on how to incorporate physical prior knowledge
as constraints or additional modalities into the PDE solving process, how to
improve PDE representation learning, and how to achieve PDE time evolution
estimation entirely in the latent space.

References
1. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: ViViT: A
video vision transformer. In: Proceedings of the IEEE/CVF International Conference
on Computer Vision (ICCV). pp. 6836–6846 (2021)
2. Bi, K., Xie, L., Zhang, H., Chen, X., Gu, X., Tian, Q.: Accurate medium-range
global weather forecasting with 3d neural networks. Nature 619(7970), 533–538
(2023)
3. Bonev, B., Kurth, T., Hundt, C., Pathak, J., Baust, M., Kashinath, K., Anandkumar,
A.: Spherical fourier neural operators: Learning stable dynamics on the sphere.
In: Proceedings of the International Conference on Machine Learning (ICML). pp.
2806–2823 (2023)
4. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee-
lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot
learners. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)
5. Cao, S.: Choose a transformer: Fourier or Galerkin. In: Advances in Neural Infor-
mation Processing Systems (NeurIPS) (2021)
6. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive
learning of visual representations. In: Proceedings of the International Conference
on Machine Learning (ICML) (2020)
7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidi-
rectional Transformers for language understanding. arXiv preprint arXiv:1810.04805
(2018)
8. Guibas, J., Mardani, M., Li, Z., Tao, A., Anandkumar, A., Catanzaro, B.: Adaptive
Fourier neural operators: efficient token mixers for Transformers. arXiv preprint
arXiv:2111.13587 (2021)
14 Tian Wang and Chuang Wang

9. Hang, Z., Ma, Y., Wu, H., Wang, H., Long, M.: Unisolver: PDE-conditional Trans-
formers are universal PDE solvers. arXiv preprint arXiv:2405.17527 (2024)
10. Hao, Z., Su, C., Liu, S., Berner, J., Ying, C., Su, H., Anandkumar, A., Song, J.,
Zhu, J.: DPOT: Auto-regressive denoising operator transformer for large-scale pde
pre-training. arXiv preprint arXiv:2403.03542 (2024)
11. Hao, Z., Wang, Z., Su, H., Ying, C., Dong, Y., Liu, S., Cheng, Z., Song, J., Zhu, J.:
GNOT: a general neural operator Transformer for operator learning. In: Proceedings
of the International Conference on Machine Learning (ICML) (2023)
12. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are
scalable vision learners. In: Proceedings of the IEEE/CVF conference on Computer
Vision and Pattern Recognition (CVPR). pp. 16000–16009 (2022)
13. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised
visual representation learning. In: Proceedings of the IEEE/CVF conference on
Computer Vision and Pattern Recognition (CVPR) (2020)
14. Karlbauer, M., Praditia, T., Otte, S., Oladyshkin, S., Nowak, W., Butz, M.V.:
Composing partial differential equations with physics-aware neural networks. In:
Proceedings of the International Conference on Machine Learning (ICML) (2022)
15. Kovachki, N., Li, Z., Liu, B., Azizzadenesheli, K., Bhattacharya, K., Stuart, A.,
Anandkumar, A.: Neural operator: learning maps between function spaces with
applications to PDEs. Journal of Machine Learning Research 24(89), 1–97 (2023)
16. Lee-Thorp, J., Ainslie, J., Eckstein, I., Ontanon, S.: FNet: mixing tokens with
Fourier transforms. arXiv preprint arXiv:2105.03824 (2021)
17. Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-
training with frozen image encoders and large language models. In: Proceedings
of the International Conference on Machine Learning (ICML). pp. 19730–19742
(2023)
18. Li, Z., Meidani, K., Farimani, A.B.: Transformer for partial differential equations’
operator learning. arXiv preprint arXiv:2205.13671 (2022)
19. Li, Z., Shu, D., Barati Farimani, A.: Scalable Transformer for PDE surrogate
modeling. In: Advances in Neural Information Processing Systems (NeurIPS) (2024)
20. Li, Z., Huang, D.Z., Liu, B., Anandkumar, A.: Fourier neural operator with learned
deformations for PDEs on general geometries. Journal of Machine Learning Research
24(388), 1–26 (2023)
21. Li, Z., Kovachki, N., Azizzadenesheli, K., Liu, B., Bhattacharya, K., Stuart, A.,
Anandkumar, A.: Fourier neural operator for parametric partial differential equa-
tions. arXiv preprint arXiv:2010.08895 (2020)
22. Li, Z., Zheng, H., Kovachki, N., Jin, D., Chen, H., Liu, B., Azizzadenesheli, K.,
Anandkumar, A.: Physics-informed neural operator for learning partial differential
equations. ACM/JMS Journal of Data Science 1(3), 1–27 (2024)
23. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: Proceedings
of the International Conference on Learning Representations (ICLR) (2018)
24. Lu, L., Jin, P., Karniadakis, G.E.: DeepONet: learning nonlinear operators for
identifying differential equations based on the universal approximation theorem of
operators. arXiv preprint arXiv:1910.03193 (2019)
25. McCabe, M., Blancard, B.R.S., Parker, L.H., Ohana, R., Cranmer, M., Bietti, A.,
Eickenberg, M., Golkar, S., Krawezik, G., Lanusse, F., et al.: Multiple physics
pretraining for physical surrogate models. arXiv preprint arXiv:2310.02994 (2023)
26. Pathak, J., Subramanian, S., Harrington, P., Raja, S., Chattopadhyay, A., Mardani,
M., Kurth, T., Hall, D., Li, Z., Azizzadenesheli, K., et al.: FourCastNet: a global
data-driven high-resolution weather model using adaptive Fourier neural operators.
arXiv preprint arXiv:2202.11214 (2022)
Latent Neural Operator Pretraining for Solving Time-Dependent PDEs 15

27. Praditia, T., Karlbauer, M., Otte, S., Oladyshkin, S., Butz, M.V., Nowak, W.:
Learning groundwater contaminant diffusion-sorption processes with a finite volume
neural network. Water Resources Research 58(12), e2022WR033149 (2022)
28. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G.,
Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from
natural language supervision. In: Proceedings of the International Conference on
Machine Learning (ICML) (2021)
29. Raissi, M., Perdikaris, P., Karniadakis, G.E.: Physics-informed neural networks:
a deep learning framework for solving forward and inverse problems involving
nonlinear partial differential equations. Journal of Computational Physics 378,
686–707 (2019)
30. Rao, C., Ren, P., Wang, Q., Buyukozturk, O., Sun, H., Liu, Y.: Encoding physics
to learn reaction–diffusion processes. Nature Machine Intelligence 5(7), 765–779
(2023)
31. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution
image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
32. Smith, L.N., Topin, N.: Super-convergence: very fast training of neural networks
using large learning rates. In: Artificial Intelligence and Machine Learning for
Multi-Domain Operations Applications (2019)
33. Takamoto, M., Praditia, T., Leiteritz, R., MacKinlay, D., Alesiani, F., Pflüger, D.,
Niepert, M.: Pdebench: An extensive benchmark for scientific machine learning.
Advances in Neural Information Processing Systems (NeurIPS) 35, 1596–1611
(2022)
34. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T.,
Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: open and efficient
foundation language models. arXiv preprint arXiv:2302.13971 (2023)
35. Tran, A., Mathews, A., Xie, L., Ong, C.S.: Factorized Fourier neural operators.
arXiv preprint arXiv:2111.13802 (2021)
36. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information
Processing Systems (NeurIPS) (2017)
37. Wang, S., Wang, H., Perdikaris, P.: Learning the solution operator of parametric
partial differential equations with physics-informed DeepONets. Science Advances
7(40), eabi8605 (2021)
38. Wang, T., Wang, C.: Latent neural operator for solving forward and inverse PDE
problems. arXiv preprint arXiv:2406.03923 (2024)
39. Wen, G., Li, Z., Azizzadenesheli, K., Anandkumar, A., Benson, S.M.: U-FNO–an
enhanced Fourier neural operator-based deep-learning model for multiphase flow.
Advances in Water Resources 163, 104180 (2022)
40. Wen, G., Li, Z., Long, Q., Azizzadenesheli, K., Anandkumar, A., Benson, S.M.:
Real-time high-resolution CO 2 geological storage prediction using nested fourier
neural operators. Energy & Environmental Science 16(4), 1732–1741 (2023)
41. Wu, H., Luo, H., Wang, H., Wang, J., Long, M.: Transolver: a fast Transformer
solver for PDEs on general geometries. arXiv preprint arXiv:2402.02366 (2024)
42. Yang, Z., Yu, C.H., Buehler, M.J.: Deep learning model to predict complex stress
and strain fields in hierarchical composites. Science Advances 7(15), eabd7416
(2021)
43. Ye, Z., Huang, X., Chen, L., Liu, H., Wang, Z., Dong, B.: PDEformer: Towards a
foundation model for one-dimensional partial differential equations. arXiv preprint
arXiv:2402.12652 (2024)

You might also like