0% found this document useful (0 votes)
61 views

Self-adaptive physics-informed neural networks

Uploaded by

lkangwk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views

Self-adaptive physics-informed neural networks

Uploaded by

lkangwk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Journal of Computational Physics 474 (2023) 111722

Contents lists available at ScienceDirect

Journal of Computational Physics


journal homepage: www.elsevier.com/locate/jcp

Self-adaptive physics-informed neural networks


Levi D. McClenny, Ulisses M. Braga-Neto ∗
Texas A&M University, United States of America

a r t i c l e i n f o a b s t r a c t

Article history: Physics-Informed Neural Networks (PINNs) have emerged recently as a promising appli-
Received 3 April 2022 cation of deep neural networks to the numerical solution of nonlinear partial differential
Received in revised form 16 September equations (PDEs). However, it has been recognized that adaptive procedures are needed to
2022
force the neural network to fit accurately the stubborn spots in the solution of “stiff” PDEs.
Accepted 21 October 2022
Available online 14 November 2022
In this paper, we propose a fundamentally new way to train PINNs adaptively, where the
adaptation weights are fully trainable and applied to each training point individually, so
Keywords: the neural network learns autonomously which regions of the solution are difficult and is
Physics-informed neural networks forced to focus on them. The self-adaptation weights specify a soft multiplicative soft atten-
Scientific machine learning tion mask, which is reminiscent of similar mechanisms used in computer vision. The basic
Numerical methods for PDE idea behind these SA-PINNs is to make the weights increase as the corresponding losses
increase, which is accomplished by training the network to simultaneously minimize the
losses and maximize the weights. In addition, we show how to build a continuous map of
self-adaptive weights using Gaussian Process regression, which allows the use of stochastic
gradient descent in problems where conventional gradient descent is not enough to pro-
duce accurate solutions. Finally, we derive the Neural Tangent Kernel matrix for SA-PINNs
and use it to obtain a heuristic understanding of the effect of the self-adaptive weights
on the dynamics of training in the limiting case of infinitely-wide PINNs, which suggests
that SA-PINNs work by producing a smooth equalization of the eigenvalues of the NTK ma-
trix corresponding to the different loss terms. In numerical experiments with several linear
and nonlinear benchmark problems, the SA-PINN outperformed other state-of-the-art PINN
algorithm in L2 error, while using a smaller number of training epochs.
© 2022 Elsevier Inc. All rights reserved.

1. Introduction

As part of the burgeoning field of scientific machine learning [1], physics-informed neural networks (PINNs) have
emerged recently as an alternative to traditional numerical methods for partial different equations (PDE) [2–5]. Typical
data-driven deep learning methodologies do not take into account physical understanding of the problem domain. The PINN
approach is based on a strong physics prior that constrains the output of a deep neural network by means of a system of
PDEs. The potential of using neural networks as universal function approximators to solve PDEs had been recognized since
the 1990’s [6]. However, PINNs promise to take this approach to a different level by using deep neural networks, which is
made possible by the vast advances in computational capabilities and training algorithms since that time [7,8], as well as
the availability of automatic differentiation methods [9,10].

* Corresponding author.
E-mail address: [email protected] (U.M. Braga-Neto).

https://doi.org/10.1016/j.jcp.2022.111722
0021-9991/© 2022 Elsevier Inc. All rights reserved.
L.D. McClenny and U.M. Braga-Neto Journal of Computational Physics 474 (2023) 111722

A great advantage of PINNs over traditional time-stepping PDE solvers is that it is possible to obtain the solution over
the entire spatial-temporal domain at once, using training points distributed irregularly across the domain, obviating the
need of constructing computationally-expensive grids. In addition, the PINN solution defines a function over the continuous
domain, rather that a discrete solution on a grid as in traditional methods. Finally, PINNs allow sample data assimilation in
a natural and efficient way.
The continuous PINN algorithm proposed in [2], henceforth referred to as the “baseline PINN” algorithm, is effective at
estimating solutions that are reasonably smooth with simple boundary conditions, such as the viscous Burgers, Poisson and
Schrödinger PDEs. On the other hand, it has been observed that the baseline PINN has convergence and accuracy problems
when solving “stiff” PDEs [11], with solutions that contain sharp space transitions or fast time evolution [12]. This is known
to be the case, for example, when attempting to solve the nonlinear Allen-Cahn equation with the baseline PINN [4]. As we
will see in this paper, this may occur even in the case of the linear wave and advection PDEs.
This paper introduces Self-Adaptive PINNs (SA-PINNs), a fundamentally new method to train PINNs adaptively, which
addresses the issues mentioned previously. SA-PINNs applies trainable weights on each training point, in a way that is
reminiscent of soft multiplicative attention masks used in computer vision [13,14]. The adaptation weights are trained
concurrently with the network weights. As a result, initial, boundary or residue points in difficult regions of the solution are
automatically weighted more in the loss function, forcing the approximation to improve on those points. The basic principle
in SA-PINNs is to make the weights increase as the corresponding losses do, which is accomplished by training the network
to simultaneously minimize the losses and maximize the weights, i.e., to find a saddle point in the cost surface.
We also propose a methodology to build a continuous map of self-adaptive weights based on Gaussian Process regression,
in order to allow the use of stochastic gradient descent in training self-adaptive PINNs. This is illustrated by application to
a 1-D wave PDE that is challenging to non-SGD training.
Finally, we derive the Neural Tangent Kernel matrix for self-adaptive PINNs and use it to obtain a heuristic understanding
of the effect of the self-adaptive weights on the dynamics of training in the limiting case of infinitely-wide PINNs. We
examine the effect of the self-adaptive weights on the eigenvalues of the NTK matrix in the solution of a linear advection
PDE, and observe that it not only equalizes the magnitudes between the different loss components, but also smooths the
shape of the distribution of eigenvalues. This provides preliminary theoretical justification of the success of self-adaptive
PINNs.
Comprehensive experimental results presented throughout the test, based on several well-known benchmarks, show that
self-adaptive PINNs can solve “stiff” PDEs with significantly better accuracy than other state-of-the-art PINN algorithms,
while using a smaller number of training epochs.

2. Background

2.1. Physics-informed neural networks

Consider the initial-boundary value problem:

Nx,t [u (x, t )] = f (x, t ) , x ∈  , t ∈ (0, T ] , (1)


Bx,t [u (x, t )] = g (x, t ) , x ∈ ∂ , t ∈ (0, T ] , (2)
u ( x , 0) = h ( x ) , x ∈  . (3)

Here, the domain  ⊂ R d in a open set,  is its closure, u :  × [0, T ] → R is the desired solution, x ∈  is a spatial
vector variable, t is time, and Nx,t and Bx,t are spatial-temporal differential operators. The problem data is provided by the
forcing function f :  → R, the boundary condition function g : ∂  × (0, T ], and the initial condition function h :  → R.
Additionally, sensor data in the interior of the domain may be available. In any case, we assume that the data are sufficient
and appropriate for a well-posed problem. Time-independent problems and other types of data can be handled similarly, so
we will use the equations (1)-(3) as a model.
Following [2], let u (x, t ) be approximated by the output u (x, t ; w ) of a deep neural network with inputs x and t
(in the case of a PDE system, this would be a neural network with multiple outputs). The value of Nx ,t [u (x, t ; w )] and
Bx,t [u (x, t ; w )] can be computed quickly and accurately using reverse-mode automatic differentiation [9,10].
The network weights w are trained by minimizing a loss function that penalizes the output for not satisfying (1)-(3):

L( w ) = Ls ( w ) + Lr ( w ) + Lb ( w ) + L0 ( w ) , (4)

where Ls is the loss term corresponding to sample data (if any), while Lr , Lb , and L0 are loss terms corresponding to not
satisfying the PDE (1), the boundary condition (2), and the initial condition (3), respectively:

1
Ns
Ls ( w ) = |u (xis , t si ; w ) − y is |2 , (5)
2
i =1

2
L.D. McClenny and U.M. Braga-Neto Journal of Computational Physics 474 (2023) 111722

1
Nr
Lr ( w ) = |Nx,t [u (xri , tri ; w )] − f (xri , tri )|2 , (6)
2
i =1

1
Nb
Lb ( w ) = |Bx,t [u (xbi , tbi ; w )] − g (xbi , tbi )|2 , (7)
2
i =1

1
N0
L0 ( w ) = |u (x0i , 0; w ) − h(x0i )|2 , (8)
2
i =1
N N N
where {xis , t si , y is }i =s 1 are sensor data (if any), {x0i }i =01 are initial condition points, {xbi , tbi }i =b1 are boundary condition points,
{xri , tri }iN=r 1
are residue (“collocation”) points randomly distributed in the domain , and N s , N 0 , N b and N r denote the total
number of sensor, initial, boundary, and residue points, respectively. The network weights w can be tuned by minimizing
the total training loss L( w ) via standard gradient descent procedures used in deep learning.

2.2. Related work

The baseline PINN algorithm described in the previous section, though remarkably successful in the solution of many
linear and nonlinear PDEs, can produce inaccurate approximations, or fail to converge entirely, in the solution of certain
“stiff” PDEs. A large amount of evidence has accumulated indicating that this happens due to the shortcomings of gradient
descent applied to the multi-part or multi-objective loss function (4); e.g., see [4,15,12,5]. This occurs because gradient
descent is a greedy procedure that may latch on some of the components at the expense of others, which creates imbalance
in the rate of descent among the different loss components and prevents convergence to the correct solution. The standard
approach in the literature of PINNs to try to correct the imbalance is the introduction of weights in (4):

L( w ) = λs Ls ( w ) + λr Lr ( w ) + λb Lb ( w ) + λ0 L0 ( w ) . (9)
Several methods, from very simple to complex, have been advanced to set the values of these weights; we mention a few
below.

Nonadaptive weighting. In [4], it was pointed out that a premium should be put on forcing the neural network to satisfy
the initial conditions closely, especially for PDEs describing time-irreversible processes, where the solution has to be ap-
proximated well early. Accordingly, a loss function of the form L(θ) = Lr (θ) + Lb (θ) + C L0 (θ) was suggested, where C  1
is a hyperparameter.

Learning rate annealing. In [12], it is argued that the optimal value of the weight C in the previous scheme may vary
wildly among different PDEs so that choosing its value would be difficult. Instead they propose to use weights that are
tuned during training using statistics of the backpropagated gradients of the loss function. It is noteworthy that the weights
themselves are not adjusted by backpropagation. Instead, they behave as learning rate coefficients, which are updated after
each epoch of training.

Adaptive resampling. In [4], a strategy to adaptively resample the residual collocation points based on the magnitude of the
residual is proposed. While this approach improves the approximation, the training process must be interrupted and the
MSE evaluated on the residual points to deterministically resample the ones with the highest error. After each resampling
step, the number of residual points grows, increasing computational complexity. In [16], resampling of the collocation points
for solving the steady-state Fokker-Planck PDE is performed using an approximate density function, while in [17], this work
is extended to time-evolution problems by means of an adaptive density approximation method based on normalizing flows.

Neural tangent kernel (NTK) weighting. Recently, [5] derived the NTK kernel matrix for PINNs, and used a heuristic argument
to set the weights adaptively based on the evolution of the eigenvalues of the NTK matrix during training.

Mimimax weighting. In [18], a methodology was proposed to update the weights during training using gradient descent
for the network weights, and gradient ascent for the loss weights, seeking to find a saddle point in weight space. Loss
components that do not decrease are assigned larger weights.
More generally, the need to use weighting to correct imbalance in multi-part loss functions has been recognized in
the general deep learning literature [19–21]. Note that the multi-part loss (9) corresponds to a linear scalarization of this
multiple-objective problem [22].
All the previous methods employ a linearly-scalarized function such as (9), the only difference among them being the
way the weights are updated. The self-adaptive weighting method proposed in this paper is fundamentally different in that
the weights apply to individual training points in the different loss components, rather than the entire loss component.
The previous methods can be seen as a special case of this, when all self-adaptive weights for a particular loss component

3
L.D. McClenny and U.M. Braga-Neto Journal of Computational Physics 474 (2023) 111722

are updated in tandem. Among the previous methods, the independently-developed Minimax weighting scheme [18] is the
closest to SA-PINNs, as it also updates its weights via gradient ascent; however, these weights still apply to the whole loss
components. This paper presents empirical and theoretical evidence that having the flexibility of weighting each training
point in the various loss terms brings additional flexibility that can lead to better performance.

3. Self-adaptive physics-informed neural networks

While previously proposed weighting methods produce improvements in stability and accuracy over the baseline PINN,
they are either nonadaptive or introduce inflexible adaptation. Here we propose a simple procedure that applies fully-
trainable weights to produce a multiplicative soft attention mask, in a manner that is reminiscent of attention mechanisms
used in computer vision [13,14]. Instead of hard-coding weights at particular regions of the solution, the proposed method
is in agreement with the neural network philosophy of self-adaptation, where the weights in the loss function are updated
by gradient descent side-by-side with the network weights.
Using the PDE in (1)-(3) as reference, the proposed self-adaptive PINN utilizes the following loss function

L( w , λr , λb , λ0 ) = Ls ( w ) + Lr ( w , λr ) + Lb ( w , λb ) + L0 ( w , λ0 ) , (10)
N
where λr = (λr1 , . . . , λrNr ),
λb = and λ0 =
(λb1 , . . . , λb b ), (λ10 , . . . , λ0N 0 ) are trainable, nonnegative self-adaptation weights for the
initial, boundary, and residue points, respectively, and

1
Nr
Lr ( w , λr ) = m(λri ) |Nx,t [u (xri , tri ; w )] − f (xri , tri )|2 (11)
2
i =1

1
Nb
Lb ( w , λb ) = m(λbi ) |Bx,t [u (xri , tri ; w )] − g (xbi , tbi )|2 (12)
2
i =1

1
N0
L0 ( w , λ0 ) = m(λ0i ) |u (x0i , 0; w ) − h(x0i )|2 , (13)
2
i =1

where the self-adaptation mask function m(λ) defined on [0, ∞) is a nonnegative, differentiable on (0, +∞), strictly increas-
ing function of λ. A key feature of self-adaptive PINNs is that the loss L( w , λr , λb , λ0 ) is minimized with respect to the
network weights w, as usual, but is maximized with respect to the self-adaptation weights λr , λb , λ0 . The corresponding
gradient descent/ascent steps are:

w k +1 = w k − ηk ∇ w L( w k , λkr , λkb , λk0 ) (14)


λkr +1 = λkr +ρ k k k k k
r ∇λr L( w , λr , λb , λ0 ) (15)
λkb+1 = λkb +ρ k k k k k
b ∇λb L( w , λr , λb , λ0 ) (16)
λk0+1 = λk0 +ρ k k k k k
0 ∇λ0 L( w , λr , λb , λ0 ) , (17)

where ηk > 0 is the learning rate for the neural network weights at step k, ρ kp > 0 is a separate learning rate for the
self-adaption weights, for p = r , b, 0, and
⎡  2 ⎤
m (λr ) Nx,t [u (xri , tri ; w k )] − f (xr1 , tr1 )
k,1
1⎢ ···

∇λr L( w k , λkr , λkb , λk0 ) = ⎢ 

2 ⎦ , (18)
2⎣ k, N r  Nr Nr 
m (λr ) Nx,t [u (xr , tr ; w )] − f (xr , tr )
i i k

⎡  2 ⎤
m (λb ) Bx,t [u (xbi , tbi ; w k )] − g (xb1 , tb1 )
k,1
1⎢ ···

∇λb L( w k , λkr , λkb , λk0 ) = ⎢ 

2 ⎦ , (19)
2⎣ k, N b  Nb Nb 
m (λb ) Bx,t [u (xb , tb ; w )] − g (xb , tb )
i i k

⎡  2 ⎤
m (λ0 ) u (x10 , 0; w k ) − h(x10 , t 01 )
k,1
1⎢ ···

∇λ0 L( w k , λkr , λkb , λk0 ) = ⎢⎣  
⎥.
⎦ (20)
2 k, N 0   2
N
m (λ0 ) u (x0i , 0; w k )] − h(x0 0 )

Hence, since m (λ) > 0 (the mask function is strictly increasing, by assumption), then ∇λr L, ∇λb L, ∇λ0 L ≥ 0, and any gra-
dient component is zero if and only if the corresponding unmasked loss is zero. This shows that the sequences of weights
{λkr ; k = 1, 2, . . .}, {λkb ; k = 1, 2, . . .}, {λk0 ; k = 1, 2, . . .} (and the associated mask values) are monotonically increasing, provided

4
L.D. McClenny and U.M. Braga-Neto Journal of Computational Physics 474 (2023) 111722

Fig. 1. Mask function examples. From the upper left to the bottom right: polynomial mask, q = 2; polynomial mask, q = 4; smooth logistic mask; sharp
logistic mask.

that the corresponding unmasked losses are nonzero. Furthermore, the magnitude of the gradients ∇λr L, ∇λb L, ∇λ0 L, and
therefore of the updates, are larger if the corresponding unmasked losses are larger. In addition, the magnitude of updates
can be controlled by specifying a schedule for the learning rates ρ kp , for p = r , b, 0, adding extra flexibility. This progressively
penalizes the network more for not fitting the residual, boundary, and initial points closely.
We remark that any of the weights can be set to fixed, non-trainable values, if desired. For example, by setting λkb ≡ 1,
only the weights of the initial and residue points would be trained. The sensor data loss is not masked in this formulation,
since if these data consist of noisy observations, weighting them requires extra care to avoid overfitting (though this remains
an open research problem).
Notice that the self-adaptive weights need to be initialized at the beginning of training. They could be initialized to 1
(no weighting) or to a different value, depending on the problem. They could also be initialized randomly over an interval,
similarly as to how neural network weights are often initialized. Here, prior knowledge plays an important role; e.g., if it
is known that the initial conditions in a problem are hard to fit, then the initial condition weights could be initialized to a
larger value than the other weights (alternatively, it could be initialized at the same value as the other weights, but employ
a larger learning rate).
The shape of the function g affects mask sharpness and training of the PINN. Examples include polynomial masks
m(λ) = c λq , for c , q > 0, and sigmoidal masks. See Fig. 1 for a few examples. In practice, the polynomial mask functions
have to be kept below a suitable (large) value, to avoid numerical overflow. The sigmoidal masks do not have this issue,
and can be used to produce sharp masks. For example, in the bottom right example in Fig. 1, the mask is essentially binary;
it starts small for small starting values of the self-adaptive weight λ, and after these exceed a certain threshold, the mask
value will quickly take on the upper saturation value. Similarly to neural network nonlinearities, sigmoid mask functions
can suffer from vanishing gradients during training. This is particularly a problem at the lower starting value. Therefore,
excessively sharp sigmoidal mask functions should be avoided.
The gradient ascent/descent step can be implemented easily using off-the-self neural network software, by simply flip-
ping the sign of ∇λr L, ∇λb L, and ∇λ0 L. In our implementation of SA-PINNs, we use Tensorflow 2.3 with a fixed number of
iterations of Adam [23]. In some case, these are followed by another fixed number of iterations of the L-BFGS quasi-newton
method [24]. This is consistent with the baseline PINN formulation in [2], as well as follow-up literature [4]. However, the
adaptive weights are only updated in the Adam training steps, and are held constant during L-BFGS training, if any. A full
implementation of the methodology described here has been made publicly available by the authors1 and it is included in
the open-source software TensorDiffEq [25]. Finally, we remark that there are some similarities between SA-PINN training
and penalty methods in optimization, which introduce a sequence of increasing penalty costs [26].

1
https://github.com/levimcclenny/SA-PINNs.

5
L.D. McClenny and U.M. Braga-Neto Journal of Computational Physics 474 (2023) 111722

Fig. 2. High-fidelity (left) vs. predicted (right) solutions for the viscous Burgers PDE. (For interpretation of the colors in the figure(s), the reader is referred
to the web version of this article.)

4. Numerical examples

In this section we present numerical experiments demonstrating the SA-PINN performance on various benchmarks. The
main figure of merit used is the L2-error:

NU
i =1
|u ( xi , t i ) − U (xi , t i )|2
L 2 error = , (21)
NU
i =1
|U (xi , t i )|2
where u (x, t ) is the trained approximation, and U (x, t ) is a high-fidelity solution over a fine mesh {xi , t i } containing N U
points. In all cases below, we repeat the training process over 10 random restarts and report the average L2 error and its
standard deviation.

4.1. Viscous Burgers equation

The viscous Burgers PDE considered here is

ut + uu x − (0.01/π )u xx = 0 , x ∈ [−1, 1], t ∈ [0, 1] , (22)


u (0, x) = − sin(π x) , (23)
u (t , −1) = u (t , 1) = 0 . (24)

All results for the viscous Burgers PDE were generated from a fully-connected network with input layer size 2 corre-
sponding to the x and t inputs, 8 hidden layers of 20 neurons each, and an output layer of size 1 corresponding to the
output of the approximation u (x, t ). This directly mimics the setup of the viscous Burgers PDE result presented in [2]. All
training is done for 10k iterations of Adam, followed by 10k iterations of L-BFGS to fine tune the network weights, consis-
tent with related work. Additionally, the number of points selected for the trials shown are N 0 = 100, N b = 200, and N r
= 10000. Training with this architecture took 96 ms/iteration on a single Nvidia V100 GPU. We initialize the self adaptive
weights on the IC and the residual points to be U (0, 1) and the learning rates for all self-adaptive weights were set to 5e−3.
We achieved an L2-error of 4.80e − 4 ± 1.01e − 4, which is smaller value than the 6.7e − 4 L2 error reported in [2], while
using only 20% as many training iterations and an identical neural network architecture. The high-fidelity and predicted
solutions are displayed in Fig. 2. Fig. 3 demonstrate the accuracy of the proposed approach, using a significantly shorter
training horizon than the baseline PINN.
Fig. 4 shows that the sharp discontinuity at x = 0 in the solution has correspondingly large weights, indicating that the
model must pay extra attention to those particular points in its solution, resulting in an increase in approximation accuracy
and training efficiency.

4.2. Helmholtz equation

The Helmholtz PDE is typically used to describe the behavior of wave and diffusion processes, and can be employed
to model evolution in a spatial domain or combined spatial-temporal domain. Here we study a particular Helmholtz PDE
existing only in the spatial (x, y ) domain, described as:

6
L.D. McClenny and U.M. Braga-Neto Journal of Computational Physics 474 (2023) 111722

Fig. 3. Top: predicted solution of the viscous Burgers PDE. Middle: Cross-sections of the approximated vs. actual solutions for various x-domain snapshots.
Bottom left: Residual r (x, t ) across the spatial-temporal domain. Bottom right: Absolute error between prediction and high-fidelity solution across the spatial-
temporal domain.

Fig. 4. Trained weights for residue points across the domain . Larger/brighter colored points correspond to larger weights.

7
L.D. McClenny and U.M. Braga-Neto Journal of Computational Physics 474 (2023) 111722

Fig. 5. Exact (left) vs. predicted (right) solutions for the Helmholtz PDE.

u xx + u y y + k2 u − q(x, y ) = 0 (25)
u (−1, y ) = u (1, y ) = u (x, −1) = u (x, 1) = 0 (26)
where x ∈ [−1, 1], y ∈ [−1, 1] and

q(x, y ) = − (a1 π )2 sin(a1 π x) sin(a2 π y )


− (a2 π )2 sin(a1 π x) sin(a2 π y )
+ k2 sin(a1 π x) sin(a2 π y ) w (27)
is a forcing term that results in a closed-form analytical solution

u (x, y ) = sin(a1 π x) sin(a2 π y ) . (28)


To allow a direct comparison to the Helmholtz PDE result reported in [12], we take a1 = 1 and a2 = 4 and use the same
neural network architecture with layer sizes [2, 50, 50, 50, 50, 1]. Our architecture is trained for 10k Adam and 10k L-
BFGS iterations, again keeping the self-adaptive mask weights constant through the L-BFGS training iterations and only
allowing those to train via Adam. We sample N b = 400 (100 points per boundary). Given the steady-state initialization and
constant forcing term, there is no applicable initial condition and consequently no N 0 . We create a mesh of size (1001,1001)
corresponding to the x ∈ [−1, 1], y ∈ [−1, 1] range, yielding 1,002,001 total mesh points, from which we select N r =100k
residue points. We initialize the self adaptive weights on the BC and the residual points to be U (0, 1) and the learning rates
for all self-adaptive weights were set to 5e-3.
We can see in Fig. 5 that the SA-PINN prediction is very accurate and indistinguishable from the exact solution, with an
L2 error of 3.2e − 3 ± 2.2e − 4. With a larger neural network, the results reported in [12] (row 5 of Table 2) are 1.4e − 1 for
the baseline PINN, and between 2.54e − 3 and 2.74e − 2 for various learning-rate annealing weighted schemes proposed in
that paper. We would add that the SA-PINN is trained for a smaller number of iterations (10k Adam and 10k L-BFGS) with
respect to that in [12] (40k Adam).
Fig. 6 shows individual cross-sections of the Helmholtz solution, demonstrating the SA-PINN’s ability to accurately ap-
proximate the sinusoidal solution on the whole domain. Fig. 7 shows that the Self-Adaptive PINN largely ignores the flat
areas in the solution, while focusing its attention on the nonflat areas.

4.3. Allen-Cahn reaction-diffusion equation

In this section, we report experimental results obtained with the Allen-Cahn PDE, which contrast the performance of the
proposed SA-PINN algorithm against the baseline PINN and two of the PINN algorithms mentioned in Section 2.2, namely,
the nonadaptive weighting and time-adaptive schemes (for the latter, Approach 1 in [4] was used).
The Allen-Cahn reaction-diffusion PDE is typically encountered in phase-field models, which can be used, for instance,
to simulate the phase separation process in the microstructure evolution of metallic alloys [27–29]. The Allen-Cahn PDE
considered here is specified as follows:

ut − 0.0001u xx + 5u 3 − 5u = 0 , x ∈ [−1, 1], t ∈ [0, 1] , (29)


2
u (x, 0) = x cos(π x) , (30)
u (t , −1) = u (t , 1) , (31)

8
L.D. McClenny and U.M. Braga-Neto Journal of Computational Physics 474 (2023) 111722

Fig. 6. Top predicted solution of Helmholtz equation. Bottom Cross-sections of the approximated vs. actual solutions for various x-domain snapshots.

Fig. 7. Self-learned weights after training via Adam for the Helmholtz system. Brighter/larger points correspond to larger weights.

9
L.D. McClenny and U.M. Braga-Neto Journal of Computational Physics 474 (2023) 111722

u x (t , −1) = u x (t , 1) . (32)
The Allen-Cahn PDE is an interesting benchmark for PINNs for multiple reasons. It is a “stiff” PDE that challenges PINNs to
approximate solutions with sharp space and time transitions, and is also introduces periodic boundary conditions (31), (32).
In order to deal with the latter, the boundary loss function Lb ( w , λb ) in (12) is replaced by

1 
Nb
Lb ( w , λb ) = g (λbi )(|u (1, tbi ) − u (−1, tbi )|2 + |u x (1, tbi ) − u x (−1, tbi )|2 ) (33)
Nb
i =1

The neural network architecture is fully connected with layer sizes [2, 128, 128, 128, 128, 1]. This architecture is identical
to the one used in the Allen-Cahn PDE result reported in [4], in order to allow a direct comparison of performance. We
set the number of residue, initial, and boundary points to N r = 20, 000, N 0 = 100 and N b = 100, respectively (due to the
periodic boundary condition, there are in fact 200 boundary points). Here we hold the boundary weights w bi at 1, while
the initial weights w 0i and residue weights w ri are trained. The initial and residue weights are initialized from a uniform
distribution in the intervals [0, 100] and [0, 1], respectively. Training took 65 ms/iteration on a single Nvidia V100 GPU.
Numerical results obtained with the SA-PINN are displayed in Fig. 8. The average L2-error across 10 runs with random
restarts was 2.1e −2 ± 1.21e −2, while the L2-error on 10 runs obtained by the time-adaptive approach in [4] was 8.0e −2 ±
0.56e −2. Neither the baseline PINN nor the nonadaptive weighted scheme, with initial condition weight C = 100, were able
to solve this PDE satisfactorily, with L2 errors 96.15e − 2 ± 6.45e − 2 and 49.61e − 2 ± 2.50e − 2, respectively (these numbers
matched almost exactly those reported in [4]).
Fig. 9 is unique to the proposed SA-PINN algorithm. It displays the trained self-adaptive weights for the residue points
across the spatio-temporal domain. These are the weights of the multiplicative soft attention mask self-imposed by the PINN.
This plot stays remarkably constant across different runs with random restarts, which is an indication that it is a property
of the particular PDE being solved. We can observe that in this case, more attention is needed early in the solution, but not
uniformly across the space variable. In [4], this observation was justified by the fact that the Allen-Cahn PDEs describes a
time-irreversible diffusion-reaction processes, where the solution has to be approximated well early. However, here this fact
is “discovered” by the SA-PINN itself.
In order to study the behavior of the SA-PINN more closely, we plot in Fig. 10 the average value of the residue weights
from various partitions of the solution domain. While all the weights are increasing, as must be the case since the mask
function is required to be monotone, the rate of increase is of importance. Notice that the initial condition weights grow
much faster than the residue weights, as expected, since the initial condition tends to be neglected by the PINN, otherwise.
As for the residue weights, we see that, for small values of t, they increase faster than for large values of t. This shows
that the SA-PINN has learned that the early part of the evolution is the most critical part of the solution. (This agrees with
what was seen in the map of Fig. 9.) In contrast with traditional time-marching approaches, where earlier time steps are
solved prior to later ones, the SA-PINN solves the PDE over the entire space-time domain at once; however, the self-adaptive
weights allow it to concentrate on the early part of the evolution.
Finally, Fig. 12 displays the training loss for the baseline PINN and SA-PINN as a function of training iteration. For the
SA-PINN, the weights were removed from the loss value to provide a direct comparison to the baseline. These plots are
generated from 10 random restarts of the SA and baseline PINN training cycles over 10k Adam training iterations with
consistent learning rates. We can see that the SA-PINN achieves significantly lower initial condition loss than the baseline
PINN for the initial. Indeed, this is the major issue faced by the baseline PINN in the AC problem. As for the residual loss,
we see that the baseline PINN decreases it fast (at the expense of the initial condition loss), but that eventually the SA PINN
is able to achieve a loss two orders of magnitude smaller. The oscillatory behavior of the residue loss in the SA-PINN reveals
the dynamics of the competing self-adaptive weighted initial condition and residue loss terms.

4.4. 2D Burgers equation

Here we demonstrate the efficacy of SA-PINN in the solution of a three-dimensional problem (two spatial dimensions
plus time), namely, a 2D viscous Burgers nonlinear PDE system with velocity fields u (x, y , t ) and v (x, y , t ), which satisfy:

ut + uu x + vu y = ν (u xx + u y y ) , (34)
v t + uv x + v v y = ν ( v xx + v y y ) , (35)
in the domain (x, y , t ) ∈ (0, 1)3 , where the ν is the kinematic viscosity (in this example, ν = 0.002). Any velocity fields such
that u (x, y , t ) + v (x, y , t ) is constant provide a solution of this PDE system. Following [30], we take:
−1
3 1 −4x + 4 y − t
u (x, y , t ) = − 1 + exp , (36)
4 4 32ν
−1
3 1 −4x + 4 y − t
v (x, y , t ) = + 1 + exp , (37)
4 4 32ν

10
L.D. McClenny and U.M. Braga-Neto Journal of Computational Physics 474 (2023) 111722

Fig. 8. Top: Plot of the approximation u (x, t ) via the SA-PINN. Middle: Snapshots of the approximation u (x, t ) vs. the high-fidelity solution U (x, t ) at various
time points through the temporal evolution. Bottom left: Residual r (x, t ) across the spatial-temporal domain. As expected, it is close to 0 for the whole
domain . Bottom right: Absolute error between approximation and high-fidelity solution across the spatial-temporal domain.

with matching initial and Dirichlet boundary conditions. We apply the SA-PINN without enforcing the boundary conditions,
i.e., only the initial conditions are enforced. Despite this, the SA-PINN was able to capture the attenuated shock effectively as
shown in Fig. 11. These results were obtained with 35k Adam iterations at a learning rate of 1e − 5, followed by 20k L-BFGS
iterations to fine-tune the solution. Training of the baseline PINN for the same amount of iterations was not successful, as
the L-BFGS optimizer failed consistently to converge. Notably, SA-PINN appears to stabilize the training process, allowing
the L-BFGS optimizer to converge.

4.5. Self-adaptive weight training hyperparameters

We comment here on the choice of hyperparameter settings made in the experiments reported in the previous sections.
In all experiments, a constant learning rate of 5e-3 for the gradient ascent of the self-adaptive learning rates, and Adam
optimization was used. Better results could potentially be obtained by using learning rate scheduling.
Empirically, we observed that effective training strategies for self-adaptive PINNs tend to require smaller values of learn-
ing rate for the neural network weights (i.e., 1e-5), and larger values for the self-adaptive weights (i.e., 1e-3 to 1e-1).

11
L.D. McClenny and U.M. Braga-Neto Journal of Computational Physics 474 (2023) 111722

Fig. 9. Learned self-adaptive weights across the spatio-temporal domain. Brighter colors and larger points indicate larger weights.

Fig. 10. Average learned residue weights across various partitions of the solution domain. Note that earlier times require heavier weighting, with the highest
average weights being the initial condition weights. This is consistent with the rationale that earlier solutions must be correctly learned for time-diffusive
processes.

As specified in Section 4.3, the results shown in Fig. 8 employ random initialization of the initial condition and residue
self-adaptive weights in the intervals [0, 100] and [0, 1], respectively, and the learning rates are held constant and equal. This
choice is dictated by the prior knowledge that heavier weighting of the initial condition is needed in the AC problem [4].
On the other hand, in the results displayed in Fig. 10, all weights were initialized randomly in the interval [0, 1]. In this
case, it is observed that the initial conditions increase faster on their own, even though all learning rates are held constant
and equal.

5. Self-adaptive PINNs with stochastic gradient descent

Stochastic gradient descent (SGD) [31] uses randomly sampled subsets of the training data to compute approximations
to the gradient for training neural networks by gradient descent [32]. It has been claimed that the empirical superior
performance of stochastic gradient descent over large-batch training is due to a tendency of the latter to converge to
“sharp” minima in the loss surface, which have poor performance, while SGD with small batches converge to better “flat”
minima [33].
The issue has not been well studied in the context of PINNs at the time of writing, though there is some empirical
evidence that SGD can indeed improve the L 2 performance of PINNs with some PDEs. It should be pointed out that PINNs

12
L.D. McClenny and U.M. Braga-Neto Journal of Computational Physics 474 (2023) 111722

Fig. 11. SA-PINN solution of 2D Burgers equation example. The L2 error for both u and v at t = 1/3 and t = 2/3 is approximately 3e − 3.

are well-suited to SGD since a new set of residue, initial and boundary points can be sampled each time rather than
subsampling a given set of training data points as in conventional machine learning.
The baseline SA-PINN algorithm described previously cannot take advantage of small-batch SGD since the self-adaptive
weights are attached to specific training points. In this section, we examine an extension of SA-PINN that allows the use of
SGD. The basic idea is to use a spatial-temporal predictor of the value of self-adaptive weights for the newly sampled points.
Here we use standard Gaussian process regression due to its predictive power. (However, simpler regression approaches
could be equally used, in cases where GP regression is unwieldy, e.g., due to large sample size.)
A problem where SGD has been found empirically to have a strong impact is the 1D wave equation:

utt (x, t ) − 4u xx (x, t ) = 0 , x ∈ [0, 1] , t ∈ [0, 1] , (38)


u (0, t ) = 0 , u (1, t ) = 0, t ∈ [0, 1] , (39)
ut (x, 0) = 0 , x ∈ [0, 1] , (40)
1
u (x, 0) = sin(π x) + sin(4π x) , x ∈ [0, 1] . (41)
2
This problem was considered in [5] to study their NTK weighting scheme. The problem has an analytical solution:
1
u (x, t ) = sin(π x) cos(2π t ) + sin(4π x) cos(8π t ) , x ∈ [0, 1] , t ∈ [0, 1] . (42)
2
The baseline PINN struggles in this problem due to its stiffness. Here, we investigate the improvement provided by
SGD, fixed weights, and self-adaptive weights. The architecture of the neural network consists of 5 layers of 500 neurons
each with the tanh nonlinearity, and the number of residue, initial, and boundary points were set to 300, 100, and 100,
respectively (these are the same hyperparameters used in [5]). The small sample sizes are appropriate to study the impact
of SGD.
In all experiments, the learning rate for the neural network weights is kept fixed at 10−5 for a total of 80,000 iterations.
The self-adaptive weights are all initialized to 1.0, with learning rate 0.01 for the residue points, 0.05 for the initial condition
on ut , and 0.25 for all other initial and boundary conditions on u. In the fixed-weight experiment, the weights were kept

13
L.D. McClenny and U.M. Braga-Neto Journal of Computational Physics 474 (2023) 111722

Fig. 12. Average values for the initial condition loss, residue loss, and total loss, over 10k Adam training iterations. For the SA-PINN, the weights were
removed from the loss value to provide a direct comparison to the baseline.

constant at 1.0 for the residue points, 5.0 for the initial condition on ut , and 50.0 for all other initial and boundary conditions
on u. These values make the final average values taken by the self-adaptive weights at the end of training approximately
match the fixed weights. SGD is applied by resampling all training points every 100 iterations. The GPs were trained using
fixed hyperparameters (no automatic tuning is performed).

14
L.D. McClenny and U.M. Braga-Neto Journal of Computational Physics 474 (2023) 111722

Table 1
Wave PDE results. The L2 error mean and standard deviation are based on 10 indepen-
dent runs. The training time is an average over the 10 runs.

PINN method No SGD SGD

L2 error time (sec) L2 error time (sec)

baseline 0.3792 ± 0.0162 879.97 0.4513 ± 0.0255 1057.04


fixed weights 0.7296 ± 0.1421 850.75 0.2079 ± 0.0624 1012.06
self-adaptive 0.8105 ± 0.1591 961.27 0.0295 ± 0.0070 1207.85

Results based on 10 independent random initializations of the neural network weights are displayed in Table 1. We can
observe that all methods fail in the absence of SGD. On the other hand, while SGD is not able to improve the performance of
the baseline PINN, it produces a significant improvement to the fixed-weight PINN, and a large improvement to the SA-PINN.
In fact, the SA-PINN achieves an average L2-error of 2.95%, which is an order of magnitude better than the fixed-weight
result. This L2-error is however larger than the one reported in [5]. Optimizations to the SGD process, including adaptive
tuning of the GP hyperparameters, will be part of future work, and are expected to improve performance.
These results can perhaps be better appreciated in the plots in Fig. 13. As an extra feature, the Gaussian Process predictor
produces a continuous self-adaptive weight map. Fig. 14 displays the self-adaptive weight GP maps for the initial condition
and residue points. We can observe in the 1D map that the self-adaptive weights become larger at the (high and low) peaks
of the initial condition u (x, 0), which is where the curvature is maximum in magnitude; these are the most difficult regions
to approximate with the neural network. In the 2D map, we can see that the self-adaptive weights are larger in the initial
time, once again indicating the importance of approximating the solution early in time-evolution problems.

6. Neural tangent kernel analysis of self-adaptive PINNs

In this section, we investigate the dynamics of SA-PINN training by studying its neural tangent kernel (NTK). We derive
the expression for the NTK matrix for self-adaptive PINNs and then use it to obtain a heuristic understanding of the effect
of the self-adaptive weights on the dynamics of training in the limiting case of infinitely-wide PINNs. We examine the effect
of the self-adaptive weights on the eigenvalues of the NTK matrix in the solution of a linear advection PDE.
First, note that (14) can be written as

w k +1 − w k
= −∇ w L( w k , λkr , λkb , λk0 ) . (43)
ηk
In the limit as the learning rate ηk tends to zero, the previous expression yields the gradient flow differential equation [34]:
dw (τ )
= −∇ w L( w (τ ), λr (τ ), λb (τ ), λ0 (τ )) , (44)

where τ ≥ 0 denotes the (continuous) training time. Notice that the usual gradient descent step corresponds to a forward
Euler discretization of (44). It follows that the properties of gradient descent optimization can be investigated by studying
this differential equation.
Under this vanishing learning-rate limit, the neural tangent kernel (NTK) [35] characterizes the training dynamics of the
neural network, i.e., the evolution of the output u (x, t ; w (τ )) as a function of training time τ . In [5], the NTK for PINNs was
derived and its properties were studied. Here we show how that a simple modification to their derivation produces the NTK
for SA-PINNs.
For definiteness, consider the PDE problem:

Nx [u (x)] = f (x, t ) , x ∈  , (45)


u ( x) = g ( x) , x ∈ ⊆ ∂ . (46)
For a time-evolution problem, t becomes one of the components of x, and the set typically includes an initial condition
at t = 0. More complex boundary conditions and sample data can be added to the analysis below in a straightforward way.
N N
Given residue points {xri }i =r 1 and boundary condition points {xbi }i =b1 , let the response vectors be

ur (τ ) = [ N x [u (xr1 ; w (τ ))], . . . , N x [u (xrN r ; w (τ ))]] T , (47)


N
ub (τ ) = [u (xb1 ; w ( τ )), . . . , u (xr b ; w (
τ ))] . T
(48)
Likewise, the data vectors are denoted by

vr = [ f (xr1 ), . . . , f (xrN r )] T , (49)


N
vb = [ g (xb1 ), . . . , g (xb b )]T . (50)

15
L.D. McClenny and U.M. Braga-Neto Journal of Computational Physics 474 (2023) 111722

Fig. 13. Top: Exact solution of the wave problem. Left: Approximations obtained without SGD. Right: Approximations with SGD. From top to bottom: baseline,
fixed weights, and self-adaptive weights.

16
L.D. McClenny and U.M. Braga-Neto Journal of Computational Physics 474 (2023) 111722

Fig. 14. Gaussian-Process maps of self-adaptive weights. Top: 1D map of initial condition weights. The red dots indicate the values of the weights at the
actual data locations. Bottom: 2D map of PDE residue weights.

Np Np
We write u p (τ ) = (u 1p (τ ), . . . , u p (τ )) and v p = ( v 1p , . . . , v p ) to identify the individual responses u ip (τ ) and data point v ip ,
for p = r , b.
The loss function at training time τ can be written similarly to (11)–(13):

Nq
1   j j j
L( w (τ ), λr (τ ), λb (τ )) = m(λq (τ )) |u q (τ ) − v q |2 (51)
2
q=r ,b j =1

Hence, the gradient flow in (44) becomes

Nq
dw  j j j
=− ∇ w uq (τ )m(λq (τ )) (uq (τ ) − v qi ) (52)

q=r ,b j =1

=− JqT (τ )q (τ )(uq (τ ) − vq ) (53)
q=r ,b

where Jq (τ ) is the Jacobian of uq (τ ) with respect to w, for q = r , b, and q (τ ) is a diagonal matrix of dimension N q × N q
Nq
containing the self-adaptive mask values m(λq1 (τ )), . . . , m(λq (τ )) in the diagonal, for q = r , b.

17
L.D. McClenny and U.M. Braga-Neto Journal of Computational Physics 474 (2023) 111722

It follows that
du p (τ ) dw(τ ) 
= J p (τ ) · =− J p (τ )JqT (τ )q (τ )(uq (τ ) − vq ) , (54)
dτ dτ
q=r ,b

for p = r , b.
Now define K pq (τ ) = J p (τ )JqT (τ ), for p , q = r , b. Notice that these are matrices of dimensions N p × N q , with i , j elements

  j
 du ip (τ ) duqj (τ )
K pq ij
(τ ) = ∇ w u ip (τ )T · ∇ w uq (τ ) = · . (55)
w ∈w
dw dw

It is clear from the definition that the matrices K pp (τ ) are symmetric and positive semi-definite, and that K pq (τ ) = Kqp (τ ) T ,
for p , q = r , b.
This allows us to collect the previous results in the following differential equation describing the evolution of the output
of the SA-PINN in the vanishing learning-rate limit:

du(τ )
= −K(τ ) · (u(τ ) − v) , (56)

where
 
ur (τ ) vr
u(τ ) = , v= , (57)
ub (τ ) vb
and

Krr (τ )r (τ ) Krb (τ )b (τ )
K(τ ) = (58)
Kbr (τ )r (τ ) Kbb (τ )b (τ )

is the empirical neural tangent kernel matrix for the SA-PINN. (When all the mask values are 1, this reduces to the expression
in Lemma 3.1 of [5].)
Next, we employ the previous result to perform a heuristic analysis of the self-adaptive weights through their effects
in the gradient flow ODE system in (56). The analysis is based on the infinite-width limit of neural networks, when it
can be shown that, under the vanishing learning rate regime, the NTK matrix converges to a constant deterministic value
throughout training [35]. Under certain regularity conditions, it is shown in [5] that this result still holds in the case of
PINNs, i.e., the matrices Krr (τ ), Krb (τ ) = Kbr
T
(τ ) and Kbb (τ ) are constant and equal to their respective values at initialization
(τ = 0) throughout training. In [5], this was proved for PINNs with one hidden layer and linear PDEs, though the authors
conjectured that this result also holds for multiple-layer PINNs and nonlinear PDEs.
We thus make the assumption that for a wide PINN under a small learning rate, the NTK matrix and self-adaptive
weights change little during training, i.e., K pq (τ ) ≈ K pq and  p (τ ) ≈  p , for p , q = r , b, τ ≥ 0. In addition, we make the
simplifying approximation that the ODE system (56) can be decoupled, so that

du p (τ )
≈ −K pp  p · (u p (τ ) − v) , (59)

for p = r , b. (This approximation is also made, implicitly, in Section 7.3 of [5].) Some justification for the decoupling ap-
proximation comes from empirical evidence (not shown) that the matrix norms of the cross-terms Krb b and Kbr r are
smaller than those of Krr r and Kbb b and, in some cases, much smaller. This approximation allows us to gain a qualitative
understanding of the importance of the residual and boundary loss components separately.
For p = r , b, matrix K pp is real symmetric and positive semi-definite, and thus diagonalizable with nonnegative eigen-
values. However, matrix K pp  p is not symmetric and not diagonalizable, in general. Fortunately, with the extra minor
assumption that K pp is positive definite, and thus invertible, K pp  p is diagonalizable. To see this, note that

−1 1 1 1
K pp2 K pp  p K pp
2
= K pp
2 2
 p K pp . (60)
1 1
2 2
But K pp  p K pp is a product of symmetric matrices, and thus symmetric itself. Hence, K pp  p is similar to a real symmetric
Np
matrix, and thus diagonalizable. Furthermore, it is fairly simple fact of matrix theory that if γ p1 ≥ · · · ≥ γ p and μ1p ≥ · · · ≥
Np
μ are the ordered eigenvalues of K pp and K pp  p , respectively, and λ1p , ≥ · · · ≥ λ N p are the self-adaptive weights sorted
p
by magnitude, then

μ1p ≤ m(λ1 ) γ p1 , (61)

μnp ≥ m(λn ) γ pn . (62)

18
L.D. McClenny and U.M. Braga-Neto Journal of Computational Physics 474 (2023) 111722

In particular, (62) implies that all eigenvalues of K pp  p are nonnegative (and positive, if K pp is positive definite).
It follows that, under the assumption that u p (0) ≈ 0 (this can be achieved with proper initialization of the neural
network weights), the solution of the ODE (59) is given by

u p (τ ) = (I − e −K pp  p t ) · v p , (63)

which can be rewritten as

u p (τ )− v p = −Q T e −Mt Q · v p , (64)

that is

Q · (u p (τ )− v p ) = −e −Mt Q · v p , (65)
N
where Q is the matrix of eigenvectors and M is the diagonal matrix of eigenvalues μ1p , . . . , μ p p of K pp  p , for p = r , b. This
−mu ip Np
implies that the training error u ip (τ ) − v ip decreases at a rate e rate. Large variation among the eigenvalues μ1p , . . . , μ p ,
both across the different loss terms p = r , b and the different data points in each loss term, will potentially lead to training
imbalances and loss of convergence.
The standard weighted loss function in (9) corresponds to the case when all the self-adaptive weights for each loss
component are equal, with  p = λ p I , in which case the eigenvalues of the NTK matrix are simply scaled by λ p : μip = λip γ pi ,
for i = 1, . . . , N p . On the other hand, the transformation effected on the eigenvalues of the NTK matrix by the self-adaptive
weights is nonlinear. In general, little can be said about it other than the transformed eigenvalues are in the interval
determined by m(λ1 )γ p1 and m(λn )γ pn , as stated in (61)–(62). The simple linear scaling introduced by traditional weighting
can certainly help reduce the imbalance among the various terms, but it is less flexible than the transformation introduced
by the pointwise self-adaptive weights, which can also change the shape of the eigenvalue distribution.
Next, we illustrate this analysis with the classical univariate advection PDE: [36]:

qt (x, t ) + ūq x (x, t ) = 0 , x ∈ [0, L ] , t ∈ [0, T ] , (66)


u (0, t ) = u ( L , t ) = 0, t ∈ [0, T ] , (67)
u (x, 0) = g (x) , x ∈ [0, L ] , (68)

where q(x, t ) is for example the concentration of a tracer being transported in a fluid in a tube of length L, where ū > 0
is the fluid constant velocity. For simplicity, it is assumed that g (x) = 0 outside an interval in [0, L ], and that T is short
enough that the Dirichlet boundary condition is satisfied. (This could be changed at the expense of more complex boundary
conditions.) In this scenario, the problem has a simple solution:

q(x, t ) = g (x − ūt ) , x ∈ [0, L ] , t ∈ [0, T ] , (69)


i.e., the initial concentration profile is simply translated to the right at constant speed ū. Here, we adopt a fairly complex
initial condition, containing several discontinuities, which makes the problem rather difficult to solve with the baseline PINN
— or indeed numerical methods in general [36].
The results presented in Figs. 15, 16, and 17 are generated with a neural network architecture of [2, 400, 400, 400,
400, 1], trained for 10k Adam iterations with a neural network weight learning rate of 0.001 (hence, a wide PINN with a
small learning rate, as required by the theory). The learning rate for all self-adaptive weight was set at 0.1. Glorot Normal
initialization was utilized, and all training was completed in Tensorflow on a single V100 GPU with an average training
time of 7 seconds for 10k iterations. At the end of 10k training iterations, the baseline PINN failed to grasp even the rough
structure of the solution, while the SA-PINN was able to approximate the solution within 5% L2 error. (More accurate results
could have obtained by using more training epochs and a decreasing learning rate schedule.)
An analysis of NTK eigenvalues similar to that performed in [5] is demonstrated in Fig. 17. (In this example, the bound-
ary condition weights were fixed and equal to 1.0, and we disregarded this component in the analysis.) We can see that
the eigenvalues become closely matched in scale between K uu and K rr , removing the imbalance between these two loss
components and enabling convergence to the solution (here we are only looking at the initial and. Importantly, the shape
of the eigenvalue distribution is also nicely equalized, as opposed to simply being scaled up as would be the case with
traditional weighting of the entire loss component.

7. Conclusion

In this paper, we introduced Self-Adaptive Physics-Informed Neural Networks, a novel class of physics-constrained neural
networks. This approach uses a similar conceptual framework as soft self-attention mechanisms in computer vision, in that
the network identifies which inputs are most important to its own training. It was shown that training of the SA-PINN is
formally equivalent to solving a PDE-constrained optimization problem using penalty-based method, though in a way where

19
L.D. McClenny and U.M. Braga-Neto Journal of Computational Physics 474 (2023) 111722

Fig. 15. Top: Plot of the approximation u (x, t ) via the baseline PINN, showing the exact solution vs predicted solution vs absolute error. Bottom: The SA-PINN
results, L2 error decreases by an order of magnitude and the SA-PINN closely captures the exact solution.

Fig. 16. Top: Plot of the approximation u (x, t ) via the baseline PINN, showing cross sections of the spatial domain at t = 0.02, 0.10, 0.18 Bottom: The
SA-PINN results at the same time steps, with the same number of epochs (10k Adam) and all other parameters held constant.

the monotonically-nondecreasing penalty coefficients are trainable. Experimental results with several linear and nonlinear
PDE benchmarks indicate that SA-PINNs produce more accurate solutions than other state-of-the-art PINN algorithms. It was
seen that SA-PINNs can employ stochastic gradient training, through continuous Gaussian-process interpolated self-adaptive
maps, which allows the solution of a difficult wave PDE. These experimental results were complemented by a theoretical
analysis based on the Neural Tangent Kernel for SA-PINNs.
We believe that SA-PINNs open up new possibilities for the use of deep neural networks in forward and inverse modeling
in engineering and science. However, there is much that is not known yet about this class of algorithms. While the empirical
results observed here show significant promise, and despite the fact that an initial analysis based on the NTK is provided,

20
L.D. McClenny and U.M. Braga-Neto Journal of Computational Physics 474 (2023) 111722

Fig. 17. NTK eigenvalues of the baseline PINN (solid) vs. the SA-PINN (dashed) for τ =1000, 5000, and 10000 training iterations. It can be observed that
the SA-PINN accurately matches the magnitudes of the NTK eigenvalues between terms of the loss function, in this case the initial condition K uu and the
residual loss K rr .

more theoretical justification is desirable, which will be part of future work. In addition, the use of standard off-the-shelf
optimization algorithms for training deep neural networks, such as Adam, may not be appropriate, since those algorithms
were mostly developed for traditional deep learning applications; obtaining optimization algorithms specifically tailored to
Self-Adaptive PINNs, and indeed PINNs in general, in an open problem. Finally, the relationship between Self-Adaptive PINNs
and constrained-optimization problems is likely a fruitful topic of future study.

21
L.D. McClenny and U.M. Braga-Neto Journal of Computational Physics 474 (2023) 111722

CRediT authorship contribution statement

Levi McClenny co-authored the idea of SA-PINNs, co-wrote the manuscript, wrote most of the code, to include the
proposed SA-PINN code/algorithm, and performed the bulk of the numerical experiments presented in the paper.
Ulisses Braga-Neto co-authored the idea of SA-PINNs, co-wrote the manuscript, performed most of the mathematical
analyses presented in the paper, to include deriving the modifications to the PINN NTK based on the addition of self-adaptive
weights and the mathematical analysis of the SA-PINN optimization, as well as provided commentary in the manuscript and
numerical experiments in an advisory capacity. All authors read and approved the manuscript.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have
appeared to influence the work reported in this paper.

Data availability

No data was used for the research described in the article.

Acknowledgements

The authors would like to acknowledge the support of the D3 EM program funded through NSF Award DGE-1545403.
The authors would further like to thank the US Army CCDC Army Research Lab (CA number W911NF-16-2-0008) for their
generous support and affiliation.

References

[1] Nathan Baker, Frank Alexander, Timo Bremer, Aric Hagberg, Yannis Kevrekidis, Habib Najm, Manish Parashar, Abani Patra, James Sethian, Stefan Wild,
Karen Willcox, Steven Lee, Workshop report on basic research needs for scientific machine learning: Core technologies for artificial intelligence, 2 2019.
[2] Maziar Raissi, Paris Perdikaris, George E. Karniadakis, Physics-informed neural networks: a deep learning framework for solving forward and inverse
problems involving nonlinear partial differential equations, J. Comput. Phys. 378 (2019) 686–707.
[3] Maziar Raissi, Forward-backward stochastic neural networks: deep learning of high-dimensional partial differential equations, arXiv preprint arXiv:
1804.07010, 2018.
[4] Colby L. Wight, Jia Zhao, Solving Allen-Cahn and Cahn-Hilliard equations using the adaptive physics informed neural networks, arXiv preprint arXiv:
2007.04542, 2020.
[5] Sifan Wang, Xinling Yu, Paris Perdikaris, When and why pinns fail to train: a neural tangent kernel perspective, arXiv preprint arXiv:2007.14527, 2020.
[6] M.W.M.G. Dissanayake, N. Phan-Thien, Neural-network-based approximations for solving partial differential equations, Commun. Numer. Methods Eng.
10 (3) (1994) 195–201.
[7] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard,
et al., Tensorflow: a system for large-scale machine learning, in: 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI}
16), 2016, pp. 265–283.
[8] Jarrett Revels, Miles Lubin, Theodore Papamarkou, Forward-mode automatic differentiation in Julia, arXiv preprint arXiv:1607.07892, 2016.
[9] Atılım Günes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, Jeffrey Mark Siskind, Automatic differentiation in machine learning: a survey, J.
Mach. Learn. Res. 18 (1) (2017) 5595–5637.
[10] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, Adam Lerer,
Automatic Differentiation in Pytorch, 2017.
[11] Richard L. Burden, Douglas J. Faires, Numerical Analysis, 1985.
[12] Sifan Wang, Yujun Teng, Paris Perdikaris, Understanding and mitigating gradient pathologies in physics-informed neural networks, arXiv preprint
arXiv:2001.04536, 2020.
[13] Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, Xiaoou Tang, Residual attention network for image
classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3156–3164.
[14] Yanwei Pang, Jin Xie, Muhammad Haris Khan, Rao Muhammad Anwer, Fahad Shahbaz Khan, Ling Shao, Mask-guided attention network for occluded
pedestrian detection, in: Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 4967–4975.
[15] Yeonjong Shin, Jerome Darbon, George Em Karniadakis, On the convergence and generalization of physics informed neural networks, arXiv preprint
arXiv:2004.01806, 2020.
[16] Kejun Tang, Xiaoliang Wan, Qifeng Liao, Adaptive deep density approximation for Fokker-Planck equations, J. Comput. Phys. 457 (2022) 111080.
[17] Xiaodong Feng, Li Zeng, Tao Zhou, Solving time dependent Fokker-Planck equations via temporal normalizing flow, arXiv preprint arXiv:2112.14012,
2021.
[18] Dehao Liu, Yan Wang, A dual-dimer method for training physics-constrained neural networks with minimax architecture, Neural Netw. 136 (2021)
112–125.
[19] Softadapt: Techniques for adaptive loss weighting of neural networks with multi-part loss functions, arXiv preprint arXiv:1912.12355, 2019.
[20] Conrado Silva Miranda, Fernando José Von Zuben, Multi-objective optimization for self-adjusting weighted gradient in machine learning tasks, arXiv
preprint arXiv:1506.01113, 2015.
[21] Haowen Xu, Hao Zhang, Zhiting Hu, Xiaodan Liang, Ruslan Salakhutdinov, Eric Xing, Autoloss: learning discrete schedules for alternate optimization,
arXiv preprint arXiv:1810.02442, 2018.
[22] Michael Emmerich, André H. Deutz, A tutorial on multiobjective optimization: fundamentals and evolutionary methods, Nat. Comput. 17 (3) (2018)
585–609.
[23] Diederik P. Kingma, Jimmy Ba, Adam: a method for stochastic optimization, arXiv preprint arXiv:1412.6980, 2014.
[24] Dong C. Liu, Jorge Nocedal, On the limited memory bfgs method for large scale optimization, Math. Program. 45 (1–3) (1989) 503–528.

22
L.D. McClenny and U.M. Braga-Neto Journal of Computational Physics 474 (2023) 111722

[25] Levi D. McClenny, Mulugeta A. Haile, Ulisses M. Braga-Neto, Tensordiffeq: scalable multi-gpu forward and inverse solvers for physics informed neural
networks, arXiv preprint arXiv:2103.16034, 2021.
[26] David G. Luenberger, Yinyu Ye, Linear and Nonlinear Programming, 3rd edition, Springer, 2008.
[27] Nele Moelans, Bart Blanpain, Patrick Wollants, An introduction to phase-field modeling of microstructure evolution, Calphad 32 (2) (2008) 268–294.
[28] Jie Shen, Xiaofeng Yang, Numerical approximations of Allen-Cahn and Cahn-Hilliard equations, Discrete Contin. Dyn. Syst., Ser. A 28 (4) (2010) 1669.
[29] Courtney Kunselman, Vahid Attari, Levi McClenny, Ulisses Braga-Neto, Raymundo Arroyave, Semi-supervised learning approaches to class assignment
in ambiguous microstructures, Acta Mater. 188 (2020) 49–62.
[30] Lu Lu, Xuhui Meng, Zhiping Mao, George Em Karniadakis, Deepxde: a deep learning library for solving differential equations, SIAM Rev. 63 (1) (2021)
208–228.
[31] Herbert Robbins, Sutton Monro, A stochastic approximation method, Ann. Math. Stat. (1951) 400–407.
[32] Sebastian Ruder, An overview of gradient descent optimization algorithms, arXiv preprint arXiv:1609.04747, 2016.
[33] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak, Peter Tang, On large-batch training for deep learning: gener-
alization gap and sharp minima, arXiv preprint arXiv:1609.04836, 2016.
[34] Andrew M. Saxe, James L. McClelland, Surya Ganguli, Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, arXiv
preprint arXiv:1312.6120, 2013.
[35] Arthur Jacot, Franck Gabriel, Clément Hongler, Neural tangent kernel: convergence and generalization in neural networks, Adv. Neural Inf. Process. Syst.
31 (2018).
[36] Randall J. LeVeque, et al., Finite Volume Methods for Hyperbolic Problems, vol. 31, Cambridge University Press, 2002.

23

You might also like