ubc_2022_may_ross_andres.pdf
ubc_2022_may_ross_andres.pdf
by
Andres Ross
Master of Science
in
April 2022
submitted by Andres Ross in partial fulfillment of the requirements for the degree
of Master of Science in Mathematics.
Examining Committee:
Christoph Ortner, Professor, Mathematics, UBC
Supervisor
Chad Sinclair, Professor, Materials Engineering, UBC
Supervisory Committee Member
Khanh Dao Duc, Professor, Mathematics, UBC
Supervisory Committee Member
ii
Abstract
iii
Lay Summary
Modelling materials at the atomic scale has become a crucial part of scientific re-
search. However, simulating thousands or even millions of atoms directly with
quantum mechanics is highly costly. A way to reduce such a cost is to gener-
ate a surrogate model trained by machine learning with data from a high fidelity
model. In this thesis, we explore a particular class of surrogate models. We develop
the mathematical theory, describe their practical implementation, and test them on
benchmark data.
iv
Preface
This thesis is original, unpublished, independent work by the author, Andres Ross
under the supervision of Dr. Christoph Ortner.
v
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Lay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
vi
2.3 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1 List of neighbours . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Reverse mode differentiation: Example in ACE . . . . . . . . . . 32
4.3 Introduction to the code base . . . . . . . . . . . . . . . . . . . . 35
4.4 Implementing nonlinear combinations . . . . . . . . . . . . . . . 37
4.4.1 Multiple properties . . . . . . . . . . . . . . . . . . . . . 37
4.4.2 Forward pass . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4.3 Making models differentiable . . . . . . . . . . . . . . . 38
4.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5.1 Multiprocessing . . . . . . . . . . . . . . . . . . . . . . 40
4.5.2 Interacting with optimization packages in Julia . . . . . . 40
4.6 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
vii
List of Tables
Table 2.1 RMSE for energy and forces. κ represents the basis size used
for that value. . . . . . . . . . . . . . . . . . . . . . . . . . . 20
viii
List of Figures
Figure 4.1 Example situation of fi for 14 atoms. The left and the right rep-
resent fi (R) for two different atoms i. The number of atoms n
inside the cutoff radius can (and usually is) of different length
for different i. . . . . . . . . . . . . . . . . . . . . . . . . . . 32
ix
Nomenclature
F Nonlinear embedding.
N correlation order.
ϵ RRQR tolerance.
φi An atomic property.
F The forces.
x
f Function to find the neighbours of an atom.
J Number of atoms.
R := {r1 , ..., rJ }.
Ri := {rij }i̸=j .
rj Position of an atom j.
rij := rj − ri .
xi
Chapter 1
1
geometrical
topological
qualitative
1 eV / 40 kT
0.1 eV / 4 kT
0.0001 eV /
0.004 kT
⟨Ψ|H|Ψ⟩
E= . (1.1)
||Ψ||2
This allows for a complete description of the system, but it requires numerical
approximation. For J atoms, let R := {r1 , ..., rJ } be an atomic structure, with
rj = (x, y, z) the positions of atom j, rj = |rj | its magnitude, and rj1 j2 :=
rj1 − rj2 the distance between atoms i and j. Then we can formulate the time-
independent electronic Schrödinger equation with the Born-Oppenheimer PES
as an eigenvalue problem where the energy levels are the eigenvalues of the sys-
tem. However, solving this PDE with simple discretization results in exponential
growth of the degrees of freedom with the number of electrons (the cause of high
dimensionality), which is very costly. As we see in Figure 1.1, Quantum Chemistry
is limited to only small atomic structures, on the order of at most a few atoms [6].
Although highly accurate, its cost makes it unfeasible for almost all applications.
In 1964 Honhenberg and Kohn [21] proved the existence of a universal density
2
functional that allowed for the calculation of energy. This functional is based on
the electronic density and serves as an approximation of E in 1.2. It is the basis of
what we now know as DFT (Density Functional Theory) and is faster than quantum
chemistry while still retaining good accuracy [21]. We briefly present the general
idea of DFT but quickly move on to methods without electrons.
In DFT models, the energy is given as the sum of external potential energy (V ),
kinetic energy (T ), and interaction energy of the atoms (U ) and is described by the
Hamiltonian operator H = T + V + U , which is evaluated with (1.1). Assuming
a non-degenerate ground state (i.e. a unique quantum state represents that energy),
let us denote the electronic density by ρ(r). This allows us to define a universal
functional:
We can then replace the wavefunction by the electronic density ρ(r) and use the
density functional F [ρ(r)] to replace the other energy contributions, which will
leave us with a representation of the energy that only depends on the external po-
tential v(r) and the electronic density
Z
Ev [ρ] := v(r)ρ(r)dr + F [ρ]. (1.5)
If the functionals v and F were chosen to only depend on the atomic position r,
then energy will also only depend on r. One way to choose ρ(r) is the Kohn-Sham
method [25]. It assumes that the electron density of a system of J electrons can be
written as the sum of one-electron orbitals ψi :
J
X
ρ(r) = |ψi (r)|2 . (1.6)
i=1
Even when optimized, using the ab-initio DFT approach scales with O(J 3 )
since the ψi solve our eigenvalue problem (1.2) [35], which limits the size of sim-
3
ulations one can realistically run with this approach and is the reason why inter-
atomic potentials are useful.
X X
E(R) ≈ V0 + V1 (rj1 ) + V2 (rj1 , rj2 ) + ... (1.7)
j1 j1 <j2
X
+ VN (rj1 , rj2 , ...rjN ) + ...
j1 <...<jN
There are several ways to model VN , which depend on the type of interac-
tions we care about in a specific case. For example, since the density of electrons
around an atom decreases exponentially with distance, short-range interactions can
be modeled with a repulsive functional V2 (rj1 j2 ) = Ae−αrj1 j2 , for some param-
eters A and α. Similarly if we wanted to account for Van der Waals interactions,
A
we could use a functional of the form V2 (rj1 j2 ) = rj6 j
for some material depen-
1 2
dent parameter A [25]. Several energy models are derived using physics-inspired
functions and truncating the series expansion. These models tend to be material-
dependent, thus some will perform better than others in different cases.
4
teractions V2 (rj1 j2 ) = 4ϵ(( rjr0j )12 − ( rjr0j )6 ), where r0 is the equilibrium dis-
1 2 1 2
tance that minimizes the potential. The long term contribution is represented with
rj−6
1 j2
, but when this potential was originally developed, short-range decay was not
known, so it was approximated with rj−12
1 j2
. We would get the Born-Mayer potential
if we wanted to use exponential decay for short-range interactions. There are many
other pair potentials that can perform well in some cases, but most of them fail to
describe the properties of metals adequately because they fail to capture the local
electron density [25]. Embedded-atom model potentials (EAM) were developed to
address this by adding an energy functional of the local electron density. Motivated
by DFT, they have the general form:
X
X 1 X
V = Fi fj1 j2 (rj1 j2 ) + V2 (rj1 j2 ), (1.8)
2
j1 j1 ̸=j2 j1 ̸=j2
where f is a function that approximates the electron density and F models the cost
of embedding on a nucleus into an electron cloud. A major limitation of EAM-type
potentials is that they do not reflect dynamic changes that arise from changing the
local environment [25]. Finnis and Sinclair introduced a similar potential which,
based on a second-moment approximation to the tight-binding density of states,
√
uses fj1 j2 (rj1 j2 ) ∼ rj1 j2 [14].
Adding more interaction orders improves performance, which leads to higher-
order potentials like the Stillinger-Weber potential [22] and modified embedded-
atom method (MEAM) potential [24]. However, increasing the order of interaction
J
N increases the cost of evaluation by N where J is the number of atoms, which
makes potentials with N > 3 very expensive.
5
nowhere near the capabilities of interatomic potentials. In an attempt to bridge
this gap, Ercolessi and Adams proposed using data created by first principles to
train interatomic potentials, arguing that a richer dataset would allow for increased
transferability [17]. This was the beginning of Machine learned interatomic po-
tentials which sparked a new movement to employ data from high fidelity models
to fit interatomic potentials. In 2007 Behler and Parrinello proposed the use of a
neural network representation of a potential energy surface using DFT data [3].
This network provided the energy and the forces directly as a function of all the
atomic positions in a system. Their method was orders of magnitude faster than
DFT, and they demonstrated high accuracy for bulk silicon compared to standard
empirical interatomic potentials. They used a densely connected neural network
with radial G1i and angular G2i functions as the input and energy as the output. The
radial functions were constructed as a sum of Gaussians with parameters η and rs ,
2
X
G1i (r) = e−η(rij −rs ) fcut (rij ) (1.9)
j̸=i
where fcut is a cutoff function that is 0 for values rij > rcut for some cutoff value
rcut . Define θijk as the angle between rij and rik for a central atom i. Then The
angular functions were constructed for all triplets of atoms by summing over the
cosine values of θijk ,
X
G2i (r) =2 1−ζ
(1 + λcos(θijk ))ζ × (1.10)
j,k̸=i
2 2 2
e−η(rij +rik +rjk ) fcut (rij )fcut (rik )fcut (rjk ) ,
6
1.4.1 Gaussian approximation potentials
In 2010 Bartók, Payne, Kondor and Csányi introduced Gaussian approximation
potentials (GAP), a class of interatomic potentials without a fixed functional form.
These were created to maintain special symmetries of the system and to be auto-
matically generated from DFT energies and forces data. We start by writing the
total energy of a system as the sum of atomic energies Ei (introduced by Behler
and Parrinello),
J
X
E := E({rij }i̸=j ), (1.11)
i=0
where {rij }i̸=j is the set of distances between the central atom i and the neigh-
bouring atoms j. We then define a local density for atom i and it’s neighbours,
X
ρi (r) := δ(r) + δ(r − rij )fcut(|rij |) , (1.12)
j
with fcut a cutoff function as previously shown. With (1.12) we build a kernel G
such that
X
E({rij }i̸=j ) := αn G(b, bn ), (1.13)
n
{αn } = α = C −1 y. (1.15)
This produces a symmetry preserving method that improves with more data.
However, two problems that arise are that data contains only total energies and
7
forces and that these will be heavily correlated. To solve these, the authors propose
using a sparsification procedure that reduces the data to a much smaller set of
configurations and replaces y with a linear combination of all data values. With
this approach, they reached, for example, an RMSE of 1 meV per atom for Silicon
energies [2].
X
V (R) ≈ cB B(R). (1.16)
B∈B
The idea of using symmetrized polynomials in MTP was the precursor for the
Atomic Cluster Expansion (ACE) which is at the centre of this thesis. We therefore
omit the derivation of B(R) in favour of a detailed description of ACE in the next
chapter.
8
Chapter 2
In this chapter we derive the Atomic Cluster Expansion (ACE) and describe its use
to model the PES of an atomic configuration. We present a linear model based on
the ACE and show results for benchmark data set for Silicon, Copper and Molyb-
denum.
9
1 3
4
𝑅 = {𝑟1 , … , 𝑟22 } 2 5
10
6
7
8
𝑅𝑖
9 11 12
17 16
𝑦
𝑧 15 13
19
14
18
20
22 21
𝑥
Now let us assume that E can be represented in terms of the following body order
expansion:
J
X
E(R) = E(Ri ), (2.1)
i=1
N
X X
E(Ri ) = V0 + VN (rij1 , ..., rijN ),
N =1 j1 <j2 <...<jN
10
X X
E(Ri ) ≈ V0 + V1 (rij1 ) + V2 (rij1 , rij2 ) + ... (2.2)
j1 j1 <j2
X
+ VN (rij1 , ..., rijN ).
j1 <...<jN
In this form, we can see the role of N in the expansion. We are representing the
site energy E as a summation of increasing body order terms, i.e. terms that account
for higher-order interactions. For example, V1 is a pair potential that accounts for
all pairwise interactions in the atomic environment, and similarly, VN accounts for
N + 1 particle interactions. The goal of this expansion is to truncate at N << J,
which significantly reduces evaluation cost.
When we defined E, we required isometry invariance and noticed it already
possessed permutation invariance. Therefore, we will assume that our components
VN are also isometry and permutation invariant. Moreover, we will further assume
regularity and locality.
where rcut is a cutoff radius to limit the range of interaction (Figure 2.1). Similarly,
we also restrict the domain by introducing a minimal radius r0 > 0 since we are
not interested in atomic collisions. We will include rcut in E such that E(Ri ) takes
Ri = {rij }j̸=i , but only considers {rij }rij <rcut .
N
Y
ϕnlm (r1 , ..., rN ) := ϕnα lα mα (rα ), ϕnlm (r) := Pn (r)Ylm (r̂) (2.4)
α=1
where n = (n1 , ..., nN ) and similarly for l and m, n = 0, 1, 2, ... dictates the
11
radial functions while l = 0, 1, 2, ... and m = −l, ..., l (the azimuthal and magnetic
quantum numbers respectively) dictate the angular functions, which in our case
are the spherical harmonics. We also denote r̂ as the unit vector of r, r as it’s
magnitude, and R̂ = (r̂1 , ..., r̂N ). The choice of spherical harmonics will later
allow us to conveniently impose rotational symmetries, but the choice of Pn has
considerable freedom. This allows us to play with different choices to improve
convergence, but we will not pursue this freedom in the present work. Let the
radial basis
where spanC t P denotes the closure of P with respect to the norm || · ||C t . These
two assumptions of the radial function mean that VN can be approximated with a
linear combination of the tensor product ϕ,
X
VeN ≈ cnlm ϕnlm . (2.7)
n,l,m
It has been shown in [16] that since the ϕnlm are linearly independent and VeN is
permutation invariant (VeN = VeN ◦ σ), we can assume cnlm = cσn,σl,σm without
loss of accuracy. This allows us to write
X X
VeN ≈ cnlm ϕnlm ◦ σ, (2.8)
(n,l,m) ordered σ∈SN
where cnlm could be different coefficients, and ”(n, l, m) ordered” denotes the
lexicographically ordered tuples. Since we assumed point reflection symmetry for
P
VN , all basis functions ϕnlm for which l is odd vanish. Hence
12
X X
VeN (R) ≈ cnlm (ϕnlm ◦ σ)(R). (2.9)
(n,l,m)
P ordered σ∈SN
l even
To make 2.9 rotationally invariant, we integrate over all rotations using the
Haar integral [16],
X X Z
VeN ≈ cnlm (ϕnlm ◦ σ)(QR)dQ. (2.10)
(n,l,m) σ∈SN SO(3)
P ordered
l even
Recall that the radial functions P are already rotationally invariant, so we focus
on Ylm . Now we represent the rotated spherical harmonics in terms of the Wigner
D-matrices [16]
(Q)Ylµ (R̂)
X
Ylm (QR̂) = l
Dµm ∀Q ∈ SO(3), (2.11)
µ∈Ml
N
Y
l
Dµm (Q) = Dµlαα mα (Q). (2.12)
α=1
Ylµ (R̂),
X
l
blm (R̂) := D̄µm (2.13)
µ∈Ml
where
Z
l l
D̄µm (Q) = Dµm (Q)dQ. (2.14)
SO(3)
l
The D̄µm coefficients can be efficiently computed with a recursive formula in-
volving Clebsch-Gordan coefficients [16]. Then we reduce this set to a basis by
defining Uel and ñl := rankD̄l ,
µi
13
Ylµ (R̂),
X
l
bli (R̂) := Ueµi i = 1, ..., ñl , (2.15)
µ∈Ml
X X
l
B
enli (R) := Uemi (ϕnlm ◦ σ)(R), i = 1, ..., ñl . (2.16)
σ∈SN m∈M0
l
However, these are not linearly independent. Therefore we define a new set of
coefficients by diagonilizing the Gramian Gnl′ = ⟨⟨B enli′ ⟩⟩, with respect to
enli , B
i,i
the abstract inner product ⟨⟨ϕnlm , ϕn′ l′ m′ ⟩⟩ := δnn′ δll′ δmm′ .
ñl
1 X
nl
Umi := [Vαi ]∗ Uemα
l
, i = 1, ..., nnl (2.17)
Σii
α=1
where nnl = rank(Gnl ) and we diagonilized Gnl = VΣV T . With this new coeffi-
cients we obtain
X X
nl
Bnli (R) := Umi (ϕnlm ◦ σ)(R). (2.18)
m∈Ml σ∈SN
So far, we have only treated a single correlation order N . We now move back
to treating all atoms J by defining Bnli as
X
Bnli (r1 , ..., rJ ) := Bnli (rJ1 , ..., rJN ). (2.19)
j1 <j2 <...<jN
J
We currently have (2.19) scale as N , which is terribly inefficient. We will now
leverage the tensor products to reduce the computational cost of our current basis.
P P
We start by completing the summation from j1 <j2 <...<jN to j1 ,j2 ,...,jN and use
(2.18) to get
14
X X X
nl
Bnli (R) = Umi ϕnlm (rjσ1 , ..., rjσN ) (2.20)
j1 ,...,jN m∈Ml σ∈SN
X X
nl
= Umi ϕnlm (rj1 , ..., rjN ) (2.21)
m∈Ml j1 ,...,jN
X J
X N
Y
nl
= Umi ϕnα lα mα (rjα ). (2.22)
m∈Ml j1 ,...,jN =1 α=1
Since we sum over tensor products, we may interchange the summations and the
product in the following way
X N X
Y J
nl
... = Umi ϕnα lα mα (rj ). (2.23)
m∈Ml α=1 j=1
X N
Y
nl
Bnli (R) =: Umi Ainα lα mα (R), (2.24)
m∈Ml α=1
QN
and a product basis Ainlm (R) := α=1 Ainα lα mα (R). This avoids the N ! cost
J
for symmetrising the basis as well as the N cost of summation over all clusters
in (2.19). We denote the resulting basis by
X
2N
BN := Bnli |(n, l) ∈ N ordered, l even, i = 1, ..., nnl , (2.25)
N
X
E(Ri ) := BN (Ri ). (2.26)
N =0
Finally, we choose a sub set of (2.25) by further restricting (n, l, m). We define a
maximal degree d ∈ R. Then choose all nα and lα such that
15
X
(nα + wL lα ) ≤ d, (2.27)
α
where wL is a relative weighting of the angular and radial basis functions. Higher
wL leads to higher resolution in the radial component. With this choice of (n, l, m)
we define a set of basis functions B ∈ B and their coefficients cB as
X
E(Ri ) =: cB B(Ri ), (2.28)
B∈B
J X
X
E(R) = cB B(Ri ). (2.29)
i=1 B∈B
Let κ be the number of basis functions, i.e. the length of B and the number of
parameters cB . We can then switch the summations and define a new basis function
B, but now over whole configurations, resulting in the linear model
X
E(R) = cB B(R) = c · B(R). (2.30)
B∈B
∂E(R) J
F (R) := − . (2.31)
∂ri i=1
16
αE B1 (R(1) ) ... αE Bκ (R(1) )
−αF ∂B1 (R(1) ) −αF ∂Bκ∂r
(R (1) )
...
∂r1
α E(R (1) )
1 E
.. .. ..
. αF F (R(1) )
. . c
1
(1) ) (1) )
... =
∂B 1 (R
−αF ∂B∂r κ (R (2) )
Ψ · c :=
−αF ∂r J1
... J1 α E E(R ,
..
αE B1 (R(2) ) ... (2)
αE Bκ (R ) cκ
.
.. ..
.. αF F (R )(T )
.
. .
∂B1 (R(T ) ) ∂Bκ (R(T ) )
−αF ∂rJ ... −αF ∂rJ
T T
(2.32)
where the length of B is κ, i.e the number of basis functions and parameters.
We also define αE , αF ∈ R as weightings to multiply the energies and forces in
Ψ. Now define a set of DFT calculations to act as targets in the training. For
(t) (t)
each training configuration R(t) , let y (t) = (yE , yF ) be the corresponding DFT
(1) (1) (2) (T )
calculated energy and forces. Then let y = [yE , yF , yE , ..., yF ]. We seek
parameters c such that the loss
17
(RMSE) on a test set for both energies and forces separately. For energies, we use
v
uP (t) t )2
u T (ΨE ·c−yE
t t=1 Jt2
RMSEE = (2.35)
T
(t) (t)
where ΨE is the tth row of a matrix ΨE of only the energies, yE ∈ yE are the
elements of a vector with only the DFT energies, and T is the total number of
energies. For forces we use
s
||ΨF · c − yF ||22
RMSEF = , (2.36)
3||J ||1
where similarly, ΨF is a matrix containing only the forces of Ψ and yF are the
forces in y.
2.4 Results
To test the ACE model, we use the data sets from [37], which contains molecular
dynamics, elastic, surface and vacancies structures for bulk silicon, copper and
molybdenum (as well as other materials). The data set contains training sets of
sizes 214, 262, and 194, and test sets of sizes 25, 31, and 23, of DFT energies and
forces, for Si, Cu, and Mo, respectively. It is worth mentioning that these data sets
are pretty small and should only be used as proof of concept. Practical data sets
can be much larger (order of thousands of structures). It is also common to include
virials in ACE fits in addition to energies and forces.
We used the codes supplied by ACE1pack.jl [33]. There are several param-
eters one can modify to achieve better results, but in this work, we focus on (i) the
RRQR tolerance ϵ, (ii) the relative weighting of energies and forces αE /αF , and
(iii) the basis size κ. Further parameters such as the cutoff radius are taken from
[37]. We used LowRankApprox.jl [23] for the RRQR factorization, where we
can manually set the tolerance on the error of the factorization.
Since the data set is small, and linear models are relatively fast we generate a
heat map with ϵ and αE /αF as parameters. We set the correlation order to N = 3
and choose a basis size of κ = 964 to keep computational cost low. With higher
18
Si Energy Forces
Cu Energy Forces
Mo Energy Forces
19
Table 2.1: RMSE for energy and forces. κ represents the basis size used for
that value.
basis sizes we could have over fitting as well as more costly simulations. We used
a cutoff radius of 5.5Å for silicon, 4.1Å for copper, and 5.2Å for molybdenum.
Figure 2.2 shows the log of the RMSE of the test set for energy and forces in a
heat map against the parameters ϵ and αE /αF . We see in Figure 2.2 that both low
ϵ and low αE /αF , as well as high ϵ and high αE /αF , give higher RMSE, likely
due to over-fitting. Therefore, we chose parameters closer to the middle. (ϵ =
10−7 , αE /αF = 10), (ϵ = 10−6 , αE /αF = 30), and (ϵ = 10−7 , αE /αF = 10)
where visually chosen for silicon, copper and molybdenum respectively.
Using these parameters, we calculated the RMSE of energy and forces for cor-
relation order N = 3 and increasing polynomial degrees, meaning increasing basis
sizes. Figure 2.3 shows the log scaled RMSE of the test set against the log scaled
basis size. The lowest RMSE for energy and forces gave different κ. Therefore,
we chose the κ visually from Figure 2.3 to give the both low RMSE for energies
and for forces, this results can be found in Table 2.1.
We can see in Figure 2.3 that for basis sizes below 1000, a larger basis size
leads to better accuracy, which we expect since increasing basis size increases the
number of parameters. However, for κ bigger than 1000, the RMSE for forces
increases again, likely due to over-fitting. Since we report results on the test data
set, increasing parameters without adjusting regularization overfits to the training
set, hence dropping performance in the test set. We likely only see this in forces and
not in energy since Ψ contains significantly more entries for forces which indicates
we could use a bigger αE /αF for larger κ.
20
Si Energy Forces
Cu Energy Forces
Mo Energy Forces
Figure 2.3: Log RMSE v.s log basis size for energy and forces.
21
Chapter 3
In this chapter we extend the models seen in Chapter 2 to allow for a nonlinear
composition of linear models. We describe the model as well as the efficient com-
putation of its derivatives. We present results based on the same data sets studied
in Chapter 2.
X
φi := cB B(Ri ) = c · B(Ri ), (3.1)
B∈B
J J
(1) (2) (P )
X X
E(R) = E(Ri ) := F(φi , φi , ..., φi ), (3.2)
i=1 i=1
22
where p ∈ {1, ..., P } indexes the atomic properties, and F : Rp → R is a nonlinear
embedding function. The forces would be
∂E(R) J
F (R) := − , (3.3)
∂ri i=1
inspired by the Finnis-Sinclair [14] and embedded atom models, or its generaliza-
tion
P
(1) (P ) (p)
X
F(φi , ..., φi ) = |φi |αp . (3.5)
p=1
where {α}Pp=1 is a set of exponents. Finally, one could consider a neural network
parametrization of F, e.g.,
(1) (P )
F(φi , ..., φi ) = W3 W2 W1 φTi + b1 + b2 + b3 (3.6)
(1) (P )
a feed forward neural network with 3 dense layers. In (3.6) φTi = [φi , ..., φi ]T ,
while W = [W1 , W2 , W3 ] and b = [b1 , b2 , b3 ] are the weights and biases of the
dense layers respectively. Notice that in (3.6) and (3.5) F has trainable parameters.
We denote the trainable parameters of a nonlinear model by θ = [θF , θφi ], where
F has parameters θF and φi has parameters θφ := [c(1) , ..., c(P ) ] for all i.
23
3.1.1 Loss function
As in Section 2.3, we define a training set R = [R(1) , ..., R(T ) ], with Jt the number
of atoms in R(t) , and its respective DFT energies and forces Y = [y (1) , ..., y (T ) ],
(t) (t) (t) (t)
with y (t) = (yE , yF ). Recall that yE ∈ R and yF ∈ R3×J . We then define a
loss function to minimize as follows
PT (t) , y (t) )
t=1 L(R
L(R, Y ) = + λ||θ||22 , (3.7)
T
Jt
X
2
L(R, y) = αE (E(R) − yE )2 + αF2 |Fi − yF |2 =: LE (R, yE ) + LF (R, yF ),
i=1
(3.8)
where R = {r1 , ..., rJ }, λ, αE and αF are hyperparameters. We define the loss
as an explicit function of the training set (R, Y ) with implicit dependence on the
trainable parameters θ. Therefore, we seek parameters θ such that
PT ∂L(R(t) ,y (t) )
∂L(R, Y ) t=1 ∂θ
= + 2λθ, (3.10)
∂θ T
where
Jt
∂L(R, y) 2 ∂E(R) 2
X ∂ 2 E(R)
= 2αE (E(R) − yE ) + 2αF |Fi − yF | (3.11)
∂θ ∂θ ∂ri ∂θ
i=1
where we used (3.3). A naive evaluation of the gradients is very costly due to the
∂ 2 E(R)
computation of ∂ri ∂θ . We will review an efficient way to evaluate it in Section
3.2 via backpropagation.
Similar to Section 2.3, we evaluate the performance of the model by measuring
the RMSE of the energy and the forces for a test set. For the energy we measure
24
v
(E(R(t) )−yE )2
uP
u T
t t=1 Jt2
RMSEE = , (3.12)
T
and for the forces
v
u T PJ
2
i=1 |Fi − yF |
uX t
RMSEF = t . (3.13)
3Jt
t=1
J
∂LF (R, y) X ∂ 2 F(φi )
= 2αF |Fi − yF |
∂θ ∂rij ∂θ
i=1
J
X ∂2F ∂ 2 φi
= 2αF |Fi − yF | (3.14)
∂φi ∂θF ∂rij ∂θφ
i=1
J
X ∂ 2 φi
=: ωi ,
∂rij ∂θφ
i=1
where we defined ωi implicitly. Then using (2.28) and (2.24) we can compute
(p) N X
∂φi X (p) ∂Aiv
= c̃v (3.15)
∂rij v
∂rij
N =0
25
(p) Nv X
Y J
∂φi X (p) ∂
= c̃v ϕvs (rij )
∂rij v
∂rij
α=1 s=1
Nv
(3.16)
(p)
X X Y
= c̃v Aivs ∇ϕvt (rij ).
v α=1 α̸=s
Expression (3.16) has cost equal to #c̃ × N 2 × J × P . We have our first cost
reduction by switching the order of the expression above.
X Nv
(p)
X X Y
... = ∇ϕv (rij ) c̃v δvvt Aivs
v v α=1 α̸=s (3.17)
X
=: ∇ϕv (rij ) · ωvϕ ,
v
J
∂LF (R, y) X X ∂ωvϕ
= ωi ∇ϕv (rij )
∂θ v
∂θφ
i=1
J
XX ∂ωvϕ
= ωi · ∇ϕv (rij ) (3.18)
v i=1
∂[c(1) , ..., c(P ) ]
X ∂ωvϕ
=: ωv′ ,
v
∂[c(1) , ..., c(P ) ]
where we defined ωv′ to hold all the i dependence. The cost of evaluating ωv′ over
all v is J × #v. Then, using (3.19)
26
X Nv
X ∂ (p)
X Y
... = ωv′ c̃v δvvt Aivs
v
∂[c(1) , ..., c(P ) ] v α=1 α̸=s
Nv X
∂ X (p) X
′
Y
= c̃v δ vvt ω v Aivs
∂[c(1) , ..., c(P ) ] v α=1 v α̸=s (3.19)
Nv
X
∂ X (p)
Y
= c̃v ωv′ α Aivs
∂[c(1) , ..., c(P ) ] v α=1 α̸=s
∂ X (p)
=: (1) (P )
c̃v A′v ,
∂[c , ..., c ] v
∂
=: c̃(p) · A′ , (3.20)
∂[c(1) , ..., c(P ) ]
and using (2.24)
∂
= c(p) · UA′ . (3.21)
∂[c(1) , ..., c(P ) ]
The cost of computation of A′v is #c̃ × N 2 for all A′ = {A′v }v , but it can be
further reduced to N [26]. The final cost of the expression is cost(U × A′ ) + #c̃ ×
N 2 ×P +J ×#v. The cost of U can be reduced further through symmetries which
make it a sparse matrix [16].
3.3 Results
Using the techniques mentioned in this chapter we minimized equation (3.7) using
two different embeddings F, (i) a Finnis-Sinclair inspired model from [26], and
(ii) three dense layers (equation 3.6). For both of them we used 2 properties with
basis size of κ = 489, which gave a θφ ∈ R978×2 .
(i) For the Finnis-Sinclair embedding we use a function very closely inspired
by what was used in the copper model in [26]. We defined F as:
27
1
3 1
1 4 2
5 3 1
2 6 4
7 5
9
10
s
(2) (2)
e−|φi | e−|φi |
(1) (2) (1) (2) (2)
F(φi , φi ) = φi + sign(φi ) |φi | + − . (3.22)
4 2
(ii) For the dense layers we used the architecture in figure 3.1 and (3.6). In our
case W1 ∈ R2×10 , W1 ∈ R10×5 , W3 ∈ R5×1 , b1 ∈ R1 0, b1 ∈ R5 , and b3 ∈ R1 .
This gives θF = [W1 , b1 , W2 , b2 , W3 , b3 ].
We solve (3.9) using the same data sets and cutoff radius as we did for the
linear model, and correlation order N = 3. We used BFGS [31] on (3.7) for 500
iterations. Usually, models would be ran for longer iterations, but we choose 500
as a proof of concept. We empirically chose αE /αF = 1/10 for all models and
λ = {10−7 , 10−6 , 10−7 } for Si, Co, and Mo respectively.
We present 6 plots in figure 3.2, where we plot the log of the RMSE for energy
and forces for both embeddings. We compare them against the best linear RMSE in
table 2.1. For all materials we beat the best linear forces, but we don’t beat the en-
ergies. There is similar performance for the two embeddings, except for the forces
of copper where we see (ii) converge faster, and for the forces of molybdenum,
were (ii) retains a low RMSE while (i) seems to over-fit.
28
Si Energy Forces
Cu Energy Forces
Mo Energy Forces
Figure 3.2: Log RMSE v.s iterations for energy and forces.
29
These results should serve as a proof of concept. However, more work is
needed for our current implementation to be competitive.
30
Chapter 4
Implementation
f (Ri ) works by finding a list of neighbours of atom ri . Figure 4.1 shows an exam-
ple of f (Ri ) for two different central atoms i with J = 14 and cutoff radius rcut .
We can then define f (R) = [f1 (R), ..., fJ (R)], which returns the neighbour lists
for all atoms. With this new definition, we implement
J
X
E(R) = F(φ(fi (R))), (4.2)
i=1
31
n
n
n
n
n n
i n
i n
n
n
n
Figure 4.1: Example situation of fi for 14 atoms. The left and the right rep-
resent fi (R) for two different atoms i. The number of atoms n inside
the cutoff radius can (and usually is) of different length for different i.
(1) (P )
with φ({rij }rij <rcut ) = [φi , ..., φi ].
∂h
ωθ (g, h) := g . (4.4)
∂θ
Forward mode differentiation defines a JV P for each function to differentiate.
We also call JV P push-forwards to be consistent with naming convention in the
Julia package ChainRules.jl. We can represent the computation of (4.3) in
the following order
32
∂fi (R)
wfi := (4.5)
∂rij
∂φ
wφ := wf (4.6)
∂fi i
∂F
wF := wφ (4.7)
∂φ
∂E(R)
wE := wF (4.8)
∂F
∂E(R)
:= 1wE (4.9)
∂rij
where all w’s are placeholders for numerical values, not symbolic expressions.
Then we can represent the same computation as the composition of push-forwards,
∂E(R)
= ωrij ωfi ωφ ωF 1, E , F , φ , fi . (4.10)
∂rij
For reverse mode differentiation we further define functions
T
∂h
ωθT (g, h) := g (4.11)
∂θ
to carry out the Jacobian transpose vector product (J T V P ), and call them pull-
backs. The naming convention comes from the Julia package ChainRules.jl
and has a definition broadly in agreement with their use in differential geometry
[36]. We will continue to use the word ”pullback” throughout this chapter, and
also its ChainRules.jl functional implementation name ”rrule”. Let’s look at
the same example, but now with reverse mode differentiation. We compute the
derivative in a ”reverse” order
33
∂E
wE (1) := 1 (4.12)
∂E
∂E
wF (wE ) := wE (4.13)
∂F
∂F
wφ (wF ) := wF (4.14)
∂φ
∂φ
wfi (wφ ) := wφ (4.15)
∂fi
∂E(R) T T
ωfTi ωrTij 1, fi , φ , F , E .
= ωF ωφ (4.17)
∂rij
In our implementation we use reverse mode differentiation, and in Section 4.4.3
we will see how the pullbacks were implemented. We used reverse mode differen-
tiation to allow for the computation shown in Section 3.2. As an example, let us
consider (3.14). We start with
34
4.3 Introduction to the code base
We first introduce the most relevant repositories and their functionality. Most of
these can be found under the Github organization ACEsuit [10].
– Provides ”rrules” which are custom pullback functions for each func-
tion we want to differentiate.
35
– We coded the optimized differentiation (Section 3.2) by creating cus-
tom pullbacks ω within this package.
36
4.4 Implementing nonlinear combinations
ACEflux.jl is a wrapper around ACE.jl and JuLIP.jl to allow compatibil-
ity with Flux.jl. This bridge allows us to leverage Flux.jl layers to generate
and compose nonlinear functions on an ACE model φi . However, as we will see in
this section, there were several caveats and issues when bringing all the packages
together. The package is now operational but limited in what it supports. New ef-
forts are now placed into making ACE.jl more general by splitting φi into several
layers (the one-particle basis, the product basis, and the symmetric basis.). This
new model is outside the scope of the current work, but we will briefly overview it
in the Chapter 5.
37
forward pass of this layer to simply call evaluate() on a configuration f (Ri ). This
layer is then equivalent to φi . It can be composed with a nonlinearity F to produce
site energies. F can be a user specified function or could be a Flux.jl layer. We
can then define F ◦ φ ◦ f through Chain(), a composition function in Flux.jl.
This creates a structure that holds θ and evaluates E. We usually call this object a
model. The gradients are then computed by taking the derivative of model() with
respect to {ri }Ji=1 .
To obtain total energy E and forces F we rely on two functions from JuLIP.jl,
energy() and forces(). These where extended in ACEflux.jl to support a Flux-
Potential, which is a structure containing a model and a cutoff radius rcut . energy()
is (4.2), with x → F(φ(x) being model() and f using the rcut saved in the Flux-
Potential. forces() uses the pullback of f to add and substract contributions of the
gradients of each atom according to the following equation:
∂E(R) X ∂E(R)
Fi = − + (4.20)
∂ri ∂rj
{j|rij <rcut }
where we use the negative gradient centered at i, like before, but now we add the
contributions of the gradients of atoms inside the atomic neighbourhood {j|rij <
rcut }, 1 ≤ j ≤ J}. The set of neighbours is computed with a function similar to
f , and the gradients are computed using Zygote.jl.
These functions can then be placed inside a loss function to optimize. In our
case, the final forward pass for (3.8) is given as follows:
αF ∗ sum(sum((forces(pot, R) − yF )2 ))
38
ferentiation framework. Simply put, ChainRules.jl stores pullbacks for dif-
ferent functions (called rrules) and Zygote.jl reads a function and calls the
necessary pullbacks in the proper order. However, Zygote.jl has reduced per-
formance when differentiating complex functionals, for example φi , which makes
differentiation slow and, in some rare cases, impossible. Therefore, we imple-
mented custom pullbacks. Defining pullback functions with efficient gradient com-
putations allows our codes to be fully differentiable efficiently.
An rrule is a ChainRules.jl function that dispatches on the type of h and
computes both the value of h(θ) (the forward pass) and the pullback ωθT (g, h). We
had to define a custom rrule for the derivative of forces with respect to parameters
since Zygote.jl does not support second derivatives. We defined a helper func-
tion adj evaluate as the pullback of Linear ACE. Then we defined two rrules one
that simply called adj evaluate, and one that computed its pullback. This second
rrule returned the second order derivatives.
Using Zygote.jl provides great flexibility. Even though we provide some
custom pullbacks, Zygote.jl has built-in functionality to differentiate a great
variety of functions. This allows us to compose several functions on top of en-
ergy() and forces, without defining rrules for them, which allows users to define
the loss functions without having to implement their derivatives. Furthermore, the
rrules that we define allow us to optimize the derivatives. For example, in Section
3.2, we saw how one could implement efficient derivatives of the forces, and by
defining the rrule of forces() explicitly, we ensure efficient computation.
While Zygote.jl is crucial for differentiation and training of parameters,
Flux.jl is not. Currently Flux.jl manages the parameters and creates the
Linear ACE layer. However, one could manage the parameters manually with
get params() and create Linear ACE as a simple structure. The original motiva-
tion to make ACE.jl compatible with Flux.jl was to allow for the immediate
use of machine learning utilities in our models, which would allow us to implement
models with θF . At the time, it seemed that moulding our codes to fit this frame-
work would allow for great flexibility and connection to well tested and maintained
codes. However, in hindsight, these same features could have been implemented in
comparable time without the use of Flux.jl. As we will see in the next chapter,
the current path of ACE models is to allow for even more general nonlinearities.
39
With this goal in mind, it is likely that the framework enforced by Flux.jl will
be too restrictive. Nonetheless, these ideas are still novel, and there is currently no
clear best path.
4.5 Training
4.5.1 Multiprocessing
To speed up simulations, we parallelize computations among different workers.
Since most of the cost of computation is the gradient evaluation, we decided to par-
allelize across training points (R, y) ∈ (R, Y ). Similarly, the time spent sending
and retrieving data from available cores is small compared to the time of a gradient
evaluation. Therefore we settled on a multiprocessing implementation, in contrast
to multithreading. We also evaluated using a GPU to perform optimization, how-
ever our matrices are very sparse, because of the symmetries on the (n, l, m), and
there is not enough support for sparse operations in GPUs in Julia.
We begin by dividing the training data (R, Y ) into subsets (Rρ , Yρ ) where
ρ ∈ [1, 2, ..., ρc ] and ρc is the number of processes available minus 1. This is so that
we have a main process to feed the others and take optimization steps. The main
process constructs a model, and sets random starting parameters θ. This model is
then shared to all cores ρ, along with their corresponding (Rρ , Yρ ). The simulation
then starts by sharing a starting θ0 with all the processes, which subsequently set
their local model’s parameters to θ0 . Once done, each process computes its piece
of the loss function and its gradient. Then all processes send their current losses
and gradients to the main process, which adds them together. This is equivalent to
splitting the summation over T in 3.7 and 3.10 into ρc parts, and then adding them
together. The main process then adds the regularization and its gradient to create
the total loss and gradient. These are then used to take an optimization step, and
progress is logged into JLD files through JLD.jl.
40
versions of the ADAM algorithm. However, we wanted to use BFGS to optimize
the Finnis-Sinclair model, and this method is not contained in Flux. Optim.jl
offers BFGS, but we cannot simply plug our codes into it because of the non-
standard structures in which Zygote.jl stores gradients. When taking gradients
according to the attributes of our layer, Zygote.jl returns a Grads(...) object,
which is a dictionary with the parameters as keys and the gradients as values, but
Optim.jl expects the gradients in a flattened array. In order to make this work,
we defined flattening and reshaping functions for both the gradients and the pa-
rameters. The main process performs these operations before and after taking an
optimization step. We could theoretically use any optimizer in a similar fashion, as
long as we can reshape the parameters and gradients accordingly.
4.6 Example
We will now go over an example of the functionality of ACEflux.jl. We begin
by importing the necessary packages.
using ACE, ACEflux, Zygote, Flux, ACE, StaticArrays, ASE, JuLIP
using Zygote: gradient
Recall that a nonlinear model is of the form (4.2), then the function call to construct
a model for φi is
phi = Linear ACE(max deg, cor order, num props)
where Linear ACE() takes a maximum degree of the polynomial, the correlation
order N , and the number of properties P to evaluate. Calling phi(R i), which
will return P atomic properties. To add a nonlinearity, for example a Finnis-
Sinclair like embedding, we simply compose the nonliearity with phi(). We use
Flux.jl’s chain():
FS(phi) = phi [1] − sqrt (abs(phi [2]) + 1/100) − 1/10
E i = Chain(phi, GenLayer(FS))
41
where GenLayer creates a Flux.jl structure to surround the nonlinear embed-
ding. E i now represents the function F ◦φi . We can do more complex embeddings
with trainable parameters:
E i = Chain(phi, Dense(2,7), Dense(7,2), GenLayer(FS), sum)
where we use several Dense layers and a Finnis-Sinclair model. To compute gradi-
ents, we simply need to call the gradient() function in Zygote.jl. There are two
ways to call this function, with explicit and with implicit parameters. The syntax is
a little confusing at first, but the implicit parameters section in Zygote’s docu-
mentation is very helpful [28]. For an explicit call, we simply call grads(function,
parameters) with a function and the parameters we want to differentiate according
to
g configs = gradient( E i , R i)
In this case, we are differentiating site energy with respect to configurations, which
we use to calculate forces.
Now, to get a gradient according to the parameters, we need to do it implicitly
because the parameters are defined implicitly in a Flux.jl layer. This is identical
to the way Flux.jl is differentiated, so their documentation could prove helpful
[30]. The function params() is Flux.jl native and allows us to extract the parame-
ters of the model. To get the derivative of site energy according to its parameters,
we would call:
g params = gradient(()−>E i(R i), params(E i))
E i() and g configs() can be combined into a loss function or composed with
more complex functions. The advantage of utilizing Zygote.jl and Flux.jl
is that all these outer functions can be differentiated out of the box. However,
Zygote.jl will still call our custom pullbacks when necessary, meaning the
derivative will leverage the efficient adjoints. However, Zygote.jl does not dif-
ferentiate functions with object mutation, so keep this in mind when creating these
functions.
Even though this functions exist, the goal is not to differentiate site energies
by hand, but rather call energy() and forces(),and for that we need to create a
42
FluxPotential(model, cutoff).
pot = FluxPotential ( E i , 6.0)
We can now evaluate the energy and forces with our potential. This is the same
syntax JuLIP and ACEatoms use.
e = energy(pot, at)
f = forces(pot, at)
Now we can create a loss function with e and f as in (4.21) and compute its deriva-
tive via
gl = gradient (() −>loss(pot,R,y), params(E i))
Once we have this we can flatten the gradients like we mentioned in Section
4.5.2 and plug into any optimizer. For Section ?? we used BFGS in Optim.jl
and used ACEatoms.jl to create R and y from the imported datasets [37].
We implemented multiprocessing by calling @spawnat and @everywhere from
Distributed.jl.
43
Chapter 5
5.1 Conclusion
Machine learned interatomic potentials continue to be a hot area of research, and
nonlinear models might be crucial in the future. In this thesis, we provided an
overview of the atomic cluster expansion as an atomic descriptor. We showcased
results on the test data sets [37] for silicon, copper and molybdenum. To that
end, we explored the parameter space of the weighting of energy versus forces
and the tolerance for RRQR. Furthermore, we presented accuracy as a function of
the number of parameters κ. We then extended this model by composing several
linear models inside a nonlinearity. We demonstrated efficient evaluation of the
gradients and comprehensively explained the Julia implementation of the models.
We showed results for silicon, copper and molybdenum as a proof of concept. The
codes are still in the experimental phase, and more work is needed for them to be
competitive.
5.2 Outlook
In chapter 4 we went over the current implementation of nonlinear ACE models
in Julia, which are still experimental. Much of the code had to be implemented
manually, especially the gradients, which made our implementation efficient and
flexible. In the future, we strive to capitalize on the generality of ACE.jl, rather
44
than focusing on a specialized type of models. Moving in this direction, the im-
plementation of gradients through Zygote.jl will stay as the primary way to
differentiate models, but the Flux.jl wrapper is subject to change. This is be-
cause a Linear ACE layer is quite a restrictive structure. We treat the computation
of the basis B as a black box. This allows for the nonlinearities implemented in this
thesis but restricts the implementation of other types of physics-inspired nonlinear-
ities. A few examples are (i) parametrization of the radial basis, (ii) composition
of layers, and (iii) changing architecture.
To understand how these nonlinearities would work, we need to consider the
basis evaluation in layers. Let us start with the input layer R = {R(1) , ..., R(T ) }
where we have T atomic environments defined by their atomic positions R(t) =
{r1 , ..., rJ }. One can extend the set R(t) to hold more properties. In fact this is
already implemented in ACE.jl. A user can define an input
where Wk can be other atomic features like magnetic properties, spin or even the
output of another nonlinear ACE model. The treatment of features Wk would have
to be defined in each layer and is already being implemented.
The input layer is then fed into a one-particle basis ϕ using f . We can compare
the action of f to a kernel (sometimes called a filter) in a convolutional neural
network (figure 5.1). ϕ is then passed to an atomic basis Ainlm , which gives the
product basis Ainlm , and finally an atomic property φi . We do this for every node
i in the Jt starting nodes for every R(t) in the training set. This is finally fed into
a nonlinearity, and then everything is summed over to generate an energy (figure
5.2).
To calculate the forces, we would need to backpropagate and derivate accord-
ing to rij . Then we can combine the forward pass and the backward pass in a loss
function to train energies and forces. This implementation is more flexible since
we now have every computation step as a standalone layer. For example, to imple-
ment (i), one simply needs to create a parametric structure to substitute ϕ, where
P contains parameters to train. For (ii), we would simply need to compose the
structures, and similarly in (iii). With all the steps defined as layers, we can com-
45
1 2
3 4 1 1
5
2 2
6
7 8
3 3
4 4
9
5 5
1 2 6 6
3 4 5
7 7
6 8 8
7 8
9 9
Figure 5.1: Example of f on a configuration for two atoms i = {7, 4}. On the
right we see the atomic environment, and on the left we see the action
of f on the input layer.
ϕnlm = Pn Ylm
Ainlm
Ainlm
φi
46
pose them however we want. We could compose several atomic properties φi , or
neural networks, or even switch layers like the one-particle basis for other descrip-
tors. This new framework could be extremely general, but will require changes
in the current Julia packages. We will need to implement the different layers as
structures as well as a way to manage their parameters and the required pullbacks
to differentiate through them with Zygote.jl.
47
Bibliography
48
[8] A. R. Christoph Ortner. Aceflux.jl. https://github.com/ACEsuit/ACEflux.jl,
2022. → page 36
[9] D. P. K. Christoph Ortner. Aceatoms.jl.
https://github.com/ACEsuit/ACEatoms.jl, 2022. → page 35
49
[20] M. Hellström, V. Quaranta, and J. Behler. One-dimensional vs.
two-dimensional proton transport processes at solid–liquid zinc-oxide–water
interfaces. Chemical Science, 10(4):1232–1243, 2019.
doi:10.1039/c8sc03033b. → page 1
[24] B.-J. Lee, W.-S. Ko, H.-K. Kim, and E.-H. Kim. The modified
embedded-atom method interatomic potentials and recent progress in
atomistic simulations. Calphad, 34(4):510–522, 2010. ISSN 0364-5916.
doi:https://doi.org/10.1016/j.calphad.2010.10.007. URL
https://www.sciencedirect.com/science/article/pii/S0364591610000817. →
page 5
[25] R. LeSar. Introduction to computational materials science: Fundamentals to
applications. Cambridge University Press, 2016. → pages 3, 4, 5
[26] Y. Lysogorskiy, C. van der Oord, A. Bochkarev, S. Menon, M. Rinaldi,
T. Hammerschmidt, M. Mrovec, A. Thompson, G. Csányi, C. Ortner, and
R. Drautz. Performant implementation of the atomic cluster expansion
(pace): Application to copper and silicon, 2021. → page 27
[27] C. L. Mike J Inness, Michael Abbott and et al. Zygote.jl.
https://github.com/FluxML/Zygote.jl, 2022. → page 36
50
[31] J. Nocedal and S. J. Wright. Numerical optimization. Springer, 2006. →
page 28
51