Atomistic Graph Networks For Experimental Materials Property Prediction
Atomistic Graph Networks For Experimental Materials Property Prediction
property prediction
Tian Xie1,2* , Victor Bapst1* , Alexander L. Gaunt1 , Annette Obika1 , Trevor Back1 , Demis Hassabis1 ,
Pushmeet Kohli1 and James Kirkpatrick1
*
Equal contribution
arXiv:2103.13795v1 [cond-mat.mtrl-sci] 25 Mar 2021
1
DeepMind, London, UK
2
Department of Materials Science and Engineering, Massachusetts Institute of Technology, Cam-
bridge, Massachusetts, USA
Machine Learning (ML) has the potential to accelerate discovery of new materials and shed
light on useful properties of existing materials. A key difficulty when applying ML in Ma-
terials Science is that experimental datasets of material properties tend to be small. In this
work we show how material descriptors can be learned from the structures present in large
scale datasets of material simulations; and how these descriptors can be used to improve the
prediction of an experimental property, the energy of formation of a solid. The material de-
scriptors are learned by training a Graph Neural Network to regress simulated formation
energies from a material’s atomistic structure. Using these learned features for experimental
property predictions outperforms existing methods that are based solely on chemical compo-
sition. Moreover, we find that the advantage of our approach increases as the generalization
requirements of the task are made more stringent, for example when limiting the amount of
training data or when generalizing to unseen chemical spaces.
1 Introduction
Deep learning has established itself as the dominant method for image recognition1 , speech
recognition2 , machine translation3 and planning4 . More recently, deep learning based methods have
been improving the state of the art in physical models of the natural world5 , for example in protein
folding6 . The key to deep learning’s success in all these domains is that the composition of several
layers of a neural network extract highly expressive and abstract features from data. The downside
of such learning approaches is that plentiful data or accurate simulators are required to achieve the
impressive milestones cited above. Applying such methods to accelerate materials discovery7 is
limited by the fact that high quality experimental data requires expensive laboratory measurements,
meaning datasets usually only cover limited material diversity8 . An alternative source of data is
quantum mechanical simulation (e.g. density functional theory (DFT)), which has already been used
to create datasets of material properties covering a significantly larger material space. However, DFT
methods have large systematic errors9, 10 due to the limitations of the underlying density functional
approximations11, 12 as well as other effects like the 0 K approximation13 . The challenge is therefore
to leverage the information in large simulation databases to learn a transferable embedding that
enables accurate experimental property prediction even from small amounts of experimental data.
1
Considerable advances have been made in demonstrating that neural networks and other
machine learning models can regress properties computed from DFT, both in the context of organic
molecular systems14–17 and for bulk materials18–23 . The accuracy of such networks depends crucially
on the choice of architecture and on the representation of the inputs. In the case of bulk materials,
some models, like the ElemNet21 , represent materials by their chemical composition alone. However
they would be unable to distinguish different phases of materials, where the atomistic structure
is different but the chemical composition is the same. Alternatively, approaches based on graph
networks24 are capable of using the material’s atomistic structure as the input representation. Since
the unit cell structure of a material underpins its quantum mechanical properties, the power of such
structured models is that they instill physically plausible inductive biases in the network25 , which
explains why these models can achieve high accuracy22, 23 . Physically motivated inductive biases
also allow networks to be trained more effectively with small amounts of data.
In this work, we develop an atomistic graph neural network that learns a transferrable em-
bedding of materials from DFT calculations to improve predictions on experimental properties.
Recently, Jha et al.26 showed such improvements by pre-training ElemNet on large simulation
datasets26 , which only learns from the composition of materials. By learning from the atomistic
structures that underpin these simulated values, our approach can more accurately predict experi-
mental quantities, even when trained on small amounts of data. Specifically, we implement these
ideas in a practical algorithm to regress experimental energies of formation (EOF)1 . We demon-
strate that our network architecture outperforms models that do not use structural information, and
furthermore show that it is more data efficient, meaning that the test error degrades in a less pro-
nounced way when limiting the training set or splitting training and testing data in more challenging
ways. A major difficulty with learning from atomistic structure is that structural information is
often not recorded in experimental datasets due to the additional experimental cost of collecting
accurate material structures. To overcome this we learn an embedding of atomistic structures into
a continuous vector space using simulated structure-property pair data. The embedding for any
unknown experimental structure can be obtained by interpolation in embedding space of known
nearby structures using an interpolation scheme inspired by the convex hull decomposition in
compositional phase diagrams. Specifically, our network is composed of a graph network ’trunk’,
which computes an embedding of a material structure, and two ’heads’: one to regress experimental
values and another for simulation values. Knowledge is transferred from the simulation data to
the experiment head by the shared embedding trunk: the trunk and the simulation head are trained
on simulation data to learn structure-property relations similar to refs.22, 23 , and the trunk and the
experimental head are then fine-tuned on experimental data.
We find that incorporating structural information allow us to achieve a new state of the art
mean absolute error (MAE) of 0.059 ± 0.004 eV/atom when predicting the formation energy on an
experimental dataset of 1963 compounds. Importantly, the method still generalizes well in the small
training data regime compared to the baseline which uses a simple multilayer perceptron (MLP) as
1
Experiments typically measure finite temperature enthalpies while simulation focus on zero temperature energies;
for simplicity we refer to both as energies in this work
2
a trunk and does not include structural information. When training on only 157 experimental points
our structured approach achieves 17% lower mean absolute error (MAE) than the previous state of
the art method 26 . We also find improved generalization ability to new chemical spaces, achieving
31% lower MAE when holding out copper from the experimental training set. Finally, although
trained on EOF, we demonstrate that the model provides estimates for the decomposition energy (i.e.
the energy of formation difference with respect to all the other compounds in the chemical space),
which are comparable in accuracy to values computed from simulation27 .
A
A B C
p2
AB A3BC2 p3
p1
AC2
STRUCTURED
GN
AB2C
C B
shared
unit cell neighbor graph embedding embedding
layer
B MLP w1 w
AB2C C
D
+ 3
…
composition Simulation Experiment
composition embedding Hsim Hexp Hexp
vector head head
property property property
Figure 1: Illustration of the transfer learning process. a) The two families of network compared
in this work. Structured approaches (AGN) use a graph representation of a crystal to compute
an embedding of the compound. Unstructured approaches (MLP) use a vector representation
of a formula to compute an embedding of the compound. b) For compounds in the simulation
dataset the simulation head computes a prediction for the simulated property from the embedding.
For compounds in the experimental dataset that have a stable correspondence in the simulation
dataset, the experimental head computes an experimental energy from the embedding. c) For the
experimental compounds that do not have a correspondence in the simulation dataset, a phase
diagram is first constructed using the simulated materials (black dots) and their formation energies.
The simulated materials are fed into a graph network and a representation for the experiment
material is synthesized as a weighted average of the embeddings of its neighbors in the phase
diagram. This is used to predict the experiment property.
Results
Atomistic graph network Our aim is to predict experimental formation energies eexp (x) for
compounds x in an experimental dataset E by leveraging the information in another dataset S
containing simulated data. This represents a scenario where researchers hope to experimentally
measure the formation energies of all compounds in E, and our goal is to accelerate this process
by training a ML model that utilizes a large open simulation dataset S and a few experimental
3
measurements on some compounds in E.
The simulation dataset S contains calculations of various physical quantities as well as the
physical structure of the material. To leverage both the structure and property information, for
every example y ∈ S, we can construct a graph representation g(y) of the corresponding crystal by
connecting atoms under a certain threshold distance with edges, and adding the atomic number as
node features and distances as edge features22 . This allows us to feed y into a graph network24, 25 T
in order to learn an embedding vT (y) = T (g(y)). Note that a recent work MEGNET 23 also uses
graph networks to encode material structures. Supervision for this embedding comes from passing
vT (y) through the simulation head MLP, Hsim , to produce a predicted energy êsim (y) and regressing
it towards with the simulated energy esim (y). The learned embedding based on the physical structure
of material y is the key element that will allow us to transfer from simulation data to experimental
data.
When we transfer to the experimental data, we first need to compute an embedding vector
vT (x) for each example x ∈ E. Note that there is no prior knowledge of the structure of x, so we
need to find a way to match the learned embedding of structures in S and x. If there is a single
stable entry y in the simulation database matching the chemical composition of x, we assume that
the physical structure of y is the true structure of x, although this assumption may be incorrect as
the stability and structure of y are obtained from simulation data alone. In this stable match case, we
can use the embedding layer T shared from the simulation architecture to compute the embedding
for y as an effective embedding for x (Figure 1b). We pass this through two experiment-specific
MLPs: the “composition” network Cexp and the experimental head Hexp to generate the predicted
energy as
v̄T (x) = Cexp (vT (y)), êexp (x) = Hexp (v̄T (x)). (1)
If there is no stable compound in the simulation dataset matching the formula of x, one way
to guess the structure embedding of x is to interpolate the structure embeddings of neighboring
compounds in the phase diagram. We decompose x according to the phase diagram construction
(obtained with the pymatgen software28 ). This yields a set of stable materials y1 , . . . , yn ∈ S n and
weights p1 , . . . , pn ∈ (0, 1]n for each compound x (see Figure 1c). We then compute the effective
experimental embedding as:
n
X pα
v̄T (x) = wi Cexp (vT (yi )), wi = Pn i , (2)
i=1 i=1 pαi
4
that the weights wi and structures yi form an unordered set and are therefore invariant to permutation
of the elements29 . We denote the approach described here as AGN in the following. We emphasize
that this approach is merely an Ansatz which we found to work well in our case, especially as the
structure and formation energy of a compound is already often close to the one given by the phase
diagram decomposition, allowing our network to learn corrections over a good initial estimate.
In a further refinement to this method (AGN+2 ), we feed as an additional input to Cexp the
simulated energies of each material, i.e. the experimental embedding is computed as: v̄T (x) =
P n
i=1 wi Cexp ([vT (yi ), esim (yi )]) (where [, ] denotes the concatenation operation). This provides more
information to the network, allowing it to learn corrections over the DFT predictions depending on
the structure of the compounds, or of its phase diagram neighbours30 .
Both the simulation and the experimental heads are trained at the same time; the total loss
that we minimize by gradient descent is
L(t) = ωsim (t)Ex∈S [|êsim (x) − esim (x)|] + ωexp (t)Ex∈E [|êexp (x) − eexp (x)|] , (3)
where t represents the training iteration, and ωexp,sim (t) control the relative weight of the experimental
loss with respect to the simulation loss. Note that different schedules for ωexp,sim (t) allow us to trial
either a multi-task learning and a fine-tuning approach. In practice we found that annealing ωsim (t)
to a small value over a fixed schedule during training works best.
DFT There have been many efforts to compute the formation energies of inorganic compounds
using various forms of density functional theory (DFT) with known structures (similar to the stable
match cases) 31–33 . The formation energies are usually obtained by computing the total energy
differences between the compound and their elemental reference states, using structures obtained
from the Inorganic Crystal Structure Database (ICSD) 34 . Since some reference states like O2 ,
N2 are in gas phase, their references energies are adjusted by fitting with experimental formation
energies. A study by Kirklin et al. reports MAEs between 0.081 eV/atom and 0.136 eV/atom for
DFT computed formation energies depending on different fitting schemes in the Open Quantum
Materials Database (OQMD)9 . To evaluate the DFT error in our dataset, we further perform a
linear fit from DFT to experiment formation energies to correct the systematic underestimation of
DFT calculations 35 . We find a MAE of 0.145 eV/atom for a total number of 1499 stable match
compounds. There are also 464 compounds that do not have a stable match in our simulation dataset.
We estimate the formation
Pn energies of these compounds by averaging the DFT formation energies
of their neighbors i=1 wi esim (yi ). We find the MAE between DFT and experiment formation
energies for the compounds without a stable match to be 0.158 eV/atom, and the MAE for all
compounds to be 0.148 eV/atom.
2
In the rest of the paper, every model whose name has a + has the simulated energies as an additional input.
5
MLP without simulation data The most direct approach is to train an MLP ony on the
experimental data, similarly to the ElemNet model21, 26 . The network directly predicts êexp (x) from
its formula. We used a 3-layer MLP, and found that this simpler architecture performed better than
the more complex ElemNet model in the low data regime that we are considering in this work.
AGN without simulation data This approach uses our AGN architecture but does not train the
simulation head (ωsim = 0 in Eq. (3)). The model does not access simulated properties, but it still
uses the structures and phase diagram matching information from the simulation dataset.
Automatminer A recent approach36 that automatically selects material descriptors and machine
learning algorithms based on the dataset and model performance. The approach is one of the state-
of-the-art methods based on human-designed descriptors. The framework is highly customizable
and we use the “express” setting in our study. The approach do not have access to the simulation
data and structure information.
MLP A more recent approach26 combines the ElemNet modelling strength with the abundance
of simulation data by performing transfer learning between the simulation dataset and the experimen-
tal dataset. For a more direct comparison with our method, we implement this baseline by training
two separate heads that come after a common MLP embedding layer, i.e. êexp (x) = Hexp (M (x))
and êsim (x) = Hsim (M (x)), where M, Hexp and Hsim are MLPs, and the loss is as in Eq. (3). We
also use shallower networks than in the original ElemNet model as we find this slightly improves
on the previously published results.
MLP+ This Papproach is similar to the MLP, but the transfer head takes as additional input the
n
average feature i=1 wi esim (yi ), where the weights are computed as in Eq. (2).
Datasets We use the same experimental dataset as in the most recent state of the art work26 , the
SGTE Solid SUBstance dataset37 . This dataset initially contains 2090 compounds. Performing the
same standard filtering as was done in prior work26 (i.e. removing compounds with formation energy
more than 5 standard deviations away from the mean), we obtain a dataset of 1963 compounds. We
note that the resulting dataset contains an important fraction of duplicate compounds: the dataset
only contains 1642 unique formula, with 197 compounds duplicated at least three times. For the
simulation dataset, we use data retrieved from the Materials Project31 , with no filtering – giving us a
dataset of 120,612 compounds.
Performance on the full dataset We first compare the performance of our approaches on the full
experimental dataset. We create a train test split of our dataset in proportions 8:2, and then further
redivide the training set in 10 splits, one of which is used as validation during training and for early
stopping. Because of the presence of repeated compounds, we randomly split the compounds in a
6
Table 1: Mean absolute error (MAE) on the experimental test set. The error bars are computed as
the standard deviation over the 10 different validation splits, for the 10 different models that we
trained.
Method MAE (eV/atom)
DFT (match) 0.145
DFT (all) 0.148
MLP w/o sim 0.128 ± 0.004
AGN w/o sim 0.128 ± 0.003
Automatminer 0.128 ± 0.007
MLP 0.065 ± 0.003
MLP+ 0.064 ± 0.002
AGN 0.063 ± 0.003
AGN+ 0.059 ± 0.004
way that places compounds with the same formula in the same split of the dataset. This avoids the
issue of contamination of the test set by the training set.
We report performances of our models and several baselines in Table 1. The MLP without
simulated data baseline reaches a MAE of 0.13 eV/atom, similar to the previously reported values
in Ref.26 . Automatminer performs similarly in the dataset because it also does not have access to
the simulation data, and it seems that the human-designed features do not improve the performance
of the model compared with a simpler MLP. All the approaches that perform transfer learning
obtain a significantly lower error, compared with both DFT, the MLP without simulated data, and
automatminer. This shows that transferring information from simulation can improve the prediction
on experimental properties, consistent with observations before26 . Interestingly, the errors of these
models are significantly smaller than the MAE between DFT and experiments even after a linear fit
for all stable match cases (0.145 eV/atom), indicating that they might learn non-trivial corrections
to the DFT formation energies, in agreement with previous studies30 . We find that the graph based
approaches (AGN and AGN+) outperform the unstructured approaches, yielding a new state of
the art mean absolute error of 0.059 ± 0.004 eV/atom, compared with 0.07 eV/atom in Ref.26 . We
also observe that giving the transfer model access to the values computed by DFT systematically
improves results, both for the MLP and for the graph network based approaches. Directly feeding
the DFT computed energies of formation to the networks without performing transfer learning
yields performance which are close but slightly worse than our transfer approaches (including those
that don’t have access to DFT data), as demonstrated in Supplementary Figure 2. This emphasizes
the role of feature learning which happens during the transfer learning: the networks can learn, for
each material, a representation from the simulations which is richer than the scalar number it is
regressing to.
7
a) Reduced training data b) Split by composition
0.13 MLP
0.08 MLP+
0.12
AGN
MAE (eV/atom)
MAE (eV/atom)
0.11 AGN+
0.06
0.10
0.09 0.04
0.08
0.07 0.02
0.06
0.00
0.11 0.22 0.33 0.44 0.56 0.67 0.78 0.89 1.00
Fraction of training data
c1) Holdout Cu c2) Holdout Fe c3) Holdout Ni
0.12
0.08 0.12
0.10
0.10
MAE (eV/atom)
MAE (eV/atom)
MAE (eV/atom)
0.08 0.06
0.08
0.06
0.04 0.06
0.04 0.04
0.02
0.02 0.02
Figure 2: Generalization performance on the experimental dataset. (a) MAE as a function of the
fraction of experimental training data used (the size of the validation set is kept constant and is not
counted as training data). The performance of the MLP without simulation data is not shown here
as it deteriorates even quicker than the transfer based approach. (b) MAE on a split of the dataset
that respects composition groups (see text for details). (c) performance on splits of the experimental
datasets for which all compounds in the test set contain a chemical element not present in any of
the training data (from left to right, copper, iron and nickel). The MLP without simulation data
approach is not shown here as it yields results with an error several times larger.
Generalization performance The key issue we wish to address is how data efficient our model
is as we decrease the size of the experimental training set while keeping the size of the simulation
datasets unchanged, and how well it generalizes beyond the training set. Data efficiency is vital
in material science applications where datasets of experimental properties have few examples and
collecting thousands of datapoints might be unfeasible for all properties of interest. Our hypothesis
is that the strong inductive bias of our structured approaches should require less data to achieve
the same error on a fixed test set. In Figure 2a, we show the performance of the various models
with reduced experimental training set size, and we find that the performance of the structured
approaches degrades more gracefully as the amount of training data is reduced. For instance, AGN+
achieves the same error with only 157 examples as MLP+ does with twice the data, and when using
less than 400 training examples the performance of the AGN method (without direct access to
simulated DFT energies) outperforms the MLP+, which has access to DFT. In addition, even with
just 157 training examples, the MAE of the AGN+ method is still significantly smaller than the
MAE of DFT with respect to the experiments in both stable match and all cases. This shows that
structured approaches can correct DFT computed formation energies even with limited data from
8
experiments.
Beyond data efficiency, an important property required to use machine learned models for
materials discovery is the ability to generalize to test sets that contain examples that are significantly
different from the training set. In Figure 2b, we consider a train/test split where the test set does not
contain materials involving the same set of atoms as in the training set (as recently proposed27 ). For
instance, if Hf2 Si is in the test set, then no other materials composed of both hafnium and silicon
(such as Hf3 Si2 and HfSi) would be allowed in the training set. We observe that the corresponding
task is harder for all methods, but that the graph based approach with access to the simulation data
performs best in this setup. An even more extreme test of generalization is when all materials
involving a certain element are removed from the experimental training set. In the bottom row of
Figure 2 we show the test error on compounds containing a particular element when the network has
never seen an experimental compound containing that element during training. We picked copper,
iron, and nickel as the elements for this generalization test, and find that the gap in performance
between the structured approaches and the unstructured ones is wide: up to 30% in the case of
copper. Note that these generalizations are possible because the models are learning features
from the simulation datasets that has a broader range of materials; these features are useful for
experimental property prediction even on elements that are unseen in the experimental dataset.
The fact that AGN and AGN+ outperform MLP and MLP+ in both data efficiency and
generalization shows that the structured approaches are learning more powerful embeddings from
the DFT data. It also indicates that such approaches are likely to be useful in the materials discovery
scenario, where we might need to extrapolate accurately from few experimental points even when
materials are significantly different from the ones that have already been measured.
Analysing the choice of interpolation method. In the absence of exact experimental structures
we rely on interpolation to produce effective embeddings from neighbours. To investigate the
efficacy of these interpolated embeddings and the effect of α in Equation 2, we study the mean error
for materials with or without an stable match in the simulation database (Figure 3a and b). The
MLP without simulation data baseline (which is indifferent to whether the experimental data has or
has not a stable match in simulation data) shows a better performance on the data without a stable
match indicating that this subset of our test set is intrinsically ‘easier’ (see also Supplementary
Figure 3). Conversely, the graph based approaches find it slightly harder because without a stable
match we rely on an assumption that phase diagram neighbours give an effective embedding that is
a good approximation to the true structure embedding. We present a more detailed analysis in the
Supplementary Information where we show that the error increases as soon as there is not stable
match, but does not strongly vary with the number of neighbors in the phase diagram afterwards.
Given that we find stable matches easier, it is reasonable to ask if the phase diagram de-
composition provides any useful information at all. To investigate this, in Figure 3c we show
9
a) All compounds b) Stable match / no stable match c) Phase diagram weighting
1
AGN+ (exact match), MLP w/o sim. 0.200
0.14
MAE: 56 meV/atom MLP+
0 AGN+ (no exact match), 0.175
AGN+
(prediction, eV/atom)
0.150
MAE (eV/atom)
MAE (eV/atom)
1 0.10
0.125
2 0.08
0.100
0.06
3 0.075
0.04 0.050
4
0.02 0.025
5 0.00 0.000
5 4 3 2 1 0 1 0 0.25 0.5 1 2 4 inf
Formation energy alpha
(experiment, eV/atom)
Figure 3: (a) predicted formation energy for the AGN+ method, for the compounds with a stable
match (green) and with no stable match (orange), plotted against the experimental formation energy.
(b) MAE on the test dataset, split by formula, depending on whether an experimental compound has
a stable match in the simulation dataset (solid bars) or has none (dashed bars). For reference, the
mean (resp. median) DFT error in the single neighbor case is 0.47 eV/atom (resp 0.14 eV/atom).
(c) MAE as a function of the parameter α of Eq. (2). In both panels, each bar shows the average
mean absolute error for 10 models trained with 10 different heldout validation splits. Error bars
are computed as the standard deviations over the 10 different validation splits, for the 10 different
models that we trained.
the effect of varying the parameter α used to compute the weights w in Eq. (2). This allows to
interpolate between the limit α = 0, which corresponds to a uniform mixture of the phase diagram
neighbor embeddings, and α → ∞, which corresponds to only considering the nearest neighbor(s)
embedding(s). We find that the mean absolute error is lowest when α = 1, which corresponds to a
picture where weighted averages of properties using weights from the phase diagram decomposition
provide a good proxy for the property of a compound. This is consistent with the observation that
the difference between the energy of a compound and the energy obtained by averaging its phase
diagram neighbors (the energy of decomposition) is typically one order of magnitude smaller than
the EOF, so the phase diagram average should be a good proxy for the EOF.
Energy of decomposition While available experimental datasets typically contain the energy of
formation, the relevant quantity for the stability of a material is the energy of decomposition, i.e.
the energy of formation with respect to all the other compounds, not just isolated constituents. A
negative value indicates that a material is stable, while a positive value indicate it is unstable or
metastable. This is typically one order of magnitude smaller and is often not correctly predicted by
machine learning models. With the exception of graph based approaches, those models tend to fail
to capture the subtle correlations in formation energies required for a good decomposition energy
prediction, while DFT provides a more competitive baseline because of cancellation of its systematic
errors27 . The error of our method on the experimental energy of formation is comparable to the
10
a) Unstable match 0.5
b) All matches 0.4
c) All matches
AGN+, MAE: 63 meV/atom MLP+, MAE: 98 meV/atom
0.6 AGN+ (treating as decomposed) AGN+, MAE: 82 meV/atom 0.2
MAE: 85 meV/atom 0.0
Decomposition energy
(prediction, eV/atom)
0.0
0.4
0.5
0.2
0.6
1.5
0.0
0.8
2.0
MLP+, MAE: 65 meV/atom
0.2 1.0 AGN+, MAE: 64 meV/atom
DFT, MAE: 49 meV/atom
2.5 1.2
0.2 0.0 0.2 0.4 0.6 2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.2 1.0 0.8 0.6 0.4 0.2 0.0 0.2 0.4
Decomposition energy Decomposition energy Decomposition energy
(simulation, eV/atom) (simulation, eV/atom) (experimental, eV/atom)
Figure 4: (a) for the matches in the simulation dataset which are unstable, comparison of the output
of the pre-trained network obtained from the phase diagram decomposition with the one obtained
by feeding the structure of the unstable compound (note that the network was trained using the
former procedure). (b) predicted decomposition energy for all the compounds in the test set that
have a match in the simulation dataset (including the unstable ones), against the values predicted by
simulation. The pre-trained graph network is fed the crystal structure of the unstable compound
where applicable. (c) predicted decomposition energy (pre-trained network and DFT) against
experimentally determined ones for the experimental compounds that are part of a non-trivial phase
diagram in the intersection of the experimental and of the simulation data. In all cases, the network
had been trained with our main procedure on the formation energy prediction task.
error of models that give predictive approximations to the simulated energy of decomposition; we
therefore ask whether our structured models have sufficient correlations in their errors to potentially
be a good predictor of the experimental energy of decomposition. While the true experimental
energy of decomposition is often unknown, we perform two indirect evaluations to estimate the
performance of our proposed method.
The first indirect evaluation is to compare the predicted experimental energy of decomposition
(see Methods) to the one computed from DFT. On Figure 4a, we investigate whether the phase
diagram decomposition method is capable of obtaining the fine-grained estimate of the energy of
formation that would lead to a good energy of decomposition prediction, by looking only at the
compounds whose formula matches an unstable compound in the simulation dataset3 . For those, we
have two options: either feed the graph network the phase diagram decomposition of the compound
(following Eq. (2)), or feed it the crystal structure of the most stable of its unstable counterparts
(following Eq. (1)). We find that the predicted decomposition energy correlates with the simulated
one only in the latter case, although the correlation is not very strong due to the narrow energy
distribution of most compounds. We see this as an evidence of the correlation between the network
errors on neighboring compounds when their structures are known. Interestingly, results on the
3
Since they exist in the experiment dataset, they are either metastable or correspond to a stable phase that is not in
the simulation dataset
11
energy of formation do not improve when following this same procedure, in agreement with the fact
that the energy of the hull provides a good approximation to the energy of formation.
In Figure 4b, we report the predicted experimental energy of decomposition against one
computed from DFT for all the compounds that have a match in the simulation dataset (keeping
the procedure of the previous paragraph for the unstable ones). We find that in this test the graph
based approach achieves a lower error than the unstructured approaches; the median error of the
AGN+ model (0.037 eV/atom) is under the typical DFT error (reported38 around 0.07 eV/atom on
646 reactions from the Materials Project data) and therefore it is impossible to decide whether our
model has learned a predictor of the decomposition energy which is slightly worse, equivalent or
strictly better than the estimate that can be obtained from DFT.
We consider the second indirect evaluation in Figure 4c by comparing the predicted experi-
mental energy of decomposition against the ones computed from experimental formation energies.
This time we compute phase diagram decompositions only from the compounds present in both the
simulation and the experimental dataset (we use the test set and the 10 heldout validation sets to
maximize the number of generated phase diagrams). Due to the sparsity of the experimental data,
the generated phase diagrams may be incorrect, but the values attached to them are experimentally
valid. We find that the DFT computed values provide the best estimate for this metric, but that
network based approaches provide estimates with an error of the same order of magnitude (both the
MLP+ and AGN+ have the same error for this metric).
Discussion
We have demonstrated that predictions on experimental material datasets can be systematically im-
proved by transferring both structural information and DFT computed properties from a simulation
dataset. By comparing with several baselines we find that including structural information improves
both the prediction performance and generalization ability of the method, consistent with our initial
hypothesis that graph networks have an appropriate inductive bias for the computation of material
properties. In particular, we find that using a structured approach is helpful in hard generalization
tasks, for example when the training set size is limited or when the test set is very distinct from
the training set. We have found that the phase diagram decomposition (α = 1) is the optimum
approach for predicting the energy of formation, but it is hard to determine whether this would
remain true for all properties. We think the reason why the approach works well empirically might
be because the simulation dataset covers a significant part of materials space and most materials
are more likely to be decomposed to structurally similar compounds. In addition, it is known that
the ground state structure determined by the DFT may not be the true structure of a compound, yet
the systematic improvements in performance suggests that even imperfect structural information
might still be helpful in predicting experimental properties. This highlights the potential to augment
experimental datasets with DFT computed structures using crystal structure search algorithms39–41 .
It also underlies the hypothesis that structure determination is a foundational subject in theoretical
12
material science and that advances in this field would have repercussions also for property prediction
based on learned neural network methods.
This approach is a first attempt to incorporate material structure from atomistic simulations
as a physics motivated inductive bias to improve predictions on small experimental datasets, but
there remains many possibilities for further improvements. First is to extend to predicting more
complicated material properties like band gap, elasticity, and thermal conductivity, especially for
those compounds without a stable match of simulation structures. We have shown that interpolating
the structural representations according to the phase diagram weights is a good inductive bias for
predicting formation energies, but better methods might be proposed for other properties with
very different characteristics. Another avenue for improvement is to achieve transfer learning
between different properties, since there are many experimentally measured properties that are
hard to simulate quickly. There are already several studies aiming to transfer between related
material properties 23, 42, 43 , but transferring to a highly different property remains challenging and
may require other physics motivated inductive biases.
Methods
Input preparation The unstructured models take the formula u(x) ∈ R100 of a compound x as
input (where x can belong either to the simulation or to theP
experimental dataset), obtained from the
stochiometric composition of x and normalized such that 100 i=1 ui (x) = 1 (none of our compounds
involve an atom with atomic number greater than 100).
• Each node corresponds to an atom in the crystal unit cell, and has the corresponding atomic
number as a one hot feature in a vector of dimension 100,
• Edges are added between pairs of nodes at a distance less than δ, with at most n edges
outgoing each node (we used δ = 7 Å and n = 12),
• Each edge has the distance between the two nodes that it connects as a scalar feature
• The graph has the chemical formula of the compound, u(x), as an additional ”global” feature.
The targets of the networks are normalized to have zero mean and variance one based on the
simulation data, i.e. the simulation and experimental head are MLPs followed by a rescaling
x → (x − α)/βx, where α = Ex∈S [esim (x)] and β 2 = Varx∈S [esim (x)]
13
Models architecture We optimized our hyper-parameters on the MAE on the validation set for
the random split of the dataset. For the MLP without simulation data, we used a MLP with 3 layers
of 128 neurons. For the MLP approaches, we used an embedding layer M with 3 layers of 128
neurons; the heads have a single hidden layer of 16 neurons. For the AGN approaches, we use a
graph network25 with node, edge and global models each made of 2 layers MLP with depth 64. The
composition model Cexp is a linear layer with dimension 32 followed by a ReLU. The experimental
head Hexp consists of an MLP with 1 hidden layer of depth 16 and the simulation head is a simple
linear network. We used Rectified Linear Units (ReLU) activations and layer normalization44 in all
our models.
Training and evaluation We train all our methods with an Adam optimizer45 and a learning rate
which is decayed from 2 × 10−3 to 2 × 10−4 over a fixed schedule of N iterations, where N was
adjusted to the method: we use N = 2 × 106 for the AGN model, N = 107 for the AGN+ model,
and N = 2 × 105 for the MLP and MLP+ models. We use a batch size of 16 for the MLP without
simulation data and for the simulation head of the MLP and AGN approaches, and a batch size of 2
for the experimental head of the MLP and AGN approaches. The schedule over the loss, ωsim (t)
in Eq. (3), is a step function with value β (β 2 = Varx∈S [esim (x)]) for t ≤ 2 × 105 and 0.1β for
t > 2 × 105 , while we used a constant ωexp (t) = 0.1β.
Data availability
No data sets were generated during current study. All the data sets used in the current study are avail-
able from their corresponding public repositories– Materials Project (https://materialsproject.
org), and experimental observations (https://github.com/wolverton-research-group/
qmpy/blob/master/qmpy/data/thermodata/ssub.dat).
Code availability
We will make the graph based model and the code necessary to create the graph inputs publicly
available upon publication.
14
Acknowledgements
We would like to thank Trevor Back, Tim Green and Alvaro Sanchez-Gonzalez for useful discussions
related to this work.
Bibliography
1. Szegedy, C., Ioffe, S., Vanhoucke, V. & Alemi, A. A. Inception-v4, inception-resnet and
the impact of residual connections on learning. In Thirty-first AAAI conference on artificial
intelligence (2017).
2. Van Den Oord, A., Kalchbrenner, N. & Kavukcuoglu, K. Pixel recurrent neural networks. In
Proceedings of the 33rd International Conference on International Conference on Machine
Learning - Volume 48, ICML’16, 1747–1756 (JMLR.org, 2016).
3. Vaswani, A. et al. Attention is all you need. In Advances in neural information processing
systems, 5998–6008 (2017).
4. Silver, D. et al. A general reinforcement learning algorithm that masters chess, shogi, and Go
through self-play. Science 362, 1140–1144 (2018).
5. Carleo, G. et al. Machine learning and the physical sciences. Rev. Mod. Phys. 91, 045002
(2019).
6. Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning.
Nature 1–5 (2020).
7. Schmidt, J., Marques, M. R. G., Botti, S. & Marques, M. A. L. Recent advances and applications
of machine learning in solid-state materials science. npj Computational Materials 5, 83 (2019).
8. Butler, K. T., Davies, D. W., Cartwright, H., Isayev, O. & Walsh, A. Machine learning for
molecular and materials science. Nature 559, 547–555 (2018).
9. Kirklin, S. et al. The open quantum materials database (oqmd): assessing the accuracy of dft
formation energies. npj Computational Materials 1, 1–15 (2015).
10. Kim, G., Meschel, S. V., Nash, P. & Chen, W. Experimental formation enthalpies for inter-
metallic phases and other inorganic compounds. Scientific Data 4, 170162 (2017).
11. Cohen, A. J., Mori-Sánchez, P. & Yang, W. Insights into current limitations of density functional
theory. Science 321, 792–794 (2008).
12. Cohen, A. J., Mori-Sánchez, P. & Yang, W. Challenges for density functional theory. Chemical
reviews 112, 289–320 (2012).
13. Lany, S. Semiconductor thermochemistry in density functional calculations. Physical Review
B 78, 245207 (2008).
15
14. Duvenaud, D. K. et al. Convolutional networks on graphs for learning molecular fingerprints.
In Advances in neural information processing systems, 2224–2232 (2015).
15. Faber, F. A. et al. Prediction errors of molecular machine learning models lower than hybrid
dft error. Journal of chemical theory and computation 13, 5255–5264 (2017).
16. Schütt, K. T., Sauceda, H. E., Kindermans, P.-J., Tkatchenko, A. & Müller, K.-R. Schnet–a
deep learning architecture for molecules and materials. The Journal of Chemical Physics 148,
241722 (2018).
17. Yang, K. et al. Analyzing learned molecular representations for property prediction. Journal of
chemical information and modeling 59, 3370–3388 (2019).
18. Ghiringhelli, L. M., Vybiral, J., Levchenko, S. V., Draxl, C. & Scheffler, M. Big data of
materials science: critical role of the descriptor. Physical review letters 114, 105503 (2015).
19. Isayev, O. et al. Universal fragment descriptors for predicting properties of inorganic crystals.
Nature communications 8, 1–12 (2017).
20. Ouyang, R., Curtarolo, S., Ahmetcik, E., Scheffler, M. & Ghiringhelli, L. M. Sisso: A
compressed-sensing method for identifying the best low-dimensional descriptor in an immensity
of offered candidates. Physical Review Materials 2, 083802 (2018).
21. Jha, D. et al. Elemnet: Deep learning the chemistry of materials from only elemental composi-
tion. Scientific Reports 8, 17593 (2018).
22. Xie, T. & Grossman, J. C. Crystal graph convolutional neural networks for an accurate and
interpretable prediction of material properties. Physical review letters 120, 145301 (2018).
23. Chen, C., Ye, W., Zuo, Y., Zheng, C. & Ong, S. P. Graph networks as a universal machine
learning framework for molecules and crystals. Chemistry of Materials 31, 3564–3572 (2019).
24. Scarselli, F., Gori, M., Tsoi, A., Hagenbuchner, M. & Monfardini, G. The graph neural network
model. IEEE Transactions on Neural Networks 20, 61–80 (2009).
25. Battaglia, P. W. et al. Relational inductive biases, deep learning, and graph networks.
arXiv:1806.01261 (2018).
26. Jha, D. et al. Enhancing materials property prediction by leveraging computational and
experimental data using deep transfer learning. Nature Communications 10, 5316 (2019).
27. Bartel, C. J. et al. A critical examination of compound stability predictions from machine-
learned formation energies. arXiv:2001.10591 (2020).
28. Python materials genomics (pymatgen): A robust, open-source python library for materials
analysis. Computational Materials Science 68, 314 – 319 (2013).
16
29. Zaheer, M. et al. Deep sets. In Guyon, I. et al. (eds.) Advances in Neural Information Processing
Systems 30, 3391–3401 (2017).
30. Zhang, Y. & Ling, C. A strategy to apply machine learning to small datasets in materials
science. npj Computational Materials 4, 1–8 (2018).
31. Jain, A. et al. Commentary: The materials project: A materials genome approach to accelerating
materials innovation. APL Materials 1, 011002 (2013).
32. Saal, J. E., Kirklin, S., Aykol, M., Meredig, B. & Wolverton, C. Materials design and discovery
with high-throughput density functional theory: The open quantum materials database (oqmd).
JOM 65, 1501–1509 (2013).
33. Curtarolo, S. et al. Aflow: an automatic framework for high-throughput materials discovery.
Computational Materials Science 58, 218–226 (2012).
34. Hellenbrandt, M. The inorganic crystal structure database (icsd)—present and future. Crystal-
lography Reviews 10, 17–22 (2004).
35. Jain, A. et al. Formation enthalpies by mixing gga and gga+ u calculations. Physical Review B
84, 045115 (2011).
36. Dunn, A., Wang, Q., Ganose, A., Dopp, D. & Jain, A. Benchmarking materials property predic-
tion methods: the matbench test set and automatminer reference algorithm. npj Computational
Materials 6, 1–10 (2020).
38. Bartel, C. J., Weimer, A. W., Lany, S., Musgrave, C. B. & Holder, A. M. The role of decompo-
sition reactions in assessing first-principles predictions of solid stability. npj Computational
Materials 5, 4 (2019).
39. Pickard, C. J. & Needs, R. Ab initio random structure searching. Journal of Physics: Condensed
Matter 23, 053201 (2011).
40. Wang, Y., Lv, J., Zhu, L. & Ma, Y. Crystal structure prediction via particle-swarm optimization.
Physical Review B 82, 094116 (2010).
41. Lonie, D. C. & Zurek, E. Xtalopt: An open-source evolutionary algorithm for crystal structure
prediction. Computer Physics Communications 182, 372–387 (2011).
42. Yamada, H. et al. Predicting materials properties with little data using shotgun transfer learning.
ACS central science 5, 1717–1730 (2019).
43. Sanyal, S. et al. MT-CGCNN: Integrating crystal graph convolutional neural network with
multitask learning for material property prediction. arXiv:1811.05660 (2018).
17
44. Ba, J. L., Kiros, J. R. & Hinton, G. E. Layer normalization. arxiv:1607.06450 (2016).
45. Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. arXiv:1412.6980 (2014).
18
Supplementary material
0.08
0.06
0.04
0.02
0.00
MP MP(*) MP MP(*)
Simulation dataset
Supplementary Figure 1: Mean absolute error on a simulation test set as a function of the training
dataset used: full Materials Project (MP) or the dataset where compounds with the same formula are
removed (keeping only the most stable one) on simulation data (MP*). The error bars are computed
over 10 different validation splits, for 10 different models that we trained.
We observe that all approaches that have access to simulation data, either in the form of
transfer learning or by being fed the simulated formation energies directly, perform significantly
better than the baselines, regardless of the type of network used. Transfer learning performs better
than simply getting the DFT computed target, a testimony of the importance of the features that can
be learned during this process. However, the best approaches combine transfer learning with the
1
ground truth targets and, as alluded to in the main text, are capable of learning significant corrections
over it. In this case, the superior inductive bias of graph networks becomes more apparent.
Split by formula
0.14 Baseline
GN Baseline
0.12 Baseline+
GN Baseline+
MLP
0.10 AGN
MAE (eV/atom) MLP+
0.08 AGN+
0.06
0.04
0.02
0.00
Supplementary Figure 2: Mean absolute error on the experimental test set for the approach described
in the text: the first four approaches do not perform transfer learning. The approaches denoted by a
+ use the DFT-computed formation energies as extra inputs. The Baseline and MLP approaches use
an MLP as a network, while the GN Baseline and AGN approaches use a graph network.
Supplementary Figure 3d and e, investigate the predicted energies in more detail for the
AGN+ model. We find that error (panel e) is relatively independent of the energy of the compound,
although there is a slight increasing trend with the energy.
P In Supplementary Figure 3f, we plot the
−1
error against the inverse participation ratio ipr(w) = ( ni=1 wi2 ) – a continuous measure of the
number of neighbors in the phase diagram4 . This provides a finer analysis of the error in the case
where there is no stable match in the simulation We see that error is relatively independent of the
inverse participation ratio, with the main increase happening very near ipr(w) = 1. This shows
that the prediction difficulty increases more sharply when the crystal structure becomes unknown,
but beyond that, the number of compounds in the phase diagram and the distance to the nearest
compound does not have a strong influence.
Ablations of the graph network models Finally we consider a series of ablation of our graph
network models to unveil which features of the graphs are the most useful for the prediction.
4
For instance, if a compound is uniformly spread over n neighbors, w = (1/n, 1/n, . . . , 1/n), then ipr(w) = n.
2
a) Stable match b) No stable match c) Phase diagram weighting
0.14 0.14 0.08 MLP w/o sim.
0.12 0.12 MLP
MAE (eV/atom)
MAE (eV/atom)
MAE (eV/atom)
0.06
MLP+
0.10 0.10 AGN
0.08 0.08 AGN+
0.04
0.06 0.06
0.04 0.04 0.02
0.02 0.02
0.00 0.00 0.00
0 0.25 0.5 1 2 4 inf
alpha
d) e) f)
Energy (predicted, eV/atom)
1 1 1 12
2.5
0 10
MAE (eV/atom)
MAE (eV/atom)
0.1 2.0 0.1
1 8
1.5 6
2 0.01 0.01
1.0 4
3
0.001 0.5 0.001
4 2
Supplementary Figure 3: (a, b): mean absolute error on the test dataset, split by formula, depending
on whether an experimental compound has a (resp. has no) stable match in the simulation dataset.
The diamonds on the central panel are reminder of the values of the leftmost panel, highlighting the
fact that the performance of the AGN approaches degrades when there is no stable match, while
the MLP without simulation and MLP approaches improve. (c): mean absolute error as a function
of the parameter α of Eq. (2). In both panels, each bar shows the average mean absolute error for
10 models trained with 10 different heldout validation split. Error bars are computed over the 10
different validation splits, for the 10 different models that we trained. (d-f): Scatter plot for the
AGN+ model of (d) predicted formation energies against ground truth energies on the 10 validation
splits, (e) mean absolute error against the formation energy, (f) mean absolute error against the
inverse participation ratio (defined in the text). The black lines are binned averaged of the scattered
points, while the grey line show the density of examples for a given value of the x-axis.
Ablating the global composition node from the graph by replacing it with a zero vector of the same
shape (denoted ”-globals” on Supplementary Figure 4) has the least severe effect, as the graph
network can easily extract this information from the other nodes. The converse is not true, however,
and wiping the atomic number from the nodes (”-nodes” on Supplementary Figure 4) degrades the
performance more sizeably. Indeed, the nodes contain information not only about the global formula
but also (when combined with the edges) about the distance patterns between atoms. Ablating
the edges of the graph (”-edges”, which contained distance information) only slightly deteriorates
performance as well – but at this stage it is important to remember that the graph that we feed to the
network already contains some distance information via the thresholding procedure which is used to
connect vertices together or not. Finally, we experimented with feeding vector information rather
than distances on the edges of the graph (”+3d” on Supplementary Figure 4). Because this quantity
is not invariant to a global rotation of the crystal, in this case we perform a data augmentation under
the form of a random 3d rotation of the network’s input. Perhaps surprisingly, we find that this did
3
not improve the results (it makes them slightly worse for the pure AGN model). We hypothesize that
this may be related to subtle balance between performance on the simulation data and generalization
to the experimental data of this machine learning setup.
AGN AGN+
0.08 0.07
0.07 0.06
0.06
MAE (eV/atom)
MAE (eV/atom)
0.05
0.05
0.04
0.04
0.03
0.03
0.02 0.02
0.01 0.01
0.00 0.00
-nodes -edges -globals - +3d -nodes -edges -globals - +3d
Supplementary Figure 4: Performance of the AGN (left) and AGN+ (right) for various ablation
conditions, described in the text.
Bibliography
1. Xie, T. & Grossman, J. C. Crystal graph convolutional neural networks for an accurate and
interpretable prediction of material properties. Physical review letters 120, 145301 (2018).