0% found this document useful (0 votes)
10 views

Bayesian and surroagte

Uploaded by

Asif Aslam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Bayesian and surroagte

Uploaded by

Asif Aslam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

www.nature.

com/npjcompumats

ARTICLE OPEN

Bayesian optimization with adaptive surrogate models for


automated experimental design
5✉
Bowen Lei 1, Tanner Quinn Kirk2, Anirban Bhattacharya1, Debdeep Pati1, Xiaoning Qian 3,4
, Raymundo Arroyave and
Bani K. Mallick1

Bayesian optimization (BO) is an indispensable tool to optimize objective functions that either do not have known functional forms
or are expensive to evaluate. Currently, optimal experimental design is always conducted within the workflow of BO leading to
more efficient exploration of the design space compared to traditional strategies. This can have a significant impact on modern
scientific discovery, in particular autonomous materials discovery, which can be viewed as an optimization problem aimed at
looking for the maximum (or minimum) point for the desired materials properties. The performance of BO-based experimental
design depends not only on the adopted acquisition function but also on the surrogate models that help to approximate
underlying objective functions. In this paper, we propose a fully autonomous experimental design framework that uses more
adaptive and flexible Bayesian surrogate models in a BO procedure, namely Bayesian multivariate adaptive regression splines and
Bayesian additive regression trees. They can overcome the weaknesses of widely used Gaussian process-based methods when
faced with relatively high-dimensional design space or non-smooth patterns of objective functions. Both simulation studies and
1234567890():,;

real-world materials science case studies demonstrate their enhanced search efficiency and robustness.
npj Computational Materials (2021)7:194 ; https://doi.org/10.1038/s41524-021-00662-x

INTRODUCTION configurations and microstructures, as generated by arbitrary


The concept of optimal experimental design, within the overall synthesis/processing methods, to meet the target properties.
framework of Bayesian optimization (BO), has been put forward as Recently, a design paradigm has been proposed—optimal
a design strategy to circumvent the limitations of traditional experimental design—built upon the foundation of BO13–18,
(costly) exploration of (arbitrary) design spaces. BO utilizes a which seeks to circumvent the limits of traditional (costly)
flexible surrogate model to stochastically approximate the exploration of the materials design space. Early examples were
(generally) expensive objective function. This surrogate, in turn, demonstrated by Frazier and Wang19, who took into account both
undergoes Bayesian updates as new information about the design the need to harness the knowledge that exists about the design
space is acquired, according to a predefined acquisition policy. space and the goal of exploring and identifying the best
The use of a Bayesian surrogate model does not impose any a experiment to speed up the iterative design process. The other
priori restrictions (such as concavity or convexity) on the objective important task, other than discovering the target position in the
function. It was mainly introduced by Mockus1 and Kushner2 and space, is the identification of the key factors responsible for most
pioneered by Jones et al.3, who developed a framework that of the variance in the properties of interest during MD20–22. This
balanced the need to exploit available knowledge of the design helps us better understand the underlying physical/chemical
space with the objective to explore it by using a metric or policy mechanisms controlling the properties or phenomena of interest,
that selects the best next experiment to carry out with the end-
which in turn results in better strategies for MD and design17,23.
goal of accelerating the iterative design process. Multiple
There have been several follow-up papers, mainly extending the
extensions have been developed to make the algorithm more
algorithm in different applied directions13–15.
efficient4–7. This popular tool has been successfully used in a wide
The BO algorithm consists of two major components10,12: (i)
range of applications8,9. Extensive surveys of this method and its
modeling a (potentially) high-dimensional black-box function, f, as
applications can also be found10–12.
Materials discovery (MD) can be mapped to an optimization a surrogate of the (expensive-to-query) objective function, and (ii)
problem in which the goal is to maximize or minimize some optimizing the selected criterion considering uncertainty based on
desired properties of a material by varying certain features/ the posterior distribution of f to obtain the design points in the
structural motifs that are ultimately controlled by changing the feature space Ω. In the procedure, we repeat the two steps until
overall chemistry and processing conditions. A typical task in MD we satisfy the stopping criteria or, as it is often the case in
is to predict the material properties based on a collection of experimental settings, we exhaust the resources available. A
features and then use such predictions in an inverse manner to critical aspect of BO is the choice of the probabilistic surrogate
identify the specific set of features leading to a desired, optimal model used to fit f. A Gaussian process (GP) is the typical choice, as
performance. The major goal is then to identify how to search the it is a powerful stochastic interpolation method that is distin-
complex material space spanning the elements in the periodic guished from others by its mathematical explicitness and
table, arranged in a virtually infinite number of possible computational flexibility, and with straightforward uncertainty

1
Department of Statistics, Texas A&M University, College Station, TX, USA. 2Department of Mechanical Engineering, Texas A&M University, College Station, TX, USA. 3Department
of Electrical & Computer Engineering, Texas A&M University, College Station, TX, USA. 4Department of Computer Science & Engineering, Texas A&M University, College Station,
TX, USA. 5Department of Materials Science & Engineering, Texas A&M University, College Station, TX, USA. ✉email: [email protected]

Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences
B. Lei et al.
2

2
Fig. 1 Plots of black-box functions. a The valley of a two-dimensional Rosenbrock function which has the formula y ¼ 100ðx2  x 21 Þ þ ðx1  1Þ2 .
b The P frequent and regularly distributed local minima of a two-dimensional Rastrigin function which has the formula
1234567890():,;

y ¼ 20 þ 2i¼1 ½x2i  10 cosð2πxi Þ.

quantification, which makes it broadly applicable to many still not perform well when the dimension of predictors is
problems12,13,24. relatively high or the choice of the kernel is not suitable for the
Oftentimes, stationary or isotropic GP-based BO may be unknown function10,12. Apart from these solutions, there is a broad
challenging when faced with (even moderately) high- literature on flexible nonstationary covariance kernels28. Deep
dimensional design spaces, particularly when very little initial network kernel is a prominent recent example29 while its strength
information about the design space is available—in some fields of may be limited when faced with sparse datasets.
science and engineering where BO is being used14,25,26, data The focus of this paper is to replace GP-based machine learning
sparsity is, in fact, the norm, rather than the exception. In MD models with other, potentially more adaptive and flexible,
problems, data sparsity is exacerbated by the (apparent) high Bayesian models. More specifically, we explore Bayesian spline-
dimensionality of the design space, as a priori, it is possible that based models and Bayesian ensemble-learning methods as
many controllable features could be responsible for the materials’ surrogate models in a BO setting. Bayesian multivariate adaptive
behavior/property of interest. In practice, however, the potentially regression splines (BMARS)30,31 and Bayesian additive regression
high-dimensional MD space may actually be reduced, as in trees (BART)32 are used in this paper as they can potentially be
materials science it is often the case that a small subset of all superior alternatives to GP-based surrogates, particularly when the
available degrees of freedom is actually controlling the materials’ objective function, f, requires more flexible models. BMARS is a
behavior of interest. Searching over a large dimensional space flexible nonparametric approach based on product spline basis
when only a small subspace is of interest may be highly functions. BART belongs to Bayesian ensemble-learning-based
computationally inefficient. A challenge is then how to discover methods and fits unknown patterns through a sum of small trees.
the dominant degrees of freedom when very little data is available
Both of them are well equipped with automatic feature selection
and no proper feature selection can be carried out at the outset of
techniques.
the discovery process. The problem may become more complex
In this article, we present a fully automated experimental design
due to the existence of interaction effects among the covariates
framework that adopts BART and BMARS as the surrogate models
since such interactions are extremely challenging to discover
when the available data is very sparse. used to predict the outcome(s) of yet-to-be-made observations/
We note that there are some more flexible GP-based models, queries of/to the expensive “black-box” function. The surrogates
like automatic relevance detection (ARD),27 which introduces a are used to evaluate the acquisition policy within the context of
different scale parameter for each input variable inside the BO. Automated algorithm-based experimental design is a growing
covariance function to facilitate removal of unimportant variables technology used in many fields such as materials informatics, and
and may alleviate the problem. Recently, Talapatra et al.14 biosystems design22,33,34. It combines the principles of specific
proposed a robust model for f, based on Gaussian mixtures and domains with the use of machine learning to accelerate scientific
Bayesian model averaging, as a strategy to deal with the data discovery. We compare the performance of this BO approach
dimension and sparsity problem. Their framework was capable of using non-GP surrogate models against other GP-based BO
detecting subspaces most correlated with optimization objectives methods using standard analytic functions, and then present
by evaluating the Bayesian evidence of competing feature results in which the framework has been applied to realistic
subsets. However, their covariance functions, and in general, materials science discovery problems. We then discuss the
most commonly used covariance functions for GP usually induce possible underlying reasons for the remarkable improvements in
smoothness property and assume continuity for f, which may not performance associated with using more flexible surrogate
necessarily be warranted and limit its performance when f is non- models, particularly when the (unknown) objective function is
smooth or has sudden transitions—this may be a common very complex and does not follow the underlying assumptions
occurrence in many MD challenges. Also, GP-based methods may motivating the use of GPs as surrogates.

npj Computational Materials (2021) 194 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences
B. Lei et al.
3
(a)N = 10, Rosenblock (c)N = 10, Rastrigin
model model
80
150 BART BART
Minimum y observed

Minimum y observed
BMA1 60 BMA1

100
BMA2 BMA2
BMARS 40 BMARS
GP (RBK) GP (RBK)
50
GP (RBK ARD) 20 GP (RBK ARD)
GP (Dot) GP (Dot)
0 0
GP (DKNet) GP (DKNet)
20 40 60 80 20 40 60 80
Iteration Iteration

(b)N = 20, Rosenblock (d)N = 20, Rastrigin


model model
BART BART
60
Minimum y observed

Minimum y observed
BMA1 BMA1
100
BMA2 BMA2
40
BMARS BMARS
50 GP (RBK) GP (RBK)
20
GP (RBK ARD) GP (RBK ARD)
GP (Dot) GP (Dot)
0 0
GP (DKNet) GP (DKNet)
20 40 60 80 20 40 60 80
Iteration Iteration

Fig. 2 The average minimum y observed based on each model in each iteration. a Rosenbrock function with the initial set of sample size
N = 10. b Rosenbrock function with the initial set of sample size N = 20. c Rastrigin function with the initial set of sample size N = 10.
d Rastrigin function with the initial set of sample size N = 20.

RESULTS The Rosenbrock function, also called the Valley or Banana


Simulation studies function, is often used as a test case for optimization algo-
In this section, we present two simulation studies where we set rithms36,40. The formula of a d-dimensional Rosenbrock function is
the Rosenbrock35 and the Rastrigin36,37 functions as the black-box as follows:
function(s) to optimize, respectively. Figure 1 shows a two- X
d1
2
dimensional example for each of them. They are two commonly f ðxÞ ¼ ½100ðxiþ1  x2i Þ þ ðxi  1Þ2 : (1)
used test functions in optimization benchmark studies. In both i¼1
optimization tasks, the goal is to find the global minimum point of This function is unimodal, with the global minimum being at x* =
the unknown function. Therefore, we record the minimum value (1, …, 1) with f(x*) = 0, which lays inside a long, narrow, parabolic-
of observed response y in each iteration for each model for shaped flat valley as shown in Fig. 1a. In this function, we have a
comparison. As shown in Fig. 2, a faster decline of the curve continuous search space and for each xi the range is [−2, 2].
indicates a more efficient search for the target point and better During the workflow, locating the valley is trivial. However, further
performance. convergence to the global optimum is difficult, making this a
As for the probabilistic models, we compare our proposal, good test problem.
which uses BART32 and BMARS30, with the popularly used baseline Here, we set d = 4 and simulate the data. Apart from these four
GP regression with Radial-basis function (GP RBF) kernel38. At the important predictors, we also add four uninformative predictors
same time, RBK kernel with ARD (GP RBK ARD)27 is considered, as that follow the standard normal distribution. These four new-
well as nonstationary kernels like the dot-product kernel (GP added features do not affect the response but are designed to
Dot)39 and more flexible deep network kernel (GP DKNet)29. We augment the dimensionality of the problem, potentially obfuscat-
also compare them with the Bayesian model average using GP ing the solution and “frustrating” the optimizers. This enables us to
(BMA1 and BMA2)14, which showed an edge over the benchmark check whether these frameworks can reveal true factors properly
method. BMA1 and BMA2 refer to the use of first- or second-order and lead to an efficient exploration. Otherwise, a lot of
Laplace approximation to calculate the relevant marginal prob- unnecessary searches occurring among the insignificant directions
abilities of the mixture model. We use a constant mean function can slow down the process of locating the optimum point.
for all the GP-based modes above. For the acquisition function, we Moreover, the quality of the predictions may suffer as a result.
As seen in Fig. 2a, b, the solid blue curves for BMARS show the
choose the expected improvement (EI) metric3 for all the models.
sharpest decrease, suggesting that it is the most efficient
To ensure a fair comparison, we also use a random search within
optimizer. BART-based BO (solid red curve) also exhibits compe-
the inner optimization problem of the acquisition function. titive performance, relative to popular GP-based techniques like
In order to have a comprehensive performance evaluation, we GP (RBK ARD) (solid gray curve), BMA1 (solid dark green curve),
begin the optimization of the above models with five different and BMA2 (dotted orange curve). Meanwhile, GP (DKNet) (dotted
sizes of initial datasets (N = 2, 5, 10, 15, 20) that are uniformly purple curve) cannot show its strength and drops slowly as it
sampled from the search space14. As for each N and each requires considerably more data to be trained properly. As seen
algorithm, the results are based on 100 replicates. To reduce the from (1), the Rosenbrock function overall is a polynomial function
number of iterations, we choose two samples each time in the with a good smoothness property, which may explain why GP-
workflow. For the stopping criteria, it is regarded as running out of based surrogates perform competitively despite not being the
the budget which is set as 80 function evaluations. Relevant best. Turning to the final stable stage, it is the blue curve (BMARS)
results for N = 10, 20 are depicted in Fig. 2 and those for other N that firstly shows a flat pattern and is closest to the optimum value
values can be referred to in Supplementary Note 1. f(x*) = 0.

Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2021) 194
B. Lei et al.
4
Table 1. The mean value and interquartile range (IQR) of the number of experiments based on each model to find the maximum bulk modulus K in
MAX phases with the initial set of sample size N ∈ {2, 5, 10, 15, 20}.

Model N=2 N=5 N = 10 N = 15 N = 20 N=2 N=5 N = 10 N = 15 N = 20


(Mean) (Mean) (Mean) (Mean) (Mean) (IQR) (IQR) (IQR) (IQR) (IQR)

BART 36.18 39.33 32.76 38.67 43.42 21 31 22 20.5 18.5


BMA1 55.2 51.66 67.02 59.12 65.78 78 83.5 63 67.5 62
BMA2 57.36 55.48 60.42 66.59 71.7 76.5 77 74 60 52.5
BMARS 36.82 34.94 41.5 48.94 50.9 24 22 26 24 22.5
GP (RBK) 77.46 76.95 76.86 73.04 75.3 49 59 61 69 62
GP (RBK ARD) 73.3 50.2 35.5 40.05 39.8 67.5 31 13.5 8.5 7.5
GP (Dot) 45.5 45.25 63.1 57.2 65.8 29.5 24. 21. 32.5 17
GP (DKNet) 91.6 78.35 62.9 62.55 81.6 15 43.5 69.5 71 50
The bold values represent the top two models in terms of search performance.

Turning to the Rastrigin function36,40, it is a nonconvex function MAX phase composition palette has so far been explored to a
used to measure the performance of optimization workflows. The limited degree, so there is also significant potential to reveal
formula for a d-dimensional Rastrigin function reads as follows: promising chemistries with optimal property sets14,42,43. For these
reasons, we compared different algorithms for searching among
X
d
f ðxÞ ¼ 10d þ ½x2i  10 cosð2πxi Þ: (2) the Mn + 1AXn phases, where M refers to a transition metal, A refers
i¼1
to group IV and VA elements, and X corresponds to carbon or
nitrogen.
It is based on a quadratic function with an addition of cosine Specifically, the materials design space for this work consists of
modulation which brings about frequent and regularly distributed the conventional MAX phases M2AX and M3AX2, where M ∈ {Sc, Ti,
local minima as depicted in Fig. 1b. Similar to Rosenbrock’s case, V, Cr, Zr, Nb, Mo, Hf, Ti}, A ∈ {Al, Si, P, S, Ga, Ge, As, Cd, In, Sn, Tl, Pd},
the search space is continuous and we focus on [−2, 2] for each and X ∈ {C, N}. The space is discrete which includes 403 stable
direction. Thus, the test function is highly multimodal, making it a MAX phases in total, aligned with Talapatra et al.14. More
challenging task where algorithms easily get stuck in local minima. discussion about the discrete space in BO can be found in
The global minimum point is x* = (0, …, 0) and f(x*) = 0. Supplementary Note 4. The goal of the automated algorithm is to
For the simulated data, we set d = 10 and again we add five provide a fast exploration of the material space, namely to find the
uninformative features following a standard normal distribution. most appropriate material design, which is either (i) the maximum
With these five additional variables (or design degrees of bulk modulus K or (ii) the minimum shear modulus G. The results
freedom), we can assess whether these frameworks are capable in the following sections are obtained with the aim (i), while those
of detecting the factors that are truly correlated with the objective for (ii) can be found in Supplementary Note 2. We point out that
function, enabling an efficient exploration of the design space. while the material design space is small, knowledge of the ground
As seen in Fig. 2c, d, the solid blue curves for BMARS again truth can assist significantly in the verification of the solutions
exhibit the fastest decline, indicating the best performance. The arrived at by different optimization algorithms.
BART-based BO (solid red curves) follows and presents similar For the predictors, we follow the setting in Talapatra et al.14 and
decreased speed with most of the GP-based methods. However, consider 13 possible features in the model: empirical constants C,
the dotted brown curve seems to be the slowest, which is for the m, which link the elements of the material to its bulk modulus;
baseline GP (RBK). Considering the convergent stage, the blue valence electron concentration Cv; electron to atom ratio ae; lattice
curve reaches it between 50 and 60 iterations and the minimum parameters a and c; atomic number Z; interatomic distance Idist;
observed y is very close to the global optimum value f(x*) = 0. The the groups corresponding to the periodic table of the M, A, and X
other methods remain in a decreasing pattern with larger values elements ColM, ColA, ColX, respectively; the order O of MAX phase
of the minimum observed y. It is no surprise that GP-based (whether of order 1 according to M2AX or order 2 according to
methods suffer under this scenario, for which Rastrigin function’s M3AX2); and the atomic packing factor (APF). We note that the
quick switch between different local minima may be the reason, features above can potentially be correlated with the intrinsic
especially for GP (RBK). In contrast, with the flexible bases mechanical properties of MAX phases, although a priori we
constructed and multiple tree models, BMARS and BART are able assume that we have no knowledge as to how such features are
to capture this complex trend of f. We note that BART might need correlated. In practice, as was found in ref. 14, only a small subset
a few more training samples to gain more competitive advantages of the feature space is correlated with the target properties. We
over more flexible GPs like BMA1 and BMA2 due to block patterns note that in ref. 14 the motivation for using Bayesian model
of Rastrigin function. averaging was precisely to be able to detect subsets within the
Having established the better overall performance of our larger feature set most effectively correlated with the target
proposed non-GP base functions applied to complex BO properties to optimize.
problems, we will now turn our attention to two materials For the probabilistic model, we align with the simulation study
science-motivated problems. above and compare our suggested framework that uses BART32
and BMARS30 to the widely used baselines, including GP (RBK)38,
MD in the MAX phase space GP (RBK ARD)27, GP (Dot)39, Bayesian model average using GP
MAX phases (ternary layered carbides/nitrides)14,41 create an (BMA1 and BMA2)14, and GP (DKNet)29. For the acquisition
adequate system to investigate the behavior of autonomous function, we choose EI for each of them to ensure a fair
materials design frameworks, as a result of both their chemical comparison. To get a comprehensive picture, we follow the
richness and the wide range of their properties. The pure ternary structure in the previous section (where we studied the

npj Computational Materials (2021) 194 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences
B. Lei et al.
5
In MD problems, beyond the identification of optimal regions in
Table 2. The top 5 important factors selected by BART for the
the materials design space, it is also desirable to understand the
maximum bulk modulus K in MAX phases with the initial set of sample
factors/features most correlated with the properties of interest. By
size N ∈ {2, 5, 10, 15, 20}.
taking these predictor and interaction rankings into account,
Setting Top 1 Top 2 Top 3 Top 4 Top 5 researchers can gain a deeper understanding of the connection
between features and material properties. We present the relevant
N=2 ColA e
a APF ColM c results for the maximum bulk modulus K, and those for the
N=5 ColA e
ColM APF Idist minimum shear modulus G are in Supplementary Note 2. BART
a
N = 10 ColA APF ColM e
Idist and BMARS are endowed with automatic feature selection based
a
on their appearances in the corresponding surrogate models,
N = 15 ColA e
a APF ColM Idist while baselines GP (RBK) and GP (Dot) cannot identify feature
N = 20 ColA APF e
a ColM Idist importance relative to the BO objective. Although BMA1 and
BMA2 can utilize the coefficient of each component to provide
some information about feature importance, they cannot directly
tell the exact order of individual variables and interactions.
Table 3. The top 5 important factors selected by BMARS for the Under five different scenarios N ∈ {2, 5, 10, 15, 20}, Tables 2 and 3
maximum bulk modulus K in MAX phases with the initial set of sample list the top 5 important factors aimed at the maximum value of K
size N ∈ {2, 5, 10, 15, 20}. using BART and BMARS, respectively. The rankings are based on the
median inclusion times of the 100 replicates from the last model
Setting Top 1 Top 2 Top 3 Top 4 Top 5 when the workflow stops. When using BART, ColA, ae, ColM, APF, and
Idist are the most useful. While turning to BMARS, ColA, ae, c, a, APF, and
N=2 e
a ColA APF Idist c Idist always play a key role. We can see a similar pattern for the top-
N=5 ColA Idist APF e
a c ranked features between the two models for different N, although
N = 10 ColA e
a APF Idist a some differences exist in their order. Regarding the interactions
N = 15 ColA e
APF Idist c among features, we measure their importance by counting the
a
N = 20 ColA APF e
Idist a
coexistence of two of them within each basis function. The more
a
frequently they are used in the same basic function, the greater their
influences on material improvement are. The detailed results for the
benchmark Rosenbrock and Rastrigin functions) and start the interaction selection can be referred in Supplementary Note 2.
above models with five different sizes of initial samples (N = 2, 5, During the material development process, we may not know which
10, 15, 20), which are randomly chosen from the design space. For features we should add to the model in advance. In light of this, it is
each N, the results are based on 100 replicates. To avoid an usually the case that one considers all possible features during the
excessive number of iterations, we add two materials at a time in training and optimization to avoid missing important features. This
the platform. For the stopping criteria, it is set as successfully brings an important challenge because it is often not possible to carry
locating the material with ideal properties or running out of the out any sort of feature selection ahead of the experimental campaign.
budget which is set as 80, roughly 20% of the available space. For Moreover, GP-based BO frameworks tend to become less efficient as
these replicates not converging within the budget, we follow the dimension of the design space increases as the required coverage
Talapatra et al.14 and regard their number of calculations as 100 to to ensure adequate learning of the response surface is exponential
avoid an excessive number of evaluations. with the number of features11. Moreover, the sparse nature of the
Due to the high cost per experiment, the framework has better sampling scheme—BO, after all, is used when there are stringent
performance if it needs a fewer number of experiments before resource constraints to query the problem space—makes the
finding the candidate with desired properties. Therefore, we use it (learned) response surface very flat over wide regions of the design
as a vital criterion for evaluating model capabilities. Table 1 shows space, with some interspersed, local highly nonconvex landscapes44.
the mean value and interquartile range (IQR) of the total number These issues make high-dimensional BO very hard. In materials
of evaluations searching for the maximum bulk modulus K within science problems, a key challenge is that many of the potential
dimensions of the problem space are uninformative, i.e., they are not
the MAX phase design space. The smaller values of the mean and
correlated with the objective of interest.
IQR indicate a more efficient and stable platform.
It is thus desirable to develop frameworks that are robust
As depicted in Table 1, while GP (Dot), BMA1, and BMA2 are
against the existence of possibly many uninformative or
more efficient than GP (RBK) and GP (DKNet) when looking for the
redundant features. To further check the platform’s utility to distill
maximum bulk modulus K, BART and BMARS can further greatly
useful information and maintain the speed, we simulate 16
reduce the number of experiments and maintain a more stable random predictors following the standard normal distribution and
performance compared to GP-based models. For GP (RBK ARD), it mix them with the 13 predictors described above. With these new
achieves good speed when N is larger than 10, but shows poor non-informative features, we use the same automated framework
and unstable performance under small N. Also, considering the and explore the space for the materials with ideal properties.
interquartile range of each model, BART and BMARS tend to As shown from Table 4, BART’s performance is not degraded by
be more robust under each setting and can achieve the goal the newly added unhelpful information and is still the most
before 80 iterations, while the other five are more likely to run out efficient choice, indicating its robust property. At the same time,
of the budget without achieving the objective. although BMARS is slower than the best, it is still competitive
Two possible reasons could explain why BMARS and BART can compared to other GP-based approaches like BMA1 and BMA2.
improve the searching speed much more efficiently than BART-based BO is clearly capable of detecting non-informative
competing strategies. On the one hand, BMARS and BART are features in a very effective manner.
known to be more flexible surrogates compared to GP-based We also find the top 5 features as well as interaction effects for
methods and are more powerful when faced with unknown and both BART and BMARS. For the 16 newly added unimportant
complex mechanisms in real-world data. On the other hand, features, we denote them by n1, …, n16. Tables 5 and 6 summarize
BMARS and BART usually scale better with the dimension and can the most significant features. We can see that the results do not
be more robust when handling high or even moderately include n1, …, n16, indicating a good ability to filter out useless
dimensional design spaces. information. Compared with Table 2, we can also notice that the

Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2021) 194
B. Lei et al.
6
Table 4. The mean value and interquartile range of the number of experiments based on each model to find the maximum bulk modulus K in MAX
phases with additional non-informative features with the initial set of sample size N ∈ {2, 5, 10, 15, 20}.

Model N=2 N=5 N = 10 N = 15 N = 20 N=2 N=5 N = 10 N = 15 N = 20


(Mean) (Mean) (Mean) (Mean) (Mean) (IQR) (IQR) (IQR) (IQR) (IQR)

BART 32.04 28.96 34.22 34.7 39.24 24.5 22 24 26 22.5


BMA1 61.48 58.48 69.62 70.6 75.06 64 72 55 53.5 46
BMA2 63.02 62.94 68.9 67 74.94 80 75 64 63.5 56
BMARS 63.8 62.35 67.52 66.77 70.6 58.5 34.5 56 59 54
GP (RBK) 63.7 67.29 66.7 70.36 72 70.5 65.5 64.5 55.5 48.5
GP (RBK ARD) 73.7 81.25 82.8 74.5 78.3 46. 26.5 41.5 45.5 32
GP (Dot) 58.1 48.45 57.85 58.75 72.7 13. 28.5 12.5 14.5 36.5
GP (DKNet) 81.1 81.2 95. 94.65 92.5 36.5 47.5 25.5 18 20.5
The bold values correspond to the top two models in terms of search performance.

instead of 403 in the previous section. This dataset represents


Table 5. The top 5 important factors selected by BART for the face-centred cubic (FCC) compositions in the 7-element CoCr-
maximum bulk modulus K in MAX phases with additional non- FeMnNiV-Al high entropy alloy (HEA) space. Specifically, we focus
informative features with the initial set of sample size N ∈ {2, 5, 10, 15,
on the task of exploring the stacking fault energy (SFE) chemical
20}.
landscape in this system. SFE is an intrinsic property of crystals
Setting Top 1 Top 2 Top 3 Top 4 Top 5 that measures their inherent resistance for adjacent crystal plans
to shear against each other. Its value can be a good indicator of
N=2 APF ColA ColM Idist e
a
the (dominant) plastic deformation mechanism of the alloy and is
N=5 APF ColA Idist ColM e thus a valuable alloy design parameter45,46. The SFE in this alloy
a
N = 10 APF ColM ColA Idist e system has been predicted for each composition using a support
a
vector regressor trained on 498 high-fidelity SFE calculations from
N = 15 APF ColA Idist ColM e
a density functional theory (DFT) using the axial-next-nearest-
N = 20 APF ColA ColM Idist e
a neighbor-Ising model47,48 relating SFE to the lattice energies of
(disordered) FCC, hexagonal close-packed (HCP) and double HCP
(DHCP) crystals of the same chemical composition. In addition to
Table 6. The top 5 important factors selected by BART for the SFE, stoichiometrically weighted averages and variances were
maximum bulk modulus K in MAX phases with additional non- calculated for each composition for 17 pure element properties to
informative features with the initial set of sample size N ∈ {2, 5, 10, 15, generate a total of 34 property-based features.
20} using BMARS. In this new analysis, we have two goals: namely, to find the
global minimum and global maximum in the SFE landscape. Thus,
Setting Top 1 Top 2 Top 3 Top 4 Top 5
we record the minimum value and the maximum value of the
N=2 ColA e
APF ColM Idist observed response in each iteration for each model for
a
comparison. As we can see, a faster curve increase in Fig. 3a, b
N=5 ColA APF ColM Idist e
a and a sharper curve decline in Fig. 3c, d indicate a more efficient
N = 10 ColA e
a ColM AP Idist search for the target point and better performance.
N = 15 ColA e
a APF Idist ColM For the predictors, we choose 41 potential predictors (34
N = 20 ColA e
a APF ColM Idist property-based features in addition to the compositions of the
seven constituent elements), which provides a larger set of
candidate features. This set is a mixture of informative and
outputs of BART are very similar to those without additional non- (potentially) uninformative features and some of the informative
informative data and ColA, ae, ColM, APF, and Idist are again features are correlated to each other, which may bring about a
frequently chosen in different N showing a robust performance. more challenging feature selection task. We follow the analysis
While compared with Table 3, the selections from BMARS above and compare our suggested framework that uses BART32
experience more changes and are more influenced by this and BMARS30 to the GP regression (RBK, RBK ARD, Dot, and
uncorrelated knowledge. DKNet)27,29,38,39 and Bayesian model average using GP (BMA1 and
Moving to the interaction effects, BART successfully neglects BMA2)14. For the acquisition function, we continue using EI for
unimportant features and maintains its performance. At the same each of them to maintain a fair comparison. Also, we start the
time, BMARS is capable of (almost) filtering out all non-informative above models with five different sizes of initial samples (N = 2, 5,
features and only leaves a small portion of the interactions 10, 15, 20), which are randomly chosen from the design space. For
between new predictors and the original data. Exact selection each N, the results are based on 100 replicates. Curves for N = 10
results can be found in Supplementary Note 2. and 20 are presented here and outputs under other initial sample
sizes are summarized in Supplementary Note 3.
As seen in Fig. 3a, b, when looking for the maximum SFE, the
Optimal design for stacking fault energy in high entropy alloy solid blue curves for BMARS, solid red curves for BART, and dotted
spaces light blue curves for GP (Dot) have the sharpest increase,
To further demonstrate our model’s advantage, we search among indicating the best performance. While the other curves
a much larger discrete material design space whose size is 36,273 representing other GP-based surrogates tend to move slowly.

npj Computational Materials (2021) 194 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences
B. Lei et al.
7

Fig. 3 The average maximum or minimum stacking fault energy (SFE) [mJ⋅m−2] observed on each model in each iteration. a The
maximum SFE with the initial set of sample size N = 10. b The maximum SFE with the initial set of sample size N = 20. c The minimum SFE with
the initial set of sample size N = 10. d The minimum SFE with the initial set of sample size N = 20.

Table 7. The top five important factors selected by BART for the maximum stacking fault energy (SFE) with the initial set of sample size N ∈ {2, 5, 10,
15, 20}.

Setting Top 1 Top 2 Top 3 Top 4 Top 5

N=2 Specific.Heat_Avg Pauling.EN_Var Mn Ni C_11_Avg


N=5 Mn Pauling.EN_Var C._Avg C_11_Avg Specific.Heat_Avg
N = 10 Specific.Heat_Avg Pauling.EN_Var Mn Ni Co
N = 15 Mn Specific.Heat_Avg Co Pauling.EN_Var SGTE.LSE_Avg
N = 20 Mn Specific.Heat_Avg Pauling.EN_Var Ni SGTE.LSE_Avg

The figure shows that BART- and BMARS-based BO is capable of SFE using BART and BMARS, respectively. The rankings are based
finding the materials with SFE values close to the ground-truth on the median inclusion times of the 100 replicates from the last
maximum in the dataset in ~80 iterations, corresponding to just model when the workflow stops. For BART, Specific.Heat_Avg,
0.25% of the total materials design space that could be explored. Pauling.EN_Var, Mn, and Ni are the most important features.
This is an impressive performance that is eclipsed when Meanwhile, turning to BMARS, Specific.Heat_Avg, Pauling.EN_Va,
considering the performance of BART/BMARS-BO in the minimiza- C_11_Avg, and Mn always play vital roles. Comparing top-ranked
tion problem, as shown in Fig. 3c, d. In Fig. 3c, d, the blue curves features for sets of different N, we observe similar patterns, but
for BMARS and red curves for BART drop much faster than other with a few differences in order. Immediately, one can see that only
curves, which confirms a more efficient search ability of our a few chemical elements are detected to be strongly correlated to
methods. In this case, by about ~40 iterations, the optimizer has the SFE in this HEA system and that, instead, other (atomically
converged to the points extremely close to the ground-truth averaged) intrinsic properties may be more informative when
minimum in the dataset. This corresponds to about 0.125% of the attempting to predict this important quantity. This implies that
total materials design space. In this case, the performance of the focusing exclusively on chemistry as opposed to derived features
proposed frameworks is much better than most of the alter- may not have been an optimal strategy towards BO-based
natives. Here we note that, although GP (Dot) performs better exploration of this space. Notably, Ni figures as the feature highly
than BMARS or BART in a few settings, an additional advantage of correlated to SFE in almost all scenarios considered. This is not
the latter methods is the automatic detection of important surprising as Ni is also highly correlated with the stability of FCC
features detailed below. over competing phases (such as HCP), and thus, higher Ni content
In this case study, not only the design space has become much in an alloy should be correlated to higher stability of FCC and
larger but also the number of candidate design features has higher SFE49. Co and Mn also appear as important covariates. In
increased. Using other approaches, it would be more difficult to the case of Co, limited experimental studies have shown that
evaluate the significance of the different features (or degrees of increased Co tends to result in lower SFEs in FCC-based HEAs50.
freedom) as well as their interactions. Here, we present the While trying to understand the underlying reasons for why other
corresponding results for finding the maximum SFE, and those for covariates (Specific hear, Pauling Electronegativity, etc.) seem to
the minimum SFE are in Supplementary Note 3. be highly correlated to SFE is beyond the scope of this work,
Under five different scenarios N ∈ {2, 5, 10, 15, 20}, Tables 7 and what is notable is that in this framework, such insights can be
8 list the top five factors most correlated with the maximum in the gleaned at the same time that the materials problem space is

Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2021) 194
B. Lei et al.
8
Table 8. The top five important factors selected by BMARS for the maximum stacking fault energy (SFE) with the initial set of sample size N ∈ {2, 5,
10, 15, 20}.

Setting Top 1 Top 2 Top 3 Top 4 Top 5

N=2 SGTE.LSE_Avg Pauling.EN_Var C_11_Avg Ni C._Avg


N=5 Pauling.EN_Var Specific.Heat_Avg C_12_Avg C._Avg C_11_Avg
N = 10 C_11_Avg Specific.Heat_Avg Mn Pauling.EN_Var C._Var
N = 15 Specific.Heat_Avg Mn Pauling.EN_Var Ni C_11_Avg
N = 20 Pauling.EN_Var Specific.Heat_Avg SGTE.LSE_Avg Fe Mn

being explored. Thus, in an admittedly limited manner, the BART/ spline basis, BMARS is able to catch challenging patterns like
BMARS-BO framework not only assists in the (very) efficient sudden transitions in the response surface. At the same time, BART
exploration of materials design spaces but also enhances our also ensembles multiple individual trees and leads to a strong
understanding of the underpinnings of material behavior. More regression algorithm. Resulting from the recursive partitioning
results about the interactions among features are presented in structures, they are equipped with a model-free variable selection
Supplementary Note 3. that is based on feature inclusion frequencies in their basic
functions and trees. This enables them to more accurately
recognize the trends and correctly reveal the true factors.
DISCUSSION We would like to close by briefly discussing potential
In general, there are two major categories of BO: (i) acquisition- applications of the framework in the context of autonomous
based BO (ABO), and (ii) partitioning-based BO (PBO). ABO10,12,36 is materials research (AMR). Recently, the concept of autonomous
the most traditional and broadly used BO. The key idea is to pick experimentation for MD57 has quickly emerged as an active area
an acquisition function, which is derived from the posterior and of research58–60. Going beyond traditional high-throughput
then optimized at each iteration to specify the next experiment. approaches to MD61–63, AMR aims to deploy robotic-assisted
On the other hand, PBO36,51 successfully avoids the optimization platforms capable of the automated exploration of complex
of acquisition functions by intelligently partitioning the space materials spaces. Autonomy, in the context of AMR, can be
based on observed experiments and exploring promising areas, achieved by developing systems capable of automatically select-
greatly reducing computations. Compared to PBO, ABO usually ing the experimental points to explore in a principled manner,
makes better use of the available knowledge and makes higher with as little human intervention as possible. Our proposed non-
quality decisions, leading to a fewer number of needed GP BO methods seem to have robust performance against a wide
experiments. In this study, we focused on ABO to construct the range of problems. It is thus conceivable that the experimental
autonomous workflow for material discovery. design engines of AMR platforms could benefit from algorithms
GP-based BO has been widely used in a number of areas and such as those proposed here.
gradually become a benchmark method12,13,24 for optimization of
expensive “black-box” functions. However, its power can be
limited by the intrinsic weaknesses of GP10,12. Isotropic covariance METHODS
functions such as the Matérn and Gaussian kernels commonly Bayesian optimization
employed in the literature have continuous sample paths, which is BO10 is a procedure intended to determine the global minimum (or
undesirable in many problems including material discovery as it is maximum, with the similar procedure) x* of an unknown objective
well known that the behavior of materials often changes abruptly function f sequentially and optimally, where X denotes the search space:
with minute changes in chemical make-up or (multiscale) x ¼ argmin f ðxÞ: (3)
microstructural arrangements. Moreover, such isotropic kernels fx2Xg
are provably suboptimal52 in function estimation when there are In the common setting of BO, the target function f can be either “black
spurious covariates or anisotropic smoothness. While remedies box” or expensive to evaluate, as such a function may represent a
have been proposed in the literature involving more flexible resource-intensive experiment or a very complex set of numerical
kernel functions with additional hyperparameters53 and sparse simulations. Thus, we would like to reduce the number of function
additive GPs54,55, tuning and computation of such models can be evaluations as we explore the design space and search for the optimal
significantly challenging, especially given a modest amount of point. It mainly includes two steps: (i) fitting the hidden pattern of the
data. Thus, in complex material science problems such as ours, target function, f, given observed data D so far based on some surrogate
models, and (ii) optimizing selected utility or acquisition functions u(x∣D)
Bayesian approaches based on additive regression trees or
based on the posterior distribution of the surrogate estimates of f in order
multivariate splines constitute an attractive alternative to GPs. to decide the next sample point to evaluate in the design space X . To be
Attractive theoretical properties of BART, including adaptivity to more specific, it generally follows Algorithm 1:
the underlying anisotropy and roughness, have recently
appeared56. Algorithm 1. Bayesian optimization (BO). Input: initial observed dataset
In this paper, we proposed a fully automated experimental D = {(yi, xi), i = 1, …, N}. Output: candidate with desired properties. 1: Begin
design pipeline where we took advantage of more adaptive and with s = 1. 2: while stopping criteria are not satisfied do (3 to 7). 3: Train
flexible Bayesian models including BMARS30,31 and BART32 within the chosen probabilistic model based on data D. 4: Calculate the selected
an otherwise conventional BO procedure. A wide range of acquisition function u(x∣D). 5: Choose the next experiment point by
problems in scientific studies can be handled with this xsþ1 ¼ argmax fxsþ1 2Xg uðxjDÞ. 6: Get the new point (ys + 1, xs + 1) and add it
into the observed dataset D. 7: s = s + 1. 8: return the candidate with
algorithm-based workflow, including MD. Both the simulation desired properties
studies and real data analysis applied to scientifically relevant
materials problems demonstrate that using BO with BMARS and A schematic illustration of BO is shown in Fig. 4—we note that such an
BART outperforms GP-based methods in terms of searching speed algorithm can be implemented in autonomous experimental design
for the optimal design and automatic feature importance platforms. Each of the subplots presents the state after one BO iteration,
determination. To be more specific, due to its well-designed where they include the true unknown function (blue curve), utility

npj Computational Materials (2021) 194 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences
B. Lei et al.
9

Fig. 4 Schematic illustration of Bayesian optimization (BO). Four subplots (a–d) give an example of the sequential automated experimental
design using BO. They describe the true unknown function (blue curve), expected improvement (red curve), fitted values using GP (orange
curve), 95% confidence interval (orange area), observed samples (black points), and the next experiment design (gray triangle) in each
iteration.

function—in this case EI—(red curve), fitted values using GP (orange a joint Gaussian distribution:
curve), 95% confidence interval (orange shaded area), observed samples
pðfÞ  N ðfjmðxÞ; KÞ; ½K ij  ¼ kðxi ; xj Þ; (5)
(black points), and the next experiment recommended by the utility/
acquisition function (gray triangle). where m(⋅) is the mean function and k(⋅, ⋅) is the kernel function.
In this sequential optimization strategy, one of the key components is A common choice for m(⋅) is a constant mean function. For k(⋅, ⋅), there
the Bayesian surrogate model for f, which is used to fit the available data34 are various candidates and we can decide it based on the corresponding
and to predict the outcome—with a measure of uncertainty—of task. The radial-basis function (RBF) kernel is popular to capture stationary
experiments yet to be carried out. Another important determinant of BO and isotropic patterns. RBF kernels with ARD27 assign different scale
efficiency is the choice of the acquisition function34. It can assist in setting parameters for each feature instead of using a common value27, which can
our expectations regarding how much we can learn and gain from a new help to identify key covariates determining f. There are also nonstationary
candidate design. The next design structure to be tested is usually the one kernels, such as dot-product kernels39 and more flexible deep network
that maximizes the acquisition function, balancing the trade-off between kernels29. For simplicity, we use D = {x1:n, y1:n} to denote the data we have
exploration and exploitation of the design space. There are many collected. For a new input x*, the predictive distribution of response y* is:
commonly used acquisition functions, such as EI, probability of improve-
ment, upper confidence bound, and Thompson sampling10,11. Here, we pðy  jx ; DÞ ¼ N ðμ ; σ 2 Þ; (6)
choose to use EI as the acquisition function, which can find the point that, 1
in expectation, improves on f n the most: μ ¼ mðx Þ þ kðx ; x1:n ÞðK þ σ 2 IÞ ðy1:n  mðx1:n ÞÞ; (7)

uðxÞ ¼ EIn ðxÞ :¼ En ½ðf ðxÞ  f n Þþ ; (4) 1


σ 2 ¼ kðx ; x Þ þ σ 2  kðx ; x1:n ÞðK þ σ 2 IÞ kðx1:n ; x Þ: (8)
f n
where is the maximum value observed so far, En ½ ¼ E½jx1:n ; y1:n  is the GP-based nonparametric regression approaches have gained a lot of
expectation taken under the posterior distribution given the observed popularity and have been widely used in various applications12,13,24.
data, and bþ ¼ maxðb; 0Þ. We note that we have explored other However, when turning to the sequential experiments in MD, the model
acquisition functions and the relative performance of the corresponding may be imprecise and the search may be inefficient if we do not have
methods with the same surrogates were not significantly different. enough information about the predictive performance of each experi-
The choice of the surrogate model in BO will have a considerable impact mental degree of freedom. Talapatra et al.14 address this issue by using
on its performance, including the cost and time involved. As mentioned model mixing to develop multiple GP regression models based on
above, GPs64 have been widely applied in BO in many applications, different combinations of the covariates and weigh all the potential
including MD19. In this work, we utilize BMARS and Bayesian ensemble- models according to their likelihood of being the true model. In this way,
learning models, in particular, BART, to help guide the search through the they incorporated model uncertainty, leading to a more robust framework
design space more efficiently. We will briefly introduce the potential capable of adaptively discovering the subset of covariates most predictive
surrogate models in BO. More detailed technical descriptions of them are of the objective function to optimize.
included in Supplementary Methods.

Bayesian multivariate adaptive regression splines


GP and model mixing BMARS30,65 is a Bayesian version of the classical MARS model31, which is a
One of the popular ways in BO is using GP regression as the surrogate flexible nonparametric regression technique. It uses product spline basis
model to approximate the unknown f. Given xi 2 Rp (design feature functions to model f and it automatically identifies the nonlinear
vectors) and yi(i = 1, …, n) (evaluated f values at the corresponding xi’s, interactions among covariates. The regression develops a relationship
which can be noisy), we aim to fit the pattern of f and predict a new y* between the covariates xi 2 Rp and the response yi(i = 1, …, n) as
associated with x*. Usually, we assume that yi is a function of xi with
i.i.d. X
l
i.i.d.
additional noise: y i ¼ f ðxi Þ þ ϵi ; ϵi  N ð0; σ 2 Þ. In GP regression, a GP y i ¼ f ðxi Þ þ ϵi ; ^f ðxi Þ ¼ αj Bj ðxi Þ; ϵi  N ð0; σ 2 Þ; (9)
prior is put on the unknown function f and f ¼ ðf ðx1 Þ; ¼ ; f ðxn ÞÞ> follows j¼1

Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2021) 194
B. Lei et al.
10

Fig. 5 Workflow of the automated experimental design framework. The overall workflow of the automated experimental design framework
is based on Bayesian optimization with adaptive Bayesian learning.

where αj denotes the relevant coefficient for the basic function Bj taking At the same time, one can use this regression model for automatic
the form of variable selection, which greatly expands its scope of use. The importance
8 of each predictor is based on the average variable inclusion frequency in
>
< 1; j ¼ 1;
all splitting rules32. Bleich et al.76 further put forward a permutation-based
Bj ðxi Þ ¼ QQj
(10) inferential approach, which is a good alternative for the factor significance
>
: ½sqj  ðxi;vðq;jÞ  tqj Þþ ; j 2 f2; ¼ ; lg;
q¼1 determination.

with sqj ∈ {−1, 1}, v(q, j) denoting the index of the variables, and the set {v
(q, j); q = 1, …, Qj} not repeated. Here tqj tells the partition location, Automated experimental design framework
ðÞþ ¼ maxð0; Þ, and Qj is the polynomial degree of the basis Bj and also With BO using BMARS or BART, we propose an autonomous platform for
corresponds to the number of predictors involved in Bj. The number of efficient experimental design, aiming at significantly reducing the number
parameters is O(l) and we set the maximum value of l as 500. of required trials and the total expense to find the best candidate in MD.
To obtain samples from the joint posterior distribution, the computation The framework is depicted in Fig. 5 and the detailed description is as
is mainly based on the reversible jump Metropolis–Hastings algorithms66. follows.
The sampling scheme only draws the important covariates, hence In this workflow, we begin with an initially observed dataset and the sample
automatic feature selection is naturally done in this procedure. size can be as small as two. Then, we train our surrogate Bayesian learning
model on the observed dataset and collect the relevant posterior samples.
Using these samples, the acquisition function for each potential experiment to
Ensemble learning and BART perform is calculated. After obtaining the values of the acquisition function, we
Apart from model mixing, ensemble learning67 provides an alternative way select the candidates with top scores and do experiments at these points. With
of combining models, which is a popular procedure that constructs the new outcomes, the observed dataset is augmented and the stopping
multiple weak learners and aggregates them into a stronger learner68–70. In criteria are checked. If the criteria are fulfilled, we stop the workflow and return
several circumstances, it is challenging for an individual model to capture the candidate with the desired properties. Otherwise, we update the surrogate
the unknown complex mechanism connecting inputs to the output(s) by model by making use of the augmented dataset and use the updated belief to
itself. Therefore, it is a better strategy to use a divide-and-conquer method guide the next round of experiments.
in the ensemble-learning framework, which allows each of the models to Within this fully automated framework, what we need to provide is the
fit a small part of the function. This is the key difference of our adopted initial sample and the stopping criteria. The beginning dataset can be
Bayesian ensemble learning from the GP-based model mixing strategy in some available data before this project. If we do not have this kind of
Talapatra et al.14. Ensemble learning’s robust performance to handle information, we can randomly conduct a small number of experiments to
complex data makes it a great candidate for BO71. However, it has not populate the database and initialize the surrogate models used in the
been explored to its full potential in the context of optimal experimental sequential experimental protocol. For the stopping criteria, it can be
design yet. Hence, we choose to combine BO with the Bayesian ensemble arriving at the desired properties or running out of the experimental
learning72, in particular, BART32. As BART is a tree-based model without budget14.
inherent smoothness assumptions, it is also a more flexible surrogate
model when modeling objective functions that are non-smooth, often
encountered in MD. This strategy is effective and efficient due to its ability DATA AVAILABILITY
to take advantage of both the ensemble-learning procedure and the The data files for materials discovery in the MAX phase space and optimal design for
Bayesian paradigm. stacking fault energy in high entropy alloy space are available upon reasonable
BART32 is a nonparametric regression method utilizing the Bayesian request.
ensemble-learning technique. Many simulations and real-world applica-
tions confirmed its flexible fitting capabilities73–75. Given xi 2 Rp and yi(i =
1, …, n), where it approximates the target function f by aggregating a set Received: 2 July 2021; Accepted: 3 November 2021;
of regression trees:
X
l
i.i.d.
y i ¼ f ðxi Þ þ ϵi ; ^f ðxi Þ ¼ gj ðxi ; T j ; Mj Þ; ϵi  N ð0; σ 2 Þ; (11)
j¼1

where Tj denotes a binary regression tree, Mj ¼ ðμj1 ; ¼ ; μjbj Þ> denotes a REFERENCES
vector of means corresponding to the bj leaf nodes of Tj, and gj(xi; Tj, Mj) is 1. Mockus, J. In Bayesian Approach to Global Optimization, 125–156 (Springer,
the function that assigns μjt ∈ Mj to xi. Dordrecht, 1989).
Using regularization priors on those trees is critical for the superior 2. Kushner, H. J. A new method of locating the maximum point of an arbitrary
performance of this ensemble regression model. That way, each tree will multipeak curve in the presence of noise. J. Basic Eng. 86, 97–106 (1964).
be regularized to explain a small and distinct part of f. This aligns with the 3. Jones, D. R., Schonlau, M. & Welch, W. J. Efficient global optimization of expensive
essence of ensemble learning, which is about combining weak learners black-box functions. J. Glob. Optim. 13, 455–492 (1998).
into a stronger model. The number of parameters is correlated with the 4. Kaufmann, E., Cappé, O. & Garivier, A. On Bayesian upper confidence bounds for
number of trees l as well as the tree depth dj and is O(l ⋅ 2d). In our analysis, bandit problems. In Proc. 15th International Conference on Artificial Intelligence
l is set as 50 and dj is usually smaller than 6. and Statistics (AISTAT), 592–600 (JMLR, 2012).

npj Computational Materials (2021) 194 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences
B. Lei et al.
11
5. Garivier, A. & Cappé, O. The kl-ucb algorithm for bounded stochastic bandits and 34. Mateos, C., Nieves-Remacha, M. J. & Rincón, J. A. Automated platforms for reac-
beyond. In Proc. 24th Annual Conference on Learning Theory, 359–376 (JMLR tion self-optimization in flow. React. Chem. Eng. 4, 1536–1544 (2019).
Workshop and Conference Proceedings, 2011). 35. Bashir, L. Z. & Hasan, R. S. M. Solving banana (rosenbrock) function based on
6. Maillard, O.-A., Munos, R. & Stoltz, G. A finite-time analysis of multi-armed fitness function. World Sci. News 12, 41–56 (2015).
bandits problems with kullback-leibler divergences. In Proc. 24th annual Con- 36. Merrill, E., Fern, A., Fern, X. & Dolatnia, N. An empirical study of Bayesian opti-
ference On Learning Theory, 497–514 (JMLR Workshop and Conference Pro- mization: acquisition versus partition. J. Mach. Learn. Res. 22, 1–25 (2021).
ceedings, 2011). 37. Pohlheim, H. GEATbx: Genetic and Evolutionary Algorithm Toolbox for use with
7. Auer, P., Cesa-Bianchi, N. & Fischer, P. Finite-time analysis of the multiarmed MATLAB Documentation. http://www.geatbx.com/docu/algindex-03.html (2008).
bandit problem. Mach. Learn. 47, 235–256 (2002). 38. Vert, J.-P., Tsuda, K. & Schölkopf, B. A primer on kernel methods. Kernel Methods
8. Negoescu, D. M., Frazier, P. I. & Powell, W. B. The knowledge-gradient algorithm Comput. Biol. 47, 35–70 (2004).
for sequencing experiments in drug discovery. INFORMS J. Comput. 23, 346–363 39. Williams, C. K. & Rasmussen, C. E. Gaussian Processes for Machine Learning, Vol. 2
(2011). (MIT Press, 2006).
9. Lizotte, D. J., Wang, T., Bowling, M. H. & Schuurmans, D. Automatic gait optimi- 40. Molga, M. & Smutnicki, C. Test functions for optimization needs. Test. Funct.
zation with Gaussian process regression. In Proc. Int. Joint Conf. on Artificial Optim. Needs 101, 48 (2005).
Intelligence, 7, 944–949 (2007). 41. Barsoum, M. W. MAX Phases: Properties of Machinable Ternary Carbides and Nitrides
10. Frazier, P. I. Bayesian optimization. In Recent Advances in Optimization and (Wiley, 2013).
Modeling of Contemporary Problems, 255–278 (INFORMS, 2018). 42. Aryal, S., Sakidja, R., Barsoum, M. W. & Ching, W.-Y. A genomic approach to the
11. Shahriari, B., Swersky, K., Wang, Z., Adams, R. P. & De Freitas, N. Taking the human stability, elastic, and electronic properties of the max phases. Phys. Stat. Sol. 251,
out of the loop: a review of Bayesian optimization. Proc. IEEE 104, 148–175 (2015). 1480–1497 (2014).
12. Snoek, J., Larochelle, H. & Adams, R. P. Practical Bayesian optimization of machine 43. Barsoum, M. W. & Radovic, M. Elastic and mechanical properties of the max
learning algorithms. Adv. Neural Inform. Process. Syst. 25, 2960–2968 (2012). phases. Annu. Rev. Mater. Res. 41, 195–227 (2011).
13. Iyer, A. et al. Data-centric mixed-variable Bayesian optimization for materials 44. Rana, S., Li, C., Gupta, S., Nguyen, V. & Venkatesh, S. High dimensional Bayesian
design. In International Design Engineering Technical Conferences and Computers optimization with elastic Gaussian process. In International Conference on
and Information in Engineering Conference, Vol. 59186, V02AT03A066 (American Machine Learning, 2883–2891 (PMLR, 2017).
Society of Mechanical Engineers, 2019). 45. Chaudhary, N., Abu-Odeh, A., Karaman, I. & Arróyave, R. A data-driven machine
14. Talapatra, A. et al. Autonomous efficient experiment design for materials dis- learning approach to predicting stacking faulting energy in austenitic steels. J.
covery with Bayesian model averaging. Phys. Rev. Mater. 2, 113803 (2018). Mater. Sci. 52, 11048–11076 (2017).
15. Ju, S. et al. Designing nanostructures for phonon transport via Bayesian optimi- 46. Hu, Y.-J., Sundar, A., Ogata, S. & Qi, L. Screening of generalized stacking fault
zation. Phys. Rev. X 7, 021024 (2017). energies, surface energies and intrinsic ductile potency of refractory multi-
16. Ghoreishi, S. F., Molkeri, A., Srivastava, A., Arroyave, R. & Allaire, D. Multi- component alloys. Acta Mater. 210, 116800 (2021).
information source fusion and optimization to realize icme: application to dual- 47. Denteneer, P. & Soler, J. Energetics of point and planar defects in aluminium from
phase materials. J. Mech. Des. 140, 111409 (2018). first-principles calculations. Solid State Commun. 78, 857–861 (1991).
17. Khatamsaz, D. et al. Efficiently exploiting process-structure-property relationships 48. Denteneer, P. & Van Haeringen, W. Stacking-fault energies in semiconductors
in material design by multi-information source fusion. Acta Mater. 206, 116619 from first-principles calculations. J. Phys. C 20, L883 (1987).
(2021). 49. Cockayne, D., Jenkins, M. & Ray, I. The measurement of stacking-fault energies of
18. Ghoreishi, S. F., Molkeri, A., Arróyave, R., Allaire, D. & Srivastava, A. Efficient use of pure face-centred cubic metals. Philos. Mag. 24, 1383–1392 (1971).
multiple information sources in material design. Acta Mater. 180, 260–271 (2019). 50. Liu, S. et al. Transformation-reinforced high-entropy alloys with superior
19. Frazier, P. I. & Wang, J. Bayesian optimization for materials design. In Information mechanical properties via tailoring stacking fault energy. J. Alloys Compd. 792,
Science for Materials Discovery and Design, 45–75 (Springer, 2016). 444–455 (2019).
20. Liu, Y., Wu, J.-M., Avdeev, M. & Shi, S.-Q. Multi-layer feature selection incorpor- 51. Wang, S. & Ng, S. H. Partition-based Bayesian optimization for stochastic simu-
ating weighted score-based expert knowledge toward modeling materials with lations. In 2020 Winter Simulation Conference (WSC), 2832–2843 (IEEE, 2020).
targeted properties. Adv. Theory Simul. 3, 1900215 (2020). 52. Bhattacharya, A., Pati, D. & Dunson, D. Anisotropic function estimation using
21. Janet, J. P. & Kulik, H. J. Resolving transition metal chemical space: feature multi-bandwidth Gaussian processes. Ann. Stat. 42, 352 (2014).
selection for machine learning and structure–property relationships. J. Phys. 53. Cheng, L. et al. An additive Gaussian process regression model for interpretable
Chem. A 121, 8939–8954 (2017). non-parametric analysis of longitudinal data. Nat. Commun. 10, 1–11 (2019).
22. Ramprasad, R., Batra, R., Pilania, G., Mannodi-Kanakkithodi, A. & Kim, C. Machine 54. Qamar, S. & Tokdar, S. T. Additive Gaussian process regression. Preprint at https://
learning in materials informatics: recent applications and prospects. npj Comput. arxiv.org/abs/1411.7009 (2014).
Mater. 3, 1–13 (2017). 55. Vo, G. & Pati, D. Sparse additive Gaussian process with soft interactions. Open J.
23. Honarmandi, P., Hossain, M., Arroyave, R. & Baxevanis, T. A top-down character- Stat. 7, 567 (2017).
ization of NiTi single-crystal inelastic properties within confidence bounds 56. Ročková, V. & van der Pas, S. et al. Posterior concentration for Bayesian regression
through Bayesian inference. Shap. Mem. Superelasticity 7, 50–64 (2021). trees and forests. Ann. Stat. 48, 2108–2131 (2020).
24. Ceylan, Z. Estimation of municipal waste generation of turkey using socio- 57. Nikolaev, P. et al. Autonomy in materials research: a case study in carbon
economic indicators by Bayesian optimization tuned Gaussian process regres- nanotube growth. npj Comput. Mater. 2, 1–6 (2016).
sion. Waste Manag. Res. 38, 840–850 (2020). 58. Kusne, A. G. et al. On-the-fly closed-loop materials discovery via Bayesian active
25. Moriconi, R., Deisenroth, M. P. & Kumar, K. S. High-dimensional bayesian opti- learning. Nat. Commun. 11, 5966 (2020).
mization using low-dimensional feature spaces. Mach. Learn. 109, 1925–1943 59. Aldeghi, M., Häse, F., Hickman, R. J., Tamblyn, I. & Aspuru-Guzik, A. Golem: an
(2020). algorithm for robust experiment and process optimization. Chem. Sci. 12,
26. Wang, Z., Hutter, F., Zoghi, M., Matheson, D. & de Feitas, N. Bayesian optimization 14792–14807 (2021).
in a billion dimensions via random embeddings. J. Artif. Intell. Res. 55, 361–387 60. Häse, F. et al. Olympus: a benchmarking framework for noisy optimization and
(2016). experiment planning. Mach. Learn. Sci. Technol. 2, 035021 (2021).
27. Aye, S. A. & Heyns, P. An integrated Gaussian process regression for prediction of 61. Liu, P. et al. High throughput materials research and development for lithium ion
remaining useful life of slow speed bearings based on acoustic emission. Mech. batteries. High-throughput Exp. Model. Res. Adv. Batter. 3, 202–208 (2017).
Syst. Signal Process. 84, 485–498 (2017). 62. Melia, M. A. et al. High-throughput additive manufacturing and characterization
28. Paciorek, C. J. & Schervish, M. J. Nonstationary covariance functions for gaussian of refractory high entropy alloys. Appl. Mater. Today 19, 100560 (2020).
process regression. In Advances in Neural Information Processing Systems, 273–280 63. Potyrailo, R. et al. Combinatorial and high-throughput screening of materials
(Citeseer, 2003). libraries: review of state of the art. ACS Comb. Sci. 13, 579–633 (2011).
29. Wilson, A. G., Hu, Z., Salakhutdinov, R. & Xing, E. P. Deep kernel learning. In 64. Schulz, E., Speekenbrink, M. & Krause, A. A tutorial on Gaussian process regres-
Artificial Ontelligence and Statistics, 370–378 (PMLR, 2016). sion: modelling, exploring, and exploiting functions. J. Math. Psychol. 85, 1–16
30. Denison, D. G., Mallick, B. K. & Smith, A. F. Bayesian MARS. Stat. Comput. 8, (2018).
337–346 (1998). 65. Denison, D. G., Holmes, C. C., Mallick, B. K. & Smith, A. F. Bayesian methods for
31. Friedman, J. H. Multivariate adaptive regression splines. Ann. Statist. 1–67 (1991). nonlinear classification and regression, Vol. 386 (John Wiley & Sons, 2002).
32. Chipman, H. A., George, E. I. & McCulloch, R. E. et al. Bart: Bayesian additive 66. Green, P. J. Reversible jump markov chain Monte Carlo computation and Baye-
regression trees. Ann. Appl. Stat. 4, 266–298 (2010). sian model determination. Biometrika 82, 711–732 (1995).
33. HamediRad, M. et al. Towards a fully automated algorithm driven platform for 67. Sagi, O. & Rokach, L. Ensemble learning: a survey. Wiley Interdiscip. Rev. Data Min.
biosystems design. Nat. Commun. 10, 1–10 (2019). Knowl. Discov. 8, e1249 (2018).

Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2021) 194
B. Lei et al.
12
68. Krawczyk, B., Minku, L. L., Gama, J., Stefanowski, J. & Woźniak, M. Ensemble function of composition. X.Q., A.B., and D.P. contributed to the discussion on the ML/
learning for data stream analysis: a survey. Inf. Fusion 37, 132–156 (2017). BO aspects of the work. R.A. and T.Q.K. provided the materials science context and
69. Laradji, I. H., Alshayeb, M. & Ghouti, L. Software defect prediction using ensemble designed the materials science examples. All authors analyzed the results,
learning on selected features. Inf. Softw. Technol. 58, 388–402 (2015). contributed to the manuscript, and edited it. All authors reviewed the final version
70. Chen, X. M., Zahiri, M. & Zhang, S. Understanding ridesplitting behavior of on- of the manuscript.
demand ride services: an ensemble learning approach. Transp. Res. Part C 76,
51–70 (2017).
71. Zhang, W., Wu, C., Zhong, H., Li, Y. & Wang, L. Prediction of undrained shear COMPETING INTERESTS
strength using extreme gradient boosting and random forest based on Bayesian The authors declare no competing interests.
optimization. Geosci. Front. 12, 469–477 (2021).
72. Fersini, E., Messina, E. & Pozzi, F. A. Sentiment analysis: Bayesian ensemble
learning. Decis. Support Syst. 68, 26–38 (2014).
ADDITIONAL INFORMATION
73. Hill, J., Linero, A. & Murray, J. Bayesian additive regression trees: a review and look
forward. Annu. Rev. Stat. Appl. 7, 251–278 (2020). Supplementary information The online version contains supplementary material
74. McCord, S. E., Buenemann, M., Karl, J. W., Browning, D. M. & Hadley, B. C. Inte- available at https://doi.org/10.1038/s41524-021-00662-x.
grating remotely sensed imagery and existing multiscale field data to derive
rangeland indicators: application of Bayesian additive regression trees. Rangel. Correspondence and requests for materials should be addressed to Raymundo
Ecol. Manag. 70, 644–655 (2017). Arroyave.
75. Sparapani, R. A., Logan, B. R., McCulloch, R. E. & Laud, P. W. Nonparametric
survival analysis using Bayesian additive regression trees (bart). Stat. Med. 35, Reprints and permission information is available at http://www.nature.com/
2741–2753 (2016). reprints
76. Bleich, J., Kapelner, A., George, E. I. & Jensen, S. T. Variable selection for bart: an
application to gene regulation. Ann. Appl. Stat. 8, 1750–1781 (2014). Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims
in published maps and institutional affiliations.

ACKNOWLEDGEMENTS
B.K.M., A.B., and D.P. acknowledge support by NSF through Grant No. NSF CCF- Open Access This article is licensed under a Creative Commons
1934904 (TRIPODS). T.Q.K. acknowledges the NSF through Grant No. NSF-DGE- Attribution 4.0 International License, which permits use, sharing,
1545403. X.Q. and R.A. acknowledge NSF through Grants Nos. 1835690 and 2119103 adaptation, distribution and reproduction in any medium or format, as long as you give
(DMREF). The authors also acknowledge Texas A&M’s Vice President for Research for appropriate credit to the original author(s) and the source, provide a link to the Creative
partial support through the X-Grants program. Dr. Prashant Singh (Ames Laboratory) Commons license, and indicate if changes were made. The images or other third party
is acknowledged for his DFT calculations of SFE in FCC HEAs. Dr. Anjana Talapatra and material in this article are included in the article’s Creative Commons license, unless
Dr. Shahin Boluki are acknowledged for facilitating the BMA Code. DFT calculations of indicated otherwise in a credit line to the material. If material is not included in the
the SFEs were conducted with the computing resources provided by Texas A&M High article’s Creative Commons license and your intended use is not permitted by statutory
Performance Research Computing. regulation or exceeds the permitted use, you will need to obtain permission directly
from the copyright holder. To view a copy of this license, visit http://creativecommons.
org/licenses/by/4.0/.
AUTHOR CONTRIBUTIONS
B.L. and B.K.M. conceived of the concept for non-GP BO. B.L. implemented the
algorithms and carried out the experiments. T.Q.K. provided the model for SFE as a © The Author(s) 2021

npj Computational Materials (2021) 194 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences

You might also like