0% found this document useful (0 votes)
14 views

Wu et al. - 2020 - iQSPR in XenonPy A Bayesian Molecular Design Algo

Uploaded by

theqmy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Wu et al. - 2020 - iQSPR in XenonPy A Bayesian Molecular Design Algo

Uploaded by

theqmy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Full Paper www.molinf.

com

DOI: 10.1002/minf.201900107

iQSPR in XenonPy: A Bayesian Molecular Design Algorithm


Stephen Wu+,*[a, b] Guillaume Lambard+,[c] Chang Liu+,[a] Hironao Yamada,[a, d] and Ryo Yoshida*[a, b, c]

Abstract: iQSPR is an inverse molecular design algorithm build customized molecular design algorithms using pre-set
based on Bayesian inference that was developed in our modules and a pre-trained model library in XenonPy. In this
previous study. Here, the algorithm is integrated in Python paper, we describe key features of iQSPR-X and provide
as a new module called iQSPR-X in the all-in-one materials guidance on its use, illustrated by an application to a
informatics platform XenonPy. Our new software provides a polymer design that targets a specific range of bandgap
flexible, easy-to-use, and extensible platform for users to and dielectric constant.
Keywords: molecular design · machine learning · Bayesian inference · open source · polymer

1 Introduction space, ML methods that use probabilistic language models


based on deep neural networks (DNNs) have proliferated
Inverse molecular design is the process of computationally intensively since 2017.[10] In these methods, a language
creating new chemical structures that exhibit desired model is trained on a given set of existing molecules, the
properties, and this approach has been one of the most chemical structures of which are translated into a set of
important research subjects in materials science. For strings according to the simplified molecular-input line-
decades, scientists have searched for efficient methods of entry system (SMILES) chemical language.[11] Models trained
discovering novel materials for a wide variety of industrial to recognize chemically realistic structures are then used to
and engineering applications. Conventional approaches refine chemical strings in the molecular design calculation.
have often relied on expert knowledge to investigate new Promising examples have included various types of varia-
structures by trial and error, starting from known materials
and considering a relatively small sub-region in the whole
search space. Although the chemical space of small organic [a] S. Wu,+ C. Liu,+ H. Yamada, R. Yoshida
molecules consists of approximately 1060 candidates,[1] the The Institute of Statistical Mathematics, Research Organization of
total number of currently known compounds is at most on Information and Systems
the order of 108.[2] Hence, most of the chemical space 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan
remains unexplored, and the concept of computer-aided phone/fax + 81 (0) 50–5533-8534
E-mail: [email protected]
molecular design (CAMD) has emerged to accelerate this [email protected]
extremely slow discovery process.[3] [b] S. Wu,+ R. Yoshida
An early attempt by Joback and Stephanopoulos framed The Graduate University for Advanced Studies, SOKENDAI,
CAMD as an optimization problem with rule-based mole- 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan
cule enumeration.[4] In many subsequent works, materials [c] G. Lambard,+ R. Yoshida
properties were optimized within a search space that was Center for Materials Research by Information Integration (CMI2),
Research and Services Division of Materials Data and Integrated
pre-constrained to a small subspace built from expert-
System (MaDIS), National Institute for Materials Science (NIMS)
selected molecular fragments or chemical rules, using 1-2-1 Sengen, Tsukuba, Ibaraki 305-0047, Japan
heuristic optimization algorithms such as genetic [d] H. Yamada
algorithms[5,6] and Monte Carlo based stochastic School of Pharmacy, Tokyo University of Pharmacy and Life Sci-
optimization.[7] For example, Miyao et al.[8,9] used a set of ences
chemically favorable fragments and designed templates of 1432-1 Horinouchi, Hachioji, Tokyo 192-0392, Japan
specific molecular graphs that were combined with some [+] Equally contributed to this work
mixture models for property predictions to generate a Supporting information for this article is available on the WWW
desired class of candidate molecules. Although these under https://doi.org/10.1002/minf.201900107
methods were a major step forward in the history of CAMD, © 2019 The Authors. Published by Wiley-VCH Verlag GmbH & Co.
they still suffered from a lack of capability to handle the KGaA. This is an open access article under the terms of the
large and highly diverse discrete spaces of candidate Creative Commons Attribution Non-Commercial License, which
permits use, distribution and reproduction in any medium,
molecules. provided the original work is properly cited and is not used for
In recent years, a new family of CAMD algorithms has commercial purposes.
emerged, inspired by the great success of modern machine The copyright line for this article was changed on December 16,
learning (ML) methods. In particular, to broaden the search 2019 after original online publication.

© 2019 The Authors. Published by Wiley-VCH Verlag GmbH & Co. KGaA Mol. Inf. 2020, 39, 1900107 (1 of 9) 1900107
18681751, 2020, 1-2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/minf.201900107 by Kanazawa University, Wiley Online Library on [29/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Full Paper www.molinf.com

tional autoencoders,[12–15] generative adversarial networks,[16] explanations and sample codes for building customized
recurrent neural networks,[17,18] and so on. These methods generators and evaluators, performing the inverse design
have been able to produce diverse chemical structures; calculations, and using some of the convenient modules in
however, they often require large training datasets to XenonPy. In this paper, we highlight some key features of
obtain a DNN-based generator that can produce chemically iQSPR-X and describe its application to the task of design-
realistic molecules with grammatically valid SMILES. Data- ing polymers using data from Polymer Genome (PG).[28,29]
sets this large are unavailable in many applications.
Furthermore, many of these methods generate chemically
or grammatically invalid representations of molecules at 2 Computational Methods
relatively high rates, unless their hyperparameters are
carefully tuned.[19–21] 2.1 Bayesian Molecular Design
Some previous works considered simpler generative
models to avoid the need to train the model with large The primary task of the Bayesian molecular design is to
dataset. Yoshikawa et al.[22] exploited a grammatical evolu- draw a set of samples from the posterior distribution P(S j
tion method with parallel computation to generate a Y2U), which represents the conditional probability of
diverse set of candidate molecules conditional on arbitrarily observing a chemical structure S, given material properties
given design targets. Ikebata et al.[23] combined a simple Y = {Yi j i = 1,…,m}, that lies in a target region U. In the
probabilistic language model based on an n-gram represen- iQSPR-X implementation, S is encoded as a SMILES string;
tation of SMILES sequences with a Bayesian inference i. e., S = s1s2…sn, where si is any valid character in SMILES.
framework to sequentially modify a population of mole- For example, phenol (C6H6O) can be represented by the
cules into promising candidate molecules that would SMILES string “C1 = CC=C(C=C1)O”, where C and O denote
exhibit desired properties. For a more complete review of the carbon and oxygen atoms, respectively; “ = ” denotes a
the above methods, Schwalbe-Koda and Gómez- double bond; the two “1” digits denote the opening and
Bombarelli[24] have provided a detailed overview of recent closing of the ring structure; and the parentheses denote
developments in inverse molecular design. the beginning and ending of the branching component.
In this paper, we introduce iQSPR-X, a flexible software According to Bayes’ theorem, a posterior distribution is
constructed to implement the Bayesian molecular design proportional to the product of a likelihood function and a
algorithm iQSPR, which was developed in our previous prior distribution:
work.[23] The algorithm was implemented in XenonPy, a
Python package with an integrated platform of materials PðSjY 2 UÞ / PðY 2 UjSÞPðSÞ
informatics.[25] In contrast to the original iQSPR algorithm
developed in R, the new version allows users to exploit where P(Y2U j S) represents the likelihood function that
various features of XenonPy as described below. The basic evaluates the goodness-of-fit of S with respect to the given
computational workflow consists of a two-step iteration: (1) property requirement Y2U, and P(S) represents the prior
current chemical structures are modified to new ones using probability that S belongs to a predefined search space of
a generator and (2) candidate molecules that show promise SMILES strings. Thus, P(S) will deliver a small or even zero
for desired properties are selected using an evaluator, probability when presented with an unfavorable or chemi-
which is a set of ML models for predicting material cally unrealistic structure, thereby acting as a filter for such
properties. The generator and the evaluator can be pre- out-of-scope or invalid structures. In iQSPR-X, a sequential
trained separately with given training instances. Users can Monte Carlo algorithm proposed by Ikebata et al.[23] is
either train new models from scratch or reuse relevant pre- implemented. This algorithm is somewhat similar to a
trained models from a model library in XenonPy, which genetic algorithm. With a given set of initial samples S0 =
covers a broad array of material properties for small {Si0 j i = 1,…,N} of size N, the pre-trained prior is used as a
molecules and polymers. In addition, when the available generator to propose a new set of samples S0’. A fitness
data on the structure-property relationship for a target task score is then assigned to each sample in S0’ using the
are limited, directly obtaining a reliable prediction model is likelihood, which is the evaluator in iQSPR-X. By resampling
difficult. However, an ML technique called transfer learning N samples from S0’ in proportion to the fitness scores, a
can be used to extract knowledge relevant to the target refined set S1 is obtained and once again modified by the
task from a large set of pre-trained models to help training generator. This cycle is repeated T times to obtain a final
new models more efficiently.[26] Successful application of sample set ST.
the iQSPR method, in conjunction with transfer learning to There are three important building blocks in this
overcome limited polymeric properties data, was demon- algorithm: the generator (prior), the evaluator (likelihood),
strated in our previous study, which achieved the discovery and the descriptor φ(S). When building models for the
of new polymers with high thermal conductivity.[27] A set of evaluator, we encode a chemical structure into a descriptor
tutorials distributed as Jupyter notebooks are available at vector φ(S) using, for example, a molecular fingerprinting
the website of XenonPy,[25] and these include detailed algorithm. Using training instances {(Yk, Sk) j k = 1,…,Ndata} on

© 2019 The Authors. Published by Wiley-VCH Verlag GmbH & Co. KGaA Mol. Inf. 2020, 39, 1900107 (2 of 9) 1900107
18681751, 2020, 1-2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/minf.201900107 by Kanazawa University, Wiley Online Library on [29/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Full Paper www.molinf.com

the structure-property relationships, we then derive a available function in iQSPR-X. This model consists of two
model that describes the materials properties Y as a components: (1) a table that records the probability of
function of the descriptor φ(S), defining Ŷ = μ(φ(S)) with the observing a subsequent character given a substring and (2)
trained model μ. Although iQSPR-X allows users to plug in a function that modifies a given SMILES string based on the
customized functions for each building block, we also stored n-gram probability table. The table can be trained by
provide some commonly used functions internally, and supplying a set of SMILES strings sampled from the desired
these can be directly called from the package. For the search space. The maximum length of a substring to be
descriptor, all available fingerprint types in RDKit[30] and the considered and stored in the table is controlled by the
Python descriptor package Mordred[31] are available by “order” parameter. In the extended n-gram model, SMILES
default. Users can alternatively use a set of features strings are internally tokenized into a list of characters. For
extracted from pre-trained neural networks in the XenonPy example, “=O” and “%10” are considered as one character,
model library, as described in the next section. For the and a terminal character is automatically added at the end
evaluator, a Gaussian likelihood is given as a choice with of each string. When proposing a new candidate molecule,
any user-defined model μi(φ(S)) and the standard deviation the modifier function deletes a random number of charac-
σi(S), which represents the uncertainty of predicted proper- ters from the end of the SMILES string, and then elongates
ties: the shortened string based on the n-gram table. Because
the representation of a molecule in SMILES is not unique, a
reordering of the SMILES string is probabilistically per-
formed to avoid constantly modifying the same part of the
chemical structure.
where U is the target region in the m dimensional space, In short, the most important parameters in modelling
and μi(φ(S)) and σi(φ(S)) are the mean and standard the generator include the probability required to trigger
deviation for the ith property, respectively, obtained from reordering, the range of the number of letters to be
ML models with input φ(S). For the generator, the extended deleted, and the order parameter controlling the maximum
n-gram model developed by Ikebata et al.[23] can be used by length of a substring in training and sampling the n-gram
training it with any chemical structuresQ given in SMILES. model. Users can adjust these parameters based on the
The model takes the form P(S) = P(s1) ni = 2 P(si j si-1,…,s1). expected molecule size in the targeted search space.
Figure 1 summarizes the computational workflow of iQSPR- Although SMILES is a powerful representation of chemical
X. structures, as exemplified by its ability to handle chirality
using the “@” symbol, the non-uniqueness of SMILES
representations may lead to subtle effects in certain usages.
2.2 Generator: Extended n-gram Model For example, the aromatic ring in phenol can be repre-
sented as “C1=CC=CC=C1” or “c1ccccc1.” We recommend
The role of the generator is to propose new candidate that users not mix different representations of the same
molecules modified from a set of initial molecules. We molecular structure when training the extended n-gram
implemented the extended n-gram model as an internally model.

2.3 Evaluator: Likelihood Function

The role of the evaluator is to provide a fitness score for a


candidate molecule to estimate how likely the candidate
possesses the desired properties. iQSPR-X allows users to
write their own evaluator, which receives a list of molecules,
converts them to a set of descriptors using a pre-set
descriptor conversion function, and returns a list of
corresponding log-likelihood values. A Gaussian likelihood
function can also be used if users select a desired descriptor
and provide an ML model that returns the mean and
standard deviation for a given set of descriptors as input.

Figure 1. Computational workflow in iQSPR-X with three main 2.4 Pre-trained Neural Descriptors in XenonPy
building blocks that users can flexibly construct: the generator, the
evaluator, and the converter that translates an input chemical One of the most distinctive features of our software is the
structure into a descriptor vector. availability in XenonPy of a comprehensive set of pre-

© 2019 The Authors. Published by Wiley-VCH Verlag GmbH & Co. KGaA Mol. Inf. 2020, 39, 1900107 (3 of 9) 1900107
18681751, 2020, 1-2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/minf.201900107 by Kanazawa University, Wiley Online Library on [29/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Full Paper www.molinf.com

trained neural features for use as the descriptor φ(S). The 3 Results and Discussion
sampling efficiency of iQSPR-X is highly influenced by the
reliability of the evaluator that predicts the material proper- 3.1 Data
ties for any given chemical structure. Building such models
from scratch is often time-consuming and requires a large We used data from PG to illustrate the use of iQSPR-X based
set of training data, which is not available in many on an example motivated from a previous study on polymer
applications. XenonPy currently provides 140,000 pre- design.[28] PG is an open database for polymeric properties
trained neural networks for the prediction of physical, that currently contains 854 polymers composed of nine
chemical, electronic, thermodynamic, and mechanical prop- types of atoms (H, C, O, N, S, F, Cl, Br, and I) with
erties of small organic molecules, polymers, and inorganic experimental data for three material properties (glass
crystalline materials, with models for 15, 18, and 12 proper- transition temperature, density, and solubility parameter)
ties of these material types, respectively. The models are and computational data from density functional theory
distributed as MXNet[32] (R) and/or PyTorch[33] (Python) (DFT) for four material properties (bandgap (Egap), refractive
model objects. The distributed API (application program- index, dielectric constant (ɛtot), and atomization energy).
ming interface) allows users to query the XenonPy.MDL Using a subset of the data (4-block polymers composed of
database. Users can directly use a retrieved model relevant CH2, NH, CO, C6H4, C4H2S, CS, and O), Mannodi-Kanakkithodi
to the target task, if available, or can re-train a pre-trained et al.[28] designed 6- to 12-block polymers with high ɛtot for
model on the target task using a transfer learning insulator applications using ML models and a genetic
technique as described below. Transfer learning has signifi- algorithm. They were specifically interested in polymers
cant potential to overcome the problem of limited materials with higher ɛtot and Egap, and this goal was adopted in our
property data, as demonstrated in our previous study,[26] for example. The given data of the chemical structures S and
various materials science tasks. Other studies have also their materials properties were used to train the generator
shown promising applications of transfer learning in and the evaluator. Here, we considered S to be the SMILES
materials informatics.[18,34–41] strings of the repeating polymer units. The connection
In this study, we applied a specific type of transfer points, i. e., the head and tail of a monomer, were denoted
learning using pre-trained neural networks. For a target as “*”.
property, a neural network pre-trained on proxy properties In PG, the lowest-energy crystal structures of the
is available in the library, where the source datasets are polymers were used for the DFT calculation. For each
sufficiently large. If the two properties are physically or polymer, Egap was computed using a hybrid Heyd-Scuseria-
chemically interrelated, the pre-trained models can be Ernzerhof (HSE06) electronic exchange-correlation func-
expected to autonomously acquire common features tional, and ɛtot, which is the sum of the electronic and ionic
relevant to the proxy properties. The features learned by dielectric constants, was computed using density functional
solving the related tasks are partially transferable to the perturbation theory (DFPT). Mannodi-Kanakkithodi et al.[28]
descriptor φ(S) in a model constructed for the target task. have detailed this computational procedure. As shown in
In general, earlier or shallower layers in a neural network Figure 2a, we observed an inverse relation between ɛtot and
tend to acquire general features to form the basis of the Egap. Polymers containing thiophene (C4H2S) tended to reach
material descriptions, and only the last one or two layers high ɛtot, but generally had low Egap. In contrast, polymers
identify specific features for the prediction of a source containing fluorine (F) atoms tended to reach high Egap, but
property. In iQSPR-X, we freeze the shallower layers for use generally had low ɛtot. However, in contrast to the enrich-
as a feature extractor. A subnetwork φ(S) of such a pre- ment offered by either C4H2S or F atoms, polymers
trained model can be reused in the supervised learning of exhibiting high ɛtot and high Egap tended to be composed of
the target property. To simplify the implementation of the CH2, NH, CO, C6H4 and O.[28] The design objective was to
repetitious tasks of neural descriptor extraction, XenonPy solve this nontrivial trade-off problem.
provides users with an internal function to extract values
from any hidden layer in a pre-trained neural network. With
its large library of pre-trained models and wide range of 3.2 Training the Generator
built-in descriptors, XenonPy provides a strong foundation
for flexibly arranging the necessary building blocks of the In this study, we considered two ways to train the extended
iQSPR algorithm. n-gram model. First, we used all 854 polymers in PG as a
training set, which covered a wide variety of polymers.
Second, we focused only on specific types of chemical
structures that shared some common features, taken from
other data sources. In practice, users may often be
interested in designing a specific class of molecules. Here,
we explored F-containing polymers with high ɛtot and Egap.
In particular, we focused on a training set containing the

© 2019 The Authors. Published by Wiley-VCH Verlag GmbH & Co. KGaA Mol. Inf. 2020, 39, 1900107 (4 of 9) 1900107
18681751, 2020, 1-2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/minf.201900107 by Kanazawa University, Wiley Online Library on [29/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Full Paper www.molinf.com

fragment “C(F)(F)N,” which was taken from PubChem.[2,42]


Because of the extremely high diversity of chemical
structures in PubChem, we used multiple steps to extract
training molecules: (1) we screened molecules in PubChem
continuously until 5,000 molecules having the desired frag-
ment were found (screening a total of over 36,940,000 mol-
ecules); (2) we reduced the number of molecules to 3,860
that consisted only of C, O, N, F, and/or S atoms; (3) we
finally extracted 2,485 molecules by filtering out those that
had more than six F atoms or included more than one
molecule in a single SMILES string (SMILES string with “.”).
The final training set was formed by the union of these
selected PubChem molecules and the set of PG polymers.
The order parameter controlling the length of a substring
for training and sampling the n-gram tables was set to be
20 for both cases, after examining the distribution of the
SMILES lengths of the molecules in PG (see Figure 2b). In
the construction of a training set, duplicates of each SMILES
string were generated by performing random reordering of
the string for at most 15 times. This step is important to
avoid the occurrence of unseen substring patterns during
the generation of new molecules, considering that we set
the reorder probability to be 0.5 during the molecular
generation. Figure 3 illustrates how the two different
generators modified molecules step-by-step starting from
the same initial chemical structures. The generator trained
on the PubChem molecules showed a stronger tendency to
include F-containing fragments during the modification
process.

Figure 3. Modification of molecules using extended n-gram models


trained with different datasets. The same five chemical structures in
PG were successively modified five times according to generators
that were trained on 854 polymers from PG (top) and with the 854
polymers from PG and 2,485 F-containing molecules from PubChem
(bottom).

3.3 Training the Evaluator

We conducted a series of experiments to obtain a model for


the Gaussian likelihood function. As a descriptor, we used
pre-trained neural network models in XenonPy that were
trained with 10 different types of fingerprints available in
Figure 2. Summary of observed data in PG. (a) Joint distribution of RDKit: atom pair and topological torsion fingerprints,
ɛtot and Egap. Red dots denote all polymers containing F atoms, Morgan fingerprints (both feature-based and not feature-
green dots denote those having C4H2S as fragments, and blue dots based), basic fingerprints in RDKit, and five more that were
denote all other polymers. (b) Histogram of the lengths of SMILES obtained by adding the MACCS keys to the five listed
strings in PG. fingerprints. With each of the 10 fingerprints, 100 randomly

© 2019 The Authors. Published by Wiley-VCH Verlag GmbH & Co. KGaA Mol. Inf. 2020, 39, 1900107 (5 of 9) 1900107
18681751, 2020, 1-2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/minf.201900107 by Kanazawa University, Wiley Online Library on [29/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Full Paper www.molinf.com

constructed neural networks, each having six fully con- regressor under the default setting in scikit-learn. The mean
nected layers, were trained with either the ɛtot or Egap and standard deviation of the predicted values from the 10
datasets; the number of epochs was 2,000 and the dropout trained models were taken as μ and σ, respectively.
rate was 0.1. The dataset was randomly separated into For the random forest approach, the forestci package
training and validation sets at a ratio of 8 : 2. Figure 4 shows was used along with the random forest method in scikit-
learn to calculate μ and σ. The number of trees was set to
be 500 and the “max_feature” option was selected to be
“sqrt.”
For the Bayesian linear regression with neural descrip-
tors, we began by selecting pre-trained model from the
model library in XenonPy for each of the two target
properties. The 100 pre-trained neural networks of ɛtot and
Egap were modified such that the last hidden layers were
connected to Bayesian linear regressors, and the prediction
performances of the models were then evaluated by the
10-fold CV applied to the training data within each fold of
the 5-fold CV. Each of the models of ɛtot and Egap that
achieved the overall lowest MAE was selected, and their last
hidden layers were concatenated to form a new neural
descriptor. This descriptor was used to replace the originally
selected descriptor in the default Gaussian likelihood
function. Finally, this evaluator was trained with the full
training data within each fold of the five-fold CV.
Figure 5 shows the performance of each model on the
five-fold CV for the ɛtot and Egap datasets. The bagging
Figure 4. Box-plots of the MAEs across different fingerprint descrip- approach with the gradient boosting model achieved the
tors evaluated on the validation datasets of either ɛtot or Egap. APFP best overall performance and was therefore selected for the
denotes the atom pair fingerprints, ECFP denotes the non-feature- inverse design calculation.
based radius-3 Morgan fingerprints, FCFP denotes the feature-
based radius-3 Morgan fingerprints, TTFP denotes the topological
torsion fingerprints, RDKit denotes the basic fingerprints in RDKit,
and + M denotes the addition of the MACCS keys.

a comparison of the validated mean absolute errors (MAEs)


across the different fingerprint descriptors. The atom pair
fingerprints with the MACCS keys showed consistently high
performance on both ɛtot and Egap, and was therefore
selected for use in the Bayesian molecular design.
The default model in the Gaussian likelihood function is
set to be a Bayesian linear regression model. Users can
directly train the model with their input data using a one-
line Python script. In this paper, we considered three more
approaches to constructing models that return μ and σ: (1)
bagging to calculate a bootstrap variance σ for any
deterministic model, (2) random forests combined with a
jackknife method,[43] and (3) Bayesian linear models with
neural descriptors extracted from pre-trained models.[26] In
our example, we tested these different methods to select
the best prediction model. Five-fold cross validation (CV)
was performed on both ɛtot and Egap and the model with the
Figure 5. Prediction performance of different models on the five-
best prediction performance was selected. fold CV for the ɛtot and Egap datasets. GB denotes bagging with
For the bagging approach, the gradient boosting gradient boosting, RF denotes random forests with jackknife-based
method in scikit-learn[44] was used and the training data in uncertainty quantification, and NN denotes pre-trained neural
each fold of the 5-fold CV were further divided into 10 non- networks with their last hidden layers connected to Bayesian linear
overlapping bags. Each bag produced a gradient boosting regressors.

© 2019 The Authors. Published by Wiley-VCH Verlag GmbH & Co. KGaA Mol. Inf. 2020, 39, 1900107 (6 of 9) 1900107
18681751, 2020, 1-2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/minf.201900107 by Kanazawa University, Wiley Online Library on [29/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Full Paper www.molinf.com

3.4 Design Results

In this example, we set the target property region to be


ɛtot > 4.5 and Egap > 5 eV. Three rounds of iQSPR-X were
executed using different setups to compare the effect of
various components of the algorithm. The first run used the
generator trained with molecules in PG, and in the inverse
design calculation, 100 initial samples were randomly
selected from the 854 molecules in PG. The second run
used the same generator, but the 100 initial samples were
randomly selected from a subset of the molecules in PG
that had a relatively low ɛtot or Egap (ɛtot < 4 or Egap < 4.5 eV).
The third run used the same initial samples as in the first
run, but the generator was trained with the PG and
PubChem molecules, as detailed in the previous section.
Other components of iQSPR-X were set to be the same
for all three runs. The n-gram order parameter was set to be
20, with the range of the number of letters to be deleted
set to be 1–10. The reordering probability was set to be 0.5.
The descriptor was selected to be a concatenation of the
atom pair fingerprints and the MACCS keys in RDKit. The
evaluator was selected to be the Gaussian likelihood with Figure 6. Comparison of the best candidate molecules from a
10 “bags” of gradient boosting models trained on 10-fold previous study[28] and the top 25 candidate molecules generated
CV of the full PG datasets for ɛtot and Egap. The mean from iQSPR-X. The optimal combinations of 8 to 12 building blocks
that were proposed in the previous study are shown as a
function μ was given by the mean of the predictions from
comparison.
the 10 models. For practical purposes, the variance function
σ2 was composed of the bootstrap variance plus a tiny pre-
set constant (0.04 for ɛtot and 0.09 for Egap). To avoid
trapping at a local region of the entire search space, an
annealing schedule was applied: the likelihood scores were
powered by a sequence of factors from 0 to 1, which
corresponds to a sequential transformation of a distribution
from a uniform distribution to the actual posterior distribu-
tion. From empirical evidence, a slow cooling schedule is
recommended. We started with 20 steps of powers linearly
increasing from 0 to 0.2, 10 additional steps linearly
increasing from 0.2 to 0.4, and another 10 steps linearly
increasing from 0.4 to 1. Finally, we performed another
60 steps with the power fixed at 1, and these samples were
recorded as candidate molecules.
Movies S1, S2, and S3 in the supplementary materials
demonstrate how the candidate molecules proposed in
each step of the sequential Monte Carlo approach the
target region. In the first run, one of the initial samples was
observed to reach the target region, and a number of the
samples continued to explore structures similar to that
molecule, whereas other samples pursued alternative
possibilities. In the end, the best proposed candidate
molecules converged to molecules similar to those found in
the previous study[28] (see Figure 6).
In contrast, with a significantly different set of initial
samples, the second run struggled to converge to candi- Figure 7. Comparison of the top 25 candidate molecules generated
date molecules similar to the first run. Instead, it became from iQSPR-X with initial samples randomly drawn from polymers
trapped at molecules with complex ring structures (see in PG with ɛtot < 4 or Egap < 4.5 eV and the top 25 candidate
Figure 7). For an intractable trade-off problem such as this molecules generated using an extended n-gram model trained with
samples from both PG and PubChem.
example, a small finite set of samples cannot support full

© 2019 The Authors. Published by Wiley-VCH Verlag GmbH & Co. KGaA Mol. Inf. 2020, 39, 1900107 (7 of 9) 1900107
18681751, 2020, 1-2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/minf.201900107 by Kanazawa University, Wiley Online Library on [29/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Full Paper www.molinf.com

exploration of the search space. Increasing the number of 5 Supplementary Materials


samples is an intuitive yet computationally intensive
solution. An alternative is to adjust parameters in the 1) Movie S1-PG_basic.avi: This video shows the evolution
iQSPR-X algorithm, such as the initial samples, the n-gram of the material property values of the proposed
order, the number of letters to be deleted, and so on. candidate molecules in each step of the XenonPy-iQSPR
iQSPR-X can also be used to search intensively for a iterations for the first run of our example. Blue dots
specific molecular subspace. In the third run, the generator denote the original data from PG, and red dots denote
showed a clear tendency to attach F-containing fragments the proposed candidate molecules, with the radius of
to chemical structures after being trained with thousands the dots proportional to the sum of the predicted
more samples from PubChem. As a result, we observed the variances of ɛtot and Egap. Beta refers to the power value
frequent appearance of molecules with relatively higher Egap used in the annealing schedule.
during the sequential Monte Carlo iterations. The best 2) Movie S2-PG_lowVal.avi: This video is the same type as
candidate molecules were composed of the F-containing Movie S1, for the second run of our example.
fragments with different combinations of the CH2, NH, and 3) Movie S3-PG_Pubchem.avi: This video is the same type
CO blocks (see Figure 7). as Movie S1, for the third run of our example.

4 Conclusions Conflict of Interest

iQSPR-X is an ML engine for generating a target-specific None declared.


molecular library. XenonPy provides an all-in-one ML-based
materials design platform, in which descriptor calculations,
property prediction models for high throughput screening, Acknowledgements
molecular library generators, and inverse design algorithms
are all present as independent modules that users can This work was supported in part by the Materials Research
either take as pre-existing functions in XenonPy or build by Information Integration Initiative (MI2I) of the Support
flexibly to accommodate their own needs in conjunction Program for Starting Up Innovation Hub from the Japan
with other major ML and materials informatics Python Science and Technology Agency (JST). R.Y. acknowledges
packages. Moreover, transfer learning offers further capa- financial support from a Grant-in-Aid for Scientific Research
bility and convenience to the ML technique. By implement- (B) 15H02672, a Grant-in-Aid for Scientific Research (A)
ing the iQSPR algorithm with the XenonPy platform, users 19H01132 from the Japan Society for the Promotion of
can fully enjoy the benefits of a wide range of pre-existing Science (JSPS), JST CREST Grant Number JPMJCR19I1, Japan,
functions and models that will greatly simplify the process and JSPS KAKENHI Grant Number JP19H05820. S.W. grate-
of establishing the Bayesian inverse design algorithm. fully acknowledges financial support from JSPS KAKENHI
Detailed tutorials for each component are available at the Grant Number JP18 K18017.
website of XenonPy.[25]
In this paper, we demonstrated some basic function-
alities of iQSPR-X by applying it to the task of designing
polymers exhibiting high ɛtot and Egap. We showed how
changes to the setup of iQSPR-X, such as the initial sample References
sets and the generator, might influence the outcome of the
computational workflow. One of our runs identified chem- [1] R. S. Bohacek, C. McMartin, W. C. Guida, Med. Res. Rev. 1996,
ical structures that were similar to the best candidate 16, 3–50.
molecules proposed in the original study. Furthermore, we [2] S. Kim, P. A. Thiessen, E. E. Bolton, J. Chen, G. Fu, A. Gindulyte,
demonstrated that by including a focused set of molecules L. Han, J. He, S. He, B. A. Shoemaker, J. Wang, B. Yu, J. Zhang,
S. H. Bryant, Nucleic Acids Res. 2016, 44(D1), D1202–1213.
in the training process of the generator, we were able to [3] N. D. Austin, N. V. Sahinidis, D. W. Trahan, Chem. Eng. Res. Des.
guide the algorithm to search a particular subspace in the 2016, 116, 2–26.
large molecule space. Moreover, although users can quickly [4] K. G. Joback, G. Stephanopoulos, Proc. FOCAPD, Snowmass, CO,
start the inverse design process using the default functions 1989, 363–387.
and setups, the true potential of the algorithm can be [5] D. J. Wales, H. A. Scheraga, Science 1999, 285, 1368–1372.
realized by building customized modules for a variety of [6] D. Douguet, E. Thoreau, G. Grassy, J. Comput.-Aided Mol. Des.
tasks in materials science. The XenonPy project aims to 2000, 14, 449–466.
[7] X. Hu, D. N. Beratan, W. Yang, J. Chem. Phys. 2008, 129, 064102.
gather contributions from various users in diverse fields of [8] T. Miyao, H. Kaneko, K. Funatsu, J. Comput.-Aided Mol. Des.
materials and data science. Contributors are highly wel- 2016, 30, 425–446.
come to share and implement their own codes in XenonPy [9] T. Miyao, H. Kaneko, K. Funatsu, J. Chem. Inf. Model. 2016, 56,
as off-the-shelf modules. 286–299.

© 2019 The Authors. Published by Wiley-VCH Verlag GmbH & Co. KGaA Mol. Inf. 2020, 39, 1900107 (8 of 9) 1900107
18681751, 2020, 1-2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/minf.201900107 by Kanazawa University, Wiley Online Library on [29/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Full Paper www.molinf.com

[10] B. Sanchez-Lengeling, A. Aspuru-Guzik, Science 2018, 361, 360– [29] C. Kim, A. Chandrasekaran, T. D. Huan, D. Das, R. Ramprasad, J.
365. Phys. Chem. C 2018, 122, 17575–17585.
[11] D. Weininger, J. Chem. Inf. Comput. Sci. 1988, 28, 31–36. [30] G. Landrum, http://www.rdkit.org, last accessed on July 20,
[12] M. J. Kusner, B. Paige, J. M. Hernández-Lobato, PMLR 2017, 70, 2019.
1945–1954. [31] H. Moriwaki, Y.-S. Tian, N. Kawashita, T. Takagi, J. Cheminf.
[13] A. Kadurin, S. Nikolenko, K. Khrabrov, A. Aliper, A. Zhavor- 2018, 10, 4.
onkov, Mol. Pharmaceutics 2017, 14, 3098–3104. [32] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C.
[14] J. Lim, S. Ryu, J. W. Kim, W. Y. Kim, J. Cheminf. 2018, 10, 31. Zhang, Z. Zhang, arXiv 2015, 1512.01274.
[15] R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández- [33] PyTorch, https://pytorch.org/, last accessed on July 20, 2019.
Lobato, B. Sánchez-Lengeling, D. Sheberla, J. Aguilera-Iparra- [34] M. L. Hutchinson, E. Antono, B. M. Gibbons, S. Paradiso, J. Ling,
guirre, T. D. Hirzel, R. P. Adams, A. Aspuru-Guzik, ACS Cent. Sci. B. Meredig, arXiv 2017, 1711.05099.
2018, 4, 268–276. [35] H. Oda, S. Kiyohara, K. Tsuda, T. Mizoguchi, J. Phys. Soc. Jpn.
[16] B. Sanchez-Lengeling, C. Outeiral, G. L. Guimaraes, A. Aspuru- 2017, 86, 123601.
Guzik, ChemRxiv 2017, 10.26434/chemrxiv. 5309668.v3. [36] R. Jalem, K. Kanamori, I. Takeuchi, M. Nakayama, H. Yamasaki,
[17] X. Yang, J. Zhang, K. Yoshizoe, K. Terayama, K. Tsuda, Sci. T. Saito, Sci. Rep. 2018, 8, 5845.
Technol. Adv. Mater. 2017, 18, 972–976. [37] T. Yonezu, T. Tamura, I. Takeuchi, M. Karasuyama, Phys. Rev.
[18] M. H. S. Segler, T. Kogej, C. Tyrchan, M. P. Waller, ACS Cent. Sci. Mater. 2018, 2, 113802.
2018, 4, 120–131. [38] B. Kailkhura, B. Gallagher, S. Kim, A. Hiszpanski, T. Y.-J. Han,
[19] W. Jin, R. Barzilay, T. Jaakkola, PMLR 2018, 80, 2323–2332. arXiv 2019, 1901.02717.
[20] H. Kajino, arXiv 2018, 1809.02745. [39] E. D. Cubuk, A. D. Sendek, E. J. Reed, J. Chem. Phys. 2019, 150,
[21] N. Brown, M. Fiscato, M. H. S. Segler, A. C. Vaucher, J. Chem. Inf. 214701.
Model. 2019, 59, 1096–1108. [40] X. Li, Y. Zhang, H. Zhao, C. Burkhart, L. C. Brinson, W. Chen, Sci.
[22] N. Yoshikawa, K. Terayama, M. Sumita, T. Homma, K. Oono, K. Rep. 2018, 8, 13461.
Tsuda, Chem. Lett. 2018, 47, 1431–1434. [41] M. Kaya, S. Hajimirza, Sci. Rep. 2019, 9, 5034.
[23] H. Ikebata, K. Hongo, T. Isomura, R. Maezono, R. Yoshida, J. [42] S. Kim, J. Chen, T. Cheng, A. Gindulyte, J. He, S. He, Q. Li, B. A.
Comput.-Aided Mol. Des. 2017, 31, 379–391. Shoemaker, P. A. Thiessen, B. Yu, L. Zaslavsky, J. Zhang, E. E.
[24] D. Schwalbe-Koda, R. Gómez-Bombarelli, arXiv 2019, Bolton, Nucleic Acids Res. 2019, 47, D1102–1109.
1907.01632. [43] S. Wager, T. Hastie, B. Efron, J. Mach. Learn. Res. 2014, 15,
[25] XenonPy, https://xenonpy.readthedocs.io/en/latest/, last accessed 1625–1651.
on July 20, 2019. [44] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
[26] H. Yamada, C. Liu, S. Wu, Y. Koyama, S. Ju, J. Shiomi, J. O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J.
Morikawa, R. Yoshida, ACS Cent. Sci. 2019, online pre-release. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E.
[27] S. Wu, Y. Kondo, M.-A. Kakimoto, B. Yan, H. Yamada, I. Duchesnay, J. Mach. Learn. Res. 2011, 12, 2825–2830.
Kuwajima, G. Lambard, K. Hongo, Y. Xu, J. Shiomi, C. Schick, J.
Morikawa, R. Yoshida, npj Comput. Mater. 2019, 5, 66. Received: August 16, 2019
[28] A. Mannodi-Kanakkithodi, G. Pilania, T. D. Huan, T. Lookman, R. Accepted: October 14, 2019
Ramprasad, Sci. Rep. 2016, 6, 20952. Published online on November 5, 2019

© 2019 The Authors. Published by Wiley-VCH Verlag GmbH & Co. KGaA Mol. Inf. 2020, 39, 1900107 (9 of 9) 1900107

You might also like