Schwaller 2021 Mach. Learn. Sci. Technol. 2 015016
Schwaller 2021 Mach. Learn. Sci. Technol. 2 015016
PAPER
PUBLISHED
Supplementary material for this article is available online
31 March 2021
1. Introduction
Chemical reactions in organic chemistry are described by writing the structural formula of reactants and
products separated by an arrow, representing the chemical transformation by specifying how the atoms
rearrange between one or several reactant molecules and one or several product molecules [1]. Economic,
logistic, and energetic considerations drive chemists to prefer chemical transformations capable of
converting all reactant molecules into products with the highest yield possible. However, side-reactions,
degradation of reactants, reagents or products in the course of the reaction, equilibrium processes with
incomplete conversion to a product, or simply by product isolation and purification undermine the
quantitative conversion of reactants into products, rarely reaching optimal performance.
Reaction yields are usually reported as a percentage of the theoretical chemical conversion, i.e. the
percentage of the reactant molecules successfully converted to the desired product compared to the
theoretical value. It is not uncommon for chemists to synthesise a molecule in a dozen or more reaction
steps. Hence, low-yield reactions may have a disastrous effect on the overall route yield because of the
individual steps’ multiplicative effect. Therefore, it is not surprising that designing new reactions with yields
higher than existing ones attracts much effort in organic chemistry research.
In practice, specific chemical reaction classes are characterised by lower or higher yields, with the actual
value depending on the reaction conditions (temperature, concentrations, etc) and on the specific substrates.
Estimating the reaction yield can be a game-changing asset for synthesis planning. It provides chemists
with the ability to evaluate the overall yield of complex reaction paths, addressing possible shortcomings well
ahead of investing hours and materials in wet-lab experiments. Computational models predicting reaction
yields could support synthetic chemists in choosing an appropriate synthesis route among many predicted by
data-driven algorithms. Moreover, reaction yields prediction models could also be employed as scoring
functions in computer-assisted retrosynthesis route planning tools [2–5], to complement forward prediction
models [4, 6] and in-scope filters [2].
Most of the existing efforts in constructing models for the prediction of reactivity or of reaction yields
focused on a particular reaction class: oxidative dehydrogenations of ethylbenzene with tin oxide catalysts
[7], reactions of vanadium selenites [8], Buchwald–Hartwig aminations [9–11], and Suzuki–Miyaura
cross-coupling reactions [12–14]. To the best of our knowledge, there has been only one attempt to design a
general-purpose prediction model for reactivity and yields, without applicability constraints to a specific
reaction class [15]. In this work, the authors design a model predicting whether the reaction yield is above or
below a threshold value and conclude that the models and descriptors they consider cannot deliver
satisfactory results.
Here, we build on our legacy of treating organic chemistry as a language to introduce a new model that
predicts reaction yields starting from reaction SMILES [16]. More specifically, we fine-tune the rxnfp models
by Schwaller et al [17] based on a bidirectional encoder representations from transformers (BERT)-encoder
[18] by extending it with a regression layer to predict reaction yields. BERT encoders belong to the
transformer model family, which has revolutionised natural language processing [18, 19]. These models take
sequences of tokens as input to compute contextualised representations of all the input tokens, and can be
applied to reactions represented in the SMILES [20] format. In this work, we demonstrate, for the first time,
that these natural language architectures are very useful not only when working with language tokens but
also in providing descriptors of high quality to predict reaction properties such as reaction yields.
It is possible to train our approach both on data specific to a given reaction class or on data representing
different reaction types. Thus, we initially trained the model on two high-throughput experimentation
(HTE) data sets. Among the few HTE reaction data sets published in recent years, we selected the data
sets for palladium-catalysed Buchwald–Hartwig reactions provided by Ahneman et al [9] and for
Suzuki–Miyaura coupling reactions provided by Perera et al [21]. Finally, we trained our model on patent
data available in the USPTO data set [22, 23].
HTE and patent data sets are very different in terms of content and quality. HTE data sets typically cover
a very narrow region in the chemical reaction space, with chemical reaction data related to one or a few
reaction templates applied to large combinations of selected precursors (reactants, solvents, bases, catalysts,
etc). In contrast, patent reactions cover a much wider reaction space. In terms of quality, HTE data sets
report reactions represented uniformly and with yields measured using the same analytical equipment, thus
providing a consistent and high-quality collection of knowledge. In comparison, the yields from patents were
measured by different scientists using different equipment. Incomplete information in the original
documents, such as unreported reagents or reaction conditions, and the extensive limitation in text mining
technologies makes the entire set of patent reactions quite noisy and sparse. An extensive analysis of the
USPTO data set revealed that the experimental conditions and reaction parameters, such as scale of the
reaction, concentrations, temperature, pressure, or reaction duration, may have a significant effect on the
measured reaction yields. The functional dependency of the yields from the reaction conditions poses
additional constraints, as the model presented in this work does not consider those values explicitly in the
reaction descriptor. The basic assumption is that every reaction yield reported in the data set is optimised for
the reaction parameters.
Our best-performing model reached an R2 score of 0.956 on a random split of the Buchwald–Hartwig
data set, while the highest R2 score on the smoothed USPTO data was 0.388. These numbers reflect how the
intrinsic data set limitations increase the complexity of training a sufficiently good performing model on the
patent data, resulting in a more difficult challenge than training a model for the HTE data set.
We base our models directly on the reaction fingerprint (rxnfp) models by Schwaller et al [17]. We use a
fixed-size encoder model size, tuning only the hyperparameter for dropout rate and learning rate, thus
avoiding often-encountered difficulties of neural networks with numerous hyperparameters. During our
experiments, we observed good performances for a wide range of dropout rates (from 0.1 to 0.8) and
conclude that the initial learning rate is the most important hyperparameter to tune. Figures S26–S30 show
hyperparameter optimisation plots (available online at stacks.iop.org/MLST/2/015016/mmedia). To facilitate
the training, our work uses simpletransformers [24], a huggingface transformer [25] and the PyTorch
framework [26]. The overall pipeline is shown in figure 1.
To provide an input compatible with the rxnfp model we use the same RDKit [27] reaction
canonicalisation and SMILES tokenization [6] as in the rxnfp work [17].
2
Mach. Learn.: Sci. Technol. 2 (2021) 015016 P Schwaller et al
3
Mach. Learn.: Sci. Technol. 2 (2021) 015016 P Schwaller et al
Table 1. Comparing methods on the Buchwald–Hartwig data set. All results shown in this table used the rxnfp pretrained model as base
encoder.
Table 2. Summary of the average R2 scores on the Suzuki–Miyaura reactions data set using a Yield-BERT with different base encoders.
We used 10 different random folds (70/30).
(12 in total), bases (8), and solvents (4), resulting in a total of 5760 measured yields. The same data set was
also investigated in the work of Granda et al [12].
Here, we first trained our yield prediction models with the same hyperparameters as for the
Buchwald–Hartwig reaction experiment above, achieving an R2 score of 0.79 ± 0.01. Second, we tuned the
dropout probability and learning rate, similarly to the previous experiment, using a split of the training set of
the first random split. The resulting hyperparameters were then used for all the splits. The hyperparameter
tuning did not lead to better performance compared to the parameters used for the Buchwald–Hartwig
reactions. This shows that the models have a stable performance for a wide range of parameters and that they
are transferable from one data set to another related data set.
We also compared two different base encoder models that are available from the rxnfp library [17],
namely the BERT model pretrained with a masked language modelling task, and the BERT model
subsequently fine-tuned on a reaction class prediction task. The results are displayed in table 2. In contrast to
the Buchwald–Hartwig data set, where no difference between the two base encoders was observed, the ft
model achieves an R2 score of 0.81 ± 0.01, outperforming the pretrained base encoder on the
Suzuki–Miyaura reactions. Detailed results on the different Suzuki–Miyaura test sets are shown in figures
S15-S24.
4
Mach. Learn.: Sci. Technol. 2 (2021) 015016 P Schwaller et al
Figure 2. Average and standard deviation of the yields for the 10, 50, and 100 reactions predicted to have the highest yields after
training on a fraction of the data set (5%, 10%, 20%). The ideal reaction selection and a random selection are plotted for
comparison.
In this section, we analyse USPTO data set [22, 23] yields. We started from the same set as in our previous
work [28], keeping only reactions for which yields and product mass were reported. In contrast to HTE,
where reactions are typically performed in sub-gram scale, the patent data contains reactions spanning a
wider range, from grams to sub-grams scales.
5
Mach. Learn.: Sci. Technol. 2 (2021) 015016 P Schwaller et al
Figure 4. Reaction atlases. Top: Gram scale. Bottom: Sub-gram scale. Left: Reaction superclass distribution, reactions belonging to
the same superclass have the same colour. Right: Corresponding reaction yields.
often have extremely diverse reaction yields. Those diverse yields make it challenging for the model to learn
anything but yield averages for similar reactions and hence, explain the low performance on the patent
reactions. This analysis opens up relevant questions on the quality of the reported information (relative to
the mass scale) and its extraction accuracy from text, which could severely hamper the development of
reaction yield predictive models. The need of cleaned and consistent reaction yields data set is even more
important than for other reaction prediction tasks.
6
Mach. Learn.: Sci. Technol. 2 (2021) 015016 P Schwaller et al
In table 3, the ‘random split (smoothed)’ row shows an experiment inspired from the observations above.
As some of the yields values are probably incorrect in the data set, we smoothed the yields by computing the
average of the three nearest neighbour yields plus twice the own yield of the reaction. The nearest neighbours
were estimated using the rxnfp ft [17] and faiss [31]. Figure S25 shows the yield distributions after
smoothing. On the smoothed data sets, the performance of our models more than triples in the gram scale
and doubles on the sub-gram scale, achieving R2 scores of 0.277 and 0.388, respectively. The removal of noisy
reactions [32] or reaction data augmentation techniques [33] could potentially lead to further improvements.
5. Conclusion
In this work, we combined a reaction SMILES encoder with a reaction regression task to design a reaction
yield predictive model. We analysed two HTE reaction data sets, showing excellent results. On the
Buchwald–Hartwig reaction data set, our models outperform previous work on random splits and perform
similar to models trained on chemical descriptors computed with DFT on test sets where specific additives
were held out from the training set. Compared to random forest models, the feature importance can not
directly be obtained. Future work could (visually) investigate the attention weights to find out what tokens
and molecules contribute the most to the predictions [34, 35].
We analysed the yields in the public patent data and show that the distribution of reported yields strongly
differs depending on the reaction scale. Because of the intrinsic lack of consistency and quality in the patent
data, our proposed method fails to predict patent reaction yields accurately. While we cannot rule out the
existence of any other architecture potentially performing better than the one presented in this manuscript,
we raise the need for a more consistent and better quality public data set for the development of reaction
yields prediction models. The suspicion that the patent data yields are inconsistently reported is substantiated
by the large variability of methods used to purify and report yields by the different reaction mass scales and
the different optimisation in each reported reaction. Our reaction atlases [17, 29, 30] reveal globally higher
yielding reaction classes. However, nearest neighbours often have significantly scattered yields. We show that
better results can be achieved by smoothing the patent data yields using the nearest neighbours.
Our approach to yield predictions can be extended to any reaction regression task, for example, for
predicting reaction activation energies [36–38], and is expected to have a broad impact in the field of organic
chemistry.
Acknowledgment
ORCID iDs
References
[1] Schwaller P, Hoover B, Reymond J-L, Strobelt H and Laino T 2020 Unsupervised Attention-Guided Atom-Mapping ChemRxiv
preprint (https://doi.org/10.26434/chemrxiv.12298559.v1)
7
Mach. Learn.: Sci. Technol. 2 (2021) 015016 P Schwaller et al
[2] Segler M H, Preuss M and Waller M P 2018 Planning chemical syntheses with deep neural networks and symbolic AI Nature
555 604–10
[3] Coley C W et al 2019 A robotic platform for flow synthesis of organic compounds informed by AI planning Science 365 eaax1566
[4] Schwaller P et al 2020 Predicting retrosynthetic pathways using transformer-based models and a hyper-graph exploration strategy
Chem. Sci. 11 3316–25
[5] Genheden S et al 2020 AiZynthFinder: a fast, robust and flexible open-source software for retrosynthetic planning J. Cheminform.
12 70
[6] Schwaller P et al 2019 Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction ACS Cent. Sci.
5 1572–83
[7] Kite S, Hattori T and Murakami Y 1994 Estimation of catalytic performance by neural network—product distribution in oxidative
dehydrogenation of ethylbenzene Appl. Catal. A 114 L173–78
[8] Raccuglia P et al 2016 Machine-learning-assisted materials discovery using failed experiments Nature 533 73–6
[9] Ahneman D T, Estrada J G, Lin S, Dreher S D and Doyle A G 2018 Predicting reaction performance in C–N cross-coupling using
machine learning Science 360 186–90
[10] Chuang K V and Keiser M J 2018 Comment on “Predicting reaction performance in C–N cross-coupling using machine learning”
Science 362 6416
[11] Sandfort F, Strieth-Kalthoff F, Kühnemund M, Beecks C and Glorius F 2020 A structure-based platform for predicting chemical
reactivity Chem. 6 1379–90
[12] Granda J M, Donina L, Dragone V, Long D-L and Cronin L 2018 Controlling an organic synthesis robot with machine learning to
search for new reactivity Nature 559 377–81
[13] Fu Z et al 2020 Optimizing chemical reaction conditions using deep learning: a case study for the Suzuki–Miyaura cross-coupling
reaction Org. Chem. Front. 7 2269–77
[14] Eyke N S, Green W H and Jensen K F 2020 Iterative experimental design based on active machine learning reduces the
experimental burden associated with reaction screening React. Chem. Eng. 5 1963–72
[15] Skoraczyński G et al 2017 Predicting the outcomes of organic reactions via machine learning: are current descriptors sufficient? Sci.
Rep. 7 3582
[16] Schwaller P, Gaudin T, Lanyi D, Bekas C and Laino T 2018 “Found in translation”: predicting outcomes of complex organic
chemistry reactions using neural sequence-to-sequence models Chem. Sci. 9 6091–8
[17] Schwaller P et al 2021 Mapping the space of chemical reactions using attention-based neural networks Nat. Mach. Intell. 3 144–52
[18] Devlin J, Chang M-W, Lee K and Toutanova K 2019 BERT: pre-training of deep bidirectional transformers for language
understanding Proc. of the 2019 Conf. of the North {A}merican Chapter of the Association for Computational Linguistics: Human
Language Technologies (Stroudsburg, PA: Association for Computational Linguistics) 4171–86
[19] Vaswani A et al 2017 Attention is all you need Advances in Neural Information Processing Systems 30 (Red Hook, NY: Curran
Associates, Inc.) 5998–6008 (https://papers.nips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html)
[20] Weininger D 1988 SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules
J. Chem. Inf. Model. 28 31–6
[21] Perera D et al 2018 A platform for automated nanomole-scale reaction screening and micromole-scale synthesis in flow Science
359 429–34
[22] Lowe D M 2012 Extraction of chemical structures and reactions from the literature PhD Thesis University of Cambridge (https://
doi.org/10.17863/CAM.16293)
[23] Lowe D 2017 Chemical reactions from US patents (1976–Sep2016) (https://doi.org/10.6084/m9.figshare.5104873.v1)
[24] Simpletransformers (available at: https://simpletransformers.ai) (Accessed: 2 July 2020)
[25] Wolf T et al 2020 Transformers: State-of-the-art natural language processing Proc. 2020 Conference on Empirical Methods in Natural
Language Processing: System Demonstrations (Stroudsburg, PA: Association for Computational Linguistics) 38–45
[26] Paszke A et al 2019 PyTorch: an imperative style, high-performance deep learning library Proc. of Advances in Neural Information
Processing Systems 32 (Red Hook, NY: Curran Associates, Inc.) 8026–37 (https://papers.nips.cc/paper/2019/hash/
bdbca288fee7f92f2bfa9f7012727740-Abstract.html)
[27] Landrum G et al 2019 rdkit/rdkit: 2019_03_4 (q1 2019) release (https://doi.org/10.5281/zenodo.3366468)
[28] Pesciullesi G, Schwaller P, Laino T and Reymond J-L 2020 Transfer learning enables the molecular transformer to predict regio-and
stereoselective reactions on carbohydrates Nat. Commun. 11 1–8
[29] Probst D and Reymond J-L 2020 Visualization of very large high-dimensional data sets as minimum spanning trees J. Cheminform.
12 1–13
[30] Probst D and Reymond J-L 2017 Fun: a framework for interactive visualizations of large, high-dimensional datasets on the web
Bioinformatics 34 1433–5
[31] Johnson J, Douze M and Jégou H 2019 Billion-scale similarity search with GPUs IEEE Trans. Big Data (https://doi.org/
10.1109/TBDATA.2019.2921572)
[32] Toniato A, Schwaller P, Cardinale A, Geluykens J and Laino T 2020 Unassisted noise-reduction of chemical reactions data sets
ChemRxiv preprint (https://doi.org/10.26434/chemrxiv.12395120.v1)
[33] Tetko I V, Karpov P, Van Deursen R and Godin G 2020 State-of-the-art augmented NLP transformer models for direct and
single-step retrosynthesis Nat. Commun. 11 5575
[34] Hoover B, Strobelt H and Gehrmann S 2019 Exbert: a visual analysis tool to explore learned representations in transformers
models Proc. 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations (Stroudsburg, PA:
Association for Computational Linguistics) 187–96
[35] Vig J and Belinkov Y 2019 Analyzing the structure of attention in a transformer language model Proc. of the 2019 ACL Workshop
BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (Stroudsburg, PA: Association for Computational Linguistics)
63–76
[36] Grambow C A, Pattanaik L and Green W H 2020 Reactants, products and transition states of elementary chemical reactions based
on quantum chemistry Sci. Data 7 1–8
[37] von Rudorff G F, Heinen S, Bragato M and von Lilienfeld A 2020 Thousands of reactants and transition states for competing E2
and SN2 reactions Mach. Learn.: Sci. Technol. 1 045026
[38] Jorner K, Brinck T, Norrby P-O and Buttar D 2020 Machine learning meets mechanistic modelling for accurate prediction of
experimental activation energies Chem. Sci. 12 1163–75