0% found this document useful (0 votes)
54 views

NLP Google Nature 2019-Word-Embeddings PDF

Uploaded by

kapil_j
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

NLP Google Nature 2019-Word-Embeddings PDF

Uploaded by

kapil_j
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Letter https://doi.org/10.

1038/s41586-019-1335-8

Unsupervised word embeddings capture latent


knowledge from materials science literature
Vahe Tshitoyan1,3*, John Dagdelen1,2, Leigh Weston1, Alexander Dunn1,2, Ziqin Rong1, Olga Kononova2, Kristin A. Persson1,2,
Gerbrand Ceder1,2* & Anubhav Jain1*

The overwhelming majority of scientific knowledge is published behave consistently with chemical intuition when they are combined
as text, which is difficult to analyse by either traditional statistical using various vector operations (projection, addition, subtraction). For
analysis or modern machine learning methods. By contrast, the example, many words in our corpus represent chemical compositions
main source of machine-interpretable data for the materials research of materials, and the five materials most similar to LiCoO2 (a well-
community has come from structured property databases1,2, which known lithium-ion cathode compound) can be determined through a
encompass only a small fraction of the knowledge present in the dot product (projection) of normalized word embeddings. According
research literature. Beyond property values, publications contain to our model, the compositions with the highest similarity to LiCoO2
valuable knowledge regarding the connections and relationships are LiMn2O4, LiNi0.5Mn1.5O4, LiNi0.8Co0.2O2, LiNi0.8Co0.15Al0.05O2 and
between data items as interpreted by the authors. To improve LiNiO2—all of which are also lithium-ion cathode materials.
the identification and use of this knowledge, several studies have Similar to the observation made in the original Word2vec paper11,
focused on the retrieval of information from scientific literature these embeddings also support analogies, which in our case can be
using supervised natural language processing3–10, which requires domain-specific. For instance, ‘NiFe’ is to ‘ferromagnetic’ as ‘IrMn’ is
large hand-labelled datasets for training. Here we show that to ‘?’, where the most appropriate response is ‘antiferromagnetic’. Such
materials science knowledge present in the published literature can analogies are expressed and solved in the Word2vec model by finding
be efficiently encoded as information-dense word embeddings11–13 the nearest word to the result of subtraction and addition operations
(vector representations of words) without human labelling or between the embeddings. Hence, in our model,
supervision. Without any explicit insertion of chemical knowledge,
these embeddings capture complex materials science concepts such ferromagnetic−NiFe + IrMn ≈ antiferromagnetic
as the underlying structure of the periodic table and structure– To better visualize such embedded relationships, we projected the
property relationships in materials. Furthermore, we demonstrate embeddings of Zr, Cr and Ni, as well as their corresponding oxides and
that an unsupervised method can recommend materials for crystal structures, onto two dimensions using principal component
functional applications several years before their discovery. This analysis (Fig. 1b). Even in reduced dimensions, there is a consistent
suggests that latent knowledge regarding future discoveries is to a operation in vector space for the concepts ‘oxide of ’ (Zr − ZrO2 ≈ Cr
large extent embedded in past publications. Our findings highlight − Cr2O3 ≈ Ni − NiO) and ‘structure of ’ (Zr − HCP ≈ Cr − BCC ≈
the possibility of extracting knowledge and relationships from the Ni − FCC). This suggests that the positions of the embeddings in space
massive body of scientific literature in a collective manner, and point encode materials science knowledge such as the fact that zirconium
towards a generalized approach to the mining of scientific literature. has a hexagonal close packed (HCP) crystal structure under standard
Assignment of high-dimensional vectors (embeddings) to words conditions and that its principal oxide is ZrO2. Other types of materials
in a text corpus in a way that preserves their syntactic and semantic analogies captured by the model, such as functional applications and
relationships is one of the most fundamental techniques in natural lan- crystal symmetries, are listed in Extended Data Table 1. The accuracies
guage processing (NLP). Word embeddings are usually constructed for each category are close to 50%—similar to the baseline set in the
using machine learning algorithms such as GloVe13 or Word2vec11,12, original Word2vec study12. We stress that Word2vec treats these entities
which use information about the co-occurrences of words in a text simply as strings, and no chemical interpretation is explicitly provided
corpus. For example, when trained on a suitable body of text, such to the model; rather, materials knowledge is captured through the posi-
methods should produce a vector representing the word ‘iron’ that is tions of the words in scientific abstracts. Notably, we also found that
closer by cosine distance to the vector for ‘steel’ than to the vector for embeddings of chemical elements are representative of their positions
‘organic’. To train the embeddings, we collected and processed approxi- in the periodic table when projected onto two dimensions (Extended
mately 3.3 million scientific abstracts published between 1922 and 2018 Data Fig. 1a, b, Supplementary Information sections S4 and S5) and
in more than 1,000 journals deemed likely to contain materials-related can serve as effective feature vectors in quantitative machine learning
research, resulting in a vocabulary of approximately 500,000 words. We models such as formation energy prediction—outperforming several
then applied the skip-gram variation of Word2vec, which is trained to previously reported curated feature vectors (Extended Data Fig. 1c, d,
predict context words that appear in the proximity of the target word as Supplementary Information section S6).
a means to learn the 200-dimensional embedding of that target word, The main advantage and novelty of this representation, however,
to our text corpus (Fig. 1a). The key idea is that, because words with is that application keywords such as ‘thermoelectric’ have the same
similar meanings often appear in similar contexts, the corresponding representation as material formulae such as ‘Bi2Te3’. When the cosine
embeddings will also be similar. More details about the model are similarity of a material embedding and the embedding of ‘thermo-
included in the Methods and in Supplementary Information sections S1 electric’ is high, one might expect that the text corpus necessarily
and S2, where we also discuss alternative algorithm options such as includes abstracts reporting on the thermoelectric behaviour of this
GloVe. We find that, even though no chemical information or inter- material14,15. However, we found that a number of materials that have
pretation is added to the algorithm, the obtained word embeddings relatively high cosine similarities to the word ‘thermoelectric’ never

1
Lawrence Berkeley National Laboratory, Berkeley, CA, USA. 2Department of Materials Science and Engineering, University of California, Berkeley, CA, USA. 3Present address: Google LLC, Mountain
View, CA, USA. *e-mail: [email protected]; [email protected]; [email protected]

4 J U L Y 2 0 1 9 | V O L 5 7 1 | N A T U RE | 9 5
RESEARCH Letter

Fig. 1 | Word2vec skip-gram and analogies. a, Target words ‘LiCoO2’ ‘cathodes’, ‘electrochemical’, and so on), which leads to similar hidden layer
and ‘LiMn2O4’ are represented as vectors with ones at their corresponding weights after the training is complete. These hidden layer weights are the
vocabulary indices (for example, 5 and 8 in the schematic) and zeros actual word embeddings. The softmax function is used at the output layer
everywhere else (one-hot encoding). These one-hot encoded vectors are to normalize the probabilities. b, Word embeddings for Zr, Cr and Ni, their
used as inputs for a neural network with a single linear hidden layer (for principal oxides and crystal symmetries (at standard conditions) projected
example, 200 neurons), which is trained to predict all words mentioned onto two dimensions using principal component analysis and represented
within a certain distance (context words) from the given target word. as points in space. The relative positioning of the words encodes materials
For similar battery cathode materials such as LiCoO2 and LiMn2O4, the science relationships, such that there exist consistent vector operations
context words that occur in the text are mostly the same (for example, between words that represent concepts such as ‘oxide of ’ and ‘structure of ’.

appeared explicitly in the same abstract with this word, or any other set. We obtained a 59% and 52% rank correlation of experimental
words that unequivocally identify materials as thermoelectric (Fig. 2a). results with the embedding-based ranking for maximum power factor
Rather than dismissing these instances as spurious, we investigated and maximum zT, respectively. Unexpectedly, our model outperformed
whether such cases could be usefully interpreted as predictions of novel the DFT dataset of power factors used in the previous paragraph, which
thermoelectric materials. exhibits only a 31% rank correlation with the experimental maximum
As a first test, we compared our predicted thermoelectric composi- power factors.
tions with available computational data. Specifically, we identified com- Finally, we tested whether our model—if trained at various points
pounds mentioned in our text corpus more than three times that are also in the past—would have correctly predicted thermoelectric materials
present in a dataset16 that reports the thermoelectric power factors (an reported later in the literature. Specifically, we generated 18 different
important component of the overall thermoelectric figure of merit, zT) ‘historical’ text corpora consisting only of abstracts published before
of approximately 48,000 compounds calculated using density func- cutoff years between 2001 and 2018. We trained separate word embed-
tional theory (DFT)17,18 (see Methods). A total of 9,483 compounds dings for each historical dataset, and used these embeddings to predict
overlap between the two datasets, of which 7,663 were never mentioned the top 50 thermoelectrics that were likely to be reported in future (test)
alongside thermoelectric keywords in our text corpus and can be con- years. For every year past the date of prediction, we tabulated the cumu-
sidered candidates for prediction. To obtain the predictions, we ranked lative percentage of predicted thermoelectric compositions that were
each of these 7,663 compounds by the dot product of their normalized reported in the literature alongside a thermoelectric keyword. Figure 3a
output embedding with the word embedding of ‘thermoelectric’ (see depicts the result from each such ‘historical’ dataset as a thin grey line.
Supplementary Information sections S1 and S3 regarding the use of For example, the light grey line labelled ‘2015’ depicts the percentage of
output versus word embeddings). This ranking can be interpreted as the top 50 predictions made using the model trained only on scientific
the likelihood that that material will co-occur with the word ‘thermo- abstracts published before 1 January 2015, and that were subsequently
electric’ in a scientific abstract despite this never occurring explicitly reported in the literature alongside a thermoelectric keyword after one,
in the text corpus. The distributions of DFT maximum power factor two, three or four years (that is, the years 2015–2018). Overall, our
values for all 9,483 materials (separated into known thermoelectrics results indicate that materials from the top 50 word embedding-based
and candidates) are plotted in Fig. 2b, and the values of the 10 highest predictions (red line) were on average eight times more likely to have
ranked candidates from the word embedding approach are indicated been studied as thermoelectrics within the next five years as compared
with dashed lines. We find that the top ten predictions all exhibit com- to a randomly chosen unstudied material from our corpus at that time
puted power factors significantly greater than the average of candidate (blue) and also three times more likely than a random material with a
materials (green), and even slightly higher than the average of known non-zero DFT bandgap (green). The use of larger corpora that incor-
thermoelectrics (purple). The average maximum power factor of porate data from more recent years improved the rate of successful pre-
40.8 μW K−2 cm−1 for these top ten predictions is 3.6 times larger than dictions, as indicated by the steeper gradients for later years in Fig. 3a.
the average of candidate materials (11.5 μW K−2 cm−1) and 2.4 times To examine these results in more detail, we focus on the fate of the
larger than the average of known thermoelectrics (17.0 μW K−2 cm−1). top five predictions determined using only abstracts published before
Moreover, the three highest power factors from the top ten predictions the year 2009. Figure 3b plots the evolution of the prediction rank of
are at the 99.6th, 96.5th and 95.3rd percentiles of known thermoelec- these top five compounds as more abstracts are added in subsequent
trics. We note that in contrast to supervised methods, our embeddings years. One of these compounds, CuGaTe2, represents one of the best
are based only on the text corpus and are not trained or modified in any present-day thermoelectrics and would have been predicted as a top
manner using the DFT data. five compound four years before its publication in 201221. Two of the
Next, we compared the same model directly against experimentally other predictions, ReS2 and CdIn2Te4, were suggested in the literature
measured power factors and zTs19. Because our approach does not pro- to be good thermoelectrics22,23 only approximately 8–9 years after
vide numerical estimations of these quantities, we compared the relative the point at which they would have first appeared in the top five list
ranking of candidates through the Spearman rank correlation20 for the from our algorithm. We note that the sharp increase in the rank of
83 materials that appear both in our text corpus and the experimental layered ReS2 in 2015 coincides with the discovery of a record zT for

9 6 | N A T U RE | V O L 5 7 1 | 4 J U L Y 2 0 1 9
Letter RESEARCH

Fig. 2 | Prediction of new thermoelectric materials. a, A ranking of word thermoelectric. The width of the edges between ‘thermoelectric’
thermoelectric materials can be produced using cosine similarities of and the context words (blue) is proportional to the cosine similarity
material embeddings with the embedding of the word ‘thermoelectric’. between the word embeddings of the nodes, whereas the width of the
Highly ranked materials that have not yet been studied for thermoelectric edges between the materials and the context words (red, green and purple)
applications (do not appear in the same abstracts as words ‘ZT’, is proportional to the cosine similarity between the word embeddings of
‘zT’, ‘seebeck’, ‘thermoelectric’, ‘thermoelectrics’, ‘thermoelectrical’, context words and the output embedding of the material. The materials are
‘thermoelectricity’, ‘thermoelectrically’ or ‘thermopower’) are considered the first (Li2CuSb), third (CsAgGa2Se4) and fourth (Cu7Te5) predictions.
to be predictions that can be tested in the future. b, Distributions of the The context words are top context words according to the sum of the edge
power factors computed using density functional theory (see Methods) weights between the material and the word ‘thermoelectric’. Wider paths
for 1,820 known thermoelectrics in the literature (purple) and 7,663 are expected to make larger contributions to the predictions. Examination
candidate materials not yet studied as thermoelectric (green). Power of the context words demonstrates that the algorithm is making
factors of the first ten predictions not studied as thermoelectrics in our predictions on the basis of crystal structure associations, co-mentions
text corpus and for which computational data are available (Li2CuSb, with other materials for the same application, associations between
CuBiS2, CdIn2Te4, CsGeI3, PdSe2, KAg2SbS4, LuRhO3, MgB2C2, Li3Sb and different applications, and key phrases that describe the material’s known
TlSbSe2) are shown with black dashed lines. c, A graph showing how the properties.
context words of materials predicted to be thermoelectrics connect to the

SnSe24—also a layered material. The final two predictions, HgZnTe crucial for the majority of thermoelectrics, and there is a large overlap
and SmInO3, contain expensive (Sm, In) or toxic (Hg) elements and between optoelectronic, photovoltaic and thermoelectric materials (see
have not been studied yet, and SmInO3 has dropped appreciably in Supplementary Information section S8). Consequently, the correlations
ranking with the addition of more data. The top 10 predictions for between these keywords and CsAgGa2Se4 led to the prediction. This
each year between 2001 and 2018 are available in Supplementary direct interpretability is a major advantage over many other machine
Table S3. learning methods for materials discovery. We also note that several pre-
To illustrate how materials never mentioned next to the word dictions were found to exhibit promising properties despite not being
‘thermoelectric’ are identified as thermoelectrics with high expected in any well known thermoelectric material classes (see Supplementary
probability, we investigated the series of connections that can lead to Information section S10). This demonstrates that word embeddings
a prediction. In Fig. 2c, we present three materials from our top five go beyond trivial compositional or structural similarity and have the
predictions (Extended Data Table 2) alongside some of the key context potential to unlock latent knowledge not directly accessible to human
words that connect these materials to ‘thermoelectric’. For instance, scientists.
CsAgGa2Se4 has high likelihood of appearing next to ‘chalcogenide’, As a final step, we verified the generalizability of our approach by
‘band gap’, ‘optoelectronic’ and ‘photovoltaic applications’: many performing historical validation of predictions for three additional
good thermoelectrics are chalcogenides, the existence of a bandgap is keywords—‘photovoltaics’, ‘topological insulator’ and ‘ferroelectric’. We

Fig. 3 | Validation of the predictions. a, Results of prediction of can be analysed over longer test periods, resulting in longer grey lines.
thermoelectric materials using word embeddings obtained from various The results are averaged (red) and compared to baseline percentages from
historical datasets. Each grey line uses only abstracts published before either all materials (blue) or non-zero DFT bandgap27 materials (green).
that year to make predictions (for example, predictions for 2001 are b, The top five predictions from the year 2009 dataset, and evolution of
performed using abstracts from 2000 and earlier). The lines plot the their prediction ranks as more data are collected. The marker indicates the
cumulative percentage of predicted materials subsequently reported as year of first published report of one of the initial top five predictions as a
thermoelectrics in the years following their predictions; earlier predictions thermoelectric.

4 J U L Y 2 0 1 9 | V O L 5 7 1 | N A T U RE | 9 7
RESEARCH Letter

emphasize that the word embeddings used for these predictions are the Received: 19 December 2018; Accepted: 8 May 2019;
same as those for thermoelectrics predictions; we have simply modified Published online 3 July 2019.
the dot product to be with a different target word. Notably, with almost
no change in procedure, we find trends similar to the ones in Fig. 3a 1. Hill, J. et al. Materials science with large-scale data and informatics: unlocking
for all three functional applications, with the results summarized in new opportunities. MRS Bull. 41, 399–409 (2016).
2. Butler, K. T., Davies, D. W., Cartwright, H., Isayev, O. & Walsh, A. Machine learning
Extended Data Fig. 2 and Extended Data Table 3. for molecular and materials science. Nature 559, 547–555 (2018).
The success of our unsupervised approach can partly be attributed 3. Friedman, C., Kra, P., Yu, H., Krauthammer, M. & Rzhetsky, A. GENIES:
to the choice of the training corpus. The main purpose of abstracts is a natural-language processing system for the extraction of molecular pathways
from journal articles. Bioinformatics 17, S74–S82 (2001).
to communicate information in a concise and straightforward manner, 4. Müller, H. M., Kenny, E. E. & Sternberg, P. W. Textpresso: an ontology-based
avoiding unnecessary words that may increase noise in embeddings information retrieval and extraction system for biological literature. PLoS Biol. 2,
during training. The importance of corpus selection is demonstrated e309 (2004).
5. Swain, M. C. & Cole, J. M. ChemDataExtractor: a toolkit for automated extraction
in Extended Data Table 4, where we show that discarding abstracts of chemical information from the scientific literature. J. Chem. Inf. Model. 56,
unrelated to inorganic materials science improves performance, and 1894–1904 (2016).
models trained on the set of all Wikipedia articles (about ten times 6. Eltyeb, S. & Salim, N. Chemical named entities recognition: a review on
approaches and applications. J. Cheminform. 6, 17 (2014).
more text than our corpus) perform substantially worse on materials 7. Kim, E. et al. Materials synthesis insights from scientific literature via text
science analogies. Contrary to what might seem like the conventional extraction and machine learning. Chem. Mater. 29, 9436–9444 (2017).
machine learning mantra, throwing more data at the problem is not 8. Leaman, R., Wei, C. H. & Lu, Z. TmChem: a high performance approach for
chemical named entity recognition and normalization. J. Cheminform. 7, S3
always the solution. Instead, the quality and domain-specificity of the (2015).
corpus determine the utility of the embeddings for domain-specific 9. Krallinger, M., Rabal, O., Lourenço, A., Oyarzabal, J. & Valencia, A. Information
tasks. retrieval and text mining technologies for chemistry. Chem. Rev. 117,
We suggest that the methodology described here can also be gener- 7673–7761 (2017).
10. Spangler, S. et al. Automated hypothesis generation based on mining scientific
alized to other language models, such that the probability of an entity literature. In Proc. 20th ACM SIGKDD Intl Conf. Knowledge Discovery and Data
(such as a material or molecule) co-occurring with words that repre- Mining 1877–1886 (ACM, 2014).
sent a target application or property can be treated as an indicator of 11. Mikolov, T., Corrado, G., Chen, K. & Dean, J. Efficient estimation of word
representations in vector space. Preprint at https://arxiv.org/abs/1301.3781
performance. Such language-based inference methods can become an (2013).
entirely new field of research at the intersection between natural lan- 12. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J. Distributed
guage processing and science, going beyond simply extracting entities representations of words and phrases and their compositionality. Preprint at
https://arxiv.org/abs/1310.4546 (2013).
and numerical values from text and leveraging the collective associa- 13. Pennington, J., Socher, R. & Manning, C. GloVe: global vectors for word
tions present in the research literature. Substitution of Word2vec with representation. Proc. 2014 Conf. Empirical Methods in Natural Language
context-aware embeddings such as BERT25 or ELMo26 could lead to Processing (EMNLP) 1532–1543 (Association for Computational Linguistics,
2014).
improvements for functional material predictions, as these models are 14. Liu, W. et al. New trends, strategies and opportunities in thermoelectric
able to change the embedding of the word based on its context. They materials: a perspective. Materials Today Physics 1, 50–60 (2017).
substantially outperform context-independent embeddings such as 15. He, J. & Tritt, T. M. Advances in thermoelectric materials research: looking back
and moving forward. Science 357, eaak9997 (2017).
Word2vec or GloVe across all conventional NLP tasks. Also, in addition 16. Ricci, F. et al. An ab initio electronic transport database for inorganic materials.
to co-occurrences, these models can capture more complex relation- Sci. Data 4, 170085 (2017).
ships between words in the sentence, such as negation. In the current 17. Hohenberg, P. & Kohn, W. Inhomogeneous electron gas. Phys. Rev. 136,
study, the effects of negation are somewhat mitigated because scientific B864–B871 (1964).
18. Kohn, W. & Sham, L. J. Self-consistent equations including exchange and
abstracts often emphasize positive relationships. However, a natural correlation effects. Phys. Rev. 140, A1133–A1138 (1965).
extension of this work is to parse the full texts of articles. We expect 19. Gaultois, M. W. et al. Data-driven review of thermoelectric materials:
the full texts will contain more negative relationships and in general performance and resource onsiderations. Chem. Mater. 25, 2911–2920 (2013).
20. Spearman, C. The proof and measurement of association between two things.
more variable and complex sentences, and will therefore require more Am. J. Psychol. 15, 72–101 (1904).
powerful methods. 21. Plirdpring, T. et al. Chalcopyrite CuGaTe2: a high-efficiency bulk thermoelectric
Scientific progress relies on the efficient assimilation of existing material. Adv. Mater. 24, 3622–3626 (2012).
22. Tian, H. et al. Low-symmetry two-dimensional materials for electronic and
knowledge in order to choose the most promising way forward and to photonic applications. Nano Today 11, 763–777 (2016).
minimize re-invention. As the amount of scientific literature grows, this 23. Pandey, C., Sharma, R. & Sharma, Y. Thermoelectric properties of defect
is becoming increasingly difficult, if not impossible, for an individual chalcopyrites. AIP Conf. Proc. 1832, 110009 (2017).
24. Zhao, L.-D. et al. Ultralow thermal conductivity and high thermoelectric figure of
scientist. We hope that this work will pave the way towards making the merit in SnSe crystals. Nature 508, 373–377 (2014).
vast amount of information found in scientific literature accessible to 25. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep
individuals in ways that enable a new paradigm of machine-assisted bidirectional transformers for language understanding. Preprint at https://
arxiv.org/abs/1810.04805 (2018).
scientific breakthroughs. 26. Peters, M. E. et al. Deep contextualized word representations. Preprint at
https://arxiv.org/abs/1802.05365 (2018).
Online content 27. Jain, A. et al. The materials project: a materials genome approach to
Any methods, additional references, Nature Research reporting summaries, source accelerating materials innovation. APL Mater. 1, 011002 (2013).
data, extended data, supplementary information, acknowledgements, peer review Publisher’s note: Springer Nature remains neutral with regard to jurisdictional
information; details of author contributions and competing interests; and state- claims in published maps and institutional affiliations.
ments of data and code availability are available at https://doi.org/10.1038/s41586-
019-1335-8. © The Author(s), under exclusive licence to Springer Nature Limited 2019

9 8 | N A T U RE | V O L 5 7 1 | 4 J U L Y 2 0 1 9
Letter RESEARCH

METHODS using the projector augmented wave (PAW)29 pseudopotentials and the Perdew–
Data collection and processing. We obtained approximately 3.3 million abstracts, Burke–Ernzerhof (PBE)30 generalized-gradient approximation (GGA), imple-
primarily focused on materials science, physics, and chemistry, through a com- mented in the Vienna Ab initio Simulation Package (VASP) 31,32. A +U
bination of Elsevier’s Scopus and Science Direct application programming inter- correction was applied to transition metal oxides16. Seebeck coefficient (S) and
faces (APIs) (https://dev.elsevier.com/), the Springer Nature API (https://dev. electrical conductivity (σ) were calculated using the BoltzTraP package33 using
springernature.com/), and web scraping. Parts of abstracts (or full abstracts) that a constant relaxation time of 10−14 s at simulated temperatures between 300 K
were in foreign languages were removed using text search and regular expression and 1,300 K and for carrier concentrations (doping) between 1016 cm−3 and
matching, as were articles with metadata types corresponding to ‘Announcement’, 1022 cm−3. A 48,770-material subset of the calculations was taken from a pre-
‘BookReview’, ‘Erratum’, ‘EditorialNotes’, ‘News’, ‘Events’ and ‘Acknowledgement’. vious work16; the remaining calculations were performed in this work using the
Abstracts with titles containing keywords ‘Foreword’, ‘Prelude’, ‘Commentary’, software atomate34. All calculations used the  pymatgen28 Python library within
‘Workshop’, ‘Conference’, ‘Symposium’, ‘Comment’, ‘Retract’, ‘Correction’, ‘Erratum’ the FireWorks35 workflow management framework. To more realistically eval-
and ‘Memorial’ were also selectively removed from the corpus. Some abstracts uate the thermoelectric potential of a candidate material, we devised a simple
contained leading or trailing copyright information, which was removed using strategy to condense the complex behaviour of the S and σ tensors into a single
regular expression matching and heuristic rules. Leading words and phrases power factor metric. For each semiconductor type η ∈ {n, p}, temperature T,
such as ‘Abstract:’ were also removed using similar methods. We further retained and doping level c, the S and σ tensors were averaged over the three crystallo-
only abstracts related to inorganic materials according to a binary classifier (see graphic directions, and the average power factor, PFavg, was computed. PFavg is
‘Abstract classification’ below). We tuned the classifier for high recall to guaran- a crude estimation of the polycystalline power factor from the power factor of
tee the presence of the majority of relevant abstracts at the expense of retaining a perfect single crystal. To account for the complex behaviour of S and σ with
some irrelevant ones. Removing irrelevant abstracts substantially improved the T, c, and η, we then took the maximum average power factor over T, c, and η
performance of our algorithm, as discussed in more detail in Supplementary constrained to a maximum cutoff temperature Tcut and maximum cutoff doping
Information section S2. The 1.5 million abstracts that were classified as relevant ccut. Formally, this is PF Tavg,
cut , ccut
max ≡ max PF(η , T , c) such that T ≤ Tcut , c ≤ ccut   . We
were tokenized using ChemDataExtractor5 to produce the individual words. The chose Tcut = 600 K and ccut = 1020 cm−3 because these values resulted in better
tokens that were identified as valid chemical formulae using pymatgen28 com- correspondence with the experimental dataset than more optimistic values,
bined with regular expression and rule-based techniques were normalized such owing to the limitations of the constant relaxation time approximation. The
K, 10 20
that the order of elements and common multipliers did not matter (NiFe is the resulting power factor, PF 600 avg, max , is equated with ‘computed power factor’ in
same as Fe50Ni50). Valence states of elements were split into separate tokens (for this study. To rank materials according to experimental power factors (or zT),
example, Fe(III) becomes two separate tokens, Fe and (III)). We also performed we used the maximum value for a given stoichiometry across all experimental
selective lower-casing and deaccenting. If the token was not a chemical formula or conditions present in the dataset from Gaultois et al.19.
an element symbol, and if only the first letter was uppercase, we lower-cased the
word. Thus, chemical formulae and abbreviations stayed in their common form, Data availability
whereas words at the beginning of sentences and proper nouns were lower-cased. The scientific abstracts used in this study are available via Elsevier’s Scopus and
Numbers with units were often not tokenized correctly by ChemDataExtractor. Science Direct API’s (https://dev.elsevier.com/) and the Springer Nature API
We addressed this in the processing step by splitting the common units from (https://dev.springernature.com/). The list of DOIs used in this study, the pre-
numbers and converting all numbers to a special token < nUm>. This reduced trained word embeddings and the analogies used for validation of the embeddings
the vocabulary size by approximately 20,000 words. We found that correct are available at https://github.com/materialsintelligence/mat2vec. All other data
preprocessing, especially the choice of phrases to include as individual tokens, generated and analysed during the current study are available from the correspond-
substantially improved the results. The code used for preprocessing is available ing authors on reasonable request.
at https://github.com/materialsintelligence/mat2vec.
Abstract classification. This work focuses on inorganic materials science.
However, our corpus contained some abstracts that fell outside this scope (for Code availability
example, articles on polymer science). We removed articles outside our targeted The code used for text preprocessing and Word2vec training are available at https://
area of research literature by training a binary classifier that could label abstracts as github.com/materialsintelligence/mat2vec.
‘relevant’ or ‘not relevant’. We annotated 1,094 randomly selected abstracts; of these,
28. Ong, S. P. et al. Python Materials Genomics (pymatgen): a robust, open-
588 were labelled as ‘relevant’ and 494 were labelled ‘not relevant’. The labelled
source Python library for materials analysis. Comput. Mater. Sci. 68, 314–319
abstracts were used as data to train a classifier; we used a linear classifier based on (2013).
logistic regression, where each document is described by a term frequency–inverse 29. Kresse, G. & Joubert, D. From ultrasoft pseudopotentials to the projector
document frequency (tf–idf) vector. The classifier achieved an accuracy (f1-score) augmented-wave method. Phys. Rev. B Condens. Matter Mater. Phys. 59,
of 89% using fivefold cross-validation. 1758–1775 (1999).
Word2vec training. We used the Word2vec implementation in gensim (https:// 30. Perdew, J. P., Burke, K. & Ernzerhof, M. Generalized gradient approximation
made simple. Phys. Rev. Lett. 77, 3865–3868 (1996).
radimrehurek.com/gensim/) with a few modifications. We found that skip-gram 31. Kresse, G. & Furthmüller, J. Efficient iterative schemes for ab initio total-energy
with negative sampling loss (n = 15) performed best (see Supplementary calculations using a plane-wave basis set. Phys. Rev. B Condens. Matter 54,
Information section S2 for comparison between models). The vocabulary con- 11169–11186 (1996).
sisted of all words that occurred more than five times as well as normalized 32. Kresse, G. & Furthmüller, J. Efficiency of ab-initio total energy calculations for
chemical formulae, independent of the number of mentions. The phrases were metals and semiconductors using a plane-wave basis set. Comput. Mater. Sci. 6,
15–50 (1996).
generated using a minimum phrase count of 10, score threshold of 15 (ref. 12) and 33. Madsen, G. K. & Singh, D. J. Boltztrap. A code for calculating band-structure
phrase depth of 2. The latter meant that we repeated the process twice, allowing dependent quantities. Comput. Phys. Commun. 175, 67–71 (2006).
generation of up to four grams. We also included common terms such as ‘-’, 34. Mathew, K. et al. Atomate: a high-level interface to generate, execute, and
‘of ’, ‘to’, ‘a’ and ‘the’, which in exceptional cases led to phrases with more tokens. analyze computational materials science workflows. Comput. Mater. Sci. 139,
For example, ‘state-of-the-art thermoelectric’ is one of the five 8-token phrases 140–152 (2017).
35. Jain, A. et al. Fireworks: a dynamic workflow system designed for high-
in our vocabulary. At the end of each phrase generation cycle, we removed
throughput applications. Concurr. Comput. 27, 5037–5059 (2013).
phrases that contained punctuation and numbers. The size of the vocabulary 36. Yang, X., Dai, Z., Zhao, Y., Liu, J. & Meng, S. Low lattice thermal conductivity and
approximately doubled after phrase generation. The rest of the hyperparameters excellent thermoelectric behavior in Li3Sb and Li3Bi. J. Phys. Condens. Matter 30,
were as follows: we used 200-dimensional embeddings, a learning rate of 0.01 425401 (2018).
decreasing to 0.0001 in 30 epochs, a context window of 8 and subsampling with 37. Wang, Y., Gao, Z. & Zhou, J. Ultralow lattice thermal conductivity and electronic
a 10−4 threshold, which subsamples approximately the 400 most common words. properties of monolayer 1T phase semimetal SiTe2 and SnTe2. Physica E 108,
53–59 (2019).
Hyperparameters were optimized for performance on approximately 15,000 38. Mukherjee, M., Yumnam, G. & Singh, A. K. High thermoelectric figure of merit
grammatical and 15,000 materials science analogies, with the score defined as via tunable valley convergence coupled low thermal conductivity in AIIBIV C2V
the percentage of correctly ‘solved’ analogies from the two sets. Hyperparameter chalcopyrites. J. Phys. Chem. C 122, 29150–29157 (2018).
optimization and the choice of the corpus are also discussed in more detail in 39. Kim, E. et al. Machine-learned and codified synthesis parameters of oxide
Supplementary Information section S2. The code used for the training and materials. Sci. Data 4, 170127 (2017).
40. Faber, F. A., Lindmaa, A., Von Lilienfeld, O. A. & Armiento, R. Machine learning
the full list of analogies used in this study are available at https://github.com/
energies of 2 million elpasolite (ABC2D6) crystals. Phys. Rev. Lett. 117, 135502
materialsintelligence/mat2vec. (2016).
Thermoelectric power factors. Each materials structure optimization and 41. Zhou, Q. et al. Learning atoms for materials discovery. Proc. Natl Acad. Sci. USA
band structure calculation was performed with density functional theory (DFT) 115, E6411–E6417 (2018).
RESEARCH Letter

Acknowledgements This work was supported by Toyota Research Institute for normalization of material formulae. A.D., Z.R. and O.K. contributed to the
through the Accelerated Materials Design and Discovery program. We thank analysis of the results. K.A.P., G.C. and A.J. supervised the work.
T. Botari, M. Horton, D. Mrdjenovich, N. Mingione and A. Faghaninia for
discussions. Competing interests The authors declare no competing interests.

Author contributions All authors contributed to the conception and design Additional information
of the study, as well as writing of the manuscript. V.T. developed the data Extended data is available for this paper at https://doi.org/10.1038/s41586-
processing pipeline, trained and optimized the Word2vec embeddings, trained 019-1335-8.
the machine learning models for property predictions and generated the Supplementary information is available for this paper at https://doi.org/
thermoelectric predictions. V.T., J.D. and L.W. analysed the results and developed 10.1038/s41586-019-1335-8.
the software infrastructure for the project. J.D. trained and optimized the GloVe Correspondence and requests for materials should be addressed to V.T., G.C.
embeddings and developed the data acquisition infrastructure. L.W. performed and A.J.
the abstract classification. A.D. performed the DFT calculation of thermoelectric Reprints and permissions information is available at http://www.nature.com/
power factors. Z.R. contributed to data acquisition. O.K. developed the code reprints.
Letter RESEARCH

Extended Data Fig. 1 | Chemistry is captured by word embeddings. versus actual (DFT) values of formation energies of approximately 10,000
a, Two-dimensional t-distributed stochastic neighbour embedding ABC2D6 elpasolite compounds40 using a simple neural network model
(t-SNE) projection of the word embeddings of 100 chemical element with word embeddings of elements as features (see Supplementary
names (for example, ‘hydrogen’) labelled with the corresponding element Information section S6 for the details of the model). The data points in the
symbols and grouped according to their classification. Chemically similar plot use fivefold cross-validation. d, Error distribution for the 10% test set
elements are seen to cluster together and the overall distribution exhibits a of elpasolite formation energies. With no extensive optimization, the word
topology reminiscent of the periodic table itself (compare to b). Arranged embeddings achieve a mean absolute error (MAE) of 0.056 eV per atom,
from top left to bottom right are the alkali metals, alkaline earth metals, which is substantially smaller than the 0.1 eV per atom error reported
transition metals, and noble gases while the trend from top right to bottom for the same task in the original study using hand-crafted features40 and
left generally follows increasing atomic number (see Supplementary the 0.15 eV per atom achieved in a recent study using element features
Information section S4 for a more detailed discussion). b, The periodic automatically learned from crystal structures of more than 60,000
table coloured according to the classification shown in a. c, Predicted compounds41.
RESEARCH Letter

Extended Data Fig. 2 | Historical validations of functional material abstracts published before a certain year to make predictions. The lines
predictions. a–c, Ferroelectric (a), photovoltaic (b) and topological show the cumulative percentage of predicted materials studied in the
insulator predictions (c) using word embeddings obtained from various years following their predictions; earlier predictions can be analysed
historical datasets, similar to Fig. 3a. For ferroelectrics and photovoltaics, over longer test periods. The results are averaged in red and compared to
the range of prediction years is 2001–2018. The phrase ‘topological baseline percentages from all materials. d, The target word or phrase used
insulator’ obtained its own embedding in our corpus only in 2011 (owing to rank materials for each application (based on cosine similarity), and the
to count and vocabulary size limits), so it is possible to analyse the results corresponding words used as indicators for a potentially existing study.
only over a shorter time period (2011–2018). Each grey line uses only
Letter RESEARCH

Extended Data Table 1 | Materials science analogies

Examples of verified word analogies corresponding to various materials science concepts. The first column lists the types of tested analogies. The second column is an example vector operation for
the corresponding analogy type, with the observed answer listed in the third column. The fourth column gives the number of pairs used for scoring the corresponding analogy task, with the resulting
score of our model shown in the fifth column. Application analogies were not tested quantitatively and the example is for demonstration purposes only. The full list of tested analogies is available at
https://github.com/materialsintelligence/mat2vec.
RESEARCH Letter

Extended Data Table 2 | Top 50 thermoelectric predictions

The top 50 thermoelectric predictions using the full text corpus available at the time of writing. Some of these have practical limitations (for example, the presence of air-sensitive species or toxic and
expensive elements), but others appear to be experimentally testable candidates. An exhaustive manual literature search revealed that, from the first 150 predictions using the full corpus of collected
abstracts published through 2018, 48 materials (32%) had already been studied as thermoelectrics in papers that were not represented in our corpus, many of which were published within the last
two years. In the top 50 listed here we have excluded any predictions for which we could find thermoelectric reports outside our corpus. 
*Materials reported as good thermoelectrics while this manuscript was being prepared and reviewed36–38.
Letter RESEARCH

Extended Data Table 3 | Top five functional material predictions and context words

The top five predictions and top ten most important context words leading to the prediction for topological insulators, photovoltaics and ferroelectrics using the full text corpus. A list of context words
that could indicate prior study in the target domain have already been excluded in the process of making the predictions, as mentioned in Extended Data Fig. 2d. Furthermore, we have excluded any
predictions for which we could find reports outside our corpus for the target application.
RESEARCH Letter

Extended Data Table 4 | Importance of the text corpus

The top analogy scores in per cent for materials science and grammatical analogy tasks for different corpora. All models except Kim et al.39 were trained using CBOW—continuous bag of words, the
other variant of Word2vec, alongside skip-gram—with the same hyper-parameters (negative sampling loss with 15 samples, 10−4 downsampling, window 8, size 200, initial learning rate 0.01, 30
training epochs, minimum word count 5) and no phrases. We used the English Wikipedia dump from March 1, 2018. ‘Wikipedia elements’ corresponds to a subset of articles that mention a chemical
element name (for example, ‘gold’), whereas ‘Wikipedia materials’ corresponds to a subset that mention at least one material formula. The smallest corpus on which we train our model has the best
performance on materials-related analogies, whereas the largest corpus has the best performance for grammar. We believe this is due to the highly specialized nature of the relevant abstracts, suitable
for the tested analogy pairs. We used the ‘Relevant abstracts’ corpus throughout this study.

You might also like