SlideShare a Scribd company logo
Discovering advanced materials for energy
applications by mining the scientific literature
Anubhav Jain
Energy Technologies Area
Lawrence Berkeley National Laboratory
Berkeley, CA
AFRL meeting, Jan 2020
Slides (already) posted to hackingmaterials.lbl.gov
• Often, materials are known for several decades
before their functional applications are known
– MgB2 sitting on lab shelves for 50 years before its
identification as a superconductor in 2001
– LiFePO4 known since 1938, only identified as a Li-ion
battery cathode in 1997
• Even after discovery, optimization and
commercialization still take decades
• To get a sense for why this is so hard, let’s look at
the problem in more detail …
2
Typically, both new materials discovery and optimization
take decades
What constrains traditional approaches to materials design?
3
“[The Chevrel] discovery resulted from a lot of
unsuccessful experiments of Mg ions insertion
into well-known hosts for Li+ ions insertion, as
well as from the thorough literature analysis
concerning the possibility of divalent ions
intercalation into inorganic materials.”
-Aurbach group, on discovery of Chevrel cathode
for multivalent (e.g., Mg2+) batteries
Levi, Levi, Chasid, Aurbach
J. Electroceramics (2009)
4
Researchers are starting to fundamentally re-think how we
invent the materials that make up our devices
Next-
generation
materials
design
Computer-
aided
materials
design
Natural
language
processing
“Self-driving
laboratories”
Outline
5
① Natural language processing - where are
we right now?
② What’s next for the NLP work?
6
Can ML help us work through our backlog of information we
need to assimilate from text sources?
papers to read “someday”
NLP algorithms
• It is difficult to look up all information any given material
due to the many different ways chemical compositions
are written
– a search for “TiNiSn” will give different results than “NiTiSn”
– a search for “GaSb” won’t match text that reads “Ga0.5Sb0.5”
– a search for “SnBi4Te7” won’t match text that reads “we studied
SnBi4X7 (X=S, Se, Te)”.
– a search for “AgCrSe2”, if it doesn’t have any hits, won’t suggest
“CuCrSe2” as a similar result
• It is difficult to compile summaries, e.g.:
– A list of all materials studied for an application
– A list of all synthesis methods for a material
7
Traditional search doesn’t answer the questions we want
What is matscholar?
• Matscholar is an attempt to organize the world’s
information on materials science, connecting
together topics of study, synthesis and
characterization methods, and specific materials
compositions
• It is also an effort to use state-of-the-art natural
language processing to make collective use of
the information in millions of articles
One of our main projects concerns named entity
recognition, or automatically labeling text
9
1
0
> 4 million
Papers Collected
31 million
Properties
19 million
Materials Mentions
8.8 million
Characterization Methods
7.5 million
Applications
5 million
Synthesis Methods
•Data Collection: Over 4 million full papers*
collected from more than 2100 journals.
* Entities only extracted from abstracts deemed relevant to inorganic materials
science (~2M) so far.
11
Now we can search!
Live on www.matscholar.com
12
Another example …
13
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
Extracted 4 million
abstracts of relevant
scientific articles using
various APIs from
journal publishers
Some are more difficult
than others to obtain.
Abstract collection
continues …
14
Step 1 – data collection
15
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
• First split the text into sentences
– Seems simple, but remember edge cases like ”et al.” or
“etc.” does not necessarily signify end of sentence despite
the period
• Then split the sentences into words
– Tricky things are detecting and normalizing chemical
formulas, selective lowercasing (“Battery” vs “battery” or
“BaS” vs “BAs”), homogenizing numbers, etc.
• Done largely with the ChemDataExtractor* with
some custom improvements
– We may move to a fully custom tokenizer soon
16
Step 2 - tokenization
*http://chemdataextractor.org
17
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
• Part A is marking abstracts
as relevant / non-relevant
to inorganic materials
science
• Part B is tediously labeling
~600 abstracts
– Largely done by one person
– Spot-check of 25 abstracts
by a second person gave
87.4% agreement
18
Step 3 – hand label abstracts
19
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
• We use the word2vec
algorithm (Google) to turn
each unique word in our
corpus into a 200-
dimensional vector
• These vectors encode the
meaning of each word
meaning based on trying to
predict context words
around the target
20
Step 4a: the word2vec algorithm is used to “featurize” words
Barazza, L. How does Word2Vec’s Skip-Gram work? Becominghuman.ai. 2017
• We use the word2vec
algorithm (Google) to turn
each unique word in our
corpus into a 200-
dimensional vector
• These vectors encode the
meaning of each word
meaning based on trying to
predict context words
around the target
21
Step 4a: the word2vec algorithm is used to “featurize” words
Barazza, L. How does Word2Vec’s Skip-Gram work? Becominghuman.ai. 2017
“You shall know a word by
the company it keeps”
- John Rupert Firth (1957)
• The classic example is:
– “king” - “man” + “woman” = ? → “queen”
22
Word embeddings trained on ”normal” text learns
relationships between words
23
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
• If you read this sentence:
“The band gap of ___ is 4.5 eV”
It is clear that the blank should be filled in with a
material word (not a synthesis method, characterization
method, etc.)
How do we get a neural network to take into account
context (as well as properties of the word itself)?
24
Step 4b: How do we train a model to recognize context?
25
Step 4b.An LSTM neural net classifies words by reading
word sequences
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
26
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
27
Step 5. Sit back and let the model label things for you!Named Entity Recognition
X
• Custom machine learning models to
extract the most valuable materials-related
information.
• Utilizes a long short-term memory (LSTM)
network trained on ~1000 hand-annotated
abstracts.
• f1 scores of ~0.9. f1 score for inorganic
materials extraction is >0.9.
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
28
Live online …
29
Could these techniques also be used to predict which
materials we might want to screen for an application?
papers to read “someday”
NLP algorithms
• The classic example is:
– “king” - “man” + “woman” = ? → “queen”
30
Remember that word embeddings seem to learn
relationships in text
31
For scientific text, it learns scientific concepts as well
crystal structures of the elements
Tshitoyan, V. et al. Unsupervised word embeddings capture latent
knowledge from materials science literature. Nature 571, 95–98 (2019).
32
There seems to be materials knowledge encoded in the
word vectors
Tshitoyan, V. et al. Unsupervised word embeddings capture latent
knowledge from materials science literature. Nature 571, 95–98 (2019).
33
Note that more data is not always better!
We want relevance
Tshitoyan, V. et al. Unsupervised word embeddings capture latent
knowledge from materials science literature. Nature 571, 95–98 (2019).
34
Word embeddings also have the periodic table encoded in it
with no prior knowledge
“word embedding”
periodic table
• Dot product of a composition word with
the word “thermoelectric” essentially
predicts how likely that word is to appear
in an abstract with the word
thermoelectric
• Compositions with high dot products are
typically known thermoelectrics
• Sometimes, compositions have a high dot
product with “thermoelectric” but have
never been studied as a thermoelectric
• These compositions usually have high
computed power factors!
(DFT+BoltzTraP)
35
Making predictions: dot products measure likelihood for
words to co-occur
Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from
materials science literature. Nature 571, 95–98 (2019).
36
Try ”going back in time” and ranking materials, and follow
what happens in later years
Tshitoyan, V. et al.
Unsupervised word
embeddings capture latent
knowledge from materials
science literature. Nature
571, 95–98 (2019).
– For every year since
2001, see which
compounds we would
have predicted using
only literature data until
that point in time
– Make predictions of
what materials are the
most promising
thermoelectrics for
data until that year
– See if those materials
were actually studied as
thermoelectrics in
subsequent years 37
A more comprehensive “back in time” test
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent knowledge from materials science literature. Nature
571, 95–98 (2019).
• Thus far, 2 of our top 20 predictions made in
~August 2018 have already been reported in the
literature for the first time as thermoelectrics
– Li3Sb was the subject of a computational study
(predicted zT=2.42) in Oct 2018
– SnTe2 was experimentally found to be a moderately
good thermoelectric (expt zT=0.71) in Dec 2018
• We are working with an experimentalist on one
of the predictions (but ”spare time” project)
38
How about “forward” predictions?
[1] Yang et al. "Low lattice thermal conductivity and
excellent thermoelectric behavior in Li3Sb and Li3Bi."
Journal of Physics: Condensed Matter 30.42 (2018):
425401
[2] Wang et al. "Ultralow lattice thermal conductivity and
electronic properties of monolayer 1T phase semimetal
SiTe2 and SnTe2." Physica E: Low-dimensional Systems and
Nanostructures 108 (2019): 53-59
39
How is this working?
“Context
words” link
together
information
from different
sources
Outline
40
① Natural language processing - where are
we right now?
② What’s next for the NLP work?
• Currently, we only have word vectors for
compositions that explicitly appear in abstracts
• We can rank known materials for an application,
but for materials with zero or little mention in the
scientific literature, we are stuck!
• How do we get word embeddings for
compositions that do not exist in the text?
41
Making predictions for entirely new compositions
42
“Hidden representation learning”
43
Initial results – predicting experimental band gap from
composition (~3000 data points)
44
Going beyond entity recognition towards relationship
extraction
45
Current approach is not good enough
• E.g., automatically generate databases from the
literature
– Materials and their numerical band gaps (or thermal
conductivities, or bulk modulus, or superconducting
temperature, etc.)
– If materials can be made n-type, p-type, or both
– Which synthesis techniques led to various sample
descriptors
• Will likely require more powerful techniques, e.g.,
attention-based algorithms (BERT, Google XLNet …)
– To be investigated …
46
Once the accuracy improves, we can start to make much
more powerful searches
47
D2S2 - data driven synthesis science (just starting)
Can we combine natural language processing with theory
and experiments to control synthesis?
Title auto-generated from abstract Published Title
Dynamics of molecular hydrogen
confined in narrow nanopores
Restricted dynamics of molecular
hydrogen confined in activated carbon
nanopores
Microfluidic Generation of
Polydisperse Solid Foams
Generation of Solid Foams with
Controlled Polydispersity Using
Microfluidics
Minimum variance unbiased estimator
of product performance
Assessing the lifetime performance
index of gamma lifetime products in
the manufacturing industry
Angle resolved ultraviolet
photoemission study of fluorescein
films on Ag 110
The growth of thin fluorescein films on
Ag 110”
48
... and also some fun things, like automatic title generation
49
Acknowledgements
Slides (already) posted to hackingmaterials.lbl.gov
• High-throughput DFT
– Gerbrand Ceder and “BURP” team
– Funding: Bosch / Umicore
• Natural language processing
– Gerbrand Ceder, Kristin Persson, and “Matscholar” team
– Funding: Toyota Research Institutes
• Overall work funded by US Department of Energy
50
The Matscholar team
Kristin PerssonAnubhav JainGerbrand Ceder
John
Dagdelen
Leigh
Weston
Vahe
Tshitoyan
Amalie
Trewartha
Alex
Dunn
Viktoriia
Baibakova
Funding from
(now at Google) (now at Medium)

More Related Content

PDF
Materials design using knowledge from millions of journal articles via natura...
Anubhav Jain
 
PDF
Capturing and leveraging materials science knowledge from millions of journal...
Anubhav Jain
 
PDF
Discovering advanced materials for energy applications (with high-throughput ...
Anubhav Jain
 
PDF
Accelerating materials design through natural language processing
Anubhav Jain
 
PDF
Open Source Tools for Materials Informatics
Anubhav Jain
 
PDF
Materials discovery through theory, computation, and machine learning
Anubhav Jain
 
PDF
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Anubhav Jain
 
PDF
Overview of accelerated materials design efforts in the Hacking Materials res...
Anubhav Jain
 
Materials design using knowledge from millions of journal articles via natura...
Anubhav Jain
 
Capturing and leveraging materials science knowledge from millions of journal...
Anubhav Jain
 
Discovering advanced materials for energy applications (with high-throughput ...
Anubhav Jain
 
Accelerating materials design through natural language processing
Anubhav Jain
 
Open Source Tools for Materials Informatics
Anubhav Jain
 
Materials discovery through theory, computation, and machine learning
Anubhav Jain
 
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Anubhav Jain
 
Overview of accelerated materials design efforts in the Hacking Materials res...
Anubhav Jain
 

What's hot (20)

PDF
Machine learning for materials design: opportunities, challenges, and methods
Anubhav Jain
 
PDF
Combined Theory and Data-Driven Approaches to Thermoelectrics Materials Disco...
Anubhav Jain
 
PDF
Natural Language Processing for Materials Design - What Can We Extract From t...
Anubhav Jain
 
PDF
Data dissemination and materials informatics at LBNL
Anubhav Jain
 
PDF
Introduction (Part I): High-throughput computation and machine learning appli...
Anubhav Jain
 
PDF
Computational Materials Design and Data Dissemination through the Materials P...
Anubhav Jain
 
PDF
High-throughput computation and machine learning methods applied to materials...
Anubhav Jain
 
PDF
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Anubhav Jain
 
PDF
Methods, tools, and examples (Part II): High-throughput computation and machi...
Anubhav Jain
 
PDF
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Anubhav Jain
 
PDF
Conducting and Enabling Data-Driven Research Through the Materials Project
Anubhav Jain
 
PDF
Combining density functional theory calculations, supercomputing, and data-dr...
Anubhav Jain
 
PDF
Open-source tools for generating and analyzing large materials data sets
Anubhav Jain
 
PDF
Software tools, crystal descriptors, and machine learning applied to material...
Anubhav Jain
 
PDF
Combining density functional theory calculations, supercomputing, and data-dr...
Anubhav Jain
 
PDF
Software tools for high-throughput materials data generation and data mining
Anubhav Jain
 
PDF
Prediction and Experimental Validation of New Bulk Thermoelectrics Compositio...
Anubhav Jain
 
PDF
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Anubhav Jain
 
PDF
Computational screening of tens of thousands of compounds as potential thermo...
Anubhav Jain
 
PDF
Density functional theory calculations and data mining for new thermoelectric...
Anubhav Jain
 
Machine learning for materials design: opportunities, challenges, and methods
Anubhav Jain
 
Combined Theory and Data-Driven Approaches to Thermoelectrics Materials Disco...
Anubhav Jain
 
Natural Language Processing for Materials Design - What Can We Extract From t...
Anubhav Jain
 
Data dissemination and materials informatics at LBNL
Anubhav Jain
 
Introduction (Part I): High-throughput computation and machine learning appli...
Anubhav Jain
 
Computational Materials Design and Data Dissemination through the Materials P...
Anubhav Jain
 
High-throughput computation and machine learning methods applied to materials...
Anubhav Jain
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Anubhav Jain
 
Methods, tools, and examples (Part II): High-throughput computation and machi...
Anubhav Jain
 
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Anubhav Jain
 
Conducting and Enabling Data-Driven Research Through the Materials Project
Anubhav Jain
 
Combining density functional theory calculations, supercomputing, and data-dr...
Anubhav Jain
 
Open-source tools for generating and analyzing large materials data sets
Anubhav Jain
 
Software tools, crystal descriptors, and machine learning applied to material...
Anubhav Jain
 
Combining density functional theory calculations, supercomputing, and data-dr...
Anubhav Jain
 
Software tools for high-throughput materials data generation and data mining
Anubhav Jain
 
Prediction and Experimental Validation of New Bulk Thermoelectrics Compositio...
Anubhav Jain
 
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Anubhav Jain
 
Computational screening of tens of thousands of compounds as potential thermo...
Anubhav Jain
 
Density functional theory calculations and data mining for new thermoelectric...
Anubhav Jain
 
Ad

Similar to Discovering advanced materials for energy applications by mining the scientific literature (20)

PDF
Applications of Natural Language Processing to Materials Design
Anubhav Jain
 
PDF
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Anubhav Jain
 
PDF
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
aimsnist
 
PDF
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Angelo Salatino
 
PPTX
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Angelo Salatino
 
PDF
The Materials Project: Applications to energy storage and functional materia...
Anubhav Jain
 
PPTX
Connected Data for Machine Learning | Paul Groth
Connected Data World
 
PPTX
anifield.pptx
Wessam Fekry
 
PPTX
Knowledge graph construction for research & medicine
Paul Groth
 
PDF
The Materials Project: A Community Data Resource for Accelerating New Materia...
Anubhav Jain
 
PPT
Chemspider Presentation at the ACS Meeting in New orleans
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
PPTX
kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
NguynDuyPhong3
 
PPTX
Using a keyword extraction pipeline to understand concepts in future work sec...
Kai Li
 
PDF
Discovering new functional materials for clean energy and beyond using high-t...
Anubhav Jain
 
PDF
anifield.pdf
Wessam Fekry
 
PPTX
Learning Systems for Science
Ian Foster
 
PPTX
Text recycling research project
C0pe
 
PPT
Getting Reading for the Next Generation Science Standards Part 3: Crosscuttin...
The Ohio State University, College of Education and Human Ecology
 
PDF
Applying machine learning techniques to big data in the scholarly domain
Angelo Salatino
 
Applications of Natural Language Processing to Materials Design
Anubhav Jain
 
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Anubhav Jain
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
aimsnist
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Angelo Salatino
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Angelo Salatino
 
The Materials Project: Applications to energy storage and functional materia...
Anubhav Jain
 
Connected Data for Machine Learning | Paul Groth
Connected Data World
 
anifield.pptx
Wessam Fekry
 
Knowledge graph construction for research & medicine
Paul Groth
 
The Materials Project: A Community Data Resource for Accelerating New Materia...
Anubhav Jain
 
kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
NguynDuyPhong3
 
Using a keyword extraction pipeline to understand concepts in future work sec...
Kai Li
 
Discovering new functional materials for clean energy and beyond using high-t...
Anubhav Jain
 
anifield.pdf
Wessam Fekry
 
Learning Systems for Science
Ian Foster
 
Text recycling research project
C0pe
 
Getting Reading for the Next Generation Science Standards Part 3: Crosscuttin...
The Ohio State University, College of Education and Human Ecology
 
Applying machine learning techniques to big data in the scholarly domain
Angelo Salatino
 
Ad

More from Anubhav Jain (20)

PDF
A Career at a U.S. National Lab: Perspective from a Mid-Career Scientist
Anubhav Jain
 
PDF
Research opportunities in materials design using AI/ML
Anubhav Jain
 
PDF
Accelerating materials discovery with big data and machine learning
Anubhav Jain
 
PDF
Predicting the Synthesizability of Inorganic Materials: Convex Hulls, Literat...
Anubhav Jain
 
PDF
Discovering advanced materials for energy applications: theory, high-throughp...
Anubhav Jain
 
PDF
Applications of Large Language Models in Materials Discovery and Design
Anubhav Jain
 
PDF
An AI-driven closed-loop facility for materials synthesis
Anubhav Jain
 
PDF
Best practices for DuraMat software dissemination
Anubhav Jain
 
PDF
Best practices for DuraMat software dissemination
Anubhav Jain
 
PDF
Available methods for predicting materials synthesizability using computation...
Anubhav Jain
 
PDF
Efficient methods for accurately calculating thermoelectric properties – elec...
Anubhav Jain
 
PDF
Natural Language Processing for Data Extraction and Synthesizability Predicti...
Anubhav Jain
 
PDF
Machine Learning for Catalyst Design
Anubhav Jain
 
PDF
Natural language processing for extracting synthesis recipes and applications...
Anubhav Jain
 
PDF
Accelerating New Materials Design with Supercomputing and Machine Learning
Anubhav Jain
 
PDF
DuraMat CO1 Central Data Resource: How it started, how it’s going …
Anubhav Jain
 
PDF
The Materials Project
Anubhav Jain
 
PDF
Evaluating Chemical Composition and Crystal Structure Representations using t...
Anubhav Jain
 
PDF
Perspectives on chemical composition and crystal structure representations fr...
Anubhav Jain
 
PDF
Discovering and Exploring New Materials through the Materials Project
Anubhav Jain
 
A Career at a U.S. National Lab: Perspective from a Mid-Career Scientist
Anubhav Jain
 
Research opportunities in materials design using AI/ML
Anubhav Jain
 
Accelerating materials discovery with big data and machine learning
Anubhav Jain
 
Predicting the Synthesizability of Inorganic Materials: Convex Hulls, Literat...
Anubhav Jain
 
Discovering advanced materials for energy applications: theory, high-throughp...
Anubhav Jain
 
Applications of Large Language Models in Materials Discovery and Design
Anubhav Jain
 
An AI-driven closed-loop facility for materials synthesis
Anubhav Jain
 
Best practices for DuraMat software dissemination
Anubhav Jain
 
Best practices for DuraMat software dissemination
Anubhav Jain
 
Available methods for predicting materials synthesizability using computation...
Anubhav Jain
 
Efficient methods for accurately calculating thermoelectric properties – elec...
Anubhav Jain
 
Natural Language Processing for Data Extraction and Synthesizability Predicti...
Anubhav Jain
 
Machine Learning for Catalyst Design
Anubhav Jain
 
Natural language processing for extracting synthesis recipes and applications...
Anubhav Jain
 
Accelerating New Materials Design with Supercomputing and Machine Learning
Anubhav Jain
 
DuraMat CO1 Central Data Resource: How it started, how it’s going …
Anubhav Jain
 
The Materials Project
Anubhav Jain
 
Evaluating Chemical Composition and Crystal Structure Representations using t...
Anubhav Jain
 
Perspectives on chemical composition and crystal structure representations fr...
Anubhav Jain
 
Discovering and Exploring New Materials through the Materials Project
Anubhav Jain
 

Recently uploaded (20)

PDF
A water-rich interior in the temperate sub-Neptune K2-18 b revealed by JWST
Sérgio Sacani
 
PDF
PPT-7-Rocks-and-Minerals Lesson 5 Quarter 1
CarlVillanueva11
 
PDF
Migrating Katalon Studio Tests to Playwright with Model Driven Engineering
ESUG
 
PPTX
INTRO-TO-CRIM-THEORIES-OF-CRIME-2023 (1).pptx
ChrisFlickIII
 
PDF
Pakistan Journal of Zoological Sciences, Volume 1, Issue 1 (2025)
IJSmart Publishing Company
 
PDF
Directing Generative AI for Pharo Documentation
ESUG
 
PPTX
How to access global TV channels with a VPN easily.pptx
harshitseo1
 
PPTX
Earth's mechanism (plate tectonics and seafloor spreading).pptx
josephangeles001
 
PPT
An Introduction to Particle Accelerators.ppt
mowehe5553
 
PDF
Even Lighter Than Lightweiht: Augmenting Type Inference with Primitive Heuris...
ESUG
 
PPTX
biomolecules-class12th chapter board classespptx
SapnaTiwari58
 
PPTX
WEEK 4-MONO HYBRID AND DIHYBRID CROSS OF GREGOR MENDEL
AliciaJamandron1
 
PDF
urticaria-1775-rahulkalal-250606145215-0ff37bc9.pdf
GajananPatil761074
 
PPT
oscillatoria known as blue -green algae
Baher El-Nogoumy
 
PPTX
Embark on a journey of cell division and it's stages
sakyierhianmontero
 
PDF
Analysing Python Machine Learning Notebooks with Moose
ESUG
 
PDF
N-enhancement in GN-z11: First evidence for supermassive stars nucleosynthesi...
Sérgio Sacani
 
PPTX
2019 Upper Respiratory Tract Infections.pptx
jackophyta10
 
PDF
Little Red Dots As Late-stage Quasi-stars
Sérgio Sacani
 
PDF
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
Sérgio Sacani
 
A water-rich interior in the temperate sub-Neptune K2-18 b revealed by JWST
Sérgio Sacani
 
PPT-7-Rocks-and-Minerals Lesson 5 Quarter 1
CarlVillanueva11
 
Migrating Katalon Studio Tests to Playwright with Model Driven Engineering
ESUG
 
INTRO-TO-CRIM-THEORIES-OF-CRIME-2023 (1).pptx
ChrisFlickIII
 
Pakistan Journal of Zoological Sciences, Volume 1, Issue 1 (2025)
IJSmart Publishing Company
 
Directing Generative AI for Pharo Documentation
ESUG
 
How to access global TV channels with a VPN easily.pptx
harshitseo1
 
Earth's mechanism (plate tectonics and seafloor spreading).pptx
josephangeles001
 
An Introduction to Particle Accelerators.ppt
mowehe5553
 
Even Lighter Than Lightweiht: Augmenting Type Inference with Primitive Heuris...
ESUG
 
biomolecules-class12th chapter board classespptx
SapnaTiwari58
 
WEEK 4-MONO HYBRID AND DIHYBRID CROSS OF GREGOR MENDEL
AliciaJamandron1
 
urticaria-1775-rahulkalal-250606145215-0ff37bc9.pdf
GajananPatil761074
 
oscillatoria known as blue -green algae
Baher El-Nogoumy
 
Embark on a journey of cell division and it's stages
sakyierhianmontero
 
Analysing Python Machine Learning Notebooks with Moose
ESUG
 
N-enhancement in GN-z11: First evidence for supermassive stars nucleosynthesi...
Sérgio Sacani
 
2019 Upper Respiratory Tract Infections.pptx
jackophyta10
 
Little Red Dots As Late-stage Quasi-stars
Sérgio Sacani
 
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
Sérgio Sacani
 

Discovering advanced materials for energy applications by mining the scientific literature

  • 1. Discovering advanced materials for energy applications by mining the scientific literature Anubhav Jain Energy Technologies Area Lawrence Berkeley National Laboratory Berkeley, CA AFRL meeting, Jan 2020 Slides (already) posted to hackingmaterials.lbl.gov
  • 2. • Often, materials are known for several decades before their functional applications are known – MgB2 sitting on lab shelves for 50 years before its identification as a superconductor in 2001 – LiFePO4 known since 1938, only identified as a Li-ion battery cathode in 1997 • Even after discovery, optimization and commercialization still take decades • To get a sense for why this is so hard, let’s look at the problem in more detail … 2 Typically, both new materials discovery and optimization take decades
  • 3. What constrains traditional approaches to materials design? 3 “[The Chevrel] discovery resulted from a lot of unsuccessful experiments of Mg ions insertion into well-known hosts for Li+ ions insertion, as well as from the thorough literature analysis concerning the possibility of divalent ions intercalation into inorganic materials.” -Aurbach group, on discovery of Chevrel cathode for multivalent (e.g., Mg2+) batteries Levi, Levi, Chasid, Aurbach J. Electroceramics (2009)
  • 4. 4 Researchers are starting to fundamentally re-think how we invent the materials that make up our devices Next- generation materials design Computer- aided materials design Natural language processing “Self-driving laboratories”
  • 5. Outline 5 ① Natural language processing - where are we right now? ② What’s next for the NLP work?
  • 6. 6 Can ML help us work through our backlog of information we need to assimilate from text sources? papers to read “someday” NLP algorithms
  • 7. • It is difficult to look up all information any given material due to the many different ways chemical compositions are written – a search for “TiNiSn” will give different results than “NiTiSn” – a search for “GaSb” won’t match text that reads “Ga0.5Sb0.5” – a search for “SnBi4Te7” won’t match text that reads “we studied SnBi4X7 (X=S, Se, Te)”. – a search for “AgCrSe2”, if it doesn’t have any hits, won’t suggest “CuCrSe2” as a similar result • It is difficult to compile summaries, e.g.: – A list of all materials studied for an application – A list of all synthesis methods for a material 7 Traditional search doesn’t answer the questions we want
  • 8. What is matscholar? • Matscholar is an attempt to organize the world’s information on materials science, connecting together topics of study, synthesis and characterization methods, and specific materials compositions • It is also an effort to use state-of-the-art natural language processing to make collective use of the information in millions of articles
  • 9. One of our main projects concerns named entity recognition, or automatically labeling text 9
  • 10. 1 0 > 4 million Papers Collected 31 million Properties 19 million Materials Mentions 8.8 million Characterization Methods 7.5 million Applications 5 million Synthesis Methods •Data Collection: Over 4 million full papers* collected from more than 2100 journals. * Entities only extracted from abstracts deemed relevant to inorganic materials science (~2M) so far.
  • 11. 11 Now we can search! Live on www.matscholar.com
  • 13. 13 Ok so how does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  • 14. Extracted 4 million abstracts of relevant scientific articles using various APIs from journal publishers Some are more difficult than others to obtain. Abstract collection continues … 14 Step 1 – data collection
  • 15. 15 Ok so how does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  • 16. • First split the text into sentences – Seems simple, but remember edge cases like ”et al.” or “etc.” does not necessarily signify end of sentence despite the period • Then split the sentences into words – Tricky things are detecting and normalizing chemical formulas, selective lowercasing (“Battery” vs “battery” or “BaS” vs “BAs”), homogenizing numbers, etc. • Done largely with the ChemDataExtractor* with some custom improvements – We may move to a fully custom tokenizer soon 16 Step 2 - tokenization *http://chemdataextractor.org
  • 17. 17 Ok so how does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  • 18. • Part A is marking abstracts as relevant / non-relevant to inorganic materials science • Part B is tediously labeling ~600 abstracts – Largely done by one person – Spot-check of 25 abstracts by a second person gave 87.4% agreement 18 Step 3 – hand label abstracts
  • 19. 19 Ok so how does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  • 20. • We use the word2vec algorithm (Google) to turn each unique word in our corpus into a 200- dimensional vector • These vectors encode the meaning of each word meaning based on trying to predict context words around the target 20 Step 4a: the word2vec algorithm is used to “featurize” words Barazza, L. How does Word2Vec’s Skip-Gram work? Becominghuman.ai. 2017
  • 21. • We use the word2vec algorithm (Google) to turn each unique word in our corpus into a 200- dimensional vector • These vectors encode the meaning of each word meaning based on trying to predict context words around the target 21 Step 4a: the word2vec algorithm is used to “featurize” words Barazza, L. How does Word2Vec’s Skip-Gram work? Becominghuman.ai. 2017 “You shall know a word by the company it keeps” - John Rupert Firth (1957)
  • 22. • The classic example is: – “king” - “man” + “woman” = ? → “queen” 22 Word embeddings trained on ”normal” text learns relationships between words
  • 23. 23 Ok so how does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  • 24. • If you read this sentence: “The band gap of ___ is 4.5 eV” It is clear that the blank should be filled in with a material word (not a synthesis method, characterization method, etc.) How do we get a neural network to take into account context (as well as properties of the word itself)? 24 Step 4b: How do we train a model to recognize context?
  • 25. 25 Step 4b.An LSTM neural net classifies words by reading word sequences Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  • 26. 26 Ok so how does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  • 27. 27 Step 5. Sit back and let the model label things for you!Named Entity Recognition X • Custom machine learning models to extract the most valuable materials-related information. • Utilizes a long short-term memory (LSTM) network trained on ~1000 hand-annotated abstracts. • f1 scores of ~0.9. f1 score for inorganic materials extraction is >0.9. Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  • 29. 29 Could these techniques also be used to predict which materials we might want to screen for an application? papers to read “someday” NLP algorithms
  • 30. • The classic example is: – “king” - “man” + “woman” = ? → “queen” 30 Remember that word embeddings seem to learn relationships in text
  • 31. 31 For scientific text, it learns scientific concepts as well crystal structures of the elements Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  • 32. 32 There seems to be materials knowledge encoded in the word vectors Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  • 33. 33 Note that more data is not always better! We want relevance Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  • 34. 34 Word embeddings also have the periodic table encoded in it with no prior knowledge “word embedding” periodic table
  • 35. • Dot product of a composition word with the word “thermoelectric” essentially predicts how likely that word is to appear in an abstract with the word thermoelectric • Compositions with high dot products are typically known thermoelectrics • Sometimes, compositions have a high dot product with “thermoelectric” but have never been studied as a thermoelectric • These compositions usually have high computed power factors! (DFT+BoltzTraP) 35 Making predictions: dot products measure likelihood for words to co-occur Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  • 36. 36 Try ”going back in time” and ranking materials, and follow what happens in later years Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  • 37. – For every year since 2001, see which compounds we would have predicted using only literature data until that point in time – Make predictions of what materials are the most promising thermoelectrics for data until that year – See if those materials were actually studied as thermoelectrics in subsequent years 37 A more comprehensive “back in time” test Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  • 38. • Thus far, 2 of our top 20 predictions made in ~August 2018 have already been reported in the literature for the first time as thermoelectrics – Li3Sb was the subject of a computational study (predicted zT=2.42) in Oct 2018 – SnTe2 was experimentally found to be a moderately good thermoelectric (expt zT=0.71) in Dec 2018 • We are working with an experimentalist on one of the predictions (but ”spare time” project) 38 How about “forward” predictions? [1] Yang et al. "Low lattice thermal conductivity and excellent thermoelectric behavior in Li3Sb and Li3Bi." Journal of Physics: Condensed Matter 30.42 (2018): 425401 [2] Wang et al. "Ultralow lattice thermal conductivity and electronic properties of monolayer 1T phase semimetal SiTe2 and SnTe2." Physica E: Low-dimensional Systems and Nanostructures 108 (2019): 53-59
  • 39. 39 How is this working? “Context words” link together information from different sources
  • 40. Outline 40 ① Natural language processing - where are we right now? ② What’s next for the NLP work?
  • 41. • Currently, we only have word vectors for compositions that explicitly appear in abstracts • We can rank known materials for an application, but for materials with zero or little mention in the scientific literature, we are stuck! • How do we get word embeddings for compositions that do not exist in the text? 41 Making predictions for entirely new compositions
  • 43. 43 Initial results – predicting experimental band gap from composition (~3000 data points)
  • 44. 44 Going beyond entity recognition towards relationship extraction
  • 45. 45 Current approach is not good enough
  • 46. • E.g., automatically generate databases from the literature – Materials and their numerical band gaps (or thermal conductivities, or bulk modulus, or superconducting temperature, etc.) – If materials can be made n-type, p-type, or both – Which synthesis techniques led to various sample descriptors • Will likely require more powerful techniques, e.g., attention-based algorithms (BERT, Google XLNet …) – To be investigated … 46 Once the accuracy improves, we can start to make much more powerful searches
  • 47. 47 D2S2 - data driven synthesis science (just starting) Can we combine natural language processing with theory and experiments to control synthesis?
  • 48. Title auto-generated from abstract Published Title Dynamics of molecular hydrogen confined in narrow nanopores Restricted dynamics of molecular hydrogen confined in activated carbon nanopores Microfluidic Generation of Polydisperse Solid Foams Generation of Solid Foams with Controlled Polydispersity Using Microfluidics Minimum variance unbiased estimator of product performance Assessing the lifetime performance index of gamma lifetime products in the manufacturing industry Angle resolved ultraviolet photoemission study of fluorescein films on Ag 110 The growth of thin fluorescein films on Ag 110” 48 ... and also some fun things, like automatic title generation
  • 49. 49 Acknowledgements Slides (already) posted to hackingmaterials.lbl.gov • High-throughput DFT – Gerbrand Ceder and “BURP” team – Funding: Bosch / Umicore • Natural language processing – Gerbrand Ceder, Kristin Persson, and “Matscholar” team – Funding: Toyota Research Institutes • Overall work funded by US Department of Energy
  • 50. 50 The Matscholar team Kristin PerssonAnubhav JainGerbrand Ceder John Dagdelen Leigh Weston Vahe Tshitoyan Amalie Trewartha Alex Dunn Viktoriia Baibakova Funding from (now at Google) (now at Medium)