0% found this document useful (0 votes)
17 views

0030 D Magpie Encoding 2

Research paper

Uploaded by

sougat2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

0030 D Magpie Encoding 2

Research paper

Uploaded by

sougat2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Computational Materials Science 152 (2018) 60–69

Contents lists available at ScienceDirect

Computational Materials Science


journal homepage: www.elsevier.com/locate/commatsci

Matminer: An open source toolkit for materials data mining T


a,b c,d c c c,e
Logan Ward , Alexander Dunn , Alireza Faghaninia , Nils E.R. Zimmermann , Saurabh Bajaj ,
Qi Wangc, Joseph Montoyac, Jiming Chenf, Kyle Bystromd, Maxwell Dyllag, Kyle Charda,b,

Mark Astad, Kristin A. Perssonc, G. Jeffrey Snyderg, Ian Fostera,b, Anubhav Jainc,
a
Computation Institute, University of Chicago, Chicago, IL 60637, United States
b
Data Science and Learning Division, Argonne National Laboratory, Argonne, IL 60439, United States
c
Lawrence Berkeley National Laboratory, Energy Technologies Area, 1 Cyclotron Road, Berkeley, CA 94720, United States
d
Department of Materials Science and Engineering, University of California, Berkeley CA 94720, University of California, Berkeley, CA 94720, United States
e
Citrine Informatics, Redwood City, CA 94063, United States
f
Department of Chemical Engineering, University of Illinois, Urbana, IL 61801, United States
g
Department of Materials Science and Engineering, Northwestern University, Evanston, IL 60208, United States

A R T I C LE I N FO A B S T R A C T

Keywords: As materials data sets grow in size and scope, the role of data mining and statistical learning methods to analyze
Data mining these materials data sets and build predictive models is becoming more important. This manuscript introduces
Open source software matminer, an open-source, Python-based software platform to facilitate data-driven methods of analyzing and
Machine learning predicting materials properties. Matminer provides modules for retrieving large data sets from external data-
Materials informatics
bases such as the Materials Project, Citrination, Materials Data Facility, and Materials Platform for Data Science.
It also provides implementations for an extensive library of feature extraction routines developed by the ma-
terials community, with 47 featurization classes that can generate thousands of individual descriptors and
combine them into mathematical functions. Finally, matminer provides a visualization module for producing
interactive, shareable plots. These functions are designed in a way that integrates closely with machine learning
and data analysis packages already developed and in use by the Python data science community. We explain the
structure and logic of matminer, provide a description of its various modules, and showcase several examples of
how matminer can be used to collect data, reproduce data mining studies reported in the literature, and test new
methodologies.

1. Introduction continued development of general-purpose data mining methods for


many types of materials data [17–19] and the proliferation of material
Recently, the materials community has placed a renewed emphasis property databases [20], this emerging field of “materials informatics”
in collecting and organizing large data sets for research, materials de- is positioned to have a continued impact on materials design.
sign, and the eventual application of statistical or “machine learning” In this paper, we describe a new software library, “matminer”, for
techniques. For example, the mining of databases comprised of density applying data-driven techniques to the materials domain. The main
functional theory (DFT) calculations has been used to identify materials roles of matminer are depicted in Fig. 1: matminer assists the user in
for batteries [1,2], to aid the design of metal alloys [3,4], and for many retrieving large data sets from common databases, extracts features to
other applications [5]. Importantly, such data sets present new oppor- transform the raw data into representations suitable for machine
tunities to develop predictive models through machine learning tech- learning, and produces interactive visualizations of the data for ex-
niques: rather than designing and programming such models manually, ploratory analysis. We note that matminer does not itself implement
such techniques produce predictive models by learning from a body of common machine learning algorithms; industry-standard tools (e.g.,
examples. Machine learning models have been demonstrated to predict scikit-learn or Keras) are already developed and maintained by the
properties of crystalline materials much faster than DFT [6–9], estimate larger data science community for this purpose. Instead, matminer's
properties that are difficult to access via other computational tools role is to connect these advanced machine learning tools to the materials
[10,11], and guide the search for new materials [12–16]. With the domain.


Corresponding author at: Lawrence Berkeley National Laboratory, Energy Technologies Area, 1 Cyclotron Road, Berkeley, CA 94720, United States.
E-mail addresses: [email protected] (L. Ward), [email protected] (A. Jain).

https://doi.org/10.1016/j.commatsci.2018.05.018
Received 16 April 2018; Accepted 7 May 2018
Available online 25 May 2018
0927-0256/ © 2018 Elsevier B.V. All rights reserved.
L. Ward et al. Computational Materials Science 152 (2018) 60–69

Fig. 1. Overview of the capabilities of matminer. Matminer aids the user in constructing a data pipeline for materials informatics and is composed of three main
components: (1) tools for retrieving data from a variety of materials databases, (2) tools for extracting features (or descriptors) from materials data, and (3) re-useable
and customizable recipes for visualizing materials data. Data is retrieved and processed in a way that makes it simple to integrate matminer with external machine
learning libraries such as scikit-learn and Keras.

Matminer solves many problems encountered when conducting interactive, runnable Jupyter notebook format [31]) for using the data
data-driven research. For example, learning the Application retrieval, featurization, and visualization tools, located at https://
Programming Interface (API) for each data source and preprocessing github.com/hackingmaterials/matminer_examples. Full documenta-
retrieved data adds significant complexity to the task of building new tion for matminer is also available from https://hackingmaterials.
machine learning models. Matminer provides a simplified interface that github.io/matminer/. The matminer code currently contains 109 unit
abstracts the details of these API interactions, making it easy for the tests to ensure the integrity of the code, which are run automatically
user to query and organize large data sets into the standard pandas [21] with each code commit through a continuous integration process. A
data format used by the Python data science community. Furthermore, help forum for matminer is available at: https://groups.google.com/
as we will further discuss later in the text, matminer implements a suite forum/#!forum/matminer.
of 47 distinct feature extraction modules capable of producing thou-
sands of physically relevant descriptors that can be leveraged by ma- 2. Software architecture and design principles
chine learning algorithms to more efficiently determine input-output
relationships. Although many such feature extraction methods are re- A guiding principle of matminer is to integrate domain-specific
ported in the literature, many lack an open source implementation. knowledge and data about materials into larger ecosystem of Python
Matminer not only implements these domain-specific feature extraction data analysis software. The Python community has developed a rich
methods but provides a unified interface for their use, making it trivial suite of interoperable tools for data science, which are broadly used
to reproduce or compare (and, eventually, extend) these methods. Fi- across the data science community and occasionally known as the
nally, matminer contains many pre-defined recipes of visualizations for “PyData” or “SciPy” stacks [32]. These libraries include NumPy and
exploring and discovering different data relationships. In aggregate, Scipy [33], which provide a suite of high-performance numerical
these features allow for cutting edge materials informatics research to methods, and Jupyter [31], which facilitates interactive data analysis.
be conducted with a high-level, easy-to-use interface. Matminer is designed to allow users to leverage these professional-level
We note that prior efforts have produced software for computing data science libraries for materials science studies.
features for materials (e.g., Magpie[22,23], pyMKS [24]), building deep A central tool in the PyData stack is the pandas DataFrame, which is
learning models of molecular materials (e.g., deepchem [25,26]), pro- a tabular representation of data similar to (but more powerful than) a
viding turnkey machine learning estimates of various properties, or virtual spreadsheet [21]. Pandas makes it possible, for example, to load
integrating machine learning with other software [27–29]. In contrast a data set and perform many common data post-processing procedures,
to these prior efforts (which have their own intended applications and such as filtering, grouping, joining, computing rolling averages, and
scope), matminer is designed to interact and integrate with standard producing descriptive statistics. Additionally, data formatted into a
Python data mining tools such as pandas and scikit-learn [30], imple- pandas DataFrame can be easily used with other Python data analysis
ments a library of feature generation methods (“featurizers”) for a wide libraries, such as scikit-learn, numpy, and matplotlib. DataFrames can
variety of materials science entities (e.g., compositions, crystal struc- also be visualized as interactive tables within Jupyter notebooks. They
tures, and electronic structures), and includes tools to assist with data can also be serialized into multiple formats to allow them to be archived
retrieval and visualization. and shared. Because of all the benefits and features that are achieved by
The source code for the version of matminer described in this transforming data into the DataFrame format, matminer's data retrieval
manuscript (version 0.3.2) and examples of its use are available as API automatically formats data that it retrieves from external sources
supplementary information. Updated versions are regularly published into this format. Data retrieved through matminer is thus immediately
to the Python Package Index (https://pypi.python.org/pypi/matminer). ready for a wide variety of tasks, including data cleaning, data ex-
The actively developed version of matminer is available on GitHub at ploration, data transformations, data visualization, and machine
https://github.com/hackingmaterials/matminer. Matminer also in- learning. As described in later sections, all data extraction, featuriza-
cludes a dedicated repository of examples and tutorials (many in an tion, and visualization tools in matminer can generate or operate on

61
L. Ward et al. Computational Materials Science 152 (2018) 60–69

pandas DataFrame objects. database. MPDataRetrieval allows users to access a wide variety of
Matminer is also designed to integrate closely with the scikit-learn properties of crystalline materials, including their crystal struc-
machine learning library [30]. Scikit-learn is the de facto standard tures, electronic band structure, phonon dispersion, piezoelectric,
machine learning library for Python. In addition to its rich suite of dielectric and elastic constants.
machine learning algorithms, scikit-learn contains utilities useful for all (iii) The Materials Data Facility (MDF) is geared towards enabling re-
aspects of the machine learning process (e.g., data preprocessing, model searchers to publish their own data sets across a wide array of data
selection, hyperparameter tuning). Other machine learning libraries, types and materials subdisciplines. Matminer contains an
such as Keras [34] and TensorFlow [35], also provide scikit-learn- MDFDataRetrieval class that uses the MDF's own Forge library [51]
compatible wrappers for their models, which further motivates the to perform the bulk of the search function but assists the user in
importance of making matminer easily compatible with scikit-learn. formatting the final data to a standardized pandas DataFrame
Matminer achieves integration with scikit-learn in two ways. First, the object.
pandas DataFrame objects produced by matminer are tightly integrated (iv) The Materials Platform for Data Science (MPDS) [45] is a com-
with scikit-learn through the interoperability built in to the PyData mercial database that includes phase diagram data (∼60,000 en-
stack. Second, the feature extraction methods implemented by mat- tries), crystal structure data (∼400,000 entries), and materials
miner follow the same model (and, more formally, subclass) scikit- property values (∼800,000 entries). The MPDSDataRetrieval class
learn’s preprocessing methods. This allows matminer feature extraction in matminer can retrieve and format information from this data-
methods to be used with scikit-learn's Pipeline functionality and makes base.
it easy to combine data processing methods present in the two libraries. (v) MongoDB is a popular tool in the data mining community due to its
Matminer also heavily leverages the pymatgen [36] materials sci- efficient and flexible data model [46]. For example, data generated
ence library. Matminer's use of the pymatgen library makes it un- through the atomate [52] computational suite is stored in such
necessary to recreate complex or materials-science-specific algorithms databases. The “MongoDataRetrieval” class of matminer converts
(e.g., space group determination) when implementing new feature ex- MongoDB documents to rows of a pandas DataFrame.
traction methods. Overall, the software architecture of matminer is
designed to bridge the gap between the professional-level data science All database tools are consistent in that they (i) contain a “get_da-
tools developed by the Python community and the tools, techniques, taframe” method that makes a query to the database and (ii) returns the
and data specific to the materials domain. data in a Pandas DataFrame object. The “get_dataframe” method for
each source takes query instructions in a simple, standard format. We
3. Components of matminer also provide the ability to run queries in the language specific to each
source. In so, we provide both a novice-friendly route for using new
We now describe the main functions of matminer. We describe each data sources and maintain the ability for experts to access all features of
of the three major components. data retrieval, featurization, and vi- a familiar data source. However, matminer does standardize the output
sualization, separately. such that data mining tools written for one database can be easily ap-
plied to another. One benefit of the uniformity of the APIs and output
3.1. Data retrieval formats provided by matminer is that these features make it easy to
combine data from multiple sources. The data merging tools built into
The first step in data mining is to obtain a data set that is ideally the pandas DataFrame object facilitate this procedure. For example, it is
large and diverse. There are several efforts underway in the materials straightforward to retrieve experimental band gap energies from
community to build such databases of materials properties [37–44]. Citrination and then easily compare those values with computed band
However, while the proliferation of databases is a great benefit to gap energies from Materials Project or the OQMD (this specific example
materials informatics, the use of these data sources is complicated by is described in detail in Section 4.2).
the fact that each database implements a different API, authentication Matminer also contains several built-in datasets that can be loaded
method, and schema. One core function of matminer is to provide a directly with a single line of Python and do not require external data-
consistent API around different databases and return the data in a form base calls or setting any options. These built-in datasets include: 1181
that is suitable for use in data mining tools. DFT-based elastic tensors [53], 941 DFT-based piezoelectric tensors
At the time of writing, matminer supports data retrieval from four [54], 1056 DFT-based dielectric constants [55], and 3938 DFT-based
commonly used materials databases: Citrination [40,43], Materials formation energies [39,56]. The built-in data sets make it simple to
Project (MP) [39], Materials Data Facility (MDF) [44], and Materials begin testing and developing data mining methods.
Platform for Data Science (MPDS) [45]. In addition, a generic MongoDB Finally, a user can load their own data set using the built-in tools of
interface supports data retrieval from any MongoDB resource [46]. the pandas library, which can load data from CSV, Excel, or various
Below, we describe these data retrieval tools in detail: other formats. This process can be conducted independently of mat-
miner but the final data format will be compatible with the subsequent
(i) Citrination, developed by Citrine Informatics [40], is a centralized data featurization tools of matminer.
database that contains a variety of materials data, including ex-
perimental measurements and computational results, all in a 3.2. Data featurization: Transforming materials-related quantities into
common data schema – the “pif” [47]. The matminer data retrieval physically relevant descriptors
tool uses Citrine’s citrination-client library to retrieve data from
Citrination, and then converts the data from the hierarchical pif Typically, machine learning employs an intermediate step between
format to a tabular DataFrame format. In the process of converting compiling raw data and applying a machine learning algorithm. This
the pif records, matminer retrieves all details describing a material step converts data from a raw format (often specialized for parsing by a
(e.g., composition), its known properties, and how these properties particular software package or formatted for human readability) into a
were determined. numerical representation that is useful for visualization or machine
(ii) The Materials Project (MP) [39] primarily contains DFT [48,49] learning software. This process is called “feature extraction”, “featur-
computed properties for over 60,000 compounds. In a similar ization”, or generating “descriptors”. Featurization transforms or aug-
fashion to the Citrination data extractor, matminer uses the ex- ments the raw data (which might have a very complicated and difficult
isting MP API [50] (as implemented in the “MPRester” class of the to learn relationship between inputs and outputs) into a set of physi-
Python Materials Genomics (pymatgen) library [36]) to query the cally relevant quantities that reflect the relationships between the input

62
L. Ward et al. Computational Materials Science 152 (2018) 60–69

4. The “implementors” method provides the name of the person(s) who


implemented and are responsible for maintaining the featurizer.
This is useful if one has a question, comment, or suggestion re-
garding the specific implementation details of a featurization
method.

BaseFeaturizer provides additional functions that a user can call


once these four methods are implemented. For example, the “featur-
ize_dataframe” method uses the “featurize” and “feature_labels” op-
erations to add the features to an entire pandas DataFrame. That is,
featurize_dataframe will process potentially thousands or millions of
rows of data, exploiting Python's multiprocess functionality to paral-
lelize over available cores. The BaseFeaturizer class also follows the
pattern used by featurizers in the scikit-learn machine learning library,
which allows matminer featurization classes to be integrated easily
with existing scikit-learn tools. For example, one can build a data
processing pipeline that mixes some of the data normalization tools
present in scikit-learn with the materials-specific features implemented
in matminer.
Matminer contains, at the time of writing, a total of 47 featurizers
that support the generation of features for diverse types of materials
data. Each of these featurizers can produce many individual features/
descriptors, such that it is possible to generate thousands of total fea-
Fig. 2. Overview of the 47 featurizers that are currently available in five dif- tures with the matminer code. For example, the ElementProperty fea-
ferent modules (composition, site, structure, bandstructure, dos) of matminer. turizer will convert a chemical composition into various summary sta-
Each featurizer can generate one or hundreds of features, such that matminer as tistics of the properties of that composition's component elements (e.g.,
a whole is capable of producing thousands of individual features. average ionic radius or standard deviation of elemental melting points).
The BandFeaturizer will convert a complex electronic band structure
and output variables. The feature extraction step is one of the main into quantities such as band gap and the norm of k point coordinates at
ways in which one can exploit domain knowledge to vastly improve the which the conduction band minimum and valence band maximum
performance of a machine learning algorithm. For example, common occur.
features that are extracted from a chemical composition include the We have grouped the featurizers into five different Python modules
differences in electronegativities of the component elements or the sum based on the input data type: (i) composition, (ii) (crystal) structure,
of atomic radii of the various elements. (iii) density of (electronic) states, (iv) band structure, and (v) (atomic)
Many generalizable featurization approaches have been proposed in site. The featurizers available in matminer in each module are pre-
the literature for different types of materials data [18,22,25,56–61]. sented in Fig. 2. In Table 1, we briefly describe each featurizer and
However, the software required to use them are often unavailable, not provide the canonical reference(s). The complete source code for each
open-source, or are distributed across many repositories. The lack of featurizer is available in matminer such that users can employ, fully
published software means that employing these methods in practice inspect, and modify the implementations of these methods.
requires a significant time investment. Through matminer, we make In addition to these individual featurizers, we provide a
these community developments in machine learning available to the FunctionFeaturizer that combines individual features into functions
community by providing open-source implementations of various fea- such as products, quotients, logarithms, or any arbitrary mathematical
turization methods. Furthermore, despite the diversity of methodolo- expression. This procedure allows one to generate a large space of
gies, matminer provides a uniform interface to all featurizers, freeing candidate features from even a small number of initial input features
researchers to rapidly iterate through different approaches and de- and has been observed to be useful in several previous works in the
termine the method best suited to their application. materials domain [18,62]. The implementation in matminer leverages
All featurizer classes in matminer follow a common code-design the sympy library [63] which can eliminate symbolically redundant
pattern by inheriting from a base class, BaseFeaturizer, which defines features.
the template for all featurization classes. BaseFeaturizer prescribes the
four methods that must be implemented by each new featurizer: 3.3. Data visualization

1. The “featurize” method does the core work. It transforms materials A crucial step of a materials informatics workflow is visualizing
data (e.g., a composition) into the desired feature values (e.g., ele- data, which is helpful in understanding outliers, selecting features, and
ment properties such as atomic weight, atomic radii, and Mendeleev guiding the machine learning process. Many data-driven materials
number). studies generate a standard suite of similar charts, such as heatmaps or
2. The “feature_labels” method provides descriptive labels that corre- two-dimensional scatter plots, which condense multiple complex re-
spond to the feature values computed in the “featurize” method. lationships into simple, informative figures. For example, visualizing
These feature_labels can be thought of as column labels for the distributions of data (such as histograms and violin plots) at inter-
various features (and are indeed used as column labels when fea- mediate steps in the workflow process is a useful tool for pruning data
turizing an entire DataFrame). and identifying outliers. Matminer drastically simplifies making many
3. The “citations” method returns a list of BibTex-formatted references common visualizaitons.
that a user should read to fully understand the features and cite if Although there exist several excellent plotting libraries in Python
they are used. The citations method thus provides background and (e.g., matplotlib [81] and seaborn [82]), these libraries are not de-
context for the featurizers and appropriate attribution to the original signed to generate interactive plots that are also easy to share and se-
developers of the methodology. rialize to a raw data format. Fortunately, the Plotly library [83] pro-
vides the needed functionality; however, its integration with standard

63
L. Ward et al. Computational Materials Science 152 (2018) 60–69

Table 1
A list of the featurizers currently implemented in matminer. Each row in the table provides the name of the relevant Python class, a concise description of the features
it computes, and the appropriate references to the original methodology.
Featurizer Description Reference

composition.py
AtomicOrbitals Highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO) using orbital energies from [64]
NIST.
AtomicPackingEfficiency Packing efficiency based on a geometric theory of the amorphous packing [90]
BandCenter Estimation of absolute position of band center using geometric mean of electronegativity. [65]
CationProperty Element property attributes of cations in a composition [66]
CohesiveEnergy Cohesive energy per atom of a compound by adding known elemental cohesive energies from the formation energy of the [67]
compound.
ElectronAffinity Average electron affinity times formal charge of anion elements. [66]
ElectronegativityDiff Statistics on electronegativity difference between anions and cations. [66]
ElementFraction Fraction of each element in a composition. –
ElementProperty Statistics of various element properties [22,36,66]
IonProperty Maximum and average ionic character, whether a composition is charge-balanced [22]
Miedema Formation enthalpies of intermetallic compounds, solid solutions, and amorphous phases using semi-empirical Miedema [68–70]
model (and some extensions).
OxidationStates Statistics of oxidation states. [66]
Stoichiometry Lp norm-based stoichiometric attributes. [22]
TMetalFraction Fraction of magnetic transition metals. [66]
ValenceOrbital Valence orbital attributes such as the mean number of electrons in each shell. [22,66]
YangSolidSolution Mixing thermochemistry and size mismatch terms of Yang and Zhang (2012) [91]

structure.py
BagofBonds Representation where each structure is represented based on the types of and distances between each pair of sites [71]
BondFraction Fraction of nearest neighbors between each element (e.g., C-O vs C-C) bonds [71]
ChemicalOrdering How much the ordering of species in the structure differs from random [6]
ColoumbMatrix Coulomb matrix (Mij = Zi Zj /|Ri – Rj| for i ≠ j, Zi2.4/2 for i = j, with Zi and Ri the nuclear charge and the position of atom i). [7]
ElectronicRadialDistributionFunction RDF in which the positions of neighboring sites are weighted by electrostatic interactions inferred from atomic partial [72]
charges.
EwaldEnergy Energy from Coulombic interactions based on charge states of each site [73]
GlobalSymmetryFeatures Symmetry information such as spacegroup number and (enumerated) crystal system type. –
MaximumPackingEfficiency Maximum possible packing efficiency of this structure [6]
MinimumRelativeDistances Closest neighbor distances for all sites, where relative distance are used fij = rij/(riatom + rjatom) with riatom being radius of [74]
atom or ion i.
OrbitalFieldMatrix Average of the 32 by 32 matrix descriptions of the chemical environment of each atom in the unit cell, based on the group [75]
numbers, row numbers (optional), distances of coordinating atoms, and Voronoi Polyhedra weights.
PartialRadialDistributionFunction Frequency of bonds across varied ranges of length between certain pairs of elements [58]
RadialDistributionFunction Conventional radial distribution function (RDF) of a crystal structure. –
RadialDistributionFunctionPeaks Distances of the largest peaks in the RDF of a structure –
StructuralHeterogeneity Variance in the bond lengths and atomic volumes in a structure [6]
SineCoulombMatrix Same as the CoulombMatrix, except the nondiagonal elements are weighted by B· ∑k = {x ,y,z } ek̂ sin2 [πek̂ B −1·rij ]−1
2 , where rij
[56]
is the vector between atoms i and j and B is the lattice matrix, rather than 1/rij.
SiteStatsFingerprint Generates features pertaining to an entire structure by computing statistics across the features of all sites in the unit cell –

bandstructure.py
BandFeaturizer Non-zero band gap, direct band gap, k-point degeneracy, relative energy to CBM/VBM at arbitrary list of k-points and at –
conduction/valence bands.
BranchPointEnergy Branch-point energy by averaging the energy of arbitrary number of conduction and valence bands throughout the full [76]
Brillouin zone.

dos.py
DopingFermi Fermi level associated with a specified carrier concentration and temperature –
DOSFeaturizer The top N contributors to the density of states at the valence and conduction band edges. Includes chemical specie, orbital –
character, and orbital location information.

site.py
AGNIFingerprints Fingerprints based on integrating the distances product of the radial distribution function with a gaussian window function [77]
AngularFourierSeries Encodes both radial and angular information about site neighbors. Each feature is a sum of the product of two distance [17]
functions between atoms that share the central site and the cosine of the angle between them.
ChemEnvSiteFingerprint Local site environment fingerprint computed with the chemenv module in pymatgen. [74,78]
ChemicalSRO Chemical short-range ordering features to evaluate deviation of local chemistry with the nominal composition of entire [79]
structure.
CoordinationNumber Number of first nearest neighbors of a site [74]
CrystalSiteFingerprint Coordination number percentage and local structure order parameters computed from the neighbor environment of a site; [74]
Voronoi decomposition-based neighbor finding.
GaussianSymmFunc Gaussian radial and angular symmetry functions originally proposed for fitting machine learning potentials. [28,80]
GeneralizedRadialDistributionFunction A radial distribution function where the bins do not need to act in a “histogram” mode. The bins can be any arbitrary [17]
function such as Gaussians, Bessel functions, or trig functions.
LocalPropertyDifference Differences in elemental properties between site and its neighboring sites [6]
OPSiteFingerprint Local structure order parameters computed from the neighbor environment of a site; distance-based neighbor finding. [74]
VoronoiFingerprint Voronoi indices, i-fold symmetries and statistics of Voronoi facet areas, sub-polyhedron volumes and distances derived by [79]
Voronoi tessellation analysis.

64
L. Ward et al. Computational Materials Science 152 (2018) 60–69

Fig. 3. Examples of plots based on a built-in data set of elastic tensors [53] and generated through the FigRecipes interface. Clockwise from top-left: a scatter matrix,
a heat map, a violin plot, and an x-y plot with color dimension that represents Poisson ratio.

Python data libraries such as pandas remains minimal. Thus, to accel- such as interactive offline plotting, static images, and the online Plotly
erate visualization, matminer includes its own module, FigRecipes, that interface. All figures generated with FigRecipes can be returned as a
provides a set of pre-defined methods for creating well-formatted, PlotlyDict object, a JSON-like dict representation of a figure that can be
common figures (Fig. 3). Plotly was selected as the backend of FigRe- serialized and stored for reproducibility and sharing. This ability makes
cipes because (1) its interactivity enables the rapid identification (via FigRecipes a useful plotting tool for creating scientific representations
Plotly “hoverinfo”) of outliers in data sets, which are frequently the of data; complex data can first be easily converted into a PlotlyDict
most important data points in materials informatics studies, and (2) it template, and this figure template specifically edited to create custom-
uses a portable JSON representation of Plotly plots, which enables made publication-quality images.
FigRecipes to output fine-tunable Plotly figure templates with a few
lines of code. Furthermore, interactive Plotly figures can be shared 4. Examples of using matminer
easily on the web via URL, which facilitates making figures collabora-
tively. Next, we present four usage examples that showcase the capabilities
The PlotlyFig class in matminer's FigRecipes module supports seven of matminer. The source code for these and other examples are avail-
types of plots: x-y plots, scatter matrices, histograms, bar charts, heat- able as part of the matminer_examples GitHub repository (https://
maps, parallel plots, and violin plots. FigRecipes also facilitates gen- github.com/hackingmaterials/matminer_examples). Users can down-
erating often-overlooked figures, such as parallel coordinate plots [84], load, inspect, and execute the full code for these examples themselves
which have been found to be useful in materials science applications as and modify them for their own applications.
they provide a technique for representing relationships between vari-
ables in high dimensional spaces. PlotlyFig can generate several plots 4.1. Retrieving data sets and visualizing them
using the same DataFrame content, automatically determining relevant
labels and legend information from DataFrame column headers. Plo- In our first example, we use matminer's CitrineDataRetrieval tool to
tlyFig can also automatically bin and transform data to be compatible collect the experimental thermoelectric materials properties reported
with the selected plot type; for example, PlotlyFig can automatically bin by Gaultois et al. [85] and compiled in the Citrine database. We then,
data in a DataFrame to create a heatmap and can generate multiple with the help of FigRecipes, visualize this data in just a few lines of
violin plots from a DataFrame lacking an explicit 'group' column. Plo- code. An example output is depicted in Fig. 4, in which electrical
tlyFig's succinct syntax and automatic conversions provide robust ex- conductivity, Seebeck coefficient, thermal conductivity and the figure
tensions of Plotly's plotting functionality. of merit of thermoelectric materials (zT) are visualized in a single plot.
PlotlyFig interfaces with several Plotly options for visualization, This example effectively recreates Fig. 3 of Ref. [85] but allows the user

65
L. Ward et al. Computational Materials Science 152 (2018) 60–69

Fig. 4. Thermoelectric properties of nearly 1000 materials compiled by Gaultois et al. [85] and as retrieved and visualized with matminer. The marker size is scaled
according to the figure of merit, zT.

to process the data locally, perhaps adding in their own data filtering or show_columns=['chemicalFormula', 'Band
featurization procedure. Once the data set is loaded into a DataFrame gap'])
called “df_te”, re-creating this figure can be accomplished by two Py- mpr = MPRester()
thon commands, as follows: def get_MP_bandgap(formula):
formula = Composition
pf = PlotlyFig(df_te, x_scale='log', (formula).get_integer_formula_and_factor()[0]
x_title='Electrical Resistivity (cm/S)', strcs = mpr.get_data(formula)
y_title='Seebeck Coefficient (uV/K)', if strcs:
colorbar_title='Thermal Conductivity (W/ return sorted(strcs, key = lambda e: e
m.K)') [‘energy_per_atom’])[0][‘band_gap’]
pf.xy(('Electrical resistivity', 'Seebeck df[‘DFT Band gap’] = data[‘chemicalFormula’].apply
coefficient'), (get_MP_bandgap)
labels='chemicalFormula', sizes='zT', As shown in Fig. 5, most computed DFT band are lower than the ex-
colors='Thermal conductivity', color_range=[0, perimental values, which is a known drawback of DFT calculations
5]) performed using LDA or GGA functionals [86–88]. Because the com-
The first line defines the data used by the charts and names for the axes. parison is performed automatically, minimal human effort is required
The second line defines the data being plotted. Further details are to update the result as new experimental band gaps are added to Ci-
handled automatically. For example, zT values are normalized for trination or new calculations are performed by Materials Project. As
better visualization. In addition, because the user specified a color_- exemplified by this example, the tools matminer provides to automate
range of [0, 5] for the thermal conductivity values, all thermal con- data-driven analyses can make reproducing data-driven materials stu-
ductivity values equal or greater than 5 are denoted by a bright yellow dies much simpler.
color with a “5+” tick label is automatically added to the colorbar.
Thus, FigRecipes includes both automatic and customizable options 4.3. Building a machine learning model using OQMD data
that balance speed and flexibility of visualization.
To demonstrate how matminer can facilitate the process of machine
4.2. Comparing experiment and theory data learning, we recreate a machine learning model from a 2016 paper by
Ward et al. [22] In this work, the authors trained a machine learning
In another example, we retrieve all the experimental band gap data model using data from the Open Quantum Materials Database (OQMD)
available in Citrine and compare them with the calculated values [42,92] to predict the formation enthalpy of crystalline materials given
available in the Materials Project [39]. Comparing data from two dif- their composition.
ferent sources is often complicated by the need to match records from The first step is to retrieve the OQMD data used by Ward et al.,
one system to another. In this example, we need to find records in which is available through the Materials Data Facility [44]. We can use
Materials Project with the same composition. As many entries in Ci- matminer’s data retrieval tools to access this data directly with only
trination lack an associated crystal structure, we match each band gap three lines of code:
to the ground-state structure with the same composition in Materials
Project. Merging these data sources also demonstrates how combining mdf = retrieve_MDF.MDFDataRetrieval
data sources can fill in missing information from each database. Owing (anonymous = True)
to the CitrineDataRetrieval class, the Material Project API and Pandas, query_string = 'mdf.source_name:oqmd_v3 AND
merging the two data sources requires only 9 lines of code: (oqmd_v3.configuration:static OR
oqmd_v3.configuration:standard) AND
c = CitrineDataRetrieval() # Create an adapter to the dft.converged:True'
Citrine Database. data = mdf.get_data(query_string,
df = c.get_dataframe(prop='band gap', unwind_arrays=False)
data_type='EXPERIMENTAL', The next step is to process the dataset to create a suitable training set:

66
L. Ward et al. Computational Materials Science 152 (2018) 60–69

Fig. 5. Comparison of experimentally-measured band gap energies retrieved from the Citrine database to DFT-PBE computed electronic band gaps retrieved from the
Materials Project. As expected, the data set demonstrates that computed band gaps underestimate experimental values [86–88].

removing errors, duplicates, and outliers. For example, removing all from matminer.data_retrieval.retrieve_MP import
entries which lack a computed formation enthalpy can be achieved in a MPDataRetrieval
single line of Python: mpr = MPDataRetrieval()
criteria = '∗-∗-O'
data = data[∼ data['oqmd_v3.delta_e.value'].isnull properties = ['structure', 'nsites',
()] 'formation_energy_per_atom',
The third step in building a machine learning model is computing a 'e_above_hull']
representation. We have implemented the techniques developed by df = mpr.get_dataframe(criteria = criteria,
Ward et al. into matminer as Featurizer classes. These Featurizers, properties = properties)
which operate on DataFrame objects, are also simple to run: df = df[df['e_above_hull'] < 0.1]
df = df[df['nsites'] < = 30]
featurizer = MultipleFeaturizer([ Each of the three methods use Kernel Ridge Regression (KRR) as the
cf.Stoichiometry(), machine learning algorithm; we employ the implementation of this
cf.ElementProperty.from_preset(“magpie”), method from scikit-learn. scikit-learn includes a well-optimized im-
cf.ValenceOrbital(props=['avg']), cf.IonProperty plementation of KRR, and has a tool – GridSearchCV – for easily se-
()]) lecting the optimum kernel and regularization parameter for KRR [30].
featurizer.featurize_dataframe(data, We tested each method using five-fold cross validation, and used four-
col_id='composition_obj') fold cross-validation when selecting optimizing hyperparameters for
These two lines of code generate the 145 features used by Ward et al. each fold. We tested Laplacian and RBF (radial basis function) kernels
and store them within the DataFrame object. At this point, the data are for both features, and used the r2 value of the formation energy per
in a form that is compatible with existing machine learning libraries, atom predictions to score each hyperparameter set [30].
such as scikit-learn or Keras. After using scikit-learn’s Random Forest The orbital field matrix can be time consuming to calculate for a
implementation and cross-validation utilities, we find that our model large dataset because of its size; however, the process can be ac-
achieves a MAE of 0.071 eV/atom in 10-fold cross-validation, which is celerated by the parallelization feature of matminer. Matminer auto-
consistent with the results reported by Ward et al. (as low as 0.088 eV/ matically runs in parallel across all available CPU cores using Python’s
atom using a different tree-based ML method). Overall this example multiprocessing package. The following code computes the OFM re-
serves to demonstrate how matminer, combined with community- presentation and automatically runs in parallel:
standard data analysis and machine learning libraries, facilitates the
construction of machine learning models from materials data. from matminer.featurizers.structure import
OrbitalFieldMatrix
4.4. Comparing crystal structure featurization methods ofm = OrbitalFieldMatrix()
df = ofm.featurize_dataframe(df, 'structure')
Another benefit of matminer is that it simplifies comparing machine The cross-validation results for the FLLA and TER_OX datasets are
learning methods. To illustrate, we used matminer to compare three presented in Table 2. We find very close agreement between the Mean
methods for predicting the formation energy for a given crystal struc- Absolute Error (MAE) reported by Faber et al. for the SCM (0.37 eV/
ture: the Sine Coulomb Matrix (SCM) [56], the Orbital Field Matrix atom) and our result with matminer of 0.387 eV/atom, despite minor
(OFM) [75], and a recent modification to the OFM in development that differences in the cross-validation procedure [56]. This demonstrates
also includes the row of each element in the periodic table in addition that we are able to reproduce the methodology of a published machine
to the column (OFMR). learning paper and compare it with a new featurization method (OFMR)
The first step in comparing the models is to gather training sets. For with very little effort.
this task, we use the original 3938 structures selected by Faber et al. Our results indicate that for both data sets, the OFMR outperforms
from the Materials Project (FLLA) [56] and a dataset of all 7735 stable the OFM featurizer, which in turn outperforms the SCM (Table 2). All
ternary oxides in the Materials Project with unit cell size at most 30 methods perform better on the TER_OX dataset than the FLLA dataset,
atoms (TER_OX). Gathering the data is simple with matminer. The FLLA demonstrating that the specific data set influences both absolute and
data set is built into matminer and the TER_OX dataset can be gathered relative model performance. Featurization and evaluation of the OFM
with a single MPDataRetrieval query: and OFMR take much longer than for the SCM because of the size of the

67
L. Ward et al. Computational Materials Science 152 (2018) 60–69

Table 2 This research used the Savio computational cluster resource pro-
Performance (in terms of both accuracy and time needed to featurize) of several vided by the Berkeley Research Computing program at the University of
machine learning methods on two different datasets: the FLLA [56] and TER_OX California, Berkeley (supported by the UC Berkeley Chancellor, Vice
datasets. We compare the Sine Coulomb Matrix (SCM) [56], Orbital Field Ma- Chancellor for Research, and Chief Information Officer). This research
trix (OFM) [75], and Orbital Field Matrix + row in periodic table (OFMR). The
used resources of the National Energy Research Scientific Computing
performance scores are for each model in 5-fold cross-validation. Each model
Center, a DOE Office of Science User Facility supported by the Office of
was run on 24, 2.3 GHz processor cores on a system with 64 GB of RAM.
Science of the U.S. Department of Energy under Contract No. DE-AC02-
Dataset Descriptor MAE RMSE r2 Featurize Cross- 05CH11231.
(eV/ (eV/ time (s) validation
We thank all those in the materials community who have con-
atom) atom) time (h:mm:ss)
tributed code commits to matminer, including Ashwin Aggarwal,
FLLA SCM 0.387 0.575 0.708 2.0 0:07:42 Evgeny Blokhin, Jason Frost, Matthew Horton, Kiran Mathew, Shyue
OFM 0.229 0.346 0.894 138. 0:50:40 Ping Ong, Sayan Rowchowdhury, and Donny Winston.
OFMR 0.171 0.277 0.932 138. 1:20:14
TER_OX SCM 0.123 0.220 0.917 5.0 0:30:16
OFM 0.090 0.140 0.967 366. 4:30:16
Appendix A. Supplementary material
OFMR 0.059 0.100 0.983 363. 7:06:42
Supplementary data associated with this article can be found, in the
online version, at http://dx.doi.org/10.1016/j.commatsci.2018.05.
descriptors, which may result in a time-accuracy tradeoff in some ap- 018.
plications. We also note that Faber et al. have been developing updated
structure representations [89] that in the future might be further References
compared to the current results. Being able to probe the applicability of
different featurization methods for different data sets is significantly [1] H. Chen, G. Hautier, A. Jain, C. Moore, B. Kang, R. Doe, L. Wu, Y. Zhu, Y. Tang,
simplified by the ability to easily swap out different machine learning G. Ceder, Chem. Mater. 24 (2012) 2009.
[2] M. Aykol, S. Kim, V.I. Hegde, D. Snydacker, Z. Lu, S. Hao, S. Kirklin, D. Morgan,
methods and datasets within a machine learning pipeline. This allows C. Wolverton, Nat. Commun. 7 (2016) 13779.
for rapid testing of new methods against various data sets. [3] C. Nyshadham, C. Oses, J.E. Hansen, I. Takeuchi, S. Curtarolo, G.L.W. Hart, Acta
Mater. 122 (2017) 438.
[4] S. Kirklin, J.E. Saal, V.I. Hegde, C. Wolverton, Acta Mater. 102 (2016) 125.
5. Conclusion [5] A. Jain, K.A. Persson, G. Ceder, APL Mater. 4 (2016) 53102.
[6] L. Ward, R. Liu, A. Krishna, V.I. Hegde, A. Agrawal, A. Choudhary, C. Wolverton,
Performing materials informatics requires developing a data pipe- Phys. Rev. B 96 (2017) 24104.
[7] M. Rupp, A. Tkatchenko, K.-R. Müller, O.A. von Lilienfeld, Phys. Rev. Lett. 108
line that encompasses data retrieval, feature extraction, and visualiza-
(2012) 58301.
tion prior to the actual machine learning step. The matminer software [8] J. Carrete, W. Li, N. Mingo, S. Wang, S. Curtarolo, Phys. Rev. X 4 (2014) 11019.
described in this manuscript is designed to facilitate the development, [9] L. Ward, C. Wolverton, Curr. Opin. Solid State Mater. Sci. 21 (2017) 167.
reuse, and reproducibility of data pipelines for materials informatics [10] J.C. Mauro, A. Tandia, K.D. Vargheese, Y.Z. Mauro, M.M. Smedskjaer, Chem. Mater.
28 (2016) 4267.
applications. We have designed matminer to connect the domain-spe- [11] E.W. Bucholz, C.S. Kong, K.R. Marchman, W.G. Sawyer, S.R. Phillpot, S.B. Sinnott,
cific aspects of materials informatics (i.e., materials data extraction, K. Rajan, Tribol. Lett. 47 (2012) 211.
feature extraction of materials science concepts, common plotting [12] T.D. Sparks, M.W. Gaultois, A. Oliynyk, J. Brgoch, B. Meredig, Scr. Mater. 111
(2015) 10.
routines) with the professional level machine learning and data pro- [13] R. Yuan, Z. Liu, P.V. Balachandran, D. Xue, Y. Zhou, X. Ding, J. Sun, D. Xue,
cessing software already developed and in use by the Python commu- T. Lookman, Adv. Mater. 1702884 (2018) 1702884.
nity. It is our hope that matminer can serve as a community repository [14] A. Mannodi-Kanakkithodi, A. Chandrasekaran, C. Kim, T.D. Huan, G. Pilania,
V. Botu, R. Ramprasad, Mater. Today (2017), http://dx.doi.org/10.1016/j.mattod.
for new materials data analytics techniques as they become available 2017.11.021.
such that researchers can rapidly develop and test new methods against [15] F.A. Faber, A. Lindmaa, O.A. von Lilienfeld, R. Armiento, Phys. Rev. Lett. 117
standard techniques, accelerating the use of data mining in the mate- (2016) 135502.
[16] F. Ren, L. Ward, T. Williams, K.J. Laws, C. Wolverton, J. Hattrick-Simpers, A.
rials community at large. Mehta, Sci. Adv. 4 (2018) eaaq1566.
[17] A. Seko, H. Hayashi, K. Nakayama, A. Takahashi, I. Tanaka, Phys. Rev. B 95 (2017)
Acknowledgements 144110.
[18] R. Ramprasad, R. Batra, G. Pilania, A. Mannodi-Kanakkithodi, C. Kim, Npj Comput.
Mater. 3 (2017) 54.
This code was intellectually led and primarily developed using [19] S.R. Kalidindi, ISRN Mater Sci. 2012 (2012) 1.
funding provided by U.S. Department of Energy, Office of Basic Energy [20] J. Hill, G. Mulholland, K. Persson, R. Seshadri, C. Wolverton, B. Meredig, MRS Bull.
41 (2016) 399.
Sciences, Early Career Research Program, which funded the efforts of
[21] W. McKinney, Proc. 9th Python Sci. Conf. 1697900 (2010) 51.
AJ, AD, AF, SB, and QW. LW and IF were supported by financial as- [22] L. Ward, A. Agrawal, A. Choudhary, C. Wolverton, Npj Comput. Mater. 2 (2016)
sistance award 70NANB14H012 from U.S. Department of Commerce, 16028.
National Institute of Standards and Technology as part of the Center for [23] http://bitbucket.org/wolverton/magpie.
[24] W. Daniel, B. David, F. Tony, K. Surya, R. Andrew, PyMKS: Materials Knowledge
Hierarchical Material Design (CHiMaD), by the National Science System in Python, 2014. doi: 10.6084/m9.figshare.1015761.
Foundation as part of the Midwest Big Data Hub under NSF Award [25] Z. Wu, B. Ramsundar, E.N. Feinberg, J. Gomes, C. Geniesse, A.S. Pappu, K. Leswing,
Number: 1636950 “BD Spokes: SPOKE: MIDWEST: Collaborative: V. Pande, Chem. Sci. 9 (2018) 513.
[26] https://github.com/deepchem/deepchem.
Integrative Materials Design (IMaD): Leverage, Innovate, and [27] E. Gossett, C. Toher, C. Oses, O. Isayev, F. Legrain, F. Rose, E. Zurek, J. Carrete,
Disseminate,” and by the Department of Energy contract DE-AC02- N. Mingo, A. Tropsha, S. Curtarolo (2017) arXiv:1711.10744v1.
06CH11357. NER, JM, MA, and KAP were funded by the U.S. [28] A. Khorshidi, A.A. Peterson, Comput. Phys. Commun. 207 (2016) 310.
[29] https://github.com/libAtoms/QUIP.
Department of Energy, Office of Science, Office of Basic Energy [30] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,
Sciences, Materials Sciences and Engineering Division under Contract M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
No. DE-AC02-05-CH11231: Materials Project program KC23MP. JC and D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, J. Mach. Learn. Res. 12
(2011) 2825.
KC were supported by NSF, United States grant 1541450 (CC∗DNI [31] F. Perez, B.E. Granger, Comput. Sci. Eng. 9 (2007) 21.
DIBBS: Merging Science and Cyberinfrastructure Pathways: The Whole [32] K.J. Millman, M. Aivazis, Comput. Sci. Eng. 13 (2011) 9.
Tale). KWB acknowledges the University of California, Berkeley College [33] S. van der Walt, S.C. Colbert, G. Varoquaux, Comput. Sci. Eng. 13 (2011) 22.
[34] https://github.com/keras-team/keras.
of Chemistry for a summer research stipend. MD and GJS were funded
[35] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado, A.
by NSF DMR program Grant nos. 1334713 and 1333335.

68
L. Ward et al. Computational Materials Science 152 (2018) 60–69

Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Kumar, S. Ivanov, J.K. Moore, S. Singh, T. Rathnayake, S. Vig, B.E. Granger, R.P.
Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Muller, F. Bonazzi, H. Gupta, S. Vats, F. Johansson, F. Pedregosa, M.J. Curry, A.R.
Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Terrel, Š. Roučka, A. Saboo, I. Fernando, S. Kulal, R. Cimrman, A. Scopatz, PeerJ
Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, Comput. Sci. 3 (2017) e103.
M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, 2015. < https://www.tensorflow. [64] S. Kotochigova, Z.H. Levine, E.L. Shirley, M.D. Stiles, C.W. Clark, Phys. Rev. A 55
org/ > . (1997) 191.
[36] S.P. Ong, W.D. Richards, A. Jain, G. Hautier, M. Kocher, S. Cholia, D. Gunter, [65] M.A. Butler, J. Electrochem. Soc. 125 (1978) 228.
V.L. Chevrier, K.A. Persson, G. Ceder, Comput. Mater. Sci. 68 (2013) 314. [66] A.M. Deml, R.O. Hayre, C. Wolverton, V. Stevanovic, Phys. Rev. B 93 (2016) 85142.
[37] A. Frantzen, J. Scheidtmann, G. Frenzer, W.F. Maier, J. Jockel, T. Brinz, D. Sanders, [67] C. Kittel, Introduction to Solid State Physics, 8th ed., Wiley, 2005.
U. Simon, Angew. Chemie Int. Ed. 43 (2004) 752. [68] F.R. de Boer, Cohesion in Metals: Transition Metal Alloys, North-Holland,
[38] Y. Xu, M. Yamazaki, P. Villars, Jpn. J. Appl. Phys 50 (2011) 11RH02. Amsterdam, 1988.
[39] A. Jain, S.P. Ong, G. Hautier, W. Chen, W.D. Richards, S. Dacek, S. Cholia, [69] R.F. Zhang, S.H. Zhang, Z.J. He, J. Jing, S.H. Sheng, Comput. Phys. Commun. 209
D. Gunter, D. Skinner, G. Ceder, K.A. Persson, APL Mater. 1 (2013) 11002. (2016) 58.
[40] https://citrination.com. [70] L.J. Gallego, J.A. Somoza, J.A. Alonso, J. Phys. Condens. Matter 2 (1990) 6245.
[41] S. Curtarolo, W. Setyawan, S. Wang, J. Xue, K. Yang, R.H. Taylor, L.J. Nelson, [71] K. Hansen, F. Biegler, R. Ramakrishnan, W. Pronobis, O.A. Von Lilienfeld, K.-
G.L.W. Hart, S. Sanvito, M. Buongiorno-Nardelli, N. Mingo, O. Levy, Comput. R.R. Müller, A. Tkatchenko, J. Phys. Chem. Lett. 6 (2015) 2326.
Mater. Sci. 58 (2012) 227. [72] E.L. Willighagen, R. Wehrens, P. Verwer, R. de Gelder, L.M.C. Buydens, Acta
[42] J.E. Saal, S. Kirklin, M. Aykol, B. Meredig, C. Wolverton, JOM 65 (2013) 1501. Crystallogr. Sect. B Struct. Sci. 61 (2005) 29.
[43] J. O’Mara, B. Meredig, K. Michel, JOM 68 (2016) 2031. [73] P.P. Ewald, Ann. Phys. 369 (1921) 253.
[44] B. Blaiszik, K. Chard, J. Pruyne, R. Ananthakrishnan, S. Tuecke, I. Foster, JOM 68 [74] N.E.R. Zimmermann, M.K. Horton, A. Jain, M. Haranczyk, Front. Mater. 4 (2017) 1.
(2016) 2045. [75] T. Lam Pham, H. Kino, K. Terakura, T. Miyake, K. Tsuda, I. Takigawa, H. Chi Dam,
[45] https://mpds.io/. Sci. Technol. Adv. Mater. 18 (2017) 756.
[46] https://www.mongodb.com/. [76] A. Schleife, F. Fuchs, C. Rödl, J. Furthmüller, F. Bechstedt, Appl. Phys. Lett. 94
[47] K. Michel, B. Meredig, MRS Bull. 41 (2016) 617. (2009) 12104.
[48] P. Hohenberg, W. Kohn, Phys. Rev. 136 (1964) B864. [77] V. Botu, R. Ramprasad, Phys. Rev. B 92 (2015) 94306.
[49] L.O. Wagner, T.E. Baker, E.M. Stoudenmire, K. Burke, S.R. White, Phys. Rev. B 90 [78] D. Waroquiers, X. Gonze, G.-M. Rignanese, C. Welker-Nieuwoudt, F. Rosowski,
(2014) 45109. M. Göbel, S. Schenk, P. Degelmann, R. André, R. Glaum, G. Hautier, Chem. Mater.
[50] S.P. Ong, S. Cholia, A. Jain, M. Brafman, D. Gunter, G. Ceder, K.A. Persson, Comput. 29 (2017) 8346.
Mater. Sci. 97 (2015) 209. [79] A. Okabe, B. Boots, K. Sugihara, S.N. Chiu, Spatial Tesselations, 2009.
[51] https://github.com/materials-data-facility/forge. [80] J. Behler, J. Chem. Phys. 134 (2011) 74106.
[52] K. Mathew, J.H. Montoya, A. Faghaninia, S. Dwarakanath, M. Aykol, H. Tang, [81] J.D. Hunter, Comput. Sci. Eng. 9 (2007) 90.
I. Chu, T. Smidt, B. Bocklund, M. Horton, J. Dagdelen, B. Wood, Z.-K. Liu, J. Neaton, [82] M. Waskom, O. Botvinnik, D. O’Kane, P. Hobson, S. Lukauskas, D.C. Gemperline, T.
S.P. Ong, K. Persson, A. Jain, Comput. Mater. Sci. 139 (2017) 140. Augspurger, Y. Halchenko, J.B. Cole, J. Warmenhoven, J. de Ruiter, C. Pye, S.
[53] M. De Jong, W. Chen, T. Angsten, A. Jain, R. Notestine, A. Gamst, M. Sluiter, C.K. Hoyer, J. Vanderplas, S. Villalba, G. Kunter, E. Quintero, P. Bachant, M. Martin, K.
Ande, S. Van Der Zwaag, J.J. Plata, C. Toher, S. Curtarolo, G. Ceder, K. a Persson, M. Meyer, A. Miles, Y. Ram, T. Yarkoni, M.L. Williams, C. Evans, C. Fitzgerald, Brian,
Asta, Sci. Data (2015) 1. C. Fonnesbeck, A. Lee, A. Qalieh, 2017. doi: 10.5281/ZENODO.883859.
[54] M. de Jong, W. Chen, H. Geerlings, M. Asta, K.A. Persson, Sci. Data 2 (2015) [83] https://plot.ly/.
150053. [84] J.M. Rickman, Npj Comput. Mater. 4 (2018) 5.
[55] I. Petousis, W. Chen, G. Hautier, T. Graf, T.D. Schladt, K.A. Persson, F.B. Prinz, Phys. [85] M.W. Gaultois, T.D. Sparks, C.K.H. Borg, R. Seshadri, W.D. Bonificio, D.R. Clarke,
Rev. B 93 (2016) 115151. Chem. Mater. 25 (2013) 2911.
[56] F. Faber, A. Lindmaa, O.A. von Lilienfeld, R. Armiento, Int. J. Quantum Chem. 115 [86] J.P. Perdew, M. Levy, Phys. Rev. Lett. 51 (1983) 1884.
(2015) 1094. [87] L.J. Sham, M. Schlüter, Phys. Rev. Lett. 51 (1983) 1888.
[57] T. Fast, S.R. Kalidindi, Acta Mater. 59 (2011) 4595. [88] M.K.Y. Chan, G. Ceder, Phys. Rev. Lett. 105 (2010) 196403.
[58] K.T. Schütt, H. Glawe, F. Brockherde, A. Sanna, K.R. Müller, E.K.U. Gross, Phys. [89] F.A. Faber, A.S. Christensen, B. Huang, O.A. von Lilienfeld, J. Chem. Phys. 148
Rev. B 89 (2014) 205118. (2018) 241717.
[59] A. Seko, A. Takahashi, I. Tanaka, Phys. Rev. B 90 (2014) 24101. [90] K.J. Laws, D.B. Miracle, M. Ferry, A predictive structural model for bulk metallic
[60] O. Isayev, C. Oses, C. Toher, E. Gossett, S. Curtarolo, A. Tropsha, Nat. Commun. 8 glasses, Nat. Commun. 6 (2015) 8123.
(2017) 15679. [91] X. Yang, Y. Zhang, Prediction of high-entropy stabilized solid-solution in multi-
[61] K.T. Schütt, F. Arbabzadah, S. Chmiela, K.R. Mu, A. Tkatchenko, Nat. Commun. 8 component alloys, Mater. Chem. Phys. 132 (2012) 233–238.
(2017) 13890. [92] S. Kirklin, J.E. Saal, B. Meredig, A. Thompson, J.W. Doak, M. Aykol, et al., The
[62] L.M. Ghiringhelli, J. Vybiral, S.V. Levchenko, C. Draxl, M. Scheffler, Phys. Rev. Lett. Open Quantum Materials Database (OQMD): assessing the accuracy of DFT for-
114 (2015) 105503. mation energies, Npj Comput. Mater. 1 (2015) 15010, http://dx.doi.org/10.1038/
[63] A. Meurer, C.P. Smith, M. Paprocki, O. Čertík, S.B. Kirpichev, M. Rocklin, Am. npjcompumats.2015.10.

69

You might also like