0% found this document useful (0 votes)

17 views

0030 D Magpie Encoding 2

Research paper

Uploaded by

sougat2000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

0030 D Magpie Encoding 2

Research paper

Uploaded by

sougat2000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Computational Materials Science 152 (2018) 60–69

Contents lists available at ScienceDirect

Computational Materials Science

journal homepage: www.elsevier.com/locate/commatsci

Matminer: An open source toolkit for materials data mining T

a,b c,d c c c,e
Logan Ward , Alexander Dunn , Alireza Faghaninia , Nils E.R. Zimmermann , Saurabh Bajaj ,
Qi Wangc, Joseph Montoyac, Jiming Chenf, Kyle Bystromd, Maxwell Dyllag, Kyle Charda,b,
⁎
Mark Astad, Kristin A. Perssonc, G. Jeﬀrey Snyderg, Ian Fostera,b, Anubhav Jainc,
a
Computation Institute, University of Chicago, Chicago, IL 60637, United States
b
Data Science and Learning Division, Argonne National Laboratory, Argonne, IL 60439, United States
c
Lawrence Berkeley National Laboratory, Energy Technologies Area, 1 Cyclotron Road, Berkeley, CA 94720, United States
d
Department of Materials Science and Engineering, University of California, Berkeley CA 94720, University of California, Berkeley, CA 94720, United States
e
Citrine Informatics, Redwood City, CA 94063, United States
f
Department of Chemical Engineering, University of Illinois, Urbana, IL 61801, United States
g
Department of Materials Science and Engineering, Northwestern University, Evanston, IL 60208, United States

A R T I C LE I N FO A B S T R A C T

Keywords: As materials data sets grow in size and scope, the role of data mining and statistical learning methods to analyze
Data mining these materials data sets and build predictive models is becoming more important. This manuscript introduces
Open source software matminer, an open-source, Python-based software platform to facilitate data-driven methods of analyzing and
Machine learning predicting materials properties. Matminer provides modules for retrieving large data sets from external data-
Materials informatics
bases such as the Materials Project, Citrination, Materials Data Facility, and Materials Platform for Data Science.
It also provides implementations for an extensive library of feature extraction routines developed by the ma-
terials community, with 47 featurization classes that can generate thousands of individual descriptors and
combine them into mathematical functions. Finally, matminer provides a visualization module for producing
interactive, shareable plots. These functions are designed in a way that integrates closely with machine learning
and data analysis packages already developed and in use by the Python data science community. We explain the
structure and logic of matminer, provide a description of its various modules, and showcase several examples of
how matminer can be used to collect data, reproduce data mining studies reported in the literature, and test new
methodologies.

1. Introduction continued development of general-purpose data mining methods for

many types of materials data [17–19] and the proliferation of material
Recently, the materials community has placed a renewed emphasis property databases [20], this emerging ﬁeld of “materials informatics”
in collecting and organizing large data sets for research, materials de- is positioned to have a continued impact on materials design.
sign, and the eventual application of statistical or “machine learning” In this paper, we describe a new software library, “matminer”, for
techniques. For example, the mining of databases comprised of density applying data-driven techniques to the materials domain. The main
functional theory (DFT) calculations has been used to identify materials roles of matminer are depicted in Fig. 1: matminer assists the user in
for batteries [1,2], to aid the design of metal alloys [3,4], and for many retrieving large data sets from common databases, extracts features to
other applications [5]. Importantly, such data sets present new oppor- transform the raw data into representations suitable for machine
tunities to develop predictive models through machine learning tech- learning, and produces interactive visualizations of the data for ex-
niques: rather than designing and programming such models manually, ploratory analysis. We note that matminer does not itself implement
such techniques produce predictive models by learning from a body of common machine learning algorithms; industry-standard tools (e.g.,
examples. Machine learning models have been demonstrated to predict scikit-learn or Keras) are already developed and maintained by the
properties of crystalline materials much faster than DFT [6–9], estimate larger data science community for this purpose. Instead, matminer's
properties that are diﬃcult to access via other computational tools role is to connect these advanced machine learning tools to the materials
[10,11], and guide the search for new materials [12–16]. With the domain.

⁎
Corresponding author at: Lawrence Berkeley National Laboratory, Energy Technologies Area, 1 Cyclotron Road, Berkeley, CA 94720, United States.
E-mail addresses: [email protected] (L. Ward), [email protected] (A. Jain).

https://doi.org/10.1016/j.commatsci.2018.05.018
Received 16 April 2018; Accepted 7 May 2018
Available online 25 May 2018
0927-0256/ © 2018 Elsevier B.V. All rights reserved.
L. Ward et al. Computational Materials Science 152 (2018) 60–69

Fig. 1. Overview of the capabilities of matminer. Matminer aids the user in constructing a data pipeline for materials informatics and is composed of three main
components: (1) tools for retrieving data from a variety of materials databases, (2) tools for extracting features (or descriptors) from materials data, and (3) re-useable
and customizable recipes for visualizing materials data. Data is retrieved and processed in a way that makes it simple to integrate matminer with external machine
learning libraries such as scikit-learn and Keras.

Matminer solves many problems encountered when conducting interactive, runnable Jupyter notebook format [31]) for using the data
data-driven research. For example, learning the Application retrieval, featurization, and visualization tools, located at https://
Programming Interface (API) for each data source and preprocessing github.com/hackingmaterials/matminer_examples. Full documenta-
retrieved data adds significant complexity to the task of building new tion for matminer is also available from https://hackingmaterials.
machine learning models. Matminer provides a simplified interface that github.io/matminer/. The matminer code currently contains 109 unit
abstracts the details of these API interactions, making it easy for the tests to ensure the integrity of the code, which are run automatically
user to query and organize large data sets into the standard pandas [21] with each code commit through a continuous integration process. A
data format used by the Python data science community. Furthermore, help forum for matminer is available at: https://groups.google.com/
as we will further discuss later in the text, matminer implements a suite forum/#!forum/matminer.
of 47 distinct feature extraction modules capable of producing thou-
sands of physically relevant descriptors that can be leveraged by ma- 2. Software architecture and design principles
chine learning algorithms to more efficiently determine input-output
relationships. Although many such feature extraction methods are re- A guiding principle of matminer is to integrate domain-specific
ported in the literature, many lack an open source implementation. knowledge and data about materials into larger ecosystem of Python
Matminer not only implements these domain-specific feature extraction data analysis software. The Python community has developed a rich
methods but provides a unified interface for their use, making it trivial suite of interoperable tools for data science, which are broadly used
to reproduce or compare (and, eventually, extend) these methods. Fi- across the data science community and occasionally known as the
nally, matminer contains many pre-defined recipes of visualizations for “PyData” or “SciPy” stacks [32]. These libraries include NumPy and
exploring and discovering different data relationships. In aggregate, Scipy [33], which provide a suite of high-performance numerical
these features allow for cutting edge materials informatics research to methods, and Jupyter [31], which facilitates interactive data analysis.
be conducted with a high-level, easy-to-use interface. Matminer is designed to allow users to leverage these professional-level
We note that prior efforts have produced software for computing data science libraries for materials science studies.
features for materials (e.g., Magpie[22,23], pyMKS [24]), building deep A central tool in the PyData stack is the pandas DataFrame, which is
learning models of molecular materials (e.g., deepchem [25,26]), pro- a tabular representation of data similar to (but more powerful than) a
viding turnkey machine learning estimates of various properties, or virtual spreadsheet [21]. Pandas makes it possible, for example, to load
integrating machine learning with other software [27–29]. In contrast a data set and perform many common data post-processing procedures,
to these prior efforts (which have their own intended applications and such as filtering, grouping, joining, computing rolling averages, and
scope), matminer is designed to interact and integrate with standard producing descriptive statistics. Additionally, data formatted into a
Python data mining tools such as pandas and scikit-learn [30], imple- pandas DataFrame can be easily used with other Python data analysis
ments a library of feature generation methods (“featurizers”) for a wide libraries, such as scikit-learn, numpy, and matplotlib. DataFrames can
variety of materials science entities (e.g., compositions, crystal struc- also be visualized as interactive tables within Jupyter notebooks. They
tures, and electronic structures), and includes tools to assist with data can also be serialized into multiple formats to allow them to be archived
retrieval and visualization. and shared. Because of all the benefits and features that are achieved by
The source code for the version of matminer described in this transforming data into the DataFrame format, matminer's data retrieval
manuscript (version 0.3.2) and examples of its use are available as API automatically formats data that it retrieves from external sources
supplementary information. Updated versions are regularly published into this format. Data retrieved through matminer is thus immediately
to the Python Package Index (https://pypi.python.org/pypi/matminer). ready for a wide variety of tasks, including data cleaning, data ex-
The actively developed version of matminer is available on GitHub at ploration, data transformations, data visualization, and machine
https://github.com/hackingmaterials/matminer. Matminer also in- learning. As described in later sections, all data extraction, featuriza-
cludes a dedicated repository of examples and tutorials (many in an tion, and visualization tools in matminer can generate or operate on

61
L. Ward et al. Computational Materials Science 152 (2018) 60–69

pandas DataFrame objects. database. MPDataRetrieval allows users to access a wide variety of
Matminer is also designed to integrate closely with the scikit-learn properties of crystalline materials, including their crystal struc-
machine learning library [30]. Scikit-learn is the de facto standard tures, electronic band structure, phonon dispersion, piezoelectric,
machine learning library for Python. In addition to its rich suite of dielectric and elastic constants.
machine learning algorithms, scikit-learn contains utilities useful for all (iii) The Materials Data Facility (MDF) is geared towards enabling re-
aspects of the machine learning process (e.g., data preprocessing, model searchers to publish their own data sets across a wide array of data
selection, hyperparameter tuning). Other machine learning libraries, types and materials subdisciplines. Matminer contains an
such as Keras [34] and TensorFlow [35], also provide scikit-learn- MDFDataRetrieval class that uses the MDF's own Forge library [51]
compatible wrappers for their models, which further motivates the to perform the bulk of the search function but assists the user in
importance of making matminer easily compatible with scikit-learn. formatting the final data to a standardized pandas DataFrame
Matminer achieves integration with scikit-learn in two ways. First, the object.
pandas DataFrame objects produced by matminer are tightly integrated (iv) The Materials Platform for Data Science (MPDS) [45] is a com-
with scikit-learn through the interoperability built in to the PyData mercial database that includes phase diagram data (∼60,000 en-
stack. Second, the feature extraction methods implemented by mat- tries), crystal structure data (∼400,000 entries), and materials
miner follow the same model (and, more formally, subclass) scikit- property values (∼800,000 entries). The MPDSDataRetrieval class
learn’s preprocessing methods. This allows matminer feature extraction in matminer can retrieve and format information from this data-
methods to be used with scikit-learn's Pipeline functionality and makes base.
it easy to combine data processing methods present in the two libraries. (v) MongoDB is a popular tool in the data mining community due to its
Matminer also heavily leverages the pymatgen [36] materials sci- efficient and flexible data model [46]. For example, data generated
ence library. Matminer's use of the pymatgen library makes it un- through the atomate [52] computational suite is stored in such
necessary to recreate complex or materials-science-specific algorithms databases. The “MongoDataRetrieval” class of matminer converts
(e.g., space group determination) when implementing new feature ex- MongoDB documents to rows of a pandas DataFrame.
traction methods. Overall, the software architecture of matminer is
designed to bridge the gap between the professional-level data science All database tools are consistent in that they (i) contain a “get_da-
tools developed by the Python community and the tools, techniques, taframe” method that makes a query to the database and (ii) returns the
and data specific to the materials domain. data in a Pandas DataFrame object. The “get_dataframe” method for
each source takes query instructions in a simple, standard format. We
3. Components of matminer also provide the ability to run queries in the language specific to each
source. In so, we provide both a novice-friendly route for using new
We now describe the main functions of matminer. We describe each data sources and maintain the ability for experts to access all features of
of the three major components. data retrieval, featurization, and vi- a familiar data source. However, matminer does standardize the output
sualization, separately. such that data mining tools written for one database can be easily ap-
plied to another. One benefit of the uniformity of the APIs and output
3.1. Data retrieval formats provided by matminer is that these features make it easy to
combine data from multiple sources. The data merging tools built into
The first step in data mining is to obtain a data set that is ideally the pandas DataFrame object facilitate this procedure. For example, it is
large and diverse. There are several efforts underway in the materials straightforward to retrieve experimental band gap energies from
community to build such databases of materials properties [37–44]. Citrination and then easily compare those values with computed band
However, while the proliferation of databases is a great benefit to gap energies from Materials Project or the OQMD (this specific example
materials informatics, the use of these data sources is complicated by is described in detail in Section 4.2).
the fact that each database implements a different API, authentication Matminer also contains several built-in datasets that can be loaded
method, and schema. One core function of matminer is to provide a directly with a single line of Python and do not require external data-
consistent API around different databases and return the data in a form base calls or setting any options. These built-in datasets include: 1181
that is suitable for use in data mining tools. DFT-based elastic tensors [53], 941 DFT-based piezoelectric tensors
At the time of writing, matminer supports data retrieval from four [54], 1056 DFT-based dielectric constants [55], and 3938 DFT-based
commonly used materials databases: Citrination [40,43], Materials formation energies [39,56]. The built-in data sets make it simple to
Project (MP) [39], Materials Data Facility (MDF) [44], and Materials begin testing and developing data mining methods.
Platform for Data Science (MPDS) [45]. In addition, a generic MongoDB Finally, a user can load their own data set using the built-in tools of
interface supports data retrieval from any MongoDB resource [46]. the pandas library, which can load data from CSV, Excel, or various
Below, we describe these data retrieval tools in detail: other formats. This process can be conducted independently of mat-
miner but the final data format will be compatible with the subsequent
(i) Citrination, developed by Citrine Informatics [40], is a centralized data featurization tools of matminer.
database that contains a variety of materials data, including ex-
perimental measurements and computational results, all in a 3.2. Data featurization: Transforming materials-related quantities into
common data schema – the “pif” [47]. The matminer data retrieval physically relevant descriptors
tool uses Citrine’s citrination-client library to retrieve data from
Citrination, and then converts the data from the hierarchical pif Typically, machine learning employs an intermediate step between
format to a tabular DataFrame format. In the process of converting compiling raw data and applying a machine learning algorithm. This
the pif records, matminer retrieves all details describing a material step converts data from a raw format (often specialized for parsing by a
(e.g., composition), its known properties, and how these properties particular software package or formatted for human readability) into a
were determined. numerical representation that is useful for visualization or machine
(ii) The Materials Project (MP) [39] primarily contains DFT [48,49] learning software. This process is called “feature extraction”, “featur-
computed properties for over 60,000 compounds. In a similar ization”, or generating “descriptors”. Featurization transforms or aug-
fashion to the Citrination data extractor, matminer uses the ex- ments the raw data (which might have a very complicated and difficult
isting MP API [50] (as implemented in the “MPRester” class of the to learn relationship between inputs and outputs) into a set of physi-
Python Materials Genomics (pymatgen) library [36]) to query the cally relevant quantities that reflect the relationships between the input

62
L. Ward et al. Computational Materials Science 152 (2018) 60–69

4. The “implementors” method provides the name of the person(s) who

implemented and are responsible for maintaining the featurizer.
This is useful if one has a question, comment, or suggestion re-
garding the speciﬁc implementation details of a featurization
method.

BaseFeaturizer provides additional functions that a user can call

once these four methods are implemented. For example, the “featur-
ize_dataframe” method uses the “featurize” and “feature_labels” op-
erations to add the features to an entire pandas DataFrame. That is,
featurize_dataframe will process potentially thousands or millions of
rows of data, exploiting Python's multiprocess functionality to paral-
lelize over available cores. The BaseFeaturizer class also follows the
pattern used by featurizers in the scikit-learn machine learning library,
which allows matminer featurization classes to be integrated easily
with existing scikit-learn tools. For example, one can build a data
processing pipeline that mixes some of the data normalization tools
present in scikit-learn with the materials-specific features implemented
in matminer.
Matminer contains, at the time of writing, a total of 47 featurizers
that support the generation of features for diverse types of materials
data. Each of these featurizers can produce many individual features/
descriptors, such that it is possible to generate thousands of total fea-
Fig. 2. Overview of the 47 featurizers that are currently available in five dif- tures with the matminer code. For example, the ElementProperty fea-
ferent modules (composition, site, structure, bandstructure, dos) of matminer. turizer will convert a chemical composition into various summary sta-
Each featurizer can generate one or hundreds of features, such that matminer as tistics of the properties of that composition's component elements (e.g.,
a whole is capable of producing thousands of individual features. average ionic radius or standard deviation of elemental melting points).
The BandFeaturizer will convert a complex electronic band structure
and output variables. The feature extraction step is one of the main into quantities such as band gap and the norm of k point coordinates at
ways in which one can exploit domain knowledge to vastly improve the which the conduction band minimum and valence band maximum
performance of a machine learning algorithm. For example, common occur.
features that are extracted from a chemical composition include the We have grouped the featurizers into five different Python modules
differences in electronegativities of the component elements or the sum based on the input data type: (i) composition, (ii) (crystal) structure,
of atomic radii of the various elements. (iii) density of (electronic) states, (iv) band structure, and (v) (atomic)
Many generalizable featurization approaches have been proposed in site. The featurizers available in matminer in each module are pre-
the literature for different types of materials data [18,22,25,56–61]. sented in Fig. 2. In Table 1, we briefly describe each featurizer and
However, the software required to use them are often unavailable, not provide the canonical reference(s). The complete source code for each
open-source, or are distributed across many repositories. The lack of featurizer is available in matminer such that users can employ, fully
published software means that employing these methods in practice inspect, and modify the implementations of these methods.
requires a significant time investment. Through matminer, we make In addition to these individual featurizers, we provide a
these community developments in machine learning available to the FunctionFeaturizer that combines individual features into functions
community by providing open-source implementations of various fea- such as products, quotients, logarithms, or any arbitrary mathematical
turization methods. Furthermore, despite the diversity of methodolo- expression. This procedure allows one to generate a large space of
gies, matminer provides a uniform interface to all featurizers, freeing candidate features from even a small number of initial input features
researchers to rapidly iterate through different approaches and de- and has been observed to be useful in several previous works in the
termine the method best suited to their application. materials domain [18,62]. The implementation in matminer leverages
All featurizer classes in matminer follow a common code-design the sympy library [63] which can eliminate symbolically redundant
pattern by inheriting from a base class, BaseFeaturizer, which defines features.
the template for all featurization classes. BaseFeaturizer prescribes the
four methods that must be implemented by each new featurizer: 3.3. Data visualization

1. The “featurize” method does the core work. It transforms materials A crucial step of a materials informatics workflow is visualizing
data (e.g., a composition) into the desired feature values (e.g., ele- data, which is helpful in understanding outliers, selecting features, and
ment properties such as atomic weight, atomic radii, and Mendeleev guiding the machine learning process. Many data-driven materials
number). studies generate a standard suite of similar charts, such as heatmaps or
2. The “feature_labels” method provides descriptive labels that corre- two-dimensional scatter plots, which condense multiple complex re-
spond to the feature values computed in the “featurize” method. lationships into simple, informative figures. For example, visualizing
These feature_labels can be thought of as column labels for the distributions of data (such as histograms and violin plots) at inter-
various features (and are indeed used as column labels when fea- mediate steps in the workflow process is a useful tool for pruning data
turizing an entire DataFrame). and identifying outliers. Matminer drastically simplifies making many
3. The “citations” method returns a list of BibTex-formatted references common visualizaitons.
that a user should read to fully understand the features and cite if Although there exist several excellent plotting libraries in Python
they are used. The citations method thus provides background and (e.g., matplotlib [81] and seaborn [82]), these libraries are not de-
context for the featurizers and appropriate attribution to the original signed to generate interactive plots that are also easy to share and se-
developers of the methodology. rialize to a raw data format. Fortunately, the Plotly library [83] pro-
vides the needed functionality; however, its integration with standard

63
L. Ward et al. Computational Materials Science 152 (2018) 60–69

Table 1
A list of the featurizers currently implemented in matminer. Each row in the table provides the name of the relevant Python class, a concise description of the features
it computes, and the appropriate references to the original methodology.
Featurizer Description Reference

composition.py
AtomicOrbitals Highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO) using orbital energies from [64]
NIST.
AtomicPackingEfficiency Packing efficiency based on a geometric theory of the amorphous packing [90]
BandCenter Estimation of absolute position of band center using geometric mean of electronegativity. [65]
CationProperty Element property attributes of cations in a composition [66]
CohesiveEnergy Cohesive energy per atom of a compound by adding known elemental cohesive energies from the formation energy of the [67]
compound.
ElectronAffinity Average electron affinity times formal charge of anion elements. [66]
ElectronegativityDiff Statistics on electronegativity difference between anions and cations. [66]
ElementFraction Fraction of each element in a composition. –
ElementProperty Statistics of various element properties [22,36,66]
IonProperty Maximum and average ionic character, whether a composition is charge-balanced [22]
Miedema Formation enthalpies of intermetallic compounds, solid solutions, and amorphous phases using semi-empirical Miedema [68–70]
model (and some extensions).
OxidationStates Statistics of oxidation states. [66]
Stoichiometry Lp norm-based stoichiometric attributes. [22]
TMetalFraction Fraction of magnetic transition metals. [66]
ValenceOrbital Valence orbital attributes such as the mean number of electrons in each shell. [22,66]
YangSolidSolution Mixing thermochemistry and size mismatch terms of Yang and Zhang (2012) [91]

structure.py
BagofBonds Representation where each structure is represented based on the types of and distances between each pair of sites [71]
BondFraction Fraction of nearest neighbors between each element (e.g., C-O vs C-C) bonds [71]
ChemicalOrdering How much the ordering of species in the structure differs from random [6]
ColoumbMatrix Coulomb matrix (Mij = Zi Zj /|Ri – Rj| for i ≠ j, Zi2.4/2 for i = j, with Zi and Ri the nuclear charge and the position of atom i). [7]
ElectronicRadialDistributionFunction RDF in which the positions of neighboring sites are weighted by electrostatic interactions inferred from atomic partial [72]
charges.
EwaldEnergy Energy from Coulombic interactions based on charge states of each site [73]
GlobalSymmetryFeatures Symmetry information such as spacegroup number and (enumerated) crystal system type. –
MaximumPackingEfficiency Maximum possible packing efficiency of this structure [6]
MinimumRelativeDistances Closest neighbor distances for all sites, where relative distance are used fij = rij/(riatom + rjatom) with riatom being radius of [74]
atom or ion i.
OrbitalFieldMatrix Average of the 32 by 32 matrix descriptions of the chemical environment of each atom in the unit cell, based on the group [75]
numbers, row numbers (optional), distances of coordinating atoms, and Voronoi Polyhedra weights.
PartialRadialDistributionFunction Frequency of bonds across varied ranges of length between certain pairs of elements [58]
RadialDistributionFunction Conventional radial distribution function (RDF) of a crystal structure. –
RadialDistributionFunctionPeaks Distances of the largest peaks in the RDF of a structure –
StructuralHeterogeneity Variance in the bond lengths and atomic volumes in a structure [6]
SineCoulombMatrix Same as the CoulombMatrix, except the nondiagonal elements are weighted by B· ∑k = {x ,y,z } ek̂ sin2 [πek̂ B −1·rij ]−1
2 , where rij
[56]
is the vector between atoms i and j and B is the lattice matrix, rather than 1/rij.
SiteStatsFingerprint Generates features pertaining to an entire structure by computing statistics across the features of all sites in the unit cell –

bandstructure.py
BandFeaturizer Non-zero band gap, direct band gap, k-point degeneracy, relative energy to CBM/VBM at arbitrary list of k-points and at –
conduction/valence bands.
BranchPointEnergy Branch-point energy by averaging the energy of arbitrary number of conduction and valence bands throughout the full [76]
Brillouin zone.

dos.py
DopingFermi Fermi level associated with a speciﬁed carrier concentration and temperature –
DOSFeaturizer The top N contributors to the density of states at the valence and conduction band edges. Includes chemical specie, orbital –
character, and orbital location information.

site.py
AGNIFingerprints Fingerprints based on integrating the distances product of the radial distribution function with a gaussian window function [77]
AngularFourierSeries Encodes both radial and angular information about site neighbors. Each feature is a sum of the product of two distance [17]
functions between atoms that share the central site and the cosine of the angle between them.
ChemEnvSiteFingerprint Local site environment fingerprint computed with the chemenv module in pymatgen. [74,78]
ChemicalSRO Chemical short-range ordering features to evaluate deviation of local chemistry with the nominal composition of entire [79]
structure.
CoordinationNumber Number of first nearest neighbors of a site [74]
CrystalSiteFingerprint Coordination number percentage and local structure order parameters computed from the neighbor environment of a site; [74]
Voronoi decomposition-based neighbor finding.
GaussianSymmFunc Gaussian radial and angular symmetry functions originally proposed for fitting machine learning potentials. [28,80]
GeneralizedRadialDistributionFunction A radial distribution function where the bins do not need to act in a “histogram” mode. The bins can be any arbitrary [17]
function such as Gaussians, Bessel functions, or trig functions.
LocalPropertyDifference Differences in elemental properties between site and its neighboring sites [6]
OPSiteFingerprint Local structure order parameters computed from the neighbor environment of a site; distance-based neighbor finding. [74]
VoronoiFingerprint Voronoi indices, i-fold symmetries and statistics of Voronoi facet areas, sub-polyhedron volumes and distances derived by [79]
Voronoi tessellation analysis.

64
L. Ward et al. Computational Materials Science 152 (2018) 60–69

Fig. 3. Examples of plots based on a built-in data set of elastic tensors [53] and generated through the FigRecipes interface. Clockwise from top-left: a scatter matrix,
a heat map, a violin plot, and an x-y plot with color dimension that represents Poisson ratio.

Python data libraries such as pandas remains minimal. Thus, to accel- such as interactive offline plotting, static images, and the online Plotly
erate visualization, matminer includes its own module, FigRecipes, that interface. All figures generated with FigRecipes can be returned as a
provides a set of pre-defined methods for creating well-formatted, PlotlyDict object, a JSON-like dict representation of a figure that can be
common figures (Fig. 3). Plotly was selected as the backend of FigRe- serialized and stored for reproducibility and sharing. This ability makes
cipes because (1) its interactivity enables the rapid identification (via FigRecipes a useful plotting tool for creating scientific representations
Plotly “hoverinfo”) of outliers in data sets, which are frequently the of data; complex data can first be easily converted into a PlotlyDict
most important data points in materials informatics studies, and (2) it template, and this figure template specifically edited to create custom-
uses a portable JSON representation of Plotly plots, which enables made publication-quality images.
FigRecipes to output fine-tunable Plotly figure templates with a few
lines of code. Furthermore, interactive Plotly figures can be shared 4. Examples of using matminer
easily on the web via URL, which facilitates making figures collabora-
tively. Next, we present four usage examples that showcase the capabilities
The PlotlyFig class in matminer's FigRecipes module supports seven of matminer. The source code for these and other examples are avail-
types of plots: x-y plots, scatter matrices, histograms, bar charts, heat- able as part of the matminer_examples GitHub repository (https://
maps, parallel plots, and violin plots. FigRecipes also facilitates gen- github.com/hackingmaterials/matminer_examples). Users can down-
erating often-overlooked figures, such as parallel coordinate plots [84], load, inspect, and execute the full code for these examples themselves
which have been found to be useful in materials science applications as and modify them for their own applications.
they provide a technique for representing relationships between vari-
ables in high dimensional spaces. PlotlyFig can generate several plots 4.1. Retrieving data sets and visualizing them
using the same DataFrame content, automatically determining relevant
labels and legend information from DataFrame column headers. Plo- In our first example, we use matminer's CitrineDataRetrieval tool to
tlyFig can also automatically bin and transform data to be compatible collect the experimental thermoelectric materials properties reported
with the selected plot type; for example, PlotlyFig can automatically bin by Gaultois et al. [85] and compiled in the Citrine database. We then,
data in a DataFrame to create a heatmap and can generate multiple with the help of FigRecipes, visualize this data in just a few lines of
violin plots from a DataFrame lacking an explicit 'group' column. Plo- code. An example output is depicted in Fig. 4, in which electrical
tlyFig's succinct syntax and automatic conversions provide robust ex- conductivity, Seebeck coefficient, thermal conductivity and the figure
tensions of Plotly's plotting functionality. of merit of thermoelectric materials (zT) are visualized in a single plot.
PlotlyFig interfaces with several Plotly options for visualization, This example effectively recreates Fig. 3 of Ref. [85] but allows the user

65
L. Ward et al. Computational Materials Science 152 (2018) 60–69

Fig. 4. Thermoelectric properties of nearly 1000 materials compiled by Gaultois et al. [85] and as retrieved and visualized with matminer. The marker size is scaled
according to the ﬁgure of merit, zT.

to process the data locally, perhaps adding in their own data filtering or show_columns=['chemicalFormula', 'Band
featurization procedure. Once the data set is loaded into a DataFrame gap'])
called “df_te”, re-creating this figure can be accomplished by two Py- mpr = MPRester()
thon commands, as follows: def get_MP_bandgap(formula):
formula = Composition
pf = PlotlyFig(df_te, x_scale='log', (formula).get_integer_formula_and_factor()[0]
x_title='Electrical Resistivity (cm/S)', strcs = mpr.get_data(formula)
y_title='Seebeck Coefficient (uV/K)', if strcs:
colorbar_title='Thermal Conductivity (W/ return sorted(strcs, key = lambda e: e
m.K)') [‘energy_per_atom’])[0][‘band_gap’]
pf.xy(('Electrical resistivity', 'Seebeck df[‘DFT Band gap’] = data[‘chemicalFormula’].apply
coefficient'), (get_MP_bandgap)
labels='chemicalFormula', sizes='zT', As shown in Fig. 5, most computed DFT band are lower than the ex-
colors='Thermal conductivity', color_range=[0, perimental values, which is a known drawback of DFT calculations
5]) performed using LDA or GGA functionals [86–88]. Because the com-
The first line defines the data used by the charts and names for the axes. parison is performed automatically, minimal human effort is required
The second line defines the data being plotted. Further details are to update the result as new experimental band gaps are added to Ci-
handled automatically. For example, zT values are normalized for trination or new calculations are performed by Materials Project. As
better visualization. In addition, because the user specified a color_- exemplified by this example, the tools matminer provides to automate
range of [0, 5] for the thermal conductivity values, all thermal con- data-driven analyses can make reproducing data-driven materials stu-
ductivity values equal or greater than 5 are denoted by a bright yellow dies much simpler.
color with a “5+” tick label is automatically added to the colorbar.
Thus, FigRecipes includes both automatic and customizable options 4.3. Building a machine learning model using OQMD data
that balance speed and flexibility of visualization.
To demonstrate how matminer can facilitate the process of machine
4.2. Comparing experiment and theory data learning, we recreate a machine learning model from a 2016 paper by
Ward et al. [22] In this work, the authors trained a machine learning
In another example, we retrieve all the experimental band gap data model using data from the Open Quantum Materials Database (OQMD)
available in Citrine and compare them with the calculated values [42,92] to predict the formation enthalpy of crystalline materials given
available in the Materials Project [39]. Comparing data from two dif- their composition.
ferent sources is often complicated by the need to match records from The first step is to retrieve the OQMD data used by Ward et al.,
one system to another. In this example, we need to find records in which is available through the Materials Data Facility [44]. We can use
Materials Project with the same composition. As many entries in Ci- matminer’s data retrieval tools to access this data directly with only
trination lack an associated crystal structure, we match each band gap three lines of code:
to the ground-state structure with the same composition in Materials
Project. Merging these data sources also demonstrates how combining mdf = retrieve_MDF.MDFDataRetrieval
data sources can fill in missing information from each database. Owing (anonymous = True)
to the CitrineDataRetrieval class, the Material Project API and Pandas, query_string = 'mdf.source_name:oqmd_v3 AND
merging the two data sources requires only 9 lines of code: (oqmd_v3.configuration:static OR
oqmd_v3.configuration:standard) AND
c = CitrineDataRetrieval() # Create an adapter to the dft.converged:True'
Citrine Database. data = mdf.get_data(query_string,
df = c.get_dataframe(prop='band gap', unwind_arrays=False)
data_type='EXPERIMENTAL', The next step is to process the dataset to create a suitable training set:

66
L. Ward et al. Computational Materials Science 152 (2018) 60–69

Fig. 5. Comparison of experimentally-measured band gap energies retrieved from the Citrine database to DFT-PBE computed electronic band gaps retrieved from the
Materials Project. As expected, the data set demonstrates that computed band gaps underestimate experimental values [86–88].

removing errors, duplicates, and outliers. For example, removing all from matminer.data_retrieval.retrieve_MP import
entries which lack a computed formation enthalpy can be achieved in a MPDataRetrieval
single line of Python: mpr = MPDataRetrieval()
criteria = '∗-∗-O'
data = data[∼ data['oqmd_v3.delta_e.value'].isnull properties = ['structure', 'nsites',
()] 'formation_energy_per_atom',
The third step in building a machine learning model is computing a 'e_above_hull']
representation. We have implemented the techniques developed by df = mpr.get_dataframe(criteria = criteria,
Ward et al. into matminer as Featurizer classes. These Featurizers, properties = properties)
which operate on DataFrame objects, are also simple to run: df = df[df['e_above_hull'] < 0.1]
df = df[df['nsites'] < = 30]
featurizer = MultipleFeaturizer([ Each of the three methods use Kernel Ridge Regression (KRR) as the
cf.Stoichiometry(), machine learning algorithm; we employ the implementation of this
cf.ElementProperty.from_preset(“magpie”), method from scikit-learn. scikit-learn includes a well-optimized im-
cf.ValenceOrbital(props=['avg']), cf.IonProperty plementation of KRR, and has a tool – GridSearchCV – for easily se-
()]) lecting the optimum kernel and regularization parameter for KRR [30].
featurizer.featurize_dataframe(data, We tested each method using five-fold cross validation, and used four-
col_id='composition_obj') fold cross-validation when selecting optimizing hyperparameters for
These two lines of code generate the 145 features used by Ward et al. each fold. We tested Laplacian and RBF (radial basis function) kernels
and store them within the DataFrame object. At this point, the data are for both features, and used the r2 value of the formation energy per
in a form that is compatible with existing machine learning libraries, atom predictions to score each hyperparameter set [30].
such as scikit-learn or Keras. After using scikit-learn’s Random Forest The orbital field matrix can be time consuming to calculate for a
implementation and cross-validation utilities, we find that our model large dataset because of its size; however, the process can be ac-
achieves a MAE of 0.071 eV/atom in 10-fold cross-validation, which is celerated by the parallelization feature of matminer. Matminer auto-
consistent with the results reported by Ward et al. (as low as 0.088 eV/ matically runs in parallel across all available CPU cores using Python’s
atom using a different tree-based ML method). Overall this example multiprocessing package. The following code computes the OFM re-
serves to demonstrate how matminer, combined with community- presentation and automatically runs in parallel:
standard data analysis and machine learning libraries, facilitates the
construction of machine learning models from materials data. from matminer.featurizers.structure import
OrbitalFieldMatrix
4.4. Comparing crystal structure featurization methods ofm = OrbitalFieldMatrix()
df = ofm.featurize_dataframe(df, 'structure')
Another benefit of matminer is that it simplifies comparing machine The cross-validation results for the FLLA and TER_OX datasets are
learning methods. To illustrate, we used matminer to compare three presented in Table 2. We find very close agreement between the Mean
methods for predicting the formation energy for a given crystal struc- Absolute Error (MAE) reported by Faber et al. for the SCM (0.37 eV/
ture: the Sine Coulomb Matrix (SCM) [56], the Orbital Field Matrix atom) and our result with matminer of 0.387 eV/atom, despite minor
(OFM) [75], and a recent modification to the OFM in development that differences in the cross-validation procedure [56]. This demonstrates
also includes the row of each element in the periodic table in addition that we are able to reproduce the methodology of a published machine
to the column (OFMR). learning paper and compare it with a new featurization method (OFMR)
The first step in comparing the models is to gather training sets. For with very little effort.
this task, we use the original 3938 structures selected by Faber et al. Our results indicate that for both data sets, the OFMR outperforms
from the Materials Project (FLLA) [56] and a dataset of all 7735 stable the OFM featurizer, which in turn outperforms the SCM (Table 2). All
ternary oxides in the Materials Project with unit cell size at most 30 methods perform better on the TER_OX dataset than the FLLA dataset,
atoms (TER_OX). Gathering the data is simple with matminer. The FLLA demonstrating that the specific data set influences both absolute and
data set is built into matminer and the TER_OX dataset can be gathered relative model performance. Featurization and evaluation of the OFM
with a single MPDataRetrieval query: and OFMR take much longer than for the SCM because of the size of the

67
L. Ward et al. Computational Materials Science 152 (2018) 60–69

Table 2 This research used the Savio computational cluster resource pro-
Performance (in terms of both accuracy and time needed to featurize) of several vided by the Berkeley Research Computing program at the University of
machine learning methods on two different datasets: the FLLA [56] and TER_OX California, Berkeley (supported by the UC Berkeley Chancellor, Vice
datasets. We compare the Sine Coulomb Matrix (SCM) [56], Orbital Field Ma- Chancellor for Research, and Chief Information Officer). This research
trix (OFM) [75], and Orbital Field Matrix + row in periodic table (OFMR). The
used resources of the National Energy Research Scientific Computing
performance scores are for each model in 5-fold cross-validation. Each model
Center, a DOE Office of Science User Facility supported by the Office of
was run on 24, 2.3 GHz processor cores on a system with 64 GB of RAM.
Science of the U.S. Department of Energy under Contract No. DE-AC02-
Dataset Descriptor MAE RMSE r2 Featurize Cross- 05CH11231.
(eV/ (eV/ time (s) validation
We thank all those in the materials community who have con-
atom) atom) time (h:mm:ss)
tributed code commits to matminer, including Ashwin Aggarwal,
FLLA SCM 0.387 0.575 0.708 2.0 0:07:42 Evgeny Blokhin, Jason Frost, Matthew Horton, Kiran Mathew, Shyue
OFM 0.229 0.346 0.894 138. 0:50:40 Ping Ong, Sayan Rowchowdhury, and Donny Winston.
OFMR 0.171 0.277 0.932 138. 1:20:14
TER_OX SCM 0.123 0.220 0.917 5.0 0:30:16
OFM 0.090 0.140 0.967 366. 4:30:16
Appendix A. Supplementary material
OFMR 0.059 0.100 0.983 363. 7:06:42
Supplementary data associated with this article can be found, in the
online version, at http://dx.doi.org/10.1016/j.commatsci.2018.05.
descriptors, which may result in a time-accuracy tradeoff in some ap- 018.
plications. We also note that Faber et al. have been developing updated
structure representations [89] that in the future might be further References
compared to the current results. Being able to probe the applicability of
different featurization methods for different data sets is significantly [1] H. Chen, G. Hautier, A. Jain, C. Moore, B. Kang, R. Doe, L. Wu, Y. Zhu, Y. Tang,
simplified by the ability to easily swap out different machine learning G. Ceder, Chem. Mater. 24 (2012) 2009.
[2] M. Aykol, S. Kim, V.I. Hegde, D. Snydacker, Z. Lu, S. Hao, S. Kirklin, D. Morgan,
methods and datasets within a machine learning pipeline. This allows C. Wolverton, Nat. Commun. 7 (2016) 13779.
for rapid testing of new methods against various data sets. [3] C. Nyshadham, C. Oses, J.E. Hansen, I. Takeuchi, S. Curtarolo, G.L.W. Hart, Acta
Mater. 122 (2017) 438.
[4] S. Kirklin, J.E. Saal, V.I. Hegde, C. Wolverton, Acta Mater. 102 (2016) 125.
5. Conclusion [5] A. Jain, K.A. Persson, G. Ceder, APL Mater. 4 (2016) 53102.
[6] L. Ward, R. Liu, A. Krishna, V.I. Hegde, A. Agrawal, A. Choudhary, C. Wolverton,
Performing materials informatics requires developing a data pipe- Phys. Rev. B 96 (2017) 24104.
[7] M. Rupp, A. Tkatchenko, K.-R. Müller, O.A. von Lilienfeld, Phys. Rev. Lett. 108
line that encompasses data retrieval, feature extraction, and visualiza-
(2012) 58301.
tion prior to the actual machine learning step. The matminer software [8] J. Carrete, W. Li, N. Mingo, S. Wang, S. Curtarolo, Phys. Rev. X 4 (2014) 11019.
described in this manuscript is designed to facilitate the development, [9] L. Ward, C. Wolverton, Curr. Opin. Solid State Mater. Sci. 21 (2017) 167.
reuse, and reproducibility of data pipelines for materials informatics [10] J.C. Mauro, A. Tandia, K.D. Vargheese, Y.Z. Mauro, M.M. Smedskjaer, Chem. Mater.
28 (2016) 4267.
applications. We have designed matminer to connect the domain-spe- [11] E.W. Bucholz, C.S. Kong, K.R. Marchman, W.G. Sawyer, S.R. Phillpot, S.B. Sinnott,
cific aspects of materials informatics (i.e., materials data extraction, K. Rajan, Tribol. Lett. 47 (2012) 211.
feature extraction of materials science concepts, common plotting [12] T.D. Sparks, M.W. Gaultois, A. Oliynyk, J. Brgoch, B. Meredig, Scr. Mater. 111
(2015) 10.
routines) with the professional level machine learning and data pro- [13] R. Yuan, Z. Liu, P.V. Balachandran, D. Xue, Y. Zhou, X. Ding, J. Sun, D. Xue,
cessing software already developed and in use by the Python commu- T. Lookman, Adv. Mater. 1702884 (2018) 1702884.
nity. It is our hope that matminer can serve as a community repository [14] A. Mannodi-Kanakkithodi, A. Chandrasekaran, C. Kim, T.D. Huan, G. Pilania,
V. Botu, R. Ramprasad, Mater. Today (2017), http://dx.doi.org/10.1016/j.mattod.
for new materials data analytics techniques as they become available 2017.11.021.
such that researchers can rapidly develop and test new methods against [15] F.A. Faber, A. Lindmaa, O.A. von Lilienfeld, R. Armiento, Phys. Rev. Lett. 117
standard techniques, accelerating the use of data mining in the mate- (2016) 135502.
[16] F. Ren, L. Ward, T. Williams, K.J. Laws, C. Wolverton, J. Hattrick-Simpers, A.
rials community at large. Mehta, Sci. Adv. 4 (2018) eaaq1566.
[17] A. Seko, H. Hayashi, K. Nakayama, A. Takahashi, I. Tanaka, Phys. Rev. B 95 (2017)
Acknowledgements 144110.
[18] R. Ramprasad, R. Batra, G. Pilania, A. Mannodi-Kanakkithodi, C. Kim, Npj Comput.
Mater. 3 (2017) 54.
This code was intellectually led and primarily developed using [19] S.R. Kalidindi, ISRN Mater Sci. 2012 (2012) 1.
funding provided by U.S. Department of Energy, Office of Basic Energy [20] J. Hill, G. Mulholland, K. Persson, R. Seshadri, C. Wolverton, B. Meredig, MRS Bull.
41 (2016) 399.
Sciences, Early Career Research Program, which funded the efforts of
[21] W. McKinney, Proc. 9th Python Sci. Conf. 1697900 (2010) 51.
AJ, AD, AF, SB, and QW. LW and IF were supported by financial as- [22] L. Ward, A. Agrawal, A. Choudhary, C. Wolverton, Npj Comput. Mater. 2 (2016)
sistance award 70NANB14H012 from U.S. Department of Commerce, 16028.
National Institute of Standards and Technology as part of the Center for [23] http://bitbucket.org/wolverton/magpie.
[24] W. Daniel, B. David, F. Tony, K. Surya, R. Andrew, PyMKS: Materials Knowledge
Hierarchical Material Design (CHiMaD), by the National Science System in Python, 2014. doi: 10.6084/m9.figshare.1015761.
Foundation as part of the Midwest Big Data Hub under NSF Award [25] Z. Wu, B. Ramsundar, E.N. Feinberg, J. Gomes, C. Geniesse, A.S. Pappu, K. Leswing,
Number: 1636950 “BD Spokes: SPOKE: MIDWEST: Collaborative: V. Pande, Chem. Sci. 9 (2018) 513.
[26] https://github.com/deepchem/deepchem.
Integrative Materials Design (IMaD): Leverage, Innovate, and [27] E. Gossett, C. Toher, C. Oses, O. Isayev, F. Legrain, F. Rose, E. Zurek, J. Carrete,
Disseminate,” and by the Department of Energy contract DE-AC02- N. Mingo, A. Tropsha, S. Curtarolo (2017) arXiv:1711.10744v1.
06CH11357. NER, JM, MA, and KAP were funded by the U.S. [28] A. Khorshidi, A.A. Peterson, Comput. Phys. Commun. 207 (2016) 310.
[29] https://github.com/libAtoms/QUIP.
Department of Energy, Office of Science, Office of Basic Energy [30] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,
Sciences, Materials Sciences and Engineering Division under Contract M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
No. DE-AC02-05-CH11231: Materials Project program KC23MP. JC and D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, J. Mach. Learn. Res. 12
(2011) 2825.
KC were supported by NSF, United States grant 1541450 (CC∗DNI [31] F. Perez, B.E. Granger, Comput. Sci. Eng. 9 (2007) 21.
DIBBS: Merging Science and Cyberinfrastructure Pathways: The Whole [32] K.J. Millman, M. Aivazis, Comput. Sci. Eng. 13 (2011) 9.
Tale). KWB acknowledges the University of California, Berkeley College [33] S. van der Walt, S.C. Colbert, G. Varoquaux, Comput. Sci. Eng. 13 (2011) 22.
[34] https://github.com/keras-team/keras.
of Chemistry for a summer research stipend. MD and GJS were funded
[35] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado, A.
by NSF DMR program Grant nos. 1334713 and 1333335.

68
L. Ward et al. Computational Materials Science 152 (2018) 60–69

Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Kumar, S. Ivanov, J.K. Moore, S. Singh, T. Rathnayake, S. Vig, B.E. Granger, R.P.
Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Muller, F. Bonazzi, H. Gupta, S. Vats, F. Johansson, F. Pedregosa, M.J. Curry, A.R.
Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Terrel, Š. Roučka, A. Saboo, I. Fernando, S. Kulal, R. Cimrman, A. Scopatz, PeerJ
Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, Comput. Sci. 3 (2017) e103.
M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, 2015. < https://www.tensorflow. [64] S. Kotochigova, Z.H. Levine, E.L. Shirley, M.D. Stiles, C.W. Clark, Phys. Rev. A 55
org/ > . (1997) 191.
[36] S.P. Ong, W.D. Richards, A. Jain, G. Hautier, M. Kocher, S. Cholia, D. Gunter, [65] M.A. Butler, J. Electrochem. Soc. 125 (1978) 228.
V.L. Chevrier, K.A. Persson, G. Ceder, Comput. Mater. Sci. 68 (2013) 314. [66] A.M. Deml, R.O. Hayre, C. Wolverton, V. Stevanovic, Phys. Rev. B 93 (2016) 85142.
[37] A. Frantzen, J. Scheidtmann, G. Frenzer, W.F. Maier, J. Jockel, T. Brinz, D. Sanders, [67] C. Kittel, Introduction to Solid State Physics, 8th ed., Wiley, 2005.
U. Simon, Angew. Chemie Int. Ed. 43 (2004) 752. [68] F.R. de Boer, Cohesion in Metals: Transition Metal Alloys, North-Holland,
[38] Y. Xu, M. Yamazaki, P. Villars, Jpn. J. Appl. Phys 50 (2011) 11RH02. Amsterdam, 1988.
[39] A. Jain, S.P. Ong, G. Hautier, W. Chen, W.D. Richards, S. Dacek, S. Cholia, [69] R.F. Zhang, S.H. Zhang, Z.J. He, J. Jing, S.H. Sheng, Comput. Phys. Commun. 209
D. Gunter, D. Skinner, G. Ceder, K.A. Persson, APL Mater. 1 (2013) 11002. (2016) 58.
[40] https://citrination.com. [70] L.J. Gallego, J.A. Somoza, J.A. Alonso, J. Phys. Condens. Matter 2 (1990) 6245.
[41] S. Curtarolo, W. Setyawan, S. Wang, J. Xue, K. Yang, R.H. Taylor, L.J. Nelson, [71] K. Hansen, F. Biegler, R. Ramakrishnan, W. Pronobis, O.A. Von Lilienfeld, K.-
G.L.W. Hart, S. Sanvito, M. Buongiorno-Nardelli, N. Mingo, O. Levy, Comput. R.R. Müller, A. Tkatchenko, J. Phys. Chem. Lett. 6 (2015) 2326.
Mater. Sci. 58 (2012) 227. [72] E.L. Willighagen, R. Wehrens, P. Verwer, R. de Gelder, L.M.C. Buydens, Acta
[42] J.E. Saal, S. Kirklin, M. Aykol, B. Meredig, C. Wolverton, JOM 65 (2013) 1501. Crystallogr. Sect. B Struct. Sci. 61 (2005) 29.
[43] J. O’Mara, B. Meredig, K. Michel, JOM 68 (2016) 2031. [73] P.P. Ewald, Ann. Phys. 369 (1921) 253.
[44] B. Blaiszik, K. Chard, J. Pruyne, R. Ananthakrishnan, S. Tuecke, I. Foster, JOM 68 [74] N.E.R. Zimmermann, M.K. Horton, A. Jain, M. Haranczyk, Front. Mater. 4 (2017) 1.
(2016) 2045. [75] T. Lam Pham, H. Kino, K. Terakura, T. Miyake, K. Tsuda, I. Takigawa, H. Chi Dam,
[45] https://mpds.io/. Sci. Technol. Adv. Mater. 18 (2017) 756.
[46] https://www.mongodb.com/. [76] A. Schleife, F. Fuchs, C. Rödl, J. Furthmüller, F. Bechstedt, Appl. Phys. Lett. 94
[47] K. Michel, B. Meredig, MRS Bull. 41 (2016) 617. (2009) 12104.
[48] P. Hohenberg, W. Kohn, Phys. Rev. 136 (1964) B864. [77] V. Botu, R. Ramprasad, Phys. Rev. B 92 (2015) 94306.
[49] L.O. Wagner, T.E. Baker, E.M. Stoudenmire, K. Burke, S.R. White, Phys. Rev. B 90 [78] D. Waroquiers, X. Gonze, G.-M. Rignanese, C. Welker-Nieuwoudt, F. Rosowski,
(2014) 45109. M. Göbel, S. Schenk, P. Degelmann, R. André, R. Glaum, G. Hautier, Chem. Mater.
[50] S.P. Ong, S. Cholia, A. Jain, M. Brafman, D. Gunter, G. Ceder, K.A. Persson, Comput. 29 (2017) 8346.
Mater. Sci. 97 (2015) 209. [79] A. Okabe, B. Boots, K. Sugihara, S.N. Chiu, Spatial Tesselations, 2009.
[51] https://github.com/materials-data-facility/forge. [80] J. Behler, J. Chem. Phys. 134 (2011) 74106.
[52] K. Mathew, J.H. Montoya, A. Faghaninia, S. Dwarakanath, M. Aykol, H. Tang, [81] J.D. Hunter, Comput. Sci. Eng. 9 (2007) 90.
I. Chu, T. Smidt, B. Bocklund, M. Horton, J. Dagdelen, B. Wood, Z.-K. Liu, J. Neaton, [82] M. Waskom, O. Botvinnik, D. O’Kane, P. Hobson, S. Lukauskas, D.C. Gemperline, T.
S.P. Ong, K. Persson, A. Jain, Comput. Mater. Sci. 139 (2017) 140. Augspurger, Y. Halchenko, J.B. Cole, J. Warmenhoven, J. de Ruiter, C. Pye, S.
[53] M. De Jong, W. Chen, T. Angsten, A. Jain, R. Notestine, A. Gamst, M. Sluiter, C.K. Hoyer, J. Vanderplas, S. Villalba, G. Kunter, E. Quintero, P. Bachant, M. Martin, K.
Ande, S. Van Der Zwaag, J.J. Plata, C. Toher, S. Curtarolo, G. Ceder, K. a Persson, M. Meyer, A. Miles, Y. Ram, T. Yarkoni, M.L. Williams, C. Evans, C. Fitzgerald, Brian,
Asta, Sci. Data (2015) 1. C. Fonnesbeck, A. Lee, A. Qalieh, 2017. doi: 10.5281/ZENODO.883859.
[54] M. de Jong, W. Chen, H. Geerlings, M. Asta, K.A. Persson, Sci. Data 2 (2015) [83] https://plot.ly/.
150053. [84] J.M. Rickman, Npj Comput. Mater. 4 (2018) 5.
[55] I. Petousis, W. Chen, G. Hautier, T. Graf, T.D. Schladt, K.A. Persson, F.B. Prinz, Phys. [85] M.W. Gaultois, T.D. Sparks, C.K.H. Borg, R. Seshadri, W.D. Bonificio, D.R. Clarke,
Rev. B 93 (2016) 115151. Chem. Mater. 25 (2013) 2911.
[56] F. Faber, A. Lindmaa, O.A. von Lilienfeld, R. Armiento, Int. J. Quantum Chem. 115 [86] J.P. Perdew, M. Levy, Phys. Rev. Lett. 51 (1983) 1884.
(2015) 1094. [87] L.J. Sham, M. Schlüter, Phys. Rev. Lett. 51 (1983) 1888.
[57] T. Fast, S.R. Kalidindi, Acta Mater. 59 (2011) 4595. [88] M.K.Y. Chan, G. Ceder, Phys. Rev. Lett. 105 (2010) 196403.
[58] K.T. Schütt, H. Glawe, F. Brockherde, A. Sanna, K.R. Müller, E.K.U. Gross, Phys. [89] F.A. Faber, A.S. Christensen, B. Huang, O.A. von Lilienfeld, J. Chem. Phys. 148
Rev. B 89 (2014) 205118. (2018) 241717.
[59] A. Seko, A. Takahashi, I. Tanaka, Phys. Rev. B 90 (2014) 24101. [90] K.J. Laws, D.B. Miracle, M. Ferry, A predictive structural model for bulk metallic
[60] O. Isayev, C. Oses, C. Toher, E. Gossett, S. Curtarolo, A. Tropsha, Nat. Commun. 8 glasses, Nat. Commun. 6 (2015) 8123.
(2017) 15679. [91] X. Yang, Y. Zhang, Prediction of high-entropy stabilized solid-solution in multi-
[61] K.T. Schütt, F. Arbabzadah, S. Chmiela, K.R. Mu, A. Tkatchenko, Nat. Commun. 8 component alloys, Mater. Chem. Phys. 132 (2012) 233–238.
(2017) 13890. [92] S. Kirklin, J.E. Saal, B. Meredig, A. Thompson, J.W. Doak, M. Aykol, et al., The
[62] L.M. Ghiringhelli, J. Vybiral, S.V. Levchenko, C. Draxl, M. Scheffler, Phys. Rev. Lett. Open Quantum Materials Database (OQMD): assessing the accuracy of DFT for-
114 (2015) 105503. mation energies, Npj Comput. Mater. 1 (2015) 15010, http://dx.doi.org/10.1038/
[63] A. Meurer, C.P. Smith, M. Paprocki, O. Čertík, S.B. Kirpichev, M. Rocklin, Am. npjcompumats.2015.10.

Machine learning for materials science
No ratings yet
Machine learning for materials science
288 pages
Barangay Management Information System Chapter1
62% (47)
Barangay Management Information System Chapter1
36 pages
Krishnan N. Machine Learning For Materials Discovery. Numerical Recipes... 2024
No ratings yet
Krishnan N. Machine Learning For Materials Discovery. Numerical Recipes... 2024
287 pages
EmpTech11 Q1 Mod2 Productivity-Tools Ver3
84% (83)
EmpTech11 Q1 Mod2 Productivity-Tools Ver3
73 pages
Fresher Resume
80% (5)
Fresher Resume
3 pages
Huawei U2000
100% (3)
Huawei U2000
616 pages
MT Nursing Informatics
No ratings yet
MT Nursing Informatics
13 pages
Inventory Management System Project Report
69% (32)
Inventory Management System Project Report
108 pages
Chemkin Fitdata PDF
No ratings yet
Chemkin Fitdata PDF
40 pages
ML For Mat. Sc.
No ratings yet
ML For Mat. Sc.
41 pages
Lecture 1 2022
No ratings yet
Lecture 1 2022
28 pages
Translated - 1 s2.0 S2095809918313559 Main
100% (1)
Translated - 1 s2.0 S2095809918313559 Main
10 pages
Mid-term AIMS EP208 Solution
No ratings yet
Mid-term AIMS EP208 Solution
6 pages
applsci-13-09992
No ratings yet
applsci-13-09992
22 pages
Advances of Machine Learning in Materials Science: Ideas and Techniques
No ratings yet
Advances of Machine Learning in Materials Science: Ideas and Techniques
40 pages
Agrawal-Choudhary2019 Article DeepMaterialsInformaticsApplic
No ratings yet
Agrawal-Choudhary2019 Article DeepMaterialsInformaticsApplic
14 pages
Literature Study Qianyu Zhou
No ratings yet
Literature Study Qianyu Zhou
9 pages
s43577-022-00357-8
No ratings yet
s43577-022-00357-8
5 pages
A Review On Background and Applications of Machine Learning in Materials Research
No ratings yet
A Review On Background and Applications of Machine Learning in Materials Research
11 pages
Big Semantic Data Processing in The Materials Design Domain: Definitions
No ratings yet
Big Semantic Data Processing in The Materials Design Domain: Definitions
11 pages
research paper (2) DIYA
No ratings yet
research paper (2) DIYA
12 pages
Predicting Material Properties Using Machine Learning for Accelerated Materials Discovery
No ratings yet
Predicting Material Properties Using Machine Learning for Accelerated Materials Discovery
9 pages
ML Material
No ratings yet
ML Material
38 pages
Materials Informatics
No ratings yet
Materials Informatics
8 pages
ML for composites
No ratings yet
ML for composites
11 pages
fatigue
No ratings yet
fatigue
19 pages
GAMM-Mitteilungen - 2021 - Stoll - Machine Learning For Material Characterization With An Application For Predicting
No ratings yet
GAMM-Mitteilungen - 2021 - Stoll - Machine Learning For Material Characterization With An Application For Predicting
21 pages
Materials Discovery and Design_ by Means of Data Science and Optimal Learning (Z-lib.io)
No ratings yet
Materials Discovery and Design_ by Means of Data Science and Optimal Learning (Z-lib.io)
266 pages
ARTIFICIAL INTELLIGENCE_MACHINE_LEARNING_FOR_MATERIALS_DISCOVERY_AND_OPTIMIZATION_NTMP
No ratings yet
ARTIFICIAL INTELLIGENCE_MACHINE_LEARNING_FOR_MATERIALS_DISCOVERY_AND_OPTIMIZATION_NTMP
26 pages
1 s2.0 S2542529324002360 Main
No ratings yet
1 s2.0 S2542529324002360 Main
11 pages
2023 Representations of Materials for Machine Learning
No ratings yet
2023 Representations of Materials for Machine Learning
30 pages
Modelling Mechanisms For Measurable and Detection Based On Artificial Intelligence
No ratings yet
Modelling Mechanisms For Measurable and Detection Based On Artificial Intelligence
6 pages
Perspectives On The Impact of Machine Learning, Deep Learning, and Artificial Intelligence On Materials, Processes, and Structures Engineering
No ratings yet
Perspectives On The Impact of Machine Learning, Deep Learning, and Artificial Intelligence On Materials, Processes, and Structures Engineering
16 pages
011002_1_online
No ratings yet
011002_1_online
12 pages
2020 Morgan D Annual Review of Matrials Research
No ratings yet
2020 Morgan D Annual Review of Matrials Research
35 pages
batteries-09-00112-v2 (3)
No ratings yet
batteries-09-00112-v2 (3)
11 pages
Data Science with Python: Unlocking the Power of Pandas and Numpy
From Everand
Data Science with Python: Unlocking the Power of Pandas and Numpy
Robert Johnson
No ratings yet
Machine Learning for Advanced Functional Materials Nirav Joshi pdf download
No ratings yet
Machine Learning for Advanced Functional Materials Nirav Joshi pdf download
77 pages
Machine Learning in Materials Science
No ratings yet
Machine Learning in Materials Science
7 pages
Case Study 2
No ratings yet
Case Study 2
5 pages
Materials 16 05977
No ratings yet
Materials 16 05977
30 pages
Data Science with Python: From Zero to Machine Learning
From Everand
Data Science with Python: From Zero to Machine Learning
Pouvo
No ratings yet
Materials Informatics: From The Atomic-Level To The Continuum
No ratings yet
Materials Informatics: From The Atomic-Level To The Continuum
38 pages
Small Data Machine Learning in Materials Science: Review Article
No ratings yet
Small Data Machine Learning in Materials Science: Review Article
15 pages
Machine Learning in Materials Science
No ratings yet
Machine Learning in Materials Science
21 pages
Machine Learning for Advanced Functional Materials Nirav Joshi - Read the ebook online or download it to own the full content
100% (1)
Machine Learning for Advanced Functional Materials Nirav Joshi - Read the ebook online or download it to own the full content
75 pages
Machine Learning For Advanced Functional Materials 1st Ed 2023 Nirav Joshi instant download
No ratings yet
Machine Learning For Advanced Functional Materials 1st Ed 2023 Nirav Joshi instant download
79 pages
MatterGen (6)
No ratings yet
MatterGen (6)
33 pages
2502.04984v1
No ratings yet
2502.04984v1
9 pages
s41597-024-03039-z
No ratings yet
s41597-024-03039-z
9 pages
An intelligent computing system to detect material
No ratings yet
An intelligent computing system to detect material
5 pages
Chen Et Al 2024 Accelerating Computational Materials Discovery With Machine Learning and Cloud High Performance
No ratings yet
Chen Et Al 2024 Accelerating Computational Materials Discovery With Machine Learning and Cloud High Performance
10 pages
Moses Et Al 2021 Machine Learning Screening of Metal Ion Battery Electrode Materials
No ratings yet
Moses Et Al 2021 Machine Learning Screening of Metal Ion Battery Electrode Materials
8 pages
Machine Learning For Chemistry
No ratings yet
Machine Learning For Chemistry
4 pages
Adv Funct Materials - 2022 - Liu - Toward Excellence of Electrocatalyst Design by Emerging Descriptor‐Oriented Machine
No ratings yet
Adv Funct Materials - 2022 - Liu - Toward Excellence of Electrocatalyst Design by Emerging Descriptor‐Oriented Machine
25 pages
Array Programming With Numpy: Review
No ratings yet
Array Programming With Numpy: Review
6 pages
NumPy Review
No ratings yet
NumPy Review
6 pages
Paper 2
No ratings yet
Paper 2
10 pages
acs.jcim.3c00643
No ratings yet
acs.jcim.3c00643
28 pages
L L M M K: U S M S GPT: Arge Anguage Odels As Aster EY Nlocking THE Ecrets of Aterials Cience With
No ratings yet
L L M M K: U S M S GPT: Arge Anguage Odels As Aster EY Nlocking THE Ecrets of Aterials Cience With
17 pages
Institute For Defense Analyses: This Content Downloaded From 157.37.139.229 On Wed, 29 Apr 2020 07:38:00 UTC
No ratings yet
Institute For Defense Analyses: This Content Downloaded From 157.37.139.229 On Wed, 29 Apr 2020 07:38:00 UTC
7 pages
Plagiarism
No ratings yet
Plagiarism
18 pages
Predicting The Electronic and Structural Properties of Two-Dimensional Materials Using Machine Learning
No ratings yet
Predicting The Electronic and Structural Properties of Two-Dimensional Materials Using Machine Learning
14 pages
Machine-Learning-Driven-Optimization-of-Battery-Materials-via-Quantum-Computing
No ratings yet
Machine-Learning-Driven-Optimization-of-Battery-Materials-via-Quantum-Computing
17 pages
Nie Poster
No ratings yet
Nie Poster
1 page
A Review
No ratings yet
A Review
15 pages
Machine Learning in Nuclear Materials Research: Ddmorgan@wisc - Edu Liju@mit - Edu
No ratings yet
Machine Learning in Nuclear Materials Research: Ddmorgan@wisc - Edu Liju@mit - Edu
64 pages
Vtu 5TH Sem Cse DBMS Notes
85% (20)
Vtu 5TH Sem Cse DBMS Notes
34 pages
Contract Search
No ratings yet
Contract Search
183 pages
Maximo Configuration Customization Best Practices
No ratings yet
Maximo Configuration Customization Best Practices
16 pages
991978-5 MyCalls Enterprise Installation Manual
No ratings yet
991978-5 MyCalls Enterprise Installation Manual
38 pages
ADMT Assignment2 (Ans)
No ratings yet
ADMT Assignment2 (Ans)
10 pages
Foundation - Microsoft SQL Server 2016
No ratings yet
Foundation - Microsoft SQL Server 2016
8 pages
Final Internship Daniel and Beamlak
100% (1)
Final Internship Daniel and Beamlak
42 pages
Sharepoint2019 PDF
No ratings yet
Sharepoint2019 PDF
3,775 pages
Github Com Aman0046 LastMinuteRevision DBMS
No ratings yet
Github Com Aman0046 LastMinuteRevision DBMS
8 pages
Installation and 5ProVision Administration Manual 6.11.
No ratings yet
Installation and 5ProVision Administration Manual 6.11.
378 pages
Inventory Management System DBMS Project
No ratings yet
Inventory Management System DBMS Project
27 pages
02 Exchange Profile
No ratings yet
02 Exchange Profile
16 pages
Custom Code Migration Guide For SAP S/4HANA 1909: Feature Package Stack 02
No ratings yet
Custom Code Migration Guide For SAP S/4HANA 1909: Feature Package Stack 02
72 pages
Radhe Resume (1)
No ratings yet
Radhe Resume (1)
7 pages
Tcs Interview Experiences
No ratings yet
Tcs Interview Experiences
25 pages
COMMANDS San Switch Cheat Sheet
No ratings yet
COMMANDS San Switch Cheat Sheet
6 pages
Risk Management Policy Morocco Highlights
No ratings yet
Risk Management Policy Morocco Highlights
26 pages
IOOP Asia Pacific University Assignment
100% (1)
IOOP Asia Pacific University Assignment
24 pages
ML Lab Programs PDF
No ratings yet
ML Lab Programs PDF
15 pages
Bpel and Ebiz
No ratings yet
Bpel and Ebiz
51 pages
Patran 2023.4 User Guide
0% (1)
Patran 2023.4 User Guide
220 pages
Cloud Computing Lab Manual Jan-jun 2025
No ratings yet
Cloud Computing Lab Manual Jan-jun 2025
56 pages
Mootl
No ratings yet
Mootl
26 pages

0030 D Magpie Encoding 2

Uploaded by

0030 D Magpie Encoding 2

Uploaded by

Computational Materials Science 152 (2018) 60–69

Contents lists available at ScienceDirect

Computational Materials Science

Matminer: An open source toolkit for materials data mining T

1. Introduction continued development of general-purpose data mining methods for

4. The “implementors” method provides the name of the person(s) who

BaseFeaturizer provides additional functions that a user can call

You might also like