SlideShare a Scribd company logo
Logan Ward1 (loganw@uchicago.edu)
Ben Blaiszik1,2 (blaiszik@uchicago.edu),
Ian Foster (foster@uchicago.edu)1,2, Ryan Chard2
Jonathon Gaff1, Kyle Chard1, Jim Pruyne1,
Rachana Ananthakrishnan1, Steven Tuecke1
Michael Ondrejcek3, Kenton McHenry3, John Towns3
University of Chicago1, Argonne National Laboratory2, University of Illinois at Urbana-Champaign3
materialsdatafacility.org
globus.org
Materials Data Facility:
A Distributed Model for
the Materials Data Community
15 August 2017
The Materials Data Facility Team
2
UC/Argonne
Ian Foster (PI) Ben Blaiszik Steve Tuecke
Kyle ChardJim Pruyne
Logan Ward Jonathon Gaff
Illinois (Urbana-Champaign)
Rachana
Ananthakrishnan
John Towns (PI) Kenton McHenry
Michal Ondrejcek
Stephen Rosen
Ryan Chard
Data-Intensive Materials Science
3
Materials Databases High-Throughput Screening
Machine Learning Multi-scale Modeling
Kirklin	et	al.	Acta	Mat. (2016)
de	Jong	et	al.	Sci	Rep. (2016) Sparks	et	al.	Scr.	Mat. (2015) https://www.mpg.de/
Data-Intensive Materials Science
4
Science is becoming limited by the ability to handle data
- Where to get it?
- How to selectively share it?
- Where to store it?
- How do know what it is?
- How to build software that uses it?
- How to get others to share theirs?
- How to keep track of provenance?
- ….?
Our goal is to create easy answers to these questions
Why create the MDF?
5
1. Make your data shareable
Custom access control, using institution credentials
2. Make your data open
Access to >100TB of storage space
3. Make your data accessible
Search across distributed resources
Automatic, domain-specific metadata extraction
4. Make your data computable
Tight integration with computing resources
5. Make your data valuable
Citable with DOIs, measured with usage stats
$
EP
What is the MDF?
EP
EP
EP
EP
Deep indexing
Query
Browse
Aggregate
Publish
Mint DOIs
Associate
metadata
Databases
Datasets
APIs
LIMS
etc.
Distributed data
storage
Data
publication
service
Data
discovery
service
SHAREABLE AND OPEN DATA
7
EP
Globus and the research data lifecycle
8
Researcher initiates
transfer request; or
requested automatically
by script, science
gateway
1
Instrument
Compute Facility
Globus transfers files
reliably, securely
2
Globus controls
access to shared
files on existing
storage; no need
to move files to
cloud storage!
4
Curator reviews and
approves; data set
published on campus
or other system
7
Researcher
selects files to
share, selects
user or group,
and sets access
permissions
3
Collaborator logs in to
Globus and accesses
shared files; no local
account required;
download via Globus
5
Researcher
assembles data set;
describes it using
metadata (Datacite
& domain-specific)
6
6
Peers, collaborators
search and discover
datasets; transfer and
share using Globus
8
Publication
Repository
Personal Computer
Transfer
Share
Publish
Discover
• Only a Web browser
required
• Use storage system
of your choice
• Access using your
campus credentials
8
Data sharing and Globus
9
Easily control who gains access to your data:
- Globus can use University/Laboratory credentials
- You can establish groups of authorized users
Data sharing and Globus
10
Simple to move data to/from any resource
Open data and Globus
11
Open data and Globus
12
Bottom Line: Globus provides a
robust, highly-developed, well-
supported platform for sharing and
managing open data
DATA ACCESSIBILITY
13
What do I mean by “accessibility”?
Need: Simplify finding and acquiring materials data
Major Challenges:
1. Data spread across many resources
§ Have to search each repository individually
§ Different services, different APIs to get data
2. Contents of resources are poorly described
§ Lack domain-specific metadata
Goal: Linking together world’s materials data resources,
with enough metadata to make it useful
14
Part 1: Linking with the Data Community
15
Materials	Project
Citrination
Materials	
Commons
Other	Facilities	(APS,	SNS,	NSLS,	…),	Institutional	Repositories,	
Publishers!
Metadata
Publishing
MetadataMD,
Pub.,	Compute
Metadata
Publishing
NCSA-PIREHV/TMSMBDH
MDF data discovery ecosystem
EP
NIST
MRR
Data
discovery
service
Harvest
Deep index
Register / Sync
Services
Bots
MDF
Pub
Service
Automate
Process
Refine
Analyze
Data Output
Data Input
EP
Data Sources
Query
Browse
Aggregate
User Interfaces
Identify resources for indexing
16
MDF + NIST Database Tools
17
Data
discovery
service
MDCS
NIST
MRR
Ref:	Dima,	et	al.	JOM.	68	(2016),	2053.	doi:	10.1007/s11837-016-2000-4
MDF + NIST Database Tools
18
Data
discovery
service
MDCS
NIST
MRR
MDF	automates	publicizing	data
and	provides	a	uniform	search	interface
Piping DFT data from MDF to Citrine
{ "category": "system.chemical",
"chemicalFormula": "MgO2",
"properties": {
"units": "eV", "name": "Band gap",
"scalars": [ { "value": 7.8 } ] } }
2.	Bot	requests	open	DFT	data	periodically
3.	Bot	accesses	data,	runs	DFT	parser	to	refine	data
4.	Push	metadata	to	Citrine
1.	User	publishes	DFT	dataset
5.	Ingest	DFT	data	quality	report
…
Our	datasets	are	discoverable	through	many	tools
19
Part 2: A Materials Data Search Engine
Goal: Simplify finding useful data
Key Issue: Lack of metadata
Approaches:
1. Simplifying metadata capture from the source
2. Extracting useful information from dataset
20
Route 1: Integrating with LIMS/Workflow
Tools
21
MAST
Materials Commons (MC)
T2C2 (4CeeD)
• Build connections to international materials
efforts and registries (e.g., NIMS, RDA, NIST,
EUDAT, NDS)
• Promote IMaD data services, tools, and
accomplishments to the community
• Develop video tutorials, webinars, and shared
code repositories
• Interface with the Materials Accelerator
Network (MAN)
• Engage with colleges, industry, and
consortiums
• (Wisconsin) Regional Materials and
Manufacturing Network (RM2N)
• (Illinois) Digital Manufacturing and
Design Innovation Institute DMDII
• (Michigan) LIFT consortium
Engagement
Linking Software and Services
PIs: I. Foster1,2, J. Allison3, D. Morgan4, D. Trinkle5, P. Voorhees6
1 University of Chicago 2 Argonne National Laboratory 3 University of Michigan 4 University of Wisconsin-Madison 5
University of Illinois at Urbana-Champaign, 6 Northwestern University
Overview
• NSF Midwest Big Data Spoke
• Argonne Leadership Computing Facility (>1000 users/year)
§ Working with datasets that comprise ~300M core hours, with 200M
more identified for near term
§ New joint effort to roll out MDF-like capabilities to ALCF users
• Advanced Photon Source (>5000 users/year)
• Building pipelines and procedures to index and publish data from
15 beamlines (~1/3 of the facility) in conjunction with the APS
software team (Schwartz)
• Advanced Light Source (>2000 users/year)
• Integration with CAMERA project and associated tomography
beamlines
Linking Data from Major Facilities
22Working	with	user	facilities	to	facilitate	capturing	data/metadata
Ripple: Home automation for research data
Doi:10.1109/ICDCSW.2017.30 23
Procedure for automating tomography experiments:
At ALS: Detect new beamline data,
and transfer it to NERSC
At NERSC: Submit, run jobs on Edison,
transfer data back to ALS
At ALS: Create a shared endpoint,
notify collaborators of result via email
Automate	capturing	results	and	metadata
Ryan	Chard
Route 2: Deep Indexing Materials Data
MDF
Index Data resources
indexed
116
Records
>3.4M
Repositories harvested
• MDF
• NIST MML Repo
• MATIN
• Materials
Commons
• CXIDB
• NIST Materials
Resource
Registry
6
~200 Datasets
~260 TB
Made
discoverable
24
Adding More Metadata to NIST MatDL
Dataset	As	Published
Limited	Metadata
Querying	Difficult
25
Adding More Metadata to NIST MatDL
Deep-Indexed	into	the	MDF
Data	Available	Programmatically
26
Adding More Metadata to NIST MatDL
Deep-Indexed	into	the	MDF
Can	be	used	for	scripting
27
Another benefit: domain-specific querying
Example service possible with DFT
data files
Answer questions like:
“Do we have any data about
anatase-TiO2?”
“Who else has studied Li-MnO3
batteries with DFT?”
Crystal Structure File
.cif, VASP, etc.
Entries from MDF that
are structurally-similar
28
Skluma: A Statistical Learning Pipeline
for Taming Unkempt Data Repositories
29
doi:10.1145/3085504.3091116
Goal: Build	intelligent	search	indexes	
with	minimal	human	effort
Method:	Employ	machine	learning	
to	extract	metadata	from	file	
repositories
- Classify	data	files
- Detect	file	types
Tyler	Skluzacek
Search	Otherwise-Unusable	Data	Repositories
MDF Forge python package (under development)
• Interface to MDF services
• Helper functions for common tasks
APIs, Automation, and Examples
https://github.com/materials-data-facility/forge
30
Tools for using these capabilities will be available soon
COMPUTABLE DATA
31
Computable Data
Reproducing data-driven science should be trivial
It often is not. Common problems:
§ If available, datasets lack documentation
§ Algorithms/methods are not open sourced
§ Models rarely published
§ Software installation/configuration require expertise
Our goal: Simplify publishing data-driven science
- Storing software and models
- Integrating them with compute resources
32
Integrating analytics tools with MDF
33
MATIN (GT)
~ 10 datasets
Used in
education
Result: Scientists connected with data, analytics tools,
and compute capability
MDF Data
Publication
MATIN (GT)
MML
Repository
(NIST)
Materials
Commons
(UM
PRISMS) Coherent X-Ray
Tomography
Database (LNL)
To	End	UsersTo	End	UsersTo	Compute	ResourcesFrom	Data	Repositories
Jetstream is a self-provisioned, scalable science and engineering cloud environment
operated by Indiana University for the National Science Foundation: jetstream-cloud.org
Building a machine learning model using MDF
A simple web service to train ML forcefields
34
35
Building a machine learning model using MDF
Example: Building force-field potentials from different datasets
Data resources: 3 DFT datasets with Aluminum data
1 dataset from khazana.uconn.edu, 2 datasets from materialsdata.nist.gov
Result: Improved performance by integrating data sources
36
Building a machine learning model using MDF
Method: Botu et al. JPCC. (2017)
Using only original data
Training	SetHoldout	Set
Example: Building force-field potentials from different datasets
Data resources: 3 DFT datasets with Aluminum data
1 dataset from khazana.uconn.edu, 2 datasets from materialsdata.nist.gov
Result: Improved performance by integrating data sources
37
Building a machine learning model using MDF
Method: Botu et al. JPCC. (2017)
Using only original data
Training	SetHoldout	Set
Including Diffusion Dataset
Example: Building force-field potentials from different datasets
Data resources: 3 DFT datasets with Aluminum data
1 dataset from khazana.uconn.edu, 2 datasets from materialsdata.nist.gov
Result: Improved performance by integrating data sources
38
Building a machine learning model using MDF
Method: Botu et al. JPCC. (2017)
Using only original data
Training	SetHoldout	Set
Including Diffusion DatasetIncluding 𝐷 + 𝑇# Dataset
Better performance in original application: No new DFT calculations
• Summer Intern (Jiming Chen) reproducing and
extending materials and ML papers with the MDF
• Joined our team with the NSF WholeTale project
Reproducing data-driven MSE with MDF
Users publish data
to the MDF…
… and code to
WholeTale
Long-term goals:
- Assemble community-driven resource for ML tools/examples
- Use MDF/WholeTale to create benchmark challenges
Jiming Chen (UIUC)
39
Replicating Ward et al. 2016
40
• Publish and share models and code linked with full
training datasets
• Link database with HPC/Cloud computing resources
• Provide uniform interface for training, running models
DLHub: Advancing Deep Learning Adoption
INCREASING VALUE OF DATA
42
$
What is the MDF?
EP
EP
EP
EP
Deep indexing
Query
Browse
Aggregate
Publish
Mint DOIs
Associate
metadata
Databases
Datasets
APIs
LIMS
etc.
Distributed data
storage
Data
publication
service
Data
discovery
service
43
Data publication service
44
• Mechanisms to create and enforce
schemas and logical collections
• Web UI to create datasets and manage
curation and admin tasks
• Tools to automate publication process
• Dataset record permanent landing page
for DOI link
• Record shows some metadata links to
the rest
• Direct link to underlying files
• Download statistics
Published Data Highlights
45
~ 30 datasets
~ 6.5 TB
MATIN (GT)
~ 10 datasets
Used in
education
X-ray Scattering Image Classification
Using Deep Learning
http://dx.doi.org/10.18126/M2Z30Z
Electron Backscattering and
Diffraction Datasets for Ni, Mg, Fe, Si
Yager et al.Marc De Graef et al.
Phase Field Benchmark I Dataset
Jokisaari et al.
Grain Structure, Grain-averaged Lattice Strains, and
Macro-scale Strain Data for Superelastic Nickel-
Titanium Shape Memory Alloy Polycrystal Loaded in
Tension
Paranjape et al.
• Largest dataset to date (>1.5 TB). Showcases MDF unique
capabilities and makes a unique dataset discoverable for code
development, analysis, and benchmarking
Datasets Are Citable
46
Streamline & automate data publication
12.5 TB
12.4 TB out
Data
Volumes
Publication
Authors
94
Institutions
14
Accesses
>1000
Total
datasets
50
CHiMaD
datasets
16
Pipeline CHiMaD
datasets
+14
Total
datasets
+30
Advantages of Globus Publish
Capable of handling large datasets
§ Publish data in place
§ Integration with Globus Transfer/HTTPS
Deep indexing of materials-specific metadata
§ Parse common materials data types
§ Make data searchable on the file-level
Automatically re-publishing data elsewhere
§ Publishing dataset metadata to MRR, Google Scholar, etc.
§ Sending fine-grained metadata to other databases (e.g., Citrine)
In Progress: Know how often your data is used
§ Track when it is used in analytics tools
48
All	of	these	capabilities	increase	the	value	of	your	data
Why create the MDF?
http://materialsdatafacility.org 49
1. Make your data shareable
Custom access control, using institution credentials
2. Make your data open
Access to >100TB of storage space
3. Make your data accessible
Search across distributed resources
Automatic, domain-specific metadata extraction
4. Make your data computable
Tight integration with computing resources
5. Make your data valuable
Citable with DOIs, measured with usage stats
$
EP
Thanks to our sponsors!
50
U . S . D E P A R T M E N T O F
ENERGY

More Related Content

Similar to The Materials Data Facility: A Distributed Model for the Materials Data Community (20)

PDF
Materials Data Facility as Community Database to Share Nano-manufacturing Rec...
Globus
 
PPTX
Hattrick Simpers TMS Machine Learning Workshop Slides
Jason Hattrick-Simpers
 
PDF
Software tools for high-throughput materials data generation and data mining
Anubhav Jain
 
PPTX
Hattrick-Simpers MRS Webinar on AI in Materials
Jason Hattrick-Simpers
 
PDF
Materials Project computation and database infrastructure
Anubhav Jain
 
PDF
Materials Data in the 21st Century: From Mishmash to Moneyball
bmeredig
 
PDF
The Materials Project: Experiences from running a million computational scien...
Anubhav Jain
 
PPTX
Morgan mgi meeting 2015 01-11 v2.0 distribution
ddm314
 
PDF
Connecting Publications & Data: Raising visibility of local data collections...
Michael Habib
 
PDF
Beyond Meta-Data: Nano-Publications Recording Scientific Endeavour
KNOWeSCAPE2014
 
PPTX
Summary of June 2014 Workshop Report: Building a Materials Accelerator Network
Susann Ely
 
PDF
How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)
PyData
 
PDF
Foundations for the Future of Science
Globus
 
PDF
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
aimsnist
 
PDF
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Anubhav Jain
 
PDF
Open Source Tools for Materials Informatics
Anubhav Jain
 
PDF
Data-intensive profile for the VAMDC
AstroAtom
 
PDF
Discovering new functional materials for clean energy and beyond using high-t...
Anubhav Jain
 
PPTX
Sharing Science Data: Semantically Reimagining the IUPAC Solubility Series Data
Stuart Chalk
 
PDF
Enabling Real Time Analysis & Decision Making - A Paradigm Shift for Experime...
PyData
 
Materials Data Facility as Community Database to Share Nano-manufacturing Rec...
Globus
 
Hattrick Simpers TMS Machine Learning Workshop Slides
Jason Hattrick-Simpers
 
Software tools for high-throughput materials data generation and data mining
Anubhav Jain
 
Hattrick-Simpers MRS Webinar on AI in Materials
Jason Hattrick-Simpers
 
Materials Project computation and database infrastructure
Anubhav Jain
 
Materials Data in the 21st Century: From Mishmash to Moneyball
bmeredig
 
The Materials Project: Experiences from running a million computational scien...
Anubhav Jain
 
Morgan mgi meeting 2015 01-11 v2.0 distribution
ddm314
 
Connecting Publications & Data: Raising visibility of local data collections...
Michael Habib
 
Beyond Meta-Data: Nano-Publications Recording Scientific Endeavour
KNOWeSCAPE2014
 
Summary of June 2014 Workshop Report: Building a Materials Accelerator Network
Susann Ely
 
How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)
PyData
 
Foundations for the Future of Science
Globus
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
aimsnist
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Anubhav Jain
 
Open Source Tools for Materials Informatics
Anubhav Jain
 
Data-intensive profile for the VAMDC
AstroAtom
 
Discovering new functional materials for clean energy and beyond using high-t...
Anubhav Jain
 
Sharing Science Data: Semantically Reimagining the IUPAC Solubility Series Data
Stuart Chalk
 
Enabling Real Time Analysis & Decision Making - A Paradigm Shift for Experime...
PyData
 

Recently uploaded (20)

PPTX
Liquid Biopsy Biomarkers for early Diagnosis
KanakChaudhary10
 
PPTX
Raising awareness on the story beyond the surface. A case study on the signif...
Kristel Wautier
 
PDF
HOW TO DEAL WITH THREATS FROM THE FORCES OF NATURE FROM OUTER SPACE.pdf
Faga1939
 
PPT
Gene expression and regulation University of Manchester
hanhocpt13
 
PDF
Cultivation and goods of microorganisms-4.pdf
adimondal300
 
PDF
Agentic AI: Autonomy, Accountability, and the Algorithmic Society
vs5qkn48td
 
PPTX
Respiratory and Circulatory Sytems.pptpptx
AngeloAngeles17
 
DOCX
Accomplishment Report on YES- O SY 2025 2026.docx
WilsonVillamater
 
PPTX
GDG AI for Science - Why JAX and Flax NNX
gdgforscience
 
PPTX
FACTORS PREDISPOSING TO MICROBIAL PATHOGENICITY.pptx
Remya M S
 
PPTX
Clinical Toxicology- Drug antagonism and drug synergism
jasmine698677
 
PDF
The Diversity of Exoplanetary Environments and the Search for Signs of Life B...
Sérgio Sacani
 
PDF
SCH 4103_Fibre Technology & Dyeing_07012020.pdf
samwelngigi37
 
PPTX
Cyclotron_Presentation_theory, designMSc.pptx
MohamedMaideen12
 
PDF
The MUSEview of the Sculptor galaxy: survey overview and the planetary nebula...
Sérgio Sacani
 
PDF
Thermal stratification in lakes-J. Bovas Joel.pdf
J. Bovas Joel BFSc
 
PDF
Global Health Initiatives: Lessons from Successful Programs (www.kiu.ac.ug)
publication11
 
PDF
RANKING THE MICRO LEVEL CRITICAL FACTORS OF ELECTRONIC MEDICAL RECORDS ADOPTI...
hiij
 
PDF
Disk Evolution Study Through Imaging of Nearby Young Stars (DESTINYS): Eviden...
Sérgio Sacani
 
PDF
The First Detection of Molecular Activity in the Largest Known Oort Cloud Com...
Sérgio Sacani
 
Liquid Biopsy Biomarkers for early Diagnosis
KanakChaudhary10
 
Raising awareness on the story beyond the surface. A case study on the signif...
Kristel Wautier
 
HOW TO DEAL WITH THREATS FROM THE FORCES OF NATURE FROM OUTER SPACE.pdf
Faga1939
 
Gene expression and regulation University of Manchester
hanhocpt13
 
Cultivation and goods of microorganisms-4.pdf
adimondal300
 
Agentic AI: Autonomy, Accountability, and the Algorithmic Society
vs5qkn48td
 
Respiratory and Circulatory Sytems.pptpptx
AngeloAngeles17
 
Accomplishment Report on YES- O SY 2025 2026.docx
WilsonVillamater
 
GDG AI for Science - Why JAX and Flax NNX
gdgforscience
 
FACTORS PREDISPOSING TO MICROBIAL PATHOGENICITY.pptx
Remya M S
 
Clinical Toxicology- Drug antagonism and drug synergism
jasmine698677
 
The Diversity of Exoplanetary Environments and the Search for Signs of Life B...
Sérgio Sacani
 
SCH 4103_Fibre Technology & Dyeing_07012020.pdf
samwelngigi37
 
Cyclotron_Presentation_theory, designMSc.pptx
MohamedMaideen12
 
The MUSEview of the Sculptor galaxy: survey overview and the planetary nebula...
Sérgio Sacani
 
Thermal stratification in lakes-J. Bovas Joel.pdf
J. Bovas Joel BFSc
 
Global Health Initiatives: Lessons from Successful Programs (www.kiu.ac.ug)
publication11
 
RANKING THE MICRO LEVEL CRITICAL FACTORS OF ELECTRONIC MEDICAL RECORDS ADOPTI...
hiij
 
Disk Evolution Study Through Imaging of Nearby Young Stars (DESTINYS): Eviden...
Sérgio Sacani
 
The First Detection of Molecular Activity in the Largest Known Oort Cloud Com...
Sérgio Sacani
 
Ad

The Materials Data Facility: A Distributed Model for the Materials Data Community

  • 1. Logan Ward1 ([email protected]) Ben Blaiszik1,2 ([email protected]), Ian Foster ([email protected])1,2, Ryan Chard2 Jonathon Gaff1, Kyle Chard1, Jim Pruyne1, Rachana Ananthakrishnan1, Steven Tuecke1 Michael Ondrejcek3, Kenton McHenry3, John Towns3 University of Chicago1, Argonne National Laboratory2, University of Illinois at Urbana-Champaign3 materialsdatafacility.org globus.org Materials Data Facility: A Distributed Model for the Materials Data Community 15 August 2017
  • 2. The Materials Data Facility Team 2 UC/Argonne Ian Foster (PI) Ben Blaiszik Steve Tuecke Kyle ChardJim Pruyne Logan Ward Jonathon Gaff Illinois (Urbana-Champaign) Rachana Ananthakrishnan John Towns (PI) Kenton McHenry Michal Ondrejcek Stephen Rosen Ryan Chard
  • 3. Data-Intensive Materials Science 3 Materials Databases High-Throughput Screening Machine Learning Multi-scale Modeling Kirklin et al. Acta Mat. (2016) de Jong et al. Sci Rep. (2016) Sparks et al. Scr. Mat. (2015) https://www.mpg.de/
  • 4. Data-Intensive Materials Science 4 Science is becoming limited by the ability to handle data - Where to get it? - How to selectively share it? - Where to store it? - How do know what it is? - How to build software that uses it? - How to get others to share theirs? - How to keep track of provenance? - ….? Our goal is to create easy answers to these questions
  • 5. Why create the MDF? 5 1. Make your data shareable Custom access control, using institution credentials 2. Make your data open Access to >100TB of storage space 3. Make your data accessible Search across distributed resources Automatic, domain-specific metadata extraction 4. Make your data computable Tight integration with computing resources 5. Make your data valuable Citable with DOIs, measured with usage stats $ EP
  • 6. What is the MDF? EP EP EP EP Deep indexing Query Browse Aggregate Publish Mint DOIs Associate metadata Databases Datasets APIs LIMS etc. Distributed data storage Data publication service Data discovery service
  • 7. SHAREABLE AND OPEN DATA 7 EP
  • 8. Globus and the research data lifecycle 8 Researcher initiates transfer request; or requested automatically by script, science gateway 1 Instrument Compute Facility Globus transfers files reliably, securely 2 Globus controls access to shared files on existing storage; no need to move files to cloud storage! 4 Curator reviews and approves; data set published on campus or other system 7 Researcher selects files to share, selects user or group, and sets access permissions 3 Collaborator logs in to Globus and accesses shared files; no local account required; download via Globus 5 Researcher assembles data set; describes it using metadata (Datacite & domain-specific) 6 6 Peers, collaborators search and discover datasets; transfer and share using Globus 8 Publication Repository Personal Computer Transfer Share Publish Discover • Only a Web browser required • Use storage system of your choice • Access using your campus credentials 8
  • 9. Data sharing and Globus 9 Easily control who gains access to your data: - Globus can use University/Laboratory credentials - You can establish groups of authorized users
  • 10. Data sharing and Globus 10 Simple to move data to/from any resource
  • 11. Open data and Globus 11
  • 12. Open data and Globus 12 Bottom Line: Globus provides a robust, highly-developed, well- supported platform for sharing and managing open data
  • 14. What do I mean by “accessibility”? Need: Simplify finding and acquiring materials data Major Challenges: 1. Data spread across many resources § Have to search each repository individually § Different services, different APIs to get data 2. Contents of resources are poorly described § Lack domain-specific metadata Goal: Linking together world’s materials data resources, with enough metadata to make it useful 14
  • 15. Part 1: Linking with the Data Community 15 Materials Project Citrination Materials Commons Other Facilities (APS, SNS, NSLS, …), Institutional Repositories, Publishers! Metadata Publishing MetadataMD, Pub., Compute Metadata Publishing NCSA-PIREHV/TMSMBDH
  • 16. MDF data discovery ecosystem EP NIST MRR Data discovery service Harvest Deep index Register / Sync Services Bots MDF Pub Service Automate Process Refine Analyze Data Output Data Input EP Data Sources Query Browse Aggregate User Interfaces Identify resources for indexing 16
  • 17. MDF + NIST Database Tools 17 Data discovery service MDCS NIST MRR Ref: Dima, et al. JOM. 68 (2016), 2053. doi: 10.1007/s11837-016-2000-4
  • 18. MDF + NIST Database Tools 18 Data discovery service MDCS NIST MRR MDF automates publicizing data and provides a uniform search interface
  • 19. Piping DFT data from MDF to Citrine { "category": "system.chemical", "chemicalFormula": "MgO2", "properties": { "units": "eV", "name": "Band gap", "scalars": [ { "value": 7.8 } ] } } 2. Bot requests open DFT data periodically 3. Bot accesses data, runs DFT parser to refine data 4. Push metadata to Citrine 1. User publishes DFT dataset 5. Ingest DFT data quality report … Our datasets are discoverable through many tools 19
  • 20. Part 2: A Materials Data Search Engine Goal: Simplify finding useful data Key Issue: Lack of metadata Approaches: 1. Simplifying metadata capture from the source 2. Extracting useful information from dataset 20
  • 21. Route 1: Integrating with LIMS/Workflow Tools 21 MAST Materials Commons (MC) T2C2 (4CeeD) • Build connections to international materials efforts and registries (e.g., NIMS, RDA, NIST, EUDAT, NDS) • Promote IMaD data services, tools, and accomplishments to the community • Develop video tutorials, webinars, and shared code repositories • Interface with the Materials Accelerator Network (MAN) • Engage with colleges, industry, and consortiums • (Wisconsin) Regional Materials and Manufacturing Network (RM2N) • (Illinois) Digital Manufacturing and Design Innovation Institute DMDII • (Michigan) LIFT consortium Engagement Linking Software and Services PIs: I. Foster1,2, J. Allison3, D. Morgan4, D. Trinkle5, P. Voorhees6 1 University of Chicago 2 Argonne National Laboratory 3 University of Michigan 4 University of Wisconsin-Madison 5 University of Illinois at Urbana-Champaign, 6 Northwestern University Overview • NSF Midwest Big Data Spoke
  • 22. • Argonne Leadership Computing Facility (>1000 users/year) § Working with datasets that comprise ~300M core hours, with 200M more identified for near term § New joint effort to roll out MDF-like capabilities to ALCF users • Advanced Photon Source (>5000 users/year) • Building pipelines and procedures to index and publish data from 15 beamlines (~1/3 of the facility) in conjunction with the APS software team (Schwartz) • Advanced Light Source (>2000 users/year) • Integration with CAMERA project and associated tomography beamlines Linking Data from Major Facilities 22Working with user facilities to facilitate capturing data/metadata
  • 23. Ripple: Home automation for research data Doi:10.1109/ICDCSW.2017.30 23 Procedure for automating tomography experiments: At ALS: Detect new beamline data, and transfer it to NERSC At NERSC: Submit, run jobs on Edison, transfer data back to ALS At ALS: Create a shared endpoint, notify collaborators of result via email Automate capturing results and metadata Ryan Chard
  • 24. Route 2: Deep Indexing Materials Data MDF Index Data resources indexed 116 Records >3.4M Repositories harvested • MDF • NIST MML Repo • MATIN • Materials Commons • CXIDB • NIST Materials Resource Registry 6 ~200 Datasets ~260 TB Made discoverable 24
  • 25. Adding More Metadata to NIST MatDL Dataset As Published Limited Metadata Querying Difficult 25
  • 26. Adding More Metadata to NIST MatDL Deep-Indexed into the MDF Data Available Programmatically 26
  • 27. Adding More Metadata to NIST MatDL Deep-Indexed into the MDF Can be used for scripting 27
  • 28. Another benefit: domain-specific querying Example service possible with DFT data files Answer questions like: “Do we have any data about anatase-TiO2?” “Who else has studied Li-MnO3 batteries with DFT?” Crystal Structure File .cif, VASP, etc. Entries from MDF that are structurally-similar 28
  • 29. Skluma: A Statistical Learning Pipeline for Taming Unkempt Data Repositories 29 doi:10.1145/3085504.3091116 Goal: Build intelligent search indexes with minimal human effort Method: Employ machine learning to extract metadata from file repositories - Classify data files - Detect file types Tyler Skluzacek Search Otherwise-Unusable Data Repositories
  • 30. MDF Forge python package (under development) • Interface to MDF services • Helper functions for common tasks APIs, Automation, and Examples https://github.com/materials-data-facility/forge 30 Tools for using these capabilities will be available soon
  • 32. Computable Data Reproducing data-driven science should be trivial It often is not. Common problems: § If available, datasets lack documentation § Algorithms/methods are not open sourced § Models rarely published § Software installation/configuration require expertise Our goal: Simplify publishing data-driven science - Storing software and models - Integrating them with compute resources 32
  • 33. Integrating analytics tools with MDF 33 MATIN (GT) ~ 10 datasets Used in education Result: Scientists connected with data, analytics tools, and compute capability MDF Data Publication MATIN (GT) MML Repository (NIST) Materials Commons (UM PRISMS) Coherent X-Ray Tomography Database (LNL) To End UsersTo End UsersTo Compute ResourcesFrom Data Repositories Jetstream is a self-provisioned, scalable science and engineering cloud environment operated by Indiana University for the National Science Foundation: jetstream-cloud.org
  • 34. Building a machine learning model using MDF A simple web service to train ML forcefields 34
  • 35. 35 Building a machine learning model using MDF
  • 36. Example: Building force-field potentials from different datasets Data resources: 3 DFT datasets with Aluminum data 1 dataset from khazana.uconn.edu, 2 datasets from materialsdata.nist.gov Result: Improved performance by integrating data sources 36 Building a machine learning model using MDF Method: Botu et al. JPCC. (2017) Using only original data Training SetHoldout Set
  • 37. Example: Building force-field potentials from different datasets Data resources: 3 DFT datasets with Aluminum data 1 dataset from khazana.uconn.edu, 2 datasets from materialsdata.nist.gov Result: Improved performance by integrating data sources 37 Building a machine learning model using MDF Method: Botu et al. JPCC. (2017) Using only original data Training SetHoldout Set Including Diffusion Dataset
  • 38. Example: Building force-field potentials from different datasets Data resources: 3 DFT datasets with Aluminum data 1 dataset from khazana.uconn.edu, 2 datasets from materialsdata.nist.gov Result: Improved performance by integrating data sources 38 Building a machine learning model using MDF Method: Botu et al. JPCC. (2017) Using only original data Training SetHoldout Set Including Diffusion DatasetIncluding 𝐷 + 𝑇# Dataset Better performance in original application: No new DFT calculations
  • 39. • Summer Intern (Jiming Chen) reproducing and extending materials and ML papers with the MDF • Joined our team with the NSF WholeTale project Reproducing data-driven MSE with MDF Users publish data to the MDF… … and code to WholeTale Long-term goals: - Assemble community-driven resource for ML tools/examples - Use MDF/WholeTale to create benchmark challenges Jiming Chen (UIUC) 39
  • 40. Replicating Ward et al. 2016 40
  • 41. • Publish and share models and code linked with full training datasets • Link database with HPC/Cloud computing resources • Provide uniform interface for training, running models DLHub: Advancing Deep Learning Adoption
  • 42. INCREASING VALUE OF DATA 42 $
  • 43. What is the MDF? EP EP EP EP Deep indexing Query Browse Aggregate Publish Mint DOIs Associate metadata Databases Datasets APIs LIMS etc. Distributed data storage Data publication service Data discovery service 43
  • 44. Data publication service 44 • Mechanisms to create and enforce schemas and logical collections • Web UI to create datasets and manage curation and admin tasks • Tools to automate publication process • Dataset record permanent landing page for DOI link • Record shows some metadata links to the rest • Direct link to underlying files • Download statistics
  • 45. Published Data Highlights 45 ~ 30 datasets ~ 6.5 TB MATIN (GT) ~ 10 datasets Used in education X-ray Scattering Image Classification Using Deep Learning http://dx.doi.org/10.18126/M2Z30Z Electron Backscattering and Diffraction Datasets for Ni, Mg, Fe, Si Yager et al.Marc De Graef et al. Phase Field Benchmark I Dataset Jokisaari et al. Grain Structure, Grain-averaged Lattice Strains, and Macro-scale Strain Data for Superelastic Nickel- Titanium Shape Memory Alloy Polycrystal Loaded in Tension Paranjape et al. • Largest dataset to date (>1.5 TB). Showcases MDF unique capabilities and makes a unique dataset discoverable for code development, analysis, and benchmarking
  • 47. Streamline & automate data publication 12.5 TB 12.4 TB out Data Volumes Publication Authors 94 Institutions 14 Accesses >1000 Total datasets 50 CHiMaD datasets 16 Pipeline CHiMaD datasets +14 Total datasets +30
  • 48. Advantages of Globus Publish Capable of handling large datasets § Publish data in place § Integration with Globus Transfer/HTTPS Deep indexing of materials-specific metadata § Parse common materials data types § Make data searchable on the file-level Automatically re-publishing data elsewhere § Publishing dataset metadata to MRR, Google Scholar, etc. § Sending fine-grained metadata to other databases (e.g., Citrine) In Progress: Know how often your data is used § Track when it is used in analytics tools 48 All of these capabilities increase the value of your data
  • 49. Why create the MDF? http://materialsdatafacility.org 49 1. Make your data shareable Custom access control, using institution credentials 2. Make your data open Access to >100TB of storage space 3. Make your data accessible Search across distributed resources Automatic, domain-specific metadata extraction 4. Make your data computable Tight integration with computing resources 5. Make your data valuable Citable with DOIs, measured with usage stats $ EP
  • 50. Thanks to our sponsors! 50 U . S . D E P A R T M E N T O F ENERGY