SlideShare a Scribd company logo
Data Automation
at Light Sources
Ian Foster
Argonne National Laboratory & The University of Chicago
1
Advanced Photon Source
Argonne Leadership
Computing Facility
1 km
5 μsec
2
von Laszewski et al., Real-time
analysis, visualization, and
steering of microtomography
experiments at photon
sources, SIAM Parallel
Processing, 1999
I have been working with light sources for some time!
“the data rates and
compute power
required ... are
prodigious, easily
reaching one gigabit per second
and a teraflop per second [respectively]”
Ptychography: Use GPU cluster for 360x speedup,
from 7 hours to 72 s
[Deng, Vine, Chen, Nashed, Philips, Jin,
Peterka, Ross, Jacobsen]
 Enable online analysis and use of fly scans
Microtomography: Use 32K Mira BG/Q nodes to
reduce reconstruction time from days to 2 mins
[Bicer, Gursoy, Kettimuthu, De Carlo, Agrawal]
 Identify and correct experimental
misconfiguration
High-energy diffraction microscopy: 10K BG/Q
nodes to reconstruct in 10 minutes
[Sharma, Almer, Wozniak, Wilde, Foster]
 Zoom in on crack locations (switch far field  near field)
Coherence
Brightness
High Energy
Micrometer porosity structure of shale samples
Microstructure of a copper wire, 0.2mm diameter
Work on high-speed analysis continues
We face a data crisis (and opportunity)
New instrumentation means that data rates
are growing much faster than Moore’s Law
 Neither humans nor computers can cope by
using current methods
We need new methods for designing
experiments, managing data, analyzing data,
and creating and delivering software
 “A knowledge-based society, connected by the
Internet and powered by AI …”
— Chen Chien-jen
6https://bit.ly/2l4gfgu
How industry deals with scale
7https://bit.ly/2l4gfgu
How industry deals with scale
Automate and outsource:
(1) Data distribution
Needs: Usable, efficient, reliable,
secure, sustainable
8
Automate and outsource:
(1) Data distribution
Needs: Usable, efficient, reliable,
secure, sustainable
Outsource:
(a) Petrel data store to hold data
prior to/during distribution
9
Petrel online store
petrel.alcf.anl.gov
94 Gbit/s Petrel—Blue Waters
2 petabytes
100 Gbps
Automate and outsource:
(1) Data distribution
Needs: Usable, efficient, reliable,
secure, sustainable
Outsource:
(a) Petrel data store to hold data
prior to/during distribution
(b) Globus service for data
transfer and sharing
10
2 petabytes
100 Gbps
Globus APIs
globus.org
Automate and outsource:
(1) Data distribution
Needs: Usable, efficient, reliable,
secure, sustainable
Outsource:
(a) Petrel data store to hold data
prior to/during distribution
(b) Globus service for data
transfer and sharing
Automate:
(c) DMagic script uses Globus
APIs to transfer data and
configure permissions
12
http://dmagic.readthedocs.io
Francesco de Carlo
Given an experiment date:
• Retrieve user info from APS scheduler
• Create Globus “shared endpoint” and
configure permissions
• Monitor directory at beamline and use
Globus to copy new files to endpoint
• Email link to shared endpoint for data
retrieval
Automate and outsource:
(2) Publication and discovery
Move to permanent location
(or publish in place)
Compute and record checksums
Obtain and record metadata
Assign persistent identifier
Index for discovery
1313
2 petabytes
100 Gbps
Globus APIs
Automate and outsource:
(2) Publication and discovery
Move to permanent location
(or publish in place)
Compute and record checksums
Obtain and record metadata
Assign persistent identifier
Index for discovery
1414
Data Publication
Indexing
materialsdatafacility.org
2 petabytes
100 Gbps
Globus APIs
Automate and outsource:
(2) Publication and discovery
1515
Programmatic access (REST, Python, Jupyter)
Web browse and search
Data Publication
Indexing
materialsdatafacility.org
2 petabytes
100 Gbps
Globus APIs
For each data, must apply quality control, assign identifiers, move to
compute, extract features, eventually publish to public repository, …
Building a different custom pipeline for every situation is impractical
Automate and outsource:
(3) End-to-end data pipelines
For each data, must apply quality control, assign identifiers, move to
compute, extract features, eventually publish to public repository
Building a different custom pipeline for every situation is impractical
Automate: Trigger-action programming (“if this happens, then do that)
Outsource: Cloud-based trigger-action service for reliability,
scalability, ease of use, security, sustainability
Automate and outsource:
(3) End-to-end data pipelines
National
Facility
Local Storage and Compute
• Quality Control
• Assign Handle
Beamline
Instrument
Globus Transfer
Central Storage and Compute (CSC)
• Feature extraction
• Aggregate and convert format
Archive
Automate and outsource:
(3) End-to-end pipelines with trigger-action programming
National
Facility
Local Storage and Compute
• Quality Control
• Assign Handle
Beamline
Instrument • Email / SMS
notification
Globus Transfer
Central Storage and Compute (CSC)
• Feature extraction
• Aggregate and convert format
Archive1
1
Rules
• IF new files THEN run quality
control scripts
• IF quality is good THEN send
email and transfer data to CSC
Automate and outsource:
(3) End-to-end pipelines with trigger-action programming
National
Facility
Local Storage and Compute
• Quality Control
• Assign Handle
Beamline
Instrument • Email / SMS
notification
Globus Transfer
Central Storage and Compute (CSC)
• Feature extraction
• Aggregate and convert format
Globus Transfer
Archive
• Set sharing ACLs
• Set timer for publication
to Materials Data Facility
Data publication
1
2
1
Rules
2
• IF new files THEN run quality
control scripts
• IF quality is good THEN send
email and transfer data to CSC
• IF new files THEN run feature
extraction
• IF feature detected THEN
transfer data to archival storage
• IF time since ingest > 6 months
THEN publish dataset to
Materials Data Facility
Automate and outsource:
(3) End-to-end pipelines with trigger-action programming
Data
Source
Collector Storage and
Compute
• Capture dataset creation
• Review center position
APS beamline 32-ID
ALCF Cooley Cluster
• Generate preview and
center images
• Reconstruct image
• Extract metadata
Ingest in Globus
Search
Set sharing ACLs
Data publication
1
2
1
Rules
2
• IF new HDF5 files THEN
transfer to Cooley
• IF new center_pos
THEN initiate
reconstruction
• IF transfer complete
THEN execute preview
and center finding
• IF results THEN return
data to APS
• IF reconstruction THEN
transfer data to Petrel
AND publish dataset
ALCF Petrel
Archive
Visualize with Neuroglancer
Another example: Mosaic tomography for neurocartography
(N. Kasthuri, R. Chard, et al.)
globus.org
Automate and outsource:
(4) Data transformation and analysis
“beam misaligned”
“…”
Say you want to use a deep neural network for online identification
of problems when running diffraction experiments
Automate and outsource:
(4) Data transformation and analysis
https://doi.org/10.1109/NYSDS.2017.8085045
Automate and outsource:
(4) Data transformation and analysis
▪ Where are the model and trained weights?
▪ How do I run the model on my data?
▪ Should I run the model on my data?
▪ How can I retrain the model on new data?
https://doi.org/10.1109/NYSDS.2017.8085045
DLHub
[“beam off image”, …]
model/xray/batch_predict
Automate and outsource:
(4) Data transformation and analysis
▪ Where are the model and trained weights?
▪ How do I run the model on my data?
▪ Should I run the model on my data?
▪ How can I retrain the model on new data?
https://doi.org/10.1109/NYSDS.2017.8085045
DLHub
[“beam off image”, …]
model/xray/batch_predict
Automate and outsource:
(4) Data transformation and analysis
▪ Where are the model and trained weights?
▪ How do I run the model on my data?
▪ Should I run the model on my data?
▪ How can I retrain the model on new data?
https://doi.org/10.1109/NYSDS.2017.8085045
Data and Learning Hub (DLHub): Overview
• Collect, publish, categorize models/code/weights/data from many sources
• Serve models via API to foster sharing, consumption, and access to data,
training sets, and models
• Automate training of models (using HPC as needed) as new data are available
• Enable new science through reuse and synthesis of existing models
TrainCollect Serve
DLHub: Collect, serve, train community models
DLHub
Collect
Data
1) Register a model
Train
Model
Register
Model Model /
transform
containers
Receive DOI
Send to DLHub
DLHub
Collect
Data
Receive
predicted
Properties
Send
compositions
Call
DLHub
Find
Model
2) Run a model
Model /
transform
containers
DLHub: Collect, serve, train community models
Collect
Data
Receive DOI
1) Register a model
Train
Model
Register
Model
Send to DLHub
Data Automation at Light Sources
32
33
Invoke model on data
Ben Blaiszik Steve TueckeKyle Chard Jim Pruyne Logan WardRachana
Ananthakrishnan
Ryan Chard Mike Papka Rick Wagner
I reported on the work of many talented people
Thanks also to:
• Jon Almer, Francesco de Carlo, Hemant Sharma, Brian Toby, Stefan Vogt, Stephen Streiffer,
Nicholas Schwarz, Doga Gursoy, and others, Advanced Photon Source
• Tekin Bicer, Jonathan Gaff, Raj Kettimuthu, Justin Wozniak, and others, Argonne Computing
We are grateful to our sponsors
DLHub Globus
IMaD
Petrel
Argonne Leadership
Computing Facility
In summary
More data demands new methods for designing experiments,
managing data, analyzing data, and creating and delivering software
We must automate and outsource to manage data, run pipelines,
and train and run (machine learning) models
I presented examples that illustrate what can be done:
• High-speed storage services for data staging and distribution: Petrel
• Cloud-based services for data transfer and sharing: Globus Transfer
• Data publication and discovery services: Materials Data Facility
• Cloud-based automation services: Globus Automate
• Model and transformation services to encapsulate software: DLHub
There are many opportunities, and great need, for collaboration
To follow up: foster@anl.gov
Ad

Recommended

Data automation 101
Data automation 101
Yosua Michael Maranatha
 
Learning Systems for Science
Learning Systems for Science
Ian Foster
 
Coding the Continuum
Coding the Continuum
Ian Foster
 
Research Automation for Data-Driven Discovery
Research Automation for Data-Driven Discovery
Globus
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Ian Foster
 
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud Automation
Ian Foster
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
Ian Foster
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Ian Foster
 
NIH Data Commons Architecture Ideas
NIH Data Commons Architecture Ideas
Ian Foster
 
Science cloud foster june 2013
Science cloud foster june 2013
Kirill Osipov
 
Science as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate Discovery
Ian Foster
 
Cloud com foster december 2010
Cloud com foster december 2010
Ian Foster
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundane
Ian Foster
 
GlobusWorld 2015
GlobusWorld 2015
Tanu Malik
 
Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)
Robert Grossman
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
Spark Summit
 
Benchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging Services
Tanu Malik
 
re:Invent 2013-foster-madduri
re:Invent 2013-foster-madduri
Ravi Madduri
 
Big data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at Kitware
bigdataviz_bay
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Spark Summit
 
Big Data Visualization
Big Data Visualization
bigdataviz_bay
 
So Long Computer Overlords
So Long Computer Overlords
Ian Foster
 
An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)
Robert Grossman
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009
Ian Foster
 
OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3
Robert Grossman
 
Genomic Scale Big Data Pipelines
Genomic Scale Big Data Pipelines
Lynn Langit
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big Data
Robert Grossman
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11
Robert Grossman
 
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Globus
 
Taming Big Data!
Taming Big Data!
Ian Foster
 

More Related Content

What's hot (20)

NIH Data Commons Architecture Ideas
NIH Data Commons Architecture Ideas
Ian Foster
 
Science cloud foster june 2013
Science cloud foster june 2013
Kirill Osipov
 
Science as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate Discovery
Ian Foster
 
Cloud com foster december 2010
Cloud com foster december 2010
Ian Foster
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundane
Ian Foster
 
GlobusWorld 2015
GlobusWorld 2015
Tanu Malik
 
Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)
Robert Grossman
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
Spark Summit
 
Benchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging Services
Tanu Malik
 
re:Invent 2013-foster-madduri
re:Invent 2013-foster-madduri
Ravi Madduri
 
Big data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at Kitware
bigdataviz_bay
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Spark Summit
 
Big Data Visualization
Big Data Visualization
bigdataviz_bay
 
So Long Computer Overlords
So Long Computer Overlords
Ian Foster
 
An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)
Robert Grossman
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009
Ian Foster
 
OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3
Robert Grossman
 
Genomic Scale Big Data Pipelines
Genomic Scale Big Data Pipelines
Lynn Langit
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big Data
Robert Grossman
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11
Robert Grossman
 
NIH Data Commons Architecture Ideas
NIH Data Commons Architecture Ideas
Ian Foster
 
Science cloud foster june 2013
Science cloud foster june 2013
Kirill Osipov
 
Science as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate Discovery
Ian Foster
 
Cloud com foster december 2010
Cloud com foster december 2010
Ian Foster
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundane
Ian Foster
 
GlobusWorld 2015
GlobusWorld 2015
Tanu Malik
 
Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)
Robert Grossman
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
Spark Summit
 
Benchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging Services
Tanu Malik
 
re:Invent 2013-foster-madduri
re:Invent 2013-foster-madduri
Ravi Madduri
 
Big data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at Kitware
bigdataviz_bay
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Spark Summit
 
Big Data Visualization
Big Data Visualization
bigdataviz_bay
 
So Long Computer Overlords
So Long Computer Overlords
Ian Foster
 
An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)
Robert Grossman
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009
Ian Foster
 
OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3
Robert Grossman
 
Genomic Scale Big Data Pipelines
Genomic Scale Big Data Pipelines
Lynn Langit
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big Data
Robert Grossman
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11
Robert Grossman
 

Similar to Data Automation at Light Sources (20)

Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Globus
 
Taming Big Data!
Taming Big Data!
Ian Foster
 
Working with Instrument Data (GlobusWorld Tour - UMich)
Working with Instrument Data (GlobusWorld Tour - UMich)
Globus
 
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Databricks
 
Big data at experimental facilities
Big data at experimental facilities
Ian Foster
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Databricks
 
Linking Scientific Instruments and Computation
Linking Scientific Instruments and Computation
Ian Foster
 
Big Process for Big Data @ NASA
Big Process for Big Data @ NASA
Ian Foster
 
Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013
Ian Foster
 
Globus Integrations (CHPC 2019 - South Africa)
Globus Integrations (CHPC 2019 - South Africa)
Globus
 
Rpi talk foster september 2011
Rpi talk foster september 2011
Ian Foster
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental science
inside-BigData.com
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Ian Foster
 
Grid is Dead ? Nimrod on the Cloud
Grid is Dead ? Nimrod on the Cloud
Adianto Wibisono
 
GlobusWorld 2020 Keynote
GlobusWorld 2020 Keynote
Globus
 
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Ian Foster
 
Accelerating Time to Science: Transforming Research in the Cloud
Accelerating Time to Science: Transforming Research in the Cloud
Jamie Kinney
 
CPAC Connectome Analysis in the Cloud
CPAC Connectome Analysis in the Cloud
Cameron Craddock
 
Accelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy Science
Ian Foster
 
Introduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-Seq
Enis Afgan
 
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Globus
 
Taming Big Data!
Taming Big Data!
Ian Foster
 
Working with Instrument Data (GlobusWorld Tour - UMich)
Working with Instrument Data (GlobusWorld Tour - UMich)
Globus
 
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Databricks
 
Big data at experimental facilities
Big data at experimental facilities
Ian Foster
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Databricks
 
Linking Scientific Instruments and Computation
Linking Scientific Instruments and Computation
Ian Foster
 
Big Process for Big Data @ NASA
Big Process for Big Data @ NASA
Ian Foster
 
Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013
Ian Foster
 
Globus Integrations (CHPC 2019 - South Africa)
Globus Integrations (CHPC 2019 - South Africa)
Globus
 
Rpi talk foster september 2011
Rpi talk foster september 2011
Ian Foster
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental science
inside-BigData.com
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Ian Foster
 
Grid is Dead ? Nimrod on the Cloud
Grid is Dead ? Nimrod on the Cloud
Adianto Wibisono
 
GlobusWorld 2020 Keynote
GlobusWorld 2020 Keynote
Globus
 
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Ian Foster
 
Accelerating Time to Science: Transforming Research in the Cloud
Accelerating Time to Science: Transforming Research in the Cloud
Jamie Kinney
 
CPAC Connectome Analysis in the Cloud
CPAC Connectome Analysis in the Cloud
Cameron Craddock
 
Accelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy Science
Ian Foster
 
Introduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-Seq
Enis Afgan
 
Ad

More from Ian Foster (19)

Global Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptx
Ian Foster
 
The Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, Evolution
Ian Foster
 
Better Information Faster: Programming the Continuum
Better Information Faster: Programming the Continuum
Ian Foster
 
ESnet6 and Smart Instruments
ESnet6 and Smart Instruments
Ian Foster
 
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
Ian Foster
 
Foster CRA March 2022.pptx
Foster CRA March 2022.pptx
Ian Foster
 
Big Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental Science
Ian Foster
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and Chemistry
Ian Foster
 
Research Automation for Data-Driven Discovery
Research Automation for Data-Driven Discovery
Ian Foster
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and Jupyter
Ian Foster
 
Team Argon Summary
Team Argon Summary
Ian Foster
 
Thoughts on interoperability
Thoughts on interoperability
Ian Foster
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCF
Ian Foster
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Ian Foster
 
Software Infrastructure for a National Research Platform
Software Infrastructure for a National Research Platform
Ian Foster
 
Globus Auth: A Research Identity and Access Management Platform
Globus Auth: A Research Identity and Access Management Platform
Ian Foster
 
Streamlined data sharing and analysis to accelerate cancer research
Streamlined data sharing and analysis to accelerate cancer research
Ian Foster
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science Services
Ian Foster
 
building global software/earthcube->sciencecloud
building global software/earthcube->sciencecloud
Ian Foster
 
Global Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptx
Ian Foster
 
The Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, Evolution
Ian Foster
 
Better Information Faster: Programming the Continuum
Better Information Faster: Programming the Continuum
Ian Foster
 
ESnet6 and Smart Instruments
ESnet6 and Smart Instruments
Ian Foster
 
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
Ian Foster
 
Foster CRA March 2022.pptx
Foster CRA March 2022.pptx
Ian Foster
 
Big Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental Science
Ian Foster
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and Chemistry
Ian Foster
 
Research Automation for Data-Driven Discovery
Research Automation for Data-Driven Discovery
Ian Foster
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and Jupyter
Ian Foster
 
Team Argon Summary
Team Argon Summary
Ian Foster
 
Thoughts on interoperability
Thoughts on interoperability
Ian Foster
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCF
Ian Foster
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Ian Foster
 
Software Infrastructure for a National Research Platform
Software Infrastructure for a National Research Platform
Ian Foster
 
Globus Auth: A Research Identity and Access Management Platform
Globus Auth: A Research Identity and Access Management Platform
Ian Foster
 
Streamlined data sharing and analysis to accelerate cancer research
Streamlined data sharing and analysis to accelerate cancer research
Ian Foster
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science Services
Ian Foster
 
building global software/earthcube->sciencecloud
building global software/earthcube->sciencecloud
Ian Foster
 
Ad

Recently uploaded (20)

TISSUE TRANSPLANTATTION and IT'S IMPORTANCE IS DISCUSSED
TISSUE TRANSPLANTATTION and IT'S IMPORTANCE IS DISCUSSED
PhoebeAkinyi1
 
Human Body Systems: Organ systemsss.pptx
Human Body Systems: Organ systemsss.pptx
marionrada1985
 
Investigatory_project Topic:-effect of electrolysis in solar desalination .pdf
Investigatory_project Topic:-effect of electrolysis in solar desalination .pdf
shubham997ku
 
Science 7 DLL Week 1 Quarter 1 Matatag Curriculum
Science 7 DLL Week 1 Quarter 1 Matatag Curriculum
RONAFAITHLOOC
 
SCIENCE-G7-Quarter1-Week1-Day1.matatagpptx
SCIENCE-G7-Quarter1-Week1-Day1.matatagpptx
Pyumpyum
 
An Analysis Of The Pearl Short Story By John Steinbeck
An Analysis Of The Pearl Short Story By John Steinbeck
BillyDarmawan3
 
History of Nursing and Nursing As A Profession UNIT-3.pptx
History of Nursing and Nursing As A Profession UNIT-3.pptx
madhusrinivas68
 
lysosomes "suicide bags of cell" and hydrolytic enzymes
lysosomes "suicide bags of cell" and hydrolytic enzymes
kchaturvedi070
 
Properties of Gases siwhdhadpaldndn.pptx
Properties of Gases siwhdhadpaldndn.pptx
CatherineJadeBurce
 
How Psychology Can Power Product Decisions: A Human-Centered Blueprint- Shray...
How Psychology Can Power Product Decisions: A Human-Centered Blueprint- Shray...
ShrayasiRoy
 
Solution Chemistry Basics, molarity Molality
Solution Chemistry Basics, molarity Molality
nuralam819365
 
Type of Heat Exchanger operation Socar pptx
Type of Heat Exchanger operation Socar pptx
TuralQuliyev5
 
Climate and Weather_Science 9_Q3_PH.pptx
Climate and Weather_Science 9_Q3_PH.pptx
Dayan Espartero
 
EV REGENERATIVE ACCELERATION INNOVATION SUMMARY PITCH June 13, 2025.pdf
EV REGENERATIVE ACCELERATION INNOVATION SUMMARY PITCH June 13, 2025.pdf
Thane Heins NOBEL PRIZE WINNING ENERGY RESEARCHER
 
SCIENCE-G7-Quarter1-Week1-Day5.matatagptx
SCIENCE-G7-Quarter1-Week1-Day5.matatagptx
Pyumpyum
 
Lesson 1 in Earth and Life Science .pptx
Lesson 1 in Earth and Life Science .pptx
KizzelLanada2
 
How Psychology Can Power Product Decisions: A Human-Centered Blueprint- Shray...
How Psychology Can Power Product Decisions: A Human-Centered Blueprint- Shray...
ShrayasiRoy2
 
GBSN_ Unit 1 - Introduction to Microbiology
GBSN_ Unit 1 - Introduction to Microbiology
Areesha Ahmad
 
Science 10 1.3 Mountain Belts in the Philippines.pptx
Science 10 1.3 Mountain Belts in the Philippines.pptx
ClaireMangundayao1
 
Antipsychotics-FOR LECTURE.pdf................
Antipsychotics-FOR LECTURE.pdf................
FalguniPatil6
 
TISSUE TRANSPLANTATTION and IT'S IMPORTANCE IS DISCUSSED
TISSUE TRANSPLANTATTION and IT'S IMPORTANCE IS DISCUSSED
PhoebeAkinyi1
 
Human Body Systems: Organ systemsss.pptx
Human Body Systems: Organ systemsss.pptx
marionrada1985
 
Investigatory_project Topic:-effect of electrolysis in solar desalination .pdf
Investigatory_project Topic:-effect of electrolysis in solar desalination .pdf
shubham997ku
 
Science 7 DLL Week 1 Quarter 1 Matatag Curriculum
Science 7 DLL Week 1 Quarter 1 Matatag Curriculum
RONAFAITHLOOC
 
SCIENCE-G7-Quarter1-Week1-Day1.matatagpptx
SCIENCE-G7-Quarter1-Week1-Day1.matatagpptx
Pyumpyum
 
An Analysis Of The Pearl Short Story By John Steinbeck
An Analysis Of The Pearl Short Story By John Steinbeck
BillyDarmawan3
 
History of Nursing and Nursing As A Profession UNIT-3.pptx
History of Nursing and Nursing As A Profession UNIT-3.pptx
madhusrinivas68
 
lysosomes "suicide bags of cell" and hydrolytic enzymes
lysosomes "suicide bags of cell" and hydrolytic enzymes
kchaturvedi070
 
Properties of Gases siwhdhadpaldndn.pptx
Properties of Gases siwhdhadpaldndn.pptx
CatherineJadeBurce
 
How Psychology Can Power Product Decisions: A Human-Centered Blueprint- Shray...
How Psychology Can Power Product Decisions: A Human-Centered Blueprint- Shray...
ShrayasiRoy
 
Solution Chemistry Basics, molarity Molality
Solution Chemistry Basics, molarity Molality
nuralam819365
 
Type of Heat Exchanger operation Socar pptx
Type of Heat Exchanger operation Socar pptx
TuralQuliyev5
 
Climate and Weather_Science 9_Q3_PH.pptx
Climate and Weather_Science 9_Q3_PH.pptx
Dayan Espartero
 
SCIENCE-G7-Quarter1-Week1-Day5.matatagptx
SCIENCE-G7-Quarter1-Week1-Day5.matatagptx
Pyumpyum
 
Lesson 1 in Earth and Life Science .pptx
Lesson 1 in Earth and Life Science .pptx
KizzelLanada2
 
How Psychology Can Power Product Decisions: A Human-Centered Blueprint- Shray...
How Psychology Can Power Product Decisions: A Human-Centered Blueprint- Shray...
ShrayasiRoy2
 
GBSN_ Unit 1 - Introduction to Microbiology
GBSN_ Unit 1 - Introduction to Microbiology
Areesha Ahmad
 
Science 10 1.3 Mountain Belts in the Philippines.pptx
Science 10 1.3 Mountain Belts in the Philippines.pptx
ClaireMangundayao1
 
Antipsychotics-FOR LECTURE.pdf................
Antipsychotics-FOR LECTURE.pdf................
FalguniPatil6
 

Data Automation at Light Sources

  • 1. Data Automation at Light Sources Ian Foster Argonne National Laboratory & The University of Chicago 1
  • 2. Advanced Photon Source Argonne Leadership Computing Facility 1 km 5 μsec 2
  • 3. von Laszewski et al., Real-time analysis, visualization, and steering of microtomography experiments at photon sources, SIAM Parallel Processing, 1999 I have been working with light sources for some time! “the data rates and compute power required ... are prodigious, easily reaching one gigabit per second and a teraflop per second [respectively]”
  • 4. Ptychography: Use GPU cluster for 360x speedup, from 7 hours to 72 s [Deng, Vine, Chen, Nashed, Philips, Jin, Peterka, Ross, Jacobsen]  Enable online analysis and use of fly scans Microtomography: Use 32K Mira BG/Q nodes to reduce reconstruction time from days to 2 mins [Bicer, Gursoy, Kettimuthu, De Carlo, Agrawal]  Identify and correct experimental misconfiguration High-energy diffraction microscopy: 10K BG/Q nodes to reconstruct in 10 minutes [Sharma, Almer, Wozniak, Wilde, Foster]  Zoom in on crack locations (switch far field  near field) Coherence Brightness High Energy Micrometer porosity structure of shale samples Microstructure of a copper wire, 0.2mm diameter Work on high-speed analysis continues
  • 5. We face a data crisis (and opportunity) New instrumentation means that data rates are growing much faster than Moore’s Law  Neither humans nor computers can cope by using current methods We need new methods for designing experiments, managing data, analyzing data, and creating and delivering software  “A knowledge-based society, connected by the Internet and powered by AI …” — Chen Chien-jen
  • 8. Automate and outsource: (1) Data distribution Needs: Usable, efficient, reliable, secure, sustainable 8
  • 9. Automate and outsource: (1) Data distribution Needs: Usable, efficient, reliable, secure, sustainable Outsource: (a) Petrel data store to hold data prior to/during distribution 9 Petrel online store petrel.alcf.anl.gov 94 Gbit/s Petrel—Blue Waters 2 petabytes 100 Gbps
  • 10. Automate and outsource: (1) Data distribution Needs: Usable, efficient, reliable, secure, sustainable Outsource: (a) Petrel data store to hold data prior to/during distribution (b) Globus service for data transfer and sharing 10 2 petabytes 100 Gbps Globus APIs
  • 12. Automate and outsource: (1) Data distribution Needs: Usable, efficient, reliable, secure, sustainable Outsource: (a) Petrel data store to hold data prior to/during distribution (b) Globus service for data transfer and sharing Automate: (c) DMagic script uses Globus APIs to transfer data and configure permissions 12 http://dmagic.readthedocs.io Francesco de Carlo Given an experiment date: • Retrieve user info from APS scheduler • Create Globus “shared endpoint” and configure permissions • Monitor directory at beamline and use Globus to copy new files to endpoint • Email link to shared endpoint for data retrieval
  • 13. Automate and outsource: (2) Publication and discovery Move to permanent location (or publish in place) Compute and record checksums Obtain and record metadata Assign persistent identifier Index for discovery 1313 2 petabytes 100 Gbps Globus APIs
  • 14. Automate and outsource: (2) Publication and discovery Move to permanent location (or publish in place) Compute and record checksums Obtain and record metadata Assign persistent identifier Index for discovery 1414 Data Publication Indexing materialsdatafacility.org 2 petabytes 100 Gbps Globus APIs
  • 15. Automate and outsource: (2) Publication and discovery 1515 Programmatic access (REST, Python, Jupyter) Web browse and search Data Publication Indexing materialsdatafacility.org 2 petabytes 100 Gbps Globus APIs
  • 16. For each data, must apply quality control, assign identifiers, move to compute, extract features, eventually publish to public repository, … Building a different custom pipeline for every situation is impractical Automate and outsource: (3) End-to-end data pipelines
  • 17. For each data, must apply quality control, assign identifiers, move to compute, extract features, eventually publish to public repository Building a different custom pipeline for every situation is impractical Automate: Trigger-action programming (“if this happens, then do that) Outsource: Cloud-based trigger-action service for reliability, scalability, ease of use, security, sustainability Automate and outsource: (3) End-to-end data pipelines
  • 18. National Facility Local Storage and Compute • Quality Control • Assign Handle Beamline Instrument Globus Transfer Central Storage and Compute (CSC) • Feature extraction • Aggregate and convert format Archive Automate and outsource: (3) End-to-end pipelines with trigger-action programming
  • 19. National Facility Local Storage and Compute • Quality Control • Assign Handle Beamline Instrument • Email / SMS notification Globus Transfer Central Storage and Compute (CSC) • Feature extraction • Aggregate and convert format Archive1 1 Rules • IF new files THEN run quality control scripts • IF quality is good THEN send email and transfer data to CSC Automate and outsource: (3) End-to-end pipelines with trigger-action programming
  • 20. National Facility Local Storage and Compute • Quality Control • Assign Handle Beamline Instrument • Email / SMS notification Globus Transfer Central Storage and Compute (CSC) • Feature extraction • Aggregate and convert format Globus Transfer Archive • Set sharing ACLs • Set timer for publication to Materials Data Facility Data publication 1 2 1 Rules 2 • IF new files THEN run quality control scripts • IF quality is good THEN send email and transfer data to CSC • IF new files THEN run feature extraction • IF feature detected THEN transfer data to archival storage • IF time since ingest > 6 months THEN publish dataset to Materials Data Facility Automate and outsource: (3) End-to-end pipelines with trigger-action programming
  • 21. Data Source Collector Storage and Compute • Capture dataset creation • Review center position APS beamline 32-ID ALCF Cooley Cluster • Generate preview and center images • Reconstruct image • Extract metadata Ingest in Globus Search Set sharing ACLs Data publication 1 2 1 Rules 2 • IF new HDF5 files THEN transfer to Cooley • IF new center_pos THEN initiate reconstruction • IF transfer complete THEN execute preview and center finding • IF results THEN return data to APS • IF reconstruction THEN transfer data to Petrel AND publish dataset ALCF Petrel Archive Visualize with Neuroglancer Another example: Mosaic tomography for neurocartography (N. Kasthuri, R. Chard, et al.)
  • 23. Automate and outsource: (4) Data transformation and analysis “beam misaligned” “…” Say you want to use a deep neural network for online identification of problems when running diffraction experiments
  • 24. Automate and outsource: (4) Data transformation and analysis https://doi.org/10.1109/NYSDS.2017.8085045
  • 25. Automate and outsource: (4) Data transformation and analysis ▪ Where are the model and trained weights? ▪ How do I run the model on my data? ▪ Should I run the model on my data? ▪ How can I retrain the model on new data? https://doi.org/10.1109/NYSDS.2017.8085045
  • 26. DLHub [“beam off image”, …] model/xray/batch_predict Automate and outsource: (4) Data transformation and analysis ▪ Where are the model and trained weights? ▪ How do I run the model on my data? ▪ Should I run the model on my data? ▪ How can I retrain the model on new data? https://doi.org/10.1109/NYSDS.2017.8085045
  • 27. DLHub [“beam off image”, …] model/xray/batch_predict Automate and outsource: (4) Data transformation and analysis ▪ Where are the model and trained weights? ▪ How do I run the model on my data? ▪ Should I run the model on my data? ▪ How can I retrain the model on new data? https://doi.org/10.1109/NYSDS.2017.8085045
  • 28. Data and Learning Hub (DLHub): Overview • Collect, publish, categorize models/code/weights/data from many sources • Serve models via API to foster sharing, consumption, and access to data, training sets, and models • Automate training of models (using HPC as needed) as new data are available • Enable new science through reuse and synthesis of existing models TrainCollect Serve
  • 29. DLHub: Collect, serve, train community models DLHub Collect Data 1) Register a model Train Model Register Model Model / transform containers Receive DOI Send to DLHub
  • 30. DLHub Collect Data Receive predicted Properties Send compositions Call DLHub Find Model 2) Run a model Model / transform containers DLHub: Collect, serve, train community models Collect Data Receive DOI 1) Register a model Train Model Register Model Send to DLHub
  • 32. 32
  • 34. Ben Blaiszik Steve TueckeKyle Chard Jim Pruyne Logan WardRachana Ananthakrishnan Ryan Chard Mike Papka Rick Wagner I reported on the work of many talented people Thanks also to: • Jon Almer, Francesco de Carlo, Hemant Sharma, Brian Toby, Stefan Vogt, Stephen Streiffer, Nicholas Schwarz, Doga Gursoy, and others, Advanced Photon Source • Tekin Bicer, Jonathan Gaff, Raj Kettimuthu, Justin Wozniak, and others, Argonne Computing We are grateful to our sponsors DLHub Globus IMaD Petrel Argonne Leadership Computing Facility
  • 35. In summary More data demands new methods for designing experiments, managing data, analyzing data, and creating and delivering software We must automate and outsource to manage data, run pipelines, and train and run (machine learning) models I presented examples that illustrate what can be done: • High-speed storage services for data staging and distribution: Petrel • Cloud-based services for data transfer and sharing: Globus Transfer • Data publication and discovery services: Materials Data Facility • Cloud-based automation services: Globus Automate • Model and transformation services to encapsulate software: DLHub There are many opportunities, and great need, for collaboration To follow up: [email protected]