SlideShare a Scribd company logo
Converting Scripts into Reproducible
Workflow Research Objects
Lucas A. M. C. Carvalho, Khalid Belhajjame, Claudia Bauzer Medeiros
lucas.carvalho@ic.unicamp.br
Baltimore, Maryland, USA
October 23-26, 2016
2
Background and Motivation
● Data-Intensive Experiments
– Collection of scripts, programs and (big) data
Papers
3
Background and Motivation
● Data-Intensive Experiments
– Collection of scripts, programs and (big) data
Papers
How to understand,
reproduce or reuse
data and models of
experiments?
4
Background and Motivation
● Data-Intensive Experiments
– Collection of scripts, programs and (big) data
Manual collection and
organization of data provenance
Papers
How to understand,
reproduce or reuse
data and models of
experiments?
5
Background and Motivation
● Script-based experiments
What are the inputs
and outputs?
How to change this
local program for a
similar web service?
Example of script code.
Difficult to
understand, to reuse,
and to reproduce.
6
Background and Motivation
● Scientific Workflows
Example of Scientific Workflow Management System.
7
Create
Understand
Reuse
Reproduce
Overview
8
Create
Understand
Reuse
Reproduce
Overview
+
9
Create
Understand
Reuse
Reproduce
Overview
+
Step 2
Step 1
Step 3
Step 4
Step 5
Methodology
10
Related Work
● Script-language specific.
● Workflow-engine specific.
● A new language is needed.
● Outcome is not an executable workflow.
● Do not collect provenance data of the
conversion process.
11
Two Kind of Experts
● Scientists
– Domain experts who understand the experiment, and
the script (sometimes called user);
● Curators:
– Scientists who are also familiar with workflow and
script programming or;
– Computer scientists who are familiar enough with the
domain to be able to implement our methodology;
– Responsible for authoring, documenting and
publishing workflows and associated resources.
12
Requirements
● Produce workflow-like view of the script.
● Create an executable workflow and compare
execution of workflow and script.
● Modify the workflow resources.
● Record provenance data.
● Aggregate all resources to support
Reproducibility and Reuse.
1
2
3
4
5
13
Requirements
● Produce workflow-like view of the script.1
Activity 1
Port 1 Port 2 Port 3
Port 1 Port 2
Activity 2
Port 3
Port 3
Activity n
Port n
Script-based experiment.
Abstract workflow.
14
Requirements
● Create executable workflow and compare
execution of workflow and script.
2
Executable workflow. Script-based experiment.
15
Requirements
● Modify the workflow resources.3
Local
(a)
(b)
Algorithm A Algorithm B
16
Requirements
● Record provenance data4
Activity 1
Output 1 Output 2
wasGeneratedBy wasGeneratedBy
Sample
used
“2012-06-01”
wasStartedAt
Activity 2
used
LucasWorkflow
Run
wasAssociatedWith
used
17
Requirements
● Aggregate all resources to support
Reproducibility and Reuse.
5
Abstract
workflows
Concrete
workflows
Annotations
Papers and
Reports
Provenance
Authors
Scripts
Data
18
Script
Generate Abstract
Workflow
Generate Abstract
Workflow
Create an
executable workflow
Create an
executable workflow
Refine workflowRefine workflow
Bundle Resources into
a Research Object
Bundle Resources into
a Research Object
Annotate and
check quality
Annotate and
check quality
Abstract
workflow
Concrete
workflow
2
1
3
4
5
Methodology
19
Workflow Research Object (WRO)
● Research Objects are
semantically rich
aggregations of resources
that bring together data,
methods and people in
scientific investigations.
● WROs encapsulate scientific
workflows and additional
information regarding their
context and resources.
Research Object Model
20
Running Example
● Molecular Dynamics Simulations
– Many branches of material sciences, computational
engineering, physics and chemistry.
– Scripts (shell script), programs (NAMD, VMD, Fortran)
– Phases: set up, simulation and analysis of trajectories.
– Inputs: protein structure, simulation parameters and
force field files.
– Output: trajectories and analysis results.
21
Step
Generate Abstract Workflow
1
Script code.
22
Step
Generate Abstract Workflow
1
Manually
annotate
Script code.
Annotated script code.
23
Step
Generate Abstract Workflow
1
Manually
annotate
Create
workflow-like
view
Script code.
Annotated script code.
Abstract workflow.
24
Step
Generate Abstract Workflow
1
code blocks
Input/ouput
YesWorkflow
McPhillips et. al, 2015
- Code comments
- Tags:
● @begin
● @end
● @desc
● @in
● @out
● ...
T. McPhillips et al. (2015), “Yesworkflow: A user-oriented, language-
independent tool for recovering workflow information from scripts,”
International Journal of Digital Curation, vol. 10, no. 1, pp. 298–313, 2015.
Create
Workflow-like
view
Abstract workflow.
Annotated script code.
25
Step
Generate Abstract Workflow
1
Create
Workflow-like
view
Abstract workflow.
Annotated script code.
26
Step
Create an executable workflow
2
Abstract workflow.
27
Step
Create an executable workflow
2
Create implementation
of activities
Copy code blocks from
the script.
Abstract workflow.
Executable workflow.
28
Step
Create an executable workflow
2
Create implementation
of activities
Copy code blocks from
the script.
Abstract workflow.
Executable workflow.
29
Step
Create an executable workflow
2
Create implementation
of activities
Copy code blocks from
the script.
Abstract workflow.
Executable workflow.
Script code.
30
Step
Refine executable workflow
3
Modify resources:
● Algorithms
● Data Sets
● Parallelization
● Web Services
● ...
Executable workflow.
New workflow version.
31
Step
Refine executable workflow
3
Create new
version
Modify resources:
● Algorithms
● Data Sets
● Parallelization
● Web Services
● ...
Executable workflow.
New workflow version.
32
Steps
Record provenance data: execution traces.
2 3
wasEnactedBy
split
Output 1 Output 2
wasGeneratedBy wasGeneratedBy
Sample
used
“2012-06-01”
wasStartedAt
psgen
used
LucasWorkflow
Run
wasAssociatedWith
used
hasSpecification
W3C PROV
Executable workflow.
33
Steps
Record provenance data: conversion process.
2 3
wasDerivedFrom
wasDerivedFrom
wasDerivedFrom
wasAssociatedWith
CuratorCurator
W3C PROV
Executable workflow.
New workflow version.
Script code.
34
Step
Annotate and check quality
● Annotations describing the workflow.
● Use provenance data
– To check the quality of the conversion process.
● Run checks to verify the soundness of the
workflow.
4
35
Step
Annotate and check quality
4
Script code.
Executable workflow.
36
Step
Annotate and check quality
4
Workflow version.
Initial Executable workflow.
37
Step
Annotate and check quality
● Common mistakes during the conversion:
– not clearly identified the main logical processing
units in the script;
– a mistake when migrating script code into the
corresponding activity;
– not provided the correct input files and parameters;
– the coding of the workflow itself contained errors.
4
38
Step
Bundle Resources into a Research Object
5
Script Abstract
workflow
Concrete
workflow(s)
Annotations
Paper
Provenance
Data
Attributions
39
Contributions
● A methodology that guides curators in a
principled manner to transform scripts into
reproducible and reusable WRO;
● This addresses an important issue in the area
of script provenance;
40
Conclusions
● We addressed issues wrt understanding, reuse and
reproducibility of script-based experiments.
● The methodology created was:
– elaborated based on requirements;
– showcased via a real world use case from the field of Molecular
Dynamics;
● We exploited tools and standards from the scientific
community:
– Scientific Workflows, YesWorkflow, Research Objects, the W3C
PROV recommendations and the Web Annotation Data Model.
● The bundle is available at http://w3id.org/w2share/s2rwro/
41
Next Steps
● Evaluation using other case studies;
● Evaluation of the cost of the effectiveness of
our methodology;
● Extension of YesWorkflow to support the
semantic annotation of blocks;
● Implementation of tools.
42
Acknowledgments
● FAPESP (grant # 2014/23861-4)
● CCES/CEPID (grant # 2013/08293-7)
– Center for Computational Engineering & Sciences
● LIS (Laboratory of Information Systems)
● Prof. Munir Skaf and his group from Institute of
Chemistry - Unicamp.
Converting Scripts into Reproducible
Workflow Research Objects
Lucas A. M. C. Carvalho, Khalid Belhajjame, Claudia Bauzer Medeiros
lucas.carvalho@ic.unicamp.br
Baltimore, Maryland, USA
October 23-26, 2016
Ad

Recommended

A Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its Extensions
Khalid Belhajjame
 
FAIR Workflows: A step closer to the Scientific Paper of the Future
FAIR Workflows: A step closer to the Scientific Paper of the Future
dgarijo
 
BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)
BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)
BigData_Europe
 
Towards Automating Data Narratives
Towards Automating Data Narratives
dgarijo
 
Overview of OSLC - INCOSE IW 2018 MBSE Workshop
Overview of OSLC - INCOSE IW 2018 MBSE Workshop
Axel Reichwein
 
New PID developments
New PID developments
OpenAIRE
 
Enabling the digital thread using open OSLC standards
Enabling the digital thread using open OSLC standards
Axel Reichwein
 
Role of PIDs in connecting scholarly works
Role of PIDs in connecting scholarly works
OpenAIRE
 
The CIARD RINGValeri
The CIARD RINGValeri
CIARD Movement
 
FOOPS!: An Ontology Pitfall Scanner for the FAIR principles
FOOPS!: An Ontology Pitfall Scanner for the FAIR principles
dgarijo
 
Open Services for Lifecycle Collaboration (OSLC) - Extending REST APIs to Con...
Open Services for Lifecycle Collaboration (OSLC) - Extending REST APIs to Con...
Axel Reichwein
 
How OpenAIRE uses persistent identifiers for discovery, enrichment, and linki...
How OpenAIRE uses persistent identifiers for discovery, enrichment, and linki...
OpenAIRE
 
Improving the chemistry content of Wikipedia using workflow tools
Improving the chemistry content of Wikipedia using workflow tools
Mitch Miller
 
Achieving the digital thread through PLM and ALM integration using oslc
Achieving the digital thread through PLM and ALM integration using oslc
Axel Reichwein
 
Open Services for Lifecycle Collaboration (OSLC)
Open Services for Lifecycle Collaboration (OSLC)
Axel Reichwein
 
Research Object Community Update
Research Object Community Update
Carole Goble
 
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Anita de Waard
 
OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...
OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...
dgarijo
 
SOMEF: a metadata extraction framework from software documentation
SOMEF: a metadata extraction framework from software documentation
dgarijo
 
Introduction to Open Services for Lifecycle Collaboration (OSLC)
Introduction to Open Services for Lifecycle Collaboration (OSLC)
Axel Reichwein
 
Coming to terms to FAIR semantics
Coming to terms to FAIR semantics
María Poveda Villalón
 
A Template-Based Approach for Annotating Long-Tailed Datasets
A Template-Based Approach for Annotating Long-Tailed Datasets
dgarijo
 
FAIRness through a novel combination of Web technologies
FAIRness through a novel combination of Web technologies
Research Data Alliance
 
Standard Web APIs for Multidisciplinary Collaboration
Standard Web APIs for Multidisciplinary Collaboration
Axel Reichwein
 
2017 06-01-eswc2017-ug
2017 06-01-eswc2017-ug
Monika Solanki
 
Towards Knowledge Graphs of Reusable Research Software Metadata
Towards Knowledge Graphs of Reusable Research Software Metadata
dgarijo
 
OSLC & The Future of Interoperability
OSLC & The Future of Interoperability
Koneksys
 
Implementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTS
Valery Tkachenko
 
元OracleMasterPlatinumがCloudSpanner触ってみた
元OracleMasterPlatinumがCloudSpanner触ってみた
Kumano Ryo
 
2017年春のPerl
2017年春のPerl
charsbar
 

More Related Content

What's hot (20)

The CIARD RINGValeri
The CIARD RINGValeri
CIARD Movement
 
FOOPS!: An Ontology Pitfall Scanner for the FAIR principles
FOOPS!: An Ontology Pitfall Scanner for the FAIR principles
dgarijo
 
Open Services for Lifecycle Collaboration (OSLC) - Extending REST APIs to Con...
Open Services for Lifecycle Collaboration (OSLC) - Extending REST APIs to Con...
Axel Reichwein
 
How OpenAIRE uses persistent identifiers for discovery, enrichment, and linki...
How OpenAIRE uses persistent identifiers for discovery, enrichment, and linki...
OpenAIRE
 
Improving the chemistry content of Wikipedia using workflow tools
Improving the chemistry content of Wikipedia using workflow tools
Mitch Miller
 
Achieving the digital thread through PLM and ALM integration using oslc
Achieving the digital thread through PLM and ALM integration using oslc
Axel Reichwein
 
Open Services for Lifecycle Collaboration (OSLC)
Open Services for Lifecycle Collaboration (OSLC)
Axel Reichwein
 
Research Object Community Update
Research Object Community Update
Carole Goble
 
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Anita de Waard
 
OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...
OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...
dgarijo
 
SOMEF: a metadata extraction framework from software documentation
SOMEF: a metadata extraction framework from software documentation
dgarijo
 
Introduction to Open Services for Lifecycle Collaboration (OSLC)
Introduction to Open Services for Lifecycle Collaboration (OSLC)
Axel Reichwein
 
Coming to terms to FAIR semantics
Coming to terms to FAIR semantics
María Poveda Villalón
 
A Template-Based Approach for Annotating Long-Tailed Datasets
A Template-Based Approach for Annotating Long-Tailed Datasets
dgarijo
 
FAIRness through a novel combination of Web technologies
FAIRness through a novel combination of Web technologies
Research Data Alliance
 
Standard Web APIs for Multidisciplinary Collaboration
Standard Web APIs for Multidisciplinary Collaboration
Axel Reichwein
 
2017 06-01-eswc2017-ug
2017 06-01-eswc2017-ug
Monika Solanki
 
Towards Knowledge Graphs of Reusable Research Software Metadata
Towards Knowledge Graphs of Reusable Research Software Metadata
dgarijo
 
OSLC & The Future of Interoperability
OSLC & The Future of Interoperability
Koneksys
 
Implementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTS
Valery Tkachenko
 
FOOPS!: An Ontology Pitfall Scanner for the FAIR principles
FOOPS!: An Ontology Pitfall Scanner for the FAIR principles
dgarijo
 
Open Services for Lifecycle Collaboration (OSLC) - Extending REST APIs to Con...
Open Services for Lifecycle Collaboration (OSLC) - Extending REST APIs to Con...
Axel Reichwein
 
How OpenAIRE uses persistent identifiers for discovery, enrichment, and linki...
How OpenAIRE uses persistent identifiers for discovery, enrichment, and linki...
OpenAIRE
 
Improving the chemistry content of Wikipedia using workflow tools
Improving the chemistry content of Wikipedia using workflow tools
Mitch Miller
 
Achieving the digital thread through PLM and ALM integration using oslc
Achieving the digital thread through PLM and ALM integration using oslc
Axel Reichwein
 
Open Services for Lifecycle Collaboration (OSLC)
Open Services for Lifecycle Collaboration (OSLC)
Axel Reichwein
 
Research Object Community Update
Research Object Community Update
Carole Goble
 
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Anita de Waard
 
OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...
OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...
dgarijo
 
SOMEF: a metadata extraction framework from software documentation
SOMEF: a metadata extraction framework from software documentation
dgarijo
 
Introduction to Open Services for Lifecycle Collaboration (OSLC)
Introduction to Open Services for Lifecycle Collaboration (OSLC)
Axel Reichwein
 
A Template-Based Approach for Annotating Long-Tailed Datasets
A Template-Based Approach for Annotating Long-Tailed Datasets
dgarijo
 
FAIRness through a novel combination of Web technologies
FAIRness through a novel combination of Web technologies
Research Data Alliance
 
Standard Web APIs for Multidisciplinary Collaboration
Standard Web APIs for Multidisciplinary Collaboration
Axel Reichwein
 
2017 06-01-eswc2017-ug
2017 06-01-eswc2017-ug
Monika Solanki
 
Towards Knowledge Graphs of Reusable Research Software Metadata
Towards Knowledge Graphs of Reusable Research Software Metadata
dgarijo
 
OSLC & The Future of Interoperability
OSLC & The Future of Interoperability
Koneksys
 
Implementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTS
Valery Tkachenko
 

Viewers also liked (19)

元OracleMasterPlatinumがCloudSpanner触ってみた
元OracleMasterPlatinumがCloudSpanner触ってみた
Kumano Ryo
 
2017年春のPerl
2017年春のPerl
charsbar
 
Blenderで学ぶスカルプト勉強会
Blenderで学ぶスカルプト勉強会
kurosaurus
 
The Next Generation of AI and Deep Learning - GTC17
The Next Generation of AI and Deep Learning - GTC17
NVIDIA
 
55 New Features in JDK 9
55 New Features in JDK 9
Simon Ritter
 
Technology Vision 2017 - Overview
Technology Vision 2017 - Overview
Accenture Technology
 
15 Tips for Compelling Company Updates on LinkedIn
15 Tips for Compelling Company Updates on LinkedIn
LinkedIn
 
The Marketer's Guide To Customer Interviews
The Marketer's Guide To Customer Interviews
Good Funnel
 
The Be-All, End-All List of Small Business Tax Deductions
The Be-All, End-All List of Small Business Tax Deductions
Wagepoint
 
ELSA France "Teaching is us!"
ELSA France "Teaching is us!"
Adrian Scarlett
 
Delhi State Report - February 2017
Delhi State Report - February 2017
India Brand Equity Foundation
 
Sharing fängt mit geben an - Interview im HR Performance Magazin
Sharing fängt mit geben an - Interview im HR Performance Magazin
Harald Schirmer
 
Introduction to computer virus
Introduction to computer virus
YouQue ™
 
Powering of bangladesh- Vision 2021
Powering of bangladesh- Vision 2021
Mukhlasur Rahman
 
Retrato do Brasil 2015
Retrato do Brasil 2015
Talita Vasconcelos
 
Nuts and Bolts of GMOs - Harold Trick
Nuts and Bolts of GMOs - Harold Trick
Mary-Katherine Kearney
 
Luchtvaartmaatschappijen: al miljoen euro Brusselse boetes
Luchtvaartmaatschappijen: al miljoen euro Brusselse boetes
Thierry Debels
 
Sustainability Day Leeds 2017
Sustainability Day Leeds 2017
4 All of Us
 
Gramatrix: El despertar (I)
Gramatrix: El despertar (I)
guest020711
 
元OracleMasterPlatinumがCloudSpanner触ってみた
元OracleMasterPlatinumがCloudSpanner触ってみた
Kumano Ryo
 
2017年春のPerl
2017年春のPerl
charsbar
 
Blenderで学ぶスカルプト勉強会
Blenderで学ぶスカルプト勉強会
kurosaurus
 
The Next Generation of AI and Deep Learning - GTC17
The Next Generation of AI and Deep Learning - GTC17
NVIDIA
 
55 New Features in JDK 9
55 New Features in JDK 9
Simon Ritter
 
15 Tips for Compelling Company Updates on LinkedIn
15 Tips for Compelling Company Updates on LinkedIn
LinkedIn
 
The Marketer's Guide To Customer Interviews
The Marketer's Guide To Customer Interviews
Good Funnel
 
The Be-All, End-All List of Small Business Tax Deductions
The Be-All, End-All List of Small Business Tax Deductions
Wagepoint
 
ELSA France "Teaching is us!"
ELSA France "Teaching is us!"
Adrian Scarlett
 
Sharing fängt mit geben an - Interview im HR Performance Magazin
Sharing fängt mit geben an - Interview im HR Performance Magazin
Harald Schirmer
 
Introduction to computer virus
Introduction to computer virus
YouQue ™
 
Powering of bangladesh- Vision 2021
Powering of bangladesh- Vision 2021
Mukhlasur Rahman
 
Luchtvaartmaatschappijen: al miljoen euro Brusselse boetes
Luchtvaartmaatschappijen: al miljoen euro Brusselse boetes
Thierry Debels
 
Sustainability Day Leeds 2017
Sustainability Day Leeds 2017
4 All of Us
 
Gramatrix: El despertar (I)
Gramatrix: El despertar (I)
guest020711
 
Ad

Similar to Converting scripts into reproducible workflow research objects (20)

Advances in Scientific Workflow Environments
Advances in Scientific Workflow Environments
Carole Goble
 
Creating abstractions from scientific workflows: PhD symposium 2015
Creating abstractions from scientific workflows: PhD symposium 2015
dgarijo
 
Credible workshop
Credible workshop
Khalid Belhajjame
 
FAIR Computational Workflows
FAIR Computational Workflows
Carole Goble
 
From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...
dgarijo
 
2016-10-20 BioExcel: Advances in Scientific Workflow Environments
2016-10-20 BioExcel: Advances in Scientific Workflow Environments
Stian Soiland-Reyes
 
Towards Workflow Ecosystems Through Semantic and Standard Representations
Towards Workflow Ecosystems Through Semantic and Standard Representations
dgarijo
 
From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...
dgarijo
 
Towards an Infrastructure for Enabling Systematic Development and Research of...
Towards an Infrastructure for Enabling Systematic Development and Research of...
Rafael Ferreira da Silva
 
Data legend dh_benelux_2017.key
Data legend dh_benelux_2017.key
Richard Zijdeman
 
2013-07-19 myExperiment research objects, beyond workflows and packs (PPTX)
2013-07-19 myExperiment research objects, beyond workflows and packs (PPTX)
Stian Soiland-Reyes
 
FAIR Computational Workflows
FAIR Computational Workflows
Carole Goble
 
ISI work
ISI work
dgarijo
 
Scientific Workflow Systems for accessible, reproducible research
Scientific Workflow Systems for accessible, reproducible research
Peter van Heusden
 
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
Khalid Belhajjame
 
Requirements for Supporting the Iterative Exploration of Scientific Workflow ...
Requirements for Supporting the Iterative Exploration of Scientific Workflow ...
Lucas Augusto Carvalho
 
EOSC-Life Workflow Collaboratory
EOSC-Life Workflow Collaboratory
Carole Goble
 
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Carole Goble
 
WorkflowHub: Community Framework for Enabling Scientific Workflow Research a...
WorkflowHub: Community Framework for Enabling Scientific Workflow Research a...
Rafael Ferreira da Silva
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
Gaignard Alban
 
Advances in Scientific Workflow Environments
Advances in Scientific Workflow Environments
Carole Goble
 
Creating abstractions from scientific workflows: PhD symposium 2015
Creating abstractions from scientific workflows: PhD symposium 2015
dgarijo
 
FAIR Computational Workflows
FAIR Computational Workflows
Carole Goble
 
From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...
dgarijo
 
2016-10-20 BioExcel: Advances in Scientific Workflow Environments
2016-10-20 BioExcel: Advances in Scientific Workflow Environments
Stian Soiland-Reyes
 
Towards Workflow Ecosystems Through Semantic and Standard Representations
Towards Workflow Ecosystems Through Semantic and Standard Representations
dgarijo
 
From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...
dgarijo
 
Towards an Infrastructure for Enabling Systematic Development and Research of...
Towards an Infrastructure for Enabling Systematic Development and Research of...
Rafael Ferreira da Silva
 
Data legend dh_benelux_2017.key
Data legend dh_benelux_2017.key
Richard Zijdeman
 
2013-07-19 myExperiment research objects, beyond workflows and packs (PPTX)
2013-07-19 myExperiment research objects, beyond workflows and packs (PPTX)
Stian Soiland-Reyes
 
FAIR Computational Workflows
FAIR Computational Workflows
Carole Goble
 
ISI work
ISI work
dgarijo
 
Scientific Workflow Systems for accessible, reproducible research
Scientific Workflow Systems for accessible, reproducible research
Peter van Heusden
 
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
Khalid Belhajjame
 
Requirements for Supporting the Iterative Exploration of Scientific Workflow ...
Requirements for Supporting the Iterative Exploration of Scientific Workflow ...
Lucas Augusto Carvalho
 
EOSC-Life Workflow Collaboratory
EOSC-Life Workflow Collaboratory
Carole Goble
 
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Carole Goble
 
WorkflowHub: Community Framework for Enabling Scientific Workflow Research a...
WorkflowHub: Community Framework for Enabling Scientific Workflow Research a...
Rafael Ferreira da Silva
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
Gaignard Alban
 
Ad

More from Khalid Belhajjame (19)

Provenance witha purpose
Provenance witha purpose
Khalid Belhajjame
 
Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Khalid Belhajjame
 
Privacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eScience
Khalid Belhajjame
 
Irpb workshop
Irpb workshop
Khalid Belhajjame
 
Aussois bda-mdd-2018
Aussois bda-mdd-2018
Khalid Belhajjame
 
Anr cair meeting feb 2016
Anr cair meeting feb 2016
Khalid Belhajjame
 
Ikc 2015
Ikc 2015
Khalid Belhajjame
 
Linking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scripts
Khalid Belhajjame
 
Reproducibility 1
Reproducibility 1
Khalid Belhajjame
 
Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014
Khalid Belhajjame
 
Tapp 2014 (belhajjame)
Tapp 2014 (belhajjame)
Khalid Belhajjame
 
Edbt2014 talk
Edbt2014 talk
Khalid Belhajjame
 
Why Workflows Break
Why Workflows Break
Khalid Belhajjame
 
D-prov use-case
D-prov use-case
Khalid Belhajjame
 
Detecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow Results
Khalid Belhajjame
 
Research Object Model in Sepublica
Research Object Model in Sepublica
Khalid Belhajjame
 
Case studyworkshoponprovenance
Case studyworkshoponprovenance
Khalid Belhajjame
 
Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)
Khalid Belhajjame
 
Edbt 2010, Belhajjame
Edbt 2010, Belhajjame
Khalid Belhajjame
 
Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Khalid Belhajjame
 
Privacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eScience
Khalid Belhajjame
 
Linking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scripts
Khalid Belhajjame
 
Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014
Khalid Belhajjame
 
Detecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow Results
Khalid Belhajjame
 
Research Object Model in Sepublica
Research Object Model in Sepublica
Khalid Belhajjame
 
Case studyworkshoponprovenance
Case studyworkshoponprovenance
Khalid Belhajjame
 
Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)
Khalid Belhajjame
 

Recently uploaded (20)

BINARY files CSV files JSON files with example.pptx
BINARY files CSV files JSON files with example.pptx
Ramakrishna Reddy Bijjam
 
Overview of Employee in Odoo 18 - Odoo Slides
Overview of Employee in Odoo 18 - Odoo Slides
Celine George
 
Energy Balances Of Oecd Countries 2011 Iea Statistics 1st Edition Oecd
Energy Balances Of Oecd Countries 2011 Iea Statistics 1st Edition Oecd
razelitouali
 
Measuring, learning and applying multiplication facts.
Measuring, learning and applying multiplication facts.
cgilmore6
 
Plate Tectonic Boundaries and Continental Drift Theory
Plate Tectonic Boundaries and Continental Drift Theory
Marie
 
How to Manage & Create a New Department in Odoo 18 Employee
How to Manage & Create a New Department in Odoo 18 Employee
Celine George
 
BUSINESS QUIZ PRELIMS | QUIZ CLUB OF PSGCAS | 9 SEPTEMBER 2024
BUSINESS QUIZ PRELIMS | QUIZ CLUB OF PSGCAS | 9 SEPTEMBER 2024
Quiz Club of PSG College of Arts & Science
 
ABCs of Bookkeeping for Nonprofits TechSoup.pdf
ABCs of Bookkeeping for Nonprofits TechSoup.pdf
TechSoup
 
How to Implement Least Package Removal Strategy in Odoo 18 Inventory
How to Implement Least Package Removal Strategy in Odoo 18 Inventory
Celine George
 
Introduction to Generative AI and Copilot.pdf
Introduction to Generative AI and Copilot.pdf
TechSoup
 
Nice Dream.pdf /
Nice Dream.pdf /
ErinUsher3
 
ROLE PLAY: FIRST AID -CPR & RECOVERY POSITION.pptx
ROLE PLAY: FIRST AID -CPR & RECOVERY POSITION.pptx
Belicia R.S
 
Battle of Bookworms 2025 - U25 Literature Quiz by Pragya
Battle of Bookworms 2025 - U25 Literature Quiz by Pragya
Pragya - UEM Kolkata Quiz Club
 
JHS SHS Back to School 2024-2025 .pptx
JHS SHS Back to School 2024-2025 .pptx
melvinapay78
 
PEST OF WHEAT SORGHUM BAJRA and MINOR MILLETS.pptx
PEST OF WHEAT SORGHUM BAJRA and MINOR MILLETS.pptx
Arshad Shaikh
 
FEBA Sofia Univercity final diplian v3 GSDG 5.2025.pdf
FEBA Sofia Univercity final diplian v3 GSDG 5.2025.pdf
ChristinaFortunova
 
Paper 108 | Thoreau’s Influence on Gandhi: The Evolution of Civil Disobedience
Paper 108 | Thoreau’s Influence on Gandhi: The Evolution of Civil Disobedience
Rajdeep Bavaliya
 
Sustainable Innovation with Immersive Learning
Sustainable Innovation with Immersive Learning
Leonel Morgado
 
What is FIle and explanation of text files.pptx
What is FIle and explanation of text files.pptx
Ramakrishna Reddy Bijjam
 
Publishing Your Memoir with Brooke Warner
Publishing Your Memoir with Brooke Warner
Brooke Warner
 
BINARY files CSV files JSON files with example.pptx
BINARY files CSV files JSON files with example.pptx
Ramakrishna Reddy Bijjam
 
Overview of Employee in Odoo 18 - Odoo Slides
Overview of Employee in Odoo 18 - Odoo Slides
Celine George
 
Energy Balances Of Oecd Countries 2011 Iea Statistics 1st Edition Oecd
Energy Balances Of Oecd Countries 2011 Iea Statistics 1st Edition Oecd
razelitouali
 
Measuring, learning and applying multiplication facts.
Measuring, learning and applying multiplication facts.
cgilmore6
 
Plate Tectonic Boundaries and Continental Drift Theory
Plate Tectonic Boundaries and Continental Drift Theory
Marie
 
How to Manage & Create a New Department in Odoo 18 Employee
How to Manage & Create a New Department in Odoo 18 Employee
Celine George
 
ABCs of Bookkeeping for Nonprofits TechSoup.pdf
ABCs of Bookkeeping for Nonprofits TechSoup.pdf
TechSoup
 
How to Implement Least Package Removal Strategy in Odoo 18 Inventory
How to Implement Least Package Removal Strategy in Odoo 18 Inventory
Celine George
 
Introduction to Generative AI and Copilot.pdf
Introduction to Generative AI and Copilot.pdf
TechSoup
 
Nice Dream.pdf /
Nice Dream.pdf /
ErinUsher3
 
ROLE PLAY: FIRST AID -CPR & RECOVERY POSITION.pptx
ROLE PLAY: FIRST AID -CPR & RECOVERY POSITION.pptx
Belicia R.S
 
Battle of Bookworms 2025 - U25 Literature Quiz by Pragya
Battle of Bookworms 2025 - U25 Literature Quiz by Pragya
Pragya - UEM Kolkata Quiz Club
 
JHS SHS Back to School 2024-2025 .pptx
JHS SHS Back to School 2024-2025 .pptx
melvinapay78
 
PEST OF WHEAT SORGHUM BAJRA and MINOR MILLETS.pptx
PEST OF WHEAT SORGHUM BAJRA and MINOR MILLETS.pptx
Arshad Shaikh
 
FEBA Sofia Univercity final diplian v3 GSDG 5.2025.pdf
FEBA Sofia Univercity final diplian v3 GSDG 5.2025.pdf
ChristinaFortunova
 
Paper 108 | Thoreau’s Influence on Gandhi: The Evolution of Civil Disobedience
Paper 108 | Thoreau’s Influence on Gandhi: The Evolution of Civil Disobedience
Rajdeep Bavaliya
 
Sustainable Innovation with Immersive Learning
Sustainable Innovation with Immersive Learning
Leonel Morgado
 
What is FIle and explanation of text files.pptx
What is FIle and explanation of text files.pptx
Ramakrishna Reddy Bijjam
 
Publishing Your Memoir with Brooke Warner
Publishing Your Memoir with Brooke Warner
Brooke Warner
 

Converting scripts into reproducible workflow research objects

  • 1. Converting Scripts into Reproducible Workflow Research Objects Lucas A. M. C. Carvalho, Khalid Belhajjame, Claudia Bauzer Medeiros [email protected] Baltimore, Maryland, USA October 23-26, 2016
  • 2. 2 Background and Motivation ● Data-Intensive Experiments – Collection of scripts, programs and (big) data Papers
  • 3. 3 Background and Motivation ● Data-Intensive Experiments – Collection of scripts, programs and (big) data Papers How to understand, reproduce or reuse data and models of experiments?
  • 4. 4 Background and Motivation ● Data-Intensive Experiments – Collection of scripts, programs and (big) data Manual collection and organization of data provenance Papers How to understand, reproduce or reuse data and models of experiments?
  • 5. 5 Background and Motivation ● Script-based experiments What are the inputs and outputs? How to change this local program for a similar web service? Example of script code. Difficult to understand, to reuse, and to reproduce.
  • 6. 6 Background and Motivation ● Scientific Workflows Example of Scientific Workflow Management System.
  • 10. 10 Related Work ● Script-language specific. ● Workflow-engine specific. ● A new language is needed. ● Outcome is not an executable workflow. ● Do not collect provenance data of the conversion process.
  • 11. 11 Two Kind of Experts ● Scientists – Domain experts who understand the experiment, and the script (sometimes called user); ● Curators: – Scientists who are also familiar with workflow and script programming or; – Computer scientists who are familiar enough with the domain to be able to implement our methodology; – Responsible for authoring, documenting and publishing workflows and associated resources.
  • 12. 12 Requirements ● Produce workflow-like view of the script. ● Create an executable workflow and compare execution of workflow and script. ● Modify the workflow resources. ● Record provenance data. ● Aggregate all resources to support Reproducibility and Reuse. 1 2 3 4 5
  • 13. 13 Requirements ● Produce workflow-like view of the script.1 Activity 1 Port 1 Port 2 Port 3 Port 1 Port 2 Activity 2 Port 3 Port 3 Activity n Port n Script-based experiment. Abstract workflow.
  • 14. 14 Requirements ● Create executable workflow and compare execution of workflow and script. 2 Executable workflow. Script-based experiment.
  • 15. 15 Requirements ● Modify the workflow resources.3 Local (a) (b) Algorithm A Algorithm B
  • 16. 16 Requirements ● Record provenance data4 Activity 1 Output 1 Output 2 wasGeneratedBy wasGeneratedBy Sample used “2012-06-01” wasStartedAt Activity 2 used LucasWorkflow Run wasAssociatedWith used
  • 17. 17 Requirements ● Aggregate all resources to support Reproducibility and Reuse. 5 Abstract workflows Concrete workflows Annotations Papers and Reports Provenance Authors Scripts Data
  • 18. 18 Script Generate Abstract Workflow Generate Abstract Workflow Create an executable workflow Create an executable workflow Refine workflowRefine workflow Bundle Resources into a Research Object Bundle Resources into a Research Object Annotate and check quality Annotate and check quality Abstract workflow Concrete workflow 2 1 3 4 5 Methodology
  • 19. 19 Workflow Research Object (WRO) ● Research Objects are semantically rich aggregations of resources that bring together data, methods and people in scientific investigations. ● WROs encapsulate scientific workflows and additional information regarding their context and resources. Research Object Model
  • 20. 20 Running Example ● Molecular Dynamics Simulations – Many branches of material sciences, computational engineering, physics and chemistry. – Scripts (shell script), programs (NAMD, VMD, Fortran) – Phases: set up, simulation and analysis of trajectories. – Inputs: protein structure, simulation parameters and force field files. – Output: trajectories and analysis results.
  • 24. 24 Step Generate Abstract Workflow 1 code blocks Input/ouput YesWorkflow McPhillips et. al, 2015 - Code comments - Tags: ● @begin ● @end ● @desc ● @in ● @out ● ... T. McPhillips et al. (2015), “Yesworkflow: A user-oriented, language- independent tool for recovering workflow information from scripts,” International Journal of Digital Curation, vol. 10, no. 1, pp. 298–313, 2015. Create Workflow-like view Abstract workflow. Annotated script code.
  • 26. 26 Step Create an executable workflow 2 Abstract workflow.
  • 27. 27 Step Create an executable workflow 2 Create implementation of activities Copy code blocks from the script. Abstract workflow. Executable workflow.
  • 28. 28 Step Create an executable workflow 2 Create implementation of activities Copy code blocks from the script. Abstract workflow. Executable workflow.
  • 29. 29 Step Create an executable workflow 2 Create implementation of activities Copy code blocks from the script. Abstract workflow. Executable workflow. Script code.
  • 30. 30 Step Refine executable workflow 3 Modify resources: ● Algorithms ● Data Sets ● Parallelization ● Web Services ● ... Executable workflow. New workflow version.
  • 31. 31 Step Refine executable workflow 3 Create new version Modify resources: ● Algorithms ● Data Sets ● Parallelization ● Web Services ● ... Executable workflow. New workflow version.
  • 32. 32 Steps Record provenance data: execution traces. 2 3 wasEnactedBy split Output 1 Output 2 wasGeneratedBy wasGeneratedBy Sample used “2012-06-01” wasStartedAt psgen used LucasWorkflow Run wasAssociatedWith used hasSpecification W3C PROV Executable workflow.
  • 33. 33 Steps Record provenance data: conversion process. 2 3 wasDerivedFrom wasDerivedFrom wasDerivedFrom wasAssociatedWith CuratorCurator W3C PROV Executable workflow. New workflow version. Script code.
  • 34. 34 Step Annotate and check quality ● Annotations describing the workflow. ● Use provenance data – To check the quality of the conversion process. ● Run checks to verify the soundness of the workflow. 4
  • 35. 35 Step Annotate and check quality 4 Script code. Executable workflow.
  • 36. 36 Step Annotate and check quality 4 Workflow version. Initial Executable workflow.
  • 37. 37 Step Annotate and check quality ● Common mistakes during the conversion: – not clearly identified the main logical processing units in the script; – a mistake when migrating script code into the corresponding activity; – not provided the correct input files and parameters; – the coding of the workflow itself contained errors. 4
  • 38. 38 Step Bundle Resources into a Research Object 5 Script Abstract workflow Concrete workflow(s) Annotations Paper Provenance Data Attributions
  • 39. 39 Contributions ● A methodology that guides curators in a principled manner to transform scripts into reproducible and reusable WRO; ● This addresses an important issue in the area of script provenance;
  • 40. 40 Conclusions ● We addressed issues wrt understanding, reuse and reproducibility of script-based experiments. ● The methodology created was: – elaborated based on requirements; – showcased via a real world use case from the field of Molecular Dynamics; ● We exploited tools and standards from the scientific community: – Scientific Workflows, YesWorkflow, Research Objects, the W3C PROV recommendations and the Web Annotation Data Model. ● The bundle is available at http://w3id.org/w2share/s2rwro/
  • 41. 41 Next Steps ● Evaluation using other case studies; ● Evaluation of the cost of the effectiveness of our methodology; ● Extension of YesWorkflow to support the semantic annotation of blocks; ● Implementation of tools.
  • 42. 42 Acknowledgments ● FAPESP (grant # 2014/23861-4) ● CCES/CEPID (grant # 2013/08293-7) – Center for Computational Engineering & Sciences ● LIS (Laboratory of Information Systems) ● Prof. Munir Skaf and his group from Institute of Chemistry - Unicamp.
  • 43. Converting Scripts into Reproducible Workflow Research Objects Lucas A. M. C. Carvalho, Khalid Belhajjame, Claudia Bauzer Medeiros [email protected] Baltimore, Maryland, USA October 23-26, 2016