Advances in Scientific Workflow Environments

2016-09-04 BioExcel SIG, ECCB, Amsterdam
Advances in Scientific
Workflow Environments
Carole Goble, Stian Soiland-Reyes
The University of Manchester
carole.goble@manchester.ac.uk
http://esciencelab.org.uk/

What is a Workflow?
• Orchestrating multiple
computational tasks
• Managing the control and
data flow between them
• In a world that is
homogeneous or
heterogeneous
• Tasks
– Local / remote
– Local / third party
– White, grey or black boxes
– Reliable / fragile
– Reserved / dynamic
– Various underpinning
infrastructure
– Various access controls
BioExcel: Biomolecular recognition

What is a Workflow?
Automation
– Automate computational aspects
– Repetitive pipelines, sweep campaigns
Scaling – compute cycles
– Make use of computational infrastructure
& handle large data
Abstraction – people cycles
– Shield complexity and incompatibilities
– Report, re-use, evolve, share, compare
– Repeat –Tweak - Repeat
– First class commodities
Provenance - reporting
– Capture, report and utilize log and data
lineage auto-documentation
– Traceable evolution, audit, transparency
– Compare
With thanks to Bertram Ludascher:WORKS 2015 Keynote
Findable
Accessible
Interoperable
Reusable
(Reproducible)

https://pegasus.isi.edu/2016/02/11/pegasus-powers-ligo-gravitational-waves-detection-analysis/
Laser Interferometer Gravitational-Wave
Observatory – first detection of gravitational
waves from colliding black holes

Morphological, hemodynamic and
structural analyses linked to aneurysm
genesis, growth and rupture.
[Susheel Varma] http://www.vph-share.eu/
http://taverna.org.uk

Galaxy
https://usegalaxy.org/

Marine metagenomics
+ Bespoke Scripts
[Rob Finn]

Open PHACTS
https://www.knime.org/
BioExcel
workflow
https://www.openphacts.org/
Targets
Pharmacological queries
target, compound and pathway data
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115460

Scripts, Ensemble toolkit, execution patterns
http://www.extasy-project.org/

http://www.myexperiment.org
WF Zoo

Advances in Scientific Workflow Environments

Workflow Patterns, templates
Data
wrangling
& analytics
Simulations
Instrument
pipelines
+
+
http://tpeterka.github.io/maui-project/
The Future of ScientificWorkflows, Report of DOEWorkshop 2015,
http://science.energy.gov/~/media/ascr/pdf/programdocuments/docs/workflows_final_report.pd

Data
wrangling
& analytics
Simulations
Instrument
pipelines
+
+ Garijo et al Common Motifs in ScientificWorkflows: An EmpiricalAnalysis, FGCS, 36, July 2014, 338–351

• Long running and complex code
• Tunable parameters and input sets
• Simulation sweeps / iterations
• Ensembles, comparisons
• Tricky set-ups, human-in-the-loop
interaction
• Computational steering
• In situ workflows – multiple tasks, same
box, within fixed time
– data locality.
– human-in-the-loop.
– capture provenance.
Data
wrangling
& analytics
Simulations
Instrument
pipelines
+
+

Traction + Examples
Reuse behaviours
Exploratory vs Production
Different kinds of user / deployment
Developer – User Ratios
BiologistDeveloper Computational
Scientist

Existing computational research
workflow systems
https://github.com/common-workflow-
WFMS Zoo

workflow systems
https://github.com/common-workflow-

workflow systems
s://github.com/common-workflow-language/common-workflow-language/wiki/Existing-
Workflow-systems

“Multi-scale” WFMS
• Workflow
Management
System
– Its design and reporting
environment
– Its execution
environment
• The tasks
– tools, codes and services
and their execution
environments
• Stack layer
– App level, infrastructure
level

Component making
Tasks loosely coupled through files,
• execute on geographically distributed
clusters, clouds, grids across systems
• execute on multiple facilities
• call host services (web / grid services)
DAIC
Distributed Area/Instrument
Computing
“Multi-scale” WFMS
Tasks tightly coupled
• exchanging info over memory/storage
• network of supercomputers
• In situ workflows – multiple tasks, same
box, within fixed time
HPC
Interoperability
Portability
Granularity
Maintenance

Workflow Environment Ecosystem

Copernicus workflow engine for
parallel adaptive molecular dynamics
• Peer-to-peer distributed
computing platform
– high-level parallelization of
statistical sampling problems
• Consolidation of heterogeneous
compute resources
• Automatic resource matching of
jobs against compute resources
• Automatic fault tolerance of
distributed work
• Workflow execution engine to
define a problem (reporting) and
trace its results live (provenance)
• Flexible plugin facilities
– programs to be integrated to the
workflow execution engine
Free Energy
Workflow using
GROMACS
http://copernicus-computing.org/

COMPs/PyCOMPs:
Programmer Productivity
framework
• Sequential programming
– Parallelisation and
distribution heavy-lifting
– Dependency detection
• Infrastructure unaware
– Abstract application from
underlying infrastructure
– Portability
• Standard Programming
Languages
– Java, Python, C/C++
• No (or few!) APIs
– Standard Java

Shield the
user/programmer
Exposure to the
infrastructure
System Design
Manage/minimize data transfers

Stop Press!
GUIs not essential!
• Canvas, drag-drop blocks, arrows,
run button
• Command-line & embedding in
developer or user applications
Scripts can be workflows!
• WMS<->Scripts
• Script vs Workflows/ASAP:
– Automation: *****
– Scaling: **
– Abstraction: *
– Provenance: **

Stop Press!
GUIs not essential!
• Canvas, drag-drop blocks, arrows,
run button
• Command-line & embedding in
developer or user applications
Scripts can be workflows!
• WMS <-> Scripts
• Script vs Workflows/ASAP:
– Automation: *****
– Scaling: **
– Abstraction: *
– Provenance: **
Work close to a problem-
specific ad-hoc data model
Domain Specific Language
"programming-lite" scripts
• wire with declarative
"makefile"-like DAG
Plus
• procedural scripting and
expressions in languages
like Javascript and Python
Nextflow, SnakeMake,
CommonWorkflow Language

GUIs Are Essential 
take-up by the user base

Workflowising script software eco-systems
prime example: provenance
ASAP
• common,
interoperable
provenance recording
– W3C PROV
ASAP
• YesWorkflow.org
– Annotations in script
yield workflow view
ASAP
• Library profilers
– noWorkflow
• runtime provenance
recorders
– Sumatra, RDataTracker

Provenance the link between computation and results
W3C PROV model standard
record for reporting
compare diffs/discrepancies
provenance analytics
track changes, adapt
partial repeat/reproduce
carry attributions
compute credits
compute data quality/trust
select data to keep/release
optimisation and debugging
Metadata propagation –where was the
physical sample collected, and who
should be attributed?
Task-based abstractions: simplifying
provenance using motifs and tool
annotations
“Free energy calculation” rather than 5
steps including preparation of PDB files
and GROMACS execution

Provenance the link workflow variants
and workflow reuse and repurpose
W3C PROV model standard?
record for reporting
compare diffs/discrepancies
provenance analytics
track changes, adapt
carry attributions
compute design credits
versioning, forking, cloning
Nested workflows
functions by stealth
Copy and paste fragmentation
Designing for reuse
Find and Go
Software practices
Systematic reuse
Guidelines for persistently identifying
software using DataCite
https://epubs.stfc.ac.uk/work/24058274
https://www.force11.org/software-citation-
principles

ASAP Wfms for FAIR Science
Automate: workflows,
programs and services folks
already use or want to use
Scale: Enable computational
productivity
Abstract: Enable human
productivity
Provenance: Record and use Usability
Workflow Plugged in Code
Reporting Comparison
Thanks to Bertram Ludascher

Dependency Management
Codes Behaviours & Reliability

● Task-specific “mini-workflow”
fragments
– e.g. using Gromacs, CPMD,
HADDOCK
● Packaged
– EGIVM images and Docker
containers
● Backed by existing registries
– ELIXIR’s bio.tools and EGI App DB
● Instantiated as cloud instances
– private (Open Nebula, Open Stack)
– public (e.g.AmazonAWS )
Application Building Blocks
BioExcel Virtualised Software Library
“transversal workflow units”, higher level operations

BioExcel Use cases
● Genomics
● Ensembl Molecular
simulations
● Free Energy simulations
● Multiscale modelling of
molecular basis for odor
and taste
● Biomolecular recognition
● Pharmacological queries
● Virtual Screening

Finding valid pathways through free-energy
landscapes: implementation of the “string of
swarms” method using Copernicus as a
workflow manager, and GROMACS as a
compute engine.

Workflow Interoperability.
• Common format for bioinformatics tool &
workflow execution
• Community based standards effort
• Designed for clusters & clouds
• Supports the use of containers (e.g. Docker)
• Specify data dependencies between steps
• Scatter/gather on steps
• Nest workflows in steps
• Develop your pipeline on your local computer
(optionally with Docker)
• Execute on your research cluster or in the cloud
• Deliver to users via workbenches
• EDAM ontology (ELIXIR-DK) to specify file
formats and reason about them: “FASTQ
Sanger” encoding is a type of FASTQ file

Workflow Research Object Bundle
researchobject.org
Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects,
JWeb Semantics doi:10.1016/j.websem.2015.01.003
application/vnd.wf4ever.robundle+zip

Z. Zhao et al., “Workflow bus for e-Science”, in IEEE e-Science 2006, Amsterdam

http://bioexcel.eu/events/bioexcel-workflow-training-for-computational-biomolecular-
research/
Adam Hospital (IRB), Anna Montras (IRB), Stian Soiland-Reyes (UNIMAN), Alexandre Bonvin
(UU), Adrien Melquiond (UU), Josep Lluís Gelpí (BSC), Daniele Lezzi (BSC), Steven Newhouse
(EBI), Jose A. Dianes (EBI), Mark Abraham (KTH), Rossen Apostolov (KTH), Emiliano Ippoliti
(Jülich), Adam Carter (UEDIN), Darren J. White (UEDIN)
Slides: Bertram Ludascher, Ewa Deelman, Vasa Curcin, Paolo Missier, Pinar Alper, Susheel
Varma, Rob Finn, Michael Crusoe, Rizos Sakellariou
Sign up
ASAP!

Advances in Scientific Workflow Environments

Recommended

More Related Content

What's hot (20)

Viewers also liked (10)

Similar to Advances in Scientific Workflow Environments (20)

More from Carole Goble (20)

Recently uploaded (20)

Advances in Scientific Workflow Environments