SlideShare a Scribd company logo
Being FAIR:
Enabling Reproducible
Data Science
Professor Carole Goble
The University of Manchester, UK
carole.goble@manchester.ac.uk
2018 Early Detection of Cancer Conference, OHSU, Portland, Oregon USA, 2-4 Oct 2018
Disclosure
Knowledge management
Computational workflows
Sharing and exchange
Reproducibility
Large e-Infrastructure
projects for life science data
The Learning Health System
Phenotypic
Patient Records
Patient cohort building
Patient stratification
Case notes
Discharge notes
Patient cohorts
Patient Multi-omics
Public Reference
repositories
text mining, data mining
data & vocabulary linking
data analytics
Single cell omics
Clinical genomics
Quantitative biology
e-Health
Predictive
models
Sensors Diagnostics
Biomarkers
Imaging
Research Clinical
Biobanks
Scientific
Literature
Patient
Public Health
[Friedman]
An Inspiration
http://fora.tv/2010/04/23/Sage_Commons_Josh_Sommer_Chordoma_Foundation
Josh Sommer
http://www.chordomafoundation.org/
Accelerate a cure
Accelerate knowledge
exchange
Barriers to Cure
• Access to scientific resources
• Coordination,Collaboration
• Flow of Information
• FAIR Data, FAIR Methods
• FAIR Object Commons
[Josh Sommer]
GobleC., De Roure D., Bechhofer S. (2013) AcceleratingScientists’ KnowledgeTurns, https://doi.org/10.1007/978-3-642-37186-8_1
Research Commons
Accelerate inter-lab
knowledge turns
Accumulate knowledge
1. A Research Commons
“… a “cloud-based” platform where investigators can store, share, access, and interact
with digital objects (data, software, [models, SOPs], etc.) generated from …. research.
By connecting the digital objects and making them accessible, the Data Commons is
intended to allow novel scientific research that was not possible before, including
hypothesis generation, discovery, and validation.” https://commonfund.nih.gov/commons
Pooled Resources
Federated
Find andAccess
Many entry points
Data + Methods + Models
Clear steps
Transparent
Comprehensible
Replicable
Logged
Accessible
Provenance
Standardised
Harmonised
Combined
Method
Materials
Variations X N
Repeat. Compare.
Log & Track
Provenance
Scale
2. Data-driven Science, Predictive Science
is Software-driven, Method-Driven
3. Reuse and Reproducibility
Is hard for in vivo/vitro and
even for in silico analysis
• OS version
• Revision of scripts
• Data analysis software
versions
• Version of data files
• Command line parameters
written on a napkin
• “Magic” the grad student
knows….
[Keiichiro Ono, Scripps Institute]
Findable (Citable)
Accessible (Trackable)
Interoperable (Intelligible)
Reusable (Reproducible)
Record
Automate
Contain
Access
FAIR
provenance
portability
preservation
robustness
access
description
standards, common APIs
licensing
standards,
common metadata
versioning, deviation
variation sensitivity
discrepancy handling
parametric spaces
packaging, containers
dependencies
steps
ids
Reproduce and reuse
computations
Transparently communicate
the way computations are
performed
Disambiguate interpretation
of inputs/parameters/results
Safely (re)run computations
ported onto different
platforms
Human and computer
readable definitions for the
provenance of computation,
types for the data and results
Cancer Data Integrator
[Várna,Davies, NIHR Health Informatics Collaborative, UK]
Being FAIR: Enabling Reproducible Data Science
Objects: data + methods + models + provenance +
Scharm M,Wendland F, Peters M,Wolfien M,TheileT,Waltemath D SEMS, University of Rostock zip-like file with a manifest & metadata
- Bundle files - Keep provenance
- Exchange data - Ship results
Bergmann, F.T. (2014). COMBINE archive and OMEX format: one file to share all information
to reproduce a modeling project. BMC bioinformatics,15(1), 1.
Combine Archive
Systems Biology
Systems Medicine
https://sems.unirostock.de/projects/combinearchive/
Research Object Framework
Bechhofer et al (2013)Why linked data is not enough for scientists https://doi.org/10.1016/j.future.2011.08.004
Bechhofer et al (2010) Research Objects:Towards Exchange and Reuse of Digital Knowledge, https://eprints.soton.ac.uk/268555/
carry machine
processable metadata
in common and specific
to different object
types.
bundle together and
relate digital
resources with their
context into a unit.
snapshot, cite,
exchange
run, evolve
accumulate
interlink
Standards-based generic
metadata framework
Container
Metadata
Object
metadata, ontologies,
identifiers
“Unbounded” Objects
Bags of things and external references to things
Data used and results produced …
Methods employed to produce and
analyse that data …
Provenance and settings …
People involved …
Annotations understanding & interpretation …
• Co-localizing massive
genomics datasets, like The
Cancer Genomics Atlas,
alongside secure and
scalable computational
resources to analyze them.
• Analyze own data alongside
TCGA using predefined
analytical workflows or
your own tools.
• Petabyte of multi-
dimensional data available
to authorized researchers.
• Fully reproducible
execution
• Secure team collaboration.
http://www.cancergenomicscloud.org/
NCI Cancer Genomics Cloud (CGC) Pilot
HTS pipelines for precision medicine
GATK:Tumor-Normal Paired Exome-Sequencing pipeline
[Durga Addepalli, Seven Bridges]
HTS pipelines for precision medicine
GATK:Tumor-Normal Paired Exome-Sequencing pipeline
[Durga Addepalli, Seven Bridges]
Inputs OutputsAnalysis
Workflow
Input Data
(Files)
Output Data
(Files)
Software
Component
Settings
(Annotation)
Workflow is defined usingCommonWorkflow Language (CWL)
Software components are Docker images
http://www.cancergenomicscloud.org/
Analysis
Output FilesInput Files Intermediates
Parameters
Configurations
Workflow
Run
Provenance
Narrative
ExecutionWorkflow
Engine
Tools / Codes
Resources
Author Workflow
Container
Metadata
Analysis
Parameters
Configurations
Workflow
Provenance
Workflow
Engine
Algorithms,
Pipelines
Definitions
of the
Metadata
Instances
Data files
Computation
metadata
Tools / Codes
metadata
Biocompute
workflow
Data formats
Ontologies
Data files
Results
Container
Stratified,
Shareable
Objects
Scientifically reliable
interpretation
Verifiable results within
acceptable uncertainty/error
Comparable results
Parameters
Configurations
Workflow
Provenance
Workflow
Engine
Algorithms,
Pipelines
Definitions
of the
Metadata
Instances
Data files
Computation
metadata
Tools / Codes
metadata
Biocompute
workflow
Data formats
Ontologies
Data files
Results
Container
Biocontainers
bio.tools
CWLViewer
Open standards,
commodity systems
Describe and run workflows, and the
command line tools they orchestrate,
supporting containers to be portable,
transparent and interoperable .
Describe the workflow inputs,
outputs, tools and data with
controlled vocabularies / ontologies
EDAM
Describe the provenance of
the workflow
Software components are
containerised to be portable
Workflow systems run the CWL workflow
Gathers the CWL workflow descriptions
together with rich context and provenance
using multi-tiered descriptions
Snapshots the workflow.
Relates it to other objects.
Uses archive formats to contain the object
A community-driven project
https://www.commonwl.org/
https://view.commonwl.org/workflows/github.com/mnneveau/cancer-genomics-
workflow/blob/master/detect_variants/detect_variants.cwl
Manifest
CWL
Annotations
Under the hood
FAIR Methods, different workflow systems & clouds
Living
Products
https://osf.io/h59uh/
Personalized medicine regulation
Standardize exchange of HTS workflows for regulatory submissions between
FDA, pharma, bioinformatics platform providers and researchers
Inspect and replicate the computational analytical workflow to review and
approve the bioinformatics
Domain-specific object model captures essential information without going in
details of the actual execution.
A community-driven project
Emphasis on robust, safe reuse
Technical Reproducibility
packaging software and providing
required datasets
Human understanding of what has been done
higher level steps of the workflow, their
parameter spaces and algorithm settings
Alterovitz, Dean II, Goble, Crusoe, Soiland-Reyes et al Enabling Precision Medicine via standard communication
of NGS provenance, analysis, and results, biorxiv.org, 2017, https://doi.org/10.1101/191783
analysis
and review
sample
archival
sequencing run
file transfer
regulation
computation
pipelines
produced files
are massive in
size
transfer is
slow
too large to keep
forever; not
standardized
difficult to
validate/verify
how can
industry and
FDA work
together to
avoid
mistakes?
HTS lifecycle: from a biological sample
to biomedical research and regulation
[Vahan Simonyan] FDA BAA contract HHSF223201510129C (PI: Raja Mazumder)
https://osf.io/h59uh/
https://doi.org/10.1101/191783
https://doi.org/10.1101/191783
identifiers.org
Under the hood
BioCompute Framework
to advance Regulatory Science to support NGS analysis
Emphasis on robust, safe reuse.
Describe and validate the
metadata of packages, and their
contents, both inside and outside
Standardise data formats and
elements and exchange of
Electronic Health Records
Describe and
validate analysis
workflows, to be
portable and
interoperable
Standardise and support
sharing and analysis of
Genomic data
Ontologies
Controlled vocabularies for
describing all of the above
APIs
Programmable interfaces for
accessing all of the above
Alterovitz, Dean II, Goble, Crusoe, Soiland-Reyes et al Enabling Precision Medicine via standard communication
of NGS provenance, analysis, and results, biorxiv.org, 2017, https://doi.org/10.1101/191783
http://sites.ieee.org/sagroups-2791/
Standardisation
Living Objectisms: grow, evolve, mutate
• RO life cycles
– Fixed snapshot
– Living objects
– Rot, mutate, clone
• Arose from workflow
sharing and preservation
• Research Objects are
analogous to software
artefacts and practices
rather than data or
articles
Snapshot Fork
Combine
Validate
Container
Manifest Profile
Descriptions
what
else
is needed
Dependencies
Versioning
its evolution
what
should
be there
Checklists
Provenance
where it
came from
ids
metadata that describes Research Object
general purpose to drive scalable infrastructure
All
Type Specific
Implementation
specific
Container
Manifest Profile
Descriptions
what
else
is needed
Dependencies
Versioning
its evolution
what
should
be there
Checklists
Provenance
where it
came from
ids
metadata that describes Research Object
Container Profile
Under the hood building blocks:
metadata that describes metadata
general purpose to drive scalable infrastructure
Manifest
Construction
Profile
Construction
IDENTIFIER
Many other kinds of objects
Multiple object types in an
investigation
Structured collections of objects
Physical objects, SOPs
These examples wereWorkflow Objects…
[Sansone]
Asthma Research e-Lab
[Phil Crouch, John
Ainsworth, Iain Buchan]
Chard et al: I'll take that to go: Big data bags and minimal identifiers for exchange of large, complex datasets, https://doi.org/10.1109/BigData.2016.7840618
Dnase HypersensitivityAnalysis
using ENCODE (Encyclopedia of
DNA Elements ) access, analysis
and publishing using Galaxy
images and
genome sequences
assembled from diverse
repositories
data distributed across
multiple locations,
referenced because big
and persisted, efficiently
and safely moved on
demand
Assemble and share large scale, multi-element
datasets.
[Chard, Kesselman, Foster, Madduri, 2016]
Richly structured
descriptions of content in the bag and outside it
Transfer and archive very large HTS datasets in a location-
independent way. Secure referencing and moving of patient data.
Big Data
collections of
arbitrary
referenced
content
annotations,
provenance,
relations
checksums
Simple, location independent
persistent identifiers
Define a dataset and its contents by enumerating its elements, regardless of their location
Verify and validate content
FAIR Data Commons
3. Everything is a research object: all
the (distributed) components of an
investigation (models, data,
pipelines, SOPs, provenance...) into
citable, exchangeable, publishable,
preserved, nested objects
1. Assemble and share
large scale, multi-
element datasets.
Secure referencing and
moving of patient data.
2. Reproduce, port,
share, and execute
HTS pipelines (and
other analytics …)
The Knowledge Object Reference Ontology (KORO): A formalism to support management and
sharing of computable biomedical knowledge for learning health systems
Flynn, Friedman, Boisvert, Landis‐Lewis, Lagoze (2018), https://doi.org/10.1002/lrh2.10054
Graphs of Research
Objects
Track Research
Objects
Combine and enrich
Research Objects
Learning Health Systems
International Efforts:
FAIR Life Science Data Infrastructure
• EGA in a Box for storing,
coordinating and distributing
human data
• Human Data Beacons discovery
service
• Authentication and
Authorization Infrastructure
Interoperability, Compute, Data,
Tools,Training
Tools andWorkflow collaboratory
for EOSC
https://www.elixir-europe.org/use-cases/human-data
Summary: help knowledge turning
• Data Science is underpinned by data
access + transparent methods to
enable reproducible and FAIR
knowledge exchange.
• FAIR First.
• Research Objects as the currency of
reproducibility and exchange
• A bunch of tech, standards, tooling,
best practices, grass roots and
international activities going on.
• Tech isn’t the issue.
• e-Infrastructure matters. Please care
about it.
http://www.researchobject.org/ro2018/
Melissa Haendel, PhD
Director of Translational Data Science, Oregon
State University
Director of the Center for Data to Health,
Oregon Health & Science University
Acknowledgements
Barend Mons
Sean Bechhofer
Matthew Gamble
Raul Palma
Jun Zhao
Mark Robinson
AlanWilliams
Norman Morrison
Stian Soiland-Reyes
Tim Clark
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
Ian Cottam
Susanna Sansone
KristianGarza
Daniel Garijo
Catarina Martins
Iain Buchan
Michael Crusoe
Rob Finn
Carl Kesselman
Ian Foster
Kyle Chard
Vahan Simonyan
Ravi Madduri
Raja Mazumder
GilAlterovitz,
Denis Dean II
Durga Addepalli
Wouter Haak
Anita De Waard
Paul Groth
Oscar Corcho
Josh Sommer
Project ID: 675728

More Related Content

What's hot (20)

PPTX
Reproducibility, Research Objects and Reality, Leiden 2016
Carole Goble
 
PPTX
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
Carole Goble
 
PDF
Research Shared: researchobject.org
Norman Morrison
 
PPTX
Mtsr2015 goble-keynote
Carole Goble
 
PPTX
Research Objects, SEEK and FAIRDOM
Carole Goble
 
PPTX
Advances in Scientific Workflow Environments
Carole Goble
 
PPTX
ROHub
Raul Palma
 
PPTX
FAIR Workflows and Research Objects get a Workout
Carole Goble
 
PPTX
FAIR Data, Operations and Model management for Systems Biology and Systems Me...
Carole Goble
 
PPTX
FAIR History and the Future
Carole Goble
 
PPTX
What is Reproducibility? The R* brouhaha (and how Research Objects can help)
Carole Goble
 
PPTX
Introduction to FAIRDOM
Carole Goble
 
PPTX
The swings and roundabouts of a decade of fun and games with Research Objects
Carole Goble
 
PPTX
Reproducibility (and the R*) of Science: motivations, challenges and trends
Carole Goble
 
PPTX
Aspects of Reproducibility in Earth Science
Raul Palma
 
PPT
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
Carole Goble
 
PPTX
RO-Crate: A framework for packaging research products into FAIR Research Objects
Carole Goble
 
PPTX
Reproducibility Using Semantics: An Overview
dgarijo
 
PPTX
Reproducible Research: how could Research Objects help
Carole Goble
 
PPTX
FAIRy stories: the FAIR Data principles in theory and in practice
Carole Goble
 
Reproducibility, Research Objects and Reality, Leiden 2016
Carole Goble
 
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
Carole Goble
 
Research Shared: researchobject.org
Norman Morrison
 
Mtsr2015 goble-keynote
Carole Goble
 
Research Objects, SEEK and FAIRDOM
Carole Goble
 
Advances in Scientific Workflow Environments
Carole Goble
 
ROHub
Raul Palma
 
FAIR Workflows and Research Objects get a Workout
Carole Goble
 
FAIR Data, Operations and Model management for Systems Biology and Systems Me...
Carole Goble
 
FAIR History and the Future
Carole Goble
 
What is Reproducibility? The R* brouhaha (and how Research Objects can help)
Carole Goble
 
Introduction to FAIRDOM
Carole Goble
 
The swings and roundabouts of a decade of fun and games with Research Objects
Carole Goble
 
Reproducibility (and the R*) of Science: motivations, challenges and trends
Carole Goble
 
Aspects of Reproducibility in Earth Science
Raul Palma
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
Carole Goble
 
RO-Crate: A framework for packaging research products into FAIR Research Objects
Carole Goble
 
Reproducibility Using Semantics: An Overview
dgarijo
 
Reproducible Research: how could Research Objects help
Carole Goble
 
FAIRy stories: the FAIR Data principles in theory and in practice
Carole Goble
 

Similar to Being FAIR: Enabling Reproducible Data Science (20)

PPTX
Let’s go on a FAIR safari!
Carole Goble
 
PPTX
Research Objects for FAIRer Science
Carole Goble
 
PPTX
Data Harmonization for a Molecularly Driven Health System
Warren Kibbe
 
PDF
Managing the analysis of high-throughput data
Javier Quílez Oliete
 
PPTX
FAIR Computational Workflows
Carole Goble
 
PPTX
Reproducible research: theory
C. Tobin Magle
 
PPTX
Data Harmonization for a Molecularly Driven Health System
Warren Kibbe
 
PDF
Stephen Friend CRUK-MD Anderson Cancer Workshop 2012-02-28
Sage Base
 
PPTX
Reproducibility: A Funder and Data Science Perspective
Philip Bourne
 
PPTX
Research Object Community Update
Carole Goble
 
PPTX
A Vision for a Cancer Research Knowledge System
Warren Kibbe
 
PPTX
NIH Data Summit - The NIH Data Commons
Vivien Bonazzi
 
PPTX
How to make your published data findable, accessible, interoperable and reusable
Phoenix Bioinformatics
 
PPTX
Scott Edmunds: Using FAIR principles for more Open & Democratic Science
GigaScience, BGI Hong Kong
 
PPT
Services For Science April 2009
Ian Foster
 
PDF
Open Source Networking Solving Molecular Analysis of Cancer
Open Networking Summit
 
PPT
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Carole Goble
 
PPTX
NCI Cancer Genomics, Open Science and PMI: FAIR
Warren Kibbe
 
PPT
The beauty of workflows and models
myGrid team
 
PDF
Scientific Workflow Systems for accessible, reproducible research
Peter van Heusden
 
Let’s go on a FAIR safari!
Carole Goble
 
Research Objects for FAIRer Science
Carole Goble
 
Data Harmonization for a Molecularly Driven Health System
Warren Kibbe
 
Managing the analysis of high-throughput data
Javier Quílez Oliete
 
FAIR Computational Workflows
Carole Goble
 
Reproducible research: theory
C. Tobin Magle
 
Data Harmonization for a Molecularly Driven Health System
Warren Kibbe
 
Stephen Friend CRUK-MD Anderson Cancer Workshop 2012-02-28
Sage Base
 
Reproducibility: A Funder and Data Science Perspective
Philip Bourne
 
Research Object Community Update
Carole Goble
 
A Vision for a Cancer Research Knowledge System
Warren Kibbe
 
NIH Data Summit - The NIH Data Commons
Vivien Bonazzi
 
How to make your published data findable, accessible, interoperable and reusable
Phoenix Bioinformatics
 
Scott Edmunds: Using FAIR principles for more Open & Democratic Science
GigaScience, BGI Hong Kong
 
Services For Science April 2009
Ian Foster
 
Open Source Networking Solving Molecular Analysis of Cancer
Open Networking Summit
 
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Carole Goble
 
NCI Cancer Genomics, Open Science and PMI: FAIR
Warren Kibbe
 
The beauty of workflows and models
myGrid team
 
Scientific Workflow Systems for accessible, reproducible research
Peter van Heusden
 
Ad

More from Carole Goble (19)

PPTX
The ELIXIR FAIR Knowledge Ecosystem for practical know-how: RDMkit and FAIRCo...
Carole Goble
 
PPTX
Can’t Pay, Won’t Pay, Don’t Pay: Delivering open science, a Digital Research...
Carole Goble
 
PPTX
RO-Crate: packaging metadata love notes into FAIR Digital Objects
Carole Goble
 
PPTX
Research Software Sustainability takes a Village
Carole Goble
 
PPTX
Title: Love, Money, Fame, Nudge: Enabling Data-intensive BioScience through D...
Carole Goble
 
PPTX
FAIR Computational Workflows
Carole Goble
 
PPTX
Open Research: Manchester leading and learning
Carole Goble
 
PPTX
RDMkit, a Research Data Management Toolkit. Built by the Community for the ...
Carole Goble
 
PPTX
FAIR Computational Workflows
Carole Goble
 
PPTX
FAIR Computational Workflows
Carole Goble
 
PPTX
EOSC-Life Workflow Collaboratory
Carole Goble
 
PPTX
FAIR Data Bridging from researcher data management to ELIXIR archives in the...
Carole Goble
 
PPTX
FAIR Computational Workflows
Carole Goble
 
PPTX
How are we Faring with FAIR? (and what FAIR is not)
Carole Goble
 
PPTX
What is Reproducibility? The R* brouhaha and how Research Objects can help
Carole Goble
 
PPTX
ELIXIR UK Node presentation to the ELIXIR Board
Carole Goble
 
PPTX
FAIRy stories: tales from building the FAIR Research Commons
Carole Goble
 
PPTX
Reflections on a (slightly unusual) multi-disciplinary academic career
Carole Goble
 
PPTX
Better Software, Better Research
Carole Goble
 
The ELIXIR FAIR Knowledge Ecosystem for practical know-how: RDMkit and FAIRCo...
Carole Goble
 
Can’t Pay, Won’t Pay, Don’t Pay: Delivering open science, a Digital Research...
Carole Goble
 
RO-Crate: packaging metadata love notes into FAIR Digital Objects
Carole Goble
 
Research Software Sustainability takes a Village
Carole Goble
 
Title: Love, Money, Fame, Nudge: Enabling Data-intensive BioScience through D...
Carole Goble
 
FAIR Computational Workflows
Carole Goble
 
Open Research: Manchester leading and learning
Carole Goble
 
RDMkit, a Research Data Management Toolkit. Built by the Community for the ...
Carole Goble
 
FAIR Computational Workflows
Carole Goble
 
FAIR Computational Workflows
Carole Goble
 
EOSC-Life Workflow Collaboratory
Carole Goble
 
FAIR Data Bridging from researcher data management to ELIXIR archives in the...
Carole Goble
 
FAIR Computational Workflows
Carole Goble
 
How are we Faring with FAIR? (and what FAIR is not)
Carole Goble
 
What is Reproducibility? The R* brouhaha and how Research Objects can help
Carole Goble
 
ELIXIR UK Node presentation to the ELIXIR Board
Carole Goble
 
FAIRy stories: tales from building the FAIR Research Commons
Carole Goble
 
Reflections on a (slightly unusual) multi-disciplinary academic career
Carole Goble
 
Better Software, Better Research
Carole Goble
 
Ad

Recently uploaded (20)

PPTX
Single-Cell Multi-Omics in Neurodegeneration p1.pptx
KanakChaudhary10
 
DOCX
Accomplishment Report on YES- O SY 2025 2026.docx
WilsonVillamater
 
PPTX
Cancer
Vartika
 
PPTX
Instrumentation of IR and Raman Spectrophotometers.pptx
sngth2h2acc
 
PDF
feismo.com-dll-for-science-11-4th-pr_9ffe2eea16c7798a3e81949d38e20447.pdf
RODULFOVPAQUINGAN
 
PDF
Herbal Excipients: Natural Colorants & Perfumery Agents
Seacom Skills University
 
PPTX
General properties of connective tissue.pptx
shrishtiv82
 
PDF
Relazione di laboratorio Idrolisi dell'amido (in inglese)
paolofvesco
 
PPTX
lysosomes "suicide bags of cell" and hydrolytic enzymes
kchaturvedi070
 
PDF
SCH 4103_Fibre Technology & Dyeing_07012020.pdf
samwelngigi37
 
PPTX
Liquid Biopsy Biomarkers for early Diagnosis
KanakChaudhary10
 
PDF
We are Living in a Dangerous Multilingual World!
Editions La Dondaine
 
PDF
Impact of Network Topologies on Blockchain Performance
vschiavoni
 
PPTX
MEDICINAL CHEMISTRY PROSPECTIVES IN DESIGN OF EGFR INHIBITORS.pptx
40RevathiP
 
PPTX
arun battery Li-ion presentation physics.pptx
lakshyanss2122
 
PDF
Driving down costs for fermentation: Recommendations from techno-economic data
The Good Food Institute
 
PDF
An Analysis of The Pearl by John Steinbeck
BillyDarmawan3
 
PPTX
1-SEAFLOOR-SPREADINGGGGGGGGGGGGGGGGGGGG.pptx
JohnCristoffMendoza
 
PDF
Global Health Initiatives: Lessons from Successful Programs (www.kiu.ac.ug)
publication11
 
PPTX
Bronchiolitis: Current Guidelines for Diagnosis and Management By DrShamavu.pptx
Gabriel Shamavu
 
Single-Cell Multi-Omics in Neurodegeneration p1.pptx
KanakChaudhary10
 
Accomplishment Report on YES- O SY 2025 2026.docx
WilsonVillamater
 
Cancer
Vartika
 
Instrumentation of IR and Raman Spectrophotometers.pptx
sngth2h2acc
 
feismo.com-dll-for-science-11-4th-pr_9ffe2eea16c7798a3e81949d38e20447.pdf
RODULFOVPAQUINGAN
 
Herbal Excipients: Natural Colorants & Perfumery Agents
Seacom Skills University
 
General properties of connective tissue.pptx
shrishtiv82
 
Relazione di laboratorio Idrolisi dell'amido (in inglese)
paolofvesco
 
lysosomes "suicide bags of cell" and hydrolytic enzymes
kchaturvedi070
 
SCH 4103_Fibre Technology & Dyeing_07012020.pdf
samwelngigi37
 
Liquid Biopsy Biomarkers for early Diagnosis
KanakChaudhary10
 
We are Living in a Dangerous Multilingual World!
Editions La Dondaine
 
Impact of Network Topologies on Blockchain Performance
vschiavoni
 
MEDICINAL CHEMISTRY PROSPECTIVES IN DESIGN OF EGFR INHIBITORS.pptx
40RevathiP
 
arun battery Li-ion presentation physics.pptx
lakshyanss2122
 
Driving down costs for fermentation: Recommendations from techno-economic data
The Good Food Institute
 
An Analysis of The Pearl by John Steinbeck
BillyDarmawan3
 
1-SEAFLOOR-SPREADINGGGGGGGGGGGGGGGGGGGG.pptx
JohnCristoffMendoza
 
Global Health Initiatives: Lessons from Successful Programs (www.kiu.ac.ug)
publication11
 
Bronchiolitis: Current Guidelines for Diagnosis and Management By DrShamavu.pptx
Gabriel Shamavu
 

Being FAIR: Enabling Reproducible Data Science

  • 1. Being FAIR: Enabling Reproducible Data Science Professor Carole Goble The University of Manchester, UK [email protected] 2018 Early Detection of Cancer Conference, OHSU, Portland, Oregon USA, 2-4 Oct 2018
  • 2. Disclosure Knowledge management Computational workflows Sharing and exchange Reproducibility Large e-Infrastructure projects for life science data
  • 3. The Learning Health System Phenotypic Patient Records Patient cohort building Patient stratification Case notes Discharge notes Patient cohorts Patient Multi-omics Public Reference repositories text mining, data mining data & vocabulary linking data analytics Single cell omics Clinical genomics Quantitative biology e-Health Predictive models Sensors Diagnostics Biomarkers Imaging Research Clinical Biobanks Scientific Literature Patient Public Health [Friedman]
  • 5. Barriers to Cure • Access to scientific resources • Coordination,Collaboration • Flow of Information • FAIR Data, FAIR Methods • FAIR Object Commons [Josh Sommer] GobleC., De Roure D., Bechhofer S. (2013) AcceleratingScientists’ KnowledgeTurns, https://doi.org/10.1007/978-3-642-37186-8_1 Research Commons Accelerate inter-lab knowledge turns Accumulate knowledge
  • 6. 1. A Research Commons “… a “cloud-based” platform where investigators can store, share, access, and interact with digital objects (data, software, [models, SOPs], etc.) generated from …. research. By connecting the digital objects and making them accessible, the Data Commons is intended to allow novel scientific research that was not possible before, including hypothesis generation, discovery, and validation.” https://commonfund.nih.gov/commons Pooled Resources Federated Find andAccess Many entry points Data + Methods + Models
  • 7. Clear steps Transparent Comprehensible Replicable Logged Accessible Provenance Standardised Harmonised Combined Method Materials Variations X N Repeat. Compare. Log & Track Provenance Scale 2. Data-driven Science, Predictive Science is Software-driven, Method-Driven
  • 8. 3. Reuse and Reproducibility Is hard for in vivo/vitro and even for in silico analysis • OS version • Revision of scripts • Data analysis software versions • Version of data files • Command line parameters written on a napkin • “Magic” the grad student knows…. [Keiichiro Ono, Scripps Institute]
  • 9. Findable (Citable) Accessible (Trackable) Interoperable (Intelligible) Reusable (Reproducible) Record Automate Contain Access
  • 10. FAIR provenance portability preservation robustness access description standards, common APIs licensing standards, common metadata versioning, deviation variation sensitivity discrepancy handling parametric spaces packaging, containers dependencies steps ids Reproduce and reuse computations Transparently communicate the way computations are performed Disambiguate interpretation of inputs/parameters/results Safely (re)run computations ported onto different platforms Human and computer readable definitions for the provenance of computation, types for the data and results
  • 11. Cancer Data Integrator [Várna,Davies, NIHR Health Informatics Collaborative, UK]
  • 13. Objects: data + methods + models + provenance + Scharm M,Wendland F, Peters M,Wolfien M,TheileT,Waltemath D SEMS, University of Rostock zip-like file with a manifest & metadata - Bundle files - Keep provenance - Exchange data - Ship results Bergmann, F.T. (2014). COMBINE archive and OMEX format: one file to share all information to reproduce a modeling project. BMC bioinformatics,15(1), 1. Combine Archive Systems Biology Systems Medicine https://sems.unirostock.de/projects/combinearchive/
  • 14. Research Object Framework Bechhofer et al (2013)Why linked data is not enough for scientists https://doi.org/10.1016/j.future.2011.08.004 Bechhofer et al (2010) Research Objects:Towards Exchange and Reuse of Digital Knowledge, https://eprints.soton.ac.uk/268555/ carry machine processable metadata in common and specific to different object types. bundle together and relate digital resources with their context into a unit. snapshot, cite, exchange run, evolve accumulate interlink Standards-based generic metadata framework
  • 15. Container Metadata Object metadata, ontologies, identifiers “Unbounded” Objects Bags of things and external references to things Data used and results produced … Methods employed to produce and analyse that data … Provenance and settings … People involved … Annotations understanding & interpretation …
  • 16. • Co-localizing massive genomics datasets, like The Cancer Genomics Atlas, alongside secure and scalable computational resources to analyze them. • Analyze own data alongside TCGA using predefined analytical workflows or your own tools. • Petabyte of multi- dimensional data available to authorized researchers. • Fully reproducible execution • Secure team collaboration. http://www.cancergenomicscloud.org/ NCI Cancer Genomics Cloud (CGC) Pilot
  • 17. HTS pipelines for precision medicine GATK:Tumor-Normal Paired Exome-Sequencing pipeline [Durga Addepalli, Seven Bridges]
  • 18. HTS pipelines for precision medicine GATK:Tumor-Normal Paired Exome-Sequencing pipeline [Durga Addepalli, Seven Bridges] Inputs OutputsAnalysis
  • 19. Workflow Input Data (Files) Output Data (Files) Software Component Settings (Annotation) Workflow is defined usingCommonWorkflow Language (CWL) Software components are Docker images http://www.cancergenomicscloud.org/ Analysis
  • 20. Output FilesInput Files Intermediates Parameters Configurations Workflow Run Provenance Narrative ExecutionWorkflow Engine Tools / Codes Resources Author Workflow Container Metadata Analysis
  • 21. Parameters Configurations Workflow Provenance Workflow Engine Algorithms, Pipelines Definitions of the Metadata Instances Data files Computation metadata Tools / Codes metadata Biocompute workflow Data formats Ontologies Data files Results Container Stratified, Shareable Objects Scientifically reliable interpretation Verifiable results within acceptable uncertainty/error Comparable results
  • 22. Parameters Configurations Workflow Provenance Workflow Engine Algorithms, Pipelines Definitions of the Metadata Instances Data files Computation metadata Tools / Codes metadata Biocompute workflow Data formats Ontologies Data files Results Container Biocontainers bio.tools CWLViewer
  • 23. Open standards, commodity systems Describe and run workflows, and the command line tools they orchestrate, supporting containers to be portable, transparent and interoperable . Describe the workflow inputs, outputs, tools and data with controlled vocabularies / ontologies EDAM Describe the provenance of the workflow Software components are containerised to be portable Workflow systems run the CWL workflow Gathers the CWL workflow descriptions together with rich context and provenance using multi-tiered descriptions Snapshots the workflow. Relates it to other objects. Uses archive formats to contain the object
  • 26. FAIR Methods, different workflow systems & clouds Living Products
  • 27. https://osf.io/h59uh/ Personalized medicine regulation Standardize exchange of HTS workflows for regulatory submissions between FDA, pharma, bioinformatics platform providers and researchers Inspect and replicate the computational analytical workflow to review and approve the bioinformatics Domain-specific object model captures essential information without going in details of the actual execution. A community-driven project Emphasis on robust, safe reuse Technical Reproducibility packaging software and providing required datasets Human understanding of what has been done higher level steps of the workflow, their parameter spaces and algorithm settings Alterovitz, Dean II, Goble, Crusoe, Soiland-Reyes et al Enabling Precision Medicine via standard communication of NGS provenance, analysis, and results, biorxiv.org, 2017, https://doi.org/10.1101/191783
  • 28. analysis and review sample archival sequencing run file transfer regulation computation pipelines produced files are massive in size transfer is slow too large to keep forever; not standardized difficult to validate/verify how can industry and FDA work together to avoid mistakes? HTS lifecycle: from a biological sample to biomedical research and regulation [Vahan Simonyan] FDA BAA contract HHSF223201510129C (PI: Raja Mazumder)
  • 31. BioCompute Framework to advance Regulatory Science to support NGS analysis Emphasis on robust, safe reuse. Describe and validate the metadata of packages, and their contents, both inside and outside Standardise data formats and elements and exchange of Electronic Health Records Describe and validate analysis workflows, to be portable and interoperable Standardise and support sharing and analysis of Genomic data Ontologies Controlled vocabularies for describing all of the above APIs Programmable interfaces for accessing all of the above Alterovitz, Dean II, Goble, Crusoe, Soiland-Reyes et al Enabling Precision Medicine via standard communication of NGS provenance, analysis, and results, biorxiv.org, 2017, https://doi.org/10.1101/191783
  • 33. Living Objectisms: grow, evolve, mutate • RO life cycles – Fixed snapshot – Living objects – Rot, mutate, clone • Arose from workflow sharing and preservation • Research Objects are analogous to software artefacts and practices rather than data or articles Snapshot Fork Combine
  • 34. Validate Container Manifest Profile Descriptions what else is needed Dependencies Versioning its evolution what should be there Checklists Provenance where it came from ids metadata that describes Research Object general purpose to drive scalable infrastructure
  • 35. All Type Specific Implementation specific Container Manifest Profile Descriptions what else is needed Dependencies Versioning its evolution what should be there Checklists Provenance where it came from ids metadata that describes Research Object
  • 36. Container Profile Under the hood building blocks: metadata that describes metadata general purpose to drive scalable infrastructure Manifest Construction Profile Construction IDENTIFIER
  • 37. Many other kinds of objects Multiple object types in an investigation Structured collections of objects Physical objects, SOPs These examples wereWorkflow Objects… [Sansone] Asthma Research e-Lab [Phil Crouch, John Ainsworth, Iain Buchan]
  • 38. Chard et al: I'll take that to go: Big data bags and minimal identifiers for exchange of large, complex datasets, https://doi.org/10.1109/BigData.2016.7840618 Dnase HypersensitivityAnalysis using ENCODE (Encyclopedia of DNA Elements ) access, analysis and publishing using Galaxy images and genome sequences assembled from diverse repositories data distributed across multiple locations, referenced because big and persisted, efficiently and safely moved on demand Assemble and share large scale, multi-element datasets. [Chard, Kesselman, Foster, Madduri, 2016]
  • 39. Richly structured descriptions of content in the bag and outside it Transfer and archive very large HTS datasets in a location- independent way. Secure referencing and moving of patient data. Big Data collections of arbitrary referenced content annotations, provenance, relations checksums Simple, location independent persistent identifiers Define a dataset and its contents by enumerating its elements, regardless of their location Verify and validate content
  • 40. FAIR Data Commons 3. Everything is a research object: all the (distributed) components of an investigation (models, data, pipelines, SOPs, provenance...) into citable, exchangeable, publishable, preserved, nested objects 1. Assemble and share large scale, multi- element datasets. Secure referencing and moving of patient data. 2. Reproduce, port, share, and execute HTS pipelines (and other analytics …)
  • 41. The Knowledge Object Reference Ontology (KORO): A formalism to support management and sharing of computable biomedical knowledge for learning health systems Flynn, Friedman, Boisvert, Landis‐Lewis, Lagoze (2018), https://doi.org/10.1002/lrh2.10054 Graphs of Research Objects Track Research Objects Combine and enrich Research Objects Learning Health Systems
  • 42. International Efforts: FAIR Life Science Data Infrastructure • EGA in a Box for storing, coordinating and distributing human data • Human Data Beacons discovery service • Authentication and Authorization Infrastructure Interoperability, Compute, Data, Tools,Training Tools andWorkflow collaboratory for EOSC https://www.elixir-europe.org/use-cases/human-data
  • 43. Summary: help knowledge turning • Data Science is underpinned by data access + transparent methods to enable reproducible and FAIR knowledge exchange. • FAIR First. • Research Objects as the currency of reproducibility and exchange • A bunch of tech, standards, tooling, best practices, grass roots and international activities going on. • Tech isn’t the issue. • e-Infrastructure matters. Please care about it.
  • 45. Melissa Haendel, PhD Director of Translational Data Science, Oregon State University Director of the Center for Data to Health, Oregon Health & Science University
  • 46. Acknowledgements Barend Mons Sean Bechhofer Matthew Gamble Raul Palma Jun Zhao Mark Robinson AlanWilliams Norman Morrison Stian Soiland-Reyes Tim Clark Alejandra Gonzalez-Beltran Philippe Rocca-Serra Ian Cottam Susanna Sansone KristianGarza Daniel Garijo Catarina Martins Iain Buchan Michael Crusoe Rob Finn Carl Kesselman Ian Foster Kyle Chard Vahan Simonyan Ravi Madduri Raja Mazumder GilAlterovitz, Denis Dean II Durga Addepalli Wouter Haak Anita De Waard Paul Groth Oscar Corcho Josh Sommer Project ID: 675728