SlideShare a Scribd company logo
Data Provenance: Principles and Why it matters
for BioMedical Applications
PART-2: Tools and Techniques
Informatics for Health 2017 Preconference Tutorial
Vasa Curcin, Lecturer, HSCR & Department of Informatics KCL
Pinar Alper, Postdoc at HSCR KCL
22.Apr.2017
Manchester/UK
1
Part-2 Outline
Model
(Part-1)
Capture
(Part-2)
Manage
(Part-2)
W3C PROV
Different approaches in
Scientific Computing
Storage and Query Data model
Tools, APIs
Explore
2
Provenance Recap
Provenance has a Particular Subject & has a Standard Model:
• “information about entities, activities, and people involved in producing a
piece of data or thing…” (W3C PROV).
edit
wasGeneratedBy
page1
used
Wikipedia Editor
wasAssociatedWith
page2
entity activity agent
• PROV-DM Core vocabulary:
• Actor,Activity, Entity
• Causal relations among elements
• Conceptually a graph.
• Constraints:
• Typing: if two things are linked with actedOnBehalf of they
are of type actor
• Impossibility : activities and entities are disjoint.
Specialisation is not reflexive
• Ordering: Usage must occur between start and end event of
activities
• Human and Machine understandable representation:
• PROV-O,
• PROV-N
• Extensibility points 3
Part-2 Outline
Model
(Part-1)
Capture
(Part-2)
Manage
(Part-2)
W3C PROV
Different approaches in
Scientific Computing
Storage and Query Data model
Tools, APIs
Explore
4
Provenance Capture
• Rigorously studied in the context of Scientific Computing.
• Workflows
• Scripts
• Command-Line Tools
• System-Level (File and Operating System)
• Databases
• Templates
5
• Became popular in the recent decade.
• Pipelines of tasks with dataflow
dependencies.
• Automation, Resource Access Client
• Somewhat disruptive, need to wrap
resources.
• Provenance for the output scientific data:
• An outline of the method followed
• Resources used (repositories, tools,
services).
• Parameter configurations and
intermediary results.
Scientific Workflows
Analysis
Data
Analysis
Visualization
Analysis
Adaptation
Community
Data Repo
Community
Tools
Local
Tools & Data
6
parameter
parameter
Data Data
7
• Backward-Looking history
of what the workflow
engine observed during
the run:
• Data nodes with identifiers
minted by WF engine. Data
often stored separately.
• Task invocations,
timestamp.
• Data usage and generation
by tasks.
• Actor is primarily the WF
engine.
• Inferred provenance:
• Task causal dependencies
• Data causal dependencies
Workflows Execution Provenance
WF Provenance -Transparency
outputactivityinput
• Annotated grey-box provenance.
• Annotate workflow, get auto-annotated provenance
• Annotate provenance
outputactivityinput
BLAST Report
DNA SEquence Sequence
Alignment
8
• Wraps resources to incorporate into workflow, hence an “Observer”
perspective. In the most basic case this provides black-box provenance.
• We also have white-box provenance will come to that later!!
CL tool execution
Script execution
Web service invocation
prov:type
prov:type
prov:type
PROV Extensibility Points:
type
hadRole
attributes
hadRole
WF Provenance - Perspective
• Workflow Provenance is the first
to be referred to as “Prospective
Provenance” ‼
• Prospective provenance is often
viewed as the provenance by end-
users (scientists).
• Prospective Provenance can be
useful as an abstraction over
(bulky) retrospective provenance.
Prospective
Retrospective
lineage
9
10
Workflow Provenance
• End use
• Debugging
• Monitoring
• Present Results
Workflow Provenance
• End use
• Comparison
Exportable as OPM
11
Workflow Provenance
• End use
• Publishing analyses
12
A workflow may not necessarily be implemented by a
workflow system
13
• Provenance Challenge Series 1-4
• Provenance Challenge Workflow: 5-step computational process using Functional
Magnetic Resonance Imaging (fMRI) data.
• Have been realized with WF, tool, and system-level provenance
http://twiki.ipaw.info/bin/view/Challenge/FirstProvenanceChallenge
Domain data
Patterns of inquiry e.g. lineage
traversal
Scripts
• Popular, established method of data processing
• Integration into statistical, numerical analysis libraries
• Visualization libraries
• Researchers have recently started paying attention to script
provenance.
• Currently research prototypes rather than out-of-box features.
• Minimally disrupting existing practices. No technology change, No
wrapping!
14
Script Provenance - RDataTracker
Barbara Lerner and Emery Boose. RDataTracker: Collecting provenance
in an interactive scripting environment. In 6th USENIX Workshop on
the Theory and Practice of Provenance (TaPP 2014).
15
• Extend R scripts with logging
statements
• Provenance ON/OFF:
ddg.init, ddg.save
• Abstract multiple commands:
ddg.procedure
• Post-execution visualize the
Data Derivation Graph
Process
Data
Data flow
Control flow
https://github.com/End-to-end-provenance/RDataTracker
Script Provenance- YesWorkflow
16
• Annotate method declarations in R,
Matlab scripts
• Makes “latent workflow information
from scripts explicit”.
• Prospective provenance, a workflow
abstraction over the script.
• Visualized in process, data and
combined views
• @begin, @end
• @in, @out, @as
• Inputs/outputs can be concrete files
• Or can refer to prospective
resources identified by templates.
Timothy M. McPhillips, Tianhong Song et al. YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts. CoRR, abs/1502.02403, 2015.
Script Provenance - YesWorkflow
Visualization of annotations – combined view
17
https://github.com/yesworkflow-org/
Script Provenance - noWorkflow
• Definition: Analyze the AST of
Python script.
• Arguments, function calls, global vars
• User defined functions
• Deployment: Environment and
library dependencies.
• Uses Python os, socket and
platform libraries
• Collected right before script execution
Leonardo Murta, Vanessa Braganholo, Fernando Chirigati, David Koop, and Juliana Freire. noWorkflow: Capturing and analyzing provenance of scripts. In 5th International Provenance and
Anno- tation Workshop (IPAW), Cologne, Jun 2014.
18
Script Provenance -noWorkflow
• Execution: Runtime information
• Uses python’s runtime profiling and reflection capabilities
• the start time of the function activation, together with the values of every argument,
return, and globals
• open() csystem call to record depedency to files and snapshot file contents at the
time of call
19
https://github.com/gems-uff/noworkflow
Script Provenance -noWorkflow
• A lot of information!
• Analysis
1. Highly summarized Activation Graph. Nodes in one control flow block (e.g. loop) merged.
2. Diff two executions (report diff in dates, argument values, environment settings)
3. Query in Prolog access_influence(File, ’output.png’)
20
https://github.com/gems-uff/noworkflow
Script Provenance -CXXR
• R Interpreter with audit features
• Tracks provenance for bindings generated in the user workspace
• provenance(x): Returns a list comprising: expression, symbol, timestamp, parents, children
• pedigree(x): Displays the sequence of commands issued, which results in x’s current state
21
https://github.com/timothyjgraham/cxxr
Script Provenance
22
• End-use:
• Minimally disruptive, but, yet to prove its utility.
• Black-box provenance.
Too fine grained process abstraction?
Are variables, arguments or files a
useful data abstractions?
Command Line Tools
• Popular interface for many scientific analysis libraries
• Web-Based Reproducibility Frameworks, a layer between User and OS
command shell.
• Register tools and with metadata to framework. Grey-box provenance.
• Layer directs tool execution
• Layer records provenance
Sumatra: automated tracking of scientific computations.https://pythonhosted.org/Sumatra/
Galaxy Data intensive biology for everyone https://galaxyproject.org/
23
Tool Provenance
• End use:
• Convert exploratory steps into workflows
• Compare different invocations
24
Database Provenance
• Databases widely used for
managing scientific resource
metadata:
• Data Catalogs
• Service Registries
• White-box Provenance
• Why: witness tuples
• How: the way witness tuple contribute
to result.
• Where: cell-level data copying from
witness tuples to result
James Cheney, Laura Chiticariu, and Wang Chiew Tan. Provenance in databases: Why, how, and where. Foundations and Trends in
Databases, 1(4):379–474, 2009.
25
Database Provenance
• Why provenance:
• Query debugging: Why did I get this result?
• View maintenance: If I update this record, do I need to refresh this
(materialized) view?
• How provenance:
• Trust and uncertainty computation
• Where provenance:
• Annotation propagation
• If desired white-box provenance can be computed post query
execution, when needed.
• Limited to research prototypes
26
http://infolab.stanford.edu/trio/
System-Level Provenance
• Every application runs on some OS and uses some storage system
• Not disruptive Zero modification to applications.
• An audit mode for systems, explore feasibility and overhead.
27
System-Level Provenance - SPADE
• A consumer of audit APIS of different OS (Windows, Mac OS, Linux, Android)
• Collects process information: name, owner, creation time, command line,
environment vars, files read/written.
• Reporters:
• Default for each OS. E.g. in Linux exec(), fork(), clone(), exit(), open(),
close(), read(), write(), clone(), truncate() is picked up by reporters
• Domain specific reporters. Add onto default.
• Reporting overhead is variable, dependes on nature of application and target OS.
• Compile Apache Web Server on Windows (53%)
• Run BLAST tool on Linux (5%), on MacOS (10%)
• OPM compliant
A. Gehani and D. Tariq. SPADE: Support for Provenance Auditing in Distributed Environments. In Middleware 2012 - Proceedings, pages 101–120, 2012.
28
System-level Provenance - PASS
• Modified Linux Kernel. Relies on the tracking of system calls.
• Focuses on files a detailed change record.
29
• For each file in the filesystem,
• The executable that created it
• Any input files
• “Complete” hardware platform
description
• Command line
• Process environment
• Other data such as random seeds
> sort a > b
Kiran-Kumar Muniswamy-Reddy, David A. Holland, Uri Braun, and Margo Seltzer. Provenance-aware storage systems. In Proceedings of the Annual Conference on
USENIX ’06 Annual Technical Conference, ATEC ’06, pages 4–4.
System-Level Provenance - PASS
• End Use
• Script generation
• Generate a Makefile that reproduces a file
• Detecting system changes
• Compare provenance of two files to detect changes in
environment, libraries, etc.
• Intrusion detection
• Detailed logs of how objects have changed
• Very fine grained provenance!!
• 5000 Objects in response to a lineage inquiry over trace with 50
elements.
30
To this end
• We focused on capturing provenance from a particular computational
system
• System dictates granularity
Data: Tuples, files, variables
Activity: Query operator, sys-call, wf activity
• System dictates transparency
White-box for DB vs Black/Grey Box for WFs and Tools
31
In reality
• We have heterogeneity and legacy:
• We use heterogeneous technologies.
• Provenance may be recorded in mixed granularities.
• Existing audit capabilities and provenance-like information recorded.
• Audited processes can span long time frames.
32
DB
Consider the case
33
Medical Practice
Management
(MPM)
A global
Provenance
Picture
Designing a decision aid to improve secondary prevention for stroke survivors with multimorbidity: A stakeholder engagement study (T Porat, I Marshall, E Sadler, MA Vadillo, V Curcin, C
McKevitt, C Wolfe) - Abstract presentationat Farr 2017 Tuesday 4pm
portal usage info
authentication trace
clinician
patient
• Have there been an update on patient’s EHR in MPM between initial assessment and follow up in SDA
• Do patients use any part of the portal after initial assessment activity in SDA?
• What are all the prior authentication records that lead to MPM activities updating this erroneous EHR.
• What is the average time between initial assessments and follow-ups?
DB
Stroke Decision
Aid (SDA)
Log files
Log files
Authentication
Service
Portal
A Solution Approach – Provenance Templates
• Focus on Provenance rather than its capture from a specific system.
• How can we combine provenance of multiple sources in a controlled manner?
• Focus on incorporating Domain-Specific information
• Heterogeneous systems interoperate via domain ontologies. How can we obtain
domain annotated provenance?
• Two groups actively working on it
34
“Templates as a method for implementing data provenance in
decision support systems.” Vasa Curcin, Elliot Fairweather,
Roxana Danger, Derek Corrigan. Journal of Biomedical
Informatics 65 (2017) 1-21
https://bitbucket.org/kclbig/templates
https://provenance.ecs.soton.ac.uk/prov-template/
In publication.
var:input
var:output
var:assessment
wasGeneratedBy
ex:guideline1
var:clinician
wasAssociatedWith
used
ex:practice1
actedOnBehalfOf
ex:startTime=“vvar:start”
ex:endTime=“vvar:end”
used
Template: A provenance graph with abstract parts
Clinical Decision Support TEMPLATE
35
variable
value variable
document
prefix ex <http://example.org>
prefix var <http://pgt.inf.kcl.ac.uk/var>
prefix vvar <http://pgt.inf.kcl.ac.uk/vvar>
entity(var:inSymbol)
entity(ex:guideline1)
activity(var:assessment)
used(var:assessment,ex:guideline1, -, [ex:startTime=”vvar:start”, ...])
......
......
endDocument
var:input
var:output
var:assessment
wasGeneratedBy
ex:guideline1
var:clinician
wasAssociatedWith
used
ex:practice1
actedOnBehalfOf
type=myOntology#StrokeAssessment
startTime=“vvar:start”
endTime=“vvar:end”
used
• Node semantic types
Domain Specific Annotations
type=myOntology#StrokeRisk
level=“vvar:riskLevel”
Clinical Decision Support TEMPLATE
36
var:a
var:c
var:b
wasGeneratedBy
var:x
wasAssociatedWith
used
Zone
• A connected sub-graph of a
template graph, which can be
instantiated multiple times
• Series/Parallel type of zones
• Restrictions on Minimum and
Maximum number of
instantiations
var:d
wasDerviedFrom
id=zone1
type=parallel
min=1
max=5
Example TEMPLATE
37
Zone variable
External
variable
entity0
entity1
activity1
wasGeneratedBy
agent1
wasAssociatedWith
used
activity2
wasGeneratedBy wasGeneratedBy
• Parallel Zone instantiated 3 times
used
used
wasAssociatedWith
wasAssociatedWith
entity4
wasDerviedFrom
wasDerviedFrom
wasDerviedFrom
38
entity2
entity3
activity3
entity0
entity1
activity1
wasGeneratedBy
agent1
wasAssociatedWith
used
activity2
used
entity2
wasDerviedFrom
wasGeneratedBy
entity3• Series Zone instantiated 2 times
39
Template Instantiation
Generation
Valid Substitutions
Template Instance
Template
(PROV-
N)
(PROV-
N)
(PROV-
N)
• Single-Step : Template Base+ All Zones
• Incremental Generation
1. Template Base
2. Zone
3. Zone
4. …
40
What constitutes a Valid Substitution?
• For Template Base
• Distinct values among all external variables
• Distinct values among the global node if space for non-graft variables
• Values for all value variables
• For each Zone substitution
• Distinct values for all zone variables among each other and among the global
node id space
• Values for all value variables
41
entity1
entity3
activity1
wasGeneratedBy
agent1
wasAssociatedWith
used
entity4
activity2
wasGeneratedBy
entity5
activity3
wasGeneratedBy
• Parallel Zone incremental instantiation
used
used
wasAssociatedWith
wasAssociatedWith
entity2
wasDerivedFrom
wasDerivedFrom
wasDerivedFrom
42
var:output
var:assessment
wasGeneratedBy
guideline1
var:clinician
wasAssociatedWith
used
practice1
actedOnBehalfOf
• Nodes shared among instantiations
• In real-life, actors or entities may participate in processes multiple times
Template – Graft nodes
GRAFT NODE
Clinical Decision Support TEMPLATE
43
result1
assessment1
guideline1
doctor1
wasAssociatedWith
used
practice1
actedOnBehalfOf
doctor2
wasGeneratedBy
actedOnBehalfOf
result2
wasGeneratedBy
result3
used
wasGeneratedBy
result4
used
wasGeneratedBy
used
doctor3
actedOnBehalfOf
wasAssociatedWith
wasAssociatedWith
44
wasAssociatedWith
Graft nodes in concrete graph
assessment2
assessment3
assessment4
Recurring node
Architectural Approach
Provenance
Authentication
Clinical Decision Support
Portal
Authentication
Service
Application
Template
Server
TEMPLATES
• Streams of substitutions from disparate heterogeneous systems 45
46
In summary, the benefits of Templates are:
1. Enforce structure over provenance
• Templates can be seen as
Provenance Schemas. Acceptable
patterns of provenance.
• Strict : there is no optionality. If a
variable is included in a template
bindings will be expected for it.
• Loose: Templates may not encode all
possible patterns. (Grafts)
• Prospective Provenance in its true
form!
47
2. Go beyond single-system restriction
• Combine provenance in different granularities- importance of identification.
• User identity
• EHR associated with identity. Track CRUD at EHRlevel.
• Fields of EHR relevant for provenance. Track Updates at selected HER field level
• Build up provenance over time
1. Authentication
2. Activity on portal
3. Activity within application
4. Authentication
5. Activity on portal
6. Activity within application
…..
....
48
3. Support other forms of Provenance
• Provenance can be found in other forms (XLS files, logs)
• We can create substitutions from these other forms
• Hence migrate legacy provenance to a standards compliant (graph-
based) representation
49
Rows  Substitutions
Columns  Variables
MIAME TEMPLATE
Generation
Minimum Information for Biological and Biomedical Investigations
4. Possibility of devising Model-Driven techniques
for introducing provenance support to existing tools
• Identify parts of the system (Actor, Entity, Activity) that are relevant for
provenance tracking
• Data Flow Diagrams to understand the data entities
• Use cases that should be tracked
• Create sample (concrete) provenance
• Refactor templates from concrete provenance
50
• Map templates to service end points
on the template server
• Add service client code to existing
tools
• Map templates to (Java) beans
• Add bean population code to existing
tools
Part-2 Outline
Model
(Part-1)
Capture
(Part-2)
Manage
(Part-2)
W3C PROV
Different approaches in
Scientific Computing
Storage and Query Data model
Tools, APIs
Explore
51
Querying Provenance
Most common pattern of access:
1. Locate node(s) of interest:
• based on attributes. e.g. Process nodes of a particular name, Data nodes produced after
some date, nodes with timestamp in a range
• based on involved in relationship(s) (consumption, generation). e.g. Data generated by a
particular buggy process.
2. Traverse lineage, causation (w transitive closure)
locate
traverse
lineage
52
A Less common pattern of access:
Locate sub-graph of interest:
• These are advanced queries that may involve regular path expressions
Querying Provenance
locate
subgraph
locate
53
Data Model for Provenance
• Relational
• Traversal require several self-joins in SQL, can be costly in case of deep lineage
traces.
• Yet still very popular!
• Joint querying of application data and provenance
• Graph
• Graph DB neo4j
• Cypher Query Language (Variable length relations) +Free text search
• Visualization
• Triple stores:
• W3C SPARQL based querying + free text search
• Ontology integration
• Reasoning and Rules support,
54
http://sig.biostr.washington.edu/projects/ontviews/gleen/index.htmlhttps://neo4j.com/product/
http://rdf4j.org/
https://jena.apache.org/
Exploring provenance graphs
• Browsing Layers of provenance
• WF Design - Manual
• (Semi) automated - grouping
Manish Kumar Anand ; Shawn Bowers ; Bertram
Ludäscher. Provenance browser: Displaying and
querying scientific workflow provenance graphs.
Proceedings of ICDE 2010 Conference.
Interactive Visualization of Provenance
Graphs for Reproducible Biomedical
Research. Stefan Lugeret al. 5th Symposium
on Biological Data Visualization.
55
Extract a sub-graph priot to
browsing
Olivier Biton, Sarah Cohen-Boulakia, Susan B. Davidson, and
Carmem S. Hara. Querying and Managing Provenance through
User Views in Scientific Workflows. pages 1072–1081, ICDE 2008.
Exploring provenance graphs
• Aggregating graph elements
Aggregation by Provenance Types: A Technique forSummarising Provenance
Graphs. Luc Moreau. In Proceedings GaM 2015, arXiv:1504.02448
Rinke Hoekstra, Paul Groth. PROV-O-Viz - Understanding the Role of Activities in
Provenance. IPAW 2014: pp 215-220 56
Exploring provenance
• A move towards customized visualizations:
E.g. Provenance of annotation sentence “inactivated by cyanide”. It
originates in TrEMBL, but ends up in Swiss-Prot.
Bell MJ, Collison M, Lord P (2013) Can Inferred Provenance and Its Visualisation Be Used to Detect Erroneous Annotation?
A Case Study Using UniProtKB. PLoS ONE 8(10): e75541. https://doi.org/10.1371/journal.pone.0075541
57
PROV implementations
58The full list is at: https://www.w3.org/2011/prov/wiki/ProvImplementations
59
Komadu
• Collect and Query API as a web service
• Visualize provenance, export to CSV format viewable in Cytoscape
• Export as PROV-XML
60
https://github.com/Data-to-Insight-Center/komadu/blob/master/docs/KomaduUserGuide.pdf
The Oh-Yeah Button
61
• PROV-AQ Implementation
• Browser add on
http://users.ugent.be/~tdenies/OhYeah/
PROV-Constraints
62
https://github.com/jamescheney/prov-constraints
https://github.com/pgroth/prov-check
PROV Generating Systems -Workflows
63
PROV Generating Systems -Git
64
http://git2prov.org
Sample PROV Datasets
65
PROV BENCH, PROV RECONSTRUCTION
1st ProvBench: Benchmarking Provenance Management Systems. In K. Belhajjame and J. Zhao,
editors, Proceedings of the Joint EDBT/ICDT 2013 Workshops.
Data2Semantics. Provenance reconstruction challenge. http:// data2semantics.github.io/ 2014.
2nd ProvBench: Benchmarking Provenance Management Systems. K. Belhajjame and A.
Chapman. https://sites.google.com/ site/provbench/home/provbench- provenance-week-
2014, 2014.
Follow State of the Art in Provenance
IPAW and TAPP have been held jointly as Provenance Weeks.
IPAW 2018 will be held at King’s College London !!
(Every two years)
(Annual)
66
67
Acknowledgements
• ProvTemp EPSRC Project (EP/N027426/1). Vasa Curcin, PinarAlper,
Elliot Fairweather.
• Innovate UK Project EmProv. Vasa Curcin, Shen Xu.
Ad

More Related Content

What's hot (20)

Patterns of Streaming Applications
Patterns of Streaming ApplicationsPatterns of Streaming Applications
Patterns of Streaming Applications
C4Media
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
Anqi Fu
 
Ga4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteGa4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger institute
Matt Massie
 
The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...
University of California, San Diego
 
PTU: Using Provenance for Repeatability
PTU: Using Provenance for RepeatabilityPTU: Using Provenance for Repeatability
PTU: Using Provenance for Repeatability
Tanu Malik
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
Sri Ambati
 
Scalable Parallel Programming in Python with Parsl
Scalable Parallel Programming in Python with ParslScalable Parallel Programming in Python with Parsl
Scalable Parallel Programming in Python with Parsl
Globus
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAM
Allen Day, PhD
 
Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?
Timothy Danford
 
Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...
Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...
Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...
Rafael Ferreira da Silva
 
Tdwg14 fp-kurator-ludaescher
Tdwg14 fp-kurator-ludaescherTdwg14 fp-kurator-ludaescher
Tdwg14 fp-kurator-ludaescher
Bertram Ludäscher
 
The Materials API
The Materials APIThe Materials API
The Materials API
University of California, San Diego
 
Data cleaning with the Kurator toolkit: Bridging the gap between conventional...
Data cleaning with the Kurator toolkit: Bridging the gap between conventional...Data cleaning with the Kurator toolkit: Bridging the gap between conventional...
Data cleaning with the Kurator toolkit: Bridging the gap between conventional...
Timothy McPhillips
 
Search at Twitter: Presented by Michael Busch, Twitter
Search at Twitter: Presented by Michael Busch, TwitterSearch at Twitter: Presented by Michael Busch, Twitter
Search at Twitter: Presented by Michael Busch, Twitter
Lucidworks
 
Group meeting: UniSan - Proactive Kernel Memory Initialization to Eliminate D...
Group meeting: UniSan - Proactive Kernel Memory Initialization to Eliminate D...Group meeting: UniSan - Proactive Kernel Memory Initialization to Eliminate D...
Group meeting: UniSan - Proactive Kernel Memory Initialization to Eliminate D...
Yu-Hsin Hung
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
Andy Petrella
 
Efficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databasesEfficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databases
Rui Vieira
 
H2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal MalohlavaH2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal Malohlava
Sri Ambati
 
Ontology-based multi-domain metadata for research data management using tripl...
Ontology-based multi-domain metadata for research data management using tripl...Ontology-based multi-domain metadata for research data management using tripl...
Ontology-based multi-domain metadata for research data management using tripl...
João Rocha da Silva
 
Spark Summit East 2015
Spark Summit East 2015Spark Summit East 2015
Spark Summit East 2015
Timothy Danford
 
Patterns of Streaming Applications
Patterns of Streaming ApplicationsPatterns of Streaming Applications
Patterns of Streaming Applications
C4Media
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
Anqi Fu
 
Ga4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteGa4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger institute
Matt Massie
 
The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...
University of California, San Diego
 
PTU: Using Provenance for Repeatability
PTU: Using Provenance for RepeatabilityPTU: Using Provenance for Repeatability
PTU: Using Provenance for Repeatability
Tanu Malik
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
Sri Ambati
 
Scalable Parallel Programming in Python with Parsl
Scalable Parallel Programming in Python with ParslScalable Parallel Programming in Python with Parsl
Scalable Parallel Programming in Python with Parsl
Globus
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAM
Allen Day, PhD
 
Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?
Timothy Danford
 
Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...
Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...
Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...
Rafael Ferreira da Silva
 
Data cleaning with the Kurator toolkit: Bridging the gap between conventional...
Data cleaning with the Kurator toolkit: Bridging the gap between conventional...Data cleaning with the Kurator toolkit: Bridging the gap between conventional...
Data cleaning with the Kurator toolkit: Bridging the gap between conventional...
Timothy McPhillips
 
Search at Twitter: Presented by Michael Busch, Twitter
Search at Twitter: Presented by Michael Busch, TwitterSearch at Twitter: Presented by Michael Busch, Twitter
Search at Twitter: Presented by Michael Busch, Twitter
Lucidworks
 
Group meeting: UniSan - Proactive Kernel Memory Initialization to Eliminate D...
Group meeting: UniSan - Proactive Kernel Memory Initialization to Eliminate D...Group meeting: UniSan - Proactive Kernel Memory Initialization to Eliminate D...
Group meeting: UniSan - Proactive Kernel Memory Initialization to Eliminate D...
Yu-Hsin Hung
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
Andy Petrella
 
Efficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databasesEfficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databases
Rui Vieira
 
H2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal MalohlavaH2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal Malohlava
Sri Ambati
 
Ontology-based multi-domain metadata for research data management using tripl...
Ontology-based multi-domain metadata for research data management using tripl...Ontology-based multi-domain metadata for research data management using tripl...
Ontology-based multi-domain metadata for research data management using tripl...
João Rocha da Silva
 

Similar to "Data Provenance: Principles and Why it matters for BioMedical Applications" (20)

Advances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsAdvances in Scientific Workflow Environments
Advances in Scientific Workflow Environments
Carole Goble
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
Ilkay Altintas, Ph.D.
 
2016-10-20 BioExcel: Advances in Scientific Workflow Environments
2016-10-20 BioExcel: Advances in Scientific Workflow Environments2016-10-20 BioExcel: Advances in Scientific Workflow Environments
2016-10-20 BioExcel: Advances in Scientific Workflow Environments
Stian Soiland-Reyes
 
Introduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-SeqIntroduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-Seq
Enis Afgan
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit
 
Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...
Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...
Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...
Behar Veliqi
 
Proactive ops for container orchestration environments
Proactive ops for container orchestration environmentsProactive ops for container orchestration environments
Proactive ops for container orchestration environments
Docker, Inc.
 
2013 06-24 Wf4Ever: Annotating research objects (PPTX)
2013 06-24 Wf4Ever: Annotating research objects (PPTX)2013 06-24 Wf4Ever: Annotating research objects (PPTX)
2013 06-24 Wf4Ever: Annotating research objects (PPTX)
Stian Soiland-Reyes
 
2013 06-24 Wf4Ever: Annotating research objects (PDF)
2013 06-24 Wf4Ever: Annotating research objects (PDF)2013 06-24 Wf4Ever: Annotating research objects (PDF)
2013 06-24 Wf4Ever: Annotating research objects (PDF)
Stian Soiland-Reyes
 
Globus Labs: Forging the Next Frontier
Globus Labs: Forging the Next FrontierGlobus Labs: Forging the Next Frontier
Globus Labs: Forging the Next Frontier
Globus
 
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
Kaxil Naik
 
Interpreting Performance Test Results
Interpreting Performance Test ResultsInterpreting Performance Test Results
Interpreting Performance Test Results
Eric Proegler
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Herman Wu
 
Distributed Tracing with Jaeger
Distributed Tracing with JaegerDistributed Tracing with Jaeger
Distributed Tracing with Jaeger
Inho Kang
 
Opentracing jaeger
Opentracing jaegerOpentracing jaeger
Opentracing jaeger
Oracle Korea
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
Gaignard Alban
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
Marco Quartulli
 
Workshop: Big Data Visualization for Security
Workshop: Big Data Visualization for SecurityWorkshop: Big Data Visualization for Security
Workshop: Big Data Visualization for Security
Raffael Marty
 
RDF Validation in a Linked Data World - A vision beyond structural and value ...
RDF Validation in a Linked Data World - A vision beyond structural and value ...RDF Validation in a Linked Data World - A vision beyond structural and value ...
RDF Validation in a Linked Data World - A vision beyond structural and value ...
Nandana Mihindukulasooriya
 
Advances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsAdvances in Scientific Workflow Environments
Advances in Scientific Workflow Environments
Carole Goble
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
Ilkay Altintas, Ph.D.
 
2016-10-20 BioExcel: Advances in Scientific Workflow Environments
2016-10-20 BioExcel: Advances in Scientific Workflow Environments2016-10-20 BioExcel: Advances in Scientific Workflow Environments
2016-10-20 BioExcel: Advances in Scientific Workflow Environments
Stian Soiland-Reyes
 
Introduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-SeqIntroduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-Seq
Enis Afgan
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit
 
Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...
Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...
Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...
Behar Veliqi
 
Proactive ops for container orchestration environments
Proactive ops for container orchestration environmentsProactive ops for container orchestration environments
Proactive ops for container orchestration environments
Docker, Inc.
 
2013 06-24 Wf4Ever: Annotating research objects (PPTX)
2013 06-24 Wf4Ever: Annotating research objects (PPTX)2013 06-24 Wf4Ever: Annotating research objects (PPTX)
2013 06-24 Wf4Ever: Annotating research objects (PPTX)
Stian Soiland-Reyes
 
2013 06-24 Wf4Ever: Annotating research objects (PDF)
2013 06-24 Wf4Ever: Annotating research objects (PDF)2013 06-24 Wf4Ever: Annotating research objects (PDF)
2013 06-24 Wf4Ever: Annotating research objects (PDF)
Stian Soiland-Reyes
 
Globus Labs: Forging the Next Frontier
Globus Labs: Forging the Next FrontierGlobus Labs: Forging the Next Frontier
Globus Labs: Forging the Next Frontier
Globus
 
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
Kaxil Naik
 
Interpreting Performance Test Results
Interpreting Performance Test ResultsInterpreting Performance Test Results
Interpreting Performance Test Results
Eric Proegler
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Herman Wu
 
Distributed Tracing with Jaeger
Distributed Tracing with JaegerDistributed Tracing with Jaeger
Distributed Tracing with Jaeger
Inho Kang
 
Opentracing jaeger
Opentracing jaegerOpentracing jaeger
Opentracing jaeger
Oracle Korea
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
Gaignard Alban
 
Workshop: Big Data Visualization for Security
Workshop: Big Data Visualization for SecurityWorkshop: Big Data Visualization for Security
Workshop: Big Data Visualization for Security
Raffael Marty
 
RDF Validation in a Linked Data World - A vision beyond structural and value ...
RDF Validation in a Linked Data World - A vision beyond structural and value ...RDF Validation in a Linked Data World - A vision beyond structural and value ...
RDF Validation in a Linked Data World - A vision beyond structural and value ...
Nandana Mihindukulasooriya
 
Ad

Recently uploaded (20)

Chromatography, types, techniques, ppt.pptx
Chromatography, types, techniques, ppt.pptxChromatography, types, techniques, ppt.pptx
Chromatography, types, techniques, ppt.pptx
Dr Showkat Ahmad Wani
 
Lipids: Classification, Functions, Metabolism, and Dietary Recommendations
Lipids: Classification, Functions, Metabolism, and Dietary RecommendationsLipids: Classification, Functions, Metabolism, and Dietary Recommendations
Lipids: Classification, Functions, Metabolism, and Dietary Recommendations
Sarumathi Murugesan
 
biochemistry amino acid from chemistry to life machinery
biochemistry amino acid from chemistry to life machinerybiochemistry amino acid from chemistry to life machinery
biochemistry amino acid from chemistry to life machinery
chaitanyaa4444
 
RAPID DIAGNOSTIC TEST (RDT) overviewppt.pptx
RAPID DIAGNOSTIC TEST (RDT)  overviewppt.pptxRAPID DIAGNOSTIC TEST (RDT)  overviewppt.pptx
RAPID DIAGNOSTIC TEST (RDT) overviewppt.pptx
nietakam
 
Causes of mortalities of eggs and spawn and remedies.pptx
Causes of mortalities of eggs and spawn and remedies.pptxCauses of mortalities of eggs and spawn and remedies.pptx
Causes of mortalities of eggs and spawn and remedies.pptx
anshumanmohanty9090
 
Chapter 4_Part 2_Infection and Immunity.ppt
Chapter 4_Part 2_Infection and Immunity.pptChapter 4_Part 2_Infection and Immunity.ppt
Chapter 4_Part 2_Infection and Immunity.ppt
JessaBalanggoyPagula
 
Direct Evidence for r-process Nucleosynthesis in Delayed MeV Emission from th...
Direct Evidence for r-process Nucleosynthesis in Delayed MeV Emission from th...Direct Evidence for r-process Nucleosynthesis in Delayed MeV Emission from th...
Direct Evidence for r-process Nucleosynthesis in Delayed MeV Emission from th...
Sérgio Sacani
 
Zoonosis, Types, Causes. A comprehensive pptx
Zoonosis, Types, Causes. A comprehensive pptxZoonosis, Types, Causes. A comprehensive pptx
Zoonosis, Types, Causes. A comprehensive pptx
Dr Showkat Ahmad Wani
 
Concise Notes on tree and graph data structure
Concise Notes on tree and graph data structureConcise Notes on tree and graph data structure
Concise Notes on tree and graph data structure
YekoyeTigabu2
 
Polytene chromosomes. A Practical Lecture.pptx
Polytene chromosomes. A Practical Lecture.pptxPolytene chromosomes. A Practical Lecture.pptx
Polytene chromosomes. A Practical Lecture.pptx
Dr Showkat Ahmad Wani
 
Botany-Finals-Patterns-of-Inheritance-DNA-Synthesis.pdf
Botany-Finals-Patterns-of-Inheritance-DNA-Synthesis.pdfBotany-Finals-Patterns-of-Inheritance-DNA-Synthesis.pdf
Botany-Finals-Patterns-of-Inheritance-DNA-Synthesis.pdf
JseleBurgos
 
Structure formation with primordial black holes: collisional dynamics, binari...
Structure formation with primordial black holes: collisional dynamics, binari...Structure formation with primordial black holes: collisional dynamics, binari...
Structure formation with primordial black holes: collisional dynamics, binari...
Sérgio Sacani
 
DNA Profiling and STR Typing in Forensics: From Molecular Techniques to Real-...
DNA Profiling and STR Typing in Forensics: From Molecular Techniques to Real-...DNA Profiling and STR Typing in Forensics: From Molecular Techniques to Real-...
DNA Profiling and STR Typing in Forensics: From Molecular Techniques to Real-...
home
 
Skin_Glands_Structure_Secretion _Control
Skin_Glands_Structure_Secretion _ControlSkin_Glands_Structure_Secretion _Control
Skin_Glands_Structure_Secretion _Control
muralinath2
 
Influenza-Understanding-the-Deadly-Virus.pptx
Influenza-Understanding-the-Deadly-Virus.pptxInfluenza-Understanding-the-Deadly-Virus.pptx
Influenza-Understanding-the-Deadly-Virus.pptx
diyapadhiyar
 
Polymerase Chain Reaction (PCR).Poer Pint
Polymerase Chain Reaction (PCR).Poer PintPolymerase Chain Reaction (PCR).Poer Pint
Polymerase Chain Reaction (PCR).Poer Pint
Dr Showkat Ahmad Wani
 
4. Chapter 4 - FINAL Promoting Inclusive Culture (2).pdf
4. Chapter 4 - FINAL Promoting Inclusive Culture (2).pdf4. Chapter 4 - FINAL Promoting Inclusive Culture (2).pdf
4. Chapter 4 - FINAL Promoting Inclusive Culture (2).pdf
abayamargaug
 
Antonie van Leeuwenhoek- Father of Microbiology
Antonie van Leeuwenhoek- Father of MicrobiologyAntonie van Leeuwenhoek- Father of Microbiology
Antonie van Leeuwenhoek- Father of Microbiology
Anoja Kurian
 
Skin function_protective_absorptive_Presentatation.pptx
Skin function_protective_absorptive_Presentatation.pptxSkin function_protective_absorptive_Presentatation.pptx
Skin function_protective_absorptive_Presentatation.pptx
muralinath2
 
Gender Bias and Empathy in Robots: Insights into Robotic Service Failures
Gender Bias and Empathy in Robots:  Insights into Robotic Service FailuresGender Bias and Empathy in Robots:  Insights into Robotic Service Failures
Gender Bias and Empathy in Robots: Insights into Robotic Service Failures
Selcen Ozturkcan
 
Chromatography, types, techniques, ppt.pptx
Chromatography, types, techniques, ppt.pptxChromatography, types, techniques, ppt.pptx
Chromatography, types, techniques, ppt.pptx
Dr Showkat Ahmad Wani
 
Lipids: Classification, Functions, Metabolism, and Dietary Recommendations
Lipids: Classification, Functions, Metabolism, and Dietary RecommendationsLipids: Classification, Functions, Metabolism, and Dietary Recommendations
Lipids: Classification, Functions, Metabolism, and Dietary Recommendations
Sarumathi Murugesan
 
biochemistry amino acid from chemistry to life machinery
biochemistry amino acid from chemistry to life machinerybiochemistry amino acid from chemistry to life machinery
biochemistry amino acid from chemistry to life machinery
chaitanyaa4444
 
RAPID DIAGNOSTIC TEST (RDT) overviewppt.pptx
RAPID DIAGNOSTIC TEST (RDT)  overviewppt.pptxRAPID DIAGNOSTIC TEST (RDT)  overviewppt.pptx
RAPID DIAGNOSTIC TEST (RDT) overviewppt.pptx
nietakam
 
Causes of mortalities of eggs and spawn and remedies.pptx
Causes of mortalities of eggs and spawn and remedies.pptxCauses of mortalities of eggs and spawn and remedies.pptx
Causes of mortalities of eggs and spawn and remedies.pptx
anshumanmohanty9090
 
Chapter 4_Part 2_Infection and Immunity.ppt
Chapter 4_Part 2_Infection and Immunity.pptChapter 4_Part 2_Infection and Immunity.ppt
Chapter 4_Part 2_Infection and Immunity.ppt
JessaBalanggoyPagula
 
Direct Evidence for r-process Nucleosynthesis in Delayed MeV Emission from th...
Direct Evidence for r-process Nucleosynthesis in Delayed MeV Emission from th...Direct Evidence for r-process Nucleosynthesis in Delayed MeV Emission from th...
Direct Evidence for r-process Nucleosynthesis in Delayed MeV Emission from th...
Sérgio Sacani
 
Zoonosis, Types, Causes. A comprehensive pptx
Zoonosis, Types, Causes. A comprehensive pptxZoonosis, Types, Causes. A comprehensive pptx
Zoonosis, Types, Causes. A comprehensive pptx
Dr Showkat Ahmad Wani
 
Concise Notes on tree and graph data structure
Concise Notes on tree and graph data structureConcise Notes on tree and graph data structure
Concise Notes on tree and graph data structure
YekoyeTigabu2
 
Polytene chromosomes. A Practical Lecture.pptx
Polytene chromosomes. A Practical Lecture.pptxPolytene chromosomes. A Practical Lecture.pptx
Polytene chromosomes. A Practical Lecture.pptx
Dr Showkat Ahmad Wani
 
Botany-Finals-Patterns-of-Inheritance-DNA-Synthesis.pdf
Botany-Finals-Patterns-of-Inheritance-DNA-Synthesis.pdfBotany-Finals-Patterns-of-Inheritance-DNA-Synthesis.pdf
Botany-Finals-Patterns-of-Inheritance-DNA-Synthesis.pdf
JseleBurgos
 
Structure formation with primordial black holes: collisional dynamics, binari...
Structure formation with primordial black holes: collisional dynamics, binari...Structure formation with primordial black holes: collisional dynamics, binari...
Structure formation with primordial black holes: collisional dynamics, binari...
Sérgio Sacani
 
DNA Profiling and STR Typing in Forensics: From Molecular Techniques to Real-...
DNA Profiling and STR Typing in Forensics: From Molecular Techniques to Real-...DNA Profiling and STR Typing in Forensics: From Molecular Techniques to Real-...
DNA Profiling and STR Typing in Forensics: From Molecular Techniques to Real-...
home
 
Skin_Glands_Structure_Secretion _Control
Skin_Glands_Structure_Secretion _ControlSkin_Glands_Structure_Secretion _Control
Skin_Glands_Structure_Secretion _Control
muralinath2
 
Influenza-Understanding-the-Deadly-Virus.pptx
Influenza-Understanding-the-Deadly-Virus.pptxInfluenza-Understanding-the-Deadly-Virus.pptx
Influenza-Understanding-the-Deadly-Virus.pptx
diyapadhiyar
 
Polymerase Chain Reaction (PCR).Poer Pint
Polymerase Chain Reaction (PCR).Poer PintPolymerase Chain Reaction (PCR).Poer Pint
Polymerase Chain Reaction (PCR).Poer Pint
Dr Showkat Ahmad Wani
 
4. Chapter 4 - FINAL Promoting Inclusive Culture (2).pdf
4. Chapter 4 - FINAL Promoting Inclusive Culture (2).pdf4. Chapter 4 - FINAL Promoting Inclusive Culture (2).pdf
4. Chapter 4 - FINAL Promoting Inclusive Culture (2).pdf
abayamargaug
 
Antonie van Leeuwenhoek- Father of Microbiology
Antonie van Leeuwenhoek- Father of MicrobiologyAntonie van Leeuwenhoek- Father of Microbiology
Antonie van Leeuwenhoek- Father of Microbiology
Anoja Kurian
 
Skin function_protective_absorptive_Presentatation.pptx
Skin function_protective_absorptive_Presentatation.pptxSkin function_protective_absorptive_Presentatation.pptx
Skin function_protective_absorptive_Presentatation.pptx
muralinath2
 
Gender Bias and Empathy in Robots: Insights into Robotic Service Failures
Gender Bias and Empathy in Robots:  Insights into Robotic Service FailuresGender Bias and Empathy in Robots:  Insights into Robotic Service Failures
Gender Bias and Empathy in Robots: Insights into Robotic Service Failures
Selcen Ozturkcan
 
Ad

"Data Provenance: Principles and Why it matters for BioMedical Applications"

  • 1. Data Provenance: Principles and Why it matters for BioMedical Applications PART-2: Tools and Techniques Informatics for Health 2017 Preconference Tutorial Vasa Curcin, Lecturer, HSCR & Department of Informatics KCL Pinar Alper, Postdoc at HSCR KCL 22.Apr.2017 Manchester/UK 1
  • 2. Part-2 Outline Model (Part-1) Capture (Part-2) Manage (Part-2) W3C PROV Different approaches in Scientific Computing Storage and Query Data model Tools, APIs Explore 2
  • 3. Provenance Recap Provenance has a Particular Subject & has a Standard Model: • “information about entities, activities, and people involved in producing a piece of data or thing…” (W3C PROV). edit wasGeneratedBy page1 used Wikipedia Editor wasAssociatedWith page2 entity activity agent • PROV-DM Core vocabulary: • Actor,Activity, Entity • Causal relations among elements • Conceptually a graph. • Constraints: • Typing: if two things are linked with actedOnBehalf of they are of type actor • Impossibility : activities and entities are disjoint. Specialisation is not reflexive • Ordering: Usage must occur between start and end event of activities • Human and Machine understandable representation: • PROV-O, • PROV-N • Extensibility points 3
  • 4. Part-2 Outline Model (Part-1) Capture (Part-2) Manage (Part-2) W3C PROV Different approaches in Scientific Computing Storage and Query Data model Tools, APIs Explore 4
  • 5. Provenance Capture • Rigorously studied in the context of Scientific Computing. • Workflows • Scripts • Command-Line Tools • System-Level (File and Operating System) • Databases • Templates 5
  • 6. • Became popular in the recent decade. • Pipelines of tasks with dataflow dependencies. • Automation, Resource Access Client • Somewhat disruptive, need to wrap resources. • Provenance for the output scientific data: • An outline of the method followed • Resources used (repositories, tools, services). • Parameter configurations and intermediary results. Scientific Workflows Analysis Data Analysis Visualization Analysis Adaptation Community Data Repo Community Tools Local Tools & Data 6 parameter parameter Data Data
  • 7. 7 • Backward-Looking history of what the workflow engine observed during the run: • Data nodes with identifiers minted by WF engine. Data often stored separately. • Task invocations, timestamp. • Data usage and generation by tasks. • Actor is primarily the WF engine. • Inferred provenance: • Task causal dependencies • Data causal dependencies Workflows Execution Provenance
  • 8. WF Provenance -Transparency outputactivityinput • Annotated grey-box provenance. • Annotate workflow, get auto-annotated provenance • Annotate provenance outputactivityinput BLAST Report DNA SEquence Sequence Alignment 8 • Wraps resources to incorporate into workflow, hence an “Observer” perspective. In the most basic case this provides black-box provenance. • We also have white-box provenance will come to that later!! CL tool execution Script execution Web service invocation prov:type prov:type prov:type PROV Extensibility Points: type hadRole attributes hadRole
  • 9. WF Provenance - Perspective • Workflow Provenance is the first to be referred to as “Prospective Provenance” ‼ • Prospective provenance is often viewed as the provenance by end- users (scientists). • Prospective Provenance can be useful as an abstraction over (bulky) retrospective provenance. Prospective Retrospective lineage 9
  • 10. 10 Workflow Provenance • End use • Debugging • Monitoring • Present Results
  • 11. Workflow Provenance • End use • Comparison Exportable as OPM 11
  • 12. Workflow Provenance • End use • Publishing analyses 12
  • 13. A workflow may not necessarily be implemented by a workflow system 13 • Provenance Challenge Series 1-4 • Provenance Challenge Workflow: 5-step computational process using Functional Magnetic Resonance Imaging (fMRI) data. • Have been realized with WF, tool, and system-level provenance http://twiki.ipaw.info/bin/view/Challenge/FirstProvenanceChallenge Domain data Patterns of inquiry e.g. lineage traversal
  • 14. Scripts • Popular, established method of data processing • Integration into statistical, numerical analysis libraries • Visualization libraries • Researchers have recently started paying attention to script provenance. • Currently research prototypes rather than out-of-box features. • Minimally disrupting existing practices. No technology change, No wrapping! 14
  • 15. Script Provenance - RDataTracker Barbara Lerner and Emery Boose. RDataTracker: Collecting provenance in an interactive scripting environment. In 6th USENIX Workshop on the Theory and Practice of Provenance (TaPP 2014). 15 • Extend R scripts with logging statements • Provenance ON/OFF: ddg.init, ddg.save • Abstract multiple commands: ddg.procedure • Post-execution visualize the Data Derivation Graph Process Data Data flow Control flow https://github.com/End-to-end-provenance/RDataTracker
  • 16. Script Provenance- YesWorkflow 16 • Annotate method declarations in R, Matlab scripts • Makes “latent workflow information from scripts explicit”. • Prospective provenance, a workflow abstraction over the script. • Visualized in process, data and combined views • @begin, @end • @in, @out, @as • Inputs/outputs can be concrete files • Or can refer to prospective resources identified by templates. Timothy M. McPhillips, Tianhong Song et al. YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts. CoRR, abs/1502.02403, 2015.
  • 17. Script Provenance - YesWorkflow Visualization of annotations – combined view 17 https://github.com/yesworkflow-org/
  • 18. Script Provenance - noWorkflow • Definition: Analyze the AST of Python script. • Arguments, function calls, global vars • User defined functions • Deployment: Environment and library dependencies. • Uses Python os, socket and platform libraries • Collected right before script execution Leonardo Murta, Vanessa Braganholo, Fernando Chirigati, David Koop, and Juliana Freire. noWorkflow: Capturing and analyzing provenance of scripts. In 5th International Provenance and Anno- tation Workshop (IPAW), Cologne, Jun 2014. 18
  • 19. Script Provenance -noWorkflow • Execution: Runtime information • Uses python’s runtime profiling and reflection capabilities • the start time of the function activation, together with the values of every argument, return, and globals • open() csystem call to record depedency to files and snapshot file contents at the time of call 19 https://github.com/gems-uff/noworkflow
  • 20. Script Provenance -noWorkflow • A lot of information! • Analysis 1. Highly summarized Activation Graph. Nodes in one control flow block (e.g. loop) merged. 2. Diff two executions (report diff in dates, argument values, environment settings) 3. Query in Prolog access_influence(File, ’output.png’) 20 https://github.com/gems-uff/noworkflow
  • 21. Script Provenance -CXXR • R Interpreter with audit features • Tracks provenance for bindings generated in the user workspace • provenance(x): Returns a list comprising: expression, symbol, timestamp, parents, children • pedigree(x): Displays the sequence of commands issued, which results in x’s current state 21 https://github.com/timothyjgraham/cxxr
  • 22. Script Provenance 22 • End-use: • Minimally disruptive, but, yet to prove its utility. • Black-box provenance. Too fine grained process abstraction? Are variables, arguments or files a useful data abstractions?
  • 23. Command Line Tools • Popular interface for many scientific analysis libraries • Web-Based Reproducibility Frameworks, a layer between User and OS command shell. • Register tools and with metadata to framework. Grey-box provenance. • Layer directs tool execution • Layer records provenance Sumatra: automated tracking of scientific computations.https://pythonhosted.org/Sumatra/ Galaxy Data intensive biology for everyone https://galaxyproject.org/ 23
  • 24. Tool Provenance • End use: • Convert exploratory steps into workflows • Compare different invocations 24
  • 25. Database Provenance • Databases widely used for managing scientific resource metadata: • Data Catalogs • Service Registries • White-box Provenance • Why: witness tuples • How: the way witness tuple contribute to result. • Where: cell-level data copying from witness tuples to result James Cheney, Laura Chiticariu, and Wang Chiew Tan. Provenance in databases: Why, how, and where. Foundations and Trends in Databases, 1(4):379–474, 2009. 25
  • 26. Database Provenance • Why provenance: • Query debugging: Why did I get this result? • View maintenance: If I update this record, do I need to refresh this (materialized) view? • How provenance: • Trust and uncertainty computation • Where provenance: • Annotation propagation • If desired white-box provenance can be computed post query execution, when needed. • Limited to research prototypes 26 http://infolab.stanford.edu/trio/
  • 27. System-Level Provenance • Every application runs on some OS and uses some storage system • Not disruptive Zero modification to applications. • An audit mode for systems, explore feasibility and overhead. 27
  • 28. System-Level Provenance - SPADE • A consumer of audit APIS of different OS (Windows, Mac OS, Linux, Android) • Collects process information: name, owner, creation time, command line, environment vars, files read/written. • Reporters: • Default for each OS. E.g. in Linux exec(), fork(), clone(), exit(), open(), close(), read(), write(), clone(), truncate() is picked up by reporters • Domain specific reporters. Add onto default. • Reporting overhead is variable, dependes on nature of application and target OS. • Compile Apache Web Server on Windows (53%) • Run BLAST tool on Linux (5%), on MacOS (10%) • OPM compliant A. Gehani and D. Tariq. SPADE: Support for Provenance Auditing in Distributed Environments. In Middleware 2012 - Proceedings, pages 101–120, 2012. 28
  • 29. System-level Provenance - PASS • Modified Linux Kernel. Relies on the tracking of system calls. • Focuses on files a detailed change record. 29 • For each file in the filesystem, • The executable that created it • Any input files • “Complete” hardware platform description • Command line • Process environment • Other data such as random seeds > sort a > b Kiran-Kumar Muniswamy-Reddy, David A. Holland, Uri Braun, and Margo Seltzer. Provenance-aware storage systems. In Proceedings of the Annual Conference on USENIX ’06 Annual Technical Conference, ATEC ’06, pages 4–4.
  • 30. System-Level Provenance - PASS • End Use • Script generation • Generate a Makefile that reproduces a file • Detecting system changes • Compare provenance of two files to detect changes in environment, libraries, etc. • Intrusion detection • Detailed logs of how objects have changed • Very fine grained provenance!! • 5000 Objects in response to a lineage inquiry over trace with 50 elements. 30
  • 31. To this end • We focused on capturing provenance from a particular computational system • System dictates granularity Data: Tuples, files, variables Activity: Query operator, sys-call, wf activity • System dictates transparency White-box for DB vs Black/Grey Box for WFs and Tools 31
  • 32. In reality • We have heterogeneity and legacy: • We use heterogeneous technologies. • Provenance may be recorded in mixed granularities. • Existing audit capabilities and provenance-like information recorded. • Audited processes can span long time frames. 32
  • 33. DB Consider the case 33 Medical Practice Management (MPM) A global Provenance Picture Designing a decision aid to improve secondary prevention for stroke survivors with multimorbidity: A stakeholder engagement study (T Porat, I Marshall, E Sadler, MA Vadillo, V Curcin, C McKevitt, C Wolfe) - Abstract presentationat Farr 2017 Tuesday 4pm portal usage info authentication trace clinician patient • Have there been an update on patient’s EHR in MPM between initial assessment and follow up in SDA • Do patients use any part of the portal after initial assessment activity in SDA? • What are all the prior authentication records that lead to MPM activities updating this erroneous EHR. • What is the average time between initial assessments and follow-ups? DB Stroke Decision Aid (SDA) Log files Log files Authentication Service Portal
  • 34. A Solution Approach – Provenance Templates • Focus on Provenance rather than its capture from a specific system. • How can we combine provenance of multiple sources in a controlled manner? • Focus on incorporating Domain-Specific information • Heterogeneous systems interoperate via domain ontologies. How can we obtain domain annotated provenance? • Two groups actively working on it 34 “Templates as a method for implementing data provenance in decision support systems.” Vasa Curcin, Elliot Fairweather, Roxana Danger, Derek Corrigan. Journal of Biomedical Informatics 65 (2017) 1-21 https://bitbucket.org/kclbig/templates https://provenance.ecs.soton.ac.uk/prov-template/ In publication.
  • 35. var:input var:output var:assessment wasGeneratedBy ex:guideline1 var:clinician wasAssociatedWith used ex:practice1 actedOnBehalfOf ex:startTime=“vvar:start” ex:endTime=“vvar:end” used Template: A provenance graph with abstract parts Clinical Decision Support TEMPLATE 35 variable value variable document prefix ex <http://example.org> prefix var <http://pgt.inf.kcl.ac.uk/var> prefix vvar <http://pgt.inf.kcl.ac.uk/vvar> entity(var:inSymbol) entity(ex:guideline1) activity(var:assessment) used(var:assessment,ex:guideline1, -, [ex:startTime=”vvar:start”, ...]) ...... ...... endDocument
  • 37. var:a var:c var:b wasGeneratedBy var:x wasAssociatedWith used Zone • A connected sub-graph of a template graph, which can be instantiated multiple times • Series/Parallel type of zones • Restrictions on Minimum and Maximum number of instantiations var:d wasDerviedFrom id=zone1 type=parallel min=1 max=5 Example TEMPLATE 37 Zone variable External variable
  • 38. entity0 entity1 activity1 wasGeneratedBy agent1 wasAssociatedWith used activity2 wasGeneratedBy wasGeneratedBy • Parallel Zone instantiated 3 times used used wasAssociatedWith wasAssociatedWith entity4 wasDerviedFrom wasDerviedFrom wasDerviedFrom 38 entity2 entity3 activity3
  • 40. Template Instantiation Generation Valid Substitutions Template Instance Template (PROV- N) (PROV- N) (PROV- N) • Single-Step : Template Base+ All Zones • Incremental Generation 1. Template Base 2. Zone 3. Zone 4. … 40
  • 41. What constitutes a Valid Substitution? • For Template Base • Distinct values among all external variables • Distinct values among the global node if space for non-graft variables • Values for all value variables • For each Zone substitution • Distinct values for all zone variables among each other and among the global node id space • Values for all value variables 41
  • 42. entity1 entity3 activity1 wasGeneratedBy agent1 wasAssociatedWith used entity4 activity2 wasGeneratedBy entity5 activity3 wasGeneratedBy • Parallel Zone incremental instantiation used used wasAssociatedWith wasAssociatedWith entity2 wasDerivedFrom wasDerivedFrom wasDerivedFrom 42
  • 43. var:output var:assessment wasGeneratedBy guideline1 var:clinician wasAssociatedWith used practice1 actedOnBehalfOf • Nodes shared among instantiations • In real-life, actors or entities may participate in processes multiple times Template – Graft nodes GRAFT NODE Clinical Decision Support TEMPLATE 43
  • 45. Architectural Approach Provenance Authentication Clinical Decision Support Portal Authentication Service Application Template Server TEMPLATES • Streams of substitutions from disparate heterogeneous systems 45
  • 46. 46 In summary, the benefits of Templates are:
  • 47. 1. Enforce structure over provenance • Templates can be seen as Provenance Schemas. Acceptable patterns of provenance. • Strict : there is no optionality. If a variable is included in a template bindings will be expected for it. • Loose: Templates may not encode all possible patterns. (Grafts) • Prospective Provenance in its true form! 47
  • 48. 2. Go beyond single-system restriction • Combine provenance in different granularities- importance of identification. • User identity • EHR associated with identity. Track CRUD at EHRlevel. • Fields of EHR relevant for provenance. Track Updates at selected HER field level • Build up provenance over time 1. Authentication 2. Activity on portal 3. Activity within application 4. Authentication 5. Activity on portal 6. Activity within application ….. .... 48
  • 49. 3. Support other forms of Provenance • Provenance can be found in other forms (XLS files, logs) • We can create substitutions from these other forms • Hence migrate legacy provenance to a standards compliant (graph- based) representation 49 Rows  Substitutions Columns  Variables MIAME TEMPLATE Generation Minimum Information for Biological and Biomedical Investigations
  • 50. 4. Possibility of devising Model-Driven techniques for introducing provenance support to existing tools • Identify parts of the system (Actor, Entity, Activity) that are relevant for provenance tracking • Data Flow Diagrams to understand the data entities • Use cases that should be tracked • Create sample (concrete) provenance • Refactor templates from concrete provenance 50 • Map templates to service end points on the template server • Add service client code to existing tools • Map templates to (Java) beans • Add bean population code to existing tools
  • 51. Part-2 Outline Model (Part-1) Capture (Part-2) Manage (Part-2) W3C PROV Different approaches in Scientific Computing Storage and Query Data model Tools, APIs Explore 51
  • 52. Querying Provenance Most common pattern of access: 1. Locate node(s) of interest: • based on attributes. e.g. Process nodes of a particular name, Data nodes produced after some date, nodes with timestamp in a range • based on involved in relationship(s) (consumption, generation). e.g. Data generated by a particular buggy process. 2. Traverse lineage, causation (w transitive closure) locate traverse lineage 52
  • 53. A Less common pattern of access: Locate sub-graph of interest: • These are advanced queries that may involve regular path expressions Querying Provenance locate subgraph locate 53
  • 54. Data Model for Provenance • Relational • Traversal require several self-joins in SQL, can be costly in case of deep lineage traces. • Yet still very popular! • Joint querying of application data and provenance • Graph • Graph DB neo4j • Cypher Query Language (Variable length relations) +Free text search • Visualization • Triple stores: • W3C SPARQL based querying + free text search • Ontology integration • Reasoning and Rules support, 54 http://sig.biostr.washington.edu/projects/ontviews/gleen/index.htmlhttps://neo4j.com/product/ http://rdf4j.org/ https://jena.apache.org/
  • 55. Exploring provenance graphs • Browsing Layers of provenance • WF Design - Manual • (Semi) automated - grouping Manish Kumar Anand ; Shawn Bowers ; Bertram Ludäscher. Provenance browser: Displaying and querying scientific workflow provenance graphs. Proceedings of ICDE 2010 Conference. Interactive Visualization of Provenance Graphs for Reproducible Biomedical Research. Stefan Lugeret al. 5th Symposium on Biological Data Visualization. 55 Extract a sub-graph priot to browsing Olivier Biton, Sarah Cohen-Boulakia, Susan B. Davidson, and Carmem S. Hara. Querying and Managing Provenance through User Views in Scientific Workflows. pages 1072–1081, ICDE 2008.
  • 56. Exploring provenance graphs • Aggregating graph elements Aggregation by Provenance Types: A Technique forSummarising Provenance Graphs. Luc Moreau. In Proceedings GaM 2015, arXiv:1504.02448 Rinke Hoekstra, Paul Groth. PROV-O-Viz - Understanding the Role of Activities in Provenance. IPAW 2014: pp 215-220 56
  • 57. Exploring provenance • A move towards customized visualizations: E.g. Provenance of annotation sentence “inactivated by cyanide”. It originates in TrEMBL, but ends up in Swiss-Prot. Bell MJ, Collison M, Lord P (2013) Can Inferred Provenance and Its Visualisation Be Used to Detect Erroneous Annotation? A Case Study Using UniProtKB. PLoS ONE 8(10): e75541. https://doi.org/10.1371/journal.pone.0075541 57
  • 58. PROV implementations 58The full list is at: https://www.w3.org/2011/prov/wiki/ProvImplementations
  • 59. 59
  • 60. Komadu • Collect and Query API as a web service • Visualize provenance, export to CSV format viewable in Cytoscape • Export as PROV-XML 60 https://github.com/Data-to-Insight-Center/komadu/blob/master/docs/KomaduUserGuide.pdf
  • 61. The Oh-Yeah Button 61 • PROV-AQ Implementation • Browser add on http://users.ugent.be/~tdenies/OhYeah/
  • 63. PROV Generating Systems -Workflows 63
  • 64. PROV Generating Systems -Git 64 http://git2prov.org
  • 65. Sample PROV Datasets 65 PROV BENCH, PROV RECONSTRUCTION 1st ProvBench: Benchmarking Provenance Management Systems. In K. Belhajjame and J. Zhao, editors, Proceedings of the Joint EDBT/ICDT 2013 Workshops. Data2Semantics. Provenance reconstruction challenge. http:// data2semantics.github.io/ 2014. 2nd ProvBench: Benchmarking Provenance Management Systems. K. Belhajjame and A. Chapman. https://sites.google.com/ site/provbench/home/provbench- provenance-week- 2014, 2014.
  • 66. Follow State of the Art in Provenance IPAW and TAPP have been held jointly as Provenance Weeks. IPAW 2018 will be held at King’s College London !! (Every two years) (Annual) 66
  • 67. 67 Acknowledgements • ProvTemp EPSRC Project (EP/N027426/1). Vasa Curcin, PinarAlper, Elliot Fairweather. • Innovate UK Project EmProv. Vasa Curcin, Shen Xu.

Editor's Notes

  • #3: In the first part of the tutorial we presented: What constitutes provenance information, The W3C standard PROV vocabulary to represent this information. In the second part we will present the existing set of tools and technqiues (coming out of research) that helps us to Capture and Manage provenance information.
  • #14: You do not need to use a workflow system to have a workflow.