"Data Provenance: Principles and Why it matters for BioMedical Applications"

Data Provenance: Principles and Why it matters
for BioMedical Applications
PART-2: Tools and Techniques
Informatics for Health 2017 Preconference Tutorial
Vasa Curcin, Lecturer, HSCR & Department of Informatics KCL
Pinar Alper, Postdoc at HSCR KCL
22.Apr.2017
Manchester/UK
1

Part-2 Outline
Model
(Part-1)
Capture
(Part-2)
Manage
(Part-2)
W3C PROV
Different approaches in
Scientific Computing
Storage and Query Data model
Tools, APIs
Explore
2

Provenance Recap
Provenance has a Particular Subject & has a Standard Model:
• “information about entities, activities, and people involved in producing a
piece of data or thing…” (W3C PROV).
edit
wasGeneratedBy
page1
used
Wikipedia Editor
wasAssociatedWith
page2
entity activity agent
• PROV-DM Core vocabulary:
• Actor,Activity, Entity
• Causal relations among elements
• Conceptually a graph.
• Constraints:
• Typing: if two things are linked with actedOnBehalf of they
are of type actor
• Impossibility : activities and entities are disjoint.
Specialisation is not reflexive
• Ordering: Usage must occur between start and end event of
activities
• Human and Machine understandable representation:
• PROV-O,
• PROV-N
• Extensibility points 3

Part-2 Outline
Model
(Part-1)
Capture
(Part-2)
Manage
(Part-2)
W3C PROV
Tools, APIs
Explore
4

Provenance Capture
• Rigorously studied in the context of Scientific Computing.
• Workflows
• Scripts
• Command-Line Tools
• System-Level (File and Operating System)
• Databases
• Templates
5

• Became popular in the recent decade.
• Pipelines of tasks with dataflow
dependencies.
• Automation, Resource Access Client
• Somewhat disruptive, need to wrap
resources.
• Provenance for the output scientific data:
• An outline of the method followed
• Resources used (repositories, tools,
services).
• Parameter configurations and
intermediary results.
Scientific Workflows
Analysis
Data
Analysis
Visualization
Analysis
Adaptation
Community
Data Repo
Community
Tools
Local
Tools & Data
6
parameter
parameter
Data Data

7
• Backward-Looking history
of what the workflow
engine observed during
the run:
• Data nodes with identifiers
minted by WF engine. Data
often stored separately.
• Task invocations,
timestamp.
• Data usage and generation
by tasks.
• Actor is primarily the WF
engine.
• Inferred provenance:
• Task causal dependencies
• Data causal dependencies
Workflows Execution Provenance

WF Provenance -Transparency
outputactivityinput
• Annotated grey-box provenance.
• Annotate workflow, get auto-annotated provenance
• Annotate provenance
outputactivityinput
BLAST Report
DNA SEquence Sequence
Alignment
8
• Wraps resources to incorporate into workflow, hence an “Observer”
perspective. In the most basic case this provides black-box provenance.
• We also have white-box provenance will come to that later!!
CL tool execution
Script execution
Web service invocation
prov:type
prov:type
prov:type
PROV Extensibility Points:
type
hadRole
attributes
hadRole

WF Provenance - Perspective
• Workflow Provenance is the first
to be referred to as “Prospective
Provenance” ‼
• Prospective provenance is often
viewed as the provenance by end-
users (scientists).
• Prospective Provenance can be
useful as an abstraction over
(bulky) retrospective provenance.
Prospective
Retrospective
lineage
9

10
Workflow Provenance
• End use
• Debugging
• Monitoring
• Present Results

Workflow Provenance
• End use
• Comparison
Exportable as OPM
11

Workflow Provenance
• End use
• Publishing analyses
12

A workflow may not necessarily be implemented by a
workflow system
13
• Provenance Challenge Series 1-4
• Provenance Challenge Workflow: 5-step computational process using Functional
Magnetic Resonance Imaging (fMRI) data.
• Have been realized with WF, tool, and system-level provenance
http://twiki.ipaw.info/bin/view/Challenge/FirstProvenanceChallenge
Domain data
Patterns of inquiry e.g. lineage
traversal

Scripts
• Popular, established method of data processing
• Integration into statistical, numerical analysis libraries
• Visualization libraries
• Researchers have recently started paying attention to script
provenance.
• Currently research prototypes rather than out-of-box features.
• Minimally disrupting existing practices. No technology change, No
wrapping!
14

Script Provenance - RDataTracker
Barbara Lerner and Emery Boose. RDataTracker: Collecting provenance
in an interactive scripting environment. In 6th USENIX Workshop on
the Theory and Practice of Provenance (TaPP 2014).
15
• Extend R scripts with logging
statements
• Provenance ON/OFF:
ddg.init, ddg.save
• Abstract multiple commands:
ddg.procedure
• Post-execution visualize the
Data Derivation Graph
Process
Data
Data flow
Control flow
https://github.com/End-to-end-provenance/RDataTracker

Script Provenance- YesWorkflow
16
• Annotate method declarations in R,
Matlab scripts
• Makes “latent workflow information
from scripts explicit”.
• Prospective provenance, a workflow
abstraction over the script.
• Visualized in process, data and
combined views
• @begin, @end
• @in, @out, @as
• Inputs/outputs can be concrete files
• Or can refer to prospective
resources identified by templates.
Timothy M. McPhillips, Tianhong Song et al. YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts. CoRR, abs/1502.02403, 2015.

Script Provenance - YesWorkflow
Visualization of annotations – combined view
17
https://github.com/yesworkflow-org/

Script Provenance - noWorkflow
• Definition: Analyze the AST of
Python script.
• Arguments, function calls, global vars
• User defined functions
• Deployment: Environment and
library dependencies.
• Uses Python os, socket and
platform libraries
• Collected right before script execution
Leonardo Murta, Vanessa Braganholo, Fernando Chirigati, David Koop, and Juliana Freire. noWorkflow: Capturing and analyzing provenance of scripts. In 5th International Provenance and
Anno- tation Workshop (IPAW), Cologne, Jun 2014.
18

Script Provenance -noWorkflow
• Execution: Runtime information
• Uses python’s runtime profiling and reflection capabilities
• the start time of the function activation, together with the values of every argument,
return, and globals
• open() csystem call to record depedency to files and snapshot file contents at the
time of call
19
https://github.com/gems-uff/noworkflow

Script Provenance -noWorkflow
• A lot of information!
• Analysis
1. Highly summarized Activation Graph. Nodes in one control flow block (e.g. loop) merged.
2. Diff two executions (report diff in dates, argument values, environment settings)
3. Query in Prolog access_influence(File, ’output.png’)
20
https://github.com/gems-uff/noworkflow

Script Provenance -CXXR
• R Interpreter with audit features
• Tracks provenance for bindings generated in the user workspace
• provenance(x): Returns a list comprising: expression, symbol, timestamp, parents, children
• pedigree(x): Displays the sequence of commands issued, which results in x’s current state
21
https://github.com/timothyjgraham/cxxr

Script Provenance
22
• End-use:
• Minimally disruptive, but, yet to prove its utility.
• Black-box provenance.
Too fine grained process abstraction?
Are variables, arguments or files a
useful data abstractions?

Command Line Tools
• Popular interface for many scientific analysis libraries
• Web-Based Reproducibility Frameworks, a layer between User and OS
command shell.
• Register tools and with metadata to framework. Grey-box provenance.
• Layer directs tool execution
• Layer records provenance
Sumatra: automated tracking of scientific computations.https://pythonhosted.org/Sumatra/
Galaxy Data intensive biology for everyone https://galaxyproject.org/
23

Tool Provenance
• End use:
• Convert exploratory steps into workflows
• Compare different invocations
24

Database Provenance
• Databases widely used for
managing scientific resource
metadata:
• Data Catalogs
• Service Registries
• White-box Provenance
• Why: witness tuples
• How: the way witness tuple contribute
to result.
• Where: cell-level data copying from
witness tuples to result
James Cheney, Laura Chiticariu, and Wang Chiew Tan. Provenance in databases: Why, how, and where. Foundations and Trends in
Databases, 1(4):379–474, 2009.
25

Database Provenance
• Why provenance:
• Query debugging: Why did I get this result?
• View maintenance: If I update this record, do I need to refresh this
(materialized) view?
• How provenance:
• Trust and uncertainty computation
• Where provenance:
• Annotation propagation
• If desired white-box provenance can be computed post query
execution, when needed.
• Limited to research prototypes
26
http://infolab.stanford.edu/trio/

System-Level Provenance
• Every application runs on some OS and uses some storage system
• Not disruptive Zero modification to applications.
• An audit mode for systems, explore feasibility and overhead.
27

System-Level Provenance - SPADE
• A consumer of audit APIS of different OS (Windows, Mac OS, Linux, Android)
• Collects process information: name, owner, creation time, command line,
environment vars, files read/written.
• Reporters:
• Default for each OS. E.g. in Linux exec(), fork(), clone(), exit(), open(),
close(), read(), write(), clone(), truncate() is picked up by reporters
• Domain specific reporters. Add onto default.
• Reporting overhead is variable, dependes on nature of application and target OS.
• Compile Apache Web Server on Windows (53%)
• Run BLAST tool on Linux (5%), on MacOS (10%)
• OPM compliant
A. Gehani and D. Tariq. SPADE: Support for Provenance Auditing in Distributed Environments. In Middleware 2012 - Proceedings, pages 101–120, 2012.
28

System-level Provenance - PASS
• Modified Linux Kernel. Relies on the tracking of system calls.
• Focuses on files a detailed change record.
29
• For each file in the filesystem,
• The executable that created it
• Any input files
• “Complete” hardware platform
description
• Command line
• Process environment
• Other data such as random seeds
> sort a > b
Kiran-Kumar Muniswamy-Reddy, David A. Holland, Uri Braun, and Margo Seltzer. Provenance-aware storage systems. In Proceedings of the Annual Conference on
USENIX ’06 Annual Technical Conference, ATEC ’06, pages 4–4.

System-Level Provenance - PASS
• End Use
• Script generation
• Generate a Makefile that reproduces a file
• Detecting system changes
• Compare provenance of two files to detect changes in
environment, libraries, etc.
• Intrusion detection
• Detailed logs of how objects have changed
• Very fine grained provenance!!
• 5000 Objects in response to a lineage inquiry over trace with 50
elements.
30

To this end
• We focused on capturing provenance from a particular computational
system
• System dictates granularity
Data: Tuples, files, variables
Activity: Query operator, sys-call, wf activity
• System dictates transparency
White-box for DB vs Black/Grey Box for WFs and Tools
31

In reality
• We have heterogeneity and legacy:
• We use heterogeneous technologies.
• Provenance may be recorded in mixed granularities.
• Existing audit capabilities and provenance-like information recorded.
• Audited processes can span long time frames.
32

DB
Consider the case
33
Medical Practice
Management
(MPM)
A global
Provenance
Picture
Designing a decision aid to improve secondary prevention for stroke survivors with multimorbidity: A stakeholder engagement study (T Porat, I Marshall, E Sadler, MA Vadillo, V Curcin, C
McKevitt, C Wolfe) - Abstract presentationat Farr 2017 Tuesday 4pm
portal usage info
authentication trace
clinician
patient
• Have there been an update on patient’s EHR in MPM between initial assessment and follow up in SDA
• Do patients use any part of the portal after initial assessment activity in SDA?
• What are all the prior authentication records that lead to MPM activities updating this erroneous EHR.
• What is the average time between initial assessments and follow-ups?
DB
Stroke Decision
Aid (SDA)
Log files
Log files
Authentication
Service
Portal

A Solution Approach – Provenance Templates
• Focus on Provenance rather than its capture from a specific system.
• How can we combine provenance of multiple sources in a controlled manner?
• Focus on incorporating Domain-Specific information
• Heterogeneous systems interoperate via domain ontologies. How can we obtain
domain annotated provenance?
• Two groups actively working on it
34
“Templates as a method for implementing data provenance in
decision support systems.” Vasa Curcin, Elliot Fairweather,
Roxana Danger, Derek Corrigan. Journal of Biomedical
Informatics 65 (2017) 1-21
https://bitbucket.org/kclbig/templates
https://provenance.ecs.soton.ac.uk/prov-template/
In publication.

var:input
var:output
var:assessment
wasGeneratedBy
ex:guideline1
var:clinician
wasAssociatedWith
used
ex:practice1
actedOnBehalfOf
ex:startTime=“vvar:start”
ex:endTime=“vvar:end”
used
Template: A provenance graph with abstract parts
Clinical Decision Support TEMPLATE
35
variable
value variable
document
prefix ex <http://example.org>
prefix var <http://pgt.inf.kcl.ac.uk/var>
prefix vvar <http://pgt.inf.kcl.ac.uk/vvar>
entity(var:inSymbol)
entity(ex:guideline1)
activity(var:assessment)
used(var:assessment,ex:guideline1, -, [ex:startTime=”vvar:start”, ...])
......
......
endDocument

var:input
var:output
var:assessment
wasGeneratedBy
ex:guideline1
var:clinician
wasAssociatedWith
used
ex:practice1
actedOnBehalfOf
type=myOntology#StrokeAssessment
startTime=“vvar:start”
endTime=“vvar:end”
used
• Node semantic types
Domain Specific Annotations
type=myOntology#StrokeRisk
level=“vvar:riskLevel”
36

var:a
var:c
var:b
wasGeneratedBy
var:x
wasAssociatedWith
used
Zone
• A connected sub-graph of a
template graph, which can be
instantiated multiple times
• Series/Parallel type of zones
• Restrictions on Minimum and
Maximum number of
instantiations
var:d
wasDerviedFrom
id=zone1
type=parallel
min=1
max=5
Example TEMPLATE
37
Zone variable
External
variable

entity0
entity1
activity1
wasGeneratedBy
agent1
wasAssociatedWith
used
activity2
wasGeneratedBy wasGeneratedBy
• Parallel Zone instantiated 3 times
used
used
wasAssociatedWith
wasAssociatedWith
entity4
wasDerviedFrom
wasDerviedFrom
wasDerviedFrom
38
entity2
entity3
activity3

entity0
entity1
activity1
wasGeneratedBy
agent1
wasAssociatedWith
used
activity2
used
entity2
wasDerviedFrom
wasGeneratedBy
entity3• Series Zone instantiated 2 times
39

Template Instantiation
Generation
Valid Substitutions
Template Instance
Template
(PROV-
N)
(PROV-
N)
(PROV-
N)
• Single-Step : Template Base+ All Zones
• Incremental Generation
1. Template Base
2. Zone
3. Zone
4. …
40

What constitutes a Valid Substitution?
• For Template Base
• Distinct values among all external variables
• Distinct values among the global node if space for non-graft variables
• Values for all value variables
• For each Zone substitution
• Distinct values for all zone variables among each other and among the global
node id space
• Values for all value variables
41

entity1
entity3
activity1
wasGeneratedBy
agent1
wasAssociatedWith
used
entity4
activity2
wasGeneratedBy
entity5
activity3
wasGeneratedBy
• Parallel Zone incremental instantiation
used
used
wasAssociatedWith
wasAssociatedWith
entity2
wasDerivedFrom
wasDerivedFrom
wasDerivedFrom
42

var:output
var:assessment
wasGeneratedBy
guideline1
var:clinician
wasAssociatedWith
used
practice1
actedOnBehalfOf
• Nodes shared among instantiations
• In real-life, actors or entities may participate in processes multiple times
Template – Graft nodes
GRAFT NODE
43

result1
assessment1
guideline1
doctor1
wasAssociatedWith
used
practice1
actedOnBehalfOf
doctor2
wasGeneratedBy
actedOnBehalfOf
result2
wasGeneratedBy
result3
used
wasGeneratedBy
result4
used
wasGeneratedBy
used
doctor3
actedOnBehalfOf
wasAssociatedWith
wasAssociatedWith
44
wasAssociatedWith
Graft nodes in concrete graph
assessment2
assessment3
assessment4
Recurring node

Architectural Approach
Provenance
Authentication
Clinical Decision Support
Portal
Authentication
Service
Application
Template
Server
TEMPLATES
• Streams of substitutions from disparate heterogeneous systems 45

46
In summary, the benefits of Templates are:

1. Enforce structure over provenance
• Templates can be seen as
Provenance Schemas. Acceptable
patterns of provenance.
• Strict : there is no optionality. If a
variable is included in a template
bindings will be expected for it.
• Loose: Templates may not encode all
possible patterns. (Grafts)
• Prospective Provenance in its true
form!
47

2. Go beyond single-system restriction
• Combine provenance in different granularities- importance of identification.
• User identity
• EHR associated with identity. Track CRUD at EHRlevel.
• Fields of EHR relevant for provenance. Track Updates at selected HER field level
• Build up provenance over time
1. Authentication
2. Activity on portal
3. Activity within application
4. Authentication
5. Activity on portal
6. Activity within application
…..
....
48

3. Support other forms of Provenance
• Provenance can be found in other forms (XLS files, logs)
• We can create substitutions from these other forms
• Hence migrate legacy provenance to a standards compliant (graph-
based) representation
49
Rows  Substitutions
Columns  Variables
MIAME TEMPLATE
Generation
Minimum Information for Biological and Biomedical Investigations

4. Possibility of devising Model-Driven techniques
for introducing provenance support to existing tools
• Identify parts of the system (Actor, Entity, Activity) that are relevant for
provenance tracking
• Data Flow Diagrams to understand the data entities
• Use cases that should be tracked
• Create sample (concrete) provenance
• Refactor templates from concrete provenance
50
• Map templates to service end points
on the template server
• Add service client code to existing
tools
• Map templates to (Java) beans
• Add bean population code to existing
tools

Part-2 Outline
Model
(Part-1)
Capture
(Part-2)
Manage
(Part-2)
W3C PROV
Tools, APIs
Explore
51

Querying Provenance
Most common pattern of access:
1. Locate node(s) of interest:
• based on attributes. e.g. Process nodes of a particular name, Data nodes produced after
some date, nodes with timestamp in a range
• based on involved in relationship(s) (consumption, generation). e.g. Data generated by a
particular buggy process.
2. Traverse lineage, causation (w transitive closure)
locate
traverse
lineage
52

A Less common pattern of access:
Locate sub-graph of interest:
• These are advanced queries that may involve regular path expressions
Querying Provenance
locate
subgraph
locate
53

Data Model for Provenance
• Relational
• Traversal require several self-joins in SQL, can be costly in case of deep lineage
traces.
• Yet still very popular!
• Joint querying of application data and provenance
• Graph
• Graph DB neo4j
• Cypher Query Language (Variable length relations) +Free text search
• Visualization
• Triple stores:
• W3C SPARQL based querying + free text search
• Ontology integration
• Reasoning and Rules support,
54
http://sig.biostr.washington.edu/projects/ontviews/gleen/index.htmlhttps://neo4j.com/product/
http://rdf4j.org/
https://jena.apache.org/

Exploring provenance graphs
• Browsing Layers of provenance
• WF Design - Manual
• (Semi) automated - grouping
Manish Kumar Anand ; Shawn Bowers ; Bertram
Ludäscher. Provenance browser: Displaying and
querying scientific workflow provenance graphs.
Proceedings of ICDE 2010 Conference.
Interactive Visualization of Provenance
Graphs for Reproducible Biomedical
Research. Stefan Lugeret al. 5th Symposium
on Biological Data Visualization.
55
Extract a sub-graph priot to
browsing
Olivier Biton, Sarah Cohen-Boulakia, Susan B. Davidson, and
Carmem S. Hara. Querying and Managing Provenance through
User Views in Scientific Workflows. pages 1072–1081, ICDE 2008.

Exploring provenance graphs
• Aggregating graph elements
Aggregation by Provenance Types: A Technique forSummarising Provenance
Graphs. Luc Moreau. In Proceedings GaM 2015, arXiv:1504.02448
Rinke Hoekstra, Paul Groth. PROV-O-Viz - Understanding the Role of Activities in
Provenance. IPAW 2014: pp 215-220 56

Exploring provenance
• A move towards customized visualizations:
E.g. Provenance of annotation sentence “inactivated by cyanide”. It
originates in TrEMBL, but ends up in Swiss-Prot.
Bell MJ, Collison M, Lord P (2013) Can Inferred Provenance and Its Visualisation Be Used to Detect Erroneous Annotation?
A Case Study Using UniProtKB. PLoS ONE 8(10): e75541. https://doi.org/10.1371/journal.pone.0075541
57

PROV implementations
58The full list is at: https://www.w3.org/2011/prov/wiki/ProvImplementations

Komadu
• Collect and Query API as a web service
• Visualize provenance, export to CSV format viewable in Cytoscape
• Export as PROV-XML
60
https://github.com/Data-to-Insight-Center/komadu/blob/master/docs/KomaduUserGuide.pdf

The Oh-Yeah Button
61
• PROV-AQ Implementation
• Browser add on
http://users.ugent.be/~tdenies/OhYeah/

PROV-Constraints
62
https://github.com/jamescheney/prov-constraints
https://github.com/pgroth/prov-check

PROV Generating Systems -Workflows
63

PROV Generating Systems -Git
64
http://git2prov.org

Sample PROV Datasets
65
PROV BENCH, PROV RECONSTRUCTION
1st ProvBench: Benchmarking Provenance Management Systems. In K. Belhajjame and J. Zhao,
editors, Proceedings of the Joint EDBT/ICDT 2013 Workshops.
Data2Semantics. Provenance reconstruction challenge. http:// data2semantics.github.io/ 2014.
2nd ProvBench: Benchmarking Provenance Management Systems. K. Belhajjame and A.
Chapman. https://sites.google.com/ site/provbench/home/provbench- provenance-week-
2014, 2014.

Follow State of the Art in Provenance
IPAW and TAPP have been held jointly as Provenance Weeks.
IPAW 2018 will be held at King’s College London !!
(Every two years)
(Annual)
66

67
Acknowledgements
• ProvTemp EPSRC Project (EP/N027426/1). Vasa Curcin, PinarAlper,
Elliot Fairweather.
• Innovate UK Project EmProv. Vasa Curcin, Shen Xu.

"Data Provenance: Principles and Why it matters for BioMedical Applications"

Recommended

More Related Content

What's hot (20)

Similar to "Data Provenance: Principles and Why it matters for BioMedical Applications" (20)

Recently uploaded (20)

"Data Provenance: Principles and Why it matters for BioMedical Applications"

Editor's Notes