SlideShare a Scribd company logo
Data cleaning with the Kurator toolkit
Bridging the gap between conventional scripting and
high-performance workflow automation
Timothy McPhillips, David Lowery, James Hanken,
Bertram Ludäscher, James A. Macklin, Paul J. Morris,
Robert A. Morris, Tianhong Song, and John Wieczorek
TDWG 2015 - Biodiversity Informatics Services and Workflows Symposium
Nairobi, Kenya - September 30, 2015
Kurator: workflow automation for
cleaning biodiversity data
Project  aims
§ Facilitate cleaning of  biodiversity  data.
§ Support both  traditional scripting  and  
high-­performance  scientific  workflows.
§ Deliver  much  more  than a  fixed  set  of  
configurable  workflows.
Technical  strategy
§ Wrap  Akka actor toolkit in  a  curation-­oriented  
scientific  workflow language and  runtime.
§ Enable  scientists  who  write  scripts  to  add their  
own  new data  validation  and  cleaning  actors.  
§ Bring  to  scripts  major  advantages  of  workflow  
automation:    prospective and  retrospective  
provenance.
§ Bridge gaps  between  data  validation  services,  
data  cleaning  scripts,  and  pipelined  data  curation  
workflows.
Empower  users  and  developers  of  scripts,  actors,  and  workflows.
What some of us think of when we hear
the term ‘scientific workflows’
Phylogenetics workflow  in  Kepler (2005)
Graphical  interface
§ Canvas  for  assembling  and  
displaying  the  workflow.
§ Library  of  workflow  blocks  
(‘actors’)  that  can  be  
dragged  onto  the  canvas  
and  connected.
§ Arrows  that  represent  
control  dependencies  or  
paths  of  data  flow.
§ A  run  button.
These  features  are  not  
essential  to  managing  
actual  scientific  
workflows.
10 essential functions
of a scientific workflow system
1. Automate  programs  and  services  scientists  already  use.  
2. Schedule invocations  of  programs  and  services  correctly  and  efficiently  
– in  parallel  where  possible.
3. Manage  data  flow  to,  from,  and  between  programs  and  services.
4. Enable scientists  (not  just  developers)  to  author  or  modify workflows  easily.
5. Predict what  a  workflow  will  do  when  executed:  prospective  provenance.
6. Record what  actually  happens  during  workflow  execution.
7. Reveal retrospective provenance – how  workflow  products  were  derived  
from  inputs  via  programs  and  services.
8. Organize intermediate  and  final  data  products  as  desired  by  users.
9. Enable scientists  to  version,  share and  publish their  workflows.  
10. Empower  scientists who  wish  to  automate additional  programs  and  
services  themselves.
These  functions–not  actors—distinguish  scientific  workflow  
automation from  general  scientific  software  development.
Why build yet another system?
Available  systems
§ Kepler  (Ptolemy  II),  Taverna,  VisTrails…
§ Familiar  graphical  programming  
environments.  
Limitations
§ Often  little  support for  organizing
intermediate  and  final  data products  in  
ways  familiar  to  scientists.
§ Professional  software  developers
frequently  are  needed to  develop  new  
components or  workflows.
Huge  gap  between  how  these  systems  
are  used  and  how  scientists  already  
automate  their  analyses—via  scripting.
Part  of  a  Kepler  workflow  for  
inferring  phylogenetic   trees  from  
protein  sequences.
Avoiding actor ‘overuse injuries’
Overuse  the  actor  paradigm…
§ In  many  systems  workflows  can  be  reused  as  
actors,  or  ‘subworkflows’  in  other  workflows.  
§ This  is  a  necessary  abstraction  when  
workflow  systems  are  not  well-­integrated  
with  scripting  languages.
§ Each  actor  at  left  is  a  page  of  Java  code.    But  
this  whole  ‘subworkflow’  could  be  written  in  a  
half  a  page  of  Python!
…or  use  the  right  tools  for  the  right  job!
§ In  Kurator we  want  to  enable  the  actor
abstraction where  it  pays  off  the  most—as  
the  unit  of  parallelism.
§ For  specifying  behavior  inside  of  actors  why  
not  use  easy  to  understand  scripts?
New  actors  and  workflows  must be  
easy  and  fast  to  develop.
Part  of  a  Kepler  workflow  for  
inferring  phylogenetic   trees  from  
protein  sequences.
Data curation workflow using Kepler
FilteredPush explored  using  workflows  for  data  cleaning
§ First  used  COMAD workflow  model  supported  by  Kepler.
§ Enabled  graphical assembly  and  configuration  of  custom workflows
from  library of  actors.
Highlighted  potential  performance  limitations  of  workflow  engines.
FP-­Akka workflows
load  data
check  scientific  name
check  basis  of  record
check  date  collected
check  lat/long
write  out  results
§ FilteredPush next  investigated  use  of    
the  Akka actor  toolkit and  platform.
§ Widely  used in  industry,  well  
supported,  and  rapidly  advancing.
§ Efficient  parallel  execution across  
multiple  processors  and  compute  nodes.
§ Improved  performance and  scalability
compared  to  Kepler.  
Limitations  of  directly using  Akka
§ Writing  new  Akka actors  and  programs  
requires  Java (or  Scala)  experience.
§ Must  address  many  parallel    
programming  challenges from  scratch.  
Advanced  programming  skills  
required  to  write  Akka programs  
(‘workflows’)  that  run  correctly.
Akka partly supports two essential
workflow platform requirements
The  Kurator toolkit  will  satisfy  the  rest  as  needed.  
1. Automate  programs  and  services  scientists  already  use.  
2. Schedule  invocations  of  programs  and  services  correctly  and  
efficiently–in  parallel  where  possible.
3. Manage  data  flow  to,  from,  and  between  programs  and  services.
4. Enable  scientists  (not  just  developers)  to  author  or  modify  workflows  easily.
5. Predict  what  a  workflow  will  do  when  executed:  prospective  provenance.
6. Record  what  actually  happens  during  workflow  execution.
7. Reveal  retrospective  provenance  – how  workflow  products  were  derived  from  
inputs  via  programs  and  services.
8. Organize  intermediate  and  final  data  products  as  desired  by  users.
9. Enable  scientists  to  version,  share  and  publish  their  workflows.  
10. Empower  scientists  who  wish  to  automate  additional  programs  and  services  
themselves.
The Kurator Toolkit
YesWorkflow (YW)
§ Add  YW  annotations  to  any  script  or  program  
that  supports  text  comments.    Highlight  the  
workflow  structure  in  the  script.
§ Visualize or  query  prospective provenance
before running  a  script.
§ Reconstruct,  visualize,  and  query  retrospective
provenance after running  the  script.
§ Integrate provenance gathered  from  file  names  
and  paths,  log  files,  data  file  headers,  run  
metadata,  and  records of  run-­time  events.
Kurator-­Akka
§ Write functions or  classes  in  Python or  Java.    Mark  up  with  YesWorkflow.
§ Declare how  scripts  or  Java  code  can  be  used  as  actors.    Short  YAML blocks.
§ Declare workflows.    List  actors  and  specify  their  connections.    More  YAML.
§ Run workflow.    Use  Akka for  parallelization  transparently—and  correctly.
§ Reconstruct retrospective  provenance of  workflow  products.
Example: data validation and
cleaning using WoRMS web services
1) Write simple  Python functions  (or  a  class)  that  wrap the  web  services  provided  
by  the  World  Register  of  Marine  Species  (WoRMS).
2) Develop a  Python script  that  uses  the  service  wrapper  functions  from  (1)  to  
clean  a  set  of  records provided  in  CSV  format.
3) Mark  up  the  script written  in  (2)  with  YesWorkflow annotations   and  
graphically  display the  script  as  a  workflow.
4) Factor  out  of  (2) a  Python function  that  can  clean a  single  record  using  the  
WoRMS wrapper  functions  in  (1).
5) Write a  block  of  YAML that  declares how  the  record-­cleaning  function  in  (4)  can  
be  used  as  an  actor in  a  Kurator-­Akka workflow.
6) Declare using  YAML a  workflow that  uses  the  actor  declared  in  (5)  along  with  
actors  for  reading  and  writing  CSV  files.
7) Run the  workflow (6)  on  a  sample  data  set  with  the  CSV  reader,  CSV  writer  
and  WoRMS validation  actors  all  running  in  parallel.
README  at  https://github.com/kurator-­org/kurator-­validation  shows  how  to:
Illustrates  how  Kurator aims  to  facilitate  composition  of  actors  and  
high-­performance  workflows  from  simple  functions  and  scripts.
class   WoRMSService(object):  
"""
Class  for  accessing   the  WoRMS taxonomic   name  database   via  the  AphiaNameService.
The  Aphia names   services   are  described   at  http://marinespecies.org/aphia.php?p=soap.  
"""
WORMS_APHIA_NAME_SERVICE_URL   =  'http://marinespecies.org/aphia.php?p=soap&wsdl=1’
def __init__(self,   marine_only=False):
"""  Initialize   a  SOAP  client   using  the  WSDL  for  the  WoRMS Aphia names  service"""                 
self._client =  Client(self.WORMS_APHIA_NAME_SERVICE_URL)
self._marine_only =  marine_only
def aphia_record_by_exact_taxon_name(self,   name):
"""
Perform   an  exact   match  on  the  input   name  against   the  taxon   names  in  WoRMS.
This   function   first   invokes   an  Aphia names   service   to  lookup   the  Aphia ID  for
the  taxon   name.    If  exactly   one  match   is  returned,   this  function   retrieves   the
Aphia record   for  that  ID  and  returns   it.  
"""                
aphia_id =  self._client.service.getAphiaID(name,   self._marine_only);
if  aphia_id is  None  or  aphia_id ==  -­‐999:                   #  -­‐999  indicates   multiple   matches
return   None
else:
return   self._client.service.getAphiaRecordByID(aphia_id)
def aphia_record_by_fuzzy_taxon_name(self,   name):
:
:
Python class wrapping WoRMS services
WoRMS services
#  @BEGIN   find_matching_worms_record
#  @IN  original_scientific_name
#  @OUT  matching_worms_record
#  @OUT  worms_lsid
worms_match_result =  None
worms_lsid =  None
#  first  try  exact  match   of  the  scientific   name  against   WoRMS
matching_worms_record =  worms.aphia_record_by_exact_taxon_name(original_scientific_name)
if  matching_worms_record is  not  None:
worms_match_result =  'exact'
#  otherwise   try  a  fuzzy   match
else:
matching_worms_record =  worms.aphia_record_by_fuzzy_taxon_name(original_scientific_name)
if  matching_worms_record is  not  None:
worms_match_result =  'fuzzy’
#  if  either   match  succeeds   extract   the  LSID   for  the  taxon
if  matching_worms_record is  not  None:
worms_lsid =  matching_worms_record['lsid']
#  @END  find_matching_worms_record
Finding matching WoRMS records
in the data cleaning script
YesWorkflow annotation  
marking  start of  workflow  step.
End of  workflow  step.
Variables  serving  as  inputs and  
outputs to  workflow  step.
Calls  to  WoRMS service  
wrapper  functions.
YW rendering of script
Workflow  steps each
delimited  by  @BEGIN  
and  @END  annotations  
in  script.
Data flowing  into  and  out  of  
find_matching_worms_record
workflow  step.
YesWorkflow infers  connections  
between  workflow  steps  and  what  
data  flows  through  them  by  matching  
@IN  and @OUT  annotations.
#  @BEGIN   compose_cleaned_record
#  @IN  original_record
#  @IN  worms_lsid
#  @IN  updated_scientific_name
#  @IN  original_scientific_name
#  @IN  updated_authorship
#  @IN  original_authorship
#  @OUT  cleaned_record
cleaned_record =  original_record
cleaned_record['LSID']   =  worms_lsid
cleaned_record['WoRMsMatchResult']   =  worms_match_result
if  updated_scientific_name is  not  None:  
cleaned_record['scientificName']   =  updated_scientific_name
cleaned_record['originalScientificName']   =  original_scientific_name
if  updated_authorship is  not  None:
cleaned_record['scientificNameAuthorship']   =  updated_authorship
cleaned_record['originalAuthor']   =  original_authorship
#  @END  compose_cleaned_record
The compose_cleaned_record step in
the data cleaning script
A function for cleaning one record
def curate_taxon_name_and_author(self,   input_record):
#  look  up    record   for  input   taxon  name  in  WoRMS taxonomic   database
is_exact_match,   aphia_record =  (
self._worms.aphia_record_by_taxon_name(input_record['TaxonName']))
if  aphia_record is  not  None:
#  save  taxon   name  and  author   values   from  input  record   in  new  fields
input_record['OriginalName']   =  input_record['TaxonName']
input_record['OriginalAuthor']   =  input_record['Author']
#  replace   taxon  name  and  author   fields   in  input   record   with  values   in  aphia record
input_record['TaxonName']   =  aphia_record['scientificname']
input_record['Author']   =  aphia_record['authority']
#  add  new  fields
input_record['WoRMsExactMatch']   =  is_exact_match
input_record['lsid']   =  aphia_record['lsid’]
else:
input_record['OriginalName']   =  None
input_record['OriginalAuthor']   =  None
input_record['WoRMsExactMatch']   =  None
input_record['lsid']   =  None
return   input_record
Factoring  out  the  core  functionality  
of  a  script  into  a  reusable  function  
is  a  natural  step  in  script  evolution.
Declaring the function as an actor
-­‐ id: WoRMSNameCurator
type: PythonClassActor
properties:
pythonClass:   org.kurator.validation.actors.WoRMSCurator.WoRMSCurator
onData: curate_taxon_name_and_author
Actor  type  identifier
referenced  when  composing  a  
workflow  that  uses  the  actor.
Python  class declaring  
the  function  as  a  
method  (optional).
Name  of  Python  function called  
for  each  data  item  received  by  
the  actor  at  run  time.  
Besides  this  block  of  YAML,  no  additional  code  needs  
to  be  written  to  convert  the  function  into  an  actor.  
Simple workflow using the new actor
components:
-­‐ id: ReadInput
type: CsvFileReader
-­‐ id: CurateRecords
type: WoRMSNameCurator
properties:
listensTo:
-­‐ !ref  ReadInput
-­‐ id: WriteOutput
type: CsvFileWriter
properties:
listensTo:
-­‐ !ref  CurateRecords
-­‐ id: WoRMSNameValidationWorkflow
type: Workflow
properties:
actors:
-­‐ !ref  ReadInput
-­‐ !ref  CurateRecords
-­‐ !ref  WriteOutput
Actor  that  reads input  from  a  CSV
file.    Emits  records  one  at  a  time.
The  WoRMSNameCurator actor  
declared  on  the  previous  slide.    
Processes  each  received  record  in  turn.
Actor  that  writes received  records  
to  an  output  CSV file  one  by  one.
The  listensTo property  is  used  to  
declare  how  actors  are  connected.
Declaration  of  workflow  as  a  
composition  of  the  three  actors.
Actors  run  concurrently,  each  working  on  
different  records  at  the  same  time,  when  the  
workflow  is  executed  by  Kurator-­Akka.  
CsvFileReader
WoRMSNameCurator
CsvFileWriter
Correspondence of
steps in script to
actors in workflow
Running the workflow
$  ka -­‐f  WoRMS_name_validation.yaml <  WoRMS_name_validation_input.csv
ID,TaxonName,Author,OriginalName,OriginalAuthor,WoRMsExactMatch,lsid
37929,Architectonica   reevei,"(Hanley,   1862)",Architectonica
reevi,,false,urn:lsid:marinespecies.org:taxname:588206
37932,Rapana   rapiformis,"(Born,   1778)",Rapana rapiformis,"(Von   Born,  
1778)",true,urn:lsid:marinespecies.org:taxname:140415
180593,Buccinum   donomani,"(Linnaeus,   1758)",,,,
179963,Codakia   paytenorum,"(Iredale,   1937)",Codakia paytenorum,"Iredale,  
1937",true,urn:lsid:marinespecies.org:taxname:215841
0,Rissoa   venusta,"Garrett,   1873",Rissoa  
venusta,,true,urn:lsid:marinespecies.org:taxname:607233
62156,Rissoa   venusta,"Garrett,   1873",Rissoa       
venusta,Phil.,true,urn:lsid:marinespecies.org:taxname:607233
$
Workflow  can  be  run  
at  the  command  line.
Actors  can  be  created  from  simple  scripts,  and  workflows  can  be  
run  like  scripts.    Simple  YAML  files  wire  everything  together.
Actors  can  read  and  write  standard  
input  and  output  like  any  script.
The road ahead
The  immediate  future
§ YesWorkflow support  for  graphically  rendering Kurator-­Akka workflows.
§ Combining  prospective  and  retrospective  provenance  from  workflow
declarations  and YesWorkflow-­annotated  scripts used  as  actors.
§ Enhancements to  YAML  workflow  declarations  enabling  Akka support  for  
creating  and  managing  multiple  instances of  each  actor for  higher  
throughput.
§ Exploration  of  advanced  Akka features  in  FP-­Akka framework,  followed  
by  generalization    of  these  features  in  Kurator-­Akka so  that  all  users  can  
benefit.
Future  possibilities
§ Support  for  additional actor  scripting  languages (e.g.,  R).
§ Actor  function  wrappers  for  other  workflow and  dataflow  systems.
§ Graphical  user  interface for  composing  and  running  workflows.  
§ Your  suggestions?
Ad

More Related Content

What's hot (20)

Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
Snehal Nagmote
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Spark Summit
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Databricks
 
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native KubernetesSimplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Databricks
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
Databricks
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine Learning
Paco Nathan
 
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuApache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Slim Baltagi
 
ROCm and Distributed Deep Learning on Spark and TensorFlow
ROCm and Distributed Deep Learning on Spark and TensorFlowROCm and Distributed Deep Learning on Spark and TensorFlow
ROCm and Distributed Deep Learning on Spark and TensorFlow
Databricks
 
Improvements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba SearchImprovements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba Search
DataWorks Summit/Hadoop Summit
 
Build a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationBuild a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimization
Craig Chao
 
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial IntroOGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
marpierc
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...
Databricks
 
Dataservices - Processing Big Data The Microservice Way
Dataservices - Processing Big Data The Microservice WayDataservices - Processing Big Data The Microservice Way
Dataservices - Processing Big Data The Microservice Way
Josef Adersberger
 
Apache Eagle: Architecture Evolvement and New Features
Apache Eagle: Architecture Evolvement and New FeaturesApache Eagle: Architecture Evolvement and New Features
Apache Eagle: Architecture Evolvement and New Features
Hao Chen
 
OGCE Project Overview
OGCE Project OverviewOGCE Project Overview
OGCE Project Overview
marpierc
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Joe Stein
 
Java SE 8 & EE 7 Launch
Java SE 8 & EE 7 LaunchJava SE 8 & EE 7 Launch
Java SE 8 & EE 7 Launch
Digicomp Academy AG
 
Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]
Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]
Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]
Accumulo Summit
 
Accumulo Summit 2015: Ambari and Accumulo: HDP 2.3 Upcoming Features [Sponsored]
Accumulo Summit 2015: Ambari and Accumulo: HDP 2.3 Upcoming Features [Sponsored]Accumulo Summit 2015: Ambari and Accumulo: HDP 2.3 Upcoming Features [Sponsored]
Accumulo Summit 2015: Ambari and Accumulo: HDP 2.3 Upcoming Features [Sponsored]
Accumulo Summit
 
Hadoop and Big Data Overview
Hadoop and Big Data OverviewHadoop and Big Data Overview
Hadoop and Big Data Overview
Prabhu Thukkaram
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Spark Summit
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Databricks
 
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native KubernetesSimplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Databricks
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
Databricks
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine Learning
Paco Nathan
 
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuApache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Slim Baltagi
 
ROCm and Distributed Deep Learning on Spark and TensorFlow
ROCm and Distributed Deep Learning on Spark and TensorFlowROCm and Distributed Deep Learning on Spark and TensorFlow
ROCm and Distributed Deep Learning on Spark and TensorFlow
Databricks
 
Improvements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba SearchImprovements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba Search
DataWorks Summit/Hadoop Summit
 
Build a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationBuild a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimization
Craig Chao
 
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial IntroOGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
marpierc
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...
Databricks
 
Dataservices - Processing Big Data The Microservice Way
Dataservices - Processing Big Data The Microservice WayDataservices - Processing Big Data The Microservice Way
Dataservices - Processing Big Data The Microservice Way
Josef Adersberger
 
Apache Eagle: Architecture Evolvement and New Features
Apache Eagle: Architecture Evolvement and New FeaturesApache Eagle: Architecture Evolvement and New Features
Apache Eagle: Architecture Evolvement and New Features
Hao Chen
 
OGCE Project Overview
OGCE Project OverviewOGCE Project Overview
OGCE Project Overview
marpierc
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Joe Stein
 
Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]
Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]
Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]
Accumulo Summit
 
Accumulo Summit 2015: Ambari and Accumulo: HDP 2.3 Upcoming Features [Sponsored]
Accumulo Summit 2015: Ambari and Accumulo: HDP 2.3 Upcoming Features [Sponsored]Accumulo Summit 2015: Ambari and Accumulo: HDP 2.3 Upcoming Features [Sponsored]
Accumulo Summit 2015: Ambari and Accumulo: HDP 2.3 Upcoming Features [Sponsored]
Accumulo Summit
 
Hadoop and Big Data Overview
Hadoop and Big Data OverviewHadoop and Big Data Overview
Hadoop and Big Data Overview
Prabhu Thukkaram
 

Similar to Data cleaning with the Kurator toolkit: Bridging the gap between conventional scripting and high-performance workflow automation (20)

XSEDE14 SciGaP-Apache Airavata Tutorial
XSEDE14 SciGaP-Apache Airavata TutorialXSEDE14 SciGaP-Apache Airavata Tutorial
XSEDE14 SciGaP-Apache Airavata Tutorial
marpierc
 
Advances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsAdvances in Scientific Workflow Environments
Advances in Scientific Workflow Environments
Carole Goble
 
Modern application development with oracle cloud sangam17
Modern application development with oracle cloud sangam17Modern application development with oracle cloud sangam17
Modern application development with oracle cloud sangam17
Vinay Kumar
 
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
DataStax Academy
 
"Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications""Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications"
Pinar Alper
 
Streaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogStreaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit Log
Joe Stein
 
LarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - IntroductionLarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - Introduction
LarKC
 
resumePdf
resumePdfresumePdf
resumePdf
Amit Kumar
 
The Taverna Software Suite
The Taverna Software SuiteThe Taverna Software Suite
The Taverna Software Suite
myGrid team
 
A DevOps guide to Kubernetes
A DevOps guide to KubernetesA DevOps guide to Kubernetes
A DevOps guide to Kubernetes
Paul Czarkowski
 
Distributed tracing in OpenStack
Distributed tracing in OpenStackDistributed tracing in OpenStack
Distributed tracing in OpenStack
Ilya Shakhat
 
Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Asho...
Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Asho...Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Asho...
Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Asho...
confluent
 
NextGenML
NextGenML NextGenML
NextGenML
Moldovan Radu Adrian
 
Web Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectWeb Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC Project
Saltlux Inc.
 
TDC Connections 2023 - A High-Speed Data Ingestion Service in Java Using MQTT...
TDC Connections 2023 - A High-Speed Data Ingestion Service in Java Using MQTT...TDC Connections 2023 - A High-Speed Data Ingestion Service in Java Using MQTT...
TDC Connections 2023 - A High-Speed Data Ingestion Service in Java Using MQTT...
Juarez Junior
 
OpenStack Identity - Keystone (liberty) by Lorenzo Carnevale and Silvio Tavilla
OpenStack Identity - Keystone (liberty) by Lorenzo Carnevale and Silvio TavillaOpenStack Identity - Keystone (liberty) by Lorenzo Carnevale and Silvio Tavilla
OpenStack Identity - Keystone (liberty) by Lorenzo Carnevale and Silvio Tavilla
Lorenzo Carnevale
 
Innovate2014 Better Integrations Through Open Interfaces
Innovate2014 Better Integrations Through Open InterfacesInnovate2014 Better Integrations Through Open Interfaces
Innovate2014 Better Integrations Through Open Interfaces
Steve Speicher
 
Bluemix paas 기반 saas 개발 사례
Bluemix paas 기반 saas 개발 사례Bluemix paas 기반 saas 개발 사례
Bluemix paas 기반 saas 개발 사례
uEngine Solutions
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
Kaxil Naik
 
Distributed & Highly Available server applications in Java and Scala
Distributed & Highly Available server applications in Java and ScalaDistributed & Highly Available server applications in Java and Scala
Distributed & Highly Available server applications in Java and Scala
Max Alexejev
 
XSEDE14 SciGaP-Apache Airavata Tutorial
XSEDE14 SciGaP-Apache Airavata TutorialXSEDE14 SciGaP-Apache Airavata Tutorial
XSEDE14 SciGaP-Apache Airavata Tutorial
marpierc
 
Advances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsAdvances in Scientific Workflow Environments
Advances in Scientific Workflow Environments
Carole Goble
 
Modern application development with oracle cloud sangam17
Modern application development with oracle cloud sangam17Modern application development with oracle cloud sangam17
Modern application development with oracle cloud sangam17
Vinay Kumar
 
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
DataStax Academy
 
"Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications""Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications"
Pinar Alper
 
Streaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogStreaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit Log
Joe Stein
 
LarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - IntroductionLarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - Introduction
LarKC
 
The Taverna Software Suite
The Taverna Software SuiteThe Taverna Software Suite
The Taverna Software Suite
myGrid team
 
A DevOps guide to Kubernetes
A DevOps guide to KubernetesA DevOps guide to Kubernetes
A DevOps guide to Kubernetes
Paul Czarkowski
 
Distributed tracing in OpenStack
Distributed tracing in OpenStackDistributed tracing in OpenStack
Distributed tracing in OpenStack
Ilya Shakhat
 
Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Asho...
Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Asho...Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Asho...
Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Asho...
confluent
 
Web Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectWeb Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC Project
Saltlux Inc.
 
TDC Connections 2023 - A High-Speed Data Ingestion Service in Java Using MQTT...
TDC Connections 2023 - A High-Speed Data Ingestion Service in Java Using MQTT...TDC Connections 2023 - A High-Speed Data Ingestion Service in Java Using MQTT...
TDC Connections 2023 - A High-Speed Data Ingestion Service in Java Using MQTT...
Juarez Junior
 
OpenStack Identity - Keystone (liberty) by Lorenzo Carnevale and Silvio Tavilla
OpenStack Identity - Keystone (liberty) by Lorenzo Carnevale and Silvio TavillaOpenStack Identity - Keystone (liberty) by Lorenzo Carnevale and Silvio Tavilla
OpenStack Identity - Keystone (liberty) by Lorenzo Carnevale and Silvio Tavilla
Lorenzo Carnevale
 
Innovate2014 Better Integrations Through Open Interfaces
Innovate2014 Better Integrations Through Open InterfacesInnovate2014 Better Integrations Through Open Interfaces
Innovate2014 Better Integrations Through Open Interfaces
Steve Speicher
 
Bluemix paas 기반 saas 개발 사례
Bluemix paas 기반 saas 개발 사례Bluemix paas 기반 saas 개발 사례
Bluemix paas 기반 saas 개발 사례
uEngine Solutions
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
Kaxil Naik
 
Distributed & Highly Available server applications in Java and Scala
Distributed & Highly Available server applications in Java and ScalaDistributed & Highly Available server applications in Java and Scala
Distributed & Highly Available server applications in Java and Scala
Max Alexejev
 
Ad

Recently uploaded (20)

Microbial Genetics for Advanced Genetics
Microbial Genetics for Advanced GeneticsMicrobial Genetics for Advanced Genetics
Microbial Genetics for Advanced Genetics
Enoch Caryl Taclan
 
when is CT scan need in breast cancer patient.pptx
when is CT scan need in breast cancer patient.pptxwhen is CT scan need in breast cancer patient.pptx
when is CT scan need in breast cancer patient.pptx
Rukhnuddin Al-daudar
 
RAPID DIAGNOSTIC TEST (RDT) overviewppt.pptx
RAPID DIAGNOSTIC TEST (RDT)  overviewppt.pptxRAPID DIAGNOSTIC TEST (RDT)  overviewppt.pptx
RAPID DIAGNOSTIC TEST (RDT) overviewppt.pptx
nietakam
 
Multydisciplinary Nature of Environmental Studies
Multydisciplinary Nature of Environmental StudiesMultydisciplinary Nature of Environmental Studies
Multydisciplinary Nature of Environmental Studies
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
whole ANATOMY OF EYE with eye ball .pptx
whole ANATOMY OF EYE with eye ball .pptxwhole ANATOMY OF EYE with eye ball .pptx
whole ANATOMY OF EYE with eye ball .pptx
simranjangra13
 
Quiz 3 Basic Nutrition 1ST Yearcmcmc.pptx
Quiz 3 Basic Nutrition 1ST Yearcmcmc.pptxQuiz 3 Basic Nutrition 1ST Yearcmcmc.pptx
Quiz 3 Basic Nutrition 1ST Yearcmcmc.pptx
NutriGen
 
Polymerase Chain Reaction (PCR).Poer Pint
Polymerase Chain Reaction (PCR).Poer PintPolymerase Chain Reaction (PCR).Poer Pint
Polymerase Chain Reaction (PCR).Poer Pint
Dr Showkat Ahmad Wani
 
DIGESTIVE SYSTEM (Overview of Metabolism)
DIGESTIVE SYSTEM (Overview of Metabolism)DIGESTIVE SYSTEM (Overview of Metabolism)
DIGESTIVE SYSTEM (Overview of Metabolism)
Elvis K. Goodridge
 
Influenza-Understanding-the-Deadly-Virus.pptx
Influenza-Understanding-the-Deadly-Virus.pptxInfluenza-Understanding-the-Deadly-Virus.pptx
Influenza-Understanding-the-Deadly-Virus.pptx
diyapadhiyar
 
Parallel resonance circuits of science.pdf
Parallel resonance circuits of science.pdfParallel resonance circuits of science.pdf
Parallel resonance circuits of science.pdf
rk5867336912
 
06-Molecular basis of transformation.pptx
06-Molecular basis of transformation.pptx06-Molecular basis of transformation.pptx
06-Molecular basis of transformation.pptx
LanaQadumii
 
4. Chapter 4 - FINAL Promoting Inclusive Culture (2).pdf
4. Chapter 4 - FINAL Promoting Inclusive Culture (2).pdf4. Chapter 4 - FINAL Promoting Inclusive Culture (2).pdf
4. Chapter 4 - FINAL Promoting Inclusive Culture (2).pdf
abayamargaug
 
Application of Microbiology- Industrial, agricultural, medical
Application of Microbiology- Industrial, agricultural, medicalApplication of Microbiology- Industrial, agricultural, medical
Application of Microbiology- Industrial, agricultural, medical
Anoja Kurian
 
Gel Electrophorosis, A Practical Lecture.pptx
Gel Electrophorosis, A Practical Lecture.pptxGel Electrophorosis, A Practical Lecture.pptx
Gel Electrophorosis, A Practical Lecture.pptx
Dr Showkat Ahmad Wani
 
Chapter 4_Part 2_Infection and Immunity.ppt
Chapter 4_Part 2_Infection and Immunity.pptChapter 4_Part 2_Infection and Immunity.ppt
Chapter 4_Part 2_Infection and Immunity.ppt
JessaBalanggoyPagula
 
Causes of mortalities of eggs and spawn and remedies.pptx
Causes of mortalities of eggs and spawn and remedies.pptxCauses of mortalities of eggs and spawn and remedies.pptx
Causes of mortalities of eggs and spawn and remedies.pptx
anshumanmohanty9090
 
STR Analysis and DNA Typing in Forensic Science: Techniques, Steps & Applicat...
STR Analysis and DNA Typing in Forensic Science: Techniques, Steps & Applicat...STR Analysis and DNA Typing in Forensic Science: Techniques, Steps & Applicat...
STR Analysis and DNA Typing in Forensic Science: Techniques, Steps & Applicat...
home
 
DNA Profiling and STR Typing in Forensics: From Molecular Techniques to Real-...
DNA Profiling and STR Typing in Forensics: From Molecular Techniques to Real-...DNA Profiling and STR Typing in Forensics: From Molecular Techniques to Real-...
DNA Profiling and STR Typing in Forensics: From Molecular Techniques to Real-...
home
 
Effect of nutrition in Entomophagous Insectson
Effect of nutrition in Entomophagous InsectsonEffect of nutrition in Entomophagous Insectson
Effect of nutrition in Entomophagous Insectson
JabaskumarKshetri
 
Nutritional Diseases in poultry.........
Nutritional Diseases in poultry.........Nutritional Diseases in poultry.........
Nutritional Diseases in poultry.........
Bangladesh Agricultural University,Mymemsingh
 
Microbial Genetics for Advanced Genetics
Microbial Genetics for Advanced GeneticsMicrobial Genetics for Advanced Genetics
Microbial Genetics for Advanced Genetics
Enoch Caryl Taclan
 
when is CT scan need in breast cancer patient.pptx
when is CT scan need in breast cancer patient.pptxwhen is CT scan need in breast cancer patient.pptx
when is CT scan need in breast cancer patient.pptx
Rukhnuddin Al-daudar
 
RAPID DIAGNOSTIC TEST (RDT) overviewppt.pptx
RAPID DIAGNOSTIC TEST (RDT)  overviewppt.pptxRAPID DIAGNOSTIC TEST (RDT)  overviewppt.pptx
RAPID DIAGNOSTIC TEST (RDT) overviewppt.pptx
nietakam
 
whole ANATOMY OF EYE with eye ball .pptx
whole ANATOMY OF EYE with eye ball .pptxwhole ANATOMY OF EYE with eye ball .pptx
whole ANATOMY OF EYE with eye ball .pptx
simranjangra13
 
Quiz 3 Basic Nutrition 1ST Yearcmcmc.pptx
Quiz 3 Basic Nutrition 1ST Yearcmcmc.pptxQuiz 3 Basic Nutrition 1ST Yearcmcmc.pptx
Quiz 3 Basic Nutrition 1ST Yearcmcmc.pptx
NutriGen
 
Polymerase Chain Reaction (PCR).Poer Pint
Polymerase Chain Reaction (PCR).Poer PintPolymerase Chain Reaction (PCR).Poer Pint
Polymerase Chain Reaction (PCR).Poer Pint
Dr Showkat Ahmad Wani
 
DIGESTIVE SYSTEM (Overview of Metabolism)
DIGESTIVE SYSTEM (Overview of Metabolism)DIGESTIVE SYSTEM (Overview of Metabolism)
DIGESTIVE SYSTEM (Overview of Metabolism)
Elvis K. Goodridge
 
Influenza-Understanding-the-Deadly-Virus.pptx
Influenza-Understanding-the-Deadly-Virus.pptxInfluenza-Understanding-the-Deadly-Virus.pptx
Influenza-Understanding-the-Deadly-Virus.pptx
diyapadhiyar
 
Parallel resonance circuits of science.pdf
Parallel resonance circuits of science.pdfParallel resonance circuits of science.pdf
Parallel resonance circuits of science.pdf
rk5867336912
 
06-Molecular basis of transformation.pptx
06-Molecular basis of transformation.pptx06-Molecular basis of transformation.pptx
06-Molecular basis of transformation.pptx
LanaQadumii
 
4. Chapter 4 - FINAL Promoting Inclusive Culture (2).pdf
4. Chapter 4 - FINAL Promoting Inclusive Culture (2).pdf4. Chapter 4 - FINAL Promoting Inclusive Culture (2).pdf
4. Chapter 4 - FINAL Promoting Inclusive Culture (2).pdf
abayamargaug
 
Application of Microbiology- Industrial, agricultural, medical
Application of Microbiology- Industrial, agricultural, medicalApplication of Microbiology- Industrial, agricultural, medical
Application of Microbiology- Industrial, agricultural, medical
Anoja Kurian
 
Gel Electrophorosis, A Practical Lecture.pptx
Gel Electrophorosis, A Practical Lecture.pptxGel Electrophorosis, A Practical Lecture.pptx
Gel Electrophorosis, A Practical Lecture.pptx
Dr Showkat Ahmad Wani
 
Chapter 4_Part 2_Infection and Immunity.ppt
Chapter 4_Part 2_Infection and Immunity.pptChapter 4_Part 2_Infection and Immunity.ppt
Chapter 4_Part 2_Infection and Immunity.ppt
JessaBalanggoyPagula
 
Causes of mortalities of eggs and spawn and remedies.pptx
Causes of mortalities of eggs and spawn and remedies.pptxCauses of mortalities of eggs and spawn and remedies.pptx
Causes of mortalities of eggs and spawn and remedies.pptx
anshumanmohanty9090
 
STR Analysis and DNA Typing in Forensic Science: Techniques, Steps & Applicat...
STR Analysis and DNA Typing in Forensic Science: Techniques, Steps & Applicat...STR Analysis and DNA Typing in Forensic Science: Techniques, Steps & Applicat...
STR Analysis and DNA Typing in Forensic Science: Techniques, Steps & Applicat...
home
 
DNA Profiling and STR Typing in Forensics: From Molecular Techniques to Real-...
DNA Profiling and STR Typing in Forensics: From Molecular Techniques to Real-...DNA Profiling and STR Typing in Forensics: From Molecular Techniques to Real-...
DNA Profiling and STR Typing in Forensics: From Molecular Techniques to Real-...
home
 
Effect of nutrition in Entomophagous Insectson
Effect of nutrition in Entomophagous InsectsonEffect of nutrition in Entomophagous Insectson
Effect of nutrition in Entomophagous Insectson
JabaskumarKshetri
 
Ad

Data cleaning with the Kurator toolkit: Bridging the gap between conventional scripting and high-performance workflow automation

  • 1. Data cleaning with the Kurator toolkit Bridging the gap between conventional scripting and high-performance workflow automation Timothy McPhillips, David Lowery, James Hanken, Bertram Ludäscher, James A. Macklin, Paul J. Morris, Robert A. Morris, Tianhong Song, and John Wieczorek TDWG 2015 - Biodiversity Informatics Services and Workflows Symposium Nairobi, Kenya - September 30, 2015
  • 2. Kurator: workflow automation for cleaning biodiversity data Project  aims § Facilitate cleaning of  biodiversity  data. § Support both  traditional scripting  and   high-­performance  scientific  workflows. § Deliver  much  more  than a  fixed  set  of   configurable  workflows. Technical  strategy § Wrap  Akka actor toolkit in  a  curation-­oriented   scientific  workflow language and  runtime. § Enable  scientists  who  write  scripts  to  add their   own  new data  validation  and  cleaning  actors.   § Bring  to  scripts  major  advantages  of  workflow   automation:    prospective and  retrospective   provenance. § Bridge gaps  between  data  validation  services,   data  cleaning  scripts,  and  pipelined  data  curation   workflows. Empower  users  and  developers  of  scripts,  actors,  and  workflows.
  • 3. What some of us think of when we hear the term ‘scientific workflows’ Phylogenetics workflow  in  Kepler (2005) Graphical  interface § Canvas  for  assembling  and   displaying  the  workflow. § Library  of  workflow  blocks   (‘actors’)  that  can  be   dragged  onto  the  canvas   and  connected. § Arrows  that  represent   control  dependencies  or   paths  of  data  flow. § A  run  button. These  features  are  not   essential  to  managing   actual  scientific   workflows.
  • 4. 10 essential functions of a scientific workflow system 1. Automate  programs  and  services  scientists  already  use.   2. Schedule invocations  of  programs  and  services  correctly  and  efficiently   – in  parallel  where  possible. 3. Manage  data  flow  to,  from,  and  between  programs  and  services. 4. Enable scientists  (not  just  developers)  to  author  or  modify workflows  easily. 5. Predict what  a  workflow  will  do  when  executed:  prospective  provenance. 6. Record what  actually  happens  during  workflow  execution. 7. Reveal retrospective provenance – how  workflow  products  were  derived   from  inputs  via  programs  and  services. 8. Organize intermediate  and  final  data  products  as  desired  by  users. 9. Enable scientists  to  version,  share and  publish their  workflows.   10. Empower  scientists who  wish  to  automate additional  programs  and   services  themselves. These  functions–not  actors—distinguish  scientific  workflow   automation from  general  scientific  software  development.
  • 5. Why build yet another system? Available  systems § Kepler  (Ptolemy  II),  Taverna,  VisTrails… § Familiar  graphical  programming   environments.   Limitations § Often  little  support for  organizing intermediate  and  final  data products  in   ways  familiar  to  scientists. § Professional  software  developers frequently  are  needed to  develop  new   components or  workflows. Huge  gap  between  how  these  systems   are  used  and  how  scientists  already   automate  their  analyses—via  scripting. Part  of  a  Kepler  workflow  for   inferring  phylogenetic   trees  from   protein  sequences.
  • 6. Avoiding actor ‘overuse injuries’ Overuse  the  actor  paradigm… § In  many  systems  workflows  can  be  reused  as   actors,  or  ‘subworkflows’  in  other  workflows.   § This  is  a  necessary  abstraction  when   workflow  systems  are  not  well-­integrated   with  scripting  languages. § Each  actor  at  left  is  a  page  of  Java  code.    But   this  whole  ‘subworkflow’  could  be  written  in  a   half  a  page  of  Python! …or  use  the  right  tools  for  the  right  job! § In  Kurator we  want  to  enable  the  actor abstraction where  it  pays  off  the  most—as   the  unit  of  parallelism. § For  specifying  behavior  inside  of  actors  why   not  use  easy  to  understand  scripts? New  actors  and  workflows  must be   easy  and  fast  to  develop. Part  of  a  Kepler  workflow  for   inferring  phylogenetic   trees  from   protein  sequences.
  • 7. Data curation workflow using Kepler FilteredPush explored  using  workflows  for  data  cleaning § First  used  COMAD workflow  model  supported  by  Kepler. § Enabled  graphical assembly  and  configuration  of  custom workflows from  library of  actors. Highlighted  potential  performance  limitations  of  workflow  engines.
  • 8. FP-­Akka workflows load  data check  scientific  name check  basis  of  record check  date  collected check  lat/long write  out  results § FilteredPush next  investigated  use  of     the  Akka actor  toolkit and  platform. § Widely  used in  industry,  well   supported,  and  rapidly  advancing. § Efficient  parallel  execution across   multiple  processors  and  compute  nodes. § Improved  performance and  scalability compared  to  Kepler.   Limitations  of  directly using  Akka § Writing  new  Akka actors  and  programs   requires  Java (or  Scala)  experience. § Must  address  many  parallel     programming  challenges from  scratch.   Advanced  programming  skills   required  to  write  Akka programs   (‘workflows’)  that  run  correctly.
  • 9. Akka partly supports two essential workflow platform requirements The  Kurator toolkit  will  satisfy  the  rest  as  needed.   1. Automate  programs  and  services  scientists  already  use.   2. Schedule  invocations  of  programs  and  services  correctly  and   efficiently–in  parallel  where  possible. 3. Manage  data  flow  to,  from,  and  between  programs  and  services. 4. Enable  scientists  (not  just  developers)  to  author  or  modify  workflows  easily. 5. Predict  what  a  workflow  will  do  when  executed:  prospective  provenance. 6. Record  what  actually  happens  during  workflow  execution. 7. Reveal  retrospective  provenance  – how  workflow  products  were  derived  from   inputs  via  programs  and  services. 8. Organize  intermediate  and  final  data  products  as  desired  by  users. 9. Enable  scientists  to  version,  share  and  publish  their  workflows.   10. Empower  scientists  who  wish  to  automate  additional  programs  and  services   themselves.
  • 10. The Kurator Toolkit YesWorkflow (YW) § Add  YW  annotations  to  any  script  or  program   that  supports  text  comments.    Highlight  the   workflow  structure  in  the  script. § Visualize or  query  prospective provenance before running  a  script. § Reconstruct,  visualize,  and  query  retrospective provenance after running  the  script. § Integrate provenance gathered  from  file  names   and  paths,  log  files,  data  file  headers,  run   metadata,  and  records of  run-­time  events. Kurator-­Akka § Write functions or  classes  in  Python or  Java.    Mark  up  with  YesWorkflow. § Declare how  scripts  or  Java  code  can  be  used  as  actors.    Short  YAML blocks. § Declare workflows.    List  actors  and  specify  their  connections.    More  YAML. § Run workflow.    Use  Akka for  parallelization  transparently—and  correctly. § Reconstruct retrospective  provenance of  workflow  products.
  • 11. Example: data validation and cleaning using WoRMS web services 1) Write simple  Python functions  (or  a  class)  that  wrap the  web  services  provided   by  the  World  Register  of  Marine  Species  (WoRMS). 2) Develop a  Python script  that  uses  the  service  wrapper  functions  from  (1)  to   clean  a  set  of  records provided  in  CSV  format. 3) Mark  up  the  script written  in  (2)  with  YesWorkflow annotations   and   graphically  display the  script  as  a  workflow. 4) Factor  out  of  (2) a  Python function  that  can  clean a  single  record  using  the   WoRMS wrapper  functions  in  (1). 5) Write a  block  of  YAML that  declares how  the  record-­cleaning  function  in  (4)  can   be  used  as  an  actor in  a  Kurator-­Akka workflow. 6) Declare using  YAML a  workflow that  uses  the  actor  declared  in  (5)  along  with   actors  for  reading  and  writing  CSV  files. 7) Run the  workflow (6)  on  a  sample  data  set  with  the  CSV  reader,  CSV  writer   and  WoRMS validation  actors  all  running  in  parallel. README  at  https://github.com/kurator-­org/kurator-­validation  shows  how  to: Illustrates  how  Kurator aims  to  facilitate  composition  of  actors  and   high-­performance  workflows  from  simple  functions  and  scripts.
  • 12. class   WoRMSService(object):   """ Class  for  accessing   the  WoRMS taxonomic   name  database   via  the  AphiaNameService. The  Aphia names   services   are  described   at  http://marinespecies.org/aphia.php?p=soap.   """ WORMS_APHIA_NAME_SERVICE_URL   =  'http://marinespecies.org/aphia.php?p=soap&wsdl=1’ def __init__(self,   marine_only=False): """  Initialize   a  SOAP  client   using  the  WSDL  for  the  WoRMS Aphia names  service"""                 self._client =  Client(self.WORMS_APHIA_NAME_SERVICE_URL) self._marine_only =  marine_only def aphia_record_by_exact_taxon_name(self,   name): """ Perform   an  exact   match  on  the  input   name  against   the  taxon   names  in  WoRMS. This   function   first   invokes   an  Aphia names   service   to  lookup   the  Aphia ID  for the  taxon   name.    If  exactly   one  match   is  returned,   this  function   retrieves   the Aphia record   for  that  ID  and  returns   it.   """                 aphia_id =  self._client.service.getAphiaID(name,   self._marine_only); if  aphia_id is  None  or  aphia_id ==  -­‐999:                  #  -­‐999  indicates   multiple   matches return   None else: return   self._client.service.getAphiaRecordByID(aphia_id) def aphia_record_by_fuzzy_taxon_name(self,   name): : : Python class wrapping WoRMS services WoRMS services
  • 13. #  @BEGIN   find_matching_worms_record #  @IN  original_scientific_name #  @OUT  matching_worms_record #  @OUT  worms_lsid worms_match_result =  None worms_lsid =  None #  first  try  exact  match   of  the  scientific   name  against   WoRMS matching_worms_record =  worms.aphia_record_by_exact_taxon_name(original_scientific_name) if  matching_worms_record is  not  None: worms_match_result =  'exact' #  otherwise   try  a  fuzzy   match else: matching_worms_record =  worms.aphia_record_by_fuzzy_taxon_name(original_scientific_name) if  matching_worms_record is  not  None: worms_match_result =  'fuzzy’ #  if  either   match  succeeds   extract   the  LSID   for  the  taxon if  matching_worms_record is  not  None: worms_lsid =  matching_worms_record['lsid'] #  @END  find_matching_worms_record Finding matching WoRMS records in the data cleaning script YesWorkflow annotation   marking  start of  workflow  step. End of  workflow  step. Variables  serving  as  inputs and   outputs to  workflow  step. Calls  to  WoRMS service   wrapper  functions.
  • 14. YW rendering of script Workflow  steps each delimited  by  @BEGIN   and  @END  annotations   in  script. Data flowing  into  and  out  of   find_matching_worms_record workflow  step. YesWorkflow infers  connections   between  workflow  steps  and  what   data  flows  through  them  by  matching   @IN  and @OUT  annotations.
  • 15. #  @BEGIN   compose_cleaned_record #  @IN  original_record #  @IN  worms_lsid #  @IN  updated_scientific_name #  @IN  original_scientific_name #  @IN  updated_authorship #  @IN  original_authorship #  @OUT  cleaned_record cleaned_record =  original_record cleaned_record['LSID']   =  worms_lsid cleaned_record['WoRMsMatchResult']   =  worms_match_result if  updated_scientific_name is  not  None:   cleaned_record['scientificName']   =  updated_scientific_name cleaned_record['originalScientificName']   =  original_scientific_name if  updated_authorship is  not  None: cleaned_record['scientificNameAuthorship']   =  updated_authorship cleaned_record['originalAuthor']   =  original_authorship #  @END  compose_cleaned_record The compose_cleaned_record step in the data cleaning script
  • 16. A function for cleaning one record def curate_taxon_name_and_author(self,   input_record): #  look  up    record   for  input   taxon  name  in  WoRMS taxonomic   database is_exact_match,   aphia_record =  ( self._worms.aphia_record_by_taxon_name(input_record['TaxonName'])) if  aphia_record is  not  None: #  save  taxon   name  and  author   values   from  input  record   in  new  fields input_record['OriginalName']   =  input_record['TaxonName'] input_record['OriginalAuthor']   =  input_record['Author'] #  replace   taxon  name  and  author   fields   in  input   record   with  values   in  aphia record input_record['TaxonName']   =  aphia_record['scientificname'] input_record['Author']   =  aphia_record['authority'] #  add  new  fields input_record['WoRMsExactMatch']   =  is_exact_match input_record['lsid']   =  aphia_record['lsid’] else: input_record['OriginalName']   =  None input_record['OriginalAuthor']   =  None input_record['WoRMsExactMatch']   =  None input_record['lsid']   =  None return   input_record Factoring  out  the  core  functionality   of  a  script  into  a  reusable  function   is  a  natural  step  in  script  evolution.
  • 17. Declaring the function as an actor -­‐ id: WoRMSNameCurator type: PythonClassActor properties: pythonClass:   org.kurator.validation.actors.WoRMSCurator.WoRMSCurator onData: curate_taxon_name_and_author Actor  type  identifier referenced  when  composing  a   workflow  that  uses  the  actor. Python  class declaring   the  function  as  a   method  (optional). Name  of  Python  function called   for  each  data  item  received  by   the  actor  at  run  time.   Besides  this  block  of  YAML,  no  additional  code  needs   to  be  written  to  convert  the  function  into  an  actor.  
  • 18. Simple workflow using the new actor components: -­‐ id: ReadInput type: CsvFileReader -­‐ id: CurateRecords type: WoRMSNameCurator properties: listensTo: -­‐ !ref  ReadInput -­‐ id: WriteOutput type: CsvFileWriter properties: listensTo: -­‐ !ref  CurateRecords -­‐ id: WoRMSNameValidationWorkflow type: Workflow properties: actors: -­‐ !ref  ReadInput -­‐ !ref  CurateRecords -­‐ !ref  WriteOutput Actor  that  reads input  from  a  CSV file.    Emits  records  one  at  a  time. The  WoRMSNameCurator actor   declared  on  the  previous  slide.     Processes  each  received  record  in  turn. Actor  that  writes received  records   to  an  output  CSV file  one  by  one. The  listensTo property  is  used  to   declare  how  actors  are  connected. Declaration  of  workflow  as  a   composition  of  the  three  actors. Actors  run  concurrently,  each  working  on   different  records  at  the  same  time,  when  the   workflow  is  executed  by  Kurator-­Akka.  
  • 20. Running the workflow $  ka -­‐f  WoRMS_name_validation.yaml <  WoRMS_name_validation_input.csv ID,TaxonName,Author,OriginalName,OriginalAuthor,WoRMsExactMatch,lsid 37929,Architectonica   reevei,"(Hanley,   1862)",Architectonica reevi,,false,urn:lsid:marinespecies.org:taxname:588206 37932,Rapana   rapiformis,"(Born,   1778)",Rapana rapiformis,"(Von   Born,   1778)",true,urn:lsid:marinespecies.org:taxname:140415 180593,Buccinum   donomani,"(Linnaeus,   1758)",,,, 179963,Codakia   paytenorum,"(Iredale,   1937)",Codakia paytenorum,"Iredale,   1937",true,urn:lsid:marinespecies.org:taxname:215841 0,Rissoa   venusta,"Garrett,   1873",Rissoa   venusta,,true,urn:lsid:marinespecies.org:taxname:607233 62156,Rissoa   venusta,"Garrett,   1873",Rissoa       venusta,Phil.,true,urn:lsid:marinespecies.org:taxname:607233 $ Workflow  can  be  run   at  the  command  line. Actors  can  be  created  from  simple  scripts,  and  workflows  can  be   run  like  scripts.    Simple  YAML  files  wire  everything  together. Actors  can  read  and  write  standard   input  and  output  like  any  script.
  • 21. The road ahead The  immediate  future § YesWorkflow support  for  graphically  rendering Kurator-­Akka workflows. § Combining  prospective  and  retrospective  provenance  from  workflow declarations  and YesWorkflow-­annotated  scripts used  as  actors. § Enhancements to  YAML  workflow  declarations  enabling  Akka support  for   creating  and  managing  multiple  instances of  each  actor for  higher   throughput. § Exploration  of  advanced  Akka features  in  FP-­Akka framework,  followed   by  generalization    of  these  features  in  Kurator-­Akka so  that  all  users  can   benefit. Future  possibilities § Support  for  additional actor  scripting  languages (e.g.,  R). § Actor  function  wrappers  for  other  workflow and  dataflow  systems. § Graphical  user  interface for  composing  and  running  workflows.   § Your  suggestions?