0% found this document useful (0 votes)

69 views8 pages

Exploration of Workflow Management Systems

This paper explores the evolution of workflow management systems (WMSs) in scientific computing, highlighting the shift from traditional task-oriented models to emerging data-oriented systems driven by big data and machine learning. It categorizes four general use-cases for WMSs, analyzes the applicability of various models, and evaluates four production-ready WMSs: Pegasus, Makeflow, Apache Airflow, and Pachyderm. The study aims to provide insights into user requirements and challenges faced by WMSs in adapting to the evolving landscape of scientific workflows.

Uploaded by

Sakshi Wagh 79

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

69 views8 pages

Exploration of Workflow Management Systems

Uploaded by

Sakshi Wagh 79

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

2019 IEEE International Conference on Big Data (Big Data)

Exploration of Workﬂow Management Systems

Emerging Features from Users Perspectives
Ryan Mitchell∗ , Loı̈c Pottier∗ , Steve Jacobs† , Rafael Ferreira da Silva∗ , Mats Rynge∗ , Karan Vahi∗ , Ewa Deelman∗
∗ Information Sciences Institute, University of Southern California, Marina Del Rey, CA, USA
† National Ecological Observatory Network (NEON), Boulder, CO, USA
{rmitchel,lpottier,rafsilva,rynge,vahi,deelman}@isi.edu, [email protected]

Abstract—There has been a recent emergence of new workflow tasks on varied and across computing infrastructures such as
applications focused on data analytics and machine learning. local servers, campus computing clusters, high performance
This emergence has precipitated a change in the workflow computing resources (such as XSEDE [5]), and even popular
management landscape, causing the development of new data-
oriented workflow management systems (WMSs) in addition to cloud computing platforms. These developments in workflow
the earlier standard of task-oriented WMSs. In this paper, we management are of primary concern to the field of scientific
summarize three general workflow use-cases and explore the computing, where scientists often run complex pipelines that
unique requirements of each use-case in order to understand scale over hundreds and thousands of tasks [2].
how WMSs from both workflow management models meet
There are four general workflow system use-cases that have
the requirements of each workflow use-case from the user’s
perspective. We analyze the applicability of the two models by been identified [3]:
carefully describing each model and by providing an examination • Traditional scientific compute workflows, as discussed
of the different variations of WMSs that fall under the task-
driven model. To illustrate the strengths and weaknesses of each
above;
workflow management model, we summarize the key features of • Data analytics workflows (including big data and machine
four production-ready WMSs: Pegasus, Makeflow, Apache Air- learning);
flow, and Pachyderm. To deepen our analysis of the four WMSs • Sensor and Internet-of-Things (IoT) workflows; and
examined in this paper, we implement three real-world use-cases • Commercial, developer, and business-related workflows.
to highlight the specifications and features of each WMS. We
present our final assessment of each WMS after considering the One of the major trends among scientific applications re-
following factors: usability, performance, ease of deployment, and cently concerns big data analytics and machine learning [4].
relevance. The purpose of this work is to offer insights from These data-oriented workflows pose different challenges when
the user’s perspective into the research challenges that WMSs
currently face due to the evolving workflow landscape.
compared to traditional workflow structures, and they often
Index Terms—Scientific workflow, Workflow Management Sys- require special features such as data provenance, data repro-
tem, Task-driven, Data-driven. ducibility, and special data ingestion features.
A second type of data-oriented workflow that is gaining
I. I NTRODUCTION prominence in the scientific field is sensor-based workflows.
Such workflows process data in a continuous fashion, with
In the last two decades, scientific workflows have become
data being ingested in near real-time from distributed sources
mainstream thanks to their ability to empower scientific dis-
(e.g. sensors that stream data to a central endpoint). Here, the
coveries in virtually all fields of science [1]. During this time,
ability to trigger computations based on the arrival of new data
key engineering challenges have been solved and a rich set
is paramount, in addition to the ability to replay processing on
of abstractions and interoperable software implementations
previous datasets.
have been developed [2]. These advancements have allowed
Similar to sensor-based workflows, the rapid expansion of
scientists across various fields to begin reaping the benefits
the Internet-of-Things (IoT) field also raises the need for
of workflow systems [3]. Traditionally, scientific workflows
new workflow orchestration models [6]. In contrast to large-
are described as directed-acyclic graphs (DAGs), in which
scale data analytics (which process large amounts of data
nodes represent computational tasks and edges represent the
in parallel), sensor-based and IoT-based workflows generally
dependencies of those tasks [4]. The traditional approach
process smaller amounts of data at one time with a higher
for orchestrating DAG-based workflows is to use task-based
data arrival rate, placing a greater importance on new data as
scheduling algorithms that spawn tasks for execution once
compared to old or late data.
their dependencies are satisfied. To keep up with increased
computing requirements, workflow systems have developed Furthermore, in the developer and commercial communities,
mechanisms to manage the distribution and execution of workflows are becoming increasingly important as developers
have started to automate more complex tasks in the software
development lifecycle. On the commercial side, businesses are
978-1-7281-0858-2/19/$31.00 ©2019 IEEE turning to in-house workflow management systems to analyze

978-1-7281-0858-2/19/$31.00 ©2019 IEEE 4537

Authorized licensed use limited to: University of Southern California. Downloaded on September 10,2020 at 22:37:21 UTC from IEEE Xplore. Restrictions apply.
their data., and have started to create a new generation of work- driven approach and the next-generation data-driven ap-
flow management systems that are DAG oriented and are able proach for workflow systems. We present several existing
to perform batch processing to their own specification (e.g., WMS, along with their respective features, that exemplify
LinkedIn’s Azkaban [7] or AirBnb’s Apache Airflow [8]). each model. We also provide a detailed discussion about
In addition to being used by companies for management or new trends and innovations with the task-driven model.
development purposes, these custom-developed solutions are 2) We describe three real-world use-cases, match them with
also being used by scientists as building blocks [9], enabling a representative WMS, and evaluate selected features
them to create their own light-weight workflow management of each WMS that benefit the given use-case. We also
solutions. present the potential challenges that users may face when
While all of these general use-cases play an important part deploying these use-cases on different cyberinfrastruc-
in defining the next generation of workflows and their require- tures.
ments, this paper intends to focus on use-cases that apply 3) We summarize our discussions and present our findings
to the scientific community, specifically traditional compute- in a manner that aims to assist an end-user in classifying
based scientific workflows, big data workflows, and sensor- their own workflows and choosing a WMS.
based workflows. This paper is organized as follows. In Section II, we provide
Based on the distinct characteristics and requirements of an overview of the background and related work. Section III
the different use-cases described, we can generally classify provides an overview of the requirements for both workflow
workflow management systems (WMSs) into two distinct management models by describing systems which implement
categories: traditional, task-driven WMSs and modern, data- each paradigm. In Section IV, we describe real-world use-
driven WMSs. In the traditional task-driven approach, work- cases, and describe how a particular WMS (Pegasus, Make-
flow tasks are triggered for execution once all of their parent flow, Apache Airflow, and Pachyderm) is beneficial to imple-
tasks have completed. A more recent development is the mentation. Section V is dedicated to illustrating the experience
data-driven approach, in which tasks in the workflow are from the perspective of both the user and the WMS. This is
triggered by data input and output, rather than task completion done by comparing the usability, performance, and relevancy
dependencies. In this paper, we conduct a study on the key of each use case. Finally, Section VI summarizes our findings
requirements and features that have driven the development of about future scientific workflow management developments.
this new paradigm. We explore how workflow systems have
addressed recent challenges presented by these new workflow II. BACKGROUND AND R ELATED W ORK
use-cases, and identify open questions that have not yet been
addressed by today’s workflow management solutions. We A. Traditional Scientific Workflows
first describe the paradigms in a WMS-agnostic and platform- One of the first models to represent a sequence of different
agnostic manner, and we then present real world use-cases computations is the directed acyclic graph (DAG) model [11].
that have benefited from these state-of-the-art workflow system An example of the task-driven approach, computational tasks,
implementations. Note that we do not aim at performing a as represented by nodes in the DAG, are the primary units.
feature-by-feature comparison of workflow systems. Instead, Many popular WMS in the scientific community, such as
our goal is to provide our own hands-on experience in dealing Kepler [12], Makeflow [13], Taverna [14], and Pegasus [10]
with such challenges from the WMS’s and user’s perspectives. rely on a task-driven approach using a DAG representation. A
It is also important to note that, during the last two decades, DAG is a very simple and natural representation that allows
many of the authors of this paper have been involved in the WMS to apply efficient scheduling [3] and data management
scientific workflow community and have contributed to the optimizations.
development of workflow management systems (most notable In the past, scientific workflows were traditionally compute-
of which is the Pegasus Workflow Management System [10]). intensive but thanks to GPUs and new memory technologies,
Though Pegasus falls into the traditional, task-driven workflow many data-intensive scientific workflows have been devel-
management model, the authors of this paper are excited oped [4]. In addition, from biology to astronomy, the number
and intrigued by the new approaches and use-cases that have of scientific domains embracing WMS is constantly grow-
recently emerged. ing [2]. The requirements and uses from one community to
For each general use-case, we have tried to focus on another, however, are not consistent [3].
representative workflow systems. We are aware that a plethora Following these trends, new requirements have emerged
of workflow systems have been developed in the recent years, in the scientific workflow user community, some of which
but it is not possible to account for every one of them. The include easy deployment on several cloud and HPC platforms
purpose of this paper is to broadly classify use-cases from a and efficient data management with a strong data reproducibil-
scientific computing perspective and help the reader identify ity aspect. In this vein, workflow systems have evolved and
the workflow system that is suited for their research needs. adopted new technologies to ensure better reproducibility and
More specifically, this work makes the following contribu- easier deployment, including support for containers and cloud-
tions: based execution. Several new workflow systems have also
1) We depict the differences between the traditional task- adopted an API approach in which users programmatically

4538

Authorized licensed use limited to: University of Southern California. Downloaded on September 10,2020 at 22:37:21 UTC from IEEE Xplore. Restrictions apply.
define the workflow instead of giving it an abstract definition. Data flow v1 Control flow v1
An example of such a system is Parsl [15], a Python scripting Control flow v6
workflow system that enables users to quickly define their v2 v2

workﬂows by directly annotating their Python codes.

v3 v4 v3 v4
B. Data-Oriented Workflows
v7 v5 v6 v5
With the scientific workflow community growing in size,
more WMS are being developed in response to the com- v8 v8
munity’s need for specific features and evolving workflow
(a) Representation of a DAG-based (b) Workflow with a conditional
paradigms. Driven by the popularity of data analytics and task-driven workflow with two types execution in blue/green and a
machine learning systems, these new WMS are more data- of dependencies. failed node in red.
oriented than their traditional counterparts, and have many
Figure 1. Two examples of DAG-based task-driven workflows. On 1(b), if in
interesting and modern features not shared by the more tradi- v2 a given condition is true then v3 is executed, otherwise v4 is spawned. In
tional systems. addition, the execution v6 fails, but as v5 successfully finished, v8 is spawned.
Many studies on data-oriented workflow management have
focused on the MapReduce [16] approach and its most known the workflow as a graph, with the computational tasks as
implementation: Apache Hadoop YARN [17]. YARN aims the primary entity of work, and edges representing data
at decoupling resource management from the programming and control dependencies (see Figure 1). A task can start
model. From this, several tools have been developed in recent its execution if, and only if, all of its predecessors have
years: Apache Apex [18] is a data-oriented solution built successfully completed. The task-driven model is simple to
on top of Hadoop and YARN that allows users to express understand and well adapted to heavy computational tasks,
both streaming and batch data pipelines with a DAG-based and is widely-used by HPC frameworks [21], as well as by
representation. Besides MapReduce-based solutions, several numerous theoretical scheduling studies [22]. It is also very
generalist data-oriented workflow management solutions have efficient, thanks to years of research focused on workflow
been developed, including Nextflow [19] and Pachyderm [20]. optimization, scheduling strategies and data management, both
Nextflow [19] uses a domain-specific language that allows from a practical and theoretical points of view. In addition, due
users to quickly prototype workflows running in containers. An to the decades of research and software development many
interesting feature of Nextflow is the complete integration with task-driven WMS are mature and production-ready, such as
several versioning platforms such as GitHub, which enables Pegasus [10] and Makeflow [13]. This model allows users to
the workflow to check for updates and pull data from a given target many different platforms, from large-scale HPC systems
repository. We explain the features offered by Pachyderm in to distributed cloud platforms, grid infrastructures, or local
greater details, later in this work. clusters.
In this paper, we attempt to distill these recent develop- Recently, several new workflow management systems have
ments in the field of scientific workflows by presenting a been developed to bring more flexibility to the classic task-
basic comparison between the traditional paradigm of WMS driven model and to address new requirements raised by
(including Pegasus and Makeflow), a more modern approach modern workflows, with unique features including complex
of this classic paradigm with Apache Airflow and the advent workflow-level and task-level scheduling abilities, conditional
of newer systems intended for more specific purposes (such task execution, and improved error management and mitiga-
as Pachyderm). We also aim to aid users in the increasingly tion. To evaluate the task-driven model, we consider three pro-
complex process of narrowing down which paradigm and duction WMS, Pegasus [10], Makeflow [13], and Airflow [8].
representative WMS best suit their use-case by examining a Workflow management systems: Pegasus [10] and Make-
distinct set of representative workflow systems in a holistic flow [13] are two well-established workflow management
manner. systems designed to manage and optimize the execution of
large-scale scientific workflows on distributed resources, they
III. W ORKFLOW M ANAGEMENT M ODELS
provide container support on various commercial clouds such
In this section, we further define the two models mentioned as AWS or Microsoft Azure, as well as HPC systems via sup-
in the introduction, and provide an overview of their respective port of various cluster job managers, including Slurm and PBS.
requirements and key features. We also introduce several rep- An important difference between them is that Pegasus allows
resentative workflow management systems using those models. control and data flow to express workflows while Makeflow
A. Task-driven Model allows only data flow. Control flow is the ability to define
individual tasks and their dependencies in a workflow, often
Model: The task-driven approach has traditionally been based by specifying the executable or type of data transformation,
on the principles of a directed acyclic graph, or DAG. The whereas data flow defines the data being passed by the task
idea behind the task-driven model is to break a large and instead of the task or transformation itself.
complex workflow into a sequence of individual computational Apache Airflow [8] is an Apache Software Foundation
tasks. A WMS using the task-driven model usually defines project, originally developed by Airbnb and released in 2016,

4539

Authorized licensed use limited to: University of Southern California. Downloaded on September 10,2020 at 22:37:21 UTC from IEEE Xplore. Restrictions apply.
that aims to provide a lightweight workflow management to trigger the workflow themselves. In certain cases, WMS
solution to easily model, maintain, and monitor workflows. In using the data-driven model can be considered ”active,” as they
contrast to the traditional task-driven model in which a DAG proactively monitor each repository for new data, and trigger
describes data and/or control exchanges, in Airflow a task is individual computational pipelines as needed. Notice that
not supposed to exchange data with other tasks—they can only this model differentiates from stream-based workflows since
exchange metadata (i.e., only control flow) [8]. Airflow is very computational tasks are not constantly running on computing
modular and provides many pre-built interfaces (Hooks) to nodes waiting for the next chunk of data to process.
common clouds and database systems such as Amazon S3,
Google Cloud, or HDFS among others, and has a modular data
execution engine for computational tasks (Operators). Users
are able to utilize multiple clouds on a single deployment. Air-
flow also supports containerized execution and orchestration repo1 P ipeline1 repo2 P ipeline2 repo3
by way of a Kubernetes [23] operator.
Airflow is an excellent example of a next-generation WMS
that has unique features as discussed above. For example, in Figure 2. An example of a data-driven workflow. A piece of data is passing
Airflow [8], each DAG is associated with parameters such as a through different tasks, here called pipelines, that compute on the data.
workflow’s start date, an end date, a number of retries in case
of failure, the delay between each retry, among others. cron Workflow management system: Pachyderm [20], [24] aims to
expressions allow the user to describe a schedule interval. enable reproducible, collaborative, and scalable data science
The scheduler runs in the background as a daemon and through a more innovative approach to workflow management.
will pick up or kick off any DAG according to their start A Pachyderm workflow, called a pipeline, is organized around
dates, end dates, and schedule intervals. Another interesting data repositories (nodes in the DAG) containing data. Whereas
task triggering concept extending the possibilities of the task- Hadoop-based solutions are usually optimized for MapReduce
driven model is the ability to trigger tasks without satisfying processing, Pachyderm is data- and language-agnostic, mean-
dependencies (e.g., users can execute a task if one or all of its ing that it is not limited to a given data format or programming
predecessors have failed, see the red task in Figure 1(b)). This language to process the data.
Using the active approach previously described, Pachyderm
ability can be seen as an exception handling mechanism for
runs a pipeline on the its data and waits asynchronously
workflow execution. Finally, another major refinement when
for new commits (i.e., new data to be processed by the
compared to the traditional task-driven approach is the support
pipeline). Pachyderm is built on top of numerous software
of conditional execution, where a branch of the workflow is
layers and runs on top of widely-used commercial cloud
executed only if a given condition is satisfied (see Figure 1(b)).
providers (Amazon S3, Microsoft Azure, Google Cloud, etc.)
B. Data-driven Model
C. Reproducibility
Compared to the previous approach, data-driven WMS in
general provide better data provenance and versioning, better As scientific workflows use increasing amounts of data,
support for cloud-based storage, easier data ingestion, and a reproducibility of results and data provenance become crucial.
good scaling capability via the adoption of highly scalable Task-driven workflows often take the approach that the input
orchestration solutions such as Kubernetes [23]. data and the description of the workflow are sufficient to
reproduce the results [1]. In the data-driven model, each data
Model: In this model, data are the primary units, as opposed repository is versioned ensuring a complete data reproducibil-
to tasks. A data-driven workflow can be represented as a DAG ity and allowing users to execute workflows on each data
but instead of tasks being individual nodes, a node represents version available. Note that, this feature has a non-negligible
a data repository and the edges are the computational tasks. cost in terms of storage. For example, Pachyderm uses a Git-
A data repository can be seen as a directory where data are inspired [25] data-versioning system, so users add data to
structured as objects or files. A task describes the processing repositories via a commit and these data are then processed
steps to be performed on the data in the incoming reposi- by the tasks (see Figure 2). Pachyderm versions the pipeline
tory. A pipeline corresponds to the edge in the classic DAG specifications and all the data processed, allowing users to
representation (see Figure 2). Let vi and vj be two distinct rollback and execute any pipelines on any data that have been
data repositories, let ei,j be the edge from vi to vj . Then, vi versioned. Coupled with a native containerized execution, this
stores the input data used by the pipeline ei,j and vj stores the enable fully reproducible data pipelines.
output data produced by this pipeline. A successful pipeline
will create a new repository for output files, and can be used D. Cloud-based orchestration
as input directory for another pipeline. Connecting pipelines Many task-driven workflows pre-date the cloud, thus these
to each-other by way of data repositories defines a completed virtual resources were often treated as an additional execution
workflow. environment. On the other hand data-driven workflows were
Often, in data-oriented workflows, users start processing born cloud-ready and utilize cloud-based container orchestra-
newer data as soon as they become available, without to have tion solutions such as Kubernetes [23], which allow efficient

4540

Authorized licensed use limited to: University of Southern California. Downloaded on September 10,2020 at 22:37:21 UTC from IEEE Xplore. Restrictions apply.
handling of large amounts of data in order to to have a repro- B. CASA: Streaming Data Workflow
ducible and an easy deployment procedure on different cloud Streaming data workflows have become increasingly popu-
providers. In Pachyderm’s case, each pipeline is contained in a lar in recent years. The University of Massachusetts’ Collab-
Docker image and all pipelines managed through Kubernetes orative Adaptive Sensing of the Atmosphere (CASA) project
worker pods. utilizes a sensor-based streaming data workflow for their Dal-
IV. R EAL -W ORLD U SE C ASES las/Fort Worth (DFW) weather radar testbed, which aggregates
data from eight short-range weather radar sensors, providing
To examine each workflow management paradigm and its higher data precision, accuracy and timeliness versus other,
associated workflow management systems, we chose three longer-range radar systems [27].
real-world workflow use-cases based on the four general
workflow categories described in the introduction, choosing radar_1.netcdf.gz radar_2.netcdf.gz ... radar_N.netcdf.gz

to omit the commercial and developer use-case as it is less

pertinent to scientiﬁc workﬂows and interests. These real- unzip unzip ... unzip

world use-cases are as follows: (i) a traditional scientiﬁc

radar_1.netcdf radar_2.netcdf ... radar_N.netcdf
compute workflow used to evaluate Pegasus and Makeflow;
(ii) a data streaming workflow to evaluate Airflow; and (iii) a max_velocity

sensor-based workﬂow to evaluate Pachyderm. max_wind.png

MaxVelocity.netcdf mvt MaxVelocity.geojson

A. 1000 Genomes: Traditional Compute Workﬂow

merged_netcdf2png
The 1000 Genomes workﬂow is a traditional DAG-based pointAlert_conﬁg.txt pointalert locations.geojson

bioinformatics workﬂow, fetching and parsing data from the

MaxVelocity.png
1000 Genomes Project [26]. The workﬂow aims to analyze alert.geojson

mutational overlaps in humans, ultimately allowing statistical

evaluation of potential disease-related mutations. The Project’s Figure 4. CASA Wind Velocity Workflow.
Phase 3 and superpopulations data is downloaded and parsed
(Individuals and Populations tasks), sorting amino acid sub- This paper focuses on a subsection of the CASA workflow
stitutions and determining their potential phenotypic effects pipeline [27], the calculation of maximum wind velocity and
(Sifting tasks). Analysis is then performed in the Frequency the sending of a customizable email alert to warn certain
overlap and Pair overlap tasks (see Figure 3). entities such as hospitals or airports of impending high wind
velocity. The CASA workflow takes input data from each
Individuals
weather radar sensor (radar_N.netcdf.gz files), unzips
Input Data 1000 Genome
i3 Populations pop 2 sh 3 Sifting
the data (unzip tasks), computes the maximum wind velocity
Individuals Populations Sifting
around the DFW area (max velocity task) and creates a
Data Preparation c1 c2 c3 c4 ... c22 p1 p2 ... pn s1 s2 s3 s4 ... s 22 graphical image of this data (merged netcdf2png task). It then
outputs velocity data in geojson files (mvt task). Finally,
fc 1 ... fc 2505 fp1 fp 2 ... fp n fs 3 these files are used to create high wind velocity alerts in the
pointalert task (see Figure 4).
The CASA use-case has two basic requirements. Most
Pair Frequency importantly, the streaming property of the workflow requires
Analysis Overlap m1 m2 m3 ... m154 fr1 fr2 fr3 ... fr154 Overlap
Mutations Mutations workflow triggering every 75 seconds using new just-in-
time data ingested since the last workflow run. Furthermore,
Output Data om 1 ofm1 fom2 fog 2 computational executables are stored in a Docker container.
While the use of the container is not required, its usage allows
Figure 3. 1000 Genomes Workflow. for easy portability and reproducibility.

The workflow itself can be considered an example of a C. NEON: Sensor-based Data-driven Workflow
“classic” scientific DAG-based workflow because of its con- The National Ecological Observatory Network (NEON) is
sistent dependencies and static nature. The workflow satisfies an NSF open-science facility collecting ecological data from
all DAG properties—each computational task in the workflow sensors across the US with the objective to study ecological
depends on the completion of previous parent tasks, and no processes and changes. NEON’s instrument data pipeline takes
changes to the workflow structure occurs. Additionally, the raw sensor data from terrestrial and aquatic sensors and
workflow is not dynamically triggered, i.e. it starts based on processes it for publication. Raw sensor data ranges from
a user’s command, and all workflow input data are known a resistance values and voltage at a low frequency of collection,
priori. This workflow use-case has no particular requirements, to high frequency sonic anemometer data. The workflow
only requiring data-flow dependencies and an available net- converts this data into the appropriate unit of measurement
work connection to retrieve the dataset. using calibration coefficients, and performs QA/QC steps on

4541

Authorized licensed use limited to: University of Southern California. Downloaded on September 10,2020 at 22:37:21 UTC from IEEE Xplore. Restrictions apply.
the data to ensure the quality (see Figure 5). The workflow testing as many WMS intended for scientific workflows are
is a linear workflow that, as long as data is coming through still deployed and used on local computers or clusters, and
the pipeline, gathers data from multiple sources ( metadata, this use-case has its own unique challenges. Second, Airflow
calibration data and raw sensor data) and process them to and Pachyderm, both considered cloud-ready WMS, are dis-
create significant output results. cussed with special attention paid to cloud deployment, which
An essential requirement of the NEON instrument data presents separate challenges, and is increasingly important as
pipeline is the ability to reprocess data. This is necessary for some scientific users have begun transitioning their workflows
several reasons, including improved algorithms, more recent from local to cloud deployments.
calibration data, or late data. Being able to process data that
arrives at a later date is of utmost importance to this workflow, A. Traditional Task-driven WMS: Pegasus and Makeflow
and NEON notates missing data with a ”null flag” in their data In the case of the 1000 Genome use-case, the workflow uses
repositories as a placeholder. a Python workflow generation script to enumerate files in the
1000 Genome dataset, and assign them as inputs of various
Data Sources
tasks inside the workflow pipeline without knowing the exact
Calibration
file names, facilitating an easy workflow modeling process.
proc-group proc-group proc-stage1 proc-stage1 … proc-output proc-output Pegasus and Makeflow are excellent WMS for the 1000
Metadata Genome use-case, due to its traditional workflow structure.
Pegasus Python API allows the user the flexibility and ease-
proc-schema
of-use of the Python language in creating a workflow pipeline.
Whereas Pegasus targets flexible composition of DAG work-
Figure 5. NEON processing workflow. The rectangles represent the pipelines
and the cylinders the data repositories. flows via APIs with emphasis on functionality, Makeflow has
a more rigid, yet simple, workflow modeling structure based
on GNU make. In this case, dependencies between workflow
V. H OLISTIC E VALUATION
tasks are automatically inferred from the data flow specified
In this section, we present the results of a holistic evaluation in the workflow description file, which alleviates the user’s
of WMS for the models discussed above. We do not attempt to burden on defining task dependencies.
make a singular recommendation, nor is this an analysis of the
different features and limitations of each WMS. As expressed Setup and deployment: Pegasus installation is relatively sim-
above, different use-cases have different feature requirements, ple due to its availability on different official repositories.
and it’s impossible to dictate a singular WMS that meets all Since Pegasus relies on HTCondor [28] as a task scheduler and
requirements possible. Rather, this work aims to help users interface to other cluster managers, additional effort is required
decide which workflow management model, task-driven or to properly configure and describe the resources. Pegasus also
data-driven fits their needs best, and provide an example WMS supports cloud deployments on commercial cloud providers,
for each model to further demonstrate how a particular model and NSF-cloud infrastructures [27]. Similarly, due to its simple
can compliment a given workflow. In order to perform a workflow structure model, Makeflow installation is relatively
thorough evaluation of each workflow management system, easy—though it is assumed a workload manager is already
we used the following criteria. available (e.g., WorkQueue). Both Pegasus and Makeflow were
not natively designed for cloud support, rather cloud resources
Setup and deployment: We tested the installation process generally are manually deployed in the cloud (e.g, AWS or
on a local cluster and using a cloud platform (AWS), us- Azure).
ing publicly-available documentation and user-facing support
channels. Workflow implementation: Pegasus provides a rich set of
APIs (Python, Java, R, and Perl) for modeling workflows.
Workflow implementation: We examined the level of knowl-
These APIs provide a versatile mechanism for modeling large-
edge and effort required to model the relevant use-case and
scale workflows (O(106 ) tasks), which is not often practical
workflow.
via graphical interfaces—though the entry barrier for non-
Workflow execution: We researched features relating to work- expert users is higher. In Makeflow workflows are defined
flow triggering, scalability, workflow resiliency, and how each like ‘Makefiles’. This structure is fairly simple for defining
WMS handles failures. workflows where the data flow drives the tasks dependencies.
The drawback of this approach is the limited flexibility for
Data management: We studied how data is managed through-
defining complex workflow patterns or control flows compared
out the workflow execution, including whether data is trans-
to more complex APIs.
ferred between computational tasks or workflows.
To test each WMS, two types of deployments were used Workflow execution: Both WMS supports different workflow
in order to account for the majority of use-cases. First, initial execution environments such as containers, and batch sched-
WMS installation, testing, and workflow modeling was done uler support, which are key for enabling large-scale executions.
on a local system. This deployment was used for initial Pegasus is built for reliability and integrity, featuring several

4542

Authorized licensed use limited to: University of Southern California. Downloaded on September 10,2020 at 22:37:21 UTC from IEEE Xplore. Restrictions apply.
different types of workflow recovery methods, provenance extremely flexible, and the Python API structure used to write
data, and checkpointing abilities. Makeflow is more simple, it workflows is easy to learn and understand.
automatically retries failed tasks, but does not provide support
Workflow execution: Airflow’s scheduling features allow users
for checkpointing or sophisticated error recovery methods.
a robust way to trigger their workflows. Airflow also has a
Pegasus workflows also feature several “catalogs” listing the
built-in task retries parameter, like many other WMS and ex-
locations of key data resources, executable ‘transformation’
ternal software such as Apache Mesos enables checkpointing
resources, and compute resources, allowing for easy workflow
features.
portability with which only the resource configuration needs
to be changed. Data management: Airflow features a basic way to pass data
from task-to-task inside a workflow, but this isn’t as robust
Data management: Pegasus provides advanced mechanisms to
as other WMS’s data management features. As such, tracking
efficiently manage data movement during workflow execution.
provenance is also more difficult.
During the workflow planning phase, Pegasus identifies data
locations and augments the workflows with data transfer jobs C. Data-driven WMS: Pachyderm
for staging input and output data from/to storage resources.
As described in the previous sections, Pachyderm is a
A wide range of protocols are supported, including access to
prime example of a recently-developed WMS using the data-
cloud object storage, via Globus services, etc. Data throttling
driven management paradigm. Written to allow portable data
allows for increased throughput performance. In Makeflow,
pipelines, with reproducibility and data provenance, Pachy-
data is assumed to be directly accessible from the comput-
derm is not explicitly designed for scientific workflows, but
ing node (e.g., shared filesystem) or fetched from a remote
can be co-opted for scientific tasks. As detailed in Section III,
source—support is limited to common Internet protocols.
Pachyderm’s workflow modeling process is interesting in that
However, depending on the execution resources used (e.g.
each task is defined individually as a data pipeline, and
Docker container), Makeflow does support data staging in/out
several pipelines can be combined together to form a complete
of individual computational tasks.
workflow.
One important feature of Pachyderm with regards to the
B. Recent Task-driven WMS
NEON workflow pipeline is Pachyderm’s robust data prove-
One example of a more recent (i.e., more flexible, support- nance tracking. The NEON project aims at publishing ver-
ing containerized execution and cloud-oriented) task-driven sioned data sets on an online portal to support open-science
WMS is Airflow. Airflow’s operators allow users to develop research. Although NEON can produce versions of these
their own notions of what a workflow “task” means. While data sets without a solution like Pachyderm, Pachyderm’s
many traditional WMS limit computational tasks to executa- provenance tracking features allow NEON to inform the public
bles, in Airflow the notion of task ranges from simple tasks what has changed between versions of the data sets, and
such as sending an email, to more complex tasks, such as why. NEON also needs to re-run their pipelines with field
running executables inside a container or managing a cloud calibration data. For sensors that are calibrated regularly in
deployment. In addition to many built-in operators ready to the field this is inconvenient, as NEON has to reprocess
use, the triggering system used by Airflow is also extremely when those calibrations are stored. For instruments that self-
beneficial to complex workflows. More robust than a cron- calibrate, and have no defined calibration period, this is border-
based implementation on top of a traditional WMS, Airflow’s line impossible without an on-demand reprocessing capability.
scheduler allows many tasks to be run at custom times or Pachyderm allows NEON to do this, by triggering only the
intervals, and monitors them as such. This concept is extremely pieces of the pipeline necessary to run when a calibration
beneficial for the CASA workflow, which is started every 75 change comes in via the commit system.
seconds in order to process new streaming data.
Setup and deployment: Similarly to other WMS, Pachyderm
Setup and deployment: Airflow is written on Python, which requires several dependencies, including Kubernetes, which
is the only required software prerequisite. Installing Airflow itself requires expert knowledge (just as with HTCondor).
on a local cluster is relatively easy via the pip package- Pachyderm is intended for cloud deployments, and several
management. However, additional software such as the high- utilities exist to automate installation on cloud platforms
performance execution queue Celery might be needed to adapt like AWS. However, Pachyderm has limited local cluster
Airflow to the user’s requirements. From a cloud perspective, deployment, requiring a Kubernetes cluster and S3-compatible
Airflow primarily supports Google Cloud (GC), with extended storage. Compared to Airflow, which allows more fine-grain
support for Azure and AWS. Deployment on cloud platforms resource management, Pachyderm is more restrictive—once its
is simple, thanks to provided scripts (with a slight edge toward Kubernetes cluster is deployed on a given cloud, all pipelines
GC) but complex features such as cloud autoscaling still execute on this cloud.
require configuration or external tools.
Workflow implementation:
Workflow implementation: Through the extensive library of Because Pachyderm triggers pipelines whenever data is
built-in operators, workflow implementation in Airflow is committed to a given input repository, file and folder structures

4543

Authorized licensed use limited to: University of Southern California. Downloaded on September 10,2020 at 22:37:21 UTC from IEEE Xplore. Restrictions apply.
inside of these repositories might need to be organized for by the National Science Foundation and operated under cooperative
optimal data flow and pipeline triggering. Another resulting re- agreement by Battelle Memorial Institute.
quirement of this commit-based triggering system is that each
R EFERENCES
pipeline must output files or objects to an output repository.
These triggering features can be very beneficial to scientific [1] E. Deelman, T. Peterka et al., “The future of scientific workflows,”
The International Journal of High Performance Computing Applications,
workflows, as highlighted with the NEON use-case. 2018.
[2] C. S. Liew, M. P. Atkinson, M. Galea, T. F. Ang, P. Martin, and
Workflow execution: Pachyderm has the ability to scale over J. I. V. Hemert, “Scientific workflows: Moving across paradigms,” ACM
a Kubernetes cluster, which is extremely simple. Pachyderm Computing Surveys (CSUR), vol. 49, no. 4, p. 66, 2017.
automatically retries each data pipeline based upon task exit [3] A. Barker and J. Van Hemert, “Scientific workflow: a survey and
research directions,” in International Conference on Parallel Processing
code. However, for certain workflows, error management and Applied Mathematics. Springer, 2007, pp. 746–753.
might not be as easy as other WMS, since actual workflows [4] J. Liu, E. Pacitti, P. Valduriez, and M. Mattoso, “A survey of data-
are buried in containers inside Kubernetes pods. intensive scientific workflow management,” Journal of Grid Computing,
vol. 13, no. 4, pp. 457–493, 2015.
Data management: Data passing is a strong point of Pachy- [5] A. Rouhani, E. Bernhardsson, and E. Freider. Extreme science and
engineering discovery environment (XSEDE).
derm, with data flow and provenance being one of the headline [6] M. Nardelli, S. Nastic, S. Dustdar, M. Villari, and R. Ranjan, “Osmotic
features of the WMS. Provenance also can be tracked back flow: Osmotic computing+ iot workflow,” IEEE Cloud Computing,
through a workflow, with PFS repositories tracking files and vol. 4, no. 2, pp. 68–75, 2017.
[7] R. Sumbaly, J. Kreps, and S. Shah, “The big data ecosystem at linkedin,”
their respective commits. in Proceedings of the 2013 ACM SIGMOD International Conference on
Management of Data. ACM, 2013, pp. 1125–1134.
VI. C ONCLUSION [8] M. Beauchemin. (2014) Apache Airflow Project.
This work’s main objective is to describe interesting trends [9] M. Kotliar et al., “CWL-Airflow: a lightweight pipeline manager sup-
porting common workflow language,” bioRxiv, 2018.
and concepts in next-generation scientific and commercial [10] E. Deelman, K. Vahi et al., “Pegasus: a workflow management system
workflow management systems, from a user’s perspective. for science automation,” Future Generation Computer Systems, 2015.
To this end, we analyzed how two different workflow man- [11] R. L. Graham et al., “Optimization and approximation in deterministic
sequencing and scheduling: a survey,” in Annals of discrete mathematics.
agement paradigms, namely the task-driven and data-driven Elsevier, 1979.
paradigms, can be applied to real-world use-cases. [12] B. Ludäscher, I. Altintas, C. Berkley et al., “Scientific workflow manage-
From the four generic use-cases detailed in the introduction, ment and the kepler system,” Concurrency and Computation: Practice
and Experience, vol. 18, no. 10, pp. 1039–1065, 2006.
we carefully described three real-world use-cases. With a [13] M. Albrecht et al., “Makeflow: A portable abstraction for data intensive
traditional scientific workflow with the 1000 Genome Project computing on clusters, clouds, and grids,” in Proceedings of the 1st
workflow, and two different sensor-based workflows with the ACM SIGMOD Workshop on Scalable Workflow Execution Engines and
Technologies. ACM, 2012, p. 1.
CASA and NEON soil workflows, we presented a cross- [14] T. Oinn, M. Addis, J. Ferris et al., “Taverna: a tool for the composition
section of traditional and next-generation workflows to evalu- and enactment of bioinformatics workflows,” Bioinformatics, vol. 20,
ate trends and developments in the workflow space. no. 17, pp. 3045–3054, 2004.
[15] Y. N. Babuji et al., “Parsl: Scalable parallel scripting in python.” in
For the task-driven and data-driven paradigms, we selected IWSG, 2018.
representative workflow management systems. Pegasus, Make- [16] J. Dean and S. Ghemawat, “Mapreduce: a flexible data processing tool,”
flow and Apache Airflow represent the task-driven model, Communications of the ACM, vol. 53, no. 1, pp. 72–77, 2010.
[17] V. K. Vavilapalli, A. C. Murthy, C. Douglas et al., “Apache hadoop
while Pachyderm represents the data-driven model. Each yarn: Yet another resource negotiator,” in Proceedings of the 4th annual
model is thoroughly and holistically evaluated, along with their Symposium on Cloud Computing. ACM, 2013, p. 5.
associated workflow management systems. While performing [18] H. Pathak, M. Rathi, and A. Parekh, “Introduction to real-time process-
ing in apache apex,” Int. J. Res. Advent Technol., p. 19, 2016.
this evaluation, we examined the rise of new technologies [19] P. Di Tommaso, M. Chatzou, E. W. Floden, P. P. Barja, E. Palumbo,
and innovations, including containers and the cloud, and new and C. Notredame, “Nextflow enables reproducible computational work-
workflow management use-cases, such as big data analytics, flows,” Nature biotechnology, vol. 35, no. 4, p. 316, 2017.
[20] Pachyderm, Inc. (2017) Pachyderm. [Online]. Available: https:
large-scale science and machine learning. Using several real- //www.pachyderm.io/
world use-cases, we highlighted how each WMS’s unique [21] G. Bosilca et al., “Dague: A generic distributed dag engine for high
features can be an asset to certain next-generation workflows, performance computing,” Parallel Computing, 2012.
[22] R. Sethi, “Scheduling graphs on two processors,” SIAM Journal on
and emphasized how these features set each WMS apart from Computing, pp. 73–82, 1976.
one another. [23] D. Bernstein, “Containers and cloud: From lxc to docker to kubernetes,”
Future work consists of exploring more real-world use-cases IEEE Cloud Computing, vol. 1, no. 3, pp. 81–84, 2014.
[24] J. A. Novella et al., “Container-based bioinformatics with pachyderm,”
such as IoT workflows or large-scale data analytics, as well as Bioinformatics, vol. 35, no. 5, pp. 839–846, 2018.
more WMS solutions and approaches, e.g. Apache Kafka for [25] L. Torvalds and J. Hamano. (2005) Git: Fast version control system.
real-time data streaming. We would also like to evaluate how [26] The 1000 Genomes Project Consortium et al., “A global reference for
human genetic variation,” Nature, 09 2015.
techniques developed by these next-generation WMS could [27] E. Lyons et al., “Toward a dynamic network-centric distributed cloud
benefit to traditional scientific workflows. platform for scientific workflows: A case study for adaptive weather
sensing,” in 15th eScience Conference, 2019.
Acknowledgments. This work is funded by NSF contract #1842042: [28] D. Thain, T. Tannenbaum, and M. Livny, “Distributed computing in
“Pilot Study for a Cyberinfrastructure Center of Excellence”. The practice: the condor experience,” Concurrency and computation: prac-
National Ecological Observatory Network is a program sponsored tice and experience, vol. 17, no. 2-4, pp. 323–356, 2005.

4544

Authorized licensed use limited to: University of Southern California. Downloaded on September 10,2020 at 22:37:21 UTC from IEEE Xplore. Restrictions apply.

Report of Geophysical Survey For Will
No ratings yet
Report of Geophysical Survey For Will
12 pages
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Manual Serviço - Aiwa NSX-D77, NSX-T76, NSX-T77
No ratings yet
Manual Serviço - Aiwa NSX-D77, NSX-T76, NSX-T77
48 pages
Research and Development of Renewable Energy Prototype of Led Street Lighting From Solar Energy
No ratings yet
Research and Development of Renewable Energy Prototype of Led Street Lighting From Solar Energy
12 pages
Model-Driven Online Capacity Management for Component-Based Software Systems
From Everand
Model-Driven Online Capacity Management for Component-Based Software Systems
André van Hoorn
No ratings yet
Managing Large-Scale Scientific Workflows in Distr
No ratings yet
Managing Large-Scale Scientific Workflows in Distr
7 pages
System Design Basics
From Everand
System Design Basics
Kai Turing
No ratings yet
A Survey of Modern Scientific Workflow Scheduling Algorithms and Systems in The Era of Big Data
No ratings yet
A Survey of Modern Scientific Workflow Scheduling Algorithms and Systems in The Era of Big Data
10 pages
DataCloud 2016 004
No ratings yet
DataCloud 2016 004
8 pages
IEEEComputer Gil
No ratings yet
IEEEComputer Gil
11 pages
Essays on Infrastructure-as-code
From Everand
Essays on Infrastructure-as-code
Ravi Rajamani
No ratings yet
Mastering Cloud Computing With Best Practices
From Everand
Mastering Cloud Computing With Best Practices
Manish Soni
No ratings yet
Edge Cloud Operations: A Systems Approach
From Everand
Edge Cloud Operations: A Systems Approach
Larry L Peterson
No ratings yet
Encapsulating Legacy: A Guide to Service-Oriented Architecture in Mainframe Systems: Mainframes
From Everand
Encapsulating Legacy: A Guide to Service-Oriented Architecture in Mainframe Systems: Mainframes
Isaac Nangan
No ratings yet
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
Application Design: Key Principles For Data-Intensive App Systems
From Everand
Application Design: Key Principles For Data-Intensive App Systems
Rob Botwright
No ratings yet
Pub 210654
No ratings yet
Pub 210654
38 pages
How Workflow Engines Should Talk to Resource Managers
No ratings yet
How Workflow Engines Should Talk to Resource Managers
14 pages
Developing accurate and scalable simulators of production workflow management systems with WRENCH
No ratings yet
Developing accurate and scalable simulators of production workflow management systems with WRENCH
14 pages
Cloud Computing Essentials: A Practical Guide with Examples
From Everand
Cloud Computing Essentials: A Practical Guide with Examples
William E. Clark
No ratings yet
InfluxDB Essentials: Definitive Reference for Developers and Engineers
From Everand
InfluxDB Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Software Architecture
From Everand
Software Architecture
Kai Turing
No ratings yet
Cloud Computing For Noobs
From Everand
Cloud Computing For Noobs
Silas Meadowlark
No ratings yet
Cloud vs Edge
From Everand
Cloud vs Edge
Isaac Berners-Lee
No ratings yet
Workflow Management Systems and Agents - Do They Fit Together?
No ratings yet
Workflow Management Systems and Agents - Do They Fit Together?
9 pages
Woollard 2008
No ratings yet
Woollard 2008
7 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Airflow for Data Workflow Automation
From Everand
Airflow for Data Workflow Automation
Richard Johnson
No ratings yet
Notes Ass2
No ratings yet
Notes Ass2
1 page
LNCS 2784 Towards Temporal Information in Workflow Systems 1st Edition by Carlo Combi, Giuseppe PozziÂ ISBN 3540452753 9783540452751 - The ebook is ready for instant download and access
No ratings yet
LNCS 2784 Towards Temporal Information in Workflow Systems 1st Edition by Carlo Combi, Giuseppe PozziÂ ISBN 3540452753 9783540452751 - The ebook is ready for instant download and access
47 pages
Understanding Computer Architectures: Principles and Applications
From Everand
Understanding Computer Architectures: Principles and Applications
Pasquale De Marco
No ratings yet
Real-Time Big Data Analytics: Emerging Trends
From Everand
Real-Time Big Data Analytics: Emerging Trends
Trilokesh Khatri
No ratings yet
2014 SWF14 Cloud-Workflow
No ratings yet
2014 SWF14 Cloud-Workflow
9 pages
Superset Data Exploration and Analysis Framework: Definitive Reference for Developers and Engineers
From Everand
Superset Data Exploration and Analysis Framework: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Cloud Computing: Harnessing the Power of the Digital Skies: The IT Collection
From Everand
Cloud Computing: Harnessing the Power of the Digital Skies: The IT Collection
Christopher Ford
No ratings yet
Sqoop Essentials: Definitive Reference for Developers and Engineers
From Everand
Sqoop Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mainframe Modernization with DevOps Mastery: Mainframes
From Everand
Mainframe Modernization with DevOps Mastery: Mainframes
Ricardo Nuqui
No ratings yet
AI-Driven Web Apps: Practical Machine Learning for Software Developers
From Everand
AI-Driven Web Apps: Practical Machine Learning for Software Developers
Sivaramarajalu Ramadurai Venkataraajalu
No ratings yet
LNCS 2784 Towards Temporal Information in Workflow Systems 1st Edition by Carlo Combi, Giuseppe PozziÂ ISBN 3540452753 9783540452751 pdf download
100% (6)
LNCS 2784 Towards Temporal Information in Workflow Systems 1st Edition by Carlo Combi, Giuseppe PozziÂ ISBN 3540452753 9783540452751 pdf download
50 pages
Serverless Architectures and Applications: Definitive Reference for Developers and Engineers
From Everand
Serverless Architectures and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Workflow Resource Patterns: Identification, Representation and Tool Support
No ratings yet
Workflow Resource Patterns: Identification, Representation and Tool Support
17 pages
Querying and ReUsing Workflows With Visstrails
No ratings yet
Querying and ReUsing Workflows With Visstrails
4 pages
Accelerated DevOps with AI, ML & RPA: Non-Programmer’s Guide to AIOPS & MLOPS
From Everand
Accelerated DevOps with AI, ML & RPA: Non-Programmer’s Guide to AIOPS & MLOPS
Stephen Fleming
5/5 (2)
Zabbix Systems Monitoring and Management: Definitive Reference for Developers and Engineers
From Everand
Zabbix Systems Monitoring and Management: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Workflow Management and Databases: Johann Eder, Herbert Groiss, Walter Liebhart
No ratings yet
Workflow Management and Databases: Johann Eder, Herbert Groiss, Walter Liebhart
22 pages
A Machine Learning Approach To Workflow Management
No ratings yet
A Machine Learning Approach To Workflow Management
12 pages
Intelligent Workflow Systems and Provenance-Aware Software: Yolanda Gil
No ratings yet
Intelligent Workflow Systems and Provenance-Aware Software: Yolanda Gil
8 pages
Digital Technologies – an Overview of Concepts, Tools and Techniques Associated with it
From Everand
Digital Technologies – an Overview of Concepts, Tools and Techniques Associated with it
Editor IJSMI
No ratings yet
A Taxonomy of Scientific Workflow Systems For Grid Computing
No ratings yet
A Taxonomy of Scientific Workflow Systems For Grid Computing
6 pages
Dataproc Administration and Engineering Solutions: Definitive Reference for Developers and Engineers
From Everand
Dataproc Administration and Engineering Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Striim Platform Essentials: Definitive Reference for Developers and Engineers
From Everand
Striim Platform Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Download LNCS 2784 Towards Temporal Information in Workflow Systems 1st Edition by Carlo Combi, Giuseppe PozziÂ ISBN 3540452753 9783540452751 ebook All Chapters PDF
100% (30)
Download LNCS 2784 Towards Temporal Information in Workflow Systems 1st Edition by Carlo Combi, Giuseppe PozziÂ ISBN 3540452753 9783540452751 ebook All Chapters PDF
41 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Shedding Light on Cloud Computing
From Everand
Shedding Light on Cloud Computing
Gregor Petri
5/5 (1)
Backend Development
From Everand
Backend Development
Kai Turing
No ratings yet
YAWL: Yet Another Work Ow Language: W.M.P. Van Der Aalst, A.H.M. Ter Hofstede
No ratings yet
YAWL: Yet Another Work Ow Language: W.M.P. Van Der Aalst, A.H.M. Ter Hofstede
31 pages
Computer Science Self Management: Fundamentals and Applications
From Everand
Computer Science Self Management: Fundamentals and Applications
Fouad Sabry
No ratings yet
IoT Edge Innovations
From Everand
IoT Edge Innovations
Kai Turing
No ratings yet
Snowflake Data Platform Engineering: Definitive Reference for Developers and Engineers
From Everand
Snowflake Data Platform Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Cloud Development
From Everand
Cloud Development
Mei Gates
No ratings yet
A Client/Server Architecture For Distributed Workflow Management Systems
No ratings yet
A Client/Server Architecture For Distributed Workflow Management Systems
18 pages
Maths Class Viii Question Bank
100% (1)
Maths Class Viii Question Bank
139 pages
Nomenclature For A Typical Xy Plot (Review)
No ratings yet
Nomenclature For A Typical Xy Plot (Review)
21 pages
Hierarchical Clustering: DSCI 5240 Data Mining and Machine Learning For Business
No ratings yet
Hierarchical Clustering: DSCI 5240 Data Mining and Machine Learning For Business
45 pages
Case Studies
No ratings yet
Case Studies
3 pages
Comparing and Ordering Decimals
No ratings yet
Comparing and Ordering Decimals
15 pages
Gauge R & R Data Sheet
No ratings yet
Gauge R & R Data Sheet
2 pages
Ce43 8
No ratings yet
Ce43 8
14 pages
Group 8
No ratings yet
Group 8
62 pages
Research of Intelligent Lead-Acid Batteries Charge
No ratings yet
Research of Intelligent Lead-Acid Batteries Charge
8 pages
Deborah Ball PDF
No ratings yet
Deborah Ball PDF
19 pages
Math Problem Set Explanation
No ratings yet
Math Problem Set Explanation
7 pages
Logic Gates and Logic Circuits: Grade 11
No ratings yet
Logic Gates and Logic Circuits: Grade 11
20 pages
584SV Frequency Inverter Product Manual HA463617 1
No ratings yet
584SV Frequency Inverter Product Manual HA463617 1
255 pages
Classification of Computer
No ratings yet
Classification of Computer
29 pages
Belt Grinder Ideas: Try G-Wizard
100% (1)
Belt Grinder Ideas: Try G-Wizard
1 page
Calculate Mean For Individual Series Example 1: Roll No. Marks
No ratings yet
Calculate Mean For Individual Series Example 1: Roll No. Marks
8 pages
Solidworks Drawing Tutorial
0% (1)
Solidworks Drawing Tutorial
4 pages
Siemens Application Note CT KV Measurement Somatom
No ratings yet
Siemens Application Note CT KV Measurement Somatom
4 pages
Best Practices For Contact Modeling Using Ansys: Yongyi Zhu, PHD Research and Development Fellow July, 2017
No ratings yet
Best Practices For Contact Modeling Using Ansys: Yongyi Zhu, PHD Research and Development Fellow July, 2017
44 pages
Step by Step Guide To Create BAPI in ABAP Hana
100% (1)
Step by Step Guide To Create BAPI in ABAP Hana
37 pages
Group 4
No ratings yet
Group 4
2 pages
Blowfish
No ratings yet
Blowfish
62 pages
Design of Stopwatch Through Digital Logic Design: December 2019
50% (2)
Design of Stopwatch Through Digital Logic Design: December 2019
5 pages
LPG Fuel System Engine Mazda M4-2.0G
No ratings yet
LPG Fuel System Engine Mazda M4-2.0G
23 pages
SG Math Grade 8 Second Quarter
No ratings yet
SG Math Grade 8 Second Quarter
7 pages
HC900 Control Designer Software - Modbusmap
No ratings yet
HC900 Control Designer Software - Modbusmap
15 pages
Hamworthy Flue Gas Generator PLC Controlled
No ratings yet
Hamworthy Flue Gas Generator PLC Controlled
3 pages

Exploration of Workflow Management Systems

Uploaded by

Exploration of Workflow Management Systems

Uploaded by

2019 IEEE International Conference on Big Data (Big Data)

Exploration of Workﬂow Management Systems

978-1-7281-0858-2/19/$31.00 ©2019 IEEE 4537

workﬂows by directly annotating their Python codes.

to omit the commercial and developer use-case as it is less

world use-cases are as follows: (i) a traditional scientiﬁc

sensor-based workﬂow to evaluate Pachyderm. max_wind.png

MaxVelocity.netcdf mvt MaxVelocity.geojson

A. 1000 Genomes: Traditional Compute Workﬂow

bioinformatics workﬂow, fetching and parsing data from the

mutational overlaps in humans, ultimately allowing statistical

You might also like