Exploration of Workflow Management Systems
Exploration of Workflow Management Systems
Abstract—There has been a recent emergence of new workflow tasks on varied and across computing infrastructures such as
applications focused on data analytics and machine learning. local servers, campus computing clusters, high performance
This emergence has precipitated a change in the workflow computing resources (such as XSEDE [5]), and even popular
management landscape, causing the development of new data-
oriented workflow management systems (WMSs) in addition to cloud computing platforms. These developments in workflow
the earlier standard of task-oriented WMSs. In this paper, we management are of primary concern to the field of scientific
summarize three general workflow use-cases and explore the computing, where scientists often run complex pipelines that
unique requirements of each use-case in order to understand scale over hundreds and thousands of tasks [2].
how WMSs from both workflow management models meet
There are four general workflow system use-cases that have
the requirements of each workflow use-case from the user’s
perspective. We analyze the applicability of the two models by been identified [3]:
carefully describing each model and by providing an examination • Traditional scientific compute workflows, as discussed
of the different variations of WMSs that fall under the task-
driven model. To illustrate the strengths and weaknesses of each
above;
workflow management model, we summarize the key features of • Data analytics workflows (including big data and machine
four production-ready WMSs: Pegasus, Makeflow, Apache Air- learning);
flow, and Pachyderm. To deepen our analysis of the four WMSs • Sensor and Internet-of-Things (IoT) workflows; and
examined in this paper, we implement three real-world use-cases • Commercial, developer, and business-related workflows.
to highlight the specifications and features of each WMS. We
present our final assessment of each WMS after considering the One of the major trends among scientific applications re-
following factors: usability, performance, ease of deployment, and cently concerns big data analytics and machine learning [4].
relevance. The purpose of this work is to offer insights from These data-oriented workflows pose different challenges when
the user’s perspective into the research challenges that WMSs
currently face due to the evolving workflow landscape.
compared to traditional workflow structures, and they often
Index Terms—Scientific workflow, Workflow Management Sys- require special features such as data provenance, data repro-
tem, Task-driven, Data-driven. ducibility, and special data ingestion features.
A second type of data-oriented workflow that is gaining
I. I NTRODUCTION prominence in the scientific field is sensor-based workflows.
Such workflows process data in a continuous fashion, with
In the last two decades, scientific workflows have become
data being ingested in near real-time from distributed sources
mainstream thanks to their ability to empower scientific dis-
(e.g. sensors that stream data to a central endpoint). Here, the
coveries in virtually all fields of science [1]. During this time,
ability to trigger computations based on the arrival of new data
key engineering challenges have been solved and a rich set
is paramount, in addition to the ability to replay processing on
of abstractions and interoperable software implementations
previous datasets.
have been developed [2]. These advancements have allowed
Similar to sensor-based workflows, the rapid expansion of
scientists across various fields to begin reaping the benefits
the Internet-of-Things (IoT) field also raises the need for
of workflow systems [3]. Traditionally, scientific workflows
new workflow orchestration models [6]. In contrast to large-
are described as directed-acyclic graphs (DAGs), in which
scale data analytics (which process large amounts of data
nodes represent computational tasks and edges represent the
in parallel), sensor-based and IoT-based workflows generally
dependencies of those tasks [4]. The traditional approach
process smaller amounts of data at one time with a higher
for orchestrating DAG-based workflows is to use task-based
data arrival rate, placing a greater importance on new data as
scheduling algorithms that spawn tasks for execution once
compared to old or late data.
their dependencies are satisfied. To keep up with increased
computing requirements, workflow systems have developed Furthermore, in the developer and commercial communities,
mechanisms to manage the distribution and execution of workflows are becoming increasingly important as developers
have started to automate more complex tasks in the software
development lifecycle. On the commercial side, businesses are
978-1-7281-0858-2/19/$31.00 ©2019 IEEE turning to in-house workflow management systems to analyze
Authorized licensed use limited to: University of Southern California. Downloaded on September 10,2020 at 22:37:21 UTC from IEEE Xplore. Restrictions apply.
their data., and have started to create a new generation of work- driven approach and the next-generation data-driven ap-
flow management systems that are DAG oriented and are able proach for workflow systems. We present several existing
to perform batch processing to their own specification (e.g., WMS, along with their respective features, that exemplify
LinkedIn’s Azkaban [7] or AirBnb’s Apache Airflow [8]). each model. We also provide a detailed discussion about
In addition to being used by companies for management or new trends and innovations with the task-driven model.
development purposes, these custom-developed solutions are 2) We describe three real-world use-cases, match them with
also being used by scientists as building blocks [9], enabling a representative WMS, and evaluate selected features
them to create their own light-weight workflow management of each WMS that benefit the given use-case. We also
solutions. present the potential challenges that users may face when
While all of these general use-cases play an important part deploying these use-cases on different cyberinfrastruc-
in defining the next generation of workflows and their require- tures.
ments, this paper intends to focus on use-cases that apply 3) We summarize our discussions and present our findings
to the scientific community, specifically traditional compute- in a manner that aims to assist an end-user in classifying
based scientific workflows, big data workflows, and sensor- their own workflows and choosing a WMS.
based workflows. This paper is organized as follows. In Section II, we provide
Based on the distinct characteristics and requirements of an overview of the background and related work. Section III
the different use-cases described, we can generally classify provides an overview of the requirements for both workflow
workflow management systems (WMSs) into two distinct management models by describing systems which implement
categories: traditional, task-driven WMSs and modern, data- each paradigm. In Section IV, we describe real-world use-
driven WMSs. In the traditional task-driven approach, work- cases, and describe how a particular WMS (Pegasus, Make-
flow tasks are triggered for execution once all of their parent flow, Apache Airflow, and Pachyderm) is beneficial to imple-
tasks have completed. A more recent development is the mentation. Section V is dedicated to illustrating the experience
data-driven approach, in which tasks in the workflow are from the perspective of both the user and the WMS. This is
triggered by data input and output, rather than task completion done by comparing the usability, performance, and relevancy
dependencies. In this paper, we conduct a study on the key of each use case. Finally, Section VI summarizes our findings
requirements and features that have driven the development of about future scientific workflow management developments.
this new paradigm. We explore how workflow systems have
addressed recent challenges presented by these new workflow II. BACKGROUND AND R ELATED W ORK
use-cases, and identify open questions that have not yet been
addressed by today’s workflow management solutions. We A. Traditional Scientific Workflows
first describe the paradigms in a WMS-agnostic and platform- One of the first models to represent a sequence of different
agnostic manner, and we then present real world use-cases computations is the directed acyclic graph (DAG) model [11].
that have benefited from these state-of-the-art workflow system An example of the task-driven approach, computational tasks,
implementations. Note that we do not aim at performing a as represented by nodes in the DAG, are the primary units.
feature-by-feature comparison of workflow systems. Instead, Many popular WMS in the scientific community, such as
our goal is to provide our own hands-on experience in dealing Kepler [12], Makeflow [13], Taverna [14], and Pegasus [10]
with such challenges from the WMS’s and user’s perspectives. rely on a task-driven approach using a DAG representation. A
It is also important to note that, during the last two decades, DAG is a very simple and natural representation that allows
many of the authors of this paper have been involved in the WMS to apply efficient scheduling [3] and data management
scientific workflow community and have contributed to the optimizations.
development of workflow management systems (most notable In the past, scientific workflows were traditionally compute-
of which is the Pegasus Workflow Management System [10]). intensive but thanks to GPUs and new memory technologies,
Though Pegasus falls into the traditional, task-driven workflow many data-intensive scientific workflows have been devel-
management model, the authors of this paper are excited oped [4]. In addition, from biology to astronomy, the number
and intrigued by the new approaches and use-cases that have of scientific domains embracing WMS is constantly grow-
recently emerged. ing [2]. The requirements and uses from one community to
For each general use-case, we have tried to focus on another, however, are not consistent [3].
representative workflow systems. We are aware that a plethora Following these trends, new requirements have emerged
of workflow systems have been developed in the recent years, in the scientific workflow user community, some of which
but it is not possible to account for every one of them. The include easy deployment on several cloud and HPC platforms
purpose of this paper is to broadly classify use-cases from a and efficient data management with a strong data reproducibil-
scientific computing perspective and help the reader identify ity aspect. In this vein, workflow systems have evolved and
the workflow system that is suited for their research needs. adopted new technologies to ensure better reproducibility and
More specifically, this work makes the following contribu- easier deployment, including support for containers and cloud-
tions: based execution. Several new workflow systems have also
1) We depict the differences between the traditional task- adopted an API approach in which users programmatically
4538
Authorized licensed use limited to: University of Southern California. Downloaded on September 10,2020 at 22:37:21 UTC from IEEE Xplore. Restrictions apply.
define the workflow instead of giving it an abstract definition. Data flow v1 Control flow v1
An example of such a system is Parsl [15], a Python scripting Control flow v6
workflow system that enables users to quickly define their v2 v2
4539
Authorized licensed use limited to: University of Southern California. Downloaded on September 10,2020 at 22:37:21 UTC from IEEE Xplore. Restrictions apply.
that aims to provide a lightweight workflow management to trigger the workflow themselves. In certain cases, WMS
solution to easily model, maintain, and monitor workflows. In using the data-driven model can be considered ”active,” as they
contrast to the traditional task-driven model in which a DAG proactively monitor each repository for new data, and trigger
describes data and/or control exchanges, in Airflow a task is individual computational pipelines as needed. Notice that
not supposed to exchange data with other tasks—they can only this model differentiates from stream-based workflows since
exchange metadata (i.e., only control flow) [8]. Airflow is very computational tasks are not constantly running on computing
modular and provides many pre-built interfaces (Hooks) to nodes waiting for the next chunk of data to process.
common clouds and database systems such as Amazon S3,
Google Cloud, or HDFS among others, and has a modular data
execution engine for computational tasks (Operators). Users
are able to utilize multiple clouds on a single deployment. Air-
flow also supports containerized execution and orchestration repo1 P ipeline1 repo2 P ipeline2 repo3
by way of a Kubernetes [23] operator.
Airflow is an excellent example of a next-generation WMS
that has unique features as discussed above. For example, in Figure 2. An example of a data-driven workflow. A piece of data is passing
Airflow [8], each DAG is associated with parameters such as a through different tasks, here called pipelines, that compute on the data.
workflow’s start date, an end date, a number of retries in case
of failure, the delay between each retry, among others. cron Workflow management system: Pachyderm [20], [24] aims to
expressions allow the user to describe a schedule interval. enable reproducible, collaborative, and scalable data science
The scheduler runs in the background as a daemon and through a more innovative approach to workflow management.
will pick up or kick off any DAG according to their start A Pachyderm workflow, called a pipeline, is organized around
dates, end dates, and schedule intervals. Another interesting data repositories (nodes in the DAG) containing data. Whereas
task triggering concept extending the possibilities of the task- Hadoop-based solutions are usually optimized for MapReduce
driven model is the ability to trigger tasks without satisfying processing, Pachyderm is data- and language-agnostic, mean-
dependencies (e.g., users can execute a task if one or all of its ing that it is not limited to a given data format or programming
predecessors have failed, see the red task in Figure 1(b)). This language to process the data.
Using the active approach previously described, Pachyderm
ability can be seen as an exception handling mechanism for
runs a pipeline on the its data and waits asynchronously
workflow execution. Finally, another major refinement when
for new commits (i.e., new data to be processed by the
compared to the traditional task-driven approach is the support
pipeline). Pachyderm is built on top of numerous software
of conditional execution, where a branch of the workflow is
layers and runs on top of widely-used commercial cloud
executed only if a given condition is satisfied (see Figure 1(b)).
providers (Amazon S3, Microsoft Azure, Google Cloud, etc.)
B. Data-driven Model
C. Reproducibility
Compared to the previous approach, data-driven WMS in
general provide better data provenance and versioning, better As scientific workflows use increasing amounts of data,
support for cloud-based storage, easier data ingestion, and a reproducibility of results and data provenance become crucial.
good scaling capability via the adoption of highly scalable Task-driven workflows often take the approach that the input
orchestration solutions such as Kubernetes [23]. data and the description of the workflow are sufficient to
reproduce the results [1]. In the data-driven model, each data
Model: In this model, data are the primary units, as opposed repository is versioned ensuring a complete data reproducibil-
to tasks. A data-driven workflow can be represented as a DAG ity and allowing users to execute workflows on each data
but instead of tasks being individual nodes, a node represents version available. Note that, this feature has a non-negligible
a data repository and the edges are the computational tasks. cost in terms of storage. For example, Pachyderm uses a Git-
A data repository can be seen as a directory where data are inspired [25] data-versioning system, so users add data to
structured as objects or files. A task describes the processing repositories via a commit and these data are then processed
steps to be performed on the data in the incoming reposi- by the tasks (see Figure 2). Pachyderm versions the pipeline
tory. A pipeline corresponds to the edge in the classic DAG specifications and all the data processed, allowing users to
representation (see Figure 2). Let vi and vj be two distinct rollback and execute any pipelines on any data that have been
data repositories, let ei,j be the edge from vi to vj . Then, vi versioned. Coupled with a native containerized execution, this
stores the input data used by the pipeline ei,j and vj stores the enable fully reproducible data pipelines.
output data produced by this pipeline. A successful pipeline
will create a new repository for output files, and can be used D. Cloud-based orchestration
as input directory for another pipeline. Connecting pipelines Many task-driven workflows pre-date the cloud, thus these
to each-other by way of data repositories defines a completed virtual resources were often treated as an additional execution
workflow. environment. On the other hand data-driven workflows were
Often, in data-oriented workflows, users start processing born cloud-ready and utilize cloud-based container orchestra-
newer data as soon as they become available, without to have tion solutions such as Kubernetes [23], which allow efficient
4540
Authorized licensed use limited to: University of Southern California. Downloaded on September 10,2020 at 22:37:21 UTC from IEEE Xplore. Restrictions apply.
handling of large amounts of data in order to to have a repro- B. CASA: Streaming Data Workflow
ducible and an easy deployment procedure on different cloud Streaming data workflows have become increasingly popu-
providers. In Pachyderm’s case, each pipeline is contained in a lar in recent years. The University of Massachusetts’ Collab-
Docker image and all pipelines managed through Kubernetes orative Adaptive Sensing of the Atmosphere (CASA) project
worker pods. utilizes a sensor-based streaming data workflow for their Dal-
IV. R EAL -W ORLD U SE C ASES las/Fort Worth (DFW) weather radar testbed, which aggregates
data from eight short-range weather radar sensors, providing
To examine each workflow management paradigm and its higher data precision, accuracy and timeliness versus other,
associated workflow management systems, we chose three longer-range radar systems [27].
real-world workflow use-cases based on the four general
workflow categories described in the introduction, choosing radar_1.netcdf.gz radar_2.netcdf.gz ... radar_N.netcdf.gz
The workflow itself can be considered an example of a C. NEON: Sensor-based Data-driven Workflow
“classic” scientific DAG-based workflow because of its con- The National Ecological Observatory Network (NEON) is
sistent dependencies and static nature. The workflow satisfies an NSF open-science facility collecting ecological data from
all DAG properties—each computational task in the workflow sensors across the US with the objective to study ecological
depends on the completion of previous parent tasks, and no processes and changes. NEON’s instrument data pipeline takes
changes to the workflow structure occurs. Additionally, the raw sensor data from terrestrial and aquatic sensors and
workflow is not dynamically triggered, i.e. it starts based on processes it for publication. Raw sensor data ranges from
a user’s command, and all workflow input data are known a resistance values and voltage at a low frequency of collection,
priori. This workflow use-case has no particular requirements, to high frequency sonic anemometer data. The workflow
only requiring data-flow dependencies and an available net- converts this data into the appropriate unit of measurement
work connection to retrieve the dataset. using calibration coefficients, and performs QA/QC steps on
4541
Authorized licensed use limited to: University of Southern California. Downloaded on September 10,2020 at 22:37:21 UTC from IEEE Xplore. Restrictions apply.
the data to ensure the quality (see Figure 5). The workflow testing as many WMS intended for scientific workflows are
is a linear workflow that, as long as data is coming through still deployed and used on local computers or clusters, and
the pipeline, gathers data from multiple sources ( metadata, this use-case has its own unique challenges. Second, Airflow
calibration data and raw sensor data) and process them to and Pachyderm, both considered cloud-ready WMS, are dis-
create significant output results. cussed with special attention paid to cloud deployment, which
An essential requirement of the NEON instrument data presents separate challenges, and is increasingly important as
pipeline is the ability to reprocess data. This is necessary for some scientific users have begun transitioning their workflows
several reasons, including improved algorithms, more recent from local to cloud deployments.
calibration data, or late data. Being able to process data that
arrives at a later date is of utmost importance to this workflow, A. Traditional Task-driven WMS: Pegasus and Makeflow
and NEON notates missing data with a ”null flag” in their data In the case of the 1000 Genome use-case, the workflow uses
repositories as a placeholder. a Python workflow generation script to enumerate files in the
1000 Genome dataset, and assign them as inputs of various
Data Sources
tasks inside the workflow pipeline without knowing the exact
Calibration
file names, facilitating an easy workflow modeling process.
proc-group proc-group proc-stage1 proc-stage1 … proc-output proc-output Pegasus and Makeflow are excellent WMS for the 1000
Metadata Genome use-case, due to its traditional workflow structure.
Pegasus Python API allows the user the flexibility and ease-
proc-schema
of-use of the Python language in creating a workflow pipeline.
Whereas Pegasus targets flexible composition of DAG work-
Figure 5. NEON processing workflow. The rectangles represent the pipelines
and the cylinders the data repositories. flows via APIs with emphasis on functionality, Makeflow has
a more rigid, yet simple, workflow modeling structure based
on GNU make. In this case, dependencies between workflow
V. H OLISTIC E VALUATION
tasks are automatically inferred from the data flow specified
In this section, we present the results of a holistic evaluation in the workflow description file, which alleviates the user’s
of WMS for the models discussed above. We do not attempt to burden on defining task dependencies.
make a singular recommendation, nor is this an analysis of the
different features and limitations of each WMS. As expressed Setup and deployment: Pegasus installation is relatively sim-
above, different use-cases have different feature requirements, ple due to its availability on different official repositories.
and it’s impossible to dictate a singular WMS that meets all Since Pegasus relies on HTCondor [28] as a task scheduler and
requirements possible. Rather, this work aims to help users interface to other cluster managers, additional effort is required
decide which workflow management model, task-driven or to properly configure and describe the resources. Pegasus also
data-driven fits their needs best, and provide an example WMS supports cloud deployments on commercial cloud providers,
for each model to further demonstrate how a particular model and NSF-cloud infrastructures [27]. Similarly, due to its simple
can compliment a given workflow. In order to perform a workflow structure model, Makeflow installation is relatively
thorough evaluation of each workflow management system, easy—though it is assumed a workload manager is already
we used the following criteria. available (e.g., WorkQueue). Both Pegasus and Makeflow were
not natively designed for cloud support, rather cloud resources
Setup and deployment: We tested the installation process generally are manually deployed in the cloud (e.g, AWS or
on a local cluster and using a cloud platform (AWS), us- Azure).
ing publicly-available documentation and user-facing support
channels. Workflow implementation: Pegasus provides a rich set of
APIs (Python, Java, R, and Perl) for modeling workflows.
Workflow implementation: We examined the level of knowl-
These APIs provide a versatile mechanism for modeling large-
edge and effort required to model the relevant use-case and
scale workflows (O(106 ) tasks), which is not often practical
workflow.
via graphical interfaces—though the entry barrier for non-
Workflow execution: We researched features relating to work- expert users is higher. In Makeflow workflows are defined
flow triggering, scalability, workflow resiliency, and how each like ‘Makefiles’. This structure is fairly simple for defining
WMS handles failures. workflows where the data flow drives the tasks dependencies.
The drawback of this approach is the limited flexibility for
Data management: We studied how data is managed through-
defining complex workflow patterns or control flows compared
out the workflow execution, including whether data is trans-
to more complex APIs.
ferred between computational tasks or workflows.
To test each WMS, two types of deployments were used Workflow execution: Both WMS supports different workflow
in order to account for the majority of use-cases. First, initial execution environments such as containers, and batch sched-
WMS installation, testing, and workflow modeling was done uler support, which are key for enabling large-scale executions.
on a local system. This deployment was used for initial Pegasus is built for reliability and integrity, featuring several
4542
Authorized licensed use limited to: University of Southern California. Downloaded on September 10,2020 at 22:37:21 UTC from IEEE Xplore. Restrictions apply.
different types of workflow recovery methods, provenance extremely flexible, and the Python API structure used to write
data, and checkpointing abilities. Makeflow is more simple, it workflows is easy to learn and understand.
automatically retries failed tasks, but does not provide support
Workflow execution: Airflow’s scheduling features allow users
for checkpointing or sophisticated error recovery methods.
a robust way to trigger their workflows. Airflow also has a
Pegasus workflows also feature several “catalogs” listing the
built-in task retries parameter, like many other WMS and ex-
locations of key data resources, executable ‘transformation’
ternal software such as Apache Mesos enables checkpointing
resources, and compute resources, allowing for easy workflow
features.
portability with which only the resource configuration needs
to be changed. Data management: Airflow features a basic way to pass data
from task-to-task inside a workflow, but this isn’t as robust
Data management: Pegasus provides advanced mechanisms to
as other WMS’s data management features. As such, tracking
efficiently manage data movement during workflow execution.
provenance is also more difficult.
During the workflow planning phase, Pegasus identifies data
locations and augments the workflows with data transfer jobs C. Data-driven WMS: Pachyderm
for staging input and output data from/to storage resources.
As described in the previous sections, Pachyderm is a
A wide range of protocols are supported, including access to
prime example of a recently-developed WMS using the data-
cloud object storage, via Globus services, etc. Data throttling
driven management paradigm. Written to allow portable data
allows for increased throughput performance. In Makeflow,
pipelines, with reproducibility and data provenance, Pachy-
data is assumed to be directly accessible from the comput-
derm is not explicitly designed for scientific workflows, but
ing node (e.g., shared filesystem) or fetched from a remote
can be co-opted for scientific tasks. As detailed in Section III,
source—support is limited to common Internet protocols.
Pachyderm’s workflow modeling process is interesting in that
However, depending on the execution resources used (e.g.
each task is defined individually as a data pipeline, and
Docker container), Makeflow does support data staging in/out
several pipelines can be combined together to form a complete
of individual computational tasks.
workflow.
One important feature of Pachyderm with regards to the
B. Recent Task-driven WMS
NEON workflow pipeline is Pachyderm’s robust data prove-
One example of a more recent (i.e., more flexible, support- nance tracking. The NEON project aims at publishing ver-
ing containerized execution and cloud-oriented) task-driven sioned data sets on an online portal to support open-science
WMS is Airflow. Airflow’s operators allow users to develop research. Although NEON can produce versions of these
their own notions of what a workflow “task” means. While data sets without a solution like Pachyderm, Pachyderm’s
many traditional WMS limit computational tasks to executa- provenance tracking features allow NEON to inform the public
bles, in Airflow the notion of task ranges from simple tasks what has changed between versions of the data sets, and
such as sending an email, to more complex tasks, such as why. NEON also needs to re-run their pipelines with field
running executables inside a container or managing a cloud calibration data. For sensors that are calibrated regularly in
deployment. In addition to many built-in operators ready to the field this is inconvenient, as NEON has to reprocess
use, the triggering system used by Airflow is also extremely when those calibrations are stored. For instruments that self-
beneficial to complex workflows. More robust than a cron- calibrate, and have no defined calibration period, this is border-
based implementation on top of a traditional WMS, Airflow’s line impossible without an on-demand reprocessing capability.
scheduler allows many tasks to be run at custom times or Pachyderm allows NEON to do this, by triggering only the
intervals, and monitors them as such. This concept is extremely pieces of the pipeline necessary to run when a calibration
beneficial for the CASA workflow, which is started every 75 change comes in via the commit system.
seconds in order to process new streaming data.
Setup and deployment: Similarly to other WMS, Pachyderm
Setup and deployment: Airflow is written on Python, which requires several dependencies, including Kubernetes, which
is the only required software prerequisite. Installing Airflow itself requires expert knowledge (just as with HTCondor).
on a local cluster is relatively easy via the pip package- Pachyderm is intended for cloud deployments, and several
management. However, additional software such as the high- utilities exist to automate installation on cloud platforms
performance execution queue Celery might be needed to adapt like AWS. However, Pachyderm has limited local cluster
Airflow to the user’s requirements. From a cloud perspective, deployment, requiring a Kubernetes cluster and S3-compatible
Airflow primarily supports Google Cloud (GC), with extended storage. Compared to Airflow, which allows more fine-grain
support for Azure and AWS. Deployment on cloud platforms resource management, Pachyderm is more restrictive—once its
is simple, thanks to provided scripts (with a slight edge toward Kubernetes cluster is deployed on a given cloud, all pipelines
GC) but complex features such as cloud autoscaling still execute on this cloud.
require configuration or external tools.
Workflow implementation:
Workflow implementation: Through the extensive library of Because Pachyderm triggers pipelines whenever data is
built-in operators, workflow implementation in Airflow is committed to a given input repository, file and folder structures
4543
Authorized licensed use limited to: University of Southern California. Downloaded on September 10,2020 at 22:37:21 UTC from IEEE Xplore. Restrictions apply.
inside of these repositories might need to be organized for by the National Science Foundation and operated under cooperative
optimal data flow and pipeline triggering. Another resulting re- agreement by Battelle Memorial Institute.
quirement of this commit-based triggering system is that each
R EFERENCES
pipeline must output files or objects to an output repository.
These triggering features can be very beneficial to scientific [1] E. Deelman, T. Peterka et al., “The future of scientific workflows,”
The International Journal of High Performance Computing Applications,
workflows, as highlighted with the NEON use-case. 2018.
[2] C. S. Liew, M. P. Atkinson, M. Galea, T. F. Ang, P. Martin, and
Workflow execution: Pachyderm has the ability to scale over J. I. V. Hemert, “Scientific workflows: Moving across paradigms,” ACM
a Kubernetes cluster, which is extremely simple. Pachyderm Computing Surveys (CSUR), vol. 49, no. 4, p. 66, 2017.
automatically retries each data pipeline based upon task exit [3] A. Barker and J. Van Hemert, “Scientific workflow: a survey and
research directions,” in International Conference on Parallel Processing
code. However, for certain workflows, error management and Applied Mathematics. Springer, 2007, pp. 746–753.
might not be as easy as other WMS, since actual workflows [4] J. Liu, E. Pacitti, P. Valduriez, and M. Mattoso, “A survey of data-
are buried in containers inside Kubernetes pods. intensive scientific workflow management,” Journal of Grid Computing,
vol. 13, no. 4, pp. 457–493, 2015.
Data management: Data passing is a strong point of Pachy- [5] A. Rouhani, E. Bernhardsson, and E. Freider. Extreme science and
engineering discovery environment (XSEDE).
derm, with data flow and provenance being one of the headline [6] M. Nardelli, S. Nastic, S. Dustdar, M. Villari, and R. Ranjan, “Osmotic
features of the WMS. Provenance also can be tracked back flow: Osmotic computing+ iot workflow,” IEEE Cloud Computing,
through a workflow, with PFS repositories tracking files and vol. 4, no. 2, pp. 68–75, 2017.
[7] R. Sumbaly, J. Kreps, and S. Shah, “The big data ecosystem at linkedin,”
their respective commits. in Proceedings of the 2013 ACM SIGMOD International Conference on
Management of Data. ACM, 2013, pp. 1125–1134.
VI. C ONCLUSION [8] M. Beauchemin. (2014) Apache Airflow Project.
This work’s main objective is to describe interesting trends [9] M. Kotliar et al., “CWL-Airflow: a lightweight pipeline manager sup-
porting common workflow language,” bioRxiv, 2018.
and concepts in next-generation scientific and commercial [10] E. Deelman, K. Vahi et al., “Pegasus: a workflow management system
workflow management systems, from a user’s perspective. for science automation,” Future Generation Computer Systems, 2015.
To this end, we analyzed how two different workflow man- [11] R. L. Graham et al., “Optimization and approximation in deterministic
sequencing and scheduling: a survey,” in Annals of discrete mathematics.
agement paradigms, namely the task-driven and data-driven Elsevier, 1979.
paradigms, can be applied to real-world use-cases. [12] B. Ludäscher, I. Altintas, C. Berkley et al., “Scientific workflow manage-
From the four generic use-cases detailed in the introduction, ment and the kepler system,” Concurrency and Computation: Practice
and Experience, vol. 18, no. 10, pp. 1039–1065, 2006.
we carefully described three real-world use-cases. With a [13] M. Albrecht et al., “Makeflow: A portable abstraction for data intensive
traditional scientific workflow with the 1000 Genome Project computing on clusters, clouds, and grids,” in Proceedings of the 1st
workflow, and two different sensor-based workflows with the ACM SIGMOD Workshop on Scalable Workflow Execution Engines and
Technologies. ACM, 2012, p. 1.
CASA and NEON soil workflows, we presented a cross- [14] T. Oinn, M. Addis, J. Ferris et al., “Taverna: a tool for the composition
section of traditional and next-generation workflows to evalu- and enactment of bioinformatics workflows,” Bioinformatics, vol. 20,
ate trends and developments in the workflow space. no. 17, pp. 3045–3054, 2004.
[15] Y. N. Babuji et al., “Parsl: Scalable parallel scripting in python.” in
For the task-driven and data-driven paradigms, we selected IWSG, 2018.
representative workflow management systems. Pegasus, Make- [16] J. Dean and S. Ghemawat, “Mapreduce: a flexible data processing tool,”
flow and Apache Airflow represent the task-driven model, Communications of the ACM, vol. 53, no. 1, pp. 72–77, 2010.
[17] V. K. Vavilapalli, A. C. Murthy, C. Douglas et al., “Apache hadoop
while Pachyderm represents the data-driven model. Each yarn: Yet another resource negotiator,” in Proceedings of the 4th annual
model is thoroughly and holistically evaluated, along with their Symposium on Cloud Computing. ACM, 2013, p. 5.
associated workflow management systems. While performing [18] H. Pathak, M. Rathi, and A. Parekh, “Introduction to real-time process-
ing in apache apex,” Int. J. Res. Advent Technol., p. 19, 2016.
this evaluation, we examined the rise of new technologies [19] P. Di Tommaso, M. Chatzou, E. W. Floden, P. P. Barja, E. Palumbo,
and innovations, including containers and the cloud, and new and C. Notredame, “Nextflow enables reproducible computational work-
workflow management use-cases, such as big data analytics, flows,” Nature biotechnology, vol. 35, no. 4, p. 316, 2017.
[20] Pachyderm, Inc. (2017) Pachyderm. [Online]. Available: https:
large-scale science and machine learning. Using several real- //www.pachyderm.io/
world use-cases, we highlighted how each WMS’s unique [21] G. Bosilca et al., “Dague: A generic distributed dag engine for high
features can be an asset to certain next-generation workflows, performance computing,” Parallel Computing, 2012.
[22] R. Sethi, “Scheduling graphs on two processors,” SIAM Journal on
and emphasized how these features set each WMS apart from Computing, pp. 73–82, 1976.
one another. [23] D. Bernstein, “Containers and cloud: From lxc to docker to kubernetes,”
Future work consists of exploring more real-world use-cases IEEE Cloud Computing, vol. 1, no. 3, pp. 81–84, 2014.
[24] J. A. Novella et al., “Container-based bioinformatics with pachyderm,”
such as IoT workflows or large-scale data analytics, as well as Bioinformatics, vol. 35, no. 5, pp. 839–846, 2018.
more WMS solutions and approaches, e.g. Apache Kafka for [25] L. Torvalds and J. Hamano. (2005) Git: Fast version control system.
real-time data streaming. We would also like to evaluate how [26] The 1000 Genomes Project Consortium et al., “A global reference for
human genetic variation,” Nature, 09 2015.
techniques developed by these next-generation WMS could [27] E. Lyons et al., “Toward a dynamic network-centric distributed cloud
benefit to traditional scientific workflows. platform for scientific workflows: A case study for adaptive weather
sensing,” in 15th eScience Conference, 2019.
Acknowledgments. This work is funded by NSF contract #1842042: [28] D. Thain, T. Tannenbaum, and M. Livny, “Distributed computing in
“Pilot Study for a Cyberinfrastructure Center of Excellence”. The practice: the condor experience,” Concurrency and computation: prac-
National Ecological Observatory Network is a program sponsored tice and experience, vol. 17, no. 2-4, pp. 323–356, 2005.
4544
Authorized licensed use limited to: University of Southern California. Downloaded on September 10,2020 at 22:37:21 UTC from IEEE Xplore. Restrictions apply.