Azure Data Factory usage at Aucfanlab

Azure Data Factory usage at Aucfanlab
1. Abstract
This report describes how the Aucfanlab team used Azure’s Data Factory service to
implement the orchestration and monitoring of all data pipelines for our “Aucfan
Datalake” project. Azure Data Factory is a service that allows the creation of
scheduled data pipelines. These pipelines were used to internally execute the
complex data workflows that the Datalake’s users can create.
2. Introduction
One of the projects of the Aucfanlab team is the big data management platform
“Aucfan Datalake” that was introduced in an earlier report. The Aucfan Datalake
allows users to easily create and monitor their data workloads while hiding all
technical implementation details.
As a central part of this system, a system for scheduling and management of the
user’s workflows had to be implemented. An example for such a workflow is
importing data from cloud-based file stores and exporting it to a SQL database.
After reviewing past approaches that used cron jobs and prototyping a Java task
scheduler to manage the data workflows by ourselves, the final choice fell on Data
Factory, a fully-managed data pipeline orchestration service by Azure that allowed us
to outsource the most cumbersome parts of pipeline management.
3. Azure Data Factory
Azure Data Factory is a service that allows the user to define data workflows using a
json-based description format. Besides handling the scheduling of user-defined jobs
(e.g. Hadoop, Hive etc.) it also provides direct execution of data movement across
resources in the cloud. This allows for a cost-effective and hassle free
implementation of ETL-processes. In the following a short overview of the main
concepts in Azure Data Factory will be given to allow for easier comprehension of
explanations in later chapters.

3.1 Dataset
A dataset defines the schema, location and other relevant attributes of data. It
has a type, which refers to one of the supported resources (for example Azure blob,
Azure SQL database or other cloud / on-premise resources). Furthermore, a dataset
has so-called slices, which are temporal partition units. For example, a daily running
pipeline’s input and output datasets have a slice for each day.
3.2 Pipeline
A pipeline contains one or more activities that take datasets as input and output.
An activity can be copying data from one resource to another, the execution of a
Hadoop or Hive job, or similar data workloads. While Hadoop jobs require a provided
cluster, all copy activities are directly executed by Data Factory and thus do not
require a custom implementation or management of computing resources.
3.2 Linked Service
A resource managed by the user. This can be a data store like a Blob storage
account or an Azure SQL database. Additionally, compute resources can also be
defined as Linked Services. An example for this would be a HDInsight Hadoop
cluster that executes MapReduce jobs.
With the above-described elements arbitrarily complex pipelines can be created and
managed. Each pipeline is executed in slices that are individually tracked. For each
slice status and errors can be monitored. Additionally slices are dependent on each
other, which means that all following steps of a pipeline wait for the previous step to
finish. In the case of an error in a previous step the slices automatically start once the
error is fixed and the previous slice is executed correctly.
4. Implementation using Data Factory
For a better understanding of the problems that needed to be solved, an introduction
to a typical user workflow will be given.
The user calls the Datalake’s API to first register a data source and than create an
importer to schedule an import job for his dataset. The user than creates a parser to

specify the parsing information for his dataset. After the parsing step is completed
the user can export his data to a variety of datamarts.
The actual implementation of this workflow in Azure Data Factory was solved as
follows. A DF dataset (DF refers to Azure Data Factory here) points to the blob
storage account registered by the user. This DF dataset becomes the input of a copy
pipeline that is executed by the Data Factory and copies the data into the Datalake’s
raw data storage. This data is referred to by the raw DF dataset.
The next step involves using this raw dataset as the input to a Hive activity pipeline.
This pipeline executes a Hive job to parse the data according to its source format
and store it as an external Hive table in gzipped Avro format. The parsed data stored
in Azure blog and referenced in the parsed DF dataset.
As Data Factory supports Avro natively the created parsed dataset can directly be
used in Azure Data Factory copy activities. Additionally, it can be used in custom
jobs like a Hadoop Distcp job that copies the data to an external target system.
5. Encountered problems and limitations
While being useful in many regards, we also encountered a few limitations while

working with Data Factory.
With the system development at Aucfanlab being done in Java, the absence of a
Java SDK for Data Factory meant that a substantial overhead had to be invested. To
automate resource creation in Data Factory, we created a REST-client that models
the Data Factory json definitions and handles all API calls.
Another Java-related limitation we encountered was that custom code executed in a
Data Factory activity can only be written in .NET. Fortunately, Hadoop or Spark jobs
can be written in Java and custom code can be executed by these means.
With some features being rather cutting-edge we sometimes had to resort to support
tickets to get explanations about not documented features or to notify Microsoft
about a bug in a newly released feature.
Another problem that we encountered and that is a general downside of using a
cloud provider, is that while Data Factory makes it very easy to work with Azure
resources it makes it harder to integrate external services. For most connections to
outside services and resources custom solutions have to be created. In our case this
meant for example that for Amazon S3 import and export, we had to use Hadoop
Distcp instead of being able to use the Azure Data Factory copy activity.
6. Closing remarks
Besides the above-mentioned limitations, using Azure’s Data Factory service took
away a lot of of headache overall. Especially the ability to have persistent pipelines
with automatic handling of daily schedules saved development time and freed up
resources for other tasks.
We are looking forward to see how the service will develop in future and hope to see
many new features that make working with different data sources even easier.

Azure Data Factory usage at Aucfanlab

Recommended

More Related Content

What's hot (19)

Viewers also liked (6)

Similar to Azure Data Factory usage at Aucfanlab (20)

Recently uploaded (20)

Azure Data Factory usage at Aucfanlab