SlideShare a Scribd company logo
Azure Data Factory usage at Aucfanlab
1. Abstract
This report describes how the Aucfanlab team used Azure’s Data Factory service to
implement the orchestration and monitoring of all data pipelines for our “Aucfan
Datalake” project. Azure Data Factory is a service that allows the creation of
scheduled data pipelines. These pipelines were used to internally execute the
complex data workflows that the Datalake’s users can create.
2. Introduction
One of the projects of the Aucfanlab team is the big data management platform
“Aucfan Datalake” that was introduced in an earlier report. The Aucfan Datalake
allows users to easily create and monitor their data workloads while hiding all
technical implementation details.
As a central part of this system, a system for scheduling and management of the
user’s workflows had to be implemented. An example for such a workflow is
importing data from cloud-based file stores and exporting it to a SQL database.
After reviewing past approaches that used cron jobs and prototyping a Java task
scheduler to manage the data workflows by ourselves, the final choice fell on Data
Factory, a fully-managed data pipeline orchestration service by Azure that allowed us
to outsource the most cumbersome parts of pipeline management.
3. Azure Data Factory
Azure Data Factory is a service that allows the user to define data workflows using a
json-based description format. Besides handling the scheduling of user-defined jobs
(e.g. Hadoop, Hive etc.) it also provides direct execution of data movement across
resources in the cloud. This allows for a cost-effective and hassle free
implementation of ETL-processes. In the following a short overview of the main
concepts in Azure Data Factory will be given to allow for easier comprehension of
explanations in later chapters.
3.1 Dataset
A dataset defines the schema, location and other relevant attributes of data. It
has a type, which refers to one of the supported resources (for example Azure blob,
Azure SQL database or other cloud / on-premise resources). Furthermore, a dataset
has so-called slices, which are temporal partition units. For example, a daily running
pipeline’s input and output datasets have a slice for each day.
3.2 Pipeline
A pipeline contains one or more activities that take datasets as input and output.
An activity can be copying data from one resource to another, the execution of a
Hadoop or Hive job, or similar data workloads. While Hadoop jobs require a provided
cluster, all copy activities are directly executed by Data Factory and thus do not
require a custom implementation or management of computing resources.
3.2 Linked Service
A resource managed by the user. This can be a data store like a Blob storage
account or an Azure SQL database. Additionally, compute resources can also be
defined as Linked Services. An example for this would be a HDInsight Hadoop
cluster that executes MapReduce jobs.
With the above-described elements arbitrarily complex pipelines can be created and
managed. Each pipeline is executed in slices that are individually tracked. For each
slice status and errors can be monitored. Additionally slices are dependent on each
other, which means that all following steps of a pipeline wait for the previous step to
finish. In the case of an error in a previous step the slices automatically start once the
error is fixed and the previous slice is executed correctly.
4. Implementation using Data Factory
For a better understanding of the problems that needed to be solved, an introduction
to a typical user workflow will be given.
The user calls the Datalake’s API to first register a data source and than create an
importer to schedule an import job for his dataset. The user than creates a parser to
specify the parsing information for his dataset. After the parsing step is completed
the user can export his data to a variety of datamarts.
The actual implementation of this workflow in Azure Data Factory was solved as
follows. A DF dataset (DF refers to Azure Data Factory here) points to the blob
storage account registered by the user. This DF dataset becomes the input of a copy
pipeline that is executed by the Data Factory and copies the data into the Datalake’s
raw data storage. This data is referred to by the raw DF dataset.
The next step involves using this raw dataset as the input to a Hive activity pipeline.
This pipeline executes a Hive job to parse the data according to its source format
and store it as an external Hive table in gzipped Avro format. The parsed data stored
in Azure blog and referenced in the parsed DF dataset.
As Data Factory supports Avro natively the created parsed dataset can directly be
used in Azure Data Factory copy activities. Additionally, it can be used in custom
jobs like a Hadoop Distcp job that copies the data to an external target system.
5. Encountered problems and limitations
While being useful in many regards, we also encountered a few limitations while
working with Data Factory.
With the system development at Aucfanlab being done in Java, the absence of a
Java SDK for Data Factory meant that a substantial overhead had to be invested. To
automate resource creation in Data Factory, we created a REST-client that models
the Data Factory json definitions and handles all API calls.
Another Java-related limitation we encountered was that custom code executed in a
Data Factory activity can only be written in .NET. Fortunately, Hadoop or Spark jobs
can be written in Java and custom code can be executed by these means.
With some features being rather cutting-edge we sometimes had to resort to support
tickets to get explanations about not documented features or to notify Microsoft
about a bug in a newly released feature.
Another problem that we encountered and that is a general downside of using a
cloud provider, is that while Data Factory makes it very easy to work with Azure
resources it makes it harder to integrate external services. For most connections to
outside services and resources custom solutions have to be created. In our case this
meant for example that for Amazon S3 import and export, we had to use Hadoop
Distcp instead of being able to use the Azure Data Factory copy activity.
6. Closing remarks
Besides the above-mentioned limitations, using Azure’s Data Factory service took
away a lot of of headache overall. Especially the ability to have persistent pipelines
with automatic handling of daily schedules saved development time and freed up
resources for other tasks.
We are looking forward to see how the service will develop in future and hope to see
many new features that make working with different data sources even easier.
Ad

Recommended

Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -
Aucfan
 
Big_SQL_3.0_Whitepaper
Big_SQL_3.0_Whitepaper
Scott Gray
 
Sap business objects interview questions
Sap business objects interview questions
Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW
 
OSS NA 2019 - Demo Booth deck overview of Egeria
OSS NA 2019 - Demo Booth deck overview of Egeria
ODPi
 
No sql database
No sql database
vishal gupta
 
FROM BIG DATA TO ACTION: HOW TO BREAK OUT OF THE SILOS AND LEVERAGE DATA GOVE...
FROM BIG DATA TO ACTION: HOW TO BREAK OUT OF THE SILOS AND LEVERAGE DATA GOVE...
ODPi
 
11g architecture
11g architecture
Manohar Jha
 
One Large Data Lake, Hold the Hype
One Large Data Lake, Hold the Hype
Jared Winick
 
Designing and developing your database for application availability
Designing and developing your database for application availability
Charley Hanania
 
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
Flurry, Inc.
 
MS-SQL SERVER ARCHITECTURE
MS-SQL SERVER ARCHITECTURE
Douglas Bernardini
 
Exploring Microsoft Azure Infrastructures
Exploring Microsoft Azure Infrastructures
CCG
 
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
Jiaheng Lu
 
Sas hpa-va-bda-exadata-2389280
Sas hpa-va-bda-exadata-2389280
Edgar Alejandro Villegas
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
CitiusTech
 
Log Analysis Engine with Integration of Hadoop and Spark
Log Analysis Engine with Integration of Hadoop and Spark
IRJET Journal
 
Bt0066 database management system1
Bt0066 database management system1
Techglyphs
 
Monitor tableau server for reference
Monitor tableau server for reference
Sivakumar Ramar
 
data stage-material
data stage-material
Rajesh Kv
 
Spark1
Spark1
Dr. G. Bharadwaja Kumar
 
A Query Model for Ad Hoc Queries using a Scanning Architecture
A Query Model for Ad Hoc Queries using a Scanning Architecture
Flurry, Inc.
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
Amy W. Tang
 
Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014
Nicolas Morales
 
QUERY OPTIMIZATION FOR BIG DATA ANALYTICS
QUERY OPTIMIZATION FOR BIG DATA ANALYTICS
ijcsit
 
SQL Server 2008 Overview
SQL Server 2008 Overview
David Chou
 
Spark For Faster Batch Processing
Spark For Faster Batch Processing
Edureka!
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
Matt Ingenthron
 
インターネットオークションにおけるチケット流通量調査
インターネットオークションにおけるチケット流通量調査
Aucfan
 
メールマガジン分析 〜クリスマス商戦はいつから始まるのか〜
メールマガジン分析 〜クリスマス商戦はいつから始まるのか〜
Aucfan
 
出品情報のタイトル文字数と落札価格の相関性
出品情報のタイトル文字数と落札価格の相関性
Aucfan
 

More Related Content

What's hot (19)

Designing and developing your database for application availability
Designing and developing your database for application availability
Charley Hanania
 
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
Flurry, Inc.
 
MS-SQL SERVER ARCHITECTURE
MS-SQL SERVER ARCHITECTURE
Douglas Bernardini
 
Exploring Microsoft Azure Infrastructures
Exploring Microsoft Azure Infrastructures
CCG
 
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
Jiaheng Lu
 
Sas hpa-va-bda-exadata-2389280
Sas hpa-va-bda-exadata-2389280
Edgar Alejandro Villegas
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
CitiusTech
 
Log Analysis Engine with Integration of Hadoop and Spark
Log Analysis Engine with Integration of Hadoop and Spark
IRJET Journal
 
Bt0066 database management system1
Bt0066 database management system1
Techglyphs
 
Monitor tableau server for reference
Monitor tableau server for reference
Sivakumar Ramar
 
data stage-material
data stage-material
Rajesh Kv
 
Spark1
Spark1
Dr. G. Bharadwaja Kumar
 
A Query Model for Ad Hoc Queries using a Scanning Architecture
A Query Model for Ad Hoc Queries using a Scanning Architecture
Flurry, Inc.
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
Amy W. Tang
 
Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014
Nicolas Morales
 
QUERY OPTIMIZATION FOR BIG DATA ANALYTICS
QUERY OPTIMIZATION FOR BIG DATA ANALYTICS
ijcsit
 
SQL Server 2008 Overview
SQL Server 2008 Overview
David Chou
 
Spark For Faster Batch Processing
Spark For Faster Batch Processing
Edureka!
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
Matt Ingenthron
 
Designing and developing your database for application availability
Designing and developing your database for application availability
Charley Hanania
 
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
Flurry, Inc.
 
Exploring Microsoft Azure Infrastructures
Exploring Microsoft Azure Infrastructures
CCG
 
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
Jiaheng Lu
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
CitiusTech
 
Log Analysis Engine with Integration of Hadoop and Spark
Log Analysis Engine with Integration of Hadoop and Spark
IRJET Journal
 
Bt0066 database management system1
Bt0066 database management system1
Techglyphs
 
Monitor tableau server for reference
Monitor tableau server for reference
Sivakumar Ramar
 
data stage-material
data stage-material
Rajesh Kv
 
A Query Model for Ad Hoc Queries using a Scanning Architecture
A Query Model for Ad Hoc Queries using a Scanning Architecture
Flurry, Inc.
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
Amy W. Tang
 
Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014
Nicolas Morales
 
QUERY OPTIMIZATION FOR BIG DATA ANALYTICS
QUERY OPTIMIZATION FOR BIG DATA ANALYTICS
ijcsit
 
SQL Server 2008 Overview
SQL Server 2008 Overview
David Chou
 
Spark For Faster Batch Processing
Spark For Faster Batch Processing
Edureka!
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
Matt Ingenthron
 

Viewers also liked (6)

インターネットオークションにおけるチケット流通量調査
インターネットオークションにおけるチケット流通量調査
Aucfan
 
メールマガジン分析 〜クリスマス商戦はいつから始まるのか〜
メールマガジン分析 〜クリスマス商戦はいつから始まるのか〜
Aucfan
 
出品情報のタイトル文字数と落札価格の相関性
出品情報のタイトル文字数と落札価格の相関性
Aucfan
 
さくっとはじめるテキストマイニング(R言語)  スタートアップ編
さくっとはじめるテキストマイニング(R言語)  スタートアップ編
Yutaka Shimada
 
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~
sugiyama koki
 
最新業界事情から見るデータサイエンティストの「実像」
最新業界事情から見るデータサイエンティストの「実像」
Takashi J OZAKI
 
インターネットオークションにおけるチケット流通量調査
インターネットオークションにおけるチケット流通量調査
Aucfan
 
メールマガジン分析 〜クリスマス商戦はいつから始まるのか〜
メールマガジン分析 〜クリスマス商戦はいつから始まるのか〜
Aucfan
 
出品情報のタイトル文字数と落札価格の相関性
出品情報のタイトル文字数と落札価格の相関性
Aucfan
 
さくっとはじめるテキストマイニング(R言語)  スタートアップ編
さくっとはじめるテキストマイニング(R言語)  スタートアップ編
Yutaka Shimada
 
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~
sugiyama koki
 
最新業界事情から見るデータサイエンティストの「実像」
最新業界事情から見るデータサイエンティストの「実像」
Takashi J OZAKI
 
Ad

Similar to Azure Data Factory usage at Aucfanlab (20)

Azure Data Factory Interview Questions PDF By ScholarHat
Azure Data Factory Interview Questions PDF By ScholarHat
Scholarhat
 
Azure Data Factory Introduction.pdf
Azure Data Factory Introduction.pdf
MaheshPandit16
 
Transform your data with Azure Data factory
Transform your data with Azure Data factory
Prometix Pty Ltd
 
adf.docx
adf.docx
KMGANGOTRISINGH
 
Building Automated Data Pipelines with Airflow.pdf
Building Automated Data Pipelines with Airflow.pdf
abhaykm804
 
Core Concepts in azure data factory
Core Concepts in azure data factory
BRIJESH KUMAR
 
Apache AirfowAsaSAsaSAsSas - Session1.pptx
Apache AirfowAsaSAsaSAsSas - Session1.pptx
MuhamedAhmed35
 
Building data pipelines
Building data pipelines
Jonathan Holloway
 
Navigating the Data World_ A Deep Dive into Architecture of Big Data Tools.pdf
Navigating the Data World_ A Deep Dive into Architecture of Big Data Tools.pdf
Impaakt Magazine
 
01_Intro_SAP BO DATA Integrator.docx
01_Intro_SAP BO DATA Integrator.docx
sivakumar269245
 
Introduction to Azure Data Factory
Introduction to Azure Data Factory
Slava Kokaev
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
AboutYouGmbH
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Lace Lofranco
 
DataStage_Whitepaper
DataStage_Whitepaper
Sourav Maity
 
A introduction to oracle data integrator
A introduction to oracle data integrator
chkamal
 
Datastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobs
shanker_uma
 
Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture
Rajesh Kumar
 
Day 1 - Technical Bootcamp azure synapse analytics
Day 1 - Technical Bootcamp azure synapse analytics
Armand272
 
Real time analytics
Real time analytics
Leandro Totino Pereira
 
www-credosystemz-com-azure-data-engineering-interview-questions-and-answers-.pdf
www-credosystemz-com-azure-data-engineering-interview-questions-and-answers-.pdf
csvishnukumar
 
Azure Data Factory Interview Questions PDF By ScholarHat
Azure Data Factory Interview Questions PDF By ScholarHat
Scholarhat
 
Azure Data Factory Introduction.pdf
Azure Data Factory Introduction.pdf
MaheshPandit16
 
Transform your data with Azure Data factory
Transform your data with Azure Data factory
Prometix Pty Ltd
 
Building Automated Data Pipelines with Airflow.pdf
Building Automated Data Pipelines with Airflow.pdf
abhaykm804
 
Core Concepts in azure data factory
Core Concepts in azure data factory
BRIJESH KUMAR
 
Apache AirfowAsaSAsaSAsSas - Session1.pptx
Apache AirfowAsaSAsaSAsSas - Session1.pptx
MuhamedAhmed35
 
Navigating the Data World_ A Deep Dive into Architecture of Big Data Tools.pdf
Navigating the Data World_ A Deep Dive into Architecture of Big Data Tools.pdf
Impaakt Magazine
 
01_Intro_SAP BO DATA Integrator.docx
01_Intro_SAP BO DATA Integrator.docx
sivakumar269245
 
Introduction to Azure Data Factory
Introduction to Azure Data Factory
Slava Kokaev
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
AboutYouGmbH
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Lace Lofranco
 
DataStage_Whitepaper
DataStage_Whitepaper
Sourav Maity
 
A introduction to oracle data integrator
A introduction to oracle data integrator
chkamal
 
Datastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobs
shanker_uma
 
Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture
Rajesh Kumar
 
Day 1 - Technical Bootcamp azure synapse analytics
Day 1 - Technical Bootcamp azure synapse analytics
Armand272
 
www-credosystemz-com-azure-data-engineering-interview-questions-and-answers-.pdf
www-credosystemz-com-azure-data-engineering-interview-questions-and-answers-.pdf
csvishnukumar
 
Ad

Recently uploaded (20)

Python Conference Singapore - 19 Jun 2025
Python Conference Singapore - 19 Jun 2025
ninefyi
 
OWASP Barcelona 2025 Threat Model Library
OWASP Barcelona 2025 Threat Model Library
PetraVukmirovic
 
FIDO Seminar: New Data: Passkey Adoption in the Workforce.pptx
FIDO Seminar: New Data: Passkey Adoption in the Workforce.pptx
FIDO Alliance
 
FIDO Seminar: Targeting Trust: The Future of Identity in the Workforce.pptx
FIDO Seminar: Targeting Trust: The Future of Identity in the Workforce.pptx
FIDO Alliance
 
"Database isolation: how we deal with hundreds of direct connections to the d...
"Database isolation: how we deal with hundreds of direct connections to the d...
Fwdays
 
The Future of AI Agent Development Trends to Watch.pptx
The Future of AI Agent Development Trends to Watch.pptx
Lisa ward
 
10 Key Challenges for AI within the EU Data Protection Framework.pdf
10 Key Challenges for AI within the EU Data Protection Framework.pdf
Priyanka Aash
 
You are not excused! How to avoid security blind spots on the way to production
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
 
Securing Account Lifecycles in the Age of Deepfakes.pptx
Securing Account Lifecycles in the Age of Deepfakes.pptx
FIDO Alliance
 
Cluster-Based Multi-Objective Metamorphic Test Case Pair Selection for Deep N...
Cluster-Based Multi-Objective Metamorphic Test Case Pair Selection for Deep N...
janeliewang985
 
Connecting Data and Intelligence: The Role of FME in Machine Learning
Connecting Data and Intelligence: The Role of FME in Machine Learning
Safe Software
 
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
revolcs10
 
Cyber Defense Matrix Workshop - RSA Conference
Cyber Defense Matrix Workshop - RSA Conference
Priyanka Aash
 
AI vs Human Writing: Can You Tell the Difference?
AI vs Human Writing: Can You Tell the Difference?
Shashi Sathyanarayana, Ph.D
 
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
Priyanka Aash
 
The Future of Technology: 2025-2125 by Saikat Basu.pdf
The Future of Technology: 2025-2125 by Saikat Basu.pdf
Saikat Basu
 
cnc-processing-centers-centateq-p-110-en.pdf
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
 
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
 
“Key Requirements to Successfully Implement Generative AI in Edge Devices—Opt...
“Key Requirements to Successfully Implement Generative AI in Edge Devices—Opt...
Edge AI and Vision Alliance
 
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
 
Python Conference Singapore - 19 Jun 2025
Python Conference Singapore - 19 Jun 2025
ninefyi
 
OWASP Barcelona 2025 Threat Model Library
OWASP Barcelona 2025 Threat Model Library
PetraVukmirovic
 
FIDO Seminar: New Data: Passkey Adoption in the Workforce.pptx
FIDO Seminar: New Data: Passkey Adoption in the Workforce.pptx
FIDO Alliance
 
FIDO Seminar: Targeting Trust: The Future of Identity in the Workforce.pptx
FIDO Seminar: Targeting Trust: The Future of Identity in the Workforce.pptx
FIDO Alliance
 
"Database isolation: how we deal with hundreds of direct connections to the d...
"Database isolation: how we deal with hundreds of direct connections to the d...
Fwdays
 
The Future of AI Agent Development Trends to Watch.pptx
The Future of AI Agent Development Trends to Watch.pptx
Lisa ward
 
10 Key Challenges for AI within the EU Data Protection Framework.pdf
10 Key Challenges for AI within the EU Data Protection Framework.pdf
Priyanka Aash
 
You are not excused! How to avoid security blind spots on the way to production
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
 
Securing Account Lifecycles in the Age of Deepfakes.pptx
Securing Account Lifecycles in the Age of Deepfakes.pptx
FIDO Alliance
 
Cluster-Based Multi-Objective Metamorphic Test Case Pair Selection for Deep N...
Cluster-Based Multi-Objective Metamorphic Test Case Pair Selection for Deep N...
janeliewang985
 
Connecting Data and Intelligence: The Role of FME in Machine Learning
Connecting Data and Intelligence: The Role of FME in Machine Learning
Safe Software
 
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
revolcs10
 
Cyber Defense Matrix Workshop - RSA Conference
Cyber Defense Matrix Workshop - RSA Conference
Priyanka Aash
 
AI vs Human Writing: Can You Tell the Difference?
AI vs Human Writing: Can You Tell the Difference?
Shashi Sathyanarayana, Ph.D
 
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
Priyanka Aash
 
The Future of Technology: 2025-2125 by Saikat Basu.pdf
The Future of Technology: 2025-2125 by Saikat Basu.pdf
Saikat Basu
 
cnc-processing-centers-centateq-p-110-en.pdf
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
 
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
 
“Key Requirements to Successfully Implement Generative AI in Edge Devices—Opt...
“Key Requirements to Successfully Implement Generative AI in Edge Devices—Opt...
Edge AI and Vision Alliance
 
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
 

Azure Data Factory usage at Aucfanlab

  • 1. Azure Data Factory usage at Aucfanlab 1. Abstract This report describes how the Aucfanlab team used Azure’s Data Factory service to implement the orchestration and monitoring of all data pipelines for our “Aucfan Datalake” project. Azure Data Factory is a service that allows the creation of scheduled data pipelines. These pipelines were used to internally execute the complex data workflows that the Datalake’s users can create. 2. Introduction One of the projects of the Aucfanlab team is the big data management platform “Aucfan Datalake” that was introduced in an earlier report. The Aucfan Datalake allows users to easily create and monitor their data workloads while hiding all technical implementation details. As a central part of this system, a system for scheduling and management of the user’s workflows had to be implemented. An example for such a workflow is importing data from cloud-based file stores and exporting it to a SQL database. After reviewing past approaches that used cron jobs and prototyping a Java task scheduler to manage the data workflows by ourselves, the final choice fell on Data Factory, a fully-managed data pipeline orchestration service by Azure that allowed us to outsource the most cumbersome parts of pipeline management. 3. Azure Data Factory Azure Data Factory is a service that allows the user to define data workflows using a json-based description format. Besides handling the scheduling of user-defined jobs (e.g. Hadoop, Hive etc.) it also provides direct execution of data movement across resources in the cloud. This allows for a cost-effective and hassle free implementation of ETL-processes. In the following a short overview of the main concepts in Azure Data Factory will be given to allow for easier comprehension of explanations in later chapters.
  • 2. 3.1 Dataset A dataset defines the schema, location and other relevant attributes of data. It has a type, which refers to one of the supported resources (for example Azure blob, Azure SQL database or other cloud / on-premise resources). Furthermore, a dataset has so-called slices, which are temporal partition units. For example, a daily running pipeline’s input and output datasets have a slice for each day. 3.2 Pipeline A pipeline contains one or more activities that take datasets as input and output. An activity can be copying data from one resource to another, the execution of a Hadoop or Hive job, or similar data workloads. While Hadoop jobs require a provided cluster, all copy activities are directly executed by Data Factory and thus do not require a custom implementation or management of computing resources. 3.2 Linked Service A resource managed by the user. This can be a data store like a Blob storage account or an Azure SQL database. Additionally, compute resources can also be defined as Linked Services. An example for this would be a HDInsight Hadoop cluster that executes MapReduce jobs. With the above-described elements arbitrarily complex pipelines can be created and managed. Each pipeline is executed in slices that are individually tracked. For each slice status and errors can be monitored. Additionally slices are dependent on each other, which means that all following steps of a pipeline wait for the previous step to finish. In the case of an error in a previous step the slices automatically start once the error is fixed and the previous slice is executed correctly. 4. Implementation using Data Factory For a better understanding of the problems that needed to be solved, an introduction to a typical user workflow will be given. The user calls the Datalake’s API to first register a data source and than create an importer to schedule an import job for his dataset. The user than creates a parser to
  • 3. specify the parsing information for his dataset. After the parsing step is completed the user can export his data to a variety of datamarts. The actual implementation of this workflow in Azure Data Factory was solved as follows. A DF dataset (DF refers to Azure Data Factory here) points to the blob storage account registered by the user. This DF dataset becomes the input of a copy pipeline that is executed by the Data Factory and copies the data into the Datalake’s raw data storage. This data is referred to by the raw DF dataset. The next step involves using this raw dataset as the input to a Hive activity pipeline. This pipeline executes a Hive job to parse the data according to its source format and store it as an external Hive table in gzipped Avro format. The parsed data stored in Azure blog and referenced in the parsed DF dataset. As Data Factory supports Avro natively the created parsed dataset can directly be used in Azure Data Factory copy activities. Additionally, it can be used in custom jobs like a Hadoop Distcp job that copies the data to an external target system. 5. Encountered problems and limitations While being useful in many regards, we also encountered a few limitations while
  • 4. working with Data Factory. With the system development at Aucfanlab being done in Java, the absence of a Java SDK for Data Factory meant that a substantial overhead had to be invested. To automate resource creation in Data Factory, we created a REST-client that models the Data Factory json definitions and handles all API calls. Another Java-related limitation we encountered was that custom code executed in a Data Factory activity can only be written in .NET. Fortunately, Hadoop or Spark jobs can be written in Java and custom code can be executed by these means. With some features being rather cutting-edge we sometimes had to resort to support tickets to get explanations about not documented features or to notify Microsoft about a bug in a newly released feature. Another problem that we encountered and that is a general downside of using a cloud provider, is that while Data Factory makes it very easy to work with Azure resources it makes it harder to integrate external services. For most connections to outside services and resources custom solutions have to be created. In our case this meant for example that for Amazon S3 import and export, we had to use Hadoop Distcp instead of being able to use the Azure Data Factory copy activity. 6. Closing remarks Besides the above-mentioned limitations, using Azure’s Data Factory service took away a lot of of headache overall. Especially the ability to have persistent pipelines with automatic handling of daily schedules saved development time and freed up resources for other tasks. We are looking forward to see how the service will develop in future and hope to see many new features that make working with different data sources even easier.