0% found this document useful (0 votes)
281 views

Databricks

Azure Data Factory is a cloud-based data integration service that allows for creating data-driven workflows for orchestrating and automating data movement and transformation. It involves connecting data sources, transforming and enriching data using services like HDInsight and Spark, publishing data to targets like Azure SQL Data Warehouse, and monitoring pipelines for success and failure rates.

Uploaded by

Madhavi Kareddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
281 views

Databricks

Azure Data Factory is a cloud-based data integration service that allows for creating data-driven workflows for orchestrating and automating data movement and transformation. It involves connecting data sources, transforming and enriching data using services like HDInsight and Spark, publishing data to targets like Azure SQL Data Warehouse, and monitoring pipelines for success and failure rates.

Uploaded by

Madhavi Kareddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

Azure Data Factory

- Mahesh Pandit

© 2018 YASH Technologies | www.yash.com | Confidential


Agenda 2

 Why Azure data Factory


 Introduction
 Steps involves in ADF
 ADF Components
 ADF Activities
 Linked Services
 Integration Runtime and its
types
 How Azure Data Factory works
 Azure Data Factory V1 vs V2
 System Variables
 Functions in ADF
 Expressions in ADF
 Question- Answers
3

Why Azure data Factory

 Modern DW for BI

 Modern DW for SaaS Apps

 Lift & Shift existing SSIS


Pkgs. to Cloud

© 2018 YASH Technologies | www.yash.com | Confidential


Why Azure Data Factory 4

Azure Data Factory

Azure Data Lake Azure SQL


DW
Modern DW for Business Intelligence 5

Ingest Store Prep & Train Model & Serve Intelligence

Log, Files & Media


(Unstructured)

Data Factory Azure Storage Azure Databricks


Spark

On Prem., Cloud
Apps & Data Azure Analysis services

Analytical
Dashboards
(Power BI)
Data Factory
Business/Custom apps
(Structures)
Azure SQL Data warehouse

Azure Data Factory orchestrates data pipeline activity work flow & scheduling
Modern DW for SaaS Apps 6

Ingest Store Prep & Train Model & Serve Intelligence

Log, Files & Media


(Unstructured)

Data Factory Azure Storage Azure Databricks


Spark

On Prem., Cloud
Apps & Data

App
Storage

Data Factory
Business/Custom apps
(Structures)
SaaS App Browser/Devices

Azure Data Factory orchestrates data pipeline activity work flow & scheduling
Lift & Shift existing SSIS packages to Cloud 7

Cloud Data Sources


Data Factory
SQL DB Managed Instance

Cloud

On Premise
VNET

On-Premise Data Sources SQL Server

Azure Data Factory orchestrates data pipeline activity work flow & scheduling
Introduction 8

Azure Data  It is cloud-based integration service


that allows you to create data- driven
Factory workflows in the cloud for
Cloud-based orchestrating and automating data
Integration Service movement and data transformation.

 Scheduled data-driven workflows.

 Sources and Destinations can be


either on-premise or cloud.

 Transformation can be done using


Azure HDInsight Hadoop, Spark,
Azure Data Lake Analytics and ML.
How does it work? 9

 The pipelines (data-driven workflows) in Azure Data Factory typically


perform the following four steps:
10

Steps involves in ADF

 Connect and collect

 Transform and enrich

 Publish

 Monitor

© 2018 YASH Technologies | www.yash.com | Confidential


Connect and collect 11

 The first step in building an information production system is to connect to all the
required sources of data and processing, such as software-as-a-service (SaaS) services,
databases, file shares, and FTP web services.

 With Data Factory, you can use the Copy Activity in a data pipeline to move data from
both on-premises and cloud source data stores to a centralization data store in the cloud for
further analysis.

 For example, you can collect data in Azure Data Lake as well in Azure Blob
storage.
Transform and enrich 12

 After data is present in a centralized data store in the


cloud, process or transform the collected data by
using compute services such as

 HDInsight Hadoop
 Spark
 Data Lake Analytics
 Machine Learning.
Publish 13

 After the raw data has been refined into a business-ready consumable form, load the data into Azure Data
Warehouse, Azure SQL Database, Azure Cosmos DB or many more as per user’s need.
Monitor 14

 After you have successfully built and deployed your data integration pipeline, providing business value from refined
data, monitor the scheduled activities and pipelines for success and failure rates.
 Azure Data Factory has built-in support for pipeline monitoring via Azure Monitor, API, PowerShell, Log
Analytics, and health panels on the Azure portal.
15

ADF Components
 Pipeline

 Activity

 Datasets

 Linked Services

© 2018 YASH Technologies | www.yash.com | Confidential


ADF Components 16

Consume
DATA ACTIVITY PIPELINE
Produces (hive, copy) (schedule, Monitor)
SET Is logical
Group of
(Table , File)
Represent
Data item
stored in
Runs on
LINKED
SERVIC
E
(SQL
 An Azure subscriptionServer,
might have one or more Azure Data Factory instances (or data factories).
Hadoop
Cluster)
 Azure Data Factory is composed of four key components.

 These components work together to provide the platform on which you can compose data-driven workflows with
steps to move and transform data.
Pipeline 17

 A data factory might have one or more pipelines.


 A pipeline is a logical grouping of activities that performs a
unit of work.
 Together, the activities in a pipeline perform a task.
 The pipeline allows you to manage the activities as a set
instead of managing each one individually.
 The activities in a pipeline can be chained together to operate
sequentially, or they can operate independently in parallel.
 To create data factory pipeline, we can use any one of the
below method:

Data Factory UI Copy Data Tool Azure Power Shell Rest

Resource Manager Template .NET Python


Pipeline Execution 18

Triggers Pipeline Runs


 Triggers represent the unit of processing that  A pipeline run is an instance of the pipeline
determines when a pipeline execution needs to be execution.
kicked off.  Pipeline runs are typically instantiated by
 There are different types of triggers for different passing the arguments to the parameters that
types of events. are defined in pipelines.
 The arguments can be passed manually or
within the trigger definition.

Parameters Control Flow


 Parameters are key-value pairs of read-only  Control flow is an orchestration of pipeline
configuration. activities that includes chaining activities in a
 Parameters are defined in the pipeline. sequence, branching, defining parameters at the
 The arguments for the defined parameters are pipeline level, and passing arguments while
passed during execution from the run context that invoking the pipeline on-demand or from a
was created by a trigger or a pipeline that was trigger.
executed manually.  It also includes custom-state passing and
 Activities within the pipeline consume the looping containers, that is, For-each
parameter values. iterators.
19

ADF Activities
 Data Movement Activities

 Data Transformation Activities

 Control Activities

© 2018 YASH Technologies | www.yash.com | Confidential


Activit 20


y
Activities represent a processing step in a pipeline.
 For example, you might use a copy activity to copy data from one data store to another data store.
 Data Factory supports three types of activities:

1. Data movement activities

2. Data transformation activities

3. Control activities.

BI Tool
Copy Activity Azure Blob

Transformation
Output data Copy Activity
Activity Azure SQL
Data Warehouse

Copy
Activity
Data Movement Activities 21

 Copy Activity in Data Factory copies data from a source data store to a sink data store.
 Data from any source can be written to any sink.

…….
Data Transformation Activities 22

 Azure Data Factory supports the following transformation activities that can be added to pipelines either
individually or chained with another activity.
Data Transformation Compute Data Transformation Compute
Activity Environmen Activity Environmen
t t

HDInsight Azure SQL, Azure SQL DW


OR SQL Server

HDInsight Azure VM

HDInsight Azure Data Lake


Analytics

HDInsight Azure Batch

HDInsight Azure Databricks


Control Activities 23

 The following control flow activities are supported

Execute Pipeline Activity It allows a Data Factory pipeline


to invoke another pipeline.

It defines a repeating control flow in your


For Each Activity pipeline.
It can be used to call a custom REST
Web Activity endpoint from a Data Factory pipeline.
It can be used to read or look up a record/
Lookup Activity table name/ value from any external source.

It can be used to retrieve metadata of


Get Metadata Activity
any data in Azure Data Factory.
It implements Do-Until loop that is similar to Do-
Until looping structure in programming languages. It
Until Activity executes a set of activities in a loop until the
condition associated with the activity evaluates to
It can be used to branch based on true.
If Condition Activity condition that evaluates to true or
false.
When you use a Wait activity in a pipeline, the
pipeline waits for the specified period of time
Wait Activity before continuing with execution of subsequent
activities.
Linked services 24
 Linked services are much like connection strings, which define the connection information that's needed for Data Factory
to connect to external resources.
 A linked service defines the connection to the data source.
 For example, an Azure Storage-linked service specifies a connection string to connect to the Azure Storage
account.
 Linked services are used for two purposes in Data Factory:
 To represent a data store that includes data stores located on-premises and in the cloud. E.g. Tables, Files,
Folders or Documents
 To represent a compute resource that can host the execution of an activity. For example, the HDInsightHive
activity runs on an HDInsight Hadoop cluster.

Tables

Data Stores

Files

HDInsight

Compute Resources

Apache Spark
.......
Integration Runtime 25

 Think it as a Bridge between 2 networks.


 It is compute infrastructure which provides capabilities across different N/W environments

Data Activity SSIS Package


Movement Dispatch Execution

 Copy data across data  This capabilities are use when  When SSIS packages need
stores in public network compute services such as to be executed in the
and data stores in private Azure HDInsight, Azure managed Azure Compute
network (on-premises or Machine Learning, Azure SQL Environment like HDInsight
virtual private network). Database, SQL Server, and then this capabilities are
more get used for used.
 It provides support for transformation activities.
built-in connectors, format
conversion, column
mapping and scalable data
transfer.
Integration runtime types 26

 These three types are:

IR type Public network Private network


Azure Data movement
Activity dispatch

Self-hosted Data movement Data movement


Activity dispatch Activity dispatch

Azure-SSIS SSIS package SSIS package


execution execution
How Azure Data Factory Works 27

Integration Runtime
Integration Runtime Integration Runtime
Pipeline
Dataset

On-Premise
SQL Server
DB

Activity Activity Activity

Dataset
Dataset
28

Data Factory V1 vs. V2

© 2018 YASH Technologies | www.yash.com | Confidential


Data Factory V1 vs. V2 29

Data Factory V1 Data Factory V2


 Datasets  Datasets

 Linked Services  Linked Services

 Pipelines  Pipelines

 On-Premises Gateway  Self hosted Integration


Runtime

 Schedule on Dataset  Schedule triggers(time


availability and Pipeline or
start/end Time tumbling window)

 Host and Execute


SSIS
Package Parameters

 New Control Flow


Activities
30

System Variables
 Pipeline scope

 Schedule Trigger scope

 Tumbling Window Trigger scope

© 2018 YASH Technologies | www.yash.com | Confidential


Pipeline Scope 31

These system variables can be referenced anywhere in the pipeline JSON.

@pipeline().DataFactory Name of the data factory the pipeline run is running within

@pipeline().Pipeline Name of the pipeline

@pipeline().RunId ID of the specific pipeline run

@pipeline().TriggerType Type of the trigger that invoked the pipeline (Manual, Scheduler)

@pipeline().TriggerId ID of the trigger that invokes the pipeline

@pipeline().TriggerName Name of the trigger that invokes the pipeline

@pipeline().TriggerTime Time when the trigger that invoked the pipeline. The trigger time is
the actual fired time, not the scheduled time.
Schedule Trigger Scope 32

 These system variables can be referenced anywhere in the trigger JSON if the trigger is of type:
"ScheduleTrigger."

Time when the trigger was scheduled to invoke the pipeline run.
@trigger().scheduledTime For example, for a trigger that fires every 5 min, this variable would return 2017-06-
01T22:20:00Z, 2017-06-01T22:25:00Z, 2017-06-01T22:29:00Z respectively.

Time when the trigger actually fired to invoke the pipeline run.
For example, for a trigger that fires every 5 min, this variable might
@trigger().startTime
return something like this 2017-06-01T22:20:00.4061448Z, 2017-06-
01T22:25:00.7958577Z, 2017-06-01T22:29:00.9935483Zrespectively.
Tumbling window Trigger Scope 33

 These system variables can be referenced anywhere in the trigger JSON if the trigger is of type:
"TumblingWindowTrigger“.

Start of the window when the trigger was scheduled to invoke the pipeline run. If
@trigger().outputs.windowStartTime the tumbling window trigger has a frequency of "hourly" this would be the time at
the beginning of the hour.

End of the window when the trigger was scheduled to invoke the pipeline run. If
@trigger().outputs.windowEndTime the tumbling window trigger has a frequency of "hourly" this would be the time at
the end of the hour.
34

Functions in Azure
 String Functions

 Collection Functions

 Logical Functions

 Conversion Functions

 Math Functions

 Date Functions

© 2018 YASH Technologies | www.yash.com | Confidential


String Functions 35

Function Description Example

concat Combines any number of strings together. concat(‘Hi’, ‘team’) : Hi team

substring Returns a subset of characters from a substring('somevalue',1,3) : ome


string.
replace Replaces a string with a given string. replace(‘Hi team', ‘Hi', ‘Hey') : Hey team

guid Generates a globally unique string guid() : c2ecc88d-88c8-4096-912c-d6

toLower Converts a string to lowercase. toLower('Two’) : two

toUpper Converts a string to uppercase. toUpper('Two’) : TWO

indexof Find the index of a value within a string indexof(Hi team', ‘Hi’) : 0
case insensitively.
endswith Checks if the string ends with a value case endswith(‘Hi team', ‘team') : true
insensitively.
startswith Checks if the string starts with a value startswith(‘Hi team', ‘team') : false
case insensitively.
split Splits the string using a separator. split(‘Hi;team', ‘;') : [“Hi", “team“]

lastindexof Find the last index of a value within a lastindexof('foofoo‘) : 3


string case insensitively.
Collection Functions 36
Function Description Example

contains Returns true if dictionary contains a key, contains('abacaba','aca')


list contains value, or string contains : true
substring. .
length Returns the number of elements in an length('abc')
array or string. :3
empty Returns true if object, array, or string is empty('')
empty. : true
intersection Returns a single array or object with the intersection([1, 2, 3], [101, 2, 1, 10],[6, 8,
common elements between the arrays or 1, 2])
objects passed to it. : [1, 2]
union Returns a single array or object with all of union([1, 2, 3], [101, 2, 1, 10])
the elements that are in either array or : [1, 2, 3, 10, 101]
object passed to it.
first Returns the first element in the array or first([0,2,3])
string passed in. :0
last Returns the last element in the array or last('0123')
string passed in. :3
take Returns the first Count elements from the take([1, 2, 3, 4], 2)
array or string passed in : [1, 2]
skip Returns the elements in the array starting skip([1, 2 ,3 ,4], 2)
at index Count, : [3, 4]
Logical Functions 37

Function Description Example

int Convert the parameter to an integer. int('100')


: 100
string Convert the parameter to a string. string(10)
: ‘10’
json Convert the parameter to a JSON type json('[1,2,3]') : [1,2,3]
value. json('{"bar" : "baz"}') : { "bar" : "baz" }
float Convert the parameter argument to a float('10.333')
floating-point number. : 10.333
bool Convert the parameter to a Boolean. bool(0)
: false
coalesce Returns the first non-null object in the coalesce(pipeline().parameters.paramet
arguments passed in. Note: an empty er1', pipeline().parameters.parameter2
string is not null. ,'fallback')
: fallback

array Convert the parameter to an array. array('abc')


: ["abc"]
createArray Creates an array from the parameters. createArray('a', 'c')
: ["a", "c"]
Math Functions 38

Function Description Example

add Returns the result of the addition of the add(10,10.333): 20.333


two numbers.
sub Returns the result of the subtraction of the sub(10,10.333): -0.333
two numbers.
mul Returns the result of the multiplication of mul(10,10.333): 103.33
the two numbers.
div Returns the result of the division of the div(10.333,10): 1.0333
two numbers.
mod Returns the result of the remainder after mod(10,4) :2
the division of the two numbers (modulo).
min There are two different patterns for calling min([0,1,2]) :0
this function. Note, all values must be min(0,1,2) : 0
numbers
max There are two different patterns for calling max([0,1,2]) :2
this function. Note, all values must be max(0,1,2) : 2
numbers
range Generates an array of integers starting range(3,4) : [3,4,5,6]
from a certain number, and you define the
length of the returned array.
rand Generates a random integer within the rand(-1000,1000) : 42
specified range
Date Functions 39

Function Description Example

utcnow Returns the current timestamp as a string. . utcnow()


: 2019-02-21T13:27:36Z

addseconds Adds an integer number of seconds to a addseconds('2015-03-15T13:27:36Z', -36)


string timestamp passed in. The number of :2015-03-15T13:27:00Z
seconds can be positive or negative.

addminutes Adds an integer number of minutes to a addminutes('2015-03-15T13:27:36Z', 33)


string timestamp passed in. The number of :2015-03-15T14:00:36Z
minutes can be positive or negative.

addhours Adds an integer number of hours to a string addhours('2015-03-15T13:27:36Z', 12)


timestamp passed in. The number of hours :2015-03-16T01:27:36Z
can be positive or negative.

adddays Adds an integer number of days to a string adddays('2015-03-15T13:27:36Z', -20)


timestamp passed in. The number of days :2015-02-23T13:27:36Z
can be positive or negative.

formatDateTime Returns a string in date format. formatDateTime('2015-03-15T13:27:36Z',


'o')
:2015-02-23T13:27:36Z
Expressions in Azure Data Factory 40

 JSON values in the definition can be literal or expressions that are evaluated at runtime.
E. g. "name": "value“ OR "name": "@pipeline().parameters.password“
 Expressions can appear anywhere in a JSON string value and always result in another JSON value.
 If a JSON value is an expression, the body of the expression is extracted by removing the at-sign (@).

JSON value Result

"parameters" The characters 'parameters' are returned.

"parameters[1]" The characters 'parameters[1]' are


returned.

"@@" A 1 character string that contains '@' is


returned.

" @" A 2 character string that contains ' @' is


returned.
A dataset with a 41

 parameter
Suppose the BlobDataset takes a parameter named path.
 Its value is used to set a value for the folderPath property by using the following expressions:

"folderPath": "@dataset().path"

A pipeline with a parameter


 In the following example, the pipeline takes inputPath and outputPath parameters.
 The path for the parameterized blob dataset is set by using values of these parameters.
 The syntax used here is: :

"path": "@pipeline().parameters.inputPath"
Question- Answers 42
Feel free to write to me at:
[email protected]

in case of any queries / clarifications.

© 2018 YASH Technologies | www.yash.com | Confidential

You might also like