Databricks
Databricks
- Mahesh Pandit
Modern DW for BI
On Prem., Cloud
Apps & Data Azure Analysis services
Analytical
Dashboards
(Power BI)
Data Factory
Business/Custom apps
(Structures)
Azure SQL Data warehouse
Azure Data Factory orchestrates data pipeline activity work flow & scheduling
Modern DW for SaaS Apps 6
On Prem., Cloud
Apps & Data
App
Storage
Data Factory
Business/Custom apps
(Structures)
SaaS App Browser/Devices
Azure Data Factory orchestrates data pipeline activity work flow & scheduling
Lift & Shift existing SSIS packages to Cloud 7
Cloud
On Premise
VNET
Azure Data Factory orchestrates data pipeline activity work flow & scheduling
Introduction 8
Publish
Monitor
The first step in building an information production system is to connect to all the
required sources of data and processing, such as software-as-a-service (SaaS) services,
databases, file shares, and FTP web services.
With Data Factory, you can use the Copy Activity in a data pipeline to move data from
both on-premises and cloud source data stores to a centralization data store in the cloud for
further analysis.
For example, you can collect data in Azure Data Lake as well in Azure Blob
storage.
Transform and enrich 12
HDInsight Hadoop
Spark
Data Lake Analytics
Machine Learning.
Publish 13
After the raw data has been refined into a business-ready consumable form, load the data into Azure Data
Warehouse, Azure SQL Database, Azure Cosmos DB or many more as per user’s need.
Monitor 14
After you have successfully built and deployed your data integration pipeline, providing business value from refined
data, monitor the scheduled activities and pipelines for success and failure rates.
Azure Data Factory has built-in support for pipeline monitoring via Azure Monitor, API, PowerShell, Log
Analytics, and health panels on the Azure portal.
15
ADF Components
Pipeline
Activity
Datasets
Linked Services
Consume
DATA ACTIVITY PIPELINE
Produces (hive, copy) (schedule, Monitor)
SET Is logical
Group of
(Table , File)
Represent
Data item
stored in
Runs on
LINKED
SERVIC
E
(SQL
An Azure subscriptionServer,
might have one or more Azure Data Factory instances (or data factories).
Hadoop
Cluster)
Azure Data Factory is composed of four key components.
These components work together to provide the platform on which you can compose data-driven workflows with
steps to move and transform data.
Pipeline 17
ADF Activities
Data Movement Activities
Control Activities
y
Activities represent a processing step in a pipeline.
For example, you might use a copy activity to copy data from one data store to another data store.
Data Factory supports three types of activities:
3. Control activities.
BI Tool
Copy Activity Azure Blob
Transformation
Output data Copy Activity
Activity Azure SQL
Data Warehouse
Copy
Activity
Data Movement Activities 21
Copy Activity in Data Factory copies data from a source data store to a sink data store.
Data from any source can be written to any sink.
…….
Data Transformation Activities 22
Azure Data Factory supports the following transformation activities that can be added to pipelines either
individually or chained with another activity.
Data Transformation Compute Data Transformation Compute
Activity Environmen Activity Environmen
t t
HDInsight Azure VM
Tables
Data Stores
Files
HDInsight
Compute Resources
Apache Spark
.......
Integration Runtime 25
Copy data across data This capabilities are use when When SSIS packages need
stores in public network compute services such as to be executed in the
and data stores in private Azure HDInsight, Azure managed Azure Compute
network (on-premises or Machine Learning, Azure SQL Environment like HDInsight
virtual private network). Database, SQL Server, and then this capabilities are
more get used for used.
It provides support for transformation activities.
built-in connectors, format
conversion, column
mapping and scalable data
transfer.
Integration runtime types 26
Integration Runtime
Integration Runtime Integration Runtime
Pipeline
Dataset
On-Premise
SQL Server
DB
Dataset
Dataset
28
Pipelines Pipelines
System Variables
Pipeline scope
@pipeline().DataFactory Name of the data factory the pipeline run is running within
@pipeline().TriggerType Type of the trigger that invoked the pipeline (Manual, Scheduler)
@pipeline().TriggerTime Time when the trigger that invoked the pipeline. The trigger time is
the actual fired time, not the scheduled time.
Schedule Trigger Scope 32
These system variables can be referenced anywhere in the trigger JSON if the trigger is of type:
"ScheduleTrigger."
Time when the trigger was scheduled to invoke the pipeline run.
@trigger().scheduledTime For example, for a trigger that fires every 5 min, this variable would return 2017-06-
01T22:20:00Z, 2017-06-01T22:25:00Z, 2017-06-01T22:29:00Z respectively.
Time when the trigger actually fired to invoke the pipeline run.
For example, for a trigger that fires every 5 min, this variable might
@trigger().startTime
return something like this 2017-06-01T22:20:00.4061448Z, 2017-06-
01T22:25:00.7958577Z, 2017-06-01T22:29:00.9935483Zrespectively.
Tumbling window Trigger Scope 33
These system variables can be referenced anywhere in the trigger JSON if the trigger is of type:
"TumblingWindowTrigger“.
Start of the window when the trigger was scheduled to invoke the pipeline run. If
@trigger().outputs.windowStartTime the tumbling window trigger has a frequency of "hourly" this would be the time at
the beginning of the hour.
End of the window when the trigger was scheduled to invoke the pipeline run. If
@trigger().outputs.windowEndTime the tumbling window trigger has a frequency of "hourly" this would be the time at
the end of the hour.
34
Functions in Azure
String Functions
Collection Functions
Logical Functions
Conversion Functions
Math Functions
Date Functions
indexof Find the index of a value within a string indexof(Hi team', ‘Hi’) : 0
case insensitively.
endswith Checks if the string ends with a value case endswith(‘Hi team', ‘team') : true
insensitively.
startswith Checks if the string starts with a value startswith(‘Hi team', ‘team') : false
case insensitively.
split Splits the string using a separator. split(‘Hi;team', ‘;') : [“Hi", “team“]
JSON values in the definition can be literal or expressions that are evaluated at runtime.
E. g. "name": "value“ OR "name": "@pipeline().parameters.password“
Expressions can appear anywhere in a JSON string value and always result in another JSON value.
If a JSON value is an expression, the body of the expression is extracted by removing the at-sign (@).
parameter
Suppose the BlobDataset takes a parameter named path.
Its value is used to set a value for the folderPath property by using the following expressions:
"folderPath": "@dataset().path"
"path": "@pipeline().parameters.inputPath"
Question- Answers 42
Feel free to write to me at:
[email protected]