0% found this document useful (0 votes)
4 views18 pages

ADF

This document provides a comprehensive overview of Azure Data Factory, including its purpose, key concepts, and functionalities such as integration runtimes, pipelines, and data sources. It also covers common interview questions for freshers and intermediate candidates, detailing aspects like ETL processes, data lakes versus data warehouses, and security in Azure Data Lake Storage. Additionally, it highlights the steps for setting up linked services and managing data integration workflows in Azure Data Factory.

Uploaded by

bapakim662
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views18 pages

ADF

This document provides a comprehensive overview of Azure Data Factory, including its purpose, key concepts, and functionalities such as integration runtimes, pipelines, and data sources. It also covers common interview questions for freshers and intermediate candidates, detailing aspects like ETL processes, data lakes versus data warehouses, and security in Azure Data Lake Storage. Additionally, it highlights the steps for setting up linked services and managing data integration workflows in Azure Data Factory.

Uploaded by

bapakim662
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

AZURE DATA

FACTORY
Basic Azure Data Factory Interview Questions for Freshers

1. Why do we need Azure Data Factory?


The amount of data generated these days is huge, and this data
comes from different sources. When we move this particular data to
the cloud, a few things need to be taken care of.
Data can be in any form, as it comes from different sources. These
sources will transfer or channel the data in different ways. They will
be in different formats. When we bring this data to the cloud or
particular storage, we need to make sure it is well managed, i.e., you
need to transform the data and delete unnecessary parts. As far as
moving the data is concerned, we need to make sure that data is
picked from different sources, brought to one common place, and
stored. If required, we should transform it into something more
meaningful.
This can be done by a traditional data warehouse, but there are
certain disadvantages. Sometimes we are forced to go ahead and
have custom applications that deal with all these processes
individually, which is time-consuming, and integrating all these
sources is a huge pain. We need to figure out a way to automate this
process or create proper workflows.
Data Factory helps to orchestrate this complete process in a more
manageable or organizable manner.

2. What is Azure Data Factory?


It is a cloud-based integration service that allows the creation of data-
driven workflows in the cloud for orchestrating and automating data
movement and transformation.

Using Azure Data Factory, you can create and schedule data-driven
workflows (called pipelines) that ingest data from disparate data
stores.
It can process and transform data using computer services such as
HDInsight, Hadoop, Spark, Azure Data Lake Analytics, and Azure
Machine Learning.

3. What is the integration runtime?


The integration runtime is the compute infrastructure Azure Data
Factory uses to provide the following data integration capabilities
across various network environments.
Three Types of Integration Runtimes: Azure Integration Runtime:
Azure integration runtime (IR) can copy data between cloud data
stores and dispatch the activity to a variety of computing services,
such as Azure HDInsight or SQL Server, where the transformation
takes place.Self-Hosted Integration Runtime: A self-hosted
integration runtime is software with essentially the same code as
Azure integration runtime. But you install it on an on-premise
machine or a virtual machine in a virtual network. A self-hosted IR
can run copy activities between a public cloud data store and a data
store on a private network. It can also dispatch transformation
activities against compute resources on a private network. We use
self-hosted IR because the Data Factory will not be able to directly
access primitive data sources because they sit behind a firewall. It is
sometimes possible to establish a direct connection between Azure
and on-premises data sources by configuring the Azure Firewall in a
specific way. If we do that, we don’t need to use a self-hosted IR.
Azure-SSIS Integration Runtime: With SSIS integration runtime, you
can natively execute SSIS packages in a managed environment. So
when we lift and shift the SSIS packages to the Data Factory, we
use Azure SSIS IR.

4. What is the limit on the number of integration runtimes?


There is no hard limit on the number of integration runtime instances
you can have in a data factory. There is, however, a limit on the number
of VM cores that the integration runtime can use per subscription for
SSIS package execution.

5. What are the top-level concepts of Azure Data Factory?


Pipeline: It acts as a carrier in which various processes take place.
An individual process is an activity.
Activities: Activities represent the processing steps in a pipeline. A
pipeline can have one or multiple activities. It can be anything, i.e., a
process like querying a data set or moving the dataset from one
source to another.
Datasets: In simple words, it is a data structure that holds our data.
Linked Services: These store information that is very important
when connecting to an external source.

For example, consider an SQL Server. You need a connection string that
you can connect to an external device. You need to mention the source
and destination of your data.

6. How can I schedule a pipeline?


You can use the scheduler trigger or time window trigger to
schedule a pipeline.
The trigger uses a wall-clock calendar schedule, which can schedule
pipelines periodically or in calendar-based recurrent patterns (for
example, on Mondays at 6:00 PM and Thursdays at 9:00 PM).

7. Can I pass parameters to a pipeline run?


Yes, parameters are a first-class, top-level concept in Data Factory.
You can define parameters at the pipeline level and pass arguments
as you execute the pipeline run-on-demand or by using a trigger.

8. Can I define default values for the pipeline parameters?


You can define default values for the parameters in the pipelines.

9. Can an activity’s output property be consumed in another


activity?
An activity output can be consumed in a subsequent activity with the
@activity construct.

10. How do I handle null values in an activity output?


You can use the @coalesce construct in the expressions to handle the
null values.

11. Which Data Factory version do I use to create data flows?


Use the Data Factory version 2 to create data flows.

12. What are datasets in Azure Data Factory?


Datasets are defined as named views of data that simply point to or
reference the data to be used in activities as inputs or outputs.

13.How are pipelines monitored in Azure Data Factory?


Azure Data Factory uses user experience to monitor pipelines in the
“Monitor and Manage” tile in the data factory blade of the Azure portal.

14. What are the three types of integration runtime?


The three types of integration runtime are:

1. Azure Integration Runtime


2. Self-Hosted Integration Runtime
3. Azure-SQL Server Integration Services
15. What are the types of data integration design patterns?
There are 4 types of common data integration, namely:

1. Broadcast
2. Bi-directional syncs
3. Correlation
4. Aggregation

Intermediate Azure Data Factory Interview Questions

16. What is the difference between Azure Data Lake and Azure
Data Warehouse?
The data warehouse is a traditional way of storing data that is still
widely used. The data lake is complementary to a data warehouse, i.e.,
if you have your data in a data lake that can be stored in the data
warehouse, you have to follow specific rules.

DATA LAKEDATA WAREHOUSE Complementary to the data warehouse


May be sourced to the data lake Data is either detailed or raw. It can be
in any particular form. You need to take the data and put it in your data
lake. Data is filtered, summarized, and refined. Schema on read (not
structured, you can define your schema in n number of ways) Schema
on write (data is written in structured form or a particular schema) One
language to process data of any format (USQL)It uses SQL.

17. What is blob storage in Azure?


Azure Blob Storage is a service for storing large amounts of
unstructured object data, such as text or binary data. You can use Blob
Storage to expose data publicly to the world or to store application data
privately. Common uses of Blob Storage are as follows:

Serving images or documents directly to a browser


Storing files for distributed access
Streaming video and audio
Storing data for backup and restore disaster recovery, and archiving
Storing data for analysis by an on-premises or Azure-hosted service

18. What is the difference between Azure Data Lake store and
Blob storage?
Azure Data Lake Storage Gen1 Azure Blob Storage

Purpose Optimized storage for big data analytics workloads General-


purpose object store for a wide variety of storage scenarios, including
big data analytics Structure Hierarchical file system Object store with
a flat namespace

Key Concepts Data Lake Storage Gen1 account contains folders, which
in turn contain data stored as files Storage account has containers,
which in turn has data in the form of blobs.

Use Cases Batch, interactive, streaming analytics, and machine


learning data such as log files, IoT data, clickstreams, and large
datasets Any type of text or binary data, such as application back end,
backup data, media storage for streaming, and general-purpose data.
Additionally, full support for analytics workloads: batch, interactive,
streaming analytics, and machine learning data such as log files, IoT
data, clickstreams, and large datasets

Server-Side API Web HDFS-compatible REST API Azure Blob Storage


REST API

Data Operations – Authentication Based on Azure Active Directory


Identities Based on shared secrets – Account Access Keys and Shared
Access Signature Keys.

19. What are the steps for creating ETL process in Azure Data
Factory?
While we are trying to extract some data from the Azure SQL Server
database, if something has to be processed, it will be processed and
stored in the Data Lake Storage.
Steps for Creating ETL

1. Create a linked service for the source data store, which is SQL
Server Database
2. Assume that we have a cars dataset
3. Create a linked service for the destination data store, which is Azure
Data Lake Storage (ADLS)
4. Create a dataset for data saving.
5. Create the pipeline and add copy activity.
6. Schedule the pipeline by adding a trigger.

20. What is the difference between HDInsight and Azure Data


Lake Analytics?
HDInsightAzure Data Lake Analytics If we want to process a data set,
first of all, we have to configure the cluster with predefined nodes, and
then we use a language like Pig or Hive for processing the data. It is all
about passing queries written for processing data. Azure Data Lake
Analytics will create the necessary compute nodes per our instructions
on demand and process the data set. Since we configure the cluster
with HDInsight, we can create it as we want and control it as we want.
All Hadoop subprojects, such as Spark and Kafka, can be used without
limitations. With Azure Data Lake Analytics, it does not give much
flexibility in terms of the provision in the cluster, but Microsoft Azure
takes care of it. We don’t need to worry about cluster creation. The
assignment of nodes will be done based on the instructions we pass. In
addition, we can make use of U-SQL taking advantage of .Net for
processing data.

21. Can an activity in a pipeline consume arguments that are


passed to a pipeline run?
In a pipeline, an activity can indeed consume arguments that are
passed to a pipeline run. Arguments serve as input values that can be
provided when triggering or scheduling a pipeline run. These
arguments can be used by activities within the pipeline to customize
their behavior or perform specific tasks based on the provided values.
This flexibility allows for dynamic and parameterized execution of
pipeline activities, enhancing the versatility and adaptability of the
pipeline workflow.

Each activity within the pipeline can consume the parameter value
that’s passed to the pipeline and run with the @parameter construct.

22. What has changed from private preview to limited public


preview in regard to data flows?
You will no longer have to bring your own Azure Databricks clusters.
Data Factory will manage cluster creation and teardown.
Blob datasets and Azure Data Lake Storage Gen2 datasets are
separated into delimited text and Apache Parquet datasets.
You can still use Data Lake Storage Gen2 and Blob Storage to store
those files. Use the appropriate linked service for those storage
engines.

23. How do I access data using the other 80 dataset types in


Data Factory?
The mapping data flow feature currently allows Azure SQL
Database, Azure SQL Data Warehouse, delimited text files from
Azure Blob Storage or Azure Data Lake Storage Gen2, and Parquet
files from Blob Storage or Data Lake Storage Gen2 natively for
source and sink.
Use the copy activity to stage data from any of the other connectors,
and then execute a Data Flow activity to transform the data after it’s
been staged. For example, your pipeline will first copy into Blob
Storage, and then a Data Flow activity will use a dataset in the
source to transform that data.

24. What is the Get Metadata activity in ADF?


The Get Metadata activity is utilized for getting the metadata of any
data in the Synapse pipeline or ADF. To perform validation or
consumption, we can utilize the output from the Get Metadata activity
in conditional expressions. It takes a dataset as input and returns
metadata information as output. The maximum size of the returned
metadata is 4 MB.

25. List any 5 types of data sources that Azure Data Factory
supports.
Azure supports the following data sources:

Azure Blob Storage: Azure Blob is a cloud storage solution to store


large-scale unstructured data.
Azure SQL Database: It is a managed, secured, intelligent service
that uses the SQL Server Database engine in ADF.
Azure Data Lake Storage: It is a service that can store data of any
size, shape, and speed and perform all kinds of processing and
analytics across platforms and languages.
Azure Cosmos DB: It is a service that works entirely on NoSQL and
relational databases for modern app development.
Azure Table Storage: It is a service used for storing structured
NoSQL data; it provides a key/attribute with no schema design.

26. How can one set up data sources and destinations in Azure
Data Factory?
To connect with a data source or destination, one needs to set up a
linked service. A linked service is a configuration containing the
connection information required to connect to a data source or
destination. The following steps show how to set linked services:

Navigate to your Azure Data Factory Instance in Azure Portal.


Select “Author and Monitor” to open UI.
From the left-hand menu, select connections and create a new
linked service.
Choose the type of data source you want to connect with: Azure Blob
Storage, Azure SQL Database, Amazon S3, etc.
Configure and test the connection.

27. How can one set up a linked service?


To set up a linked service, follow the steps below:

1. Click “Author & Monitor” tab in the ADF portal.


2. Next, click the “Author” button to launch ADF authoring interface.
3. Click the “Linked Services” tab to create a new linked service.
4. Select the type of service corresponding to the data source or
destination one wants to connect with.
5. Mention the connection information, such as server name, database
name, and credentials.
6. Test the connection service to ensure the working.
7. Save the linked service.

28. What is a Synapse workspace, and where is it required?


Azure Synapse Analytics workspace was previously called Azure SQL
Data Warehouse. It is a service that manages and integrates enterprise
data warehousing, big data analytics, and data integration capabilities
into a single platform. It supports role-based access control (RBAC)
encryption and auditing capabilities to ensure data protection and
compliance with regulatory requirements.

Use cases:

1. Collaboration on analytical projects by data engineers, data


scientists, and business analysts, leveraging the capabilities for
data querying, analysis, and visualization.
2. It is used in analyzing and visualizing the data, creating reports and
dashboards, and gaining insights into business performance and
trends, supporting business intelligence and reporting needs.

29. What is the general connector error in Azure Data Factory?


Mention the causes of the errors.
The general connector errors are:

1. UserErrorOdbcInvalidQueryString

Cause: when the user commits a wrong or invalid query for fetching the
data/schemas.

2. FailedToResolveParametersInExploratoryController

Cause: This error arises due to the limitation of supporting the linked
service, which provides a reference to another linked service with
parameters for test connections or preview data.

Advanced Azure Data Factory Interview Questions for


Experienced

30. Explain the two levels of security in ADLS Gen2.


The two levels of security applicable to ADLS Gen2 were also in effect
for ADLS Gen1. Even though this is not new, it is worth calling out the
two levels of security because it’s a fundamental piece to getting
started with the data lake, and it is confusing for many people to start.

Role-Based Access Control (RBAC): RBAC includes built-in Azure


roles such as reader, contributor, owner, or custom roles. Typically,
RBAC is assigned for two reasons. One is to specify who can manage
the service itself (i.e., update settings and properties for the storage
account). Another reason is to permit the use of built-in data
explorer tools, which require reader permissions.
Access Control Lists (ACLs): Access control lists specify exactly
which data objects a user may read, write, or execute (execute is
required to browse the directory structure). ACLs are POSIX-
compliant, thus familiar to those with a Unix or Linux background.

POSIX does not operate on a security inheritance model, which means


that access ACLs are specified for every object. The concept of default
ACLs is critical for new files within a directory to obtain the correct
security settings, but it should not be thought of as an inheritance.
Because of the overhead assigning ACLs to every object, and because
there is a limit of 32 ACLs for every object, it is extremely important to
manage data-level security in ADLS Gen1 or Gen2 via Azure Active
Directory groups.

31. How is the performance of pipelines optimized in Azure


Data Factory?
Optimizing the performance of Azure Data Factory pipelines involves
strategically enhancing data movement, transformation, and overall
pipeline execution. Some ways to optimize performance are:

1. Choosing the appropriate integration runtime for data movement


activities based on the location of the destination and data source,
integration runtimes help optimize performance by providing
compute resources closer to the data.
2. Usage of parallel activities such as breaking data into smaller
chunks and executing them in parallel activities such as pipelines or
within data flow activities.
3. While mapping data flows, minimize the unwanted transformations
and data shuffling by reducing the data flow magic.

32. What are triggers in ADF, and how can they be used to
automate pipeline expressions? What is their significance in
pipeline development?
In ADF, triggers are components that enable the automated execution
of pipeline activities based on predefined conditions or schedules. In
orchestrating the data workflows, triggers have played a crucial role,
along with automating data integration and transformation tasks
within ADF.

Significance of triggers in pipeline development:

1. Automation: Trigger enables automated execution of the pipeline,


eradicating manual intrusion and scheduling of tasks.
2. Scheduling: Scheduling triggers help users define recurring
schedules for pipeline execution, ensuring that the tasks are
performed and integrated at regular intervals.
3. Event-Driven Architecture: Event triggers enable event-driven
architecture in ADF, where pipelines are triggered in response to
specific data events or business events.

33. How many types of datasets are supported in ADF?


The datasets supported in ADF are as follows:

CSV
Excel
Binary
Avro
JSON
ORC
XML
Parquet

34. What are the prerequisites for Data Factory SSIS


execution?
The prerequisites include an Azure SQL managed instance or an Azure
SQL Database for hosting the SSIS IR and SSISDB catalog.

35. What are the differences between the transformation


procedures called mapping data flows and wrangling data flow
in ADF?
The mapping data flow is the process of graphically designing and
transforming the data. This allows the user to design data
transformation logic in the graphical user interface without the need
for a professional programmer, which eventually makes it cost-
effective.

On the contrary, in the wrangling data flow activity, the method of data
preparation is without the use of a program. In Spark, the data
manipulation capabilities of Power Query M are provided to the user, as
Power Query Online has a compatible nature.

36.What is an ARM template? Where are they used?


ARM stands for Azure Resource Manager Template. In ADF, this
template allows the user to create and deploy an Azure infrastructure
not only on virtual machines but on infrastructure, storage systems, or
other resources.

37. Mention a few functionalities of the ARM template in ADF.


The ARM template consists of a few functions that can be used in
deployment to a resource group subscription or management group,
such as

CIDR
Array functions.
Comparison functions
Resource functions
Deployment value functions
Subscription scope function
Logical function

38.What are the functions in ARM with respect to CIDR and


Comparison functions?
A few functionalities of ARM are as follows:

CIDR: It consists of functions of the sys namespace such as

parseCidr
cidrSubnet
cidrHost

Comparison Functions: In ARM, this feature helps in comparing


templates such as:

coalesce
equals
less
lessOrEquals
greater
greaterOrEquals

39. What is a Bicep in ADF?


Bicep is a domain-specific language that utilizes declarative syntax to
deploy Azure resources. A Bicep file consists of the infrastructure to be
deployed in Azure. The file gets used throughout the development
lifecycle to repeatedly deploy the infrastructure.

40.What is ETL in ADF?


ETL stands for Extract, Transform, and Load process. It is a data
pipeline used for the collection of data from various resources. The data
is then transformed according to the business rules. After transforming
the data, it is then loaded into the destination data store.

The data transformation happens based on filtering, sorting,


aggregating, joining and cleaning data, and duplicating and validating
data.

41. What is a V2 data factory?


V2 in ADF is Azure Data Factory version 2, which allows one to
create and schedule pipelines, which are data-driven workflows that
can ingest data from disparate data stores.
It processes or transforms data by using compute services such as
Azure HDInsight Hadoop, Spark, and Azure Data Lake Analytics.
It publishes output data to data stores such as Azure SQL Data
Warehouse for BI applications to consume.

42. What are the three most important tasks that you can
complete with ADF?
The three most important tasks that you can complete with ADF are
moving data, transferring data, and exercising control.

1. In the movement of data, the operations facilitate the flow of data


from one data store to another using the data factory’s copy activity.
2. Data transformation activities are the modification of data activities
that modify the loaded data as the data moves towards the final
destination stage. Some examples are stored procedures, U-SQL,
and Azure functions.

43.How is Azure Data Factory different from Data Bricks?


Azure Data Factory mainly excels in ETL workflows, i.e., (Extract
Transform Load) which smoothens the data movement and data
transformation. Data Bricks, which is mainly built on Apache Spark, is
mainly focused on advanced analytics, which involves data processing
on a large scale.

44. What do you mean by Azure SSIS integration runtime?


The cluster or group of virtual machines that are hosted in Azure and
are more dedicated to running the SSIS packages in the data factory
are termed the Azure SSIS integration runtime (IR). The size of the
nodes can be configured to scale up, while the number of nodes on the
virtual machine’s cluster can be configured to scale out.

45. What is Data Flow Debug?


Data Flow Debug is an excellent feature provided by Azure Data
Factory to the developers, within which developers are facilitated to
simultaneously observe and analyze the transformations made in the
data during the building phase, including the designing and debugging
phases. This helps the user get real-time feedback on the data shape at
each phase of execution and the data flow within the pipelines.

46. How are email notifications sent on a Pipeline Failure?


There are multiple options to send an email notification to the
developer in case of a Pipeline Failure:

1. Logical Application with Web/Webhook Activity: An application can


be configured that, upon receiving an HTTP request, can quickly
notify the required set of people about the failure.
2. Alerts and Metrics Options: These options can be set up in the
pipeline itself, where a number of options are available to email in
case failure activity is detected.

47. What do you mean by an Azure SQL database?


Azure SQL database is also an integral part of the Azure family, which
extends as a fully managed, secured, and intelligent product that uses
an SQL Server Database Engine to store the data within the Azure
Cloud.

48. What do you understand from a data flow map?


A data flow map is also called a data flow diagram (DFD), which depicts
the flow of data inside a system or organization. It depicts the
movement of data from one process to another process or entity,
highlighting the source’s destination and the transformations of data
during the process. Data flows come in handy in system analysis and
design to visualize and understand the data flow.

49. What is the capability of Lookup Activity in ADF?


Lookup Activity can retrieve a dataset from any of the data sources
supported by the data factory and synapse pipelines. Some capabilities
of lookup are:

It can return up to 5000 rows at once; if there are more than 5000
rows, it will return the first 5000 data values.
The supported activity size for lookup is 4 MB; the activity fails if
the size exceeds the limit.
The longest duration before timeout for lookup activity IS 24 hrs
https://www.linkedin.com/in/madhumitha-podishetty-842561119/

You might also like