0% found this document useful (0 votes)
690 views

Azure Data Engineer Interview Questions

The document discusses various concepts related to Azure data engineering including differences between technologies like RDD, DataFrame and Dataset, components of Azure Data Factory, encryption in Databricks, differences between ADLS Gen1 and Gen2, semantic layers, cluster sizing, mounting data sources, Spark architecture concepts like accumulators and broadcasts joins, DAGs and RDDs, serialization, and connecting Databricks to storage accounts.

Uploaded by

sunitacrm
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
690 views

Azure Data Engineer Interview Questions

The document discusses various concepts related to Azure data engineering including differences between technologies like RDD, DataFrame and Dataset, components of Azure Data Factory, encryption in Databricks, differences between ADLS Gen1 and Gen2, semantic layers, cluster sizing, mounting data sources, Spark architecture concepts like accumulators and broadcasts joins, DAGs and RDDs, serialization, and connecting Databricks to storage accounts.

Uploaded by

sunitacrm
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 35

1. Difference between RDD, Dataframe and Dataset.

How and what you have used in


you databricks for data anlysis

2. What are key components in ADF? What all you have used in your pipeline?

3. Do you create any encryprion key in Databricks? Cluster size in Databricks.

4. Difference between ADLS gen 1 and gen 2?

https://beetechnical.com/cloud-computing/azure-data-lake-storageadls-gen1-vs-gen2-
complete-guide-2021/

5. What is Semantic layer?

6. How to choose a cluster to process the data? What is Azure services ?

7. How to create mount points? How to load data source to ADLS?

8. what is Accumulators? what is groupby key and reducedby key?

9. what is the Spark architecture? what is azure sql?

10. What is serialization? what is broadcast join?

11. what is DAG? what is RDD?

12. how do you connect data bricks with storage account

13. Online hackathon test. 2) Azure ADF, Databricks, Spark and python(enough for big
data) knowledge. 3) SQL Advance(Analytical Fun)

14. case study scenario questions

15. Explain about copy activity in ADF Slowly changing dimensions Data warehousing

16. Azure Databricks cluster types Pipeline triggers types

17. difference between olap and oltp

18. Difference between dataframe and rdd

19. On-premise Oracle server with daily incremental of 10 gb data. How do you move to
cloud using Azure?
20. How do you design/implement database solutions in the cloud?

21. How would you convince client to migrate to cloud?

22. asked about azure basics and data factory

23. asked about basics of sql

24. How to read parquet file, how to call notebook from adf, Azure Devops CI/CD
Process, system variables in adf

25. Whole architecture of azure project, eto solutions using ADF, ADB, few SQL
queries, azure sql pool
More Interview Questions:
https://azurede.com/2021/05/29/azure-data-engineering-interview-questions/

https://www.vcubesoftsolutions.com/azure-data-engineers-interview-questions/

Azure Data Engineer Interview Questions


1. What is Data Engineering?
The application of data collecting and analysis is the emphasis of data engineering. The
information gathered from numerous sources is merely raw information. Data
engineering helps in the transformation of unusable data into useful information. It is the
process of transforming, cleansing, profiling, and aggregating huge data sets in a
nutshell.

2. What is Azure Synapse analytics?


Azure Synapse is an enterprise service accelerating time to discernment across data
storage and tectonic data networks. Azure Synapse combines the stylish of
SQL(Structured Query Language) technologies used in enterprise data warehousing,
Spark technologies used for big data, Pipelines for data integration and ETL/ ELT, and
deep integration with distinct Azure services like Power BI, CosmosDB, and AzureML.

3. Explain the data masking feature of Azure?


Data masking helps in preventing unauthorized access to delicate data by enabling
customers to assign how much of the delicate data to reveal with minimal impact on the
application layer. Dynamic data masking limits acute data exposure by masking it to
non-privileged users. It is a policy-based security feature that hides the delicate data in
the result set of a query over designated database fields. In contrast, the data in the
database will not be changed.
A few data masking policies are:

👉SQL users excluded from masking - A set of SQL users or Azure Active Directory
identities that get unmasked data in the SQL query results. Users with administrator
privileges are permanently banned from masking and seeing the original data without
any mask.

👉Masking rules - A set of rules defining the designated fields to be masked and the
masking function used. The selected fields can be determined using a database
schema, table, and column names.

👉Masking functions - A set of methods that control data exposure for different
scenarios.

4. Difference between Azure Synapse Analytics and Azure Data Lake Storage?
Azure Synapse Analytics Azure Data Lake

It is optimized for processing structured data in a It is optimized for storing and


well-defined schema. processing structured and non-structured data.

Built on SQL(Structured Query Language) Server. Built to work with Hadoop.

Built-in data pipelines and data streaming Handle data streaming using Azure Stream
capabilities. Analytics.

Compliant with regulatory standards. No regulatory compliance

Used for Business Analytics. Used for data analytics and exploration by data
scientists and engineers

5. Describe various windowing functions of Azure Stream Analytics?


A window in Azure Stream Analytics is a block of instant events that enables users to
perform various operations on the event data. To analyze and partition a window in
Azure Stream Analytics, There exist four windowing functions:

🌼 Hopping Window: In these windows, the data segments can overlap. So, to define a
hopping window, we need to specify two parameters:
 Hop (duration of the overlap)
 Window size (length of data segment)

🌼 Tumbling Window: In this, the data stream is segmented into distinct time segments
of fixed length in the tumbling window function.

🌼 Session Window: This function groups events based on arrival time, so there is no
fixed window size. Its purpose is to eliminate quiet periods in the data stream.
🌼 Sliding Window: This windowing function does not necessarily produce aggregation
after a fixed time interval, unlike the tumbling and hopping window functions.
Aggregation occurs every time an existing event falls out of the time window, or a new
event occurs.

6. What are the different storage types in Azure?


The following are the various advantages of the Java collection framework:
Storage Operations
Types

Files Azure Files is an organized way of storing data on the cloud. The main advantage of using
Azure Files over Azure Blobs is that Azure Files allows for organizing the data in a folder
structure. Also, Azure Files is SMB (Server Message Block) protocol compliant, i.e., and
can be used as a file share.

Blobs Blob stands for a large binary object. This storage solution supports all kinds of files,
including text files, videos, images, documents, binary data, etc.

Queues Azure Queue is a cloud-based messaging store for establishing and brokering
communication between various applications and components.

Disks The Azure disk is used as a storage solution for Azure VMs (Virtual Machines)

Tables Tables are NoSQL storage structures for storing structured data that does not meet the
standard RDBMS (relational database schema).

7. What are the different security options available in the Azure SQL database?
Security plays a vital role in databases. Some of the security options available in the
Azure SQL database are:

🌸 Azure SQL Firewall Rules: Azure provides two-level security. There are server-level
firewall rules which are stored in the SQL Master database. Server-level firewall rules
determine the access to the Azure database server. Users can also create database-
level firewall rules that govern the individual databases’ keys.

🌸 Azure SQL TDE (Transparent Data Encryption): TDE is the technology used to
encrypt stored data. TDE is also available for Azure Synapse Analytics and Azure SQL
Managed Instances. With TDE, the encryption and decryption of databases, backups,
and transaction log files, happens in real-time.

🌸 Always Encrypted: It is a feature designed to protect sensitive data stored in the


Azure SQL database, such as credit card numbers. This feature encrypts data within
the client applications using Always Encrypted-enabled driver. Encryption keys are not
shared with SQL Database, which means database admins do not have access to
sensitive data.
🌸 Database Auditing: Azure provides comprehensive auditing capabilities along with
the SQL Database. It is also possible to declare the audit policy at the individual
database level, allowing users to choose based on the requirements.

8. How data security is implemented in Azure Data Lake Storage(ADLS) Gen2?


Data security is one of the primary concerns for most organizations for moving data to
cloud storage. Azure data lake storage gen2 provides a multi-layered and robust
security model. This model has 6 data security layers:

🔯 Authentication: The first layer includes user account security. ADLS Gen2 provides
three authentication modes, Azure Active Directory (AAD), Shared Access Token
(SAS), and Shared Key.

🔯 Access Control: The next layer for restricting access to individual containers or files.
This can be managed using Roles and Access Control Lists (ACLs)

🔯 Network Isolation: This layer enables administrators to manage access by disabling


or allowing access to only particular Virtual Private Networks (VPNs) or IP Addresses.

🔯 Data Protection: This is achieved by encrypting in-transit data using


HTTPS(Hypertext Transfer Protocol Secure). Options to encrypt stored data are also
available.

🔯 Advanced Threat Protection: If enabled, ADLS Gen2 will monitor any unauthorized
attempts to access or exploit the storage account.

🔯 Auditing: This is the sixth and final layer of security. ADLS Gen2 provides
comprehensive auditing features in which all account management activities are logged.
These logs can be later reviewed to ensure the highest level of security.

9. What are the various data flow partition schemes available in Azure?
Partition Explanation Usage
Scheme

Round It is the most straightforward partition scheme No good key candidates were
Robin which spreads data evenly across partitions. available in the data.

Hash Hash of columns creates uniform partitions such It is used to check for partition
that rows with similar values fall in the same skew.
partition.

Dynamic Spark dynamics range based on the provided Select the column that will be used
Range columns or expression. for partitioning.
Fixed Range A fixed range of values based on the user-created A good understanding of data is
expression for disturbing data across partitions. required to avoid partition skew.

Key Partition for each unique value in the selected Good understanding of data
column. cardinality is required.
10. Why Azure data factory is needed?
The amount of data generated these days is vast, coming from different sources. When
we move this particular data to the cloud, a few things must be taken care of-

🏭 Data can be in any form as it comes from different sources, and these various sources
will transfer or channelize the data in different ways, and it can be in different formats.
When we bring this data to the cloud or particular storage, we need to make sure that
this data is well managed. i.e., you need to transform the data and delete unnecessary
parts. As per moving the data is concerned, we need to make sure that data is picked
from different sources and bring it to one common place, then stored, and if required,
we should transform it into more meaningful.

🏭 A traditional data warehouse can also do this, but certain disadvantages exist.
Sometimes we are forced to go ahead and have custom applications that deal with all
these processes individually, which is time-consuming, and integrating all these sources
is a huge pain.

🏭 A data factory helps to orchestrate this complete process into a more manageable or
organizable manner.

11. What do you mean by data modeling?


Data Modeling is creating a visual representation of an entire information system or
parts to express linkages between data points and structures. The purpose is to show
the many types of data used and stored in the system, the relationships between them,
how the data can be classified and arranged, and its formats and features. Data can be
modeled according to the needs and requirements at various degrees of abstraction.
The process begins with stakeholders and end-users providing information about
business requirements. These business rules are then converted into data structures to
create a concrete database design.
There are two design schemas available in data modeling:
 Star Schema
 Snowflake Schema

12. What is the difference between Snowflake and Star Schema?


Both are multidimensional models of the data warehouses. The main differences are:
Snowflake Star Schema

It contains 3D, sub-dimension, and fact tables. It contains fact and dimension tables.

It is a type of bottom-up model. It is a type of top-down model.


It uses both normalization and denormalization. It does not use normalization.

In the snowflake schema, data redundancy is lower. In the star schema, data redundancy is higher.

The design is very complex. The design is straightforward.

Execution time for queries is high. Execution time for queries is low.

It makes use of less space. It makes use of more space.

13. What are the 2 levels of security in Azure data lake storage Gen2?
The two levels of security available in Azure data lake storage Gen2 are also adequate
for Azure data lake Gen1. Although this is not new, it is worth calling it two levels of
security because it’s a fundamental piece for getting started with the Azure data lake.
The two levels of security are defined as:

🦉 Role-Based Access Control (RBAC): RBAC includes built-in Azure roles such as
reader, owner, contributor, or custom. Typically, RBAC is assigned due to two reasons.
One is to permit the use of built-in data explorer tools that require reader permissions.
Another is to specify who can manage the service (i.e., update properties and settings
for the storage account).

🦉 Control Lists (ACLs): ACLs specify exactly which data objects a user may write,
read, and execute (execution is required for browsing the directory structure). ACLs are
POSIX (Portable Operating System Interface) - compliant, thus familiar to those with a
Linux or Unix background.

14. Explain a few important concepts of the Azure data factory?

🎈Pipeline: It acts as a carrier in various processes occurring. An individual process is


considered an activity.

🎈Activities: It represents the processing steps of a pipeline. A pipeline can have one or
many activities. It can be a process like moving the dataset from one source to another
or querying a data set.

🎈Datasets: It is the source of data or, we can say it is a data structure that holds our
data.

🎈Linked services: It stores information that is very important when connecting to an


external source.

15. Differences between Azure data lake analytics and HDInsight?


Azure Data Lake Analytics HDInsight
It is a software. It is a platform.

Azure Data Lake Analytics creates essential HDInsight configures the cluster with predefined
computer nodes as on-demand instruction and nodes and then uses a language like a hive or pig
processes the dataset. for data processing.

Azure data lake analytics does not give much HDInsight provides more flexibility, as we can
flexibility in provisioning the cluster. create and control the cluster according to our
choice.

16. Explain the process of creating ETL(Extract, Transform, Load)?


The process of creating ETL are:

🎀 Build a Linked Service for source data store (SQL Server Database). Suppose that we
have a cars dataset.

🎀 Formulate a Linked Service for address data store which is Azure Data Lake Store.

🎀 Build a dataset for Data Saving.

🎀 Formulate the pipeline and attach copy activity.

🎀 Program the pipeline by combining a trigger.

17. What is Azure Synapse Runtime?


Apache Spark pools in Azure Synapse use runtimes to tie together essential component
versions, Azure Synapse optimizations, packages, and connectors with a specific
Apache Spark version. These runtimes will be upgraded periodically to include new
improvements, features, and patches.
These runtimes have the following advantages:
 Faster session startup times.
 Tested compatibility with specific Apache Spark versions.
 Access to popular, compatible connectors and open-source packages.

18. What is SerDe in the hive?


Serializer/Deserializer is popularly known as SerDe. For IO(Input/Output), Hive employs
the SerDe protocol. Serialization and deserialization are handled by the interface, which
also interprets serialization results as separate fields for processing.
The Deserializer turns a record into a Hive-compatible Java object. The Serializer now
turns this Java object into an HDFS (Hadoop Distributed File System) -compatible
format. The storage role is then taken over by HDFS. Anyone can create their own
SerDe for their own data format.

19. What are the different types of integration runtime?


🎃 Azure Integration Run Time: It can copy data among cloud data repositories and it
can express the exercise to a type of computing services like SQL server or Azure
HDinsight where the transformation takes place

🎃 Self-Hosted Integration Run Time: It is software with basically the equivalent code
as Azure Integration Run Time. Except you install it on an on-premise instrument or a
virtual machine in the virtual network. A Self Hosted IR can operate copy exercises
between a data store in a private network and a public cloud data store.

🎃 Azure SSIS Integration Run Time: With this, one can natively perform SSIS (SQL
Server Integration Services) packages in a controlled environment. So when we elevate
and shift the SSIS packages to the data factory, we work Azure SSIS Integration Run
Time.

20. Mention some common applications of Blob storage?


Common works of Blob Storage consists of:

🔔 Laboring images or documents straight to a browser.

🔔 Saving files for shared access.

🔔 Streaming audio and video.

🔔 Collecting data for backup and recovery disaster restoration, and archiving.

🔔 Saving data for analysis by an on-premises or Azure-hosted.

21. What are the main characteristics of Hadoop?

⚽ It is an open-source structure that is ready for freeware.

⚽ Hadoop is cooperative with the various types of hardware and simple to access
distinct hardware within a particular node.

⚽ It encourages faster-distributed data processing.

⚽ It saves the data in the group, which is unconventional of the rest of the operations.

⚽ Hadoop supports building replicas for every block with separate nodes.

22. What is the Star scheme?


Star Join Schema or Star Schema is the most manageable type of Data Warehouse
schema. This is called a star schema because its construction is like a star. In this, the
heart of the star may have one particular table and various connected dimension tables.
This schema is practiced for questioning large data sets.

23. How would you approve data to move from one database to another?
The efficency of data and guaranteeing that no data is released should be of the highest
priority for a data engineer. Hiring administrators examine this question to know your
thought method on how validation of data would occur.
The candidate should be capable to talk about proper validation representations in
different situations. For example, you could recommend that validation could be a
simplistic comparison, or it can occur after the comprehensive data migration.

24. Discriminate between structured and unstructured data?


Parameter Structured Data Unstructured Data

Storage DBMS (Database Management Unmanaged file structure


System)

Standard ADO.net, ODBC, and SQL STMP, XML, CSV, and SMS

Scaling Schema scaling is hard. Schema scaling is easy.

Integration ETL (Extract, Transform, Load) Manual data entry or batch processing that
Tool incorporates codes

25. What do you mean by data pipeline?


A data pipeline is a system for transporting data from one location (the source) to
another (the destination), such as a data warehouse. Data is converted and optimized
along the journey, and it eventually reaches a state that can be evaluated and used to
produce business insights. The procedures involved in aggregating, organizing, and
transporting data are referred to as a data pipeline. Many of the manual tasks needed in
processing and improving continuous data loads are automated by modern data
pipelines.

1. What is Azure Data Factory?


Azure Data Factory is a cloud-based, fully managed, serverless ETL and data
integration service offered by Microsoft Azure for automating data movement from its
native place to say a data lake or data warehouse using ETL (extract-transform-load)
OR extract-load-transform (ELT). It lets you create and run data pipelines that can help
move and transform data and run scheduled pipelines.

2. Is Azure Data Factory ETL or ELT tool?


It is a cloud-based Microsoft tool that provides a cloud-based integration service for data
analytics at scale and supports ETL and ELT paradigms.
3. Why is ADF needed?
With an increasing amount of big data, there is a need for a service like ADF that can
orchestrate and operationalize processes to refine the enormous stores of raw business
data into actionable business insights.

4. What sets Azure Data Factory apart from conventional ETL tools?
Azure Data Factory stands out from other ETL tools as it provides: -

i) Enterprise Readiness: Data integration at Cloud Scale for big data analytics!

ii) Enterprise Data Readiness: There are 90+ connectors supported to get your data
from any disparate sources to the Azure cloud!

iii) Code-Free Transformation: UI-driven mapping dataflows.

iv) Ability to run Code on Any Azure Compute: Hands-on data transformations

v) Ability to rehost on-prem services on Azure Cloud in 3 Steps: Many SSIS packages
run on Azure cloud.

vi) Making DataOps seamless: with Source control, automated deploy & simple
templates.

vii) Secure Data Integration: Managed virtual networks protect against data exfiltration,
which, in turn, simplifies your networking.

Data Factory contains a series of interconnected systems that provide a complete end-
to-end platform for data engineers. The below snippet summarizes the same.
5. What are the major components of a Data Factory?
To work with Data Factory effectively, one must be aware of below
concepts/components associated with it: -

i) Pipelines: Data Factory can contain one or more pipelines, which is a logical grouping
of tasks/activities to perform a task. e.g., An activity can read data from Azure blob
storage and load it into Cosmos DB or Synapse DB for analytics while transforming the
data according to business logic.

This way, one can work with a set of activities using one entity rather than dealing with
several tasks individually.

ii) Activities: Activities represent a processing step in a pipeline. For example, you might
use a copy activity to copy data between data stores. Data Factory supports data
movement, transformations, and control activities.

iii) Datasets: Datasets represent data structures within the data stores, which simply
point to or reference the data you want to use in your activities as inputs or outputs.

iv) Linked service: This is more like a connection string, which will hold the information
that Data Factory can connect to various sources. In the case of reading from Azure
Blob storage, the storage-linked service will specify the connection string to connect to
the blob, and the Azure blob dataset will select the container and folder containing the
data.

v) Integration Runtime: Integration runtime instances provide the bridge between the
activity and linked Service. It is referenced by the linked service or activity and provides
the computing environment where the activity either runs on or gets dispatched. This
way, the activity can be performed in the region closest to the target data stores or
compute service in the most performant way while meeting security (no exposing of
data publicly) and compliance needs.

vi) Data Flows: These are objects you build visually in Data Factory, which transform
data at scale on backend Spark services. You do not need to understand programming
or Spark internals. Just design your data transformation intent using graphs (Mapping)
or spreadsheets (Power query activity).

Refer to the documentation for more


details: https://docs.microsoft.com/en-us/azure/data-factory/frequently-asked-questions

The below snapshot explains the relationship between pipeline, activity, dataset, and
linked service.
6. What are the different ways to execute pipelines in Azure Data Factory?
There are three ways in which we can execute a pipeline in Data Factory:

i) Debug mode can be helpful when trying out pipeline code and acts as a tool to test
and troubleshoot our code.

ii) Manual Execution is what we do by clicking on the ‘Trigger now’ option in a pipeline.
This is useful if you want to run your pipelines on an ad-hoc basis.

iii) We can schedule our pipelines at predefined times and intervals via a Trigger. As we
will see later in this article, there are three types of triggers available in Data Factory.

New Projects
Build an End-to-End AWS SageMaker Classification ModelView Project

Build a Text Generator Model using Amazon SageMakerView Project

End-to-End ML Model Monitoring using Airflow and DockerView Project

Build an AI Chatbot from Scratch using Keras Sequential Model View Project

Build a Text Generator Model using Amazon SageMakerView Project

Build an Image Segmentation Model using Amazon SageMakerView Project


PyTorch Project to Build a GAN Model on MNIST DatasetView Project

Getting Started with Pyspark on AWS EMR and AthenaView Project

Talend Real-Time Project for ETL Process AutomationView Project

Build a Credit Default Risk Prediction Model with LightGBMView Project

Build an End-to-End AWS SageMaker Classification ModelView Project

Build a Text Generator Model using Amazon SageMakerView Project

End-to-End ML Model Monitoring using Airflow and DockerView Project

Build an AI Chatbot from Scratch using Keras Sequential Model View Project

Build a Text Generator Model using Amazon SageMakerView Project

Build an Image Segmentation Model using Amazon SageMakerView Project

PyTorch Project to Build a GAN Model on MNIST DatasetView Project

Getting Started with Pyspark on AWS EMR and AthenaView Project

Talend Real-Time Project for ETL Process AutomationView Project

Build a Credit Default Risk Prediction Model with LightGBMView Project

View all New Projects

7. What is the purpose of Linked services in Azure Data Factory?


Linked services are used majorly for two purposes in Data Factory:

1. For a Data Store representation, i.e., any storage system like Azure Blob storage
account, a file share, or an Oracle DB/ SQL Server instance.

2. For Compute representation, i.e., the underlying VM will execute the activity
defined in the pipeline.

8. Can you Elaborate more on Data Factory Integration Runtime?


The Integration Runtime or IR is the compute infrastructure for Azure Data Factory
pipelines. It is the bridge between activities and linked services. It's referenced by the
linked service or activity and provides the compute environment where the activity is run
directly or dispatched. This allows the activity to be performed in the closest region to
the target data stores or compute service.

The following diagram shows the location settings for Data Factory and its integration
runtimes:

Source:docs.microsoft.com/en-us/azure/data-factory/concepts-integration-runtime

There are three types of integration runtime supported by Azure Data Factory, and one
should choose based on their data integration capabilities and network environment
requirements.

1. Azure Integration Runtime: To copy data between cloud data stores and send
activity to various computing services such as SQL Server, Azure HDInsight, etc.

2. Self-Hosted Integration Runtime: Used for running copy activity between cloud
data stores and data stores in the private networks. Self-hosted integration
runtime is software with the same code as the Azure Integration Runtime, but it is
installed on your local system or virtual machine over a virtual network.

3. Azure SSIS Integration Runtime: It allows you to run SSIS packages in a


managed environment. So, when we lift and shift SSIS packages to the data
factory, we use Azure SSIS Integration Runtime.
9. What is required to execute an SSIS package in Data Factory?
We need to create an SSIS integration runtime and an SSISDB catalog hosted in the
Azure SQL server database or azure SQL managed instance before we can execute an
SSIS package.

10. What is the limit on the number of Integration Runtimes, if any?


Within a Data Factory, the default limit on any entities is set to 5000, including pipelines,
data sets, triggers, linked services, Private Endpoints, and integration runtimes. One
can create an online support ticket to raise the limit to a higher number if required.

Refer to the documentation for more


details: https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/
azure-subscription-service-limits#azure-data-factory-limits

11. What are ARM Templates in Azure Data Factory? What are they used for?
An ARM template is a JSON (JavaScript Object Notation) file that defines the
infrastructure and configuration for the data factory pipeline, including pipeline activities,
linked services, datasets, etc. The template will contain essentially the same code as
our pipeline.

ARM templates are helpful when we want to migrate our pipeline code to higher
environments, say Production or Staging from Development, after we are convinced
that the code is working correctly.

12. How can we deploy code to higher environments in Data Factory?


At a very high level, we can achieve this with the below set of steps:

1. Create a feature branch that will store our code base.

2. Create a pull request to merge the code after we’re sure to the Dev branch.

3. Publish the code from dev to generate ARM templates.

4. This can trigger an automated CI/CD DevOps pipeline to promote code to higher
environments like Staging or Production.

13. Which three activities can you run in Microsoft Azure Data Factory?
As we discussed in question #3, Data Factory supports three activities: data movement,
transformation, and control activities.
1. Data movement activities: As the name suggests, these activities help move data
from one place to another.
e.g., Copy Activity in Data Factory copies data from a source to a sink data store.

2. Data transformation activities: These activities help transform the data while we
load it into the data's target or destination.
e.g., Stored Procedure, U-SQL, Azure Functions, etc.

3. Control flow activities: Control (flow) activities help control the flow of any activity
in a pipeline. e.g., Wait activity makes the pipeline wait for a specified amount of
time.

14. What are the two types of compute environments supported by Data Factory

to execute the transform activities?


Below are the types of compute environments that Data Factory supports for executing
transformation activities: -

i) On-Demand Computing Environment: This is a fully managed environment provided


by ADF. This type of calculation creates a cluster to perform the transformation activity
and automatically deletes it when the activity is complete.

ii) Bring Your Own Environment: In this environment, you can use ADF to manage your
computing environment if you already have the infrastructure for on-premises services.

15. What are the steps involved in an ETL process?


The ETL (Extract, Transform, Load) process follows four main steps:

i) Connect and Collect: Connect to the data source/s and move data to local and
crowdsource data storage.

ii) Data transformation using computing services such as HDInsight, Hadoop, Spark etc.

iii) Publish: To load data into Azure data lake storage, Azure SQL data warehouse,
Azure SQL databases, Azure Cosmos DB, etc.

iv)Monitor: Azure Data Factory has built-in support for pipeline monitoring via Azure
Monitor, API, PowerShell, Azure Monitor logs, and health panels on the Azure portal.

Build a unique job-winning data engineer resume with big data mini projects.
16. If you want to use the output by executing a query, which activity shall you

use?
Look-up activity can return the result of executing a query or stored procedure.

The output can be a singleton value or an array of attributes, which can be consumed in
subsequent copy data activity, or any transformation or control flow activity like ForEach
activity.

17. Can we pass parameters to a pipeline run?


Yes, parameters are a first-class, top-level concept in Data Factory. We can define
parameters at the pipeline level and pass arguments as you execute the pipeline run on
demand or using a trigger.

18. Have you used Execute Notebook activity in Data Factory? How to pass

parameters to a notebook activity?


We can use execute notebook activity to pass code to our databricks cluster. We can
pass parameters to a notebook activity using baseParameters property. If the
parameters are not defined/ specified in the activity, default values from the notebook
are executed.

19. What are some useful constructs available in Data Factory?


i) parameter: Each activity within the pipeline can consume the parameter value passed
to the pipeline and run with the @parameter construct.

ii) coalesce: We can use the @coalesce construct in the expressions to handle null
values gracefully.

iii) activity: An activity output can be consumed in a subsequent activity with


the @activity construct.

20. Is it possible to push code and have CI/CD (Continuous Integration and

Continuous Delivery) in ADF?


Data Factory offers full support for CI/CD of your data pipelines using Azure DevOps
and GitHub. This allows you to develop and deliver your ETL processes incrementally
before publishing the finished product. After the raw data has been refined into a
business-ready consumable form, load the data into Azure Data Warehouse or Azure
SQL Azure Data Lake, Azure Cosmos DB, or whichever analytics engine your business
uses can point to from their business intelligence tools.
21. What do you mean by variables in the Azure Data Factory?
Variables in the Azure Data Factory pipeline provide the functionality to hold the values.
They are used for a similar reason as we use variables in any programming language
and are available inside the pipeline.

Set Variable and append variable are two activities used for setting or manipulating the
values of the variables. There are two types of the variables in a data factory: -

i) System variables: These are fixed variables from the azure pipeline. For example,
pipeline name, pipeline id, trigger name etc. You mostly need these to get the system
information that might be needed in your use case.

ii) User variable: A user variable is declared manually in your code based on your
pipeline logic.

22. What are mapping data flows?


Mapping data flows are visually designed data transformations in Azure Data Factory.
Data flows allow data engineers to develop data transformation logic without writing
code. The resulting data flows are executed as activities within Azure Data Factory
pipelines that use scaled-out Apache Spark clusters. Data flow activities can be
operationalized using existing Azure Data Factory scheduling, control flow, and
monitoring capabilities.

Mapping data flows provide an entirely visual experience with no coding required. Data
flows run on ADF-managed execution clusters for scaled-out data processing. Azure
Data Factory manages all the code translation, path optimization, and execution of the
data flow jobs.

23. What is copy activity in the azure data factory?


Copy activity is one of the most popular and universally used activity in the Azure data
factory. It is used for ETL or Lift and Shift, where you want to move the data from one
data source to another. While you copy the data, you can also do the transformation; for
example, you read the data from txt/csv file, which contains 12 columns; however, while
writing to your target data source, you want to keep only seven columns. You can
transform it and send only the required number of columns to the destination data
source.

24. Can you elaborate more on the Copy activity?


The copy activity performs the following steps at high-level:

i) Read data from the source data store. (e.g., blob storage)

ii) Perform the following tasks on the data:


 Serialization/deserialization

 Compression/decompression

 Column mapping

iii) Write data to the destination data store or sink. (e.g., azure data lake)

This is summarized in the below graphic:

Source: docs.microsoft.com/en-us/learn/modules/intro-to-azure-data-factory/3-how-
azure-data-factory-works

Azure Data Factory(ADF) Interview Questions and


Answers for 2-5 Years Experienced

This section will cover azure data factory interview


questions for mid-level experienced professionals.
25. What are the different activities you have used in Azure Data Factory?
Here you can share some of the major activities if you have used them in your career be
it your work or college project. Here are a few of the most used activities :

1. Copy Data Activity to copy the data between datasets.

2. ForEach Activity for looping.

3. Get Metadata Activity which can provide metadata about any data source.

4. Set Variable Activity to define and initiate variables within pipelines.

5. Lookup Activity to do a lookup to get some values from a table/file.

6. Wait Activity to wait for a specified amount of time before/in between the pipeline
run.

7. Validation Activity will validate the presence of files within the dataset.
8. Web Activity to call a custom REST endpoint from an ADF pipeline.

Explore Categories
Apache Hive Projects Apache Hbase Projects Apache Pig Projects Apache Oozie

Projects Apache Impala Projects Apache Flume Projects Apache Sqoop

Projects Spark GraphX Projects Spark Streaming Projects Spark MLlib

Projects PySpark Projects Apache Zepellin Projects Apache Kafka

Projects Neo4j Projects Redis Projects Microsoft Azure Projects Google Cloud

Projects GCP AWS Projects

26. How can I schedule a pipeline?


You can use the time window trigger or scheduler trigger to schedule a pipeline. The
trigger uses a wall-clock calendar schedule, which can schedule pipelines periodically
or in calendar-based recurrent patterns (for example, on Mondays at 6:00 PM and
Thursdays at 9:00 PM).

Currently, the service supports three types of triggers:

 Tumbling window trigger: A trigger that operates on a periodic interval while


retaining a state.

 Schedule Trigger: A trigger that invokes a pipeline on a wall-clock schedule.

 Event-Based Trigger: A trigger that responds to an event. e.g., a file getting


placed inside a blob.
Pipelines and triggers have a many-to-many relationship (except for the tumbling
window trigger). Multiple triggers can kick off a single pipeline, or a single trigger
can kick off numerous pipelines.

27. When should you choose Azure Data Factory?


One should consider using Data Factory-
i) When working with big data, there is a need for a data warehouse to be implemented;
you might require a cloud-based integration solution like ADF for the same.

ii) Not all the team members are experienced in coding and may prefer graphical tools
to work with data.

iii) When raw business data is stored at diverse data sources, which can be on-prem
and on the cloud, we would like to have one analytics solution like ADF to integrate
them all in one place.

iv) We would like to use readily available data movement and processing solutions and
like to be light in terms of infrastructure management. So, a managed solution like ADF
makes more sense in this case.

28. How can you access data using the other 90 dataset types in Data Factory?
The mapping data flow feature allows Azure SQL Database, Azure Synapse Analytics,
delimited text files from azure storage account or Azure Data Lake Storage Gen2, and
Parquet files from blob storage or Data Lake Storage Gen2 natively for source and sink
data source.

Use the Copy activity to stage data from any other connectors, and then execute a Data
Flow activity to transform data after it's been staged.

29. What is the difference between mapping and wrangling data flow (Power

query activity)?
Mapping data flows transform data at scale without requiring coding. You can design a
data transformation job in the data flow canvas by constructing a series of
transformations. Start with any number of source transformations followed by data
transformation steps. Complete your data flow with a sink to land your results in a
destination. It is excellent at mapping and transforming data with known and unknown
schemas in the sinks and sources.

Power Query Data Wrangling allows you to do agile data preparation and exploration
using the Power Query Online mashup editor at scale via spark execution. With the rise
of data lakes, sometimes you just need to explore a data set or create a dataset in the
lake.

It currently supports 24 SQL data types from char, nchar to int, bigint and timestamp,
xml, etc.

Refer to the documentation here for more


details: https://docs.microsoft.com/en-us/azure/data-factory/frequently-asked-
questions#supported-sql-types
30. Is it possible to calculate a value for a new column from the existing column

from mapping in ADF?


We can derive transformations in the mapping data flow to generate a new column
based on our desired logic. We can create a new derived column or update an existing
one when generating a derived one. Enter the name of the column you’re creating in the
Column textbox.

You can use the column dropdown to override an existing column in your schema. Click
the Enter expression textbox to start creating the derived column’s expression. You can
input or use the expression builder to build your logic.

31. How is lookup activity useful in the Azure Data Factory?


In the ADF pipeline, the Lookup activity is commonly used for configuration lookup
purposes, and the source dataset is available. Moreover, it is used to retrieve the data
from the source dataset and then send it as the output of the activity. Generally, the
output of the lookup activity is further used in the pipeline for taking some decisions or
presenting any configuration as a result.

In simple terms, lookup activity is used for data fetching in the ADF pipeline. The way
you would use it entirely relies on your pipeline logic. It is possible to obtain only the first
row, or you can retrieve the complete rows depending on your dataset or query.

Does Big Data sound difficult to work with? Work on end-to-end solved Big Data
Projects using Spark, and you will know how easy it is!

32. Elaborate more on the Get Metadata activity in Azure Data Factory.
The Get Metadata activity is used to retrieve the metadata of any data in the Azure Data
Factory or a Synapse pipeline. We can use the output from the Get Metadata activity in
conditional expressions to perform validation or consume the metadata in subsequent
activities.

It takes a dataset as an input and returns metadata information as output. Currently, the
following connectors and the corresponding retrievable metadata are supported. The
maximum size of returned metadata is 4 MB.

Please refer to the snapshot below for supported metadata which can be retrieved
using the Get Metadata activity.
Source: docs.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-
activity#metadata-options

33. How to debug an ADF pipeline?


Debugging is one of the crucial aspects of any coding-related activity needed to test the
code for any issues it might have. It also provides an option to debug the pipeline
without executing it.

34. What does it mean by the breakpoint in the ADF pipeline?


To understand better, for example, you are using three activities in the pipeline, and
now you want to debug up to the second activity only. You can do this by placing the
breakpoint at the second activity. To add a breakpoint, you can click the circle present
at the top of the activity.
35. What is the use of the ADF Service?
ADF is primarily used to organize the data copying between various relational and non-
relational data sources hosted locally in data centers or the cloud. Moreover, you can
use ADF Service to transform the ingested data to fulfill business requirements. In most
Big Data solutions, ADF Service is used as an ETL or ELT tool for data ingestion.

36. Explain the data source in the azure data factory.


The data source is the source or destination system that comprises the data intended to
be utilized or executed. The type of data can be binary, text, csv files, JSON files, and it.
It can be image files, video, audio, or might be a proper database.

Examples of data sources include azure data lake storage, azure blob storage, or any
other database such as mysql db, azure sql database, postgres, etc.

37. Can you share any difficulties you faced while getting data from on-premises

to Azure cloud using Data Factory?


One of the significant challenges we face while migrating from on-prem to cloud is
throughput and speed. When we try to copy the data using Copy activity from on-prem,
the speed of the process is relatively slow, and hence we don’t get the desired
throughput.

There are some configuration options for a copy activity, which can help in tuning this
process and can give desired results.

i) We should use the compression option to get the data in a compressed mode while
loading from on-prem servers, which is then de-compressed while writing on the cloud
storage.

ii) Staging area should be the first destination of our data after we have enabled the
compression. The copy activity can decompress before writing it to the final cloud
storage buckets.

iii) Degree of Copy Parallelism is another option to help improve the migration process.
This is identical to having multiple threads processing data and can speed up the data
copy process.

There is no right fit-for-all here, so we must try out different numbers like 8, 16, or 32
and see which gives a good performance.

iv) Data Integration Unit is loosely the number of CPUs used, and increasing it may
improve the performance of the copy process.
38. How to copy multiple sheet data from an Excel file?
When we use an excel connector within a data factory, we must provide a sheet name
from which we have to load data. This approach is nuanced when we have to deal with
a single or a handful of sheets’ data, but when we have lots of sheets (say 10+), this
may become a tedious task as we have to change the hard-coded sheet name every
time!

However, we can use a data factory binary data format connector for this and point it to
the excel file and need not provide the sheet name/s. We’ll be able to use copy activity
to copy the data from all the sheets present in the file.

39. Is it possible to have nested looping in Azure Data Factory?


There is no direct support for nested looping in the data factory for any looping activity
(for each / until). However, we can use one for each/until loop activity which will contain
an execute pipeline activity that can have a loop activity. This way, when we call the
loop activity it will indirectly call another loop activity, and we’ll be able to achieve nested
looping.

40. How to copy multiple tables from one datastore to another datastore?
An efficient approach to complete this task would be:

i) Maintain a lookup table/ file which will contain the list of tables and their source,
which needs to be copied.

ii) Then, we can use the lookup activity and each loop activity to scan through the list.

iii) Inside the for each loop activity, we can use a copy activity or a mapping dataflow to
accomplish the task of copying multiple tables to the destination datastore.

41. What are some performance tuning techniques for Mapping Data Flow
activity?
We could consider the below set of parameters for tuning the performance of a
Mapping Data Flow activity we have in a pipeline.

i) We should try to leverage partitioning in the source, sink, or transformation whenever


possible.

Microsoft, however, recommends that we use the default partition (size 128 MB)
selected by the Data Factory as it intelligently chooses one based on our pipeline
configuration.

Still, one should try out different partitions and see if they can have improved
performance.
ii) We should not use a data flow activity for each loop activity. Instead, suppose we
have multiple files similar in terms of structure and the processing need. In that case,
we should use a wildcard path inside the data flow activity, enabling the processing of
all the files within a folder.

iii) The recommended file format to use is ‘. parquet’. The reason being the pipeline will
execute by spinning up spark clusters, and Parquet is the native file format for Apache
Spark; thus it will generally give good performance.

iv) Multiple logging modes are available: Basic, Verbose, and None.

We should not use verbose mode unless essential, as it will log all the details about
each operation the activity is performing. e.g., It will log all the details of the operations
performed for all the partitions we have. This one is useful when troubleshooting issues
with the data flow.

The basic mode will give out all the necessary basic details in the log, so try to use this
one whenever possible.

v) Try to break down a complex data flow activity into multiple data flow activities. Let’s
say we have n number of transformations between source and sink, and by adding
more, we think the design has become complex. In this case, try to have it in multiple
such activities, which will give two advantages:

a) All activities will run on separate spark clusters, so the run time will come down for
the whole task.

b) The whole pipeline will be easy to understand and maintain in the future.

42. What are some of the limitations of ADF?


Azure Data Factory provides great functionalities for data movement and
transformations. However, there are some limitations as well.

i) We can’t have nested looping activities in the data factory, and we must use some
workaround if we have that sort of structure in our pipeline. All the looping activities
come under this: If, Foreach, switch, and until activities.

ii) The lookup activity can retrieve only 5000 rows at a time and not more than that.
Again, we need to use some other loop activity along with SQL with the limit to achieve
this sort of structure in the pipeline.

iii) We can have a maximum of 40 activities in a single pipeline, including everything:


inner activity, containers, etc. To overcome this, we should try to modularize the
pipelines regarding the number of datasets, activities, etc.
43. How can one stay updated with new features of Azure Data Factory?
https://docs.microsoft.com/en-us/azure/data-factory/whats-new page is updated every
month with all the features added, bugs fixed, and issues resolved with the Azure data
factory:

44. How are all the components of Azure Data Factory combined to complete the
purpose?
The below diagram depicts how all these components can be clubbed together to fulfill
Azure Data Factory ADF tasks.

Source:docs.microsoft.com/en-us/learn/modules/intro-to-azure-data-factory/3-how-
azure-data-factory-works

45. How do you send email notifications on pipeline failure?


There are multiple ways to do this:

1. Using Logic Apps with Web/Web hook activity.


Configure a logic app that, upon getting an HTTP request, can send an email to
the required set of people for failure. In the pipeline, configure the failure option
to hit the URL generated by the logic app.

2. Using Alerts and Metrics from pipeline options.


We can set up this from the pipeline itself, where we get numerous options for
email on any activity failure within the pipeline.
46. Can we integrate Data Factory with Machine learning data?
Yes, we can train and retrain the model on machine learning data from the pipelines
and publish it as a web service.

Checkout:https://docs.microsoft.com/en-us/azure/data-factory/transform-data-using-
machine-learning#using-machine-learning-studio-classic-with-azure-data-factory-or-
synapse-analytics

47. What is Azure SQL database? Can you integrate it with Data Factory?
Part of the Azure SQL family, Azure SQL Database is an always up-to-date, fully
managed relational database service built for the cloud for storing data. We can easily
design data pipelines to read and write to SQL DB using the Azure data factory.

Checkout:https://docs.microsoft.com/en-us/azure/data-factory/connector-azure-sql-
database?tabs=data-factory

48. Can you host SQL Server instances on Azure?


Azure SQL Managed Instance is the intelligent, scalable cloud database service that
combines the broadest SQL Server instance or SQL Server database engine
compatibility with all the benefits of a fully managed and evergreen platform as a
service.

49. Explain the Azure Data Factory Architecture.


Check:https://docs.microsoft.com/en-us/azure/data-factory/media/introduction/data-
factory-visual-guide.png
Access Data Science and Machine Learning Project Code Examples

50. What is Azure Data Lake Analytics?


Azure Data Lake Analytics is an on-demand analytics job service that simplifies storing
data and processing big data.

Basic Azure Data Engineer Interview Questions and Answers


If you’re someone who’s just starting, here are some basic Azure data engineer interview questions:

1. Define Microsoft Azure.


A cloud computing platform that offers hardware and software both, Microsoft Azure provides a
managed service that allows users to access the services that are in demand.

2. List the data masking features Azure has.


When it comes to data security, dynamic data masking has several vital roles and contains sensitive data
to a certain specific set of users. Some of its features are:

 It’s available for Azure SQL Database, Azure SQL Managed Instance, and Azure Synapse
Analytics.
 It can be carried out as a security policy on all the different SQL databases across the Azure
subscription.
 The levels of masking can be controlled per the users' needs.
3. What is meant by a Polybase?
Polybase is used for optimizing data ingestion into the PDW and supporting T-SQL. It lets developers
transfer external data transparently from supported data stores, no matter the storage architecture of
the external data store.

4. Define reserved capacity in Azure.


Microsoft has included a reserved capacity option in Azure storage to optimize costs. The reserved
storage gives its customers a fixed amount of capacity during the reservation period on the Azure cloud.

5. What is meant by the Azure Data Factory?


Azure Data Factory is a cloud-based integration service that lets users build data-driven workflows
within the cloud to arrange and automate data movement and transformation. Using Azure Data
Factory, you can:

 Develop and schedule data-driven workflows that can take data from different data stores.
 Process and transform data with the help of computing services such as HDInsight Hadoop,
Spark, Azure Data Lake Analytics, and Azure Machine Learning.

Sample Basic Azure Data Engineer Interview Questions

1. Explain the main ETL service in Azure.


2. Why is the Azure Data Factory important?
3. What is the limit on the number of integration runtimes?
4. Differentiate between Azure Data Lake and Azure Data Warehouse.
5. Define the integration runtime.
You can also look at these top Data Engineer Interview Questions for practice.

Intermediate Azure Data Engineer Interview Questions and Answers


When applying for intermediate-level roles, these are the Azure data engineer interview questions you
can expect:

1. What do you mean by blob storage in Azure?


It is a service that lets users store massive amounts of unstructured object data such as binary data or
text. It can even be used to publicly showcase data or privately store the application data. Blog storage is
commonly used for:

 Providing images or documents to a browser directly


 Audio and video streaming
 Data storage for backup and restore disaster recovery
 Data storage for analysis using an on-premises or Azure-hosted service

2. Define the steps involved in creating the ETL process in Azure Data Factory.
The steps involved in creating the ETL process in Azure Data Factory are:

 In the SQL Server Database, create a Linked Service for the source data store
 For the destination data store, build a Linked Service that is the Azure Data Lake Store
 For Data Saving purposes, create a dataset
 Build the pipeline and then add the copy activity
 Plan the pipeline by attaching a trigger

3. Define serverless database computing in Azure.


The program code is typically present either on the client-side or the server. However, serverless
computing accompanies the stateless code nature, which means the code doesn’t need any
infrastructure.

Users have to pay to access the compute resources the code uses within the brief period in which the
code is being executed. It's cost-effective, and users need to pay only for the resources they have used.

4. Explain the top-level concepts of Azure Data Factory.

1. Pipeline
Used as a carrier for the numerous processes taking place. Every individual process is known as an
activity.

2. Activities
Activities stand for the process steps involved in a pipeline. A pipeline has one or multiple activities and
can be anything. This means querying a data set or transferring the dataset from one source to the
other.

3. Datasets
Simply put, it’s a structure that holds the data.

4. Linked Services
Used for storing critical information when connecting an external source.

Check out these article to prepare for FAANG Data Engineering interviews:

 Amazon Data Engineer Interview Questions


 Facebook Data Engineer Interview Questions
1. Differentiate between HDinsight & Azure Data Lake Analytics.
2. Elaborate on the best way to transfer data from an on-premise database to Azure.
3. Give some ways to ingest data from on-premise storage to Azure.
4. What is data redundancy in Azure?
5. In Azure SQL DB, what are the different data security options available?
6. Define the Azure table storage.
7. What is Azure Databricks, and what separates it from regular data bricks?
8. What is the Azure storage explorer and its uses?
9. List the various kinds of storage in Azure.
10. What are the different kinds of windowing functions in Azure Stream Analytics?

You need to prepare these Azure data engineer interview questions for experienced professionals when
applying for more advanced positions:

1. How is a pipeline scheduled?


To schedule a pipeline, you could take the help of the scheduler trigger or the time window trigger. This
trigger uses the wall-clock calendar schedule and can plan pipelines at periodic intervals or calendar-
based recurring patterns.

2. What’s the significance of the Azure Cosmos DB synthetic partition key?


To distribute the data uniformly across multiple partitions, selecting a good partition key is pretty
important. A Synthetic partition key can be developed when there isn’t any right column with properly
distributed values.

Here are the three ways in which a synthetic partition key can be created:

1. Concatenate Properties: Combine several property values to create a synthetic partition key.
2. Random Suffix: A random number is added at the end of the partition key's value.
3. Pre-calculated Suffix: Add a pre-calculated number to the end of the partition to enhance read
performance.

3. Which Data Factory version needs to be used to create data flows?


Using the Data Factory V2 version is recommended when creating data flows.

4. How to pass the parameters to a pipeline run?


In Data Factory, parameters are a top-tier concept. They can be defined at the pipeline level, followed
by the passing of arguments to execute the pipeline run on-demand or upon using a trigger.
Sample Advanced Azure Data Engineer Interview Questions

1. Can default values for the pipeline parameters be defined?


2. About data flow, what has changed from private preview to limited public preview?
3. What are the two levels of security in ADLS Gen2?
4. What are the data flow partitioning schemes in Azure?
5. What are multi-model databases?
These are some important Azure data engineer interview questions that will give you an idea of what to
expect in the interview. Also, ensure that you prepare these topics — Security, DevOps, CI/CD,
Infrastructure as a Code best practices, Subscription, Billing Management, etc.

As you prepare for your DE interview, it would be best to study Azure using a holistic approach that
extends beyond the fundamentals of the role. Don’t forget to prep your resume as well with the help of
the Data Engineer Resume Guide.

Here are some more blogs you can check out to get a better sense of the interview process:

Q1. What does an Azure Data Engineer do?

Azure data engineers are responsible for the integration, transformation, operation, and consolidation
of data from structured or unstructured data systems.

Q2. What skills are needed to become an Azure data engineer?

As an Azure data engineer, you’ll need to have skills such as Database system management (SQL or Non-
SQL), Data warehousing, ETL (Extract, Transform and Load) tools, Machine Learning, knowledge of
programming language basics (Python/Java), and so on.

Q3. How to prepare for the Azure data engineer interview?

Get a good understanding of Azure’s Modern Enterprise Data and Analytics Platform and build your
knowledge across its other specialties. Further, you should also be able to communicate the business
value of the Azure Data Platform.

Q4. What are the important Azure data engineer interview questions?

Some important questions are: What is the difference between Azure Data Lake Store and Blob storage?
Differentiate between Control Flow activities and Data Flow Transformations. How is the Data factory
pipeline manually executed?

Most Watched Projects


Customer Churn Prediction Analysis using Ensemble TechniquesView Project
Hands-On Real Time PySpark Project for BeginnersView Project
Build an Analytical Platform for eCommerce using AWS ServicesView Project
Linear Regression Model Project in Python for Beginners Part 1 View Project
Build an AWS ETL Data Pipeline in Python on YouTube DataView Project
Customer Churn Prediction Analysis using Ensemble TechniquesView Project
Hands-On Real Time PySpark Project for BeginnersView Project
Build an Analytical Platform for eCommerce using AWS ServicesView Project
Linear Regression Model Project in Python for Beginners Part 1 View Project
Build an AWS ETL Data Pipeline in Python on YouTube DataView Project
Customer Churn Prediction Analysis using Ensemble TechniquesView Project
Hands-On Real Time PySpark Project for BeginnersView Project
Build an Analytical Platform for eCommerce using AWS ServicesView Project
View all Most Watched Projects

You might also like