Azure Data Engineer Interview Questions
Azure Data Engineer Interview Questions
2. What are key components in ADF? What all you have used in your pipeline?
https://beetechnical.com/cloud-computing/azure-data-lake-storageadls-gen1-vs-gen2-
complete-guide-2021/
13. Online hackathon test. 2) Azure ADF, Databricks, Spark and python(enough for big
data) knowledge. 3) SQL Advance(Analytical Fun)
15. Explain about copy activity in ADF Slowly changing dimensions Data warehousing
19. On-premise Oracle server with daily incremental of 10 gb data. How do you move to
cloud using Azure?
20. How do you design/implement database solutions in the cloud?
24. How to read parquet file, how to call notebook from adf, Azure Devops CI/CD
Process, system variables in adf
25. Whole architecture of azure project, eto solutions using ADF, ADB, few SQL
queries, azure sql pool
More Interview Questions:
https://azurede.com/2021/05/29/azure-data-engineering-interview-questions/
https://www.vcubesoftsolutions.com/azure-data-engineers-interview-questions/
👉SQL users excluded from masking - A set of SQL users or Azure Active Directory
identities that get unmasked data in the SQL query results. Users with administrator
privileges are permanently banned from masking and seeing the original data without
any mask.
👉Masking rules - A set of rules defining the designated fields to be masked and the
masking function used. The selected fields can be determined using a database
schema, table, and column names.
👉Masking functions - A set of methods that control data exposure for different
scenarios.
4. Difference between Azure Synapse Analytics and Azure Data Lake Storage?
Azure Synapse Analytics Azure Data Lake
Built-in data pipelines and data streaming Handle data streaming using Azure Stream
capabilities. Analytics.
Used for Business Analytics. Used for data analytics and exploration by data
scientists and engineers
🌼 Hopping Window: In these windows, the data segments can overlap. So, to define a
hopping window, we need to specify two parameters:
Hop (duration of the overlap)
Window size (length of data segment)
🌼 Tumbling Window: In this, the data stream is segmented into distinct time segments
of fixed length in the tumbling window function.
🌼 Session Window: This function groups events based on arrival time, so there is no
fixed window size. Its purpose is to eliminate quiet periods in the data stream.
🌼 Sliding Window: This windowing function does not necessarily produce aggregation
after a fixed time interval, unlike the tumbling and hopping window functions.
Aggregation occurs every time an existing event falls out of the time window, or a new
event occurs.
Files Azure Files is an organized way of storing data on the cloud. The main advantage of using
Azure Files over Azure Blobs is that Azure Files allows for organizing the data in a folder
structure. Also, Azure Files is SMB (Server Message Block) protocol compliant, i.e., and
can be used as a file share.
Blobs Blob stands for a large binary object. This storage solution supports all kinds of files,
including text files, videos, images, documents, binary data, etc.
Queues Azure Queue is a cloud-based messaging store for establishing and brokering
communication between various applications and components.
Disks The Azure disk is used as a storage solution for Azure VMs (Virtual Machines)
Tables Tables are NoSQL storage structures for storing structured data that does not meet the
standard RDBMS (relational database schema).
7. What are the different security options available in the Azure SQL database?
Security plays a vital role in databases. Some of the security options available in the
Azure SQL database are:
🌸 Azure SQL Firewall Rules: Azure provides two-level security. There are server-level
firewall rules which are stored in the SQL Master database. Server-level firewall rules
determine the access to the Azure database server. Users can also create database-
level firewall rules that govern the individual databases’ keys.
🌸 Azure SQL TDE (Transparent Data Encryption): TDE is the technology used to
encrypt stored data. TDE is also available for Azure Synapse Analytics and Azure SQL
Managed Instances. With TDE, the encryption and decryption of databases, backups,
and transaction log files, happens in real-time.
🔯 Authentication: The first layer includes user account security. ADLS Gen2 provides
three authentication modes, Azure Active Directory (AAD), Shared Access Token
(SAS), and Shared Key.
🔯 Access Control: The next layer for restricting access to individual containers or files.
This can be managed using Roles and Access Control Lists (ACLs)
🔯 Advanced Threat Protection: If enabled, ADLS Gen2 will monitor any unauthorized
attempts to access or exploit the storage account.
🔯 Auditing: This is the sixth and final layer of security. ADLS Gen2 provides
comprehensive auditing features in which all account management activities are logged.
These logs can be later reviewed to ensure the highest level of security.
9. What are the various data flow partition schemes available in Azure?
Partition Explanation Usage
Scheme
Round It is the most straightforward partition scheme No good key candidates were
Robin which spreads data evenly across partitions. available in the data.
Hash Hash of columns creates uniform partitions such It is used to check for partition
that rows with similar values fall in the same skew.
partition.
Dynamic Spark dynamics range based on the provided Select the column that will be used
Range columns or expression. for partitioning.
Fixed Range A fixed range of values based on the user-created A good understanding of data is
expression for disturbing data across partitions. required to avoid partition skew.
Key Partition for each unique value in the selected Good understanding of data
column. cardinality is required.
10. Why Azure data factory is needed?
The amount of data generated these days is vast, coming from different sources. When
we move this particular data to the cloud, a few things must be taken care of-
🏭 Data can be in any form as it comes from different sources, and these various sources
will transfer or channelize the data in different ways, and it can be in different formats.
When we bring this data to the cloud or particular storage, we need to make sure that
this data is well managed. i.e., you need to transform the data and delete unnecessary
parts. As per moving the data is concerned, we need to make sure that data is picked
from different sources and bring it to one common place, then stored, and if required,
we should transform it into more meaningful.
🏭 A traditional data warehouse can also do this, but certain disadvantages exist.
Sometimes we are forced to go ahead and have custom applications that deal with all
these processes individually, which is time-consuming, and integrating all these sources
is a huge pain.
🏭 A data factory helps to orchestrate this complete process into a more manageable or
organizable manner.
It contains 3D, sub-dimension, and fact tables. It contains fact and dimension tables.
In the snowflake schema, data redundancy is lower. In the star schema, data redundancy is higher.
Execution time for queries is high. Execution time for queries is low.
13. What are the 2 levels of security in Azure data lake storage Gen2?
The two levels of security available in Azure data lake storage Gen2 are also adequate
for Azure data lake Gen1. Although this is not new, it is worth calling it two levels of
security because it’s a fundamental piece for getting started with the Azure data lake.
The two levels of security are defined as:
🦉 Role-Based Access Control (RBAC): RBAC includes built-in Azure roles such as
reader, owner, contributor, or custom. Typically, RBAC is assigned due to two reasons.
One is to permit the use of built-in data explorer tools that require reader permissions.
Another is to specify who can manage the service (i.e., update properties and settings
for the storage account).
🦉 Control Lists (ACLs): ACLs specify exactly which data objects a user may write,
read, and execute (execution is required for browsing the directory structure). ACLs are
POSIX (Portable Operating System Interface) - compliant, thus familiar to those with a
Linux or Unix background.
🎈Activities: It represents the processing steps of a pipeline. A pipeline can have one or
many activities. It can be a process like moving the dataset from one source to another
or querying a data set.
🎈Datasets: It is the source of data or, we can say it is a data structure that holds our
data.
Azure Data Lake Analytics creates essential HDInsight configures the cluster with predefined
computer nodes as on-demand instruction and nodes and then uses a language like a hive or pig
processes the dataset. for data processing.
Azure data lake analytics does not give much HDInsight provides more flexibility, as we can
flexibility in provisioning the cluster. create and control the cluster according to our
choice.
🎀 Build a Linked Service for source data store (SQL Server Database). Suppose that we
have a cars dataset.
🎀 Formulate a Linked Service for address data store which is Azure Data Lake Store.
🎃 Self-Hosted Integration Run Time: It is software with basically the equivalent code
as Azure Integration Run Time. Except you install it on an on-premise instrument or a
virtual machine in the virtual network. A Self Hosted IR can operate copy exercises
between a data store in a private network and a public cloud data store.
🎃 Azure SSIS Integration Run Time: With this, one can natively perform SSIS (SQL
Server Integration Services) packages in a controlled environment. So when we elevate
and shift the SSIS packages to the data factory, we work Azure SSIS Integration Run
Time.
🔔 Collecting data for backup and recovery disaster restoration, and archiving.
⚽ Hadoop is cooperative with the various types of hardware and simple to access
distinct hardware within a particular node.
⚽ It saves the data in the group, which is unconventional of the rest of the operations.
⚽ Hadoop supports building replicas for every block with separate nodes.
23. How would you approve data to move from one database to another?
The efficency of data and guaranteeing that no data is released should be of the highest
priority for a data engineer. Hiring administrators examine this question to know your
thought method on how validation of data would occur.
The candidate should be capable to talk about proper validation representations in
different situations. For example, you could recommend that validation could be a
simplistic comparison, or it can occur after the comprehensive data migration.
Standard ADO.net, ODBC, and SQL STMP, XML, CSV, and SMS
Integration ETL (Extract, Transform, Load) Manual data entry or batch processing that
Tool incorporates codes
4. What sets Azure Data Factory apart from conventional ETL tools?
Azure Data Factory stands out from other ETL tools as it provides: -
i) Enterprise Readiness: Data integration at Cloud Scale for big data analytics!
ii) Enterprise Data Readiness: There are 90+ connectors supported to get your data
from any disparate sources to the Azure cloud!
iv) Ability to run Code on Any Azure Compute: Hands-on data transformations
v) Ability to rehost on-prem services on Azure Cloud in 3 Steps: Many SSIS packages
run on Azure cloud.
vi) Making DataOps seamless: with Source control, automated deploy & simple
templates.
vii) Secure Data Integration: Managed virtual networks protect against data exfiltration,
which, in turn, simplifies your networking.
Data Factory contains a series of interconnected systems that provide a complete end-
to-end platform for data engineers. The below snippet summarizes the same.
5. What are the major components of a Data Factory?
To work with Data Factory effectively, one must be aware of below
concepts/components associated with it: -
i) Pipelines: Data Factory can contain one or more pipelines, which is a logical grouping
of tasks/activities to perform a task. e.g., An activity can read data from Azure blob
storage and load it into Cosmos DB or Synapse DB for analytics while transforming the
data according to business logic.
This way, one can work with a set of activities using one entity rather than dealing with
several tasks individually.
ii) Activities: Activities represent a processing step in a pipeline. For example, you might
use a copy activity to copy data between data stores. Data Factory supports data
movement, transformations, and control activities.
iii) Datasets: Datasets represent data structures within the data stores, which simply
point to or reference the data you want to use in your activities as inputs or outputs.
iv) Linked service: This is more like a connection string, which will hold the information
that Data Factory can connect to various sources. In the case of reading from Azure
Blob storage, the storage-linked service will specify the connection string to connect to
the blob, and the Azure blob dataset will select the container and folder containing the
data.
v) Integration Runtime: Integration runtime instances provide the bridge between the
activity and linked Service. It is referenced by the linked service or activity and provides
the computing environment where the activity either runs on or gets dispatched. This
way, the activity can be performed in the region closest to the target data stores or
compute service in the most performant way while meeting security (no exposing of
data publicly) and compliance needs.
vi) Data Flows: These are objects you build visually in Data Factory, which transform
data at scale on backend Spark services. You do not need to understand programming
or Spark internals. Just design your data transformation intent using graphs (Mapping)
or spreadsheets (Power query activity).
The below snapshot explains the relationship between pipeline, activity, dataset, and
linked service.
6. What are the different ways to execute pipelines in Azure Data Factory?
There are three ways in which we can execute a pipeline in Data Factory:
i) Debug mode can be helpful when trying out pipeline code and acts as a tool to test
and troubleshoot our code.
ii) Manual Execution is what we do by clicking on the ‘Trigger now’ option in a pipeline.
This is useful if you want to run your pipelines on an ad-hoc basis.
iii) We can schedule our pipelines at predefined times and intervals via a Trigger. As we
will see later in this article, there are three types of triggers available in Data Factory.
New Projects
Build an End-to-End AWS SageMaker Classification ModelView Project
Build an AI Chatbot from Scratch using Keras Sequential Model View Project
Build an AI Chatbot from Scratch using Keras Sequential Model View Project
1. For a Data Store representation, i.e., any storage system like Azure Blob storage
account, a file share, or an Oracle DB/ SQL Server instance.
2. For Compute representation, i.e., the underlying VM will execute the activity
defined in the pipeline.
The following diagram shows the location settings for Data Factory and its integration
runtimes:
Source:docs.microsoft.com/en-us/azure/data-factory/concepts-integration-runtime
There are three types of integration runtime supported by Azure Data Factory, and one
should choose based on their data integration capabilities and network environment
requirements.
1. Azure Integration Runtime: To copy data between cloud data stores and send
activity to various computing services such as SQL Server, Azure HDInsight, etc.
2. Self-Hosted Integration Runtime: Used for running copy activity between cloud
data stores and data stores in the private networks. Self-hosted integration
runtime is software with the same code as the Azure Integration Runtime, but it is
installed on your local system or virtual machine over a virtual network.
11. What are ARM Templates in Azure Data Factory? What are they used for?
An ARM template is a JSON (JavaScript Object Notation) file that defines the
infrastructure and configuration for the data factory pipeline, including pipeline activities,
linked services, datasets, etc. The template will contain essentially the same code as
our pipeline.
ARM templates are helpful when we want to migrate our pipeline code to higher
environments, say Production or Staging from Development, after we are convinced
that the code is working correctly.
2. Create a pull request to merge the code after we’re sure to the Dev branch.
4. This can trigger an automated CI/CD DevOps pipeline to promote code to higher
environments like Staging or Production.
13. Which three activities can you run in Microsoft Azure Data Factory?
As we discussed in question #3, Data Factory supports three activities: data movement,
transformation, and control activities.
1. Data movement activities: As the name suggests, these activities help move data
from one place to another.
e.g., Copy Activity in Data Factory copies data from a source to a sink data store.
2. Data transformation activities: These activities help transform the data while we
load it into the data's target or destination.
e.g., Stored Procedure, U-SQL, Azure Functions, etc.
3. Control flow activities: Control (flow) activities help control the flow of any activity
in a pipeline. e.g., Wait activity makes the pipeline wait for a specified amount of
time.
14. What are the two types of compute environments supported by Data Factory
ii) Bring Your Own Environment: In this environment, you can use ADF to manage your
computing environment if you already have the infrastructure for on-premises services.
i) Connect and Collect: Connect to the data source/s and move data to local and
crowdsource data storage.
ii) Data transformation using computing services such as HDInsight, Hadoop, Spark etc.
iii) Publish: To load data into Azure data lake storage, Azure SQL data warehouse,
Azure SQL databases, Azure Cosmos DB, etc.
iv)Monitor: Azure Data Factory has built-in support for pipeline monitoring via Azure
Monitor, API, PowerShell, Azure Monitor logs, and health panels on the Azure portal.
Build a unique job-winning data engineer resume with big data mini projects.
16. If you want to use the output by executing a query, which activity shall you
use?
Look-up activity can return the result of executing a query or stored procedure.
The output can be a singleton value or an array of attributes, which can be consumed in
subsequent copy data activity, or any transformation or control flow activity like ForEach
activity.
18. Have you used Execute Notebook activity in Data Factory? How to pass
ii) coalesce: We can use the @coalesce construct in the expressions to handle null
values gracefully.
20. Is it possible to push code and have CI/CD (Continuous Integration and
Set Variable and append variable are two activities used for setting or manipulating the
values of the variables. There are two types of the variables in a data factory: -
i) System variables: These are fixed variables from the azure pipeline. For example,
pipeline name, pipeline id, trigger name etc. You mostly need these to get the system
information that might be needed in your use case.
ii) User variable: A user variable is declared manually in your code based on your
pipeline logic.
Mapping data flows provide an entirely visual experience with no coding required. Data
flows run on ADF-managed execution clusters for scaled-out data processing. Azure
Data Factory manages all the code translation, path optimization, and execution of the
data flow jobs.
i) Read data from the source data store. (e.g., blob storage)
Compression/decompression
Column mapping
iii) Write data to the destination data store or sink. (e.g., azure data lake)
Source: docs.microsoft.com/en-us/learn/modules/intro-to-azure-data-factory/3-how-
azure-data-factory-works
3. Get Metadata Activity which can provide metadata about any data source.
6. Wait Activity to wait for a specified amount of time before/in between the pipeline
run.
7. Validation Activity will validate the presence of files within the dataset.
8. Web Activity to call a custom REST endpoint from an ADF pipeline.
Explore Categories
Apache Hive Projects Apache Hbase Projects Apache Pig Projects Apache Oozie
Projects Neo4j Projects Redis Projects Microsoft Azure Projects Google Cloud
ii) Not all the team members are experienced in coding and may prefer graphical tools
to work with data.
iii) When raw business data is stored at diverse data sources, which can be on-prem
and on the cloud, we would like to have one analytics solution like ADF to integrate
them all in one place.
iv) We would like to use readily available data movement and processing solutions and
like to be light in terms of infrastructure management. So, a managed solution like ADF
makes more sense in this case.
28. How can you access data using the other 90 dataset types in Data Factory?
The mapping data flow feature allows Azure SQL Database, Azure Synapse Analytics,
delimited text files from azure storage account or Azure Data Lake Storage Gen2, and
Parquet files from blob storage or Data Lake Storage Gen2 natively for source and sink
data source.
Use the Copy activity to stage data from any other connectors, and then execute a Data
Flow activity to transform data after it's been staged.
29. What is the difference between mapping and wrangling data flow (Power
query activity)?
Mapping data flows transform data at scale without requiring coding. You can design a
data transformation job in the data flow canvas by constructing a series of
transformations. Start with any number of source transformations followed by data
transformation steps. Complete your data flow with a sink to land your results in a
destination. It is excellent at mapping and transforming data with known and unknown
schemas in the sinks and sources.
Power Query Data Wrangling allows you to do agile data preparation and exploration
using the Power Query Online mashup editor at scale via spark execution. With the rise
of data lakes, sometimes you just need to explore a data set or create a dataset in the
lake.
It currently supports 24 SQL data types from char, nchar to int, bigint and timestamp,
xml, etc.
You can use the column dropdown to override an existing column in your schema. Click
the Enter expression textbox to start creating the derived column’s expression. You can
input or use the expression builder to build your logic.
In simple terms, lookup activity is used for data fetching in the ADF pipeline. The way
you would use it entirely relies on your pipeline logic. It is possible to obtain only the first
row, or you can retrieve the complete rows depending on your dataset or query.
Does Big Data sound difficult to work with? Work on end-to-end solved Big Data
Projects using Spark, and you will know how easy it is!
32. Elaborate more on the Get Metadata activity in Azure Data Factory.
The Get Metadata activity is used to retrieve the metadata of any data in the Azure Data
Factory or a Synapse pipeline. We can use the output from the Get Metadata activity in
conditional expressions to perform validation or consume the metadata in subsequent
activities.
It takes a dataset as an input and returns metadata information as output. Currently, the
following connectors and the corresponding retrievable metadata are supported. The
maximum size of returned metadata is 4 MB.
Please refer to the snapshot below for supported metadata which can be retrieved
using the Get Metadata activity.
Source: docs.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-
activity#metadata-options
Examples of data sources include azure data lake storage, azure blob storage, or any
other database such as mysql db, azure sql database, postgres, etc.
37. Can you share any difficulties you faced while getting data from on-premises
There are some configuration options for a copy activity, which can help in tuning this
process and can give desired results.
i) We should use the compression option to get the data in a compressed mode while
loading from on-prem servers, which is then de-compressed while writing on the cloud
storage.
ii) Staging area should be the first destination of our data after we have enabled the
compression. The copy activity can decompress before writing it to the final cloud
storage buckets.
iii) Degree of Copy Parallelism is another option to help improve the migration process.
This is identical to having multiple threads processing data and can speed up the data
copy process.
There is no right fit-for-all here, so we must try out different numbers like 8, 16, or 32
and see which gives a good performance.
iv) Data Integration Unit is loosely the number of CPUs used, and increasing it may
improve the performance of the copy process.
38. How to copy multiple sheet data from an Excel file?
When we use an excel connector within a data factory, we must provide a sheet name
from which we have to load data. This approach is nuanced when we have to deal with
a single or a handful of sheets’ data, but when we have lots of sheets (say 10+), this
may become a tedious task as we have to change the hard-coded sheet name every
time!
However, we can use a data factory binary data format connector for this and point it to
the excel file and need not provide the sheet name/s. We’ll be able to use copy activity
to copy the data from all the sheets present in the file.
40. How to copy multiple tables from one datastore to another datastore?
An efficient approach to complete this task would be:
i) Maintain a lookup table/ file which will contain the list of tables and their source,
which needs to be copied.
ii) Then, we can use the lookup activity and each loop activity to scan through the list.
iii) Inside the for each loop activity, we can use a copy activity or a mapping dataflow to
accomplish the task of copying multiple tables to the destination datastore.
41. What are some performance tuning techniques for Mapping Data Flow
activity?
We could consider the below set of parameters for tuning the performance of a
Mapping Data Flow activity we have in a pipeline.
Microsoft, however, recommends that we use the default partition (size 128 MB)
selected by the Data Factory as it intelligently chooses one based on our pipeline
configuration.
Still, one should try out different partitions and see if they can have improved
performance.
ii) We should not use a data flow activity for each loop activity. Instead, suppose we
have multiple files similar in terms of structure and the processing need. In that case,
we should use a wildcard path inside the data flow activity, enabling the processing of
all the files within a folder.
iii) The recommended file format to use is ‘. parquet’. The reason being the pipeline will
execute by spinning up spark clusters, and Parquet is the native file format for Apache
Spark; thus it will generally give good performance.
iv) Multiple logging modes are available: Basic, Verbose, and None.
We should not use verbose mode unless essential, as it will log all the details about
each operation the activity is performing. e.g., It will log all the details of the operations
performed for all the partitions we have. This one is useful when troubleshooting issues
with the data flow.
The basic mode will give out all the necessary basic details in the log, so try to use this
one whenever possible.
v) Try to break down a complex data flow activity into multiple data flow activities. Let’s
say we have n number of transformations between source and sink, and by adding
more, we think the design has become complex. In this case, try to have it in multiple
such activities, which will give two advantages:
a) All activities will run on separate spark clusters, so the run time will come down for
the whole task.
b) The whole pipeline will be easy to understand and maintain in the future.
i) We can’t have nested looping activities in the data factory, and we must use some
workaround if we have that sort of structure in our pipeline. All the looping activities
come under this: If, Foreach, switch, and until activities.
ii) The lookup activity can retrieve only 5000 rows at a time and not more than that.
Again, we need to use some other loop activity along with SQL with the limit to achieve
this sort of structure in the pipeline.
44. How are all the components of Azure Data Factory combined to complete the
purpose?
The below diagram depicts how all these components can be clubbed together to fulfill
Azure Data Factory ADF tasks.
Source:docs.microsoft.com/en-us/learn/modules/intro-to-azure-data-factory/3-how-
azure-data-factory-works
Checkout:https://docs.microsoft.com/en-us/azure/data-factory/transform-data-using-
machine-learning#using-machine-learning-studio-classic-with-azure-data-factory-or-
synapse-analytics
47. What is Azure SQL database? Can you integrate it with Data Factory?
Part of the Azure SQL family, Azure SQL Database is an always up-to-date, fully
managed relational database service built for the cloud for storing data. We can easily
design data pipelines to read and write to SQL DB using the Azure data factory.
Checkout:https://docs.microsoft.com/en-us/azure/data-factory/connector-azure-sql-
database?tabs=data-factory
It’s available for Azure SQL Database, Azure SQL Managed Instance, and Azure Synapse
Analytics.
It can be carried out as a security policy on all the different SQL databases across the Azure
subscription.
The levels of masking can be controlled per the users' needs.
3. What is meant by a Polybase?
Polybase is used for optimizing data ingestion into the PDW and supporting T-SQL. It lets developers
transfer external data transparently from supported data stores, no matter the storage architecture of
the external data store.
Develop and schedule data-driven workflows that can take data from different data stores.
Process and transform data with the help of computing services such as HDInsight Hadoop,
Spark, Azure Data Lake Analytics, and Azure Machine Learning.
2. Define the steps involved in creating the ETL process in Azure Data Factory.
The steps involved in creating the ETL process in Azure Data Factory are:
In the SQL Server Database, create a Linked Service for the source data store
For the destination data store, build a Linked Service that is the Azure Data Lake Store
For Data Saving purposes, create a dataset
Build the pipeline and then add the copy activity
Plan the pipeline by attaching a trigger
Users have to pay to access the compute resources the code uses within the brief period in which the
code is being executed. It's cost-effective, and users need to pay only for the resources they have used.
1. Pipeline
Used as a carrier for the numerous processes taking place. Every individual process is known as an
activity.
2. Activities
Activities stand for the process steps involved in a pipeline. A pipeline has one or multiple activities and
can be anything. This means querying a data set or transferring the dataset from one source to the
other.
3. Datasets
Simply put, it’s a structure that holds the data.
4. Linked Services
Used for storing critical information when connecting an external source.
Check out these article to prepare for FAANG Data Engineering interviews:
You need to prepare these Azure data engineer interview questions for experienced professionals when
applying for more advanced positions:
Here are the three ways in which a synthetic partition key can be created:
1. Concatenate Properties: Combine several property values to create a synthetic partition key.
2. Random Suffix: A random number is added at the end of the partition key's value.
3. Pre-calculated Suffix: Add a pre-calculated number to the end of the partition to enhance read
performance.
As you prepare for your DE interview, it would be best to study Azure using a holistic approach that
extends beyond the fundamentals of the role. Don’t forget to prep your resume as well with the help of
the Data Engineer Resume Guide.
Here are some more blogs you can check out to get a better sense of the interview process:
Azure data engineers are responsible for the integration, transformation, operation, and consolidation
of data from structured or unstructured data systems.
As an Azure data engineer, you’ll need to have skills such as Database system management (SQL or Non-
SQL), Data warehousing, ETL (Extract, Transform and Load) tools, Machine Learning, knowledge of
programming language basics (Python/Java), and so on.
Get a good understanding of Azure’s Modern Enterprise Data and Analytics Platform and build your
knowledge across its other specialties. Further, you should also be able to communicate the business
value of the Azure Data Platform.
Q4. What are the important Azure data engineer interview questions?
Some important questions are: What is the difference between Azure Data Lake Store and Blob storage?
Differentiate between Control Flow activities and Data Flow Transformations. How is the Data factory
pipeline manually executed?