Data Engineering Top 100 Questions
Data Engineering Top 100 Questions
Page 1 of 58
ffi
ff
TR Raveendra https://www.youtube.com/@TRRaveendra
A typical data engineering architecture may involve several layers, including:
Data Sources: This layer includes all the systems and applications that generate data, such as databases, sensors, and web services.
Data Ingestion: This layer is responsible for collecting and aggregating data from various sources and bringing it into the data processing pipeline. Tools such
as Apache Kafka, Apache Flume, and AWS Kinesis are often used for this purpose.
Data Storage: This layer involves storing data in a structured, semi-structured, or unstructured format, depending on the nature of the data. Common data
storage systems include relational databases, NoSQL databases, and data lakes.
Data Processing: This layer is responsible for transforming, cleaning, and enriching the data to make it useful for downstream analytics and machine learning.
Tools such as Apache Spark, Apache Beam, and AWS Glue are commonly used for this purpose.
Data Analytics: This layer involves using various tools and techniques to analyze the data and gain insights. This may involve data visualization, machine
learning, and other data analysis techniques.
Data Delivery: This layer is responsible for delivering the data and insights to end-users or downstream applications, such as dashboards, reports, and APIs.
Overall, a good data engineering architecture should be scalable, reliable, secure, and exible enough to accommodate changing data requirements and
business needs.
3. How does the day to day look like for a Data Engineer?
The day-to-day responsibilities of a data engineer can vary depending on the organization, but here are some common tasks that a data engineer may
perform:
Data Ingestion: Collecting data from various sources and systems and bringing it into the data processing pipeline. This may involve setting up ETL (Extract,
Transform, Load) pipelines or working with tools such as ADF, AZCopy, FTP Tools, API Calls, Apache Kafka, Apache Flume, or AWS Kinesis.
Data Storage: Designing and implementing data storage solutions that can accommodate large volumes of structured, semi-structured, or unstructured data
in data lake like Azure Data Lake Gen2, AWS S3, Google Cloud Storage, HDFS. This may involve working with Cloud Data Warehouses like Azure SQL DWH,
AWS RedShift, Google Big Query and snow ake databases, NoSQL databases.
Data Transformation: Transforming and cleaning the data to make it useful for downstream analytics and machine learning. This may involve using tools such
as Apache Spark,Databricks Spark SQL, Synapse Analytics with SQL/Spark, Apache Beam, or AWS Glue.
Data Quality: Ensuring the quality and consistency of the data by implementing data validation, veri cation, and monitoring processes using pyspark
Transformations in Databricks, synapse analytics or any other cloud Data Warehouses using SQL.
Performance Optimization: Improving the performance of data processing pipelines by optimizing queries, improving data partitioning, or using caching
strategies.
Data Security: Ensuring the security of the data by implementing encryption, access controls, and other security measures.
Page 2 of 58
fl
fl
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
Documentation: Creating and maintaining documentation for the data processing pipelines and data storage systems to help ensure that they can be easily
understood and maintained by other members of the team.
Collaboration: Working closely with data analysts, data scientists, and other stakeholders to understand their requirements and to ensure that the data
engineering systems meet their needs.
Overall, the day-to-day responsibilities of a data engineer involve designing, building, and maintaining the infrastructure and systems that enable an
organization to e ectively manage and process large volumes of data.
5. What are the various data engineering tools and technologies you used?
The following Azure services have been used in the architecture:
✅ Azure Synapse Analytics
✅ Azure DataBricks
✅ Databricks SQL Analytics
✅ Azure Data Lake Gen2
✅ Azure Cosmos DB
✅ Azure Cognitive Services
✅ Azure Machine Learning
✅ Azure Event Hubs
Page 3 of 58
ff
TR Raveendra https://www.youtube.com/@TRRaveendra
✅ Azure IoT Hub
✅ Azure Stream Analytics
✅ Microsoft Purview
✅ Azure Data Share
✅ Microsoft Power BI
✅ Azure Active Directory
✅ Azure Cost Management
✅ Azure Key Vault
✅ Azure Monitor
✅ Microsoft Defender for Cloud
✅ Azure DevOps
✅ Azure Policy
✅ GitHub
6. What are various cloud data platforms you worked? explain data services available in one of the cloud provider.
Amazon Web Services (AWS): AWS o ers a wide range of data-related services, including data storage (S3, EBS, EFS, etc.), data processing (EC2, EMR,
Glue, etc.), and data analytics (Athena, Redshift, QuickSight, etc.).
Microsoft Azure: Azure o ers a range of data-related services, including data storage (Blob storage, Azure Files, etc.), data processing (HDInsight, Azure Data
Factory, etc.), and data analytics (Azure Synapse Analytics, Power BI, etc.).
Google Cloud Platform (GCP): GCP o ers a range of data-related services, including data storage (Cloud Storage, Cloud SQL, etc.), data processing (Cloud
Dataproc, Data ow, etc.), and data analytics (BigQuery, Data Studio, etc.).
Snow ake: Snow ake is a cloud-based data warehousing platform that enables users to store, process, and analyze large volumes of data.
IBM Cloud: IBM Cloud o ers a range of data-related services, including data storage (Cloud Object Storage, Cloud Databases, etc.), data processing (IBM
Cloud Pak for Data, Watson Studio, etc.), and data analytics (IBM Cognos Analytics, Watson Discovery, etc.).
Page 4 of 58
fl
fl
fl
ff
ff
ff
ff
TR Raveendra https://www.youtube.com/@TRRaveendra
7. What are the the various programming languages used in data engineering
There are several programming languages that are commonly used in data engineering, including:
Python: Python is a popular language for data engineering due to its ease of use, versatility, and large selection of data-related libraries and frameworks, such
as Pandas, NumPy, and Apache Spark. Python can be used for data processing, data analysis, and building data pipelines.
SQL: SQL (Structured Query Language) is a language used for querying and manipulating data in relational databases. It is a standard language that is widely
used in data engineering, particularly for data warehousing and data analytics.
Java: Java is a commonly used language for building large-scale data processing systems, particularly those that use distributed computing frameworks such
as Apache Hadoop and Apache Spark.
Scala: Scala is a high-performance language that is used in distributed computing frameworks such as Apache Spark. It is often used in conjunction with
Java to build large-scale data processing systems.
R: R is a language that is commonly used for statistical computing and data analysis. It has a large selection of libraries and frameworks that make it well-
suited for data engineering tasks such as data cleaning, data visualization, and data analysis.
JavaScript : Snow ake supports the use of JavaScript in Snow ake Stored Procedures and User-De ned Functions (UDFs).JavaScript can be used in Stored
Procedures and UDFs to implement custom business logic and data transformations within Snow ake. This can include manipulating data, calling external
APIs, and performing calculations.
Other programming languages that are commonly used in data engineering include C++, Perl, and Go. The choice of language will depend on factors such as
the speci c requirements of the project, the technical expertise of the team, and the available tools and libraries.
Ease of use: Python is an easy-to-learn language with a simple syntax that makes it accessible for beginners. This means that data engineers can quickly get
up to speed with Python and start building data pipelines.
Versatility: Python has a wide range of libraries and frameworks that are speci cally designed for data processing, such as Pandas, NumPy, and Apache
Spark. This makes it a powerful language for building data engineering pipelines, as well as for performing data analysis and machine learning tasks.
Interoperability: Python can easily interface with other languages and technologies commonly used in data engineering, such as SQL databases and big data
platforms like Hadoop and Spark.
Community: Python has a large and active community of data engineers and data scientists, who contribute to open-source libraries and tools that make
data engineering tasks faster and more e cient.
Scalability: Python's ability to scale horizontally through distributed computing frameworks like Apache Spark and Dask make it suitable for processing large
data sets in parallel.
Page 5 of 58
fi
fl
ffi
fl
fi
fl
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
Flexibility: Python is a multi-purpose language that can be used for a variety of tasks beyond data engineering and data science. For example, it is used for
web development, automation, and scripting.
Better support for deep learning: Python has gained popularity in recent years as a language for deep learning, and as such, it has several widely used deep
learning libraries, such as TensorFlow, Keras, and PyTorch. These libraries are well-supported in Databricks, making Python a natural choice for data
engineers and data scientists who are working on deep learning projects.
Integration with Jupyter Notebooks: Databricks supports the use of Jupyter Notebooks, which are a popular platform for interactive data analysis and
visualization using Python. This integration enables data engineers and data scientists to work seamlessly in a single environment.
Data extraction: SQL can be used to extract data from databases and load it into data pipelines for further processing.
Data transformation: SQL can be used to transform data by manipulating or joining tables, ltering data based on certain conditions, or aggregating data
into new forms.
Data quality: SQL can be used to ensure data quality by performing data validation checks, such as checking for missing or duplicate values.
Data warehousing: SQL can be used to create and manage data warehouses, which are centralized repositories of data used for reporting and analysis.
ETL (Extract, Transform, Load): SQL can be used in ETL pipelines to extract data from di erent sources, transform it into the desired format, and load it into
the target data store.
SQL is a widely used and powerful tool for data engineering, and it is supported by many di erent relational database management systems (RDBMS).
Additionally, the use of SQL can help ensure data integrity and consistency across di erent systems and applications.
10.What are the di erent data sources / targets you used in your project?
Data Engineering can have any number of sources like
Relational databases: These are databases that store data in tables with prede ned relationships between them. Common examples include MySQL,
PostgreSQL, Oracle, and Microsoft SQL Server.
NoSQL databases: These are databases that store data in non-tabular structures, such as key-value stores, document databases, and graph databases.
Common examples include MongoDB, Cassandra, and Neo4j.
Cloud storage: Cloud storage services like Amazon S3, Google Cloud Storage, and Microsoft Azure Blob Storage are commonly used as sources and targets
for data pipelines.
Page 6 of 58
ff
fi
ff
ff
ff
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
APIs: APIs (Application Programming Interfaces) allow data to be accessed and integrated from various online services and applications. Common examples
include REST APIs and SOAP APIs.
File formats: Data can be stored and exchanged in a variety of le formats, such as CSV, JSON, XML, Parquet, and Avro. These formats are often used as
sources and targets for data pipelines.
Message queues: Message queues, such as Apache Kafka and RabbitMQ, are used to send and receive data between di erent systems.
11. What is data ingestion means? what tools you used for data ingestion.
Ingestion in cloud computing refers to the process of bringing data from external sources into a cloud-based storage or processing system. This is a critical
step in data engineering, as it enables organizations to centralize their data and make it accessible for analysis, reporting, and other purposes.
There are various types of ingestion tools that are commonly used in cloud computing, including:
Batch ingestion tools: These tools are used for processing large volumes of data in batches, typically on a daily or weekly basis. Examples of batch ingestion
tools include Apache Hadoop, Apache Spark, and Apache Flume.
Real-time ingestion tools: These tools are used for processing data in real-time or near real-time, typically with low latency and high throughput. Examples of
real-time ingestion tools include Azure Event Hub, Azure IOT, Apache Kafka, AWS Kinesis, and Google Cloud Pub/Sub.
Cloud-native ingestion tools: These tools are speci cally designed for use in cloud environments and often provide seamless integration with cloud-based
storage and processing systems. Examples of cloud-native ingestion tools include AWS Glue, Google Cloud Data ow, and Microsoft Azure Data Factory.
Database replication tools: These tools are used to replicate data from one database to another, typically for disaster recovery or high availability purposes.
Examples of database replication tools include AWS Database Migration Service, Google Cloud SQL, and Microsoft SQL Server Replication.
File transfer tools: These tools are used for transferring les from one location to another, typically over the internet. Examples of le transfer tools include
AWS Transfer for SFTP, Google Cloud Storage Transfer Service, and Microsoft Azure File Sync.
12. How do you read / move data from on-premise to cloud? did you work with some tools.
Cloud Data Migration Tools: These tools are speci cally designed for use in cloud environments and often provide seamless integration with cloud-based
storage and processing systems. Examples of cloud-native ingestion tools include AWS Glue, Google Cloud Data ow, and Microsoft Azure Data Factory.
Third-party data integration tools: There are many third-party data integration tools available that can help you ingest data from on-premises to the cloud.
These tools o er features such as data transformation, data validation, and data cleansing, and support a wide range of data sources and targets.
Custom scripts and APIs: If you have speci c data integration requirements, you can develop custom scripts or APIs to ingest data from on-premises to the
cloud. This approach can be time-consuming and complex, but it provides more exibility and control over the data integration process.
Page 7 of 58
ff
fi
fi
fi
fi
fi
fl
fl
fl
ff
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
13. What are the di erent le types you used in your day to day work?
In data engineering, there are several types of data les that are commonly used to store and process data. Here are a few examples:
CSV les: CSV (Comma-Separated Values) les are a simple and widely used format for storing tabular data. Each row represents a record, and each column
represents a eld in the record. CSV les can be easily imported into a variety of data processing tools and databases.
JSON les: JSON (JavaScript Object Notation) les are a lightweight data format that is easy to read and write. They are commonly used for web-based
applications and for exchanging data between di erent programming languages. JSON les are structured as key-value pairs and can contain nested objects
and arrays.
Parquet les: Parquet is a columnar storage format that is optimized for big data processing. It is designed to be highly e cient for analytics workloads,
allowing for faster query performance and reduced storage costs. Parquet les are often used in data warehousing and data lake environments.
Avro les: Avro is a binary data format that is designed to be compact and e cient. It supports schema evolution, meaning that the schema can be changed
over time without breaking existing applications. Avro les are often used in Hadoop and other big data processing frameworks.
ORC les: ORC (Optimized Row Columnar) les are another columnar storage format that is designed for fast data processing. ORC les are highly
compressed, making them e cient to store and transmit over networks. They are commonly used in Hadoop and other big data processing environments.
Delta les: Delta is a le format created by Databricks that is designed for building data lakes and data warehouses. Delta les are based on Parquet les but
add transactional capabilities to support updates, deletes, and merges. Delta les are designed to be highly scalable and performant, making them well-suited
for big data processing.
XML les: XML (Extensible Markup Language) is a markup language that is used to store and transport data. XML les are structured as nested elements,
with each element representing a record or data item. XML is a exible and self-describing format that can be used for a wide range of data processing needs,
including web services, document exchange, and database integration.
The choice of data le format depends on the speci c needs of the data processing and storage system. Data engineers need to consider factors such as
performance, scalability, exibility, and interoperability when selecting the appropriate format for a particular use case.
Parquet les: Parquet is a columnar storage format that is optimized for big data processing. It is designed to be highly e cient for analytics workloads,
allowing for faster query performance and reduced storage costs. Parquet les are often used in data warehousing and data lake environments.
Avro les: Avro is a binary data format that is designed to be compact and e cient. It supports schema evolution, meaning that the schema can be changed
over time without breaking existing applications. Avro les are often used in Hadoop and other big data processing frameworks.
ORC les: ORC (Optimized Row Columnar) les are another columnar storage format that is designed for fast data processing. ORC les are highly
compressed, making them e cient to store and transmit over networks. They are commonly used in Hadoop and other big data processing environments.
Page 8 of 58
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
ff
fi
fi
fi
fl
ffi
ffi
fi
fi
fi
fi
fi
fi
ff
fi
fi
fi
fi
fi
fl
fi
fi
ffi
ffi
fi
fi
fi
ffi
ffi
fi
fi
fi
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
15. What are the di erences between ORC, AVRO and Parquet les?
If you have large-scale data processing needs: Orc and Parquet are both designed for e cient storage and processing of large datasets. They can provide
high performance and scalability for data processing jobs that require processing of large volumes of data.
If you need advanced compression capabilities: Parquet and Orc support more advanced compression techniques like Zstandard and Snappy, which can lead
to higher compression ratios and faster query performance. If you have speci c compression requirements, these le formats may be a better choice than
Avro.
If you need strong schema evolution support: Avro has strong support for schema evolution, meaning that the schema can be changed over time without
breaking existing applications. If you expect your schema to evolve frequently, Avro may be a better choice than Parquet or Orc.
If you need a lightweight and exible le format: Avro is a lightweight and exible le format that can be used for a wide range of data processing needs. If you
have a small dataset or require more exible schema management, Avro may be a better choice than Parquet or Orc.
16. What are best use cases for ORC, AVRO, and Parquet les (Explain best suitable data scenarios for each le type)
Orc, Avro, and Parquet are three popular columnar le formats used in data engineering. Here are some key di erences between them:
Compression: All three le formats support compression, but the speci c compression algorithms and settings vary. Parquet and Orc support more advanced
compression techniques like Zstandard and Snappy, which can lead to higher compression ratios and faster query performance. Avro supports simpler
compression techniques like de ate and snappy, which are less e cient but more widely supported.
Schema evolution: Avro has strong support for schema evolution, meaning that the schema can be changed over time without breaking existing applications.
Parquet and Orc also support schema evolution, but the process is more complex and requires more care to avoid breaking compatibility.
Performance: Parquet and Orc are optimized for big data processing and can handle large volumes of data quickly and e ciently. Avro is more lightweight and
may be better suited for smaller datasets or applications that require more exible schema management.
Integration with data processing tools: All three le formats are widely supported by data processing tools and platforms, including Hadoop, Spark, and
others. However, some tools may have better performance or compatibility with certain le formats.
17. How do you convert data from one le format to another le format? for example, CSV to ORC? ORC to Parquet? JSON to CSV?
you can use the following steps to convert a le in Orc format to CSV:
Read the Orc le using the spark.read.orc() function and store it in a DataFrame.
Write the DataFrame to a CSV le using the DataFrame.write() function, specifying the format as “csv”.
Page 9 of 58
fi
ff
fi
fl
fi
fl
fl
fi
fi
fi
fi
fi
fi
fi
fi
ffi
fi
fl
fl
fi
fi
fi
ffi
ff
fi
fi
ffi
TR Raveendra https://www.youtube.com/@TRRaveendra
This will save the contents of the Orc le in CSV format to the speci ed le path.
Similarly, to convert a CSV le to Orc format, you can use the following steps:
Read the CSV le using the spark.read.csv() function and store it in a DataFrame.
Write the DataFrame to an Orc le using the DataFrame.write() function, specifying the format as “orc".
Metadata is essential for e ective data discovery, integration, transformation, and analysis. It helps to ensure that data is accurate, consistent, and usable
across di erent applications and systems.
Page 10 of 58
ff
fi
ff
fi
fi
fi
fi
fi
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
19. What are typical scenarios where you get data in the format JSON les?
JSON (JavaScript Object Notation) is a widely used data format that is simple, lightweight, and easy to parse. It is used for a variety of data scenarios,
including:
Web APIs: JSON is commonly used as a data format for web APIs, which allow applications to exchange data over the internet. APIs can return JSON data in
response to requests from client applications, which can then parse and use the data as needed.
Big data: JSON is often used for storing and processing big data, especially in NoSQL databases like MongoDB and Cassandra. JSON allows for exible
schema design, which can be bene cial for handling unstructured or semi-structured data.
IoT devices: JSON is used to transmit and store data from Internet of Things (IoT) devices, which can generate large amounts of data in real time. JSON can
be used to encode data such as sensor readings, location data, and other metadata.
Con guration les: JSON can be used to store con guration data for applications or systems. This can include data such as server settings, user preferences,
or other application-speci c settings.
Log les: JSON can be used to store log data, which can be analyzed and monitored for application performance, security, or other purposes. JSON log data
can be easier to parse and analyze than other formats like plain text or XML.
Overall, JSON is a versatile and widely used data format that can be used in a variety of data scenarios. Its simplicity, exibility, and compatibility with many
programming languages and tools make it a popular choice for developers and data engineers.
20. What are the di erent types of data you worked in you project?
Explain di erent types of source data les and target data les in your project. Like csv, tsv , parquet , orc and delta le formats.
Architecture: Pandas is a Python library that runs on a single machine, whereas Spark is a distributed computing system that can run on a cluster of
machines.
Data Size: Pandas is suitable for working with small to medium-sized datasets that can t into memory, while Spark is designed to handle larger datasets that
may be too big to t into memory.
Performance: Due to its distributed architecture, Spark is generally faster than Pandas for large-scale data processing tasks. However, for small datasets that
can t into memory, Pandas can be faster due to its optimized data structures.
Ease of Use: Pandas has a simpler and more user-friendly API compared to Spark, which can require more complex code for certain operations.
In summary, if you are working with large datasets that require distributed processing, Spark is likely the better choice. However, if your data ts into memory
and you are comfortable working with Python, Pandas can be a more convenient option.
Page 11 of 58
fi
fi
fi
ff
fi
ff
ff
fi
fi
fi
fi
fi
fi
fi
ff
fi
fi
fl
fi
fl
TR Raveendra https://www.youtube.com/@TRRaveendra
22. What is best between pandas and Spark? explain why?
if you are working with large datasets that require distributed processing, Spark is likely the better choice. However, if your data ts into memory and you are
comfortable working with Python, Pandas can be a more convenient option.
Each cluster in Databricks is assigned a speci c set of resources, including CPU, memory, and storage, depending on the selected instance type and cluster
con guration. The cluster can also be con gured with additional libraries, runtime environments, and settings to optimize performance and ensure
compatibility with the data being processed.
Clusters in Databricks are typically used to run distributed computing jobs, such as processing large datasets, training machine learning models, or
performing real-time data analytics. Multiple users can share a cluster, and the platform automatically manages resource allocation and scheduling to ensure
e cient use of the available resources.
Azure Data Factory provides a graphical interface and a set of tools to enable you to build and manage data pipelines that can connect to various data
sources, including on-premises and cloud-based sources. You can use the tool to perform a variety of data integration tasks, such as data ingestion, data
transformation, and data loading.
Integration with other Azure services, such as Azure Blob Storage, Azure Data Lake Storage, Azure SQL Database, and more.
Support for a variety of data sources, including structured, unstructured, and semi-structured data.
Built-in data transformation and processing using Azure Databricks or HDInsight.
Advanced security and monitoring features, including role-based access control, auditing, and logging.
Overall, Azure Data Factory simpli es the process of building and managing data integration pipelines, allowing organizations to easily move, transform, and
analyze data from various sources in a scalable, cost-e ective way.
25. How do you connect data factory with on-premise data sources?
To connect Azure Data Factory with on-premises data sources, you can use the following two methods:
Self-hosted integration runtime: This method involves installing a self-hosted integration runtime (IR) on an on-premises machine, which acts as a gateway
between your on-premises data sources and Azure Data Factory. Once the self-hosted IR is set up, you can use it to connect to your on-premises data
sources and move data between them and Azure.
Azure Data Gateway: This method uses the Azure Data Gateway to connect to on-premises data sources. The Azure Data Gateway is a cloud-based service
that provides secure access to on-premises data sources, allowing you to move data between them and Azure Data Factory.
Page 12 of 58
ffi
fi
fi
fi
fi
ff
fi
fi
fl
TR Raveendra https://www.youtube.com/@TRRaveendra
To set up the connection, you would need to follow these general steps:
Create a linked service in Azure Data Factory for the on-premises data source you want to connect to.
For the self-hosted IR method, download and install the self-hosted IR on an on-premises machine, and register it with Azure Data Factory.
For the Azure Data Gateway method, download and install the gateway, and register it with Azure Data Factory.
Con gure the linked service to use the self-hosted IR or Azure Data Gateway.
Use the linked service in your data pipeline to move data between the on-premises data source and Azure.
26. Did you work with custom activity / transformation in data factory?
There are two types of activities that you can use in an Azure Data Factory or Synapse pipeline.
Data movement activities to move data between supported source and sink data stores.
Data transformation activities to transform data using compute services such as Azure HDInsight and Azure Batch.
To move data to/from a data store that the service does not support, or to transform/process data in a way that isn't supported by the service, you can create
a Custom activity with your own data movement or transformation logic and use the activity in a pipeline. The custom activity runs your customized code logic
on an Azure Batch pool of virtual machines.
Page 13 of 58
fi
fl
fl
fi
fi
fi
fi
fi
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
29. Can you write a custom activity / transformation in Data Factory using Python? Pandas? Spark?
Use Databricks notebook for any kind of customer transformations using SQL, Python, Pyspark or Scala languages.
Page 14 of 58
ff
ff
TR Raveendra https://www.youtube.com/@TRRaveendra
Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics, built on Azure Blob Storage.
Data Lake Storage Gen2 converges the capabilities of Azure Data Lake Storage Gen1 with Azure Blob Storage. For example, Data Lake Storage Gen2
provides le system semantics, le-level security, and scale. Because these capabilities are built on Blob storage, you'll also get low-cost, tiered storage, with
high availability/disaster recovery capabilities.
Page 15 of 58
ff
fi
fi
fi
ff
fi
ff
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
33. What are event driven data solutions?
An event-driven architecture consists of event producers that generate a stream of events, and event consumers that listen for the events.
An event driven architecture can use a publish/subscribe (also called pub/sub) model or an event stream model.
Pub/sub: The messaging infrastructure keeps track of subscriptions. When an event is published, it sends the event to each subscriber. After an event is
received, it cannot be replayed, and new subscribers do not see the event.
Event streaming: Events are written to a log. Events are strictly ordered (within a partition) and durable. Clients don't subscribe to the stream, instead a client
can read from any part of the stream. The client is responsible for advancing its position in the stream. That means a client can join at any time, and can replay
events.
A Kafka cluster can have multiple topics. Similarly, an Event Hub Namespace can have multiple Event Hubs. Azure Event Hub Namespace is a logical
container that can hold multiple Even Hub instances. So, both Kafka and Event Hub are similar on this aspect
Copy and transform data from and to a REST endpoint by using Azure Data Factory.
We can use below two activities.
1) copy activity with dataset type of REST API
2) Web activity with REST API URL - GET or PUT or POST Methods.
Web Activity can be used to call a custom REST endpoint from an Azure Data Factory or Synapse pipeline. You can pass datasets and linked services to be
consumed and accessed by the activity.
Page 16 of 58
TR Raveendra https://www.youtube.com/@TRRaveendra
Batch processing is a technique for automating and processing multiple transactions as a single group. Batch processing helps in handling tasks like payroll,
end-of-month reconciliation, or settling trades overnight.
Batch data processing is a method of processing data in which data is collected over a period of time and then processed as a group (or "batch") rather than
in real-time. This involves storing the data in a le or database until there is enough data to process, and then running a job or script to process the data in
bulk. Batch processing is commonly used for large-scale data processing and analysis, such as data warehousing, ETL (extract, transform, load), and report
generation.
Azure Stream Analytics: a fully managed real-time analytics service that can process streaming data from various sources and deliver insights in real-time.
Azure Event Hubs: a highly scalable data streaming platform that can collect and process millions of events per second from various sources.
Azure IoT Hub: a cloud-based platform that can connect, monitor, and manage IoT devices and process real-time device telemetry data.
Azure Noti cation Hubs: a service that enables push noti cations to be sent to mobile and web applications at scale.
Azure Media Services: a platform that provides cloud-based media processing and delivery services for streaming video and audio content.
Azure Data Explorer: a fast and highly scalable data exploration and analytics service that can process and analyze large volumes of streaming data in real-
time.
Page 17 of 58
fi
ff
fl
fi
fi
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
40. What is Spark SQL?
Spark SQL is a component of the Apache Spark open-source big data processing framework that enables developers to run SQL-like queries on large
datasets stored in distributed storage systems like Hadoop Distributed File System (HDFS) and Apache Cassandra. It provides a programming interface to
work with structured data using SQL queries, DataFrame APIs, and Datasets APIs.
Spark SQL allows users to combine the bene ts of both relational and procedural programming paradigms to work with data in a distributed environment. It
also provides support for reading and writing data in various le formats such as Parquet, ORC, Avro, JSON, and CSV.
Spark SQL includes an optimizer that can optimize SQL queries to improve query performance by pushing down lters, aggregations, and other operations to
the data source. This optimization enables Spark SQL to process large datasets e ciently in a distributed environment.
41. How do you covert pandas dataframe to spark data frame vice versa?
Page 18 of 58
fi
fi
ffi
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
42. How do you slice a data le into 10 small les (a large CSV le with 10 million lines, slice them 1 million each le) using Pandas?
43. How do you slice a data le into 10 small les (a large CSV le with 10 million lines, slice them 1 million each le) using Spark?
Page 19 of 58
fi
fi
fi
fi
fi
fi
fi
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
44. What is databricks? why is it required? what databricks runtimes you used?
Databricks Runtime is a managed computing environment for Apache Spark, which is an open-source distributed computing framework for big data
processing. Databricks Runtime is optimized for running Spark-based workloads in a cloud-based environment and includes pre-con gured clusters, drivers,
and tools that make it easy to set up and manage Spark applications. It provides a uni ed platform for data engineers, data scientists, and business analysts
to collaborate on big data processing and analytics tasks. Databricks Runtime is part of the Databricks Uni ed Analytics Platform, which also includes data
integration, machine learning, and visualization tools.
Databricks provides a Docker image for the Databricks Runtime environment. The image can be used to run Databricks workloads locally or in a containerized
environment. The image can be obtained from Docker Hub or built from source using the Databricks open-source repository on GitHub.
45. What is cluster in databricks? what are di erent cluster types available in databricks?
A Databricks cluster is a managed computing environment that allows users to run distributed data processing workloads on the Databricks platform. It is a
group of virtual machines that are provisioned and con gured to work together to execute distributed data processing tasks, such as data ingestion,
transformation, machine learning, and deep learning. The cluster resources, such as the number and type of virtual machines, can be adjusted based on the
workload requirements, and users can choose from various cluster con gurations to optimize performance and cost. Databricks clusters are typically used to
process large volumes of data and to train and deploy machine learning models at scale.
Databricks o ers two types of Compute speci cally designed for running batch workloads: All-purpose compute and Job compute.
All-purpose clusters: These clusters are designed to run long-running batch jobs and are optimized for high availability, fault tolerance, and resource
isolation. They can be created and managed using the Databricks UI or API and can be used to run data processing jobs, ETL pipelines, and machine learning
work ows.
Job clusters: Job clusters are a type of ephemeral cluster that are created on-demand to run a speci c job and are terminated automatically once the job is
complete. Job clusters are optimized for cost and performance, as they are created with the minimum required resources to run the job. They are typically
used for running ad-hoc or one-time batch jobs, such as data transformations or model training.
Page 20 of 58
fl
ff
ff
ff
fi
fi
fi
fi
fi
fi
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
Standard clusters: These are the most common type of cluster and are used for general-purpose data processing and analytics workloads. Standard clusters
are highly customizable and can be con gured with various virtual machine types, network settings, and storage options.
High Concurrency clusters: These clusters are optimized for running interactive workloads and serving multiple users concurrently. They are designed to
handle small to medium-sized queries and are highly scalable, allowing users to increase or decrease the cluster size based on demand.
GPU clusters: These clusters are used for running deep learning workloads and training machine learning models that require high-performance GPUs. GPU
clusters can be con gured with di erent types of GPUs, such as NVIDIA V100, P100, and K80, and are optimized for running TensorFlow, PyTorch, and other
deep learning frameworks.
Serverless clusters: These clusters are designed to provide a highly scalable and cost-e ective computing environment for ad-hoc workloads and bursty
data processing. Serverless clusters automatically scale up or down based on the workload demand and are charged based on the actual usage.
Kubernetes clusters: These clusters allow users to run Databricks workloads on a Kubernetes cluster, giving them more control over the cluster environment
and enabling them to leverage Kubernetes features such as auto-scaling, load balancing, and rolling updates.
Standard mode clusters are now called No Isolation Shared access mode clusters.
High Concurrency with Tables ACLs are now called Shared access mode clusters.
Page 21 of 58
fi
ff
fi
ff
TR Raveendra https://www.youtube.com/@TRRaveendra
46. How do you connect (Mount) Data lake storage with Databricks?
47. How do you enable SFTP service into your Azure data lake storage?
Page 22 of 58
TR Raveendra https://www.youtube.com/@TRRaveendra
48. How do you securely mount your data from ADLS to Databricks?
Page 23 of 58
TR Raveendra https://www.youtube.com/@TRRaveendra
Mount ADLS Gen1 or Gen2 to Databricks: To mount ADLS to Databricks, you can use the Databricks UI or API. You will need to provide the ADLS account
name and key or an Azure Active Directory token, along with the mount point and con guration options. This will create a virtual lesystem on Databricks that
points to the ADLS storage account.
Read data from ADLS: Once the ADLS account is mounted, you can read data from it using the le APIs in Databricks, such as spark.read or dbutils.fs. For
example, you can read a CSV le from ADLS using the following code:
Write data to ADLS: To write data to ADLS, you can use the same le APIs, but specify the ADLS mount point as the output directory. For example, to write a
DataFrame to a Parquet le on ADLS, you can use the following code:
50. What are the di erent ways to schedule a data engineering job in Azure?
There are several ways to schedule a data engineering job in Azure, depending on the requirements and use case:
Azure Data Factory: Azure Data Factory (ADF) is a cloud-based data integration service that allows you to create, schedule, and orchestrate data pipelines.
ADF provides a visual interface for building pipelines using drag-and-drop components, as well as code-based pipeline de nition using Azure Resource
Manager (ARM) templates. ADF also supports various data sources and destinations, including cloud and on-premises databases, les, and big data stores.
Azure Logic Apps: Azure Logic Apps is a cloud-based service that allows you to create and schedule work ows that integrate with various systems and
services. Logic Apps provides a visual work ow designer that allows you to create work ows using pre-built connectors and custom code. Logic Apps can
integrate with Azure services, as well as third-party services, such as Salesforce, Slack, and Twilio.
Page 24 of 58
ff
fi
fi
fl
fi
fi
fl
fi
fl
fi
fi
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
Azure Functions: Azure Functions is a serverless compute service that allows you to run event-driven code in response to various triggers. Functions can be
used to run data processing or data integration code on a schedule or in response to an event, such as a le upload or a message in a queue. Functions can
be written in several languages, including C#, Python, and JavaScript.
Azure Batch: Azure Batch is a cloud-based job scheduling and compute management service that allows you to run large-scale parallel and high-
performance computing (HPC) workloads. Batch provides a managed environment for running jobs on clusters of virtual machines, with support for job
scheduling, job dependencies, and scaling. Batch can be used to run data processing or machine learning workloads on a large scale.
Azure Kubernetes Service: Azure Kubernetes Service (AKS) is a managed Kubernetes service that allows you to deploy and manage containerized
applications and services. AKS provides a scalable and highly available environment for running batch jobs or data processing workloads using containerized
applications. AKS can be integrated with Azure services, such as Azure Container Registry, for a seamless end-to-end experience.
azure vm cron job: In Azure, you can schedule a cron job on a virtual machine (VM) using the built-in Linux cron service.
SSH into your VM: Open a terminal and SSH into your VM using the ssh command and your VM's public IP address or DNS name.
Open the cron con guration le: Once you're connected to the VM, open the cron con guration le using the following command:
Databricks Work ow jobs: are used to automate and schedule data processing and analysis tasks. Here are some key features of Databricks jobs:
Scheduling: You can create jobs to run on a schedule, such as daily, weekly, or monthly. You can also specify the start time and end time for the job and the
time zone in which it should run.
Recurrence: You can set the job to recur at a speci c interval, such as every 15 minutes, every hour, or every day.
51. Explain some interesting Spark SQL functions you worked with?
coalesce
coalesce(expr1, expr2, ...) - Returns the rst non-null argument if exists. Otherwise, null.
nvl
nvl(expr1, expr2) - Returns expr2 if expr1 is null, or expr1 otherwise
SELECT nvl(NULL, array('2'));
to_date
to_date(date_str[, fmt]) - Parses the date_str expression with the fmt expression to a date. Returns null with invalid input. By default, it follows casting rules to
a date if the fmt is omitted.
months_between
months_between(timestamp1, timestamp2[, roundO ]) - If timestamp1 is later than timestamp2, then the result is positive. If timestamp1 and timestamp2 are
on the same day of month, or both are the last day of month, time of day will be ignored. Otherwise, the di erence is calculated based on 31 days per month,
and rounded to 8 digits unless roundO =false.
SELECT months_between('1997-02-28 10:30:00', '1996-10-30');
trim
trim(str) - Removes the leading and trailing space characters from str.
52. How do read lename while reading data from ADLS into databricks?
To read the lename while reading data from ADLS into Databricks, you can use the input_ le_name() function in PySpark. Here's an example code snippet:
A Delta table is a table that is stored in Delta Lake format. Delta tables can be queried using standard SQL commands, and can be accessed by a variety of
di erent data processing frameworks and tools, including Apache Spark, Python, R, SQL, and machine learning frameworks like TensorFlow and Scikit-Learn.
Delta tables can be used for both batch and streaming workloads, and support a variety of data sources and le formats, including Parquet, CSV, JSON, and
Avro. With Delta Lake, users can easily build data pipelines, perform data engineering tasks, and build analytics applications with high reliability and
performance.
Page 26 of 58
ff
fi
fi
ff
ff
fi
ff
fi
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
54. How do you read schema from a particular le using Spark / Pandas?
To read the schema from a le using Spark, you can use the printSchema() method of a DataFrame object. This method prints the schema of the DataFrame in
a tree format, showing the data types and structure of each column. Here's an example using PySpark:
Delta Lake provides several advantages over traditional data storage solutions for big data analytics. Here are some of the key advantages of using Delta
Lake:
ACID transactions: Delta Lake provides ACID transactions, which ensure that data is processed reliably and without con icts. This helps to eliminate
common data integrity issues that can arise in distributed environments.
Data versioning: Delta Lake keeps track of every change that is made to the data, allowing users to revert to earlier versions of the data if necessary. This
makes it easy to recover from data corruption or accidental data loss.
Schema enforcement: Delta Lake enforces schema on write, which ensures that all data written to the data lake conforms to a consistent schema. This helps
to eliminate data quality issues and makes it easier to manage and analyze data.
Query optimization: Delta Lake provides features like data skipping, predicate pushdown, and Z-ordering that improve query performance by reducing the
amount of data that needs to be scanned.
Stream processing: Delta Lake supports both batch and streaming workloads, allowing users to process data in real-time and perform near-real-time
analysis.
Open source: Delta Lake is an open-source project that is supported by a large and active community. This makes it easy to get help and support, and to
contribute to the development of the technology.
Overall, Delta Lake provides a reliable, scalable, and performant data storage solution for big data analytics, making it an attractive choice for many
organizations.
Page 27 of 58
fi
fi
fl
TR Raveendra https://www.youtube.com/@TRRaveendra
56. How do you list all databases and all tables from databricks?
@TRRaveendra
you can use the language magic command %<language> at the beginning of a cell. The supported magic
commands are: %python, %r, %scala, and %sql.
%fs: Access the Databricks File System (DBFS) and interact with les.
%sh: Run shell commands in a notebook cell.
%md: Write markdown in a notebook cell.
%sql: Execute SQL queries against tables in a database.
%python: Switch to Python language mode in a notebook cell.
%scala: Switch to Scala language mode in a notebook cell.
%r: Switch to R language mode in a notebook cell.
%run: Run a notebook, passing arguments if necessary.
%pip: Install Python packages using pip.
%conda: Install Python packages using conda.
%load: Load external code into a notebook cell.
%lsmagic: List all available magic commands.
Page 28 of 58
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
58. How do you connect databricks with a SQL datastore?
Azure Databricks supports connecting to external databases using JDBC. This article provides the basic syntax for con guring and using these connections
with examples in Python, SQL, and Scala.
Read data with JDBC
You must con gure a number of settings to read data using JDBC. Note that each database uses a di erent format for the <jdbc_url>.
59. How do connect Azure data factory with azure databricks securely?
Create an Azure Databricks linked service in ADF to connect and integrate Databricks in data factory pipelines.
Generate ADB Token in User Settings. Use token based authentication ADF Linked Service.
Page 29 of 58
fi
ff
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
@TRRaveendra
Create new pipeline and use databricks notebook activity in pipeline activities.
add a Databricks notebook to the pipeline by expanding the "Databricks" activity, then dragging and dropping a Databricks notebook onto the pipeline design
canvas.
Page 30 of 58
TR Raveendra https://www.youtube.com/@TRRaveendra
60. What is Azure purview?
Microsoft Purview provides a uni ed data governance solution to help manage and govern your on-premises, multicloud, and software as a service (SaaS)
data. Easily create a holistic, up-to-date map of your data landscape with automated data discovery, sensitive data classi cation, and end-to-end data
lineage. Enable data consumers to access valuable, trustworthy data management.
Azure Purview is a cloud-based data governance solution from Microsoft that helps organizations discover, manage, and secure their data assets across on-
premises, multi-cloud, and SaaS environments. It provides a uni ed view of an organization's data landscape, enabling users to understand their data assets
and their relationships, and to discover new data sources and insights.
Azure Purview is designed to help organizations address the challenges of data discovery, cataloging, and governance. It includes features such as automated
data discovery and classi cation, data lineage and impact analysis, metadata management, data cataloging, and policy enforcement. These features help
organizations ensure the accuracy, completeness, consistency, and security of their data assets throughout their lifecycle.
Azure Purview also provides integration with other Microsoft cloud services, such as Azure Data Factory, Azure Databricks, and Azure Synapse Analytics, as
well as with third-party services, enabling organizations to manage and govern their data assets across a wide range of environments and tools.
Overall, Azure Purview is a comprehensive solution for data governance and management that helps organizations gain greater visibility and control over their
data assets, while also improving compliance and reducing risks associated with data management.
Page 31 of 58
fi
fi
fi
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
Data governance is concerned with ensuring that the right people have access to the right data at the right time, and that data is used in a responsible and
ethical way. It also involves managing data quality, metadata, data standards, and data security.
The goals of data governance include improving the accuracy and consistency of data, ensuring compliance with regulations and standards, reducing risks
associated with data management, and maximizing the value of an organization's data assets.
Data governance is typically managed by a dedicated team or department within an organization, which is responsible for de ning policies and procedures,
monitoring compliance, and enforcing standards. This team works closely with other departments, such as IT, legal, and business operations, to ensure that
data governance is integrated throughout the organization.
Page 32 of 58
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
Data lineage typically includes information about the data's origins, its processing and transformations, and any other data that it may be related to or depend
on. It can also include metadata about the data, such as its format, structure, and quality.
Data lineage is important for a number of reasons. It helps to ensure data accuracy and consistency, by identifying any potential sources of errors or
discrepancies in the data. It also helps to meet regulatory and compliance requirements, by providing a clear audit trail of data usage and handling.
Additionally, it helps to improve data governance and management, by providing a better understanding of how data is used within an organization, and who
has access to it.
Page 33 of 58
fl
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
@TRRaveendra
Page 34 of 58
TR Raveendra https://www.youtube.com/@TRRaveendra
64. How do you store REST API data in databricks delta lake?
You can use Python to read REST API data and store in JSON format.
Then Read JSON format and create DataFrame and create target table as delta table using Dataframe Write API.
Page 35 of 58
fi
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
66. Can you use databricks delta as operation store? for example for an ordering system? or real time booking system?
Databricks Delta is primarily designed for OLAP (Online Analytical Processing) workloads, which involve analyzing large datasets to gain insights and make
informed business decisions. OLAP workloads typically involve complex queries that aggregate and summarize data, and require fast access to large volumes
of data.
While Delta can be used for OLTP (Online Transaction Processing) workloads as well, it is not optimized for this type of workload. OLTP workloads involve
managing high volumes of small transactions, typically involving the insertion, deletion, and updating of individual records. These workloads require high
throughput, low latency, and high concurrency, and typically involve relatively small datasets.
While Delta does provide transactional capabilities, including ACID transactions and data versioning, its design and performance characteristics are better
suited for OLAP workloads, which involve larger datasets and more complex queries.
That being said, Delta can be used in combination with other tools and platforms to support OLTP workloads. For example, you could use Delta to store and
manage historical data, and use a separate database or data store to handle real-time transactional processing.
Overall, Databricks Delta is a powerful and exible platform for managing and processing large datasets, and can be a great choice for OLAP workloads that
require fast access to large volumes of data.
CSV/JSON datasources use the pattern string for parsing and formatting datetime content.
Datetime functions related to convert StringType to/from DateType or TimestampType. For example, unix_timestamp, date_format, to_unix_timestamp,
from_unixtime, to_date, to_timestamp, from_utc_timestamp, to_utc_timestamp,
Page 36 of 58
fi
fi
fi
fi
fl
TR Raveendra https://www.youtube.com/@TRRaveendra
69. What is best to use Python / Scala / SQL within databricks?
The choice of programming language to use in Databricks depends on a variety of factors, including the nature of the data and the processing that you need
to perform, as well as your personal preferences and expertise.
In general, Databricks supports several programming languages, including Python, Scala, SQL, R, and Java. Each language has its own strengths and
weaknesses, and may be better suited to certain types of tasks.
Here are a few factors to consider when deciding which language to use in Databricks:
Data types and processing: Python is a popular choice for data analysis and machine learning tasks, as it has a large number of libraries and tools for these
tasks, including NumPy, Pandas, and Scikit-learn. Scala, on the other hand, is a good choice for tasks that require high-performance data processing, such as
streaming or distributed computing. SQL is best for tasks that require querying and processing data stored in databases or data warehouses.
Integration with Spark: Scala is a native language for Apache Spark, and is often used for developing Spark applications. Python, on the other hand, has a
Spark API that allows you to write Spark applications using Python. SQL is used to express relational queries in Spark SQL, which can be used to process
large-scale data sets.
Team skills and preferences: The choice of language may also depend on the skills and preferences of your team. If your team has more experience with
Python, it may be more e cient to use Python. If your team has more experience with SQL, it may be more e cient to use SQL.
Overall, the best language to use in Databricks depends on your speci c use case and the nature of the data and processing that you need to perform. It's
often a good idea to experiment with di erent languages to see which one works best for your needs. Databricks provides an environment that supports
multiple languages and makes it easy to switch between them, so you can choose the language that is most appropriate for each task.
70. What are various data storage / processing services in Azure?
Azure o ers a variety of data storage and processing services, including:
Azure Blob Storage: A massively scalable object storage service that can store large amounts of unstructured data such as text, images, and videos.
Azure Data Lake Storage: A highly scalable and secure data lake service that allows you to store and analyze large amounts of data.
Azure SQL Database: A fully managed relational database service that o ers high availability, security, and scalability.
Azure Cosmos DB: A globally distributed, multi-model database service that supports NoSQL data models, including key-value, graph, and document.
Azure HDInsight: A fully managed cloud service that makes it easy to process big data using popular open-source frameworks such as Hadoop, Spark, Hive,
and HBase.
Azure Stream Analytics: A real-time data processing service that allows you to analyze and gain insights from streaming data.
Azure Databricks: A collaborative, cloud-based platform for data engineering, machine learning, and analytics that is based on Apache Spark.
Azure Synapse Analytics: An analytics service that allows you to analyze large amounts of structured and unstructured data using both serverless and
provisioned resources.
Azure Machine Learning: A cloud-based machine learning service that allows you to build, train, and deploy machine learning models.
Azure Cognitive Search: A fully managed search-as-a-service that allows you to add search capabilities to your applications using natural language
processing and machine learning.
Page 37 of 58
ff
ffi
ff
fi
ff
ffi
TR Raveendra https://www.youtube.com/@TRRaveendra
Globally Distributed: With Azure regions spread out globally, the data can be replicated globally.
Scalability: Cosmos DB is horizontally scalable to support hundreds of millions of reads and writes per second.
Schema-Agnostic Indexing: This enables the automatic indexing of data without schema and index management.
Multi-Model: It can store data in Key-value Pairs, Document-based, Graph-based, Column Family-based databases. Global distribution, horizontal
partitioning, and automatic indexing capabilities are the same irrespective of the data model.
High Availability: It has 99.99 % availability for reads and writes for both multi region and single region Azure Cosmos DB accounts.
Low Latency: The global availability of Azure regions allows for the global distribution of data, which further makes it available nearest to the customers. This
reduces the latency in retrieving data.
Page 38 of 58
ff
fi
fi
fi
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
73. What is delta time travel? did you work with delta time travel? explain where you used it
Delta Time Travel is a feature of Databricks Delta that allows users to access and query previous versions of a Delta table. With Delta Time Travel, users can
query a table as it appeared at any point in time, without having to create and manage multiple versions of the table manually.
To use Delta Time Travel, you can specify a version number or a timestamp when querying a Delta table. For example, you can use the AS OF syntax in a SQL
query to query the table as it appeared at a speci c timestamp, like this:
I have used Delta Time Travel in a project where we needed to maintain a history of changes to a customer database, so we could track changes over time
and understand how the data had evolved. We used Delta Time Travel to query the database as it appeared at di erent points in time, and to compare
versions of the data to identify changes and trends. This allowed us to gain insights into the data and make better decisions based on historical trends.
Purpose: OLAP is designed for complex, ad-hoc queries that involve aggregation and summarization of large data sets, and is optimized for data analysis and
reporting. OLTP, on the other hand, is designed for managing transactional data, such as sales, orders, and inventory, and is optimized for fast, e cient data
processing.
Data model: OLAP databases typically use a multidimensional model, where data is organized into dimensions and measures. This allows users to slice and
dice data from di erent angles and explore relationships between di erent data elements. OLTP databases, on the other hand, typically use a normalized data
model, where data is organized into tables that represent entities and relationships between them.
Query complexity: OLAP queries tend to be more complex than OLTP queries, as they involve more aggregation, grouping, and ltering of data. OLTP queries,
on the other hand, tend to be simpler, as they typically involve selecting, inserting, updating, or deleting individual records.
Performance requirements: OLAP systems are designed to handle large, complex queries that may involve scanning large amounts of data, so performance is
optimized for read-heavy workloads. OLTP systems, on the other hand, are optimized for write-heavy workloads, with fast response times for individual
transactions.
In summary, OLAP is designed for complex data analysis and reporting, while OLTP is designed for e cient transaction processing. While some systems may
incorporate elements of both OLAP and OLTP, the two approaches are fundamentally di erent and are optimized for di erent types of workloads.
Page 39 of 58
ff
ff
ff
fi
ff
ff
ff
ffi
ff
ff
fi
ffi
TR Raveendra https://www.youtube.com/@TRRaveendra
75. Do you have an option to work with Spark without databricks in Azure?
Yes, you can work with Apache Spark on Azure without using Databricks. Azure provides several services for running Spark workloads, including:
Azure HDInsight: This is a fully-managed cloud service that makes it easy to process big data using popular open-source frameworks such as Hadoop,
Spark, Hive, and HBase. With HDInsight, you can deploy and manage Spark clusters in Azure, and run Spark jobs using familiar tools and languages.
Azure Synapse Analytics: This is an analytics service that allows you to analyze large amounts of structured and unstructured data using both serverless and
provisioned resources. Synapse Analytics includes a Spark pool that allows you to run Spark jobs and Spark SQL queries on large data sets.
Azure Data Factory: This is a cloud-based data integration service that allows you to create and schedule data pipelines that can move and transform data
between various sources and destinations, including Spark clusters.
Azure Kubernetes Service (AKS): This is a fully managed Kubernetes service that allows you to deploy and manage containerized applications and services,
including Spark applications. With AKS, you can deploy Spark clusters as Kubernetes pods and manage them using Kubernetes tools and APIs.
These services provide di erent ways of running Spark workloads on Azure, depending on your needs and requirements. They o er varying levels of
scalability, performance, and cost, and support di erent programming languages, data sources, and data processing frameworks.
Page 40 of 58
ff
ff
ff
TR Raveendra https://www.youtube.com/@TRRaveendra
With Synapse Analytics, users can:
Ingest data from various sources, including streaming data, batch data, and big data.
Store data in a scalable and exible data lake that uses Azure Blob Storage or Azure Data Lake Storage Gen2.
Analyze data using a variety of tools and services, including Apache Spark, Power BI, Azure Machine Learning, and Azure Databricks.
Build data pipelines and work ows to automate data integration and processing using Azure Data Factory.
Use a serverless SQL pool or dedicated SQL pool to run fast, scalable SQL queries on large data sets.
Secure data using advanced security and compliance features, including Azure Active Directory integration, network isolation, and row-level security.
Synapse Analytics also o ers an integrated development environment (IDE) called Synapse Studio, which provides a uni ed workspace for data engineers,
data scientists, and business analysts to collaborate on data-related projects. The IDE includes tools for data preparation, data transformation, data
visualization, and machine learning, as well as a notebook environment for running code and exploring data.
77. How do you get all reporting employees of a manager as a list in Spark SQL?
To get all reporting employees of a manager as a list in Spark SQL, you can use a self-join and a GROUP BY clause. Here's an example SQL query:
This query joins the employees table with itself using the manager_id column to match each employee with their manager. The collect_list function is used to
aggregate the employee_id values for each manager into a list. The GROUP BY clause groups the results by manager_id.
This query will return a result set with two columns: manager_id and reporting_employee_ids. The manager_id column contains the ID of each manager, and
the reporting_employee_ids column contains a list of the IDs of all the employees reporting to that manager. You can further process this result set in Spark
SQL or other Spark APIs to generate the desired output format.
Page 41 of 58
ff
fl
fl
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
78. How do you loop through all records of a delta table using SQL and Python?
You can loop through all records of a Delta table using SQL and Python by using the Delta Lake pySpark API in a Python script.
Here's an example script:
it loops through each row of the DataFrame using the df.collect() method, and prints the values of each row.
Alternatively, you can use Spark SQL to loop through all records of a Delta table. Here's an example SQL query:
Page 42 of 58
TR Raveendra https://www.youtube.com/@TRRaveendra
80. How do you covert a dataframe to graph frame?
To convert a DataFrame to a GraphFrame in Databricks, you can use the GraphFrame.fromEdges method to create a GraphFrame from a DataFrame that
represents the edges of the graph. Here's an example of how to do this:
In this example, the createDataFrame method is used to create a DataFrame edges that contains two columns, src and dst, which represent the edges of the
graph. The fromEdges method of the GraphFrame class is then used to create a GraphFrame g from the edges DataFrame. Finally, the display method is used
to show the vertices and edges of the graph.
Note that before you can convert a DataFrame to a GraphFrame, you need to make sure that the DataFrame has the correct schema and contains the
appropriate columns to represent the vertices and edges of the graph. In addition, you may need to perform additional transformations on the DataFrame to
prepare it for use as a graph, such as adding vertex properties or ltering out irrelevant data.
Page 43 of 58
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
81. How do you stream data from IoT devices in Azure databricks?
Data Ingest - stream real-time raw sensor data from Azure IoT Hubs into the Delta format in Azure Storage
Data Processing - stream process sensor data from raw (Bronze) to silver (aggregated) to gold (enriched) Delta tables on Azure Storage
Page 44 of 58
TR Raveendra https://www.youtube.com/@TRRaveendra
Describe History Table Returns provenance information, including the operation, user, and so on, for each write to a table. Table history is retained for 30 days.
83. How do you nd data history from Delta tables using time stamp or version?
Page 45 of 58
fi
fi
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
Page 46 of 58
TR Raveendra https://www.youtube.com/@TRRaveendra
In Spark SQL, there are four types of joins that can be used to combine data from two or more tables:
Inner Join: Returns only the rows where there is a match between the keys in both tables. Syntax: SELECT ... FROM table1 JOIN table2 ON table1.key =
table2.key
Left Outer Join: Returns all the rows from the left table and the matching rows from the right table. If there is no match in the right table, it returns null values
for the right table's columns. Syntax: SELECT ... FROM table1 LEFT JOIN table2 ON table1.key = table2.key
Right Outer Join: Returns all the rows from the right table and the matching rows from the left table. If there is no match in the left table, it returns null values
for the left table's columns. Syntax: SELECT ... FROM table1 RIGHT JOIN table2 ON table1.key = table2.key
Full Outer Join: Returns all the rows from both tables and lls null values for the columns that do not have a matching key in the other table. Syntax:
SELECT ... FROM table1 FULL OUTER JOIN table2 ON table1.key = table2.key
Left Semi Join: This join returns all the rows from the left table for which there is a match in the right table, and it does not return any columns from the right
table. In other words, it is similar to an inner join, but it only returns the columns from the left table. Syntax: SELECT ... FROM table1 LEFT SEMI JOIN table2
ON table1.key = table2.key
Left Anti Join: This join returns all the rows from the left table for which there is no match in the right table, and it does not return any columns from the right
table. In other words, it returns all the rows from the left table that do not have a corresponding match in the right table. Syntax: SELECT ... FROM table1
LEFT ANTI JOIN table2 ON table1.key = table2.key
Page 47 of 58
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
A partitioned table is divided into segments, called partitions, that make it easier to manage and query your data. By dividing a large table into smaller
partitions, you can improve query performance and control costs by reducing the number of bytes read by a query. You partition tables by specifying a
partition column which is used to segment the table.
If a query uses a qualifying lter on the value of the partitioning column, BigQuery can scan the partitions that match the lter and skip the remaining
partitions. This process is called pruning.
In a partitioned table, data is stored in physical blocks, each of which holds one partition of data. Each partitioned table maintains various metadata about the
sort properties across all operations that modify it.
Page 48 of 58
fi
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
88. Read a text contains millions of words, and nd top 10 words from entire text excluding prepositions and articles?
@TRRaveendra
89. How do you nd all possible substrings of a given name or string excluding space and special characters? for example, “data” → “d”, ”a”, “t”, “da”,
“dt”, “td”, “ta”, “at”, “aa”…etc….”data”
Page 49 of 58
fi
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
90. Read data from multiple CSV les of same schema, and display number of records from each le with lename? get the le name into dataframe /
table storage
91. There are 10 million records in the table and the schema does not contain the Modi edDate column. One cell was modi ed the next day in the table.
How will you fetch that particular information that needs to be loaded into the warehouse?
In Spark, a left anti join is a type of join operation that returns all the rows from the left DataFrame that do not have a matching key in the right DataFrame. In
other words, it returns the rows in the left DataFrame that are not present in the right DataFrame.
The left anti join operation is useful when you want to nd the rows in one DataFrame that are not present in another DataFrame based on a common key
column. This operation can be used to lter out the rows in the left DataFrame that do not have a corresponding key in the right DataFrame.
The left anti join operation can be performed using the join method of the DataFrame API, with the how parameter set to "left_anti".
Here's an example:
Page 50 of 58
fi
fi
fi
fi
fi
fi
fi
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
92. Read covid data through from https://data.cdc.gov/ and provide number of diseased patients per day for the last 5 days.
@TRRaveendra
Page 51 of 58
TR Raveendra https://www.youtube.com/@TRRaveendra
93. Make a secure connection from databricks to ADLS and read data from a CSV le and convert it into a delta table with appropriate schema
@TRRaveendra
Page 52 of 58
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
94. If you are running more no of pipelines and its taking longer time to execute. How to resolve this type of issues?
If you are experiencing slow pipeline execution times in Azure Data Factory, there are several steps you can take to optimize performance:
Use parallelism: If your pipelines are processing large amounts of data, consider using parallelism to split the workload across multiple activities or pipelines.
This can help speed up processing times and reduce the overall time required to complete the pipeline.
Optimize data movement: Data movement can be a bottleneck for pipeline performance, so it's important to optimize the data movement as much as
possible. This could involve compressing data before transfer, using partitioning to move data in smaller chunks, or using a dedicated transfer service like
Azure Data Box.
Optimize data transformation: If your pipelines involve complex data transformation, consider using more e cient data processing technologies such as
Databricks / synapse analytics. This can help reduce the time required to transform data and speed up overall pipeline execution.
Optimize infrastructure: Consider upgrading the infrastructure used to run the pipelines. This might involve upgrading the VM size, increasing the number of
nodes in a cluster, or scaling out by adding more worker nodes.
Monitor and optimize: Monitor the performance of your pipelines regularly and use performance metrics to identify areas for improvement. This could involve
using tools like Azure Monitor or third-party monitoring solutions to track metrics such as execution time, data throughput, and resource utilization.
In Azure Data Factory, an Integration Runtime (IR) is a compute infrastructure used to provide data integration capabilities across di erent network
environments. The AutoResolve Integration Runtime is a type of Integration Runtime that provides the ability to automatically select the most appropriate
Integration Runtime for a given data movement task.
When using the AutoResolve Integration Runtime, you don't need to manually specify which Integration Runtime to use for a speci c task. Instead, the Data
Factory service will automatically select the appropriate Integration Runtime based on the source and destination data store types, the connectivity
requirements, and other factors.
This can be useful in scenarios where you have a variety of data stores and di erent data movement tasks with varying requirements. The AutoResolve
Integration Runtime can help simplify the process of con guring and executing these tasks by automatically selecting the right Integration Runtime for each
one.
Note that there are other types of Integration Runtimes available in Azure Data Factory, including the Azure-SSIS Integration Runtime for running SQL Server
Integration Services (SSIS) packages in the cloud, and the Self-Hosted Integration Runtime for connecting to on-premises data stores.
Page 53 of 58
Like Celebrate Support Funny Love Insightful Curious
fi
ff
ffi
fi
ff
TR Raveendra https://www.youtube.com/@TRRaveendra
Schedule Trigger: A Schedule Trigger allows you to run a pipeline on a recurring schedule, such as once a day or once an hour. You can de ne the frequency
and start time for the trigger, as well as any additional parameters, and the trigger will automatically execute the pipeline at the speci ed times.
Event-Based Trigger: An Event-Based Trigger allows you to execute a pipeline in response to an event, such as a le being added to a data store, a message
being posted to a queue, or an HTTP request being received. You can con gure the trigger to monitor speci c events and trigger the pipeline when those
events occur.
Tumbling Window Trigger: A Tumbling Window Trigger allows you to execute a pipeline on a recurring schedule, but with a more complex de nition of the
trigger time. You can de ne a start time and end time for the trigger, as well as the duration of the window, and the trigger will execute the pipeline at the start
of each window.
Each of these trigger types has its own speci c use cases and bene ts. For example, a Schedule Trigger might be appropriate for a pipeline that needs to run
on a set schedule, while an Event-Based Trigger might be useful for a pipeline that needs to process data in real-time as it's generated. The Tumbling Window
Trigger might be used when you need to process data in regular, overlapping time periods.
There are several ways to execute pipelines in Azure Data Factory, including:
Trigger-based execution: You can create a trigger that automatically executes a pipeline on a speci ed schedule or when an event occurs (such as when
new data is added to a data store).
Ad-hoc execution: You can manually execute a pipeline from the Azure Data Factory user interface or programmatically using the REST API or Azure
PowerShell.
Event-based execution: You can use Azure Event Grid to trigger a pipeline when an event occurs in an Azure service such as Blob Storage, Event Hub, or IoT
Hub.
External execution: You can use a third-party scheduling tool or an orchestration tool such as Azure Logic Apps or Azure Functions to execute pipelines.
Continuous integration and delivery (CI/CD): You can use Azure DevOps or another CI/CD tool to automatically build, test, and deploy Data Factory
pipelines.
Overall, these execution options provide a lot of exibility for running Data Factory pipelines in di erent scenarios, whether it's on a set schedule, in response
to an event, or manually triggered as needed.
Page 54 of 58
ff
fi
fi
fl
fi
fi
ff
fi
fi
fi
fi
fi
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
98. Best Way to copy large from on premises to data lake using ADF?
The performance of a self-hosted integration runtime in Azure Data Factory can be in uenced by a variety of factors, including the con guration of the runtime
and the performance of the hardware it is running on. Here are some tips to optimize the performance of a self-hosted integration runtime:
Use a dedicated machine: To optimize performance, use a dedicated machine for the self-hosted integration runtime. Avoid sharing the machine with other
workloads that may a ect its performance.
Optimize machine resources: Ensure that the machine used for the self-hosted integration runtime has su cient CPU, memory, and disk space to handle
the workloads.
Use compression: Compressing the data before transferring it can reduce the amount of data that needs to be transferred, which can result in faster transfer
speeds.
Use multi-part uploads: If the le is very large, consider using multi-part uploads. This allows the le to be split into smaller chunks, which can be transferred
in parallel. This can signi cantly improve the transfer speed.
Binary copy is a feature in Azure Data Factory that allows you to copy les between di erent le-based data stores, such as Azure Blob Storage, Azure Data
Lake Storage, and on-premises le systems. Binary copy enables you to copy large les quickly and e ciently, without having to read the entire le into
memory.
A self-hosted integration runtime can be installed on-premises and used to securely transfer data between on-premises data stores and cloud-based data
stores. The self-hosted integration runtime can take advantage of the faster network speeds and reduced latency of an on-premises network.
Use Azure ExpressRoute: Azure ExpressRoute provides a dedicated, private connection between an on-premises network and Azure. This can improve the
performance of data transfer by providing faster and more reliable connectivity.
Use Azure Data Box: If the data is very large, consider using Azure Data Box. Azure Data Box is a physical appliance that can be shipped to the on-premises
location to transfer large amounts of data.
Optimize the on-premises network: Ensure that the on-premises network is optimized for data transfer, with su cient bandwidth and low latency. Consider
upgrading the network infrastructure if necessary.
Page 55 of 58
ff
fi
fi
fi
fi
fi
fl
ff
fi
fi
ffi
ffi
ffi
fi
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
99. How To Get the latest added le in a folder using Azure Data Factory?
To get the latest added le in a folder using Azure Data Factory, you can use the "Get Metadata" activity with a child item of "Child Items" and sort the results
by creation or modi cation time. Here are the steps to do this:
Add a "Get Metadata" activity to your pipeline and con gure it to connect to the folder you want to monitor.
In the "Get Metadata" activity, select the "Child Items" option as the child item.
Under the "Field List" tab, add the "creationTime" and/or "lastModi ed" elds.
Under the "Field List" tab, click on the "Add dynamic content" button to add an expression that sorts the les by the creation or modi cation time. The
expression should look like this:
This will sort the les in descending order by their creation or modi cation time.
Finally, add an "If Condition" activity to check if any les were found in the folder. The expression should be:
If the output is empty, you can use a "Set Variable" activity to set a default value. If there are les, you can use a "Set Variable" activity to set the latest le
name and/or path using the expression:
Page 56 of 58
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
100.What is the meaning of a hierarchical folder structure in data engineering, and how is it used in organizing and managing data within a data lake??
In data engineering, hierarchical folder structure refers to the organization of data les and folders in a hierarchical or tree-like structure, where les and folders
are arranged in a parent-child relationship. In this structure, each folder can contain one or more sub-folders, and each sub-folder can contain one or more
les or additional sub-folders. This allows for a logical organization of data les and facilitates e cient storage and retrieval of data.
In the context of data lakes, a hierarchical folder structure can be used to organize data les in a way that re ects the di erent data domains, data sources, or
business units. For example, a data lake may have a top-level folder for each data domain (such as sales, nance, or customer data), and within each domain
folder, there may be sub-folders for each data source (such as ERP systems, CRM systems, or social media platforms).
A hierarchical folder structure can also be used to enforce data governance and access controls, by setting permissions at the folder or le level to control
who can view or modify data.
Page 57 of 58
fi
fi
fi
fi
ffi
fi
fl
ff
fi
fi
TR Raveendra https://www.youtube.com/@TRRaveendra
@TRRaveendra
Learn
&
Lead
Page 58 of 58