Top 100+ Data Engineer Interview Questions and Answers For 2022
Top 100+ Data Engineer Interview Questions and Answers For 2022
Solved Projects Customer Reviews Custom Project Path Blog Write for ProjectPro End to End Projects
Phone
Get 1250+ Data Science code snippets free GET NOW
Select Project
This blog is your one-stop solution for the top 100+ Data Engineer Interview Questions and Answers. In this blog, we
have collated the frequently asked data engineer interview questions based on tools and technologies that are highly
useful for a data engineer in the Big Data industry. Also, you will find some interesting data engineer interview START PROJECT
questions that have been asked in different companies (like Facebook, Amazon, Walmart, etc.) that leverage big
data analytics and tools.
Start Project
Juan Solis
Senior Data Scientist at en DUS Software Engineering
Preparing for data engineer interviews makes even the bravest of us anxious. One good way to stay calm and
composed for an interview is to thoroughly answer questions frequently asked in interviews. If you have an interview
I signed up on this platform with the intention of
for a data engineer role coming up, here are some data engineer interview questions and answers based on the
getting real industry projects which no other learning
skillset required that you can refer to help nail your future data engineer interviews. platform provides. Every single project is very well
designed and is indeed a real industry... Read More
Table of Contents
Top 100+ Data Engineer Interview Questions and Answers
Relevant Projects
Data Engineer Interview Questions on Big Data
Data Engineer Interview Questions on SQL Python Projects for Data Science
NLP Projects
EY Data Engineer Interview Questions
Kaggle Projects
Behavioral Data Engineering Questions
How Data Engineering helps Businesses? | Why is Data Engineering In Demand? Hadoop Real-Time Projects Examples
Get Set Go For Your Interview with ProjectPro’s Top Data Engineer Interview Questions
How can I pass data engineer interview? You might also like
What are the roles and responsibilities of data engineer? Data Scientist Salary
What are the 4 most key questions a data engineer is likely to hear during an interview?
How to Become a Data Scientist
Ace Your Next Job Interview with Mock Interviews from Experts to Improve Your Skills and Boost
Machine Learning Projects for Beginners
Confidence!
Datasets
Pandas Dataframe
Data Engineer Interview Questions on Big Data
Any organization that relies on data must perform big data engineering to stand out from the crowd. But data Machine Learning Algorithms
collection, storage, and large-scale data processing are only the first steps in the complex process of big data
Regression Analysis
analysis. Complex algorithms, specialized professionals, and high-end technologies are required to leverage big
data in businesses, and big Data Engineering ensures that organizations can utilize the power of data.
MNIST Dataset
Below are some big data interview questions for data engineers based on the fundamental concepts of big data,
Data Science Interview Questions
such as data modeling, data analysis, data migration, data processing architecture, data storage, big data analytics,
etc.
Python Data Science Interview Questions
AWS vs Azure
Hadoop Architecture
Relational Databases primarily work with Non-relational databases support dynamic schema
Tutorials
structured data using SQL (Structured Query for unstructured data. Data can be graph-based,
Language). SQL works on data arranged in a column-oriented, document-oriented, or even stored Data Science Tutorial
predefined schema. as a Key store.
Snowflake Data Warehouse Tutorial for Beginners
with Examples
RDBMS follow the ACID properties - atomicity, Non-RDBMS follow the Brewers Cap theorem -
consistency, isolation, and durability. consistency, availability, and partition tolerance. Jupyter Notebook Tutorial - A Complete Beginners
Guide
RDBMS are usually vertically scalable. A single Non-RDBMS are horizontally scalable and can
Best Python NumPy Tutorial for Beginners
server can handle more load by increasing handle more traffic by adding more servers to handle
resources such as RAM, CPU, or SSD. the data. Tableau Tutorial for Beginners -Step by Step Guide
data requires multi-row transactions to be documents without having a fixed schema. Since
Alteryx Tutorial for Beginners to Master Alteryx in
performed on it since relational databases are non-RDBMS are horizontally scalable, they can
2021
table-oriented. become more powerful and suitable for large or
constantly changing datasets. Free Microsoft Power BI Tutorial for Beginners with
Examples
E.g. PostgreSQL, MySQL, Oracle, Microsoft E.g. Redis, MongoDB, Cassandra, HBase, Neo4j,
analytical data processing - OLAP. transaction processing, typically - OLTP. Hadoop Hive Tutorial-Usage of Hive Commands in
HQL
You may add new data regularly, but once you add
Data is regularly updated.
the data, it does not change very frequently. Hive Tutorial-Getting Started with Hive Installation
on Ubuntu
Data warehouses are optimized to handle complex
Operational databases are ideal for queries
queries, which can access multiple rows across many Learn Java for Hadoop Tutorial: Inheritance and
that return single rows at a time per table.
tables. Interfaces
There is a large amount of data involved. The amount of data is usually less. Learn Java for Hadoop Tutorial: Classes and
Objects
Operational databases are optimized to
A data warehouse is usually suitable for fast retrieval
handle fast inserts and updates on a Learn Java for Hadoop Tutorial: Arrays
of data from relatively large volumes of data.
smaller scale of data.
Apache Spark Tutorial - Run your First Spark
Program
MatPlotLib Tutorial
Veracity: the quality of the data to be analyzed. The data has to be able to contribute in a meaningful way to
generate results. Decision Tree Tutorial
5. Differentiate between Star schema and Snowflake schema. Neural Network Tutorial
warehouse schema that contains the fact tables warehouse schema that contains fact tables, R Tutorial: Data.Table
The design and understanding are simpler than The design and understanding are a little more
Introduction to Machine Learning Tutorial
the Snowflake schema, and the Star schema complex. Snowflake schema has higher query
has low query complexity. complexity than Star schema. Machine Learning Tutorial: Linear Regression
There are fewer foreign keys. There are many foreign keys. Machine Learning Tutorial: Logistic Regression
that require less transactional time. queries which require more transactional time. Pandas Tutorial Part-3
Tables in OLTP are normalized. Tables in OLAP are not normalized. Pandas Tutorial Part-2
7. What are some differences between a data engineer and a data Tutorial- Hadoop Multinode Cluster Setup on
Ubuntu
scientist?
Data Visualizations Tools in R
Data engineers and data scientists work very closely together, but there are some differences in their roles and
responsibilities. R Statistical and Language tutorial
The primary role of a data scientist is to take raw data Apache Pig Tutorial: User Defined Function
The primary role is to design and
presented on the data and apply analytic tools and Example
implement highly maintainable database
modeling techniques to analyze the data and provide
management systems. Apache Pig Tutorial Example: Web Log Server
insights to the business.
Analytics
Data engineers transform the big data into Data scientists perform the actual analysis of Big
Impala Case Study: Web Traffic
a structure that one can analyze. Data.
the databases meets industry requirements statements that can process the data to help the Hadoop Impala Tutorial
and caters to the business. business.
Apache Hive Tutorial: Tables
Data engineers have to take care of the Data scientists should have good data visualization
Flume Hadoop Tutorial: Twitter Data Extraction
safety, security and backing up of the data, and communication skills to convey the results of their
and they work as gatekeepers of the data. data analysis to various stakeholders.
Flume Hadoop Tutorial: Website Log Aggregation
Proficiency in the field of big data, and Hadoop Sqoop Tutorial: Example Data Export
Proficiency in machine learning is a requirement.
strong database management skills.
Hadoop Sqoop Tutorial: Example of Data
A data scientist and data engineer role require professionals with a computer science and engineering background, Aggregation
or a closely related field such as mathematics, statistics, or economics. A sound command over software and
Apache Zookepeer Tutorial: Example of Watch
programming languages is important for a data scientist and a data engineer. Read more for a detailed comparison
Notification
between data scientists and data engineers.
roject View Project View Project View Project View Project View Project View Project
Hadoop NoSQL Database Tutorial View Project
8. How is a data architect different from a data engineer? Hadoop Flume Tutorial
Data architects require practical skills with Data engineers must possess skills in software
data management tools including data engineering and be able to maintain and build
modeling, ETL tools, and data warehousing. database management systems. Top 15 Latest Recipes
Data engineers take the vision of the data architects Explain the features of Amazon Personalize
Data architects help the organization
and use this to build, maintain and process the
understand how changes in data acquisitions Introduction to Amazon Personalize and its use
architecture for further use by other data
will impact the data in use. cases
professionals.
only text. other formats. Introduction to Amazon MQ and its use cases
It is easy to query structured data Explain the features of Amazon Monitron for Redis
It is difficult to query the required unstructured data.
and perform further analysis on it.
Introduction to Amazon Monitron and its use cases
Data lakes and non-relational databases can contain
Relational databases and data Explain the features of Amazon MemoryDB for
unstructured data. A data warehouse can contain
warehouses contain structured data. Redis
unstructured data too.
10. How does Network File System (NFS) differ from Hadoop Distributed Explain the features of Amazon Grafana
File System (HDFS)?
Introduction to Amazon Managed Grafana and its
Network File System Hadoop Distributed File System use cases
The data in an NFS exists in a single dedicated The data blocks exist in a distributed format on
NFS is not very fault tolerant. In case of a HDFS is fault tolerant and you may recover the
machine failure, you cannot recover the data. data if one of the nodes fails.
There is no data redundancy as NFS runs on a Due to replication across machines on a cluster,
Deleting rows with missing values: You simply delete the rows or columns in a table with missing values from
the dataset. You can drop the entire column from the analysis if a column has more than half of the rows with
null values. You can use a similar method for rows with missing values in more than half of the columns. This
method may not work very well in cases where a large number of values are missing.
Using Mean/Medians for missing values: In a dataset, the columns with missing values and the column's data
type are numeric; you can fill in the missing values by using the median or mode of the remaining values in the
column.
Imputation method for categorical data: If you can classify the data in a column, you can replace the missing
values with the most frequently used category in that particular column. If more than half of the column values
are empty, you can use a new categorical variable to place the missing values.
Predicting missing values: Regression or classification techniques can predict the values based on the nature
of the missing values.
Last Observation carried Forward (LCOF) method: The last valid observation can fill in the missing value in
data variables that display a longitudinal behavior.
Using Algorithms that support missing values: Some algorithms, such as the k-NN algorithm, can ignore a
column if values are missing. Another such algorithm is Naive Bayes. The RandomForest algorithm can work
with non-linear and categorical data.
15. What is the difference between the KNN and k-means methods?
The k-means method is an unsupervised learning algorithm used as a clustering technique, whereas the K-
nearest-neighbor is a supervised learning algorithm for classification and regression problems.
KNN algorithm uses feature similarity, whereas the K-means algorithm refers to dividing data points into
clusters so that each data point is placed precisely in one cluster and not across many.
18. What are some biases that can happen while sampling?
Some popular type of bias that occurs while sampling is
Undercoverage- The undercoverage bias occurs when there is an inadequate representation of some
members of a particular population in the sample.
Observer Bias- Observer bias occurs when researchers unintentionally project their expectations on the
research. There may be occurrences where the researcher unintentionally influences surveys or interviews.
Self-Selection Bias- Self-selection bias, also known as volunteer response bias, happens when the research
study participants take control over the decision to participate in the survey. The individuals may be biased and
are likely to share some opinions that are different from those who choose not to participate. In such cases,
the survey will not represent the entire population.
Survivorship Bias- The survivorship bias occurs when a sample is more concentrated on subjects that
passed the selection process or criterion and ignore the subjects who did not pass the selection criteria.
Survivorship biases can lead to overly optimistic results.
Recall Bias- Recall bias occurs when a respondent fails to remember things correctly.
Exclusion Bias- The exclusion bias occurs due to the exclusion of certain groups while building the sample.
20. Explain how Big Data and Hadoop are related to each other.
Apache Hadoop is a collection of open-source libraries for processing large amounts of data. Hadoop supports
distributed computing, where you process data across multiple computers in clusters. Previously, if an organization
had to process large volumes of data, it had to buy expensive hardware. Hadoop has made it possible to shift the
dependency from hardware to achieve high performance, reliability, and fault tolerance through the software itself.
Hadoop can be useful when there is Big Data and insights generated from the Big Data. Hadoop also has robust
community support and is evolving to process, manage, manipulate and visualize Big Data in new ways.
Hadoop Common: This comprises all the tools and libraries typically used by the Hadoop application.
Hadoop Distributed File System (HDFS): When using Hadoop, all data is present in the HDFS, or Hadoop
Distributed File System. It offers an extremely high bandwidth distributed file system.
Hadoop YARN: The Hadoop system uses YARN, or Yet Another Resource Negotiator, to manage resources.
YARN can also be useful for task scheduling.
Hadoop MapReduce: Hadoop MapReduce is a framework for large-scale data processing that gives users
access.
Hadoop is highly scalable. Hadoop can handle any sort of dataset effectively, including unstructured (MySQL
Data), semi-structured (XML, JSON), and structured (MySQL Data) (Images and Videos).
Hadoop ensures data availability even if one of your systems crashes by copying data across several
DataNodes in a Hadoop cluster.
setup(): This function is mostly useful to set input data variables and cache protocols.
reduce(): This method is used only once for each key and is the most crucial component of the entire reducer.
Star Schema- The star schema is the most basic type of data warehouse schema. Its structure is similar to
that of a star, where the star's center may contain a single fact table and several associated dimension tables.
The star schema is efficient for data modeling tasks such as analyzing large data sets.
Snowflake Schema- The snowflake schema is an extension of the star schema. In terms of structure, it adds
more dimensions and has a snowflake-like appearance. Data is split into additional tables, and the dimension
tables are normalized.
26. What are the components that the Hive data model has to offer?
Some major components in a Hive data model are
Buckets
Tables
Partitions.
You can go through many more detailed Hadoop Interview Questions here.
Learn more about Big Data Tools and Technologies with Innovative and Exciting Big Data Projects
Examples.
**kwargs in function definitions are used to pass a variable number of keyworded arguments to a function
while calling the function. The double star allows passing any number of keyworded arguments.
a = [1,2,3]
b = [1,2,3]
c=b
a == b
evaluates to true since the values contained in the list a and list b are the same but
a is b
c is b
The objects and data structures initialized in a Python program are present in a private heap, and
programmers do not have permission to access the private heap space.
You can allocate heap space for Python objects using the Python memory manager. The core API of the
memory manager gives the programmer access to some of the tools for coding purposes.
Python has a built-in garbage collector that recycles unused memory and frees up memory for heap space.
E.g.
list1 = [5,9,4,8,5,3,7,3,9]
list2 = list(set(list1))
Set() may not maintain the order of items within the list.
The argument passed to extend() is iterated over, and each element of the argument adds to the list. The length of
the list increases by the number of elements in the argument passed to extend(). The time complexity for extend is
O(n), where n is the number of elements in the argument passed to extend.
Consider:
list1.append(list2)
list1.extend(list2)
The continue statement forces control to stop the current iteration of the loop and execute the next iteration rather
than terminating the loop completely. If a continue statement is present within a loop, it leads to skipping the code
following it for that iteration, and the next iteration gets executed.
Pass statement in Python does nothing when it executes, and it is useful when a statement is syntactically required
but has no command or code execution. The pass statement can write empty loops and empty control statements,
functions, and classes.
30. How can you check if a given string contains only letters and
numbers?
str.isalnum() can be used to check whether a string ‘str’ contains only letters and numbers.
31. Mention some advantages of using NumPy arrays over Python lists.
NumPy arrays take up less space in memory than lists.
NumPy arrays have built-in functions optimized for various techniques such as linear algebra, vector, and
matrix operations.
Lists in Python do not allow element-wise operations, but NumPy arrays can perform element-wise operations.
df = pd.DataFrame(days)
Can be used to create the data frame and the values for the index and columns.
33. In Pandas, how can you find the median value in a column “Age”
from a dataframe “employees”?
The median() function can be used to find the median value in a column. E.g.- employees[“age”].median()
employees.rename(columns=dict(address_line_1=’region’, address_line_2=’city’))
It returns a dataframe of boolean values of the same size as the data frame in which missing values are present. The
missing values in the original data frame are mapped to true, and non-missing values are mapped to False.
37. Given a 5x5 matrix in NumPy, how will you inverse the matrix?
The function numpy.linalg.inv() can help you inverse a matrix. It takes a matrix as the input and returns its inverse.
You can calculate the inverse of a matrix M as:
if det(M) != 0
M-1 = adjoint(M)/determinant(M)
else
39. Using NumPy, create a 2-D array of random integers between 0 and
500 with 4 rows and 7 columns.
from numpy import random
40. Find all the indices in an array of NumPy where the value is greater
than 5.
import NumPy as np
array = np.array([5,9,6,3,2,1,9])
print(np.where(array>5))
Explore Categories
Apache Hive Projects Apache Hbase Projects Apache Pig Projects Apache Oozie Projects
Apache Impala Projects Apache Flume Projects Spark SQL Projects Spark GraphX Projects
Spark Streaming Projects Spark MLlib Projects PySpark Projects Apache Zepellin Projects
Apache Kafka Projects Neo4j Projects Redis Projects Microsoft Azure Projects
1. First, select the cell to the right of the columns and below the rows to be kept visible.
43. How can you prevent someone from copying the data in your
spreadsheet?
In Excel, you can protect a worksheet, meaning that you can paste no copied data from the cells in the protected
worksheet. To be able to copy and paste data from a protected worksheet, you must remove the sheet protection
and unlock all cells, and once more lock only those cells that are not to be changed or removed. To protect a
worksheet, go to Menu -> Review -> Protect Sheet -> Password. Using a unique password, you can protect the
sheet from getting copied by others.
=SUM(A5:F5) can be useful to find the sum of values in the columns A-F of the 5th row.
P - Parentheses
E - Exponent
M - Multiplication
D - Division
A - Addition
S - Subtraction
SUBSTITUTE syntax
Where
text refers to the text in which you can perform the replacements
REPLACE syntax
50. What filter will you use if you want more than two conditions or if you
want to analyze the list using the database function?
You can use the Advanced Criteria Filter to analyze a list or in cases where you need to test more than two
conditions.
51. What does it mean if there is a red triangle at the top right-hand
corner of a cell?
A red triangle at the top right-hand corner of a cell indicates a comment associated with that particular cell. You can
view the comment by hovering the cursor over it.
FROM table_name
GROUP BY column_name
HAVING COUNT(column_name)>1
Will display all the records in a column which have the same value.
FROM table_name
HAVING COUNT(*)>1
Will display all the records with the same values in column1 and column2.
(INNER) JOIN: returns the records that have matching values in both tables.
LEFT (OUTER) JOIN: returns all records from the left table with their corresponding matching records from the
right table.
RIGHT (OUTER) JOIN: returns all records from the right table and their corresponding matching records from
the left table.
FULL (OUTER) JOIN: returns all records with a matching record in either the left or right table.
SELECT column_name(s)
FROM table_name
The IN operator tests whether an expression matches the values specified in a list of values. It helps to eliminate the
need of using multiple OR conditions. NOT IN operator may exclude certain rows from the query return. IN operator
may also be used with SELECT, INSERT, UPDATE, and DELETE statements. The syntax is:
SELECT column_name(s)
FROM table_name
Implicit Cursors: they are allocated by the SQL server when users perform DML operations.
Explicit Cursors: Users create explicit cursors based on requirements. Explicit cursors allow you to fetch table
data in a row-by-row method.
FROM table_name;
SELECT column_name(s)
Some rules followed in database normalization, which is also known as Normal forms are
AS
sql_statement
GO;
A stored procedure can take parameters at the time of execution so that the stored procedure can execute based on
the values passed as parameters.
Build a job-winning Big Data portfolio with end-to-end solved Apache Spark Projects for Resume and ace
that Big Data interview!
65. Write a query to select all statements that contain “ind” in their name
from a table named places.
SELECT *
FROM places
66. Which SQL query can be used to delete a table from the database
but keep its structure intact?
The TRUNCATE command helps delete all the rows from a table but keeps its structure intact. The column, indexes,
and constraints remain intact when using the TRUNCATE statement.
67. Write an SQL query to find the second highest sales from an "
Apparels " table.
select min(sales) from
68. Is a blank space or a zero value treated the same way as the
operator NULL?
NULL in SQL is not the same as zero or a blank space. NULL is used in the absence of any value and is said to be
unavailable, unknown, unassigned, or inappropriate. Zero is a number, and a blank space gets treated as a
character. You can compare a blank space or zero to another black space or zero, but cannot compare one NULL
with another NULL.
69. What is the default ordering of the ORDER BY clause and how can this be changed?
The ORDER BY clause is useful for sorting the query result in ascending or descending order. By default, the query
sorts in ascending order. The following statement can change the order:
WHERE conditions
73. Write an SQL query to find all students’ names from a table named
‘Students’ that end with ‘T’.
SELECT * FROM student WHERE stud_name like '%T';
In the case of the DELETE statement, rows are Truncating a table removes the data associated
removed one at a time. The DELETE with a table by deallocating the data pages that
statement records an entry for each deleted store the table data. Only the page deallocations
The DELETE command is slower than the The TRUNCATE command is faster than the
You can only use the DELETE statement with Using the TRUNCATE command requires ALTER
(AFTER|BEFORE) (INSERT|UPDATE|DELETE)
BEGIN
Variable declarations
Trigger code
END;
An easy-to-use interface gives you access to many Azure data stores, including ADLS Gen2, Cosmos DB,
Blobs, Queues, Tables, etc.
One of the most significant aspects of Azure Storage Explorer is that it enables users to work despite being
disconnected from the Azure cloud service using local emulators.
The first group comprises Queue Storage, Table Storage, and Blob Storage. It is built with data storage,
scalability, and connectivity and is accessible through a REST API.
The second group comprises File Storage and Disk Storage, which boosts the functionalities of the Microsoft
Azure Virtual Machine environment and is only accessible through Virtual Machines.
Queue Storage enables you to create versatile applications that comprise independent components
depending on asynchronous message queuing. Azure Queue storage stores massive volumes of messages
accessible by authenticated HTTP or HTTPS queries anywhere.
Table Storage in Microsoft Azure holds structured NoSQL data. The storage is highly extensible while also
being efficient in storing data. However, if you access temporary files frequently, it becomes more expensive.
This storage can be helpful to those who find Microsoft Azure SQL too costly and don't require the SQL
structure and architecture.
Blob Storage supports unstructured data/huge data files such as text documents, images, audio, video files,
etc. In Microsoft Azure, you can store blobs in three ways: Block Blobs, Append Blobs, and Page Blobs.
File Storage serves the needs of the Azure VM environment. You can use it to store huge data files
accessible from multiple Virtual Machines. File Storage allows users to share any data file via the SMB (Server
Message Block) protocol.
Disk Storage serves as a storage option for Azure virtual machines. It enables you to construct virtual
machine disks. Only one virtual machine can access a disk in Disk Storage.
Azure SQL Firewall Rules: There are two levels of security available in Azure.
The first are server-level firewall rules, which are present in the SQL Master database and specify which Azure
database servers are accessible.
The second type of firewall rule is database-level firewall rules, which monitor database access.
Azure SQL Database Auditing: The SQL Database service in Azure offers auditing features. It allows you to
define the audit policy at the database server or database level.
Azure SQL Transparent Data Encryption: TDE encrypts and decrypts databases and performs backups and
transactions on log files in real-time.
Azure SQL Always Encrypted: This feature safeguards sensitive data in the Azure SQL database, such as
credit card details.
PolyBase allows you to access data in Hadoop, Azure Blob Storage, or Azure Data Lake Store from Azure
SQL Database or Azure Synapse Analytics.
PolyBase uses relatively easy T-SQL queries to import data from Hadoop, Azure Blob Storage, or Azure Data
Lake Store without any third-party ETL tool.
PolyBase allows you to export and retain data to external data repositories.
It enables you to extend the query language's capabilities by introducing new Machine Learning functions.
Azure Stream Analytics can analyze a massive volume of structured and unstructured data at around a million
events per second and provide relatively low latency outputs.
Tumbling window functions take a data stream and divide it into discrete temporal segments, then apply a
function to each. Tumbling windows often recur, do not overlap, and one event cannot correspond to more
than one tumbling window.
Hopping window functions progress in time by a set period. Think of them as Tumbling windows that can
overlap and emit more frequently than the window size allows. Events can appear in multiple Hopping window
result sets. Set the hop size to the same as the window size to make a Hopping window look like a Tumbling
window.
Unlike Tumbling or Hopping windows, Sliding windows only emit events when the window's content changes.
As a result, each window contains at least one event, and events, like hopping windows, can belong to many
sliding windows.
Session window functions combine events that coincide and filter out periods when no data is available. The
three primary variables in Session windows are timeout, maximum duration, and partitioning key.
Snapshot windows bring together events having the same timestamp. You can implement a snapshot
window by adding System.Timestamp() to the GROUP BY clause, unlike most windowing function types that
involve a specialized window function (such as SessionWindow()).
Strong- It ensures linearizability, i.e., serving multiple requests simultaneously. The reads will always return
the item's most recent committed version. Uncommitted or incomplete writes are never visible to the client, and
users will always be able to read the most recent commit.
Bounded staleness- It guarantees the reads to follow the consistent prefix guarantee. Reads may lag writes
by "K" versions (that is, "updates") of an item or "T" time interval, whichever comes first.
Session- It guarantees reads to honor the consistent prefix, monotonic reads and writes, read-your-writes, and
write-follows-reads guarantees in a single client session. This implies that only one "writer" session or several
authors share the same session token.
Consistent prefix- It returns updates with a consistent prefix throughout all updates and has no gaps. Reads
will never detect out-of-order writes if the prefix consistency level is constant.
Eventual- There is no guarantee for ordering of reads in eventual consistency. The replicas gradually
converge in the lack of further writes.
83. What are the various types of Queues that Azure offers?
Storage queues and Service Bus queues are the two queue techniques that Azure offers.
Storage queues- Azure Storage system includes storage queues. You can save a vast quantity of messages
on them. Authorized HTTP or HTTPS calls allow you to access messages from anywhere. A queue can hold
millions of messages up to the storage account's overall capacity limit. Queues can build a backlog of work for
asynchronous processing.
Service Bus queues are present in the Azure messaging infrastructure, including queuing, publish/subscribe,
and more advanced integration patterns. They mainly connect applications or parts of applications that
encompass different communication protocols, data contracts, trust domains, or network settings.
84. What are the different data redundancy options in Azure Storage?
When it comes to data replication in the primary region, Azure Storage provides two choices:
Locally redundant storage (LRS) replicates your data three times synchronously in a single physical location
in the primary area. Although LRS is the cheapest replication method, it is unsuitable for high availability or
durability applications.
Zone-redundant storage (ZRS) synchronizes data across three Azure availability zones in the primary region.
Microsoft advises adopting ZRS in the primary region and replicating it in a secondary region for high-
availability applications.
Azure Storage provides two options for moving your data to a secondary area:
Geo-redundant storage (GRS) synchronizes three copies of your data within a single physical location using
LRS in the primary area. It moves your data to a single physical place in the secondary region asynchronously.
Geo-zone-redundant storage (GZRS) uses ZRS to synchronize data across three Azure availability zones in
the primary region. It then asynchronously moves your data to a single physical place in the secondary region.
Request a demo
Amazon S3 Access Logs record individual requests to Amazon S3 buckets and can be capable of monitoring
traffic patterns, troubleshooting, and security and access audits. It can also assist a business in gaining a
better understanding of its client base, establishing lifecycle policies, defining access policies, and determining
Amazon S3 prices.
Amazon VPC Flow Logs record IP traffic between Amazon Virtual Private Cloud (Amazon VPC) network
interfaces at the VPC, subnet, or individual Elastic Network Interface level. You can store Flow log data in
Amazon CloudWatch Logs and export it to Amazon CloudWatch Streams for enhanced network traffic
analytics and visualization.
86. How can Amazon Route 53 ensure high availability while maintaining
low latency?
AWS's highly available and stable infrastructure builds Route 53. The DNS servers' widely distributed design helps
maintain a constant ability to direct end-users to your application by avoiding internet or network-related issues.
Route 53 delivers the level of dependability that specific systems demand. Route 53 uses a worldwide anycast
network of DNS servers to automatically respond to inquiries from the best location available based on network
circumstances. As a result, your end consumers will experience low query latency.
It's intended to be a highly flexible, simple-to-use, and cost-effective solution for developers and organizations
to transform (or "transcode") media files from their original format into versions suitable for smartphones,
tablets, and computers.
Amazon Elastic Transcoder also includes transcoding presets for standard output formats, so you don't have
to assume which parameters will work best on specific devices.
Reserved Instances- When deployed in a specific Availability Zone, Amazon EC2 Reserved Instances (RI)
offer a significant reduction (up to 72%) over On-Demand pricing and a capacity reservation.
Spot Instances- You can request additional Amazon EC2 computing resources for up to 90% off the On-
Demand price using Amazon EC2 Spot instances.
The eventual consistency model is ideal for systems where data update doesn’t occur in real-time. It's
Amazon DynamoDB's default consistency model, boosting read throughput. However, the outcomes of a
recently completed write may not necessarily reflect in an eventually consistent read.
In Amazon DynamoDB, a strongly consistent read yields a result that includes all writes that have a
successful response before the read. You can provide additional variables in a request to get a strongly
consistent read result. Processing a highly consistent read takes more resources than an eventually consistent
read.
90. What do you understand about Amazon Virtual Private Cloud (VPC)?
The Amazon Virtual Private Cloud (Amazon VPC) enables you to deploy AWS resources into a custom virtual
network.
This virtual network is like a typical network run in your private data center, but with the added benefit of AWS's
scalable infrastructure.
Amazon VPC allows you to create a virtual network in the cloud without VPNs, hardware, or real data centers.
You can also use Amazon VPC's advanced security features to give more selective access to and from your
virtual network's Amazon EC2 instances.
Network Access Analyzer- The Network Access Analyzer tool assists you in ensuring that your AWS network
meets your network security and compliance standards. Network Access Analyzer allows you to establish your
network security and compliance standards.
Traffic Mirroring- You can directly access the network packets running through your VPC via Traffic Mirroring.
This functionality enables you to route network traffic from Amazon EC2 instances' elastic network interface to
security and monitoring equipment for packet inspection.
Recovery point objective (RPO): The maximum allowed time since the previous data recovery point. This
establishes the level of data loss that is acceptable.
93. What are the benefits of using AWS Identity and Access
Management (IAM)?
AWS Identity and Access Management (IAM) supports fine-grained access management throughout the AWS
infrastructure.
IAM Access Analyzer allows you to control who has access to which services and resources and under what
circumstances. IAM policies let you control rights for your employees and systems, ensuring they have the
least amount of access.
It also provides Federated Access, enabling you to grant resource access to systems and users without
establishing IAM Roles.
94. What are the various types of load balancers available in AWS?
1. An Application Load Balancer routes requests to one or more ports on each container instance in your
cluster, making routing decisions at the application layer (HTTP/HTTPS). It also enables path-based routing
and may route requests to one or more ports on each container instance in your cluster. Dynamic host port
mapping is available with Application Load Balancers.
2. The transport layer (TCP/SSL) is where a Network Load Balancer decides the routing path. It processes
millions of requests per second, and dynamic host port mapping is available with Network Load Balancers.
3. Gateway Load Balancer distributes traffic while scaling your virtual appliances to match demands by
combining a transparent network gateway.
You create queries to change your data and get essential insights instead of deploying, configuring, and
optimizing hardware.
The analytics service can instantaneously manage jobs of any complexity by pitching in the amount of power
you require.
Also, it's cost-effective because you only pay for your task when it's operating.
96. Compare Azure Data Lake Gen1 vs. Azure Data Lake Gen2.
Azure Data Lake Gen1 Azure Data Lake Gen2
The hot/cold storage tier isn't available. The hot/cold storage tier is available.
U-SQL scales out custom code (.NET/C#/Python) from a Gigabyte to a Petabyte scale using typical SQL
techniques and language.
Big data processing techniques like "schema on reads," custom processors, and reducers are available in U-
SQL.
The language allows you to query and integrate structured and unstructured data from various data sources,
including Azure Data Lake Storage, Azure Blob Storage, Azure SQL DB, Azure SQL Data Warehouse, and
SQL Server instances on Azure VMs.
U-SQL can process any structured and unstructured data using SQL syntax and Azure custom functions to set
up new ADFS driver functions.
It offers a highly accessible on-premise data warehouse service for exploring data for analytics, reporting,
monitoring, and Business Intelligence using various tools.
99. What are the different blob storage access tiers in Azure?
Hot tier - An online tier that stores regularly viewed or updated data. The Hot tier has the most expensive
storage but the cheapest access.
Cool tier - An online layer designed for rarely storing data that is accessed or modified. The Cool tier offers
reduced storage costs but higher access charges than the Hot tier.
Archive tier - An offline tier designed for storing data accessed rarely and with variable latency requirements.
You should keep the Archive tier's data for at least 180 days.
Project Hands-On Real Build an Analytical Build an AWS ETL Learn to Build a Hive Mini Project Hands-On Real Build an Analytic
Data Time PySpark Platform for Data Pipeline in Polynomial to Build a Data Time PySpark Platform for
e for e- Project for eCommerce using Python on Regression Model Warehouse for e- Project for eCommerce usin
e Beginners AWS Services YouTube Data from Scratch Commerce Beginners AWS Services
roject View Project View Project View Project View Project View Project View Project View Project
A block scanner is implemented to check whether the loss-of-blocks generated by Hadoop are successfully installed
on the DataNode.
101. How does a block scanner deal with a corrupted data block?
The DataNode notifies the NameNode about a particular file when the block scanner detects a corrupted data block.
After that, NameNode processes the data file by replicating it using the original, corrupted file. The corrupted data
block is not deleted if there is a match between the replicas made and the replication block.
Core-site
YARN-site
Mapred-site
103. How would you check the validity of data migration between
databases?
A data engineer's primary concerns should be maintaining the accuracy of the data and preventing data loss. The
purpose of this question is to help the hiring managers understand how you would validate data.
You must be able to explain the suitable validation types in various instances. For instance, you might suggest that
validation can be done through a basic comparison or after the complete data migration.
117. Why are you opting for a career in data engineering, and why
should we hire you?
118. What are the daily responsibilities of a data engineer?
119. What problems did you face while trying to aggregate data from
multiple sources? How did you go about resolving this?
120. Do you have any experience working on Hadoop, and how did you
enjoy it?
121. Do you have any experience working in a cloud computing
environment? What are some challenges that you faced?
122. What are the fundamental characteristics that make a good data
engineer?
123. How would you approach a new project as a data engineer?
124. Do you have any experience working with data modeling
techniques?
Facebook Data Engineer Interview Questions
As per Glassdoor, here are some Data Engineer interview questions asked in Facebook:
124. Given a list containing a None value, replace the None value with
the previous value in the list.
125. Print the key in a dictionary corresponding to the nth highest value
in the dictionary. Print just the first one if there is more than one record
associated with the nth highest value.
126. Given two sentences, print the words that are present in only one of
the two sentences.
127. Create a histogram using values from a given list.
128. Write a program to flatten the given list : [1,2,3,[4,5,[6,7[8,9]]]]
129. Write a program to remove duplicates from any given list.
130. Write a program to count the number of words in a given sentence.
131. Find the number of occurrences of a letter in a string.
Amazon Data Engineer Interview Questions
Data Engineer interview questions that are most commonly asked at Amazon
132. How can you tune a query? If a query takes longer than it initially
did, what may be the reason, and how will you find the cause?
133. In Python, how can you find non-duplicate numbers in the first list
and create a new list preserving the order of the non-duplicates?
134. Consider a large table containing three columns corresponding to
DateTime, Employee, and customer_response. The customer_response
column is a free text column. Assuming a phone number is embedded in
the customer_response column, how can you find the top 10 employees
with the most phone numbers in the customer_response column?
135. Sort an array in Python so that it produces only odd numbers.
136. How can you achieve performance tuning in SQL? Find the
numbers which have the maximum count in a list?
137. Generate a new list containing the numbers repeated in two
existing lists.
138. How would you tackle a data pipeline performance problem as a
data engineer?
How Data Engineering helps Businesses? | Why is Data
Engineering In Demand?
Data engineering is more significant than data science. Data engineering maintains the framework that enables data
scientists to analyze data and create models. Without data engineering, data science is not possible. A successful
data-driven company relies on data engineering. Data engineering makes it easier to build a data processing stack
for data collection, storage, cleaning, and analysis in batches or in real time, making it ready for further data analysis.
Furthermore, as businesses learn more about the significance of big data engineering, they turn towards AI-driven
methodologies for end-to-end Data Engineering rather than employing the older techniques. Data engineering aids
in finding useful data residing in any data warehouse with the help of advanced analytic methods. Data Engineering
also allows businesses to collaborate with data and leads to efficient data processing.
Based on Glassdoor, the average salary of a data engineer in the United States is $112,493 per annum. In India, the
average data engineer salary is ₹925,000. According to Indeed, Data Engineer is the 5th highest paying job in the
United States across all the sectors. These stats clearly state that the demand for the role of a Data Engineer will
only increase with lucrative paychecks.
1. SQL: Data engineers are responsible for handling large amounts of data. Structured Query Language (SQL) is
required to work on structured data in relational database management systems (RDBMS). As a data
engineer, it is essential to be thorough with using SQL for simple and complex queries and optimize queries as
per requirements.
2. Data Architecture and Data Modeling: Data engineers are responsible for building complex database
management systems. They are considered the gatekeepers of business-relevant data and must design and
develop safe, secure, and efficient systems for data collection and processing.
3. Data Warehousing: It is important for data engineers to grasp building data warehouses and to work with
them. Data warehouses allow the aggregation of unstructured data from different sources, which can be used
for further efficient processing and analysis.
4. Programming Skills: The most popular programming languages used in Big Data Engineering are Python and
R, which is why it is essential to be well versed in at least one of these languages.
5. Microsoft Excel: Excel allows developers to arrange their data into tables. It is a commonly used tool to
organize and update data regularly if required. Excel provides many tools that can be used for data analysis,
manipulation, and visualization.
6. Apache Hadoop-Based Analytics: Apache Hadoop is a prevalent open-source tool used extensively in Big
Data Engineering. The Hadoop ecosystem provides support for distributed computing, allows storage,
manipulation, security, and processing of large amounts of data, and is a necessity for anyone applying for the
role of a data engineer.
7. Operating Systems: Data engineers are often required to be familiar with working with operating systems like
LINUX, Solaris, UNIX, and Microsoft.
8. Machine Learning: Machine learning techniques are primarily required for data scientists. However, since data
scientists and data engineers work closely together, knowledge of machine learning tools and techniques will
help a data engineer.
Brush up your skills: Here are some skills that are expected in a data engineer role:
Technical skills: Data Engineers have to be familiar with database management systems, SQL, Microsoft
Excel, programming languages especially R and Python, working with Big Data tools including Apache
Hadoop and Apache Spark.
Analytical Skills: Data Engineering requires individuals with strong mathematical and statistical skills who
can make sense of the large amounts of data that they constantly have to deal with.
Understanding business requirements: To design optimum databases, it is important that data engineers
understand what is expected of them, and design databases as per requirements.
Be familiar with the specific company with which you are interviewing. Understand the goals and objectives of
the company, some of their recent accomplishments, and any ongoing projects you can find out about. The
more specific your answers to questions like “Why have you chosen Company X?”, the more you will be able
to convince your interviewers that you have truly come prepared for the interview.
Have a thorough understanding of the projects you have worked on. Be prepared to answer questions based
on these projects, primarily if the projects are related to Big Data and data engineering. You may be asked
questions about the technology used in the data engineering projects, the datasets you used, how you
obtained the required data samples, and the algorithms you used to approach the end goal. Try to recall any
difficulties that you encountered during the execution of the project and how you went about solving them.
Spend time working on building up your project profile and in the process, your confidence. By working on
projects, you can expand your knowledge by gaining hands-on experience. Projects can be showcased to your
interviewer but will also help build up your skillset and give you a deeper understanding of the tools and
techniques used in the market in the field of Big Data and data engineering.
Make sure to get some hands-on practice with ProjectPro’s solved big data projects with reusable source code that
can be used for further practice with complete datasets. At any time, if you feel that you require some assistance, we
provide one-to-one industry expert guidance to help you understand the code and ace your data engineering skills.
Create and implement ETL data pipeline for a variety of clients in various sectors.
Generate accurate and useful data-driven solutions using data modeling and data warehousing techniques.
Interact with other teams (data scientists, etc.) and help them by delivering relevant datasets for analysis.
Build data pipelines for extraction and storage tasks by employing a range of big data engineering tools and
various cloud service platforms.
Do you have any experience working on Hadoop, and how did you enjoy it?
Do you have any experience working in a cloud computing environment, what are some challenges that you
faced?
PREVIOUS NEXT
Trending Project Trending Projects Trending Blogs Trending Recipes Trending Tutorials
Categories
Walmart Sales Forecasting Data Science Machine Learning Projects for Beginners Search for a Value in Pandas DataFrame PCA in Machine Learning Tutorial
Project with Source Code
Machine Learning Projects
Pandas Create New Column based on PySpark Tutorial
BigMart Sales Prediction ML Project Data Science Projects for Beginners with Multiple Condition
Data Science Projects
Source Code Hive Commands Tutorial
Music Recommender System Project LSTM vs GRU
Deep Learning Projects
Big Data Projects for Beginners with MapReduce in Hadoop Tutorial
Credit Card Fraud Detection Using Source Code Plot ROC Curve in Python
Big Data Projects
Machine Learning Apache Hive Tutorial -Tables
IoT Projects for Beginners with Source Python Upload File to Google Drive
Apache Hadoop Projects Code
Resume Parser Python Project for Data Linear Regression Tutorial
Science Optimize Logistic Regression Hyper
Apache Spark Projects Data Analyst vs Data Scientist Parameters Show more
Time Series Forecasting Projects
Show more Data Science Interview Questions and Show more
Show more Answers
Show more
ProjectPro © 2022 Iconiq Inc. About us Contact us Privacy policy User policy Write for ProjectPro