0% found this document useful (0 votes)
233 views

Top 100+ Data Engineer Interview Questions and Answers For 2022

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
233 views

Top 100+ Data Engineer Interview Questions and Answers For 2022

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

New

Solved Projects Customer Reviews Custom Project Path Blog Write for ProjectPro End to End Projects

Start Your First Project


100+ Data Engineer Interview Questions and Learn By Doing

Answers for 2022 Email


Last Updated: 30 Jun 2022

Phone
Get 1250+ Data Science code snippets free GET NOW

Select Project
This blog is your one-stop solution for the top 100+ Data Engineer Interview Questions and Answers. In this blog, we
have collated the frequently asked data engineer interview questions based on tools and technologies that are highly
useful for a data engineer in the Big Data industry. Also, you will find some interesting data engineer interview START PROJECT

questions that have been asked in different companies (like Facebook, Amazon, Walmart, etc.) that leverage big
data analytics and tools.

What Users are saying..

End-to-End Big Data Project to


Learn PySpark SQL Functions
Downloadable solution code | Explanatory videos |
Tech Support

Start Project

Juan Solis
Senior Data Scientist at en DUS Software Engineering

Preparing for data engineer interviews makes even the bravest of us anxious. One good way to stay calm and
composed for an interview is to thoroughly answer questions frequently asked in interviews. If you have an interview
I signed up on this platform with the intention of
for a data engineer role coming up, here are some data engineer interview questions and answers based on the
getting real industry projects which no other learning
skillset required that you can refer to help nail your future data engineer interviews. platform provides. Every single project is very well
designed and is indeed a real industry... Read More

Table of Contents
Top 100+ Data Engineer Interview Questions and Answers
Relevant Projects
Data Engineer Interview Questions on Big Data

Data Engineer Interview Questions on Python Machine Learning Projects

Data Engineer Interview Questions on Excel Data Science Projects

Data Engineer Interview Questions on SQL Python Projects for Data Science

Data Engineer Interview Questions on Azure


Data Science Projects in R

Data Engineer Interview Questions on AWS


Machine Learning Projects for Beginners

Data Engineer Interview Questions on Data Lake


Deep Learning Projects
Data Engineer Technical Interview Questions | Data Engineering Technical Interview Questions
Neural Network Projects
Databricks Data Engineer Interview Questions
Tensorflow Projects
Walmart Data Engineer Interview Questions

NLP Projects
EY Data Engineer Interview Questions

Kaggle Projects
Behavioral Data Engineering Questions

Facebook Data Engineer Interview Questions IoT Projects

Amazon Data Engineer Interview Questions Big Data Projects

How Data Engineering helps Businesses? | Why is Data Engineering In Demand? Hadoop Real-Time Projects Examples

Data Engineer Job Growth and Demand in 2022


Spark Projects

What Skills Does a Data Engineer Need?


Data Analytics Projects for Students

Get Set Go For Your Interview with ProjectPro’s Top Data Engineer Interview Questions

FAQs on Data Engineer Interview Questions

How can I pass data engineer interview? You might also like
What are the roles and responsibilities of data engineer? Data Scientist Salary

What are the 4 most key questions a data engineer is likely to hear during an interview?
How to Become a Data Scientist

Data Analyst vs Data Scientist


Top 100+ Data Engineer Interview Questions and Answers
The following sections consist of the top 100+ data engineer interview questions divided based on big data Data Scientist Resume
fundamentals, big data tools/technologies, and big data cloud computing platforms. Furthermore, you will find a few
sections on data engineer interview questions commonly asked in various companies leveraging the power of big Data Science Projects for Beginners

data and data engineering.


Machine Learning Engineer

Ace Your Next Job Interview with Mock Interviews from Experts to Improve Your Skills and Boost
Machine Learning Projects for Beginners
Confidence!

Datasets

Pandas Dataframe
Data Engineer Interview Questions on Big Data
Any organization that relies on data must perform big data engineering to stand out from the crowd. But data Machine Learning Algorithms

collection, storage, and large-scale data processing are only the first steps in the complex process of big data
Regression Analysis
analysis. Complex algorithms, specialized professionals, and high-end technologies are required to leverage big
data in businesses, and big Data Engineering ensures that organizations can utilize the power of data.
MNIST Dataset

Below are some big data interview questions for data engineers based on the fundamental concepts of big data,
Data Science Interview Questions
such as data modeling, data analysis, data migration, data processing architecture, data storage, big data analytics,
etc.
Python Data Science Interview Questions

Spark Interview Questions

Hadoop Interview Questions

Data Analyst Interview Questions

Machine Learning Interview Questions

AWS vs Azure

Hadoop Architecture

1. Differentiate between relational and non-relational database


Spark Architecture
management systems.
Relational Database Management Systems
Non-relational Database Management Systems
(RDBMS)

Relational Databases primarily work with Non-relational databases support dynamic schema
Tutorials
structured data using SQL (Structured Query for unstructured data. Data can be graph-based,

Language). SQL works on data arranged in a column-oriented, document-oriented, or even stored Data Science Tutorial
predefined schema. as a Key store.
Snowflake Data Warehouse Tutorial for Beginners
with Examples
RDBMS follow the ACID properties - atomicity, Non-RDBMS follow the Brewers Cap theorem -

consistency, isolation, and durability. consistency, availability, and partition tolerance. Jupyter Notebook Tutorial - A Complete Beginners
Guide
RDBMS are usually vertically scalable. A single Non-RDBMS are horizontally scalable and can
Best Python NumPy Tutorial for Beginners
server can handle more load by increasing handle more traffic by adding more servers to handle

resources such as RAM, CPU, or SSD. the data. Tableau Tutorial for Beginners -Step by Step Guide

MLOps Python Tutorial for Beginners -Get Started


Non-relational databases are ideal if you need
with MLOps
Relational Databases are a better option if the flexibility for storing the data since you cannot create

data requires multi-row transactions to be documents without having a fixed schema. Since
Alteryx Tutorial for Beginners to Master Alteryx in
performed on it since relational databases are non-RDBMS are horizontally scalable, they can
2021
table-oriented. become more powerful and suitable for large or

constantly changing datasets. Free Microsoft Power BI Tutorial for Beginners with
Examples
E.g. PostgreSQL, MySQL, Oracle, Microsoft E.g. Redis, MongoDB, Cassandra, HBase, Neo4j,

SQL Server. CouchDB Theano Deep Learning Tutorial for Beginners

Computer Vision Tutorial for Beginners | Learn


Computer Vision
2. What is data modeling?
Python Pandas Tutorial for Beginners - The A-Z
Data modeling is a technique that defines and analyzes the data requirements needed to support business Guide
processes. It involves creating a visual representation of an entire system of data or a part of it. The process of data
NumPy Python Tutorial for Beginners
modeling begins with stakeholders providing business requirements to the data engineering team.

Hadoop Online Tutorial – Hadoop HDFS


3. How is a data warehouse different from an operational database? Commands Guide

Data warehouse Operational database MapReduce Tutorial–Learn to implement Hadoop


WordCount Example
Data warehouses generally support high-volume Operational databases support high-volume

analytical data processing - OLAP. transaction processing, typically - OLTP. Hadoop Hive Tutorial-Usage of Hive Commands in
HQL
You may add new data regularly, but once you add
Data is regularly updated.
the data, it does not change very frequently. Hive Tutorial-Getting Started with Hive Installation
on Ubuntu
Data warehouses are optimized to handle complex
Operational databases are ideal for queries
queries, which can access multiple rows across many Learn Java for Hadoop Tutorial: Inheritance and
that return single rows at a time per table.
tables. Interfaces

There is a large amount of data involved. The amount of data is usually less. Learn Java for Hadoop Tutorial: Classes and
Objects
Operational databases are optimized to
A data warehouse is usually suitable for fast retrieval
handle fast inserts and updates on a Learn Java for Hadoop Tutorial: Arrays
of data from relatively large volumes of data.
smaller scale of data.
Apache Spark Tutorial - Run your First Spark
Program

Best PySpark Tutorial for Beginners-Learn Spark


4. What are the big four V’s of big data?
with Python
Volume: refers to the size of the data sets to be analyzed or processed. The size is generally in terabytes and
R Tutorial- Learn Data Visualization with R using
petabytes.
GGVIS
Velocity: the speed at which you generate data. The data generates faster than traditional data handling
Neural Network Training Tutorial
techniques can handle it.

Python List Tutorial


Variety: the data can come from various sources and contain structured, semi-structured, or unstructured data.

MatPlotLib Tutorial
Veracity: the quality of the data to be analyzed. The data has to be able to contribute in a meaningful way to
generate results. Decision Tree Tutorial

5. Differentiate between Star schema and Snowflake schema. Neural Network Tutorial

Star schema Snowflake Schema Performance Metrics for Machine Learning


Algorithms
Star schema is a simple top-down data The snowflake schema is a bottom-up data

warehouse schema that contains the fact tables warehouse schema that contains fact tables, R Tutorial: Data.Table

and the dimension tables. dimension tables, and sub-dimension tables.


SciPy Tutorial

Takes up more space. Takes up less space.


Step-by-Step Apache Spark Installation Tutorial

Takes more time for query execution than star


Takes less time for query execution. Introduction to Apache Spark Tutorial
schema.
R Tutorial: Importing Data from Web
Normalization and denormalization are useful in
Normalization is not useful in a star schema,
this data warehouse schema, and there is less R Tutorial: Importing Data from Relational Database
and there is high data redundancy.
data redundancy.
R Tutorial: Importing Data from Excel

The design and understanding are simpler than The design and understanding are a little more
Introduction to Machine Learning Tutorial
the Snowflake schema, and the Star schema complex. Snowflake schema has higher query

has low query complexity. complexity than Star schema. Machine Learning Tutorial: Linear Regression

There are fewer foreign keys. There are many foreign keys. Machine Learning Tutorial: Logistic Regression

Support Vector Machine Tutorial (SVM)


6. What are the differences between OLTP and OLAP?
K-Means Clustering Tutorial
OLTP (Online Transaction Processing) Systems OLAP (Online Analytical Processing ) Systems

dplyr Manipulation Verbs


System for modification of online databases. System for querying online databases.
Introduction to dplyr package
Supports insert, update and delete Supports extraction of data from the database for
Importing Data from Flat Files in R
transformations on the database. further analysis.

Principal Component Analysis Tutorial


OLTP systems generally have simpler queries OLAP queries generally have more complex

that require less transactional time. queries which require more transactional time. Pandas Tutorial Part-3

Tables in OLTP are normalized. Tables in OLAP are not normalized. Pandas Tutorial Part-2

Pandas Tutorial Part-1

7. What are some differences between a data engineer and a data Tutorial- Hadoop Multinode Cluster Setup on
Ubuntu
scientist?
Data Visualizations Tools in R
Data engineers and data scientists work very closely together, but there are some differences in their roles and
responsibilities. R Statistical and Language tutorial

Data Engineer Data scientist Introduction to Data Science with R

The primary role of a data scientist is to take raw data Apache Pig Tutorial: User Defined Function
The primary role is to design and
presented on the data and apply analytic tools and Example
implement highly maintainable database
modeling techniques to analyze the data and provide
management systems. Apache Pig Tutorial Example: Web Log Server
insights to the business.
Analytics
Data engineers transform the big data into Data scientists perform the actual analysis of Big
Impala Case Study: Web Traffic
a structure that one can analyze. Data.

Impala Case Study: Flight Data Analysis


They must ensure that the infrastructure of They must analyze the data and develop problem

the databases meets industry requirements statements that can process the data to help the Hadoop Impala Tutorial
and caters to the business. business.
Apache Hive Tutorial: Tables
Data engineers have to take care of the Data scientists should have good data visualization
Flume Hadoop Tutorial: Twitter Data Extraction
safety, security and backing up of the data, and communication skills to convey the results of their

and they work as gatekeepers of the data. data analysis to various stakeholders.
Flume Hadoop Tutorial: Website Log Aggregation

Proficiency in the field of big data, and Hadoop Sqoop Tutorial: Example Data Export
Proficiency in machine learning is a requirement.
strong database management skills.
Hadoop Sqoop Tutorial: Example of Data
A data scientist and data engineer role require professionals with a computer science and engineering background, Aggregation
or a closely related field such as mathematics, statistics, or economics. A sound command over software and
Apache Zookepeer Tutorial: Example of Watch
programming languages is important for a data scientist and a data engineer. Read more for a detailed comparison
Notification
between data scientists and data engineers.

Apache Zookepeer Tutorial: Centralized


New Projects Configuration Management

Hadoop Zookeeper Tutorial for Beginners


Project- Learn to Build a Build Build Customer Project-Driven Hands-On PySpark Project- Build a Text
ta Neural network Classification and Propensity to Approach to ApproachHadoop
to Sqoop Tutorial
Build a Data Classification
sing from Scratch Clustering Models Purchase Model in PySpark Causal Inference Pipeline using Model with
Redshift using NumPy with PySpark and Python Partitioning Best in MachineHadoop PIG Tutorial
Kafka and Redshift Attention
MLlib Practices Learning Mechanism NLP
Hadoop Oozie Tutorial

roject View Project View Project View Project View Project View Project View Project
Hadoop NoSQL Database Tutorial View Project

Hadoop Hive Tutorial

View all New Projects Hadoop HDFS Tutorial

Hadoop hBase Tutorial

8. How is a data architect different from a data engineer? Hadoop Flume Tutorial

Data architect Data engineers Hadoop 2.0 YARN Tutorial

Data architects visualize and conceptualize Hadoop MapReduce Tutorial


Data engineers build and maintain data frameworks.
data frameworks.
Big Data Hadoop Tutorial for Beginners- Hadoop
Data architects provide the organizational Data engineers use the organizational data blueprint Installation
blueprint of data. to collect, maintain and prepare the required data.

Data architects require practical skills with Data engineers must possess skills in software

data management tools including data engineering and be able to maintain and build

modeling, ETL tools, and data warehousing. database management systems. Top 15 Latest Recipes
Data engineers take the vision of the data architects Explain the features of Amazon Personalize
Data architects help the organization
and use this to build, maintain and process the
understand how changes in data acquisitions Introduction to Amazon Personalize and its use
architecture for further use by other data
will impact the data in use. cases
professionals.

Explain the features of Amazon Nimble Studio

Introduction to Amazon Nimble Studio and its use


9. Differentiate between structured and unstructured data.
cases
Structured Data Unstructured Data
Explain the features of Amazon Neptune
Structured data usually fits into a
Unstructured data does not fit into a predefined data model. Introduction to Amazon Neptune and its use cases
predefined model.

Explain the features of Amazon MQ


Structured data usually consists of Unstructured data can be text, images, sounds, videos, or

only text. other formats. Introduction to Amazon MQ and its use cases

It is easy to query structured data Explain the features of Amazon Monitron for Redis
It is difficult to query the required unstructured data.
and perform further analysis on it.
Introduction to Amazon Monitron and its use cases
Data lakes and non-relational databases can contain
Relational databases and data Explain the features of Amazon MemoryDB for
unstructured data. A data warehouse can contain
warehouses contain structured data. Redis
unstructured data too.

Introduction to Amazon MemoryDB for Redis and


its use cases

10. How does Network File System (NFS) differ from Hadoop Distributed Explain the features of Amazon Grafana
File System (HDFS)?
Introduction to Amazon Managed Grafana and its
Network File System Hadoop Distributed File System use cases

Explain the features of Amazon Managed


Hadoop Distributed File System, or HDFS,
NFS can store and process only small volumes Blockchain
primarily stores and processes large amounts of
of data.
data or Big Data.

The data in an NFS exists in a single dedicated The data blocks exist in a distributed format on

hardware. local hardware drives.

NFS is not very fault tolerant. In case of a HDFS is fault tolerant and you may recover the

machine failure, you cannot recover the data. data if one of the nodes fails.

There is no data redundancy as NFS runs on a Due to replication across machines on a cluster,

single machine. there is data redundancy in HDFS.

11. What is meant by feature selection?


Feature selection is identifying and selecting only the features relevant to the prediction variable or desired output for
the model creation. A subset of the features that contribute the most to the desired output must be selected
automatically or manually.

12. How can missing values be handled in Big Data?


Some ways you can handle missing values in Big Data are as follows:

Deleting rows with missing values: You simply delete the rows or columns in a table with missing values from
the dataset. You can drop the entire column from the analysis if a column has more than half of the rows with
null values. You can use a similar method for rows with missing values in more than half of the columns. This
method may not work very well in cases where a large number of values are missing.

Using Mean/Medians for missing values: In a dataset, the columns with missing values and the column's data
type are numeric; you can fill in the missing values by using the median or mode of the remaining values in the
column.

Imputation method for categorical data: If you can classify the data in a column, you can replace the missing
values with the most frequently used category in that particular column. If more than half of the column values
are empty, you can use a new categorical variable to place the missing values.

Predicting missing values: Regression or classification techniques can predict the values based on the nature
of the missing values.

Last Observation carried Forward (LCOF) method: The last valid observation can fill in the missing value in
data variables that display a longitudinal behavior.

Using Algorithms that support missing values: Some algorithms, such as the k-NN algorithm, can ignore a
column if values are missing. Another such algorithm is Naive Bayes. The RandomForest algorithm can work
with non-linear and categorical data.

13. What is meant by outliers?


In a dataset, an outlier is an observation that lies at an abnormal distance from the other values in a random sample
from a particular data set. It is left up to the analyst to determine what can be considered abnormal. Before you
classify data points as abnormal, you must first identify and categorize the normal observations. Outliers may occur
due to variability in measurement or a particular experimental error. Outliers must be identified and removed before
further analysis of the data not to cause any problems.

14. What is meant by logistic regression?


Logistic regression is a classification rather than a regression model, which involves modeling the probability of a
discrete outcome given an input variable. It is a simple and efficient method that can approach binary and linear
classification problems. Logistic regression is a statistical method that works well with binary classifications but can
be generalized to multiclass classifications.

15. Briefly define the Star Schema.


The star join schema, one of the most basic design schemas in the Data Warehousing concept, is also known as the
star schema. It looks like a star, with fact tables and related dimension tables. The star schema is useful when
handling huge amounts of data.

16. Briefly define the Snowflake Schema.


The snowflake schema, one of the popular design schemas, is a basic extension of the star schema that includes
additional dimensions. The term comes from the way it resembles the structure of a snowflake. In the snowflake
schema, the data is organized and, after normalization, divided into additional tables.

15. What is the difference between the KNN and k-means methods?
The k-means method is an unsupervised learning algorithm used as a clustering technique, whereas the K-
nearest-neighbor is a supervised learning algorithm for classification and regression problems.

KNN algorithm uses feature similarity, whereas the K-means algorithm refers to dividing data points into
clusters so that each data point is placed precisely in one cluster and not across many.

16. What is the purpose of A/B testing?


A/B testing is a randomized experiment performed on two variants, ‘A’ and ‘B.’ It is a statistics-based process
involving applying statistical hypothesis testing, also known as “two-sample hypothesis testing.” In this process, the
goal is to evaluate a subject’s response to variant A against its response to variant B to determine which variants are
more effective in achieving a particular outcome.

17. What do you mean by collaborative filtering?


Collaborative filtering is a method used by recommendation engines. In the narrow sense, collaborative filtering is a
technique used to automatically predict a user's tastes by collecting various information regarding the interests or
preferences of many other users. This technique works on the logic that if person 1 and person 2 have the same
opinion on one particular issue, then person 1 is likely to have the same opinion as person 2 on another issue than
another random person. In general, collaborative filtering is the process that filters information using techniques
involving collaboration among multiple data sources and viewpoints.

18. What are some biases that can happen while sampling?
Some popular type of bias that occurs while sampling is

Undercoverage- The undercoverage bias occurs when there is an inadequate representation of some
members of a particular population in the sample.

Observer Bias- Observer bias occurs when researchers unintentionally project their expectations on the
research. There may be occurrences where the researcher unintentionally influences surveys or interviews.

Self-Selection Bias- Self-selection bias, also known as volunteer response bias, happens when the research
study participants take control over the decision to participate in the survey. The individuals may be biased and
are likely to share some opinions that are different from those who choose not to participate. In such cases,
the survey will not represent the entire population.

Survivorship Bias- The survivorship bias occurs when a sample is more concentrated on subjects that
passed the selection process or criterion and ignore the subjects who did not pass the selection criteria.
Survivorship biases can lead to overly optimistic results.

Recall Bias- Recall bias occurs when a respondent fails to remember things correctly.

Exclusion Bias- The exclusion bias occurs due to the exclusion of certain groups while building the sample.

19. What is a distributed cache?


A distributed cache pools the RAM in multiple computers networked into a single in-memory data store to provide
fast access to data. Most traditional caches tend to be in a single physical server or hardware component.
Distributed caches, however, grow beyond the memory limits of a single computer as they link multiple computers,
providing larger and more efficient processing power. Distributed caches are useful in environments that involve
large data loads and volumes. They allow scaling by adding more computers to the cluster and allowing the cache to
grow based on requirements.

20. Explain how Big Data and Hadoop are related to each other.
Apache Hadoop is a collection of open-source libraries for processing large amounts of data. Hadoop supports
distributed computing, where you process data across multiple computers in clusters. Previously, if an organization
had to process large volumes of data, it had to buy expensive hardware. Hadoop has made it possible to shift the
dependency from hardware to achieve high performance, reliability, and fault tolerance through the software itself.
Hadoop can be useful when there is Big Data and insights generated from the Big Data. Hadoop also has robust
community support and is evolving to process, manage, manipulate and visualize Big Data in new ways.

21. Briefly define COSHH.


COSHH is an acronym for Classification and Optimization-based Scheduling for Heterogeneous Hadoop systems.
As the name implies, it offers scheduling at both the cluster and application levels to speed up job completion.

22. Give a brief overview of the major Hadoop components.


Working with Hadoop involves many different components, some of which are listed below:

Hadoop Common: This comprises all the tools and libraries typically used by the Hadoop application.

Hadoop Distributed File System (HDFS): When using Hadoop, all data is present in the HDFS, or Hadoop
Distributed File System. It offers an extremely high bandwidth distributed file system.

Hadoop YARN: The Hadoop system uses YARN, or Yet Another Resource Negotiator, to manage resources.
YARN can also be useful for task scheduling.

Hadoop MapReduce: Hadoop MapReduce is a framework for large-scale data processing that gives users
access.

23. List some of the essential features of Hadoop.


Hadoop is a user-friendly open source framework.

Hadoop is highly scalable. Hadoop can handle any sort of dataset effectively, including unstructured (MySQL
Data), semi-structured (XML, JSON), and structured (MySQL Data) (Images and Videos).

Parallel computing ensures efficient data processing in Hadoop.

Hadoop ensures data availability even if one of your systems crashes by copying data across several
DataNodes in a Hadoop cluster.

24. What methods does Reducer use in Hadoop?


The three primary methods to use with reducer in Hadoop are as follows:

setup(): This function is mostly useful to set input data variables and cache protocols.

cleanup(): This procedure is useful for deleting temporary files saved.

reduce(): This method is used only once for each key and is the most crucial component of the entire reducer.

25. What are the various design schemas in data modeling?


There are two fundamental design schemas in data modeling: star schema and snowflake schema.

Star Schema- The star schema is the most basic type of data warehouse schema. Its structure is similar to
that of a star, where the star's center may contain a single fact table and several associated dimension tables.
The star schema is efficient for data modeling tasks such as analyzing large data sets.

Snowflake Schema- The snowflake schema is an extension of the star schema. In terms of structure, it adds
more dimensions and has a snowflake-like appearance. Data is split into additional tables, and the dimension
tables are normalized.

26. What are the components that the Hive data model has to offer?
Some major components in a Hive data model are

Buckets

Tables

Partitions.

You can go through many more detailed Hadoop Interview Questions here.

Learn more about Big Data Tools and Technologies with Innovative and Exciting Big Data Projects
Examples.

Data Engineer Interview Questions on Python


Python is crucial in implementing data engineering techniques. Pandas, NumPy, NLTK, SciPy, and other Python
libraries are ideal for various data engineering tasks such as faster data processing and other machine learning
activities. Data engineers primarily focus on data modeling and data processing architecture but also need a
fundamental understanding of algorithms and data structures. Take a look at some of the data engineer interview
questions based on various Python concepts, including Python libraries, algorithms, data structures, etc. These data
engineer interview questions cover Python libraries like Pandas, NumPy, and SciPy.

Image for Data Engineer interview questions and answers on Python

21. Differentiate between *args and **kwargs.


*args in function definitions are used to pass a variable number of arguments to a function when calling the
function. By using the *, a variable associated with it becomes iterable.

**kwargs in function definitions are used to pass a variable number of keyworded arguments to a function
while calling the function. The double star allows passing any number of keyworded arguments.

22. What is the difference between “is” and “==”?


Python's “is” operator checks whether two variables point to the same object. “==” is used to check whether the
values of two variables are the same.

E.g. consider the following code:

a = [1,2,3]

b = [1,2,3]

c=b
a == b

evaluates to true since the values contained in the list a and list b are the same but

a is b

evaluates to false since a and b refers to two different objects.

c is b

Evaluates to true since c and b point to the same object.

23. How is memory managed in Python?


Memory in Python exists in the following way:

The objects and data structures initialized in a Python program are present in a private heap, and
programmers do not have permission to access the private heap space.

You can allocate heap space for Python objects using the Python memory manager. The core API of the
memory manager gives the programmer access to some of the tools for coding purposes.

Python has a built-in garbage collector that recycles unused memory and frees up memory for heap space.

24. What is a decorator?


A decorator is a tool in Python which allows programmers to wrap another function around a function or a class to
extend the behavior of the wrapped function without making any permanent modifications to it. Functions in Python
are first-class objects, meaning functions can be passed or used as arguments. A function works as the argument for
another function in a decorator, which you can call inside the wrapper function.

25. Are lookups faster with dictionaries or lists in Python?


The time complexity to look up a value in a list in Python is O(n) since the whole list iterates through to find the value.
Since a dictionary is a hash table, the time complexity to find the value associated with a key is O(1). Hence, a
lookup is generally faster with a dictionary, but a limitation is that dictionaries require unique keys to store the values.

26. How can you return the binary of an integer?


The bin() function works on a variable to return its binary equivalent.

27. How can you remove duplicates from a list in Python?


A list can be converted into a set and then back into a list to remove the duplicates. Sets do not contain duplicate
data in Python.

E.g.

list1 = [5,9,4,8,5,3,7,3,9]

list2 = list(set(list1))

list2 will contain [5,9,4,8,3,7]

Set() may not maintain the order of items within the list.

28. What is the difference between append and extend in Python?


The argument passed to append() is added as a single element to a list in Python. The list length increases by one,
and the time complexity for append is O(1).

The argument passed to extend() is iterated over, and each element of the argument adds to the list. The length of
the list increases by the number of elements in the argument passed to extend(). The time complexity for extend is
O(n), where n is the number of elements in the argument passed to extend.

Consider:

list1 = [“Python”, “data”, “engineering”]

list2 = [“projectpro”, “interview”, “questions”]

list1.append(list2)

List1 will now be : [“projectpro”, “interview”, “questions”, [“Python”, “data”, “engineering”]]

The length of list1 is 4.

Instead of append, use extend

list1.extend(list2)

List1 will now be : [“projectpro”, “interview”, “questions”, “Python”, “data”, “engineering”]

The length of list1, in this case, becomes 6.

29. When do you use pass, continue and break?


The break statement in Python terminates a loop or another statement containing the break statement. If a break
statement is present in a nested loop, it will terminate only the loop in which it is present. Control will pass the
statements after the break statement if they are present.

The continue statement forces control to stop the current iteration of the loop and execute the next iteration rather
than terminating the loop completely. If a continue statement is present within a loop, it leads to skipping the code
following it for that iteration, and the next iteration gets executed.

Pass statement in Python does nothing when it executes, and it is useful when a statement is syntactically required
but has no command or code execution. The pass statement can write empty loops and empty control statements,
functions, and classes.

30. How can you check if a given string contains only letters and
numbers?
str.isalnum() can be used to check whether a string ‘str’ contains only letters and numbers.

31. Mention some advantages of using NumPy arrays over Python lists.
NumPy arrays take up less space in memory than lists.

NumPy arrays are faster than lists.

NumPy arrays have built-in functions optimized for various techniques such as linear algebra, vector, and
matrix operations.

Lists in Python do not allow element-wise operations, but NumPy arrays can perform element-wise operations.

32. In Pandas, how can you create a dataframe from a list?


import pandas as pd

days = [‘Sunday’, ‘Monday’, ‘Tuesday’, ‘Wednesday’]

# Calling DataFrame constructor on list

df = pd.DataFrame(days)

df is the data frame created from the list ‘days’.

df = pd.DataFrame(days, index =[‘1’,’2’,’3’,’4’], columns=[‘Days’])

Can be used to create the data frame and the values for the index and columns.

33. In Pandas, how can you find the median value in a column “Age”
from a dataframe “employees”?
The median() function can be used to find the median value in a column. E.g.- employees[“age”].median()

34. In Pandas, how can you rename a column?


The rename() function can be used to rename columns of a data frame.

To rename address_line_1 to ‘region’ and address_line_2 to ‘city’

employees.rename(columns=dict(address_line_1=’region’, address_line_2=’city’))

35. How can you identify missing values in a data frame?


The isnull() function help to identify missing values in a given data frame.

The syntax is DataFrame.isnull()

It returns a dataframe of boolean values of the same size as the data frame in which missing values are present. The
missing values in the original data frame are mapped to true, and non-missing values are mapped to False.

36. What is SciPy?


SciPy is an open-source Python library that is useful for scientific computations. SciPy is short for Scientific Python
and is used to solve complex mathematical and scientific problems. SciPy is built on top of NumPy and provides
effective, user-friendly functions for numerical optimization. The SciPy library comes equipped with functions to
support integration, ordinary differential equation solvers, special functions, and support for several other technical
computing functions.

37. Given a 5x5 matrix in NumPy, how will you inverse the matrix?
The function numpy.linalg.inv() can help you inverse a matrix. It takes a matrix as the input and returns its inverse.
You can calculate the inverse of a matrix M as:

if det(M) != 0

M-1 = adjoint(M)/determinant(M)

else

"Inverse does not exist

38. What is an ndarray in NumPy?


In NumPy, an array is a table of elements, and the elements are all of the same types and you can index them by a
tuple of positive integers. To create an array in NumPy, you must create an n-dimensional array object. An ndarray is
the n-dimensional array object defined in NumPy to store a collection of elements of the same data type.

39. Using NumPy, create a 2-D array of random integers between 0 and
500 with 4 rows and 7 columns.
from numpy import random

x = random.randint(500, size=(4, 7))

40. Find all the indices in an array of NumPy where the value is greater
than 5.
import NumPy as np

array = np.array([5,9,6,3,2,1,9])

To find the indices of values greater than 5

print(np.where(array>5))

Gives the output (array([0,1,2,6])

Explore Categories
Apache Hive Projects Apache Hbase Projects Apache Pig Projects Apache Oozie Projects

Apache Impala Projects Apache Flume Projects Spark SQL Projects Spark GraphX Projects

Spark Streaming Projects Spark MLlib Projects PySpark Projects Apache Zepellin Projects

Apache Kafka Projects Neo4j Projects Redis Projects Microsoft Azure Projects

Google Cloud Projects GCP

Data Engineer Interview Questions on Excel


Microsoft Excel is one of the most popular data engineering tools in the big data industry. In contrast to BI tools,
which ingest processed data supplied by the data engineering pipeline, Excel gives data engineers flexibility and
control over data entry. Here are some data engineer interview questions on Microsoft Excel and its features.

Image for Data Engineer interview questions and answers on Excel

41. What are Freeze Panes in MS Excel?


Freeze panes are used in MS Excel to lock a particular row or column. The rows or columns you lock will be visible
on the screen even when scrolling the sheet horizontally or vertically.

To freeze panes on Excel:

1. First, select the cell to the right of the columns and below the rows to be kept visible.

2. Select View > Freeze Panes > Freeze Panes.

42. What is meant by a ribbon?


In Excel, the ribbon exists in the topmost area of the window. They contain the toolbars and menu items available in
Excel. Ribbons contain multiple tabs, each with its own command set. You can switch the ribbon between shown and
hidden using CTRL+F1.

43. How can you prevent someone from copying the data in your
spreadsheet?
In Excel, you can protect a worksheet, meaning that you can paste no copied data from the cells in the protected
worksheet. To be able to copy and paste data from a protected worksheet, you must remove the sheet protection
and unlock all cells, and once more lock only those cells that are not to be changed or removed. To protect a
worksheet, go to Menu -> Review -> Protect Sheet -> Password. Using a unique password, you can protect the
sheet from getting copied by others.

44. How can you find the sum of columns in Excel?


The SUM function may be useful for finding the sum of columns in an Excel spreadsheet.

=SUM(A5:F5) can be useful to find the sum of values in the columns A-F of the 5th row.

45. Explain macros in Excel.


Macros in Excel refers to an action or a set of actions that can be saved and recorded to run as often as required.
Macros may be given names and can be used to save time to perform any frequently run tasks. Excel stores macros
as VBA code, and you can view the code using a VBA editor. You can assign macros to objects, including shapes,
graphics, or control.

46. What is the order of operations followed for evaluating expressions in


Excel?
Excel follows the same order of operations as in standard mathematics, which is indicated by “PEMDAS” where:

P - Parentheses

E - Exponent

M - Multiplication

D - Division

A - Addition

S - Subtraction

47. Explain pivot tables in Excel.


A pivot table is a tool consisting of a table of grouped values where individual items of a larger, more extensive table
aggregate within one or more discrete categories. It is useful for quick summarization of large unstructured data. It
can automatically perform sort, total, count, or average of the data in the spreadsheet and display the results in
another spreadsheet. Pivot tables save time and allow linking external data sources to Excel.

48. Mention some differences between SUBSTITUTE and REPLACE


functions in Excel.
The SUBSTITUTE function in Excel is useful to find a match for a particular text and replace it. The REPLACE
function replaces the text, which you can identify using its position.

SUBSTITUTE syntax

=SUBSTITUTE (text, text_to_be_replaced, text_to_replace_old_text_with, [instance_number])

Where

text refers to the text in which you can perform the replacements

instance_number refers to the number of times you need to replace a match.

E.g. consider a cell A5 which contains “Bond007”

=SUBSTITUTE(A5, “0”, “1”, 1) gives the result “Bond107”

=SUBSTITUTE(A5, “0”, “1”, 2) gives the result “Bond117”

=SUBSTITUTE(A5, “0”, “1”) gives the result “Bond117”

REPLACE syntax

=REPLACE (old_text, start_num, num_chars, text_to_be_replaced)

Where start_num - starting position of old_text to be replaced

num_chars - number of characters to be replaced

E.g. consider a cell A5 which contains “Bond007”

=REPLACE(A5, 5, 1, “99”) gives the result “Bond9907”

49. What is the use of the IF function in Excel?


The IF function in Excel performs the logic test and is used to check whether a given condition is true or false, then
perform further operations based on the result.

The syntax is:

=IF (test condition, value if true, value if false)

50. What filter will you use if you want more than two conditions or if you
want to analyze the list using the database function?
You can use the Advanced Criteria Filter to analyze a list or in cases where you need to test more than two
conditions.

51. What does it mean if there is a red triangle at the top right-hand
corner of a cell?
A red triangle at the top right-hand corner of a cell indicates a comment associated with that particular cell. You can
view the comment by hovering the cursor over it.

Data Engineer Interview Questions on SQL


You will spend most of your career using SQL if you are a Data Engineer working in an organization. Building a
strong foundation in SQL is crucial since you may easily save time and effort if you can leverage its various features
effectively. Also, acquire a solid knowledge of databases such as the NoSQL or Oracle database. Questions
addressing data modeling and database architecture test your understanding of entity-relationship modeling,
normalization and denormalization, dimensional modeling, and relevant ideas. Below are a few data engineer
interview questions on SQL concepts, queries on data storage, data retrieval, and a lot more.

Image for Data Engineer interview questions and answers on SQL

52. What is meant by Aggregate Functions in SQL?


In SQL, aggregate functions are functions where the values from multiple rows are grouped to form a single value
with its significant meaning. Aggregate functions in SQL include count(), min(), max(), sum(), avg().

53. How would you find duplicates using an SQL query?


To find duplicates in a single column:

SELECT column_name, COUNT(column_name)

FROM table_name

GROUP BY column_name

HAVING COUNT(column_name)>1

Will display all the records in a column which have the same value.

To find duplicates in multiple columns of a table:

SELECT column1_name, column2_name, COUNT(*)

FROM table_name

GROUP BY column1_name, column2_name

HAVING COUNT(*)>1

Will display all the records with the same values in column1 and column2.

54. What is a primary key in SQL?


In SQL, a primary key refers to a field in a table that can uniquely identify rows in that table. Primary keys must have
unique values, and a primary key value cannot be NULL. A table can have only one primary key and can be a single
field or multiple fields. When you use multiple fields as the primary key, they are collectively known as the composite
key.

55. What is meant by the UNIQUE constraint in SQL?


The UNIQUE constraint is used for columns in SQL to ensure that all the values in a particular column are different.
The UNIQUE constraint and the PRIMARY KEY both ensure that a column contains a value with unique values.
However, there can be only one PRIMARY KEY per table, but you can specify the UNIQUE constraint for multiple
columns. After creating the table, you can add or drop the UNIQUE constraints from columns.

56. What are the different kinds of joins in SQL?


A JOIN clause combines rows across two or more tables with a related column. The different kinds of joins
supported in SQL are:

(INNER) JOIN: returns the records that have matching values in both tables.

LEFT (OUTER) JOIN: returns all records from the left table with their corresponding matching records from the
right table.

RIGHT (OUTER) JOIN: returns all records from the right table and their corresponding matching records from
the left table.

FULL (OUTER) JOIN: returns all records with a matching record in either the left or right table.

57. What do you mean by index and indexing in SQL?


In SQL, an index is a special lookup table used by the database search engine to perform data retrieval from any
data structure more speedily. Indexes speed up SELECT queries and WHERE clauses, but slow down UPDATE
and INSERT statements, which require input data. Indexes can be created or dropped and will not affect the data.
Indexing is a method for optimizing database efficiency by reducing the number of disc accesses required during
query execution. This data structure technique may quickly search for and access a database.

58. How is a clustered index different from a non-clustered index in


SQL?
Clustered indexes in SQL modify how you store records in the database based on the indexed column. They are
useful for the speedy retrieval of data from the database. Non-clustered indexes create a different entity within the
table that references the original table. They are relatively slower than clustered indexes, and SQL allows only a
single clustered index but multiple non-clustered indexes.

59. Differentiate between IN and BETWEEN operators.


The BETWEEN operator in SQL tests if a particular expression lies between a range of values. The values can be in
the form of text, dates, or numbers. You can use the BETWEEN operator with SELECT, INSERT, UPDATE, and
DELETE statements. In a query, the BETWEEN condition helps to return all values that lie within the range. The
range is inclusive. The syntax is of BETWEEN is as follows:

SELECT column_name(s)

FROM table_name

WHERE column_name BETWEEN value1 AND value2;

The IN operator tests whether an expression matches the values specified in a list of values. It helps to eliminate the
need of using multiple OR conditions. NOT IN operator may exclude certain rows from the query return. IN operator
may also be used with SELECT, INSERT, UPDATE, and DELETE statements. The syntax is:

SELECT column_name(s)

FROM table_name

WHERE column_name IN (list_of_values);

60. What is a foreign key in SQL?


A foreign key is a field or a collection of fields in one table that can refer to the primary key in another table. The table
which contains the foreign key is the child table, and the table containing the primary key is the parent table or the
referenced table. The purpose of the foreign key constraint is to prevent actions that would destroy links between
tables.

61. What is a cursor?


A cursor is a temporary memory or workstation. It is allocated by the server when DML operations are performed on
the table by the user. Cursors store Database tables. SQL provides two types of cursors which are:

Implicit Cursors: they are allocated by the SQL server when users perform DML operations.

Explicit Cursors: Users create explicit cursors based on requirements. Explicit cursors allow you to fetch table
data in a row-by-row method.

62. What is an alias in SQL?


An alias enables you to give a table or a particular column in a table a temporary name to make the table or column
name more readable for that specific query. Aliases only exist for the duration of the query.

The syntax for creating a column alias

SELECT column_name AS alias_name

FROM table_name;

The syntax for creating a table alias

SELECT column_name(s)

FROM table_name AS alias_name;

63. What is meant by normalization in SQL?


Normalization is a method used to minimize redundancy, inconsistency, and dependency in a database by
organizing the fields and tables. It involves adding, deleting, or modifying fields that can go into a single table.
Normalization allows you to break the tables into smaller partitions and link these partitions through different
relationships to avoid redundancy.

Some rules followed in database normalization, which is also known as Normal forms are

1NF - first normal form

2NF - second normal form

3NF - third normal form

BCF - Boyce-Codd normal form

64. What is a stored procedure?


Stored procedures are used in SQL to run a particular task several times. You can save or reuse stored procedures
when required.

The syntax for creating a stored procedure:

CREATE PROCEDURE procedure_name *params*

AS

sql_statement

GO;

Syntax for executing a stored procedure

EXEC procedure_name *params*;

A stored procedure can take parameters at the time of execution so that the stored procedure can execute based on
the values passed as parameters.

Build a job-winning Big Data portfolio with end-to-end solved Apache Spark Projects for Resume and ace
that Big Data interview!

65. Write a query to select all statements that contain “ind” in their name
from a table named places.
SELECT *

FROM places

WHERE name LIKE '%ind%'

66. Which SQL query can be used to delete a table from the database
but keep its structure intact?
The TRUNCATE command helps delete all the rows from a table but keeps its structure intact. The column, indexes,
and constraints remain intact when using the TRUNCATE statement.

67. Write an SQL query to find the second highest sales from an "
Apparels " table.
select min(sales) from

(select distinct sales from Apparels by sales desc)

where rownum < 3;

68. Is a blank space or a zero value treated the same way as the
operator NULL?
NULL in SQL is not the same as zero or a blank space. NULL is used in the absence of any value and is said to be
unavailable, unknown, unassigned, or inappropriate. Zero is a number, and a blank space gets treated as a
character. You can compare a blank space or zero to another black space or zero, but cannot compare one NULL
with another NULL.

69. What is the default ordering of the ORDER BY clause and how can this be changed?

The ORDER BY clause is useful for sorting the query result in ascending or descending order. By default, the query
sorts in ascending order. The following statement can change the order:

SELECT expressions FROM table_name

WHERE conditions

ORDER BY expression DESC;

70. Will the following query return an output?

SELECT employee_id, AVG (sales)


FROM Employees
WHERE AVG(sales) > 70000
GROUP BY month;
No, the above query will not return an output since you cannot use the WHERE clause to restrict the groups. To
generate output in this query, you should use the HAVING clause.

71. What is meant by SQL injection?


SQL injection is a type of vulnerability in SQL codes that allows attackers to control back-end database operations
and access, retrieve and/or destroy sensitive data present in databases. SQL injection involves inserting malicious
SQL code into a database entry field. When the code gets executed, the database becomes vulnerable to attack,
and SQL injection is also known as SQLi attack.

72. What statement does the system execute whenever a database is


modified?
Whenever a database is modified, the system executes a trigger command.

73. Write an SQL query to find all students’ names from a table named
‘Students’ that end with ‘T’.
SELECT * FROM student WHERE stud_name like '%T';

74. Mention some differences between the DELETE and TRUNCATE


statements in SQL.
DELETE command TRUNCATE command

The DELETE command helps to delete one


The TRUNCATE command helps to delete all rows
specific row or more than one row
of a table.
corresponding to a certain condition.

It is a Data Manipulation Language (DML)


It is a Data Definition Language (DDL) command.
command.

In the case of the DELETE statement, rows are Truncating a table removes the data associated

removed one at a time. The DELETE with a table by deallocating the data pages that

statement records an entry for each deleted store the table data. Only the page deallocations

row in the transaction log. get stored in the transaction log.

The DELETE command is slower than the The TRUNCATE command is faster than the

TRUNCATE command. DELETE command.

You can only use the DELETE statement with Using the TRUNCATE command requires ALTER

DELETE permission for the table. permission for the table.

75. What is a trigger in SQL?


In SQL, a trigger refers to a set of statements in a system catalog that runs whenever DML (Data Manipulation
Language) commands run on a system. It is a special stored procedure that gets called automatically in response to
an event. Triggers allow the execution of a batch of code whenever an insert, update or delete command is executed
for a specific table. You can create a trigger by using the CREATE TRIGGER statement. The syntax is:

CREATE TRIGGER trigger_name

(AFTER|BEFORE) (INSERT|UPDATE|DELETE)

ON table_name FOR EACH ROW

BEGIN

Variable declarations

Trigger code

END;

Data Engineer Interview Questions on Azure


Most businesses are switching to cloud infrastructure these days. Organizations employ a variety of providers
including AWS, Google Cloud, and Azure for their BI and Machine Learning applications. Microsoft Azure allows
data engineers to build and deploy applications using various solutions. Check out these common data engineer
interview questions on various Microsft Azure concepts, tools, and frameworks.

Image for Data Engineer interview questions and answers on Azure

76. Explain the features of Azure Storage Explorer.


It's a robust stand-alone application that lets you manage Azure Storage from any platform, including
Windows, Mac OS, and Linux.

An easy-to-use interface gives you access to many Azure data stores, including ADLS Gen2, Cosmos DB,
Blobs, Queues, Tables, etc.

One of the most significant aspects of Azure Storage Explorer is that it enables users to work despite being
disconnected from the Azure cloud service using local emulators.

77. What are the various types of storage available in Azure?


In Microsoft Azure, there are five storage types classified into two categories.

The first group comprises Queue Storage, Table Storage, and Blob Storage. It is built with data storage,
scalability, and connectivity and is accessible through a REST API.

The second group comprises File Storage and Disk Storage, which boosts the functionalities of the Microsoft
Azure Virtual Machine environment and is only accessible through Virtual Machines.

Queue Storage enables you to create versatile applications that comprise independent components
depending on asynchronous message queuing. Azure Queue storage stores massive volumes of messages
accessible by authenticated HTTP or HTTPS queries anywhere.

Table Storage in Microsoft Azure holds structured NoSQL data. The storage is highly extensible while also
being efficient in storing data. However, if you access temporary files frequently, it becomes more expensive.
This storage can be helpful to those who find Microsoft Azure SQL too costly and don't require the SQL
structure and architecture.

Blob Storage supports unstructured data/huge data files such as text documents, images, audio, video files,
etc. In Microsoft Azure, you can store blobs in three ways: Block Blobs, Append Blobs, and Page Blobs.

File Storage serves the needs of the Azure VM environment. You can use it to store huge data files
accessible from multiple Virtual Machines. File Storage allows users to share any data file via the SMB (Server
Message Block) protocol.

Disk Storage serves as a storage option for Azure virtual machines. It enables you to construct virtual
machine disks. Only one virtual machine can access a disk in Disk Storage.

78. What data security solutions does Azure SQL DB provide?


In Azure SQL DB, there are several data security options:

Azure SQL Firewall Rules: There are two levels of security available in Azure.

The first are server-level firewall rules, which are present in the SQL Master database and specify which Azure
database servers are accessible.

The second type of firewall rule is database-level firewall rules, which monitor database access.

Azure SQL Database Auditing: The SQL Database service in Azure offers auditing features. It allows you to
define the audit policy at the database server or database level.

Azure SQL Transparent Data Encryption: TDE encrypts and decrypts databases and performs backups and
transactions on log files in real-time.

Azure SQL Always Encrypted: This feature safeguards sensitive data in the Azure SQL database, such as
credit card details.

79. What do you understand by PolyBase?


Polybase is a system that uses the Transact-SQL language to access external data stored in Azure Blob storage,
Hadoop, or the Azure Data Lake repository. This is the most efficient way to load data into an Azure Synapse SQL
Pool. Polybase facilitates bidirectional data movement between Synapse SQL Pool and external resources, resulting
in faster load performance.

PolyBase allows you to access data in Hadoop, Azure Blob Storage, or Azure Data Lake Store from Azure
SQL Database or Azure Synapse Analytics.

PolyBase uses relatively easy T-SQL queries to import data from Hadoop, Azure Blob Storage, or Azure Data
Lake Store without any third-party ETL tool.

PolyBase allows you to export and retain data to external data repositories.

80. What is the best way to capture streaming data in Azure?


Azure has a separate analytics service called Azure Stream Analytics, which supports the Stream Analytics
Query Language, a primary SQL-based language.

It enables you to extend the query language's capabilities by introducing new Machine Learning functions.

Azure Stream Analytics can analyze a massive volume of structured and unstructured data at around a million
events per second and provide relatively low latency outputs.

81. Discuss the different windowing options available in Azure Stream


Analytics.
Stream Analytics has built-in support for windowing functions, allowing developers to quickly create complicated
stream processing jobs. Five types of temporal windows are available: Tumbling, Hopping, Sliding, Session, and
Snapshot.

Tumbling window functions take a data stream and divide it into discrete temporal segments, then apply a
function to each. Tumbling windows often recur, do not overlap, and one event cannot correspond to more
than one tumbling window.

Hopping window functions progress in time by a set period. Think of them as Tumbling windows that can
overlap and emit more frequently than the window size allows. Events can appear in multiple Hopping window
result sets. Set the hop size to the same as the window size to make a Hopping window look like a Tumbling
window.

Unlike Tumbling or Hopping windows, Sliding windows only emit events when the window's content changes.
As a result, each window contains at least one event, and events, like hopping windows, can belong to many
sliding windows.

Session window functions combine events that coincide and filter out periods when no data is available. The
three primary variables in Session windows are timeout, maximum duration, and partitioning key.

Snapshot windows bring together events having the same timestamp. You can implement a snapshot
window by adding System.Timestamp() to the GROUP BY clause, unlike most windowing function types that
involve a specialized window function (such as SessionWindow()).

82. Discuss the different consistency models in Cosmos DB.


There are five distinct consistency models/levels in Azure Cosmos DB, starting from strongest to weakest-

Strong- It ensures linearizability, i.e., serving multiple requests simultaneously. The reads will always return
the item's most recent committed version. Uncommitted or incomplete writes are never visible to the client, and
users will always be able to read the most recent commit.

Bounded staleness- It guarantees the reads to follow the consistent prefix guarantee. Reads may lag writes
by "K" versions (that is, "updates") of an item or "T" time interval, whichever comes first.

Session- It guarantees reads to honor the consistent prefix, monotonic reads and writes, read-your-writes, and
write-follows-reads guarantees in a single client session. This implies that only one "writer" session or several
authors share the same session token.

Consistent prefix- It returns updates with a consistent prefix throughout all updates and has no gaps. Reads
will never detect out-of-order writes if the prefix consistency level is constant.

Eventual- There is no guarantee for ordering of reads in eventual consistency. The replicas gradually
converge in the lack of further writes.

83. What are the various types of Queues that Azure offers?
Storage queues and Service Bus queues are the two queue techniques that Azure offers.

Storage queues- Azure Storage system includes storage queues. You can save a vast quantity of messages
on them. Authorized HTTP or HTTPS calls allow you to access messages from anywhere. A queue can hold
millions of messages up to the storage account's overall capacity limit. Queues can build a backlog of work for
asynchronous processing.

Service Bus queues are present in the Azure messaging infrastructure, including queuing, publish/subscribe,
and more advanced integration patterns. They mainly connect applications or parts of applications that
encompass different communication protocols, data contracts, trust domains, or network settings.

84. What are the different data redundancy options in Azure Storage?
When it comes to data replication in the primary region, Azure Storage provides two choices:

Locally redundant storage (LRS) replicates your data three times synchronously in a single physical location
in the primary area. Although LRS is the cheapest replication method, it is unsuitable for high availability or
durability applications.

Zone-redundant storage (ZRS) synchronizes data across three Azure availability zones in the primary region.
Microsoft advises adopting ZRS in the primary region and replicating it in a secondary region for high-
availability applications.

Azure Storage provides two options for moving your data to a secondary area:

Geo-redundant storage (GRS) synchronizes three copies of your data within a single physical location using
LRS in the primary area. It moves your data to a single physical place in the secondary region asynchronously.

Geo-zone-redundant storage (GZRS) uses ZRS to synchronize data across three Azure availability zones in
the primary region. It then asynchronously moves your data to a single physical place in the secondary region.

Get confident to build end-to-end projects.


Access to a curated library of 250+ end-to-end industry
projects with solution code, videos and tech support.

Request a demo

Data Engineer Interview Questions on AWS


Data engineers may leverage cloud-based services like AWS to help enterprises overcome some of the issues they
face as they deal with large data volumes. Practice these data engineering interview questions below to impress
your hiring manager with your data engineering skills in cloud computing.

Image for Data Engineer interview questions and answers on AWS

85. What logging capabilities does AWS Security offer?


AWS CloudTrail allows security analysis, resource change tracking, and compliance auditing of an AWS
environment by providing a history of AWS API calls for an account. CloudTrail sends log files to a chosen
Amazon Simple Storage Service (Amazon S3) bucket, with optional log file integrity validation.

Amazon S3 Access Logs record individual requests to Amazon S3 buckets and can be capable of monitoring
traffic patterns, troubleshooting, and security and access audits. It can also assist a business in gaining a
better understanding of its client base, establishing lifecycle policies, defining access policies, and determining
Amazon S3 prices.

Amazon VPC Flow Logs record IP traffic between Amazon Virtual Private Cloud (Amazon VPC) network
interfaces at the VPC, subnet, or individual Elastic Network Interface level. You can store Flow log data in
Amazon CloudWatch Logs and export it to Amazon CloudWatch Streams for enhanced network traffic
analytics and visualization.

86. How can Amazon Route 53 ensure high availability while maintaining
low latency?
AWS's highly available and stable infrastructure builds Route 53. The DNS servers' widely distributed design helps
maintain a constant ability to direct end-users to your application by avoiding internet or network-related issues.
Route 53 delivers the level of dependability that specific systems demand. Route 53 uses a worldwide anycast
network of DNS servers to automatically respond to inquiries from the best location available based on network
circumstances. As a result, your end consumers will experience low query latency.

87. What is Amazon Elastic Transcoder, and how does it work?


Amazon Elastic Transcoder is a cloud-based media transcoding service.

It's intended to be a highly flexible, simple-to-use, and cost-effective solution for developers and organizations
to transform (or "transcode") media files from their original format into versions suitable for smartphones,
tablets, and computers.

Amazon Elastic Transcoder also includes transcoding presets for standard output formats, so you don't have
to assume which parameters will work best on specific devices.

88. Discuss the different types of EC2 instances available.


On-Demand Instances- You pay for computing capacity by the hour or second with On-Demand instances,
depending on the instances you run. There are no long-term obligations or upfront payments required. You can
scale up or down your compute capacity based on your application's needs, and you only pay the per-hour
prices for the instance you utilize.

Reserved Instances- When deployed in a specific Availability Zone, Amazon EC2 Reserved Instances (RI)
offer a significant reduction (up to 72%) over On-Demand pricing and a capacity reservation.

Spot Instances- You can request additional Amazon EC2 computing resources for up to 90% off the On-
Demand price using Amazon EC2 Spot instances.

89. Mention the AWS consistency models for modern DBs.


A database consistency model specifies how and when a successful write or change reflects in a future read of the
same data.

The eventual consistency model is ideal for systems where data update doesn’t occur in real-time. It's
Amazon DynamoDB's default consistency model, boosting read throughput. However, the outcomes of a
recently completed write may not necessarily reflect in an eventually consistent read.

In Amazon DynamoDB, a strongly consistent read yields a result that includes all writes that have a
successful response before the read. You can provide additional variables in a request to get a strongly
consistent read result. Processing a highly consistent read takes more resources than an eventually consistent
read.

90. What do you understand about Amazon Virtual Private Cloud (VPC)?
The Amazon Virtual Private Cloud (Amazon VPC) enables you to deploy AWS resources into a custom virtual
network.

This virtual network is like a typical network run in your private data center, but with the added benefit of AWS's
scalable infrastructure.

Amazon VPC allows you to create a virtual network in the cloud without VPNs, hardware, or real data centers.

You can also use Amazon VPC's advanced security features to give more selective access to and from your
virtual network's Amazon EC2 instances.

91. Outline some security products and features available in a virtual


private cloud (VPC).
Flow Logs- Analyze your VPC flow logs in Amazon S3 or Amazon CloudWatch to obtain operational visibility
into your network dependencies and traffic patterns, discover abnormalities, prevent data leakage, etc.

Network Access Analyzer- The Network Access Analyzer tool assists you in ensuring that your AWS network
meets your network security and compliance standards. Network Access Analyzer allows you to establish your
network security and compliance standards.

Traffic Mirroring- You can directly access the network packets running through your VPC via Traffic Mirroring.
This functionality enables you to route network traffic from Amazon EC2 instances' elastic network interface to
security and monitoring equipment for packet inspection.

92. What do you mean by RTO and RPO in AWS?


Recovery time objective (RTO): The highest allowed time between a service outage and restoration. This
specifies the maximum amount of service downtime that you may tolerate.

Recovery point objective (RPO): The maximum allowed time since the previous data recovery point. This
establishes the level of data loss that is acceptable.

93. What are the benefits of using AWS Identity and Access
Management (IAM)?
AWS Identity and Access Management (IAM) supports fine-grained access management throughout the AWS
infrastructure.

IAM Access Analyzer allows you to control who has access to which services and resources and under what
circumstances. IAM policies let you control rights for your employees and systems, ensuring they have the
least amount of access.

It also provides Federated Access, enabling you to grant resource access to systems and users without
establishing IAM Roles.

94. What are the various types of load balancers available in AWS?
1. An Application Load Balancer routes requests to one or more ports on each container instance in your
cluster, making routing decisions at the application layer (HTTP/HTTPS). It also enables path-based routing
and may route requests to one or more ports on each container instance in your cluster. Dynamic host port
mapping is available with Application Load Balancers.

2. The transport layer (TCP/SSL) is where a Network Load Balancer decides the routing path. It processes
millions of requests per second, and dynamic host port mapping is available with Network Load Balancers.

3. Gateway Load Balancer distributes traffic while scaling your virtual appliances to match demands by
combining a transparent network gateway.

Data Engineer Interview Questions on Data Lake


Data lakes are the ideal way to store the company's historical data because they can store a lot of data at a low cost.
Data lake enables users to switch back and forth between data engineering and use cases like interactive analytics
and machine learning. Azure Data Lake, a cloud platform, supports big data analytics by providing unlimited storage
for structured, semi-structured, or unstructured data. Take a look at some important data engineering interview
questions on Azure Data Lake.

95. What do you understand by Azure Data Lake Analytics?


Azure Data Lake Analytics is a real-time analytics job application that makes big data easier to understand.

You create queries to change your data and get essential insights instead of deploying, configuring, and
optimizing hardware.

The analytics service can instantaneously manage jobs of any complexity by pitching in the amount of power
you require.

Also, it's cost-effective because you only pay for your task when it's operating.

96. Compare Azure Data Lake Gen1 vs. Azure Data Lake Gen2.
Azure Data Lake Gen1 Azure Data Lake Gen2

Azure Data Lake Gen 1 is file system


Azure Data Lake Gen 2 includes a file
storage In a hierarchical file system that
system for efficiency and reliability and
distributes data in blocks.
flexible object storage.

The hot/cold storage tier isn't available. The hot/cold storage tier is available.

It doesn’t approve storage redundancy. It supports Storage Redundancy.

97. What do you mean by U-SQL?


Azure Data Lake Analytics uses U-SQL as a big data query language and execution infrastructure.

U-SQL scales out custom code (.NET/C#/Python) from a Gigabyte to a Petabyte scale using typical SQL
techniques and language.

Big data processing techniques like "schema on reads," custom processors, and reducers are available in U-
SQL.

The language allows you to query and integrate structured and unstructured data from various data sources,
including Azure Data Lake Storage, Azure Blob Storage, Azure SQL DB, Azure SQL Data Warehouse, and
SQL Server instances on Azure VMs.

98. Outline some of the features of Azure Data Lake Analytics.


Azure Data Lake offers high throughput for raw or other data types for analytics, real-time reporting, and
monitoring.

It's highly flexible and auto-scalable, with payment handling flexibility.

U-SQL can process any structured and unstructured data using SQL syntax and Azure custom functions to set
up new ADFS driver functions.

It offers a highly accessible on-premise data warehouse service for exploring data for analytics, reporting,
monitoring, and Business Intelligence using various tools.

99. What are the different blob storage access tiers in Azure?
Hot tier - An online tier that stores regularly viewed or updated data. The Hot tier has the most expensive
storage but the cheapest access.

Cool tier - An online layer designed for rarely storing data that is accessed or modified. The Cool tier offers
reduced storage costs but higher access charges than the Hot tier.

Archive tier - An offline tier designed for storing data accessed rarely and with variable latency requirements.
You should keep the Archive tier's data for at least 180 days.

Most Watched Projects

Project Hands-On Real Build an Analytical Build an AWS ETL Learn to Build a Hive Mini Project Hands-On Real Build an Analytic
Data Time PySpark Platform for Data Pipeline in Polynomial to Build a Data Time PySpark Platform for
e for e- Project for eCommerce using Python on Regression Model Warehouse for e- Project for eCommerce usin
e Beginners AWS Services YouTube Data from Scratch Commerce Beginners AWS Services

roject View Project View Project View Project View Project View Project View Project View Project

View all Most Watched Projects

Data Engineer Technical Interview Questions | Data


Engineering Technical Interview Questions
Here are some data engineering interview questions for you that will help hiring managers to test your technical skills
and knowledge.

100. What do you mean by Blocks and Block Scanner?


Block is the smallest unit of a data file and is regarded as a single entity. When Hadoop comes across a large data
file, it automatically breaks it up into smaller pieces called blocks.

A block scanner is implemented to check whether the loss-of-blocks generated by Hadoop are successfully installed
on the DataNode.

101. How does a block scanner deal with a corrupted data block?
The DataNode notifies the NameNode about a particular file when the block scanner detects a corrupted data block.
After that, NameNode processes the data file by replicating it using the original, corrupted file. The corrupted data
block is not deleted if there is a match between the replicas made and the replication block.

102. List some of the XML configuration files present in Hadoop.


Some of the XML configuration files present in Hadoop are

HDFS-site (one of the most important XML configuration files)

Core-site

YARN-site

Mapred-site

103. How would you check the validity of data migration between
databases?
A data engineer's primary concerns should be maintaining the accuracy of the data and preventing data loss. The
purpose of this question is to help the hiring managers understand how you would validate data.

You must be able to explain the suitable validation types in various instances. For instance, you might suggest that
validation can be done through a basic comparison or after the complete data migration.

104. How does a SQL query handle duplicate data points?


In SQL, there are mainly two ways to handle or reduce duplicate data points- you can use the SQL keywords
DISTINCT & UNIQUE to reduce duplicate data points. Additionally, you have other options, like using GROUP BY to
handle duplicate data points.

Databricks Data Engineer Interview Questions


Below are the Data Engineer interview questions asked in Databricks-

100. What is Databricks Runtime?


101. What Spark components are included in Azure Databricks?
102. What are the types of runtimes Azure Databricks offers?
103. What is Databricks File System?
104. How do you access Azure Data Lake Storage from a Notebook?
Walmart Data Engineer Interview Questions
Some of the Data Engineer interview questions asked in Walmart are

105. What is a case class in Scala?


106. Elaborate on the Hive architecture.
106. What are the various types of data models?
107. Can we use Hadoop commands to load data in the backend to a
particular partition table?
108. How can we truncate a table in Hive?
109. What is Spark? How is it different from Hive?
EY Data Engineer Interview Questions
Here are the most commonly asked Data Engineer interview questions at EY -

110. When should you not use a pie chart?


111. What is database dynamic?
112. Explain the spark architecture.
113. Explain joins in SQL.
114. Difference between map and flat map.
115. Difference between RDD and dataframe.
116. Given this role sits within EY Data Analytics Team, please let us
know about your recent experience and exposure to Data and Analytics.
What data-related projects, tools, platforms, and technologies have you
worked on?
Behavioral Data Engineering Questions
These are some of the behavioral Data Engineer interview questions asked in almost every data engineering
interview.

117. Why are you opting for a career in data engineering, and why
should we hire you?
118. What are the daily responsibilities of a data engineer?
119. What problems did you face while trying to aggregate data from
multiple sources? How did you go about resolving this?
120. Do you have any experience working on Hadoop, and how did you
enjoy it?
121. Do you have any experience working in a cloud computing
environment? What are some challenges that you faced?
122. What are the fundamental characteristics that make a good data
engineer?
123. How would you approach a new project as a data engineer?
124. Do you have any experience working with data modeling
techniques?
Facebook Data Engineer Interview Questions
As per Glassdoor, here are some Data Engineer interview questions asked in Facebook:

124. Given a list containing a None value, replace the None value with
the previous value in the list.
125. Print the key in a dictionary corresponding to the nth highest value
in the dictionary. Print just the first one if there is more than one record
associated with the nth highest value.
126. Given two sentences, print the words that are present in only one of
the two sentences.
127. Create a histogram using values from a given list.
128. Write a program to flatten the given list : [1,2,3,[4,5,[6,7[8,9]]]]
129. Write a program to remove duplicates from any given list.
130. Write a program to count the number of words in a given sentence.
131. Find the number of occurrences of a letter in a string.
Amazon Data Engineer Interview Questions
Data Engineer interview questions that are most commonly asked at Amazon

132. How can you tune a query? If a query takes longer than it initially
did, what may be the reason, and how will you find the cause?
133. In Python, how can you find non-duplicate numbers in the first list
and create a new list preserving the order of the non-duplicates?
134. Consider a large table containing three columns corresponding to
DateTime, Employee, and customer_response. The customer_response
column is a free text column. Assuming a phone number is embedded in
the customer_response column, how can you find the top 10 employees
with the most phone numbers in the customer_response column?
135. Sort an array in Python so that it produces only odd numbers.
136. How can you achieve performance tuning in SQL? Find the
numbers which have the maximum count in a list?
137. Generate a new list containing the numbers repeated in two
existing lists.
138. How would you tackle a data pipeline performance problem as a
data engineer?
How Data Engineering helps Businesses? | Why is Data
Engineering In Demand?
Data engineering is more significant than data science. Data engineering maintains the framework that enables data
scientists to analyze data and create models. Without data engineering, data science is not possible. A successful
data-driven company relies on data engineering. Data engineering makes it easier to build a data processing stack
for data collection, storage, cleaning, and analysis in batches or in real time, making it ready for further data analysis.

Furthermore, as businesses learn more about the significance of big data engineering, they turn towards AI-driven
methodologies for end-to-end Data Engineering rather than employing the older techniques. Data engineering aids
in finding useful data residing in any data warehouse with the help of advanced analytic methods. Data Engineering
also allows businesses to collaborate with data and leads to efficient data processing.

Access Data Science and Machine Learning Project Code Examples

Data Engineer Job Growth and Demand in 2022


When compared to data science, data engineering does not receive as much media coverage. However, data
engineering is a career field that is rapidly expanding and in great demand. It can be a highly exciting career for
people who enjoy assembling the "pieces of a puzzle" that build complex data pipelines to ingest raw data, convert it,
and then optimize it for various data users. According to a LinkedIn Search as of June 2022, there are over 229,000
jobs for data engineering in the United States, and over 41,000 jobs for the same in India.

Image for Data Engineer job vacancies in the US

Image for Data Engineer job vacancies in India

Based on Glassdoor, the average salary of a data engineer in the United States is $112,493 per annum. In India, the
average data engineer salary is ₹925,000. According to Indeed, Data Engineer is the 5th highest paying job in the
United States across all the sectors. These stats clearly state that the demand for the role of a Data Engineer will
only increase with lucrative paychecks.

Image showing Data Engineer annual average salary in the US

Image showing Data Engineer annual average salary in India

What Skills Does a Data Engineer Need?


Below are some essential skills that a data engineer or any individual working in the data engineering field requires-

1. SQL: Data engineers are responsible for handling large amounts of data. Structured Query Language (SQL) is
required to work on structured data in relational database management systems (RDBMS). As a data
engineer, it is essential to be thorough with using SQL for simple and complex queries and optimize queries as
per requirements.

2. Data Architecture and Data Modeling: Data engineers are responsible for building complex database
management systems. They are considered the gatekeepers of business-relevant data and must design and
develop safe, secure, and efficient systems for data collection and processing.

3. Data Warehousing: It is important for data engineers to grasp building data warehouses and to work with
them. Data warehouses allow the aggregation of unstructured data from different sources, which can be used
for further efficient processing and analysis.

4. Programming Skills: The most popular programming languages used in Big Data Engineering are Python and
R, which is why it is essential to be well versed in at least one of these languages.

5. Microsoft Excel: Excel allows developers to arrange their data into tables. It is a commonly used tool to
organize and update data regularly if required. Excel provides many tools that can be used for data analysis,
manipulation, and visualization.

6. Apache Hadoop-Based Analytics: Apache Hadoop is a prevalent open-source tool used extensively in Big
Data Engineering. The Hadoop ecosystem provides support for distributed computing, allows storage,
manipulation, security, and processing of large amounts of data, and is a necessity for anyone applying for the
role of a data engineer.

7. Operating Systems: Data engineers are often required to be familiar with working with operating systems like
LINUX, Solaris, UNIX, and Microsoft.

8. Machine Learning: Machine learning techniques are primarily required for data scientists. However, since data
scientists and data engineers work closely together, knowledge of machine learning tools and techniques will
help a data engineer.

Get Set Go For Your Interview with ProjectPro’s Top Data


Engineer Interview Questions
We hope these questions will help you ace your interview and land a data engineer role in your dream organization.
Apart from the data engineer interview questions, here are some essential tips to keep you prepared for your next
data engineering interview:

Brush up your skills: Here are some skills that are expected in a data engineer role:

Technical skills: Data Engineers have to be familiar with database management systems, SQL, Microsoft
Excel, programming languages especially R and Python, working with Big Data tools including Apache
Hadoop and Apache Spark.

Analytical Skills: Data Engineering requires individuals with strong mathematical and statistical skills who
can make sense of the large amounts of data that they constantly have to deal with.

Understanding business requirements: To design optimum databases, it is important that data engineers
understand what is expected of them, and design databases as per requirements.

Be familiar with the specific company with which you are interviewing. Understand the goals and objectives of
the company, some of their recent accomplishments, and any ongoing projects you can find out about. The
more specific your answers to questions like “Why have you chosen Company X?”, the more you will be able
to convince your interviewers that you have truly come prepared for the interview.

Have a thorough understanding of the projects you have worked on. Be prepared to answer questions based
on these projects, primarily if the projects are related to Big Data and data engineering. You may be asked
questions about the technology used in the data engineering projects, the datasets you used, how you
obtained the required data samples, and the algorithms you used to approach the end goal. Try to recall any
difficulties that you encountered during the execution of the project and how you went about solving them.

Spend time working on building up your project profile and in the process, your confidence. By working on
projects, you can expand your knowledge by gaining hands-on experience. Projects can be showcased to your
interviewer but will also help build up your skillset and give you a deeper understanding of the tools and
techniques used in the market in the field of Big Data and data engineering.

Make sure to get some hands-on practice with ProjectPro’s solved big data projects with reusable source code that
can be used for further practice with complete datasets. At any time, if you feel that you require some assistance, we
provide one-to-one industry expert guidance to help you understand the code and ace your data engineering skills.

FAQs on Data Engineer Interview Questions


1. How can I pass data engineer interview?
You can pass a data engineer interview if you have the right skill set and experience necessary for the job role. If you
want to crack the data engineer interview, acquire the essential skills like data modeling, data pipelines, data
analytics, etc., explore resources for data engineer interview questions, and build a solid portfolio of big data
projects. Practice real-world data engineering projects on ProjectPro, Github, etc. to gain hands-on experience.

2. What are the roles and responsibilities of data engineer?


Some of the roles and responsibilities of a data engineer are

Create and implement ETL data pipeline for a variety of clients in various sectors.

Generate accurate and useful data-driven solutions using data modeling and data warehousing techniques.

Interact with other teams (data scientists, etc.) and help them by delivering relevant datasets for analysis.

Build data pipelines for extraction and storage tasks by employing a range of big data engineering tools and
various cloud service platforms.

3. What are the 4 most key questions a data engineer is likely to


hear during an interview?
The four most key questions a data engineer is likely to hear during an interview are

What is data modeling?

What are the four V’s of Big Data?

Do you have any experience working on Hadoop, and how did you enjoy it?

Do you have any experience working in a cloud computing environment, what are some challenges that you
faced?

PREVIOUS NEXT

Access Solved Big Data and Data Science Projects

Trending Project Trending Projects Trending Blogs Trending Recipes Trending Tutorials
Categories
Walmart Sales Forecasting Data Science Machine Learning Projects for Beginners Search for a Value in Pandas DataFrame PCA in Machine Learning Tutorial
Project with Source Code
Machine Learning Projects
Pandas Create New Column based on PySpark Tutorial
BigMart Sales Prediction ML Project Data Science Projects for Beginners with Multiple Condition
Data Science Projects
Source Code Hive Commands Tutorial
Music Recommender System Project LSTM vs GRU
Deep Learning Projects
Big Data Projects for Beginners with MapReduce in Hadoop Tutorial
Credit Card Fraud Detection Using Source Code Plot ROC Curve in Python
Big Data Projects
Machine Learning Apache Hive Tutorial -Tables
IoT Projects for Beginners with Source Python Upload File to Google Drive
Apache Hadoop Projects Code
Resume Parser Python Project for Data Linear Regression Tutorial
Science Optimize Logistic Regression Hyper
Apache Spark Projects Data Analyst vs Data Scientist Parameters Show more
Time Series Forecasting Projects
Show more Data Science Interview Questions and Show more
Show more Answers

Show more

ProjectPro © 2022 Iconiq Inc. About us Contact us Privacy policy User policy Write for ProjectPro

💬 Click here to chat

You might also like