Top 60+ Data Warehouse Interview Questions and Answers.pdf

www.datacademy.ai
Knowledge world
Top 60+ Data Warehouse Interview Questions and
Answers
1. What is a data warehouse?
A data warehouse is a large, centralized repository of data that is specifically designed to support business intelligence and
decision-making activities. Data warehouses are optimized for querying and analysis, and are populated with data from a
variety of sources, including transactional systems, operational databases, and external sources.
Data warehouses typically use a relational database management system (RDBMS) and employ a multidimensional data
model, which allows for the efficient querying and analysis of data. Data warehouses also employ a process known as
extract, transform, and load (ETL), which is used to clean, transform, and load data into the warehouse.
Data warehouses are typically used by organizations to gain insights into their data and make informed business decisions.
They provide a single source of truth for data, allowing organizations to access, analyze, and report on their data in a
consistent and accurate manner. Additionally, data warehouses are designed to support ad-hoc reporting and analytics,
making them a valuable tool for decision-makers.
2. What is the difference between Data Warehousing and Data Mining?
A data warehouse is a centralized repository of structured data that is used for reporting and data analysis. It typically
contains data from multiple sources, such as transactional databases, log files, and API calls, and is designed to support
efficient querying and analysis of the data.
3. What is Data Transformation?
Data transformation is the process of converting data from one format or structure to another. This may involve cleaning
and formatting the data, as well as aggregating or summarizing the data in a specific way.
In a data warehouse or business intelligence system, data transformation is often a crucial step in the process of loading
data from various sources into a central repository. The data may need to be transformed to fit the structure of the data
warehouse schema, or to conform to certain standards or requirements. Data transformation may also be used to derive
new insights or to create new data sets by combining or manipulating existing data.
Data transformation can be performed using a variety of tools and techniques, such as SQL queries, programming
languages, or specialized data transformation software. It is an important part of the ETL (extract, transform, load)
process, which is used to populate a data warehouse or other data repository with data from multiple sources.
4. Why do we need a Data Warehouse?
There are several reasons why organizations might choose to implement a data warehouse:
1. Centralized data repository: A data warehouse provides a central location for storing and managing data from multiple
sources. This can make it easier to access and analyze data, as well as to create reports and dashboards.
2. Improved performance: Data warehouses are designed to support fast query performance, even when working with large
volumes of data. This can make it easier to extract insights and make decisions in a timely manner.
3. Data integration and transformation: A data warehouse allows organizations to integrate and transform data from multiple
sources, making it possible to analyze data in a consistent and standardized manner.
4. Historical data analysis: Data warehouses typically store data over a long period of time, making it possible to analyze
historical trends and patterns.
5. Support for business intelligence and analytics: A data warehouse is often used as the foundation for business intelligence and
analytics initiatives, as it provides a single source of data that can be used to generate reports, dashboards, and other analytics
outputs.
Overall, a data warehouse can help organizations better understand and make use of their data, enabling them to make
more informed decisions and improve their operations.

www.datacademy.ai
Knowledge world
5. What are the key characteristics of a Data Warehouse?
Some of the major key characteristics of a data warehouse are listed below:
• The part of data can be denormalized so that it can be simplified and improve the performance of the same.
• A huge volume of historical data is stored and used whenever needed.
• Many queries are involved where a lot of data is retrieved to support the queries.
• The data load is controlled.
• Ad hoc queries and planned queries are quite common when it comes to data extraction.
6.What is the difference between Database vs. Data Lake vs. Warehouse
vs. Data Mart?
Here is a summary of the main differences between databases, data lakes, data warehouses, and data marts:
In general, a database is a structured collection of data used for transactional processing or data management, while a data
lake is a central repository that allows storing structured and unstructured data at any scale. A data warehouse is a
centralized repository of structured data used for reporting and analysis, and a data mart is a subset of a data warehouse
that is designed for a specific group or department.
7. Data Warehouse
A data warehouse exists on several databases and is used for business intelligence. The data warehouse gathers the data
from all these databases and creates a layer to optimize data for analytics. It mainly stores processed, refined, highly
modeled, highly standardized, and cleansed data.
8. Data Lake
A data lake is a centralized repository for structure and unstructured data storage. It can be used to store raw data without
any structure schema, and there is no need to perform any ETL or transformation job.
Any type of data can be stored here, like images, text, files, and videos, and even it can store machine learning model
artifacts, real-time and analytics output, etc. Data retrieval processing can be done via export, so the schema is defined on
reading. It mainly stores raw and unprocessed data. The main focus is to capture and store as much data as possible.
9. Data Mart
Data Mart lies between the data warehouse and Data Lake. It’s a subset of filtered and structured essential data of a
specific domain or area for a specific business need.
10. What is a Data Model?
A data model is simply a diagram that displays a set of tables and the relationship between them. This helps in
understanding the purpose of the table as well as its dependency. A data model applies to any software development
involving creating database objects to store and manipulate data, including transactional and data warehouse systems. The
data model is being designed through three main stages: conceptual, logical, and physical data model.
A conceptual data model is a set of square shapes connected by a line. The square shape represents an entity, and the line
represents a relationship between the entities. This is very high level and highly abstract, and key attributes should be here.
The logical data model expands the conceptual model by adding more detail and identifying its key and non-key attributes.
Hence, key attributes or attributes define the uniqueness of that entity, such as in the time entity, it’s the date that’s a key
attribute. It also considers the relationship type, whether one-to-one, one to many, or many to many.
The physical data model looks similar to a logical data model; however, there are significant changes. Here entities will be
replaced by tables, and attributes will be referred to as columns. So tables and columns are words specific to a database. In

www.datacademy.ai
Knowledge world
contrast, entities and attributes are specific to a logical data model design, so a physical data model always refers to these
as tables and columns. It should be database technology compatible.
11. What is Data Modelling?
Data Modelling is a very simple step of simplifying an entity here in the concept of data engineering. It will simplify
complex software by simply breaking it up into diagrams and further breaking it into flow charts. Flowcharts are a simple
representation of how a complex entity can be broken down into a simple diagram. This will give a visual representation,
an easier understanding of the complex problem, and even better readability to a person who might not be proficient in that
particular software usage.
Data modeling is generally defined as a framework for data to be used within information systems by supporting specific
definitions and formats. It is a process used to define and analyze data requirements needed to support the business
processes within the boundary of respective information systems in organizations. Therefore, the creation of data modeling
involves experienced data modelers working closely with business stakeholders, as well as potential users of the
information system.
12. What is an ODS used for?
An ODS, or operational data store, is a database that is used to store current, real-time data for operational purposes. It is
typically used to support the day-to-day operations of an organization, such as processing transactions, managing
inventory, or tracking customer interactions.
An ODS is designed to support high volumes of read and write operations, and to provide fast access to data. It is typically
populated with data from operational systems, such as transactional databases, log files, and API calls. The data in an ODS
is usually updated in near real-time, and is used to support operational processes and decision-making.
An ODS is different from a data warehouse, which is a centralized repository of structured data used for reporting and data
analysis. While an ODS is optimized for fast read and write performance, a data warehouse is optimized for fast query
performance. An ODS is also different from a data lake, which is a central repository that allows storing structured and
unstructured data at any scale.
13. What is the difference between OLTP & OLAP?
Criteria OLTP OLAP
Abbreviation Online Transaction Processing Online Analytical Processing
Used for Day-to-day business transaction Analyzed or reported purpose
Used by End users, business users Business Analyst, Decision Makers, Management level users
Data Insertion/
Change
Frequency
Very frequent Mostly fixed number of times through scheduled jobs
Mostly Used
Statement
Select, Insert, Update, Delete Select
Type of System
or Source of
data
Source system, Main source of data
Target system, data are transferred from OLTP through
extraction, Transformation, and Loading process.
Database Type Normalized Denormalized
Data Volume Less compared to OLAP Very high
Processing
speed or
Very fast
Depending on the amount of data, report generation SLA time
can be a few seconds to a few hours.

www.datacademy.ai
Knowledge world
14. What is Metadata, and what is it used for?
The definition of Metadata is data about data. Metadata is the context that gives information a richer identity and forms the
foundation for its relationship with other data. It can also be a helpful tool that saves time, keeps organized, and helps
make the most of the files. Structural Metadata is information about how an object should be categorized to fit into a larger
system with other objects. Structural Metadata establishes relationships with other files to be organized and used in many
ways.
Administrative Metadata is information about the history of an object, who used to own it, and what can be done with it.
Things like rights, licenses, and permissions. This information is helpful for people managing and taking care of an object.
One data point gains its full meaning only when it’s put in the right context. And the better-organized Metadata will
reduce the searching time significantly.
15. What is the difference between ER Modelling vs. Dimensional
Modelling?
ER Modelling Dimension Modelling
Used for OLTP Application design.Optimized
for Select / Insert / Update / Delete
Used for OLAP Application design. Optimized for retrieving data and answering
business queries.
Revolves around entities and their relationships
to capture the process
Revolves around Dimensions for decision making, Doesn’t capture process
The unit of storage is a table. Cubes are units of storage.
Contains normalized data. Contains denormalized data
16. What is the difference between View and Materialized View?
A view is to access the data from its table that does not occupy space, and changes get affected in the corresponding tables.
In contrast, in the materialized view, pre-calculated data persists, and it has physical data space occupation in the memory,
and changes will not get affected in the corresponding tables. The material view concept came from database links, mainly
used earlier to make a copy of remote data sets. Nowadays, it’s widely used for performance tuning.
The view always holds the real-time data, whereas Materialized view contains a snapshot of data that may not be real-time.
Some methods are available to refresh the data in the Materialized view.
17. What does Data Purging mean?
The data purging name is quite straightforward. It is the process involving methods that can erase data permanently from
the storage. Several techniques and strategies can be used for data purging. The process of data forging often contrasts
with data deletion, so they are not the same as deleting data is more temporarily while data purging permanently removes
the data. This, in turn, frees up more storage and memory space which can be utilized for other purposes.
latency
Focus
More focus on ‘effective data storage’ and quick
completion of the request. Hence generally, a
limited number of indexes are used.
Focus on retrieval of data; hence more indexes are used.
Backup
A more frequent backup needs to be placed. Even
runtime incremental backup is always
recommended.
Time-to-time backup is less frequent, and no need for
incremental runtime backup.

www.datacademy.ai
Knowledge world
The purging process allows us to archive data even if it is permanently removed from the primary source, giving us an
option to recover that data in case we purge it. The deleting process also permanently removes the data but does not
necessarily involve keeping a ba, and Itp generally involves insignificant amounts of data.
18. Please provide a couple of current Data Warehouse solutions widely
used in the Industry.
There are a couple of solutions available in the market. Some of the major solutions are:
• Snowflakes
• Oracle Exadata
• Apache Hadoop
• SAP BW4HANA
• Microfocus Vertica
• Teradata
• AWS Redshift
• GCP Big Query
19. Provide a couple of renowned used ETL tools used in the Industry.
Some of the major ETL tools are
• Informatica
• Talend
• Pentaho
• Abnitio
• Oracle Data Integrator
• Xplenty
• Skyvia
• Microsoft – SQL Server Integrated Services (SSIS)
20. What is a Slowly Changing Dimension?
A slowly changing dimension (SCD) is one that appropriately manages changes of dimension members over time. It
applies when business entity value changes over time and in an ad-hoc manner.
21. What are the different types of SCD?
There are six sorts of Slowly Changing Dimensions that are commonly used. They are as follows:
Type 0 – Dimension never changes here, dimension is fixed, and no changes are permissible.
Type 1 – No History Update record directly. There’s no record of historical values, only the current state. A kind 1 SCD
always reflects the newest values, and the dimension table is overwritten when changes in source data are detected.
Type 2 – Row Versioning Track changes as version records which will be identified by the current flag & active dates, and
other metadata. If the source system doesn’t store versions, the info warehouse load process usually detects changes and
appropriately manages them during a dimension table.
Type 3 – Previous Value column Track change to a selected attribute, and add a column to point out the previous value,
which is updated as further changes occur.
Type 4 – History Table shows the current value in the dimension table, and all changes are tracked and stored in a separate
table.
Hybrid SCD – Hybrid SDC utilizes techniques from SCD Types 1, 2, and three to trace change.

www.datacademy.ai
Knowledge world
Only types 0, 1, and a couple of are widely used, while the others are applied for specific requirements.
22. What is a Factless Fact Table?
A factless fact table is a type of table in a data warehouse schema that does not contain any measure columns. Instead, it
contains only foreign keys that reference dimensions in the data warehouse. Factless fact tables are used to track events or
situations that do not have any associated measures, but that still need to be recorded and analyzed.
For example, a factless fact table might be used to track the attendance at a sporting event. The table would contain foreign
keys that reference the dimensions of time, location, and team, but it would not contain any measure columns, since there
is no numeric value associated with attendance. Instead, the factless fact table would simply record the fact that the event
took place and would allow analysts to track trends and patterns over time.
Factless fact tables can be useful for tracking events or situations that do not have any associated measures, such as the
attendance at a sporting event or the occurrence of a particular type of incident. They can be used to support a variety of
analysis and reporting tasks, such as tracking trends and patterns over time or identifying correlations between different
dimensions.
23. What is a Fact Table?
A fact table is a central table in a data warehouse schema that contains the measures or facts that are being tracked and
analyzed. It typically contains columns that store numeric values, such as sales, costs, or quantities, and is used to track
business transactions and events.
Fact tables are typically linked to dimension tables, which contain the context or background information for the measures
in the fact table. For example, a fact table might contain sales data, while the dimension tables might contain information
about the products being sold, the customers making the purchases, and the time and location of the sales.
Fact tables are often used to support a variety of analysis and reporting tasks, such as calculating totals, averages, and
other summary statistics, or identifying trends and patterns over time. They are a key component of a data warehouse
schema, and are typically optimized for fast query performance.
24. What are Non-additive Facts?
Non-additive facts are measures or facts in a data warehouse that cannot be added across different dimensions. This means
that the values of the measure cannot be accurately summed or aggregated across different levels of the dimensions.
For example, consider a sales fact table that contains data about the products being sold, the customers making the
purchases, and the time and location of the sales. The sales amount is an additive fact, since it can be accurately summed
or aggregated across different dimensions, such as by product, customer, or time period.
On the other hand, consider a measure such as profit margin, which is calculated as the ratio of profit to sales. Profit
margin cannot be accurately summed or aggregated across different dimensions, since it is a ratio rather than a raw value.
Therefore, profit margin would be considered a non-additive fact.
Non-additive facts can be more challenging to analyze and report on, as they cannot be easily aggregated or summarized.
In some cases, it may be necessary to calculate them separately for each level of the dimensions, or to use specialized
techniques such as ratio analysis to compare them across different levels.
25. What is a Conformed Fact?
A conformed fact is a table across multiple data marts and fact tables.
26. What is the Core Dimension?
The core dimension is a Dimension table, which is dedicated to a single fact table or Data Mart.

www.datacademy.ai
Knowledge world
27. What is Dimensional Data Modeling?
Dimensional modeling is a set of guidelines to design database table structures for easier and faster data retrieval. It is a
widely accepted technique. The benefits of using dimensional modeling are its simplicity and faster query performance.
Dimension modeling elaborates logical and physical data models to further detail model data and data-related
requirements. Dimensional models map the aspects of every process within the business.
Dimensional Modelling is a core design concept used by many data warehouse designers design data warehouses. During
this design model, all the info is stored in two sorts of tables.
• Facts table
• Dimension table
The fact table contains the facts or measurements of the business, and the dimension table contains the context of
measurements by which the facts are calculated. Dimension modeling is a method of designing a data warehouse.
Data Warehouse Interview Questions
28. What are the types of Dimensional Modelling?
Types of Dimensional Modelling are listed below:
• Conceptual Modelling
• Logical Modelling
• Physical Modelling
29. What is the difference between E-R modeling and Dimensional
modeling?
The basic difference is that E-R modeling has a logical and physical model while Dimensional modeling has only a
physical model. E-R modeling is required to normalize the OLTP database design, whereas dimensional modeling is
required to denormalize the ROLAP/MOLAP design.
30. What is a Dimension Table?
A dimension table is a type of table that contains attributes of measurements stored in fact tables. It contains hierarchies,
categories, and logic that can be used to traverse nodes.
31. What is a Degenerate Dimension?
In a data warehouse, a degenerate dimension is a dimension key in the fact table that does not have its dimension table.
Degenerate dimensions commonly occur when the fact table’s grain is a single transaction (or transaction line).
32. What is the purpose of Cluster Analysis and Data Warehousing?
One of the purposes of cluster analysis is to achieve scalability, so regardless of the quantity of data system will be able to
analyze its ability to deal with different kinds of attributes, so no matter the data type, the attributes present in the data set
can deal with its discovery of clusters with attribute shape high dimensionality which have multiple dimensions more than
2d to be precise ability to deal with noise, so any inconsistencies in the data to deal with that and interpretability.
33. What is the difference between Agglomerative and Divisive
Hierarchical Clustering?
The agglomerative hierarchical constraining method allows clusters to be read from bottom to top so that the program
always reads from the sub-component first and then moves to the parent in an upward direction. In contrast, divisive
hierarchical clustering uses a top-to-bottom approach in which the parent is visited first and then the child. The
agglomerative hierarchical method consists of objects in which each object creates its clusters. These clusters are grouped
to form a larger cluster.

www.datacademy.ai
Knowledge world
It is also the process of continuous merging until all the single clusters are merged into a complete big cluster that will
consist of the objects of the chart clusters; however, in divisive clustering, the parent cluster is divided into smaller
clusters. It keeps on dividing until each cluster has a singular object to represent.
34. What is ODS?
ODS is a database that integrates data from multiple sources for additional data operations. The full form of ODS is the
operational data source, unlike the master data source, where the data is not sent back to the operational systems. It may be
passed for further operations and to the data warehouse for reporting. In ODS, data can be scrubbed, resolved for
redundancy, and checked for compliance with the corresponding business rules, so whatever data is filtered out to see if
there is some data redundancy.
It is checked and shows whether the data complies with the organization’s business rules.
This data can be used for integrating disparate data from multiple sources so that business operations analysis and
reporting can be carried out. This is where most of the data used in the current operation are housed before it’s transferred
to the data warehouse for the longer term and storage and archiving.
For simple queries on small amounts of data, such as finding the status of a customer order, it is easier to find the details
from ODS rather than Data warehousing as it does not make sense to search a particular customer order status on a larger
dataset which will be more costly to fetch the single records. But for analyses like sentimental analysis, prediction, and
anomaly detection where data warehousing will perform the role to play with its large data volumes.
ODS is similar to short-term memory, where it only stores very recent information. On the contrary, the data warehouse is
more like a long-term memory storing relatively permanent information because a data warehouse is created permanently.
35. What is the level of granularity of a Fact Table?
A fact table is usually designed at a low level of granularity. This means we must find the lowest amount of information
stored in a fact table. For example, employee performance is a very high level of granularity. In contrast, employee
performance daily and employee performance weekly can be considered low levels of granularity because they are much
more frequently recorded data. The granularity is the lowest level of information stored in the fact table; the depth of the
data level is known as granularity in the date dimension.
The level could be a year, month, quarter, period, week, and day of granularity, so the day is the lowest, and the year is the
highest. The process consists of the following two steps determining the dimensions to be included and the location to find
the hierarchy of each dimension of that information. The above factors of determination will be resent as per the
requirements.
36. What’s the biggest difference between Inmon and Kimball’s
philosophies of Knowledge Warehousing?
These are two philosophies that we’ve in data warehousing. Within the Kimball philosophy, data warehousing is viewed
as a constituency of knowledge mods, so data mods are focused on delivering business objectives for departments in a
corporation. Therefore the data warehouse may be a confirmed dimension of the info mods; hence a unified view of the
enterprise is often obtained from the dimension modeling on a departmental area level.
Within the Inmon philosophy, we will create a knowledge warehouse on a topic-by-discipline basis; hence, the
information warehouse can start with the in-web store’s information. The subject areas are often added to the info
warehouse as their need arises point of sale, or pos data are often added later if management decides it’s required.
We first accompany data marts if we check it out algorithmically within the Kimball philosophy. We combine it, and we
get our data warehouse, while with Inmon philosophy, we create our data warehouse and then create our data marts.
Both differ within the concept of building the info Warehouse. – Kimball views Data Warehousing as a constituency of
knowledge marts. Data marts are focused on delivering business objectives for departments in a corporation, and therefore

www.datacademy.ai
Knowledge world
the Data Warehouse may be a conformed dimension of the info Marts. Hence, a unified view of the enterprise is often
obtained from the dimension modeling on a departmental area level. – Inmon explains creating a knowledge Warehouse
on a subject-by-subject area basis.
Hence, the event of the info Warehouse can start with data from the web store. Other subject areas are often added to the
info Warehouse as their needs arise. Point-of-sale (POS) data is often added later if management decides it’s necessary.
37. Explain the ETL cycles’ three-layer architecture.
ETL stands for extraction transformation and loading, so three phases are involved in it – the primary is the staging layer.
The info integration layer and the last layer is the access layer. So these are the three layers involved in the three specific
phases within the ETL cycle, so the staging layer is used for the info extraction from various source data structures.
Within the data integration layer, data from the staging layer is transformed and transferred to the info base using the
mixing layer. The data is arranged in hierarchical groups often mentioned as dimensions facts or aggregates during a data
warehousing system; the mixture of facts and dimension tables is called a schema, so basically, within the data integration
layer, once the info is loaded and data extracted and transformed within the staging layer and eventually the access layer
where the info is accessed and may be loaded for further analytics.
38. What’s an OLAP Cube?
The idea behind OLAP was to pre-compute all calculations needed for reporting. Generally, calculations are done through
a scheduled batch job processing at non-business hours when the database server is normally idle. The calculated fields are
stored in a special database called an OLAP Cube.
An OLAP Cube doesn’t need to loop through any transactions because all the calculations are pre-calculated, providing
instant access.
An OLAP Cube may be a snapshot of knowledge at a selected point in time, perhaps at the top of a selected day, week,
month, or year.
You’ll refresh the Cube at any time using the present values within the source tables.
With very large data sets, it could take an appreciable amount of your time for Excel to reconstruct the Cube.
But the method appears instantaneous with the info sets we’ve been using (just a few thousand rows).
39. Explain the chameleon method utilized in Data Warehousing.
Chameleon may be a methodology that may be a hierarchical clustering algorithm that overcomes the restrictions of the
prevailing models and methods in data warehousing. This method operates on the sparse graph having nodes representing
data items and edges representing the weights of the info items. This representation allows large data sets to be created and
operated successfully. The tactic finds the clusters utilized in the info set using the two-phase algorithm.
The primary phase consists of graph partitioning that permits the clustering of the info items into a larger number of sub-
clusters; the second phase, on the opposite hand, uses an agglomerative hierarchical clustering algorithm to look for the
clusters that are genuine and may be combined alongside the sub-clusters that are produced.
40. What’s virtual Data Warehousing?
A virtual data warehouse provides a collective view of the finished data. A virtual data warehouse has no historical data
and is often considered a logical data model of the given Metadata. Virtual data warehousing is the de facto data system
strategy for supporting analytical decisions. It’s one of the simplest ways of translating data and presenting it within the
form decision-makers will employ. It provides a semantic map that allows the top user viewing because the data is
virtualized.

Top 60+ Data Warehouse Interview Questions and Answers.pdf

More Related Content

Similar to Top 60+ Data Warehouse Interview Questions and Answers.pdf (20)

More from Datacademy.ai (16)

Recently uploaded (20)

Top 60+ Data Warehouse Interview Questions and Answers.pdf