DAM UNIT - IV
DAM UNIT - IV
UNIT – IV
DATA WAREHOUSING
Data Warehousing: Identify purpose of data warehousing - Identify between key components
of a data warehouse - Distinguish between data warehouses and data lakes - Determine the
role of different warehousing techniques - Data Warehousing Tools: Differentiate between
utility of relational DW, cubes, and in-memory scenarios - Compare techniques for data
integration with regards to warehousing - Use warehousing tools - Use integration tools for
warehousing.
A data lake is a centralized repository that allows you to store all your structured and unstructured
data at any scale. You can store your data as-is, without having to first structure the data, and run
different types of analytics—from dashboards and visualizations to big data processing, real-time
analytics, and machine learning to guide better decisions.
Q) What about Data warehouse , Data Marts, Data lakes and databases?
How are they different?
A) There are a lot of data sorting, storage, and accessing options available. Which will benefit your
business most depends on what you use your data for.
Data ware house : A data warehouse is a centralized and integrated repository that stores large
volumes of structured and sometimes unstructured data from various sources within an organization.
The data stored in a data warehouse is used for analytical and reporting purposes rather than
operational transactions. It is designed to support complex querying, data analysis, and reporting,
providing a comprehensive view of an organization's historical and current data.
Data mart. As already indicated, a data mart is part of a data warehouse, generally geared towards
giving a group, team, or line of business and the specific information they require. Also called mini-
data warehouses, they both improve response time within the already low-latency data warehouse
and ensure queries are sufficiently focused to be useful to end users.
NEELIMA 1
BA III SEM UNIT - IV DAM
Data lake. Data lakes are simply repositories filled with unorganized, unclassified data; they’re
generally helpful for collecting data the value of which isn’t yet known. Data lake data may not be
cleansed, corrected, or deduplicated; useful for applications like machine learning, data lake analytics
queries can produce poor results for users looking for usable, trustworthy business insights.
Database. Databases log frequent transactions and provide quick access to specific, repetitive
business transactions. While designed to be good at receiving data, databases simply aren’t built to be
sources from which to pull insights.
Data cube: A data cube in a data warehouse is a multidimensional structure used to store
data. The data cube was initially planned for the OLAP tools that could easily access the
multidimensional data. But the data cube can also be used for data mining.
Q) What is Data warehouse?
A) A data warehouse is a data management system that stores current and historical data from
multiple sources in a business friendly manner for easier insights and reporting.
Data warehouses are typically used for business intelligence (BI), reporting and data analysis.
Data warehouses make it possible to quickly and easily analyze business data uploaded from
operational systems such as point-of-sale systems, inventory management systems, or marketing or
sales databases. Data may pass through an operational data store and require data cleansing to ensure
data quality before it can be used in the data warehouse for reporting.
Data warehouses are used in BI, reporting, and data analysis to extract and summarize data from
operational databases. Information that is difficult to obtain directly from transactional databases can
be obtained via data warehouses. For example, management wants to know the total revenues
generated by each salesperson on a monthly basis for each product category. Transactional databases
may not capture this data, but the data warehouse does.
NEELIMA 2
BA III SEM UNIT - IV DAM
1) Business User: Business users require a data warehouse to view summarized data from the past.
Since these people are non-technical, the data may be presented to them in an elementary form.
2) Store historical data: Data Warehouse is required to store the time variable data from the past.
This input is made to be used for various purposes.
3) Make strategic decisions: Some strategies may be depending upon the data in the data
warehouse. So, data warehouse contributes to making strategic decisions.
4) For data consistency and quality: Bringing the data from different sources at a commonplace,
the user can effectively undertake to bring the uniformity and consistency in data.
5) High response time: Data warehouse has to be ready for somewhat unexpected loads and types
of queries, which demands a significant degree of flexibility and quick response time.
The purpose of data warehousing is to provide a centralized and integrated repository for storing,
managing, and analyzing large volumes of structured and sometimes unstructured data from
various sources within an organization. Data warehousing serves several important purposes:
• Data Integration: Organizations often have data stored in different systems, databases,
and formats. Data warehousing allows for the integration of data from multiple sources into a
single, unified structure. This integration facilitates cross-functional analysis and reporting
by providing a consistent view of the data.
• Historical Analysis: Data warehouses store historical data over time, allowing
organizations to analyze trends, patterns, and changes in business operations. This historical
context is crucial for making informed decisions and understanding the evolution of the
organization.
• Business Intelligence and Reporting: Data warehouses provide a platform for
generating reports, dashboards, and visualizations that offer insights into business
NEELIMA 3
BA III SEM UNIT - IV DAM
performance, customer behavior, and market trends. These insights support data-driven
decision-making at various levels of the organization.
• Complex Queries: Data warehouses are optimized for complex queries and analytical
processing. Users can perform advanced analytics, such as data mining, statistical analysis,
and predictive modeling, to extract valuable insights from the data.
• Data Cleansing and Transformation: Before data is loaded into a data warehouse, it
often undergoes cleansing, transformation, and enrichment processes to ensure data
accuracy and consistency. This improves the quality of the data available for analysis.
• Support for Decision-Making: Data warehouses provide decision-makers with a
comprehensive view of the organization's data, enabling them to make informed choices that
align with business goals and strategies.
• Scalability: Data warehouses are designed to handle large volumes of data efficiently. As an
organization's data needs grow, a well-designed data warehouse can scale to accommodate
the increased data load.
• Data Security and Governance: Centralized data storage in a data warehouse can
improve data security and governance by providing a controlled environment for data access
and ensuring compliance with regulations and policies.
• Operational Performance: By separating analytical workloads from operational
databases, data warehousing reduces the impact on transactional systems, allowing them to
focus on core operations without being burdened by resource-intensive analytical queries.
• Support for Different User Roles: Data warehouses support different user roles, such as
executives, analysts, and business users, by providing them with tailored access to the data
and tools they need for their specific tasks.
Overall, the primary purpose of data warehousing is to enable organizations to harness the power of
their data for strategic decision-making, business insights, and improved operational efficiency.
NEELIMA 4
BA III SEM UNIT - IV DAM
ETL Tools
ETL stands for Extract, Transform, and Load. The staging layer uses ETL tools to extract the
needed data from various formats and checks the quality before loading it into the data warehouse.
The data coming from the data source layer can come in a variety of formats. Before merging all the
data collected from multiple sources into a single database, the system must clean and organize the
information.
The Database
The most crucial component and the heart of each architecture is the database. The warehouse is
where the data is stored and accessed.
When creating the data warehouse system, you first need to decide what kind of database you want
to use.
NEELIMA 5
BA III SEM UNIT - IV DAM
Data
Once the system cleans and organizes the data, it stores it in the data warehouse. The data
warehouse represents the central repository that stores metadata, summary data, and raw data
coming from each source.
• Metadata is the information that defines the data. Its primary role is to simplify working
with data instances. It allows data analysts to classify, locate, and direct queries to the
required data.
• Summary data is generated by the warehouse manager. It updates as new data loads into
the warehouse. This component can include lightly or highly summarized data. Its main role
is to speed up query performance.
• Raw data is the actual data loading into the repository, which has not been processed.
Having the data in its raw form makes it accessible for further processing and analysis.
Access Tools
Users interact with the gathered information through different tools and technologies. They can
analyze the data, gather insight, and create reports.
• Reporting tools. They play a crucial role in understanding how your business is doing and
what should be done next. Reporting tools include visualizations such as graphs and charts
showing how data changes over time.
• OLAP tools. Online analytical processing tools which allow users to analyze
multidimensional data from multiple perspectives. These tools provide fast processing and
valuable analysis. They extract data from numerous relational data sets and reorganize it into
a multidimensional format.
• Data mining tools. Examine data sets to find patterns within the warehouse and the
correlation between them. Data mining also helps establish relationships when analyzing
multidimensional data.
NEELIMA 6
BA III SEM UNIT - IV DAM
Data Marts
A Data Mart is a subset of a directorial information store, generally oriented to a specific purpose or
primary data subject which may be distributed to provide business needs. Data Marts are analytical
record stores designed to focus on particular business functions for a specific community within an
organization. Data marts are derived from subsets of data in a data warehouse, though in the bottom-
up data warehouse design methodology, the data warehouse is created from the union of
organizational data marts.
The fundamental use of a data mart is Business Intelligence (BI) applications. BI is used to
gather, store, access, and analyze record. It can be used by smaller businesses to utilize the data they
have accumulated since it is less expensive than implementing a data warehouse.
Data marts allow you to have multiple groups within the system by segmenting the data in the
warehouse into categories. It partitions data, producing it for a particular user group.
For instance, you can use data marts to categorize information by departments within the company.
NEELIMA 7
BA III SEM UNIT - IV DAM
Purpose :
NEELIMA 8
BA III SEM UNIT - IV DAM
Here are key differences between data lakes vs data warehouse:
NEELIMA 9
BA III SEM UNIT - IV DAM
Parameters Data Lake Data Warehouse
problem faced when trying to make
change in in them.
They integrate different types of data to
Most users in an organization are
come up with entirely new questions as
Key operational. These type of users only
these users not likely to use data
Benefits care about reports and key
warehouses because they may need to
performance metrics.
go beyond its capabilities.
1. Dimensional Modeling:
• Star Schema: In this technique, data is organized into a central fact table containing
quantitative measures and surrounding dimension tables that describe the context of the
measures.
• Snowflake Schema: It's an extension of the star schema, where dimension tables are
normalized into multiple related tables, reducing redundancy.
2. Data Integration:
• ETL (Extract, Transform, Load): ETL processes involve extracting data from source
systems, transforming it into a suitable format, and then loading it into the data warehouse.
• ELT (Extract, Load, Transform): ELT reverses the ETL process by first loading data
into the data warehouse and then transforming it as needed.
3. Data Storage:
• Data Warehouses: Traditional data warehousing systems store data in structured
databases optimized for analytical queries.
• Data Lakes: These store data in its raw, unstructured form, and offer flexibility for storing
both structured and unstructured data.
4. Data Processing:
• Batch Processing: Data is processed in batches at scheduled intervals, which is suitable for
historical reporting.
• Real-time Processing: Data is processed as it arrives, allowing for near real-time analytics
and decision-making.
5. Data Partitioning and Indexing:
• Partitioning: Data can be divided into partitions based on specific criteria like date, region,
or product. This enhances query performance and maintenance.
• Indexing: Indexes are created on columns to speed up data retrieval operations.
6. Data Compression and Archiving:
• Data Compression: Reduces the storage requirements of data while maintaining query
performance.
• Data Archiving: Moves older, less frequently accessed data to lower-cost storage to
optimize costs.
7. Data Security and Governance:
NEELIMA 10
BA III SEM UNIT - IV DAM
• Techniques to ensure data privacy, compliance with regulations (e.g., GDPR), and access
controls.
8. Data Quality and Cleansing:
• Ensuring data accuracy and consistency through processes like data profiling, data cleansing,
and data validation.
9. Scalability and Performance Optimization:
• Techniques like sharding, clustering, and distributed computing to scale data warehouses for
increased performance.
10. Cloud Data Warehousing:
• Utilizing cloud-based platforms and services like Amazon Redshift, Google BigQuery, or
Snowflake for flexible and scalable data warehousing.
11. Hybrid Data Warehousing:
• Combining on-premises and cloud-based data warehousing to leverage existing investments
while benefiting from cloud scalability.
12. Data Visualization and Reporting:
• Tools and techniques for creating dashboards, reports, and data visualizations to make data
insights accessible to business users.
13. Data Warehouse Automation:
• Using automation tools to accelerate the design, development, and maintenance of data
warehouses.
14. Data Warehouse as a Service (DWaaS):
• Outsourcing data warehousing to third-party providers who manage the infrastructure and
maintenance.
15. Streaming Data Warehousing:
• Handling real-time data streams for immediate analysis and decision-making.
The choice of data warehousing techniques depends on factors like data volume, complexity,
performance requirements, budget, and the specific needs of the organization. Data warehousing is
an evolving field, with new techniques and technologies continually emerging to meet the growing
demands of data-driven businesses.
NEELIMA 11
BA III SEM UNIT - IV DAM
• It is done for providing data in a specific view as requested by users, applications etc.
• The bigger the organization gets, the more data there is and the more data needs integration.
• Increases with the need for data sharing.
• Data integration is a key component of data-driven decision-making and the success of a
business.
Here's a comparison of some common techniques for data integration with regards to data
warehousing:
Data Integration Techniques
The following are the technologies used for data integration:
1. Data Interchange
a. It is the structured transmission of organizational data between two or more organization
through electronic means; used for the transfer of electronic documents from one computer
another (i.e ., from one corporate trading partner to another).
b. Data interchange must not be seen merely as email. For instance, organizations might want to
do away with bills of lading (or even checks), and use appropriate EDI messages instead.
2. Object Brokering
a. An ORB (Object Request Broker) is a certain variety of middleware software. It gives
programmers the freedom to make calls from one computer to another over a computer
network.
b. It handles the transformation of in-process data structure to and from the byte sequence.
3. Modeling Techniques: There are two logical design techniques:
a. ER Modeling: Entity Relationship (ER) Modeling is a logical design technique whose main focus
is to reduce data redundancy. It is basically used for transaction capture and can contribute in the
initial stages of constructing a data warehouse. The reduction in the data redundancy solves the
problems of inserting, deleting, and updating data but it leads to yet another problem. In our bid to
keep redundancy to the minimum extent possible, we end up creating a whole lot of tables.
These huge numbers of tables imply dozens of joins between them. The result is a massive spider web
of joins between tables.
NEELIMA 12
BA III SEM UNIT - IV DAM
What could be the problems posed by ER Modeling?
• End-users find it difficult to comprehend and traverse through the ER model.
• Not too many software exist which can query a general ER model.
• ER Modeling cannot be used for data warehousing where the focus is on performance access
and satisfying ad hoc, unanticipated queries.
Example: Consider a library transaction system of a department of DIIT. Every transaction (issue of
book to a student or return of book by a student) are recorded. Let us draw an ER model to represent
the above-stated scenario.
Steps to drawing an ER model:
• Identify entities.
• Identify relationships between various entities.
• Identify the key attribute.
• Identify the other relevant attributes for the entities.
• Draw the ER diagram.
• Review the ER diagram with business users and get their sign-off.
Damage_Fine
Stud_ID
Book_ID Technology Transaction
_ID
Issue_Return Item_ID
Book Book
Book_Name Issue
d to
Author Issue_Date
NEELIMA 13
BA III SEM UNIT - IV DAM
• If designed appropriately, it can give quick responses to ad hoc query for information.
A) Data warehousing tools are software platforms designed to facilitate the process of creating,
managing, and querying data warehouses. These tools are essential for organizations looking to
store, consolidate, and analyze large volumes of data from various sources to support business
intelligence, reporting, and data analytics. Here are some popular data warehousing tools as of my
last knowledge update in September 2021:
1. Amazon Redshift: A fully managed, petabyte-scale data warehouse service offered by AWS.
It's known for its scalability, cost-effectiveness, and integration with other AWS services.
2. Snowflake: A cloud-based data warehousing platform that provides features like data
sharing, data lakes integration, and support for structured and semi-structured data.
3. Google BigQuery: Google Cloud's data warehousing solution that allows you to run super-
fast SQL queries on large datasets. It's serverless and integrates well with other Google Cloud
services.
4. Microsoft Azure Synapse Analytics (formerly SQL Data Warehouse): Part of the
Microsoft Azure ecosystem, Synapse Analytics is designed for data warehousing and analytics
workloads. It supports both data warehousing and big data analytics.
5. Teradata: Teradata offers a powerful on-premises and cloud-based data warehousing
solution known for its performance and scalability. It's often used by enterprises for data
analytics.
6. IBM Db2 Warehouse: IBM's data warehousing solution that supports hybrid cloud
deployments and provides advanced analytics capabilities.
7. Oracle Exadata: Oracle's engineered system for data warehousing and analytics. It offers a
combination of hardware and software optimized for performance and scalability.
8. SAP BW/4HANA: SAP's data warehousing solution, built on the HANA in-memory database
platform. It's designed for real-time data processing and analytics.
9. Yellowbrick Data: A data warehouse platform known for its high performance and hybrid
cloud capabilities. It's designed for data-intensive workloads.
10. Vertica: A columnar database and data warehousing platform known for its speed and
scalability, especially for real-time analytics.
11. Couchbase: While primarily known as a NoSQL database, Couchbase offers a multi-
dimensional scaling feature that allows it to be used as a data warehousing solution for JSON
and semi-structured data.
12. Exasol: A high-performance, in-memory data warehousing solution known for its speed and
efficiency in processing large volumes of data.
13. Actian Avalanche: A cloud-native data warehousing platform designed for high-speed
analytics and data integration.
14. Panoply: A cloud data platform that automates the ETL (Extract, Transform, Load) process
and offers a data warehouse as a service.
15. HPE Vertica: Hewlett Packard Enterprise's data warehousing and analytics platform known
for its speed and scalability.
When choosing a data warehousing tool, organizations should consider factors such as
scalability, cost, integration capabilities, security, and the specific needs of their data analytics
NEELIMA 14
BA III SEM UNIT - IV DAM
projects. Many organizations also opt for cloud-based data warehousing solutions due to their
flexibility and scalability, but on-premises options are still prevalent for certain use cases.
It is called relational because it is based on the relational model, a widely used approach to data
representation and organizational for databases.
In the relational model, data is organized into tables (also known as relations, hence the name).
These tables consist of rows and columns, where each row represents an entity (such as a customer
or product), and each column represents an attribute of that entity (like name, price, or quantity).
It is called a data warehouse because it collects, stores, and manages massive volumes of structured
data from various sources, such as transactional databases, application systems, and external data
feeds.
In a relational data warehouse, you will do a lot of work up front to get the data to where you can use
it to create reports. Doing all this work beforehand is a design and implementation methodology
referred to as a top-down approach. This approach works well for historical-type reporting, in
which you’re trying to determine what happened (descriptive analytics) and why it happened
(diagnostic analytics).
In the top-down approach, you establish the overall planning, design, and architecture of the data
warehouse first, then develop specific components. This method emphasizes the importance of
defining an enterprise-wide vision and understanding the organization’s strategic goals and
information requirements before diving into the development of the data warehouse.
The major benefits or utilities you can get from using a relational data warehouse:
NEELIMA 15
BA III SEM UNIT - IV DAM
Q) What is Data Cube? Explain Utilities of Data Cube.
A) a Data cube refers to a multi-dimensional data structure. That is, data within the data
cube is explained by specific dimensional values.
Data cubes are a very convenient tool whenever one needs to build summaries or extract certain
portions of the entire dataset. We will cover the following:
NEELIMA 16
BA III SEM UNIT - IV DAM
Low latency, providing real time responses
Latency is the lag between the request to access data and the application's response. In-memory
databases offer predictable low latencies irrespective of scale. They deliver microsecond read
latency, single-digit millisecond write latency, and high throughput.
As a result, in-memory storage allows enterprises to make data-based decisions in real-time. You
can design applications that process data and respond to changes before it's too late. For example,
in-memory computing of sensor data from self-driving vehicles can give the desired split-second
response time for emergency braking.
High throughput
In-memory databases are known for their high throughput. Throughput refers to the number of read
(read throughput) or write (write throughput) operations over a given period of time. Examples
include bytes/minute or transactions per second.
High scalability
You can scale your in-memory database to meet fluctuating application demands. Both write and
read scaling is possible without adversely impacting performance. The database stays online and
supports read-and-write operations during resizing.
In-memory databases can find their place in many different scenarios. Some of the typical use cases
could include:
• IoT data: IoT sensors can provide large amounts of data. An in-memory database could be
used for storing and computing data to later be stored in a traditional database.
• E-commerce: Some parts of e-commerce applications, such as the shopping cart, can be
stored in an in-memory database for faster retrieval on each page view, while the product
catalogue could be stored in a traditional database.
• Gaming: Leader boards require quick updates and fast reads when millions of players are
accessing a game at the same time. In-memory databases can help to sort the results more
quickly than traditional databases.
• Session management: In stateful web applications, a session is created to keep track of a
user identity and recent actions. Storing this information in an in-memory database avoids a
round trip to the central database with each web request.
NEELIMA 17