0% found this document useful (0 votes)
47 views

Data Warehouse Week 1

This document provides an overview of key concepts related to data warehousing, including: 1. The difference between databases and data warehouses, with data warehouses being optimized for analysis rather than transactions. 2. Important features of data warehouses such as being subject-oriented, integrated, time-variant, and non-volatile. 3. The typical architecture of a data warehouse, including source systems, ETL processes, the data warehouse database, metadata management, and data access tools. 4. The role of data staging in the ETL process and how it facilitates data integration, cleansing, and transformation before loading into the data warehouse.

Uploaded by

bsit fall20
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Data Warehouse Week 1

This document provides an overview of key concepts related to data warehousing, including: 1. The difference between databases and data warehouses, with data warehouses being optimized for analysis rather than transactions. 2. Important features of data warehouses such as being subject-oriented, integrated, time-variant, and non-volatile. 3. The typical architecture of a data warehouse, including source systems, ETL processes, the data warehouse database, metadata management, and data access tools. 4. The role of data staging in the ETL process and how it facilitates data integration, cleansing, and transformation before loading into the data warehouse.

Uploaded by

bsit fall20
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 78

Data warehouse

Week 1
Content
1. Introduction to data warehouse
2. Difference between database and data warehouses
3. Brief history of data were house
4. Features of data warehoused
5. Architecture of data were house
6. Data Staging and ETL
7. Multidimensional model
8. OLAP,ROLAP,MOLAP
9. META data and accessing data warehouse
Introduction to data wereHouse
Data warehouse

A data warehouse is a large, centralized repository that stores integrated, historical


data from various sources within an organization. It is designed to support business
intelligence (BI) activities, including data analysis, reporting, and decision-making.
Data warehousing provides a structured and efficient way to manage and access
data for strategic and tactical decision-making processes.
Difference between database and data warehouses
Data base Data warehouse
 A database is a system designed to efficiently store, manage, and retrieve  A data warehouse, on the other hand, is a large and centralized repository
structured data. It is primarily used for transactional operations, where data that stores historical and aggregated data from multiple sources. It is used
is constantly being added, updated, and deleted, such as in an online for analytical operations and is optimized for complex queries and data
transaction processing (OLTP) system. Databases are optimized for quick and analysis, making it suitable for online analytical processing (OLAP) tasks.
reliable access to individual records.
 Data warehouses typically use a denormalized structure to optimize query
 Databases store data in a normalized structure to minimize data redundancy performance. Data from different sources is transformed and integrated into
and ensure data integrity. This means that data is organized into separate a single, unified data model (also known as a star or snowflake schema) that
tables, and related information is spread across various tables, requiring allows for easier and faster analysis.
joins to retrieve complete information.
 Data warehouses are used by business analysts, data scientists, and decision-
 Databases are used for applications that require quick access to real-time makers to gain insights from historical data, perform data mining, create
data, such as handling online transactions, managing inventory, or reports, and make strategic business decisions.
supporting user interactions on websites or applications.
 Data warehouses are optimized for complex queries that involve
 Databases are designed for high-performance transaction processing. They aggregations, summarizations, and large-scale data scanning. They prioritize
prioritize fast data insertion, updates, and retrieval for individual records. query performance over fast data insertion.
Important features of DWH
1. Subject-Oriented: Data warehousing is organized around specific business subjects or areas, such as sales, customers, products,
etc. It provides a consolidated view of data related to these subjects, making it easier for users to understand and analyze.

2. Integrated: Data from various operational systems are extracted, transformed, and loaded into the data warehouse, ensuring that
the data is consistent and standardized across the entire organization.

3. Time-Variant: Data warehousing retains historical data, allowing analysts to perform trend analysis and track changes over time.
4. Non-Volatile: Once data is stored in the data warehouse, it is not altered. New data is added through periodic updates, preserving
the historical record.
Data warehouse architecture
Data Warehouse Architecture:
The architecture of a data warehouse typically consists of the following components:
1. Source Systems: Data warehouses gather data from multiple source systems such as
transactional databases, spreadsheets, CRM systems, ERP systems, flat files, and more.
These sources can be both internal and external to the organization. Data extraction
from these systems is the initial step in the data warehouse architecture.
2. ETL (Extract, Transform, Load): The ETL process is fundamental in data warehousing. It involves
three main steps:
Extract: Data is extracted from the source systems and moved to a staging area. The staging area acts
as an intermediary between the source systems and the data warehouse. Data is often extracted
incrementally to keep the warehouse up-to-date without overloading the source systems.
Transform: In this step, data is cleaned, integrated, and transformed into a consistent format. Data
quality checks are performed to identify and rectify errors, missing values, or inconsistencies.
Load: The transformed data is loaded into the data warehouse where it is organized into tables, fact,
and dimension tables, ready for analytical querying.
Data Warehouse Architecture:
3. Data Warehouse Database: The data warehouse database is the core
component of the architecture. It is a relational or multi-dimensional database that
stores the integrated and transformed data. The two primary architectural models
for data warehouses are:
Relational Model: It uses tables to store data and supports SQL queries.
Common relational database management systems (RDBMS) like Oracle, SQL
Server, and PostgreSQL are often used for this model.
Multi-dimensional Model: This model uses a structure called a data cube, which
allows data to be organized and viewed in multiple dimensions. Online Analytical
Processing (OLAP) tools are used to work with multi-dimensional data.
Data Warehouse Architecture:
4. Metadata Management: Metadata is data about data and is crucial in
understanding the structure, meaning, and relationships within the data
warehouse. Metadata management involves storing and managing information
about the source systems, data transformations, data lineage, and other crucial
details. It helps users understand the data and ensures data governance.
5. Data Access Layer: The data access layer provides various tools and interfaces
for end-users to interact with the data warehouse. Common tools include Business
Intelligence (BI) platforms, reporting tools, data visualization tools, and OLAP tools.
These tools allow users to run queries, generate reports, and gain insights from the
data stored in the warehouse.
Data Warehouse Architecture w/staging
1. Data Staging: Data staging is an intermediate area where data from various source systems is
temporarily stored before being loaded into the data warehouse. The staging area plays a crucial role in
the ETL process, and it offers several benefits:
• Data Integration: The staging area facilitates the integration of data from different source systems,
which may have varying formats, structures, or data definitions. Here, data is transformed and
standardized to ensure consistency across the entire data warehouse.
• Data Cleansing: The staging area is where data cleansing and data quality checks take place. This
involves identifying and resolving issues like missing values, duplicate records, or data inconsistencies.
• Performance Optimization: By decoupling the data extraction process from the data warehouse, the
staging area reduces the impact on the operational systems. It also enables parallel processing of
data, enhancing performance during the ETL phase.
• Data Transformation: The staging area allows for complex data transformations, including data
aggregation, joining, and other operations necessary to prepare data for loading into the data
warehouse.
Data Warehouse Architecture w/ staging
2. Data Warehouse: The data warehouse is the central repository that stores integrated, cleansed, and
transformed data. It follows either a relational or multi-dimensional data model, depending on the chosen
architecture. The key components of the data warehouse are:
• Data Integration Layer: This layer contains the transformed data from the staging area, organized into fact
tables and dimension tables. The fact tables contain numerical measures (e.g., sales revenue, quantity sold),
while the dimension tables provide descriptive attributes (e.g., time, product, location) for slicing and dicing
the data.
• Metadata Management: Metadata, which includes data definitions, relationships, and transformation rules, is
vital for understanding and managing the data within the warehouse effectively. Proper metadata
management ensures data lineage, data governance, and easier data exploration.
• Data Access Layer: The data access layer allows users to query and retrieve data from the data warehouse. It
includes various tools such as SQL-based query engines, OLAP tools, Business Intelligence (BI) applications, and
data visualization platforms.
• Scalability and Performance Optimization: To handle the increasing volume of data and user queries, the data
warehouse architecture should be designed with scalability and performance in mind. Techniques like
partitioning, indexing, and caching are often used to enhance performance.
Data Warehouse Architecture w/staging and
data mart
3. Data Marts: Data marts are subsets of the data warehouse that focus on specific business areas
or user requirements. They are designed to cater to the needs of different departments or user
groups, providing a more tailored and optimized view of the data. Data marts offer several
advantages:
• Specialization: Data marts are specialized for specific business domains, allowing users to access
relevant data without the complexity of the entire data warehouse. This targeted approach often
improves query performance and user experience.
• Departmental Autonomy: Data marts can be created and managed independently by different
departments, reducing the risk of data conflicts and promoting departmental autonomy.
• Security and Access Control: Data marts can be configured with specific access controls,
ensuring that sensitive information is restricted to authorized users only.
• Performance Optimization: By aggregating data relevant to specific business units, data marts
can be optimized for faster query response times and improved analytical capabilities.
Multidimensional model
Multidimensional data model
A multidimensional model of a data warehouse is a data representation technique used
to organize and present data in a way that facilitates efficient and intuitive analysis.
It provides a more flexible and user-friendly approach to data exploration compared to
traditional relational database models.
The multidimensional model is designed to support Online Analytical Processing (OLAP)
operations, enabling users to perform complex queries and data analysis efficiently.
• OLAP (online analytical processing) and data warehousing uses multi dimensional
databases. It is used to show multiple dimensions of the data to users. 
• It represents data in the form of data cubes. Data cubes allow to model and view the
data from many dimensions and perspectives. It is defined by dimensions and facts
and is represented by a fact table. Facts are numerical measures and fact tables
contain measures of the related dimensional tables or names of the facts.
Multidimensional data model
Key components or features of the multidimensional model include
dimensions, measures, hierarchies, and cubes. Let's delve into each of these
components:
1. Dimensions: Dimensions are the descriptive attributes or characteristics
by which data is organized and categorized in a data warehouse.
They represent the various aspects of business data and form the axes of
the multidimensional model.
 Common dimensions in a business context include time, geography,
product, customer, and more. For example, in a sales data warehouse,
"Time" could be a dimension with attributes like year, quarter, month, and
day.
Multidimensional data model
2. Measures: Measures are the numerical values or metrics that
represent the actual data in a data warehouse.
 These are the data points that users want to analyze. Measures are
typically aggregated and are often expressed as sums, averages,
counts, or other mathematical functions.
 In a sales data warehouse, measures could include "Revenue,"
"Quantity Sold," and "Profit."
Multidimensional data model
3. Hierarchies: Hierarchies represent the drill-down paths or levels of
granularity within each dimension. They allow users to navigate
from summarized data to more detailed data or vice versa.
For example, the "Time" dimension hierarchy could have levels such
as Year > Quarter > Month > Day, enabling users to drill down from
annual data to daily data.
Multidimensional data model
4. Cubes: A cube is a core concept in the multidimensional model. It is
a data structure that combines dimensions and measures to create
a multidimensional representation of data.
A cube allows users to perform slicing, dicing, drilling, and pivoting
operations for analytical purposes. Cubes can be thought of as a
three-dimensional array with dimensions on the axes and measures at
the intersecting points.
 Each cell in the cube holds a specific aggregated value. OLAP tools
utilize cubes to process complex queries and deliver fast query
response times.
Working on a Multidimensional Data
Model
On the basis of the pre-decided steps, the Multidimensional Data
Model works.
The following stages should be followed by every project for building a
Multi Dimensional Data Model : 
Stage 1 : Assembling data from the client : In first stage, a Multi
Dimensional Data Model collects correct data from the client. Mostly,
software professionals provide simplicity to the client about the range
of data which can be gained with the selected technology and collect
the complete data in detail.
Working on a Multidimensional Data
Model
Stage 2 : Grouping different segments of the system : In the second
stage, the Multi Dimensional Data Model recognizes and classifies all
the data to the respective section they belong to and also builds it
problem-free to apply step by step.
Stage 3 : Noticing the different proportions :  In the third stage, it is the
basis on which the design of the system is based. In this stage, the main
factors are recognized according to the user’s point of view. These
factors are also known as “Dimensions”.
Working on a Multidimensional Data
Model
Stage 4 : Preparing the actual-time factors and their respective qualities : In the
fourth stage, the factors which are recognized in the previous step are used
further for identifying the related qualities. These qualities are also known
as “attributes” in the database.
Stage 5 : Finding the actuality of factors which are listed previously and their
qualities : In the fifth stage, A Multi Dimensional Data Model separates and
differentiates the actuality from the factors which are collected by it. These
actually play a significant role in the arrangement of a Multi Dimensional Data
Model. 
Stage 6 : Building the Schema to place the data, with respect to the information
collected from the steps above : In the sixth stage, on the basis of the data which
was collected previously, a Schema is built. 
For Example : 
1. Let us take the example of a firm. The revenue cost of a firm can be
recognized on the basis of different factors such as geographical
location of      firm’s workplace, products of the firm, advertisements
done, time utilized to flourish a product, etc.
example
2. Let us take the example of the data of a factory which sells products
per quarter in Bangalore. The data is represented in the table given
below :

2D factory data
In the above given presentation, the factory’s sales for Bangalore are, for the time dimension, which is organized into
quarters and the dimension of items, which is sorted according to the kind of item which is sold. The facts here are
represented in rupees (in thousands).
Now, if we desire to view the data of the sales in a three-dimensional table, then it is represented in the diagram
given below. Here the data of the sales is represented as a two dimensional table. Let us consider the data according
to item, time and location (like Kolkata, Delhi, Mumbai). Here is the table :

3D data representation as 2D
This data can be represented in the form of three dimensions conceptually, which is shown in the image below :

3D data representation
Fact Table:
Definition: A fact table is a central table in a data warehouse that stores
quantitative and numeric data, representing the measurable business events or
transactions. It contains the metrics or measures that provide insights into
business performance. Fact tables are typically large and contain a massive
amount of data.
Structure: A fact table consists of foreign keys that link to the dimensional tables
(described below) and the actual measurements or facts associated with those
combinations of dimensions.
Examples: In a retail data warehouse, a fact table may contain data such as
"Sales Revenue," "Quantity Sold," "Profit," "Discount Amount," and "Tax
Amount." Each row in the fact table represents a specific business event or
transaction.
Dimensional Table:
Definition: A dimensional table, also known as a dimension table, stores
descriptive attributes or characteristics that provide context to the data in the
fact table. Dimensional tables contain textual or categorical data that help
categorize and organize the facts.
Structure: A dimensional table typically consists of a primary key (surrogate
key) and descriptive attributes. The primary key uniquely identifies each
record in the table and acts as a reference point for the fact table.
Examples: In a retail data warehouse, dimensional tables may include "Time"
(with attributes like year, quarter, month, day, etc.), "Product" (with attributes
like category, brand, price, etc.), "Customer" (with attributes like name, age,
location, etc.), and "Store" (with attributes like location, size, etc.).
Fact table with dimensional table
Start schema
Example(data of grocery store sales transition)
Sales all in one table
• Duplication of data
• Normalization
Customer and items table
Sales table
• Just id used (normalized)
Date table
Start schema
Snowflake schema
Snowflake schema
Snowflake schema
OLAP (Online Analytical Processing)
OLAP is a category of software tools and technologies used to perform
complex and multidimensional analysis of data. The main focus of OLAP
is to enable users to quickly and interactively explore and analyze large
datasets from multiple perspectives, facilitating deeper insights and
faster decision-making.
 OLAP systems allow users to perform operations like slicing (selecting
a subset of data), dicing (cross-sectioning data), drilling down
(navigating from summarized to more detailed data), and rolling up
(aggregating data).
ROLAP (Relational Online Analytical
Processing)
ROLAP is an OLAP implementation that stores data in a traditional
relational database management system (RDBMS).
Unlike MOLAP, which pre calculates and stores aggregations in a
multidimensional cube format, ROLAP systems generate queries on
the fly and directly access the relational database to retrieve the
required data.
ROLAP allows for more flexibility in terms of data sources, as it can
work with a wide range of database systems, but it may be slower
than MOLAP when dealing with large datasets or complex queries.
MOLAP (Multidimensional Online
Analytical Processing)
MOLAP is another OLAP implementation, but it stores data in a
specialized multidimensional database format.
This format, often referred to as a "cube," precalculates and stores
aggregations for various dimensions and hierarchies, resulting in
faster query response times compared to ROLAP.
MOLAP systems are optimized for analytical processing and can
efficiently handle large volumes of data, making them well-suited for
complex analysis and reporting tasks.
HOLAP(Hybrid Online Analytical
Processing)
HOLAP combines the best of both worlds by using a hybrid approach. It leverages both MOLAP
and ROLAP techniques to optimize the performance and flexibility of data analysis:
Aggregation: The most detailed and frequently used data are stored in a MOLAP format,
which provides high query performance for common operations.
On-Demand Aggregation: Less frequently used or more detailed data can be stored in a
ROLAP format, allowing for on-the-fly aggregation when required, but sacrificing a bit of
performance.
HOLAP systems aim to strike a balance between query speed and storage efficiency. They
provide faster response times for typical queries while still being able to handle more
complex and ad-hoc queries that might be slower due to ROLAP components.
In summary, HOLAP is a data warehousing approach that combines the strengths of both
MOLAP and ROLAP to optimize query performance and flexibility in business intelligence and
analytical applications.

You might also like