0% found this document useful (0 votes)

25 views

04OLAP

This chapter discusses data warehousing and online analytical processing (OLAP). It defines a data warehouse as a subject-oriented collection of integrated and nonvolatile data used to support management decision making. The chapter describes how data is extracted, transformed, and loaded from various sources into the data warehouse. It also explains how a multidimensional data model represents data as cubes that can be viewed along different dimensions like time or products. Finally, it discusses how metadata is stored to describe the structure and contents of the data warehouse.

Uploaded by

rafihassan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views

04OLAP

Uploaded by

rafihassan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Data Mining:

Concepts and Techniques

(3rd ed.)

— Chapter 4 —

Jiawei Han, Micheline Kamber, and Jian Pei

University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.

1
Chapter 4: Data Warehousing and On-line Analytical
Processing

■ Data Warehouse: Basic Concepts

■ Data Warehouse Modeling: Data Cube and OLAP
■ Data Warehouse Design and Usage
■ Data Warehouse Implementation
■ Data Generalization by Attribute-Oriented
Induction
■ Summary

2
What is a Data Warehouse?
■ Defined in many different ways, but not rigorously.
■ A decision support database that is maintained separately from
the organization’s operational database
■ Support information processing by providing a solid platform of
consolidated, historical data for analysis.
■ “A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s
decision-making process.”—W. H. Inmon
■ Data warehousing:
■ The process of constructing and using data warehouses

3
Data Warehouse—Subject-Oriented

■ Organized around major subjects, such as customer,

product, sales
■ Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing
■ Provide a simple and concise view around particular
subject issues by excluding data that are not useful in
the decision support process

4
Data Warehouse—Integrated

■ Constructed by integrating multiple, heterogeneous data

sources
■ relational databases, flat files, on-line transaction

records
■ Data cleaning and data integration techniques are
applied.
■ Ensure consistency in naming conventions, encoding

structures, attribute measures, etc. among different

data sources
■ E.g., Hotel price: currency, tax, breakfast covered, etc.
■ When data is moved to the warehouse, it is
converted.

5
Data Warehouse—Time Variant

■ The time horizon for the data warehouse is significantly

longer than that of operational systems
■ Operational database: current value data
■ Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
■ Every key structure in the data warehouse
■ Contains an element of time, explicitly or implicitly
■ But the key of operational data may or may not
contain “time element”

6
Data Warehouse—Nonvolatile

■ A physically separate store of data transformed from the

operational environment
■ Operational update of data does not occur in the data
warehouse environment
■ Does not require transaction processing, recovery,
and concurrency control mechanisms
■ Requires only two operations in data accessing:
■ initial loading of data and access of data

7
OLTP vs. OLAP

8
Why a Separate Data Warehouse?
■ High performance for both systems
■ DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
■ Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
■ Different functions and different data:
■ missing data: Decision support requires historical data which
operational DBs do not typically maintain
■ data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
■ data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
■ Note: There are more and more systems which perform OLAP
analysis directly on relational databases
9
Data Warehouse: A Multi-Tiered Architecture

Monitor
& OLAP Server
Other Metadata
Integrato
sources r
Analysis
Operational Extract Query
DBs Transform Data Serv Reports
Load e
Refresh
Warehous Data
e mining

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools

10
Three Data Warehouse Models
■ Enterprise warehouse
■ collects all of the information about subjects spanning

the entire organization

■ Data Mart
■ a subset of corporate-wide data that is of value to a

specific groups of users. Its scope is confined to

specific, selected groups, such as marketing data mart
■ Independent vs. dependent (directly from warehouse) data mart
■ Virtual warehouse
■ A set of views over operational databases

■ Only some of the possible summary views may be

materialized
11
Extraction, Transformation, and Loading (ETL)
■ Data extraction
■ get data from multiple, heterogeneous, and external
sources
■ Data cleaning
■ detect errors in the data and rectify them when possible

■ Data transformation
■ convert data from legacy or host format to warehouse
format
■ Load
■ sort, summarize, consolidate, compute views, check
integrity, and build indicies and partitions
■ Refresh
■ propagate the updates from the data sources to the

warehouse
12
Metadata Repository
■ Meta data is the data defining warehouse objects. It stores:
■ Description of the structure of the data warehouse
■ schema, view, dimensions, hierarchies, derived data defn, data
mart locations and contents
■ Operational meta-data
■ data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged), monitoring
information (warehouse usage statistics, error reports, audit trails)
■ The algorithms used for summarization
■ The mapping from operational environment to the data warehouse
■ Data related to system performance
■ warehouse schema, view and derived data definitions

■ Business data
■ business terms and definitions, ownership of data, charging policies
13
Chapter 4: Data Warehousing and On-line Analytical
Processing

■ Data Warehouse: Basic Concepts

■ Data Warehouse Modeling: Data Cube and OLAP
■ Data Warehouse Design and Usage
■ Data Warehouse Implementation
■ Data Generalization by Attribute-Oriented
Induction
■ Summary

14
From Tables and Spreadsheets to
Data Cubes
■ A data warehouse is based on a multidimensional data model
which views data in the form of a data cube
■ A data cube, such as sales, allows data to be modeled and viewed in
multiple dimensions
■ Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
■ Fact table contains measures (such as dollars_sold) and keys
to each of the related dimension tables
■ In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube.
15
Cube: A Lattice of Cuboids

all
0-D (apex) cuboid

time item location supplier

1-D cuboids

time,location item,location location,supplier

time,item 2-D cuboids
time,supplier item,supplier

time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier

4-D (base) cuboid

time, item, location, supplier

16
Conceptual Modeling of Data Warehouses

■ Modeling data warehouses: dimensions & measures

■ Star schema: A fact table in the middle connected to a
set of dimension tables
■ Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
■ Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation

17
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
branch_key location_key
location_key street
branch_name units_sold
branch_type city
dollars_sold state_or_province
country

Measures avg_sales

18
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
branch location
location_key
branch_key location_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
city
Measures avg_sales state_or_province
country

19
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location

branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
shipper_type 20
A Concept Hierarchy:
Dimension (location)

all all

region Europe ... North_America

country Germany ... Spain Canada ... Mexico

city Frankfurt ... Vancouver ... Toronto

office L. Chan ... M. Wind

21
A Sample Data Cube

Total annual sales

Date of TVs in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
VCR

Country
sum
Canada

Mexico

sum

22
Cuboids Corresponding to the Cube

all
0-D (apex) cuboid
product date country
1-D cuboids

product,date product,country date, country

2-D cuboids

3-D (base) cuboid

product, date, country

23
Typical OLAP Operations
■ Roll up (drill-up): summarize data
■ by climbing up hierarchy or by dimension reduction
■ Drill down (roll down): reverse of roll-up
■ from higher level summary to lower level summary or
detailed data, or introducing new dimensions
■ Slice and dice: project and select
■ Pivot (rotate):
■ reorient the cube, visualization, 3D to series of 2D planes
■ Other operations
■ drill across: involving (across) more than one fact table
■ drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)

24
Fig. 3.10 Typical OLAP
Operations

25
Example

26
Example

27
ETL vs ELT

Used in Power BI
or Data
Warehouse
Solutions

Used in cloud
technologies.
E.g. Data Lakes

28
29
30
Example Database

31
References (I)
■ S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S.
Sarawagi. On the computation of multidimensional aggregates. VLDB’96
■ D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data
warehouses. SIGMOD’97
■ R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE’97
■ S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM
SIGMOD Record, 26:65-74, 1997
■ E. F. Codd, S. B. Codd, and C. T. Salley. Beyond decision support. Computer World, 27, July
1993.
■ J. Gray, et al. Data cube: A relational aggregation operator generalizing group-by, cross-tab
and sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997.
■ A. Gupta and I. S. Mumick. Materialized Views: Techniques, Implementations, and
Applications. MIT Press, 1999.
■ J. Han. Towards on-line analytical mining in large databases. ACM SIGMOD Record, 27:97-107,
1998.
■ V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently.
SIGMOD’96
■ J. Hellerstein, P. Haas, and H. Wang. Online aggregation. SIGMOD'97
32
References (II)
■ C. Imhoff, N. Galemmo, and J. G. Geiger. Mastering Data Warehouse Design: Relational and
Dimensional Techniques. John Wiley, 2003
■ W. H. Inmon. Building the Data Warehouse. John Wiley, 1996
■ R. Kimball and M. Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional
Modeling. 2ed. John Wiley, 2002
■ P. O’Neil and G. Graefe. Multi-table joins through bitmapped join indices. SIGMOD Record, 24:8–
11, Sept. 1995.
■ P. O'Neil and D. Quass. Improved query performance with variant indexes. SIGMOD'97
■ Microsoft. OLEDB for OLAP programmer's reference version 1.0. In
http://www.microsoft.com/data/oledb/olap, 1998
■ S. Sarawagi and M. Stonebraker. Efficient organization of large multidimensional arrays. ICDE'94
■ A. Shoshani. OLAP and statistical databases: Similarities and differences. PODS’00.
■ D. Srivastava, S. Dar, H. V. Jagadish, and A. V. Levy. Answering queries with aggregation using
views. VLDB'96
■ P. Valduriez. Join indices. ACM Trans. Database Systems, 12:218-246, 1987.
■ J. Widom. Research problems in data warehousing. CIKM’95
■ K. Wu, E. Otoo, and A. Shoshani, Optimal Bitmap Indices with Efficient Compression, ACM Trans.
on Database Systems (TODS), 31(1): 1-38, 2006

33
Surplus Slides

34
Compression of Bitmap Indices
■ Bitmap indexes must be compressed to reduce I/O costs
and minimize CPU usage—majority of the bits are 0’s
■ Two compression schemes:
■ Byte-aligned Bitmap Code (BBC)
■ Word-Aligned Hybrid (WAH) code
■ Time and space required to operate on compressed
bitmap is proportional to the total size of the bitmap
■ Optimal on attributes of low cardinality as well as those of
high cardinality.
■ WAH out performs BBC by about a factor of two
35

EN 18031-1-2024-Preview
0% (1)
EN 18031-1-2024-Preview
3 pages
Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet
Manual Testing Experienced Resume
77% (13)
Manual Testing Experienced Resume
4 pages
SQL Exercises (HR Database) (JOINS)
100% (4)
SQL Exercises (HR Database) (JOINS)
6 pages
Report of A 6 Month Internship at Go-Groups LTD, Buea: Faculty of Engineering and Technology
No ratings yet
Report of A 6 Month Internship at Go-Groups LTD, Buea: Faculty of Engineering and Technology
29 pages
Data Warehouse
No ratings yet
Data Warehouse
174 pages
_04OLAP_editted_v1_
No ratings yet
_04OLAP_editted_v1_
59 pages
04DWH & Olap
No ratings yet
04DWH & Olap
50 pages
04OLAP
No ratings yet
04OLAP
48 pages
warehouse
No ratings yet
warehouse
58 pages
Datawarehouse Notes
No ratings yet
Datawarehouse Notes
39 pages
Concepts and Techniques: - Chapter 4
No ratings yet
Concepts and Techniques: - Chapter 4
51 pages
Chap3-Data Warehousing and OLAP
No ratings yet
Chap3-Data Warehousing and OLAP
67 pages
04OLAP
100% (1)
04OLAP
58 pages
P6 Olap
No ratings yet
P6 Olap
47 pages
2025-Handouts_OLAP_Lecture 1
No ratings yet
2025-Handouts_OLAP_Lecture 1
10 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
48 pages
04OLAP
No ratings yet
04OLAP
50 pages
Data Warehousing and On-Line Analytical Processing
No ratings yet
Data Warehousing and On-Line Analytical Processing
40 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
61 pages
04OLAP
No ratings yet
04OLAP
58 pages
04olap New
No ratings yet
04olap New
55 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
66 pages
Concepts and Techniques: - Chapter 4
No ratings yet
Concepts and Techniques: - Chapter 4
58 pages
Concepts and Techniques: - Chapter 4
No ratings yet
Concepts and Techniques: - Chapter 4
58 pages
04OLAP
No ratings yet
04OLAP
66 pages
Chap3_PIEAS_DCIS_BSCIS_DM_23_Topic_03_DWH_OLAP
No ratings yet
Chap3_PIEAS_DCIS_BSCIS_DM_23_Topic_03_DWH_OLAP
46 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
Warehouse
No ratings yet
Warehouse
60 pages
data mining 4
No ratings yet
data mining 4
59 pages
Unit 1- Data Warehouse
No ratings yet
Unit 1- Data Warehouse
21 pages
03 04OLAP SKJ Edited Oct 1, 2024
No ratings yet
03 04OLAP SKJ Edited Oct 1, 2024
93 pages
Chap3_PIEAS_DCIS_BSCIS_DM_23_Topic_03_DWH_OLAP
No ratings yet
Chap3_PIEAS_DCIS_BSCIS_DM_23_Topic_03_DWH_OLAP
46 pages
Chapter 1 Datawarehouse
100% (1)
Chapter 1 Datawarehouse
47 pages
FALLSEM2023-24 CSI3010 ETH VL2023240104197 2023-07-26 Reference-Material-I
No ratings yet
FALLSEM2023-24 CSI3010 ETH VL2023240104197 2023-07-26 Reference-Material-I
28 pages
DataMining and Data Warehousing
No ratings yet
DataMining and Data Warehousing
96 pages
Concepts and Techniques: - Chapter 4
No ratings yet
Concepts and Techniques: - Chapter 4
50 pages
Module-3 Data Warehousing
No ratings yet
Module-3 Data Warehousing
44 pages
4-Data Warehousing and Integration in Business
No ratings yet
4-Data Warehousing and Integration in Business
39 pages
MIS416 Chapter5 by DrAsimAlwabel
No ratings yet
MIS416 Chapter5 by DrAsimAlwabel
46 pages
Data Warehousing: Lecturer: Dr. Nguyen Thi Ngoc Anh
No ratings yet
Data Warehousing: Lecturer: Dr. Nguyen Thi Ngoc Anh
23 pages
Datawarehouse: Fact Table
No ratings yet
Datawarehouse: Fact Table
55 pages
Lecture 4 (Dataware Housing)
No ratings yet
Lecture 4 (Dataware Housing)
50 pages
DWDM 3
0% (1)
DWDM 3
52 pages
Data Mining: Concepts and Techniques: - Chapter 2
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 2
62 pages
UEU Sistem Pendukung Keputusan Pertemuan 5
No ratings yet
UEU Sistem Pendukung Keputusan Pertemuan 5
46 pages
Data Warehouse
No ratings yet
Data Warehouse
77 pages
Data Warehouse Notes
No ratings yet
Data Warehouse Notes
41 pages
Csb4318 DWDM Unit - 1 Revised
No ratings yet
Csb4318 DWDM Unit - 1 Revised
68 pages
Data Warehouse and OLAP
No ratings yet
Data Warehouse and OLAP
55 pages
Multitier DW Architecture & Implementation
No ratings yet
Multitier DW Architecture & Implementation
63 pages
CSEP 546 Data Mining: Instructor: Pedro Domingos
No ratings yet
CSEP 546 Data Mining: Instructor: Pedro Domingos
63 pages
Data Warehousing
100% (1)
Data Warehousing
51 pages
Unit 1
No ratings yet
Unit 1
54 pages
DMDW_Operations
No ratings yet
DMDW_Operations
65 pages
CSE 592 Data Mining: Instructor: Pedro Domingos
No ratings yet
CSE 592 Data Mining: Instructor: Pedro Domingos
63 pages
2 Data Warehouse 2
No ratings yet
2 Data Warehouse 2
57 pages
2 Datawarehouse 2
No ratings yet
2 Datawarehouse 2
57 pages
CH 1
No ratings yet
CH 1
53 pages
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
No ratings yet
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
25 pages
Data Mining-Data Warehouse
No ratings yet
Data Mining-Data Warehouse
7 pages
Data Warehousing and OLAP Technology For Data Mining
No ratings yet
Data Warehousing and OLAP Technology For Data Mining
3 pages
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
Data Warehousing For Dummies
From Everand
Data Warehousing For Dummies
Thomas C. Hammergren
4/5 (1)
HPI Sec-Eng General Thesis Posting GraphAnalytics
No ratings yet
HPI Sec-Eng General Thesis Posting GraphAnalytics
6 pages
Manoj More
No ratings yet
Manoj More
2 pages
Assignment 4 Dbms CF
No ratings yet
Assignment 4 Dbms CF
8 pages
IBM App Connect Enterprise SaaS Essentials_ Completing this course, you should now have an understanding of…
No ratings yet
IBM App Connect Enterprise SaaS Essentials_ Completing this course, you should now have an understanding of…
2 pages
LDP- UNIT-1(BCA--BSC)-Sem-1
No ratings yet
LDP- UNIT-1(BCA--BSC)-Sem-1
25 pages
DDB Unit3
No ratings yet
DDB Unit3
11 pages
Bolt Iot Student Partner Internship Program
No ratings yet
Bolt Iot Student Partner Internship Program
2 pages
NOVA Technical Note 6 - Automatic Data Recovery
No ratings yet
NOVA Technical Note 6 - Automatic Data Recovery
14 pages
Enterprise Content Management System
No ratings yet
Enterprise Content Management System
21 pages
Living in IT Era - Lesson 2 Midterms
No ratings yet
Living in IT Era - Lesson 2 Midterms
14 pages
Pricing - Opus Clip
No ratings yet
Pricing - Opus Clip
1 page
ENISA Report - Digital Identity - Leveraging The SSI Concept To Build Trust
No ratings yet
ENISA Report - Digital Identity - Leveraging The SSI Concept To Build Trust
51 pages
Access Methods
No ratings yet
Access Methods
3 pages
Smart Farming: Muhammad Saeed Mushtaq
No ratings yet
Smart Farming: Muhammad Saeed Mushtaq
24 pages
Python_Programming_Internship_Report.pdf
No ratings yet
Python_Programming_Internship_Report.pdf
4 pages
Cloud Top Tips
No ratings yet
Cloud Top Tips
5 pages
E Learning Project
No ratings yet
E Learning Project
28 pages
Elegant Minimalist A4 Stationery Paper Document
No ratings yet
Elegant Minimalist A4 Stationery Paper Document
10 pages
axway
No ratings yet
axway
26 pages
The Google File System: Alexandru Costan
No ratings yet
The Google File System: Alexandru Costan
38 pages
Q18. Create A Table Named As STUDENT With The Following Fields As
No ratings yet
Q18. Create A Table Named As STUDENT With The Following Fields As
21 pages
systemcalls and process Copy
No ratings yet
systemcalls and process Copy
40 pages
The Requirements of Methodologies For Developing Web Applications
No ratings yet
The Requirements of Methodologies For Developing Web Applications
10 pages
Systems Data Analyst in Manhattan Brooklyn NYC New York Resume Diana Melnikov
No ratings yet
Systems Data Analyst in Manhattan Brooklyn NYC New York Resume Diana Melnikov
2 pages
Cloud Management and Operations Module 1
No ratings yet
Cloud Management and Operations Module 1
102 pages
Week 7 - Storage
No ratings yet
Week 7 - Storage
15 pages

04OLAP

Uploaded by

04OLAP

Uploaded by

Data Mining:

Concepts and Techniques

Jiawei Han, Micheline Kamber, and Jian Pei

■ Data Warehouse: Basic Concepts

■ Organized around major subjects, such as customer,

■ Constructed by integrating multiple, heterogeneous data

structures, attribute measures, etc. among different

■ The time horizon for the data warehouse is significantly

■ A physically separate store of data transformed from the

Data Sources Data Storage OLAP Engine Front-End Tools

the entire organization

specific groups of users. Its scope is confined to

■ Only some of the possible summary views may be

■ Data Warehouse: Basic Concepts

time item location supplier

time,location item,location location,supplier

4-D (base) cuboid

■ Modeling data warehouses: dimensions & measures

branch location_key location to_location

region Europe ... North_America

country Germany ... Spain Canada ... Mexico

city Frankfurt ... Vancouver ... Toronto

office L. Chan ... M. Wind

Total annual sales

product,date product,country date, country

3-D (base) cuboid

You might also like