0% found this document useful (0 votes)

21 views

DM 6

The document discusses data warehousing and online analytical processing (OLAP). It covers topics like data warehouse modeling using star and snowflake schemas, data cube measures and operations, and designing and implementing a data warehouse.

Uploaded by

ps1406051

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

DM 6

Uploaded by

ps1406051

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Data Mining

(CS64120)

Dr. Anshul
(Assistant Professor)
Department of Computer Science & Engineering
NIT Patna, Ashok Rajpath, Bihar-80005
UNIT-II
Introduction to Data Mining
Chapter 4: Data Warehousing and On-line
Analytical Processing

◼ Data Warehouse: Basic Concepts

◼ Data Warehouse Modeling: Data Cube and OLAP
◼ Data Warehouse Design and Usage
◼ Data Warehouse Implementation
◼ Data Generalization by Attribute-Oriented
Induction
◼ Summary

3
Conceptual Modeling of Data Warehouses

◼ Modeling data warehouses: dimensions & measures

◼ Star schema: A fact table in the middle connected to a
set of dimension tables
◼ Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
◼ Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation

4
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures

5
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country

6
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location

branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
shipper_type 7
A Concept Hierarchy:
Dimension (location)

all all

region Europe ... North_America

country Germany ... Spain Canada ... Mexico

city Frankfurt ... Vancouver ... Toronto

office L. Chan ... M. Wind

8
Data Cube Measures: Three Categories

◼ Distributive: if the result derived by applying the function

to n aggregate values is the same as that derived by
applying the function on all the data without partitioning
◼ E.g., count(), sum(), min(), max()
◼ Algebraic: if it can be computed by an algebraic function
with M arguments (where M is a bounded integer), each of
which is obtained by applying a distributive aggregate
function
◼ E.g., avg(), min_N(), standard_deviation()
◼ Holistic: if there is no constant bound on the storage size
needed to describe a subaggregate.
◼ E.g., median(), mode(), rank()
9
Multidimensional Data

◼ Sales volume as a function of product, month,

and region
Dimensions: Product, Location, Time
Hierarchical summarization paths

Industry Region Year

Category Country Quarter

Product

Product City Month Week

Office Day

Month
10
A Sample Data Cube

Total annual sales

Date of TVs in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
VCR

Country
sum
Canada

Mexico

sum

All, All, All

11
Cuboids Corresponding to the Cube

all
0-D (apex) cuboid
product date country
1-D cuboids

product,date product,country date, country

2-D cuboids

3-D (base) cuboid

product, date, country

12
Typical OLAP Operations
◼ Roll up (drill-up): summarize data
◼ by climbing up hierarchy or by dimension reduction
◼ Drill down (roll down): reverse of roll-up
◼ from higher level summary to lower level summary or
detailed data, or introducing new dimensions
◼ Slice and dice: project and select
◼ Pivot (rotate):
◼ reorient the cube, visualization, 3D to series of 2D planes
◼ Other operations
◼ drill across: involving (across) more than one fact table
◼ drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)

13
Typical OLAP
Operations

14
Browsing a Data Cube

◼ Visualization
◼ OLAP capabilities
◼ Interactive manipulation
15
Data Warehousing and On-line Analytical
Processing

◼ Data Warehouse: Basic Concepts

◼ Data Warehouse Modeling: Data Cube and OLAP
◼ Data Warehouse Design and Usage
◼ Data Warehouse Implementation
◼ Data Generalization by Attribute-Oriented
Induction
◼ Summary

16
Design of Data Warehouse: A Business
Analysis Framework
◼ Four views regarding the design of a data warehouse
◼ Top-down view
◼ allows selection of the relevant information necessary for the
data warehouse
◼ Data source view
◼ exposes the information being captured, stored, and
managed by operational systems
◼ Data warehouse view
◼ consists of fact tables and dimension tables
◼ Business query view
◼ sees the perspectives of data in the warehouse from the view
of end-user
17
Data Warehouse Design Process
◼ Top-down, bottom-up approaches or a combination of both
◼ Top-down: Starts with overall design and planning (mature)
◼ Bottom-up: Starts with experiments and prototypes (rapid)
◼ From software engineering point of view
◼ Waterfall: structured and systematic analysis at each step before
proceeding to the next
◼ Spiral: rapid generation of increasingly functional systems, short
turn around time, quick turn around
◼ Typical data warehouse design process
◼ Choose a business process to model, e.g., orders, invoices, etc.
◼ Choose the grain (atomic level of data) of the business process
◼ Choose the dimensions that will apply to each fact table record
◼ Choose the measure that will populate each fact table record
18
Data Warehouse Usage
◼ Three kinds of data warehouse applications
◼ Information processing
◼ supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
◼ Analytical processing
◼ multidimensional analysis of data warehouse data
◼ supports basic OLAP operations, slice-dice, drilling, pivoting
◼ Data mining
◼ knowledge discovery from hidden patterns
◼ supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools

19
Data Warehousing and On-line Analytical
Processing

◼ Data Warehouse: Basic Concepts

◼ Data Warehouse Modeling: Data Cube and OLAP
◼ Data Warehouse Design and Usage
◼ Data Warehouse Implementation
◼ Data Generalization by Attribute-Oriented
Induction
◼ Summary

20
Efficient Data Cube Computation
◼ Data cube can be viewed as a lattice of cuboids
◼ The bottom-most cuboid is the base cuboid
◼ The top-most cuboid (apex) contains only one cell
◼ How many cuboids in an n-dimensional cube with L
levels? n
T =  ( Li +1)
i =1
◼ Materialization of data cube
◼ Materialize every (cuboid) (full materialization),
none (no materialization), or some (partial
materialization)
◼ Selection of which cuboids to materialize
◼ Based on size, sharing, access frequency, etc.
21
The “Compute Cube” Operator
◼ Cube definition and computation in DMQL
define cube sales [item, city, year]: sum (sales_in_dollars)
compute cube sales
◼ Transform it into a SQL-like language (with a new operator cube
by, introduced by Gray et al.’96) ()
SELECT item, city, year, SUM (amount)
FROM SALES (city) (item) (year)

CUBE BY item, city, year

◼ Need compute the following Group-Bys
(city, item) (city, year) (item, year)
(date, product, customer),
(date,product),(date, customer), (product, customer),
(date), (product), (customer) (city, item, year)
()
22
Indexing OLAP Data: Bitmap Index
◼ Index on a particular column
◼ Each value in the column has a bit vector: bit-op is fast
◼ The length of the bit vector: # of records in the base table
◼ The i-th bit is set if the i-th row of the base table has the value for
the indexed column
◼ not suitable for high cardinality domains
◼ A recent bit compression technique, Word-Aligned Hybrid (WAH),
makes it work for high cardinality domain as well [Wu, et al. TODS’06]
Base table Index on Region Index on Type
Cust Region Type RecIDAsia Europe America RecID Retail Dealer
C1 Asia Retail 1 1 0 0 1 1 0
C2 Europe Dealer 2 0 1 0 2 0 1
C3 Asia Dealer 3 1 0 0 3 0 1
C4 America Retail 4 0 0 1 4 1 0
C5 Europe Dealer 5 0 1 0 5 0 1
23
Indexing OLAP Data: Join Indices

◼ Join index: JI(R-id, S-id) where R (R-id, …)  S (S-id, …)

◼ Traditional indices map the values to a list of record ids
◼ It materializes relational join in JI file and speeds up relational join

◼ In data warehouses, join index relates the values of the dimensions of a

start schema to rows in the fact table.
◼ E.g. fact table: Sales and two dimensions city and product

◼ A join index on city maintains for each distinct city a list of R-IDs

of the tuples recording the Sales in the city

◼ Join indices can span multiple dimensions

24
Indexing OLAP Data: Join Indices

25
OLAP Server Architectures

◼ Relational OLAP (ROLAP)

◼ Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware
◼ Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services
◼ Greater scalability
◼ Multidimensional OLAP (MOLAP)
◼ Sparse array-based multidimensional storage engine
◼ Fast indexing to pre-computed summarized data
◼ Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
◼ Flexibility, e.g., low level: relational, high-level: array
◼ Specialized SQL servers (e.g., Redbricks)
◼ Specialized support for SQL queries over star/snowflake schemas
26
Chapter 4: Data Warehousing and On-line
Analytical Processing

◼ Data Warehouse: Basic Concepts

◼ Data Warehouse Modeling: Data Cube and OLAP
◼ Data Warehouse Design and Usage
◼ Data Warehouse Implementation
◼ Summary

27
Summary
◼ Data warehousing: A multi-dimensional model of a data warehouse
◼ A data cube consists of dimensions and measures
◼ Star schema, snowflake schema, fact constellations
◼ OLAP operations: drilling, rolling, slicing, dicing and pivoting
◼ Data Warehouse Architecture, Design, and Usage
◼ Multi-tiered architecture
◼ Business analysis design framework
◼ Information processing, analytical processing, data mining, OLAM (Online
Analytical Mining)
◼ Implementation: Efficient computation of data cubes
◼ Partial vs. full vs. no materialization
◼ Indexing OALP data: Bitmap index and join index
◼ OLAP query processing
◼ OLAP servers: ROLAP, MOLAP, HOLAP

28
Thank You

Questions

SERVICEDESKSERVER031
No ratings yet
SERVICEDESKSERVER031
222 pages
DWDM 3
0% (1)
DWDM 3
52 pages
Chapter 1 Datawarehouse
100% (1)
Chapter 1 Datawarehouse
47 pages
Datawarehouse Notes
No ratings yet
Datawarehouse Notes
39 pages
04OLAP
100% (1)
04OLAP
58 pages
warehouse
No ratings yet
warehouse
58 pages
04OLAP
No ratings yet
04OLAP
50 pages
Lecture 4 (Dataware Housing)
No ratings yet
Lecture 4 (Dataware Housing)
50 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
48 pages
Concepts and Techniques: - Chapter 4
No ratings yet
Concepts and Techniques: - Chapter 4
51 pages
Data Warehouse
No ratings yet
Data Warehouse
174 pages
Unit 2_Data Science BCA
No ratings yet
Unit 2_Data Science BCA
20 pages
Concepts and Techniques: - Chapter 4
No ratings yet
Concepts and Techniques: - Chapter 4
58 pages
Concepts and Techniques: - Chapter 4
No ratings yet
Concepts and Techniques: - Chapter 4
58 pages
04OLAP
No ratings yet
04OLAP
66 pages
Unit 1- Data Warehouse
No ratings yet
Unit 1- Data Warehouse
21 pages
03 04OLAP SKJ Edited Oct 1, 2024
No ratings yet
03 04OLAP SKJ Edited Oct 1, 2024
93 pages
04OLAP
No ratings yet
04OLAP
58 pages
04olap New
No ratings yet
04olap New
55 pages
2025-Handouts_OLAP_Lecture 1
No ratings yet
2025-Handouts_OLAP_Lecture 1
10 pages
Data Warehouse
No ratings yet
Data Warehouse
23 pages
Chapter 2.introduction To Data Warehouse
No ratings yet
Chapter 2.introduction To Data Warehouse
49 pages
_04OLAP_editted_v1_
No ratings yet
_04OLAP_editted_v1_
59 pages
UEU Sistem Pendukung Keputusan Pertemuan 5
No ratings yet
UEU Sistem Pendukung Keputusan Pertemuan 5
46 pages
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
No ratings yet
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
25 pages
FALLSEM2023-24 CSI3010 ETH VL2023240104197 2023-07-28 Reference-Material-I
No ratings yet
FALLSEM2023-24 CSI3010 ETH VL2023240104197 2023-07-28 Reference-Material-I
32 pages
data mining 4
No ratings yet
data mining 4
59 pages
Warehouse
No ratings yet
Warehouse
60 pages
04DWH & Olap
No ratings yet
04DWH & Olap
50 pages
Data Mining 9,10,11
No ratings yet
Data Mining 9,10,11
27 pages
Concepts and Techniques: - Chapter 4
No ratings yet
Concepts and Techniques: - Chapter 4
50 pages
Unit 2 Datawarehouse
No ratings yet
Unit 2 Datawarehouse
58 pages
Multitier DW Architecture & Implementation
No ratings yet
Multitier DW Architecture & Implementation
63 pages
CSEP 546 Data Mining: Instructor: Pedro Domingos
No ratings yet
CSEP 546 Data Mining: Instructor: Pedro Domingos
63 pages
Csb4318 DWDM Unit - 1 Revised
No ratings yet
Csb4318 DWDM Unit - 1 Revised
68 pages
Chap3_PIEAS_DCIS_BSCIS_DM_23_Topic_03_DWH_OLAP
No ratings yet
Chap3_PIEAS_DCIS_BSCIS_DM_23_Topic_03_DWH_OLAP
46 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
61 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
66 pages
FALLSEM2023-24 CSI3010 ETH VL2023240104197 2023-07-26 Reference-Material-I
No ratings yet
FALLSEM2023-24 CSI3010 ETH VL2023240104197 2023-07-26 Reference-Material-I
28 pages
OLAP2
No ratings yet
OLAP2
53 pages
DMDW_Operations
No ratings yet
DMDW_Operations
65 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
46 pages
Datawarehouse: Fact Table
No ratings yet
Datawarehouse: Fact Table
55 pages
04OLAP
No ratings yet
04OLAP
48 pages
Module-3 Data Warehousing
No ratings yet
Module-3 Data Warehousing
44 pages
[2025!04!03]-Data Warehouse_lecture 3
No ratings yet
[2025!04!03]-Data Warehouse_lecture 3
41 pages
Data Warehouse
No ratings yet
Data Warehouse
77 pages
Data Warehousing
100% (1)
Data Warehousing
51 pages
2 Data Warehouse 2
No ratings yet
2 Data Warehouse 2
57 pages
Chapter-2 DM
No ratings yet
Chapter-2 DM
23 pages
04OLAP
No ratings yet
04OLAP
35 pages
Lect 5
No ratings yet
Lect 5
31 pages
Data Mining: Concepts and Techniques: - Chapter 2
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 2
62 pages
What Is A Data Warehouse?
No ratings yet
What Is A Data Warehouse?
47 pages
Unit 2_V2_Data Science
No ratings yet
Unit 2_V2_Data Science
23 pages
CSE 592 Data Mining: Instructor: Pedro Domingos
No ratings yet
CSE 592 Data Mining: Instructor: Pedro Domingos
63 pages
2 Datawarehouse 2
No ratings yet
2 Datawarehouse 2
57 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Business Metrics and Tools; Reference for Professionals and Students
From Everand
Business Metrics and Tools; Reference for Professionals and Students
Ray Myers, Jr
No ratings yet
Pre-Calculus: 1,001 Practice Problems For Dummies (+ Free Online Practice)
From Everand
Pre-Calculus: 1,001 Practice Problems For Dummies (+ Free Online Practice)
Mary Jane Sterling
3/5 (16)
ARP Poisoning Attacks
No ratings yet
ARP Poisoning Attacks
39 pages
NIST IR 8459 Ipd
No ratings yet
NIST IR 8459 Ipd
39 pages
DM8
No ratings yet
DM8
4 pages
DynamicProgramming Notes
No ratings yet
DynamicProgramming Notes
13 pages
DivMemo 166 S 2020
No ratings yet
DivMemo 166 S 2020
17 pages
Microsoft Defender For Cloud IT Best Practices Microsoft Press Yuri Diogenes Tom Janetscheck Z
No ratings yet
Microsoft Defender For Cloud IT Best Practices Microsoft Press Yuri Diogenes Tom Janetscheck Z
419 pages
Kvs PGT Syllabus 3430f2d2
No ratings yet
Kvs PGT Syllabus 3430f2d2
26 pages
Ec Ethical Social Political Issues Ok
No ratings yet
Ec Ethical Social Political Issues Ok
16 pages
New Thesis Ref 3
No ratings yet
New Thesis Ref 3
6 pages
Reviewer Itpf02 1
No ratings yet
Reviewer Itpf02 1
11 pages
TP Configuration Réseau Informatique PDF
No ratings yet
TP Configuration Réseau Informatique PDF
4 pages
Filtro de Silica Gel MTraB - 100115620
No ratings yet
Filtro de Silica Gel MTraB - 100115620
1 page
Download Complete JMP Essentials An Illustrated Step by Step Guide for New Users 2nd Edition Curt Hinrichs PDF for All Chapters
100% (5)
Download Complete JMP Essentials An Illustrated Step by Step Guide for New Users 2nd Edition Curt Hinrichs PDF for All Chapters
52 pages
Dynamod Dungeon Tiles Sample Pack
No ratings yet
Dynamod Dungeon Tiles Sample Pack
5 pages
SOLUTION_MOCK TEST-28 (SSC MAINS-2024)
No ratings yet
SOLUTION_MOCK TEST-28 (SSC MAINS-2024)
19 pages
Microprocessor I - Lecture 02
No ratings yet
Microprocessor I - Lecture 02
45 pages
Project Report Sentiment Analysis On Twitter Using Apache Spark
No ratings yet
Project Report Sentiment Analysis On Twitter Using Apache Spark
9 pages
README
No ratings yet
README
4 pages
HW1 Solution
No ratings yet
HW1 Solution
3 pages
AutomationTools Summary JC
No ratings yet
AutomationTools Summary JC
7 pages
Hello, World! I'm Al Sweigart, Author of - Automate The Boring Stuff With Python - and Several Other Programming Books. AMA! - Python
No ratings yet
Hello, World! I'm Al Sweigart, Author of - Automate The Boring Stuff With Python - and Several Other Programming Books. AMA! - Python
3 pages
Log 4+log 25 Log29: Logarithm Test Form A
No ratings yet
Log 4+log 25 Log29: Logarithm Test Form A
4 pages
Solution Manual for Starting out with Visual C#, 5th Edition, Tony Gaddisinstant download
100% (5)
Solution Manual for Starting out with Visual C#, 5th Edition, Tony Gaddisinstant download
37 pages
103 Tnset Pyhsics Model Question Paper 2
No ratings yet
103 Tnset Pyhsics Model Question Paper 2
14 pages
FFL 0.0M: Foundation Load Static Load: 20 Ton Dynamic Load: 30 Ton
No ratings yet
FFL 0.0M: Foundation Load Static Load: 20 Ton Dynamic Load: 30 Ton
1 page
SuccessFactors With Microsoft 365
No ratings yet
SuccessFactors With Microsoft 365
41 pages
02 Dasar JST
No ratings yet
02 Dasar JST
16 pages
Computer Architecture: Trần Trọng Hiếu
No ratings yet
Computer Architecture: Trần Trọng Hiếu
65 pages
Fundamentals of Power Electronics 2e (Robert W. Erickson)
No ratings yet
Fundamentals of Power Electronics 2e (Robert W. Erickson)
900 pages
Building A Student Information System in Salesforce
No ratings yet
Building A Student Information System in Salesforce
29 pages
Azure Book 107
No ratings yet
Azure Book 107
1 page
Horno 5 Model EGNX12-6NWL
No ratings yet
Horno 5 Model EGNX12-6NWL
15 pages
Add Payment Method Bot by @BlackHeadsOP
No ratings yet
Add Payment Method Bot by @BlackHeadsOP
2 pages

DM 6

Uploaded by

DM 6

Uploaded by

Data Mining

◼ Data Warehouse: Basic Concepts

◼ Modeling data warehouses: dimensions & measures

branch location_key location to_location

region Europe ... North_America

country Germany ... Spain Canada ... Mexico

city Frankfurt ... Vancouver ... Toronto

office L. Chan ... M. Wind

◼ Distributive: if the result derived by applying the function

◼ Sales volume as a function of product, month,

Industry Region Year

Category Country Quarter

Product City Month Week

Total annual sales

All, All, All

product,date product,country date, country

3-D (base) cuboid

◼ Data Warehouse: Basic Concepts

◼ Data Warehouse: Basic Concepts

CUBE BY item, city, year

◼ Join index: JI(R-id, S-id) where R (R-id, …)  S (S-id, …)

◼ In data warehouses, join index relates the values of the dimensions of a

of the tuples recording the Sales in the city

◼ Relational OLAP (ROLAP)

◼ Data Warehouse: Basic Concepts

You might also like