Concepts and Techniques: - Chapter 3
Concepts and Techniques: - Chapter 3
Concepts and
Techniques
— Chapter 3 —
OLAP
description
August 10, 2009 Data Mining: Concepts and Techniques 3
What is Data Warehouse?
Defined in many different ways, but not rigorously.
A decision support database that is maintained
separately from the organization’s operational database
Support information processing by providing a solid
platform of consolidated, historical data for analysis.
“A data warehouse is a subject-oriented, integrated, time-
variant, and nonvolatile collection of data in support of
management’s decision-making process.”—W. H. Inmon
Data warehousing:
The process of constructing and using data warehouses
transaction records
Data cleaning and data integration techniques
are applied.
Ensure consistency in naming conventions,
OLAP
description
August 10, 2009 Data Mining: Concepts and Techniques 14
Cube: A Lattice of Cuboids
all
0-D(apex) cuboid
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier
4-D(base) cuboid
time, item, location, supplier
branch_key
location
branch location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
<dimension_name_first_time> in cube
<cube_name_first_time>
all all
Specification of
hierarchies
Schema hierarchy
day < {month <
quarter; week} <
year
Set_grouping
hierarchy
August 10, 2009 {1..10} <
Data Mining: Concepts and Techniques 26
Multidimensional Data
Office Day
Month
August 10, 2009 Data Mining: Concepts and Techniques 27
A Sample Data Cube
TV
du
PC U.S.A
o
Pr
VCR
Country
sum
Canada
Mexico
sum
all
0-D(apex) cuboid
product date country
1-D cuboids
3-D(base) cuboid
product, date, country
Visualization
OLAP capabilities
Interactive
ORDER
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
DIVISION
Location Each circle is
called a Promotion Organization
August 10, 2009 footprint Data Mining: Concepts and Techniques 33
Chapter 3: Data Warehousing, Data
Generalization, and On-line Analytical
Processing
Data warehouse: Basic concept
OLAP
description
August 10, 2009 Data Mining: Concepts and Techniques 34
Design of Data Warehouse: A
Business Analysis Framework
Four views regarding the design of a data
warehouse
Top-down view
allows selection of the relevant information necessary
for the data warehouse
Data source view
exposes the information being captured, stored, and
managed by operational systems
Data warehouse view
consists of fact tables and dimension tables
Business query view
sees the perspectives of data in the warehouse from
August 10, 2009 the view of end-user
Data Mining: Concepts and Techniques 35
Data Warehouse Design
Process
Top-down, bottom-up approaches or a combination of both
Top-down: Starts with overall design and planning
(mature)
Bottom-up: Starts with experiments and prototypes (rapid)
From software engineering point of view
Waterfall: structured and systematic analysis at each step
before proceeding to the next
Spiral: rapid generation of increasingly functional
systems, short turn around time, quick turn around
Typical data warehouse design process
Choose a business process to model, e.g., orders,
invoices, etc.
Choose the grain (atomic level of data) of the business
process
August 10, 2009 Data Mining: Concepts and Techniques 36
Data Warehouse: A Multi-Tiered Architecture
Monitor
& OLAP Server
Other Metadata
sources Integrator
Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
Enterprise
Data Data
Data
Mart Mart
Warehouse
external sources
Data cleaning
detect errors in the data and rectify them when
possible
Data transformation
convert data from legacy or host format to
warehouse format
Load
sort, summarize, consolidate, compute views,
Business data
August 10, 2009 Data Mining: Concepts and Techniques 41
OLAP Server Architectures
OLAP
description
August 10, 2009 Data Mining: Concepts and Techniques 44
Efficient Data Cube
Computation
Data cube can be viewed as a lattice of cuboids
The bottom-most cuboid is the base cuboid
The top-most cuboid (apex) contains only one
cell
How many T n in an n-dimensional cube
cuboids
= ∏ ( Li +1)
with L levels? i =1
certain threshold
Avoid explosive growth of the cube
Suppose 100 dimensions, only 1 base cell. How many
aggregate cells if count >= 1? What about count >=
2?
August 10, 2009 Data Mining: Concepts and Techniques 47
Indexing OLAP Data: Bitmap Index
Index on a particular column
Each value in the column has a bit vector: bit-op is fast
The length of the bit vector: # of records in the base table
The i-th bit is set if the i-th row of the base table has the
value for the indexed column
not suitable for high cardinality domains
dimensions
August 10, 2009 Data Mining: Concepts and Techniques 49
Efficient Processing OLAP Queries
Determine which operations should be performed on the available
cuboids
Transform drill, roll, etc. into corresponding SQL and/or OLAP
operations, e.g., dice = selection + projection
Determine which materialized cuboid(s) should be selected for OLAP
op.
Let the query to be processed be on {brand, province_or_state}
with the condition “year = 2004”, and there are 4 materialized
cuboids available:
1) {year, item_name, city}
2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state}
August 10, 2009
where year = 2004
Data Mining: Concepts and Techniques 50
Chapter 3: Data Warehousing, Data
Generalization, and On-line Analytical
Processing
Data warehouse: Basic concept
OLAP
description
August 10, 2009 Data Mining: Concepts and Techniques 51
What is Concept Description?
Descriptive vs. predictive data mining
Descriptive mining: describes concepts or task-
use Big_University_DB
mine characteristics as “Science_Students”
in relevance to name, gender, major,
birth_place, birth_date, residence, phone#,
gpa
from student
where status in “graduate”
Corresponding SQL statement:
Birth_Region
Canada Foreign Total
Gender
M 16 14 30
F 10 22 32
Total 26 36 62
Cj = target class
qa = a generalized tuple covers some tuples of
class
but can also cover some tuples of contrasting
Count distribution between graduate and undergraduate students for a generalized tuple
Quantitative discriminant rule
∀ X , graduate _ student ( X ) ⇐
birth _ country ( X ) =" Canada"∧ age _ range( X ) ="25 − 30"∧ gpa ( X ) =" good " [d : 30%]
OLAP
description
August 10, 2009 Data Mining: Concepts and Techniques 66
From On-Line Analytical Processing
(OLAP)
to On Line Analytical Mining (OLAM)
Why online analytical mining?
High quality of data in data warehouses
DW contains integrated, consistent, cleaned
data
Available information processing structure
Integration and swapping of multiple mining
August 10, 2009 functions, algorithms,
Data Mining: Concepts and tasks
and Techniques 67
An OLAM System Architecture
Mining query Mining result Layer4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM
Layer2
MDDB
MDDB
Meta
Data
Filtering&Integration Database API Filtering
Layer1
Data cleaning Data
Databases Data
Data integration Warehouse Repository
August 10, 2009 Data Mining: Concepts and Techniques 68
Chapter 3: Data Warehousing, Data
Generalization, and On-line Analytical
Processing
Data warehouse: Basic concept
Summary
August 10, 2009 Data Mining: Concepts and Techniques 69
Warehousing, and On-line Analytical
Processing
Data generalization: Attribute-oriented induction
Data warehousing: A multi-dimensional model of a data
warehouse
Star schema, snowflake schema, fact constellations
A data cube consists of dimensions & measures
OLAP operations: drilling, rolling, slicing, dicing and pivoting
Data warehouse architecture
OLAP servers: ROLAP, MOLAP, HOLAP
Efficient computation of data cubes
Partial vs. full vs. no materialization
Indexing OALP data: Bitmap index and join index
OLAP query processing
From OLAP to OLAM (on-line analytical mining)
August 10, 2009 Data Mining: Concepts and Techniques 70
References (I)
S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R.
Ramakrishnan, and S. Sarawagi. On the computation of multidimensional
aggregates. VLDB’96
D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance
in data warehouses. SIGMOD’97
R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional
databases. ICDE’97
S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP
technology. ACM SIGMOD Record, 26:65-74, 1997
E. F. Codd, S. B. Codd, and C. T. Salley. Beyond decision support. Computer
World, 27, July 1993.
J. Gray, et al. Data cube: A relational aggregation operator generalizing
group-by, cross-tab and sub-totals. Data Mining and Knowledge Discovery,
1:29-54, 1997.
A. Gupta and I. S. Mumick. Materialized Views: Techniques, Implementations,
and Applications. MIT Press, 1999.
J. Han. Towards on-line analytical mining in large databases. ACM SIGMOD
Record, 27:97-107, 1998.
August
10, 2009 Data Mining: Concepts and Techniques 71
References (II)
C. Imhoff, N. Galemmo, and J. G. Geiger. Mastering Data Warehouse Design:
Relational and Dimensional Techniques. John Wiley, 2003
W. H. Inmon. Building the Data Warehouse. John Wiley, 1996
R. Kimball and M. Ross. The Data Warehouse Toolkit: The Complete Guide to
Dimensional Modeling. 2ed. John Wiley, 2002
P. O'Neil and D. Quass. Improved query performance with variant indexes.
SIGMOD'97
Microsoft. OLEDB for OLAP programmer's reference version 1.0. In
http://www.microsoft.com/data/oledb/olap, 1998
A. Shoshani. OLAP and statistical databases: Similarities and differences.
PODS’00.
S. Sarawagi and M. Stonebraker. Efficient organization of large
multidimensional arrays. ICDE'94
OLAP council. MDAPI specification version 2.0. In
http://www.olapcouncil.org/research/apily.htm, 1998
E. Thomsen. OLAP Solutions: Building Multidimensional Information Systems.
John Wiley, 1997
P. Valduriez. Join indices. ACM Trans. Database Systems, 12:218-246, 1987.
August 10, 2009 Data Mining: Concepts and Techniques 72
August 10, 2009 Data Mining: Concepts and Techniques 73