DataMining- Chapter2 - Data WareHouse
DataMining- Chapter2 - Data WareHouse
1
Data Warehousing and OLAP
Technology: An Overview
5
Data Warehouse—Integrated
Constructed by integrating multiple,
heterogeneous data sources
relational databases, flat files, on-line transaction
records
11
A multi-dimensional data model
It is a two-dimensional data model – rows and
columns.
Such a data model is appropriate for online
transaction processing.
15
…
In data warehouse, FTs and DTs are key components
for businesses data analysis and decision making.
Fact table
Is called referencing table.
It consists of foreign key references to dimension
tables.
A FT is a table in the database that stores
quantitative data (facts) about a business
process, usually numeric values such as sales
revenue, quantity sold, or profit.
Provides “How much”.
The fact table contains the names of the facts,
measures, as well as keys to each of the related
16
…
Dimension tables
A DT is a table in the database that stores descriptive
attributes, often text values that provide context to
the data in the fact table.
Provides “who, what, where, and when” of the
data.
Called referenced table.
It has a primary key which is a foreign key reference
to the fact table (s).
time,location,supplier
Office Day
Month
20
A Sample Data Cube
Total annual sales
Date of TV in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
t
uc
TV
od
PC U.S.A
Pr
VCR
Country
sum
Canada
Mexico
sum
21
Conceptual Modeling of Data
Warehouses
22
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
23
Conceptual Modeling of Data
Warehouses
24
Example of Snowflake
Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
25
Conceptual Modeling of Data
Warehouses
Multiple fact tables
Collection of stars galaxy schema
It is complicated
26
Example of Fact
Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
<dimension_name_first_time> in cube
<cube_name_first_time>
28
Defining Star Schema in
DMQL
define cube sales_star [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
29
Defining Snowflake Schema in
DMQL
DMQL?
31
Defining Fact Constellation in
DMQL
Roll up (drill-up)
Drill down (roll down
Slice and dice
Pivot (rotate)
33
Typical OLAP Operations
Roll up (drill-up):
Performs
aggregation on a
data cube
(summarize data)
by climbing up
hierarchy or by
dimension reduction
34
Typical OLAP Operations
Drill down (roll down): reverse of roll-up
from higher level summary to lower-level
35
Typical OLAP Operations
Slice and dice: project and select
36
Typical OLAP Operations
Pivot (rotate):
Reorient the cube, visualization, 3D to series
of 2D planes
37
Typical OLAP Operations--- All in one
38
DW Architecture
The bottom tier is a warehouse database server that
is almost always a relational database system.
39
DW Architecture …
The middle tier is an OLAP server that is typically
implemented using either
(1) a relational OLAP (ROLAP) model (i.e., an
extended relational DBMS that maps operations on
multidimensional data to standard relational
operations); or
40
DW Architecture …
41
Data Warehouse: A Multi-Tiered Architecture
Monitor
Metadata & OLAP Server
Other
sources Integrator
Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
external sources
Data cleaning
detect errors in the data and rectify them when
possible
Data transformation
convert data from legacy or host format to
warehouse format
Load
sort, summarize, consolidate, compute views,
the warehouse 43
Metadata Repository
Efficient computation of data cube
Compute cube
Indexing OLAP data
bitmap indexing
Join indexing
Efficient processing of OLAP queries
45
Compute cube
Compute cube operator
Data can be viewed as a lattice of cuboids
The bottom most cuboid is the base cuboid.
The top most cuboid is (apex) contains only one
cell.
The total number of cuboids or groupby’s can be
computed for data cube is 2n.
If the dimensions given as item, city, and year. 23
=8
The possible groupby’s are {(city, item, year), (city,
item), (city, year), (item, year), (city), (item), (year),
()}
Cube definition and computation in DMQL: ()
define cube sales[item, city,
year]: sum(sales_in_dollars)
(city) (item) (year)
48
Indexing OLAP Data: Bitmap
Index
50
Join Indices: Example
51
Data Warehouse Usage
Three kinds of data warehouse applications
Information processing
supports querying, basic statistical analysis, and
reporting using crosstabs, tables, charts and graphs
Analytical processing
multidimensional analysis of data_warehouse data
supports basic OLAP operations, slice-dice, drilling,
pivoting
Data mining
knowledge discovery from hidden patterns
supports associations, constructing analytical
models, performing classification and prediction, and
52
?
53