Ch2 Data Warehousing
Ch2 Data Warehousing
Chapter 2
Data warehousing
Slide | 1
Scientific Databases
Digital Libraries
Different interfaces
Different data representations
Duplicate and inconsistent information
Slide | 2
World
Wide
Web
Sales Administration
Slide | 3
Finance
Manufacturing
...
Integration System
World
Wide
Web
Digital Libraries
Scientific Databases
Personal
Databases
The Warehouse
Clients
Data
Warehouse
Integration System
Metadata
...
Extractor/
Monitor
Source
Slide | 5
Extractor/
Monitor
Source
Extractor/
Monitor
...
Source
data in support o
managements decision
-
Data warehousing:
The process of constructing and using data warehouses
Slide | 6
Data WarehouseSubject-Oriented
Organized around major subjects, such as customer,
product, sales.
Focusing on the modeling and analysis of data for decision
Slide | 7
Data WarehouseIntegrated
Constructed by integrating multiple, heterogeneous
data sources
relational databases, flat files, on-line transaction records
Slide | 8
Slide | 9
Data WarehouseNon-Volatile
A physically separate store of data transformed from the
operational environment.
Operational update of data does not occur in the data
warehouse environment.
Does not require transaction processing, recovery, and
concurrency control mechanisms
Client
Loading
Design Phase
Warehouse
Metadata
Maintenance
Integrator
Extractor/
Monitor
Extractor/
Monitor
Extractor/
Monitor
...
Slide | 11
Optimization
Slide | 12
Database
Analysis, Decision
making
OLAP( on-line
analytical processing )
Data model
Multi-dimentional
Rational
Age of data
Data
modification
Read/access only
Type of data
Static
Dynamic
Support For
OLTP( on-line
transaction processing )
Smaller
Schema
Slide | 13
design
normalization
Denormalization
Slide | 14
Slide | 15
OLAP servers(p.p.135)
Relational OLAP (ROLAP): extended relational DBMS that maps
operations on multidimensional data to standard relational
operators
Multidimensional OLAP (MOLAP): special-purpose server that
directly implements multidimensional data and operations
Clients
Query and reporting tools
Analysis tools
Data mining tools
Slide | 17
Data Warehouse
Server
(Tier 1)
OLAP Servers
(Tier 2)
Clients
(Tier 3)
e.g., MOLAP
Semistructured
Sources
Data
Warehouse
extract
transform
load
refresh
etc.
serve
Query/Reporting
serve
e.g., ROLAP
Operational
DBs
Slide | 18
Analysis
serve
Data Marts
Data Mining
Data Preprocessing
Real world data : Noisy, missing and
inconsistent (why??)
Low quality data => Low quality mining result
Data Cleaning
Data integration
Data transformations
Data reduction
Slide | 20
Data Cleaning
Missing values
No record value for several attributes such as
income
How can fill missing data?
E.g. manually, fill with mean, fill with probable
Noisy Data
containing errors, or outlier values
How can smooth data ?
E.g. Binning, regression, clustering
Slide | 21
Binning
Slide | 22
By: Sur
Data Integration
Combines data from multiple sources(e.g.
databases, data cubes or flat files) into data
warehouse
Slide | 23
Data Transformation
Data Reduction
Goal : Making mining process more efficient
with out losing quality
E.g.
Slide | 25
Conceptual Modeling of DW
Dimensions & Measures
Star schema: A fact table in the middle connected to a set of
dimension tables
time
time_key
day
month
quarter
year
time_key
item_key
branch_key
item_key
item_name
Brand
supplier_type
location_key
branch
location
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
Slide | 26
item
location_key
street
city
country
Conceptual Modeling of DW
Snowflake schema
A refinement of star schema where some
dimensional hierarchy is normalized into a set
of smaller dimension tables, forming a shape
similar to snowflake.
Slide | 27
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
Slide | 28
item_key
item_name
brand
type
supplier_key
supplier
supplier_key
supplier_type
location
location_key
street
city_key
city
city_key
city
state_or_provinc
e
country
Conceptual Modeling of DW
Fact constellations:
Multiple fact tables share dimension tables,
viewed as a collection of stars, therefore
called galaxy schema or fact constellation
Slide | 29
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
item_name
brand
type
supplier_type
item_key
branch_key
location_key
branch
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
location
to_location
location_key
street
city
province_or_state
country
dollars_cost
Measures
Slide | 30
units_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type
Data Discretization
Three types of attributes:
Nominal values from an unordered set, e.g., color, profession
Ordinal values from an ordered set, e.g., military or academic rank
Data discretization:
Divide the range of a continuous attribute into intervals
Slide | 31
or senior)
Slide | 32
Slide | 33
Entropy-Based Discretization
Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary
T, the information gain after partitioning is
S 1
S 2
I S,T =
Entropy S 1
Entropy S 2
S distribution of the
S samples in the set. Given m
Entropy is calculated based on class
classes, the entropy of S1 is
m
Entropy S 1 = pi log 2 pi
i=1
where pi is the probability
of class i in S1
The boundary that minimizes the entropy function over all possible boundaries is
selected as a binary discretization
The process is recursively applied to partitions obtained until some stopping criterion
is met
Such a boundary may reduce data size and improve classification accuracy
Slide | 34
all
Europe
region
country
city
branch
Slide | 35
Germany
Frankfurt
...
...
Spain
Canada
Vancouver
...
L. Chan
North_America
...
...
M. Wind
...
Toronto
Mexico
A Concept Hierarchy
Slide | 36
lecture 2
Multidimensional Data
Sales volume as a function of product, month,
and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region
Year
Product
Product
Office
Month
Slide | 37
City
Month Week
Day
Slide | 38
Roll-up :
Drill- down :
Slice and dice :
Pivot (rotate) :
Slide | 39
Slide | 40
drill-down on
time (from
quarters to
months)
Slide | 41
Slide | 42
By: Sur
Slide | 43
Slide | 44
Review
Data mining definitions, applications, issues, classifications
Data warehouse, architecture, benefits, DSS, preprocessing
data cube, OLAP operations
Read Chapter 1, 2, 3
Questions
Slide | 45
Slide | 46