0% found this document useful (0 votes)
71 views46 pages

Ch2 Data Warehousing

This document discusses data warehousing and data mining. It begins by describing some of the challenges of heterogeneous data sources and the goal of unified access. It then defines what a data warehouse is, including that it is subject-oriented, integrated, time-variant, and non-volatile. Common data warehouse architectures and components are also outlined, including the warehouse, metadata, extractors/monitors, and sources. Finally, it briefly touches on data preprocessing, conceptual modeling of data warehouses using star schemas and snowflake schemas, and decision support systems.

Uploaded by

DipeshKC
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views46 pages

Ch2 Data Warehousing

This document discusses data warehousing and data mining. It begins by describing some of the challenges of heterogeneous data sources and the goal of unified access. It then defines what a data warehouse is, including that it is subject-oriented, integrated, time-variant, and non-volatile. Common data warehouse architectures and components are also outlined, including the warehouse, metadata, extractors/monitors, and sources. Finally, it briefly touches on data preprocessing, conceptual modeling of data warehouses using star schemas and snowflake schemas, and decision support systems.

Uploaded by

DipeshKC
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Data Mining and Data Warehousing

Chapter 2

Data warehousing

Slide | 1

Data Mining and Data Warehousing

Problem: Heterogeneous Information Sources

Heterogeneities are everywhere


Personal
Databases

Scientific Databases
Digital Libraries
Different interfaces
Different data representations
Duplicate and inconsistent information

Slide | 2

Data Mining and Data Warehousing

World
Wide
Web

Problem: Data Management in Large Enterprises


Vertical fragmentation of informational systems
(vertical stove pipes)
Sales Planning
Suppliers
Num. Control
Stock Mngmt
Debt Mngmt
Inventory
...
...
...

Sales Administration
Slide | 3

Finance

Manufacturing

Data Mining and Data Warehousing

...

Goal: Unified Access to Data

Integration System

World
Wide
Web

Digital Libraries

Scientific Databases

Personal
Databases

Collects and combines information


Provides integrated view, uniform user interface
Supports sharing
Slide | 4

Data Mining and Data Warehousing

The Warehouse
Clients

Data
Warehouse

Integration System

Metadata

...
Extractor/
Monitor

Source
Slide | 5

Extractor/
Monitor

Source

Extractor/
Monitor

...

Data Mining and Data Warehousing

Source

What is Data Warehouse?


Defined in many different ways:

A decision support database that is maintained separately from

the organizations operational database


Support information processing by providing a solid platform of
consolidated, historical data for analysis.
A data warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection o

data in support o

managements decision
-

making process.W. H. Inmon

Data warehousing:
The process of constructing and using data warehouses
Slide | 6

Data Mining and Data Warehousing

Data WarehouseSubject-Oriented
Organized around major subjects, such as customer,
product, sales.
Focusing on the modeling and analysis of data for decision

makers, not on daily operations or transaction processing.


Provide a simple and concise view around particular subject
issues by excluding data that are not useful in the decision
support process.

Slide | 7

Data Mining and Data Warehousing

Data WarehouseIntegrated
Constructed by integrating multiple, heterogeneous
data sources
relational databases, flat files, on-line transaction records

Data cleaning and data integration techniques are


applied.
Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources
E.g., Hotel price: currency, tax, breakfast covered, etc.

When data is moved to the warehouse, it is converted.

Slide | 8

Data Mining and Data Warehousing

Data WarehouseTime Variant


The time horizon for the data warehouse is significantly
longer than that of operational systems.
Operational database: current value data.
Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)

Slide | 9

Data Mining and Data Warehousing

Data WarehouseNon-Volatile
A physically separate store of data transformed from the
operational environment.
Operational update of data does not occur in the data

warehouse environment.
Does not require transaction processing, recovery, and
concurrency control mechanisms

Requires only two operations in data accessing:

initial loading of data and access of data.


Slide | 10

Data Mining and Data Warehousing

Generic Warehouse Architecture


Client

Query & Analysis

Client

Loading

Design Phase
Warehouse

Metadata

Maintenance
Integrator

Extractor/
Monitor

Extractor/
Monitor

Extractor/
Monitor

...
Slide | 11

Optimization

Data Mining and Data Warehousing

Data Warehousing: Two Distinct Issues


1. How to get information into warehouse Data
warehousing
2. What to do with data once its in warehouse
Warehouse DBMS
Both rich research areas
Industry has focused on 2

Slide | 12

Data Mining and Data Warehousing

Data Warehouse & Database


Data Warehouse
Purpose

Database

Analysis, Decision
making
OLAP( on-line
analytical processing )

Day to day use

Data model

Multi-dimentional

Rational

Age of data

Current & time series

Current & real time

Data
modification

Read/access only

Insert, update, delete

Type of data

Static

Dynamic

Support For

OLTP( on-line
transaction processing )

Amount of data Larger


per transaction

Smaller

Schema
Slide | 13
design

normalization

Denormalization

Data Mining and Data Warehousing

Data Warehousing: Benefit

Provides organizing framework


Allows simplified maintenance
Speeds up future development by aiding
understanding of DM
Communication tool for roles and requirements
Coordinate data marts

Slide | 14

Data Mining and Data Warehousing

Data Warehouse and data Mart


Data warehouse: enterprise based, collects all information
about subjects (customers, products, sales,
assets, personnel) that span the entire
organization
Concerns with decision subjects of the whole enterprise or
organization
Requires extensive business modeling (may take years to
design and build)

Data mart: department based, Departmental subsets that


focus on selected subjects
Specialized single line of business warehouses e.g. within
departments or groups of people
Marketing data mart: customer, product, sales

Slide | 15

Data Mining and Data Warehousing

Decision Support System

Information technology to help the


knowledge worker (executive, manager,
analyst) make faster & better decisions
What were the sales volumes by region and product
category
or the last year?
How did the share price o
omp. manu acturers correlate
with quarterly pro
ver the past 10 years?

On-line analytical processing (OLAP) is an


element of decision support systems (DSS)
Slide | 16

Data Mining and Data Warehousing

Three-Tier Decision Support Systems


Warehouse database server
Almost always a relational DBMS, rarely flat files

OLAP servers(p.p.135)
Relational OLAP (ROLAP): extended relational DBMS that maps
operations on multidimensional data to standard relational
operators
Multidimensional OLAP (MOLAP): special-purpose server that
directly implements multidimensional data and operations

Clients
Query and reporting tools
Analysis tools
Data mining tools
Slide | 17

Data Mining and Data Warehousing

The Complete Decision Support


System
Information Sources

Data Warehouse
Server
(Tier 1)

OLAP Servers
(Tier 2)

Clients
(Tier 3)

e.g., MOLAP
Semistructured
Sources

Data
Warehouse
extract
transform
load
refresh
etc.

serve
Query/Reporting

serve
e.g., ROLAP

Operational
DBs

Slide | 18

Analysis

serve

Data Marts

Data Mining and Data Warehousing

Data Mining

Approaches to OLAP Servers


Two possibilities for OLAP servers

(1) Relational OLAP (ROLAP)


Relational and specialized relational DBMS to store and
manage warehouse data
OLAP middleware to support missing pieces
have greater scalability

(2) Multidimensional OLAP (MOLAP)


Array-based storage structures
Direct access to array data structures
Fast indexing to pre-computed summarized data

(3) Hybrid OLAP (HOLAP) server


- Combine both ROLAP and MOLAP
- E.g. Microsoft SQL Server 2000
Slide | 19

Data Mining and Data Warehousing

Data Preprocessing
Real world data : Noisy, missing and
inconsistent (why??)
Low quality data => Low quality mining result
Data Cleaning
Data integration
Data transformations
Data reduction

Slide | 20

Data Mining and Data Warehousing

Data Cleaning
Missing values
No record value for several attributes such as
income
How can fill missing data?
E.g. manually, fill with mean, fill with probable

Noisy Data
containing errors, or outlier values
How can smooth data ?
E.g. Binning, regression, clustering
Slide | 21

Data Mining and Data Warehousing

Binning

Slide | 22

Data Mining and Data Warehousing

By: Sur

Data Integration
Combines data from multiple sources(e.g.
databases, data cubes or flat files) into data
warehouse

Slide | 23

Data Mining and Data Warehousing

Data Transformation

Data transforms into appropriate form for mining


Some of the methods:
Smoothing: remove noise
Aggregation : summary or aggregation operations are applied
to the data.
Generation : low-level =>high level concepts e.g. age => youth,
middle-aged, senior
Normalization : attribute data are scaled into specified range
such as -1.0 to 1.0 or 0.0 to 1.0 (e. g. how??)
e.g.

Attribute construction : New features are constructed and


added from the given set of attributes to help the mining
process
Slide | 24

Data Mining and Data Warehousing

Data Reduction
Goal : Making mining process more efficient
with out losing quality
E.g.

Slide | 25

Data Mining and Data Warehousing

Conceptual Modeling of DW
Dimensions & Measures
Star schema: A fact table in the middle connected to a set of
dimension tables
time

Sales Fact Table

time_key
day
month
quarter
year

time_key
item_key
branch_key

item_key
item_name
Brand
supplier_type

location_key

branch

location

branch_key
branch_name
branch_type

units_sold
dollars_sold
avg_sales

Measures
Slide | 26

item

Data Mining and Data Warehousing

location_key
street
city
country

Conceptual Modeling of DW

Snowflake schema
A refinement of star schema where some
dimensional hierarchy is normalized into a set
of smaller dimension tables, forming a shape
similar to snowflake.

Slide | 27

Data Mining and Data Warehousing

time
time_key
day
day_of_the_week
month
quarter
year

item
Sales Fact Table
time_key

item_key
branch_key

branch

location_key

branch_key
branch_name
branch_type

units_sold
dollars_sold
avg_sales

Measures
Slide | 28

Data Mining and Data Warehousing

item_key
item_name
brand
type
supplier_key

supplier
supplier_key
supplier_type

location
location_key
street
city_key

city
city_key
city
state_or_provinc
e
country

Conceptual Modeling of DW
Fact constellations:
Multiple fact tables share dimension tables,
viewed as a collection of stars, therefore
called galaxy schema or fact constellation

Slide | 29

Data Mining and Data Warehousing

time
time_key
day
day_of_the_week
month
quarter
year

item
Sales Fact Table
time_key

item_key
item_name
brand
type
supplier_type

item_key

Shipping Fact Table


time_key
item_key
shipper_key
from_location

branch_key
location_key

branch
branch_key
branch_name
branch_type

units_sold
dollars_sold
avg_sales

location

to_location

location_key
street
city
province_or_state
country

dollars_cost

Measures

Slide | 30

units_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type

Data Mining and Data Warehousing

Data Discretization
Three types of attributes:
Nominal values from an unordered set, e.g., color, profession
Ordinal values from an ordered set, e.g., military or academic rank

Continuous real numbers, e.g., integers or real numbers

Data discretization:
Divide the range of a continuous attribute into intervals

Some classification algorithms only accept categorical attributes.


Reduce data size by discretization
Prepare for further analysis

Slide | 31

Data Mining and Data Warehousing

Discretization and Concept Hierarchy


Discretization
Reduce the number of values for a given continuous attribute by dividing the
range of the attribute into intervals

Interval labels can then be used to replace actual data values


Supervised (use class information) vs. Unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an attribute

Concept hierarchy formation


Recursively reduce the data by collecting and replacing low level concepts (such
as numeric values for age) by higher level concepts (such as young, middle-aged,

or senior)
Slide | 32

Data Mining and Data Warehousing

Discretization and Concept Hierarchy Generation for


Numeric Data
Typical methods:
Binning

Entropy-based discretization: supervised, top-down split

Slide | 33

Data Mining and Data Warehousing

Entropy-Based Discretization

Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary
T, the information gain after partitioning is

S 1
S 2
I S,T =
Entropy S 1
Entropy S 2
S distribution of the
S samples in the set. Given m
Entropy is calculated based on class
classes, the entropy of S1 is
m

Entropy S 1 = pi log 2 pi
i=1
where pi is the probability
of class i in S1

The boundary that minimizes the entropy function over all possible boundaries is
selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion
is met

Such a boundary may reduce data size and improve classification accuracy
Slide | 34

Data Mining and Data Warehousing

A Concept Hierarchy: Dimension (location)


all

all

Europe

region

country

city

branch
Slide | 35

Germany

Frankfurt

...

...

Spain

Canada

Vancouver

...

L. Chan

North_America

...

Data Mining and Data Warehousing

...

M. Wind

...

Toronto

Mexico

A Concept Hierarchy

Slide | 36

Data Mining and Data Warehousing

lecture 2

Multidimensional Data
Sales volume as a function of product, month,
and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region

Year

Product

Category Country Quarter

Product

Office

Month
Slide | 37

City

Data Mining and Data Warehousing

Month Week
Day

3-D data cube representation from table

Slide | 38

Data Mining and Data Warehousing

OLAP Operations in Multidimensional Data Model

Roll-up :
Drill- down :
Slice and dice :
Pivot (rotate) :

Slide | 39

Data Mining and Data Warehousing

OLAP Operations : roll-up

Slide | 40

Data Mining and Data Warehousing

OLAP Operations : drill-down

drill-down on
time (from
quarters to
months)

Slide | 41

OLAP Operations : dice

Slide | 42

Data Mining and Data Warehousing

By: Sur

OLAP Operations : slice

Slide | 43

Data Mining and Data Warehousing

OLAP Operations : pivot

Slide | 44

Data Mining and Data Warehousing

Review
Data mining definitions, applications, issues, classifications
Data warehouse, architecture, benefits, DSS, preprocessing
data cube, OLAP operations
Read Chapter 1, 2, 3
Questions

Slide | 45

Data Mining and Data Warehousing

Slide | 46

Data Mining and Data Warehousing

You might also like