Data Warehousing AND Data Mining
Data Warehousing AND Data Mining
AND
DATA MINING
What is a Data Warehouse?
A single, complete and
consistent store of data
obtained from a variety
of different sources
made available to end
users in a what they can
understand and use in a
business context.
[Barry Devlin]
2
What is Data Warehousing?
A process of
Information transforming data into
information and making
it available to users in a
timely enough manner
to make a difference
4
Very Large Data Bases
Terabytes -- 10^12 bytes: Walmart -- 24 Terabytes
5
Data Warehousing --
It is a process
Technique for assembling and
managing data from various
sources for the purpose of
answering business questions.
Thus making decisions that
were not previous possible
A decision support database
maintained separately from
the organization’s operational
database
6
Data Warehouse
A data warehouse is a
subject-oriented
integrated
time-varying
non-volatile
7
Data Warehouse Architecture
Relational
Databases
Optimized Loader
Extraction
ERP
Systems Cleansing
Data Warehouse
Engine Analyze
Purchased Query
Data
Legacy
Data Metadata Repository
8
Data Warehouse for Decision
Support & OLAP
Putting Information technology to help the
knowledge worker make faster and better
decisions
Which of my customers are most likely to go to
the competition?
What product promotions have the biggest
impact on revenue?
How did the share price of software companies
correlate with profits over last 10 years?
9
Decision Support
12
RDBMS used for OLTP
13
Operational Systems
Run the business in real time
Based on up-to-the-second data
Optimized to handle large
numbers of simple read/write
transactions
Optimized for fast response to
predefined transactions
Used by people who deal with
customers, products -- clerks,
salespeople etc.
They are increasingly used by
customers
14
Examples of Operational Data
Data Industry Usage Technology Volumes
Customer All Track Legacy application, flat Small-medium
File Customer files, main frames
Details
Account Finance Control Legacy applications, Large
Balance account hierarchical databases,
activities mainframe
Point-of- Retail Generate ERP, Client/Server, Very Large
Sale data bills, manage relational databases
stock
Call Telecomm- Billing Legacy application, Very Large
Record unications hierarchical database,
mainframe
Production Manufact- Control ERP, Medium
Record uring Production relational databases,
AS/400
15
So, what’s different?
Application-Orientation vs.
Subject-Orientation
Application-Orientation Subject-Orientation
Operational Data
Database Warehouse
Credit
Loans Customer
Card
Vendor
Trust Product
Savings Activity
17
OLTP vs. Data Warehouse
18
OLTP vs Data Warehouse
19
OLTP vs Data Warehouse
20
OLTP vs Data Warehouse
21
To summarize ...
OLTP Systems are
used to “run” a
business
The Data
Warehouse helps
to “optimize” the
business
22
Data Warehouse Architecture
Relational
Databases
Optimized Loader
Extraction
ERP
Systems Cleansing
Data Warehouse
Engine Analyze
Purchased Query
Data
Legacy
Data Metadata Repository
23
Components of the Warehouse
Data Extraction and Loading
The Warehouse
Analyze and Query -- OLAP Tools
Metadata
24
Loading the Warehouse
Operational/ Sequential
Source Data Legacy Relational External
26
Schema Design
Database organization
must look like business
must be recognizable by business user
approachable by business user
Must be simple
Schema Types
Star Schema
Fact Constellation Schema
Snowflake schema
27
Dimension Tables
Dimension tables
Define business in terms already familiar to
users
Wide rows with lots of descriptive text
Small tables (about a million rows)
Joined to fact table by a foreign key
heavily indexed
typical dimensions
time periods, geographic region (markets, cities),
products, customers, salesperson, etc.
28
Fact Table
Central table
mostly raw numeric items
narrow rows, a few columns at most
large number of rows (millions to a billion)
Access via dimensions
29
Star Schema
Fact Constellation
Multiple fact tables that share many dimension
tables
Booking and Checkout may share many
dimension tables in the hotel industry
Promotion
Hotels
Booking
Checkout
Travel Agents Room Type
Customer 32
Data Warehouse vs. Data Marts
Information
Individually Less
Structured
Departmentally History
Structured Normalized
Detailed
Organizationally More
Structured Data Warehouse
Data
34
Data Warehouse and Data Marts
OLAP
Data Mart
Lightly summarized
Departmentally structured
Organizationally structured
Atomic
Detailed Data Warehouse Data
35
Characteristics of the
Departmental Data Mart
OLAP
Small
Flexible
Customized by
Department
Source is
departmentally
structured data
warehouse
36
Techniques for Creating
Departmental Data Mart
OLAP
Sales Finance Mktg.
Subset
Summarized
Superset
Indexed
Arrayed
37
Data Mart Centric
Data Sources
Data Marts
Data Warehouse
38
II. On-Line Analytical Processing (OLAP)
Making Decision
Support Possible
Limitations of SQL
“A Freshman in
Business needs
a Ph.D. in SQL”
-- Ralph Kimball
40
Typical OLAP Queries
41
What Is OLAP?
Online Analytical Processing - coined by
EF Codd in 1994 paper contracted by
Arbor Software*
Generally synonymous with earlier terms such as
Decisions Support, Business Intelligence, Executive
Information System
OLAP = Multidimensional Database
MOLAP: Multidimensional OLAP (Arbor Essbase,
Oracle Express)
ROLAP: Relational OLAP (Informix MetaCube,
Microstrategy DSS Agent)
* Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html
42
Strengths of OLAP
43
OLAP Is FASMI
Fast
Analysis
Shared
Multidimensional
Information
W
eg
S
Product R
Month
Office
45 Day
A Visual Operation: Pivot (Rotate)
NY
LA
th
SF
n
Mo
Juice 10
Cola 47
Region
Milk 30
Cream 12 Product
Household
ns
Telecomm o
gi
Re
Video Europe
Far East
Audio India
Drill-Down
Region
Roll Up
Country
State
Location Address
Sales
Representative
Low-level
Details
48
Nature of OLAP Analysis
Aggregation -- (total sales,
percent-to-total)
Comparison -- Budget vs.
Expenses
Ranking -- Top 10, quartile
analysis
Access to detailed and
aggregate data
Complex criteria
specification
Visualization
49
Relational OLAP: 3 Tier DSS
Data Warehouse ROLAP Engine Decision Support Client