0% found this document useful (0 votes)
52 views

Data Warehousing AND Data Mining

The document discusses data warehousing and data mining. It defines a data warehouse as a single, complete store of data from various sources made available to end users in a way they can understand. It describes the evolution of data analysis from batch reports in the 1960s to modern data warehousing with integrated online analytical processing tools. It also discusses the components of a data warehouse, including data extraction and loading, the warehouse itself, analytical tools, and metadata.

Uploaded by

Saurav Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

Data Warehousing AND Data Mining

The document discusses data warehousing and data mining. It defines a data warehouse as a single, complete store of data from various sources made available to end users in a way they can understand. It describes the evolution of data analysis from batch reports in the 1960s to modern data warehousing with integrated online analytical processing tools. It also discusses the components of a data warehouse, including data extraction and loading, the warehouse itself, analytical tools, and metadata.

Uploaded by

Saurav Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 51

DATA WAREHOUSING

AND
DATA MINING
What is a Data Warehouse?
A single, complete and
consistent store of data
obtained from a variety
of different sources
made available to end
users in a what they can
understand and use in a
business context.

[Barry Devlin]
2
What is Data Warehousing?

A process of
Information transforming data into
information and making
it available to users in a
timely enough manner
to make a difference

[Forrester Research, April


1996]
Data
3
Evolution

 60’s: Batch reports


 hard to find and analyze information
 inflexible and expensive, reprogram every new request
 70’s: Terminal-based DSS and EIS (executive
information systems)
 still inflexible, not integrated with desktop tools
 80’s: Desktop data access and analysis tools
 query tools, spreadsheets, GUIs
 easier to use, but only access operational databases
 90’s: Data warehousing with integrated OLAP
engines and tools

4
Very Large Data Bases
 Terabytes -- 10^12 bytes: Walmart -- 24 Terabytes

 Petabytes -- 10^15 bytes: Geographic Information


Systems
 Exabytes -- 10^18 bytes: National Medical Records

 Zettabytes -- 10^21 bytes:


Weather images

 Zottabytes -- 10^24 bytes:


Intelligence Agency
Videos

5
Data Warehousing --
It is a process
 Technique for assembling and
managing data from various
sources for the purpose of
answering business questions.
Thus making decisions that
were not previous possible
 A decision support database
maintained separately from
the organization’s operational
database
6
Data Warehouse

 A data warehouse is a
 subject-oriented
 integrated
 time-varying
 non-volatile

collection of data that is used primarily in


organizational decision making.
-- Bill Inmon, Building the Data Warehouse 1996

7
Data Warehouse Architecture
Relational
Databases
Optimized Loader
Extraction
ERP
Systems Cleansing

Data Warehouse
Engine Analyze
Purchased Query
Data

Legacy
Data Metadata Repository
8
Data Warehouse for Decision
Support & OLAP
 Putting Information technology to help the
knowledge worker make faster and better
decisions
 Which of my customers are most likely to go to
the competition?
 What product promotions have the biggest
impact on revenue?
 How did the share price of software companies
correlate with profits over last 10 years?

9
Decision Support

 Used to manage and control business


 Data is historical or point-in-time
 Optimized for inquiry rather than update
 Use of the system is loosely defined and
can be ad-hoc
 Used by managers and end-users to
understand the business and make
judgements
10
Data Mining works with Warehouse
Data

 Data Warehousing provides


the Enterprise with a
memory

 Data Mining provides


the Enterprise with
intelligence
11
What are Operational Systems?
 They are OLTP systems
 Run mission critical
applications
 Need to work with
stringent performance
requirements for
routine tasks
 Used to run a
business!

12
RDBMS used for OLTP

 Database Systems have been used traditionally


for OLTP
 clerical data processing tasks
 detailed, up to date data
 structured repetitive tasks
 read/update a few records
 isolation, recovery and integrity are critical

13
Operational Systems
 Run the business in real time
 Based on up-to-the-second data
 Optimized to handle large
numbers of simple read/write
transactions
 Optimized for fast response to
predefined transactions
 Used by people who deal with
customers, products -- clerks,
salespeople etc.
 They are increasingly used by
customers
14
Examples of Operational Data
Data Industry Usage Technology Volumes
Customer All Track Legacy application, flat Small-medium
File Customer files, main frames
Details
Account Finance Control Legacy applications, Large
Balance account hierarchical databases,
activities mainframe
Point-of- Retail Generate ERP, Client/Server, Very Large
Sale data bills, manage relational databases
stock
Call Telecomm- Billing Legacy application, Very Large
Record unications hierarchical database,
mainframe
Production Manufact- Control ERP, Medium
Record uring Production relational databases,
AS/400
15
So, what’s different?
Application-Orientation vs.
Subject-Orientation

Application-Orientation Subject-Orientation

Operational Data
Database Warehouse

Credit
Loans Customer
Card
Vendor
Trust Product

Savings Activity
17
OLTP vs. Data Warehouse

 OLTP systems are tuned for known


transactions and workloads while workload
is not known a priori in a data warehouse
 Special data organization, access methods
and implementation methods are needed
to support data warehouse queries
(typically multidimensional queries)
 e.g., average amount spent on phone calls
between 9AM-5PM in Pune during the month of
December

18
OLTP vs Data Warehouse

 OLTP Warehouse (DSS)


 Application Subject Oriented
Oriented Used to analyze business
 Used to run Summarized and refined
business Snapshot data
 Detailed data Integrated Data
 Current up to date Ad-hoc access
 Isolated Data Knowledge User
 Repetitive access (Manager)
 Clerical User

19
OLTP vs Data Warehouse

 OLTP Data Warehouse


 Performance Sensitive Performance relaxed
 Few Records accessed at Large volumes accessed
a time (tens) at a time(millions)
Mostly Read (Batch
 Read/Update Access Update)
Redundancy present
 No data redundancy Database Size 100
 Database Size 100MB GB - few terabytes
-100 GB

20
OLTP vs Data Warehouse

 OLTP Data Warehouse


 Transaction Query throughput is
throughput is the the performance
performance metric metric
 Thousands of users Hundreds of users
 Managed in entirety Managed by subsets

21
To summarize ...
 OLTP Systems are
used to “run” a
business

 The Data
Warehouse helps
to “optimize” the
business
22
Data Warehouse Architecture
Relational
Databases
Optimized Loader
Extraction
ERP
Systems Cleansing

Data Warehouse
Engine Analyze
Purchased Query
Data

Legacy
Data Metadata Repository
23
Components of the Warehouse
 Data Extraction and Loading
 The Warehouse
 Analyze and Query -- OLAP Tools
 Metadata

 Data Mining tools

24
Loading the Warehouse

Cleaning the data


before it is loaded
Source Data

Operational/ Sequential
Source Data Legacy Relational External

 Typically host based, legacy applications


 Customized applications, COBOL, 3GL,
4GL
 Point of Contact Devices
 POS, ATM, Call switches
 External Sources
 Nielsen’s, Acxiom, CMIE, Vendors,
Partners

26
Schema Design

 Database organization
 must look like business
 must be recognizable by business user
 approachable by business user
 Must be simple
 Schema Types
 Star Schema
 Fact Constellation Schema
 Snowflake schema

27
Dimension Tables

 Dimension tables
 Define business in terms already familiar to
users
 Wide rows with lots of descriptive text
 Small tables (about a million rows)
 Joined to fact table by a foreign key
 heavily indexed
 typical dimensions
 time periods, geographic region (markets, cities),
products, customers, salesperson, etc.

28
Fact Table

 Central table
 mostly raw numeric items
 narrow rows, a few columns at most
 large number of rows (millions to a billion)
 Access via dimensions

29
Star Schema

 A single fact table and for each


dimension one dimension table
 Does not capture hierarchies directly
T date, custno, prodno, cityname, ...
p
i r
m o
e f d
a
c c c
u t i
s t
t 30 y
Snowflake schema

 Represent dimensional hierarchy directly


by normalizing tables.
 Easy to maintain and saves storage
T date, custno, prodno, cityname, ...
p
i r
m o
e f d
a
c c c r
u t i e
s g
t i
t y o
31
n
Fact Constellation

 Fact Constellation
 Multiple fact tables that share many dimension
tables
 Booking and Checkout may share many
dimension tables in the hotel industry

Promotion
Hotels
Booking
Checkout
Travel Agents Room Type
Customer 32
Data Warehouse vs. Data Marts

What comes first


From the Data Warehouse to Data
Marts

Information

Individually Less
Structured

Departmentally History
Structured Normalized
Detailed

Organizationally More
Structured Data Warehouse

Data
34
Data Warehouse and Data Marts
OLAP
Data Mart
Lightly summarized
Departmentally structured

Organizationally structured
Atomic
Detailed Data Warehouse Data

35
Characteristics of the
Departmental Data Mart
 OLAP
 Small
 Flexible
 Customized by
Department
 Source is
departmentally
structured data
warehouse

36
Techniques for Creating
Departmental Data Mart

 OLAP
Sales Finance Mktg.
 Subset
 Summarized
 Superset
 Indexed
 Arrayed

37
Data Mart Centric

Data Sources

Data Marts

Data Warehouse

38
II. On-Line Analytical Processing (OLAP)

Making Decision
Support Possible
Limitations of SQL

“A Freshman in
Business needs
a Ph.D. in SQL”

-- Ralph Kimball

40
Typical OLAP Queries

 Write a multi-table join to compare sales for each


product line YTD this year vs. last year.
 Repeat the above process to find the top 5
product contributors to margin.
 Repeat the above process to find the sales of a
product line to new vs. existing customers.
 Repeat the above process to find the customers
that have had negative sales growth.

41
What Is OLAP?
 Online Analytical Processing - coined by
EF Codd in 1994 paper contracted by
Arbor Software*
 Generally synonymous with earlier terms such as
Decisions Support, Business Intelligence, Executive
Information System
 OLAP = Multidimensional Database
 MOLAP: Multidimensional OLAP (Arbor Essbase,
Oracle Express)
 ROLAP: Relational OLAP (Informix MetaCube,
Microstrategy DSS Agent)
* Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html

42
Strengths of OLAP

 It is a powerful visualization paradigm


 It provides fast, interactive response
times
 It is good for analyzing time series
 It can be useful to find some clusters and
outliers
 Many vendors offer OLAP tools

43
OLAP Is FASMI
 Fast
 Analysis
 Shared
 Multidimensional
 Information

Nigel Pendse, Richard Creath - The OLAP Report


44
Multi-dimensional Data
 “Hey…I sold $100M worth of goods”

Dimensions: Product, Region, Time


n

Hierarchical summarization paths


io

W
eg

S
Product R

N Product Region Time


Juice Industry Country Year
Cola
Milk
Cream Category Region Quarter
Toothpaste
Soap
1 2 34 5 6 7 Product City Month Week

Month
Office
45 Day
A Visual Operation: Pivot (Rotate)
NY
LA

th
SF

n
Mo
Juice 10
Cola 47

Region
Milk 30
Cream 12 Product

3/1 3/2 3/3 3/4


Date 46
“Slicing and Dicing”

The Telecomm Slice


Product

Household

ns
Telecomm o
gi
Re
Video Europe
Far East
Audio India

Retail Direct Special Sales Channel


47
Roll-up and Drill Down
Higher Level of
Aggregation
 Sales Channel

Drill-Down
 Region
Roll Up

 Country
 State
 Location Address
 Sales
Representative
Low-level
Details
48
Nature of OLAP Analysis
 Aggregation -- (total sales,
percent-to-total)
 Comparison -- Budget vs.
Expenses
 Ranking -- Top 10, quartile
analysis
 Access to detailed and
aggregate data
 Complex criteria
specification
 Visualization
49
Relational OLAP: 3 Tier DSS
Data Warehouse ROLAP Engine Decision Support Client

Database Layer Application Logic Layer Presentation Layer

Store atomic Generate SQL Obtain multi-


data in execution plans in dimensional
industry the ROLAP engine reports from the
standard to obtain OLAP DSS Client.
RDBMS. functionality.
50
MD-OLAP: 2 Tier DSS
MDDB Engine MDDB Engine Decision Support Client

Database Layer Application Logic Layer Presentation Layer

Store atomic data in a proprietary Obtain multi-


data structure (MDDB), pre-calculate dimensional
as many outcomes as possible, obtain reports from the
OLAP functionality via proprietary DSS Client.
algorithms running against this data.
51

You might also like