0% found this document useful (0 votes)
2 views

DataMining- Chapter2 - Data WareHouse

The document provides an overview of data warehousing and OLAP technologies, defining a data warehouse as a centralized repository for enterprise data that supports decision-making. It discusses the characteristics of data warehouses, including their subject-oriented, integrated, time-variant, and nonvolatile nature, as well as the differences between operational databases and data warehouses. Additionally, it covers multi-dimensional data models, schema designs like star and snowflake, and OLAP operations such as roll-up, drill-down, slice and dice, and pivot.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

DataMining- Chapter2 - Data WareHouse

The document provides an overview of data warehousing and OLAP technologies, defining a data warehouse as a centralized repository for enterprise data that supports decision-making. It discusses the characteristics of data warehouses, including their subject-oriented, integrated, time-variant, and nonvolatile nature, as well as the differences between operational databases and data warehouses. Additionally, it covers multi-dimensional data models, schema designs like star and snowflake, and OLAP operations such as roll-up, drill-down, slice and dice, and pivot.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 53

Chapter II

DATA WAREHOUSING AND OLAP TECHNOLOGIES

1
Data Warehousing and OLAP
Technology: An Overview

 What is a data warehouse?

 A multi-dimensional data model

 Data warehouse architecture

 Data warehouse implementation


2
What is Data Warehouse?
 Defined in many different ways, but not rigorously.
 A centralized location where all enterprise data,
from different decentralized database systems and
file systems, get stored for further analysis.
 A decision support database that is maintained
separately from the organization’s operational
database.
 Support information processing by providing a solid
platform of consolidated, historical data for
analysis. 3
What is Data Warehouse?...

 It is a repository of multiple heterogeneous


data sources organized under a unified schema
at a single site to facilitate management
decision making.

 “A data warehouse is a subject-oriented,


integrated, time-variant, and nonvolatile
collection of data in support of management’s
decision-making process.
4
Data Warehouse—Subject-
Oriented

 Organized around major subjects, such as customer,


product, sales

 Provide a simple and concise view around particular


subject issues by excluding data that are not
useful in the decision support process.

5
Data Warehouse—Integrated
 Constructed by integrating multiple,
heterogeneous data sources
 relational databases, flat files, on-line transaction

records

 Data cleaning and data integration techniques are


applied.
 Ensure consistency in naming conventions,

encoding structures, attribute measures, etc.


among different data sources

 When data is moved to the warehouse, it is


6
Data Warehouse—Time
Variant

 The time horizon for the data warehouse is


significantly longer than that of operational systems.
 Operational database: current value data
 Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)

 Every key structure in the data warehouse


 Contains an element of time, explicitly or implicitly
 But the key of operational data may or may not
contain “time element”
7
Data Warehouse—
Nonvolatile
 A physically separate store of data transformed from
the operational environment.
 Operational update of data does not occur in the data
warehouse environment.
 Does not require transaction processing, recovery,
and concurrency control mechanisms

 Requires only two operations in data


accessing:

initial loading of data and access of data
8
Data Warehouse vs. Operational
DBMS
 OLTP (on-line transaction processing)
 The major task of online operational database systems
is to perform online transaction and query processing.

Traditional relational DBMS
 These systems are called OLTP systems.
 They cover most of the day-to-day operations of an
organization such as purchasing, inventory,
manufacturing, banking, payroll, registration, and
accounting.

 OLAP (on-line analytical processing)


 Major task of data warehouse system
 Data analysis and decision making
 9
Data Warehouse vs. Operational
DBMS …

 Distinct features (OLTP vs. OLAP): Differences


 User and system orientation: customer vs. market
 Data contents: current, detailed vs. historical,
consolidated
 Database design(Schema): ER + application vs. star
+ subject
 View: current, local vs. evolutionary, integrated
 Access patterns: update vs. read-only but complex
queries 10
Data Warehousing and OLAP
Technology: An Overview

 What is a data warehouse?


 A multi-dimensional data model
 Data warehouse architecture
 Data warehouse implementation
 From data warehousing to data mining

11
A multi-dimensional data model

 The entity relationship data model is


commonly used in the design of relational
databases,

where a database schema consists of a set of
entities and the relationships between them.


It is a two-dimensional data model – rows and
columns.


Such a data model is appropriate for online
transaction processing.

 How to model a system if we have more


than two-dimensional data? 12
A multi-dimensional data model …

 A data warehouse is based on a multidimensional


data model which views data in the form of a data
cube.

 Data Cube: A Multidimensional Data Model


 A data cube allows data to be modeled and viewed
in multiple dimensions.
 It is defined by dimensions and facts.
 In general terms, dimensions are entities in which
an organization wants to keep records.
13
A multi-dimensional data model …

 For example, ‘AllElectronics’ may create a sales


data warehouse in order to keep records of the
store’s sales with respect to the dimensions:

Time,

Item,

Branch, and

Location
 These dimensions allow the store to keep track of
things like monthly sales of items and the branches
and locations at which the items were sold.

 Each dimension may have a table associated with it,


called a dimension table.
14
A multi-dimensional data model …

 For example, a dimension table for item may contain


the attributes  item name, brand, and type.

 A multidimensional data model is typically organized


around a central theme (subject), such as sales. This
theme is represented by a fact table.

 Facts are numeric measures.

15

In data warehouse, FTs and DTs are key components 
for businesses data analysis and decision making.

Fact table
 Is called referencing table.
 It consists of foreign key references to dimension
tables.
 A FT is a table in the database that stores
quantitative data (facts) about a business
process, usually numeric values such as sales
revenue, quantity sold, or profit.

Provides “How much”.
 The fact table contains the names of the facts,
measures, as well as keys to each of the related
16

Dimension tables
 A DT is a table in the database that stores descriptive
attributes, often text values that provide context to
the data in the fact table.

Provides “who, what, where, and when” of the
data.
 Called referenced table.
 It has a primary key which is a foreign key reference
to the fact table (s).

 Note: FTs contain numeric data, while DTs provide


context and background information. Both types
tables are necessary for effective data analysis and
decision making.
 FT and DTs are related by primary and foreign keys.
17
A multi-dimensional data model …

 In data warehousing literature, an n-D base cube is


called a base cuboid (different levels of
summarization).

 The top most 0-D cuboid, which holds the highest-


level of summarization, is called the apex cuboid.

 The lattice of cuboids forms a data cube.


18
Cube: A Lattice of
Cuboids
all
0-D(apex) cuboid

time item location supplier


1-D cuboids

time,location item,location location,supplier


2-D cuboids
time,item time,supplier item,supplier

time,location,supplier

time,item,location time,item,supplier item,location,supplier 3-D cuboids

time, item, location, supplier


4-D(base) cuboid
19
Multidimensional Data
 Sales volume as a function of product, month,
and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
o n
gi

Industry Region Year


Re

Category Country Quarter


Product

Product City Month Week

Office Day

Month
20
A Sample Data Cube
Total annual sales
Date of TV in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
t
uc

TV
od

PC U.S.A
Pr

VCR

Country
sum
Canada

Mexico

sum

21
Conceptual Modeling of Data
Warehouses

 Modeling data warehouses: dimensions & measures


 Star schema: A fact table in the middle connected
to a set of dimension tables, FT can be
normalized where as DT can’t.

Each dimension table has a primary key

All the keys of dimension tables are associated
with the fact table.

The primary keys of the dimension tables are
used as a foreign keys in the fact table.

22
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures

23
Conceptual Modeling of Data
Warehouses

 Snowflake schema: A modification of star


schema where some dimensional hierarchy is
normalized into a set of smaller dimension tables,
forming a shape similar to snowflake.

 Reduce effectiveness of browsing due to more


joins.

Affect system performance

Not popular

24
Example of Snowflake
Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country

25
Conceptual Modeling of Data
Warehouses

 Fact constellations: Multiple fact tables share


dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact
constellation.


Multiple fact tables

Collection of stars  galaxy schema

It is complicated

26
Example of Fact
Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location


branch_key location_key dollars_cost
branch_name units_sold
street
branch_type dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
shipper_type 27
Cube Definition Syntax (BNF) in
DMQL
 Cube Definition (Fact Table)
define cube <cube_name> [<dimension_list>]:
<measure_list>

 Dimension Definition (Dimension Table)


define dimension <dimension_name> as
(<attribute_or_subdimension_list>)

 Special Case (Shared Dimension Tables)


 First time as “cube definition”

 define dimension <dimension_name> as

<dimension_name_first_time> in cube
<cube_name_first_time>
28
Defining Star Schema in
DMQL
define cube sales_star [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week,


month, quarter, year)
define dimension item as (item_key, item_name,
brand, type, supplier_type)
define dimension branch as (branch_key,
branch_name, branch_type)
define dimension location as (location_key, street, city,
province_or_state, country)

29
Defining Snowflake Schema in
DMQL

define cube sales_snowflake [time, item, branch,


location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week,


month, quarter, year)
define dimension item as (item_key, item_name, brand,
type, supplier(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street,
30
Discus on

 Based on star and snowflake,

how Fact Constellation is defined in

DMQL?

31
Defining Fact Constellation in
DMQL

define cube sales [time, item, branch, location]:


dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter,
year)
define dimension item as (item_key, item_name, brand, type,
supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state,
country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as
location in cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
32
OLAP Operations

 Roll up (drill-up)
 Drill down (roll down
 Slice and dice
 Pivot (rotate)

33
Typical OLAP Operations
Roll up (drill-up):
 Performs
aggregation on a
data cube
(summarize data)
 by climbing up
hierarchy or by
dimension reduction

34
Typical OLAP Operations
 Drill down (roll down): reverse of roll-up
 from higher level summary to lower-level

summary or detailed data, or introducing new


dimensions

35
Typical OLAP Operations
 Slice and dice: project and select

36
Typical OLAP Operations
 Pivot (rotate):
 Reorient the cube, visualization, 3D to series

of 2D planes

37
Typical OLAP Operations--- All in one

38
DW Architecture
 The bottom tier is a warehouse database server that
is almost always a relational database system.

 Back-end tools and utilities are used to feed data into


the bottom tier from operational databases or other
external sources.

These tools and utilities perform data extraction, cleaning,
and transformation (e.g., to merge similar data from different
sources into a unified format), as well as load and refresh
functions to update the data warehouse.

 This tier also contains a metadata repository, which


stores information about the data warehouse and its
contents.

39
DW Architecture …
 The middle tier is an OLAP server that is typically
implemented using either
 (1) a relational OLAP (ROLAP) model (i.e., an
extended relational DBMS that maps operations on
multidimensional data to standard relational
operations); or

 (2) a multi- dimensional OLAP (MOLAP) model (i.e., a


special-purpose server that directly implements
multidimensional data and operations).

40
DW Architecture …

 The top tier is a front-end client layer,


which contains

query and reporting tools,

analysis tools, and/or

data mining tools (e.g., trend analysis,
prediction, and so on).

41
Data Warehouse: A Multi-Tiered Architecture

Monitor
Metadata & OLAP Server
Other
sources Integrator

Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools


42
Data Warehouse Back-End Tools and
Utilities
 Data extraction
 get data from multiple, heterogeneous, and

external sources
 Data cleaning
 detect errors in the data and rectify them when

possible
 Data transformation
 convert data from legacy or host format to

warehouse format
 Load
 sort, summarize, consolidate, compute views,

check integrity, and build indices and partitions


 Refresh
 propagate the updates from the data sources to

the warehouse 43
Metadata Repository

 Meta data is the data defining warehouse objects. It


stores:
 Description of the structure of the data warehouse

schema, view, dimensions, hierarchies, derived data defn,
data mart loc
 The algorithms used for summarization
 The mapping from operational environment to the
data warehouse
 Data related to system performance

warehouse schema, view and derived data definitions
 Business data

business terms and definitions, ownership of data,
charging policies 44
DW implementation


Efficient computation of data cube
 Compute cube

Indexing OLAP data
 bitmap indexing
 Join indexing

Efficient processing of OLAP queries

45
Compute cube
 Compute cube operator

Data can be viewed as a lattice of cuboids

The bottom most cuboid is the base cuboid.

The top most cuboid is (apex) contains only one
cell.

The total number of cuboids or groupby’s can be
computed for data cube is 2n.

If the dimensions given as item, city, and year. 23
=8

 Queries: “compute sum of sales group by city”


“Compute sum of sales group by city and
item”
46
DW implementation

Compute cube operator computes aggregates overall
subsets of dimensions specified in the operation.


The possible groupby’s are  {(city, item, year), (city,
item), (city, year), (item, year), (city), (item), (year),
()}


Cube definition and computation in DMQL: ()
define cube sales[item, city,
year]: sum(sales_in_dollars)
(city) (item) (year)

After defining the cube


(above), can compute it. (city, item) (city, year) (item, year)
compute cube sales
(city, item, year) 47

 Transform it into a SQL-like language (with a


new operator cube by, introduced by Gray et
al.’96)

SELECT item, city, year, SUM (amount)


FROM SALES
CUBE BY item, city, year

48
Indexing OLAP Data: Bitmap
Index

 Index on a particular column


 Each value in the column has a bit vector.
 The length of the bit vector: # of records in the base
table
Base table Index on Region Index on Type
Cust Region Type RecID Asia Europe Am erica RecID Retail Dealer
C1 Asia Retail 1 1 0 0 1 1 0
C2 Europe Dealer 2 0 1 0 2 0 1
C3 Asia Dealer 3 1 0 0 3 0 1
C4 America Retail 4 0 0 1 4 1 0
C5 Europe Dealer 5 0 1 0 5 0 1
49
Indexing OLAP Data: Join
Indices
 In data warehouses, join index relates the values of
the dimensions of a star schema to rows in the fact
table.

 E.g. fact table: Sales and two dimensions city and


product

A join index on city maintains for each distinct
city a list of R-IDs of the tuples recording the
Sales in the city
 Join indices can span multiple dimensions

50
Join Indices: Example

51
Data Warehouse Usage
 Three kinds of data warehouse applications
 Information processing

supports querying, basic statistical analysis, and
reporting using crosstabs, tables, charts and graphs
 Analytical processing

multidimensional analysis of data_warehouse data

supports basic OLAP operations, slice-dice, drilling,
pivoting
 Data mining

knowledge discovery from hidden patterns

supports associations, constructing analytical
models, performing classification and prediction, and
52
?

53

You might also like