0% found this document useful (0 votes)
103 views

Data Warehouse Models and OLAP Operations: Enrico Franconi

The document discusses data warehouse models and OLAP operations. It describes a three-tier architecture for decision support systems, with a data warehouse database server, OLAP servers, and clients. It also discusses dimensional modeling approaches like star schemas that organize data around measures and dimensions. Relational and multidimensional OLAP techniques are presented for implementing dimensional models and supporting analytical queries in data warehouses.

Uploaded by

Barty Waine
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
103 views

Data Warehouse Models and OLAP Operations: Enrico Franconi

The document discusses data warehouse models and OLAP operations. It describes a three-tier architecture for decision support systems, with a data warehouse database server, OLAP servers, and clients. It also discusses dimensional modeling approaches like star schemas that organize data around measures and dimensions. Relational and multidimensional OLAP techniques are presented for implementing dimensional models and supporting analytical queries in data warehouses.

Uploaded by

Barty Waine
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 45

Data Warehouse Models

and OLAP Operations


Enrico Franconi
CS 636
Data Warehouse Architecture

CS 336 2
Decision Support
 Information technology to help the
knowledge worker (executive, manager,
analyst) make faster & better decisions
 “What were the sales volumes by region and product category for
the last year?”
 “How did the share price of comp. manufacturers correlate with
quarterly profits over the past 10 years?”
 “Which orders should we fill to maximize revenues?”

 On-line analytical processing (OLAP) is an


element of decision support systems (DSS)

CS 336 3
Three-Tier Decision Support Systems
 Warehouse database server
 Almost always a relational DBMS, rarely flat files
 OLAP servers
 Relational OLAP (ROLAP): extended relational DBMS that maps
operations on multidimensional data to standard relational operators
 Multidimensional OLAP (MOLAP): special-purpose server that
directly implements multidimensional data and operations
 Clients
 Query and reporting tools
 Analysis tools
 Data mining tools

CS 336 4
The Complete Decision Support
System
Information Sources Data Warehouse OLAP Servers Clients
Server (Tier 2) (Tier 3)
(Tier 1)
e.g., MOLAP
Semistructured Analysis
Sources
Data
Warehouse serve

extract Query/Reporting
transform
load serve
refresh
etc. e.g., ROLAP
Operational
DB’s Data Mining
serve

Data Marts
CS 336 5
Data Warehouse vs. Data Marts
 Enterprise warehouse: collects all information about
subjects (customers,products,sales,assets,
personnel) that span the entire organization
 Requires extensive business modeling (may take years to design
and build)
 Data Marts: Departmental subsets that focus on selected
subjects
 Marketing data mart: customer, product, sales
 Faster roll out, but complex integration in the long run
 Virtual warehouse: views over operational dbs
 Materialize sel. summary views for efficient query processing
 Easy to build but require excess capability on operat. db servers
CS 336 6
Approaches to OLAP Servers
 Relational DBMS as Warehouse Servers
 Two possibilities for OLAP servers
 (1) Relational OLAP (ROLAP)
 Relational and specialized relational DBMS to
store and manage warehouse data
 OLAP middleware to support missing pieces
 (2) Multidimensional OLAP (MOLAP)
 Array-based storage structures
 Direct access to array data structures
CS 336 7
OLAP Server: Query Engine
Requirements
 Aggregates (maintenance and querying)
 Decide what to precompute and when
 Query language to support
multidimensional operations
 Standard SQL falls short
 Scalable query processing
 Data intensive and data selective queries

CS 336 8
OLAP for Decision Support
 OLAP = Online Analytical Processing
 Support (almost) ad-hoc querying for business analyst
 Think in terms of spreadsheets
 View sales data by geography, time, or product
 Extend spreadsheet analysis model to work with
warehouse data
 Large data sets
 Semantically enriched to understand business terms
 Combine interactive queries with reporting functions
 Multidimensional view of data is the foundation of
OLAP
 Data model, operations, etc.
CS 336 9
Warehouse Models & Operators
 Data Models
 relations
 stars & snowflakes
 cubes
 Operators
 slice & dice
 roll-up, drill down
 pivoting
 other
CS 336 10
Multi-Dimensional Data
 Measures - numerical data being tracked
 Dimensions - business parameters that define a
transaction
 Example: Analyst may want to view sales data
(measure) by geography, by time, and by product
(dimensions)
 Dimensional modeling is a technique for
structuring data around the business concepts
 ER models describe “entities” and “relationships”
 Dimensional models describe “measures” and
“dimensions”
CS 336 11
The Multi-Dimensional Model
“Sales by product line over the past six months”
“Sales by store between 1990 and 1995”
Store Info Key columns joining fact table
to dimension tables Numerical Measures

Prod Code Time Code Store Code Sales Qty

Fact table for


Product Info
measures

Dimension tables Time Info

...
CS 336 12
Dimensional Modeling

 Dimensions are organized into hierarchies


 E.g., Time dimension: days  weeks  quarters
 E.g., Product dimension: product  product line  brand
 Dimensions have attributes

CS 336 13
Dimension Hierarchies
Store Dimension Product Dimension

Total Total

Region Manufacturer

District Brand

Stores Products

CS 336 14
ROLAP: Dimensional Modeling
Using Relational DBMS
 Special schema design: star, snowflake
 Special indexes: bitmap, multi-table join
 Special tuning: maximize query throughput
 Proven technology (relational model,
DBMS), tend to outperform specialized
MDDB especially on large data sets
 Products
 IBM DB2, Oracle, Sybase IQ, RedBrick,
Informix
CS 336 15
MOLAP: Dimensional Modeling
Using the Multi Dimensional Model
 MDDB: a special-purpose data model
 Facts stored in multi-dimensional arrays
 Dimensions used to index array
 Sometimes on top of relational DB
 Products
 Pilot, Arbor Essbase, Gentia

CS 336 16
Star Schema (in RDBMS)

CS 336 17
Star Schema Example

CS 336 18
Star Schema
with Sample
Data

CS 336 19
The “Classic” Star Schema
 A single fact table, with detail
Sto re Dim e nsio n Fa c t Ta b le Tim e Dim e nsio n
STORE KEY STORE KEY
PERIOD KEY
and summary data
Sto re De sc rip tio n PRODUCT KEY
City
Sta te
PERIOD KEY Pe rio d De sc
Ye a r
 Fact table primary key has
Do lla rs
Distric t ID
Distric t De sc .
Units
Pric e
Qua rte r
Mo nth only one key column per
Re g io n_ID
Re g io n De sc .
Pro duc t Dim e nsio n
Da y
Curre nt Fla g dimension
Re g io na l Mg r. Re so lutio n
Le ve l PRODUCT KEY
Pro d uc t De sc .
Se que nc e  Each key is generated
Bra nd
Co lo r
Size
 Each dimension is a single
Ma nufa c ture r
Le ve l table, highly denormalized

Benefits: Easy to understand, easy to define hierarchies, reduces # of physical joins, low
maintenance, very simple metadata
Drawbacks: Summary data in the fact table yields poorer performance for summary levels,
huge dimension tables a problem
CS 336 20
The “Classic” Star Schema
Sto re Dim e nsio n Fa c t Ta b le Tim e Dim e nsio n The biggest drawback: dimension tables
STORE KEY STORE KEY
Sto re De sc rip tio n PRODUCT KEY
PERIOD KEY must carry a level indicator for every
Pe rio d De sc
City
Sta te
PERIOD KEY
Ye a r record and every query must use it. In the
Do lla rs
Distric t ID
Distric t De sc .
Units
Qua rte r
Mo nth
example below, without the level
Pric e
Re g io n_ID
Re g io n De sc .
Da y
Curre nt Fla g
constraint, keys for all stores in the
Pro duc t Dim e nsio n
Re g io na l Mg r.
Le ve l PRODUCT KEY
Re so lutio n NORTH region, including aggregates for
Se que nc e
Pro d uc t De sc . region and district will be pulled from the
Bra nd
Co lo r fact table, resulting in error.
Size
Ma nufa c ture r
Le ve l

Example:
Select A.STORE_KEY, A.PERIOD_KEY, A.dollars from Level is needed
Fact_Table A whenever aggregates
where A.STORE_KEY in (select STORE_KEY are stored with detail
from Store_Dimension B facts.
where region = “North” and Level = 2)
and
CS 336 etc... 21
The “Level” Problem
 Level is a problem because because it causes
potential for error. If the query builder,
human or program, forgets about it,
perfectly reasonable looking WRONG
answers can occur.
 One alternative: the FACT
CONSTELLATION model...

CS 336 22
The “Fact Constellation” Schema
Sto re Dime nsio n Fa c t Ta ble Tim e Dim e nsio n
STORE KEY STORE KEY
PERIOD KEY
Sto re De sc rip tio n PRODUCT KEY
City PERIOD KEY Pe rio d De sc
Sta te Ye a r
Do lla rs Qua rte r
Distric t ID
Units
Distric t De sc . Mo nth
Pric e
Re g io n_ID Da y
Re g io n De sc . Curre nt Fla g
Re g io na l Mg r.
Pro duc t Dim e nsio n
Se que nc e
PRODUCT KEY
Pro d uc t De sc .
Bra nd District Fact Table
Co lo r
Region Fact Table
Size District_ID
Ma nufa c ture r Region_ID
PRODUCT_KE
PRODUCT_KEY
Y
PERIOD_KEY
PERIOD_KEY
Dollars
Dollars
Units Units
Price Price

CS 336 23
The “Fact Constellation” Schema
Sto re Dim e nsio n Fa c t Ta b le Tim e Dime nsio n
STORE KEY
Sto re De sc rip tio n
STORE KEY
PRODUCTKEY
PERIOD KEY In the Fact Constellations,
City
Sta te
PERIOD KEY
Do lla rs
Pe rio d De sc
Ye a r aggregate tables are created
Distric t ID Qua rte r
Distric t De sc .
Re g io n_ID
Units
Pric e
Mo nth
Da y
separately from the detail,
therefor
Re g io n De sc . Curre nt Fla g
Re g io na l Mg r.
Pro d uc t Dime nsio n
Se q ue nc e
PRODUCT KEY
Pro d uc t De sc .
Bra nd Dis t ric t Fac t Table it is impossible to pick up, for
Co lo r
example, Store detail when
Re g io n Fac t Table
Size Distric t_ID
Ma nufa c ture r PRODUCT_KEY Re g io n_ID
PRODUCT_KEY
PERIOD_KEY
Do lla rs
PERIOD_KEY
Do lla rs
querying
Units
Pric e
Units
Pric e the District Fact Table.

Major Advantage: No need for the “Level” indicator in the dimension tables,
since no aggregated data is stored with lower-level detail

Disadvantage: Dimension tables are still very large in some cases, which can slow
performance; front-end must be able to detect existence of aggregate facts, which
requires more extensive metadata
CS 336 24
Another Alternative to “Level”
 Fact Constellation is a good alternative to the
Star, but when dimensions have very high
cardinality, the sub-selects in the dimension
tables can be a source of delay.
 An alternative is to normalize the dimension
tables by attribute level, with each smaller
dimension table pointing to an appropriate
aggregated fact table, the “Snowflake
Schema” ...
CS 336 25
The “Snowflake” Schema
Store Dimension
STORE KEY District_ID Region_ID
Store Description District Desc. Region Desc.
City Region_ID Regional Mgr.
State
District ID
District Desc.
Region_ID
Region Desc.
Regional Mgr.
Store Fact Table District Fact Table RegionFact Table
Region_ID
STORE KEY District_ID
PRODUCT_KEY
PRODUCT_KEY PERIOD_KEY
PRODUCT KEY PERIOD_KEY Dollars
PERIOD KEY Dollars Units
Units Price
Dollars Price
Units
Price

CS 336 26
The “Snowflake” Schema
St ore Dime ns ion
 No LEVEL in dimension tables
STORE KEY Dis t ric t _ ID Re gion_ ID
St o re De s cript ion
Cit y
Dis t rict De s c .
Re g io n_ ID
Re g io n De s c.  Dimension tables are normalized by
Re g io nal Mg r.
St at e
Dis t rict ID
decomposing at the attribute level
Dis t rict De s c.
Re g io n_ ID  Each dimension table has one key for
Re g io n De s c.
Re g io nal Mg r.
St ore Fac t Table Dis t rict Fact Table RegionFact Table
Region_ID
each level of the dimensionís hierarchy
STORE KEY District_ID
 The lowest level key joins the dimension
PRODUCT_KEY
PRODUCT_KEY PERIOD_KEY
PRODUCT KEY PERIOD_KEY Dollars
PERIOD KEY Dollars
Unit s
Unit s
Price table to both the fact table and the lower
Do llars Pric e
Unit s level attribute table
Pric e

How does it work? The best way is for the query to be built by
understanding which summary levels exist, and finding the proper
snowflaked attribute tables, constraining there for keys, then
selecting from the fact table.
CS 336 27
The “Snowflake” Schema
St ore Dime ns ion  Additional features: The original Store
STORE KEY Dis t ric t _ ID Re gion_ ID
St o re De s cript ion Dis t rict De s c . Re g io n De s c.
Dimension table, completely de-normalized,
Cit y
St at e
Re g io n_ ID Re g io nal Mg r. is kept intact, since certain queries can benefit
Dis t rict ID by its all-encompassing content.
Dis t rict De s c.
Re g io n_ ID
Re g io n De s c.
 In practice, start with a Star Schema and
St ore Fac t Table Dis t rict Fact Table
create the “snowflakes” with queries. This
RegionFact Table
Re g io nal Mg r.
District_ID Region_ID
STORE KEY PRODUCT_KEY
PRODUCT KEY
PRODUCT_KEY
PERIOD_KEY
PERIOD_KEY
Dollars
eliminates the need to create separate extracts
PERIOD KEY Dollars
Unit s
Unit s
Price for each table, and referential integrity is
Do llars Pric e
Unit s inherited from the dimension table.
Pric e

Advantage: Best performance when queries involve aggregation

Disadvantage: Complicated maintenance and metadata, explosion in the number


of tables in the database
CS 336 28
Advantages of ROLAP
Dimensional Modeling
 Define complex, multi-dimensional data with
simple model
 Reduces the number of joins a query has to
process
 Allows the data warehouse to evolve with rel.
low maintenance
 HOWEVER! Star schema and relational
DBMS are not the magic solution
 Query optimization is still problematic

CS 336 29
Aggregates
 Add up amounts for day 1
 In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1

sale prodId storeId date amt


p1 s1 1 12
p2 s1 1 11
p1 s3 1 50
p2 s2 1 8 81
p1 s1 2 44
p1 s2 2 4

CS 336 30
Aggregates
 Add up amounts by day
 In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date

sale prodId storeId date amt


p1 s1 1 12
p2 s1 1 11 ans date sum
p1 s3 1 50 1 81
p2 s2 1 8 2 48
p1 s1 2 44
p1 s2 2 4

CS 336 31
Another Example
 Add up amounts by day, product
 In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
sale prodId storeId date amt
p1 s1 1 12 sale prodId date amt
p2 s1 1 11 p1 1 62
p1 s3 1 50 p2 1 19
p2 s2 1 8
p1 s1 2 44
p1 2 48
p1 s2 2 4

rollup
drill-down

CS 336 32
Aggregates
 Operators: sum, count, max, min,
median, ave
 “Having” clause
 Using dimension hierarchy
 average by region (within store)
 maximum by month (within date)

CS 336 33
ROLAP vs. MOLAP
 ROLAP:
Relational On-Line Analytical Processing
 MOLAP:
Multi-Dimensional On-Line Analytical
Processing

CS 336 34
The MOLAP Cube

Fact table view: Multi-dimensional cube:


sale prodId storeId amt
p1 s1 12 s1 s2 s3
p2 s1 11 p1 12 50
p1 s3 50 p2 11 8
p2 s2 8

dimensions = 2

CS 336 35
3-D Cube

Fact table view: Multi-dimensional cube:

sale prodId storeId date amt


p1 s1 1 12
p2 s1 1 11 s1 s2 s3
day 2
p1 s3 1 50 p1 44 4
p2 s2 1 8 p2 s1 s2 s3
p1 s1 2 44 day 1
p1 12 50
p1 s2 2 4 p2 11 8

dimensions = 3

CS 336 36
Example
roll-up to region
Dimensions:
e NY
or SF Time, Product, Store
St roll-up to brand
LA Attributes:
Product (upc, price, …)
Juice 10
Store …
Product

Milk 34
56

Coke
32
Hierarchies:
Cream
12 Product  Brand  …
Soap
Bread 56 roll-up to week Day  Week  Quarter
M T W Th F S S
Store  Region  Country
Time
56 units of bread sold in LA on M

CS 336 37
Cube Aggregation: Roll-up
Example: computing sums
s1 s2 s3
day 2 ...
p1 44 4
p2 s1 s2 s3
day 1
p1 12 50
p2 11 8

s1 s2 s3
sum 67 12 50
s1 s2 s3
p1 56 4 50
p2 11 8
129
sum
rollup p1 110
p2 19
drill-down
CS 336 38
Cube Operators for Roll-up

s1 s2 s3
day 2 ...
p1 44 4
p2 s1 s2 s3
day 1
p1 12 50
p2 11 8 sale(s1,*,*)

s1 s2 s3
sum 67 12 50
s1 s2 s3
p1 56 4 50
p2 11 8
129
sum
sale(s2,p2,*) p1 110
p2 19 sale(*,*,*)

CS 336 39
Extended Cube

* s1 s2 s3 *
p1 56 4 50 110
p2 11 8 19
day 2 *
s1 67
s2 12
s3 *50 129
p1 44 4 48
p2
s1 s2 s3 *
day 1
p1
*
12
44 4
50 62
48 sale(*,p2,*)
p2 11 8 19
* 23 8 50 81

CS 336 40
Aggregation Using Hierarchies

s1 s2 s3
day 2
p1 44 4 store
p2 s1 s2 s3
day 1
p1 12 50
p2 11 8
region

country

region A region B
p1 56 54
p2 11 8
(store s1 in Region A;
stores s2, s3 in Region B)

CS 336 41
Slicing

s1 s2 s3
day 2
p1 44 4
p2 s1 s2 s3
day 1
p1 12 50
p2 11 8

TIME = day 1

s1 s2 s3
p1 12 50
p2 11 8

CS 336 42
Slicing & Sales
($ millions)
Products Time
Pivoting Store s1 Electronics
d1
$5.2
d2

Toys $1.9
Clothing $2.3
Cosmetics $1.1
Store s2 Electronics $8.9
Toys $0.75
Clothing $4.6
Cosmetics $1.5
Sales
($ millions)
Products d1
Store s1 Store s2
Store s1 Electronics $5.2 $8.9
Toys $1.9 $0.75
Clothing $2.3 $4.6
Cosmetics $1.1 $1.5
Store s2 Electronics
Toys
CS 336
Clothing 43
Summary of Operations
 Aggregation (roll-up)
 aggregate (summarize) data to the next higher dimension element
 e.g., total sales by city, year  total sales by region, year
 Navigation to detailed data (drill-down)
 Selection (slice) defines a subcube
 e.g., sales where city =‘Gainesville’ and date = ‘1/15/90’
 Calculation and ranking
 e.g., top 3% of cities by average income
 Visualization operations (e.g., Pivot)
 Time functions
 e.g., time average

CS 336 44
Query & Analysis Tools
 Query Building
 Report Writers (comparisons, growth, graphs,…)
 Spreadsheet Systems
 Web Interfaces
 Data Mining

CS 336 45

You might also like