04OLAP
04OLAP
— Chapter 4 —
1
Chapter 4: Data Warehousing and On-line Analytical
Processing
2
What is a Data Warehouse?
■ Defined in many different ways, but not rigorously.
■ A decision support database that is maintained separately from
the organization’s operational database
■ Support information processing by providing a solid platform of
consolidated, historical data for analysis.
■ “A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s
decision-making process.”—W. H. Inmon
■ Data warehousing:
■ The process of constructing and using data warehouses
3
Data Warehouse—Subject-Oriented
4
Data Warehouse—Integrated
records
■ Data cleaning and data integration techniques are
applied.
■ Ensure consistency in naming conventions, encoding
5
Data Warehouse—Time Variant
6
Data Warehouse—Nonvolatile
7
OLTP vs. OLAP
8
Why a Separate Data Warehouse?
■ High performance for both systems
■ DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
■ Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
■ Different functions and different data:
■ missing data: Decision support requires historical data which
operational DBs do not typically maintain
■ data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
■ data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
■ Note: There are more and more systems which perform OLAP
analysis directly on relational databases
9
Data Warehouse: A Multi-Tiered Architecture
Monitor
& OLAP Server
Other Metadata
Integrato
sources r
Analysis
Operational Extract Query
DBs Transform Data Serv Reports
Load e
Refresh
Warehous Data
e mining
Data Marts
materialized
11
Extraction, Transformation, and Loading (ETL)
■ Data extraction
■ get data from multiple, heterogeneous, and external
sources
■ Data cleaning
■ detect errors in the data and rectify them when possible
■ Data transformation
■ convert data from legacy or host format to warehouse
format
■ Load
■ sort, summarize, consolidate, compute views, check
integrity, and build indicies and partitions
■ Refresh
■ propagate the updates from the data sources to the
warehouse
12
Metadata Repository
■ Meta data is the data defining warehouse objects. It stores:
■ Description of the structure of the data warehouse
■ schema, view, dimensions, hierarchies, derived data defn, data
mart locations and contents
■ Operational meta-data
■ data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged), monitoring
information (warehouse usage statistics, error reports, audit trails)
■ The algorithms used for summarization
■ The mapping from operational environment to the data warehouse
■ Data related to system performance
■ warehouse schema, view and derived data definitions
■ Business data
■ business terms and definitions, ownership of data, charging policies
13
Chapter 4: Data Warehousing and On-line Analytical
Processing
14
From Tables and Spreadsheets to
Data Cubes
■ A data warehouse is based on a multidimensional data model
which views data in the form of a data cube
■ A data cube, such as sales, allows data to be modeled and viewed in
multiple dimensions
■ Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
■ Fact table contains measures (such as dollars_sold) and keys
to each of the related dimension tables
■ In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube.
15
Cube: A Lattice of Cuboids
all
0-D (apex) cuboid
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier
16
Conceptual Modeling of Data Warehouses
17
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
branch_key location_key
location_key street
branch_name units_sold
branch_type city
dollars_sold state_or_province
country
Measures avg_sales
18
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
branch location
location_key
branch_key location_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
city
Measures avg_sales state_or_province
country
19
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
all all
21
A Sample Data Cube
Country
sum
Canada
Mexico
sum
22
Cuboids Corresponding to the Cube
all
0-D (apex) cuboid
product date country
1-D cuboids
23
Typical OLAP Operations
■ Roll up (drill-up): summarize data
■ by climbing up hierarchy or by dimension reduction
■ Drill down (roll down): reverse of roll-up
■ from higher level summary to lower level summary or
detailed data, or introducing new dimensions
■ Slice and dice: project and select
■ Pivot (rotate):
■ reorient the cube, visualization, 3D to series of 2D planes
■ Other operations
■ drill across: involving (across) more than one fact table
■ drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)
24
Fig. 3.10 Typical OLAP
Operations
25
Example
26
Example
27
ETL vs ELT
Used in Power BI
or Data
Warehouse
Solutions
Used in cloud
technologies.
E.g. Data Lakes
28
29
30
Example Database
31
References (I)
■ S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S.
Sarawagi. On the computation of multidimensional aggregates. VLDB’96
■ D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data
warehouses. SIGMOD’97
■ R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE’97
■ S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM
SIGMOD Record, 26:65-74, 1997
■ E. F. Codd, S. B. Codd, and C. T. Salley. Beyond decision support. Computer World, 27, July
1993.
■ J. Gray, et al. Data cube: A relational aggregation operator generalizing group-by, cross-tab
and sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997.
■ A. Gupta and I. S. Mumick. Materialized Views: Techniques, Implementations, and
Applications. MIT Press, 1999.
■ J. Han. Towards on-line analytical mining in large databases. ACM SIGMOD Record, 27:97-107,
1998.
■ V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently.
SIGMOD’96
■ J. Hellerstein, P. Haas, and H. Wang. Online aggregation. SIGMOD'97
32
References (II)
■ C. Imhoff, N. Galemmo, and J. G. Geiger. Mastering Data Warehouse Design: Relational and
Dimensional Techniques. John Wiley, 2003
■ W. H. Inmon. Building the Data Warehouse. John Wiley, 1996
■ R. Kimball and M. Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional
Modeling. 2ed. John Wiley, 2002
■ P. O’Neil and G. Graefe. Multi-table joins through bitmapped join indices. SIGMOD Record, 24:8–
11, Sept. 1995.
■ P. O'Neil and D. Quass. Improved query performance with variant indexes. SIGMOD'97
■ Microsoft. OLEDB for OLAP programmer's reference version 1.0. In
http://www.microsoft.com/data/oledb/olap, 1998
■ S. Sarawagi and M. Stonebraker. Efficient organization of large multidimensional arrays. ICDE'94
■ A. Shoshani. OLAP and statistical databases: Similarities and differences. PODS’00.
■ D. Srivastava, S. Dar, H. V. Jagadish, and A. V. Levy. Answering queries with aggregation using
views. VLDB'96
■ P. Valduriez. Join indices. ACM Trans. Database Systems, 12:218-246, 1987.
■ J. Widom. Research problems in data warehousing. CIKM’95
■ K. Wu, E. Otoo, and A. Shoshani, Optimal Bitmap Indices with Efficient Compression, ACM Trans.
on Database Systems (TODS), 31(1): 1-38, 2006
33
Surplus Slides
34
Compression of Bitmap Indices
■ Bitmap indexes must be compressed to reduce I/O costs
and minimize CPU usage—majority of the bits are 0’s
■ Two compression schemes:
■ Byte-aligned Bitmap Code (BBC)
■ Word-Aligned Hybrid (WAH) code
■ Time and space required to operate on compressed
bitmap is proportional to the total size of the bitmap
■ Optimal on attributes of low cardinality as well as those of
high cardinality.
■ WAH out performs BBC by about a factor of two
35