Data Warehousing & Data Mining Chapter 2
Data Warehousing & Data Mining Chapter 2
Data Warehousing
Part I
TBS 2020-2021
3
Origins of DW
▪ Operational processing (transactional processing)
captures, stores and manipulates data to support
daily operations.
▪ Information processing is the analysis of data or
other forms of information to support decision
making.
▪ Data warehouse can consolidate and integrate
information from many internal and external
sources and arrange it in a meaningful format for
making business decisions.
4
Origins of DW
Example:
Thinking for instance in mine. If a
mine is not organized and built
properly, miner cannot access to
place they need to mine the mine
effectively.
In same manner, the warehousing
must be set up to better serve the
needs of those analyzing the data.
5
What is a Data Warehouse ?
▪ According to Inmon’s (father of data warehousing) :
It is a collection of integrated, subject-oriented,
databases designed to support the DSS function,
where each unit of data is non-volatile and relevant
to some moment in time.
▪ Or a DW is : A subject-oriented, integrated, time-
variant, non-updatable collection of data used in
support of management decision-making processes:
▪ Subject-oriented: e.g. customers, patients, products
▪ Integrated: Consistent naming conventions, formats,
encoding structures; from multiple data sources
▪ Time-variant: Can study trends and changes
▪ Non-updatable: Read-only, periodically refreshed 6
What is a Data Warehouse ?
▪ DW-Subject-oriented
▪ Organized around major subjects, such as
customer, product, sales, student, patient.
▪ Focusing on the modeling and analysis of data
for decision makers, not on daily operations or
transaction processing.
▪ Providing a simple and concise view around
particular subject issues by excluding data that
are not useful in the decision support process.
7
What is a Data Warehouse ?
▪ DW-Subject-oriented
8
What is a Data Warehouse ?
▪ DW-Integrated
▪ Focusing on the modeling and analysis of data for
decision makers: data cleaning and data integration
techniques are applied.
9
What is a Data Warehouse ?
▪ DW- Time Variant
▪ The time horizon for the DW is significantly
longer than that of operational systems.
▪ Data warehouse data: provide information
from a historical perspective (e.g., past 5-10
years)
▪ Every key structure in the data warehouse:
▪ Contains an element of time, explicitly or
implicitly. But the key of operational data
may or may not contain “time element”.
10
What is a Data Warehouse ?
▪ DW-Non Updatable / Non-volatility
▪ Typical activities such as deletes, inserts, and
changes that are performed in an operational
application environment are completely
nonexistent in a DW environment.
▪ Only two data operations are ever performed
in the DW: data loading and data access.
11
Need for Data Warehousing
12
Database, Data warehouse and
Data set
▪ DB : contains tables, rows refer to records and
columns to fields. Most DBs are relational DBs
(relating tables to reduce redundancy & improve
DB performance via the normalization process)
▪ DW : is a type of DB that has been denormalized
& archived.
▪ Denormalization is the process of combining
some tables into a single table. This may
introduce duplicate data, but will reduce the
number of joins a query has to process.
▪ Data set : is a sub-set of a DW or a DB. It is usually
denormalized so that only one table is used. 13
How Do Data Warehouses Differ
From Operational Systems?
▪ Goals
▪ Structure
▪ Size
▪ Performance optimization
▪ Technologies used
14
Need to separate operational and
information systems
Three primary factors:
▪ A data warehouse centralizes data that are
scattered throughout disparate operational
systems and makes them available for DM.
▪ A well-designed data warehouse adds value to
data by improving their quality and consistency.
▪ A separate data warehouse eliminates much of the
contention for resources that results when
information applications are mixed with
operational processing.
15
Comparison of Database Types
16
From the Data Warehouse to Data
Marts
▪ A data mart contains only those data that are
specific to a particular group. For example, the
marketing data mart may contain only data
related to items, customers, and sales.
▪ Data marts are confined to subjects.
▪ Data marts are small in size.
▪ Data marts are customized by department.
17
How Data Warehousing works
18
How Data Warehousing works
▪ Data is loaded and periodically
updated via
Data Warehouse
Extract/Transform/Load (ETL)
tools. ETL pipeline
22
How Data Warehousing works
Extraction Transformation Loading–ETL
▪ To get data out of the source and load it into the data
warehouse – simply a process of copying data from
one database to another.
▪ Data is extracted from an OLTP database, transformed
to match the data warehouse schema and loaded into
the data warehouse database.
▪ Many data warehouses also incorporate data from
non‐OLTP systems such as text files, legacy systems,
and spreadsheets; such data also requires extraction,
transformation, and loading.
▪ When defining ETL for a data warehouse, it is
important to think of ETL as a process, not a physical
23
implementation.
How Data Warehousing works
Extraction Transformation Loading–ETL tools
Extract Transform Load
& Clean
Sources DSA DW
27
ETL Tools
Data Transformation
▪ Extracted data is raw data and it cannot be applied to the DW
▪ Major effort within data transformation is the improvement
of data quality
▪ Data warehouses can fail if appropriate data transformation
strategy is not developed
▪ Transformation process involve: Applying business rules (so-
called derivations, e.g., calculating new measures and
dimensions),
▪ Cleaning (e.g., mapping NULL to 0 or "Male" to "M“ , etc.)
▪ Filtering (e.g., selecting only certain columns to load),
▪ Splitting a column into multiple columns and vice versa,
▪ Joining together data from multiple sources (e.g., lookup,
merge) 28
ETL Tools
Loading
▪ Loading the data into a data warehouse
▪ Terminology:
Initial Load : populating all the data warehouse
tables for the first time
Incremental Load : applying ongoing changes as
necessary in a periodic manner
Full Refresh : completely erasing the contents of
one or more tables and reloading with fresh
data (initial load is a refresh of all the tables)
29
ETL Tools
Data Loading
▪ Load,
▪ Append,
▪ Destructive merge,
▪ Constructive merge.
30
ETL Tools
Load
▪ If the target table to be loaded already exists and
data exists in the table, the load process wipes
out the existing data and applies the data from
the incoming file.
▪ If the table is already empty before loading, the
load process simply applies the data from the
incoming file.
31
ETL Tools
Load
32
ETL Tools
Append
▪ Extension of the load.
▪ If data already exists in the table, the append
process unconditionally adds the incoming data,
preserving the existing data in the target table.
▪ When an incoming record is a duplicate of an
already existing record, you may define how to
handle an incoming duplicate:
▪ The incoming record may be allowed to be
added as a duplicate.
▪ In the other option, the incoming duplicate
record may be rejected during the append
process.
33
ETL Tools
Append
34
ETL Tools
Destructive Merge
▪ Applies incoming data to the target data.
▪ If the primary key of an incoming record matches
with the key of an existing record, update the
matching target record.
▪ If the incoming record is a new record without a
match with any existing record, add the incoming
record to the target table.
35
ETL Tools
Destructive Merge
36
ETL Tools
Constructive Merge
▪ It is slightly different from the destructive merge.
▪ If the primary key of an incoming record matches
with the key of an existing record, leave the
existing record, add the incoming record, and
mark the added record as superseding the old
record.
37
ETL Tools
Constructive Merge
38
Refresh
▪ Propagate updates on source data to the
warehouse
▪ Issues:
– when to refresh
– how to refresh -- incremental refresh techniques
39
When to Refresh?
▪ Periodically (e.g., every night, every week) or
after significant events
▪ On every update: not warranted unless
warehouse data require current data (up to
the minute stock quotes)
▪ Refresh policy set by administrator based on
user needs and traffic
▪ Possibly different policies for different sources
40
Refresh techniques
▪ Incremental techniques
– Detect changes on base tables: replication
servers (e.g., Sybase, Oracle, IBM Data
Propagator)
• snapshots (Oracle)
• transaction shipping (Sybase)
– Compute changes to derived and summary
tables
– Maintain transactional correctness for
incremental load
41
How To Detect Changes
▪ Create a snapshot log table to record ids of
updated rows of source data and timestamp
▪ Detect changes by:
– Defining after row triggers to update snapshot
log when source table changes
– Using regular transaction log to detect changes
to source data
42
Chapter 2:
Data Warehouse
conceptual modeling
Part II
TBS 2020-2021
44
ER Model vs. Multidimensional
Model
▪ Why don’t we use the entity-relationship (ER)
model in data warehousing?
▪ ER model: a data model for general purposes
– All types of data are equal, difficult to identify the
data that is:
• important for business analysis
• No difference between: What is important ? What
just describes the important?
• Normalized databases (many details that can affect
privacy and security)
– Hard to overview a large ER diagram (e.g., over 100
entities/relations for an enterprise)
45
ER Model vs. Multidimensional
Model
▪ Traditional DBs generally deal with two-dimensional
data. However, querying performance in a multi-
dimensional data storage model is more efficient.
▪ More built in “meaning”
– What is important
– What describes the important
– What we want to optimize
▪ Recognized by OLAP/BI tools : Tools that offer powerful
query facilities based on Multi-Dimensional (MD) design
46
Multidimensional Model
▪ Data is divided into: Facts and Dimensions
▪ A fact is the important entity: exp a sale
▪ Facts have measures that can be aggregated: sales
price
▪ Dimensions describe facts
▪ Facts “live” in a MD cube
▪ Goal for dimensional modeling:
– Surround facts with as much context (dimensions) as
possible
– Hint: redundancy may be ok (in well-chosen places)
– But you should not try to model all relationships in the
data (unlike E/R and OO modeling!) 47
Dimension
▪ Dimensions are the core of MD databases
▪ Dimensions are used for
▪ Selection of data
▪ Grouping of data at the right level of detail
▪ Dimensions consist of dimension values
▪ Product dimension has values ”milk”, ”cream”, …
▪ Time dimension has values ”1/1/2001”, ”2/1/2001”,…
▪ Dimension values may have an ordering
▪ Used for comparing cube data across values
▪ Especially used for Time dimension
48
Dimension
▪ Dimensions have hierarchies with levels
▪ Typically 3-5 levels (of detail)
▪ Dimension values are organized in a tree structure
▪ Product: Product->Type->Category
▪ Store: Store->Area->City->County
▪ Time: Day->Month->Quarter->Year
▪ Dimensions have a bottom level and a top level
▪ Levels may have attributes
▪ Simple, non-hierarchical information
▪ Day has Workday as attribute
▪ Dimensions should contain much information
▪ Time dimension may contain holiday, season, events,…
▪ Good dimensions have 50-100 or more attributes/levels 49
Facts
▪ Facts represent the subject of the desired analysis
• The important in the business that should be
analyzed
▪ A fact is identified via its dimension values
• A fact is a non-empty cell
▪ Generally, a fact should:
• Be attached to exactly one dimension value in
each dimension
• Only be attached to dimension values in the
bottom levels
50
Measures
▪ Measures represent the fact property that the
users want to study and optimize
▪ Example: total sales price
▪ A measure has two components
▪ Numerical value: (exp: sales price)
▪ Aggregation formula (exp: SUM): used for
aggregating/combining a number of measure values
into one
51
Multidimensional Model
Example: sales of supermarkets
• Facts and measures
– Each sales record is a fact, and its sales value is a
measure
• Dimensions
– Group correlated attributes into the same
dimension
– Each sales record is associated with its values of
Product, store, Time
52
Granularity: Dimensionality Hierarchy
▪ Granularity of facts is important
▪ Level of detail
▪ Given by combination of bottom levels
▪ A dimensional hierarchy defines mappings from a set of
lower-level concepts to higher level concepts.
Country
Year
2D data
Region Season
Quarter
City
Month Week
Area
53
ZipCode Day
Schema Design
▪ A schema is a logical description of the entire
database.
▪ Much like a database, a data warehouse also
requires to maintain a schema.
▪ A database uses relational model, while a data
warehouse uses Star, Snowflake, and Fact
Constellation schema.
54
Star schema
▪ A star schema consists of two types of tables:
• fact table
• dimension tables
▪ Each dimension in a star schema is represented
with only one-dimension table.
▪ This dimension table contains the set of
attributes.
55
Star schema: Components
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
56
Snowflake schema
▪ Snowflake schema is an expanded version of a
star schema in which dimension tables are
normalized into several related tables.
▪ Advantages
• Small saving in storage space
• Normalized structures are easier to update and
maintain
▪ Disadvantages
• A schema that is less intuitive
• The ability to browse through the content is difficult
• A degraded query performance because of additional
joins.
57
Snowflake schema : Example
time
item
time_key
day item_key supplier
day_of_the_week Sales Fact Table item_name
supplier_key
month brand
time_key supplier_type
quarter type
year item_key supplier_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province_or_street
Measures country
58
Fact Constellation Schema
▪ A fact constellation has multiple fact tables. It is
also known as galaxy schema.
▪ The following diagram shows two fact tables,
namely sales and shipping.
59
Fact Constellation Schema
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
61
Multidimensional Model: Data Cubes
Data Cube
▪ Useful data analysis tool in DW
▪ Generalized GROUP BY queries
▪ Aggregate facts based on chosen dimensions
– Product, store, time dimensions
– Sales measures of Sales fact
Why data cube?
▪ Good for visualization (i.e., text results hard to
understand)
▪ MD, intuitive
▪ Support interactive OLAP operations
62
Multidimensional Model: Data Cubes
▪ Sales volume as a function of product, month, and
region
Dimensions: Product, Location, Time
Hierarchical summarization paths
Office Day
Month
63
A Sample of a Data Cube
Total annual sales
Date of TV in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
VCR
Country
sum
Canada
Mexico
sum
64
On-Line Analytical Processing (OLAP)
▪ Original definition : The dynamic synthesis,
analysis, and consolidation of large volumes of
multi-dimensional data, [Codd, 1993].
65
On-Line Analytical Processing (OLAP)
▪ The analytical operations that can be performed
on data cubes include:
– Roll-up – Split
– Drill-down – Nest
– Slice and Dice – Select
– Pivot/rotate – Projection
– Switch
66
On-Line Analytical Processing (OLAP)
▪ Roll-up performs aggregation on a data cube in
any of the following ways:
– By climbing up a concept hierarchy for a
dimension
– By dimension reduction
The following diagram illustrates how roll-up works.
67
On-Line Analytical Processing (OLAP)
Roll-up
68
On-Line Analytical Processing (OLAP)
▪ Roll-up is performed by climbing up a
concept hierarchy for the dimension
location.
▪ Initially the concept hierarchy was:
"street < city < province < country".
▪ On rolling up, the data is aggregated by
ascending the location hierarchy from the
level of city to the level of country.
▪ The data is grouped into cities rather than
countries.
▪ When roll-up is performed, one or more
dimensions from the data cube are
removed. 69
On-Line Analytical Processing (OLAP)
▪ Drill-down is the reverse of roll-up and involves
revealing the detailed data that forms the
aggregated data. Drill-down can be performed by
moving down the dimensional hierarchy or by
dimensional introduction e.g. 3-D sales data to 4-
D sales data.
70
On-Line Analytical Processing (OLAP)
Drill-down
71
On-Line Analytical Processing (OLAP)
▪ Drill-down is performed by stepping down a concept
hierarchy for the dimension time.
73
On-Line Analytical Processing (OLAP)
▪ Slice - ability to look at data from different
viewpoints. The slice operation performs a
selection on one dimension of the data whereas
dice uses two or more dimensions. For example a
slice of sales revenue (type = ‘Flat’) and a dice
(type = ‘Flat’ and time = ‘Q1’).
74
On-Line Analytical Processing (OLAP)
Slice
76
On-Line Analytical Processing (OLAP)
Dice 77
On-Line Analytical Processing (OLAP)
The dice operation on the cube based on the
following selection criteria involves three
dimensions:
▪ (location = "Toronto" or "Vancouver")
▪ (time = "Q1" or "Q2")
▪ (item =" Mobile" or "Modem")
78
On-Line Analytical Processing (OLAP)
79
On-Line Analytical Processing (OLAP)
▪ Pivot - ability to rotate the data to provide an
alternative view of the same data e.g. sales
revenue data displayed using the location (city)
as x-axis against time (quarter) as the y-axis can
be rotated so that time (quarter) is the x-axis
against location (city) is the y-axis.
80
On-Line Analytical Processing (OLAP)
Pivot 81
On-Line Analytical Processing (OLAP)
Switch
82
On-Line Analytical Processing (OLAP)
Split 83
On-Line Analytical Processing (OLAP)
Nest 84
On-Line Analytical Processing (OLAP)
Selection 85
On-Line Analytical Processing (OLAP)
Projection 86
On-Line Analytical Processing (OLAP)
▪ OLAP is the use of a set of graphical tools that provide
users with MD views of their data and allows them to
analyze the data using simple windowing techniques.
▪ Relational OLAP (ROLAP)
▪ OLAP tools that view the database as a traditional
relational database in either a star schema or other
normalized or demoralized set of tables
▪ Multidimensional OLAP (MOLAP)
▪ OLAP tools that load data into an intermediate
structure, usually a three or higher dimensional array.
(Cube structure)
▪ Hybrid OLAP (HOLAP)
▪ Combination of ROLAP and MOLAP tools
87
The Complete Decision Support
System
extract Query/Reporting
transform Data
Warehouse serve
load
refresh e.g., ROLAP
.
Data Mining
Operational serve
DB’s
Data Marts
88