0% found this document useful (0 votes)
2 views

Chapter1 Data Warehousing Intro

The document provides an overview of Data Warehousing (DWH) fundamentals, including its need, benefits, architecture, and modeling techniques such as Star and Snowflake schemas. It discusses the differences between Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP), emphasizing the importance of integrated, accurate, and timely data for decision-making. Key components of DWH architecture, including data staging, ETL processes, and metadata roles, are also outlined to illustrate how data is managed and utilized for analytical purposes.

Uploaded by

Ashna Brito
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Chapter1 Data Warehousing Intro

The document provides an overview of Data Warehousing (DWH) fundamentals, including its need, benefits, architecture, and modeling techniques such as Star and Snowflake schemas. It discusses the differences between Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP), emphasizing the importance of integrated, accurate, and timely data for decision-making. Key components of DWH architecture, including data staging, ETL processes, and metadata roles, are also outlined to illustrate how data is managed and utilized for analytical purposes.

Uploaded by

Ashna Brito
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 48

Module I

Data Warehouse (DWH)


Fundamentals

By : Aruna Khubalkar
Outline

Need for Data Warehousing

Benefits and features of Data Warehouse

Data warehouse Characteristics

Data warehouse Architecture

Dimensional Modeling : Star & Snowflake

OLAP operations

OLTP v/s OLAP

By : Ms. Aruna Khubalkar 2


Operational Sources (OLTP’s)
 Operational computer systems did provide information to run
day-to-day operations, and answer’s daily questions, but…
 Also called Online Transactional Processing System (OLTP)
 Data is read or manipulated with each transaction
 Transactions/queries are simple, and easy to write
 Usually for middle management
 Examples
 Sales systems

 Hotel reservation systems

 HRM Applications

 Etc.

By : Ms. Aruna Khubalkar 3


Typical decision queries
 Data set are mounting everywhere, but not useful for
decision support
 Decision-making require complex questions from
integrated data.
 Enterprise wide data is desired
 Decision makers want to know:
 Where to build new oil warehouse?
 Which market they should strengthen?
 Which customer groups are most profitable?
 How much is the total sale by month/ year/ quarter for each
offices?
 Is there any relation between promotion campaigns and sales
growth?

Can OLTP answer all such questions,  efficiently?


By : Ms. Aruna Khubalkar 4
*
Information crisis
 Integrated
 Must have a single, enterprise-wide view
 Data Integrity
 Information must be accurate and must conform to

business rules
 Accessible
 Easily accessible with intuitive access paths and

responsive for analysis


 Credible
 Every business factor must have one and only one

value
 Timely
 Information must be available within the stipulated

time frame
* Paulraj 2001.
By : Ms. Aruna Khubalkar 5
Expectations of new soln.
 DB designed for analytical tasks
 Data from multiple applications
 Easy to use
 Ability of what-if analysis
 Read-intensive data usage
 Direct interaction with system, without IT assistance
 Periodical updating contents & stable
 Current & historical data
 Ability for users to initiate reports

By : Ms. Aruna Khubalkar 7


DW meets expectations

 Provides enterprise view


 Current & historical data available
 Decision-transaction possible without affecting
operational source
 Reliable source of information
 Ability for users to initiate reports
 Acts as a data source for all analytical applications

By : Ms. Aruna Khubalkar 8


Definition of DW
Bill Inmon (Father of Data Warehousing) defined
“A DW is a subject-oriented, integrated, non-volatile,
time-variant collection of data in favor of decision-making”

Kelly said
“Separate available, integrated, time-stamped, subject-
oriented, non-volatile, accessible”

Four properties/characteristics of DW
subject-oriented, integrated,
non-volatile, time-variant
By : Ms. Aruna Khubalkar 9
Subject-oriented
 In operational sources data is organized by
applications, or business processes.
 In DW subject is the organization method
 Subjects vary with enterprise
 These are critical factors, that affect performance
 Example of Manufacturing Company
 Sales
 Shipment
 Inventory etc

By : Ms. Aruna Khubalkar 10


Integrated Data
 Data comes from several applications
 Problems of integration comes into play
 File layout, encoding, field names, systems, schema,
data heterogeneity are the issues
 Bank example, variance: naming convention,
attributes for data item, account no, account type,
size, currency
 In addition to internal, external data sources
 External companies data sharing
 Websites
 Others
 Removal of inconsistency
 So process of extraction, transformation & loading
By : Ms. Aruna Khubalkar 11
Time variant
 Operational data has current values
 Comparative analysis is one of the best techniques for
business performance evaluation
 Time is critical factor for comparative analysis
 Every data structure in DW contains time element
 In order to promote product in certain, analyst has to
know about current and historical values
 The advantages are
 Allows for analysis of the past
 Relates information to the present
 Enables forecasts for the future

By : Ms. Aruna Khubalkar 12


Non-volatile
 Data from operational systems are moved into DW
after specific intervals
 Data is persistent/ not removed i.e. non volatile
 Every business transaction don’t update in DW
 Data from DW is not deleted
 Data is neither changed by individual transactions
 Properties summary

Subject Oriented Time-Variant Non-Volatile

Organized along the lines of Every record in the data Refers to the inability of data
the subjects of the to be updated. Every record
warehouse has some
corporation. Typical subjects in the data warehouse is time
are customer, product, form of time variancy stamped in one form or
vendor and transaction. attached to it. another.
By : Ms. Aruna Khubalkar 13
Generic two-level architecture

L
One,
T company-
wide
warehouse
E

Periodic extraction  data is not completely current in warehouse


By : Ms. Aruna Khubalkar 15
DW Architecture

By : Ms. Aruna Khubalkar 17


Components of DW

Source Data Component

Data Staging Component

Data Storage Component

Information Delivery Component

Metadata Component

Management and Control Component

By : Ms. Aruna Khubalkar 18


Source Data Component
Source Data coming into the data warehouse
may be grouped into 4 broad categories :
1. Production data
2. Internal data
3. Archived data
4. External data

By : Ms. Aruna Khubalkar 19


Data Staging Component

Data staging provides a place and an area with
a set of functions to clean, change, combine,
convert, deduplicate and prepare source data for
storage and use in the data warehouse.

3 major functions of Extraction, Transformation
and preparation for Loading takes place in
staging area.

By : Ms. Aruna Khubalkar 20


The ETL Process
Capture
Scrub or data cleansing
Transform
Load and Index

ETL = Extract, transform, and load

By : Ms. Aruna Khubalkar 21


Data Transformation
Data transformation is the component of
data reconcilation that converts data from
the format of the source operational systems
to the format of enterprise data warehouse.
Data transformation consists of a variety of
different functions:
– record-level functions,
– field-level functions and
– more complex transformation.

By : Ms. Aruna Khubalkar 22


Role of Metadata (data catalog)

Identify subjects of the data mart


Identify dimensions and facts
Indicate how data is derived from enterprise data
warehouses, including derivation rules
Indicate how data is derived from operational
data store, including derivation rules
Identify available reports and predefined queries
Identify data analysis techniques (e.g. drill-down)
Identify responsible people

By : Ms. Aruna Khubalkar 23


Derived Data
 Objectives
– Ease of use for decision support applications
– Fast response to predefined user queries
– Customized data for particular target audiences
– Ad-hoc query support
– Data mining capabilities
 Characteristics
– Detailed (mostly periodic) data
– Aggregate (for summary)
– Distributed (to departmental servers)

Most common data model = star schema


(also called “dimensional model”)
By : Ms. Aruna Khubalkar 24

Data Warehouse Modeling :

Data Cube and OLAP
Design Decisions

Choosing the Process. Selecting the subjects from the
information packages for the first set of logical structures
to be designed.

Choosing the Grain. Determining the level of detail for
the data in the data structures.

Identifying and Conforming the Dimensions.
Choosing the business dimensions (such as product,
market, time, etc.)

Choosing the Facts. Selecting the metrics or units of
measurements (such as product sale units, dollar sales,
dollar revenue, etc.)

Choosing the Duration of the Database. Determining
how far back in time you should go for historical data. 26
By : Ms. Aruna Khubalkar
Dimension Table


Dimensions are perspectives or entities with
respect to which an organization wants to keep
records.
Time, item, branch, location

Each dimension have a table associated with it,
Dimension table, which further describes the
dimension.
Dimension tables, such as
item (item_name, brand, type), or
time(day, week, month, quarter, year)

By : Ms. Aruna Khubalkar 27


Fact Table


A multidimensional data model is typically
organized around a central theme, like sales,
represented by Fact Table

Fact table contains measures (such as
dollars_sold, units_sold) and keys to each
of the related dimension tables

By : Ms. Aruna Khubalkar 28


Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
Example of Snowflake Schema
time
item
time_key
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location


branch_key location_key dollars_cost
branch_name units_sold
street
branch_type dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
shipper_type
Conceptual Modeling of Data Warehouses

Modeling data warehouses: dimensions & measures


• Star schema: A fact table in the middle connected to a
set of dimension tables
• Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake
• Fact constellations: Multiple fact tables share dimension
tables, viewed as a collection of stars, therefore called
galaxy schema or fact constellation
Defining Star Schema in DMQL
define cube sales_star [time, item, branch,
location]:
l
dollars_sold = sum(sales_in_dollars),
avg_sales = avg(sales_in_dollars), units_sold
= count(*)
define dimension time as (time_key, day,
day_of_week, month, quarter, year)
define dimension item as (item_key,
item_name, brand, type, supplier_type)
define dimension branch as (branch_key,
branch_name, branch_type)
define dimension location as (location_key,
street, city, province_or_state, country)
Defining Snowflake Schema in
DMQL
define cube sales_snowflake [time, item, branch,
location]:
l
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day,
day_of_week, month, quarter, year)
define dimension item as (item_key, item_name,
brand, type, supplier(supplier_key, supplier_type))
define dimension branch as (branch_key,
branch_name, branch_type)
define dimension location as (location_key, street,
city(city_key, province_or_state, country))
Defining Fact Constellation in DMQL
define cube sales [time, item, branch, location]:
l
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month,
quarter, year)
define dimension item as (item_key, item_name, brand, type,
supplier_type)
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street, city,
province_or_state, country)
define cube shipping [time, item, shipper, from_location,
to_location]:
l
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location
as location in cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
A Concept Hierarchy: Dimension
(location)
all

region Europe ... North_America

Germany ... Spain Canada ... Mexico

city Frankfurt ... Vancouver ... Toronto

office L. Chan ... M. Wind


From Tables and Spreadsheets to
Data Cubes

A data warehouse is based on a
multidimensional data model which views
data in the form of a data cube

A data cube, such as sales, allows data to
be modeled and viewed in multiple
dimensions
o Defined by dimensions and facts
Multidimensional Data

Sales volume as a function of product,
month, and region

Region
Sales data warehouse
Dimensions: Product, Location, Time

Product

Month

In data warehousing literature, a data
cube (such as above) is referred as
cuboid.

Given a set of dimensions, we can
generate a cuboid for each of the
possible subsets of given dimensions.

The result is lattice of cuboids, each
showing the data at different level of
summarization.

The lattice of cuboids is referred as a
Data Cube
A Sample Data Cube
Date
Total annual sales
2Qtr of TVs in U.S.A.
1Qtr 3Qtr 4Qtr sum
TV
Product PC U.S.A
VCR
sum
Canada
Country
Mexico
sum

All, All, All


Cuboids Corresponding to
the Cube
all
0-D(apex) cuboid

product date country


1-D cuboids

product,date product,country date, country


2-D cuboids

product, date, country 3-D(base) cuboid



An n-D base cube is called a base
cuboid.

The top most 0-D cuboid, which
holds the highest-level of
summarization, is called the apex
cuboid.

The lattice of cuboids forms a data
cube.
Cube: A Lattice of Cuboids
all
0-D (apex) cuboid

time item location supplier


1-D cuboids

time,location item,location location,supplier


time,item 2-D cuboids
time,supplier item,supplier

time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier

4-D (base) cuboid


time, item, location, supplier
On-Line Analytical Processing (OLAP)
 OLAP is the use of a set of graphical tools that
provides users with multidimensional views of their
data and allows them to analyze the data using
simple windowing techniques
 Relational OLAP (ROLAP)
– OLAP tools that view the database as a traditional
relational database in either a star schema or other
normalized or denormalized set of tables.
 Multidimensional OLAP (MOLAP)
– OLAP tools that load data into an intermediate
structure, usually a three or higher dimensional
array. (Cube structure)

By : Ms. Aruna Khubalkar 45


Typical OLAP Operations
 Roll up (drill-up): summarize data
◦ by climbing up hierarchy or by dimension
reduction
 Drill down (roll down): reverse of roll-up
◦ from higher level summary to lower level
summary or detailed data, or introducing new
dimensions
 Slice and Dice: project and select

 Pivot (rotate):
◦ reorient the cube, visualization, 3D to series of
2D planes

47
Fig. 3.10 Typical
OLAP Operations

48
Figure 11-22: Slicing a data cube

By : Ms. Aruna Khubalkar 50


Summary report

Example of drill-down

Drill-down with
color added

By : Ms. Aruna Khubalkar 51


Data Warehouse vs. Operational
DBMS
OLTP (on-line transaction processing)
◦ Major task of traditional relational DBMS
◦ Day-to-day operations: purchasing,
inventory, banking, manufacturing,
payroll, registration, accounting, etc.
OLAP (on-line analytical processing)
◦ Major task of data warehouse system
◦ Data analysis and decision making

Data Mining: Concepts and


17/01/23 Techniques 52
OLTP vs. OLAP Systems
Major Distinguishing features
◦ User and system orientation:
customer vs. market
◦ Data contents: current, detailed vs.
historical, consolidated
◦ Database design: application-oriented
+ ER vs. subject-oriented +
star/snowflake model
◦ View: current, local vs. evolutionary,
integrated
◦ Access patterns: short, atomic
transaction vs. read-only but complex
OLTP vs. OLAP Systems
Feature OLTP OLAP
Orientation transaction analysis
Function day to day operations decision support
Users clerk, DBA, IT professional knowledge worker(manager,analyst)
Data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
DB design application-oriented subject-oriented
Operations index/hash on primary key lots of scans
Unit of work short, simple transaction complex query
# records accessed tens millions
# users thousands hundreds
DB size 100MB-GB 100GB-TB
Metric transaction throughput query throughput, response
Priority High performance, high High flexibility
Data Mining: Concepts and
availability 17/01/23 Techniques 54

You might also like