Introduction to Data Warehouse
Introduction to Data Warehouse
Data Warehouse
One management and analytics platform
for product configuration, warranty, and
diagnostic readout data
Data Applications
Sources No data marts option (Visualization)
Data
Marts Routine
ERP Business
ETL
Reporting
Process
Data mart
Select (Marketing)
/ Middleware
Legacy Metadata Data/text
Extract mining
Data mart
Transform Enterprise (Operations)
POS Data warehouse
OLAP,
Integrate
API
Data mart Dashboard,
(Finance) Web
Other Load
OLTP/Web
Replication Data mart
(...) Custom built
External
applications
Data
ETL Process in DW
• Data Sources: multiple independent operational legacy
systems, external data providers, OLTP, ERP, Web logs
• Extraction: extracted data is stored it into staging area, not
into DW directly, because extracted data is in various formats
and can be corrupted also.
• Transformation: set of rules or functions are applied on the
extracted data to convert it into a single standard format. May
involve following tasks: filtering, cleaning, joining, splitting,
sorting
• Loading: Transformed data is loaded into DW. Rate and
period of loading solely depends on the requirements and
varies from system to system.
• Most commonly used ETL tools: Hevo, Sybase, Oracle etc.
DW Architecture
• DW architectures are called client/server or n-tier architectures
• Three-tier architecture
1. Data acquisition software (back-end)
2. The data warehouse that contains the data & software
3. Client (front-end) software that allows users to access and
analyze data from the warehouse (DSS/BI/BA engine)
• Two-tier architecture
– First two tiers in three-tier architecture are combined into one
… sometimes there is only one tier?
DW Architectures
Web pages
Application
Server
Client Web
(Web browser) Internet/ Server
Intranet/
Extranet
Data
warehouse
Packaged Transient
application data source
Data
warehouse
Data
marts
Other internal
applications
Data Migration
• Data sources may consist of
– Files from OLTP databases, spreadsheets, personal
databases (eg. MS Access) or external files
• Staging Tables: all input is stored in this to facilitate loading
• DW contains various rules:
– How data will be used
– Summarization rules
– Standardization of encoded attributes
– Calculation rules
• Metadata: these rules are applied to DW centrally
• Quality issues to input files to be corrected before loading data
• Two types of Loading: Data transformation tool or programming
Data Transformation
• Selections of Data Transformation tool
– Data transformation tools are expensive
– Data transformation tools may have long learning curve
• Data transformation tool should simplify the maintenance of an
organizations data warehouse
• Classification of ETL technologies
– Sophisticated, Enabler, Simple, Rudimentary
• Criteria for Selecting ETL tool
– Ability to handle unlimited data sources
– Automatic capturing and delivery of metadata
– Conforming to open standards
– Easy-to-use interface for the developer and user
Direct Benefits of Data Warehouse
• End users can perform extensive analysis in numerous ways
• A consolidated view of corporate data (a single version of truth)
is possible
• Better and more timely information is possible.
– DW permits information processing to be relieved from costly
operational systems onto low-cost servers; so more end-
user requests can be processed more quickly
• Enhanced system performance can result.
– DW frees production processing because some operational
system reporting requirements are moved to DSS
• Data access is simplified
Indirect Benefits of Data Warehouse
• Enhance business knowledge
• Present a competitive advantage
• Improve customer service and satisfaction
• Facilitate decision making
• Help in reforming business processes
• Strongest contributions to competitive advantage
Cost-Benefit Analysis of Data Warehouse
• Given the potential benefits and the substantial investments in
time and money that a DW development project requires, it is
critical that an organization structure its DW project to maximize
the changes of success.
• Benefits – consider the money is saved in the following
– Keepers (improving traditional decision support functions)
– Gatherers (automated collection/dissemination of information)
– Users (decisions made using data warehouse)
• Costs: hardware, software, network bandwidth, internal
development, internal support, training and external consulting
• Net present value to be calculated over expected life of DW
• It is important to involve users in the DW development process, as
it is one of the critical success factor.
Data Warehouse Development Approaches
• Inmon Model: EDW approach (top-down)
– Bill Inmon called father of data warehousing
– Adopts traditional relational database tools to the
development needs of an enterprise-wide DW
• Kimball Model: DM approach (bottom-up)
– Employs dimensional modeling, which starts with tables
– Plan big, build small approach
– Subject-oriented or department-oriented DW
• Another alternative is the hosted data warehouses
DM Development – EDW vs DM – 1/2
Data prerequisite for sharing Common (within business area) Common (across enterprise)
• Benefits:
– Requires minimal investment in infrastructure
– Frees up capacity on in-house systems
– Frees up cash flow
– Makes powerful solutions affordable
– Enables solutions that provide for growth
– Offers better quality equipment and software
– Provides faster connections
Representation of Data in DW
• Dimensional Modeling
– A retrieval-based system that supports high-volume query
access
• Star schema
– The most commonly used and the simplest style of
dimensional modeling
– Contain a fact table surrounded by and connected to
several dimension tables
• Snowflakes schema
– An extension of star schema where the diagram resembles
a snowflake in shape
Multidimensionality
A 3-dimensional
OLAP cube with Sales volumes of
slicing a specific Product
operations on variable Time
and Region
e
m
Ti
Product
Geography
Sales volumes of
a specific Time on
variable Region
and Products
OLAP Operations
• Roll-up
• Drill-down
• Pivot (rotate)
Roll-up
• Roll-up performs aggregation
on a data cube in any of the
following ways −
– By climbing up a concept
hierarchy for a dimension
– By dimension reduction
Drill-down
• Drill-down is the reverse
operation of roll-up. It is
performed by either of the
following ways −
– By stepping down a
concept hierarchy for a
dimension
– By introducing a new
dimension.
Slice
• The slice operation selects
one particular dimension
from a given cube and
provides a new sub-cube.
Consider the following
diagram that shows how
slice works
Dice
• Dice selects two or more
dimensions from a given
cube and provides a new
sub-cube. Consider the
following diagram that
shows the dice operation
Pivot
• The pivot operation is also
known as rotation. It rotates
the data axes in view in
order to provide an
alternative presentation of
data. Consider the following
diagram that shows the
pivot operation.