0% found this document useful (0 votes)
8 views

Introduction to Data Warehouse

The document provides an extensive overview of data warehousing, detailing its purpose, characteristics, types, and processes involved in data integration and management. It discusses the importance of data warehouses in supporting decision-making through structured data analysis and highlights different architectures and development approaches. Additionally, it outlines the benefits and costs associated with data warehouses, emphasizing their role in enhancing business intelligence and operational efficiency.

Uploaded by

Pravalika Bura
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Introduction to Data Warehouse

The document provides an extensive overview of data warehousing, detailing its purpose, characteristics, types, and processes involved in data integration and management. It discusses the importance of data warehouses in supporting decision-making through structured data analysis and highlights different architectures and development approaches. Additionally, it outlines the benefits and costs associated with data warehouses, emphasizing their role in enhancing business intelligence and operational efficiency.

Uploaded by

Pravalika Bura
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Business Analytics

Topic 2: Introduction to Data Warehouse


Data Warehousing
• DW is a multitude of organizational and external data is
captured, transformed, and stored in a data warehouse to
support timely and accurate decisions through enriched
business insight.
• Repository of current and historical data of potential interest to
managers throughout the organization
• Data are usually structured to be available in a form ready for
analytical processing activities (OLAP)
– Mining, querying, reporting etc
• DW is a subject-oriented, integrated, time-variant, non-volatile
collection of data in support management decision-making
process
• Bill Inmon (1993) wrote seminal book – Building the Data
Warehouse and is considered father of data warehousing
Data Warehousing – Historical Perspective
Characteristics of DWs
• Subject oriented: such as sales, products or customers
• Integrated: data from different sources into a consistent format
• Time-variant (time series): detect trends, deviations, relationships
• Nonvolatile: users cannot change or update data. Only discard
• Web based: optimized for web-based applications
• Relational/multi-dimensional: uses either of the structures
• Client/server: architecture to provide easy access to end users
• Real-time: Newer DWs provide real-time, active data access
• Include Metadata: data about how data is organized
Types of Data Warehouses (DW)

• Three types of data warehouses


– Data Marts (DMs)
– Operational Data Stores (ODS)
– Enterprise Data Warehouses (EDW)
Data Marts
• Data Mart is usually smaller and focuses on a particular subject or
department. Subset of a data DW – single subject area
• A departmental small-scale “DW” that stores only limited/relevant
data
• Dependent data mart
– A subset that is created directly from a data warehouse
– Ensure end user is viewing same version data that of DW users
• Independent data mart
– A small data warehouse designed for a strategic business unit
or a department
– Used by small companies as a low-cost, scaled-down version
of DW
– Its source is not enterprise DW
Operational Data Stores (ODS)
• Used as an interim staging area for a data warehouse
• Contents of ODS are updated throughout the course of business
operations
• Used for short-term decisions involving mission-critical
applications rather than for the medium- and long-terms
decisions associated with an EDW
• ODS is similar to short-terms memory – it stores only very
recent information
• ODS consolidates data from multiple source systems and
provides a near real-time, integrated view of volatile, current
data.
Enterprise Data Warehouses (EDW)
• EDW is a large-scale data warehouse that is used across the
enterprise for decision support
• Provides integration of data from many sources into a standard
format for effective BI and decision support applications
• EDWs are used to provide data for many types of decision
support systems (DSS)
– Customer Relationship Management (CRM)
– Supply Chain Management (SCM)
– Business Performance Management (BPM)
– Business Activity monitoring
– Product life-cycle management
– Revenue management
– Knowledge management
Data Warehousing Process
• Organizations continuously collect data, information and
knowledge at an increasingly accelerated rate and store them
• Due to scalability issues maintaining, using data and
information becomes extremely complex
• Due to improved reliability and availability of network access,
internet, users accessing information continues to increase
• Working with multiple databased has become an extremely
difficult task requiring considerable expertise
• The benefits of DW far exceed its costs.
DW for Data-Driven Decision Making

Data Warehouse
One management and analytics platform
for product configuration, warranty, and
diagnostic readout data

Reduced Produced Warranty Improved Cost of IT Architecture


Accurate
Infrastructure Expenses Quality Standardization
Improved reimbursement Faster identification,
Environmental One strategic platform for
Expenses
2/3 cost reduction through accuracy through improved prioritization, and resolution Performance business intelligence and
data mart consolidation claim data quality of quality issues Reporting compliance reporting
A Generic DW Framework

Data Applications
Sources No data marts option (Visualization)
Data
Marts Routine
ERP Business
ETL
Reporting
Process
Data mart
Select (Marketing)

/ Middleware
Legacy Metadata Data/text
Extract mining
Data mart
Transform Enterprise (Operations)
POS Data warehouse
OLAP,
Integrate

API
Data mart Dashboard,
(Finance) Web
Other Load
OLTP/Web
Replication Data mart
(...) Custom built
External
applications
Data
ETL Process in DW
• Data Sources: multiple independent operational legacy
systems, external data providers, OLTP, ERP, Web logs
• Extraction: extracted data is stored it into staging area, not
into DW directly, because extracted data is in various formats
and can be corrupted also.
• Transformation: set of rules or functions are applied on the
extracted data to convert it into a single standard format. May
involve following tasks: filtering, cleaning, joining, splitting,
sorting
• Loading: Transformed data is loaded into DW. Rate and
period of loading solely depends on the requirements and
varies from system to system.
• Most commonly used ETL tools: Hevo, Sybase, Oracle etc.
DW Architecture
• DW architectures are called client/server or n-tier architectures
• Three-tier architecture
1. Data acquisition software (back-end)
2. The data warehouse that contains the data & software
3. Client (front-end) software that allows users to access and
analyze data from the warehouse (DSS/BI/BA engine)
• Two-tier architecture
– First two tiers in three-tier architecture are combined into one
… sometimes there is only one tier?
DW Architectures

Tier 1: Tier 2: Tier 3: Tier 1: Tier 2:


Client workstation Application server Database server Client workstation Application & database server

• Advantage of 3-tier is its separation of functions of data


warehouse, which eliminates resource constraints and makes
it possible to easily create DWs
• 2-tier system is more economical as DSS engine physically
runs on same hardware as DW
• Limitation of 2-tier system: performance problems for large
data warehouses
A Web-based DW Architecture

Web pages
Application
Server

Client Web
(Web browser) Internet/ Server
Intranet/
Extranet
Data
warehouse

• Advantages: ease of access, platform independence and


lower cost
Factors to Consider in DW Architectures
• Database management system (DBMS) should be used
– Oracle, SQL server, DB2
• Parallel processing: provides scalability
• Partitioning: split into smaller tables helps in access
efficiency
• Selection of data migration tools should based on
thorough assessment of existing data assets
• Selection of migration tools for data retrieval and analysis
– In-house tool development
– Third-party tool
– Default tool provided with DW system
Data Integration
• Decision maker needs access to multiple sources of data that
must be integrated
• Recognizing what data to access and providing them to decision
maker requires database specialists
• As DW grow in size, issues of integrating data grow as well
– Mergers, acquisitions, regulatory requirements, new
channels are driving changes in DW
– In addition to historical, cleansed, consolidated, point-of-time
data, decision makers are demanding access to real-time,
unstructured, and/or remote data
– Access though PDAs, speech recognition and synthesis is
making integration more complicated
Data Integration – Three Major Processes
• Proper data integration permit data to be accessed and made
accessible to an array of ETL and analysis tools and the DW
environment.
• Data Access
– Ability to access and extract data from any data source
• Data Federation
– Integration of business views across multiple data stores
• Change Capture
– Identification, capture, and delivery of the changes made
to enterprise data sources
• SAS Institute and Oracle provide data integration tools that
improve data quality in the integration process
Various Data Integration Technologies
• Enterprise Application Integration (EAI)
– Provides vehicle for pushing data from source systems in DW
• Service-Oriented Architecture (SOA)
– supports integrating business data and processes by creating
reusable components of functionality, or services
• Enterprise Information Integration (EII)
– Mechanism for pulling data from source systems on request
• Extraction, Transformation, and Load (ETL)
– Purpose is to load DW with integrated and cleansed data
– Supports data between sources and targets
– Document how data elements (metadata) change
Extraction, Transformation, Load Process

Packaged Transient
application data source

Data
warehouse

Legacy Extract Extract Extract Extract


system

Data
marts
Other internal
applications
Data Migration
• Data sources may consist of
– Files from OLTP databases, spreadsheets, personal
databases (eg. MS Access) or external files
• Staging Tables: all input is stored in this to facilitate loading
• DW contains various rules:
– How data will be used
– Summarization rules
– Standardization of encoded attributes
– Calculation rules
• Metadata: these rules are applied to DW centrally
• Quality issues to input files to be corrected before loading data
• Two types of Loading: Data transformation tool or programming
Data Transformation
• Selections of Data Transformation tool
– Data transformation tools are expensive
– Data transformation tools may have long learning curve
• Data transformation tool should simplify the maintenance of an
organizations data warehouse
• Classification of ETL technologies
– Sophisticated, Enabler, Simple, Rudimentary
• Criteria for Selecting ETL tool
– Ability to handle unlimited data sources
– Automatic capturing and delivery of metadata
– Conforming to open standards
– Easy-to-use interface for the developer and user
Direct Benefits of Data Warehouse
• End users can perform extensive analysis in numerous ways
• A consolidated view of corporate data (a single version of truth)
is possible
• Better and more timely information is possible.
– DW permits information processing to be relieved from costly
operational systems onto low-cost servers; so more end-
user requests can be processed more quickly
• Enhanced system performance can result.
– DW frees production processing because some operational
system reporting requirements are moved to DSS
• Data access is simplified
Indirect Benefits of Data Warehouse
• Enhance business knowledge
• Present a competitive advantage
• Improve customer service and satisfaction
• Facilitate decision making
• Help in reforming business processes
• Strongest contributions to competitive advantage
Cost-Benefit Analysis of Data Warehouse
• Given the potential benefits and the substantial investments in
time and money that a DW development project requires, it is
critical that an organization structure its DW project to maximize
the changes of success.
• Benefits – consider the money is saved in the following
– Keepers (improving traditional decision support functions)
– Gatherers (automated collection/dissemination of information)
– Users (decisions made using data warehouse)
• Costs: hardware, software, network bandwidth, internal
development, internal support, training and external consulting
• Net present value to be calculated over expected life of DW
• It is important to involve users in the DW development process, as
it is one of the critical success factor.
Data Warehouse Development Approaches
• Inmon Model: EDW approach (top-down)
– Bill Inmon called father of data warehousing
– Adopts traditional relational database tools to the
development needs of an enterprise-wide DW
• Kimball Model: DM approach (bottom-up)
– Employs dimensional modeling, which starts with tables
– Plan big, build small approach
– Subject-oriented or department-oriented DW
• Another alternative is the hosted data warehouses
DM Development – EDW vs DM – 1/2

Effort DM Approach EDW Approach

Scope One subject area Several subject areas

Development time Months Years

Development cost $10,000 to $100,000+ $1,000,000+

Development difficulty Low to medium High

Data prerequisite for sharing Common (within business area) Common (across enterprise)

Sources Only some operational and Many operational and external


external systems systems

Size Megabytes to several gigabytes Gigabytes to petabytes

Time horizon Near-current and historical data Historical data

Data transformations Low to medium High


DM Development – EDW vs DM – 2/2
Effort DM Approach EDW Approach
Update frequency Hourly, daily, weekly Weekly, monthly
Technology Blank Blank
Hardware Workstations and departmental Enterprise servers and
Servers mainframe computers

Operating system Windows and Linux Unix, Z/OS, OS/390


Databases Workgroup or standard Enterprise database servers
database servers

Usage Blank Blank


Number of simultaneous Users 10s 100s to 1,000s
User types Business area analysts and Enterprise analysts and senior
Managers executives

Business spotlight Optimizing activities within the Cross-functional optimization


business area and decision making
Additional DW Considerations Hosted
Data Warehouses

• Benefits:
– Requires minimal investment in infrastructure
– Frees up capacity on in-house systems
– Frees up cash flow
– Makes powerful solutions affordable
– Enables solutions that provide for growth
– Offers better quality equipment and software
– Provides faster connections
Representation of Data in DW
• Dimensional Modeling
– A retrieval-based system that supports high-volume query
access
• Star schema
– The most commonly used and the simplest style of
dimensional modeling
– Contain a fact table surrounded by and connected to
several dimension tables
• Snowflakes schema
– An extension of star schema where the diagram resembles
a snowflake in shape
Multidimensionality

The ability to organize, present, and analyze data by


several dimensions, such as sales by region, by product,
by salesperson, and by time (four dimensions)
• Multidimensional presentation
– Dimensions: products, salespeople, market
segments, business units, geographical locations,
distribution channels, country, or industry
– Measures: money, sales volume, head count,
inventory profit, actual versus forecast
– Time: daily, weekly, monthly, quarterly, or yearly
Star Schema vs Snowflake Schema
Analysis of Data in DW

• OLTP (Online Transaction Processing)


– Capturing and storing data from ERP, CRM, POS, …
– The main focus is on efficiency of routine tasks
• OLAP (Online Analytical Processing)
– Converting data into information for decision support
– Data cubes, drill-down / rollup, slice & dice, …
– Requesting ad hoc reports
– Conducting statistical and other analyses
– Developing multimedia-based applications
OLTP vs. OLAP

Criteria OLTP OLAP


Purpose To carry out day-to-day business To support decision making and
functions provide answers to business and
management queries

Data source Transaction database (a normalized Data warehouse or DM (a


data repository primarily focused on nonnormalized data repository
efficiency and consistency) primarily focused on accuracy and
completeness)

Reporting Routine, periodic, narrowly focused Ad hoc, multidimensional, broadly


Reports focused reports and queries

Resource requirements Ordinary relational databases Multiprocessor, large-capacity,


specialized databases

Execution speed Fast (recording of business Slow (resource intensive, complex,


transactions and routine reports) large-scale queries)
OLAP Operations

• Slice - a subset of a multidimensional array


• Dice - a slice on more than two dimensions
• Drill Down/Up - navigating among levels of data ranging
from the most summarized (up) to the most detailed
(down)
• Roll Up - computing all of the data relationships for one
or more dimensions
• Pivot - used to change the dimensional orientation of a
report or an ad hoc query-page display
OLAP
• Slicing Operations on a Simple Tree-Dimensional Data Cube

A 3-dimensional
OLAP cube with Sales volumes of
slicing a specific Product
operations on variable Time
and Region

e
m
Ti
Product
Geography

Cells are filled


with numbers
Sales volumes of
representing a specific Region
sales volumes on variable Time
and Products

Sales volumes of
a specific Time on
variable Region
and Products
OLAP Operations
• Roll-up

• Drill-down

• Slice and dice

• Pivot (rotate)
Roll-up
• Roll-up performs aggregation
on a data cube in any of the
following ways −
– By climbing up a concept
hierarchy for a dimension
– By dimension reduction
Drill-down
• Drill-down is the reverse
operation of roll-up. It is
performed by either of the
following ways −
– By stepping down a
concept hierarchy for a
dimension
– By introducing a new
dimension.
Slice
• The slice operation selects
one particular dimension
from a given cube and
provides a new sub-cube.
Consider the following
diagram that shows how
slice works
Dice
• Dice selects two or more
dimensions from a given
cube and provides a new
sub-cube. Consider the
following diagram that
shows the dice operation
Pivot
• The pivot operation is also
known as rotation. It rotates
the data axes in view in
order to provide an
alternative presentation of
data. Consider the following
diagram that shows the
pivot operation.

You might also like