0% found this document useful (0 votes)
69 views

DW Data Warehousing

The document discusses introducing data warehousing concepts and why organizations implement data warehouses. It provides examples of scenarios where different companies need consolidated sales and customer data to generate reports for managers. The document also describes common data warehousing architectures, processes for extracting, transforming and loading data from operational systems into a data warehouse, and some of the key features of data warehouses like being subject-oriented, integrated, and time-variant.

Uploaded by

Sonali Agrawal
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

DW Data Warehousing

The document discusses introducing data warehousing concepts and why organizations implement data warehouses. It provides examples of scenarios where different companies need consolidated sales and customer data to generate reports for managers. The document also describes common data warehousing architectures, processes for extracting, transforming and loading data from operational systems into a data warehouse, and some of the key features of data warehouses like being subject-oriented, integrated, and time-variant.

Uploaded by

Sonali Agrawal
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 56

Introduction to Data Warehousing

Why Data Warehouse?

Scenario 1
ABC Pvt Ltd is a company with branches at Mumbai, Delhi, Chennai and Banglore. The Sales Manager wants quarterly sales report. Each branch has a separate operational system.

Scenario 1 : ABC Pvt Ltd.


Mumbai

Delhi

Sales per item type per branch for first quarter.


Chennai

Sales Manager

Banglore

Solution 1:ABC Pvt Ltd.


Extract sales information from each database. Store the information in a common repository at a single site.

Solution 1:ABC Pvt Ltd.


Mumbai

Report Delhi Data Warehouse Chennai Query & Analysis tools Sales Manager

Banglore

Scenario 2
One Stop Shopping Super Market has huge operational database.Whenever Executives wants some report the OLTP system becomes slow and data entry operators have to wait for some time.

Scenario 2 : One Stop Shopping


Data Entry Operator
Report Wait Operational Database

Management

Data Entry Operator

Solution 2
Extract data needed for analysis from operational database. Store it in warehouse. Refresh warehouse at regular interval so that it contains up to date information for analysis. Warehouse will contain data with historical perspective.

Solution 2
Data Entry Operator Report

Transaction

Operational database

Extract data

Data Warehouse

Manager

Data Entry Operator

Scenario 3
Cakes & Cookies is a small,new company.President of the company wants his company should grow.He needs information so that he can make correct decisions.

Solution 3
Improve the quality of data before loading it into the warehouse. Perform data cleaning and transformation before loading the data. Use query analysis tools to support adhoc queries.

Solution 3
Expansio n sales

Data Warehouse

Query and Analysis tool time

President

Improvemen t

Data Warehouses and Data Marts

Basic reports and simple OLAP analyses can be made directly from operational data. Many organizations choose to extract operational data into facilities called data warehouses and data marts, both of which are facilities that prepare, store, and manage data specifically for data mining and other analyses. Programs read operational data and extract, clean, and prepare that data for BI processing. The prepared data are stored in a data-warehouse database using data-warehouse DBMS, which can be different from the organizations operational DBMS.

Very Large Data Bases


Terabytes -- 10^12 bytes: Petabytes -- 10^15 bytes: Exabytes -- 10^18 bytes: Zettabytes -- 10^21 bytes: Walmart -- 24 Terabytes Geographic Information Systems National Medical Records Weather images

Zottabytes -- 10^24 bytes:

Intelligence Agency Videos

Problems with Operational Data

Most operational and purchased data have problems that inhibit their usefulness for business intelligence.

Data Mining works with Warehouse Data


Data Warehousing provides the Enterprise with a memory

z Data Mining provides the Enterprise with intelligence

Data Warehouse

Inmonss definition
A data warehouse is -subject-oriented, -integrated, -time-variant, -nonvolatile collection of data in support of managements decision making process.

Subject-oriented
Data warehouse is organized around subjects such as sales,product,customer. It focuses on modeling and analysis of data for decision makers. Excludes data not useful in decision support process.

Integration
Data Warehouse is constructed by integrating multiple heterogeneous sources. Data Preprocessing are applied to ensure consistency.
RDBMS

Legacy System

Data Warehouse

Flat File

Data Processing Data Transformation

Integration
In terms of data.
encoding structures.
Measurement of attributes. physical attribute. of data naming conventions. Data type format

remarks

Time-variant
Provides information from historical perspective e.g. past 5-10 years Every key structure contains either implicitly or explicitly an element of time

Nonvolatile
Data once recorded cannot be updated. Data warehouse requires two operations in data accessing Initial loading of data Access of data

load

access

Operational v/s Historic Data


Features
Characteristics
Orientation User Function Data

Operational
Operational processing
Transaction Clerk,DBA,database professional Day to day operation Current

Historic
Informational processing
Analysis Knowledge workers Decision support Historical

View
DB design Unit of work Access

Detailed,flat relational
Application oriented Read/write

Summarized, multidimensional
Subject oriented Mostly read

Short ,simple transaction Complex query

Operational v/s Historic System


Features
Focus
Number of records accessed Number of users DB size

Operational
Data in
tens thousands 100MB to GB

Historic
Information out
millions hundreds 100 GB to TB

Priority
Metric

High performance,high High flexibility,endavailability user autonomy


Transaction throughput Query througput

Data Warehouse versus Data Marts


Enterprise data warehouse (EDW): Large-scale data repository that incorporates aggregated historical data for an entire company, division, or business unit. Built around many subjects, can support a wide range of decision tasks. Data marts: Small-scale data repository serving the needs of one department. Based on a limited number of subjects (sometimes one). Constructed from few transactional databases or a subset of EDW data. Provides a buffer between managers and EDW: managers work with DM data, so that even if the DM data is corrupted, EDW data is unchanged. Which is done first: Top-down development: EDW is created first, from which data is extracted to create one or more DMs. Bottom-up approach: Build independent DMs as needed, overall EDW built later from existing DMs.

Dimensions of a Data Warehouse

Two-dimensional data warehouse Three-dimensional data warehouse Data warehouses can have four or more dimensions

Building a Data Warehouse


Flat files VSAM 3rd party Oracle feeds API

Extract

Transform

Load

ETL Methodology:

Temporary data hub

Data warehouse

Extract data Transform data Load data

Data marts

ETL Methodology
Data extraction: Process of copying relevant data from a variety of transactional databases for inclusion in a DW. May occur at regular intervals (e.g., weekly, monthly) to add new data. Data from incompatible databases, flat files, text documents, etc. must be filtered through appropriate API (application programming interfaces) as needed. Data transformation: Next slide. Data loading: Extracted, cleaned, and transformed data is loaded into DW at a predetermined data refresh frequency.

Building a Data Warehouse


Data transformation/cleaning: Data extracted from transactional databases must be cleaned (scrubbed) and transformed before loading into a DW. Format differences across different tables/databases must be reconciled. Missing or misspelled data values must be resolved. Erroneous data are identified using application programs, and scrutinized/ corrected by DW analysts using system-generated exception reports. Transaction-level data is aggregated by business dimensions. Key step in DW construction since DW is very sensitive to data errors. Home Insurance Database
Life Insurance Database
PK: SS# (123-45-6789) Name (Robert G. Smith)

Auto Insurance Database


PK: DL# (FL-B12345678) Name (Bob Smith)

PK: Acc# (12345678905) Name (R. G. Smith)

Challenges of Data Reconciliation

Data Cleaning Example


Book Number Customer Number 1 426478 03480 2 077656 18575 3 365905 06837 4 645688 21359 5 474640 15367 6 426478 08362 7 276432 03480 8 365905 12738 9 276432 06837 10 327467 18575 11 426478 06837 (b) SALE Table.
1 2 3 4 5 6 7 8 9 10 11 Customer Number Customer Name 02847 Mervis 03185 Gomez 03480 Taylor 06837 Stevens 08362 Adams 12739 Gomez 13848 Lucas 15367 Tailor 15933 Chang 18575 Smith 21359 Sanchez (a) CUSTOMER Table.

Date May 19, May 19, May 19, May 20, May 34, June 03, June 04, June 04, June 05, June 12, June 15,

2003 2003 2003 2003 2003 2003 2003 2003 2003 2003 2003

Price Quantity 32.99 1 19.95 21 24.99 3 49.50 1 3200.99 1 32.99 2 30.00 1 24.99 1 30.00 5 -32.99 2 32.99 1
City State TN Columbus OH San Diego CA Raleigh NC Brisbane Columbus GA Brussels San Diego CA Toronto ON Columbus RP Santiago Country USA USA USA USA Australia USA Belgium USA Canada USA Chile

Good Reading Bookstores


Questionable data: Is book quantity correct? Out-of-range data: A single book cant cost $3,200.99 Referential integrity problem: Customer# 12738 does not exist in Customers table

Street 123 Oak St. 345 Main Ave. 50 Elm Rd. 876 Leslie Ln. 1200 Wallaby St. 345 Main Ave. 742 Ave. Louise 50 Elm Rd. 48 Maple Ave. 390 Martin Dr. 666 Ave. Bolivar

Missing data: City is blank. Questionable data: State for rows 2 & 6 could be the same Possible misspelling: Do rows 3 & 8 refer to the same person?

Case Study
Afco Foods & Beverages is a new company which produces dairy,bread and meat products with production unit located at Baroda. There products are sold in North,North West and Western region of India. They have sales units at Mumbai, Pune , Ahemdabad ,Delhi and Baroda. The President of the company wants sales information.

Sales Information
Report: The number of units sold.
113

Report: The number of units sold over time

January 14

February 41

March 33

April 25

Sales Information
Report : The number of items sold for each product with time
Jan Feb Mar Apr Wheat Bread Cheese Swiss Rolls 6 8 16 25 6 6 21
Product

17 8

Sales Information
Report: The number of items sold in each City for each product with time
Jan Mumbai Wheat Bread Feb Mar 3 Apr 10

Cheese Swiss Rolls


Pune Wheat Bread Cheese Swiss Rolls

3 4

16 16

6 6
3 7 8
Product Time

3 4 9 15

Sales Information
Report: The number of items sold and income in each region for each product with time.
Jan Rs Mumbai Wheat Bread Cheese Swiss Rolls Pune Wheat Bread Cheese Swiss Rolls 7.95 7.32 3 4 16.47 9 27.45 15 7.95 7.32 3 4 42.40 29.98 16 16 U Feb Rs U Mar Rs 7.44 15.90 10.98 7.44 U 3 6 6 3 17.36 21.20 7 8 Apr Rs 24.80 U 10

Sales Measures & Dimensions


Measure Units sold, Amount. Dimensions Product,Time,Region.

Sales Data Warehouse Model


Fact Table
City
Mumbai Mumbai Pune Pune Mumbai

Product
Cheese Cheese Swiss Rolls

Month
January January February

Units
3 4 3 4 16

Rupees
7.95 7.32 7.95 7.32 42.40

Wheat Bread January Wheat Bread January

Sales Data Warehouse Model


City_ID Prod_ID
1 1 2 2 1 589 1218 589 1218 589

Month
1/1/1998 1/1/1998 1/1/1998 1/1/1998 2/1/1998

Units
3 4 3 4 16

Rupees
7.95 7.32 7.95 7.32 42.40

Sales Data Warehouse Model


Product Dimension Tables
Prod_ID 589 590 288 Product_Name Wheat Bread White Bread Coconut Cookies Product_Category_ID 1 1 2

Product_Category_Id Product_Category 1 2 Bread Cookies

Sales Data Warehouse Model


Region Dimension Table
City_ID 1 2 City Mumbai Pune Region West NorthWest Country India India

Sales Data Warehouse Model


Time

Sales Fact

Product

Product Category

Region

Online Analysis Processing(OLAP)


It enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user.
Product Data Warehouse

Time

OLAP Cube
City
All Mumbai Mumbai Mumbai Mumbai Mumbai

Product
All All White Bread

Time
All All All

Units
113 64 38 13 3 3

Dollars
251.26 146.07 98.49 32.24 7.44 7.44

Wheat Bread All Wheat Bread Qtr1 Wheat Bread March

OLAP Operations
Drill Down
Product Category e.g Electrical Appliance

Sub Category e.g Kitchen


Product e.g Toaster

Time

OLAP Operations
Drill Up
Product Category e.g Electrical Appliance

Sub Category e.g Kitchen


Product e.g Toaster

Time

OLAP Operations
Slice and Dice
Product Product=Toaster

Time

Time

OLAP Operations
Pivot
Product Product

Time

Region

OLAP Server
An OLAP Server is a high capacity,multi user data manipulation engine specifically designed to support and operate on multi-dimensional data structure. OLAP server available are
MOLAP server ROLAP server HOLAP server

Presentation
Product

Reporting Tool

Report
Time

Data Warehousing includes


Build Data Warehouse Online analysis processing(OLAP). Presentation.
Cleaning ,Selection & Integration RDBMS Presentation

Flat File

Warehouse & OLAP server

Client

Need for Data Warehousing


Industry has huge amount of operational data Knowledge worker wants to turn this data into useful information. This information is used by them to support strategic decision making .

Need for Data Warehousing (contd..)


It is a platform for consolidated historical data for analysis. It stores data of good quality so that knowledge worker can make correct decisions.

Need for Data Warehousing (contd..)


From business perspective -it is latest marketing weapon -helps to keep customers by learning more about their needs . -valuable tool in todays competitive fast evolving world.

Data Warehousing Tools


Data Warehouse SQL Server 2000 DTS Oracle 8i Warehouse Builder OLAP tools SQL Server Analysis Services Oracle Express Server Reporting tools MS Excel Pivot Chart VB Applications

You might also like