0% found this document useful (0 votes)
11 views

Chapter 2

Uploaded by

Syed Kazmi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Chapter 2

Uploaded by

Syed Kazmi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Data Warehousing (CS614)

Lecture 2: Introduction to Data Ware Housing – Part II

Learning Goals
• Data recording and storage is growing.
• History is excellent predictor of the future.
• Gives total view of the organization.
• Data recording and storage is growing.
• Intelligent decision-support is required for decision-making.

2.1 Why a Data Warehouse (DWH)?

Moore’s law on increase in performance of CPUs and decrease in cost has been surpassed
by the increase in storage space and decrease in cost. Meaning, it is true that the cost of
CPUs is going down and the performance is going up, but this is applicable at a higher
rate to storage space and cost i.e. more and more cheap storage space is becoming
available as compared to fast CPUs.

As you would have experienced, when you (or your father’s) briefcase seems to be small
as compared to the contents carried in it, it seems a good idea to buy a new and larger
briefcase. However, after sometime the new briefcase too seems to be small for the
contents carried. On the practical side, it has been noted that the amount of data recorded
in an organization doubles every year and this is an exponential increase.

Reason-1: Data Sets are growing

How Much Data is that?


1 MB 220 or 106 bytes Small novel – 31/2 Disk
Paper rims that could fill the back of a pickup
1 GB 230 or 109 bytes
van
50,000 trees chopped and converted into
1 TB 240 or 1012 bytes
paper and printed

2 PB 1 PB = 250 or 1015 bytes Academic research libraries across the U.S.

5 EB 1 EB = 260 or 1018 bytes All words ever spoken by human beings

Table-2.1: Quantifying size of data

Size of Data Sets are going up ↑.


Cost of data storage is coming down ↓.

Total hardware and software cost to store and manage 1 Mbyte of data
1990: ~ $15
2002: ~ ¢15 (Down 100 times)
By 2007: < ¢1 (Down 150 times)

© Copyright Virtual University of Pakistan 15


Data Warehousing (CS614)

A Few Examples
WalMart: 24 TB (Tera Byte)
France Telecom: ~ 100 TB
CERN: Up to 20 PB by 2006 (Peta Byte)
Stanford Linear Accelerator Center (SLAC): 500TB

A Ware House of Data is NOT a Data Warehouse

Someone says I have a data set of size 1 GB so I have a DWH can you beat this?
Someone else says, I have a data set of size 100 GB, can you beat this?
Someone else says, I have a 1 TB data set, who can beat this?

Who has a data warehouse? Not enough information, it is much more than just the size, it
is a whole concept, it is NOT a shrink wrapped solution, it evolves. A company may have
a TB of data and not have a data warehouse; while on the other hand, a company may
have 500 GB of data and have a fully functional data warehouse.

Size is NOT Everything

History is excellent predictor of the future


Secondly as I mentioned earlier the data warehouse has the historical data. And one thing
that we have learned by using information is that, “past is the best predictor of the future”.
You use historical data, because it gives you an insight into how the environment is
changing. Also you must have heard that “history repeats itself”, however this repetition
of history is not likely to be constant for all businesses or all events. Note that you just
can’t use the historical data to predict the future; you have to have to bring your own
insight and experience to interpret how the environment is changing in order to predict
the future accurately and meaningfully.

Gives total view of the organization


So why would you want data warehouse in your organization? First of all a data
warehouse gives a total view of an organization. If you look at the operational system i.e.
the databases in most environments, the databases are designed around different lines of
business. Consider the case of a Bank; a bank will typically have current accounts and
savings accounts, foreign currency account etc. The bank will have an MIS system for
leasing, and another system for managing credit cards and another system for every
different kind of business they are in. However, nowhere they have the total view of the
environment from the customer’s perspective. The reason being, transaction processing
systems are typically designed around functional areas, within a business environment.
For good decision making you should be able to integrate the data across the organization
so as to cross the LoB (Line of Business). So the idea here is to give the total view of the
organization especially from a customer’s perspective within the data warehouse, as
shown in Figure-2.1

© Copyright Virtual University of Pakistan 16


Data Warehousing (CS614)

Leasing ATM
Savings
Account

DATA WAREHOUSE

Checking Account
Credit Card

Figure-2.1: A Data Warehouse crosses the LoB

Intelligent decision-support is required for decision-making


Consider a bank which is losing customers, for reasons not known. However, one thing is
for sure that the bank is losing business because of lost customers. Therefore, it is
important, actually critical to understand which customers have left and why they have
left. This will give you the ability to predict going forward (in time), to identify which
customers will leave you (i.e. the bank). We are going to talk about this in the course
using data mining algorithms, like clustering, classification, regression analysis etc.
However, this being another example of using historical data to predict the future. So I
can predict today, which customers will leave me in the next 3 months before they even
leave. There can be, and there are whole courses on data mining, but we will just have an
applied overview of data mining in this course.

Reason-2: Businesses demand intelligence

Complex questions from integrated data.


“Intelligent Enterprise”

DBMS Approach Intelligent Enterprise


List of all items that were sold last month? Which items sell together? Which items
to stock?
List of all items purchased by Khizar?
Where and how to place the items? What
The total sales of the last month grouped discounts to offer?
by branch?
How best to target customers to increase
How many sales transactions occurred sales at a branch?
during the month of January?
Which customers are most likely to
respond to my next promotional campaign,
and why?

Table-2.2: Comparison of queries


Let’s take a close look at the typical queries for a DBMS. They are either about listing the
contents of tables or running aggregates of values i.e. rather simple and straightforward

© Copyright Virtual University of Pakistan 17


Data Warehousing (CS614)

queries and fairly easy to program. The queries follow rather pre-defined paths into the
database and are unlikely to come up with something new or abnormal.

Reason-3: Businesses want much more…

1. What happened?
2. Why it happened?
3. What will happen?
4. What is happening?
5. What do you want to happen?

These questions primarily point to what is called as the different stages of a Data
Warehouse i.e. starting from the first stage, and going all the way to stage 5. The first
stage is not actually a data warehouse, but a pure batch processing system. Note that as
the stages evolve the amount of batching processing decreases, this being maximum in
the first stage and minimum in the last or 5th stage. At the same time the amount of ad-hoc
query processing increases. Finally in the most developed stage there is a high level of
event based triggering. As the system moves from stage-1 to stage-5 it becomes what is
called as an active data warehouse.

2.2 What is a DWH?

A complete repository of historical corporate data extracted from transaction systems


that is available for ad-hoc access by knowledge workers

The other key points in this standard definition that I have also underlined and listed
below are:

Complete repository
• All the data is present from all the branches/outlets of the business.
• Even the archived data may be brought online.
• Data from arcane and old systems is also brought online.

Transaction System
• Management Information System (MIS)
• Could be typed sheets (NOT transaction system)

Ad-Hoc access
• Does not have a certain predefined database access pattern.
• Queries not known in advance.
• Difficult to write SQL in advance.

Knowledge workers
• Typically NOT IT literate (Executives, Analysts, Managers).
• NOT clerical workers.
• Decision makers.

The users of data warehouse are knowledge workers in other words they are decision
makers in the organization. They are not the clerical people entering the data or

© Copyright Virtual University of Pakistan 18


Data Warehousing (CS614)

overseeing the transactions etc or doing programming or performing system


design/analysis. These are really decision makers in the organization like General
Manager Marketing, or Executive Director or CEO (Chief Operating Officer). Typically
those decision makers are people in areas like marketing, finance and strategic planning
etc.

Completeness: There is a misnomer here, about completeness. As per the standard


definition a data warehouse is a complete repository of corporate data. The reality is that
it can never be complete. We will discuss this in detail very shortly.

Transaction System: Unlike databases where data is directly entered, the input to the
data warehouse can come from OLTP or transactional systems or other third party
databases. This is not a rule, the data could come from typed or even hand filled sheets, as
was the case for the census data warehouse.

Ad-Hoc access: It dose not have a certain repeatable pattern and it’s not known in
advance. Consider financial transactions like a bank deposit, you know exactly what
records will be inserted deleted or updated. That’s in OLTP system and in ERP system.
But in a data warehouse there are really no fixed patterns. Say the marketing person, just
sits down and thinks about what questions he/she has about customers and there
behaviors and so on and they are typically using some tool to generate SQL dynamically
and then that SQL gets executed and that you don’t know in advance.

Although there may be some patterns of queries, but they are really not very predictable
and the query patterns may change over time. Hence there are no predefined access paths
into the database. That’s why relational databases are so important for the data
warehouse, because relational databases allow you to navigate the data in any direction
that is appropriate using the primary, foreign key structure within the data model.
Meaning, using a data warehouse, does not implies that we just forget about databases.

2.3 Another view of a DWH

Subject
Oriented

Integrated

Time
Variant

Non
Volatile
Figure-2.2: Another view of a Data Warehouse

© Copyright Virtual University of Pakistan 19

You might also like