Chapter 2
Chapter 2
Learning Goals
• Data recording and storage is growing.
• History is excellent predictor of the future.
• Gives total view of the organization.
• Data recording and storage is growing.
• Intelligent decision-support is required for decision-making.
Moore’s law on increase in performance of CPUs and decrease in cost has been surpassed
by the increase in storage space and decrease in cost. Meaning, it is true that the cost of
CPUs is going down and the performance is going up, but this is applicable at a higher
rate to storage space and cost i.e. more and more cheap storage space is becoming
available as compared to fast CPUs.
As you would have experienced, when you (or your father’s) briefcase seems to be small
as compared to the contents carried in it, it seems a good idea to buy a new and larger
briefcase. However, after sometime the new briefcase too seems to be small for the
contents carried. On the practical side, it has been noted that the amount of data recorded
in an organization doubles every year and this is an exponential increase.
Total hardware and software cost to store and manage 1 Mbyte of data
1990: ~ $15
2002: ~ ¢15 (Down 100 times)
By 2007: < ¢1 (Down 150 times)
A Few Examples
WalMart: 24 TB (Tera Byte)
France Telecom: ~ 100 TB
CERN: Up to 20 PB by 2006 (Peta Byte)
Stanford Linear Accelerator Center (SLAC): 500TB
Someone says I have a data set of size 1 GB so I have a DWH can you beat this?
Someone else says, I have a data set of size 100 GB, can you beat this?
Someone else says, I have a 1 TB data set, who can beat this?
Who has a data warehouse? Not enough information, it is much more than just the size, it
is a whole concept, it is NOT a shrink wrapped solution, it evolves. A company may have
a TB of data and not have a data warehouse; while on the other hand, a company may
have 500 GB of data and have a fully functional data warehouse.
Leasing ATM
Savings
Account
DATA WAREHOUSE
Checking Account
Credit Card
queries and fairly easy to program. The queries follow rather pre-defined paths into the
database and are unlikely to come up with something new or abnormal.
1. What happened?
2. Why it happened?
3. What will happen?
4. What is happening?
5. What do you want to happen?
These questions primarily point to what is called as the different stages of a Data
Warehouse i.e. starting from the first stage, and going all the way to stage 5. The first
stage is not actually a data warehouse, but a pure batch processing system. Note that as
the stages evolve the amount of batching processing decreases, this being maximum in
the first stage and minimum in the last or 5th stage. At the same time the amount of ad-hoc
query processing increases. Finally in the most developed stage there is a high level of
event based triggering. As the system moves from stage-1 to stage-5 it becomes what is
called as an active data warehouse.
The other key points in this standard definition that I have also underlined and listed
below are:
Complete repository
• All the data is present from all the branches/outlets of the business.
• Even the archived data may be brought online.
• Data from arcane and old systems is also brought online.
Transaction System
• Management Information System (MIS)
• Could be typed sheets (NOT transaction system)
Ad-Hoc access
• Does not have a certain predefined database access pattern.
• Queries not known in advance.
• Difficult to write SQL in advance.
Knowledge workers
• Typically NOT IT literate (Executives, Analysts, Managers).
• NOT clerical workers.
• Decision makers.
The users of data warehouse are knowledge workers in other words they are decision
makers in the organization. They are not the clerical people entering the data or
Transaction System: Unlike databases where data is directly entered, the input to the
data warehouse can come from OLTP or transactional systems or other third party
databases. This is not a rule, the data could come from typed or even hand filled sheets, as
was the case for the census data warehouse.
Ad-Hoc access: It dose not have a certain repeatable pattern and it’s not known in
advance. Consider financial transactions like a bank deposit, you know exactly what
records will be inserted deleted or updated. That’s in OLTP system and in ERP system.
But in a data warehouse there are really no fixed patterns. Say the marketing person, just
sits down and thinks about what questions he/she has about customers and there
behaviors and so on and they are typically using some tool to generate SQL dynamically
and then that SQL gets executed and that you don’t know in advance.
Although there may be some patterns of queries, but they are really not very predictable
and the query patterns may change over time. Hence there are no predefined access paths
into the database. That’s why relational databases are so important for the data
warehouse, because relational databases allow you to navigate the data in any direction
that is appropriate using the primary, foreign key structure within the data model.
Meaning, using a data warehouse, does not implies that we just forget about databases.
Subject
Oriented
Integrated
Time
Variant
Non
Volatile
Figure-2.2: Another view of a Data Warehouse