DWM Unit 1
DWM Unit 1
1.1 Introduction
• Applications such as order processing, general ledger, inventory, human resources, payroll, in-
patient billing, checking accounts, insurance claims, and so on.
• These applications are important systems that run businesses
• They gather, store, and process all the data needed to successfully perform the daily routine
operations.
• They provide online information and produce a variety of reports to monitor and run the business.
• The operational computer systems did provide information to run the day-to-day operations but
what the executives needed were different kinds of information that could be used readily
to make strategic decisions.
• The decision makers wanted to know which geographic regions to focus on, which product
lines to expand, and which markets to strengthen.
• They needed the type of information with proper content and format that could help them
make such strategic decisions. We may call this type of information strategic information as
different from operational information. The operational systems, important as they were, could
not provide strategic information.
Businesses, therefore, were compelled to turn to new ways of getting strategic information.
Data warehousing is a new paradigm specifically intended to provide vital strategic information.
Figure 1-1 shows a sample of strategic areas where data warehousing had already produced results in
different industries.
1
Here are some examples of business objectives:
• Retain the present customer base
• Increase the customer base by 15% over the next 5 years
• Improve product quality levels in the top five product groups
For making decisions about these objectives, executives and managers need information for the
following purposes:
• to get in-depth knowledge of their company’s operations,
• review and monitor key performance indicators and note how these affect one another,
• keep track of how business factors change over time,
• compare their company’s performance relative to the competition and to industry
benchmarks. Executives and managers need to focus their attention on customers’ needs
and preferences,
• emerging technologies,
• sales and marketing results,
• quality levels of products and services.
Strategic information is far more important for the continued health and survival of the
corporation.
Figure 1-2 lists the desired characteristics of strategic information.
2
History of Decision Support Systems
Marketing department in a company has been concerned about the performance of a particular region as the sales numbers
from monthly report of that month are drastically low. The marketing manager wants to get some report from IT
department to analyze the performance over the past two years, product by product and compared to monthly targets. He
wants to take quick strategic decisions to rectify the situation. Now, there may not be any regular reports to give to the
marketing department on what they want. The IT department has to gather the data from multiple applications and start
forming results from scratch.
Depending on the size and nature of the business, most companies have gone through the following
stages of attempts to provide strategic information for decision making.
Ad hoc Reports: This was the earliest stage. Users, especially from marketing and finance, would send
requests to IT for special reports. IT would write special programs, typically one for each request, and
produce the ad hoc reports.
Special Extract Programs: IT would write a suite of programs and run the programs periodically to
extract data from the various applications.
Small Applications: IT would create simple applications based on the extracted files. The users could
stipulate the parameters for each special report. The report printing programs would print the
information based on user-specific parameters
Information Centers: The information center typically was a place where users could go to request ad
hoc reports or view special information on screens. These were predetermined reports or screens.
Decision-Support Systems: The systems were menu-driven and provided online information and also
the ability to print special reports.
Executive Information Systems: This was an attempt to bring strategic information to the executive
desktop. The main criteria were simplicity and ease of use. The system would display key information
every day and provide the ability to request simple, straightforward reports. However, only
preprogrammed screens and reports were available.
Inability to Provide Information
Figure 1-4 depicts the inadequate attempts by IT to provide strategic information.
3
Here are some of the factors relating to the inability to provide strategic information:
• IT receives too many ad hoc requests, resulting in a large overload. With limited resources, IT
is unable to respond to the numerous requests in a timely fashion.
• Requests are too numerous; they also keep changing all the time. The users need more reports to
expand and understand the earlier reports.
• The users find that they get into the spiral of asking for more and more supplementary reports,
so they sometimes adapt by asking for every possible combination, which only increases the IT
load even further.
• The users have to depend on IT to provide the information. They are not able to access the
information themselves interactively.
• The information environment ideally suited for strategic decision-making has to be very flexible
and conducive for analysis. IT has been unable to provide such an environment.
4
5. Operational vs. Decision support system
Following table summarizes the differences between the traditional operational systems and the
newer(needed) decision support system or informational systems that need to be built.
Attributes Operational Systems Decision Support Systems
1 Data Content Current value Achieved, summarized, derived
2 Data Structure Optimized for transactions Optimized for complex queries
3 Access High Medium to low
Frequency
4 Access Type Read, update, delete Read
5 Usage Predictive, repetitive Ad-hoc, random
6 Response time Sub-seconds Several seconds to minutes
7 User number Large numbers Relatively small number
8 Characteristic Operational processing Informational processing
9 Orientation Transaction Analysis
10 Users Clerk DBA database Executives, managers, business
professional executives
11 Function Day-to-day operations long-term informational requirements
12 Database design ER based, application oriented Star/snowflake, subject oriented
13 Summarization Primitive, highly detailed Summarized, consolidated
14 View Detailed, flat relational Summarized, multidimensional
15 Unit of work Short, simple transaction Complex query
16 Records accessed Tens Millions
17 Database Size 100MB to GB 100GB to TB
18 Priority High performance, high High flexibility , end user autonomy
availability
19 Indexes Few Many
20 Joins Many Some
21 Duplicated Data Normalized DBMS Denormalized DBMS
22. Derived data & Rare Common
Aggregates
5
1.3 Data Warehouse Defined
Definition of Data Warehouse
DW is a subject oriented, integrated, time varying, non-volatile collection of data that is used
primarily in organizational decision-making.
Reason for developing data warehouse
• The database designs of operational systems were not optimized for information analysis and
strategic decision making.
• The processing load of reporting affected the response time of operational systems.
• Generally all big organizations had a number of operational systems enterprise-wide reporting
could not be supported from a single system.
As a result, separate databases were built that were specifically designed to support management
information and analysis purposes.
The data warehouse is an informational environment that:
• Provides an integrated and total view of the enterprise.
• Makes the enterprise’s current and historical information easily available for strategic
decision making.
• Makes decision-support transactions possible without hindering(delay) operational systems.
• Renders the organization’s information consistent.
• Presents a flexible and interactive source of strategic information.
1. What can a Data Warehouse Do? [Not in syllabus but read once]
2. What Can a Data Warehouse Not Do?
3. Data Warehouse - An Environment or not a Product?
4. A Blend of Many Technologies
1. What can a Data Warehouse Do?
1. Immediate information delivery:
Data warehouses reduces the time period lapsed between the request for information and the
actual delivery of information to the users. For example, the sales report was formed once in
every month, usually in the first week of every month. But with data warehouses the same report
can be formulated on a daily basis thereby enabling the business analysts to exploit opportunities
that could otherwise have been raised.
2. Integration of data from within and outside the organization:
Data warehouses combine data from multiple sources. The data is collected from different
departments like sales, marketing, finance, and accounting. Besides this, data is also taken from
external sources like business magazines, news reports, survey's etc.
6
3. Provides an insight into the future:
Data warehouses store large amounts of historical information that enables the decision makers
to analyze the prevailing trends in the market and produce goods according to the customers'
demands.
4. Enables users to look at the same data in different ways:
A data warehouse provides its users with tools for analyzing and manipulating data in many
different ways. It facilitates the users to drill down into detailed data with the click of a mouse
that could have otherwise taken a few days with the traditional approach.
5. Provides freedom from the dependency on IT:
With data warehouses, the users have to no longer depend on the availability of IT professionals
to answer their queries. Now, if the manager needs an ad hoc report, he can himself form it
without the assistance of any computer expert.
7
• Remove inconsistencies and transform the data.
• Store the data in formats suitable for easy access for decision making.
Different technologies are, therefore, needed to support these functions. Figure 1-9 shows how a data
warehouse is a blend of the many technologies needed for the various functions.
1.5 Features of DW
Subject-oriented Data
• Organized around major subjects, such as customer, product, sales
8
• Data warehouses are designed to help you analyze data, not on daily operations or
transaction processing
• For example, to learn more about your company's sales data, we can build a warehouse that
concentrates on sales.
• Using this warehouse, you can answer questions like "Who was our best customer for this item
last year?" This ability to define a data warehouse by subject matter, sales in this case, makes
the data warehouse subject oriented.
• E.g. claims data are organized around the subject of claims and not by individual applications
of Auto Insurance and Workers’ Comp
• Provide a simple and concise view around particular subject issues by excluding data that
are not useful in the decision support process
Integrated Data
• Data warehouses must put data from disparate sources into a consistent format.
o relational databases, flat files, on-line transaction records
• Data cleaning and data transformation techniques are applied.
o Ensure consistency in naming conventions, encoding structures, attribute measures, etc.
among different data sources
• When they achieve this, they are said to be integrated.
Some of the items that would need to standardized and made consistent:
• Naming conventions
• Codes
• Data attributes
• Measurements
9
Time-Variant Data
• In order to discover trends in business, analysts need large amounts of data.
• The data are kept for many years so they can be used for trends, forecasting, and comparisons
over time.
• A data warehouse's focus on change over time is what is meant by the term time variant.
• The time horizon for the data warehouse is significantly longer than that of operational systems
o Operational database: current value data
o Data warehouse data: provide information from a historical perspective (e.g., past 5-10
years)
• Every key structure in the data warehouse
o Contains an element of time, explicitly or implicitly
Non-volatile Data
• A physically separate store of data transformed from the operational environment
• Operational update of data does not occur in the data warehouse environment
o Does not require transaction processing, recovery, and concurrency control mechanisms
o Requires only two operations in data accessing:
▪ initial loading of data and access of data
Data Granularity
• Data granularity refers to the level of detail.
• Depending on the requirements, multiple levels of detail may be present. Many data
warehouses have at least dual levels of granularity.
• Depending on the query, we can then go to the particular level of detail and satisfy the query
• More granularity levels, more storage requirement
• Decide on the granularity levels based on the data types and the expected system performance
for queries.
Following examples of data granularity in a typical data warehouse.
10
1.6 Information Flow Mechanism
12
• Remove inconsistency in common data. e,g. In one table status is married and in another table
single. Remove this inconsistency by finding true value from records.
Data Extraction
• Deal with numerous data sources
• Tools for data extraction
– Purchasing outside tools (e.g. Abinitio, Actawork, RT, Mapforce etc)
– Developing in-house programs
• Extract the source data into
– a group of flat files,
– or a data-staging relational database,
– or a combination of both
Data Transformation
• Perform a number of individual tasks
– Clean
– Standardization
– Combine
– Purging and separating out
– Sorting and merging
13
– Assignment of surrogate keys
• Results: a collection of integrated data that is cleaned, standardized, and summarized
14
Data Loading
• Two distinct groups of tasks
– The initial loading of the data into the data warehouse
– Refresh cycles
• Extract the changes to the source data
• Transform the data revisions
And feed the incremental data revisions on an ongoing basis
15
5) Metadata Component
Metadata in a data warehouse is similar to the data dictionary or the data catalog in a database
management system. In the data dictionary, we keep the information about the logical data structures,
the information about the files and addresses, the information about the indexes, and so on. Similar in
warehouse it stores all the information about the contents of warehouse. So metadata is the source of
information for the management module.
16
Example of Metadata for Client
Role of Metadata
Metadata in the data warehouse is similar to the data dictionary. The metadata component stores data
about the data. The metadata is often used for building, maintaining, and using the data warehouse. It is
the key to providing users and developers with a road map to the information in the warehouse.
The three main functions that metadata performs in a data
• Connects the different parts of the data warehouse thereby acting glue that connects all the parts.
• Provides information about the contents of the data and its underlying structure to the data
warehouse administrator and other users.
• Enables the end-users to search for the desired data in their own business terms.
17
Classification of Metadata
Operational Metadata
• Contain all of next information about the operational data sources
– Data for the data warehouse comes from several operational systems
– The data elements have various field lengths and data types
– We split records, combine parts of records from different source files, and deal with
multiple coding schemes and field lengths
End-User Data
• The navigational map
– Enable the end-users to find information
Allow the end-users to use their own business terminology
18
1.8 Data Warehouses and Data Marts
In 1998, Bill Inmon stated, “The single most important issue facing the IT manager this year is
whether to build the data warehouse first or the data mart first.”
Before deciding to build a data warehouse, we need to ask:
• Top-down or bottom-up approach?
• Enterprise-wide or department?
• Which first data warehouse or data mart?
• Build pilot or go with a full-fledged implementation?
• Dependent or independent data marts?
Top-Down Approach
The advantages of this approach are:
• A truly corporate effort, an enterprise view of data
• Inherently architected—not a union of disparate data marts
• Single, central storage of data about the content
• Centralized rules and control
• May see quick results if implemented with iterations
19
The disadvantages are:
• Takes longer to build even with an iterative method
• High exposure/risk to failure
• Needs high level of cross-functional skills
• High outlay without proof of concept
Bottom-Up Approach
The advantages of this approach are:
• Faster and easier implementation of manageable pieces
• Favourable return on investment and proof of concept
• Less risk of failure
• Inherently incremental; can schedule important data marts first
• Allows project team to learn and grow
20
Various approaches by different Authors
1. Approach by Han Kamber: Three-Tier Data Warehouse Architecture
Data ware house adopt a three tier architecture.
These 3 tiers are: Bottom Tier, Middle Tier & Top Tier
Data Sources:
All the data related to any business organization is stored in operational databases, external files and flat
files. These sources are application oriented for example, complete data of organization such as training
detail, customer detail, sales, departments, transactions, employee detail etc. Data present here in
different formats or host format and also not well documented
Data Warehouse:
It is an optimized form of operational database contains only relevant information and provides fast
access to data. It has characteristics like subject oriented, integrated, and time variant, and non-volatile
Metadata repository:
21
It figure out that what is available in data warehouse.
It contains:
▪ Structure of data warehouse
▪ Data names and definitions
▪ Source of extracted data
▪ Algorithm used for data cleaning purpose
▪ Sequence of transformations applied on data
▪ Data related to system performance
Data Marts
Subset of data warehouse contain only small slices of data warehouse
E.g: Data pertaining to the single department
Two types of data marts: Dependent & Independent
22
▪ MOLAP Model: Present data in array based structures means map directly to data
cube array structure.
2. Approach by Paulraj
a) Centralized corporate Data Warehouse
b) Independent Data Marts
c) Federated
d) Hub & Spoke
e) Data Mart Bus
23
c)Federated
Common data elements in the various data marts and even data warehouses that compose federation are
integrated physically and logically. So resulttant output is centralized data warehouse.
d) Hub-and-Spoke
Centralized data warehouse is present. Also there are data marts that depend on the enterprise data
warehouse for data feed. Therefore information delivery can be both from centralized data warehouse
and dependent data marts.
24
DSS(Decision support system) engine, DSS client and RDBMS server all need one single machine.
Not scalable. All load on single machine so performance of other applications also degraded.
Two-Tier Architecture
Client DW Server
* GUI/Presentation logic * Data logic
* Query specification * Data services
* Data analysis * Meta data
* Report formatting * File services
* Summarizing
* Data access
Data warehouse resides on a dedicated RDBMS server and both DSS engine and DSS client reside on
client hardware. It utilizes existing legacy system as database servers and requires minimal investment
in additional hardware and software.
Drawbacks:
• Limited scalable.
• Can't support a large number of online end users without additional modifications
• Congestion problem may be there.
Three-Tier Architecture
25
Client Application/Data Mart Server DW Server
* GUI/Presentation logic * Summarizing * Data logic
* Query specification * Filtering * Data services
* Data Analysis * Meta Data * Meta data
* Report formatting * Multidimensional view * File services
* Data access * Data access
Advantages:
• Scalable but cost is associated with it.
• As data is put close to users, response time is fast.
• System is transparent so users are not aware where data is stored, complex or not etc
• Network traffic is less
Disadvantages:
• DSS engine is complex because for every query he has to find out location of data on server.
• Additional costs are iimposed as warehouse is maintained and data must be replicated to the
local servers.
• Need of local data administration as data design to be controlled and optimized for different
queries.
Four-Tier Architecture:
26
Data is stored in a data warehouse that includes both relational database and cache of multidimensional
data (Analysis server). After the response data is converted into web page format on the Internet server,
the data is returned to the client.
IT design principles
Scalable: Design it in a way which will allow it to support expansions or upgrades. You should be able
to adapt it to a number of different business situations.
Design: Design should fall under the guidelines of information technology standards.
27