0% found this document useful (0 votes)
46 views

Database Vs Data Warehouse

database
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

Database Vs Data Warehouse

database
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 21

Dr R.K.

Singla
Professor, Department of Computer Science
Panjab University, Chandigarh

Database & Data Warehousing


Data warehouse is a computer system designed for archiving and analyzing an
organisation's historical data, such as sales, salaries, or other information from day-to-day
operations. Normally, an organisation copies information from its operational systems
(such as sales and human resources) to the data warehouse on a regular schedule, such as
every night or every weekend; after that, management can perform complex queries and
analysis (such as data mining) on the information without slowing down the operational
systems.

Definition
A data warehouse is the main repository of the organisation's historical data, or its
corporate memory. For example, an organisation would use the information that's stored
in its data warehouse to find out what day of the week they sold the most widgets in May
2002, or how employee casual leave the week before New Year differed between
Chandigarh and Delhi from 2001-2006. In other words, the data warehouse contains the
raw material for management's decision support system.
While operational systems are optimized for simplicity and speed of modification (online
transaction processing, or OLTP) through heavy use of database normalization and an
entity-relationship model, the data warehouse is optimized for reporting and analysis
(online analytical processing, or OLAP). Frequently data in Data Warehouses is heavily
denormalised, summarised and/or stored in a dimension-based model but this is not
always required to achieve acceptable query response times.
More formally, Bill Inmon (one of the earliest and most influential practitioners) defined
a data warehouse as follows:
 Subject-oriented, meaning that the data in the database is organized so that all the
data elements relating to the same real-world event or object are linked together;

 Time-variant, meaning that the changes to the data in the database are tracked and
recorded so that reports can be produced showing changes over time;

 Non-volatile, meaning that data in the database is never over-written or deleted,


but retained for future reporting; and,

 Integrated, meaning that the database contains data from most or all of an
organization's operational applications, and that this data is made consistent.
History of data warehousing
Data Warehouses became a distinct type of computer database during the late 1980's and
early 1990's. They developed to meet a growing demand for management information
and analysis that could not be met by operational systems. Operational systems were
unable to meet this need for a range of reasons:
 The processing load of reporting reduced the response time of the operational
systems,
 The database designs of operational systems were not optimised for information
analysis and reporting,
 Most organizations had more than one operational system, so company-wide
reporting could not be supported from a single system, and
 Development of reports in operational systems often required writing specific
computer programs which was slow and expensive

As a result, separate computer databases began to be built that were specifically designed
to support management information and analysis purposes. These data warehouses were
able to bring in data from a range of different data sources, such as mainframe computers,
minicomputers, as well as personal computers and office automation software such as
spreadsheet, and integrate this information in a single place. This capability, coupled with
user-friendly reporting tools and freedom from operational impacts, has led to a growth
of this type of computer system.
As technology improved (lower cost for more performance) and user requirements
increased (faster data load cycle times and more features), data warehouses have evolved
through several fundamental stages:
 Offline Operational Databases - Data warehouses in this initial stage are
developed by simply copying the database of an operational system to an off-line
server where the processing load of reporting does not impact on the operational
system's performance.
 Offline Data Warehouse - Data warehouses in this stage of evolution are updated
on a regular time cycle (usually daily, weekly or monthly) from the operational
systems and the data is stored in an integrated reporting-oriented data structure
 Real Time Data Warehouse - Data warehouses at this stage are updated on a
transaction or event basis, every time an operational system performs a
transaction (e.g. an order or a delivery or a booking etc.)
 Integrated Data Warehouse - Data warehouses at this stage are used to generate
activity or transactions that are passed back into the operational systems for use in
the daily activity of the organization.

Components of a data warehouse


The primary components of the majority of data warehouses are shown in the attached
diagram and described in more detail below:
Data Sources
Data sources refers to any electronic repository of information that contains data of
interest for management use or analytics. This definition covers mainframe databases
(e.g. IBM DB2, ISAM, Adabas, Teradata, etc.), client-server databases (e.g. Teradata,
IBM DB2, Oracle database, Informix, Microsoft SQL Server, etc.), PC databases (e.g.
Microsoft Access, Alpha Five), spreadsheets (e.g. Microsoft Excel) and any other
electronic store of data. Data needs to be passed from these systems to the data
warehouse either on a transaction-by-transaction basis for real-time data warehouses or
on a regular cycle (e.g. daily or weekly) for offline data warehouses.

Data Transformation
The Data Transformation layer receives data from the data sources, cleans and
standardises it, and loads it into the data repository. This is often called "staging" data as
data often passes through a temporary database while it is being transformed. This
activity of transforming data can be performed either by manually created code or a
specific type of software could be used called an ETL tool. Regardless of the nature of
the software used, the following types of activities occur during data transformation:
 comparing data from different systems to improve data quality (e.g. Date of birth
for a customer may be blank in one system but contain valid data in a second
system. In this instance, the data warehouse would retain the date of birth field
from the second system)
 standardising data and codes (e.g. If one system refers to "Male" and "Female",
but a second refers to only "M" and "F", these codes sets would need to be
standardised)
 integrating data from different systems (e.g. if one system keeps orders and
another stores customers, these data elements need to be linked)
 performing other system housekeeping functions such as determining change (or
"delta") files to reduce data load times, generating or finding surrogate keys for
data etc.

Data Warehouse
The data warehouse is normally (but does not have to be) a relational database. It must be
organized to hold information in a structure that best supports not only query and
reporting, but also advanced analysis techniques, like data mining. Most data warehouses
hold information for at least 1 year and sometimes can reach half century, depending on
the business/operations data retention requirement. As a result these databases can
become very large.

Reporting
The data in the data warehouse must be available to the organisation's staff if the data
warehouse is to be useful. There are a very large number of software applications that
perform this function, or reporting can be custom-developed. Examples of types of
reporting tools include:
 Business intelligence tools: These are software applications that simplify
the process of development and production of business reports based on
data warehouse data.
 Executive information systems (known more widely as Dashboard
(business): These are software applications that are used to display
complex business metrics and information in a graphical way to allow
rapid understanding.
 OLAP Tools: OLAP tools form data into logical multi-dimensional
structures and allow users to select which dimensions to view data by.
 Data Mining: Data mining tools are software that allow users to perform
detailed mathematical and statistical calculations on detailed data
warehouse data to detect trends, identify patterns and analyse data.

Metadata
Metadata, or "data about data", is used not only to inform operators and users of the data
warehouse about its status and the information held within the data warehouse, but also
as a means of integration of incoming data and a tool to update and refine the underlying
DW model.
Examples of data warehouse metadata include table and column names, their detailed
descriptions, their connection to business meaningful names, the most recent data load
date, the business meaning of a data item and the number of users that are logged in
currently.
Operations
Data warehouse operations is comprised of the processes of loading, manipulating and
extracting data from the data warehouse. Operations also cover user management,
security, capacity management and related functions

Optional Components
In addition, the following components exist in some data warehouses:
1. Dependent Data Marts: A dependent data mart is a physical database (either on
the same hardware as the data warehouse or on a separate hardware platform) that
receives all its information from the data warehouse. The purpose of a Data Mart
is to provide a sub-set of the data warehouse's data for a specific purpose or to a
specific sub-group of the organization. A data mart is exactly like a data
warehouse technically, but it serves a different business purpose: it either holds
information for only part of a company (such as a division), or it holds a small
selection of information for the entire company (to support extra analysis without
slowing down the main system). In either case, however, it is not the
organization's official repository, the way a data warehouse is.
2. Logical Data Marts: A logical data mart is a filtered view of the main data
warehouse but does not physically exist as a separate data copy. This approach to
data marts delivers the same benefits but has the additional advantages of not
requiring additional (costly) disk space and it is always as current with data as the
main data warehouse. The downside is that Logical Data Marts can have slower
response times than physicalized ones.
3. Operational Data Store: An ODS is an integrated database of operational data. Its
sources include legacy systems, and it contains current or near-term data. An ODS
may contain 30 to 60 days of information, while a data warehouse typically
contains years of data. ODSs are used in some data warehouse architectures to
provide near-real-time reporting capability in the event that the Data Warehouse's
loading time or architecture prevents it from being able to provide near-real-time
reporting capability.

Different methods of storing data in a data warehouse


All data warehouses store their data grouped together by subject areas that reflect the
general usage of the data (Customer, Product, Finance etc.). The general principle used in
the majority of data warehouses is that data is stored at its most elemental level for use in
reporting and information analysis.
Within this generic intent, there are two primary approaches to organising the data in a
data warehouse.
The first is using a "dimensional" approach. In this style, information is stored as "facts"
which are numeric or text data that capture specific data about a single transaction or
event, and "dimensions" which contain reference information that allows each transaction
or event to be classified in various ways. As an example, a sales transaction would be
broken up into facts such as the number of products ordered, and the price paid, and
dimensions such as date, customer, product, geographical location and salesperson. The
main advantages of a dimensional approach is that the Data Warehouse is easy for
business staff with limited information technology experience to understand and use.
Also, because the data is pre-processed into the dimensional form, the Data Warehouse
tends to operate very quickly. The main disadvantage of the dimensional approach is that
it is quite difficult to add or change later if the company changes the way in which it does
business.
The second approach uses database normalization. In this style, the data in the data
warehouse is stored in third normal form. The main advantage of this approach is that it
is quite straightforward to add new information into the database -- the primary
disadvantage of this approach is that it can be rather slow to produce information and
reports.

Advantages of using data warehouse


There are many advantages to using a data warehouse, some of them are:
 Enhances end-user access to a wide variety of data.
 Business decision makers can obtain various kinds of trend reports e.g. the item
with the most sales in a particular area / country for the last two years. This may
be helpful for future investments in a particular item.
 Increases data consistency.
 Increases productivity and decreases computing costs.
 Is able to combine data from different sources, in one place.
 It provides an infrastructure with the capability to support changes to data and to
replicate the changed data back into the operational systems.

Concerns in using data warehouse


 Extracting, cleaning and loading data could be time consuming. But this can be
made easy with the help of warehousing tools.
 Data warehousing project scope might increase.
 Problems with compatibility with systems already in place e.g. transaction
processing system.
 Providing training to end-users, who end up not using the data warehouse.
 Security could develop into a serious issue, especially if the data warehouse is
web accessible.

So how is a data warehouse different from you regular database? After all, both are
databases, and both have some tables containing data. If you look deeper, you'd find that both
have indexes, keys, views, and the regular jing-bang. So is that 'Data warehouse' really
different from the tables in you application? And if the two aren't really different, maybe you
can just run your queries and reports directly from your application databases!
Well, to be fair, that may be just what you are doing right now, running some EOD (end-of-day)
reports as complex SQL queries and shipping them off to those who need them. And this
scheme might just be serving you fine right now. Nothing wrong with that if it works for you.
But before you start patting yourself on the back for having avoided a data warehouse
altogether, do spend a moment to understand the differences, and to appreciate the pros and
cons of either approach.

The primary difference between your application database and a data warehouse is that
while the former is designed (and optimized) to record , the latter has to be designed (and
optimized) to respond to analysis questions that are critical for your business.

Application databases are OLTP (On-Line Transaction Processing) systems where every
transaction has to be recorded, and super-fast at that. Consider the scenario where a bank ATM
has disbursed cash to a customer but was unable to record this event in the bank records. If
this started happening frequently, the bank wouldn't stay in business for too long. So the
banking system is designed to make sure that every trasaction gets recorded within the time
you stand before the ATM machine. This system is write-optimized, and you shouldn't crib if
your analysis query (read operation) takes a lot of time on such a system.

A Data Warehouse (DW) on the other end, is a database (yes, you are right, it's a database) that
is designed for facilitating querying and analysis. Often designed as OLAP (On-Line Analytical
Processing) systems, these databases contain read-only data that can be queried and analysed
far more efficiently as compared to your regular OLTP application databases. In this sense an
OLAP system is designed to be read-optimized.

Separation from your application database also ensures that your business intelligence solution
is scalable (your bank and ATMs don't go down just because the CEO asked for a report), better
documented and managed (god help the novice who is given the application database diagrams
and asked to locate the needle of data in the proverbial haystack of table proliferation), and
can answer questions far more efficiently and frequently.

Creation of a DW leads to a direct increase in quality of analyses as the table structures are
simpler (you keep only the needed information in simpler tables), standardized (well-
documented table structures), and often denormalized (to reduce the linkages between tables
and the corresponding complexity of queries). A DW drastically reduces the 'cost-per-analysis'
and thus permits more analysis.

If you are still running your reports off the main application database, answer this simple
question: Would the solution still work next year with 20% more customers, 50% more business,
70% more users, and 300% more reports? What about the year after next? If you are sure that
your solution will run without any changes, great!! However, if you have already budgeted to
buy new state-of-the-art hardware and 25 new Oracle licenses with those partition-options,
and the 33 other cool-sounding features, good luck to you. (You can probably send me a ticket
to Hawaii, since it's going to cost you just a minute fraction of your budget)
A database, in the traditional sense, could be compared to a Rolodex. It's filled with many
records (i.e. business cards) with many fields (e.g. name, address, phone), and most people
probably index (i.e. sort) it alphabetically. But, there are times when it might be good to have a
quick way to look up cards in the Rolodex by company, or by title. Instead of a single dimension
(sorted alphabetically), a multi-dimensional Rolodex would be faster and more valuable.

A data warehouse is a multi-dimensional database. With a multi-dimensional Rolodex, you flip


the lid up facing forward, and the cards are alphabetical. But, if you spin it around, and open the
lid from the back, the cards would be grouped and sorted by company. Turn it over, and open
the bottom and the cards are sorted by title. Similarly, a data warehouse stores your
organization's data in such a way that you can query for information in multiple dimensions, and
the results are fast and accurate. Therefore, the key to a good data warehouse is good data
management.

Extract Transform Load (ETL)


When you build a data warehouse, you most likely are drawing data from traditional databases.
Data experts develop ways to "Extract" data from different sources, "Transform" it so that the
data fits well with other data and isn't duplicative or full of holes (think of this as ensuring that all
the cards have all the necessary fields filled in for name, company and title, but none are
duplicated.) Lastly, the data is "Load"-ed into the warehouse, and then processes are run to
prepare the data to be queried in multiple dimensions. Since you know what KPIs you want to
create, and you've identified where the data comes from, the data warehouse is built with those
outcomes in mind. Retrieving data is fast, and the results are clean and accurate.

OLAP is an acronym for On Line Analytical Processing. It is an approach to quickly provide the
answer to analytical queries that are dimensional in nature.

Databases configured for OLAP employ a multidimensional data model, allowing for complex
analytical and ad-hoc queries with a rapid execution time.

The term OLAP was created as a slight modification of the traditional database term OLTP (On
Line Transaction Processing).

Extract, transform, and load (ETL) is a process in data warehousing that involves

o Extracting data from outside sources,

o Transforming it to fit business needs, and ultimately


o Loading it into the data warehouse.

So how is a data warehouse different from you regular database? After all, both are
databases, and both have some tables containing data. If you look deeper, you'd find that both
have indexes, keys, views, and the regular jing-bang. So is that 'Data warehouse' really
different from the tables in you application? And if the two aren't really different, maybe you
can just run your queries and reports directly from your application databases!

Well, to be fair, that may be just what you are doing right now, running some EOD (end-of-day)
reports as complex SQL queries and shipping them off to those who need them. And this
scheme might just be serving you fine right now. Nothing wrong with that if it works for you.

But before you start patting yourself on the back for having avoided a data warehouse
altogether, do spend a moment to understand the differences, and to appreciate the pros and
cons of either approach.

The primary difference betwen you application database and a data warehouse is that
while the former is designed (and optimized) to record , the latter has to be designed (and
optimized) to respond to analysis questions that are critical for your business.

Application databases are OLTP (On-Line Transaction Processing) systems where every
transaction has to be recorded, and super-fast at that. Consider the scenario where a bank ATM
has disbursed cash to a customer but was unable to record this event in the bank records. If
this started happening frequently, the bank wouldn't stay in business for too long. So the
banking system is designed to make sure that every trasaction gets recorded within the time
you stand before the ATM machine. This system is write-optimized, and you shouldn't crib if
your analysis query (read operation) takes a lot of time on such a system.

A Data Warehouse (DW) on the other end, is a database (yes, you are right, it's a database) that
is designed for facilitating querying and analysis. Often designed as OLAP (On-Line Analytical
Processing) systems, these databases contain read-only data that can be queried and analysed
far more efficiently as compared to your regular OLTP application databases. In this sense an
OLAP system is designed to be read-optimized.

Separation from your application database also ensures that your business intelligence solution
is scalable (your bank and ATMs don't go down just because the CFO asked for a report), better
documented and managed (god help the novice who is given the application database diagrams
and asked to locate the needle of data in the proverbial haystack of table proliferation), and
can answer questions far more efficietly and frequently.

Creation of a DW leads to a direct increase in quality of analyses as the table structures are
simpler (you keep only the needed information in simpler tables), standardized (well-
documented table structures), and often denormalized (to reduce the linkages between tables
and the corresponding complexity of queries). A DW drastically reduces the 'cost-per-analysis'
and thus permits more analysis per FTE. Having a well-designed DW is the foundation successful
BI/Analytics initiatives are built upon.

If you are still running your reports off the main application database, answer this simple
question: Would the solution still work next year with 20% more customers, 50% more business,
70% more users, and 300% more reports? What about the year after next? If you are sure that
your solution will run without any changes, great!! However, if you have already budgeted to
buy new state-of-the-art hardware and 25 new Oracle licenses with those partition-options,
and the 33 other cool-sounding features, good luck to you. (You can probably send me a ticket
to Hawaii, since it's gonna cost you just a minute fraction of your budget)
Data Warehousing - An Overview
Information Technology (IT) has historically influenced organizational performance
and competitive standing. The increasing processing power and sophistication of
analytical tools and techniques have put the strong foundation for the product called
data warehouse. There are a number of reasons that any organization should
consider a data warehouse, which can be the critical tool for maximizing the
organization’s investment in the information it has collected and stored throughout
the enterprise. IT managers need to understand the rationale and benefits of data
warehouses because they may need to design and implement, or procure this
kingpin of business intelligence.

The data warehouses are supposed to provide storage, functionality and


responsiveness to queries beyond the capabilities of today's transaction-oriented
databases. Also data warehouses are set to improve the data access performance of
databases. Traditional databases balance the requirement of data access with the
need to ensure integrity of data. In present day organizations, users of data are
often completely removed from the data sources. Many people only need read-
access to data, but still need a very rapid access to a larger volume of data than can
conveniently by downloaded to the desktop. Often such data comes from multiple
databases. Because many of the analyses performed are recurrent and predictable,
software vendors and systems support staff have begun to design systems to
support these functions. Currently there comes a necessity for providing decision
makers from middle management upward with information at the correct level of
detail to support decision-making. Data warehousing, online analytical processing
(OLAP) and data mining provide this functionality.

Here, we are to discuss what is all about data warehouse, how it helps to gain a
competitive edge for an organization.

Data Warehouse - An Introduction


A data warehouse is defined as a subject-oriented, integrated, nonvolatile, time-
variant collection of data in support of management's decisions. More generally, data
warehousing is a collection of decision support technologies, aimed at enabling the
knowledge worker, such as executive, manager, and analyst, to arrive at better and
faster decisions. Data warehouses provide access to data for complex analysis,
knowledge discovery, and decision-making. They support high performance demands
on an organization's data and information. It provides an enormous amount of
historical and static data from three tiers:

1. Relational databases
2. Multidimensional OLAP applications
3. Client analysis tools

Several types of applications such as online analytical processing (OLAP),


decision-support systems (DSS) and data mining are being supported. OLAP is a
term used to describe the analysis of complex data from the data warehouse.

OLAP is a software technology that allows users to easily and quickly analyze and
view data from multiple points-of-view. OLAP provides dynamic and multi-
dimensional support to executives and managers who need to understand different
aspects of the data. Activities that are supported include:

 Analyzing financial trends


 Creating slices of data
 Finding new relationships among the data
 Drilling down into sales statistics
 Doing calculations through different dimensions where each category of data
(that is, product, location, sales numbers, time period, etc.) is considered a
dimension.

There are OLAP tools that use distributed computing capabilities for analyses that
require more storage and processing power than can be economically and efficiently
located on an individual desktop.

DSS support an organization's leading decision makers with higher-level data for
complex and critical decisions. A DSS queries a data warehouse or an OLAP database
for relevant information that can be compared in order to make a business decision
and predict the impact of that decision.

Finally, data mining is being used for knowledge discovery, the process of searching
data for unanticipated new knowledge.

Knowledge workers and decision makers use tools ranging from parametric queries
to ad hoc queries to data mining. Thus, the access component of the data warehouse
must provide support of structured queries (both parametric and ad hoc). These
together make up a managed query environment.

Databases Vs Data Warehouses


A database is a collection of related data and a database system is a database and
database software together.

A data warehouse is also a collection of information as well as a supporting system.

Databases are transactional such as relational, object-oriented, network or


hierarchical. Traditional databases support on-line transaction processing (OLTP),
which includes insertions, updates, and deletions, while also supporting information
query requirements. Traditional databases are optimized to process queries that may
touch a small part of the database and transactions that deal with insertions or
updates of a few tuples per relation to process.

Thus databases must strike a balance between efficiency in transaction processing


and supporting query requirements (ad hoc user requests), That is, they can't
further optimized for the applications such as OLAP, DSS and data mining.

But a data warehouse is typically optimized for access from a decision maker's
needs. Data warehouses are designed specifically to support efficient extraction,
processing and presentation for analytic and decision-making purposes.

In contrast to databases, data warehouses generally contain very large amounts of


data from multiple sources that may include databases from different data models
and sometimes files acquired from independent systems and platforms.
Multidatabases provide access to disjoint and usually heterogeneous databases and
are volatile. Whereas a data warehouse is frequently a store of integrated data from
multiple sources, processed for storage in a multidimensional model and nonvolatile.
Data warehouses also support time-series and trend analysis, both of which require
more historical data.

In transactional systems, transactions are the unit and are the agent of change to
the database, but data warehouse information is much more coarse-grained and is
refreshed according to a careful choice of incremental refresh policy. Warehouse
updates are handled by the warehouse's acquisition component that provides all
required processing. As data warehouses encompass large volumes of data, they are
more or less double the size of source databases.

The sheer volume of data likely to be in terabytes is an issue that has been dealt
with through enterprise-wide data warehouses, virtual data warehouses and data
marts. Enterprise-wide data warehouses are huge projects in need of massive
investment of time and resources. Virtual data warehouses are bound to provide
views of operational databases that are materialized for efficient access. A data mart
is an easy-to-access repository of a subset of highly focused data for a single
function or department (i.e., finance, sales, marketing) and is considerably smaller
than a data warehouse. The data comes form operational information that is needed
by a particular group of employees for analysis, content, presentations all in terms
that are familiar to them. Data for a data mart is derived from a data warehouse or
from more specialized access.

Distinctive Characteristics of Data Warehouses

Data warehouses are supposed to be blessed with the following unique features.

a. Multidimensional conceptual view and generic dimensionality,


b. Unlimited dimensions and aggregation levels and unrestricted cross-
dimensional operations,
c. Dynamic sparse matrix handling,
d. Client/server architecture and multi-user support,
e. Accessibility and transparency, intuitive data manipulation and consistent
reporting performance

As data warehouses are not much particular about transaction processing, there is
an increased efficiency in query processing. There are some specialized tools and
techniques. They are query transformation, index intersection and union, special
ROLAP (Relational OLAP), MOLAP (multidimensional OLAP), DOLAP (Database
OLAP) and WOLAP (Web OLAP) functions, SQL extensions, advanced join methods,
and intelligent scanning.

Traditional OLAP products are also known as multidimensional OLAP. Relational OLAP
tools take data from traditional two-dimensional or relational databases and create
multidimensional views upon request rather than being prepared in advance as in
OLAP. ROLAP is often used on complex data with a wide number of fields, such as
customer data. DOLAP is a relational database management system designed to
perform OLAP calculations. WOLAP refers to OLAP data that can be reached from a
Web server.
Parallel processing can enhance the performance of data warehouse. Parallel server
architectures include symmetric

Data Modeling for Data Warehouses


Multidimensional models highly take advantage of inherent relationships existing in
data to populate data in multidimensional matrices referred to as data cubes. If the
dimensional of the matrix is greater than three, then it is called hypercubes. Query
performance in multidimensional matrices for data that lend themselves to
dimensional formatting can be much better than in the relational data model. For a
corporate data warehouse, three examples of dimensions would be the corporation's
fiscal periods, products and regions.

Multidimensional models lend themselves readily to hierarchical views such as roll-up


display and drill-down display. Roll-up display moves up the hierarchy, grouping into
larger units along a dimension. A drill-down display provides the opposite capability,
furnishing a finer-grained view through disaggregating process.

The Multidimensional storage model involves two types of tables: dimension tables
and fact tables. A dimension table consists of tuples of attributes of the dimension.
A fact table can be thought of as having tuples, one per a recorded fact. This fact
contains some measured or observed variables and identifies them with pointers to
dimension tables.

Two common multidimensional schemas are the star schema and the snowflake
schema. The star schema consists of a fact table with a single table for each
dimension. The snowflake schema is a variation on the star schema in which the
dimensional tables from a star schema are organized into a hierarchy by normalizing
them. A fact constellation is a set of fact tables that share some dimension tables.

Data warehouse storage also utilizes indexing techniques to support high


performance access. A technique called bitmap indexing constructs a bit vector for
each value in a domain (column) being indexed. It does well for domains of low-
cardinality. Bitmap indexing can provide considerable input/output and storage space
advantages in low-cardinality domains. With bit vectors a bitmap index can provide
dramatic improvements in comparison, aggregation, and join performance.

In a star schema, dimensional data can be indexed to tuples in the fact table by join
indexing. Join indexes are traditional indexes to maintain relationships between
primary key and foreign key values. They relate the values of a dimension of a star
schema to rows in the fact table. Data warehouse storage can facilitate access to
summary data by taking further advantage of the nonvolatility of data warehouses
and a degree of predictability of the analyses that will be performed using them.

Building a Data Warehouse


Warehouse builders should take a broad view of the anticipated use of the
warehouse. The design should specifically support adhoc querying that is, accessing
data with any meaningful combination of values for the attributes in the dimension or
fact tables. The following steps are being involved during the data acquisition phase.

1. The data must be extracted from multiple, heterogeneous sources such as


databases or other data feeds.

2. Data must be formatted for consistency within the data warehouse. Names,
meanings and domains of data from unrelated sources must be reconciled.
3. The data must be cleaned to ensure validity. Data cleaning is an important part in
building a data warehouse and it is being touted as the largest labor-demanding one.
For input data, cleaning process has to be performed before the data are loaded in
the warehouse. Data warehouse builders have to check for validity and quality when
the input data must be examined and formatted consistently.

4. The data must be fitted into the data model of the warehouse. Data from the
various sources must be installed in the data model of the warehouse. Data may
have to be converted from relational, object-oriented, or legacy databases.

5. The data must be loaded into the warehouse. The sheer volume of data in the
warehouse makes loading the data a significant task. Monitoring tools for loads as
well as methods to recover from incomplete or incorrect loads are required. With the
huge volume of data in the warehouse, incremental updating is usually the only
feasible approach. The refresh policy will probably emerge as a compromise that
takes into account the answers to the following questions.

How up-to-date the data must be?

Can the warehouse go off-line, and for how long?

What are the data interdependencies?

What is the storage availability?

What are the distribution requirements such as for replication and partitioning?

What is the loading time including cleaning, formatting, copying, transmitting and
overhead such as index rebuilding?

Data storage in a data warehouse involves the following processes:

Storing the data according to the data model of the warehouse

Creating and maintaining required data structures

Creating and maintaining appropriate access paths

Providing for time-variant data as new data are added

Supporting the updating of warehouse data

Refreshing and purging the data

The sheer volume of data in the warehouse generally makes it impossible to simply
reload the warehouse in its entirety later on. There are a couple of alternatives for
this problem such as selective refreshing of data and separate warehouse versions.
When the warehouse uses an incremental data refreshing mechanism, data may
need to be periodically purged.

Data warehouses should also be designed with by taking the environment in which
they reside into account. The important points behind the data warehouse design are

. Usage projections

. The fit of the data model

. Characteristics of available sources


. Design of the metadata component

. Modular component design

. Design for manageability and change

There are mainly two types of data warehouses. They are distributed warehouse and
federated warehouse. For a distributed data warehouse, all the issues such as
replication, partitioning, communication and consistency, of distributed databases
are taken into account. As usual, the benefits of distribution, such as load balancing,
scalability of performance and higher availability, are available with distributed data
warehouse. A single replicated metadata repository would reside at each distribution
site.

The idea of the federated warehouse is like that of the federated database: a
decentralized confederation of autonomous data warehouses, each with its own
metadata repository. Given the magnitude of the challenge inherent to data
warehouses, it is likely that such federations will consist of smaller-scale
components, such as data marts. Large organizations may choose to federate data
marts rather than build huge data warehouses.

Functionality of Data Warehouses


Data warehouses exist to facilitate complex, data-intensive and frequent adhoc
queries. Data warehouses must provide far greater and more efficient query support
than is demanded of transactional databases. The data warehouse access component
supports enhanced spreadsheet functionality, efficient query processing, structured
queries, adhoc queries, data mining and materialized views. Particularly enhanced
spreadsheet functionality includes support for state-of-the art spreadsheet
applications as well as for OLAP applications programs. These provide
preprogrammed functionalities such as the following:

Roll-up: Data is summarized with increasing generalization

Drill-down: Increasing levels of detail are revealed

Pivot: Cross tabulation that is, rotation, performed

Slice and dice: Performing projection operations on the dimensions

Sorting: Data is sorted by ordinal value

Selection: Data is available by value or range

Derived or computer attributes: Attributes are computed by operations on stored


and derived values.

Benefits of Data Warehouses


The core benefits include

 Historical information for comparative and competitive analysis


 Enhanced data quality and completeness
 Supplementing disaster recovery plans with another data back up source.

Among the greatest benefits of a data warehouse is the ability to analyze and
execute business decisions based on data from multiple sources. For example, an
organization has collected valuable data and stored it in 30 databases. A data
warehouse is not only a convenient way to analyze and compare data in all the
databases, but it can also give historical data and perspective. Thus data warehouse
is a one-stop shop, but it is also a one-stop shop from an historical perspective as
well. Using data warehouse, one can look at past trends, whether they be product
sales or customers or whatever and may be do some predictions of what is going to
happen in the future.

Also data retrieved from multiple databases is not constrained by the tables in each
of those databases. A data warehouse receives application neutral data. Whatever
database application is supplying the information to the data warehouse is not
preconditioning the data to be presented in a way the originator of the data requires.
That means, the data from the inventory system, the financial system, or the sales
system is sent to the data warehouse for processing as application neutral data that
is not formatted to answer only queries from an inventory database, finance
database, or sales database program. If not for application-neutral data, the data
warehouse would be nothing more than a collection of data marts.

A data warehouse by itself does not create value, but value comes from the use of
the data in the warehouse. In support of a low cost strategy, the data warehouse can
provide savings in billing processes, reduce fraud losses, and reduce the cost of
reporting. The data warehouses can provide analysts with precalculated reports and
graphs. This increases the productivity of business analysts.

Most companies can benefit from a data warehouse when the proper tools are in
place and users are trained in analysis of results.

Conclusion
However, data warehouses are still an expensive solution and typically found in large
firms. The development of a central warehouse is a huge undertaking and capital
intensive with large, potentially unmanageable risks.

The explosion of e-business - and the massive amount of data it created - has made data management and
organization more important than ever. We often hear the terms database, data warehouse and data mart,
but the differences among them aren't always clear. Some experts say that the difference between, say, a
data mart and a data warehouse is more conceptual than real. Nonetheless, here are some general rules of
thumb to sort out these terms.

In the Beginning . . .

A datum is a raw piece of information that's capable of being moved and stored. In the broadest sense, a
database is a collection or aggregation of such data, along with information on how pieces of data relate to
one another.

A database is typically organized into records - one record per item, such as an order - that are themselves
divided into several fields, with each field containing information about a specific aspect or attribute of the
item. For an order, these could include customer data, part numbers, prices and discounts.

In theory, a database doesn't even require a computer, but it certainly makes its use a lot more scalable and
efficient, says Mike Schiff, an analyst at Current Analysis Inc. in Sterling, Va. A pocket address book is
certainly a database, but searching contact entries by city or industry requires flipping through each page.

Database management systems, such as those from Microsoft Corp., Oracle Corp. or IBM, act as the
underlying vault and retrieval technology.

In addition to storing data, a database management system handles security and access control, says
Schiff. Business intelligence tools then access this data for analysis. However, databases rarely exist just to
run analytical operations; in general, they're vital to running a business.

Database management systems can be organized in different ways. A relational database stores information
in tables and then joins or combines those tables across common fields [QuickStudy, Jan. 8, 2001]. A
hierarchical database stores data in a tree structure; an order record might have every line item underneath
it. An object-oriented database encapsulates both data and business logic [QuickStudy, Feb. 9, 1998].

Wholesale, Retail, Slice and Dice

Data warehouses [QuickStudy, Dec. 6, 1999] and data marts are very similar technologies, say experts, but
they usually service different types of clients. For instance, a warehouse typically contains a massive
amount of data from across an enterprise, says John Kopcke, chief technology officer at Hyperion Solutions
Corp., a maker of analytical software in Sunnyvale, Calif.

Data marts tend to be smaller and dedicated to a single division or line of business. Data warehouses are
"similar to a real food warehouse, storing massive amounts of food and then distributing subsets of food to
grocery stores [the marts] for people to access [or] purchase," says Kopcke.

A data mart can run in size from megabytes to gigabytes, says Tho Nguyen, director of data warehousing
strategy at SAS Institute Inc. in Cary, N.C., whereas data warehouses usually run from gigabytes to
terabytes.

Consider a data mart that supports a firm's cellophane-tape division. It might contain relevant facts about
making cellophane tape - suppliers, deliveries, rates, quality control information - says Schiff.

However, the uncontrolled proliferation of such data marts can become an IT nightmare unless each data
mart uses standard naming and cataloging schemes and compatible data types. The last thing you want are
data marts that can't talk to one another.

Users tend to assemble a warehouse from different pieces of technology, then customize it to meet their
needs, rather than just put it together out of the box. Schiff notes that warehouses are often built using
relational databases, because the relational model can more efficiently store and organize the huge
amounts of information that make up a high-volume, multipurpose data warehouse. However, getting data
from many large relational tables can require massive amounts of processing and storage.

For that kind of slice-and-dice analysis, data marts use multidimensional databases geared for quick
responses with multiple elements. Often-selected data from a data mart is fed into a smaller database called
a data cube for intensive processing.

====

What solutions does data warehousing provide?

The systems that contain operational data (the data that runs the daily transactions of your business)
contain information that is useful to business analysts. For example, analysts can use information
about which products were sold in which regions at which time of year to look for anomalies or to
project future sales.

However, several problems can arise when analysts access the operational data directly:

 Analysts might not have the expertise to query the operational database. For example,
querying IMS(TM) databases requires an application program that uses a specialized type of
data manipulation language. In general, the programmers who have the expertise to query
the operational database have a full-time job in maintaining the database and its
applications.
 Performance is critical for many operational databases, such as databases for a bank. The
system cannot handle users making ad hoc queries.
 The operational data generally is not in the best format for use by business analysts. For
example, sales data that is summarized by product, region, and season is much more
useful to analysts than the raw data.

Data warehousing solves these problems. In data warehousing, you create stores of informational
data. Informational data is data that is extracted from the operational data and then transformed for
decision making. For example, a data warehousing tool might copy all the sales data from the
operational database, clean the data, perform calculations to summarize the data, and write the
summarized data to a target in a separate database from the operational data. Users can query the
separate database (the warehouse) without impacting the operational databases.

Figure 1. The path from operational data to data analysis

What the Warehouse Contains


This document describes how data in the Warehouse is organized and explains table and
data element terminology. It is subdivided as follows:
· Data Collections
· Tables
· Table Help
· Data Elements
· Data Element Help
· Summary Data
Data Collections
The Warehouse contains data from transaction systems. The information from each
transaction system is referred to as a data collection. For example, information from SRS
(Student Records System) is referred to as the Student Data Collection, while account
balance information from the BEN Financials General Ledger is referred to as the Genera
Ledger Data Collection.
Note that depending on your authorization level, you may or may not have access to a
specific data collection. If you have questions about a specific data collection or want
access to a collection, contact Data Administration.

Tables
Each data collection in the Warehouse is organized in a set of related tables. Each table
consists of data elements that describe or qualify an item of business significance. For
example, the Student Data Collection has an ADDRESS table with data elements such as
SSN (Social Security Number), street address, etc. The collection also has a PERSON
table with data elements such as SSN, name, birth date, etc. The ADDRESS table is
related to the PERSON table through the SSN. Because the data is stored in tables, it is
easy to access just the data you need, rather than having to plow through all the data in
the data collection. A table may be a part of more than one data collection.
Note that some data elements in tables are indexed (indexed columns are noted in the
documentation for each table). Indexing enables the system to execute queries faster. A
query with a record selection condition using an indexed data element tells the system to
go directly to the rows in the table that contain the value indicated and to stop retrieving
data when the value is no longer found. If a query does not select records based on an
indexed data element in its record selection condition, the system starts searching at the
first row in the table and works through every row until it reaches the last row in the
table. Tables can contain hundreds of thousands or even millions of rows (for example,
one table contains 93,000 students, about 250,000 addresses, and approximately 700,000
enrollments). Thus, queries that do not use indexed data elements for record selection
will run slowly.
Table Help. Help documentation is available for each table in the Warehouse, accessible
via a hyperlink from the table name. Help describes the basic contents of the table. If
applicable, it also gives the following information:
Explanation. Describes the physical makeup or content of the table.
Common uses. Describes some queries that would make use of the table.
Primary key. Lists the data elements that are the primary keys in the table.
Indexed data elements. Lists the data elements that are indexed in the table. Since tables
can consist of many rows, queries that include record selection conditions based on
indexed data elements provide faster results.
Related tables. Identifies other tables that may be meaningful to your query. That is,
tables that are good candidates for containing information that you may want to include
in your results. For example, if you are using the Enrollment table to list students in a
specific course section, you may want to use the Person table to get the students' names.
Cautions. Provides additional guidance, help, or explanation about a table. It can also
include recommendations that must be followed to prevent poor query results.
Data Elements
The smallest unit of data that you can work with is called a data element. A data element
cannot be logically divided any further without losing its meaning or context. Zip code,
last name, and SSN are examples of data elements that cannot be logically divided any
further without becoming meaningless. In contrast, student and address are not data
elements because they can be logically divided into more units of data.
Data Element Help
Help documentation is available for each data element in the Warehouse. Help describes
the data element and includes its indexed, format, and not null values. If applicable, it
also provides a list of valid values for the data element. Values for data elements can be
listed in alphabetical order or in the order most frequently used.
The primary datatypes used in the Date Warehouse are CHAR (character), DATE and
NUMBER. Element formats are indicated by the datatype, length and, for NUMBER
types, precision and scale. The format of Name column in the EMPLOYEE_GENERAL
table, for example, is listed as CHAR(30), meaning that the column is of character
datatype, and holds a maximum of 30 characters. Numeric datatypes (such as
Payment_Amount in the EMPLOYEE_PAYMENT table) have a specified precision and
scale. Precision is the total maximum length of the column, while scale represents the
number of places to the right of the decimal. For example, the format for
EMPLOYEE_PAYMENT.Payment_Amount is represented by NUMBER(9,2), meaning
that the column is of numeric datatype, with a total of 9 characters of which 2 are to the
right of the decimal point; thus, the maximum value is 9999999.99.
Summary Data
One major advantage of the Data Warehouse is that it contains data at different levels of
summarization. For example, you could retrieve data as individual transactions or as
summaries by week, by month, or by year. Note that additional levels of summarization
can be added to the Warehouse as needed and as resources allow without impacting
existing data.
Summary data are different for every data collection in the Warehouse. For example, the
Student Data Collection includes a student detail level which consists of the basic SRS
tables, and the student census level which is a snapshot of student term activity taken at
the census date (one week after the end of the drop period). The student detail level
changes daily. The student census level, once loaded for a given term, remains static and
never changes for that term. The advantage of the student census level is that you can run
hundreds of queries on a hundred different days, and run them ten years from now and
still have the same numbers for comparison. The census level is the data Penn uses for
official enrollment statistics, for example, for providing data to the state and federal
governments. In the General Ledger Data Collection, the SUMMARY_BALANCES
table contains budget, encumbrance, and actual balances for summary-level Accounting
Flexfields by accounting period. Balances are available for the month, the fiscal year-to-
date, and the project year-to-date.
If you need to know what summary data are available for a specific data collection, refer
to the documentation for that data collection.
-==
http://en.wikipedia.org/wiki/Data_warehouse

You might also like