Lesson 2. Data Warehouse Basic Concepts
Lesson 2. Data Warehouse Basic Concepts
0 DATA WAREHOUSE
BASIC CONCEPTS
OBJECTIVES:
Duration: 1 hour
A. Subject-oriented
Data warehouses are designed for decision-makers to analyze data. A data
warehouse environment is organized around significant subjects such as
customers, employees, suppliers, accounts, sales, products, and so on
instead of focusing on the day-to-day operations and transactions of an
organization. This subject-specific design helps in reducing the query
response time by searching a few records to get an answer to the user’s
question. For example, to learn more about the company's sales data, a data
warehouse that concentrates on sales can be built. Using this data
warehouse, questions like "Who was our best customer for this product last
year?" or "Who is possible to be our best customer next month?" will be
answered. This ability to define a data warehouse by subject matter, sales, in
this case, makes the data warehouse subject-oriented.
B. Integrated
Data warehouses can establish consistency between different data types from
other sources such as relational databases, flat files, and online transaction
records and putting it into one consistent location. The techniques for cleaning
and integration of data are applied to ensure consistency in naming
conventions, encoding structures, attribute measures, and so on.
C. Time-variant
Time links with data in the data warehouse. Data warehouse analysis focuses
on reflecting historical changes. The system records the information of the
company from a certain point of time.
D. Nonvolatile
Data in a data warehouse will remain stable and will not change. The data
operations involved are mainly for data queries and for analyzing the data
without updating in the general sense. Once a specific data enters the data
warehouse, it will generally retain for a long time. There are generally a large
number of query operations in the data warehouse. Usually, the work that a
data warehouse needs to do is to load, query, and analyze. Generally, it does
not perform any modification operations.
Since data warehouse comes from historical data, the organization may
have a better understanding of the data that it is currently collecting and
what data needs to collect.
To give a centralized view of all data being collected across the
organization and provides a means for determining the inconsistent data.
After identifying that the identified data is consistent, an organization can
generate data without ambiguity.
By having a data warehouse, the organization can have snapshots of data
over time.
A data warehouse provides tools to combine data, which can provide new
information and analysis.
Where to use this data?
Many organizations use the information taken from a data warehouse to
support business decision-making activities, including (Han, Kamber & Pei, 2012):
1. increasing customer focus (i.e., analysis of customer buying patterns)
a. buying preference,
b. buying time,
c. budget cycles, and
d. appetites for spending
2. moving and managing the portfolios of each product by comparing its
performance of sales by quarter, by year, and by geographic regions to
modify the production strategies;
3. analyzing operations of the organization and looking for another source of
profit
4. managing customer relationships, making environmental corrections, and
managing the cost of corporate assets
LESSON 2:
OPERATIONAL DATABASE SYSTEM
VS. DATA WAREHOUSES
OBJECTIVES:
At the end of this lesson, the student will be able to:
Duration: 1 hour
Operational Database Systems vs. Data Warehouses
The data warehouse system serves users or knowledge workers for data
analysis and decision-making. This system can organize and present information in
a specific format to meet the diverse needs of various users. These systems are
called Online Analytical Processing (OLAP) systems. OLAP handles historical
data or archive data that are obtained over a long period. For example, if we collect
information about flight bookings for the last ten years, these data can provide us
with a lot of meaningful data, such as booking trends. This may provide useful
information, such as peak travel times, what kind of people are traveling in different
categories (economic/business), etc.
The significant difference between OLTP and OLAP systems is the amount of
data analyzed in a single transaction. OLTP manages many concurrent clients and
queries at the same time, and these queries and queries only involve a single record
or a limited set of files at a time. The OLAP system must have the ability to process
millions of files to answer a single query.
The goals of these two databases are different in the following fields (Han, Kamber &
Pei, 2012).
1. Users and system orientation
OLTP system is customer-oriented and is designed for real-time
business transactions and processes.
OLAP system is market-oriented and aims to analyze business
indicators for data analysis by the knowledge workers.
2. Data contents:
OLTP system manages a set of simple transactions (CRUD), and are
too detailed to be easily used by an analyst.
OLAP system manages high, complex, and unpredictable amounts of
historical data that provide convenience for summarization and
aggregation, which make the data more comfortable to use for
informed decision making.
3. Database design:
OLTP system generally adopts an entity-relationship (ER) data model
and application-oriented database design.
OLAP systems usually use a star or snowflake models and subject-
oriented database design.
4. View:
OLTP system mainly focuses on the current data in the enterprise or
department.
OLAP systems usually span multiple versions of the database schema.
OLAP systems also process data from various organizations and
integrate information from many data stores.
5. Access patterns:
OLTP system is mainly composed of short atomic transactions.
OLAP systems are read-only because these data warehouses store
historical data.
Other features that distinguish between OLTP and OLAP systems include database
size, frequency of operations, and performance metrics. These are summarized in
Table 7.1.
Table 7.1. Differences between OLTP and OLAP (Han, Kamber, & Pei, 2012)
OBJECTIVES:
Middle Tier - The middle tier in a data warehouse is typically an OLAP server which
is implemented in either of the following models:
1. ROLAP (Relational OLAP) – an extended relational database
management system that maps operations on multi-dimensional data to
standard relational operations.
2. MOLAP (Multi-dimensional OLAP) – this model directly implements the
multi-dimensional data and operations.
Top Tier - − This tier is the front-end client layer. This layer holds the query and
reporting tools, analysis tools, and data mining tools.
Enterprise Warehouse
An enterprise warehouse stores and manages all historical records about
subjects (customers, products, sales, assets, personnel) across the entire
organization. It supports corporate-wide data integration, usually from one or more
operational systems or external data providers, and it is cross-functional in scope.
Enterprise warehouse contains detailed data as well as summarized data and can
range in size from a few gigabytes to hundreds of gigabytes, terabytes, or beyond.
An enterprise data warehouse may be implemented on traditional mainframes,
supercomputer servers, or parallel architecture platforms. It requires extensive
business modeling and may take years to design and build.
Data Mart
A data mart is a smaller, more centralized data warehouse. It contains a
subset of corporate-wide data that is of value to a specific group of users. The scope
of a data mart is restricted to particular selected subjects. Simply put, raw data runs
from the data warehouse into different departments to support the customized use of
these departments. These department-level databases are called data marts. A data
mart is a data collection of a department. For example, a marketing data mart may
restrict its subjects for a customer, item, and sales; therefore, the marketing
department has its data mart. The finance department also has its data mart. The
data mart of both departments may be related, but they are different and
independent. In a separate data mart, data can be collected directly from data
sources.
Data Mart, unlike enterprise warehouse, is usually implemented on low-cost
departmental servers that are Unix/Linux or Windows-based. The implementation
sequence of a data mart is measured in weeks rather than months or years, unlike
the enterprise warehouse. However, this may involve complex integration in the long
run if its design and planning were not enterprise-wide.
Data marts can be categorized depending on the source of data:
1. Independent data marts. In independent data marts, data come from one
or more operating systems or external information providers, or from data
generated in a specific department or region.
2. Dependent data marts. The data in the dependent data mart comes
directly from the enterprise data warehouses.
Virtual warehouse
A virtual warehouse is a set of views over operational databases that can be queried
together so a user can effectively access all the data as if it was stored in one data
warehouse. For efficient query processing, only some of the possible summary views
may be materialized. A virtual warehouse is easy to build but requires excess
capacity on operational database servers.
LESSON 4:
EXTRACTION, TRANSFORMATION AND
LOADING
OBJECTIVES:
Duration: 1 hour
Extraction, Transformation, and Loading (ETL)
Figure 7.1 shows that data warehouse systems use back-end tools and
utilities to populate and refresh the data. These tools and utilities include the
Extraction, Transformation, and Loading (ETL) process: ETL is the process of
extracting, cleaning, and transforming business system data and then loading it into
the data warehouse. The purpose of ETL is to integrate scattered, messy, and
inconsistent data in the organization. It provides an analytical basis for enterprise
decision-making. ETL is a systematic method of data warehouse systems; this can
be done either daily, weekly, or monthly, and needs to be flexible, automated, and
well-documented (Golfarelli & Rizzi, 2009).
ETL Tools
1. Data Extraction
Extraction is the first step of ETL, which aims to extract information from the
target source system. The extraction process is usually one of the most time-
consuming tasks in ETL. Different systems tend to use other data formats,
which are standardized into a standard format for further processing. The
source system may be complex and inadequately documented, making it
difficult to determine what data needs to be extracted. The data must be
fetched several times regularly to provide all the changed data to the
warehouse and keep it up to date.
2. Data cleaning
OBJECTIVES:
Duration: 1 hour
Metadata
Metadata refers to "data about data". In a data warehouse, it is to define and
describe all the information on the warehouse subject. Metadata runs through the
entire life cycle of the data warehouse and uses metadata to drive the development
of the data warehouse to automate and visualize the data warehouse.
Types of Metadata
Metadata in a data warehouse fall into three major categories (Ponniah, 2010):
Operational metadata
Extraction and transformation metadata
End-user metadata