0% found this document useful (0 votes)
3 views

Lesson 2. Data Warehouse Basic Concepts

A data warehouse (DW) is a centralized repository for historical and current data that supports decision-making across an organization. It is characterized as subject-oriented, integrated, time-variant, and non-volatile, allowing for efficient data analysis and reporting. The document also contrasts operational databases (OLTP) with data warehouses (OLAP), outlines the architecture and models of data warehouses, and describes the ETL process for data integration.

Uploaded by

Aaron Gutierrez
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lesson 2. Data Warehouse Basic Concepts

A data warehouse (DW) is a centralized repository for historical and current data that supports decision-making across an organization. It is characterized as subject-oriented, integrated, time-variant, and non-volatile, allowing for efficient data analysis and reporting. The document also contrasts operational databases (OLTP) with data warehouses (OLAP), outlines the architecture and models of data warehouses, and describes the ETL process for data integration.

Uploaded by

Aaron Gutierrez
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

UNIT

0 DATA WAREHOUSE
BASIC CONCEPTS

5 Data warehouse (DW or DWH) is a strategic


collection that provides all types of data support
for the decision-making process at all levels of
the enterprise .

DW is a single data store, created for analytical


reporting and decision support purposes. It also
provide guidance for business process
improvement, monitoring time, cost, quality and
control for companies that need business
intelligence. Usually a data warehouse contains
a large amount of historical data and uses
specific analysis.
LESSON 1:
DATA WAREHOUSE
AND ITS CHARACTERISTICS

OBJECTIVES:

At the end of this lesson, the student will be able to:

 Define the concept of a data warehouse

 Determine each characteristic of a data warehouse

 Identify the benefits of having a data warehouse

Duration: 1 hour

What is a Data Warehouse?


A data warehouse is basically a collection of current and historical data of
potential interest in the decision-making process throughout the company, which are
difficult or impossible to obtain in traditional operational databases. The data
originate from different operational transaction systems such as systems for sales,
customer accounts, and manufacturing, and may include data from Web site
transactions. The data warehouse provides to the user the availability of data to be
accessed as needed; hence, this data cannot be altered. When constructing a data
warehouse, it must go through processes such as data cleaning, data extraction and
conversion, data integration, and data loading. To guarantee the correctness of the
data, it has to be cleaned, extracted, converted into the required form of the data
warehouse, and loaded into the data warehouse. (Laudon & Laudon, 2014).

A data warehouse is maintained separately from an organization’s operational


databases. Data warehouse systems allow the integration of a variety of application
systems. They support information processing by providing a stable platform of
consolidated historical data for analysis (Han, Kamber & Pei, 2012).

Characteristics of a data warehouse


William H. Inmon, the recognized father of the data warehousing concept,
defines data warehouse systems as a collection of data that provides the following
characteristics:
a. subject-oriented
b. integrated
c. time-variant
d. non-volatile

A. Subject-oriented
Data warehouses are designed for decision-makers to analyze data. A data
warehouse environment is organized around significant subjects such as
customers, employees, suppliers, accounts, sales, products, and so on
instead of focusing on the day-to-day operations and transactions of an
organization. This subject-specific design helps in reducing the query
response time by searching a few records to get an answer to the user’s
question. For example, to learn more about the company's sales data, a data
warehouse that concentrates on sales can be built. Using this data
warehouse, questions like "Who was our best customer for this product last
year?" or "Who is possible to be our best customer next month?" will be
answered. This ability to define a data warehouse by subject matter, sales, in
this case, makes the data warehouse subject-oriented.

B. Integrated
Data warehouses can establish consistency between different data types from
other sources such as relational databases, flat files, and online transaction
records and putting it into one consistent location. The techniques for cleaning
and integration of data are applied to ensure consistency in naming
conventions, encoding structures, attribute measures, and so on.
C. Time-variant
Time links with data in the data warehouse. Data warehouse analysis focuses
on reflecting historical changes. The system records the information of the
company from a certain point of time.

D. Nonvolatile
Data in a data warehouse will remain stable and will not change. The data
operations involved are mainly for data queries and for analyzing the data
without updating in the general sense. Once a specific data enters the data
warehouse, it will generally retain for a long time. There are generally a large
number of query operations in the data warehouse. Usually, the work that a
data warehouse needs to do is to load, query, and analyze. Generally, it does
not perform any modification operations.

A well-designed data warehouse supports high-speed queries and high data


throughput. Based on this information, we can define data warehousing as the
process of constructing and using data warehouses. Building a data warehouse
requires cleaning, integration, and consolidation of data. The utilization of a data
warehouse often involves a collection of different decision support technologies. With
the aid of technologies, the “knowledge workers” (e.g., managers, analysts, and
executives) allows using the warehouse to obtain an overview of the data for
management analysis and business decision-making. The data warehouse can help
transform the company's operational data into high-value, available information (or
knowledge), which are to be delivered to the right people in the right way at the right
time (Han, Kamber & Pei, 2012).

Benefits of Data Warehouses


Data warehouses are beneficial to organizations for several reasons
(Bourgeois, 2014):

 Since data warehouse comes from historical data, the organization may
have a better understanding of the data that it is currently collecting and
what data needs to collect.
 To give a centralized view of all data being collected across the
organization and provides a means for determining the inconsistent data.
 After identifying that the identified data is consistent, an organization can
generate data without ambiguity.
 By having a data warehouse, the organization can have snapshots of data
over time.
 A data warehouse provides tools to combine data, which can provide new
information and analysis.
Where to use this data?
Many organizations use the information taken from a data warehouse to
support business decision-making activities, including (Han, Kamber & Pei, 2012):
1. increasing customer focus (i.e., analysis of customer buying patterns)
a. buying preference,
b. buying time,
c. budget cycles, and
d. appetites for spending
2. moving and managing the portfolios of each product by comparing its
performance of sales by quarter, by year, and by geographic regions to
modify the production strategies;
3. analyzing operations of the organization and looking for another source of
profit
4. managing customer relationships, making environmental corrections, and
managing the cost of corporate assets
LESSON 2:
OPERATIONAL DATABASE SYSTEM
VS. DATA WAREHOUSES

OBJECTIVES:
At the end of this lesson, the student will be able to:

 Describe the operational database system

 Compare OLTP and OLAP

 Identify the goals of OLTP and OLAP in different fields

Duration: 1 hour
Operational Database Systems vs. Data Warehouses

The operational database system is the primary source of the data


warehouse. It contains detailed information used to run the daily operations of the
organization, such as purchasing, inventory, manufacturing, banking, payroll,
registration, and accounting. As the update progresses, the data will often change
and reflect the current value of recent transactions. The operational database system
is also called the Online Transaction Processing (OLTP) system, which is used to
manage dynamic data in real-time. Operational data are those data included in the
operation of a specific system.

The data warehouse system serves users or knowledge workers for data
analysis and decision-making. This system can organize and present information in
a specific format to meet the diverse needs of various users. These systems are
called Online Analytical Processing (OLAP) systems. OLAP handles historical
data or archive data that are obtained over a long period. For example, if we collect
information about flight bookings for the last ten years, these data can provide us
with a lot of meaningful data, such as booking trends. This may provide useful
information, such as peak travel times, what kind of people are traveling in different
categories (economic/business), etc.

The significant difference between OLTP and OLAP systems is the amount of
data analyzed in a single transaction. OLTP manages many concurrent clients and
queries at the same time, and these queries and queries only involve a single record
or a limited set of files at a time. The OLAP system must have the ability to process
millions of files to answer a single query.

The goals of these two databases are different in the following fields (Han, Kamber &
Pei, 2012).
1. Users and system orientation
 OLTP system is customer-oriented and is designed for real-time
business transactions and processes.
 OLAP system is market-oriented and aims to analyze business
indicators for data analysis by the knowledge workers.

2. Data contents:
 OLTP system manages a set of simple transactions (CRUD), and are
too detailed to be easily used by an analyst.
 OLAP system manages high, complex, and unpredictable amounts of
historical data that provide convenience for summarization and
aggregation, which make the data more comfortable to use for
informed decision making.

3. Database design:
 OLTP system generally adopts an entity-relationship (ER) data model
and application-oriented database design.
 OLAP systems usually use a star or snowflake models and subject-
oriented database design.
4. View:
 OLTP system mainly focuses on the current data in the enterprise or
department.
 OLAP systems usually span multiple versions of the database schema.
OLAP systems also process data from various organizations and
integrate information from many data stores.

5. Access patterns:
 OLTP system is mainly composed of short atomic transactions.
 OLAP systems are read-only because these data warehouses store
historical data.

Other features that distinguish between OLTP and OLAP systems include database
size, frequency of operations, and performance metrics. These are summarized in
Table 7.1.

Table 7.1. Differences between OLTP and OLAP (Han, Kamber, & Pei, 2012)

Feature OLTP OLAP


Characteristic operational processing informational processing
Orientation transaction analysis
clerk, DBA, database knowledge worker (e.g., manager,
User
professional executive, analyst)
long-term informational
Function day-to-day operations
requirements decision support

DB design ER-based, application-oriented star/snowflake, subject-oriented

historic, accuracy maintained over


Data current, guaranteed up-to-date
time
Summarization primitive, highly detailed summarized, consolidated
View detailed, flat relational summarized, multi-dimensional
Unit of work short, simple transaction complex query
Access read/write mostly read
Focus data in information out
Operations index/hash on a primary key lots of scans
Number of records
tens millions
accessed
Number of users thousands hundreds
DB size GB to high-order GB ≥ TB
high performance, high
Priority high flexibility, end-user autonomy
availability
Metric transaction throughput query throughput, response time
LESSON 3:
DATA WAREHOUSE ARCHITECTURE
AND MODELS

OBJECTIVES:

At the end of this lesson, the student will be able to:

 Identify the three models of a data warehouse

 Discuss the requirements of each data warehouse model

 Differentiate the different types of data mart

Duration: 1.5 hours


Data Warehouse Architecture

Figure 7.1. A three-tier data warehousing architecture


(Han, Kamber, & Pei, 2012)

Data warehouses often adopt a three-tier architecture, as depicted in Figure


7.1.
Bottom Tier - It is where the data warehouse database server resides. Typically it
is a relational database system. Different back end tools and utilities are used to feed
data into the bottom tier. These back end tools perform the extract, clean, load, and
refresh functions.

Middle Tier - The middle tier in a data warehouse is typically an OLAP server which
is implemented in either of the following models:
1. ROLAP (Relational OLAP) – an extended relational database
management system that maps operations on multi-dimensional data to
standard relational operations.
2. MOLAP (Multi-dimensional OLAP) – this model directly implements the
multi-dimensional data and operations.
Top Tier - − This tier is the front-end client layer. This layer holds the query and
reporting tools, analysis tools, and data mining tools.

Data Warehouse Models


From the perspective of data warehouse architecture, there are three data
warehouse models:
 enterprise warehouse
 data mart
 virtual warehouse

Enterprise Warehouse
An enterprise warehouse stores and manages all historical records about
subjects (customers, products, sales, assets, personnel) across the entire
organization. It supports corporate-wide data integration, usually from one or more
operational systems or external data providers, and it is cross-functional in scope.
Enterprise warehouse contains detailed data as well as summarized data and can
range in size from a few gigabytes to hundreds of gigabytes, terabytes, or beyond.
An enterprise data warehouse may be implemented on traditional mainframes,
supercomputer servers, or parallel architecture platforms. It requires extensive
business modeling and may take years to design and build.

Data Mart
A data mart is a smaller, more centralized data warehouse. It contains a
subset of corporate-wide data that is of value to a specific group of users. The scope
of a data mart is restricted to particular selected subjects. Simply put, raw data runs
from the data warehouse into different departments to support the customized use of
these departments. These department-level databases are called data marts. A data
mart is a data collection of a department. For example, a marketing data mart may
restrict its subjects for a customer, item, and sales; therefore, the marketing
department has its data mart. The finance department also has its data mart. The
data mart of both departments may be related, but they are different and
independent. In a separate data mart, data can be collected directly from data
sources.
Data Mart, unlike enterprise warehouse, is usually implemented on low-cost
departmental servers that are Unix/Linux or Windows-based. The implementation
sequence of a data mart is measured in weeks rather than months or years, unlike
the enterprise warehouse. However, this may involve complex integration in the long
run if its design and planning were not enterprise-wide.
Data marts can be categorized depending on the source of data:
1. Independent data marts. In independent data marts, data come from one
or more operating systems or external information providers, or from data
generated in a specific department or region.
2. Dependent data marts. The data in the dependent data mart comes
directly from the enterprise data warehouses.

Virtual warehouse
A virtual warehouse is a set of views over operational databases that can be queried
together so a user can effectively access all the data as if it was stored in one data
warehouse. For efficient query processing, only some of the possible summary views
may be materialized. A virtual warehouse is easy to build but requires excess
capacity on operational database servers.
LESSON 4:
EXTRACTION, TRANSFORMATION AND
LOADING

OBJECTIVES:

At the end of this lesson, the student will be able to:

 Define the ETL process

 Determine the data to be cleansed

 Enumerate the steps in loading the data to data warehouse

Duration: 1 hour
Extraction, Transformation, and Loading (ETL)

Figure 7.1 shows that data warehouse systems use back-end tools and
utilities to populate and refresh the data. These tools and utilities include the
Extraction, Transformation, and Loading (ETL) process: ETL is the process of
extracting, cleaning, and transforming business system data and then loading it into
the data warehouse. The purpose of ETL is to integrate scattered, messy, and
inconsistent data in the organization. It provides an analytical basis for enterprise
decision-making. ETL is a systematic method of data warehouse systems; this can
be done either daily, weekly, or monthly, and needs to be flexible, automated, and
well-documented (Golfarelli & Rizzi, 2009).

ETL Tools
1. Data Extraction

Extraction is the first step of ETL, which aims to extract information from the
target source system. The extraction process is usually one of the most time-
consuming tasks in ETL. Different systems tend to use other data formats,
which are standardized into a standard format for further processing. The
source system may be complex and inadequately documented, making it
difficult to determine what data needs to be extracted. The data must be
fetched several times regularly to provide all the changed data to the
warehouse and keep it up to date.

2. Data cleaning

Cleaning (cleansing or scrubbing) aims mainly to improve data quality. Data


quality rules set by data extraction are allowed to remove the erroneous
records first and then adjust the corresponding cleaning operation according
to the actual situation when possible. Below is the list of data that needs to be
cleansed:

 Duplicate data. For example, a student is recorded many times in a


university database system
 Inconsistent values that are logically associated. Such as addresses
and ZIP codes
 Missing data. Such as a student’s last name
 Unexpected use of fields. For example, a contactNumber field could be
misused to store student number
 Impossible or wrong values. Such as 22/30/2020
 Inconsistent values for a single entity due to different practices were
used. For example, to specify a country, you can use an international
country abbreviation (PH) or a full country name (Philippines); similar
problems arise with addresses (Marcos St or Marcos Street)
 Inconsistent values for an individual entity because of typing
mistakes. Such as Broklyn Shop instead of Brooklyn Shop.
3. Data transformation
Data transformation converts data from its operational source format into a
specific data warehouse format.
The following are the major transformation involved:
 Conversion and normalization
 Matching the equivalent fields in different sources
 Reducing the number of source fields and records through selection,
which converts data from legacy or host format to warehouse format.
4. Loading
It sorts, summarizes, consolidates, compute views, checks integrity, and
builds indices and partitions of data. Loading carried out in two ways:
 Refresh. Data warehouse data is completely rewritten, which means the
older data is replaced. Refresh is generally used in combination with static
extraction to primarily filling a data warehouse.
 Update. The changes applied to source data are now added to the data
warehouse. The update is carried out without deleting or modifying
preexisting data. This technique is used in combination with incremental
extraction
LESSON 5:
METADATA REPOSITORY

OBJECTIVES:

At the end of this lesson, the student will be able to:

 Describe the usage of metadata in data warehouse

 Identify each category of metadata

 Discuss the importance of metadata in data warehouse

Duration: 1 hour
Metadata
Metadata refers to "data about data". In a data warehouse, it is to define and
describe all the information on the warehouse subject. Metadata runs through the
entire life cycle of the data warehouse and uses metadata to drive the development
of the data warehouse to automate and visualize the data warehouse.

Types of Metadata
Metadata in a data warehouse fall into three major categories (Ponniah, 2010):
 Operational metadata
 Extraction and transformation metadata
 End-user metadata

Operational Metadata. It contains information about operational data sources.


a. Structures of data that come from different operational systems
b. Field lengths and data types of data elements selected for the data
warehouse
c. Tasks involved in selecting data from the source systems for the data
warehouse(i.e., splitting records, combining parts of records from different
source files, and deal with multiple coding schemes and field lengths)
d. The output data is a tie back with the source data sets.

Extraction and Transformation Metadata. These metadata contain data about


the extraction of data from the source systems, namely:
a. extraction frequencies
b. extraction methods, and
c. business rules for data extraction.

End-User Metadata. It is the navigational map of the data warehouse. It enables


the end-users to get the information from the data warehouse. It allows the end-
users to use their business terminology and look for information in those ways in
which they usually think of the business.

Why is metadata specifically crucial in a data warehouse (Ponniah, 2010)?


1. Metadata acts as the glue that connects all parts of the data warehouse.
2. It provides information about the contents and structures to the
developers.
3. It opens the door to the end-users and makes the contents recognizable in
their terms.
Metadata repository
The metadata itself is in the metadata repository. A metadata repository is just
like a dictionary which contains different words with its synonyms or definitions.
Metadata repository management software can be used to map source data to target
databases, integrate and transform data, generate code for data transformation, and
move data to the warehouse (Han, Kamber, & Pei, 2012).

The metadata repository includes the following:


1. Data warehouse structure description
 schema, view, dimensions, hierarchies, and data definitions
 data mart locations and contents.
2. Operational metadata
 data lineage (history of migrated data and its transformation path),
 the currency of data (i.e., active, archived, or purged)
 monitoring information (data warehouse usage statistics, error reports,
and audit trails).
3. Algorithms used for summarization
 measure and dimension definition of algorithms
 data on granularity
 pre-determined queries and reports
4. Operational environment to the data warehouse mapping
 source databases and their contents
 gateway descriptions, partitions of data, data extraction, cleaning,
transformation rules and defaults
 data refresh and purging rules
 and security (user authorization and access control).
5. Data related to system performance
 indices and profiles that improve the access and retrieval performance
of data
 rules of timing and scheduling of refresh, update, and replication cycles
6. Business metadata,
 business terms and definitions,
 data ownership information,
 charging policies.

You might also like