0% found this document useful (0 votes)
104 views

Data Mining Warehousing I & II

This document provides an introduction to data warehousing and discusses key concepts. It defines a data warehouse as a collection of corporate data derived from operational systems and external sources, designed to support business analysis and decision-making. The document outlines the three main types of data warehouses - enterprise data warehouse, operational data store, and data mart. It also discusses characteristics of data warehouses such as being subject-oriented, integrated, and time-variant. Finally, it covers operational database systems and differentiates between OLTP and OLAP.

Uploaded by

tanvi kamani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views

Data Mining Warehousing I & II

This document provides an introduction to data warehousing and discusses key concepts. It defines a data warehouse as a collection of corporate data derived from operational systems and external sources, designed to support business analysis and decision-making. The document outlines the three main types of data warehouses - enterprise data warehouse, operational data store, and data mart. It also discusses characteristics of data warehouses such as being subject-oriented, integrated, and time-variant. Finally, it covers operational database systems and differentiates between OLTP and OLAP.

Uploaded by

tanvi kamani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Module I Data Warehouse fundamentals

 The Data Warehouse –Introduction


Databases are real-time repositories of information, which are usually tied to specific applications.
Data warehouses pull information from various sources (including databases), with a focus on the
storage, filtering, retrieval and, specifically, analysis of huge volumes of structured data.
Data Warehousing is the process of compiling and organizing data into one common database,
whereas data mining refers the process of extracting meaningful data from that database. The two
concepts are interrelated; data mining begins only after data warehousing has taken place.
Data Warehousing may be defined as a collection of corporate information and data derived from
operational systems and external data sources. A data warehouse is designed with the purpose of
inducing business decisions by allowing data consolidation, analysis, and reporting at different
aggregate levels. Data is populated into the DW by extraction, transformation, and loading.
Data Warehousing incorporates data stores and conceptual, logical, and physical models to support
business goals and end-user information needs. Creating a DW requires mapping data between
sources and targets, then capturing the details of the transformation in a metadata repository. The
data warehouse provides a single, comprehensive source of current and historical information.
Data warehousing techniques and tools include DW appliances, platforms, architectures, data stores,
and spreadmarts; database architectures, structures, scalability, security, and services; and DW as a
service.
The three main types of Data Warehouses are:
 Enterprise Data Warehouse
 Operational Data Store
 Data Mart
Enterprise Data Warehouse: Enterprise Data Warehouse is a centralized warehouse, which
provides decision support service across the enterprise. It offers a unified approach to organizing
and representing data. It also provides the ability to classify data according to the subject and give
access according to those divisions.
Operational Data Store: Operational Data Store, also called ODS, is data store required when
neither Data warehouse nor OLTP systems support organizations reporting needs. In ODS, Data
warehouse is refreshed in real time. Hence, it is widely preferred for routine activities like storing
records of the Employees.
Data Mart: A Data Mart is a subset of the data warehouse. It specially designed for specific
segments like sales, finance, sales, or finance. In an independent data mart, data can collect
directly from sources.
Data Warehouses and data marts are mostly built on dimensional data modeling where fact tables
relate to dimension tables. This is useful for users to access data since a database can be visualized
as a cube of several dimensions. A data warehouse allows a user to splice the cube along each of its
dimensions.

 Characteristics
Data warehouses are repositories of high-volume information. They are centralized stores of all the
data a company may generate, formed by relational databases and designed for query and analysis.
Data warehouses allow for quick, accurate access to structured data via predefined queries.

i. Subject-oriented :

The warehouse organizes data around the essential subjects of the business (customers and products)
rather than around applications such as inventory management or order processing.
i. Integrated:

It is consistent/uniform in the way that data from several sources is extracted and transformed,
regardless of the original source. For example, coding conventions are standardized: M _ male, F _
female.
ii. Time-variant:

Data are organized by various time-periods (e.g. weekly, monthly, annually, etc.).

iii. Non-volatile:

 The warehouse’s database is not updated in real time. It is periodically updated via the
uploading of data, protecting it from the influence of momentary change. There are a number
of steps and processes in building a warehouse.
First, you must identify where the relevant data is stored. This can be a challenge.When the
Commonwealth Bank opted to implement CRM in its retail banking business, it found that relevant
customer data were resident on over 80 separate systems.

Secondly, data must be extracted from those systems. It is possible that when these systems were
developed they were not expected to align with other systems. The data then needs to be transformed
into a standardized, consistent and clean format. Data in different systems may have been stored in
different forms. Also, the cleanliness of data from different parts of the business may vary.

The culture in sales may be very driven by quarterly performance targets. Getting sales
representatives to maintain their customer fi les may be not straightforward. Much of their
information may be in their heads. On the other hand, direct marketers may be very dedicated to
keeping their data in good shape.

After transformation, the data then needs to be uploaded into the warehouse. Archival data that have
little relevance to today’s operations may be set aside, or only uploaded if there is sufficient space.
Recent operational and transactional data from the various functions, channels and touch points will
most probably be prioritized for uploading. Refreshing the data in the warehouse is important. This
may be done on a daily or weekly basis depending upon the speed of change in the business and its
environment.

 Its competitive advantages


1. The Enablement of Better Decision-Making
As companies are now able to get closer to their consumers than ever before, the corporate
decision-makers no longer have to hedge their bets or make important business decisions based on
partial or limited data. They're now backed up by facts and statistics housed within data warehouses
that can be recalled ad hoc.

2. Quick and Easy Data Access


If there's one thing the application economy has taught us, it's that speed is everything. Users can
access an array of information, stored across multiple sources, almost instantly. It means you won't be
wasting time attempting to manually pull information from various sources, or seeking help from
your IT department.

3. Consistent Quality Data


Data warehouses gather information from countless sources, but they convert it into a unified format
to be used throughout your organization. What does this mean? Well, you can have confidence that
each of your departments will be producing results which are in line and consistent with each other,
which in turn ensures company-wide accuracy.

A data warehouse maintains a copy of information from the source transaction systems.
This architectural complexity provides the opportunity to:

a. Maintain data history, even if the source transaction systems do not.

b. Integrate data from multiple source systems, enabling a central view across the enterprise. This
benefit is always valuable, but particularly so when the organization has grown by merger.

c. Improve data, by providing consistent codes and descriptions, flagging or even fixing bad data.

d. Present the organization’s information consistently.

e. Provide a single common data model for all data of interest regardless of the data’s source.

f. Restructure the data so that it makes sense to the business users.

g. Restructure the data so that it delivers excellent query performance, even for complex analytic
queries, without impacting the operational systems.

h. Add value to operational business applications, notably customer relationship management (CRM)
systems.

 Operational Database Systems


Operational database management systems (also referred to as OLTP On Line Transaction Processing
databases), are used to update data in real-time. These types of databases allow users to do more than
simply view archived data. Operational databases allow you to modify that data (add, change or delete
data), doing it in real-time. OLTP databases provide transactions as main abstraction to guarantee
data consistency that guarantee the so-called ACID properties. Basically, the consistency of the data is
guaranteed in the case of failures and/or concurrent access to the data.
In data warehousing, the term is even more specific: the operational database is the one which is
accessed by an operational system (for example a customer-facing website or the application used by
the customer service department) to carry out regular operations of an organization. Operational
databases usually use an online transaction processing database which is optimized for faster
transaction processing (create, read, update and delete operations). An operational database is the
source for a data warehouse.

 Data Warehouse (OLTP & OLAP)


We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can assume that
OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it.
- OLTP (On-line Transaction Processing) is characterized by a large number of
short on-line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put
on very fast query processing, maintaining data integrity in multi-access environments and an
effectiveness measured by number of transactions per second. In OLTP database there is detailed and
current data, and schema used to store transactional databases is the entity model (usually
3NF). Examples – Uses of OLTP are as follows:
 ATM center is an OLTP application.
 OLTP handles the ACID properties during data transaction via the application.
 It’s also used for Online banking, Online airline ticket booking, sending a text message, add a book
to the shopping cart.

- OLAP (On-line Analytical Processing) is characterized by relatively low volume of transactions.


Queries are often very complex and involve aggregations. For OLAP systems a response time is an
effectiveness measure. OLAP applications are widely used by Data Mining techniques.
In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually
star schema). Examples – Any type of Data warehouse system is an OLAP system. Uses of OLAP are
as follows:
 Spottily analyzed songs by users to come up with the personalized homepage of their songs and
playlist.
 Netflix movie recommendation system.
We have four types of OLAP servers −
 Relational OLAP (ROLAP)
 Multidimensional OLAP (MOLAP)
 Hybrid OLAP (HOLAP)
 Specialized SQL Servers
OLTP System OLAP System
Online Transaction Processing Online Analytical Processing
(Operational System) (Data Warehouse)
Source of data Operational data; OLTPs are the original Consolidation data; OLAP data comes from the
source of the data. various OLTP Databases

Purpose of data To control and run fundamental business To help with planning, problem solving, and
tasks decision support

What the data Reveals a snapshot of ongoing business Multi-dimensional views of various kinds of
processes business activities

Inserts and Short and fast inserts and updates initiated by Periodic long-running batch jobs refresh the data
Updates end users

Queries Relatively standardized and simple queries Often complex queries involving aggregations
Returning relatively few records

Processing Typically very fast Depends on the amount of data involved; batch
Speed data refreshes and complex queries may take many
hours; query speed can be improved by creating
indexes

Space Can be relatively small if historical data is Larger due to the existence of aggregation
Requirements archived structures and history data; requires more indexes
than OLTP

Database Highly normalized with many tables Typically de-normalized with fewer tables; use of
Design star and/or snowflake schemas

Backup and Backup religiously; operational data is Instead of regular backups, some environments
Recovery critical to run the business, data loss is likely may consider simply reloading the OLTP data as a
to entail significant monetary loss and legal recovery method
liability

 Multidimensional Data Models: Types of Data, from Tables and


Spreadsheets to Data Cubes
Multidimensional data model stores data in the form of data cube.Mostly, data warehousing supports
two or three-dimensional cubes.
A data cube allows data to be viewed in multiple dimensions. Dimensions are entities with respect to
which an organization wants to keep records. For example in store sales record, dimensions allow the
store to keep track of things like monthly sales of items and the branches and locations.
A multidimensional databases helps to provide data-related answers to complex business queries
quickly and accurately.
Data warehouses and Online Analytical Processing (OLAP) tools are based on a multidimensional
data model. OLAP in data warehousing enables users to view data from different angles and
dimensions.
A dimensional model is a data structure technique optimized for Data warehousing tools and is
comprised of "fact" and "dimension" tables.
A Dimensional model is designed to read, summarize, analyze numeric information like values,
balances, counts, weights, etc. in a data warehouse. In contrast, relation models are optimized for
addition, updating and deletion of data in a real-time Online Transaction System.
These dimensional and relational models have their unique way of data storage that has specific
advantages.
For instance, in the relational mode, normalization and ER models reduce redundancy in data. On the
contrary, dimensional model arranges data in such a way that it is easier to retrieve information and
generate reports.
Hence, Dimensional models are used in data warehouse systems and not a good fit for relational
systems.
An OLAP cube is a multi-dimensional array of data, Online analytical processing (OLAP) is a
computer-based technique of analyzing data to look for insights. The term cube here refers to a
multi-dimensional dataset, which is also sometimes called a hypercube if the number of dimensions is
greater than 3.
A cube can be considered a multi-dimensional generalization of a two- or
three-dimensional spreadsheet. For example, a company might wish to summarize financial data by
product, by time-period, and by city to compare actual and budget expenses. Product, time, city and
scenario (actual and budget) are the data's dimensions.
Cube is a shorthand for multidimensional dataset, given that data can have an arbitrary number
of dimensions. The term hypercube is sometimes used, especially for data with more than three
dimensions. A cube is not a "cube" in the strict mathematical sense, as all the sides are not necessarily
equal. But this term is used widely.
Slice is a term for a dimension which is held constant for all cells so that multi-dimensional
information can be shown in a two-dimensional physical space of a spreadsheet or pivot table.
Each cell of the cube holds a number that represents some measure of the business, such as sales,
profits, expenses, budget and forecast.
OLAP data is typically stored in a star schema or snowflake schema in a relational data warehouse or
in a special-purpose data management system. Measures are derived from the records in the fact
table and dimensions are derived from the dimension tables.
The elements of a dimension can be organized as a hierarchy, a set of parent-child relationships,
typically where a parent member summarizes its children. Parent elements can further be aggregated
as the children of another parent.
For example, May 2005's parent is Second Quarter 2005 which is in turn the child of Year 2005.
Similarly cities are the children of regions; products roll into product groups and individual expense
items into types of expenditure.

You might also like