0% found this document useful (0 votes)

104 views

Data Mining Warehousing I & II

This document provides an introduction to data warehousing and discusses key concepts. It defines a data warehouse as a collection of corporate data derived from operational systems and external sources, designed to support business analysis and decision-making. The document outlines the three main types of data warehouses - enterprise data warehouse, operational data store, and data mart. It also discusses characteristics of data warehouses such as being subject-oriented, integrated, and time-variant. Finally, it covers operational database systems and differentiates between OLTP and OLAP.

Uploaded by

tanvi kamani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

104 views

Data Mining Warehousing I & II

Uploaded by

tanvi kamani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Module I Data Warehouse fundamentals

 The Data Warehouse –Introduction

Databases are real-time repositories of information, which are usually tied to specific applications.
Data warehouses pull information from various sources (including databases), with a focus on the
storage, filtering, retrieval and, specifically, analysis of huge volumes of structured data.
Data Warehousing is the process of compiling and organizing data into one common database,
whereas data mining refers the process of extracting meaningful data from that database. The two
concepts are interrelated; data mining begins only after data warehousing has taken place.
Data Warehousing may be defined as a collection of corporate information and data derived from
operational systems and external data sources. A data warehouse is designed with the purpose of
inducing business decisions by allowing data consolidation, analysis, and reporting at different
aggregate levels. Data is populated into the DW by extraction, transformation, and loading.
Data Warehousing incorporates data stores and conceptual, logical, and physical models to support
business goals and end-user information needs. Creating a DW requires mapping data between
sources and targets, then capturing the details of the transformation in a metadata repository. The
data warehouse provides a single, comprehensive source of current and historical information.
Data warehousing techniques and tools include DW appliances, platforms, architectures, data stores,
and spreadmarts; database architectures, structures, scalability, security, and services; and DW as a
service.
The three main types of Data Warehouses are:
 Enterprise Data Warehouse
 Operational Data Store
 Data Mart
Enterprise Data Warehouse: Enterprise Data Warehouse is a centralized warehouse, which
provides decision support service across the enterprise. It offers a unified approach to organizing
and representing data. It also provides the ability to classify data according to the subject and give
access according to those divisions.
Operational Data Store: Operational Data Store, also called ODS, is data store required when
neither Data warehouse nor OLTP systems support organizations reporting needs. In ODS, Data
warehouse is refreshed in real time. Hence, it is widely preferred for routine activities like storing
records of the Employees.
Data Mart: A Data Mart is a subset of the data warehouse. It specially designed for specific
segments like sales, finance, sales, or finance. In an independent data mart, data can collect
directly from sources.
Data Warehouses and data marts are mostly built on dimensional data modeling where fact tables
relate to dimension tables. This is useful for users to access data since a database can be visualized
as a cube of several dimensions. A data warehouse allows a user to splice the cube along each of its
dimensions.

 Characteristics
Data warehouses are repositories of high-volume information. They are centralized stores of all the
data a company may generate, formed by relational databases and designed for query and analysis.
Data warehouses allow for quick, accurate access to structured data via predefined queries.

i. Subject-oriented :

The warehouse organizes data around the essential subjects of the business (customers and products)
rather than around applications such as inventory management or order processing.
i. Integrated:

It is consistent/uniform in the way that data from several sources is extracted and transformed,
regardless of the original source. For example, coding conventions are standardized: M _ male, F _
female.
ii. Time-variant:

Data are organized by various time-periods (e.g. weekly, monthly, annually, etc.).

iii. Non-volatile:

 The warehouse’s database is not updated in real time. It is periodically updated via the
uploading of data, protecting it from the influence of momentary change. There are a number
of steps and processes in building a warehouse.
First, you must identify where the relevant data is stored. This can be a challenge.When the
Commonwealth Bank opted to implement CRM in its retail banking business, it found that relevant
customer data were resident on over 80 separate systems.

Secondly, data must be extracted from those systems. It is possible that when these systems were
developed they were not expected to align with other systems. The data then needs to be transformed
into a standardized, consistent and clean format. Data in different systems may have been stored in
different forms. Also, the cleanliness of data from different parts of the business may vary.

The culture in sales may be very driven by quarterly performance targets. Getting sales
representatives to maintain their customer fi les may be not straightforward. Much of their
information may be in their heads. On the other hand, direct marketers may be very dedicated to
keeping their data in good shape.

After transformation, the data then needs to be uploaded into the warehouse. Archival data that have
little relevance to today’s operations may be set aside, or only uploaded if there is sufficient space.
Recent operational and transactional data from the various functions, channels and touch points will
most probably be prioritized for uploading. Refreshing the data in the warehouse is important. This
may be done on a daily or weekly basis depending upon the speed of change in the business and its
environment.

 Its competitive advantages

1. The Enablement of Better Decision-Making
As companies are now able to get closer to their consumers than ever before, the corporate
decision-makers no longer have to hedge their bets or make important business decisions based on
partial or limited data. They're now backed up by facts and statistics housed within data warehouses
that can be recalled ad hoc.

2. Quick and Easy Data Access

If there's one thing the application economy has taught us, it's that speed is everything. Users can
access an array of information, stored across multiple sources, almost instantly. It means you won't be
wasting time attempting to manually pull information from various sources, or seeking help from
your IT department.

3. Consistent Quality Data

Data warehouses gather information from countless sources, but they convert it into a unified format
to be used throughout your organization. What does this mean? Well, you can have confidence that
each of your departments will be producing results which are in line and consistent with each other,
which in turn ensures company-wide accuracy.

A data warehouse maintains a copy of information from the source transaction systems.
This architectural complexity provides the opportunity to:

a. Maintain data history, even if the source transaction systems do not.

b. Integrate data from multiple source systems, enabling a central view across the enterprise. This
benefit is always valuable, but particularly so when the organization has grown by merger.

c. Improve data, by providing consistent codes and descriptions, flagging or even fixing bad data.

d. Present the organization’s information consistently.

e. Provide a single common data model for all data of interest regardless of the data’s source.

f. Restructure the data so that it makes sense to the business users.

g. Restructure the data so that it delivers excellent query performance, even for complex analytic
queries, without impacting the operational systems.

h. Add value to operational business applications, notably customer relationship management (CRM)
systems.

 Operational Database Systems

Operational database management systems (also referred to as OLTP On Line Transaction Processing
databases), are used to update data in real-time. These types of databases allow users to do more than
simply view archived data. Operational databases allow you to modify that data (add, change or delete
data), doing it in real-time. OLTP databases provide transactions as main abstraction to guarantee
data consistency that guarantee the so-called ACID properties. Basically, the consistency of the data is
guaranteed in the case of failures and/or concurrent access to the data.
In data warehousing, the term is even more specific: the operational database is the one which is
accessed by an operational system (for example a customer-facing website or the application used by
the customer service department) to carry out regular operations of an organization. Operational
databases usually use an online transaction processing database which is optimized for faster
transaction processing (create, read, update and delete operations). An operational database is the
source for a data warehouse.

 Data Warehouse (OLTP & OLAP)

We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can assume that
OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it.
- OLTP (On-line Transaction Processing) is characterized by a large number of
short on-line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put
on very fast query processing, maintaining data integrity in multi-access environments and an
effectiveness measured by number of transactions per second. In OLTP database there is detailed and
current data, and schema used to store transactional databases is the entity model (usually
3NF). Examples – Uses of OLTP are as follows:
 ATM center is an OLTP application.
 OLTP handles the ACID properties during data transaction via the application.
 It’s also used for Online banking, Online airline ticket booking, sending a text message, add a book
to the shopping cart.

- OLAP (On-line Analytical Processing) is characterized by relatively low volume of transactions.

Queries are often very complex and involve aggregations. For OLAP systems a response time is an
effectiveness measure. OLAP applications are widely used by Data Mining techniques.
In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually
star schema). Examples – Any type of Data warehouse system is an OLAP system. Uses of OLAP are
as follows:
 Spottily analyzed songs by users to come up with the personalized homepage of their songs and
playlist.
 Netflix movie recommendation system.
We have four types of OLAP servers −
 Relational OLAP (ROLAP)
 Multidimensional OLAP (MOLAP)
 Hybrid OLAP (HOLAP)
 Specialized SQL Servers
OLTP System OLAP System
Online Transaction Processing Online Analytical Processing
(Operational System) (Data Warehouse)
Source of data Operational data; OLTPs are the original Consolidation data; OLAP data comes from the
source of the data. various OLTP Databases

Purpose of data To control and run fundamental business To help with planning, problem solving, and
tasks decision support

What the data Reveals a snapshot of ongoing business Multi-dimensional views of various kinds of
processes business activities

Inserts and Short and fast inserts and updates initiated by Periodic long-running batch jobs refresh the data
Updates end users

Queries Relatively standardized and simple queries Often complex queries involving aggregations
Returning relatively few records

Processing Typically very fast Depends on the amount of data involved; batch
Speed data refreshes and complex queries may take many
hours; query speed can be improved by creating
indexes

Space Can be relatively small if historical data is Larger due to the existence of aggregation
Requirements archived structures and history data; requires more indexes
than OLTP

Database Highly normalized with many tables Typically de-normalized with fewer tables; use of
Design star and/or snowflake schemas

Backup and Backup religiously; operational data is Instead of regular backups, some environments
Recovery critical to run the business, data loss is likely may consider simply reloading the OLTP data as a
to entail significant monetary loss and legal recovery method
liability

 Multidimensional Data Models: Types of Data, from Tables and

Spreadsheets to Data Cubes
Multidimensional data model stores data in the form of data cube.Mostly, data warehousing supports
two or three-dimensional cubes.
A data cube allows data to be viewed in multiple dimensions. Dimensions are entities with respect to
which an organization wants to keep records. For example in store sales record, dimensions allow the
store to keep track of things like monthly sales of items and the branches and locations.
A multidimensional databases helps to provide data-related answers to complex business queries
quickly and accurately.
Data warehouses and Online Analytical Processing (OLAP) tools are based on a multidimensional
data model. OLAP in data warehousing enables users to view data from different angles and
dimensions.
A dimensional model is a data structure technique optimized for Data warehousing tools and is
comprised of "fact" and "dimension" tables.
A Dimensional model is designed to read, summarize, analyze numeric information like values,
balances, counts, weights, etc. in a data warehouse. In contrast, relation models are optimized for
addition, updating and deletion of data in a real-time Online Transaction System.
These dimensional and relational models have their unique way of data storage that has specific
advantages.
For instance, in the relational mode, normalization and ER models reduce redundancy in data. On the
contrary, dimensional model arranges data in such a way that it is easier to retrieve information and
generate reports.
Hence, Dimensional models are used in data warehouse systems and not a good fit for relational
systems.
An OLAP cube is a multi-dimensional array of data, Online analytical processing (OLAP) is a
computer-based technique of analyzing data to look for insights. The term cube here refers to a
multi-dimensional dataset, which is also sometimes called a hypercube if the number of dimensions is
greater than 3.
A cube can be considered a multi-dimensional generalization of a two- or
three-dimensional spreadsheet. For example, a company might wish to summarize financial data by
product, by time-period, and by city to compare actual and budget expenses. Product, time, city and
scenario (actual and budget) are the data's dimensions.
Cube is a shorthand for multidimensional dataset, given that data can have an arbitrary number
of dimensions. The term hypercube is sometimes used, especially for data with more than three
dimensions. A cube is not a "cube" in the strict mathematical sense, as all the sides are not necessarily
equal. But this term is used widely.
Slice is a term for a dimension which is held constant for all cells so that multi-dimensional
information can be shown in a two-dimensional physical space of a spreadsheet or pivot table.
Each cell of the cube holds a number that represents some measure of the business, such as sales,
profits, expenses, budget and forecast.
OLAP data is typically stored in a star schema or snowflake schema in a relational data warehouse or
in a special-purpose data management system. Measures are derived from the records in the fact
table and dimensions are derived from the dimension tables.
The elements of a dimension can be organized as a hierarchy, a set of parent-child relationships,
typically where a parent member summarizes its children. Parent elements can further be aggregated
as the children of another parent.
For example, May 2005's parent is Second Quarter 2005 which is in turn the child of Year 2005.
Similarly cities are the children of regions; products roll into product groups and individual expense
items into types of expenditure.

Assessment 1 Brief
No ratings yet
Assessment 1 Brief
6 pages
Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet
GLS Online Api
No ratings yet
GLS Online Api
12 pages
Byffer Cache Deep Dive - V2 PDF
No ratings yet
Byffer Cache Deep Dive - V2 PDF
55 pages
Presentation 2
No ratings yet
Presentation 2
36 pages
Fables Facts
No ratings yet
Fables Facts
7 pages
Wipro Questions and Answers
No ratings yet
Wipro Questions and Answers
3 pages
Research Praposal
No ratings yet
Research Praposal
34 pages
Dbms Unit4 SQL Final
No ratings yet
Dbms Unit4 SQL Final
7 pages
NoSQL Systems For Big Data Management
No ratings yet
NoSQL Systems For Big Data Management
8 pages
BDA Lab Manual
No ratings yet
BDA Lab Manual
62 pages
Systems Theory Modelling
No ratings yet
Systems Theory Modelling
45 pages
Q1. Explain JDK, JRE and JVM?
No ratings yet
Q1. Explain JDK, JRE and JVM?
21 pages
Dimensional Modelling by Example PDF
No ratings yet
Dimensional Modelling by Example PDF
132 pages
DBMS QB
No ratings yet
DBMS QB
4 pages
June-2012 (Computer Science) (Paper)
No ratings yet
June-2012 (Computer Science) (Paper)
12 pages
How To Build A Mutual Fund Portfolio: The Debt-Equity Ratio
No ratings yet
How To Build A Mutual Fund Portfolio: The Debt-Equity Ratio
10 pages
Java Programming Part I
No ratings yet
Java Programming Part I
120 pages
Cs8492 - Database Management Systems Unit I - Relational Database Part - A (2 Marks)
No ratings yet
Cs8492 - Database Management Systems Unit I - Relational Database Part - A (2 Marks)
27 pages
It6713 Grid Cloud Computing Lab
No ratings yet
It6713 Grid Cloud Computing Lab
96 pages
Introduction To AJS
No ratings yet
Introduction To AJS
44 pages
1 DWH Concepts
No ratings yet
1 DWH Concepts
13 pages
GCC QB
100% (1)
GCC QB
16 pages
Hci 2 Mark Question Bank Unit 1
No ratings yet
Hci 2 Mark Question Bank Unit 1
10 pages
CS2402 QB mpc2
No ratings yet
CS2402 QB mpc2
10 pages
Wrapper Classes Exercise: Cognizant Technology Solutions
No ratings yet
Wrapper Classes Exercise: Cognizant Technology Solutions
7 pages
STLD Notes Unit-2
No ratings yet
STLD Notes Unit-2
19 pages
Android Training Online
No ratings yet
Android Training Online
6 pages
Operations Research Lecture 1: Introduction To OR Models: Kusum Deep Mathematics Department
No ratings yet
Operations Research Lecture 1: Introduction To OR Models: Kusum Deep Mathematics Department
44 pages
Icles' Motilal Jhunjhunwala College, Vashi IT& CS Department
No ratings yet
Icles' Motilal Jhunjhunwala College, Vashi IT& CS Department
41 pages
HCI Lecture 1
No ratings yet
HCI Lecture 1
26 pages
AI Question Bank 2017 18 CSE
No ratings yet
AI Question Bank 2017 18 CSE
4 pages
Viii Sem Cs6008 QB
No ratings yet
Viii Sem Cs6008 QB
12 pages
Labmanual SOA IT2406
No ratings yet
Labmanual SOA IT2406
51 pages
It1402 QB
No ratings yet
It1402 QB
3 pages
A Review Paper On Extractive Techniques of Text Summarization
No ratings yet
A Review Paper On Extractive Techniques of Text Summarization
4 pages
MPC
No ratings yet
MPC
17 pages
Topic 6 - Linear Programming - Graphical Method 2
No ratings yet
Topic 6 - Linear Programming - Graphical Method 2
52 pages
FYP PPT Submission
No ratings yet
FYP PPT Submission
15 pages
Systems: Local-And Wide-Area Computer Networks (Such As The Internet) Connect
No ratings yet
Systems: Local-And Wide-Area Computer Networks (Such As The Internet) Connect
19 pages
Logic Boolean
100% (1)
Logic Boolean
141 pages
Cs 403 Software Engineering Jun 2020
No ratings yet
Cs 403 Software Engineering Jun 2020
3 pages
Hadoop Online Training
No ratings yet
Hadoop Online Training
7 pages
Master's Theorem
No ratings yet
Master's Theorem
13 pages
EC360 Soft Computing - Syllabus PDF
No ratings yet
EC360 Soft Computing - Syllabus PDF
2 pages
CN GATE Question and Answers
No ratings yet
CN GATE Question and Answers
24 pages
Hadoop Online Training
No ratings yet
Hadoop Online Training
5 pages
Department of Computer Science and Engineering Astu: NLP: Background and Overview
No ratings yet
Department of Computer Science and Engineering Astu: NLP: Background and Overview
30 pages
Hci 2m
No ratings yet
Hci 2m
8 pages
Financial Basics - Investment Options
100% (1)
Financial Basics - Investment Options
7 pages
It1402 Mobile Computing 2
No ratings yet
It1402 Mobile Computing 2
24 pages
Tycs Ai Unit 2
No ratings yet
Tycs Ai Unit 2
84 pages
Chapter 1 Introduction To AI
No ratings yet
Chapter 1 Introduction To AI
26 pages
UID IT 604 Part A
No ratings yet
UID IT 604 Part A
4 pages
DWM Assignment
No ratings yet
DWM Assignment
9 pages
1.operations Research
100% (1)
1.operations Research
36 pages
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
2-in-1 Book Series: Teacher King’s English Beginner Course Book 1 & English Speaking Course Book 1 - Croatian Edition
From Everand
2-in-1 Book Series: Teacher King’s English Beginner Course Book 1 & English Speaking Course Book 1 - Croatian Edition
Kevin L. King
No ratings yet
DWDM u-1
No ratings yet
DWDM u-1
45 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
135 pages
Data Warehousing-Notes(Module -I & II) (1) (1)
No ratings yet
Data Warehousing-Notes(Module -I & II) (1) (1)
32 pages
Data Vwarehouse
No ratings yet
Data Vwarehouse
5 pages
Supply Chain and Logistics Management Trimester - 4 CIA-3 Research Based Assignment Automobile Industry
No ratings yet
Supply Chain and Logistics Management Trimester - 4 CIA-3 Research Based Assignment Automobile Industry
30 pages
Tanvi Kamani Register No: 1927048: Topic of CIA: Employee Well-Being
No ratings yet
Tanvi Kamani Register No: 1927048: Topic of CIA: Employee Well-Being
5 pages
Where Do You See Yourself in 5 Years
100% (1)
Where Do You See Yourself in 5 Years
1 page
Body Language
No ratings yet
Body Language
13 pages
Hazardous Waste
No ratings yet
Hazardous Waste
5 pages
Computer Java
No ratings yet
Computer Java
19 pages
Dell - PowerEdge - R720 - Technical - Guide
No ratings yet
Dell - PowerEdge - R720 - Technical - Guide
66 pages
Horizontally Scaling and Vertically Scaling
No ratings yet
Horizontally Scaling and Vertically Scaling
4 pages
CGS2545 - Assignment 4
No ratings yet
CGS2545 - Assignment 4
7 pages
Rohs Certificate of Compliance: Western Digital Products As of May 8, 2020
No ratings yet
Rohs Certificate of Compliance: Western Digital Products As of May 8, 2020
3 pages
MS-Computer Science-XII
67% (3)
MS-Computer Science-XII
13 pages
SQL commands
No ratings yet
SQL commands
4 pages
Newtec M6100 User Manual 1 - 28701-FGC1012177 - EN - A - PDFV1R1 PDF
100% (1)
Newtec M6100 User Manual 1 - 28701-FGC1012177 - EN - A - PDFV1R1 PDF
244 pages
Hash Table PDF
No ratings yet
Hash Table PDF
25 pages
Grade 7 ICT Paper Marking Scheme
No ratings yet
Grade 7 ICT Paper Marking Scheme
8 pages
Lindsey Boylan Exhibits - Combined
No ratings yet
Lindsey Boylan Exhibits - Combined
97 pages
AN6001-G16 Optical Line Terminal Equipment UNM2000 Configuration Guide (Version A)
No ratings yet
AN6001-G16 Optical Line Terminal Equipment UNM2000 Configuration Guide (Version A)
278 pages
M68HC08 To HCS08 Transition
No ratings yet
M68HC08 To HCS08 Transition
26 pages
fig-FORTH Installation Manual Glossary Model: Interactive Comituter System
No ratings yet
fig-FORTH Installation Manual Glossary Model: Interactive Comituter System
56 pages
SQL For Interviews?
No ratings yet
SQL For Interviews?
9 pages
Rails Concepts in Short
No ratings yet
Rails Concepts in Short
42 pages
Inbound Data Flow in SAP Retail Environment
No ratings yet
Inbound Data Flow in SAP Retail Environment
3 pages
MySQL - DDL
No ratings yet
MySQL - DDL
8 pages
ThinkServer TD350 - Product Guide
No ratings yet
ThinkServer TD350 - Product Guide
27 pages
Internship Zenix
No ratings yet
Internship Zenix
9 pages
Splunk Operational Intelligence Cookbook Sample Chapter
No ratings yet
Splunk Operational Intelligence Cookbook Sample Chapter
41 pages
Extractor Technical Design Specification
No ratings yet
Extractor Technical Design Specification
32 pages
As ISO IEC 15459.1-2006 Information Technology - Unique Identifiers Unique Identifiers For Transport Units
No ratings yet
As ISO IEC 15459.1-2006 Information Technology - Unique Identifiers Unique Identifiers For Transport Units
8 pages
Ip Lab
No ratings yet
Ip Lab
34 pages
Interview Question On BAPI
No ratings yet
Interview Question On BAPI
4 pages
Archer D9 (EU) V1 UG PDF
No ratings yet
Archer D9 (EU) V1 UG PDF
118 pages
Disadvantages of Microprocessor
No ratings yet
Disadvantages of Microprocessor
37 pages
Components of A Computer System
47% (15)
Components of A Computer System
12 pages

Data Mining Warehousing I & II

Uploaded by

Data Mining Warehousing I & II

Uploaded by

Module I Data Warehouse fundamentals

 The Data Warehouse –Introduction

 Its competitive advantages

2. Quick and Easy Data Access

3. Consistent Quality Data

a. Maintain data history, even if the source transaction systems do not.

d. Present the organization’s information consistently.

f. Restructure the data so that it makes sense to the business users.

 Operational Database Systems

 Data Warehouse (OLTP & OLAP)

- OLAP (On-line Analytical Processing) is characterized by relatively low volume of transactions.

 Multidimensional Data Models: Types of Data, from Tables and

You might also like