7BCEE1A-Datamining and Data Warehousing
7BCEE1A-Datamining and Data Warehousing
UNIT – 1
CHAPTER 1: INTRODUCTION
➢ Introduction
➢ Overview
➢ Typical Process Flow within a Data
warehouse
➢ Extract and Load Process
➢ Clean and Transform Data
➢ Backup and Archive Process
➢ Query Management Process
➢ Introduction
➢ Load Manager
➢ Warehouse Manager
➢ Query Manager
CHAPTER 1: INTRODUCTION
DELIVERY METHOD:
IT STRATEGY
BUSINESS CASE
• Limit the scope of the first build phase to the minimum that
delivers business benefits.
TECHNICAL BLUEPRINT:
HISTORY LOAD
AD HOC QUERY
EXTENDING SCOPE
REQUIREMENTS EVOLUTION
INTRODUCTION
Data extraction takes data from the source systems. Data load
takes the extracted data and loads it into the data warehouse.
Note – Consistency checks are executed only when all the data
sources have been loaded into the temporary data store.
• Aggregation
• within itself.
INTRODUCTION:
Single-tier architecture
Two-tier architecture
LOAD MANAGER:
Fast Load
Simple Transformations
• Strip out all the columns that are not required within the
warehouse.
WAREHOUSE MANAGER:
• Backup/Recovery tool
• SQL Scripts
Warehouse Manager Architecture
• Transforms and merges the source data into the published data
warehouse.
• Archives the data that has reached the end of its captured life.
QUERY MANAGER:
• Stored procedures
UNIT – 2
➢ Introduction
➢ Why you need tools to Manage a Data Warehouse
➢ System Managers
➢ Data Warehouse Process Managers
➢ Load Manager
➢ Warehouse Manager
➢ Query Manager
➢ Introduction
➢ Process
➢ Estimating the Load
➢ Introduction
➢ Assessing Performance
➢ Tuning the Data Load
➢ Tuning Queries
CHAPTER 10: SYSTEM AND DATA WAREHOUSE PROCESS
MANAGERS
INTRODUCTION
Data warehouses are not just large databases; they are large,
complex environments that integrate many different technologies.
As such they require a lot of maintenance and management
The traditional approach of having a large team of administrators
to manage a data warehouse arena does not work well. The system
usage is generally too ad hoc and unpredictable to be manually
administrated. Therefore, intelligent tools are required to help the
system and database administrators to do their jobs.
The tools required can be divided into the two categories:
o System management tools
o Data warehouse process management tools
SYSTEM MANAGERS:
System management is mandatory for the successful implementation
of a data warehouse. The most important system managers are −
• Data load
• Data processing
• Index creation
• Backup
• Aggregation creation
• Data transformation
Events
Events are the actions that are generated by the user or the system
itself. It may be noted that the event is a measurable, observable,
occurrence of a defined action.
• Hardware failure
• A process dying
The most important thing about events is that they should be capable
of executing on their own. Event packages define the procedures for
the predefined events. The code associated with each event is known
as event handler. This code is executed whenever an event occurs.
The backup and recovery tool makes it easy for operations and
management staff to back-up the data. Note that the system backup
manager must be integrated with the schedule manager software
being used. The important features that are required for the
management of backups are as follows −
• Scheduling
• Database awareness
Backups are taken only to protect against data loss. Following are the
important points to remember −
• The backup software will keep some form of database of where
and when the piece of data was backed up.
PROCESS MANAGERS:
• Load manager
• Warehouse manager
• Query manager
Fast Load
Simple Transformations
• Strip out all the columns that are not required within the
warehouse.
WAREHOUSE MANAGER:
• Backup/Recovery tool
• SQL scripts
Functions of Warehouse Manager
• Generates normalizations.
• Archives the data that has reached the end of its captured life.
QUERY MANAGER:
The query manager is responsible for directing the queries to
suitable tables. By directing the queries to appropriate tables, it
speeds up the query request and response process. In addition, the
query manager is responsible for scheduling the execution of the
queries posted by the user.
• Stored procedures
INTRODUCTION
Any data warehouse solution will grow over time sometimes quite
dramatically
PROCESS
Usage profiles:
▪ Initial Configuration
▪ How Much CPU Bandwidth
▪ How Much Memory
▪ How Much Disk
INITIAL CONFIGURATION
▪ Daily processing
▪ Overnight processing
• Backup
DAILY PROCESSING
DATA TRANSFORMATION
BACKUP
DATA LOAD
DATABASE REQUIREMENTS
o Administration
o User requirements
DATA BASE SIZING
o Aggregations
o Indexes
o data dictionary
o journal files
o rollback space
o temporary requirements
OTHER FACTORS
AGGREGATIONS
INDEXES
The load manager need disk space allocated for the source
files
INTRODUCTION
ASSESSING PERFORMANCE
• Scan rates
• It is also possible that the user can write a query you had not tuned
for.
There are various approaches of tuning data load that are discussed
below −
• The very common approach is to insert data using the SQL Layer.
In this approach, normal checks and constraints need to be
performed. When the data is inserted into the table, the code will
run to check for enough space to insert the data. If sufficient space
is not available, then more space may have to be allocated to these
tables. These checks take time to perform and are costly to CPU.
• The third approach is that while loading the data into the table
that already contains the table, we can maintain indexes.
• The fourth approach says that to load the data in tables that
already contain data, drop the indexes & recreate them when
the data load is complete. The choice between the third and the
fourth approach depends on how much data is already loaded and
how many indexes need to be rebuilt.
Integrity Checks
• Fixed queries
• Ad hoc queries
Fixe Queries
Fixed queries are well defined. Following are the examples of fixed
queries −
• regular reports
• Canned queries
• Common aggregations
Ad Hoc Queries
• It is also important that the tuning performed does not affect the
performance.
• If these queries are identified, then the database will change and
new indexes can be added for those queries.
UNIT – 3
CHAPTER 1: INTRODUCTION
➢ Introduction
Introduction:
Data mining is defined as finding hidden information in a database.
Alternatively, it has been called exploratory data analysis, data
driven discovery, and deductive learning.
Traditional data base queries access a database using a well defined
query stated in a language such as SQL. The output of the query
consists of the data from the database that satisfies the query. The
output is usually a subset of the database.
Database Access
Descriptive Model:
Example:
Regression:
Example:
her retirement. Periodically, she predicts what her retirement savings will be
based on its current value and several past values. She uses a simple linear
regression formula to predict this value by fitting past behavior to a linear
function and then using this function to predict the values at points in the
future. Based on these values, she then alters her investment portfolio.
Example:
Example:
Clustering:
Example:
Summarization:
Example:
One of the many criteria used to compare universities by the U.S. News &
World Report is the average SAT or ACTS score. This is a summarization
used to estimate the type and intellectual level of the student body.
Association Rules:
Example:
Example:
The Webmaster at the XYZ Corp. periodically analyzes the Web log data to
determine how users of the XYZ's Web pages access them. He is interested in
determining what sequences of pages are frequently accessed. He
determines that 70 percent of the users of page A follow one of the following
patterns of behavior: (A, B, C) or (A, D, B, C) or (A, E, B, C). He then
determines to add a link directly from page A to page C.
Data mining is not an easy task, as the algorithms used can get very complex
and data is not always available at one place. It needs to be integrated from
various heterogeneous data sources. These factors also create some issues.
Here in this tutorial, we will discuss the major issues regarding −
• Performance Issues
Performance Issues
For most of us, data mining has become a part of our daily lives. Data
mining, affecting everyday things from the products stocked at our local
supermarket, to the ads we see while surfing the Internet, to crime
prevention.
IR Researcher
o An IR researcher probably would concentrate on the use of data
mining techniques to access text data.
Statistician
o A statistician might look primarily at the historical techniques.
Machine learning specialist
o A machine learning specialist might be interested primarily in
data mining algorithms that learn.
Algorithm Researcher
o An algorithm researcher would be interested in studying and
comparing algorithms based on type and complexity.
Update: Many data mining algorithms work with static datasets. This
is not a realistic assumption.
Ease of Use: Although some algorithms may work well, they may not
be well received by users.
7BCEE1A – DATA MINING AND DATA WAREHOUSING
UNIT – 4
➢ Information Retrieval
➢ Dimensional Modeling
➢ OLAP
➢ Introduction
➢ Similarity Measures
➢ Decision Trees
➢ Neural Networks
➢ Genetic Algorithms
CHAPTER 2: RELATED CONCEPTS
➢ Fuzzy Set - A fuzzy set is a set F, in which the set membership function,
f is a real valued function with output in the range [0,1].
o An element x is said to belong to F with probability f(x).
o An element x to be in ⌝F with probability 1-f(x).
Fuzzy logic has been used in database system to retrieve data with
imprecise / missing values.
Fuzzy logic uses the following operators to perform the specific
operations.
Operator Operation
⌝ Negation
^ Intersection
v Union
Fuzzy Classification:
• Approve
• Reject
• Probably Approve
• Probable Reject → Fuzzy classes
• Unknown
Information Retrieval:
Recall:
IR query results
➢ Many similarity measures have been proposed for use in information
retrieval.
▪ Sim(q, Di) → query to document similarity
▪ Sim(Di, Dj) → document to document similarity
▪ Sim(qi, qj) → query to query similarity
Dimensional Modeling:
➢ Cube view:
The same multi dimensional may also be viewed as a cube. Each
dimension is an axis for the cube. This cube has one fact for each
unique combination of dimension values.
▪ productID -> 123, 150, 200, 300, 500 ( 5 unique entries)
▪ LocationID -> Dallas, Houston, Fort worth, Chicago, Seattle,
Rochester, Bradenton (7 unique entries)
▪ Date ->022900, 020100,031500, 021000, 012000,
030100,021500,022000 (8 unique entries)
Cube View
o Here, Day < Month but Day < Season. The aggregation can be
applied to levels that can be found in the same path as defined by
the < relationship.
▪ Join indices:
It support joins by pre computing tuples from tables
that join together and pointing to the tuples in those
tables.
OLAP:
1. Roll-up
2. Drill-down
3. Slice and dice
4. Pivot (rotate)
1) Roll-up:
1. Reducing dimensions
2. Climbing up concept hierarchy. Concept hierarchy is a system of
grouping things based on their order or level.
2) Drill-down
3) Slice:
Dice:
4) Pivot
In Pivot, you rotate the data axes to provide a substitute presentation of data.
OLAP Operations
1. Web Crawler
2. Database
3. Search Interfaces
o Web crawler
It is also known as spider or bots. It is a software component that
traverses the web to gather information.
o Database
All the information on the web is stored in database. It consists of
huge web resources.
o Search Interfaces
This component is an interface between user and the database. It
helps the user to search through the database.
➢ Conventional search engines suffer from following issues:
o Abundance – Most of the data of the web are of no interest to
most people.
o Limited coverage – Search engine often provide results from a
subset of the web pages.
o Limited query – Most search engines provide access based only
on simple keyword based searching.
o Limited customization – query results are often determined only
by the query itself.
CHAPTER 3: DATA MINING TECHNIQUES
Introduction:
➢ There are many different methods used to perform data mining tasks.
Parametric Model:
It describes the relationship between input and output through the use of
Non-Parametric Model:
equations are used to determine the model. This means that the modelling
process adapts to the data at hand. Non parametric model techniques include
Some of the statistical concepts that are basis for data mining techniques are,
• Point Estimation
• Bayes Theorem
• Hypothesis Testing
Point Estimation:
Here P(h1 I Xi) is called the posterior probability, while P(h1) is the
prior probability associated with hypothesis h 1. P (xi) is the
probability of the occurrence of data value Xi and P(xi I h1) is the
conditional probability that, given a hypothesis, the tuple satisfies it.
Thus, we have
Hypothesis Testing:
➢ The similarity between two tuples t; and t1, sim(t;, t1), in a database Dis a
mapping from D x D to the range [0, 1]. Thus, sim(t;, t1) E [0, 1].
➢ The objective is to define the similarity mapping such that documents that
are more alike have a higher similarity value. Thus, the following are
desirable characteristics of a good similarity measure:
• Vt; ED , sim(t;, t;) = 1
• Vt;, t1 ED , sirn(t;, fJ) = 0 if t; and t1 are not alike at all
• Vt;, fJ , tk ED , sim(t;, fJ) < sim(t;, tk) if t; is more like tk than it is like
t1
➢ Here are some of the more common similarity measures used in traditional
IR systems and more recently in Internet search engines:
This tree has as the root the first question asked. Each subsequent level in
the tree consists of questions at that stage in the game.
➢ Definition:
o A decision tree (DT) is a tree where the root and each internal
node is labeled with a question. The arcs emanating from each
node represent each possible answer to the associated
question. Each leaf node represents a prediction of a solution
to the problem under consideration.
o An algorithm to create the tree.
o An algorithm that applies the tree to data and solves the
problem under consideration.
DTProc Algorithm
➢ Definition:
Here c is a constant positive value. With the linear function, the output
value has no limits in terms of maximum or minimum values.
Here c is a constant positive value that changes the slope of the function.
This function has an output centered at zero, which may help with
learning.
function.
Genetic Algorithms:
UNIT – 5
➢ Introduction
➢ Large Itemsets
➢ Basic Algorithms
➢ Comparing Approaches
➢ Incremental Rules
Introduction:
➢ Association rules are used to show the relationships between data
items.
➢ The purchasing of one product when another product is purchased
represents an association rule. Association rules are frequently used
by retail stores to assist in marketing, advertising, floor placement,
and inventory control.
Sample Data to Illustrate Association Rules
➢ Here, there are five transactions {t1, t2,t3, t4, t5} and five items {Beer,
Bread, Jelly, Milk, Peanut Butter}.
➢ Association Rule:
Given a set of items I={I1,I2,...,Im} and a database of transaction
D={t1, t2,...,tn} where ti=(Ii1, Ii2,...,Iik) and Iij ∈ I , an association rule is
an implication of the form X => Y where X,Y ⊂ I are sets of items called
itemsets and X∩Y =⌀.
➢ Support:
The support (s) for an association rule X => Y is the percentage
of transactions in the database that contain X U Y.
➢ Confidence or Strength:
The confidence or strength (a) for an association rule X => Y is
the ratio of the number of transactions that contain X U Y to the
number of transactions that contain X.
Support and Confidence for Some Association Rules
Large Itemsets:
➢ The most common approach to finding association rules is to break up
the problem into two parts:
Example:
Suppose that the input support and confidence are s = 30% and a = 50%,
respectively. Using this value of s, we obtain the following set of large
itemsets:
We now look at what association rules are generated from the last large
itemset. Here l = {Bread, PeanutButter}. There are two nonempty subsets
of l: {Bread} and {PeanutButter}. With the first one we see:
Basic Algorithms:
Apriori Algorithm:
There are no candidates of size three because there is only one large
itemset of size two.
Sampling Algorithm:
Example:
Suppose the set of items is {A, B, C, D}. The set of large itemsets found
to exist in a sample of the database is PL = {A, C, D, CD}. The first scan
of the entire database, then, generates the set of candidates as follows:
C = BD -(PL) UPL = {B,AC,AD} U {A, C, D, CD }. Here we add AC because
both A and C are in PL. Likewise we add AD. We could not have added
ACD because neither AC nor AD is in PL.
Negative border
Sampling Algorithm
Partitioning Algorithm:
Partition Algorithm
Partitioning example
Here the database is partitioned into two parts, the first containing two
transactions and the second with three transactions. Using a support of
10%, the resulting large itemsets L 1 and L 2 are shown. If the items are
uniformly distributed across the partitions, then a large fraction of the
itemsets will be large. However, if the data are not uniform, there may be a
large percentage of false candidates.
Data Parallelism:
This figure illustrates the approach used by the CDA algorithm using
the grocery store data. Here there are three processors. The first two
transactions are counted at P1, the next two at P2, and the last one at P3.
When the local counts are obtained, they are then broadcast to the other
sites so that global counts can be generated.
Task Parallelism:
Comparing Approaches:
Algorithms can be classified along the following dimensions:
• Target: The algorithms we have examined generate all rules that
satisfy a given support and confidence level. Alternatives to these types
of algorithms are those that generate some subset of the algorithms
based on the constraints given.
• Type: Algorithms may generate regular association rules or more
advanced association rules.
• Data type: We have examined rules generated for data in categorical
databases. Rules may also be derived for other types of data such as
plain text.
• Data source: Our investigation has been limited to the use of
association rules for market basket data. This assumes that data are
present in a transaction. The absence of data may also be important.
• Technique: The most common strategy to generate association rules
is that of finding large itemsets. Other techniques may also be used.
• Itemset strategy: Itemsets may be counted in different ways. The
most naive approach is to generate all itemsets and count them. As this
is usually too spaceintensive, the bottom-up approach used by Apriori,
which takes advantage of the large itemset property, is the most
common approach. A top-down technique could also be used.
• Transaction strategy: To count the itemsets, the transactions in the
database must be scanned. All transactions could be counted, only a
sample may be counted, or the transactions could be divided into
partitions.
• Itemset data structure: The most common data structure used to
store the candidate itemsets and their counts is a hash tree. Hash trees
provide an effective technique to store, access, and count itemsets.
They are efficient to search, insert, and delete itemsets. A hash tree is a
multiway search tree where the branch to be taken at each level in the
tree is determined by applying a hash function as opposed to
comparing key values to branching points in the node. A leaf node in
the hash tree contains the candidates that hash to it, stored in sorted
order.
• Transaction data structure: Transactions may be viewed as in a flat
file or as a TID list, which can be viewed as an inverted file.
• Optimization: These techniques look at how to improve on the
performance of an algorithm given data distribution (skewness) or
amount of main memory.
• Architecture: Sequential, parallel, and distributed algorithms have
been proposed.
• Parallelism strategy: Both data parallelism and task parallelism have
been used.
Incremental Rules:
All algorithms discussed so far assume a static database. However, in
reality we cannot assume this. With these prior algorithms, generating
association rules for a new database state requires a complete rerun
of the algorithm.
Several approaches have been proposed to address the issue of how to
maintain the association rules as the underlying database changes.
Most of the proposed approaches have addressed the issue of how to
modify the association rules as inserts are performed on the database.
These incremental updating approaches concentrate on determining
the large itemsets for D U db where D is a database state and db are
updates to it and where the large itemsets for D, L are known.
One incremental approach, fast update (FUP), is based on the Apriori
algorithm. Each iteration, k, scans both db and D with candidates
generated from the prior iteration, k - 1, based on the large itemsets at
that scan. In addition, we use as part of the candidate set for scan k to
be Lk found in D.
For each scan k of db, Lk plus the counts for each itemset in Lk are used
as input. When the count for each item in Lk is found in db, we
automatically know whether it will be large in the entire database
without scanning D.
The association rule algorithms discussed so far assume that the data
are categorical. A quantitative association rule is one that involves
categorical and quantitative data. An example of a quantitative rule is:
When looking at large databases with many types of data, using one
minimum support value can be a problem. Different items behave
differently. It certainly is easier to obtain a given support threshold
with an attribute that has only two values than it is with an attribute
that has hundreds of values. It might be more meaningful to find a rule
of the form
SkimMilk => WheatBread
with a support of 3% than it is to find
Milk => Bread
with a support of 6%.
If a larger support is used, we might miss out on generating
meaningful association rules. This problem is called the rare item
problem. If the minimum support is too high, then rules involving
items that rarely occur will not be generated. If it is set too low, then
too many rules may be generated, many of which are not important.
One approach, MISapriori, allows a different support threshold to be
indicated for each item. Here MIS stands for minimum item support.
The minimum support for a rule is the minimum of all the minimum
supports for each item in the rule. An interesting problem occurs when
multiple minimum supports are used.
Example:
Suppose we have three items, {A, B, C}, with minimum supports MIS(A)
= 20%, MIS(B) = 3%, and MIS(C) = 4%. Because the support for A is so
large, it may be small, while both AB and AC may be large because the
required support for AB min(M/S (A),MIS(B)) = 3% and AC = min(MIS
(A), MIS (C)) = 4%.
Correlation Rules:
Support:
Confidence or Strength:
Lift / Interest:
This measure takes into account both P(A) and P(B). A problem with
this measure is that it is symmetric. Thus, there is no difference between the
value for interest (A => B) and the value for interest(B => A).
Conviction:
As with lift, conviction takes into account both P(A) and P(B). From
logic we know that implication A→ B = ⌝(A^⌝B). To take into account the
negation, the conviction measure inverts this ratio. The formula for
conviction is,
Surprise:
Chi-Squared statistic:
If all values were independent, then the chi squared statistic should
be 0.