0% found this document useful (0 votes)
36 views

Data Mining Unit 2 1

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Data Mining Unit 2 1

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Data Mining Unit-2

Lecture Notes

------------------------------------------------------------------------------------------------------
Association Rule Mining: Mining Frequent Patterns–Associations and correlations – Mining
Methods– Mining Various kinds of Association Rules– Correlation Analysis– Constraint
based Association mining. Graph Pattern Mining, SPM.

Topic 1: Mining Frequent Patterns

Market Basket Analysis: A Motivating Example

Frequent itemset mining leads to the discovery of associations and correlations among items
in large transactional or relational data sets. With massive amounts of data continuously
being collected and stored, many industries are becoming interested in mining such patterns
from their databases. The discovery of interesting correlation relationships among huge
amounts of business transaction records can help in many business decision-making
processes such as catalog design, cross-marketing, and customer shopping behaviour
analysis.

A typical example of frequent itemset mining is market basket analysis. This process analyses
customer buying habits by finding associations between the different items that customers
place in their “shopping baskets”.

The discovery of these associations can help retailers develop marketing strategies by gaining
insight into which items are frequently purchased together by customers. For instance, if
customers are buying milk, how likely are they to also buy bread (and what kind of bread) on
the same trip to the supermarket? This information can lead to increased sales by helping
retailers do selective marketing and plan their shelf space.

Market basket analysis. Suppose, as manager of an AllElectronics branch, you would like to
learn more about the buying habits of your customers. Specifically, you wonder, “Which
groups or sets of items are customers likely to purchase on a given trip to the store?” To
answer your question, market basket analysis may be performed on the retail data of
customer transactions at your store. You can then use the results to plan marketing or
advertising strategies, or in the design of a new catalog.

For instance, market basket analysis may help you design different store layouts. In one
strategy, items that are frequently purchased together can be placed in proximity to further
encourage the combined sale of such items. If customers who purchase computers also tend
to buy antivirus software at the same time, then placing the hardware display close to the
software display may help increase the sales of both items.

In an alternative strategy, placing hardware and software at opposite ends of the store may
entice customers who purchase such items to pick up other items along the way. For instance,
after deciding on an expensive computer, a customer may observe security systems for sale
while heading toward the software display to purchase antivirus software, and may decide to
purchase a home security system as well.

If we think of the universe as the set of items available at the store, then each item has a
Boolean variable representing the presence or absence of that item. Each basket can then be
represented by a Boolean vector of values assigned to these variables. The

Boolean vectors can be analysed for buying patterns that reflect items that are frequently
associated or purchased together. These patterns can be represented in the form of association
rules.

For example, the information that customer who purchase computers also tend to buy
antivirus software at the same time is represented in the following association rule:

Computer =>antivirus_software [support =2%,confidence = 60%].

Typically, association rules are considered interesting if they satisfy both a minimum support
threshold and a minimum confidence threshold. These thresholds can be a set by users or
domain experts.

Frequent Itemsets, Closed Itemsets, and Association Rules

Let I ={i1,i2,i3…………..in} be an itemset. Let D, the task-relevant data, be a set of database


transactions where each transaction T is a nonempty itemset such that T€I.

Let A be a set of items. A transaction T is said to contain A if A€ T.


An association rule is an implication of the form A=>B,

where A € I, B € I, A ≠ф ;B≠ф, and A Ո B≠ф

Rules that satisfy both a minimum support threshold (min sup) and a minimum confidence
threshold (min conf ) are called strong. By convention, we write support and confidence
values so as to occur between 0% and 100%, rather than 0 to 1.0.

The occurrence frequency of an itemset is the number of transactions that contain the itemset.
This is also known, simply, as the frequency, support count, or count of the itemset

If the relative support of an itemset I satisfies a pre-specified minimum support threshold


(i.e., the absolute support of I satisfies the corresponding minimum support count threshold),
then I is a frequent itemset

In general, association rule mining can be viewed as a two-step process:

1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as
frequently as a predetermined minimum support count, min sup.

2. Generate strong association rules from the frequent itemsets: By definition, these rules
must satisfy minimum support and minimum confidence.

Additional interestingness measures can be applied for the discovery of correlation


relationships between associated items, the second step is much less costly than the first, the
overall performance of mining association rules is determined by the first step.

If an itemset is frequent, each of its subsets is frequent as well.

For example, a frequent itemset of length 100, such as {a1, a2,………,a100}, contains
= 100 frequent 1-itemsets: {a1}, {a2}, . . . ,{a100};

frequent 2-itemsets: {a1, a2}, {a1, a3}, . . . , {a99, a100}; and so on.

The total number of frequent itemsets that it contains is thus


This is too huge a number of item sets for any computer to compute or store. To overcome
this difficulty, we introduce the concepts of closed frequent itemset and maximal frequent
itemset.

Demonstrate Apriori algorithm with example.


To explain, let's use the data in this table and assume that the minimum support is 2.

We start by looking for single items that meet the support threshold.  In this case, it's simply
A, B, C, D, and E, because there is at least 2 of each of these in the table.  This is summarized
in the single item support table below

Next, we take all of the items that meet the support requirements, everything so far in this
example, an make all of the patterns/combinations we can out of them; AB, AC, AD, AE,
BC, BD, BE, CD, CE, DE.  When we list all of these combinations in a table, and
determine the support for each, we get a table that looks like this.
Several of these patterns don't meet the support threshold of 2, so we remove them from the
list of options.

At this point, we use the surviving items to make other patterns that contain 3 items.  If you
logically work through all of the options you'll get a list like this: ABC, ABD, ABE, BCD,
BCE, BDE (Notice that I didn't list ABCD, or BCDE here because they are 4 items long).

Before I create the support table for these let's look at these patterns.  The first one, ABC, was
created by combining AB and BC.  If you look in the 2 item support table (before or after
filtering), you'll find that AC doesn't have the minimum support required.  If AC isn't
supported, a more complicated pattern that includes AC (ABC) can't be supported
either.  This is a key point of the Apriori Principle.  So, without having to go back to the
original data, we can exclude some of the 3-item patterns.  When we do this, we eliminate
ABC (AC not supported), ABD (AD not supported), ABE (AE not supported), BCE (CE not
supported) and BDE (DE not supported).  This process of removing patterns that can't be
supported because their subsets (or shorter combination) aren't supported is called pruning. 
This pruning process leaves only BCD with a support of 2.
The final list of all of the patterns that have support greater than or equal to 2 are summarized
here.

Shortcomings Of Apriori Algorithm


 Using Apriori needs a generation of candidate itemsets. These itemsets may be large
in number if the itemset in the database is huge.
 Apriori needs multiple scans of the database to check the support of each itemset
generated and this leads to high costs.
These shortcomings can be overcome using the FP growth algorithm.
Frequent Pattern Growth Algorithm
This algorithm is an improvement to the Apriori method. A frequent pattern is generated
without the need for candidate generation. FP growth algorithm represents the database in the
form of a tree called a frequent pattern tree or FP tree.
This tree structure will maintain the association between the itemsets. The database is
fragmented using one frequent item. This fragmented part is called “pattern fragment”. The
itemsets of these fragmented patterns are analyzed. Thus with this method, the search for
frequent itemsets is reduced comparatively.

FP Tree
Frequent Pattern Tree is a tree-like structure that is made with the initial itemsets of the
database. The purpose of the FP tree is to mine the most frequent pattern. Each node of the
FP tree represents an item of the itemset.
The root node represents null while the lower nodes represent the itemsets. The association of
the nodes with the lower nodes that is the itemsets with the other itemsets are maintained
while forming the tree.

Frequent Pattern Algorithm Steps


The frequent pattern growth method lets us find the frequent pattern without candidate
generation.

Let us see the steps followed to mine the frequent pattern using frequent pattern growth
algorithm:

1) The first step is to scan the database to find the occurrences of the itemsets in the database.
This step is the same as the first step of Apriori. The count of 1-itemsets in the database is
called support count or frequency of 1-itemset.

2) The second step is to construct the FP tree. For this, create the root of the tree. The root is
represented by null.

3) The next step is to scan the database again and examine the transactions. Examine the first
transaction and find out the itemset in it. The itemset with the max count is taken at the top,
the next itemset with lower count and so on. It means that the branch of the tree is
constructed with transaction itemsets in descending order of count.

4) The next transaction in the database is examined. The itemsets are ordered in descending
order of count. If any itemset of this transaction is already present in another branch (for
example in the 1st transaction), then this transaction branch would share a common prefix to
the root.
This means that the common itemset is linked to the new node of another itemset in this
transaction.

5) Also, the count of the itemset is incremented as it occurs in the transactions. Both the
common node and new node count is increased by 1 as they are created and linked according
to transactions.

6) The next step is to mine the created FP Tree. For this, the lowest node is examined first
along with the links of the lowest nodes. The lowest node represents the frequency pattern
length 1. From this, traverse the path in the FP Tree. This path or paths are called a
conditional pattern base.

Conditional pattern base is a sub-database consisting of prefix paths in the FP tree occurring
with the lowest node (suffix).

7) Construct a Conditional FP Tree, which is formed by a count of itemsets in the path. The
itemsets meeting the threshold support are considered in the Conditional FP Tree.
8) Frequent Patterns are generated from the Conditional FP Tree.

Example Of FP-Growth Algorithm


Support threshold=50%, Confidence= 60%

Solution:
Support threshold=50% => 0.5*6= 3 => min_sup=3

1. Count of each item

2. Sort the itemset in descending order.


3. Build FP Tree

1. Considering the root node null.


2. The first scan of Transaction T1: I1, I2, I3 contains three items {I1:1}, {I2:1}, {I3:1},
where I2 is linked as a child to root, I1 is linked to I2 and I3 is linked to I1.
3. T2: I2, I3, I4 contains I2, I3, and I4, where I2 is linked to root, I3 is linked to I2 and
I4 is linked to I3. But this branch would share I2 node as common as it is already used
in T1.
4. Increment the count of I2 by 1 and I3 is linked as a child to I2, I4 is linked as a child
to I3. The count is {I2:2}, {I3:1}, {I4:1}.
5. T3: I4, I5. Similarly, a new branch with I5 is linked to I4 as a child is created.
6. T4: I1, I2, I4. The sequence will be I2, I1, and I4. I2 is already linked to the root node,
hence it will be incremented by 1. Similarly I1 will be incremented by 1 as it is
already linked with I2 in T1, thus {I2:3}, {I1:2}, {I4:1}.
7. T5:I1, I2, I3, I5. The sequence will be I2, I1, I3, and I5. Thus {I2:4}, {I1:3}, {I3:2},
{I5:1}.
8. T6: I1, I2, I3, I4. The sequence will be I2, I1, I3, and I4. Thus {I2:5}, {I1:4}, {I3:3},
{I4 1}.

4) Mining of FP-tree is summarized below:


1. The lowest node item I5 is not considered as it does not have a min support count,
hence it is deleted.
2. The next lower node is I4. I4 occurs in 2 branches , {I2,I1,I3:,I41},{I2,I3,I4:1}.
Therefore considering I4 as suffix the prefix paths will be {I2, I1, I3:1}, {I2, I3: 1}.
This forms the conditional pattern base.
3. The conditional pattern base is considered a transaction database, an FP-tree is
constructed. This will contain {I2:2, I3:2}, I1 is not considered as it does not meet the
min support count.
4. This path will generate all combinations of frequent patterns : {I2,I4:2},{I3,I4:2},
{I2,I3,I4:2}
5. For I3, the prefix path would be: {I2,I1:3},{I2:1}, this will generate a 2 node FP-tree :
{I2:4, I1:3} and frequent patterns are generated: {I2,I3:4}, {I1:I3:3}, {I2,I1,I3:3}.
6. For I1, the prefix path would be: {I2:4} this will generate a single node FP-tree:
{I2:4} and frequent patterns are generated: {I2, I1:4}.

Advantages Of FP Growth Algorithm


1. This algorithm needs to scan the database only twice when compared to Apriori
which scans the transactions for each iteration.
2. The pairing of items is not done in this algorithm and this makes it faster.
3. The database is stored in a compact version in memory.
4. It is efficient and scalable for mining both long and short frequent patterns.
Disadvantages Of FP-Growth Algorithm
1. FP Tree is more cumbersome and difficult to build than Apriori.
2. It may be expensive.
3. When the database is large, the algorithm may not fit in the shared memory.

Graph Pattern Mining


Graph pattern mining is the mining of frequent subgraphs (also called (sub)graph patterns)
in one or a set of graphs. Methods for mining graph patterns can be categorized into Apriori-
based and pattern growth–based approaches. Alternatively, we can mine the set of closed
graphs where a graph g is closed if there exists no proper supergraph g′ that carries the same
support count as g. Moreover, there are many variant graph patterns, including approximate
frequent graphs, coherent graphs, and dense graphs. User-specified constraints can be pushed
deep into the graph pattern mining process to improve mining efficiency.
Graph pattern mining has many interesting applications.

For example, it can be used to generate compact and effective graph index structures based
on the concept of frequent and discriminative graph patterns. Approximate structure
similarity search can be achieved by exploring graph index structures and multiple graph
features. Moreover, classification of graphs can also be performed effectively using frequent
and discriminative subgraphs as features.
Graph Mining (GM) is essentially the problem of discovering repetitive subgraphs occurring
in the input graphs
Motivation
 Finding subgraphs capable of compressing the data by abstracting instances of the
substructures
 Identifying conceptually interesting patterns

Graph, Graph, Everywhere

Aspirin Yeast protein interaction network Internet

Co- author network


Application of Graph Mining:
1. Chemical compounds (Cheminformatics)
2. Protein structures, biological pathways/networks (Bioinformactics)
3. Program control flow, traffic flow, and workflow analysis
4. XML databases, Web, and social network analysis

GRAPH DATASET

FREQUENT PATTERNS (MIN SUPPORT IS 2)


Sequential pattern mining (SPM)

Finding statistically relevant patterns between data examples where the values are delivered
in a sequence. It is usually presumed that the values are discrete, and thus time series mining
is closely related, but usually considered a different activity. Sequential pattern mining is a
special case of structured data mining.

There are several key traditional computational problems addressed within this field. These
include building efficient databases and indexes for sequence information, extracting the
frequently occurring patterns, comparing sequences for similarity, and recovering missing
sequence members. In general, sequence mining problems can be classified as string mining
which is typically based on string processing algorithms and itemset mining which is
typically based on association rule learning. Local process models extend sequential pattern
mining to more complex patterns that can include (exclusive) choices, loops, and concurrency
constructs in addition to the sequential ordering construct.

You might also like