Data Mining Unit 2 1
Data Mining Unit 2 1
Lecture Notes
------------------------------------------------------------------------------------------------------
Association Rule Mining: Mining Frequent Patterns–Associations and correlations – Mining
Methods– Mining Various kinds of Association Rules– Correlation Analysis– Constraint
based Association mining. Graph Pattern Mining, SPM.
Frequent itemset mining leads to the discovery of associations and correlations among items
in large transactional or relational data sets. With massive amounts of data continuously
being collected and stored, many industries are becoming interested in mining such patterns
from their databases. The discovery of interesting correlation relationships among huge
amounts of business transaction records can help in many business decision-making
processes such as catalog design, cross-marketing, and customer shopping behaviour
analysis.
A typical example of frequent itemset mining is market basket analysis. This process analyses
customer buying habits by finding associations between the different items that customers
place in their “shopping baskets”.
The discovery of these associations can help retailers develop marketing strategies by gaining
insight into which items are frequently purchased together by customers. For instance, if
customers are buying milk, how likely are they to also buy bread (and what kind of bread) on
the same trip to the supermarket? This information can lead to increased sales by helping
retailers do selective marketing and plan their shelf space.
Market basket analysis. Suppose, as manager of an AllElectronics branch, you would like to
learn more about the buying habits of your customers. Specifically, you wonder, “Which
groups or sets of items are customers likely to purchase on a given trip to the store?” To
answer your question, market basket analysis may be performed on the retail data of
customer transactions at your store. You can then use the results to plan marketing or
advertising strategies, or in the design of a new catalog.
For instance, market basket analysis may help you design different store layouts. In one
strategy, items that are frequently purchased together can be placed in proximity to further
encourage the combined sale of such items. If customers who purchase computers also tend
to buy antivirus software at the same time, then placing the hardware display close to the
software display may help increase the sales of both items.
In an alternative strategy, placing hardware and software at opposite ends of the store may
entice customers who purchase such items to pick up other items along the way. For instance,
after deciding on an expensive computer, a customer may observe security systems for sale
while heading toward the software display to purchase antivirus software, and may decide to
purchase a home security system as well.
If we think of the universe as the set of items available at the store, then each item has a
Boolean variable representing the presence or absence of that item. Each basket can then be
represented by a Boolean vector of values assigned to these variables. The
Boolean vectors can be analysed for buying patterns that reflect items that are frequently
associated or purchased together. These patterns can be represented in the form of association
rules.
For example, the information that customer who purchase computers also tend to buy
antivirus software at the same time is represented in the following association rule:
Typically, association rules are considered interesting if they satisfy both a minimum support
threshold and a minimum confidence threshold. These thresholds can be a set by users or
domain experts.
Rules that satisfy both a minimum support threshold (min sup) and a minimum confidence
threshold (min conf ) are called strong. By convention, we write support and confidence
values so as to occur between 0% and 100%, rather than 0 to 1.0.
The occurrence frequency of an itemset is the number of transactions that contain the itemset.
This is also known, simply, as the frequency, support count, or count of the itemset
1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as
frequently as a predetermined minimum support count, min sup.
2. Generate strong association rules from the frequent itemsets: By definition, these rules
must satisfy minimum support and minimum confidence.
For example, a frequent itemset of length 100, such as {a1, a2,………,a100}, contains
= 100 frequent 1-itemsets: {a1}, {a2}, . . . ,{a100};
frequent 2-itemsets: {a1, a2}, {a1, a3}, . . . , {a99, a100}; and so on.
We start by looking for single items that meet the support threshold. In this case, it's simply
A, B, C, D, and E, because there is at least 2 of each of these in the table. This is summarized
in the single item support table below
Next, we take all of the items that meet the support requirements, everything so far in this
example, an make all of the patterns/combinations we can out of them; AB, AC, AD, AE,
BC, BD, BE, CD, CE, DE. When we list all of these combinations in a table, and
determine the support for each, we get a table that looks like this.
Several of these patterns don't meet the support threshold of 2, so we remove them from the
list of options.
At this point, we use the surviving items to make other patterns that contain 3 items. If you
logically work through all of the options you'll get a list like this: ABC, ABD, ABE, BCD,
BCE, BDE (Notice that I didn't list ABCD, or BCDE here because they are 4 items long).
Before I create the support table for these let's look at these patterns. The first one, ABC, was
created by combining AB and BC. If you look in the 2 item support table (before or after
filtering), you'll find that AC doesn't have the minimum support required. If AC isn't
supported, a more complicated pattern that includes AC (ABC) can't be supported
either. This is a key point of the Apriori Principle. So, without having to go back to the
original data, we can exclude some of the 3-item patterns. When we do this, we eliminate
ABC (AC not supported), ABD (AD not supported), ABE (AE not supported), BCE (CE not
supported) and BDE (DE not supported). This process of removing patterns that can't be
supported because their subsets (or shorter combination) aren't supported is called pruning.
This pruning process leaves only BCD with a support of 2.
The final list of all of the patterns that have support greater than or equal to 2 are summarized
here.
FP Tree
Frequent Pattern Tree is a tree-like structure that is made with the initial itemsets of the
database. The purpose of the FP tree is to mine the most frequent pattern. Each node of the
FP tree represents an item of the itemset.
The root node represents null while the lower nodes represent the itemsets. The association of
the nodes with the lower nodes that is the itemsets with the other itemsets are maintained
while forming the tree.
Let us see the steps followed to mine the frequent pattern using frequent pattern growth
algorithm:
1) The first step is to scan the database to find the occurrences of the itemsets in the database.
This step is the same as the first step of Apriori. The count of 1-itemsets in the database is
called support count or frequency of 1-itemset.
2) The second step is to construct the FP tree. For this, create the root of the tree. The root is
represented by null.
3) The next step is to scan the database again and examine the transactions. Examine the first
transaction and find out the itemset in it. The itemset with the max count is taken at the top,
the next itemset with lower count and so on. It means that the branch of the tree is
constructed with transaction itemsets in descending order of count.
4) The next transaction in the database is examined. The itemsets are ordered in descending
order of count. If any itemset of this transaction is already present in another branch (for
example in the 1st transaction), then this transaction branch would share a common prefix to
the root.
This means that the common itemset is linked to the new node of another itemset in this
transaction.
5) Also, the count of the itemset is incremented as it occurs in the transactions. Both the
common node and new node count is increased by 1 as they are created and linked according
to transactions.
6) The next step is to mine the created FP Tree. For this, the lowest node is examined first
along with the links of the lowest nodes. The lowest node represents the frequency pattern
length 1. From this, traverse the path in the FP Tree. This path or paths are called a
conditional pattern base.
Conditional pattern base is a sub-database consisting of prefix paths in the FP tree occurring
with the lowest node (suffix).
7) Construct a Conditional FP Tree, which is formed by a count of itemsets in the path. The
itemsets meeting the threshold support are considered in the Conditional FP Tree.
8) Frequent Patterns are generated from the Conditional FP Tree.
Solution:
Support threshold=50% => 0.5*6= 3 => min_sup=3
For example, it can be used to generate compact and effective graph index structures based
on the concept of frequent and discriminative graph patterns. Approximate structure
similarity search can be achieved by exploring graph index structures and multiple graph
features. Moreover, classification of graphs can also be performed effectively using frequent
and discriminative subgraphs as features.
Graph Mining (GM) is essentially the problem of discovering repetitive subgraphs occurring
in the input graphs
Motivation
Finding subgraphs capable of compressing the data by abstracting instances of the
substructures
Identifying conceptually interesting patterns
GRAPH DATASET
Finding statistically relevant patterns between data examples where the values are delivered
in a sequence. It is usually presumed that the values are discrete, and thus time series mining
is closely related, but usually considered a different activity. Sequential pattern mining is a
special case of structured data mining.
There are several key traditional computational problems addressed within this field. These
include building efficient databases and indexes for sequence information, extracting the
frequently occurring patterns, comparing sequences for similarity, and recovering missing
sequence members. In general, sequence mining problems can be classified as string mining
which is typically based on string processing algorithms and itemset mining which is
typically based on association rule learning. Local process models extend sequential pattern
mining to more complex patterns that can include (exclusive) choices, loops, and concurrency
constructs in addition to the sequential ordering construct.