0% found this document useful (0 votes)

103 views

Module III Final

The document discusses association rule mining and the Apriori algorithm. Association rule mining aims to find rules that predict the occurrence of an item based on other items in a transaction. The Apriori algorithm uses an iterative approach to efficiently find frequent itemsets in a database. It generates candidate itemsets of length k from frequent itemsets of length k-1, and only keeps those candidates that meet a minimum support threshold. This pruning strategy significantly reduces the search space compared to a brute-force approach.

Uploaded by

shilpa veeru

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

103 views

Module III Final

Uploaded by

shilpa veeru

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 68

DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

MODULE-3

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Rule Mining

Given a set of transactions, find rules that will predict the occurrence of an item based on the
occurrences of other items in the transaction

Market-Basket transactions

TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Example of Association Rules

{Diaper}  {Beer},
{Milk, Bread}  {Eggs, Coke},
{Beer, Bread}  {Milk},

Definition: Item Set and support count

Item set and Support Count Let I = {i1, i2,.id} be the set of all items in a market basket data and
T: {t1, t2, -, tN} be the set of all transactions. Each transaction ti contains a subset of items
chosen from I

In association analysis, a collection of zero or more items is termed an item set.

If an item set contains k- items, it is called a k-item set

Example: {Milk, Bread, Diaper}

DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Rule Evaluation Metrics:

Support count ()

– Frequency of occurrence of an item set

– E.g. ({Milk, Bread, Diaper}) = 2

Support(s)

– Fraction of transactions that contain an item set

– E.g. s({Milk, Bread, Diaper}) = 2/5

Frequent Item set

– An item set whose support is greater than or equal to a minsup threshold

Example:

{Milk , Diaper } Beer

 (Milk, Diaper, Beer)  2  0.4

s
| T| 5
 (Milk, Diaper, Beer)  2  0.67
c
 (Milk, Diaper) 3
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Association Rule Mining Task

Given a set of transactions T, the goal of association rule mining is to find all rules having

– support ≥ minsupthreshold

– confidence ≥ minconfthreshold

Brute-force approach:

– List all possible association rules

– Compute the support and confidence for each rule

– Prune rules that fail the minsup and minconf thresholds

 Computationally prohibitive!

More specifically, the total number of possible rules extracted from a data set that contains d items is

Even for the small data set with 6 items, this approach requires us to compute the support and
confidence for 36 - 27 + 1 = 602 rules.

More than 80% of the rules are discarded after applying minsup : 20Vo andminconf : 5070, thus
making most of the computations become wasted.

To avoid performing needless computations, it would be useful to prune the rules early without
having to compute their support and confidence values
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

If the itemset is infrequent, then all six candidate rules can be pruned immediately without
our having to compute their confidence values.
Therefore, a common strategy adopted by many association rule mining algorithms is
to decompose the problem into two major subtasks:

1. Frequent Itemset Generation

– Generate all itemsets whose support minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset, where each
rule is a binary partitioning of a frequent itemset

Frequent itemset generation is still computationally expensive.

DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Frequent Itemset Generation:

A lattice structure can be used to enumerate the list of all possible item sets.
Figure 6.1 shows an itemset lattice for 1: {a,b,c.,d,e}.In general, a data set that contains k items
can potentially generate up to 2k- 1 frequent itemsets, excluding the null set. Because k can be
very large in many practical applications, the search space of itemsets that need to be explored
is exponentially Large.
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the database
Such an approach can be very expensive because it requires O(N Mw) comparisons, where N is
the number of transactions, M =2k - 1 is the number of candidate itemsets, and w is the maximum
transaction width.

Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

There are several ways to reduce the computational complexity of frequent itemset generation.

Reduce the number of candidates (M)

– Complete search: M=2d

– Use pruning techniques to reduce M

Reduce the number of transactions (N)

– Reduce size of N as the size of itemset increases

Reduce the number of comparisons (NM)

– Use efficient data structures to store the candidates or transactions

– No need to match every candidate against every transaction

DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Apriori principle : Reducing Number of Candidates

Apriory principle: If an itemset is frequent, then all of its subsets must also be frequent
To illustrate the idea behind the Apriory principle, consider the itemset lattice shown in Figure
6.3. Suppose {c, d, e} is a frequent itemset. Clearly, any transaction that contains {c,d,e} must
also contain its subsets, {c,d},{c,e}, {d,e}, {c}, {d}, and {e}. As a result, if {c,d,e} is frequent,
then all subsets of {c, d,e} (i.e., the shaded itemsets in this figure) must also be frequent.
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Conversely, if an itemset such as {a, b} is infrequent, then all of its supersets must be
infrequent too. As illustrated in Figure 6.4, the entire subgraph containing the supersets of {a, b}
can be pruned immediately once {a, b} is found to be infrequent. This strategy of trimming the
exponential search space based on the support measure is known as support-based pruning.

Frequent Itemset Generation in the Apriori Algorithm: Illustration with example.

Figure 6.5 provides a high-level illustration of the frequent item set generation part of the Apriori
algorithm for the transactions shown inTable 6.1. We assume that the support threshold is 60 To,
which is equivalent to a minimum support count equal to 3.
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Initially, every item is considered as a candidate l-itemset. After counting their supports, the
candidate itemsets {Co1a} and {Eggs} are discarded because they appear in fewer than three
transactions.

In the next iteration, candidate 2-itemsets are generated using only the frequent 1-itemsets
because the Apriory principle ensures that all supersets of the infrequent 1-itemsets must be
infrequent.

Because there are only four frequent 1-itemsets, the number of candidate 2-itemsets generated by
the algorithm is 6. Two of these six candidates, {Beer, Bread} and {Beer, Milk}, are
subsequently found to be infrequent after computing their support values. The remaining four
candidates are frequent, and thus will be used to generate candidate 3-itemsets.

Without support-based. Pruning, there are 20 candidate3-itemsets that can be formed using the
six items given in this example. With the Apriory principle, we only need to keep candidate 3-
itemsets whose subsets are frequent. The only candidate that has this property is {Bread, Diapers,
Milk).
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

The effectiveness of the Apriory pruning strategy can be shown by counting the number of
candidate itemsets generated.

A brute-force strategy of enumerating all itemsets( up to size3 ) as candidates will produce 41

candidates.

Apriori Algorithm:

Input: set of items I, set of transactions T, number of transactions N, minimum support minsup.
Output: frequent k-itemsets Fk, k=1…
Method:
K=1
 Compute support for each 1-itemset (item) by scanning the transactions
 F1 = items that have support above minsup
 Repeat until no new frequent itemsets are identified
1. Ck+1 = candidate k+1 -itemsets generated from length k frequent
itemsets Fk
2. Compute the support of each candidate in Ck+1 by scanning the
transactions T
3. Fk+1 = Candidates in Ck+1 that have support above minsup.
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Candidate Generation and Pruning

Candidate Generation: This operation generates new candidate kitemsets based on the frequent
(k - l)-itemsets found in the previous iteration.

Candidate Pruning: This operation eliminates some of the candidate k-itemsets using the
support-based pruning strategy.

The folIowing is a list of requirements for an effective candidate generation procedure:

 It should avoid generating too many unnecessary candidates.

 It must ensure that the candidate set is complete, i.e., no frequent itemsets are
left out by the candidate generation procedure.
 It should not generate the same candidate itemset more than once.
e.g. {a,b,c,d} can be generated by merging {a,b,c} with {d}
or {b,d} with {a,c}, {a,b} with {c,d}

Several candidate generation strategies are discussed below.

DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Brute-Force Method: The brute-force method considers every k-itemset as a potential candidate
and then applies the candidate pruning step to remove any unnecessary candidates (see Figure

6.6).

Fk-1 x F1 Method:
Combine frequent k-1 –itemsets with frequent 1- itemsets

Figure 6.7 illustrates how a frequent 2-itemset such as {Beer, Diapers} can be augmented with a
frequent item such as Bread to produce a candidate 3-itemset {Beer, Diapers, Bread}.
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Satisfaction of our requirements

1) while many k-itemsets are left ungenerated, can still generate unnecessary candidates
e.g. merging {Beer, Diapers} with {Milk} is unnecessary, since {Beer, Milk} is infrequent.

2) Method is complete: each frequent itemset consists of a frequent k-1 –itemset and a
frequent 1-itemset.

3) Can generate the same set twice

e.g. {Bread, Diapers, Milk} can be generated by merging {Bread,Diapers} with {Milk} or
{Bread,Milk} with {Diapers} or {Diapers, Milk} with {Bread}
This can be circumvented by keeping all frequent itemsets in their lexicographical order (\
- e.g. {Bread,Diapers} can be merged with {Milk} as ‘Milk’ comes after ‘Bread’ and
‘Diapers’ in lexicographical order
- {Diapers, Milk} is not merged with {Bread}, {Bread, Milk} is not merged with
{Diapers} as that would violate the lexicographical ordering
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Fk-1 x Fk-1 Method:

 Combine a frequent k-1 –itemset with another frequent k-1 -itemset

 Items are stored in lexicographical order in the itemset
 When considering for merging, only pairs that share first k-2 items are considered
o e.g. {Bread, Diapers} is merged with {Bread,Milk}
o if the pairs share fewer than k-2 items, the resulting itemset would be larger than
k, so we do not need to generate it yet
 The resulting k-itemset has k subsets of size k-1, which will be checked against support
threshold
o the merging ensures that at least two of the subsets are frequent
o An additional check is made that the remaining k-2 subsets are frequent as well
In Figure 6.8, the frequent itemsets {Bread, Diapers} and {Bread, Milk} are merged to form a
candidate 3-itemset {Bread, Diapers, Milk}.

Satisfaction of our requirements

1) Avoids the generation of many unnecessary candidates that are generated by the Fk-1 x F1
method
e.g. will not generate {Beer, Diapers, Milk} as {Beer,Milk} is infrequent
2. Method is complete: every frequent k-itemset can be formed of two frequent k-1 –itemsets
differing in their last item.
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

3. Each candidate itemset is generated only once.

Support counting using hash tree:

Given the candidate itemsets Ck and the set of transactions T, we need to compute the support
counts σ(X) for each itemset X in Ck.
Brute-force algorithm would compare each transaction against each itemset.
 large amount of comparisons.
An alternative approach
Divide the candidate itemsets Ck into buckets by using a hash function for each
transaction t.
Hash the itemsets contained in t into buckets using the same hash function.
Compare the corresponding buckets of candidates and the transaction.
Increment the support counts of each matching candidate itemset.
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

A hash tree is used to implement the hash function.

An alternative approach is to enumerate the itemsets contained in each transaction and use them
to update the support counts oftheir respective candidate itemsets. To illustrate, consider a
transaction t that contains five items, {1,2,3,5,6}.

Figure 6.9 shows a systematic way for enumerating the 3-itemsets contained in t. Assuming that
each itemset keeps its items in increasing lexicographic order, an itemset can be enumerated by
specifying the smallest item first, followed by the larger items. For instance, given t : {1,2,3,5,6},
all the 3- itemsets contained in f must begin with item 1, 2, or 3.

Figure 6.11 shows an example of a hash tree structure.

Each internal node of the tree uses the following hash function, h(p) : p mod 3, to determine
which branch of the current node should be followed next.
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

For example, items 1, 4, and 7 are hashed to the same branch (i.e., the leftmost branch) because
they have the same remainder after dividing the number by 3.

All candidate itemsets are stored at the leaf nodes of the hash tree. The hash tree shown in
Figure 6.11 contains 15 candidate 3-itemsets, distributed across 9 leaf nodes.

Consider a transaction, t, : {1,2,3,5,6}. To update the support counts of the candidate itemsets,
the hash tree must be traversed in such a way that all the leaf nodes containing candidate 3-
itemsets belonging to t must be visited at least once.

At the root node of the hash tree, the items 1, 2, and 3 of the transaction are hashed separately.
Item 1 is hashed to the left child of the root node, item 2 is hashed to the middle child, and item 3
is hashed to the right child.

At the next level of the tree, the transaction is hashed on the second item listed in the Level 2
structures shown in Figure 6.9.

For example, after hashing on item 1 at the root node, items 2, 3, and 5 of the transaction are
hashed. Items 2 and 5 are hashed to the middle child, while item 3 is hashed to the right child, as
shown in Figure 6.12. This process continues until the leaf nodes of the hash tree are reached.

The candidate item sets stored at the visited leaf nodes are compared against the transaction. If a
candidate is a subset of the transaction, its support count is incremented.

In this example, 5 out of the 9 leaf nodes are visited and 9 out of the 15 item sets are compared
against the transaction.
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Rule Generation
Given a frequent itemset L, find all non-empty subsets f  L such that f  L – f satisfies the
minimum confidence requirement
– If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D, ABD C, ACD B, BCD A,
A BCD, B ACD, C ABD, D ABC
AB CD, AC  BD, AD  BC, BC AD,
BD AC, CD AB,
If |L| = k, then there are 2k – 2 candidate association rules (ignoring L   and   L)

How to efficiently generate rules from frequent itemsets?

– In general, confidence does not have an anti-monotone property
c(ABC D) can be larger or smaller than c(AB D)
– But confidence of rules generated from the same itemset has an anti-monotone
property
e.g., L = {A,B,C,D}:

c(ABC  D)  c(AB  CD)  c(A  BCD)

DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

 Confidence is anti-monotone w.r.t. number of items on the RHS of the

rule

Candidate rule is generated by merging two rules that share the same prefix
in the rule consequent
join(CD=>AB,BD=>AC) would produce the candidate rule D => ABC
Prune rule D=>ABC if its subset AD=>BC does not have
high confidence

Alternative Methods for Generating Frequent Itemsets

Traversal of Itemset Lattice::A search for frequent itemsets can be conceptually viewed as a
traversal on the itemset lattice.
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

The search strategy employed by an algorithm dictates how the lattice structure is traversed
during the frequent itemset generation process. Some search strategies are better than others,
depending on the configuration of frequent itemsets in the lattice.

Equivalence classes : Equivalence Classes can also be defined according to the prefix or suffix
labels of an itemset.

In this case, two itemsets belong to the same equivalence class if they share a common prefix or
suffix of length k. In the prefix-based approach, the algorithm can search for frequent itemsets
starting with the prefix a before looking for those starting with prefixes b, c and so on

Breadth-First versus Depth-First: The Apriori, algorithm traverses the lattice in a breadth-first
manner) as shown in Figure 6.2L(a). It first discovers all the frequent 1-itemsets, followed by the
frequent 2-itemsets, and so on, until no new frequent itemsets are generated.

The algorithm can start from, say, node a, in Figure 6.22, and count its support to determine
whether it is frequent. If so, the algorithm progressively expands the next level of nodes, i.e., ab,
abc, and so on, until an infrequent node is reached, say, abcd. It then backtracks to another
branch, say, abce, and continues the search from there.
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

FP-Growth Algorithm
 Apriori: uses a generate-and-test approach – generates candidate itemsets and tests if they
are frequent
– Generation of candidate itemsets is expensive(in both space and time)
– Support counting is expensive
• Subset checking (computationally expensive)
Multiple Database scans
FP-Growth: allows frequent itemset discovery without candidate itemset generation. Two
step approach:
– Step 1: Build a compact data structure called the FP-tree
• Built using 2 passes over the data-set.
– Step 2: Extracts frequent itemsets directly from the FP-tree

Step 1: FP-Tree Construction

FP-Tree is constructed using 2 passes over the data-set:
Pass 1: Scan data and find support for each item.
– Discard infrequent items.
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

– Sort frequent items in decreasing order based on their support.

Use this order when building the FP-Tree, so common prefixes can
be shared.

Pass 2:
Nodes correspond to items and have a counter
1. FP-Growth reads 1 transaction at a time and maps it to a path
2. Fixed order is used, so paths can overlap when transactions share items (when they have
the same prfix ).
– In this case, counters are incremented
3. Pointers are maintained between nodes containing the same item, creating singly linked
lists (dotted lines)
– The more paths that overlap, the higher the compression. FP-tree may fit in
memory.
4. Frequent itemsets extracted from the FP-Tree.

Figure 6.24 shows a data set that contains ten transactions and five items.
Initially, the FP-tree contains only the root node represented by the null symbol. The FP-tree is
subsequently extended in the following way:

1. The data set is scanned once to determine the support count of each item. Infrequent items are
discarded, while the frequent items are sorted in decreasing support counts. For the data set
shown in Figure 6.24, a is the most frequent item, followed by b, c, d, and e.

2.The algorithm makes a second pass over the data to construct the FP tree.
After reading the first transaction, {a,b), the nodes labeled as a and b are created. A path is then
formed from nulI ,a, b to encode the transaction. Every node along the path has a frequency
count of 1.
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

3. After reading the second transaction, {b,c,d}, a new set of nodes is created for items b, c, and
d. A path is then formed to represent the transaction by connecting the nodes null ,b,c, d. Every
node along this path also has a frequency count equal to one. Although the first two transactions
have an item in common, which is b, their paths are disjoint because the transactions do not share
a common prefix.

The third transaction, {a,c,d,e}, shares a common prefix item (which is a) with the first
transaction. As a result, the path for the third transaction null , a,c,d, e, overlaps with the path for
the first transaction, nuI,a ,b. Because of their overlapping path, the frequency count for node a is
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

incremented to two, while the frequency counts for the newly created nodes, c, d, and e) are
equal to one.

This process continues until every transaction has been mapped onto one of the paths given in
the FP-tree. The resulting FP-tree after reading all the transactions is shown at the bottom of
Figure 6.25.

Step 2: Frequent Itemset Generation

FP-growth is an algorithm that generates frequent itemsets from an FP-tree by exploring the tree
in a bottom-up fashion.

Given the example tree shown in Figure 6.24, the algorithm looks for frequent itemsets ending in
e first, followed by d, c, b, and finally, a. This bottom-up strategy for finding frequent itemsets
ending with a particular item is equivalent to the suffix-based approach.

Since every transaction is mapped onto a path in the FP-tree, we can derive the frequent itemsets
ending with a particular item, say e, by examining only the paths containing node e. These paths
can be accessed rapidly using the pointers associated with node e. The extracted paths are shown
in Figure 6.26(a).
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

After finding the frequent itemsets ending in e, the algorithm proceeds to look for frequent
itemsets ending in d by processing the paths associated with node d. The corresponding paths are
shown in Figure 6.26(b). This process continues until all the paths associated with nodes c, b,
and finally a are processed.
The paths for these items are shown in Figures 6.26(c), (d), and (e), while their corresponding
frequent itemsets are summarized in Table 6. 6
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Evaluation of Association Pattern :

Association rule algorithms tend to produce too many rules
– many of them are uninteresting or redundant
– Redundant if {A,B,C}  {D} and {A,B}  {D}
have same support & confidence
Interestingness measures can be used to prune/rank the derived patterns

Objective measures of interestingness

Given a rule X  Y, information needed to compute rule interestingness can be
obtained from a contingency table
Contingency table for X  Y

Y Y

X f11 f10 f1+

X f01 f00 fo+

f+1 f+0 N
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Interest Factor

Example:

Coffee Coffee

Tea 15 5 20

Tea 75 5 80

90 10 100

Correlation Analysis
For binary variables, correlation can be measured using the d-coefficient. which is defined as
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

The value of correlation ranges from -1 (perfect negative correlation) to +1 (perfect positive
correlation). If the variables are statistically independent, then it is 0.
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Question Bank:

1. What is association analysis? Define support and confidence with an example.

2. Develop the appriori algorithm for frequent itemset generation, with an example.
3. Explain the various measure of evaluating association patterns.
4. Explain in detail frequent itemset generation and rule generation with reference to
appriori along with an example.
5. Define following: a) Support b) Confidence.
6. Explain FP growth algorithm for discovering frequent item sets. What are its limitation.
7. Consider following transaction data set
TID ITEM
1 {a, b}
2 {b, c, d}
3 {a, c, d, e}
4 {a, d, e}
5 {a, b, c}
6 {a, b, c, d}
7 {a}
8 {a, b, c}
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

9 {a, b, d}
10 {b, c, e}
Construct the FP tress by showing the tress separately after reading each transaction.
8. Illustrate the limitations of support confidence framework for evaluation of an association
rule
9. Define cross support pattern. Suppose the support for milk is 70%, support for sugar is
10% and support for bread is 0.04%. given hc= 0.01. is the frequent item set {milk, sugar,
bread} the cross-support pattern?
10. Which are the factors affecting the computational complexity of appriori algorithm?
Explain them.
11. Define a frequent pattern tree. Discuss the method of computing a FP-Tree, with an
algorithm.
12. Give an example to show that items in a strong association rule may actually be
negatively corelated.
13. A database has five transactions. Let min-sup = 60% and min-conf = 80%
TID ITEM
T1 {M, O, N, K, E, Y}
T2 {D, O, N, K, E, Y}
T3 {M, A, K, E}
T4 {M, U, C, K, Y}
T5 {C, O, O, K, I, E}
Find all frequent item sets using appriori and FP growth respectively,
14. Explain various alternative methods for generating frequent item sets.
15. A database has four transactions. Let min-sup = 40% and min-conf = 60%
TID DATE ITEM
T1 01/01/10 {K, A, D, B}
T2 01/01/10 {D, A, C, E, B}
T3 01/15/10 {C, A, B, E}
T4 01/22/10 {B, A, D}
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Find all frequent item sets using appriori and FP growth algorithms. Compare the
efficiency of two measuring process.
16. Explain various Candidate Generation and Pruning techniques.
17. Explain the various properties of objective measures.
18. Comprehend the Simpson’s Paradox.
19. Illustrate the nature of Simpson’s paradox for the following two-way contingency table

Buy Exercise machine

Buy HDTV
yes no
yes 99 81 180
no 54 66 120
153 147 300

20. What is appriori algorithm? Give an example. A database has six transactions of purchase
of books from a book shop as given below
TID ITEM
T1 {ANN, CC, JC, CG}
T2 {CC, D, CG}
T3 {ANN, D, CC, TC}
T4 {ANN, CC, D, CG}
T5 {ANN, CC, D, TC, CG}
T6 {C, D, TC}
Let X= {CC, TC} and Y= {ANN, TC, CC} find confidence and support of the
association rule XY and inverse rule YX

21. Consider the following transaction data set:

TID ITEM
T100 I1, I2, I5
T200 I2, I4
T300 I2, I3
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

T400 I1, I2, I4

T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I1, I2, I3, I5
T900 I1, I2, I3
Construct FP Tree.Generate List of frequent item set ordered by their corresponding suffixes.

22. Consider following set of frequent 3 item sets

{1, 2, 3} {1, 3, 5}
{1, 2, 4} {2, 3, 4}
{1, 2, 5} {2, 3, 5}
{1, 3, 4} {3, 4, 5}

Assume that there are only 5 items in data set.

a) List all candidate 4 item sets obtained by a candidate generation procedure
using Fk-1 X F1 merging strategy
b) List all candidate 4 item sets obtained by the candidate generation procedure
in appriori,
23. Apply appriori algorithm for
TID ITEM
101 Milk, Bread, Eggs
102 Milk, Juice
103 Juice, Butter
104 Milk, Bread, Eggs
105 Coffee, Eggs
106 Coffee
107 Coffee, Juice
108 Milk, Bread, Cookies, Eggs
109 Cookies, Butter
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

110 Milk, Bread

Item set = {Milk, Bread, Eggs, Cookies, Coffee, Butter, Juice}, use 0.2 for min-sup.
Source diginotes.in

Module-3 Association Analysis: Data Mining Association Analysis: Basic Concepts and Algorithms
No ratings yet
Module-3 Association Analysis: Data Mining Association Analysis: Basic Concepts and Algorithms
34 pages
Number Patterns and Sequences
80% (5)
Number Patterns and Sequences
10 pages
Data Mining Association Analysis: Basic Concepts and Algorithms
No ratings yet
Data Mining Association Analysis: Basic Concepts and Algorithms
38 pages
DM Mod3 PDF
No ratings yet
DM Mod3 PDF
96 pages
DMDW 3rd Module
No ratings yet
DMDW 3rd Module
34 pages
Association Rule Mining: - Algorithms For Frequent Itemset Mining - Apriori - Elcat - FP-Growth
No ratings yet
Association Rule Mining: - Algorithms For Frequent Itemset Mining - Apriori - Elcat - FP-Growth
45 pages
DM Association
No ratings yet
DM Association
43 pages
M9 Asosiasi
No ratings yet
M9 Asosiasi
58 pages
DS2 Association
No ratings yet
DS2 Association
48 pages
Frequent Pattern Mining Overview: Data Mining Techniques: Frequent Patterns in Sets and Sequences
No ratings yet
Frequent Pattern Mining Overview: Data Mining Techniques: Frequent Patterns in Sets and Sequences
14 pages
DWDM Unit-3
No ratings yet
DWDM Unit-3
35 pages
Week 6 - Basic Association Analysis
No ratings yet
Week 6 - Basic Association Analysis
71 pages
dm 2
No ratings yet
dm 2
71 pages
Datamining Lect2 Frequent
No ratings yet
Datamining Lect2 Frequent
59 pages
04 FPbasic
No ratings yet
04 FPbasic
78 pages
MS (Data Science) Fall 2020 Semester
No ratings yet
MS (Data Science) Fall 2020 Semester
36 pages
04 Frequent Patterns Analysis
No ratings yet
04 Frequent Patterns Analysis
37 pages
DWM-UNIT-4
No ratings yet
DWM-UNIT-4
11 pages
Association Analysis: Unit-V
No ratings yet
Association Analysis: Unit-V
12 pages
Unit-5 DWDM
No ratings yet
Unit-5 DWDM
7 pages
Associationrule 1
No ratings yet
Associationrule 1
30 pages
Optimization Algorithms For Association Rule Mining (ARM) : K.Indira
No ratings yet
Optimization Algorithms For Association Rule Mining (ARM) : K.Indira
118 pages
06FPBasic
No ratings yet
06FPBasic
77 pages
Association Rule Mining
No ratings yet
Association Rule Mining
54 pages
Rule Mining
No ratings yet
Rule Mining
20 pages
Concepts and Techniques: - Chapter 6
No ratings yet
Concepts and Techniques: - Chapter 6
64 pages
DWDM Unit 4
No ratings yet
DWDM Unit 4
17 pages
CSE 385 - Data Mining and Business Intelligence - Lecture 02
No ratings yet
CSE 385 - Data Mining and Business Intelligence - Lecture 02
67 pages
DWDM-UNIT-4
No ratings yet
DWDM-UNIT-4
12 pages
Week 3
No ratings yet
Week 3
56 pages
dmunit2
No ratings yet
dmunit2
85 pages
Unit 2
No ratings yet
Unit 2
14 pages
DM -Unit 2-PPT
No ratings yet
DM -Unit 2-PPT
49 pages
Lecture Notes For Chapter 6: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 6: by Tan, Steinbach, Kumar
65 pages
DMDW Chapter 4
No ratings yet
DMDW Chapter 4
28 pages
association rule
No ratings yet
association rule
22 pages
Frequent Pattern Based Clustering Methods
No ratings yet
Frequent Pattern Based Clustering Methods
23 pages
From Introduction To Data Mining: Data Mining Association Analysis: Basic Concepts and Algorithms
No ratings yet
From Introduction To Data Mining: Data Mining Association Analysis: Basic Concepts and Algorithms
37 pages
Chap6 Basic Association Analysis
No ratings yet
Chap6 Basic Association Analysis
82 pages
DATA MINING UNIT-II NOTES
No ratings yet
DATA MINING UNIT-II NOTES
24 pages
Ijctt V27P116
No ratings yet
Ijctt V27P116
7 pages
DWDM UNIT-5
No ratings yet
DWDM UNIT-5
14 pages
Data Mining: Frequent Itemsets and Association Rules
No ratings yet
Data Mining: Frequent Itemsets and Association Rules
105 pages
Mining Associans in Large Data Bases (Unit-5)
No ratings yet
Mining Associans in Large Data Bases (Unit-5)
12 pages
Dm&bi - L10-Association Rules
No ratings yet
Dm&bi - L10-Association Rules
43 pages
Unit 4
No ratings yet
Unit 4
72 pages
06 FPBasic
No ratings yet
06 FPBasic
65 pages
Association Rule Mining2
No ratings yet
Association Rule Mining2
37 pages
5 DM Association
No ratings yet
5 DM Association
27 pages
Chapter - 6 Data Mining
No ratings yet
Chapter - 6 Data Mining
65 pages
Lecture Notes For Chapter 6 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 6 Introduction To Data Mining: by Tan, Steinbach, Kumar
82 pages
Mining Frequent Itemsets Using Apriori Algorithm
No ratings yet
Mining Frequent Itemsets Using Apriori Algorithm
5 pages
Association Analysis: Basic Concepts and Algorithms: Problem Definition
No ratings yet
Association Analysis: Basic Concepts and Algorithms: Problem Definition
15 pages
06 FPBasic
No ratings yet
06 FPBasic
103 pages
Unit_3 Mining Frequent Patterns
No ratings yet
Unit_3 Mining Frequent Patterns
10 pages
Data Mining Association Rules
No ratings yet
Data Mining Association Rules
54 pages
06 FPBasic
No ratings yet
06 FPBasic
69 pages
DM_U_2
No ratings yet
DM_U_2
16 pages
CH 4
No ratings yet
CH 4
51 pages
The Data Science Workshop: A New, Interactive Approach to Learning Data Science
From Everand
The Data Science Workshop: A New, Interactive Approach to Learning Data Science
Anthony So
No ratings yet
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
Viva Questions
No ratings yet
Viva Questions
13 pages
Combinepdf
No ratings yet
Combinepdf
4 pages
Introduction To Process Control
No ratings yet
Introduction To Process Control
29 pages
Final Round 1
No ratings yet
Final Round 1
84 pages
03.01.2022 Mca, Bca CST
No ratings yet
03.01.2022 Mca, Bca CST
1 page
Edel 415 Portfolio Picks Theorem
No ratings yet
Edel 415 Portfolio Picks Theorem
4 pages
Java Script
No ratings yet
Java Script
8 pages
Information Quality Parameters
No ratings yet
Information Quality Parameters
9 pages
Si Product Guide PDF
No ratings yet
Si Product Guide PDF
31 pages
The Imageprint 8.X For Windows Troubleshooting Guide: Revision 1.1
No ratings yet
The Imageprint 8.X For Windows Troubleshooting Guide: Revision 1.1
63 pages
Special Event Trigger For PIC32 by Bruce Misner
No ratings yet
Special Event Trigger For PIC32 by Bruce Misner
3 pages
Hacker Powered Security Report 2019 PDF
No ratings yet
Hacker Powered Security Report 2019 PDF
66 pages
Data Encryption Standard (DES)
No ratings yet
Data Encryption Standard (DES)
63 pages
Kiel Compiler
No ratings yet
Kiel Compiler
5 pages
PYDS 3150713 Unit-2
No ratings yet
PYDS 3150713 Unit-2
38 pages
Establishing A Successful Process Center of Excellence: What Is A Coe?
No ratings yet
Establishing A Successful Process Center of Excellence: What Is A Coe?
1 page
Customer Satisfaction Towards Jio
100% (1)
Customer Satisfaction Towards Jio
7 pages
FIU MAXIMO Work Orders Quick Reference Guide: G S U W O
No ratings yet
FIU MAXIMO Work Orders Quick Reference Guide: G S U W O
1 page
Halcon 12.0 Solution Guide I
No ratings yet
Halcon 12.0 Solution Guide I
345 pages
Embroidery Design PDF
0% (1)
Embroidery Design PDF
50 pages
Machine Learning Curriculum Berkley
100% (1)
Machine Learning Curriculum Berkley
12 pages
03b Practice Test Set 4 - Paper 3F Mark Scheme
No ratings yet
03b Practice Test Set 4 - Paper 3F Mark Scheme
8 pages
Tring Class and Its Objects
No ratings yet
Tring Class and Its Objects
12 pages
Aveva
100% (1)
Aveva
134 pages
Cbse Class 3 Maths Sample Paper Term 2 Model 1
0% (1)
Cbse Class 3 Maths Sample Paper Term 2 Model 1
3 pages
Artemis Data Sheet
No ratings yet
Artemis Data Sheet
2 pages
Project Supervisor: Dr. Muhammad Salman Khan: Health Monitoring and Management Using Internet-of-Things (IOT)
No ratings yet
Project Supervisor: Dr. Muhammad Salman Khan: Health Monitoring and Management Using Internet-of-Things (IOT)
20 pages
63 Sample Xi Production Tuning Parameters PDF
No ratings yet
63 Sample Xi Production Tuning Parameters PDF
55 pages
Abap Syllabus
No ratings yet
Abap Syllabus
5 pages

Module III Final

Uploaded by

Module III Final

Uploaded by

DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Rule Mining

Definition: Item Set and support count

In association analysis, a collection of zero or more items is termed an item set.

If an item set contains k- items, it is called a k-item set

Example: {Milk, Bread, Diaper}

Rule Evaluation Metrics:

Support count ()

– Frequency of occurrence of an item set

– E.g. ({Milk, Bread, Diaper}) = 2

– Fraction of transactions that contain an item set

– E.g. s({Milk, Bread, Diaper}) = 2/5

Frequent Item set

– An item set whose support is greater than or equal to a minsup threshold

{Milk , Diaper } Beer

 (Milk, Diaper, Beer)  2  0.4

Association Rule Mining Task

– List all possible association rules

– Compute the support and confidence for each rule

– Prune rules that fail the minsup and minconf thresholds

1. Frequent Itemset Generation

Frequent itemset generation is still computationally expensive.

Frequent Itemset Generation:

Reduce the number of candidates (M)

– Complete search: M=2d

– Use pruning techniques to reduce M

Reduce the number of transactions (N)

– Reduce size of N as the size of itemset increases

Reduce the number of comparisons (NM)

– Use efficient data structures to store the candidates or transactions

– No need to match every candidate against every transaction

Apriori principle : Reducing Number of Candidates

Frequent Itemset Generation in the Apriori Algorithm: Illustration with example.

A brute-force strategy of enumerating all itemsets( up to size3 ) as candidates will produce 41

Candidate Generation and Pruning

The folIowing is a list of requirements for an effective candidate generation procedure:

 It should avoid generating too many unnecessary candidates.

Several candidate generation strategies are discussed below.

Satisfaction of our requirements

3) Can generate the same set twice

Fk-1 x Fk-1 Method:

 Combine a frequent k-1 –itemset with another frequent k-1 -itemset

Satisfaction of our requirements

3. Each candidate itemset is generated only once.

Support counting using hash tree:

A hash tree is used to implement the hash function.

Figure 6.11 shows an example of a hash tree structure.

How to efficiently generate rules from frequent itemsets?

c(ABC  D)  c(AB  CD)  c(A  BCD)

 Confidence is anti-monotone w.r.t. number of items on the RHS of the

Alternative Methods for Generating Frequent Itemsets

Step 1: FP-Tree Construction

– Sort frequent items in decreasing order based on their support.

Step 2: Frequent Itemset Generation

Evaluation of Association Pattern :

Objective measures of interestingness

X f11 f10 f1+

X f01 f00 fo+

1. What is association analysis? Define support and confidence with an example.

Buy Exercise machine

21. Consider the following transaction data set:

T400 I1, I2, I4

22. Consider following set of frequent 3 item sets

Assume that there are only 5 items in data set.

110 Milk, Bread

You might also like