0% found this document useful (0 votes)

5 views

06FPBasic

The document discusses mining frequent patterns, specifically focusing on association rule mining, which aims to identify rules predicting item occurrences based on transaction data. It outlines key concepts such as frequent itemsets, support, and confidence, and introduces the Apriori algorithm for generating frequent itemsets and association rules. The document emphasizes the computational challenges and strategies to optimize the mining process.

Uploaded by

deyamate9

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

06FPBasic

Uploaded by

deyamate9

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 77

Based on slides from Han J., et. al.

2013

1
Chapter 6: Mining Frequent Patterns, Association and
Correlations

 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern Evaluation

Methods

 Summary

2
Association Rule Mining
Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other items
in the transaction
Market-Basket transactions
Example of Association
TID Items Rules
{Diaper}  {Beer},
1 Bread, Milk {Milk, Bread}  {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread}  {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!
Definition: Frequent Itemset
 Itemset
 A collection of one or more items
 Example: {Milk, Bread, Diaper}
TID Items
 k-itemset
1 Bread, Milk
 An itemset that contains k items
2 Bread, Diaper, Beer, Eggs
 Support count ()
3 Milk, Diaper, Beer, Coke
 Frequency of occurrence of an itemset
4 Bread, Milk, Diaper, Beer
 E.g. ({Milk, Bread,Diaper}) = 2
5 Bread, Milk, Diaper, Coke
 Support
 Fraction of transactions that contain an
itemset
 E.g. s({Milk, Bread, Diaper}) = 2/5 I assume that itemsets are
 Frequent Itemset ordered lexicographically
 An itemset whose support is greater than or
equal to a minsup threshold
Definition: Association Rule

Let D be database of transactions

e.g.:
Transaction ID Items Bought
2000 A,B,C
1000 A,C
4000 A,D
5000 B,E,F

Let I be the set of items that appear in the

database, e.g., I={A,B,C,D,E,F}
A rule is defined by X  Y, where XI, YI,
and XY=
e.g.: {B,C}  {E} is a rule
Definition: Association Rule
TID Items
 Association Rule 1 Bread, Milk
 An implication expression of the form X 2 Bread, Diaper, Beer, Eggs
 Y, where X and Y are itemsets 3 Milk, Diaper, Beer, Coke
 Example: 4 Bread, Milk, Diaper, Beer
{Milk, Diaper}  {Beer} 5 Bread, Milk, Diaper, Coke

 Rule Evaluation Metrics

 Support (s) Example:
 Fraction of transactions that contain both X {Milk, Diaper}  Beer
and Y
 Confidence (c)  (Milk , Diaper, Beer) 2
s  0.4
 Measures how often items in Y |T| 5
appear in transactions that
contain X  (Milk, Diaper, Beer) 2
c  0.67
 (Milk , Diaper) 3
Rule Measures: Support and Confidence
Customer
buys both
Customer Find all the rules X  Y with minimum
buys diaper
confidence and support
 support, s, probability that a transaction
contains {X  Y}
 confidence, c, conditional probability that
a transaction having X also contains Y
Customer
buys beer

Transaction ID Items Bought Let minimum support 50%, and

2000 A,B,C minimum confidence 50%, we
1000 A,C have
 A  C (50%, 66.6%)
4000 A,D
 C  A (50%, 100%)
5000 B,E,F
Example
TID date items_bought
100 10/10/99 {F,A,D,B}
200 15/10/99 {D,A,C,E,B} Remember:
sup(X  Y)
300 19/10/99 {C,A,B,E} conf(X  Y) =
400 20/10/99 {B,A,D} sup(X)

What is the support and confidence of the rule: {B,D}  {A}

 Support:
 percentage of tuples that contain {A,B,D} = 75%
 Confidence:
number of tuples that contain {A, B, D}
 100%
number of tuples that contain {B, D}
Association Rule Mining Task
Given a set of transactions T, the goal of association rule
mining is to find all rules having
 support ≥ minsup threshold
 confidence ≥ minconf threshold
Brute-force approach:
 List all possible association rules
 Compute the support and confidence for each rule
 Prune rules that fail the minsup and minconf thresholds
 Computationally prohibitive!
Mining Association Rules
TID Items
Example of Rules:
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
{Milk,Diaper}  {Beer} (s=0.4, c=0.67)
3 Milk, Diaper, Beer, Coke {Milk,Beer}  {Diaper} (s=0.4, c=1.0)
4 Bread, Milk, Diaper, Beer {Diaper,Beer}  {Milk} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke {Beer}  {Milk,Diaper} (s=0.4, c=0.67)
{Diaper}  {Milk,Beer} (s=0.4, c=0.5)
{Milk}  {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Mining Association Rules
 Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup

2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset

 Frequent itemset generation is still computationally

expensive
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Given d items, there
are 2d possible
candidate itemsets
ABCDE
Frequent Itemset Generation
 Brute-force approach:
 Each itemset in the lattice is a candidate frequent itemset
 Count the support of each candidate by scanning the database

Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w
 Match each transaction against every candidate
 Complexity ~ O(NMw) => Expensive since M = 2d !!!
Computational Complexity
Given d unique items:
 Total number of itemsets = 2d
 Total number of possible association rules:

 d 
d1  d  k 
d k
R       
 k   j 
k 1 j 1

3  2  1
d d 1

If d=6, R = 602 rules

Frequent Itemset Generation Strategies
Reduce the number of candidates (M)
 Complete search: M=2d
 Use pruning techniques to reduce M

Reduce the number of transactions (N)

 Reduce size of N as the size of itemset increases
 Used by DHP and vertical-based mining algorithms

Reduce the number of comparisons (NM)

 Use efficient data structures to store the candidates or
transactions
 No need to match every candidate against every transaction
Chapter 6: Mining Frequent Patterns, Association and
Correlations

 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern

Evaluation Methods

 Summary

16
Reducing Number of Candidates
Apriori principle:
 If an itemset is frequent, then all of its subsets must also be
frequent

Apriori principle holds due to the following property of the

support measure:

X , Y : ( X  Y )  s( X ) s(Y )
 Support of an itemset never exceeds the support of its subsets
 This is known as the anti-monotone property of support
Example

TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs s(Bread) > s(Bread, Beer)
3 Milk, Diaper, Beer, Coke s(Milk) > s(Bread, Milk)
4 Bread, Milk, Diaper, Beer s(Diaper, Beer) > s(Diaper, Beer, Coke)
5 Bread, Milk, Diaper, Coke
Illustrating Apriori Principle
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Pruned
ABCDE
supersets
Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)

If every subset is considered, Itemset Count

6
C1 + 6C2 + 6C3 = 41 {Bread,Milk,Diaper} 3
With support-based pruning,
6 + 6 + 1 = 13
The Apriori Algorithm (the general idea)
1. Find frequent 1-items and put them to Lk (k=1)
2. Use Lk to generate a collection of candidate itemsets Ck+1
with size (k+1)
3. Scan the database to find which itemsets in Ck+1 are frequent
and put them into Lk+1
4. If Lk+1 is not empty
 k=k+1
 GOTO 2

R. Agrawal, R. Srikant: "Fast Algorithms for Mining Association Rules",

Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, Sept. 1994.
The Apriori Algorithm
Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
// join and prune steps
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support (frequent)
end
return k Lk;
Important steps in candidate generation:
 Join Step: Ck+1 is generated by joining Lk with itself
 Prune Step: Any k-itemset that is not frequent cannot be a subset of a
frequent (k+1)-itemset
The Apriori Algorithm — Example
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 1 3 4 {2} 3 {2} 3
200 2 3 5 Scan D {3} 3 {3} 3
300 1 2 3 5 {4} 1 {5} 3
400 2 5 {5} 3
min_sup=2=50% C
2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
How to Generate Candidates?
Suppose the items in Lk are listed in an order
Step 1: self-joining Lk (IN SQL)
insert into Ck+1
select p.item1, p.item2, …, p.itemk, q.itemk
from Lk p, Lk q
where p.item1=q.item1, …, p.itemk-1=q.itemk-1, p.itemk < q.itemk

Step 2: pruning
forall itemsets c in Ck+1 do
forall k-subsets s of c do
if (s is not in Lk) then delete c from Ck+1
Example of Candidates Generation

L3={abc, abd, acd, ace, bcd}

Self-joining: L3*L3 {a,c,d} {a,c,e}

abcd from abc and abd

X
{a,c,d,e}

acde from acd and ace

acd ac ade cde
Pruning:  e X

acde is removed because ade is not in L3

C4={abcd}
How to Count Supports of Candidates?

Why counting supports of candidates a problem?

 The total number of candidates can be huge
 One transaction may contain many candidates

Method:
 Candidate itemsets are stored in a hash-tree
 Leaf node of hash-tree contains a list of itemsets and counts
 Interior node contains a hash table
 Subset function: finds all the candidates contained in a transaction
Example of the hash-tree for C3

Hash function: mod 3

H H Hash on 1st item

1,4,.. 2,5,.. 3,6,..

H 234 H Hash on 2nd item
567

145 H 345 356 367

Hash on 3rd item 689 368

124 125 159

457 458
Example of the hash-tree for C3
2345 345
Hash function: mod 3 12345 look for 2XX look for 3XX
H H Hash on 1st item
12345
1,4,.. 2,5,.. 3,6,.. look for 1XX H 234 H Hash on 2nd item
567

145 H 345 356 367

Hash on 3rd item 689 368

124 125 159

457 458
Example of the hash-tree for C3
2345 345
Hash function: mod 3 12345 look for 2XX look for 3XX
H H Hash on 1st item
12345
1,4,.. 2,5,.. 3,6,.. look for 1XX H 234 H Hash on 2nd item
567
12345
look for 12X 145 H
 345 356 367
689 368
12345
look for 13X (null) 124 125 159
457 458
12345
look for 14X
AprioriTid: Use D only for first pass

The database is not used after the 1st pass.

Instead, the set Ck’ is used for each step, Ck’ = <TID, {Xk}> :
each Xk is a potentially frequent itemset in transaction with
id=TID.
At each step Ck’ is generated from Ck-1’ at the pruning step of
constructing Ck and used to compute Lk.
For small values of k, Ck’ could be larger than the database!
AprioriTid Example (min_sup=2)
C1’ L1
Database D
TID Items TID Sets of itemsets
itemset sup.
100 134
100 {{1},{3},{4}} {1} 2
200 235
200 {{2},{3},{5}} {2} 3
300 1235
300 {{1},{2},{3},{5}} {3} 3
400 25
400 {{2},{5}} {5} 3

C1’
itemset L2
{1 2} TID Sets of itemsets
itemset sup
C2 {1 3}
100 {{1 3}}
200 {{2 3},{2 5},{3 5}} {1 3} 2
{1 5} 300 {{1 2},{1 3},{1 5}, {2 {2 3} 2
{2 3} 400
3},{2 5},{3 5}}
{{2 5}} {2 5} 3
{2 5} {3 5} 2
{3 5}
L3
C3 itemset
TID
200
Sets of itemsets
{{2 3 5}} itemset sup
{2 3 5} 300 {{2 3 5}}
C3’ {2 3 5} 2
Methods to Improve Apriori’s Efficiency

 Hash-based itemset counting: A k-itemset whose

corresponding hashing bucket count is below the
threshold cannot be frequent

 Transaction reduction: A transaction that does not contain

any frequent k-itemset is useless in subsequent scans
 Partitioning: Any itemset that is potentially frequent in DB
must be frequent in at least one of the partitions of DB
 Sampling: mining on a subset of given data, lower support
threshold + a method to determine the completeness
 Dynamic itemset counting: add new candidate itemsets
only when all of their subsets are estimated to be frequent
An itemset is maximal frequent if none of its immediate supersets is
frequent null

Maximal A B C D E
Itemsets

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Infrequent
Itemsets Border
ABCD
E
Closed Itemset

An itemset is closed if none of its immediate supersets has the

same support as the itemset
Itemset Support
{A} 4
TID Items {B} 5 Itemset Support
1 {A,B} {C} 3 {A,B,C} 2
2 {B,C,D} {D} 4 {A,B,D} 3
3 {A,B,C,D} {A,B} 4 {A,C,D} 2
4 {A,B,D} {A,C} 2 {B,C,D} 3
5 {A,B,C,D} {A,D} 3 {A,B,C,D} 2
{B,C} 3
{B,D} 4
{C,D} 3
null Transaction Ids

124 123 1234 245 345

A B C D E
TID Items
1 ABC
12 124 24 4 123 2 3
2 ABCD AB AC AD AE BC BD BE
24
CD
34
CE
45
DE
3 BCE
4 ACDE
12 2 24 4 4 2
5 DE ABC ABD ABE ACD ACE ADE BCD
3
BCE BDE
4
CDE

2 4
ABCD ABCE ABDE ACDE BCDE

Not supported by
any transactions ABCDE
Minimum support = 2 null Closed but
not
maximal
124 123 1234 245 345
A B C D E
Closed and
maximal

12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE

12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

2 4
ABCD ABCE ABDE ACDE BCDE # Closed = 9
# Maximal = 4

ABCDE
Maximal vs Closed Itemsets

Frequent
Itemsets

Closed
Frequent
Itemsets

Maximal
Frequent
Itemsets
Factors Affecting Complexity
Choice of minimum support threshold
 lowering support threshold results in more frequent itemsets
 this may increase number of candidates and max length of frequent
itemsets
Dimensionality (number of items) of the data set
 more space is needed to store support count of each item
 if number of frequent items also increases, both computation and I/O
costs may also increase
Size of database
 since Apriori makes multiple passes, run time of algorithm may increase
with number of transactions
Average transaction width
 transaction width increases with denser data sets
 This may increase max length of frequent itemsets and traversals of hash
tree (number of subsets in a transaction increases with its width)
Rule Generation
Given a frequent itemset L, find all non-empty subsets f  L
such that f  L – f satisfies the minimum confidence
requirement
 If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D, ABD C, ACD B, BCD A,
A BCD, B ACD, C ABD, D ABC
AB CD, AC  BD, AD  BC, BC AD,
BD AC, CD AB,

If |L| = k, then there are 2k – 2 candidate association rules

(ignoring L   and   L)
Rule Generation
How to efficiently generate rules from frequent itemsets?
 In general, confidence does not have an anti-monotone
property
c(ABC D) can be larger or smaller than c(AB D)

 But confidence of rules generated from the same itemset

has an anti-monotone property
 e.g., L = {A,B,C,D}:

c(ABC  D)  c(AB  CD)  c(A 

BCD)
 Confidence is anti-monotone w.r.t. number of items on the RHS
of the rule
Lattice of rules
ABCD=>{ }
Low
Confidence
Rule
BCD=>A ACD=>B ABD=>C ABC=>D

CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD

Pruned
Rules
Rule Generation for Apriori Algorithm

Candidate rule is generated by merging two rules that

share the same prefix
in the rule consequent
CD=>AB BD=>AC
join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC

Prune rule D=>ABC if its

D=>ABC
subset AD=>BC does not have
high confidence
Is Apriori Fast Enough? — Performance
Bottlenecks
The core of the Apriori algorithm:
 Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets
 Use database scan and pattern matching to collect counts for the candidate
itemsets
The bottleneck of Apriori: candidate generation
 Huge candidate sets:
 104 frequent 1-itemset will generate 107 candidate 2-itemsets
 To discover a frequent pattern of size 100, e.g., {a 1, a2, …, a100}, one needs to
generate 2100  1030 candidates.
 Multiple scans of database:
 Needs (n +1 ) scans, n is the length of the longest pattern
FP-growth: Mining Frequent Patterns Without
Candidate Generation
Compress a large database into a compact, Frequent-Pattern
tree (FP-tree) structure
 highly condensed, but complete for frequent pattern mining
 avoid costly database scans

Develop an efficient, FP-tree-based frequent pattern mining

method
 A divide-and-conquer methodology: decompose mining tasks into
smaller ones
 Avoid candidate generation: sub-database test only!
FP-tree Construction from a Transactional DB
TID items Items bought (ordered) frequent min_support = 3
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} Item frequency
200 {a, b, c, f, l, m, o} {f, c, a, b, m} f 4
c 4
300 {b, f, h, j, o, w} {f, b}
a 3
400 {b, c, k, s, p} {c, b, p} b 3
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} m 3
p 3
Steps:
1. Scan DB once, find frequent 1-itemsets (single item
patterns)
2. Order frequent items in descending order of their
frequency
3. Scan DB again, construct FP-tree
FP-tree Construction
min_support = 3
Item frequency
f 4
c 4
a 3
root b 3
m 3
f:1 p 3

c:1

a:1

m:1

p:1
FP-tree Construction
min_support = 3
Item frequency
f 4
c 4
a 3
root b 3
m 3
p 3
f:2

c:2

a:2

m:1 b:1

p:1 m:1
FP-tree Construction
min_support = 3
Item frequency
f 4
c 4
a 3
root b 3
m 3
p 3
f:3 c:1

c:2 b:1 b:1

a:2 p:1

m:1 b:1

p:1 m:1
FP-tree Construction

min_support = 3
Item frequency
f 4
c 4
root a 3
b 3
m 3
Header Table f:4 c:1 p 3
Item frequency head
f 4
c 4 c:3 b:1 b:1
a 3
b 3 a:3 p:1
m 3
p 3 m:2 b:1

p:2 m:1
Benefits of the FP-tree Structure
Completeness:
 never breaks a long pattern of any transaction
 preserves complete information for frequent pattern mining

Compactness
 reduce irrelevant information—infrequent items are gone
 frequency descending ordering: more frequent items are more likely to be
shared
 never be larger than the original database (if not count node-links and
counts)
 Example: For Connect-4 DB, compression ratio could be over 100
Mining Frequent Patterns Using FP-tree

General idea (divide-and-conquer)

 Recursively grow frequent pattern path using the FP-tree

Method
 For each item, construct its conditional pattern-base, and then its
conditional FP-tree
 Repeat the process on each newly created conditional FP-tree
 Until the resulting FP-tree is empty, or it contains only one path (single
path will generate all the combinations of its sub-paths, each of which is a
frequent pattern)
Mining Frequent Patterns Using the FP-tree (cont’d)

 Start with last item in order (i.e., p).

 Follow node pointers and traverse only the paths containing p.
 Accumulate all of transformed prefix paths of that item to form a conditional
pattern base

f:4 c:1 Conditional pattern base for p

fcam:2, cb:1
c:3 b:1
Construct a new FP-tree based on this
a:3 p:1 pattern, by merging all paths and
keeping nodes that appear sup times.
p This leads to only one branch c:3
m:2
Thus we derive only one frequent
p:2 pattern cont. p. Pattern cp
Mining Frequent Patterns Using the FP-tree (cont’d)

 Move to next least frequent item in order, i.e., m

 Follow node pointers and traverse only the paths containing m.
 Accumulate all of transformed prefix paths of that item to form a conditional
pattern base
m-conditional
pattern base:
f:4
fca:2, fcab:1
c:3 All frequent patterns
{} that include m
m,
m a:3  f:3  fm, cm, am,
fcm, fam, cam,
m:2 b:1 fcam
c:3
m:1 a:3
m-conditional FP-tree (contains only path fca:3)
{} {}

f:2 c:1 f:3

c:2 b:1 c:3

{}
a:2 p:1 a:3
f:4 c:1
m:2 + b:1 +
p m
c:3 b:1 b:1 (1) (2)

a:3 p:1 {} {}
{}
m:2 b:1
f:2 c:1 f:3 f:4
f:3
p:2 m:1
c:1 c:3 +
+
+ c
a:1 a
b
(3) (4) (5) (6) 55
1 f, c, a, m
4 c, b +p
5 f, c, a, m
1 f, c, a
1 f, c, a, m, p 2 f, c, a, b + m
2 f, c, a, b, m 5 f, c, a
1 f, c, a, m 2 f, c, a
3 f, b 3f +b
2 f, c, a, b, m
4 c, b, p 4c
3 f, b 1 f, c, a 1 f, c
5 f, c, a, m, p
4 c, b 2 f, c, a, b 2 f, c + a
5 f, c, a, m 3 f, b 1 f, c, a 5 f, c
4 c, b 2 f, c, a
5 f, c, a 3 f 1 f, c
4 c 2 f, c
5 f, c, a 3 f
4 c
5 f, c 56
1 f, c, a, m
+p 1 f, c, a
4 c, b
2 f, c, a, b + m
5 f, c, a, m
5 f, c, a
(1) (2)
1 f, c, a, m, p
2 f, c, a, b, m 2 f, c, a 1 f, c
3 f, b 3f +b 2 f, c + a
4 c, b, p 4c 5 f, c
5 f, c, a, m, p (3) (4)

1f
2f
+c f: 1,2,3,5
4
5f
(6)
(5) 57
1 f, c, a, m 1 c
4 c, b +p 4 c +p
p: 3
cp: 3
5 f, c, a, m 5 c

1 f, c, a 1 f, c, a
2 f, c, a, b + m 2 f, c, a + m
m: 3
min_sup = 3
5 f, c, a fm: 3
5 f, c, a
cm: 3
2 f, c, a
am: 3
3f +b b: 3 fcm: 3
1 f, c, a, m, p 4c fam: 3
2 f, c, a, b, m a: 3 cam: 3
1 f, c
3 f, b fa: 3 fcam: 3
2 f, c + a
4 c, b, p ca: 3
5 f, c fca: 3
5 f, c, a, m, p

1f
2f c: 4
+c
4 fc: 3
5f

f: 1,2,3,5 f: 4 58
Properties of FP-tree for Conditional Pattern
Base Construction

Node-link property
 For any frequent item ai, all the possible frequent patterns that contain ai

can be obtained by following ai's node-links, starting from ai's head in the
FP-tree header
Prefix path property
 To calculate the frequent patterns for a node ai in a path P, only the prefix

sub-path of ai in P need to be accumulated, and its frequency count should

carry the same count as node ai.
Conditional Pattern-Bases for the example

Item Conditional pattern-base Conditional FP-tree

p {(fcam:2), (cb:1)} {(c:3)}|p
m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m
b {(fca:1), (f:1), (c:1)} Empty
a {(fc:3)} {(f:3, c:3)}|a
c {(f:3)} {(f:3)}|c
f Empty Empty
Why Is Frequent Pattern Growth Fast?
Performance studies show
 FP-growth is an order of magnitude faster than Apriori, and is also

faster than tree-projection

Reasoning
 No candidate generation, no candidate test

 Uses compact data structure

 Eliminates repeated database scan

 Basic operation is counting and FP-tree building

FP-growth vs. Apriori: Scalability With the
Support Threshold

100 Data set T25I20D10K

90 D1 FP-grow th runtime
D1 Apriori runtime
80
Run time(sec.)

0
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
Chapter 6: Mining Frequent Patterns, Association and
Correlations

 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern

Evaluation Methods

 Summary

63
Interestingness Measurements
Objective measures
Two popular measurements:
 support; and
 confidence

Subjective measures
A rule (pattern) is interesting if
 it is unexpected (surprising to the user); and/or
 actionable (the user can do something with it)
Computing Interestingness Measure
Given a rule X  Y, information needed to compute rule
interestingness can be obtained from a contingency table

Contingency table for X  Y

Y Y f11: support of X and Y
X f11 f10 f1+ f10: support of X and Y
X f01 f00 fo+ f01: support of X and Y
f+1 f+0 |T| f00: support of X and Y

Used to define various measures

 support, confidence, lift, Gini,
J-measure, etc.
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
Statistical Independence

 Population of 1000 students

 600 students know how to swim (S)
 700 students know how to bike (B)
 420 students know how to swim and bike (S,B)

 P(SB) = 420/1000 = 0.42

 P(S)  P(B) = 0.6  0.7 = 0.42

 P(SB) = P(S)  P(B) => Statistical independence

 P(SB) > P(S)  P(B) => Positively correlated
 P(SB) < P(S)  P(B) => Negatively correlated
Statistical-based Measures
Measures that take into account statistical dependence

P (Y | X )
Lift 
P (Y )
P( X , Y )
Interest 
P ( X ) P (Y )
PS P ( X , Y )  P ( X ) P (Y )
P ( X , Y )  P ( X ) P (Y )
  coefficient 
P ( X )[1  P ( X )]P (Y )[1  P (Y )]
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100

Association Rule: Tea  Coffee

Confidence= P(Coffee|Tea) = 0.75

but P(Coffee) = 0.9
 Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)
Are lift and 2 Good Measures of Correlation?
 “Buy walnuts  buy
milk [1%, 80%]” is
misleading if 85% of
customers buy milk
 Support and confidence
are not good to indicate
correlations
 Over 20 interestingness
measures have been
proposed (see Tan,
Kumar, Sritastava
@KDD’02)
 Which are good ones?

70
There are lots of
measures proposed in
the literature

Some measures are

good for certain
applications, but not for
others

What criteria should we

use to determine
whether a measure is
good or bad?

What about Apriori-

style support based
pruning? How does it
affect these measures?
Example: -Coefficient
-coefficient is analogous to correlation coefficient for
continuous variables
Y Y Y Y

X 60 10 70 X 20 10 30
X 10 20 30 X 10 60 70
70 30 100 30 70 100

0.6  0.7 0.7 0.2  0.3 0.3

 
0.7 0.3 0.7 0.3 0.7 0.3 0.7 0.3
0.5238 0.5238
 Coefficient is the same for both tables
Null-Invariant Measures

73
Comparison of Interestingness Measures
 Null-(transaction) invariance is crucial for correlation analysis
 Lift and 2 are not null-invariant
 5 null-invariant measures

Milk No Milk Sum

(row)
Coffee m, c ~m, c c
No m, ~c ~m, ~c ~c
Coffee
Sum(col. m ~m 
) Null-transactions w.r.t. Kulczynski
m and c measure (1927) Null-invariant

April 20, 2025 Data Mining: Concepts and Techniques Subtle: They disagree
74
Analysis of DBLP Coauthor Relationships
Recent DB conferences, removing balanced associations, low sup, etc.

Advisor-advisee relation: Kulc: high,

coherence: low, cosine: middle
 Tianyi Wu, Yuguo Chen and Jiawei Han, “
Association Mining in Large Databases: A Re-Examination of Its Measures”,
Proc. 2007 Int. Conf. Principles and Practice of Knowledge Discovery in
Databases (PKDD'07), Sept. 2007
75
Which Null-Invariant Measure Is Better?
IR (Imbalance Ratio): measure the imbalance of two itemsets A and
B in rule implications

Kulczynski and Imbalance Ratio (IR) together present a clear picture

for all the three datasets D4 through D6
D4 is balanced & neutral
D5 is imbalanced & neutral
D6 is very imbalanced & neutral
Chapter 6: Mining Frequent Patterns, Association and
Correlations

 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern

Evaluation Methods

 Summary

77
Summary
 Basic concepts: association rules, support-confident
framework, closed and max-patterns
 Scalable frequent pattern mining methods
 Apriori (Candidate generation & test)
 Projection-based (FPgrowth, CLOSET+, ...)

 Which patterns are interesting?

 Pattern evaluation methods

GM Repertoire 1A The Catalan PDF
100% (2)
GM Repertoire 1A The Catalan PDF
794 pages
Module 5 - Frequent Pattern Mining
No ratings yet
Module 5 - Frequent Pattern Mining
111 pages
Associationrule 1
No ratings yet
Associationrule 1
30 pages
New Microsoft Power Point Presentation
No ratings yet
New Microsoft Power Point Presentation
18 pages
DM Association
No ratings yet
DM Association
43 pages
1.2 Association Rule Mining: Abdulfetah Abdulahi A
No ratings yet
1.2 Association Rule Mining: Abdulfetah Abdulahi A
43 pages
association rule
No ratings yet
association rule
22 pages
Association Rule Mining: - Algorithms For Frequent Itemset Mining - Apriori - Elcat - FP-Growth
No ratings yet
Association Rule Mining: - Algorithms For Frequent Itemset Mining - Apriori - Elcat - FP-Growth
45 pages
Rule Mining
No ratings yet
Rule Mining
20 pages
Data Mining Task - Association Rule Mining
No ratings yet
Data Mining Task - Association Rule Mining
30 pages
CS2202_AssociationRuleMining
No ratings yet
CS2202_AssociationRuleMining
59 pages
04 Frequent Patterns Analysis
No ratings yet
04 Frequent Patterns Analysis
37 pages
Frequent Pattern Mining Overview: Data Mining Techniques: Frequent Patterns in Sets and Sequences
No ratings yet
Frequent Pattern Mining Overview: Data Mining Techniques: Frequent Patterns in Sets and Sequences
14 pages
dmunit2
No ratings yet
dmunit2
85 pages
Rule Mining by Akshay Rele
No ratings yet
Rule Mining by Akshay Rele
42 pages
Lect 6
No ratings yet
Lect 6
74 pages
Association Rules & Frequent Itemsets: The Market-Basket Problem
No ratings yet
Association Rules & Frequent Itemsets: The Market-Basket Problem
5 pages
CSE 385 - Data Mining and Business Intelligence - Lecture 02
No ratings yet
CSE 385 - Data Mining and Business Intelligence - Lecture 02
67 pages
BD25
No ratings yet
BD25
19 pages
DM -Unit 2-PPT
No ratings yet
DM -Unit 2-PPT
49 pages
Association Rule Mining
No ratings yet
Association Rule Mining
97 pages
UNIT 4 .3 ASSOCIATION ANALYSIS
No ratings yet
UNIT 4 .3 ASSOCIATION ANALYSIS
50 pages
CA03CA3405Notes On Association Rule Mining and Apriori Algorithm
No ratings yet
CA03CA3405Notes On Association Rule Mining and Apriori Algorithm
41 pages
Unit 2
No ratings yet
Unit 2
14 pages
Data Mining Association Rules
No ratings yet
Data Mining Association Rules
54 pages
06 FPBasic
No ratings yet
06 FPBasic
103 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
54 pages
Associate Rules
No ratings yet
Associate Rules
26 pages
Association Rule Mining Spring 2022
No ratings yet
Association Rule Mining Spring 2022
84 pages
DSTBD_9-DMassrules
No ratings yet
DSTBD_9-DMassrules
98 pages
Week 6 - Basic Association Analysis
No ratings yet
Week 6 - Basic Association Analysis
71 pages
DS2 Association
No ratings yet
DS2 Association
48 pages
Association Rule Mining
No ratings yet
Association Rule Mining
92 pages
Arm PPT
No ratings yet
Arm PPT
15 pages
Lecture Notes For Chapter 6: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 6: by Tan, Steinbach, Kumar
65 pages
Association Rule Mining
No ratings yet
Association Rule Mining
54 pages
Datamining Lect2 Frequent
No ratings yet
Datamining Lect2 Frequent
59 pages
Association Rules
No ratings yet
Association Rules
24 pages
Association Rule Mining
No ratings yet
Association Rule Mining
72 pages
Association: Market Basket Analysis
No ratings yet
Association: Market Basket Analysis
40 pages
Chapter 5 - Association Rule Mining
No ratings yet
Chapter 5 - Association Rule Mining
45 pages
6 - Association Rules- for students
No ratings yet
6 - Association Rules- for students
39 pages
Unit 4
No ratings yet
Unit 4
72 pages
Unit 4 DWM by DR KSR Association - Analysis
No ratings yet
Unit 4 DWM by DR KSR Association - Analysis
68 pages
Association Rule
No ratings yet
Association Rule
17 pages
Association
No ratings yet
Association
67 pages
Lab8 Apriori
No ratings yet
Lab8 Apriori
9 pages
Association Rule Mining
No ratings yet
Association Rule Mining
24 pages
ch6 PDF
No ratings yet
ch6 PDF
82 pages
Chap6 Basic Association Analysis
No ratings yet
Chap6 Basic Association Analysis
82 pages
Chap6 Basic Association Analysis
No ratings yet
Chap6 Basic Association Analysis
82 pages
Lecture Notes For Chapter 6 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 6 Introduction To Data Mining: by Tan, Steinbach, Kumar
82 pages
BITS WASE Data Mining Session 5 PDF
No ratings yet
BITS WASE Data Mining Session 5 PDF
83 pages
Unit 5
No ratings yet
Unit 5
40 pages
DMDW 3rd Module
No ratings yet
DMDW 3rd Module
34 pages
Association Analysis Basic Concepts Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Association Analysis Basic Concepts Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
102 pages
Lec5 Association Rule
No ratings yet
Lec5 Association Rule
26 pages
Association Rules PDF
No ratings yet
Association Rules PDF
35 pages
chap 4-Mining Frequent Patterns, Association-Lecture 6-2
No ratings yet
chap 4-Mining Frequent Patterns, Association-Lecture 6-2
66 pages
Investment Guarantees: Modeling and Risk Management for Equity-Linked Life Insurance
From Everand
Investment Guarantees: Modeling and Risk Management for Equity-Linked Life Insurance
Mary Hardy
3.5/5 (2)
Life-Cycle Costing: Using Activity-Based Costing and Monte Carlo Methods to Manage Future Costs and Risks
From Everand
Life-Cycle Costing: Using Activity-Based Costing and Monte Carlo Methods to Manage Future Costs and Risks
Jan Emblemsvåg
No ratings yet
7 - Adverbial Clauses
No ratings yet
7 - Adverbial Clauses
3 pages
Conan Pathfinder Update 1-2
No ratings yet
Conan Pathfinder Update 1-2
144 pages
Channa Mereya (Ae Dil Hai Mushkil)
No ratings yet
Channa Mereya (Ae Dil Hai Mushkil)
4 pages
BS en Iso 00105-E04-2009
No ratings yet
BS en Iso 00105-E04-2009
12 pages
International Trade Law Course Manual - Spring 2024
No ratings yet
International Trade Law Course Manual - Spring 2024
30 pages
REG List of Courses
No ratings yet
REG List of Courses
3 pages
Student Guide
No ratings yet
Student Guide
11 pages
General Education Elementary
No ratings yet
General Education Elementary
298 pages
Sample Resolution On Frustrated Murder
75% (4)
Sample Resolution On Frustrated Murder
3 pages
BMD0003 Intelligent Business Information Systems
No ratings yet
BMD0003 Intelligent Business Information Systems
11 pages
5 Tongue Root, Floor, Neck Phlegmon
No ratings yet
5 Tongue Root, Floor, Neck Phlegmon
30 pages
Rani Mam Practice Book 2024 Edition
No ratings yet
Rani Mam Practice Book 2024 Edition
123 pages
Application Summary Form
No ratings yet
Application Summary Form
2 pages
Management of DR-TB: PMDT Guideline
No ratings yet
Management of DR-TB: PMDT Guideline
41 pages
Manpower Agency Proposal
No ratings yet
Manpower Agency Proposal
10 pages
Careplus Group Berhad - Corporate Governance Report
No ratings yet
Careplus Group Berhad - Corporate Governance Report
51 pages
SSG Platform
No ratings yet
SSG Platform
2 pages
pEOPLE VS. IBANEZ - TORTS
No ratings yet
pEOPLE VS. IBANEZ - TORTS
2 pages
Lecture 23 Supply Chain Performance & SCOR (1)
No ratings yet
Lecture 23 Supply Chain Performance & SCOR (1)
17 pages
Arc Flash Application Guide
No ratings yet
Arc Flash Application Guide
340 pages
CSS Solved Pair of Words 2000 To 2023 Edition
100% (10)
CSS Solved Pair of Words 2000 To 2023 Edition
47 pages
Lect02 Pact Analysis
No ratings yet
Lect02 Pact Analysis
42 pages
Machine Input Output Short Tricks & Questions With Solutions
No ratings yet
Machine Input Output Short Tricks & Questions With Solutions
37 pages
Essay Report: ETH Zürich Cadastral System WS 2006/07
100% (1)
Essay Report: ETH Zürich Cadastral System WS 2006/07
26 pages
Flusser TheWord Design
No ratings yet
Flusser TheWord Design
5 pages
(Ebooks PDF) Download The Imaginary Puritan Nancy Armstrong Full Chapters
100% (12)
(Ebooks PDF) Download The Imaginary Puritan Nancy Armstrong Full Chapters
84 pages
Year 9 Music 1: 2.1 Content and Teaching Strategies of The Teaching Area
No ratings yet
Year 9 Music 1: 2.1 Content and Teaching Strategies of The Teaching Area
3 pages
Info and Task Sheets in Entrep 10FIRST QUARTER
No ratings yet
Info and Task Sheets in Entrep 10FIRST QUARTER
14 pages
SELCO Foundation Annual Report 2021 2022 2 Compressed
No ratings yet
SELCO Foundation Annual Report 2021 2022 2 Compressed
51 pages