06FPBasic
06FPBasic
2013
1
Chapter 6: Mining Frequent Patterns, Association and
Correlations
Basic Concepts
Methods
Summary
2
Association Rule Mining
Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other items
in the transaction
Market-Basket transactions
Example of Association
TID Items Rules
{Diaper} {Beer},
1 Bread, Milk {Milk, Bread} {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!
Definition: Frequent Itemset
Itemset
A collection of one or more items
Example: {Milk, Bread, Diaper}
TID Items
k-itemset
1 Bread, Milk
An itemset that contains k items
2 Bread, Diaper, Beer, Eggs
Support count ()
3 Milk, Diaper, Beer, Coke
Frequency of occurrence of an itemset
4 Bread, Milk, Diaper, Beer
E.g. ({Milk, Bread,Diaper}) = 2
5 Bread, Milk, Diaper, Coke
Support
Fraction of transactions that contain an
itemset
E.g. s({Milk, Bread, Diaper}) = 2/5 I assume that itemsets are
Frequent Itemset ordered lexicographically
An itemset whose support is greater than or
equal to a minsup threshold
Definition: Association Rule
Support:
percentage of tuples that contain {A,B,D} = 75%
Confidence:
number of tuples that contain {A, B, D}
100%
number of tuples that contain {B, D}
Association Rule Mining Task
Given a set of transactions T, the goal of association rule
mining is to find all rules having
support ≥ minsup threshold
confidence ≥ minconf threshold
Brute-force approach:
List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf thresholds
Computationally prohibitive!
Mining Association Rules
TID Items
Example of Rules:
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
{Milk,Diaper} {Beer} (s=0.4, c=0.67)
3 Milk, Diaper, Beer, Coke {Milk,Beer} {Diaper} (s=0.4, c=1.0)
4 Bread, Milk, Diaper, Beer {Diaper,Beer} {Milk} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke {Beer} {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} {Milk,Beer} (s=0.4, c=0.5)
{Milk} {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Mining Association Rules
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w
Match each transaction against every candidate
Complexity ~ O(NMw) => Expensive since M = 2d !!!
Computational Complexity
Given d unique items:
Total number of itemsets = 2d
Total number of possible association rules:
d
d1 d k
d k
R
k j
k 1 j 1
3 2 1
d d 1
Basic Concepts
Evaluation Methods
Summary
16
Reducing Number of Candidates
Apriori principle:
If an itemset is frequent, then all of its subsets must also be
frequent
X , Y : ( X Y ) s( X ) s(Y )
Support of an itemset never exceeds the support of its subsets
This is known as the anti-monotone property of support
Example
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs s(Bread) > s(Bread, Beer)
3 Milk, Diaper, Beer, Coke s(Milk) > s(Bread, Milk)
4 Bread, Milk, Diaper, Beer s(Diaper, Beer) > s(Diaper, Beer, Coke)
5 Bread, Milk, Diaper, Coke
Illustrating Apriori Principle
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Pruned
ABCDE
supersets
Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
// join and prune steps
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support (frequent)
end
return k Lk;
Important steps in candidate generation:
Join Step: Ck+1 is generated by joining Lk with itself
Prune Step: Any k-itemset that is not frequent cannot be a subset of a
frequent (k+1)-itemset
The Apriori Algorithm — Example
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 1 3 4 {2} 3 {2} 3
200 2 3 5 Scan D {3} 3 {3} 3
300 1 2 3 5 {4} 1 {5} 3
400 2 5 {5} 3
min_sup=2=50% C
2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
How to Generate Candidates?
Suppose the items in Lk are listed in an order
Step 1: self-joining Lk (IN SQL)
insert into Ck+1
select p.item1, p.item2, …, p.itemk, q.itemk
from Lk p, Lk q
where p.item1=q.item1, …, p.itemk-1=q.itemk-1, p.itemk < q.itemk
Step 2: pruning
forall itemsets c in Ck+1 do
forall k-subsets s of c do
if (s is not in Lk) then delete c from Ck+1
Example of Candidates Generation
C4={abcd}
How to Count Supports of Candidates?
Method:
Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of itemsets and counts
Interior node contains a hash table
Subset function: finds all the candidates contained in a transaction
Example of the hash-tree for C3
C1’
itemset L2
{1 2} TID Sets of itemsets
itemset sup
C2 {1 3}
100 {{1 3}}
200 {{2 3},{2 5},{3 5}} {1 3} 2
{1 5} 300 {{1 2},{1 3},{1 5}, {2 {2 3} 2
{2 3} 400
3},{2 5},{3 5}}
{{2 5}} {2 5} 3
{2 5} {3 5} 2
{3 5}
L3
C3 itemset
TID
200
Sets of itemsets
{{2 3 5}} itemset sup
{2 3 5} 300 {{2 3 5}}
C3’ {2 3 5} 2
Methods to Improve Apriori’s Efficiency
Maximal A B C D E
Itemsets
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Infrequent
Itemsets Border
ABCD
E
Closed Itemset
2 4
ABCD ABCE ABDE ACDE BCDE
Not supported by
any transactions ABCDE
Minimum support = 2 null Closed but
not
maximal
124 123 1234 245 345
A B C D E
Closed and
maximal
12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE
12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
2 4
ABCD ABCE ABDE ACDE BCDE # Closed = 9
# Maximal = 4
ABCDE
Maximal vs Closed Itemsets
Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets
Factors Affecting Complexity
Choice of minimum support threshold
lowering support threshold results in more frequent itemsets
this may increase number of candidates and max length of frequent
itemsets
Dimensionality (number of items) of the data set
more space is needed to store support count of each item
if number of frequent items also increases, both computation and I/O
costs may also increase
Size of database
since Apriori makes multiple passes, run time of algorithm may increase
with number of transactions
Average transaction width
transaction width increases with denser data sets
This may increase max length of frequent itemsets and traversals of hash
tree (number of subsets in a transaction increases with its width)
Rule Generation
Given a frequent itemset L, find all non-empty subsets f L
such that f L – f satisfies the minimum confidence
requirement
If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D, ABD C, ACD B, BCD A,
A BCD, B ACD, C ABD, D ABC
AB CD, AC BD, AD BC, BC AD,
BD AC, CD AB,
c:1
a:1
m:1
p:1
FP-tree Construction
min_support = 3
Item frequency
f 4
c 4
a 3
root b 3
m 3
p 3
f:2
c:2
a:2
m:1 b:1
p:1 m:1
FP-tree Construction
min_support = 3
Item frequency
f 4
c 4
a 3
root b 3
m 3
p 3
f:3 c:1
a:2 p:1
m:1 b:1
p:1 m:1
FP-tree Construction
min_support = 3
Item frequency
f 4
c 4
root a 3
b 3
m 3
Header Table f:4 c:1 p 3
Item frequency head
f 4
c 4 c:3 b:1 b:1
a 3
b 3 a:3 p:1
m 3
p 3 m:2 b:1
p:2 m:1
Benefits of the FP-tree Structure
Completeness:
never breaks a long pattern of any transaction
preserves complete information for frequent pattern mining
Compactness
reduce irrelevant information—infrequent items are gone
frequency descending ordering: more frequent items are more likely to be
shared
never be larger than the original database (if not count node-links and
counts)
Example: For Connect-4 DB, compression ratio could be over 100
Mining Frequent Patterns Using FP-tree
Method
For each item, construct its conditional pattern-base, and then its
conditional FP-tree
Repeat the process on each newly created conditional FP-tree
Until the resulting FP-tree is empty, or it contains only one path (single
path will generate all the combinations of its sub-paths, each of which is a
frequent pattern)
Mining Frequent Patterns Using the FP-tree (cont’d)
a:3 p:1 {} {}
{}
m:2 b:1
f:2 c:1 f:3 f:4
f:3
p:2 m:1
c:1 c:3 +
+
+ c
a:1 a
b
(3) (4) (5) (6) 55
1 f, c, a, m
4 c, b +p
5 f, c, a, m
1 f, c, a
1 f, c, a, m, p 2 f, c, a, b + m
2 f, c, a, b, m 5 f, c, a
1 f, c, a, m 2 f, c, a
3 f, b 3f +b
2 f, c, a, b, m
4 c, b, p 4c
3 f, b 1 f, c, a 1 f, c
5 f, c, a, m, p
4 c, b 2 f, c, a, b 2 f, c + a
5 f, c, a, m 3 f, b 1 f, c, a 5 f, c
4 c, b 2 f, c, a
5 f, c, a 3 f 1 f, c
4 c 2 f, c
5 f, c, a 3 f
4 c
5 f, c 56
1 f, c, a, m
+p 1 f, c, a
4 c, b
2 f, c, a, b + m
5 f, c, a, m
5 f, c, a
(1) (2)
1 f, c, a, m, p
2 f, c, a, b, m 2 f, c, a 1 f, c
3 f, b 3f +b 2 f, c + a
4 c, b, p 4c 5 f, c
5 f, c, a, m, p (3) (4)
1f
2f
+c f: 1,2,3,5
4
5f
(6)
(5) 57
1 f, c, a, m 1 c
4 c, b +p 4 c +p
p: 3
cp: 3
5 f, c, a, m 5 c
1 f, c, a 1 f, c, a
2 f, c, a, b + m 2 f, c, a + m
m: 3
min_sup = 3
5 f, c, a fm: 3
5 f, c, a
cm: 3
2 f, c, a
am: 3
3f +b b: 3 fcm: 3
1 f, c, a, m, p 4c fam: 3
2 f, c, a, b, m a: 3 cam: 3
1 f, c
3 f, b fa: 3 fcam: 3
2 f, c + a
4 c, b, p ca: 3
5 f, c fca: 3
5 f, c, a, m, p
1f
2f c: 4
+c
4 fc: 3
5f
f: 1,2,3,5 f: 4 58
Properties of FP-tree for Conditional Pattern
Base Construction
Node-link property
For any frequent item ai, all the possible frequent patterns that contain ai
can be obtained by following ai's node-links, starting from ai's head in the
FP-tree header
Prefix path property
To calculate the frequent patterns for a node ai in a path P, only the prefix
70
60
50
40
30
20
10
0
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
Chapter 6: Mining Frequent Patterns, Association and
Correlations
Basic Concepts
Evaluation Methods
Summary
63
Interestingness Measurements
Objective measures
Two popular measurements:
support; and
confidence
Subjective measures
A rule (pattern) is interesting if
it is unexpected (surprising to the user); and/or
actionable (the user can do something with it)
Computing Interestingness Measure
Given a rule X Y, information needed to compute rule
interestingness can be obtained from a contingency table
P (Y | X )
Lift
P (Y )
P( X , Y )
Interest
P ( X ) P (Y )
PS P ( X , Y ) P ( X ) P (Y )
P ( X , Y ) P ( X ) P (Y )
coefficient
P ( X )[1 P ( X )]P (Y )[1 P (Y )]
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
70
There are lots of
measures proposed in
the literature
X 60 10 70 X 20 10 30
X 10 20 30 X 10 60 70
70 30 100 30 70 100
73
Comparison of Interestingness Measures
Null-(transaction) invariance is crucial for correlation analysis
Lift and 2 are not null-invariant
5 null-invariant measures
April 20, 2025 Data Mining: Concepts and Techniques Subtle: They disagree
74
Analysis of DBLP Coauthor Relationships
Recent DB conferences, removing balanced associations, low sup, etc.
Basic Concepts
Evaluation Methods
Summary
77
Summary
Basic concepts: association rules, support-confident
framework, closed and max-patterns
Scalable frequent pattern mining methods
Apriori (Candidate generation & test)
Projection-based (FPgrowth, CLOSET+, ...)
78