Unit 3
Unit 3
Pattern •
•
What products were often purchased together?— Beer and diapers?!
What are the subsequent purchases after buying a PC?
Analysis? •
•
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
Applications
Customer
buys beer
Basic Concepts: Association Rules
Tid Items bought • Find all the rules X → Y with minimum
10 Beer, Nuts, Diaper support and confidence
20 Beer, Coffee, Diaper
• support, s, probability that a transaction
30 Beer, Diaper, Eggs
contains X Y
40 Nuts, Eggs, Milk
50 Nuts, Coffee, Diaper, Eggs, Milk • confidence, c, conditional probability that
Customer
a transaction having X also contains Y
Customer
buys both
buys Let minsup = 50%, minconf = 50%
diaper Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in
t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
Implementation of Apriori
• How to generate candidates?
• Step 1: self-joining Lk
• Step 2: pruning
• Example of Candidate-generation
• L3={abc, abd, acd, ace, bcd}
• Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
• Pruning:
• acde is removed because ade is not in L3
• C4 = {abcd}
Why counting supports of candidates a
problem?
Count Method:
Supports of • Candidate itemsets are stored in a hash-tree
Subset function
Transaction: 1 2 3 5 6
3,6,9
1,4,7
2,5,8
1+2356
13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
457 125 159
458
Candidate Generation: An SQL Implementation
• SQL Implementation of candidate generation
• Suppose the items in Lk-1 are listed in an order
• Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
• Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
• Use object-relational extensions like UDFs, BLOBs, and Table functions for efficient
implementation
Major computational challenges
• Multiple scans of transaction database
Further • Huge number of candidates
• Tedious workload of support counting for
Improvement candidates
ABCD
• Once both A and D are determined frequent,
the counting of AD begins
ABC ABD ACD BCD • Once all length-2 subsets of BCD are
determined frequent, the counting of BCD
begins
AB AC BC AD BD CD
Transactions
1-itemsets
A B C D
Apriori 2-itemsets
…
{}
Itemset lattice 1-itemsets
2-items
S. Brin R. Motwani, J. Ullman,
and S. Tsur. Dynamic itemset DIC 3-items
counting and implication rules for
market basket data. SIGMOD’97
Scalable Frequent Itemset Mining Methods
23
Pattern-Growth Approach: Mining Frequent Patterns
Without Candidate Generation
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
{}
c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree
{}
a3:n3
{} r1
• Completeness
• Preserve complete information for frequent pattern mining
• Never break a long pattern of any transaction
• Compactness
• Reduce irrelevant info—infrequent items are gone
• Items in frequency descending order: the more frequently
occurring, the more likely to be shared
• Never be larger than the original database (not count node-
links and the count field)
The Frequent Pattern Growth Mining Method
am-proj DB cm-proj DB
fc f …
fc f
fc f
Performance of FPGrowth in Large Datasets
100
140
90 D1 FP-grow th runtime D2 FP-growth
80
D1 Apriori runtime 120 D2 TreeProjection
70 100
Runtime (sec.)
Run time(sec.)
60
80
50 Data set T25I20D10K Data set T25I20D100K
40 60
30 40
20
20
10
0 0
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2
Support threshold(%)
Support threshold (%)
• Divide-and-conquer:
• Decompose both the mining task and DB according to the frequent
patterns obtained so far
• Lead to focused search of smaller databases
• Other factors
• No candidate generation, no candidate test
• Compressed database: FP-tree structure
• No repeated scan of entire database
• Basic ops: counting local freq items and building sub FP-tree, no pattern
search and matching
• A good open-source implementation and refinement of FPGrowth
• FPGrowth+ (Grahne and J. Zhu, FIMI'03)
Further Improvements of Mining Methods
41
Mining Frequent Closed Patterns: CLOSET
• Basic Concepts
Methods
• Summary
Interestingness Measure: Correlations (Lift)
• play basketball not eat cereal [20%, 33.3%] is more accurate, although with
lower support and confidence
• Over 20 interestingness
measures have been
proposed (see Tan, Kumar,
Sritastava @KDD’02)
Null-transactions Kulczynski
w.r.t. m and c measure (1927) Null-invariant
• Basic Concepts
Methods
• Summary
Summary
• Basic concepts: association rules, support-confident
framework, closed and max-patterns
• Scalable frequent pattern mining methods
• Apriori (Candidate generation & test)
• Projection-based (FPgrowth, CLOSET+, ...)
• Vertical format approach (ECLAT, CHARM, ...)
• G. Grahne and J. Zhu, Efficiently Using Prefix-Trees in Mining Frequent Itemsets, Proc. FIMI'03
• J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD’ 00
• J. Liu, Y. Pan, K. Wang, and J. Han. Mining Frequent Item Sets by Opportunistic Projection.
KDD'02
• J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining Top-K Frequent Closed Patterns without
Minimum Support. ICDM'02
• J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the Best Strategies for Mining Frequent
Closed Itemsets. KDD'03
Ref: Vertical Format and Row Enumeration Methods
• H. Liu, J. Han, D. Xin, and Z. Shao, Mining Interesting Patterns from Very High
Dimensional Data: A Top-Down Row Enumeration Approach, SDM'06.
Ref: Mining Correlations and Interesting Rules