Frequent Pattern Mining Overview: Data Mining Techniques: Frequent Patterns in Sets and Sequences
Frequent Pattern Mining Overview: Data Mining Techniques: Frequent Patterns in Sets and Sequences
3 4
1
Association Rule Mining Task Mining Association Rules
TID Items Example rules:
• Given a transaction database DB, find all rules 1 Bread, Milk {Milk,Diaper} {Beer} (s=0.4, c=0.67)
having support ≥ minsup and confidence ≥ 2 Bread, Diaper, Beer, Eggs {Milk,Beer} {Diaper} (s=0.4, c=1.0)
minconf 3 Milk, Diaper, Beer, Coke {Diaper,Beer} {Milk} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer {Beer} {Milk,Diaper} (s=0.4, c=0.67)
• Brute-force approach: 5 Bread, Milk, Diaper, Coke {Diaper} {Milk,Beer} (s=0.4, c=0.5)
{Milk} {Diaper,Beer} (s=0.4, c=0.5)
– List all possible association rules
– Compute support and confidence for each rule Observations:
• All the above rules are binary partitions of the same itemset
– Remove rules that fail the minsup or minconf {Milk, Diaper, Beer}
thresholds • Rules originating from the same itemset have identical support but
can have different confidence
– Computationally prohibitive! • Thus, we may decouple the support and confidence requirements
7 8
• Two-step approach: A B C D E
are 2d possible
candidate itemsets
ABCDE
9 10
2
Frequent Pattern Mining Overview Reducing Number of Candidates
• Basic Concepts and Challenges • Apriori principle:
– If an itemset is frequent, then all of its subsets must
• Efficient and Scalable Methods for Frequent also be frequent
Itemsets and Association Rules • Apriori principle holds due to the following
property of the support measure:
• Pattern Interestingness Measures
X ,Y : ( X Y ) s( X ) s(Y )
• Sequence Mining
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of
support
13 14
17 18
3
How to Generate Candidates? How to Count Supports of Candidates?
• Step 1: self-joining Lk-1 • Why is counting supports of candidates a
insert into Ck problem?
select p.item1, p.item2,…, p.itemk-1, q.itemk-1 – Total number of candidates can be very large
from Lk-1 p, Lk-1 q – One transaction may contain many candidates
where p.item1=q.item1 AND … AND p.itemk-2=q.itemk-2
AND p.itemk-1 < q.itemk-1 • Method:
– Candidate itemsets stored in a hash-tree
• Step 2: pruning – Leaf node contains list of itemsets
– forall itemsets c in Ck do – Interior node contains a hash table
• forall (k-1)-subsets s of c do – Subset function finds all candidates contained in a
– if (s is not in Lk-1) then delete c from Ck transaction
19 20
Subset Operation Using Hash Tree Subset Operation Using Hash Tree
Hash Function Hash Function
1 2 3 5 6 transaction 1 2 3 5 6 transaction
1+ 2356 1+ 2356
2+ 356 1,4,7 3,6,9 2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
12+ 356 2,5,8
3+ 56 3+ 56
13+ 56 13+ 56
234 234
15+ 6 567 15+ 6 567
4
Association Rule Generation Rule Generation
• Given a frequent itemset L, find all non-empty • How do we efficiently generate association
subsets f L such that f L – f satisfies the rules from frequent itemsets?
minimum confidence requirement – In general, confidence does not have an anti-
– If {A,B,C,D} is a frequent itemset, candidate rules are: monotone property
• ABC D, ABD C, ACD B, BCD A, • c(ABCD) can be larger or smaller than c(ABD)
A BCD, B ACD, C ABD, D ABC – But confidence of rules generated from the same
AB CD, AC BD, AD BC, BC AD, itemset has an anti-monotone property
BD AC, CD AB • For {A,B,C,D}, c(ABC D) c(AB CD) c(A BCD)
• If |L| = k, then there are 2k – 2 candidate • Confidence is anti-monotone w.r.t. number of items on
association rules (ignoring L and L) the right-hand side of the rule
25 26
Rule Generation for Apriori Algorithm Rule Generation for Apriori Algorithm
Lattice of rules • Candidate rule is generated by merging two rules
ABCD=>{ }
Low
Confidence
that share the same prefix
Rule in the rule consequent CD=>AB BD=>AC
BCD=>A ACD=>B ABD=>C ABC=>D
• Join(CDAB, BDAC)
would produce the candidate
CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD rule D ABC
D=>ABC
• Prune rule DABC if its
subset ADBC does not have
Pruned
D=>ABC C=>ABD B=>ACD A=>BCD
high confidence
Rules
27 28
5
Construct FP-tree from a Transaction
How to Avoid Candidate Generation
Database
TID Items bought (ordered) frequent items
• Grow long patterns from short ones using 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
local frequent items 300 {b, f, h, j, o, w} {f, b} min_support = 3
– Assume {a,b,c} is a frequent pattern in transaction 400
500
{b, c, k, s, p}
{a, f, c, e, l, p, m, n}
{c, b, p}
{f, c, a, m, p} {}
database DB Header Table
1. Scan DB once, find
– Get all transactions containing {a,b,c} frequent 1-itemsets Item frequency head
• Notation: DB|{a,b,c} (single item pattern) f 4
2. Sort frequent items in c 4
– {d} is a local frequent item in DB|{a,b,c}, if and frequency descending a 3
b 3
only if {a,b,c,d} is a frequent pattern in DB order, get f-list m 3
3. Scan DB again, p 3
construct FP-tree
31
F-list=f-c-a-b-m-p 32
6
Construct Conditional Pattern Base For
Partition Patterns and Databases
Item X
• Frequent patterns can be partitioned into subsets • Conditional pattern base = set of prefix paths in FP-tree that co-
occur with x
according to f-list • Traverse FP-tree by following link of frequent item x in header table
– F-list=f-c-a-b-m-p • Accumulate paths with their frequency counts
– Patterns containing p {}
– Patterns having m, but no p Header Table
Conditional pattern bases
– Patterns having b, but neither m nor p Item frequency head f:4 c:1
f 4 item cond. pattern base
–… c 4 c:3 b:1 b:1 c f:3
– Patterns having c, but neither a, b, m, nor p a 3
a fc:3
b 3 a:3 p:1
– Pattern f m 3 b fca:1, f:1, c:1
• This partitioning is complete and non-redundant p 3 m:2 b:1 m fca:2, fcab:1
• Method
80
70
– For each frequent item, construct its conditional
Run time(sec.)
60
pattern-base, and then its conditional FP-tree
50
– Repeat the process recursively on each newly created 40
conditional FP-tree
30
– Stop recursion when resulting FP-tree is empty
20
• Optimization if tree contains only one path: single path will
generate all the combinations of its sub-paths, each of which 10
is a frequent pattern 0
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
41 42
7
Why Is FP-Growth the Winner? Factors Affecting Mining Cost
• Divide-and-conquer • Choice of minimum support threshold
– Lower support threshold => more frequent itemsets
– Decompose both the mining task and DB according to • More candidates, longer frequent itemsets
the frequent patterns obtained so far • Dimensionality (number of items) of the data set
– Leads to focused search of smaller databases – More space needed to store support count of each item
• Other factors – If number of frequent items also increases, both computation and I/O
costs may increase
– No candidate generation, no candidate test • Size of database
– Compressed database: FP-tree structure – Each pass over DB is more expensive
– No repeated scan of entire database • Average transaction width
– May increase max. length of frequent itemsets and traversals of hash
– Basic operations: counting local frequent single items tree (more subsets supported by transaction)
and building sub FP-tree
• No pattern search and matching • How can we further reduce some of these costs?
43 44
k 1 k
• Need a compact representation AB AC AD AE BC BD BE CD CE DE
• A frequent itemset is closed if none of its 124 123 1234 245 345
8
Maximal vs Closed Frequent Itemsets Maximal vs Closed Frequent Itemsets
Minimum support = 2 null
Closed but
not maximal
• How to efficiently find
124 123 1234 245 345 maximal frequent
A B C D E
Closed and Frequent itemsets? (similar for
Itemsets
maximal
closed ones)
12 124 24 4 123 2 3 24
– Naïve: first find all
34 45 Closed
AB AC AD AE BC BD BE CD CE DE frequent itemsets, then
Frequent remove non-maximal
Itemsets ones
12 2 24 4 4 2 3 4
– Better: use maximality
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Maximal property for pruning
Frequent • Effectiveness depends on
Itemsets
itemset generation
2
ABCD ABCE ABDE
4
ACDE BCDE # Closed = 9
strategy
# Maximal = 4 • See book for details
ABCDE
49 50
.. .. .. AB AC AD BC BD CD AB AC BC AD BD CD
9
Extension: Mining Multi-Dimensional
Frequent Pattern Mining Overview
Associations
• Single-dimensional rules: one type of predicate • Basic Concepts and Challenges
• buys(X, “milk”) buys(X, “bread”)
• Efficient and Scalable Methods for Frequent
• Multi-dimensional rules: 2 types of predicates
– Interdimensional association rules (no repeated
Itemsets and Association Rules
predicates) • Pattern Interestingness Measures
• age(X, “19-25”) occupation(X, “student”) buys(X,
“coke”) • Sequence Mining
– Hybrid-dimensional association rules (repeated
predicates)
• age(X, “19-25”) buys(X, “popcorn”) buys(X, “coke”)
• See book for efficient mining algorithms
57 58
the right s
c
Support
Confidence
0…1
0…1
No
No
Yes
Yes
No
No
Yes
Yes
No
No
No
No
No
No
No
Yes
patterns? V
L Laplace
Conviction
0…1
0.5 … 1 …
No
No
Yes
Yes
No
No
Yes
Yes**
No
No
No
No
No
Yes
No
No
• Does it I
IS
Interest
IS (cosine)
0 … 1 …
0 .. 1
Yes*
No
Yes
Yes
Yes
Yes
Yes
Yes
No
No
No
No
No
No
No
Yes
result in PS Piatetsky-Shapiro's -0.25 … 0 … 0.25 Yes Yes Yes Yes No Yes Yes No
F Certainty factor -1 … 0 … 1 Yes Yes Yes No No No Yes No
an AV Added value 0.5 … 1 … 1 Yes Yes Yes No No No No No
0 … 1 …
efficient S
Collective strength
Jaccard 0 .. 1
No
No
Yes
Yes
Yes
Yes
Yes
Yes
No
No
Yes*
No
Yes
No
No
Yes
mining K Klosgen's
2
3
1 2 3
1
0
3
2
Yes Yes Yes No No No No No
3 3
algorithm?
The P’s and O’s are various desirable properties, e.g., symmetry under variable permutation (O1),
which we do not cover in this class. Take-away message: no interestingness measure has all the
desirable properties.
61 62
10
Frequent Pattern Mining Overview Introduction
• Basic Concepts and Challenges • Sequence mining: relevant for transaction, time-
series, and sequence databases
• Efficient and Scalable Methods for Frequent
• Applications of sequential pattern mining
Itemsets and Association Rules
– Customer shopping sequences: first buy computer,
• Pattern Interestingness Measures then peripheral device within 3 months
• Sequence Mining – Medical treatments, natural disasters (e.g.,
earthquakes), science & engineering processes, stocks
and markets
– Telephone calling patterns, Weblog click streams
– DNA sequences and gene structures
74 75
78 79
11
Finding Length-1 Sequential Patterns GSP: Generating Length-2 Candidates
• Initial candidates: all singleton <a> <b> <c> <d> <e> <f>
Cand Sup
sequences <a> <aa> <ab> <ac> <ad> <ae> <af>
<a> 3 51 length-2 <b> <ba> <bb> <bc> <bd> <be> <bf>
– <a>, <b>, <c>, <d>, <e>, <f>, <g>, <h>
<b> 5 Candidates <c> <ca> <cb> <cc> <cd> <ce> <cf>
• Scan database once, count support <c> 4 <d> <da> <db> <dc> <dd> <de> <df>
<e> <ea> <eb> <ec> <ed> <ee> <ef>
for candidates <d> 3 <f> <fa> <fb> <fc> <fd> <fe> <ff>
<e> 3
Seq. ID Sequence
<f> 2 <a> <b> <c> <d> <e> <f> Without Apriori property,
10 <(bd)cb(ac)> 8*8+8*7/2=92 candidates
<a> <(ab)> <(ac)> <(ad)> <(ae)> <(af)>
20 <(bf)(ce)b(fg)> <g> 1 <b> <(bc)> <(bd)> <(be)> <(bf)>
30 <(ah)(bf)abf> <h> 1 <c> <(cd)> <(ce)> <(cf)>
Apriori prunes
min_sup =2 40 <(be)(ce)d> <d> <(de)> <(df)>
Candidate Generate-and-Test
The GSP Mining Process
Drawbacks
• Scan 5: 1 candidate, 1
length-5 seq. pattern <(bd)cba>
Cand. does not pass
support threshold
• Huge set of candidate sequences generated
• Scan 4: 8 candidates, 6
length-4 seq. patterns <abba> <(bd)bc> … Cand. not in DB at all
• Scan 3: 47 candidates,
19 length-3 seq. • Multiple Scans of entire database needed
<abb> <aab> <aba> <baa> <bab> …
patterns, 20 candidates – Length of each candidate grows by one at each
not in DB at all
• Scan 2: 51 candidates, <aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)> database scan
19 length-2 seq.
patterns, 10 candidates <a> <b> <c> <d> <e> <f> <g> <h>
not in DB at all Seq. ID Sequence
• Scan 1: 8 candidates, 6 10 <(bd)cb(ac)>
length-1 seq. patterns min_sup =2 20 <(bf)(ce)b(fg)>
30 <(ah)(bf)abf>
40 <(be)(ce)d>
50 <a(bd)bcb(ade)> 82 83
12
Finding Seq. Patterns with Prefix <a> Completeness of PrefixSpan
DB
• Only need to consider projections w.r.t. <a> SID sequence
Length-1 sequential patterns
10 <a(abc)(ac)d(cf)>
– <a>-projected database: <(abc)(ac)d(cf)>, 20 <(ad)c(bc)(ae)>
<a>, <b>, <c>, <d>, <e>, <f>
<(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc> 30
40
<(ef)(ab)(df)cb>
<eg(af)cbc>
• Find all length-2 frequent seq. patterns having Having prefix <a> Having prefix <c>, …, <f>
prefix <a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> Having prefix <b>
<a>-projected database
– Further partition into those 6 subsets <(abc)(ac)d(cf)>
<b>-projected database …
Length-2 sequential
• Having prefix <aa>; <(_d)c(bc)(ae)>
patterns
SID sequence <(_b)(df)cb> <aa>, <ab>, <(ab)>,
• Having prefix <ab>; <(_f)cbc>
10 <a(abc)(ac)d(cf)> <ac>, <ad>, <af> ……
• Having prefix <(ab)>;
20 <(ad)c(bc)(ae)> Having prefix <aa> Having prefix <af>
• …
30 <(ef)(ab)(df)cb>
• Having prefix <af>
40 <eg(af)cbc>
<aa>-proj. db … <af>-proj. db
86 87
90 91
13
Performance on Data Set Gazelle Effect of Pseudo-Projection
92 93
94 131
132
14