0% found this document useful (0 votes)
82 views

Frequent Pattern Mining Overview: Data Mining Techniques: Frequent Patterns in Sets and Sequences

The document provides an overview of frequent pattern mining. It discusses finding frequent patterns, such as itemsets or sequences, that occur commonly in data. Association rule mining is described as finding rules that can predict occurrences based on other items. The key tasks are generating frequent itemsets that meet a minimum support threshold and generating high-confidence rules from those itemsets. Efficient and scalable algorithms are needed due to the computational expense of enumerating all possible patterns and rules.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views

Frequent Pattern Mining Overview: Data Mining Techniques: Frequent Patterns in Sets and Sequences

The document provides an overview of frequent pattern mining. It discusses finding frequent patterns, such as itemsets or sequences, that occur commonly in data. Association rule mining is described as finding rules that can predict occurrences based on other items. The key tasks are generating frequent itemsets that meet a minimum support threshold and generating high-confidence rules from those itemsets. Efficient and scalable algorithms are needed due to the computational expense of enumerating all possible patterns and rules.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Frequent Pattern Mining Overview

Data Mining Techniques: • Basic Concepts and Challenges


Frequent Patterns in Sets and • Efficient and Scalable Methods for Frequent
Itemsets and Association Rules
Sequences
• Pattern Interestingness Measures
Mirek Riedewald • Sequence Mining
Some slides based on presentations by
Han/Kamber and Tan/Steinbach/Kumar

What Is Frequent Pattern Analysis? Association Rule Mining


• Find patterns (itemset, sequence, structure, etc.) that • Given a set of transactions, find rules that will predict
occur frequently in a data set the occurrence of an item based on the occurrences of
• First proposed for frequent itemsets and association other items in the transaction
rule mining Market-Basket transactions
• Motivation: Find inherent regularities in data Example of Association Rules
– What products were often purchased together? TID Items
{Diaper}  {Beer},
– What are the subsequent purchases after buying a PC? 1 Bread, Milk {Milk, Bread}  {Eggs,Coke},
– What kinds of DNA are sensitive to a new drug? 2 Bread, Diaper, Beer, Eggs {Beer, Bread}  {Milk},
• Applications 3 Milk, Diaper, Beer, Coke
– Market basket analysis, cross-marketing, catalog design, 4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
sale campaign analysis, Web log (click stream) analysis, 5 Bread, Milk, Diaper, Coke not causality!
DNA sequence analysis

3 4

Definition: Frequent Itemset Definition: Association Rule


TID Items
• Itemset • Association Rule = implication
– A collection of one or more items 1 Bread, Milk
expression of the form XY,
• Example: {Milk, Bread, Diaper} where X and Y are itemsets 2 Bread, Diaper, Beer, Eggs
– k-itemset: itemset that contains k items – Ex.: {Milk, Diaper}  {Beer} 3 Milk, Diaper, Beer, Coke
• Support count () TID Items
4 Bread, Milk, Diaper, Beer
– Frequency of occurrence of an itemset 5 Bread, Milk, Diaper, Coke
– E.g., ({Milk, Bread, Diaper}) = 2
1 Bread, Milk • Rule Evaluation Metrics
• Support (s)
2 Bread, Diaper, Beer, Eggs
– Support (s) = P(XY) Example: {Milk, Diaper}  Beer
3 Milk, Diaper, Beer, Coke
• Estimated by fraction of
– Fraction of transactions that contain an 4 Bread, Milk, Diaper, Beer transactions that contain both X
itemset
– E.g., s({Milk, Bread, Diaper}) = 2/5
5 Bread, Milk, Diaper, Coke and Y  (Milk, Diaper, Beer ) 2
– Confidence (c) = P(Y| X) s 
• Frequent Itemset • Estimated by fraction of
|D| 5
– An itemset whose support is greater than transactions that contain X and Y
or equal to a minsup threshold among all transactions containing  (Milk, Diaper, Beer ) 2
X c 
 (Milk, Diaper) 3
5 6

1
Association Rule Mining Task Mining Association Rules
TID Items Example rules:
• Given a transaction database DB, find all rules 1 Bread, Milk {Milk,Diaper}  {Beer} (s=0.4, c=0.67)
having support ≥ minsup and confidence ≥ 2 Bread, Diaper, Beer, Eggs {Milk,Beer}  {Diaper} (s=0.4, c=1.0)
minconf 3 Milk, Diaper, Beer, Coke {Diaper,Beer}  {Milk} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer {Beer}  {Milk,Diaper} (s=0.4, c=0.67)
• Brute-force approach: 5 Bread, Milk, Diaper, Coke {Diaper}  {Milk,Beer} (s=0.4, c=0.5)
{Milk}  {Diaper,Beer} (s=0.4, c=0.5)
– List all possible association rules
– Compute support and confidence for each rule Observations:
• All the above rules are binary partitions of the same itemset
– Remove rules that fail the minsup or minconf {Milk, Diaper, Beer}
thresholds • Rules originating from the same itemset have identical support but
can have different confidence
– Computationally prohibitive! • Thus, we may decouple the support and confidence requirements

7 8

Mining Association Rules Frequent Itemset Generation


null

• Two-step approach: A B C D E

1. Frequent Itemset Generation


• Generate all itemsets that have support  minsup
AB AC AD AE BC BD BE CD CE DE
2. Rule Generation
• Generate high-confidence rules from each frequent
itemset, where each rule is a binary partitioning of the ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
frequent itemset
• Frequent itemset generation is still
Given d items, there
computationally expensive ABCD ABCE ABDE ACDE BCDE

are 2d possible
candidate itemsets
ABCDE
9 10

Frequent Itemset Generation Computational Complexity


• Brute-force approach: • Given d unique items, total number of itemsets = 2d
– Each itemset in the lattice is a candidate frequent itemset • Total number of possible association rules?
– Count the support of each candidate by scanning the
database
– Match each transaction against every candidate d 1  d
  d  k  d  k 
– Complexity  O(N*M*w) => expensive since M=2d R        
Transactions List of k 1  k  j 1  j 
Candidates
TID Items  3d  2 d 1  1
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs If d=6, R = 602 possible
N 3 Milk, Diaper, Beer, Coke M rules
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w 11 12

2
Frequent Pattern Mining Overview Reducing Number of Candidates
• Basic Concepts and Challenges • Apriori principle:
– If an itemset is frequent, then all of its subsets must
• Efficient and Scalable Methods for Frequent also be frequent
Itemsets and Association Rules • Apriori principle holds due to the following
property of the support measure:
• Pattern Interestingness Measures
X ,Y : ( X  Y )  s( X )  s(Y )
• Sequence Mining
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of
support

13 14

Illustrating the Apriori Principle Illustrating the Apriori Principle


null

Item Count Items (1-itemsets)


A B C D E Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4
AB AC AD AE BC BD BE CD CE DE
{Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
Found to be {Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE {Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)

If every subset is considered, Itemset Count


ABCD ABCE ABDE ACDE BCDE 6C + 6C + 6C = 41 {Bread,Milk,Diaper} 3
1 2 3
With support-based pruning,
Pruned 6 + 6 + 1 = 13
ABCDE
supersets
15 16

Apriori Algorithm Important Details of Apriori


• Generate L1 = frequent itemsets of length k=1 • How to generate candidates?
– Step 1: self-joining Lk
• Repeat until no new frequent itemsets are found – Step 2: pruning
– Generate Ck+1, the length-(k+1) candidate itemsets, • Example of Candidate-generation for
from Lk L3={ {a,b,c}, {a,b,d}, {a,c,d}, {a,c,e}, {b,c,d} }
– Prune candidate itemsets in Ck+1 containing subsets of – Self-joining L3
length k that are not in Lk (and hence infrequent) • {a,b,c,d} from {a,b,c} and {a,b,d}
• {a,c,d,e} from {a,c,d} and {a,c,e}
– Count support of each remaining candidate by
– Pruning:
scanning DB; eliminate infrequent ones from Ck+1
• {a,c,d,e} is removed because {a,d,e} is not in L3
– Lk+1=Ck+1; k = k+1 – C4={ {a,b,c,d} }

17 18

3
How to Generate Candidates? How to Count Supports of Candidates?
• Step 1: self-joining Lk-1 • Why is counting supports of candidates a
insert into Ck problem?
select p.item1, p.item2,…, p.itemk-1, q.itemk-1 – Total number of candidates can be very large
from Lk-1 p, Lk-1 q – One transaction may contain many candidates
where p.item1=q.item1 AND … AND p.itemk-2=q.itemk-2
AND p.itemk-1 < q.itemk-1 • Method:
– Candidate itemsets stored in a hash-tree
• Step 2: pruning – Leaf node contains list of itemsets
– forall itemsets c in Ck do – Interior node contains a hash table
• forall (k-1)-subsets s of c do – Subset function finds all candidates contained in a
– if (s is not in Lk-1) then delete c from Ck transaction

19 20

Generate Hash Tree Subset Operation Using Hash Tree


Hash Function
1 2 3 5 6 transaction
• Suppose we have 15 candidate itemsets of length 3:
– {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3
4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} 1+ 2356
2+ 356 1,4,7 3,6,9
• We need: 2,5,8
– Hash function 3+ 56
– Max leaf size: max number of itemsets stored in a leaf node (if number
of candidate itemsets exceeds max leaf size, split the node) 234
567

234 145 136


Hash function 345 356 367
3,6,9 567
1,4,7 145 345 356 367 357 368
136 368 689
2,5,8 357 124 125 159
124 689 457 458
457 125 159
458
21 22

Subset Operation Using Hash Tree Subset Operation Using Hash Tree
Hash Function Hash Function
1 2 3 5 6 transaction 1 2 3 5 6 transaction

1+ 2356 1+ 2356
2+ 356 1,4,7 3,6,9 2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
12+ 356 2,5,8
3+ 56 3+ 56
13+ 56 13+ 56
234 234
15+ 6 567 15+ 6 567

145 136 145 136


345 356 367 345 356 367
357 368 357 368
124 159 689 124 159 689
125 125
457 458 457 458
Match transaction against 9 out of 15 candidates
23 24

4
Association Rule Generation Rule Generation
• Given a frequent itemset L, find all non-empty • How do we efficiently generate association
subsets f  L such that f  L – f satisfies the rules from frequent itemsets?
minimum confidence requirement – In general, confidence does not have an anti-
– If {A,B,C,D} is a frequent itemset, candidate rules are: monotone property
• ABC D, ABD C, ACD B, BCD A, • c(ABCD) can be larger or smaller than c(ABD)
A BCD, B ACD, C ABD, D ABC – But confidence of rules generated from the same
AB CD, AC  BD, AD  BC, BC AD, itemset has an anti-monotone property
BD AC, CD AB • For {A,B,C,D}, c(ABC  D)  c(AB  CD)  c(A  BCD)
• If |L| = k, then there are 2k – 2 candidate • Confidence is anti-monotone w.r.t. number of items on
association rules (ignoring L   and   L) the right-hand side of the rule

25 26

Rule Generation for Apriori Algorithm Rule Generation for Apriori Algorithm
Lattice of rules • Candidate rule is generated by merging two rules
ABCD=>{ }
Low
Confidence
that share the same prefix
Rule in the rule consequent CD=>AB BD=>AC
BCD=>A ACD=>B ABD=>C ABC=>D

• Join(CDAB, BDAC)
would produce the candidate
CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD rule D  ABC
D=>ABC
• Prune rule DABC if its
subset ADBC does not have
Pruned
D=>ABC C=>ABD B=>ACD A=>BCD
high confidence
Rules
27 28

Improving Apriori Bottleneck of Frequent-Pattern Mining


• Challenges • Apriori generates a very large number of
– Multiple scans of transaction database candidates
– 104 frequent 1-itemsets can result in more than 107
– Huge number of candidates candidate 2-itemsets
– Tedious workload of support counting for – Many candidates might have low support, or do not
candidates even exist in the database
• General ideas • Apriori scans entire transaction database for
every round of support counting
– Reduce passes of transaction database scans
• Bottleneck: candidate-generation-and-test
– Further shrink number of candidates
– Facilitate support counting of candidates
• Can we avoid candidate generation?
29 30

5
Construct FP-tree from a Transaction
How to Avoid Candidate Generation
Database
TID Items bought (ordered) frequent items
• Grow long patterns from short ones using 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
local frequent items 300 {b, f, h, j, o, w} {f, b} min_support = 3
– Assume {a,b,c} is a frequent pattern in transaction 400
500
{b, c, k, s, p}
{a, f, c, e, l, p, m, n}
{c, b, p}
{f, c, a, m, p} {}
database DB Header Table
1. Scan DB once, find
– Get all transactions containing {a,b,c} frequent 1-itemsets Item frequency head
• Notation: DB|{a,b,c} (single item pattern) f 4
2. Sort frequent items in c 4
– {d} is a local frequent item in DB|{a,b,c}, if and frequency descending a 3
b 3
only if {a,b,c,d} is a frequent pattern in DB order, get f-list m 3
3. Scan DB again, p 3
construct FP-tree
31
F-list=f-c-a-b-m-p 32

Construct FP-tree from a Transaction Construct FP-tree from a Transaction


Database Database
TID Items bought (ordered) frequent items TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m} 200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b} min_support = 3 300 {b, f, h, j, o, w} {f, b} min_support = 3
400 {b, c, k, s, p} {c, b, p} 400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {}
Header Table Header Table
1. Scan DB once, find 1. Scan DB once, find
frequent 1-itemsets Item frequency head f:1 frequent 1-itemsets Item frequency head f:2
(single item pattern) f 4 (single item pattern) f 4
c:1 c:2
2. Sort frequent items in c 4 2. Sort frequent items in c 4
a 3 a 3
frequency descending b 3
frequency descending b 3
order, get f-list m 3
a:1 order, get f-list m 3
a:2

3. Scan DB again, p 3 m:1 3. Scan DB again, p 3 m:1 b:1


construct FP-tree construct FP-tree
F-list=f-c-a-b-m-p p:1 33
F-list=f-c-a-b-m-p p:1 m:1 34

Construct FP-tree from a Transaction


Benefits of the FP-tree Structure
Database
TID
100
Items bought (ordered) frequent items
{f, a, c, d, g, i, m, p} {f, c, a, m, p}
• Completeness
200 {a, b, c, f, l, m, o} {f, c, a, b, m} – Preserve complete information for frequent pattern
300 {b, f, h, j, o, w} {f, b} min_support = 3 mining
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
– Never break a long pattern of any transaction
{}
Header Table • Compactness
1. Scan DB once, find
frequent 1-itemsets f:4 c:1 – Reduce irrelevant info—infrequent items are gone
Item frequency head
(single item pattern) f 4 – Items in frequency descending order: the more
2. Sort frequent items in c 4 c:3 b:1 b:1 frequently occurring, the more likely to be shared
frequency descending a 3 – Never larger than the original database (if we do not
b 3
order, get f-list m 3
a:3 p:1 count node-links and the count field)
3. Scan DB again, p 3 m:2 b:1
– For some example DBs, compression ratio over 100
construct FP-tree
F-list=f-c-a-b-m-p p:2 m:1 35 36

6
Construct Conditional Pattern Base For
Partition Patterns and Databases
Item X
• Frequent patterns can be partitioned into subsets • Conditional pattern base = set of prefix paths in FP-tree that co-
occur with x
according to f-list • Traverse FP-tree by following link of frequent item x in header table
– F-list=f-c-a-b-m-p • Accumulate paths with their frequency counts
– Patterns containing p {}
– Patterns having m, but no p Header Table
Conditional pattern bases
– Patterns having b, but neither m nor p Item frequency head f:4 c:1
f 4 item cond. pattern base
–… c 4 c:3 b:1 b:1 c f:3
– Patterns having c, but neither a, b, m, nor p a 3
a fc:3
b 3 a:3 p:1
– Pattern f m 3 b fca:1, f:1, c:1
• This partitioning is complete and non-redundant p 3 m:2 b:1 m fca:2, fcab:1

p:2 m:1 p fcam:2, cb:1


37 38

From Conditional Pattern Bases to Recursion: Mining Conditional FP-


Conditional FP-Trees Trees
{}
• For each pattern-base {} Output: am
Cond. pattern base of “am”: fc:3 f:3
– Accumulate the count for each item in the base
f:3
– Construct the FP-tree for the frequent items of the c:3
c:3 am-conditional FP-tree
pattern base
m-conditional pattern base: a:3 Output: cm {}
{} fca:2, fcab:1 m-conditional FP-tree Cond. pattern base of “cm”: f:3
Header Table
All frequent f:3
Item frequency head f:4 c:1 patterns having m,
f 4 {} but not p cm-conditional FP-tree
c 4 c:3 b:1 b:1 m, Output: fm
a 3 f:3  fm, cm, am, Cond. pattern base of “fm”: {}
b 3 a:3 p:1
fcm, fam, cam,
m 3 c:3 {}
p 3 m:2 b:1 fcam
For am-conditional FP-tree, output cam
p:2 m:1 a:3 Cond. pattern base of “cam”: f:3 f:3
m-conditional FP-tree 39
cam-conditional FP-tree 40

FP-Growth vs. Apriori: Scalability With


FP-Tree Algorithm Summary
Support Threshold
• Idea: frequent pattern growth Data set T25I20D10K
100
– Recursively grow frequent patterns by pattern and D1 FP-growth runtime
90
database partition D1 Apriori runtime

• Method
80
70
– For each frequent item, construct its conditional
Run time(sec.)

60
pattern-base, and then its conditional FP-tree
50
– Repeat the process recursively on each newly created 40
conditional FP-tree
30
– Stop recursion when resulting FP-tree is empty
20
• Optimization if tree contains only one path: single path will
generate all the combinations of its sub-paths, each of which 10
is a frequent pattern 0
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
41 42

7
Why Is FP-Growth the Winner? Factors Affecting Mining Cost
• Divide-and-conquer • Choice of minimum support threshold
– Lower support threshold => more frequent itemsets
– Decompose both the mining task and DB according to • More candidates, longer frequent itemsets
the frequent patterns obtained so far • Dimensionality (number of items) of the data set
– Leads to focused search of smaller databases – More space needed to store support count of each item
• Other factors – If number of frequent items also increases, both computation and I/O
costs may increase
– No candidate generation, no candidate test • Size of database
– Compressed database: FP-tree structure – Each pass over DB is more expensive
– No repeated scan of entire database • Average transaction width
– May increase max. length of frequent itemsets and traversals of hash
– Basic operations: counting local frequent single items tree (more subsets supported by transaction)
and building sub FP-tree
• No pattern search and matching • How can we further reduce some of these costs?

43 44

Compact Representation of Frequent


Maximal Frequent Itemset
Itemsets An itemset is maximal-frequent if none of its supersets is frequent
• Some itemsets are redundant because they
null

have identical support as their supersets Maximal


Itemsets
A B C D E

• Number of frequent itemsets  3    


10 10

k 1  k 
• Need a compact representation AB AC AD AE BC BD BE CD CE DE

TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10


1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
ABCD ABCE ABDE ACDE BCDE
10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 Infrequent
13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 Itemsets Border
14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 ABCD
15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 E
45 46

Closed Itemset Maximal vs Closed Frequent Itemsets


null
Transaction Ids

• A frequent itemset is closed if none of its 124 123 1234 245 345

supersets has the same support TID Items


A B C D E

– Lossless compression of the set of all frequent 1 ABC


itemsets 12 124 24 4 123 2 3 24 34 45
2 ABCD AB AC AD AE BC BD BE CD CE DE

Itemset Support 3 BCE


{A} 4
TID Items 4 ACDE
{B} 5 Itemset Support 12 2 24 4 4 2 3 4
1 {A,B} {C} 3 {A,B,C} 2 5 DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
2 {B,C,D} {D} 4 {A,B,D} 3
3 {A,B,C,D} {A,B} 4 {A,C,D} 2
4 {A,B,D} {A,C} 2 {B,C,D} 3
2 4
5 {A,B,C,D} {A,D} 3 {A,B,C,D} 2 ABCD ABCE ABDE ACDE BCDE
{B,C} 3
Not supported by
{B,D} 4
min_sup = 2 {C,D} 3
any transactions
ABCDE
47 48

8
Maximal vs Closed Frequent Itemsets Maximal vs Closed Frequent Itemsets
Minimum support = 2 null
Closed but
not maximal
• How to efficiently find
124 123 1234 245 345 maximal frequent
A B C D E
Closed and Frequent itemsets? (similar for
Itemsets
maximal
closed ones)
12 124 24 4 123 2 3 24
– Naïve: first find all
34 45 Closed
AB AC AD AE BC BD BE CD CE DE frequent itemsets, then
Frequent remove non-maximal
Itemsets ones
12 2 24 4 4 2 3 4
– Better: use maximality
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Maximal property for pruning
Frequent • Effectiveness depends on
Itemsets
itemset generation
2
ABCD ABCE ABDE
4
ACDE BCDE # Closed = 9
strategy
# Maximal = 4 • See book for details
ABCDE
49 50

Methods for Frequent Itemset Alternative Methods for Frequent


Generation Itemset Generation
• Traversal of itemset lattice • Traversal of itemset lattice
– General-to-specific: Apriori – Equivalence Classes: search one class first, before moving on to
– Specific-to-general: good for pruning for maximal frequent itemsets the next one
null null
Frequent
itemset Frequent
border null null itemset null A B C D A B C D
border

.. .. .. AB AC AD BC BD CD AB AC BC AD BD CD

.. .. .. ABC ABD ACD BCD ABC ABD ACD BCD

Frequent ABCD ABCD


{a1,a2,...,an} {a1,a2,...,an} itemset {a1,a2,...,an}
border
(a) General-to-specific (b) Specific-to-general (c) Bidirectional (a) Prefix tree (b) Suffix tree
53 54

Alternative Methods for Frequent Extension: Mining Multiple-Level


Itemset Generation Association Rules
• Traversal of Itemset Lattice • Items often form hierarchies
– Breadth-first vs Depth-first – Most relevant pattern might only show at the right
• Apriori is breadth-first (good for pruning)
• Depth-first often good for maximal frequent itemsets: discover large frequent
granularity
itemsets quickly, use for pruning • Flexible support settings
– Items at the lower level are expected to have lower
support

uniform support reduced support


Level 1
Milk Level 1
min_sup = 5%
[support = 10%] min_sup = 5%

Level 2 2% Milk Skim Milk Level 2


min_sup = 5% [support = 6%] [support = 4%] min_sup = 3%
(a) Breadth first (b) Depth first
55 56

9
Extension: Mining Multi-Dimensional
Frequent Pattern Mining Overview
Associations
• Single-dimensional rules: one type of predicate • Basic Concepts and Challenges
• buys(X, “milk”)  buys(X, “bread”)
• Efficient and Scalable Methods for Frequent
• Multi-dimensional rules:  2 types of predicates
– Interdimensional association rules (no repeated
Itemsets and Association Rules
predicates) • Pattern Interestingness Measures
• age(X, “19-25”)  occupation(X, “student”)  buys(X,
“coke”) • Sequence Mining
– Hybrid-dimensional association rules (repeated
predicates)
• age(X, “19-25”)  buys(X, “popcorn”)  buys(X, “coke”)
• See book for efficient mining algorithms
57 58

Lift Lift vs. Other Correlation Measures


Milk No Milk
• Ex.: 2000 txns have bread and milk, 1000 have bread but no • Intuition: Are milk and
coffee usually bought Coffee m, c ~m, c
milk, 1750 have milk but no bread, 250 have neither together? No Coffee m, ~c ~m, ~c
sup(A)
• Rule breadmilk has support 0.4, confidence 0.67 – (m, c) > (~m, c) + (m, ~c) all_conf( A) 
• m and c are… max_item_s up(A)
• Does it mean that people who buy bread also tend to buy – bought together in A’s
milk? – independent in B P( A B) Lift vs. cosine: cosine
– not bought together in C’s cosine( A, B)  does not depend on size
• Misleading: 75% of all people buy milk, while among bread • All measures good for B
P( A) P( B) of DB
purchasers only 67% do • Lift, 2 bad for A’s, C’s
– But bread[no milk] only has support 0.2, confidence 0.33 – Reason: strongly affected by
number of null-transactions
• Measure of dependent/correlated events: lift (those without m, c)
• all_conf, cosine good for
A’s, C’s
P( A B) A, B are itemsets
lift( A, B)  – Not affected by number of
null-transactions
P( A) P( B)
2000 / 5000 1000 / 5000
lift ( B, M )   0.89 lift ( B, M )   1.33
3000 / 5000 * 3750 / 5000 3000 / 5000 *1250 / 5000
59 60

Sym bol Measure Range P1 P2 P3 O1 O2 O3 O3' O4


 Correlation -1 … 0 … 1 Yes Yes Yes Yes No Yes Yes No
 Lambda 0…1 Yes No No Yes No No* Yes No
Which Measure Is Best? Q
 Odds ratio
Yule's Q
0 … 1 … 
-1 … 0 … 1
Yes*
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes*
Yes
Yes
Yes
No
No
Y Yule's Y -1 … 0 … 1 Yes Yes Yes Yes Yes Yes Yes No
 Cohen's -1 … 0 … 1 Yes Yes Yes Yes No No Yes No
• Does it M Mutual Information 0…1 Yes Yes Yes Yes No No* Yes No
J J-Measure 0…1 Yes No No No No No No No
identify G Gini Index 0…1 Yes No No No No No* Yes No

the right s
c
Support
Confidence
0…1
0…1
No
No
Yes
Yes
No
No
Yes
Yes
No
No
No
No
No
No
No
Yes
patterns? V
L Laplace
Conviction
0…1
0.5 … 1 … 
No
No
Yes
Yes
No
No
Yes
Yes**
No
No
No
No
No
Yes
No
No
• Does it I
IS
Interest
IS (cosine)
0 … 1 … 
0 .. 1
Yes*
No
Yes
Yes
Yes
Yes
Yes
Yes
No
No
No
No
No
No
No
Yes
result in PS Piatetsky-Shapiro's -0.25 … 0 … 0.25 Yes Yes Yes Yes No Yes Yes No
F Certainty factor -1 … 0 … 1 Yes Yes Yes No No No Yes No
an AV Added value 0.5 … 1 … 1 Yes Yes Yes No No No No No
0 … 1 … 
efficient S

Collective strength
Jaccard 0 .. 1
No
No
Yes
Yes
Yes
Yes
Yes
Yes
No
No
Yes*
No
Yes
No
No
Yes
mining K Klosgen's



2
3 

 1  2  3 
1 
  0 
3
2
Yes Yes Yes No No No No No
 3 3
algorithm?
The P’s and O’s are various desirable properties, e.g., symmetry under variable permutation (O1),
which we do not cover in this class. Take-away message: no interestingness measure has all the
desirable properties.
61 62

10
Frequent Pattern Mining Overview Introduction
• Basic Concepts and Challenges • Sequence mining: relevant for transaction, time-
series, and sequence databases
• Efficient and Scalable Methods for Frequent
• Applications of sequential pattern mining
Itemsets and Association Rules
– Customer shopping sequences: first buy computer,
• Pattern Interestingness Measures then peripheral device within 3 months
• Sequence Mining – Medical treatments, natural disasters (e.g.,
earthquakes), science & engineering processes, stocks
and markets
– Telephone calling patterns, Weblog click streams
– DNA sequences and gene structures

74 75

Challenges of Sequential Pattern


What Is Sequential Pattern Mining?
Mining
• Given a set of sequences, find all frequent • Huge number of possible patterns
subsequences
A sequence: < (ef) (ab) (df) c b >
A sequence database
• A mining algorithm should
SID sequence An element may contain a set of items.
– find all patterns satisfying the minimum support
10 <a(abc)(ac)d(cf)> Items within an element are unordered threshold
and we list them alphabetically
20 <(ad)c(bc)(ae)> – be highly efficient and scalable
30 <(ef)(ab)(df)cb>
<a(bc)dc> is a subsequence – be able to incorporate user-specific constraints
40 <eg(af)cbc>
of <a(abc)(ac)d(cf)>

Given support threshold min_sup =2, <(ab)c> is a


sequential pattern
76 77

GSP: Generalized Sequential Pattern


Apriori Property of Sequential Patterns
Mining
• If a sequence S is not frequent, then none of • Initially, every item in DB is a candidate of length k=1
• For each level (i.e., sequences of length k) do
the super-sequences of S is frequent – Scan database to collect support count for each candidate
– E.g, if <hb> is infrequent, then so are <hab> and sequence
<(ah)b> – Generate candidate length-(k+1) sequences from length-k
frequent sequences
• Join phase: sequences s1 and s2 join, if s1 without its first item is
Seq. ID Sequence identical to s2 without its last item
10 <(bd)cb(ac)> • Prune phase: delete candidates that contain a length-k
Given support threshold subsequence that is not among the frequent ones
20 <(bf)(ce)b(fg)> min_sup =2, • Repeat until no frequent sequence or no candidate can
30 <(ah)(bf)abf> find all frequent be found
40 <(be)(ce)d> subsequences • Major strength: Candidate pruning by Apriori
50 <a(bd)bcb(ade)>

78 79

11
Finding Length-1 Sequential Patterns GSP: Generating Length-2 Candidates
• Initial candidates: all singleton <a> <b> <c> <d> <e> <f>
Cand Sup
sequences <a> <aa> <ab> <ac> <ad> <ae> <af>
<a> 3 51 length-2 <b> <ba> <bb> <bc> <bd> <be> <bf>
– <a>, <b>, <c>, <d>, <e>, <f>, <g>, <h>
<b> 5 Candidates <c> <ca> <cb> <cc> <cd> <ce> <cf>
• Scan database once, count support <c> 4 <d> <da> <db> <dc> <dd> <de> <df>
<e> <ea> <eb> <ec> <ed> <ee> <ef>
for candidates <d> 3 <f> <fa> <fb> <fc> <fd> <fe> <ff>
<e> 3
Seq. ID Sequence
<f> 2 <a> <b> <c> <d> <e> <f> Without Apriori property,
10 <(bd)cb(ac)> 8*8+8*7/2=92 candidates
<a> <(ab)> <(ac)> <(ad)> <(ae)> <(af)>
20 <(bf)(ce)b(fg)> <g> 1 <b> <(bc)> <(bd)> <(be)> <(bf)>
30 <(ah)(bf)abf> <h> 1 <c> <(cd)> <(ce)> <(cf)>
Apriori prunes
min_sup =2 40 <(be)(ce)d> <d> <(de)> <(df)>

50 <a(bd)bcb(ade)> <e> <(ef)> 44.57% candidates


<f>
80 81

Candidate Generate-and-Test
The GSP Mining Process
Drawbacks
• Scan 5: 1 candidate, 1
length-5 seq. pattern <(bd)cba>
Cand. does not pass
support threshold
• Huge set of candidate sequences generated
• Scan 4: 8 candidates, 6
length-4 seq. patterns <abba> <(bd)bc> … Cand. not in DB at all
• Scan 3: 47 candidates,
19 length-3 seq. • Multiple Scans of entire database needed
<abb> <aab> <aba> <baa> <bab> …
patterns, 20 candidates – Length of each candidate grows by one at each
not in DB at all
• Scan 2: 51 candidates, <aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)> database scan
19 length-2 seq.
patterns, 10 candidates <a> <b> <c> <d> <e> <f> <g> <h>
not in DB at all Seq. ID Sequence
• Scan 1: 8 candidates, 6 10 <(bd)cb(ac)>
length-1 seq. patterns min_sup =2 20 <(bf)(ce)b(fg)>
30 <(ah)(bf)abf>
40 <(be)(ce)d>
50 <a(bd)bcb(ade)> 82 83

Mining Sequential Patterns by Prefix


Prefix and Suffix (Projection)
Projections
• <a>, <aa>, <a(ab)> and <a(abc)> are prefixes • Step 1: find length-1 frequent sequential patterns
of sequence <a(abc)(ac)d(cf)> – <a>, <b>, <c>, <d>, <e>, <f>
• Given sequence <a(abc)(ac)d(cf)>, we have: • Step 2: divide search space. The complete set of
Prefix Suffix (Prefix-Based Projection) sequential patterns can be partitioned into six
<a> <(abc)(ac)d(cf)> subsets:
– The ones having prefix <a>; SID sequence
<aa> <(_bc)(ac)d(cf)>
10 <a(abc)(ac)d(cf)>
– The ones having prefix <b>;
<ab> <(_c)(ac)d(cf)> 20 <(ad)c(bc)(ae)>
–…
<bc> <d(cf)> 30 <(ef)(ab)(df)cb>
– The ones having prefix <f> 40 <eg(af)cbc>
<(bc)> <(ac)d(cf)>
84 85

12
Finding Seq. Patterns with Prefix <a> Completeness of PrefixSpan
DB
• Only need to consider projections w.r.t. <a> SID sequence
Length-1 sequential patterns
10 <a(abc)(ac)d(cf)>
– <a>-projected database: <(abc)(ac)d(cf)>, 20 <(ad)c(bc)(ae)>
<a>, <b>, <c>, <d>, <e>, <f>
<(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc> 30
40
<(ef)(ab)(df)cb>
<eg(af)cbc>

• Find all length-2 frequent seq. patterns having Having prefix <a> Having prefix <c>, …, <f>
prefix <a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> Having prefix <b>
<a>-projected database
– Further partition into those 6 subsets <(abc)(ac)d(cf)>
<b>-projected database …
Length-2 sequential
• Having prefix <aa>; <(_d)c(bc)(ae)>
patterns
SID sequence <(_b)(df)cb> <aa>, <ab>, <(ab)>,
• Having prefix <ab>; <(_f)cbc>
10 <a(abc)(ac)d(cf)> <ac>, <ad>, <af> ……
• Having prefix <(ab)>;
20 <(ad)c(bc)(ae)> Having prefix <aa> Having prefix <af>
• …
30 <(ef)(ab)(df)cb>
• Having prefix <af>
40 <eg(af)cbc>
<aa>-proj. db … <af>-proj. db

86 87

Efficiency of PrefixSpan Pseudo-Projection


• No candidate sequence needs to be generated • Major cost of PrefixSpan: projection
– Postfixes of sequences often appear repeatedly in
• Projected databases keep shrinking recursive projected databases
• Major cost of PrefixSpan: constructing • When (projected) database can be held in
projected databases memory, use pointers
– Can be improved by pseudo-projections – Pointer to the sequence, offset of the postfix
• Why is this a bad idea s=<a(abc)(ac)d(cf)>
when the (projected) <a>
database does not s|<a>: ( , 2) <(abc)(ac)d(cf)>
fit in memory? <ab>
s|<ab>: ( , 4) <(_c)(ac)d(cf)>
88 89

Pseudo-Projection vs. Physical


Performance on Data Set C10T8S8I8
Projection
• Pseudo-projection avoids physically copying
postfixes
– Efficient in running time and space when database can
be held in main memory
• Not efficient when database cannot fit in main
memory
– Disk-based random access
• Suggested Approach:
– Integration of physical and pseudo-projection
– Swapping to pseudo-projection when the data set fits
in memory

90 91

13
Performance on Data Set Gazelle Effect of Pseudo-Projection

92 93

Sequence Mining Variations Frequent-Pattern Mining: Summary


• Multidimensional and multilevel patterns • Important task in data mining
• Constraint-based mining of sequential patterns • Scalable frequent pattern mining methods
• Periodicity analysis – Apriori (itemsets, candidate generation & test)
• Mining biological sequences – GSP (sequences, candidate generation & test)
– Hot research area, major topic by itself – Projection-based (FP-growth for itemsets,
• All these not discussed in class; see book PrefixSpan for sequences)
• Some of my own research: finding relevant • Mining a variety of rules and interesting
sequences in bursty data; see paper patterns

94 131

Frequent-Pattern Mining: Research


Problems
• Mining fault-tolerant frequent, sequential and
structured patterns
– Patterns allows limited faults (insertion, deletion,
mutation)
• Mining truly interesting patterns
– Surprising, novel, concise,…
• Application exploration
– E.g., DNA sequence analysis and bio-pattern
classification

132

14

You might also like