DWDWM Unit2
DWDWM Unit2
1
Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods
Basic Concepts
Evaluation Methods
Summary
3
What Is Frequent Pattern Analysis?
Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of
frequent itemsets and association rule mining
Motivation: Finding inherent regularities in data
What products were often purchased together?— Beer and diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
Applications
Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence
analysis. 4
Why Is Freq. Pattern Mining Important?
5
Basic Concepts: Association Rules
Tid
Items bought
Find all the rules X Y with
10 Beer, Nuts, Diaper
20 Beer, Coffee, Diaper
minimum support and confidence
30 Beer, Diaper, Eggs support, s, probability that a
40 Nuts, Eggs, Milk transaction contains X Y
50 Nuts, Coffee, Diaper, Eggs, Milk
confidence, c, conditional
Customer
buys both
Customer probability that a transaction
buys
diaper having X also contains Y
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3,
Customer {Beer, Diaper}:3
buys beer Association rules: (many more!)
Beer Diaper (60%, 100%)
Diaper Beer (60%, 75%)
6
Closed Patterns and Max-Patterns
A long pattern contains a combinatorial number of sub-
patterns, e.g., {a1, …, a100} contains (100)1 + (100)2+ … + ( 101000)=
2100– 1 = 1.27*1030 sub-patterns!
Solution: Mine closed patterns and max-patterns instead
An itemset X is closed if X is frequent and there exists no
super-pattern Y כX, with the same support as X
(proposed by Pasquier, et al. @ ICDT’99)
An itemset X is a max-pattern if X is frequent and there
exists no frequent super-pattern Y כX (proposed by
Bayardo @ SIGMOD’98)
Closed pattern is a lossless compression of freq. patterns
Reducing the # of patterns and rules
7
Closed Patterns and Max-Patterns
Exercise: Suppose a DB contains only two transactions
<a1, …, a100>, <a1, …, a50>
Let min_sup = 1
What is the set of closed itemset?
{a1, …, a100}: 1
{a1, …, a50}: 2
What is the set of max-pattern?
{a1, …, a100}: 1
What is the set of all patterns?
{a1}: 2, …, {a1, a2}: 2, …, {a1, a51}: 1, …, {a1, a2, …, a100}: 1
A big number: 2100- 1? Why? 9
Chapter 5: Mining Frequent Patterns, Association
and Correlations: Basic Concepts and Methods
Basic Concepts
Evaluation Methods
Summary
9
Scalable Frequent Itemset Mining Methods
Approach
Data Format
10
The Downward Closure Property and Scalable
Mining Methods
The downward closure property of frequent patterns
Any subset of a frequent itemset must be frequent
If {beer, diaper, nuts} is frequent, so is { beer,
diaper}
i.e., every transaction having {beer, diaper, nuts} also
contains {beer, diaper}
Scalable mining methods: Three major approaches
Apriori (Agrawal & Srikant@VLDB’94)
Freq. pattern growth (FPgrowth—Han, Pei & Yin
@SIGMOD’00)
Vertical data format approach (Charm—Zaki & Hsiao
@SDM’02)
11
Apriori: A Candidate Generation & Test Approach
Format
20
Further Improvement of the Apriori Method
21
Partition: Scan Database OnlyTwice
Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB
Scan 1: partition database and find local frequent
patterns
Scan 2: consolidate global frequent patterns
A. Savasere, E. Omiecinski and S. Navathe, VLDB’95
24
DIC: Reduce Number of Scans
ABCD
Once both A and D are determined
frequent, the counting of AD begins
ABC ABD ACD BCD Once all length-2 subsets of BCD are
determined frequent, the counting of BCD
begins
AB AC BC AD BD CD
Transactions
1-itemsets
A B C D
Apriori 2-itemsets
…
{}
Itemset lattice 1-itemsets
S. Brin R. Motwani, J. Ullman, 2-items
and S. Tsur. Dynamic itemset DIC 3-items
counting and implication rules for
market basket data. In
SIGMOD’97 25
Scalable Frequent Itemset Mining Methods
Format
26
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
Bottlenecks of the Apriori approach
Breadth-first (i.e., level-wise) search
Candidate generation and test
Often generates a huge number of candidates
The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
Depth-first search
Avoid explicit candidate generation
Major philosophy: Grow long patterns from short ones using local
frequent items only
“abc” is a frequent pattern
Get all transactions having “abc”, i.e., project DB on abc: DB|abc
“d” is a local frequent item in DB|abc abcd is a frequent pattern
27
Construct FP-tree from a Transaction Database
29
Find Patterns Having P From P-conditional Database
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1
m fca:2, fcab:1
p:2 m:1 p fcam:2, cb:1
30
From Conditional Pattern-bases to Conditional FP-trees
c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree
{}
32
A Special Case: Single Prefix Path in FP-tree
C2:k2 C3:k3
a3:n 3 C2:k2 C3:k3
33
Benefits of the FP-tree Structure
Completeness
Preserve complete information for frequent pattern
mining
Never break a long pattern of any transaction
Compactness
Reduce irrelevant info—infrequent items are gone
Items in frequency descending order: the more
frequently occurring, the more likely to be shared
Never be larger than the original database (not count
node-links and the count field)
34
The Frequent Pattern Growth Mining Method
35
Scaling FP-growth by Database Projection
36
Partition-Based Projection
am-proj DB cm-proj DB
fc f …
fc f
fc f
37
FP-Growth vs. Apriori: Scalability With the
Support Threshold
70
R u n time(se
60
50
40
30
20
10
0
0 0.5 1 1.5 2 2.5 3
S u p p o r t threshold(%)
38
FP-Growth vs. Tree-Projection: Scalability with
the Support Threshold
100
Runtime (sec.)
80
60
40
20
0
0 0.5 1 1.5 2
Support threshold (%)
Data Mining: Concepts and Techniques 39
Advantages of the Pattern Growth Approach
Divide-and-conquer:
Decompose both the mining task and DB according to the
frequent patterns obtained so far
Lead to focused search of smaller databases
Other factors
No candidate generation, no candidate test
Compressed database: FP-tree structure
No repeated scan of entire database
Basic ops: counting local freq items and building sub FP-tree, no
pattern search and matching
A good open-source implementation and refinement of FPGrowth
FPGrowth+ (Grahne and J. Zhu, FIMI'03)
40
Further Improvements of Mining Methods
41
Extension of Pattern Growth Mining Methodology
Format
43
ECLAT: Mining by Exploring Vertical Data
Format
Vertical format: t(AB) = {T11, T25, …}
tid-list: list of trans.-ids containing an itemset
Deriving frequent patterns based on vertical intersections
t(X) = t(Y): X and Y always happen together
t(X) t(Y): transaction having X always has Y
Using diffset to accelerate mining
Only keep track of differences of tids
t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
Format
45
Mining Frequent Closed Patterns: CLOSET
50
Visualization of Association Rules: Rule Graph
51
Visualization of Association Rules
(SGI/MineSet 3.0)
52
Computational Complexity of Frequent Itemset
Mining
How many itemsets are potentially to be generated in the worst case?
The number of frequent itemsets to be generated is senstive to the
minsup threshold
When minsup is low, there exist potentially an exponential number of
frequent itemsets
The worst case: MNwhere M: # distinct items, and N: max length of
transactions
The worst case complexty vs. the expected probability
Ex. Suppose Walmart has 104 kinds of products
The chance to pick up one product 10-4
The chance to pick up a particular set of 10 products: ~10-40
What is the chance this particular set of 10 products to be frequent
103 times in 109 transactions?
53
Chapter 5: Mining Frequent Patterns, Association
and Correlations: Basic Concepts and Methods
Basic Concepts
Evaluation Methods
Summary
54
Interestingness Measure: Correlations (Lift)
P( AB)
lift = Cereal
Basketball
2000
Not basketball
1750
Sum (row)
3750
P(A)P(B)
Not cereal 1000 250 1250
2000 / 5000
lift(B, C) = = 0.89 Sum(col.) 3000 2000 5000
3000 / 5000*3750 / 5000
1000 / 5000
lift(B, C) = =1.33
3000 / 5000 *1250 / 5000
55
Are lift and 2 Good Measures of Correlation?
56
Null-Invariant Measures
57
Comparison of Interestingness Measures
Null-(transaction) invariance is crucial for correlation analysis
Lift and 2 are not null-invariant
5 null-invariant measures
Coffee m, c ~m, c c
No Coffee m, ~c ~m, ~c ~c
Sum(col.) m ~m
Basic Concepts
Evaluation Methods
Summary
61
Summary
62