0% found this document useful (0 votes)
5 views

06FPBasic

The document discusses mining frequent patterns, specifically focusing on association rule mining, which aims to identify rules predicting item occurrences based on transaction data. It outlines key concepts such as frequent itemsets, support, and confidence, and introduces the Apriori algorithm for generating frequent itemsets and association rules. The document emphasizes the computational challenges and strategies to optimize the mining process.

Uploaded by

deyamate9
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

06FPBasic

The document discusses mining frequent patterns, specifically focusing on association rule mining, which aims to identify rules predicting item occurrences based on transaction data. It outlines key concepts such as frequent itemsets, support, and confidence, and introduces the Apriori algorithm for generating frequent itemsets and association rules. The document emphasizes the computational challenges and strategies to optimize the mining process.

Uploaded by

deyamate9
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 77

Based on slides from Han J., et. al.

2013

1
Chapter 6: Mining Frequent Patterns, Association and
Correlations

 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern Evaluation

Methods

 Summary

2
Association Rule Mining
Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other items
in the transaction
Market-Basket transactions
Example of Association
TID Items Rules
{Diaper}  {Beer},
1 Bread, Milk {Milk, Bread}  {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread}  {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!
Definition: Frequent Itemset
 Itemset
 A collection of one or more items
 Example: {Milk, Bread, Diaper}
TID Items
 k-itemset
1 Bread, Milk
 An itemset that contains k items
2 Bread, Diaper, Beer, Eggs
 Support count ()
3 Milk, Diaper, Beer, Coke
 Frequency of occurrence of an itemset
4 Bread, Milk, Diaper, Beer
 E.g. ({Milk, Bread,Diaper}) = 2
5 Bread, Milk, Diaper, Coke
 Support
 Fraction of transactions that contain an
itemset
 E.g. s({Milk, Bread, Diaper}) = 2/5 I assume that itemsets are
 Frequent Itemset ordered lexicographically
 An itemset whose support is greater than or
equal to a minsup threshold
Definition: Association Rule

Let D be database of transactions


e.g.:
Transaction ID Items Bought
2000 A,B,C
1000 A,C
4000 A,D
5000 B,E,F

Let I be the set of items that appear in the


database, e.g., I={A,B,C,D,E,F}
A rule is defined by X  Y, where XI, YI,
and XY=
e.g.: {B,C}  {E} is a rule
Definition: Association Rule
TID Items
 Association Rule 1 Bread, Milk
 An implication expression of the form X 2 Bread, Diaper, Beer, Eggs
 Y, where X and Y are itemsets 3 Milk, Diaper, Beer, Coke
 Example: 4 Bread, Milk, Diaper, Beer
{Milk, Diaper}  {Beer} 5 Bread, Milk, Diaper, Coke

 Rule Evaluation Metrics


 Support (s) Example:
 Fraction of transactions that contain both X {Milk, Diaper}  Beer
and Y
 Confidence (c)  (Milk , Diaper, Beer) 2
s  0.4
 Measures how often items in Y |T| 5
appear in transactions that
contain X  (Milk, Diaper, Beer) 2
c  0.67
 (Milk , Diaper) 3
Rule Measures: Support and Confidence
Customer
buys both
Customer Find all the rules X  Y with minimum
buys diaper
confidence and support
 support, s, probability that a transaction
contains {X  Y}
 confidence, c, conditional probability that
a transaction having X also contains Y
Customer
buys beer

Transaction ID Items Bought Let minimum support 50%, and


2000 A,B,C minimum confidence 50%, we
1000 A,C have
 A  C (50%, 66.6%)
4000 A,D
 C  A (50%, 100%)
5000 B,E,F
Example
TID date items_bought
100 10/10/99 {F,A,D,B}
200 15/10/99 {D,A,C,E,B} Remember:
sup(X  Y)
300 19/10/99 {C,A,B,E} conf(X  Y) =
400 20/10/99 {B,A,D} sup(X)

What is the support and confidence of the rule: {B,D}  {A}

 Support:
 percentage of tuples that contain {A,B,D} = 75%
 Confidence:
number of tuples that contain {A, B, D}
 100%
number of tuples that contain {B, D}
Association Rule Mining Task
Given a set of transactions T, the goal of association rule
mining is to find all rules having
 support ≥ minsup threshold
 confidence ≥ minconf threshold
Brute-force approach:
 List all possible association rules
 Compute the support and confidence for each rule
 Prune rules that fail the minsup and minconf thresholds
 Computationally prohibitive!
Mining Association Rules
TID Items
Example of Rules:
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
{Milk,Diaper}  {Beer} (s=0.4, c=0.67)
3 Milk, Diaper, Beer, Coke {Milk,Beer}  {Diaper} (s=0.4, c=1.0)
4 Bread, Milk, Diaper, Beer {Diaper,Beer}  {Milk} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke {Beer}  {Milk,Diaper} (s=0.4, c=0.67)
{Diaper}  {Milk,Beer} (s=0.4, c=0.5)
{Milk}  {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Mining Association Rules
 Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup

2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset

 Frequent itemset generation is still computationally


expensive
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE


Given d items, there
are 2d possible
candidate itemsets
ABCDE
Frequent Itemset Generation
 Brute-force approach:
 Each itemset in the lattice is a candidate frequent itemset
 Count the support of each candidate by scanning the database

Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w
 Match each transaction against every candidate
 Complexity ~ O(NMw) => Expensive since M = 2d !!!
Computational Complexity
Given d unique items:
 Total number of itemsets = 2d
 Total number of possible association rules:

 d 
d1  d  k 
d k
R       
 k   j 
k 1 j 1

3  2  1
d d 1

If d=6, R = 602 rules


Frequent Itemset Generation Strategies
Reduce the number of candidates (M)
 Complete search: M=2d
 Use pruning techniques to reduce M

Reduce the number of transactions (N)


 Reduce size of N as the size of itemset increases
 Used by DHP and vertical-based mining algorithms

Reduce the number of comparisons (NM)


 Use efficient data structures to store the candidates or
transactions
 No need to match every candidate against every transaction
Chapter 6: Mining Frequent Patterns, Association and
Correlations

 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern

Evaluation Methods

 Summary

16
Reducing Number of Candidates
Apriori principle:
 If an itemset is frequent, then all of its subsets must also be
frequent

Apriori principle holds due to the following property of the


support measure:

X , Y : ( X  Y )  s( X ) s(Y )
 Support of an itemset never exceeds the support of its subsets
 This is known as the anti-monotone property of support
Example

TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs s(Bread) > s(Bread, Beer)
3 Milk, Diaper, Beer, Coke s(Milk) > s(Bread, Milk)
4 Bread, Milk, Diaper, Beer s(Diaper, Beer) > s(Diaper, Beer, Coke)
5 Bread, Milk, Diaper, Coke
Illustrating Apriori Principle
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Pruned
ABCDE
supersets
Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)

If every subset is considered, Itemset Count


6
C1 + 6C2 + 6C3 = 41 {Bread,Milk,Diaper} 3
With support-based pruning,
6 + 6 + 1 = 13
The Apriori Algorithm (the general idea)
1. Find frequent 1-items and put them to Lk (k=1)
2. Use Lk to generate a collection of candidate itemsets Ck+1
with size (k+1)
3. Scan the database to find which itemsets in Ck+1 are frequent
and put them into Lk+1
4. If Lk+1 is not empty
 k=k+1
 GOTO 2

R. Agrawal, R. Srikant: "Fast Algorithms for Mining Association Rules",


Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, Sept. 1994.
The Apriori Algorithm
Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
// join and prune steps
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support (frequent)
end
return k Lk;
Important steps in candidate generation:
 Join Step: Ck+1 is generated by joining Lk with itself
 Prune Step: Any k-itemset that is not frequent cannot be a subset of a
frequent (k+1)-itemset
The Apriori Algorithm — Example
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 1 3 4 {2} 3 {2} 3
200 2 3 5 Scan D {3} 3 {3} 3
300 1 2 3 5 {4} 1 {5} 3
400 2 5 {5} 3
min_sup=2=50% C
2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
How to Generate Candidates?
Suppose the items in Lk are listed in an order
Step 1: self-joining Lk (IN SQL)
insert into Ck+1
select p.item1, p.item2, …, p.itemk, q.itemk
from Lk p, Lk q
where p.item1=q.item1, …, p.itemk-1=q.itemk-1, p.itemk < q.itemk

Step 2: pruning
forall itemsets c in Ck+1 do
forall k-subsets s of c do
if (s is not in Lk) then delete c from Ck+1
Example of Candidates Generation

L3={abc, abd, acd, ace, bcd}

Self-joining: L3*L3 {a,c,d} {a,c,e}

abcd from abc and abd


X
{a,c,d,e}

acde from acd and ace


acd ac ade cde
Pruning:  e X

acde is removed because ade is not in L3

C4={abcd}
How to Count Supports of Candidates?

Why counting supports of candidates a problem?


 The total number of candidates can be huge
 One transaction may contain many candidates

Method:
 Candidate itemsets are stored in a hash-tree
 Leaf node of hash-tree contains a list of itemsets and counts
 Interior node contains a hash table
 Subset function: finds all the candidates contained in a transaction
Example of the hash-tree for C3

Hash function: mod 3


H H Hash on 1st item

1,4,.. 2,5,.. 3,6,..


H 234 H Hash on 2nd item
567

145 H 345 356 367


Hash on 3rd item 689 368

124 125 159


457 458
Example of the hash-tree for C3
2345 345
Hash function: mod 3 12345 look for 2XX look for 3XX
H H Hash on 1st item
12345
1,4,.. 2,5,.. 3,6,.. look for 1XX H 234 H Hash on 2nd item
567

145 H 345 356 367


Hash on 3rd item 689 368

124 125 159


457 458
Example of the hash-tree for C3
2345 345
Hash function: mod 3 12345 look for 2XX look for 3XX
H H Hash on 1st item
12345
1,4,.. 2,5,.. 3,6,.. look for 1XX H 234 H Hash on 2nd item
567
12345
look for 12X 145 H
 345 356 367
689 368
12345
look for 13X (null) 124 125 159
457 458
12345
look for 14X
AprioriTid: Use D only for first pass

The database is not used after the 1st pass.


Instead, the set Ck’ is used for each step, Ck’ = <TID, {Xk}> :
each Xk is a potentially frequent itemset in transaction with
id=TID.
At each step Ck’ is generated from Ck-1’ at the pruning step of
constructing Ck and used to compute Lk.
For small values of k, Ck’ could be larger than the database!
AprioriTid Example (min_sup=2)
C1’ L1
Database D
TID Items TID Sets of itemsets
itemset sup.
100 134
100 {{1},{3},{4}} {1} 2
200 235
200 {{2},{3},{5}} {2} 3
300 1235
300 {{1},{2},{3},{5}} {3} 3
400 25
400 {{2},{5}} {5} 3

C1’
itemset L2
{1 2} TID Sets of itemsets
itemset sup
C2 {1 3}
100 {{1 3}}
200 {{2 3},{2 5},{3 5}} {1 3} 2
{1 5} 300 {{1 2},{1 3},{1 5}, {2 {2 3} 2
{2 3} 400
3},{2 5},{3 5}}
{{2 5}} {2 5} 3
{2 5} {3 5} 2
{3 5}
L3
C3 itemset
TID
200
Sets of itemsets
{{2 3 5}} itemset sup
{2 3 5} 300 {{2 3 5}}
C3’ {2 3 5} 2
Methods to Improve Apriori’s Efficiency

 Hash-based itemset counting: A k-itemset whose


corresponding hashing bucket count is below the
threshold cannot be frequent

 Transaction reduction: A transaction that does not contain


any frequent k-itemset is useless in subsequent scans
 Partitioning: Any itemset that is potentially frequent in DB
must be frequent in at least one of the partitions of DB
 Sampling: mining on a subset of given data, lower support
threshold + a method to determine the completeness
 Dynamic itemset counting: add new candidate itemsets
only when all of their subsets are estimated to be frequent
An itemset is maximal frequent if none of its immediate supersets is
frequent null

Maximal A B C D E
Itemsets

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Infrequent
Itemsets Border
ABCD
E
Closed Itemset

An itemset is closed if none of its immediate supersets has the


same support as the itemset
Itemset Support
{A} 4
TID Items {B} 5 Itemset Support
1 {A,B} {C} 3 {A,B,C} 2
2 {B,C,D} {D} 4 {A,B,D} 3
3 {A,B,C,D} {A,B} 4 {A,C,D} 2
4 {A,B,D} {A,C} 2 {B,C,D} 3
5 {A,B,C,D} {A,D} 3 {A,B,C,D} 2
{B,C} 3
{B,D} 4
{C,D} 3
null Transaction Ids

124 123 1234 245 345


A B C D E
TID Items
1 ABC
12 124 24 4 123 2 3
2 ABCD AB AC AD AE BC BD BE
24
CD
34
CE
45
DE
3 BCE
4 ACDE
12 2 24 4 4 2
5 DE ABC ABD ABE ACD ACE ADE BCD
3
BCE BDE
4
CDE

2 4
ABCD ABCE ABDE ACDE BCDE

Not supported by
any transactions ABCDE
Minimum support = 2 null Closed but
not
maximal
124 123 1234 245 345
A B C D E
Closed and
maximal

12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE

12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

2 4
ABCD ABCE ABDE ACDE BCDE # Closed = 9
# Maximal = 4

ABCDE
Maximal vs Closed Itemsets

Frequent
Itemsets

Closed
Frequent
Itemsets

Maximal
Frequent
Itemsets
Factors Affecting Complexity
Choice of minimum support threshold
 lowering support threshold results in more frequent itemsets
 this may increase number of candidates and max length of frequent
itemsets
Dimensionality (number of items) of the data set
 more space is needed to store support count of each item
 if number of frequent items also increases, both computation and I/O
costs may also increase
Size of database
 since Apriori makes multiple passes, run time of algorithm may increase
with number of transactions
Average transaction width
 transaction width increases with denser data sets
 This may increase max length of frequent itemsets and traversals of hash
tree (number of subsets in a transaction increases with its width)
Rule Generation
Given a frequent itemset L, find all non-empty subsets f  L
such that f  L – f satisfies the minimum confidence
requirement
 If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D, ABD C, ACD B, BCD A,
A BCD, B ACD, C ABD, D ABC
AB CD, AC  BD, AD  BC, BC AD,
BD AC, CD AB,

If |L| = k, then there are 2k – 2 candidate association rules


(ignoring L   and   L)
Rule Generation
How to efficiently generate rules from frequent itemsets?
 In general, confidence does not have an anti-monotone
property
c(ABC D) can be larger or smaller than c(AB D)

 But confidence of rules generated from the same itemset


has an anti-monotone property
 e.g., L = {A,B,C,D}:

c(ABC  D)  c(AB  CD)  c(A 


BCD)
 Confidence is anti-monotone w.r.t. number of items on the RHS
of the rule
Lattice of rules
ABCD=>{ }
Low
Confidence
Rule
BCD=>A ACD=>B ABD=>C ABC=>D

CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD


Pruned
Rules
Rule Generation for Apriori Algorithm

Candidate rule is generated by merging two rules that


share the same prefix
in the rule consequent
CD=>AB BD=>AC
join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC

Prune rule D=>ABC if its


D=>ABC
subset AD=>BC does not have
high confidence
Is Apriori Fast Enough? — Performance
Bottlenecks
The core of the Apriori algorithm:
 Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets
 Use database scan and pattern matching to collect counts for the candidate
itemsets
The bottleneck of Apriori: candidate generation
 Huge candidate sets:
 104 frequent 1-itemset will generate 107 candidate 2-itemsets
 To discover a frequent pattern of size 100, e.g., {a 1, a2, …, a100}, one needs to
generate 2100  1030 candidates.
 Multiple scans of database:
 Needs (n +1 ) scans, n is the length of the longest pattern
FP-growth: Mining Frequent Patterns Without
Candidate Generation
Compress a large database into a compact, Frequent-Pattern
tree (FP-tree) structure
 highly condensed, but complete for frequent pattern mining
 avoid costly database scans

Develop an efficient, FP-tree-based frequent pattern mining


method
 A divide-and-conquer methodology: decompose mining tasks into
smaller ones
 Avoid candidate generation: sub-database test only!
FP-tree Construction from a Transactional DB
TID items Items bought (ordered) frequent min_support = 3
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} Item frequency
200 {a, b, c, f, l, m, o} {f, c, a, b, m} f 4
c 4
300 {b, f, h, j, o, w} {f, b}
a 3
400 {b, c, k, s, p} {c, b, p} b 3
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} m 3
p 3
Steps:
1. Scan DB once, find frequent 1-itemsets (single item
patterns)
2. Order frequent items in descending order of their
frequency
3. Scan DB again, construct FP-tree
FP-tree Construction
min_support = 3
Item frequency
f 4
c 4
a 3
root b 3
m 3
f:1 p 3

c:1

a:1

m:1

p:1
FP-tree Construction
min_support = 3
Item frequency
f 4
c 4
a 3
root b 3
m 3
p 3
f:2

c:2

a:2

m:1 b:1

p:1 m:1
FP-tree Construction
min_support = 3
Item frequency
f 4
c 4
a 3
root b 3
m 3
p 3
f:3 c:1

c:2 b:1 b:1

a:2 p:1

m:1 b:1

p:1 m:1
FP-tree Construction

min_support = 3
Item frequency
f 4
c 4
root a 3
b 3
m 3
Header Table f:4 c:1 p 3
Item frequency head
f 4
c 4 c:3 b:1 b:1
a 3
b 3 a:3 p:1
m 3
p 3 m:2 b:1

p:2 m:1
Benefits of the FP-tree Structure
Completeness:
 never breaks a long pattern of any transaction
 preserves complete information for frequent pattern mining

Compactness
 reduce irrelevant information—infrequent items are gone
 frequency descending ordering: more frequent items are more likely to be
shared
 never be larger than the original database (if not count node-links and
counts)
 Example: For Connect-4 DB, compression ratio could be over 100
Mining Frequent Patterns Using FP-tree

General idea (divide-and-conquer)


 Recursively grow frequent pattern path using the FP-tree

Method
 For each item, construct its conditional pattern-base, and then its
conditional FP-tree
 Repeat the process on each newly created conditional FP-tree
 Until the resulting FP-tree is empty, or it contains only one path (single
path will generate all the combinations of its sub-paths, each of which is a
frequent pattern)
Mining Frequent Patterns Using the FP-tree (cont’d)

 Start with last item in order (i.e., p).


 Follow node pointers and traverse only the paths containing p.
 Accumulate all of transformed prefix paths of that item to form a conditional
pattern base

f:4 c:1 Conditional pattern base for p


fcam:2, cb:1
c:3 b:1
Construct a new FP-tree based on this
a:3 p:1 pattern, by merging all paths and
keeping nodes that appear sup times.
p This leads to only one branch c:3
m:2
Thus we derive only one frequent
p:2 pattern cont. p. Pattern cp
Mining Frequent Patterns Using the FP-tree (cont’d)

 Move to next least frequent item in order, i.e., m


 Follow node pointers and traverse only the paths containing m.
 Accumulate all of transformed prefix paths of that item to form a conditional
pattern base
m-conditional
pattern base:
f:4
fca:2, fcab:1
c:3 All frequent patterns
{} that include m
m,
m a:3  f:3  fm, cm, am,
fcm, fam, cam,
m:2 b:1 fcam
c:3
m:1 a:3
m-conditional FP-tree (contains only path fca:3)
{} {}

f:2 c:1 f:3

c:2 b:1 c:3


{}
a:2 p:1 a:3
f:4 c:1
m:2 + b:1 +
p m
c:3 b:1 b:1 (1) (2)

a:3 p:1 {} {}
{}
m:2 b:1
f:2 c:1 f:3 f:4
f:3
p:2 m:1
c:1 c:3 +
+
+ c
a:1 a
b
(3) (4) (5) (6) 55
1 f, c, a, m
4 c, b +p
5 f, c, a, m
1 f, c, a
1 f, c, a, m, p 2 f, c, a, b + m
2 f, c, a, b, m 5 f, c, a
1 f, c, a, m 2 f, c, a
3 f, b 3f +b
2 f, c, a, b, m
4 c, b, p 4c
3 f, b 1 f, c, a 1 f, c
5 f, c, a, m, p
4 c, b 2 f, c, a, b 2 f, c + a
5 f, c, a, m 3 f, b 1 f, c, a 5 f, c
4 c, b 2 f, c, a
5 f, c, a 3 f 1 f, c
4 c 2 f, c
5 f, c, a 3 f
4 c
5 f, c 56
1 f, c, a, m
+p 1 f, c, a
4 c, b
2 f, c, a, b + m
5 f, c, a, m
5 f, c, a
(1) (2)
1 f, c, a, m, p
2 f, c, a, b, m 2 f, c, a 1 f, c
3 f, b 3f +b 2 f, c + a
4 c, b, p 4c 5 f, c
5 f, c, a, m, p (3) (4)

1f
2f
+c f: 1,2,3,5
4
5f
(6)
(5) 57
1 f, c, a, m 1 c
4 c, b +p 4 c +p
p: 3
cp: 3
5 f, c, a, m 5 c

1 f, c, a 1 f, c, a
2 f, c, a, b + m 2 f, c, a + m
m: 3
min_sup = 3
5 f, c, a fm: 3
5 f, c, a
cm: 3
2 f, c, a
am: 3
3f +b b: 3 fcm: 3
1 f, c, a, m, p 4c fam: 3
2 f, c, a, b, m a: 3 cam: 3
1 f, c
3 f, b fa: 3 fcam: 3
2 f, c + a
4 c, b, p ca: 3
5 f, c fca: 3
5 f, c, a, m, p

1f
2f c: 4
+c
4 fc: 3
5f

f: 1,2,3,5 f: 4 58
Properties of FP-tree for Conditional Pattern
Base Construction

Node-link property
 For any frequent item ai, all the possible frequent patterns that contain ai

can be obtained by following ai's node-links, starting from ai's head in the
FP-tree header
Prefix path property
 To calculate the frequent patterns for a node ai in a path P, only the prefix

sub-path of ai in P need to be accumulated, and its frequency count should


carry the same count as node ai.
Conditional Pattern-Bases for the example

Item Conditional pattern-base Conditional FP-tree


p {(fcam:2), (cb:1)} {(c:3)}|p
m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m
b {(fca:1), (f:1), (c:1)} Empty
a {(fc:3)} {(f:3, c:3)}|a
c {(f:3)} {(f:3)}|c
f Empty Empty
Why Is Frequent Pattern Growth Fast?
Performance studies show
 FP-growth is an order of magnitude faster than Apriori, and is also

faster than tree-projection


Reasoning
 No candidate generation, no candidate test

 Uses compact data structure

 Eliminates repeated database scan

 Basic operation is counting and FP-tree building


FP-growth vs. Apriori: Scalability With the
Support Threshold

100 Data set T25I20D10K


90 D1 FP-grow th runtime
D1 Apriori runtime
80
Run time(sec.)

70

60

50

40

30

20

10

0
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
Chapter 6: Mining Frequent Patterns, Association and
Correlations

 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern

Evaluation Methods

 Summary

63
Interestingness Measurements
Objective measures
Two popular measurements:
 support; and
 confidence

Subjective measures
A rule (pattern) is interesting if
 it is unexpected (surprising to the user); and/or
 actionable (the user can do something with it)
Computing Interestingness Measure
Given a rule X  Y, information needed to compute rule
interestingness can be obtained from a contingency table

Contingency table for X  Y


Y Y f11: support of X and Y
X f11 f10 f1+ f10: support of X and Y
X f01 f00 fo+ f01: support of X and Y
f+1 f+0 |T| f00: support of X and Y

Used to define various measures


 support, confidence, lift, Gini,
J-measure, etc.
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
Statistical Independence

 Population of 1000 students


 600 students know how to swim (S)
 700 students know how to bike (B)
 420 students know how to swim and bike (S,B)

 P(SB) = 420/1000 = 0.42


 P(S)  P(B) = 0.6  0.7 = 0.42

 P(SB) = P(S)  P(B) => Statistical independence


 P(SB) > P(S)  P(B) => Positively correlated
 P(SB) < P(S)  P(B) => Negatively correlated
Statistical-based Measures
Measures that take into account statistical dependence

P (Y | X )
Lift 
P (Y )
P( X , Y )
Interest 
P ( X ) P (Y )
PS P ( X , Y )  P ( X ) P (Y )
P ( X , Y )  P ( X ) P (Y )
  coefficient 
P ( X )[1  P ( X )]P (Y )[1  P (Y )]
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100

Association Rule: Tea  Coffee

Confidence= P(Coffee|Tea) = 0.75


but P(Coffee) = 0.9
 Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)
Are lift and 2 Good Measures of Correlation?
 “Buy walnuts  buy
milk [1%, 80%]” is
misleading if 85% of
customers buy milk
 Support and confidence
are not good to indicate
correlations
 Over 20 interestingness
measures have been
proposed (see Tan,
Kumar, Sritastava
@KDD’02)
 Which are good ones?

70
There are lots of
measures proposed in
the literature

Some measures are


good for certain
applications, but not for
others

What criteria should we


use to determine
whether a measure is
good or bad?

What about Apriori-


style support based
pruning? How does it
affect these measures?
Example: -Coefficient
-coefficient is analogous to correlation coefficient for
continuous variables
Y Y Y Y

X 60 10 70 X 20 10 30
X 10 20 30 X 10 60 70
70 30 100 30 70 100

0.6  0.7 0.7 0.2  0.3 0.3


 
0.7 0.3 0.7 0.3 0.7 0.3 0.7 0.3
0.5238 0.5238
 Coefficient is the same for both tables
Null-Invariant Measures

73
Comparison of Interestingness Measures
 Null-(transaction) invariance is crucial for correlation analysis
 Lift and 2 are not null-invariant
 5 null-invariant measures

Milk No Milk Sum


(row)
Coffee m, c ~m, c c
No m, ~c ~m, ~c ~c
Coffee
Sum(col. m ~m 
) Null-transactions w.r.t. Kulczynski
m and c measure (1927) Null-invariant

April 20, 2025 Data Mining: Concepts and Techniques Subtle: They disagree
74
Analysis of DBLP Coauthor Relationships
Recent DB conferences, removing balanced associations, low sup, etc.

Advisor-advisee relation: Kulc: high,


coherence: low, cosine: middle
 Tianyi Wu, Yuguo Chen and Jiawei Han, “
Association Mining in Large Databases: A Re-Examination of Its Measures”,
Proc. 2007 Int. Conf. Principles and Practice of Knowledge Discovery in
Databases (PKDD'07), Sept. 2007
75
Which Null-Invariant Measure Is Better?
IR (Imbalance Ratio): measure the imbalance of two itemsets A and
B in rule implications

Kulczynski and Imbalance Ratio (IR) together present a clear picture


for all the three datasets D4 through D6
D4 is balanced & neutral
D5 is imbalanced & neutral
D6 is very imbalanced & neutral
Chapter 6: Mining Frequent Patterns, Association and
Correlations

 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern

Evaluation Methods

 Summary

77
Summary
 Basic concepts: association rules, support-confident
framework, closed and max-patterns
 Scalable frequent pattern mining methods
 Apriori (Candidate generation & test)
 Projection-based (FPgrowth, CLOSET+, ...)

 Which patterns are interesting?


 Pattern evaluation methods

78

You might also like