Association Rule Mining
Association Rule Mining
Problem Definition
Rule Generation
Recommended Reading
Market-Basket transactions
Example of Association
TID Items Rules
{Diaper} {Beer},
1 Bread, Milk {Milk, Bread} {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!
Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
Computationally prohibitive!
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7
Mining Association Rules
Example of Rules:
TID Items
1 Bread, Milk
{Milk,Diaper} {Beer} (s=0.4, c=0.67)
2 Bread, Diaper, Beer, Eggs
{Milk,Beer} {Diaper} (s=0.4, c=1.0)
3 Milk, Diaper, Beer, Coke {Diaper,Beer} {Milk} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer {Beer} {Milk,Diaper} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke {Diaper} {Milk,Beer} (s=0.4, c=0.5)
{Milk} {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8
Mining Association Rules
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
d d k
d 1 d k
R
k j
k 1 j 1
3 2 1
d d 1
Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Pruned
ABCDE
supersets
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15
Illustrating Apriori Principle
Method:
– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
Generate length (k+1) candidate itemsets from length k
frequent itemsets
Prune candidate itemsets containing subsets of length k that
are infrequent
Count the support of each candidate by scanning the DB
Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19
Important Details of Apriori
– C4={abcd}
Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
1, 4 or 7
124 159 689
125
457 458
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
2, 5 or 8
124 159 689
125
457 458
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
3, 6 or 9
124 159 689
125
457 458
Level 1
1 2 3 5 6 2 3 5 6 3 5 6
Level 2
12 3 5 6 13 5 6 15 6 23 5 6 25 6 35 6
123
135 235
125 156 256 356
136 236
126
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
2,5,8
3+ 56
234
567
145 136
345 356 367
357 368
124 159 689
125
457 458
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458
Match transaction against 9 out of 15 candidates
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30
Factors Affecting Complexity
Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
Dimensionality (number of items) of the data set
– more space is needed to store support count of each item
– if number of frequent items also increases, both computation and
I/O costs may also increase
Size of database
– since Apriori makes multiple passes, run time of algorithm may
increase with number of transactions
Average transaction width
– transaction width increases with denser data sets
– This may increase max length of frequent itemsets and traversals
of hash tree (number of subsets in a transaction increases with its
width)
Lattice of rules
ABCD=>{ }
Low
Confidence
Rule
BCD=>A ACD=>B ABD=>C ABC=>D
join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC
D=>ABC
Prune rule D=>ABC if its
subset AD=>BC does not have
high confidence
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 35
Compact Representation of Frequent Itemsets
10 10
Number of frequent itemsets 3 =3*(2^10-1)
k
k 1
Maximal A B C D E
Itemsets
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Infrequent
Itemsets Border
ABCD
E
Itemset Support
{A} 4
TID Items Itemset Support
{B} 5
1 {A,B} {A,B,C} 2
{C} 3
2 {B,C,D} {A,B,D} 3
{D} 4
3 {A,B,C,D} {A,C,D} 2
{A,B} 4
4 {A,B,D} {B,C,D} 3
{A,C} 2
5 {A,B,C,D} {A,B,C,D} 2
{A,D} 3
{B,C} 3
• BC should not be a closed {B,D} 4
itemset.
{C,D} 3
•BCD should be a closed
itemset
12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
2 4
ABCD ABCE ABDE ACDE BCDE
Not supported by
any transactions ABCDE
Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets
A:1
TID Items
1 {A,B}
2 {B,C,D} B:1
3 {A,C,D,E}
4 {A,D,E} After reading TID=2:
5 {A,B,C} null
6 {A,B,C,D} B:1
A:1
7 {B,C}
8 {A,B,C}
9 {A,B,D} B:1 C:1
10 {B,C,E}
D:1
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 43
FP-Tree Construction
TID Items
Transaction
1 {A,B}
2 {B,C,D} Database
null
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C}
A:7 B:3
6 {A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D} B:5 C:3
10 {B,C,E}
C:1 D:1
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1
c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
p:2 m:1 p fcam:2, cb:1
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 47
From Conditional Pattern-bases to Conditional FP-
trees
{}
c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree
{}
Cond. pattern base of “cam”: (f:3) f:3
cam-conditional FP-tree
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 49
Alternative Methods for Frequent Itemset Generation
A B C D A B C D
AB AC AD BC BD CD AB AC BC AD BD CD
ABCD ABCD
.. .. ..
.. .. ..
Frequent
{a1,a2,...,an} {a1,a2,...,an} itemset {a1,a2,...,an}
border
(a) General-to-specific (b) Specific-to-general (c) Bidirectional
Representation of Database
– horizontal vs vertical data layout
Horizontal
Data Layout Vertical Data Layout
TID Items A B C D E
1 A,B,E 1 1 2 2 1
2 B,C,D 4 2 3 4 3
3 C,E 5 5 4 5 6
4 A,C,D 6 7 8 9
5 A,B,C,D 7 8 9
6 A,E 8 10
7 A,B 9
8 A,B,C
9 A,C,D
10 B
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Possible Extension:
E(ABC) = {D,E}
ABCD ABCE ABDE ACDE BCDE
ABCDE
Support
distribution of
a retail data set
AC ABD
A
A 0.10% 0.25%
AD ABE
B AE ACD
B 0.20% 0.26%
BC ACE
C
C 0.30% 0.29% BD ADE
D BE BCD
D 0.50% 0.05%
CD BCE
E
E 3% 4.20% CE BDE
DE CDE
B AE ACD
B 0.20% 0.26%
BC ACE
C
C 0.30% 0.29% BD ADE
D BE BCD
D 0.50% 0.05%
CD BCE
E
E 3% 4.20% CE BDE
DE CDE
Modifications to Apriori:
– In traditional Apriori,
A candidate (k+1)-itemset is generated by merging two
frequent itemsets of size k
The candidate is pruned if it contains any infrequent subsets
of size k
– Pruning step has to be modified:
Prune only if subset contains the first item
e.g.: Candidate={Broccoli, Coke, Milk} (ordered according to
minimum support)
{Broccoli, Coke} and {Broccoli, Milk} are frequent but
{Coke, Milk} is infrequent
– Candidate is not pruned because {Coke,Milk} does not contain
the first item, i.e., Broccoli.
Preprocessed
Data
Prod
Prod
Prod
Prod
Prod
Prod
Prod
Prod
Prod
Prod
uct
uct
uct
uct
uct
uct
uct
uct
uct
uct
Featur
Featur
e
Featur
e
Mining
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
e
Selected
Data
Data Preprocessing
Selection
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
Y Y Y Y
X 10 0 10 X 90 0 90
X 0 90 90 X 0 10 10
10 90 100 90 10 100
0.1 0.9
Lift 10 Lift 1.11
(0.1)(0.1) (0.9)(0.9)
Statistical independence:
If P(X,Y)=P(X)P(Y) => Lift = 1
Piatetsky-Shapiro:
3 properties a good measure M must satisfy:
– M(A,B) = 0 if A and B are statistically independent
B B A A
A p q B p r
A r s B q s
Symmetric measures:
support, lift, collective strength, cosine, Jaccard, etc
Asymmetric measures:
confidence, conviction, Laplace, J-measure, etc
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 77
Property under Row/Column Scaling
2x 10x
Mosteller:
Underlying association should be independent of
the relative number of male and female students
in the samples
A B C D E F
.
Transaction 1
1 0 0 1 0 0
0 0 1 1 1 0
. 0 0 1 1 1 0
.
0 0 1 1 1 0
0 1 1 0 1 1
. 0
0
0
0
1
1
1
1
1
1
0
0
. 0
0
0
0
1
1
1
1
1
1
0
0
Transaction N 1 0 0 1 0 0
B B B B
A p q A p q
A r s A r s+k
Invariant measures:
support, cosine, Jaccard, etc
Non-invariant measures:
correlation, Gini, mutual information, odds ratio, etc
1000
900
800
700
600
500
400
300
200
100
0
Correlation
300 300
250 250
200 200
150 150
100 100
50 50
0 0
Correlation Correlation
300
250
negatively correlated
100
50
itemsets 0
Correlation
Steps:
– Generate 10000 contingency tables
– Rank each table according to the different measures
– Compute the pair-wise correlation between the
measures
Correlation 0.8
Interest
PS 0.7
CF
0.6
Yule Y
Jaccard
Reliability 0.5
Kappa
Klosg en 0.4
Yule Q
0.3
Confidence
Laplace 0.2
IS
Support 0.1
Jaccard
0
Lambda -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Gini Correlation
J-measure
Mutual Info
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Jaccard
Reliability
0.5
PS
Yule Q 0.4
CF
0.3
Yule Y
Kappa
0.2
IS
Jaccard 0.1
Support
Lambda
0
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Gini Correlation
J-measure
Mutual Info
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Scatter Plot between Correlation
& Jaccard Measure:
61.45% pairs have correlation > 0.85
Conviction
0.8
Yule Q
CF
0.6
Jaccard
Yule Y
0.5
Kappa
Correlation 0.4
Col Streng th
IS 0.3
Jaccard
0.2
Laplace
PS 0.1
Klosg en
Lambda 0
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Mutual Info Correlation
Gini
J-measure
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Objective measure:
– Rank patterns based on statistics computed from data
– e.g., 21 measures of association (support, confidence,
Laplace, Gini, mutual information, Jaccard, etc).
Subjective measure:
– Rank patterns according to user’s interpretation
A pattern is subjectively interesting if it contradicts the
expectation of a user (Silberschatz & Tuzhilin)
A pattern is subjectively interesting if it is actionable
(Silberschatz & Tuzhilin)
+ - Expected Patterns
- + Unexpected Patterns
P( X X ... X ) 1 2 k