Association
Association
Market-Basket transactions
Example of Association Rules
TID Items
{Diaper} {Beer},
1 Bread, Milk {Milk, Bread} {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!
Definition: Frequent Itemset
Itemset
– A collection of one or more items
Example: {Milk, Bread, Diaper}
– k-itemset TID Items
An itemset that contains k items 1 Bread, Milk
Support count () 2 Bread, Diaper, Beer, Eggs
– Frequency of occurrence of an itemset 3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
– E.g. ({Milk, Bread,Diaper}) = 2
5 Bread, Milk, Diaper, Coke
Support
– Fraction of transactions that contain an
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset
– An itemset whose support is greater
than or equal to a minsup threshold
Definition: Association Rule
Association Rule
TID Items
– An implication expression of the form
X Y, where X and Y are itemsets 1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
– Example:
{Milk, Diaper} {Beer} 3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Rule Evaluation Metrics
– Support (s)
Fraction of transactions that contain Example:
both X and Y
{Milk , Diaper} {Beer}
– Confidence (c)
Measures how often items in Y (Milk , Diaper, Beer ) 2
appear in transactions that s 0.4
contain X
|T| 5
(Milk, Diaper, Beer ) 2
c 0.67
(Milk , Diaper ) 3
Association Rule Mining Task
Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
Computationally prohibitive!
Mining Association Rules
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Mining Association Rules
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent
X , Y : ( X Y ) s( X ) s(Y )
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support
Illustrating Apriori Principle
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Pruned
ABCDE
supersets
Illustrating Apriori Principle
TID Items
Items (1-itemsets)
1 Bread, Milk
Item Count
2 Beer, Bread,
B read, Diaper, Eggs
Bread 4
3 Beer, Coke, Diaper, Milk Coke 2
4 Beer, Bread,
B read, Diaper, Milk Milk 4
Beer 3
5 Bread, Coke, Diaper, Milk
Diaper 4
Eggs 1
Minimum Support = 3
TID Items
Items (1-itemsets)
1 Bread, Milk
2 Beer, Bread, Diaper, Eggs Item Count
Bread 4
3 Beer, Coke, Diaper, Milk
Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
5 Bread, Coke, Diaper, Milk Beer 3
Diaper 4
Eggs 1
Minimum Support = 3
Minimum Support = 3
F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
– Merge(ABC, ABD) = ABCD
– Merge(ABC, ABE) = ABCE
– Merge(ABD, ABE) = ABDE
Let F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE} be
the set of frequent 3-itemsets
Candidate pruning
– Prune ABCE because ACE and BCE are infrequent
– Prune ABDE because ADE is infrequent
F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
– Merge(ABC, BCD) = ABCD
– Merge(ABD, BDE) = ABDE
– Merge(ACD, CDE) = ACDE
– Merge(BCD, CDE) = BCDE
Candidate Pruning for Alternate Fk-1 x Fk-1 Method
Let F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE} be
the set of frequent 3-itemsets
TID Items
Itemset
1 Bread, Milk
{ Beer, Diaper, Milk}
2 Beer, Bread, Diaper, Eggs { Beer,Bread,Diaper}
3 Beer, Coke, Diaper, Milk {Bread, Diaper, Milk}
{ Beer, Bread, Milk}
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk
Support Counting of Candidate Itemsets
Transaction, t
1 2 3 5 6
Level 1
1 2 3 5 6 2 3 5 6 3 5 6
Level 2
12 3 5 6 13 5 6 15 6 23 5 6 25 6 35 6
123
135 235
125 156 256 356
136 236
126
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
1, 4 or 7
124 159 689
125
457 458
Support Counting Using a Hash Tree
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
2, 5 or 8
124 159 689
125
457 458
Support Counting Using a Hash Tree
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
3, 6 or 9
124 159 689
125
457 458
Support Counting Using a Hash Tree
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
2,5,8
3+ 56
234
567
145 136
345 356 367
357 368
124 159 689
125
457 458
Support Counting Using a Hash Tree
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458
Support Counting Using a Hash Tree
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458
Match transaction against 11 out of 15 candidates
Rule Generation
Lattice of rules
ABCD=>{ }
Low
Confidence
Rule
BCD=>A ACD=>B ABD=>C ABC=>D
Size of database
TID Items
Items (1-itemsets)
1 Bread, Milk
Item Count
2 Beer, Bread, Diaper, Eggs
Bread 4
3 Beer, Coke, Diaper, Milk Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
Beer 3
5 Bread, Coke, Diaper, Milk
Diaper 4
Eggs 1
Minimum Support = 3
Minimum Support = 2
If every subset is considered,
6
C1 + 6C2 + 6C3
If every subset is considered,
6 + 15 + 20 = 41 6
C1 + 6C2 + 6C3 + 6C4
With support-based pruning,
6 + 15 + 20 +15 = 56
6 + 6 + 4 = 16
Factors Affecting Complexity of Apriori
Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
Dimensionality (number of items) of the data set
– More space is needed to store support count of itemsets
– if number of frequent itemsets also increases, both computation
and I/O costs may also increase
Size of database
TID Items
Average transaction width
1 Bread, Milk
–
2 Beer, B read, Diaper, Eggs
3 Beer, Coke, Diaper, Milk
4 Beer, B read, Diaper, Milk
5 Bread, Coke, Diaper, Milk
Factors Affecting Complexity of Apriori
Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
Dimensionality (number of items) of the data set
– More space is needed to store support count of itemsets
– if number of frequent itemsets also increases, both computation
and I/O costs may also increase
Size of database
– run time of algorithm increases with number of transactions
Average transaction width
TID Items
1 Bread, Milk
2 Beer, B read, Diaper, Eggs
3 Beer, Coke, Diaper, Milk
4 Beer, B read, Diaper, Milk
5 Bread, Coke, Diaper, Milk
Factors Affecting Complexity of Apriori
Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
Dimensionality (number of items) of the data set
– More space is needed to store support count of itemsets
– if number of frequent itemsets also increases, both computation
and I/O costs may also increase
Size of database
– run time of algorithm increases with number of transactions
Average transaction width
– transaction width increases the max length of frequent itemsets
– number of subsets in a transaction increases with its width,
increasing computation time for support counting
Maximal Frequent Itemset
An itemset is maximal frequent if it is frequent and none of its
immediate supersets is frequent null
Maximal A B C D E
Itemsets
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Infrequent
Itemsets Border
ABCD
E
An illustrative example
Items
A B C D E F G H I J Support threshold (by count) : 5
Frequent itemsets: ?
1 Maximal itemsets: ?
2
3
Transactions
4
5
6
7
8
9
10
An illustrative example
Items
A B C D E F G H I J Support threshold (by count) : 5
Frequent itemsets: {F}
1 Maximal itemsets: {F}
4
5
6
7
8
9
10
An illustrative example
Items
A B C D E F G H I J Support threshold (by count) : 5
Frequent itemsets: {F}
1 Maximal itemsets: {F}
6
7
8
9
10
An illustrative example
Items
A B C D E F G H I J Support threshold (by count) : 5
Frequent itemsets: {F}
1 Maximal itemsets: {F}
7
8
9
10
Another illustrative example
Items
A B C D E F G H I J Support threshold (by count) : 5
Maximal itemsets: {A}, {B}, {C}
1
Support threshold (by count): 4
2 Maximal itemsets: {A,B}, {A,C},{B,C}
4
5
6
7
8
9
10
Closed Itemset
12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
2 4
ABCD ABCE ABDE ACDE BCDE
Not supported by
any transactions ABCDE
Maximal Frequent vs Closed Frequent Itemsets
12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
2 4
ABCD ABCE ABDE ACDE BCDE # Closed frequent = 9
# Maximal frequent = 4
ABCDE
What are the Closed Itemsets in this Data?
(A1-A10)
(B1-B10)
(C1-C10)
Example 1
Items
A B C D E F G H I J Itemsets Support Closed
(counts) itemsets
1
{C} 3
2
{D} 2
3
{C,D} 2
Transactions
4
5
6
7
8
9
10
Example 1
Items
A B C D E F G H I J Itemsets Support Closed
(counts) itemsets
1
{C} 3
2
{D} 2
3
{C,D} 2
Transactions
4
5
6
7
8
9
10
Example 2
Items
A B C D E F G H I J Itemsets Support Closed
(counts) itemsets
1
{C} 3
2
{D} 2
3
{E} 2
Transactions
4
{C,D} 2
5
{C,E} 2
6
{D,E} 2
7
{C,D,E} 2
8
9
10
Example 2
Items
A B C D E F G H I J Itemsets Support Closed
(counts) itemsets
1
{C} 3
2
{D} 2
3
{E} 2
Transactions
4
{C,D} 2
5
{C,E} 2
6
{D,E} 2
7
{C,D,E} 2
8
9
10
Example 3
Items
A B C D E F G H I J Closed itemsets: {C,D,E,F}, {C,F}
1
2
3
Transactions
4
5
6
7
8
9
10
Example 4
Items
A B C D E F G H I J Closed itemsets: {C,D,E,F}, {C}, {F}
1
2
3
Transactions
4
5
6
7
8
9
10
Maximal vs Closed Itemsets
Example question
Given the following transaction data sets (dark cells indicate presence of an item in
a transaction) and a support threshold of 20%, answer the following questions