Association Rules & Frequent Itemsets: The Market-Basket Problem
Association Rules & Frequent Itemsets: The Market-Basket Problem
• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
⇒ Computationally prohibitive!
Data Mining: Association Rules 7 Data Mining: Association Rules 8
Transactions List of
AB AC AD AE BC BD BE CD CE DE
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
ABCD ABCE ABDE ACDE BCDE w
Given d items, there
are 2d possible – Match each transaction against every candidate
ABCDE candidate itemsets – Complexity ~ O(NMw) => Expensive since M = 2d !!!
Data Mining: Association Rules 11 Data Mining: Association Rules 12
Computational Complexity Frequent Itemset Generation Strategies
also be frequent
• L3 = {abc, abd, acd, ace, bcd } • k passes over data where k is the size of the
largest candidate itemset
• Self-joining: L3*L3
• Memory chunking algorithm ⇒ 2 passes over
– abcd from abc and abd
data on disk but multiple in memory
– acde from acd and ace
• Hash-based itemset counting: A k-itemset whose • The core of the Apriori algorithm:
corresponding hashing bucket count is below the threshold – Use frequent (k – 1)-itemsets to generate candidate frequent
cannot be frequent k-itemsets
– Use database scan and pattern matching to collect counts for the
• Transaction reduction: A transaction that does not contain any candidate itemsets
frequent k-itemset is useless in subsequent scans • The bottleneck of Apriori: candidate generation
• Partitioning: Any itemset that is potentially frequent in DB – Huge candidate sets:
must be frequent in at least one of the partitions of DB • 104 frequent 1-itemset will generate 107 candidate 2-itemsets
• Sampling: mining on a subset of given data • To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100},
– lower support threshold one needs to generate 2100 ≈ 1030 candidates.