0% found this document useful (0 votes)
44 views

Association Rules & Frequent Itemsets: The Market-Basket Problem

The document discusses the market basket problem and mining association rules from transactional data. The market basket problem aims to find rules that predict the occurrence of items based on other items purchased together. It describes association rules as implications between sets of items and evaluates them based on support and confidence metrics. The task of mining association rules involves finding all rules above minimum support and confidence thresholds, which is computationally prohibitive to do via brute force. Instead, a two-step approach is used where frequent itemsets are first generated before mining association rules.

Uploaded by

Jaskiran Huvor
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Association Rules & Frequent Itemsets: The Market-Basket Problem

The document discusses the market basket problem and mining association rules from transactional data. The market basket problem aims to find rules that predict the occurrence of items based on other items purchased together. It describes association rules as implications between sets of items and evaluates them based on support and confidence metrics. The task of mining association rules involves finding all rules above minimum support and confidence thresholds, which is computationally prohibitive to do via brute force. Instead, a two-step approach is used where frequent itemsets are first generated before mining association rules.

Uploaded by

Jaskiran Huvor
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

The Market-Basket Problem

• Given a database of transactions, find rules that will


predict the occurrence of an item based on the
occurrences of other items in the transaction

Association Rules & Frequent Itemsets Market-Basket transactions


Example of Association Rules
TID Items
{Diaper} → {Beer},
1 Bread, Milk {Milk, Bread} → {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} → {Milk}
All you ever wanted to know about diapers, 3 Milk, Diaper, Beer, Coke
beers and their correlation! 4 Bread, Milk, Diaper, Beer Implication here means
5 Bread, Milk, Diaper, Coke co-occurrence, not causality!

Data Mining: Association Rules 2

The Market-Basket Problem Applications of Market-Basket Analysis

Given a database of transactions • Supermarkets


where each transaction is a collection of items (purchased by a – Placement
customer in a visit) – Advertising
find all rules that correlate the presence of one set of – Sales
items with that of another set of items – Coupons
• Many applications outside market basket data analysis
Example: 30% of all transactions that contain diapers also contain
beers; 5% of all transactions contain these items – Prediction (telecom switch failure)
– 30%: confidence of the rule – Web usage mining
– 5%: support of the rule • Many different types of association rules
– Temporal
We are interested in finding all rules, – Spatial
rather than verifying that a particular rule holds – Causal

Data Mining: Association Rules 3 Data Mining: Association Rules 4

Definition: Frequent Itemset Definition: Association Rule

• Itemset TID Items • Association Rule TID Items


– A collection of one or more items 1 Bread, Milk – An implication expression of 1 Bread, Milk
• Example: {Milk, Bread, Diaper} the form X → Y, where X and Y
2 Bread, Diaper, Beer, Eggs 2 Bread, Diaper, Beer, Eggs
– k-itemset are itemsets
3 Milk, Diaper, Beer, Coke 3 Milk, Diaper, Beer, Coke
• An itemset that contains k items – Example:
4 Bread, Milk, Diaper, Beer 4 Bread, Milk, Diaper, Beer
σ)
• Support count (σ {Milk, Diaper} → {Beer}
– Frequency of occurrence of an 5 Bread, Milk, Diaper, Coke 5 Bread, Milk, Diaper, Coke
itemset
• Rule Evaluation Metrics
– E.g. σ({Milk, Bread,Diaper}) = 2 Example:
– Support (s)
• Support {Milk , Diaper } ⇒ Beer
• Fraction of transactions that
– Fraction of transactions that contain both X and Y
contain an itemset σ (Milk , Diaper, Beer ) 2
– E.g. s({Milk, Bread, Diaper}) = 2/5
– Confidence (c) s= = = 0.4
• Measures how often items in Y |T| 5
• Frequent Itemset appear in transactions that
– An itemset whose support is greater σ ( Milk, Diaper, Beer ) 2
contain X c= = = 0.67
than or equal to a minsup threshold σ ( Milk, Diaper ) 3
Data Mining: Association Rules 5 Data Mining: Association Rules 6
Aspects of Association Rule Mining Association Rule Mining Task

• How do we generate rules fast? • Given a set of transactions T, the goal of


– Performance measured in association rule mining is to find all rules
• Number of database scans having
• Number of itemsets that must be counted – support ≥ minsup threshold
• Which are the interesting rules? – confidence ≥ minconf threshold

• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
⇒ Computationally prohibitive!
Data Mining: Association Rules 7 Data Mining: Association Rules 8

Mining Association Rules Finding Association Rules


Example of Rules:
TID Items Two-step approach:
1 Bread, Milk {Milk,Diaper} → {Beer} (s=0.4, c=0.67)
{Milk,Beer} → {Diaper} (s=0.4, c=1.0)
1. Frequent Itemset Generation
2 Bread, Diaper, Beer, Eggs
{Diaper,Beer} → {Milk} (s=0.4, c=0.67) • Generate all itemsets whose support ≥ minsup
3 Milk, Diaper, Beer, Coke
{Beer} → {Milk,Diaper} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
5 Bread, Milk, Diaper, Coke 2. Rule Generation
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)
• Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
Observations: frequent itemset
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but • Frequent itemset generation is still
can have different confidence computationally expensive
• Thus, we may decouple the support and confidence requirements
Data Mining: Association Rules 9 Data Mining: Association Rules 10

Frequent Itemset Generation Frequent Itemset Generation


null
• Brute-force approach:
A B C D E
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the database

Transactions List of
AB AC AD AE BC BD BE CD CE DE
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
ABCD ABCE ABDE ACDE BCDE w
Given d items, there
are 2d possible – Match each transaction against every candidate
ABCDE candidate itemsets – Complexity ~ O(NMw) => Expensive since M = 2d !!!
Data Mining: Association Rules 11 Data Mining: Association Rules 12
Computational Complexity Frequent Itemset Generation Strategies

• Given d unique items: • Reduce the number of candidates (M)


– Total number of itemsets = 2d – Complete search: M=2d
– Total number of possible association rules: – Use pruning techniques to reduce M

• Reduce the number of transactions (N)


 d   d − k 
d −1 d −k
R = ∑   × ∑   – Reduce size of N as the size of itemset increases
 k   j 
k =1 j =1
– Used by DHP and vertical-based mining algorithms
= 3 − 2 +1d d +1
• Reduce the number of comparisons (NM)
– Use efficient data structures to store the
candidates or transactions
If d=6, R = 602 rules – No need to match every candidate against every
transaction
Data Mining: Association Rules 13 Data Mining: Association Rules 14

Reducing Number of Candidates Illustrating Apriori Principle


null
• Apriori principle:
– If an itemset is frequent, then all of its subsets must A B C D E

also be frequent

• Apriori principle holds due to the following AB AC AD AE BC BD BE CD CE DE

property of the support measure: Found to be


Infrequent
∀X , Y : ( X ⊆ Y ) ⇒ s( X ) ≥ s(Y ) ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

– Support of an itemset never exceeds the support of


its subsets ABCD ABCE ABDE ACDE BCDE

– This is known as the anti-monotone property of Pruned


support supersets ABCDE

Data Mining: Association Rules 15 Data Mining: Association Rules 16

Illustrating Apriori Principle The Idea of the Apriori Algorithm


Item Count Items (1-itemsets)
Bread 4 • start with all 1-itemsets
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets) • go through data and count their support and
Beer 3
Diaper 4
{Bread,Milk} 3 find all “large” 1-itemsets
{Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper}
{Milk,Beer}
3
2
candidates involving Coke • combine them to form “candidate” 2-itemsets
or Eggs)
{Milk,Diaper}
{Beer,Diaper}
3
3
• go through data and count their support and
Minimum Support = 3
Triplets (3-itemsets)
find all “large” 2-itemsets
If every subset is considered, Ite m s e t C ount
• combine them to form “candidate” 3-itemsets
6C + 6C + 6C = 41
1 2 3
{ B r e a d ,M ilk ,D ia p e r } 3 …
With support-based pruning,
6 + 6 + 1 = 13
large itemset: itemset with support > s
candidate itemset: itemset that may have support > s
Data Mining: Association Rules 17 Data Mining: Association Rules 18
The Apriori Algorithm Apriori Algorithm from Agrawal et al. (1993)

• Join Step: Ck is generated by joining Lk-1with itself


• Prune Step: Any (k-1)-itemset that is not frequent cannot
be a subset of a frequent k-itemset
• Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;

Data Mining: Association Rules 19 Data Mining: Association Rules 20

Apriori Algorithm Example (s = 50%) Algorithm to Guess Itemsets

Database D itemset sup.


L1 itemset sup. • Naïve way:
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 – Extend all itemsets with all possible items
Scan D {2} 3
200 235 {3} 3 {3} 3 • More sophisticated:
300 1235 {4} 1
{5} 3
400 25 {5} 3 – Join Lk-1 with itself, adding only a single, final item
C2 itemset sup C2 itemset e.g.: {1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4} produces
L2 itemset sup {1 2} 1 Scan D {1 2} {1 2 3 4} and {1 3 4 5}
{1 3} 2 {1 3}
{1 3} 2 – Remove itemsets with an unsupported subset
{2 3} 2 {1 5} 1 {1 5}
{2 3} e.g.: {1 3 4 5} has an unsupported subset: {1 4 5}
{2 5} 3 {2 3} 2
{2 5} 3 {2 5} if minsup = 50%
{3 5} 2
{3 5} 2 {3 5} – Use the database to further refine Ck
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
Data Mining: Association Rules 21 Data Mining: Association Rules 22

Apriori: How to Generate Candidates? How to Count Supports of Candidates?

STEP 1: Self-join operation • Why counting supports of candidates is a problem?


– The total number of candidates can be very huge
– One transaction may contain many candidates
• Method:
– Candidate itemsets are stored in a hash-tree
– Leaf node of hash-tree contains a list of itemsets and
counts
STEP 2: Subset filtering – Interior node contains a hash table
– Subset function: finds all the candidates contained in a
transaction

Data Mining: Association Rules 23 Data Mining: Association Rules 24


Example of Generating Candidate Itemsets Run Time of Apriori

• L3 = {abc, abd, acd, ace, bcd } • k passes over data where k is the size of the
largest candidate itemset
• Self-joining: L3*L3
• Memory chunking algorithm ⇒ 2 passes over
– abcd from abc and abd
data on disk but multiple in memory
– acde from acd and ace

• Pruning based on the Apriori principle:


Toivonen 1996 gives a statistical technique which
– acde is removed because ade is not in L3 requires 1 + e passes (but more memory)
• C4 = {abcd } Brin 1997 - Dynamic Itemset Counting ⇒ 1 + e
passes (less memory)

Data Mining: Association Rules 25 Data Mining: Association Rules 26

Methods to Improve Apriori’s Efficiency Is Apriori Fast Enough? — Performance Bottlenecks

• Hash-based itemset counting: A k-itemset whose • The core of the Apriori algorithm:
corresponding hashing bucket count is below the threshold – Use frequent (k – 1)-itemsets to generate candidate frequent
cannot be frequent k-itemsets
– Use database scan and pattern matching to collect counts for the
• Transaction reduction: A transaction that does not contain any candidate itemsets
frequent k-itemset is useless in subsequent scans • The bottleneck of Apriori: candidate generation
• Partitioning: Any itemset that is potentially frequent in DB – Huge candidate sets:
must be frequent in at least one of the partitions of DB • 104 frequent 1-itemset will generate 107 candidate 2-itemsets
• Sampling: mining on a subset of given data • To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100},
– lower support threshold one needs to generate 2100 ≈ 1030 candidates.

– a method to determine the completeness – Multiple scans of database:


• Needs (n +1 ) scans, where n is the length of the longest pattern
• Dynamic itemset counting: add new candidate itemsets only
when all of their subsets are estimated to be frequent

Data Mining: Association Rules 27 Data Mining: Association Rules 28

You might also like