ICS 2408 - Lecture 5 - Association
ICS 2408 - Lecture 5 - Association
Correlations
Basic concepts
Efficient and scalable frequent itemset mining methods
Constraint-based association mining
Take action:
Store layouts
Targeted advertising
Floor planning
Inventory control,
Data set D
Count, Support,
TID Itemsets Confidence:
T100 134
Count(1 3)=2
T200 235
|D| = 4
T300 1235
Support(1 3)=0.5
T400 25
Support(32)=0.5
Confidence(32)=0.67
itemsets
If we have all frequently occurring sets of items (frequent
itemsets), we can compute support and confidence!
must be frequent
If {beer, diaper, nuts} is frequent, so is {beer,
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
August 12, 2024 Moso J : Dedan Kimathi University 20
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup Itemset sup
Database TDB
{A} 2 L1 {A} 2
Tid Items C1
{B} 3 {B} 3
10 A, C, D
1st scan {C} 3 {C} 3
20 B, C, E
{D} 1 {E} 3
30 A, B, C, E
{E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
August 12, 2024 Moso J : Dedan Kimathi University 21
Important Details of Apriori
How to generate candidates?
Step 1: self-joining Lk
Step 2: pruning
How to count supports of candidates? :Example of Candidate-
generation
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4={abcd}
August 12, 2024 Moso J : Dedan Kimathi University 22
Step 2: Generating rules from frequent itemsets
A B is an association rule if
Confidence(A B) ≥ minconf,
support(A B) = support(AB) = support(X)
confidence(A B) = support(A B) / support(A)
Advantages:
Uses large itemset property.
Easily parallelized
Easy to implement.
Disadvantages:
Assumes transaction database is memory resident.
Pass 1:
– Scan data and find support for each item.
– Discard infrequent items.
– Sort frequent items in decreasing order based on their support.
Use this order when building the FP-Tree, so common prefixes can
be shared.
Step 1: FP-Tree Construction
Pass 2:
Nodes correspond to items and have a counter
1. FP-Growth reads 1 transaction at a time and maps it to a path
The size of the FP-tree depends on how the items are ordered
Ordering by decreasing support is typically used but it does not
always lead to the smallest tree (it's a heuristic).
Step 2: Frequent Itemset Generation
FP-Growth extracts frequent itemsets from the FP-tree.
Bottom-up algorithm - from the leaves towards the root
Divide and conquer: first look for frequent itemsets ending in e, then
de, etc. . . then d, then cd, etc. . .
First, extract prefix path sub-trees ending in an item(set). (hint: use
the linked lists)
Prefix path sub-trees (Example)
Step 2: Frequent Itemset Generation
Each prefix path sub-tree is processed
recursively to extract the frequent itemsets.
Solutions are then merged.
E.g. the prefix path sub-tree for e will be
Compactness
Reduce irrelevant information—infrequent items are gone
likely to be shared
Never be larger than the original database (if not count node-links
and counts)
Advantages/Disadvantages of FP growth
Advantages of FP-Growth
Only 2 passes over data-set
“Compresses” data-set
No candidate generation
Much faster than Apriori
Disadvantages of FP-Growth
FP-Tree may not fit in memory!!
FP-Tree is expensive to build
Constraint-based (Query-Directed) Mining