Chapter06 (Frequent Patterns)
Chapter06 (Frequent Patterns)
— Chapter 6 —
Basic Concepts
Evaluation Methods
Summary
2
What Is Frequent Pattern Analysis?
Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
Motivation: Finding inherent regularities in data
What products were often purchased together?— Beer and diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
Applications
Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
3
8/22/21 Data Mining: Concepts and Techniques 4
Why Is Freq. Pattern Mining Important?
Freq. pattern: An intrinsic and important property of
datasets
Foundation for many essential data mining tasks
Association, correlation, and causality analysis
Broad applications
5
Basic Concepts: Frequent Patterns
6
Basic Concepts: Association Rules
Tid Items bought Find all the rules X Y with
10 Butter, Nuts, Diaper
minimum support and confidence
20 Butter, Coffee, Diaper
30 Butter, Diaper, Eggs
support, s, probability that a
40 Nuts, Eggs, Milk transaction contains X Y
50 Nuts, Coffee, Diaper, Eggs, Milk
confidence, c, conditional
Customer
buys both
Customer probability that a transaction
buys
having X also contains Y
diaper
Let minsup = 50%, minconf = 50%
Freq. Pat.: Butter:3, Nuts:3, Diaper:4,
Eggs:3, {Butter, Diaper}:3
Customer
buys beer Association rules: (many more!)
Butter Diaper (60%, 100%)
Diaper Butter (60%, 75%)
7
Interesting association rules
Basic Concepts
Evaluation Methods
Summary
10
Scalable Frequent Itemset Mining Methods
Approach
Data Format
11
The Downward Closure Property and Scalable
Mining Methods
The downward closure property of frequent patterns
Any subset of a frequent itemset must be frequent
diaper}
i.e., every transaction having {beer, diaper, nuts} also
12
Apriori: A Candidate Generation & Test Approach
13
The Apriori Algorithm—An Example
to generate all frequent itemsets
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2 2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset
3rd scan L3 Itemset sup
C4 = { }. Algorithm
{B, C, E} {B, C, E} 2
terminates
14
What are the association rules for the previous
candidate itemset?
Steps
Find all non-empty subsets
each rule
Select all rules that satisfy min.confidence
Tid Items
{B,C}=>{E}
Conf = 2/2 = 100%
10 A, C, D
Similarly, find conf for other
20 B, C, E rules
30 A, B, C, E Select those rules which
40 B, E
satisfy minconf
Suppose, minconf = 60%.
What are the strong rules
that you can select?
17
Example 6.3
MinSupport = 2
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk; 22
Scalable Frequent Itemset Mining Methods
23
Exercise
Find all frequent itemsets using Apriori algorithm
and generate all association rules (assume
minsup = 20%, minconf=50%)
25
Partition: Scan Database Only Twice
Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB
Scan 1: partition database and find local frequent
patterns
Scan 2: consolidate global frequent patterns
27
Bottleneck of Frequent-pattern Mining
29
Construct FP-tree from a Transaction Database
F-list = f-c-a-b-m-p
30
Example 6.3
37
From Conditional Pattern-bases to Conditional FP-trees
pattern base
38
Benefits of the FP-tree Structure
Completeness
Preserve complete information for frequent pattern
mining
Compactness
Reduce irrelevant info—infrequent items are gone
No candidate generation, no candidate test
Compressed database: FP-tree structure
No repeated scan of entire database
39
Scalable Frequent Itemset Mining Methods
40
CHARM: Mining by Exploring Vertical
Data Format
Horizontal date format
Transaction-id: Itemset format
Basic Concepts
Evaluation Methods
Summary
44
Interestingness Measure: Correlations (Lift)
play basketball eat cereal [40%, 66.7%] is misleading
The overall % of students eating cereal is 75% > 66.7%.
play basketball not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
Measure of dependent/correlated events: lift
46
Summary
47