Chapter4
Chapter4
November, 24
Chapter Iv
2
Mining Frequent Patterns, Association and
Correlations
3
What Is Frequent Pattern
Analysis?
Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the
context of frequent itemsets and association rule mining
Motivation: Finding inherent regularities in data
What products were often purchased together?— Beer and
diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
Applications
Basket data analysis, cross-marketing, catalog design, sale
campaign analysis, Web log (click stream) analysis, and DNA 4
Basic Concepts: Frequent Patterns and Association
Rules
Customer
buys beer
5
Mining Frequent Patterns, Association and
Correlations
6
Scalable Methods for Mining Frequent
Patterns
frequent
If {beer, diaper, nuts} is frequent, so is {beer,
diaper}
i.e., every transaction having {beer, diaper, nuts}
Freq. pattern
7
Apriori: A Candidate Generation-and-Test
Approach
8
The Apriori Algorithm
Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
9
The Apriori Algorithm—An
Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
{C} 3
20 B, C, E 1st scan {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2 2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
10
How to Generate Candidates?
12
Methods to Improve Apriori’s
Efficiency
Hash-based itemset counting: A k-itemset whose
corresponding hashing bucket count is below the threshold
cannot be frequent
Transaction reduction: A transaction that does not contain
any frequent k-itemset is useless in subsequent scans
Partitioning: Any itemset that is potentially frequent in DB
must be frequent in at least one of the partitions of DB
Sampling: mining on a subset of given data, lower support
threshold + a method to determine the completeness
Dynamic itemset counting: add new candidate itemsets only
when all of their subsets are estimated to be frequent
13
Mining Frequent Patterns Without Candidate
Generation
14
Construct FP-tree from a Transaction
Database
Completeness
Preserve complete information for frequent
pattern mining
Never break a long pattern of any transaction
Compactness
Reduce irrelevant info—infrequent items are gone
16
Partition Patterns and
Databases
Frequent patterns can be partitioned into subsets
according to f-list
F-list=f-c-a-b-m-p
Patterns containing p
…
Pattern f
17
Scaling FP-growth by DB
Projection
18
Partition-based Projection
Tran. DB
Parallel projection needs fcamp
a lot of disk space fcabm
fb
Partition projection saves cbp
it fcamp
am-proj DB cm-proj DB
fc f …
fc f
fc f
19
Why Is FP-Growth the Winner?
Divide-and-conquer:
decompose both the mining task and DB
according to the frequent patterns obtained so
far
leads to focused search of smaller databases
Other factors
no candidate generation, no candidate test
compressed database: FP-tree structure
no repeated scan of entire database
basic ops—counting local freq items and
building sub FP-tree, no pattern search and
matching 20
Mining Multiple-Level Association
Rules
Items often form hierarchies
Flexible support settings
Items at the lower level are expected to have
lower support
Exploration of shared multi-level mining
21
Multi-level Association: Redundancy
Filtering
24
Static Discretization of
Quantitative Attributes
mined is maximized
2-D quantitative association rules: Aquan1 Aquan2 Acat
Cluster adjacent association rules
to form general rules using a 2-D grid
Example
age(X,”34-35”) income(X,”30-50K”)
buys(X,”high resolution TV”)
26
Mining Frequent Patterns, Association and
Correlations
27
Interestingness Measure: Correlations
(Lift)
2000 / 5000
lift ( B, C ) 0.89
3000 / 5000 * 3750 / 5000
28
Are lift and 2 Good Measures of
Correlation?
P ( A B )
lift
P ( A) P ( B ) Milk No Milk Sum
(row)
Coffee m, c ~m, c c
sup( X )
all _ conf No m, ~c ~m, ~c ~c
max_ item _ sup( X ) Coffee
DB m, c ~m, Sum(col. m
m~c ~m~c ~m
lift all- coh 2
c ) conf
sup( X )
coh A1 1000 100 100 10,000 9.2 0.91 0.83 905
| universe ( X ) | 6 5
A2 100 1000 1000 100,00 8.4 0.09 0.05 670
0 4
A3 1000 100 1000 100,00 9.1 0.09 0.09 817 29
Which Measures Should Be Used?
lift and 2 are not
good measures for
correlations in large
transactional DBs
all-conf or
coherence could be
good measures
Both all-conf and
coherence have the
downward closure
property
Efficient algorithms
can be derived for
mining
30
Constraint-based (Query-Directed)
Mining
Finding all the patterns in a database
autonomously? — unrealistic!
The patterns could be too many but not
focused!
Data mining should be an interactive process
User directs what to be mined using a data
be mined
System optimization: explores such constraints
Chicago in Dec.’02
Dimension/level constraint
in relevance to region, price, brand, customer
category
Rule (or pattern) constraint
small sales (price < $10) triggers big sales (sum > $200)
Interestingness constraint
strong rules: min_support 3%, min_confidence 60%
32