0% found this document useful (0 votes)
5 views

Chapter4

The document discusses association rule mining, focusing on frequent pattern analysis, efficient mining methods, and various types of association rules. It introduces key concepts such as support and confidence, and outlines algorithms like Apriori and FP-growth for mining frequent patterns. Additionally, it covers multi-level and multi-dimensional association mining, as well as techniques for mining quantitative associations.

Uploaded by

Elshaday Abraham
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Chapter4

The document discusses association rule mining, focusing on frequent pattern analysis, efficient mining methods, and various types of association rules. It introduces key concepts such as support and confidence, and outlines algorithms like Apriori and FP-growth for mining frequent patterns. Additionally, it covers multi-level and multi-dimensional association mining, as well as techniques for mining quantitative associations.

Uploaded by

Elshaday Abraham
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Debre Tabor University

Gafat Institute of Technology


Department of Computer Science

Introduction to Data Mining & Warehousing


For 4th Year IT Computer Science students
Instructors: Habtu Hailu (PhD)

November, 24
Chapter Iv

ASSOCIATION RULE MINING

2
Mining Frequent Patterns, Association and
Correlations

 Basic concepts and a road map


 Efficient and scalable frequent Itemset
mining methods
 Mining various kinds of association rules
 From association mining to correlation
analysis
 Constraint-based association mining
 Summary

3
What Is Frequent Pattern
Analysis?
 Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
 First proposed by Agrawal, Imielinski, and Swami [AIS93] in the
context of frequent itemsets and association rule mining
 Motivation: Finding inherent regularities in data
 What products were often purchased together?— Beer and
diapers?!
 What are the subsequent purchases after buying a PC?
 What kinds of DNA are sensitive to this new drug?
 Can we automatically classify web documents?
 Applications
 Basket data analysis, cross-marketing, catalog design, sale
campaign analysis, Web log (click stream) analysis, and DNA 4
Basic Concepts: Frequent Patterns and Association
Rules

Transaction-id Items bought 


Itemset X = {x1, …, xk}
10 A, B, D 
Find all the rules X  Y with
20 A, C, D minimum support and confidence
 support, s, probability that a
30 A, D, E
transaction contains X  Y
40 B, E, F
 confidence, c, conditional
50 B, C, D, E, F
probability that a transaction
Customer having X also contains Y, (i.e.
Customer
buys both buys diaper
Prob(Y/X))

Customer
buys beer

5
Mining Frequent Patterns, Association and
Correlations

 Basic concepts and a road map


 Efficient and scalable frequent itemset
mining methods
 Mining various kinds of association rules
 From association mining to correlation
analysis
 Constraint-based association mining
 Summary

6
Scalable Methods for Mining Frequent
Patterns

 The downward closure property of frequent patterns


 Any subset of a frequent itemset must be

frequent
 If {beer, diaper, nuts} is frequent, so is {beer,

diaper}
 i.e., every transaction having {beer, diaper, nuts}

also contains {beer, diaper}


 Scalable mining methods: Three major approaches
 Apriori

 Freq. pattern

 Vertical data format approach

7
Apriori: A Candidate Generation-and-Test
Approach

 Apriori pruning principle: If there is any itemset


which is infrequent, its superset should not be
generated/tested!
 Method:
 Initially, scan DB once to get frequent 1-itemset
 Generate length (k+1) candidate itemsets from
length k frequent itemsets
 Test the candidates against DB
 Terminate when no frequent or candidate set can
be generated

8
The Apriori Algorithm
 Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
9
The Apriori Algorithm—An
Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
{C} 3
20 B, C, E 1st scan {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2 2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
10
How to Generate Candidates?

 Suppose the items in Lk-1 are listed in an order


 Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2,
p.itemk-1 < q.itemk-1
 Step 2: pruning
for all itemsets c in Ck do
for all (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
11
Example of Generating
Candidates
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3

abcd from abc and abd

acde from acd and ace
 Pruning:
 acde is removed because ade is not in L3
 C4={abcd}

12
Methods to Improve Apriori’s
Efficiency
 Hash-based itemset counting: A k-itemset whose
corresponding hashing bucket count is below the threshold
cannot be frequent
 Transaction reduction: A transaction that does not contain
any frequent k-itemset is useless in subsequent scans
 Partitioning: Any itemset that is potentially frequent in DB
must be frequent in at least one of the partitions of DB
 Sampling: mining on a subset of given data, lower support
threshold + a method to determine the completeness
 Dynamic itemset counting: add new candidate itemsets only
when all of their subsets are estimated to be frequent

13
Mining Frequent Patterns Without Candidate
Generation

 Compress a large database into a compact,


Frequent-Pattern tree (FP-tree) structure
 highly condensed, but complete for frequent
pattern mining
 avoid costly database scans
 Develop an efficient, FP-tree-based frequent pattern
mining method
 A divide-and-conquer methodology: decompose
mining tasks into smaller ones
 Avoid candidate generation: sub-database test
only!

14
Construct FP-tree from a Transaction
Database

TID items Items bought (ordered) frequent


100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m} min_support = 3
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {}
Header Table
1. Scan DB once, find
frequent 1-itemset Item frequency head f:4 c:1
(single item pattern) f 4
c 4 c:3 b:1 b:1
2. Sort frequent items a 3
in frequency b 3
descending order, f- m a:3 p:1
3
list p 3
m:2 b:1
3. Scan DB again,
construct FP-tree F-list=f-c-a-b-m-p p:2 m:1
15
Benefits of the FP-tree Structure

 Completeness
 Preserve complete information for frequent

pattern mining
 Never break a long pattern of any transaction

 Compactness
 Reduce irrelevant info—infrequent items are gone

 Items in frequency descending order: the more

frequently occurring, the more likely to be shared


 Never be larger than the original database (not

count node-links and the count field)

16
Partition Patterns and
Databases
 Frequent patterns can be partitioned into subsets
according to f-list
 F-list=f-c-a-b-m-p

 Patterns containing p

 Patterns having m but no p

 …

 Patterns having c but no a nor b, m, p

 Pattern f

 Completeness and non-redundency

17
Scaling FP-growth by DB
Projection

 FP-tree cannot fit in memory?—DB projection


 First partition a database into a set of projected
DBs
 Then construct and mine FP-tree for each
projected DB
 Parallel projection vs. Partition projection
techniques
 Parallel projection is space costly

18
Partition-based Projection

Tran. DB
 Parallel projection needs fcamp
a lot of disk space fcabm
fb
 Partition projection saves cbp
it fcamp

p-proj DB m-proj DB b-proj DB a-proj DB c-proj DB f-proj DB


fcam fcab f fc f …
cb fca cb … …
fcam fca …

am-proj DB cm-proj DB
fc f …
fc f
fc f
19
Why Is FP-Growth the Winner?

 Divide-and-conquer:
 decompose both the mining task and DB
according to the frequent patterns obtained so
far
 leads to focused search of smaller databases
 Other factors
 no candidate generation, no candidate test
 compressed database: FP-tree structure
 no repeated scan of entire database
 basic ops—counting local freq items and
building sub FP-tree, no pattern search and
matching 20
Mining Multiple-Level Association
Rules
 Items often form hierarchies
 Flexible support settings
 Items at the lower level are expected to have

lower support
 Exploration of shared multi-level mining

uniform support reduced support


Level 1
Milk Level 1
min_sup = 5%
[support = 10%] min_sup = 5%

Level 2 2% Milk Skim Milk Level 2


min_sup = 5% [support = 6%] [support = 4%] min_sup = 3%

21
Multi-level Association: Redundancy
Filtering

 Some rules may be redundant due to “ancestor”


relationships between items.
 Example
 milk  wheat bread [support = 8%, confidence =
70%]
 2% milk  wheat bread [support = 2%, confidence =
72%]
 We say the first rule is an ancestor of the second
rule.
 A rule is redundant if its support is close to the
“expected” value, based on the rule’s ancestor.
22
Mining Multi-Dimensional
Association
 Single-dimensional rules:
buys(X, “milk”)  buys(X, “bread”)
 Multi-dimensional rules:  2 dimensions or
predicates
 Inter-dimension assoc. rules (no repeated
predicates)
age(X,”19-25”)  occupation(X,“student”)  buys(X,
“coke”)
 hybrid-dimension assoc. rules (repeated
predicates)
age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”)
 Categorical Attributes: finite number of possible
values, no ordering among values—data cube 23
Mining Quantitative Associations
 Techniques can be categorized by how numerical
attributes, such as age or salary are treated
1. Static discretization based on predefined concept
hierarchies (data cube methods)
2. Dynamic discretization based on data distribution
(quantitative rules)
3. Clustering: Distance-based association
 one dimensional clustering then association
4. Deviation
Sex = female => Wage: mean=$7/hr (overall mean = $9)

24
Static Discretization of
Quantitative Attributes

 Discretized prior to mining using concept hierarchy.


 Numeric values are replaced by ranges.
 In relational database, finding all frequent k-
predicate sets will require k or k+1 table scans.
 Data cube is well suited for mining. ()

 The cells of an n-dimensional


(age) (income) (buys)
cuboid correspond to the
predicate sets.
(age, income) (age,buys) (income,buys)
 Mining from data cubes
can be much faster.
(age,income,buys)
25
Quantitative Association
Rules
 Numeric attributes are dynamically discretized
 Such that the confidence or compactness of the rules

mined is maximized
 2-D quantitative association rules: Aquan1  Aquan2  Acat
 Cluster adjacent association rules
to form general rules using a 2-D grid
 Example
age(X,”34-35”)  income(X,”30-50K”)
 buys(X,”high resolution TV”)

26
Mining Frequent Patterns, Association and
Correlations

 Basic concepts and a road map


 Efficient and scalable frequent itemset
mining methods
 Mining various kinds of association rules
 From association mining to correlation
analysis
 Constraint-based association mining
 Summary

27
Interestingness Measure: Correlations
(Lift)

 play basketball  eat cereal [40%, 66.7%] is misleading


 The overall % of students eating cereal is 75% > 66.7%.
 play basketball  not eat cereal [20%, 33.3%] is more
1000 / 5000
accurate, although with lower support and confidence
lift ( B, C )
3000 / 5000 *1250 / 5000
1.33

 Measure of dependent/correlated events: lift


Basketbal Not basketball Sum (row)
l
P ( A B ) Cereal 2000 1750 3750
lift 
P ( A) P ( B ) Not cereal 1000 250 1250

Sum(col.) 3000 2000 5000

2000 / 5000
lift ( B, C )  0.89
3000 / 5000 * 3750 / 5000

28
Are lift and 2 Good Measures of
Correlation?

 “Buy walnuts  buy milk [1%, 80%]” is misleading


 if 85% of customers buy milk
 Support and confidence are not good to represent correlations
 So many interestingness measures?

P ( A B )
lift 
P ( A) P ( B ) Milk No Milk Sum
(row)
Coffee m, c ~m, c c
sup( X )
all _ conf  No m, ~c ~m, ~c ~c
max_ item _ sup( X ) Coffee

DB m, c ~m, Sum(col. m
m~c ~m~c ~m
lift all- coh 2
c ) conf
sup( X )
coh  A1 1000 100 100 10,000 9.2 0.91 0.83 905
| universe ( X ) | 6 5
A2 100 1000 1000 100,00 8.4 0.09 0.05 670
0 4
A3 1000 100 1000 100,00 9.1 0.09 0.09 817 29
Which Measures Should Be Used?
 lift and 2 are not
good measures for
correlations in large
transactional DBs
 all-conf or
coherence could be
good measures
 Both all-conf and
coherence have the
downward closure
property
 Efficient algorithms
can be derived for
mining

30
Constraint-based (Query-Directed)
Mining
 Finding all the patterns in a database
autonomously? — unrealistic!
 The patterns could be too many but not

focused!
 Data mining should be an interactive process
 User directs what to be mined using a data

mining query language (or a graphical user


interface)
 Constraint-based mining
 User flexibility: provides constraints on what to

be mined
 System optimization: explores such constraints

for efficient mining—constraint-based mining


31
Constraints in Data Mining

 Knowledge type constraint:


 classification, association, etc.

 Data constraint — using SQL-like queries


 find product pairs sold together in stores in

Chicago in Dec.’02
 Dimension/level constraint
 in relevance to region, price, brand, customer

category
 Rule (or pattern) constraint
 small sales (price < $10) triggers big sales (sum > $200)
 Interestingness constraint
 strong rules: min_support  3%, min_confidence  60%

32

You might also like