0% found this document useful (0 votes)

5 views

Chapter4

The document discusses association rule mining, focusing on frequent pattern analysis, efficient mining methods, and various types of association rules. It introduces key concepts such as support and confidence, and outlines algorithms like Apriori and FP-growth for mining frequent patterns. Additionally, it covers multi-level and multi-dimensional association mining, as well as techniques for mining quantitative associations.

Uploaded by

Elshaday Abraham

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Chapter4

Uploaded by

Elshaday Abraham

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 32

Debre Tabor University

Gafat Institute of Technology

Department of Computer Science

Introduction to Data Mining & Warehousing

For 4th Year IT Computer Science students
Instructors: Habtu Hailu (PhD)

November, 24
Chapter Iv

ASSOCIATION RULE MINING

2
Mining Frequent Patterns, Association and
Correlations

 Basic concepts and a road map

 Efficient and scalable frequent Itemset
mining methods
 Mining various kinds of association rules
 From association mining to correlation
analysis
 Constraint-based association mining
 Summary

3
What Is Frequent Pattern
Analysis?
 Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
 First proposed by Agrawal, Imielinski, and Swami [AIS93] in the
context of frequent itemsets and association rule mining
 Motivation: Finding inherent regularities in data
 What products were often purchased together?— Beer and
diapers?!
 What are the subsequent purchases after buying a PC?
 What kinds of DNA are sensitive to this new drug?
 Can we automatically classify web documents?
 Applications
 Basket data analysis, cross-marketing, catalog design, sale
campaign analysis, Web log (click stream) analysis, and DNA 4
Basic Concepts: Frequent Patterns and Association
Rules

Transaction-id Items bought 

Itemset X = {x1, …, xk}
10 A, B, D 
Find all the rules X  Y with
20 A, C, D minimum support and confidence
 support, s, probability that a
30 A, D, E
transaction contains X  Y
40 B, E, F
 confidence, c, conditional
50 B, C, D, E, F
probability that a transaction
Customer having X also contains Y, (i.e.
Customer
buys both buys diaper
Prob(Y/X))

Customer
buys beer

5
Mining Frequent Patterns, Association and
Correlations

 Basic concepts and a road map

 Efficient and scalable frequent itemset
mining methods
 Mining various kinds of association rules
 From association mining to correlation
analysis
 Constraint-based association mining
 Summary

6
Scalable Methods for Mining Frequent
Patterns

 The downward closure property of frequent patterns

 Any subset of a frequent itemset must be

frequent
 If {beer, diaper, nuts} is frequent, so is {beer,

diaper}
 i.e., every transaction having {beer, diaper, nuts}

also contains {beer, diaper}

 Scalable mining methods: Three major approaches
 Apriori

 Freq. pattern

 Vertical data format approach

7
Apriori: A Candidate Generation-and-Test
Approach

 Apriori pruning principle: If there is any itemset

which is infrequent, its superset should not be
generated/tested!
 Method:
 Initially, scan DB once to get frequent 1-itemset
 Generate length (k+1) candidate itemsets from
length k frequent itemsets
 Test the candidates against DB
 Terminate when no frequent or candidate set can
be generated

8
The Apriori Algorithm
 Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
9
The Apriori Algorithm—An
Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
{C} 3
20 B, C, E 1st scan {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2 2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
10
How to Generate Candidates?

 Suppose the items in Lk-1 are listed in an order

 Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2,
p.itemk-1 < q.itemk-1
 Step 2: pruning
for all itemsets c in Ck do
for all (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
11
Example of Generating
Candidates
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3

abcd from abc and abd

acde from acd and ace
 Pruning:
 acde is removed because ade is not in L3
 C4={abcd}

12
Methods to Improve Apriori’s
Efficiency
 Hash-based itemset counting: A k-itemset whose
corresponding hashing bucket count is below the threshold
cannot be frequent
 Transaction reduction: A transaction that does not contain
any frequent k-itemset is useless in subsequent scans
 Partitioning: Any itemset that is potentially frequent in DB
must be frequent in at least one of the partitions of DB
 Sampling: mining on a subset of given data, lower support
threshold + a method to determine the completeness
 Dynamic itemset counting: add new candidate itemsets only
when all of their subsets are estimated to be frequent

13
Mining Frequent Patterns Without Candidate
Generation

 Compress a large database into a compact,

Frequent-Pattern tree (FP-tree) structure
 highly condensed, but complete for frequent
pattern mining
 avoid costly database scans
 Develop an efficient, FP-tree-based frequent pattern
mining method
 A divide-and-conquer methodology: decompose
mining tasks into smaller ones
 Avoid candidate generation: sub-database test
only!

14
Construct FP-tree from a Transaction
Database

TID items Items bought (ordered) frequent

100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m} min_support = 3
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {}
Header Table
1. Scan DB once, find
frequent 1-itemset Item frequency head f:4 c:1
(single item pattern) f 4
c 4 c:3 b:1 b:1
2. Sort frequent items a 3
in frequency b 3
descending order, f- m a:3 p:1
3
list p 3
m:2 b:1
3. Scan DB again,
construct FP-tree F-list=f-c-a-b-m-p p:2 m:1
15
Benefits of the FP-tree Structure

 Completeness
 Preserve complete information for frequent

pattern mining
 Never break a long pattern of any transaction

 Compactness
 Reduce irrelevant info—infrequent items are gone

 Items in frequency descending order: the more

frequently occurring, the more likely to be shared

 Never be larger than the original database (not

count node-links and the count field)

16
Partition Patterns and
Databases
 Frequent patterns can be partitioned into subsets
according to f-list
 F-list=f-c-a-b-m-p

 Patterns containing p

 Patterns having m but no p

 …

 Patterns having c but no a nor b, m, p

 Pattern f

 Completeness and non-redundency

17
Scaling FP-growth by DB
Projection

 FP-tree cannot fit in memory?—DB projection

 First partition a database into a set of projected
DBs
 Then construct and mine FP-tree for each
projected DB
 Parallel projection vs. Partition projection
techniques
 Parallel projection is space costly

18
Partition-based Projection

Tran. DB
 Parallel projection needs fcamp
a lot of disk space fcabm
fb
 Partition projection saves cbp
it fcamp

p-proj DB m-proj DB b-proj DB a-proj DB c-proj DB f-proj DB

fcam fcab f fc f …
cb fca cb … …
fcam fca …

am-proj DB cm-proj DB
fc f …
fc f
fc f
19
Why Is FP-Growth the Winner?

 Divide-and-conquer:
 decompose both the mining task and DB
according to the frequent patterns obtained so
far
 leads to focused search of smaller databases
 Other factors
 no candidate generation, no candidate test
 compressed database: FP-tree structure
 no repeated scan of entire database
 basic ops—counting local freq items and
building sub FP-tree, no pattern search and
matching 20
Mining Multiple-Level Association
Rules
 Items often form hierarchies
 Flexible support settings
 Items at the lower level are expected to have

lower support
 Exploration of shared multi-level mining

uniform support reduced support

Level 1
Milk Level 1
min_sup = 5%
[support = 10%] min_sup = 5%

Level 2 2% Milk Skim Milk Level 2

min_sup = 5% [support = 6%] [support = 4%] min_sup = 3%

21
Multi-level Association: Redundancy
Filtering

 Some rules may be redundant due to “ancestor”

relationships between items.
 Example
 milk  wheat bread [support = 8%, confidence =
70%]
 2% milk  wheat bread [support = 2%, confidence =
72%]
 We say the first rule is an ancestor of the second
rule.
 A rule is redundant if its support is close to the
“expected” value, based on the rule’s ancestor.
22
Mining Multi-Dimensional
Association
 Single-dimensional rules:
buys(X, “milk”)  buys(X, “bread”)
 Multi-dimensional rules:  2 dimensions or
predicates
 Inter-dimension assoc. rules (no repeated
predicates)
age(X,”19-25”)  occupation(X,“student”)  buys(X,
“coke”)
 hybrid-dimension assoc. rules (repeated
predicates)
age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”)
 Categorical Attributes: finite number of possible
values, no ordering among values—data cube 23
Mining Quantitative Associations
 Techniques can be categorized by how numerical
attributes, such as age or salary are treated
1. Static discretization based on predefined concept
hierarchies (data cube methods)
2. Dynamic discretization based on data distribution
(quantitative rules)
3. Clustering: Distance-based association
 one dimensional clustering then association
4. Deviation
Sex = female => Wage: mean=$7/hr (overall mean = $9)

24
Static Discretization of
Quantitative Attributes

 Discretized prior to mining using concept hierarchy.

 Numeric values are replaced by ranges.
 In relational database, finding all frequent k-
predicate sets will require k or k+1 table scans.
 Data cube is well suited for mining. ()

 The cells of an n-dimensional

(age) (income) (buys)
cuboid correspond to the
predicate sets.
(age, income) (age,buys) (income,buys)
 Mining from data cubes
can be much faster.
(age,income,buys)
25
Quantitative Association
Rules
 Numeric attributes are dynamically discretized
 Such that the confidence or compactness of the rules

mined is maximized
 2-D quantitative association rules: Aquan1  Aquan2  Acat
 Cluster adjacent association rules
to form general rules using a 2-D grid
 Example
age(X,”34-35”)  income(X,”30-50K”)
 buys(X,”high resolution TV”)

26
Mining Frequent Patterns, Association and
Correlations

 Basic concepts and a road map

27
Interestingness Measure: Correlations
(Lift)

 play basketball  eat cereal [40%, 66.7%] is misleading

 The overall % of students eating cereal is 75% > 66.7%.
 play basketball  not eat cereal [20%, 33.3%] is more
1000 / 5000
accurate, although with lower support and confidence
lift ( B, C )
3000 / 5000 *1250 / 5000
1.33

 Measure of dependent/correlated events: lift

Basketbal Not basketball Sum (row)
l
P ( A B ) Cereal 2000 1750 3750
lift 
P ( A) P ( B ) Not cereal 1000 250 1250

Sum(col.) 3000 2000 5000

2000 / 5000
lift ( B, C )  0.89
3000 / 5000 * 3750 / 5000

28
Are lift and 2 Good Measures of
Correlation?

 “Buy walnuts  buy milk [1%, 80%]” is misleading

 if 85% of customers buy milk
 Support and confidence are not good to represent correlations
 So many interestingness measures?

P ( A B )
lift 
P ( A) P ( B ) Milk No Milk Sum
(row)
Coffee m, c ~m, c c
sup( X )
all _ conf  No m, ~c ~m, ~c ~c
max_ item _ sup( X ) Coffee

DB m, c ~m, Sum(col. m
m~c ~m~c ~m
lift all- coh 2
c ) conf
sup( X )
coh  A1 1000 100 100 10,000 9.2 0.91 0.83 905
| universe ( X ) | 6 5
A2 100 1000 1000 100,00 8.4 0.09 0.05 670
0 4
A3 1000 100 1000 100,00 9.1 0.09 0.09 817 29
Which Measures Should Be Used?
 lift and 2 are not
good measures for
correlations in large
transactional DBs
 all-conf or
coherence could be
good measures
 Both all-conf and
coherence have the
downward closure
property
 Efficient algorithms
can be derived for
mining

30
Constraint-based (Query-Directed)
Mining
 Finding all the patterns in a database
autonomously? — unrealistic!
 The patterns could be too many but not

focused!
 Data mining should be an interactive process
 User directs what to be mined using a data

mining query language (or a graphical user

interface)
 Constraint-based mining
 User flexibility: provides constraints on what to

be mined
 System optimization: explores such constraints

for efficient mining—constraint-based mining

31
Constraints in Data Mining

 Knowledge type constraint:

 classification, association, etc.

 Data constraint — using SQL-like queries

 find product pairs sold together in stores in

Chicago in Dec.’02
 Dimension/level constraint
 in relevance to region, price, brand, customer

category
 Rule (or pattern) constraint
 small sales (price < $10) triggers big sales (sum > $200)
 Interestingness constraint
 strong rules: min_support  3%, min_confidence  60%

What Is Frequent Pattern Analysis?
No ratings yet
What Is Frequent Pattern Analysis?
37 pages
Lecture 8-9 Association Rule Mining
No ratings yet
Lecture 8-9 Association Rule Mining
21 pages
Powerpoint Presentation On Somlething
No ratings yet
Powerpoint Presentation On Somlething
181 pages
SE 458 - Data Mining (DM) : Spring 2019 Section W1
No ratings yet
SE 458 - Data Mining (DM) : Spring 2019 Section W1
20 pages
KDDM-Lecture 3
No ratings yet
KDDM-Lecture 3
21 pages
Frequent Patterns
No ratings yet
Frequent Patterns
80 pages
BCA Semester VI Data Mining Module 3 (Presentation Kind of N
No ratings yet
BCA Semester VI Data Mining Module 3 (Presentation Kind of N
108 pages
Ariori DHP
No ratings yet
Ariori DHP
53 pages
Mining Frequent Patterns, Association and Correlations
No ratings yet
Mining Frequent Patterns, Association and Correlations
100 pages
Notes 4 DWM Data Mining
No ratings yet
Notes 4 DWM Data Mining
34 pages
Module 3
No ratings yet
Module 3
98 pages
apriori
No ratings yet
apriori
69 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
69 pages
Week 8-Association Rules Part 1
No ratings yet
Week 8-Association Rules Part 1
31 pages
Week 3
No ratings yet
Week 3
56 pages
Mining Frequent Patterns, Association and Correlations
No ratings yet
Mining Frequent Patterns, Association and Correlations
42 pages
Association Rule Mining: Iyad Batal
No ratings yet
Association Rule Mining: Iyad Batal
37 pages
Association Rule Mining Spring 2022
No ratings yet
Association Rule Mining Spring 2022
84 pages
Association Rules FP Growth
No ratings yet
Association Rules FP Growth
32 pages
Association Rule
No ratings yet
Association Rule
106 pages
FALLSEM2022-23 SWE2009 ETH VL2022230101117 Reference Material I 25-08-2022 Frequent Pattern Mining
No ratings yet
FALLSEM2022-23 SWE2009 ETH VL2022230101117 Reference Material I 25-08-2022 Frequent Pattern Mining
42 pages
Assoc
No ratings yet
Assoc
166 pages
Mining Association Rules in Large Databases
No ratings yet
Mining Association Rules in Large Databases
77 pages
ATC - Lecture - Notes - Data Mining Techniques - 2021
No ratings yet
ATC - Lecture - Notes - Data Mining Techniques - 2021
77 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
99 pages
Data Mining - 8
No ratings yet
Data Mining - 8
19 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
94 pages
Dm Unit 2
No ratings yet
Dm Unit 2
330 pages
Association
No ratings yet
Association
40 pages
Module 5.1 - Association Rule Mining, Apriori Algorithm, Data Mining, Support, Confidence, Examples
100% (1)
Module 5.1 - Association Rule Mining, Apriori Algorithm, Data Mining, Support, Confidence, Examples
108 pages
DMML Unit 2
No ratings yet
DMML Unit 2
64 pages
DMT Unit-IV - UR20 - New
No ratings yet
DMT Unit-IV - UR20 - New
62 pages
11 Association Rules Mining New
No ratings yet
11 Association Rules Mining New
32 pages
MINING FREQUENT PATTERNS
No ratings yet
MINING FREQUENT PATTERNS
41 pages
Unit 3
No ratings yet
Unit 3
62 pages
Chap 6
No ratings yet
Chap 6
77 pages
Association Rule Mining
No ratings yet
Association Rule Mining
50 pages
Association Rule Mining
No ratings yet
Association Rule Mining
20 pages
Dwdmunit2 Assoc
No ratings yet
Dwdmunit2 Assoc
55 pages
DM Lect7
No ratings yet
DM Lect7
26 pages
DWDWM Unit2
No ratings yet
DWDWM Unit2
59 pages
Asc Rule Problems
No ratings yet
Asc Rule Problems
9 pages
AR Mining Rev
No ratings yet
AR Mining Rev
45 pages
Ch5 DataMIning
No ratings yet
Ch5 DataMIning
99 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
74 pages
Lesson 2.2 - Frequent Pattern Analysis
No ratings yet
Lesson 2.2 - Frequent Pattern Analysis
54 pages
Association Rule Mining
No ratings yet
Association Rule Mining
24 pages
7Apriori Algorithm Slide
No ratings yet
7Apriori Algorithm Slide
15 pages
Association Rule Mining
No ratings yet
Association Rule Mining
34 pages
FP Tree Basics
No ratings yet
FP Tree Basics
67 pages
Frequent Pattern Based Clustering Methods
No ratings yet
Frequent Pattern Based Clustering Methods
23 pages
Association Rule Mining - Part I
No ratings yet
Association Rule Mining - Part I
21 pages
Data Mining: Concepts and Techniques: Mining Association Rules in Large Databases
No ratings yet
Data Mining: Concepts and Techniques: Mining Association Rules in Large Databases
81 pages
Ite2006 - Data Mining Techniques: B.Tech. (Information Technology) Programme Winter Semester 2021 - 2022
No ratings yet
Ite2006 - Data Mining Techniques: B.Tech. (Information Technology) Programme Winter Semester 2021 - 2022
24 pages
Association Rule Mining: Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin
No ratings yet
Association Rule Mining: Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin
11 pages
Rule Mining and The Apriori Algorithm: M I, 2 I, 3 I 1 I, 5
No ratings yet
Rule Mining and The Apriori Algorithm: M I, 2 I, 3 I 1 I, 5
6 pages
Week 7 Assignment 1
No ratings yet
Week 7 Assignment 1
6 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
30 pages
Ec3352 QB 2 Marks by Learn Engineering
No ratings yet
Ec3352 QB 2 Marks by Learn Engineering
23 pages
Couchbase Certified Java Developer - Exam Practice Tests
From Everand
Couchbase Certified Java Developer - Exam Practice Tests
Cristian Scutaru
No ratings yet
Lab Task 2
No ratings yet
Lab Task 2
12 pages
DAA Lab Manual
No ratings yet
DAA Lab Manual
46 pages
Dav Assignment 5
No ratings yet
Dav Assignment 5
2 pages
HW 7
No ratings yet
HW 7
4 pages
Heap Data Structure - Notes
No ratings yet
Heap Data Structure - Notes
4 pages
Boyer Moore Algorithm
No ratings yet
Boyer Moore Algorithm
16 pages
CS III Sem Compelte Material
No ratings yet
CS III Sem Compelte Material
110 pages
A New Hybrid Particle Swarm Optimization Algorithm For Solving Continuous Optimization Problems
No ratings yet
A New Hybrid Particle Swarm Optimization Algorithm For Solving Continuous Optimization Problems
11 pages
Construct AVL tree for the following data 21,26,30,9,4,14,28,18,15,10,2,3,7
No ratings yet
Construct AVL tree for the following data 21,26,30,9,4,14,28,18,15,10,2,3,7
9 pages
Loops
No ratings yet
Loops
2 pages
Assn2 PDF
No ratings yet
Assn2 PDF
3 pages
Know Thy Complexities
No ratings yet
Know Thy Complexities
8 pages
Assignment (Computer Science) Topics-Array, Searching, Sorting, Merging)
No ratings yet
Assignment (Computer Science) Topics-Array, Searching, Sorting, Merging)
3 pages
Microprocessor Programming Lab Manual
No ratings yet
Microprocessor Programming Lab Manual
17 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
82 pages
Activation Function: Presented by
No ratings yet
Activation Function: Presented by
19 pages
2017-CE-008 Lab # 10 SHA
No ratings yet
2017-CE-008 Lab # 10 SHA
8 pages
Mr.S.Habib Hussain, Ap/Cse, Mahalakshmi Engineering College, Trichy - 621 213
No ratings yet
Mr.S.Habib Hussain, Ap/Cse, Mahalakshmi Engineering College, Trichy - 621 213
23 pages
Midsemester Exam Solutions
No ratings yet
Midsemester Exam Solutions
6 pages
T4 Flowcharts
No ratings yet
T4 Flowcharts
26 pages
Classification of Algorithm
No ratings yet
Classification of Algorithm
4 pages
Lecture 2: Problem Solving Using State Space Representation
No ratings yet
Lecture 2: Problem Solving Using State Space Representation
37 pages
Dsa Midterm
No ratings yet
Dsa Midterm
5 pages
BCOL306 Design & Analysis of Algorithm: Course Objectives
No ratings yet
BCOL306 Design & Analysis of Algorithm: Course Objectives
44 pages
OD Transport Assignment 2012
No ratings yet
OD Transport Assignment 2012
17 pages
CTRL + Shift + - General Formatting - + + + + + + + + + + + + + + + +
No ratings yet
CTRL + Shift + - General Formatting - + + + + + + + + + + + + + + + +
2 pages
Lec1-Introduction To Data Structure and Algorithms
100% (3)
Lec1-Introduction To Data Structure and Algorithms
20 pages
ÔN TẬP CTDL
No ratings yet
ÔN TẬP CTDL
47 pages
Parallel Cursor Pattern
No ratings yet
Parallel Cursor Pattern
13 pages
Data Structures Unit 2
No ratings yet
Data Structures Unit 2
22 pages