0% found this document useful (0 votes)

19 views

Association

The document discusses association rule mining which aims to find rules that predict the occurrence of items based on other items in transactions. It defines key concepts like frequent itemsets, association rules, support and confidence. It also describes the challenges in generating frequent itemsets and strategies to reduce computational cost like pruning using the Apriori principle.

Uploaded by

20je0426HritikGupta

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

Association

Uploaded by

20je0426HritikGupta

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 67

Association Rule Mining

 Given a set of transactions, find rules that will predict the

occurrence of an item based on the occurrences of other
items in the transaction

Market-Basket transactions
Example of Association Rules
TID Items
{Diaper}  {Beer},
1 Bread, Milk {Milk, Bread}  {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread}  {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!
Definition: Frequent Itemset
 Itemset
– A collection of one or more items
 Example: {Milk, Bread, Diaper}
– k-itemset TID Items
 An itemset that contains k items 1 Bread, Milk
 Support count () 2 Bread, Diaper, Beer, Eggs
– Frequency of occurrence of an itemset 3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
– E.g. ({Milk, Bread,Diaper}) = 2
5 Bread, Milk, Diaper, Coke
 Support
– Fraction of transactions that contain an
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
 Frequent Itemset
– An itemset whose support is greater
than or equal to a minsup threshold
Definition: Association Rule
 Association Rule
TID Items
– An implication expression of the form
X  Y, where X and Y are itemsets 1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
– Example:
{Milk, Diaper}  {Beer} 3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
 5 Bread, Milk, Diaper, Coke
Rule Evaluation Metrics
– Support (s)
 Fraction of transactions that contain Example:
both X and Y
{Milk , Diaper}  {Beer}
– Confidence (c)
 Measures how often items in Y  (Milk , Diaper, Beer ) 2
appear in transactions that s   0.4
contain X
|T| 5
 (Milk, Diaper, Beer ) 2
c   0.67
 (Milk , Diaper ) 3
Association Rule Mining Task

 Given a set of transactions T, the goal of

association rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold

 Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
 Computationally prohibitive!
Mining Association Rules

TID Items Example of Rules:

1 Bread, Milk {Milk,Diaper}  {Beer} (s=0.4, c=0.67)
2 Bread, Diaper, Beer, Eggs {Milk,Beer}  {Diaper} (s=0.4, c=1.0)
3 Milk, Diaper, Beer, Coke {Diaper,Beer}  {Milk} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer {Beer}  {Milk,Diaper} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke {Diaper}  {Milk,Beer} (s=0.4, c=0.5)
{Milk}  {Diaper,Beer} (s=0.4, c=0.5)

Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Mining Association Rules

 Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup

2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset

 Frequent itemset generation is still

computationally expensive
Frequent Itemset Generation
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Given d items, there
are 2d possible
ABCDE candidate itemsets
Frequent Itemset Generation
 Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w
– Match each transaction against every candidate
– Complexity ~ O(NMw) => Expensive since M = 2d !!!
Frequent Itemset Generation Strategies
 Reduce the number of candidates (M)
– Complete search: M=2d
– Use pruning techniques to reduce M

 Reduce the number of transactions (N)

– Reduce size of N as the size of itemset increases
– Used by DHP and vertical-based mining algorithms

 Reduce the number of comparisons (NM)

– Use efficient data structures to store the candidates or
transactions
– No need to match every candidate against every transaction
Reducing Number of Candidates

 Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent

 Apriori principle holds due to the following property

of the support measure:

X , Y : ( X  Y )  s( X )  s(Y )
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support
Illustrating Apriori Principle

null

A B C D E

AB AC AD AE BC BD BE CD CE DE

Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Pruned
ABCDE
supersets
Illustrating Apriori Principle

TID Items
Items (1-itemsets)
1 Bread, Milk
Item Count
2 Beer, Bread,
B read, Diaper, Eggs
Bread 4
3 Beer, Coke, Diaper, Milk Coke 2
4 Beer, Bread,
B read, Diaper, Milk Milk 4
Beer 3
5 Bread, Coke, Diaper, Milk
Diaper 4
Eggs 1

Minimum Support = 3

If every subset is considered,

6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
Illustrating Apriori Principle

TID Items
Items (1-itemsets)
1 Bread, Milk
2 Beer, Bread, Diaper, Eggs Item Count
Bread 4
3 Beer, Coke, Diaper, Milk
Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
5 Bread, Coke, Diaper, Milk Beer 3
Diaper 4
Eggs 1

Minimum Support = 3

If every subset is considered,

6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
Illustrating Apriori Principle TID
1
Items
Bread, Milk
2 Beer, Bread, Diaper, Eggs
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk

Item Count Items (1-itemsets)

Bread 4
Coke 2
Milk 4 Itemset Pairs (2-itemsets)
Beer 3 {Bread,Milk}
Diaper 4 {Bread, Beer } (No need to generate
Eggs 1 {Bread,Diaper}
{Beer, Milk}
candidates involving Coke
{Diaper, Milk} or Eggs)
{Beer,Diaper}

Minimum Support = 3

If every subset is considered,

Item Count Items (1-itemsets)

Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Beer, Bread} 2 (No need to generate
Eggs 1 {Bread,Diaper} 3 candidates involving Coke
{Beer,Milk} 2
{Diaper,Milk} 3 or Eggs)
{Beer,Diaper} 3
Minimum Support = 3

If every subset is considered,

Item Count Items (1-itemsets)

Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset
66C + 66C + 66C { Beer, Diaper, Milk}
11 22 33
{ Beer,Bread,Diaper}
6 + 15 + 20 = 41 {Bread,Diaper,Milk}
With support-based pruning, { Beer, Bread, Milk}
6 + 6 + 4 = 16
Illustrating Apriori Principle TID
1
Items
Bread, Milk
2 Beer, Bread, Diaper, Eggs
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk

Item Count Items (1-itemsets)

Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset Count
6
C1 + 6C2 + 6C3 { Beer, Diaper, Milk} 2
6 + 15 + 20 = 41 { Beer,Bread, Diaper} 2
With support-based pruning, {Bread, Diaper, Milk} 2
6 + 6 + 4 = 16 {Beer, Bread, Milk} 1
Illustrating Apriori Principle TID
1
Items
Bread, Milk
2 Beer, Bread, Diaper, Eggs
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk

Item Count Items (1-itemsets)

Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset Count
6
C1 + 6C2 + 6C3 { Beer, Diaper, Milk} 2
6 + 15 + 20 = 41 { Beer,Bread, Diaper} 2
With support-based pruning, {Bread, Diaper, Milk} 2
6 + 6 + 4 = 16 {Beer, Bread, Milk} 1
6 + 6 + 1 = 13
Apriori Algorithm

– Fk: frequent k-itemsets

– Lk: candidate k-itemsets
 Algorithm
– Let k=1
– Generate F1 = {frequent 1-itemsets}
– Repeat until Fk is empty
 Candidate Generation: Generate Lk+1 from Fk
 Candidate Pruning: Prune candidate itemsets in Lk+1
containing subsets of length k that are infrequent
 Support Counting: Count the support of each candidate in
Lk+1 by scanning the DB
 Candidate Elimination: Eliminate candidates in Lk+1 that are
infrequent, leaving only those that are frequent => F k+1
Candidate Generation: Brute-force method
TID Items
1 Bread, Milk
2 Beer, Bread, Diaper, Eggs
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk
Candidate Generation: Merge Fk-1 and F1 itemsets
Candidate Generation: Fk-1 x Fk-1 Method

 Merge two frequent (k-1)-itemsets if their first (k-2) items

are identical

 F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
– Merge(ABC, ABD) = ABCD
– Merge(ABC, ABE) = ABCE
– Merge(ABD, ABE) = ABDE

– Do not merge(ABD,ACD) because they share only

prefix of length 1 instead of length 2
Candidate Pruning

 Let F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE} be
the set of frequent 3-itemsets

 L4 = {ABCD,ABCE,ABDE} is the set of candidate

4-itemsets generated (from previous slide)

 Candidate pruning
– Prune ABCE because ACE and BCE are infrequent
– Prune ABDE because ADE is infrequent

 After candidate pruning: L4 = {ABCD}

Candidate Generation: Fk-1 x Fk-1 Method
Illustrating Apriori Principle

Item Count Items (1-itemsets)

Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset Count
6
C1 + 6C2 + 6C3
{Bread, Diaper, Milk} 2
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 1 = 13 Use of Fk-1xFk-1 method for candidate generation results in
only one 3-itemset. This is eliminated after the support
counting step.
Alternate Fk-1 x Fk-1 Method

 Merge two frequent (k-1)-itemsets if the last (k-2) items of

the first one is identical to the first (k-2) items of the
second.

 F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
– Merge(ABC, BCD) = ABCD
– Merge(ABD, BDE) = ABDE
– Merge(ACD, CDE) = ACDE
– Merge(BCD, CDE) = BCDE
Candidate Pruning for Alternate Fk-1 x Fk-1 Method

 Let F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE} be
the set of frequent 3-itemsets

 L4 = {ABCD,ABDE,ACDE,BCDE} is the set of

candidate 4-itemsets generated (from previous
slide)
 Candidate pruning
– Prune ABDE because ADE is infrequent
– Prune ACDE because ACE and ADE are infrequent
– Prune BCDE because BCE
 After candidate pruning: L4 = {ABCD}
Support Counting of Candidate Itemsets

 Scan the database of transactions to determine the

support of each candidate itemset
– Must match every candidate itemset against every transaction,
which is an expensive operation

TID Items
Itemset
1 Bread, Milk
{ Beer, Diaper, Milk}
2 Beer, Bread, Diaper, Eggs { Beer,Bread,Diaper}
3 Beer, Coke, Diaper, Milk {Bread, Diaper, Milk}
{ Beer, Bread, Milk}
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk
Support Counting of Candidate Itemsets

 To reduce number of comparisons, store the candidate

itemsets in a hash structure
– Instead of matching each transaction against every candidate,
match it against candidates contained in the hashed buckets

Transactions Hash Structure

TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke k
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Buckets
Support Counting: An Example
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5},
{3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}

How many of these itemsets are supported by transaction (1,2,3,5,6)?

Transaction, t
1 2 3 5 6

Level 1
1 2 3 5 6 2 3 5 6 3 5 6

Level 2

12 3 5 6 13 5 6 15 6 23 5 6 25 6 35 6

123
135 235
125 156 256 356
136 236
126

Level 3 Subsets of 3 items

Support Counting Using a Hash Tree
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5},
{3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
You need:
• Hash function
• Max leaf size: max number of itemsets stored in a leaf node (if number
of candidate itemsets exceeds max leaf size, split the node)

Hash function 234

3,6,9 567
1,4,7 145 345 356 367
136 368
2,5,8 357
124 689
457 125 159
458
Support Counting Using a Hash Tree

Hash Function Candidate Hash Tree

1,4,7 3,6,9

2,5,8

234
567

145 136
345 356 367
Hash on
357 368
1, 4 or 7
124 159 689
125
457 458
Support Counting Using a Hash Tree

Hash Function Candidate Hash Tree

1,4,7 3,6,9

2,5,8

234
567

145 136
345 356 367
Hash on
357 368
2, 5 or 8
124 159 689
125
457 458
Support Counting Using a Hash Tree

Hash Function Candidate Hash Tree

1,4,7 3,6,9

2,5,8

234
567

145 136
345 356 367
Hash on
357 368
3, 6 or 9
124 159 689
125
457 458
Support Counting Using a Hash Tree

Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9

2,5,8
3+ 56

234
567

145 136
345 356 367
357 368
124 159 689
125
457 458
Support Counting Using a Hash Tree

Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567

145 136
345 356 367
357 368
124 159 689
125
457 458
Support Counting Using a Hash Tree

Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567

145 136
345 356 367
357 368
124 159 689
125
457 458
Match transaction against 11 out of 15 candidates
Rule Generation

 Given a frequent itemset L, find all non-empty

subsets f  L such that f  L – f satisfies the
minimum confidence requirement
– If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D, ABD C, ACD B, BCD A,
A BCD, B ACD, C ABD, D ABC
AB CD, AC  BD, AD  BC, BC AD,
BD AC, CD AB,

 If |L| = k, then there are 2k – 2 candidate

association rules (ignoring L   and   L)
Rule Generation

 In general, confidence does not have an anti-

monotone property
c(ABC D) can be larger or smaller than c(AB D)

 But confidence of rules generated from the same

itemset has an anti-monotone property
– E.g., Suppose {A,B,C,D} is a frequent 4-itemset:

c(ABC  D)  c(AB  CD)  c(A 

BCD)

– Confidence is anti-monotone w.r.t. number of items

on the RHS of the rule
Rule Generation for Apriori Algorithm

Lattice of rules
ABCD=>{ }
Low
Confidence
Rule
BCD=>A ACD=>B ABD=>C ABC=>D

CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD

Pruned
Rules
Association Analysis: Basic Concepts
and Algorithms

Algorithms and Complexity

Factors Affecting Complexity of Apriori
 Choice of minimum support threshold

 Dimensionality (number of items) of the data set

 Size of database

 Average transaction width

Factors Affecting Complexity of Apriori
 Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
 Dimensionality (number of items) of the data set
–

 Size of database TID Items

–
1 Bread, Milk
2 Beer, B read, Diaper, Eggs
 Average transaction width 3 Beer, Coke, Diaper, Milk
– 4 Beer, B read, Diaper, Milk
5 Bread, Coke, Diaper, Milk
Impact of Support Based Pruning

TID Items
Items (1-itemsets)
1 Bread, Milk
Item Count
2 Beer, Bread, Diaper, Eggs
Bread 4
3 Beer, Coke, Diaper, Milk Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
Beer 3
5 Bread, Coke, Diaper, Milk
Diaper 4
Eggs 1

Minimum Support = 3
Minimum Support = 2
If every subset is considered,
6
C1 + 6C2 + 6C3
If every subset is considered,
6 + 15 + 20 = 41 6
C1 + 6C2 + 6C3 + 6C4
With support-based pruning,
6 + 15 + 20 +15 = 56
6 + 6 + 4 = 16
Factors Affecting Complexity of Apriori
 Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
 Dimensionality (number of items) of the data set
– More space is needed to store support count of itemsets
– if number of frequent itemsets also increases, both computation
and I/O costs may also increase
 Size of database
TID Items
 Average transaction width
1 Bread, Milk
–
2 Beer, B read, Diaper, Eggs
3 Beer, Coke, Diaper, Milk
4 Beer, B read, Diaper, Milk
5 Bread, Coke, Diaper, Milk
Factors Affecting Complexity of Apriori
 Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
 Dimensionality (number of items) of the data set
– More space is needed to store support count of itemsets
– if number of frequent itemsets also increases, both computation
and I/O costs may also increase
 Size of database
– run time of algorithm increases with number of transactions
 Average transaction width
TID Items
1 Bread, Milk
2 Beer, B read, Diaper, Eggs
3 Beer, Coke, Diaper, Milk
4 Beer, B read, Diaper, Milk
5 Bread, Coke, Diaper, Milk
Factors Affecting Complexity of Apriori
 Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
 Dimensionality (number of items) of the data set
– More space is needed to store support count of itemsets
– if number of frequent itemsets also increases, both computation
and I/O costs may also increase
 Size of database
– run time of algorithm increases with number of transactions
 Average transaction width
– transaction width increases the max length of frequent itemsets
– number of subsets in a transaction increases with its width,
increasing computation time for support counting
Maximal Frequent Itemset
An itemset is maximal frequent if it is frequent and none of its
immediate supersets is frequent null

Maximal A B C D E
Itemsets

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Infrequent
Itemsets Border
ABCD
E
An illustrative example

Items
A B C D E F G H I J Support threshold (by count) : 5
Frequent itemsets: ?
1 Maximal itemsets: ?

2
3
Transactions

4
5
6
7
8
9
10
An illustrative example

Items
A B C D E F G H I J Support threshold (by count) : 5
Frequent itemsets: {F}
1 Maximal itemsets: {F}

2 Support threshold (by count): 4

Frequent itemsets: ?
3 Maximal itemsets: ?
Transactions

4
5
6
7
8
9
10
An illustrative example

Items
A B C D E F G H I J Support threshold (by count) : 5
Frequent itemsets: {F}
1 Maximal itemsets: {F}

2 Support threshold (by count): 4

Frequent itemsets: {E}, {F}, {E,F}, {J}
3 Maximal itemsets: {E,F}, {J}
Transactions

4 Support threshold (by count): 3

Frequent itemsets: ?
5 Maximal itemsets: ?

6
7
8
9
10
An illustrative example

Items
A B C D E F G H I J Support threshold (by count) : 5
Frequent itemsets: {F}
1 Maximal itemsets: {F}

2 Support threshold (by count): 4

Frequent itemsets: {E}, {F}, {E,F}, {J}
3 Maximal itemsets: {E,F}, {J}
Transactions

4 Support threshold (by count): 3

Frequent itemsets:
5 All subsets of {C,D,E,F} + {J}
Maximal itemsets:
6 {C,D,E,F}, {J}

7
8
9
10
Another illustrative example

Items
A B C D E F G H I J Support threshold (by count) : 5
Maximal itemsets: {A}, {B}, {C}
1
Support threshold (by count): 4
2 Maximal itemsets: {A,B}, {A,C},{B,C}

3 Support threshold (by count): 3

Maximal itemsets: {A,B,C}
Transactions

4
5
6
7
8
9
10
Closed Itemset

 An itemset X is closed if none of its immediate supersets

has the same support as the itemset X.
 X is not closed if at least one of its immediate supersets
has support count as X.
Closed Itemset

 An itemset X is closed if none of its immediate supersets

has the same support as the itemset X.
 X is not closed if at least one of its immediate supersets
has support count as X.
Itemset Support
{A} 4
TID Items {B} 5 Itemset Support
1 {A,B} {C} 3 {A,B,C} 2
2 {B,C,D} {D} 4 {A,B,D} 3
3 {A,B,C,D} {A,B} 4 {A,C,D} 2
4 {A,B,D} {A,C} 2 {B,C,D} 2
5 {A,B,C,D} {A,D} 3 {A,B,C,D} 2
{B,C} 3
{B,D} 4
{C,D} 3
Maximal vs Closed Itemsets
null
Transaction Ids
TID Items
1 ABC 124 123 1234 245 345
A B C D E
2 ABCD
3 BCE
4 ACDE 12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE
5 DE

12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

2 4
ABCD ABCE ABDE ACDE BCDE

Not supported by
any transactions ABCDE
Maximal Frequent vs Closed Frequent Itemsets

TID Items Minimum support = 2 null Closed but

1 ABC
not
maximal
2 ABCD 124 123 1234 245 345
A B C D E
Closed and
3 BCE
maximal
4 ACDE
5 DE
12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE

12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

2 4
ABCD ABCE ABDE ACDE BCDE # Closed frequent = 9
# Maximal frequent = 4

ABCDE
What are the Closed Itemsets in this Data?

TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10

1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

(A1-A10)
(B1-B10)
(C1-C10)
Example 1

Items
A B C D E F G H I J Itemsets Support Closed
(counts) itemsets
1
{C} 3
2
{D} 2
3
{C,D} 2
Transactions

4
5
6
7
8
9
10
Example 1

Items
A B C D E F G H I J Itemsets Support Closed
(counts) itemsets
1
{C} 3 
2
{D} 2
3
{C,D} 2 
Transactions

4
5
6
7
8
9
10
Example 2

Items
A B C D E F G H I J Itemsets Support Closed
(counts) itemsets
1
{C} 3
2
{D} 2
3
{E} 2
Transactions

4
{C,D} 2
5
{C,E} 2
6
{D,E} 2
7
{C,D,E} 2
8
9
10
Example 2

Items
A B C D E F G H I J Itemsets Support Closed
(counts) itemsets
1
{C} 3 
2
{D} 2
3
{E} 2
Transactions

4
{C,D} 2
5
{C,E} 2
6
{D,E} 2
7
{C,D,E} 2 
8
9
10
Example 3

Items
A B C D E F G H I J Closed itemsets: {C,D,E,F}, {C,F}
1
2
3
Transactions

4
5
6
7
8
9
10
Example 4

Items
A B C D E F G H I J Closed itemsets: {C,D,E,F}, {C}, {F}
1
2
3
Transactions

4
5
6
7
8
9
10
Maximal vs Closed Itemsets
Example question
 Given the following transaction data sets (dark cells indicate presence of an item in
a transaction) and a support threshold of 20%, answer the following questions

DataSet: A Data Set: B Data Set: C

a. What is the number of frequent itemsets for each dataset? Which dataset will produce the
most number of frequent itemsets?
b. Which dataset will produce the longest frequent itemset?
c. Which dataset will produce frequent itemsets with highest maximum support?
d. Which dataset will produce frequent itemsets containing items with widely varying support
levels (i.e., itemsets containing items with mixed support, ranging from 20% to more than
70%)?
e. What is the number of maximal frequent itemsets for each dataset? Which dataset will
produce the most number of maximal frequent itemsets?
f. What is the number of closed frequent itemsets for each dataset? Which dataset will produce
the most number of closed frequent itemsets?

Module-3 Association Analysis: Data Mining Association Analysis: Basic Concepts and Algorithms
No ratings yet
Module-3 Association Analysis: Data Mining Association Analysis: Basic Concepts and Algorithms
34 pages
New Microsoft Power Point Presentation
No ratings yet
New Microsoft Power Point Presentation
18 pages
Association Analysis Basic Concepts Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Association Analysis Basic Concepts Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
104 pages
06FPBasic
No ratings yet
06FPBasic
77 pages
Rule Mining
No ratings yet
Rule Mining
20 pages
Chap5 Basic Association Analysis
No ratings yet
Chap5 Basic Association Analysis
105 pages
Association Rule Mining: - Algorithms For Frequent Itemset Mining - Apriori - Elcat - FP-Growth
No ratings yet
Association Rule Mining: - Algorithms For Frequent Itemset Mining - Apriori - Elcat - FP-Growth
45 pages
Association Rules & Frequent Itemsets: The Market-Basket Problem
No ratings yet
Association Rules & Frequent Itemsets: The Market-Basket Problem
5 pages
BD25
No ratings yet
BD25
19 pages
Chap5 Basic Association Analysis
No ratings yet
Chap5 Basic Association Analysis
105 pages
DM Association
No ratings yet
DM Association
43 pages
Association Analysis Basic Concepts Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Association Analysis Basic Concepts Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
102 pages
Datamining Lect2 Frequent
No ratings yet
Datamining Lect2 Frequent
59 pages
04 Frequent Patterns Analysis
No ratings yet
04 Frequent Patterns Analysis
37 pages
Associate Rules
No ratings yet
Associate Rules
26 pages
DS2 Association
No ratings yet
DS2 Association
48 pages
Chap5-Association Analysis
No ratings yet
Chap5-Association Analysis
102 pages
CS2202_AssociationRuleMining
No ratings yet
CS2202_AssociationRuleMining
59 pages
Arm PPT
No ratings yet
Arm PPT
15 pages
association rule
No ratings yet
association rule
22 pages
Updated Apriori Algorithm Analysis
No ratings yet
Updated Apriori Algorithm Analysis
2 pages
DM -Unit 2-PPT
No ratings yet
DM -Unit 2-PPT
49 pages
Data Mining Association Analysis: Basic Concepts and Algorithms
No ratings yet
Data Mining Association Analysis: Basic Concepts and Algorithms
38 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
54 pages
Chap6 Basic Association Analysis
No ratings yet
Chap6 Basic Association Analysis
82 pages
DM Mod3 PDF
No ratings yet
DM Mod3 PDF
96 pages
Unit 2
No ratings yet
Unit 2
14 pages
UNIT 4 .3 ASSOCIATION ANALYSIS
No ratings yet
UNIT 4 .3 ASSOCIATION ANALYSIS
50 pages
Association Rule Mining
No ratings yet
Association Rule Mining
97 pages
dmunit2
No ratings yet
dmunit2
85 pages
Week 6 - Basic Association Analysis
No ratings yet
Week 6 - Basic Association Analysis
71 pages
Rule Mining by Akshay Rele
No ratings yet
Rule Mining by Akshay Rele
42 pages
DMDW 3rd Module
No ratings yet
DMDW 3rd Module
34 pages
Associationrule 1
No ratings yet
Associationrule 1
30 pages
DWDM UNIT-5
No ratings yet
DWDM UNIT-5
14 pages
Chap5-Association Analysis
No ratings yet
Chap5-Association Analysis
29 pages
06 FPBasic
No ratings yet
06 FPBasic
103 pages
Lecture Notes For Chapter 6: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 6: by Tan, Steinbach, Kumar
65 pages
Unit 4
No ratings yet
Unit 4
72 pages
Association Rule Mining
No ratings yet
Association Rule Mining
92 pages
Association Rule
No ratings yet
Association Rule
17 pages
Dm&bi - L10-Association Rules
No ratings yet
Dm&bi - L10-Association Rules
43 pages
Lab8 Apriori
No ratings yet
Lab8 Apriori
9 pages
BITS WASE Data Mining Session 5 PDF
No ratings yet
BITS WASE Data Mining Session 5 PDF
83 pages
ch6 PDF
No ratings yet
ch6 PDF
82 pages
Association Analysis: Basic Concepts and Algorithms: Market-Basket Transactions
No ratings yet
Association Analysis: Basic Concepts and Algorithms: Market-Basket Transactions
42 pages
DWDM Unit-3
No ratings yet
DWDM Unit-3
35 pages
Lecture Notes For Chapter 6 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 6 Introduction To Data Mining: by Tan, Steinbach, Kumar
82 pages
Chap6 Basic Association Analysis
No ratings yet
Chap6 Basic Association Analysis
82 pages
Frequent Pattern Mining Overview: Data Mining Techniques: Frequent Patterns in Sets and Sequences
No ratings yet
Frequent Pattern Mining Overview: Data Mining Techniques: Frequent Patterns in Sets and Sequences
14 pages
1.2 Association Rule Mining: Abdulfetah Abdulahi A
No ratings yet
1.2 Association Rule Mining: Abdulfetah Abdulahi A
43 pages
Data Mining: Frequent Itemsets and Association Rules
No ratings yet
Data Mining: Frequent Itemsets and Association Rules
105 pages
MS (Data Science) Fall 2020 Semester
No ratings yet
MS (Data Science) Fall 2020 Semester
36 pages
Lect 6
No ratings yet
Lect 6
74 pages
CA03CA3405Notes On Association Rule Mining and Apriori Algorithm
No ratings yet
CA03CA3405Notes On Association Rule Mining and Apriori Algorithm
41 pages
Association Analysis in Detail
No ratings yet
Association Analysis in Detail
15 pages
Unit 4 DWM by DR KSR Association - Analysis
No ratings yet
Unit 4 DWM by DR KSR Association - Analysis
68 pages
Data Mining Task - Association Rule Mining
No ratings yet
Data Mining Task - Association Rule Mining
30 pages
Sales & Distribution Management Assignment
No ratings yet
Sales & Distribution Management Assignment
9 pages
Unit 2 Till Exposure 8
No ratings yet
Unit 2 Till Exposure 8
22 pages
Data Warehouse C
No ratings yet
Data Warehouse C
34 pages
Data
No ratings yet
Data
84 pages

Association

Uploaded by

Association

Uploaded by

Association Rule Mining

Association Rule Mining

 Given a set of transactions, find rules that will predict the

 Given a set of transactions T, the goal of

TID Items Example of Rules:

 Frequent itemset generation is still

ABCD ABCE ABDE ACDE BCDE

 Reduce the number of transactions (N)

 Reduce the number of comparisons (NM)

 Apriori principle holds due to the following property

ABCD ABCE ABDE ACDE BCDE

If every subset is considered,

If every subset is considered,

Item Count Items (1-itemsets)

If every subset is considered,

Item Count Items (1-itemsets)

If every subset is considered,

Item Count Items (1-itemsets)

Item Count Items (1-itemsets)

Item Count Items (1-itemsets)

– Fk: frequent k-itemsets

 Merge two frequent (k-1)-itemsets if their first (k-2) items

– Do not merge(ABD,ACD) because they share only

 L4 = {ABCD,ABCE,ABDE} is the set of candidate

 After candidate pruning: L4 = {ABCD}

Item Count Items (1-itemsets)

 Merge two frequent (k-1)-itemsets if the last (k-2) items of

 L4 = {ABCD,ABDE,ACDE,BCDE} is the set of

 Scan the database of transactions to determine the

 To reduce number of comparisons, store the candidate

Transactions Hash Structure

How many of these itemsets are supported by transaction (1,2,3,5,6)?

Level 3 Subsets of 3 items

Hash function 234

Hash Function Candidate Hash Tree

Hash Function Candidate Hash Tree

Hash Function Candidate Hash Tree

 Given a frequent itemset L, find all non-empty

 If |L| = k, then there are 2k – 2 candidate

 In general, confidence does not have an anti-

 But confidence of rules generated from the same

c(ABC  D)  c(AB  CD)  c(A 

– Confidence is anti-monotone w.r.t. number of items

CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD

Algorithms and Complexity

 Dimensionality (number of items) of the data set

 Average transaction width

 Size of database TID Items

ABCD ABCE ABDE ACDE BCDE

2 Support threshold (by count): 4

2 Support threshold (by count): 4

4 Support threshold (by count): 3

2 Support threshold (by count): 4

4 Support threshold (by count): 3

3 Support threshold (by count): 3

 An itemset X is closed if none of its immediate supersets

 An itemset X is closed if none of its immediate supersets

TID Items Minimum support = 2 null Closed but

TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10

DataSet: A Data Set: B Data Set: C

You might also like