0% found this document useful (0 votes)
3 views

Unit 3- Asso Rule Mining

Association Rule Mining, also known as Affinity Analysis, focuses on discovering interesting associations and correlations among large datasets. The process involves identifying frequent itemsets from transaction data and generating rules that predict item occurrences based on these itemsets. Key algorithms discussed include the Apriori algorithm and FP-Growth algorithm, which facilitate efficient mining of association rules by leveraging support and confidence metrics.

Uploaded by

Maahi Tated
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Unit 3- Asso Rule Mining

Association Rule Mining, also known as Affinity Analysis, focuses on discovering interesting associations and correlations among large datasets. The process involves identifying frequent itemsets from transaction data and generating rules that predict item occurrences based on these itemsets. Key algorithms discussed include the Apriori algorithm and FP-Growth algorithm, which facilitate efficient mining of association rules by leveraging support and confidence metrics.

Uploaded by

Maahi Tated
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 27

Association Analysis: Basic Concepts

and Algorithms

Lecture Notes for Chapter 7

ASSOCIATION RULE MINING

Refer Page No:160 – 171


Data Mining by K P Soman

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1


ASSOCIATION RULE MINING:

 It is called Affinity Analysis


 It is the study of ‘what goes with what’
 It finds interesting associations and/or

correlations among large data sets

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2


AUTOMATIC DISCOVERY OF ASSOCIATION RULES IN
TRANSACTION DATABASES :
 INTRODUCTION:
 From the detailed information of association of customers
transactions, asssociation between items are automatically
formed

TLD ITEMS
1 BREAD,MILK
2 BREAD,DIAPER,BEER,EGGS
3 MILKS,DIAPER,BEER,COKE
4 BREAD,MILK,DIAPER,BEER

5 BREAD,MILK,DIAPER,COKE

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3


Association Rule Mining

 Given a set of transactions, find rules that will predict the


occurrence of an item based on the occurrences of other
items in the transaction

Market-Basket transactions
Example of Association
TID Items Rules
{Diaper}  {Beer},
1 Bread, Milk {Milk, Bread}  {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread}  {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4


Definition: Frequent Itemset
 Itemset
– A collection of one or more items
 Example: {Milk, Bread, Diaper}
– k-itemset TID Items
 An itemset that contains k items 1 Bread, Milk
 2 Bread, Diaper, Beer, Eggs
Support count ()
3 Milk, Diaper, Beer, Coke
– Frequency of occurrence of an itemset
4 Bread, Milk, Diaper, Beer
– E.g. ({Milk, Bread,Diaper}) = 2
5 Bread, Milk, Diaper, Coke
 Support
– Fraction of transactions that contain an
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
 Frequent Itemset
– An itemset whose support is greater
than or equal to a minsup threshold
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5
Definition: Association Rule
 Association Rule TID Items
– An implication expression of the form
1 Bread, Milk
X  Y, where X and Y are itemsets
2 Bread, Diaper, Beer, Eggs
– Example:
3 Milk, Diaper, Beer, Coke
{Milk, Diaper}  {Beer}
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
 Rule Evaluation Metrics
– Support (s)
 Fraction of transactions that contain Example:
both X and Y {Milk, Diaper}  Beer
– Confidence (c)
 Measures how often items in Y  (Milk , Diaper, Beer) 2
s  0.4
appear in transactions that |T| 5
contain X
 (Milk, Diaper, Beer) 2
c  0.67
 (Milk , Diaper) 3
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6
Association Rule Mining Task

 Given a set of transactions T, the goal of


association rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold

 Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7


Mining Association Rules

Example of Rules:
TID Items
1 Bread, Milk
{Milk,Diaper}  {Beer} (s=0.4, c=0.67)
2 Bread, Diaper, Beer, Eggs
{Milk,Beer}  {Diaper} (s=0.4, c=1.0)
3 Milk, Diaper, Beer, Coke {Diaper,Beer}  {Milk} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer {Beer}  {Milk,Diaper} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke {Diaper}  {Milk,Beer} (s=0.4, c=0.5)
{Milk}  {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8
Mining Association Rules

 Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup

2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset

 Frequent itemset generation is still


computationally expensive

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9


Frequent Itemset Generation
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE


Given d items, there
are 2d possible
ABCDE candidate itemsets
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10
 Steps for rule generation:-

  create a list of all itemsets that have required


support
 Examine all subsets of each itemset
  Decide the cut off value for confidence
  Retain the association rule that exceeds desired
cut-off value for confidence.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11


APRIORI ALGORITHM
 Proposed by Agarwal and Srikanth
 Step 1 create a candidate list of k itemsets by
performing a join operation on various pairs of (k-1)
itemsets.
[repeat the process for all frequent itemsets]

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12


Illustrating Apriori Principle

Item Count Items (1-itemsets)


Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
Itemset Count
{Bread,Milk,Diaper} 2

L3=null

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13


Apriori Algorithm

 Method:

– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
 Generate length (k+1) candidate itemsets from length k
frequent itemsets
 Prune candidate itemsets containing subsets of length k that
are infrequent
 Count the support of each candidate by scanning the DB

 Eliminate candidates that are infrequent, leaving only those


that are frequent

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14


Rule Generation
 Frequent itemset L=L1 U L2
 Deriving strong rules
– Consider a frequent 2 item set. {Bread,Milk}
– First identify all non empty proper subsets
– {Bread},{Milk}
– For each subset a rule is formed as follows
– {Bread}=>{Milk}
– {Milk=>{Bread}
 To determine which rules are strong
 Find the confidence
 Rule 1: {Bread}=>{Milk} :3/4 or 75%
 Rule 2:{Milk}=>{Bread} :3/4 or 75%
 If the confidence is greater than threshold then the rule is strong
(Assume the threshold to be 60% then both the rules are strong).

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15


© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20
 Step 3process is repeated until the candidate list
becomes empty
 Hash trees are used for frequent itemset lists.

SHORTCOMINGS
• The support confidence framework generates too
many rules.
• Irrelevant items are combined.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21


FP-Tree Representation

a) FP-TREE REPRESENTATION
 FP-tree is compressed representation of I/P data
 Transaction is read one-by-one and then each
transaction is mapped onto a path in the FP-tree
 A different transactions can have several items in
common, their paths may overlap.
 If the size of the FP-tree is small enough to fit into
main memory, this will allow us to extract frequent
item sets directly from the structure in memory.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22


© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23
 More on FP - tree
 Each node in the tree has, label of the item counter
showing number of transactions mapped onto the
given path.
 If all the transctions have same set of items, FP_Tree
has only a single branch of nodes.
 If every transaction has a single branch of node , size
of FP tree is almost same to original data.
 Size of an FP tree depends on how the items are
ordered ( either left to right or right to left)
 The pointers represented as dashed lines in FP tree
helps to facilitate rapid access of individual items in the
tree.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24
 FREQUENT ITEM SET GENERATION IN FP
GROWTH ALGORITHM
 Fp growth algorithm generates frequent item sets by
explaining the tree in a bottom up fashion
 Fp-growth algorithm finds all frequent itemsets ending with
particular suffix by employing divide-and-conquer strategy
to spilt the problem into smaller subproblems
 For example , we will find frequent itemsets ending with ‘e’
 Step1 gather all paths containing node ‘e’. These paths are
called prefix paths
 Step2 from the prefix paths , the support count for the
node is obtained
 Step3convert prefix paths to conditional FP-tree as seen
under:
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25
 a) the support count along the prefix path must be
updated
 b) prefix paths( are truncated) that have support count
less than the cut-off are truncated.
 c) then we obtain an conditional FP tree containing
only frequent itemsets.
 d) the conditonal FP-tree constructed in the previous
step is used to find the frequent item set of the sub
problem.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26


© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27

You might also like