0% found this document useful (0 votes)
50 views

Chapter 5 - Association Rule Mining

Association rule mining is used to discover relationships between variables in large datasets. It aims to find rules that describe large portions of your data, like "customers that buy x also tend to buy y". Key concepts include support, which measures how frequently an itemset occurs, and confidence, which measures the strength of implications between itemsets. Rules must meet minimum support and confidence thresholds to be considered strong and significant. Association rule mining is commonly used for market basket analysis to discover what products customers frequently purchase together.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

Chapter 5 - Association Rule Mining

Association rule mining is used to discover relationships between variables in large datasets. It aims to find rules that describe large portions of your data, like "customers that buy x also tend to buy y". Key concepts include support, which measures how frequently an itemset occurs, and confidence, which measures the strength of implications between itemsets. Rules must meet minimum support and confidence thresholds to be considered strong and significant. Association rule mining is commonly used for market basket analysis to discover what products customers frequently purchase together.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

Data Mining and Warehousing

Association Rule Mining


Association Rule Mining
• Association rule mining is to derive all logical dependencies among
different attributes given a set of entities.
Basket Items
1 bread, milk, diaper, cola
2 bread, diaper, beer, egg
3 milk, diaper, beer, cola
4 bread, milk, tea
5 bread, milk, diaper, beer
6 milk, tea, sugar, diaper

Which item frequently bought together?

{ 𝑏𝑟𝑒𝑎𝑑 } → {𝑚𝑖𝑙𝑘 }
Example Application
• Consider a data set recorded in a medical center regarding the symptoms of
patients.

Patient Symptom(s)
1
2
3
4
5

Question: Which symptoms frequently happens together?


Association Rule Mining
• In general, an association rule can be expressed as

Example:

this means, customer, who purchases bread (or beer) also likely to purchase
milk (or diaper).
Association Rules Examples
• Basket Data
Tea ^ Milk ==> Sugar [0.3 , 0.9]

• Relational Data
x.diagnosis = Heart ^ x.gender = Male ==> x.age > 50 [0.4 , 0.7]

5
Some basic definitions and terminologies
Some notations Database of Transaction
Notation Description
Transaction Id Transaction (item set)
1
2
3
4
5
6
7
8

D=
comprising set of all items.
Any one transaction, say is called an itemset.
Interesting/Useful rules
• Statistically, anything that is interesting is something that happens significantly more than you would
expect by chance.

• E.g. basic statistical analysis of basket data may show that 10% of baskets contain bread, and 4% of baskets
contain washing-up powder. i.e.: if you choose a basket at random:
• There is a probability 0.1 that it contains bread.
• There is a probability 0.04 that it contains washing-up powder.
Interesting means surprising
 a prior expectation that just 4 in 1,00 baskets should contain both bread
and washing up powder.

 If we investigate, and discover that really it is 20 in 1,00 baskets, then we


will be very surprised. It tells us that:

• Something is going on in shoppers’ minds: bread and washing-up powder are


connected in some way.

• There may be ways to exploit this discovery … put the powder and bread at
opposite ends of the supermarket?
Interestingness Measure/Pattern
Evaluation?
• How strong is the relationship between AB?

• Answer?
• Support
• Confidence
• Lift
• AND more…
Definition: Support Count and Support
• Support count refers to the number of transactions that Transaction Id Transaction (item set)
contain a particular item set. 1

• Support count of an item setis denoted as and is defined 2


as 3
4
5
• Where the symbol denotes the number of elements in a
set . 6
7
Exercise: Find the support count of the following item and item sets? 8

a) {a}
b) {a, b}
c) {b, c, d}
d) {c, f}
Definition: Support Count and Support
• Support is the ratio (or percentage) of the number of Transaction Id Transaction (item set)

item sets satisfying both body and head to the total 1

number of transactions. 2
3
4
• Support of a rule is denoted as and mathematically 5
defined as 6
7
Exercise: Find the support of the following item and item sets? 8

a) s{a}
b) s{a, b}
c) s{b, c, d}
d) s{c, f}
Definition: Support cont…
• The value of Support can be expressed either in percentage or
probability form.

• For example support =0.1, means 10% of transaction contains the


specified itemset.
Meaning of Support to Data Engineer
• Support implies a measurement of strength of strength of a rule.

• implies “no-match”, whereas implies “all matches”.

• In other words, a rule that has very low support may occur simply by
chance, and hence association rule is insignificant. Whereas a rule
with high value of is significant.
Definition: Confidence

• Confidence of a rule in a database is represented by and defined as the


ratio (or percentage) of the transactions in containing that also contain to
the support count of . More precisely,

• Note: Confidence of a rule also can be expressed as follows

So, alternatively, confidence of a rule is the conditional probability that


occurs given that occurs.
Exercise: find the confidence of the following
rules
Transaction Id Transaction (item set)
1
a)A B 2

b)B C 3
4
c){A, B} C 5

d)B A 6
7
8
Meaning of Confidence to Data Engineer
• Confidence measures the reliability of the inference made by a rule.

• For a given rule , the higher the confidence, the more likely it is for to be
present in transaction that contains .

Note: Support (s) is also called “relative support” and confidence () is called
“absolute support” or “reliability”.
Definition: minsup () and minconf ()
• It is customary to reject any rule for which the support is below a minimum
threshold. This minimum threshold of support is called minsup and denoted
as . Typically the value of .

• Also, if the confidence of a rule is below a minimum threshold , it is


customary to reject the rule. This minimum threshold of confidence is called
minconf and denoted as . Typically, the value of
Example: minsup () and minconf ()
1) Example: For the database of transactions shown below, find a strong rule
that satisfy a minimum support of 50% and confidence of 80%?

Transaction Id Transaction (item set)


2) Which one of the following is a strong rule? 1
2
a)A B 3

b)B C 4
5
c){B, C} D 6

d)B A 7
8
Rule Evaluation metrics
• Support
• Fraction of transactions that contain both X and Y
• Confidence
• Measures how often items in Y appear in transactions that contain X

Quiz: Find the support and confidence of {Milk, Diaper} Beer

TID Items
{Milk , Diaper}  Beer
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs  ( Milk, Diaper, Beer) 2
s   0.4
3 Milk, Diaper, Beer, Coke |T| 5
4 Bread, Milk, Diaper, Beer   (Milk, Diaper, Beer) 2 
     0 .67 
5 Bread, Milk, Diaper, Coke   ( Milk, Diaper ) 3 
Definition: Frequent Itemset
• Let be the user specified minsup. An itemset in is said to be a frequent
itemset in with respect to if and only if

Exercise 1: Which one of the following itemsets


are Frequent itemsets (? Transaction Id Transaction (item set)
1
a) {a} 2

b) {b, c} 3
4
c){a, b, c} 5

d) {a, b, d} 6
7

Exercise 2: Find the 2 itemset that is frequent (? 8


Association Rule Mining
• Now, we are in a position to discuss the core problem of this chapter:
 “given a dataset of transactions, how to discover association rules”.

• The discovery of association rules is the most well-studied problem in


data mining. In fact, there are many types of frequent itemsets,
association rules and correlation relationships.
Problem specification and solution strategy

• Given a set of transactions D, we are to discover all the rule such that & .

• Solution to such a problem can be obtained at two different steps:

1. Generating frequent itemsets, that is, given a set of items I, we are to find all the
itemsets that satisfy the minsup threshold. These itemsets are called frequent itemsets.

2. Generating association rules, that is, having a frequent itemsets, we are to extract rules
those satisfy the constraint minimal confidence, that is, minconf.

Out of the above mentioned two tasks, the first task is computationally very expensive and the second task is
fairly straight forward to implement. Let us first examine the naïve approach to accomplish the task of
frequent itemset generation.
Naïve approach (Brute-force approach):

• List all possible association rules


• Compute the support and confidence for each rule
• Prune rules that fail the minsup and minconf thresholds
 Computationally prohibitive!
Frequent Itemset Generation (Naïve approach )
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE


Given d items, there are
2d possible candidate
ABCDE itemsets
Computational Complexity
• Given d unique items:
• Total number of itemsets = 2d
• Total number of possible association rules:

 d   d  k 
d 1 d k
R        
 k   j 
k 1 j 1

 3  2 1d d 1

If d=6, R = 602 rules


Apriori algorithm

• Apriori pruning principle: If there is any itemset which is infrequent, its


superset should not be generated/tested!

• If an itemset is frequent, then all of its subsets must also be frequent

• Method:
• Initially, scan DB once to get frequent 1-itemset
• Generate length (k+1) candidate itemsets from length k frequent itemsets
• Test the candidates against DB
• Terminate when no frequent or candidate set can be generated
The Apriori Algorithm
• Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
Example 2 (minsup=2)
Generate all the frequent item set in the
Database given below

TID Items
100 134
200 235
300 1235
400 25

38
Exercise: Use Apriori algorithm to generate frequent itemset

Supmin = 2 Itemset sup


Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2 2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
What is next?
• Generate all the strong rule that satisfy the minimum support
requirement
Apriori algorithm to frequent itemsets generation
(Contd..)

Illustration of Apriori property


Analysis of bottleneck

• The bottleneck of Apriori: candidate generation


• Huge candidate sets:
• 104 frequent 1-itemset will generate 107 candidate 2-itemsets
• To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs
to generate 2100  1030 candidates.

• Multiple scans of database:


• Needs (n +1 ) scans, n is the length of the longest pattern
Is it possible to Mine Frequent Patterns
without Candidate Generation?
Mining Frequent Patterns Without Candidate Generation

• Compress a large database into a compact, Frequent-Pattern tree (FP-


tree) structure
• highly condensed, but complete for frequent pattern mining
• avoid costly database scans
Construct FP-tree from a Transaction DB
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o} {f, b}
400 {b, c, k, s, p} {c, b, p} min_support = 0.5
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

{}
Steps: Header Table
1. Scan DB once, find frequent f:4 c:1
Item frequency head
1-itemset (single item f 4
pattern) c 4 c:3 b:1 b:1
2. Order frequent items in a 3
frequency descending order b 3 a:3 p:1
m 3
3. Scan DB again, construct p 3 m:2 b:1
FP-tree
p:2 m:1
FP-Tree Construction Example2
TID Items
Transaction
1 {A,B}
2 {B,C,D} Database
null
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C}
A:7 B:3
6 {A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D} B:5 C:3
10 {B,C,E}
C:1 D:1

Header table D:1


C:3 E:1
Item Pointer D:1 E:1
A D:1
B E:1
C D:1
D Pointers are used to assist
E frequent itemset generation
Benefits of the FP-tree Structure
• Completeness
• never breaks a long pattern of any transaction
• preserves complete information for frequent pattern mining

• Compactness
• reduce irrelevant information—infrequent items are gone
• frequency descending ordering: more frequent items are more likely to be
shared
• never be larger than the original database (if not count node-links and counts)
Mining Frequent Patterns Using FP-tree

• General idea (divide-and-conquer)


• Recursively grow frequent pattern path using the FP-tree

• Method
• For each item, construct its conditional pattern-base, and then its conditional FP-
tree
• Repeat the process on each newly created conditional FP-tree
• Until the resulting FP-tree is empty, or it contains only one path (single path will
generate all the combinations of its sub-paths, each of which is a frequent pattern)
Major Steps to Mine FP-tree

1) Construct conditional pattern base for each node in the FP-tree


2) Construct conditional FP-tree from each conditional pattern-base
3) Recursively mine conditional FP-trees and grow frequent patterns
obtained so far
 If the conditional FP-tree contains a single path, simply enumerate all the
patterns
Step 1: From FP-tree to Conditional Pattern Base
• Starting at the frequent header table in the FP-tree
• Traverse the FP-tree by following the link of each frequent item
• Accumulate all of transformed prefix paths of that item to form a conditional pattern base

Header Table {}

Item frequency head Conditional pattern bases


f:4 c:1
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
b 3 a fc:3
a:3 p:1
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
p fcam:2, cb:1
p:2 m:1
Step 2: Construct Conditional FP-tree
• For each pattern-base
• Accumulate the count for each item in the base
• Construct the FP-tree for the frequent items of the pattern base

{} m-conditional pattern
Header Table base:
Item frequency head f:4 c:1 fca:2, fcab:1
f 4 All frequent patterns
c 4 c:3 b:1 b:1 {} concerning m
m,
a 3 
b 3 a:3 p:1 f:3  fm, cm, am,
fcm, fam, cam,
m 3
p 3 m:2 b:1 c:3 fcam

p:2 m:1 a:3


m-conditional FP-tree
Exercises
Test yourself: Understanding
rules
Suppose itemset A = {beer, cheese, eggs} has 30% support in the DB
{beer, cheese} has 40%, {beer, eggs} has 30%, {cheese, eggs} has 50%,
and each of beer, cheese, and eggs alone has 50% support..

What is the confidence of:


IF basket contains Beer and Cheese, THEN basket also contains Eggs ?

So it’s 30/40 = 0.75 ; this rule has 75% confidence

What is the confidence of:


IF basket contains Beer, THEN basket also contains Cheese and Eggs ?

30 / 50 = 0.6 so this rule has 60% confidence

The answers are in the above boxes in white font colour


Test yourself: Understanding rules
If A then B
If the following rule has confidence c:
and if support(A) = 2 * support(B), what can be said
about the confidence of: If B then A
confidence c is support(A + B) / support(A)
= support(A + B) / 2 * support(B)

Let d be the confidence of ``If B then A’’.


d is support(A+B / support(B) -- Clearly, d = 2c

E.g. A might be milk and B might be newspapers

The answers are in the above box in white font colour

You might also like