Chapter 5 - Association Rule Mining
Chapter 5 - Association Rule Mining
{ 𝑏𝑟𝑒𝑎𝑑 } → {𝑚𝑖𝑙𝑘 }
Example Application
• Consider a data set recorded in a medical center regarding the symptoms of
patients.
Patient Symptom(s)
1
2
3
4
5
Example:
this means, customer, who purchases bread (or beer) also likely to purchase
milk (or diaper).
Association Rules Examples
• Basket Data
Tea ^ Milk ==> Sugar [0.3 , 0.9]
• Relational Data
x.diagnosis = Heart ^ x.gender = Male ==> x.age > 50 [0.4 , 0.7]
5
Some basic definitions and terminologies
Some notations Database of Transaction
Notation Description
Transaction Id Transaction (item set)
1
2
3
4
5
6
7
8
D=
comprising set of all items.
Any one transaction, say is called an itemset.
Interesting/Useful rules
• Statistically, anything that is interesting is something that happens significantly more than you would
expect by chance.
• E.g. basic statistical analysis of basket data may show that 10% of baskets contain bread, and 4% of baskets
contain washing-up powder. i.e.: if you choose a basket at random:
• There is a probability 0.1 that it contains bread.
• There is a probability 0.04 that it contains washing-up powder.
Interesting means surprising
a prior expectation that just 4 in 1,00 baskets should contain both bread
and washing up powder.
• There may be ways to exploit this discovery … put the powder and bread at
opposite ends of the supermarket?
Interestingness Measure/Pattern
Evaluation?
• How strong is the relationship between AB?
• Answer?
• Support
• Confidence
• Lift
• AND more…
Definition: Support Count and Support
• Support count refers to the number of transactions that Transaction Id Transaction (item set)
contain a particular item set. 1
a) {a}
b) {a, b}
c) {b, c, d}
d) {c, f}
Definition: Support Count and Support
• Support is the ratio (or percentage) of the number of Transaction Id Transaction (item set)
number of transactions. 2
3
4
• Support of a rule is denoted as and mathematically 5
defined as 6
7
Exercise: Find the support of the following item and item sets? 8
a) s{a}
b) s{a, b}
c) s{b, c, d}
d) s{c, f}
Definition: Support cont…
• The value of Support can be expressed either in percentage or
probability form.
• In other words, a rule that has very low support may occur simply by
chance, and hence association rule is insignificant. Whereas a rule
with high value of is significant.
Definition: Confidence
b)B C 3
4
c){A, B} C 5
d)B A 6
7
8
Meaning of Confidence to Data Engineer
• Confidence measures the reliability of the inference made by a rule.
• For a given rule , the higher the confidence, the more likely it is for to be
present in transaction that contains .
Note: Support (s) is also called “relative support” and confidence () is called
“absolute support” or “reliability”.
Definition: minsup () and minconf ()
• It is customary to reject any rule for which the support is below a minimum
threshold. This minimum threshold of support is called minsup and denoted
as . Typically the value of .
b)B C 4
5
c){B, C} D 6
d)B A 7
8
Rule Evaluation metrics
• Support
• Fraction of transactions that contain both X and Y
• Confidence
• Measures how often items in Y appear in transactions that contain X
TID Items
{Milk , Diaper} Beer
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs ( Milk, Diaper, Beer) 2
s 0.4
3 Milk, Diaper, Beer, Coke |T| 5
4 Bread, Milk, Diaper, Beer (Milk, Diaper, Beer) 2
0 .67
5 Bread, Milk, Diaper, Coke ( Milk, Diaper ) 3
Definition: Frequent Itemset
• Let be the user specified minsup. An itemset in is said to be a frequent
itemset in with respect to if and only if
b) {b, c} 3
4
c){a, b, c} 5
d) {a, b, d} 6
7
• Given a set of transactions D, we are to discover all the rule such that & .
1. Generating frequent itemsets, that is, given a set of items I, we are to find all the
itemsets that satisfy the minsup threshold. These itemsets are called frequent itemsets.
2. Generating association rules, that is, having a frequent itemsets, we are to extract rules
those satisfy the constraint minimal confidence, that is, minconf.
Out of the above mentioned two tasks, the first task is computationally very expensive and the second task is
fairly straight forward to implement. Let us first examine the naïve approach to accomplish the task of
frequent itemset generation.
Naïve approach (Brute-force approach):
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
d d k
d 1 d k
R
k j
k 1 j 1
3 2 1d d 1
• Method:
• Initially, scan DB once to get frequent 1-itemset
• Generate length (k+1) candidate itemsets from length k frequent itemsets
• Test the candidates against DB
• Terminate when no frequent or candidate set can be generated
The Apriori Algorithm
• Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
Example 2 (minsup=2)
Generate all the frequent item set in the
Database given below
TID Items
100 134
200 235
300 1235
400 25
38
Exercise: Use Apriori algorithm to generate frequent itemset
C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
What is next?
• Generate all the strong rule that satisfy the minimum support
requirement
Apriori algorithm to frequent itemsets generation
(Contd..)
{}
Steps: Header Table
1. Scan DB once, find frequent f:4 c:1
Item frequency head
1-itemset (single item f 4
pattern) c 4 c:3 b:1 b:1
2. Order frequent items in a 3
frequency descending order b 3 a:3 p:1
m 3
3. Scan DB again, construct p 3 m:2 b:1
FP-tree
p:2 m:1
FP-Tree Construction Example2
TID Items
Transaction
1 {A,B}
2 {B,C,D} Database
null
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C}
A:7 B:3
6 {A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D} B:5 C:3
10 {B,C,E}
C:1 D:1
• Compactness
• reduce irrelevant information—infrequent items are gone
• frequency descending ordering: more frequent items are more likely to be
shared
• never be larger than the original database (if not count node-links and counts)
Mining Frequent Patterns Using FP-tree
• Method
• For each item, construct its conditional pattern-base, and then its conditional FP-
tree
• Repeat the process on each newly created conditional FP-tree
• Until the resulting FP-tree is empty, or it contains only one path (single path will
generate all the combinations of its sub-paths, each of which is a frequent pattern)
Major Steps to Mine FP-tree
Header Table {}
{} m-conditional pattern
Header Table base:
Item frequency head f:4 c:1 fca:2, fcab:1
f 4 All frequent patterns
c 4 c:3 b:1 b:1 {} concerning m
m,
a 3
b 3 a:3 p:1 f:3 fm, cm, am,
fcm, fam, cam,
m 3
p 3 m:2 b:1 c:3 fcam