Association Rule Mining Lesson PDF
Association Rule Mining Lesson PDF
Introduction 46
Study core 46
45
Open Universiteit Data Analytics
Learning Unit 7
INTRODUCTION
LEARNING OBJECTIVES
Study hints
This learning unit takes two weeks; it should take 19 hours to complete.
Part of the work is theoretical in nature and involves reading Provost,
pages 289–291. A more detailed discussion concerning the Apriori and
FP-growth algorithms is then provided in this chapter of the workbook.
Finally, students are required to complete the exercises in the assignment
bundle.
STUDY CORE
46
Data Analytics
LHS means ‘left hand side’ and RHS means ‘right hand side’. The LHS is
usually an item, set of items or set of attributes, while the RHS is also an
item, set of items or set of attributes. Stated otherwise, the LHS implies
the RHS, or rather ‘if LHS, then RHS’.
This can be read as ‘if the customer buys milk, bread and tomatoes, then
it is highly likely that eggs will be bought too’.
Support, confidence In order to select interesting rules from the dataset, the algorithms use
two important metrics: support and confidence. Support is a number
between 0 and 1 and indicates how frequently that particular rule is true
in the dataset. Confidence is also a number between 0 and 1 and
represents how many times the rule has been found to be true.
2 Apriori Algorithm
47
Open Universiteit Data Analytics
In order to further prune the returned itemset (since this may contain a
large number of items if only T is defined) the user can also specify a
support ε (support as defined in Section 1 above in this chapter) stating
the percentage of transactions in which the itemset has to present itself in
order to consider it a relevant association rule.
2 FP-Growth Algorithm
FP-tree In the first step, the algorithm builds a compact data structure
called the FP-tree.
In the second step, the algorithm builds frequent itemsets
directly from the FP-tree.
Item frequency In addition, the algorithm has the parameter φ, which is user defined, to
parameter φ
define a threshold for which items are considered as frequent.
As an illustrative example, consider the following dataset of
transactions:
Transaction ID Items
1 {apricots, bread}
2 {bread, carrots, dumplings}
3 {apricots, carrots, dumplings, eggs}
4 { apricots, dumplings, eggs}
5 { apricots, bread, carrots }
6 { apricots, bread, carrots,
dumplings}
7 { apricots }
8 { apricots, bread, carrots}
9 { apricots, bread, dumplings }
10 { bread, carrots, eggs}
48
Data Analytics
Considering the first transaction, for example, it means that apricots and
bread were bought together.
This is continued in the same fashion until the entire dataset has been
mapped. Figure 1 shows a depiction of the algorithm used to create the
FP-tree.
After creating the FP-tree, the algorithm proceeds to identify the frequent
itemsets by using the FP-tree. In order to do this, the algorithm takes
Prefix path subtree
49
Open Universiteit Data Analytics
each of the items and checks the entire prefix path subtree preceding the
items. For the eggs item, the prefix path subtree is as shown in Figure 2.
Informally speaking, the prefix path subtree is nothing more than the
part of the tree above the eggs items.
The main issue now is to identify which of these itemsets are more
relevant. In order to do this FP-growth uses the prefix path subtree to
Conditional FP-tree
50
Data Analytics
create an FP-tree around the eggs item. This is called a conditional FP-tree
(for the eggs item). To do this, the eggs item is removed from the prefix
path sub tree. This is done because eggs are no longer needed. The goal is
to see what happens in the prefix of the transactions, or to see what is
frequent simultaneously with the eggs item. Figure 4 illustrates such a
tree for the eggs item.
51
Open Universiteit Data Analytics
Thus the count ≥ 2 with respect to carrots. After applying this procedure
a number of times recursively, for this dataset the following itemsets are
found to be the frequent ones:
References
52
Data Analytics
53