Unit-4 DM
Unit-4 DM
UNIT 4
Frequent Itemset generation : Apriori Principle, Apriori algorithm, and examples, FP growth
algorithm and examples.
A Priori Algorithm
The Apriori Algorithm is an influential algorithm for mining frequent itemsets for
boolean association rules. It is also called the level-wise algorithm. It was proposed by
Agarwal and Srikant in 1994. It is the most popular algorithm to find all the frequent sets. It
makes use of the downward closure property. As the name suggests, the algorithm is a
bottom-up search, moving upward level-wise in the lattice. However, the important feature of
the method is that before reading the database at every level, it graciously prunes many of the
sets which are unlikely to be frequent sets.
The first pass of the algorithm simply counts item occurrences to determine the
frequent itemsets. A subsequent pass, say pass k, consists of two phases. First, the frequent
itemsets Lk-1 found in the (k-1)th pass are used to generate the candidate itemsets Ck, using
the a priori candidate generation procedure described below. Next, the database is scanned
and the support of candidates in Ck is counted. For fast counting, we need to efficiently
determine the candidates in Ck contained in a given transaction t. The set of candidate
itemsets is subjected to a pruning process to ensure that all the subsets of the candidate sets
are already known to the frequent itemsets. The candidate generation process and the pruning
process are the most important parts of this algorithm.
Key Concepts :
• Frequent Itemsets: The sets of item which has minimum support (denoted by Li for
ith-Itemset).
• Apriori Property: Any subset of frequent itemset must be frequent.
• Join Operation: To find Lk , a set of candidate k-itemsets is generated by joining Lk-1 with
itself.
Candidate Generation
Given Lk-1, the set of all frequent (k-1) - itemsets, we want to generate a superset of
the set of all frequent k - itemsets. The intuition behind the a priori candidate - generation
procedure is that if an itemset X has minimum support, so do all subsets of X. Let us assume
that the set of frequent 3 - itemsets are {1,2,3}, {1,2,5} , {1,3,5} , {2,3,5} , {2,3,4}. Then, the
4 - itemsets that are generated as candidate itemsets are supersets of these 3 - itemsets and in
addition, all the 3 - itemset subsets of any candidate 4 - itemset (so generated) must be
already known to be in L3. The first part of the second part is handled by the a priori
candidate - generation method. The following pruning algorithm prunes some candidate sets
which do not meet the second criterion. The candidate generation method is described below
:
Pruning
The pruning step eliminates the extensions of (k-1) - itemsets, which are not found to
be frequent, from being considered for counting support. For example, from C4, the itemset
{2,3,4,5} is pruned, since all its 3 - subsets are not in L3 (clearly, {2,4,5} is not in L). The
pruning algorithm is described below.
prune(Ck)
for all c ∈ Ck
for all (k-1) - subsets d of c do
if d ∉ Lk-1
then Ck = Ck \ {c}
The a priori frequent itemset discovery algorithm uses these two functions (candidate
generation and pruning) at every iteration. It moves forward in the lattice starting from level 1
till level k, where no candidate set remains after pruning.
Answer := ∪kLk;
X Support Count
{1} 2
{2} 6
{3} 6
{4} 4
{5} 8
{6} 5
{7} 7
{8} 4
{9} 2
Computationally Expensive. Even though the a priori algorithm reduces the number of
candidate itemsets to consider, this number could still be huge when store inventories are
large or when the support threshold is low. However, an alternative solution would be to
reduce the number of comparisons by using advanced data structures, such as hash tables, to
sort candidate itemsets more efficiently.
Spurious Associations. Analysis of large inventories would involve more itemset
configurations, and the support threshold might have to be lowered to detect certain
associations. However, lowering the support threshold might also increase the number of
spurious associations detected. To ensure that identified associations are generalizable, they
could first be distilled from a training dataset, before having their support and confidence
assessed in a separate test dataset.
FP - Tree
Definition : A frequent pattern tree (or FP tree) is a tree structure consisting of an item -
prefix - tree and a frequent - item - header table.
• Item - prefix - tree:
o It consists of a root node labeled null.
o Each on - root node consists of three fields :
▪ Item name,
▪ Support count, and
▪ Node link
• Frequent - item- header - table : It consists of two fields :
▪ Item name;
▪ Head of node link which points to the first node in the FP - tree
carrying the item name
It may be noted that the FP - tree is dependent on the support threshold 𝜎. For different
values of 𝜎, the trees are different. There is another typical feature of the FP - tree; it depends
on the ordering of the items. The ordering that is followed in the original paper is the
decreasing order of the support counts. However, different ordering may offer different
advantages. Thus, the header table is arranged in this order of the frequent items.
We make one scan of the database T and compute L1, the set of frequent 1 - itemsets.
For convenience, let us call this the set of frequent items. In other words, we refer to L1 in the
decreasing order of frequency counts. From this stage onwards, the algorithm ignores all the
non- frequent items from the transaction and views any transaction as a list of frequent items
in the decreasing order of frequency. Without any ambiguity, we can refer to the transaction t
as such a list. The first element of the list corresponding to any given transaction, is the most
frequent item among the items supported by t. For a list t, we denote head_t as its first
element and body_t as the remaining part of the list (the portion of the list t after removal of
head_t). Thus, t is [head_t|body_t]. The FP - tree construction grows the tree recursively
using these concepts.
Example
Let us consider the following database. The frequent items are 2, 3, 4, 5, 6, 7, and 8. If
we sort them in the order of their frequency, then they appear in the order 5, 3, 4, 7, 2, 6, and
8.
A1 A2 A3 A4 A5 A6 A7 A8 A9
1 0 0 0 1 1 0 1 0
0 1 0 1 0 0 0 1 0
0 0 0 1 1 0 1 0 0
0 0 1 0 0 0 0 0 0
0 0 0 0 1 1 1 0 0
0 1 1 1 0 0 0 0 0
0 1 0 0 0 1 1 0 1
0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 1 0
0 0 1 1 1 0 1 0 0
0 0 1 1 1 0 1 0 0
0 0 0 0 1 1 0 1 0
0 1 1 1 0 1 1 0 0
1 0 1 1 1 0 1 0 0
0 1 1 0 0 0 0 0 1
If the transactions are written in terms of only frequent items, then the transactions are -
5, 6, 8
4, 2, 8
5, 4, 7
3
5, 7, 6
3, 4, 2
7, 2, 6
5
8
5, 3, 4, 7
5, 3, 4, 7
5, 6, 8
3, 4, 7, 2, 9
5, 3, 4, 7
3, 2
The scan of the first transaction leads to the construction of the first branch of the tree
as shown in the figure. Notice that the branch is not ordered in the same way as the
transaction appears in the database. The items are ordered according to the decreasing order
of frequency of the frequent items. The complete table is given in next figure.