0% found this document useful (0 votes)
8 views

Unit-4 DM

Uploaded by

lambulamb15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Unit-4 DM

Uploaded by

lambulamb15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Unit-4

UNIT 4

Frequent Itemset generation : Apriori Principle, Apriori algorithm, and examples, FP growth
algorithm and examples.

A Priori Algorithm

The Apriori Algorithm is an influential algorithm for mining frequent itemsets for
boolean association rules. It is also called the level-wise algorithm. It was proposed by
Agarwal and Srikant in 1994. It is the most popular algorithm to find all the frequent sets. It
makes use of the downward closure property. As the name suggests, the algorithm is a
bottom-up search, moving upward level-wise in the lattice. However, the important feature of
the method is that before reading the database at every level, it graciously prunes many of the
sets which are unlikely to be frequent sets.
The first pass of the algorithm simply counts item occurrences to determine the
frequent itemsets. A subsequent pass, say pass k, consists of two phases. First, the frequent
itemsets Lk-1 found in the (k-1)th pass are used to generate the candidate itemsets Ck, using
the a priori candidate generation procedure described below. Next, the database is scanned
and the support of candidates in Ck is counted. For fast counting, we need to efficiently
determine the candidates in Ck contained in a given transaction t. The set of candidate
itemsets is subjected to a pruning process to ensure that all the subsets of the candidate sets
are already known to the frequent itemsets. The candidate generation process and the pruning
process are the most important parts of this algorithm.
Key Concepts :

• Frequent Itemsets: The sets of item which has minimum support (denoted by Li for
ith-Itemset).
• Apriori Property: Any subset of frequent itemset must be frequent.
• Join Operation: To find Lk , a set of candidate k-itemsets is generated by joining Lk-1 with
itself.

The Apriori Algorithm in a Nutshell


• Find the frequent itemsets: the sets of items that have minimum support
– A subset of a frequent itemset must also be a frequent itemset
• i.e., if {AB} is a frequent itemset, both { A} and { B} should be a frequent itemset
– Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)
• Use the frequent itemsets to generate association rules.

Candidate Generation
Given Lk-1, the set of all frequent (k-1) - itemsets, we want to generate a superset of
the set of all frequent k - itemsets. The intuition behind the a priori candidate - generation
procedure is that if an itemset X has minimum support, so do all subsets of X. Let us assume
that the set of frequent 3 - itemsets are {1,2,3}, {1,2,5} , {1,3,5} , {2,3,5} , {2,3,4}. Then, the
4 - itemsets that are generated as candidate itemsets are supersets of these 3 - itemsets and in
addition, all the 3 - itemset subsets of any candidate 4 - itemset (so generated) must be
already known to be in L3. The first part of the second part is handled by the a priori
candidate - generation method. The following pruning algorithm prunes some candidate sets
which do not meet the second criterion. The candidate generation method is described below
:

gen_candidate_itemsets with the given Lk-1 as follows :


Ck = ∅
for all itemsets l1 ∈ Lk-1 do
for all itemsets l2 ∈ Lk-1 do
if l1[1] = l2[1]^l+l1[2] = l2[2] ^ ...... ^ l1[k-1] < l2 [k-1]
then c = l1[1], l2[2] ............ l1[k-1], l2[k-1]

Using this algorithm C4= {{1,2,3,5} , {2,3,4,5}} is obtained from


L3 = {(1,2,3) , (1,2,5) , (1,3,5) , (2,3,5) , ( 2,3,4)}
{1,2,3,5} is generated from {(1,2,3) , (1,2,5)} .
Similarly {2,3,4,5} is generated from {(2,3,4) , (2,3,5)}
No other pair of 3 - itemsets satisfy the condition.
l1[1] = l2[1] ^ l1[2] ^ ......... ^ l1[k-1] < l2[k-1]

Pruning
The pruning step eliminates the extensions of (k-1) - itemsets, which are not found to
be frequent, from being considered for counting support. For example, from C4, the itemset
{2,3,4,5} is pruned, since all its 3 - subsets are not in L3 (clearly, {2,4,5} is not in L). The
pruning algorithm is described below.

prune(Ck)
for all c ∈ Ck
for all (k-1) - subsets d of c do
if d ∉ Lk-1
then Ck = Ck \ {c}

The a priori frequent itemset discovery algorithm uses these two functions (candidate
generation and pruning) at every iteration. It moves forward in the lattice starting from level 1
till level k, where no candidate set remains after pruning.

The Apriori Algorithm :


Initialize : k:1, C1= all the 1 - itemsets;
Read the database to count the support of C1 to determine L1.
L1 := {frequent 1 - itemsets};
K := 2: // k represents the pass number//
while (Lk-1≠ ∅) do
begin
Ck := gen_candidate_itemsets with the given Lk-1.
prune (Ck)
for all transactions t ∈ T do
increment the count of all candidates in Ck that are contained in t;
Lk : = All candidates in Ck with minimum support;
k := k+1;
end

Answer := ∪kLk;

A priori algorithm by Example


We illustrate the working of the algorithm with example discussed above
k :=1
Read the database to count the support of 1- itemsets (Table 1). The frequent 1-
itemsets and their support counts are given below.

X Support Count
{1} 2
{2} 6
{3} 6
{4} 4
{5} 8
{6} 5
{7} 7
{8} 4
{9} 2

L1 : = {(2) → 6, (3) → 6, (4) → 4, (5) → 8, (6) → 5, (7) → 7, (8) → 4}


k := 2
In the candidate generation step, we get
C2 := { (2,3) , (2,4) , (2,5) , (2,6) , (2,7) , (2,8) , (3,4) , (3,5) , (3,6) , (3,6) , (3,7) , (3,8) , (4,5)
, (4,6) , (4,7) , (4,8) , (5,6) , (5,7) , (5,8) , (6,7) , (6,8) , (7,8) }
The pruning step does not change C2 :
Read the database to count the support of elements in C2 to get
L2 := {(2,3) → 3, (2,4) → 3, (3,5) → 3, (3,7) → 3, (5,6) → 3, (5,7) → 5, (6,7) → 3}
k :=3
In the candidate generation step,
using {2,3} and {2,4}, we get {2,3,4}
using {3,5} and {3,7}, we get {3,5,7} and
similarly from {5,6} and {5,7}, we get {5,6,7}
C3 := {{2,3,4} , {3,5,7} , {5,6,7}}
The pruning step prunes {2,3,4} as not all subsets of size 2, i.e. {2,3} , {2,4} , {3,4} are
present in L2. The other two itemsets are retained.
Thus the pruned C3 is {{3,5,7} , {5,6,7}}
Read the database to count the support of the itemsets in C3 to get
L3 := {{3,5,7}→ 3 }
k := 4
Since L3 contains only one element, C4 is empty and hence the algorithm stops, returning the
set of frequent sets along with their respective support values as
L := L1∪L2∪L3
Limitations

Computationally Expensive. Even though the a priori algorithm reduces the number of
candidate itemsets to consider, this number could still be huge when store inventories are
large or when the support threshold is low. However, an alternative solution would be to
reduce the number of comparisons by using advanced data structures, such as hash tables, to
sort candidate itemsets more efficiently.
Spurious Associations. Analysis of large inventories would involve more itemset
configurations, and the support threshold might have to be lowered to detect certain
associations. However, lowering the support threshold might also increase the number of
spurious associations detected. To ensure that identified associations are generalizable, they
could first be distilled from a training dataset, before having their support and confidence
assessed in a separate test dataset.

FP - Tree
Definition : A frequent pattern tree (or FP tree) is a tree structure consisting of an item -
prefix - tree and a frequent - item - header table.
• Item - prefix - tree:
o It consists of a root node labeled null.
o Each on - root node consists of three fields :
▪ Item name,
▪ Support count, and
▪ Node link
• Frequent - item- header - table : It consists of two fields :
▪ Item name;
▪ Head of node link which points to the first node in the FP - tree
carrying the item name
It may be noted that the FP - tree is dependent on the support threshold 𝜎. For different
values of 𝜎, the trees are different. There is another typical feature of the FP - tree; it depends
on the ordering of the items. The ordering that is followed in the original paper is the
decreasing order of the support counts. However, different ordering may offer different
advantages. Thus, the header table is arranged in this order of the frequent items.
We make one scan of the database T and compute L1, the set of frequent 1 - itemsets.
For convenience, let us call this the set of frequent items. In other words, we refer to L1 in the
decreasing order of frequency counts. From this stage onwards, the algorithm ignores all the
non- frequent items from the transaction and views any transaction as a list of frequent items
in the decreasing order of frequency. Without any ambiguity, we can refer to the transaction t
as such a list. The first element of the list corresponding to any given transaction, is the most
frequent item among the items supported by t. For a list t, we denote head_t as its first
element and body_t as the remaining part of the list (the portion of the list t after removal of
head_t). Thus, t is [head_t|body_t]. The FP - tree construction grows the tree recursively
using these concepts.

FP - tree construction Algorithm


Create a root node root of the FP - tree and label it as null.
do for every transaction t
if t is not empty
insert (t, root)
link the new nodes to other nodes with similar labels links originating from header list.
end do
return FP - tree
insert (t, any_node)
do while t is not empty
if any_node has a child node with label head_t
then increment the link count between any_node and head_t by 1
else create a new child node of any_node with label; head_t with link count 1.
call insert (body_t, head_t)
end do

Example
Let us consider the following database. The frequent items are 2, 3, 4, 5, 6, 7, and 8. If
we sort them in the order of their frequency, then they appear in the order 5, 3, 4, 7, 2, 6, and
8.

A1 A2 A3 A4 A5 A6 A7 A8 A9
1 0 0 0 1 1 0 1 0
0 1 0 1 0 0 0 1 0
0 0 0 1 1 0 1 0 0
0 0 1 0 0 0 0 0 0
0 0 0 0 1 1 1 0 0
0 1 1 1 0 0 0 0 0
0 1 0 0 0 1 1 0 1
0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 1 0
0 0 1 1 1 0 1 0 0
0 0 1 1 1 0 1 0 0
0 0 0 0 1 1 0 1 0
0 1 1 1 0 1 1 0 0
1 0 1 1 1 0 1 0 0
0 1 1 0 0 0 0 0 1

If the transactions are written in terms of only frequent items, then the transactions are -
5, 6, 8
4, 2, 8
5, 4, 7
3
5, 7, 6
3, 4, 2
7, 2, 6
5
8
5, 3, 4, 7
5, 3, 4, 7
5, 6, 8
3, 4, 7, 2, 9
5, 3, 4, 7
3, 2
The scan of the first transaction leads to the construction of the first branch of the tree
as shown in the figure. Notice that the branch is not ordered in the same way as the
transaction appears in the database. The items are ordered according to the decreasing order
of frequency of the frequent items. The complete table is given in next figure.

Illustration of insert (t, root) operation.

Complete FP - tree of the above example. Labels on edges represent frequency.


It is now easy to read the frequent itemsets from the FP - tree. The algorithm starts from the
leaf node in a bottom - up approach. Thus, after processing item 6, it processes item 7. Let us
consider, for instance, the frequent item {6}. There are four paths to 6 from the root; these are
{5, 6, 7}, {5, 6} , {4, 7, 2, 6} and {7, 2, 6}. All these paths have labels of 1. A label of a
path is the smallest of the link counts of its links. Thus, each of these combinations appear
just once. The paths {5, 6, 7} , {5, 6} , {4, 7, 2, 6} and {7, 2, 6} from the root to the odes
with label 6 are called the prefix subpaths of 6. The prefix subpath of a node are the paths
from the root to the nodes labeled a and the count of links along a path are adjusted by
adjusting the frequency count of every link in such a path, so that they are the same as the
count of the link incident on a along the path. This is called a transformed prefix path. The
transformed prefix paths of a node a form a truncated database of patterns which co - occur
with a. This is called a conditional pattern base. Once the conditional pattern base is derived
from the FP - tree, one can compute all the frequent patterns associated with it in the
conditional pattern base. By creating a small conditional FP - tree for a, the process is
recursively carried out starting from the leaf nodes.
The conditional pattern base for {6} is the following :
For the prefix sub path {5, 6, 7}, we get {5, 6} → 1, {7, 6} → 3, {5, 6, 7}→ 1.
For the prefix sub path {5, 6}, we get {5, 6}→ 1
For the path {4, 7, 2, 6}, we have
(4, 5), (7, 6) , (2, 6) , (4, 7, 6) , (4, 2, 6) , (7, 2, 6) , (4, 7, 2, 6) and for the path (7, 2, 6) , (7, 6)
, (2, 6) , (7, 2, 6) all with label 1.
Thus the only frequent pattern with prefix 6 is {7, 6} → 3
In below figure the conditional pattern base of 7 is illustrated. Please note that since the
processing now is bottom - up, the combination {6, 7} is already considered when the item 6
was considered.

Conditional pattern base for item 7

You might also like