fp-growth
fp-growth
Next →← Prev
The FP-Growth Algorithm proposed by Han in. This is an efficient and scalable method for
mining the complete set of frequent patterns by pattern fragment growth, using an extended
prefix-tree structure for storing compressed and crucial information about frequent patterns
named frequent-pattern tree (FP-tree). In his study, Han proved that his method outperforms
other popular methods for mining frequent patterns, e.g. the Apriori Algorithm and the
TreeProjection. In some later works, it was proved that FP-Growth performs better than other
methods, including Eclat and Relim. The popularity and efficiency of the FP-Growth Algorithm
contribute to many studies that propose variations to improve its performance.
Using this strategy, the FP-Growth reduces the search costs by recursively looking for short
patterns and then concatenating them into the long frequent patterns.
In large databases, holding the FP tree in the main memory is impossible. A strategy to cope
with this problem is to partition the database into a set of smaller databases (called projected
databases) and then construct an FP-tree from each of these smaller databases.
FP-Tree
The frequent-pattern tree (FP-tree) is a compact data structure that stores quantitative
information about frequent patterns in a database. Each transaction is read and then mapped
onto a path in the FP-tree. This is done until all transactions have been read. Different
transactions with common subsets allow the tree to remain compact because their paths
overlap.
A frequent Pattern Tree is made with the initial item sets of the database. The purpose of the
FP tree is to mine the most frequent pattern. Each node of the FP tree represents an item of
the item set.
The root node represents null, while the lower nodes represent the item sets. The associations
of the nodes with the lower nodes, that is, the item sets with the other item sets, are
maintained while forming the tree.
1. One root is labelled as "null" with a set of item-prefix subtrees as children and a
frequent-item-header table.
2. Each node in the item-prefix subtree consists of three fields:
o Item-name: registers which item is represented by the node;
o Count: the number of transactions represented by the portion of the path
reaching the node;
o Node-link: links to the next node in the FP-tree carrying the same item name or
null if there is none.
3. Each entry in the frequent-item-header table consists of two fields:
o Item-name: as the same to the node;
o Head of node-link: a pointer to the first node in the FP-tree carrying the item
name.
Additionally, the frequent-item-header table can have the count support for an item. The below
diagram is an example of a best-case scenario that occurs when all transactions have the same
itemset; the size of the FP-tree will be only a single branch of nodes.
The worst-case scenario occurs when every transaction has a unique item set. So the space
needed to store the tree is greater than the space used to store the original data set because
the FP-tree requires additional space to store pointers between nodes and the counters for
each item. The diagram below shows how a worst-case scenario FP-tree might appear. As you
can see, the tree's complexity grows with each transaction's uniqueness.
The original algorithm to construct the FP-Tree defined by Han is given below:
1. The first step is to scan the database to find the occurrences of the itemsets in the
database. This step is the same as the first step of Apriori. The count of 1-itemsets in the
database is called support count or frequency of 1-itemset.
2. The second step is to construct the FP tree. For this, create the root of the tree. The root
is represented by null.
3. The next step is to scan the database again and examine the transactions. Examine the
first transaction and find out the itemset in it. The itemset with the max count is taken at
the top, and then the next itemset with the lower count. It means that the branch of the
tree is constructed with transaction itemsets in descending order of count.
4. The next transaction in the database is examined. The itemsets are ordered in
descending order of count. If any itemset of this transaction is already present in
another branch, then this transaction branch would share a common prefix to the root.
This means that the common itemset is linked to the new node of another itemset in this
transaction.
5. Also, the count of the itemset is incremented as it occurs in the transactions. The
common node and new node count are increased by 1 as they are created and linked
according to transactions.
6. The next step is to mine the created FP Tree. For this, the lowest node is examined first,
along with the links of the lowest nodes. The lowest node represents the frequency
pattern length 1. From this, traverse the path in the FP Tree. This path or paths is called
a conditional pattern base.
A conditional pattern base is a sub-database consisting of prefix paths in the FP tree
occurring with the lowest node (suffix).
7. Construct a Conditional FP Tree, formed by a count of itemsets in the path. The itemsets
meeting the threshold support are considered in the Conditional FP Tree.
8. Frequent Patterns are generated from the Conditional FP Tree.
Using this algorithm, the FP-tree is constructed in two database scans. The first scan collects
and sorts the set of frequent items, and the second constructs the FP-Tree.
Example
ADVERTISEMENT
Table 1:
T1 I1,I2,I3
T2 I2,I3,I4
T3 I4,I5
T4 I1,I2,I4
T5 I1,I2,I3,I5
T6 I1,I2,I3,I4
Solution: Support threshold=50% => 0.5*6= 3 => min_sup=3
ADVERTISEMENT
ADVERTISEMENT
Item Count
I1 4
I2 5
I3 4
I4 4
I5 2
Table 3: Sort the itemset in descending order.
Item Count
I2 5
I1 4
I3 4
I4 4
Build FP Tree
1. The lowest node item, I5, is not considered as it does not have a min support count.
Hence it is deleted.
2. The next lower node is I4. I4 occurs in 2 branches , {I2,I1,I3:,I41},{I2,I3,I4:1}. Therefore
considering I4 as suffix the prefix paths will be {I2, I1, I3:1}, {I2, I3: 1} this forms the
conditional pattern base.
3. The conditional pattern base is considered a transaction database, and an FP tree is
constructed. This will contain {I2:2, I3:2}, I1 is not considered as it does not meet the
min support count.
4. This path will generate all combinations of frequent patterns : {I2,I4:2},{I3,I4:2},
{I2,I3,I4:2}
5. For I3, the prefix path would be: {I2,I1:3},{I2:1}, this will generate a 2 node FP-tree :
{I2:4, I1:3} and frequent patterns are generated: {I2,I3:4}, {I1:I3:3}, {I2,I1,I3:3}.
6. For I1, the prefix path would be: {I2:4} this will generate a single node FP-tree: {I2:4}
and frequent patterns are generated: {I2, I1:4}.
FP-Growth Algorithm
After constructing the FP-Tree, it's possible to mine it to find the complete set of frequent
patterns. Han presents a group of lemmas and properties to do this job and then describes the
following FP-Growth Algorithm.
Algorithm 2: FP-Growth
Procedure FP-growth(Tree, a)
{ If the tree contains a single prefix path, then.
{
// Mining single prefix-path FP-tree
let P be the single prefix-path part of the tree;
let Q be the multipath part with the top branching node replaced by a null root;
for each combination (denoted as ß) of the nodes in the path, P do
generate pattern ß ∪ a with support = minimum support of nodes in ß;
let freq pattern set(P) be the set of patterns so generated;
10. }
11.else let Q be Tree;
12.for each item ai in Q, do
13. {
14. // Mining multipath FP-tree
15.generate pattern ß = ai ∪ a with support = ai .support;
16.construct ß's conditional pattern-based, and then ß's conditional FP-tree Tree ß;
17.if Tree ß ≠ Ø then
18.call FP-growth(Tree ß, ß);
19.let freq pattern set(Q) be the set of patterns so generated;
20. }
21.return(freq pattern set(P) ∪ freq pattern set(Q) ∪ (freq pattern set(P) × freq pattern set(Q)))
22.}
When the FP-tree contains a single prefix path, the complete set of frequent patterns can be
generated in three parts:
The resulting patterns for a single prefix path are the enumerations of its subpaths with
minimum support. After that, the multipath Q is defined, and the resulting patterns are
processed. Finally, the combined results are returned as the frequent patterns found.
Apriori FP Growth
Apriori generates frequent patterns by making the FP Growth generates an FP-Tree for
itemsets using pairings such as single item set, making frequent patterns.
double itemset, and triple itemset.
Since apriori scans the database in each step, it FP-tree requires only one database
becomes time-consuming for data where the scan in its beginning steps, so it
number of items is larger. consumes less time.
A converted version of the database is saved in the A set of conditional FP-tree for every
memory item is saved in the memory
Let the minimum support be 3. A Frequent Pattern set is built which will
contain all the elements whose frequency is greater than or equal to the
minimum support. These elements are stored in descending order of their
respective frequencies. After insertion of the relevant items, the set L looks
like this:-
L = {K : 5, E : 4, M : 3, O : 3, Y : 3}
Here, all the items are simply linked one after the other in the order of
occurrence in the set and initialize the support count for each item as 1.
b) Inserting the set {K, E, O, Y}:
Till the insertion of the elements K and E, simply the support count is
increased by 1. On inserting O we can see that there is no direct link
between E and O, therefore a new node for the item O is initialized with the
support count as 1 and item E is linked to this new node. On inserting Y, we
first initialize a new node for the item Y with support count as 1 and link the
new node of O with the new node of Y.
c) Inserting the set {K, E, M}:
Similar to step b), first the support count of K is increased, then new nodes
for M and Y are initialized and linked accordingly.
e) Inserting the set {K, E, O}:
Here simply the support counts of the respective elements are increased.
Note that the support count of the new node of item O is increased.
Now, for each item, the Conditional Pattern Base is computed which is
path labels of all the paths which lead to any node of the given item in the
frequent-pattern tree. Note that the items in the below table are arranged in
the ascending order of their frequencies.
Now for each item, the Conditional Frequent Pattern Tree is built. It is
done by taking the set of elements that is common in all the paths in the
Conditional Pattern Base of that item and calculating its support count by
summing the support counts of all the paths in the Conditional Pattern Base.
Don't miss your chance to ride the wave of the data revolution! Every
industry is scaling new heights by tapping into the power of data. Sharpen
your skills and become a part of the hottest trend in the 21st century.