0% found this document useful (0 votes)
4 views

UNIT-iii

Frequent pattern mining is a crucial aspect of data mining that identifies recurring patterns or itemsets in datasets, utilizing algorithms like Apriori and methods such as association rule learning. The process involves measuring support, confidence, and lift to uncover relationships between items, with applications in market basket analysis, web usage mining, and bioinformatics. The Apriori algorithm, while effective, can be computationally expensive, leading to the development of the Frequent Pattern Growth algorithm to enhance efficiency.

Uploaded by

lohithram331
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

UNIT-iii

Frequent pattern mining is a crucial aspect of data mining that identifies recurring patterns or itemsets in datasets, utilizing algorithms like Apriori and methods such as association rule learning. The process involves measuring support, confidence, and lift to uncover relationships between items, with applications in market basket analysis, web usage mining, and bioinformatics. The Apriori algorithm, while effective, can be computationally expensive, leading to the development of the Frequent Pattern Growth algorithm to enhance efficiency.

Uploaded by

lohithram331
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

UNIT-III

Mining Frequent Patterns:


Frequent pattern extraction is an essential mission in data mining that intends to
uncover repetitive patterns or itemsets in a granted dataset. It encompasses
recognizing collections of components that occur together frequently in a
transactional or relational database. This procedure can offer valuable perceptions
into the connections and affiliations among diverse components or features within
the data.
The technique of frequent pattern mining is built upon a number of fundamental
ideas.
Transactional and Relational Databases: The analysis is based on transaction
databases, which include records or transactions that represent collections of objects.
Items inside these transactions are grouped together as itemsets.
Support and Repeating Groupings: The importance of patterns is greatly
influenced by support and confidence measurements. Support quantifies how
frequently an itemset appears in the database, whereas confidence quantifies how
likely it is that a rule generated from the itemset is accurate.
The Apriori algorithm, is one of the most well-known and widely used algorithms
for repeating arrangement prospecting. It uses a breadth-first search strategy to
discover repeating groupings efficiently. The algorithm works in multiple iterations.
It starts by finding repeating individual objects by scanning the database once and
counting the occurrence of each object. It then generates candidate groupings of size
2 by combining the repeating groupings of size 1. The support of these candidate
groupings is calculated by scanning the database again. The process continues
iteratively, generating candidate groupings of size k and calculating their support
until no more repeating groupings can be found.
Support-based Pruning: During the Apriori algorithm’s execution, aid-based
pruning is used to reduce the search space and enhance efficiency. If an itemset is
found to be rare (i.e., its aid is below the minimum aid threshold), then all its
supersets are also assured to be rare. Therefore, these supersets are trimmed from
further consideration. This trimming step significantly decreases the number of
potential item sets that need to be evaluated in subsequent iterations..

Association Rule Mining: Frequent item sets can be further examined to discover
association rules, which represent connections between different items. An
association rule consists of an antecedent and a consequent (right-hand side), both
of which are item sets. For instance, {milk, bread} => {eggs} is an
association rule. Association rules are produced from frequent itemsets by
considering different combinations of items and calculating measures such as aid,
confidence, and lift. Aid measures the frequency of both the antecedent and the
consequent appearing together, while confidence measures the conditional
probability of the consequent given the antecedent. Lift indicates the strength of the
association between the antecedent and the consequent, considering their individual
aid.
Applications:Frequent pattern mining has various practical uses in different
domains. Some examples include market basket analysis, customer behavior
analysis, web mining, bioinformatics, and network traffic analysis.

Applications of Frequent Pattern Mining


Market Basket Analysis
Market basket analysis frequently mines patterns to comprehend consumer buying
patterns. Businesses get knowledge about product associations by recognizing
itemsets that commonly appear together in transactions. This knowledge enables
companies to improve recommendation systems and cross−sell efforts. Retailers can
use this program to assist them in making data−driven decisions that will enhance
customer happiness and boost sales.
Web usage mining

Web usage mining examines user navigation patterns to learn more about how
people use websites. To personalize websites and enhance them
performance, frequent pattern mining makes it possible to identify recurrent
navigation patterns and session patterns. Businesses can change content, layout, and
navigation to improve user experience and boost engagement by studying how
consumers interact with a website.
Bioinformatics
The identification of relevant DNA patterns in the field of bioinformatics is made
possible by often occurring pattern mining. Researchers can get insights into genetic
variants, illness connections, and drug development by examining big genomic
databases for recurrent patterns. In order to diagnose diseases, practice personalized
medicine, and create innovative therapeutic strategies, frequent pattern mining
algorithms help uncover important DNA sequences and patterns.

Frequent Item sets Mining Methods:


Association Rule Learning
Association rule learning is a type of unsupervised learning technique that checks
for the dependency of one data item on another data item and maps accordingly so
that it can be more profitable.
The association rule learning is one of the very important concepts of machine
learning, and it is employed in Market Basket analysis, Web usage mining,
continuous production, etc. Here market basket analysis is a technique used by the
various big retailer to discover the associations between items.
For example, if a customer buys bread, he most likely can also buy butter, eggs, or
milk, so these products are stored within a shelf or mostly nearby.

How does Association Rule Learning work?


Association rule learning works on the concept of If and Else Statement, such as if A
then B.

Here the If element is called antecedent, and then statement is called as Consequent.
These types of relationships where we can find out some association or relation
between two items is known as single cardinality. It is all about creating rules, and if
the number of items increases, then cardinality also increases accordingly. So, to
measure the associations between thousands of data items, there are several metrics.
These metrics are given below:
 Support
 Confidence
 Lift
Let's understand each of them:
Support
Support is the frequency of A or how frequently an item appears in the dataset. It is
defined as the fraction of the transaction T that contains the itemset X. If there are X
datasets, then for transactions T, it can be written as:

Confidence
Confidence indicates how often the rule has been found to be true. Or how often the
items X and Y occur together in the dataset when the occurrence of X is already
given. It is the ratio of the transaction that contains X and Y to the number of
records that contain X.

Lift
It is the strength of any rule, which can be defined as below formula:

It is the ratio of the observed support measure and expected support if X and Y are
independent of each other. It has three possible values:
 If Lift= 1: The probability of occurrence of antecedent and
consequent is independent of each other.
 Lift>1: It determines the degree to which the two itemsets are
dependent to each other.
 Lift<1: It tells us that one item is a substitute for other items,
which means one item has a negative effect on another.

Apriori Algorithm:
Apriori algorithm refers to the algorithm which is used to calculate the association
rules between objects. It means how two or more objects are related to one another.
In other words, we can say that the apriori algorithm is an association rule leaning
that analyzes that people who bought product A also bought product B.
The primary objective of the apriori algorithm is to create the association rule
between different objects. The association rule describes how two or more objects
are related to one another. Apriori algorithm is also called frequent pattern mining.
Components of Apriori algorithm
The given three components comprise the apriori algorithm.
 Support
 Confidence
 Lift

Support
Support refers to the default popularity of any product. You find the support as a
quotient of the division of the number of transactions comprising that product by the
total number of transactions. Hence, we get
Support (Biscuits) = (Transactions relating biscuits) / (Total transactions)
= 400/4000 = 10 percent.

Confidence
Confidence refers to the possibility that the customers bought both biscuits and
chocolates together. So, you need to divide the number of transactions that comprise
both biscuits and chocolates by the total number of transactions to get the
confidence.
Hence,
Confidence = (Transactions relating both biscuits and Chocolate) / (Total
transactions involving Biscuits)
= 200/400
= 50 percent.
It means that 50 percent of customers who bought biscuits bought chocolates also.

Lift
Consider the above example; lift refers to the increase in the ratio of the sale of
chocolates when you sell biscuits. The mathematical equations of lift are given
below.
Lift = (Confidence (Biscuits - chocolates)/ (Support (Biscuits)
= 50/10 = 5
It means that the probability of people buying both biscuits and chocolates together
is five times more than that of purchasing the biscuits alone. If the lift value is below
one, it requires that the people are unlikely to buy both the items together. Larger the
value, the better is the combination.
How does the Apriori Algorithm work in Data Mining?
Consider the following dataset and we will find frequent itemsets and generate
association rules for them.

minimum support count is 2


minimum confidence is 60%

Step-1: K=1
(I) Create a table containing support count of each item present in dataset – Called
C1(candidate set)

(II) compare candidate set item’s support count with minimum support count(here
min_support=2 if support_count of candidate set items is less than min_support then
remove those items). This gives us itemset L1.

Step-2: K=2
 Generate candidate set C2 using L1 (this is called join step).
Condition of joining Lk-1 and Lk-1 is that it should have (K-2)
elements in common.
 Check all subsets of an itemset are frequent or not and if not
frequent remove that itemset.(Example subset of{I1, I2} are
{I1}, {I2} they are frequent.Check for each itemset)
 Now find support count of these itemsets by searching in dataset.
(II) compare candidate (C2) support count with minimum support count(here
min_support=2 if support_count of candidate set item is less than min_support then
remove those items) this gives us itemset L2.

Step-3:

 Generate candidate set C3 using L2 (join step). Condition of


joining Lk-1 and Lk-1 is that it should have (K-2) elements in
common. So here, for L2, first element should match.
 So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1,
I3, i5}{I2, I3, I4}{I2, I4, I5}{I2, I3, I5}
 Check if all subsets of these itemsets are frequent or not and if
not, then remove that itemset.(Here subset of {I1, I2, I3} are
{I1, I2},{I2, I3},{I1, I3} which are frequent. For {I2, I3, I4},
subset {I3, I4} is not frequent so remove it. Similarly check for
every itemset)
 find support count of these remaining itemset by searching in dataset.
(II) Compare candidate (C3) support count with minimum support
count(here min_support=2 if support_count of candidate set item
is less than min_support then remove those items) this gives us
itemset L3.

Step-4:
 Generate candidate set C4 using L3 (join step). Condition of
joining Lk-1 and Lk-1 (K=4) is that, they should have (K-2)
elements in common. So here, for L3, first 2 elements (items)
should match.
 Check all subsets of these itemsets are frequent or not (Here
itemset formed by joining L3 is {I1, I2, I3, I5} so its subset
contains {I1, I3, I5}, which is not frequent). So no itemset in C4
 We stop here because no frequent itemsets are found further.
Thus, we have discovered all the frequent item-sets. Now generation of strong
association rule comes into picture. For that we need to calculate confidence of each
rule.

Advantages of Apriori Algorithm


 It is used to calculate large itemsets.
 Simple to understand and apply.

Disadvantages of Apriori Algorithms


 Apriori algorithm is an expensive method to find support since
the calculation has to pass through the whole database.
 Sometimes, you need a huge number of candidate rules, so
it becomes computationally more expensive.
 At each step, candidate sets have to be built.
 To build the candidate sets, the algorithm has to repeatedly scan the
database.

Frequent Pattern Growth Algorithm


Frequent Pattern Growth Algorithm overcomes the disadvantages of the Apriori
algorithm by storing all the transactions in a Trie Data. Consider the following data:-
Transaction ID Items
T1 {E,K,M,N,O,Y}
T2 {D,E,K,N,O,Y}
T3 {A,E,K,M}
T4 {C,KM,U,Y}
T5 {C,E,K,O,O}

The above-given data is a hypothetical dataset of transactions with each letter


representing an item. The frequency of each individual item is computed:-

ITEM FREQUENCY
A 1
C 2
D 1
E 4
I 1
K 5
M 3
N 2
O 4
U 1
Y 3

Let the minimum support be 3. A Frequent Pattern set is built which will contain all
the elements whose frequency is greater than or equal to the minimum support.
These elements are stored in descending order of their respective frequencies. After
insertion of the relevant items, the set L looks like this:-
L = {K : 5, E : 4, M : 3, O : 4, Y : 3}

Now, for each transaction, the respective Ordered-Item set is built. It is done by
iterating the Frequent Pattern set and checking if the current item is contained in the
transaction in question. If the current item is contained, the item is inserted in the
Ordered-Item set for the current transaction. The following table is built for all the
transactions:
Transaction ID Items Ordered-Item Set
T1 {E,K,M,N,O,Y} {K,E,M,O,Y}
T2 {D,E,K,N,O,Y} {K,E,O,Y}
T3 {A,E,K,M} {K,E,M}
T4 {C,KM,U,Y} {K,M,Y}
T5 {C,E,K,O,O} {K,E,O}
Now, all the Ordered-Item sets are inserted into a Trie Data Structure.
a) Inserting the set {K, E, M, O, Y}:
Here, all the items are simply linked one after the other in the order of occurrence in
the set and initialize the support count for each item as 1.

b) Inserting the set {K, E, O, Y}:


Till the insertion of the elements K and E, simply the support count is increased by
1. On inserting O we can see that there is no direct link between E and O, therefore a
new node for the item O is initialized with the support count as 1 and item E is
linked to this new node. On inserting Y, we first initialize a new node for the item Y
with support count as 1 and link the new node of O with the new node of Y.

c) Inserting the set {K, E, M}:


Here simply the support count of each element is increased by 1.
d) Inserting the set {K, M, Y}: Similar to step b), first the support
count of K is increased, then new nodes for M and Y are initialized
and linked accordingly.

e) Inserting the set {K, E, O}: Here simply the support counts of the
respective elements are increased. Note that the support count of
the new node of item O is increased.
Now, for each item, the Conditional Pattern Base is computed which is path labels
of all the paths which lead to any node of the given item in the frequent- pattern tree.
Note that the items in the below table are arranged in the ascending order of their
frequencies.

Now for each item, the Conditional Frequent Pattern Tree is built. It is done by
taking the set of elements that is common in all the paths in the Conditional Pattern
Base of that item and calculating its support count by summing the support counts of
all the paths in the Conditional Pattern Base.

From the Conditional Frequent Pattern tree, the Frequent Pattern rules are generated
by pairing the items of the Conditional Frequent Pattern Tree set to the
corresponding to the item as given in the below table.
For each row, two types of association rules can be inferred for example for the first
row which contains the element, the rules K -> Y and Y -> K can be inferred. To
determine the valid rule, the confidence of both the rules is calculated and the one
with confidence greater than or equal to the minimum confidence value is retained.

Applications of Association Rule Learning


It has various applications in machine learning and data mining. Below are some
popular applications of association rule learning:
 Market Basket Analysis: It is one of the popular examples and
applications of association rule mining. This technique is
commonly used by big retailers to determine the association
between items.
 Medical Diagnosis: With the help of association rules, patients
can be cured easily, as it helps in identifying the probability of
illness for a particular disease.
 Protein Sequence: The association rules help in determining
the synthesis of artificial Proteins.
 It is also used for the Catalog Design and Loss-leader Analysis
and many more other applications.

You might also like