UNIT-iii
UNIT-iii
Association Rule Mining: Frequent item sets can be further examined to discover
association rules, which represent connections between different items. An
association rule consists of an antecedent and a consequent (right-hand side), both
of which are item sets. For instance, {milk, bread} => {eggs} is an
association rule. Association rules are produced from frequent itemsets by
considering different combinations of items and calculating measures such as aid,
confidence, and lift. Aid measures the frequency of both the antecedent and the
consequent appearing together, while confidence measures the conditional
probability of the consequent given the antecedent. Lift indicates the strength of the
association between the antecedent and the consequent, considering their individual
aid.
Applications:Frequent pattern mining has various practical uses in different
domains. Some examples include market basket analysis, customer behavior
analysis, web mining, bioinformatics, and network traffic analysis.
Web usage mining examines user navigation patterns to learn more about how
people use websites. To personalize websites and enhance them
performance, frequent pattern mining makes it possible to identify recurrent
navigation patterns and session patterns. Businesses can change content, layout, and
navigation to improve user experience and boost engagement by studying how
consumers interact with a website.
Bioinformatics
The identification of relevant DNA patterns in the field of bioinformatics is made
possible by often occurring pattern mining. Researchers can get insights into genetic
variants, illness connections, and drug development by examining big genomic
databases for recurrent patterns. In order to diagnose diseases, practice personalized
medicine, and create innovative therapeutic strategies, frequent pattern mining
algorithms help uncover important DNA sequences and patterns.
Here the If element is called antecedent, and then statement is called as Consequent.
These types of relationships where we can find out some association or relation
between two items is known as single cardinality. It is all about creating rules, and if
the number of items increases, then cardinality also increases accordingly. So, to
measure the associations between thousands of data items, there are several metrics.
These metrics are given below:
Support
Confidence
Lift
Let's understand each of them:
Support
Support is the frequency of A or how frequently an item appears in the dataset. It is
defined as the fraction of the transaction T that contains the itemset X. If there are X
datasets, then for transactions T, it can be written as:
Confidence
Confidence indicates how often the rule has been found to be true. Or how often the
items X and Y occur together in the dataset when the occurrence of X is already
given. It is the ratio of the transaction that contains X and Y to the number of
records that contain X.
Lift
It is the strength of any rule, which can be defined as below formula:
It is the ratio of the observed support measure and expected support if X and Y are
independent of each other. It has three possible values:
If Lift= 1: The probability of occurrence of antecedent and
consequent is independent of each other.
Lift>1: It determines the degree to which the two itemsets are
dependent to each other.
Lift<1: It tells us that one item is a substitute for other items,
which means one item has a negative effect on another.
Apriori Algorithm:
Apriori algorithm refers to the algorithm which is used to calculate the association
rules between objects. It means how two or more objects are related to one another.
In other words, we can say that the apriori algorithm is an association rule leaning
that analyzes that people who bought product A also bought product B.
The primary objective of the apriori algorithm is to create the association rule
between different objects. The association rule describes how two or more objects
are related to one another. Apriori algorithm is also called frequent pattern mining.
Components of Apriori algorithm
The given three components comprise the apriori algorithm.
Support
Confidence
Lift
Support
Support refers to the default popularity of any product. You find the support as a
quotient of the division of the number of transactions comprising that product by the
total number of transactions. Hence, we get
Support (Biscuits) = (Transactions relating biscuits) / (Total transactions)
= 400/4000 = 10 percent.
Confidence
Confidence refers to the possibility that the customers bought both biscuits and
chocolates together. So, you need to divide the number of transactions that comprise
both biscuits and chocolates by the total number of transactions to get the
confidence.
Hence,
Confidence = (Transactions relating both biscuits and Chocolate) / (Total
transactions involving Biscuits)
= 200/400
= 50 percent.
It means that 50 percent of customers who bought biscuits bought chocolates also.
Lift
Consider the above example; lift refers to the increase in the ratio of the sale of
chocolates when you sell biscuits. The mathematical equations of lift are given
below.
Lift = (Confidence (Biscuits - chocolates)/ (Support (Biscuits)
= 50/10 = 5
It means that the probability of people buying both biscuits and chocolates together
is five times more than that of purchasing the biscuits alone. If the lift value is below
one, it requires that the people are unlikely to buy both the items together. Larger the
value, the better is the combination.
How does the Apriori Algorithm work in Data Mining?
Consider the following dataset and we will find frequent itemsets and generate
association rules for them.
Step-1: K=1
(I) Create a table containing support count of each item present in dataset – Called
C1(candidate set)
(II) compare candidate set item’s support count with minimum support count(here
min_support=2 if support_count of candidate set items is less than min_support then
remove those items). This gives us itemset L1.
Step-2: K=2
Generate candidate set C2 using L1 (this is called join step).
Condition of joining Lk-1 and Lk-1 is that it should have (K-2)
elements in common.
Check all subsets of an itemset are frequent or not and if not
frequent remove that itemset.(Example subset of{I1, I2} are
{I1}, {I2} they are frequent.Check for each itemset)
Now find support count of these itemsets by searching in dataset.
(II) compare candidate (C2) support count with minimum support count(here
min_support=2 if support_count of candidate set item is less than min_support then
remove those items) this gives us itemset L2.
Step-3:
Step-4:
Generate candidate set C4 using L3 (join step). Condition of
joining Lk-1 and Lk-1 (K=4) is that, they should have (K-2)
elements in common. So here, for L3, first 2 elements (items)
should match.
Check all subsets of these itemsets are frequent or not (Here
itemset formed by joining L3 is {I1, I2, I3, I5} so its subset
contains {I1, I3, I5}, which is not frequent). So no itemset in C4
We stop here because no frequent itemsets are found further.
Thus, we have discovered all the frequent item-sets. Now generation of strong
association rule comes into picture. For that we need to calculate confidence of each
rule.
ITEM FREQUENCY
A 1
C 2
D 1
E 4
I 1
K 5
M 3
N 2
O 4
U 1
Y 3
Let the minimum support be 3. A Frequent Pattern set is built which will contain all
the elements whose frequency is greater than or equal to the minimum support.
These elements are stored in descending order of their respective frequencies. After
insertion of the relevant items, the set L looks like this:-
L = {K : 5, E : 4, M : 3, O : 4, Y : 3}
Now, for each transaction, the respective Ordered-Item set is built. It is done by
iterating the Frequent Pattern set and checking if the current item is contained in the
transaction in question. If the current item is contained, the item is inserted in the
Ordered-Item set for the current transaction. The following table is built for all the
transactions:
Transaction ID Items Ordered-Item Set
T1 {E,K,M,N,O,Y} {K,E,M,O,Y}
T2 {D,E,K,N,O,Y} {K,E,O,Y}
T3 {A,E,K,M} {K,E,M}
T4 {C,KM,U,Y} {K,M,Y}
T5 {C,E,K,O,O} {K,E,O}
Now, all the Ordered-Item sets are inserted into a Trie Data Structure.
a) Inserting the set {K, E, M, O, Y}:
Here, all the items are simply linked one after the other in the order of occurrence in
the set and initialize the support count for each item as 1.
e) Inserting the set {K, E, O}: Here simply the support counts of the
respective elements are increased. Note that the support count of
the new node of item O is increased.
Now, for each item, the Conditional Pattern Base is computed which is path labels
of all the paths which lead to any node of the given item in the frequent- pattern tree.
Note that the items in the below table are arranged in the ascending order of their
frequencies.
Now for each item, the Conditional Frequent Pattern Tree is built. It is done by
taking the set of elements that is common in all the paths in the Conditional Pattern
Base of that item and calculating its support count by summing the support counts of
all the paths in the Conditional Pattern Base.
From the Conditional Frequent Pattern tree, the Frequent Pattern rules are generated
by pairing the items of the Conditional Frequent Pattern Tree set to the
corresponding to the item as given in the below table.
For each row, two types of association rules can be inferred for example for the first
row which contains the element, the rules K -> Y and Y -> K can be inferred. To
determine the valid rule, the confidence of both the rules is calculated and the one
with confidence greater than or equal to the minimum confidence value is retained.