Data Mining U3
Data Mining U3
Association rules are a commonly used technique in data mining and warehousing for
discovering interesting relationships between variables or items in a dataset.
Single-dimensional association rules involve analyzing the relationships between two variables,
such as the association between the purchase of a certain product and the purchase of another
product. These rules can be represented as “if A, then B” statements, where A is the antecedent
(the item that is being analyzed) and B is the consequent (the item that is being predicted or
associated with A). For example, a single-dimensional association rule could be “If a customer
buys bread, they are likely to also buy butter.”
Multidimensional association rules involve analyzing the relationships between three or more
variables. These rules are useful for discovering more complex relationships between items,
such as the association between a customer’s age, gender, and purchasing habits.
Multidimensional association rules can be represented as “if A and B, then C” statements,
where A and B are the antecedents and C is the consequent. For example, a multidimensional
association rule could be “If a customer is female, over 30 years old, and has previously
purchased skincare products, they are likely to also purchase anti-aging products.”
Both single and multidimensional association rules can be useful for identifying patterns and
trends in large datasets, and can be used to make predictions and inform business decisions.
However, multidimensional association rules are generally more complex and may require more
advanced algorithms and techniques to discover.
Algorithms
Algorithms are crucial in data mining and warehousing as they help to extract useful information
and insights from large datasets. Data mining algorithms are used to identify patterns,
relationships, and correlations in data that can be used to make predictions and inform business
decisions. Some common algorithms used in data mining include association rules, decision
trees, clustering, and neural networks.
In data warehousing, algorithms are used to improve the efficiency and accuracy of data
processing tasks such as data cleaning, data integration, and data transformation. Some
commonly used algorithms in data warehousing include the Apriori algorithm for frequent
itemset mining, the K-means algorithm for clustering, and the decision tree algorithm for data
classification.
Overall, algorithms play a critical role in data mining and warehousing by helping organizations
to extract actionable insights from their data, which can be used to improve decision-making
processes, optimize business operations, and gain a competitive advantage in the marketplace.
Single dimensional Boolean association rule mining is a technique used to discover interesting
relationships or patterns in transaction databases. In this approach, the focus is on analyzing
the presence or absence of items in transactions and identifying associations between them.
Transaction Databases
Association rule mining aims to find associations or relationships between items in a transaction
database. An association rule consists of an antecedent (a set of items) and a consequent
(another set of items). The rule indicates that if the antecedent is present in a transaction, the
consequent is likely to be present as well.
In single dimensional Boolean association rule mining, the focus is on analyzing the presence or
absence of items in transactions. It involves identifying frequent itemsets and generating
association rules based on these itemsets.
1. Frequent Itemsets: A frequent itemset is a set of items that appears frequently in the
transaction database. To identify frequent itemsets, the algorithm scans the transaction
database and counts the occurrences of each item or itemset. The support of an itemset is the
proportion of transactions in which the itemset appears. The algorithm selects itemsets with
support above a predefined threshold as frequent itemsets.
2. Association Rule Generation: Once frequent itemsets are identified, association rules can be
generated. An association rule has the form "antecedent => consequent," where both the
antecedent and consequent are itemsets. The confidence of a rule is the proportion of
transactions containing the antecedent that also contain the consequent. The algorithm selects
rules with confidence above a predefined threshold as interesting association rules.
Benefits and Applications
Single dimensional Boolean association rule mining provides valuable insights into the
relationships between items in transaction databases. It has several benefits and applications,
including:
Market Basket Analysis: By analyzing association rules, retailers can identify items that are
frequently purchased together. This information can be used for product placement, cross-
selling, and targeted marketing strategies.
Web Usage Mining: Association rules can be used to analyze user behavior on websites. By
identifying patterns in users' navigation paths, website owners can optimize website design,
recommend relevant content, and personalize user experiences.
Healthcare: Association rule mining can be applied to healthcare data to discover relationships
between symptoms, diseases, and treatments. This information can aid in diagnosis, treatment
planning, and disease prevention.
In conclusion, single dimensional Boolean association rule mining is a powerful technique for
discovering associations between items in transaction databases. It helps uncover valuable
insights that can be applied in various domains, such as retail, web analytics, and healthcare.
1. Uniform Support –
At the point when a uniform least help edge is used, the search methodology is simplified.
The technique is likewise basic in that clients are needed to determine just a single least
help threshold. An advancement technique can be adopted, based on the information that
a progenitor is a superset of its descendant. the search keeps away from analyzing item
sets containing anything that doesn’t have minimum support. The uniform support
approach however has some difficulties. It is unlikely that items at lower levels of
abstraction will occur as frequently as those at higher levels of abstraction. If the minimum
support threshold is set too high it could miss several meaningful associations occurring at
low abstraction levels. This provides the motivation for the following approach.
2. ReduceSupport –
For mining various level relationship with diminished support, there are various elective
hunt techniques as follows.
Level-by-Level independence –
This is a full-broadness search, where no foundation information on regular item sets is
utilized for pruning. Each hub is examined, regardless of whether its parent hub is
discovered to be incessant.
Level – cross-separating by single thing –
A thing at the I level is inspected if and just if its parent hub at the (I-1) level is
regular .all in all, we research a more explicit relationship from a more broad one. If a
hub is frequent, its kids will be examined; otherwise, its descendant is pruned from the
inquiry.
Level-cross separating by – K-itemset –
A-itemset at the I level is inspected if and just if it’s For mining various level
relationship with diminished support, there are various elective hunt techniques.
Level-by-Level independence –
This is a full-broadness search, where no foundation information on regular item sets is
utilized for pruning. Each hub is examined, regardless of whether its parent hub is
discovered to be incessant.
Level – cross-separating by single thing –
A thing at the 1st level is inspected if and just if its parent hub at the (I-1) the level is
regular .all in all, we research a more explicit relationship from a more broad one. If a
hub is frequent, its kids will be examined otherwise, its descendant is pruned from the
inquiry.
Level-cross separating by – K-item set –
A-item set at the I level is inspected if and just if its corresponding parents A item set
(i-1) level is frequent.
3. Group-based support –
The group-wise threshold value for support and confidence is input by the user or expert.
The group is selected based on a product price or item set because often expert has
insight as to which groups are more important than others.
Example –
For e.g. Experts are interested in purchase patterns of laptops or clothes in the non and
electronic category. Therefore low support threshold is set for this group to give attention
to these items’ purchase patterns.
Multilevel Association Rule mining helps retailers gain insights into customer buying
behavior and preferences, optimize product placement and pricing, and improve
supply chain management.
Healthcare Management
Fraud Detection
Multilevel Association Rule mining helps web-based companies gain insights into
user preferences, optimize website design and layout, and personalize content for
individual users by analyzing data at different levels of abstraction.
Multilevel Association Rule mining helps social network providers identify influential
users, detect communities, and optimize network structure and design by analyzing
social network data at different levels of abstraction.
High dimensionality
It is the problem of dealing with data sets that have a large number of attributes.
It is the problem of dealing with data sets that have a large number of records.
Scalability
It is the problem of dealing with data sets that are too large to fit into memory.
Before we start understanding the algorithm, go through some definitions which are
explained in my previous post.
Consider the following dataset and we will find frequent itemsets and generate
association rules for them.
minimum support count is 2
minimum confidence is 60%
Step-1: K=1
(I) Create a table containing support count of each item present in dataset –
Called C1(candidate set)
(II) compare candidate set item’s support count with minimum support count(here
min_support=2 if support_count of candidate set items is less than min_support
then remove those items). This gives us itemset L1.
Step-2: K=2
Generate candidate set C2 using L1 (this is called join step). Condition of joining L k-
1 and Lk-1 is that it should have (K-2) elements in common.
Check all subsets of an itemset are frequent or not and if not frequent remove that
itemset.(Example subset of{I1, I2} are {I1}, {I2} they are frequent.Check for each itemset)
Now find support count of these itemsets by searching in dataset.
(II) compare candidate (C2) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this
gives us itemset L2.
Step-3:
o Generate candidate set C3 using L2 (join step). Condition of joining L k-1 and Lk-1 is
that it should have (K-2) elements in common. So here, for L2, first element should
match.
So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4,
I5}{I2, I3, I5}
o Check if all subsets of these itemsets are frequent or not and if not, then remove
that itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are
frequent. For {I2, I3, I4}, subset {I3, I4} is not frequent so remove it. Similarly check
for every itemset)
o find support count of these remaining itemset by searching in dataset.
(II) Compare candidate (C3) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this
gives us itemset L3.
Step-4:
o Generate candidate set C4 using L3 (join step). Condition of joining L k-1 and Lk-
1 (K=4) is that, they should have (K-2) elements in common. So here, for L3, first 2
elements (items) should match.
o Check all subsets of these itemsets are frequent or not (Here itemset formed by
joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent).
So no itemset in C4
o We stop here because no frequent itemsets are found further
Thus, we have discovered all the frequent item-sets. Now generation of strong association
rule comes into picture. For that we need to calculate confidence of each rule.
Confidence –
A confidence of 60% means that 60% of the customers, who purchased milk and bread also
bought butter.
Confidence(A->B)=Support_count(A∪B)/Support_count(A)
Limitations of Apriori Algorithm
Apriori Algorithm can be slow. The main limitation is time required to hold a vast number of
candidate sets with much frequent itemsets, low minimum support or large itemsets i.e. it is
not an efficient approach for large number of datasets. For example, if there are 10^4 from
frequent 1- itemsets, it need to generate more than 10^7 candidates into 2-length which in
turn they will be tested and accumulate. Furthermore, to detect frequent pattern in size 100
i.e. v1, v2… v100, it have to generate 2^100 candidate itemsets that yield on costly and
wasting of time of candidate generation. So, it will check for many sets from candidate
itemsets, also it will scan database many times repeatedly for finding candidate itemsets.
Apriori will be very low and inefficiency when memory capacity is limited with large number of
transactions.
Frequent Pattern Growth Algorithm
1. At each step, candidate sets have to be built.
2. To build the candidate sets, the algorithm has to repeatedly scan the database.
These two properties inevitably make the algorithm slower. To overcome these
redundant steps, a new association-rule mining algorithm was developed named
Frequent Pattern Growth Algorithm. It overcomes the disadvantages of the Apriori
algorithm by storing all the transactions in a Trie Data Structure. Consider the
following data:-
The above-given data is a hypothetical dataset of transactions with each letter
representing an item. The frequency of each individual item is computed:-
Let the minimum support be 3. A Frequent Pattern set is built which will contain
all the elements whose frequency is greater than or equal to the minimum support.
These elements are stored in descending order of their respective frequencies. After
insertion of the relevant items, the set L looks like this:-
L = {K : 5, E : 4, M : 3, O : 4, Y : 3}
Now, for each transaction, the respective Ordered-Item set is built. It is done by
iterating the Frequent Pattern set and checking if the current item is contained in
the transaction in question. If the current item is contained, the item is inserted in
the Ordered-Item set for the current transaction. The following table is built for all
the transactions:
Now, all the Ordered-Item sets are inserted into a Trie Data Structure.
Here, all the items are simply linked one after the other in the order of occurrence
in the set and initialize the support count for each item as 1.
Till the insertion of the elements K and E, simply the support count is increased by
1. On inserting O we can see that there is no direct link between E and O, therefore
a new node for the item O is initialized with the support count as 1 and item E is
linked to this new node. On inserting Y, we first initialize a new node for the item Y
with support count as 1 and link the new node of O with the new node of Y.
Similar to step b), first the support count of K is increased, then new nodes for M
and Y are initialized and linked accordingly.
e) Inserting the set {K, E, O}:
Here simply the support counts of the respective elements are increased. Note that
the support count of the new node of item O is increased.
Now, for each item, the Conditional Pattern Base is computed which is path
labels of all the paths which lead to any node of the given item in the frequent-
pattern tree. Note that the items in the below table are arranged in the ascending
order of their frequencies.
Now for each item, the Conditional Frequent Pattern Tree is built. It is done by
taking the set of elements that is common in all the paths in the Conditional Pattern
Base of that item and calculating its support count by summing the support counts
of all the paths in the Conditional Pattern Base.
From the Conditional Frequent Pattern tree, the Frequent Pattern rules are
generated by pairing the items of the Conditional Frequent Pattern Tree set to the
corresponding to the item as given in the below table.
For each row, two types of association rules can be inferred for example for the first
row which contains the element, the rules K -> Y and Y -> K can be inferred. To
determine the valid rule, the confidence of both the rules is calculated and the one
with confidence greater than or equal to the minimum confidence.