0% found this document useful (0 votes)
5 views

Data Mining U3

Data mining

Uploaded by

jafaaa0001
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Data Mining U3

Data mining

Uploaded by

jafaaa0001
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Single and Multidimensional association rules

Single and Multidimensional association rules

Association rules are a commonly used technique in data mining and warehousing for
discovering interesting relationships between variables or items in a dataset.

Single-dimensional association rules involve analyzing the relationships between two variables,
such as the association between the purchase of a certain product and the purchase of another
product. These rules can be represented as “if A, then B” statements, where A is the antecedent
(the item that is being analyzed) and B is the consequent (the item that is being predicted or
associated with A). For example, a single-dimensional association rule could be “If a customer
buys bread, they are likely to also buy butter.”

Multidimensional association rules involve analyzing the relationships between three or more
variables. These rules are useful for discovering more complex relationships between items,
such as the association between a customer’s age, gender, and purchasing habits.
Multidimensional association rules can be represented as “if A and B, then C” statements,
where A and B are the antecedents and C is the consequent. For example, a multidimensional
association rule could be “If a customer is female, over 30 years old, and has previously
purchased skincare products, they are likely to also purchase anti-aging products.”

Both single and multidimensional association rules can be useful for identifying patterns and
trends in large datasets, and can be used to make predictions and inform business decisions.
However, multidimensional association rules are generally more complex and may require more
advanced algorithms and techniques to discover.

Algorithms

Algorithms are crucial in data mining and warehousing as they help to extract useful information
and insights from large datasets. Data mining algorithms are used to identify patterns,
relationships, and correlations in data that can be used to make predictions and inform business
decisions. Some common algorithms used in data mining include association rules, decision
trees, clustering, and neural networks.
In data warehousing, algorithms are used to improve the efficiency and accuracy of data
processing tasks such as data cleaning, data integration, and data transformation. Some
commonly used algorithms in data warehousing include the Apriori algorithm for frequent
itemset mining, the K-means algorithm for clustering, and the decision tree algorithm for data
classification.

Overall, algorithms play a critical role in data mining and warehousing by helping organizations
to extract actionable insights from their data, which can be used to improve decision-making
processes, optimize business operations, and gain a competitive advantage in the marketplace.

Single Dimensional Boolean Association Rule Mining for Transaction Databases

Single dimensional Boolean association rule mining is a technique used to discover interesting
relationships or patterns in transaction databases. In this approach, the focus is on analyzing
the presence or absence of items in transactions and identifying associations between them.

Transaction Databases

A transaction database is a collection of transactions, where each transaction represents a set


of items purchased or used together. For example, in a retail store, each transaction may
represent a customer's purchase, and the items bought by the customer form the transaction.

Association Rule Mining

Association rule mining aims to find associations or relationships between items in a transaction
database. An association rule consists of an antecedent (a set of items) and a consequent
(another set of items). The rule indicates that if the antecedent is present in a transaction, the
consequent is likely to be present as well.

Single Dimensional Boolean Association Rule Mining

In single dimensional Boolean association rule mining, the focus is on analyzing the presence or
absence of items in transactions. It involves identifying frequent itemsets and generating
association rules based on these itemsets.

1. Frequent Itemsets: A frequent itemset is a set of items that appears frequently in the
transaction database. To identify frequent itemsets, the algorithm scans the transaction
database and counts the occurrences of each item or itemset. The support of an itemset is the
proportion of transactions in which the itemset appears. The algorithm selects itemsets with
support above a predefined threshold as frequent itemsets.
2. Association Rule Generation: Once frequent itemsets are identified, association rules can be
generated. An association rule has the form "antecedent => consequent," where both the
antecedent and consequent are itemsets. The confidence of a rule is the proportion of
transactions containing the antecedent that also contain the consequent. The algorithm selects
rules with confidence above a predefined threshold as interesting association rules.
Benefits and Applications

Single dimensional Boolean association rule mining provides valuable insights into the
relationships between items in transaction databases. It has several benefits and applications,
including:

 Market Basket Analysis: By analyzing association rules, retailers can identify items that are
frequently purchased together. This information can be used for product placement, cross-
selling, and targeted marketing strategies.
 Web Usage Mining: Association rules can be used to analyze user behavior on websites. By
identifying patterns in users' navigation paths, website owners can optimize website design,
recommend relevant content, and personalize user experiences.
 Healthcare: Association rule mining can be applied to healthcare data to discover relationships
between symptoms, diseases, and treatments. This information can aid in diagnosis, treatment
planning, and disease prevention.
In conclusion, single dimensional Boolean association rule mining is a powerful technique for
discovering associations between items in transaction databases. It helps uncover valuable
insights that can be applied in various domains, such as retail, web analytics, and healthcare.

Multilevel Association Rule :


Association rules created from mining information at different degrees of reflection are called
various level or staggered association rules.
Multilevel association rules can be mined effectively utilizing idea progressions under a help
certainty system.
Rules at a high idea level may add to good judgment while rules at a low idea level may not
be valuable consistently.
Utilizing uniform least help for all levels :
 At the point when a uniform least help edge is utilized, the pursuit system is rearranged.
 The technique is likewise straightforward, in that clients are needed to indicate just a
single least help edge.
 A similar least help edge is utilized when mining at each degree of deliberation. (for
example for mining from “PC” down to “PC”). Both “PC” and “PC” discovered to be
incessant, while “PC” isn’t.
Needs of Multidimensional Rule :
 Sometimes at the low data level, data does not show any significant pattern but there is
useful information hiding behind it.
 The aim is to find the hidden information in or between levels of abstraction.
Approaches to multilevel association rule mining :
1. Uniform Support(Using uniform minimum support for all level)
2. Reduced Support (Using reduced minimum support at lower levels)
3. Group-based Support(Using item or group based support)
Let’s discuss one by one.

1. Uniform Support –
At the point when a uniform least help edge is used, the search methodology is simplified.
The technique is likewise basic in that clients are needed to determine just a single least
help threshold. An advancement technique can be adopted, based on the information that
a progenitor is a superset of its descendant. the search keeps away from analyzing item
sets containing anything that doesn’t have minimum support. The uniform support
approach however has some difficulties. It is unlikely that items at lower levels of
abstraction will occur as frequently as those at higher levels of abstraction. If the minimum
support threshold is set too high it could miss several meaningful associations occurring at
low abstraction levels. This provides the motivation for the following approach.
2. ReduceSupport –
For mining various level relationship with diminished support, there are various elective
hunt techniques as follows.
 Level-by-Level independence –
This is a full-broadness search, where no foundation information on regular item sets is
utilized for pruning. Each hub is examined, regardless of whether its parent hub is
discovered to be incessant.
 Level – cross-separating by single thing –
A thing at the I level is inspected if and just if its parent hub at the (I-1) level is
regular .all in all, we research a more explicit relationship from a more broad one. If a
hub is frequent, its kids will be examined; otherwise, its descendant is pruned from the
inquiry.
 Level-cross separating by – K-itemset –
A-itemset at the I level is inspected if and just if it’s For mining various level
relationship with diminished support, there are various elective hunt techniques.
 Level-by-Level independence –
This is a full-broadness search, where no foundation information on regular item sets is
utilized for pruning. Each hub is examined, regardless of whether its parent hub is
discovered to be incessant.
 Level – cross-separating by single thing –
A thing at the 1st level is inspected if and just if its parent hub at the (I-1) the level is
regular .all in all, we research a more explicit relationship from a more broad one. If a
hub is frequent, its kids will be examined otherwise, its descendant is pruned from the
inquiry.
 Level-cross separating by – K-item set –
A-item set at the I level is inspected if and just if its corresponding parents A item set
(i-1) level is frequent.
3. Group-based support –
The group-wise threshold value for support and confidence is input by the user or expert.
The group is selected based on a product price or item set because often expert has
insight as to which groups are more important than others.
Example –
For e.g. Experts are interested in purchase patterns of laptops or clothes in the non and
electronic category. Therefore low support threshold is set for this group to give attention
to these items’ purchase patterns.

Applications of Multilevel Association Rule in data mining

These are some application as follows

Retail Sales Analysis

Multilevel Association Rule mining helps retailers gain insights into customer buying
behavior and preferences, optimize product placement and pricing, and improve
supply chain management.
Healthcare Management

Multilevel Association Rule mining helps healthcare providers identify patterns in


patient behavior, diagnose diseases, identify high-risk patients, and optimize
treatment plans.

Fraud Detection

Multilevel Association Rule mining helps companies identify fraudulent patterns,


detect anomalies, and prevent fraud in various industries such as finance,
insurance, and telecommunications.

Web Usage Mining

Multilevel Association Rule mining helps web-based companies gain insights into
user preferences, optimize website design and layout, and personalize content for
individual users by analyzing data at different levels of abstraction.

Social Network Analysis

Multilevel Association Rule mining helps social network providers identify influential
users, detect communities, and optimize network structure and design by analyzing
social network data at different levels of abstraction.

Challenges in Multilevel Association Rule Mining

Multilevel Association Rule mining poses several challenges, including high


dimensionality, large data set size, and scalability issues.

High dimensionality

It is the problem of dealing with data sets that have a large number of attributes.

Large data set size

It is the problem of dealing with data sets that have a large number of records.
Scalability

It is the problem of dealing with data sets that are too large to fit into memory.

Data Mining Multidimensional Association Rule


In this article, we are going to discuss Multidimensional Association Rule. Also, we will
discuss examples of each. Let’s discuss one by one.

Multidimensional Association Rules :


In Multi dimensional association rule Qualities can be absolute or quantitative.

 Quantitative characteristics are numeric and consolidates order.


 Numeric traits should be discretized.
 Multi dimensional affiliation rule comprises of more than one measurement.
 Example –buys(X, “IBM Laptop computer”)buys(X, “HP Inkjet Printer”)
Approaches in mining multi dimensional affiliation rules :
Three approaches in mining multi dimensional affiliation rules are as following.
1. Using static discretization of quantitative qualities :
 Discretization is static and happens preceding mining.
 Discretized ascribes are treated as unmitigated.
 Use apriori calculation to locate all k-regular predicate sets(this requires k or k+1 table
outputs). Each subset of regular predicate set should be continuous.
Example –
If in an information block the 3D cuboid (age, pay, purchases) is continuous suggests
(age, pay), (age, purchases), (pay, purchases) are likewise regular.
Note –
Information blocks are appropriate for mining since they make mining quicker. The cells of
an n-dimensional information cuboid relate to the predicate cells.
2. Using powerful discretization of quantitative traits :
 Known as mining Quantitative Association Rules.
 Numeric properties are progressively discretized.
Example –:
age(X, "20..25") Λ income(X, "30K..41K")buys ( X, "Laptop
Computer")
3. Grid FOR TUPLES :
Using distance based discretization with bunching –
This id dynamic discretization measure that considers the distance between information
focuses. It includes a two stage mining measure as following.
 Perform bunching to discover the time period included.
 Get affiliation rules via looking for gatherings of groups that happen together.
The resultant guidelines may fulfill –
 Bunches in the standard precursor are unequivocally connected with groups of rules in
the subsequent.
 Bunches in the forerunner happen together.
 Bunches in the ensuing happen together.
Apriori Algorithm

Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for finding


frequent itemsets in a dataset for boolean association rule. Name of the algorithm is
Apriori because it uses prior knowledge of frequent itemset properties. We apply an
iterative approach or level-wise search where k-frequent itemsets are used to find
k+1 itemsets.
To improve the efficiency of level-wise generation of frequent itemsets, an
important property is used called Apriori property which helps by reducing the
search space.
Apriori Property –
All non-empty subset of frequent itemset must be frequent. The key concept of
Apriori algorithm is its anti-monotonicity of support measure. Apriori assumes that
All subsets of a frequent itemset must be frequent(Apriori property).
If an itemset is infrequent, all its supersets will be infrequent.

Before we start understanding the algorithm, go through some definitions which are
explained in my previous post.
Consider the following dataset and we will find frequent itemsets and generate
association rules for them.
minimum support count is 2
minimum confidence is 60%
Step-1: K=1
(I) Create a table containing support count of each item present in dataset –
Called C1(candidate set)

(II) compare candidate set item’s support count with minimum support count(here
min_support=2 if support_count of candidate set items is less than min_support
then remove those items). This gives us itemset L1.

Step-2: K=2
 Generate candidate set C2 using L1 (this is called join step). Condition of joining L k-
1 and Lk-1 is that it should have (K-2) elements in common.
 Check all subsets of an itemset are frequent or not and if not frequent remove that
itemset.(Example subset of{I1, I2} are {I1}, {I2} they are frequent.Check for each itemset)
 Now find support count of these itemsets by searching in dataset.

(II) compare candidate (C2) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this
gives us itemset L2.

Step-3:
o Generate candidate set C3 using L2 (join step). Condition of joining L k-1 and Lk-1 is
that it should have (K-2) elements in common. So here, for L2, first element should
match.
So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4,
I5}{I2, I3, I5}
o Check if all subsets of these itemsets are frequent or not and if not, then remove
that itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are
frequent. For {I2, I3, I4}, subset {I3, I4} is not frequent so remove it. Similarly check
for every itemset)
o find support count of these remaining itemset by searching in dataset.

(II) Compare candidate (C3) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this
gives us itemset L3.

Step-4:
o Generate candidate set C4 using L3 (join step). Condition of joining L k-1 and Lk-
1 (K=4) is that, they should have (K-2) elements in common. So here, for L3, first 2
elements (items) should match.
o Check all subsets of these itemsets are frequent or not (Here itemset formed by
joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent).
So no itemset in C4
o We stop here because no frequent itemsets are found further

Thus, we have discovered all the frequent item-sets. Now generation of strong association
rule comes into picture. For that we need to calculate confidence of each rule.

Confidence –
A confidence of 60% means that 60% of the customers, who purchased milk and bread also
bought butter.
Confidence(A->B)=Support_count(A∪B)/Support_count(A)
Limitations of Apriori Algorithm
Apriori Algorithm can be slow. The main limitation is time required to hold a vast number of
candidate sets with much frequent itemsets, low minimum support or large itemsets i.e. it is
not an efficient approach for large number of datasets. For example, if there are 10^4 from
frequent 1- itemsets, it need to generate more than 10^7 candidates into 2-length which in
turn they will be tested and accumulate. Furthermore, to detect frequent pattern in size 100
i.e. v1, v2… v100, it have to generate 2^100 candidate itemsets that yield on costly and
wasting of time of candidate generation. So, it will check for many sets from candidate
itemsets, also it will scan database many times repeatedly for finding candidate itemsets.
Apriori will be very low and inefficiency when memory capacity is limited with large number of
transactions.
Frequent Pattern Growth Algorithm
1. At each step, candidate sets have to be built.
2. To build the candidate sets, the algorithm has to repeatedly scan the database.

These two properties inevitably make the algorithm slower. To overcome these
redundant steps, a new association-rule mining algorithm was developed named
Frequent Pattern Growth Algorithm. It overcomes the disadvantages of the Apriori
algorithm by storing all the transactions in a Trie Data Structure. Consider the
following data:-
The above-given data is a hypothetical dataset of transactions with each letter
representing an item. The frequency of each individual item is computed:-

Let the minimum support be 3. A Frequent Pattern set is built which will contain
all the elements whose frequency is greater than or equal to the minimum support.
These elements are stored in descending order of their respective frequencies. After
insertion of the relevant items, the set L looks like this:-

L = {K : 5, E : 4, M : 3, O : 4, Y : 3}

Now, for each transaction, the respective Ordered-Item set is built. It is done by
iterating the Frequent Pattern set and checking if the current item is contained in
the transaction in question. If the current item is contained, the item is inserted in
the Ordered-Item set for the current transaction. The following table is built for all
the transactions:
Now, all the Ordered-Item sets are inserted into a Trie Data Structure.

a) Inserting the set {K, E, M, O, Y}:

Here, all the items are simply linked one after the other in the order of occurrence
in the set and initialize the support count for each item as 1.

b) Inserting the set {K, E, O, Y}:

Till the insertion of the elements K and E, simply the support count is increased by
1. On inserting O we can see that there is no direct link between E and O, therefore
a new node for the item O is initialized with the support count as 1 and item E is
linked to this new node. On inserting Y, we first initialize a new node for the item Y
with support count as 1 and link the new node of O with the new node of Y.

c) Inserting the set {K, E, M}:

Here simply the support count of each element is increased by 1.


d) Inserting the set {K, M, Y}:

Similar to step b), first the support count of K is increased, then new nodes for M
and Y are initialized and linked accordingly.
e) Inserting the set {K, E, O}:

Here simply the support counts of the respective elements are increased. Note that
the support count of the new node of item O is increased.
Now, for each item, the Conditional Pattern Base is computed which is path
labels of all the paths which lead to any node of the given item in the frequent-
pattern tree. Note that the items in the below table are arranged in the ascending
order of their frequencies.
Now for each item, the Conditional Frequent Pattern Tree is built. It is done by
taking the set of elements that is common in all the paths in the Conditional Pattern
Base of that item and calculating its support count by summing the support counts
of all the paths in the Conditional Pattern Base.

From the Conditional Frequent Pattern tree, the Frequent Pattern rules are
generated by pairing the items of the Conditional Frequent Pattern Tree set to the
corresponding to the item as given in the below table.
For each row, two types of association rules can be inferred for example for the first
row which contains the element, the rules K -> Y and Y -> K can be inferred. To
determine the valid rule, the confidence of both the rules is calculated and the one
with confidence greater than or equal to the minimum confidence.

You might also like