Unit-03 Dw&Dm Notes Ashish Singh PDF 11
Unit-03 Dw&Dm Notes Ashish Singh PDF 11
Frequent pattern extraction is an essential mission in data mining that intends to uncover repetitive patterns or itemsets in
a granted dataset. It encompasses recognizing collections of components that occur together frequ ently in a transactional
or relational database. This procedure can offer valuable perceptions into the connections and affiliations among diverse
components or features within the data
Apriori Algorithm:
The Apriori algorithm is one of the most well-known and widely used algorithms for repeating arrangement prospecting.
It uses a breadth-first search strategy to discover repeating groupings efficiently. The algorithm works in multiple
iterations. It starts by finding repeating individual objects by scanning the database once and counting the occurrence of
each object. It then generates candidate groupings of size 2 by combining the repeating groupings of size 1. The support
of these candidate groupings is calculated by scanning the database again. The process continues iteratively, generating
candidate groupings of size k and calculating their support until no more repeating groupings can be found.
Support-based Pruning:
During the Apriori algorithm’s execution, aid-based pruning is used to reduce the search space and enhance
efficiency. If an itemset is found to be rare (i.e., its aid is below the minimum aid threshold), then all its
supersets are also assured to be rare. Therefore, these supersets are trimmed from further consideration. This
trimming step significantly decreases the number of potential item sets that need to be evaluated in
subsequent iterations.
Applications:
Frequent pattern mining has various practical uses in different domains. Some examples include market
basket analysis, customer behavior analysis, web mining, bioinformatics, and network traffic analysis.
Market basket analysis involves analyzing customer purchase patterns to identify connections between
items and enhance sales strategies. In bioinformatics, frequent pattern mining can be used to identify
common patterns in DNA sequences, protein structures, or gene expressions, leading to insights in genetics
and drug design. Web mining can employ frequent pattern mining to discover navigational patterns, user
preferences, or collaborative filtering recommendations on the web.
here are several different algorithms used for frequent pattern mining, including:
1. Apriori algorithm: This is one of the most commonly used algorithms for frequent pattern mining. It uses
a “bottom-up” approach to identify frequent itemsets and then generates association rules from those
itemsets.
2. ECLAT algorithm: This algorithm uses a “depth-first search” approach to identify frequent itemsets. It is
particularly efficient for datasets with a large number of items.
3. FP-growth algorithm: This algorithm uses a “compression” technique to find frequent patterns
efficiently. It is particularly efficient for datasets with a large number of transactions.
4. Frequent pattern mining has many applications, such as Market Basket Analysis, Recommender
Systems, Fraud Detection, and many more
Advantages:
1. It can find useful information which is not visible in simple data browsing
2. It can find interesting association and correlation among data items
Disadvantages:
1. It can generate a large number of patterns
2. With high dimensionality, the number of patterns can be very large, making it difficult to interpret the
results.
Important Definitions :
Support : It is one of the measures of interestingness. This tells about the usefulness and certainty of
rules. 5% Support means total 5% of transactions in the database follow the rule.
Support(A -> B) = Support_count(A ∪ B)
Confidence: A confidence of 60% means that 60% of the customers who purchased a milk and bread
also bought butter.
Confidence(A -> B) = Support_count(A ∪ B) / Support_count(A)
If a rule satisfies both minimum support and minimum confidence, it is a strong rule.
Support_count(X): Number of transactions in which X appears. If X is A union B then it is the number
of transactions in which A and B both are present.
Maximal Itemset: An itemset is maximal frequent if none of its supersets are frequent.
Closed Itemset: An itemset is closed if none of its immediate supersets have same support count same
as Itemset.
K- Itemset: Itemset which contains K items is a K-itemset. So it can be said that an itemset is frequent if
the corresponding support count is greater than the minimum support count.
Advantages of using frequent item sets and association rule mining include:
1. Efficient discovery of patterns: Association rule mining algorithms are efficient at discovering patterns
in large datasets, making them useful for tasks such as market basket analysis and recommendation
systems.
2. Easy to interpret: The results of association rule mining are easy to understand and interpret, making it
possible to explain the patterns found in the data.
3. Can be used in a wide range of applications: Association rule mining can be used in a wide range of
applications such as retail, finance, and healthcare, which can help to improve decision-making and
increase revenue.
4. Handling large datasets: These algorithms can handle large datasets with many items and transactions,
which makes them suitable for big-data scenarios.
Disadvantages of using frequent item sets and association rule mining include:
1. Large number of generated rules: Association rule mining can generate a large number of rules, many of
which may be irrelevant or uninteresting, which can make it difficult to identify the most important
patterns.
2. Limited in detecting complex relationships: Association rule mining is limited in its ability to detect
complex relationships between items, and it only considers the co-occurrence of items in the same
transaction.
3. Can be computationally expensive: As the number of items and transactions increases, the number of
candidate item sets also increases, which can make the algorithm computationally expensive.
4. Need to define the minimum support and confidence threshold: The minimum support and confidence
threshold must be set before the association rule mining process, which can be difficult and requires a
good understanding of the data.
What is Association?
Association is a technique used in data mining to identify the relationships or co-occurrences between items in a dataset. It
involves analyzing large datasets to discover patterns or associations between items, such as products purchased together in
a supermarket or web pages frequently visited together on a website. Association analysis is based on the idea of finding the
most frequent patterns or itemsets in a dataset, where an itemset is a collection of one or more items.
Association analysis can provide valuable insights into consumer behaviour and preferences. It can help retailers identify
the items that are frequently purchased together, which can be used to optimize product placement and promotions.
Similarly, it can help e-commerce websites recommend related products to customers based on their purchase history.
Types of Associations
Here are the most common types of associations used in data mining:
Itemset Associations: Itemset association is the most common type of association analysis, which is used to discover
relationships between items in a dataset. In this type of association, a collection of one or more items that frequently co-occur
together is called an itemset. For example, in a supermarket dataset, itemset association can be used to identify items that are
frequently purchased together, such as bread and butter.
Sequential Associations: Sequential association is used to identify patterns that occur in a specific sequence or order. This
type of association analysis is commonly used in applications such as analyzing customer behaviour on e-commerce websites
or studying weblogs. For example, in the weblogs dataset, a sequential association can be used to identify the sequence of
pages that users visit before making a purchase.
Graph-based Associations Graph-based association is a type of association analysis that involves representing the
relationships between items in a dataset as a graph. In this type of association, each item is represented as a node in the graph,
and the edges between nodes represent the co-occurrence or relationship between items. The graph-based association is used in
various applications, such as social network analysis, recommendation systems, and fraud detection. For example, in a social
network dataset, identifying groups of users with similar interests or behaviours.
Association Rule Mining
Here are the most commonly used algorithms to implement association rule mining in data mining:
Apriori Algorithm - Apriori is one of the most widely used algorithms for association rule mining. It generates frequent item
sets from a given dataset by pruning infrequent item sets iteratively. The Apriori algorithm is based on the concept that if an
item set is frequent, then all of its subsets must also be frequent. The algorithm first identifies the frequent items in the dataset,
then generates candidate itemsets of length two from the frequent items, and so on until no more frequent itemsets can be
generated. The Apriori algorithm is computationally expensive, especially for large datasets with many items.
FP-Growth Algorithm - FP-Growth is another popular algorithm for association rule mining that is based on the concept of
frequent pattern growth. It is faster than the Apriori algorithm, especially for large datasets. The FP-Growth algorithm builds a
compact representation of the dataset called a frequent pattern tree (FP-tree), which is used to mine frequent item sets. The
algorithm scans the dataset only twice, first to build the FP-tree and then to mine the frequent itemsets. The FP-Growth
algorithm can handle datasets with both discrete and continuous attributes.
Eclat Algorithm - Eclat (Equivalence Class Clustering and Bottom-up Lattice Traversal) is a frequent itemset mining
algorithm based on the vertical data format. The algorithm first converts the dataset into a vertical data format, where each item
and the transaction ID in which it appears are stored. Eclat then performs a depth-first search on a tree-like structure,
representing the dataset's frequent itemsets. The algorithm is efficient regarding both memory usage and runtime, especially for
sparse datasets.
Correlation Analysis is a data mining technique used to identify the degree to which two or more variables are related or
associated with each other. Correlation refers to the statistical relationship between two or more variables, where the
variation in one variable is associated with the variation in another variable. In other words, it measures how changes in one
variable are related to changes in another variable. Correlation can be positive, negative, or zero, depending on the direction
and strength of the relationship between the variables.
, For example,, we are studying the relationship between the hours of study and the grades obtained by students. If we find
that as the number of hours of study increases, the grades obtained also increase, then there is a positive correlation between
the two variables. On the other hand, if we find that as the number of hours of study increases, the grades obtained decrease,
then there is a negative correlation between the two variables. If there is no relationship between the two variables, we
would say that there is zero correlation.
Correlation analysis is important because it allows us to measure the strength and direction of the relationship between
two or more variables. This information can help identify patterns and trends in the data, make predictions, and select
relevant variables for analysis. By understanding the relationships between different variables, we can gain valuable insights
into complex systems and make informed decisions based on data-driven analysis.
There are three main types of correlation analysis used in data mining, as mentioned below:
Pearson Correlation Coefficient - Pearson correlation measures the linear relationship between two continuous variables. It
ranges from -1 to +1, where -1 indicates a perfect negative correlation, 0 indicates no correlation, and +1 indicates a perfect
positive correlation.
Kendall Rank Correlation - Kendall correlation is a non-parametric measure of the association between two ordinal
variables. It measures the degree of correspondence between the ranking of observations on two variables. It calculates the
difference between the number of concordant pairs (pairs of observations that have the same rank order in both variables) and
discordant pairs (pairs of observations that have an opposite rank order in the two variables) and normalizes the result by
dividing by the total number of pairs.
Spearman Rank Correlation - Spearman correlation is another non-parametric measure of the relationship between two
variables. It measures the degree of association between the ranks of two variables. Spearman correlation is similar to the
Kendall correlation in that it measures the strength of the relationship between two variables measured on a ranked scale.
However, Spearman correlation uses the actual numerical ranks of the data instead of counting the number of concordant and
discordant pairs.
Correlation analysis is a powerful tool in data mining and statistical analysis that offers several benefits.
Identifying Relationships - Correlation analysis helps identify the relationships between different variables in a dataset. By
quantifying the degree and direction of the relationship, we can gain insights into how changes in one variable are likely to
affect the other.
Prediction - Correlation analysis can help predict one variable's values based on another variable's values. Building models
based on correlations can predict future outcomes and make informed decisions.
Feature Selection - Correlation analysis can also help select the most relevant features for a particular analysis or model. By
identifying the features that are highly correlated with the outcome features, we can focus on those features and exclude the
irrelevant ones, improving the accuracy and efficiency of the analysis or model.
Quality Control - Correlation analysis is useful in quality control applications, where it can be used to identify correlations
between different process variables and identify potential sources of quality problems.
Here are some examples of the most common use cases for association and correlation in data mining -
Market Basket Analysis - Association mining is commonly used in retail and e-commerce industries to identify patterns in
customer purchase behaviour. By analyzing transaction data, businesses can uncover product associations and make informed
decisions about product placement, pricing, and marketing strategies.
Medical Research - Correlation analysis is often used in medical research to explore relationships between different variables,
such as the correlation between smoking and lung cancer risk or the correlation between blood pressure and heart disease.
Financial Analysis - Correlation analysis is frequently used in financial analysis to measure the strength of relationships
between different financial variables, such as the correlation between stock prices and interest rates.
Fraud Detection - Association mining can be used to identify behaviour patterns associated with fraudulent activity, such as
multiple failed login attempts or unusual purchase patterns.