0% found this document useful (0 votes)
18 views8 pages

Unit-03 Dw&Dm Notes Ashish Singh PDF 11

The document discusses frequent pattern mining in data mining, focusing on techniques for identifying patterns and associations within datasets, including algorithms like Apriori, ECLAT, and FP-Growth. It highlights the importance of support, confidence, and lift in association rule mining, along with various applications such as market basket analysis and fraud detection. Additionally, it covers correlation analysis, its types, and its significance in understanding relationships between variables.

Uploaded by

kumariritu020503
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views8 pages

Unit-03 Dw&Dm Notes Ashish Singh PDF 11

The document discusses frequent pattern mining in data mining, focusing on techniques for identifying patterns and associations within datasets, including algorithms like Apriori, ECLAT, and FP-Growth. It highlights the importance of support, confidence, and lift in association rule mining, along with various applications such as market basket analysis and fraud detection. Additionally, it covers correlation analysis, its types, and its significance in understanding relationships between variables.

Uploaded by

kumariritu020503
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

UNIT-03 Patterns Associations and correlations

Frequent Pattern Mining in Data Mining


Frequent pattern mining in data mining is the process of identifying patterns or associations within a dataset that occur
frequently. This is typically done by analyzing large datasets to find items or sets of items that appear together frequently.

Frequent pattern extraction is an essential mission in data mining that intends to uncover repetitive patterns or itemsets in
a granted dataset. It encompasses recognizing collections of components that occur together frequ ently in a transactional
or relational database. This procedure can offer valuable perceptions into the connections and affiliations among diverse
components or features within the data

 Apriori Algorithm:
The Apriori algorithm is one of the most well-known and widely used algorithms for repeating arrangement prospecting.
It uses a breadth-first search strategy to discover repeating groupings efficiently. The algorithm works in multiple
iterations. It starts by finding repeating individual objects by scanning the database once and counting the occurrence of
each object. It then generates candidate groupings of size 2 by combining the repeating groupings of size 1. The support
of these candidate groupings is calculated by scanning the database again. The process continues iteratively, generating
candidate groupings of size k and calculating their support until no more repeating groupings can be found.

 Support-based Pruning:
During the Apriori algorithm’s execution, aid-based pruning is used to reduce the search space and enhance
efficiency. If an itemset is found to be rare (i.e., its aid is below the minimum aid threshold), then all its
supersets are also assured to be rare. Therefore, these supersets are trimmed from further consideration. This
trimming step significantly decreases the number of potential item sets that need to be evaluated in
subsequent iterations.

 Association Rule Mining:


Frequent item sets can be further examined to discover association rules, which represent connections
between different items. An association rule consists of an antecedent and a consequent (right-hand side),
both of which are item sets. For instance, {milk, bread} => {eggs} is an association rule. Association rules
are produced from frequent itemsets by considering different combinations of items and calculating
measures such as aid, confidence, and lift. Aid measures the frequency of both the antecedent and the
consequent appearing together, while confidence measures the conditional probability of the consequent
given the antecedent. Lift indicates the strength of the association between the antecedent and the
consequent, considering their individual aid.

 Applications:
Frequent pattern mining has various practical uses in different domains. Some examples include market
basket analysis, customer behavior analysis, web mining, bioinformatics, and network traffic analysis.
Market basket analysis involves analyzing customer purchase patterns to identify connections between
items and enhance sales strategies. In bioinformatics, frequent pattern mining can be used to identify
common patterns in DNA sequences, protein structures, or gene expressions, leading to insights in genetics
and drug design. Web mining can employ frequent pattern mining to discover navigational patterns, user
preferences, or collaborative filtering recommendations on the web.
here are several different algorithms used for frequent pattern mining, including:
1. Apriori algorithm: This is one of the most commonly used algorithms for frequent pattern mining. It uses
a “bottom-up” approach to identify frequent itemsets and then generates association rules from those
itemsets.
2. ECLAT algorithm: This algorithm uses a “depth-first search” approach to identify frequent itemsets. It is
particularly efficient for datasets with a large number of items.
3. FP-growth algorithm: This algorithm uses a “compression” technique to find frequent patterns
efficiently. It is particularly efficient for datasets with a large number of transactions.
4. Frequent pattern mining has many applications, such as Market Basket Analysis, Recommender
Systems, Fraud Detection, and many more

Advantages:
1. It can find useful information which is not visible in simple data browsing
2. It can find interesting association and correlation among data items
Disadvantages:
1. It can generate a large number of patterns
2. With high dimensionality, the number of patterns can be very large, making it difficult to interpret the
results.

Issues of frequent pattern mining


 flexibility and reusability for creating frequent patterns
 most of the algorithms used for mining frequent item sets do not offer flexibility for reusing
 much research is needed to reduce the size of the derived patterns
 Frequent pattern mining has several applications in different areas, including:
 Market Basket Analysis: This is the process of analyzing customer purchasing patterns in order to
identify items that are frequently bought together. This information can be used to optimize product
placement, create targeted marketing campaigns, and make other business decisions.
 Recommender Systems: Frequent pattern mining can be used to identify patterns in user behavior and
preferences in order to make personalized recommendations.
 Fraud Detection: Frequent pattern mining can be used to identify abnormal patterns of behavior that may
indicate fraudulent activity.
 Network Intrusion Detection: Network administrators can use frequent pattern mining to detect patterns
of network activity that may indicate a security threat.
 Medical Analysis: Frequent pattern mining can be used to identify patterns in medical data that may
indicate a particular disease or condition.
 Text Mining: Frequent pattern mining can be used to identify patterns in text data, such as keywords or
phrases that appear frequently together in a document.
 Web usage mining: Frequent pattern mining can be used to analyze patterns of user behavior on a
website, such as which pages are visited most frequently or which links are clicked on most often.
 Gene Expression: Frequent pattern mining can be used to analyze patterns of gene expression in order to
identify potential biomarkers for different diseases.

Frequent Item set in Data set (Association Rule Mining)


1. Frequent item sets, also known as association rules, are a fundamental concept in association rule
mining, which is a technique used in data mining to discover relationships between items in a dataset.
The goal of association rule mining is to identify relationships between items in a dataset that occur
frequently together.
2. A frequent item set is a set of items that occur together frequently in a dataset. The frequency of an item
set is measured by the support count, which is the number of transactions or records in the dataset that
contain the item set. For example, if a dataset contains 100 transactions and the item set {milk, bread}
appears in 20 of those transactions, the support count for {milk, bread} is 20.
3. Association rule mining algorithms, such as Apriori or FP-Growth, are used to find frequent item sets
and generate association rules. These algorithms work by iteratively generating candidate item sets and
pruning those that do not meet the minimum support threshold. Once the frequent item sets are found,
association rules can be generated by using the concept of confidence, which is the ratio of the number
of transactions that contain the item set and the number of transactions that contain the antecedent (left -
hand side) of the rule.
4. Frequent item sets and association rules can be used for a variety of tasks such as market basket analysis,
cross-selling and recommendation systems. However, it should be noted that association rule mining can
generate a large number of rules, many of which may be irrelevant or uninteresting. Therefore, it is
important to use appropriate measures such as lift and conviction to evaluate the interestingness of the
generated rules.

Important Definitions :

 Support : It is one of the measures of interestingness. This tells about the usefulness and certainty of
rules. 5% Support means total 5% of transactions in the database follow the rule.
Support(A -> B) = Support_count(A ∪ B)
 Confidence: A confidence of 60% means that 60% of the customers who purchased a milk and bread
also bought butter.
Confidence(A -> B) = Support_count(A ∪ B) / Support_count(A)

If a rule satisfies both minimum support and minimum confidence, it is a strong rule.
 Support_count(X): Number of transactions in which X appears. If X is A union B then it is the number
of transactions in which A and B both are present.
 Maximal Itemset: An itemset is maximal frequent if none of its supersets are frequent.
 Closed Itemset: An itemset is closed if none of its immediate supersets have same support count same
as Itemset.
 K- Itemset: Itemset which contains K items is a K-itemset. So it can be said that an itemset is frequent if
the corresponding support count is greater than the minimum support count.

Advantages of using frequent item sets and association rule mining include:

1. Efficient discovery of patterns: Association rule mining algorithms are efficient at discovering patterns
in large datasets, making them useful for tasks such as market basket analysis and recommendation
systems.
2. Easy to interpret: The results of association rule mining are easy to understand and interpret, making it
possible to explain the patterns found in the data.
3. Can be used in a wide range of applications: Association rule mining can be used in a wide range of
applications such as retail, finance, and healthcare, which can help to improve decision-making and
increase revenue.
4. Handling large datasets: These algorithms can handle large datasets with many items and transactions,
which makes them suitable for big-data scenarios.
Disadvantages of using frequent item sets and association rule mining include:

1. Large number of generated rules: Association rule mining can generate a large number of rules, many of
which may be irrelevant or uninteresting, which can make it difficult to identify the most important
patterns.
2. Limited in detecting complex relationships: Association rule mining is limited in its ability to detect
complex relationships between items, and it only considers the co-occurrence of items in the same
transaction.
3. Can be computationally expensive: As the number of items and transactions increases, the number of
candidate item sets also increases, which can make the algorithm computationally expensive.
4. Need to define the minimum support and confidence threshold: The minimum support and confidence
threshold must be set before the association rule mining process, which can be difficult and requires a
good understanding of the data.

Association and Correlation in Data Mining


In data mining, association and correlation are key techniques for extracting patterns and relationships from large datasets.
Association uncovers relationships between items, while correlation measures the strength of the link between two
variables. This exploration will delve into these techniques, their types, and methods, pivotal for informed decision-making
in various domains.

What is Association?

Association is a technique used in data mining to identify the relationships or co-occurrences between items in a dataset. It
involves analyzing large datasets to discover patterns or associations between items, such as products purchased together in
a supermarket or web pages frequently visited together on a website. Association analysis is based on the idea of finding the
most frequent patterns or itemsets in a dataset, where an itemset is a collection of one or more items.

Association analysis can provide valuable insights into consumer behaviour and preferences. It can help retailers identify
the items that are frequently purchased together, which can be used to optimize product placement and promotions.
Similarly, it can help e-commerce websites recommend related products to customers based on their purchase history.

Types of Associations

Here are the most common types of associations used in data mining:

 Itemset Associations: Itemset association is the most common type of association analysis, which is used to discover
relationships between items in a dataset. In this type of association, a collection of one or more items that frequently co-occur
together is called an itemset. For example, in a supermarket dataset, itemset association can be used to identify items that are
frequently purchased together, such as bread and butter.
 Sequential Associations: Sequential association is used to identify patterns that occur in a specific sequence or order. This
type of association analysis is commonly used in applications such as analyzing customer behaviour on e-commerce websites
or studying weblogs. For example, in the weblogs dataset, a sequential association can be used to identify the sequence of
pages that users visit before making a purchase.
 Graph-based Associations Graph-based association is a type of association analysis that involves representing the
relationships between items in a dataset as a graph. In this type of association, each item is represented as a node in the graph,
and the edges between nodes represent the co-occurrence or relationship between items. The graph-based association is used in
various applications, such as social network analysis, recommendation systems, and fraud detection. For example, in a social
network dataset, identifying groups of users with similar interests or behaviours.
Association Rule Mining

Here are the most commonly used algorithms to implement association rule mining in data mining:

 Apriori Algorithm - Apriori is one of the most widely used algorithms for association rule mining. It generates frequent item
sets from a given dataset by pruning infrequent item sets iteratively. The Apriori algorithm is based on the concept that if an
item set is frequent, then all of its subsets must also be frequent. The algorithm first identifies the frequent items in the dataset,
then generates candidate itemsets of length two from the frequent items, and so on until no more frequent itemsets can be
generated. The Apriori algorithm is computationally expensive, especially for large datasets with many items.
 FP-Growth Algorithm - FP-Growth is another popular algorithm for association rule mining that is based on the concept of
frequent pattern growth. It is faster than the Apriori algorithm, especially for large datasets. The FP-Growth algorithm builds a
compact representation of the dataset called a frequent pattern tree (FP-tree), which is used to mine frequent item sets. The
algorithm scans the dataset only twice, first to build the FP-tree and then to mine the frequent itemsets. The FP-Growth
algorithm can handle datasets with both discrete and continuous attributes.
 Eclat Algorithm - Eclat (Equivalence Class Clustering and Bottom-up Lattice Traversal) is a frequent itemset mining
algorithm based on the vertical data format. The algorithm first converts the dataset into a vertical data format, where each item
and the transaction ID in which it appears are stored. Eclat then performs a depth-first search on a tree-like structure,
representing the dataset's frequent itemsets. The algorithm is efficient regarding both memory usage and runtime, especially for
sparse datasets.

Correlation Analysis in Data Mining

Correlation Analysis is a data mining technique used to identify the degree to which two or more variables are related or
associated with each other. Correlation refers to the statistical relationship between two or more variables, where the
variation in one variable is associated with the variation in another variable. In other words, it measures how changes in one
variable are related to changes in another variable. Correlation can be positive, negative, or zero, depending on the direction
and strength of the relationship between the variables.

, For example,, we are studying the relationship between the hours of study and the grades obtained by students. If we find
that as the number of hours of study increases, the grades obtained also increase, then there is a positive correlation between
the two variables. On the other hand, if we find that as the number of hours of study increases, the grades obtained decrease,
then there is a negative correlation between the two variables. If there is no relationship between the two variables, we
would say that there is zero correlation.

Why is Correlation Analysis Important?

Correlation analysis is important because it allows us to measure the strength and direction of the relationship between
two or more variables. This information can help identify patterns and trends in the data, make predictions, and select
relevant variables for analysis. By understanding the relationships between different variables, we can gain valuable insights
into complex systems and make informed decisions based on data-driven analysis.

Types of Correlation Analysis in Data Mining

There are three main types of correlation analysis used in data mining, as mentioned below:

 Pearson Correlation Coefficient - Pearson correlation measures the linear relationship between two continuous variables. It
ranges from -1 to +1, where -1 indicates a perfect negative correlation, 0 indicates no correlation, and +1 indicates a perfect
positive correlation.
 Kendall Rank Correlation - Kendall correlation is a non-parametric measure of the association between two ordinal
variables. It measures the degree of correspondence between the ranking of observations on two variables. It calculates the
difference between the number of concordant pairs (pairs of observations that have the same rank order in both variables) and
discordant pairs (pairs of observations that have an opposite rank order in the two variables) and normalizes the result by
dividing by the total number of pairs.
 Spearman Rank Correlation - Spearman correlation is another non-parametric measure of the relationship between two
variables. It measures the degree of association between the ranks of two variables. Spearman correlation is similar to the
Kendall correlation in that it measures the strength of the relationship between two variables measured on a ranked scale.
However, Spearman correlation uses the actual numerical ranks of the data instead of counting the number of concordant and
discordant pairs.

Benefits of Correlation Analysis

Correlation analysis is a powerful tool in data mining and statistical analysis that offers several benefits.

Some of the main benefits of correlation analysis are:

 Identifying Relationships - Correlation analysis helps identify the relationships between different variables in a dataset. By
quantifying the degree and direction of the relationship, we can gain insights into how changes in one variable are likely to
affect the other.
 Prediction - Correlation analysis can help predict one variable's values based on another variable's values. Building models
based on correlations can predict future outcomes and make informed decisions.
 Feature Selection - Correlation analysis can also help select the most relevant features for a particular analysis or model. By
identifying the features that are highly correlated with the outcome features, we can focus on those features and exclude the
irrelevant ones, improving the accuracy and efficiency of the analysis or model.
 Quality Control - Correlation analysis is useful in quality control applications, where it can be used to identify correlations
between different process variables and identify potential sources of quality problems.

Use Cases for Correlation Analysis and Association Mining

Here are some examples of the most common use cases for association and correlation in data mining -

 Market Basket Analysis - Association mining is commonly used in retail and e-commerce industries to identify patterns in
customer purchase behaviour. By analyzing transaction data, businesses can uncover product associations and make informed
decisions about product placement, pricing, and marketing strategies.
 Medical Research - Correlation analysis is often used in medical research to explore relationships between different variables,
such as the correlation between smoking and lung cancer risk or the correlation between blood pressure and heart disease.
 Financial Analysis - Correlation analysis is frequently used in financial analysis to measure the strength of relationships
between different financial variables, such as the correlation between stock prices and interest rates.
 Fraud Detection - Association mining can be used to identify behaviour patterns associated with fraudulent activity, such as
multiple failed login attempts or unusual purchase patterns.

Pattern Evaluation Methods in Data Mining


In data mining, pattern evaluation is the process of assessing the quality of discovered patterns. This process
is important in order to determine whether the patterns are useful and whether they can be trusted. There are
a number of different measures that can be used to evaluate patterns, and the choice of measure will depend
on the application.
There are several ways to evaluate pattern mining algorithms:
1. Accuracy
The accuracy of a data mining model is a measure of how correctly the model predicts the target values. The
accuracy is measured on a test dataset, which is separate from the training dataset that was used to train the
model. There are a number of ways to measure accuracy, but the most common is to calculate the
percentage of correct predictions. This is known as the accuracy rate
2. Classification Accuracy
This measures how accurately the patterns discovered by the algorithm can be used to classify new data.
This is typically done by taking a set of data that has been labeled with known class labels and then using
the discovered patterns to predict the class labels of the data. The accuracy can then be computed by
comparing the predicted labels to the actual labels.
Classification accuracy is one of the most popular evaluation metrics for classification models, and it is
simply the percentage of correct predictions made by the model. Although it is a straightforward and easy-
to-understand metric, classification accuracy can be misleading in certain situations. For example, if we
have a dataset with a very imbalanced class distribution, such as 100 instances of class 0 and 1,000
instances of class 1, then a model that always predicts class 1 will achieve a high classification accuracy of
90%. However, this model is clearly not very useful, since it is not making any correct predictions for class
0.
3. Clustering Accuracy
This measures how accurately the patterns discovered by the algorithm can be used to cluster new data. This
is typically done by taking a set of data that has been labeled with known cluster labels and then using the
discovered patterns to predict the cluster labels of the data. The accuracy can then be computed by
comparing the predicted labels to the actual labels.
There are a few ways to evaluate the accuracy of a clustering algorithm:
 External indices: these indices compare the clusters produced by the algorithm to some known ground
truth. For example, the Rand Index or the Jaccard coefficient can be used if the ground truth is known.
 Internal indices: these indices assess the goodness of clustering without reference to any external
information. The most popular internal index is the Dunn index.
 Stability: this measures how robust the clustering is to small changes in the data. A clustering algorithm
is said to be stable if, when applied to different samples of the same data, it produces the same results.
 Efficiency: this measures how quickly the algorithm converges to the correct clustering.
4. Coverage
This measures how many of the possible patterns in the data are discovered by the algorithm. This can be
computed by taking the total number of possible patterns and dividing it by the number of patterns
discovered by the algorithm. A Coverage Pattern is a type of sequential pattern that is found by looking for
items that tend to appear together in sequential order. For example, a coverage pattern might be “customers
who purchase item A also tend to purchase item B within the next month.”
5. Visual Inspection
This is perhaps the most common method, where the data miner simply looks at the patterns to see if they
make sense. In visual inspection, the data is plotted in a graphical format and the pattern is observed. This
method is used when the data is not too large and can be easily plotted. It is also used when the data is
categorical in nature. Visual inspection is a pattern evaluation method in data mining where the data is
visually inspected for patterns. This can be done by looking at a graph or plot of the data, or by looking at
the raw data itself. This method is often used to find outliers or unusual patterns.
6. Running Time
This measures how long it takes for the algorithm to find the patterns in the data. This is typically measured
in seconds or minutes. There are a few different ways to measure the performance of a machine learning
algorithm, but one of the most common is to simply measure the amount of time it takes to train the model
and make predictions. This is known as the running time pattern evaluation.
7. Support
The support of a pattern is the percentage of the total number of records that contain the pattern. Support
Pattern evaluation is a process of finding interesting and potentially useful patterns in data. The purpose of
support pattern evaluation is to identify interesting patterns that may be useful for decision-making. Support
pattern evaluation is typically used in data mining and machine learning applications.
8. Confidence
The confidence of a pattern is the percentage of times that the pattern is found to be correct. Confidence
Pattern evaluation is a method of data mining that is used to assess the quality of patterns found in data.
This evaluation is typically performed by calculating the percentage of times a pattern is found in a data set
and comparing this percentage to the percentage of times the pattern is expected to be found based on the
overall distribution of data. If the percentage of times a pattern is found is significantly higher than the
expected percentage, then the pattern is said to be a strong confidence pattern.
9. Lift
The lift of a pattern is the ratio of the number of times that the pattern is found to be correct to the number
of times that the pattern is expected to be correct. Lift Pattern evaluation is a data mining technique tha t can
be used to evaluate the performance of a predictive model. The lift pattern is a graphical representation of
the model’s performance and can be used to identify potential problems with the model.
10. Prediction
The prediction of a pattern is the percentage of times that the pattern is found to be correct. Prediction
Pattern evaluation is a data mining technique used to assess the accuracy of predictive models. It is used to
determine how well a model can predict future outcomes based on past data. Prediction Pattern evaluation
can be used to compare different models, or to evaluate the performance of a single model.
11. Precision
Precision Pattern Evaluation is a method for analyzing data that has been collected from a variety of
sources. This method can be used to identify patterns and trends in the data, and to evaluate the accuracy of
data. Precision Pattern Evaluation can be used to identify errors in the data, and to determine the cause of
the errors. This method can also be used to determine the impact of the errors on the overall accuracy of the
data.

You might also like