unit4
unit4
Computer Engineering
Machine Learning
Sem 7
Unit # 4
Silhouette Coefficient
Clustering
Introduction to clustering Dunn's Index
Topics
Types of Clustering Association Rule Mining
Partitional Clustering
Unsupervised Learning
Introduction & Importance , Types of Unsupervised Learning
Unsupervised learning is a type of machine learning in which
models are trained using unlabeled dataset and are allowed
to act on that data without any supervision.
Hierarchical
Clustering Clustering
Algorithm
Types of
Unsupervised Unsupervised
DBSCAN
Learning Learning
Algorithm
Algorithm algorithms
Apriori
Algorithm
Association Rule
Learning
FP-Growth
Algorithm
Clustering: Clustering is a method of grouping the
objects into clusters such that objects with most
similarities remains into a group and has less or no
Types of
similarities with the objects of another group.
Unsupervised
Learning Cluster analysis finds the commonalities between the
Algorithm data objects and categorizes them as per the presence
and absence of those commonalities.
Association: An association rule is an unsupervised learning
method which is used for finding the relationships between
variables in the large database. It determines the set of
Types of items that occurs together in the dataset.
Unsupervised
Association rule makes marketing strategy more effective.
Learning
Such as people who buy X item (suppose a bread) are also
Algorithm
tend to purchase Y (Butter/Jam) item. A typical example of
Association rule is Market Basket Analysis.
Feature Clustering Association Rule Mining
Discover interesting
Group similar data points
Purpose relationships between
into clusters.
variables.
A set of clusters or Association rules (e.g., "If
Output
groups. A, then B").
Difference Data Type
Typically works with
Often applied to
transactional or
Between unlabeled data.
categorical data.
Clustering and Approach
Looks for patterns based Looks for co-occurrence
on distance or similarity. or frequency of items.
Association Market basket analysis
Grouping customers by
Rule Mining Examples
buying behavior.
(e.g., buying bread and
butter together).
Rules can be evaluated
Clusters can be visualized
Interpretation for support and
and analyzed.
confidence.
K-means, hierarchical
Techniques Apriori, FP-Growth, etc.
clustering, DBSCAN, etc.
Clustering
Introduction, Types of Clustering (Hierarchical, Agglomerative,
Divisive, Partitional), K -means clustering Algorithm, Evaluation
metrics for Clustering (Silhouette Coefficient, Dunn's Index)
Clustering is a way to group similar things together.
Fuzzy Clustering
Centroid-based
Clustering
(Partitioning
methods)
Density-based
Clustering (Model-
based methods)
Agglomerative
Types of Connectivity-based clustering
Types of Clustering Clustering
Clustering Methods (Hierarchical
clustering )
Methods Divisive clustering
Distribution-based
Clustering
Fuzzy Clustering
Partitioning Clustering
Step-2: Select random K points or centroids. (It can be other from the
input dataset).
K-Means Step-3: Assign each data point to their closest centroid, which will form
Clustering the predefined K clusters.
Algorithm Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to
the new closest centroid of each cluster.
Example: 2 Cluster.
Data: { 1, 5, 2, 4, 5 }
apply the K-means clustering algorithm again to
divide the data into 2 clusters. Here is a new set of
Example: data:
{3, 8, 6, 7, 2}
Step 1: Initialization
Contd.
Table
1:
Table
2:
Step 5: Since the cluster assignments did not change,
the algorithm stop.
Final Clusters
Contd.
Cluster 1: {3, 2}
Cluster 2: {8, 6, 7}
Example:
Example:
Evaluation metrics for Clustering
Silhouette Coefficient: it tells you how well points fit
Evaluation into their clusters, where higher is better.
metrics for
Clustering Dunn's Index: it tells you if the clusters are well-
separated and compact, where a higher value
indicates better clustering.
Silhouette Coefficient
What it measures: How well each data point fits within its
cluster compared to other clusters.
Range: From -1 to 1.
1 means the point is well-clustered (fits perfectly in its
Evaluation cluster).
metrics for
0 means the point is on the border between clusters.
Clustering
-1 means the point is likely in the wrong cluster.
Support is a measure of the number of times an Confidence is a measure of the likelihood that an
item set appears in a dataset. itemset will appear if another itemset appears.
Support is used to identify itemsets that occur Confidence is used to evaluate the strength of a
frequently in the dataset. rule.
Support is often used with a threshold to identify Confidence is often used with a threshold to
itemsets that occur frequently enough to be of identify rules that are strong enough to be of
interest. interest.
Step-3: Find all the rules of these subsets that have higher
confidence value than the threshold or minimum
confidence.
Example:
Condt.
Find the frequent itemsets using Apriori Algorithm .
Assume that minimum support threshold (s =2)
Example:
Contd.
Find the frequent
itemsets using Apriori
Algorithm . Assume
Example: that minimum support
threshold (s =2)
Ans:
Find the frequent itemsets on this using Apriori
Algorithm. Assume that minimum support (s = 3)
Example:
Contd.
Example:
Contd.
There is only one itemset with minimum support 2. So only one itemset is frequent.
Association rules,
There are four strong results (minimum confidence greater than 60%)
Find the frequent itemsets and generate association
rules on this using Apriori Algorithm. Assume that
minimum support threshold (s = 33.33%) and minimum
confident threshold (c = 60%)
Example:
Consider frequent itemset - {I1, I2, I3} , Find Association
rules generation using Apriori Algorithm.
Example:
So here, by taking an example of any frequent itemset, we will show the
rule generation.
Itemset {I1, I2, I3} //from L3
SO rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%
Contd.
[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%
Straightforward Approach: Uses a clear, step-by-step method to find frequent itemsets in a database.
Widely Used: Popular in market basket analysis to find associations between products.
Disadvantage Disadvantages:
High Time Complexity: Can be slow for large datasets, as it has to scan the entire database multiple
of Apriori times.
Algorithm Memory Intensive: Requires a lot of memory, especially with big datasets.
Generates Redundant Rules: Often produces many rules, including irrelevant or redundant ones.
Prone to Scalability Issues: Not efficient with large and complex databases.
Requires Pruning: Needs careful tuning of support and confidence thresholds to filter out uninteresting
rules.
The two primary drawbacks of the Apriori Algorithm are:
At each step, candidate sets have to be built.
Step 1: Making Frequency Table - The frequency of each individual item is computed:
Step 2: Find Frequent Pattern set - A Frequent Pattern set is built which will contain all the
elements whose frequency is greater than or equal to the minimum support. These elements are
stored in descending order of their respective frequencies.
Step 3: Ordered-Item set Creation - for each transaction, the respective Ordered-Item set is built.
FP Tree Step 4: Make a FP- Tree - All the Ordered-Item sets are inserted into a Trie Data Structure.
Algorithm Step 5: Computation of Conditional Pattern Base - for each item, the Conditional Pattern Base is
computed which is path labels of all the paths which lead to any node of the given item in the
frequent-pattern tree. Note that the items in the below table are arranged in the ascending order of
their frequencies.
Step 6: Compute Conditional Frequent Pattern Tree - It is done by taking the set of elements that
is common in all the paths in the Conditional Pattern Base of that item and calculating its support
count by summing the support counts of all the paths in the Conditional Pattern Base.
Step 7: Frequent Pattern rules Generation - From the Conditional Frequent Pattern tree,
the Frequent Pattern rules are generated by pairing the items of the Conditional Frequent Pattern
Tree set to the corresponding to the item as given in the below table.
Example:
Item Count
I1 4
Contd. I2 5
I3 4
I4 4
I5 2
Step 2: Find Frequent Pattern set - A Frequent
Pattern set is built which will contain all the elements
whose frequency is greater than or equal to the
minimum support. These elements are stored in
descending order of their respective frequencies.
Support threshold=50% => 0.5*6= 3 => min_sup=3
Contd.
Item Count
I2 5
I1 4
I3 4
I4 4
Ordered-Item set
I2,I1,I3
I2,I3,I4
Contd.
I4
I2,I1,I4
I2,I1,I3
I2,I1,I3,I4
Step 5: Computation of Conditional Pattern Base - for
each item, the Conditional Pattern Base is computed
which is path labels of all the paths which lead to any node
of the given item in the frequent-pattern tree. Note that the
items in the below table are arranged in the ascending order
of their frequencies.
It may be expensive.
The algorithm may not fit in the shared memory when the
database is large.
Apriori FP Growth