Module 5
Module 5
• It can be defined as "A way of grouping the data points into different clusters, consisting of
similar data points. The objects with the possible similarities remain in a group that has less or
no similarities with another group.“
• It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color,
behavior, etc., and divides them as per the presence and absence of those similar patterns.
• After applying this clustering technique, each cluster or group is provided with a cluster-ID. ML
system can use this id to simplify the processing of large and complex datasets.
Realtime Example:
Let's understand the clustering technique with the real-world example of Mall:
When we visit any shopping mall, we can observe that the things with similar
usage are grouped together. Such as the t-shirts are grouped in one section,
and trousers are at other sections, similarly, at vegetable sections, apples,
bananas, Mangoes, etc., are grouped in separate sections, so that we can easily
find out the things.
The clustering technique also works in the same way. Other examples of
clustering are grouping documents according to the topic.
The clustering technique can be widely used in various tasks.
Some most common uses of this technique are:
• Market Segmentation
• Statistical data analysis
• Social network analysis
• Image segmentation
• Anomaly detection, etc.
Note:
Netflix also uses this technique to recommend the movies and web-series to its users as per
the watch history.
Diagram explains working of the clustering algorithm
Types of Clustering Methods
• The clustering methods are broadly divided into Hard clustering (datapoint
belongs to only one group) and Soft Clustering (data points can belong to
another group also). - the main clustering methods used in Machine
learning:
1.Partitioning Clustering
2.Density-Based Clustering
3.Distribution Model-Based Clustering
4.Hierarchical Clustering
5.Fuzzy Clustering
Partitioning Clustering
• Each dataset has a set of membership coefficients, which depend on the degree of
membership to be in a cluster.
• The 'Fuzzy' word means the things that are not clear or are vague. Sometimes, we
cannot decide in real life that the given problem or statement is either true or false.
• At that time, this concept provides many values between the true and false and
gives the flexibility to find the best solution to that problem.
Clustering Algorithms
• K-means clustering
• The k-means algorithm is one of the most popular clustering algorithms. It classifies the dataset by dividing the
samples into different clusters of equal variances. The number of clusters must be specified in this algorithm. It is
fast with fewer computations required, with the linear complexity of O(n).
• K-Medoids
• K-Medoids is a clustering algorithm that is an extension of the K-Means algorithm. Instead of using the mean
(average) point as the center of a cluster, K-Medoids employs the actual data point that minimizes the sum of
dissimilarities to other points within the cluster. This approach makes K-Medoids more robust to outliers
compared to K-Means, as the medoid is less sensitive to extreme values. The algorithm is particularly useful in
situations where the mean may not accurately represent the cluster center, and the selection of medoids can lead to
more accurate clustering results.
• Density-based methods
• Density-based clustering methods are a class of clustering algorithms that identify clusters based on the density of
data points in the feature space. These methods aim to discover clusters of arbitrary shapes and can effectively
handle noise and outliers. One well-known density-based clustering algorithm is Density-Based Spatial Clustering of
Applications with Noise (DBSCAN).
• DBSCAN
• It stands for Density-Based Spatial Clustering of Applications with Noise. It is an example of a density-based
model similar to the mean-shift, but with some remarkable advantages. In this algorithm, the areas of high density
• Hierarchical clustering:
• Hierarchical clustering is a clustering algorithm that organizes data into a tree-like structure, known
as a dendrogram. The algorithm builds clusters in a hierarchical manner by iteratively merging or
splitting data points or existing clusters based on their similarities. There are two main types of
hierarchical clustering: agglomerative and divisive.
2. Iteratively merges the closest clusters until a single cluster containing all data points is formed.
3. The process is visualized in a dendrogram, where the height of the branches represents the dissimilarity between
merged clusters.
2. Iteratively divides clusters into smaller ones until each cluster contains only one data point.
3. The process is also visualized in a dendrogram, showing the sequence of cluster divisions.
Applications of Clustering - commonly known applications of clustering technique
• In Identification of Cancer Cells: The clustering algorithms are widely used for the identification of cancerous
cells. It divides the cancerous and non-cancerous data sets into different groups.
• In Search Engines: Search engines also work on the clustering technique. The search result appears based on the
closest object to the search query. It does it by grouping similar data objects in one group that is far from the
other dissimilar objects. The accurate result of a query depends on the quality of the clustering algorithm used.
• Customer Segmentation: It is used in market research to segment the customers based on their choice and
preferences.
• In Biology: It is used in the biology stream to classify different species of plants and animals using the image
recognition technique.
• In Land Use: The clustering technique is used in identifying the area of similar lands use in the GIS database.
This can be very useful to find that for what purpose the particular land should be used, that means for which
purpose it is more suitable.
• K-means Clustering:
• K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into
different clusters.
Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there will
be two clusters, and for K=3, there will be three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that
each dataset belongs only one group that has similar properties.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between the data point and their corresponding clusters.
• The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats
the process until it does not find the best clusters. The value of k should be predetermined in this algorithm.
• The k-means clustering algorithm mainly performs two tasks:
• Determines the best value for K center points or centroids by an iterative
process.
• Assigns each data point to its closest k-center. Those data points which are
near to the particular k-center, create a cluster.
• Hence each cluster has datapoints with some commonalities, and it is away
from other clusters.
• The below diagram explains the working of the K-means Clustering Algorithm:
How does the K-Means Algorithm Work? - Steps
• Step-1: Select the number K to decide the number of clusters.
• Step-2: Select random K points or centroids. (It can be other from the input
dataset).
• Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
• Step-4: Calculate the variance and place a new centroid of each cluster. For
each cluster, calculate the mean (centroid) of all the data points assigned to
that cluster. This mean becomes the new centroid for that cluster.
• Step-5: Repeat the third steps, which means reassign each datapoint to the
new closest centroid of each cluster.
• Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
• Step-7: The model is ready.
• Let's understand the above steps by considering the visual plots:
• Suppose we have two variables M1 and M2. The x-y axis scatter plot of
these two variables is given below:
• Let's take number k of clusters, i.e., K=2, to identify the dataset and to put
them into different clusters. It means here we will try to group these datasets
into two different clusters.
• We need to choose some random k points or centroid to form the cluster.
These points can be either the points from the dataset or any other point.
So, here we are selecting the below two points as k points, which are not the
part of our dataset. Consider the below image:
Now we will assign each data point of the scatter plot to its closest K-point or centroid. We
will compute it by applying some mathematics that we have studied to calculate the distance
between two points. So, we will draw a median between both the centroids. Consider the
below image:
• From the above image, it is clear that points left side of the line is near to the
K1 or blue centroid, and points to the right of the line are close to the yellow
centroid. Let's color them as blue and yellow for clear visualization.
• From the above image, we can see, one yellow point is on the left side of the
line, and two blue points are right to the line. So, these three points will be
assigned to new centroids.
• As reassignment has taken place, so we will again go to the step-4, which is
finding new centroids or K-points.
• We will repeat the process by finding the center of gravity of centroids, so the
new centroids will be as shown in the below image:
• As we got the new centroids so again will draw the median line and reassign
the data points. So, the image will be:
• We can see in the above image; there are no dissimilar data points on either
side of the line, which means our model is formed. Consider the below
image:
• As our model is ready, so we can now remove the assumed centroids, and
the two final clusters will be as shown in the below image:
How to choose the value of "K number of clusters" in K-means
Clustering?
• The performance of the K-means clustering algorithm depends upon highly
efficient clusters that it forms. But choosing the optimal number of clusters
is a big task. There are some different ways to find the optimal number of
clusters, but here we are discussing the most appropriate method to find the
number of clusters or value of K. The method is given below:
• Elbow Method
• The Elbow method is one of the most popular ways to find the optimal
number of clusters. This method uses the concept of WCSS
value. WCSS stands for Within Cluster Sum of Squares, which defines the
total variations within a cluster. The formula to calculate the value of WCSS
(for 3 clusters) is given below:
• In the above formula of WCSS,
• ∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between
each data point and its centroid within a cluster1 and the same for the other two
terms.
• To measure the distance between data points and centroid, we can use any
method such as Euclidean distance or Manhattan distance.
• To find the optimal value of clusters, the elbow method follows the below
steps:
• It executes the K-means clustering on a given dataset for different K values
(ranges from 1-10).
• For each value of K, calculates the WCSS value.
• Plots a curve between calculated WCSS values and the number of clusters K.
• The sharp point of bend or a point of the plot looks like an arm, then that
point is considered as the best value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence
it is known as the elbow method. The graph for the elbow method looks
like the below image:
Note: We can choose the number of clusters equal to the given data points. If we
choose the number of clusters equal to the data points, then the value of WCSS
becomes zero, and that will be the endpoint of the plot.
Python Implementation of K-means Clustering Algorithm
• 1.
https://www.javatpoint.com/k-means-clustering-algorithm-in-machine
-learning
• 2. https://www.w3schools.com/python/python_ml_k-means.asp
Why hierarchical clustering?
• we have seen in the K-means clustering that there are some challenges with
this algorithm, which are a predetermined number of clusters, and it always
tries to create the clusters of the same size.
• To solve these two challenges, we can opt for the hierarchical clustering
algorithm because, in this algorithm, we don't need to have knowledge about
the predefined number of clusters.
Agglomerative Hierarchical clustering
• It does this until all the clusters are merged into a single cluster that contains
all the datasets.
• The working of the AHC algorithm can be explained using the below steps:
• Step-1: Create each data point as a single cluster. Let's say there are N data points,
will also be N.
Step-2: Take two closest data points or clusters and merge them to form one cluster.
clusters.
• Step-3: Again, take the two closest clusters and merge them together to form
one cluster. There will be N-2 clusters.
• Step-4: Repeat Step 3 until only one cluster left. So, we will get the following
clusters. Consider the below images:
• Step-5: Once all the clusters are combined into one big cluster, develop the
dendrogram to divide the clusters as per the problem.
• As we have seen, the closest distance between the two clusters is crucial for
the hierarchical clustering. There are various ways to calculate the distance
between two clusters, and these ways decide the rule for clustering. These
measures are called Linkage methods. Some of the popular linkage methods
are given below:
• Single Linkage: It is the Shortest Distance between the closest points of the
clusters. Consider the below image:
• Complete Linkage: It is the farthest distance between the two points of two
different clusters. It is one of the popular linkage methods as it forms tighter
clusters than single-linkage.
1.Average Linkage: It is the linkage method in which the distance between each
pair of datasets is added up and then divided by the total number of datasets to
calculate the average distance between two clusters. It is also one of the most
popular linkage methods.
2.Centroid Linkage: It is the linkage method in which the distance between the
centroid of the clusters is calculated. Consider the below image:
From the above-given approaches, we can apply any of them according to the
type of problem or business requirement.
• Woking of Dendrogram in Hierarchical clustering
• The dendrogram is a tree-like structure that is mainly used to store each step as
a memory that the HC algorithm performs. In the dendrogram plot, the Y-axis
shows the Euclidean distances between the data points, and the x-axis
shows all the data points of the given dataset.
• The working of the dendrogram can be explained using the below diagram:
• In the above diagram, the left part is showing how clusters are created in
agglomerative clustering, and the right part is showing the corresponding
dendrogram.
• As we have discussed above, firstly, the datapoints P2 and P3 combine together
and form a cluster, correspondingly a dendrogram is created, which connects
P2 and P3 with a rectangular shape. The hight is decided according to the
Euclidean distance between the data points.
• In the next step, P5 and P6 form a cluster, and the corresponding dendrogram is
created. It is higher than of previous, as the Euclidean distance between P5 and
P6 is a little bit greater than the P2 and P3.
• Again, two new dendrograms are created that combine P1, P2, and P3 in one
dendrogram, and P4, P5, and P6, in another dendrogram.
• At last, the final dendrogram is created that combines all the data points
together.
• We can cut the dendrogram tree structure at any level as per our requirement.
Python Implementation of Agglomerative Hierarchical Clustering
1.
https://www.javatpoint.com/hierarchical-clustering-in-machine-learning
2.
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.Agglom
erativeClustering.html
3. https://www.kaggle.com/code/khotijahs1/hierarchical-agglomerative-
clustering
Dendrogram
Dendrogram
Dendrogram
Dendrogram
Dendrogram
What is Density-based clustering?
• Density-Based Clustering refers to one of the most popular unsupervised
learning methodologies used in model building and machine learning
algorithms. The data points in the region separated by two clusters of low
point density are considered as noise. The surroundings with a radius ε of
a given object are known as the ε neighborhood of the object. If the ε
neighborhood of the object comprises at least a minimum number, MinPts
of objects, then it is called a core object.
Density-Based Clustering - Background
• There are two different parameters to calculate the density-based clustering
• EPS: It is considered as the maximum radius of the neighborhood.
• MinPts: MinPts refers to the minimum number of points in an Eps
neighborhood of that point.
• NEps (i) : { k belongs to D and dist (i,k) < = Eps}
• Directly density reachable:
• A point i is considered as the directly density reachable from a point k with
respect to Eps, MinPts if
• i belongs to NEps(k)
• Core point condition:
• NEps (k) >= MinPts
• Density reachable:
• A point denoted by i is a density reachable from a point j with respect to Eps,
MinPts if there is a sequence chain of a point i1,…., in, i1 = j, pn = i such that
ii + 1 is directly density reachable from ii.
• Density connected:
• A point i refers to density connected to a point j with respect to Eps, MinPts
if there is a point o such that both i and j are considered as density
reachable from o with respect to Eps and MinPts.
• Working of Density-Based Clustering
• Suppose a set of objects is denoted by D', we can say that an object I is directly
density reachable form the object j only if it is located within the ε
neighborhood of j, and j is a core object.
• An object i is density reachable form the object j with respect to ε and MinPts
in a given set of objects, D' only if there is a sequence of object chains point i1,
…., in, i1 = j, pn = i such that ii + 1 is directly density reachable from i i with
respect to ε and MinPts.
• An object i is density connected object j with respect to ε and MinPts in a
given set of objects, D' only if there is an object o belongs to D such that both
point i and j are density reachable from o with respect to ε and MinPts.
• Major Features of Density-Based Clustering
• The primary features of Density-based clustering are given below.
• It is a scan method.
• It requires density parameters as a termination condition.
• It is used to manage noise in data clusters.
• Density-based clustering is used to identify clusters of arbitrary size.
• Density-Based Clustering Methods-DBSCAN, OPTICS, DENCLUE
• DBSCAN
• DBSCAN stands for Density-Based Spatial Clustering of Applications with
Noise. It depends on a density-based notion of cluster. It also identifies
clusters of arbitrary size in the spatial database with outliers.
DBSCAN groups together closely packed points based on their density in a given space.
Here's a step-by-step explanation of how DBSCAN works:
1.Initialization:
1. Choose two parameters:
1.ϵ (epsilon): The maximum distance between two points for them to be considered as
in the same neighborhood.
2.MinPts: The minimum number of points required to form a dense region.
2.Core Points Identification:
1. For each point in the dataset, compute the number of points within its ϵ-neighborhood
(including itself).
2. If a point has at least MinPts neighboring points within ϵ, it is considered a core point.
3.Density-Reachable Points Identification:
1. For each core point, recursively find all points that are density-reachable from it.
2. A point P is density-reachable from another point Q if there exists a chain of core
points leading from Q to P, with each successive point in the chain being directly
reachable from the previous one within ϵ.
4. Cluster Formation:
1. Assign each core point and its density-reachable neighbors to the same cluster.
2. If a core point has neighbors that belong to different clusters, merge those clusters.
5. Border Points Assignment:
3. Assign any point that is not a core point but falls within the ϵ-neighborhood of a core
point to the same cluster as that core point.
4. These points are called border points.
6. Noise Points Identification:
5. Any point that is neither a core point nor a border point is considered a noise point and is
not assigned to any cluster.
7. Output:
6. The output of the DBSCAN algorithm is a set of clusters, each containing a group of
points that are closely packed together and separated by areas of lower density or noise
points.
DBSCAN is robust to noise and capable of identifying clusters of arbitrary shapes. However, it
requires careful selection of the parameters ϵ and MinPts to achieve optimal results for a given
dataset.
Association Rule Learning
• Lift
• It is the strength of any rule, which can be defined as below formula:
• It is the ratio of the observed support measure and expected support if
X and Y are independent of each other. It has three possible values:
• If Lift= 1: The probability of occurrence of antecedent and consequent
is independent of each other.
• Lift>1: It determines the degree to which the two itemsets are
dependent to each other.
• Lift<1: It tells us that one item is a substitute for other items, which
means one item has a negative effect on another.
• Applications of Association Rule Learning
• It has various applications in machine learning and data mining. Below are some
popular applications of association rule learning:
• Market Basket Analysis: It is one of the popular examples and applications of
association rule mining. This technique is commonly used by big retailers to
determine the association between items.
• Medical Diagnosis: With the help of association rules, patients can be cured easily,
as it helps in identifying the probability of illness for a particular disease.
• Protein Sequence: The association rules help in determining the synthesis of
artificial Proteins.
• It is also used for the Catalog Design and Loss-leader Analysis and many more
other applications.
• Apriori Algorithm in Machine Learning
• The Apriori algorithm uses frequent itemsets to generate association rules, and
it is designed to work on the databases that contain transactions. With the help
of these association rule, it determines how strongly or how weakly two
objects are connected. This algorithm uses a breadth-first search and Hash
Tree to calculate the itemset associations efficiently. It is the iterative process
for finding the frequent itemsets from the large dataset.
• This algorithm was given by the R. Agrawal and Srikant in the year 1994. It
is mainly used for market basket analysis and helps to find those products that
can be bought together. It can also be used in the healthcare field to find drug
reactions for patients.
• What is Frequent Itemset?
• Frequent itemsets are those items whose support is greater than the
threshold value or user-specified minimum support. It means if A & B
are the frequent itemsets together, then individually A and B should
also be the frequent itemset.
• Suppose there are the two transactions: A= {1,2,3,4,5}, and B=
{2,3,7}, in these two transactions, 2 and 3 are the frequent itemsets.
• Note: To better understand the apriori algorithm, and related
term such as support and confidence, it is recommended to
understand the association rule learning.
• Steps for Apriori Algorithm
• Below are the steps for the apriori algorithm:
• Step-1: Determine the support of itemsets in the transactional
database, and select the minimum support and confidence.
• Step-2: Take all supports in the transaction with higher support value
than the minimum or selected support value.
• Step-3: Find all the rules of these subsets that have higher confidence
value than the threshold or minimum confidence.
• Step-4: Sort the rules as the decreasing order of lift.
• Apriori Algorithm Working
• We will understand the apriori algorithm using an example and mathematical
calculation:
• Example: Suppose we have the following dataset that has various
transactions, and from this dataset, we need to find the frequent itemsets and
generate the association rules using the Apriori algorithm:
• Solution:
• Step-1: Calculating C1 and L1:
• In the first step, we will create a table that contains support count (The
frequency of each itemset individually in the dataset) of each itemset in the
given dataset. This table is called the Candidate set or C1.
• Now, we will take out all the itemsets that have the greater support count than
the Minimum Support (2). It will give us the table for the frequent itemset L1.
Since all the itemsets have greater or equal support count than the minimum
support, except the E, so E itemset will be removed.
Step-2: Candidate Generation C2, and L2:
•In this step, we will generate C2 with the help of L1. In C2, we will create the pair of the itemsets of L1 in the
form of subsets.
•After creating the subsets, we will again find the support count from the main transaction table of datasets,
i.e., how many times these pairs have occurred together in the given dataset. So, we will get the below table for
C2:
• Again, we need to compare the C2 Support count with the minimum support
count, and after comparing, the itemset with less support count will be
eliminated from the table C2. It will give us the below table for L2
• 1. https://www.javatpoint.com/apriori-algorithm-in-machine-learning
• 2.
https://www.geeksforgeeks.org/implementing-apriori-algorithm-in-pyt
hon/
• 3. https://www.kaggle.com/code/nandinibagga/apriori-algorithm
Reinforcement Learning