0% found this document useful (0 votes)
2 views

Final ML Unit3 May24

Clustering is the process of grouping data points into subsets (clusters) based on shared traits, emphasizing high intra-class similarity and low inter-class similarity. It is an unsupervised learning technique used in various applications such as marketing, fraud detection, and city planning, with methods including K-means and K-medoids. The effectiveness of clustering depends on the chosen distance measures and the algorithm's ability to reveal hidden patterns within the data.

Uploaded by

Shivani Samdani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Final ML Unit3 May24

Clustering is the process of grouping data points into subsets (clusters) based on shared traits, emphasizing high intra-class similarity and low inter-class similarity. It is an unsupervised learning technique used in various applications such as marketing, fraud detection, and city planning, with methods including K-means and K-medoids. The effectiveness of clustering depends on the chosen distance measures and the algorithm's ability to reveal hidden patterns within the data.

Uploaded by

Shivani Samdani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 154

MACHINE LEARNING

Professional Core( CET3006B)


T. Y. B.Tech CSE, Sem-VI
2023-2024
UNIT-III
SoCSE – Dept. of Computer Engineering & Technology
INTRODUCTION-
What is clustering?

• Clustering is the classification of objects into different groups, or


more precisely, the partitioning of a data set into subsets (clusters),
so that the data in each subset (ideally) share some common trait -
often according to some defined distance measure.

2
Clustering
• Organizing data into classes such that there is
• high intra-class similarity
• low inter-class similarity
• Finding the class labels and the number of classes
directly from the data (in contrast to classification).
• More informally, finding natural groupings among
objects.
• Also called unsupervised learning, sometimes called
classification by statisticians and sorting by
psychologists and segmentation by people in
marketing

3
3
Clustering

• Finding groups of objects such that the objects in a


group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

4
Clustering Example

5
What is a natural grouping among these objects?
UNIT II-Syllabus

Clustering is subjective

Simpson's Family School Employees Females Males 6


6
What is Similarity?
The quality or state of being similar; likeness; resemblance; as, a similarity of
features. Webster's Dictionary

Similarity is
hard to define,
but…
“We know it
when we see it”

The real
meaning of
similarity is a
philosophical
question. We
will take a more
pragmatic
approach.

7
7
Common Distance measures:

Distance measure will determine how the similarity of two elements


is calculated and it will influence the shape of the clusters.
They include:
1. The Euclidean distance (also called 2-norm distance) is given by:

2. The Manhattan distance (also called taxicab norm or 1-norm) is


given by:

8
Intuitions behind desirable distance
measure properties
D(A,B) = D(B,A) Symmetry
Otherwise you could claim “Alex looks like Bob, but Bob looks nothing
like Alex.”

D(A,A) = 0 Constancy of Self-Similarity


Otherwise you could claim “Alex looks more like Bob, than Bob does.”

D(A,B) = 0 IIf A=B Positivity (Separation)


Otherwise there are objects in your world that are different, but you
cannot tell apart.

D(A,B) ≤ D(A,C) + D(B,C) Triangular Inequality


Otherwise you could claim “Alex is very much like Bob, and Alex is very
much like Carl, but Bob is very much unlike Carl.”
Also, one can use weighted distance, and many other similarity/distance
measures.
9
9
Examples of Clustering
Applications

• Marketing: Help marketers discover distinct groups in their customer


bases, and then use this knowledge to develop targeted marketing
programs
• Land use: Identification of areas of similar land use in an earth observation
database
• Insurance: Identifying groups of motor insurance policy holders with a high
average claim cost
• City-planning: Identifying groups of houses according to their house type,
value, and geographical location
• Earth-quake studies: Observed earth quake epicenters should be clustered
along continent faults

10
Clustering Applications
• Customer segmentation: Customer segmentation is the practice of dividing a company's
customers into groups that reflect similarity among customers in each group
• Fraud detection: Using techniques such as K-means Clustering, one can easily identify the
patterns of any unusual activities. Detecting an outlier will mean a fraud event has taken
place.
• Document classification
• Image segmentation
• Anomaly detection

3/14/2023 11
What Is Good Clustering?
• A good clustering method will produce high quality clusters with
• high intra-class similarity
• low inter-class similarity

• The quality of a clustering result depends on both the similarity measure


used by the method and its implementation.
• The quality of a clustering method is also measured by its ability to
discover some or all of the hidden patterns.

12
Types of Clustering
• Hierarchical clustering(BIRCH)
• A set of nested clusters organized as a hierarchical tree
• Partitional Clustering(k-means,k-mediods)
• A division data objects into non-overlapping (distinct)
subsets (i.e., clusters) such that each data object is in
exactly one subset
• Density – Based(DBSCAN)
• Based on density functions
• Grid-Based(STING)
• Based on nultiple-level granularity structure
• Model-Based(SOM)
• Hypothesize a model for each of the clusters and find the
best fit of the data to the given model
13
Clustering

• Partitional algorithms: Construct various partitions and then evaluate them by


some criterion
• Hierarchical algorithms: Create a hierarchical decomposition of the set of
objects using some criterion
Hierarchical
Partitional

14
14
Partitional Clustering

Original Points A Partitional Clustering

15
Hierarchical Clustering

Traditional Hierarchical Clustering Traditional Dendrogram

Non-traditional Hierarchical Clustering Non-traditional Dendrogram


16
Desirable Properties of Clustering

• Scalability (in terms of both time and space)


• Ability to deal with different data types
• Minimal requirements for domain knowledge to determine input
parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• Incorporation of user-specified constraints
• Interpretability and usability

17
Clustering
Clustering is defined as dividing data points or populations into several
groups such that similar data points are in the same groups. The aim to
segregate groups based on similar traits. Clustering can be divided into
two subgroups, broadly:
1.Soft Clustering– In this type of clustering, a likelihood or probability
of the data point being in a particular cluster is assigned instead of
putting each data point into a separate cluster.
2.Hard Clustering– Each data point either entirely belongs to a cluster
or not at all.
The task of clustering is subjective; i.e., there are many ways of
achieving the goal. There can be many different sets of rules for defining
similarity among data points. Out of more than 100 clustering
algorithms, a few are used correctly. They are as follows:
https://www.learndatasci.com/glossary/hierarchical-clustering/
18
Clustering
1.Connectivity models– These models are based on the fact that data
points closer in the data space exhibit more similarity than those lying
farther away. The model’s first approach can classify data points into
separate clusters and aggregation as the distance decreases. Another
approach is classifying data points as a single cluster and partitioning as
the distance increases. The choice of distance function is subjective. The
models are easily interpreted but lack scalability for handling large
datasets: for example- Hierarchical clustering.
2.Centroid models– Iterative clustering algorithms in which similarity is
derived as the notion of the closeness of data point to the cluster’s
centroid. Example- K-Means clustering. The number of clusters is
mentioned in advance, which requires prior knowledge of the dataset.

19
Clustering

3. Distribution models– The models are based on the


likelihood of all data points in the cluster belonging to
the same distribution. Overfitting is common in these
models. Example- Expectation-maximization algorithm
4. Density models– These models search the data
space for varying density areas. It isolates various
density regions and assigns data points in the same
cluster. Example- DBSCAN.

20
K-Means Clustering

K-Means Clustering
The most common clustering covered in machine
learning for beginners is K-Means. The first step is to
create c new observations among our unlabelled data
and locate them randomly, called centroids. The number
of centroids represents the number of output classes.
The first step of the iterative process for each centroid is
to find the nearest point (in terms of Euclidean distance)
and assign them to its category. Next, for each category,
the average of all the points attributed to that class is
computed. The output is the new centroid of the class.
21
22
What is the problem of K-Means Method?

• The k-means algorithm is sensitive to outliers !


• Since an object with an extremely large value may substantially distort the distribution of
the data.

• K-Medoids: Instead of taking the mean value of the object in a cluster as a reference
point, medoids can be used, which is the most centrally located object in a cluster.

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

23
The K-Medoids Clustering
• Find representative objects, called medoids, in clusters

• PAM (Partitioning Around Medoids, 1987)


• Starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of the
resulting clustering
• PAM works effectively for small data sets, but does not scale well for large
data sets

24
Typical K-Medoids algorithm (PAM)
Total Cost =
20
10

6
Arbitrar Assign
5
y each
4 choose remaini
3
k object ng
2
as object
initial to
1

0
0 1 2 3 4 5 6 7 8 9 10
medoid nearest
s medoid
K= s Randomly select a
nonmedoid
2 Total Cost =
26 object,Oramdom
1 1

Do loop
0 0
9 9

8
Compute 8
Swapping total cost
Until no
7 7

O and 6
of 6

change Oramdom 5
swapping
5

4 4

If quality is 3 3

2 2
improved. 1 1

0 0
0 1 2 3 4 5 6 7 8 9 1
0
0 1 2 3 4 5 6 7 8 9 1
0 25
PAM (Partitioning Around Medoids)

• Use real object to represent the cluster


• Select k representative objects arbitrarily
• For each pair of non-selected object h and selected object i, calculate
the total swapping cost TCih
• For each pair of i and h,
• If TCih < 0, i is replaced by h
• Then assign each non-selected object to the most similar
representative object
• repeat steps 2-3 until there is no change

26
K-Medoids method: The idea

• The iterative process of replacing representative objects by no


representative objects continues as long as the quality of the
clustering is improved
• For each representative Object O
• For each non-representative object R, swap O and R
• Choose the configuration with the lowest cost
• Cost function is the difference in absolute error-value if a current
representative object is replaced by a non-representative object

27
K-Medoids method

• Minimize the sensitivity of k-means to outliers


• Pick actual objects to represent clusters instead of mean values
• Each remaining object is clustered with the representative object (Medoid) to
which is the most similar
• The algorithm minimizes the sum of the dissimilarities between each object and
its corresponding reference point
• E: the sum of absolute error for all objects in the data set
• P: the data point in the space representing an object
• Oi: is the representative object of cluster Ci

28
K-Medoids method: The idea
K-Medoids method: The idea
K-Medoids method: The idea
K-Medoids method: The idea
K-Medoids method: The idea
K-Medoids method: The idea
K-Medoids method: The idea
K-Medoids method: The idea
K-Medoids method: The idea
K-Medoids method: The idea

• Input
• K: the number of clusters
• D: a data set containing n objects
• Output: A set of k clusters
• Method:
1. Arbitrary choose k objects from D as representative objects
(seeds)
2. Repeat (3) Assign each remaining object to the cluster with the
nearest representative object
3. For each representative object Oj
4. Randomly select a non representative object Orandom
5. Compute the total cost S of swapping representative object Oj
with Orandom (7) if S<0 then replace Oj with Orandom (8) Until
no change
38
K-Medoids Properties(k-medoids vs.K-means)

• The complexity of each iteration is O(k(n-k)2)


• For large values of n and k, such computation becomes very costly
• Advantages
• K-Medoids method is more robust than k-Means in the presence of noise
and outliers
• Disadvantages
• K-Medoids is more costly that the k-Means method
• Like k-means, k-medoids requires the user to specify k
• It does not scale well for large data sets

39
K-Medoid Clustering

• What is K-Medoids Clustering?


• K-Medoids clustering is an unsupervised machine learning
algorithm used to group data into different clusters. It is an iterative
algorithm that starts by selecting k data points as medoids in a
dataset. After this, the distance between each data point and the
medoids is calculated. Then, the data points are assigned to
clusters associated with the medoid at the minimum distance from
each data point. Here, the medoid is the most centrally located
point in the cluster. Once we assign all the data points to the
clusters, we calculate the sum of the distance of all the non-medoid
data points to the medoid of each cluster. We term the sum of
distances as the cost.
40
K-Medoid Clustering
• K-Medoids Clustering Algorithm
• Having an overview of K-Medoids clustering, let us discuss the algorithm for the
same.
1. First, we select K random data points from the dataset and use them as medoids.
2. Now, we will calculate the distance of each data point from the medoids. You can
use any of the Euclidean, Manhattan distance, or squared Euclidean distances
as the distance measure.
3. Once we find the distance of each data point from the medoids, we will assign
the data points to the clusters associated with each medoid. The data points are
assigned to the medoids at the closest distance.
4. After determining the clusters, we will calculate the sum of the distance of all the
non-medoid data points to the medoid of each cluster. Let the cost be Ci.
41
K-Medoid Clustering
5. Now, we will select a random data point D j from the dataset and swap it
with a medoid Mi. Here, Dj becomes a temporary medoid. After swapping,
we will calculate the distance of all the non-medoid data points to the
current medoid of each cluster. Let this cost be C j.
6. If Ci>Cj, the current medoids with Dj as one of the medoids are made
permanent medoids. Otherwise, we undo the swap, and M i is reinstated as
the medoid.
7. Repeat 4 to 6 until no change occurs in the clusters.

42
K-Medoid Clustering
K-Medoids Clustering Example
Poin
Coordinates
t
A1 (2, 6)
A2 (3, 8)
A3 (4, 7)
A4 (6, 2)
A5 (6, 4)
A6 (7, 3)
A7 (7,4)
A8 (8, 5)
A9 (7, 6)
A10 (3, 4)
43
K-Medoid Clustering
• Iteration 1
• Suppose that we want to group the above dataset into two clusters. So, we will
randomly choose two medoids.
• Here, the choice of medoids is important for efficient execution. Hence, we have
selected two points from the dataset that can be potential medoid for the final
clusters. Following are two points from the dataset that we have selected as
medoids.
• M1 = (3, 4)
• M2 = (7, 3)
• Now, we will calculate the distance between each data point and the medoids
using the Manhattan distance measure. The results have been tabulated as follows.
44
K-Medoid Clustering
Distance Distance
Coordinate Assigned
Point From M1 from M2
s Cluster
(3,4) (7,3)
A1 (2, 6) 3 8 Cluster 1
A2 (3, 8) 4 9 Cluster 1
A3 (4, 7) 4 7 Cluster 1
A4 (6, 2) 5 2 Cluster 2
A5 (6, 4) 3 2 Cluster 2
A6 (7, 3) 5 0 Cluster 2 Iteration-1
A7 (7,4) 4 1 Cluster 2
A8 (8, 5) 6 3 Cluster 2
A9 (7, 6) 6 3 Cluster 2
A10 (3, 4) 0 5 Cluster 1
45
K-Medoid Clustering
The clusters made with medoids (3, 4) and (7, 3) are as follows.
Points in cluster1= {(2, 6), (3, 8), (4, 7), (3, 4)}
Points in cluster 2= {(7,4), (6,2), (6, 4), (7,3), (8,5), (7,6)}
After assigning clusters, we will calculate the cost for each cluster and
find their sum. The cost is nothing but the sum of distances of all the
data points from the medoid of the cluster they belong to.
Hence, the cost for the current cluster will be
3+4+4+2+2+0+1+3+3+0=22.

46
K-Medoid Clustering
Iteration 2
Now, we will select another non-medoid point (7, 4) and make it a
temporary medoid for the second cluster. Hence,
M1 = (3, 4)
M2 = (7, 4)
Now, let us calculate the distance between all the data points and the
current medoids.

47
K-Medoid Clustering
Distance Distance
Coordinate Assigned
Point From M1 from M2
s Cluster
(3,4) (7,4)
A1 (2, 6) 3 7 Cluster 1
A2 (3, 8) 4 8 Cluster 1
A3 (4, 7) 4 6 Cluster 1
A4 (6, 2) 5 3 Cluster 2
A5 (6, 4) 3 1 Cluster 2
A6 (7, 3) 5 1 Cluster 2
A7 (7,4) 4 0 Cluster 2
A8 (8, 5) 6 2 Cluster 2
A9 (7, 6) 6 2 Cluster 2
A10 (3, 4) 0 4 Cluster 1
48
K-Medoid Clustering
The data points haven’t changed in the clusters after changing the medoids.
Hence, clusters are:
Points in cluster1:{(2, 6), (3, 8), (4, 7), (3, 4)}
Points in cluster 2:{(7,4), (6,2), (6, 4), (7,3), (8,5), (7,6)}
Now, let us again calculate the cost for each cluster and find their sum. The
total cost this time will be 3+4+4+3+1+1+0+2+2+0=20.
Here, the current cost is less than the cost calculated in the previous iteration.
Hence, we will make the swap permanent and make (7,4) the medoid for
cluster 2. If the cost this time was greater than the previous cost i.e. 22, we
would have to revert the change. New medoids after this iteration are (3, 4)
and (7, 4) with no change in the clusters.
49
K-Medoid Clustering

Iteration 3
Now, let us again change the medoid of cluster 2 to (6, 4). Hence,
the new medoids for the clusters are M1=(3, 4) and M2= (6, 4 ).
Let us calculate the distance between the data points and the above
medoids to find the new cluster. The results have been tabulated as
follows.

50
K-Medoid Clustering

Distance Distance
Coordinate Assigned
Point From M1 from M2
s Cluster
(3,4) (6,4)
A1 (2, 6) 3 6 Cluster 1
A2 (3, 8) 4 7 Cluster 1
A3 (4, 7) 4 5 Cluster 1
A4 (6, 2) 5 2 Cluster 2
A5 (6, 4) 3 0 Cluster 2
A6 (7, 3) 5 2 Cluster 2
A7 (7,4) 4 1 Cluster 2
A8 (8, 5) 6 3 Cluster 2
A9 (7, 6) 6 3 Cluster 2
A10 (3, 4) 0 3 Cluster 1

51
Again, the clusters haven’t changed. Hence, clusters are:
Points in cluster1:{(2, 6), (3, 8), (4, 7), (3, 4)}
Points in cluster 2:{(7,4), (6,2), (6, 4), (7,3), (8,5), (7,6)}
Now, let us again calculate the cost for each cluster and find their sum.
The total cost this time will be 3+4+4+2+0+2+1+3+3+0=22.
The current cost is 22 which is greater than the cost in the previous
iteration i.e. 20. Hence, we will revert the change and the point (7, 4)
will again be made the medoid for cluster 2.
So, the clusters after this iteration will be cluster1 = {(2, 6), (3, 8), (4,
7), (3, 4)} and cluster 2= {(7,4), (6,2), (6, 4), (7,3), (8,5), (7,6)}. The
medoids are (3,4) and (7,4).
52
Replacing the medoids with a non-medoid data point. The set of
medoids for which the cost is the least, the medoids, and the associated
clusters are made permanent. So, after all the iterations, you will get the
final clusters and their medoids.
The K-Medoids clustering algorithm is a computation-intensive
algorithm that requires many iterations. In each iteration, we need to
calculate the distance between the medoids and the data points, assign
clusters, and compute the cost. Hence, K-Medoids clustering is not well
suited for large data sets.

53
Advantages of K-Medoids Clustering
❖ K-Medoids clustering is a simple iterative algorithm and is very easy to implement.
❖ K-Medoids clustering is guaranteed to converge. Hence, we are guaranteed to get results when we
perform k-medoids clustering on any dataset.
❖ K-Medoids clustering doesn’t apply to a specific domain. Owing to the generalization, we can use
k-medoids clustering in different machine learning applications ranging from text data to Geo-
spatial data and financial data to e-commerce data.
❖ The medoids in k-medoids clustering are selected randomly. We can choose the initial medoids in
a way such that the number of iterations can be minimized. To improve the performance of the
algorithm, you can warm start the choice of medoids by selecting specific data points as medoids
after data analysis.
❖ Compared to other partitioning clustering algorithms such as K-median and k-modes clustering,
the k-medoids clustering algorithm is faster in execution.
❖ K-Medoids clustering algorithm is very robust and it effectively deals with outliers and noise in
the dataset. Compared to k-means clustering, the k-medoids clustering algorithm is a better choice
for analyzing data with significant noise and outliers.
54
Disadvantages of K-Medoids Clustering
❖ K-Medoids clustering is not a scalable algorithm. For large datasets, the execution time of the
algorithm becomes very high. Due to this, you cannot use the K-Medoids algorithm for very large
datasets.
❖ In K-Medoids clustering, we don’t know the optimal number of clusters. Hence, we need to perform
clustering using different values of k to find the optimal number of clusters.
❖ When the dimensionality in the dataset increases, the distance between the data points starts to
become similar. Due to this, the distance between a data point and various medoids becomes almost
the same. This introduces inefficiency in the execution. To overcome this problem, you can use
advanced clustering algorithms like spectral clustering. Alternatively, you can also try to reduce the
dimensionality of the dataset while data preprocessing.
❖ K-Medoids clustering algorithm randomly chooses the initial medoids. Also, the output clusters are
primarily dependent on the initial medoids. Due to this, every time you run the k-medoids clustering
algorithm, you will get unique clusters.
❖ The K-Medoids clustering algorithm uses the distance from a central point (medoid). Due to this,
the k-medoids algorithm gives circular/spherical clusters. This algorithm is not useful for clustering
data into arbitrarily shaped clusters. 55
56
57
58
59
60
61
62
63
Clustering -Example
What Exactly is DBSCAN Clustering?

DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise

It groups ‘densely grouped’ data points into a single cluster. It can identify clusters in large
spatial datasets by looking at the local density of the data points. The most exciting
feature of DBSCAN clustering is that it is robust to outliers. It also does not require
the number of clusters to be told beforehand, unlike K-Means, where we have to specify the
number of centroids.

DBSCAN requires only two parameters: epsilon and minPoints. Epsilon is the radius of the
circle to be created around each data point to check the density and minPoints is the
minimum number of data points required inside that circle for that data point to be classified
as a Core point.

In higher dimensions the circle becomes hypersphere, epsilon becomes the radius of that
hypersphere, and minPoints is the minimum number of data points required inside that
hypersphere.
DBSCAN Clustering
DBSCAN creates a circle of epsilon radius around every data point and classifies them
into Core point, Border point, and Noise. A data point is a Core point if the circle around it
contains at least ‘minPoints’ number of points. If the number of points is less than minPoints,
then it is classified as Border Point, and if there are no other data points around any data
point within epsilon radius, then it treated as Noise.

The above figure shows us a cluster created by DBCAN with minPoints = 3. Here, we
draw a circle of equal radius epsilon around every data point. These two parameters help
in creating spatial clusters.
DBSCAN Clustering
All the data points with at least 3 points in the circle including itself are considered
as Core points represented by red color. All the data points with less than 3 but
greater than 1 point in the circle including itself are considered as Border points. They
are represented by yellow color. Finally, data points with no point other than itself
present inside the circle are considered as Noise represented by the purple color.

For locating data points in space, DBSCAN uses Euclidean distance, although other
methods can also be used (like great circle distance for geographical data). It also needs
to scan through the entire dataset once, whereas in other algorithms we have to do it
multiple times.
Reachability and Connectivity

These are the two concepts that you need to understand before moving further. Reachability
states if a data point can be accessed from another data point directly or indirectly, whereas
Connectivity states whether two data points belong to the same cluster or not. In terms of
reachability and connectivity, two points in DBSCAN can be referred to as:
•Directly Density-Reachable
•Density-Reachable
•Density-Connected

A point X is directly density-reachable from


point Y w.r.t epsilon, minPoints if,
1.X belongs to the neighborhood of Y, i.e, dist(X, Y) <= epsilon
2.Y is a core point

Here, X is directly density-reachable from Y, but vice versa is not


valid.
DBSCAN Clustering
A point X is density-reachable from point Y w.r.t epsilon,
minPoints if there is a chain of points p1, p2, p3, …, pn and
p1=X and pn=Y such that pi+1 is directly density-
reachable from pi.

Here, X is density-reachable from Y with X being


directly density-reachable
from P2, P2 from P3, and P3 from Y.

But, the inverse of this is not valid.


DBSCAN Clustering
A point X is density-connected from
point Y w.r.t epsilon and minPoints if there exists a
point O such that both X and Y are density-reachable
from O w.r.t to epsilon and minPoints.

Here, both X and Y are density-reachable


from O, therefore, we can say that X is density-
connected from Y.
Parameter Selection in DBSCAN
Clustering
DBSCAN is very sensitive to the values of epsilon and minPoints. Therefore, it is very
important to understand how to select the values of epsilon and minPoints. A slight variation
in these values can significantly change the results produced by the DBSCAN algorithm.

The value of minPoints should be at least one greater than the number of dimensions of the
dataset, i.e.,
minPoints>=Dimensions+1.

It does not make sense to take minPoints as 1 because it will result in each point being a
separate cluster. Therefore, it must be at least 3. Generally, it is twice the dimensions. But
domain knowledge also decides its value.

If the value of epsilon chosen is too small then a higher number of clusters will be created,
and more data points will be taken as noise. Whereas, if chosen too big then various small
clusters will merge into a big cluster, and we will lose details.
Hierarchical Clustering

72
Dendrogram -A Useful Tool for Summarizing Similarity
Desirable Properties of Clustering
Measurements
The similarity between two objects
in a dendrogram is represented as
the height of the lowest internal
node they share.

73
Hierarchical Clustering

The number of dendrograms with n Since we cannot test all possible trees
leafs = (2n -3)!/[(2(n -2)) (n -2)!] we will have to heuristic search of all
possible trees. We could do this..
Number Number of Possible
of Leafs Dendrograms
2 1 Bottom-Up (agglomerative): Starting
3 3
4 15 with each item in its own cluster, find
5 105 the best pair to merge into a new
... …
10 34,459,425
cluster. Repeat until all clusters are
fused together.

Top-Down (divisive): Starting with all


the data in a single cluster, consider
every possible way to divide the cluster
into two. Choose the best division and
recursively operate on both sides.
Hierarchical Clustering

We begin with a distance matrix


which contains the distances
between every pair of objects in
our database. 0 8 8 7 7

0 2 4 4

0 3 3

D( , ) = 8 0 1

D( , ) = 1 0

75
Starting with each item in its own cluster, find
Bottom-Up (agglomerative):
the best pair to merge into a new cluster. Repeat until all clusters
are fused together.

Consider all Choose


possible … the best
merges…
Bottom-Up (agglomerative): Starting with each item in its own cluster,
find the best pair to merge into a new cluster. Repeat until all
clusters are fused together.

Consider all
Choose
possible … the best
merges…

Consider all Choose


possible … the best
merges…
Bottom-Up (agglomerative): Starting
with each item in its own cluster,
find the best pair to merge into a
new cluster. Repeat until all clusters
are fused together.

Consider all
Choose
possible … the best
merges…

Consider all
Choose
possible … the best
merges…

Consider all Choose


possible … the best
merges…
78
Hierarchical Agglomerative Clustering-Linkage
Method

• The single linkage method is based on minimum distance, or the


nearest neighbor rule.

• The complete linkage method is based on the maximum distance or


the furthest neighbor approach.

• The average linkage method the distance between two clusters is


defined as the average of the distances between all pairs of objects

79
Hierarchical Agglomerative Clustering-Linkage
Method

Single Linkage
Minimum Distance

Cluster 1 Cluster 2

Complete Linkage
Maximum Distance

Cluster 1 Cluster 2
Average Linkage

Average Distance
Cluster 1 Cluster 2

80
Hierarchical Clustering (Agglomerative )-Example

• Consider distance matrix of size 6X6 as


shown below
• Beginning with 6 clusters as there are 6
objects
• Goal is to form single cluster
• In each iteration we find closest pair of
cluster and merge it
• Group closest cluster D,F into cluster(D,F)
• Using single linkage we can specify the
minimum distance between objects of
the two clusters

81
Hierarchical Clustering (Agglomerative )-Example

82
Hierarchical Clustering (Agglomerative )-Example

83
Hierarchical Clustering (Agglomerative )-Example

84
Hierarchical Clustering (Agglomerative )-Example

85
Hierarchical Clustering (Agglomerative )-Example

86
Hierarchical Clustering (Agglomerative )-Example

87
Quality: What Is Good Clustering?

• A good clustering method will produce high quality clusters


• high intra-class similarity: cohesive within clusters
• low inter-class similarity: distinctive between clusters
• The quality of a clustering method depends on
• the similarity measure used by the method
• its implementation, and
• Its ability to discover some or all of the hidden patterns

88
Measure the Quality of Clustering
• Dissimilarity/Similarity metric
• Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
• The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables
• Weights should be associated with different variables
based on applications and data semantics
• Quality of clustering:
• There is usually a separate “quality” function that
measures the “goodness” of a cluster.
• It is hard to define “similar enough” or “good enough”
• The answer is typically highly subjective

89
Considerations for Cluster Analysis
• Partitioning criteria
• Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is
desirable)
• Separation of clusters
• Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one
document may belong to more than one class)
• Similarity measure
• Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or
contiguity)
• Clustering space
• Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering)

90
Requirements and Challenges
• Scalability
• Clustering all the data instead of only on samples
• Ability to deal with different types of attributes
• Numerical, binary, categorical, ordinal, linked, and mixture of these
• Constraint-based clustering
• User may give inputs on constraints
• Use domain knowledge to determine input parameters
• Interpretability and usability
• Others
• Discovery of clusters with arbitrary shape
• Ability to deal with noisy data
• Incremental clustering and insensitivity to input order
• High dimensionality
91
Extensions to Hierarchical Clustering
• Major weakness of agglomerative clustering methods
• Can never undo what was done previously
• Do not scale well: time complexity of at least O(n2), where n is the
number of total objects

• CURE(Clustering Using Representatives)

• Integration of hierarchical & distance-based clustering


• BIRCH (1996): uses CF-tree and incrementally adjusts the quality of
sub-clusters
• CHAMELEON (1999): hierarchical clustering using dynamic modeling

92
CURE(Clustering Using Representatives)

• It is a hierarchical based clustering technique, that adopts a middle


ground between the centroid based and the all-point extremes.
• Hierarchical clustering is a type of clustering, that starts with a single
point cluster, and moves to merge with another cluster, until the desired
number of clusters are formed.
• It is used for identifying the spherical and non-spherical clusters.
• It is useful for discovering groups and identifying interesting distributions
in the underlying data.
• Instead of using one point centroid, as in most of data mining algorithms,
CURE uses a set of well-defined representative points, for efficiently
handling the clusters and eliminating the outliers.
93
Algorithm:

•Draw a random sample.


•Partition the random sample.
•Partially cluster the partition.
•Outliers are identified and
eliminated.
•The partial clusters obtained are
clubbed into clustered.
•Label the result on storage.

94
BIRCH (Balanced Iterative Reducing and
Clustering Using Hierarchies)
• Zhang, Ramakrishnan & Livny, SIGMOD’96
• Incrementally construct a CF (Clustering Feature) tree, a hierarchical data
structure for multiphase clustering
• Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
• Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of
the CF-tree
• Scales linearly: finds a good clustering with a single scan and improves the
quality with a few additional scans
• Weakness: handles only numeric data, and sensitive to the order of the data
record
95
Clustering Feature Vector in
BIRCH
Clustering Feature (CF): CF = (N, LS, SS)
N: Number of data points
LS: linear sum of N points:

SS: square sum of N points CF = (5, (16,30),(54,190))

(3,4)
(2,6)
(4,5)
(4,7)
(3,8)

96
CF-Tree in BIRCH
• Clustering feature:
• Summary of the statistics for a given subcluster: the 0-th,
1st, and 2nd moments of the subcluster from the statistical
point of view
• Registers crucial measurements for computing cluster and
utilizes storage efficiently
• A CF tree is a height-balanced tree that stores the clustering
features for a hierarchical clustering
• A nonleaf node in a tree has descendants or “children”
• The nonleaf nodes store sums of the CFs of their children
• A CF tree has two parameters
• Branching factor: max # of children
• Threshold: max diameter of sub-clusters stored at the leaf
nodes
97
The CF Tree Structure
Root

B=7 CF1 CF2 CF3 CF6


L=6 child1 child2 child3 child6

Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5

Leaf node Leaf node


prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next

98
The Birch Algorithm
• Cluster Diameter

• For each point in the input


• Find closest leaf entry
• Add point to leaf entry and update CF
• If entry diameter > max_diameter, then split leaf, and possibly parents
• Algorithm is O(n)
• Concerns
• Sensitive to insertion order of data points
• Since we fix the size of leaf nodes, so clusters may not be so natural
• Clusters tend to be spherical given the radius and diameter measures

99
Fuzzy Set and Fuzzy Cluster
• Clustering methods discussed so far
• Every data object is assigned to exactly one cluster
• Some applications may need for fuzzy or soft cluster assignment
• Ex. An e-game could belong to both entertainment and software
• Methods: fuzzy clusters and probabilistic model-based clusters
• Fuzzy cluster: A fuzzy set S: FS : X → [0, 1] (value between 0 and 1)
• Example: Popularity of cameras is defined as a fuzzy mapping

• Then, A(0.05), B(1), C(0.86), D(0.27)


100
Advantages and Disadvantages of Mixture Models

• Strength
• Mixture models are more general than partitioning and fuzzy clustering
• Clusters can be characterized by a small number of parameters
• The results may satisfy the statistical assumptions of the generative
models
• Weakness
• Converge to local optimal (overcome: run multi-times w. random
initialization)
• Computationally expensive if the number of distributions is large, or
the data set contains very few observed data points
• Need large data sets
• Hard to estimate the number of clusters

101
Probabilistic Model-Based Clustering
• Cluster analysis is to find hidden categories.
• A hidden category (i.e., probabilistic cluster) is a distribution over the data
space, which can be mathematically represented using a probability density
function (or distribution function).
■ Ex. 2 categories for digital cameras
sold
■ consumer line vs. professional line
■ density functions f1, f2 for C1, C2
■ obtained by probabilistic clustering
■ A mixture model assumes that a set of observed objects is a mixture
of instances from multiple probabilistic clusters, and conceptually
each observed object is generated independently
■ Out task: infer a set of k probabilistic clusters that is mostly likely to
generate D using the above data generation process
102
Model-Based Clustering
• A set C of k probabilistic clusters C1, …,Ck with probability density functions f1, …, fk,
respectively, and their probabilities ω1, …, ωk.

• Probability of an object o generated by cluster Cj is

• Probability of o generated by the set of cluster C is


■ Since objects are assumed to be generated
independently, for a data set D = {o1, …, on}, we have,

■ Task: Find a set C of k probabilistic clusters s.t. P(D|C) is maximized


■ However, maximizing P(D|C) is often intractable since the probability
density function of a cluster can take an arbitrarily complicated form
■ To make it computationally feasible (as a compromise), assume the
probability density functions being some parameterized distributions

103
Univariate Gaussian Mixture Model
• O = {o1, …, on} (n observed objects), Θ = {θ1, …, θk} (parameters of the k
distributions), and Pj(oi| θj) is the probability that oi is generated from the j-
th distribution using parameter θj, we have

■ Univariate Gaussian mixture model


■ Assume the probability density function of each cluster follows a 1-
d Gaussian distribution. Suppose that there are k clusters.
■ The probability density function of each cluster are centered at μ j
with standard deviation σj, θj, = (μj, σj), we have

104
The EM (Expectation Maximization) Algorithm

• The k-means algorithm has two steps at each iteration:


• Expectation Step (E-step): Given the current cluster centers, each object
is assigned to the cluster whose center is closest to the object: An object
is expected to belong to the closest cluster
• Maximization Step (M-step): Given the cluster assignment, for each
cluster, the algorithm adjusts the center so that the sum of distance from
the objects assigned to this cluster and the new center is minimized
• The (EM) algorithm: A framework to approach maximum likelihood or
maximum a posteriori estimates of parameters in statistical models.
• E-step assigns objects to clusters according to the current fuzzy
clustering or parameters of probabilistic clusters
• M-step finds the new clustering or parameters that maximize the sum of
squared error (SSE) or the expected likelihood

105
Gaussian Mixture Model

Gaussian Mixture Model


A Gaussian mixture model is a probabilistic model that
assumes all the data points are generated from a mixture
of a finite number of Gaussian distributions with unknown
parameters. One can think of mixture models as
generalizing k-means clustering to incorporate information
about the covariance structure of the data as well as the
centers of the latent Gaussians.
A Gaussian mixture model (GMM) is a category of
probabilistic model that states that all generated data points
are derived from a mixture of finite Gaussian distributions
that have no known parameters
106
Cntd..

In one dimension the probability density function of a


Gaussian Distribution is given by

107
Gaussian Mixture

• Linear combination of Gaussian

108
Human Tumor Microarray Data

• The human tumor microarray, is a cutting-edge technology


that revolutionizes cancer research and diagnosis.

• This innovative approach harnesses the potential of machine


learning to uncover invaluable insights into the complex tumor
microenvironment.

109
Overview of Tissue
Microarray Technology
Tissue microarray (TMA) is an innovative technology that allows for high-
throughput analysis of tissue samples. In a TMA, hundreds of tiny tissue
cores are precisely arrayed on a single paraffin block, enabling
simultaneous immunohistochemical or in situ hybridization
analysis of multiple patient samples on a single slide.
Role of Machine Learning
in Tumor Microarray
Analysis
Machine learning algorithms have become invaluable tools for analyzing
the vast amounts of data generated by tumor microarray technologies.
These powerful techniques can uncover complex patterns and
relationships within the data, enabling researchers to better understand
tumor biology and develop more targeted therapies.
Algorithms for Tumor Microarray
Analysis
Image Segmentation
Applying image segmentation algorithms to isolate individual tissue cores and
separate them from the background.

Feature Extraction
Extracting quantitative features from the tissue cores, such as staining
intensity, texture, and morphological properties.

Supervised Classification
Using machine learning algorithms like decision trees, random forests, and
support vector machines to classify tumor samples into different subtypes or
grades.
Image Processing and Feature
Extraction
Image processing plays a crucial role in
analyzing tissue microarray data. Key steps
include image segmentation, cell detection,
and feature extraction.
Sophisticated algorithms extract quantitative
measurements from the images, such as cell
counts, protein expression levels, and spatial
distributions.
These image-derived features serve as
inputs to machine learning models for tumor
classification and subtyping.
Supervised Learning Techniques
for Tumor Classification
Logistic
1 Regression
Predicts tumor type based on features

Support Vector Machines


2
Identifies optimal hyperplane to separate tumor classes

Random Forests
3 Ensemble of decision trees for robust
classification

Supervised learning algorithms are powerful tools for classifying tumor types based on the
microarray data. Logistic regression, support vector machines, and random forests are commonly
used techniques that can accurately predict tumor subtypes by learning from labeled training data.
These models excel at identifying complex patterns in the high-dimensional tumor profiles.
Unsupervised Clustering
Algorithms for Tumor Subtyping
Unsupervised clustering algorithms play a crucial role in identifying distinct tumor subtypes within
a heterogeneous tumor microarray dataset. These algorithms can uncover hidden patterns and
groupings without relying on pre-defined labels or classes.

K-Means Clustering Groups tumors into k distinct clusters


based on similarity of gene expression
profiles. Useful for identifying broad tumor

Hierarchical Clustering subtypes.


Builds a hierarchy of clusters, allowing
visualization of relationships between
tumor samples. Helps identify subtypes
and their hierarchical structure.
Gaussian Mixture Models Assumes data is generated from a mixture
of Gaussian distributions. Can identify
overlapping tumor subtypes with
probabilistic assignments.
These algorithms uncover hidden structures in the tumor microarray data, enabling researchers to
identify novel tumor subtypes that may have distinct molecular profiles and clinical outcomes. The
discovered subtypes can then be further studied and validated using supervised learning
Applications of Tissue Microarray
in Cancer Research

Biomarker Drug High- Pathology


Discovery Development Throughput Validation
Tissue microarrays Tissue microarrays Analysis Tissue microarrays
enable rapid allow efficient The compact array facilitate the
screening of tissue evaluation of drug format enables validation of novel
samples for candidates and their comprehensive, immunohistochemic
potential biomarkers impact on various high-throughput al markers and their
associated with tumor types, analysis of large association with
cancer prognosis accelerating the cohorts of patient clinicopathological
and treatment drug discovery samples, providing features of cancer.
response. process. valuable insights
Advantages of Tissue Microarray
1 High-Throughput Analysis 2 Reduced Tissue Consumption
Tissue microarray enables the simultan The miniaturized format of tissue
eous analysis of hundreds of tissue samp microarray requires only small amounts
les on a single slide, dramatically invcrea of tissue, preserving precious patient
sing the efficiency of cancer research. samples for multiple analyses.

3 Standardized Conditions 4 Cost-


Tissue microarray ensures consistent Effectiveness
The streamlined workflow and reduced
staining and analysis conditions across tissue requirements of tissue
all samples, improving the reliability microarray make it a cost-effective tool
and reproducibility of results. for cancer research and biomarker
discovery.
Disadvantages of Tissue Microarray
1 Limited Representativeness 2 Tissue Damage and Degradation
Tissue microarrays typically The process of constructing
contain small tissue cores from tissue microarrays involves
different regions of a tumour or punching small cores from
from multiple tumours. paraffin-embedded tissue blocks.

3 Sampling Bias 4 Loss of Morphological Context


The selection of tissue cores for Tissue microarrays sacrifice the
inclusion in a microarray may spatial context of individual
introduce sampling bias, particularly if tissue samples, as multiple cores
cores are selected based on are arranged in a grid-like pattern
subjective criteria or if certain tissue on a single slide.
regions are overrepresented or
Vector Quantization
Vectors are mathematical entities that represent quantities that have both
magnitude and direction. In vector quantization, these vectors typically represent
data points in a multi-dimensional space. For example, in image compression, each
pixel's color values can be represented as a vector in a high-dimensional color
space.

Vector quantization is a technique used in data compression. It involves the process


of representing a large set of data points with a smaller set of representative points
called centroids or codewords.
Vector Quantization : Basic Working
• Codebook Generation: A codebook is created initially, containing a predefined
number of codewords or centroids. These codewords can be randomly selected
or generated based on some criteria.
• Quantization: Each input vector in the dataset is then assigned to the nearest
codeword in the codebook. This process is called quantization, where each
vector is mapped to a single codeword.
• Compression: Instead of storing the entire dataset, only the indices of the
codewords to which each vector is assigned are stored. This results in significant
compression, especially when the dataset is large.
• Reconstruction: To reconstruct the original data, the indices stored along with
the codebook are used to retrieve the corresponding codewords, which are then
Vector Quantization : Basic Working
Example : {0.75 , 1.27} , {1.78 , 2.11}
CODEBOOK Distortion

-2 -2

Quantization Function
-1 -1

1 1

Source : Youtube ( Rudra Singh - https://www.youtube.com/watch?v=qzhIXfxMDpM&t=191s )


Vector Quantization : Basic Working
Example : {0.75 , 1.27} , {1.78 , 2.11}
CODEBOOK Distortion

-2 -2

-1 -1

1 1

Source : Youtube ( Rudra Singh - https://www.youtube.com/watch?v=qzhIXfxMDpM&t=191s )


Applications
Data Compression
Vector quantization is widely
used in data compression Data Analysis and Mining
applications such as image, Pattern Recognition
audio, and video Vector quantization is applied in
Vector quantization is
compression. utilized in pattern
In image compression, for recognition tasks such as
data analysis and mining tasks
instance, vector classification, clustering,
quantization techniques like and feature extraction.
such as clustering and anomaly
the Linde-Buzo-Gray (LBG) In classification tasks,
algorithm are employed to codewords obtained
detection.
represent image blocks with through vector
a smaller set of quantization serve as
In clustering tasks, vector
representative codewords, informative features for
achieving high compression training classifiers to quantization algorithms are used
ratios while preserving recognize and classify
image quality. different patterns or to group similar data points
objects.
together based on their feature

representations.
Advantages
• Data compression: Vector Quantization can achieve significant data
compression with minimal loss of information, making it suitable for
applications like image and audio compression.

• Noise reduction: Vector Quantization can help reduce noise in the data by
replacing individual data points with representative codebook vectors,
leading to smoother and more robust representations.

• Pattern recognition: Vector Quantization can be used to identify patterns or


structures in the data, which can be useful for tasks like classification,
clustering, and feature extraction.
Disadvantages
Fixed Codebook Size: Most vector quantization techniques require a fixed-size codebook,
which needs to be predefined before the quantization process. Adapting the codebook size
to accommodate variations in the data distribution or achieving optimal compression
performance can be challenging.

Limited Adaptability to Data Changes: Once the codebook is trained, it remains fixed and
may not adapt well to changes in the input data distribution over time. This lack of
adaptability may result in suboptimal quantization performance when the data distribution
shifts or evolves.

High Computational Complexity: Depending on the size of the dataset and the
dimensionality of the feature space, vector quantization algorithms can be
computationally intensive, especially during the codebook training phase. This high
computational complexity may limit the scalability of VQ algorithms to large datasets and
Self Organizing Maps – Kohonen Maps

• Self Organizing Map (or Kohonen Map or SOM) is a type of Artificial


Neural Network which is also inspired by biological models of neural
systems from the 1970s.
• It follows an unsupervised learning approach and trained its network
through a competitive learning algorithm.
• SOM is used for clustering and mapping (or dimensionality reduction)
techniques to map multidimensional data onto lower-dimensional
which allows people to reduce complex problems for easy
interpretation.
• SOM has two layers, one is the Input layer and the other one is the
Output layer.
126
• The architecture of the Self Organizing Map with two clusters and n
input features of any sample is given below:

127
How do SOM works?
• Let’s say an input data of size (m, n) where m is the number of training examples
and n is the number of features in each example.
• First, it initializes the weights of size (n, C) where C is the number of clusters.
• Then iterating over the input data, for each training example, it updates the winning
vector (weight vector with the shortest distance (e.g Euclidean distance) from
training example). Weight updation rule is given by
wij = wij(old) + alpha(t) * (xik - wij(old))

• where alpha is a learning rate at time t, j denotes the winning vector, i denotes the
ith feature of training example and k denotes the kth training example from the input
data.
• After training the SOM network, trained weights are used for clustering new
examples. A new example falls in the cluster of winning vectors.

128
Algorithm

• Training:
• Step 1: Initialize the weights wij random value may be assumed. Initialize the learning rate
α.
• Step 2: Calculate squared Euclidean distance.
• D(j) = Σ (wij – xi)^2 where i=1 to n and j=1 to m
• Step 3: Find index J, when D(j) is minimum that will be considered as winning index.
• Step 4: For each j within a specific neighborhood of j and for all i, calculate the new weight.
• wij(new)=wij(old) + α[xi – wij(old)]
• Step 5: Update the learning rule by using :
• α(t+1) = 0.5 * t
• Step 6: Test the Stopping Condition.

129
• Output:
Test Sample s belongs to Cluster : 0
Trained weights : [[0.6000000000000001, 0.8, 0.5, 0.9],
[0.3333984375, 0.0666015625, 0.7, 0.3]]

130
Self Organization Maps

• The Self Organizing Map is one of the most popular neural models. It belongs to the category of
the competitive learning network.
• The SOM is based on unsupervised learning, which means that is no human intervention is
needed during the training and those little needs to be known about characterized by the input
data.
• We could, for example, use the SOM for clustering membership of the input data. The SOM can
be used to detect features inherent to the problem and thus has also been called SOFM the
Self Origination Feature Map.

131
• The Self Organized Map was developed by Professor
Kohenen and is used in many applications.
• the purpose of SOM is that it’s providing a data visualization
technique that helps to understand high dimensional data
by reducing the dimension of data to map.
• SOM also represents the clustering concept by grouping
similar data.
• Therefore it can be said that the Self Organizing Map
reduces data dimension and displays similarly among data.
132
133
SOM
• A SOM does not need a target output to be specified, unlike many other types of
networks.
• Instead, where the node weights match the input vector, that area of the lattice is
selectively optimized to more closely resemble the data for the class the input
vector is a member of.
• From an initial distribution of random weights, and over many iterations, the SOM
eventually settles into a map of stable zones.
• Each zone is effectively a feature classifier, so you can think of the graphical
output as a type of feature map of the input space.

134
SOM

Training occurs in several steps and over many


iterations:
1. Each node’s weights are initialized.
2. A vector is chosen at random from the set of training data
and presented to the lattice.
3. Every node is examined to calculate which ones weights
are most like the input vector. The winning node is commonly
known as the Best Matching Unit (BMU).
.

135
SOM

4. The radius of the neighborhood of the BMU is now calculated. This value
starts large, typically set to the ‘radius’ of the lattice, but diminishes each time
step. Any nodes within this radius are deemed inside the BMU’s
neighborhood.
5. Each neighboring node’s (the nodes found in step 4) weights are adjusted
to make them more like the input vector. The closer a node is to the BMU; the
more its weights get altered.
6. Repeat step 2 for N iterations.

136
SOM

4. The radius of the neighborhood of the BMU is now calculated. This


value starts large, typically set to the ‘radius’ of the lattice, but diminishes
each time step. Any nodes within this radius are deemed inside the
BMU’s neighborhood.
5. Each neighboring node’s (the nodes found in step 4) weights are
adjusted to make them more like the input vector. The closer a node is to
the BMU; the more its weights get altered.
6. Repeat step 2 for N iterations.

137
SOM

Example:

138
SOM

Randomly initialize the values of the weights (close to


0 but not 0).

139
SOM

Step 2: Calculating the Best Matching


Unit with

140
SOM

Node number 3 is
the closest
with a distance of
0.4. We will call this
node
our BMU (best-
matching unit).

141
SOM

Adjusting the Weights


Every node within the BMU’s neighborhood (including the BMU)
has its weight vector adjusted according to the following
equation:
New Weights = Old Weights + Learning Rate (Input Vector —
Old Weights)
W(t+1) = W(t) + L(t) ( V(t) — W(t) )
So according to our example Node 4 is the Best Match Unit (as
you can see in step 2) corresponding to their weights:
142
SOM

Learning rate = 0.5


So update that weight according to the above equation
For W3,1
New Weights = Old Weights + Learning Rate (Input Vector1 — Old Weights)
New Weights = 0.39 + 0.5 (0.7–0.39)
New Weights = 0.545
For W3,2
New Weights = Old Weights + Learning Rate (Input Vector2 — Old Weights)
New Weights = 0.42 + 0.5 (0.6–0.42)
143
SOM

New Weights = 0.51


For W3,3
New Weights = Old Weights + Learning Rate (Input
Vector3 — Old Weights)
New Weights = 0.45 + 0.5 (0.9–0.45)
New Weights = 0.675
Updated weights
i

144
Spectral clustering:
■ Combining feature extraction and clustering (i.e., use the spectrum of the similarity
matrix of the data to perform dimensionality reduction for clustering in fewer
dimensions)
■ Normalized Cuts (Shi and Malik, CVPR’97 or PAMI’2000)
■ The Ng-Jordan-Weiss algorithm (NIPS’01)

145
Spectral Clustering:
The Ng-Jordan-Weiss (NJW) Algorithm
• Given a set of objects o1, …, on, and the distance between each pair of objects, dist(oi, oj), find the desired number k of
clusters
• Calculate an affinity matrix W, where σ is a scaling parameter that controls how fast the affinity Wij decreases as dist(oi,
oj) increases. In NJW, set Wij = 0

• Derive a matrix A = f(W). NJW defines a matrix D to be a diagonal matrix s.t. Dii is the sum of the i-th row of W, i.e.,

Then, A is set to
• A spectral clustering method finds the k leading eigenvectors of A
• A vector v is an eigenvector of matrix A if Av = λv, where λ is the corresponding eigen-value
• Using the k leading eigenvectors, project the original data into the new space defined by the k leading eigenvectors,
and run a clustering algorithm, such as k-means, to find k clusters
• Assign the original data points to clusters according to how the transformed points are assigned in the clusters
obtained

146
Spectral Clustering: Illustration and Comments

• Spectral clustering: Effective in tasks like image processing


• Scalability challenge: Computing eigenvectors on a large matrix is costly
• Can be combined with other clustering methods, such as bi-clustering
147
• Spectral clustering is one of the most popular forms of multivariate statistical analysis
• Spectral Clustering uses the connectivity approach to clustering’, wherein communities
of nodes (i.e. data points) that are connected or immediately next to each other are
identified in a graph.
• The nodes are then mapped to a low-dimensional space that can be easily segregated to
form clusters.
• Spectral Clustering uses information from the eigenvalues (spectrum) of special matrices
(i.e. Affinity Matrix, Degree Matrix and Laplacian Matrix) derived from the graph or the
data set.
• Spectral clustering methods are attractive, easy to implement, reasonably fast especially
for sparse data sets up to several thousand. Spectral clustering treats the data clustering
as a graph partitioning problem without making any assumption on the form of the data
clusters.
148
Difference between Spectral Clustering and Conventional
Clustering Techniques

• Spectral clustering is flexible and allows us to cluster non-graphical data as well.


• It makes no assumptions about the form of the clusters.
• Clustering techniques, like K-Means, assume that the points assigned to a cluster are spherical
about the cluster centre. This is a strong assumption and may not always be relevant.
• In such cases, Spectral Clustering helps create more accurate clusters. It can correctly cluster
observations that actually belong to the same cluster, but are farther off than observations in
other clusters, due to dimension reduction.
• The data points in Spectral Clustering should be connected, but may not necessarily have
convex boundaries, as opposed to the conventional clustering techniques, where clustering is
based on the compactness of data points.
• Although, it is computationally expensive for large datasets, since eigenvalues and
eigenvectors need to be computed and clustering is performed on these vectors.
• Also,for large datasets, the complexity increases and accuracy decreases significantly.

149
Spectral Clustering Matrix Representation
Adjacency and Affinity Matrix (A)

• The graph (or set of data points) can be represented as an


Adjacency Matrix, where the row and column indices represent the
nodes, and the entries represent the absence or presence of an
edge between the nodes (i.e. if the entry in row 0 and column 1 is 1,
it would indicate that node 0 is connected to node 1).

150
• An Affinity Matrix is like an Adjacency Matrix, except the value for a
pair of points expresses how similar those points are to each other.
If pairs of points are very dissimilar then the affinity should be 0. If
the points are identical, then the affinity might be 1. In this way, the
affinity acts like the weights for the edges on our graph.

151
Laplacian Matrix (L)

Degree Matrix (D)


A Degree Matrix is a diagonal matrix, where the degree of a node (i.e. values) of the diagonal is given by the
number of edges connected to it. We can also obtain the degree of the nodes by taking the sum of each row in the
adjacency matrix.

152
• Laplacian Matrix (L)
• This is another representation of the graph/data points, which
attributes to the beautiful properties leveraged by Spectral
Clustering. One such representation is obtained by subtracting the
Adjacency Matrix from the Degree Matrix (i.e. L = D – A).

153
• Spectral Gap: The first non-zero eigenvalue is called the Spectral Gap.
The Spectral Gap gives us some notion of the density of the graph.
• Fiedler Value: The second eigenvalue is called the Fiedler Value, and
the corresponding vector is the Fiedler vector. Each value in the
Fiedler vector gives us information as to which side of the decision
boundary a particular node belongs to.
• Using L, we find the first large gap between eigenvalues which
generally indicates that the number of eigenvalues before this gap is
equal to the number of clusters.
• What is Spectral Clustering and how its work? (mygreatlearning.com)
154

You might also like