Final ML Unit3 May24
Final ML Unit3 May24
2
Clustering
• Organizing data into classes such that there is
• high intra-class similarity
• low inter-class similarity
• Finding the class labels and the number of classes
directly from the data (in contrast to classification).
• More informally, finding natural groupings among
objects.
• Also called unsupervised learning, sometimes called
classification by statisticians and sorting by
psychologists and segmentation by people in
marketing
3
3
Clustering
4
Clustering Example
5
What is a natural grouping among these objects?
UNIT II-Syllabus
Clustering is subjective
Similarity is
hard to define,
but…
“We know it
when we see it”
The real
meaning of
similarity is a
philosophical
question. We
will take a more
pragmatic
approach.
7
7
Common Distance measures:
8
Intuitions behind desirable distance
measure properties
D(A,B) = D(B,A) Symmetry
Otherwise you could claim “Alex looks like Bob, but Bob looks nothing
like Alex.”
10
Clustering Applications
• Customer segmentation: Customer segmentation is the practice of dividing a company's
customers into groups that reflect similarity among customers in each group
• Fraud detection: Using techniques such as K-means Clustering, one can easily identify the
patterns of any unusual activities. Detecting an outlier will mean a fraud event has taken
place.
• Document classification
• Image segmentation
• Anomaly detection
3/14/2023 11
What Is Good Clustering?
• A good clustering method will produce high quality clusters with
• high intra-class similarity
• low inter-class similarity
12
Types of Clustering
• Hierarchical clustering(BIRCH)
• A set of nested clusters organized as a hierarchical tree
• Partitional Clustering(k-means,k-mediods)
• A division data objects into non-overlapping (distinct)
subsets (i.e., clusters) such that each data object is in
exactly one subset
• Density – Based(DBSCAN)
• Based on density functions
• Grid-Based(STING)
• Based on nultiple-level granularity structure
• Model-Based(SOM)
• Hypothesize a model for each of the clusters and find the
best fit of the data to the given model
13
Clustering
14
14
Partitional Clustering
15
Hierarchical Clustering
17
Clustering
Clustering is defined as dividing data points or populations into several
groups such that similar data points are in the same groups. The aim to
segregate groups based on similar traits. Clustering can be divided into
two subgroups, broadly:
1.Soft Clustering– In this type of clustering, a likelihood or probability
of the data point being in a particular cluster is assigned instead of
putting each data point into a separate cluster.
2.Hard Clustering– Each data point either entirely belongs to a cluster
or not at all.
The task of clustering is subjective; i.e., there are many ways of
achieving the goal. There can be many different sets of rules for defining
similarity among data points. Out of more than 100 clustering
algorithms, a few are used correctly. They are as follows:
https://www.learndatasci.com/glossary/hierarchical-clustering/
18
Clustering
1.Connectivity models– These models are based on the fact that data
points closer in the data space exhibit more similarity than those lying
farther away. The model’s first approach can classify data points into
separate clusters and aggregation as the distance decreases. Another
approach is classifying data points as a single cluster and partitioning as
the distance increases. The choice of distance function is subjective. The
models are easily interpreted but lack scalability for handling large
datasets: for example- Hierarchical clustering.
2.Centroid models– Iterative clustering algorithms in which similarity is
derived as the notion of the closeness of data point to the cluster’s
centroid. Example- K-Means clustering. The number of clusters is
mentioned in advance, which requires prior knowledge of the dataset.
19
Clustering
20
K-Means Clustering
K-Means Clustering
The most common clustering covered in machine
learning for beginners is K-Means. The first step is to
create c new observations among our unlabelled data
and locate them randomly, called centroids. The number
of centroids represents the number of output classes.
The first step of the iterative process for each centroid is
to find the nearest point (in terms of Euclidean distance)
and assign them to its category. Next, for each category,
the average of all the points attributed to that class is
computed. The output is the new centroid of the class.
21
22
What is the problem of K-Means Method?
• K-Medoids: Instead of taking the mean value of the object in a cluster as a reference
point, medoids can be used, which is the most centrally located object in a cluster.
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
23
The K-Medoids Clustering
• Find representative objects, called medoids, in clusters
24
Typical K-Medoids algorithm (PAM)
Total Cost =
20
10
6
Arbitrar Assign
5
y each
4 choose remaini
3
k object ng
2
as object
initial to
1
0
0 1 2 3 4 5 6 7 8 9 10
medoid nearest
s medoid
K= s Randomly select a
nonmedoid
2 Total Cost =
26 object,Oramdom
1 1
Do loop
0 0
9 9
8
Compute 8
Swapping total cost
Until no
7 7
O and 6
of 6
change Oramdom 5
swapping
5
4 4
If quality is 3 3
2 2
improved. 1 1
0 0
0 1 2 3 4 5 6 7 8 9 1
0
0 1 2 3 4 5 6 7 8 9 1
0 25
PAM (Partitioning Around Medoids)
26
K-Medoids method: The idea
27
K-Medoids method
28
K-Medoids method: The idea
K-Medoids method: The idea
K-Medoids method: The idea
K-Medoids method: The idea
K-Medoids method: The idea
K-Medoids method: The idea
K-Medoids method: The idea
K-Medoids method: The idea
K-Medoids method: The idea
K-Medoids method: The idea
• Input
• K: the number of clusters
• D: a data set containing n objects
• Output: A set of k clusters
• Method:
1. Arbitrary choose k objects from D as representative objects
(seeds)
2. Repeat (3) Assign each remaining object to the cluster with the
nearest representative object
3. For each representative object Oj
4. Randomly select a non representative object Orandom
5. Compute the total cost S of swapping representative object Oj
with Orandom (7) if S<0 then replace Oj with Orandom (8) Until
no change
38
K-Medoids Properties(k-medoids vs.K-means)
39
K-Medoid Clustering
42
K-Medoid Clustering
K-Medoids Clustering Example
Poin
Coordinates
t
A1 (2, 6)
A2 (3, 8)
A3 (4, 7)
A4 (6, 2)
A5 (6, 4)
A6 (7, 3)
A7 (7,4)
A8 (8, 5)
A9 (7, 6)
A10 (3, 4)
43
K-Medoid Clustering
• Iteration 1
• Suppose that we want to group the above dataset into two clusters. So, we will
randomly choose two medoids.
• Here, the choice of medoids is important for efficient execution. Hence, we have
selected two points from the dataset that can be potential medoid for the final
clusters. Following are two points from the dataset that we have selected as
medoids.
• M1 = (3, 4)
• M2 = (7, 3)
• Now, we will calculate the distance between each data point and the medoids
using the Manhattan distance measure. The results have been tabulated as follows.
44
K-Medoid Clustering
Distance Distance
Coordinate Assigned
Point From M1 from M2
s Cluster
(3,4) (7,3)
A1 (2, 6) 3 8 Cluster 1
A2 (3, 8) 4 9 Cluster 1
A3 (4, 7) 4 7 Cluster 1
A4 (6, 2) 5 2 Cluster 2
A5 (6, 4) 3 2 Cluster 2
A6 (7, 3) 5 0 Cluster 2 Iteration-1
A7 (7,4) 4 1 Cluster 2
A8 (8, 5) 6 3 Cluster 2
A9 (7, 6) 6 3 Cluster 2
A10 (3, 4) 0 5 Cluster 1
45
K-Medoid Clustering
The clusters made with medoids (3, 4) and (7, 3) are as follows.
Points in cluster1= {(2, 6), (3, 8), (4, 7), (3, 4)}
Points in cluster 2= {(7,4), (6,2), (6, 4), (7,3), (8,5), (7,6)}
After assigning clusters, we will calculate the cost for each cluster and
find their sum. The cost is nothing but the sum of distances of all the
data points from the medoid of the cluster they belong to.
Hence, the cost for the current cluster will be
3+4+4+2+2+0+1+3+3+0=22.
46
K-Medoid Clustering
Iteration 2
Now, we will select another non-medoid point (7, 4) and make it a
temporary medoid for the second cluster. Hence,
M1 = (3, 4)
M2 = (7, 4)
Now, let us calculate the distance between all the data points and the
current medoids.
47
K-Medoid Clustering
Distance Distance
Coordinate Assigned
Point From M1 from M2
s Cluster
(3,4) (7,4)
A1 (2, 6) 3 7 Cluster 1
A2 (3, 8) 4 8 Cluster 1
A3 (4, 7) 4 6 Cluster 1
A4 (6, 2) 5 3 Cluster 2
A5 (6, 4) 3 1 Cluster 2
A6 (7, 3) 5 1 Cluster 2
A7 (7,4) 4 0 Cluster 2
A8 (8, 5) 6 2 Cluster 2
A9 (7, 6) 6 2 Cluster 2
A10 (3, 4) 0 4 Cluster 1
48
K-Medoid Clustering
The data points haven’t changed in the clusters after changing the medoids.
Hence, clusters are:
Points in cluster1:{(2, 6), (3, 8), (4, 7), (3, 4)}
Points in cluster 2:{(7,4), (6,2), (6, 4), (7,3), (8,5), (7,6)}
Now, let us again calculate the cost for each cluster and find their sum. The
total cost this time will be 3+4+4+3+1+1+0+2+2+0=20.
Here, the current cost is less than the cost calculated in the previous iteration.
Hence, we will make the swap permanent and make (7,4) the medoid for
cluster 2. If the cost this time was greater than the previous cost i.e. 22, we
would have to revert the change. New medoids after this iteration are (3, 4)
and (7, 4) with no change in the clusters.
49
K-Medoid Clustering
Iteration 3
Now, let us again change the medoid of cluster 2 to (6, 4). Hence,
the new medoids for the clusters are M1=(3, 4) and M2= (6, 4 ).
Let us calculate the distance between the data points and the above
medoids to find the new cluster. The results have been tabulated as
follows.
50
K-Medoid Clustering
Distance Distance
Coordinate Assigned
Point From M1 from M2
s Cluster
(3,4) (6,4)
A1 (2, 6) 3 6 Cluster 1
A2 (3, 8) 4 7 Cluster 1
A3 (4, 7) 4 5 Cluster 1
A4 (6, 2) 5 2 Cluster 2
A5 (6, 4) 3 0 Cluster 2
A6 (7, 3) 5 2 Cluster 2
A7 (7,4) 4 1 Cluster 2
A8 (8, 5) 6 3 Cluster 2
A9 (7, 6) 6 3 Cluster 2
A10 (3, 4) 0 3 Cluster 1
51
Again, the clusters haven’t changed. Hence, clusters are:
Points in cluster1:{(2, 6), (3, 8), (4, 7), (3, 4)}
Points in cluster 2:{(7,4), (6,2), (6, 4), (7,3), (8,5), (7,6)}
Now, let us again calculate the cost for each cluster and find their sum.
The total cost this time will be 3+4+4+2+0+2+1+3+3+0=22.
The current cost is 22 which is greater than the cost in the previous
iteration i.e. 20. Hence, we will revert the change and the point (7, 4)
will again be made the medoid for cluster 2.
So, the clusters after this iteration will be cluster1 = {(2, 6), (3, 8), (4,
7), (3, 4)} and cluster 2= {(7,4), (6,2), (6, 4), (7,3), (8,5), (7,6)}. The
medoids are (3,4) and (7,4).
52
Replacing the medoids with a non-medoid data point. The set of
medoids for which the cost is the least, the medoids, and the associated
clusters are made permanent. So, after all the iterations, you will get the
final clusters and their medoids.
The K-Medoids clustering algorithm is a computation-intensive
algorithm that requires many iterations. In each iteration, we need to
calculate the distance between the medoids and the data points, assign
clusters, and compute the cost. Hence, K-Medoids clustering is not well
suited for large data sets.
53
Advantages of K-Medoids Clustering
❖ K-Medoids clustering is a simple iterative algorithm and is very easy to implement.
❖ K-Medoids clustering is guaranteed to converge. Hence, we are guaranteed to get results when we
perform k-medoids clustering on any dataset.
❖ K-Medoids clustering doesn’t apply to a specific domain. Owing to the generalization, we can use
k-medoids clustering in different machine learning applications ranging from text data to Geo-
spatial data and financial data to e-commerce data.
❖ The medoids in k-medoids clustering are selected randomly. We can choose the initial medoids in
a way such that the number of iterations can be minimized. To improve the performance of the
algorithm, you can warm start the choice of medoids by selecting specific data points as medoids
after data analysis.
❖ Compared to other partitioning clustering algorithms such as K-median and k-modes clustering,
the k-medoids clustering algorithm is faster in execution.
❖ K-Medoids clustering algorithm is very robust and it effectively deals with outliers and noise in
the dataset. Compared to k-means clustering, the k-medoids clustering algorithm is a better choice
for analyzing data with significant noise and outliers.
54
Disadvantages of K-Medoids Clustering
❖ K-Medoids clustering is not a scalable algorithm. For large datasets, the execution time of the
algorithm becomes very high. Due to this, you cannot use the K-Medoids algorithm for very large
datasets.
❖ In K-Medoids clustering, we don’t know the optimal number of clusters. Hence, we need to perform
clustering using different values of k to find the optimal number of clusters.
❖ When the dimensionality in the dataset increases, the distance between the data points starts to
become similar. Due to this, the distance between a data point and various medoids becomes almost
the same. This introduces inefficiency in the execution. To overcome this problem, you can use
advanced clustering algorithms like spectral clustering. Alternatively, you can also try to reduce the
dimensionality of the dataset while data preprocessing.
❖ K-Medoids clustering algorithm randomly chooses the initial medoids. Also, the output clusters are
primarily dependent on the initial medoids. Due to this, every time you run the k-medoids clustering
algorithm, you will get unique clusters.
❖ The K-Medoids clustering algorithm uses the distance from a central point (medoid). Due to this,
the k-medoids algorithm gives circular/spherical clusters. This algorithm is not useful for clustering
data into arbitrarily shaped clusters. 55
56
57
58
59
60
61
62
63
Clustering -Example
What Exactly is DBSCAN Clustering?
It groups ‘densely grouped’ data points into a single cluster. It can identify clusters in large
spatial datasets by looking at the local density of the data points. The most exciting
feature of DBSCAN clustering is that it is robust to outliers. It also does not require
the number of clusters to be told beforehand, unlike K-Means, where we have to specify the
number of centroids.
DBSCAN requires only two parameters: epsilon and minPoints. Epsilon is the radius of the
circle to be created around each data point to check the density and minPoints is the
minimum number of data points required inside that circle for that data point to be classified
as a Core point.
In higher dimensions the circle becomes hypersphere, epsilon becomes the radius of that
hypersphere, and minPoints is the minimum number of data points required inside that
hypersphere.
DBSCAN Clustering
DBSCAN creates a circle of epsilon radius around every data point and classifies them
into Core point, Border point, and Noise. A data point is a Core point if the circle around it
contains at least ‘minPoints’ number of points. If the number of points is less than minPoints,
then it is classified as Border Point, and if there are no other data points around any data
point within epsilon radius, then it treated as Noise.
The above figure shows us a cluster created by DBCAN with minPoints = 3. Here, we
draw a circle of equal radius epsilon around every data point. These two parameters help
in creating spatial clusters.
DBSCAN Clustering
All the data points with at least 3 points in the circle including itself are considered
as Core points represented by red color. All the data points with less than 3 but
greater than 1 point in the circle including itself are considered as Border points. They
are represented by yellow color. Finally, data points with no point other than itself
present inside the circle are considered as Noise represented by the purple color.
For locating data points in space, DBSCAN uses Euclidean distance, although other
methods can also be used (like great circle distance for geographical data). It also needs
to scan through the entire dataset once, whereas in other algorithms we have to do it
multiple times.
Reachability and Connectivity
These are the two concepts that you need to understand before moving further. Reachability
states if a data point can be accessed from another data point directly or indirectly, whereas
Connectivity states whether two data points belong to the same cluster or not. In terms of
reachability and connectivity, two points in DBSCAN can be referred to as:
•Directly Density-Reachable
•Density-Reachable
•Density-Connected
The value of minPoints should be at least one greater than the number of dimensions of the
dataset, i.e.,
minPoints>=Dimensions+1.
It does not make sense to take minPoints as 1 because it will result in each point being a
separate cluster. Therefore, it must be at least 3. Generally, it is twice the dimensions. But
domain knowledge also decides its value.
If the value of epsilon chosen is too small then a higher number of clusters will be created,
and more data points will be taken as noise. Whereas, if chosen too big then various small
clusters will merge into a big cluster, and we will lose details.
Hierarchical Clustering
72
Dendrogram -A Useful Tool for Summarizing Similarity
Desirable Properties of Clustering
Measurements
The similarity between two objects
in a dendrogram is represented as
the height of the lowest internal
node they share.
73
Hierarchical Clustering
The number of dendrograms with n Since we cannot test all possible trees
leafs = (2n -3)!/[(2(n -2)) (n -2)!] we will have to heuristic search of all
possible trees. We could do this..
Number Number of Possible
of Leafs Dendrograms
2 1 Bottom-Up (agglomerative): Starting
3 3
4 15 with each item in its own cluster, find
5 105 the best pair to merge into a new
... …
10 34,459,425
cluster. Repeat until all clusters are
fused together.
0 2 4 4
0 3 3
D( , ) = 8 0 1
D( , ) = 1 0
75
Starting with each item in its own cluster, find
Bottom-Up (agglomerative):
the best pair to merge into a new cluster. Repeat until all clusters
are fused together.
Consider all
Choose
possible … the best
merges…
Consider all
Choose
possible … the best
merges…
Consider all
Choose
possible … the best
merges…
79
Hierarchical Agglomerative Clustering-Linkage
Method
Single Linkage
Minimum Distance
Cluster 1 Cluster 2
Complete Linkage
Maximum Distance
Cluster 1 Cluster 2
Average Linkage
Average Distance
Cluster 1 Cluster 2
80
Hierarchical Clustering (Agglomerative )-Example
81
Hierarchical Clustering (Agglomerative )-Example
82
Hierarchical Clustering (Agglomerative )-Example
83
Hierarchical Clustering (Agglomerative )-Example
84
Hierarchical Clustering (Agglomerative )-Example
85
Hierarchical Clustering (Agglomerative )-Example
86
Hierarchical Clustering (Agglomerative )-Example
87
Quality: What Is Good Clustering?
88
Measure the Quality of Clustering
• Dissimilarity/Similarity metric
• Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
• The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables
• Weights should be associated with different variables
based on applications and data semantics
• Quality of clustering:
• There is usually a separate “quality” function that
measures the “goodness” of a cluster.
• It is hard to define “similar enough” or “good enough”
• The answer is typically highly subjective
89
Considerations for Cluster Analysis
• Partitioning criteria
• Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is
desirable)
• Separation of clusters
• Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one
document may belong to more than one class)
• Similarity measure
• Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or
contiguity)
• Clustering space
• Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering)
90
Requirements and Challenges
• Scalability
• Clustering all the data instead of only on samples
• Ability to deal with different types of attributes
• Numerical, binary, categorical, ordinal, linked, and mixture of these
• Constraint-based clustering
• User may give inputs on constraints
• Use domain knowledge to determine input parameters
• Interpretability and usability
• Others
• Discovery of clusters with arbitrary shape
• Ability to deal with noisy data
• Incremental clustering and insensitivity to input order
• High dimensionality
91
Extensions to Hierarchical Clustering
• Major weakness of agglomerative clustering methods
• Can never undo what was done previously
• Do not scale well: time complexity of at least O(n2), where n is the
number of total objects
92
CURE(Clustering Using Representatives)
94
BIRCH (Balanced Iterative Reducing and
Clustering Using Hierarchies)
• Zhang, Ramakrishnan & Livny, SIGMOD’96
• Incrementally construct a CF (Clustering Feature) tree, a hierarchical data
structure for multiphase clustering
• Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
• Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of
the CF-tree
• Scales linearly: finds a good clustering with a single scan and improves the
quality with a few additional scans
• Weakness: handles only numeric data, and sensitive to the order of the data
record
95
Clustering Feature Vector in
BIRCH
Clustering Feature (CF): CF = (N, LS, SS)
N: Number of data points
LS: linear sum of N points:
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
96
CF-Tree in BIRCH
• Clustering feature:
• Summary of the statistics for a given subcluster: the 0-th,
1st, and 2nd moments of the subcluster from the statistical
point of view
• Registers crucial measurements for computing cluster and
utilizes storage efficiently
• A CF tree is a height-balanced tree that stores the clustering
features for a hierarchical clustering
• A nonleaf node in a tree has descendants or “children”
• The nonleaf nodes store sums of the CFs of their children
• A CF tree has two parameters
• Branching factor: max # of children
• Threshold: max diameter of sub-clusters stored at the leaf
nodes
97
The CF Tree Structure
Root
Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5
98
The Birch Algorithm
• Cluster Diameter
99
Fuzzy Set and Fuzzy Cluster
• Clustering methods discussed so far
• Every data object is assigned to exactly one cluster
• Some applications may need for fuzzy or soft cluster assignment
• Ex. An e-game could belong to both entertainment and software
• Methods: fuzzy clusters and probabilistic model-based clusters
• Fuzzy cluster: A fuzzy set S: FS : X → [0, 1] (value between 0 and 1)
• Example: Popularity of cameras is defined as a fuzzy mapping
• Strength
• Mixture models are more general than partitioning and fuzzy clustering
• Clusters can be characterized by a small number of parameters
• The results may satisfy the statistical assumptions of the generative
models
• Weakness
• Converge to local optimal (overcome: run multi-times w. random
initialization)
• Computationally expensive if the number of distributions is large, or
the data set contains very few observed data points
• Need large data sets
• Hard to estimate the number of clusters
101
Probabilistic Model-Based Clustering
• Cluster analysis is to find hidden categories.
• A hidden category (i.e., probabilistic cluster) is a distribution over the data
space, which can be mathematically represented using a probability density
function (or distribution function).
■ Ex. 2 categories for digital cameras
sold
■ consumer line vs. professional line
■ density functions f1, f2 for C1, C2
■ obtained by probabilistic clustering
■ A mixture model assumes that a set of observed objects is a mixture
of instances from multiple probabilistic clusters, and conceptually
each observed object is generated independently
■ Out task: infer a set of k probabilistic clusters that is mostly likely to
generate D using the above data generation process
102
Model-Based Clustering
• A set C of k probabilistic clusters C1, …,Ck with probability density functions f1, …, fk,
respectively, and their probabilities ω1, …, ωk.
103
Univariate Gaussian Mixture Model
• O = {o1, …, on} (n observed objects), Θ = {θ1, …, θk} (parameters of the k
distributions), and Pj(oi| θj) is the probability that oi is generated from the j-
th distribution using parameter θj, we have
104
The EM (Expectation Maximization) Algorithm
105
Gaussian Mixture Model
107
Gaussian Mixture
108
Human Tumor Microarray Data
109
Overview of Tissue
Microarray Technology
Tissue microarray (TMA) is an innovative technology that allows for high-
throughput analysis of tissue samples. In a TMA, hundreds of tiny tissue
cores are precisely arrayed on a single paraffin block, enabling
simultaneous immunohistochemical or in situ hybridization
analysis of multiple patient samples on a single slide.
Role of Machine Learning
in Tumor Microarray
Analysis
Machine learning algorithms have become invaluable tools for analyzing
the vast amounts of data generated by tumor microarray technologies.
These powerful techniques can uncover complex patterns and
relationships within the data, enabling researchers to better understand
tumor biology and develop more targeted therapies.
Algorithms for Tumor Microarray
Analysis
Image Segmentation
Applying image segmentation algorithms to isolate individual tissue cores and
separate them from the background.
Feature Extraction
Extracting quantitative features from the tissue cores, such as staining
intensity, texture, and morphological properties.
Supervised Classification
Using machine learning algorithms like decision trees, random forests, and
support vector machines to classify tumor samples into different subtypes or
grades.
Image Processing and Feature
Extraction
Image processing plays a crucial role in
analyzing tissue microarray data. Key steps
include image segmentation, cell detection,
and feature extraction.
Sophisticated algorithms extract quantitative
measurements from the images, such as cell
counts, protein expression levels, and spatial
distributions.
These image-derived features serve as
inputs to machine learning models for tumor
classification and subtyping.
Supervised Learning Techniques
for Tumor Classification
Logistic
1 Regression
Predicts tumor type based on features
Random Forests
3 Ensemble of decision trees for robust
classification
Supervised learning algorithms are powerful tools for classifying tumor types based on the
microarray data. Logistic regression, support vector machines, and random forests are commonly
used techniques that can accurately predict tumor subtypes by learning from labeled training data.
These models excel at identifying complex patterns in the high-dimensional tumor profiles.
Unsupervised Clustering
Algorithms for Tumor Subtyping
Unsupervised clustering algorithms play a crucial role in identifying distinct tumor subtypes within
a heterogeneous tumor microarray dataset. These algorithms can uncover hidden patterns and
groupings without relying on pre-defined labels or classes.
-2 -2
Quantization Function
-1 -1
1 1
-2 -2
-1 -1
1 1
representations.
Advantages
• Data compression: Vector Quantization can achieve significant data
compression with minimal loss of information, making it suitable for
applications like image and audio compression.
• Noise reduction: Vector Quantization can help reduce noise in the data by
replacing individual data points with representative codebook vectors,
leading to smoother and more robust representations.
Limited Adaptability to Data Changes: Once the codebook is trained, it remains fixed and
may not adapt well to changes in the input data distribution over time. This lack of
adaptability may result in suboptimal quantization performance when the data distribution
shifts or evolves.
High Computational Complexity: Depending on the size of the dataset and the
dimensionality of the feature space, vector quantization algorithms can be
computationally intensive, especially during the codebook training phase. This high
computational complexity may limit the scalability of VQ algorithms to large datasets and
Self Organizing Maps – Kohonen Maps
127
How do SOM works?
• Let’s say an input data of size (m, n) where m is the number of training examples
and n is the number of features in each example.
• First, it initializes the weights of size (n, C) where C is the number of clusters.
• Then iterating over the input data, for each training example, it updates the winning
vector (weight vector with the shortest distance (e.g Euclidean distance) from
training example). Weight updation rule is given by
wij = wij(old) + alpha(t) * (xik - wij(old))
• where alpha is a learning rate at time t, j denotes the winning vector, i denotes the
ith feature of training example and k denotes the kth training example from the input
data.
• After training the SOM network, trained weights are used for clustering new
examples. A new example falls in the cluster of winning vectors.
128
Algorithm
• Training:
• Step 1: Initialize the weights wij random value may be assumed. Initialize the learning rate
α.
• Step 2: Calculate squared Euclidean distance.
• D(j) = Σ (wij – xi)^2 where i=1 to n and j=1 to m
• Step 3: Find index J, when D(j) is minimum that will be considered as winning index.
• Step 4: For each j within a specific neighborhood of j and for all i, calculate the new weight.
• wij(new)=wij(old) + α[xi – wij(old)]
• Step 5: Update the learning rule by using :
• α(t+1) = 0.5 * t
• Step 6: Test the Stopping Condition.
129
• Output:
Test Sample s belongs to Cluster : 0
Trained weights : [[0.6000000000000001, 0.8, 0.5, 0.9],
[0.3333984375, 0.0666015625, 0.7, 0.3]]
130
Self Organization Maps
• The Self Organizing Map is one of the most popular neural models. It belongs to the category of
the competitive learning network.
• The SOM is based on unsupervised learning, which means that is no human intervention is
needed during the training and those little needs to be known about characterized by the input
data.
• We could, for example, use the SOM for clustering membership of the input data. The SOM can
be used to detect features inherent to the problem and thus has also been called SOFM the
Self Origination Feature Map.
131
• The Self Organized Map was developed by Professor
Kohenen and is used in many applications.
• the purpose of SOM is that it’s providing a data visualization
technique that helps to understand high dimensional data
by reducing the dimension of data to map.
• SOM also represents the clustering concept by grouping
similar data.
• Therefore it can be said that the Self Organizing Map
reduces data dimension and displays similarly among data.
132
133
SOM
• A SOM does not need a target output to be specified, unlike many other types of
networks.
• Instead, where the node weights match the input vector, that area of the lattice is
selectively optimized to more closely resemble the data for the class the input
vector is a member of.
• From an initial distribution of random weights, and over many iterations, the SOM
eventually settles into a map of stable zones.
• Each zone is effectively a feature classifier, so you can think of the graphical
output as a type of feature map of the input space.
134
SOM
135
SOM
4. The radius of the neighborhood of the BMU is now calculated. This value
starts large, typically set to the ‘radius’ of the lattice, but diminishes each time
step. Any nodes within this radius are deemed inside the BMU’s
neighborhood.
5. Each neighboring node’s (the nodes found in step 4) weights are adjusted
to make them more like the input vector. The closer a node is to the BMU; the
more its weights get altered.
6. Repeat step 2 for N iterations.
136
SOM
137
SOM
Example:
138
SOM
139
SOM
140
SOM
Node number 3 is
the closest
with a distance of
0.4. We will call this
node
our BMU (best-
matching unit).
141
SOM
144
Spectral clustering:
■ Combining feature extraction and clustering (i.e., use the spectrum of the similarity
matrix of the data to perform dimensionality reduction for clustering in fewer
dimensions)
■ Normalized Cuts (Shi and Malik, CVPR’97 or PAMI’2000)
■ The Ng-Jordan-Weiss algorithm (NIPS’01)
145
Spectral Clustering:
The Ng-Jordan-Weiss (NJW) Algorithm
• Given a set of objects o1, …, on, and the distance between each pair of objects, dist(oi, oj), find the desired number k of
clusters
• Calculate an affinity matrix W, where σ is a scaling parameter that controls how fast the affinity Wij decreases as dist(oi,
oj) increases. In NJW, set Wij = 0
• Derive a matrix A = f(W). NJW defines a matrix D to be a diagonal matrix s.t. Dii is the sum of the i-th row of W, i.e.,
Then, A is set to
• A spectral clustering method finds the k leading eigenvectors of A
• A vector v is an eigenvector of matrix A if Av = λv, where λ is the corresponding eigen-value
• Using the k leading eigenvectors, project the original data into the new space defined by the k leading eigenvectors,
and run a clustering algorithm, such as k-means, to find k clusters
• Assign the original data points to clusters according to how the transformed points are assigned in the clusters
obtained
146
Spectral Clustering: Illustration and Comments
149
Spectral Clustering Matrix Representation
Adjacency and Affinity Matrix (A)
150
• An Affinity Matrix is like an Adjacency Matrix, except the value for a
pair of points expresses how similar those points are to each other.
If pairs of points are very dissimilar then the affinity should be 0. If
the points are identical, then the affinity might be 1. In this way, the
affinity acts like the weights for the edges on our graph.
151
Laplacian Matrix (L)
152
• Laplacian Matrix (L)
• This is another representation of the graph/data points, which
attributes to the beautiful properties leveraged by Spectral
Clustering. One such representation is obtained by subtracting the
Adjacency Matrix from the Degree Matrix (i.e. L = D – A).
153
• Spectral Gap: The first non-zero eigenvalue is called the Spectral Gap.
The Spectral Gap gives us some notion of the density of the graph.
• Fiedler Value: The second eigenvalue is called the Fiedler Value, and
the corresponding vector is the Fiedler vector. Each value in the
Fiedler vector gives us information as to which side of the decision
boundary a particular node belongs to.
• Using L, we find the first large gap between eigenvalues which
generally indicates that the number of eigenvalues before this gap is
equal to the number of clusters.
• What is Spectral Clustering and how its work? (mygreatlearning.com)
154