0% found this document useful (0 votes)
9 views

AIML mod 5

Uploaded by

sarika.mn.362
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

AIML mod 5

Uploaded by

sarika.mn.362
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Module 2- Advanced Machine Learning (18AI72)

CHAPTER 13: CLUSTERING


13.1 Introduction to Clustering
Q. Describe the clustering.
Clustering is the process of grouping together data objects into multiple sets or clusters, so that objects
within a cluster have high similarity as compared to objects outside of it. The similarity is assessed
based on the attributes that describe the objects. Similarity is measured by distance metrics. The
partitioning of clusters is not done by humans. It is done with the help of algorithms. These algorithms
allow us to derive some useful information from the data which was previously unknown. Clustering
is also called data segmentation because it partitions large datasets into groups according to their
similarity.
Clustering can also be used for outlier detection. Outliers are objects which do not fall into any cluster
because of too much dissimilarity with other objects. We can utilize them for special applications like
credit Card fraud detection. In credit card transactions, very expensive and infrequent purchases may
be signs of fraudulent cases and we can apply one more level of security to avoid such transactions.
Clustering is known as unsupervised learning because the class label information is not present. The
supervised learning is called learning by example, while unsupervised earning is called learning by
observation.
13.1.1 Applications of Clustering
Q. Briefly explain the applications of Clustering

Cluster analysis has been widely used in many applications such as business intelligence, Pattern
recognition, image processing, bio-informatics, web technology, search engines, and text mining.
1. Business intelligence: Cluster analysis helps in target marketing, where marketers discover
groups and categorize them based on the purchasing patterns. The information retrieved can
be used market segmentation, product positioning (i.e., allocating products to specific areas),
new product development, grouping of shopping items, and selecting test markets.
2. Pattern cognition: Here, the clustering methods group similar patterns into clusters whose
members are more similar to each other. In other words, the similarity of members within
cluster is much higher when compared to the similarity of members outside of it.
3. Image processing: Extracting and understanding information from images is very important
in image processing. The images are initially segmented and the different objects of interest
in them are then identified. This involves division of an image into areas of similar attributes.
It is one of the important and most challenging tasks in image processing where clustering can
be applied. Image processing has applications in many areas such as analysis of remotely

Dept. of AIML, GMIT, Davangere


Module 2- Advanced Machine Learning (18AI72)

sensed images, traffic system monitoring, and fingerprint recognition.


4. Bioinformatics: In this field, clustering techniques are required to derive plant and animal
taxonomies, categorize genes with similar functionalities, and gain insight into structures
inherent to populations. Biological systematics is another field which involves study of the
diversification of living forms and the relationships among living things through time. The
scientific classification of species can be done on the basis of similar characteristics using
clustering. This field can give more information about both extinct and extant organisms.
5. Web technology: Clustering helps classifying documents on the web for information
delivery.
6. Search engines: The success of Google as a search engine is because of its intensive searching
capabilities. 'Whenever a query is fired by a user, the search engine provides the result for the
searched data according to the nearest similar object which are clustered around the data to be
searched. The speed and accuracy of the retrieved resultant is dependent on the use of the
clustering algorithm. Better the clustering algorithm used, better are the chances of getting the
required result first. Hence the definition of similar object plays a crucial role in getting better
search results.
7. Text mining: Text mining involves the process of extracting high quality information from
text. High quality in text mining means clustering in terms of relevance, novelty, and
interestingness. It can be used for sentiment analysis and document summarization.

13.1.2 Requirements of Clustering


Q. Explain the Requirements of Clustering

The requirements of clustering can be enumerated and explained as follows:


1. Scalability: A clustering algorithm is considered to be highly scalable if it gives similar results
independent of the size of the database. Generally, clustering on a sample dataset may give
different results compared to a larger dataset. Poor scalability of clustering algorithms leads
to distributed clustering for partitioning large datasets. Some algorithms cluster large-scale
datasets without considering the entire dataset at a time. Data can be randomly divided into
equal-sized disjoint subsets and clustered using a standard algorithm. The centroids of subsets
formed by a centroid correspondence algorithm. The centroids are combined to form a global
set of centroids.
2. Dealing with different types of attributes: Algorithms are designed to cluster numeric data.
However applications may require clustering other data types like nominal, binary, and

Dept. of AIML, GMIT, Davangere


Module 2- Advanced Machine Learning (18AI72)

ordinal. Nominal is in alphabetical form and not in integer form. Binary attribute is of two
types: symmetric binary and asymmetric binary. In symmetric data, both values are
equally important. For example, in gender, it is male and female. In asymmetric data, both
values are not equally important. For example, in result, it is pass and fail. The clustering
algorithm should also work for complex data types such as graphs, sequences, images, and
documents.
3. Discovery of clusters with arbitrary shape: Generally, clustering algorithms are to
determine spherical clusters. Due to the characteristics and diverse nature of the data used,
clusters may be of arbitrary shapes and can be nested within one another.
Traditional clustering algorithms, such as k-means and k-medoids, fail to detect non-
spherical shapes. Thus, it is important to have clustering algorithms that can detect clusters
of any arbitrary shape.
4. Avoiding domain knowledge to determine input parameters: Many algorithms require
domain knowledge like the desired number of clusters in the form of input. Thus, the
clustering results may become sensitive to the input parameters. Such parameters are often
hard to determine for high dimensionality data. Domain knowledge requirement affects
the quality of clustering and burdens the user.
For example, in k-means algorithm, the metric used to compare results for different values
of k is the mean distance between data points and their cluster centroid. Increasing the
number of clusters will always reduce the distance of data points to the extreme of
reaching zero when k is the same as the number of data points. Thus, this cannot be used.
Instead, to roughly determine k, the mean distance to the centroid is plotted as a function
of k and the "elbow point", where the rate of decrease sharply shifts. This is shown in Fig.
13.2.

Dept. of AIML, GMIT, Davangere


Module 2- Advanced Machine Learning (18AI72)

5. Handling noisy data Real-world data, which is the input of clustering algorithms, are
mostly affected by noise. This results in poor-quality clusters. Noise is an unavoidable
problem, which affects the data collection and data preparation processes. Therefore, the
algorithms we use should be able to deal with noise. There are two types of noise:
i. Attribute noise includes implicit errors introduced by measurement tools.
They are induced by different types of sensors.
ii. Random errors introduced by batch processes or experts when the data is
gathered. This can be induced in document digitalization process.
6. Incremental clustering: The database used for clustering needs to be updated by adding
new data (incremental updates). Some clustering algorithms cannot incorporate
incremental updates but have to recompute a new clustering from scratch. The algorithms
which can accommodate new data without reconstructing the clusters are called
incremental clustering algorithms. It is more effective to use incremental clustering
algorithms.
7. Insensitivity to input order: Some clustering algorithms are sensitive to the order in
which data objects are entered. Such algorithms are not ideal as we have little idea about
the data objects presented. Clustering algorithms should be insensitive to the input order
of data objects.
8. Handling high-dimensional data: A dataset can contain numerous dimensions or
attributes. Generally, clustering algorithms are good at handling low-dimensional data
such as datasets involving only two or three dimensions. Clustering algorithms which can
handle high-dimensional space are more effective.
9. Handling constraints: Constrained clustering can be considered to contain a set of must-
link constraints, cannot-link constraints, or both. In a must-link constraint, two instances
in the must-link relation should be included in the same cluster. On the other hand, a
cannot-link constraint specifies that the two instances cannot be in the same cluster. These
sets of constraints act as guidelines to cluster the entire dataset. Some constrained
clustering algorithms cancel the clustering process if they cannot form clusters which
satisfy the specified constraints. Others try to minimize the amount of constraint violation
if it is impossible to find a clustering which satisfies the constraints. Constraints can be
used to select a clustering model to follow among different clustering methods. A
challenging task is to find data groups with good clustering behaviour that satisfy
constraints.

Dept. of AIML, GMIT, Davangere


Module 2- Advanced Machine Learning (18AI72)

6. Interpretability and usability: Users require the clustering results to be interpretable, usable
usable, and include all the elements. Clustering is always tied with specific semantic
interpretations and applications.
13.2 Types of Clustering
Q. Explain the classification of clustering algorithms

Clustering algorithms can be classified into two main subgroups:


1. Hard clustering: Each data point either belongs to a cluster completely or not.
2. Soft clustering: Instead of putting each data point into a separate cluster, a probability or
likelihood of that data point to be in those clusters is assigned.
Clustering algorithms can also be classified as follows:
1. Partitioning method.
2. Hierarchical method.
3. Density-based method.
4. Grid-based method.
However, the focus in this chapter is on partitioning method and hierarchical-based methods.
13.2.1 Partitioning Method
Partitioning means division. Suppose we are given a database of n objects and we need to partition
this data into k partitions of data. Within a partition there exists some similarity among the items. So
each partition will represent a cluster and k ≤ n. It means that it will classify the data into k groups,
each group contains at least one object and each object must belong to exactly one group. Although
this is the general requirement, in soft clustering an object can belong to two clusters also. Most
partitioning methods are distance-based.
For a given number of partitions (say k), the partitioning method will create an initial partitioning.
Then it uses the iterative relocation technique to improve the partitioning by moving objects from one
group to another. The general criterion of a good partitioning is that objects in the same cluster are
close to each other, whereas objects in different clusters are far from each other. Some other criteria
can be used for judging the quality of partitions.
Partition-based clustering is often computationally expensive and hence most of the methods apply
heuristic methods, including the greedy approach which improves the quality of cluster arriving at a
local optimum.
These heuristic clustering methods result in spherical clusters. For complex-shaped clusters and for
large datasets, some extensions are required for partition-based methods.
13.2.2 Hierarchical Method
Hierarchical clustering is an alternative approach to partitioning clustering for identifying groups in

Dept. of AIML, GMIT, Davangere


Module 2- Advanced Machine Learning (18AI72)

a data-set. It does not require prespecifying the number of clusters to be generated. The result of
hierarchical clustering is a tree-based representation of the objects, which is also known as
dendrogram. Observations can be subdivided into groups by cutting the dendrogram at a desired
similarity level. We classify hierarchical methods on the basis of how the hierarchical decomposition
is formed. There are two approaches:
1. Agglomerative approach: This approach is also known as the bottom-up approach. In this
approach, we start with each object forming a separate group. It keeps on merging the objects
or groups that are close to one another. It keeps on doing so until all of the groups are merged
into one or until the termination condition holds.
2. Divisive approach: This approach is also known as the top-down approach. In this approach,
we start with all of the objects in the same cluster. In the continuous iteration, a cluster is split
up into smaller clusters. It is done until each object is in one cluster or the termination
condition holds. This method is rigid, because once a merging or splitting is done, it can never
be undone.
13.2.3 Density-Based Methods
Density-based clustering algorithm finds nonlinear shapes clusters based on the density. Density-
based Spatial clustering of applications with noise (DBSCAN) is the most widely used density-based
algorithm. It uses the concept of density reachability and density connectivity.
1. Density reachability: A point "p" is said to be density reachable from a point "q" if it is within
the distance from point "q" and "q" has sufficient number of points in its neighbors that are
within distance ‘€’
2. Density connectivity: Points "p" and "q" are said to be density-connected if there exists a
point which has sufficient number of points in its neighbors and both the points are within the
distance. This is called chaining process. So, if "q" is neighbor of "r", "r" is neighbor of s , s
is neighbor of "t", and t is neighbor of "p"; this implies that "q" is neighbor of "p".
13.2.4 13.2.4 Grid-Based Methods
The grid-based clustering approach differs from the conventional clustering algorithms in that it is
concerned not with data points but with the value space that surrounds the data points. In general, a
typical grid-based clustering algorithm consists of the following five basic steps:

1. Create the grid structure, i.e., partitioning the data space into a finite number of cells.
2. Calculating the cell density for each cell.
3. Sorting the cells according to their densities.
4. Identifying cluster centers.

Dept. of AIML, GMIT, Davangere


Module 2- Advanced Machine Learning (18AI72)

5. Traversal of neighbor cells.

13.3 Partitioning Methods of Clustering


The most fundamental clustering method is the partitioning method. This method assumes that we
already know the number of clusters to be formed, which organizes the objects of a set into several
exclusive groups or clusters. If k is the number of clusters to be formed given a dataset D of n objects,
the partitioning algorithm organizes the objects into k partitions (k n), where each partition represents
a cluster. The objective function in this type of partitioning is that the similarity among the data items
within a cluster is higher than the elements in a different cluster. In other words, inter-cluster
similarity is higher than intra-cluster similarity.

13.3.1 k-Means Algorithm


Q. Explain the k-Means clustering Algorithm and write the flow chart for the same
The most well-known clustering algorithm is k-means. It is easy to understand and implement.
The main concept is to define k cluster centers. The cluster centers should be kept in such a way that
it covers the data points of the entire dataset. The best way to do so is to keep data points as far away
from each other as possible. We can then associate each data point to the nearest cluster center. The
initial grouping of data is completed when there is no data point remaining. Once the grouping is
done, new centroids are computed. These again form clusters based on the new cluster centers. The
process is repeated till no more changes are done, which implies the cluster centers do not change
any more. This algorithm aims at minimizing an objective function know as squared error function,
which is given by,

Where ||xi – vj|| is the Euclidean distance between xi, and vi .


Ci the number of data points in ith cluster.
C is the number of cluster centers.
The objective function aims for high intra-cluster similarity and low inter-cluster similarity.
133.1.1 Steps in k-Means Clustering Algorithm
Let us study the steps in k-means clustering algorithm with the following example. Let X = { x1, x2,
x3, x4, , …….., xn} be the set of data points and V= { v1, v2, v3, v4, , …….., vc}be the set of centers. The steps
are:
1. Randomly select c cluster centers.
2. Calculate the distance between each data point and cluster centers.

Dept. of AIML, GMIT, Davangere


Module 2- Advanced Machine Learning (18AI72)

3. Assign the data point to the cluster having a minimum distance from it and the cluster center.
4. Recalculate the new cluster center using Eq. (13.2).

where ci represents the number of data points in the ith cluster.


5. Recalculate the distance between each data point and the newly obtained cluster centers.
6. If no data point was reassigned then stop, otherwise repeat steps 3 to 5.

Flowchart for k-Means Clustering Algorithm

The advantages of k-means clustering algorithm are:


1. Fast, robust, and easier to understand.
2. Relatively efficient: The computational complexity of the algorithm is O(tknd), where n is
the number of data objects, k is the number of clusters, d is the number of attributes in each
data object, and t is the number of iterations. Normally, k, t, d « n. That is, the number of
clusters attributes and iterations will be very small compared to the number of data objects
in the dataset.
3. Gives best result when dataset are distinct or well separated from each other.

Dept. of AIML, GMIT, Davangere


Module 2- Advanced Machine Learning (18AI72)

The disadvantages of k-means clustering algorithm are:


1. It requires prior specification of number of clusters.
2. It is not be able to cluster highly overlapping data.
3. It is variant to nonlinear transformations, that is, with different representation of data we get
different results (data represented in form of Cartesian coordinates and polar coordinates will
give different results).
4. It provides the local optima of the squared error function.
5. Random choosing of the cluster center cannot lead to fruitful result.
6. Applicable only when the mean is defined, that is, fails for categorical data
7. Unable to handle noisy data and outliers.

13.3.1.1.1 k-Means Solved Examples in One-Dimensional Data


Q. Apply k-means algorithm in given data for k = 3 (that is, 3 clusters). Use C1(2), C2(16), and
C3(38), as the initial cluster centers. Data: 2, 4, 6, 3, 31, 12, 15, 16, 38, 35, 14, 21, 23, 25, 30.
Solution:
The initial cluster centers are given as C1(2), C2(16), and C3(38). Calculating the distance between
each data point and cluster centers, we get the following table.
Iteration 1
Minimum
Data Point
Distance from Distance from Distance from distance of
Data Points belongs to
C1(2) C2(16) C3(38) all the cluster
the cluster
centers
2 (2 - 2)2 = 0 (2 - 16)2 = 196 (2 - 38)2 = 1296 0 Cluster-1
4 (4 - 2)2 = 4 (4 - 16)2 = 144 (4 - 38)2 = 1156 4 C1
6 (6 - 2)2 = 16 (6 - 16)2 = 100 (6 - 38)2 = 1024 16 C1

3 (3 - 2)2 = 1 (3 - 16)2 = 169 (3 - 38)2 =1225 1 C1

31 (31 - 2)2 = 841 (31 - 16)2 = 225 (31 - 38)2 = 49 49 C3

12 (12 - 2)2 = 100 (12 - 16)2 = 16 (12 - 38)2 = 676 16 C2

15 (15 - 2)2 = 169 (15 - 16)2 = 1 (15 - 38)2 = 529 1 C2


16 (16 - 2)2 =196 (16 - 16)2 = 0 (16 - 38)2 = 484 0 C2

38 (38 - 2)2 = 1296 (38 - 16)2 = 484 (38 - 38)2 = 0 0 C3

35 35 - 2)2 =1089 (35 - 16)2 = 361 (35 - 38)2 = 9 9 C3


14 (14 - 2)2 = 144 (14 - 16)2 = 4 (14 - 38)2 = 576 4 C2
21 (21 - 2)2 = 361 (21- 16)2 =25 (21 - 38)2 = 289 25 C2

23 (23 - 2)2 = 441 (23 - 16)2 = 49 (23 - 38)2 = 225 49 C2

25 (25 - 2)2 = 529 (25 - 16)2 = 81 (25 - 38)2 = 169 81 C2

Dept. of AIML, GMIT, Davangere


Module 2- Advanced Machine Learning (18AI72)

30 (30 - 2)2 = 784 (30 - 16)2 =196 (30 - 38)2 = 64 64 C3

By assigning the data points to the cluster center whose distance from it is minimum of all the cluster
centers, we get the following table.
C1 (2) C2 (16) C3( 38 )
m1 =2 m2 = 16 m3 = 38
{2, 3, 4, 6} {12, 14, 15, 16, 21, 23, 25} {30, 31, 35, 38}
New cluster centers
m2 = (12+14
m1= (2+3+4+6)/4 = 3.75
+15+16+21+23+25)/7 = 18 m3 = 33.50
m1= 3.75
m2 = 18

Similarly, using the new cluster centers we can calculate the distance from it and allocate clusters
based we stop up on minimum distance.
Iteration 2
Minimum Data
distance Point
Data Distance from Distance from Distance from
of all the belongs
Points C1(3.75) C2(18) C3(33.50)
cluster to the
centers cluster
2 (2 - 3.75 )2 = 3.06 (2 - 18)2 = 256 (2 - 33.50)2 = 1007.32 3.06 C1
4 (4 - 3.75)2 = 0.06 (4 - 18)2 = 196 (4 - 33.50)2 = 940.64 0.06 C1
6 (6 - 3.75)2 = 5.06 (6 - 18)2 = 144 (6 - 33.50)2 = 821 5.06 C1

3 (3 - 3.75)2 = 0.56 (3 - 18)2 = 225 (3 - 33.50)2 =1002.98 0.56 C1

31 (31 - 3.75)2 = 742.56 (31 - 18)2 = 169 (31 - 33.50)2 = 13.46 13.46 C3

12 (12 - 3.75)2 = 68.06 (12 - 18)2 = 36 (12 - 33.50)2 = 513.92 36 C2

15 (15 - 3.75)2 = 128.56 (15 - 18)2 = 9 (15 - 33.50)2 = 386.90 9 C2


16 (16 - 3.75)2 =150.06 (16 - 18)2 = 4 (16 - 33.50)2 = 348.56 4 C2

38 (38 - 3.75)2 = 1173 (38 - 18)2 = 400 (38 - 33.50)2 = 11.08 11.08 C3

35 35 - 3.75)2 =976.56 (35 - 18)2 = 289 (35 - 33.50)2 = 0.10 0.10 C3


14 (14 - 3.75)2 = 105.06 (14 - 18)2 = 16 (14 - 33.50)2 = 427.24 16 C2

21 (21 - 3.75)2 = 297.56 (21- 18)2 = 9 (21 - 33.50)2 = 186.86 9 C2

23 (23 - 3.75)2 = 370.56 (23 - 18)2 = 25 (23 - 33.50)2 = 136.18 25 C2

25 (25 - 3.75)2 = 451.56 (25 - 18)2 = 49 (25 - 33.50)2 = 93.50 49 C2

30 (30 - 2)2 = 689.06 (30 - 18)2 =144 (30 - 33.50)2 = 21.80 21.80 C3

Dept. of AIML, GMIT, Davangere


Module 2- Advanced Machine Learning (18AI72)

C1 (3.75) C2 (18) C3 (33.50)


m1 =2 m2 = 16 m3 = 38
{2, 3, 4, 6} {12, 14, 15, 16, 21, 23, 25} {30, 31, 35, 38}
New cluster centers
m2 = (12+14
m1= (2+3+4+6)/4 = 3.75
+15+16+21+23+25)/7 = 18 m3 = 33.50
m1= 3.75
m2 = 18

It is found that there is no difference in the cluster formed & cluster centers from iteration 1 & 2, and
hence we stop the procedure.
Q. Apply k-means algorithm in given data for k = 2 (that is, 2 clusters). Use C1(80), and C2(250),
as the initial cluster centers. Data: 234,123, 456, 23, 34, 56, 78, 90, 150, 116, 117, 118, 199.
Iteration -1
Minimum Data Point
Distance from
Data Points Distance from C2(250) distance of all the belongs to the
C1(80)
cluster centers cluster

234 (234 - 80)2 = 23,716 (234 - 250 )2 = 256 256 Cluster-2


2 2 C1
123 (123 - 80) = 1,849 (123 - 250) = 16,129 1,849
456 (456 - 80)2 = 1,41,376 (456 - 250 )2 = 42,436 42,436 C2

23 (23- 80)2 = 3,249 (23 - 250) 2 = 51,529 3,249 C1

34 (34 - 80)2 = (34 - 250)2 = C1

56 (56- 80)2 = (56- 250)2 = C1

78 (78- 80)2 = (78- 250)2 = C1


90 (90- 80)2 = (90- 250 )2 = C1

150 (150- 80)2 = (150- 250)2 = C2

116 (116- 80)2 = (116- 250)2 = C1


117 (117- 80)2 = (117- 250)2 = C1

118 (118- 80)2 = (118- 250)2 = C1

199 (199- 80)2 = (199- 250)2 = C2

C1(80) C2(250)
m1 = 80 m2 = 250
{123, 23, 34, 56, 78, 90,116,117, 118} {150, 199, 234, 456}
New cluster centers
m1= 83.9 m2 = 259.75

Dept. of AIML, GMIT, Davangere


Module 2- Advanced Machine Learning (18AI72)

Iteration -2
Minimum
Data Point
Distance from Distance from distance of all
Data Points belongs to the
C1(83.9) C2(259.75) the cluster
cluster
centers

234 (234 - 83.9)2 = 22,530.01 (234 - 259.75)2 = 663.06 66 3. 0 6 Cluster-2


2 2 C1
123 (123 - 83.9) = (123 - 259.75) =
456 (456 - 83.9 )2 = (456 - 259.75)2 = C2

23 (23- 83.9)2 = (23 - 259.75) 2 = C1

34 (34 - 83.9)2 = (34 - 259.75)2 = C1

56 (56- 83.9)2 = (56- 259.75)2 = C1

78 (78- 83.9)2 = (78- 259.75)2 = C1


90 (90- 83.9 )2 = (90- 259.75)2 = C1

150 (150- 83.9)2 = (150- 259.75)2 = C2

116 (116- 83.9)2 = (116- 259.75)2 = C1


117 (117- 83.9)2 = (117- 259.75)2 = C1

118 (118- 83.9)2 = (118- 259.75)2 = C1

199 (199- 83.9)2 = (199- 259.75)2 = C2

C1(83.9) C2(259.75)
m1 = 83.9
m2 = 259.75
{123, 23, 34, 56, 78, 90,116,117, 118,
{199, 234, 456}
150}
New cluster centers
m1= 90.5 m2 = 296.3

Iteration -3
Minimum Data Point
Distance from
Data Points Distance from C2(296.3) distance of all the belongs to the
C1(90.5)
cluster centers cluster

234 (234 - 90.5)2 = 20,592.25 (234 - 296.3)2 = 3,881.29 3,881.29 Cluster-2


2 2 C1
123 (123 - 90.5) = (123 - 296.3) =
456 (456 - 90.5)2 = (456 - 296.3)2 = C2

23 (23- 90.5)2 = (23 - 296.3) 2 = C1

34 (34 - 83.9)2 = (34 - 259.75)2 = C1

56 C1

78 C1
90 C1

Dept. of AIML, GMIT, Davangere


Module 2- Advanced Machine Learning (18AI72)

150 C1

116 C1
117 C1

118 C1

199 C2

C1(90.5) C2(296.3)
m1 = 83.9
m2 = 259.75
{123, 23, 34, 56, 78, 90,116,117, 118,
{ 199, 234, 456}
150}
New cluster centers
m1= 90.5 m2 = 296.3

Since there is no change in the cluster centers from iteration 2 to 3, so we stop the procedure.

13.3.1.1.2 k-Means Solved Examples in Two-Dimensional Data

Q. Apply k-means clustering for the datasets given in Table 13.1 for two clusters. Tabulate all
the assignments.
Sample No. X Y
1 185 72
2 170 56
3 168 60
4 179 68
5 182 72
6 188 77

Solution:
Sample No. X Y Assignment
1 185 72 C1
2 170 56 C2

Centroid: C1 = (185, 72) and C2 = (170, 56)


First Iteration:
Distance from Cl is Euclidean distance between (185, 72) and (168, 60) = 20.808
Distance from C2 is Euclidean distance between (170, 56) and (168, 60) = 4.472

Dept. of AIML, GMIT, Davangere


Module 2- Advanced Machine Learning (18AI72)

Since C2 is closer to (168, 60), the sample belongs to C2


Sample No. X Y Assignment
1 185 72 C1
2 170 56 C2
3 168 60 C2
4 179 68
5 182 72
6 188 77

Similarly,
1. Distance from C1(185, 72) for (179, 68) = 7.21
Distance from C2(170, 56) for (179, 68) = 15

Since C1 is closer to (179, 68), the sample belongs to Cl.


2. Distance from C1(185, 72) for (182, 72) = 3
Distance from C2(170, 56) for (182, 72) = 20
Since C1 is closer to (182, 72), the sample belongs to Cl.
3. Distance from C1(185, 72) for (188, 77) = 5.83
Distance from C2(170, 56) for (188, 77) = 27.66
Since C1 is closer to, (188, 77), the sample belongs to C1
Sample No. X Y Assignment
1 185 72 Cluster-1
2 170 56 C2
3 168 60 C2

Dept. of AIML, GMIT, Davangere


Module 2- Advanced Machine Learning (18AI72)

4 179 68 C1
5 182 72 C1
6 188 77 C1

The new centroid for C1 is


((185+179+182+188)/4 , (72+68+72+77/4)) = (183.5, 73)
The new centroid for C2 is
((170+168)/2 , (56+60/2)) = (169, 58)
Second Iteration:
Distance from Cl is Euclidean distance between (183.5, 73) and (168, 60) = 20.2
Distance from C2 is Euclidean distance between (169, 58) and (168, 60) = 2.24
Since C2 is closer to (168, 60), the sample belongs to C2.
Similarly,
1. Distance from C1(183.5, 73) for (179, 68) = 6.73
Distance from C2(169, 58) for (179, 68) = 14.14
Since C1 is closer to (179, 68), the sample belongs to C1.
2. Distance from C1(183.5, 73) for (182, 72) = 1.80
Distance from C2(169, 58) for (182, 72) = 19.10
Since CI is closer to (182, 72), the sample belongs to Cl.
3. Distance from C1(183.5, 73) for (188, 77) = 6.02
Distance from C2(169, 58) for (188, 77) = 26.87
Since C1 is closer to (188, 77), the sample belongs to Cl.
Sample No. X Y Assignment
1 185 72 C1
2 170 56 C2
3 168 60 C2
4 179 68 C1
5 182 72 C1
6 188 77 C1

After the second iteration, the assignment has not changed and hence the algorithm is stopped and the
Points are clustered.

Dept. of AIML, GMIT, Davangere


Module 2- Advanced Machine Learning (18AI72)

3.3.2 The k-Medoids


Q. Explain the k- Medoids clustering Algorithm
k-medoids algorithm is a clustering algorithm very similar to the k-means algorithm. Both k-means
and k-medoids algorithms are partitional and try to minimize the distance between points and cluster
centers. In contrast to the k-means algorithm, k-medoids chooses data points as centers and uses
Manhattan distance to define the distance between cluster centers and data points. This technique
clusters the dataset of n objects into k clusters, where the number of clusters k is known in prior. It is
more robust to noise and outliers as compared to k-means because it minimizes a sum of pairwise
dissimilarities instead of a sum of squared Euclidean distances. A medoid is defined as an object of a
cluster whose average dissimilarity to all objects in the cluster is minimal.
The Manhattan distance between two vectors in an n-dimensional real vector space is given by Eq.
(13.2). It is used in computing the distance between a data point and its cluster center.

Equation 13.2
The most common algorithm in k-medoid clustering is Partitioning Around Medoids (PAM)
PAM uses a greedy search which is faster than the exhaustive search and may not find the optimum
solution. It works as follows:
1. Initialize: select k of the n data points as the medoids.
2. Associate each data point to the closest medoid.
3. While the cost of the configuration decreases: For each medoid m and for each non-medoid
data point O':

 Swap m and O', recompute the cost (sum of distances of points to their medoid).

 If the total cost of the configuration increased in the previous step, undo the swap.

Solved Problem 13.4


Q. Cluster the following dataset of 6 objects into two clusters, that is, k = 2.
X1 2 6
X2 3 4
X3 3 8
X4 4 2
X5 6 2
X6 6 4

Dept. of AIML, GMIT, Davangere


Module 2- Advanced Machine Learning (18AI72)

Solution:
Step 1: Two observations Cl = X2 = (3, 4) and C2 = X6 = (6, 4) are randomly selected as medoids (Cluster
centers)
Step 2: Manhattan distances are calculated to each center to associate each data object to its nearest
medoid.
Manhattan distance formula to calculate the distance matrices between medoid (cluster center) and non-
medoid (which are not cluster centers) points.
Distance = |X1-X2| + |Y1-Y2|
1. Distance between point ( 2, 6 ) and C1 medoid ( 3, 4 ) = | ( 2 - 3 ) | + | ( 6 – 4 ) | = 1 + 2 = 3
Distance between point ( 2, 6 ) and C2 medoid ( 6, 4 ) = | ( 2 - 6) | + | ( 6 - 4 ) | = 4 + 2 = 6
2. Distance between point ( 3, 4 ) and C1 medoid ( 3, 4 ) = | ( 3 - 3 ) | + | ( 4 – 4 ) | = 0 + 0 = 0
Distance between point ( 3, 4 ) and C2 medoid ( 6, 4 ) = | ( 3 - 6) | + | ( 4 - 4 ) | = 3 + 0 = 3
3. Distance between point ( 3, 8 ) and C1 medoid ( 3, 4 ) = | ( 3 - 3 ) | + | ( 8 – 4 ) | = 0 + 4 = 4
Distance between point ( 3, 8 ) and C2 medoid ( 6, 4 ) = | ( 3 - 6) | + | ( 8 - 4 ) | = 3 + 4 = 7
4. Distance between point ( 4, 2 ) and C1 medoid ( 3, 4 ) = | ( 4 - 3 ) | + | ( 2 – 4 ) | = 1 + 2 = 3
Distance between point ( 4, 2 ) and C2 medoid ( 6, 4 ) = | ( 4 - 6) | + | ( 2 - 4 ) | = 2 + 2 = 4
5. Distance between point ( 6, 2 ) and C1 medoid ( 3, 4 ) = | ( 6 - 3 ) | + | ( 2 – 4 ) | = 3 + 2 = 5
Distance between point ( 6, 2 ) and C2 medoid ( 6, 4 ) = | ( 6 - 6) | + | ( 2 - 4 ) | = 0 + 2 = 2
6. Distance between point ( 6, 4 ) and C1 medoid ( 3, 4 ) = | ( 6 - 3 ) | + | ( 4 – 4 ) | = 3 + 0 = 3
Distance between point ( 6, 4 ) and C2 medoid ( 6, 4 ) = | ( 6 - 6) | + | ( 4 - 4 ) | = 0 + 0 = 0
Data Object Distance To Distance Distance
closure to closure to
Sample Point C1 (3, 4) C2 =(6, 4)
C1 C2
X1 (2, 6) 3 6 3
X2 (3, 4) 0 3 0
X3 (3, 8) 4 7 4
X4 (4, 2) 3 4 3
X5 (6, 2) 5 2 2
X6 (6, 4) 3 0 0
cost 10 2
Total cost 12

Each point is assigned to the cluster of that medoid whose dissimilarity is less. Points 1, 2, 3 and 4 go to
cluster C1 and 5, 6 go to cluster C2.

Dept. of AIML, GMIT, Davangere


Module 2- Advanced Machine Learning (18AI72)

The Cost of C1 = (3 + 0 + 4 + 3) =10


The Cost of C2 = (2 + 0) = 2
Step 3: randomly select one non-medoid point and recalculate the cost. We select one of the non-medoids
O'. Let us assume O' = (6, 2). So now the medoids are cl (3, 4) and O' = (6, 2). If Cl and O' are the new
medoids. We calculate the total cost involved.
1. Distance between point ( 2, 6 ) and C1 medoid ( 3, 4 ) = | ( 2 - 3 ) | + | ( 6 – 4 ) | = 1 + 2 = 3
Distance between point ( 2, 6 ) and C2 medoid ( 6, 2 ) = | ( 2 - 6) | + | ( 6 - 2 ) | = 4 + 4 = 8
2. Distance between point ( 3, 4 ) and C1 medoid ( 3, 4 ) = | ( 3 - 3 ) | + | ( 4 – 4 ) | = 0 + 0 = 0
Distance between point ( 3, 4 ) and C2 medoid ( 6, 2 ) = | ( 3 - 6) | + | ( 4 - 2 ) | = 3 + 2 = 5
3. Distance between point ( 3, 8 ) and C1 medoid ( 3, 4 ) = | ( 3 - 3 ) | + | ( 8 – 4 ) | = 0 + 4 = 4
Distance between point ( 3, 8 ) and C2 medoid ( 6, 2 ) = | ( 3 - 6) | + | ( 8 – 2 ) | = 3 + 6 = 9
4. Distance between point ( 4, 2 ) and C1 medoid ( 3, 4 ) = | ( 4 - 3 ) | + | ( 2 – 4 ) | = 1 + 2 = 3
Distance between point ( 4, 2 ) and C2 medoid ( 6, 2 ) = | ( 4 - 6) | + | ( 2 - 2 ) | = 2 + 0 = 2
5. Distance between point ( 6, 2 ) and C1 medoid ( 3, 4 ) = | ( 6 - 3 ) | + | ( 2 – 4 ) | = 3 + 2 = 5
Distance between point ( 6, 2 ) and C2 medoid ( 6, 2 ) = | ( 6 - 6) | + | ( 2 - 2 ) | = 0 + 0 = 0
6. Distance between point ( 6, 4 ) and C1 medoid ( 3, 4 ) = | ( 6 - 3 ) | + | ( 4 – 4 ) | = 3 + 0 = 3
Distance between point ( 6, 4 ) and C2 medoid ( 6, 2 ) = | ( 6 - 6) | + | ( 4 - 2 ) | = 0 + 2 = 2
Data Object Distance To Distance Distance
closure closure to
Sample Point Cl (3, 4) C2 =(6, 2)
to C1 C2
X1 (2, 6) 3 8 3
X2 (3, 4) 0 5 0
X3 (3, 8) 4 9 4
X4 (4, 2) 3 2 2
X5 (6, 2) 5 0 0
X6 (6, 4) 3 2 2
cost 7 4
Total cost 11

So cost of swapping medoid from C2(6, 4) to O'(6,2) is 11. Since the cost is less than previous cost, this is
considered as a better cluster assignment. Here swapping is done as the cost is less.
Step 4: We select another non-medoid O'. Let us assume 0' = (4, 2). So now the medoids are Cl (3, 4) and
O'(4, 2). If cl and O' are new medoids, we calculate the total cost involved.
1. Distance between point ( 2, 6 ) and C1 medoid ( 3, 4 ) = | ( 2 - 3 ) | + | ( 6 – 4 ) | = 1 + 2 = 3

Dept. of AIML, GMIT, Davangere


Module 2- Advanced Machine Learning (18AI72)

Distance between point ( 2, 6 ) and C2 medoid ( 4, 2 ) = | ( 4 - 6) | + | ( 6 - 2 ) | = 2 + 4 = 6


2. Distance between point ( 3, 4 ) and C1 medoid ( 3, 4 ) = | ( 3 - 3 ) | + | ( 4 – 4 ) | = 0 + 0 = 0
Distance between point ( 3, 4 ) and C2 medoid ( 4, 2 ) = | ( 4 - 3) | + | ( 4 - 2 ) | = 1 + 2 = 3
3. Distance between point ( 3, 8 ) and C1 medoid ( 3, 4 ) = | ( 3 - 3 ) | + | ( 8 – 4 ) | = 0 + 4 = 4
Distance between point ( 3, 8 ) and C2 medoid ( 4, 2 ) = | ( 4 - 3) | + | ( 8 – 2 ) | = 1 + 6 = 7
4. Distance between point ( 4, 2 ) and C1 medoid ( 3, 4 ) = | ( 4 - 3 ) | + | ( 2 – 4 ) | = 1 + 2 = 3
Distance between point ( 4, 2 ) and C2 medoid ( 4, 2 ) = | ( 4 - 4) | + | ( 2 - 2 ) | = 0 + 0 = 0
5. Distance between point ( 6, 2 ) and C1 medoid ( 3, 4 ) = | ( 6 - 3 ) | + | ( 2 – 4 ) | = 3 + 2 = 5
Distance between point ( 6, 2 ) and C2 medoid ( 4, 2 ) = | ( 4 - 6) | + | ( 2 - 2 ) | = 2 + 0 = 2
6. Distance between point ( 6, 4 ) and C1 medoid ( 3, 4 ) = | ( 6 - 3 ) | + | ( 4 – 4 ) | = 3 + 0 = 3
Distance between point ( 6, 4 ) and C2 medoid ( 4, 2 ) = | ( 4 - 6) | + | ( 4 - 2 ) | = 2 + 2 = 4
Data Object Distance To Distance Distance
closure closure
Sample Point cl (3, 4) c2 =(4, 2)
to C1 to C2
X1 (2, 6) 3 6 3
X2 (3, 4) 0 3 0
X3 (3, 8) 4 7 4
X4 (4, 2) 3 0 0
X5 (6, 2) 5 2 2
X6 (6, 4) 3 4 4
cost 7 8
Total cost 15

So cost of swapping medoid from C2(6, 2) to 0'(4,2) is 15. Since the cost is more, this cluster assignment
is not considered and the swapping is not done.
Thus, We try other non-medoids points to get minimum cost, The assignment with minimum cost
considered the best. For some applications, k-medoids show better results than k-means. The most time
consuming part of the k-medoids algorithm is the calculation of the distances between objects. The takes
matrix can be computed in advance to speed-up the process.

13.4. Hierarchical Methods


The hierarchical agglomerative clustering methods are most commonly used. The construction of
hierarchical agglomerative classification can be achieved by the following general algorithm.
1. Find the two closest objects and merge them into a cluster.

Dept. of AIML, GMIT, Davangere


Module 2- Advanced Machine Learning (18AI72)

2. Find and merge the next two closest points, where a point is either an individual object or a cluster
of objects
3. If more than one cluster remains, return to step 2.

13.4.1 Agglomerative Algorithms


Q. Explain the Agglomerative clustering Algorithm
Agglomerative algorithm follows a bottom-up strategy, treating each object from its own cluster and
iteratively merging clusters until a single cluster is formed or a terminal condition is satisfied. According
to some similarity measure, the merging is done by choosing the closest clusters first. A dendrogram, which
is a tree like structure, is used to represent hierarchical clustering. Individual objects are represented by
leaf nodes and the clusters are represented by root nodes. A representation of a dendrogram is shown in
figure.

13.4.1.1 Distance Measures


One of the major factors in clustering is the metric that is used to measure the distance between two clusters,
where each cluster is generally a set of objects. The distance between two objects or points p and Pi are
computed using Eqs. (13.3) to (13.6). Let Ci be the cluster and ni. is the number of objects in Ci. They are
also known as linkage measures.
Minimum distance:

Dept. of AIML, GMIT, Davangere


Module 2- Advanced Machine Learning (18AI72)

Average distance:

When an algorithm uses the minimum distance, dmin(Ci, Cj), to measure the distance between clusters, it is
called nearest-neighbor clustering algorithm. If the clustering process is terminated when the distance
between the nearest clusters exceeds a user-defined threshold, it is called single-linkage algorithm.
Agglomerative hierarchical clustering algorithm (with minimum distance measure) is called minimum
spanning tree algorithm since spanning tree of a graph is a tree that connects all vertices and a minimal
spanning tree is one with the least sum of edge weights.
An algorithm that uses the maximum distance, dmax (Ci, Cj), to measure the distance between clusters is
called farthest-neighbor clustering algorithm. If clustering is terminated when the maximum distance
exceeds a user-defined threshold, it is called complete-linkage algorithm.
The minimum and maximum measures tend to be sensitive to outliers or noisy data. The third method thus
suggests to take the average distance to rule out outlier problems. Another advantage is that it can handle
categorical data as well.
Algorithm: The agglomerative algorithm is carried out in three steps and the flowchart is shown in Fig.
13.4.
1. Convert object attributes to distance matrix.
2. Set each object as a cluster (thus, if we have N objects, we will have N clusters at the beginning).
3. Repeat until number of clusters is one.
• Merge two closest clusters.
• Update distance matrix.

Dept. of AIML, GMIT, Davangere


Module 2- Advanced Machine Learning (18AI72)

13.4.1.2 Agglomerative Algorithm: Single Link


Single-nearest distance or single linkage is the agglomerative method that uses the distance between the
closest members of the two clusters.
Solved Problem 13.5
Q. Find the clusters using single link technique. Use Euclidean distance and draw the dendrogram.
Sample No. X Y
P1 0.40 0.53
P2 0.22 0.38
P3 0.35 0.32
P4 0.26 0.19
P5 0.08 0.41
P6 0.45 0.30

Solution:
To compute distance matrix:

Euclidean distance:

Dept. of AIML, GMIT, Davangere


Module 2- Advanced Machine Learning (18AI72)

The distance matrix is:

Similarly:

Dept. of AIML, GMIT, Davangere


Module 2- Advanced Machine Learning (18AI72)

The distance matrix is:

Here the minimum value is 0.10, that is the distance between P3 & P6 and hence we combine P3 and P6.
To update the distance matrix,
min (d(P3, P6), P1)) = min (d(P3, P1), d(P6, P1)) = min (0.22, 0.24) = 0.22
min (d(P3, P6), P2)) = min (d(P3, P2), d(P6, P2)) = min (0.14, 0.24) = 0.14
min (d(P3, P6), P4)) = min (d(P3, P4), d(P6, P4)) = min (0.13, 0.22) = 0.13
min (d(P3, P6), P5)) = min (d(P3, P5), d(P6, P5)) = min (0.28, 0.39) = 0.28

The distance matrix after merging two closest members p3 & p6

Here the minimum value is 0.13 that is the distance between (P3, P6) & P4 and hence we combine (P3,
P6) & P4. To update the distance matrix,
min ((d(P3, P6), P4), P1) = min (d(P3, P6), P1), d(P4, P1)) = min (0.22, 0.37) = 0.22
min ((d(P3, P6), P4), P2) = min (d(P3, P6), P2), d(P4, P2)) = min (0.14, 0.19) = 0.14
min ((d(P3, P6), P4), P5) = min (d(P3, P6), P5), d(P4, P5)) = min (0.28, 0.23) = 0.23

Dept. of AIML, GMIT, Davangere


Module 2- Advanced Machine Learning (18AI72)

The distance matrix after merging two closest members (P3, P6) & P4,

Here the minimum value is 0.14 that is the distance between P2 & P5 and hence we combine P2 & P5. To
update the distance matrix,
min (d(P2, P5), P1)) = min (d(P2, P1), d(P5, P1)) = min (0.23, 0.34) = 0.23
min (d(P2, P5), (P3, P6, P4)) = min (d(P2, (P3, P6, P4), d(P5, (P3, P6, P4))= min (0.14, 0.23) = 0.14

The distance matrix after merging two closest members P2, P5

Here the minimum value is 0.14 that is the distance between (P3, P6, P4) & (P2, P5) and hence we combine
them. To update the distance matrix,
min (d(P2, P5, P3, P6, P4), P1)) = min (d((P2, P5), P1), d((P3,P6, P4), P1)) = min (0.23, 0.22) = 0.22
The distance matrix after merging two closest members (P2, P5, P3, P6, P4) & P1,

The dendogram can now be drawn as shown in Fig. 13.5.

Dept. of AIML, GMIT, Davangere


Module 2- Advanced Machine Learning (18AI72)

13.4.1.3 Agglomerative Algorithm: Complete Link


Complete farthest distance or complete linkage is the agglomerative method that uses the distance between
the members that are farthest apart.
Solved Problem 13.6
Q. For die given set of points, identify clusters using complete link agglomerative clustering.

Solution:
To compute distance matrix:

The Euclidean distance is:

The distance matrix is:

Here the minimum value is 0.5 that is the distance between P4 & P6 and hence we combine them. To
update the distance matrix,
max (d(P4, P6), P1)) = max (d((P4, P1), d(P6,P1))) = max (3.6, 3.2) = 3.6
max (d(P4,P6), P2) = max (d(P4,P2), d(P6,P2)) = max (2.92, 2.5) = 2.92
max (d(P4,P6), P3) = max (d(P4,P3), d(P6,P3)) = max (2.24, 2.5) = 2.5
max (d(P4,P6), P5) = max (d(P4,P5), d(P6,P5)) = max (1.0, 1.12) = 1.12

Dept. of AIML, GMIT, Davangere


Module 2- Advanced Machine Learning (18AI72)

Again, merging the two closest members of the two clusters and finding the minimum element in distance
matrix. We get the minimum value as 0.71 and hence we combine P1 and P2. To update the distance
matrix,
max (d(P1, P2), P3) = max (d(P1, P3), d(P2, P3)) = max (5.66, 4.95) = 5.66
max (d(P1,P2), (P4,P6)) = max (d(P1, P4, P6), d(P2, P4, P6)) = max (3.6, 2.92) = 3.6
max (d(P1,P2), P5) = max (d(P1, P5), d(P2, P5)) = max (4.24, 3.53) = 4.24

Again, merging the two closest members of the two clusters and finding the minimum element in distance
matrix. We get the minimum value as 1.12 and hence we combine P4, P6 and P5. To update the distance
matrix,
max (d(P4,P6,P5), (P1,P2)) = max (d(P4,P6,P1,P2), d(P5,P1,P2)) = max (3.6, 4.24) = 4.24
max (d(P4,P6,P5), P3) = max (d(P4,P6,P3), d(P5, P3)) = max (2.5, 1.41) = 2.5

Again, merging the two closest members of the two clusters and finding the minimum element in distance
matrix. We get the minimum value as 2.5 and hence combine P4, P6, P5 and P3. to update the distance
matrix,
min (d(P4,P6,P5,P3), (P1,P2)) = max (d(P4,P6,P5,P1,P2), d(P3,P1,P2)) = mac (3.6, 5.66) = 5.66

The final cluster can be drawn as in fig. 13.6

Dept. of AIML, GMIT, Davangere


Module 2- Advanced Machine Learning (18AI72)

13.4.1.4 Agglomerative Algorithm: Average Link


Average - average distance or average linkage is the method that involves looking at the distances between
all pairs and averages all of these distances. This is also called Unweighted Pair Group Mean Averaging.
Solved Problem 13.7
Q. For the given set of points, identify clusters using average link agglomerative clustering.

Solution:
The distance matrix is:

Merging the two closest members by finding the minimum element in distance matrix and forming the
clusters, we get
Here the minimum value is 0.5 that is the distance between P4 & P6 and hence we combine them. To
update the distance matrix,
average (d(P4,P6), P1) = average (d(P4,P1), d(P6,P1)) = average (3.6, 3.20) = 3.4
average (d(P4,P6), P2) = average (d(P4,P2), d(P6,P2)) = average (2.92, 2.5) = 2.71

Dept. of AIML, GMIT, Davangere


Module 2- Advanced Machine Learning (18AI72)

average (d(P4,P6), P3) = average (d(P4,P3), d(P6,P3)) = average (2.24, 2.5) = 2.37
average (d(P4,P6), P5) = average (d(P4,P5), d(P6,P5)) = average (1.0, 1.12) = 1.06

Merging two closest members of the two clusters and finding the minimum elements in distance matrix.
We get the minimum value as 0.71 and hence we combine P1 and P2. To update the distance matrix:
average (d(P1,P2), P3) = average (d(P1,P3), d(P2,P3)) = average (5.66, 4.95) = 5.31
average (d(P1,P2), (P4,P6)) = average (d(P1,P4,P6), d(P2,P4,P6)) = average (3.2, 2.71) = 2.96
average (d(P1,P2), P5) = average (d(P1,P5), d(P2,P5)) = average (4.24, 3.53) = 3.89

Merging two closest members of the two clusters and finding the minimum elements in distance matrix.
We get the minimum value as 1.12 and hence we combine P4,P6 and P5. To update the distance matrix:
average (d(P4,P6,P5), (P1,P2)) = average (d(P4,P6,P1,P2), d(P5,P1,P2)) = average (2.96, 3.89) = 3.43
average (d(P4,P6,P5), P3) = average (d(P4,P6,P3), d(P5,P3)) = average (2.5, 1.41) = 1.96

Merging two closest members of the two clusters and finding the minimum elements in distance matrix.
We get the minimum value as 1.96 and hence we combine P4,P6,P5 and P3. To update the distance
matrix:
average (d(P4,P6,P5,P3), (P1,P2)) = average (d(P4,P6,P5,P1,P2), d(P3,P1P2)) = average (3.43, 5.66) =
4.55

Dept. of AIML, GMIT, Davangere


Module 2- Advanced Machine Learning (18AI72)

The final cluster can be drawn as in fig. 13.7

Dept. of AIML, GMIT, Davangere


MODULE 5

INSTANCE BASED LEARNING

INTRODUCTION

 Instance-based learning methods such as nearest neighbor and locally weighted


regression are conceptually straightforward approaches to approximating real-valued or
discrete-valued target functions.
 Learning in these algorithms consists of simply storing the presented training data.
When a new query instance is encountered, a set of similar related instances is retrieved
from memory and used to classify the new query instance
 Instance-based approaches can construct a different approximation to the target function
for each distinct query instance that must be classified

Advantages of Instance-based learning


1. Training is very fast
2. Learn complex target function
3. Don’t lose information

Disadvantages of Instance-based learning


 The cost of classifying new instances can be high. This is due to the fact that nearly all
computation takes place at classification time rather than when the training examples
are first encountered.
 In many instance-based approaches, especially nearest-neighbor approaches, is that they
typically consider all attributes of the instances when attempting to retrieve similar
training examples from memory. If the target concept depends on only a few of the
many available attributes, then the instances that are truly most "similar" may well be a
large distance apart.

Downloaded by Sarika Athreyas ([email protected])


k- NEAREST NEIGHBOR LEARNING

 The most basic instance-based method is the K- Nearest Neighbor Learning. This
algorithm assumes all instances correspond to points in the n-dimensional space Rn.
 The nearest neighbors of an instance are defined in terms of the standard Euclidean
distance.
 Let an arbitrary instance x be described by the feature vector
((a1(x), a2(x), ………, an(x))
Where, ar(x) denotes the value of the rth attribute of instance x.

 Then the distance between two instances xi and xj is defined to be d(xi , xj )


Where,

 In nearest-neighbor learning the target function may be either discrete-valued or real-


valued.

Let us first consider learning discrete-valued target functions of the form

Where, V is the finite set {v1, . . . vs }

The k- Nearest Neighbor algorithm for approximation a discrete-valued target function is


given below:

Downloaded by Sarika Athreyas ([email protected])


 The value ̂(xq) returned by this algorithm as its estimate of f(x q) is just the most
common value of f among the k training examples nearest to xq.
̂ q) the value f(xi). Where
 If k = 1, then the 1- Nearest Neighbor algorithm assigns to (x
xi is the training instance nearest to xq.
 For larger values of k, the algorithm assigns the most common value among the k nearest
training examples.

 Below figure illustrates the operation of the k-Nearest Neighbor algorithm for the case where
the instances are points in a two-dimensional space and where the target function is Boolean
valued.

 The positive and negative training examples are shown by “+” and “-” respectively. A
query point xq is shown as well.
 The 1-Nearest Neighbor algorithm classifies xq as a positive example in this figure,
whereas the 5-Nearest Neighbor algorithm classifies it as a negative example.

 Below figure shows the shape of this decision surface induced by 1- Nearest Neighbor over
the entire instance space. The decision surface is a combination of convex polyhedra
surrounding each of the training examples.

 For every training example, the polyhedron indicates the set of query points whose
classification will be completely determined by that training example. Query points
outside the polyhedron are closer to some other training example. This kind of diagram
is often called the Voronoi diagram of the set of training example

Downloaded by Sarika Athreyas ([email protected])


The K- Nearest Neighbor algorithm for approximation a real-valued target function is given
below

Distance-Weighted Nearest Neighbor Algorithm

 The refinement to the k-NEAREST NEIGHBOR Algorithm is to weight the


contribution of each of the k neighbors according to their distance to the query point xq,
giving greater weight to closer neighbors.
 For example, in the k-Nearest Neighbor algorithm, which approximates discrete-valued
target functions, we might weight the vote of each neighbor according to the inverse
square of its distance from xq

Distance-Weighted Nearest Neighbor Algorithm for approximation a discrete-valued target


functions

Downloaded by Sarika Athreyas ([email protected])


Distance-Weighted Nearest Neighbor Algorithm for approximation a Real-valued target
functions

Terminology

 Regression means approximating a real-valued target function.


 Residual is the error ̂(x) - f (x) in approximating the target function.
 Kernel function is the function of distance that is used to determine the weight of each
training example. In other words, the kernel function is the function K such that
wi = K(d(xi, xq))

LOCALLY WEIGHTED REGRESSION

 The phrase "locally weighted regression" is called local because the function is
approximated based only on data near the query point, weighted because the
contribution of each training example is weighted by its distance from the query point,
and regression because this is the term used widely in the statistical learning community
for the problem of approximating real-valued functions.

 Given a new query instance xq, the general approach in locally weighted regression is
to construct an approximation ̂ that fits the training examples in the neighborhood
̂ q), which is
surrounding xq. This approximation is then used to calculate the value (x
output as the estimated target value for the query instance.

Downloaded by Sarika Athreyas ([email protected])


Locally Weighted Linear Regression

 Consider locally weighted regression in which the target function f is approximated near
xq using a linear function of the form

Where, ai(x) denotes the value of the ith attribute of the instance x

 Derived methods are used to choose weights that minimize the squared error summed
over the set D of training examples using gradient descent

Which led us to the gradient descent training rule

Where, η is a constant learning rate

 Need to modify this procedure to derive a local approximation rather than a global one.
The simple way is to redefine the error criterion E to emphasize fitting the local training
examples. Three possible criteria are given below.

1. Minimize the squared error over just the k nearest neighbors:

2. Minimize the squared error over the entire set D of training examples, while
weighting the error of each training example by some decreasing function K of its
distance from xq :

3. Combine 1 and 2:

Downloaded by Sarika Athreyas ([email protected])


If we choose criterion three and re-derive the gradient descent rule, we obtain the following
training rule

The differences between this new rule and the rule given by Equation (3) are that the
contribution of instance x to the weight update is now multiplied by the distance penalty
K(d(xq, x)), and that the error is summed over only the k nearest training examples.

RADIAL BASIS FUNCTIONS

 One approach to function approximation that is closely related to distance-weighted


regression and also to artificial neural networks is learning with radial basis functions
 In this approach, the learned hypothesis is a function of the form

 Where, each xu is an instance from X and where the kernel function Ku(d(xu, x)) is
defined so that it decreases as the distance d(xu, x) increases.
 Here k is a user provided constant that specifies the number of kernel functions to be
included.
 ̂ is a global approximation to f (x), the contribution from each of the Ku(d(xu, x)) terms
is localized to a region nearby the point xu.

Choose each function Ku(d(xu, x)) to be a Gaussian function centred at the point xu with some
variance u2

 The functional form of equ(1) can approximate any function with arbitrarily small error,
provided a sufficiently large number k of such Gaussian kernels and provided the width
2 of each kernel can be separately specified
 The function given by equ(1) can be viewed as describing a two layer network where
the first layer of units computes the values of the various K u(d(xu, x)) and where the
second layer computes a linear combination of these first-layer unit values

Downloaded by Sarika Athreyas ([email protected])


Example: Radial basis function (RBF) network

Given a set of training examples of the target function, RBF networks are typically trained in
a two-stage process.
1. First, the number k of hidden units is determined and each hidden unit u is defined by
choosing the values of xu and u2 that define its kernel function Ku(d(xu, x))
2. Second, the weights w, are trained to maximize the fit of the network to the training
data, using the global error criterion given by

Because the kernel functions are held fixed during this second stage, the linear weight
values w, can be trained very efficiently

Several alternative methods have been proposed for choosing an appropriate number of hidden
units or, equivalently, kernel functions.
 One approach is to allocate a Gaussian kernel function for each training example
(xi,f (xi)), centring this Gaussian at the point xi.
Each of these kernels may be assigned the same width 2. Given this approach, the RBF
network learns a global approximation to the target function in which each training
example (xi, f (xi)) can influence the value of f only in the neighbourhood of xi.
 A second approach is to choose a set of kernel functions that is smaller than the number
of training examples. This approach can be much more efficient than the first approach,
especially when the number of training examples is large.

Summary
 Radial basis function networks provide a global approximation to the target function,
represented by a linear combination of many local kernel functions.
 The value for any given kernel function is non-negligible only when the input x falls
into the region defined by its particular centre and width. Thus, the network can be
viewed as a smooth linear combination of many local approximations to the target
function.
 One key advantage to RBF networks is that they can be trained much more efficiently
than feedforward networks trained with BACKPROPAGATION.

Downloaded by Sarika Athreyas ([email protected])


CASE-BASED REASONING

 Case-based reasoning (CBR) is a learning paradigm based on lazy learning methods and
they classify new query instances by analysing similar instances while ignoring
instances that are very different from the query.
 In CBR represent instances are not represented as real-valued points, but instead, they
use a rich symbolic representation.
 CBR has been applied to problems such as conceptual design of mechanical devices
based on a stored library of previous designs, reasoning about new legal cases based on
previous rulings, and solving planning and scheduling problems by reusing and
combining portions of previous solutions to similar problems

A prototypical example of a case-based reasoning

 The CADET system employs case-based reasoning to assist in the conceptual design of
simple mechanical devices such as water faucets.
 It uses a library containing approximately 75 previous designs and design fragments to
suggest conceptual designs to meet the specifications of new design problems.
 Each instance stored in memory (e.g., a water pipe) is represented by describing both its
structure and its qualitative function.
 New design problems are then presented by specifying the desired function and
requesting the corresponding structure.

The problem setting is illustrated in below figure

Downloaded by Sarika Athreyas ([email protected])

You might also like