DMW Assignment 2
DMW Assignment 2
Aim: Consider a suitable dataset. For clustering of data instances in different groups,
apply different clustering techniques (minimum 2). Visualize the clusters using suitable
tool.
Objectives:
1. Understanding clustering Algorithms
Theory:
Clustering : Clustering is the grouping of a particular set of objects based on their
characteristics, aggregating them according to their similarities. Regarding to data
mining, this methodology partitions the data implementing a specific join algorithm,
most suitable for the desired information analysis. There are several different ways to
implement this partitioning, based on distinct models.
Centralized each cluster is represented by a single vector mean, and a object value is
compared to these mean values
Distributed – the cluster is built using statistical distributions
Group – algorithms have only group information
Graph – cluster organization and relationship between members is defined by a graph
linked structure
Density – members of the cluster are grouped by regions where observations are dense
and similar
Rapid Miner:
RapidMiner is a data science software platform developed by the company of the same
name that provides an integrated environment for data preparation, machine learning,
deep learning, text mining, and predictive analytics
KMeans Algorithm:
Kmeans clustering is a type of unsupervised learning, which is used when you
have unlabeled data (i.e., data without defined categories or groups). The goal of this
algorithm is to find groups in the data, with the number of groups represented by the
variable K. The algorithm works iteratively to assign each data point to one of K groups
based on the features that are provided. Data points are clustered based on feature
similarity. The results of the Kmeans clustering algorithm are:
1.The centroids of the K clusters, which can be used to label new data
2. Labels for the training data (each data point is assigned to a single cluster)
Deciding the number of clusters
The number of clusters should match the data. An incorrect choice of the number of
clusters will invalidate the whole process. An empirical way to find the best number of
clusters is to try Kmeans clustering with different number of clusters and measure the
resulting sum of squares.
Algorithm
1. Clusters the data into k groups where k is predefined.
2. Select k points at random as cluster centers.
3. Assign objects to their closest cluster center according to the Euclidean distance
function.
4. Calculate the centroid or mean of all objects in each cluster.
5. Repeat steps 2, 3 and 4 until the same points are assigned to each cluster in
consecutive rounds.
kmedoids algorithm
The kmedoids or PAM algorithm is a clustering algorithm reminiscent to the kmeans
algorithm. Both the kmeans and kmedoids algorithms are partitional (breaking the
dataset up into groups) and both attempt to minimize the distance between points
labeled to be in a cluster and a point designated as the center of that cluster. In contrast
to the kmeans algorithm,kmedoids chooses data points as centers (medoids or
exemplars) and can be used with arbitrary distances, while in kmeans the centre of a
clusters is not necessarily one of the input data points (it is the average between the
points in the cluster). kmedoid is a classical partitioning technique of clustering, which
clusters the data set of n objects into k clusters, with the number k of clusters assumed
known a priori (which implies that the programmer must specify k before the execution
of the algorithm). The "goodness" of the given value of k can be assessed with methods
such as silhouette. It is more robust to noise and outliers as compared to kmeans
because it minimizes a sum of pairwise dissimilarities instead of a sum of squared
Euclidean distances. A medoid can be defined as the object of a cluster whose average
dissimilarity to all the objects in the cluster is minimal, that is, it is a most centrally
located point in the cluster