Review Paper On Clustering and Validation Techniques
Review Paper On Clustering and Validation Techniques
ISSN: 2321-9653
Abstract— Clustering is important in data analysis and data mining applications. It is the task of grouping a set of objects so
that objects in the same group are more similar to each other than to those in other groups (clusters).
The overall goal of the data mining process is to extract information from a large data set and transform it into an
understandable form for further use. Clustering can be done by the different no. of algorithms such as hierarchical,
partitioning, grid and density based algorithms. Hierarchical clustering is the connectivity based clustering. Partitioning is
the centroid based clustering, the value of k-mean is set. Clustering has been applied to serve various purposes like, to gain
insight to data distribution, generate hypotheses, to observe the characteristic and find anomalies. The intension of this
paper is to provide a categorization of some well known clustering algorithms. It also describes the clustering process and
overview of the different clustering methods. The validation of clustering structures is the most difficult and frustrating part
of cluster analysis. Validation comparing the results of two clusters and find out the best cluster.
Index Terms— Data mining, clustering process, Categorization of Clustering, validation etc.
Page 182
www.ijraset.com Vol. 2 Issue V, May 2014
ISSN: 2321-9653
Page 183
www.ijraset.com Vol. 2 Issue V, May 2014
ISSN: 2321-9653
granularity.
Page 184
www.ijraset.com Vol. 2 Issue V, May 2014
ISSN: 2321-9653
WSS stands for Within Sum of Squared Error. And deliver to user.
Cluster Separation: Measure how distinct or well-separated a For the explosion of information in the World Wide Web, this
cluster is from other clusters thesis proposed a new method of summarization via soft
clustering algorithm. It used Google search engine to extract
relevant documents, and mixed query sentence into document
set which segmented from multi-documents set, then this
paper created efficient hierarchical clustering to cluster all the
documents. Also, there are a lot of rooms for improvement.
For example, readability is an important aspect in the
BSS stands for Between Sum of Squared Error. performance of multi-document summarization. In future
work, we will consider new soft cluster algorithm to more
Validation techniques is used to find out the best cluster. improve the efficiency of clustering. Cluster Analysis is a
process of grouping the objects, called as a cluster/s, which
Different Aspects of Cluster Validation consists of the objects that are similar to each other in a given
cluster and dissimilar to the objects in other cluster. With the
Determining the clustering tendency of a set of data, i.e. application of clustering in all most every field of science and
distinguishing whether non-random structure actually exists in technology, large number of clustering algorithms had been
the data. proposed which satisfy certain criteria such as arbitrary
shapes, high dimensional database, and domain knowledge
Comparing the results of a cluster analysis to externally
and so on. It had been also proved that it is not possible to
known results, e.g., to externally given class labels.
design a single clustering algorithm which fulfils all the
Evaluating how well the results of a cluster analysis fit the requirement of clustering. Therefore, number of methods had
data without reference to external information. been proposed such as partitioning, hierarchical, density
based, model based and so on. Different algorithms may
- Use only the data. follow good features of one or more methods and thus it is
difficult to categorize them with the solid boundary. In this
Comparing the results of two different sets of cluster analyses paper we had tried to provide a detail categorization of the
to determine which is better. clustering algorithms from our perspective. Though it had
been tried to cover as much clarity as possible, there is still a
Determining the ‘correct’ number of clusters. For 2, 3, and 4, scope of variation. In this paper we had covered the detailed
we can further distinguish whether we want to evaluate the categorization of the different clustering methods with the
entire clustering or just individual clusters. representative algorithms under each. The future work
planned is to perform a detailed analysis of major clustering
AIM AND OBJECTIVE: algorithm and find out the best algorithm for document deliver
to the user
Page 185
www.ijraset.com Vol. 2 Issue V, May 2014
ISSN: 2321-9653
REFERENCE
Page 186