0% found this document useful (0 votes)
22 views5 pages

Review Paper On Clustering and Validation Techniques

This document provides an overview of clustering techniques and validation methods. It discusses the clustering process, which involves feature selection, algorithm design and selection, clustering, and results interpretation. It also categorizes common clustering methods as partitioning, hierarchical, density-based, and grid-based algorithms. Cluster validation is described as important for evaluating and comparing the results of different clustering algorithms.

Uploaded by

Sarip Rahmat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views5 pages

Review Paper On Clustering and Validation Techniques

This document provides an overview of clustering techniques and validation methods. It discusses the clustering process, which involves feature selection, algorithm design and selection, clustering, and results interpretation. It also categorizes common clustering methods as partitioning, hierarchical, density-based, and grid-based algorithms. Cluster validation is described as important for evaluating and comparing the results of different clustering algorithms.

Uploaded by

Sarip Rahmat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

www.ijraset.com Vol.

2 Issue V, May 2014

ISSN: 2321-9653

INTERNATIONAL JOURNAL FOR RESEARCH IN AP PLIED SCIENCE


AN D E N G I N E E R I N G T E C H N O L O G Y (I J R A S E T)

Review Paper on Clustering and Validation


Techniques
Jyoti, Neha Kaushik, Rekha

Abstract— Clustering is important in data analysis and data mining applications. It is the task of grouping a set of objects so
that objects in the same group are more similar to each other than to those in other groups (clusters).

The overall goal of the data mining process is to extract information from a large data set and transform it into an
understandable form for further use. Clustering can be done by the different no. of algorithms such as hierarchical,
partitioning, grid and density based algorithms. Hierarchical clustering is the connectivity based clustering. Partitioning is
the centroid based clustering, the value of k-mean is set. Clustering has been applied to serve various purposes like, to gain
insight to data distribution, generate hypotheses, to observe the characteristic and find anomalies. The intension of this
paper is to provide a categorization of some well known clustering algorithms. It also describes the clustering process and
overview of the different clustering methods. The validation of clustering structures is the most difficult and frustrating part
of cluster analysis. Validation comparing the results of two clusters and find out the best cluster.

Index Terms— Data mining, clustering process, Categorization of Clustering, validation etc.

I. INTRODUCTION validation techniques. In data mining data can be mined by


passing these steps.
The purpose of the data mining technique is to mine
information from a bulky data set and make over it into a
reasonable form for supplementary purpose. Clustering is a
significant task in data analysis and data mining applications.
It is the task of arrangement a set of objects so that objects in
the identical group are more related to each other than to those
in other groups (clusters). Data mining can do by passing
through various phases. Mining can be done by using
supervised and unsupervised learning. The clustering is
unsupervised learning. A good clustering method will produce
high superiority clusters with high intra-class similarity and
low inter-class similarity. Clustering algorithms can be
categorized into partition-based algorithms,hierarchical-
basedalgorithms,density-based algorithms and grid-based
algorithms.Partitioning clustering algorithm splits the data
points into k partition, where each partition represents a
cluster. Hierarchical clustering is a technique of clustering
which divide the similar dataset by constructing a hierarchy of
clusters. Density based algorithms find the cluster according
to the regions which grow with high density. It is the one-scan
algorithms. Grid Density based algorithm uses the multi Figure 1 : Phases of Data Mining
resolution grid data structure and use dense grids to form
clusters. Its main distinctiveness is the fastest processing In Data Mining the two types of learning sets are
time.In this survey paper, an analysis of clustering and
used, they are supervised learning and unsupervised

Page 182
www.ijraset.com Vol. 2 Issue V, May 2014

ISSN: 2321-9653

INTERNATIONAL JOURNAL FOR RESEARCH IN AP PLIED SCIENCE


AN D E N G I N E E R I N G T E C H N O L O G Y (I J R A S E T)
learning. As there is no universal algorithm for clustering, different
clustering algorithm applied to same dataset produce different
a) Supervised Learning results. Even the same algorithm, with the different values of
parameter produces different clusters. Therefore it becomes
In supervised training, data includes together the input and the necessary to validate or evaluate the result produce by the
preferred results. It is the rapid and perfect technique. The clustering method. The evaluation criteria are categorized as:
accurate results are recognized and are given in inputs to the
model through the learning procedure. Supervised models are 1) Internal indices: The internal indices generally evaluate the
neural network, Multilayer Perceptron and Decision trees. clusters produces by the clustering algorithm by comparing it
with the data only.
b) Unsupervised Learning
2) External indices: The external indices evaluate the
The unsupervised model is not provided with the accurate clustering results by using the prior knowledge, e.g. class
results during the training. This can be used to cluster the labels.
input information in classes on the basis of their statistical
properties only. Unsupervised models are for dissimilar types 3) Relative indices: As the name suggest, this criteria
of clustering, distances and normalization, k-means, self compares the results against various other results produced by
organizing maps. the different algorithms.
II. CLUSTERING PROCESS: D. Results Interpretation
The overall process of cluster analysis is shown in fig.2 It The last step of clustering process deals with the
involves four basic steps as explained below. representation of the clusters. The ultimate goal of clustering
is to provide users with meaningful insights from the original
A. Feature Selection or Extraction data, so that they can effectively analyze and solve the
problems. This is still an untouched area of research.
Feature selection is the process of identifying the most
effective subset of the original features to use in clustering, III. CATEGORIZATION OF CLUSTERING
whereas the feature extraction is the process of transforming METHODS
one or more input features to produce new salient feature.
Clustering process is highly dependent on this step. Improper There is difference between clustering method and clustering
selection of features increases the complexity and may result algorithm]. A clustering method is a general strategy applied
into irrelevant clusters, too. to solve a clustering problem, whereas a clustering algorithm
is simply an instance of a method. As mentioned earlier no
B. Clustering Algorithm Design or Selection algorithm exist to satisfy all the requirements of clustering and
therefore large numbers of clustering methods proposed till
The impossibility theorem [12] states that, “no single
date, each with a particular intension like application or data
clustering algorithm simultaneously satisfies the three basic
types or to fulfil a specific requirement. All clustering
axioms of data clustering, i.e., scale-invariance, consistency
algorithms basically can be categorized into two broad
and richness”. Thus it impossible to develop a generalized
categories: partitioning and hierarchical, based on the
framework of clustering methods for the application in the
properties of generated clusters. Different algorithms
different scientific, social, medical and other fields. It is
proposed may follows a good features of the different
therefore very important to select the algorithm carefully by
methodology and thus it is difficult to categorize them with
applying domain knowledge. Generally all algorithms are
the solid boundary. The detail categorization of the clustering
based on the different input parameters, like number of
algorithm is given in figure 2. Though we had tried to provide
clusters, optimization/construction criterion, termination
as much clarity as possible, there is still a scope of variation.
condition, proximity measure etc. This different parameters
The overview of each categorization is discussed below.
and criteria are also designed or selected as a prerequisite of
this step. A. Hierarchical Methods
C. Cluster Validation

Page 183
www.ijraset.com Vol. 2 Issue V, May 2014

ISSN: 2321-9653

INTERNATIONAL JOURNAL FOR RESEARCH IN AP PLIED SCIENCE


AN D E N G I N E E R I N G T E C H N O L O G Y (I J R A S E T)
As the name suggest, the hierarchical methods, in general tries the maximum similarity between them.
to decompose the dataset of n objects into a hierarchy of a
groups. This hierarchical decomposition can be represented by (3) At the end of the process all the nodes belong to
a tree structure diagram called as a dendrogram; whose root
node represents the whole dataset and each leaf node is a the same cluster i.e. known as the root of the tree
single object of the dataset. The clustering results can be
structure.
obtained by cutting the dendrogram at different level. There
are two general approaches for the hierarchical method: Devise clustering
agglomerative (bottom-up) and divisive (top down).
It is also known as DIANA. It is top-down
Hierarchical Clustering is classified as
approach. It is introduced in Kaufmann and Rousseeuw
A. Agglomerative
(1990). It is the inverse of the agglomerative method.
B. Divisive
Starting from the root node (cluster) step by step each

node forms the cluster (leaf) on its own.

Advantages of hierarchical clustering [2]

(1) Embedded flexibility with regard to the level of

granularity.

(2) Ease of handling any forms of similarity or distance.

(3) Applicability to any attributes type.

Disadvantages of hierarchical clustering [2]

(1) Vagueness of termination criteria.

(2) Most hierarchical algorithm does not revisit once

constructed clusters with the purpose of


Fig 2. Agglomerative and Divisive Clustering
improvement.
Agglomerative clustering
Validation
It is also known as AGNES. It is bottom-up
The validation of clustering structures is the most difficult and
approach. This method construct the tree of clusters i.e. frustrating part of cluster analysis. Validation comparing the
results of two clusters and find out the best cluster.In
nodes. The criteria used in this method for clustering the
validation step we use the BSS stands for Between Sum of
data is min distance, max distance, avg distance, center Squared Error and WSS stands for Within Sum of Squared
Error techniques.
distance. The steps of this method are:
WSS:
(1) Initially all the objects are clusters i.e. leaf.

(2) It recursively merges the nodes (clusters) that have

Page 184
www.ijraset.com Vol. 2 Issue V, May 2014

ISSN: 2321-9653

INTERNATIONAL JOURNAL FOR RESEARCH IN AP PLIED SCIENCE


AN D E N G I N E E R I N G T E C H N O L O G Y (I J R A S E T)
Cluster Cohesion: Measures how closely related are objects in Compare the set of cluster and validate the cluster using
a cluster. BSS/WSS technique

Solve outliers problem

Ranking the document according to their relevancy

WSS stands for Within Sum of Squared Error. And deliver to user.

BSS: Conclusions and Directions of Future Research

Cluster Separation: Measure how distinct or well-separated a For the explosion of information in the World Wide Web, this
cluster is from other clusters thesis proposed a new method of summarization via soft
clustering algorithm. It used Google search engine to extract
relevant documents, and mixed query sentence into document
set which segmented from multi-documents set, then this
paper created efficient hierarchical clustering to cluster all the
documents. Also, there are a lot of rooms for improvement.
For example, readability is an important aspect in the
BSS stands for Between Sum of Squared Error. performance of multi-document summarization. In future
work, we will consider new soft cluster algorithm to more
Validation techniques is used to find out the best cluster. improve the efficiency of clustering. Cluster Analysis is a
process of grouping the objects, called as a cluster/s, which
Different Aspects of Cluster Validation consists of the objects that are similar to each other in a given
cluster and dissimilar to the objects in other cluster. With the
Determining the clustering tendency of a set of data, i.e. application of clustering in all most every field of science and
distinguishing whether non-random structure actually exists in technology, large number of clustering algorithms had been
the data. proposed which satisfy certain criteria such as arbitrary
shapes, high dimensional database, and domain knowledge
Comparing the results of a cluster analysis to externally
and so on. It had been also proved that it is not possible to
known results, e.g., to externally given class labels.
design a single clustering algorithm which fulfils all the
Evaluating how well the results of a cluster analysis fit the requirement of clustering. Therefore, number of methods had
data without reference to external information. been proposed such as partitioning, hierarchical, density
based, model based and so on. Different algorithms may
- Use only the data. follow good features of one or more methods and thus it is
difficult to categorize them with the solid boundary. In this
Comparing the results of two different sets of cluster analyses paper we had tried to provide a detail categorization of the
to determine which is better. clustering algorithms from our perspective. Though it had
been tried to cover as much clarity as possible, there is still a
Determining the ‘correct’ number of clusters. For 2, 3, and 4, scope of variation. In this paper we had covered the detailed
we can further distinguish whether we want to evaluate the categorization of the different clustering methods with the
entire clustering or just individual clusters. representative algorithms under each. The future work
planned is to perform a detailed analysis of major clustering
AIM AND OBJECTIVE: algorithm and find out the best algorithm for document deliver
to the user

Main objective of my thesis is to delivery of document using


hierarchical clustering technique.

Increase document retrieval efficiency

Page 185
www.ijraset.com Vol. 2 Issue V, May 2014

ISSN: 2321-9653

INTERNATIONAL JOURNAL FOR RESEARCH IN AP PLIED SCIENCE


AN D E N G I N E E R I N G T E C H N O L O G Y (I J R A S E T)
ACKNOWLEDGEMENT

I would like to thanks Ms. Neha Kaushik and the Department


of Computer Science & Engineering of DITM Institute,
Gannur , Sonepat,India.

REFERENCE

[1] A.K. Jain, M. N. Murty, P. J. Flynn, “Data Clustering: A


Review”, ACM Computing Surveys, vol. 31, pp. 264-323, Sep.
1999.

[2] O. A. Abbas, “Comparisons between Data Clustering


Algorithms”, The Int. Journal of Info. Tech. ,vol. 5, pp. 320-
325, Jul. 2008.

[3] Dr. E. Chandra, V. P. Anuradha, “ A Survey on


Clustering Algorithms for Data in Spatial Database
Management Systems”, International Journal of Computer
Application, vol. 24, pp. 19-26

[4] Oded Maimon, Lior Rokach, “DATA MINING AND


KNOWLEDGE DISCOVERY HANDBOOK”, Springer
Science+Business Media.Inc, pp.321-352, 2005.

[5] Bharati M Ramager, “Data Mining techniques and


Applications”, “International Journal of Computer Science
and Engineering Vol. 8”, December 2009.

[6] Accessible from Sonali Agarwal, Neera Singh, Dr. G.N.


Pandey, “Implementation of Data Mining and Data
Warehouse in E-Governance”, “International Journal of
Computer Applications (IJCA) (0975-8887), Vol.9- No.4,”
November 2010.

[7] Calinski, T, Harabasz J (1974) A dendrite method for


cluster analysis. Commun Stat Theory Methods 3(1):1–27

[8] Ng, R. and Han, J.(1994). Efficient and Effective


Clustering Methods for Spatial Data Mining. In Proceeding’s
of the 20th VLDB Conference, Santiago, Chile.

[9] D. R. Cutting, J. O. Pedersen, D. R. Karger, and J. W.


Turkey. Scatter/gather: A cluster-based approach to browsing
large document collections. In Proceedings of the ACM
SIGIR, 1992.

[10] Willet, Peter "Parallel Database Processing, Text


Retrieval and Cluster Analyses" Pitman Publishing, London,
1990.

Page 186

You might also like