0% found this document useful (0 votes)

22 views5 pages

Review Paper On Clustering and Validation Techniques

This document provides an overview of clustering techniques and validation methods. It discusses the clustering process, which involves feature selection, algorithm design and selection, clustering, and results interpretation. It also categorizes common clustering methods as partitioning, hierarchical, density-based, and grid-based algorithms. Cluster validation is described as important for evaluating and comparing the results of different clustering algorithms.

Uploaded by

Sarip Rahmat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views5 pages

Review Paper On Clustering and Validation Techniques

Uploaded by

Sarip Rahmat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

www.ijraset.com Vol.

2 Issue V, May 2014

ISSN: 2321-9653

INTERNATIONAL JOURNAL FOR RESEARCH IN AP PLIED SCIENCE

AN D E N G I N E E R I N G T E C H N O L O G Y (I J R A S E T)

Review Paper on Clustering and Validation

Techniques
Jyoti, Neha Kaushik, Rekha

Abstract— Clustering is important in data analysis and data mining applications. It is the task of grouping a set of objects so
that objects in the same group are more similar to each other than to those in other groups (clusters).

The overall goal of the data mining process is to extract information from a large data set and transform it into an
understandable form for further use. Clustering can be done by the different no. of algorithms such as hierarchical,
partitioning, grid and density based algorithms. Hierarchical clustering is the connectivity based clustering. Partitioning is
the centroid based clustering, the value of k-mean is set. Clustering has been applied to serve various purposes like, to gain
insight to data distribution, generate hypotheses, to observe the characteristic and find anomalies. The intension of this
paper is to provide a categorization of some well known clustering algorithms. It also describes the clustering process and
overview of the different clustering methods. The validation of clustering structures is the most difficult and frustrating part
of cluster analysis. Validation comparing the results of two clusters and find out the best cluster.

Index Terms— Data mining, clustering process, Categorization of Clustering, validation etc.

I. INTRODUCTION validation techniques. In data mining data can be mined by

passing these steps.
The purpose of the data mining technique is to mine
information from a bulky data set and make over it into a
reasonable form for supplementary purpose. Clustering is a
significant task in data analysis and data mining applications.
It is the task of arrangement a set of objects so that objects in
the identical group are more related to each other than to those
in other groups (clusters). Data mining can do by passing
through various phases. Mining can be done by using
supervised and unsupervised learning. The clustering is
unsupervised learning. A good clustering method will produce
high superiority clusters with high intra-class similarity and
low inter-class similarity. Clustering algorithms can be
categorized into partition-based algorithms,hierarchical-
basedalgorithms,density-based algorithms and grid-based
algorithms.Partitioning clustering algorithm splits the data
points into k partition, where each partition represents a
cluster. Hierarchical clustering is a technique of clustering
which divide the similar dataset by constructing a hierarchy of
clusters. Density based algorithms find the cluster according
to the regions which grow with high density. It is the one-scan
algorithms. Grid Density based algorithm uses the multi Figure 1 : Phases of Data Mining
resolution grid data structure and use dense grids to form
clusters. Its main distinctiveness is the fastest processing In Data Mining the two types of learning sets are
time.In this survey paper, an analysis of clustering and
used, they are supervised learning and unsupervised

Page 182
www.ijraset.com Vol. 2 Issue V, May 2014

ISSN: 2321-9653

INTERNATIONAL JOURNAL FOR RESEARCH IN AP PLIED SCIENCE

AN D E N G I N E E R I N G T E C H N O L O G Y (I J R A S E T)
learning. As there is no universal algorithm for clustering, different
clustering algorithm applied to same dataset produce different
a) Supervised Learning results. Even the same algorithm, with the different values of
parameter produces different clusters. Therefore it becomes
In supervised training, data includes together the input and the necessary to validate or evaluate the result produce by the
preferred results. It is the rapid and perfect technique. The clustering method. The evaluation criteria are categorized as:
accurate results are recognized and are given in inputs to the
model through the learning procedure. Supervised models are 1) Internal indices: The internal indices generally evaluate the
neural network, Multilayer Perceptron and Decision trees. clusters produces by the clustering algorithm by comparing it
with the data only.
b) Unsupervised Learning
2) External indices: The external indices evaluate the
The unsupervised model is not provided with the accurate clustering results by using the prior knowledge, e.g. class
results during the training. This can be used to cluster the labels.
input information in classes on the basis of their statistical
properties only. Unsupervised models are for dissimilar types 3) Relative indices: As the name suggest, this criteria
of clustering, distances and normalization, k-means, self compares the results against various other results produced by
organizing maps. the different algorithms.
II. CLUSTERING PROCESS: D. Results Interpretation
The overall process of cluster analysis is shown in fig.2 It The last step of clustering process deals with the
involves four basic steps as explained below. representation of the clusters. The ultimate goal of clustering
is to provide users with meaningful insights from the original
A. Feature Selection or Extraction data, so that they can effectively analyze and solve the
problems. This is still an untouched area of research.
Feature selection is the process of identifying the most
effective subset of the original features to use in clustering, III. CATEGORIZATION OF CLUSTERING
whereas the feature extraction is the process of transforming METHODS
one or more input features to produce new salient feature.
Clustering process is highly dependent on this step. Improper There is difference between clustering method and clustering
selection of features increases the complexity and may result algorithm]. A clustering method is a general strategy applied
into irrelevant clusters, too. to solve a clustering problem, whereas a clustering algorithm
is simply an instance of a method. As mentioned earlier no
B. Clustering Algorithm Design or Selection algorithm exist to satisfy all the requirements of clustering and
therefore large numbers of clustering methods proposed till
The impossibility theorem [12] states that, “no single
date, each with a particular intension like application or data
clustering algorithm simultaneously satisfies the three basic
types or to fulfil a specific requirement. All clustering
axioms of data clustering, i.e., scale-invariance, consistency
algorithms basically can be categorized into two broad
and richness”. Thus it impossible to develop a generalized
categories: partitioning and hierarchical, based on the
framework of clustering methods for the application in the
properties of generated clusters. Different algorithms
different scientific, social, medical and other fields. It is
proposed may follows a good features of the different
therefore very important to select the algorithm carefully by
methodology and thus it is difficult to categorize them with
applying domain knowledge. Generally all algorithms are
the solid boundary. The detail categorization of the clustering
based on the different input parameters, like number of
algorithm is given in figure 2. Though we had tried to provide
clusters, optimization/construction criterion, termination
as much clarity as possible, there is still a scope of variation.
condition, proximity measure etc. This different parameters
The overview of each categorization is discussed below.
and criteria are also designed or selected as a prerequisite of
this step. A. Hierarchical Methods
C. Cluster Validation

Page 183
www.ijraset.com Vol. 2 Issue V, May 2014

ISSN: 2321-9653

INTERNATIONAL JOURNAL FOR RESEARCH IN AP PLIED SCIENCE

AN D E N G I N E E R I N G T E C H N O L O G Y (I J R A S E T)
As the name suggest, the hierarchical methods, in general tries the maximum similarity between them.
to decompose the dataset of n objects into a hierarchy of a
groups. This hierarchical decomposition can be represented by (3) At the end of the process all the nodes belong to
a tree structure diagram called as a dendrogram; whose root
node represents the whole dataset and each leaf node is a the same cluster i.e. known as the root of the tree
single object of the dataset. The clustering results can be
structure.
obtained by cutting the dendrogram at different level. There
are two general approaches for the hierarchical method: Devise clustering
agglomerative (bottom-up) and divisive (top down).
It is also known as DIANA. It is top-down
Hierarchical Clustering is classified as
approach. It is introduced in Kaufmann and Rousseeuw
A. Agglomerative
(1990). It is the inverse of the agglomerative method.
B. Divisive
Starting from the root node (cluster) step by step each

node forms the cluster (leaf) on its own.

Advantages of hierarchical clustering [2]

(1) Embedded flexibility with regard to the level of

granularity.

(2) Ease of handling any forms of similarity or distance.

(3) Applicability to any attributes type.

Disadvantages of hierarchical clustering [2]

(1) Vagueness of termination criteria.

(2) Most hierarchical algorithm does not revisit once

constructed clusters with the purpose of

Fig 2. Agglomerative and Divisive Clustering
improvement.
Agglomerative clustering
Validation
It is also known as AGNES. It is bottom-up
The validation of clustering structures is the most difficult and
approach. This method construct the tree of clusters i.e. frustrating part of cluster analysis. Validation comparing the
results of two clusters and find out the best cluster.In
nodes. The criteria used in this method for clustering the
validation step we use the BSS stands for Between Sum of
data is min distance, max distance, avg distance, center Squared Error and WSS stands for Within Sum of Squared
Error techniques.
distance. The steps of this method are:
WSS:
(1) Initially all the objects are clusters i.e. leaf.

(2) It recursively merges the nodes (clusters) that have

Page 184
www.ijraset.com Vol. 2 Issue V, May 2014

ISSN: 2321-9653

INTERNATIONAL JOURNAL FOR RESEARCH IN AP PLIED SCIENCE

AN D E N G I N E E R I N G T E C H N O L O G Y (I J R A S E T)
Cluster Cohesion: Measures how closely related are objects in Compare the set of cluster and validate the cluster using
a cluster. BSS/WSS technique

Solve outliers problem

Ranking the document according to their relevancy

WSS stands for Within Sum of Squared Error. And deliver to user.

BSS: Conclusions and Directions of Future Research

Cluster Separation: Measure how distinct or well-separated a For the explosion of information in the World Wide Web, this
cluster is from other clusters thesis proposed a new method of summarization via soft
clustering algorithm. It used Google search engine to extract
relevant documents, and mixed query sentence into document
set which segmented from multi-documents set, then this
paper created efficient hierarchical clustering to cluster all the
documents. Also, there are a lot of rooms for improvement.
For example, readability is an important aspect in the
BSS stands for Between Sum of Squared Error. performance of multi-document summarization. In future
work, we will consider new soft cluster algorithm to more
Validation techniques is used to find out the best cluster. improve the efficiency of clustering. Cluster Analysis is a
process of grouping the objects, called as a cluster/s, which
Different Aspects of Cluster Validation consists of the objects that are similar to each other in a given
cluster and dissimilar to the objects in other cluster. With the
Determining the clustering tendency of a set of data, i.e. application of clustering in all most every field of science and
distinguishing whether non-random structure actually exists in technology, large number of clustering algorithms had been
the data. proposed which satisfy certain criteria such as arbitrary
shapes, high dimensional database, and domain knowledge
Comparing the results of a cluster analysis to externally
and so on. It had been also proved that it is not possible to
known results, e.g., to externally given class labels.
design a single clustering algorithm which fulfils all the
Evaluating how well the results of a cluster analysis fit the requirement of clustering. Therefore, number of methods had
data without reference to external information. been proposed such as partitioning, hierarchical, density
based, model based and so on. Different algorithms may
- Use only the data. follow good features of one or more methods and thus it is
difficult to categorize them with the solid boundary. In this
Comparing the results of two different sets of cluster analyses paper we had tried to provide a detail categorization of the
to determine which is better. clustering algorithms from our perspective. Though it had
been tried to cover as much clarity as possible, there is still a
Determining the ‘correct’ number of clusters. For 2, 3, and 4, scope of variation. In this paper we had covered the detailed
we can further distinguish whether we want to evaluate the categorization of the different clustering methods with the
entire clustering or just individual clusters. representative algorithms under each. The future work
planned is to perform a detailed analysis of major clustering
AIM AND OBJECTIVE: algorithm and find out the best algorithm for document deliver
to the user

Main objective of my thesis is to delivery of document using

hierarchical clustering technique.

Increase document retrieval efficiency

Page 185
www.ijraset.com Vol. 2 Issue V, May 2014

ISSN: 2321-9653

INTERNATIONAL JOURNAL FOR RESEARCH IN AP PLIED SCIENCE

AN D E N G I N E E R I N G T E C H N O L O G Y (I J R A S E T)
ACKNOWLEDGEMENT

I would like to thanks Ms. Neha Kaushik and the Department

of Computer Science & Engineering of DITM Institute,
Gannur , Sonepat,India.

REFERENCE

[1] A.K. Jain, M. N. Murty, P. J. Flynn, “Data Clustering: A

Review”, ACM Computing Surveys, vol. 31, pp. 264-323, Sep.
1999.

[2] O. A. Abbas, “Comparisons between Data Clustering

Algorithms”, The Int. Journal of Info. Tech. ,vol. 5, pp. 320-
325, Jul. 2008.

[3] Dr. E. Chandra, V. P. Anuradha, “ A Survey on

Clustering Algorithms for Data in Spatial Database
Management Systems”, International Journal of Computer
Application, vol. 24, pp. 19-26

[4] Oded Maimon, Lior Rokach, “DATA MINING AND

KNOWLEDGE DISCOVERY HANDBOOK”, Springer
Science+Business Media.Inc, pp.321-352, 2005.

[5] Bharati M Ramager, “Data Mining techniques and

Applications”, “International Journal of Computer Science
and Engineering Vol. 8”, December 2009.

[6] Accessible from Sonali Agarwal, Neera Singh, Dr. G.N.

Pandey, “Implementation of Data Mining and Data
Warehouse in E-Governance”, “International Journal of
Computer Applications (IJCA) (0975-8887), Vol.9- No.4,”
November 2010.

[7] Calinski, T, Harabasz J (1974) A dendrite method for

cluster analysis. Commun Stat Theory Methods 3(1):1–27

[8] Ng, R. and Han, J.(1994). Efficient and Effective

Clustering Methods for Spatial Data Mining. In Proceeding’s
of the 20th VLDB Conference, Santiago, Chile.

[9] D. R. Cutting, J. O. Pedersen, D. R. Karger, and J. W.

Turkey. Scatter/gather: A cluster-based approach to browsing
large document collections. In Proceedings of the ACM
SIGIR, 1992.

[10] Willet, Peter "Parallel Database Processing, Text

Retrieval and Cluster Analyses" Pitman Publishing, London,
1990.

Page 186

Paper-2 Clustering Algorithms in Data Mining A Review
No ratings yet
Paper-2 Clustering Algorithms in Data Mining A Review
7 pages
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
fuzzy meaning
No ratings yet
fuzzy meaning
6 pages
DM Unit 5
No ratings yet
DM Unit 5
15 pages
CLUSTRING
No ratings yet
CLUSTRING
13 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
Survey of Clustering Data Mining Techniques: Pavel Berkhin
100% (1)
Survey of Clustering Data Mining Techniques: Pavel Berkhin
56 pages
A06-A Survey of Clustering Techniques
No ratings yet
A06-A Survey of Clustering Techniques
5 pages
Fundamentals of Data Science Unit 3
No ratings yet
Fundamentals of Data Science Unit 3
15 pages
DM MODULE 4
No ratings yet
DM MODULE 4
17 pages
Clustering
No ratings yet
Clustering
8 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
18 pages
ML UNIT-III
No ratings yet
ML UNIT-III
18 pages
CLUSTER ANALYSIS unit 3 Data mining
No ratings yet
CLUSTER ANALYSIS unit 3 Data mining
84 pages
PRJ C MR 18
No ratings yet
PRJ C MR 18
4 pages
150
No ratings yet
150
6 pages
Data Mining Clustering Techniques
No ratings yet
Data Mining Clustering Techniques
3 pages
Unit 5
No ratings yet
Unit 5
27 pages
Gautam A. Kudale
No ratings yet
Gautam A. Kudale
6 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
UNIT-4
No ratings yet
UNIT-4
106 pages
Comparison of Different Clustering Algorithms Using WEKA Tool
No ratings yet
Comparison of Different Clustering Algorithms Using WEKA Tool
3 pages
UNIT 4 NOTES
No ratings yet
UNIT 4 NOTES
66 pages
A_new_hierarchical_clustering_algorithm (1)
No ratings yet
A_new_hierarchical_clustering_algorithm (1)
5 pages
Data Clustering: A Review
No ratings yet
Data Clustering: A Review
60 pages
PSO and WDO Data Clusterin
No ratings yet
PSO and WDO Data Clusterin
19 pages
MODULE-V
No ratings yet
MODULE-V
16 pages
Study of Clustering Methods in Data Mini PDF
No ratings yet
Study of Clustering Methods in Data Mini PDF
5 pages
Unit 4
No ratings yet
Unit 4
4 pages
Screenshot 2024-05-17 at 3.30.05 PM
No ratings yet
Screenshot 2024-05-17 at 3.30.05 PM
31 pages
Cluster Analysis
No ratings yet
Cluster Analysis
26 pages
UNIT III - ML
No ratings yet
UNIT III - ML
13 pages
Classify Clustering
No ratings yet
Classify Clustering
31 pages
A Thorough Investigation On The Clustering and Classification Techniques in Various Applications
No ratings yet
A Thorough Investigation On The Clustering and Classification Techniques in Various Applications
4 pages
Automatic_clustering_algorithms_a_systematic_revie
No ratings yet
Automatic_clustering_algorithms_a_systematic_revie
61 pages
Assignment 4
No ratings yet
Assignment 4
40 pages
Unit 4
No ratings yet
Unit 4
21 pages
E-Note_28966_Content_Document_20241211091351PM
No ratings yet
E-Note_28966_Content_Document_20241211091351PM
69 pages
Clustering
No ratings yet
Clustering
6 pages
Clustering new
No ratings yet
Clustering new
6 pages
Amity School of Engineering and Technology Amity University, Uttar Pradesh
No ratings yet
Amity School of Engineering and Technology Amity University, Uttar Pradesh
5 pages
Cluster Evaluation Techniques: Atds Assignment
No ratings yet
Cluster Evaluation Techniques: Atds Assignment
4 pages
Untitled document
No ratings yet
Untitled document
32 pages
Data-Clustering (Part I)
No ratings yet
Data-Clustering (Part I)
74 pages
DWMModule 4 (1) (1) (1)
No ratings yet
DWMModule 4 (1) (1) (1)
31 pages
DM UNIT-5 NOTES
No ratings yet
DM UNIT-5 NOTES
16 pages
Clustering Agglo Devisive DBSCAN
No ratings yet
Clustering Agglo Devisive DBSCAN
78 pages
Cluster Analysis-Unit 4
No ratings yet
Cluster Analysis-Unit 4
7 pages
PR Assignment 02 - Seemal Ajaz (206979)
No ratings yet
PR Assignment 02 - Seemal Ajaz (206979)
5 pages
Introduction to Cluster Analysis.
No ratings yet
Introduction to Cluster Analysis.
53 pages
Machine Learning & Data Mining: Understanding
No ratings yet
Machine Learning & Data Mining: Understanding
7 pages
Chapter 5
No ratings yet
Chapter 5
43 pages
Data-Clustering (Part I)
No ratings yet
Data-Clustering (Part I)
74 pages
ARTIFICIAL INTELLIGENCE LEC 5
No ratings yet
ARTIFICIAL INTELLIGENCE LEC 5
20 pages
Cluster Analysis
No ratings yet
Cluster Analysis
21 pages
Clustering Notes
No ratings yet
Clustering Notes
17 pages
DM UNIT-4 Part2
No ratings yet
DM UNIT-4 Part2
18 pages
A Parallel Study On Clustering Algorithms in Data Mining
No ratings yet
A Parallel Study On Clustering Algorithms in Data Mining
7 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
English 5 q3
100% (1)
English 5 q3
73 pages
Sannino Engestrom Formative Interventions For Expansive Learning and Transformative Agency
No ratings yet
Sannino Engestrom Formative Interventions For Expansive Learning and Transformative Agency
36 pages
Teaching Students To Categorize TOEFL Essay Topics: Stephan J. Franciosi
No ratings yet
Teaching Students To Categorize TOEFL Essay Topics: Stephan J. Franciosi
8 pages
Unit-Ii Class and Objects
No ratings yet
Unit-Ii Class and Objects
21 pages
The Insight Model of Teaching
No ratings yet
The Insight Model of Teaching
15 pages
CHAPTER 1: Introduction To Technology For Teaching and Learning Lesson 3: Roles of Technology For Teaching and Learning
No ratings yet
CHAPTER 1: Introduction To Technology For Teaching and Learning Lesson 3: Roles of Technology For Teaching and Learning
2 pages
Community As Urban Practice: Housing Studies April 2018
No ratings yet
Community As Urban Practice: Housing Studies April 2018
5 pages
Hovedoppgaven KAS
No ratings yet
Hovedoppgaven KAS
192 pages
Notes
No ratings yet
Notes
11 pages
Mesaieed PTW Module 4 (1021)
No ratings yet
Mesaieed PTW Module 4 (1021)
14 pages
Adamovic & Leibbrandt (2023)
No ratings yet
Adamovic & Leibbrandt (2023)
13 pages
3 Program Outcomes and Learning Outcomes
No ratings yet
3 Program Outcomes and Learning Outcomes
13 pages
Lesson 4-7
No ratings yet
Lesson 4-7
19 pages
First Pictures, Early Concepts - Early Concept Books
No ratings yet
First Pictures, Early Concepts - Early Concept Books
25 pages
Nursing Theories and Models in Pediatrics
No ratings yet
Nursing Theories and Models in Pediatrics
64 pages
1 PB
No ratings yet
1 PB
7 pages
Adaptivity Based On Felder-Silverman Learning Styles Model in E-Learning Systems
No ratings yet
Adaptivity Based On Felder-Silverman Learning Styles Model in E-Learning Systems
11 pages
The Embodied Mind
100% (4)
The Embodied Mind
29 pages
Rigor in Qualitative Analysis, Active Categorization in Theory Building-Grodal
No ratings yet
Rigor in Qualitative Analysis, Active Categorization in Theory Building-Grodal
23 pages
Exemplar Theory
No ratings yet
Exemplar Theory
1 page
Neural Networks Utilization For Oil Spill Classification Using A Sequential CNN Model
No ratings yet
Neural Networks Utilization For Oil Spill Classification Using A Sequential CNN Model
5 pages
(Ellin Kofsky Scholnick, Katherine Nelson, Susan A
No ratings yet
(Ellin Kofsky Scholnick, Katherine Nelson, Susan A
192 pages
Download ebooks file Proceedings of the Twenty Third Annual Conference of the Cognitive Science Society Johanna D. Moore all chapters
No ratings yet
Download ebooks file Proceedings of the Twenty Third Annual Conference of the Cognitive Science Society Johanna D. Moore all chapters
81 pages
3 Enablers For Considering Sustainability in Project
No ratings yet
3 Enablers For Considering Sustainability in Project
9 pages
Ontology Based Text Categorization - Telugu Documents: Mrs.A.Kanaka Durga, Dr.A.Govardhan
No ratings yet
Ontology Based Text Categorization - Telugu Documents: Mrs.A.Kanaka Durga, Dr.A.Govardhan
4 pages
Methodological Approaches Tunnel Classification According ADR Agreement
No ratings yet
Methodological Approaches Tunnel Classification According ADR Agreement
7 pages
HTPM
No ratings yet
HTPM
8 pages
Health Educ Finals
No ratings yet
Health Educ Finals
12 pages
Concept Paper On
No ratings yet
Concept Paper On
9 pages
Pygmalion Analysis of Social Relations
No ratings yet
Pygmalion Analysis of Social Relations
12 pages