CLustering Methods
CLustering Methods
CLUSTERING METHODS
Lior Rokach
Department of lndustrial Engineering
Tel-Aviv University
liorrOeng.tau.ac.il
Oded Maimon
Department of Industrial Engineering
Tel-Aviv University
[email protected]
Abstract This chapter presents a tutorial overview of the main clustering methods used
in Data Mining. The goal is to provide a self-contained review of the concepts
and the mathematics underlying clustering techniques. The chapter begins by
providing measures and criteria that are used for determining whether two ob-
jects are similar or dissimilar. Then the clustering methods are presented, di-
vided into: hierarchical, partitioning, density-based, model-based, grid-based,
and soft-computing methods. Following the methods, the challenges of per-
forming clustering in large data sets are discussed. Finally, the chapter presents
how to determine the number of clusters.
1. Introduction
Clustering and classification are both fundamental tasks in Data Mining.
Classification is used mostly as a supervised learning method, clustering for
unsupervised learning (some clustering models are for both). The goal of clus-
tering is descriptive, that of classification is predictive (Veyssieres and Plant,
1998). Since the goal of clustering is to discover a new set of categories, the
new groups are of interest in themselves, and their assessment is intrinsic. In
classification tasks, however, an important part of the assessment is extrinsic,
since the groups must reflect some reference set of classes. "Understanding
322 DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK
our world requires conceptualizing the similarities and differences between the
entities that compose it" (Tyron and Bailey, 1970).
Clustering groups data instances into subsets in such a manner that simi-
lar instances are grouped together, while different instances belong to differ-
ent groups. The instances are thereby organized into an efficient representa-
tion that characterizes the population being sampled. Formally, the clustering
structure is represented as a set of subsets C = Cl, . . . ,Ck of S, such that:
S = u:=, Ci and Ci n Cj = 0 for i # j. Consequently, any instance in S
belongs to exactly one and only one subset.
Clustering of objects is as ancient as the human need for describing the
salient characteristics of men and objects and identifying them with a type.
Therefore, it embraces various scientific disciplines: from mathematics and
statistics to biology and genetics, each of which uses different terms to describe
the topologies formed using this analysis. From biological "taxonomies", to
medical "syndromes" and genetic "genotypes" to manufacturing "group tech-
nology" - the problem is identical: forming categories of entities and assign-
ing individuals to the proper groups within it.
2. Distance Measures
Since clustering is the grouping of similar instances/objects, some sort of
measure that can determine whether two objects are similar or dissimilar is
required. There are two main type of measures used to estimate this relation:
distance measures and similarity measures.
Many clustering methods use distance measures to determine the similarity
or dissimilarity between any pair of objects. It is useful to denote the distance
between two instances xi and x j as: d(xi,xj). A valid distance measure should
be symmetric and obtains its minimum value (usually zero) in case of identical
vectors. The distance measure is called a metric distance measure if it also
satisfies the following properties: