0% found this document useful (0 votes)
2 views

Unit 2 - Introduction to Cluster Analysis

The document provides an introduction to cluster analysis, detailing its definition, applications, and various methods including partitioning and hierarchical approaches. It discusses the requirements for effective clustering, such as scalability, handling different data types, and robustness to noise. Additionally, it outlines specific algorithms like k-means and k-medoids, highlighting their advantages and limitations.

Uploaded by

dikshaprabhugvm
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Unit 2 - Introduction to Cluster Analysis

The document provides an introduction to cluster analysis, detailing its definition, applications, and various methods including partitioning and hierarchical approaches. It discusses the requirements for effective clustering, such as scalability, handling different data types, and robustness to noise. Additionally, it outlines specific algorithms like k-means and k-medoids, highlighting their advantages and limitations.

Uploaded by

dikshaprabhugvm
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 53

Introduction to Cluster

Analysis
Unit 2 : Chapter 2
Contents
• Classificalion v/s Clustering
• Clustering
• Types of data in cluster analysis
• Clustering Methods
• Partitioning Methods
• Hierarchical Methods
• Density Method
Classificalion v/s Clustering
Clustering
• What Is Cluster Analysis?
• Cluster analysis or simply clustering is the process of partitioning a set of data
objects (or observations) into subsets. Each subset is a cluster, such that objects
in a cluster are similar to one another, yet dissimilar to objects in other
clusters.
• The set o1 clusters resulting from a Gluster analysis can be referred to as a
clustering.

• It is a proGess of grouping a set of data objects into multiple groups or clusters


so that objects within a cluster have high similarity, but are very dissimilar to
objects in other clusters
Ap lications
• Clustering analysis is broadly used in many applications such as market
research, pattern recognition, data analysis, and image processing.
• Clustering can also help marketers discover distinct groups iii their customer
base.
• And they can characterize their customer groups based on the purchasing
patterns.
• In the field o1 biology, it can be used to derive plant and animal
taxonomies, categorize genes with similar functionalities and gain insight
into structures inherent to populations
• Clustering is also used in outlier detection applications such as deteGtion of credit
card fraud.
• Clustering also helps in identification o1 areas of similar land use in an earth
observation database. It also helps in the identification of groups of houses in a
city according to house type, value, and geographic location.
• As a data mining function, cluster analysis serves as a tool to gain insight into
the distribution o1 data to observe characteristics o1 each cluster.

.! . i.,
,
Requirements for Cluster Analysis

Scalability:
• Many clustering algorithms work well on small data sets containmg fewer
than several hundred data objects; liOwever, a large database may contain
millions or even billions of objects, particularly in Web search scenarios.
• Clustering on only a sample of a given large data set may lead to biased results.
Therefore, highly scalable clustering algorithms are needed.

Ability to deal with different types of nHributes:


— Many algorithms are designed to cluster numeric (interval-based) data. However, —
applications may require clustering other data types, such as binary, nominal
(categorical), and ordinal data, or mixtures of these data types..
Requirements for domain knowledge to determine input parameters:
• Many clustering algoritluns require users to provide domain knowledge in the
form of input parameters such as the desired number of clusters. Consequently, the
clustering results may be sensitive to such parameters.
• Parameters are often hard to determine, especially for high-dimensionality data
sets and where users have yet to grasp a deep understanding of their data.

Ability to deal with noisy data:


• Most real-world data sets contain outliers and/or missing, unknown, or erroneous
data. Clustering algorithms can be sensitive to such noise and may produce
„ poor- quality clusters. Therefore, we need clustering methods that are robust to
noise
Discovery of clusters with arbitrary shape:
• Many clustering algorithms deteonine clusters based on Euclidean or Manhattan
distance measures. AlgOritluns based on such distance measures tend
to find spherical clusters with similar size and density. It is important to
develop algorithms that can detect clusters of arbitrary shape.
Incremental clustering and insensitivity to input order:
• In many applications, incremental updates (representing newer data)
may arrive at any time. Some clustering algorithms cannot incorporate incremental
updates into existing clustering structures and, instead, have to recomputed a new
clustering from scratch.
— • Clustering algorithms may also be sensitive to the input data order. —
Incremental clustering algoritluns and algoritluns that are insensitive to the
input order are needed.
• Clustering algorithms typically operate on either o1 the following two data
structures:
Data matrix
Dissimilarity matrix
Data Matrix

_ <› • •• x; y • x;
i p
x;
xiv
f
n1 ' nf
np
o This represents o objects, such as persons, with p
variables (measurements or attributes), such as age,
height, weight, gender, and so on. ax r abhi
• The structure is in the form of a relational table, or my Khoijiueukm
-p
"SiS™i Pi’Of'ssor
matrix (o objects p Vdfiables)
Dissimilarity Matrix

• It is often represented by an n-by-n where del,J) is the


measured difference or dissimilarity between objects ? and y.
• In general, d(i,y9 is a nonnegative number that is
close to 9 when objects rand y are highly similar or “near” each
other becomes larger the more they differ
Types of Data in Cluster Analysis
• Dissimilarity can be computed for
Interval-scaled (numeric)
variables Binary variables
Categorical (nominal) variables
Ordinal variables
• Types of Data in Cluster Analysis
Ratio variables
Mixed types variables
Interval-valued variables
o Interval-scaled (numeric) variables are continuous
measurements of a roughly linear scale.
o Examples
— weight and heip•hl, latitude and longitude
coordinates (e.g., when clustering houses), and weather
temperature.
o The measurement unit used can affect ie
clustering
analysis
— For example, changing measurement units from meters to ”
inches for height, or from kilograms to r• unds for
weight, may lead to a very different clustering structure.
Binary Variables
• A binary variable has only two states: 0 or 1, where 0
means that the variable is absent, and l means that it
is present.
• Given the variable smoker describing a patient,
— l indicates that the patient smokes
— 0 indicates that the patient does not.
o Treating binary variables as if they are interval-scaled
can lead to misleading clustering results.
• Therefore, methods specific to binary data are
necessary for computing dissimilarities.
Categorical Variables
• A categoricl (nominal) variable is a generalization
of ie binary variable in that it can take on more than
two
states.
— Example: map_color is a categorical variable may have
fat five staRs: red, yellow, green, pink, and
blue.
• The smtes can be denoted by letters, symbols, or a
set of integers.
Diksha Prabhii Khorjuvenkar
Assistant Professor
Ordinal Variables
• A disci'ete ordinal vat iable resembles a categorical
variable, except that the M states of the ordinal
value are ordered in a meaningful sequence.
— Example: professional ranks are often enumerated
in a sequential order, such as asslS(ill3t ilS$tic iate, and
fu II for professors.
• Ordinal variables may also be obtained from the discretization
of interval-scaled quantities by splitting the value range into a
finite number of classes.
• The values of an ordinal variable can be mapped to ranks.
— Example: suppose that an ordinal variable f’ has M states.
— These ordered states define the ranking l. , M, . ,. '
’,
Ratio Sclaed
Variables of Mixed Types
• A database may contain different types of
variables
— interval-scaled, symmetric binary, asymmetric binary,
nominal, and ordinal
• We can combine ie different variables into a
single dissimilarity matrix, bringing all of the
meaningful variables onto a common scale of the
intem% [0.0, 1.0]. Assistant Professor
Clustering Methods
Pnrtitioning methods:
• Given a set of n objects, a partitioning method constructs k partitions of the data,
where each partitiOn represents a cluster and k < n. That is, it
• divides the data into k groups such that each group must contain at least one
object.
• In other words, partitioning methods conduct one-level partitioning on data
sets.
• The basic partitioning methods typically adopt exclusive cluster separation.
That
IS,

• each object must belong to exactly one group.


Algorithm: k-means The k-means algorithm for panitioning, where each cluster's
center is represented by the mean value of the object in the cluster.

1: the number of clusters,


0: a data set containing n objects.

Output: A set of £ clusters.

(I) arbitrarily choose k objects kom D as the initial cluster centers;


(2) zzpzst
(re)assign each object to the cluster to which the object is the most similar,
based on the mean value of the objects in the cluster;
update the cluster means, that is, calculate the mean value of the objects for
each cluster;
(5) until no change;
Problem
• Refer class notes
How can we make the k-means algorithm more
scalable?
• One approach to making the k-means method more efficient on large data sets is to
use a good-sized set of samples in clustering.

• Another is to employ a filtering approach that uses a spatial hierarchical data


index to save costs when computing means.
• A third approach explores the micro clustering idea, which list groups nearby
objects into “micro clusters” and then performs k-means clustering on the micro
clusters

.! . i., ,
,
What Is the Problem of the K-Means Method?
Table 2. Advantages and Limitations of k-means algorithm
Advantages Limitations
Relatively efficient and easy to implement. Sensitive to initialization
Terminates at local optimum. Limiting case of fixed data.
Apply even large data sets Difficult to compare with
different
numbers of clusters
The clusters are non-hierarchical arid they Needs to specify the number of clusters
do not overlap in advance.

With a large number of variables, K-Means Unable to handle noisy data or outliers.
may be computationally faster than
hierarchical clustering
K-Means may produce tighter clusters than Not suitable to discover clusters with
hierarchical clustering, especially if the non-convex shapes
clusters are globular
The K-Medoids Clustering Method: A
Representative Object-Based
Technique
1. Initialize: select k random points out of the n data points as the
medoids.
2. Associate each data point to the closest medoid by using any
common distance metric methods.
3. While the cost decreases: For each medoid m, for each data o point
which is not a medoids:
4. Swap m and o, associate each data point to the closest
medoids, and recompute the cost.
5. If the total cost is more than that in the previous step, undo
the swap.
Algorithmi k-niedoidn PAM, a £-medoids algorithm fòr partitioning¡ based on
medoid or central objects.

ü the number of clusters,


D: a data snt containing n objects.

Output: A set of k clusters.

arbitmrily choose k objects in D as the initial representative objects or seeds;

assign each remaining object to the duster with the nearest representativa
object;
randomly select a nonrepresentative object, pp;
compute the total cost, S, of swapping representativa object, oj, with o

,p;
(6) if S < 0 tben swap oj with ng p„, to form the new set of £
representativa objecte;
(7) until no change;
Which method is more robust k-means
or medoids?
• The k-medoids method is more robust than k-means in the presence of noise
and outliers because a medoid is less influenced by outliers or other extreme
values than a mean.

• However, the complexity of each iteration in the k-medoids algorithm is


• O(k(n — k)').

• For large values of n and k, such computation becomes very costly, and much
more costly than the k-means method.
.! . i.::
Problems
• Refer class notes
“How can we scale up the k-medoids method?
• To deal with larger data sets, a sampling-based method called CLARA
(Clustering LARge Applications) can be used.

• Instead of taking the whole data set into consideration, CLARA uses a
random sample of the data set. The PAM algorithm is then applied to compute
the best medoids from the sample.

• Ideally, the sample should closely represent the original data set. In many cases, a
large sample works well if it is created so that each object has equal probability of
being selected into the sample.
• The representative objects (medoids) chosen will likely be similar to those that
would have been chosen from the whole data set. CLARA builds
multiple random samples and returns the best clustering as the output.
Hierarchical methods:
• A hierarchical method creates a hierarchical decomposition of the given set of
data objects.
• A hierarchical method can be classified as being either agglomerative or divisive,
based on hOw the hierarchical decomposition is formed.

• Group data objects into a tree of clusters

Hierarchical mefhods can be

Ag glomerotlve: bott om-up a p p ro a c h


Divisive: top-down approacn
m mm m
• Hierarcchical Mustering has no backtracking

Ifa particular merge or split turns out to be poor choice, it c a


nnot
be correcfe d
• The agglonierative approach, also called the bottom-up approach, starts with
each object forming a separate group. It successively merges the objects or groups
close to one another, until all the groups are merged into one (the topmost level of
the hierarchy), or a termination condition holds.

• The divisive approach, also called the top-down approach, starts with all
the objects in the same cluster. In each successive iteration, a cluster is split
into smaller clusters, until eventually each object is in one cluster, or a
teonination condition holds.
Agglomeroflve Hierarchical Clustering

r Bottom-up strategy
• Each cluster starts with only one object
r Clusters are merged into larger and larger clusters until:
All the objects ore in a single cluster
Certain termination conditions ore satisfied

Divisive H!ezorchlcal Clustering

• Top-clown strategy
• Start with all objeck in one cluster
r Clusters are subclivided into smaller ancJ smaller clusters
until:
Di’ sha Prabhii Khor{.van'
Each object forms a cluster on its own ss.’m.ant Profs •.o•
Certain termination conditions ore satisfied
• Hierarchical clustering methods can be distance-based or density- and cOntinuity
based.
• Hierarchical methods suffer from the factor that once a step (merge or split) is
done, it can never be undone. This rigidity is useful in that it leads to
smaller computation Gosts by not having to worry about a combinatorial
number of different choices.
• Such teclmiques cannot correct erroneous decisions; however, methods for
improving the quality of hierarchical clustering have been proposed.
Example
• Agglomerati ve and divisive algorithms d ata set of five
ona objects {a, b, c, d, e}

Step 0 Step Step 2 Step 3 Step 4


OOI 1
(AGNE3)

a b
cd e

cdc

divisive
(DIANA)
Step 4 Step 3 Step 2 Siep l Step 0
› a g g lomerotiv e
Step
G
tcp
T
tcp lrp Tt p
J
AGNES (AGN E3 )

Clusters C l and C2
may be m e rge d if an object
de
in C l ancl an object in C2 form
divis ive
the minimum Euclidean S +p Step " Step " Strp 1 Step 0
( DI A h!
A)
4
distance between any two
objects from different clusters

• DIANA

A cluster is split according to some principle, e.g., the


maximum Euclidian distance between the closest neighboring
objects in the cluster
Distance measure
• First measure: Minimum cllstonce

I - I '• the distance between t wo objects p and p’


• Use cases
An algorithm thot uses the minimum clistonce to measure the
distance between clusters is called sometimes nearest-neighbor
clustering algorithm

If the clustering process terminates when the minimum


distance between nearest clusters exceeds on arbitrary
threshold, It is called single-linkage algorithm

An agglomerative algorithm that uses the minimum c/istance Prabhii Khor{.ven'


measure is also called minimal spanning free algorithm - it
Profs •.o•
DIANA: All the objects are used
Distance between the to form one initial cluster. The cluster is
cluster
• Second measure: Maximum
split according to some principle such as
the maximum Euclidean distance
distance between the closest neighboring objects
in the cluster.

p- p’ | is the distance between fwo objects p and p’


› Use c a s e s
Analgorithm that uses the maximum distance to measure the
distance between clusters is called sometimes farthest-
neighbor clustering algorithm

If the clustering process terminates when the maximum distance


between nearest clusters exceeds an arbitrary threshold,
it is called complete-linkage algorithm
I
'
:

.
'
• A tree structure called a dendrogram is commonly used to represent the process
of hierarchical clustering.
• It shows how objects are grouped together (in an aggloinerative method)
or partitioned (in a divisive method)
• A Dendrogram for the five objects presented in Figure where 1 = 0 shows the
five objects as singleton clusters at level 0. At l = 1, objects a and b are grouped
together to foon the first cluster, and they stay together at all subsequent levels.

Le 'cl a b c d r
/= 0 1.0

' Dendrogram representation for hierarchical clustering of data objects (o, h, c, d, e). . .
!
Fast computation and there Hard to define levels for
is no need to prc•define the clusters.Sensitivity to noise
number of clusters(k). and outliers

Rigid, cannot cormct later


fo erroneous decisions
granularity. make earlier.

N
pmblem o when
s the

Accepts any valid measure


ofdisiance

.Good for data Ii cannot perform well on Diksha Prabhii Kliorjiiveokm


visualization a Assistant Piofessor

relation between clusters.


Challenges and Solutions
• It is diffi cult to select m e r g e or split points

› No backtracking

› Hierarchical clustering does not scale well: examinesa


good number of objects before any decision
of split or merge

› One promising directions to solve these problems is to


combine hierarchical clustering with other clustering techniques:
multiple- p h a s e clustering
Density-based methods:
• Here the general idea is to continue growing a given cluster as long as the density
(number of objects or data points) in the “neighborhood” exceeds some
threshold.
• For example, for each data point within a given cluster, the neighborhood of a
given radius has to contain at least a minimum number o1 points.
• Such a method can be used to filter out noise or outliers and discover clusters
of arbitrary shape.
• Density-based methods can divide a set o1 objects into multiple exclusive
clusters, or a hierarchy of clusters.
• Typically, density-based methods consider exclusive clusters only, and do not
consider iiizzy clusters.
• Moreover, density-based methods can be extended from full space to subspace
clustering.
• The density-based algorithm requires two parameters, the minimum point number
needed to form the cluster and the threshold of radius distance defines the
neighborhood of every point.
• The commonly used density-based clustering algorithm known as D B S C A N
groups data points that are close together and can discover clusters.

! . ’, ; ' ,
1'
Advantages
• Density-based clustering algorithms can effectively handle noise and outliers in
the dataset, making them robust in such scenarios.
• These algorithms can identify clusters o1 arbitrary shapes and sizes instead of
other clustering algorithms that may assume specific forms.
• They don't require prior knowledge of the number of clusters, making them more
flexible and versatile.
• They can efficiently process large datasets and handle high-dimensional data.

.! .
i.:. ,
Disadvantages
• The performance o1 density-based clustering algoritluns is highly
dependent on the choice of parameters, such as e and MinPts, which can be
challenging to tune.
• These algorithms may not be suitable for datasets with low-density regions or
evenly distributed data points.
• They Gan be coinputationally expensive and time-consuming,
especially for large datasets with complex structures.
• Density-based clustering can need help with identifying clusters o1 varying —
— densities or scales.
Summary of methods

Partitioning — Find mutually exclusive clusters of spherical shape


methods — Distance-based
— May use mean or medoid (etc.) to represent cluster center
- Effective for small- to medium-size data ents
Hierarchical - Clustering is a hierarchical decomposition (i.e., multiple
methods levels)
— Cannot correct erroneous merges or splits
— Mayincorporate other techniques like microdastering or
consider obiect “linka
Density-based
methods — Can find arbitrarily shaped clusters
— Clusters are dense regions of objects in space that ate
separated by low-density rug ons
—Cluster density: Each point rñust have a minimum number of
points within its “neighborhood”
— Ma
Evaluation of Clustering
• Assessing clustering tendency. In this task, for a given data set, we assess
whether a non random structure exists in the data. Blindly applying a clustering
method on a data set will return clusters; however, the clusters mined may be
misleading.
• Clustering analysis on a data set is meaningful only when there is a nonrandom
structure in the data.
• Determining the number of clusters in a data set. A few algorithms, such as k-
means, require the number of clusters in a data set as the parameter. Moreover,
the number of clusters can be regarded as an interesting and important
summary statistic of data set.
• Therefore, it is desirable to estimate this number even before a clustering
algorithm is used to derive detailed clusters.
• Measuring clustering quality. After applying a clustering method on a data set,
we want to assess how good the resulting clusters are. A number of measures
can be used.
• Some methods measure how well the clusters fit the data set, while others measure
how well the clusters match the ground truth, if such truth is available.
• There are also measures that score clustering and thus can compare two sets
o1 clustering results on the same data set.

.! .
i.::

You might also like