0% found this document useful (0 votes)
17 views

DWDM - Unit - VI

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

DWDM - Unit - VI

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Unit – VI

Clustering
 1. What is Clustering: Clustering is a Unsupervised learning i.e.
no predefined classes, Group of similar objects that differ
significantly from other objects.
 The process of grouping a set of physical or abstract objects into
classes of similar objects is called clustering.
 Clustering is “the process of organizing objects into groups whose
members are similar in some way”.

 The cluster property is Intra-cluster distances are minimized and


Inter-cluster distances are maximized.
 Cluster is a collection of data objects.
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
1
 Group points into clusters based on how “near” they are to one
another.
 Outlier detection and Cluster analysis are very useful for fraud
detection, etc. and can be performed by statistical, distance-based
or deviation-based approaches.

 What is Good Clustering: A good clustering method will produce


high quality clusters with
 high intra-class similarity
 low inter-class similarity

 Requirements of Clustering in DM
 Scalability
 Ability to deal with different types of attributes
 High dimensionality
 Able to deal with noise and outliers
 Interpretability
 Discovery of clusters with attribute shape
2
 Why Clustering
 Scalability
 Ability to deal with different types of attributes
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to determine input
parameters
 Ability to deal with noisy data
 Incremental clustering and insensitivity to the order of input
records: High dimensionality, Constraint-based clustering and
Interpretability and usability

3
Types of data in Clustering Analysis
1. Nominal variables
2. Ordinal variables
3. Categorical Data
4. Labeled Variables
5. Unlabeled Variables
6. Numerical Values
7. Interval-scaled variables
8. Binary variables
9. Ratio variables
10. Variables of mixed types
4
1. Nominal Variables allow for only qualitative classification. A
generalization of the binary variable in that it can take more than
2 states, e.g., red, yellow, blue, green
Ex: { male, female},{yes, no},{true, false}
2. Ordinal Data are categorical data where there is a logical
ordering to the categories.
Ex: 1=Strongly disagree; 2=Disagree; 3=Neutral; 4=Agree;
3. Categorical Data represent types of data which may be divided
into groups.
Ex: race, sex, age group, and educational level.
4. Labeled Data are share the class labels or the generative
distribution of data.
5. Unlabeled Data are does not share the class labels or the
generative distribution of the labeled data.
6. Numerical Values: The data values completely belongs to and
only numerical values. Ex: 1,2,3,4,….
5
7. Interval-valued variables: These are variables ranges from
numbers
 Ex: 10-20, 20-30, 30-40,……..

8. Binary Variables: These are the variables and combination of 0


and 1.
 Ex: 1, 0, 001,010 ….

9. Ratio-Scaled Variables: A positive measurement on a nonlinear


scale, approximately at exponential scale, such as AeBt or Ae-Bt
 Ex: ½, 2/4, 4/8,…..

10. Variables of Mixed Types: A database may contain all the six
types of variables symmetric binary, asymmetric binary,
nominal, ordinal, interval and ratio.
 Ex: 11121A1201
6
Similarity Measure
 Euclidean distance: Distances are normally used to measure the
similarity or dissimilarity between two data objects.

 Euclidean distance: Euclidean distance is the distance between two


points in Euclidean space.
Major Clustering Approaches

1. Partitioning Methods
2. Hierarchical Methods
3. Density-Based Methods
4. Grid-Based Methods
5. Model-Based Clustering Methods
6. Clustering High-Dimensional Data
7. Constraint-Based Cluster Analysis
8. Outlier Analysis 8
5. Model-Based 6. Clustering High-
Clustering Dimensional Data
Methods

1.
Partitioning 3. Density-
Methods Based
Major Clustering Methods
Approaches
2. 4. Grid-Based
Hierarchi Methods
cal
Methods

7. Constraint-Based 8. Outlier Analysis


Cluster Analysis

Fig: Major Clustering Approaches


9
1. Partitioning approach: Construct various partitions and
then evaluate them by some criterion, e.g., minimizing the
sum of square errors.
 A partitioning method first creates an initial set of k
partitions, where parameter k is the number of partitions to
construct.
 It then uses an iterative relocation technique that attempts
to improve the partitioning by moving objects from one
group to another.

 Typical partitioning methods include


1. k-means,
2. k-medoids,
3. CLARANS (Clustering LARge Applications) 10
2. Hierarchical approach: A hierarchical method creates a
hierarchical decomposition of the given set of data
objects.
 The method can be classified as being either
agglomerative (bottom-up) or divisive (top-down), based
on how the hierarchical decomposition is formed.

 To compensate for the rigidity of merge or split, the


quality of hierarchical agglomeration can be improved by
analyzing object linkages at each hierarchical
partitioning.

11
 By first performing microclustering and then operating
on the micro clusters with other clustering techniques,
such as iterative relocation.
 Create a hierarchical decomposition of the set of data (or
objects) using some criterion classified into .

1. DIANA ((DIvisive ANAlysis)


2. AGNES (AGglomerative NESting)
3. BIRCH (Balanced Iterative Reducing and Clustering using
Hierarchies)
4. ROCK (RObust Clustering using linKs)
5. CAMELEON (Hierarchical clustering algorithm that uses
dynamic modeling). 12
3. Density-based approach: A density-based method
clusters objects based on the notion of density.
 It either grows clusters according to the density of
neighborhood objects or according to some density
function.

 OPTICS is a density based method that generates an


augmented ordering of the clustering structure of the
data.
 Based on connectivity and density functions classified
into
1. DBSCAN (Density-Based Spatial Clustering of Applications
with Noise)
2. OPTICS (Ordering Points to Identify the Clustering
Structure)
3. DENCLUE( DENsity-based CLUstEring) 13
4. Grid-based Approach: A grid-based method first
quantizes the object space into a finite number of cells
that form a grid structure, and then performs clustering
on the grid structure.

 STING is a typical example of a grid-based method


based on statistical information stored in grid cells.
 Wave Cluster and CLIQUE are two clustering algorithms
that are both grid based and density-based.

 Based on a multiple-level granularity structure are


1. STING (STatistical INformation Grid)
2. WAVECLUSTER (Clustering Using Wavelet
Transformation) 14
5. Model-based Methods : A model-based method
hypothesizes a model for each of the clusters and
finds the best fit of the data to that model.

 Examples of model-based clustering include the EM


algorithm, conceptual clustering, and neural network
approaches.

 A model is hypothesized for each of the clusters and


tries to find the best fit of that model to each other.
They are
1. EM (Expectation-Maximization)
2. SOM (Self-organizing feature maps)
15
3. COBWEB (Conceptual Clustering)
6. Clustering High-Dimensional Data: Clustering high-
dimensional data is of crucial importance, because in many
advanced applications, data objects such as text documents and
microarray data are high-dimensional in nature.

 There are three typical methods to handle high dimensional data


sets: dimension-growth subspace clustering, represented by
CLIQUE, dimension-reduction projected clustering, represented by
PROCLUS, and frequent pattern–based clustering, represented by
clusters.

 Data objects such as text documents and microarray data are high-
dimensional in nature. They are
1. CLIQUE (CLustering InQUEst)
2. PROCLUS (PROjected CLUStering)
3. pCluster(frequent pattern–based clustering)
16
7. Constraint-Based Cluster Analysis: A constraint-based
clustering method groups objects based on application dependent or
user-specified constraints.

 Ex: clustering with the existence of obstacle objects and clustering


under user-specified constraints are typical methods of constraint-
based clustering, semi-supervised clustering based on “weak”
supervision.

 Groups objects based on application dependent or user-specified


constraints.

 They are
1. Clustering with Obstacle Objects
2. User-Constrained Cluster Analysis
3. Semi-Supervised Cluster Analysis
17
8. Outlier Analysis: These are very useful for fraud detection,
customized marketing, medical analysis, and many other tasks.

 Computer-based outlier analysis methods typically follow either a


statistical distribution-based approach, a distance-based approach, a
density-based local outlier detection approach, or a deviation-based
approach.

 They are
1. Statistical Distribution-Based Outlier Detection
2. Distance-Based Outlier Detection
3. Density-Based Local Outlier Detection
4. Deviation-Based Outlier Detection

18
Examples of Clustering Applications
1. Marketing
2. Land use
3. Insurance
4. City-planning
5. Earth-quake studies

Issues of Clustering
1. Accuracy,
2. Training time,
3. Robustness,
4. Interpretability, and
5. Scalability
6. Find top ‘n’ outlier points 19
Applications
 Pattern Recognition
 Spatial Data Analysis
 GIS(Geographical Information System)
 Cluster Weblog data to discover groups
 Credit approval
 Target marketing
 Medical diagnosis
 Fraud detection
 Weather forecasting
 Stock Marketing 20
2. Classification Vs Clustering
1. Classification is “the process 1. Clustering is “the process of
of organizing objects into organizing objects into
groups whose members are groups whose members are
not similar. similar in some way”.

2. It is a Supervised Learning. 2. It is a Unsupervised Learning.

3. Predefined classes. 3. No predefined classes.

4. Have labels for some points. 4. No labels in Clustering.

5. Require a “rule” that will 5. Group points into clusters


accurately assign labels to based on how “near” they
new points. are to one another. 21
6. Classification 6. Clustering

22
7. Classification approaches are 7. Clustering approaches are eight.
two types 1. Partition Method
2. Hierarchical Method
3. Density-Based Methods
1. Predictive Classification 4. Grid-Based Methods
2. Descriptive Classification 5. Model-Based Clustering Methods
6. Clustering High-Dimensional
Data
7. Constraint - Based Cluster
Analysis
8. Outlier Analysis
8. Issues of Classification
1. Accuracy, 8. Issues of Clustering
2. Training time, 1. Accuracy,
3. Robustness, 2. Training time,
4. Interpretability, and 3. Robustness,
5. Scalability 4. Interpretability, and
5. Scalability
6. Find top ‘n’ outlier points
9. Examples 9. Examples
1. Marketing 1. Marketing
2. Land use 2. Land use
3. Insurance 3. Insurance
4. City-planning 4. City-planning
5. Earth-quake studies 5. Earth-quake studies

10. Techniques
10. Techniques
1. K- Means Clustering
1. Decision Tree
2. DIANA ((DIvisive ANAlysis)
2. Bayesian classification
3. AGNES (AGglomerative NESting)
3. Rule-based classification
4. BIRCH (Balanced Iterative Reducing
4. Prediction and Accuracy and error
and Clustering using Hierarchies)
measures
5. DBSACN (Density-Based Spatial
Clustering of Applications with
Noise)
24
11. Applications 11. Applications
1. Credit approval 1. Pattern Recognition
2. Target marketing 2. Spatial Data Analysis
3. Medical diagnosis 3. WWW (World Wide Web)
4. Fraud detection 4. Weblog data to discover groups
5. Weather forecasting 5. Credit approval
6. Stock Marketing 6. Target marketing
7. Medical diagnosis
8. Fraud detection
9. Weather forecasting
10. Stock Marketing

25
3. k-Means Clustering
 It is a Partitioning cluster technique.
 It is a Centroid-Based cluster technique
 Clustering is a Unsupervised learning i.e. no predefined
classes, Group of similar objects that differ significantly
from other objects.
2 2 2
d (i, j)  (| x  x |  | x  x | ... | x  x | )
i1 j1 i2 j 2 ip jp
 It then creates the first k initial clusters (k= number of
clusters needed) from the dataset by choosing k rows of
data randomly from the dataset.
 The k-Means algorithm calculates the Arithmetic Mean
of each cluster formed in the dataset. 26
 Square-error criterion

 Where
– E is the sum of the square error for all objects in the data set;
– p is the point in space representing a given object; and
– mi is the mean of cluster
– Ci (both p and mi are multidimensional).
 Algorithm: The k-means algorithm for partitioning,
where each cluster’s center is represented by the mean
value of the objects in the cluster.
 Input:
– k: the number of clusters,
– D: a data set containing n objects.
 Output: A set of k clusters.
27
k-Means Clustering Method

Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
3

2 each
2 the 2

1
objects
1

0
cluster 1

0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10

similar
center reassign reassign
10 10

k=2 9 9

8 8

Arbitrarily choose K 7 7

object as initial cluster


6 6

5 5

center 4 Update 4

2
the 3

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10

28
Fig: Clustering of a set of objects based on the k-means
method. (The mean of each cluster is marked by a “+”.)

29
Steps
k - Means algorithm is implemented in four
steps:
1. Partition objects into k nonempty subsets.
2. Compute seed points as the centroids of the clusters
of the current partition (the centroid is the center, i.e.,
mean point, of the cluster).
3. Assign each object to the cluster with the nearest seed
point.
4. Go back to Step 2, stop when no more new
assignment

31
Comments on the k-Means Method
 Strength: Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.
– Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))

 Comment: Often terminates at a local optimum. The global


optimum may be found using techniques such as deterministic
annealing and genetic algorithms

 Weakness
– Applicable only when mean is defined, then what about
categorical data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes 32
Variations of the k-Means Method
 A few variants of the k-means which differ in
– Selection of the initial k means
– Dissimilarity calculations
– Strategies to calculate cluster means

 Handling categorical data: k-modes


– Replacing means of clusters with modes
– Using new dissimilarity measures to deal with categorical
objects
– Using a frequency-based method to update modes of clusters
– A mixture of categorical and numerical data: k-prototype
method

33
What is the Problem of the k-Means Method?
 The k-means algorithm is sensitive to outliers !
– Since an object with an extremely large value may substantially
distort the distribution of the data.

 k-Medoids: Instead of taking the mean value of the object in a


cluster as a reference point, medoids can be used, which is the
most centrally located object in a cluster.
10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

34
The k-Medoids Clustering Method
 Find representative objects, called Medoids, in clusters

 PAM (Partitioning Around Medoids)


– starts from an initial set of medoids and iteratively
replaces one of the medoids by one of the non-medoids
if it improves the total distance of the resulting
clustering
– PAM works effectively for small data sets, but does not
scale well for large data sets

 CLARA (Kaufmann & Rousseeuw)


 CLARANS: Randomized sampling
 Focusing + Spatial data structure. 35
A Typical k-Medoids Algorithm (PAM)

Total Cost = 20
10 10 10

9 9 9

8 8 8

7 7 7

6
Arbitrary 6
Assign 6

5
choose k 5
each 5

4 object as 4 remainin 4

3
initial 3
g object 3

2
medoids 2
to 2

1 1
nearest
1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10

K=2 Randomly select a


Total Cost = 26 nonmedoid object,Orandom
10 10

Do loop 9

8
Compute
9

8
Swapping O total cost of
Until no change
7 7

and Oramdom 6
swapping 6

5 5

If quality is 4 4

improved. 3 3

2 2

1 1

0 0

Data Mining:
0 1 2
Concepts
3 4 5
and
6 7
Techniques
8 9 10 0 1 2 3 4 5 6 7
36
8 9 10
Issues of Clustering
1. Accuracy,
2. Training time,
3. Robustness,
4. Interpretability, and
5. Scalability
6. Find top ‘n’ outlier points

Examples of Clustering Applications


1. Marketing
2. Land use
3. Insurance
4. City-planning
5. Earth-quake studies
37
Applications
 Pattern Recognition
 Spatial Data Analysis
 GIS(Geographical Information System)
 Image Processing
 WWW (World Wide Web)
 Cluster Weblog data to discover groups
 Credit approval
 Target marketing
 Medical diagnosis
 Fraud detection
 Weather forecasting
 Stock Marketing
38

You might also like