DWDM - Unit - VI
DWDM - Unit - VI
Clustering
1. What is Clustering: Clustering is a Unsupervised learning i.e.
no predefined classes, Group of similar objects that differ
significantly from other objects.
The process of grouping a set of physical or abstract objects into
classes of similar objects is called clustering.
Clustering is “the process of organizing objects into groups whose
members are similar in some way”.
Requirements of Clustering in DM
Scalability
Ability to deal with different types of attributes
High dimensionality
Able to deal with noise and outliers
Interpretability
Discovery of clusters with attribute shape
2
Why Clustering
Scalability
Ability to deal with different types of attributes
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to determine input
parameters
Ability to deal with noisy data
Incremental clustering and insensitivity to the order of input
records: High dimensionality, Constraint-based clustering and
Interpretability and usability
3
Types of data in Clustering Analysis
1. Nominal variables
2. Ordinal variables
3. Categorical Data
4. Labeled Variables
5. Unlabeled Variables
6. Numerical Values
7. Interval-scaled variables
8. Binary variables
9. Ratio variables
10. Variables of mixed types
4
1. Nominal Variables allow for only qualitative classification. A
generalization of the binary variable in that it can take more than
2 states, e.g., red, yellow, blue, green
Ex: { male, female},{yes, no},{true, false}
2. Ordinal Data are categorical data where there is a logical
ordering to the categories.
Ex: 1=Strongly disagree; 2=Disagree; 3=Neutral; 4=Agree;
3. Categorical Data represent types of data which may be divided
into groups.
Ex: race, sex, age group, and educational level.
4. Labeled Data are share the class labels or the generative
distribution of data.
5. Unlabeled Data are does not share the class labels or the
generative distribution of the labeled data.
6. Numerical Values: The data values completely belongs to and
only numerical values. Ex: 1,2,3,4,….
5
7. Interval-valued variables: These are variables ranges from
numbers
Ex: 10-20, 20-30, 30-40,……..
10. Variables of Mixed Types: A database may contain all the six
types of variables symmetric binary, asymmetric binary,
nominal, ordinal, interval and ratio.
Ex: 11121A1201
6
Similarity Measure
Euclidean distance: Distances are normally used to measure the
similarity or dissimilarity between two data objects.
1. Partitioning Methods
2. Hierarchical Methods
3. Density-Based Methods
4. Grid-Based Methods
5. Model-Based Clustering Methods
6. Clustering High-Dimensional Data
7. Constraint-Based Cluster Analysis
8. Outlier Analysis 8
5. Model-Based 6. Clustering High-
Clustering Dimensional Data
Methods
1.
Partitioning 3. Density-
Methods Based
Major Clustering Methods
Approaches
2. 4. Grid-Based
Hierarchi Methods
cal
Methods
11
By first performing microclustering and then operating
on the micro clusters with other clustering techniques,
such as iterative relocation.
Create a hierarchical decomposition of the set of data (or
objects) using some criterion classified into .
Data objects such as text documents and microarray data are high-
dimensional in nature. They are
1. CLIQUE (CLustering InQUEst)
2. PROCLUS (PROjected CLUStering)
3. pCluster(frequent pattern–based clustering)
16
7. Constraint-Based Cluster Analysis: A constraint-based
clustering method groups objects based on application dependent or
user-specified constraints.
They are
1. Clustering with Obstacle Objects
2. User-Constrained Cluster Analysis
3. Semi-Supervised Cluster Analysis
17
8. Outlier Analysis: These are very useful for fraud detection,
customized marketing, medical analysis, and many other tasks.
They are
1. Statistical Distribution-Based Outlier Detection
2. Distance-Based Outlier Detection
3. Density-Based Local Outlier Detection
4. Deviation-Based Outlier Detection
18
Examples of Clustering Applications
1. Marketing
2. Land use
3. Insurance
4. City-planning
5. Earth-quake studies
Issues of Clustering
1. Accuracy,
2. Training time,
3. Robustness,
4. Interpretability, and
5. Scalability
6. Find top ‘n’ outlier points 19
Applications
Pattern Recognition
Spatial Data Analysis
GIS(Geographical Information System)
Cluster Weblog data to discover groups
Credit approval
Target marketing
Medical diagnosis
Fraud detection
Weather forecasting
Stock Marketing 20
2. Classification Vs Clustering
1. Classification is “the process 1. Clustering is “the process of
of organizing objects into organizing objects into
groups whose members are groups whose members are
not similar. similar in some way”.
22
7. Classification approaches are 7. Clustering approaches are eight.
two types 1. Partition Method
2. Hierarchical Method
3. Density-Based Methods
1. Predictive Classification 4. Grid-Based Methods
2. Descriptive Classification 5. Model-Based Clustering Methods
6. Clustering High-Dimensional
Data
7. Constraint - Based Cluster
Analysis
8. Outlier Analysis
8. Issues of Classification
1. Accuracy, 8. Issues of Clustering
2. Training time, 1. Accuracy,
3. Robustness, 2. Training time,
4. Interpretability, and 3. Robustness,
5. Scalability 4. Interpretability, and
5. Scalability
6. Find top ‘n’ outlier points
9. Examples 9. Examples
1. Marketing 1. Marketing
2. Land use 2. Land use
3. Insurance 3. Insurance
4. City-planning 4. City-planning
5. Earth-quake studies 5. Earth-quake studies
10. Techniques
10. Techniques
1. K- Means Clustering
1. Decision Tree
2. DIANA ((DIvisive ANAlysis)
2. Bayesian classification
3. AGNES (AGglomerative NESting)
3. Rule-based classification
4. BIRCH (Balanced Iterative Reducing
4. Prediction and Accuracy and error
and Clustering using Hierarchies)
measures
5. DBSACN (Density-Based Spatial
Clustering of Applications with
Noise)
24
11. Applications 11. Applications
1. Credit approval 1. Pattern Recognition
2. Target marketing 2. Spatial Data Analysis
3. Medical diagnosis 3. WWW (World Wide Web)
4. Fraud detection 4. Weblog data to discover groups
5. Weather forecasting 5. Credit approval
6. Stock Marketing 6. Target marketing
7. Medical diagnosis
8. Fraud detection
9. Weather forecasting
10. Stock Marketing
25
3. k-Means Clustering
It is a Partitioning cluster technique.
It is a Centroid-Based cluster technique
Clustering is a Unsupervised learning i.e. no predefined
classes, Group of similar objects that differ significantly
from other objects.
2 2 2
d (i, j) (| x x | | x x | ... | x x | )
i1 j1 i2 j 2 ip jp
It then creates the first k initial clusters (k= number of
clusters needed) from the dataset by choosing k rows of
data randomly from the dataset.
The k-Means algorithm calculates the Arithmetic Mean
of each cluster formed in the dataset. 26
Square-error criterion
Where
– E is the sum of the square error for all objects in the data set;
– p is the point in space representing a given object; and
– mi is the mean of cluster
– Ci (both p and mi are multidimensional).
Algorithm: The k-means algorithm for partitioning,
where each cluster’s center is represented by the mean
value of the objects in the cluster.
Input:
– k: the number of clusters,
– D: a data set containing n objects.
Output: A set of k clusters.
27
k-Means Clustering Method
Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
3
2 each
2 the 2
1
objects
1
0
cluster 1
0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10
similar
center reassign reassign
10 10
k=2 9 9
8 8
Arbitrarily choose K 7 7
5 5
center 4 Update 4
2
the 3
1 cluster 1
0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
28
Fig: Clustering of a set of objects based on the k-means
method. (The mean of each cluster is marked by a “+”.)
29
Steps
k - Means algorithm is implemented in four
steps:
1. Partition objects into k nonempty subsets.
2. Compute seed points as the centroids of the clusters
of the current partition (the centroid is the center, i.e.,
mean point, of the cluster).
3. Assign each object to the cluster with the nearest seed
point.
4. Go back to Step 2, stop when no more new
assignment
31
Comments on the k-Means Method
Strength: Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.
– Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
Weakness
– Applicable only when mean is defined, then what about
categorical data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes 32
Variations of the k-Means Method
A few variants of the k-means which differ in
– Selection of the initial k means
– Dissimilarity calculations
– Strategies to calculate cluster means
33
What is the Problem of the k-Means Method?
The k-means algorithm is sensitive to outliers !
– Since an object with an extremely large value may substantially
distort the distribution of the data.
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
34
The k-Medoids Clustering Method
Find representative objects, called Medoids, in clusters
Total Cost = 20
10 10 10
9 9 9
8 8 8
7 7 7
6
Arbitrary 6
Assign 6
5
choose k 5
each 5
4 object as 4 remainin 4
3
initial 3
g object 3
2
medoids 2
to 2
1 1
nearest
1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10
Do loop 9
8
Compute
9
8
Swapping O total cost of
Until no change
7 7
and Oramdom 6
swapping 6
5 5
If quality is 4 4
improved. 3 3
2 2
1 1
0 0
Data Mining:
0 1 2
Concepts
3 4 5
and
6 7
Techniques
8 9 10 0 1 2 3 4 5 6 7
36
8 9 10
Issues of Clustering
1. Accuracy,
2. Training time,
3. Robustness,
4. Interpretability, and
5. Scalability
6. Find top ‘n’ outlier points