dm 4
dm 4
Unit - 4
Unit - IV
Clustering and Applications
• Cluster analysis
• Types of Data in Cluster Analysis
• Categorization of Major Clustering Methods
• Partitioning Methods
• Hierarchical Methods
• Density–Based Methods
• Grid–Based Methods
• Outlier Analysis.
Introduction
• Supervised learning: discover patterns in the data with known target
(class) or label.
• These patterns are then utilized to predict the values of the target attribute
in future data instances.
• Examples ?
• Unsupervised learning: The data have no target attribute.
• We want to explore the data to find some intrinsic structures in them.
• Can we perform regression here ?
• Examples ?
Introduction…
• Centroid is computed as
the mean of all data points
in a cluster
•𝐶𝑗 = ∑ 𝑥𝑖
2. Stopping criterion
3. Cluster Quality
1. Similarity function / Distance measure
• How to find distance between data points
• Euclidean distance
• Problems with Euclidean distance
Euclidean distance and Manhattan distance
• Euclidean distance
• Manhattan distance
dist(xi , x j ) =| xi1 − x j1 | + | xi 2 − x j2 | +...+ | xir − x jr |
14
Squared distance
• Squared Euclidean distance: to place progressively greater
weight on data points that are further apart.
dist(xi ,x j ) = (xi1 − x j1 ) 2 + (xi2 − x j 2 )2 + ...+ (xir − xjr )2
15
Distance functions for binary and nominal attributes
16
Confusion matrix
17
Contd..
• Cosine similarity
𝑥. 𝑦
cos 𝑥, 𝑦 =
𝑥. 𝑦
• Euclidean distance
𝑑 𝑥, 𝑦 = ∑ 𝑥𝑖 − 𝑦𝑖 2
• Minkowski Metric
2. Stopping criteria
1. no (or minimum) re-assignments of data points to different
clusters,
2. no (or minimum) change of centroids, or
3. minimum decrease in the sum of squared error (SSE),
k
SSE = xC j
dist(x,m j ) 2
j =1
+
+
•a
Dimension
•b
2
•Dimension 1
Cluster Analysis: Basic Concepts and Methods
• Partitioning Methods
• Hierarchical Methods
• Evaluation of Clustering
• Summary
Types of clustering
• Clustering: Task of grouping a set of data points such that data points in
the same group are more similar to each other than data points in
another group (group is known as cluster)
• it groups data instances that are similar to (near) each other in one cluster and
• data instances that are very different (far away) from each other into different
clusters.
Types
• Partitioning Methods
• Hierarchical Methods
• Density–Based Methods
• Grid–Based Methods
Partitioning-Based Clustering Methods
• Basic Concepts of Partitioning Algorithms
• Cons
• Setting k?
• Sensitive to initial centers
• Sensitive to outliers
• Detects spherical clusters
• Assuming means can be computed
Assign Recompute
points to cluster
clusters centers
9 9 9
8 8 8
7
Arbitrary 7
Assign 7
choose K each
6 6 6
object as remaining
5 5
4 4 4
3 initial 3 object to 3
2
medoids 2
nearest 2
1 1
medoids 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
K=2
Randomly select a non-
medoid object,Oramdom
Select initial K medoids randomly
10 10
Repeat 9 9
Compute
Swapping O 8 8
6
swapping 6
4
5
improved
improves the clustering quality 3
2
3
1 1
• Partitioning Methods
• Hierarchical Methods
• Evaluation of Clustering
• Summary
Hierarchical Clustering Methods
• Basic Concepts of Hierarchical Algorithms
– Agglomerative Clustering Algorithms
– Divisive Clustering Algorithms
• Extensions to Hierarchical Clustering
• BIRCH: A Micro-Clustering-Based Approach
• CURE: Exploring Well-Scattered Representative Points
• CHAMELEON: Graph Partitioning on the KNN Graph of the
Data
• Probabilistic Hierarchical Clustering
Hierarchical Clustering: Basic Concepts
Step 0 Step 1 Step 2 Step 3 Step 4
• Hierarchical clustering agglomerative
(AGNES)
– Generate a clustering hierarchy a
ab
(drawn as a dendrogram)
b
abcde
– Not required to specify K, the
c
number of clusters cde
d
– More deterministic de
e
– No iterative refinement divisive
(DIANA)
• Two categories of algorithms: Step 4 Step 3 Step 2 Step 1 Step 0
Hierarchical clustering
generates a dendrogram
(a hierarchy of clusters)
Agglomerative Clustering Algorithm
• AGNES (AGglomerative NESting) (Kaufmann and Rousseeuw, 1990)
– Use the single-link method and the dissimilarity matrix
– Continuously merge nodes that have the least dissimilarity
– Eventually all nodes belong to the same cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
• Partitioning Methods
• Hierarchical Methods
• Evaluation of Clustering
• Summary
Density-Based and Grid-Based Clustering Methods
❑ Density-Based Clustering
❑ Basic Concepts
❑ DBSCAN: A Density-Based Clustering Algorithm
❑ OPTICS: Ordering Points To Identify Clustering Structure
❑ Grid-Based Clustering Methods
❑ Basic Concepts
❑ STING: A Statistical Information Grid Approach
❑ CLIQUE: Grid-Based Subspace Clustering
Density-Based Clustering Methods
• Clustering based on density (a local cluster criterion), such as density-connected
points
• Major features
– Discover clusters of arbitrary shape
– Handle noise
– One scan (only examine the local region to justify density)
– Need density parameters as termination condition
• Several interesting studies
– DBSCAN: Ester, et al. (KDD’96)
– OPTICS: Ankerst, et al (SIGMOD’99)
– DENCLUE: Hinneburg & D. Keim (KDD’98)
– CLIQUE: Agrawal, et al. (SIGMOD’98) (also, grid-based)
DBSCAN: A Density-Based Spatial Clustering Algorithm
• DBSCAN (M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, KDD’96)
– Discovers clusters of arbitrary shape: Density-Based p
Ack. Figures from G. Karypis, E.-H. Han, and V. Kumar, COMPUTER, 32(8), 1999
OPTICS: Ordering Points To Identify Clustering Structure
Reachability plot for a dataset
• OPTICS (Ankerst, Breunig, Kriegel, and Sander,
Reachability-distance
SIGMOD’99)
– DBSCAN is sensitive to parameter setting undefined
• Start at 1-D space and discretize numerical intervals in each axis into grid
• Find dense regions (clusters) in each subspace and generate their minimal descriptions
– Use the dense regions to find promising candidates in 2-D space based on the Apriori
principle
– Repeat the above in level-wise manner in higher dimensional subspaces
Major Steps of the CLIQUE Algorithm
• Identify subspaces that contain clusters
– Partition the data space and find the number of points that lie inside each cell of
the partition
– Identify the subspaces that contain clusters using the Apriori principle
• Identify clusters
– Determine dense units in all subspaces of interests
– Determine connected dense units in all subspaces of interests
• Generate minimal descriptions for the clusters
– Determine maximal regions that cover a cluster of connected dense units for
each cluster
– Determine minimal cover for each cluster
Additional Comments on CLIQUE
• Strengths
– Automatically finds subspaces of the highest dimensionality as long as high
density clusters exist in those subspaces
– Insensitive to the order of records in input and does not presume some
canonical data distribution
– Scales linearly with the size of input and has good scalability as the number of
dimensions in the data increases
• Weaknesses
– As in all grid-based clustering approaches, the quality of the results crucially
depends on the appropriate choice of the number and width of the partitions
and grid cells
Outlier Analysis
• Outlier and Outlier Analysis
• Outlier Detection Methods
• Statistical Approaches
• Proximity-Base Approaches
• Clustering-Base Approaches
• Classification Approaches
• Mining Contextual and Collective Outliers
• Outlier Detection in High Dimensional Data
• Summary
72
What Are Outliers?
• Outlier: A data object that deviates significantly from the normal objects as if it were generated by a
different mechanism
– Ex.: Unusual credit card purchase, sports: Michael Jordon, Wayne Gretzky, ...
• Outliers are different from the noise data
– Noise is random error or variance in a measured variable
– Noise should be removed before outlier detection
• Outliers are interesting: It violates the mechanism that generates the normal data
• Outlier detection vs. novelty detection: early stage, outlier; but later merged into the model
• Applications:
– Credit card fraud detection
– Telecom fraud detection
– Customer segmentation
– Medical analysis
73
Types of Outliers (I)
• Three kinds: global, contextual and collective outliers
• Global outlier (or point anomaly)
– Object is Og if it significantly deviates from the rest of the data set
Global Outlier
– Ex. Intrusion detection in computer networks
– Issue: Find an appropriate measurement of deviation
• Contextual outlier (or conditional outlier)
– Object is Oc if it deviates significantly based on a selected context
– Ex. 80o F in Urbana: outlier? (depending on summer or winter?)
– Attributes of data objects should be divided into two groups
• Contextual attributes: defines the context, e.g., time & location
• Behavioral attributes: characteristics of the object, used in outlier evaluation, e.g.,
temperature
– Can be viewed as a generalization of local outliers—whose density significantly
deviates from its local area
– Issue: How to define or formulate meaningful context?
74
Types of Outliers (II)
• Collective Outliers
– A subset of data objects collectively deviate significantly from the
whole data set, even if the individual data objects may not be
outliers
– Applications: E.g., intrusion detection: Collective Outlier
• When a number of computers keep sending denial-of-service
packages to each other
◼ Detection of collective outliers
◼ Consider not only behavior of individual objects, but also that of groups
of objects
◼ Need to have the background knowledge on the relationship among
75
End of Unit-4