0% found this document useful (0 votes)

5 views

dm 4

The document covers clustering and its applications, detailing various clustering methods such as partitioning, hierarchical, density-based, and grid-based methods. It explains the concepts of cluster analysis, including similarity measures, stopping criteria, and cluster quality, while also discussing the K-means algorithm and its variations. Additionally, it highlights the importance of clustering in data mining tasks, including data summarization, outlier detection, and customer segmentation.

Uploaded by

mrpulluri1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

dm 4

Uploaded by

mrpulluri1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 76

AI512PE: DATA MINING (PE - I)

Unit - 4
Unit - IV
Clustering and Applications
• Cluster analysis
• Types of Data in Cluster Analysis
• Categorization of Major Clustering Methods
• Partitioning Methods
• Hierarchical Methods
• Density–Based Methods
• Grid–Based Methods
• Outlier Analysis.
Introduction
• Supervised learning: discover patterns in the data with known target
(class) or label.
• These patterns are then utilized to predict the values of the target attribute
in future data instances.
• Examples ?
• Unsupervised learning: The data have no target attribute.
• We want to explore the data to find some intrinsic structures in them.
• Can we perform regression here ?
• Examples ?
Introduction…

• The goal of clustering is to

• group data points that are
close (or similar) to each
other
• identify such groupings (or
clusters) in an unsupervised
manner

• How to define similarity ?

• How many iterations for
checking cluster quality ?
Cluster
• A cluster is represented by a
single point, known as centroid
(or cluster center) of the cluster.

• Centroid is computed as
the mean of all data points
in a cluster
•𝐶𝑗 = ∑ 𝑥𝑖

• Cluster boundary is decided

by the farthest data point in
the cluster.
What Is Cluster Analysis?
• What is a cluster?
– A cluster is a collection of data objects which are
• Similar (or related) to one another within the same group (i.e.,
cluster)
• Dissimilar (or unrelated) to the objects in other groups (i.e.,
clusters)
• Cluster analysis (or clustering, data segmentation, …)
– Given a set of data points, partition them into a set of groups (i.e.,
clusters) which are as similar as possible
• Cluster analysis is unsupervised learning (i.e., no predefined classes)
– This contrasts with classification (i.e., supervised learning)
• Typical ways to use/apply cluster analysis
– As a stand-alone tool to get insight into data distribution, or
– As a preprocessing (or intermediate) step for other algorithms
What Is Good Clustering?
• A good clustering method will produce high quality clusters which should have
– High intra-class similarity: Cohesive within clusters
– Low inter-class similarity: Distinctive between clusters
• Quality function
– There is usually a separate “quality” function that measures the “goodness” of
a cluster
– It is hard to define “similar enough” or “good enough”
• The answer is typically highly subjective
• There exist many similarity measures and/or functions for different applications
• Similarity measure is critical for cluster analysis
Cluster Analysis: Applications
• A key intermediate step for other data mining tasks
– Generating a compact summary of data for classification, pattern discovery,
hypothesis generation and testing, etc.
– Outlier detection: Outliers—those “far away” from any cluster
• Data summarization, compression, and reduction
– Ex. Image processing: Vector quantization
• Collaborative filtering, recommendation systems, or customer segmentation
– Find like-minded users or similar products
• Dynamic trend detection
– Clustering stream data and detecting trends and patterns
• Multimedia data analysis, biological data analysis and social network analysis
– Ex. Clustering images or video/audio clips, gene/protein sequences, etc.
Applications
• Example 1: groups people of similar sizes together to make “small”,
“medium” and “large” T-Shirts.
• Tailor-made for each person: too expensive
• One-size-fits-all: does not fit all.

• Example 2: In marketing, segment customers according to

their similarities
• To do targeted marketing.

• Example 3: Given a collection of text documents, we want to

organize them according to their content similarities,
• To produce a topic hierarchy
Considerations for Cluster Analysis
• Partitioning criteria
– Single level vs. hierarchical partitioning (often, multi-level hierarchical
partitioning is desirable, e.g., grouping topical terms)
• Separation of clusters
– Exclusive (e.g., one customer belongs to only one region) vs. non-
exclusive (e.g., one document may belong to more than one class)
• Similarity measure
– Distance-based (e.g., Euclidean, road network, vector) vs. connectivity-
based (e.g., density or contiguity)
• Clustering space
– Full space (often when low dimensional) vs. subspaces (often in high-
dimensional clustering)
Requirements and Challenges
• Quality
– Ability to deal with different types of attributes: Numerical, categorical, text,
multimedia, networks, and mixture of multiple types
– Discovery of clusters with arbitrary shape
– Ability to deal with noisy data
• Scalability
– Clustering all the data instead of only on samples
– High dimensionality
– Incremental or stream clustering and insensitivity to input order
• Constraint-based clustering
– User-given preferences or constraints; domain knowledge; user queries
• Interpretability and usability
– The final generated clusters should be semantically meaningful and useful
Clustering criterion ..
1. Similarity function

2. Stopping criterion

3. Cluster Quality
1. Similarity function / Distance measure
• How to find distance between data points
• Euclidean distance
• Problems with Euclidean distance
Euclidean distance and Manhattan distance
• Euclidean distance

dist(xi , x j ) = (xi1 − x j1 ) 2 + (xi 2 − x j 2 ) 2 + ...+ (xir − x jr ) 2

• Weighted Euclidean distance

dist(xi , x j ) = w1(xi1 − x j1 )2 + w2 (x i2 − x j 2 )2 + ...+ wr (xir − x jr )2

14
Squared distance
• Squared Euclidean distance: to place progressively greater
weight on data points that are further apart.
dist(xi ,x j ) = (xi1 − x j1 ) 2 + (xi2 − x j 2 )2 + ...+ (xir − xjr )2

15
Distance functions for binary and nominal attributes

• Binary attribute: has two values or states but no ordering

relationships, e.g.,
• Gender: male and female.
• We use a confusion matrix to introduce the distance
functions/measures.
• Let the ith and jth data points be xi and xj (vectors)

16
Confusion matrix

17
Contd..
• Cosine similarity
𝑥. 𝑦
cos 𝑥, 𝑦 =
𝑥. 𝑦

• Euclidean distance
𝑑 𝑥, 𝑦 = ∑ 𝑥𝑖 − 𝑦𝑖 2

• Minkowski Metric
2. Stopping criteria
1. no (or minimum) re-assignments of data points to different
clusters,
2. no (or minimum) change of centroids, or
3. minimum decrease in the sum of squared error (SSE),
k
SSE =  xC j
dist(x,m j ) 2
j =1

• Ci is the jth cluster, mj is the centroid of cluster Cj (the mean vector of

all the data points in Cj), and dist(x, mj) is the distance between data
point x and centroid mj.
An example

+
+

This slide is taken from: Bing Liu, UIC 20

An example (cont …)

This slide is taken from: Bing Liu, UIC 21

3. Cluster quality
• Intra-cluster cohesion (compactness):
• Cohesion measures how near the data points in a cluster are to the cluster
centroid.
• Sum of squared error (SSE) is a commonly used measure.
• Inter-cluster separation (isolation):
• Separation means that different cluster centroids should be far away from
one another.

•a
Dimension

•b
2

•Dimension 1
Cluster Analysis: Basic Concepts and Methods

• Cluster Analysis: An Introduction

• Partitioning Methods

• Hierarchical Methods

• Density- and Grid-Based Methods

• Evaluation of Clustering

• Summary
Types of clustering
• Clustering: Task of grouping a set of data points such that data points in
the same group are more similar to each other than data points in
another group (group is known as cluster)
• it groups data instances that are similar to (near) each other in one cluster and
• data instances that are very different (far away) from each other into different
clusters.
Types
• Partitioning Methods
• Hierarchical Methods
• Density–Based Methods
• Grid–Based Methods
Partitioning-Based Clustering Methods
• Basic Concepts of Partitioning Algorithms

• The K-Means Clustering Method

• Initialization of K-Means Clustering

• The K-Medoids Clustering Method

• The K-Medians and K-Modes Clustering Methods

• The Kernel K-Means Clustering Method

Partitioning Algorithms: Basic Concepts
• Partitioning method: Discovering the groupings in the data by optimizing a
specific objective function and iteratively improving the quality of partitions
• K-partitioning method: Partitioning a dataset D of n objects into a set of K
clusters so that an objective function is optimized (e.g., the sum of squared
distances is minimized, where ck is the centroid or medoid of cluster Ck)
– A typical objective function: Sum of Squared Errors (SSE)
• Problem definition: Given K, find a partition of K clusters that optimizes the
chosen partitioning criterion
– Global optimal: Needs to exhaustively enumerate all partitions
– Heuristic methods (i.e., greedy algorithms): K-Means, K-Medians, K-
Medoids, etc.
K-means Clustering
• Basic idea: Randomly initialize the k cluster centers, and iterate between the two
steps we just saw.
• Randomly initialize the cluster centers, c1, ..., cK
• Given cluster centers, determine points in each cluster
• For each point p, find the closest ci. Put p into cluster i
• Given points in each cluster, solve for ci
• Set ci to be the mean of points in cluster i
• If ci have changed, repeat Step 2
K-means contd..
Algorithm
Begin
initialize n, c, 1, 2, …, c(randomly selected)
do classify n samples according to
nearest i
recompute i
until no change in i
return 1, 2, …, c
End
K-means example

This slide is taken from: Andrew Moore

Example contd..

This slide is taken from: Andrew Moore

Example contd..

This slide is taken from: Andrew Moore

Example contd..

This slide is taken from: Andrew Moore

Example contd..

This slide is taken from: Andrew Moore

Contd..
• Pros
• Simple, fast to compute
• Converges to local minimum of within-
cluster squared error

• Cons
• Setting k?
• Sensitive to initial centers
• Sensitive to outliers
• Detects spherical clusters
• Assuming means can be computed

Image source: Google

Discussion on the K-Means Method
• Efficiency: O(tKn) where n: # of objects, K: # of clusters, and t: # of iterations
– Normally, K, t << n; thus, an efficient method
• K-means clustering often terminates at a local optimal
– Initialization can be important to find high-quality clusters
• Need to specify K, the number of clusters, in advance
– There are ways to automatically determine the “best” K
– In practice, one often runs a range of values and selected the “best” K value
• Sensitive to noisy data and outliers
– Variations: Using K-medians, K-medoids, etc.
• K-means is applicable only to objects in a continuous n-dimensional space
– Using the K-modes for categorical data
• Not suitable to discover clusters with non-convex shapes
– Using density-based clustering, kernel K-means, etc.
Variations of K-Means
• There are many variants of the K-Means method, varying in
different aspects
– Choosing better initial centroid estimates
• K-means++, Intelligent K-Means, Genetic K-Means

– Choosing different representative prototypes for the clusters

• K-Medoids, K-Medians, K-Modes

– Applying feature transformation techniques

• Weighted K-Means, Kernel K-Means
Poor Initialization in K-Means May Lead to Poor Clustering

Assign Recompute
points to cluster
clusters centers

Another random selection of k

centroids for the same data points

❑ Rerun of the K-Means using another random K seeds

❑ This run of K-Means generates a poor quality clustering
Initialization of K-Means: Problem and Solution
• Different initializations may generate rather different clustering
results (some could be far from optimal)
• Original proposal (MacQueen’67): Select K seeds randomly
– Need to run the algorithm multiple times using different seeds
❑ There are many methods proposed for better initialization of k seeds
❑ K-Means++ (Arthur & Vassilvitskii’07):
❑ The first centroid is selected at random
❑ The next centroid selected is the one that is farthest from the currently selected
(selection is based on a weighted probability score)
❑ The selection continues until K centroids are obtained
PAM: A Typical K-Medoids Algorithm
10 10 10

9 9 9

8 8 8

7
Arbitrary 7
Assign 7

choose K each
6 6 6

object as remaining
5 5

4 4 4

3 initial 3 object to 3

2
medoids 2
nearest 2

1 1
medoids 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

K=2
Randomly select a non-
medoid object,Oramdom
Select initial K medoids randomly
10 10

Repeat 9 9
Compute
Swapping O 8 8

Object re-assignment total cost of

and Oramdom
7 7

6
swapping 6

Swap medoid m with oi if it If quality is

4
5

improved
improves the clustering quality 3

2
3

1 1

Until convergence criterion is satisfied 0

0 1 2 3 4 5 6 7 8 9 10
0
0 1 2 3 4 5 6 7 8 9 10
Discussion on K-Medoids Clustering
• K-Medoids Clustering: Find representative objects (medoids) in clusters
• PAM (Partitioning Around Medoids: Kaufmann & Rousseeuw 1987)
– Starts from an initial set of medoids, and
– Iteratively replaces one of the medoids by one of the non-medoids if it improves
the total sum of the squared errors (SSE) of the resulting clustering
– PAM works effectively for small data sets but does not scale well for large data sets
(due to the computational complexity)
– Computational complexity: PAM: O(K(n − K) 2) (quite expensive!)
• Efficiency improvements on PAM
– CLARA (Kaufmann & Rousseeuw, 1990):
• PAM on samples; O(Ks2 + K(n − K)), s is the sample size
– CLARANS (Ng & Han, 1994): Randomized re-sampling, ensuring efficiency + quality
Handling Outliers: From K-Means to K-Medoids
• The K-Means algorithm is sensitive to outliers!—since an object with an extremely
large value may substantially distort the distribution of the data
• K-Medoids: Instead of taking the mean value of the object in a cluster as a reference
point, medoids can be used, which is the most centrally located object in a cluster
• The K-Medoids clustering algorithm:
• Select K points as the initial representative objects (i.e., as initial K medoids)
• Repeat
– Assigning each point to the cluster with the closest medoid
– Randomly select a non-representative object oi
– Compute the total cost S of swapping the medoid m with oi
– If S < 0, then swap m with oi to form the new set of medoids
• Until convergence criterion is satisfied
K-Medians: Handling Outliers by Computing Medians
• Medians are less sensitive to outliers than means
– Think of the median salary vs. mean salary of a large firm when adding a few top
executives!
• K-Medians: Instead of taking the mean value of the object in a cluster as a
reference point, medians are used (L1-norm as the distance measure)
• The criterion function for the K-Medians algorithm:
• The K-Medians clustering algorithm:
• Select K points as the initial representative objects (i.e., as initial K medians)
• Repeat
– Assign every point to its nearest median
– Re-compute the median using the median of each individual feature
• Until convergence criterion is satisfied
Cluster Analysis: Basic Concepts and Methods

• Cluster Analysis: An Introduction

• Partitioning Methods

• Hierarchical Methods

• Density- and Grid-Based Methods

• Evaluation of Clustering

• Summary
Hierarchical Clustering Methods
• Basic Concepts of Hierarchical Algorithms
– Agglomerative Clustering Algorithms
– Divisive Clustering Algorithms
• Extensions to Hierarchical Clustering
• BIRCH: A Micro-Clustering-Based Approach
• CURE: Exploring Well-Scattered Representative Points
• CHAMELEON: Graph Partitioning on the KNN Graph of the
Data
• Probabilistic Hierarchical Clustering
Hierarchical Clustering: Basic Concepts
Step 0 Step 1 Step 2 Step 3 Step 4
• Hierarchical clustering agglomerative
(AGNES)
– Generate a clustering hierarchy a
ab
(drawn as a dendrogram)
b
abcde
– Not required to specify K, the
c
number of clusters cde
d
– More deterministic de
e
– No iterative refinement divisive
(DIANA)
• Two categories of algorithms: Step 4 Step 3 Step 2 Step 1 Step 0

❑ Agglomerative: Start with singleton clusters, continuously merge two clusters

at a time to build a bottom-up hierarchy of clusters
❑ Divisive: Start with a huge macro-cluster, split it continuously into two groups,
generating a top-down hierarchy of clusters
Dendrogram: Shows How Clusters are Merged
• Dendrogram: Decompose a set of data objects into a tree of clusters by multi-level
nested partitioning
• A clustering of the data objects is obtained by cutting the dendrogram at the desired
level, then each connected component forms a cluster

Hierarchical clustering
generates a dendrogram
(a hierarchy of clusters)
Agglomerative Clustering Algorithm
• AGNES (AGglomerative NESting) (Kaufmann and Rousseeuw, 1990)
– Use the single-link method and the dissimilarity matrix
– Continuously merge nodes that have the least dissimilarity
– Eventually all nodes belong to the same cluster
10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

❑ Agglomerative clustering varies on different similarity measures among clusters

❑ Single link (nearest neighbor) ❑ Average link (group average)
❑ Complete link (diameter) ❑ Centroid link (centroid similarity)
Single Link vs. Complete Link in Hierarchical Clustering
• Single link (nearest neighbor)
– The similarity between two clusters is the similarity between
X
their most similar (nearest neighbor) members X

– Local similarity-based: Emphasizing more on close regions,

ignoring the overall structure of the cluster
– Capable of clustering non-elliptical shaped group of objects
– Sensitive to noise and outliers
• Complete link (diameter)
– The similarity between two clusters is the similarity between
X
their most dissimilar members X

– Merge two clusters to form one with the smallest diameter

– Nonlocal in behavior, obtaining compact shaped clusters
– Sensitive to outliers
Agglomerative Clustering: Average vs. Centroid Links
• Agglomerative clustering with average link
– Average link: The average distance between an element in one cluster X X

and an element in the other (i.e., all pairs in two clusters)

Ca : N a Cb: Nb
• Expensive to compute
• Agglomerative clustering with centroid link
X
– Centroid link: The distance between the centroids of two clusters X

• Group Averaged Agglomerative Clustering (GAAC)

• Let two clusters Ca and Cb be merged into CaUb. The new centroid is:
– Na is the cardinality of cluster Ca, and ca is the centroid of Ca
• The similarity measure for GAAC is the average of their distances
Divisive Clustering
• DIANA (Divisive Analysis) (Kaufmann and Rousseeuw,1990)
– Implemented in some statistical analysis packages, e.g., Splus
• Inverse order of AGNES: Eventually each node forms a cluster on its own
10 10
10

9 9
9

8 8
8

7 7
7

6 6
6

5 5
5

4 4
4

3 3
3

2 2
2

1 1
1

0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

❑ Divisive clustering is a top-down approach

❑ The process starts at the root with all the points as one cluster
❑ It recursively splits the higher level clusters to build the dendrogram
❑ Can be considered as a global approach
❑ More efficient when compared with agglomerative clustering
More on Algorithm Design for Divisive Clustering
• Choosing which cluster to split
– Check the sums of squared errors of the clusters and choose the one with the
largest value
• Splitting criterion: Determining how to split
– One may use Ward’s criterion to chase for greater reduction in the difference in
the SSE criterion as a result of a split
– For categorical data, Gini-index can be used
• Handling the noise
– Use a threshold to determine the termination criterion (do not generate
clusters that are too small because they contain mainly noises)
Extensions to Hierarchical Clustering
• Major weaknesses of hierarchical clustering methods
– Can never undo what was done previously
– Do not scale well
• Time complexity of at least O(n2), where n is the number of total objects
• Other hierarchical clustering algorithms
– BIRCH (1996): Use CF-tree and incrementally adjust the quality of sub-clusters
– CURE (1998): Represent a cluster using a set of well-scattered representative
points
– CHAMELEON (1999): Use graph partitioning methods on the K-nearest neighbor
graph of the data
BIRCH (Balanced Iterative Reducing and Clustering
Using Hierarchies)
• A multiphase clustering algorithm (Zhang, Ramakrishnan & Livny, SIGMOD’96)
• Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for
multiphase clustering
– Phase 1: Scan DB to build an initial in-memory CF tree (a multi-level compression of the
data that tries to preserve the inherent clustering structure of the data)
– Phase 2: Use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree
• Key idea: Multi-level clustering
– Low-level micro-clustering: Reduce complexity and increase scalability
– High-level macro-clustering: Leave enough flexibility for high-level clustering
• Scales linearly: Find a good clustering with a single scan and improve the quality with a few
additional scans
Hierarchical clustering
• Pros
• Dendograms are great for visualization
• Provides hierarchical relations between clusters
• Shown to be able to capture concentric clusters
• Cons
• Not easy to define levels for clusters
• Experiments showed that other clustering techniques outperform
•hierarchical clustering
Cluster Analysis: Basic Concepts and Methods

• Cluster Analysis: An Introduction

• Partitioning Methods

• Hierarchical Methods

• Density- and Grid-Based Methods

• Evaluation of Clustering

• Summary
Density-Based and Grid-Based Clustering Methods
❑ Density-Based Clustering
❑ Basic Concepts
❑ DBSCAN: A Density-Based Clustering Algorithm
❑ OPTICS: Ordering Points To Identify Clustering Structure
❑ Grid-Based Clustering Methods
❑ Basic Concepts
❑ STING: A Statistical Information Grid Approach
❑ CLIQUE: Grid-Based Subspace Clustering
Density-Based Clustering Methods
• Clustering based on density (a local cluster criterion), such as density-connected
points
• Major features
– Discover clusters of arbitrary shape
– Handle noise
– One scan (only examine the local region to justify density)
– Need density parameters as termination condition
• Several interesting studies
– DBSCAN: Ester, et al. (KDD’96)
– OPTICS: Ankerst, et al (SIGMOD’99)
– DENCLUE: Hinneburg & D. Keim (KDD’98)
– CLIQUE: Agrawal, et al. (SIGMOD’98) (also, grid-based)
DBSCAN: A Density-Based Spatial Clustering Algorithm
• DBSCAN (M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, KDD’96)
– Discovers clusters of arbitrary shape: Density-Based p

Spatial Clustering of Applications with Noise MinPts = 5

q
• A density-based notion of cluster Eps = 1 cm

– A cluster is defined as a maximal set of density-connected

points
– Two parameters: Outlier
– Eps (ε): Maximum radius of the neighborhood Outlier/noise:
not in a cluster
– MinPts: Minimum number of points in the Border
Core point: dense
Eps-neighborhood of a point
Core neighborhood
• The Eps(ε)-neighborhood of a point q:
Border point: in cluster but
– NEps(q): {p belongs to D | dist(p, q) ≤ Eps} neighborhood is not dense
DBSCAN: Density-Reachable and Density-Connected
• Directly density-reachable:
p
– A point p is directly density-reachable from a point q w.r.t.
Eps (ε), MinPts if MinPts = 5
q Eps = 1 cm
– p belongs to NEps(q)
– core point condition: |NEps (q)| ≥ MinPts p
• Density-reachable: p2
– A point p is density-reachable from a point q w.r.t. Eps, q
MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p
such that pi+1 is directly density-reachable from pi p q
• Density-connected:
– A point p is density-connected to a point q w.r.t. Eps, o
MinPts if there is a point o such that both p and q are
density-reachable from o w.r.t. Eps and MinPts
DBSCAN: The Algorithm
• Algorithm
Outlier
– Arbitrarily select a point p
Outlier/noise:
– Retrieve all points density-reachable Border not in a cluster
from p w.r.t. Eps and MinPts Core point: dense
• If p is a core point, a cluster is formed Core neighborhood

• If p is a border point, no points are density-reachable Border point: in cluster but

from p, and DBSCAN visits the next point of the neighborhood is not dense
database
– Continue the process until all of the points have been
processed
❑ Computational complexity
❑ If a spatial index is used, the computational complexity of DBSCAN is
O(n log n), where n is the number of database objects
❑ Otherwise, the complexity is O(n2)
DBSCAN Is Sensitive to the Setting of Parameters

Ack. Figures from G. Karypis, E.-H. Han, and V. Kumar, COMPUTER, 32(8), 1999
OPTICS: Ordering Points To Identify Clustering Structure
Reachability plot for a dataset
• OPTICS (Ankerst, Breunig, Kriegel, and Sander,
Reachability-distance
SIGMOD’99)
– DBSCAN is sensitive to parameter setting undefined

– An extension: finding clustering structure 

• Observation: Given a MinPts, density-based
clusters w.r.t. a higher density are completely ’
contained in clusters w.r.t. to a lower density
• Idea: Higher density points should be
processed first—find high-density clusters first Cluster-order of the objects
• OPTICS stores such a clustering order using ❑ Since points belonging to a cluster have a
two pieces of information: low reachability distance to their nearest
neighbor, valleys correspond to clusters
– Core distance and reachability distance
❑ The deeper the valley, the denser the
cluster
OPTICS: An Extension from DBSCAN
• Core distance of an object p: The smallest value ε such
that the ε-neighborhood of p has at least MinPts objects
Let Nε(p): ε-neighborhood of p
ε is a distance value
Core-distanceε, MinPts(p) = Undefined if card(Nε(p)) < MinPts
MinPts-distance(p), otherwise
Reachability
❑ Reachability distance of object p from core object q is the distance
min. radius value that makes p density-reachable from q undefined

Reachability-distanceε, MinPts(p, q) =
Undefined, if q is not a core object
’
max(core-distance(q), distance (q, p)), otherwise
❑ Complexity: O(N logN) (if index-based)
where N: # of points Cluster-order of the objects
OPTICS: Finding Hierarchically Nested Clustering Structures
• OPTICS produces a special cluster-ordering of the data points with respect to its
density-based clustering structure
– The cluster-ordering contains information equivalent to the density-based
clusterings corresponding to a broad range of parameter settings
– Good for both automatic and interactive cluster analysis—finding intrinsic, even
hierarchically nested clustering structures

Finding nested clustering structures with different parameter settings

Grid-Based Clustering Methods
• Grid-Based Clustering: Explore multi-resolution grid data structure in clustering
– Partition the data space into a finite number of cells to form a grid structure
– Find clusters (dense regions) from the cells in the grid structure
• Features and challenges of a typical grid-based algorithm
– Efficiency and scalability: # of cells << # of data points
– Uniformity: Uniform, hard to handle highly irregular data distributions
– Locality: Limited by predefined cell sizes, borders, and the density threshold
– Curse of dimensionality: Hard to cluster high-dimensional data
• Methods to be introduced
– STING (a STatistical INformation Grid approach) (Wang, Yang and Muntz, VLDB’97)
– CLIQUE (Agrawal, Gehrke, Gunopulos, and Raghavan, SIGMOD’98)
• Both grid-based and subspace clustering
STING: A Statistical Information Grid Approach
• STING (Statistical Information Grid) (Wang, Yang and Muntz, VLDB’97)
• The spatial area is divided into rectangular cells at different levels of resolution,
and these cells form a tree structure
• A cell at a high level contains a number of smaller cells of the next lower level
❑ Statistical information of each cell is
calculated and stored beforehand and
is used to answer queries
❑ Parameters of higher level cells can be
easily calculated from that of lower
level cell, including
❑ count, mean, s(standard deviation),
min, max
❑ type of distribution—normal,
uniform, etc.
Query Processing in STING and Its Analysis
• To process a region query
– Start at the root and proceed to the next lower level, using the STING index
– Calculate the likelihood that a cell is relevant to the query at some confidence
level using the statistical information of the cell
– Only children of likely relevant cells are recursively explored
– Repeat this process until the bottom layer is reached
• Advantages
– Query-independent, easy to parallelize, incremental update
– Efficiency: Complexity is O(K)
• K: # of grid cells at the lowest level, and K << N (i.e., # of data points)
• Disadvantages
– Its probabilistic nature may imply a loss of accuracy in query processing
CLIQUE: Grid-Based Subspace Clustering
• CLIQUE (Clustering In QUEst) (Agrawal, Gehrke, Gunopulos, Raghavan: SIGMOD’98)
• CLIQUE is a density-based and grid-based subspace clustering algorithm
– Grid-based: It discretizes the data space through a grid and estimates the density
by counting the number of points in a grid cell
– Density-based: A cluster is a maximal set of connected dense units in a subspace
• A unit is dense if the fraction of total data points contained in the unit exceeds
the input model parameter
– Subspace clustering: A subspace cluster is a set of neighboring dense cells in an
arbitrary subspace. It also discovers some minimal descriptions of the clusters
• It automatically identifies subspaces of a high dimensional data space that allow
better clustering than original space using the Apriori principle
CLIQUE: SubSpace Clustering with Aprori Pruning

• Start at 1-D space and discretize numerical intervals in each axis into grid
• Find dense regions (clusters) in each subspace and generate their minimal descriptions
– Use the dense regions to find promising candidates in 2-D space based on the Apriori
principle
– Repeat the above in level-wise manner in higher dimensional subspaces
Major Steps of the CLIQUE Algorithm
• Identify subspaces that contain clusters
– Partition the data space and find the number of points that lie inside each cell of
the partition
– Identify the subspaces that contain clusters using the Apriori principle
• Identify clusters
– Determine dense units in all subspaces of interests
– Determine connected dense units in all subspaces of interests
• Generate minimal descriptions for the clusters
– Determine maximal regions that cover a cluster of connected dense units for
each cluster
– Determine minimal cover for each cluster
Additional Comments on CLIQUE
• Strengths
– Automatically finds subspaces of the highest dimensionality as long as high
density clusters exist in those subspaces
– Insensitive to the order of records in input and does not presume some
canonical data distribution
– Scales linearly with the size of input and has good scalability as the number of
dimensions in the data increases
• Weaknesses
– As in all grid-based clustering approaches, the quality of the results crucially
depends on the appropriate choice of the number and width of the partitions
and grid cells
Outlier Analysis
• Outlier and Outlier Analysis
• Outlier Detection Methods
• Statistical Approaches
• Proximity-Base Approaches
• Clustering-Base Approaches
• Classification Approaches
• Mining Contextual and Collective Outliers
• Outlier Detection in High Dimensional Data
• Summary
72
What Are Outliers?

• Outlier: A data object that deviates significantly from the normal objects as if it were generated by a
different mechanism
– Ex.: Unusual credit card purchase, sports: Michael Jordon, Wayne Gretzky, ...
• Outliers are different from the noise data
– Noise is random error or variance in a measured variable
– Noise should be removed before outlier detection
• Outliers are interesting: It violates the mechanism that generates the normal data
• Outlier detection vs. novelty detection: early stage, outlier; but later merged into the model
• Applications:
– Credit card fraud detection
– Telecom fraud detection
– Customer segmentation
– Medical analysis

73
Types of Outliers (I)
• Three kinds: global, contextual and collective outliers
• Global outlier (or point anomaly)
– Object is Og if it significantly deviates from the rest of the data set
Global Outlier
– Ex. Intrusion detection in computer networks
– Issue: Find an appropriate measurement of deviation
• Contextual outlier (or conditional outlier)
– Object is Oc if it deviates significantly based on a selected context
– Ex. 80o F in Urbana: outlier? (depending on summer or winter?)
– Attributes of data objects should be divided into two groups
• Contextual attributes: defines the context, e.g., time & location
• Behavioral attributes: characteristics of the object, used in outlier evaluation, e.g.,
temperature
– Can be viewed as a generalization of local outliers—whose density significantly
deviates from its local area
– Issue: How to define or formulate meaningful context?
74
Types of Outliers (II)
• Collective Outliers
– A subset of data objects collectively deviate significantly from the
whole data set, even if the individual data objects may not be
outliers
– Applications: E.g., intrusion detection: Collective Outlier
• When a number of computers keep sending denial-of-service
packages to each other
◼ Detection of collective outliers
◼ Consider not only behavior of individual objects, but also that of groups

of objects
◼ Need to have the background knowledge on the relationship among

data objects, such as a distance or similarity measure on objects.

◼ A data set may have multiple types of outlier
◼ One object may belong to more than one type of outlier

75
End of Unit-4

Why Do I Fast?: Khadija Begum 20MAEN016HY
100% (2)
Why Do I Fast?: Khadija Begum 20MAEN016HY
10 pages
Rumus Primoidal
No ratings yet
Rumus Primoidal
7 pages
Clustering
No ratings yet
Clustering
104 pages
Lecture 6
No ratings yet
Lecture 6
14 pages
8. Clustering
No ratings yet
8. Clustering
38 pages
UNIT5
No ratings yet
UNIT5
60 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
APznzaaxpWzYylHJmwXGn2puBz7GP1usZYf9XTi7oqfrrKnFV9DMMfVzPCu6yO0UOnr_XFt1gJv4TE1ITR6850n9k65DydQUgoRlylNdn2acWAu6KNonoO8z7QULN6BlLxY_B-JhKko0tJ3K77woLz26oTaAv1YNcIuMcOSqInmgeCUzpUxjKC9VqnT_lhE7vDyWp_LQQjGTRnamgIC6ya3nlwi7mjjE9EUIiO2sUhjkD6RV
No ratings yet
APznzaaxpWzYylHJmwXGn2puBz7GP1usZYf9XTi7oqfrrKnFV9DMMfVzPCu6yO0UOnr_XFt1gJv4TE1ITR6850n9k65DydQUgoRlylNdn2acWAu6KNonoO8z7QULN6BlLxY_B-JhKko0tJ3K77woLz26oTaAv1YNcIuMcOSqInmgeCUzpUxjKC9VqnT_lhE7vDyWp_LQQjGTRnamgIC6ya3nlwi7mjjE9EUIiO2sUhjkD6RV
38 pages
Data Mining: I Gede Mahendra Darmawiguna
No ratings yet
Data Mining: I Gede Mahendra Darmawiguna
25 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
Cluster Analysis 1731695796
No ratings yet
Cluster Analysis 1731695796
91 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
Clustering
No ratings yet
Clustering
125 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
Analysis of cluteruing
No ratings yet
Analysis of cluteruing
16 pages
Unit - 4 - Modified
No ratings yet
Unit - 4 - Modified
152 pages
Chapter 5 Clustering
No ratings yet
Chapter 5 Clustering
40 pages
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
No ratings yet
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
42 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
93 pages
Chapter 3-Unsupervised learning_updated
No ratings yet
Chapter 3-Unsupervised learning_updated
54 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
CLUSTRING
No ratings yet
CLUSTRING
13 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
Module 4 ML
No ratings yet
Module 4 ML
11 pages
8 Clustering
No ratings yet
8 Clustering
53 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
w6 Clustering
No ratings yet
w6 Clustering
29 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
17 pages
Cluster Analysis
No ratings yet
Cluster Analysis
21 pages
BIS 541 Ch04 20-21 S
No ratings yet
BIS 541 Ch04 20-21 S
82 pages
Clustering
No ratings yet
Clustering
34 pages
Clustering
No ratings yet
Clustering
19 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
DM 10,11 Clustering PDF
No ratings yet
DM 10,11 Clustering PDF
65 pages
Cluster
100% (1)
Cluster
72 pages
cz4041 10 Clustering
No ratings yet
cz4041 10 Clustering
67 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
Cluster Analysis: G Sreenivas
No ratings yet
Cluster Analysis: G Sreenivas
29 pages
Cluster Analysis
No ratings yet
Cluster Analysis
29 pages
ML+Clustering
No ratings yet
ML+Clustering
33 pages
Unit - 5 Cluster Analysis
No ratings yet
Unit - 5 Cluster Analysis
83 pages
Cluster Analysis
No ratings yet
Cluster Analysis
60 pages
Datawarehousing and Data Mining
No ratings yet
Datawarehousing and Data Mining
119 pages
10ClusBasic
No ratings yet
10ClusBasic
95 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
k-medoids
No ratings yet
k-medoids
101 pages
Unit5 Clustering
No ratings yet
Unit5 Clustering
74 pages
Unit 5
No ratings yet
Unit 5
63 pages
Data Mining
No ratings yet
Data Mining
98 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
تنقيب بيانات 7 بعد التعديل Maj
No ratings yet
تنقيب بيانات 7 بعد التعديل Maj
35 pages
Chap7 Basic Cluster Analysis
No ratings yet
Chap7 Basic Cluster Analysis
82 pages
Unit-7 Finalized
No ratings yet
Unit-7 Finalized
20 pages
Fds Unit03
No ratings yet
Fds Unit03
11 pages
02 - Clustering
No ratings yet
02 - Clustering
43 pages
Module-5 Clustering Algorithms
No ratings yet
Module-5 Clustering Algorithms
44 pages
Image Segmentation: Unlocking Insights through Pixel Precision
From Everand
Image Segmentation: Unlocking Insights through Pixel Precision
Fouad Sabry
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
cf-unit-1-notes
No ratings yet
cf-unit-1-notes
14 pages
gis
No ratings yet
gis
111 pages
Pcd lawn menial process
No ratings yet
Pcd lawn menial process
24 pages
IV YEAR II SEM MID -I WSSOA
No ratings yet
IV YEAR II SEM MID -I WSSOA
3 pages
OverallMarks of individual
No ratings yet
OverallMarks of individual
1 page
Computer breaks system and mound
No ratings yet
Computer breaks system and mound
5 pages
dm 1
No ratings yet
dm 1
47 pages
RPF si PDF
No ratings yet
RPF si PDF
2 pages
Assessment of Senior High Students' English Vocabulary Acquisition Through Context Clues
No ratings yet
Assessment of Senior High Students' English Vocabulary Acquisition Through Context Clues
2 pages
Python Next Steps Homework 2 Answers
No ratings yet
Python Next Steps Homework 2 Answers
2 pages
159 - Advanced Topics - Introduction
No ratings yet
159 - Advanced Topics - Introduction
1 page
Short Answer Type Questions: VI K.Padmasai
No ratings yet
Short Answer Type Questions: VI K.Padmasai
3 pages
vii-maths(Foundation book for iit aspirantes)
No ratings yet
vii-maths(Foundation book for iit aspirantes)
94 pages
Development and Use of Non-Digital or Conventional Instructional
No ratings yet
Development and Use of Non-Digital or Conventional Instructional
31 pages
User Guide: Order To Cash With SAP Operational Process Intelligence Powered by SAP HANA
No ratings yet
User Guide: Order To Cash With SAP Operational Process Intelligence Powered by SAP HANA
26 pages
#3 Gen Ed-900 Items
No ratings yet
#3 Gen Ed-900 Items
40 pages
EMotorad T-REX AIR Unisex Electric Bicycle 29 Inch Wheel Size 16.5 Inch Sturdy High Tensile Steel Frame 250W BLDC Motor Front Su
No ratings yet
EMotorad T-REX AIR Unisex Electric Bicycle 29 Inch Wheel Size 16.5 Inch Sturdy High Tensile Steel Frame 250W BLDC Motor Front Su
1 page
Intangibles, Wasting A, Impairment
No ratings yet
Intangibles, Wasting A, Impairment
4 pages
James Steiger R For MultipleRegressionIntro
No ratings yet
James Steiger R For MultipleRegressionIntro
54 pages
Written Assignment 6 5272
No ratings yet
Written Assignment 6 5272
6 pages
Door Approaches
No ratings yet
Door Approaches
5 pages
The P-Block Elements-2020 PDF
No ratings yet
The P-Block Elements-2020 PDF
5 pages
Kumpulan Quiz AKM III
No ratings yet
Kumpulan Quiz AKM III
10 pages
Suppositories 1st Edition Loyd V. Allen Download PDF
100% (3)
Suppositories 1st Edition Loyd V. Allen Download PDF
84 pages
Chest Tubes
No ratings yet
Chest Tubes
2 pages
RMT Application Letter=-BGH
No ratings yet
RMT Application Letter=-BGH
2 pages
A1 ENS Price List 0515
No ratings yet
A1 ENS Price List 0515
25 pages
What is the World Wide Web
No ratings yet
What is the World Wide Web
4 pages
Frank O Gehry
No ratings yet
Frank O Gehry
14 pages
A. Music Third Quarter
100% (1)
A. Music Third Quarter
7 pages
The Fluid Mosaic Model of Membranes
No ratings yet
The Fluid Mosaic Model of Membranes
36 pages
DLL - Mapeh 3 - Q4 - W4
0% (1)
DLL - Mapeh 3 - Q4 - W4
2 pages
Webmethods Flow Service Design and Practice PDF
100% (1)
Webmethods Flow Service Design and Practice PDF
11 pages
Holidays Home Work 9th
No ratings yet
Holidays Home Work 9th
9 pages
Bander Block
No ratings yet
Bander Block
2 pages
BS 8544_2013
No ratings yet
BS 8544_2013
104 pages

dm 4

Uploaded by

dm 4

Uploaded by

AI512PE: DATA MINING (PE - I)

• The goal of clustering is to

• How to define similarity ?

• Cluster boundary is decided

• Example 2: In marketing, segment customers according to

• Example 3: Given a collection of text documents, we want to

dist(xi , x j ) = (xi1 − x j1 ) 2 + (xi 2 − x j 2 ) 2 + ...+ (xir − x jr ) 2

• Weighted Euclidean distance

dist(xi , x j ) = w1(xi1 − x j1 )2 + w2 (x i2 − x j 2 )2 + ...+ wr (xir − x jr )2

• Binary attribute: has two values or states but no ordering

• Ci is the jth cluster, mj is the centroid of cluster Cj (the mean vector of

This slide is taken from: Bing Liu, UIC 20

This slide is taken from: Bing Liu, UIC 21

• Cluster Analysis: An Introduction

• Density- and Grid-Based Methods

• The K-Means Clustering Method

• Initialization of K-Means Clustering

• The K-Medoids Clustering Method

• The K-Medians and K-Modes Clustering Methods

• The Kernel K-Means Clustering Method

This slide is taken from: Andrew Moore

This slide is taken from: Andrew Moore

This slide is taken from: Andrew Moore

This slide is taken from: Andrew Moore

This slide is taken from: Andrew Moore

Image source: Google

– Choosing different representative prototypes for the clusters

– Applying feature transformation techniques

Another random selection of k

❑ Rerun of the K-Means using another random K seeds

Object re-assignment total cost of

Swap medoid m with oi if it If quality is

Until convergence criterion is satisfied 0

• Cluster Analysis: An Introduction

• Density- and Grid-Based Methods

❑ Agglomerative: Start with singleton clusters, continuously merge two clusters

❑ Agglomerative clustering varies on different similarity measures among clusters

– Local similarity-based: Emphasizing more on close regions,

– Merge two clusters to form one with the smallest diameter

and an element in the other (i.e., all pairs in two clusters)

• Group Averaged Agglomerative Clustering (GAAC)

❑ Divisive clustering is a top-down approach

• Cluster Analysis: An Introduction

• Density- and Grid-Based Methods

Spatial Clustering of Applications with Noise MinPts = 5

– A cluster is defined as a maximal set of density-connected

• If p is a border point, no points are density-reachable Border point: in cluster but

– An extension: finding clustering structure 

Finding nested clustering structures with different parameter settings

data objects, such as a distance or similarity measure on objects.

You might also like