0% found this document useful (0 votes)
5 views

dm 4

The document covers clustering and its applications, detailing various clustering methods such as partitioning, hierarchical, density-based, and grid-based methods. It explains the concepts of cluster analysis, including similarity measures, stopping criteria, and cluster quality, while also discussing the K-means algorithm and its variations. Additionally, it highlights the importance of clustering in data mining tasks, including data summarization, outlier detection, and customer segmentation.

Uploaded by

mrpulluri1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

dm 4

The document covers clustering and its applications, detailing various clustering methods such as partitioning, hierarchical, density-based, and grid-based methods. It explains the concepts of cluster analysis, including similarity measures, stopping criteria, and cluster quality, while also discussing the K-means algorithm and its variations. Additionally, it highlights the importance of clustering in data mining tasks, including data summarization, outlier detection, and customer segmentation.

Uploaded by

mrpulluri1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

AI512PE: DATA MINING (PE - I)

Unit - 4
Unit - IV
Clustering and Applications
• Cluster analysis
• Types of Data in Cluster Analysis
• Categorization of Major Clustering Methods
• Partitioning Methods
• Hierarchical Methods
• Density–Based Methods
• Grid–Based Methods
• Outlier Analysis.
Introduction
• Supervised learning: discover patterns in the data with known target
(class) or label.
• These patterns are then utilized to predict the values of the target attribute
in future data instances.
• Examples ?
• Unsupervised learning: The data have no target attribute.
• We want to explore the data to find some intrinsic structures in them.
• Can we perform regression here ?
• Examples ?
Introduction…

• The goal of clustering is to


• group data points that are
close (or similar) to each
other
• identify such groupings (or
clusters) in an unsupervised
manner

• How to define similarity ?


• How many iterations for
checking cluster quality ?
Cluster
• A cluster is represented by a
single point, known as centroid
(or cluster center) of the cluster.

• Centroid is computed as
the mean of all data points
in a cluster
•𝐶𝑗 = ∑ 𝑥𝑖

• Cluster boundary is decided


by the farthest data point in
the cluster.
What Is Cluster Analysis?
• What is a cluster?
– A cluster is a collection of data objects which are
• Similar (or related) to one another within the same group (i.e.,
cluster)
• Dissimilar (or unrelated) to the objects in other groups (i.e.,
clusters)
• Cluster analysis (or clustering, data segmentation, …)
– Given a set of data points, partition them into a set of groups (i.e.,
clusters) which are as similar as possible
• Cluster analysis is unsupervised learning (i.e., no predefined classes)
– This contrasts with classification (i.e., supervised learning)
• Typical ways to use/apply cluster analysis
– As a stand-alone tool to get insight into data distribution, or
– As a preprocessing (or intermediate) step for other algorithms
What Is Good Clustering?
• A good clustering method will produce high quality clusters which should have
– High intra-class similarity: Cohesive within clusters
– Low inter-class similarity: Distinctive between clusters
• Quality function
– There is usually a separate “quality” function that measures the “goodness” of
a cluster
– It is hard to define “similar enough” or “good enough”
• The answer is typically highly subjective
• There exist many similarity measures and/or functions for different applications
• Similarity measure is critical for cluster analysis
Cluster Analysis: Applications
• A key intermediate step for other data mining tasks
– Generating a compact summary of data for classification, pattern discovery,
hypothesis generation and testing, etc.
– Outlier detection: Outliers—those “far away” from any cluster
• Data summarization, compression, and reduction
– Ex. Image processing: Vector quantization
• Collaborative filtering, recommendation systems, or customer segmentation
– Find like-minded users or similar products
• Dynamic trend detection
– Clustering stream data and detecting trends and patterns
• Multimedia data analysis, biological data analysis and social network analysis
– Ex. Clustering images or video/audio clips, gene/protein sequences, etc.
Applications
• Example 1: groups people of similar sizes together to make “small”,
“medium” and “large” T-Shirts.
• Tailor-made for each person: too expensive
• One-size-fits-all: does not fit all.

• Example 2: In marketing, segment customers according to


their similarities
• To do targeted marketing.

• Example 3: Given a collection of text documents, we want to


organize them according to their content similarities,
• To produce a topic hierarchy
Considerations for Cluster Analysis
• Partitioning criteria
– Single level vs. hierarchical partitioning (often, multi-level hierarchical
partitioning is desirable, e.g., grouping topical terms)
• Separation of clusters
– Exclusive (e.g., one customer belongs to only one region) vs. non-
exclusive (e.g., one document may belong to more than one class)
• Similarity measure
– Distance-based (e.g., Euclidean, road network, vector) vs. connectivity-
based (e.g., density or contiguity)
• Clustering space
– Full space (often when low dimensional) vs. subspaces (often in high-
dimensional clustering)
Requirements and Challenges
• Quality
– Ability to deal with different types of attributes: Numerical, categorical, text,
multimedia, networks, and mixture of multiple types
– Discovery of clusters with arbitrary shape
– Ability to deal with noisy data
• Scalability
– Clustering all the data instead of only on samples
– High dimensionality
– Incremental or stream clustering and insensitivity to input order
• Constraint-based clustering
– User-given preferences or constraints; domain knowledge; user queries
• Interpretability and usability
– The final generated clusters should be semantically meaningful and useful
Clustering criterion ..
1. Similarity function

2. Stopping criterion

3. Cluster Quality
1. Similarity function / Distance measure
• How to find distance between data points
• Euclidean distance
• Problems with Euclidean distance
Euclidean distance and Manhattan distance
• Euclidean distance

dist(xi , x j ) = (xi1 − x j1 ) 2 + (xi 2 − x j 2 ) 2 + ...+ (xir − x jr ) 2

• Manhattan distance
dist(xi , x j ) =| xi1 − x j1 | + | xi 2 − x j2 | +...+ | xir − x jr |

• Weighted Euclidean distance

dist(xi , x j ) = w1(xi1 − x j1 )2 + w2 (x i2 − x j 2 )2 + ...+ wr (xir − x jr )2

14
Squared distance
• Squared Euclidean distance: to place progressively greater
weight on data points that are further apart.
dist(xi ,x j ) = (xi1 − x j1 ) 2 + (xi2 − x j 2 )2 + ...+ (xir − xjr )2

15
Distance functions for binary and nominal attributes

• Binary attribute: has two values or states but no ordering


relationships, e.g.,
• Gender: male and female.
• We use a confusion matrix to introduce the distance
functions/measures.
• Let the ith and jth data points be xi and xj (vectors)

16
Confusion matrix

17
Contd..
• Cosine similarity
𝑥. 𝑦
cos 𝑥, 𝑦 =
𝑥. 𝑦

• Euclidean distance
𝑑 𝑥, 𝑦 = ∑ 𝑥𝑖 − 𝑦𝑖 2

• Minkowski Metric
2. Stopping criteria
1. no (or minimum) re-assignments of data points to different
clusters,
2. no (or minimum) change of centroids, or
3. minimum decrease in the sum of squared error (SSE),
k
SSE =  xC j
dist(x,m j ) 2
j =1

• Ci is the jth cluster, mj is the centroid of cluster Cj (the mean vector of


all the data points in Cj), and dist(x, mj) is the distance between data
point x and centroid mj.
An example

+
+

This slide is taken from: Bing Liu, UIC 20


An example (cont …)

This slide is taken from: Bing Liu, UIC 21


3. Cluster quality
• Intra-cluster cohesion (compactness):
• Cohesion measures how near the data points in a cluster are to the cluster
centroid.
• Sum of squared error (SSE) is a commonly used measure.
• Inter-cluster separation (isolation):
• Separation means that different cluster centroids should be far away from
one another.

•a
Dimension

•b
2

•Dimension 1
Cluster Analysis: Basic Concepts and Methods

• Cluster Analysis: An Introduction

• Partitioning Methods

• Hierarchical Methods

• Density- and Grid-Based Methods

• Evaluation of Clustering

• Summary
Types of clustering
• Clustering: Task of grouping a set of data points such that data points in
the same group are more similar to each other than data points in
another group (group is known as cluster)
• it groups data instances that are similar to (near) each other in one cluster and
• data instances that are very different (far away) from each other into different
clusters.
Types
• Partitioning Methods
• Hierarchical Methods
• Density–Based Methods
• Grid–Based Methods
Partitioning-Based Clustering Methods
• Basic Concepts of Partitioning Algorithms

• The K-Means Clustering Method

• Initialization of K-Means Clustering

• The K-Medoids Clustering Method

• The K-Medians and K-Modes Clustering Methods

• The Kernel K-Means Clustering Method


Partitioning Algorithms: Basic Concepts
• Partitioning method: Discovering the groupings in the data by optimizing a
specific objective function and iteratively improving the quality of partitions
• K-partitioning method: Partitioning a dataset D of n objects into a set of K
clusters so that an objective function is optimized (e.g., the sum of squared
distances is minimized, where ck is the centroid or medoid of cluster Ck)
– A typical objective function: Sum of Squared Errors (SSE)
• Problem definition: Given K, find a partition of K clusters that optimizes the
chosen partitioning criterion
– Global optimal: Needs to exhaustively enumerate all partitions
– Heuristic methods (i.e., greedy algorithms): K-Means, K-Medians, K-
Medoids, etc.
K-means Clustering
• Basic idea: Randomly initialize the k cluster centers, and iterate between the two
steps we just saw.
• Randomly initialize the cluster centers, c1, ..., cK
• Given cluster centers, determine points in each cluster
• For each point p, find the closest ci. Put p into cluster i
• Given points in each cluster, solve for ci
• Set ci to be the mean of points in cluster i
• If ci have changed, repeat Step 2
K-means contd..
Algorithm
Begin
initialize n, c, 1, 2, …, c(randomly selected)
do classify n samples according to
nearest i
recompute i
until no change in i
return 1, 2, …, c
End
K-means example

This slide is taken from: Andrew Moore


Example contd..

This slide is taken from: Andrew Moore


Example contd..

This slide is taken from: Andrew Moore


Example contd..

This slide is taken from: Andrew Moore


Example contd..

This slide is taken from: Andrew Moore


Contd..
• Pros
• Simple, fast to compute
• Converges to local minimum of within-
cluster squared error

• Cons
• Setting k?
• Sensitive to initial centers
• Sensitive to outliers
• Detects spherical clusters
• Assuming means can be computed

Image source: Google


Discussion on the K-Means Method
• Efficiency: O(tKn) where n: # of objects, K: # of clusters, and t: # of iterations
– Normally, K, t << n; thus, an efficient method
• K-means clustering often terminates at a local optimal
– Initialization can be important to find high-quality clusters
• Need to specify K, the number of clusters, in advance
– There are ways to automatically determine the “best” K
– In practice, one often runs a range of values and selected the “best” K value
• Sensitive to noisy data and outliers
– Variations: Using K-medians, K-medoids, etc.
• K-means is applicable only to objects in a continuous n-dimensional space
– Using the K-modes for categorical data
• Not suitable to discover clusters with non-convex shapes
– Using density-based clustering, kernel K-means, etc.
Variations of K-Means
• There are many variants of the K-Means method, varying in
different aspects
– Choosing better initial centroid estimates
• K-means++, Intelligent K-Means, Genetic K-Means

– Choosing different representative prototypes for the clusters


• K-Medoids, K-Medians, K-Modes

– Applying feature transformation techniques


• Weighted K-Means, Kernel K-Means
Poor Initialization in K-Means May Lead to Poor Clustering

Assign Recompute
points to cluster
clusters centers

Another random selection of k


centroids for the same data points

❑ Rerun of the K-Means using another random K seeds


❑ This run of K-Means generates a poor quality clustering
Initialization of K-Means: Problem and Solution
• Different initializations may generate rather different clustering
results (some could be far from optimal)
• Original proposal (MacQueen’67): Select K seeds randomly
– Need to run the algorithm multiple times using different seeds
❑ There are many methods proposed for better initialization of k seeds
❑ K-Means++ (Arthur & Vassilvitskii’07):
❑ The first centroid is selected at random
❑ The next centroid selected is the one that is farthest from the currently selected
(selection is based on a weighted probability score)
❑ The selection continues until K centroids are obtained
PAM: A Typical K-Medoids Algorithm
10 10 10

9 9 9

8 8 8

7
Arbitrary 7
Assign 7

choose K each
6 6 6

object as remaining
5 5

4 4 4

3 initial 3 object to 3

2
medoids 2
nearest 2

1 1
medoids 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

K=2
Randomly select a non-
medoid object,Oramdom
Select initial K medoids randomly
10 10

Repeat 9 9
Compute
Swapping O 8 8

Object re-assignment total cost of


and Oramdom
7 7

6
swapping 6

Swap medoid m with oi if it If quality is


5

4
5

improved
improves the clustering quality 3

2
3

1 1

Until convergence criterion is satisfied 0


0 1 2 3 4 5 6 7 8 9 10
0
0 1 2 3 4 5 6 7 8 9 10
Discussion on K-Medoids Clustering
• K-Medoids Clustering: Find representative objects (medoids) in clusters
• PAM (Partitioning Around Medoids: Kaufmann & Rousseeuw 1987)
– Starts from an initial set of medoids, and
– Iteratively replaces one of the medoids by one of the non-medoids if it improves
the total sum of the squared errors (SSE) of the resulting clustering
– PAM works effectively for small data sets but does not scale well for large data sets
(due to the computational complexity)
– Computational complexity: PAM: O(K(n − K) 2) (quite expensive!)
• Efficiency improvements on PAM
– CLARA (Kaufmann & Rousseeuw, 1990):
• PAM on samples; O(Ks2 + K(n − K)), s is the sample size
– CLARANS (Ng & Han, 1994): Randomized re-sampling, ensuring efficiency + quality
Handling Outliers: From K-Means to K-Medoids
• The K-Means algorithm is sensitive to outliers!—since an object with an extremely
large value may substantially distort the distribution of the data
• K-Medoids: Instead of taking the mean value of the object in a cluster as a reference
point, medoids can be used, which is the most centrally located object in a cluster
• The K-Medoids clustering algorithm:
• Select K points as the initial representative objects (i.e., as initial K medoids)
• Repeat
– Assigning each point to the cluster with the closest medoid
– Randomly select a non-representative object oi
– Compute the total cost S of swapping the medoid m with oi
– If S < 0, then swap m with oi to form the new set of medoids
• Until convergence criterion is satisfied
K-Medians: Handling Outliers by Computing Medians
• Medians are less sensitive to outliers than means
– Think of the median salary vs. mean salary of a large firm when adding a few top
executives!
• K-Medians: Instead of taking the mean value of the object in a cluster as a
reference point, medians are used (L1-norm as the distance measure)
• The criterion function for the K-Medians algorithm:
• The K-Medians clustering algorithm:
• Select K points as the initial representative objects (i.e., as initial K medians)
• Repeat
– Assign every point to its nearest median
– Re-compute the median using the median of each individual feature
• Until convergence criterion is satisfied
Cluster Analysis: Basic Concepts and Methods

• Cluster Analysis: An Introduction

• Partitioning Methods

• Hierarchical Methods

• Density- and Grid-Based Methods

• Evaluation of Clustering

• Summary
Hierarchical Clustering Methods
• Basic Concepts of Hierarchical Algorithms
– Agglomerative Clustering Algorithms
– Divisive Clustering Algorithms
• Extensions to Hierarchical Clustering
• BIRCH: A Micro-Clustering-Based Approach
• CURE: Exploring Well-Scattered Representative Points
• CHAMELEON: Graph Partitioning on the KNN Graph of the
Data
• Probabilistic Hierarchical Clustering
Hierarchical Clustering: Basic Concepts
Step 0 Step 1 Step 2 Step 3 Step 4
• Hierarchical clustering agglomerative
(AGNES)
– Generate a clustering hierarchy a
ab
(drawn as a dendrogram)
b
abcde
– Not required to specify K, the
c
number of clusters cde
d
– More deterministic de
e
– No iterative refinement divisive
(DIANA)
• Two categories of algorithms: Step 4 Step 3 Step 2 Step 1 Step 0

❑ Agglomerative: Start with singleton clusters, continuously merge two clusters


at a time to build a bottom-up hierarchy of clusters
❑ Divisive: Start with a huge macro-cluster, split it continuously into two groups,
generating a top-down hierarchy of clusters
Dendrogram: Shows How Clusters are Merged
• Dendrogram: Decompose a set of data objects into a tree of clusters by multi-level
nested partitioning
• A clustering of the data objects is obtained by cutting the dendrogram at the desired
level, then each connected component forms a cluster

Hierarchical clustering
generates a dendrogram
(a hierarchy of clusters)
Agglomerative Clustering Algorithm
• AGNES (AGglomerative NESting) (Kaufmann and Rousseeuw, 1990)
– Use the single-link method and the dissimilarity matrix
– Continuously merge nodes that have the least dissimilarity
– Eventually all nodes belong to the same cluster
10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

❑ Agglomerative clustering varies on different similarity measures among clusters


❑ Single link (nearest neighbor) ❑ Average link (group average)
❑ Complete link (diameter) ❑ Centroid link (centroid similarity)
Single Link vs. Complete Link in Hierarchical Clustering
• Single link (nearest neighbor)
– The similarity between two clusters is the similarity between
X
their most similar (nearest neighbor) members X

– Local similarity-based: Emphasizing more on close regions,


ignoring the overall structure of the cluster
– Capable of clustering non-elliptical shaped group of objects
– Sensitive to noise and outliers
• Complete link (diameter)
– The similarity between two clusters is the similarity between
X
their most dissimilar members X

– Merge two clusters to form one with the smallest diameter


– Nonlocal in behavior, obtaining compact shaped clusters
– Sensitive to outliers
Agglomerative Clustering: Average vs. Centroid Links
• Agglomerative clustering with average link
– Average link: The average distance between an element in one cluster X X

and an element in the other (i.e., all pairs in two clusters)


Ca : N a Cb: Nb
• Expensive to compute
• Agglomerative clustering with centroid link
X
– Centroid link: The distance between the centroids of two clusters X

• Group Averaged Agglomerative Clustering (GAAC)


• Let two clusters Ca and Cb be merged into CaUb. The new centroid is:
– Na is the cardinality of cluster Ca, and ca is the centroid of Ca
• The similarity measure for GAAC is the average of their distances
Divisive Clustering
• DIANA (Divisive Analysis) (Kaufmann and Rousseeuw,1990)
– Implemented in some statistical analysis packages, e.g., Splus
• Inverse order of AGNES: Eventually each node forms a cluster on its own
10 10
10

9 9
9

8 8
8

7 7
7

6 6
6

5 5
5

4 4
4

3 3
3

2 2
2

1 1
1

0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

❑ Divisive clustering is a top-down approach


❑ The process starts at the root with all the points as one cluster
❑ It recursively splits the higher level clusters to build the dendrogram
❑ Can be considered as a global approach
❑ More efficient when compared with agglomerative clustering
More on Algorithm Design for Divisive Clustering
• Choosing which cluster to split
– Check the sums of squared errors of the clusters and choose the one with the
largest value
• Splitting criterion: Determining how to split
– One may use Ward’s criterion to chase for greater reduction in the difference in
the SSE criterion as a result of a split
– For categorical data, Gini-index can be used
• Handling the noise
– Use a threshold to determine the termination criterion (do not generate
clusters that are too small because they contain mainly noises)
Extensions to Hierarchical Clustering
• Major weaknesses of hierarchical clustering methods
– Can never undo what was done previously
– Do not scale well
• Time complexity of at least O(n2), where n is the number of total objects
• Other hierarchical clustering algorithms
– BIRCH (1996): Use CF-tree and incrementally adjust the quality of sub-clusters
– CURE (1998): Represent a cluster using a set of well-scattered representative
points
– CHAMELEON (1999): Use graph partitioning methods on the K-nearest neighbor
graph of the data
BIRCH (Balanced Iterative Reducing and Clustering
Using Hierarchies)
• A multiphase clustering algorithm (Zhang, Ramakrishnan & Livny, SIGMOD’96)
• Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for
multiphase clustering
– Phase 1: Scan DB to build an initial in-memory CF tree (a multi-level compression of the
data that tries to preserve the inherent clustering structure of the data)
– Phase 2: Use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree
• Key idea: Multi-level clustering
– Low-level micro-clustering: Reduce complexity and increase scalability
– High-level macro-clustering: Leave enough flexibility for high-level clustering
• Scales linearly: Find a good clustering with a single scan and improve the quality with a few
additional scans
Hierarchical clustering
• Pros
• Dendograms are great for visualization
• Provides hierarchical relations between clusters
• Shown to be able to capture concentric clusters
• Cons
• Not easy to define levels for clusters
• Experiments showed that other clustering techniques outperform
•hierarchical clustering
Cluster Analysis: Basic Concepts and Methods

• Cluster Analysis: An Introduction

• Partitioning Methods

• Hierarchical Methods

• Density- and Grid-Based Methods

• Evaluation of Clustering

• Summary
Density-Based and Grid-Based Clustering Methods
❑ Density-Based Clustering
❑ Basic Concepts
❑ DBSCAN: A Density-Based Clustering Algorithm
❑ OPTICS: Ordering Points To Identify Clustering Structure
❑ Grid-Based Clustering Methods
❑ Basic Concepts
❑ STING: A Statistical Information Grid Approach
❑ CLIQUE: Grid-Based Subspace Clustering
Density-Based Clustering Methods
• Clustering based on density (a local cluster criterion), such as density-connected
points
• Major features
– Discover clusters of arbitrary shape
– Handle noise
– One scan (only examine the local region to justify density)
– Need density parameters as termination condition
• Several interesting studies
– DBSCAN: Ester, et al. (KDD’96)
– OPTICS: Ankerst, et al (SIGMOD’99)
– DENCLUE: Hinneburg & D. Keim (KDD’98)
– CLIQUE: Agrawal, et al. (SIGMOD’98) (also, grid-based)
DBSCAN: A Density-Based Spatial Clustering Algorithm
• DBSCAN (M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, KDD’96)
– Discovers clusters of arbitrary shape: Density-Based p

Spatial Clustering of Applications with Noise MinPts = 5


q
• A density-based notion of cluster Eps = 1 cm

– A cluster is defined as a maximal set of density-connected


points
– Two parameters: Outlier
– Eps (ε): Maximum radius of the neighborhood Outlier/noise:
not in a cluster
– MinPts: Minimum number of points in the Border
Core point: dense
Eps-neighborhood of a point
Core neighborhood
• The Eps(ε)-neighborhood of a point q:
Border point: in cluster but
– NEps(q): {p belongs to D | dist(p, q) ≤ Eps} neighborhood is not dense
DBSCAN: Density-Reachable and Density-Connected
• Directly density-reachable:
p
– A point p is directly density-reachable from a point q w.r.t.
Eps (ε), MinPts if MinPts = 5
q Eps = 1 cm
– p belongs to NEps(q)
– core point condition: |NEps (q)| ≥ MinPts p
• Density-reachable: p2
– A point p is density-reachable from a point q w.r.t. Eps, q
MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p
such that pi+1 is directly density-reachable from pi p q
• Density-connected:
– A point p is density-connected to a point q w.r.t. Eps, o
MinPts if there is a point o such that both p and q are
density-reachable from o w.r.t. Eps and MinPts
DBSCAN: The Algorithm
• Algorithm
Outlier
– Arbitrarily select a point p
Outlier/noise:
– Retrieve all points density-reachable Border not in a cluster
from p w.r.t. Eps and MinPts Core point: dense
• If p is a core point, a cluster is formed Core neighborhood

• If p is a border point, no points are density-reachable Border point: in cluster but


from p, and DBSCAN visits the next point of the neighborhood is not dense
database
– Continue the process until all of the points have been
processed
❑ Computational complexity
❑ If a spatial index is used, the computational complexity of DBSCAN is
O(n log n), where n is the number of database objects
❑ Otherwise, the complexity is O(n2)
DBSCAN Is Sensitive to the Setting of Parameters

Ack. Figures from G. Karypis, E.-H. Han, and V. Kumar, COMPUTER, 32(8), 1999
OPTICS: Ordering Points To Identify Clustering Structure
Reachability plot for a dataset
• OPTICS (Ankerst, Breunig, Kriegel, and Sander,
Reachability-distance
SIGMOD’99)
– DBSCAN is sensitive to parameter setting undefined

– An extension: finding clustering structure 


• Observation: Given a MinPts, density-based
clusters w.r.t. a higher density are completely ’
contained in clusters w.r.t. to a lower density
• Idea: Higher density points should be
processed first—find high-density clusters first Cluster-order of the objects
• OPTICS stores such a clustering order using ❑ Since points belonging to a cluster have a
two pieces of information: low reachability distance to their nearest
neighbor, valleys correspond to clusters
– Core distance and reachability distance
❑ The deeper the valley, the denser the
cluster
OPTICS: An Extension from DBSCAN
• Core distance of an object p: The smallest value ε such
that the ε-neighborhood of p has at least MinPts objects
Let Nε(p): ε-neighborhood of p
ε is a distance value
Core-distanceε, MinPts(p) = Undefined if card(Nε(p)) < MinPts
MinPts-distance(p), otherwise
Reachability
❑ Reachability distance of object p from core object q is the distance
min. radius value that makes p density-reachable from q undefined

Reachability-distanceε, MinPts(p, q) =
Undefined, if q is not a core object
’
max(core-distance(q), distance (q, p)), otherwise
❑ Complexity: O(N logN) (if index-based)
where N: # of points Cluster-order of the objects
OPTICS: Finding Hierarchically Nested Clustering Structures
• OPTICS produces a special cluster-ordering of the data points with respect to its
density-based clustering structure
– The cluster-ordering contains information equivalent to the density-based
clusterings corresponding to a broad range of parameter settings
– Good for both automatic and interactive cluster analysis—finding intrinsic, even
hierarchically nested clustering structures

Finding nested clustering structures with different parameter settings


Grid-Based Clustering Methods
• Grid-Based Clustering: Explore multi-resolution grid data structure in clustering
– Partition the data space into a finite number of cells to form a grid structure
– Find clusters (dense regions) from the cells in the grid structure
• Features and challenges of a typical grid-based algorithm
– Efficiency and scalability: # of cells << # of data points
– Uniformity: Uniform, hard to handle highly irregular data distributions
– Locality: Limited by predefined cell sizes, borders, and the density threshold
– Curse of dimensionality: Hard to cluster high-dimensional data
• Methods to be introduced
– STING (a STatistical INformation Grid approach) (Wang, Yang and Muntz, VLDB’97)
– CLIQUE (Agrawal, Gehrke, Gunopulos, and Raghavan, SIGMOD’98)
• Both grid-based and subspace clustering
STING: A Statistical Information Grid Approach
• STING (Statistical Information Grid) (Wang, Yang and Muntz, VLDB’97)
• The spatial area is divided into rectangular cells at different levels of resolution,
and these cells form a tree structure
• A cell at a high level contains a number of smaller cells of the next lower level
❑ Statistical information of each cell is
calculated and stored beforehand and
is used to answer queries
❑ Parameters of higher level cells can be
easily calculated from that of lower
level cell, including
❑ count, mean, s(standard deviation),
min, max
❑ type of distribution—normal,
uniform, etc.
Query Processing in STING and Its Analysis
• To process a region query
– Start at the root and proceed to the next lower level, using the STING index
– Calculate the likelihood that a cell is relevant to the query at some confidence
level using the statistical information of the cell
– Only children of likely relevant cells are recursively explored
– Repeat this process until the bottom layer is reached
• Advantages
– Query-independent, easy to parallelize, incremental update
– Efficiency: Complexity is O(K)
• K: # of grid cells at the lowest level, and K << N (i.e., # of data points)
• Disadvantages
– Its probabilistic nature may imply a loss of accuracy in query processing
CLIQUE: Grid-Based Subspace Clustering
• CLIQUE (Clustering In QUEst) (Agrawal, Gehrke, Gunopulos, Raghavan: SIGMOD’98)
• CLIQUE is a density-based and grid-based subspace clustering algorithm
– Grid-based: It discretizes the data space through a grid and estimates the density
by counting the number of points in a grid cell
– Density-based: A cluster is a maximal set of connected dense units in a subspace
• A unit is dense if the fraction of total data points contained in the unit exceeds
the input model parameter
– Subspace clustering: A subspace cluster is a set of neighboring dense cells in an
arbitrary subspace. It also discovers some minimal descriptions of the clusters
• It automatically identifies subspaces of a high dimensional data space that allow
better clustering than original space using the Apriori principle
CLIQUE: SubSpace Clustering with Aprori Pruning

• Start at 1-D space and discretize numerical intervals in each axis into grid
• Find dense regions (clusters) in each subspace and generate their minimal descriptions
– Use the dense regions to find promising candidates in 2-D space based on the Apriori
principle
– Repeat the above in level-wise manner in higher dimensional subspaces
Major Steps of the CLIQUE Algorithm
• Identify subspaces that contain clusters
– Partition the data space and find the number of points that lie inside each cell of
the partition
– Identify the subspaces that contain clusters using the Apriori principle
• Identify clusters
– Determine dense units in all subspaces of interests
– Determine connected dense units in all subspaces of interests
• Generate minimal descriptions for the clusters
– Determine maximal regions that cover a cluster of connected dense units for
each cluster
– Determine minimal cover for each cluster
Additional Comments on CLIQUE
• Strengths
– Automatically finds subspaces of the highest dimensionality as long as high
density clusters exist in those subspaces
– Insensitive to the order of records in input and does not presume some
canonical data distribution
– Scales linearly with the size of input and has good scalability as the number of
dimensions in the data increases
• Weaknesses
– As in all grid-based clustering approaches, the quality of the results crucially
depends on the appropriate choice of the number and width of the partitions
and grid cells
Outlier Analysis
• Outlier and Outlier Analysis
• Outlier Detection Methods
• Statistical Approaches
• Proximity-Base Approaches
• Clustering-Base Approaches
• Classification Approaches
• Mining Contextual and Collective Outliers
• Outlier Detection in High Dimensional Data
• Summary
72
What Are Outliers?

• Outlier: A data object that deviates significantly from the normal objects as if it were generated by a
different mechanism
– Ex.: Unusual credit card purchase, sports: Michael Jordon, Wayne Gretzky, ...
• Outliers are different from the noise data
– Noise is random error or variance in a measured variable
– Noise should be removed before outlier detection
• Outliers are interesting: It violates the mechanism that generates the normal data
• Outlier detection vs. novelty detection: early stage, outlier; but later merged into the model
• Applications:
– Credit card fraud detection
– Telecom fraud detection
– Customer segmentation
– Medical analysis

73
Types of Outliers (I)
• Three kinds: global, contextual and collective outliers
• Global outlier (or point anomaly)
– Object is Og if it significantly deviates from the rest of the data set
Global Outlier
– Ex. Intrusion detection in computer networks
– Issue: Find an appropriate measurement of deviation
• Contextual outlier (or conditional outlier)
– Object is Oc if it deviates significantly based on a selected context
– Ex. 80o F in Urbana: outlier? (depending on summer or winter?)
– Attributes of data objects should be divided into two groups
• Contextual attributes: defines the context, e.g., time & location
• Behavioral attributes: characteristics of the object, used in outlier evaluation, e.g.,
temperature
– Can be viewed as a generalization of local outliers—whose density significantly
deviates from its local area
– Issue: How to define or formulate meaningful context?
74
Types of Outliers (II)
• Collective Outliers
– A subset of data objects collectively deviate significantly from the
whole data set, even if the individual data objects may not be
outliers
– Applications: E.g., intrusion detection: Collective Outlier
• When a number of computers keep sending denial-of-service
packages to each other
◼ Detection of collective outliers
◼ Consider not only behavior of individual objects, but also that of groups

of objects
◼ Need to have the background knowledge on the relationship among

data objects, such as a distance or similarity measure on objects.


◼ A data set may have multiple types of outlier
◼ One object may belong to more than one type of outlier

75
End of Unit-4

You might also like