0% found this document useful (0 votes)

30 views

Cluster-Analysis

Data Mining IOE - Chapter 5 Notes

Uploaded by

flamboyantmcclintock4

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views

Cluster-Analysis

Data Mining IOE - Chapter 5 Notes

Uploaded by

flamboyantmcclintock4

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 89

5.

Cluster Analysis(9 Hrs)

Pukar Karki
Assistant Professor
[email protected]
Contents
1. Basics and Algorithms
2. K-means Clustering https://www.youtube.com/watch?v=5FpsGnkbEpM gATE SMASHERS

3. Hierarchical Clustering GATE SMASHERS

4. DBSCAN Clustering 5 minutes engineering

5. Issues : Evaluation, Scalability, Comparison

Data Mining(CT725) 2
Contents
1. Basics and Algorithms
2. K-means Clustering
3. Hierarchical Clustering
4. DBSCAN Clustering
5. Issues : Evaluation, Scalability, Comparison

Data Mining(CT725) 3
What is Cluster Analysis?
 Cluster: A collection of data objects
 similar (or related) to one another within the same group
 dissimilar (or unrelated) to the objects in other groups
 Cluster analysis (or clustering, data segmentation, …)
 Finding similarities between data according to the characteristics
found in the data and grouping similar data objects into clusters
 Unsupervised learning: no predefined classes (i.e., learning by
observations vs. learning by examples: supervised)
 Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms 4
Clustering for Data Understanding and Applications
 Biology: taxonomy of living things: kingdom, phylum, class, order, family,
genus and species
 Information retrieval: document clustering
 Land use: Identification of areas of similar land use in an earth observation
database
 Marketing: Help marketers discover distinct groups in their customer bases,
and then use this knowledge to develop targeted marketing programs
 City-planning: Identifying groups of houses according to their house type,
value, and geographical location
 Earth-quake studies: Observed earth quake epicenters should be clustered
along continent faults
 Climate: understanding earth climate, find patterns of atmospheric and ocean
 Economic Science: market resarch 5
Clustering as a Preprocessing Tool (Utility)
 Summarization:
 Preprocessing for regression, PCA, classification, and association
analysis
 Compression:
 Image processing: vector quantization
 Finding K-nearest Neighbors
 Localizing search to one or a small number of clusters
 Outlier detection
 Outliers are often viewed as those “far away” from any cluster
6
Vector Quantization
 Left: original image; middle: using 23.9% of the storage; right: using 6.25% of
the storage

K-means is often called “Lloyd’s algorithm” in computer science and engineering, and is
used in vector quantization for compression

Basic idea: run K-means clustering on 4 × 4 squares of pixels in an image, and keep only
the clusters and labels. Smaller K means more compression
7
Quality: What Is Good Clustering?
 A good clustering method will produce high quality clusters
 high intra-class similarity: cohesive within clusters
 low inter-class similarity: distinctive between clusters
 The quality of a clustering method depends on
 the similarity measure used by the method
 its implementation, and
 Its ability to discover some or all of the hidden patterns
8
Measure the Quality of Clustering
 Dissimilarity/Similarity metric
 Similarity is expressed in terms of a distance function, typically metric:
d(i, j)
 The definitions of distance functions are usually rather different for
interval-scaled, boolean, categorical, ordinal ratio, and vector variables
 Weights should be associated with different variables based on
applications and data semantics
 Quality of clustering:
 There is usually a separate “quality” function that measures the
“goodness” of a cluster.
 It is hard to define “similar enough” or “good enough”
 The answer is typically highly subjective
9
Considerations for Cluster Analysis
 Partitioning criteria
 Single level vs. hierarchical partitioning (often, multi-level hierarchical
partitioning is desirable)
 Separation of clusters
 Exclusive (e.g., one customer belongs to only one region) vs. non-
exclusive (e.g., one document may belong to more than one class)
 Similarity measure
 Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-
based (e.g., density or contiguity)
 Clustering space
 Full space (often when low dimensional) vs. subspaces (often in high-
dimensional clustering)
10
Requirements and Challenges
 Scalability
 Clustering all the data instead of only on samples
 Ability to deal with different types of attributes
 Numerical, binary, categorical, ordinal, linked, and mixture of these
 Constraint-based clustering
 User may give inputs on constraints
 Use domain knowledge to determine input parameters
 Interpretability and usability
 Others
 Discovery of clusters with arbitrary shape
 Ability to deal with noisy data
 Incremental clustering and insensitivity to input order
 High dimensionality 11
Major Clustering Approaches
 Partitioning approach:
 Construct various partitions and then evaluate them by some criterion, e.g., minimizing
the sum of square errors
 Typical methods: k-means, k-medoids, CLARANS
 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects) using some criterion
 Typical methods: Diana, Agnes, BIRCH, CAMELEON
 Density-based approach:
 Based on connectivity and density functions
 Typical methods: DBSCAN, OPTICS, DenClue
 Grid-based approach:
 based on a multiple-level granularity structure
 Typical methods: STING, WaveCluster, CLIQUE
12
Major Clustering Approaches (II)

13
Contents
1. Basics and Algorithms
2. K-means Clustering
3. Hierarchical Clustering
4. DBSCAN Clustering
5. Issues : Evaluation, Scalability, Comparison

Data Mining(CT725) 14
Partitioning Algorithms: Basic Concept
 Partitioning method: Partitioning a database D of n objects into a set of k
clusters, such that the sum of squared distances is minimized (where ci is the
centroid or medoid of cluster Ci)

 Given k, find a partition of k clusters that optimizes the chosen partitioning

criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the
center of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87):
Each cluster is represented by one of the objects in the cluster 15
The K-Means Clustering Method

16
An Example of K-Means Clustering
Comments on the K-Means Method
 Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations.
Normally, k, t << n.
 Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
 Comment: Often terminates at a local optimal.
 Weakness
 Applicable only to objects in a continuous n-dimensional space
 Using the k-modes method for categorical data
 In comparison, k-medoids can be applied to a wide range of data
 Need to specify k, the number of clusters, in advance (there are ways to
automatically determine the best k (see Hastie et al., 2009)
 Sensitive to noisy data and outliers
 Not suitable to discover clusters with non-convex shapes 18
Variations of the K-Means Method
 Most of the variants of the k-means which differ in
 Selection of the initial k means
 Dissimilarity calculations
 Strategies to calculate cluster means
 Handling categorical data: k-modes
 Replacing means of clusters with modes
 Using new dissimilarity measures to deal with categorical objects
 Using a frequency-based method to update modes of clusters
 A mixture of categorical and numerical data: k-prototype method 19
What Is the Problem of the K-Means Method?
 The k-means algorithm is sensitive to outliers !
 Since an object with an extremely large value may substantially distort the
distribution of the data
 K-Medoids: Instead of taking the mean value of the object in a cluster as a
reference point, medoids can be used, which is the most centrally located
object in a cluster

20
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10

9 9 9

8 8 8

7 7 7

6 Arbitrary 6 Assign 6

5
choose k 5
each 5

4 4

object as remaining
4

3 3 3

2 initial 2
object to 2

medoids nearest
1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10

K=2
Randomly select a
Total Cost = 26 nonmedoid object,Oramdom
10 10

Do loop 9

8 Compute
9

Swapping O 7
total cost of 7

Until no change and Oramdom 6

5 swapping
6

4 4

If quality is 3 3

2 2
improved. 1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

21
The K-Medoid Clustering Method

22
The K-Medoid Clustering Method
 K-Medoids Clustering: Find representative objects (medoids) in clusters
 PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)
 Starts from an initial set of medoids and iteratively replaces one of the medoids
by one of the non-medoids if it improves the total distance of the resulting
clustering
 PAM works effectively for small data sets, but does not scale well for large data
sets (due to the computational complexity)
 Efficiency improvement on PAM
 CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
 CLARANS (Ng & Han, 1994): Randomized re-sampling
23
Contents
1. Basics and Algorithms
2. K-means Clustering
3. Hierarchical Clustering
4. DBSCAN Clustering
5. Issues : Evaluation, Scalability, Comparison

Data Mining(CT725) 24
Hierarchical Clustering
 There are two basic approaches for generating a hierarchical
clustering:
Agglomerative: Start with the points as individual clusters and, at each
step, merge the closest pair of clusters. This requires defining a notion
of cluster proximity.
Divisive: Start with one, all-inclusive cluster and, at each step, split a
cluster until only singleton clusters of individual points remain. In this
case, we need to decide which cluster to split at each step and how to
do the splitting.

25
Hierarchical Clustering

26
Hierarchical Clustering
 Use distance matrix as clustering criteria.
 This method does not require the number of clusters k as an input, but
needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
a (AGNES)
ab
b abcde
c
cde
d de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0
(DIANA) 27
Basic Agglomerative Hierarchical Clustering Algorithm

28
Steps 1 and 2

✔ Start with clusters of individual points and a proximity

matrix p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
.
.
.
Proximity Matrix

...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation
✔ After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1

C2 C5

...
p1 p2 p3 p4 p9 p10 p11 p12
Step 4
✔ We want to merge the two closest clusters (C2 and C5) and update
the proximity matrix. C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1

C2 C5

...
p1 p2 p3 p4 p9 p10 p11 p12
Step 5
✔ The question is “How do we update the proximity matrix?”
C2 U
C5
C1 C3 C4
C1 ?

C2 U C5 ? ? ? ?
C3
C3 ?
C4
C4 ?

C1
Proximity Matrix

C2 U C5

...
p1 p2 p3 p4 p9 p10 p11 p12
How to Define Inter-Cluster Distance
p1 p2 p3 p4 p5 ...
p1
Similarity?
p2
p3

p4
p5
✔ MIN .
✔ MAX .
✔ Group Average . Proximity Matrix
✔ Distance Between Centroids
✔ Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1

p2
p3

p4
p5
✔ MIN .
✔ MAX .
✔ Group Average . Proximity Matrix
✔ Distance Between Centroids
✔ Other methods driven by an objective
function
– Ward’s Method uses squared error
Distance between Clusters
Distance between Clusters
Example

Set of six two-dimensional points. (X,Y) -coordinates of six points.

Example

Euclidean distance matrix for six points.

Example: Single Link or MIN
 For the single link or MIN version of hierarchical clustering, the
proximity of two clusters is defined as the minimum of the
distance (maximum of the similarity) between any two points in
the two different clusters.
 The single link technique is good at handling non-elliptical
shapes, but is sensitive to noise and outliers.
Hierarchical Clustering: MIN

5
1
3
5 0.2

2 1 0.15

2 3 6 0.1

0.05

4
4 0
3 6 2 5 4 1

Nested Clusters Dendrogram

Strength of MIN

Original Points Six Clusters

• Can handle non-elliptical shapes

Limitations of MIN

Two Clusters

Original Points

• Sensitive to noise
Three Clusters
Example: Complete Link or MAX
 For the complete link or MAX version of hierarchical clustering,
the proximity of two clusters is defined as the maximum of the
distance (minimum of the similarity) between any two points in
the two different clusters.
 Complete link is less susceptible to noise and outliers, but it can
break large clusters and it favors globular shapes.
Hierarchical Clustering: MAX

4 1
2 5 0.4

0.35
5
2
0.3

0.25

3 6
0.2

3 0.15

1 0.1

4
0.05

0
3 6 4 1 2 5

Nested Clusters Dendrogram

Example: Complete Link or MAX
Strength of MAX

Original Points Two Clusters

• Less susceptible to noise

Limitations of MAX

Original Points Two Clusters

• Tends to break large clusters

• Biased towards globular clusters
Example: Group Average
 For the group average version of hierarchical clustering, the proximity
of two clusters is defined as the average pairwise proximity among all
pairs of points in the different clusters.
 For group average, the cluster proximity proximity (Ci,Cj) of clusters
Ci and Cj, which are of size mi and mj, respectively, is expressed by
the following equation:
Hierarchical Clustering: Group Average

5
4 1
0.25
2
5 0.2

2 0.15

3 6 0.1

1 0.05

4 0
3 3 6 4 1 2 5

Nested Clusters Dendrogram

Example: Group Average

Because dist({3, 6, 4}, {2, 5}) is smaller than dist({3, 6, 4}, {1}) and dist({2, 5}, {1}),
clusters {3, 6, 4} and {2, 5} are merged at the fourth stage.
Hierarchical Clustering: Group Average

✔ Compromise between Single and Complete Link

✔ Strengths
– Less susceptible to noise

✔ Limitations
– Biased towards globular clusters
Cluster Similarity: Ward’s Method
✔ Similarity of two clusters is based on the increase in squared
error when two clusters are merged
– Similar to group average if distance between points is distance
squared

✔ Less susceptible to noise

✔ Biased towards globular clusters

✔ Hierarchical analogue of K-means

– Can be used to initialize K-means
Hierarchical Clustering: Comparison

5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3 1
4 4
4

5
1 5 4 1
2 2
5 Ward’s Method 5
2 2
3 6 Group Average 3 6
3 1 1
4
4 4
3
Hierarchical Clustering: Time and Space requirements

✔ O(N2) space since it uses the proximity matrix.

– N is the number of points.

✔ O(N3) time in many cases

– There are N steps and at each step the size, N2, proximity
matrix must be updated and searched
– Complexity can be reduced to O(N2 log(N) ) time with some
cleverness
Hierarchical Clustering: Problems and Limitations

✔ Once a decision is made to combine two clusters, it cannot

be undone

✔ No global objective function is directly minimized

✔ Different schemes have problems with one or more of the

following:
– Sensitivity to noise
– Difficulty handling clusters of different sizes and non-globular
shapes
– Breaking large clusters
AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical packages, e.g., Splus
 Use the single-link method and the dissimilarity matrix
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster

59
DIANA (Divisive Analysis)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical analysis packages, e.g., Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own
Contents
1. Basics and Algorithms
2. K-means Clustering
3. Hierarchical Clustering
4. DBSCAN Clustering
5. Issues : Evaluation, Scalability, Comparison

Data Mining(CT725) 61
Density-Based Clustering Methods
 Partitioning and hierarchical methods are designed to find spherical-shaped
clusters.
 They have difficulty finding clusters of arbitrary shape such as the “S” shape
and oval clusters in figure.

62
Density-Based Clustering Methods
 To find clusters of arbitrary shape, alternatively, we can model clusters as
dense regions in the data space, separated by sparse regions.
 This is the main strategy behind density-based clustering methods, which can
discover clusters of nonspherical shape.

63
Density-Based Clustering Methods
 Clustering based on density (local cluster criterion), such as density-connected
points
 Major features:

Discover clusters of arbitrary shape

Handle noise

One scan

Need density parameters as termination condition
 Several interesting studies:
 DBSCAN: Ester, et al. (KDD’96)
 OPTICS: Ankerst, et al (SIGMOD’99).
 DENCLUE: Hinneburg & D. Keim (KDD’98)
 CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based) 64
DBSCAN

The density of an object o can be measured by the number of
objects close to o.

DBSCAN (Density-Based Spatial Clustering of Applications
with Noise) finds core objects, that is, objects that have dense
neighborhoods.

It connects core objects and their neighborhoods to form dense
regions as clusters.

65
DBSCAN
“How does DBSCAN quantify the neighborhood of an object?”

→ A user-specified parameter ε > 0 is used to specify the radius of a

neighborhood we consider for every object.
→ The ε-neighborhood of an object o is the space within a radius ε
centered at o.
→ Due to the fixed neighborhood size parameterized by ε, the density of a
neighborhood can be measured simply by the number of objects in the
neighborhood.

66
DBSCAN
→ To determine whether a neighborhood is dense or not, DBSCAN uses
another user-specified parameter, MinPts, which specifies the density
threshold of dense regions.

→ An object is a core object if the ε-neighborhood of the object contains

at least MinPts objects. Core objects are the pillars of dense regions.

67
DBSCAN
 Given a set, D, of objects, we can identify all core objects with respect
to the given parameters, ε and MinPts.
 The clustering task is therein reduced to using core objects and their
neighborhoods to form dense regions, where the dense regions are
clusters.

68
Directly Density-Reachable
 For a core object q and an object p, we say that p is directly
density-reachable from q (with respect to ε and MinPts) if p is within
the ε-neighborhood of q.

69
Density-Reachable and Density-Connected
 Density- reachable:
 A point p is density-reachable from a point q w.r.t. Eps, MinPts if
there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is
directly density-reachable from pi
 Density- connected
 A point p is density-connected to a point q w.r.t. Eps, MinPts if there
is a point o such that both, p and q are density-reachable from o
w.r.t. Eps and MinPts

70
Consider the following figure for a given ε represented by the radius of
the circles, and, say, let MinPts = 3.
Consider the following figure for a given ε represented by the radius of
the circles, and, say, let MinPts = 3.

Of the labeled points, m, p, o, r are core objects because each is in an ε-

neighborhood containing at least three points.
Consider the following figure for a given ε represented by the radius of
the circles, and, say, let MinPts = 3.

Object q is directly density-reachable from m.

Object m is directly density-reachable from p and vice versa.
Consider the following figure for a given ε represented by the radius of
the circles, and, say, let MinPts = 3.

Object q is (indirectly) density-reachable from p because q is directly density-

reachable from m and m is directly density-reachable from p. However, p is not
density-reachable from q because q is not a core object.
Consider the following figure for a given ε represented by the radius of
the circles, and, say, let MinPts = 3.

Similarly, r and s are density-reachable from o and o is density-

reachable from r. Thus, o, r, and s are all density-connected.
DBSCAN: The Algorithm

77
DBSCAN: Determining EPS and MinPts
✔ Idea is that for points in a cluster, their kth nearest neighbors are
at close distance
✔ Noise points have the kth nearest neighbor at farther distance
✔ So, plot sorted distance of every point to its kth nearest neighbor
DBSCAN: Core, Border and Noise Points

Original Points Point types: core,

border and noise
Eps = 10, MinPts = 4
DBSCAN: Core, Border and Noise Points

Core points: These points are in the interior of a density-based cluster. A point
is a core point if there are at least MinPts within a distance of Eps, where
MinPts and Eps are user-specified parameters. In Figure, point A is a core point
for the radius (Eps) if MinPts ≥ 7.
DBSCAN: Core, Border and Noise Points

Border points: A border point is not a core point, but falls within the
neighborhood of a core point. In Figure, point B is a border point. A border
point can fall within the neighborhoods of several core points.
DBSCAN: Core, Border and Noise Points

Noise points: A noise point is any point that is neither a core point nor a
border point. In Figure, point C is a noise point.
DBSCAN Algorithm
When DBSCAN Works Well

Original Points Clusters (dark blue points indicate noise)

• Can handle clusters of different shapes and sizes

• Resistant to noise
Contents
1. Basics and Algorithms
2. K-means Clustering
3. Hierarchical Clustering
4. DBSCAN Clustering
5. Issues : Evaluation, Scalability, Comparison

Data Mining(CT725) 85
Considerations for Cluster Analysis
 Partitioning criteria
 Single level vs. hierarchical partitioning (often, multi-level hierarchical
partitioning is desirable)
 Separation of clusters
 Exclusive (e.g., one customer belongs to only one region) vs. non-
exclusive (e.g., one document may belong to more than one class)
 Similarity measure
 Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-
based (e.g., density or contiguity)
 Clustering space
 Full space (often when low dimensional) vs. subspaces (often in high-
dimensional clustering)
86
Requirements and Challenges
 Scalability
 Clustering all the data instead of only on samples
 Ability to deal with different types of attributes
 Numerical, binary, categorical, ordinal, linked, and mixture of these
 Constraint-based clustering
 User may give inputs on constraints
 Use domain knowledge to determine input parameters
 Interpretability and usability
 Others
 Discovery of clusters with arbitrary shape
 Ability to deal with noisy data
 Incremental clustering and insensitivity to input order
 High dimensionality 87
Cluster Validity
✔ For supervised classification we have a variety of measures to evaluate
how good our model is
– Accuracy, precision, recall

✔ For cluster analysis, the analogous question is how to evaluate the

“goodness” of the resulting clusters?

✔ But “clusters are in the eye of the beholder”!

– In practice the clusters we find are defined by the clustering algorithm

✔ Then why do we want to evaluate them?

– To avoid finding patterns in noise
– To compare clustering algorithms
– To compare two sets of clusters
– To compare two clusters
Clusters found in Random Data
1 1

0.9 0.9

0.8 0.8

0.7 0.7

Random 0.6 0.6

DBSCAN
Points 0.5 0.5
y

y
0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
1 1

0.9 0.9

0.8 0.8
K-means 0.7 0.7
Complete
0.6 0.6
Link
0.5 0.5
y

y
0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
Measures of Cluster Validity
✔ Numerical measures that are applied to judge various aspects of
cluster validity, are classified into the following two types.
– Supervised: Used to measure the extent to which cluster labels match
externally supplied class labels.
 Entropy
 Often called external indices because they use information external to the data
– Unsupervised: Used to measure the goodness of a clustering structure
without respect to external information.
 Sum of Squared Error (SSE)
 Often called internal indices because they only use information in the data

✔ You can use supervised or unsupervised measures to compare clusters

or clusterings

Detailed Lesson Plan in STEM 11
100% (3)
Detailed Lesson Plan in STEM 11
3 pages
Changing Behavior Using Socialcognitive Theory 2020
No ratings yet
Changing Behavior Using Socialcognitive Theory 2020
14 pages
QNXT Business Analyst
100% (2)
QNXT Business Analyst
4 pages
CTX 510eco-INSTALATION-V1.0 - 000 PDF
No ratings yet
CTX 510eco-INSTALATION-V1.0 - 000 PDF
22 pages
Heavy Hydraulics Cat
No ratings yet
Heavy Hydraulics Cat
118 pages
Cae Result Students' Book Keys - Unit 11
No ratings yet
Cae Result Students' Book Keys - Unit 11
3 pages
10ClusBasic Editted v1
No ratings yet
10ClusBasic Editted v1
41 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Clustering K Means Agnes
No ratings yet
Clustering K Means Agnes
36 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
10ClusBasic
No ratings yet
10ClusBasic
95 pages
10ClusBasic
No ratings yet
10ClusBasic
66 pages
Cluster Analysis
No ratings yet
Cluster Analysis
76 pages
2002 Spring CS525 Lecture 2
No ratings yet
2002 Spring CS525 Lecture 2
37 pages
Unit - 5 Cluster Analysis
No ratings yet
Unit - 5 Cluster Analysis
83 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Unit 5 DM
No ratings yet
Unit 5 DM
47 pages
Concepts and Techniques: - Chapter 10
No ratings yet
Concepts and Techniques: - Chapter 10
97 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
50 pages
Session 7 Clustering
No ratings yet
Session 7 Clustering
93 pages
10clustering - Han and Kamber
No ratings yet
10clustering - Han and Kamber
93 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
Clustering For Big Data Analytics
No ratings yet
Clustering For Big Data Analytics
28 pages
Clustering
No ratings yet
Clustering
24 pages
Clustering
No ratings yet
Clustering
32 pages
APznzaaxpWzYylHJmwXGn2puBz7GP1usZYf9XTi7oqfrrKnFV9DMMfVzPCu6yO0UOnr_XFt1gJv4TE1ITR6850n9k65DydQUgoRlylNdn2acWAu6KNonoO8z7QULN6BlLxY_B-JhKko0tJ3K77woLz26oTaAv1YNcIuMcOSqInmgeCUzpUxjKC9VqnT_lhE7vDyWp_LQQjGTRnamgIC6ya3nlwi7mjjE9EUIiO2sUhjkD6RV
No ratings yet
APznzaaxpWzYylHJmwXGn2puBz7GP1usZYf9XTi7oqfrrKnFV9DMMfVzPCu6yO0UOnr_XFt1gJv4TE1ITR6850n9k65DydQUgoRlylNdn2acWAu6KNonoO8z7QULN6BlLxY_B-JhKko0tJ3K77woLz26oTaAv1YNcIuMcOSqInmgeCUzpUxjKC9VqnT_lhE7vDyWp_LQQjGTRnamgIC6ya3nlwi7mjjE9EUIiO2sUhjkD6RV
38 pages
Cluster Analysis - Approach 1
No ratings yet
Cluster Analysis - Approach 1
28 pages
Unit5 Clustering
No ratings yet
Unit5 Clustering
74 pages
(PML ITS - Week 10) - Clustering
No ratings yet
(PML ITS - Week 10) - Clustering
42 pages
Introduction to Cluster Analysis.
No ratings yet
Introduction to Cluster Analysis.
53 pages
Cluster
No ratings yet
Cluster
20 pages
Clustering Partitioning Methods
No ratings yet
Clustering Partitioning Methods
20 pages
clustering
No ratings yet
clustering
16 pages
Session 3: Clustering Techniques - Partitioning & Hierarchical Methods
No ratings yet
Session 3: Clustering Techniques - Partitioning & Hierarchical Methods
27 pages
Partitioning Methods
No ratings yet
Partitioning Methods
26 pages
8 - Clustering
No ratings yet
8 - Clustering
85 pages
UNIT-5 PPT
No ratings yet
UNIT-5 PPT
85 pages
Data Mining: I Gede Mahendra Darmawiguna
No ratings yet
Data Mining: I Gede Mahendra Darmawiguna
25 pages
Data Mining-Partitioning Methods
100% (1)
Data Mining-Partitioning Methods
7 pages
Clustering
No ratings yet
Clustering
104 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
77 pages
Clustering Data Mining
No ratings yet
Clustering Data Mining
27 pages
Lecture 16
No ratings yet
Lecture 16
29 pages
Cluster Analysis Clustering
No ratings yet
Cluster Analysis Clustering
6 pages
Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
Lecture 6
No ratings yet
Lecture 6
14 pages
Clustering
No ratings yet
Clustering
25 pages
Data Mining Project: Cluster Analysis and Dimensionality Reduction in R Using Bank Marketing Data Set
No ratings yet
Data Mining Project: Cluster Analysis and Dimensionality Reduction in R Using Bank Marketing Data Set
31 pages
BDA Unit 2
No ratings yet
BDA Unit 2
31 pages
Cluster Analysis
No ratings yet
Cluster Analysis
21 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
Clustering in AI
No ratings yet
Clustering in AI
16 pages
Lect 10 DM
No ratings yet
Lect 10 DM
36 pages
Partitioning Methods & Hierachical Methods
No ratings yet
Partitioning Methods & Hierachical Methods
22 pages
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
No ratings yet
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
22 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
Unit V - Clustering
No ratings yet
Unit V - Clustering
19 pages
Clustering
No ratings yet
Clustering
45 pages
Unit 4
No ratings yet
Unit 4
4 pages
ML_7th_Sem_AIML_ITE_Notes_Complete_LONG[1]-155-202
No ratings yet
ML_7th_Sem_AIML_ITE_Notes_Complete_LONG[1]-155-202
48 pages
Clustering-Part 1
No ratings yet
Clustering-Part 1
35 pages
DMW Unit-V
No ratings yet
DMW Unit-V
47 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Julia for Data Science
From Everand
Julia for Data Science
Anshul Joshi
No ratings yet
Laser Acupuncture and Auriculotherapy in Postu - 2011 - Journal of Acupuncture A
No ratings yet
Laser Acupuncture and Auriculotherapy in Postu - 2011 - Journal of Acupuncture A
6 pages
Sankalp Phase III Heat 3
No ratings yet
Sankalp Phase III Heat 3
5 pages
Personal Vision Statement
No ratings yet
Personal Vision Statement
9 pages
DplReference en
No ratings yet
DplReference en
702 pages
Tim Rowell Pepsicos Practical Application of Supply Chain Resilience Strategies and Inventory Optimization
No ratings yet
Tim Rowell Pepsicos Practical Application of Supply Chain Resilience Strategies and Inventory Optimization
40 pages
au, dong, tremblay 2021 jfqa _ann - employee flexibility, exogenous shocks and firm value
No ratings yet
au, dong, tremblay 2021 jfqa _ann - employee flexibility, exogenous shocks and firm value
60 pages
Responsibilities of A Journalist
No ratings yet
Responsibilities of A Journalist
2 pages
faculty-list
No ratings yet
faculty-list
36 pages
Grade 11 (New) Chapter 4
No ratings yet
Grade 11 (New) Chapter 4
53 pages
Impact of Marketing Research On Consumer Patronage of Telecommunication Product
No ratings yet
Impact of Marketing Research On Consumer Patronage of Telecommunication Product
58 pages
2023 Specimen Paper 3
No ratings yet
2023 Specimen Paper 3
14 pages
See Colour en
No ratings yet
See Colour en
16 pages
Detail Review On Chemical Physical and Green Synthesis Classification Characterizations and Applications of Nanoparticles
No ratings yet
Detail Review On Chemical Physical and Green Synthesis Classification Characterizations and Applications of Nanoparticles
24 pages
IECE and ATEX The Bees Knees
No ratings yet
IECE and ATEX The Bees Knees
2 pages
A Science of Social Work - Response To John Brekke
No ratings yet
A Science of Social Work - Response To John Brekke
3 pages
Factory Acceptance Test Plan
No ratings yet
Factory Acceptance Test Plan
6 pages
All India Career Point Test NEET
No ratings yet
All India Career Point Test NEET
5 pages
e 8 Revision 1st Term Exam Duoc 23 24
No ratings yet
e 8 Revision 1st Term Exam Duoc 23 24
13 pages
Chapter 1-Introduction: Poornima University, Jaipur Industrial Training Seminar Report Format
No ratings yet
Chapter 1-Introduction: Poornima University, Jaipur Industrial Training Seminar Report Format
4 pages
Maimun Palace Is Located - Untuk Mengidentifikasi Karena Hanya Berupa Informasi
0% (1)
Maimun Palace Is Located - Untuk Mengidentifikasi Karena Hanya Berupa Informasi
2 pages
FABM2121 Fundamentals Grade 12 1st Quarter Exam
No ratings yet
FABM2121 Fundamentals Grade 12 1st Quarter Exam
36 pages
Session Plan
No ratings yet
Session Plan
2 pages
Chapter 2 - Environmental Scanning - Industry Analysis
No ratings yet
Chapter 2 - Environmental Scanning - Industry Analysis
20 pages
24.3.25 - P1 - Operating Systems - 24 Mar v2 _
No ratings yet
24.3.25 - P1 - Operating Systems - 24 Mar v2 _
8 pages

Cluster-Analysis

Uploaded by

Cluster-Analysis

Uploaded by

5.

Cluster Analysis(9 Hrs)

3. Hierarchical Clustering GATE SMASHERS

4. DBSCAN Clustering 5 minutes engineering

5. Issues : Evaluation, Scalability, Comparison

 Given k, find a partition of k clusters that optimizes the chosen partitioning

Until no change and Oramdom 6

✔ Start with clusters of individual points and a proximity

Set of six two-dimensional points. (X,Y) -coordinates of six points.

Euclidean distance matrix for six points.

Nested Clusters Dendrogram

Original Points Six Clusters

• Can handle non-elliptical shapes

Nested Clusters Dendrogram

Original Points Two Clusters

• Less susceptible to noise

Original Points Two Clusters

• Tends to break large clusters

Nested Clusters Dendrogram

✔ Compromise between Single and Complete Link

✔ Less susceptible to noise

✔ Biased towards globular clusters

✔ Hierarchical analogue of K-means

✔ O(N2) space since it uses the proximity matrix.

✔ O(N3) time in many cases

✔ Once a decision is made to combine two clusters, it cannot

✔ No global objective function is directly minimized

✔ Different schemes have problems with one or more of the

→ A user-specified parameter ε > 0 is used to specify the radius of a

→ An object is a core object if the ε-neighborhood of the object contains

Of the labeled points, m, p, o, r are core objects because each is in an ε-

Object q is directly density-reachable from m.

Object q is (indirectly) density-reachable from p because q is directly density-

Similarly, r and s are density-reachable from o and o is density-

Original Points Point types: core,

Original Points Clusters (dark blue points indicate noise)

• Can handle clusters of different shapes and sizes

✔ For cluster analysis, the analogous question is how to evaluate the

✔ But “clusters are in the eye of the beholder”!

✔ Then why do we want to evaluate them?

Random 0.6 0.6

✔ You can use supervised or unsupervised measures to compare clusters

You might also like