0% found this document useful (0 votes)

10 views49 pages

Clustering Lecture

This document provides an introduction to cluster analysis. It discusses how cluster analysis is used to group similar data points together and maximize differences between groups. The document outlines the typical workflow, which includes selecting similarity metrics and clustering algorithms. It describes hierarchical clustering algorithms like agglomerative and divisive approaches. It also discusses other clustering types like k-means and density-based clustering. Finally, it covers evaluating clustering results both externally and internally.

Uploaded by

Nguyễn Oanh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views49 pages

Clustering Lecture

Uploaded by

Nguyễn Oanh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

Introduction to

Cluster Analysis
Outline
● Background
○ Intro
○ Workflow
○ Similarity metrics
● Clustering algorithms
○ Hierarchical
○ K-means
○ Density-based
● Cluster evaluation
○ External
○ Internal
Cluster Analysis
∙ Data mining tool(s) for dividing a multivariate dataset
into (meaningful, useful) groups
∙ Good clustering:
∙ Data points in one cluster are highly similar

∙ Data points in different clusters are dissimilar

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Applications

● Gain understanding
− Groups of genes/proteins with
similar function (from
nucleotide or amino acid
sequence data)
− Groups of cells with similar
expression patterns (from
RNAseq data)
● Summarize
− Reduce the size of a large
dataset
Clustering precipitation
in Australia
Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.
Eisen, Brown, Botstein (1998) PNAS.
Cluster analysis is not...
Simple segmentation
i.e., Dividing students into different registration groups alphabetically, by last
name
Although, some work in graph partitioning and more complex segmentation is
related to clustering

The results of a query

Groupings are a result of an external specification

Supervised classification
Supervised classification has class label information
Clustering can be called unsupervised classification: labels derived from data

Association Analysis
Finding connections between items in datasets
Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.
Cluster evaluation has an element of
subjectivity

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Traditional types of clusterings
● A clustering is a set of clusters
● Clusters can be:
− Hierarchical: data are in nested clusters, organized in a
hierarchical tree
− Partition: data in non-overlapping subsets. One data
object is in one subset.

Hierarchical Partition

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

D’haeseleer (2005) Nature Biotech.
Other distinctions between clusters
● Exclusive vs non-exclusive
− Exclusive: points belong to one cluster
− Non-exclusive: points can belong to multiple
● Fuzzy vs non-fuzzy
− In fuzzy clustering, a point belongs to every cluster with
some weight (0 to 1)
− Weights must sum to 1
− Similar to probabilistic clustering
● Partial vs complete
− Partial: only some of the data is clustered (can exclude
outliers)
● Heterogenous vs homogeneous
− Degree to which cluster size, shape, and density can vary

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Why is cluster analysis hard?
● Clustering in two dimensions looks easy!
● Clustering small amounts of data looks
easy
● In most cases, looks are not deceiving
● However, many applications involve
more than 2 dimensions (i.e., human
gene expression dataset has >10,000
dimensions)
● High dimensional spaces look
different: Almost all pairs of points are at
about the same distance

Leskovec, Rajaraman, Ullman: Mining of Massive Datasets, http://www.mmds.org

Typical workflow for cluster analysis

Handl, Knowles, Kell (2005) Bioinformatics.

Similarity (aka distance) metrics

D’haeseleer (2005) Nature Biotech.

Outline
● Background
○ Intro
○ Workflow
○ Similarity metrics
● Clustering algorithms
○ Hierarchical
○ K-means
○ Density-based
● Cluster evaluation
○ External
○ Internal
Hierarchical clustering
Produces nested clusters
Can be visualized as a
dendrogram
Can be either:
- Agglomerative (bottom up):
Initially, each point is a cluster
Repeatedly combine the two
“nearest” clusters into one
- Divisive (top down):
Start with one cluster and recursively
split

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Leskovec, Rajaraman, Ullman: Mining of Massive Datasets, http://www.mmds.org
Advantages of Hierarchical
Clustering
● Do not have to assume any
particular number of clusters
− Any desired number of clusters
can be obtained by cutting the
dendrogram at the proper level
● No random component (clusters
will be the same from run to run)
● Clusters may correspond to
meaningful taxonomies
− Especially in biological sciences
(e.g., phylogeny reconstruction)

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Image from Encyclopedia Britannica Online. Phylogeny entry. Web. 05 Jun 2018.
Agglomerative Clustering Algorithm
● Most popular hierarchical clustering technique
● Basic algorithm:
1) Compute the proximity metric
2) Let each data point be a cluster
3) Repeat
4) Merge the two closest clusters
5) Update the proximity metric
6) Until only a single cluster remains
● Key operation is the computation of the
proximity between two clusters
− Different approaches to defining this distance
distinguish the different algorithms
Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.
Divisive Clustering Algorithm
● Minimum spanning tree (MST)
− Start with one point
− In successive steps, look for closest pair of points
(p,q) such that p is in the tree but q is not.
− Add q to the tree (add edge between p and q)

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Linkages
● Linkage: measure of dissimilarity between
clusters
● Many methods:
− Single linkage
− Complete linkage
− Average linkage
− Centroids
− Ward’s method
Single linkage
(aka nearest neighbor)
● Proximity of two clusters is based on the two closest points
in the different cluster
● Proximity is determined by one pair of points (i.e., one link)
● Can handle non-elliptical shapes
● Sensitive to noise and outliers

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Complete linkage
● Proximity of two clusters is based on the two most distant
points in the different clusters
● Less susceptible to noise and outliers
● May break large clusters
● Biased toward globular clusters

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Average linkage
● Proximity of two clusters is the average of
pairwise proximity between points in the
clusters
● Less susceptible to noise and outliers
● Biased towards globular clusters

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Ward’s method

● Similiarity of two clusters is

based on the increase in
squared error when two
clusters are merged
● Similar to group average if
distance between points is
distance squared
● Less susceptible to noise
and outliers
● Biased towards globular
clusters

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Lecture notes from C Shalizi, 36-350 Data Mining, Carnegie Mellon University.
Agglomerative clustering exercise
● How do clusters change with different linkage
methods? 5
∙ Single 3
1

5
2 1
1 2 3 6

4
5 4
2

3 6

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Agglomerative clustering exercise
● How do clusters change with different linkage
methods? 4 1
∙ Complete 5
2 5
2
3 6
1 3
1
4
5
2

3 6

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Agglomerative clustering exercise
● How do clusters change with different linkage
methods? 5
1
∙ Average 2
5
2
1 3 6
3
4 1
4
5
2

3 6

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Linkage Comparison
5
1 4 1
3
2 5
5 5
2 1 2
Single Complete
2 3 6 3 6
3
1
4 4
4

5
1 5 4 1
2 2
5 Ward’s Method 5
2 2
3 6 Average 3 6
3
4 1 1
4 4
3

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

K-means clustering
● Partition clustering approach
● Number of clusters (K) must be specified
● Each cluster is associated with a centroid
● Each datapoint is assigned to the cluster with
the closest centroid

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Example of K-means clustering

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

More on K-means clustering
● Initial centroids often chosen randomly
− Clusters will vary from one run to the next
● Centroid is typically the mean of the points in
the cluster
● ‘Closeness’ is measured by similarity metric
(e.g., Euclidean distance)
● Convergence usually happens within first few
iterations

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Evaluating K-means clusters
Most common measure is Sum of Squared Error
(SSE)
● SSE is the sum of the squared distance between each
member of the cluster and the cluster’s centroid:
m = centroid in cluster Ci
x = a data point in cluster Ci

● Given two sets of clusters, we prefer the one with the

smallest error
● One way to reduce SSE is to increase K
Although, a good clustering with small K can have a
lower SSE than a poor clustering with high K

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Choosing K
● Visual inspection
● “Elbow method”

Choose K where
SSE drops
abruptly

Leskovec, Rajaraman, Ullman: Mining of Massive Datasets, http://www.mmds.org

Limitations of K-means:
Different sizes

Original Points K-means (3

Clusters)

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Limitations of K-means:
Differing density

Original Points K-means (3

Clusters)

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Limitations of K-means:
Non-globular shapes

Original Points K-means (2

Clusters)

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Overcoming K-means Limitations

One solution is to use many clusters.

Find parts of desired clusters, but need to put together.

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Concerns with selecting initial
centroids
● If there are K “real” clusters, then the chance of
initially selecting one centroid from each cluster
is small

n = size of clusters (assuming relatively similar)

● If K=10, then P = 10!/1010 = 0.00036

● Consider an example of ten clusters….

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

“Real” clusters:

Clusters obtained with K=10, some “real” clusters without initial centroids:

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Solving initial centroids issues
● Multiple runs
● Use hierarchical clustering to determine initial
centroids
● Select more than K initial centroids, then
subselect among these (select most widely
separated)
● Post-processing
● Generate a larger number of clusters, then
perform hierarchical clustering
● Use Bisecting K-means

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Pre- and Post-processing
● Pre-processing
− Normalize the data
− Eliminate outliers
● Post-processing
− Eliminate small clusters (may represent outliers)
− Split ‘loose’ clusters (i.e., clusters w/ high SSE)
− Merge clusters that are close (w/ low SSE)

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Bisecting K-means
● Combines K-means
and hierarchial
clustering
● Clusters are iteratively
split via regular
K-means with K=2
● Stops when desired #
of clusters is reached

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Figure from Mo Velayati, hhtps://mvelayati.com
Density-based clustering
● Assumes clusters are areas of high density separated by areas
of low density
● Core points are in areas of a certain density (at least n points
in radius r from the core point)
● Border points aren’t core points, but are w/in r of the core point
● Noise points are all other points

n=7

r
r r

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

DBSCAN Algorithm
● Eliminate noise points
● Perform clustering on remaining points

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

DBSCAN Advantages & Limitations
● Advantages:
● Resistant to noise
● Can handle clusters of different shapes and sizes
● Number of clusters is determined by the algorithm

Original data Clustered

Limitations:
● Struggles to identify clusters with varying densities – clustering is often

incomplete at points in low density regions are ignored

● Density can be difficult/expensive to compute in high-dimensional datasets

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

“Clusters are in the eye of the beholder”

But we might want to evaluate them anyway

Outline
● Background
○ Intro
○ Workflow
○ Similarity metrics
● Clustering algorithms
○ Hierarchical
○ K-means
○ Density-based
● Cluster evaluation
○ External
○ Internal
Cluster validation
1) Determining the clustering tendency of a set of data, i.e., distinguishing
whether non-random structure actually exists in the data.
2) Comparing the results of a cluster analysis to externally known results, e.g.,
to externally given class labels.
3) Comparing the results of two different sets of cluster analyses to determine
which is better.
4) Determining the ‘correct’ number of clusters.

For 2 and 3, we can further distinguish whether we want to evaluate the entire
clustering or just individual clusters.

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

External measures of cluster validity
External Index: Extend to which cluster labels match
externally supplied class labels
− e.g., gene functional groups, tissue of origin
− F-measure provides assessment of cluster
purity and completeness
● Purity: fraction of a cluster taken up by
predominant class label
● Completeness: fraction of items in the class
grouped in the cluster at hand
− Rand index compares similarity between two
clusterings, or known vs predicted labels
Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.
Handle, Knowles, Kell (2005) Bioinformatics.
Internal measures of cluster validity
Internal Index: Measures goodness of clustering
without respect to external info
− How compact are the clusters?
● SSE
● Average/maximum pairwise intra-cluster
distances
− How well separated are the clusters?
● Average inter-cluster distance
● Minimum separation between individual
clusters

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Handle, Knowles, Kell (2005) Bioinformatics.
“The validation of clustering structures is the most
difficult and frustrating part of cluster analysis.
Without a strong effort in this direction, cluster
analysis will remain a black art accessible only to
those true believers who have experience and
great courage.”

Algorithms for Clustering Data, Jain and Dubes

http://scikit-learn.org/stable/modules/clustering.html

Cluster Analysis 1731695796
No ratings yet
Cluster Analysis 1731695796
91 pages
Chap5 Basic Cluster Analysis
No ratings yet
Chap5 Basic Cluster Analysis
110 pages
Unit4 Cluster Analysis 10oct
No ratings yet
Unit4 Cluster Analysis 10oct
133 pages
Lecture Notes For Chapter 7 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 7 Introduction To Data Mining, 2 Edition
108 pages
Clustering and Visualisation of Data - 2020
No ratings yet
Clustering and Visualisation of Data - 2020
5 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Wk. 9. Cluster Analysis [01-04-2021]
No ratings yet
Wk. 9. Cluster Analysis [01-04-2021]
97 pages
Machine Learning Notes-1 (Clustering-1)
No ratings yet
Machine Learning Notes-1 (Clustering-1)
25 pages
R9wsIP9uD8GJeqCD5u8yHAJEhfrXGGOonQSwonhc
No ratings yet
R9wsIP9uD8GJeqCD5u8yHAJEhfrXGGOonQSwonhc
89 pages
Chap7 Basic Cluster Analysis
No ratings yet
Chap7 Basic Cluster Analysis
117 pages
Introduction To Unsupervised Learning:: Clustering
No ratings yet
Introduction To Unsupervised Learning:: Clustering
21 pages
Chapter 4
No ratings yet
Chapter 4
60 pages
Data Mining - Cluster Analysis Basic Concepts and Algorithms
No ratings yet
Data Mining - Cluster Analysis Basic Concepts and Algorithms
98 pages
Chap8 Basic Cluster Analysis Final Student Final
No ratings yet
Chap8 Basic Cluster Analysis Final Student Final
72 pages
CH 7 Clustering
No ratings yet
CH 7 Clustering
37 pages
Unit 5 DWM by DR KSR Cluster Analysis
No ratings yet
Unit 5 DWM by DR KSR Cluster Analysis
72 pages
ML unit 4
No ratings yet
ML unit 4
110 pages
K-Means Revision & Extension
No ratings yet
K-Means Revision & Extension
32 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
Clustering (1)
No ratings yet
Clustering (1)
53 pages
کتاب چهارم بارگزاری شده
No ratings yet
کتاب چهارم بارگزاری شده
63 pages
Ml Module5 Clustering
No ratings yet
Ml Module5 Clustering
71 pages
Original Points A Partitional Clustering
No ratings yet
Original Points A Partitional Clustering
50 pages
cz4041 10 Clustering
No ratings yet
cz4041 10 Clustering
67 pages
Clustering
No ratings yet
Clustering
104 pages
Chap8 Advanced Cluster Analysis
No ratings yet
Chap8 Advanced Cluster Analysis
45 pages
Clustering
No ratings yet
Clustering
65 pages
Clustering 1
No ratings yet
Clustering 1
18 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
48 pages
Introduction To Data Mining Clustering Analysis
No ratings yet
Introduction To Data Mining Clustering Analysis
84 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Clustering
No ratings yet
Clustering
75 pages
Presentation 28128 Content Document 20241126014005PM
No ratings yet
Presentation 28128 Content Document 20241126014005PM
80 pages
Chap8 Basic Cluster Analysis
No ratings yet
Chap8 Basic Cluster Analysis
98 pages
Clustering-Part1.pptx
No ratings yet
Clustering-Part1.pptx
84 pages
Lecture Notes For Chapter 8 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8 Introduction To Data Mining: by Tan, Steinbach, Kumar
90 pages
Data Mining P
No ratings yet
Data Mining P
23 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Introduction to Cluster Analysis.
No ratings yet
Introduction to Cluster Analysis.
53 pages
Unit 2 - Introduction to Cluster Analysis
No ratings yet
Unit 2 - Introduction to Cluster Analysis
53 pages
Slide TIF311 DM 10 11
No ratings yet
Slide TIF311 DM 10 11
49 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
Sr. No. Questions A B C D Ans: Unit ONE SUB: 410243 DA
No ratings yet
Sr. No. Questions A B C D Ans: Unit ONE SUB: 410243 DA
1,777 pages
DM_C6
No ratings yet
DM_C6
37 pages
1 s2.0 S0020025522014633 Main
No ratings yet
1 s2.0 S0020025522014633 Main
33 pages
Data Mining K-Means Algorithm
No ratings yet
Data Mining K-Means Algorithm
36 pages
Intro Data Science: Cluster Analysis
No ratings yet
Intro Data Science: Cluster Analysis
60 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Chap8 Basic Cluster Analysis
100% (1)
Chap8 Basic Cluster Analysis
104 pages
8. Clustering
No ratings yet
8. Clustering
38 pages
UnSupervisedLearning
No ratings yet
UnSupervisedLearning
22 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
U-5_IML (2)
No ratings yet
U-5_IML (2)
20 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Clustering
No ratings yet
Clustering
39 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
An Introduction To Clustering Methods
No ratings yet
An Introduction To Clustering Methods
8 pages
Erd 1
No ratings yet
Erd 1
35 pages
Slide PTKT DBKT Full
No ratings yet
Slide PTKT DBKT Full
115 pages
lec48
No ratings yet
lec48
12 pages
DWM UNIT-VI (2)
No ratings yet
DWM UNIT-VI (2)
30 pages
R Programming Notes
No ratings yet
R Programming Notes
113 pages
Part 3 Simulation With R
No ratings yet
Part 3 Simulation With R
42 pages
Unit 3
No ratings yet
Unit 3
15 pages
Sequential Clustering and Classication Approach To Analyze Sales Performance of Retail Stores Based On Point of Sale Data
No ratings yet
Sequential Clustering and Classication Approach To Analyze Sales Performance of Retail Stores Based On Point of Sale Data
26 pages
29_Cybersecurity+in+Network+Traffic+latest
No ratings yet
29_Cybersecurity+in+Network+Traffic+latest
11 pages
The Application of Artificial Intelligence in Project Management Research: A Review
No ratings yet
The Application of Artificial Intelligence in Project Management Research: A Review
14 pages
Machine Learning: Unsupervised Learning Dimensionality Reduction K-Means Clustering
No ratings yet
Machine Learning: Unsupervised Learning Dimensionality Reduction K-Means Clustering
28 pages
Machine Learning & Data Mining: Understanding
No ratings yet
Machine Learning & Data Mining: Understanding
7 pages
Intro - Cfa - in R
No ratings yet
Intro - Cfa - in R
53 pages
SMK Means An Improved Mini Batch K Means Algorithm
No ratings yet
SMK Means An Improved Mini Batch K Means Algorithm
16 pages
Data Mining Unit-4
No ratings yet
Data Mining Unit-4
27 pages
From Cloud Down To Things An Overview of Machine Learning in Internet
No ratings yet
From Cloud Down To Things An Overview of Machine Learning in Internet
14 pages
PCA and FA Commands
No ratings yet
PCA and FA Commands
13 pages
Image Classification Using Image
No ratings yet
Image Classification Using Image
50 pages
ch0bt10 1
No ratings yet
ch0bt10 1
9 pages
Part 4.1 R and MC Simulation
No ratings yet
Part 4.1 R and MC Simulation
36 pages
R Code For Discriminant and Cluster Analysis
No ratings yet
R Code For Discriminant and Cluster Analysis
23 pages
TheAnalysisofKMV MertonModelinForecasting
No ratings yet
TheAnalysisofKMV MertonModelinForecasting
6 pages
Building K-Means Clustering Algorithm From Scratch
No ratings yet
Building K-Means Clustering Algorithm From Scratch
10 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
50 pages
Real-Time Illegal Parking Detection System Based On
No ratings yet
Real-Time Illegal Parking Detection System Based On
5 pages
Recommendation System For Localized Products in Vending Machines
No ratings yet
Recommendation System For Localized Products in Vending Machines
10 pages
Automatic Clustering of Construction Project Documents Based On Textual Similarity
No ratings yet
Automatic Clustering of Construction Project Documents Based On Textual Similarity
14 pages
Irfan 2017
No ratings yet
Irfan 2017
5 pages
Identity-Based Data Outsourcing With Comprehensive
No ratings yet
Identity-Based Data Outsourcing With Comprehensive
8 pages
Civil Complaints Management System by Using Machine Learning Techniques
No ratings yet
Civil Complaints Management System by Using Machine Learning Techniques
4 pages
Practical 5
No ratings yet
Practical 5
6 pages
10 Fuzzy Clustering PDF
100% (1)
10 Fuzzy Clustering PDF
14 pages
A Study of Various Clustering Algorithms Used For Radar Signal Sorting
100% (1)
A Study of Various Clustering Algorithms Used For Radar Signal Sorting
12 pages
An Integration of K-Means and Decision Tree (ID3) Towards A More Efficient Data Mining Algorithm
No ratings yet
An Integration of K-Means and Decision Tree (ID3) Towards A More Efficient Data Mining Algorithm
7 pages
Exam P Tables
No ratings yet
Exam P Tables
2 pages
Pronouns
No ratings yet
Pronouns
21 pages
Does A Peer Recommender Foster Students' Engagement in MOOCs?
No ratings yet
Does A Peer Recommender Foster Students' Engagement in MOOCs?
7 pages
21EC744
No ratings yet
21EC744
6 pages
Financial Times Series
No ratings yet
Financial Times Series
10 pages
Kernel PCA
No ratings yet
Kernel PCA
13 pages
Neural Networks
From Everand
Neural Networks
Sasha Kurzweil
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

Clustering Lecture

Uploaded by

Clustering Lecture

Uploaded by

Introduction to

∙ Data points in different clusters are dissimilar

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

The results of a query

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Leskovec, Rajaraman, Ullman: Mining of Massive Datasets, http://www.mmds.org

Handl, Knowles, Kell (2005) Bioinformatics.

D’haeseleer (2005) Nature Biotech.

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

● Similiarity of two clusters is

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

● Given two sets of clusters, we prefer the one with the

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Leskovec, Rajaraman, Ullman: Mining of Massive Datasets, http://www.mmds.org

Original Points K-means (3

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Original Points K-means (3

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Original Points K-means (2

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

One solution is to use many clusters.

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

n = size of clusters (assuming relatively similar)

● If K=10, then P = 10!/1010 = 0.00036

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Original data Clustered

incomplete at points in low density regions are ignored

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

But we might want to evaluate them anyway

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.

Algorithms for Clustering Data, Jain and Dubes

You might also like