III-clustering
III-clustering
Clustering
Genoveva Vargas-Solar
http://www.vargas-solar.com/big-data-analytics
French Council of Scientific Research, LIG & LAFMIA Labs
Dimensionality Duplicate
Spam Queries on Perceptron,
reduction document
Detection streams kNN
detection
? ? ? ?
? ?? ? ?
? ? ? ? ??
?
+ Approaches 5
n Assume that all data point are given in advance and can be
scanned frequently
n In machine learning clustering analysis
n Clustering analysis
+ Approaches 6
n As a branch of statistics
n Clustering analysis
+ Approaches 7
n As a branch of statistics
n Clustering analysis
n Probability analysis
n Assumption that probability distributions on separate attributes are statistically independent one another (not always true)
n The probability distribution representation of clusters à expensive clusters’ updates and storage
n Probability-based tree built to identify clusters is not height balanced
n Increase of time and space complexity
D. Fisher, Improving inference through conceptual clustering, In Proceedings of the AAAI Conference, July, 1987
D. Fisher, Optimization and simplification of hierarchical clusterings, In Proceedings of the 1st Conference on Knowledge Discovery and Data mining, August,
1985
8
Clustering analysis
+ The Problem of Clustering 9
n Usually:
n Points are in a high-dimensional space
n Similarity is defined using a distance measure
n Euclidean, Cosine, Jaccard, edit distance, …
x x x Dachshuds
x x x x
x x x
x
Outlier Cluster
Weight
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
+ Points and spaces 11
n In its most general sense, a space is just a universal set of points, from which the
points in the dataset are drawn.
n All spaces for which we can perform a clustering have a distance measure,
giving a distance between any two points in the space
12
Measuring distance
+ Distance measure 13
Axion 4: intuitively say that to travel from x to y, we cannot obtain any benefit if we are forced to travel via some particular
third point z. It makes all distance measures behave as if distance describes the length of a shortest path from one point to
another
+ Euclidean distance 14
n Manhattan distance
n The distance one would have to travel between points if one were
constrained to travel along grid lines
n As on the streets of a city such as Manhattan
+ L - norm distance 18
∞
n L2-norm
n L1-norm
n L∞-norm
+ Jaccard distance 20
n The Jaccard similarity is a measure of how close sets are, although it is not
really a distance measure
n the closer sets are, the higher the Jaccard similarity
n 1 minus the ratio of the sizes of the intersection and union of sets x and y
+ Cosine distance 21
n Cosine distance between two points is the angle that the vectors to those
points make.
n The angle will be in the range 0 to 180 degrees, regardless of how many dimensions
the space has
n First compute the cosine of the angle, and then
n Apply the arc-cosine function to translate to an angle in the 0-180 degree range.
n Given two vectors x and y, the cosine of the angle between them is the
n dot product x.y
n divided by the L2-norms of x and y (i.e., their Euclidean distances from the origin)
+ Cosine distance example 22
n L2-norm of x
n Euclidean spaces
n These common symbols appear in the same order in both strings, so we are
able to use them all in an LCS
n Given the length of x is 5, the length of y is 6, and the length of their LCS is 4
n There are spaces where the notion of average makes no sense (e.g.,
average of strings)
Finding topics:
Clustering methods
Overview (1) 38
n Point assignment:
n Maintain a set of clusters
n Points belong to “nearest” cluster
n Data is small enough to fit in main memory, or data must reside in secondary
memory
n Algorithms for large amounts of data often must take shortcuts, since it is infeasible
to look at all pairs of points, for example
n It is also necessary to summarize clusters in main memory, since we cannot hold all
the points of all the clusters in main memory at the same time
40
Hierarchical clustering
+ Hierarchical Clustering 41
n Key operation:
Repeatedly combine
two nearest clusters
(5,3)
o
(1,2)
o
x (1.5,1.5) x (4.7,1.3)
o (2,1)
x (1,1) o (4,1)
x (4.5,0.5)
o (0,0) o (5,0)
Data:
o … data point
x … centroid Dendrogram
n Knowledge or belief, about how many clusters there are in the data
n For example, if we are told that the data about dogs is taken from Chihuahuas,
Dachshunds, and Beagles
n Then we know to stop when there are three clusters left
n Stop combining when at some point the best combination of existing clusters
produces a cluster that is “inadequate”
n The only “locations” we can talk about are the points themselves
n i.e., there is no “average” of two points
n Distances cannot be based on “location” of points
n Approach 1:
(1) How to represent a cluster of many points?
clustroid = (data)point “closest” to other points
(2) How do you determine the “nearness” of clusters? Treat clustroid as if
it were centroid, when computing inter-cluster distances
K-means clustering
51
+ k–means Algorithm(s) 52
1) For each point, place it in the cluster whose current centroid it is nearest
2) After all points are assigned, update the locations of centroids of the k
clusters
x
x
x
x
x
x x x x x x
x … data point
… centroid
Clusters after round 1
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
+ Example: Assigning Clusters 55
x
x
x
x
x
x x x x x x
x … data point
… centroid
Clusters after round 2
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
+ Example: Assigning Clusters 56
x
x
x
x
x
x x x x x x
x … data point
… centroid
Clusters at the end
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Getting the k right 57
How to select k?
Too few; x
x
many long
xx x
distances x x
to centroid. x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x
x
Just right; x
distances xx x
rather short. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x
Too many; x
little improvement x
in average xx x
distance. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x
n BFR [Bradley-Fayyad-Reina] is a
variant of k-means designed to
handle very large (disk-resident) data sets
Compressed sets.
Their points are in
the CS.
A cluster.
All its points are in the DS. The centroid
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
+ Summarizing Points: Comments 67
1) Find those points that are “sufficiently close” to a cluster centroid and add
those points to that cluster and the DS
n These points are so close to the centroid that they can be summarized and then
discarded
3) DS set: Adjust statistics of the clusters to account for the new points
n Add Ns, SUMs, SUMSQs
5) If this is the last round, merge all compressed sets in the CS and all RS
points into their nearest cluster
Compressed sets.
Their points are in
the CS.
h h
h
e e
e
h e
e e h
e e e e
salary h
e
h
h
h h
h h h
age
1) Initial clusters:
n Cluster these points hierarchically – group nearest points/clusters
h h
h
e e
e
h e
e e h
e e e e
h
salary e
h
h
h h
h h h
age
h h
h
e e
e
h e
e e h
e e e e
h
salary e Pick (say) 4
h
h remote points
h h for each
h h h cluster.
age
h h
h
e e
e
h e
e e h
e e e e
h
salary e Move points
h
h (say) 20%
h h toward the
h h h centroid.
age
Pass 2:
Final remarks
+ Summary 86
n Algorithms:
n Agglomerative hierarchical clustering:
n Centroid and clustroid
n k-means:
n Initialization, picking k
n BFR
n CURE