0% found this document useful (0 votes)
16 views

DM&BAFall2204 2

Uploaded by

Ly Khánh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

DM&BAFall2204 2

Uploaded by

Ly Khánh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Supervised learning

label
label1
model/
label3 predictor

label4

label5

Supervised learning: given labeled examples


Unsupervised learning

Unupervised learning: given data, i.e. examples, but no labels


Unsupervised learning

Given some example without labels, do something!


Unsupervised learning applications
learn clusters/groups without any label

customer segmentation (i.e. grouping)

image compression

bioinformatics: learn motifs

find important features


Unsupervised learning: clustering

Raw data features


f1, f2, f3, …, fn

f1, f2, f3, …, fn

f1, f2, f3, …, fn Clusters


f1, f2, f3, …, fn group into
extract
classes/clust
features
f1, f2, f3, …, fn ers

No “supervision”, we’re only given data and want to find


natural groupings
Unsupervised learning: modeling
Most frequently, when people think of unsupervised
learning they think clustering

Another category: learning probabilities/parameters for


models without supervision
• Learn a translation dictionary
• Learn a grammar for a language
• Learn the social graph
Clustering

Clustering: the process of grouping a set of objects


into classes of similar objects

Applications?
Gene expression data
Data from Garber et al.
PNAS (98), 2001.
Face Clustering
Face clustering
Search result clustering
Google News
Clustering in search advertising

Find clusters of advertisers and


b i d s keywords
• Keyword suggestion
• Performance estimation

Advertiser Bidded
Keyword
~10M nodes
Clustering applications

Find clusters of users


• Targeted advertising
• Exploratory analysis

Clusters of the Web Graph


• Distributed pagerank
Who-messages-who IM/text/twitter
graph computation

~100M nodes
Data visualization

Wise et al, “Visualizing the non-visual” PNNL

ThemeScapes, Cartia
• [Mountain height = cluster size]
A data set with clear cluster structure

What are some of the


issues for clustering?

What clustering
algorithms have you
seen/used?
Issues for clustering
Representation for clustering
• How do we represent an example
• features, etc.
• Similarity/distance between examples

Flat clustering or hierarchical

Number of clusters
• Fixed a priori
• Data driven?
Clustering Algorithms
Flat algorithms
• Usually start with a random (partial) partitioning
• Refine it iteratively
• K means clustering
• Model based clustering
• Spectral clustering

Hierarchical algorithms
• Bottom-up, agglomerative
• Top-down, divisive
Hard vs. soft clustering
Hard clustering: Each example belongs to exactly one cluster

Soft clustering: An example can belong to more than one cluster


(probabilistic)
• Makes more sense for applications like creating browsable hierarchies
• You may want to put a pair of sneakers in two clusters: (i) sports apparel
and (ii) shoes
K-means

Most well-known and popular clustering algorithm:

Start with some initial cluster centers

Iterate:
• Assign/cluster each example to closest center
• Recalculate centers as the mean of the points in a cluster
K-means: an example
K-means: Initialize centers randomly
K-means: assign points to nearest center
K-means: readjust centers
K-means: assign points to nearest center
K-means: readjust centers
K-means: assign points to nearest center
K-means: readjust centers
K-means: assign points to nearest center

No changes: Done
K-means

Iterate:
• Assign/cluster each example to closest center
• Recalculate centers as the mean of the points in a cluster

How do we do this?
K-means

Iterate:
n Assign/cluster each example to closest center
iterate over each point:
- get distance to each cluster center
- assign to closest center (hard cluster)
n Recalculate centers as the mean of the points in a cluster
K-means

Iterate:
n Assign/cluster each example to closest center
iterate over each point:
- get distance to each cluster center
- assign to closest center (hard cluster)
n Recalculate centers as the mean of the points in a cluster

What distance measure should we use?


Distance measures

Euclidean:

n
d(x, y) = ∑ (xi − yi )2
i=1

good for spatial data


Clustering documents (e.g. wine data)

One feature for each word. The value is the number of times that
word occurs.

Documents are points or vectors in this space


When Euclidean distance doesn’t work

Which document is closest to


q using Euclidian distance?

Which do you think should be


closer?
Issues with Euclidian distance

the Euclidean distance


between q and d2 is large

but, the distribution of terms


in the query q and the
distribution of terms in the
document d2 are very similar

This is not what we want!


cosine similarity

sim(x, y) =
x•y x y
= • =
∑ i=1
xi yi
x y x y n n
∑ 2
x
i=1 i
∑ i=1
yi2

correlated with the


angle between two vectors
cosine distance

cosine similarity is a similarity between 0 and 1, with


things that are similar 1 and not 0

We want a distance measure, cosine distance:

d(x, y) = 1− sim(x, y)

- good for text data and many other “real world” data sets
- is computationally friendly since we only need to consider
features that have non-zero values both examples
K-means

Iterate:
• Assign/cluster each example to closest center
• Recalculate centers as the mean of the points in a cluster

Where are the cluster centers?


K-means

Iterate:
• Assign/cluster each example to closest center
• Recalculate centers as the mean of the points in a cluster

How do we calculate these?


K-means

Iterate:
• Assign/cluster each example to closest center
• Recalculate centers as the mean of the points in a cluster

Mean of the points in the cluster:

1
µ (C) = ∑
| C | x∈C
x

where:

n x n x
x + y = ∑ xi + yi =∑ i
i=1 C i=1 C
K-means loss function

K-means tries to minimize what is called the “k-means”


loss function:

n
loss = ∑ d(xi , µ k )2 where µ k is cluster center for xi
i=1

that is, the sum of the squared distances from


each point to the associated cluster center
Minimizing k-means loss

Iterate:
1. Assign/cluster each example to closest center
2. Recalculate centers as the mean of the points in a cluster

n
loss = ∑ d(xi , µ k )2 where µ k is cluster center for xi
i=1

Does each step of k-means move towards reducing this loss


function (or at least not increasing)?
Minimizing k-means loss

Iterate:
1. Assign/cluster each example to closest center
2. Recalculate centers as the mean of the points in a cluster

n
loss = ∑ d(xi , µ k )2 where µ k is cluster center for xi
i=1

This isn’t quite a complete proof/argument, but:

1. Any other assignment would end up in a larger loss

1. The mean of a set of values minimizes the squared


error
Minimizing k-means loss

Iterate:
1. Assign/cluster each example to closest center
2. Recalculate centers as the mean of the points in a cluster

n
loss = ∑ d(xi , µ k )2 where µ k is cluster center for xi
i=1

Does this mean that k-means will always find the minimum
loss/clustering?
Minimizing k-means loss

Iterate:
1. Assign/cluster each example to closest center
2. Recalculate centers as the mean of the points in a cluster

n
loss = ∑ d(xi , µ k )2 where µ k is cluster center for xi
i=1

NO! It will find a minimum.

Unfortunately, the k-means loss function is generally not


convex and for most problems has many, many minima

We’re only guaranteed to find one of them


K-means variations/parameters

Start with some initial cluster centers

Iterate:
n Assign/cluster each example to closest center
n Recalculate centers as the mean of the points in a cluster

What are some other


variations/parameters we haven’t
specified?
K-means variations/parameters
Initial (seed) cluster centers

Convergence
• A fixed number of iterations
• partitions unchanged
• Cluster centers don’t change

K!
K-means: Initialize centers randomly

What would happen here?

Seed selection ideas?


Seed choice

Results can vary drastically based on random seed selection

Some seeds can result in poor convergence rate, or


convergence to sub-optimal clusterings

Common heuristics
• Random centers in the space
• Randomly pick examples
• Points least similar to any existing center (furthest centers
heuristic)
• Try out multiple starting points
• Initialize with the results of another clustering method
Furthest centers heuristic

μ1 = pick random point

for i = 2 to K:
μi = point that is furthest from any previous centers

arg max min


µi = d(x, µ j )
x µ j :1 < j < i

point with the largest distance smallest distance from x to


to any previous center any previous center
K-means: Initialize furthest from centers

Pick a random point for the first center


K-means: Initialize furthest from centers

What point will be chosen next?


K-means: Initialize furthest from centers

Furthest point from center

What point will be chosen next?


K-means: Initialize furthest from centers

Furthest point from center

What point will be chosen next?


K-means: Initialize furthest from centers

Furthest point from center

Any issues/concerns with this approach?


Furthest points concerns

If k = 4, which points will get chosen?


Furthest points concerns

If we do a number of trials, will we get


different centers?
Furthest points concerns

Doesn’t deal well with outliers


K-means++

μ1 = pick random point

for k = 2 to K:
for i = 1 to N:
si = min d(xi, μ1…k-1) // smallest distance to any center

μk = randomly pick point proportionate to s

How does this help?


K-means++

μ1 = pick random point

for k = 2 to K:
for i = 1 to N:
si = min d(xi, μ1…k-1) // smallest distance to any center

μk = randomly pick point proportionate to s


- Makes it possible to select other points
- if #points >> #outliers, we will pick good points
- Makes it non-deterministic, which will help with random
runs
- Nice theoretical guarantees!

You might also like