Machine Learning 3
Machine Learning 3
Clustering is subjective
Similarity is hard
to define, but…
“We know it when
we see it”
Peter Piotr
0.23 3 342.7
Peter Piotr When we peek inside one of
these black boxes, we see some
function on two variables. These
d('', '') = 0 d(s,
'') = d('', s) = |s|
functions might very simple or
-- i.e. length of s
d(s1+ch1,
very complex.
s2+ch2) =
min( d(s1, s2) + In either case it is natural to ask,
if ch1=ch2 then
0 else 1 fi,
d(s1+ch1, s2) +
what properties should these
1, d(s1, s2+ch2)
+1) functions have?
3
Hierarchical Partitional
Desirable Properties of a Clustering Algorithm
Yahoo’s hierarchy is
manually created, we will
focus on automatic
creation of hierarchies in
data mining.
Cristovao (Portuguese)
Christoph (German), Christophe (French), Cristobal
(Spanish), Cristoforo (Italian), Kristoffer
(Scandinavian), Krystof (Czech), Christopher (English)
Miguel (Portuguese)
Michalis (Greek), Michael (English), Mick (Irish!)
Chr tovao
ro
Cri ph
Mic l
k
Cri an
Mic is
stof
r
Mig l
oro
al
Ch r h e
Kry er
ro
rre
er
a
er
tr
tr
ros
tro
dar
ue
hae
phe
Pek
hal
stob
Pyo
Pio
f
Ped
Pet
Ped
sde
Pie
p
isto
stof
Pie
Pie
stof
Pet
Pea
isto
isto
Mic
s
Cri
Kri
Cri
Chr
Pedro (Portuguese/Spanish)
Petros (Greek), Peter (English), Piotr (Polish),
Peadar (Irish), Pierre (French), Peder (Danish),
Peka (Hawaiian), Pietro (Italian), Piero (Italian
Alternative), Petr (Czech), Pyotr (Russian)
ro
rre
ro
a
tr
er
tr
tro
ros
dar
er
Pek
Pio
Pyo
Ped
Pet
Pie
Ped
Pie
Pea
Pie
Pet
Hierarchal clustering can sometimes show
patterns that are meaningless or spurious
• For example, in this clustering, the tight grouping of Australia,
Anguilla, St. Helena etc is meaningful, since all these countries are
former UK colonies.
• The Indian flag is a horizontal tricolor in equal proportion of deep saffron on the
top, white in the middle and dark green at the bottom. In the center of the white
band, there is a wheel in navy blue to indicate the Dharma Chakra, the wheel of
law in the Sarnath Lion Capital. This center symbol or the 'CHAKRA' is a symbol
dating back to 2nd century BC. The saffron stands for courage and sacrifice; the
white, for purity and truth; the green for growth and auspiciousness.
Outlier
(How-to) Hierarchical Clustering
The number of dendrograms with n Since we cannot test all possible trees
leafs = (2n -3)!/[(2(n -2) ) (n -2)!] we will have to heuristic search of all
possible trees. We could do this..
Number Number of Possible
of Leafs Dendrograms
2 1 Bottom-Up (agglomerative): Starting
3 3
4 15 with each item in its own cluster, find
5 105 the best pair to merge into a new
...
10
…
34,459,425
cluster. Repeat until all clusters are
fused together.
0 8 8 7 7
0 2 4 4
0 3 3
D( , ) = 8 0 1
D( , ) = 1 0
Bottom-Up (agglomerative):
Starting with each item in its own
cluster, find the best pair to merge into
a new cluster. Repeat until all clusters
are fused together.
Consider all
Choose
possible
… the best
merges…
Consider all
Choose
possible
… the best
merges…
Consider all
Choose
possible
… the best
merges…
Consider all
Choose
possible
… the best
merges…
Consider all
Choose
possible
… the best
merges…
7
2
5
2
0
1
5
4
3 1
0
2
5
2
9 2 61
1 91
7101
3242
5262
0223
027 1 3 8 41
2 51
4231
5161
8192
128 7 0
51
423 7 41
2192
1241
5161
8 1 3 8 92
9 21
0112
0281
7262
725 61
3223
0
Peter Piotr
0.23 3 342.7
A generic technique for measuring similarity
To measure the similarity between two objects,
transform one of the objects into the other, and
measure how much effort it took. The measure
of effort becomes the distance measure.
Piter
Insertion (o)
Pioter
Deletion (e)
ro
tr
er
tr
ro
t ro
rre
ros
Pio
Ped
Pyo
Pet
Pie
Piotr
Pie
Pie
Pet
Partitional Clustering
• Nonhierarchical, each instance is placed in
exactly one of K nonoverlapping clusters.
• Since only one set of clusters is output, the user
normally has to input the desired number of
clusters K.
Squared Error
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Objective Function
Algorithm k-means
1. Decide on a value for k.
2. Initialize the k cluster centers (randomly, if
necessary).
3. Decide the class memberships of the N objects by
assigning them to the nearest cluster center.
4. Re-estimate the k cluster centers, by assuming the
memberships found above are correct.
5. If none of the N objects changed membership in
the last iteration, exit. Otherwise goto 3.
K-means Clustering: Step 1
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
3
2
k2
k3
0
0 1 2 3 4 5
K-means Clustering: Step 2
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
3
2
k2
k3
0
0 1 2 3 4 5
K-means Clustering: Step 3
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
k3
1 k2
0
0 1 2 3 4 5
K-means Clustering: Step 4
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
k3
1 k2
0
0 1 2 3 4 5
K-means Clustering: Step 5
Algorithm: k-means, Distance Metric: Euclidean Distance
expression in condition 2 5
4 k1
2 k2
k3
0
0 1 2 3 4 5
Comments on the K-Means Method
• Strength
– Relatively efficient: O(tkn), where n is # objects, k is # clusters,
and t is # iterations. Normally, k, t << n.
– Often terminates at a local optimum. The global optimum may
be found using techniques such as: deterministic annealing and
genetic algorithms
• Weakness
– Applicable only when mean is defined, then what about
categorical data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes
The K-Medoids Clustering Method
• Find representative objects, called medoids, in clusters
• PAM (Partitioning Around Medoids, 1987)
– starts from an initial set of medoids and iteratively replaces
one of the medoids by one of the non-medoids if it improves
the total distance of the resulting clustering
– PAM works effectively for small data sets, but does not scale
well for large data sets
Clustering PD Games
• Clustering on scores
• Answers a question of “do humans perform
better then computers?”
• Distance = difference in scores
• Create distance matrix, cluster with k-
means, k=2.
Clustering PD Games
• Each game is a collection of moves for each
agent (cooperate = 1, defect =0)
• Game = 2 bit-strings:
– 10011001
– 01111000
• Distance between games
– Edit distance or Hamming Distance
Assignment – Part 2
• Must be able to run a set of games consecutively, varying
machine strategy, but with same human player.
• Every player (human or machine) must have a unique ID,
starting with ‘h’ for human, and ‘c’ for computer
• Output the following file:
Player ID Game String Score
h1 10110111010 36
c1 11011101101 42
Etc etc
What next?
• Create a distance matrix by computing pair-wise distances
between game bitstrings
• Implement a k-means or hierarchical clustering algorithm
over that matrix, your choice
• Answer the following question:
– Do computers and humans act differently?
– Can you tell?
– Who’s the most human computer, and who’s the most robotic
human?
EM Algorithm
• Initialize K cluster centers
• Iterate between two steps
– Expectation step: assign points to clusters
P(d i ∈ ck ) = wk Pr(d i | ck ) ∑w j Pr(d i | c j )
∑ Pr( d ∈ c )
j
i k
wk = i
N
– Maximation step: estimate model parameters
1 m
d i P ( d i ∈ck )
µk =
m
∑
i=1 ∑ P ( d i ∈c j )
k
Iteration 1
The cluster
means are
randomly
assigned
Iteration 2
Iteration 5
Iteration 25
What happens if the data is streaming…
It is difficult to determine t in
advance…
Partitional Clustering Algorithms
• Clustering algorithms have been designed to handle
very large datasets
• E.g. the Birch algorithm
• Main idea: use an in-memory R-tree to store points that are
being clustered
• Insert points one at a time into the R-tree, merging a new
point with an existing cluster if is less than some δ
distance away
• If there are more leaf nodes than fit in memory, merge
existing clusters that are close to each other
• At the end of first pass we get a large number of clusters at
the leaves of the R-tree
Merge clusters to reduce the number of clusters
Partitional Clustering Algorithms
We need to specify the number of clusters in advance, I have chosen 2
R1 R2 R3 R4 R5 R6 R7 R8 R9
R10 R11
{R1,R2} R3 R4 R5 R6 R7 R8 R9
R10 R11
R12
How can we tell the right number of clusters?
1 2 3 4 5 6 7 8 9 10
When k = 1, the objective function is 873.0
1 2 3 4 5 6 7 8 9 10
When k = 2, the objective function is 173.1
1 2 3 4 5 6 7 8 9 10
When k = 3, the objective function is 133.6
1 2 3 4 5 6 7 8 9 10
We can plot the objective function values for k equals 1 to 6…
9.00E+02
Objective Function
8.00E+02
7.00E+02
6.00E+02
5.00E+02
4.00E+02
3.00E+02
2.00E+02
1.00E+02
0.00E+00
1 2 3
k 4 5 6
Note that the results are not always as clear cut as in this toy example