0% found this document useful (0 votes)
4 views

Lecture 4

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lecture 4

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

University of Wisconsin - Madison

Department of Industrial and Systems Engineering


ISyE 521: Machine Learning in Action
Fall 2022

Lecture 4: Clustering and K-nearest neighbors


Instructor: Justin J. Boutilier
October 11, 2022

In this lecture, we will introduce two types of clustering and the K-nearest neighbors
algorithm. We will learn:

1. The difference between k-means and hierarchical clustering, and how to use them

2. Various distance metrics and linkage criteria

3. How to use K-nearest neighbors

Clustering
Clustering is an unsupervised learning algorithm, which means that it does not require
the target variable. There are many types of clustering algorithms, and we focus on two
popular variants:

1. k-means clustering: finds k clusters such that the total pairwise distance between
each observation and its closest cluster centroid is minimized

2. Hierarchical / Agglomerative clustering: builds a hierarchy (or tree) of clusters where


the closest pairwise clusters are merged until there is only one cluster, or a stopping
criteria is met.

The general idea behind clustering is to partition the data into groups called clusters,
such that the observations within a cluster are similar to one another. To do this, we first
need to define what we mean by similar.

Distance metrics
To determine if observations are similar, we can compute the distance between them;
observations with small distances are more similar than observations with large distances.
For each observation i, we let xi = (xi1 , xi2 , . . . , xiF )T represent a column vector of F
feature variables. Now, imagine that each observation xi is a point in F dimensional
space. Suppose we have three observations x1 , x2 and x3 . A distance metric, d(x1 , x2 ),
must satisfy the following criteria:

1. Non-negativity: d(x1 , x2 ) ≥ 0 and d(x1 , x2 ) = 0 if and only if x1 = x2 .

1
University of Wisconsin - Madison
Department of Industrial and Systems Engineering
ISyE 521: Machine Learning in Action
Fall 2022

2. Symmetry: d(x1 , x2 ) = d(x2 , x1 )


3. Triangle inequality: d(x1 , x2 ) + d(x2 , x3 ) ≥ d(x1 , x3 )
There are many ways to define the distance between two points in space and we review
some popular choices:
• Euclidean: represents “straight line distance” and is the most commonly used dis-
tance metric. Also known as the l2 norm:
v
u
uXF
d(x1 , x2 ) = kx1 − x2 k2 = (x1f − x2f )2
u
t
f =1

• Manhattan: represents the absolute value of differences. Also known as the l1 norm:
F
X
d(x1 , x2 ) = kx1 − x2 k1 = |x1f − x2f |
f =1

• Chebychev: represents the maximum difference. Also known as the l∞ norm:


d(x1 , x2 ) = kx1 − x2 k∞ = max |x1f − x2f |
f =1,...,F

• Minkowski: is a generalization of Euclidean, Manhattan, and Chebychev. Also


known as the lp norm:
 1/p
F
|x1f − x2f |p 
X
d(x1 , x2 ) = kx1 − x2 kp = 
f =1

• Hamming: counts the number of features that are different:


F
X
d(x1 , x2 ) = I(x1f 6= x2f ),
f =1

where I is an indicator that is 1 if the condition (x1f 6= x2f ) is true, and 0 otherwise.
Next, let S1 , S2 , . . . , Sk be the index sets for each cluster. An index set for a cluster
includes the indices (or IDs) of each observation within that cluster. For example, S6 =
{1, 3, 7, 21, 44} means that observations 1, 3, 7, 21, 44 are in cluster 6. We define some
additional terminology and metrics:
• Centroid: the “center” of each cluster and often used as a representative point (for
each cluster). Denoted sk where k is the cluster index and computed as the mean
of all points in a cluster:
1 X
sk = xi
|Sk | i∈Sk

• Intra-cluster distance: the distance between points within a cluster. Computed


using any of the distance metrics above.
• Inter-cluster distance: the distance between points in different clusters. Often called
linkage criteria when used as part of hierarchical clustering. More on this below.

2
University of Wisconsin - Madison
Department of Industrial and Systems Engineering
ISyE 521: Machine Learning in Action
Fall 2022

k-means clustering
k-means is one of the most commonly used unsupervised learning algorithms. As noted
above, the objective is to partition observations into k clusters such that the total pairwise
distance between each observation and it’s nearest centroid is minimized. The number of
clusters, k, and the distance metric, d(x1 , x2 ), are user chosen hyperparameters.
Determining the number of clusters can be difficult and often requires exploratory
data analysis or expert opinion. In some cases (e.g., 1, 2, or 3 dimensions), the number of
clusters can be intuited once the data is visualized. It is also possible to use hierarchical
clustering as an approach to determine the number of clusters.
k-means can be written as an integer programming problem and it is NP-hard, mean-
ing that finding the optimal solution difficult. However, there are many fast heuristic
algorithms that allow us to find locally optimal solutions - something that is often good
enough in practice. The most common algorithm is called Lloyd’s Algorithm, which is an
iterative heuristic with the following steps:

1. Randomly initialize k centroids by choosing the locations of s1 , s2 , . . . , sk .

2. Assign each observation xi to its closest centroid using the predetermined distance
metric

3. Recompute the centroid of each cluster

4. Stop if there is no change in the centroids. Otherwise, return to step 2.

The algorithm will (often) produce different solutions for different random initializations.
In practice, randomly generated points that are far apart make for a good set of initial
centroids, and the algorithm should be run several times with different random seeds to
find a good set of initial centroids.

Hierarchical/Agglomerative clustering
Hierarchical/Agglomerative clustering (HAC) is a type of unsupervised learning algorithm
that builds a hierarchy (or tree) of clusters where the closest pairwise clusters are merged
until there is only one cluster, or a stopping criteria is met. We use the following algorithm
to perform HAC:

1. Initialize each observation as its own cluster

2. Merge each cluster with its closest neighbor cluster according to some distance
metric / linkage criteria combination. In other words, we greedily match every
cluster with exactly one other cluster (the closest one!).

3. Continue until there is only one cluster (or a stopping criteria is met).

To fully understand this algorithm, we need to define some linkage criteria.

3
University of Wisconsin - Madison
Department of Industrial and Systems Engineering
ISyE 521: Machine Learning in Action
Fall 2022

Linkage criteria
There are five popular choices for determining the distance between clusters (i.e., linkage
criteria) and each methods leverages one of the distance metrics above:

• Centroid distance: the distance between each cluster’s centroid.

d(S1 , S2 ) = d(s1 , s2 )

• Minimum distance: measures the smallest distance between two points in different
clusters. Also known as single linkage.

d(S1 , S2 ) = min d(xi , xj )


i∈S1 ,j∈S2

• Maximum distance: measures the largest distance between two points in different
clusters. Also known as complete linkage.

d(S1 , S2 ) = max d(xi , xj )


i∈S1 ,j∈S2

• Average distance: measures the average of all pairwise distances between points in
different clusters.
1 X X
d(S1 , S2 ) = d(xi , xj )
|S1 ||S2 | i∈S1 j∈S2

• Minimum variance: measures the total intra-cluster variance. Also known as Ward
linkage (after the creator J.H. Ward).
v
2|S1 ||S2 |
u
u
d(S1 , S2 ) = t ks1 − s2 k2
|S1 | + |S2 |

*Please be aware that this (or the list above) is not a complete list of distance metrics
or linkage rules. Although we have included many popular choices, there are others out
there!
The user typically visualizes output of HAC using a dendrogram (a type of tree dia-
gram). See Figure 1 for an example of a dendrogram using the Iris dataset. The vertical
lines correspond to clusters. The point at which two clusters are merged (or linked) is
represented by the horizontal line (that links two clusters). The location of the horizon-
tal line represents the inter-cluster distance (i.e., how far the two clusters are from each
other). The dendrogram can be useful tool for determining an appropriate number of
clusters because it allows you to easily visualize the inter-cluster distance as a function
of the number of clusters. The user can also select a stopping criteria or maximum inter-
cluster distance that will determine a number of clusters. For example, in Figure 1, a
stopping criteria or maximum inter-cluster distance of 10, yields three clusters. To see
this, draw a horizontal line at a distance of 10 and notice that it intersects with three
vertical lines (or clusters)!

4
University of Wisconsin - Madison
Department of Industrial and Systems Engineering
ISyE 521: Machine Learning in Action
Fall 2022

Hierarchial Cluster Dendrogram

30

25

20
Distance

15

10

0
(16)

41

(5)

(4)

(12)

(12)

(5)

(4)

(8)

(7)

(5)

(3)

(2)

(2)

(4)

(12)

106

(9)

(5)

(3)

(7)

(8)

(8)

(4)

(3)
Cluster index

Figure 1: Dendrogram of the Iris dataset showing 25 clusters.

K-Nearest Neighbors (KNN)


KNN is one of the simplest and most intuitive supervised learning algorithms in machine
learning. Given a new observation (that requires a prediction), the general idea is to use
the target values of the K-nearest neighbors in feature space (according to some distance
metric) to make predictions. There are two types of KNN:

1. Classification: a majority vote of the K-nearest neighbors is used to make predic-


tions. For example, if K = 5 and three observations have a target of 1, and two
observations have a target of 0, then we will predict 1 (or we could predict the
probability of 1 as 0.6).

2. Regression: the average target value across the K-nearest neighbors is used to make
predictions. For example, if K = 5 then our prediction is the average target across
all five neighbors.

Hyperparameters
Similar to clustering, KNN requires the user to choose the number of neighbors K and
the distance metric. KNN has an additional hyperparameter that is used to weight the
neighbors. Typically, we use one of the following weighting schemes:

• Uniform: each neighbor is weighted equally

5
University of Wisconsin - Madison
Department of Industrial and Systems Engineering
ISyE 521: Machine Learning in Action
Fall 2022

• Distance: each neighbor is weighted proportionally by the inverse of its distance


1
(i.e., d(·) ). Inverse weighting implies that closer neighbors contribute more to the
prediction than farther neighbors.

Algorithm
KNN does not explicitly require model training because there are no parameters / regres-
sion coefficients to fit. Instead, given some data and hyperparameter choices (distance,
weighting, K), we can directly make predictions. Similar to regularized logistic regression,
cross validation can be used to determine the best hyperparameters.
KNN is often considered as the simplest machine learning algorithm to implement.
Suppose we have n observations that include both features (x0 , x1 , . . . , xn ) and targets
(y0 , y1 , . . . , yn ). We are presented with a new observation xp that requires prediction. The
algorithm proceeds as follows:

1. Compute the distance between xp and x0 , x1 , . . . , xn . In other words, compute


d(xp , xi ), for i = 0, 1, . . . , n.

2. We index the K nearest neighbors by the set Np .

3. We compute our prediction as


X
ŷp = wi yi ,
i∈Np

d(xp ,xi ) 1
where wi = P d(xp ,xi )
for distance weighting and wi = K
for equal weighting.
i∈Np

You might also like