Lecture 4
Lecture 4
In this lecture, we will introduce two types of clustering and the K-nearest neighbors
algorithm. We will learn:
1. The difference between k-means and hierarchical clustering, and how to use them
Clustering
Clustering is an unsupervised learning algorithm, which means that it does not require
the target variable. There are many types of clustering algorithms, and we focus on two
popular variants:
1. k-means clustering: finds k clusters such that the total pairwise distance between
each observation and its closest cluster centroid is minimized
The general idea behind clustering is to partition the data into groups called clusters,
such that the observations within a cluster are similar to one another. To do this, we first
need to define what we mean by similar.
Distance metrics
To determine if observations are similar, we can compute the distance between them;
observations with small distances are more similar than observations with large distances.
For each observation i, we let xi = (xi1 , xi2 , . . . , xiF )T represent a column vector of F
feature variables. Now, imagine that each observation xi is a point in F dimensional
space. Suppose we have three observations x1 , x2 and x3 . A distance metric, d(x1 , x2 ),
must satisfy the following criteria:
1
University of Wisconsin - Madison
Department of Industrial and Systems Engineering
ISyE 521: Machine Learning in Action
Fall 2022
• Manhattan: represents the absolute value of differences. Also known as the l1 norm:
F
X
d(x1 , x2 ) = kx1 − x2 k1 = |x1f − x2f |
f =1
where I is an indicator that is 1 if the condition (x1f 6= x2f ) is true, and 0 otherwise.
Next, let S1 , S2 , . . . , Sk be the index sets for each cluster. An index set for a cluster
includes the indices (or IDs) of each observation within that cluster. For example, S6 =
{1, 3, 7, 21, 44} means that observations 1, 3, 7, 21, 44 are in cluster 6. We define some
additional terminology and metrics:
• Centroid: the “center” of each cluster and often used as a representative point (for
each cluster). Denoted sk where k is the cluster index and computed as the mean
of all points in a cluster:
1 X
sk = xi
|Sk | i∈Sk
2
University of Wisconsin - Madison
Department of Industrial and Systems Engineering
ISyE 521: Machine Learning in Action
Fall 2022
k-means clustering
k-means is one of the most commonly used unsupervised learning algorithms. As noted
above, the objective is to partition observations into k clusters such that the total pairwise
distance between each observation and it’s nearest centroid is minimized. The number of
clusters, k, and the distance metric, d(x1 , x2 ), are user chosen hyperparameters.
Determining the number of clusters can be difficult and often requires exploratory
data analysis or expert opinion. In some cases (e.g., 1, 2, or 3 dimensions), the number of
clusters can be intuited once the data is visualized. It is also possible to use hierarchical
clustering as an approach to determine the number of clusters.
k-means can be written as an integer programming problem and it is NP-hard, mean-
ing that finding the optimal solution difficult. However, there are many fast heuristic
algorithms that allow us to find locally optimal solutions - something that is often good
enough in practice. The most common algorithm is called Lloyd’s Algorithm, which is an
iterative heuristic with the following steps:
2. Assign each observation xi to its closest centroid using the predetermined distance
metric
The algorithm will (often) produce different solutions for different random initializations.
In practice, randomly generated points that are far apart make for a good set of initial
centroids, and the algorithm should be run several times with different random seeds to
find a good set of initial centroids.
Hierarchical/Agglomerative clustering
Hierarchical/Agglomerative clustering (HAC) is a type of unsupervised learning algorithm
that builds a hierarchy (or tree) of clusters where the closest pairwise clusters are merged
until there is only one cluster, or a stopping criteria is met. We use the following algorithm
to perform HAC:
2. Merge each cluster with its closest neighbor cluster according to some distance
metric / linkage criteria combination. In other words, we greedily match every
cluster with exactly one other cluster (the closest one!).
3. Continue until there is only one cluster (or a stopping criteria is met).
3
University of Wisconsin - Madison
Department of Industrial and Systems Engineering
ISyE 521: Machine Learning in Action
Fall 2022
Linkage criteria
There are five popular choices for determining the distance between clusters (i.e., linkage
criteria) and each methods leverages one of the distance metrics above:
d(S1 , S2 ) = d(s1 , s2 )
• Minimum distance: measures the smallest distance between two points in different
clusters. Also known as single linkage.
• Maximum distance: measures the largest distance between two points in different
clusters. Also known as complete linkage.
• Average distance: measures the average of all pairwise distances between points in
different clusters.
1 X X
d(S1 , S2 ) = d(xi , xj )
|S1 ||S2 | i∈S1 j∈S2
• Minimum variance: measures the total intra-cluster variance. Also known as Ward
linkage (after the creator J.H. Ward).
v
2|S1 ||S2 |
u
u
d(S1 , S2 ) = t ks1 − s2 k2
|S1 | + |S2 |
*Please be aware that this (or the list above) is not a complete list of distance metrics
or linkage rules. Although we have included many popular choices, there are others out
there!
The user typically visualizes output of HAC using a dendrogram (a type of tree dia-
gram). See Figure 1 for an example of a dendrogram using the Iris dataset. The vertical
lines correspond to clusters. The point at which two clusters are merged (or linked) is
represented by the horizontal line (that links two clusters). The location of the horizon-
tal line represents the inter-cluster distance (i.e., how far the two clusters are from each
other). The dendrogram can be useful tool for determining an appropriate number of
clusters because it allows you to easily visualize the inter-cluster distance as a function
of the number of clusters. The user can also select a stopping criteria or maximum inter-
cluster distance that will determine a number of clusters. For example, in Figure 1, a
stopping criteria or maximum inter-cluster distance of 10, yields three clusters. To see
this, draw a horizontal line at a distance of 10 and notice that it intersects with three
vertical lines (or clusters)!
4
University of Wisconsin - Madison
Department of Industrial and Systems Engineering
ISyE 521: Machine Learning in Action
Fall 2022
30
25
20
Distance
15
10
0
(16)
41
(5)
(4)
(12)
(12)
(5)
(4)
(8)
(7)
(5)
(3)
(2)
(2)
(4)
(12)
106
(9)
(5)
(3)
(7)
(8)
(8)
(4)
(3)
Cluster index
2. Regression: the average target value across the K-nearest neighbors is used to make
predictions. For example, if K = 5 then our prediction is the average target across
all five neighbors.
Hyperparameters
Similar to clustering, KNN requires the user to choose the number of neighbors K and
the distance metric. KNN has an additional hyperparameter that is used to weight the
neighbors. Typically, we use one of the following weighting schemes:
5
University of Wisconsin - Madison
Department of Industrial and Systems Engineering
ISyE 521: Machine Learning in Action
Fall 2022
Algorithm
KNN does not explicitly require model training because there are no parameters / regres-
sion coefficients to fit. Instead, given some data and hyperparameter choices (distance,
weighting, K), we can directly make predictions. Similar to regularized logistic regression,
cross validation can be used to determine the best hyperparameters.
KNN is often considered as the simplest machine learning algorithm to implement.
Suppose we have n observations that include both features (x0 , x1 , . . . , xn ) and targets
(y0 , y1 , . . . , yn ). We are presented with a new observation xp that requires prediction. The
algorithm proceeds as follows:
d(xp ,xi ) 1
where wi = P d(xp ,xi )
for distance weighting and wi = K
for equal weighting.
i∈Np