K-Means Clustering - MATLAB Kmeans
K-Means Clustering - MATLAB Kmeans
Help Center
kmeans
k-means clustering
Syntax
idx = kmeans(X,k)
idx = kmeans(X,k,Name,Value)
[idx,C] = kmeans( ___ )
[idx,C,sumd] = kmeans( ___ )
[idx,C,sumd,D] = kmeans( ___ )
Description
example
idx = kmeans(X,k) performs k-means clustering to partition the observations of the n-by-p data
matrix X into k clusters, and returns an n-by-1 vector (idx) containing cluster indices of each
observation. Rows of X correspond to points and columns correspond to variables.
By default, kmeans uses the squared Euclidean distance metric and the k-means++ algorithm for
cluster center initialization.
example
idx = kmeans(X,k,Name,Value) returns the cluster indices with additional options specified by one
or more Name,Value pair arguments.
For example, specify the cosine distance, the number of times to repeat the clustering using new initial
values, or to use parallel computing.
example
[idx,C] = kmeans( ___ ) returns the k cluster centroid locations in the k-by-p matrix C.
example
[idx,C,sumd,D] = kmeans( ___ ) returns distances from each point to every centroid in the n-by-k
matrix D.
Cluster data using k-means clustering, then plot the cluster regions.
Load Fisher's iris data set. Use the petal lengths and widths as predictors.
load fisheriris
X = meas(:,3:4);
figure;
plot(X(:,1),X(:,2),'k*','MarkerSize',5);
title 'Fisher''s Iris Data';
xlabel 'Petal Lengths (cm)';
ylabel 'Petal Widths (cm)';
The larger cluster seems to be split into a lower variance region and a higher variance region. This might indicate
that the larger cluster is two, overlapping clusters.
idx is a vector of predicted cluster indices corresponding to the observations in X. C is a 3-by-2 matrix containing
the final centroid locations.
Use kmeans to compute the distance from each centroid to points on a grid. To do this, pass the centroids (C)
and points on a grid to kmeans, and implement one iteration of the algorithm.
x1 = min(X(:,1)):0.01:max(X(:,1));
x2 = min(X(:,2)):0.01:max(X(:,2));
[x1G,x2G] = meshgrid(x1,x2);
XGrid = [x1G(:),x2G(:)]; % Defines a fine grid on the plot
idx2Region = kmeans(XGrid,3,'MaxIter',1,'Start',C);
kmeans displays a warning stating that the algorithm did not converge, which you should expect since the
software only implemented one iteration.
figure;
gscatter(XGrid(:,1),XGrid(:,2),idx2Region,...
[0,0.75,0.75;0.75,0,0.75;0.75,0.75,0],'..');
hold on;
plot(X(:,1),X(:,2),'k*','MarkerSize',5);
title 'Fisher''s Iris Data';
xlabel 'Petal Lengths (cm)';
ylabel 'Petal Widths (cm)';
legend('Region 1','Region 2','Region 3','Data','Location','SouthEast');
hold off;
figure;
plot(X(:,1),X(:,2),'.');
title 'Randomly Generated Data';
Partition the data into two clusters, and choose the best arrangement out of five initializations. Display the final
output.
opts = statset('Display','final');
[idx,C] = kmeans(X,2,'Distance','cityblock',...
'Replicates',5,'Options',opts);
figure;
plot(X(idx==1,1),X(idx==1,2),'r.','MarkerSize',12)
hold on
plot(X(idx==2,1),X(idx==2,2),'b.','MarkerSize',12)
plot(C(:,1),C(:,2),'kx',...
'MarkerSize',15,'LineWidth',3)
legend('Cluster 1','Cluster 2','Centroids',...
'Location','NW')
title 'Cluster Assignments and Centroids'
hold off
You can determine how well separated the clusters are by passing idx to silhouette.
Clustering large data sets might take time, particularly if you use online
This example uses:
updates (set by default). If you have a Parallel Computing Toolbox ™ license
and you set the options for parallel computing, then kmeans runs each Parallel Computing Toolbox
clustering task (or replicate) in parallel. And, if Replicates>1, then parallel Statistics and Machine
computing decreases time to convergence. Learning Toolbox
X = random(Mdl,10000);
The input argument 'mlfg6331_64' of RandStream specifies to use the multiplicative lagged Fibonacci
generator algorithm. options is a structure array with fields that specify options for controlling estimation.
Cluster the data using k-means clustering. Specify that there are k = 20 clusters in the data and increase the
number of iterations. Typically, the objective function contains local minima. Specify 10 replicates to help find a
lower, local minimum.
The Command Window indicates that six workers are available. The number of workers might vary on your
system. The Command Window displays the number of iterations and the terminal objective function value for
each replicate. The output arguments contain the results of replicate 9 because it has the lowest total sum of
distances.
kmeans performs k-means clustering to partition data into k clusters. When you
This example uses:
have a new data set to cluster, you can create new clusters that include the
existing data and the new data by using kmeans. The kmeans function supports GPU Coder
C/C++ code generation, so you can generate code that accepts training data MATLAB Coder
and returns clustering results, and then deploy the code to a device. In this
Statistics and Machine
workflow, you must pass training data, which can be of considerable size. To
Learning Toolbox
save memory on the device, you can separate training and prediction by using
kmeans and pdist2, respectively.
Use kmeans to create clusters in MATLAB® and use pdist2 in the generated
code to assign new data to existing clusters. For code generation, define an entry-point function that accepts the
cluster centroid positions and the new data set, and returns the index of the nearest cluster. Then, generate code
for the entry-point function.
[idx,C] = kmeans(X,3);
figure
gscatter(X(:,1),X(:,2),idx,'bgm')
hold on
plot(C(:,1),C(:,2),'kx')
legend('Cluster 1','Cluster 2','Cluster 3','Cluster Centroid')
Xtest = [randn(10,2)*0.75+ones(10,2);
randn(10,2)*0.5-ones(10,2);
randn(10,2)*0.75];
Classify the test data set using the existing clusters. Find the nearest centroid from each test data point by using
pdist2.
[~,idx_test] = pdist2(C,Xtest,'euclidean','Smallest',1);
Plot the test data and label the test data using idx_test by using gscatter.
gscatter(Xtest(:,1),Xtest(:,2),idx_test,'bgm','ooo')
legend('Cluster 1','Cluster 2','Cluster 3','Cluster Centroid', ...
'Data classified to Cluster 1','Data classified to Cluster 2', ...
'Data classified to Cluster 3')
Generate Code
Generate C code that assigns new data to the existing clusters. Note that generating C/C++ code requires
MATLAB® Coder™.
Define an entry-point function named findNearestCentroid that accepts centroid positions and new data, and
then find the nearest cluster by using pdist2.
Note: If you click the button located in the upper-right section of this page and open this example in MATLAB®,
then MATLAB® opens the example folder. This folder includes the entry-point function file.
Generate code by using codegen (MATLAB Coder). Because C and C++ are statically typed languages, you must
determine the properties of all variables in the entry-point function at compile time. To specify the data type and
array size of the inputs of findNearestCentroid, pass a MATLAB expression that represents the set of values
with a certain data type and array size by using the -args option. For details, see Specify Variable-Size
Arguments for Code Generation.
myIndx = findNearestCentroid(C,Xtest);
myIndex_mex = findNearestCentroid_mex(C,Xtest);
verifyMEX = isequal(idx_test,myIndx,myIndex_mex)
verifyMEX = logical
1
isequal returns logical 1 (true), which means all the inputs are equal. The comparison confirms that the
pdist2 function, the findNearestCentroid function, and the MEX function return the same index.
You can also generate optimized CUDA® code using GPU Coder™.
cfg = coder.gpuConfig('mex');
codegen -config cfg findNearestCentroid -args {C,Xtest}
X — Data
numeric matrix
Data, specified as a numeric matrix. The rows of X correspond to observations, and the columns correspond to
variables.
If X is a numeric vector, then kmeans treats it as an n-by-1 data matrix, regardless of its orientation.
The software treats NaNs in X as missing data and removes any row of X that contains at least one NaN.
Removing rows of X reduces the sample size. The kmeans function returns NaN for the corresponding value in the
output argument idx.
k — Number of clusters
positive integer
Name-Value Arguments
Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and
Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose Name in quotes.
Level of output to display in the Command Window, specified as the comma-separated pair consisting of
'Display' and one of the following options:
Example: 'Display','final'
Distance metric, in p-dimensional space, used for minimization, specified as the comma-separated pair
consisting of 'Distance' and 'sqeuclidean', 'cityblock', 'cosine', 'correlation', or 'hamming'.
kmeans computes centroid clusters differently for the supported distance metrics. This table summarizes the
available distance metrics. In the formulae, x is an observation (that is, a row of X) and c is a centroid (a row
vector).
It is the proportion of
bits that differ. Each
centroid is the
where I is the indicator function.
component-wise
median of points in that
cluster.
Example: 'Distance','cityblock'
Action to take if a cluster loses all its member observations, specified as the comma-separated pair consisting
of 'EmptyAction' and one of the following options.
Value Description
Example: 'EmptyAction','error'
Maximum number of iterations, specified as the comma-separated pair consisting of 'MaxIter' and a positive
integer.
Example: 'MaxIter',1000
Online update flag, specified as the comma-separated pair consisting of 'OnlinePhase' and 'off' or 'on'.
Example: 'OnlinePhase','on'
Options — Options for controlling iterative algorithm for minimizing fitting criteria
[] (default) | structure array returned by statset
Options for controlling the iterative algorithm for minimizing the fitting criteria, specified as the comma-separated
pair consisting of 'Options' and a structure array returned by statset. Supported fields of the structure array
specify options for controlling the iterative algorithm.
This table summarizes the supported fields. Note that the supported fields require Parallel Computing Toolbox™.
Field Description
UseParallel is true.
UseSubstreams is false.
In this case, use a cell array the same size as the parallel
pool. If a parallel pool is not open, then Streams must
supply a single random number stream.
Field Description
To ensure more predictable results, use parpool (Parallel Computing Toolbox) and explicitly create a parallel
pool before invoking kmeans and setting 'Options',statset('UseParallel',1).
Example: 'Options',statset('UseParallel',1)
Replicates — Number of times to repeat clustering using new initial cluster centroid positions
1 (default) | positive integer
Number of times to repeat clustering using new initial cluster centroid positions, specified as the comma-
separated pair consisting of 'Replicates' and an integer. kmeans returns the solution with the lowest sumd.
You can set 'Replicates' implicitly by supplying a 3-D array as the value for the 'Start' name-value pair
argument.
Example: 'Replicates',5
Method for choosing initial cluster centroid positions (or seeds), specified as the comma-separated pair
consisting of 'Start' and 'cluster', 'plus', 'sample', 'uniform', a numeric matrix, or a numeric array. This
table summarizes the available options for choosing seeds.
Value Description
Example: 'Start','sample'
Cluster centroid locations, returned as a numeric matrix. C is a k-by-p matrix, where row j is the centroid of cluster
j. The location of a centroid depends on the distance metric specified by the Distance name-value argument.
Within-cluster sums of point-to-centroid distances, returned as a numeric column vector. sumd is a k-by-1 vector,
where element j is the sum of point-to-centroid distances within cluster j. By default, kmeans uses the squared
Euclidean distance (see 'Distance' metrics).
Distances from each point to every centroid, returned as a numeric matrix. D is an n-by-k matrix, where element
(j,m) is the distance from observation j to centroid m. By default, kmeans uses the squared Euclidean distance
(see 'Distance' metrics).
k-Means Clustering
k-means clustering, or Lloyd’s algorithm [2], is an iterative, data-partitioning algorithm that assigns n observations
to exactly one of k clusters defined by centroids, where k is chosen before the algorithm starts.
Batch update — Assign each observation to the cluster with the closest centroid.
Online update — Individually assign observations to a different centroid if the reassignment decreases
the sum of the within-cluster, sum-of-squares point-to-cluster-centroid distances.
4. Compute the average of the observations in each cluster to obtain k new centroid locations.
5. Repeat steps 2 through 4 until cluster assignments do not change, or the maximum number of iterations is
reached.
k-means++ Algorithm
The k-means++ algorithm uses an heuristic to find centroid seeds for k-means clustering. According to Arthur
and Vassilvitskii [1], k-means++ improves the running time of Lloyd’s algorithm, and the quality of the final
solution.
The k-means++ algorithm chooses seeds as follows, assuming the number of clusters is k.
1. Select an observation uniformly at random from the data set, X. The chosen observation is the first centroid,
and is denoted c1.
2. Compute distances from each observation to c1. Denote the distance between cj and the observation m as
.
4. To choose center j:
b. For m = 1,...,n and p = 1,...,j – 1, select centroid j at random from X with probability
where Cp is the set of all observations closest to centroid cp and xm belongs to Cp.
That is, select each subsequent center with a probability proportional to the distance from itself to the
closest center that you already chose.
5. Repeat step 4 until k centroids are chosen.
Arthur and Vassilvitskii [1] demonstrate, using a simulation study for several cluster orientations, that k-means++
achieves faster convergence to a lower sum of within-cluster, sum-of-squares point-to-cluster-centroid
distances than Lloyd’s algorithm.
Algorithms
kmeans uses a two-phase iterative algorithm to minimize the sum of point-to-centroid distances, summed over all k
clusters.
1. This first phase uses batch updates, where each iteration consists of reassigning points to their nearest
cluster centroid, all at once, followed by recalculation of cluster centroids. This phase occasionally does not
converge to solution that is a local minimum. That is, a partition of the data where moving any single point to a
different cluster increases the total sum of distances. This is more likely for small data sets. The batch phase
is fast, but potentially only approximates a solution as a starting point for the second phase.
2. This second phase uses online updates, where points are individually reassigned if doing so reduces the sum
of distances, and cluster centroids are recomputed after each reassignment. Each iteration during this phase
consists of one pass though all the points. This phase converges to a local minimum, although there might be
other local minima with lower total sum of distances. In general, finding the global minimum is solved by an
exhaustive choice of starting points, but using several replicates with random starting points typically results in
a solution that is a global minimum.
If Replicates = r > 1 and Start is plus (the default), then the software selects r possibly different sets of seeds
according to the k-means++ algorithm.
References
[1] Arthur, David, and Sergi Vassilvitskii. “K-means++: The Advantages of Careful Seeding.” SODA ‘07: Proceedings of
the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. 2007, pp. 1027–1035.
[2] Lloyd, Stuart P. “Least Squares Quantization in PCM.” IEEE Transactions on Information Theory. Vol. 28, 1982, pp.
129–137.
[3] Seber, G. A. F. Multivariate Observations. Hoboken, NJ: John Wiley & Sons, Inc., 1984.
[4] Spath, H. Cluster Dissection and Analysis: Theory, FORTRAN Programs, Examples. Translated by J. Goldschmidt.
New York: Halsted Press, 1985.
Extended Capabilities
Tall Arrays
Calculate with arrays that have more rows than fit in memory.
GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing
Toolbox™.
Version History
Introduced before R2006a
See Also
linkage | clusterdata | silhouette | parpool (Parallel Computing Toolbox) | statset | gmdistribution |
kmedoids
Topics
Compare k-Means Clustering Solutions
Introduction to k-Means Clustering
mathworks.com
© 1994-2024 The MathWorks, Inc. MATLAB and Simulink are registered trademarks of The MathWorks, Inc. See mathworks.com/trademarks for a list of additional
trademarks. Other product or brand names may be trademarks or registered trademarks of their respective holders.