0% found this document useful (0 votes)

29 views

K-Means Clustering - MATLAB Kmeans

The document describes the MATLAB function kmeans, which performs k-means clustering. Kmeans partitions observations into k clusters defined by centroids, and returns cluster assignment indexes and centroid locations. It can use various distance metrics and initialization methods. Examples show using kmeans to cluster Fisher's iris data into 3 clusters based on petal features, and randomly generated data into 2 clusters.

Uploaded by

jefferyleclerc

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views

K-Means Clustering - MATLAB Kmeans

Uploaded by

jefferyleclerc

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

3/9/24, 6:18 AM k-means clustering - MATLAB kmeans

Help Center

kmeans
k-means clustering

Syntax

idx = kmeans(X,k)
idx = kmeans(X,k,Name,Value)
[idx,C] = kmeans( ___ )
[idx,C,sumd] = kmeans( ___ )
[idx,C,sumd,D] = kmeans( ___ )

Description
example
idx = kmeans(X,k) performs k-means clustering to partition the observations of the n-by-p data
matrix X into k clusters, and returns an n-by-1 vector (idx) containing cluster indices of each
observation. Rows of X correspond to points and columns correspond to variables.

By default, kmeans uses the squared Euclidean distance metric and the k-means++ algorithm for
cluster center initialization.

example
idx = kmeans(X,k,Name,Value) returns the cluster indices with additional options specified by one
or more Name,Value pair arguments.

For example, specify the cosine distance, the number of times to repeat the clustering using new initial
values, or to use parallel computing.

example
[idx,C] = kmeans( ___ ) returns the k cluster centroid locations in the k-by-p matrix C.

https://w w w .mathw orks.com/help/stats/kmeans.html 1/23

3/9/24, 6:18 AM k-means clustering - MATLAB kmeans
example
[idx,C,sumd] = kmeans( ___ ) returns the within-cluster sums of point-to-centroid distances in the k-
by-1 vector sumd.

example
[idx,C,sumd,D] = kmeans( ___ ) returns distances from each point to every centroid in the n-by-k
matrix D.

Examples collapse all

 Train a k-Means Clustering Algorithm

Cluster data using k-means clustering, then plot the cluster regions.

Load Fisher's iris data set. Use the petal lengths and widths as predictors.

load fisheriris
X = meas(:,3:4);

figure;
plot(X(:,1),X(:,2),'k*','MarkerSize',5);
title 'Fisher''s Iris Data';
xlabel 'Petal Lengths (cm)';
ylabel 'Petal Widths (cm)';

https://w w w .mathw orks.com/help/stats/kmeans.html 2/23

3/9/24, 6:18 AM k-means clustering - MATLAB kmeans

The larger cluster seems to be split into a lower variance region and a higher variance region. This might indicate
that the larger cluster is two, overlapping clusters.

Cluster the data. Specify k = 3 clusters.

rng(1); % For reproducibility

[idx,C] = kmeans(X,3);

idx is a vector of predicted cluster indices corresponding to the observations in X. C is a 3-by-2 matrix containing
the final centroid locations.

Use kmeans to compute the distance from each centroid to points on a grid. To do this, pass the centroids (C)
and points on a grid to kmeans, and implement one iteration of the algorithm.

https://w w w .mathw orks.com/help/stats/kmeans.html 3/23

3/9/24, 6:18 AM k-means clustering - MATLAB kmeans

x1 = min(X(:,1)):0.01:max(X(:,1));
x2 = min(X(:,2)):0.01:max(X(:,2));
[x1G,x2G] = meshgrid(x1,x2);
XGrid = [x1G(:),x2G(:)]; % Defines a fine grid on the plot

idx2Region = kmeans(XGrid,3,'MaxIter',1,'Start',C);

Warning: Failed to converge in 1 iterations.

% Assigns each node in the grid to the closest centroid

kmeans displays a warning stating that the algorithm did not converge, which you should expect since the
software only implemented one iteration.

Plot the cluster regions.

figure;
gscatter(XGrid(:,1),XGrid(:,2),idx2Region,...
[0,0.75,0.75;0.75,0,0.75;0.75,0.75,0],'..');
hold on;
plot(X(:,1),X(:,2),'k*','MarkerSize',5);
title 'Fisher''s Iris Data';
xlabel 'Petal Lengths (cm)';
ylabel 'Petal Widths (cm)';
legend('Region 1','Region 2','Region 3','Data','Location','SouthEast');
hold off;

https://w w w .mathw orks.com/help/stats/kmeans.html 4/23

3/9/24, 6:18 AM k-means clustering - MATLAB kmeans

 Partition Data into Two Clusters

Randomly generate the sample data.

rng default; % For reproducibility

X = [randn(100,2)*0.75+ones(100,2);
randn(100,2)*0.5-ones(100,2)];

figure;
plot(X(:,1),X(:,2),'.');
title 'Randomly Generated Data';

https://w w w .mathw orks.com/help/stats/kmeans.html 5/23

3/9/24, 6:18 AM k-means clustering - MATLAB kmeans

There appears to be two clusters in the data.

Partition the data into two clusters, and choose the best arrangement out of five initializations. Display the final
output.

opts = statset('Display','final');
[idx,C] = kmeans(X,2,'Distance','cityblock',...
'Replicates',5,'Options',opts);

Replicate 1, 3 iterations, total sum of distances = 201.533.

Replicate 2, 5 iterations, total sum of distances = 201.533.
Replicate 3, 3 iterations, total sum of distances = 201.533.
Replicate 4, 3 iterations, total sum of distances = 201.533.
Replicate 5, 2 iterations, total sum of distances = 201.533.
Best total sum of distances = 201.533

By default, the software initializes the replicates separately using k-means++.

https://w w w .mathw orks.com/help/stats/kmeans.html 6/23
3/9/24, 6:18 AM k-means clustering - MATLAB kmeans
Plot the clusters and the cluster centroids.

figure;
plot(X(idx==1,1),X(idx==1,2),'r.','MarkerSize',12)
hold on
plot(X(idx==2,1),X(idx==2,2),'b.','MarkerSize',12)
plot(C(:,1),C(:,2),'kx',...
'MarkerSize',15,'LineWidth',3)
legend('Cluster 1','Cluster 2','Centroids',...
'Location','NW')
title 'Cluster Assignments and Centroids'
hold off

You can determine how well separated the clusters are by passing idx to silhouette.

 Cluster Data Using Parallel Computing

Clustering large data sets might take time, particularly if you use online
This example uses:
updates (set by default). If you have a Parallel Computing Toolbox ™ license
and you set the options for parallel computing, then kmeans runs each Parallel Computing Toolbox
clustering task (or replicate) in parallel. And, if Replicates>1, then parallel Statistics and Machine
computing decreases time to convergence. Learning Toolbox

Randomly generate a large data set from a Gaussian mixture model.

rng(1); % For reproducibility

Mu = ones(20,30).*(1:20)'; % Gaussian mixture mean
rn30 = randn(30,30);
Sigma = rn30'*rn30; % Symmetric and positive-definite covariance
Mdl = gmdistribution(Mu,Sigma); % Define the Gaussian mixture distribution

X = random(Mdl,10000);

Mdl is a 30-dimensional gmdistribution model with 20 components. X is a 10000-by-30 matrix of data

generated from Mdl.

https://w w w .mathw orks.com/help/stats/kmeans.html 7/23

3/9/24, 6:18 AM k-means clustering - MATLAB kmeans
Specify the options for parallel computing.

stream = RandStream('mlfg6331_64'); % Random number stream

options = statset('UseParallel',1,'UseSubstreams',1,...
'Streams',stream);

The input argument 'mlfg6331_64' of RandStream specifies to use the multiplicative lagged Fibonacci
generator algorithm. options is a structure array with fields that specify options for controlling estimation.

Cluster the data using k-means clustering. Specify that there are k = 20 clusters in the data and increase the
number of iterations. Typically, the objective function contains local minima. Specify 10 replicates to help find a
lower, local minimum.

tic; % Start stopwatch timer

[idx,C,sumd,D] = kmeans(X,20,'Options',options,'MaxIter',10000,...
'Display','final','Replicates',10);

Starting parallel pool (parpool) using the 'Processes' profile ...

Connected to the parallel pool (number of workers: 6).
Replicate 4, 79 iterations, total sum of distances = 7.62412e+06.
Replicate 2, 56 iterations, total sum of distances = 7.62036e+06.
Replicate 3, 76 iterations, total sum of distances = 7.62583e+06.
Replicate 6, 96 iterations, total sum of distances = 7.6258e+06.
Replicate 5, 103 iterations, total sum of distances = 7.61753e+06.
Replicate 1, 94 iterations, total sum of distances = 7.60746e+06.
Replicate 10, 66 iterations, total sum of distances = 7.62582e+06.
Replicate 8, 113 iterations, total sum of distances = 7.60741e+06.
Replicate 9, 80 iterations, total sum of distances = 7.60592e+06.
Replicate 7, 77 iterations, total sum of distances = 7.61939e+06.
Best total sum of distances = 7.60592e+06

toc % Terminate stopwatch timer

Elapsed time is 72.873647 seconds.

The Command Window indicates that six workers are available. The number of workers might vary on your
system. The Command Window displays the number of iterations and the terminal objective function value for
each replicate. The output arguments contain the results of replicate 9 because it has the lowest total sum of
distances.

https://w w w .mathw orks.com/help/stats/kmeans.html 8/23

3/9/24, 6:18 AM k-means clustering - MATLAB kmeans

 Assign New Data to Existing Clusters and Generate C/C++ Code

kmeans performs k-means clustering to partition data into k clusters. When you
This example uses:
have a new data set to cluster, you can create new clusters that include the
existing data and the new data by using kmeans. The kmeans function supports GPU Coder
C/C++ code generation, so you can generate code that accepts training data MATLAB Coder
and returns clustering results, and then deploy the code to a device. In this
Statistics and Machine
workflow, you must pass training data, which can be of considerable size. To
Learning Toolbox
save memory on the device, you can separate training and prediction by using
kmeans and pdist2, respectively.

Use kmeans to create clusters in MATLAB® and use pdist2 in the generated
code to assign new data to existing clusters. For code generation, define an entry-point function that accepts the
cluster centroid positions and the new data set, and returns the index of the nearest cluster. Then, generate code
for the entry-point function.

Generating C/C++ code requires MATLAB® Coder™.

Perform k-Means Clustering

Generate a training data set using three distributions.

rng('default') % For reproducibility

X = [randn(100,2)*0.75+ones(100,2);
randn(100,2)*0.5-ones(100,2);
randn(100,2)*0.75];

Partition the training data into three clusters by using kmeans.

[idx,C] = kmeans(X,3);

Plot the clusters and the cluster centroids.

https://w w w .mathw orks.com/help/stats/kmeans.html 9/23

3/9/24, 6:18 AM k-means clustering - MATLAB kmeans

figure
gscatter(X(:,1),X(:,2),idx,'bgm')
hold on
plot(C(:,1),C(:,2),'kx')
legend('Cluster 1','Cluster 2','Cluster 3','Cluster Centroid')

Assign New Data to Existing Clusters

Generate a test data set.

Xtest = [randn(10,2)*0.75+ones(10,2);
randn(10,2)*0.5-ones(10,2);
randn(10,2)*0.75];

Classify the test data set using the existing clusters. Find the nearest centroid from each test data point by using
pdist2.

[~,idx_test] = pdist2(C,Xtest,'euclidean','Smallest',1);

Plot the test data and label the test data using idx_test by using gscatter.

gscatter(Xtest(:,1),Xtest(:,2),idx_test,'bgm','ooo')
legend('Cluster 1','Cluster 2','Cluster 3','Cluster Centroid', ...
'Data classified to Cluster 1','Data classified to Cluster 2', ...
'Data classified to Cluster 3')

Generate Code

Generate C code that assigns new data to the existing clusters. Note that generating C/C++ code requires
MATLAB® Coder™.

Define an entry-point function named findNearestCentroid that accepts centroid positions and new data, and
then find the nearest cluster by using pdist2.

https://w w w .mathw orks.com/help/stats/kmeans.html 10/23

3/9/24, 6:18 AM k-means clustering - MATLAB kmeans
Add the %#codegen compiler directive (or pragma) to the entry-point function after the function signature to
indicate that you intend to generate code for the MATLAB algorithm. Adding this directive instructs the MATLAB
Code Analyzer to help you diagnose and fix violations that would cause errors during code generation.

type findNearestCentroid % Display contents of findNearestCentroid.m

function idx = findNearestCentroid(C,X) %#codegen

[~,idx] = pdist2(C,X,'euclidean','Smallest',1); % Find the nearest centroid

Note: If you click the button located in the upper-right section of this page and open this example in MATLAB®,
then MATLAB® opens the example folder. This folder includes the entry-point function file.

Generate code by using codegen (MATLAB Coder). Because C and C++ are statically typed languages, you must
determine the properties of all variables in the entry-point function at compile time. To specify the data type and
array size of the inputs of findNearestCentroid, pass a MATLAB expression that represents the set of values
with a certain data type and array size by using the -args option. For details, see Specify Variable-Size
Arguments for Code Generation.

codegen findNearestCentroid -args {C,Xtest}

Code generation successful.

codegen generates the MEX function findNearestCentroid_mex with a platform-dependent extension.

Verify the generated code.

myIndx = findNearestCentroid(C,Xtest);
myIndex_mex = findNearestCentroid_mex(C,Xtest);
verifyMEX = isequal(idx_test,myIndx,myIndex_mex)

verifyMEX = logical
1

isequal returns logical 1 (true), which means all the inputs are equal. The comparison confirms that the
pdist2 function, the findNearestCentroid function, and the MEX function return the same index.

You can also generate optimized CUDA® code using GPU Coder™.

cfg = coder.gpuConfig('mex');
codegen -config cfg findNearestCentroid -args {C,Xtest}

https://w w w .mathw orks.com/help/stats/kmeans.html 11/23

3/9/24, 6:18 AM k-means clustering - MATLAB kmeans
For more information on code generation, see General Code Generation Workflow. For more information on GPU
coder, see Get Started with GPU Coder (GPU Coder) and Supported Functions (GPU Coder).

Input Arguments collapse all

X — Data
 numeric matrix

Data, specified as a numeric matrix. The rows of X correspond to observations, and the columns correspond to
variables.

If X is a numeric vector, then kmeans treats it as an n-by-1 data matrix, regardless of its orientation.

The software treats NaNs in X as missing data and removes any row of X that contains at least one NaN.
Removing rows of X reduces the sample size. The kmeans function returns NaN for the corresponding value in the
output argument idx.

Data Types: single | double

k — Number of clusters
 positive integer

Number of clusters in the data, specified as a positive integer.

Data Types: single | double

Name-Value Arguments
Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and
Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: 'Distance','cosine','Replicates',10,'Options',statset('UseParallel',1) specifies the cosine

distance, 10 replicate clusters at different starting values, and to use parallel computing.

https://w w w .mathw orks.com/help/stats/kmeans.html 12/23

3/9/24, 6:18 AM k-means clustering - MATLAB kmeans

Display — Level of output to display

 'off' (default) | 'final' | 'iter'

Level of output to display in the Command Window, specified as the comma-separated pair consisting of
'Display' and one of the following options:

'final' — Displays results of the final iteration

'iter' — Displays results of each iteration

'off' — Displays nothing

Example: 'Display','final'

Distance — Distance metric

 'sqeuclidean' (default) | 'cityblock' | 'cosine' | 'correlation' | 'hamming'

Distance metric, in p-dimensional space, used for minimization, specified as the comma-separated pair
consisting of 'Distance' and 'sqeuclidean', 'cityblock', 'cosine', 'correlation', or 'hamming'.

kmeans computes centroid clusters differently for the supported distance metrics. This table summarizes the
available distance metrics. In the formulae, x is an observation (that is, a row of X) and c is a centroid (a row
vector).

Distance Metric Description Formula

'sqeuclidean' Squared Euclidean

distance (default). Each
centroid is the mean of
the points in that
cluster.

'cityblock' Sum of absolute

differences, i.e., the L1
distance. Each centroid
is the component-wise
median of the points in
that cluster.

https://w w w .mathw orks.com/help/stats/kmeans.html 13/23

3/9/24, 6:18 AM k-means clustering - MATLAB kmeans

Distance Metric Description Formula

'cosine' One minus the cosine of

the included angle
between points (treated
as vectors). Each
centroid is the mean of
the points in that
cluster, after
normalizing those
points to unit Euclidean
length.

'correlation' One minus the sample

correlation between
points (treated as
sequences of values).
Each centroid is the
component-wise mean
of the points in that
where
cluster, after centering
and normalizing those
points to zero mean
and unit standard
deviation.

is a row vector of p ones.

'hamming' This metric is only

suitable for binary data.

It is the proportion of
bits that differ. Each
centroid is the
where I is the indicator function.
component-wise
median of points in that
cluster.

Example: 'Distance','cityblock'

https://w w w .mathw orks.com/help/stats/kmeans.html 14/23

3/9/24, 6:18 AM k-means clustering - MATLAB kmeans

EmptyAction — Action to take if cluster loses all member observations

 'singleton' (default) | 'error' | 'drop'

Action to take if a cluster loses all its member observations, specified as the comma-separated pair consisting
of 'EmptyAction' and one of the following options.

Value Description

'error' Treat an empty cluster as an error.

'drop' Remove any clusters that become empty. kmeans sets

the corresponding return values in C and D to NaN.

'singleton' Create a new cluster consisting of the one point furthest

from its centroid (default).

Example: 'EmptyAction','error'

MaxIter — Maximum number of iterations

 100 (default) | positive integer

Maximum number of iterations, specified as the comma-separated pair consisting of 'MaxIter' and a positive
integer.

Example: 'MaxIter',1000

Data Types: double | single

OnlinePhase — Online update flag

 'off' (default) | 'on'

Online update flag, specified as the comma-separated pair consisting of 'OnlinePhase' and 'off' or 'on'.

https://w w w .mathw orks.com/help/stats/kmeans.html 15/23

3/9/24, 6:18 AM k-means clustering - MATLAB kmeans
If OnlinePhase is on, then kmeans performs an online update phase in addition to a batch update phase. The
online phase can be time consuming for large data sets, but guarantees a solution that is a local minimum of the
distance criterion. In other words, the software finds a partition of the data in which moving any single point to a
different cluster increases the total sum of distances.

Example: 'OnlinePhase','on'

Options — Options for controlling iterative algorithm for minimizing fitting criteria
 [] (default) | structure array returned by statset

Options for controlling the iterative algorithm for minimizing the fitting criteria, specified as the comma-separated
pair consisting of 'Options' and a structure array returned by statset. Supported fields of the structure array
specify options for controlling the iterative algorithm.

This table summarizes the supported fields. Note that the supported fields require Parallel Computing Toolbox™.

Field Description

'Streams' A RandStream object or cell array of such objects. If you

do not specify Streams, kmeans uses the default stream
or streams. If you specify Streams, use a single object
except when all of the following conditions exist:

You have an open parallel pool.

UseParallel is true.

UseSubstreams is false.

In this case, use a cell array the same size as the parallel
pool. If a parallel pool is not open, then Streams must
supply a single random number stream.

'UseParallel' If true and Replicates > 1, then kmeans

implements the k-means algorithm on each replicate
in parallel.

If Parallel Computing Toolbox is not installed, then

computation occurs in serial mode. The default is
false, indicating serial computation.

https://w w w .mathw orks.com/help/stats/kmeans.html 16/23

3/9/24, 6:18 AM k-means clustering - MATLAB kmeans

Field Description

'UseSubstreams' Set to true to compute in parallel in a reproducible

fashion. The default is false. To compute reproducibly,
set Streams to a type allowing substreams:
'mlfg6331_64' or 'mrg32k3a'.

To ensure more predictable results, use parpool (Parallel Computing Toolbox) and explicitly create a parallel
pool before invoking kmeans and setting 'Options',statset('UseParallel',1).

Example: 'Options',statset('UseParallel',1)

Data Types: struct

Replicates — Number of times to repeat clustering using new initial cluster centroid positions
 1 (default) | positive integer

Number of times to repeat clustering using new initial cluster centroid positions, specified as the comma-
separated pair consisting of 'Replicates' and an integer. kmeans returns the solution with the lowest sumd.

You can set 'Replicates' implicitly by supplying a 3-D array as the value for the 'Start' name-value pair
argument.

Example: 'Replicates',5

Data Types: double | single

Start — Method for choosing initial cluster centroid positions

Method for choosing initial cluster centroid positions (or seeds), specified as the comma-separated pair
consisting of 'Start' and 'cluster', 'plus', 'sample', 'uniform', a numeric matrix, or a numeric array. This
table summarizes the available options for choosing seeds.

https://w w w .mathw orks.com/help/stats/kmeans.html 17/23

3/9/24, 6:18 AM k-means clustering - MATLAB kmeans

Value Description

'cluster' Perform a preliminary clustering phase on a random

10% subsample of X when the number of observations
in the subsample is greater than k. This preliminary
phase is itself initialized using 'sample'.

If the number of observations in the random 10%

subsample is less than k, then the software selects k
observations from X at random.

'plus' (default) Select k seeds by implementing the k-means++

algorithm for cluster center initialization.

'sample' Select k observations from X at random.

'uniform' Select k points uniformly at random from the range of X.

Not valid with the Hamming distance.

numeric matrix k-by-p matrix of centroid starting locations. The rows of

Start correspond to seeds. The software infers k from
the first dimension of Start, so you can pass in [] for
k.

numeric array k-by-p-by-r array of centroid starting locations. The rows

of each page correspond to seeds. The third dimension
invokes replication of the clustering routine. Page j
contains the set of seeds for replicate j. The software
infers the number of replicates (specified by the
'Replicates' name-value pair argument) from the
size of the third dimension.

Example: 'Start','sample'

Data Types: char | string | double | single

Output Arguments collapse all

idx — Cluster indices

 numeric column vector

https://w w w .mathw orks.com/help/stats/kmeans.html 18/23

3/9/24, 6:18 AM k-means clustering - MATLAB kmeans
Cluster indices, returned as a numeric column vector. idx has as many rows as X, and each row indicates the
cluster assignment of the corresponding observation.

C — Cluster centroid locations

 numeric matrix

Cluster centroid locations, returned as a numeric matrix. C is a k-by-p matrix, where row j is the centroid of cluster
j. The location of a centroid depends on the distance metric specified by the Distance name-value argument.

sumd — Within-cluster sums of point-to-centroid distances

 numeric column vector

Within-cluster sums of point-to-centroid distances, returned as a numeric column vector. sumd is a k-by-1 vector,
where element j is the sum of point-to-centroid distances within cluster j. By default, kmeans uses the squared
Euclidean distance (see 'Distance' metrics).

D — Distances from each point to every centroid

 numeric matrix

Distances from each point to every centroid, returned as a numeric matrix. D is an n-by-k matrix, where element
(j,m) is the distance from observation j to centroid m. By default, kmeans uses the squared Euclidean distance
(see 'Distance' metrics).

More About collapse all

 k-Means Clustering
k-means clustering, or Lloyd’s algorithm [2], is an iterative, data-partitioning algorithm that assigns n observations
to exactly one of k clusters defined by centroids, where k is chosen before the algorithm starts.

The algorithm proceeds as follows:

https://w w w .mathw orks.com/help/stats/kmeans.html 19/23

3/9/24, 6:18 AM k-means clustering - MATLAB kmeans
1. Choose k initial cluster centers (centroid). For example, choose k observations at random (by using
'Start','sample') or use the k-means ++ algorithm for cluster center initialization (the default).

2. Compute point-to-cluster-centroid distances of all observations to each centroid.

3. There are two ways to proceed (specified by OnlinePhase):

Batch update — Assign each observation to the cluster with the closest centroid.

Online update — Individually assign observations to a different centroid if the reassignment decreases
the sum of the within-cluster, sum-of-squares point-to-cluster-centroid distances.

For more details, see Algorithms.

4. Compute the average of the observations in each cluster to obtain k new centroid locations.

5. Repeat steps 2 through 4 until cluster assignments do not change, or the maximum number of iterations is
reached.

 k-means++ Algorithm
The k-means++ algorithm uses an heuristic to find centroid seeds for k-means clustering. According to Arthur
and Vassilvitskii [1], k-means++ improves the running time of Lloyd’s algorithm, and the quality of the final
solution.

The k-means++ algorithm chooses seeds as follows, assuming the number of clusters is k.

1. Select an observation uniformly at random from the data set, X. The chosen observation is the first centroid,
and is denoted c1.

2. Compute distances from each observation to c1. Denote the distance between cj and the observation m as
.

3. Select the next centroid, c2 at random from X with probability

4. To choose center j:

https://w w w .mathw orks.com/help/stats/kmeans.html 20/23

3/9/24, 6:18 AM k-means clustering - MATLAB kmeans
a. Compute the distances from each observation to each centroid, and assign each observation to its
closest centroid.

b. For m = 1,...,n and p = 1,...,j – 1, select centroid j at random from X with probability

where Cp is the set of all observations closest to centroid cp and xm belongs to Cp.

That is, select each subsequent center with a probability proportional to the distance from itself to the
closest center that you already chose.
5. Repeat step 4 until k centroids are chosen.

Arthur and Vassilvitskii [1] demonstrate, using a simulation study for several cluster orientations, that k-means++
achieves faster convergence to a lower sum of within-cluster, sum-of-squares point-to-cluster-centroid
distances than Lloyd’s algorithm.

Algorithms
kmeans uses a two-phase iterative algorithm to minimize the sum of point-to-centroid distances, summed over all k
clusters.

1. This first phase uses batch updates, where each iteration consists of reassigning points to their nearest
cluster centroid, all at once, followed by recalculation of cluster centroids. This phase occasionally does not
converge to solution that is a local minimum. That is, a partition of the data where moving any single point to a
different cluster increases the total sum of distances. This is more likely for small data sets. The batch phase
is fast, but potentially only approximates a solution as a starting point for the second phase.

2. This second phase uses online updates, where points are individually reassigned if doing so reduces the sum
of distances, and cluster centroids are recomputed after each reassignment. Each iteration during this phase
consists of one pass though all the points. This phase converges to a local minimum, although there might be
other local minima with lower total sum of distances. In general, finding the global minimum is solved by an
exhaustive choice of starting points, but using several replicates with random starting points typically results in
a solution that is a global minimum.

If Replicates = r > 1 and Start is plus (the default), then the software selects r possibly different sets of seeds
according to the k-means++ algorithm.

https://w w w .mathw orks.com/help/stats/kmeans.html 21/23

3/9/24, 6:18 AM k-means clustering - MATLAB kmeans
If you enable the UseParallel option in Options and Replicates > 1, then each worker selects seeds and
clusters in parallel.

References
[1] Arthur, David, and Sergi Vassilvitskii. “K-means++: The Advantages of Careful Seeding.” SODA ‘07: Proceedings of
the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. 2007, pp. 1027–1035.

[2] Lloyd, Stuart P. “Least Squares Quantization in PCM.” IEEE Transactions on Information Theory. Vol. 28, 1982, pp.
129–137.

[3] Seber, G. A. F. Multivariate Observations. Hoboken, NJ: John Wiley & Sons, Inc., 1984.

[4] Spath, H. Cluster Dissection and Analysis: Theory, FORTRAN Programs, Examples. Translated by J. Goldschmidt.
New York: Halsted Press, 1985.

Extended Capabilities

 Tall Arrays
Calculate with arrays that have more rows than fit in memory.

 C/C++ Code Generation

Generate C and C++ code using MATLAB® Coder™.

 Automatic Parallel Support

Accelerate code by automatically running computation in parallel using Parallel Computing
Toolbox™.

 GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing
Toolbox™.

Version History
Introduced before R2006a

https://w w w .mathw orks.com/help/stats/kmeans.html 22/23

3/9/24, 6:18 AM k-means clustering - MATLAB kmeans

Topics
Compare k-Means Clustering Solutions
Introduction to k-Means Clustering

mathworks.com
© 1994-2024 The MathWorks, Inc. MATLAB and Simulink are registered trademarks of The MathWorks, Inc. See mathworks.com/trademarks for a list of additional
trademarks. Other product or brand names may be trademarks or registered trademarks of their respective holders.

https://w w w .mathw orks.com/help/stats/kmeans.html 23/23

Exp 5
No ratings yet
Exp 5
4 pages
Learn Lab3
No ratings yet
Learn Lab3
12 pages
K-Means Clustering
No ratings yet
K-Means Clustering
19 pages
K Means
100% (2)
K Means
329 pages
03 23MAT214 MIS4 KMeans Spectral Clustering (1)
No ratings yet
03 23MAT214 MIS4 KMeans Spectral Clustering (1)
52 pages
Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign
No ratings yet
Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign
7 pages
Session 18-Cluster Analysis
No ratings yet
Session 18-Cluster Analysis
20 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
77 pages
ML Exp5 C36
No ratings yet
ML Exp5 C36
18 pages
Report 1
No ratings yet
Report 1
3 pages
algo
No ratings yet
algo
59 pages
ML-Lab Programs - VTU
No ratings yet
ML-Lab Programs - VTU
5 pages
RDM Slides Clustering With R 1
No ratings yet
RDM Slides Clustering With R 1
64 pages
K Means
No ratings yet
K Means
33 pages
02.1 K-Means Example
No ratings yet
02.1 K-Means Example
12 pages
LAB 6A:K-Means Clustering
No ratings yet
LAB 6A:K-Means Clustering
3 pages
K Means Algorithms
No ratings yet
K Means Algorithms
27 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
Cluster Analysis Thesis Matlab Code PDF
100% (3)
Cluster Analysis Thesis Matlab Code PDF
7 pages
Assignment 4 A
No ratings yet
Assignment 4 A
15 pages
K Means
No ratings yet
K Means
23 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
08_k-means
No ratings yet
08_k-means
19 pages
K Means Algo
No ratings yet
K Means Algo
7 pages
5 - Clustering
No ratings yet
5 - Clustering
13 pages
K-Means Cluster Analysis UC Business Analytics R Programming Guide
No ratings yet
K-Means Cluster Analysis UC Business Analytics R Programming Guide
19 pages
Lecture 11 K Means Clustering
No ratings yet
Lecture 11 K Means Clustering
8 pages
A Tutorial On Clustering Algorithms
No ratings yet
A Tutorial On Clustering Algorithms
4 pages
AI Week 11
No ratings yet
AI Week 11
21 pages
Presentation 1
No ratings yet
Presentation 1
47 pages
Clustering-Part1.pptx
No ratings yet
Clustering-Part1.pptx
84 pages
Alehandro Lumentah 210211010188 Assignment09
No ratings yet
Alehandro Lumentah 210211010188 Assignment09
10 pages
Cluster
No ratings yet
Cluster
50 pages
K means algorithm
No ratings yet
K means algorithm
4 pages
Unit 4 Aam
No ratings yet
Unit 4 Aam
26 pages
UNIT - 3 - Clustering
No ratings yet
UNIT - 3 - Clustering
21 pages
A Paper With 12pt Global Font Size
No ratings yet
A Paper With 12pt Global Font Size
13 pages
Intro Data Science: Cluster Analysis
No ratings yet
Intro Data Science: Cluster Analysis
60 pages
Kmeans
No ratings yet
Kmeans
6 pages
Topic4 Clustering
No ratings yet
Topic4 Clustering
78 pages
Data Science Analysis Final Project
No ratings yet
Data Science Analysis Final Project
10 pages
K Means Clustering
No ratings yet
K Means Clustering
11 pages
K Mean Clustering
No ratings yet
K Mean Clustering
27 pages
Overview of Clustering:: UNIT-5
No ratings yet
Overview of Clustering:: UNIT-5
27 pages
K.means Clustering
No ratings yet
K.means Clustering
8 pages
CSC649 Lecture 3 Unsupervised ML - KMeansClustering
No ratings yet
CSC649 Lecture 3 Unsupervised ML - KMeansClustering
22 pages
19.1. Partitioning-Based Clustering Algorithms
No ratings yet
19.1. Partitioning-Based Clustering Algorithms
27 pages
Kmean Clustering
No ratings yet
Kmean Clustering
3 pages
01 K Means - Merged
No ratings yet
01 K Means - Merged
26 pages
K Mean Clustering
No ratings yet
K Mean Clustering
45 pages
Unit-4
No ratings yet
Unit-4
46 pages
K-Means Clustering
No ratings yet
K-Means Clustering
38 pages
Clustering
No ratings yet
Clustering
84 pages
k Mean Clustering
No ratings yet
k Mean Clustering
32 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
K Means Clustering
No ratings yet
K Means Clustering
6 pages
Clustering in Python
No ratings yet
Clustering in Python
31 pages
K-Means Clustering-converted-merged
No ratings yet
K-Means Clustering-converted-merged
76 pages
Numerical Analysis II Essentials
From Everand
Numerical Analysis II Essentials
The Editors of REA
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-A
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-A
7 pages
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-H
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-H
4 pages
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-C
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-C
10 pages
MapReduce - What It Is, and Why It Is So Popular
No ratings yet
MapReduce - What It Is, and Why It Is So Popular
7 pages
Paper Dvi
No ratings yet
Paper Dvi
7 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1E
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1E
2 pages
2 Mapreduce Model Principles
No ratings yet
2 Mapreduce Model Principles
7 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1Q
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1Q
2 pages
Balanced K-Means Revisited-5
No ratings yet
Balanced K-Means Revisited-5
3 pages
Balanced K-Means Revisited-1
No ratings yet
Balanced K-Means Revisited-1
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-O
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-O
3 pages
Hadoop
No ratings yet
Hadoop
7 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-16
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-16
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-14
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-14
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-P
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-P
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-17
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-17
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-9
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-9
4 pages
A Distance-Based Kernel For Classification Via Support Vector Machines - PMC-17
No ratings yet
A Distance-Based Kernel For Classification Via Support Vector Machines - PMC-17
1 page
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-A
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-A
6 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-5
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-5
4 pages
Data Visualization Cheat Sheet For Basic Machine Learning Algorithms - by Boriharn K - Mar, 2024 - Towards Data Science
No ratings yet
Data Visualization Cheat Sheet For Basic Machine Learning Algorithms - by Boriharn K - Mar, 2024 - Towards Data Science
3 pages
Improved K-Means Map Reduce Algorithm For Big Data Cluster Analysis
No ratings yet
Improved K-Means Map Reduce Algorithm For Big Data Cluster Analysis
7 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-4
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-4
3 pages
Tutorial For K Means Clustering in Python Sklearn - MLK - Machine Learning Knowledge-5
No ratings yet
Tutorial For K Means Clustering in Python Sklearn - MLK - Machine Learning Knowledge-5
3 pages
The Incremental Online K Means Clustering Algorithm and Its Application To Color Quantization
No ratings yet
The Incremental Online K Means Clustering Algorithm and Its Application To Color Quantization
42 pages
K-Means Clustering Optimization Algorithm Based On Mapreduce
No ratings yet
K-Means Clustering Optimization Algorithm Based On Mapreduce
6 pages
Fast Scalable K-Means++ Algorithm With Mapreduce
No ratings yet
Fast Scalable K-Means++ Algorithm With Mapreduce
2 pages
Fuzzy K-Mean Clustering in Mapreduce On Cloud Based Hadoop: Dweepna Garg
No ratings yet
Fuzzy K-Mean Clustering in Mapreduce On Cloud Based Hadoop: Dweepna Garg
4 pages
Analysis of Mapreduce Algorithms: Harini Padmanaban
No ratings yet
Analysis of Mapreduce Algorithms: Harini Padmanaban
6 pages
Struktur Data 12 Juli
No ratings yet
Struktur Data 12 Juli
3 pages
Machine Learning
No ratings yet
Machine Learning
4 pages
20- Describing Data With Tables - III
No ratings yet
20- Describing Data With Tables - III
18 pages
Signal Theory Chapter 2
No ratings yet
Signal Theory Chapter 2
39 pages
Matlab Code Fixed Point PDF
No ratings yet
Matlab Code Fixed Point PDF
2 pages
Exercise For Sorting and Searching in Data Structure N Skema 2014
No ratings yet
Exercise For Sorting and Searching in Data Structure N Skema 2014
3 pages
Work and Strain Energy
No ratings yet
Work and Strain Energy
47 pages
Jboi 2014 FulecoandBrazilAntSolution
No ratings yet
Jboi 2014 FulecoandBrazilAntSolution
3 pages
Webtoc
No ratings yet
Webtoc
56 pages
Apresentacao SBMO RicardoSS
No ratings yet
Apresentacao SBMO RicardoSS
21 pages
IDS705 Final Report
No ratings yet
IDS705 Final Report
27 pages
AAM CT-1 Que Bank
No ratings yet
AAM CT-1 Que Bank
2 pages
Predective Modelling Project Business Report
50% (2)
Predective Modelling Project Business Report
58 pages
HW3 QM2 Fall2014
100% (1)
HW3 QM2 Fall2014
1 page
Geoscience and Generative AI - Future Challenges
No ratings yet
Geoscience and Generative AI - Future Challenges
13 pages
42 98452err
No ratings yet
42 98452err
3 pages
Frequency Response For Control System Analysis - GATE Study Material in PDF
No ratings yet
Frequency Response For Control System Analysis - GATE Study Material in PDF
8 pages
AI Expert Roadmap
No ratings yet
AI Expert Roadmap
15 pages
245114-23EC1301 – Signals and Systems
No ratings yet
245114-23EC1301 – Signals and Systems
3 pages
1 MIMO - Channel - Capacity
No ratings yet
1 MIMO - Channel - Capacity
4 pages
Big O Notation
No ratings yet
Big O Notation
11 pages
QTM Papers All Year
No ratings yet
QTM Papers All Year
28 pages
Be Computer Engineering Aids Semester 6 2024 May Image Video Processing Rev 2019 c Scheme
No ratings yet
Be Computer Engineering Aids Semester 6 2024 May Image Video Processing Rev 2019 c Scheme
2 pages
Lecture-2: List of Fading Wireless Channels (72 Channels) Section-A: Multipath Fading Channel (28 Channels)
No ratings yet
Lecture-2: List of Fading Wireless Channels (72 Channels) Section-A: Multipath Fading Channel (28 Channels)
7 pages
GCE AS Level Quadratics Hidden Quadratics Quadratic Equations in Some Function of X
No ratings yet
GCE AS Level Quadratics Hidden Quadratics Quadratic Equations in Some Function of X
5 pages
Ec6405 - Control System Engineering Questions and Answers Unit - IV Stability Analysis Two Marks
No ratings yet
Ec6405 - Control System Engineering Questions and Answers Unit - IV Stability Analysis Two Marks
6 pages
Model Qp02 Bmatm201 for Mechanical Engineering Stream
No ratings yet
Model Qp02 Bmatm201 for Mechanical Engineering Stream
3 pages
Association Rules: Apriori Algorithm. Prof. Carolina Ruiz, WPI
No ratings yet
Association Rules: Apriori Algorithm. Prof. Carolina Ruiz, WPI
2 pages
109-1 Differential Equation Midterm Sol
No ratings yet
109-1 Differential Equation Midterm Sol
17 pages
Quadratic Equations
No ratings yet
Quadratic Equations
71 pages