A Survey on Cluster-based Federated Learning
A Survey on Cluster-based Federated Learning
As the industrial and commercial use of Federated Learning (FL) has expanded, so has the need for optimized algorithms.
In settings were FL clients’ data is non-independently and identically distributed (non-IID) and with highly heterogeneous
distributions, the baseline FL approach seems to fall short. To tackle this issue, recent studies, have looked into
personalized FL (PFL) which relaxes the implicit single-model constraint and allows for multiple hypotheses to be
learned from the data or local models. Among the personalized FL approaches, cluster-based solutions (CFL) are
particularly interesting whenever it is clear - through domain knowledge - that the clients can be separated into groups.
In this paper, we study recent works on CFL, proposing: i) a classification of CFL solutions for personalization; ii) a
structured review of literature iii) a review of alternative use cases for CFL.
CCS Concepts: • General and reference → Surveys and overviews; • Computing methodologies → Machine
learning; • Information systems → Clustering; • Security and privacy → Privacy-preserving protocols.
Additional Key Words and Phrases: Federated Learning, Clustered Federated Learning, non-IID
1 INTRODUCTION
Federated Learning (FL) was designed to train a global federated model on a network of devices without
sharing raw data across different client devices (see Figure 1). The original assumptions of FL are that devices
Authors’ addresses: Omar El-Rifai, [email protected], Université Toulouse, UT3, IRIT, CNRS, Toulouse, France; Michael Ben Ali,
[email protected], Université Toulouse, UT3, IRIT, CNRS, Toulouse, France; Imen Megdiche, [email protected], INU
Champollion, ISIS Castres, IRIT, CNRS, Castres, France; André Peninou, [email protected], Université Toulouse, UT2J, IRIT, CNRS,
Toulouse, France; Olivier Teste, [email protected], Université Toulouse, UT2J, IRIT, CNRS, Toulouse, France.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that
copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the
first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is
permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
Request permissions from [email protected].
© 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Manuscript submitted to ACM
are massively distributed, the data is non-independently and identically distributed (non-IID), unbalanced,
and network communication is limited [29]. As such, one of the key claims of the original FL framework is
that it is efficient to train a deep learning model in a heterogeneous setting. In this context the heterogeneity
can be twofold, namely, system heterogeneity and data heterogeneity.
Central server
Client i Client N
Client 1
Dataset Di
with ni samples
All the same, consequent studies have demonstrated several shortcomings in the baseline approach [17].
In particular, when faced with heterogeneity, FL solutions have been shown to be inadequate in many
practical applications [33, 43]. A recent survey on device-heterogenous FL [33] argues that the sparsity of
current real-world production applications of FL is due to heterogeneity challenges. Similarly, in [43] the
authors argue that FL “heavily relies on the assumption that all the participants share the same network
structure and possess similar data distributions” and then expand on the types of possible heterogeneities
and the current literature which addresses these problems.
In this paper, we look in details into the problem of training a single global solution when the datasets
have heterogeneous distributions among the clients in FL configuration. The difficulty of such scenario is
illustrated in Figure 2 whereby the optimization direction of a model trained on centralized clients’ data
Manuscript submitted to ACM
A Survey on Cluster-based Federated Learning 3
is compared to those of the federated scenario with both IID and non-IID data. In the non-IID scenario,
¯ trained on the same
the global federated models weights 𝜔 drift from the non-federated model weights 𝜔,
data but in a centralized way, resulting in a larger gap between them than in the case of homogeneous
distributions. This results was shown in [44] which noted a significant reduction in accuracy in scenarios
with highly skewed non-IID data.
(a) IID scenario: All clients 𝑖, 1 ≤ 𝑖 ≤ 𝑁 , train their lo- (b) Non-IID scenario: All clients 𝑖, 1 ≤ 𝑖 ≤ 𝑁 , train their
cal model’s parameters 𝜔𝑖 , on homogeneous data. In this local model’s parameters 𝜔𝑖 , on heterogeneous data. In this
scenario, the federated model parameters 𝜔 are close to a scenario, the federated model parameters 𝜔 diverges from
hypothetical non-federated model with parameters 𝜔¯ trained the non-federated model parameters 𝜔¯ trained on the same
on the same data but in a centralize manner data but in a centralize manner
As a response to this challenge, Personalized FL (PFL) literature [20, 38] emerged with different approaches
that relax the single model constraint and instead train several specialized federated models to cater for
data heterogeneity. Recent trend in PFL literature uses cluster-based federated learning (CFL) methods for
personalization which unlike other methods, explicitly prescribe the number of federated models to train.
Then, in Section 5.1, we shed light on an important ambiguity in the definition and use of non-IID
terminology which makes studies comparison difficult. In response we propose a precise taxonomy which
could summarize our findings and facilitate reproducibility.
Finally in Section 5.2, we identify the different use cases of clustering in the context of FL. For although
personalization has been the main focus of cluster-based studies in FL, the problems of “client selection” and
“attack prevention” have also been tackled in some studies using clustering. We also provide a comprehensive
list of notations used throughout this paper in Table 1 for ease of reference.
where 𝑓𝑖 (𝜔) is the local objective function associated to each client 𝑖 typical of neural network settings
(For a machine learning problem, we typically take 𝑓𝑖 (𝜔) = E (𝑋 ,𝑌 )∼𝐷𝑖 [𝐿(𝑋, 𝑌 , 𝜔)], that is, the expected
value of the loss function 𝐿 calculated with feature and target (𝑋, 𝑌 ) following the distribution of dataset 𝐷𝑖
with model of parameters 𝜔 ∈ R𝑑 .) Generally, using a larger number of data sample leads to more accurate
and robust models.
Since the data cannot be shared across clients, the authors in [29], propose to iteratively optimize 𝑓𝑖 (𝜔)
on local datasets and then to aggregate the models on the server using a weighted average function. We
assume that each client 𝑖 ∈ 𝐼 over which the data is partitioned have a local dataset 𝐷𝑖 where the number of
the dataset samples is |𝐷𝑖 | = 𝑛𝑖 . Thus, we can re-write the objective function (1) as:
𝑁
∑︁ 𝑛𝑖
min 𝑓 (𝜔) := Í𝑁 𝑓𝑖 (𝜔) (2)
𝜔 ∈ R𝑑 𝑗=1 𝑛 𝑗
𝑖=1
Í𝑁
where 𝑗=1 𝑛 𝑗 correspond to the total number of samples across all clients.
Notations Description
𝑁 Number of clients in the FL ecosystem, 𝑁 ≥ 2
𝐼 𝐼 = {1, . . . , 𝑁 } set of all clients index in the FL ecosystem
𝑖 Index of client (𝑖 ∈ 𝐼 ) n o
𝐷𝑖 Local dataset of client 𝑖 noted as 𝐷𝑖 = (𝑥 𝑗(𝑖 ) , 𝑦 𝑗(𝑖 ) ) | 1 ≤ 𝑗 ≤ 𝑛𝑖 , where (𝑥 𝑗(𝑖 ) )1≤ 𝑗 ≤𝑛𝑖 and
(𝑦 𝑗(𝑖 ) )1≤ 𝑗 ≤𝑛𝑖 represents respectively the dataset features and target
𝑛𝑖 Number of samples in 𝐷𝑖 . We note |𝐷𝑖 | = 𝑛𝑖
𝑑 Number of parameters of considered neural networks, 𝑑 ∈ N∗
𝜔 Federated neural network model parameters with 𝜔 ∈ R𝑑
𝜔𝑖 Local neural network parameters of client 𝑖 with 𝜔𝑖 ∈ R𝑑
𝜔¯ Neural network model parameters 𝜔¯ ∈ R𝑑 in a non-federated ecosystem trained on dataset
Ð𝑁
𝐷 = 𝐷𝑖
𝑖=1
𝑓 Federated model objective function
𝑓𝑖 Local objective function for client 𝑖 defined to optimize a local model on 𝐷𝑖
(Ω, F , P) Probability space where Ω is the sample space, representing all possible outcomes of the random
Ð𝑁
experiment associated with sampling from the dataset 𝐷 = 𝐷𝑖
𝑖=1
F the associate sigma algebra and P the probability measure
𝑋 Random variable representing the possible outcome of features of all clients datasets (𝐷𝑖 )𝑖 ∈𝐼 in
the federated learning ecosystem. In this case, each (𝑥 𝑗(𝑖 ) )1≤ 𝑗 ≤𝑛𝑖 for 𝑖 ∈ 𝐼 is a realization of 𝑋 .
𝑌 Random variable representing the possible outcome of targets of all clients datasets (𝐷𝑖 )𝑖 ∈𝐼 in
the federated learning ecosystem. In this case, each (𝑦 𝑗(𝑖 ) )1≤ 𝑗 ≤𝑛𝑖 for 𝑖 ∈ 𝐼 is a realization of 𝑌 .
(𝑋, 𝑌 ) Joint random variable of the possible outcome of features and targets of all clients datasets (𝐷𝑖 )𝑖 ∈𝐼
𝑁
in the federated learning ecosystem. We have (𝑋, 𝑌 ) : Ω ↦→ 𝐷 := 𝐷𝑖
Ð
𝑖=1
P𝑖 (𝑋 ) Marginal distribution of dataset 𝐷𝑖 features
P𝑖 (𝑌 ) Marginal distribution of dataset 𝐷𝑖 target
P𝑖 (𝑋, 𝑌 ) Joint distribution of features and targets of dataset 𝐷𝑖
P𝑖 (𝑋 |𝑌 ) Conditional distribution of features conditioned by target of dataset 𝐷𝑖
P𝑖 (𝑌 |𝑋 ) Conditional distribution of targets conditioned by features of dataset 𝐷𝑖
𝐾 Number of clusters, 𝐾 ≤ 𝑁
𝐶𝑘 𝑘-th cluster with 𝑘 ∈ {1, . . . , 𝐾 }; a cluster 𝐶𝑘 will be considered as a subset of 𝐼 . Meaning that
𝐾
Ð
𝐶𝑘 ⊆ 𝐼 and 𝐼 = 𝐶𝑘
𝑘=1
𝐹𝑘 Objective function to optimize the federated model on cluster 𝐶𝑘
dist(·, ·) Generic distance metric
1𝑖 ∈𝐶𝑘 Equal to 1 if client 𝑖 is in cluster 𝐶𝑘 , 0 otherwise
𝜇𝑘 Cluster 𝐶𝑘 representative point essential for cluster formation; a vector of size 𝑑 calculated using
weights 𝜔𝑖 for all 𝑖 ∈ 𝐶𝑘
𝜂𝑖 Metadata parameters of client 𝑖; may correspond to local data statistics, distribution statistics or
any type of metadata representation of client 𝑖 (see Section 3.2.1)
𝛿𝑘 Cluster 𝐶𝑘 metadata-based representative point essential for cluster formation; This point share
the same type as metadata parameters 𝜂𝑖 for all clients 𝑖 ∈ 𝐶𝑘
Table 1. Summary of notations
Manuscript submitted to ACM
6 Omar El-Rifai, Michael Ben Ali, Imen Megdiche, André Peninou, and Olivier Teste
Equation (2) reflects the Federated Average (FedAvg) protocol and is used in the original framework [29].
We note that in the original paper, the data distribution of each client as well as the devices computation
and communication capacities are considered to be heterogeneous.
2.2.2 Statistical Heterogeneity: non-iid. The baseline FedAvg approach [29] was trained on both an IID and
a non-IID setting and is shown to be effective in one specific type of heterogeneous context which is the
label distribution skew scenario. In this non-IID case, the authors explore partitioning the MNIST dataset
by labels (digits) prior to being distributed over clients. However, there exists other forms of non-IID data
situations (statistical heterogeneity) which we summarize below as proposed in the taxonomy of [17].
For each client 𝑖 ∈ 𝐼 = {1, . . . , 𝑁 }, we assume that its associate local dataset 𝐷𝑖 which is noted as a
samples set of the form n o
𝐷𝑖 = (𝑥 𝑗(𝑖 ) , 𝑦 𝑗(𝑖 ) ) | 1 ≤ 𝑗 ≤ 𝑛𝑖
where (𝑥 𝑗(𝑖 ) )1≤ 𝑗 ≤𝑛𝑖 represents the features and (𝑦 𝑗(𝑖 ) )1≤ 𝑗 ≤𝑛𝑖 represents the target and with |𝐷𝑖 | = 𝑛𝑖 .
Using this notation, we define the random variables 𝑋 representing the possible outcome of features and
𝑌 representing the possible outcome of the targets of all client datasets (𝐷𝑖 )𝑖 ∈𝐼 (meaning that in this case,
for all 𝑖 ∈ 𝐼 , (𝑥 𝑗(𝑖 ) )1≤ 𝑗 ≤𝑛𝑖 are realizations of 𝑋 and (𝑦 𝑗(𝑖 ) )1≤ 𝑗 ≤𝑛𝑖 are realizations of 𝑌 ).
If we denote the probability space as (Ω, F , 𝑃) where Ω is the sample space, representing all possible
𝑁
Ð
outcomes of the random experiment associated with sampling from the dataset 𝐷 = 𝐷𝑖 , F its associate
𝑖=1
sigma algebra and P the probability measure, then the joint random variable (𝑋, 𝑌 ) can be defined as :
𝑁
Ø
(𝑋, 𝑌 ) : Ω ↦→ 𝐷 := 𝐷𝑖
𝑖=1
Each clients dataset 𝐷𝑖 is represented by its local joint distribution P𝑖 (𝑋, 𝑌 ) which we can factor as
follows using Bayes’ rule:
P𝑖 (𝑋, 𝑌 ) = P𝑖 (𝑌 )P𝑖 (𝑋 |𝑌 ) = P𝑖 (𝑋 )P𝑖 (𝑌 |𝑋 ) (3)
where P𝑖 (𝑋 ) the marginal distribution of the features (respectively P𝑖 (𝑌 ) the target) and P𝑖 (𝑌 |𝑋 ) the
distribution of the target conditioned by the features (respectively P𝑖 (𝑋 |𝑌 ) the distribution of features
conditioned by the target) for client 𝑖 ∈ 𝐼 .
(a) Feature Distribution (b) Label Distribution (c) Concept Shift on La- (d) Concept Shift on Fea- (e) Quantity Skew
Skew Skew bels tures
Fig. 3. Illustration of non-IID categories for two client 𝑖 and 𝑗 with samples from the MNIST dataset
We thus define 5 types of non-IID data situations (based on the taxonomy of [17]) each with a corre-
sponding exemple from the MNIST dataset in Figure 3 (The letter items from the categories correspond
to the letter items in the Figure). For client 𝑖 ∈ 𝐼 and client 𝑗 ∈ 𝐼 with 𝑖 ≠ 𝑗, we can consider 𝐷𝑖 and 𝐷 𝑗 as
non-IID if one or multiple of the following conditions apply:
a. Feature distribution skew: P𝑖 (𝑋 ) ≠ P𝑗 (𝑋 ) even if P𝑖 (𝑌 |𝑋 ) = P𝑗 (𝑌 |𝑋 ). In the federated MNIST
dataset (FEMNIST), client’s dataset contain images written by a specific person. As each person’s
handwriting is different, the features can be considered as skewed. For exemple, we can have that the
“6” is written differently for client 𝑖 and 𝑗 like in Figure 3.a.
b. Label distribution skew: P𝑖 (𝑌 ) ≠ P𝑗 (𝑌 ) even if P𝑖 (𝑋 |𝑌 ) = P𝑗 (𝑋 |𝑌 ). We can create this scenario
by unevenly distributing the digits to the clients such that one client has a majority “1”s, another a
majority of “4”s and so on as illustrated in Figure 3.b.
c. Concept shift on labels: (same features, different label) P𝑖 (𝑌 |𝑋 ) ≠ P𝑗 (𝑌 |𝑋 ) even if P𝑖 (𝑋 ) = P𝑗 (𝑋 ).
Although more likely in dataset reflecting personal preferences, we can image such a scenario in the
MNIST dataset if for instance the digits “1” in a user’s dataset resembles digit “7” in another user’s
(see Figure 3.c.).
d. Concept shift on features: 2 (same label, different features) P𝑖 (𝑋 |𝑌 ) ≠ P𝑗 (𝑋 |𝑌 ) for clients 𝑖 and 𝑗
even if P𝑖 (𝑌 ) = P𝑗 (𝑌 ). This can be artificially attained in MNIST if we rotate the images in certain
devices. (see Figure 3.d.)
2 Shortened to “Concept Drift” in [17]
Manuscript submitted to ACM
8 Omar El-Rifai, Michael Ben Ali, Imen Megdiche, André Peninou, and Olivier Teste
e. Quantity Skew: the volume of data differs significantly among clients. Meaning that |𝐷𝑖 | ≪ |𝐷 𝑗 | or
|𝐷 𝑗 | ≪ |𝐷𝑖 |. (see Figure 3.e.)
that is the expected values of the expected value of the local loss function 𝐿𝑖 calculated with feature and
target (𝑋, 𝑌 ) following the distribution of dataset 𝐷𝑖 with model of parameters 𝜔 ∈ R𝑑 .
Based on this formulation , we describe the three main cluster based personalization solution approaches
which we found in the literature. We base our taxonomy below on how clusters are computed.
This approach (or minor variants) will be used in many studies with dist being a generic distance function
between models parameters or their gradients and 1𝑖 ∈𝐶𝑘 = 1 if client 𝑖 is in cluster 𝐶𝑘 , 0 otherwise. Several
studies rely on the 𝑙 2 distance between models parameters as the used metric which is common in k-means
but can lead to incorrect data partitions as shown in [36]. Since the cluster assignment is determined by the
central server, throughout the remainder of this paper, this type of clustering method will be denominate as
”server-side” clustering.
3.2.2 Training Loss-based (client-side) Clustering. As an alternative to server-side clustering, some studies
delegated the computational burden of determining cluster identities to client nodes [8, 14, 18, 22, 28, 34, 39].
In these studies, clients are presented with several possible models (usually initialized by the central server)
and evaluate their cluster membership by calculating the training loss on these different models.
Formally, let 𝜇𝑘 ∈ R𝑑 be the cluster representative point of cluster 𝐶𝑘 for all 𝑘 ∈ {1, . . . , 𝐾 } essential for
the formation of each cluster. These representative points can be initialized by the central server. Each clients
𝑖 ∈ {1, . . . , 𝑁 } will evaluate their training data on the 𝐾 different models, each initialized with parameters
𝜇𝑘 and update theirs models parameters 𝜔𝑖 with the cluster representative point which minimizes their
that is the representative point 𝜇𝑘 that minimize the expected values of the expected value of the local loss
function 𝐿𝑖 calculated with feature and target (𝑋, 𝑌 ) following the distribution of dataset 𝐷𝑖 with model of
parameters 𝜇𝑘 .
Once an initial, cluster assignment has been chosen, cluster representative point can be updated and
clients recalculate the assignment iterativelly (usually the representative point will be updated by a weighted
average of all cluster members models parameters [14] by the central server). Client-side clustering can be
both an advantage or a disadvantage depending on the context. The authors in [14], argue that “one of the
major advantages of [their] algorithm is that it does not require a centralized clustering algorithm, and thus
significantly reduces the computational cost at the center machine”. It has also been argued in [26], that
client-based clustering “posts high communication and computation overheads because the selected nodes
will spend more resources for receiving and running multiple global models”. Since the cluster assignment
in this case is determined by the client, throughout the remainder of this paper, this type of clustering
method will be denominate as ”client-side” clustering.
3.2.3 Metadata-based Clustering. The third category of studies rely on exogenous information to partition
clients [6, 7, 21]. Specifically, metadata such as local data statistics or distribution statistics are transferred to
the central server in order to cluster clients with similar distributions. One problem with using exogenous
information is the risk of transmitting potentially sensitive data across the network. Workarounds often
include homomorphic encryption or differential privacy [41].
For each client 𝑖 ∈ {1, . . . , 𝑁 }, we define metadata parameters 𝜂𝑖 and then we define for each cluster 𝐶𝑘 a
metadata-based representative point 𝛿𝑘 for all 𝑘 ∈ {1, . . . , 𝐾 } essential for the cluster formation of the same
type as the metadata parameters. Each client 𝑖 membership is determine by using a formulation similar to
that of Equation (6):
𝐾 ∑︁
∑︁ 𝑁
min 1𝑖 ∈𝐶𝑘 dist(𝜂𝑖 , 𝛿𝑘 ) (8)
𝛿 1 ,...,𝛿𝐾
𝑘=1 𝑖=1
where dist is a generic distance between the defined metadata and 1𝑖 ∈𝐶𝑘 = 1 if client 𝑖 is in cluster 𝐶𝑘 , 0
otherwise. Since the cluster assignment is determined by the clients metadata, throughout the remainder of
this paper, this type of clustering method will be denominate as ”metadata-based” clustering.
in Section 4.2, we describe follow up studies which took the methods a step further either by optimizing
the algorithms or accounting for more realistic contexts. Finally in Section 4.3 we look into papers which
propose methods at the intersection of clustering and other approaches. These later papers highlight the
properties as well as the limitations of clustering and as such are important to discuss here. We include
in this last section the metadata-based solution which violate the strict FL frame of reference which only
allows sharing of gradients, model weights and local number of clients samples.
In addition to our description of the papers, we give a high level synthesis of the papers in Table 2. The
column “Clustering Category” identifies the category of the solution approach as presented in Section 3.
“Nomenclature” is the name (or acronym) of the solution as presented in the paper itself or as it is referred
to in later papers.“Clustering Algorithm” and “Clustering Evaluation Metric” are details on the clustering
method used.
server and based on their similarity, client clusters are computed. This approach is shown to optimize the
communication costs as it requires only one additional round of communication with the central server.
Similarly, in [6] the server creates clusters based on infrastructure-specific features (same access point/
coordinates etc.). However, the authors assume that users can communicate and transfer models amongst
them which is not a standard assumption in the classical FL set-up.
Clustering
Reference Nomenclature Clustering Algorithm Clustering Evaluation Metric
Category
Ghosh et al. 2019 [13] Server-side N/A Trimmed k-means 𝐿2 norm
Sattler et al. 2020 [36] Server-side CFL Recursive bi-partitioning Cosine similarity
Ghosh et al. 2020 [14] Client-side IFCA Client loss minimisation Local training loss
Mansour et al. 2020 [28] Client-side HypCluster Client Loss minimisation Local training loss
Briggs et al. 2020 [3] Server-side FL+HC Agglomerative Hierarchical Clustering 𝐿1 /𝐿2 /Cosine similarity
Dennis et al. 2021 [7] Metadata-based k-Fed K-means 𝐿2
K-means/
Duan et al. 2021 [9] Server-side FedGroup MADD [35] / Custom Metric
Hierarchical clustering
K-means/
Duan et al 2021 [10] Server-side FlexCFL MADD[35] / Custom Metric
Hierarchical clustering
3 Steps Clustering:
Dynamic
1) Cluster GAN [31]
Kim et al. 2021 [18] Client-side Gan-Based Local training loss
2) Client loss minimization
Clustering
3) Divisive clustering
Li et al. 2021 [22] Client-side N/A Client loss minimisation Local training loss
Xiao et al. 2021 [42] Server-side CFMTL Hierarchical clustering 𝐿2
Luo et al. 2021 [27] Server-side CAFL Agglomerative Hierarchical Clustering 𝐿2
Ruan et al. 2022 [34] Client-side FedSoft Client loss minimisation Local training loss
Wolfrath et al. 20202 [41] Metadata-based HACCS Density based Algorithm (OPTICS) Hellinger Distance
Gong et al. 2022 [15] Client-side AutoCFL Voting scheme 𝐿2
Castellon et al. 2022 [5] Server-side FLIC Louvain algorithm Cosine similarity
Long et al. 2023 [26] Client-side FeSEM K-means like N/A
Liang et al. 2023 [25] Server-side FedEOC K-means Cosine similarity
Augello et al. 2023 [2] Server-side DCFL Affinity propagation algorithm [12] Custom distance
Mehta et al. 2023 [30] Server-side FLACC Agglomerative Hierarchical Clustering Cosine similarity
Table 2. High level summary of reviewed papers for foundationnal studies (ordered by year)
optimizing the computational and/or communication costs. Then in Section 4.2.2, we look into studies
which examine the framework’s security and propose solutions to mitigate the possible attack risks. In 4.2.3,
we include studies which add a mechanism to integrate network dynamics. These dynamics can be new
clients joining the network or having the distribution of clients evolve over time. Finally in 4.2.4 we look
into approaches which expand on the problem of identifying an optimal number of clusters. All of these
results are summarized in Table 3 for ease of reference. We additionally add two columns in Table 3 (Namely
“Hard/Soft” and “One-shot/iterative”). The first column defines whether the clustering algorithm uses hard
clustering (i.e mutually exclusive clusters) or soft clustering and the second whether the framework is
iterative or clustering is done in one shot (more communication efficient).
4.2.1 Computational and Communication Efficiency. In terms of server-side methods using gradient infor-
mation, the framework in [9] offers computational improvements over the work in [36]. The framework
first decomposes the model updates matrix into 𝑚 directions using singular value decomposition (SVD)
then computes the cosine similarity using these new directions. Similarly, in [5] the authors tackle the
problem of communication efficiency. Instead of having dedicated communication rounds for clustering
as in [36] and [3], this paper suggests capitalizing on the past weight updates of non-participant of the
current round of FL. A similarity matrix between models is built iteratively on the server where only current
round participants will update the matrix and past similarity of non-participants are kept, then clustering is
performed using this matrix. The work of [27] follow up on the works of [3], but seeks to improve accuracy.
As such, the authors trains multiple models at each client to make the gradients robust to noise. Conversely,
the work of [30] prefers to favour computation speed and chooses to cluster clients without waiting for
full model convergence. This form of greedy clustering reduces the computational burden and is shown to
be efficient in many cases. In [15] the authors start with the observation that the computation burden is
exacerbated when the number of samples per client is highly imbalanced. To counter that, they propose
a solution which favors balancing clients with large datasets across clusters. Finally in [25], the authors
propose to use a single layer instead of entire models to calculate the similarity between models.
On the other hand, papers using training-loss based solutions also improved on the initial proposals with
follow up studies. In particular, the authors in [15] compute distance between models using partial weights
similarity to decrease computation costs. Weights chosen are those in the layers closest to the output as
they were shown to be most representative. In [26] layer-wise matching aggregation is chosen for faster
convergence (cf. [40]).
In a different manner [22, 34] root their approaches in the client-side, training-loss based approach
inspired by [14]. Specifically, in [34] the authors propose soft clustered federated learning which relaxes
the assumption of a unique data distribution per client. Every local dataset is considered as a mixture of
multiple source distributions. On updating the local models, clients optimizes a proximal local objective
function (inspired by FedProx [23]) which accounts for local data as well as clusters’ information which
makes for more realistic data distribution scenarios.
Manuscript submitted to ACM
Reference Computation and Communica- Security Tweaks Newcomers Number of Clusters Hard/Soft One-shot / Iterative
14
tion Tweaks
Ghosh et al. 2019 [13] N/A Outlier detection N/A User defined parameter Hard One shot
Newcomers traverse saved clusters-
Sattler et al. 2020 [36] N/A Encryption compatible parameter’s tree Threshold Based Hard Iterative
to find best fitting cluster
Ghosh et al. 2020 [14] N/A N/A N/A User defined parameter Hard Iterative
Mansour et al. 2020 [28] N/A N/A N/A User defined parameter Hard iterative
Stop merging when the distance be-
Briggs et al. 2020 [3] N/A N/A N/A tween cluster Hard One shot
is below a chosen threshold
Dennis et al. 2021 [7] One shot communication client- N/A User defined parameter Hard One shot
Following the same line of reasoning that [34], the authors in [22] propose to combine loss-based cluster
assignment approach of IFCA [14] with a soft clustering method. Instead of choosing one cluster identity
exclusively, each client selects a set of 𝐾 clusters to which its data belongs to. At the server, the cluster
models are updated based on all the clients which have data belonging to it. But unlike in [34] where weights
indicate the importance of each cluster model for each client, here client models are a simple average over
the 𝐾 cluster models that yield the lowest loss.
4.2.2 Security Concerns. As data privacy is at the center of FL, accounting for the security of exchanges
is important. Consequently, some papers introduced security tweaks to protect the FL framework from
malicious attacks. For instance in [2] the authors proposes a clustering metric which, unlike cosine similarity,
remains “effective under noisy conditions typical of differential privacy”. The clustering metric is a mixture
of euclidean distance and a divergence measure. [18] builds on the study in [28] and enhances it by using
GAN models which purportedly to help with data privacy using differential privacy.
4.2.3 Newcomers and dynamic. Some studies include an algorithm to integrate new clients in the network
dynamically or allow clients to change clusters dynamically. For instance, in [9] newcomer devices are
assigned to the clusters that are most closely related to their optimization direction (proximity is evaluated
using the normalized cosine dissimilarity between group and newcomer model update). Further in [10]
the authors pursue the previous work [9] by also including a client migration strategy. That is, in dynamic
environments where there can be data distribution shifts, clients can be reassigned to a different cluster if
their data distribution at communication round 𝑡 evolves beyond a certain threshold from the distribution at
communication round 𝑡 − 1. The Wasserstein distance between the previous and current local distribution
is used to evaluate the shift. In a different way, the authors of [30] propose an algorithm that is robust in
dynamic environments where not all clients are online at all time and ensure that the algorithm works
when only a fraction of clients participate in a training round. Finally, in [18] the authors uses a cluster
division procedure to change the number of clusters dynamically.
4.2.4 Number of Clusters. Two studies ([30] and [2]) improve on existing approaches by clustering clients
using clustering algorithms that do not require a user defined number of clusters to be specified (see Table 2).
As this parameters can drastically change the results and is difficult to set, avoiding it by using a parameter
free algorithm can be essential for real-life situations. In [8] a different approach is taken. The study extend
the client-side method in [14] to adapt it to a dynamic environment. As such, they do not specify a fixed
number of clusters a priori but instead create, at each round, a new clusters whose representative point is
based on the client with the highest loss and allow other clients to migrate to this new cluster if need be. At
the end of each round, cluster which are no longer relevant will be removed.
Manuscript submitted to ACM
16 Omar El-Rifai, Michael Ben Ali, Imen Megdiche, André Peninou, and Olivier Teste
5 REVIEW OF APPLICATIONS
The CFL for personalization solutions reviewed in Section 4 allowed us to organize papers into solution
categories and describe how frameworks evolved to include more operational constraints. Solutions catego-
rization is important to understand when to use one solution over another from a purely formal perspective.
In this section, we continue the categorization of the literature but this time from the perspective of applica-
tions. Specifically, in Section 5.1 we look into the datasets that have been used and the type of heterogeneity
solutions have been tested on. This new categorization sheds light on some gaps in the literature and
possible future direction for research. And in Section 5.2 we see two other applications in cluster-based
federated learning that go beyond personalization.
shift on labels and feature respectively. Adding these special cases is not very relevant in our case as we are
only interested to look at type of non-iid data without identifying the reason behind it.
The ambiguity in the nomenclature is not entirely surprising as a joint distribution between features
and target P (𝑋, 𝑌 ) can be factored using Bayes’ rule (see equation (3)) as we have seen and hence altering
the distribution of labels will have an impact on features and vice versa. That being said, those conflicting
definitions make it harder for a reader to compare results.
To alleviate this ambiguity and provide a common ground for comparisons, we summarize the different
use cases tested in the literature in Table 4 and specify to which non-iid category they most adequately
belong to. To be clear, this categorization is is not definitive in the sense that some use cases can be seen
from different angles as we have seen in the section above. Nonetheless, we believe that having a clear
formalization that works across studies is helpful for comparisons.
In Table 4, “Dataset” is either a public dataset such as “MNIST”, "FEMNIST” or “Shakeaspear”, or a private
one such as some of the time series data used in [18]. Then, the column “Task Type” defines which category
of task is tackled with that dataset (whether it is a classification task, a prediction task, etc.). In Figure 4,
a pie-chart representing the distribution of experimental designs from Table 4 is presented. Already an
initial pattern comes into sights with a vast majority (88%) of experimental designs focusing on “image
classification” tasks on “MNIST” or some variant of it .
In “Heterogeneity Information” we extract from the papers information on how the non-iid setting was
induced. Predictably, this information is expressed in various ways across the papers and we therefore
suggest in “Suggested Category of non-IID situation” the non-iid category under which we can classify
the use case. This suggestion is a recommendation to facilitate the comparison of different studies but
is relative to our perception of the problem at hand. From these last two column, we can see that, with
the exception of “quantity skew” (only evaluated in [15]) the other type of non-iid categories seem to be
adequately represented.
Suggested Category
Dataset task Type Heterogeneity Information papers
of non-IID situation
Label swapping Concept shift on labels [3, 5, 10, 36]
MNIST Image classification Image rotations Concept shift on features [5, 7, 14, 15, 30, 41]
Subset of labels per client Label distribution skew [3, 9, 15, 22, 42]
Image rotations Concept shift on features [2, 30, 34]
EMNIST Image classification
Subset of labels per client Label distribution skew [2]
Split input by author Feature distribution skew [9, 14, 26, 28, 30]
FEMNIST Image classification Label swapping Concept shift on labels [10]
Subset of labels per client Label distribution skew [41]
Label swapping Concept shift on labels [10, 27]
Fashion MNIST Image classification Subset of labels for each client Label distribution skew [15, 22]
Image rotations Concept shift on features [15]
FedCelebA Image classification Split by celebrity Feature distribution skew [26]
Shakespear Next character prediction Split by play character Feature distribution skew [7]
Sentiment140 Sentiment analysis Split by user Feature distribution skew [9]
Time series data Time series forecasting Split by class Feature distribution skew [18]
ml-100K Rating prediction Split by user Concept shift on labels [25]
ml-1M Rating prediction Split by user Concept shift on labels [25]
Label swapping Concept shift on labels [27, 30, 36]
CIFAR-10 Image classification Image rotations Concept shift on features [14, 15, 34]
Subset of labels per client Label distribution skew [15, 30, 41, 42]
Ag-News Word prediction Split by topic Feature distribution skew [36]
Table 4. Summary of datasets and tasks tested in the literature
(a) Distribution of use cases studies across papers (b) Distribution of non-iid categories studies across papers
In [37], communication latency, computational performance, and training time are proposed as features for
clustering. At each communication rounds, client are randomly selected from one cluster only. All clusters
will be sequentially explored during the FL process in a round-robin way. Finally, in [11], clusters are created
based on dataset size or local models similarity. Following this, clients are given a participation probability
based on their cluster attributes.
On the other hand, attack prevention is about countering adversarial attacks (poisoning attacks or
byzantine attacks [16]) . Preventing these types of attacks is important as they can pollute the global
Manuscript submitted to ACM
A Survey on Cluster-based Federated Learning 19
model’s performance or prevent convergence. For instance, an adversarial client might purposefully updates
its local model with the opposite of the true gradient, so as to hinder the global model’s convergence
when aggregated. In that respect, clustering can isolate outliers which are potentially malicious users. As
an example Ghosh et al. [13] use a robust version of k-means algorithm combined with outlier-robust
optimization to exclude byzantine machine of the FL process. Similarly [19] and [45], use clustering to
identify “malicious” nodes whose distributions are too different from the majority. In [1], the authors
note that using outlier detection mechanisms can introduce a fairness bias. Particularly clients whose
distinct distributions is legitimate might be wrongly excluded too hastily and it is important to choose
the criteria for exclusion carefully so as not to lose important information. The authors then propose a
solution based on metadata sharing to counter this bias. In [24], client’s clusters are found using k-means
on metadata such as geo-features (client IP-address), time-features (sending-time) or user-features (client
ID). The authors argue that these meta features are often representative of client’s similarity. The authors
then use a robust aggregation mechanism by cluster to mitigate attacks. Finally, In the FLAME framework
[32], the authors use the HDBSCAN algorithm [4], a noise-robust version of DBSCAN to categorize models
into three categories: core points, border points and noise points. The models considered as noise will be
excluded from the update for the current round.
6 CONCLUSION
This survey examines cluster-based Federated Learning papers with an emphasis on personalization as
a use case. It presents a novel classification of solution approaches in Section 3.1 which highlights the
similarities and variations over the baseline approach. Then, the review in Section 4 classifies the papers by
topics addressed and brings out the relationships between these studies. The classification in the form of
a table gives a frame of reference for quickly comparing solution approaches and choosing a framework
appropriate to new problems. One thing that stands out form this comparison is the limited literature on
some topics like dynamic changes in data distributions and soft-clustering. However, both of these problems
are particularly relevant in a real-world scenario where datasets are not static but constantly evolving and
often involving a mix of distributions.
Section 5 then goes on examining the papers from the perspective of applications this time. The datasets
and task instances which have been tested in the literature are analyzed in Section 5.1, bringing forward
general tendencies summarized in Table 4 and Figure 4. Here too we observer a limitation of the current
literature with a predominant tendency towards image classification tasks at the expense of other types of
tasks/datasets. Additionally, Table 4 serves as an aid to reproducibility by suggesting a common nomenclature
to the problems addressed across papers. Finally, Section 5.2 is a review of clustering use-cases in FL beyond
personalization which allows future researchers to have a thorough vision of clustering in FL.
REFERENCES
[1] Singh Ashneet Khandpur, Blanco-Justicia Alberto, and Domingo-Ferrer Josep. Fair detection of poisoning attacks in federated
learning on non-i.i.d. data. Data Mining and Knowledge Discovery, 2023.
[2] Andrea Augello, Giulio Falzone, and Giuseppe Lo Re. Dcfl: Dynamic clustered federated learning under differential privacy
settings. In 2023 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events
(PerCom Workshops), pages 614–619. IEEE, 2023.
[3] Christopher Briggs, Zhong Fan, and Peter Andras. Federated learning with hierarchical clustering of local updates to improve
training on non-iid data. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–9. IEEE, 2020.
[4] Ricardo JGB Campello, Davoud Moulavi, and Jörg Sander. Density-based clustering based on hierarchical density estimates. In
Pacific-Asia conference on knowledge discovery and data mining, pages 160–172. Springer, 2013.
[5] Fabiola Espinoza Castellon, Aurélien Mayoue, Jacques-Henri Sublemontier, and Cédric Gouy-Pailler. Federated learning with
incremental clustering for heterogeneous data. In 2022 International Joint Conference on Neural Networks (IJCNN), pages 1–8.
IEEE, 2022.
[6] Zhikun Chen, Daofeng Li, Rui Ni, Jinkang Zhu, and Sihai Zhang. Fedseq: A hybrid federated learning framework based on
sequential in-cluster training. IEEE Systems Journal, 2023.
[7] Don Kurian Dennis, Tian Li, and Virginia Smith. Heterogeneity for the win: One-shot federated clustering. In International
Conference on Machine Learning, pages 2611–2620. PMLR, 2021.
[8] Run Du, Shuo Xu, Rui Zhang, Lijuan Xu, and Hui Xia. A dynamic adaptive iterative clustered federated learning scheme.
Knowledge-Based Systems, 276:110741, 2023.
[9] Moming Duan, Duo Liu, Xinyuan Ji, Renping Liu, Liang Liang, Xianzhang Chen, and Yujuan Tan. Fedgroup: Efficient
federated learning via decomposed similarity-based clustering. In 2021 IEEE Intl Conf on Parallel & Distributed Processing
with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking
(ISPA/BDCloud/SocialCom/SustainCom), pages 228–237. IEEE, 2021.
[10] Moming Duan, Duo Liu, Xinyuan Ji, Yu Wu, Liang Liang, Xianzhang Chen, Yujuan Tan, and Ao Ren. Flexible clustered federated
learning for client-level data distribution shift. IEEE Transactions on Parallel and Distributed Systems, 33(11):2661–2674, 2021.
[11] Yann Fraboni, Richard Vidal, Laetitia Kameni, and Marco Lorenzi. Clustered sampling: Low-variance and improved representativity
for clients selection in federated learning. In International Conference on Machine Learning, pages 3407–3416. PMLR, 2021.
[12] Brendan J Frey and Delbert Dueck. Clustering by passing messages between data points. science, 2007.
[13] Avishek Ghosh, Justin Hong, Dong Yin, and Kannan Ramchandran. Robust federated learning in a heterogeneous environment.
arXiv preprint arXiv:1906.06629, 2019.
[14] Avishek Ghosh, Jichan Chung, Dong Yin, and Kannan Ramchandran. An efficient framework for clustered federated learning.
Advances in Neural Information Processing Systems, 33:19586–19597, 2020.
[15] Biyao Gong, Tianzhang Xing, Zhidan Liu, Wei Xi, and Xiaojiang Chen. Adaptive client clustering for efficient federated learning
over non-iid and imbalanced data. IEEE Transactions on Big Data, 2022.
[16] Wen Jie, Zhang Zhixia, Lan Yang, Cui Zhihua, Cai Jianghui, and Zhang Wensheng. A survey on federated learning: challenges
and applications. International Journal of Machine Learning and Cybernetics, 2023.
[17] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz,
Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations
and Trends in Machine Learning, 14(1–2):1–210, 2021.
[18] Yeongwoo Kim, Ezeddin Al Hakim, Johan Haraldson, Henrik Eriksson, José Mairton B da Silva, and Carlo Fischione. Dynamic
clustering in federated learning. In ICC 2021-IEEE International Conference on Communications, pages 1–6. IEEE, 2021.
[19] Yadav Krishna and Gupta B.B. Clustering based rewarding algorithm to detect adversaries in federated machine learning based
iot environment. In 2021 IEEE International Conference on Consumer Electronics (ICCE). IEEE, 2021.
[20] Viraj Kulkarni, Milind Kulkarni, and Aniruddha Pant. Survey of personalization techniques for federated learning. In 2020 Fourth
World Conference on Smart Trends in Systems, Security and Sustainability (WorldS4), pages 794–797. IEEE, 2020.
[21] Hunmin Lee and Daehee Seo. Fedlc: Optimizing federated learning in non-iid data via label-wise clustering. IEEE Access, 2023.
[22] Chengxi Li, Gang Li, and Pramod K Varshney. Federated learning with soft clustering. IEEE Internet of Things Journal, 9(10):
7773–7782, 2021.
[23] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in
heterogeneous networks. Proceedings of Machine learning and systems, 2:429–450, 2020.
[24] Yanli Li, Dong Yuan, Abubakar Sadiq Sani, and Wei Bao. Enhancing federated learning robustness in adversarial environment
through clustering non-iid features. Computers & Security, page 103319, 2023.
[25] Tingting Liang, Cheng Yuan, Cheng Lu, Youhuizi Li, Junfeng Yuan, and Yuyu Yin. Efficient one-off clustering for personalized
federated learning. Knowledge-Based Systems, page 110813, 2023.
[26] Guodong Long, Ming Xie, Tao Shen, Tianyi Zhou, Xianzhi Wang, and Jing Jiang. Multi-center federated learning: clients clustering
for better personalization. World Wide Web, 26(1):481–500, 2023.
[27] Yibo Luo, Xuefeng Liu, and Jianwei Xiu. Energy-efficient clustering to address data heterogeneity in federated learning. In ICC
2021-IEEE International Conference on Communications, pages 1–6. IEEE, 2021.
[28] Yishay Mansour, Mehryar Mohri, Jae Ro, and Ananda Theertha Suresh. Three approaches for personalization with applications
to federated learning. arXiv preprint arXiv:2002.10619, 2020.
[29] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning
of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
[30] Manan Mehta and Chenhui Shao. A greedy agglomerative framework for clustered federated learning. IEEE Transactions on
Industrial Informatics, 2023.
[31] Sudipto Mukherjee, Himanshu Asnani, Eugene Lin, and Sreeram Kannan. Clustergan: Latent space clustering in generative
adversarial networks. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 4610–4617, 2019.
[32] Thien Duc Nguyen, Phillip Rieger, Roberta De Viti, Huili Chen, Björn B Brandenburg, Hossein Yalame, Helen Möllering, Hossein
Fereidooni, Samuel Marchal, Markus Miettinen, et al. {FLAME}: Taming backdoors in federated learning. In 31st USENIX Security
Symposium (USENIX Security 22), pages 1415–1432, 2022.
[33] Kilian Pfeiffer, Martin Rapp, Ramin Khalili, and Jörg Henkel. Federated learning for computationally-constrained heterogeneous
devices: A survey. ACM Computing Surveys, 2023.
[34] Yichen Ruan and Carlee Joe-Wong. Fedsoft: Soft clustered federated learning with proximal local updating. In Proceedings of the
AAAI Conference on Artificial Intelligence, volume 36, pages 8124–8131, 2022.
[35] Soham Sarkar and Anil K Ghosh. On perfect clustering of high dimension, low sample size data. IEEE transactions on pattern
analysis and machine intelligence, 42(9):2257–2272, 2019.
[36] Felix Sattler, Klaus-Robert Müller, and Wojciech Samek. Clustered federated learning: Model-agnostic distributed multitask
optimization under privacy constraints. IEEE transactions on neural networks and learning systems, 32(8):3710–3722, 2020.
[37] Ammar Tahir, Yongzhou Chen, and Prashanti Nilayam. Fedss: federated learning with smart selection of clients. arXiv preprint
arXiv:2207.04569, 2022.
[38] Alysa Ziying Tan, Han Yu, Lizhen Cui, and Qiang Yang. Towards personalized federated learning. IEEE Transactions on Neural
Networks and Learning Systems, 2022.
[39] Ye Lin Tun, Minh NH Nguyen, Chu Myaet Thwal, Jinwoo Choi, and Choong Seon Hong. Contrastive encoder pre-training–based
clustered federated learning for heterogeneous data. Neural Networks, 2023.
[40] Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris Papailiopoulos, and Yasaman Khazaeni. Federated learning with matched
averaging. arXiv preprint arXiv:2002.06440, 2020.
[41] Joel Wolfrath, Nikhil Sreekumar, Dhruv Kumar, Yuanli Wang, and Abhishek Chandra. Haccs: heterogeneity-aware clustered
client selection for accelerated federated learning. In 2022 IEEE International Parallel and Distributed Processing Symposium
Manuscript submitted to ACM
22 Omar El-Rifai, Michael Ben Ali, Imen Megdiche, André Peninou, and Olivier Teste