0% found this document useful (0 votes)

42 views11 pages

Federated Meta-Learning For Few-Shot Fault Diagnosis With Representation Encoding

1. The document proposes a new framework called representation encoding-based federated meta-learning (REFML) for few-shot fault diagnosis to address issues like data scarcity and domain discrepancy in federated learning. 2. REFML harnesses heterogeneity among training clients to improve out-of-distribution generalization, and uses an adaptive interpolation method to better utilize local information. 3. In experiments, REFML achieved higher diagnostic accuracy than baseline methods on unseen working conditions and equipment types with limited training data.

Uploaded by

Sree Krishna Das

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views11 pages

Federated Meta-Learning For Few-Shot Fault Diagnosis With Representation Encoding

Uploaded by

Sree Krishna Das

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

1

Federated Meta-Learning for Few-Shot Fault

Diagnosis with Representation Encoding
Jixuan Cui, Student Member, IEEE, Jun Li, Senior Member, IEEE, Zhen Mei, Member, IEEE,
Kang Wei, Member, IEEE, Sha Wei, Member, IEEE,
Ming Ding, Senior Member, IEEE, Wen Chen, Senior Member, IEEE, Song Guo, Fellow, IEEE

Abstract—Deep learning-based fault diagnosis (FD) ap- rity privacy, domain discrepancy, data scarcity.
proaches require a large amount of training data, which are
arXiv:2310.09002v1 [cs.LG] 13 Oct 2023

difficult to obtain since they are located across different entities.

Federated learning (FL) enables multiple clients to collaboratively I. I NTRODUCTION
train a shared model with data privacy guaranteed. However, the
domain discrepancy and data scarcity problems among clients
deteriorate the performance of the global FL model. To tackle
these issues, we propose a novel framework called representation
F AULT diagnosis (FD) plays a significant role in modern
industry, ensuring safety and reliability, and preventing
breakdowns and losses [1]. Traditional FD methods require
encoding-based federated meta-learning (REFML) for few-shot expert knowledge, such as identifying faults through analyzing
FD. First, a novel training strategy based on representation en-
coding and meta-learning is developed. It harnesses the inherent abnormal sounds or utilizing signal processing techniques.
heterogeneity among training clients, effectively transforming it However, these methods increased the labor intensity and
into an advantage for out-of-distribution generalization on unseen difficulty for equipment operators as the industry continues
working conditions or equipment types. Additionally, an adaptive to grow rapidly in size and frequently change its operational
interpolation method that calculates the optimal combination of modes. The application of machine learning has made FD
local and global models as the initialization of local training
is proposed. This helps to further utilize local information intelligent enough to automatically recognize health states
to mitigate the negative effects of domain discrepancy. As a and shorten maintenance cycles. Early on, sensitive features
result, high diagnostic accuracy can be achieved on unseen are extracted from data and fed into traditional machine
working conditions or equipment types with limited training data. learning models, such as expert systems, artificial neural
Compared with the state-of-the-art methods, such as FedProx, the network (ANN), support vector machine (SVM), and others
proposed REFML framework achieves an increase in accuracy
by 2.17%-6.50% when tested on unseen working conditions of [2]. However, challenges arise from the difficulty of extracting
the same equipment type and 13.44%-18.33% when tested on specialized and proper features from large volumes of data and
totally unseen equipment types, respectively. achieving accurate results with the limited learning capability
Index Terms—Fault diagnosis, federated meta-learning, secu-
of traditional diagnosis models.
Deep Learning (DL) has become increasingly popular as
This work was supported in part by National Key Project under Grant a solution to address the aforementioned problems as it can
2020YFB1807700, in part by the National Natural Science Foundation of learn abstracted representations without human intervention.
China under Grant 62071296, in part by the Project on Construction of A novel motor condition monitoring system with 1D convo-
Intelligent Manufacturing Data Resource public service platform 2021-0174-
1-1, in part by Shanghai Fundamental Project under Grant 22JC1404000, lutional neural network (CNN) was proposed and shown to
Grant 20JC1416502, and Grant PKX2021-D02, in part by the fundings from achieve an elegant classification performance [3]. Liu et al.
the Key-Area Research and Development Program of Guangdong Province proposed a novel method for FD with recurrent neural network
under Grant 2021B0101400003, in part by Hong Kong RGC Research Impact
Fund under Grant R5060-19, and Grant R5034-18, in part by Areas of (RNN) in the form of an autoencoder [4]. To realize efficient
Excellence Scheme under Grant AoE/E-601/22-R, in part by General Research FD with low-quality data, an improved deep fused CNN-based
Fund under Grant 152203/20E, Grant 152244/21E, Grant 152169/22E, and method combined with a complementary ensemble empirical
Grant 152228/23E, in part by Shenzhen Science and Technology Innovation
Commission under Grant JCYJ20200109142008673. (Corresponding authors: mode decomposition and a short-time Fourier transform was
Jun Li, and Zhen Mei.) proposed to fully mine fault features [5]. To alleviate the
Jixuan Cui, Jun Li, and Zhen Mei are with the School of Electronic data imbalance in FD, a weakly supervised learning-based
and Optical Engineering, Nanjing University of Science and Technology,
Nanjing 210094, China (e-mail: [email protected]; [email protected]; method was proposed to introduce real-world samples into
[email protected]). the imbalanced dataset [6], and a novel method based on
Kang Wei is with the Department of Computing, The Hong Kong Polytech- pre-training Wasserstein generative adversarial network with
nic University, Hong Kong 999077, China (e-mail: [email protected]).
Sha Wei is with The Research Institute of Information and Industrialization gradient penalty was proposed to generate high-quality faulty
Integration, China Academy of Information and Communications Technology, samples [7]. However, these data-driven methods require a
Beijing 100191, China (e-mail: [email protected]). large amount of labeled data. Although there is a vast amount
Ming Ding is with Data61, CSIRO, Sydney, NSW 2015, Australia (e-mail:
[email protected]). of data collected by sensors in modern industry, generating
Wen Chen is with the Department of Electronics Engineering, Shanghai Jiao labeled data is very expensive and time-consuming. Moreover,
Tong University, Shanghai 200240, China (e-mail: [email protected]). in real industrial scenarios, data is scattered across different
Song Guo is with the Department of Computer Science and Engineering,
Hong Kong University of Science and Technology, Hong Kong 999077, China entities and privacy-sensitive, rendering it unfeasible to be
(e-mail: [email protected]). gathered into centralized servers for training.
2

Federated learning (FL) [8]–[14] empowers multiple clients the issue of poor diagnosis performance on new tasks caused
to collaboratively train a global model without compromising by domain discrepancy and data scarcity, which has not been
data privacy. An FL method for machinery FD with a self- fully researched. Furthermore, when considering the inherent
supervised learning scheme was proposed in [15]. The FL domain discrepancies among the data from various participants
framework is also utilized in the context of class-imbalanced in FL, it raises a fundamental question: how can we leverage
FD classification to facilitate the implementation of privacy- this aspect to strengthen the model’s robustness when faced
preserving functionalities [16]. However, in practical industrial with unobserved tasks?
environments, working conditions and equipment types signif-
In this study, we tackle this challenge and propose a
icantly vary across different companies and change frequently.
novel representation encoding-based federated meta-learning
This means the presence of domain discrepancy extends be-
(REFML) framework for few-shot FD. REFML harnesses
yond the traditional demarcation between the training and test-
federated meta-learning (FML) and draws inspiration from
ing stages. It involves the domain discrepancy between training
representation learning for capturing discriminative feature
clients and new clients, with inherent variations observed
representations [29]–[33]. It leverages inherent heterogeneity
among individual training clients. As a result, the trained
among training clients by extracting meta-knowledge from
model cannot generalize well to out-of-distribution (OOD)
different local diagnosis tasks and training a domain-invariant
data on new tasks with limited samples, because most learning
feature extractor in a privacy-preserving manner, effectively
algorithms heavily rely on the independent and identically
transforming it into an advantage for OOD generalization.
distributed assumption on source/target data. Collecting and
Without compromising the privacy data of participating clients,
labeling sufficient data to address this issue is costly and
the trained model can achieve high performance with very few
impractical. Data-based approaches such as data sharing and
training samples when encountering new tasks, such as those
augmentation work well, but may increase the risk of data
involving previously unseen working conditions or equipment
privacy leakage under the FL framework [17].
types, making it well-suited for practical industrial FD scenar-
In addressing challenges related to deep model construction
ios with domain discrepancy and data scarcity problems.
and diverse data distributions, an approach with adaptive and
independent learning rate design and structure optimization The main contributions of this paper are as follows:
was proposed to enhance both the timeliness of FD and its
adaptability to dynamic conditions [18]. A novel method com- 1) We propose REFML, an innovative FML-based privacy-
bining 2-D-gcForest and L2,p -PCA was proposed to improve preserving method for few-shot FD, a relatively underex-
the feature representation for different data sources [19]. A plored area in prior research. This approach consists of a
distribution-invariant deep belief network was proposed to novel training strategy based on representation encoding
learn distribution-invariant features directly from raw vibration and meta-learning and an adaptive interpolation module.
data [20]. Transfer learning (TL) is another way against the
domain discrepancy problems [21], in which the knowledge 2) We develop a novel training strategy based on rep-
from one or more tasks in the source domain can be reused for resentation encoding and meta-learning to harness the
other related tasks in the target domain. Shao et al. developed heterogeneity among training clients to improve OOD
a fast and accurate FD framework using TL [22]. A federated generalization in FL with limited training samples. With
TL framework with discrepancy-based weighted federated this strategy, the trained model is capable of capturing
averaging was proposed to train a good global FD model domain-invariant features and adapting well to unseen
collaboratively [23]. Meta-learning, a technique that also refers tasks to achieve high performance with limited data.
to leveraging previous knowledge to improve learning on new 3) We design an adaptive interpolation method by calcu-
tasks [24], focuses on learning to learn across a broader range lating the optimal combination of the local and global
of tasks rather than specific source and target tasks in TL. models as the initialization of local training. It is capable
In industrial scenarios of frequently changing work condi- of mitigating the negative effects of domain discrepancy
tions and equipment types, training a meta-learning model for better model performance.
to achieve strong generalization capabilities can greatly meet
practical demands. A novel meta-learning method based on 4) Experiments are conducted on two bearing datasets
model agnostic meta-learning (MAML) [25] for FD in rolling and one gearbox dataset. Compared with state-of-the-art
bearings under varying working conditions with limited data methods like FedProx, the proposed REFML framework
was proposed in [26]. Hu et al. proposed a task-sequencing increases accuracy by 2.17%-6.50% when generalizing
meta-learning method that sorts tasks from easy to difficult to unseen working conditions and 13.44%-18.33% when
to get better knowledge adaptability [27]. Moreover, meta- generalizing to unseen equipment types, respectively.
learning could also be combined with semi-supervised learning
utilizing unlabeled data for better fault recognition [28]. The rest of this paper is organized as follows. In section II
However, as mentioned earlier, it is challenging to aggregate and section III, the preliminary work and problem formulation
data from different entities and train models using centralized are introduced. Section IV presents the proposed method in
algorithms above in real-world production environments. In detail. Numerical experiments are conducted in Section V to
practical and common scenarios, it’s necessary to exploit verify the effectiveness of the proposed REFML framework.
privacy-preserving distributed training algorithms to address Section VI concludes this article.
3

II. P RELIMINARIES where |DΓs i | is the size of the support set, l is the cross entropy
A. Federated Learning loss function.
Then, the performance of the adapted parameters Wi′ on task
FL enables multiple clients to obtain a globally optimized Γi is evaluated on its query set DΓq i in the form of empirical
model while safeguarding sensitive data. It generally consists loss which reflects the generalization ability of W . Hence, the
of a central server and multiple clients. The central server man- optimization objective is
ages multiple rounds of federated communication to obtain a X
global model, extracting valuable information from distributed min LDΓq (Wi′ ). (5)
W i
clients without accessing their private data. Throughout this Γi ∼p(Γ )
process, the only elements transmitted are the model param- The aggregated loss values of each task are used to update the
eters. Currently, the prevailing paradigm considers supervised model parameters W , given as
horizontal FL as an empirical risk minimization problem, X
where the goal is to minimize the aggregated empirical loss, W = W − β ▽W LDΓq (W − α ▽W LDΓs (W )), (6)
i i
shown as Γi ∼p(Γ )
U
X |Du | where β is the meta-learning rate. The purpose of training on
min LDu (W ), (1)
W
u=1
n multiple tasks is to find a high-quality initial model. As shown
in Fig. 1, W are the parameters of the model before updating,
where W represents the parameters of the global model, and and ∇l1 , ∇l2 , and ∇l3 are corresponding update directions
U is the number of total clients. Du and |Du | are the local of three training tasks. The objective of MAML is not to
dataset of client u P
and its size, respectively. The total number attain the best possible performance on a single task, which
U
of samples is n = u=1 |Du |. LDu (W ) is the empirical loss means reaching any one of the three optimal weights W1∗ ,
of client u in the form of an expected risk on local dataset W2∗ , and W3∗ of three training tasks, but rather to converge on
Du to reflect the model performance, given by parameters that can swiftly adapt to similar, especially unseen
1 X tasks.
LDu (W ) = l(W (x); y), (2)
|Du |
(x,y)∈Du
W ∇l
where l(W (x); y) is a loss function that penalizes the distance
of model output W (x) from label y.
∇l

B. Meta-Learning
Meta-learning, also known as learning to learn, is a tech-
∇l
ꞏ
ꞏ W∗

ꞏ
nique aiming at enhancing performance on new tasks by
utilizing prior knowledge from known tasks. In traditional
machine learning, the objective is to train a high-performing
model on specific tasks with a fixed algorithm, including
W∗
ꞏW∗

Fig. 1: The learning process of MAML. The model is trained

artificially designated network architecture, initialization pa- to learn general knowledge for adapting to similar, especially
rameters, update methods, and so on. However, in meta- unseen tasks quickly and effectively.
learning, the aim is to find a high-performing algorithm that
can adapt well to a set of tasks, particularly those that are
unseen.
Among different meta-learning methods, MAML is a repre- III. P ROBLEM FORMULATION
sentative algorithm that focuses on the optimization of initial- The core problems of FD in practical applications can be
ization parameters. The algorithm trains the model among a summarized as follows:
series of tasks {Γi }N
i=1 from the distribution of p(Γ ) to obtain 1) Security privacy: Data privacy has been widely and
high-quality initialization parameters, aiming to perform well highly valued worldwide. Enterprises are increasingly cautious
on new tasks after training on a few labeled samples. Each task about the application of data, making it difficult to centralize
Γi (i = 1, 2, ..., N ) comprises training and testing samples, sufficient data from mutually isolated “data islands” to train a
known as support set DΓs i and query set DΓq i , respectively. well-performing model.
Firstly, MAML makes a fast adaptation from W to Wi′ in 2) Domain discrepancy and data scarcity: Working con-
each task’s support set DΓs i , which can be represented by ditions and equipment types from which data is collected
Wi′ = W − α ▽W LDΓs (W ), (3) significantly vary across different companies, and change fre-
i
quently. Domain discrepancy problems not only exist between
where α is the learning rate of this fast adaptation. LDΓs (W ) participating clients and new clients but also between clients
i
is computed in the form of an expected risk on DΓs i , given by participating in FL. Collecting and labeling sufficient data on
new tasks is costly and impractical. It is challenging for models
1 X
LDΓs (W ) = s l(W (x); y), (4) to perform well on new tasks with only a small amount of
i |DΓi | s samples.
(x,y)∈DΓ
i
4

Under the FL framework, suppose there are U training with very limited data. In each round, every training client
clients and V testing clients, whose datasets collected under downloads the global model, conducts adaptive interpolation,
different working conditions or equipment types are regarded representation encoding, meta-updating of the predictor, and
as training tasks and testing tasks. It is noteworthy that uploads the model in sequence. Every testing client downloads
clients’ private data is not allowed to be shared, and each the global model, conducts adaptive interpolation, and fine-
of them has a relatively small amount of data. Let {Du }U u=1 , tunes the model. In the following subsections, the proposed
{Dv }U +V
v=U +1 denote the datasets of training clients and testing training process and the overall workflow of the proposed
|D | REFML method will be introduced in detail.
clients, respectively. The dataset Dm = {(xi , yi )}i=1m (m =
s
1, 2, ..., U +V s
) of each client m is divided into support
s q
Dm =
|Dm | q |Dm |+|Dm |
{(xi , yi )}i=1 and query set Dm = {(xi , yi )}i=|Ds |+1 . Here A. Adaptive Interpolation
m
s q
|Dm | + |Dm | = |Dm |. Vector xi ∈ Rd is a d-dimensional In general FL, clients download global model parameters at
real-valued feature vector regarded as the input of the model, the beginning of each round and initialize their local models
and scalar yi ∈ {1, 2, 3, ..., N } is a class label, where N is with these parameters. Facing data heterogeneity challenge in
the number of categories. The expected loss of the prediction FL, a method of mixing global and local models has been
made with the model parameters Wm of client m on its dataset proposed to balance generalization with personalization [34].
Dm is defined as LDm (Wm ), which could be computed by Inspired by this, we further exploit personal information in
1 X the model communication stage, that is using the optimal
LDm (Wm ) = l(Wm (x); y) interpolation of the global model and the local model as the
|Dm |
(x,y)∈Dm
initial model rather than the global model itself, where the
N (7)
1 X X optimal interpolation weights are calculated adaptively using
=− 1[y=c] log(Wm (x)), gradient-based search on local data.
|Dm | c=1
(x,y)∈Dm In communication round t, training client u receives global
where l(Wm (x); y) is the cross entropy loss function, and 1 model Wt and aims to find the best combination of the global
u
is the indicator function. model Wt and its local model Wt−1 as the initialization of its
In the meta-training phase, the meta-goal is to find param- local training, which is formulated as
eters W ∗ that perform well among training clients after fast
Wtu = Aut ⊙ Wt + (O − Aut ) ⊙ Wt−1
u
, (10)
adaptation, given by
U where Wtu is the interpolated model of client u in the t-th
communication round, and ⊙ is a Hadamard product. Aut are
X
∗
W = arg min LDuq (W − α ▽W LDus (W )). (8)
W u=1 the optimal interpolation weights of the global model Wt ,
In the meta-testing phase, the learned parameters W ∗ are used whose elements are all between 0 and 1, with the same shape
to initialize models {Wv }U +V as Wt . O is an all-ones matrix, and O−Aut are the interpolation
v=U +1 of testing clients. Then the
model will use a small number of samples (the support set) to weights of the local model. This arrangement ensures that the
quickly adapt to the task and expect to achieve good diagnostic summation of interpolation weights for the global and local
accuracy on the query set. The optimization objective during models pertaining to each element position consistently equals
the fast adaptation phase can be expressed as 1, thus normalizing the weighting process.
To find the optimal interpolation weights Aut , the interpo-
UX
+V
lation weights Aut−1 of the last round are used to compute a
min LDvs (Wv ). (9) ′
{Wv }U +V temporary combination Wtu , given by
v=U +1 v=U +1
′
The problem can be formulated as a N -way K-shot classifica- Wtu = Aut−1 ⊙ Wt + (O − Aut−1 ) ⊙ Wt−1
u
. (11)
tion task. The term N refers to the number of categories that a ′
Then the temporary combination Wtu is evaluated on local
meta task needs to classify, while K represents the number of
data, and the interpolation weights can be updated as
labeled samples in the support set available for each category.
′
Thus, each task has N × K samples in the support set, which Aut = Aut−1 − δ ▽Aut−1 LDu (Wtu ), (12)
s
is equivalent to |Dm | = N × K (m = 1, 2, ..., U + V ). All
the training clients use both the support set and the query set where δ is the learning rate.
to train their model, and all testing clients use the support set Finally, the best combination of the local and global models
and the query set to fine-tune and test the model, respectively. is computed using (10) with the updated interpolation weights
Aut . By utilizing this process, every client has the ability to
IV. P ROPOSED M ETHOD acquire a model that is better tailored to their specific local
objective adaptively.
This section presents the proposed REFML method for few-
shot FD. It consists of two main components: a central server
and local clients, which are divided into U training clients B. Representation Encoding
and V testing clients. As shown in Fig. 2, the server and As shown in Fig. 3, the related models consist of feature
clients collaborate through multiple communication rounds to extraction and classification layers, which we refer to as
jointly train a model that can adapt well to testing clients the encoder and the predictor, respectively. We consider the
5

Training Client u
Meta-updating of the Predictor Server 𝐷 ⑥
𝑊 𝑊
·
𝑃 ③ 𝑛
𝐷
𝒒

· ·
𝑃 𝑫𝒖
train predictor updated 𝑃 𝑊 𝑊
② 𝑊 𝑊 𝑊 𝑊
①
+
→
𝐷
Training Training Testing Testing
Client 1 ꞏꞏꞏ Client U Client 1
ꞏꞏꞏ Client V
train encoder find the best combination
? Training Encoder
Representation Encoding Adaptive Interpolation
𝑊 ꞏꞏꞏ 𝑊
Testing Client v Diagnosis Results
⑤ ④ ⑧
+ 𝐷
→
𝐷

adapt on support set find the best combination test on query set

Fine-tune the Model Adaptive Interpolation Test the Model

Fig. 2: Workflow of the proposed REFML framework. It contains a central server and multiple training and testing clients.
The server and clients collaborate through multiple communication rounds to train a model that can effectively adapt to testing
clients using extremely limited data with guaranteed privacy protection.

Encoder Meta-based Predictor

are divided into Etu and Ptu , representing the parameters of
the encoder and predictor, respectively.
Input Representations Output
After adaptive interpolation, each training client makes sev-
eral local gradient-based updates to find an optimal encoder.
For client u, it updates encoder Etu on local dataset Du while
its predictor Ptu remains unchanged. The encoder is updated
as
Convolution Etu = Etu − η ▽Etu LDu (Etu ), (13)
Batch normalization FC
ReLU FC
where η is the learning rate. The aggregated encoder will
be able to extract common feature abstraction, serving as
Max-pooling
a shared gripper to obtain domain-invariant features. When
Fig. 3: The model structure used in the REFML framework, encountering unseen working conditions or equipment types,
which consists of an encoder and a meta-based predictor. the encoder can extract informative features from previously
unseen data types, thereby aiding in achieving better diagnostic
performance on new tasks.
encoder as a domain-invariant representations learner across
clients with different data distributions and the predictor as
a meta-based classifier learning on a series of training tasks. C. Meta-updating of the Predictor
The encoder contains three convolution units, and each of them The predictor consists of two fully connected (FC) layers,
is composed of a one-dimensional convolution layer, a batch which manages the classification of the extracted features to
normalization layer, a rectified linear unit (ReLU) activation generate predictive results for the health states. Following the
layer, and a max-pooling layer. It is responsible for extracting representation encoding process, we engage in the optimiza-
informative feature representations from raw data. tion of the predictor Ptu . This optimization is carried out based
Training a specific module within the network while keeping on the loss computed using parameters that have undergone
the parameters of other modules unchanged is a common rapid adaptation, which we refer to as meta-updating. It’s
approach to improve the module’s suitability for specific tasks, important to note that during this meta-updating procedure,
especially when dealing with limited local data. Here, our the encoder Etu remains unchanged. First the fast adaptation
′
objective is to enhance the encoder’s ability to extract features Ptu is computed on the support set Dus , given by
from local data before meta-knowledge extraction is conducted
′
on the predictor. Once trained, the aggregated encoder can Ptu = Ptu − α ▽Ptu LDus (Ptu ). (14)
capture domain-invariant features accurately, and the predictor
′
with high-quality initialization parameters can adapt well to Then the loss of Ptu on the query set Duq is computed to
new tasks. Correspondingly, the parameters of the model Wtu evaluate the performance of meta-parameters Ptu after rapid
6

Algorithm 1: The proposed REFML method. in the subsequent round, and all participants collectively
input : Number of communication rounds T , number iterate through this process until convergence. During the
of training and testing clients U , V , learning federated communication process, the testing clients have
rates of the corresponding stages δ, η, α, β, γ. the opportunity to acquire suitable interpolation weights and
output: Learned testing models {W v }U +V consistently leverage the domain-invariant feature extraction
v=U +1
1 Initialize global model W0 , local models and capability provided by the encoder in the global model, as well
interpolation weights of training and testing clients as the high-quality initialization parameters of the predictor,
{W0u }U v U +V u U v U +V to achieve robust diagnostic accuracy on their respective local
u=1 , {W0 }v=U +1 , {A0 }u=1 , {A0 }v=U +1 .
2 for each round t = 1, 2, ..., T do tasks.
3 for each training client u = 1, ..., U do 3) The Complete Diagnostic Steps: The complete process
4 Compute Aut , Wtu with Du using (10-12). of the proposed REFML method is illustrated in Algorithm 1,
5 Etu = Etu − η ▽Etu LDu (Etu ). and the diagnostic steps are summarized below:
6 Ptu = Ptu −β ▽Ptu LDuq (Ptu −α▽Ptu LDus (Ptu )). 1) The training clients download the global model and
7 end execute adaptive interpolation;
8 for each testing client v = U + 1, ..., U + V do 2) The training clients train the encoder with their local
9 Compute Avt , Wtv with Dvs using (10-12). data;
10 Fine-tune the model: 3) The training clients meta-update the predictor and upload
11 Wtv = Wtv − γ ▽Wtv LDvs (Wtv ). the model to the server;
12 end 4) The testing clients download the global model and
13 Wt+1 = u=1 |Dnu | Wtu .
PU execute adaptive interpolation;
14 end
5) The testing clients fine-tune the model with the support
set of their local data;
6) The server aggregates the models of all training clients;
7) Repeat steps 1) - 6) until the end of training;
adaptation. Then the parameters of the predictor will be 8) The testing clients test the model with the query set of
updated using their local data to get diagnosis results.
′
Ptu = Ptu − β ▽Ptu LDuq (Ptu ), (15) In practice, the computational complexity of our method
is comparable to that of typical FL systems, and the train-
where α is the learning rate of fast adaptation and β is the ing phase does require a certain amount of time. However,
learning rate of meta-updating. Minimizing this loss means it’s important to highlight that once the training phase is
that the meta-parameters Ptu will have a better performance completed, the inference process becomes highly convenient.
after rapid adaptation. In other words, acquiring these adap- This characteristic is particularly advantageous in real-world
tation abilities through a series of training tasks constitutes engineering deployments, where the efficiency of the inference
the process of extracting meta-knowledge, which in turn phase frequently surpasses that of the training phase.
aids in the development of a robust generalization capability.
When facing unobserved tasks, informative representations are
V. E XPERIMENTS
extracted by the encoder from the raw data, and the predictor
with high adaptability utilizes these representations to make To comprehensively evaluate the effectiveness of the pro-
accurate diagnoses. posed REFML method, two bearing datasets and one gear-
box dataset are employed for few-shot scenarios, and the t-
distributed stochastic neighbor embedding (t-SNE) technique
D. The Proposed REFML Algorithm
is applied to compare the feature extraction ability of different
1) Local Training: The workflow of the proposed REFML methods in the form of visualization.
framework is depicted in Fig. 2. In each communication
round, the training clients engage in a series of operations,
A. Datasets
including downloading the global model, performing adaptive
interpolation, representation encoding, and meta-updating of 1) Case Western Reserve University (CWRU) Dataset: The
the predictor, followed by uploading the updated model to CWRU dataset is a well-known open-source dataset in FD. Its
the server. On the other hand, the testing clients download the four health states: one normal bearing (NA), inner fault (IF),
global model, conduct adaptive interpolation, and fine-tune the ball fault (BF), and outer fault (OF), are further classified into
model using the support set. ten categories according to three different fault sizes (7, 14,
2) Global Aggregation: At the end of each communication and 21 mils) of each fault state. Each health state corresponds
round, the server aggregates models uploaded by training to four distinct working conditions, characterized by varying
clients using loads and their corresponding speeds (1797, 1772, 1750, and
U 1730 rpm).
X |Du | u
Wt+1 = Wt , (16) 2) JiangNan University (JNU) Dataset: The JNU dataset
u=1
n
is a bearing dataset acquired by Jiangnan University, China.
where n is the total number of training samples. Subsequently, Four kinds of health states, including NA, IF, BF, and OF,
the server disseminates the aggregated model to all clients were carried out. Vibration signals were sampled under three
7

TABLE I: Working Conditions of Three Datasets.

CWRU JNU PHM2009

Working condition
Load(HP) Speed(rpm) Size Speed(rpm) Size Load Speed(Hz) Size
0 0 1797 2013 600 488 High 30 3640
1 1 1772 2250 800 488 High 35 3640
2 2 1750 2250 1000 488 High 40 3640
3 3 1730 2255 High 45 3640

TABLE II: Hyper Parameters.

Hyper parameters Value

length of sample 1024

communication round 1000
learning rate [0.00001, 0.001]
sample each class in support set 1, 3, 5
sample each class in query set 10

TABLE III: Experiment Settings on Unseen Working Condi-
tions.

Dataset Fold Meta-train conditions Meta-test condition
1 1, 2, 3 0 (a) (b)
2 2, 3, 0 1
CWRU
3 3, 0, 1 2

4 0, 1, 2 3
1 1, 2 0
JNU 2 2, 3 1
3 3, 0 2
1 1, 2, 3 0
2 2, 3, 0 1
PHM2009
3 3, 0, 1 2
4 0, 1, 2 3

rotating speeds (600, 800, and 1000 rpm) corresponding to
three working conditions.
3) PHM Data Challenge on 2009 (PHM2009) Dataset:
(c) (d)
The PHM2009 dataset is a generic industrial gearbox dataset
provided by the PHM data challenge competition. A total Fig. 4: Visualization of extracted representations using t-SNE.
of 14 experiments (eight for spur gears and six for helical (a) Raw data. (b) FedAvg-FT. (c) FedProx-FT. (d) REFML.
gears) were performed. It contains five rotating speeds and
two loads, corresponding to ten working conditions. Here
four experiments of spur gears as four categories are used
to conduct the experiments. These experiments are conducted units N of the last layer varies with the category number of
under four different working conditions, with speeds set at 30, the specific task.
35, 40, and 45 Hz, all operating under high load conditions.
The details of the three datasets including corresponding The learning rates are obtained by searching within the
sample sizes are listed in Table I. However, it is worth noting range of 0.00001 to 0.001. In the federated communication
that in our data scarcity setup, not all the data from the original process, the maximum number of communication rounds is
dataset are utilized. The specific sample size configurations for 1000. In the few-shot scenario, the shot number in the query
the experiments can be found in the next subsection. set is 10, and the shot number in the support set varies in the
range of 1, 3, and 5. Some crucial hyper parameters are shown
in Table II.
B. Experiment Setup
1) Network Structure and Hyper parameters: In our exper- 2) Baselines: We evaluate our approach against several
iments, related FD models are based on CNN, which contains state-of-the-art baselines, including FedAvg [8], FedProx [35]
three convolution units and two FC layers. Each convolution and their fine-tuned versions denoted by FedAvg-FT and
unit is composed of a one-dimensional convolution layer, a FedProx-FT for a fair comparison. Fine-tuned versions of these
batch normalization layer, a ReLU activation layer, and a max- baselines use the support set of the testing clients to fine-tune
pooling layer. Its output is flattened into a tensor with one the model received from the server before testing it on the
dimension and a length of 4096 and then used as input for query set. All of these four baselines use all the data, including
the following FC layers. Note that the number of the output support and query set on the training clients.
8

TABLE IV: Experiment Results on Unseen Working Conditions.

CWRU JNU PHM2009
Methods
1-shot 3-shot 5-shot 1-shot 3-shot 5-shot 1-shot 3-shot 5-shot
FedAvg 78.25 78.33 84.25 68.89 72.50 72.78 71.56 71.88 72.19
FedAvg-FT 80.92 81.25 91.17 70.83 73.61 74.17 71.88 72.81 72.81
FedProx 84.75 87.25 91.25 71.67 74.17 76.67 72.50 73.44 74.38
FedProx-FT 88.13 88.25 93.88 74.17 76.67 77.50 73.13 74.06 75.16
REFML 91.38 94.75 96.05 76.94 80.83 81.11 75.94 77.19 77.82

there are 4 working conditions in the CWRU dataset, and

we have one client for each working condition. We choose
clients 1, 2, and 3 as training clients and client 0 as testing
client in the first fold, then we choose clients 2, 3, and 0 as
training clients and client 1 as testing client in the second
fold, and complete the experiment according to this rule. The
performance of the model is then evaluated by averaging
the results obtained from each fold. The specific experiment
settings are listed in Table III. To clearly demonstrate this
process, the diagnostic accuracy of the 1-shot sub-experiments
conducted on the CWRU dataset are shown in Fig. 5. As
we can see, although the generalization difficulty varies on
different validation folds, for example, the performance of all
Fig. 5: The diagnostic accuracy of the 1-shot sub-experiments algorithms is worse on the first fold than on other folds, the
on the CWRU dataset. proposed REFML method achieves the best performance on
all validation folds.
C. Visual Analyses The comparison results under different shot numbers with
different methods are shown in Table IV and a visualization
To provide an intuitive comparison of the feature extrac- form in Fig. 6. It can be seen that the FedAvg method achieves
tion ability of different methods, the extracted features from the lowest accuracy while the FedProx method performs better
different methods including raw data, FedAvg-FT, FedProx- because of its improvement against heterogeneity by restricting
FT, and the proposed REFML are illustrated in Fig. 4 using the local updates to be closer to the global model. Another
the t-SNE technique, where different colors represent different observation is that the fine-tuned versions of FedAvg and Fed-
health states. The feature information is contained in the output Prox outperform their original versions since they utilize the
of the first FC layer in the models, which is a 256-dimensional data on the testing task to adapt the model. Most importantly,
vector, and the result of raw data is generated from a 1024- the proposed REFML method achieves the best performance
dimensional vector as its original length. across all datasets and all shot numbers, and within a single
It can be seen that features of the same health state learned dataset, the accuracy increases as the shot number increases. It
by REFML are clustered well while features of different health provides an increase of accuracy by 2.17%-6.50% compared to
states are separated well. In comparison, features learned by the FedProx-FT method when generalizing to unseen working
the other methods do not cluster well. For example, as shown conditions under different shot numbers. In addition, as the
in Fig. 4(c), the points of health state “2” overlapped the shot number increases, the degree of performance improve-
points of the health state “5”, which implies that an effective ment diminishes. For example, in the CWRU dataset, the
classification cannot be achieved. performance increases from 1-shot to 3-shot and from 3-shot
to 5-shot are 3.37% and 1.30%, respectively. Similarly, in the
D. Experiment Results on Unseen Working Conditions other two datasets, the corresponding performance gains drop
from 3.89% to 0.28% and from 1.25% to 0.63%, respectively.
In this part, we evaluate the proposed REFML method
on unseen working conditions. In each dataset, the working
conditions of the meta-train and meta-test phases do not
E. Experiment Results on Unseen Equipment Types
overlap. Among all the data corresponding to different work-
ing conditions, a portion of the data associated with certain To further examine the effectiveness of the proposed method
working conditions is selected as data for the training clients, on generalizing to unseen equipment types, we conduct exper-
while another portion is chosen for the testing clients. Each iments across the CWRU dataset and the JNU dataset, where
client corresponds to a specific working condition. The meta- training data and testing data are collected from different types
knowledge extraction is conducted in tasks of the training of bearings. To align classification categories, we selected four
clients, and the generalization ability of the trained model is out of the ten health states from the CWRU dataset: three
evaluated in the tasks of testing clients. Here we adopt K- fault states with size 21 and an NA state, and kept all four
fold cross-validation to conduct the experiment. For example, health states in the JNU dataset. Thus, the tasks on these two
9

(a) (b) (c)

Fig. 6: Visualization of the experiment results on unseen working conditions. The proposed REFML method outperforms other
methods and provides an increase of accuracy by 2.17% - 6.50% compared to the FedProx-FT method. (a) Test on the CWRU
dataset. (b) Test on the JNU dataset. (c) Test on the PHM2009 dataset.

TABLE V: Experiment Settings on Unseen Equipment Types.

Datasets Meta-train conditions Meta-test conditions
CWRU to JNU 0, 1, 2, 3 0, 1, 2
JNU to CWRU 0, 1, 2 0, 1, 2, 3

TABLE VI: Experiment Results on Unseen Equipment Types.

CWRU to JNU JNU to CWRU
Methods
1-shot 3-shot 5-shot 1-shot 3-shot 5-shot
FedAvg 44.99 45.33 45.67 49.85 50.16 50.47
FedAvg-FT 50.83 51.67 54.17 60.32 66.10 67.50
FedProx 45.33 45.83 46.67 52.66 55.63 56.46
(a)
FedProx-FT 51.67 52.92 53.33 63.91 72.19 72.30
REFML 68.06 68.89 69.17 78.55 85.63 90.63

datasets are all 4-way classification problems. The details of

the designed experiments are listed in TABLE V.
The comparison results under different shot numbers with
different methods are shown in Table VI, and Fig. 7 presents
a visualization of the results. It can be observed that the
proposed REFML method still achieves the highest diagnos-
tic accuracy and exhibits a positive correlation between its
performance and the shot number. Another observation is the
degree of performance improvement diminishes as the shot (b)
number becomes larger, which is consistent with the findings
of the previous experiment. Furthermore, the performances Fig. 7: Visualization of the experiment results on unseen
of FedAvg-FT and FedProx-FT are significantly superior to equipment types. The proposed REFML method outperforms
those of their original versions, which indicates that the fine- other methods and provides an increase of accuracy by 13.44%
tuning process has a considerable impact on accuracy improve- - 18.33% compared to the FedProx-FT method. (a) General-
ment, suggesting the severity of domain discrepancy between ization from CWRU dataset to JNU dataset. (b) Generalization
training and testing. The advanced generalization ability of from JNU dataset to CWRU dataset.
the REFML method is proved by 13.44%-18.33% accuracy
improvement compared with the FedProx-FT method across
1-shot to 5-shot scenarios. proposed by us contributes to performance, including the
training strategy based on meta-learning and representation
encoding, and the adaptive interpolation module, the ablation
F. Ablation Analyses experiments were designed. As depicted in Fig. 8, “Local”
To verify the necessity of adopting the FL framework under refers to testing clients performing local training without any
privacy and security conditions and confirm each module federated communications. We also introduced a variant of the
10

clients to facilitate OOD generalization in new clients by

extracting meta-knowledge from different local diagnosis tasks
and training a domain-invariant feature extractor. Furthermore,
an adaptive interpolation method has been designed to cal-
culate the optimal combination of local and global models
before local training, which further utilizes local information
to achieve better model performance.
Experiments conducted on three real datasets have demon-
strated that the proposed REFML method could perform well
on unseen tasks using very limited training data with privacy
guaranteed. Compared with the state-of-the-art methods, the
proposed REFML framework achieves an increase in accuracy
Fig. 8: Results of the ablation experiments under 3-shot by 2.17%-6.50% when tested on unseen working conditions
settings on three datasets. of the same equipment type and 13.44%-18.33% when tested
on totally unseen equipment types, respectively.
While our results demonstrate the promise of REFML,
it suffers from high computational complexity due to the
proposed REFML method called “REFML w/o AI (adaptive
calculation of second-order derivatives in MAML. Future
interpolation) ”, which omits the adaptive interpolation module
research includes exploring computationally efficient variants
before local training.
of MAML to enhance the practicality, scalability, and accuracy
We observe that the diagnostic accuracy of “Local” is
considering system heterogeneity.
relatively low due to its inability to utilize information from
the training clients, which reflects the effectiveness of FL. By
employing our training strategy, the diagnostic accuracy of R EFERENCES
“REFML w/o AI” surpasses FedAvg-FT across all datasets. [1] Y. Chi, Y. Dong, Z. J. Wang, F. R. Yu, and V. C. M. Leung, “Knowledge-
This result indicates that our method can effectively extract based fault diagnosis in industrial internet of things: A survey,” IEEE
domain-invariant representations for previously unseen clas- Internet Things J., vol. 9, no. 15, pp. 12 886–12 900, Aug. 2022.
[2] Y. Lei, B. Yang, X. Jiang, F. Jia, N. Li, and A. K. Nandi, “Applications
sification tasks to achieve an accurate diagnosis. Moreover, of machine learning to machine fault diagnosis: A review and roadmap,”
the performance improvement of REFML compared with Mech. Syst. Signal Process., vol. 138, Apr. 2020.
“REFML w/o AI” demonstrates that our adaptive interpola- [3] T. Ince, S. Kiranyaz, L. Eren, M. Askar, and M. Gabbouj, “Real-time
motor fault detection by 1-d convolutional neural networks,” IEEE Trans.
tion module can efficiently utilize local information, thereby Ind. Electron., vol. 63, no. 11, pp. 7067–7075, Nov. 2016.
mitigating the effects of domain discrepancy. [4] H. Liu, J. Zhou, Y. Zheng, W. Jiang, and Y. Zhang, “Fault diagnosis of
The experimental results on both bearing data and gearbox rolling bearings with recurrent neural network-based autoencoders,” ISA
Trans., vol. 77, pp. 167–178, Jun. 2018.
data have already confirmed the effectiveness and robustness [5] J. Chen, C. Lin, B. Yao, L. Yang, and H. Ge, “Intelligent fault diagnosis
of the proposed method in the field of industrial FD, showing of rolling bearings with low-quality data: A feature significance and
its potential for extensive applicability in industrial scenarios. diversity learning method,” Reliab. Eng. Syst. Saf., vol. 237, Sep. 2023.
[6] H. Liu, Z. Liu, W. Jia, D. Zhang, and J. Tan, “A novel imbalanced
It enables multiple clients to obtain a globally optimized model data classification method based on weakly supervised learning for fault
with high generalization performance. When encountering new diagnosis,” IEEE Trans. Ind. Informat., vol. 18, no. 3, pp. 1583–1593,
working conditions and equipment types, it only requires a Mar. 2022.
[7] J. Chen, Z. Yan, C. Lin, B. Yao, and H. Ge, “Aero-engine high
small number of samples to achieve good performance. speed bearing fault diagnosis for data imbalance: A sample enhanced
Apart from industrial FD, our proposed approach can be diagnostic method based on pre-training wgan-gp,” Measurement, vol.
readily applied to various applications with the same chal- 213, May 2023.
[8] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y.
lenges of data privacy, domain discrepancy, and data scarcity. Arcas, “Communication-Efficient Learning of Deep Networks from
For instance, in the healthcare field, especially in image clas- Decentralized Data,” in Proc. AISTATS, 2017, pp. 1273–1282.
sification tasks, domain discrepancy occurs due to differences [9] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning:
Concept and applications,” ACM Trans. Intell. Syst. Technol., vol. 10,
in imaging equipment and protocols across institutions. no. 2, pp. 1–19, Jan. 2019.
[10] J. Li, Y. Shao, K. Wei, M. Ding, C. Ma, L. Shi, Z. Han, and H. V.
VI. C ONCLUSION Poor, “Blockchain assisted decentralized federated learning (blade-fl):
Performance analysis and resource allocation,” IEEE Trans. Parallel
In this paper, we have developed REFML to address the Distrib. Syst., vol. 33, no. 10, pp. 2401–2415, Oct. 2022.
challenge of poor diagnosis performance in industrial FD [11] K. Wei, J. Li, M. Ding, C. Ma, H. Su, B. Zhang, and H. V. Poor, “User-
level privacy-preserving federated learning: Analysis and performance
caused by data privacy, domain discrepancy, and data scarcity. optimization,” IEEE Trans. Mobile Comput., vol. 21, no. 9, pp. 3388–
It enables multiple clients to collaboratively train a global 3401, Sep. 2022.
model with high generalization ability while ensuring data pri- [12] K. Wei, J. Li, M. Ding, C. Ma, H. H. Yang, F. Farokhi, S. Jin, T. Q. S.
Quek, and H. Vincent Poor, “Federated learning with differential pri-
vacy. The trained model can achieve superior performance on vacy: Algorithms and performance analysis,” IEEE Trans. Inf. Forensics
unobserved working conditions or equipment types even with Secur., vol. 15, pp. 3454–3469, Apr. 2020.
limited training data. Specifically, a novel training strategy [13] K. Wei, J. Li, C. Ma, M. Ding, C. Chen, S. Jin, Z. Han, and H. V. Poor,
“Low-latency federated learning over wireless channels with differential
based on representation encoding and meta-learning has been privacy,” IEEE J. Sel. Areas Commun., vol. 40, no. 1, pp. 290–307, Jan.
invented, which harnesses data heterogeneity among training 2022.
11

[14] F. Haddadpour, M. M. Kamani, A. Mokhtari, and M. Mahdavi, “Feder-

ated learning with compression: Unified analysis and sharp guarantees,”
in Proc. AISTATS, 2021, pp. 2350–2358.
[15] W. Zhang, X. Li, H. Ma, Z. Luo, and X. Li, “Federated learning for
machinery fault diagnosis with dynamic validation and self-supervision,”
Knowl.-Based Syst., vol. 213, p. 106679, Feb. 2021.
[16] S. Lu, Z. Gao, Q. Xu, C. Jiang, A. Zhang, and X. Wang, “Class-
imbalance privacy-preserving federated learning for decentralized fault
diagnosis with biometric authentication,” IEEE Trans. Ind. Informat.,
vol. 18, no. 12, pp. 9101–9111, Dec. 2022.
[17] H. Zhu, J. Xu, S. Liu, and Y. Jin, “Federated learning on non-iid data:
A survey,” Neurocomputing, vol. 465, pp. 371–390, Nov. 2021.
[18] X. Zhai, F. Qiao, Y. Ma, and H. Lu, “A novel fault diagnosis method
under dynamic working conditions based on a cnn with an adaptive
learning rate,” IEEE Trans. Instrum. Meas., vol. 71, pp. 1–12, May 2022.
[19] J. Chen, J. Cui, C. Lin, and H. Ge, “An intelligent fault diagnostic
method based on 2d-gcforest and l2,p -pca under different data distribu-
tions,” IEEE Trans. Ind. Informat., vol. 18, no. 10, pp. 6652–6662, Oct.
2022.
[20] S. Xing, Y. Lei, S. Wang, and F. Jia, “Distribution-invariant deep belief
network for intelligent fault diagnosis of machines under new working
conditions,” IEEE Trans. Ind. Electron., vol. 68, no. 3, pp. 2617–2625,
Mar. 2021.
[21] W. Li, R. Huang, J. Li, Y. Liao, Z. Chen, G. He, R. Yan, and K. Gryllias,
“A perspective survey on deep transfer learning for fault diagnosis in
industrial scenarios: Theories, applications and challenges,” Mech. Syst.
Signal Process., vol. 167, Mar. 2022.
[22] S. Shao, S. McAleer, R. Yan, and P. Baldi, “Highly accurate machine
fault diagnosis using deep transfer learning,” IEEE Trans. Ind. Informat.,
vol. 15, no. 4, pp. 2446–2455, Apr. 2019.
[23] J. Chen, J. Li, R. Huang, K. Yue, Z. Chen, and W. Li, “Federated transfer
learning for bearing fault diagnosis with discrepancy-based weighted
federated averaging,” IEEE Trans. Instrum. Meas., vol. 71, pp. 1–11,
Jun. 2022.
[24] Y. Tian, X. Zhao, and W. Huang, “Meta-learning approaches for
learning-to-learn in deep learning: A survey,” Neurocomputing, vol. 494,
pp. 203–223, Jul. 2022.
[25] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for
fast adaptation of deep networks,” in Proc. ICML, 2017, pp. 1126–1135.
[26] J. Chen, W. Hu, D. Cao, Z. Zhang, Z. Chen, and F. Blaabjerg,
“A meta-learning method for electric machine bearing fault diagnosis
under varying working conditions with limited data,” IEEE Trans. Ind.
Informat., vol. 19, no. 3, pp. 2552–2564, Mar. 2023.
[27] Y. Hu, R. Liu, X. Li, D. Chen, and Q. Hu, “Task-sequencing meta
learning for intelligent few-shot fault diagnosis with limited data,” IEEE
Trans. Ind. Informat., vol. 18, no. 6, pp. 3894–3904, Jun. 2022.
[28] Y. Feng, J. Chen, T. Zhang, S. He, E. Xu, and Z. Zhou, “Semi-supervised
meta-learning networks with squeeze-and-excitation attention for few-
shot fault diagnosis,” ISA Trans., vol. 120, pp. 383–401, Jan. 2022.
[29] F. Chen, Z. Dong, Z. Li, and X. He, “Federated meta-learning for
recommendation,” arXiv:1802.07876, 2018.
[30] A. Fallah, A. Mokhtari, and A. Ozdaglar, “Personalized federated
learning with theoretical guarantees: A model-agnostic meta-learning
approach,” in Proc. NeurIPS, vol. 33, 2020, pp. 3557–3568.
[31] K. Wei, J. Li, C. Ma, M. Ding, W. Chen, J. Wu, M. Tao, and H. V.
Poor, “Personalized federated learning with differential privacy and
convergence guarantee,” IEEE Trans. Inf. Forensics Secur., vol. 18, pp.
4488–4503, Jul. 2023.
[32] L. Jing and Y. Tian, “Self-supervised visual feature learning with deep
neural networks: A survey,” IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 43, no. 11, pp. 4037–4058, Nov. 2021.
[33] K. Nozawa and I. Sato, “Evaluation methods for representation learning:
A survey,” in Proc. IJCAI, 2022, pp. 5556–5563.
[34] F. Hanzely and P. Richtarik, “Federated learning of a mixture of global
and local models,” arXiv:2002.05516, 2021.
[35] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith,
“Federated optimization in heterogeneous networks,” in Proc. MLSys,
vol. 2, 2020, pp. 429–450.