Federated Meta-Learning For Few-Shot Fault Diagnosis With Representation Encoding
Federated Meta-Learning For Few-Shot Fault Diagnosis With Representation Encoding
Abstract—Deep learning-based fault diagnosis (FD) ap- rity privacy, domain discrepancy, data scarcity.
proaches require a large amount of training data, which are
arXiv:2310.09002v1 [cs.LG] 13 Oct 2023
Federated learning (FL) [8]–[14] empowers multiple clients the issue of poor diagnosis performance on new tasks caused
to collaboratively train a global model without compromising by domain discrepancy and data scarcity, which has not been
data privacy. An FL method for machinery FD with a self- fully researched. Furthermore, when considering the inherent
supervised learning scheme was proposed in [15]. The FL domain discrepancies among the data from various participants
framework is also utilized in the context of class-imbalanced in FL, it raises a fundamental question: how can we leverage
FD classification to facilitate the implementation of privacy- this aspect to strengthen the model’s robustness when faced
preserving functionalities [16]. However, in practical industrial with unobserved tasks?
environments, working conditions and equipment types signif-
In this study, we tackle this challenge and propose a
icantly vary across different companies and change frequently.
novel representation encoding-based federated meta-learning
This means the presence of domain discrepancy extends be-
(REFML) framework for few-shot FD. REFML harnesses
yond the traditional demarcation between the training and test-
federated meta-learning (FML) and draws inspiration from
ing stages. It involves the domain discrepancy between training
representation learning for capturing discriminative feature
clients and new clients, with inherent variations observed
representations [29]–[33]. It leverages inherent heterogeneity
among individual training clients. As a result, the trained
among training clients by extracting meta-knowledge from
model cannot generalize well to out-of-distribution (OOD)
different local diagnosis tasks and training a domain-invariant
data on new tasks with limited samples, because most learning
feature extractor in a privacy-preserving manner, effectively
algorithms heavily rely on the independent and identically
transforming it into an advantage for OOD generalization.
distributed assumption on source/target data. Collecting and
Without compromising the privacy data of participating clients,
labeling sufficient data to address this issue is costly and
the trained model can achieve high performance with very few
impractical. Data-based approaches such as data sharing and
training samples when encountering new tasks, such as those
augmentation work well, but may increase the risk of data
involving previously unseen working conditions or equipment
privacy leakage under the FL framework [17].
types, making it well-suited for practical industrial FD scenar-
In addressing challenges related to deep model construction
ios with domain discrepancy and data scarcity problems.
and diverse data distributions, an approach with adaptive and
independent learning rate design and structure optimization The main contributions of this paper are as follows:
was proposed to enhance both the timeliness of FD and its
adaptability to dynamic conditions [18]. A novel method com- 1) We propose REFML, an innovative FML-based privacy-
bining 2-D-gcForest and L2,p -PCA was proposed to improve preserving method for few-shot FD, a relatively underex-
the feature representation for different data sources [19]. A plored area in prior research. This approach consists of a
distribution-invariant deep belief network was proposed to novel training strategy based on representation encoding
learn distribution-invariant features directly from raw vibration and meta-learning and an adaptive interpolation module.
data [20]. Transfer learning (TL) is another way against the
domain discrepancy problems [21], in which the knowledge 2) We develop a novel training strategy based on rep-
from one or more tasks in the source domain can be reused for resentation encoding and meta-learning to harness the
other related tasks in the target domain. Shao et al. developed heterogeneity among training clients to improve OOD
a fast and accurate FD framework using TL [22]. A federated generalization in FL with limited training samples. With
TL framework with discrepancy-based weighted federated this strategy, the trained model is capable of capturing
averaging was proposed to train a good global FD model domain-invariant features and adapting well to unseen
collaboratively [23]. Meta-learning, a technique that also refers tasks to achieve high performance with limited data.
to leveraging previous knowledge to improve learning on new 3) We design an adaptive interpolation method by calcu-
tasks [24], focuses on learning to learn across a broader range lating the optimal combination of the local and global
of tasks rather than specific source and target tasks in TL. models as the initialization of local training. It is capable
In industrial scenarios of frequently changing work condi- of mitigating the negative effects of domain discrepancy
tions and equipment types, training a meta-learning model for better model performance.
to achieve strong generalization capabilities can greatly meet
practical demands. A novel meta-learning method based on 4) Experiments are conducted on two bearing datasets
model agnostic meta-learning (MAML) [25] for FD in rolling and one gearbox dataset. Compared with state-of-the-art
bearings under varying working conditions with limited data methods like FedProx, the proposed REFML framework
was proposed in [26]. Hu et al. proposed a task-sequencing increases accuracy by 2.17%-6.50% when generalizing
meta-learning method that sorts tasks from easy to difficult to unseen working conditions and 13.44%-18.33% when
to get better knowledge adaptability [27]. Moreover, meta- generalizing to unseen equipment types, respectively.
learning could also be combined with semi-supervised learning
utilizing unlabeled data for better fault recognition [28]. The rest of this paper is organized as follows. In section II
However, as mentioned earlier, it is challenging to aggregate and section III, the preliminary work and problem formulation
data from different entities and train models using centralized are introduced. Section IV presents the proposed method in
algorithms above in real-world production environments. In detail. Numerical experiments are conducted in Section V to
practical and common scenarios, it’s necessary to exploit verify the effectiveness of the proposed REFML framework.
privacy-preserving distributed training algorithms to address Section VI concludes this article.
3
II. P RELIMINARIES where |DΓs i | is the size of the support set, l is the cross entropy
A. Federated Learning loss function.
Then, the performance of the adapted parameters Wi′ on task
FL enables multiple clients to obtain a globally optimized Γi is evaluated on its query set DΓq i in the form of empirical
model while safeguarding sensitive data. It generally consists loss which reflects the generalization ability of W . Hence, the
of a central server and multiple clients. The central server man- optimization objective is
ages multiple rounds of federated communication to obtain a X
global model, extracting valuable information from distributed min LDΓq (Wi′ ). (5)
W i
clients without accessing their private data. Throughout this Γi ∼p(Γ )
process, the only elements transmitted are the model param- The aggregated loss values of each task are used to update the
eters. Currently, the prevailing paradigm considers supervised model parameters W , given as
horizontal FL as an empirical risk minimization problem, X
where the goal is to minimize the aggregated empirical loss, W = W − β ▽W LDΓq (W − α ▽W LDΓs (W )), (6)
i i
shown as Γi ∼p(Γ )
U
X |Du | where β is the meta-learning rate. The purpose of training on
min LDu (W ), (1)
W
u=1
n multiple tasks is to find a high-quality initial model. As shown
in Fig. 1, W are the parameters of the model before updating,
where W represents the parameters of the global model, and and ∇l1 , ∇l2 , and ∇l3 are corresponding update directions
U is the number of total clients. Du and |Du | are the local of three training tasks. The objective of MAML is not to
dataset of client u P
and its size, respectively. The total number attain the best possible performance on a single task, which
U
of samples is n = u=1 |Du |. LDu (W ) is the empirical loss means reaching any one of the three optimal weights W1∗ ,
of client u in the form of an expected risk on local dataset W2∗ , and W3∗ of three training tasks, but rather to converge on
Du to reflect the model performance, given by parameters that can swiftly adapt to similar, especially unseen
1 X tasks.
LDu (W ) = l(W (x); y), (2)
|Du |
(x,y)∈Du
W ∇l
where l(W (x); y) is a loss function that penalizes the distance
of model output W (x) from label y.
∇l
B. Meta-Learning
Meta-learning, also known as learning to learn, is a tech-
∇l
ꞏ
ꞏ W∗
ꞏ
nique aiming at enhancing performance on new tasks by
utilizing prior knowledge from known tasks. In traditional
machine learning, the objective is to train a high-performing
model on specific tasks with a fixed algorithm, including
W∗
ꞏW∗
Under the FL framework, suppose there are U training with very limited data. In each round, every training client
clients and V testing clients, whose datasets collected under downloads the global model, conducts adaptive interpolation,
different working conditions or equipment types are regarded representation encoding, meta-updating of the predictor, and
as training tasks and testing tasks. It is noteworthy that uploads the model in sequence. Every testing client downloads
clients’ private data is not allowed to be shared, and each the global model, conducts adaptive interpolation, and fine-
of them has a relatively small amount of data. Let {Du }U u=1 , tunes the model. In the following subsections, the proposed
{Dv }U +V
v=U +1 denote the datasets of training clients and testing training process and the overall workflow of the proposed
|D | REFML method will be introduced in detail.
clients, respectively. The dataset Dm = {(xi , yi )}i=1m (m =
s
1, 2, ..., U +V s
) of each client m is divided into support
s q
Dm =
|Dm | q |Dm |+|Dm |
{(xi , yi )}i=1 and query set Dm = {(xi , yi )}i=|Ds |+1 . Here A. Adaptive Interpolation
m
s q
|Dm | + |Dm | = |Dm |. Vector xi ∈ Rd is a d-dimensional In general FL, clients download global model parameters at
real-valued feature vector regarded as the input of the model, the beginning of each round and initialize their local models
and scalar yi ∈ {1, 2, 3, ..., N } is a class label, where N is with these parameters. Facing data heterogeneity challenge in
the number of categories. The expected loss of the prediction FL, a method of mixing global and local models has been
made with the model parameters Wm of client m on its dataset proposed to balance generalization with personalization [34].
Dm is defined as LDm (Wm ), which could be computed by Inspired by this, we further exploit personal information in
1 X the model communication stage, that is using the optimal
LDm (Wm ) = l(Wm (x); y) interpolation of the global model and the local model as the
|Dm |
(x,y)∈Dm
initial model rather than the global model itself, where the
N (7)
1 X X optimal interpolation weights are calculated adaptively using
=− 1[y=c] log(Wm (x)), gradient-based search on local data.
|Dm | c=1
(x,y)∈Dm In communication round t, training client u receives global
where l(Wm (x); y) is the cross entropy loss function, and 1 model Wt and aims to find the best combination of the global
u
is the indicator function. model Wt and its local model Wt−1 as the initialization of its
In the meta-training phase, the meta-goal is to find param- local training, which is formulated as
eters W ∗ that perform well among training clients after fast
Wtu = Aut ⊙ Wt + (O − Aut ) ⊙ Wt−1
u
, (10)
adaptation, given by
U where Wtu is the interpolated model of client u in the t-th
communication round, and ⊙ is a Hadamard product. Aut are
X
∗
W = arg min LDuq (W − α ▽W LDus (W )). (8)
W u=1 the optimal interpolation weights of the global model Wt ,
In the meta-testing phase, the learned parameters W ∗ are used whose elements are all between 0 and 1, with the same shape
to initialize models {Wv }U +V as Wt . O is an all-ones matrix, and O−Aut are the interpolation
v=U +1 of testing clients. Then the
model will use a small number of samples (the support set) to weights of the local model. This arrangement ensures that the
quickly adapt to the task and expect to achieve good diagnostic summation of interpolation weights for the global and local
accuracy on the query set. The optimization objective during models pertaining to each element position consistently equals
the fast adaptation phase can be expressed as 1, thus normalizing the weighting process.
To find the optimal interpolation weights Aut , the interpo-
UX
+V
lation weights Aut−1 of the last round are used to compute a
min LDvs (Wv ). (9) ′
{Wv }U +V temporary combination Wtu , given by
v=U +1 v=U +1
′
The problem can be formulated as a N -way K-shot classifica- Wtu = Aut−1 ⊙ Wt + (O − Aut−1 ) ⊙ Wt−1
u
. (11)
tion task. The term N refers to the number of categories that a ′
Then the temporary combination Wtu is evaluated on local
meta task needs to classify, while K represents the number of
data, and the interpolation weights can be updated as
labeled samples in the support set available for each category.
′
Thus, each task has N × K samples in the support set, which Aut = Aut−1 − δ ▽Aut−1 LDu (Wtu ), (12)
s
is equivalent to |Dm | = N × K (m = 1, 2, ..., U + V ). All
the training clients use both the support set and the query set where δ is the learning rate.
to train their model, and all testing clients use the support set Finally, the best combination of the local and global models
and the query set to fine-tune and test the model, respectively. is computed using (10) with the updated interpolation weights
Aut . By utilizing this process, every client has the ability to
IV. P ROPOSED M ETHOD acquire a model that is better tailored to their specific local
objective adaptively.
This section presents the proposed REFML method for few-
shot FD. It consists of two main components: a central server
and local clients, which are divided into U training clients B. Representation Encoding
and V testing clients. As shown in Fig. 2, the server and As shown in Fig. 3, the related models consist of feature
clients collaborate through multiple communication rounds to extraction and classification layers, which we refer to as
jointly train a model that can adapt well to testing clients the encoder and the predictor, respectively. We consider the
5
Training Client u
Meta-updating of the Predictor Server 𝐷 ⑥
𝑊 𝑊
·
𝑃 ③ 𝑛
𝐷
𝒒
· ·
𝑃 𝑫𝒖
train predictor updated 𝑃 𝑊 𝑊
② 𝑊 𝑊 𝑊 𝑊
①
+
→
𝐷
Training Training Testing Testing
Client 1 ꞏꞏꞏ Client U Client 1
ꞏꞏꞏ Client V
train encoder find the best combination
? Training Encoder
Representation Encoding Adaptive Interpolation
𝑊 ꞏꞏꞏ 𝑊
Testing Client v Diagnosis Results
⑤ ④ ⑧
+ 𝐷
→
𝐷
adapt on support set find the best combination test on query set
Fig. 2: Workflow of the proposed REFML framework. It contains a central server and multiple training and testing clients.
The server and clients collaborate through multiple communication rounds to train a model that can effectively adapt to testing
clients using extremely limited data with guaranteed privacy protection.
Algorithm 1: The proposed REFML method. in the subsequent round, and all participants collectively
input : Number of communication rounds T , number iterate through this process until convergence. During the
of training and testing clients U , V , learning federated communication process, the testing clients have
rates of the corresponding stages δ, η, α, β, γ. the opportunity to acquire suitable interpolation weights and
output: Learned testing models {W v }U +V consistently leverage the domain-invariant feature extraction
v=U +1
1 Initialize global model W0 , local models and capability provided by the encoder in the global model, as well
interpolation weights of training and testing clients as the high-quality initialization parameters of the predictor,
{W0u }U v U +V u U v U +V to achieve robust diagnostic accuracy on their respective local
u=1 , {W0 }v=U +1 , {A0 }u=1 , {A0 }v=U +1 .
2 for each round t = 1, 2, ..., T do tasks.
3 for each training client u = 1, ..., U do 3) The Complete Diagnostic Steps: The complete process
4 Compute Aut , Wtu with Du using (10-12). of the proposed REFML method is illustrated in Algorithm 1,
5 Etu = Etu − η ▽Etu LDu (Etu ). and the diagnostic steps are summarized below:
6 Ptu = Ptu −β ▽Ptu LDuq (Ptu −α▽Ptu LDus (Ptu )). 1) The training clients download the global model and
7 end execute adaptive interpolation;
8 for each testing client v = U + 1, ..., U + V do 2) The training clients train the encoder with their local
9 Compute Avt , Wtv with Dvs using (10-12). data;
10 Fine-tune the model: 3) The training clients meta-update the predictor and upload
11 Wtv = Wtv − γ ▽Wtv LDvs (Wtv ). the model to the server;
12 end 4) The testing clients download the global model and
13 Wt+1 = u=1 |Dnu | Wtu .
PU execute adaptive interpolation;
14 end
5) The testing clients fine-tune the model with the support
set of their local data;
6) The server aggregates the models of all training clients;
7) Repeat steps 1) - 6) until the end of training;
adaptation. Then the parameters of the predictor will be 8) The testing clients test the model with the query set of
updated using their local data to get diagnosis results.
′
Ptu = Ptu − β ▽Ptu LDuq (Ptu ), (15) In practice, the computational complexity of our method
is comparable to that of typical FL systems, and the train-
where α is the learning rate of fast adaptation and β is the ing phase does require a certain amount of time. However,
learning rate of meta-updating. Minimizing this loss means it’s important to highlight that once the training phase is
that the meta-parameters Ptu will have a better performance completed, the inference process becomes highly convenient.
after rapid adaptation. In other words, acquiring these adap- This characteristic is particularly advantageous in real-world
tation abilities through a series of training tasks constitutes engineering deployments, where the efficiency of the inference
the process of extracting meta-knowledge, which in turn phase frequently surpasses that of the training phase.
aids in the development of a robust generalization capability.
When facing unobserved tasks, informative representations are
V. E XPERIMENTS
extracted by the encoder from the raw data, and the predictor
with high adaptability utilizes these representations to make To comprehensively evaluate the effectiveness of the pro-
accurate diagnoses. posed REFML method, two bearing datasets and one gear-
box dataset are employed for few-shot scenarios, and the t-
distributed stochastic neighbor embedding (t-SNE) technique
D. The Proposed REFML Algorithm
is applied to compare the feature extraction ability of different
1) Local Training: The workflow of the proposed REFML methods in the form of visualization.
framework is depicted in Fig. 2. In each communication
round, the training clients engage in a series of operations,
A. Datasets
including downloading the global model, performing adaptive
interpolation, representation encoding, and meta-updating of 1) Case Western Reserve University (CWRU) Dataset: The
the predictor, followed by uploading the updated model to CWRU dataset is a well-known open-source dataset in FD. Its
the server. On the other hand, the testing clients download the four health states: one normal bearing (NA), inner fault (IF),
global model, conduct adaptive interpolation, and fine-tune the ball fault (BF), and outer fault (OF), are further classified into
model using the support set. ten categories according to three different fault sizes (7, 14,
2) Global Aggregation: At the end of each communication and 21 mils) of each fault state. Each health state corresponds
round, the server aggregates models uploaded by training to four distinct working conditions, characterized by varying
clients using loads and their corresponding speeds (1797, 1772, 1750, and
U 1730 rpm).
X |Du | u
Wt+1 = Wt , (16) 2) JiangNan University (JNU) Dataset: The JNU dataset
u=1
n
is a bearing dataset acquired by Jiangnan University, China.
where n is the total number of training samples. Subsequently, Four kinds of health states, including NA, IF, BF, and OF,
the server disseminates the aggregated model to all clients were carried out. Vibration signals were sampled under three
7
Dataset Fold Meta-train conditions Meta-test condition
1 1, 2, 3 0 (a) (b)
2 2, 3, 0 1
CWRU
3 3, 0, 1 2
4 0, 1, 2 3
1 1, 2 0
JNU 2 2, 3 1
3 3, 0 2
1 1, 2, 3 0
2 2, 3, 0 1
PHM2009
3 3, 0, 1 2
4 0, 1, 2 3
rotating speeds (600, 800, and 1000 rpm) corresponding to
three working conditions.
3) PHM Data Challenge on 2009 (PHM2009) Dataset:
(c) (d)
The PHM2009 dataset is a generic industrial gearbox dataset
provided by the PHM data challenge competition. A total Fig. 4: Visualization of extracted representations using t-SNE.
of 14 experiments (eight for spur gears and six for helical (a) Raw data. (b) FedAvg-FT. (c) FedProx-FT. (d) REFML.
gears) were performed. It contains five rotating speeds and
two loads, corresponding to ten working conditions. Here
four experiments of spur gears as four categories are used
to conduct the experiments. These experiments are conducted units N of the last layer varies with the category number of
under four different working conditions, with speeds set at 30, the specific task.
35, 40, and 45 Hz, all operating under high load conditions.
The details of the three datasets including corresponding The learning rates are obtained by searching within the
sample sizes are listed in Table I. However, it is worth noting range of 0.00001 to 0.001. In the federated communication
that in our data scarcity setup, not all the data from the original process, the maximum number of communication rounds is
dataset are utilized. The specific sample size configurations for 1000. In the few-shot scenario, the shot number in the query
the experiments can be found in the next subsection. set is 10, and the shot number in the support set varies in the
range of 1, 3, and 5. Some crucial hyper parameters are shown
in Table II.
B. Experiment Setup
1) Network Structure and Hyper parameters: In our exper- 2) Baselines: We evaluate our approach against several
iments, related FD models are based on CNN, which contains state-of-the-art baselines, including FedAvg [8], FedProx [35]
three convolution units and two FC layers. Each convolution and their fine-tuned versions denoted by FedAvg-FT and
unit is composed of a one-dimensional convolution layer, a FedProx-FT for a fair comparison. Fine-tuned versions of these
batch normalization layer, a ReLU activation layer, and a max- baselines use the support set of the testing clients to fine-tune
pooling layer. Its output is flattened into a tensor with one the model received from the server before testing it on the
dimension and a length of 4096 and then used as input for query set. All of these four baselines use all the data, including
the following FC layers. Note that the number of the output support and query set on the training clients.
8
Fig. 6: Visualization of the experiment results on unseen working conditions. The proposed REFML method outperforms other
methods and provides an increase of accuracy by 2.17% - 6.50% compared to the FedProx-FT method. (a) Test on the CWRU
dataset. (b) Test on the JNU dataset. (c) Test on the PHM2009 dataset.