0% found this document useful (0 votes)

52 views15 pages

Simple Is Good Investigation of History State Ensemble Deep Neu - 2023 - Neuroc

This document summarizes a research paper that proposes a new ensemble method called history-state ensemble (HSE) for improving the performance of deep neural networks. HSE leverages the "local sub-optima" generated during training to create an ensemble without additional training costs. The paper investigates over 20 ensemble strategies using HSE and tests them on six datasets and eight network architectures for rotating machinery fault diagnosis. Results show that HSE can generally improve network stability and accuracy through its simplest proposed ensemble strategy.

Uploaded by

Sumeet Mitra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views15 pages

Simple Is Good Investigation of History State Ensemble Deep Neu - 2023 - Neuroc

Uploaded by

Sumeet Mitra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Neurocomputing 548 (2023) 126353

Contents lists available at ScienceDirect

Neurocomputing
journal homepage: www.elsevier.com/locate/neucom

Simple is good: Investigation of history-state ensemble deep neural

networks and their validation on rotating machinery fault diagnosis
Yu Wang ⇑, Alexey Vinogradov
Department of Mechanical and Industrial Engineering, Norwegian University of Science and Technology – NTNU, Trondheim 7491, Norway

a r t i c l e i n f o a b s t r a c t

Article history: The present work is motivated by the desire to find an efficient approach that can improve the perfor-
Received 23 September 2022 mance of deep neural networks in a general sense. To this end, an easy-to-implement ensemble approach
Revised 11 April 2023 is proposed in this paper leveraging the ‘local sub-optima’ of deep networks, which is referred as to
Accepted 14 May 2023
history-state ensemble (HSE) method. We demonstrated that neural networks can naturally generate
Available online 24 May 2023
multiple ‘local sub-optima’ with diversity during training process, and their combination can effectively
improve the accuracy and stability of the single network. The merits of HSE are twofold: (1) It does not
Keywords:
require additional training cost in order to acquire multiple base models, which is one of the main draw-
History-state ensemble (HSE)
Ensemble learning
backs limiting the generalization of ensemble techniques in deep learning. (2) It can be easily applied to
Deep neural networks any types of deep networks without tuning of network architectures. We proposed the simplest way to
Average voting (AV) perform HSE and investigated more than 20 ensemble strategies for HSE as comparison. Experiments are
Fault diagnosis conducted on six datasets and eight popular network architectures for the case of rotating machinery
fault diagnosis. It is demonstrated that the stability and accuracy of neural networks can be generally im-
proved through the simplest ensemble strategy proposed in this paper.
Ó 2023 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).

AV Average voting
BMA Base model accuracy
CCALR Cyclic cosine annealing learning rate
CLR Constant learning rate 1. Introduction
CNN Convolutional Neural Network
CWRU Case Western Reserve University With the recent rapid advent of artificial intelligence, intelligent
DBN Deep Belief Network fault diagnostics (IFD) techniques are burgeoning in new dimen-
DL Deep Learning sions. IFD generally refers to the application of Machine Learning
FCNN Fully connected neural network (ML) algorithms in fault diagnostics to reduce human labour de-
GBDT Gradient-Boosted Decision Trees mand and cost [1]. Among all branches of ML, Deep Learning
GRU Gate Recurrent Unit Network
(DL) technology has attracted the most attention for its capacity
HSE History-state ensemble
to extract implicit features automatically from training data
IFD Intelligent fault diagnostics
through multi-layered hidden neurons. Moreover, the procedures
KAt Konstruktions-und Antriebstechnik
of feature extraction and fault recognition in DL are integrated,
KNN k-Nearest Neighbor
LR Learning rate which makes it suitable to deal with the raw signal directly with-
LRD Learning rate with decay out any pre-processing. Since 2015, the area of DL applications has
MBGD Mini-batch gradient descent expanded rapidly; thus, the DL-based machine fault diagnostics
ML Machine Learning framework has become the mainstream of IFD de facto [2]. Up to
PSO Particle swarm optimization now, hundreds of deep networks have been designed and applied
SAE Stacked Autoencoder to IFD of bearing to take advantage of DL philosophy. Just to name
SVM Support Vector Machine some of them: back propagation neural network [3,4], convolu-
tional neural network (CNN) [5–9], deep Boltzmann machine
[10], deep belief network (DBN) [11], stacked autoencoder [12–
⇑ Corresponding author. 15], long short-term memory network [16–18] and their modifica-
E-mail address: [email protected] (Y. Wang). tions are among the most popular examples.

https://doi.org/10.1016/j.neucom.2023.126353
0925-2312/Ó 2023 The Author(s). Published by Elsevier B.V.
This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
Y. Wang and A. Vinogradov Neurocomputing 548 (2023) 126353

Deep network architectures mainly comprise four elements: (i) (HSE) method is proposed in the present paper. The ’history-state’
the number of layers, (ii) the number of neurons in each layer, (iii) in the method is represented by the network weights after the net-
the activation function of each neuron, and (iv) the training algo- work update in each training cycle. We hypothesize that network
rithms [19]. Due to the nonlinear nature of the network, a slight performance can be further improved in terms of stability and ac-
change in one element may lead to a significantly different result. curacy by incorporating these history-states that should be cov-
Therefore, substantial research efforts have been devoted to the ered after each model update. Details will be elaborated on in
design of network architectures. Increasingly, people are looking the following section. In comparison with the ensembled deep net-
to improve network performance by increasing network complex- works introduced above, the advantages of the proposed method
ity or adopting more complicated approaches. Indeed, this has are twofold: (1) it offers an improvement in the efficiency of the
achieved remarkable success in many fields, including rotating ma- network by acquiring multiple base models without increasing
chinery fault diagnostics. However, the performance of a carefully training costs, and (2) the method is versatile and can be easily ap-
designed neural network may decline when applied to tasks differ- plied to different neural networks without a re-design or tuning of
ent from what they had been originally intended for. For example, network architectures.
for the multi-scale cascade convolutional neural network proposed Some related works of HSE are to be mentioned. Back in 2013,
in [10], the classification accuracy varied between 99.7% and 96.9% Xie et al. proposed a series of ensemble methods for DL, including
when the authors tested the same network with different scales of vertical voting, horizontal voting, and horizontal stacked ensemble,
the convolutional kernel. Besides, the neural network is sensitive of which horizontal voting has a similar concept to the proposed
to training parameters such as the learning rate (LR) and the train- HSE [33]. However, there is not enough experimental verification
ing epochs. in this work, the approach has not received considerable attention.
To this end, we endeavor to develop an efficient and robust ap- To encourage the deep networks to produce diverse base models,
proach that can improve the performance of deep networks in a Huang et al. have proposed that cyclic cosine annealing learning
general sense. The ensemble strategy is a promising technique rate (CCALR) helps the deep networks to attain multiple local min-
for improving the performance of a single model [20]. Generally, ima, and a Snapshot ensemble method was introduced in their
it comprises a great number of weak classifiers like Decision Tree work [34]. There have been two applications of the Snapshot
and combines them to form a stronger model. Classical ensemble ensemble in the field of machinery fault diagnostics. Wen et al.
models are known as Random Forest [21], Adaboost [22], Xgboost [35] improved the original CCALR of the Snapshot ensemble and
[23] and so on. Thomas et al. demonstrated in their investigation applied it to a CNN-based model for the fault diagnostics of bear-
that an ensemble of different types of classifiers leads to an in- ings. Another Improve Snapshot Ensemble CNN was proposed in
crease in accuracy [24]. In recent years, ensemble strategy has [36] using diversity regularization to encourage the diversity of
gained increasing attention in DL. In [25], an ensemble strategy the training history-states. Zhang et al. [37] proposed the Snapshot
that combines a convolutional residual network, deep belief net- boosting to improve the snapshot ensemble.
work and deep autoencoder was proved to be more effective than These works are similar to our proposed method in that they all
a single model. Cruz et al. proposed an evolutionary way to ensem- assume that the deep networks can generate multiple base models
ble a fixed number of CNNs [5]. A multiobjective deep belief net- during the training process for ensemble learning; in this sense,
works ensemble method was proposed in [26] for the remaining they all refer to HSE methods. However, the difference lies in
useful life estimation of engineering systems. Zhang et al. proposed how the ensemble strategy is designed. There are several signifi-
an ensemble deep network architecture based on sparse deep au- cant gaps in the current research landscape that need to be filled.
toencoder, denoising deep autoencoder and contractive deep au- (1) While the ensembled deep networks have been compared with
toencoder in [27] for the rotating machinery fault diagnostics. At a single deep network, the generalization of the ensemble method
the same time, Yang et al. [28] proposed another ensembled fault has yet to be done to adapt the concept to various network archi-
diagnostics scheme based on Sparse Autoencoder [29] and Denois- tectures. (2) Different ensemble strategies have to be systematical-
ing Autoencoder. The bootstrap sampling and plurality voting were ly compared. (3) The HSE methods remain to be only scarcely
employed for the ensemble in this paper. In ref. [30], a new ensem- studied, and their effectiveness needs to be examined and docu-
ble deep network was developed to combine the result generated mented in a broader range of applications. This motivated us to ad-
by fifteen different activation functions. To make use of the advan- dress these specific issues on the examples of case studies relevant
tages offered by different neural networks, Ma et al. applied a mul- to practically significant problems of condition monitoring and
tiobjective optimization algorithm to integrate CNN, DBN and deep fault diagnostics in rotating machinery. To this end, we investigat-
autoencoder [25]. Zhang et al. proposed an ensemble learning ed the efficiency and robustness of different HSE methods with
model based on convolutional neural network [31]. Their method various deep network architectures and proposed the ensemble
is implemented by adding multiple classification layers to generate strategy with the best generalization ability. Compared to previous
a ’poll matrix’ before the majority voting is used to generate the works, our contributions are summarised as follows.
ensembled classification result. Li et al. proposed an optimal
ensemble deep transfer network [32], and parameter transfer (1) We propose the most straightforward yet efficient way to
learning was used to initialize the start points of several base mod- perform ensemble learning on deep networks. It has been
els with different kernels of maximum mean discrepancy. Howev- experimentally confirmed that deep networks can generate
er, these ensemble strategies involve an intricate network multiple local sub-optima during the training process, and
architecture design, training approaches, and additional hyper- the combination of them improves the network performance
parameters to be tuned. on average without increasing training costs.
In this paper, we do not concern with improving the structure (2) We conduct an extensive investigation of various ensemble
or training of deep networks; instead, our goal is to exploit the po- strategies on deep networks in order to address the afore-
tential of deep networks that have been overlooked. We can see mentioned existing issues. Through the comparative experi-
that by virtue of adequately organized and together-tuned com- ments, our proposed method demonstrates the best overall
monly accessible tools, the performance of deep networks can be performance. The conclusions derived from the study pro-
greatly improved in a general sense. In this connection, a simple vide practitioners with the guidelines for the rational selec-
yet efficient ensemble strategy referred to as history-state ensemble tion of the ensemble strategy.

2
Y. Wang and A. Vinogradov Neurocomputing 548 (2023) 126353

(3) We present a novel approach to investigate the tradeoff be- 2.1. Feasibility analysis
tween the diversity and accuracy of base models, which pro-
vides new insights into the underlying mechanisms of The feasibility of the proposed ensemble method is analyzed
ensemble learning. below.

The rest of the paper is organized as follows. The basic theory of 2.1.1. Improve network stability
HSE method is described in Section 2. The experimental setup and The network stability is evaluated by the variability of perfor-
contrastive methods are introduced in Section 3. Experimental re- mance under randomly initialized weights. The variance or stan-
sults of the probed methods are discussed in Section 4. Conclusions dard deviation is utilized as the metric. With the previous
are drawn in Section 5. hypothesis on local sub-optima, we assume the acquired base
models are (i) unbiased, (ii) with the same variance, and (iii) uncor-
2. Methodology related with each other. The accuracy of the i-th base models is
modeled as: ai ¼ A þ x, where A stands for the accuracy of the true
The training of neural network consists of two stages: (i) the local optimum to be estimated, x is random noise caused by the
forward propagation of input data from the first layer to the last diversity of base models. Based on the minimum variance unbaised
layer of the network structure for feature extraction; (ii) the back- estimator, a reasonable estimator of a can be expressed as
P
ward propagation of the prediction error from the opposite direc- a ¼ ni¼1 ai =n ¼ EðAÞ, assuming the x is Gaussian noise, and n rep-
tion for network optimization. A closed loop is formed through resents the base model number. Then, the variance of the estimator

forward and backward propagation, and the updated weights after a is expressed as:
each cycle are referred to as one history-state of the network in
Pn v ar ða Þ v arðaÞ
this paper. In general, a single network model is trained on fixed
v ar a ¼ i¼1 2
i
¼ ð1Þ
training cycles or iterations. However, it is hard to choose a ’magic’ n n
training budget to build a reliable model [34], as the acquired local
where v arðaÞ denotes the variance of a single base model. Since
optimum based on training dataset cannot represent the true local
v arðaÞ=n < v arðaÞ, i.e., the variance of the ensemble network is
optimum of the whole dataset especially in practical applications.
smaller than the single network. Hence, it can be inferred that the
Local optimum denotes a solution that is optimal within a neigh-
ensemble method reduces the performance variability.
boring set of candidate solutions. Loss value is usually served as
a measure of how well or poorly the network behaves after each
2.1.2. Improve network accuracy
optimization cycle. If the network is properly trained, the evolu-
For better illustration, we enumerate possible scenarios when
tionary trend in the loss value has the ’elbow-like’ shape, as shown
comparing the single network with the ensemble network by in-
in Fig. 1: it drops sharply at the first iterations and then decays
troducing a simple example of binary classification, as displayed
slowly before converging finally at very low values. At the same
in Fig. 2. In case 1–1 and 1–2, most base models produce the true
time, the network accuracy shows the opposite trend.
labels, thus underlying the efficacy of the ensemble. However, the
We assume that (1) a neural network can produce multiple ’lo-
single network might fail. And cases 2–1 and 2–2 are the opposite.
cal sub-optima’ with diversity during training; (2) the combination
The ensemble method will fail if most base models produce false
of these local sub-optima improves the network performance in
labels, whereas the single network might succeed. If the two sce-
terms of stability and accuracy. We use the term’ local sub-
narios have the same frequency, then the ensemble and single net-
optima’ to indicate that the solution is merely approximate to
work should have the same performance on average. Since deep
the true ’local optima’. In this paper, they refer to all the history-
networks are generally considered to be a strong classifier, we hy-
states of neural networks when the training process enters a stable
pothesize that the first scenario is more frequent than the other
phase, i.e., the loss value converges to a small value. Therefore, to
one, and the ensemble method has a higher chance to output a true
obtain base models for ensemble learning using a neural network,
label, which remains to be proven in the following experiments.
one only needs to preserve these history states or local sub-optima
of the network that should be covered after each training cycle, and
the time cost of this is negligible. Since there is no requirement for 2.2. The proposed ensemble strategy
the network structure, the method can be applied to all neural
networks. Similarly to conventional ensemble techniques, the implemen-
tation of the HSE method faces two challenges: (1) a training strat-
egy that encourages the generation of accurate base models with
diversity [5,24]; and (2) a learning strategy which combines ac-

Fig. 2. Graphical illustration of possible scenarios when comparing the single

Fig. 1. Illustration showing the typical behaviour of the training loss value and the network with the ensemble network. Assuming there are 10 base models, and the
accuracy as a function of the number of iterations; the definition of parameters h1 prediction of the ensemble is given by majority voting of all base models, whereas
and h2 is illustrated (see the text for details) the prediction of a single network is given by the last base model.

3
Y. Wang and A. Vinogradov Neurocomputing 548 (2023) 126353

quired base models to form a stronger classifier [38]. In this paper, S h1

N ¼ intð Þ ð3Þ
we are not concerned with improving the training of neural net- h2
works; instead, our goal is to show the potential of current neural
where intðÞ returns an integer number. The key to improving
networks with the most common tools we already have.
ensemble performance is the balance between the diversity and ac-
curacy of base models, which needs to be investigated
2.2.1. Training strategy
experimentally.
To improve the ensemble performance, it is pivotal to balance
the diversity and accuracy of the base models. We propose that
the simple combination of the widely used MBGD and CLR in the 3. Experimental setup
network training process is sufficient to produce the required base
models. 3.1. Description of datasets
MBGD is a variant of the gradient descent algorithm whereby
the whole training dataset is divided into multiple small batches, The utilized datasets are acquired from two bearing data centers:
and only one batch is used to calculate the gradient at each itera- Case Western Reserve University (CWRU) Bearing Data Center and
tion. The iteration refers to the number of batches or steps through Konstruktions-und Antriebstechnik (KAt) – Bearing Data Center
partitioned packets of the training data. One epoch is counted after [39]. Details of the used data are described in Table 2. The CWRU data
all batches of training data are fed into the neural network. For ex- centre provides vibration signals of bearing with various artificial
ample, if the dataset is divided into m batches with k training damages. It has been widely used as benchmark data to evaluate
epochs, the total number of iterations is m k. The application of the effectiveness of the proposed models. Table 1 summarises the
MBGD increases the model update frequency, thus giving rise to use of CWRU data in recent publications. The vibration signals data-
a larger number of base models generated for the ensemble base comprises samples collected at both 12 k and 48 k samples/sec-
classification. ond rates. The highest accuracy for the signals from the 12 kHz
CLR is a de facto tool for training modern neural networks, dataset was observed to be 99.94%, while for signals from the
which is advocated in HSE for the following reasons: (1) it ap- 48 kHz dataset, it was 98.95%. Ten categories of bearing under differ-
proaches local minima in a noisy manner, which helps to encour- ent fault levels are used in this study, as presented in Table 2 (Index
age the diversity of base models; (2) as a rule of thumb, it will 1–10). Each category of vibration data is collected from 3 different
not deviate far from the local minimum, which guarantees the ac- loads (1, 2, 3 hp) at a sampling frequency of 48 kHz. The KAt data cen-
curacy of the acquired base models to a certain extent. Neverthe- tre provides real bearing damage signals caused by the accelerated
less, the hypothesis is based on experience, which needs to be lifetime test. The test rig and experimental details can be found in
proven experimentally. To demonstrate this, different LR sched- [39]. Ten categories of faulty bearings with different fault levels in
ulers are investigated in the following experiments, and analysis different locations are gathered in Table 2 (Index 11–20). Each bear-
of the diversity-accuracy tradeoff has shown that CLR exhibits bet- ing was tested under four different operating conditions, which are
ter balance in the acquired base models. denoted as N1, N2, N3, and N4 in Table 3.
In this study, six datasets are constructed from the above-
gathered bearing vibration data to validate the proposed method,
2.2.2. Learning strategy as described in Table 3. Among them, dataset A and dataset E are
A SoftMax layer is generally attached after the last hidden layer collected from CWRU and KAt datasets, respectively, with the same
of a neural network to normalize the output of a network to a prob- operating condition on training and test data. It can be concluded
ability distribution over the predicted output classes. The AV from the previous study that it is more difficult to classify the data-
method takes the average score of all the recorded base models, sets with different operating conditions on training and test data.
which is mathematically described as follows: Therefore, datasets B-D and dataset E are constructed to evaluate
the performance of the proposed method under various operating
1 X
i
N
expðb þ W ij HÞ conditions.
Ej ¼ Pc i
ð2Þ
N i¼1 expðb þ W i HÞ
j¼1 j
3.2. Evaluation method
where N denotes the number of base models, and c represents the
number of categories. H stands for the input of the Softmax layer, Each model runs ten times with random initial conditions, and
W and b are weights and bias, respectively, which connect the Soft- the average classification accuracy and standard deviation are used
max layer and the last hidden layer of the neural network. Ej is the to evaluate the probed models, which are defined as follows:
predicted probability that the input data is classified as the category
TP þ TN
j, and the predicted class of HSE is given by max Ej . Acc ¼ ð4Þ
TP þ FN þ FP þ TN
2.2.3. Model selection 1
We do not propose strict selection criteria for base models in Std ¼ ðAcci lÞ2 ð5Þ
n
HSE for the following reasons: (1) to avoid overfitting; (2) to be
consistent with the previous assumptions of base models, namely, where TP, FN, FP and TN abbreviate the True Positive, False Nega-
that they are unbiased and have the same variance. Nevertheless, tive, False Positive and True Negative, respectively, n denotes the
the ensemble model performance is affected by three factors: the number of trials, Acci is the accuracy of the i-th trial, l is the mean:
P
number, the accuracy, and the diversity of base models, which l ¼ 1n ni¼1 Acci .
are closely correlated with the following parameters: (i) the S rep-
resents the applied training cycles; (ii) the h1 represents the re- 3.3. Implementation details and comparison methods
quired training cycles for model warm-up; and (iii) the h2
denotes the update frequency between two adjacent base models. 3.3.1. Comparison with single deep networks
Generally, the larger h2 can reduce redundancy and increase the di- Various deep neural network architectures with increasing
versity of base models. The total number of the acquired base mod- depths are proposed in this work for comparison, which are de-
els is calculated as: tailed as follows:
4
Y. Wang and A. Vinogradov Neurocomputing 548 (2023) 126353

Table 1
Summary of the use of CWRU dataset in recent publications.

Refs Load (HP) Frequency (kHz) Class Training/test samples Accuracy (%)
[4] * 12 12 4200/600 98.47
[8] 3 48 10 1680/720 98.46
[9] 0 ! 0/0 ! 1/0 ! 2 12 12 2400/1200 98.5/97.1667/95.8333
[11] 0/1/2/3 12 10 1000/1000 99.57/99.32/99.54/99.43
[14] 0–3 12 8 * 99.94
[15] 0–3 * 12 1800/900 96.44
[18] 0–3 48 10 * 98.95

Table 2
Description of the gathered vibration signals from CWRU and KAt data centers.

Data base Index Fault location Fault level Data base Index Fault location Bearing code Fault type Fault level
CWRU 1 – 0 KAt 11 – K001 – 0
2 IR 0.007 12 IR KI16 Fatigue pitting 3
3 IR 0.014 13 IR KI17 Fatigue pitting 1
4 IR 0.021 14 IR KI18 Fatigue pitting 2
5 Ball 0.007 15 OR KA16 Fatigue pitting 2
6 Ball 0.014 16 OR KA22 Fatigue pitting 1
7 Ball 0.021 17 OR KA15 Indentations 1
8 OR 0.007 18 OR + IR KB23 Fatigue pitting 2
9 OR 0.014 19 OR + IR KB24 Fatigue pitting 3
10 OR 0.021 20 OR + IR KB27 Indentations 2

Table 3
Description of the extracted six datasets under various operating conditions.

Datasets Class number indices Training set Test set

Operating condition Sample number Operating condition Sample number
A 10 1–10 L1, L2, L3 4500 L1, L2, L3 4500
B L1 3000 L2, L3 6000
C L2 3000 L1, L3 6000
D L3 3000 L1, L2 6000
E 11–20 N1, N2, N3, N4 8000 N1, N2, N3, N4 8000
F N1, N2 8000 N3, N4 8000

(1) Fully connected neural network (FCNN). To probe their performance under a relative fair circumstance,
(2) Stacked Autoencoder with supervised fine-tuning (SAE). all networks are trained with same hyperparameters. The Adam
(3) Gaussian-Bernoulli Deep Belief Network with supervised stochastic optimization algorithm is applied to update the network
fine-tuning (DBN). weights [55]. Epoch number and batch size are set as 50. The initial
(4) Gate Recurrent Unit Network for fault diagnostics (GRU) LR is set as 0.001 and decreases with a decay rate of 0.001 for each
[40]. iteration after 20 training epochs. The LR with decay is denoted as
(5) One-dimensional LeNet5 (1D-LeNet5). LRD in this work. The deep network without HSE is denoted as a
(6) One-dimensional AlexNet (1D-AlexNet). single network. Single networks are served as references to the
(7) One-dimensional Deep Residual Network (1D-ResNet). ensembled neural networks.
(8) One-dimensional Deep Densely Connected Network (1D-
DenseNet). 3.3.2. Comparison with different ensemble strategies
Several ensemble approaches designed for neural networks are
Among the above eight architectures, (1)-(3) are fully connected investigated. They are roughly divided into two categories in this
networks with the shared architecture configuration of study: training strategy and learning strategy. The training strategy
½1024; 512; 256; 128. SAE and DBN are employed to learn unsuper- aims at encouraging the diversity of the acquired base models,
vised features from data which are followed by a supervised fine- while the learning strategy provides optimal solutions to combine
tuning process for fault diagnostics. The unsupervised learning is these base models. The implementation details are presented
trained for 30 epochs. (4) is a fault diagnostics-based recurrent below:
neural network architecture proposed in [40], which consists of a
linear layer, a GRU layer and a classification module with a multi- 3.3.2.1. Training strategies:.
layer perceptron. To maintain the consistence of the input of all (1) MBGD + CLR (the proposed).
probed networks, the data length is set as 1024, which is converted (2) CCALR in Snapshot ensemble [34].
to a ½16 64 image, i.e., the sequential length of GRU cell is 16. The (3) Boosted framework.
linear layer maps the dimension of the raw image to ½16 1024, (4) Snapshot boosting ensemble [37].
and the output is fed into GRU cell with a hidden size of 1024.
The classification module consists of two hidden layers with In (2), the models are warmed up for 20 epochs with initial LR of
1024 neurons and an output layer. (5)-(8) are convolutional neural 0.001, and then the LR is scheduled with CCALR, and a snapshot or
networks with increasing depth of hidden layers, the configuration history-state of the model is taken when the LR reaches its mini-
of their network architectures is detailed in Tables 4 and 5. mum at each cycle. The Snapshot number of 5 and 30 are investi-
5
Y. Wang and A. Vinogradov Neurocomputing 548 (2023) 126353

Table 4
The proposed 1D-CNN architectures with increasing network depths.

1D-LeNet-5 1D-AlexNet 1D-ResNet 1D-DenseNet

Layers Output size Layers Output size Layers Output size Layers Output size
Input 1024@1 Input 1024@1 Input 1024@1 Input 1024@1

C125; S1; P0 445@16 C125; S1; P0 445@16 C125; S1; P0 445@16 C125; S1; P0 445@16
A12; S2; P0 A12; S2; P0 A12; S2; P0 A12; S2; P0

½C7; S2; P0 220@16 C7; S1; P3 220@16 RL 2 445@16 DL 2 445@16
A7; S2; P0
½C5; S1; P0 3 220@32 ½C5; S2; P0 221@16 TL 221@16
RL 2 221@32 DL 2 221@32
½C5; S2; P0 109@32 TL 109@32
RL 2 109@64 DL 2 109@64
½C5; S2; P0 53@64 TL 53@64
RL 2 53@64 DL 2 53@64

½A5; S5; P0 44@32 ½A5; S5; P0 44@32 C5; S2; P0 22@64 C5; S2; P0 22@64
A4; S1; P0 A4; S1; P0
Flatten 1408@1 Flatten 1408@1 Flatten 1408@1 Flatten 1408@1

Note: C denotes the convolutional kernel, A denotes the average pooling kernel, S and P are the stride and padding number of each kernel, respectively. Flatten stands for the
concatenated layer. The output size is denoted by the notation ‘a@b’, where a represents the length of the output vector, and b is the number of output channels.

Table 5 and the number of iterations are set as 100. Acceleration coeffi-
Description of RL, DL, and TL. cients, c1 and c2 , are set as 0.5. The inertia weight is 0.9. In (7), par-
Layers Description Parameters ticle swarm optimization (PSO) is utilized to adaptively learn the
weights of all base models based on their performance on the val-
RL Residual block C15; S1; P7
C15; S1; P7 idation set, and the prediction is given by the weighted base mod-

DL Densely connected block C1; S1; P0 els. While in (8), only the base models with positive weights
C15; S1; P7 participate in the voting, and the prediction is given by the aver-

TL Transition layer C1; S1; P0
aged output of the selected base models.
A5; S2; P0

3.3.3. Comparison with shallow learning algorithms

gated in this work, i.e., 5 and 30 base models are acquired, which
The performance of several types of classical shallow learning
are denoted as ’Snap50 and ’Snap300 , respectively. (3) The models
and ensemble algorithms are probed as comparison methods, as
are warmed up for 20 epochs and then trained with the boosted
listed below.
framework as described in subsection 3.2.2. The sample weights
are updated every epoch after the first 20 epochs, and a total of
(1) k-nearest neighbors (KNN).
30 history states are recorded. (4) The Snapshot boosting ensemble
(2) Support Vector Machine (SVM).
is introduced in [37]; the original method is based on an image
(3) Random forest.
processing task. The models are warmed up for 20 epochs, and
(4) AdaBoost.
then both boosted framework and CCALR are applied to train the
(5) Gradient-Boosted Decision Trees (GBDT).
models. The Snapshot number is 30.
(6) XGBoost.

3.3.2.2. Learning strategies:. Common time- and frequency-dependent features are extracted
(5) AV. from the raw signal as the input of each model as Root Mean
(6) Ranking voting. Square, Skewness, Kurtosis, Shape factor, Crest factor, Impulse fac-
(7) Weighted voting + PSO. tor, Margin factor, Power, Mean frequency, RMSF, RVF. In model
(8) Selective voting + PSO. (1), the number of neighbors is set as 1. (2) The Gaussian kernel
is employed, and the kernel coefficient is defined by
In (6), a validation set is used to evaluate the performance of the 1=ðL v arðXÞÞ, where L denotes the number of classes, v arðXÞ rep-
recorded base models, and only the base models with the top n resents the variance of input X. The regularization parameter is 1.
classification accuracy on the validation set participate in the vot- DT is served as the base model for (3–6), and the number of base
ing. In this work, 10% of training samples are randomly extracted models is set as 50. In (4), the LR value is 0.5, and the maximum
out as a validation set, and n is set as 3. The PSO algorithms are depth of the individual model is 3. (5) LR is 0.5, and the maximum
served as meta-learner to implement (7) and (8). The swarm size depth of each individual tree is 5. (6) The LR value is 0.9, L1 regu-

Table 6
Fault diagnostics accuracy of shallow learning algorithms based on time and frequency features.

Models Datasets
A B C D E F
KNN 0.8658 0.8085 0.8085 0.7395 0.7059 0.6609
SVM 0.9106 0.8362 0.8642 0.8053 0.7592 0.7035
RF 0.9251 0.8303 0.8360 0.7891 0.8084 0.7047
Adaboost 0.8802 0.7945 0.8080 0.7332 0.7374 0.6500
GBDT 0.9195 0.8271 0.8336 0.7716 0.7916 0.6874
XGboost 0.8716 0.7728 0.7901 0.7492 0.6910 0.6193

6
Y. Wang and A. Vinogradov Neurocomputing 548 (2023) 126353

larization term is 0.1, and the maximum depth of each individual performance of a carefully designed network architecture might
tree is 5. decrease on other tasks. However, the model performance can be
generally improved by HSE regardless of the network architectures
and datasets, demonstrating the general capability of the proposed
4. Results and discussion
method to be self-adapted to different network structures.

We first look at the influence of different LR schedulers on the

4.1.2. Comparison of different ensemble strategies
loss value of deep networks, as shown in Fig. 3. It can be observed
Through different combinations of training and learning strate-
from Fig. 3a that the loss value converges with the decay of LR,
gies, a total of 21 ensemble strategies for HSE are probed in this
which indicates that the model gradually reaches a local minimum.
work. Taking the performance of ’single + CLR’ and ’single + LRD’
Fig. 3c shows the loss value of the model with the CCALR scheduler.
as baselines, Figs. 4 and 5 show that most of the ensemble strate-
There are obvious ups and downs of loss value with the cyclic eval-
gies outperform the ’single + CLR’ model. Besides, this model also
uation of LR, which manifests the CCALR encourages the model to
exhibits the highest average standard deviation. However, only
attain multiple local minimums. Fig. 3b and d show the loss value
seven probed ensemble strategies work better than the ’single +
of the same network architecture, which has been trained by CLR.
LRD’ model. Among them, the proposed ’MBGD + CLR + AV’ strate-
Compared with Fig. 3a, their loss value converges at a higher level
gy exhibits the highest average accuracy and lowest standard devi-
and exhibits more significant fluctuations. The observation of LR
ation, thus demonstrating the relative effectiveness and robustness
and its corresponding loss value indicates that the models have
of the method.
been adequately trained.
Then, each probed method runs ten times on every network
4.1.3. Comparison with shallow learning methods
with the randomly generated training dataset and weights, and
The performance of shallowing learning methods is presented
the average classification accuracy is measured. Detailed experi-
in Table 6. It can be concluded that deep networks like 1D-
mental results on datasets A-F of each method are shown in Tables
LeNet5, 1D-AlexNet, 1D-ResNet and 1D-DenseNet are much supe-
8–13 in the appendix. When comparing with recent publications
rior to shallow learning methods. Thus, we demonstrated that
cited in Table 1, which employed the same datasets sampled at
ensemble learning combined with strong classifiers could make
48 kHz (data A), it was observed that 1D-DenseNet using the pro-
full use of the advantages of both methods.
posed ensemble method (MBGD + CLR + AV) achieved the highest
average accuracy of 99.19% on dataset A. This result is superior to
what has been reported earlier, c.f., Table 1. 4.2. Discussion on the comparison results
To facilitate the observation of the results, the average accuracy
of 1D-LeNet5, 1D-AlexNet, 1D-ResNet and 1D-DenseNet are plot- The ensemble performance is directly affected by the diversity
ted in Fig. 4. To measure the effectiveness and robustness of DHSE and accuracy of the acquired base models. To uncover their inner
on different network architectures and datasets, the accuracy and working mechanism, the diversity and base model accuracy
standard deviation of the probed eight deep networks on datasets (BMA) acquired by the probed ensemble strategies are quantified
A-F are averaged and shown in Fig. 5. as:
0 Pm k k 1
X k¼1 1 f i –f j
N X N
4.1. Display of comparison experiments
Div ersity ¼ @ A=N 2 ð6Þ
i¼1 j¼1
m
4.1.1. Ensemble networks compared to single networks
LR scheduler is crucial for network training. We probed the sin- Pm !
X
N
k¼1 1
m
f i ¼ yk
gle networks in two cases: trained by CLR and LRD schedulers, re- BMA ¼ =N ð7Þ
spectively. LRD is a widely used tool, which encourages the j¼1
m
network to converge to a local minimum. It can be seen in Fig. 5
that the average accuracy of single networks with CLR is 0.7584, where 1ðÞ is an indicator function, which output is 1 when the re-
which is improved to 0.7861 through the utilization of LRD. In ad- sult of the Boolean operation is true, otherwise, it is 0. N and m rep-
m
dition, the average standard deviation of single networks with LRD resent the number of base models and test samples. f i stands for
is smaller than CLR. The result reveals that for single networks, the the predicted label of the i-th base model and the k-th test sample.
use of LRD helps to improve the performance and stability of the ym represents the true label of the k-th test sample. Therefore,
model. For comparison, the ensemble method was applied to single Div ersity measures the ratio of the two base models generating dif-
networks, which are denoted as ’MBGD + CLR + AV’ and ’MBGD + ferent labels for the same sample, and BMA measures the mean ac-
LRD + AV’, respectively. Among them, the ’MBGD + CLR + AV’ is curacy of all base models. The average Div ersity and BMA of all
the strategy advocated in this work. Through HSE, the performance probed network architectures and datasets are plotted in Fig. 6.
of ’single + LRD’ is further improved to 0.7883 in Fig. 5. More no- One can observe that the diversity of base models is roughly in-
tably, the performance of ’single + CLR’ is largely improved up to versely proportional to the accuracy of base models. The training
0.7994 average accuracy by ’MBGD + CLR + AV’. A conclusion can strategies and learning strategies are discussed separately. More de-
be drawn that for HSE the CLR scheduler works better than LRD, tails will be u below.
which is the opposite of the case of single networks. And this is
not just a coincidence; the same behaviour can be observed in all 4.2.1. Discussion on the training strategies
probed network architectures on each dataset, as shown in Fig. 4 Then, we discuss the experimental results with diversity-
(more details can be found in Tables 8–13 in the appendix). The accuracy tradeoff.
reason is presumed to be the balance between the diversity and ac-
curacy of the base models, which will be discussed later. 4.2.1.1. Discussion of LRD and CCALR. As concluded above, although
Besides, it is worth noting from Fig. 4 that although 1D- LRD effectively increases the accuracy of single networks, its per-
DenseNet exhibits the best performance on datasets A, E and F, formance is inferior to CLR in HSE. This is because LRD encour-
its performance decreases on datasets B, C and D compared with ages the network to converge to one local minimum; thus, the
other networks. The result confirms the above statement that the acquired base models have low diversity. The same reason ap-
7
Y. Wang and A. Vinogradov Neurocomputing 548 (2023) 126353

Fig. 3. Loss value of different training methods - exemplified from 1D-LeNet5 of trial 1 on dataset A. (a) The LRD is initially set as 0.001 and then starts to decrease with a
decay rate of 0.001 for each iteration after 20 epochs. (b) CLR+MBGD with a constant value of 0.001 (the proposed method). (c) CCALR with Snapshot number of 5. (d) Boosting
+CLR with a constant value of 0.001, 10% of training samples are randomly extracted out as validation set.

Table 7 is low. Hence, the performance increment of these methods to

Training and test cost time of 1D-LeNet5 trained by ‘MBDG + CLR’ methods on dataset single networks is limited. Although, theoretically, the Snapshot
A with different learning strategies.
ensemble provides the likelihood to jump out of the current lo-
Learning Number of Training time (s)/ Test time(s)/ cal minimum in some cases, it is practically difficult to guaran-
strategy base models Sample number Sample number tee a successful result. Instead, the ’MBGD + CLR’ is easier to
Single 1 25.56/4500 0.1/4500 implement and more robust.
AV 30 25.75/4500 2.96/4500
Ranking 3 24.26/4500 0.38/4500
voting 4.2.1.2. Discussion on the boosted framework. Contrary to LRD and
Weighted 30 29.29/4500 3.17/4500
CCALR, both ’Boosting’ and ’MBGD + CLR’ training strategies gener-
voting + PSO
Selective 14 29.37/4500 1.43/4500 ate high diversity in base models; however, the BMA of the ’Boost-
voting + PSO ing’ method is much lower than ’MBGD + CLR’. The reason
presumably is that the ’Boosting’ method forces each base model
to focus on these currently difficult-to-classify samples; conse-
plies to the CCALR scheduler in Snapshot ensemble. However, quently, the total training sample of each base model is reduced
CLR usually approaches the local minima in a noisy manner and biased. Although the diversity of base models has increased,
which encourages the diversity of base models. It can be seen the accuracy has also decreased. The combination of ’Boosting’
that the ’Snap50 , ’MBGD + LRD’ and ’Snap300 occupy the top and CCALR in Snapshot ensemble still cannot overcome the
three positions of BMA across all methods. But their diversity above-mentioned problems.

Fig. 4. The average accuracy of the selected four network architectures on datasets A-F. Plots are sorted by the descending order of the average accuracy of the four models on
each dataset.

8
Y. Wang and A. Vinogradov Neurocomputing 548 (2023) 126353

Fig. 5. The averaged classification accuracy and standard deviation of DHSEs with various ensemble strategies and single model on datasets A-E and different network
architectures. (a) Average classification accuracy in descending order. (b) Average standard deviation in descending order.

Fig. 6. Measure of diversity and mean accuracy of base models. The picture is
sorted by the descending order of the average diversity of the probed network
architectures and datasets.

Fig. 7. Average accuracy of MBGD+CLR+AV with different selection of parameters

4.2.2. Discussion on the learning strategies
h1 and h2 .
Three learning strategies are introduced as a comparison, in-
cluding ’Ranking voting’, ’Weighted voting’ and ’Selective voting’.
The probed methods apply different selection criteria to the performance is highly related with base model number; (2) with
ensemble method, endowing different degrees of freedom to the the same number of base models, larger h2 tends to have higher ac-
obtained base model, among which the proposed ’AV’ gives the curacy, this is presumably because each base model is updated
highest degree of freedom to the base models, followed by the ’se- more often with larger h2 which encourages the diversity of base
lective voting’, ’Ranking Voting’, and ’Weighted Voting’. As can be models. The two rules can be regarded as the selection criteria of
observed from Fig. 5, the ensemble performance is highly correlat- the proposed method. However, the specific setting of h1 and h2 de-
ed with the selection criteria, and the methods with a higher de- pends on the specific task. Nevertheless, the performance of Snap-
gree of freedom tend to perform better. The reason presumably shot ensemble does not show a direct relationship with the base
is that the strict selection criteria formulated on training or valida- model number through the comparison between ’Snap50 and
tion datasets are likely to cause overfitting. ’Snap300 .

4.3. Parameter analysis

4.3.2. Training epochs
4.3.1. Ensemble parameters We investigated the influence of training budget to the model
The selection of the parameters h1 and h2 directly affects the performance. To facilitate analysis, the ensemble parameter h2 is
number of base models. Fig. 7 shows the average accuracy of fixed as 1, and the number of base models is fixed as 10; the
’MBGD + CLR + AV’ with different selection of parameters h1 and warm-up parameter h1 is defined as N 10, where N denotes the
h2 as well as corresponding base model number. Take the perfor- number of training epochs. The other hyper-parameters remain
mance of ’Snap50 and ’Single + LRD’ as two base lines, it can be con- unchanged. The performance of the proposed HSE method is com-
cluded from the experimental results that: (1) the ensemble pared with the corresponding single model under different training
9
Y. Wang and A. Vinogradov Neurocomputing 548 (2023) 126353

Fig. 8. Accuracy of the probed methods with different training epochs.

epochs ranging from 10 to 100, as displayed in Fig. 8. The example bility of bearing fault diagnostics by introducing an effective and
is taken from 1D-LeNet5. The red line represents the accuracy of user-friendly HSE strategy, which is evaluated across various net-
the proposed method under different training epochs. The black work architectures. The experimental results reveal that deep net-
line represents the accuracy of a single network corresponding to works can produce multiple base models for ensemble learning
training epochs, while the blue line is the moving average accuracy using a combination of MBGD, CLR, and AV methods once the
of the single network with a window length of 10. One can observe training process reaches a stable phase. Compared with peer meth-
that the ensemble network shows less fluctuation in accuracy with ods, the proposed ensemble strategy benefits from simplicity,
different training epochs. The blue line and red line show similar which is evident from the following aspects:
performance variability; however, the accuracy of the proposed
method is maintained on the upper envelope of the single network. (1) Ensemble strategy: this work integrates existing concepts in
an enhanced accuracy-improving workflow. The use of de
4.4. Computational cost facto accepted tools enhances the method’s accessibility
and applicability, making it an easy-to-implement option
As has been noted, the DHSE does not increase the training time for practitioners seeking to improve the model performance.
of the model. However, the test time is inevitably increased to (2) Selection of base models: the proposed method differs from
t N, where t denotes the test time of a single network, and N rep- many peer methods like Snapshot ensemble struggling to
resents the number of base models. Table 7 presents the training encourage neural networks to reach different local optima.
and test cost time of 1D-LeNet5 using the ’MBDG + CLR’ training Instead, the proposed method focuses on utilizing ’local
method on dataset A. The training cost of ’Weight Voting’ and sub-optima’ to achieve its goals. By doing so, the proposed
’Selective Voting’ is increased because of the use of PSO algorithms. method avoids the intricate task of defining qualified base
Although the test time is increased, it is small for every single models, resulting in reduced complexity.
sample. (3) Hyper-parameters: the proposed method requires fewer hy-
perparameters to be tuned and exhibits robustness to the se-
lection of hyperparameters through parameter analysis.
5. Conclusion
Comparing the experimental findings with peer methods
In this paper, the ensemble techniques combining the ’history- demonstrates that the proposed ’MBGD + CLR + AV’ exhibits a bet-
states’ generated during network training are denoted as HSE ter classification accuracy while having a lower standard deviation,
methods. The proposed methodology aims at enhancing the relia-
10
Y. Wang and A. Vinogradov Neurocomputing 548 (2023) 126353

thus showing the relative effectiveness and robustness of the pro- fore, Snapshot ensemble can be regarded as an implementation of
posed methodology. DHSE with scheduled LR, and CCALR is a type of training strategy.
Nevertheless, the efficiency of HSE methods still needs to be in-
vestigated in a broader range of applications, which is in the focus 2. Boosted training strategy
of our further study. Up to now, we have successfully tested the
proposed methodology against several other applications, includ- Boosting technique trains a number of weak learners sequen-
ing unsupervised early fault detection and the problem of stream- tially through an iterative arrangement of training samples to form
ing data with emerging new classes, the readers can find more a stronger model. The technique gives larger weights to those sam-
details in [41,42]. ples that were misclassified by the previous weak learners; in this
way, the classifiers are supposed to have less overlap in the set of
CRediT authorship contribution statement samples they misclassify. A boosted framework for the neural net-
work was proposed in [37], the data distribution of each classifier
Yu Wang: Conceptualization, Methodology, Software, Investi- is arranged as follows:
gation, Writing – original draft. Alexey Vinogradov: Writing – re- 1=n bt1 a
view & editing, Supervision, Funding acquisition. W t ðiÞ ¼ e ; i ¼ 1; 2; n: ð9Þ
Z t1

Data availability 1 1 t 1
bt ¼ log þ log ðk 1Þ ð10Þ
2 t 10
Data will be made available on request.
1X n

Declaration of Competing Interest t ¼ Iðf ðxi Þ–yi Þ ð11Þ

n i¼1 t

The authors declare that they have no known competing finan-

cial interests or personal relationships that could have appeared 1; iff t ðxi Þ ¼ yi
a¼ ð12Þ
to influence the work reported in this paper. 1; iff t ðxi Þ–yi

where W t ðiÞ denotes the weight of i -th sample for the t-th classi-
Acknowledgments
fier, t counts the number of misclassified samples. A higher value
of t yeilds a smaller value of the coefficient bt , which will be further
The financial support from the Norwegian Research Council by
reflected in a higher value of weight. Z t is a normalization factor,
the RCN Project No 296236 is gratefully appreciated. Pn b y f ðx Þ
which is defined as i¼1 e
t1 i i i . The difference between the

boosted framework for a deep network with AdaBoost is: (1) A val-
Appendix
idation set is used to count the t , while conventionally, it is calcu-
lated by all training samples. (2) The weights W t is updated from
1. Cyclic cosine annealing learning rate
the uniform distribution of 1=n for each classifier: note that in the
Snapshot ensemble adopted a cyclic cosine annealing learning
conventional procedure, it is updated from W t1 sequentially. (3)
rate (CCALR) schedule to encourage the model to reach multiple lo-
The boosted framework used a meta-learner to combine the weak
cal minima during training [34]. The LR is lowered at a very fast
classifiers with a validation set.
pace at first, encouraging the neural networks to converge towards
its local minimum. Then the optimization continues at the initial 3. Weighted learning strategy
LR, and the procedure repeats several times. The shifted cosine
function is used to obtain the LR at each iteration, which is math- The weighted ensemble output is expressed as:
ematically described below.
X
N i
expðb þ W ij HÞ
a0 p modðt 1; ½T=MÞ Ei ¼ wi ð13Þ
aðtÞ ¼ cos þ1 ð8Þ Pc i i
2 ½T=M i¼1 j¼1 expðb þ W j HÞ

where aðt Þ denotes the LR at the iteration number of t, a0 is the ini- Compared with AV, the weighted voting applies a meta-leaner
tial LR, T represents the total number of training iterations and M is to adaptively learn the weight of each base model wi .
the number of cycles the procedure is repeated. A ’snapshot’ of the
model is taken when the LR reaches its minimum at each cycle; 4. The detailed experimental results on dataset A-F are presented
thus, a total of M models are acquired. The snapshot of the model in Tables 8–13.
is also referred as the history-state of the model in this work. There-

Table 8
Average accuracy of the probed methods on dataset A.

Methods BPNN SAE DBN GRU 1D-LeNet5 1D-AlexNet 1D-ResNet 1D-DenseNet

Single + LRD 0.8407 0.8402 0.8377 0.9494 0.9846 0.9835 0.9855 0.9907
Single + CLR 0.7861 0.7827 0.7786 0.9094 0.9783 0.9750 0.9536 0.9873
MBGD + LRD + AV 0.8433 0.8418 0.8393 0.9501 0.9846 0.9843 0.9866 0.9913
MBGD + CLR + AV 0.8698 0.8700 0.8689 0.9543 0.9850 0.9852 0.9868 0.9919
Snap5 + AV 0.8573 0.8509 0.8540 0.9516 0.9851 0.9845 0.9872 0.9917
Snap30 + AV 0.8516 0.8495 0.8497 0.9497 0.9851 0.9847 0.9874 0.9909
Boosting + AV 0.8366 0.8388 0.8402 0.9475 0.9848 0.9837 0.9862 0.9903
Boosting + Snapshot + AV 0.8265 0.8300 0.8263 0.9458 0.9842 0.9845 0.9866 0.9899
MBGD + CLR + Ranking 0.8265 0.8246 0.8310 0.9416 0.9834 0.9820 0.9850 0.9903

(continued on next page)

11
Y. Wang and A. Vinogradov Neurocomputing 548 (2023) 126353

Table 8 (continued)

Methods BPNN SAE DBN GRU 1D-LeNet5 1D-AlexNet 1D-ResNet 1D-DenseNet

Snap5 + Ranking 0.8317 0.8346 0.8316 0.9450 0.9831 0.9833 0.9865 0.9911
Snap30 + Ranking 0.8213 0.8229 0.8241 0.9444 0.9839 0.9833 0.9859 0.9907
Boosting + Ranking 0.8082 0.8116 0.8120 0.9391 0.9834 0.9815 0.9838 0.9894
Boosting + Snapshot + Ranking 0.8144 0.8176 0.8194 0.9437 0.9837 0.9831 0.9854 0.9899
MBGD + CLR + Weight 0.7897 0.7969 0.7834 0.9111 0.9744 0.9684 0.9708 0.9883
Snap5 + Weight 0.8247 0.8295 0.8272 0.9434 0.9827 0.9827 0.9861 0.9911
Snap30 + Weight 0.8028 0.8031 0.8048 0.9390 0.9804 0.9771 0.9802 0.9897
Boosting + Weight 0.7793 0.7769 0.7830 0.9171 0.9745 0.9682 0.9676 0.9886
Boosting + Snapshot + Weight 0.7978 0.7999 0.8041 0.9308 0.9793 0.9771 0.9798 0.9895
MBGD + CLR + Select 0.8485 0.8499 0.8500 0.9486 0.9843 0.9839 0.9860 0.9911
Snap5 + Select 0.8335 0.8375 0.8329 0.9451 0.9835 0.9834 0.9865 0.9910
Snap30 + Select 0.8272 0.8292 0.8312 0.9453 0.9841 0.9837 0.9869 0.9909
Boosting + Select 0.8341 0.8362 0.8366 0.9463 0.9844 0.9832 0.9860 0.9904
Boosting + Snapshot + Ranking 0.8254 0.8295 0.8259 0.9456 0.9842 0.9841 0.9868 0.9899

Table 9
Average accuracy of the probed methods on dataset B.

Methods BPNN SAE DBN GRU 1D-LeNet5 1D-AlexNet 1D-ResNet 1D-DenseNet

Single + LRD 0.5719 0.5754 0.5777 0.8608 0.9252 0.8934 0.8667 0.8856
Single + CLR 0.5316 0.5533 0.5435 0.8190 0.9115 0.8869 0.8344 0.8932
MBGD + LRD + AV 0.5777 0.5799 0.5809 0.8609 0.9275 0.9011 0.8692 0.8901
MBGD + CLR + AV 0.5968 0.6041 0.6050 0.8635 0.9266 0.9098 0.8732 0.8989
Snap5 + AV 0.5803 0.5839 0.5851 0.8594 0.9280 0.9018 0.8685 0.8906
Snap30 + AV 0.5804 0.5827 0.5837 0.8584 0.9287 0.9049 0.8724 0.8940
Boosting + AV 0.5874 0.5942 0.5868 0.8685 0.9192 0.9044 0.8741 0.8974
Boosting + Snapshot + AV 0.5667 0.5698 0.5717 0.8653 0.9233 0.9017 0.8673 0.8970
MBGD + CLR + Ranking 0.5734 0.5769 0.5696 0.8572 0.9233 0.9024 0.8628 0.8939
Snap5 + Ranking 0.5686 0.5721 0.5678 0.8631 0.9228 0.9049 0.8685 0.8961
Snap30 + Ranking 0.5675 0.5709 0.5666 0.8665 0.9230 0.9009 0.8589 0.8972
Boosting + Ranking 0.5661 0.5747 0.5643 0.8602 0.9143 0.8984 0.8635 0.8963
Boosting + Snapshot + Ranking 0.5651 0.5665 0.5692 0.8659 0.9214 0.8973 0.8642 0.8971
MBGD + CLR + Weight 0.5482 0.5572 0.5481 0.8287 0.8994 0.8737 0.8451 0.8951
Snap5 + Weight 0.5649 0.5718 0.5697 0.8631 0.9229 0.8986 0.8655 0.8978
Snap30 + Weight 0.5622 0.5679 0.5582 0.8592 0.9201 0.8962 0.8534 0.8935
Boosting + Weight 0.5477 0.5459 0.5369 0.8141 0.9126 0.8518 0.8249 0.8881
Boosting + Snapshot + Weight 0.5566 0.5581 0.5624 0.8564 0.9131 0.8877 0.8483 0.8906
MBGD + CLR + Select 0.5892 0.5954 0.5915 0.8654 0.9250 0.9008 0.8654 0.8993
Snap5 + Select 0.5698 0.5731 0.5708 0.8619 0.9232 0.9014 0.8685 0.8966
Snap30 + Select 0.5677 0.5744 0.5684 0.8656 0.9240 0.9016 0.8647 0.8987
Boosting + Select 0.5858 0.5921 0.5849 0.8662 0.9187 0.9031 0.8732 0.8970
Boosting + Snapshot + Ranking 0.5668 0.5698 0.5716 0.8651 0.9237 0.9024 0.8683 0.8978

Table 10
Average accuracy of the probed methods on dataset C.

Methods BPNN SAE DBN GRU 1D-LeNet5 1D-AlexNet 1D-ResNet 1D-DenseNet

Single + LRD 0.5375 0.5383 0.5389 0.8357 0.9053 0.9002 0.8922 0.8976
Single + CLR 0.5150 0.5106 0.5096 0.8096 0.8964 0.8804 0.8445 0.9068
MBGD + LRD + AV 0.5403 0.5402 0.5421 0.8362 0.9066 0.9014 0.8940 0.9021
MBGD + CLR + AV 0.5606 0.5599 0.5596 0.8457 0.9092 0.9031 0.8973 0.9065
Snap5 + AV 0.5455 0.5472 0.5438 0.8362 0.9055 0.8965 0.8898 0.9015
Snap30 + AV 0.5424 0.5408 0.5438 0.8334 0.9086 0.8993 0.8951 0.9000
Boosting + AV 0.5440 0.5464 0.5466 0.8478 0.9074 0.9009 0.8877 0.9055
Boosting + Snapshot + AV 0.5294 0.5339 0.5326 0.8352 0.9049 0.8978 0.8802 0.9021
MBGD + CLR + Ranking 0.5338 0.5378 0.5372 0.8345 0.9054 0.8982 0.8861 0.9041
Snap5 + Ranking 0.5279 0.5349 0.5317 0.8363 0.9070 0.8955 0.8803 0.9057
Snap30 + Ranking 0.5264 0.5293 0.5265 0.8357 0.9056 0.8918 0.8801 0.9040
Boosting + Ranking 0.5255 0.5251 0.5276 0.8374 0.9047 0.8897 0.8805 0.9041
Boosting + Snapshot + Ranking 0.5239 0.5279 0.5311 0.8319 0.9027 0.8940 0.8776 0.9012
MBGD + CLR + Weight 0.5144 0.5198 0.5214 0.8187 0.8912 0.8684 0.8648 0.9081
Snap5 + Weight 0.5256 0.5316 0.5299 0.8337 0.9064 0.8954 0.8788 0.9068
Snap30 + Weight 0.5211 0.5221 0.5251 0.8320 0.9015 0.8860 0.8672 0.9022
Boosting + Weight 0.5114 0.5083 0.5147 0.8055 0.8851 0.8713 0.8448 0.9014
Boosting + Snapshot + Weight 0.5181 0.5243 0.5255 0.8290 0.8951 0.8882 0.8687 0.9010
MBGD + CLR + Select 0.5456 0.5486 0.5472 0.8449 0.9079 0.9002 0.8869 0.9062
Snap5 + Select 0.5284 0.5354 0.5336 0.8368 0.9066 0.8952 0.8819 0.9054
Snap30 + Select 0.5297 0.5311 0.5298 0.8352 0.9050 0.8957 0.8826 0.9042
Boosting + Select 0.5423 0.5448 0.5441 0.8466 0.9087 0.9001 0.8888 0.9050
Boosting + Snapshot + Ranking 0.5287 0.5330 0.5324 0.8349 0.9046 0.8978 0.8808 0.9027

12
Y. Wang and A. Vinogradov Neurocomputing 548 (2023) 126353

Table 11
Average accuracy of the probed methods on dataset D.

Methods BPNN SAE DBN GRU 1D-LeNet5 1D-AlexNet 1D-ResNet 1D-DenseNet

Single + LRD 0.5283 0.5351 0.5314 0.7932 0.8844 0.8556 0.8655 0.8666
Single + CLR 0.4991 0.4983 0.5006 0.7773 0.8755 0.8424 0.8451 0.8594
MBGD + LRD + AV 0.5324 0.5380 0.5344 0.7959 0.8893 0.8609 0.8721 0.8726
MBGD + CLR + AV 0.5499 0.5479 0.5519 0.8036 0.8911 0.8712 0.8762 0.8774
Snap5 + AV 0.5281 0.5339 0.5323 0.7949 0.8894 0.8635 0.8699 0.8817
Snap30 + AV 0.5332 0.5349 0.5336 0.7947 0.8888 0.8612 0.8718 0.8725
Boosting + AV 0.5332 0.5308 0.5365 0.7966 0.8883 0.8692 0.8707 0.8656
Boosting + Snapshot + AV 0.5241 0.5202 0.5194 0.7931 0.8853 0.8560 0.8728 0.8592
MBGD + CLR + Ranking 0.5214 0.5226 0.5266 0.7901 0.8904 0.8666 0.8704 0.8689
Snap5 + Ranking 0.5268 0.5201 0.5233 0.7897 0.8873 0.8655 0.8719 0.8669
Snap30 + Ranking 0.5224 0.5168 0.5172 0.7919 0.8895 0.8495 0.8707 0.8752
Boosting + Ranking 0.5209 0.5143 0.5176 0.7959 0.8842 0.8552 0.8622 0.8692
Boosting + Snapshot + Ranking 0.5219 0.5172 0.5160 0.7918 0.8907 0.8585 0.8645 0.8605
MBGD + CLR + Weight 0.5020 0.4978 0.5036 0.7709 0.8698 0.8169 0.8468 0.8581
Snap5 + Weight 0.5227 0.5198 0.5217 0.7902 0.8874 0.8620 0.8683 0.8642
Snap30 + Weight 0.5166 0.5124 0.5135 0.7958 0.8780 0.8502 0.8551 0.8698
Boosting + Weight 0.4944 0.4908 0.4994 0.7518 0.8627 0.8160 0.8262 0.8626
Boosting + Snapshot + Weight 0.5133 0.5073 0.5149 0.7905 0.8712 0.8225 0.8475 0.8580
MBGD + CLR + Select 0.5367 0.5362 0.5397 0.7943 0.8913 0.8674 0.8792 0.8726
Snap5 + Select 0.5261 0.5206 0.5233 0.7887 0.8865 0.8641 0.8722 0.8667
Snap30 + Select 0.5251 0.5185 0.5210 0.7924 0.8868 0.8645 0.8750 0.8716
Boosting + Select 0.5339 0.5311 0.5339 0.7960 0.8860 0.8674 0.8713 0.8643
Boosting + Snapshot + Ranking 0.5239 0.5204 0.5194 0.7928 0.8851 0.8545 0.8728 0.8602

Table 12
Average accuracy of the probed methods on dataset E.

Methods BPNN SAE DBN GRU 1D-LeNet5 1D-AlexNet 1D-ResNet 1D-DenseNet

Single + LRD 0.6306 0.6293 0.6286 0.9093 0.9392 0.9572 0.9598 0.9735
Single + CLR 0.5860 0.5874 0.5916 0.8588 0.9153 0.9183 0.9274 0.9615
MBGD + LRD + AV 0.6305 0.6296 0.6295 0.9100 0.9418 0.9585 0.9610 0.9746
MBGD + CLR + AV 0.6555 0.6554 0.6545 0.9213 0.9445 0.9602 0.9628 0.9762
Snap5 + AV 0.6475 0.6456 0.6442 0.9202 0.9433 0.9594 0.9650 0.9753
Snap30 + AV 0.6412 0.6418 0.6415 0.9185 0.9430 0.9606 0.9630 0.9760
Boosting + AV 0.6363 0.6383 0.6388 0.9132 0.9344 0.9462 0.9499 0.9689
Boosting + Snapshot + AV 0.6292 0.6309 0.6337 0.9122 0.9345 0.9496 0.9553 0.9705
MBGD + CLR + Ranking 0.6216 0.6199 0.6207 0.9000 0.9328 0.9505 0.9561 0.9723
Snap5 + Ranking 0.6236 0.6257 0.6252 0.9086 0.9357 0.9522 0.9619 0.9734
Snap30 + Ranking 0.6172 0.6191 0.6192 0.9028 0.9344 0.9516 0.9596 0.9721
Boosting + Ranking 0.6110 0.6163 0.6106 0.8977 0.9317 0.9485 0.9447 0.9680
Boosting + Snapshot + Ranking 0.6125 0.6169 0.6199 0.9014 0.9338 0.9518 0.9536 0.9704
MBGD + CLR + Weight 0.5900 0.5927 0.5928 0.8517 0.9172 0.9294 0.9359 0.9682
Snap5 + Weight 0.6196 0.6212 0.6197 0.8994 0.9337 0.9514 0.9599 0.9725
Snap30 + Weight 0.6085 0.6051 0.6054 0.8758 0.9265 0.9419 0.9503 0.9689
Boosting + Weight 0.5837 0.5902 0.5861 0.8543 0.9165 0.9303 0.9286 0.9637
Boosting + Snapshot + Weight 0.5968 0.6011 0.6004 0.8718 0.9255 0.9392 0.9440 0.9662
MBGD + CLR + Select 0.6405 0.6401 0.6408 0.9131 0.9374 0.9506 0.9606 0.9749
Snap5 + Select 0.6287 0.6319 0.6308 0.9126 0.9358 0.9527 0.9620 0.9737
Snap30 + Select 0.6254 0.6245 0.6270 0.9106 0.9362 0.9531 0.9611 0.9730
Boosting + Select 0.6347 0.6369 0.6368 0.9116 0.9343 0.9476 0.9493 0.9686
Boosting + Snapshot + Ranking 0.6278 0.6307 0.6324 0.9113 0.9346 0.9497 0.9556 0.9705

Table 13
Average accuracy of the probed methods on dataset F.

Methods BPNN SAE DBN GRU 1D-LeNet5 1D-AlexNet 1D-ResNet 1D-DenseNet

Single + LRD 0.5755 0.5752 0.5769 0.7430 0.7348 0.7870 0.8063 0.8326
Single + CLR 0.5345 0.5353 0.5375 0.6999 0.7173 0.7629 0.7624 0.8114
MBGD + LRD + AV 0.5759 0.5760 0.5774 0.7434 0.7326 0.7903 0.8078 0.8336
MBGD + CLR + AV 0.6033 0.6022 0.6031 0.7513 0.7380 0.7927 0.8114 0.8360
Snap5 + AV 0.5930 0.5920 0.5900 0.7446 0.7372 0.7915 0.8120 0.8321
Snap30 + AV 0.5906 0.5868 0.5901 0.7483 0.7345 0.7932 0.8093 0.8312
Boosting + AV 0.5874 0.5852 0.5854 0.7456 0.7276 0.7932 0.7934 0.8336
Boosting + Snapshot + AV 0.5773 0.5808 0.5776 0.7406 0.7259 0.7878 0.7945 0.8348
MBGD + CLR + Ranking 0.5671 0.5690 0.5689 0.7272 0.7273 0.7873 0.7854 0.8292
Snap5 + Ranking 0.5750 0.5767 0.5744 0.7343 0.7259 0.7892 0.7940 0.8316
Snap30 + Ranking 0.5656 0.5642 0.5675 0.7304 0.7244 0.7842 0.7929 0.8333
Boosting + Ranking 0.5568 0.5641 0.5592 0.7296 0.7264 0.7882 0.7833 0.8309
Boosting + Snapshot + Ranking 0.5629 0.5661 0.5670 0.7309 0.7262 0.7850 0.7932 0.8336
MBGD + CLR + Weight 0.5413 0.5377 0.5482 0.6815 0.7140 0.7665 0.7588 0.8217

(continued on next page)

13
Y. Wang and A. Vinogradov Neurocomputing 548 (2023) 126353

Table 13 (continued)

Methods BPNN SAE DBN GRU 1D-LeNet5 1D-AlexNet 1D-ResNet 1D-DenseNet

Snap5 + Weight 0.5688 0.5702 0.5717 0.7285 0.7239 0.7883 0.7898 0.8296
Snap30 + Weight 0.5496 0.5490 0.5528 0.7108 0.7183 0.7762 0.7798 0.8274
Boosting + Weight 0.5417 0.5384 0.5406 0.6908 0.7127 0.7592 0.7573 0.8279
Boosting + Snapshot + Weight 0.5423 0.5481 0.5498 0.7072 0.7155 0.7699 0.7809 0.8276
MBGD + CLR + Select 0.5886 0.5889 0.5885 0.7480 0.7294 0.7937 0.7959 0.8317
Snap5 + Select 0.5791 0.5819 0.5781 0.7358 0.7258 0.7894 0.7939 0.8320
Snap30 + Select 0.5738 0.5783 0.5762 0.7348 0.7254 0.7882 0.7970 0.8350
Boosting + Select 0.5857 0.5848 0.5863 0.7456 0.7266 0.7934 0.7917 0.8318
Boosting + Snapshot + Ranking 0.5761 0.5802 0.5779 0.7395 0.7253 0.7874 0.7950 0.8349

References [22] I. Martin-Diaz, D. Morinigo-Sotelo, O. Duque-Perez, R. de J. Romero-Troncoso,

Early fault detection in induction motors using adaboost with imbalanced
small data and optimized sampling, IEEE Trans. Ind. Appl. 53 (2017) 3066–
[1] H.Y. Sim, R. Ramli, A. Saifizul, M.F. Soong, Detection and estimation of valve
3075, https://doi.org/10.1109/TIA.2016.2618756.
leakage losses in reciprocating compressor using acoustic emission technique,
[23] R. Zhang, B. Li, B. Jiao, Application of XGboost algorithm in bearing fault
Measurement 152 (2020) 107315.
diagnosis, IOP Conf. Ser.: Mater. Sci. Eng. 490 (2019), https://doi.org/10.1088/
[2] Y. Lei, B. Yang, X. Jiang, F. Jia, N. Li, A.K. Nandi, Applications of machine learning
1757-899X/490/7/072062.
to machine fault diagnosis: A review and roadmap, Mech. Syst. Sig. Process.
[24] P. Thomas, H. Bril El Haouzi, M.-C. Suhner, A. Thomas, E. Zimmermann, M.
138 (2020) 106587.
Noyel, Using a classifier ensemble for proactive quality monitoring and
[3] R. Liu, B. Yang, E. Zio, X. Chen, Artificial intelligence for fault diagnosis of
control: the impact of the choice of classifiers types, selection criterion, and
rotating machinery: A review, Mech. Syst. Sig. Process. 108 (2018) 33–47,
fusion process, Comput. Ind. 99 (2018) 193–204, https://doi.org/10.1016/
https://doi.org/10.1016/j.ymssp.2018.02.016.
j.compind.2018.03.038.
[4] R. Wang, H. Jiang, X. Li, S. Liu, A reinforcement neural architecture search
[25] S. Ma, F. Chu, Ensemble deep learning-based fault diagnosis of rotor bearing
method for rolling bearing fault diagnosis, Measurement 154 (2020) 107417.
systems, Comput. Ind. 105 (2019) 143–152, https://doi.org/10.1016/
[5] Y.J. Cruz, M. Rivas, R. Quiza, A. Villalonga, R.E. Haber, G. Beruvides, Ensemble of
j.compind.2018.12.012.
convolutional neural networks based on an evolutionary algorithm applied to
[26] C. Zhang, P. Lim, A.K. Qin, K.C. Tan, Multiobjective deep belief networks
an industrial welding process, Comput. Ind. 133 (2021), https://doi.org/
ensemble for remaining useful life estimation in prognostics, IEEE Trans.
10.1016/j.compind.2021.103530.
Neural Networks Learn. Syst. 28 (10) (2017) 2306–2318.
[6] H. Wang, J. Xu, R. Yan, R.X. Gao, A New Intelligent Bearing Fault Diagnosis
[27] Y. Zhang, X. Li, L. Gao, W. Chen, P. Li, Intelligent fault diagnosis of rotating
Method Using SDP Representation and SE-CNN, IEEE Trans. Instrum. Meas. 69
machinery using a new ensemble deep auto-encoder method, Measurement
(2020) 2377–2389, https://doi.org/10.1109/TIM.2019.2956332.
151 (2020) 107232.
[7] C. Wu, P. Jiang, C. Ding, F. Feng, T. Chen, Intelligent fault diagnosis of rotating
[28] J. Yang, G. Xie, Y. Yang, An improved ensemble fusion autoencoder model for
machinery based on one-dimensional convolutional neural network, Comput.
fault diagnosis from imbalanced and incomplete data, Control Eng. Pract. 98
Ind. 108 (2019) 53–61, https://doi.org/10.1016/j.compind.2018.12.001.
(2020) 104358.
[8] L. Wen, X. Li, L. Gao, Y. Zhang, A new convolutional neural network-based data-
[29] H.A. Saeed, M.-J. Peng, H. Wang, B.-W. Zhang, Novel fault diagnosis scheme
driven fault diagnosis method, IEEE Trans. Ind. Electron. 65 (2018) 5990–5998,
utilizing deep learning networks, Prog. Nucl. Energy 118 (2020) 103066.
https://doi.org/10.1109/tie.2017.2774777.
[30] Y. Zhang, X. Li, L. Gao, W. Chen, P. Li, Ensemble deep contractive auto-encoders
[9] B. Zhao, X. Zhang, H. Li, Z. Yang, Intelligent fault diagnosis of rolling bearings
for intelligent fault diagnosis of machines under noisy environment, Knowl.-
based on normalized CNN considering data imbalance and variable working
Based Syst. 196 (2020) 105764.
conditions, Knowl.-Based Syst. 199 (2020), https://doi.org/10.1016/
[31] W. Zhang, C. Li, G. Peng, Y. Chen, Z. Zhang, A deep convolutional neural
j.knosys.2020.105971.
network with new training methods for bearing fault diagnosis under noisy
[10] Y. Li, H. Cao, K. Tang, A general dynamic model coupled with EFEM and DBM of
environment and different working load, Mech. Syst. Sig. Process. 100 (2018)
rolling bearing-rotor system, Mech. Syst. Sig. Process. 134 (2019), https://doi.
439–453, https://doi.org/10.1016/j.ymssp.2017.06.022.
org/10.1016/j.ymssp.2019.106322.
[32] X. Li, H. Jiang, R. Wang, M. Niu, Rolling bearing fault diagnosis using optimal
[11] X. Yan, Y. Liu, M. Jia, Multiscale cascading deep belief network for fault
ensemble deep transfer network, Knowl.-Based Syst. 213 (2021), https://doi.
identification of rotating machinery under various working conditions,
org/10.1016/j.knosys.2020.106695.
Knowl.-Based Syst. 193 (2020) 105484.
[33] J. Xie, B. Xu, Z. Chuang, Horizontal and Vertical Ensemble with Deep
[12] H. Shao, H. Jiang, H. Zhao, F. Wang, A novel deep autoencoder feature learning
Representation for Classification, (2013). https://doi.org/10.48550/
method for rotating machinery fault diagnosis, Mech. Syst. Sig. Process. 95
arXiv.1306.2759.
(2017) 187–204, https://doi.org/10.1016/j.ymssp.2017.03.034.
[34] G. Huang, Y. Li, G. Pleiss, Z. Liu, J.E. Hopcroft, K.Q. Weinberger, Snapshot
[13] S. Plakias, Y.S. Boutalis, A novel information processing method based on an
Ensembles: Train 1, get M for free, (2017). https://doi.org/10.48550/
ensemble of Auto-Encoders for unsupervised fault detection, Comput. Ind. 142
arXiv.1704.00109.
(2022), https://doi.org/10.1016/j.compind.2022.103743.
[35] L. Wen, L. Gao, X. Li, A new snapshot ensemble convolutional neural network
[14] H. Zhu, J. Cheng, C. Zhang, J. Wu, X. Shao, Stacked pruning sparse denoising
for fault diagnosis, IEEE Access 7 (2019) 32037–32047, https://doi.org/
autoencoder based intelligent fault diagnosis of rolling bearings, Appl. Soft
10.1109/ACCESS.2019.2903295.
Comput. 88 (2020) 106060.
[36] L. Wen, X. Xie, X. Li, L. Gao, A new ensemble convolutional neural network
[15] X. Kong, G. Mao, Q. Wang, H. Ma, W. Yang, A multi-ensemble method based on
with diversity regularization for fault diagnosis, J. Manuf. Syst. (2020).
deep auto-encoders for fault diagnosis of rolling bearings, Measurement 151
[37] W. Zhang, J. Jiang, Y. Shao, B. Cui, Snapshot boosting: a fast ensemble
(2020) 107132.
framework for deep neural networks, Sci. China Inf. Sci. 63 (2019), https://doi.
[16] S. Xiang, Y.i. Qin, C. Zhu, Y. Wang, H. Chen, Long short-term memory neural
org/10.1007/s11432-018-9944-x.
network with weight amplification and its application into gear remaining
[38] J. Yang, X. Zeng, S. Zhong, S. Wu, Effective neural network ensemble approach
useful life prediction, Eng. Appl. Artif. Intel. 91 (2020) 103587.
for improving generalization performance, IEEE Trans. Neural Networks Learn.
[17] X. Chen, B. Zhang, D. Gao, Bearing fault diagnosis base on multi-scale CNN and
Syst. 24 (2013) 878–887, https://doi.org/10.1109/TNNLS.2013.2246578.
LSTM model, J. Intell. Manuf. 32 (4) (2021) 971–987.
[39] C. Lessmeier, J.K. Kimotho, D. Zimmer, W. Sextro, Condition monitoring of
[18] Y. Wang, L. Cheng, A combination of residual and long–short-term memory
bearing damage in electromechanical drive systems by using motor current
networks for bearing fault diagnosis based on time-series model analysis,
signals of electric motors: a benchmark data set for data-driven classification,
Meas. Sci. Technol. 32 (2020).
PHM Society European Conference. 3 (2016). https://doi.org/10.36001/phme.
[19] P.G. Benardos, G.-C. Vosniakos, Optimizing feedforward artificial neural
2016.v3i1.1577.
network architecture, Eng. Appl. Artif. Intel. 20 (2007) 365–382, https://doi.
[40] Y. Zhang, T. Zhou, X. Huang, L. Cao, Q. Zhou, Fault diagnosis of rotating
org/10.1016/j.engappai.2006.06.005.
machinery based on recurrent neural networks, Measurement 171 (2021),
[20] W. Huang, J. Cheng, Y. Yang, G. Guo, An improved deep convolutional neural
https://doi.org/10.1016/j.measurement.2020.108774.
network with multi-scale information for bearing fault diagnosis,
[41] Y. Wang, Q. Wang, A. Vinogradov, Semi-supervised deep architecture for
Neurocomputing 359 (2019) 77–92, https://doi.org/10.1016/j.
classification in streaming data with emerging new classes: application in
neucom.2019.05.052.
condition monitoring, (2023). https://doi.org/10.36227/techrxiv.21931476.v1.
[21] M. Cerrada, G. Zurita, D. Cabrera, R.-V. Sánchez, M. Artés, C. Li, Fault
[42] Y. Wang, A. Vinogradov, Improving the Performance of Convolutional GAN
diagnosis in spur gears based on genetic algorithm and random forest,
Using History-State Ensemble for Unsupervised Early Fault Detection with
Mech. Syst. Sig. Process. 70–71 (2016) 87–103, https://doi.org/10.1016/j.
Acoustic Emission Signals, Applied Sciences. 13 (2023) 3136. https://doi.org/
ymssp.2015.08.030.
10.3390/app13053136.

14
Y. Wang and A. Vinogradov Neurocomputing 548 (2023) 126353

Yu Wang received the MM. degree in Management Alexey Vinogradov majored in acoustic emission signal
Science and Engineering from the National Research processing, data mining, and failure analsyis and pre-
Base of Intelligent Manufacturing Service, Chongqing diction. He received his PhD degree in physics and
Technology and Business University, Chongqing, China mathematics from A.F. Ioffe Physical-Technical Institute
in 2020. She is currently working towards the Ph.D. of St- Petersburg, USSR, in 1988. Since 1992 he has
degree in the Department of Mechanical and Industrial taken several academic positions in Japan and Norway.
Engineering, Norwegian University of Science and Since 2023 AV serves as a distinguished professor of the
Technology, Trondheim, Norway. She has worked on the Magnesium Research Center of Kumamoto University,
intelligent fault diagnosis of rotating machinery with Japan.
machine learning since 2017.