Simple Is Good Investigation of History State Ensemble Deep Neu - 2023 - Neuroc
Simple Is Good Investigation of History State Ensemble Deep Neu - 2023 - Neuroc
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
a r t i c l e i n f o a b s t r a c t
Article history: The present work is motivated by the desire to find an efficient approach that can improve the perfor-
Received 23 September 2022 mance of deep neural networks in a general sense. To this end, an easy-to-implement ensemble approach
Revised 11 April 2023 is proposed in this paper leveraging the ‘local sub-optima’ of deep networks, which is referred as to
Accepted 14 May 2023
history-state ensemble (HSE) method. We demonstrated that neural networks can naturally generate
Available online 24 May 2023
multiple ‘local sub-optima’ with diversity during training process, and their combination can effectively
improve the accuracy and stability of the single network. The merits of HSE are twofold: (1) It does not
Keywords:
require additional training cost in order to acquire multiple base models, which is one of the main draw-
History-state ensemble (HSE)
Ensemble learning
backs limiting the generalization of ensemble techniques in deep learning. (2) It can be easily applied to
Deep neural networks any types of deep networks without tuning of network architectures. We proposed the simplest way to
Average voting (AV) perform HSE and investigated more than 20 ensemble strategies for HSE as comparison. Experiments are
Fault diagnosis conducted on six datasets and eight popular network architectures for the case of rotating machinery
fault diagnosis. It is demonstrated that the stability and accuracy of neural networks can be generally im-
proved through the simplest ensemble strategy proposed in this paper.
Ó 2023 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
AV Average voting
BMA Base model accuracy
CCALR Cyclic cosine annealing learning rate
CLR Constant learning rate 1. Introduction
CNN Convolutional Neural Network
CWRU Case Western Reserve University With the recent rapid advent of artificial intelligence, intelligent
DBN Deep Belief Network fault diagnostics (IFD) techniques are burgeoning in new dimen-
DL Deep Learning sions. IFD generally refers to the application of Machine Learning
FCNN Fully connected neural network (ML) algorithms in fault diagnostics to reduce human labour de-
GBDT Gradient-Boosted Decision Trees mand and cost [1]. Among all branches of ML, Deep Learning
GRU Gate Recurrent Unit Network
(DL) technology has attracted the most attention for its capacity
HSE History-state ensemble
to extract implicit features automatically from training data
IFD Intelligent fault diagnostics
through multi-layered hidden neurons. Moreover, the procedures
KAt Konstruktions-und Antriebstechnik
of feature extraction and fault recognition in DL are integrated,
KNN k-Nearest Neighbor
LR Learning rate which makes it suitable to deal with the raw signal directly with-
LRD Learning rate with decay out any pre-processing. Since 2015, the area of DL applications has
MBGD Mini-batch gradient descent expanded rapidly; thus, the DL-based machine fault diagnostics
ML Machine Learning framework has become the mainstream of IFD de facto [2]. Up to
PSO Particle swarm optimization now, hundreds of deep networks have been designed and applied
SAE Stacked Autoencoder to IFD of bearing to take advantage of DL philosophy. Just to name
SVM Support Vector Machine some of them: back propagation neural network [3,4], convolu-
tional neural network (CNN) [5–9], deep Boltzmann machine
[10], deep belief network (DBN) [11], stacked autoencoder [12–
⇑ Corresponding author. 15], long short-term memory network [16–18] and their modifica-
E-mail address: [email protected] (Y. Wang). tions are among the most popular examples.
https://doi.org/10.1016/j.neucom.2023.126353
0925-2312/Ó 2023 The Author(s). Published by Elsevier B.V.
This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
Y. Wang and A. Vinogradov Neurocomputing 548 (2023) 126353
Deep network architectures mainly comprise four elements: (i) (HSE) method is proposed in the present paper. The ’history-state’
the number of layers, (ii) the number of neurons in each layer, (iii) in the method is represented by the network weights after the net-
the activation function of each neuron, and (iv) the training algo- work update in each training cycle. We hypothesize that network
rithms [19]. Due to the nonlinear nature of the network, a slight performance can be further improved in terms of stability and ac-
change in one element may lead to a significantly different result. curacy by incorporating these history-states that should be cov-
Therefore, substantial research efforts have been devoted to the ered after each model update. Details will be elaborated on in
design of network architectures. Increasingly, people are looking the following section. In comparison with the ensembled deep net-
to improve network performance by increasing network complex- works introduced above, the advantages of the proposed method
ity or adopting more complicated approaches. Indeed, this has are twofold: (1) it offers an improvement in the efficiency of the
achieved remarkable success in many fields, including rotating ma- network by acquiring multiple base models without increasing
chinery fault diagnostics. However, the performance of a carefully training costs, and (2) the method is versatile and can be easily ap-
designed neural network may decline when applied to tasks differ- plied to different neural networks without a re-design or tuning of
ent from what they had been originally intended for. For example, network architectures.
for the multi-scale cascade convolutional neural network proposed Some related works of HSE are to be mentioned. Back in 2013,
in [10], the classification accuracy varied between 99.7% and 96.9% Xie et al. proposed a series of ensemble methods for DL, including
when the authors tested the same network with different scales of vertical voting, horizontal voting, and horizontal stacked ensemble,
the convolutional kernel. Besides, the neural network is sensitive of which horizontal voting has a similar concept to the proposed
to training parameters such as the learning rate (LR) and the train- HSE [33]. However, there is not enough experimental verification
ing epochs. in this work, the approach has not received considerable attention.
To this end, we endeavor to develop an efficient and robust ap- To encourage the deep networks to produce diverse base models,
proach that can improve the performance of deep networks in a Huang et al. have proposed that cyclic cosine annealing learning
general sense. The ensemble strategy is a promising technique rate (CCALR) helps the deep networks to attain multiple local min-
for improving the performance of a single model [20]. Generally, ima, and a Snapshot ensemble method was introduced in their
it comprises a great number of weak classifiers like Decision Tree work [34]. There have been two applications of the Snapshot
and combines them to form a stronger model. Classical ensemble ensemble in the field of machinery fault diagnostics. Wen et al.
models are known as Random Forest [21], Adaboost [22], Xgboost [35] improved the original CCALR of the Snapshot ensemble and
[23] and so on. Thomas et al. demonstrated in their investigation applied it to a CNN-based model for the fault diagnostics of bear-
that an ensemble of different types of classifiers leads to an in- ings. Another Improve Snapshot Ensemble CNN was proposed in
crease in accuracy [24]. In recent years, ensemble strategy has [36] using diversity regularization to encourage the diversity of
gained increasing attention in DL. In [25], an ensemble strategy the training history-states. Zhang et al. [37] proposed the Snapshot
that combines a convolutional residual network, deep belief net- boosting to improve the snapshot ensemble.
work and deep autoencoder was proved to be more effective than These works are similar to our proposed method in that they all
a single model. Cruz et al. proposed an evolutionary way to ensem- assume that the deep networks can generate multiple base models
ble a fixed number of CNNs [5]. A multiobjective deep belief net- during the training process for ensemble learning; in this sense,
works ensemble method was proposed in [26] for the remaining they all refer to HSE methods. However, the difference lies in
useful life estimation of engineering systems. Zhang et al. proposed how the ensemble strategy is designed. There are several signifi-
an ensemble deep network architecture based on sparse deep au- cant gaps in the current research landscape that need to be filled.
toencoder, denoising deep autoencoder and contractive deep au- (1) While the ensembled deep networks have been compared with
toencoder in [27] for the rotating machinery fault diagnostics. At a single deep network, the generalization of the ensemble method
the same time, Yang et al. [28] proposed another ensembled fault has yet to be done to adapt the concept to various network archi-
diagnostics scheme based on Sparse Autoencoder [29] and Denois- tectures. (2) Different ensemble strategies have to be systematical-
ing Autoencoder. The bootstrap sampling and plurality voting were ly compared. (3) The HSE methods remain to be only scarcely
employed for the ensemble in this paper. In ref. [30], a new ensem- studied, and their effectiveness needs to be examined and docu-
ble deep network was developed to combine the result generated mented in a broader range of applications. This motivated us to ad-
by fifteen different activation functions. To make use of the advan- dress these specific issues on the examples of case studies relevant
tages offered by different neural networks, Ma et al. applied a mul- to practically significant problems of condition monitoring and
tiobjective optimization algorithm to integrate CNN, DBN and deep fault diagnostics in rotating machinery. To this end, we investigat-
autoencoder [25]. Zhang et al. proposed an ensemble learning ed the efficiency and robustness of different HSE methods with
model based on convolutional neural network [31]. Their method various deep network architectures and proposed the ensemble
is implemented by adding multiple classification layers to generate strategy with the best generalization ability. Compared to previous
a ’poll matrix’ before the majority voting is used to generate the works, our contributions are summarised as follows.
ensembled classification result. Li et al. proposed an optimal
ensemble deep transfer network [32], and parameter transfer (1) We propose the most straightforward yet efficient way to
learning was used to initialize the start points of several base mod- perform ensemble learning on deep networks. It has been
els with different kernels of maximum mean discrepancy. Howev- experimentally confirmed that deep networks can generate
er, these ensemble strategies involve an intricate network multiple local sub-optima during the training process, and
architecture design, training approaches, and additional hyper- the combination of them improves the network performance
parameters to be tuned. on average without increasing training costs.
In this paper, we do not concern with improving the structure (2) We conduct an extensive investigation of various ensemble
or training of deep networks; instead, our goal is to exploit the po- strategies on deep networks in order to address the afore-
tential of deep networks that have been overlooked. We can see mentioned existing issues. Through the comparative experi-
that by virtue of adequately organized and together-tuned com- ments, our proposed method demonstrates the best overall
monly accessible tools, the performance of deep networks can be performance. The conclusions derived from the study pro-
greatly improved in a general sense. In this connection, a simple vide practitioners with the guidelines for the rational selec-
yet efficient ensemble strategy referred to as history-state ensemble tion of the ensemble strategy.
2
Y. Wang and A. Vinogradov Neurocomputing 548 (2023) 126353
(3) We present a novel approach to investigate the tradeoff be- 2.1. Feasibility analysis
tween the diversity and accuracy of base models, which pro-
vides new insights into the underlying mechanisms of The feasibility of the proposed ensemble method is analyzed
ensemble learning. below.
The rest of the paper is organized as follows. The basic theory of 2.1.1. Improve network stability
HSE method is described in Section 2. The experimental setup and The network stability is evaluated by the variability of perfor-
contrastive methods are introduced in Section 3. Experimental re- mance under randomly initialized weights. The variance or stan-
sults of the probed methods are discussed in Section 4. Conclusions dard deviation is utilized as the metric. With the previous
are drawn in Section 5. hypothesis on local sub-optima, we assume the acquired base
models are (i) unbiased, (ii) with the same variance, and (iii) uncor-
2. Methodology related with each other. The accuracy of the i-th base models is
modeled as: ai ¼ A þ x, where A stands for the accuracy of the true
The training of neural network consists of two stages: (i) the local optimum to be estimated, x is random noise caused by the
forward propagation of input data from the first layer to the last diversity of base models. Based on the minimum variance unbaised
layer of the network structure for feature extraction; (ii) the back- estimator, a reasonable estimator of a can be expressed as
P
ward propagation of the prediction error from the opposite direc- a ¼ ni¼1 ai =n ¼ EðAÞ, assuming the x is Gaussian noise, and n rep-
tion for network optimization. A closed loop is formed through resents the base model number. Then, the variance of the estimator
forward and backward propagation, and the updated weights after a is expressed as:
each cycle are referred to as one history-state of the network in
Pn v ar ða Þ v arðaÞ
this paper. In general, a single network model is trained on fixed
v ar a ¼ i¼1 2
i
¼ ð1Þ
training cycles or iterations. However, it is hard to choose a ’magic’ n n
training budget to build a reliable model [34], as the acquired local
where v arðaÞ denotes the variance of a single base model. Since
optimum based on training dataset cannot represent the true local
v arðaÞ=n < v arðaÞ, i.e., the variance of the ensemble network is
optimum of the whole dataset especially in practical applications.
smaller than the single network. Hence, it can be inferred that the
Local optimum denotes a solution that is optimal within a neigh-
ensemble method reduces the performance variability.
boring set of candidate solutions. Loss value is usually served as
a measure of how well or poorly the network behaves after each
2.1.2. Improve network accuracy
optimization cycle. If the network is properly trained, the evolu-
For better illustration, we enumerate possible scenarios when
tionary trend in the loss value has the ’elbow-like’ shape, as shown
comparing the single network with the ensemble network by in-
in Fig. 1: it drops sharply at the first iterations and then decays
troducing a simple example of binary classification, as displayed
slowly before converging finally at very low values. At the same
in Fig. 2. In case 1–1 and 1–2, most base models produce the true
time, the network accuracy shows the opposite trend.
labels, thus underlying the efficacy of the ensemble. However, the
We assume that (1) a neural network can produce multiple ’lo-
single network might fail. And cases 2–1 and 2–2 are the opposite.
cal sub-optima’ with diversity during training; (2) the combination
The ensemble method will fail if most base models produce false
of these local sub-optima improves the network performance in
labels, whereas the single network might succeed. If the two sce-
terms of stability and accuracy. We use the term’ local sub-
narios have the same frequency, then the ensemble and single net-
optima’ to indicate that the solution is merely approximate to
work should have the same performance on average. Since deep
the true ’local optima’. In this paper, they refer to all the history-
networks are generally considered to be a strong classifier, we hy-
states of neural networks when the training process enters a stable
pothesize that the first scenario is more frequent than the other
phase, i.e., the loss value converges to a small value. Therefore, to
one, and the ensemble method has a higher chance to output a true
obtain base models for ensemble learning using a neural network,
label, which remains to be proven in the following experiments.
one only needs to preserve these history states or local sub-optima
of the network that should be covered after each training cycle, and
the time cost of this is negligible. Since there is no requirement for 2.2. The proposed ensemble strategy
the network structure, the method can be applied to all neural
networks. Similarly to conventional ensemble techniques, the implemen-
tation of the HSE method faces two challenges: (1) a training strat-
egy that encourages the generation of accurate base models with
diversity [5,24]; and (2) a learning strategy which combines ac-
3
Y. Wang and A. Vinogradov Neurocomputing 548 (2023) 126353
Table 1
Summary of the use of CWRU dataset in recent publications.
Refs Load (HP) Frequency (kHz) Class Training/test samples Accuracy (%)
[4] * 12 12 4200/600 98.47
[8] 3 48 10 1680/720 98.46
[9] 0 ! 0/0 ! 1/0 ! 2 12 12 2400/1200 98.5/97.1667/95.8333
[11] 0/1/2/3 12 10 1000/1000 99.57/99.32/99.54/99.43
[14] 0–3 12 8 * 99.94
[15] 0–3 * 12 1800/900 96.44
[18] 0–3 48 10 * 98.95
Table 2
Description of the gathered vibration signals from CWRU and KAt data centers.
Data base Index Fault location Fault level Data base Index Fault location Bearing code Fault type Fault level
CWRU 1 – 0 KAt 11 – K001 – 0
2 IR 0.007 12 IR KI16 Fatigue pitting 3
3 IR 0.014 13 IR KI17 Fatigue pitting 1
4 IR 0.021 14 IR KI18 Fatigue pitting 2
5 Ball 0.007 15 OR KA16 Fatigue pitting 2
6 Ball 0.014 16 OR KA22 Fatigue pitting 1
7 Ball 0.021 17 OR KA15 Indentations 1
8 OR 0.007 18 OR + IR KB23 Fatigue pitting 2
9 OR 0.014 19 OR + IR KB24 Fatigue pitting 3
10 OR 0.021 20 OR + IR KB27 Indentations 2
Table 3
Description of the extracted six datasets under various operating conditions.
(1) Fully connected neural network (FCNN). To probe their performance under a relative fair circumstance,
(2) Stacked Autoencoder with supervised fine-tuning (SAE). all networks are trained with same hyperparameters. The Adam
(3) Gaussian-Bernoulli Deep Belief Network with supervised stochastic optimization algorithm is applied to update the network
fine-tuning (DBN). weights [55]. Epoch number and batch size are set as 50. The initial
(4) Gate Recurrent Unit Network for fault diagnostics (GRU) LR is set as 0.001 and decreases with a decay rate of 0.001 for each
[40]. iteration after 20 training epochs. The LR with decay is denoted as
(5) One-dimensional LeNet5 (1D-LeNet5). LRD in this work. The deep network without HSE is denoted as a
(6) One-dimensional AlexNet (1D-AlexNet). single network. Single networks are served as references to the
(7) One-dimensional Deep Residual Network (1D-ResNet). ensembled neural networks.
(8) One-dimensional Deep Densely Connected Network (1D-
DenseNet). 3.3.2. Comparison with different ensemble strategies
Several ensemble approaches designed for neural networks are
Among the above eight architectures, (1)-(3) are fully connected investigated. They are roughly divided into two categories in this
networks with the shared architecture configuration of study: training strategy and learning strategy. The training strategy
½1024; 512; 256; 128. SAE and DBN are employed to learn unsuper- aims at encouraging the diversity of the acquired base models,
vised features from data which are followed by a supervised fine- while the learning strategy provides optimal solutions to combine
tuning process for fault diagnostics. The unsupervised learning is these base models. The implementation details are presented
trained for 30 epochs. (4) is a fault diagnostics-based recurrent below:
neural network architecture proposed in [40], which consists of a
linear layer, a GRU layer and a classification module with a multi- 3.3.2.1. Training strategies:.
layer perceptron. To maintain the consistence of the input of all (1) MBGD + CLR (the proposed).
probed networks, the data length is set as 1024, which is converted (2) CCALR in Snapshot ensemble [34].
to a ½16 64 image, i.e., the sequential length of GRU cell is 16. The (3) Boosted framework.
linear layer maps the dimension of the raw image to ½16 1024, (4) Snapshot boosting ensemble [37].
and the output is fed into GRU cell with a hidden size of 1024.
The classification module consists of two hidden layers with In (2), the models are warmed up for 20 epochs with initial LR of
1024 neurons and an output layer. (5)-(8) are convolutional neural 0.001, and then the LR is scheduled with CCALR, and a snapshot or
networks with increasing depth of hidden layers, the configuration history-state of the model is taken when the LR reaches its mini-
of their network architectures is detailed in Tables 4 and 5. mum at each cycle. The Snapshot number of 5 and 30 are investi-
5
Y. Wang and A. Vinogradov Neurocomputing 548 (2023) 126353
Table 4
The proposed 1D-CNN architectures with increasing network depths.
Note: C denotes the convolutional kernel, A denotes the average pooling kernel, S and P are the stride and padding number of each kernel, respectively. Flatten stands for the
concatenated layer. The output size is denoted by the notation ‘a@b’, where a represents the length of the output vector, and b is the number of output channels.
Table 5 and the number of iterations are set as 100. Acceleration coeffi-
Description of RL, DL, and TL. cients, c1 and c2 , are set as 0.5. The inertia weight is 0.9. In (7), par-
Layers Description Parameters ticle swarm optimization (PSO) is utilized to adaptively learn the
weights of all base models based on their performance on the val-
RL Residual block C15; S1; P7
C15; S1; P7 idation set, and the prediction is given by the weighted base mod-
DL Densely connected block C1; S1; P0 els. While in (8), only the base models with positive weights
C15; S1; P7 participate in the voting, and the prediction is given by the aver-
TL Transition layer C1; S1; P0
aged output of the selected base models.
A5; S2; P0
3.3.2.2. Learning strategies:. Common time- and frequency-dependent features are extracted
(5) AV. from the raw signal as the input of each model as Root Mean
(6) Ranking voting. Square, Skewness, Kurtosis, Shape factor, Crest factor, Impulse fac-
(7) Weighted voting + PSO. tor, Margin factor, Power, Mean frequency, RMSF, RVF. In model
(8) Selective voting + PSO. (1), the number of neighbors is set as 1. (2) The Gaussian kernel
is employed, and the kernel coefficient is defined by
In (6), a validation set is used to evaluate the performance of the 1=ðL v arðXÞÞ, where L denotes the number of classes, v arðXÞ rep-
recorded base models, and only the base models with the top n resents the variance of input X. The regularization parameter is 1.
classification accuracy on the validation set participate in the vot- DT is served as the base model for (3–6), and the number of base
ing. In this work, 10% of training samples are randomly extracted models is set as 50. In (4), the LR value is 0.5, and the maximum
out as a validation set, and n is set as 3. The PSO algorithms are depth of the individual model is 3. (5) LR is 0.5, and the maximum
served as meta-learner to implement (7) and (8). The swarm size depth of each individual tree is 5. (6) The LR value is 0.9, L1 regu-
Table 6
Fault diagnostics accuracy of shallow learning algorithms based on time and frequency features.
Models Datasets
A B C D E F
KNN 0.8658 0.8085 0.8085 0.7395 0.7059 0.6609
SVM 0.9106 0.8362 0.8642 0.8053 0.7592 0.7035
RF 0.9251 0.8303 0.8360 0.7891 0.8084 0.7047
Adaboost 0.8802 0.7945 0.8080 0.7332 0.7374 0.6500
GBDT 0.9195 0.8271 0.8336 0.7716 0.7916 0.6874
XGboost 0.8716 0.7728 0.7901 0.7492 0.6910 0.6193
6
Y. Wang and A. Vinogradov Neurocomputing 548 (2023) 126353
larization term is 0.1, and the maximum depth of each individual performance of a carefully designed network architecture might
tree is 5. decrease on other tasks. However, the model performance can be
generally improved by HSE regardless of the network architectures
and datasets, demonstrating the general capability of the proposed
4. Results and discussion
method to be self-adapted to different network structures.
Fig. 3. Loss value of different training methods - exemplified from 1D-LeNet5 of trial 1 on dataset A. (a) The LRD is initially set as 0.001 and then starts to decrease with a
decay rate of 0.001 for each iteration after 20 epochs. (b) CLR+MBGD with a constant value of 0.001 (the proposed method). (c) CCALR with Snapshot number of 5. (d) Boosting
+CLR with a constant value of 0.001, 10% of training samples are randomly extracted out as validation set.
Fig. 4. The average accuracy of the selected four network architectures on datasets A-F. Plots are sorted by the descending order of the average accuracy of the four models on
each dataset.
8
Y. Wang and A. Vinogradov Neurocomputing 548 (2023) 126353
Fig. 5. The averaged classification accuracy and standard deviation of DHSEs with various ensemble strategies and single model on datasets A-E and different network
architectures. (a) Average classification accuracy in descending order. (b) Average standard deviation in descending order.
Fig. 6. Measure of diversity and mean accuracy of base models. The picture is
sorted by the descending order of the average diversity of the probed network
architectures and datasets.
epochs ranging from 10 to 100, as displayed in Fig. 8. The example bility of bearing fault diagnostics by introducing an effective and
is taken from 1D-LeNet5. The red line represents the accuracy of user-friendly HSE strategy, which is evaluated across various net-
the proposed method under different training epochs. The black work architectures. The experimental results reveal that deep net-
line represents the accuracy of a single network corresponding to works can produce multiple base models for ensemble learning
training epochs, while the blue line is the moving average accuracy using a combination of MBGD, CLR, and AV methods once the
of the single network with a window length of 10. One can observe training process reaches a stable phase. Compared with peer meth-
that the ensemble network shows less fluctuation in accuracy with ods, the proposed ensemble strategy benefits from simplicity,
different training epochs. The blue line and red line show similar which is evident from the following aspects:
performance variability; however, the accuracy of the proposed
method is maintained on the upper envelope of the single network. (1) Ensemble strategy: this work integrates existing concepts in
an enhanced accuracy-improving workflow. The use of de
4.4. Computational cost facto accepted tools enhances the method’s accessibility
and applicability, making it an easy-to-implement option
As has been noted, the DHSE does not increase the training time for practitioners seeking to improve the model performance.
of the model. However, the test time is inevitably increased to (2) Selection of base models: the proposed method differs from
t N, where t denotes the test time of a single network, and N rep- many peer methods like Snapshot ensemble struggling to
resents the number of base models. Table 7 presents the training encourage neural networks to reach different local optima.
and test cost time of 1D-LeNet5 using the ’MBDG + CLR’ training Instead, the proposed method focuses on utilizing ’local
method on dataset A. The training cost of ’Weight Voting’ and sub-optima’ to achieve its goals. By doing so, the proposed
’Selective Voting’ is increased because of the use of PSO algorithms. method avoids the intricate task of defining qualified base
Although the test time is increased, it is small for every single models, resulting in reduced complexity.
sample. (3) Hyper-parameters: the proposed method requires fewer hy-
perparameters to be tuned and exhibits robustness to the se-
lection of hyperparameters through parameter analysis.
5. Conclusion
Comparing the experimental findings with peer methods
In this paper, the ensemble techniques combining the ’history- demonstrates that the proposed ’MBGD + CLR + AV’ exhibits a bet-
states’ generated during network training are denoted as HSE ter classification accuracy while having a lower standard deviation,
methods. The proposed methodology aims at enhancing the relia-
10
Y. Wang and A. Vinogradov Neurocomputing 548 (2023) 126353
thus showing the relative effectiveness and robustness of the pro- fore, Snapshot ensemble can be regarded as an implementation of
posed methodology. DHSE with scheduled LR, and CCALR is a type of training strategy.
Nevertheless, the efficiency of HSE methods still needs to be in-
vestigated in a broader range of applications, which is in the focus 2. Boosted training strategy
of our further study. Up to now, we have successfully tested the
proposed methodology against several other applications, includ- Boosting technique trains a number of weak learners sequen-
ing unsupervised early fault detection and the problem of stream- tially through an iterative arrangement of training samples to form
ing data with emerging new classes, the readers can find more a stronger model. The technique gives larger weights to those sam-
details in [41,42]. ples that were misclassified by the previous weak learners; in this
way, the classifiers are supposed to have less overlap in the set of
CRediT authorship contribution statement samples they misclassify. A boosted framework for the neural net-
work was proposed in [37], the data distribution of each classifier
Yu Wang: Conceptualization, Methodology, Software, Investi- is arranged as follows:
gation, Writing – original draft. Alexey Vinogradov: Writing – re- 1=n bt1 a
view & editing, Supervision, Funding acquisition. W t ðiÞ ¼ e ; i ¼ 1; 2; n: ð9Þ
Z t1
Data availability 1 1 t 1
bt ¼ log þ log ðk 1Þ ð10Þ
2 t 10
Data will be made available on request.
1X n
cial interests or personal relationships that could have appeared 1; iff t ðxi Þ ¼ yi
a¼ ð12Þ
to influence the work reported in this paper. 1; iff t ðxi Þ–yi
where W t ðiÞ denotes the weight of i -th sample for the t-th classi-
Acknowledgments
fier, t counts the number of misclassified samples. A higher value
of t yeilds a smaller value of the coefficient bt , which will be further
The financial support from the Norwegian Research Council by
reflected in a higher value of weight. Z t is a normalization factor,
the RCN Project No 296236 is gratefully appreciated. Pn b y f ðx Þ
which is defined as i¼1 e
t1 i i i . The difference between the
boosted framework for a deep network with AdaBoost is: (1) A val-
Appendix
idation set is used to count the t , while conventionally, it is calcu-
lated by all training samples. (2) The weights W t is updated from
1. Cyclic cosine annealing learning rate
the uniform distribution of 1=n for each classifier: note that in the
Snapshot ensemble adopted a cyclic cosine annealing learning
conventional procedure, it is updated from W t1 sequentially. (3)
rate (CCALR) schedule to encourage the model to reach multiple lo-
The boosted framework used a meta-learner to combine the weak
cal minima during training [34]. The LR is lowered at a very fast
classifiers with a validation set.
pace at first, encouraging the neural networks to converge towards
its local minimum. Then the optimization continues at the initial 3. Weighted learning strategy
LR, and the procedure repeats several times. The shifted cosine
function is used to obtain the LR at each iteration, which is math- The weighted ensemble output is expressed as:
ematically described below.
X
N i
expðb þ W ij HÞ
a0 p modðt 1; ½T=MÞ Ei ¼ wi ð13Þ
aðtÞ ¼ cos þ1 ð8Þ Pc i i
2 ½T=M i¼1 j¼1 expðb þ W j HÞ
where aðt Þ denotes the LR at the iteration number of t, a0 is the ini- Compared with AV, the weighted voting applies a meta-leaner
tial LR, T represents the total number of training iterations and M is to adaptively learn the weight of each base model wi .
the number of cycles the procedure is repeated. A ’snapshot’ of the
model is taken when the LR reaches its minimum at each cycle; 4. The detailed experimental results on dataset A-F are presented
thus, a total of M models are acquired. The snapshot of the model in Tables 8–13.
is also referred as the history-state of the model in this work. There-
Table 8
Average accuracy of the probed methods on dataset A.
11
Y. Wang and A. Vinogradov Neurocomputing 548 (2023) 126353
Table 8 (continued)
Table 9
Average accuracy of the probed methods on dataset B.
Table 10
Average accuracy of the probed methods on dataset C.
12
Y. Wang and A. Vinogradov Neurocomputing 548 (2023) 126353
Table 11
Average accuracy of the probed methods on dataset D.
Table 12
Average accuracy of the probed methods on dataset E.
Table 13
Average accuracy of the probed methods on dataset F.
13
Y. Wang and A. Vinogradov Neurocomputing 548 (2023) 126353
Table 13 (continued)
14
Y. Wang and A. Vinogradov Neurocomputing 548 (2023) 126353
Yu Wang received the MM. degree in Management Alexey Vinogradov majored in acoustic emission signal
Science and Engineering from the National Research processing, data mining, and failure analsyis and pre-
Base of Intelligent Manufacturing Service, Chongqing diction. He received his PhD degree in physics and
Technology and Business University, Chongqing, China mathematics from A.F. Ioffe Physical-Technical Institute
in 2020. She is currently working towards the Ph.D. of St- Petersburg, USSR, in 1988. Since 1992 he has
degree in the Department of Mechanical and Industrial taken several academic positions in Japan and Norway.
Engineering, Norwegian University of Science and Since 2023 AV serves as a distinguished professor of the
Technology, Trondheim, Norway. She has worked on the Magnesium Research Center of Kumamoto University,
intelligent fault diagnosis of rotating machinery with Japan.
machine learning since 2017.
15