sensors-21-05888-v2
sensors-21-05888-v2
Article
Towards Interpretable Deep Learning: A Feature Selection
Framework for Prognostics and Health Management Using
Deep Neural Networks
Joaquín Figueroa Barraza 1, * , Enrique López Droguett 2 and Marcelo Ramos Martins 1
1 LabRisco—Analysis, Evaluation and Risk Management Laboratory, Department of Naval Architecture and
Ocean Engineering, University of São Paulo, São Paulo 05508-030, Brazil; [email protected]
2 Department of Civil and Environmental Engineering & The Garrick Institute for the Risk Sciences, University
of California, Los Angeles, CA 90095, USA; [email protected]
* Correspondence: [email protected]
Abstract: In the last five years, the inclusion of Deep Learning algorithms in prognostics and health
management (PHM) has led to a performance increase in diagnostics, prognostics, and anomaly
detection. However, the lack of interpretability of these models results in resistance towards their
deployment. Deep Learning-based models fall within the accuracy/interpretability tradeoff, which
means that their complexity leads to high performance levels but lacks interpretability. This work
aims at addressing this tradeoff by proposing a technique for feature selection embedded in deep
neural networks that uses a feature selection (FS) layer trained with the rest of the network to
evaluate the input features’ importance. The importance values are used to determine which will
Citation: Figueroa Barraza, J.; López be considered for deployment of a PHM model. For comparison with other techniques, this paper
Droguett, E.; Martins, M.R. Towards
introduces a new metric called ranking quality score (RQS), that measures how performance evolves
Interpretable Deep Learning: A
while following the corresponding ranking. The proposed framework is exemplified with three
Feature Selection Framework for
case studies involving health state diagnostics and prognostics and remaining useful life prediction.
Prognostics and Health Management
Results show that the proposed technique achieves higher RQS than the compared techniques, while
Using Deep Neural Networks.
Sensors 2021, 21, 5888. https://
maintaining the same performance level when compared to the same model but without an FS layer.
doi.org/10.3390/s21175888
Keywords: feature selection; deep learning; deep neural networks; prognostics and health manage-
Academic Editors: Adam Glowacz, ment; interpretable AI
Jose A. Antonino-Daviu and
Wahyu Caesarendra
Forest for ranking features. In [58], the authors propose an approach based on attention
mechanisms. It consists of a feature weights generation module (also called “attention
module”) made of parallel hidden layers incorporated to the neural network next to
the input layer. Each part of the module outputs a value between 0 and 1, which is
then multiplied by its corresponding feature to enter the rest of the network, which they
call the “learning module”. They test their approach using the MNIST dataset [60], the
noisy-MNIST dataset with its three variants [61] and two other small datasets used for
feature selection problems [62]. Results show their framework achieves better results when
compared with other filter and embedded methods.
In this paper, we propose a framework for feature selection embedded in deep neural
networks (DNN) for PHM in which a feature selection (FS) layer is added next to the input
layer to be trained jointly with the rest of the fault diagnosis and prognosis network. The
main contributions of this work are the following:
• An in-model technique, referred to as feature selection layer (FS) technique, for DNN
feature selection, which helps interpreting the model without performance decrease.
It uses the DNN’s inner dynamics to determine each feature’s importance without
the need of an external technique. It is an ad hoc technique that addresses the accu-
racy/interpretability tradeoff.
• A framework for ranking quality evaluation based on a new metric, which is defined
as ranking quality score (RQS) that performs a quantitative evaluation of the feature
ranking obtained through the FS layer technique. It is also used to allow a fair
comparison of the proposed framework with other techniques.
• Application of the FS layer technique for fault diagnosis and prognosis (classification)
and RUL (regression) prediction tasks for PHM in four different case studies.
• Comparison of the proposed technique with other filter and embedded methods.
The remainder of the paper is organized as follows. In Section 2, a brief theoretical
description of deep neural networks is presented. In Section 3, the case studies used in
this work are presented and described. The proposed framework for feature selection to
improve the DNN model interpretability and ranking quality evaluation are presented in
Section 4. Section 5 presents and discusses the obtained results. Concluding remarks are
presented in Section 6.
j j
where ai is the value (also called activation) of the i-th neuron in the j-th layer, wk,i is the
trainable weight connecting the k-th neuron in the previous layer with the i-th neuron
j −1
in the current layer, ak is the activation of the k-th neuron in the previous layer, b j−1 is
the bias term of the previous layer and f corresponds to the activation function of the
current layer. ANNs can be divided by the type of task: classification or regression. In
classification tasks, the output layer has one neuron per class. The one with the highest
activation value indicates the class to which the model associates the corresponding input.
Sensors 2021, 21, 5888 5 of 30
In regression tasks, the output layer has one neuron, which corresponds to the continuous
value the model calculates. An ANN is optimized through backpropagation [63], in
which the weights and biases in the network are modified to minimize a loss function.
Hyperparameters such as number of hidden layers, number of neurons per layer, activation
functions of each layer, loss function and optimization algorithm, among others, must by
fine-tuned to achieve higher performance.
Though ANNs include only one hidden layer, the addition of more hidden layers
results in models able to find more complex dependencies in the data and thus achieve
higher performance. These kinds of networks with multiple hidden layers are referred to
as Deep Neural Networks (DNN) and define the subset of Machine Learning algorithms
called Deep Learning.
Due to the stacking of several hidden layers, DNNs are able to model complex and
rare dependencies in the training data. However, sometimes this ability can lead to
overfitting, as the model finds relations within the data that cannot be extrapolated to new
observations. There are several ways to address this issue, including parameter penalty
terms, dataset augmentation, multitask learning, early stopping and dropout [64], among
others. Regarding techniques based on penalty terms, two of the most used ones are the
L1 [65] and L2 [66,67] regularization. In DL, the L1 regularization refers to the addition of a
penalizing term to the loss function as
L1 (k ) = λ· ∑ wik (2)
i
where λ determines the strength of the regularization and wik is the i-th weight in the k-th
layer. Thus, when an L1 regularization is applied to a certain layer of the network, the
model, along with the primary task, tries to minimize the L1 norm of the layer’s weights.
On the other hand, the L2 regularization applies the same principle but with the L2 norm:
s
2
2
L (k) = λ· ∑ wik (3)
i
Though these two techniques are similar, their effect on the model and the learning
process is different. According to [68], due to their derivatives, both encourage small
weight values, but weights regularized by L2 seldom reach the value zero, whereas L1
induces solutions where a few number of weights have values greater than zero. This is
explained in [65]. According to the author, this is due to the shape of the constrain region of
the regularizers. This is shown in Figure 2. This sparsity property of the L1 regularization
Sensors 2021, 21, 5888 6 of 30
has been used for feature selection tasks [69]. For example, in [70], Ng claims that in the
context of logistic regression, a model is less affected by irrelevant features when it is
regularized using L1 regularization than when using L2 regularization. When L1 is used,
the number of samples needed for an acceptable level of performance grows with the
number of irrelevant features at a logarithmic rate, whereas with L2 regularization, this
rate is linear.
Figure 2. Least squares solution contour for a linear model with L1 (red curve) and L2 (blue curve)
regularizations. Based on [71].
3. Case Studies
This section presents the case studies used to evaluate the performance of the proposed
framework for fault diagnosis and prognosis and RUL prediction.
Setting Description
Bearing Model SKF 6205 2RSJEM
Sensors Location Drive end bearing housing—12 o’clock position
Faults Location Balls, outer ring, or inner ring
Motor Loads 0-1-2-3 hp
Motor Speeds [1720,1797] rpm
Sampling Rates 48 kHz (baseline)—12 kHz (faults)
Feature
Type of Signal Feature
Number
Maximum amplitude 1
Root Mean Square (RMS) 2
Peak-to-peak amplitude 3
Crest factor 4
Arithmetic mean 5
Variance 6
Skewness 7
Kurtosis 8
Original signal Centered moments (k = 5 − 11) 9–15
Arithmetic mean of the Fourier amplitude, divided in 25
16–40
frequency bands
RMS of the first five IMFs * (empirical mode decomposition) 41–45
Percent energy of the first five IMFs (empirical mode
46–50
decomposition)
Shannon entropy of the first 5 IMFs (empirical mode
51–55
decomposition)
RMS of the first 5 PFs ** (Local Mean Decomposition) 56–60
Percent energy of the first 5 PFs (Local Mean Decomposition) 61–65
Shannon entropy of the first 5 PFs (Local Mean Decomposition) 66–70
Maximum amplitude 71
Root Mean Square (RMS) 72
Peak-to-peak amplitude 73
Derivative of Crest factor 74
the original Arithmetic mean 75
signal Variance 76
Skewness 77
Kurtosis 78
Centered moments (k = 5 − 11) 79–85
Sensors 2021, 21, 5888 8 of 30
Table 2. Cont.
Feature
Type of Signal Feature
Number
Maximum amplitude 86
Root Mean Square (RMS) 87
Peak-to-peak amplitude 88
Crest factor 89
Integral of the
Arithmetic mean 90
original signal
Variance 91
Skewness 92
Kurtosis 93
Centered moments (k = 5 − 11) 94–100
* IMF: intrinsic mode functions; ** PF: product functions.
To diagnose the bearing’s health state, 100 extracted features are used. However, there
are features that give more information about the health state than others. Furthermore, a
situation may be presented in which the information of some features is obtained through
other features, making them irrelevant to the network. Thus, it is necessary to determine
the importance of each input feature fed to the network.
Regarding other works, some authors have done feature selection for this dataset [73–76].
In particular, the authors in [73] propose the use of a modified CNN in which the first layer
is built using a second-generation wavelet convolution over the raw signal. Authors in [76]
create a framework based in correlation analysis, principal component decomposition
(PCA) and weighted feature fusion in order to train a model over extracted features from
the raw signal. They successfully identify useful features to train an SVM. However, their
approach is not ad hoc; therefore, it does not access the inner dynamics of the model.
emulate a real-case scenario. To generate the labels, the same procedure shown in [29] is
used. Thus, degradation is represented as a piecewise linear function of the time cycles,
with a maximum RUL number of 125. A diagram of the engine used for creating the
datasets and the sensor points is shown in [82]. A description of the features used for RUL
estimation is given in Table 5.
Out of the four datasets, this work shows results on datasets FD001 and FD004, since
this work’s objective is to show the framework’s performance in different scenarios, and
these two datasets do not share any of the characteristics shown in Table 4. The sizes of the
two datasets are shown in Table 6.
As mentioned before, the C-MAPSS datasets are used for RUL prediction. Typically,
best results are obtained when using CNNs and/or LSTMs [31,79–81]. Here, the data are
organized into time windows for the model to learn to associate a certain evolution of
feature values to a RUL value. In the case of DNNs, inputs are two-dimensional, thus,
there is no construction of time windows. This naturally hinders the results obtained when
using DNNs. However, the objective of this work was not to improve the state-of-the-art
results on RUL prediction but to present a framework for interpretability of DNNs in the
context of PHM that does not fall into the accuracy/interpretability tradeoff. Hence, the
use of DNNs is not an issue within this work.
Among the input features, there are some more informative about the degradation of
the turbofan than others. This means that some features (or combination of features) are
more sensitive to a change in the health state than others. More specifically, there may be
features that are related to a certain failure mode that do not relate to another failure mode.
Therefore, it is relevant to determine the importance of each input feature on both datasets.
Sensors 2021, 21, 5888 10 of 30
Regarding other works, the authors in [83] do an extensive analysis of feature selection
approaches applied to these datasets, including filter and wrapper methods. They argue
that these methods reduce model complexity; however, they do not guarantee an improve-
ment in performance. This relates to the fact that filter methods are detached from the
model itself, and may generate results that do not apply to a specific model. For wrapper
methods, several different models are trained according to different sets of features. This
generates misleading interpretations of the results, as different models may use features in
different ways.
Table 7. Monitored data for natural gas treatment plant. Source: [32].
In this case study, input features are related to certain equipment acting on the process.
To identify an irrelevant variable could lead to stopping the monitoring of equipment that
does not need monitoring. On the other hand, identifying a feature with high relevance
could lead to identifying equipment that has more impact on the amount of CO2 in the
treated gas. This could help to improve process reliability.
Regarding other works, authors in [32] use an LSTM-based autoencoder to predict
whether the amount of CO2 will be above or below 2500 ppm after 20 min. They also
evaluate how the different performance metrics vary with the prediction time. The feature
selection process is done separately from the training process, according to a wrapper
method. They use the following features: nontreated gas flow rate (A), nontreated gas tem-
perature (B), amine pressure at the reboiler (H), amine temperature entering the contactor
(J) and past measurements of the concentration of CO2 in the treated gas (K).
4. Proposed Framework
In this section, the proposed framework for feature selection and evaluation is pre-
sented. The feature selection (FS) layer and a methodology for evaluating the quality of
feature rankings are introduced and discussed. Also, other approaches in the literature are
compared with the proposed framework.
addition, a regularization term is added to ensure the interaction between features, which
is calculated as:
n
r W FID = λ· ∑ wiFID − 1 (4)
i =1
where W FID represents the weights vector in the FS layer and λ is a value between 0
and 1 that determines the strength of the regularization. The addition of this term to the
loss function enforces the weights to sum 1. Thus, it is assured that the importance of
each feature is influenced by the interaction with other features, which does not occur in
methods like random forest or mutual information. This also avoids situations in which
weights have the same values. Furthermore, the use of the l1 norm, instead of the l2 norm,
is based on the effects of the L1 and L2 regularizations in the model related to feature
selection, as detailed in Section 2. This configuration enforces the model to reach sparse
solutions in which irrelevant features are tied to a weight value of zero. The use of the l2
would result in irrelevant features having values close to zero.
1. The raw dataset is preprocessed to make it suitable for usage. Since feature selection
is done during training, it is not necessary to do any analysis for removing variables,
besides discarding categorical variables.
2. The dataset is divided into train and test sets which are normalized. To do this,
the scaler chosen to normalize the data is fitted to the train set and used in both
the training and testing set. For example, if the data is standardized, the mean and
variance used for normalizing both the training and test sets are obtained only with
the train set. This is done in order not to have any interference in the training process
from the test set.
3. The training set is used for training the network.
4. When the training process is finished, the FS layer’s weights are used to rank the
features in decreasing order. The model’s performance is then evaluated in the test set
using a subset of the top n features, with n = {1, 2, . . . , N }. At first, the subset only
contains the most relevant input feature. After evaluation, the second most relevant
feature is added to the subset and performance is evaluated. This process is repeated
until all the input features available are added to the subset.
The performance results obtained from this iterative process are presented in a curve
in order to compare with other methods, as in [51,57,58]. However, a visual comparison
is not as accurate and reliable as a quantitative comparison. To address this issue, it is
proposed the ranking quality score (RQS) is used, which is defined as:
where PMn is a performance metric of the choice evaluated using the n most relevant
features and max( PM) is the maximum value reached by the model. The range of values
of the RQS metric is [0,1] (the proof is presented in Appendix A), and it compares the
two scenarios exemplified in Figure 7. In the figure, the scenario shown in the blue line
represents an example case where the performance metric value increases alongside the
number of features. Then, the numerator of Equation (5) is calculated by multiplying
each performance metric (calculated through step 4 mentioned above) by the number
of features used to reach that performance value and summing up all the results. This
value is an indicator of the quality of the ranking obtained through step 4. However, it is
highly dependent on the model. Within one model, this value can be used successfully to
compare different rankings. Nonetheless, this cannot be done when comparing models. To
address this issue, the RQS metric has a normalization term shown in the denominator of
Equation (5), represented by a red dotted line in Figure 7. It is the same calculation for the
numerator, but assuming an ideal case where the maximum performance value is reached
with the most relevant feature and is unaltered by the addition of less relevant features. By
using the denominator mentioned above, the RQS metric presents values between 0 and 1.
To illustrate its behavior, Figure 8 shows six different curves with their corresponding
RQS value.
The RQS metric measures the quality of the ranking obtained through the model using
an FS layer. It is a normalized weighted sum of the chosen performance metric. It gives
more weight as the number of variables increases. When using one feature, performance
typically shows a low value. As features are added, performance increases at a fast rate.
After a certain number of features, performance reaches a plateau. The RQS metric gives a
high value to a model in which the plateau is reached with few variables, and this plateau
is not affected by the addition of more features. On the other hand, it gives a low value to a
model in which more features are needed to reach the plateau, or the addition of features
leads to an important decrease in performance.
Besides achieving a high RQS, the proposed FS layer technique must, at least, main-
tain the same level of performance compared to the same network without an FS layer.
Otherwise, there would be a tradeoff between performance and interpretability, which is
not desirable. To evaluate the influence of the FS layer in the model’s performance, and
based in the work presented in [57], relative performance metrics are used, as defined in:
( ρn
2
ρ0 , f or classi f ication tasks or PMn = R
PMn = ρ0 2 (6)
ρn , f or regression tasks with PMn 6 = R
where ρn is the performance metric of the model trained with the FS layer and tested
with the n more relevant features, and ρ0 is the performance metric of the same model
but without the FS layer and tested with all the set of features. Since typical regression
performance metrics (except the R2 coefficient) show the same behavior as the prediction
error, the difference in definition is necessary for having the same kind of analysis regardless
the task. Performance metrics are used as in Equation (6) to easily illustrate in this work if
the inclusion of an FS layer is beneficial or detrimental to the model’s performance. A value
>1 indicates the use of an FS leads to an increase in the model’s performance, whereas a
value ≤1 indicates the contrary.
The proposed methodology involves the evaluation of models in terms of task perfor-
mance (through a chosen performance metric) and their feature ranking quality (through
the proposed RQS metric) simultaneously. Since the RQS metric quantifies how the model
reaches its maximum performance and not the performance itself, this simultaneous anal-
ysis is not only possible but necessary. This is more clearly explained by observing that
a model may have a good ranking quality (reflected in a high RQS value) but a low task
performance. Thus, the selection of an appropriate model (including the subset of input
features to be used) is determined by its task performance and its RQS.
models with a high RQS might benefit from other selection criteria besides maximum
performance since they achieve high levels of performance with fewer features.
In this work, two criteria for model selection are analyzed. The first one is maximum
performance, and the second one is defined as the number of features needed to reach a
95% of the maximum performance value. This is done in order to show the performance
and applicability of the techniques in a scenario where maximum performance is required,
and another one where a 5% decrease in performance is allowed, emulating a real-case
scenario. However, the application of this framework is not restricted to these two criteria.
Other criteria can be used for model selection depending on particular objectives and
limitations. With this, model selection is done not only evaluating a performance value but
also the RQS value simultaneously.
p( x, y)
MI = ∑ ∑ p( x, y)log
p( x ) p(y)
(7)
x ∈ X y ∈Y
where p( x, y) is the joint probability distribution of features X and Y. Thus, features with
which the output is more dependent are more relevant. In ReliefF, features are analyzed in
terms of how well they can distinguish classes in instances close to each other. The feature
weights are calculated by measuring how the feature value varies with respect to a nearest
hit (the nearest instance with the same class) and with respect to nearest misses (nearest
instances with different classes). This is repeated for a set of instances. For regression tasks,
the ReliefF algorithm is extended to RReliefF [85]. In the case of random forest, feature
importance is calculated during training based on how much each feature contributes to the
decrease in impurity when splitting data in each tree. For classification tasks, this impurity
measure is typically the Gini impurity. For regression tasks, impurity is measured through
variance. The splitting process is explained in more detail in [86]. The AFS technique, as
described in the Introduction section, presents a detachable module for feature selection
based on attention mechanism. An attention net for each feature is trained jointly with the
rest of the network to determine whether the feature is relevant or not. The output of each
attention net after training is used to rank features.
These four techniques cover filter and embedded methods. Wrapper methods are
not used since they become unfeasible for some datasets. Mutual information and ReliefF
techniques are filter methods, thus they are model agnostic. On the other hand, Random
Forest is a popular ML technique, which has an embedded method for feature selection.
The AFS method is used for comparison because it is a technique embedded in neural
networks, achieving promising results when compared against other filter and embedded
techniques, including one used for neural networks referred to in Roy et al. [59]. To the
best of the authors’ knowledge, the AFS technique is the most promising one amongst
those embedded in neural networks. In this paper, we use two different configurations
of this model to compare them with the proposed FS layer technique, namely AFS and
AFS 2. In the first (AFS), the first hidden layer after the input layer (referred to in the
paper as E) has an N/2 neurons, N being the number of input features. A graphical
representation of an example (with N = 4) is shown in Figure 9. In AFS 2, the same layer
has N neurons. Regarding the attention nets, it was decided not to include any hidden
Sensors 2021, 21, 5888 17 of 30
n o
layers (referred to in their work as h1k . . . hkL ) for three reasons: first, the authors in [58] do
not specify the number of hidden layers and neurons used in their experiments; second, the
implementation available in [87] does not include hidden layers in their attention nets; and
last, depending on the number of features, the inclusion of hidden layers would increase
the model size and, thus, its computational cost. Therefore, the E layer is connected to N
softmax-activated layers with two neurons, which is used to determine the importance of
each feature before entering the learning module.
Figure 9. Example representation of the AFS network. Note that the E layer has N/2 neurons. In the
AFS2 configuration, the E layer has N features. As described in the original work, the E layer has a
tanh activation function. It serves as input to N attention nets, each of them used to determine the
importance value of an input feature. Based on [58].
Regarding the two widely used techniques LIME and SHAP discussed in the Introduc-
tion section, they are not used for comparison because their mechanism for interpretability
uses the model’s predictions. They are post hoc techniques in which explanations depend
entirely on the model’s outputs and, therefore, their inputs. In the case of LIME, this
is a method for local interpretability, while in the case of the SHAP algorithm, feature
importance for the model is obtained through the mean Shapley values for each feature
across the dataset. This means that they may eventually vary when evaluating them with
other datasets (such as using the test set). In contrast, all of the approaches above estimate
feature importance values that do not change when varying the test set.
To evaluate the ranking quality of the aforementioned techniques, the first step is
to obtain the features’ importance values using each of them. Then, a neural network
with the same configuration as the main model but without the FS layer is trained. After
that, similar to the process depicted in Figure 6 the n most relevant features are used to
evaluate the network’s performance on the masked test set, with n varying from 1 to the
total number of features N. The network’s performance evolution across the number of
features is represented in a curve, and RQS is calculated, as discussed in the following
section.
Sensors 2021, 21, 5888 18 of 30
Figure 10. Comparison of relative F1-score based on six different rankings. The red and green dotted
lines indicate the number of features needed when using the FS layer technique to reach maximum
performance and a 95% of this value, respectively.
Table 9 shows the performance of the different techniques in terms of ranking quality
and task performance. It also shows the number of variables needed to reach the maximum
performance value and a 95% threshold. The RQS values presented confirm what is seen
Sensors 2021, 21, 5888 19 of 30
in Figure 10 regarding ranking quality. The FS layer technique achieves the highest RQS
value, followed by ReliefF, Random Forest, AFS 2, Mutual Information and AFS. At some
point, all techniques reach a very similar level of performance (shown in the table by
max ( PM)). However, the number of input features required for reaching the neighborhood
of that value is different for each technique. The proposed technique requires 29 features
to reach the 95% performance level threshold. In turn, the ReliefF, Random Forest, AFS 2,
Mutual Information and AFS techniques need 63, 80, 87, 85 and 92 features, respectively.
This example shows how the RQS metric favors techniques that require less variables
to achieve high levels of performance. As can be seen in the figure and the table above,
those techniques with the highest RQS need less input features to achieve a high level
of performance. In particular, when comparing the AFS 2 technique with the Mutual
Information technique, it can be seen that both reach high levels of performance needing
approximately the same number of features (87 and 85, respectively). However, because
the AFS 2 technique in most of the cases shows higher performance than the Mutual
Information technique, it has a higher RQS.
Number of Number of
PM for 95%
Approach RQS max(PM) Variables for Features for 95%
Threshold
max(PM) Threshold
FS layer 0.9754 1.0014 44 0.9599 29
Mutual Info. 0.7215 1.0000 100 0.9626 85
ReliefF 0.8630 1.0000 95 0.9587 63
Random Forest 0.8346 1.0000 98 0.9646 80
AFS 0.7127 0.9997 100 0.9629 92
AFS 2 0.7466 0.9997 100 0.9545 87
Overall, it can be seen that in those techniques where the RQS value is higher, less
features are needed to reach the 95% performance threshold. In this case, a high RQS
indicates that some variables can be discarded without an important performance loss with
respect to the maximum value.
Regarding performance, Table 9 shows that feature selection improves performance
when using the FS layer, with a 0.14% F1-score improvement reached when using the
44 most relevant features. This suggests that the inclusion of a FS layer not only maintains
the performance level, but also improves it. However, the difference in performance for
this case study is not high enough as to be conclusive.
Regarding model selection, the FS layer technique is the most suitable for both criteria.
In the first one, the highest performance value is reached with this technique, also with
the least number of features. For the second criterion, the desired threshold is reached
with only 29 features. For this criterion, the performance value is not as important as the
number of features needed, because the value is above the determined threshold.
Table 10 shows the duration of the feature importance value generation process for
each technique. It can be seen that the mutual information and random forest techniques
achieve the best results, followed by the proposed FS layer technique. Furthermore, when
the latter is compared to other embedded techniques (such as AFS and AFS 2), it is noted
that the proposed technique is much less time consuming. This is due to the fact that the FS
layer technique only adds one parameter per feature. This is not the case for the AFS and
AFS 2 techniques, as each feature requires an attention module. As the number of features
grows, the time difference between the proposed technique and the AFS techniques should
also grow.
Sensors 2021, 21, 5888 20 of 30
Table 10. Time duration for feature importance values generation (CWR case). For the embedded
techniques (FS layer, AFS and AFS 2), this value is calculated as the difference between the total
training time and the training time for a vanilla model without any embedded techniques. For the
random forest technique, this value corresponds to the total training time, as there is no way to
identify how much time it takes to obtain the desired values.
Figure 11. Comparison of relative MSE according to six different rankings. The red and green dotted
lines indicate the number of features needed when using the FS layer technique to reach maximum
performance and a 95% of this value, respectively.
To do a more complete analysis about the aforementioned issue regarding the mutual
information-based technique, Table 11 shows the two most relevant features and the two
least relevant features for each technique. The feature located at the middle of the ranking is
also shown. Indeed, note that mutual information-based technique gives the least relevance
to variables 6 (physical core speed) and 10 (corrected core speed), which are among the
two most relevant features in most of the other techniques. This shows that the core speed
features alone are not enough for calculating the RUL of the turbine engine. However,
alongside the rest of the variables, they are highly relevant. Because of this, the mutual
information-based technique is not able to recognize their importance. It is also noteworthy
that the FS layer technique is the only one that gives a high importance value to feature 8
Sensors 2021, 21, 5888 21 of 30
(ratio of fuel flow to HP compressor outlet), while feature 10 goes to the seventh place. This
helps understand why the FS layer technique presents a performance evolution curve with
a better behavior than the others.
Table 11. Features ranked 1st, 2nd, 7th, 13th and 14th in the ranking, for the six compared techniques.
Table 12. Task performance and ranking quality for C-MAPSS FD001 dataset.
Number of Number of
PM for 95%
Approach RQS max(PM) Features for Features for 95%
Threshold
max(PM) Threshold
FS layer 0.9216 1.0104 12 0.9913 9
Mutual Info. 0.6786 1.0000 14 1.0000 14
RReliefF 0.8730 1.0000 14 0.9589 10
Random Forest 0.8465 1.0000 14 0.9634 12
AFS 0.7821 1.0515 14 1.0351 13
AFS 2 0.7560 1.0392 14 1.0007 13
Sensors 2021, 21, 5888 22 of 30
Table 13 shows the time consumption for each technique when calculating the feature
importance values. As with the previous case, the mutual information and random forest
techniques achieve the best results, followed by the proposed technique. When comparing
with the AFS techniques, the difference is not as large as with the previous case. However,
it is still considerable. The AFS techniques take an order of magnitude more time than the
proposed FS layer technique. This shows that despite having less variables compared to
the previous case (100 against 14), the FS layer is still a faster technique.
Table 13. Time duration for feature importance values generation (C-MAPSS FD001 case).
Figure 12. Comparison of relative MSE based on six different rankings for C-MAPSS FD004 dataset.
The green dotted line indicates the number of features needed when using the FS layer technique to
reach a 95% of the maximum performance value.
Table 14 shows the performance associated to each technique, along with its ranking
quality. Results show that the proposed technique achieves the higher RQS value. However,
its maximum relative MSE is lower than 1. In this case, the use of an FS layer does not
imply an improvement in performance. However, the difference is 0.05%; thus, it can be
concluded that the same level of performance is maintained. These results are valid for the
Sensors 2021, 21, 5888 23 of 30
two criteria for model selection, since discarding a feature results in an important decrease
in performance independently of the employed technique.
Table 14. Task performance and ranking quality for C-MAPSS FD004 dataset.
Number of Number of
PM for 95%
Approach RQS max(PM) Features for Features for 95%
Threshold
max(PM) Threshold
FS layer 0.1750 0.9995 14 0.9995 14
Mutual Info. 0.1524 1.0000 14 1.0000 14
RReliefF 0.1545 1.0000 14 1.0000 14
Random Forest 0.1715 1.0000 14 1.0000 14
AFS 0.1679 0.9761 14 0.9761 14
AFS 2 0.1726 0.9652 14 0.9652 14
Table 15 shows how much time each technique takes to calculate the feature’s impor-
tance values. As in the two previous cases, the mutual information and random forest
techniques achieve the best results, followed by the proposed technique. RReliefF, AFS and
AFS 2 take longer than the rest to calculate these values. When comparing the proposed FS
technique with the AFS techniques, it can be noted that the AFS techniques take more than
six times longer than the proposed technique. The difference with the C-MAPSS FD001
case comes from the fact that the FD004 dataset has three times more records, and that it is
trained for 10,000 epochs instead of 800.
Table 15. Time duration for feature importance values generation (C-MAPSS FD004 case).
Figure 13. Comparison of relative MSE based on six different rankings for NGTP dataset. The red
and green dotted lines indicate the number of features needed when using the FS layer technique to
reach maximum performance and a 95% of this value, respectively.
In order to understand how the FS layer technique ranks features, Figure 14 shows
the FS layer weights calculated by the model after training. It can be seen that the most
Sensors 2021, 21, 5888 24 of 30
relevant features are related to the amine contactor and the reboiler. Important features
related to the contactor are the nontreated gas temperature, the amine temperature before
entering the contactor, and the amine pressure at the contactor. This is coherent with the
fact that the CO2 removal process occurs in the contactor and, therefore, these variables
directly affect its performance. The amine temperature at the reboiler appears as the second
most relevant feature, higher than stripping tower-related features. Since the heating
process that occurs in the reboiler comes immediately after the stripping process, there is a
correlation between them. Results in Figure 14 show that the technique is able to capture
this correlation, and in the correct order. On the other hand, the least important feature is
the temperature difference between nontreated gas and amine in the contactor. This means
that the proposed approach can calculate internally that information from other features
(i.e., nontreated gas temperature and temperature of the amine entering the contactor) and,
thus, can dispose that feature. Despite the fact of the model needing all 10 features to
reach its maximum relative MSE when using the FS layer technique (as seen in Figure 13),
performance increments are marginal when adding the two least relevant features. A more
robust neural network (with more layers or neurons per layer) could solve this issue and
reach maximum performance without the need of these two last features.
Table 16 shows the quantitative comparison of the different techniques. Results show
that along with the FS layer technique, the AFS techniques (AFS and AFS 2) and the Mutual
Information technique led to a relative MSE higher than 1. Despite the latter reaching
the maximum performance with nine variables, its value is considerably smaller than the
one reached by the proposed approach. When using an FS layer, there is a performance
increase of approximately 20%. Regarding ranking quality, the proposed approach has
similar RQS value to the Mutual Information-based approach. This verifies what it shown
in Figure 13. Although reaching different maximum values, the two curves show similar
behavior with a performance improvement rate that stabilizes when using more than eight
input features. On the contrary, the AFS and AFS 2 techniques reach a high maximum
performance value but rely heavily on the use of all input features available. For example,
when using nine out of the ten features, both techniques present the lowest performance
values. Thus, despite achieving high performance, their RQS values are the lowest of all
six techniques presented. This emphasizes the fact that the RQS metric does not consider
the maximum performance reached but considers how fast that performance is reached
in terms of input features required. This is also emphasized when looking at the results
for the 95% performance threshold criterion. The AFS and AFS 2 techniques, which show
the lowest RQS values, cannot discard any features from the feature set without having a
major decrease in performance. This is not the case with the other techniques, in which
Sensors 2021, 21, 5888 25 of 30
one or two features can be discarded. In the particular case of the FS layer technique,
eight features can be used for deployment with a performance decrease below 5% with
respect to the maximum value and still have a better performance value than the rest of
the techniques. This shows how the RQS metric helps in model selection process even
when different criteria are used. The RQS indicates how likely it is to discard features
without having an important performance decrease. The decision must be made by jointly
analyzing performance values and RQS. After this, the corresponding criterion must be
used to select the appropriate number of features.
Table 16. Task performance and ranking quality for the NGTP case study.
Number of Number of
PM for 95%
Approach RQS max(PM) Features for Features for 95%
Threshold
max(PM) Threshold
FS layer 0.7755 1.1979 10 1.1873 8
Mutual Info. 0.7582 1.0075 9 0.9767 8
RReliefF 0.5676 1.0000 10 0.9693 9
Random Forest 0.6596 1.0000 10 0.9767 8
AFS 0.4946 1.1664 10 1.1664 10
AFS 2 0.4126 1.1503 10 1.1503 10
Table 17 shows how much time each technique takes for calculating feature importance
values. The same trend as in the previous cases is shown, with mutual information and
random forest achieving the best results, followed by the proposed FS layer. As in the
other cases, the AFS techniques take similar time to obtain these values. In this case,
the AFS techniques take much more time, by a factor of more than five, for calculating
feature importance values than the proposed technique. Through the comparison of the
three embedded ad hoc techniques in the four case studies, it is able to see that the AFS
techniques add a larger amount of complexity to the model in order to obtain feature
importance values, which is reflected in the time they take.
Table 17. Time duration for feature importance values generation (NGTP case).
6. Conclusions
In this work, a novel technique was presented for feature selection in deep neural
networks for PHM models, with the objective of increasing interpretability without perfor-
mance loss. It consists of a locally connected layer next to the input layer, whose weights
represent the importance of the associated feature. Prior to analysis, the layer’s config-
uration is advantageous as it does not imply a considerable modification of the original
network. In terms of trainable parameters, a number of weights equal to the number
of input features is added. Regarding hyperparameters, two are added: the FS layer’s
activation function and the λ term for regularization. To evaluate the quality of the ranking
obtained using the FS layers weights, a new metric is introduced that analyses the feature
rankings in terms of performance evolution regardless of the model’s task (it can be used
for discrete health state classification as well as RUL prediction) and performance metric
(it can be used for performance metrics such as accuracy, F1-score, MSE, and MAE). Thus,
through the proposed framework, task performance and ranking quality are analyzed
independently and simultaneously. This helps to achieve a more informed model selection.
Results across three case studies within the presented framework show that the
proposed technique achieves higher RQS values than the rest of the compared techniques.
It identifies irrelevant features, allowing the model to reach maximum performance with a
subset of the input features. Indeed, in the CWR case, maximum performance was reached
Sensors 2021, 21, 5888 26 of 30
with 44 out of the 100 input features. Regarding performance, it can be concluded that the
inclusion of an FS layer into a deep neural network at least maintains the same level of
performance, and thus, successfully tackles the accuracy/interpretability tradeoff. Indeed,
in the NGTP case, performance shows a 19.79% increase. On the other hand, results in
the C-MAPSS FD004 case indicate a 0.05% decrease in RUL prediction MSE, which is
inconclusive. Thus, the use of the proposed FS layer technique opens the possibility of
reducing the input feature space without performance loss when using DNNs. Overall, the
RQS metric proves to serve as an indicator of how likely it is to discard features without a
relevant loss of performance, which is useful for real-life cases where the number of features
may be a limitation and performance requirements allow a certain amount of decrease.
This work attempted to improve the interpretability of DL based PHM models and
address the accuracy/interpretability tradeoff by presenting a technique for global inter-
pretation. Even though the proposed technique allows a reduction of the feature space
without performance loss and informs the importance of each feature within the model, it
does not explain how features interact inside the algorithm to generate a single prediction.
This is the main limitation of the proposed technique and can be addressed by enhancing it
with other kinds of explanations, such as counterfactuals.
Future work includes evaluating this technique in other DL algorithms, such as
autoencoders, convolutional neural networks, and long short-term memory networks to
determine the effectiveness of the proposed framework in a wider range of deep learning
approaches. In addition, an integration with other kinds of ML algorithms (for example,
SVM) is a possible continuation of this research. To further reduce the interpretability issue
mentioned in the Introduction section, it is necessary to analyze models locally. In this
sense, research regarding local interpretation of DL models is also included as future work.
Appendix A
To prove that
∑nN=1 PMn ·n N N
mathematical induction was used. We generalized PMn to any function α(n) with values
within the range [0,1].
Base case (n = 1).
1 1
∑ α(n)·n = α(1)·1 ≤ nmax
∈[1,1]
{α(n)}·1 = ∑ max {α(n)}·n
n∈[1,1]
n =1 n =1
Induction step.
K K K +1 K +1
∑ α(n)·n ≤ ∑ nmax
∈[1,K ]
{α(n)}·n ⇒ ∑ α(n)·n ≤ ∑ max {α(n)}·n
n∈[1,K +1]
n =1 n =1 n =1 n =1
Sensors 2021, 21, 5888 27 of 30
Proof.
K +1 K
∑ α(n)·n = ∑ α(n)·n + α(K + 1)·(K + 1)
n =1 n =1
K
≤ ∑ max {α(n)}·n + α(K + 1)·(K + 1)
n=1 n∈[1,K ]
= max {α(n)}· K(K2+1) + α(K + 1)·(K + 1)
n∈[1,K ]
K ( K +1)
≤ max {α(n)}· + max {α(n)}·(K + 1)
2
n∈[1,K +1] n∈[1,K +1]
= max {α(n)} K(K2+1) + (K + 1)
n∈[1,K +1]
= max {α(n)} (K+1)(2 K+2)
n∈[1,K +1]
K +1
= ∑ max {α(n)}·n
n=1 n∈[1,K +1]
References
1. Minsky, M.; Papert, S. Perceptrons—An Introduction to Computational Geometry; MIT Press: Cambridge, MA, USA, 1969.
2. Rosenblatt, F. The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychol. Rev. 1958, 65,
386–408. [CrossRef]
3. Werbos, P.; John, P. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences; Harvard University: Cambridge,
MA, USA, 1974.
4. McCulloch, W.S.; Pitts, W. A Logical Calculus of the Ideas Immanent in Nervous Activity. Bull. Math. Biophys. 1943, 5, 115–133.
[CrossRef]
5. Ivakhnenko, A.; Lapa, V.G. Cybernetics and Forecasting Techniques; American Elsevier Pub. Co.: New York, NY, USA, 1967.
6. Ivakhnenko, A.G. Polynomial Theory of Complex Systems. IEEE Trans. Syst. Man Cybern. 1971, 1, 364–378. [CrossRef]
7. Goldberg, Y. Neural Network Methods for Natural Language Processing; Morgan and Claypool Publishers: San Rafael, CA, USA,
2017; Volume 10.
8. Esteva, A.; Kuprel, B.; Novoa, R.A.; Ko, J.; Swetter, S.M.; Blau, H.M.; Thrun, S. Dermatologist-Level Classification of Skin Cancer
with Deep Neural Networks. Nature 2017, 542, 115–118. [CrossRef] [PubMed]
9. Ali, F.; El-Sappagh, S.; Islam, S.M.R.; Kwak, D.; Ali, A.; Imran, M.; Kwak, K.S. A Smart Healthcare Monitoring System for Heart
Disease Prediction Based on Ensemble Deep Learning and Feature Fusion. Inf. Fusion 2020, 63, 208–222. [CrossRef]
10. Ijaz, M.F.; Attique, M.; Son, Y. Data-Driven Cervical Cancer Prediction Model with Outlier Detection and Over-Sampling Methods.
Sensors 2020, 20, 2809. [CrossRef]
11. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf.
Process. Syst. 2012, 25, 1097–1105. [CrossRef]
12. Dreossi, T.; Ghosh, S.; Sangiovanni-Vincentelli, A.; Seshia, S.A. Systematic Testing of Convolutional Neural Networks for
Autonomous Driving. arXiv 2017, arXiv:1708.03309. Available online: http://arxiv.org/abs/1708.03309 (accessed on 3 December
2020).
13. Corradini, D.; Brizi, L.; Gaudiano, C.; Bianchi, L.; Marcelli, E.; Golfieri, R.; Schiavina, R.; Testa, C.; Remondini, D. Challenges
in the Use of Artificial Intelligence for Prostate Cancer Diagnosis from Multiparametric Imaging Data. Cancers 2021, 13, 3944.
[CrossRef]
14. Rhyou, S.-Y.; Yoo, J.-C. Cascaded Deep Learning Neural Network for Automated Liver Steatosis Diagnosis Using Ultrasound
Images. Sensors 2021, 21, 5304. [CrossRef]
15. Ahmed, S.; Shaikh, A.; Alshahrani, H.; Alghamdi, A.; Alrizq, M.; Baber, J.; Bakhtyar, M. Transfer Learning Approach for
Classification of Histopathology Whole Slide Images. Sensors 2021, 21, 5361. [CrossRef] [PubMed]
16. Ryu, S.; Joe, I. A Hybrid DenseNet-LSTM Model for Epileptic Seizure Prediction. Appl. Sci. 2021, 11, 7661. [CrossRef]
17. Biswas, S.; Chatterjee, S.; Majee, A.; Sen, S.; Schwenker, F.; Sarkar, R. Prediction of COVID-19 from Chest CT Images Using an
Ensemble of Deep Learning Models. Appl. Sci. 2021, 11, 7004. [CrossRef]
18. Yu, Y.; Wang, C.; Gu, X.; Li, J. A Novel Deep Learning-Based Method for Damage Identification of Smart Building Structures.
Struct. Health Monit. 2019, 18, 143–163. [CrossRef]
19. San Martin, G.; López Droguett, E.; Meruane, V.; das Chagas Moura, M. Deep Variational Auto-Encoders: A Promising Tool for
Dimensionality Reduction and Ball Bearing Elements Fault Diagnosis. Struct. Health Monit. 2019, 18, 1092–1128. [CrossRef]
20. Chen, Z.; Deng, S.; Chen, X.; Li, C.; Sanchez, R.V.; Qin, H. Deep Neural Networks-Based Rolling Bearing Fault Diagnosis.
Microelectron. Reliab. 2017, 75, 327–333. [CrossRef]
Sensors 2021, 21, 5888 28 of 30
21. Verstraete, D.; Ferrada, A.; Droguett, E.L.; Meruane, V.; Modarres, M. Deep Learning Enabled Fault Diagnosis Using Time-
Frequency Image Analysis of Rolling Element Bearings. Shock Vib. 2017, 2017, 5067651. [CrossRef]
22. Gan, M.; Wang, C.; Zhu, C. Construction of Hierarchical Diagnosis Network Based on Deep Learning and Its Application in the
Fault Pattern Recognition of Rolling Element Bearings. Mech. Syst. Signal. Process. 2016, 72–73, 92–104. [CrossRef]
23. Cofre-Martel, S.; Kobrich, P.; Lopez Droguett, E.; Meruane, V. Deep Convolutional Neural Network-Based Structural Damage
Localization and Quantification Using Transmissibility Data. Shock Vib. 2019, 2019, 1–27. [CrossRef]
24. Barraza, J.F.; Droguett, E.L.; Naranjo, V.M.; Martins, M.R. Capsule Neural Networks for Structural Damage Localization and
Quantification Using Transmissibility Data. Appl. Soft Comput. 2020, 97, 106732. [CrossRef]
25. Glowacz, A. Ventilation Diagnosis of Angle Grinder Using Thermal Imaging. Sensors 2021, 21, 2853. [CrossRef]
26. Yuan, M.; Wu, Y.; Lin, L. Fault Diagnosis and Remaining Useful Life Estimation of Aero Engine Using LSTM Neural Network.
In Proceedings of the AUS 2016 IEEE International Conference on Aircraft Utility Systems, Beijing, China, 10–12 October 2016;
pp. 135–140. [CrossRef]
27. Ben Ali, J.; Chebel-Morello, B.; Saidi, L.; Malinowski, S.; Fnaiech, F. Accurate Bearing Remaining Useful Life Prediction Based on
Weibull Distribution and Artificial Neural Network. Mech. Syst. Signal. Process. 2015, 56, 150–172. [CrossRef]
28. Aria, A.; Lopez Droguett, E.; Azarm, S.; Modarres, M. Estimating Damage Size and Remaining Useful Life in Degraded Structures
Using Deep Learning-Based Multi-Source Data Fusion. Struct. Health Monit. 2020, 19, 1542–1559. [CrossRef]
29. Ruiz-Tagle Palazuelos, A.; Droguett, E.L.; Pascual, R. A Novel Deep Capsule Neural Network for Remaining Useful Life
Estimation. Proc. Inst. Mech. Eng. Part. O J. Risk Reliab. 2020, 234, 151–167. [CrossRef]
30. Verstraete, D.; Droguett, E.; Modarres, M. A Deep Adversarial Approach Based on Multisensor Fusion for Remaining Useful
Life Prognostics. In Proceedings of the 29th European Safety and Reliability Conference (ESREL), Hannover, Germany, 22–26
September 2019; pp. 1072–1077. [CrossRef]
31. Zhang, J.; Wang, P.; Yan, R.; Gao, R.X. Long Short-Term Memory for Machine Remaining Life Prediction. J. Manuf. Syst. 2018, 48,
78–86. [CrossRef]
32. Figueroa Barraza, J.; Guarda Bräuning, L.; Benites Perez, R.; Morais, C.B.; Martins, M.R.; Droguett, E.L. Deep Learning Health
State Prognostics of Physical Assets in the Oil and Gas Industry. Proc. Inst. Mech. Eng. Part. O J. Risk Reliab. 2020. [CrossRef]
33. Park, P.; Di Marco, P.; Shin, H.; Bang, J. Fault Detection and Diagnosis Using Combined Autoencoder and Long Short-Term
Memory Network. Sensors 2019, 19, 4612. [CrossRef]
34. Reddy, K.K.; Sarkar, S.; Venugopalan, V.; Giering, M. Anomaly Detection and Fault Disambiguation in Large Flight Data: A
Multi-Modal Deep Auto-Encoder Approach. In Proceedings of the Annual Conference of the Prognostics and Health Management
Society, Denver, CO, USA, 3–6 October 2016; Volume 8, pp. 192–199.
35. Li, H.; Yu, L.; He, W. The Impact of GDPR on Global Technology Development. J. Glob. Inf. Technol. Manag. 2019, 22, 1–6.
[CrossRef]
36. Doshi-Velez, F.; Kim, B. Towards A Rigorous Science of Interpretable Machine Learning. 2017. Available online: http://arxiv.org/
abs/1702.08608 (accessed on 9 March 2021).
37. Fan, F.; Xiong, J.; Wang, G. On Interpretability of Artificial Neural Networks. arXiv 2020. Available online: https://arxiv.org/abs/
2001.02522 (accessed on 1 March 2021).
38. Carvalho, D.V.; Pereira, E.M.; Cardoso, J.S. Machine Learning Interpretability: A Survey on Methods and Metrics. Electronics
2019, 8, 832. [CrossRef]
39. Barredo Arrieta, A.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; Garcia, S.; Gil-Lopez, S.; Molina, D.;
Benjamins, R.; et al. Explainable Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges
toward Responsible AI. Inf. Fusion 2020, 58, 82–115. [CrossRef]
40. Alvarez-Melis, D.; Jaakkola, T.S. Towards Robust Interpretability with Self-Explaining Neural Networks. In Proceedings of the
Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 7775–7784.
41. Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the Advances in Neural
Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017.
42. Rezaeianjouybari, B.; Shang, Y. Deep Learning for Prognostics and Health Management: State of the Art, Challenges, and
Opportunities. Meas. J. Int. Meas. Confed. 2020, 163, 107929. [CrossRef]
43. Ribeiro, M.T.; Guestrin, C. “Why Should I Trust You?” Explaining the Predictions of Any Classifier. In Proceedings of the 22nd
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016;
pp. 1135–1144. [CrossRef]
44. Slack, D.; Hilgard, S.; Jia, E. Fooling LIME and SHAP: Adversarial Attacks on Post Hoc Explanation Methods. In Proceedings of
the AAAI/ACM Conference on AI, Ethics, and Society, New York, NY, USA, 7–8 February 2020; pp. 180–186. [CrossRef]
45. Jović, A.; Brkić, K.; Bogunović, N. A Review of Feature Selection Methods with Applications. In Proceedings of the 2015 38th
International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Opatija,
Croatia, 25–29 May 2015; pp. 1200–1205. [CrossRef]
46. Blum, A.L.; Langley, P. Artificial Intelligence Selection of Relevant Features and Examples in Machine. Artif. Intell. 1997, 97,
245–271. [CrossRef]
47. Ferreira, A.J.; Figueiredo, M.A.T. Efficient Feature Selection Filters for High-Dimensional Data. Pattern Recognit. Lett. 2012, 33,
1794–1804. [CrossRef]
Sensors 2021, 21, 5888 29 of 30
48. Saha, S.K.; Sarkar, S.; Mitra, P. Feature Selection Techniques for Maximum Entropy Based Biomedical Named Entity Recognition.
J. Biomed. Inform. 2009, 42, 905–911. [CrossRef] [PubMed]
49. Hameed, S.S.; Petinrin, O.O.; Hashi, A.O.; Saeed, F. Filter-Wrapper Combination and Embedded Feature Selection for Gene
Expression Data. Int. J. Adv. Soft Comput. Appl. 2018, 10, 90–105.
50. Maldonado, S.; López, J. Dealing with High-Dimensional Class-Imbalanced Datasets: Embedded Feature Selection for SVM
Classification. Appl. Soft Comput. J. 2018, 67, 94–105. [CrossRef]
51. Chang, C.-H.; Rampasek, L.; Goldenberg, A. Dropout Feature Ranking for Deep Learning Models. arXiv 2017, arXiv:1712.08645.
52. Zou, Q.; Ni, L.; Zhang, T.; Wang, Q. Remote Sensing Scene Classification. IEEE Trans. Geosci. Remote Sens. Lett. 2015, 12, 2321–2325.
[CrossRef]
53. Helleputte, T.; Dupont, P. Partially Supervised Feature Selection with Regularized Linear Models. In Proceedings of the 26th
Annual International Conference on Machine Learning, ICML 2009, Montreal, QC, Canada, 14–18 June 2009; pp. 409–416.
[CrossRef]
54. Nezhad, M.Z.; Zhu, D.; Li, X.; Yang, K.; Levy, P. SAFS: A Deep Feature Selection Approach for Precision Medicine. In Proceedings
of the 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Shenzhen, China, 15–18 December 2016;
pp. 501–506. [CrossRef]
55. Feng, C.; Cui, M.; Hodge, B.M.; Zhang, J. A Data-Driven Multi-Model Methodology with Deep Feature Selection for Short-Term
Wind Forecasting. Appl. Energy 2017, 190, 1245–1257. [CrossRef]
56. Mbuvha, R.; Boulkaibet, I.; Marwala, T. Automatic Relevance Determination Bayesian Neural Networks for Credit Card Default
Modelling. arXiv 2019, arXiv:1906.06382.
57. Škrlj, B.; Daeroski, S.; Lavrač, N.; Petković, M. Feature Importance Estimation with Self-Attention Networks. Front. Artif. Intell.
Appl. 2020, 325, 1491–1498. [CrossRef]
58. Gui, N.; Ge, D.; Hu, Z. AFS: An Attention-Based Mechanism for Supervised Feature Selection. In Proceedings of the AAAI
Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 3705–3713. [CrossRef]
59. Roy, D.; Murty, K.S.R.; Mohan, C.K. Feature Selection Using Deep Neural Networks. In Proceedings of the 2015 International
Joint Conference on Neural Networks, Killarney, Ireland, 12–17 July 2015. [CrossRef]
60. LeCun, Y.; Cortes, C.; Burges, C.J.C. The MNIST Dataset of Handwritten Digits. 1998. Available online: http://yann.lecun.com/
exdb/mnist/ (accessed on 14 December 2020).
61. Basu, S.; Karki, M.; Ganguly, S.; DiBiano, R.; Mukhopadhyay, S.; Nemani, R. Learning Sparse Feature Representations Using
Probabilistic Quadtrees and Deep Belief Nets. Neural Process. Lett. 2015, 45, 855–867. [CrossRef]
62. Li, J.; Cheng, K.; Wang, S.; Morstatter, F.; Trevino, R.P.; Tang, J.; Liu, H. Feature Selection: A Data Perspective. ACM Comput. Surv.
2016, 50, 1–45. [CrossRef]
63. Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning Representations by Back-Propagating Errors. Nature 1986, 323, 533–536.
[CrossRef]
64. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from
Overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958.
65. Tibshirani, R. Regression Shrinkage and Selection Via the Lasso. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. [CrossRef]
66. Hoerl, A.E.; Kennard, R.W. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 1970, 12, 55–67.
[CrossRef]
67. Hoerl, A.E.; Kennard, R.W. Ridge Regression: Applications to Nonorthogonal Problems. Technometrics 1970, 12, 69–82. [CrossRef]
68. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016; Volume 1. [CrossRef]
69. Venkatesh, B.; Anuradha, J. A Review of Feature Selection and Its Methods. Cybern. Inf. Technol. 2019, 19, 3–26. [CrossRef]
70. Ng, A.Y. Feature Selection, L1 vs. L2. In Proceedings of the Twenty-First International Machine Learning Conference—ICML’04,
Banff, AB, Canada, 4–8 July 2004; p. 78.
71. Hastie, T.; Tibshirani, R.; Friedman, J. Linear Methods for Regression. In The Elements of Statistical Learning; Springer: New York,
NY, USA, 2009; pp. 43–99.
72. Loparo, K.A. Bearing Data Center, Case Western Reserve University. Available online: https://csegroups.case.edu/
bearingdatacenter/pages/welcome-case-western-reserve-university-bearing-data-center-website (accessed on 8 February 2021).
73. Yuan, J.; Cao, S.; Ren, G.; Jiang, H.; Zhao, Q. SGWnet: An Interpretable Convolutional Neural Network for Mechanical Fault
Intelligent. In Proceedings of the Neural Computing for Advanced Applications, Second International Conference (NCAA 2021),
Guangzhou, China, 27–30 August 2021; pp. 360–374.
74. Rauber, T.W.; De Assis Boldt, F.; Varejão, F.M. Heterogeneous Feature Models and Feature Selection Applied to Bearing Fault
Diagnosis. IEEE Trans. Ind. Electron. 2015, 62, 637–646. [CrossRef]
75. Hui, K.H.; Ooi, C.S.; Lim, M.H.; Leong, M.S.; Al-Obaidi, S.M. An Improved Wrapper-Based Feature Selection Method for
Machinery Fault Diagnosis. PLoS ONE 2017, 12, e0189143. [CrossRef]
76. Li, Y.; Dai, W.; Zhang, W. Bearing Fault Feature Selection Method Based on Weighted Multidimensional Feature Fusion. IEEE
Access 2020, 8, 19008–19025. [CrossRef]
77. Saxena, A.; Goebel, K.; Simon, D.; Eklund, N. Damage Propagation Modeling for Aircraft Engine Run-to-Failure Simulation. In
Proceedings of the 2008 International Conference on Prognostics and Health Management (PHM 2008), Denver, CO, USA, 6–9
October 2008.
Sensors 2021, 21, 5888 30 of 30
78. Cofre-Martel, S.; Droguett, E.L.; Modarres, M. Uncovering the Underlying Physics of Degrading System Behavior through a Deep
Neural Network Framework: The Case of Rul Prognosis. Available online: https://arxiv.org/abs/2006.09288 (accessed on 20
May 2021).
79. Kong, Z.; Cui, Y.; Xia, Z.; Lv, H. Convolution and Long Short-Term Memory Hybrid Deep Neural Networks for Remaining Useful
Life Prognostics. Appl. Sci. 2019, 9, 4156. [CrossRef]
80. Zhao, C.; Huang, X.; Li, Y.; Iqbal, M.Y. A Double-Channel Hybrid Deep Neural Network Based on CNN and BiLSTM for
Remaining Useful Life Prediction. Sensors 2020, 20, 7109. [CrossRef]
81. Peng, C.; Chen, Y.; Chen, Q.; Tang, Z.; Li, L.; Gui, W. A Remaining Useful Life Prognosis of Turbofan Engine Using Temporal and
Spatial Feature Fusion. Sensors 2021, 21, 418. [CrossRef] [PubMed]
82. Frederick, D.K.; Decastro, J.A.; Litt, J.S. User’s Guide for the Commercial Modular Aero-Propulsion System Simulation (C-MAPSS);
National Aeronautics and Space Administration: Washington, DC, USA, 2007.
83. Khumprom, P.; Grewell, D.; Yodo, N. Deep Neural Network Feature Selection Approaches for Data-Driven Prognostic Model of
Aircraft Engines. Aerospace 2020, 7, 132. [CrossRef]
84. Kononenko, I. Estimating Attributes: Analysis and Extensions of RELIEF. Lect. Notes Comput. Sci. 1994, 784, 171–182. [CrossRef]
85. Robnik, M.; Konenko, I. Theoretical and Empirical Analysis of ReliefF and RReliefF. Mach. Learn. 2003, 53, 23–69. [CrossRef]
86. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [CrossRef]
87. Gui, N.; Ge, D.; Hu, Z. The Code of the AAAI-19 Paper “AFS: An Attention-Based Mechanism for Supervised Feature Selection”.
Available online: https://github.com/upup123/AAAI-2019-AFS (accessed on 15 February 2020).
88. Prognostics Center of Excellence Datasets. Available online: https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-data-
repository/ (accessed on 3 January 2021).