Bayesian Variational Recurrent Neural Networks For Prognostics and Health Management of Complex Systems
Bayesian Variational Recurrent Neural Networks For Prognostics and Health Management of Complex Systems
PROFESOR GUÍA:
ENRIQUE LÓPEZ DROGUETT
MIEMBROS DE LA COMISIÓN:
VIVIANA MERUANE NARANJO
RODRIGO PASCUAL JIMÉNEZ
Este trabajo ha sido parcialmente financiado por Beca Magíster Nacional CONICYT
SANTIAGO DE CHILE
2020
RESUMEN DE LA MEMORIA PARA OPTAR
AL TÍTULO DE MAGÍSTER EN CIENCIAS DE LA INGENIERÍA, MENCIÓN MECÁNICA
POR: DANILO FABIÁN GONZÁLEZ TOLEDO
AÑO: 2020
PROF. GUÍA: ENRIQUE LÓPEZ DROGUETT
BAYESIAN VARIATIONAL RECURRENT NEURAL NETWORKS FOR
PROGNOSTICS AND HEALTH MANAGEMENT OF COMPLEX SYSTEMS.
In recent couple of years many automated data analytics models are implemented, they
provide solutions to problems from detection and identification of faces to the translation
between different languages. This tasks are humanly manageable but with a low processing
rate than the complex models. However, the fact that are humanly solvable allows the user
to agree or discard the provided solution. As an example, the translation task could easily
output misleading solutions if the context is not correctly provided and then the user could
improve the input to the model or just discard the translation.
Unfortunately, when this models are applied to large databases is not possible to apply
this ’double validation’ and the user might believe blindly in the model output. The above
could lead into a biased decision making, threatening not only the productivity but also the
security of the workers.
Because of this, is neccessary to create models that quantify the uncertainty in the output.
At the following thesis work the possibility to use distributions over single point matrices as
weights is fused with recurrent neural networks (RNNs) a type of neural network which are
specialized dealing with sequential data. The model proposed could be trained as discrimi-
native probabilistic models thanks to the Bayes theorem and the variational inference.
The proposed model is called ’Bayesian Variational Recurrent Neural Networks’ is vali-
dated with the benchmark dataset C-MAPSS which is for remaining useful life (RUL) prog-
nosis. Also, the model is compared with the same architecture but a frequentist approach
(single points matrices as weights), with different models from the state of the art and finally,
with MC Dropout, another method to quantify the uncertainty in neuronal networks. The
proposed model outperforms every comparison and furthermore, it is tested with two classifi-
cation tasks in bearings from University of Ottawa and Politecnico di Torino, and two health
indicator regression tasks, one from a commercial wind turbine from Green Power Monitor
and the last in fatigue crack testing from the University of Maryland showing low error and
good performance in all tasks.
The above proves that the model could be used not only in regression tasks but also in
classification. Finally, it is important to notice that even if the validations are in a mechanical
engineering context, the layers are not limited to them, allowing to be used in another context
with sequential data.
i
ii
RESUMEN DE LA MEMORIA PARA OPTAR
AL TÍTULO DE MAGÍSTER EN CIENCIAS DE LA INGENIERÍA, MENCIÓN MECÁNICA
POR: DANILO FABIÁN GONZÁLEZ TOLEDO
AÑO: 2020
PROF. GUÍA: ENRIQUE LÓPEZ DROGUETT
BAYESIAN VARIATIONAL RECURRENT NEURAL NETWORKS FOR
PROGNOSTICS AND HEALTH MANAGEMENT OF COMPLEX SYSTEMS.
En el último par de años se han implementado muchas soluciones basadas en el análisis
automático de datos, desde la detección e identificación de rostros a la traducción de diferentes
idiomas. Éstas tareas son humanamente realizables pero con una tasa de procesamiento
mucho menor que la obtenida gracias a los algoritmos de aprendizaje de máquinas. Sin
embargo, el que sean humanamente realizables permite decidir si alguno de los resultados
entregados por el modelo es confiable o no. Un ejemplo de lo anterior puede ser la traducción
de un idioma a otro, donde muchas veces características como el contexto son importantes y
no son directamente introducidas al traductor.
Lamentablemente, al utilizar estos modelos de aprendizaje automático con largas bases de
datos sensoriales no es posible realizar esta ’doble validación’ y se suele creer ciegamente en
el resultado del modelo. Lo anterior puede originar una toma de decisiones sesgada poniendo
en peligro no sólo la productividad sino que también la seguridad de los operadores.
Debido a esto, es necesario originar modelos que puedan cuantificar la incertidumbre en
su salida. En la tesis a continuación se utiliza la posibilidad de emplear distribuciones en vez
de matrices de pesos puntuales con las redes neuronales recurrentes (RNNs), un tipo de red
neuronal especializada para trabajar con datos secuenciales. El modelo propuesto puede ser
entrenado como un modelos probabilístico discriminativo gracias a la aplicación del teorema
de Bayes y la inferencia variacional.
Este modelo propuesto, llamado ’Bayesian Variational Recurrent Neural Networks’ es val-
idado con la base de datos C-MAPSS para regresión de la vida útil remanente, en donde
se compara con la misma arquitectura en su versión frecuentista, con diferentes modelos del
estado del arte y finalmente con MC Dropout, otro método utilizado para evaluar la incer-
tidumbre en redes neuronales. El modelo propuesto obtiene mejores resultados en todas las
comparaciones anteriores y también se demuestra su utilización en dos problemas de clasifi-
cación de estado de salud con bases de datos de rodamientos provenientes de la Universidad
de Ottawa y el Politecnico di Torino y regresión de indicador de vida para una turbina eólica
comercial de Green Power Monitor y una base de datos de grietas en un ensayo de fatiga de
la Universidad de Maryland con bajo porcentaje de error y buen desempeño.
Lo anterior demuestra que el modelo propuesto puede ser utilizado no sólo en regresión,
sino también en clasificación. Finalmente es importante destacar que se valida pero no se
restringe su utilización en otros contextos con bases de datos secuenciales.
iii
iv
Hoy se te da,
hoy se te quita.
v
vi
Contents
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Aim of the work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 General objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.2 Specific objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Resources available for this Thesis . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Structure of the work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Methodology 5
3 Theoretical Background 6
3.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.2 Regression and classification tasks . . . . . . . . . . . . . . . . . . . . 7
3.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.2 Convolutional neural networks . . . . . . . . . . . . . . . . . . . . . . 8
3.2.3 Recurrent neural networks . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.4 Training of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Bayesian inference and Bayes By Backprop . . . . . . . . . . . . . . . . . . . 13
3.3.1 Weight perturbations . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.1.1 Flipout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5 Study cases 22
5.1 Datasets for regression tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.1.1 NASA: C-MAPSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.1.1.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . 22
5.1.1.2 Pre processing . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.1.1.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.1.2 Maryland: Crack Lines . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1.2.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1.2.2 Pre processing . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1.2.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.1.3 GPM Systems: Wind Turbine Data . . . . . . . . . . . . . . . . . . . 29
vii
5.1.3.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1.3.2 Pre processing . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.1.3.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2 Datasets for classification tasks . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.1 Ottawa: Bearing vibration Data . . . . . . . . . . . . . . . . . . . . . 33
5.2.1.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.1.2 Pre processing . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2.1.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2.2 Politecnico di Torino: Rolling bearing Data . . . . . . . . . . . . . . 36
5.2.2.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2.2.2 Pre processing . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.2.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Bibliography 60
Appendix
A Statistical parameters extracted from signals as features . . . . . . . . . . . . i
B Results of Bayesian Recurrent Neural Networks, C-MAPSS dataset. . . . . . ii
C Plots of the results of Bayesian RNN, C-MAPSS. . . . . . . . . . . . . . . . iii
D Comparison tables of the number of trainable parameters in Frequentist and
Bayesian models for C-MAPSS Dataset. . . . . . . . . . . . . . . . . . . . . vii
E Results of Frequentist Recurrent Neural networks, C-MAPSS dataset. . . . . ix
F Results of MC Dropout Recurrent Neural networks, C-MAPSS dataset. . . . ix
G Results of Bayesian Recurrent Neural networks, Maryland cracklines dataset. xi
H Results of Bayesian Recurrent Neural networks, GPM Wind turbine dataset. xviii
I Results of Bayesian Recurrent Neural networks, Ottawa Bearing dataset. . . xviii
J Results of Bayesian Recurrent Neural networks, Politecnico di Torino Bearing
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi
viii
List of Tables
5.1 Definition of the 4 sub-datasets by it operation conditions and fault modes. . 23
5.2 C-MAPSS output sensor data. . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.3 Dataset size for C-MAPSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.4 Table of Hyperparameters to test. . . . . . . . . . . . . . . . . . . . . . . . . 25
5.5 Hyperparameters for the C-MAPSS dataset. . . . . . . . . . . . . . . . . . . 25
5.6 Mechanical properties of SS304L. . . . . . . . . . . . . . . . . . . . . . . . . 26
5.7 Tested datasets for Maryland crack lines. . . . . . . . . . . . . . . . . . . . . 27
5.8 Dataset size for Maryland cracklines. . . . . . . . . . . . . . . . . . . . . . . 27
5.9 Hyperparameters tested in the grid search . . . . . . . . . . . . . . . . . . . 28
5.10 Hyperparameteres for the Maryland dataset. . . . . . . . . . . . . . . . . . . 28
5.11 GPM turbine dataset shape post processing. . . . . . . . . . . . . . . . . . . 30
5.12 Grid search for the hyperparameter selection. . . . . . . . . . . . . . . . . . 31
5.13 Hyperparameters for the GPM Wind turbine dataset. . . . . . . . . . . . . . 32
5.14 Conformation of the Ottawa bearing dataset. . . . . . . . . . . . . . . . . . . 33
5.15 Ottawa bearing dataset shape post processing. . . . . . . . . . . . . . . . . . 34
5.16 Hyperparameters for grid search. . . . . . . . . . . . . . . . . . . . . . . . . 35
5.17 Hyperparameters for Ottawa bearing dataset. . . . . . . . . . . . . . . . . . 35
5.18 Different health states at the B1 bearing. . . . . . . . . . . . . . . . . . . . . 37
5.19 Summary of nominal loads and speeds of the shaft. . . . . . . . . . . . . . . 37
5.20 Torino bearing dataset shape post processing. . . . . . . . . . . . . . . . . . 38
5.21 Grid search hyperparameters. . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.22 Hyperparameters for Ottawa bearing dataset. . . . . . . . . . . . . . . . . . 39
6.1 Bayesian Recurrent Neural Networks results for C-MAPSS dataset. Mean of
3 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.2 Number of parameters in Bayesian and Frequentist approach. . . . . . . . . 45
6.3 Comparison between average RMSE of Frequentist approach and Bayesian
Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.4 Comparison between best average performance for each dataset between Bayesian
and MC Dropout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.5 Proposed approach results compared with state of the art performance. . . . 46
6.6 Bayesian Recurrent Neural Networks results for Maryland cracklines. Mean
of 3 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.7 Best result of Maryland dataset with BayesLSTM layer, per test set run. . . 47
6.8 Best result of Maryland dataset with BayesGRU layer, per test set run. . . . 47
6.9 Bayesian Recurrent Neural Networks results for Wind turbine. Mean of 3 runs. 50
6.10 Bayesian Recurrent Neural Networks results for Ottawa bearing dataset. Mean
of 3 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.11 Mean of 3 runs, Torino bearing dataset. . . . . . . . . . . . . . . . . . . . . . 55
ix
B.3 Bayesian Recurrent Neural Network, FD003. . . . . . . . . . . . . . . . . . . ii
B.4 Bayesian Recurrent Neural Network, FD004. . . . . . . . . . . . . . . . . . . ii
D.1 Parameter table, FD001. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
D.2 Parameter table, FD002. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
D.3 Parameter table, FD003. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
D.4 Parameter table, FD004. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
E.1 All runs of Frequentist Recurrent Neural Network approach. . . . . . . . . . ix
F.1 Summary of MC Dropout Models, C-MAPSS dataset. . . . . . . . . . . . . . ix
F.2 MC Dropout Recurrent Neural Network, FD001. . . . . . . . . . . . . . . . . x
F.3 MC Dropout Recurrent Neural Network, FD002. . . . . . . . . . . . . . . . . x
F.4 MC Dropout Recurrent Neural Network, FD003. . . . . . . . . . . . . . . . . x
F.5 MC Dropout Recurrent Neural Network, FD004. . . . . . . . . . . . . . . . . x
G.1 Results of Bayesian Recurrent Neural networks, Maryland cracklines dataset. xi
H.1 Results Wind turbine: Total results. . . . . . . . . . . . . . . . . . . . . . . xviii
I.1 All results for each Bayesian layer. Dataset 1. . . . . . . . . . . . . . . . . . xviii
I.2 All results for each Bayesian layer. Dataset 2. . . . . . . . . . . . . . . . . . xviii
I.3 All results for each Bayesian layer. Dataset 3. . . . . . . . . . . . . . . . . . xix
J.1 All results for each Bayesian layer. . . . . . . . . . . . . . . . . . . . . . . . xxi
x
List of Figures
3.1 Neuron. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Aritificial neural network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Graphical comparison between Dense and Convolutional layers. Source: http:
//cs231n.stanford.edu/ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Left: Folded RNN. Right: Unfolded RNN. . . . . . . . . . . . . . . . . . . . 10
3.5 LSTM cell. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.6 GRU cell. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
xi
6.14 Best result for Ottawa bearing dataset. Dataset 3. . . . . . . . . . . . . . . . 54
6.15 Worst result Torino bearing dataset with Bayes LSTM layer. . . . . . . . . . 56
6.16 Worst result Torino bearing dataset with Bayes GRU layer. . . . . . . . . . . 56
6.17 Example of a loss plot of Torino bearing dataset. . . . . . . . . . . . . . . . 57
C.1 Best result for C-MAPSS FD001 with Bayesian GRU cell. . . . . . . . . . . iii
C.2 Best result for C-MAPSS FD002 with Bayesian GRU cell. . . . . . . . . . . iv
C.3 Best result for C-MAPSS FD003 with Bayesian LSTM cell. . . . . . . . . . . v
C.4 Best result for C-MAPSS FD004 with Bayesian GRU cell. . . . . . . . . . . vi
G.1 First complete life cycle. Maryland cracklines, Bayesian LSTM. . . . . . . . xi
G.2 First complete life cycle. Maryland cracklines, Bayesian GRU. . . . . . . . . xi
G.3 Second complete life cycle. Maryland cracklines, Bayesian LSTM. . . . . . . xii
G.4 Second complete life cycle. Maryland cracklines, Bayesian GRU. . . . . . . . xii
G.5 Third complete life cycle. Maryland cracklines, Bayesian LSTM. . . . . . . . xiii
G.6 Third complete life cycle. Maryland cracklines, Bayesian GRU. . . . . . . . . xiii
G.7 Fourth complete life cycle. Maryland cracklines, Bayesian LSTM. . . . . . . xiv
G.8 Fourth complete life cycle. Maryland cracklines, Bayesian GRU. . . . . . . . xiv
G.9 Fourth complete life cycle. Maryland cracklines, Bayesian LSTM. . . . . . . xv
G.10 Fourth complete life cycle. Maryland cracklines, Bayesian GRU. . . . . . . . xv
G.11 Fifth complete life cycle. Maryland cracklines, Bayesian LSTM. . . . . . . . xvi
G.12 Fifth complete life cycle. Maryland cracklines, Bayesian GRU. . . . . . . . . xvi
G.13 Sixth complete life cycle. Maryland cracklines, Bayesian LSTM. . . . . . . . xvii
G.14 Sixth complete life cycle. Maryland cracklines, Bayesian GRU. . . . . . . . . xvii
I.1 Best result for Ottawa bearing dataset. Dataset 1 with Bayesian LSTM layers. xix
I.2 Best result for Ottawa bearing dataset. Dataset 2 with Bayesian LSTM layers. xx
I.3 Best result for Ottawa bearing dataset. Dataset 3 with Bayesian LSTM layers. xx
xii
Chapter 1
Introduction
1.1 Introduction
In general, as new technologies are developed, the complexity of the components often
grows. This complexity consider the specialization to do a task, their interaction with other
components and their importance in the organization, among others. In a industrial context
this could be worrying, since the failure in some of this components affects directly the
expected results and, commonly, due to their specialization, the reparations are expensive
(in time and cost).
Therefore, it is necessary to know their failure modes, the physics that rules the functioning
states, etc. However, this is a difficult task that requires a lot of knowledge and investigation.
Also, it is noticeable that different operation conditions could lead into completely different
failures.
Prognostics and health management (PHM) is the specialized area of the engineering that
is focused on modelling the lifecycle of components and systems to predict when the prior
will no longer perform as intended. The above could be achieved by two main approaches:
model-based prognostics and data-driven prognostics. The first ones uses equations that
incorporate physical understanding of the system which needs a lot of experimental studies
and works under specified conditions, an example for this models is the Paris law, which
models the crack growth in a fatigue stress regime.
On the other hand, data-driven models take advantage of the advances of sensor technology
and data analytics to detect the degradation of engineered systems, diagnose the type of
faults, predict the remaining useful lifetime (RUL) and proactively manage failures. Some
useful tools as the Internet of Things (IoT) which helps to monitor and control the operation
of the components by multiple network-connected sensors could be the source of the data
implying a nice connectivity between data and the application. However, this creates such big
databases that human based monitoring is non-viable. One approach to solve this problem is
to preprocess the databases to decrease the dimensionality and be allowed to indentify in the
data the key features. Data-driven models like machine learning techniques can be used to
extract useful information from the data and mix them to understand insights of the system
performance [3].
Most recent PHM studies focus on the critical machinery components including bearings
[4, 5], gears [6, 7] and batteries [8, 9]. However, these studies are mainly developed based
on conventional machine learning with shallow configuration and thus large amounts of time
and expertise resources to manually design and extract features.
To complete the end-to-end process, it is necessary to use an algorithm that could han-
dle the large amount of big machinery data by learning different features and interactions
between them. Here is where, Deep learning has emerged as a promising solution since the
1
layer to layer architecture enable to extract more abstract and complex information (even
features that misses a direct physical meaning) that are useful to perform a certain task.
Typically, there are several types of deep learning models including Auto-encoder, Convolu-
tional Neural Network (CNN), Recurrent Neural Network (RNN), Deep Belief Network and
Deep Boltzmann Machines. Since this models could be used as components (layers) of a
complex model, they are new architectures that mix many of the above as in Variational
Auto-encoder, Generative adversarial networks and long short-term memory (LSTM). In the
context of machinery health prognosis, large amounts of research efforts have been conducted
to explore the advantages against conventional machine learning techniques [10]. Verstraete
et al. developed an unsupervised and a semi-supervised deep learning enabled generative
adversarial network-based methodology for fault diagnostics of rolling element bearings [11].
Zhang et al. presented a LSTM RNN model to predict the remaining useful life of lithium-
ion batteries [12]. Correa-Jullian et al. [13] demonstrated that the deep learning models for
solar thermal systems can outperform naïve predictors and other regression models such as
Bayesian Ridge, Gaussian Process and Linear Regression. Two excellent review articles on
the applications of deep learning to handle big machinery data are developed by Zhao and
Jia et al. [14, 15].
It is important to highlight that all the previous studies did not address the uncertainty
in the data and/or in the model. This is because their weights are single point estimates
and since the operations between layers corresponds to matrix multiplications, for a given
input and trained model, the output will be always the same. This may provide overly
confident predictions and results, causing poor guidance to the decision-maker team which
could lead into safety hazards, lower availability and higher costs. To overcome this problem,
Peng et al. [16] proposed a Bayesian deep learning approach to address uncertainty through
health prognosis. However, their Bayesian approximation is implemented through Monte
Carlo Dropout (MC dropout) that is not truly Bayesian [17]. Unlike the regular dropout,
MC Dropout is applied at both training and test, resulting into randomly ignored parts
of the network at the testing and with this, different outputs. This ultimately generates
random predictions as samples from a probability distribution which is treated as variational
inference [18]. However, the performance of MC Dropout has been criticized by Osband et al.
[19, 20], and counterexamples were presented to demonstrate the limitations of MC Dropout
to produce satisfactory results.
Nowadays, Bayesian recurrent neural networks have been studied and developed in dif-
ferent ways: [21] using a Markov Chain Monte Carlo variation (PX-MCMC) or [22] using
Bayes By Backprop with reparameterization trick to estimate the uncertainty. Finally, Wen
et al. [23], proposses the flipout method which outperforms previous methods using Bayesian
networks.
In the following Master Thesis the Bayesian Neural Networks are developed with the
flipout estimator to address the prediction of remaining useful life, classification and health
indicator for different applications of datasets with time dependant features (i.e. sensory
data). Because of this, the implementation is focused in RNN cells (LSTM and GRU) due
to their known performance with sequential data.
The proposed approach is validated using an open-access turbofan engine dataset [24]
that is the most usual benchmark for RUL prognosis (C-MAPSS) and also tested with two
health indicator prognosis datasets (GPM wind turbine and Maryland cracklines dataset)
2
and two classification applications (Politecnico di Torino rolling bearing data and Ottawa
bearing vibration data).
1.2 Motivation
As summarized above, Bayesian Recurrent Neural Networks (B-RNN) are recurrent neural
networks that their weights are defined by a sample from a trained distribution rather than
single point estimates.
With this, is possible to draw a distribution in the output of the network which can be
sampled to get the final value (class probabilities for classification, remaining useful life or
health indicator for regression).
The above distributions helps to generate confidence intervals which includes uncertainty
assessment which is relevant in the context of Prognostic health management (PHM) because
this confidence intervals help to identify the risk involved in maintenance decisions. On the
other hand, single point estimates lacks of this quantification and due to this, their prognostics
have no uncertainty in its values, this could be a problem in maintenance tasks because if a
mechanical component change its behavior due to different operational conditions this is not
represented in the approach, but a larger distribution range notifies it.
With the above, it is substantive to do some robust predictions with rather more useful
information than a single value.
3
6. Test other case studies in PHM context as Remaining useful life prognosis and Health
indicator prognosis (GPM wind turbine dataset and Maryland crack lines dataset) and
health state identification (Ottawa bearing dataset and Politecnico di Torino bearing
dataset).
4
Chapter 2
Methodology
This thesis investigation explore Bayesian Neural Networks with Flipout as weight per-
turbation method. To address this research, literature review, development and validation
steps are needed, which are explained below:
1. Literature review: This consider the exploration of the state of the art in Bayesian
Neural Networks, different approaches to Bayesian Recurrent Neural Networks and
their applications in Prognostics and health management (PHM).
Also, the above requires elaborated knowledge in neural networks, Variational infer-
ence, backpropagation, backpropagation through time, bayes by backprop and weight
perturbations.
2. Development of Bayesian Recurrent Neural Networks: This methodological
step consists in the ensemble of Bayesian Dense Layers into Recurrent Neural Networks
cells joining the acquired knowledge in the prior methodological step.
3. Dataset selection: Since Deep learning models are a data-driven technique, it is
necessary to look for datasets that presents a regression or classification tasks. Also,
since this thesis works with recurrent neural networks, the datasets has to be time
series. The datasets are divided in simulated data and real data.
• Simulated dataset: First, since this is a new approach, it is necessary to bench-
mark the results with a common dataset used in PHM. This is achieved with
C-MAPSS dataset, which is a simulated dataset from NASA of a turbofan engine
and is widely used in RUL address.
• Experimental Dataset: Since the deal with PHM algorithms is to implement
them in real applications, many datasets are included. 3 university datasets in-
cluding Maryland (Crack lines), Ottawa (Bearing vibration data) and Politecnico
di Torino (Rolling bearing data). Finally a real scenario is included: GPM systems
(Wind turbine data).
4. Hyperparameter tuning with the datasets: In this methodological step a proper
B-RNN is designed and optimized for each application. This consider a grid search of
the hyperparameters of the network in pursuit of the best results.
5. Analysis of the results: After all the above, it lasts the comparison of the proposed
model with the current article research. Since C-MAPSS is the benchmark dataset, the
results obtained in this dataset are compared with other kind of networks (state of the
art) and with another type of uncertainty estimator in neural networks (MC Dropout).
Then, the results for the other dataset are presented with their respective metrics and
plots. Which leads to the concluding remarks of this thesis.
5
Chapter 3
Theoretical Background
The following chapter summarizes the necessary concepts to understand the developed
framework. Firstly, machine learning has to be defined and reviewed, integrating the most
common problems that are solved with it. Then, deep learning expands the above definitions,
including neural networks as the vanilla layer and some more specialized ones, like a brief
summary of convolutional neural networks (since this type of layer is used but is not the focus
of this thesis) and recurrent neural networks with emphasis in two types of cell including
Long-Short Term Memory and Gated Recurrent Unit. Finally, a section about Bayesian
Inference and Bayes by Backprop as the needed Bayesian background.
6
3.1.2 Regression and classification tasks
Classification is a task where the computer must decide in which of n categories some
input belongs to. This is commonly solved by a function f : Rk → {1, ..., n} learned by the
algorithm [10]. In most cases, the output is interpreted as the probability of being part of a
class and the function that represents a categorical probability function is the SoftMax:
ez j
σ(z) = Pn f or j ∈ {1, ..., n} (3.1)
i=1 ez i
In the regression task, the above function changes to f : Rk → R, meaning that the
function maps all the inputs to a single value and its most known application is to predict
or forecast some behavior driven by sequential data [10]. Here, the most used loss function
is the root mean squared error (RMSE):
v
u n
u1 X
RM SE = t (Yi − Ŷi )2 (3.2)
n i=1
7
0
Linear Activation
0 combination function
Output
1
1
( ̂)
⋮ ∑
=0
A stack of neurons conforms a layer known as dense layers because every neuron interacts
with all input features and a stack of layers generate a multilayer perceptron (MLP) or
artificial neural network (ANN). With this, the number of parameters increases with the
multiple connections between layers.
∞
X
s(t) = (x ∗ w)(t) = x(a)w(t − a) (3.3)
a=−∞
Where s(t) is the output for the kernels w(t) convolved onto the input x(t). The above is
with one dimentional data, but is easily constructed for 2 dimentional data:
8
XX
S(i, j) = (K ∗ I)(i, j) = I(i + m, j + n)K(m, n) (3.4)
m n
With all the above, it follows that Convolutional neural networks have three [10] main
capabilities:
• Sparse interactions: Since the kernels are noticeably smaller than the input, the
number of trainable parameters is low in comparison to dense layers. This also implies
fewer operations improving the efficiency of the algorithm.
• Shared weights: Also, this kernels are slided through the input, therefore the weights
are used multiple times in the model. In a dense layer, the weights are used only once
since they are associated with only one feature of the input.
• Equivariance to translation: In the convolution operation is possible to apply a
translation function and then a convolution or the convolution and then the translation
and the transformation of the input is the same. This means that the extraction of the
parameters is sequential so the representation is a map of the represented features no
matter where they are.
Figure 3.3: Graphical comparison between Dense and Convolutional layers. Source: http:
//cs231n.stanford.edu/
The Figure 3.3 shows the differences between dense layers and convolutional layers, the
most important one is that convolutions generate smaller images with more features as out-
puts thanks to the smaller kernels and dense layers output as many weights.
9
Where s(t) is the state of the system and θ the parameters (weights).
At the above definition is possible to observe that the time t is defined by the previous
time t − 1 and then it is possible to "unfold" this recurrent graph, for example with t = 4:
s(4) = f (s(3) ; θ)
(3.6)
= f (f (s(2) ; θ); θ)
If we continue unfolding this equation, a non recurrence expression is generated. Also, this
state s(t) is known as hidden state h(s), as is an internal process of the recurrent network.
Then, is noticeable that the above expression does not depend on the input, because the
dynamic system that we model is only dependant at the prior state. Now, is possible to
incorporate the input x(t), doing that, the hidden state depends at the prior hidden state
and the input at time t:
With this, the state now contains information about the current input and the last hidden
state, being that, the extracted features of the whole sequence.
This new expression can be represented as a folded and a directed graph as follows:
−1 +1
Unfold
ℎ ℎ ℎ ℎ
... −1 +1
...
−1 +1
Also, RNNs are flexible in terms of input-output size, this means that since they work
timestep by timestep, the RNNs could be applied to different input lenghts. Further, the
output could be a sequence (each of the outputs of a layer) or a vector (the last value of the
sequence) and depending of the task, is possible to feed all the sequence or part of it.
Because each hidden state is a function of the prior state, the memory present in RNNs
is just one step in time. And, if that is not enough, RNNs suffer from vanishing gradients,
this means that when trained with long sequences, the gradients tends to be more smaller
leading to long training runs.
To overcome this problems, Long Short Term Memory (LSTM) arrives as a solution. In
this models are introduced a long memory (called cell c) and the known short memory (called
state h) which controls the information that has to be deleted and the one who is favourable
to keep, since it has three hidden transformations, the gradients are less likely to vanish.
10
To manage the memory in the layer, the three principal transformations are:
• Forget gate ft : This extracts the long term information that is expired and deletes
it.
• Input gate it : This gate choose the new information to be added to the long term
from the previous hidden state and the current input data.
• Output gate ot : Finally, the output gate, fuses the long memory with selected new
data generating the new hidden state.
Having regard to the above, a diagram of the cell is provided:
−1
⊙ +
ℎ −1 ℎ
However, this number of operations tends to affect the training time. This is why Gated
Recurrent Units (GRU) studies which are the less important gates in favor to eliminate
operations and optimize the learning process.
GRUs are constructed with:
• Reset gate: Drops information that is found to be irrelevant at the future.
• Update gate: Controls the information that pass from the previous hidden state to
the current hidden state.
11
With this, in the 3.6 are a GRU cell:
⊙ +
−1 ⊙
̃
ℎ
ℎ −1 ℎ
As it can be seen, GRU does not have memory units, it just uses gates to modulate the
flow of information [29]. The procedure is similar to LSTM cells but exposing the whole
hidden state each time. Another difference is that GRUs controls the information flow at the
same time that computes the new candidate whereas LSTM controls it with the output gate.
12
Finally, for the regression task the common used loss function is the cross-entropy, which
compares the ground true with the predicted distribution output (softmax) as show in Equa-
tion 3.1:
C
X
Lcross = − yc log(ŷc ) (3.17)
c=1
On the other hand, regression uses the RMSE as shown in Equation 3.2.
p(X, Y = y)
p(X|Y = y) = (3.19)
p(Y = y)
13
• Chain rule of probability: Is the conditional distribution applied to the joint distri-
bution.
p(Y )
p(X, Y ) = p(X, Y ) = p(X|Y )p(Y ) (3.20)
p(Y )
• Bayes’ theorem: All the above definitions are needed to formulate this. The condi-
tional probability of X given data Y is the posterior probability and the conditional
probability of Y given X the likelihood or model evidence of data fitting. Then, p(X)
is the belief, also called prior probability. Finally p(Y ) is the data distribution.
p(Y |X)p(X)
p(X|Y ) = (3.21)
p(Y )
• Likelihood: Is a measure of how well the data summarized the parameters in the
model. The negative log likelihood is commonly used as a cost function in Bayesian
neural networks.
• Inference: Is the process of computing probability distributions over certain random
variables, usually after knowing values for other variables.
Suppose that a model (p(w|D) with D as data and w as weights) is successfully trained
and then it is desired to make and inference (y ∗ ) for a new datapoint (x∗ ). So, is necessary
to look the joint distribution :
Z
∗ ∗
p(y |x , D) = p(y ∗ |x∗ , w)p(w|D)dw (3.22)
It is noticeable that the integral over the weights is intractable since it is taking an expec-
tation under the posterior distribution on weights which is equivalent to using an ensemble of
infinity numbers of neural networks. Then, it is possible to define an approximate distribution
q(w|θ) that has to be as similar as possible to p(w|D). The computation of this approximate
distribution is known as variational inference since θ are the variational parameters that
makes possible the inference process.
To evaluate if those two distributions are similar, the Kullback-Leibler (KL) divergence
is computed, so, the optimization problem is to find the θ that minimizes the divergence
between both distributions.
Again, it came out another integral over the weights, so this is not solvable. To deal with
the above we define the Expectation with respect to a distribution as:
Z
Eq(w|θ) [f (w)] = q(w|θ)f (w)dw (3.25)
14
Also, the conditional distribution for p(w|D) could be replaced by the Bayes’ theorem,
leading to:
q(w|θ)p(D)
KL[q(w|θ)||p(w|D)] = Eq(w|θ) log (3.26)
p(D|w)p(w)
Finally, splitting the logarithms and doing some algebra:
q(w|θ)
KL[q(w|θ)||p(w|D)] = Eq(w|θ) log − log p(D|w) + log p(D) (3.27)
p(w)
It is possible to extract the log p(D) since it is constant for the integral and the first
logarithm term is the KL divergence between qθ (w|D) and p(w), leading to:
Back to Equation 3.23, since this is a minimization problem, is possible to ignore the
log p(D) as it is a constant.
θopt = arg min KL[q(w|θ)||p(w|D)] ∼ arg min KL[q(w|θ)||p(w)] − Eq(w|θ) [log p(D|w)]
θ θ
∼ arg min −ELBO(D, θ)
θ
(3.29)
Which is known as the negative of the Evidence Lower Bound (ELBO), thus, maximizing
the ELBO is the same as minimizing the KL between the distribution of interest.
The ELBO contains a data dependant part which is the likelihood cost and a prior de-
pendant part which is the complexity cost. As is a subtraction between these two terms, the
loss is a trade-off between complexity of the data and the simplicity prior. However, this is
finally computable by unbiased Monte Carlo gradients and backpropagation [30].
First, it is necessary to get the gradient of an expectation. It is possible to define:
∂ ∂f (w, θ) ∂w ∂f (w, θ)
Eq(w|θ) [f (w, θ)] = Eq(ε) + (3.30)
∂θ ∂w ∂θ ∂θ
Where ε is a random variable with probability density q(ε) and w = t(θ, ε) where t(θ, ε)
is a deterministic function. Also, as further supposition q(ε)dε = (w|θ)dw.
The above deterministic function transforms a sample of parameter-free noise ε and the
variational posterior parameters θ into a sample from the variational posterior. With this, it
is possible to draw a sample with Monte Carlo to evaluate the expectations, enabling to do
a backpropagarion like algorithm.
This backpropagation (Bayes by Backprop) approximate the exact cost as:
n
X
− ELBO = log q(w(i) |θ) − log p(w(i) ) − log p(D|w(i) ) (3.31)
i=1
15
Where the superindex denotes the i th Monte Carlo sample drawn from the variational
posterior q(w(i) |θ). This depends upon the particular weights drawn from the variational
posterior (common random numbers) [31].
However, Bayes by backprop is a generalization for the Gaussian reparameterization trick
used for latent variables models in Bayesian context (also called Variational Bayesian neural
nets) which contains the problem that all weights in a batch share the same weight pertur-
bation limiting the variance reduction effect of large batches. To deal with this, Flipout is
an efficient method to avoid this problem.
3.3.1.1 Flipout
In Section 3.3, Flipout is introduced as an efficient method to deal with the same pertur-
bation across a batch problem, which leads into higher variances.
This works with two main assumptions about the distribution qθ . This is that the pertur-
bation of different weights are independent and is symmetric around zero. The above allows
to do element wise multiplication without modifying the perturbation distribution.
Then, if a sample ∆W d ∼ qθ is multiplied by a random sign matrix (S) independent from
∆W
d then ∆W = ∆W d S is identically distributed to ∆W
d and their gradients are also
identically distributed.
With this, it is possible to draw a base perturbation ∆W
d shared in the batch and multiplies
by different sign matrix, generating an unbiased estimator for the loss gradients.
∆Wn = ∆W
d v n s>
n (3.32)
16
Where rn and sn are random vectors sampled uniformly from ±1.
Then, the activations of a dense layer are:
yn = φ(W > xn )
= φ (W + ∆W d vn s> >
n ) xn (3.33)
> >
= φ(W xn + (∆W
d (xn sn )) vn )
Which is vectorizable and defines the forward pass. Since V and S are sampled inde-
pendently it is possible to obtain derivatives of the weights and the input allowing to do
backpropagation.
Also, Flipout take advantage of evolution strategies allowing parallelism on a GPU up-
dating the weights with the following scheme:
M X N
1 X n o
W t+1 = Wt + α F mn ∆W m
d vmn s>
mn (3.34)
M N σ 2 m=1 n=1
Where, N are the flipout perturbations at each worker, M the number of workers and
Fmn the reward at the nth perturbation at worker m.
17
Chapter 4
Proposed approach of Bayesian Recurrent
Neural Networks
This Chapter includes the proposed Bayesian Recurrent Neural Networks for classification,
RUL and HI prognosis with a graph of the framework to construct the B-RNN and the new
cell operations.
Preprocessed data:
1.Run to Failure - Cleaning Data.
Raw training Data
training data - Normalization.
- Window rolling.
BRNN
BRNN 3. Trained
Bayesian Neural
Network
BRNN
2. Bayesian
Architectures:
Neural Network Bayesian layers
- Bayes VRNN
for regression (Flipout)
- Bayes JaNet
Model
Bayesian Optimization(ELBO):
Posterior Variational
approximation Inference
18
The first step to construct a Bayesian RNN is to have proper data to work, since the
approach uses recurrent neural networks, it is necessary to use sequential data. Depending of
the type of dataset the task is selected i.e. if it considers run to failure data, RUL prognosis
is feasible, if they are different failures at different files, classification, also it is possible to
generate a health indicator supposing that none of them are provided based into the time in
operation or a statistic feature that grows (or shrinks) monotonically [33].
To enhance the performance of the model, it is possible to split randomly the dataset into
a test-train set but this could lead into an over-performance since the develop works with
sequential data and this is essential to implement this kind of model in real environment. The
above implies that the data to train are sensory data accumulated over time until present
and the new incoming sensor data are the ones to test the performance.
Then, the dataset must be preprocessed with a cleaning stage where the out of scale and
NaNs are dropped out from the data. Normalization, to enforce the model to learn without
bias over the scale of each particular sensor and window rolling to shape correctly the dataset.
In this process of windowing, it is possible to do it directly over the sensor or by features
extracted from previously created windows, both methods are used in different applications
at this thesis.
Later, the model is created by the utilization of LSTM or GRU cells as Keras layers, which
enables an easy way to create the model. However, since this are custom made layers which
contains dense flipout layers, a custom made training function is applied. With this, and the
optimization of the ELBO as loss function, an output distribution is trained, from where it
is possible to draw any number of samples for a chosen input which is the final prediction.
Where multiplications between the weights and the inputs could be replaced by dense
layers for each input (hidden state and new data):
Then, each dense layer is transformed to a Bayesian layer by the application of Flipout,
resulting:
h i h i
zt = σ Uz ht−1 + (∆U
dz (ht−1 shz,t )) h
vz,t + Wz xt + (∆W
[z (xt sxz,t )) x
vz,t (4.3)
19
Where a sub-index z, t represents that is for gate z in time t and a super-index that
represent if it for a hidden state h or an input x. Also, since an activation function for each
dense layer is not necessary, it just left with an identity activation function.
It is important to notice than this notation includes the sample of the sign vectors at each
time step.
With the above, it is possible to implement the proposed approach for all the remaining
gates as follows:
ht = ot tanh(ct ) (4.8)
20
4.3 Training and implementation of Bayesian Neural Net-
works
The training of Bayesian Neural Networks consists in the application of Bayes by Backprop
as backpropagation of the gradients to minimize the ELBO loss.
This loss is divided in two main parts:
• Kullback-Leibler: Is a regularization term for the overall loss. Also, it is many
magnitude orders higher than the Log likelihood. Because of this, is necessary to
multiply this number by a regularizer in search of the best tradeoff between both losses.
• Log likelihood: Empirically, this loss is the one who rules the performance of the net-
work. So, the balance between KL and Log likelihood is doing well if the log likelihood
is stable and the KL is not growing.
With that, the implementation consists in the application of Tensorflow Probability Flipout
Dense Layers in new layers called Bayes-LSTM and Bayes-GRU wrapping their inner cell
calculations. With this, the Bayes by Backprop implementation is straightforward and the
output is a chosen distribution depending on the application (categorical if it is a classification
and normal distribution at the regression).
21
Chapter 5
Study cases
In this Chapter the datasets in which this thesis is validated are described, including
their setup, origin and more relevant characteristics. Also, this datasets could need some
preprocess prior to be feeded in the network, so this step is also explained. Finally, after a
grid search to look for the best hyperparameters the architecture of the network is fixed and
is presented.
22
The dataset consists in 4 different sub-datasets which are divided by the number of oper-
ation conditions and fault modes:
Table 5.1: Definition of the 4 sub-datasets by it operation conditions and fault modes.
Operation conditions
1 6
Fault HPC Degradation FD001 FD002
modes HPC and Fan Degradation FD003 FD004
Where HPC stands for high-pressure compressor. In each of this datasets are 26 columns,
which are:
• Unit number
• Time [cycles]
• 3 operational settings
• 21 sensor measurements
The outputs of the measure system response are:
Where LPC stands for low pressure compressor, HPT for high pressure turbine and LPT
for low pressure turbine.
23
5.1.1.2 Pre processing
First, the dataset is constructed as follows:
• Each row corresponds to one cycle.
• Each column represent a sensor or an operational setting.
• A single unit is simulated (Unit number columns) and it is under operation N cycles
until failure. Then, another unit is put into operation and register its sensor data.
• They are 7 sensors that does not change with the cycles, this means that they do not
deliver useful information, so, they are ignored.
Since this dataset is used for a challenge, it is already divided in train and test samples
and also, the RUL value is in another file for train and test.
Therefore, it is not necessary to split the dataset and it is possible to create windows with
1 cycle as slide step without contaminating the train with test data.
Due to data conformation, not all datasets have the same window size. With this, the
created datasets have the following shape:
24
5.1.1.3 Architecture
A grid search is used to fix parameters of the model, some of the selected numbers to be
tried are summarized in the following table:
input: (256, 2)
DistributionLambda
input: (256, 960) output: [(256, 1), (256, 1)]
DenseFlipout
output: (256, 32)
Figure 5.2: Bayesian recurrent layers architecture for the C-MAPSS dataset. In particular,
with BayesLSTM layer.
Notice that the output of the Bayesian RNN is the complete sequence and the flatten
layer prepares the features to be processed by the dense flipout layers.
Also, a decaying polynomial learning rate is applied to smooth the learning in the last
epochs, the learning rate decays in 60 epochs to the 70% of the learning rate at the epoch 1.
A summary of the hyperparameters is presented in the Table 5.5:
25
5.1.2 Maryland: Crack Lines
5.1.2.1 Data description
This dataset is from a Doctoral thesis developed at the University of Maryland [34] and
consists in the acquisition of three dissipative energies (strain, heat dissipation and acoustic
emission) during a cycle tensile uniaxial test.
With the above energies is possible to quantify the damage in specific failure modes by
the Degradation-entropy generation (DEG) theorem which is introduced in [35, 36, 37] and
reviewed in applications including fatigue, corrosion and wear [38]. Also, the crack length is
monitored with images taken by an Edmond 2.5-10x optical microscope with an OptixCam
Pinnacle Series CCD digital camera. This with the corresponding life in cycles define the life
of failure at a certain crack length. 250[µm], 500[µm], 1000[µm], the transition from region
II to III (with linear elastic fracture mechanics [39]) and fracture are the selected failure
determinations for each entropic approach’s assessment.
Two different materials are selected as testing materials, AA7075-T6 and SS304L. The first
one is an aluminum alloy with high-strength and light, the other is an structural material
used commonly in highly acidic environments. Since the conformation of the dataset, the
only measurements that have enough datapoints to work with the B-RNN scheme is the
stainless steel
The specimen is selected and designed for fatigue testing under ASTM 466 guideline [40].
An Instron 8800 system on an MTS 311.11 frame is used to mount the uniaxial test. The
loads are in the range of 16 to 24 [kN], 0.1 loading ratio and 5 [Hz] frequency. For applications
to short load term process every 1000 cycles, the load is paused by 2 minutes and 500 cycles
of excitation are applied at 6 [kN] for the aluminum and 10 [kN] for the stainless steel as
maximum loads at 25 [Hz]. The time without load allowed the specimen’s cooling and it is
the perfect time to capture cleaner images. The acoustic emission sensor are two Physical
Acoustics Micro-30s resonant sensors.
26
5.1.2.2 Pre processing
With this, the dataset consist in 3 signal measurements (position, extension and temper-
ature) and 4 cumulative measurements (2 acoustic emission counts and 2 acoustic emission
energy). The proposed features to extract consider the Crest factor and the mean of the
Fourier transform splitted in frequency bands, this two features are not feasible with the last
4 measurements due to its behavior, even so, peak to peak value has no useful information.
Due to the above, there are many ways to work this dataset, the tested approaches are
listed below:
It is important to notice that the raw windows have length of 200 due to the composition
of the extracted features which have a 10 length window to perform the extraction and then
20 length window to train the RNN, creating a dataset with the same numbers of examples.
After testing all this datasets, the best results are obtained with only the sensor data and
the samples are constructed as follows:
The test set consists in 7 of 47 total runs, which is almost a 15% split.
27
5.1.2.3 Architecture
The grid search considers the following hyperparameters:
Table 5.9: Hyperparameters tested in the grid search
Hyperparameter Grid search Hyperparameter Grid search
Conv Layers [1, 2, 3] Conv N of kernels [16, 32, 64]
Conv kernels (only time) [1, 2, 3] LR decay [10, 20, 30]
BayesRNN Neurons [16, 32, 64] End LR [10%, 50%]
DenseFlipout Layers [1, 2, 3, 4] KL Regularizer [1e-5, 5e-6, 1e-6]
DenseFlipout Neurons [64, 32, 16] Epochs [1000, 2000, 3000]
Learning Rate [1e-2, 1e-3, 5e-4]
input: (500, 2)
DistributionLambda
input: (500, 20, 1536) output: [(500, 1), (500, 1)]
BayesGRU
output: (500, 20, 64)
Figure 5.3: Bayesian recurrent layers architecture for the Maryland cracklines dataset.
28
5.1.3 GPM Systems: Wind Turbine Data
5.1.3.1 Data description
This data was collected from a commercial 2.2 [MW] wind turbine over 50 days, 6 seconds
each with a sampling frequency of 100 [kHz] for a high speed shaft bearing from a wind
turbine generator provided by Green Power Monitoring Systems in the USA [2, 41].
The bearing is a SKF 32222 J2 tapered roller bearings with an inner race fault (detected
with inspection at the end of the 50 days). The recorded sensor is a MEMS accelerometer
(analog devices ADXL001 with 32 [kHz] bandwith).
Figure 5.4: Bearing test set with the inner race fault detected at the end of the 50 days.
Source: [2]
It is possible to see the degradation of the wind turbine with the growing size of the
acceleration. The following Figure 5.5 visualizes all the 50 days one next to the other, even
if they are many hours between each measurement:
20
10
Acceleration [g]
10
20
Figure 5.5: Bearing test set with the inner race fault detected at the end of the 50 days. The
colors shows different days of measurements.
29
5.1.3.2 Pre processing
Since the dataset only consider one sensor, our approach is to extract features from the
accelerometer signal, the features are presented in the Appendix A
The features are computed for the signal, numeric derivative and numeric integration,
generating 42 features. To compute the features it is needed to generate windows. In our
approach 200 timesteps generate 1 feature example and 50 feature example a window for
training the network.
Also, this dataset is not previously splitted into train and test. Therefore, is not possible
to generate the windows with overlap due to train/test contamination.
With this, for each day of 585936 datapoints 58 training windows are generated with
length of 50 and 200 as feature length, the remaining of each file is dropped. Finally, a 30%
test - 70% train random split is used to divide the samples.
The Figure 5.6 illustrates this dimensional reduction by the application of the preprocess-
ing:
30
On the other hand, the labels are not provided and the sensor data is not from the
beginning of the lifespan until the failure. Therefore, it is not possible to consider the
timeline as a RUL and it is needed to create a health indicator (HI). The most used ones is
to find some statistic feature that changes monotonically with time and select it as HI. Since
our approach is to lower the human based decisions, our HI is a decreasing line from 100 to
0, where 100 means a fully operative wind turbine and 0 the last day of operation, being this
the direct approach to a HI and could be comparable to a RUL.
5.1.3.3 Architecture
First, a hyperparameter selection is made through a grid search, the most important tested
parameters are listed below:
After a grid search for hyperparameter selection, the selected architecture is as follows in
Figure 5.7:
input: (500, 2)
input: (500, 50, 704) DistributionLambda
BayesGRU output: [(500, 1), (500, 1)]
output: (500 50 64)
Figure 5.7: Bayesian recurrent layers architecture for the GPM Wind turbine dataset. In
particular, with BayesGRU layer.
31
It is possible to see that the 2D convolutions have kernel and stride of size 2 in the feature
index and size 1 in the time index since the Bayesian recurrent networks are the ones that
deal with the temporary of the data. The activation function between every layer is an ReLU.
Also, the convolutional layers are only used as feature extractors, this allows to use fre-
quentist approach in this layers and mix them with Bayesian weights. Following this scheme,
the time distributed layer stacks the extracted features from the convolutions but maintaining
the time axis.
Another important hyperparameters are:
32
5.2 Datasets for classification tasks
The classification tasks includes the identification of different fault modes in ball bearings
under time-varying speed conditions (Ottawa bearing dataset) and the health state classifi-
cation for a rolling bearing under different speeds and torques (Politecnico di Torino bearing
dataset).
33
Two columns are present in the dataset, the accelerometer and the rotational speed data
measured by the encoder. Also, they are 3 different runs per experiment. Therefore, they
are 60 files to preprocess.
Figure 5.9: Graphical view of the dimensional reduction by preprocessing per file.
34
5.2.1.3 Architecture
A grid search based on the following table is implemented to select the hyperparameters:
After a hyperparameter selection the final architecture to address this classification task
is:
Figure 5.10: Bayesian recurrent layers architecture for the Ottawa bearing dataset. In par-
ticular, with BayesGRU layer.
Notice that the regression models outputs directly a distribution, but the classification
outputs a vector (a.k.a. logits) with the number of classes as length (as in the frequentist
approach). However, the softmax function is not used since the output of the model is the
parameterization of a (Tensorflow probability) categorical distribution which shows the real
probability of the classes.
It is important to notice that the KL Regularizer is not at the grid search table because
is fine tuned after most of the hyperparameters.
35
5.2.2 Politecnico di Torino: Rolling bearing Data
5.2.2.1 Data description
As Ottawa bearing dataset, Politecnico di Rotino bearing dataset is a laboratory experi-
mental dataset of different fault modes but in rolling bearings with an orthogonal load that
produces torques to the shaft.
As it can be seen in Figure 5.11, the same plate have a center bearing (B2) and two more
bearings (B1 and B3) that are used as supports.
The shaft was part of a complete gearbox and carried a spur gear. This gear applies a
contact force with radial an tangential directions because the applied torque. This interaction
between the spur and the shaft is replaced by the load applied by a third larger roller bearing
(B2) which is linked to a precision sledge that moves orthogonal to the shaft that produces
the required load by the compression of two parallel springs.
The Figure 5.11 shows the experimental setup:
The accelerometers are two triaxial IEPE type with 12[kHz] frequency range positioned
on the support of the damaged bearing (A1) and at the support of the loaded bearing (A2).
Also, the radial force is measured a static load cell. The loaded bearing is the B2 but the
defects are mounted in bearing B1.
36
The different health states are:
Due to the limited power of the inverter, at higher load conditions is impossible to reach
higher rotation speeds.
37
5.2.2.2 Pre processing
This dataset has different speeds and torques at the shaft under 7 different failure modes.
Since the goal is to classify the fault, all the files from the same health state are joined as
the same class.
However, unlike Ottawa bearing dataset, Torino has only one file per torque and speed,
therefore, it is not possible to split the dataset into train/test by entirely independent files.
Consequently, to maintain the sequential behavior of the data, each file is divided in 70%
train (the first values) and the remaining 30% as test.
Then, as they are two tri-axial accelerometers, the parameters (Appendix A) of all mea-
surement are calculated, this is 42 features per column and a total of 252 features.
As seen before, to shape correctly the dataset it is necessary two window sizes, the first
to compute the features and the last to feed the model.
Figure 5.12: Graphical view of the dimensional reduction by preprocessing per file.
38
5.2.2.3 Architecture
As in the other models, the hyperparameters are selected by grid search (and then, man-
ually fine tuned).
Figure 5.13: Bayesian recurrent layers architecture for the Politecnico di Torino bearing
dataset. In particular, with BayesLSTM layer.
To smooth the loss curve and prevent overfit, a polynomical learning rate decay is imple-
mented, the parameters.
The most important hyperparameters are listed in the following Table 5.22:
39
Chapter 6
Results and discussion
The following Section the summarizes the results obtained for each case study.
Also, the evaluation of the performance and the comparison between the state of the art,
frequentist approach and MC Dropout for the C-MAPSS is addressed.
The structure follows the prior section, first, the regression tasks and then the classi-
fication. To validate our approach C-MAPSS is the first dataset to be analyzed. Then,
Maryland Cracklines and GPM wind turbine. Finally, Ottawa and Torino bearing datasets
shows another application to our development.
Table 6.1: Bayesian Recurrent Neural Networks results for C-MAPSS dataset. Mean of 3
runs.
Dataset FD001 FD002 FD003 FD004
Bayes Bayes Bayes Bayes Bayes Bayes Bayes Bayes
Model
LSTM GRU LSTM GRU LSTM GRU LSTM GRU
Mean
17.73 17.82 23.97 24.07 17.36 16.33 27.28 27.74
RMSE
STD 0.92 1.05 1.01 1.03 1.00 0.93 1.07 1.13
RMSE* 14.00 14.08 17.43 17.58 13.70 12.31 21.00 21.42
40
Also, the best result for each of the C-MAPSS dataset are presented. In this Figures, the
comparison between each example point and its RUL is shown an the top figure an then, the
distribution of 3 different examples (equispaced) are shown with the respective distributions.
The Table with all the results is at the Appendix B and plots for the other best runs are
summarized at the Appendix C.
100
RUL [Cycles]
75
50
25
0 20 40 60 80
Example
(a) RUL prediction with mean predicted value, 2 probability intervals and the ground truth.
100 120 40
110
90
100 20
80
90
70
80 0
60
70
(b) 1rst star histogram (c) 2nd star histogram (d) 3rd star histogram
41
BayesLSTM - C-MAPSS FD002
RMSE RUL prediction: 16.78
Mean predicted value
150 Ground Truth
PI: Standard Deviation
PI: 90%
125
100
RUL [Cycles]
75
50
25
25
0 50 100 150 200 250
Example
(a) RUL prediction with mean predicted value, 2 probability intervals and the ground truth.
(b) 1rst star histogram (c) 2nd star histogram (d) 3rd star histogram
42
BayesGRU - C-MAPSS FD003
RMSE RUL prediction: 11.56
150 Mean predicted value
Ground Truth
PI: Standard Deviation
125 PI: 90%
100
RUL [Cycles]
75
50
25
0 20 40 60 80
Example
(a) RUL prediction with mean predicted value, 2 probability intervals and the ground truth.
(b) 1rst star histogram (c) 2nd star histogram (d) 3rd star histogram
43
BayesLSTM - C-MAPSS FD004
RMSE RUL prediction: 20.30
Mean predicted value
150 Ground Truth
PI: Standard Deviation
PI: 90%
125
100
RUL [Cycles]
75
50
25
(a) RUL prediction with mean predicted value, 2 probability intervals and the ground truth.
(b) 1rst star histogram (c) 2nd star histogram (d) 3rd star histogram
Loss
10 56000
54000
5 52000
20 40 60 80 100 20 40 60 80 100
Epoch Epoch
44
The results shows good predictions, altough the plots seems to be noisy, the behavior is
congruent with the training samples and the overall RMSE is satisfactory. Also, the loss plot
shows good performance with decreasing values for both log likelihood and KL and without
appreciable overfit. It is important to notice the difference of magnitude of these two values
but as seen in Section 3.3 the KL acts only as a regularization term and the log likelihood is
the main loss.
Then, the same architecture (and its hyperparameters) are trained with the frequentist
approach. Training for the same amount of epochs at the frequentist model produces overfit-
ted results, that leads to bad comparison between these two models, so the number of epochs
is reduced to 20 to obtain unbiased results.
As it is expected, the frequentist models have nearly the half of parameters than the
Bayesian approach, this is due to the definition of the Bayesian layers that is conformed with
the mean weights and their perturbations. A breakdown of the number of weights per layer
is in the Appendix D.
The following Table 6.3 contains the mean results for 5 runs of the frequentist approach
and the comparable result for the Bayesian approach:
Table 6.3: Comparison between average RMSE of Frequentist approach and Bayesian Recur-
rent Neural Networks
LSTM GRU
Frequentist Model Bayesian Model Frequentist Model Bayesian Model
Dataset
RMSE RMSE* RMSE RMSE*
FD001 15.26 14.00 14.12 14.08
FD002 18.08 17.43 17.97 17.58
FD003 14.35 13.70 13.60 12.31
FD004 21.47 21.00 21.67 21.42
45
Since the Bayesian method have nearly double the parameters of the frequentist, the
training phase of the Bayesian model takes longer. However, this longer training allow us
to have a probability distribution as output, capturing the uncertainty and obtain better
results. With this is possible to say that the first stages of the construction of a model could
be tested with frequentist approach and then migrate the model to a Bayesian one to obtain
the benefits listed above.
Then, the proposed approach is compared to MC Dropout as a Bayesian approximation.
This method, consists in the application of dropout layers between dense layers and, unlike the
normal application of dropout, applied the disconnection of random neurons in the evaluation
process too, resulting in different outputs to a fixed input [18, 32, 42]. This methodology to
address uncertainty has been questioned due to its result in simple tasks [43], its overconfident
results [44] in classification and directly that does not quantify uncertainty [19]. However,
still is an used method to deal with uncertainty.
As is another Bayesian approximation, it is possible to get the same metrics as in the
propossed approach. The following Table 6.4 compares the best model for each dataset as a
mean of 3 runs:
Table 6.4: Comparison between best average performance for each dataset between Bayesian
and MC Dropout.
Dataset FD001 FD002 FD003 FD004
MC MC MC MC
Bayes Bayes Bayes Bayes
Model Dropout Dropout Dropout Dropout
LSTM LSTM GRU LSTM
GRU LSTM GRU LSTM
Mean
17.73 15.67 23.97 18.84 16.33 15.63 27.28 21.88
RMSE
STD 0.92 0.68 1.01 0.33 0.93 0.65 1.07 0.27
RMSE* 14.00 14.25 17.43 18.24 12.31 14.52 21.00 21.53
It is noticeable that MC dropout have better mean RMSE than the proposed approach but
in the global performance RMSE*, the Bayesian method implemented it is better, leading to
better overall results. With this, our approach have better results.
Also, it is possible to compare the proposed approach best average performance for each
dataset with some of the state of the art results:
Table 6.5: Proposed approach results compared with state of the art performance.
Model MODBNE [12] DCNN [45] DLSTM [46] AdaBN-CNN [47] Proposed approach
Data RMSE RMSE RMSE RMSE RMSE*
FD001 15.40 12.61 16.14 13.17 14.00
FD002 25.05 22.36 24.49 20.87 17.43
FD003 12.51 12.64 16.18 14.97 12.31
FD004 28.66 23.31 28.17 24.57 21.00
46
The models which are compared to our results are different types of architectures includ-
ing: convolutional, recurrent, deep belief models and adaptative models. DCNN presents
a standard deviation but is for 10 independent runs, not representing an uncertainty. It is
important to notice that no one of the state of the art models address the uncertainty, giving
a huge advantage of the proposed approach. Also, the Bayesian RNN outperforms the state
of the art models considerably at the hardest datasets (FD002 and FD004) and the other
two datasets have similar results than the other approaches.
Table 6.6: Bayesian Recurrent Neural Networks results for Maryland cracklines. Mean of 3
runs.
Bayes LSTM Bayes GRU
Mean RMSE 24.75 25.07
STD 0.59 0.57
Global RMSE 19.54 20.26
Table 6.7: Best result of Maryland dataset with BayesLSTM layer, per test set run.
Test Test Test Test Test Test Test
run 1 run 2 run 3 run 4 run 5 run 6 run 7
Mearn RMSE 17.97 18.29 16.53 16.96 29.49 23.95 28.45
STD 1.67 1.96 1.50 1.20 1.70 1.34 1.13
RMSE* 14.16 13.05 10.80 11.76 26.15 17.98 22.18
Table 6.8: Best result of Maryland dataset with BayesGRU layer, per test set run.
Test Test Test Test Test Test Test
run 1 run 2 run 3 run 4 run 5 run 6 run 7
Mearn RMSE 19.57 19.42 16.99 17.31 29.90 23.76 27.01
STD 1.86 1.87 1.89 1.19 1.30 1.63 1.06
RMSE* 15.53 14.38 10.65 11.55 27.01 16.97 20.68
47
BayesLSTM - Maryland Cracks
RMSE RUL prediction: 19.33
Mean predicted value
120 Ground Truth
PI: Standard Deviation
100 PI: 90%
80
60
HI
40
20
20
(a) RUL prediction with mean predicted value, 2 probability intervals and the ground truth.
(b) 1rst star histogram (c) 2nd star histogram (d) 3rd star histogram
48
BayesGRU - Maryland Cracks
RMSE RUL prediction: 18.79
140
Mean predicted value
120 Ground Truth
PI: Standard Deviation
PI: 90%
100
80
60
HI
40
20
0
20
(a) RUL prediction with mean predicted value, 2 probability intervals and the ground truth.
(b) 1rst star histogram (c) 2nd star histogram (d) 3rd star histogram
Loss
10.0
650000
7.5 600000
5.0 550000
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000
Epoch Epoch
49
It is noticeable that the one with shorter lives are the ones with better results. This is
probably due to the fact that the proposed health indicator decays from the beginning of the
test and this is not always true affecting the longest runs.
However, the results are good enough to obtain the behavior of the health indicator
addressing the task successfully.
The training with this dataset is stopped at the stabilization of the log-likelihood and the
KL is not growing (this considers stabilization or decreacing behavior).
A future work with this dataset could consider the identification of the starting point of
damage (which is critical to the last two experiments), but to the scope of this Master Thesis,
the results proves the functionality of the Bayesian RNN architectures successfully.
Table 6.9: Bayesian Recurrent Neural Networks results for Wind turbine. Mean of 3 runs.
Bayes LSTM Bayes GRU
Mean RMSE 13.40 13.71
STD 0.31 0.30
RMSE* 8.15 8.66
50
BayesLSTM - GPM Wind Turbine
RMSE RUL prediction: 7.98
Mean predicted value
120 Ground Truth
PI: Standard Deviation
100 PI: 90%
80
60
HI
40
20
20
0 100 200 300 400 500 600 700 800
Example
(a) RUL prediction with mean predicted value, 2 probability intervals and the ground truth.
(b) 1rst star histogram (c) 2nd star histogram (d) 3rd star histogram
Figure 6.9: Best result for Wind Turbine Dataset. BayesLSTM architecture.
51
BayesGRU - GPM Wind Turbine
RMSE RUL prediction: 8.51
Mean predicted value
120 Ground Truth
PI: Standard Deviation
100 PI: 90%
80
60
HI
40
20
20
0 100 200 300 400 500 600 700 800
Example
(a) RUL prediction with mean predicted value, 2 probability intervals and the ground truth.
(b) 1rst star histogram (c) 2nd star histogram (d) 3rd star histogram
Figure 6.10: Best result for Wind Turbine Dataset. BayesGRU architecture.
Loss
10.0
450000
7.5
400000
5.0
350000
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000
Epoch Epoch
52
6.2 Classification tasks
6.2.1 Ottawa: Bearing vibration Data
The results of this dataset are presented in the Table 6.10 as the summary of 3 runs per
dataset.
Non-normalized and normalized confusion matrix are presented for the best result for this
dataset at both Bayesian layers with a 90% probability interval.
Table 6.10: Bayesian Recurrent Neural Networks results for Ottawa bearing dataset. Mean
of 3 runs.
Dataset 1 Dataset 2 Dataset 3
BayesLSTM BayesGRU BayesLSTM BayesGRU BayesLSTM BayesGRU
Top 5% 93.68% 95.40% 98.97% 99.23% 97.51% 97.55%
50% 93.48% 95.21% 98.82% 99.09% 97.33% 97.37%
95% 93.26% 95.00% 98.67% 98.93% 97.13% 97.20%
3123.00 3.00 33.55 5.00 52.55 97.6% 0.1% 1.0% 0.2% 1.6%
Healthy 3118.00 2.00 29.00 3.50 47.50 Healthy 97.4% 0.1% 0.9% 0.1% 1.5%
3110.45 1.00 25.45 1.00 44.45 97.2% 0.0% 0.8% 0.0% 1.4%
39.00 3096.00 44.55 22.00 24.00 1.2% 96.7% 1.4% 0.7% 0.7%
Inner 35.00 3089.50 38.00 18.00 20.00 Inner 1.1% 96.5% 1.2% 0.6% 0.6%
fault 31.00 3080.00 29.80 13.45 16.00 fault 1.0% 96.2% 0.9% 0.4% 0.5%
True label
True label
11.00 1.00 3169.00 30.55 2.00 0.3% 0.0% 99.0% 1.0% 0.1%
Outer 7.50 0.00 3165.00 26.00 1.00 Outer 0.2% 0.0% 98.9% 0.8% 0.0%
fault 5.00 0.00 3159.45 22.45 0.00 fault 0.2% 0.0% 98.7% 0.7% 0.0%
2.55 0.00 21.55 3178.00 8.00 0.1% 0.0% 0.7% 99.3% 0.2%
Mixed 1.00 0.00 19.00 3175.00 5.00 Mixed 0.0% 0.0% 0.6% 99.2% 0.2%
fault 0.00 0.00 16.00 3170.45 4.00 fault 0.0% 0.0% 0.5% 99.1% 0.1%
30.55 1.55 2.00 21.20 3163.00 1.0% 0.0% 0.1% 0.7% 98.9%
Ball 25.00 0.00 1.00 17.00 3156.00 Ball 0.8% 0.0% 0.0% 0.5% 98.7%
fault 18.45 0.00 0.00 13.45 3151.45 fault 0.6% 0.0% 0.0% 0.4% 98.5%
y
ul r
ul r
ul d
ul r
ul r
ul d
fa all
fa all
fa ne
fa ute
fa ne
fa ute
lth
lth
fa ixe
fa ixe
t
t
t
t
B
B
t
t
t
t
ul
ul
In
In
ea
ea
O
O
M
M
H
53
3162.65 3.00 18.55 9.00 21.00 98.8% 0.1% 0.6% 0.3% 0.7%
Healthy 3158.00 1.00 15.00 7.00 19.00 Healthy 98.7% 0.0% 0.5% 0.2% 0.6%
3153.45 0.00 10.45 5.45 15.45 98.5% 0.0% 0.3% 0.2% 0.5%
5.55 3193.00 3.10 1.00 5.00 0.2% 99.8% 0.1% 0.0% 0.2%
Inner 4.00 3191.50 1.00 0.00 3.00 Inner 0.1% 99.7% 0.0% 0.0% 0.1%
fault 3.00 3188.45 1.00 0.00 2.00 fault 0.1% 99.7% 0.0% 0.0% 0.1%
True label
True label
14.00 7.00 3175.55 15.55 3.55 0.4% 0.2% 99.2% 0.5% 0.1%
Outer 11.00 5.00 3170.50 11.50 2.00 Outer 0.3% 0.2% 99.1% 0.4% 0.1%
fault 7.00 3.45 3167.00 7.90 0.00 fault 0.2% 0.1% 99.0% 0.2% 0.0%
3.00 18.00 16.55 3163.55 12.00 0.1% 0.6% 0.5% 98.9% 0.4%
Mixed 2.00 14.00 14.00 3159.00 10.00 Mixed 0.1% 0.4% 0.4% 98.7% 0.3%
fault 0.45 12.00 12.00 3154.90 8.45 fault 0.0% 0.4% 0.4% 98.6% 0.3%
16.00 2.00 1.00 3.00 3186.55 0.5% 0.1% 0.0% 0.1% 99.6%
Ball 12.00 1.00 1.00 2.00 3183.00 Ball 0.4% 0.0% 0.0% 0.1% 99.5%
fault 10.00 1.00 0.00 1.00 3181.00 fault 0.3% 0.0% 0.0% 0.0% 99.4%
ul r
ul r
ul d
ul r
y
ul r
ul d
fa all
fa all
fa ne
fa ute
fa ne
fa ute
lth
lth
fa ixe
fa ixe
t
t
t
t
B
B
t
t
t
t
ul
ul
In
In
ea
ea
O
O
M
M
H
H
Predicted label Predicted label
(a) Confusion Matrix (b) Normalized Confusion Matrix
3008.55 149.55 12.55 5.00 53.55 94.0% 4.7% 0.4% 0.2% 1.7%
Healthy 2997.50 144.50 9.50 4.00 46.00 Healthy 93.6% 4.5% 0.3% 0.1% 1.4%
2988.35 135.90 5.45 1.45 39.45 93.3% 4.2% 0.2% 0.0% 1.2%
2.00 3188.55 4.00 11.00 5.55 0.1% 99.6% 0.1% 0.3% 0.2%
Inner 1.00 3184.00 3.00 8.00 4.00 Inner 0.0% 99.5% 0.1% 0.2% 0.1%
fault 0.00 3181.00 2.00 5.00 1.45 fault 0.0% 99.4% 0.1% 0.2% 0.0%
True label
True label
16.55 2.00 3188.55 4.00 0.00 0.5% 0.1% 99.7% 0.1% 0.0%
Outer 12.00 1.00 3184.00 2.00 0.00 Outer 0.4% 0.0% 99.5% 0.1% 0.0%
fault 8.00 0.00 3179.45 1.00 0.00 fault 0.3% 0.0% 99.4% 0.0% 0.0%
2.55 2.55 17.00 3186.00 1.00 0.1% 0.1% 0.5% 99.6% 0.0%
Mixed 1.00 1.00 15.00 3183.00 0.00 Mixed 0.0% 0.0% 0.5% 99.5% 0.0%
fault 0.00 0.00 12.45 3178.45 0.00 fault 0.0% 0.0% 0.4% 99.3% 0.0%
12.00 62.20 1.00 65.00 3082.10 0.4% 1.9% 0.0% 2.0% 96.3%
Ball 10.00 54.50 0.00 60.00 3077.00 Ball 0.3% 1.7% 0.0% 1.9% 96.1%
fault 6.00 48.00 0.00 54.00 3066.45 fault 0.2% 1.5% 0.0% 1.7% 95.8%
ul r
ul r
ul r
y
ul d
ul r
ul d
fa all
fa all
fa ne
fa ute
fa ne
fa ute
lth
lth
fa ixe
fa ixe
t
t
t
t
B
B
t
t
t
t
ul
ul
In
In
ea
ea
O
O
M
M
H
54
All the best confusion matrix corresponds to Bayesian GRUs cells. The Tables with the
full runs per dataset can be seen in the Appendix I.
The classification task with this dataset shows very good results without a biased class
(class with considerably lower results than the others). This is important in this dataset
because the mixed fault class that consists in many different damages (Inner, outer and ball
fault) and this could be a problematic class for the classifier, but the results presents robust
outcomes. Even so, the best result for the Dataset 1 with Bayesian LSTM cell presents the
above mistake with the Inner and Mixed fault.
As the Table 6.10 shows, the worst results obtained are with the first split of the dataset
(5% less accuracy for the Bayesian LSTM and 4% for the Bayesian GRU). However, the
accuracy is higher than a 90% being this very low probability of failure. Also, the split of the
dataset are file-driven, so the contamination in the train-test split is negligible. This confirms
that the model is extensible to new acquired data, which is the main issue with recurrent
neural networks.
Also, the confusion matrix for the Bayesian LSTM models are presented at the end of the
Appendix I.
55
170.00 0.55 0.00 0.00 0.00 0.00 0.00 100.0% 0.3% 0.0% 0.0% 0.0% 0.0% 0.0%
C0A 170.00 0.00 0.00 0.00 0.00 0.00 0.00 C0A 100.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
169.00 0.00 0.00 0.00 0.00 0.00 0.00 99.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.00 170.00 0.00 0.00 0.00 1.00 0.00 0.0% 100.0% 0.0% 0.0% 0.0% 0.6% 0.0%
C1A 0.00 170.00 0.00 0.00 0.00 0.00 0.00 C1A 0.0% 100.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.00 169.00 0.00 0.00 0.00 0.00 0.00 0.0% 99.4% 0.0% 0.0% 0.0% 0.0% 0.0%
0.00 0.00 170.00 0.00 0.00 0.00 0.00 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0%
C2A 0.00 0.00 170.00 0.00 0.00 0.00 0.00 C2A 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0%
0.00 0.00 170.00 0.00 0.00 0.00 0.00 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0%
True label
True label
0.00 0.00 0.00 170.00 0.00 0.00 0.55 0.0% 0.0% 0.0% 100.0% 0.0% 0.0% 0.3%
C3A 0.00 0.00 0.00 170.00 0.00 0.00 0.00 C3A 0.0% 0.0% 0.0% 100.0% 0.0% 0.0% 0.0%
0.00 0.00 0.00 169.00 0.00 0.00 0.00 0.0% 0.0% 0.0% 99.4% 0.0% 0.0% 0.0%
0.00 1.00 0.00 0.00 170.00 0.00 0.00 0.0% 0.6% 0.0% 0.0% 100.0% 0.0% 0.0%
C4A 0.00 0.00 0.00 0.00 170.00 0.00 0.00 C4A 0.0% 0.0% 0.0% 0.0% 100.0% 0.0% 0.0%
0.00 0.00 0.00 0.00 169.00 0.00 0.00 0.0% 0.0% 0.0% 0.0% 99.4% 0.0% 0.0%
1.00 1.00 0.00 0.00 0.00 170.00 0.00 0.6% 0.6% 0.0% 0.0% 0.0% 100.0% 0.0%
C5A 0.00 0.00 0.00 0.00 0.00 170.00 0.00 C5A 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% 0.0%
0.00 0.00 0.00 0.00 0.00 169.00 0.00 0.0% 0.0% 0.0% 0.0% 0.0% 99.4% 0.0%
0.00 1.00 0.00 0.00 0.00 0.00 170.00 0.0% 0.6% 0.0% 0.0% 0.0% 0.0% 100.0%
C6A 0.00 0.00 0.00 0.00 0.00 0.00 170.00 C6A 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0%
0.00 0.00 0.00 0.00 0.00 0.00 169.00 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.4%
A
A
C0
C1
C2
C3
C4
C5
C6
C0
C1
C2
C3
C4
C5
C6
Predicted label Predicted label
(a) Confusion Matrix (b) Normalized Confusion Matrix
Figure 6.15: Worst result Torino bearing dataset with Bayes LSTM layer.
170.00 0.00 0.00 0.00 0.00 0.00 0.00 100.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
C0A 170.00 0.00 0.00 0.00 0.00 0.00 0.00 C0A 100.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
170.00 0.00 0.00 0.00 0.00 0.00 0.00 100.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.00 170.00 0.00 0.00 2.00 0.55 1.00 0.0% 100.0% 0.0% 0.0% 1.2% 0.3% 0.6%
C1A 0.00 169.00 0.00 0.00 1.00 0.00 0.00 C1A 0.0% 99.4% 0.0% 0.0% 0.6% 0.0% 0.0%
0.00 167.00 0.00 0.00 0.00 0.00 0.00 0.0% 98.2% 0.0% 0.0% 0.0% 0.0% 0.0%
0.00 0.00 170.00 0.00 0.00 0.00 0.00 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0%
C2A 0.00 0.00 170.00 0.00 0.00 0.00 0.00 C2A 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0%
0.00 0.00 170.00 0.00 0.00 0.00 0.00 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0%
True label
True label
0.00 0.00 0.00 170.00 0.00 0.00 0.00 0.0% 0.0% 0.0% 100.0% 0.0% 0.0% 0.0%
C3A 0.00 0.00 0.00 170.00 0.00 0.00 0.00 C3A 0.0% 0.0% 0.0% 100.0% 0.0% 0.0% 0.0%
0.00 0.00 0.00 170.00 0.00 0.00 0.00 0.0% 0.0% 0.0% 100.0% 0.0% 0.0% 0.0%
0.00 0.00 1.00 0.00 170.00 0.00 0.00 0.0% 0.0% 0.6% 0.0% 100.0% 0.0% 0.0%
C4A 0.00 0.00 0.00 0.00 170.00 0.00 0.00 C4A 0.0% 0.0% 0.0% 0.0% 100.0% 0.0% 0.0%
0.00 0.00 0.00 0.00 169.00 0.00 0.00 0.0% 0.0% 0.0% 0.0% 99.4% 0.0% 0.0%
0.00 0.00 0.00 0.00 0.00 170.00 1.55 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% 0.9%
C5A 0.00 0.00 0.00 0.00 0.00 170.00 0.00 C5A 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% 0.0%
0.00 0.00 0.00 0.00 0.00 168.45 0.00 0.0% 0.0% 0.0% 0.0% 0.0% 99.1% 0.0%
0.00 0.00 0.00 0.00 0.00 0.00 170.00 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0%
C6A 0.00 0.00 0.00 0.00 0.00 0.00 170.00 C6A 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0%
0.00 0.00 0.00 0.00 0.00 0.00 170.00 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0%
A
A
C0
C1
C2
C3
C4
C5
C6
C0
C1
C2
C3
C4
C5
C6
Figure 6.16: Worst result Torino bearing dataset with Bayes GRU layer.
56
Log Likelihood Kullback-Leibler
2.0
Train 100000 Train
Test Test
1.5 97500
95000
1.0
Loss
Loss
92500
0.5 90000
87500
0.0 85000
50 100 150 200 250 300 350 50 100 150 200 250 300 350
Epoch Epoch
As can be seen the results obtained at this dataset are extremely good. It is possible to
consider this as an overfitting but due to the train-test split and the behavior of the loss plot
this is discarded.
With this, the health state is identified whatsoever amount of torque applied to the shaft.
57
Chapter 7
Concluding remarks and Future work
7.1 Concluding remarks
This thesis deals successfully with the construction of Bayesian Recurrent Neural Networks
and proves its performance with regression tasks (RUL and HI prognosis) and classification
tasks (health state classification) in complex mechanical systems based onto their sensory
data.
The proposed Bayesian RNN method are implemented for LSTM and GRU layers, con-
verting their inner gates to Bayesian dense flipout layers. The above produces easy to use
keras-like layers that can be attached to every other keras layer with no effort. However, the
use of this layers changes the training from a loss between the expected result and the pre-
dicted result (i.e. RMSE or cross entropy in regression and classification tasks) to a loss that
is focused on the likelihood (data dependant) and complexity (prior dependant) since every
layer uses a distribution as weights. The above allows the network to output a distribution
that embody the uncertainty, which is a major advantage over frequentist approaches.
This is validated with C-MAPSS dataset, where outperforms other models like MODBNE,
DCNN, DLSTM, AdaBN-CNN which are the most complex models with state of the art
performance. Also, the result is compared to the frequentist approach of the same architecture
which shows that the Bayesian approach not only is capable to address the uncertainty but
also obtains better results at the cost of nearly double trainable parameters and longest
training step. At the same dataset, MC Dropout is implemented as another approach to
address the uncertainty. Again, the proposed approach shows better performance.
Two other regression tasks and two classification tasks are implemented with good results
showing that the Bayesian recurrent neural networks can deal with sequential data to solve
some of the common PHM problems. Also, the implementation in different datasets presents
the extendability of the constructed Bayesian layers, allowing to be used in different areas,
even outside PHM context, where the quantification of the uncertainty is an important
problem.
58
7.2 Future work
The proposed Bayesian Recurrent Neural Networks are presented as a functional alter-
native to the frequentist RNNs that work with sequential data with the improvement that
also quantify the uncertainty. However, the validations only uses Gaussian distribution as
a prior, this could be not useful in every application but have shown good performance in
the presented case studies. A future work could address other kind of problems like counting
process (which can model renewal) or just try other priors with sensory data to evaluate the
difference between the performance.
Also, as the construction of the layers does not involve necessary the application in the
PHM context and only requires sequential data, another future work is apply the proposed
framework into another context, as sequential modeling.
In the last years, attention models [48, 49, 50] has appeared as a new type of sequential
models that have shown interesting applications in natural language processing which is
a sequential type of data. However, this models uses a large amount of parameters (i.e.
Facebook blender AI [50] have 90 millons of parameters at their smallest approach and 9.4
billions of parameters at their largest model). Since the application of the proposed approach
nearly doubles the parameters of a network, the develop of Bayesian attention models could
be an interesting challenge.
59
Bibliography
[1] A. Saxena, K. Goebel, D. Simon, and N. Eklund, “Damage propagation modeling for air-
craft engine run-to-failure simulation,” in 2008 International Conference on Prognostics
and Health Management, 2008, pp. 1–9.
[2] L. Saidi, J. B. Ali, E. Bechhoefer, and M. Benbouzid, “Wind turbine high-
speed shaft bearings health prognosis through a spectral kurtosis-derived indices
and SVR,” Applied Acoustics, vol. 120, pp. 1–8, May 2017. [Online]. Available:
https://doi.org/10.1016/j.apacoust.2017.01.005
[3] Y. Lei, N. Li, L. Guo, N. Li, T. Yan, and J. Lin, “Machinery health prognostics: A
systematic review from data acquisition to RUL prediction,” pp. 799–834, may 2018.
[4] D. Siegel, H. Al-Atat, V. Shauche, L. Liao, J. Snyder, and J. Lee, “Novel method for
rolling element bearing health assessment - A tachometer-less synchronously averaged
envelope feature extraction technique,” in Mechanical Systems and Signal Processing,
vol. 29. Academic Press, may 2012, pp. 362–376.
[5] D. Wang and K. L. Tsui, “Statistical modeling of bearing degradation signals,” IEEE
Transactions on Reliability, vol. 66, no. 4, pp. 1331–1344, dec 2017.
[6] R. Liu, B. Yang, E. Zio, and X. Chen, “Artificial intelligence for fault diagnosis of
rotating machinery: A review,” pp. 33–47, aug 2018.
[7] T. Touret, C. Changenet, F. Ville, M. Lalmi, and S. Becquerelle, “On the use of temper-
ature for online condition monitoring of geared systems – A review,” Mechanical Systems
and Signal Processing, vol. 101, pp. 197–210, feb 2018.
[8] K. Goebel, B. Saha, A. Saxena, J. R. Celaya, and J. P. Christophersen, “Prognostics
in battery health management,” IEEE Instrumentation and Measurement Magazine,
vol. 11, no. 4, pp. 33–40, 2008.
[9] H. Meng and Y. F. Li, “A review on prognostics and health management (PHM) methods
of lithium-ion batteries,” p. 109405, dec 2019.
[10] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http:
//www.deeplearningbook.org.
[11] D. B. Verstraete, E. L. Droguett, V. Meruane, M. Modarres, and A. Ferrada, “Deep
semi-supervised generative adversarial fault diagnostics of rolling element bearings,”
Structural Health Monitoring, vol. 19, no. 2, pp. 390–411, mar 2020. [Online]. Available:
http://journals.sagepub.com/doi/10.1177/1475921719850576
[12] Y. Zhang, R. Xiong, H. He, and M. G. Pecht, “Long short-term memory recurrent neural
network for remaining useful life prediction of lithium-ion batteries,” IEEE Transactions
on Vehicular Technology, vol. 67, no. 7, pp. 5695–5705, jul 2018.
[13] C. Correa-Jullian, J. M. Cardemil, E. López Droguett, and M. Behzad, “Assessment of
Deep Learning techniques for Prognosis of solar thermal systems,” Renewable Energy,
vol. 145, pp. 2178–2191, jan 2020.
60
[14] R. Zhao, R. Yan, Z. Chen, K. Mao, P. Wang, and R. X. Gao, “Deep learning and its
applications to machine health monitoring,” pp. 213–237, jan 2019.
[15] F. Jia, Y. Lei, J. Lin, X. Zhou, and N. Lu, “Deep neural networks: A promising tool for
fault characteristic mining and intelligent diagnosis of rotating machinery with massive
data,” Mechanical Systems and Signal Processing, vol. 72-73, pp. 303–315, may 2016.
[16] W. Peng, Z. S. Ye, and N. Chen, “Bayesian Deep-Learning-Based Health Prognostics
Toward Prognostics Uncertainty,” IEEE Transactions on Industrial Electronics, vol. 67,
no. 3, pp. 2283–2293, mar 2020.
[17] J. Hron, A. G. d. G. Matthews, and Z. Ghahramani, “Variational Gaussian Dropout is
not Bayesian,” nov 2017. [Online]. Available: http://arxiv.org/abs/1711.02989
[18] Y. Gal and Z. Ghahramani, “Dropout as a Bayesian Approximation: Representing
Model Uncertainty in Deep Learning,” in Proceedings of The 33rd International
Conference on Machine Learning, ser. Proceedings of Machine Learning Research,
M. F. Balcan and K. Q. Weinberger, Eds., vol. 48. New York, New York, USA: PMLR,
2016, pp. 1050–1059. [Online]. Available: http://proceedings.mlr.press/v48/gal16.html
[19] I. Osband and G. Deepmind, “Risk versus Uncertainty in Deep Learning: Bayes, Boot-
strap and the Dangers of Dropout,” Tech. Rep., 2016.
[20] T. Pearce, N. Anastassacos, M. Zaki, and A. Neely, “Bayesian Inference with Anchored
Ensembles of Neural Networks, and Application to Exploration in Reinforcement
Learning,” may 2018. [Online]. Available: http://arxiv.org/abs/1805.11324
[21] P. L. McDermott and C. K. Wikle, “Bayesian Recurrent Neural Network Models for
Forecasting and Quantifying Uncertainty in Spatial-Temporal Data,” Entropy, vol. 21,
no. 2, nov 2017. [Online]. Available: http://arxiv.org/abs/1711.00636
[22] M. Fortunato, C. Blundell, and O. Vinyals, “Bayesian Recurrent Neural Networks,” apr
2017. [Online]. Available: http://arxiv.org/abs/1704.02798
[23] Y. Wen, P. Vicol, J. Ba, D. Tran, and R. Grosse, “Flipout: Efficient pseudo-independent
weight perturbations on mini-batches,” in 6th International Conference on Learning
Representations, ICLR 2018 - Conference Track Proceedings. International Conference
on Learning Representations, ICLR, mar 2018.
[24] A. Saxena and K. Goebel, “Turbofan engine degradation simulation data set,” NASA
Ames Prognostics Data Repository, 2008.
[25] A. Y. Ng, “Machine Learning - Coursera,” 2017. [Online]. Available: https:
//es.coursera.org/learn/machine-learning
[26] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to human-
level performance in face verification,” in 2014 IEEE Conference on Computer Vision
and Pattern Recognition, 2014, pp. 1701–1708.
[27] G. Bohouta and V. Këpuska, “Comparing speech recognition systems (microsoft api,
google api and cmu sphinx),” Int. Journal of Engineering Research and Application, vol.
2248-9622, pp. 20–24, 03 2017.
[28] L. Fridman, D. E. Brown, M. Glazer, W. Angell, S. Dodd, B. Jenik, J. Terwilliger,
J. Kindelsberger, L. Ding, S. Seaman, H. Abraham, A. Mehler, A. Sipperley,
A. Pettinato, B. Seppelt, L. Angell, B. Mehler, and B. Reimer, “MIT autonomous
61
vehicle technology study: Large-scale deep learning based analysis of driver behavior
and interaction with automation,” CoRR, vol. abs/1711.06976, 2017. [Online]. Available:
http://arxiv.org/abs/1711.06976
[29] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent
neural networks on sequence modeling,” 2014.
[30] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, “Weight uncertainty in
neural networks,” 2015.
[31] A. B. Owen, Monte Carlo theory, methods and examples., 2013.
[32] D. P. Kingma, T. Salimans, and M. Welling, “Variational dropout and the local repa-
rameterization trick,” 2015.
[33] J. Baalis Coble, “Merging Data Sources to Predict Remaining Useful Life – An
Automated Method to Identify Prognostic Parameters,” University of Tennessee, Tech.
Rep., 2010. [Online]. Available: https://trace.tennessee.edu/utk{_}graddiss/683
[34] H. Yun and M. Modarres, “Measures of entropy to characterize fatigue damage in metallic
materials,” Entropy, vol. 21, p. 804, 08 2019.
[35] M. Bryant, M. Khonsari, and F. Ling, “On the thermodynamics of degradation,” Pro-
ceedings of The Royal Society A: Mathematical, Physical and Engineering Sciences, vol.
464, pp. 2001–2014, 08 2008.
[36] C. Basaran and S. Nie, “An irreversible thermodynamics theory for damage mechanics of
solids,” International Journal of Damage Mechanics - INT J DAMAGE MECH, vol. 13,
pp. 205–223, 07 2004.
[37] H. Yun and M. Modarres, “Measures of entropy to characterize fatigue damage in
metallic materials,” Entropy, vol. 21, no. 8, p. 804, Aug. 2019. [Online]. Available:
https://doi.org/10.3390/e21080804
[38] M. Modarres and M. Amiri, “An entropy-based damage characterization,” Entropy,
vol. 16, pp. 6434–6463, 12 2014.
[39] M. Amiri and M. M. Khonsari, “Nondestructive estimation of remaining fatigue life:
Thermography technique,” Journal of Failure Analysis and Prevention, vol. 12, no. 6,
pp. 683–688, Aug. 2012. [Online]. Available: https://doi.org/10.1007/s11668-012-9607-8
[40] “Practice for conducting force controlled constant amplitude axial fatigue tests of
metallic materials.” [Online]. Available: https://doi.org/10.1520/e0466-15
[41] J. B. Ali, L. Saidi, S. Harrath, E. Bechhoefer, and M. Benbouzid, “Online automatic
diagnosis of wind turbine bearings progressive degradations under real experimental
conditions based on unsupervised machine learning,” Applied Acoustics, vol. 132, pp.
167–181, Mar. 2018. [Online]. Available: https://doi.org/10.1016/j.apacoust.2017.11.021
[42] S. ichi Maeda, “A bayesian encourages dropout,” 2014.
[43] A. Y. K. Foong, D. R. Burt, Y. Li, and R. E. Turner, “On the expressiveness of approx-
imate inference in bayesian neural networks,” 2019.
[44] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive un-
certainty estimation using deep ensembles,” 2016.
[45] X. Li, Q. Ding, and J.-Q. Sun, “Remaining useful life estimation in prognostics using
62
deep convolution neural networks,” Reliability Engineering & System Safety, vol. 172,
pp. 1–11, Apr. 2018. [Online]. Available: https://doi.org/10.1016/j.ress.2017.11.021
[46] J. Wu, K. Hu, Y. Cheng, H. Zhu, X. Shao, and Y. Wang, “Data-driven remaining
useful life prediction via multiple sensor signals and deep long short-term memory
neural network,” ISA Transactions, vol. 97, pp. 241–250, Feb. 2020. [Online]. Available:
https://doi.org/10.1016/j.isatra.2019.07.004
[47] J. Li, X. Li, and D. He, “Domain adaptation remaining useful life prediction
method based on AdaBN-DCNN,” in 2019 Prognostics and System Health
Management Conference (PHM-Qingdao). IEEE, Oct. 2019. [Online]. Available:
https://doi.org/10.1109/phm-qingdao46334.2019.8942857
[48] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser,
and I. Polosukhin, “Attention is all you need,” CoRR, vol. abs/1706.03762, 2017.
[Online]. Available: http://arxiv.org/abs/1706.03762
[49] A. Galassi, M. Lippi, and P. Torroni, “Attention, please! A critical review of neural
attention models in natural language processing,” CoRR, vol. abs/1902.02181, 2019.
[Online]. Available: http://arxiv.org/abs/1902.02181
[50] S. Roller, E. Dinan, N. Goyal, D. Ju, M. Williamson, Y. Liu, J. Xu, M. Ott, K. Shuster,
E. M. Smith, Y.-L. Boureau, and J. Weston, “Recipes for building an open-domain
chatbot,” 2020.
63
64
Appendix
A Statistical parameters extracted from signals as fea-
tures
• Max Value
• Root Mean Square (RMS)
• Peak to peak
• Crest Factor
• Mean
• Variance
• Skewness
• Kurtosis
• 5th moment
• The mean for each of the 5 first frequency bands of the Fourier frequency transform.
i
B Results of Bayesian Recurrent Neural Networks, C-
MAPSS dataset.
ii
C Plots of the results of Bayesian RNN, C-MAPSS.
100
RUL [Cycles]
80
60
40
20
0
0 20 40 60 80
Example
(a) RUL prediction with mean predicted value, 2 probability intervals and the ground truth.
(b) 1rst star histogram (c) 2nd star histogram (d) 3rd star histogram
Figure C.1: Best result for C-MAPSS FD001 with Bayesian GRU cell.
iii
BayesGRU - C-MAPSS FD002
RMSE RUL prediction: 17.32
Mean predicted value
150 Ground Truth
PI: Standard Deviation
PI: 90%
125
100
RUL [Cycles]
75
50
25
25
0 50 100 150 200 250
Example
(a) RUL prediction with mean predicted value, 2 probability intervals and the ground truth.
(b) 1rst star histogram (c) 2nd star histogram (d) 3rd star histogram
Figure C.2: Best result for C-MAPSS FD002 with Bayesian GRU cell.
iv
BayesLSTM - C-MAPSS FD003
RMSE RUL prediction: 13.35
150 Mean predicted value
Ground Truth
PI: Standard Deviation
125 PI: 90%
100
RUL [Cycles]
75
50
25
0 20 40 60 80
Example
(a) RUL prediction with mean predicted value, 2 probability intervals and the ground truth.
(b) 1rst star histogram (c) 2nd star histogram (d) 3rd star histogram
Figure C.3: Best result for C-MAPSS FD003 with Bayesian LSTM cell.
v
BayesGRU - C-MAPSS FD004
RMSE RUL prediction: 20.80
175
Mean predicted value
Ground Truth
150 PI: Standard Deviation
PI: 90%
125
100
RUL [Cycles]
75
50
25
(a) RUL prediction with mean predicted value, 2 probability intervals and the ground truth.
140 120 60
100
120 40
80
60 20
100
40
0
80 20
(b) 1rst star histogram (c) 2nd star histogram (d) 3rd star histogram
Figure C.4: Best result for C-MAPSS FD004 with Bayesian GRU cell.
vi
D Comparison tables of the number of trainable parame-
ters in Frequentist and Bayesian models for C-MAPSS
Dataset.
vii
Table D.3: Parameter table, FD003.
Dataset FD003 LSTM GRU
Output shape Bayesian Frequentist Bayesian Frequentist
Input (None, 30, 14) 0 0 0 0
LSTM (None, 30, 32) 12032 6016 9024 4512
Flatten (None, 960) 0 0 0 0
Dense 1 (None, 32) 61472 30752 61472 30752
Dense 2 (None, 16) 1040 528 1040 528
Output Bayesian (None, 2) 66 - 66 -
Output Frequentist (None, 1) - 17 - 17
Total Parameters 74610 37313 71602 35809
viii
E Results of Frequentist Recurrent Neural networks, C-
MAPSS dataset.
ix
Table F.2: MC Dropout Recurrent Neural Network, FD001.
MC Dropout LSTM MC Dropout GRU
Run 1 Run 2 Run 3 Run 1 Run 2 Run 3
Mean RMSE 17.23 16.17 15.17 15.32 16.19 15.48
STD 0.76 0.73 0.52 0.64 0.77 0.64
RMSE* 16.07 14.40 14.41 14.05 14.50 14.19
x
G Results of Bayesian Recurrent Neural networks, Mary-
land cracklines dataset.
Table G.1: Results of Bayesian Recurrent Neural networks, Maryland cracklines dataset.
Bayes LSTM Bayes GRU
Run 1 Run 2 Run 3 Run 1 Run 2 Run 3
Mean RMSE 24.69 25.53 25.57 24.82 24.30 24.50
STD 0.57 0.64 0.57 0.62 0.56 0.58
RMSE* 19.33 20.31 20.14 18.85 18.79 19.17
60
HI
40
20
20
0 5 10 15 20 25 30
Example
Figure G.1: First complete life cycle. Maryland cracklines, Bayesian LSTM.
60
HI
40
20
0 5 10 15 20 25 30
Example
Figure G.2: First complete life cycle. Maryland cracklines, Bayesian GRU.
xi
BayesLSTM - Maryland Cracks
RMSE RUL prediction: 13.05
120 Mean predicted value
Ground Truth
100 PI: Standard Deviation
PI: 90%
80
60
HI
40
20
0 5 10 15 20 25 30 35
Example
Figure G.3: Second complete life cycle. Maryland cracklines, Bayesian LSTM.
60
HI
40
20
20
0 5 10 15 20 25 30 35
Example
Figure G.4: Second complete life cycle. Maryland cracklines, Bayesian GRU.
xii
BayesLSTM - Maryland Cracks
RMSE RUL prediction: 10.80
Mean predicted value
120 Ground Truth
PI: Standard Deviation
100 PI: 90%
80
60
HI
40
20
20
0 5 10 15 20 25 30 35 40
Example
Figure G.5: Third complete life cycle. Maryland cracklines, Bayesian LSTM.
80
60
HI
40
20
20
0 5 10 15 20 25 30 35 40
Example
Figure G.6: Third complete life cycle. Maryland cracklines, Bayesian GRU.
xiii
BayesLSTM - Maryland Cracks
RMSE RUL prediction: 11.76
120
Mean predicted value
Ground Truth
100 PI: Standard Deviation
PI: 90%
80
60
HI
40
20
20
0 10 20 30 40 50 60 70 80
Example
Figure G.7: Fourth complete life cycle. Maryland cracklines, Bayesian LSTM.
80
60
HI
40
20
20
0 10 20 30 40 50 60 70 80
Example
Figure G.8: Fourth complete life cycle. Maryland cracklines, Bayesian GRU.
xiv
BayesLSTM - Maryland Cracks
RMSE RUL prediction: 26.15
120 Mean predicted value
Ground Truth
100 PI: Standard Deviation
PI: 90%
80
60
HI
40
20
20
0 10 20 30 40 50 60 70 80
Example
Figure G.9: Fourth complete life cycle. Maryland cracklines, Bayesian LSTM.
60
HI
40
20
20
0 10 20 30 40 50 60 70 80
Example
Figure G.10: Fourth complete life cycle. Maryland cracklines, Bayesian GRU.
xv
BayesLSTM - Maryland Cracks
RMSE RUL prediction: 17.98
120 Mean predicted value
Ground Truth
PI: Standard Deviation
100 PI: 90%
80
60
HI
40
20
20
0 20 40 60 80 100 120
Example
Figure G.11: Fifth complete life cycle. Maryland cracklines, Bayesian LSTM.
80
60
HI
40
20
20
0 20 40 60 80 100 120
Example
Figure G.12: Fifth complete life cycle. Maryland cracklines, Bayesian GRU.
xvi
BayesLSTM - Maryland Cracks
RMSE RUL prediction: 22.18
120 Mean predicted value
Ground Truth
PI: Standard Deviation
100 PI: 90%
80
60
HI
40
20
20
0 50 100 150 200
Example
Figure G.13: Sixth complete life cycle. Maryland cracklines, Bayesian LSTM.
60
HI
40
20
20
0 50 100 150 200
Example
Figure G.14: Sixth complete life cycle. Maryland cracklines, Bayesian GRU.
xvii
H Results of Bayesian Recurrent Neural networks, GPM
Wind turbine dataset.
xviii
Table I.3: All results for each Bayesian layer. Dataset 3.
BayesLSTM BayesGRU
Run 1 Run 2 Run 3 Run 1 Run 2 Run 3
Top 5% 97.36% 97.78% 97.39% 97.43% 97.39% 97.82%
50% 97.13% 97.61% 97.25% 97.27% 97.20% 97.65%
95% 96.91% 97.44% 97.05% 97.11% 97.03% 97.45%
3153.55 2.55 43.55 1.00 13.00 98.6% 0.1% 1.4% 0.0% 0.4%
Healthy 3148.50 1.00 40.00 0.00 10.00 Healthy 98.4% 0.0% 1.3% 0.0% 0.3%
3143.90 1.00 34.00 0.00 8.45 98.3% 0.0% 1.1% 0.0% 0.3%
62.55 2444.00 16.55 667.55 40.55 2.0% 76.4% 0.5% 20.9% 1.3%
Inner 55.50 2434.50 14.50 663.00 32.50 Inner 1.7% 76.1% 0.5% 20.7% 1.0%
fault 50.45 2425.00 12.00 652.45 26.00 fault 1.6% 75.8% 0.4% 20.4% 0.8%
True label
True label
8.00 1.55 3150.00 54.00 3.00 0.3% 0.0% 98.5% 1.7% 0.1%
Outer 5.00 1.00 3142.00 49.00 1.00 Outer 0.2% 0.0% 98.2% 1.5% 0.0%
fault 3.00 0.45 3136.35 42.90 0.00 fault 0.1% 0.0% 98.1% 1.3% 0.0%
4.00 2.55 32.55 3165.55 10.55 0.1% 0.1% 1.0% 98.9% 0.3%
Mixed 3.00 1.00 28.00 3160.50 8.00 Mixed 0.1% 0.0% 0.9% 98.8% 0.2%
fault 1.00 0.00 24.00 3156.00 6.00 fault 0.0% 0.0% 0.7% 98.6% 0.2%
81.00 10.55 2.00 17.00 3110.00 2.5% 0.3% 0.1% 0.5% 97.2%
Ball 75.00 6.00 1.00 15.00 3102.50 Ball 2.3% 0.2% 0.0% 0.5% 97.0%
fault 69.00 3.45 0.00 11.45 3093.35 fault 2.2% 0.1% 0.0% 0.4% 96.7%
ul r
ul r
ul r
y
ul d
ul r
ul d
fa all
fa all
fa ne
fa ute
fa ne
fa ute
lth
lth
fa ixe
fa ixe
t
t
t
t
B
B
t
t
t
t
ul
ul
In
In
ea
ea
O
O
M
M
H
Figure I.1: Best result for Ottawa bearing dataset. Dataset 1 with Bayesian LSTM layers.
xix
3158.55 5.00 15.55 5.55 33.55 98.7% 0.2% 0.5% 0.2% 1.0%
Healthy 3152.50 3.00 11.00 3.00 31.00 Healthy 98.5% 0.1% 0.3% 0.1% 1.0%
3146.45 1.00 7.45 2.00 26.00 98.3% 0.0% 0.2% 0.1% 0.8%
3.55 3190.55 2.00 2.55 12.00 0.1% 99.7% 0.1% 0.1% 0.4%
Inner 2.00 3188.00 1.00 1.00 8.00 Inner 0.1% 99.6% 0.0% 0.0% 0.2%
fault 0.00 3183.45 0.00 0.00 6.00 fault 0.0% 99.5% 0.0% 0.0% 0.2%
True label
True label
30.10 9.55 3154.55 19.00 7.00 0.9% 0.3% 98.6% 0.6% 0.2%
Outer 24.50 8.00 3146.00 16.00 5.00 Outer 0.8% 0.3% 98.3% 0.5% 0.2%
fault 20.45 5.00 3141.90 13.00 2.45 fault 0.6% 0.2% 98.2% 0.4% 0.1%
3.00 21.00 12.00 3152.00 32.55 0.1% 0.7% 0.4% 98.5% 1.0%
Mixed 2.00 15.50 10.00 3145.50 27.00 Mixed 0.1% 0.5% 0.3% 98.3% 0.8%
fault 1.00 11.00 7.00 3136.45 24.00 fault 0.0% 0.3% 0.2% 98.0% 0.8%
12.55 1.00 0.00 1.00 3192.00 0.4% 0.0% 0.0% 0.0% 99.8%
Ball 10.00 0.00 0.00 0.00 3189.50 Ball 0.3% 0.0% 0.0% 0.0% 99.7%
fault 7.45 0.00 0.00 0.00 3187.00 fault 0.2% 0.0% 0.0% 0.0% 99.6%
ul r
ul r
ul d
ul r
y
ul r
ul d
fa all
fa all
fa ne
fa ute
fa ne
fa ute
lth
lth
fa ixe
fa ixe
t
t
t
t
B
B
t
t
t
t
ul
ul
In
In
ea
ea
O
O
M
M
H
H
Predicted label Predicted label
(a) Confusion Matrix (b) Normalized Confusion Matrix
Figure I.2: Best result for Ottawa bearing dataset. Dataset 2 with Bayesian LSTM layers.
3042.55 125.10 16.55 7.55 34.00 95.0% 3.9% 0.5% 0.2% 1.1%
Healthy 3033.50 118.00 14.00 6.00 29.50 Healthy 94.8% 3.7% 0.4% 0.2% 0.9%
3023.90 110.00 12.00 3.45 25.45 94.5% 3.4% 0.4% 0.1% 0.8%
2.00 3190.55 4.55 11.00 3.00 0.1% 99.7% 0.1% 0.3% 0.1%
Inner 1.00 3186.00 3.00 9.00 2.00 Inner 0.0% 99.5% 0.1% 0.3% 0.1%
fault 0.00 3183.00 1.00 6.00 0.00 fault 0.0% 99.4% 0.0% 0.2% 0.0%
True label
True label
12.10 2.00 3183.00 13.55 0.00 0.4% 0.1% 99.5% 0.4% 0.0%
Outer 9.00 1.00 3180.00 10.00 0.00 Outer 0.3% 0.0% 99.4% 0.3% 0.0%
fault 7.00 0.45 3175.90 6.45 0.00 fault 0.2% 0.0% 99.2% 0.2% 0.0%
0.00 8.00 25.55 3175.55 3.00 0.0% 0.3% 0.8% 99.3% 0.1%
Mixed 0.00 5.00 21.00 3172.00 1.00 Mixed 0.0% 0.2% 0.7% 99.2% 0.0%
fault 0.00 3.00 18.00 3166.45 0.00 fault 0.0% 0.1% 0.6% 99.0% 0.0%
7.55 80.55 2.55 80.55 3052.55 0.2% 2.5% 0.1% 2.5% 95.4%
Ball 5.00 74.50 1.00 73.00 3045.00 Ball 0.2% 2.3% 0.0% 2.3% 95.2%
fault 3.00 66.90 0.00 67.35 3040.00 fault 0.1% 2.1% 0.0% 2.1% 95.0%
ul r
ul r
ul d
ul r
y
ul r
ul d
fa all
fa all
fa ne
fa ute
fa ne
fa ute
lth
lth
fa ixe
fa ixe
t
t
t
t
B
B
t
t
t
t
ul
ul
In
In
ea
ea
O
O
M
M
H
Figure I.3: Best result for Ottawa bearing dataset. Dataset 3 with Bayesian LSTM layers.
xx
J Results of Bayesian Recurrent Neural networks, Po-
litecnico di Torino Bearing dataset.
xxi