0% found this document useful (0 votes)
22 views

Bayesian Variational Recurrent Neural Networks For Prognostics and Health Management of Complex Systems

This document is a thesis submitted by Danilo Fabian Gonzalez Toledo to obtain a Master's degree in Mechanical Engineering. The thesis proposes Bayesian variational recurrent neural networks for prognostics and health management of complex systems. Specifically: - It develops a model called Bayesian Variational Recurrent Neural Networks which can be trained as a discriminative probabilistic model using Bayes' theorem and variational inference. - The model is validated on a benchmark dataset for remaining useful life prognosis and outperforms other recurrent neural network architectures and uncertainty quantification methods. - It is also tested on classification and regression tasks involving bearing and mechanical component health data, demonstrating low error and good performance. - The model can be

Uploaded by

alexis.calquin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Bayesian Variational Recurrent Neural Networks For Prognostics and Health Management of Complex Systems

This document is a thesis submitted by Danilo Fabian Gonzalez Toledo to obtain a Master's degree in Mechanical Engineering. The thesis proposes Bayesian variational recurrent neural networks for prognostics and health management of complex systems. Specifically: - It develops a model called Bayesian Variational Recurrent Neural Networks which can be trained as a discriminative probabilistic model using Bayes' theorem and variational inference. - The model is validated on a benchmark dataset for remaining useful life prognosis and outperforms other recurrent neural network architectures and uncertainty quantification methods. - It is also tested on classification and regression tasks involving bearing and mechanical component health data, demonstrating low error and good performance. - The model can be

Uploaded by

alexis.calquin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 99

UNIVERSIDAD DE CHILE

FACULTAD DE CIENCIAS FÍSICAS Y MATEMÁTICAS


DEPARTAMENTO DE INGENIERÍA MECÁNICA

BAYESIAN VARIATIONAL RECURRENT NEURAL NETWORKS FOR


PROGNOSTICS AND HEALTH MANAGEMENT OF COMPLEX SYSTEMS.

TESIS PARA OPTAR AL GRADO DE


MAGÍSTER EN CIENCIAS DE LA INGENIERÍA, MENCIÓN MECÁNICA.

DANILO FABIÁN GONZÁLEZ TOLEDO

PROFESOR GUÍA:
ENRIQUE LÓPEZ DROGUETT

MIEMBROS DE LA COMISIÓN:
VIVIANA MERUANE NARANJO
RODRIGO PASCUAL JIMÉNEZ

Este trabajo ha sido parcialmente financiado por Beca Magíster Nacional CONICYT

SANTIAGO DE CHILE
2020
RESUMEN DE LA MEMORIA PARA OPTAR
AL TÍTULO DE MAGÍSTER EN CIENCIAS DE LA INGENIERÍA, MENCIÓN MECÁNICA
POR: DANILO FABIÁN GONZÁLEZ TOLEDO
AÑO: 2020
PROF. GUÍA: ENRIQUE LÓPEZ DROGUETT
BAYESIAN VARIATIONAL RECURRENT NEURAL NETWORKS FOR
PROGNOSTICS AND HEALTH MANAGEMENT OF COMPLEX SYSTEMS.
In recent couple of years many automated data analytics models are implemented, they
provide solutions to problems from detection and identification of faces to the translation
between different languages. This tasks are humanly manageable but with a low processing
rate than the complex models. However, the fact that are humanly solvable allows the user
to agree or discard the provided solution. As an example, the translation task could easily
output misleading solutions if the context is not correctly provided and then the user could
improve the input to the model or just discard the translation.
Unfortunately, when this models are applied to large databases is not possible to apply
this ’double validation’ and the user might believe blindly in the model output. The above
could lead into a biased decision making, threatening not only the productivity but also the
security of the workers.
Because of this, is neccessary to create models that quantify the uncertainty in the output.
At the following thesis work the possibility to use distributions over single point matrices as
weights is fused with recurrent neural networks (RNNs) a type of neural network which are
specialized dealing with sequential data. The model proposed could be trained as discrimi-
native probabilistic models thanks to the Bayes theorem and the variational inference.
The proposed model is called ’Bayesian Variational Recurrent Neural Networks’ is vali-
dated with the benchmark dataset C-MAPSS which is for remaining useful life (RUL) prog-
nosis. Also, the model is compared with the same architecture but a frequentist approach
(single points matrices as weights), with different models from the state of the art and finally,
with MC Dropout, another method to quantify the uncertainty in neuronal networks. The
proposed model outperforms every comparison and furthermore, it is tested with two classifi-
cation tasks in bearings from University of Ottawa and Politecnico di Torino, and two health
indicator regression tasks, one from a commercial wind turbine from Green Power Monitor
and the last in fatigue crack testing from the University of Maryland showing low error and
good performance in all tasks.
The above proves that the model could be used not only in regression tasks but also in
classification. Finally, it is important to notice that even if the validations are in a mechanical
engineering context, the layers are not limited to them, allowing to be used in another context
with sequential data.

i
ii
RESUMEN DE LA MEMORIA PARA OPTAR
AL TÍTULO DE MAGÍSTER EN CIENCIAS DE LA INGENIERÍA, MENCIÓN MECÁNICA
POR: DANILO FABIÁN GONZÁLEZ TOLEDO
AÑO: 2020
PROF. GUÍA: ENRIQUE LÓPEZ DROGUETT
BAYESIAN VARIATIONAL RECURRENT NEURAL NETWORKS FOR
PROGNOSTICS AND HEALTH MANAGEMENT OF COMPLEX SYSTEMS.
En el último par de años se han implementado muchas soluciones basadas en el análisis
automático de datos, desde la detección e identificación de rostros a la traducción de diferentes
idiomas. Éstas tareas son humanamente realizables pero con una tasa de procesamiento
mucho menor que la obtenida gracias a los algoritmos de aprendizaje de máquinas. Sin
embargo, el que sean humanamente realizables permite decidir si alguno de los resultados
entregados por el modelo es confiable o no. Un ejemplo de lo anterior puede ser la traducción
de un idioma a otro, donde muchas veces características como el contexto son importantes y
no son directamente introducidas al traductor.
Lamentablemente, al utilizar estos modelos de aprendizaje automático con largas bases de
datos sensoriales no es posible realizar esta ’doble validación’ y se suele creer ciegamente en
el resultado del modelo. Lo anterior puede originar una toma de decisiones sesgada poniendo
en peligro no sólo la productividad sino que también la seguridad de los operadores.
Debido a esto, es necesario originar modelos que puedan cuantificar la incertidumbre en
su salida. En la tesis a continuación se utiliza la posibilidad de emplear distribuciones en vez
de matrices de pesos puntuales con las redes neuronales recurrentes (RNNs), un tipo de red
neuronal especializada para trabajar con datos secuenciales. El modelo propuesto puede ser
entrenado como un modelos probabilístico discriminativo gracias a la aplicación del teorema
de Bayes y la inferencia variacional.
Este modelo propuesto, llamado ’Bayesian Variational Recurrent Neural Networks’ es val-
idado con la base de datos C-MAPSS para regresión de la vida útil remanente, en donde
se compara con la misma arquitectura en su versión frecuentista, con diferentes modelos del
estado del arte y finalmente con MC Dropout, otro método utilizado para evaluar la incer-
tidumbre en redes neuronales. El modelo propuesto obtiene mejores resultados en todas las
comparaciones anteriores y también se demuestra su utilización en dos problemas de clasifi-
cación de estado de salud con bases de datos de rodamientos provenientes de la Universidad
de Ottawa y el Politecnico di Torino y regresión de indicador de vida para una turbina eólica
comercial de Green Power Monitor y una base de datos de grietas en un ensayo de fatiga de
la Universidad de Maryland con bajo porcentaje de error y buen desempeño.
Lo anterior demuestra que el modelo propuesto puede ser utilizado no sólo en regresión,
sino también en clasificación. Finalmente es importante destacar que se valida pero no se
restringe su utilización en otros contextos con bases de datos secuenciales.

iii
iv
Hoy se te da,
hoy se te quita.

v
vi
Contents
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Aim of the work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 General objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.2 Specific objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Resources available for this Thesis . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Structure of the work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Methodology 5

3 Theoretical Background 6
3.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.2 Regression and classification tasks . . . . . . . . . . . . . . . . . . . . 7
3.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.2 Convolutional neural networks . . . . . . . . . . . . . . . . . . . . . . 8
3.2.3 Recurrent neural networks . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.4 Training of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Bayesian inference and Bayes By Backprop . . . . . . . . . . . . . . . . . . . 13
3.3.1 Weight perturbations . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.1.1 Flipout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Proposed approach of Bayesian Recurrent Neural Networks 18


4.1 Proposed framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Cell operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Training and implementation of Bayesian Neural Networks . . . . . . . . . . 21
4.3.1 Regression metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.2 Classification metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 Study cases 22
5.1 Datasets for regression tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.1.1 NASA: C-MAPSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.1.1.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . 22
5.1.1.2 Pre processing . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.1.1.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.1.2 Maryland: Crack Lines . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1.2.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1.2.2 Pre processing . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1.2.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.1.3 GPM Systems: Wind Turbine Data . . . . . . . . . . . . . . . . . . . 29

vii
5.1.3.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1.3.2 Pre processing . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.1.3.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2 Datasets for classification tasks . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.1 Ottawa: Bearing vibration Data . . . . . . . . . . . . . . . . . . . . . 33
5.2.1.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.1.2 Pre processing . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2.1.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2.2 Politecnico di Torino: Rolling bearing Data . . . . . . . . . . . . . . 36
5.2.2.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2.2.2 Pre processing . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.2.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6 Results and discussion 40


6.1 Regression tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.1.1 NASA: C-MAPSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.1.2 Maryland: Crack Lines . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.1.3 GPM Systems: Wind Turbine Data . . . . . . . . . . . . . . . . . . . 50
6.2 Classification tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2.1 Ottawa: Bearing vibration Data . . . . . . . . . . . . . . . . . . . . . 53
6.2.2 Politecnico di Torino: Rolling bearing Data . . . . . . . . . . . . . . 55

7 Concluding remarks and Future work 58


7.1 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Bibliography 60

Appendix
A Statistical parameters extracted from signals as features . . . . . . . . . . . . i
B Results of Bayesian Recurrent Neural Networks, C-MAPSS dataset. . . . . . ii
C Plots of the results of Bayesian RNN, C-MAPSS. . . . . . . . . . . . . . . . iii
D Comparison tables of the number of trainable parameters in Frequentist and
Bayesian models for C-MAPSS Dataset. . . . . . . . . . . . . . . . . . . . . vii
E Results of Frequentist Recurrent Neural networks, C-MAPSS dataset. . . . . ix
F Results of MC Dropout Recurrent Neural networks, C-MAPSS dataset. . . . ix
G Results of Bayesian Recurrent Neural networks, Maryland cracklines dataset. xi
H Results of Bayesian Recurrent Neural networks, GPM Wind turbine dataset. xviii
I Results of Bayesian Recurrent Neural networks, Ottawa Bearing dataset. . . xviii
J Results of Bayesian Recurrent Neural networks, Politecnico di Torino Bearing
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi

viii
List of Tables
5.1 Definition of the 4 sub-datasets by it operation conditions and fault modes. . 23
5.2 C-MAPSS output sensor data. . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.3 Dataset size for C-MAPSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.4 Table of Hyperparameters to test. . . . . . . . . . . . . . . . . . . . . . . . . 25
5.5 Hyperparameters for the C-MAPSS dataset. . . . . . . . . . . . . . . . . . . 25
5.6 Mechanical properties of SS304L. . . . . . . . . . . . . . . . . . . . . . . . . 26
5.7 Tested datasets for Maryland crack lines. . . . . . . . . . . . . . . . . . . . . 27
5.8 Dataset size for Maryland cracklines. . . . . . . . . . . . . . . . . . . . . . . 27
5.9 Hyperparameters tested in the grid search . . . . . . . . . . . . . . . . . . . 28
5.10 Hyperparameteres for the Maryland dataset. . . . . . . . . . . . . . . . . . . 28
5.11 GPM turbine dataset shape post processing. . . . . . . . . . . . . . . . . . . 30
5.12 Grid search for the hyperparameter selection. . . . . . . . . . . . . . . . . . 31
5.13 Hyperparameters for the GPM Wind turbine dataset. . . . . . . . . . . . . . 32
5.14 Conformation of the Ottawa bearing dataset. . . . . . . . . . . . . . . . . . . 33
5.15 Ottawa bearing dataset shape post processing. . . . . . . . . . . . . . . . . . 34
5.16 Hyperparameters for grid search. . . . . . . . . . . . . . . . . . . . . . . . . 35
5.17 Hyperparameters for Ottawa bearing dataset. . . . . . . . . . . . . . . . . . 35
5.18 Different health states at the B1 bearing. . . . . . . . . . . . . . . . . . . . . 37
5.19 Summary of nominal loads and speeds of the shaft. . . . . . . . . . . . . . . 37
5.20 Torino bearing dataset shape post processing. . . . . . . . . . . . . . . . . . 38
5.21 Grid search hyperparameters. . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.22 Hyperparameters for Ottawa bearing dataset. . . . . . . . . . . . . . . . . . 39

6.1 Bayesian Recurrent Neural Networks results for C-MAPSS dataset. Mean of
3 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.2 Number of parameters in Bayesian and Frequentist approach. . . . . . . . . 45
6.3 Comparison between average RMSE of Frequentist approach and Bayesian
Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.4 Comparison between best average performance for each dataset between Bayesian
and MC Dropout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.5 Proposed approach results compared with state of the art performance. . . . 46
6.6 Bayesian Recurrent Neural Networks results for Maryland cracklines. Mean
of 3 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.7 Best result of Maryland dataset with BayesLSTM layer, per test set run. . . 47
6.8 Best result of Maryland dataset with BayesGRU layer, per test set run. . . . 47
6.9 Bayesian Recurrent Neural Networks results for Wind turbine. Mean of 3 runs. 50
6.10 Bayesian Recurrent Neural Networks results for Ottawa bearing dataset. Mean
of 3 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.11 Mean of 3 runs, Torino bearing dataset. . . . . . . . . . . . . . . . . . . . . . 55

B.1 Bayesian Recurrent Neural Network, FD001. . . . . . . . . . . . . . . . . . . ii


B.2 Bayesian Recurrent Neural Network, FD002. . . . . . . . . . . . . . . . . . . ii

ix
B.3 Bayesian Recurrent Neural Network, FD003. . . . . . . . . . . . . . . . . . . ii
B.4 Bayesian Recurrent Neural Network, FD004. . . . . . . . . . . . . . . . . . . ii
D.1 Parameter table, FD001. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
D.2 Parameter table, FD002. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
D.3 Parameter table, FD003. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
D.4 Parameter table, FD004. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
E.1 All runs of Frequentist Recurrent Neural Network approach. . . . . . . . . . ix
F.1 Summary of MC Dropout Models, C-MAPSS dataset. . . . . . . . . . . . . . ix
F.2 MC Dropout Recurrent Neural Network, FD001. . . . . . . . . . . . . . . . . x
F.3 MC Dropout Recurrent Neural Network, FD002. . . . . . . . . . . . . . . . . x
F.4 MC Dropout Recurrent Neural Network, FD003. . . . . . . . . . . . . . . . . x
F.5 MC Dropout Recurrent Neural Network, FD004. . . . . . . . . . . . . . . . . x
G.1 Results of Bayesian Recurrent Neural networks, Maryland cracklines dataset. xi
H.1 Results Wind turbine: Total results. . . . . . . . . . . . . . . . . . . . . . . xviii
I.1 All results for each Bayesian layer. Dataset 1. . . . . . . . . . . . . . . . . . xviii
I.2 All results for each Bayesian layer. Dataset 2. . . . . . . . . . . . . . . . . . xviii
I.3 All results for each Bayesian layer. Dataset 3. . . . . . . . . . . . . . . . . . xix
J.1 All results for each Bayesian layer. . . . . . . . . . . . . . . . . . . . . . . . xxi

x
List of Figures
3.1 Neuron. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Aritificial neural network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Graphical comparison between Dense and Convolutional layers. Source: http:
//cs231n.stanford.edu/ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Left: Folded RNN. Right: Unfolded RNN. . . . . . . . . . . . . . . . . . . . 10
3.5 LSTM cell. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.6 GRU cell. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.1 B-RNN construction flowchart. . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.1 Simplified diagram of engine simulated in C-MAPSS. Source: [1]. . . . . . . 22


5.2 Bayesian recurrent layers architecture for the C-MAPSS dataset. In particular,
with BayesLSTM layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.3 Bayesian recurrent layers architecture for the Maryland cracklines dataset. . 28
5.4 Bearing test set with the inner race fault detected at the end of the 50 days.
Source: [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.5 Bearing test set with the inner race fault detected at the end of the 50 days.
The colors shows different days of measurements. . . . . . . . . . . . . . . . 29
5.6 Graphical view of the dimensional reduction by preprocessing. . . . . . . . . 30
5.7 Bayesian recurrent layers architecture for the GPM Wind turbine dataset. In
particular, with BayesGRU layer. . . . . . . . . . . . . . . . . . . . . . . . . 31
5.8 Experimental setup for Ottawa bearing dataset. . . . . . . . . . . . . . . . . 33
5.9 Graphical view of the dimensional reduction by preprocessing per file. . . . . 34
5.10 Bayesian recurrent layers architecture for the Ottawa bearing dataset. In
particular, with BayesGRU layer. . . . . . . . . . . . . . . . . . . . . . . . . 35
5.11 Experimental setup for Politecnico di Torino bearing dataset. . . . . . . . . . 36
5.12 Graphical view of the dimensional reduction by preprocessing per file. . . . . 38
5.13 Bayesian recurrent layers architecture for the Politecnico di Torino bearing
dataset. In particular, with BayesLSTM layer. . . . . . . . . . . . . . . . . . 39

6.1 Best result for C-MAPSS FD001. . . . . . . . . . . . . . . . . . . . . . . . . 41


6.2 Best result for C-MAPSS FD002. . . . . . . . . . . . . . . . . . . . . . . . . 42
6.3 Best result for C-MAPSS FD003. . . . . . . . . . . . . . . . . . . . . . . . . 43
6.4 Best result for C-MAPSS FD004. . . . . . . . . . . . . . . . . . . . . . . . . 44
6.5 Example of a loss plot of C-MAPSS training. . . . . . . . . . . . . . . . . . . 44
6.6 Best result for Maryland cracklines, Bayes LSTM. . . . . . . . . . . . . . . . 48
6.7 Best result for Maryland cracklines, Bayes GRU. . . . . . . . . . . . . . . . . 49
6.8 Example of a loss plot of Maryland dataset training. . . . . . . . . . . . . . . 49
6.9 Best result for Wind Turbine Dataset. BayesLSTM architecture. . . . . . . . 51
6.10 Best result for Wind Turbine Dataset. BayesGRU architecture. . . . . . . . 52
6.11 Example of a loss plot of Wind turbine training. . . . . . . . . . . . . . . . . 52
6.12 Best result for Ottawa bearing dataset. Dataset 1. . . . . . . . . . . . . . . . 53
6.13 Best result for Ottawa bearing dataset. Dataset 2. . . . . . . . . . . . . . . . 54

xi
6.14 Best result for Ottawa bearing dataset. Dataset 3. . . . . . . . . . . . . . . . 54
6.15 Worst result Torino bearing dataset with Bayes LSTM layer. . . . . . . . . . 56
6.16 Worst result Torino bearing dataset with Bayes GRU layer. . . . . . . . . . . 56
6.17 Example of a loss plot of Torino bearing dataset. . . . . . . . . . . . . . . . 57

C.1 Best result for C-MAPSS FD001 with Bayesian GRU cell. . . . . . . . . . . iii
C.2 Best result for C-MAPSS FD002 with Bayesian GRU cell. . . . . . . . . . . iv
C.3 Best result for C-MAPSS FD003 with Bayesian LSTM cell. . . . . . . . . . . v
C.4 Best result for C-MAPSS FD004 with Bayesian GRU cell. . . . . . . . . . . vi
G.1 First complete life cycle. Maryland cracklines, Bayesian LSTM. . . . . . . . xi
G.2 First complete life cycle. Maryland cracklines, Bayesian GRU. . . . . . . . . xi
G.3 Second complete life cycle. Maryland cracklines, Bayesian LSTM. . . . . . . xii
G.4 Second complete life cycle. Maryland cracklines, Bayesian GRU. . . . . . . . xii
G.5 Third complete life cycle. Maryland cracklines, Bayesian LSTM. . . . . . . . xiii
G.6 Third complete life cycle. Maryland cracklines, Bayesian GRU. . . . . . . . . xiii
G.7 Fourth complete life cycle. Maryland cracklines, Bayesian LSTM. . . . . . . xiv
G.8 Fourth complete life cycle. Maryland cracklines, Bayesian GRU. . . . . . . . xiv
G.9 Fourth complete life cycle. Maryland cracklines, Bayesian LSTM. . . . . . . xv
G.10 Fourth complete life cycle. Maryland cracklines, Bayesian GRU. . . . . . . . xv
G.11 Fifth complete life cycle. Maryland cracklines, Bayesian LSTM. . . . . . . . xvi
G.12 Fifth complete life cycle. Maryland cracklines, Bayesian GRU. . . . . . . . . xvi
G.13 Sixth complete life cycle. Maryland cracklines, Bayesian LSTM. . . . . . . . xvii
G.14 Sixth complete life cycle. Maryland cracklines, Bayesian GRU. . . . . . . . . xvii
I.1 Best result for Ottawa bearing dataset. Dataset 1 with Bayesian LSTM layers. xix
I.2 Best result for Ottawa bearing dataset. Dataset 2 with Bayesian LSTM layers. xx
I.3 Best result for Ottawa bearing dataset. Dataset 3 with Bayesian LSTM layers. xx

xii
Chapter 1
Introduction
1.1 Introduction

In general, as new technologies are developed, the complexity of the components often
grows. This complexity consider the specialization to do a task, their interaction with other
components and their importance in the organization, among others. In a industrial context
this could be worrying, since the failure in some of this components affects directly the
expected results and, commonly, due to their specialization, the reparations are expensive
(in time and cost).
Therefore, it is necessary to know their failure modes, the physics that rules the functioning
states, etc. However, this is a difficult task that requires a lot of knowledge and investigation.
Also, it is noticeable that different operation conditions could lead into completely different
failures.
Prognostics and health management (PHM) is the specialized area of the engineering that
is focused on modelling the lifecycle of components and systems to predict when the prior
will no longer perform as intended. The above could be achieved by two main approaches:
model-based prognostics and data-driven prognostics. The first ones uses equations that
incorporate physical understanding of the system which needs a lot of experimental studies
and works under specified conditions, an example for this models is the Paris law, which
models the crack growth in a fatigue stress regime.
On the other hand, data-driven models take advantage of the advances of sensor technology
and data analytics to detect the degradation of engineered systems, diagnose the type of
faults, predict the remaining useful lifetime (RUL) and proactively manage failures. Some
useful tools as the Internet of Things (IoT) which helps to monitor and control the operation
of the components by multiple network-connected sensors could be the source of the data
implying a nice connectivity between data and the application. However, this creates such big
databases that human based monitoring is non-viable. One approach to solve this problem is
to preprocess the databases to decrease the dimensionality and be allowed to indentify in the
data the key features. Data-driven models like machine learning techniques can be used to
extract useful information from the data and mix them to understand insights of the system
performance [3].
Most recent PHM studies focus on the critical machinery components including bearings
[4, 5], gears [6, 7] and batteries [8, 9]. However, these studies are mainly developed based
on conventional machine learning with shallow configuration and thus large amounts of time
and expertise resources to manually design and extract features.
To complete the end-to-end process, it is necessary to use an algorithm that could han-
dle the large amount of big machinery data by learning different features and interactions
between them. Here is where, Deep learning has emerged as a promising solution since the

1
layer to layer architecture enable to extract more abstract and complex information (even
features that misses a direct physical meaning) that are useful to perform a certain task.
Typically, there are several types of deep learning models including Auto-encoder, Convolu-
tional Neural Network (CNN), Recurrent Neural Network (RNN), Deep Belief Network and
Deep Boltzmann Machines. Since this models could be used as components (layers) of a
complex model, they are new architectures that mix many of the above as in Variational
Auto-encoder, Generative adversarial networks and long short-term memory (LSTM). In the
context of machinery health prognosis, large amounts of research efforts have been conducted
to explore the advantages against conventional machine learning techniques [10]. Verstraete
et al. developed an unsupervised and a semi-supervised deep learning enabled generative
adversarial network-based methodology for fault diagnostics of rolling element bearings [11].
Zhang et al. presented a LSTM RNN model to predict the remaining useful life of lithium-
ion batteries [12]. Correa-Jullian et al. [13] demonstrated that the deep learning models for
solar thermal systems can outperform naïve predictors and other regression models such as
Bayesian Ridge, Gaussian Process and Linear Regression. Two excellent review articles on
the applications of deep learning to handle big machinery data are developed by Zhao and
Jia et al. [14, 15].
It is important to highlight that all the previous studies did not address the uncertainty
in the data and/or in the model. This is because their weights are single point estimates
and since the operations between layers corresponds to matrix multiplications, for a given
input and trained model, the output will be always the same. This may provide overly
confident predictions and results, causing poor guidance to the decision-maker team which
could lead into safety hazards, lower availability and higher costs. To overcome this problem,
Peng et al. [16] proposed a Bayesian deep learning approach to address uncertainty through
health prognosis. However, their Bayesian approximation is implemented through Monte
Carlo Dropout (MC dropout) that is not truly Bayesian [17]. Unlike the regular dropout,
MC Dropout is applied at both training and test, resulting into randomly ignored parts
of the network at the testing and with this, different outputs. This ultimately generates
random predictions as samples from a probability distribution which is treated as variational
inference [18]. However, the performance of MC Dropout has been criticized by Osband et al.
[19, 20], and counterexamples were presented to demonstrate the limitations of MC Dropout
to produce satisfactory results.
Nowadays, Bayesian recurrent neural networks have been studied and developed in dif-
ferent ways: [21] using a Markov Chain Monte Carlo variation (PX-MCMC) or [22] using
Bayes By Backprop with reparameterization trick to estimate the uncertainty. Finally, Wen
et al. [23], proposses the flipout method which outperforms previous methods using Bayesian
networks.
In the following Master Thesis the Bayesian Neural Networks are developed with the
flipout estimator to address the prediction of remaining useful life, classification and health
indicator for different applications of datasets with time dependant features (i.e. sensory
data). Because of this, the implementation is focused in RNN cells (LSTM and GRU) due
to their known performance with sequential data.
The proposed approach is validated using an open-access turbofan engine dataset [24]
that is the most usual benchmark for RUL prognosis (C-MAPSS) and also tested with two
health indicator prognosis datasets (GPM wind turbine and Maryland cracklines dataset)

2
and two classification applications (Politecnico di Torino rolling bearing data and Ottawa
bearing vibration data).

1.2 Motivation
As summarized above, Bayesian Recurrent Neural Networks (B-RNN) are recurrent neural
networks that their weights are defined by a sample from a trained distribution rather than
single point estimates.
With this, is possible to draw a distribution in the output of the network which can be
sampled to get the final value (class probabilities for classification, remaining useful life or
health indicator for regression).
The above distributions helps to generate confidence intervals which includes uncertainty
assessment which is relevant in the context of Prognostic health management (PHM) because
this confidence intervals help to identify the risk involved in maintenance decisions. On the
other hand, single point estimates lacks of this quantification and due to this, their prognostics
have no uncertainty in its values, this could be a problem in maintenance tasks because if a
mechanical component change its behavior due to different operational conditions this is not
represented in the approach, but a larger distribution range notifies it.
With the above, it is substantive to do some robust predictions with rather more useful
information than a single value.

1.3 Aim of the work


1.3.1 General objective
Propose Bayesian Recurrent Neural Network models that perform Bayesian variational
inference with sequential data with its own framework. Including the develop of the models
and their validation of this scheme with classification and regression for Prognostics and
Health Management (PHM) of complex systems.

1.3.2 Specific objectives


The specific objectives consists in:
1. Transform the frequentist approach of Recurrent neural networks to a Bayesian scheme.
2. Implement the above cells with Tensorflow Probability dense layers as the gates of the
"Long-Short Term Memory" and "Gated Recurrent Units" in a keras-like implementa-
tion.
3. Compare benchmark dataset (C-MAPSS) performance with the frequentist (non Bayesian)
approach for benchmark dataset.
4. Compare benchmark dataset their performance with Montecarlo Dropout as Bayesian
interpretation in a Neural Network for benchmark dataset.
5. Compare the proposed approach performance with the state of the art results for the
benchmark dataset.

3
6. Test other case studies in PHM context as Remaining useful life prognosis and Health
indicator prognosis (GPM wind turbine dataset and Maryland crack lines dataset) and
health state identification (Ottawa bearing dataset and Politecnico di Torino bearing
dataset).

1.4 Resources available for this Thesis


To develop and test this thesis, a desktop computer were provided by Smart Reliability
and Maintanance Integration (SRMI) Laboratory of the University of Chile with the following
specifications:
• Windows 10 Pro.
• Intel Core i7-8700 CPU 3.20Ghz.
• 32Gb RAM.
• NVIDIA GeForce RTX 2080 Ti.
Also, the software and libraries needed, includes:
• Python 3.6 as programing language.
• TensorFlow 1.13 as deep learning framework.
• TensorFlow Probability 0.6 to combine probabilistic models and deep learning library.
• Numpy, Pandas and SkLearn to sort and preprocess the databases.
• Matplotlib to visualize the database and their results.
• Any dependence of the above libraries.
Finally, it is important to remark that financial support were provided by CONICYT
during the full duration of the Master Studies and Thesis through a scholarship.

1.5 Structure of the work


First, in Chapter 2 the methodology of the work is show. Then, in Chapter 3 the Theo-
retical Background summarizes all the prior knowledge needed to understand the proposed
approach including a brief explanation of machine learning, the tasks that are commonly
solved with this technique, the most used layers in deep learning (i.e. dense, convolutional
and recurrent), Bayesian inference and weight perturbations, which are explained in Chapter
4. After that, in Chapter 5 the different study cases are presented in different sections for
regression and classification. Each dataset have a data description, pre-processing and final
architecture. Later, in Chapter 6 the results are presented, with a graphic presentation and
a discussion about the highlights. Finally the conclusion of all the experiments to summarize
the performance and proyection of the Bayesian Recurrent Neural Network.

4
Chapter 2
Methodology
This thesis investigation explore Bayesian Neural Networks with Flipout as weight per-
turbation method. To address this research, literature review, development and validation
steps are needed, which are explained below:
1. Literature review: This consider the exploration of the state of the art in Bayesian
Neural Networks, different approaches to Bayesian Recurrent Neural Networks and
their applications in Prognostics and health management (PHM).
Also, the above requires elaborated knowledge in neural networks, Variational infer-
ence, backpropagation, backpropagation through time, bayes by backprop and weight
perturbations.
2. Development of Bayesian Recurrent Neural Networks: This methodological
step consists in the ensemble of Bayesian Dense Layers into Recurrent Neural Networks
cells joining the acquired knowledge in the prior methodological step.
3. Dataset selection: Since Deep learning models are a data-driven technique, it is
necessary to look for datasets that presents a regression or classification tasks. Also,
since this thesis works with recurrent neural networks, the datasets has to be time
series. The datasets are divided in simulated data and real data.
• Simulated dataset: First, since this is a new approach, it is necessary to bench-
mark the results with a common dataset used in PHM. This is achieved with
C-MAPSS dataset, which is a simulated dataset from NASA of a turbofan engine
and is widely used in RUL address.
• Experimental Dataset: Since the deal with PHM algorithms is to implement
them in real applications, many datasets are included. 3 university datasets in-
cluding Maryland (Crack lines), Ottawa (Bearing vibration data) and Politecnico
di Torino (Rolling bearing data). Finally a real scenario is included: GPM systems
(Wind turbine data).
4. Hyperparameter tuning with the datasets: In this methodological step a proper
B-RNN is designed and optimized for each application. This consider a grid search of
the hyperparameters of the network in pursuit of the best results.
5. Analysis of the results: After all the above, it lasts the comparison of the proposed
model with the current article research. Since C-MAPSS is the benchmark dataset, the
results obtained in this dataset are compared with other kind of networks (state of the
art) and with another type of uncertainty estimator in neural networks (MC Dropout).
Then, the results for the other dataset are presented with their respective metrics and
plots. Which leads to the concluding remarks of this thesis.

5
Chapter 3
Theoretical Background
The following chapter summarizes the necessary concepts to understand the developed
framework. Firstly, machine learning has to be defined and reviewed, integrating the most
common problems that are solved with it. Then, deep learning expands the above definitions,
including neural networks as the vanilla layer and some more specialized ones, like a brief
summary of convolutional neural networks (since this type of layer is used but is not the focus
of this thesis) and recurrent neural networks with emphasis in two types of cell including
Long-Short Term Memory and Gated Recurrent Unit. Finally, a section about Bayesian
Inference and Bayes by Backprop as the needed Bayesian background.

3.1 Machine Learning


Machine Learning is a subfield of Artificial Intelligence (AI) whose motivation is to gen-
erate inteligent programmation, this means, that the tasks that solve are not directly pro-
grammed in the code [25].
Some of the most known applications of Machine Learning are: face recognition (e.g.
Facebook [26]), speech recognition (e.g. Alexa, Cortana or Google home [27]) and self-driving
cars (e.g. Tesla [28]).
In simple words, the pipeline of a machine learning algorithm consists in a training stage,
where the data is provided to the network generating "knowledge". Then, after multiple
iterations of this learning process, a model is ready to be used to predict. The prediction
stage is when the model is fed with new (unviewed) data and retrieve an output based onto
the prior knowledge.
It is possible to classify the tasks that a machine learning algorithm solve into four fam-
ilies: clustering, anomaly detection, classification and regression. Also, the training of the
algorithm could be in three ways: supervised, unsupervised and reinforcement learning. This
thesis consider classification and regression tasks trained with supervised data.

3.1.1 Supervised learning


At the training time of a machine learning algorithm, the network is trained using the
data and the expected result (labels) to quantify the difference between the prediction made
by the computer and the ground truth. Since this metric is representative of how well is
performing the algorithm, minimizing this function is the goal of any supervised machine
learning algorithm.
Due to the above, any supervised learning algorithms generate a map from the database
to a label and the loss function changes depending on the data and the task to solve.

6
3.1.2 Regression and classification tasks
Classification is a task where the computer must decide in which of n categories some
input belongs to. This is commonly solved by a function f : Rk → {1, ..., n} learned by the
algorithm [10]. In most cases, the output is interpreted as the probability of being part of a
class and the function that represents a categorical probability function is the SoftMax:
ez j
σ(z) = Pn f or j ∈ {1, ..., n} (3.1)
i=1 ez i

In the regression task, the above function changes to f : Rk → R, meaning that the
function maps all the inputs to a single value and its most known application is to predict
or forecast some behavior driven by sequential data [10]. Here, the most used loss function
is the root mean squared error (RMSE):
v
u n
u1 X
RM SE = t (Yi − Ŷi )2 (3.2)
n i=1

3.2 Deep Learning


Deep Learning is an specialized type of machine learning that create internal abstractions
of the feeded data by non-linear transformations made by "layers". This layers could be
artificial neural networks (dense layers), convolutional neural networks or recurrent neural
networks. Also, they are layers for dimensionality reduction (pooling layers) and for regular-
ization (dropout).
The word "deep" is used because the layers could be stacked, with this is possible to
extract more complex features. Also, the features extracted are the most useful to achieve
the goal since they are optimized by minimizing a loss function that is created for a particular
task. The above, eliminates the prior feature engineering common to all statistic methods,
allowing to model complex systems and solve problems without further physical knowledge
[?].

3.2.1 Artificial Neural Networks


An artificial neural networks (ANN) is a connection between nodes known as artificial neu-
rons. They are inspired by biological neural networks since emulates the chemical connection
in the biological neurons with a computation of a non linear function of its inputs. The non
linear function is created by a linear combination and a non linear activation function. The
parameters of the linear combination are called weights and are trained by the network.

7
0

Linear Activation
0 combination function
Output
1

1
( ̂)
⋮ ∑
=0

Figure 3.1: Neuron.

A stack of neurons conforms a layer known as dense layers because every neuron interacts
with all input features and a stack of layers generate a multilayer perceptron (MLP) or
artificial neural network (ANN). With this, the number of parameters increases with the
multiple connections between layers.

Input Layer Hidden Layers Output Layer

Figure 3.2: Aritificial neural network.

3.2.2 Convolutional neural networks


Is noticeable that dense neural networks makes sense with one dimensional (or flattened)
data. For two dimensional (or grid) data, as images, the local relationship between nearby
pixels is an important feature and dense layers could not learn this kind of link. To deal with
this, Convolutional neural network uses small kernels of learnable weights that are slided
through the data with a mathematical function called convolution:


X
s(t) = (x ∗ w)(t) = x(a)w(t − a) (3.3)
a=−∞

Where s(t) is the output for the kernels w(t) convolved onto the input x(t). The above is
with one dimentional data, but is easily constructed for 2 dimentional data:

8
XX
S(i, j) = (K ∗ I)(i, j) = I(i + m, j + n)K(m, n) (3.4)
m n

With all the above, it follows that Convolutional neural networks have three [10] main
capabilities:
• Sparse interactions: Since the kernels are noticeably smaller than the input, the
number of trainable parameters is low in comparison to dense layers. This also implies
fewer operations improving the efficiency of the algorithm.
• Shared weights: Also, this kernels are slided through the input, therefore the weights
are used multiple times in the model. In a dense layer, the weights are used only once
since they are associated with only one feature of the input.
• Equivariance to translation: In the convolution operation is possible to apply a
translation function and then a convolution or the convolution and then the translation
and the transformation of the input is the same. This means that the extraction of the
parameters is sequential so the representation is a map of the represented features no
matter where they are.

(a) Dense neural network (b) Convolutional neural network

Figure 3.3: Graphical comparison between Dense and Convolutional layers. Source: http:
//cs231n.stanford.edu/

The Figure 3.3 shows the differences between dense layers and convolutional layers, the
most important one is that convolutions generate smaller images with more features as out-
puts thanks to the smaller kernels and dense layers output as many weights.

3.2.3 Recurrent neural networks


The main topic of this thesis are Recurrent Neural Networks (RNN), therefore, it is im-
portant to know the most important features from this type of neural network.
First of all, as convolutional neural networks takes advantage of their small kernels to
extract useful features in grid data, recurrent neural networks are specialized for processing
sequential data.
To achieve this, the possibility to share weights plays an important role since this cause
a generalization across all the inputs that are operated with the same weights. This is why
RNNs share weights across time and defines that the function created by the network maps
the state at time t to time t + 1 (with the same parameters). Mathematically, RNNs could
be described as dynamical system, defined as follows:

s(t) = f (s(t−1) ; θ) (3.5)

9
Where s(t) is the state of the system and θ the parameters (weights).
At the above definition is possible to observe that the time t is defined by the previous
time t − 1 and then it is possible to "unfold" this recurrent graph, for example with t = 4:

s(4) = f (s(3) ; θ)
(3.6)
= f (f (s(2) ; θ); θ)

If we continue unfolding this equation, a non recurrence expression is generated. Also, this
state s(t) is known as hidden state h(s), as is an internal process of the recurrent network.
Then, is noticeable that the above expression does not depend on the input, because the
dynamic system that we model is only dependant at the prior state. Now, is possible to
incorporate the input x(t), doing that, the hidden state depends at the prior hidden state
and the input at time t:

h(t) = f (h(t−1) , x(t); θ) (3.7)

With this, the state now contains information about the current input and the last hidden
state, being that, the extracted features of the whole sequence.
This new expression can be represented as a folded and a directed graph as follows:

−1 +1

Unfold

ℎ ℎ ℎ ℎ
... −1 +1
...

−1 +1

Figure 3.4: Left: Folded RNN. Right: Unfolded RNN.

Also, RNNs are flexible in terms of input-output size, this means that since they work
timestep by timestep, the RNNs could be applied to different input lenghts. Further, the
output could be a sequence (each of the outputs of a layer) or a vector (the last value of the
sequence) and depending of the task, is possible to feed all the sequence or part of it.
Because each hidden state is a function of the prior state, the memory present in RNNs
is just one step in time. And, if that is not enough, RNNs suffer from vanishing gradients,
this means that when trained with long sequences, the gradients tends to be more smaller
leading to long training runs.
To overcome this problems, Long Short Term Memory (LSTM) arrives as a solution. In
this models are introduced a long memory (called cell c) and the known short memory (called
state h) which controls the information that has to be deleted and the one who is favourable
to keep, since it has three hidden transformations, the gradients are less likely to vanish.

10
To manage the memory in the layer, the three principal transformations are:
• Forget gate ft : This extracts the long term information that is expired and deletes
it.
• Input gate it : This gate choose the new information to be added to the long term
from the previous hidden state and the current input data.
• Output gate ot : Finally, the output gate, fuses the long memory with selected new
data generating the new hidden state.
Having regard to the above, a diagram of the cell is provided:

−1

⊙ +

ℎ −1 ℎ

Figure 3.5: LSTM cell.

The equations below describe the operation inside the cell.

ft = σ(Uf · ht−1 + Wf · xt + bf ) (3.8)


it = σ(Ui · ht−1 + Wi · xt + bi ) (3.9)
ot = σ(Uo · ht−1 + Wo · xt + bo ) (3.10)
ct = f t ct−1 + it tanh(Uc · ht−1 + Wc · xt + bc ) (3.11)
ht = ot tanh(ct ) (3.12)

However, this number of operations tends to affect the training time. This is why Gated
Recurrent Units (GRU) studies which are the less important gates in favor to eliminate
operations and optimize the learning process.
GRUs are constructed with:
• Reset gate: Drops information that is found to be irrelevant at the future.
• Update gate: Controls the information that pass from the previous hidden state to
the current hidden state.

11
With this, in the 3.6 are a GRU cell:

⊙ +

−1 ⊙

̃

ℎ −1 ℎ

Figure 3.6: GRU cell.

zt = σ(Uz · ht−1 + Wz · xt + bz ) (3.13)


rt = σ(Ur · ht−1 + Wr · xt + br ) (3.14)
h̃t = tanh(Uh̃ · (rt ht−1 ) + Wh̃ · xt + bh̃ ) (3.15)
ht = (1 − zt ) ht−1 + zt · h̃t (3.16)

As it can be seen, GRU does not have memory units, it just uses gates to modulate the
flow of information [29]. The procedure is similar to LSTM cells but exposing the whole
hidden state each time. Another difference is that GRUs controls the information flow at the
same time that computes the new candidate whereas LSTM controls it with the output gate.

3.2.4 Training of Neural Networks


The training of the neural networks consists in optimize the parameters θ (that includes
weights and biases) to generate a "good" map function ŷ = f (x) where x is the input and ŷ
the trainable map function.
As explained, in Section 3.1 machine learning consists in a training phase and a testing
phase. It is clear that in training phase, the network search in an optimized way the best
parameters to generate the map function, and then, at the testing phase, the weights are
frozen to make predictions.
The optimization is done by computing the gradients of each parameter θ with respect to
the loss function L with the backpropagation algorithm and then use some gradient descent
based optimization to update each weight.
The backpropagation algorithm simply consists in the calculation of the gradients from
the last layer and propagates its values across the network to the first layer with also the
importance of the contribution for each neuron, which enforces the network to organize the
neurons with different focuses and complement their features.

12
Finally, for the regression task the common used loss function is the cross-entropy, which
compares the ground true with the predicted distribution output (softmax) as show in Equa-
tion 3.1:

C
X
Lcross = − yc log(ŷc ) (3.17)
c=1

On the other hand, regression uses the RMSE as shown in Equation 3.2.

3.3 Bayesian inference and Bayes By Backprop


Now that all the frequentist machine learning approach has been summarized, it is time
to explain the Bayesian concepts.
Firstly, simple definitions has to be made in the context of this thesis:
• Probability: Is a measure of how likely an event is to occur.
• Random variable: Is a named situation that could result in many different events,
so, it has no determined value but a probability distribution of it possibilities.
• Probability distribution: Is the summary of the outputs of a random variable and
their probabilities which is expressed by a function.
• Probability density (mass) function: A function whose integral (sum) gives the
probability that the value of the variable lies within the same interval. Density is used
for continuous random variable and mass for discrete ones.
• Expected value E[x]: The expected value is the integral of the random variable with
respect to its probability. In simple words is the mean of a large size of samples from
the random variable.
• Sampling: Is the action of draw a value from a probability distribution.
• Categorical distribution: Is a probability distribution for discrete random variables.
This distribution is used to determine the probabilities of belonging to a certain class.
Therefore, is used as output in classification tasks.
• Normal distribution: Is a probability distribution for continuous (real valued) ran-
dom variables. They are many more probability distributions for continuous random
variables but this is the most common and used one. The main characteristic of this
distribution is the bell-shaped form of its probability density function and their sym-
metry.
1 1 x−µ 2
f (x) = √ e− 2 ( σ ) (3.18)
σ 2π

• Joint distribution: Is a probability distribution over multiple variables.


• Conditional probability: Is the probability of a random variable given another ran-
dom variable particular value.

p(X, Y = y)
p(X|Y = y) = (3.19)
p(Y = y)

13
• Chain rule of probability: Is the conditional distribution applied to the joint distri-
bution.
p(Y )
p(X, Y ) = p(X, Y ) = p(X|Y )p(Y ) (3.20)
p(Y )
• Bayes’ theorem: All the above definitions are needed to formulate this. The condi-
tional probability of X given data Y is the posterior probability and the conditional
probability of Y given X the likelihood or model evidence of data fitting. Then, p(X)
is the belief, also called prior probability. Finally p(Y ) is the data distribution.

p(Y |X)p(X)
p(X|Y ) = (3.21)
p(Y )

• Likelihood: Is a measure of how well the data summarized the parameters in the
model. The negative log likelihood is commonly used as a cost function in Bayesian
neural networks.
• Inference: Is the process of computing probability distributions over certain random
variables, usually after knowing values for other variables.
Suppose that a model (p(w|D) with D as data and w as weights) is successfully trained
and then it is desired to make and inference (y ∗ ) for a new datapoint (x∗ ). So, is necessary
to look the joint distribution :
Z
∗ ∗
p(y |x , D) = p(y ∗ |x∗ , w)p(w|D)dw (3.22)

It is noticeable that the integral over the weights is intractable since it is taking an expec-
tation under the posterior distribution on weights which is equivalent to using an ensemble of
infinity numbers of neural networks. Then, it is possible to define an approximate distribution
q(w|θ) that has to be as similar as possible to p(w|D). The computation of this approximate
distribution is known as variational inference since θ are the variational parameters that
makes possible the inference process.
To evaluate if those two distributions are similar, the Kullback-Leibler (KL) divergence
is computed, so, the optimization problem is to find the θ that minimizes the divergence
between both distributions.

θopt = arg min KL[q(w|θ)||p(w|D)] (3.23)


θ

Then, the definition of the KL divergence is:


Z
q(w|θ)
KL[q(w|θ)||p(w|D)] = q(w|θ) log dw (3.24)
p(w|D)

Again, it came out another integral over the weights, so this is not solvable. To deal with
the above we define the Expectation with respect to a distribution as:
Z
Eq(w|θ) [f (w)] = q(w|θ)f (w)dw (3.25)

14
Also, the conditional distribution for p(w|D) could be replaced by the Bayes’ theorem,
leading to:
 
q(w|θ)p(D)
KL[q(w|θ)||p(w|D)] = Eq(w|θ) log (3.26)
p(D|w)p(w)
Finally, splitting the logarithms and doing some algebra:
 
q(w|θ)
KL[q(w|θ)||p(w|D)] = Eq(w|θ) log − log p(D|w) + log p(D) (3.27)
p(w)
It is possible to extract the log p(D) since it is constant for the integral and the first
logarithm term is the KL divergence between qθ (w|D) and p(w), leading to:

KL[q(w|θ)||p(w|D)] = KL[q(w|θ)||p(w)] − Eq(w|θ) [log p(D|w)] + log p(D) (3.28)

Back to Equation 3.23, since this is a minimization problem, is possible to ignore the
log p(D) as it is a constant.

θopt = arg min KL[q(w|θ)||p(w|D)] ∼ arg min KL[q(w|θ)||p(w)] − Eq(w|θ) [log p(D|w)]
θ θ
∼ arg min −ELBO(D, θ)
θ
(3.29)

Which is known as the negative of the Evidence Lower Bound (ELBO), thus, maximizing
the ELBO is the same as minimizing the KL between the distribution of interest.
The ELBO contains a data dependant part which is the likelihood cost and a prior de-
pendant part which is the complexity cost. As is a subtraction between these two terms, the
loss is a trade-off between complexity of the data and the simplicity prior. However, this is
finally computable by unbiased Monte Carlo gradients and backpropagation [30].
First, it is necessary to get the gradient of an expectation. It is possible to define:
 
∂ ∂f (w, θ) ∂w ∂f (w, θ)
Eq(w|θ) [f (w, θ)] = Eq(ε) + (3.30)
∂θ ∂w ∂θ ∂θ
Where ε is a random variable with probability density q(ε) and w = t(θ, ε) where t(θ, ε)
is a deterministic function. Also, as further supposition q(ε)dε = (w|θ)dw.
The above deterministic function transforms a sample of parameter-free noise ε and the
variational posterior parameters θ into a sample from the variational posterior. With this, it
is possible to draw a sample with Monte Carlo to evaluate the expectations, enabling to do
a backpropagarion like algorithm.
This backpropagation (Bayes by Backprop) approximate the exact cost as:

n
X
− ELBO = log q(w(i) |θ) − log p(w(i) ) − log p(D|w(i) ) (3.31)
i=1

15
Where the superindex denotes the i th Monte Carlo sample drawn from the variational
posterior q(w(i) |θ). This depends upon the particular weights drawn from the variational
posterior (common random numbers) [31].
However, Bayes by backprop is a generalization for the Gaussian reparameterization trick
used for latent variables models in Bayesian context (also called Variational Bayesian neural
nets) which contains the problem that all weights in a batch share the same weight pertur-
bation limiting the variance reduction effect of large batches. To deal with this, Flipout is
an efficient method to avoid this problem.

3.3.1 Weight perturbations


The reparameterization trick is a weight perturbation method, which are all the methods
that sample the weights of a neural network stochastically at training time.
This methods consider that the minimization of an expected loss E(x,y)∼D,W ∼qθ [L(f (x, W ), y)]
and qθ could be described as perturbations to the weights W = W̄ + ∆W where W̄ are the
mean weights and ∆W a stochastic perturbation.
In addition to addressed Variational Bayesian neural nets, most known methods are:
• Gaussian perturbations: Being a sample Wij ∼ N (W̄ij , σij2 ), then, using the repa-
rameterization trick [32] this sample can be rewritten as Wij = W̄ij + σij εij where ε ∼
N (0, 1) allowing the algorithm to compute the gradients by backpropagation. Another
variation is to sample from Wij ∼ N (W̄ij , W̄ij2 σij2 ) (this means Wij = W̄ij (1 + σij εij )) to
make the information content of the weights invariable with the scale.
• DropConnect: This is a regularization inspired by dropout which randomly change
the values of a subset of the weights as zeros.
• Evolution strategies: This methods works as black box optimization algorithms
and nowadays are proposed as reinforcement learning algorithm. In each iteration the
method generate a collection of weight perturbations and evaluates it performance.
Finally, the weight perturbation method applied in this thesis is the Flipout.

3.3.1.1 Flipout
In Section 3.3, Flipout is introduced as an efficient method to deal with the same pertur-
bation across a batch problem, which leads into higher variances.
This works with two main assumptions about the distribution qθ . This is that the pertur-
bation of different weights are independent and is symmetric around zero. The above allows
to do element wise multiplication without modifying the perturbation distribution.
Then, if a sample ∆W d ∼ qθ is multiplied by a random sign matrix (S) independent from
∆W
d then ∆W = ∆W d S is identically distributed to ∆W
d and their gradients are also
identically distributed.
With this, it is possible to draw a base perturbation ∆W
d shared in the batch and multiplies
by different sign matrix, generating an unbiased estimator for the loss gradients.

∆Wn = ∆W
d v n s>
n (3.32)

16
Where rn and sn are random vectors sampled uniformly from ±1.
Then, the activations of a dense layer are:

yn = φ(W > xn )
 
= φ (W + ∆W d vn s> >
n ) xn (3.33)
> >
= φ(W xn + (∆W
d (xn sn )) vn )

Which is vectorizable and defines the forward pass. Since V and S are sampled inde-
pendently it is possible to obtain derivatives of the weights and the input allowing to do
backpropagation.
Also, Flipout take advantage of evolution strategies allowing parallelism on a GPU up-
dating the weights with the following scheme:

M X N
1 X n o
W t+1 = Wt + α F mn ∆W m
d vmn s>
mn (3.34)
M N σ 2 m=1 n=1

Where, N are the flipout perturbations at each worker, M the number of workers and
Fmn the reward at the nth perturbation at worker m.

17
Chapter 4
Proposed approach of Bayesian Recurrent
Neural Networks
This Chapter includes the proposed Bayesian Recurrent Neural Networks for classification,
RUL and HI prognosis with a graph of the framework to construct the B-RNN and the new
cell operations.

4.1 Proposed framework


The Figure 4.1 shows the construction of a B-RNN model to address RUL prediction with
all their important flows from the dataset and the mathematical background.

Preprocessed data:
1.Run to Failure - Cleaning Data.
Raw training Data
training data - Normalization.
- Window rolling.

Input 4. Test Data


evaluation
BRNN

BRNN

BRNN 3. Trained
Bayesian Neural
Network
BRNN

2. Bayesian
Architectures:
Neural Network Bayesian layers
- Bayes VRNN
for regression (Flipout)
- Bayes JaNet
Model

Uncertainty Bayes Theorem

Bayesian Optimization(ELBO):
Posterior Variational
approximation Inference

Figure 4.1: B-RNN construction flowchart.

18
The first step to construct a Bayesian RNN is to have proper data to work, since the
approach uses recurrent neural networks, it is necessary to use sequential data. Depending of
the type of dataset the task is selected i.e. if it considers run to failure data, RUL prognosis
is feasible, if they are different failures at different files, classification, also it is possible to
generate a health indicator supposing that none of them are provided based into the time in
operation or a statistic feature that grows (or shrinks) monotonically [33].
To enhance the performance of the model, it is possible to split randomly the dataset into
a test-train set but this could lead into an over-performance since the develop works with
sequential data and this is essential to implement this kind of model in real environment. The
above implies that the data to train are sensory data accumulated over time until present
and the new incoming sensor data are the ones to test the performance.
Then, the dataset must be preprocessed with a cleaning stage where the out of scale and
NaNs are dropped out from the data. Normalization, to enforce the model to learn without
bias over the scale of each particular sensor and window rolling to shape correctly the dataset.
In this process of windowing, it is possible to do it directly over the sensor or by features
extracted from previously created windows, both methods are used in different applications
at this thesis.
Later, the model is created by the utilization of LSTM or GRU cells as Keras layers, which
enables an easy way to create the model. However, since this are custom made layers which
contains dense flipout layers, a custom made training function is applied. With this, and the
optimization of the ELBO as loss function, an output distribution is trained, from where it
is possible to draw any number of samples for a chosen input which is the final prediction.

4.2 Cell operations


The proposed approach of Bayesian Recurrent Neural Networks consists in the transfor-
mation of the inner gates calculation of the frequentist RNN cells into dense layers and then,
convert them into Bayesian layers by the application of the Flipout scheme.
The first step, is transform the operations inside the RNN cells to dense layers, for example,
the update gate of the GRU layer:

zt = σ(Uz · ht−1 + Wz · xt + bz ) (4.1)

Where multiplications between the weights and the inputs could be replaced by dense
layers for each input (hidden state and new data):

zt = σ(Dense(ht−1 ) + Dense(xt )) (4.2)

Then, each dense layer is transformed to a Bayesian layer by the application of Flipout,
resulting:

h i h i
zt = σ Uz ht−1 + (∆U
dz (ht−1 shz,t )) h
vz,t + Wz xt + (∆W
[z (xt sxz,t )) x
vz,t (4.3)

19
Where a sub-index z, t represents that is for gate z in time t and a super-index that
represent if it for a hidden state h or an input x. Also, since an activation function for each
dense layer is not necessary, it just left with an identity activation function.
It is important to notice than this notation includes the sample of the sign vectors at each
time step.
With the above, it is possible to implement the proposed approach for all the remaining
gates as follows:

Bayes LSTM Operations


Using the same steps as above, LSTM gates are defined by:

ft = σ(Uf ht−1 + (∆U


df (ht−1 shf,t )) h
vf,t + Wf xt + (∆W
[f (xt sxf,t )) x
vf,t ) (4.4)

it = σ(Ui ht−1 + (∆U


di (ht−1 shi,t )) h
vi,t + Wi xt + (∆W
[i (xt sxi,t )) x
vi,t ) (4.5)

ot = σ(Uo ht−1 + (∆U


do (ht−1 sho,t )) h
vo,t + Wo xt + (∆W
[o (xt sxo,t )) x
vo,t ) (4.6)
dc (ht−1 sh )) v h +Wc xt +(∆W
ct = ft ct−1 +it tanh(Uc ht−1 +(∆U [c (xt sxc,t )) vc,t
x
) (4.7)
c,t c,t

ht = ot tanh(ct ) (4.8)

Bayes GRU Operations


Using the same steps as above, GRU gates are defined by:

zt = σ(Uz ht−1 + (∆U


dz (ht−1 shz,t )) h
vz,t + Wz xt + (∆W
[z (xt sxz,t )) x
vz,t ) (4.9)

rt = σ(Ur ht−1 + (∆U


dr (ht−1 shr,t )) h
vr,t + Wr xt + (∆W
[r (xt sxr,t )) x
vr,t ) (4.10)
d ((ht−1 rt ) sh )) v h +W xt +(∆W
h̃t = tanh(Uh̃ (ht−1 rt )+(∆U [h̃ (xt sx )) v x ) (4.11)
h̃ h̃,t h̃,t h̃ h̃,t h̃,t

ht = (1 − zt ) ht−1 + zt h̃t (4.12)

At this equations, the notation is:


• W , U : Are the mean weights matrix sampled from the distribution qθ .
• Ŵ , Û : The stochastic perturbation of the mean weights sampled from the distribution
qθ .
• v, s: Column vectors from V and S which are sampled uniformly from ±1.

20
4.3 Training and implementation of Bayesian Neural Net-
works
The training of Bayesian Neural Networks consists in the application of Bayes by Backprop
as backpropagation of the gradients to minimize the ELBO loss.
This loss is divided in two main parts:
• Kullback-Leibler: Is a regularization term for the overall loss. Also, it is many
magnitude orders higher than the Log likelihood. Because of this, is necessary to
multiply this number by a regularizer in search of the best tradeoff between both losses.
• Log likelihood: Empirically, this loss is the one who rules the performance of the net-
work. So, the balance between KL and Log likelihood is doing well if the log likelihood
is stable and the KL is not growing.
With that, the implementation consists in the application of Tensorflow Probability Flipout
Dense Layers in new layers called Bayes-LSTM and Bayes-GRU wrapping their inner cell
calculations. With this, the Bayes by Backprop implementation is straightforward and the
output is a chosen distribution depending on the application (categorical if it is a classification
and normal distribution at the regression).

4.3.1 Regression metrics


A the regression task the results are N samples from the final distribution. With this is
possible to get two different RMSE, the first one is the called Mean RMSE or (RMSE) which
is the mean of the RMSE of the N samples and the other is the Global RMSE (RMSE*)
which is the RMSE of the mean for each point at the N samples. Since the output is the
distribution and not the single samples, the RMSE* is the most comparable metric with
the RMSE of the frequentist approaches with the difference that is possible to compute the
deviation of the RMSE* too.

4.3.2 Classification metrics


As same as the regression task, the output of the model is a sampleable distribution, so
the result are N samples from it. This samples are the probability of being from a certain
class and because of this, their length is the same as the number of classes at the dataset.
As it is expected, we could obtain many different confusion matrix from the above samples
(due to the incorporation of the uncertainty) and then calculate a probability interval for
each cell of the matrices resulting in a confusion matrix that summarizes the performance of
the classifier.

21
Chapter 5
Study cases
In this Chapter the datasets in which this thesis is validated are described, including
their setup, origin and more relevant characteristics. Also, this datasets could need some
preprocess prior to be feeded in the network, so this step is also explained. Finally, after a
grid search to look for the best hyperparameters the architecture of the network is fixed and
is presented.

5.1 Datasets for regression tasks


The regression tasks includes the assessment of the remaining useful life (C-MAPSS) and
health indicators (Maryland cracklines and wind turbine data).

5.1.1 NASA: C-MAPSS


5.1.1.1 Data description
C-MAPSS stands for Commercial Modular Aero-Propulsion System Simulator [1]. Is a
software created by NASA to generate a thermo-dynamical simulation model for the turbofan
engine as function of variations of flow an efficiency. This data is the most used at RUL
prognosis and also is used in the PHM08 challenge. From then, they are many different
approaches to obtain a prediction of the RUL and the one who is taken the most successful
results is Deep Learning.

Figure 5.1: Simplified diagram of engine simulated in C-MAPSS. Source: [1].

22
The dataset consists in 4 different sub-datasets which are divided by the number of oper-
ation conditions and fault modes:

Table 5.1: Definition of the 4 sub-datasets by it operation conditions and fault modes.
Operation conditions
1 6
Fault HPC Degradation FD001 FD002
modes HPC and Fan Degradation FD003 FD004

Where HPC stands for high-pressure compressor. In each of this datasets are 26 columns,
which are:
• Unit number
• Time [cycles]
• 3 operational settings
• 21 sensor measurements
The outputs of the measure system response are:

Table 5.2: C-MAPSS output sensor data.


Sensor
Description Units
tag
o
1 Total temperature at fan inlet R
o
2 Total temperature at LPC outlet R
o
3 Total temperature at HPC outlet R
o
4 Total temperature at LPC outlet R
5 Pressure at fan inlet psia
6 Total pressure in bypass-duct psia
7 Total pressure at HPC outlet psia
8 Physical fan speed rpm
9 Physical core speed rpm
10 Engine pressure ratio dimentionless
11 Static pressure at HPC outlet psia
12 Ratio of fuel flow to Ps30 pps/psi
13 Corrected fan speed rpm
14 Corrected core speed rpm
15 Bypass ratio dimentionless
16 Burner fuel-air ratio dimentionless
17 Bleed enthalpy dimentionless
18 Demanded fan speed rpm
19 Demanded corrected fan speed rpm
20 HPT coolant bleed lbm/s
21 LPT coolant bleed lbm/s

Where LPC stands for low pressure compressor, HPT for high pressure turbine and LPT
for low pressure turbine.

23
5.1.1.2 Pre processing
First, the dataset is constructed as follows:
• Each row corresponds to one cycle.
• Each column represent a sensor or an operational setting.
• A single unit is simulated (Unit number columns) and it is under operation N cycles
until failure. Then, another unit is put into operation and register its sensor data.
• They are 7 sensors that does not change with the cycles, this means that they do not
deliver useful information, so, they are ignored.
Since this dataset is used for a challenge, it is already divided in train and test samples
and also, the RUL value is in another file for train and test.
Therefore, it is not necessary to split the dataset and it is possible to create windows with
1 cycle as slide step without contaminating the train with test data.
Due to data conformation, not all datasets have the same window size. With this, the
created datasets have the following shape:

Table 5.3: Dataset size for C-MAPSS


Dataset Window length Sensors Train/Test Number of windows
Train 14347
FD001 30 14
Test 100
Train 29526
FD002 21 14
Test 259
Train 17864
FD003 30 14
Test 100
Train 46326
FD004 18 14
Test 248

24
5.1.1.3 Architecture
A grid search is used to fix parameters of the model, some of the selected numbers to be
tried are summarized in the following table:

Table 5.4: Table of Hyperparameters to test.


Hyperparameter Grid search Hyperparameter Grid search
BayesRNN Neurons [16, 32, 64] Scale [1e0, 5e0, 1e1]
DenseFlipout Layers [1, 2, 3] Softplus [1e-3, 5e-3, 1e-2]
DenseFlipout Neurons [64, 32, 16] LR decay [40, 60, 100]
Epochs [50, 100, 300] End LR [50%, 70%]
Learning Rate [1e-2, 1e-3, 1e-4] KL Regularizer [1e-5, 5e-6, 1e-6]

The final architecture to address the C-MAPSS dataset is:


input: (256, 30, 14)
InputLayer
output: (256, 30, 14)

input: (256, 32)


DenseFlipout
input: (256, 30, 14) output: (256, 16)
BayesLSTM
output: (256, 30, 32)

input: (256, 16)


DenseFlipout
input: (256, 30, 32) output: (256, 2)
Flatten
output: (256, 960)

input: (256, 2)
DistributionLambda
input: (256, 960) output: [(256, 1), (256, 1)]
DenseFlipout
output: (256, 32)

Figure 5.2: Bayesian recurrent layers architecture for the C-MAPSS dataset. In particular,
with BayesLSTM layer.

Notice that the output of the Bayesian RNN is the complete sequence and the flatten
layer prepares the features to be processed by the dense flipout layers.
Also, a decaying polynomial learning rate is applied to smooth the learning in the last
epochs, the learning rate decays in 60 epochs to the 70% of the learning rate at the epoch 1.
A summary of the hyperparameters is presented in the Table 5.5:

Table 5.5: Hyperparameters for the C-MAPSS dataset.


Hyperparameter Value Hyperparameter Value
Batch size 256 KL regularizer 1e-6
Epochs 100 KL annealing [epochs] 20
Scale 1e1 Learning rate 0.001
Softplus 1e-3 End learning rate 70%
Learning rate decay [epochs] 60

25
5.1.2 Maryland: Crack Lines
5.1.2.1 Data description
This dataset is from a Doctoral thesis developed at the University of Maryland [34] and
consists in the acquisition of three dissipative energies (strain, heat dissipation and acoustic
emission) during a cycle tensile uniaxial test.
With the above energies is possible to quantify the damage in specific failure modes by
the Degradation-entropy generation (DEG) theorem which is introduced in [35, 36, 37] and
reviewed in applications including fatigue, corrosion and wear [38]. Also, the crack length is
monitored with images taken by an Edmond 2.5-10x optical microscope with an OptixCam
Pinnacle Series CCD digital camera. This with the corresponding life in cycles define the life
of failure at a certain crack length. 250[µm], 500[µm], 1000[µm], the transition from region
II to III (with linear elastic fracture mechanics [39]) and fracture are the selected failure
determinations for each entropic approach’s assessment.
Two different materials are selected as testing materials, AA7075-T6 and SS304L. The first
one is an aluminum alloy with high-strength and light, the other is an structural material
used commonly in highly acidic environments. Since the conformation of the dataset, the
only measurements that have enough datapoints to work with the B-RNN scheme is the
stainless steel

Table 5.6: Mechanical properties of SS304L.


Mechanical properties
σU T S [MPa] σy [MPa] Elongation [%] Hardness [RB]
613.8 325.7 54.06 85.00
Chemical composition [w%]
C Cr Cu Mn
0.0243 18.06 0.3655 1.772
Mo N Ni P
0.2940 0.0713 8.081 0.0300
S Si
0.0010 0.1930

The specimen is selected and designed for fatigue testing under ASTM 466 guideline [40].
An Instron 8800 system on an MTS 311.11 frame is used to mount the uniaxial test. The
loads are in the range of 16 to 24 [kN], 0.1 loading ratio and 5 [Hz] frequency. For applications
to short load term process every 1000 cycles, the load is paused by 2 minutes and 500 cycles
of excitation are applied at 6 [kN] for the aluminum and 10 [kN] for the stainless steel as
maximum loads at 25 [Hz]. The time without load allowed the specimen’s cooling and it is
the perfect time to capture cleaner images. The acoustic emission sensor are two Physical
Acoustics Micro-30s resonant sensors.

26
5.1.2.2 Pre processing
With this, the dataset consist in 3 signal measurements (position, extension and temper-
ature) and 4 cumulative measurements (2 acoustic emission counts and 2 acoustic emission
energy). The proposed features to extract consider the Crest factor and the mean of the
Fourier transform splitted in frequency bands, this two features are not feasible with the last
4 measurements due to its behavior, even so, peak to peak value has no useful information.
Due to the above, there are many ways to work this dataset, the tested approaches are
listed below:

Table 5.7: Tested datasets for Maryland crack lines.


Feature Total features Time Feature
Raw/Features Columns
excluded per column dimension dimension
Every Crest factor,
Features 24 20 168
measurement all fft
Only sensor fft of derivative
Features 32 20 96
data and integral
Only sensor
Raw - - 200 3
data
Only Crest factor,
Features 24 20 96
cummulative data all fft
Only
Raw - - 200 4
cummulative data

It is important to notice that the raw windows have length of 200 due to the composition
of the extracted features which have a 10 length window to perform the extraction and then
20 length window to train the RNN, creating a dataset with the same numbers of examples.
After testing all this datasets, the best results are obtained with only the sensor data and
the samples are constructed as follows:

Table 5.8: Dataset size for Maryland cracklines.


Feature window length Features Window length Train/Test Number of windows
Train 3844
10 96 20
Test 646

The test set consists in 7 of 47 total runs, which is almost a 15% split.

27
5.1.2.3 Architecture
The grid search considers the following hyperparameters:
Table 5.9: Hyperparameters tested in the grid search
Hyperparameter Grid search Hyperparameter Grid search
Conv Layers [1, 2, 3] Conv N of kernels [16, 32, 64]
Conv kernels (only time) [1, 2, 3] LR decay [10, 20, 30]
BayesRNN Neurons [16, 32, 64] End LR [10%, 50%]
DenseFlipout Layers [1, 2, 3, 4] KL Regularizer [1e-5, 5e-6, 1e-6]
DenseFlipout Neurons [64, 32, 16] Epochs [1000, 2000, 3000]
Learning Rate [1e-2, 1e-3, 5e-4]

The final architecture selected is an ensemble of two frecuentist 2-dimensional convolu-


tional layers, with kernels that only works at the sensor dimension, to work as feature ex-
tractor of the raw sensor data. Then, a time distributed layer reshapes the data to be ready
for the Bayesian RNN layers, then flatten and dense flipout to deal with the regression.

input: (500, 20, 96, 1)


InputLayer
output: (500, 20, 96, 1)

input: (500, 20, 64)


Flatten
input: (500, 20, 96, 1) output: (500, 1280)
Conv2D
output: (500, 20, 48, 32)

input: (500, 1280)


DenseFlipout
input: (500, 20, 48, 32) output: (500, 16)
Conv2D
output: (500, 20, 24, 64)

input: (500, 16)


DenseFlipout
input: (500, 20, 24, 64) output: (500, 2)
TimeDistributed
output: (500, 20, 1536)

input: (500, 2)
DistributionLambda
input: (500, 20, 1536) output: [(500, 1), (500, 1)]
BayesGRU
output: (500, 20, 64)

Figure 5.3: Bayesian recurrent layers architecture for the Maryland cracklines dataset.

Table 5.10: Hyperparameteres for the Maryland dataset.


Hyperparameter Value Hyperparameter Value
Batch size 500 KL Regularizer 1e-5
Epochs 3000 KL annealing [epochs] 20
Learning rate 0.0005 End learning rate 10%
Learning rate decay [epochs] 20

28
5.1.3 GPM Systems: Wind Turbine Data
5.1.3.1 Data description
This data was collected from a commercial 2.2 [MW] wind turbine over 50 days, 6 seconds
each with a sampling frequency of 100 [kHz] for a high speed shaft bearing from a wind
turbine generator provided by Green Power Monitoring Systems in the USA [2, 41].
The bearing is a SKF 32222 J2 tapered roller bearings with an inner race fault (detected
with inspection at the end of the 50 days). The recorded sensor is a MEMS accelerometer
(analog devices ADXL001 with 32 [kHz] bandwith).

Figure 5.4: Bearing test set with the inner race fault detected at the end of the 50 days.
Source: [2]

It is possible to see the degradation of the wind turbine with the growing size of the
acceleration. The following Figure 5.5 visualizes all the 50 days one next to the other, even
if they are many hours between each measurement:

20

10
Acceleration [g]

10

20

0.0 0.5 1.0 1.5 2.0 2.5 3.0


Time, 6 seconds per 50 days 1e7

Figure 5.5: Bearing test set with the inner race fault detected at the end of the 50 days. The
colors shows different days of measurements.

29
5.1.3.2 Pre processing
Since the dataset only consider one sensor, our approach is to extract features from the
accelerometer signal, the features are presented in the Appendix A
The features are computed for the signal, numeric derivative and numeric integration,
generating 42 features. To compute the features it is needed to generate windows. In our
approach 200 timesteps generate 1 feature example and 50 feature example a window for
training the network.
Also, this dataset is not previously splitted into train and test. Therefore, is not possible
to generate the windows with overlap due to train/test contamination.
With this, for each day of 585936 datapoints 58 training windows are generated with
length of 50 and 200 as feature length, the remaining of each file is dropped. Finally, a 30%
test - 70% train random split is used to divide the samples.
The Figure 5.6 illustrates this dimensional reduction by the application of the preprocess-
ing:

Accelerometer Feature Window


raw data data data
Shape: 585936x1 Shape: 2929x42 Shape: 58x50x42

Figure 5.6: Graphical view of the dimensional reduction by preprocessing.

A summary of the dataset is presented in the Table 5.11:

Table 5.11: GPM turbine dataset shape post processing.


Feature window length Features Window length Train/Test Number of windows
Train 2030
200 42 50
Test 870

30
On the other hand, the labels are not provided and the sensor data is not from the
beginning of the lifespan until the failure. Therefore, it is not possible to consider the
timeline as a RUL and it is needed to create a health indicator (HI). The most used ones is
to find some statistic feature that changes monotonically with time and select it as HI. Since
our approach is to lower the human based decisions, our HI is a decreasing line from 100 to
0, where 100 means a fully operative wind turbine and 0 the last day of operation, being this
the direct approach to a HI and could be comparable to a RUL.

5.1.3.3 Architecture
First, a hyperparameter selection is made through a grid search, the most important tested
parameters are listed below:

Table 5.12: Grid search for the hyperparameter selection.


Hyperparameter Grid search Hyperparameter Grid search
Conv Layers [1, 2, 3] Conv N of kernels [16, 32, 64]
Conv kernels (only time) [1, 2, 3] KL Regularizer [1e-5, 1e-6, 1e-7]
BayesRNN Neurons [16, 32, 64] Epochs [1000, 2000, 3000]
DenseFlipout Layers [1, 2, 3, 4] Learning Rate [1e-2, 1e-3, 5e-4]
DenseFlipout Neurons [64, 32, 16]

After a grid search for hyperparameter selection, the selected architecture is as follows in
Figure 5.7:

input: (500, 50, 704)


input: (500, 50, 42, 1) BayesGRU
InputLayer output: (500, 50, 64)
output: (500, 50, 42, 1)

input: (500, 50, 64)


input: (500, 50, 42, 1) Flatten
Conv2D output: (500, 3200)
output: (500, 50, 21, 32)

input: (500, 3200)


input: (500, 50, 21, 32) DenseFlipout
Conv2D output: (500, 16)
output: (500, 50, 11, 64)

input: (500, 16)


input: (500, 50, 11, 64) DenseFlipout
TimeDistributed output: (500, 2)
output: (500, 50, 704)

input: (500, 2)
input: (500, 50, 704) DistributionLambda
BayesGRU output: [(500, 1), (500, 1)]
output: (500 50 64)

Figure 5.7: Bayesian recurrent layers architecture for the GPM Wind turbine dataset. In
particular, with BayesGRU layer.

31
It is possible to see that the 2D convolutions have kernel and stride of size 2 in the feature
index and size 1 in the time index since the Bayesian recurrent networks are the ones that
deal with the temporary of the data. The activation function between every layer is an ReLU.
Also, the convolutional layers are only used as feature extractors, this allows to use fre-
quentist approach in this layers and mix them with Bayesian weights. Following this scheme,
the time distributed layer stacks the extracted features from the convolutions but maintaining
the time axis.
Another important hyperparameters are:

Table 5.13: Hyperparameters for the GPM Wind turbine dataset.


Hyperparameter Value Hyperparameter Value
Batch size 500 KL regularizer 1e-7
Epochs 3000 KL annealing [epochs] 20
Scale 1 Learning rate 0.001
Softplus 1e-5

32
5.2 Datasets for classification tasks
The classification tasks includes the identification of different fault modes in ball bearings
under time-varying speed conditions (Ottawa bearing dataset) and the health state classifi-
cation for a rolling bearing under different speeds and torques (Politecnico di Torino bearing
dataset).

5.2.1 Ottawa: Bearing vibration Data


5.2.1.1 Data description
This is a laboratory experimental dataset collected from bearings with different health
condition under different time-varying speed conditions.
The bearing that is monitored is an ER16K ball bearing with pitch diameter of 38.52[mm],
ball diameter of 7.94[mm] and 9 balls. Which is sampled at 200[kHz] for 10[s] by an ICP
accelerometer, model 623C01 and an incremental encoder (EPC, model 775) is installed to
measure the rotational speed.
They are 5 different health conditions and 4 operational settings, which are:

Table 5.14: Conformation of the Ottawa bearing dataset.


Health conditions Operational settings
Healthy Increasing speed
Inner race fault Decreasing speed
Outer race fault Increasing then decreasing speed
Ball defects Decreasing then increasing speed
Mixed faults

The experimental setup is in Figure 5.8:

Figure 5.8: Experimental setup for Ottawa bearing dataset.

33
Two columns are present in the dataset, the accelerometer and the rotational speed data
measured by the encoder. Also, they are 3 different runs per experiment. Therefore, they
are 60 files to preprocess.

5.2.1.2 Pre processing


To preprocess this dataset, our approach is to use features extracted from the data as
was done with the GPM dataset. In particular, this dataset have two columns but only the
accelerometer is used.
Also, the classes used are the 5 different health conditions with all the different rotational
speeds e.g. the class inner race defect includes examples for all rotational speeds.
As each experiment (one class, one condition) is repeated 3 times, it is possible to extract
one file by experiment and train with the other two. With that, we create 3 datasets with
this approach each consisting in a test set of 4 files per class and therefore, a train set with
the remaining 8 files, generating a 67% train and 33% test set.
Following the same kind of preprocessing as in GPM dataset, it is needed to select a
feature window size and a RNN window size. Testing with different configurations, the best
behavior is obtained with 50 timesteps to generate features, and 50 features to generate a
RNN example.

Accelerometer Feature Window


raw data data data
Shape: 2000000x1 Shape: 40000x42 Shape: 800x50x42

Figure 5.9: Graphical view of the dimensional reduction by preprocessing per file.

A summary of the dataset is presented in the Table 5.15:

Table 5.15: Ottawa bearing dataset shape post processing.


Feature window length Features Window length Train/Test Number of windows
Train 32000
50 42 50
Test 16000

34
5.2.1.3 Architecture
A grid search based on the following table is implemented to select the hyperparameters:

Table 5.16: Hyperparameters for grid search.


Hyperparameter Grid search Hyperparameter Grid search
BayesRNN Neurons [16, 32, 64] KL Regularizer [1e-7, 1e-6, 1e-5]
DenseFlipout Layers [1, 2] Epochs [500, 800, 1000]
DenseFlipout Neurons [64, 32, 16] Learning Rate [1e-2, 1e-3, 5e-4]

After a hyperparameter selection the final architecture to address this classification task
is:

input: (800, 50, 42)


InputLayer
output: (800, 50, 42)

input: (800, 50, 42)


BayesGRU
output: (800, 64)

input: (800, 64)


DenseFlipout
output: (800, 64)

input: (800, 64)


DenseFlipout
output: (800, 5)

Figure 5.10: Bayesian recurrent layers architecture for the Ottawa bearing dataset. In par-
ticular, with BayesGRU layer.

Notice that the regression models outputs directly a distribution, but the classification
outputs a vector (a.k.a. logits) with the number of classes as length (as in the frequentist
approach). However, the softmax function is not used since the output of the model is the
parameterization of a (Tensorflow probability) categorical distribution which shows the real
probability of the classes.

Table 5.17: Hyperparameters for Ottawa bearing dataset.


Hyperparameter Value Hyperparameter Value
Batch size 800 KL Regularizer 8e-7
Epochs 800 KL annealing [epochs] 50
Learning rate 0.001

It is important to notice that the KL Regularizer is not at the grid search table because
is fine tuned after most of the hyperparameters.

35
5.2.2 Politecnico di Torino: Rolling bearing Data
5.2.2.1 Data description
As Ottawa bearing dataset, Politecnico di Rotino bearing dataset is a laboratory experi-
mental dataset of different fault modes but in rolling bearings with an orthogonal load that
produces torques to the shaft.
As it can be seen in Figure 5.11, the same plate have a center bearing (B2) and two more
bearings (B1 and B3) that are used as supports.
The shaft was part of a complete gearbox and carried a spur gear. This gear applies a
contact force with radial an tangential directions because the applied torque. This interaction
between the spur and the shaft is replaced by the load applied by a third larger roller bearing
(B2) which is linked to a precision sledge that moves orthogonal to the shaft that produces
the required load by the compression of two parallel springs.
The Figure 5.11 shows the experimental setup:

Figure 5.11: Experimental setup for Politecnico di Torino bearing dataset.

The accelerometers are two triaxial IEPE type with 12[kHz] frequency range positioned
on the support of the damaged bearing (A1) and at the support of the loaded bearing (A2).
Also, the radial force is measured a static load cell. The loaded bearing is the B2 but the
defects are mounted in bearing B1.

36
The different health states are:

Table 5.18: Different health states at the B1 bearing.


Name Defect Diameter dimension (µm)
0A No defect -
1A Inner ring 450
2A Inner ring 250
3A Inner ring 150
4A Roller 450
5A Roller 250
6A Roller 150

The test consider the following steps:


• Brief run at minimum speed (100[Hz]) without load to check the correct mounting.
• Application of the static load: 1000[N], 1400[N] and finally 1800[N].
• Increment of the speed from 0[Hz] to 500[Hz] with steps of 100[Hz]
As soon as the shaft reach steady speed, the measurement of the accelerations are regis-
tered. The acquisition has sampling frequency of 51.2[kHz] for a duration of 10[s].
The possible configurations of speed and loads are summarized in Table 5.19:

Table 5.19: Summary of nominal loads and speeds of the shaft.


Nominal load [N] Nominal speed [Hz]
0 100 200 300 400 500
1000 100 200 300 400 500
1400 100 200 300 400 500
1800 100 200 300

Due to the limited power of the inverter, at higher load conditions is impossible to reach
higher rotation speeds.

37
5.2.2.2 Pre processing
This dataset has different speeds and torques at the shaft under 7 different failure modes.
Since the goal is to classify the fault, all the files from the same health state are joined as
the same class.
However, unlike Ottawa bearing dataset, Torino has only one file per torque and speed,
therefore, it is not possible to split the dataset into train/test by entirely independent files.
Consequently, to maintain the sequential behavior of the data, each file is divided in 70%
train (the first values) and the remaining 30% as test.
Then, as they are two tri-axial accelerometers, the parameters (Appendix A) of all mea-
surement are calculated, this is 42 features per column and a total of 252 features.
As seen before, to shape correctly the dataset it is necessary two window sizes, the first
to compute the features and the last to feed the model.

Accelerometer Feature Window


raw data data data
Shape: 512000x6 Shape: 1706x252 Shape: 34x50x252

Figure 5.12: Graphical view of the dimensional reduction by preprocessing per file.

A summary of the dataset is presented in the Table 5.20:

Table 5.20: Torino bearing dataset shape post processing.


Feature window length Features Window length Train/Test Number of windows
Train 2856
300 252 50
Test 1190

38
5.2.2.3 Architecture
As in the other models, the hyperparameters are selected by grid search (and then, man-
ually fine tuned).

Table 5.21: Grid search hyperparameters.


Hyperparameter Grid search Hyperparameter Grid search
BayesRNN Neurons [16, 32, 64] KL Regularizer [1e-7, 1e-6, 1e-5]
DenseFlipout Layers [1, 2] Epochs [300, 500, 800]
DenseFlipout Neurons [64, 32, 16] Learning Rate [1e-3, 5e-3, 8e-3]
End LR [10%, 50%] LR decay [40, 100]

Then, the final architecture is as follows:

input: (800, 50, 252)


InputLayer
output: (800, 50, 252)

input: (800, 50, 252)


BayesLSTM
output: (800, 64)

input: (800, 64)


DenseFlipout
output: (800, 64)

input: (800, 64)


DenseFlipout
output: (800, 7)

Figure 5.13: Bayesian recurrent layers architecture for the Politecnico di Torino bearing
dataset. In particular, with BayesLSTM layer.

To smooth the loss curve and prevent overfit, a polynomical learning rate decay is imple-
mented, the parameters.
The most important hyperparameters are listed in the following Table 5.22:

Table 5.22: Hyperparameters for Ottawa bearing dataset.


Hyperparameter Value Hyperparameter Value
Batch size 800 KL Regularizer 5e-7
Epochs 350 KL annealing [epochs] 50
Learning rate 0.008 End learning rate 10%
Learning rate decay [epochs] 40

39
Chapter 6
Results and discussion
The following Section the summarizes the results obtained for each case study.
Also, the evaluation of the performance and the comparison between the state of the art,
frequentist approach and MC Dropout for the C-MAPSS is addressed.
The structure follows the prior section, first, the regression tasks and then the classi-
fication. To validate our approach C-MAPSS is the first dataset to be analyzed. Then,
Maryland Cracklines and GPM wind turbine. Finally, Ottawa and Torino bearing datasets
shows another application to our development.

6.1 Regression tasks


6.1.1 NASA: C-MAPSS
First, a summary of the results is presented for our approach as a mean of the performance
of 3 runs. As clarified in Section 4.3, the mean RMSE corresponds to a large amount of
single draws of the distribution and the mean of the RMSE for each draw with its respective
standard deviation, the global RMSE (RMSE*) corresponds to the mean of the amount of
single draws for each datapoint, generating the most likely distribution for each example.
The last one is the comparable metric since it shows the overall performance of the regressor.

Table 6.1: Bayesian Recurrent Neural Networks results for C-MAPSS dataset. Mean of 3
runs.
Dataset FD001 FD002 FD003 FD004
Bayes Bayes Bayes Bayes Bayes Bayes Bayes Bayes
Model
LSTM GRU LSTM GRU LSTM GRU LSTM GRU
Mean
17.73 17.82 23.97 24.07 17.36 16.33 27.28 27.74
RMSE
STD 0.92 1.05 1.01 1.03 1.00 0.93 1.07 1.13
RMSE* 14.00 14.08 17.43 17.58 13.70 12.31 21.00 21.42

40
Also, the best result for each of the C-MAPSS dataset are presented. In this Figures, the
comparison between each example point and its RUL is shown an the top figure an then, the
distribution of 3 different examples (equispaced) are shown with the respective distributions.
The Table with all the results is at the Appendix B and plots for the other best runs are
summarized at the Appendix C.

BayesLSTM - C-MAPSS FD001


RMSE RUL prediction: 13.27
150 Mean predicted value
Ground Truth
PI: Standard Deviation
125 PI: 90%

100
RUL [Cycles]

75

50

25

0 20 40 60 80
Example

(a) RUL prediction with mean predicted value, 2 probability intervals and the ground truth.

Ground Truth: 114.00 Ground Truth: 90.00 Ground Truth: 38.00


Dist mean: 95.46 Dist mean: 111.25 Dist mean: 32.20
80
130 150
120 140
60
110 130

100 120 40
110
90
100 20
80
90
70
80 0
60
70

(b) 1rst star histogram (c) 2nd star histogram (d) 3rd star histogram

Figure 6.1: Best result for C-MAPSS FD001.

41
BayesLSTM - C-MAPSS FD002
RMSE RUL prediction: 16.78
Mean predicted value
150 Ground Truth
PI: Standard Deviation
PI: 90%
125

100
RUL [Cycles]

75

50

25

25
0 50 100 150 200 250
Example

(a) RUL prediction with mean predicted value, 2 probability intervals and the ground truth.

Ground Truth: 122.00 Ground Truth: 82.00 Ground Truth: 37.00


Dist mean: 123.26 Dist mean: 86.03 Dist mean: 38.13
80
175
160 150
60
125
140
100
40
120 75
50 20
100 25
0 0
80

(b) 1rst star histogram (c) 2nd star histogram (d) 3rd star histogram

Figure 6.2: Best result for C-MAPSS FD002.

42
BayesGRU - C-MAPSS FD003
RMSE RUL prediction: 11.56
150 Mean predicted value
Ground Truth
PI: Standard Deviation
125 PI: 90%

100
RUL [Cycles]

75

50

25

0 20 40 60 80
Example

(a) RUL prediction with mean predicted value, 2 probability intervals and the ground truth.

Ground Truth: 115.00 Ground Truth: 79.00 Ground Truth: 45.00


Dist mean: 114.61 Dist mean: 91.83 Dist mean: 48.36
160 140
80
120 70
140
60
120 100 50
40
100 80
30
60 20
80
10

(b) 1rst star histogram (c) 2nd star histogram (d) 3rd star histogram

Figure 6.3: Best result for C-MAPSS FD003.

43
BayesLSTM - C-MAPSS FD004
RMSE RUL prediction: 20.30
Mean predicted value
150 Ground Truth
PI: Standard Deviation
PI: 90%
125

100
RUL [Cycles]

75

50

25

0 50 100 150 200


Example

(a) RUL prediction with mean predicted value, 2 probability intervals and the ground truth.

Ground Truth: 125.00 Ground Truth: 88.00 Ground Truth: 37.00


Dist mean: 115.20 Dist mean: 105.00 Dist mean: 47.08
120
175
160 100
150
80
140
125
60
120 100
40
75 20
100
50 0
80 25 20

(b) 1rst star histogram (c) 2nd star histogram (d) 3rd star histogram

Figure 6.4: Best result for C-MAPSS FD004.

Finally, a sample of the loss plot in a training of this architectures is shown.

Log Likelihood Kullback-Leibler


20 Train Train
62000
Test Test
60000
15
58000
Loss

Loss

10 56000
54000
5 52000
20 40 60 80 100 20 40 60 80 100
Epoch Epoch

Figure 6.5: Example of a loss plot of C-MAPSS training.

44
The results shows good predictions, altough the plots seems to be noisy, the behavior is
congruent with the training samples and the overall RMSE is satisfactory. Also, the loss plot
shows good performance with decreasing values for both log likelihood and KL and without
appreciable overfit. It is important to notice the difference of magnitude of these two values
but as seen in Section 3.3 the KL acts only as a regularization term and the log likelihood is
the main loss.
Then, the same architecture (and its hyperparameters) are trained with the frequentist
approach. Training for the same amount of epochs at the frequentist model produces overfit-
ted results, that leads to bad comparison between these two models, so the number of epochs
is reduced to 20 to obtain unbiased results.
As it is expected, the frequentist models have nearly the half of parameters than the
Bayesian approach, this is due to the definition of the Bayesian layers that is conformed with
the mean weights and their perturbations. A breakdown of the number of weights per layer
is in the Appendix D.

Table 6.2: Number of parameters in Bayesian and Frequentist approach.


LSTM GRU
Bayesian Frequentist Bayesian Frequentist
FD001 74610 37313 71602 35809
FD002 56178 28097 53170 26593
FD003 74610 37313 71602 35809
FD004 50034 25025 47026 23521

The following Table 6.3 contains the mean results for 5 runs of the frequentist approach
and the comparable result for the Bayesian approach:

Table 6.3: Comparison between average RMSE of Frequentist approach and Bayesian Recur-
rent Neural Networks
LSTM GRU
Frequentist Model Bayesian Model Frequentist Model Bayesian Model
Dataset
RMSE RMSE* RMSE RMSE*
FD001 15.26 14.00 14.12 14.08
FD002 18.08 17.43 17.97 17.58
FD003 14.35 13.70 13.60 12.31
FD004 21.47 21.00 21.67 21.42

45
Since the Bayesian method have nearly double the parameters of the frequentist, the
training phase of the Bayesian model takes longer. However, this longer training allow us
to have a probability distribution as output, capturing the uncertainty and obtain better
results. With this is possible to say that the first stages of the construction of a model could
be tested with frequentist approach and then migrate the model to a Bayesian one to obtain
the benefits listed above.
Then, the proposed approach is compared to MC Dropout as a Bayesian approximation.
This method, consists in the application of dropout layers between dense layers and, unlike the
normal application of dropout, applied the disconnection of random neurons in the evaluation
process too, resulting in different outputs to a fixed input [18, 32, 42]. This methodology to
address uncertainty has been questioned due to its result in simple tasks [43], its overconfident
results [44] in classification and directly that does not quantify uncertainty [19]. However,
still is an used method to deal with uncertainty.
As is another Bayesian approximation, it is possible to get the same metrics as in the
propossed approach. The following Table 6.4 compares the best model for each dataset as a
mean of 3 runs:

Table 6.4: Comparison between best average performance for each dataset between Bayesian
and MC Dropout.
Dataset FD001 FD002 FD003 FD004
MC MC MC MC
Bayes Bayes Bayes Bayes
Model Dropout Dropout Dropout Dropout
LSTM LSTM GRU LSTM
GRU LSTM GRU LSTM
Mean
17.73 15.67 23.97 18.84 16.33 15.63 27.28 21.88
RMSE
STD 0.92 0.68 1.01 0.33 0.93 0.65 1.07 0.27
RMSE* 14.00 14.25 17.43 18.24 12.31 14.52 21.00 21.53

It is noticeable that MC dropout have better mean RMSE than the proposed approach but
in the global performance RMSE*, the Bayesian method implemented it is better, leading to
better overall results. With this, our approach have better results.
Also, it is possible to compare the proposed approach best average performance for each
dataset with some of the state of the art results:

Table 6.5: Proposed approach results compared with state of the art performance.
Model MODBNE [12] DCNN [45] DLSTM [46] AdaBN-CNN [47] Proposed approach
Data RMSE RMSE RMSE RMSE RMSE*
FD001 15.40 12.61 16.14 13.17 14.00
FD002 25.05 22.36 24.49 20.87 17.43
FD003 12.51 12.64 16.18 14.97 12.31
FD004 28.66 23.31 28.17 24.57 21.00

46
The models which are compared to our results are different types of architectures includ-
ing: convolutional, recurrent, deep belief models and adaptative models. DCNN presents
a standard deviation but is for 10 independent runs, not representing an uncertainty. It is
important to notice that no one of the state of the art models address the uncertainty, giving
a huge advantage of the proposed approach. Also, the Bayesian RNN outperforms the state
of the art models considerably at the hardest datasets (FD002 and FD004) and the other
two datasets have similar results than the other approaches.

6.1.2 Maryland: Crack Lines


As explained in the Section 5.1.2, this dataset have different runs that consists in different
fatigue test samples. To correctly visualize their behavior, the regression plot stack all the
runs and the independent plots are presented in the Appendix G.

Table 6.6: Bayesian Recurrent Neural Networks results for Maryland cracklines. Mean of 3
runs.
Bayes LSTM Bayes GRU
Mean RMSE 24.75 25.07
STD 0.59 0.57
Global RMSE 19.54 20.26

Table 6.7: Best result of Maryland dataset with BayesLSTM layer, per test set run.
Test Test Test Test Test Test Test
run 1 run 2 run 3 run 4 run 5 run 6 run 7
Mearn RMSE 17.97 18.29 16.53 16.96 29.49 23.95 28.45
STD 1.67 1.96 1.50 1.20 1.70 1.34 1.13
RMSE* 14.16 13.05 10.80 11.76 26.15 17.98 22.18

Table 6.8: Best result of Maryland dataset with BayesGRU layer, per test set run.
Test Test Test Test Test Test Test
run 1 run 2 run 3 run 4 run 5 run 6 run 7
Mearn RMSE 19.57 19.42 16.99 17.31 29.90 23.76 27.01
STD 1.86 1.87 1.89 1.19 1.30 1.63 1.06
RMSE* 15.53 14.38 10.65 11.55 27.01 16.97 20.68

47
BayesLSTM - Maryland Cracks
RMSE RUL prediction: 19.33
Mean predicted value
120 Ground Truth
PI: Standard Deviation
100 PI: 90%

80

60
HI

40

20

20

0 100 200 300 400 500 600


Example

(a) RUL prediction with mean predicted value, 2 probability intervals and the ground truth.

Ground Truth: 50.49 Ground Truth: 71.98 Ground Truth: 69.55


Dist mean: 44.84 Dist mean: 68.86 Dist mean: 59.80
90 140
80 120
120
70 100
100
60 80
50 80 60
40 60 40
30 20
40
20
20 0
10
20

(b) 1rst star histogram (c) 2nd star histogram (d) 3rd star histogram

Figure 6.6: Best result for Maryland cracklines, Bayes LSTM.

48
BayesGRU - Maryland Cracks
RMSE RUL prediction: 18.79
140
Mean predicted value
120 Ground Truth
PI: Standard Deviation
PI: 90%
100
80
60
HI

40
20
0
20

0 100 200 300 400 500 600


Example

(a) RUL prediction with mean predicted value, 2 probability intervals and the ground truth.

Ground Truth: 50.49 Ground Truth: 71.98 Ground Truth: 69.55


Dist mean: 46.17 Dist mean: 66.43 Dist mean: 59.24
150
80 120
125
100
60 100
80
75
40 60
50
40
20 25
20
0
0 0
25

(b) 1rst star histogram (c) 2nd star histogram (d) 3rd star histogram

Figure 6.7: Best result for Maryland cracklines, Bayes GRU.

Log Likelihood Kullback-Leibler


17.5 Train Train
Test 800000 Test
15.0
750000
12.5
700000
Loss

Loss

10.0
650000
7.5 600000
5.0 550000
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000
Epoch Epoch

Figure 6.8: Example of a loss plot of Maryland dataset training.

49
It is noticeable that the one with shorter lives are the ones with better results. This is
probably due to the fact that the proposed health indicator decays from the beginning of the
test and this is not always true affecting the longest runs.
However, the results are good enough to obtain the behavior of the health indicator
addressing the task successfully.
The training with this dataset is stopped at the stabilization of the log-likelihood and the
KL is not growing (this considers stabilization or decreacing behavior).
A future work with this dataset could consider the identification of the starting point of
damage (which is critical to the last two experiments), but to the scope of this Master Thesis,
the results proves the functionality of the Bayesian RNN architectures successfully.

6.1.3 GPM Systems: Wind Turbine Data


As GPM systems are a HI regression task the results are visualized in the same scheme
as C-MAPSS results.
The mean results are presented in the Table 6.9 and all the runs are reported at the
Appendix H
This dataset is different to the C-MAPSS and Maryland crack dataset due to the con-
formation of the acquired data. The prior two datasets have multiple training data through
start to end of the lifespan but this dataset have only one lifecycle.
Also, the short measurements (6 seconds per day per 50 days) difficult the construction
of a health indicator that represents the continuous wear and forces to split the dataset in
train/test randomly.
With this, even if the results present a good behavior, this model is not tested in other
runs so its impossible to predict their results in a real application (deployability).
However, as Maryland crack lines dataset, this dataset confirms the utilization of the
Bayesian RNNs as good representators of time-driven wears.

Table 6.9: Bayesian Recurrent Neural Networks results for Wind turbine. Mean of 3 runs.
Bayes LSTM Bayes GRU
Mean RMSE 13.40 13.71
STD 0.31 0.30
RMSE* 8.15 8.66

50
BayesLSTM - GPM Wind Turbine
RMSE RUL prediction: 7.98
Mean predicted value
120 Ground Truth
PI: Standard Deviation
100 PI: 90%

80

60
HI

40

20

20
0 100 200 300 400 500 600 700 800
Example

(a) RUL prediction with mean predicted value, 2 probability intervals and the ground truth.

Ground Truth: 76.61 Ground Truth: 50.50 Ground Truth: 26.56


Dist mean: 77.73 Dist mean: 55.42 Dist mean: 26.19
100 70
110
60
100 80 50
90 40
80 60 30
70 20
40
60 10
0
50 20
10

(b) 1rst star histogram (c) 2nd star histogram (d) 3rd star histogram

Figure 6.9: Best result for Wind Turbine Dataset. BayesLSTM architecture.

51
BayesGRU - GPM Wind Turbine
RMSE RUL prediction: 8.51
Mean predicted value
120 Ground Truth
PI: Standard Deviation
100 PI: 90%

80

60
HI

40

20

20
0 100 200 300 400 500 600 700 800
Example

(a) RUL prediction with mean predicted value, 2 probability intervals and the ground truth.

Ground Truth: 76.61 Ground Truth: 50.50 Ground Truth: 26.56


Dist mean: 83.36 Dist mean: 32.96 Dist mean: 27.46
80 60
120 50
60
40
100
30
40
20
80
20 10
60 0
0 10
40

(b) 1rst star histogram (c) 2nd star histogram (d) 3rd star histogram

Figure 6.10: Best result for Wind Turbine Dataset. BayesGRU architecture.

Finally, a sample of the loss plot in a training of this architectures is shown.

Log Likelihood Kullback-Leibler


17.5 Train Train
600000
15.0 Test Test
550000
12.5
500000
Loss

Loss

10.0
450000
7.5
400000
5.0
350000
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000
Epoch Epoch

Figure 6.11: Example of a loss plot of Wind turbine training.

52
6.2 Classification tasks
6.2.1 Ottawa: Bearing vibration Data
The results of this dataset are presented in the Table 6.10 as the summary of 3 runs per
dataset.
Non-normalized and normalized confusion matrix are presented for the best result for this
dataset at both Bayesian layers with a 90% probability interval.

Table 6.10: Bayesian Recurrent Neural Networks results for Ottawa bearing dataset. Mean
of 3 runs.
Dataset 1 Dataset 2 Dataset 3
BayesLSTM BayesGRU BayesLSTM BayesGRU BayesLSTM BayesGRU
Top 5% 93.68% 95.40% 98.97% 99.23% 97.51% 97.55%
50% 93.48% 95.21% 98.82% 99.09% 97.33% 97.37%
95% 93.26% 95.00% 98.67% 98.93% 97.13% 97.20%

3123.00 3.00 33.55 5.00 52.55 97.6% 0.1% 1.0% 0.2% 1.6%
Healthy 3118.00 2.00 29.00 3.50 47.50 Healthy 97.4% 0.1% 0.9% 0.1% 1.5%
3110.45 1.00 25.45 1.00 44.45 97.2% 0.0% 0.8% 0.0% 1.4%

39.00 3096.00 44.55 22.00 24.00 1.2% 96.7% 1.4% 0.7% 0.7%
Inner 35.00 3089.50 38.00 18.00 20.00 Inner 1.1% 96.5% 1.2% 0.6% 0.6%
fault 31.00 3080.00 29.80 13.45 16.00 fault 1.0% 96.2% 0.9% 0.4% 0.5%
True label

True label

11.00 1.00 3169.00 30.55 2.00 0.3% 0.0% 99.0% 1.0% 0.1%
Outer 7.50 0.00 3165.00 26.00 1.00 Outer 0.2% 0.0% 98.9% 0.8% 0.0%
fault 5.00 0.00 3159.45 22.45 0.00 fault 0.2% 0.0% 98.7% 0.7% 0.0%

2.55 0.00 21.55 3178.00 8.00 0.1% 0.0% 0.7% 99.3% 0.2%
Mixed 1.00 0.00 19.00 3175.00 5.00 Mixed 0.0% 0.0% 0.6% 99.2% 0.2%
fault 0.00 0.00 16.00 3170.45 4.00 fault 0.0% 0.0% 0.5% 99.1% 0.1%

30.55 1.55 2.00 21.20 3163.00 1.0% 0.0% 0.1% 0.7% 98.9%
Ball 25.00 0.00 1.00 17.00 3156.00 Ball 0.8% 0.0% 0.0% 0.5% 98.7%
fault 18.45 0.00 0.00 13.45 3151.45 fault 0.6% 0.0% 0.0% 0.4% 98.5%
y

ul r

ul r

ul d

ul r

ul r

ul d
fa all

fa all
fa ne

fa ute

fa ne

fa ute
lth

lth
fa ixe

fa ixe
t

t
t

t
B

B
t

t
t

t
ul

ul
In

In
ea

ea
O

O
M

M
H

r Predicted label Predicted label


(a) Confusion Matrix (b) Normalized Confusion Matrix

Figure 6.12: Best result for Ottawa bearing dataset. Dataset 1.

53
3162.65 3.00 18.55 9.00 21.00 98.8% 0.1% 0.6% 0.3% 0.7%
Healthy 3158.00 1.00 15.00 7.00 19.00 Healthy 98.7% 0.0% 0.5% 0.2% 0.6%
3153.45 0.00 10.45 5.45 15.45 98.5% 0.0% 0.3% 0.2% 0.5%

5.55 3193.00 3.10 1.00 5.00 0.2% 99.8% 0.1% 0.0% 0.2%
Inner 4.00 3191.50 1.00 0.00 3.00 Inner 0.1% 99.7% 0.0% 0.0% 0.1%
fault 3.00 3188.45 1.00 0.00 2.00 fault 0.1% 99.7% 0.0% 0.0% 0.1%
True label

True label
14.00 7.00 3175.55 15.55 3.55 0.4% 0.2% 99.2% 0.5% 0.1%
Outer 11.00 5.00 3170.50 11.50 2.00 Outer 0.3% 0.2% 99.1% 0.4% 0.1%
fault 7.00 3.45 3167.00 7.90 0.00 fault 0.2% 0.1% 99.0% 0.2% 0.0%

3.00 18.00 16.55 3163.55 12.00 0.1% 0.6% 0.5% 98.9% 0.4%
Mixed 2.00 14.00 14.00 3159.00 10.00 Mixed 0.1% 0.4% 0.4% 98.7% 0.3%
fault 0.45 12.00 12.00 3154.90 8.45 fault 0.0% 0.4% 0.4% 98.6% 0.3%

16.00 2.00 1.00 3.00 3186.55 0.5% 0.1% 0.0% 0.1% 99.6%
Ball 12.00 1.00 1.00 2.00 3183.00 Ball 0.4% 0.0% 0.0% 0.1% 99.5%
fault 10.00 1.00 0.00 1.00 3181.00 fault 0.3% 0.0% 0.0% 0.0% 99.4%
ul r

ul r

ul d

ul r
y

ul r

ul d
fa all

fa all
fa ne

fa ute

fa ne

fa ute
lth

lth
fa ixe

fa ixe
t

t
t

t
B

B
t

t
t

t
ul

ul
In

In
ea

ea
O

O
M

M
H

H
Predicted label Predicted label
(a) Confusion Matrix (b) Normalized Confusion Matrix

Figure 6.13: Best result for Ottawa bearing dataset. Dataset 2.

3008.55 149.55 12.55 5.00 53.55 94.0% 4.7% 0.4% 0.2% 1.7%
Healthy 2997.50 144.50 9.50 4.00 46.00 Healthy 93.6% 4.5% 0.3% 0.1% 1.4%
2988.35 135.90 5.45 1.45 39.45 93.3% 4.2% 0.2% 0.0% 1.2%

2.00 3188.55 4.00 11.00 5.55 0.1% 99.6% 0.1% 0.3% 0.2%
Inner 1.00 3184.00 3.00 8.00 4.00 Inner 0.0% 99.5% 0.1% 0.2% 0.1%
fault 0.00 3181.00 2.00 5.00 1.45 fault 0.0% 99.4% 0.1% 0.2% 0.0%
True label

True label

16.55 2.00 3188.55 4.00 0.00 0.5% 0.1% 99.7% 0.1% 0.0%
Outer 12.00 1.00 3184.00 2.00 0.00 Outer 0.4% 0.0% 99.5% 0.1% 0.0%
fault 8.00 0.00 3179.45 1.00 0.00 fault 0.3% 0.0% 99.4% 0.0% 0.0%

2.55 2.55 17.00 3186.00 1.00 0.1% 0.1% 0.5% 99.6% 0.0%
Mixed 1.00 1.00 15.00 3183.00 0.00 Mixed 0.0% 0.0% 0.5% 99.5% 0.0%
fault 0.00 0.00 12.45 3178.45 0.00 fault 0.0% 0.0% 0.4% 99.3% 0.0%

12.00 62.20 1.00 65.00 3082.10 0.4% 1.9% 0.0% 2.0% 96.3%
Ball 10.00 54.50 0.00 60.00 3077.00 Ball 0.3% 1.7% 0.0% 1.9% 96.1%
fault 6.00 48.00 0.00 54.00 3066.45 fault 0.2% 1.5% 0.0% 1.7% 95.8%
ul r

ul r

ul r
y

ul d

ul r

ul d
fa all

fa all
fa ne

fa ute

fa ne

fa ute
lth

lth
fa ixe

fa ixe
t

t
t

t
B

B
t

t
t

t
ul

ul
In

In
ea

ea
O

O
M

M
H

Predicted label Predicted label


(a) Confusion Matrix (b) Normalized Confusion Matrix

Figure 6.14: Best result for Ottawa bearing dataset. Dataset 3.

54
All the best confusion matrix corresponds to Bayesian GRUs cells. The Tables with the
full runs per dataset can be seen in the Appendix I.
The classification task with this dataset shows very good results without a biased class
(class with considerably lower results than the others). This is important in this dataset
because the mixed fault class that consists in many different damages (Inner, outer and ball
fault) and this could be a problematic class for the classifier, but the results presents robust
outcomes. Even so, the best result for the Dataset 1 with Bayesian LSTM cell presents the
above mistake with the Inner and Mixed fault.
As the Table 6.10 shows, the worst results obtained are with the first split of the dataset
(5% less accuracy for the Bayesian LSTM and 4% for the Bayesian GRU). However, the
accuracy is higher than a 90% being this very low probability of failure. Also, the split of the
dataset are file-driven, so the contamination in the train-test split is negligible. This confirms
that the model is extensible to new acquired data, which is the main issue with recurrent
neural networks.
Also, the confusion matrix for the Bayesian LSTM models are presented at the end of the
Appendix I.

6.2.2 Politecnico di Torino: Rolling bearing Data


On the other hand, this dataset is not file-driven splitted since the construction of the
dataset, even so, it is possible to split it by time which is explained at the Section 5.2.2.2.
With this, the train-test contamination is negligible.
The above is really important in this dataset due the outstanding results that the proposed
approach achieves. The Table 6.11 presents the mean results of 3 runs and the worst
confusion matrix results for each Bayesian layer. The Table with all the results can be seen
at the Appendix J.

Table 6.11: Mean of 3 runs, Torino bearing dataset.


BayesLSTM BayesGRU
Top 5% 100.00% 100.00%
50% 100.00% 99.97%
95% 99.68% 99.79%

55
170.00 0.55 0.00 0.00 0.00 0.00 0.00 100.0% 0.3% 0.0% 0.0% 0.0% 0.0% 0.0%
C0A 170.00 0.00 0.00 0.00 0.00 0.00 0.00 C0A 100.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
169.00 0.00 0.00 0.00 0.00 0.00 0.00 99.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.00 170.00 0.00 0.00 0.00 1.00 0.00 0.0% 100.0% 0.0% 0.0% 0.0% 0.6% 0.0%
C1A 0.00 170.00 0.00 0.00 0.00 0.00 0.00 C1A 0.0% 100.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.00 169.00 0.00 0.00 0.00 0.00 0.00 0.0% 99.4% 0.0% 0.0% 0.0% 0.0% 0.0%

0.00 0.00 170.00 0.00 0.00 0.00 0.00 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0%
C2A 0.00 0.00 170.00 0.00 0.00 0.00 0.00 C2A 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0%
0.00 0.00 170.00 0.00 0.00 0.00 0.00 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0%
True label

True label
0.00 0.00 0.00 170.00 0.00 0.00 0.55 0.0% 0.0% 0.0% 100.0% 0.0% 0.0% 0.3%
C3A 0.00 0.00 0.00 170.00 0.00 0.00 0.00 C3A 0.0% 0.0% 0.0% 100.0% 0.0% 0.0% 0.0%
0.00 0.00 0.00 169.00 0.00 0.00 0.00 0.0% 0.0% 0.0% 99.4% 0.0% 0.0% 0.0%

0.00 1.00 0.00 0.00 170.00 0.00 0.00 0.0% 0.6% 0.0% 0.0% 100.0% 0.0% 0.0%
C4A 0.00 0.00 0.00 0.00 170.00 0.00 0.00 C4A 0.0% 0.0% 0.0% 0.0% 100.0% 0.0% 0.0%
0.00 0.00 0.00 0.00 169.00 0.00 0.00 0.0% 0.0% 0.0% 0.0% 99.4% 0.0% 0.0%

1.00 1.00 0.00 0.00 0.00 170.00 0.00 0.6% 0.6% 0.0% 0.0% 0.0% 100.0% 0.0%
C5A 0.00 0.00 0.00 0.00 0.00 170.00 0.00 C5A 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% 0.0%
0.00 0.00 0.00 0.00 0.00 169.00 0.00 0.0% 0.0% 0.0% 0.0% 0.0% 99.4% 0.0%

0.00 1.00 0.00 0.00 0.00 0.00 170.00 0.0% 0.6% 0.0% 0.0% 0.0% 0.0% 100.0%
C6A 0.00 0.00 0.00 0.00 0.00 0.00 170.00 C6A 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0%
0.00 0.00 0.00 0.00 0.00 0.00 169.00 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.4%
A

A
C0

C1

C2

C3

C4

C5

C6

C0

C1

C2

C3

C4

C5

C6
Predicted label Predicted label
(a) Confusion Matrix (b) Normalized Confusion Matrix

Figure 6.15: Worst result Torino bearing dataset with Bayes LSTM layer.

170.00 0.00 0.00 0.00 0.00 0.00 0.00 100.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
C0A 170.00 0.00 0.00 0.00 0.00 0.00 0.00 C0A 100.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
170.00 0.00 0.00 0.00 0.00 0.00 0.00 100.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.00 170.00 0.00 0.00 2.00 0.55 1.00 0.0% 100.0% 0.0% 0.0% 1.2% 0.3% 0.6%
C1A 0.00 169.00 0.00 0.00 1.00 0.00 0.00 C1A 0.0% 99.4% 0.0% 0.0% 0.6% 0.0% 0.0%
0.00 167.00 0.00 0.00 0.00 0.00 0.00 0.0% 98.2% 0.0% 0.0% 0.0% 0.0% 0.0%

0.00 0.00 170.00 0.00 0.00 0.00 0.00 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0%
C2A 0.00 0.00 170.00 0.00 0.00 0.00 0.00 C2A 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0%
0.00 0.00 170.00 0.00 0.00 0.00 0.00 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0%
True label

True label

0.00 0.00 0.00 170.00 0.00 0.00 0.00 0.0% 0.0% 0.0% 100.0% 0.0% 0.0% 0.0%
C3A 0.00 0.00 0.00 170.00 0.00 0.00 0.00 C3A 0.0% 0.0% 0.0% 100.0% 0.0% 0.0% 0.0%
0.00 0.00 0.00 170.00 0.00 0.00 0.00 0.0% 0.0% 0.0% 100.0% 0.0% 0.0% 0.0%

0.00 0.00 1.00 0.00 170.00 0.00 0.00 0.0% 0.0% 0.6% 0.0% 100.0% 0.0% 0.0%
C4A 0.00 0.00 0.00 0.00 170.00 0.00 0.00 C4A 0.0% 0.0% 0.0% 0.0% 100.0% 0.0% 0.0%
0.00 0.00 0.00 0.00 169.00 0.00 0.00 0.0% 0.0% 0.0% 0.0% 99.4% 0.0% 0.0%

0.00 0.00 0.00 0.00 0.00 170.00 1.55 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% 0.9%
C5A 0.00 0.00 0.00 0.00 0.00 170.00 0.00 C5A 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% 0.0%
0.00 0.00 0.00 0.00 0.00 168.45 0.00 0.0% 0.0% 0.0% 0.0% 0.0% 99.1% 0.0%

0.00 0.00 0.00 0.00 0.00 0.00 170.00 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0%
C6A 0.00 0.00 0.00 0.00 0.00 0.00 170.00 C6A 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0%
0.00 0.00 0.00 0.00 0.00 0.00 170.00 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0%
A

A
C0

C1

C2

C3

C4

C5

C6

C0

C1

C2

C3

C4

C5

C6

Predicted label Predicted label


(a) Confusion Matrix (b) Normalized Confusion Matrix

Figure 6.16: Worst result Torino bearing dataset with Bayes GRU layer.

56
Log Likelihood Kullback-Leibler
2.0
Train 100000 Train
Test Test
1.5 97500
95000
1.0
Loss

Loss
92500
0.5 90000
87500
0.0 85000
50 100 150 200 250 300 350 50 100 150 200 250 300 350
Epoch Epoch

Figure 6.17: Example of a loss plot of Torino bearing dataset.

As can be seen the results obtained at this dataset are extremely good. It is possible to
consider this as an overfitting but due to the train-test split and the behavior of the loss plot
this is discarded.
With this, the health state is identified whatsoever amount of torque applied to the shaft.

57
Chapter 7
Concluding remarks and Future work
7.1 Concluding remarks
This thesis deals successfully with the construction of Bayesian Recurrent Neural Networks
and proves its performance with regression tasks (RUL and HI prognosis) and classification
tasks (health state classification) in complex mechanical systems based onto their sensory
data.
The proposed Bayesian RNN method are implemented for LSTM and GRU layers, con-
verting their inner gates to Bayesian dense flipout layers. The above produces easy to use
keras-like layers that can be attached to every other keras layer with no effort. However, the
use of this layers changes the training from a loss between the expected result and the pre-
dicted result (i.e. RMSE or cross entropy in regression and classification tasks) to a loss that
is focused on the likelihood (data dependant) and complexity (prior dependant) since every
layer uses a distribution as weights. The above allows the network to output a distribution
that embody the uncertainty, which is a major advantage over frequentist approaches.
This is validated with C-MAPSS dataset, where outperforms other models like MODBNE,
DCNN, DLSTM, AdaBN-CNN which are the most complex models with state of the art
performance. Also, the result is compared to the frequentist approach of the same architecture
which shows that the Bayesian approach not only is capable to address the uncertainty but
also obtains better results at the cost of nearly double trainable parameters and longest
training step. At the same dataset, MC Dropout is implemented as another approach to
address the uncertainty. Again, the proposed approach shows better performance.
Two other regression tasks and two classification tasks are implemented with good results
showing that the Bayesian recurrent neural networks can deal with sequential data to solve
some of the common PHM problems. Also, the implementation in different datasets presents
the extendability of the constructed Bayesian layers, allowing to be used in different areas,
even outside PHM context, where the quantification of the uncertainty is an important
problem.

58
7.2 Future work
The proposed Bayesian Recurrent Neural Networks are presented as a functional alter-
native to the frequentist RNNs that work with sequential data with the improvement that
also quantify the uncertainty. However, the validations only uses Gaussian distribution as
a prior, this could be not useful in every application but have shown good performance in
the presented case studies. A future work could address other kind of problems like counting
process (which can model renewal) or just try other priors with sensory data to evaluate the
difference between the performance.
Also, as the construction of the layers does not involve necessary the application in the
PHM context and only requires sequential data, another future work is apply the proposed
framework into another context, as sequential modeling.
In the last years, attention models [48, 49, 50] has appeared as a new type of sequential
models that have shown interesting applications in natural language processing which is
a sequential type of data. However, this models uses a large amount of parameters (i.e.
Facebook blender AI [50] have 90 millons of parameters at their smallest approach and 9.4
billions of parameters at their largest model). Since the application of the proposed approach
nearly doubles the parameters of a network, the develop of Bayesian attention models could
be an interesting challenge.

59
Bibliography
[1] A. Saxena, K. Goebel, D. Simon, and N. Eklund, “Damage propagation modeling for air-
craft engine run-to-failure simulation,” in 2008 International Conference on Prognostics
and Health Management, 2008, pp. 1–9.
[2] L. Saidi, J. B. Ali, E. Bechhoefer, and M. Benbouzid, “Wind turbine high-
speed shaft bearings health prognosis through a spectral kurtosis-derived indices
and SVR,” Applied Acoustics, vol. 120, pp. 1–8, May 2017. [Online]. Available:
https://doi.org/10.1016/j.apacoust.2017.01.005
[3] Y. Lei, N. Li, L. Guo, N. Li, T. Yan, and J. Lin, “Machinery health prognostics: A
systematic review from data acquisition to RUL prediction,” pp. 799–834, may 2018.
[4] D. Siegel, H. Al-Atat, V. Shauche, L. Liao, J. Snyder, and J. Lee, “Novel method for
rolling element bearing health assessment - A tachometer-less synchronously averaged
envelope feature extraction technique,” in Mechanical Systems and Signal Processing,
vol. 29. Academic Press, may 2012, pp. 362–376.
[5] D. Wang and K. L. Tsui, “Statistical modeling of bearing degradation signals,” IEEE
Transactions on Reliability, vol. 66, no. 4, pp. 1331–1344, dec 2017.
[6] R. Liu, B. Yang, E. Zio, and X. Chen, “Artificial intelligence for fault diagnosis of
rotating machinery: A review,” pp. 33–47, aug 2018.
[7] T. Touret, C. Changenet, F. Ville, M. Lalmi, and S. Becquerelle, “On the use of temper-
ature for online condition monitoring of geared systems – A review,” Mechanical Systems
and Signal Processing, vol. 101, pp. 197–210, feb 2018.
[8] K. Goebel, B. Saha, A. Saxena, J. R. Celaya, and J. P. Christophersen, “Prognostics
in battery health management,” IEEE Instrumentation and Measurement Magazine,
vol. 11, no. 4, pp. 33–40, 2008.
[9] H. Meng and Y. F. Li, “A review on prognostics and health management (PHM) methods
of lithium-ion batteries,” p. 109405, dec 2019.
[10] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http:
//www.deeplearningbook.org.
[11] D. B. Verstraete, E. L. Droguett, V. Meruane, M. Modarres, and A. Ferrada, “Deep
semi-supervised generative adversarial fault diagnostics of rolling element bearings,”
Structural Health Monitoring, vol. 19, no. 2, pp. 390–411, mar 2020. [Online]. Available:
http://journals.sagepub.com/doi/10.1177/1475921719850576
[12] Y. Zhang, R. Xiong, H. He, and M. G. Pecht, “Long short-term memory recurrent neural
network for remaining useful life prediction of lithium-ion batteries,” IEEE Transactions
on Vehicular Technology, vol. 67, no. 7, pp. 5695–5705, jul 2018.
[13] C. Correa-Jullian, J. M. Cardemil, E. López Droguett, and M. Behzad, “Assessment of
Deep Learning techniques for Prognosis of solar thermal systems,” Renewable Energy,
vol. 145, pp. 2178–2191, jan 2020.

60
[14] R. Zhao, R. Yan, Z. Chen, K. Mao, P. Wang, and R. X. Gao, “Deep learning and its
applications to machine health monitoring,” pp. 213–237, jan 2019.
[15] F. Jia, Y. Lei, J. Lin, X. Zhou, and N. Lu, “Deep neural networks: A promising tool for
fault characteristic mining and intelligent diagnosis of rotating machinery with massive
data,” Mechanical Systems and Signal Processing, vol. 72-73, pp. 303–315, may 2016.
[16] W. Peng, Z. S. Ye, and N. Chen, “Bayesian Deep-Learning-Based Health Prognostics
Toward Prognostics Uncertainty,” IEEE Transactions on Industrial Electronics, vol. 67,
no. 3, pp. 2283–2293, mar 2020.
[17] J. Hron, A. G. d. G. Matthews, and Z. Ghahramani, “Variational Gaussian Dropout is
not Bayesian,” nov 2017. [Online]. Available: http://arxiv.org/abs/1711.02989
[18] Y. Gal and Z. Ghahramani, “Dropout as a Bayesian Approximation: Representing
Model Uncertainty in Deep Learning,” in Proceedings of The 33rd International
Conference on Machine Learning, ser. Proceedings of Machine Learning Research,
M. F. Balcan and K. Q. Weinberger, Eds., vol. 48. New York, New York, USA: PMLR,
2016, pp. 1050–1059. [Online]. Available: http://proceedings.mlr.press/v48/gal16.html
[19] I. Osband and G. Deepmind, “Risk versus Uncertainty in Deep Learning: Bayes, Boot-
strap and the Dangers of Dropout,” Tech. Rep., 2016.
[20] T. Pearce, N. Anastassacos, M. Zaki, and A. Neely, “Bayesian Inference with Anchored
Ensembles of Neural Networks, and Application to Exploration in Reinforcement
Learning,” may 2018. [Online]. Available: http://arxiv.org/abs/1805.11324
[21] P. L. McDermott and C. K. Wikle, “Bayesian Recurrent Neural Network Models for
Forecasting and Quantifying Uncertainty in Spatial-Temporal Data,” Entropy, vol. 21,
no. 2, nov 2017. [Online]. Available: http://arxiv.org/abs/1711.00636
[22] M. Fortunato, C. Blundell, and O. Vinyals, “Bayesian Recurrent Neural Networks,” apr
2017. [Online]. Available: http://arxiv.org/abs/1704.02798
[23] Y. Wen, P. Vicol, J. Ba, D. Tran, and R. Grosse, “Flipout: Efficient pseudo-independent
weight perturbations on mini-batches,” in 6th International Conference on Learning
Representations, ICLR 2018 - Conference Track Proceedings. International Conference
on Learning Representations, ICLR, mar 2018.
[24] A. Saxena and K. Goebel, “Turbofan engine degradation simulation data set,” NASA
Ames Prognostics Data Repository, 2008.
[25] A. Y. Ng, “Machine Learning - Coursera,” 2017. [Online]. Available: https:
//es.coursera.org/learn/machine-learning
[26] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to human-
level performance in face verification,” in 2014 IEEE Conference on Computer Vision
and Pattern Recognition, 2014, pp. 1701–1708.
[27] G. Bohouta and V. Këpuska, “Comparing speech recognition systems (microsoft api,
google api and cmu sphinx),” Int. Journal of Engineering Research and Application, vol.
2248-9622, pp. 20–24, 03 2017.
[28] L. Fridman, D. E. Brown, M. Glazer, W. Angell, S. Dodd, B. Jenik, J. Terwilliger,
J. Kindelsberger, L. Ding, S. Seaman, H. Abraham, A. Mehler, A. Sipperley,
A. Pettinato, B. Seppelt, L. Angell, B. Mehler, and B. Reimer, “MIT autonomous

61
vehicle technology study: Large-scale deep learning based analysis of driver behavior
and interaction with automation,” CoRR, vol. abs/1711.06976, 2017. [Online]. Available:
http://arxiv.org/abs/1711.06976
[29] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent
neural networks on sequence modeling,” 2014.
[30] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, “Weight uncertainty in
neural networks,” 2015.
[31] A. B. Owen, Monte Carlo theory, methods and examples., 2013.
[32] D. P. Kingma, T. Salimans, and M. Welling, “Variational dropout and the local repa-
rameterization trick,” 2015.
[33] J. Baalis Coble, “Merging Data Sources to Predict Remaining Useful Life – An
Automated Method to Identify Prognostic Parameters,” University of Tennessee, Tech.
Rep., 2010. [Online]. Available: https://trace.tennessee.edu/utk{_}graddiss/683
[34] H. Yun and M. Modarres, “Measures of entropy to characterize fatigue damage in metallic
materials,” Entropy, vol. 21, p. 804, 08 2019.
[35] M. Bryant, M. Khonsari, and F. Ling, “On the thermodynamics of degradation,” Pro-
ceedings of The Royal Society A: Mathematical, Physical and Engineering Sciences, vol.
464, pp. 2001–2014, 08 2008.
[36] C. Basaran and S. Nie, “An irreversible thermodynamics theory for damage mechanics of
solids,” International Journal of Damage Mechanics - INT J DAMAGE MECH, vol. 13,
pp. 205–223, 07 2004.
[37] H. Yun and M. Modarres, “Measures of entropy to characterize fatigue damage in
metallic materials,” Entropy, vol. 21, no. 8, p. 804, Aug. 2019. [Online]. Available:
https://doi.org/10.3390/e21080804
[38] M. Modarres and M. Amiri, “An entropy-based damage characterization,” Entropy,
vol. 16, pp. 6434–6463, 12 2014.
[39] M. Amiri and M. M. Khonsari, “Nondestructive estimation of remaining fatigue life:
Thermography technique,” Journal of Failure Analysis and Prevention, vol. 12, no. 6,
pp. 683–688, Aug. 2012. [Online]. Available: https://doi.org/10.1007/s11668-012-9607-8
[40] “Practice for conducting force controlled constant amplitude axial fatigue tests of
metallic materials.” [Online]. Available: https://doi.org/10.1520/e0466-15
[41] J. B. Ali, L. Saidi, S. Harrath, E. Bechhoefer, and M. Benbouzid, “Online automatic
diagnosis of wind turbine bearings progressive degradations under real experimental
conditions based on unsupervised machine learning,” Applied Acoustics, vol. 132, pp.
167–181, Mar. 2018. [Online]. Available: https://doi.org/10.1016/j.apacoust.2017.11.021
[42] S. ichi Maeda, “A bayesian encourages dropout,” 2014.
[43] A. Y. K. Foong, D. R. Burt, Y. Li, and R. E. Turner, “On the expressiveness of approx-
imate inference in bayesian neural networks,” 2019.
[44] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive un-
certainty estimation using deep ensembles,” 2016.
[45] X. Li, Q. Ding, and J.-Q. Sun, “Remaining useful life estimation in prognostics using

62
deep convolution neural networks,” Reliability Engineering & System Safety, vol. 172,
pp. 1–11, Apr. 2018. [Online]. Available: https://doi.org/10.1016/j.ress.2017.11.021
[46] J. Wu, K. Hu, Y. Cheng, H. Zhu, X. Shao, and Y. Wang, “Data-driven remaining
useful life prediction via multiple sensor signals and deep long short-term memory
neural network,” ISA Transactions, vol. 97, pp. 241–250, Feb. 2020. [Online]. Available:
https://doi.org/10.1016/j.isatra.2019.07.004
[47] J. Li, X. Li, and D. He, “Domain adaptation remaining useful life prediction
method based on AdaBN-DCNN,” in 2019 Prognostics and System Health
Management Conference (PHM-Qingdao). IEEE, Oct. 2019. [Online]. Available:
https://doi.org/10.1109/phm-qingdao46334.2019.8942857
[48] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser,
and I. Polosukhin, “Attention is all you need,” CoRR, vol. abs/1706.03762, 2017.
[Online]. Available: http://arxiv.org/abs/1706.03762
[49] A. Galassi, M. Lippi, and P. Torroni, “Attention, please! A critical review of neural
attention models in natural language processing,” CoRR, vol. abs/1902.02181, 2019.
[Online]. Available: http://arxiv.org/abs/1902.02181
[50] S. Roller, E. Dinan, N. Goyal, D. Ju, M. Williamson, Y. Liu, J. Xu, M. Ott, K. Shuster,
E. M. Smith, Y.-L. Boureau, and J. Weston, “Recipes for building an open-domain
chatbot,” 2020.

63
64
Appendix
A Statistical parameters extracted from signals as fea-
tures
• Max Value
• Root Mean Square (RMS)
• Peak to peak
• Crest Factor
• Mean
• Variance
• Skewness
• Kurtosis
• 5th moment
• The mean for each of the 5 first frequency bands of the Fourier frequency transform.

i
B Results of Bayesian Recurrent Neural Networks, C-
MAPSS dataset.

Table B.1: Bayesian Recurrent Neural Network, FD001.


Bayes LSTM Bayes GRU
Run 1 Run 2 Run 3 Run 1 Run 2 Run 3
Mean RMSE 18.12 17.17 17.90 17.95 17.69 17.83
STD 0.84 0.98 0.93 1.07 1.08 1.00
RMSE* 14.49 13.27 14.23 14.34 13.80 14.10

Table B.2: Bayesian Recurrent Neural Network, FD002.


Bayes LSTM Bayes GRU
Run 1 Run 2 Run 3 Run 1 Run 2 Run 3
Mean RMSE 23.97 24.21 23.74 23.42 24.22 24.56
STD 1.02 1.00 1.01 0.92 1.05 1.10
RMSE* 17.47 18.03 16.78 17.37 17.32 18.05

Table B.3: Bayesian Recurrent Neural Network, FD003.


Bayes LSTM Bayes GRU
Run 1 Run 2 Run 3 Run 1 Run 2 Run 3
Mean RMSE 17.33 17.07 17.69 15.69 16.62 16.67
STD 1.05 0.93 1.03 0.88 0.99 0.91
RMSE* 13.56 13.35 14.20 11.56 12.66 12.71

Table B.4: Bayesian Recurrent Neural Network, FD004.


Bayes LSTM Bayes GRU
Run 1 Run 2 Run 3 Run 1 Run 2 Run 3
Mean RMSE 27.94 26.74 27.15 27.19 27.63 28.39
STD 0.96 1.02 1.22 1.12 1.10 1.17
RMSE* 22.05 20.30 20.66 20.80 21.38 22.07

ii
C Plots of the results of Bayesian RNN, C-MAPSS.

BayesGRU - C-MAPSS FD001


RMSE RUL prediction: 13.80
160
Mean predicted value
140 Ground Truth
PI: Standard Deviation
120 PI: 90%

100
RUL [Cycles]

80
60
40
20
0

0 20 40 60 80
Example

(a) RUL prediction with mean predicted value, 2 probability intervals and the ground truth.

Ground Truth: 112.00 Ground Truth: 85.00 Ground Truth: 29.00


Dist mean: 117.67 Dist mean: 60.67 Dist mean: 32.21
80
160
100
150
60
140
80
130 40
120
60
110 20
100 40
90 0
80 20

(b) 1rst star histogram (c) 2nd star histogram (d) 3rd star histogram

Figure C.1: Best result for C-MAPSS FD001 with Bayesian GRU cell.

iii
BayesGRU - C-MAPSS FD002
RMSE RUL prediction: 17.32
Mean predicted value
150 Ground Truth
PI: Standard Deviation
PI: 90%
125

100
RUL [Cycles]

75

50

25

25
0 50 100 150 200 250
Example

(a) RUL prediction with mean predicted value, 2 probability intervals and the ground truth.

Ground Truth: 121.00 Ground Truth: 81.00 Ground Truth: 36.00


Dist mean: 112.29 Dist mean: 108.04 Dist mean: 28.94
180 200
60
175
160
50
150
140 40
125
120 30
100
100 20
75
10
80 50
0
60 25

(b) 1rst star histogram (c) 2nd star histogram (d) 3rd star histogram

Figure C.2: Best result for C-MAPSS FD002 with Bayesian GRU cell.

iv
BayesLSTM - C-MAPSS FD003
RMSE RUL prediction: 13.35
150 Mean predicted value
Ground Truth
PI: Standard Deviation
125 PI: 90%

100
RUL [Cycles]

75

50

25

0 20 40 60 80
Example

(a) RUL prediction with mean predicted value, 2 probability intervals and the ground truth.

Ground Truth: 115.00 Ground Truth: 77.00 Ground Truth: 41.00


Dist mean: 85.43 Dist mean: 79.04 Dist mean: 44.67
130
110
120 80
100
110
90
100 60
80
90
70
80 40
60
70
50 20
60
40
50

(b) 1rst star histogram (c) 2nd star histogram (d) 3rd star histogram

Figure C.3: Best result for C-MAPSS FD003 with Bayesian LSTM cell.

v
BayesGRU - C-MAPSS FD004
RMSE RUL prediction: 20.80
175
Mean predicted value
Ground Truth
150 PI: Standard Deviation
PI: 90%
125

100
RUL [Cycles]

75

50

25

0 50 100 150 200


Example

(a) RUL prediction with mean predicted value, 2 probability intervals and the ground truth.

Ground Truth: 125.00 Ground Truth: 88.00 Ground Truth: 36.00


Dist mean: 120.25 Dist mean: 85.38 Dist mean: 36.89
160
160 80
140

140 120 60
100
120 40
80
60 20
100
40
0
80 20

(b) 1rst star histogram (c) 2nd star histogram (d) 3rd star histogram

Figure C.4: Best result for C-MAPSS FD004 with Bayesian GRU cell.

vi
D Comparison tables of the number of trainable parame-
ters in Frequentist and Bayesian models for C-MAPSS
Dataset.

Table D.1: Parameter table, FD001.


Dataset FD001 LSTM GRU
Output shape Bayesian Frequentist Bayesian Frequentist
Input (None, 30, 14) 0 0 0 0
LSTM (None, 30, 32) 12032 6016 9024 4512
Flatten (None, 960) 0 0 0 0
Dense 1 (None, 32) 61472 30752 61472 30752
Dense 2 (None, 16) 1040 528 1040 528
Output Bayesian (None, 2) 66 - 66 -
Output Frequentist (None, 1) - 17 - 17
Total Parameters 74610 37313 71602 35809

Table D.2: Parameter table, FD002.


Dataset FD002 LSTM GRU
Output shape Bayesian Frequentist Bayesian Frequentist
Input (None, 21, 14) 0 0 0 0
LSTM (None, 21, 32) 12032 6016 9024 4512
Flatten (None, 672) 0 0 0 0
Dense 1 (None, 32) 43040 21536 43040 21536
Dense 2 (None, 16) 1040 528 1040 528
Output Bayesian (None, 2) 66 - 66 -
Output Frequentist (None, 1) - 17 - 17
Total Parameters 56178 28097 53170 26593

vii
Table D.3: Parameter table, FD003.
Dataset FD003 LSTM GRU
Output shape Bayesian Frequentist Bayesian Frequentist
Input (None, 30, 14) 0 0 0 0
LSTM (None, 30, 32) 12032 6016 9024 4512
Flatten (None, 960) 0 0 0 0
Dense 1 (None, 32) 61472 30752 61472 30752
Dense 2 (None, 16) 1040 528 1040 528
Output Bayesian (None, 2) 66 - 66 -
Output Frequentist (None, 1) - 17 - 17
Total Parameters 74610 37313 71602 35809

Table D.4: Parameter table, FD004.


Dataset FD004 LSTM GRU
Output shape Bayesian Frequentist Bayesian Frequentist
Input (None, 18, 14) 0 0 0 0
LSTM (None, 18, 32) 12032 6016 9024 4512
Flatten (None, 576) 0 0 0 0
Dense 1 (None, 32) 36896 18464 36896 18464
Dense 2 (None, 16) 1040 528 1040 528
Output Bayesian (None, 2) 66 - 66 -
Output Frequentist (None, 1) - 17 - 17
Total Parameters 50034 25025 47026 23521

viii
E Results of Frequentist Recurrent Neural networks, C-
MAPSS dataset.

Table E.1: All runs of Frequentist Recurrent Neural Network approach.


Dataset FD001 FD002 FD003 FD004
Model LSTM GRU LSTM GRU LSTM GRU LSTM GRU
1 14.00 14.09 17.82 17.61 13.70 13.19 21.64 21.70
2 16.51 13.71 18.09 18.88 15.27 13.58 21.27 21.48
3 15.56 14.30 17.99 17.73 14.18 14.21 21.66 21.82
4 14.39 13.76 18.93 17.90 14.61 13.40 21.20 21.79
5 15.82 14.75 17.58 17.75 13.96 13.61 21.57 21.56

Mean 15.26 14.12 18.08 17.97 14.35 13.60 21.47 21.67

F Results of MC Dropout Recurrent Neural networks,


C-MAPSS dataset.

Table F.1: Summary of MC Dropout Models, C-MAPSS dataset.


Dataset FD001 FD002 FD003 FD004
MC MC MC MC MC MC MC MC
Model Dropout Dropout Dropout Dropout Dropout Dropout Dropout Dropout
LSTM GRU LSTM GRU LSTM GRU LSTM GRU
Mean
16.19 15.67 18.84 19.09 16.02 15.63 21.88 22.19
RMSE
STD 0.68 0.68 0.33 0.31 0.57 0.65 0.27 0.28
RMSE* 14.96 14.25 18.24 18.51 15.11 14.52 21.53 21.75

ix
Table F.2: MC Dropout Recurrent Neural Network, FD001.
MC Dropout LSTM MC Dropout GRU
Run 1 Run 2 Run 3 Run 1 Run 2 Run 3
Mean RMSE 17.23 16.17 15.17 15.32 16.19 15.48
STD 0.76 0.73 0.52 0.64 0.77 0.64
RMSE* 16.07 14.40 14.41 14.05 14.50 14.19

Table F.3: MC Dropout Recurrent Neural Network, FD002.


MC Dropout LSTM MC Dropout GRU
Run 1 Run 2 Run 3 Run 1 Run 2 Run 3
Mean RMSE 19.44 18.79 18.28 18.83 19.19 19.25
STD 0.40 0.30 0.26 0.28 0.32 0.33
RMSE* 18.58 18.36 17.77 18.30 18.60 18.63

Table F.4: MC Dropout Recurrent Neural Network, FD003.


MC Dropout LSTM MC Dropout GRU
Run 1 Run 2 Run 3 Run 1 Run 2 Run 3
Mean RMSE 15.35 16.12 16.59 14.81 14.96 17.13
STD 0.66 0.54 0.48 0.60 0.69 0.67
RMSE* 14.17 15.28 15.87 13.82 13.77 15.96

Table F.5: MC Dropout Recurrent Neural Network, FD004.


MC Dropout LSTM MC Dropout GRU
Run 1 Run 2 Run 3 Run 1 Run 2 Run 3
Mean RMSE 22.11 21.91 21.62 22.18 22.03 22.37
STD 0.26 0.25 0.30 0.27 0.31 0.26
RMSE* 21.76 21.62 21.22 21.81 21.49 21.95

x
G Results of Bayesian Recurrent Neural networks, Mary-
land cracklines dataset.

Table G.1: Results of Bayesian Recurrent Neural networks, Maryland cracklines dataset.
Bayes LSTM Bayes GRU
Run 1 Run 2 Run 3 Run 1 Run 2 Run 3
Mean RMSE 24.69 25.53 25.57 24.82 24.30 24.50
STD 0.57 0.64 0.57 0.62 0.56 0.58
RMSE* 19.33 20.31 20.14 18.85 18.79 19.17

BayesLSTM - Maryland Cracks


RMSE RUL prediction: 14.16
Mean predicted value
100 Ground Truth
PI: Standard Deviation
PI: 90%
80

60
HI

40

20

20
0 5 10 15 20 25 30
Example

Figure G.1: First complete life cycle. Maryland cracklines, Bayesian LSTM.

BayesGRU - Maryland Cracks


RMSE RUL prediction: 15.53
Mean predicted value
100 Ground Truth
PI: Standard Deviation
PI: 90%
80

60
HI

40

20

0 5 10 15 20 25 30
Example

Figure G.2: First complete life cycle. Maryland cracklines, Bayesian GRU.

xi
BayesLSTM - Maryland Cracks
RMSE RUL prediction: 13.05
120 Mean predicted value
Ground Truth
100 PI: Standard Deviation
PI: 90%
80

60
HI

40

20

0 5 10 15 20 25 30 35
Example

Figure G.3: Second complete life cycle. Maryland cracklines, Bayesian LSTM.

BayesGRU - Maryland Cracks


RMSE RUL prediction: 14.38
120 Mean predicted value
Ground Truth
PI: Standard Deviation
100 PI: 90%
80

60
HI

40

20

20
0 5 10 15 20 25 30 35
Example

Figure G.4: Second complete life cycle. Maryland cracklines, Bayesian GRU.

xii
BayesLSTM - Maryland Cracks
RMSE RUL prediction: 10.80
Mean predicted value
120 Ground Truth
PI: Standard Deviation
100 PI: 90%

80

60
HI

40

20

20
0 5 10 15 20 25 30 35 40
Example

Figure G.5: Third complete life cycle. Maryland cracklines, Bayesian LSTM.

BayesGRU - Maryland Cracks


RMSE RUL prediction: 10.65
Mean predicted value
120 Ground Truth
PI: Standard Deviation
100 PI: 90%

80

60
HI

40

20

20
0 5 10 15 20 25 30 35 40
Example

Figure G.6: Third complete life cycle. Maryland cracklines, Bayesian GRU.

xiii
BayesLSTM - Maryland Cracks
RMSE RUL prediction: 11.76
120
Mean predicted value
Ground Truth
100 PI: Standard Deviation
PI: 90%
80

60
HI

40

20

20
0 10 20 30 40 50 60 70 80
Example

Figure G.7: Fourth complete life cycle. Maryland cracklines, Bayesian LSTM.

BayesGRU - Maryland Cracks


RMSE RUL prediction: 11.55
Mean predicted value
120 Ground Truth
PI: Standard Deviation
100 PI: 90%

80

60
HI

40

20

20
0 10 20 30 40 50 60 70 80
Example

Figure G.8: Fourth complete life cycle. Maryland cracklines, Bayesian GRU.

xiv
BayesLSTM - Maryland Cracks
RMSE RUL prediction: 26.15
120 Mean predicted value
Ground Truth
100 PI: Standard Deviation
PI: 90%
80

60
HI

40

20

20
0 10 20 30 40 50 60 70 80
Example

Figure G.9: Fourth complete life cycle. Maryland cracklines, Bayesian LSTM.

BayesGRU - Maryland Cracks


RMSE RUL prediction: 27.01
120 Mean predicted value
Ground Truth
PI: Standard Deviation
100 PI: 90%
80

60
HI

40

20

20
0 10 20 30 40 50 60 70 80
Example

Figure G.10: Fourth complete life cycle. Maryland cracklines, Bayesian GRU.

xv
BayesLSTM - Maryland Cracks
RMSE RUL prediction: 17.98
120 Mean predicted value
Ground Truth
PI: Standard Deviation
100 PI: 90%
80

60
HI

40

20

20
0 20 40 60 80 100 120
Example

Figure G.11: Fifth complete life cycle. Maryland cracklines, Bayesian LSTM.

BayesGRU - Maryland Cracks


RMSE RUL prediction: 16.97
Mean predicted value
120 Ground Truth
PI: Standard Deviation
100 PI: 90%

80

60
HI

40

20

20
0 20 40 60 80 100 120
Example

Figure G.12: Fifth complete life cycle. Maryland cracklines, Bayesian GRU.

xvi
BayesLSTM - Maryland Cracks
RMSE RUL prediction: 22.18
120 Mean predicted value
Ground Truth
PI: Standard Deviation
100 PI: 90%

80

60
HI

40

20

20
0 50 100 150 200
Example

Figure G.13: Sixth complete life cycle. Maryland cracklines, Bayesian LSTM.

BayesGRU - Maryland Cracks


RMSE RUL prediction: 20.68
120 Mean predicted value
Ground Truth
100 PI: Standard Deviation
PI: 90%
80

60
HI

40

20

20
0 50 100 150 200
Example

Figure G.14: Sixth complete life cycle. Maryland cracklines, Bayesian GRU.

xvii
H Results of Bayesian Recurrent Neural networks, GPM
Wind turbine dataset.

Table H.1: Results Wind turbine: Total results.


Bayes LSTM Bayes GRU
Run 1 Run 2 Run 3 Run 1 Run 2 Run 3
Mean RMSE 13.60 13.30 13.31 13.62 13.88 13.62
STD 0.32 0.32 0.28 0.32 0.30 0.28
RMSE* 8.47 7.98 7.99 8.55 8.92 8.51

I Results of Bayesian Recurrent Neural networks, Ottawa


Bearing dataset.

Table I.1: All results for each Bayesian layer. Dataset 1.


BayesLSTM BayesGRU
Run 1 Run 2 Run 3 Run 1 Run 2 Run 3
Top 5% 93.73% 93.91% 93.40% 93.85% 94.03% 98.31%
50% 93.55% 93.69% 93.21% 93.67% 93.81% 98.15%
95% 93.32% 93.48% 92.98% 93.44% 93.61% 97.95%

Table I.2: All results for each Bayesian layer. Dataset 2.


BayesLSTM BayesGRU
Run 1 Run 2 Run 3 Run 1 Run 2 Run 3
Top 5% 98.94% 99.05% 98.94% 99.27% 99.21% 99.19%
50% 98.79% 98.89% 98.80% 99.15% 99.06% 99.05%
95% 98.63% 98.72% 98.66% 99.05% 98.85% 98.89%

xviii
Table I.3: All results for each Bayesian layer. Dataset 3.
BayesLSTM BayesGRU
Run 1 Run 2 Run 3 Run 1 Run 2 Run 3
Top 5% 97.36% 97.78% 97.39% 97.43% 97.39% 97.82%
50% 97.13% 97.61% 97.25% 97.27% 97.20% 97.65%
95% 96.91% 97.44% 97.05% 97.11% 97.03% 97.45%

3153.55 2.55 43.55 1.00 13.00 98.6% 0.1% 1.4% 0.0% 0.4%
Healthy 3148.50 1.00 40.00 0.00 10.00 Healthy 98.4% 0.0% 1.3% 0.0% 0.3%
3143.90 1.00 34.00 0.00 8.45 98.3% 0.0% 1.1% 0.0% 0.3%

62.55 2444.00 16.55 667.55 40.55 2.0% 76.4% 0.5% 20.9% 1.3%
Inner 55.50 2434.50 14.50 663.00 32.50 Inner 1.7% 76.1% 0.5% 20.7% 1.0%
fault 50.45 2425.00 12.00 652.45 26.00 fault 1.6% 75.8% 0.4% 20.4% 0.8%
True label

True label
8.00 1.55 3150.00 54.00 3.00 0.3% 0.0% 98.5% 1.7% 0.1%
Outer 5.00 1.00 3142.00 49.00 1.00 Outer 0.2% 0.0% 98.2% 1.5% 0.0%
fault 3.00 0.45 3136.35 42.90 0.00 fault 0.1% 0.0% 98.1% 1.3% 0.0%

4.00 2.55 32.55 3165.55 10.55 0.1% 0.1% 1.0% 98.9% 0.3%
Mixed 3.00 1.00 28.00 3160.50 8.00 Mixed 0.1% 0.0% 0.9% 98.8% 0.2%
fault 1.00 0.00 24.00 3156.00 6.00 fault 0.0% 0.0% 0.7% 98.6% 0.2%

81.00 10.55 2.00 17.00 3110.00 2.5% 0.3% 0.1% 0.5% 97.2%
Ball 75.00 6.00 1.00 15.00 3102.50 Ball 2.3% 0.2% 0.0% 0.5% 97.0%
fault 69.00 3.45 0.00 11.45 3093.35 fault 2.2% 0.1% 0.0% 0.4% 96.7%
ul r

ul r

ul r
y

ul d

ul r

ul d
fa all

fa all
fa ne

fa ute

fa ne

fa ute
lth

lth
fa ixe

fa ixe
t

t
t

t
B

B
t

t
t

t
ul

ul
In

In
ea

ea
O

O
M

M
H

Predicted label Predicted label


(a) Confusion Matrix (b) Normalized Confusion Matrix

Figure I.1: Best result for Ottawa bearing dataset. Dataset 1 with Bayesian LSTM layers.

xix
3158.55 5.00 15.55 5.55 33.55 98.7% 0.2% 0.5% 0.2% 1.0%
Healthy 3152.50 3.00 11.00 3.00 31.00 Healthy 98.5% 0.1% 0.3% 0.1% 1.0%
3146.45 1.00 7.45 2.00 26.00 98.3% 0.0% 0.2% 0.1% 0.8%

3.55 3190.55 2.00 2.55 12.00 0.1% 99.7% 0.1% 0.1% 0.4%
Inner 2.00 3188.00 1.00 1.00 8.00 Inner 0.1% 99.6% 0.0% 0.0% 0.2%
fault 0.00 3183.45 0.00 0.00 6.00 fault 0.0% 99.5% 0.0% 0.0% 0.2%
True label

True label
30.10 9.55 3154.55 19.00 7.00 0.9% 0.3% 98.6% 0.6% 0.2%
Outer 24.50 8.00 3146.00 16.00 5.00 Outer 0.8% 0.3% 98.3% 0.5% 0.2%
fault 20.45 5.00 3141.90 13.00 2.45 fault 0.6% 0.2% 98.2% 0.4% 0.1%

3.00 21.00 12.00 3152.00 32.55 0.1% 0.7% 0.4% 98.5% 1.0%
Mixed 2.00 15.50 10.00 3145.50 27.00 Mixed 0.1% 0.5% 0.3% 98.3% 0.8%
fault 1.00 11.00 7.00 3136.45 24.00 fault 0.0% 0.3% 0.2% 98.0% 0.8%

12.55 1.00 0.00 1.00 3192.00 0.4% 0.0% 0.0% 0.0% 99.8%
Ball 10.00 0.00 0.00 0.00 3189.50 Ball 0.3% 0.0% 0.0% 0.0% 99.7%
fault 7.45 0.00 0.00 0.00 3187.00 fault 0.2% 0.0% 0.0% 0.0% 99.6%
ul r

ul r

ul d

ul r
y

ul r

ul d
fa all

fa all
fa ne

fa ute

fa ne

fa ute
lth

lth
fa ixe

fa ixe
t

t
t

t
B

B
t

t
t

t
ul

ul
In

In
ea

ea
O

O
M

M
H

H
Predicted label Predicted label
(a) Confusion Matrix (b) Normalized Confusion Matrix

Figure I.2: Best result for Ottawa bearing dataset. Dataset 2 with Bayesian LSTM layers.

3042.55 125.10 16.55 7.55 34.00 95.0% 3.9% 0.5% 0.2% 1.1%
Healthy 3033.50 118.00 14.00 6.00 29.50 Healthy 94.8% 3.7% 0.4% 0.2% 0.9%
3023.90 110.00 12.00 3.45 25.45 94.5% 3.4% 0.4% 0.1% 0.8%

2.00 3190.55 4.55 11.00 3.00 0.1% 99.7% 0.1% 0.3% 0.1%
Inner 1.00 3186.00 3.00 9.00 2.00 Inner 0.0% 99.5% 0.1% 0.3% 0.1%
fault 0.00 3183.00 1.00 6.00 0.00 fault 0.0% 99.4% 0.0% 0.2% 0.0%
True label

True label

12.10 2.00 3183.00 13.55 0.00 0.4% 0.1% 99.5% 0.4% 0.0%
Outer 9.00 1.00 3180.00 10.00 0.00 Outer 0.3% 0.0% 99.4% 0.3% 0.0%
fault 7.00 0.45 3175.90 6.45 0.00 fault 0.2% 0.0% 99.2% 0.2% 0.0%

0.00 8.00 25.55 3175.55 3.00 0.0% 0.3% 0.8% 99.3% 0.1%
Mixed 0.00 5.00 21.00 3172.00 1.00 Mixed 0.0% 0.2% 0.7% 99.2% 0.0%
fault 0.00 3.00 18.00 3166.45 0.00 fault 0.0% 0.1% 0.6% 99.0% 0.0%

7.55 80.55 2.55 80.55 3052.55 0.2% 2.5% 0.1% 2.5% 95.4%
Ball 5.00 74.50 1.00 73.00 3045.00 Ball 0.2% 2.3% 0.0% 2.3% 95.2%
fault 3.00 66.90 0.00 67.35 3040.00 fault 0.1% 2.1% 0.0% 2.1% 95.0%
ul r

ul r

ul d

ul r
y

ul r

ul d
fa all

fa all
fa ne

fa ute

fa ne

fa ute
lth

lth
fa ixe

fa ixe
t

t
t

t
B

B
t

t
t

t
ul

ul
In

In
ea

ea
O

O
M

M
H

Predicted label Predicted label


(a) Confusion Matrix (b) Normalized Confusion Matrix

Figure I.3: Best result for Ottawa bearing dataset. Dataset 3 with Bayesian LSTM layers.

xx
J Results of Bayesian Recurrent Neural networks, Po-
litecnico di Torino Bearing dataset.

Table J.1: All results for each Bayesian layer.


BayesLSTM BayesGRU
Run 1 Run 2 Run 3 Run 1 Run 2 Run 3
Top 5% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00%
50% 100.00% 100.00% 100.00% 99.92% 100.00% 100.00%
95% 99.50% 99.70% 99.83% 99.53% 99.95% 99.87%

xxi

You might also like