UQ Review
UQ Review
Venkat Nemania , Luca Biggiob , Xun Huanc , Zhen Hud , Olga Finke , Anh Tranf , Yan Wangg ,
Xiaoge Zhangh,i,∗, Chao Huj,∗
a
Department of Mechanical Engineering, Iowa State University, Ames, IA 50011, USA
b
Data Analytics Lab, ETH, Zürich, Switzerland
c
Department of Mechanical Engineering, University of Michigan, Ann Arbor, MI 48109, USA
d
Department of Industrial and Manufacturing Systems Engineering, University of Michigan-Dearborn, Dearborn, MI
48128, USA
e
arXiv:2305.04933v2 [cs.LG] 20 Sep 2023
Abstract
On top of machine learning (ML) models, uncertainty quantification (UQ) functions as an essential
layer of safety assurance that could lead to more principled decision making by enabling sound risk
assessment and management. The safety and reliability improvement of ML models empowered by
UQ has the potential to significantly facilitate the broad adoption of ML solutions in high-stakes
decision settings, such as healthcare, manufacturing, and aviation, to name a few. In this tutorial,
we aim to provide a holistic lens on emerging UQ methods for ML models with a particular focus
on neural networks and the applications of these UQ methods in tackling engineering design as well
as prognostics and health management problems. Toward this goal, we start with a comprehensive
classification of uncertainty types, sources, and causes pertaining to UQ of ML models. Next, we
provide a tutorial-style description of several state-of-the-art UQ methods: Gaussian process regres-
sion, Bayesian neural network, neural network ensemble, and deterministic UQ methods focusing
on spectral-normalized neural Gaussian process. Established upon the mathematical formulations,
we subsequently examine the soundness of these UQ methods quantitatively and qualitatively (by
a toy regression example) to examine their strengths and shortcomings from different dimensions.
Then, we review quantitative metrics commonly used to assess the quality of predictive uncertainty
in classification and regression problems. Afterward, we discuss the increasingly important role of
UQ of ML models in solving challenging problems in engineering design and health prognostics.
Two case studies with source codes available on GitHub are used to demonstrate these UQ methods
and compare their performance in the life prediction of lithium-ion batteries at the early stage (case
study 1) and the remaining useful life prediction of turbofan engines (case study 2).
Keywords: Machine learning, Uncertainty quantification, Engineering design, Prognostics and
∗
Correspondence authors.
Email addresses: [email protected] (Xiaoge Zhang), [email protected] (Chao Hu)
Preprint submitted to Mechanical Systems and Signal Processing September 21, 2023
health management
Nomenclature
2
yt Matrix representation of target output in RN
training data, that is yt = [y1 , . . . , yN ]T ∈
1. Introduction
In recent years, data-driven machine learning (ML) models have become increasingly prevalent
across a wide range of engineering fields. Two application domains of interest to this tutorial are
engineering design and post-design health prognostics. The ML community has devoted significant
efforts toward creating deep learning (DL) models that yield improved prediction accuracy over
earlier DL models on publicly available, large, standardized datasets, such as MNIST [1], ImageNet
[2], Places [3], and Microsoft COCO [4]. Among these DL models are deep neural networks (DNNs),
known for their ability to extract high-level abstracted features from large volumes of data auto-
matically achieved through multiple layers of neurons and activation functions in an end-to-end
fashion.
Despite record-breaking prediction accuracy on some fixed sets of test samples (i.e., images in
the case of computer vision), these neural networks typically have difficulties in generalizing to data
not observed during model training. Suppose test samples come from a distribution substantially
different from the training distribution, where most of the training samples are located. These test
samples can be called out-of-distribution (OOD) samples. Trained neural network models tend to
produce large prediction errors on these OOD samples. Despite considerable efforts, such as domain
adaption [5–7], aimed at improving the generalization performance of neural network models, the
issue of poor generalizability still persists. Another limitation that adds to the challenge is that
complex ML models, such as DNNs, are mostly black-box in nature. It is generally preferred to use
simpler models (e.g., linear regression and decision tree) that are easier to interpret unless more
complex models can be justified with non-incremental benefits (e.g., substantially improved accu-
racy). In recent years, the growing availability of large volumes of data has made complex models,
which are often significantly more accurate than simple models, the obvious better choice in many
ML applications where prediction accuracy is the priority. Consequently, black-box ML models that
are hard to understand are increasingly deployed, particularly in big data applications. Some efforts
have been made to address the lack of interpretability, with notable explanation algorithms such as
SHAP [8] and Grad-CAM [9] and a good review of interpretable ML [10]. Despite these recent ef-
forts, many complex ML models are still implemented as black-box models and cannot explain their
predictions to the end user for various reasons. This limitation makes it extremely intricate for the
end user to understand the decision mechanism behind a neural network’s prediction. Given these
two limitations (difficulties in extrapolating to OOD samples and lack of interpretability), it is vital
to quantify the predictive uncertainty of a trained ML model and communicate this uncertainty to
end users in an easy-to-understand way. To enhance algorithmic transparency and trustworthiness,
uncertainty quantification (UQ) and interpretation should ideally be performed together, with UQ
providing information on the confidence of complex machine learning models in making predictions.
This integration allows for a better understanding of often difficult-to-interpret models and their
predictions.
3
Let us first look at typical ways to express and communicate predictive uncertainty. A simple
case is with classification problems, where the probability of the model-predicted class can depict
model confidence at a prediction. For example, a fault classification model may predict a bearing
to have an inner race fault with a 90% probability/confidence. In regression problems, predictive
uncertainty is often communicated as confidence intervals, shown as error bars on graphs visualizing
predictions. For instance, we could train a probabilistic ML model to predict the number of weeks a
rolling element bearing can be used before failure, i.e., the remaining useful life (RUL). An example
prediction may be 120 ± 15, in weeks, which represents a two-sided 95% confidence interval (i.e.,
∼1.96 standard deviations subtracted from or added to the mean estimate assuming the model-
predicted RUL follows a Gaussian distribution). A narrower confidence interval comes from lower
predictive uncertainty, which suggests higher model confidence.
One clear advantage of UQ is that it helps end users determine when they can trust predictions
made by the model and when they need extra caution while making decisions based on these pre-
dictions. This is especially important when incorrect decisions can lead to severe financial losses
or even life-threatening outcomes. Towards this end, the integration of UQ in ML models, as well
as the sound quantification and calibration of uncertainty in ML model prediction, has a viable
potential to tackle a central research question the ML community confronts – safety assurance of
ML models [11–14]. In fact, the absence of essential performance characteristics (e.g., model robust-
ness and safety assurance) has emerged as the fundamental roadblock to limiting ML’s application
scope in risk-insensitive areas, while its adoptions in high-stakes, high-reward decision environments
(e.g., healthcare, aviation, and power grid) are still in the infancy stage primarily because of the
reluctance of end users to delegate critical decision making to machine intelligence in cases where
the safety of patients or critical engineering systems might be put at stake [15–18]. Towards the
translation of ML solutions in high-risk domains, UQ offers an additional dimension by extend-
ing the traditional discipline of statistical error analysis to capture various uncertainties arising
from limited or noisy data, missing variables, incomplete knowledge, etc. This development has
wide-ranging implications for supporting quantitative and precise risk management in high-stakes
decision-making settings, particularly concerning potential model failures and decision limitations of
ML algorithms. However, the evaluation of ML model performance on most benchmarking datasets
focuses exclusively on some form of prediction accuracy on a fixed test dataset; it rarely considers
the quality of predictive uncertainty. As a result, UQ of ML models is typically pushed to the
sidelines, yielding the centerlines to prediction accuracy. In reality, underestimating uncertainty
(overconfidence) can create trust issues, while overestimating uncertainty (underconfidence) may
result in overly conservative predictions, ultimately diminishing the value of ML.
More recently (approximately since 2015), there has been growing interest in approaches to
estimating the predictive uncertainty of deep learning models, for example, in the form of class
probability for classification and predicted variance for regression, as discussed earlier. The growing
interest can be attributed to failure cases where trained ML models produced unexpectedly incorrect
predictions on test samples while communicating high confidence in the predictions [19] and those
where models changed their predictions substantially in response to minor, unimportant changes
4
to samples (or so-called adversarial samples) [20]. Two pioneering studies that stimulated many
subsequent efforts created two widely used approaches to UQ of neural networks: (1) Monte Carlo
(MC) dropout as a computationally efficient alternative to traditional Bayesian neural network
[21] and (2) neural network ensemble consisting of multiple independently trained neural networks,
each predicting a mean and standard deviation of a Gaussian target [22]. Another notable early
study highlighted differences between aleatory and epistemic uncertainty and discussed situations
where quantifying aleatory uncertainty is important and where quantifying epistemic uncertainty
is important [23]. A common understanding in the ML community towards these two types of
uncertainty has been the following: aleatory uncertainty can be considered data uncertainty and
represents inherent randomness (e.g., measurement noise) in observations of the target that an ML
model is tasked with predicting; epistemic uncertainty can be treated as model uncertainty and
results from having access to only limited training data, which makes it not possible to learn a
precise model. As discussed in Sec. 2.1, aleatory and epistemic uncertainty could encompass more
sources and causes than the well-known data and model uncertainty.
The engineering design community has a long history of applying Gaussian process regression
(GPR) or kriging, an ML method with UQ capability, to build cheap-to-evaluate surrogates of
expensive simulation models for simulation-based design, dating back to the early 2000s [24–26].
GPR has an elegant way of quantifying aleatory and epistemic uncertainty and can produce high
uncertainty on OOD samples. However, the UQ capability of GPR is typically not used to detect
OOD samples or quantify the epistemic uncertainty of a final built surrogate. Rather, it is leveraged
in an adaptive sampling scheme to encourage sampling in highly uncertain and critical regions of
the input space (exploration) to minimize the number of training samples for either (1) building
an accurate surrogate within some lower and upper bounds of input variables (local or global sur-
rogate modeling) [27–29] or (2) finding a globally optimally design for some expensive-to-evaluate
black-box objective function [30, 31]. Additionally, little effort is made to evaluate the quality of
UQ for a trained GPR model, likely because the model makes predictions on samples within pre-
defined design bounds and does not need to extrapolate much (low epistemic uncertainty). Other
classical surrogate modeling methods, such as standard artificial neural networks and support vec-
tor machines, are generally less capable of quantifying predictive uncertainty, especially epistemic
uncertainty. These methods and GPR are typically used to build surrogates that act as “deter-
ministic” transfer functions and allow propagating aleatory uncertainty in input variables to derive
the uncertainty in the model output, known as uncertainty propagation [32]. The recent two years
have seen efforts applying DNNs to surrogate modeling for reliability analysis [33–35]. Similarly,
these DNNs do not have built-in UQ capability and are typically used as deterministic functions
primarily for uncertainty propagation.
For over two decades, the prognostics and health management (PHM) community has used ML
methods with built-in UQ capability as part of the health forecasting/RUL prediction process. Early
applications include the Bayesian linear regression for aircraft turbofan engine prognostics [36], the
relevance vector machine, a probabilistic kernel regression model of an identical function form to the
support vector machine [37], for battery prognostics [38–40] and general purpose prognostics [41, 42],
5
and GPR for battery prognostics [43–45]. UQ of ML models for PHM is perceived to have more
significance than that for engineering design, mainly due to (1) the more likely lack of sufficient
training data, given an expensive and time-consuming process to collect run-to-failure data for
training ML models for health prognostics, (2) the higher need to extrapolate to unseen operating
conditions in PHM applications, and (3) the higher criticality of consequences from incorrectly made
maintenance decisions. Two representative reviews of UQ work in the field of PHM can be found
in [46, 47]. Both reviews seem to focus on identifying uncertainty sources in health prognostics and
discussing ways to propagate these sources of uncertainty to derive the probability distribution of
RUL.
7.1: Physics-informed ML
utline.pdf Sec. 7: Other topics 7.2: Probabilistic Learning on Manifolds
related to UQ of ML 7.3: Interpretability of ML models for dynamic
models systems
7.4: Polynomial chaos expansion
Within this paper, we seek to provide a comprehensive overview of emerging approaches for UQ
of ML models and a brief review of applications of these approaches to solve engineering design and
health prognostics problems. As for the ML models, our tutorial focuses on neural networks due to
their increasing popularity amongst academic researchers and industrial practitioners. In essence,
we look at methods to quantify the predictive uncertainty of neural networks, i.e., methods for UQ
of neural networks. This focus differs from the notion of “ML for UQ” where UQ of engineered
6
systems or processes becomes the primary task, and ML models are built only to serve the primary
purpose of UQ. Figure 1 shows an outline of this tutorial paper. Our tutorial possesses four unique
properties that distinguish it from recent reviews on UQ of ML models in the ML community
[48–50], computational physics community [51], and PHM community [46, 47].
• First, we give a detailed classification of uncertainty types, sources, and causes (Sec. 2.1)
and discuss ways to reduce epistemic uncertainty (Sec. 2.3). Our classification and discussion
complement the theoretical and data science-oriented discussions in the ML community and
provide more context for researchers and practitioners in the engineering design and PHM
communities. Additionally, we provide an easy-to-understand explanation of the process of
decomposing the total predictive uncertainty of an ML model into aleatory and epistemic
uncertainty, leveraging simple mathematical examples (Sec. 2.2).
• Third, although our tutorial focuses primarily on UQ methods for ML models, it additionally
briefly covers a collection of recent studies that apply some of the emerging UQ approaches to
solve challenging problems in engineering design (Appendix B) and health prognostics (Sec. 5).
This review is meaningful because as the adoption of ML techniques in design and prognostics
rapidly increases, we also expect to see an increasing need for UQ of ML models. Note that
deep neural network architectures, originally created for computer vision tasks based on large
image datasets, can be readily adopted in engineering design tasks, such as surrogate modeling
for reliability analysis [28, 29] and generative designs [52–54], and PHM tasks, such as fault
diagnostics [55–59] and RUL prediction [60–62]. We hope to provide observations and insights
that can help guide researchers in the engineering design and PHM communities in choosing
and implementing the UQ methods suitable for specific applications. This unique and distinct
application area distinguishes our tutorial paper from a recent review paper on UQ of ML
models [51], which explored the use of ML with UQ for solving partial differential equations
and learning neural operators.
• Fourth, we share, on GitHub, our code for implementing several UQ methods on one toy
regression example (Sec. 3.5) and two real-world case studies on health prognostics (Sec. 6).
Our implementations have been thoroughly verified to have quality on par with high quality
implementations by the ML community. Some of our implementations are directly built on
top of code shared by the ML community. We anticipate our code will allow researchers and
practitioners in the engineering design and PHM communities to replicate results, customize
existing UQ methods to specific applications, and test new methods. Moving forward, we plan
7
to make continuous improvements to the codebase, e.g., by polishing lines of code and adding
new methods as they become available.
Our tutorial paper is concluded in Sec. 8, where we also discuss directions for future research.
This section first provides the definitions of different types of uncertainty and a summary of
their sources and causes, and then discusses the methods to decompose and reduce the predictive
uncertainty of ML models.
i. Aleatory uncertainty: It stems from natural variability and is irreducible by nature [63].
This type of uncertainty captures the noise inherent in physical systems [64]. A typical example
of aleatory uncertainty is the noise in sensor measurements, which would persist even if more
data were collected. In ML, aleatory uncertainty represents the inherently stochastic nature
of an input, an output, or the dependency between these two [19]. Example causes of aleatory
uncertainty include variability of material properties from one specimen to another, variability
of response from different runs of the same experiment, variability in classes for classification
problems, and variability of the output for regression problems. This type of uncertainty
is usually modeled as a part of the likelihood function in a probabilistic ML model. The
predictions of the ML model is also probabilistically distributed [64]. This way of capturing
the observation uncertainty (sometimes termed data uncertainty) is leveraged by several UQ
methods, such as homoscedastic (Eq. (13)) and heteroscedastic (Eq. (30)) GPRs discussed in
Sec. 3.1 and neural network ensemble (Eq. (30)) discussed in Sec. 3.3.
ii. Epistemic uncertainty: This type of uncertainty is attributed to things one could know
in principle but remain unknown in practice due to a lack of knowledge. It is reducible by
nature [63]. Common causes of epistemic uncertainty in the engineering domain include model
simplification, model-form selection, computational assumptions, lack of information about
certain model parameters, and numerical discretization. ML models generally have similar
epistemic uncertainty sources as engineering models. In particular, the epistemic uncertainty
in ML models can be further classified into the following two categories:
(a) Model-form uncertainty is due to the simplification and approximation procedures involved
in ML model construction. It is usually associated with the choices of model types, such
as the architectures and activation functions of neural networks and the model forms of
kernel functions in GPR models.
8
(b) Parameter uncertainty is associated with model parameters and arises from the model
calibration and training processes. Major causes of parameter uncertainty include a lack
of enough training data, inherent bias in the training data due to low data fidelity, and
difficulties in converging to optimal solutions faced by training algorithms.
Table 1 summarizes the common sources and associated causes of the above two types of un-
certainty in ML. When the test dataset falls outside the training data distribution, the ML model
predictions likely have high epistemic uncertainty since the performance of ML models is typically
poorer in extrapolation than in interpolation. When the test data in some regions of the input space
are associated with higher measurement noise, they can lead to higher aleatory uncertainty. Addi-
tionally, data of output used to train an ML model could deviate from the true values of the output.
When the error is caused by random noise of measurement, it will lead to aleatory uncertainty in
the output. However, when there is also bias in the data, the error causes additional epistemic
uncertainty. For instance, when the bias is caused by low data fidelity representing the data’s low
accuracy, this bias will result in epistemic uncertainty, which is reducible by adding high-fidelity
data for training.
Note that aleatory uncertainty could exist in the input, output, or both of an ML model. A
common practice of dealing with aleatory uncertainty in the inputs is propagating the uncertainty
to the output after constructing the ML model. The aleatory uncertainty in the output, however,
is more challenging to tackle, since it needs to be accounted for during the training of an ML model
(see more detailed discussion in Secs. 3.1 and 3.3). Uncertainty propagation of input aleatory
uncertainty to the output is not the focus of this paper. We mainly focus on accounting for aleatory
uncertainty in the output during the training of an ML model. Moreover, it is worth mentioning
9
that aleatory uncertainty and epistemic uncertainty often coexist, making it difficult to separate
them. Even though some efforts have been made in recent years to separate these two types of
uncertainty, for example, by using the variance decomposition method (see Sec. 2.2) that has been
extensively studied in the global sensitivity analysis field [66–68], a clean and complete separation
of these two types of uncertainty may only be possible for some cases when there are no complicated
interactions between aleatory and epistemic uncertainty sources. We are interested in separating
these two types of uncertainty often because we are usually concerned about when the “prediction
accuracy” of ML models becomes so low that model prediction cannot be trusted. These “break-
down” cases are typically associated with high epistemic uncertainty, the quantification of which
would help identify low-confidence predictions by the ML models and avoid making sub-optimal
or even incorrect decisions whose consequences could be very costly and even life-threatening in
safety-critical applications.
Suppose we cannot separate these two types of uncertainty and only look at their combination.
In that case, we only have access to the total predictive uncertainty of an ML model, which can be
used to measure the model’s confidence in predicting at a test point, given both noise sources in the
environment and the reducible uncertainty arising from a lack of training data. The total predictive
uncertainty is often what commercially available ML solutions produce as ML outputs (e.g., the
probability mass function of the predicted health class for health diagnostics and the variance of
the remaining useful life estimate for health prognostics).
Note that applying an activation function to the linear term θT x introduces nonlinearity to the
regression model, making it a building block in a neural network.
If we make a Bayesian treatment of Eq. (1), we will start with a prior distribution p(θ) over
model parameters θ and then infer a posterior from a training dataset D, p(θ|D). Essentially, we
build a Bayesian linear regression model, from which we can derive the predictive distribution of y
at a given training/validation/test point x via marginalization:
Z
p(y|x, D) = p(y|x, θ)p(θ|D)dθ. (2)
To make the discussion more concrete and easier to understand, we further assume that Eq.
10
(1) is a two-dimensional model (i.e., D = 2) and the posterior " of θ is jointly Gaussian:
# p(θ|D) =
2
σθ1 ρσθ1 σθ2
N (µθ , Σθ ) with µθ = [µθ1 , µθ2 ]T and a covariance matrix Σθ = . The predicted
ρσθ1 σθ2 σθ22
y then follows a Gaussian distribution given by:
p(y|x, D) = N (µθ1 x1 + µθ2 x2 , σθ21 x21 + σθ22 x22 + 2ρσθ1 σθ2 x1 x2 + σ 2 ). (3)
For classification problems, we typically use differential entropy as a measure of uncertainty [72];
for regression problems, a typical choice is variance of a Gaussian output [73]. Since we deal with
a regression problem, we use variance to measure uncertainty in this example. The total predictive
uncertainty is measured as the predicted variance
The aleatory uncertainty can be measured as the variance of the Gaussian noise (intrinsic in the
data)
Ualeatory = σ 2 . (5)
Then, the epistemic uncertainty can be estimated by subtracting the aleatory uncertainty from
the total predictive uncertainty
Uepistemic = Utotal − Ualeatory = σθ21 x21 + σθ22 x22 + 2ρσθ1 σθ2 x1 x2 . (6)
It can be seen from the above equation that the epistemic uncertainty is dependent on (1) the
posterior variances (σθ21 and σθ22 ) and covariance (ρσθ1 σθ2 ) of the model parameters θ and (2) values
of the input variables (x1 and x2 ). The noise variance, which measures the intrinsic uncertainty in
the data, does not affect and has nothing to do with the epistemic uncertainty.
Using the law of total variance or variance-based sensitivity analysis [74], we can generalize Eqs.
(4) through (6) for uncertainty decomposition:
where E(y|x, θ) and V ar(y|x, θ) are the mean and variance of y at x for a given realization of
θ. The first term on the right-hand side of Eq. (7), Eθ∼p(θ|D) [V ar(y|x, θ)], computes the average
of the variance of y, V ar(y|x, θ), over p(θ|D). This term does not consider any contribution of
parameter (θ) uncertainty to the variance of y, as the expectation operation, Eθ∼p(θ|D) [·], take out
the contribution of the variation in θ. It only captures the intrinsic data noise (ε) and therefore
represents the aleatory uncertainty. The second term, V arθ∼p(θ|D) [E(y|x, θ)], computes the vari-
ance of E(y|x, θ) for θ ∼ p(θ|D). The expectation operation, E(y|x, θ), essentially takes out the
contribution by the data noise (ε). Therefore, this second term measures epistemic uncertainty. For
classification problems, similar expressions can be derived for the uncertainty metric of differential
11
entropy, as demonstrated in some earlier work (see, for example, [69–71]).
High
epistemic
High
aleatory
Figure 2 shows an example of uncertainty decomposition using the above variance decomposition
method for a mathematical problem. The true model is a two-dimensional function as depicted in
1
the top-right graph of Fig. 2 and this function has the following closed form: y(x) = 20 ((1.5 + x1 )2 +
sin 5×(1.5+x1 )
4) × (1.5 + x2 ) − 2 . In this example, the true model is assumed to be unknown and needs
to be learned from training data using an ML model. Due to inherent sensor noise, observational
uncertainty is present in the output of the training data. It is modeled as a random variable
following a Gaussian distribution as ε(x) ∼ N (0, 0.5|sin(y(x))|2 ). Based on 50 training samples, a
GPR model is constructed. The total predicted variance of the resulting ML model is shown in the
upper left graph of Fig. 2. This graph shows that the predicted variance is high for some regions and
low for others. Since both aleatory and epistemic uncertainty exists and only the total predictive
uncertainty is visualized, it is difficult to tell if the uncertainty (the total predicted variance) in a
certain region could be further reduced.
Decomposing the total predicted variance into variances due to aleatory uncertainty and epis-
temic uncertainty, respectively, as shown in the lower half of this figure, allows us to identify regions
with high aleatory uncertainty and those with high epistemic uncertainty. If a region with high
epistemic uncertainty is the prediction region of interest, we can reduce the uncertainty to improve
12
the prediction confidence of the ML model (see the uncertainty reduction methods in Sec. 2.3).
However, if a region with high aleatory uncertainty and low epistemic uncertainty is the prediction
region of interest, it would be difficult to further reduce the total predictive uncertainty. In that
case, risk-based decision making needs to be employed to account for the irreducible aleatory uncer-
tainty when deriving optimal decisions (see, for example, decision-making scenarios in engineering
design, as discussed in Appendix B, and in PHM, as discussed in Sec. 5).
i Adding more training data: Having access to limited training data usually leads
to uncertainty in ML model parameters. The model-parameter uncertainty is part of
epistemic uncertainty. It can be reduced by increasing the training data size, e.g., via data
augmentation using physics-based models [75] or simply by collecting and adding more
experimental data to the training set. Let us assume the added training data is as clean
as the existing data. In that case, the epistemic uncertainty component of the predictive
uncertainty becomes smaller, while the aleatory uncertainty is expected to remain at a
similar level. Suppose that, in a different case, the added training data contains more
noise than the existing data. In that case, we still expect lower epistemic uncertainty in
regions of the input space where the added data lie but higher aleatory uncertainty in
these regions.
ii Adding physics-informed loss or physical constraints for ML model training:
Incorporating physical laws as new loss terms or imposing physical constraints, such as
boundedness, monotonicity, and convexity for interpretable latent variables for ML model
training, may allow us to obtain a more accurate estimate of ML model parameters.
Although this physics-informed/constrained ML approach may not directly reduce epis-
temic uncertainty in ML predictions, it helps to reduce the training data size required to
build a robust ML model that produces accurate predictions across a wide range of input
settings. Specifically, enforcing principled physical laws into an ML model considerably
prunes the search space of model parameters as parameters violating these constraints are
discarded immediately. As a result, physical constraints contribute to reducing parameter
uncertainty to some extent by complementing the insufficient training data and narrow-
ing down the feasible region of these parameters. This benefit becomes especially relevant
when training data is lacking and has been reported in recent review papers in various
engineering fields, such as computational physics [76], digital twin [77], and reliability
13
engineering [78], and in research papers published in recent special issue collections on
health diagnostics/prognostics [79] and the broader topic of reliability and safety [80]. For
over-parameterized ML models such as neural networks, it is possible to simultaneously
reduce bias and variance in the model parameters [81]. For simpler models such as GPR,
utilizing additional information such as gradient information [82], orthogonality [83], and
monotonicity [84] as constraints in kernel construction can also improve the prediction
accuracy.
iii Adopting better strategies for ML model training: If a better starting point can
be used when training an ML model, the optimization process may yield a more accurate
estimate of the model parameters. Similar to adding physics-informed loss terms, this
strategy can also indirectly reduce epistemic uncertainty. A popular example of this
strategy is transfer learning, where the model trained in one domain is used as a starting
point for training a model in another domain (e.g., transfer of weights and biases in selected
neural network layers) [85]. Another strategy is to use better optimization algorithms
when the number of parameters to be optimized is large. Global optimization in high-
dimensional search spaces is always challenging. Algorithms such as stochastic gradient
descent can have better convergence than traditional quasi-Newton methods in training
deep neural networks [86]. Reformulating model training with multiple loss terms as
minimax problems to adjust the focus of different loss terms can also improve convergence
[87].
Next, we use the two-dimensional example given in Fig. 2 to illustrate the process of reducing
epistemic uncertainty. As shown in Fig. 3, a group of training points is first generated from a known
mathematical function. Then, an ML model with only x1 as the input feature is constructed based
14
Epistemic uncertainty
(model-form and data)
Training ML
True model model 1
data
Use only 𝑥1 as
input feature
Use 𝑥1 and 𝑥2
as input features Reduced epistemic
uncertainty
ML
model 2
Adding more
training data Further reduced epistemic
ML Input Number of uncertainty
model feature(s) training points
ML
1 𝑥1 50 model 3
2 𝑥1 , 𝑥2 50
3 𝑥1 , 𝑥2 100
Figure 3: Types of uncertainty sources in ML models and the process of reducing epistemic uncertainty (i.e., methods
(b).i and (a).i described in Sec. 2.3).
on this group of training data. As shown in this figure, the resulting ML model (i.e., ML model 1)
has considerable epistemic uncertainty due to the combined effect of model-form uncertainty and
model-parameter uncertainty. In particular, the model-form uncertainty is caused by the fact that
the underlying model used to generate this dataset has two input variables (x1 and x2 ) while ML
model 1 only uses x1 as its input feature. Model-parameter uncertainty stems from the limited
number of training samples (i.e., 50 in this example). In order to reduce the epistemic uncertainty
(model-form uncertainty), we then include both x1 and x2 as the input features, and another ML
model labeled ML model 2 is constructed using the same group of training data. As illustrated in
Fig. 3, adding input feature x2 (i.e., strategy (b).i as described above) substantially reduces the
epistemic uncertainty in regions within the training sample distribution. If we increase the size of
the training data to 100 (i.e., strategy (a).i), a third ML model (ML model 3) can be built based
on this larger training dataset. As expected, the epistemic component of the predictive uncertainty
is shown to decrease further due to the reduction of model-parameter uncertainty.
Data-driven ML models, most notably neural networks, have demonstrated unprecedented per-
formance in establishing associations and correlations from large volumes of data in high-dimensional
space via multiple layers of neurons and activation functions stacked together [96]. While ML has
progressed on a fast track, it is still far away from fulfilling the stringent conditions of mission-
15
critical applications [15, 97], such as medical diagnostics, self-driving, and health prognostics of
critical infrastructures, where safety and correctness concerns are salient. In addition to safety and
reliability concerns, we are only able to collect a limited amount of data to train an ML model in a
broad range of applications due to practical constraints on physical experiments and computational
resources. To address some of these challenges, it is of paramount importance to establish princi-
pled and formal UQ approaches so that we can quantitatively analyze the uncertainty in ML model
predictions arising from scarce and noisy training data as well as model parameters and structures
in a sound manner. Accurate quantification of uncertainty in ML model predictions substantially
facilitates the risk management of ML models in high-stakes decision-making environments [98–101].
In particular, when dealing with input samples in the region of input space with low signal-to-
noise ratios or when handling the so-called OOD samples (input points sampled from a distribution
very different from the training distribution), most ML models are prone to produce erroneous
predictions [102]. If the uncertainty of an ML model can be quantified appropriately, it could lead to
more principled decision making by enabling ML models to automatically detect samples for which
there is high uncertainty. In fact, principled ML models are expected to yield high uncertainty (low
confidence) in their predictions when the ML model predictions are likely to be wrong [103, 104].
Having uncertainty estimates that appropriately reflect the correctness of predictions is essential to
identifying these “difficult-to-predict” samples that need to be examined cautiously, possibly with
the eyes of a domain expert. This section provides a detailed, tutorial-style introduction of state-of-
the-art methods for estimating the predictive uncertainty of data-driven ML models. As graphically
summarized in Fig. 4, these UQ methods are GPR (Sec. 3.1), Bayesian neural network (BNN) (Sec.
3.2), neural network ensemble (Sec. 3.3), and deterministic methods focusing on SNGP (Sec. 3.4).
16
Figure 4: Graphical comparison of six state-of-the-art UQ methods introduced in Sec. 3. These methods are GPR
(method 1), BNN via MCMC or VI (method 2), BNN via MC dropout (method 3), neural network ensemble (method
4), DNN with GPR – DNN-GPR (method 5), and SNGP (method 6). In method 1, MVN standards for the multivariate
normal distribution, or equivalently, the multivariate Gaussian distribution used in the main text. In methods (5)
and (6), SN stands for spectral normalization.
17
GPR starts from a Gaussian process prior for the unknown function: f (x) ∼ GP(m(x), k(x, x′ ))
[105]. This Gaussian process prior is fully characterized by a (prior) mean function m(x) : RD 7→ R
and a (prior) covariance function k(x, x′ ) : RD × RD 7→ R. The mean function m(x) defines the
prior mean of f at any given input point x, i.e.,
The prior mean of the Gaussian process is often set as zero everywhere, m(x) = 0, for the ease of
computing the posterior. If the prior mean is a non-zero function, a trick is subtracting the prior
means from the observations and function means (which we want to predict), thereby maintaining
the “zero-mean” condition. The covariance function k(x, x′ ), also called the kernel in GPR, captures
how the function values at two input points, x and x′ , linearly depend on each other. It takes the
following form
k(x, x′ ) = E (f (x) − m(x)) f (x′ ) − m(x′ ) .
(9)
When the prior mean is zero, the kernel fully defines the shape (e.g., smoothness and patterns)
of functions sampled from the prior and posterior.
∥x − x′ ∥2
′
k(x, x ) = σf2 exp − . (10)
2l2
where the two kernel parameters, or two hyperparameters of the GPR model, are the signal am-
plitude σf (σf2 is called signal variance) and length scale l. σf2 sets the upper limit of the prior
variance and covariance and should take a large value if f (x) spans a large range vertically (along
the y-axis). It can be observed that the covariance between f (x) and f (x′ ) decreases as x and x′
get farther apart. When x is extremely far from x′ , they have a very large Euclidian distance, and
thus, k(x, x′ ) ≈ 0, i.e., the covariance between their function values approaches 0. Therefore, when
predicting f at a new input point, observations far away in the input space will have a minimum
influence. When a new input is OOD, it has a very low covariance with any training point, meaning
that the training observations contribute minimally to reducing the prior variance of the function
value at the OOD point, leading to high epistemic uncertainty. This kernel-enabled characteristic
has important implications for the distance awareness property of GPR. On the other extreme, if
two input points are extremely close, i.e., x ≈ x′ , then k(x, x′ ) becomes very close to its maximum,
meaning f (x) and f (x′ ) have an almost perfect correlation. Function values of neighbors being
highly correlated ensures smoothness in the GPR model, which is desirable because we often want
to fit smooth functions to data.
The squared exponential kernel in Eq. (10) uses the same length scale l across all D dimensions.
An alternative approach is to assign a different length scale ld for each input dimension xd , known
18
as automatic relevance determination (ARD) [107]. The resulting ARD squared exponential kernel
takes the following form
D
!
′ 2 1 X (xd − x′d )2
k(x, x ) = σf exp − , (11)
2 ld2
d=1
where the (D + 1) kernal parameters are the D length scales, l1 , . . . , lD , and the signal amplitude,
σf . The ARD squared exponential kernel is also known as the anisotropic variant of the (isotropic)
squared exponential kernel. Each length scale determines how relevant an input variable is to the
GPR model. If ld is learned to take a very large value, the corresponding input dimension xd is
deemed irrelevant and contributes minimally to the regression. It is worth noting that the squared
exponential kernel is a special case of a more general class of kernels called Matérn kernels. See
Appendix A.1 for an extended discussion of kernels.
Now we can draw random samples of the function values at the N∗ input points X from
GP(0, k(x, x′ )) by sampling from the following multivariate Gaussian distribution: f∗ ∼ N (0, KX∗ ,X∗ ).
Each sample (f∗ ) consists of N∗ function values, i.e., f∗ = f (X∗ ) = [f (x∗1 ), . . . , f (x∗N∗ )]T . The most
19
3 3
2 2
1 1
Y
Y
0 0
y
y
-1 -1
-2 -2
-3 -3
-5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5
x
X Xx
(a) (b)
Figure 5: Sample functions drawn a Gaussian process prior (a) and posterior (b). The GPR model uses the squared
exponential kernel with a length scale (l) of 1 and a signal amplitude (σf ) of 1, and a Gaussian observation model
with a noise standard deviation (σε ) of 0.1. The means are shown collectively as a solid blue line/curve, and ∼95%
figures/samples_prior_posterior.pdf
confidence intervals (means plus and minus two standard deviations) are shown collectively as a light blue shaded area.
20 training observations are generated by corrupting a sine function with a white Gaussian noise term, y = sin(0.9x)+ε
with ε ∼ N 0, 0.12 ; these observations are shown as red dots.
commonly used numerical procedure to sample from a multivariate Gaussian distribution consists of
two steps: (1) generate random samples (vectors) from the multivariate (D-dimensional) standard
normal distribution, N (0, I), and (2) transform these random samples linearly based on the mean
vectorFinal
of the target multivariate Gaussian and the Cholesky decomposition of its covariance matrix
(see further details in Sec. A.2 (Gaussian Identities) of Ref. [105]). Figure 5(a) shows three sample
functions randomly drawn from a Gaussian process prior.
where ε is a zero-mean Gaussian noise, i.e., ε ∼ N (0, σε2 ). The above additive Gaussian form will
also be commonly used for other UQ methods in the upcoming sections. The N noisy observations
can be conveniently written in a vector form: yt = [y1 , . . . , yN ]T ∈ RN . Note that these observations
are sometimes called targets in a regression setting. In GPR, we want to infer the input (x) - target
(y) relationship from the noisy observations; we may also be interested in learning the input (x) -
output (f ) relationship in some cases.
The Gaussian observation model in Eq. (13) portrays an observation as two components: a
signal term and a noise term. The signal term f (x) carries the epistemic uncertainty (see Sec.
2.1) about f (x), which can be reduced with additional observations of f at a finite set of training
points (e.g., x1 , . . . , xN ). The noise term ε represents the inherent mismatch between signal and
20
observation (e.g., due to measurement noise; see Table 1), which is a type of aleatory uncertainty
(see Sec. 2.1) and cannot be reduced from additional observations. In some cases, observations may
be noise-free, corresponding to a special case where σε = 0. In other words, we have access to the
true function (f ) output in these cases.
Now it is time to look at how to make predictions of function values f∗ for N∗ new, unseen
input points X∗ , given a collection of training observations, D = {(x1 , y1 ) , (x2 , y2 ) , . . . , (xN , yN )},
equivalently expressed as D = {Xt , yt }. These predictions can be made by drawing samples from
the Gaussian process posterior, p(f |D). We denote the function values at the training inputs as
ft = f (Xt ) = [f (x1 ), . . . , f (xN )]T . Again, according to the definition of a Gaussian process, the
function values at the training inputs and those at the new inputs are jointly Gaussian (prior without
using observations), written as
" # " #!
ft KXt ,Xt KXt ,X∗
∼N 0, , (14)
f∗ KX∗ ,Xt KX∗ ,X∗
where KXt ,Xt is the covariance matrix between the f values at the training points, expressed by
simply replacing X in Eq. (12) with Xt , KXt ,X∗ is the covariance matrix between the training
points and new points (also called the cross-covariance matrix), KX∗ ,Xt = KT Xt ,X∗ , and KX∗ ,X∗ is
the covariance matrix between the new points.
As shown in the Gaussian observation model in Eq. (13), we assume all observations contain an
additive independent and identically distributed (i.i.d.) Gaussian noise with zero mean and variance
σε2 . Under this assumption, the covariance matrix for the training observations needs the addition
of the noise variance to each diagonal element, i.e., yt ∼ N (0, KXt ,Xt + σε2 I), where I denotes the
identity matrix of size N whose diagonal elements are ones and off-diagonal elements are zeros. It
then follows that the training observations (known) and the function values at the new input points
(unknown) follow a slightly revised version of the multivariate Gaussian prior shown in Eq. (14),
expressed as " # " #!
yt KXt ,Xt + σε2 I KXt ,X∗
∼ N 0, . (15)
f∗ KX∗ ,Xt KX∗ ,X∗
Now we want to ask the following question: “given the training dataset D and new test points
X∗ , what is the posterior distribution of the new, unobserved function values f∗ ?”. It has been
shown that conditionals of a multivariate Gaussian are also multivariate Gaussian (see, for example,
Sec. 3.2.3 of the probabilistic ML book [73]). Therefore, the posterior distribution p(f∗ |D, X∗ ) is
multivariate Gaussian. The posterior mean f ∗ and covariance cov(f∗ ) can be derived based on the
well-known formulae for conditional distributions of multivariate Gaussian, leading to the following:
2 −1
f ∗ = KT
Xt ,X∗ (KXt ,Xt + σε I) yt , (16)
and
2 −1
cov(f∗ ) = KX∗ ,X∗ − KT
Xt ,X∗ (KXt ,Xt + σε I) KXt ,X∗ . (17)
21
It is worth noting that this posterior distribution is also a Gaussian process, called a Gaussian
process posterior. So we have f (x)|D ∼ GP(mpost (x), kpost (x, x′ )), where the mean and kernel
functions of this Gaussian process posterior take the following forms:
2 −1
mpost (x) = KT
Xt ,x (KXt ,Xt + σε I) yt , (18)
and
kpost (x, x′ ) = k(x, x′ ) − KT 2 −1
Xt ,x (KXt ,Xt + σε I) KXt ,x′ . (19)
It can be observed from Eqs. (16) and (17) that the key to making predictions with a Gaus-
sian process posterior is calculating the three covariance matrices, KXt ,Xt , KXt ,X∗ , and KX∗ ,X∗ .
Difficulties in computation usually arise when performing a matrix inversion on a large covariance
matrix KXt ,Xt with many training observations. Much effort has been devoted to solving this
matrix inversion problem, resulting in many approximation methods, such as covariance tapering
[108, 109] and low-rank approximations [110, 111], mostly applied to handle large spatial datasets.
Another important issue associated with the matrix inversion is that the covariance matrix could
become ill-conditioned, most likely due to some training points being too close and providing re-
dundant information. Two common strategies to invert an ill-conditioned covariance matrix are (1)
performing the Moore–Penrose inverse or pseudoinverse using the singular value decomposition [30]
and (2) applying “nugget” regularization, i.e., adding a small positive constant (e.g., 10−6 ) to each
diagonal element of the covariance matrix to make it better conditioned while having a negligible
effect on the calculation [112, 113]. Oftentimes, adding the variance of the Gaussian noise σε2 , as
shown in Eqs. (16) and (17), serves the purpose of “nugget” regularization.
Following the numerical procedure described in Sec. 3.1.1.c, we can generate random samples
of f from the Gaussian process posterior. For example, we can sample function values at the N∗
input points, x∗1 , . . . , x∗N∗ , by sampling from a multivariate Gaussian with mean f ∗ and covariance
cov(f∗ ). It is possible that the Cholesky decomposition needs to be performed on an ill-conditioned
posterior covariance matrix cov(f∗ ). This issue can be tackled by applying “nugget” regularization
or adopting an alternative sampling procedure that centers around defining and sampling from a
zero-mean, unconditional Gaussian process, as described in Refs. [114–116]. Figure 5(b) shows three
sample functions drawn from a Gaussian process posterior after collecting 20 noisy observations of
a 1D sine function.
We have been looking at the posterior of noise-free function values. To derive the posterior over
the noisy observations, p(y∗ |D, X∗ ), we add a vector of i.i.d. zero-mean Gaussian noise variables to
the f∗ posterior, producing a multivariate Gaussian with the same means (Eq. (16)) and a different
covariance matrix whose diagonal elements increase by σε2 compared to the covariance matrix in Eq.
(17). It is also straightforward to make predictions on a noise-free Gaussian process using Eqs. (16)
and (17). We can simply take out the noise variance term σε2 I and use y∗ = f∗ . As is discussed in
Appendix B, GPR with noise-free observations is widely used to build cheap-to-evaluate surrogates
of computationally expensive computer simulation models in engineering design applications such
as model calibration, reliability analysis, sensitivity analysis, and optimization. The observations in
22
these applications are free of noise because we have direct access to the true underlying function (i.e.,
the computer simulation model) that we want to approximate. In contrast, as will be discussed
in Sec. 5, many applications of GPR in health prognostics require the consideration of noisy
observations, as we often do not have access to the true targets (e.g., health indicator) but can only
obtain noisy measurements or estimates of these targets.
Now let us look back at the distance awareness property of GPR. Suppose a new input point
x∗ keeps moving away from the training distribution D. In that case, the Euclidean distance
between x∗ and any input point xi in D, i.e., dist(x∗ , xi ), ∀i = 1, . . . , N , constantly increases. All
elements in the cross-covariance matrix and, more strictly, the cross-covariance vector kXt ,x∗ =
[k(x1 , x∗ ), . . . , k(xN , x∗ )]T quickly approach zero. Given that neither the training-data covariance
matrix KXt ,Xt nor the new-data covariance (variance in this case) k(x∗ , x∗ ) experiences any changes,
the posterior mean f ∗ will approach zero (i.e., the prior mean), and more importantly, the posterior
variance var(f∗ ) will approach its maximum allowed value σf2 . This observation of the GPR model
behavior is significant for UQ because it means that a GPR model naturally yields high-uncertainty
predictions for OOD samples falling outside of the training distribution.
e. Optimizing hyperparameters.
Suppose we choose the squared exponential kernel as the covariance function. In that case,
we will have three unknown hyperparameters that need to be estimated based on training data.
These parameters are the characteristic length scale (l), signal amplitude (σf ), and noise standard
deviation (σε ), i.e., θ = [l, σf , σε ]T . Estimating these hyperparameters can be regarded as training
a GPR model. As it is often difficult yet not much value-added to obtain the full Bayesian posterior
of θ, we typically choose to obtain a maximum a posteriori probability (MAP) estimate of θ, a point
estimate at which the log marginal likelihood log p(yt |Xt , θ) reaches the largest value. Assuming
the prior is uniform, the log marginal likelihood function of the posterior takes the following form
[105]:
1 1 N
log p(yt |Xt , θ) = − (ytT (KXt ,Xt + σε2 I)−1 yt − log |KXt ,Xt + σε2 I| − log (2π), (20)
| 2 {z }| 2 {z } | 2 {z }
Model-data fit Complexity penalty Constant
The first term on the right-hand side, the so-called “model-data fit” term, quantifies how well
the model fits the training observations. The second term, called the “complexity penalty” term,
quantifies the model complexity where a smoother covariance matrix with a smaller determinant
is preferred [105]. The third and last term is a normalization constant and indicates that the
likelihood of data tends to decrease as the training data size increases [31]. It should be noted
that the cost complexity of Eq. (20) is O(N 3 ) to compute the inverse of the covariance matrix
KXt ,Xt and the space complexity is O(N 2 ) to store this matrix. Hyperparameter optimization
significantly influences the accuracy of GPR. See Appendix A.2 for an illustrated example on the
effect of hyperparameter optimization.
23
3.1.2. UQ capability and some limitations of Gaussian process regression
GPR is capable of capturing both aleatory and epistemic uncertainty. For regression problems,
the posterior variance for a query or test point, shown in Eq. (17), is an elegant expression of the
total predictive uncertainty. The variance of the additive white noise, σε2 , is a measure of the aleatory
uncertainty. If this noise variance is assumed to be a constant (learned from the observations), the
GPR model is called a “homoscedastic” model. In contrast, a heteroscedastic GPR model represents
the noise variance as a function of the input variables x [117]. Assuming a squared exponential kernel
is used, the epistemic uncertainty is determined mainly by kXt ,x∗ , the covariance vector between the
training points and query point, as discussed at the end of Sec. 3.1.1.d. The farther away the query
point is from the training points, the smaller the elements of kXt ,x∗ and the larger the epistemic
uncertainty. Therefore, using a distance-based covariance (or kernel) function and according to
conditionals of a multivariate Gaussian, a GPR model produces low epistemic uncertainty at query
or test points close to observations used for training and high epistemic uncertainty at query points
far away from any training observation. This distance awareness property makes GPR an ideal
choice for highly reliable OOD detection for problems of low dimensions and small training sizes.
The aleatory and epistemic uncertainty components of the posterior variance determine how wide
the confidence interval of a model prediction at the query point should be, reflecting the total
predictive uncertainty.
Despite the highly desirable distance awareness property and OOD detection capability, GPR
does not always produce posterior variances that reliably measure the predictive uncertainty. The
reliability of UQ by GPR depends on many factors, such as the test point where a prediction is
made, the behavior of the underlying function to be fitted, and the choices of the kernel and hy-
perparameters. For example, a necessary condition for reliable UQ by a GPR model is properly
choosing its kernel and optimizing the resulting hyperparameters (e.g., the variance of the additive
white noise, σε2 , measures the aleatory uncertainty and should be optimized for accurate UQ). As
discussed earlier, GPR can detect OOD test points, especially those far from the training distribu-
tion. However, the high posterior variances at these “extreme” test points may still not accurately
measure the prediction accuracy. Specifically, as a test point moves away from the training distri-
bution, the posterior variance will start to “saturate” at its peak value, as discussed in detail in Sec.
3.1.1.d; in contrast, the prediction error at this test point may continue to rise due to an increasing
degree of extrapolation, and so should an “ideal” estimate of the predictive uncertainty. Although
GPR may not produce reliable UQ in such an extreme extrapolation scenario, it is important to take
a step back and keep in mind that extrapolating to an extensive degree goes against the purpose
for which GPR was originally introduced, i.e., interpolation [105, 118].
Standard GPR generally does not scale well to large training datasets (large N ) because its
training complexity is O(N 3 ). This scalability issue originates from the computation of the inverse
and determinant of the N × N covariance matrix KXt ,Xt during model training (i.e., hyperparam-
eter optimization), as shown in Eq. (20). This scalability issue motivated considerable effort in
examining local and global approximation methods to scale GPR to large training datasets while
maintaining prediction accuracy and UQ quality. Interested readers may refer to a recent review on
24
scalable GPR in [119]. Another limitation of GPR is its lack of scalability to high input dimensions
(high D). This limitation stems from two issues. First, training a GPR model in a high-dimensional
input space typically requires optimizing a large number of hyperparameters. This is because an
ARD kernel form often needs to be chosen to deal with high-dimensional problems. As a result,
the number of hyperparameters increases linearly with the number of input variables (e.g., a GPR
model with the ARD squared exponential kernel shown in Eq. (11) has (D + 2) hyperparameters).
A direct consequence is that a large quantity of training samples (high N ) is needed to optimize the
many hyperparameters, leading to a large covariance matrix. As discussed earlier, inverting this
large covariance matrix and calculating its determinant have high computational complexity. Sec-
ond, maximizing the log marginal likelihood (see Eq. (20)) with a large number of hyperparameters
becomes a high-dimensional optimization problem. Solving this high-dimensional problem requires
many function evaluations, each involving one-time covariance matrix inversion and determinant
calculation. Attempts to improve GPR’s scalability to high-dimensional problems include (1) pro-
jecting the original, high-dimensional input onto a much lower-dimensional subspace and building a
GPR model in the subspace [120, 121], (2) defining a new kernel with a substantially smaller num-
ber of parameters identified with partial least squares [122], and (3) adopting an additive kernel in
place of a tensor product kernel in Eq. (11) [123]. More detailed discussions on scaling GPR to
high-dimensional problems can be found in a recent review [124].
As a final note, since this tutorial focuses on UQ of neural networks, it is relevant and interesting
to discuss connections between GPR and neural networks. Considerable efforts have been made to
establish such connections. Some of these efforts are briefly discussed in Appendix A.3.
For example, a commonly used loss function for regression problems is the mean squared error
(MSE) defined below:
N
1 X
θ⋆MSE = argmin ||yi − f (xi ; θ)||22 . (22)
θ N
i=1
With the gradient of f accessible through back-propagation [125], the loss minimization is typi-
cally solved numerically using stochastic gradient descent [126, 127]. Once θ⋆ is found, prediction
25
at a new point x∗ can be made via ŷ∗ = f (x∗ ; θ⋆ ). These predictions, however, are single-valued
and do not have quantified uncertainty.
A Bayesian training [128–130] of DNNs, also known as Bayesian deep learning [69, 107, 131–
133], produces a Bayesian neural network or BNN. The Bayesian approach views θ as a random
variable with the goal to find the entire distribution of plausible θ values that could have generated
the observed data D. Following Bayes’ rule, the prior probability density function (PDF) p(θ)
(“before”-uncertainty in θ) is updated to the posterior PDF p(θ|D) (“after”-uncertainty in θ)
conditioned on the training data D. Mathematically, we have:
where we separate the training dataset D = {X, y} into their inputs X = {x1 , x2 , · · · , xN } and
corresponding outputs y = {y1 , y2 , · · · , yN }. Note that in the GPR section (Sec. 3.1), Xt and X∗
denote matrices comprising input points and yt and y∗ denote vectors consisting of observations. In
this BNN section, X and y denote sets of input points and observations, respectively, to be consistent
with the literature on Bayesian inference and BNN. In the above, p(y|θ, X) is the likelihood and
p(y|X) is the marginal likelihood (model evidence). The Bayesian problem and the BNN entail
solving for the posterior p(θ|D). We further discuss each term in the Bayes’ rule in Eq. (23) below.
The prior p(θ) can be formed in an informative or non-informative manner. The former allows
one to inject domain knowledge and expert opinions on the probable values of θ, formally through
the methods of prior elicitation [134]. However, these methods are difficult to use on DNN param-
eters θ due to their abstract and high-dimensional nature. The latter generates a prior following
guiding principles for desirable properties (e.g., Jeffreys’ prior [135], maximum entropy prior [136]).
In practice, isotropic Gaussian is often adopted for their convenience, but caution must be taken to
consider their pitfalls and appropriateness as BNN priors [137].
The likelihood p(y|θ, X) commonly follows a data (observation) model with an additive indepen-
dent Gaussian noise (similar to Eq. (13) in the GPR case): yi = f (xi ; θ) + ε, where ε ∼ N (0, σε2 ).
In the implementation, we often work with the log-likelihood, which is computed as:
N N
X √ 1 X
log p(y|θ, X) = log p(yi |θ, xi ) = −N log( 2πσε ) − 2 ||yi − f (xi ; θ)||22 . (24)
2σε
i=1 i=1
We can see that finding the mode of the Gaussian (log)-likelihood above (i.e. the θ that maxi-
mizes Eq. (24)) is equivalent to the MSE minimization in Eq. (22); hence, θ⋆MSE is also known as
the maximum likelihood estimator. Furthermore, adding a regularization term to Eq. (22) serves
the role of a prior, and in a similar fashion, a regularized loss minimization is also known as a
maximum a-posteriori (MAP) estimator (e.g., L2-regularization is the MAP with a Gaussian prior,
L1-regularization is the MAP with a Laplace prior).
The marginal likelihood p(y|X) in the denominator of Eq. (23) is a (normalization) constant
R
for the posterior that integrates the numerator: p(y|X) = p(y|θ, X)p(θ) dθ. As it requires a
26
non-trivial integration, this term is highly difficult to estimate. Fortunately, Bayesian computation
algorithms are often designed to avoid the marginal likelihood altogether; we will describe examples
of these algorithms in the upcoming sections.
Lastly, once the Bayesian posterior p(θ|D) is obtained, the posterior uncertainty can be propa-
gated through the BNN at a new point x∗ via, for example, MC sampling. Importantly, we draw the
distinction between the posterior-pushforward and posterior-predictive distributions. The posterior-
pushforward is p(ŷ∗ |x∗ , D) = p(f (x∗ ; θ)|x∗ , D). It describes the uncertainty on ŷ∗ (i.e. the “clean”
prediction from the DNN) as a result of the uncertainty in θ. In contrast, the posterior-predictive
is p(y∗ |x∗ , D) = p( [f (x∗ ; θ) + ε] |x∗ , D), it describes the uncertainty on y∗ (i.e. the noisy observed
quantity). Hence, the former incorporates epistemic parametric uncertainty, while the latter fur-
ther augments aleatory data uncertainty to the new prediction. The two distributions can be easily
confused with each other, with the danger of improper UQ assessments where one might incorrectly
expect the posterior-pushforward uncertainty to “capture” the noisy observation data.
In the following sections, we introduce several major types of Bayesian computational methods
for solving the Bayesian posterior: Markov chain Monte Carlo or MCMC (posterior sampling),
variational inference (posterior approximating), and MC dropout.
27
this section, we will start by defining the optimization problem that describes the best posterior
approximation, then introduce some examples of numerical algorithms to solve the VI problem.
Denoting a variational distribution (for approximating the posterior) using q(θ; λ) parameterized
by λ, VI seeks the best posterior-approximation q(θ; λ⋆ ) that minimizes the Kullback-Leibler (KL)
divergence between q(θ; λ) and p(θ|D), that is:
A popular choice for the variational distribution is the independent (mean-field) Gaussian:
q(θ; λ) = K
Q QK 2
k=1 q(θk ; λk ) = k=1 N (θk ; µk , σk ), where K is the total number of parameters in
the DNN. The independence structure allows the joint PDF to be factored into a product of uni-
variate Gaussian marginals, and so the variational parameters are λ = {µk , σk }, k = 1, . . . , K that
encompasses the mean and standard deviation of each component of θ, for a total of 2K variational
parameters. As a result, mean-field simplifies to a diagonal global covariance matrix (instead of
dense covariance) in the approximate posterior, and it is unable to capture any correlation among
the θk ’s. More expressive representations of q(θ; λ) are also possible, for example via normalizing
flows [148] and transport maps [149] that parameterize the mapping from the posterior random
variable θ to a standard normal reference random variable.
Given the variational distribution q(θ; λ), Eq. (25) can be further simplified as follows:
where going from the second to the third equation, the log-denominator’s contribution Eq(θ;λ) [ln p(y|X)] =
ln p(y|X) is omitted since it is constant with respect to both λ and θ and its exclusion does not
change the minimizer. The resulting expression in Eq. (26) is the negative of the well-known Evi-
dence Lower Bound (ELBO). The first term of ELBO acts as a regularization to keep q(θ; λ) close
to the prior. The second term of ELBO involves the log-likelihood of generating the observed data
under DNN parameters θ ∼ q(θ; λ); hence it measures the expected model-data fit.
In general, it is impossible to evaluate the ELBO analytically, and Eq. (26) must be solved
numerically. The simplest approach is to use MC sampling to estimate the ELBO, which only
entails sampling θ ∼ q(θ; λ). Often, further simplifications can be made by analyically computing
the first term, which involves only the prior and variational distribution. Furthermore, the gradient
of ELBO with respect to λ may be derived (e.g., see [133] for Gaussian q) or obtained through
automatic differentiation, allowing one to take advantage of gradient-based optimization algorithms
(e.g., stochastic gradient descent) to solve Eq. (26).
The Stein variational gradient descent (SVGD) [150] is another VI variant offering a flexible
28
particle approximation to the posterior distribution. SVGD leverages the relationship between
the gradient of the KL divergence in Eq. (25) to the Stein discrepancy, the latter which can be
approximated using a set of particles. An update procedure can then formed to iteratively ascent
along a perturbation direction θℓ+1 i ← θℓi + ϵℓ φ̂∗ (θℓi ), where θℓi , i = 1, . . . , Np , denotes the i-th
particle at the ℓ-th iteration, ϵℓ is the learning rate, and the perturbation direction is defined as:
Np
1 Xh i
φ̂∗ (θ) = k(θℓj , θ)∇θℓ ln p(θℓj | D) + ∇θℓ k(θℓj , θ) , (27)
Np j j
j=1
with k(·, ·) being a positive definite kernel (e.g., radial basis function kernel in Eq. (10)) Notably,
the gradient of the log-posterior in the above equation can be evaluated via the sum of gradients
of log-likelihood and log-prior, since the gradient of the log-marginal-likelihood with respect to θ
is zero. The overall effect is an iterative transport of a set of particles to best match the target
posterior distribution p(θ|D). Building upon the SVGD, advanced methods of Stein variational
Newton [151, 152] that makes use of second-order (Hessian) information, and projected SVGD [153]
that finds low dimensional data-informed subspaces, have also been proposed.
Figure 6: Illustration of Bayesian posterior obtained from (left) MCMC, (middle) SVGD, and (right) mean-field
Gaussian VI for a simple low-dimensional Bayesian inference test problem.
Figure 6 compares the different Bayesian posteriors obtained from a simple low-dimensional
Bayesian inference test problem using MCMC, SVGD, and mean-field Gaussian VI. MCMC and
SVGD provide sample/particle representations of the posterior distribution, while VI produces an
analytical Gaussian approximation of the PDF. Both MCMC and SVGD are able to capture non-
Gaussian and correlated structure, although SVGD is more restrictive in the number of particles it
can use due to higher memory requirement. However, SVGD and VI are more scalable to higher θ
dimensions than MCMC.
We note that another variant of VI can arise from the reverse KL divergence DKL [ p(θ|D) || q(θ; λ) ]
(in contrast to the DKL [ q(θ; λ) || p(θ|D) ] from Eq. (25)). Notable algorithms from this formulation
include expectation propagation [154], assumed density filtering [155], and moment matching [156];
in particular, expectation propagation has been shown to be quite effective in logistic-type models
in general.
29
3.2.3. MC dropout
Although the Bayesian approach offers an elegant and principled way to model and quantify
the uncertainty in neural networks, it typically comes with a prohibitive computational cost. As
introduced earlier, MCMC and VI are two commonly used methods to perform Bayesian inference
over the parameters of neural network. However, Bayesian inference with MCMC and variational
inference in DNNs suffers from extremely time-consuming computational burden and poor scalabil-
ity. Specifically, in the case of MCMC, estimating the uncertainty of neural network prediction with
respect to a given input requires to draw a large number of samples from the posterior distributions
of thousands or even millions of neural network parameters and propagate these samples through
the neural network [157]. Compared with MCMC, VI is much faster and has better scalability as
it recasts the inference of posterior distributions of neural network parameters as an optimization
problem. However, VI unfortunately doubles the parameters to be estimated for the same neu-
ral network. In addition, it is intricate to derive and formulate the optimization problem, much
less optimization regarding the high-dimensional problem consumes a large amount of time before
convergence [21].
Beyond MCMC and VI, further scalability can be achieved through the MC dropout method.
Initially proposed as a regularization technique to prevent the overfitting of DNNs [158], MC dropout
has been shown to approximate the posterior predictive distribution under a particular Bayesian
setup [21]. Procedurally, MC dropout follows the same deterministic DNN training in Eq. (21),
except that it forms new sparsely connected DNNs from the original DNN (see method 3 in Fig. 4)
by multiplying every weight with an independent Bernoulli random variable. Hence, each weight
has some probability of becoming zero (i.e., the weight being dropped). These Bernoulli random
variables are re-sampled (i.e., a new, randomized sparse DNN is formed) for every training sample
and for every forward pass of the model. At test time, the prediction at a new point x∗ can also be
repeated with multiple forward passes each with a new, randomized sparse DNN resulting from the
dropout operation. An ensemble of predictions can thus be obtained to estimate the uncertainty.
Practical implementation of MC dropout in probabilistic programming languages is often realized
by adding a dropout layer after each fully-connected layer.
The connection from MC dropout to a Bayesian setup is detailed in [21, 159]. Those works
show that the loss function following the dropout procedure corresponds to a single-sample MC
approximation to the VI objective (i.e., the ELBO in Eq. (26)), where the variational posterior of the
DNN weights is a Bernoulli mixture of two independent Gaussians of fixed covariance. Furthermore,
the prior of each DNN weight is assumed to follow a standard normal distribution, and the likelihood
is based on the additive Gaussian noise model in Eq. (24). Established upon such a setup, in MC
dropout, the variational distribution q (θ; λ) for approximating the posterior distribution p(θ|D)
becomes a factorization over the weight matrices Wi of the layers 1 to L. Mathematically, the
variational distribution q (θ; λ) takes the following multiplicative form:
L
Y
q (θ; λ) = qMi (Wi ), (28)
i=1
30
where qMi (Wi ) denotes the density associated with the weight matrices Wi of layer i, and under
MC dropout, it emerges as a Gaussian mixture model consisting of two independent Gaussian
components with a fixed and identical variance, as shown below [21, 159]:
qMi (Wi ) = pi N Mi , σ 2 Ii + (1 − pi ) N 0, σ 2 Ii .
(29)
| {z } | {z }
First Gaussian Second Gaussian
In the above, Mi is the mean of the first Gaussian, which is a vectorization of ni−1 × ni values
pertaining to the weight matrix Wi of size ni−1 × ni (ni denotes the number of units in the i-th
layer; when i = 0, it denotes the number of inputs), σ is the standard deviation parameter specified
by the end user, Ii is the identity matrix, N denotes the normal distribution, and pi (pi ∈ [0, 1])
is the dropout rate associated with the set of links connecting two consecutive layers of the neural
network. Under this VI perspective, MC dropout corresponds to optimizing λ = Mi , while both σ
and pi have fixed user-chosen values and are not part of the variational parameter set.
In the MC dropout implementation, for each element of Wi , we sample a υ according to a
Bernoulli distribution with a prescribed dropout rate pi , that is υ ∼ Bernoulli(pi ). If the binary
variable υ = 0, it indicates that link connecting the i-th and (i + 1)-th layers is dropped out. This
operation corresponds to choosing one of the two Gaussians from the mixture model in Eq. (29),
and hence MC dropout can serve as an approximation to the Bayesian posterior in BNNs.
A major advantage of MC dropout is that it is very straightforward to implement, requiring
only a few lines of modification to insert the z’s to an existing DNN setup and often conveniently
available as a dropout layer in many programming environments. Furthermore, its ease of imple-
mentation is agnostic of the neural network architecture, and can be readily adopted for many
polular types of neural networks such as convolutional neural network (CNN) and recurrent neural
network (RNN) [159, 160]. Another major advantage of MC dropout is its low computational cost
and high scalability since its training procedure is effectively identical to an ordinary, non-Bayesian
training of DNNs but with randomized sparse networks. These appealing properties collectively
contribute to the growing popularity of MC dropout in practice.
MC dropout also has some limitations. One disadvantage is that the quality of the uncertainty
generated by MC dropout is highly dependent on the choice of several hyperparameters [161–163],
such as the dropout rate and number of dropout layers. Thus, these hyperparameters need to be
fine tuned. Along this front, we also have similar findings in Section 3.5 that MC dropout exhibits
poor stability to the dropout rate, training epochs, and the number of trainable network parameters
(see Appendix D for more details). Regardless of the instability, the uncertainty produced by MC
dropout exhibits a consistent difficulty in detecting OOD instances. Note that other approximation
inference methods, such as MFVI, have a pathology that is slightly different from MC dropout
with respect to the soundness of the quantified uncertainty, see Section 3.5 for more details. As
highlighted by Foong et al. [164], the pathology of UQ in approximation methods is solely attributed
to the restrictiveness of approximating family, while exact inference methods, such as MCMC, do
not have such a problem. Another disadvantage of MC dropout is that users do not have the
option to inject their prior knowledge by specifying the prior or likelihood function because there
31
is no mechanism for MC dropout to integrate such information—as a result, MC dropout can
only represent a narrow spectrum of Bayesian problems. A further side effect of this limitation
is that users may be hindered from critically thinking about the prior and likelihood altogether,
which may lead to claims of a Bayesian solution without actually having a Bayesian problem setup.
Finally, some researchers [165] have argued that MC dropout is not Bayesian because the variational
distribution fails to converge to ground-truth posterior distribution on closed-form benchmarks.
32
Let us take a closer look at the aleatory uncertainty, more specifically, the observational noise
pertaining to each target observation. The simplest case is that we assume the same amount of
noise or aleatory uncertainty for every input xi , also known as homoscedasticity or homogeneity of
variance in statistics (similar to the homoscedastic case for GPR discussed in Sec. 3.1). To represent
the relationship between input xi and observation yi , we can use the Gaussian observation model
given in Eq. (13), substituting x with xi and y with yi . In this model, a random noise term ε, often
modeled as a zero-mean Gaussian noise, shifts the target away from the true value f (xi ) to the
observed value yi . In this simplest case, the variance of random noise ε takes the same value σε2 for
every input and is thus a constant. Although we could learn σε together with the neural network
parameters θ, this simplest case may not be realistic as some regions of the input space may have
larger measurement noise than other regions.
A more realistic case is one where the noise variance depends on xi . The basic idea is to tailor
aleatory uncertainty to each input, making the uncertainty input-dependent. This heteroscedastic
case is also briefly discussed in Sec. 3.1 where heteroscedastic GPR is the focus of the discussion.
The observation model now becomes the following:
where the variance of the noise term ε (xi ), σε2 (xi ), is now a function of xi . It turns out that a
neural network can be trained to learn the mapping from x to σε2 [22, 23]. It then follows that
we can train a neural network with parameters θ that learns to predict both the mean µ (xi ) and
variance σ 2 (xi ) of the target for each input xi . This neural network has two outputs, predicted
mean µ b (xi ; θ) and variance σ b2 (xi ; θ), which fully characterise a Gaussian predictive distribution,
b2 (xi ; θ)
i.e., ybi ∼ N µ b (xi ; θ), σ
Before optimizing the network parameters θ, we need to define a proper scoring rule that mea-
sures the quality of predictive (aleatory) uncertainty. For regression problems, a typical choice of
a proper scoring rule is the likelihood function p ( yi | xi ; θ) whose logarithmic transformation takes
the following form [22, 172]:
b2 (xi ; θ) (yi − µ
log σ b (xi ; θ))2
log p ( yi | xi ; θ) = − − − constant. (31)
2 σ 2 (xi ; θ)
2b
Given a training dataset consisting of N input-output pairs, D = {(x1 , y1 ) , (x2 , y2 ) , · · · , (xN , yN )},
θ can be optimized by minimizing the following negative log-likelihood (NLL) loss on the entire
training data, which is equivalent to maximizing the negative counterpart of the likelihood function
in Eq. (31), after being summed up over all N training samples.
N
" #
X b2 (xi ; θ) (yi − µ
log σ b (xi ; θ))2
L (θ) = + , (32)
2 σ 2 (xi ; θ)
2b
i=1
where the constant term in Eq. (31) is omitted for brevity because it has nothing to do with the
optimization of θ.
33
3.3.2. Epistemic uncertainty: using an ensemble of independently trained networks
As discussed in Sec. 3.3.1, the neural network ensemble approach captures aleatory uncertainty
by training a neural network that produces a Gaussian output (or another type of probability dis-
tribution) for each input. This modeling process improves over traditional deterministic approaches
that only produce a point estimate. Plus, the network-predicted variance varies according to the
input, making it possible to capture input-dependent observational noise. One limitation is that
minimizing the loss function in Eq. (32) yields a single vector of network parameters. Therefore, the
resulting neural network cannot capture the uncertainty related to the network parameters because
all parameters are deterministic. This treatment becomes an issue when only limited training data
are available. These cases are more realistic than having abundant training data, and when train-
ing data are of limited quantities, epistemic uncertainty is high and cannot be ignored. One widely
used way to capture epistemic uncertainty is to assume and estimate uncertainty in the parame-
ters of a neural network model, also known as model parameter uncertainty or network parameter
uncertainty.
After tuning the neural network parameters θ, at the time of prediction, each individual neural
network generates a pair of outputs (b µ (x∗ ) , σ
b (x∗ )) with respect to an unseen instance x∗ , where
σ
b (x∗ ) explicitly quantifies the aleatory uncertainty in model prediction arising from the random
noise ε (·) associated with the target value. Next, to quantify the epistemic uncertainty associated
with the neural network parameters θ, we can build an ensemble of neural networks, for example,
by adopting the randomization strategy (random parameter initialization and mini-batch sampling)
that attains a diverse set of neural networks. Suppose the neural network ensemble is composed
of M individual neural networks, then the ensemble model produces M pairs of (b µm (x∗ ) , σ
bm (x∗ ))
(m = 1, 2, · · · , M ) for the given input x∗ . The M pairs of predictions (b µm (x∗ ) , σ
bm (x∗ )) can be
viewed as a mixture of Gaussian distributions. Thus, we can use a single Gaussian distribution to
approximate the mixture of Gaussian distributions as long as the mean and variance of the single
Gaussian distribution are the same as the mean and variance of the mixture. Assuming that each
individual neural network in the ensemble carries an equal weight, we have the mean and variance
of the ensemble-predicted single Gaussian distribution as:
M
1 P
µ (x∗ ) = M µ
bm (x∗ ) ,
m=1 (33)
M
1
σ 2 (x 2 (x ) b2m (x∗ ) µ2 (x
P
∗) = M σ
bm ∗ + µ − ∗) .
m=1
In the ensemble of neural networks, both the aleatory and epistemic uncertainty can be measured
in a straightforward way. Specifically, the aleatory uncertainty arising from the noise associated with
the observation y is reflected in the variance σbm (x∗ ) predicted by each individual neural network.
In contrast, the epistemic uncertainty associated with the network structure and parameters is
manifested mainly as the difference with respect to µ bm (x∗ ) of the M neural networks because
each individual neural network is initialized with a random set of weights and biases and trained
with a random mini-batch data for the gradient descent algorithm. Such randomness introduces
34
a sufficient amount of diversity among the individual models. Thus, the difference between the
individual mean predictions µ bm (x∗ ) that dominates the epistemic uncertainty characterizes the
structural and parametric uncertainty pertaining to the neural network.
An interesting question about neural network ensembles is why training multiple neural networks
of an identical architecture independently with just random initializations can capture epistemic
uncertainty. The answer lies in that training a neural network with a large number of parameters
(e.g., weights and biases) is an extremely intricate large-scale optimization problem in a high-
dimensional space, and stochastic gradient descent-based algorithms oftentimes converge to different
sets of parameter values θ that are locally optimal [173]. As mentioned earlier, network training
involves two sources of randomness: (1) random parameters initialization at the beginning of model
training and (2) random perturbations of the training data to produce mini-batches of data in
stochastic gradient descent As a result, the locally optimal parameters θ vary from one trained
neural network to another. Suppose M independent training runs give rise to M different local
minima for the network parameters, which then lead to the creation of M individual members of
an ensemble, as shown in Eq. (33). From the optimization perspective, the randomness in the
initialization of neural network parameters and the sampling of mini-batch data encourages the
optimization algorithm to explore different modes of the function space of a neural network. As a
result, the predicted means of these M networks may differ substantially in some regions of the input
space, while the predicted variances may still be similar, resulting in high epistemic uncertainty.
These regions are typically located outside the training data distribution. Test samples falling into
these regions are called OOD samples (as previously defined in Sec. 1), where ensemble predictions
must be taken cautiously and are often untrustworthy.
35
representations of a neural network such that distances between points in the input space are
preserved in the hidden space. The need for distance-aware latent representations comes from a
recently reported phenomenon called feature collapse [176], where some OOD points in the input
space are mapped through feature extraction to in-distribution points in the hidden space, leading
to overconfident predictions at these OOD points. Feature collapse must be combatted for feature
representations in the hidden space to be useful for epistemic uncertainty estimation and OOD
detection. One option is imposing a bi-Lipschitz constraint on the feature extractor (i.e., a neural
network excluding its output layer). The term “bi-Lipschitz” means a two-sided constraint on the
Lipschitz constant of a feature extractor that determines how much distances in the input space
contract (small Lipschitz, feature collapse) and expand (large Lipschitz, small changes in input
resulting in drastic changes in latent features).
We now briefly describe the math pertaining to a bi-Lipschitz constraint. Suppose we take any
two input points x and x′ from a training dataset and let hnn (·) denote a function mapping an
input into latent features (i.e., right after the activation function in the last hidden layer of a neural
network). A bi-Lipschitz constraint on the mapping function h for any training input pairs looks
like:
Liplb ||x − x′ ||input ≤ ||hnn (x) − hnn (x′ )||hidden ≤ Lipub ||x − x′ ||input . (34)
where Liplb and Lipub are, respectively, the lower and upper bounds imposed on the Lipschitz
constant of the feature extractor hnn (·), and || · ||input and || · ||hidden are, respectively, the distance
metrics chosen for the input and hidden spaces. Setting the lower bound Liplb ensures that latent
representations are distance sensitive, i.e., if x and x′ are relatively far apart in the input space,
they also have a relatively large distance in the hidden space. This sensitivity regularization allows
the feature extractors to preserve input distances and directly counteracts the feature collapse issue
by preventing OOD points from overlapping with in-distribution feature representations. Setting
the upper bound Lipub ensures that hidden representations are smooth, i.e., small distance changes
in the input space do not result in drastically large distance changes in the hidden space. This
smoothness enforcement leads to feature extractors that generalize well and are robust to adversarial
attacks. As for the distance metric, the Euclidean distance dist(·, ·) is often a good choice for
measuring distances between input points and even those between hidden representations, except
for image-like data. The Euclidean distance has recently been adopted as the distance metric in
several deterministic UQ methods [176, 179, 180].
The feature-space regularization via a bi-Lipschitz constraint shown in Eq. (34) can be imple-
mented during model training by applying either of the following two methods: (1) gradient penalty,
originally introduced for training generative adversarial networks (GANs) [181] and then adopted
for deterministic uncertainty estimation [176], and (2) spectral normalization, originally proposed
again for training GANs [182] and then adopted for deterministic uncertainty estimation [177–180].
In the rest of this subsection, we will briefly go over the application of spectral normalization in
SNGP. We will also discuss the use of GPR as the output layer by SNGP to produce an uncertainty
estimate based on distances in the “regularized” hidden space.
36
3.4.2. Spectral normalization for distance preservation in hidden space
The algorithm of SNGP enforces the lower bound of the Lipschitz constant in Eq. (34) simply
by using network architectures with residual connections (e.g., residual networks) while imposing
the upper bound using spectral normalization. Briefly, for each hidden layer, spectral normalization
first calculates the spectral norm of the weight matrix W (i.e., the largest singular value of W),
denoted as ||W||2 , and then normalizes W using its spectral norm as:
c sn = γ · W ,
W (35)
∥W∥2
where γ is the upper bound of the spectral norm (i.e., ∥W∥2 ≤ γ), also called the spectral norm
upper bound, which effectively enforces an upper bound on the Lipschitz constant of the mapping
function in the hidden layer. The weight matrix needs to be spectral-normalized only when its
spectral norm exceeds the upper bound, i.e., when ||W||2 > γ [179]. Introducing the spectral norm
upper bound gives rise to the flexibility to balance the expressiveness and distance awareness of the
resulting spectral-normalized feature extractor. Specifically, when γ takes a small value (γ < 1),
the feature extractor tends to contract toward identity mapping, thereby limiting the ability of
the feature extractor to learn complex nonlinear mapping, critically important for achieving high
prediction accuracy on the training distribution; when γ is large (γ ≫ 1), the feature extractor
is allowed to expand and be more expressive but may not preserve input distances. However, in
reality, this flexibility may become a limitation against adoption, as γ needs to be carefully tuned
to balance accuracy/generalizability and distance awareness.
When the neural network is a DNN (e.g., with > 5 hidden layers), the above kernel can sometimes be
called a deep kernel. The prior and posterior derivations follow the standard procedures described in
Secs. 3.1.1.c and 3.1.1.d. Essentially, we perform a GPR in the learned, distance-preserving feature
space instead of the input space. The resulting GPR model yields the posterior variance of a test
37
input x∗ based on its Euclidean distances from all training points in the hidden space, leveraging
the distance awareness property of GPR, extensively discussed in Secs. 3.1.1.b and 3.1.1.d, to make
the output layer distance aware. Intuitively speaking, let us suppose x∗ keeps moving away from
the training distribution. The value of the hidden-space kernel between any training input xi and
x∗ , knn (xi , x∗ ), will become smaller and smaller given the distance preservation property of hnn (·).
At some point, this kernel value will quickly approach zero. As a result, the posterior variance at
x∗ will keep increasing and eventually approach its maximum value σf2 . This scenario suggests the
distance awareness property of SNGP makes it an ideal tool for OOD detection.
To make inference computationally tractable, SNGP applies two approximations to the GPR
output layer: (1) expanding the GPR model into simpler Bayesian linear models in the space of
random Fourier features and (2) approximating the resulting posterior via Laplace approximation
[179]. It is noted that another deterministic UQ method named DUE also uses spectral normal-
ization plus residual connections to encourage a bi-Lipschitz mapping to the hidden space and
GPR in the output layer. The only major difference is that DUE uses variational inducing point
approximation for GPR in place of the random Fourier feature expansion [178].
38
15
10
5
0
x2
x2
x2
5
10
Training 7 U D L Q L Q J 7 U D L Q L Q J
OOD 2 2 ' 2 2 '
1515 10 5 0 5 10 15
x1 x1 x1
(a) GPR (b) MFVI (c) MC dropout
x2
x2
x2
Figure 7: The uncertainty maps by five different methods for UQ of ML models on the toy 2D regression problem.
These methods are Gaussian process regression – GPR (a), MFVI – mean-field variational inference (b), Monte Carlo
dropout – MC dropout (c), neural network ensemble (d), deep neural network with Gaussian process regression –
DNN-GPR (e), Spectral-normalized Neural Gaussian Process – SNGP (f). The two clusters colored in purple represent
the training data, while the cluster colored in red indicates a cluster of OOD instances. The background in each 2D
plot is color-coded according to the predictive uncertainty by the corresponding UQ method, with yellow (blue)
indicating high (low) uncertainty.
These training samples form two separate clusters with no overlap in between, as shown in Fig.
7. As can be observed in both Eq. (37) and Fig. 7, the two clusters have an identical variance-
covariance matrix and differ significantly only in the mean vector. We now apply the previously
introduced UQ methods on the 800 training samples. For those methods requiring neural networks,
the UQ methods are built on a backbone of similar residual neural network architectures with four
64-neuron residual layers. For example, in the case of neural network ensemble, a Gaussian layer is
inserted at the end of a residual neural network; while in the case of MC dropout, dropout with a
rate of 0.2 is applied at the end of each residual layer.
To test the UQ performance of different ML models, we generate a uniform meshgrid consisting
of 40,000 (= 200×200) samples with x1 and x2 spanning in the range [−15, 15]. Next, an uncertainty
heap map is constructed to visualize the predictive uncertainty of each trained ML model within
the domain. Figure 7 shows the uncertainty heat maps obtained by the five different UQ methods
on this toy problem. At a quick glance, both GPR and SNGP exhibit a desirable behavior in
producing high quality predictive uncertainty: the predictive uncertainty is quite low for samples
39
in the proximity of the in-distribution/training data (dots in pink color). At the same time, both
GPR and SNGP generate high predictive uncertainty when test sample point [x1 , x2 ]T moves far
away from the training data clusters. As a result, both GPR and SNGP successfully assigned high
uncertainty to the 200 OOD samples (dots in red color at the bottom left of Fig. 7) - which are
randomly generated to test the OOD detection capability of different UQ techniques.
Unlike GPR and SNGP, the other four UQ methods have a relatively poor performance in quan-
tifying predictive uncertainty. As can be observed in Fig. 7 (c-e), MC dropout, deep ensemble, and
DNN-GPR assign low uncertainty for samples that are quite far away from the training data. As a
consequence, these three UQ techniques are likely to fail to detect the 200 OOD samples whose pre-
dictions are associated with relatively low uncertainty, as shown in the bottom-left corners of Fig. 7
(c), (d), and (e). Besides the lack of ability in OOD detection, these three UQ techniques share an-
other feature in common: their uncertainty output is more sensitive to the (hypothetical) boundary
that separates the two clusters of training data, while they exhibit a substantially faulty behav-
ior when establishing the decision boundary (trustworthy vs. untrustworthy region) around each
cluster of training data itself. More specifically, for a given test sample, the predictive uncertainty
generated by these three UQ techniques has a low sensitivity to how distant is a test sample’s dis-
tribution with respect to the training data. Regarding the mean-field variational inference (MFVI),
its predictive uncertainty gets increased in accordance with the distance away from the two train-
ing clusters, however, MFVI assigns nearly an identical uncertainty for the data between the two
training clusters as they are near the data, which contradicts with our anticipation. This suggests
that MFVI suffers from the lack of in-between uncertainty due to the approximation to Bayesian
inference, and such finding is also confirmed by Foong et al. [164]. Consequently, the predictive
uncertainty by these UQ techniques is unprincipled because their quantified uncertainty does not
match our expectation that uncertainty should clearly distinguish in-domain and out-domain data.
The significant difference in the uncertainty heat map across different UQ methods is primarily
attributed to their distance awareness capability. MC dropout, deep ensemble, and DNN-GPR do
not have the ability to properly quantify the distance of an input sample away from the training
data manifold. Instead, the predictive uncertainty at an input sample quantified by MC dropout,
deep ensemble, and DNN-GPR seems to be established upon the distance of the input sample from
a decision boundary separating the two clusters of training data. Therefore, it is not surprising
to see all these three UQ methods assign low uncertainty to the 200 OOD samples even though
they are quite far from the training data. Distinct from MC dropout and deep ensemble, GPR,
DNN-GPR, and SNGP are equipped with a good sense of awareness with respect to the distance
between an input sample and the training data manifold. As a result, they are comparatively more
principled in the sense that the uncertainty is much higher for the input sample that lies far from
the training data. Finally, even though both DNN-GPR and SNGP have GPR as the output layer,
DNN-GPR is free from determining what information to discard in the hidden space, while SNGP
imposes a spectral normalization on the latent representation of the input sample, thus making the
output layer distance sensitive in the hidden space. In a broad context, the sound UQ by GPR and
SNGP substantially facilitates the identification of OOD samples, establishing a trustworthy region
40
in the input space where ML predictions are reliable.
3.6. Summary
The numerical example in Sec. 3.5 demonstrates the performance difference among different UQ
methods with an emphasis on OOD detection. Comprehensive comparison of these UQ methods may
help better guide users to select appropriate UQ methods for specific ML applications. To this end,
we construct a table (Table 2) to qualitatively compare these methods along multiple dimensions,
such as the quality of UQ, computational costs in training and test, etc. In the first place, regarding
the calibration accuracy of these UQ methods, GPR and SNGP generally outperform other alternate
UQ methods, which is also confirmed in the previous numerical example. For the computational
cost associated with training an ML model, implementing a Bayesian neural network via MCMC or
variational inference incurs a relatively higher computational cost than MC dropout, as MC dropout
consumes nearly the same amount of computational time as training a regular neural network. In
41
terms of scalability, it is well-known that GPR suffers from the curse of high dimensionality, so
training and testing GPR models may be computationally very expensive for high-dimensional
problems. The other three UQ methods (neural network ensemble, DNN-GPR, and SNGP) are
computationally cheaper than GPR, MCMC, and variational inference. We have similar findings
regarding the computational burden of these UQ methods at test time.
An important function of UQ built atop the original deterministic ML model is to serve as a
safeguard to detect OOD samples for the purpose of increasing the reliability of ML models. In
this regard, SNGP achieves similar performance as the gold standard GPR, while the remaining
UQ methods may perform poorly in detecting OOD samples. Besides strong OOD detection capa-
bility, SNGP also exhibits a desirable feature in scalability, while such a feature is missing in GP.
However, compared to GP, SNGP requires an additional effort to turn a deterministic ML model
into a probabilistic counterpart for UQ, while GPR is born with the capability of UQ. As for the
uncertainty decomposition, GPR, Bayesian neural network, and neural network ensemble all have
some capability to quantify aleatory and epistemic uncertainty separately, while such a capability
may be lacking in the MC dropout version of Bayesian neural network as well as in DNN-GPR
and SNGP. Next, both GPR and SNGP estimate the predictive uncertainty of ML models in an
analytical form. In contrast, the other UQ methods draw Monte Carlo samples to approximate the
uncertainty, which is a major performance barrier if critical applications require real-time inferences.
Let us now shift our focus to the performance evaluation of probabilistic ML models. A unique
property of these models is that they do not simply produce a point estimate of y and instead
output a probability distribution of y, p(y), that fully characterizes the predictive uncertainty. This
unique property requires that the performance evaluation examines both the prediction accuracy,
e.g., the RMSE or mean absolute error calculated based on the mean predictions for regression,
and the quality of predictive uncertainty, e.g., how accurately the predictive uncertainty reflects the
deviation of a model prediction from the actual observation. In what follows, we will discuss ways
to assess the quality of predictive uncertainty.
42
validation/test sample xi , i = 1, · · · , N . Without loss of generality, let us further assume that
the probabilistic output ybi follows a Gaussian distribution,
characterized
by a Gaussian probability
1 ybi −µθ (xi )
density function, p (b
yi ; µθ (xi ) , σθ (xi )) = σθ (xi ) ϕ σθ (xi ) , with the predicted mean µθ (xi ) and
standard deviation σθ (xi ). For a given confidence level c ∈ [0, 1], we can easily derive a two-sided
100c% confidence interval for the Gaussian random variable ybi as:
h i
CIic = µθ (xi ) − z 1+c σθ (xi ) , µθ (xi ) + z 1+c σθ (xi ) , (38)
2 2
th
where z 1+c denotes the 1+c quantile of the standard normal distribution, i.e., z 1+c = Φ−1 1+c
2 2 ,
2 2
with Φ(·) denoting the cumulative distribution function (CDF) of the standard normal distribution.
The probability of a random realization of ybi falling into CIic equals c, expressed as
Z µθ (xi )+z 1+c σθ (xi ) Z z 1+c
2 2
p (b
yi ; µθ (xi ) , σθ (xi ))db
yi = ϕ (τ )dτ = c. (39)
µθ (xi )−z 1+c σθ (xi ) bi −µθ (xi )
y −z 1+c
τ≡
2 σθ (xi ) 2
If we choose to use a CDF Pi to characterize the probability distribution of ybi that may not
follow a Gaussian distribution, we can write out the 100c% confidence interval for any arbitrary
distribution type,
−1 1 − c
c −1 1 + c
CIi = Pi , Pi , (40)
2 2
where Pi−1 (c) = inf ybi : Pi−1 (b
yi ) ≥ c . Here, Pi−1 is an inverse of the CDF Pi , also called a quantile
function, and becomes Φ−1 for the standard normal distribution. Alternatively, we can derive a
one-sided confidence interval CIic = −∞, Pi−1 (c) .
Ideally, the UQ of this ML model should yield a 100c% confidence interval that contains the
observed y for approximately 100c% of the time. For example, if c = 0.95, then yi should fall into
a 95% confidence interval CIi0.95 , one- or two-sided, for nearly 95% of the time. In other words,
we expect that approximately 95% of the N validation/test samples have their observed y values
fall into the respective 95% confidence intervals. The fraction of validation/test samples for which
the confidence intervals contain the observations can be called observed confidence (ĉ) or sometimes
N
c = N1 I (yi ∈ CIic ), where I (prop) is an indicator function that takes the
P
accuracy, expressed as b
i=1
value of 1 if the proposition prop is true and 0 otherwise. If we plot observed confidence against
expected confidence (c) over [0, 1], we will create a calibration curve, sometimes called a reliability
diagram (see an example in the right-most plot of Fig. 9). This calibration curve shows how well
predictive uncertainty is quantified, and a perfect UQ should yield a calibration curve that overlaps
with the diagonal line (y = x). If the observed confidence is higher than expected at some c values,
the model is said to be underconfident at these confidence levels; otherwise, the model is deemed
overconfident. In predictive maintenance practices, reliability/maintenance engineers often prefer
underconfident predictions over overconfident predictions, as overconfident predictions are more
likely to trigger maintenance actions that are either unnecessarily early or too late. If 90% or 95%
is chosen as the confidence level, it is preferred that the observed confidence (or accuracy) is very
43
close to or slightly higher than 90% or 95%.
3
Function
2 Training Data
Test Data
Mean
1
Confidence
0
y
-1
-2
-3
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
Figure 8: An example dataset with eight training samples (solid red circles) and 100 test samples (hollow red circles),
plotted with the underlying one-dimensional function and fitted GPR model. Shown for the fitted GPR model is the
poterior mean function (solid blue curve) and a collection of 95% confidence intervals (light blue shade) for the noisy
observations (y∗ ) at new/test points. These test points are equally spaced between -5 and 5 along the x-axis.
Let us now do a step-by-step walkthrough of how a calibration curve is created using a toy
figures/toy_1d_example.pdf
example. This example uses training and test data generated from the same 1D function and
Gaussian observation model used to generate Figs. 5 and A.25 in Sec. 3.1.1. The observation
model consists of a sine function corrupted with a white Gaussian noise term, y = sin(0.9x) + ε
with ε ∼ N 0, 0.12 . As shown in Fig. 8, we fit a GPR model to the eight training data points
and test this model on 100 test points. It can be seen from the figure that the regressor reports
high uncertainty at test points that fall outside of the x ranges where training samples exist. If we
compare the in-distribution test samples (i.e., whose x values fall into [−3, −1) or [2, 4)) with the
OOD samples (whose x values lie within [−5, −3), [−1, 2), or [4, 5)), we observe higher predictive
Final
uncertainty on the OOD samples, where the model’s predictions are more likely to be incorrect.
Creating a calibration curve in this toy example consists of three steps.
Step 1: We start by choosing K confidence levels between 0 and 1, 0 ≤ c1 < c2 < · · · < cK ≤ 1.
In this example, we choose 11 (K = 11) confidence levels equally spaced between 0 and
1, i.e., 0, 0.1, · · · , 0.9, 1 (see Step 1 in Fig. 9).
44
Step 2: We then compute for each expected confidence level cj the observed confidence as:
N
1 X
cj =
b I (yi ∈ CIic ). (41)
N
i=1
Step 3: We finally plot the K pairs of expected vs. observed confidence, {(c1 , b
c1 ) , · · · , (cK , b
cK )},
which gives rise to a calibration curve. In the toy example, we have 11 pairs of (cj , b cj )
plotted to form a discrete calibration curve in Step 3 in Fig. 9.
Observed confidence
1 1 1
Accurate predictions Calibration
0.8 0.5 Inaccurate 0.8 Ideal
Observed
0.6 55 0.6
0 𝑐𝑐6̂ =
yy
0.4 55 + 45 0.4
-0.5
0.2 0.2
# of inaccurate
-1 predictions
0 0
0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1
Expected confidence Sample index (sorted) Expected confidence
Figure 9: Illustration of three-step procedure to create a calibration curve for toy regression problem shown in Fig. 8.
Suppose we are interested in assessing the regression model’s UQ quality at the confidence level
of 90%. In that case, we can observe from the calibration curve drawn in Step 3 that the Gaussian
figures/procedure_calibration_curve.pdf
process regressor tends to be underconfident, i.e., the confidence we expect the regressor to have
(c10 = 90%) is lower than the observed (empirically estimated) confidence (b c10 = 95%) or simply
c10 < bc10 . More specifically, the actual proportion of times that the model’s 90% confidence interval
contains the ground truth (i.e., the model is correct) is higher than the expected value (i.e., 90%).
Being underconfident also means that the model tends to produce higher-than-true uncertainty in its
predictions, which is often more desirable in safety-critical applications than having an overconfident
model.
Final
To further understand how a calibration curve behaves as a test window varies, we expand the
range of test data from [−5, 5], as shown in Fig. 8, to [−15, 15], as shown in Fig. 10, while keeping
the same number of test samples (i.e., 100). As shown in Fig. 10, the new test dataset includes much
more OOD samples that fall outside the range of [−5, 5]. The calibration curve on this new dataset
is plotted alongside the one on the original dataset in Fig. 11. Let us compare the new (red) and
original (blue) calibration curves. We can observe that having more OOD test samples degrades the
quality of UQ by moving the calibration curve further away from the ideal line. This observation
is not surprising because high quality UQ (i.e., producing predictive uncertainty that accurately
45
3
Function
2 Training Data
Test Data
1 Mean
Confidence
0
y
-1
-2
-3
Figure 10: Toy example identical to the one in Fig. 8 but with an expanded range of x on test data.
figures/toy_1d_example_x_expanded.pdf
Underconfident
Overconfident
Final
Figure 11: Comparison of calibration curves for two different ranges of test data for the toy 1D mathematical problem.
Test samples are equally spaced between -5 and 5 (the same as Figs. 8 and 9) and between -15 and 15, respectively,
for the two test ranges.
reflects prediction errors) is expected to be more challenging on OOD samples than in-distribution
figures/comparison_calibration_curves.pdf
samples. Another interesting observation is that the GPR model appears more overconfident in
46
making predictions on the new test dataset with more OOD samples. Our explanation for this
observation is that as a test sample xi moves farther away from the training data, the prediction
error may increase drastically (i.e., the model-predicted mean may deviate substantially more from
the true observation), but the predictive uncertainty by a UQ method may start to saturate at a
certain distance away from the training distribution (see, for example, the flat confidence bounds
in Fig. 10 when xi ∈ [−15, −6] ∪ [7, 15]), making it more difficult for a probabilistic prediction to
be accurate (i.e., the predictive confidence interval at xi contains the ground truth yi ). Essentially,
in some cases, the predictive uncertainty cannot catch up with the prediction error as a test sample
moves further away from a training distribution. In that case, it is critically important to establish
boundaries in the input space within which predictive uncertainty cannot be trusted. Very little
effort has been devoted to trustworthy UQ, and more effort is urgently needed on this front.
Step 1: The first step is to discretize the observed confidence c into some number (K) of bins of
width 1/K. For example, if K = 10, we then have ten intervals of observed confidence,
[0, 0.1], (0.1, 0.2], · · · , (0.9, 1.0].
1 1
Step 2: We then compute for each bin Bj = cj − 2K , cj + 2K the observed confidence as
N
P
yi I (fθ (xi ) ∈ Bj )
i=1
cj =
b N
, (42)
P
I (fθ (xi ) ∈ Bj )
i=1
Step 3: The final step is to plot the predicted vs. the observed confidence for class 1 for each bin
Bj .
47
Note though that the extension in [190] focused on deriving calibration curves and did not propose
an ECE definition under regression settings. The ECE can be defined as the weighted average
K
P
difference between a calibration curve and the ideal linear line, ECE = wj |b
cj − cj |, where the
j=1
weight wj can be set as either a constant (i.e., 1/K) or proportional to the number of samples
N N
I yi ∈ CIiC for regression and wj ∝
P P
falling into each bin, i.e., wj ∝ I (fθ (xi ) ∈ Bj ) for
i=1 i=1
binary classification [190]. Figure 12 illustrates the calibration-ideal differences as error bars on the
calibration curve obtained for the toy 1D mathematical problem shown in Fig. 8. Assuming equal
weights (w1 = w2 =, · · · , = w11 = 1/11), the ECE for this calibration error is calculated to be 0.043,
which means the observed confidence deviates from the expected confidence by 0.043 on average.
1
Calibration
Ideal
Observed confidence
0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Expected confidence
Figure 12: Calibration curve for the toy 1D mathematical problem shown in Fig. 8. This figure builds on the calibration
curve shown in Step 3 of Fig. 9 and also includes the differences between calibrated and ideal (red error bars) used to
calculate the ECE for this example.
figures/calibration_with_error_bars.pdf
4.1.4. Recalibration
If the calibration curve deviates significantly from the identity function (perfect calibration), a
recalibration may be needed to bring the calibration curve closer to the linear line. For example,
this recalibration can be done by a parametric approach called Platt scaling, which modifies the
Final
non-probabilistic prediction of an ML binary classifier (e.g., a neural network or support vector
classifier) using a two-parameter, simple linear regression model and optimizes the two model pa-
rameters by minimizing the NLL on a validation dataset [187, 191]. It is straightforward to extend
Platt scaling to multi-class settings, for example, by expanding the simple linear regression model
to a multivariate linear regression model [192]. Another simple extension is temperature scaling,
a single-parameter version of Platt scaling [192], which was shown to be effective in re-calibrating
deterministic neural networks capable of UQ [177]. Another approach to recalibrating classification
models is training an auxiliary regression model on top of the trained machine learning predictor,
again using a validation dataset [190]. A popular choice of the auxiliary regression model is an iso-
48
tonic regression model, where a non-parametric isotonic (monotonically increasing) function maps
probabilistic predictions to empirically observed values on a validation set. Recalibration using iso-
tonic regression was originally proposed for classification [186, 187] and then extended to regression
[190]. It found recent applications in the PHM field, such as battery state-of-health estimation
[193].
Both Platt scaling and isotonic regression require a separate validation dataset of a decent size
(typically 20-50% of the training dataset) to either optimize scaling parameters (Platt scaling) or
build a non-parametric regression model (isotonic regression), while in reality, such a decent sized
validation dataset may not be available. A comparative study of re-calibration approaches was
performed in [192], where temperature scaling was found to be the most simple and effective.
49
Before concluding on the connection between UQ calibration (ML community) and the u-pooling
method (model validation community), we want to note that the u-pooling method could also be
applied to assess the quality of the UQ of an ML model, with a different objective of measuring the
degree to which each observation comes from the probability distribution predicted by the ML model,
which differs from the objective of UQ calibration to test how underconfident or overconfident the
ML model is. Similarly, the area metric or “u-pooling” metric can be used to measure the mismatch
between predictive distributions and observations in a global sense [188].
Step 1: Given an uncertainty metric (e.g., variance for regression, entropy for classification),
all samples in the validation/test dataset are sorted in descending order, starting with
those with the highest predictive uncertainty. In the toy example, the 100 test samples
are ranked according to the GPR model-predicted variance, with the first few samples
having the largest predicted variances.
Step 2: A subset of samples (e.g., 2% of the validation/test dataset) with the highest uncertainty
is gradually removed, leaving an increasingly smaller dataset whose samples have lower
predictive uncertainty than those removed. In the toy example, the sample removal
process involves 50 iterations, each of which takes out 2% of the remaining test samples
with the highest predictive uncertainty.
Step 3: Given an error metric (e.g., RMSE, mean absolute error), the prediction error is computed
on the remaining samples each time a subset of high uncertainty samples is removed in
Step 2. The toy example uses the RMSE as the error metric, computed by comparing
the GPR model-predicted means with the actual (noisy) observations.
Step 4: The final step is to plot the error metric vs. fraction of removed samples for the combi-
nations obtained in Steps 2 and 3. Figure 13 shows the sparsification plot (dashed blue
curve) for the toy example.
The resulting sparsification plot (see, for example, Fig. 13) visualizes how the prediction error
changes as a function of the fraction of removed samples. If predictive uncertainty is a good proxy
for prediction error, the error metric on a sparsification plot should decrease monotonically with the
fraction of removed high-uncertainty samples, as is the case in Fig. 13. If ground truth is available,
an ideal error curve (oracle) can be derived by ranking all samples in the validation/test dataset
in descending order according to the actual prediction error. The oracle for the 1D toy regression
50
0.5
UQ
Random
0.4 Oracle
0.3
RMSE
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
Fraction of removed samples
Figure 13: Sparsification curve and oracles for the toy 1D mathematical problem shown in Fig. 8.
sparsification_plot.pdf
problem is shown as a solid gray curve in Fig. 13, where we can observe a small difference between
the calculated and ideal error curves. If predictive uncertainty is a perfect representation of model
prediction error, the calculated error curve and oracle will overlap on the sparsification plot. On
the other extreme, random uncertainty estimates that do not reflect prediction error meaningfully
would result in an almost constant error on the remaining samples, i.e., a (close to) flat error curve.
Revised
An example of the sparsification curve under random uncertainty estimates is shown in Fig. 13
for the 1D toy regression problem (see the dash-dotted red curve). In this extreme case, a flat
curve suggests that UQ provides little information about identifying problematic samples (e.g.,
OOD samples and those in regions of the input space with high measurement noise) whose model
predictions may contain large errors.
Prior UQ studies in the ML community used plots similar to the sparsification plot to examine
model accuracy as a function of model confidence [22, 196]. The only difference may be the label used
for the x-axis, sometimes explicitly called confidence threshold for classification [22] and regression
[196], instead of fraction of removed samples. Per-sample model confidence was derived as the
probability of the predicted label for classification [22] and the percentage of validation/test samples
whose variances are higher than the validation/test sample of interest [196]. However, estimating
the per-sample model confidence from the per-sample predictive uncertainty without access to the
ground truth is difficult and remains an open research question.
Since the model prediction error of one UQ approach on a validation/test sample most likely
differs from that of a different approach, the ideal error curve (oracle) is likely to differ among
UQ approaches. To compare these approaches, we can first calculate the difference between the
sparsification and oracle for each fraction of removed samples, named sparcification error. Then, we
can compute two sparsification metrics: (1) the Area Under the Sparsification Error curve (AUSE),
i.e., the area between the actual error curve and its oracle [197], and (2) the Area Under the Random
Gain curve (AURG), i.e., the area between the (close-to) flat random curve and the actual error
curve. The lower the AUSE, the better the predictive uncertainty (derived from UQ) represents
51
the actual prediction error (unknown). The higher the AURG (assuming the error curve shows a
monotonically decreasing trend), the better UQ is compared to no UQ.
52
5. UQ of ML models in prognostics
As stated in Sec. 1, our tutorial has an additional, secondary role, i.e., reviewing recent studies
on engineering design and health prognostics applications of emerging UQ approaches. To make
this tutorial focused, we place our review of engineering design applications in Appendix B and
only present the review of health prognostics applications in the main text of this tutorial (i.e., the
present section). We believe such an arrangement will provide the additional benefit of creating
a methodological transition into the two case studies in Sec. 6 that are both related to health
prognostics.
53
data, learn important features that characterize the system’s health status, and track its changes
over time until reaching the end of life. Industrial asset prognostics using DL can be implemented
in two ways: directly predicting the RUL from sensor data or forecasting the future evolution of
the system’s health status until a pre-defined threshold is reached. The first approach, referred to
as direct mapping [14], requires a dataset that links sensor readings to corresponding RUL target
labels and is treated as a regression task. The second approach, called time series forecasting
[14], involves identifying condition indicators that change in a predictable manner as the system
deteriorates under different operational modes. These indicators may either be predetermined
as strongly correlated with the machine’s health and hence, interpretable, such as the internal
resistance and capacity of a lithium-ion battery [207] or may be derived implicitly. A health indicator
integrates several condition indicators into a single value, providing the user with information about
the component’s health status. The threshold for the health indicator, which may be subject to
noise, also needs to be derived or learned. The importance of UQ in both approaches lies in the
need to avoid unexpected safety-critical failures due to too-late replacements and to minimize costs
by avoiding too-early replacements. UQ is, therefore, crucial to provide meaningful estimations and
ensure accurate predictions in DL-based industrial asset prognostics. While quantifying the total
predictive uncertainty (e.g., as a single variance value) already provides essential information for
decision making, distinguishing between aleatory and epistemic uncertainty is equally important
for prognostic applications. Particularly, considering that faults/failures are rare in safety-critical
applications, epistemic uncertainty substantially impacts model performance due to the challenges
in collecting representative run-to-failure datasets for training.
54
[46, 201, 206, 214–216].
55
PH
+ +
π[r(k)]|αα− ≥ β π[r(k)]|αα− ≥ β
RUL
RUL
RUL
∆λ 1
+
Figure 14: (Left) Prognostic Horizon (PH): here [π(r(k))]α α− indicates the probability that the distribution of the
prediction r at time k falls within the confidence region [r∗ (k) − α− , r∗ (k) + α+ ] (grey area), and β is a pre-determined
threshold; (Middle) α-λ metric calculated at kλ1 and kλ2 : same notation as before, note that the confidence bounds
around the ground-truth shrink as the end of life is approached; (Right) Relative Accuracy calculated at kλ1 : ∆λ1
indicates the difference between the median of the predictive distribution and the ground-truth value.
The α-λ metric is very similar to the Prediction Horizon but it differs in two aspects: first, it is
binary, if the criterion is met at a certain time step, its value will be one, otherwise 0. Second, the
confidence bounds around the ground-truth RUL are now a function of the predicted RUL and, as
a result, will tend to shrink as the machine approaches the end of life.
The relative accuracy is simply calculated as one minus the relative error of the model with
respect to the ground truth at a certain time step. In particular, the relative error is computed
by taking the ratio between the absolute difference between ground truth and a properly-chosen
central tendency point estimate of the predicted RUL distribution, and the ground truth RUL value.
The central tendency point estimate of the prediction distribution is arbitrary and depends on the
statistical properties of the predictive distribution (Gaussian, mixture-of-Gaussians, multi-modal,
etc.). Finally, the Convergence acts as a meta-metric to measure how quickly each of the above
metrics improves over time.
5.3. Discussion
Meaningful uncertainty estimates are crucial for ensuring the safe and reliable deployment of
DL models in real-world applications, especially for safety-critical assets. This is essential to build
trust in the models and ensure their effectiveness. This is because, in practice, decision making in
the context of industrial applications involves a complicated trade-off between risky decisions and
large potential economic benefits. DL has undoubtedly advanced the field by offering a valuable set
of tools to efficiently learn from data and automate the entire prognostics process. Nevertheless,
this is only one part - yet very significant - of the challenges arising in prognostics. ML and DL
techniques need to be as trustworthy and reliable as possible, and for this reason, effective UQ and
its integration into existing techniques remain an essential desideratum.
In previous research studies, MC dropout has been by far the most widely employed strategy for
tackling UQ of neural networks, especially DNNs. There are likely two reasons for this: first, the
interpretation of MC dropout is very intuitive; and second, it requires only a minimal modification
to existing architectures, namely activating dropout layers at training time. Nevertheless, as shown
in multiple studies [22, 218, 219], the UQ performance of MC dropout is not always satisfactory,
56
and more advanced solutions should be explored. Fortunately, the fields of UQ and Bayesian DL are
constantly progressing, and applications of the resulting techniques to prognostics are an important
research area to be further explored [48, 133, 208–212].
In addition, uncertainty-aware ML methods have been mainly used in the context of prognostics
for RUL prediction. While this is arguably the most important end goal in this field, several other
avenues could be investigated in the future. An example is, for instance, anomaly detection. In
this setting, uncertainty can be used to detect abnormal health states in the machine operation by
evaluating the level of confidence of the model corresponding to that time step. The assumption
is that a high level of epistemic uncertainty associated with a certain input will be indicative of
test data points that are less representative of the training data distribution. Hence, such data
will probably correspond to unusual health states, assuming the training data are collected from a
machine operating in a nominal regime.
To conclude, a crucial criterion for any UQ technique used in prognostics is the ability to accu-
rately disentangle aleatory and epistemic uncertainty. These two measures contain distinct types
of information and, therefore, must be interpreted separately to ensure appropriate analysis.
In this section, we benchmark the performance of several UQ methods in two engineering appli-
cations: (1) early life prediction of lithium-ion batteries and (2) RUL prediction of turbofan engines.
In both case studies, we built UQ models with publicly available datasets and compared the models’
performance. To ensure a fair comparison, these UQ models are built with nearly identical back-
bone architectures wherever applicable. These two case studies are widely used in the literature
due to their broad significance in safety-critical applications and, therefore, a comprehensive under-
standing of the performance of different UQ methods helps to identify the right model to deploy in
a particular application. A code walk-through is provided for the first case study to demonstrate
the practical implementation of UQ methods. We acknowledge that there could be several other
ways of implementing the same UQ models using different sets of libraries. In this discussion, we
try to limit ourselves to using only TensorFlow and Keras libraries for building the neural network
models.
57
(1) neural network ensemble, (2) MC dropout, (3) GPR, and (4) SNGP. The goal of this study is to
compare several UQ methods with comparable prediction accuracy based on the current literature.
The neural network-based models, namely neural network ensemble, MC dropout, and SNGP, are
built on a ResNet with a similar backbone architecture as shown in Fig. 15.
Figure 15: UQ model architectures with ResNet backbone used in case study 1. The ResNet block for each model is
defined by the blue box.
58
Table 3: Summary of LFP battery dataset
Type No. of cells
Training 41
Primary test 43
Secondary test 40
Tertiary test 45
Normalized Capacity
0.95 0.95
2000
0.90 0.90
0.85 0.85 1750
0.80 0.80 1500
0 500 1000 1500 2000 0 500 1000 1500 2000
Cycle Life
Cycle Number Cycle Number
1250
Dataset: Secondary Test Dataset: Tertiary Test
1.00 1.00 1000
Normalized Capacity
Normalized Capacity
0.95 0.95
750
0.90 0.90
0.85 0.85 500
0.80 0.80
0 500 1000 1500 2000 0 500 1000 1500 2000
Cycle Number Cycle Number
Figure 16: Normalized capacity curves for the four datasets mentioned in Table 3.
UQ models. Similar to Severson et al. [220], we find that the cycle life is significantly correlated
with V ar(∆Q100−10 (V )) as shown in Fig. 17.
59
Dataset: Train Dataset: Primary Test
2 × 103 cycle_life 2 × 103
400
Cycle Life
Cycle Life
103 800
103
1200
6 × 102 1600 6 × 102
2000
4 × 102 4 × 102
3 × 102
10 5 10 4 10 3 10 5 10 4 10 3
Var(Q100 Q10) Var(Q100 Q10)
Dataset: Secondary Test Dataset: Tertiary Test
2 × 103
103
Cycle Life
Cycle Life
103
6 × 102
6 × 102
10 4 10 4
Var(Q100 Q10) Var(Q100 Q10)
In the code below, the Gaussian layer uses two kernels and biases to characterize µ and σ by
splitting the output of the previous layer (traditionally a fully connected layer with one dimension).
Note that the kernel shape should be compatible with the number of hidden units in the previous
dense layer.
class GaussianLayer ( Layer ) :
def build ( self , input_shape ) :
self . kernel_1 = self . add_weight ( shape =(10 , self . output_dim ) ,...)
self . kernel_2 = self . add_weight ( shape =(10 , self . output_dim ) ,...)
... # ( define bias_1 and bias_2 ) Two kernels + biases to split the output
def call ( self , x ) :
output_mu = K . dot (x , self . kernel_1 ) + self . bias_1
Make variance positive
output_var = K . dot (x , self . kernel_2 ) + self . bias_2
output_var_pos = K . log (1 + K . exp ( output_var ) ) + 1e -06
return [ output_mu , output_var_pos ] Output mean and variance
Finally, a neural network model is constructed by appending the Gaussian layer to a simple
ResNet model. The architecture for each individual model of the neural network ensemble is shown
60
in Table 4.
6.1.3. MC Dropout
In this section, a simple MC dropout model is developed following the method described in
Section 3.2.3. The only differences between the implementation of the MC dropout and the neural
network ensemble are (1) the inclusion of dropout layers with dropout being active during the pre-
diction phase and (2) having a single deterministic output as the final output. Note that the dropout
layer can also be introduced in other UQ methods, for example, in neural network ensembles, to
mitigate overfitting. However, dropout is typically not activated during the prediction phase in such
models. In the case of MC dropout, the output varies from one prediction run to another, where a
certain percentage of neural network weights from the trained model are randomly dropped out at
the prediction phase. The code snippet below showcases our implementation of the dropout layers
within the ResNet block as shown in Fig. 15.
for _ in range ( num_res_layers ) : # for each residual block
x = Dense (50 , activation = actfn ) ( x )
x1 = Dense (50 , activation = actfn ) ( x )
Dropout within each ResNet block
x = x1 + x
x = Dropout ( rate = 0.10) ( x )
mu = Dense (1 , activation = actfn ) ( x ) Single output (RUL)
model = Model ( feature_input , mu )
The MC dropout model architecture and trainable parameters are similar to Table 4 except for
the presence of dropout layers with a 10% dropout rate. During the prediction phase, the trained
MC dropout model is run 15 times with dropout enabled (the ensemble size was determined based
on the elbow method - see description for Fig. 18). An ensemble of all the individual deterministic
RUL predictions produces the RUL prediction with uncertainty quantified.
61
6.1.4. Spectral Normalization Gaussian Process (SNGP)
Next, we implement the SNGP model discussed in Section 3.4 with the core idea of preserving
distance awareness between training and test/OOD distributions when producing the uncertainty
for each prediction. This is achieved by: (1) applying spectral normalization to the hidden layers
of the neural network and (2) replacing the final layer with a Gaussian process layer. This is a
single-model method with high performance in OOD detection.
Following Liu et al. [179] and a corresponding tutorial of TensorFlow, as shown below, we first
define a model class FC SNGP inherited from the class of TensorFlow model. In this model class, we
wrap some dense layers with the spectral normalization layer, where the normalization threshold has
a constant value of spec norm bound. The RandomFeatureGaussianProcess layer with RBF kernel
serves as the Gaussian process layer.
import official . nlp . modeling . layers as nlp_layers
Spectral Normalization wrapper
class RN_SNGP ( tf . keras . Model ) :
...
applied to Dense layer
self . dense_layers1 = nlp_layers . Sp ectra lNor maliz atio n (
self . make_dense_layer (100) , norm_multiplier = self . spec_norm_bound )
...
def m ake_output_layer ( self , no_outputs ) :
""" Uses Gaussian process as the output layer . """
return nlp_layers . R a n d o m F e a t u r e G a u s s i a n P r o c e s s ( no_outputs ,
gp_cov_momentum = -1 ,** self . kwargs )
The value of gp cov momentum in the above figure decides if the calculated covariance is exact or
approximated. A positive value of gp cov momentum updates the covariance across the batch using
a momentum-based moving average technique, whereas a value of -1 calculates the exact covariance.
Since the calculation of covariance could be affected by the batch size, it is recommended that the
covariance matrix estimator be reset during each epoch. This can be done using Keras API to
define a callback class and then appending it to FC SNGP. Finally, we train an SNGP model with
the ReLU activation function and spec norm bound = 0.9.
class R e s e tC ov a ri a nc eC a ll b ac k ( tf . keras . callbacks . Callback ) :
def on_epoch_begin ( self , epoch , logs = None ) :
""" Resets covariance matrix at the beginning of the epoch . """
if epoch > 0:
self . model . regressor . re se t _c o va ri a nc e _m at r ix ()
62
6.1.6. Evaluation/Results
In this section, we exploit the following metrics to quantitatively examine the uncertainty quan-
tification performance of all the models: (1) root mean square error (RMSE), (2) average NLL
defined in Eq. (31), (3) expected calibration error (ECE) as defined in Section 4.1.3, and (4) cal-
ibration curve introduced in Section 4.1.1. Since both neural network ensemble and MC dropout
require an ensemble of individual models, it is essential to determine the ensemble size. Ideally, it is
preferred that an ensemble has as many individual models as possible so that all the potential varia-
tions get manifested during the prediction stage. In other words, an ensemble benefits from models
that undergo diverse learning paths and this would effectively capture the variations in predictions.
However, beyond a certain ensemble size, the learning becomes increasingly less diverse and only
trivially contributes to the ensemble at the expense of increased computational cost. Therefore,
inspired by the elbow method, we systematically vary the ensemble size for constructing the neural
network ensemble and MC dropout models while capturing the training RMSE and ECE as shown
in Fig. 18. RMSE and ECE are chosen to strike a trade-off between accuracy and uncertainty
quantification capabilities. Based on this study, we choose an ensemble size of 15 for both neural
network ensemble and MC dropout.
RMSE (cycles)
15 120
ECE (%)
ECE (%)
50
35
45
10 100
40 30
35 80 25
5
30
60 20
0 5 10 15 20 25 0 5 10 15 20 25
Ensemble size Ensemble size
Figure 18: Determining the ensemble size for neural network ensemble and MC dropout. The selected ensemble size
for this case study is determined by the green vertical line.
Table 5 reports the RMSE, NLL, and ECE across different UQ methods for the dataset described
in Table 3. The variation in Table 5 results from 10 end-to-end independent runs. Note that the
results may not be the best that each method could offer as all these methods are built on a backbone
of a simple ResNet architecture except for GPR. It is likely that different UQ methods would require
different architectures to obtain the best results. From Table 5, we observe that the GPR model
perfectly fits the 41 training data points with an RMSE of zero and an extremely low NLL. However,
GPR exhibits poor generalization when learning, as can be seen in the large RUL prediction error
as well as high uncertainty at testing. In particular, for the secondary and tertiary test datasets
that are known to be significantly different from the training dataset, the performance of GPR gets
even worse. Secondly, the non-ensemble SNGP model performs much better in generalization when
63
Table 5: Performance comparison across UQ methods for the 169 LFP cell dataset in terms of MSE ± standard
deviation
NNE MC SNGP GPR
Dataset RMSE (cycles) ↓
Train 68.1±22.1 69.4±16.8 34.8±14.7 0.0±0.0
Primary test 137.3±20.9 149.9±18.4 148.1±16.2 141.1±0.0
Secondary test 205.1±27.4 194.1±15.1 249.3±33.6 319.0±0.0
Tertiary test 183.9±46.9 195.0±29.1 258.9±60.3 406.5±0.0
NLL ↓
Train 4.7±0.3 8.6±2.6 5.6±0.02 -3.8±0.0
Primary test 5.4±0.2 14.3±6.5 5.7±0.03 5.7±0.0
Secondary test 5.7±0.2 6.9±1.3 6.1±0.2 6.0±0.0
Tertiary test 5.7±0.1 9.2±1.7 5.9±0.1 6.4±0.0
ECE (%) ↓
Train 29.8±3.7 15.2±6.8 42.5±3.0 49.9±0.0
Primary test 10.5±5.0 24.4±5.3 21.5±2.3 6.9±0.0
Secondary test 13.5±5.7 9.5±4.6 12.7±4.6 10.4±0.0
Tertiary test 9.8±4.5 22.6±3.4 9.3±4.4 8.0±0.0
compared to GPR. The presence of neural network layers helps condense crucial information in the
hidden space which is further enhanced by the spectral normalization wrapper. But we generally
found in this case study that SNGP tends to generate unnecessarily large uncertainty for each
prediction, thus resulting in a large NLL and ECE. Third, among the two ensemble-like models, the
neural network ensemble performs slightly better than MC dropout in terms of accuracy but exhibits
a substantial advantage in UQ over MC dropout. We observe that the MC dropout predictions are
generally overconfident with a low uncertainty estimate σ̂RUL for each prediction. This low σ̂RUL
leads to large NLLs along with increased run-to-run variation. In the case that there is a larger
σ̂RUL , small changes in µ̂RUL do not significantly affect the run-to-run variation. On the other
hand, when σ̂RUL is small, run-to-run variation of NLL becomes more sensitive to the changes in
µ̂RUL around the true RUL. Note that the dropout rate hyperparameter of the MC dropout model
significantly affects the model performance. A low dropout rate would lead to almost identical
models within the ensemble, leading to very low predictive uncertainty and, thus, an overconfident
model. On the contrary, a larger dropout rate could cause significant differences between different
runs, thereby increasing uncertainty while compromising accuracy. Lastly, the better UQ ability
of the neural network ensemble can be primarily attributed to the ability of each individual model
within the ensemble to provide aleatory uncertainty, which during the ensemble process provides a
more holistic picture of uncertainty.
64
Dataset: Train Dataset: Primary Test
2000 True 2000
SNGP
NN
RUL (cycles)
RUL (cycles)
1500 1500
1000 1000
500 500
0 10 20 30 40 0 10 20 30 40
Sorted Cell Index Sorted Cell Index
Dataset: Secondary Test Dataset: Tertiary Test
2000 2000
RUL (cycles)
1000 1000
500 500
0 10 20 30 40 0 10 20 30 40
Sorted Cell Index Sorted Cell Index
Figure 19: RUL prediction error curves with cells sorted based on true RUL values.
Next, we visualize the prediction error with respect to a single end-to-end run for neural network
ensemble and SNGP in Fig. 19. To better depict prediction accuracy and the uncertainty estimate
pertaining to each prediction, we plot the error curve associated with each cell in the dataset by
their RUL in ascending order. As can be observed, regarding the training data, the mean RUL
predictions of both SNGP and neural network ensemble models highly align with the true RUL
prediction. In the case of the primary and secondary test datasets, a few instances of discrepancy
between the mean RUL prediction and ground truth arise. However, these models fail to capture
the true RULs of the tertiary test dataset, which is well known to be significantly different from
the other three datasets. Another interesting observation across the first three considered datasets
is that SNGP tends to yield a large uncertainty estimate for almost all predictions. As a result,
SNGP is underconfident in most cases. In contrast, the neural network ensemble model produces
significantly lower prediction uncertainty than SNGP. Only in the case of the tertiary test dataset,
both neural network ensemble and SNGP associate large σ̂RUL to most of the batteries.
In what follows, we construct the calibration curve based on each model’s performance on the
four datasets. As illustrated in Fig. 20, the shaded area of each curve characterizes the run-to-run
65
Dataset: Train Dataset: Primary Test
100 100
Predicted Confidence (%)
Figure 20: Calibration curves for the four models on all the datasets of the 169 LFP cell dataset. The shaded area
captures the run-to-run variation of all the models.
variation over 10 independent trials. First, since the GPR model fits the training data perfectly (zero
RMSE), the observed confidence is 100% and does not change with the expected confidence level.
For the other datasets, GPR seems to be the closest to the expected line leading to the least ECE
(see Table 5). Next, we observe that both GPR and SNGP are relatively stable irrespective of model
initialization leading to low run-to-run variation. On the other hand, models like neural network
ensemble and MC dropout exhibit higher run-to-run variation (with MC dropout having the highest
run-to-run variation), especially when considering OOD datasets like the tertiary dataset. These
observations regarding model stability are in line with our qualitative comparison of UQ models
summarized in Table 2. Lastly, MC dropout is generally overconfident across all the datasets, as
reflected in the relatively low uncertainty associated with each RUL prediction. Different from MC
dropout, neural network ensemble, and SNGP are consistently underconfident. Considering the
safety-critical nature of early life prediction of batteries, underconfident models are desirable as
they allow end users to stay on the safe side.
66
6.2. Case study 2: Turbofan engine prognostics
In this section, similar to Case Study 1, we evaluate the performance of multiple UQ methods
in predicting the RUL of nine turbofan engines that operate under varying conditions. To carry
out our analysis, we utilize the New Commercial Modular Aero-Propulsion System Simulation (N-
CMAPSS) prognostics dataset [222], which has been recently open-sourced. Specifically, we use
the sub-dataset DS02, which has been used in several previous works, see Refs. [223–225]. Our
objective is to predict the target RUL by employing a set of multivariate time series as inputs. In
addition to providing a point estimate of the RUL, our aim is to quantify the uncertainty associated
with the RUL prediction with the UQ methods surveyed in this paper. The code for this case study
is available on our Github page. The primary goal of this study is to pedagogically compare various
UQ methods that exhibit similar prediction accuracy based on the current literature. We do not
make any claims that the discussed methods outperform the existing literature’s models.
Table 6: Overview of the input variables. These condition monitoring signals include both scenario descriptors (first 6
rows) and measured physical properties (last 14 rows). The symbol used for each variable corresponds to its internal
name in the CMAPSS dataset.
This case study comprises a collection of run-to-failure trajectories for a fleet of nine aircraft
engines that operate under authentic flight conditions [222]. We use the open-source code presented
in Ref. [226] to download and preprocess the data. For every RUL prediction time step, the input
67
to the UQ model is a 20-dimensional vector that represents the measured physical properties of
the engine as well as the scenario descriptors characterizing the engine’s operating mode during the
flight. At each time step, the UQ model produces RUL and its associated uncertainty as outputs.
Table 6 provides an overview of the input variables used in the model. As we adopted a purely
data-driven approach, we did not utilize the virtual sensors or the calibration parameters that are
available in the N-CMAPSS dataset [222, 227].
Consistent with Ref. [227], we split the entire dataset into a training dataset, which comprises
the time-to-failure trajectories of six units (i.e., units 2, 5, 10, 16, 18 and 20), and a testing dataset,
which includes the trajectories of three units (i.e., units 11, 14 and 15). Figure 21 illustrates
the distributions of the flight conditions across all units and provides an example of a flight cycle
obtained by traces of the scenario-descriptor variables for unit 10. Finally, to address the memory
consumption concerns associated with the size of the dataset, we downsampled the data by a factor
of 500 by using the code from Ref. [226], thus resulting in a sampling frequency of 0.002 Hz.
Mach Number
Unit 15
Altitude (ft)
0.0003
Density
Density
17500 0.5
0.0002 10
15000
0.0001 5 12500 0.4
500
0.08 70
0.06 490
60
0.06
Density
Density
0.04 50 480
0.04
40
0.02 0.02 470
30
0.00 20 40 60 80 0.00 460
450 500 0 5000 10000 0 5000 10000
Throttle Resolver Angle (%) Temperature at Fan Inlet (K) Time (s) Time (s)
Figure 21: (Left) The flight envelopes simulated for climb, cruise, and descend conditions were estimated using
kernel density estimation based on measurements of altitude, flight Mach number, throttle-resolver angle, and total
temperature at the fan inlet. The densities of these measurements are shown for three representative training units
(u = 2, 10, and 18) and two test units (u = 14 and 15). (Right) A typical flight cycle for unit 10 with traces of
the scenario-descriptor variables depicting the climb, cruise, and descend phases of the flight, covering different flight
routes operated by the aircraft, where altitude was above 10,000 ft.
6.2.2. Evaluation/Results
For the sake of clarity and consistency, in this case study, we have used the same code structure/-
functions from the previous case study. However, we have excluded GPR from our evaluation due
to the large size of the dataset and the well-known scaling issues associated with this UQ method.
For further implementation details, we refer the reader to the detailed descriptions in the previous
case study or to the code implementation on GitHub.
68
The performance of NNE, MC, and SNGP on the three test units is compared in Table 7 using
RMSE, NLL, and ECE metrics. Overall, NNE seems to outperform MC and SNGP in terms of
all the metrics considered, with SNGP providing slightly better performance than MC. Figure 22
shows that all the three models are able to capture the decreasing trend of the RUL over time,
but they encounter difficulties at the beginning of the trajectory, i.e., at the onset of degradation.
Interestingly, NNE appears to address this issue by assigning higher uncertainty corresponding to
such points.
Table 7: Comparison of the error metrics across different UQ methods on the N-CMAPSS dataset
NNE MC SNGP
Dataset RMSE (cycles) ↓
Train 7.1±0.1 10.2±0.1 8.7±0.7
Unit 11 8.5±0.5 10.0±0.3 8.9±1.8
Unit 14 7.4±0.2 11.5±0.1 9.3±1.4
Unit 15 4.8±0.3 8.2±0.2 6.8±1.2
NLL ↓
Train 2.0±0.0 3.7±0.1 4.4±0.7
Unit 11 2.3±0.1 3.0±0.1 4.8±1.8
Unit 14 2.2±0.0 4.2±0.2 4.4±1.3
Unit 15 1.8±0.0 2.8±0.1 3.1±0.6
ECE (%) ↓
Train 6.2±0.8 12.8±1.2 9.6±2.7
Unit 11 15.1±2.5 19.6±1.5 15.9±7.3
Unit 14 5.8±1.0 25.1±1.2 13.0±3.5
Unit 15 14.9±2.7 11.5±1.6 8.5±3.0
RUL (cycles)
RUL (cycles)
60 MC 60 60
SNGP
40 40 40
20 20 20
0 0.0 0.2 0.4 0.6 0.8 1.0 0 0.0 0.2 0.4 0.6 0.8 1.0 0 0.0 0.2 0.4 0.6 0.8 1.0
Relative Time Relative Time Relative Time
Figure 22: RUL prediction error curves for the N-CMAPSS dataset.
The calibration curves presented in Fig. 23 suggest that the methods used in this study tend to
produce over-confident predictions, particularly for unit 11. This overconfidence can have serious
implications for safety in prognostics. While MC exhibits overconfidence across all test units, NNE
performs best on unit 14 and SNGP on unit 15, displaying a calibration curve that is closer to
the ideal. Overall, NNE generally outperforms other UQ models as demonstrated by its accurate
predictions (i.e., low RMSE and NLL scores). Furthermore, NNE’s calibration curve is more closely
69
aligned with the ideal leading to low ECE values.
Figure 23: Calibration curves for the three models on all the datasets. The shaded area captures the run-to-run
variation of all the models.
As a final remark, we would like to acknowledge that the present results could be improved by
optimizing the hyperparameters of each model individually, i.e., the number of layers and nodes, the
dropout rate, the number of ensemble components, and the type of activation functions. However,
the present study serves as a solid foundation for investigating the UQ capabilities of the analyzed
methods in challenging and realistic case studies.
70
such as architectures dedicated to specific physics and engineering problems [246, 247] and utilizing
a large amount of simulation data to emulate the dynamics of physical systems, such as deep
operator networks [248] and Fourier neural operator [249]. A more detailed summary of these seven
physics-informed ML categories can be found in Part 1 of our recent review on digital twins [14].
As mentioned in this review, the above list of seven categories is not exhaustive by any means, and
many other approaches for combining data and physics have been developed over the past decade.
Comprehensive reviews dedicated to physics-informed ML are also available in Refs. [76, 250].
Regardless of the specific means of incorporating physical knowledge into ML modeling, param-
eter and model-form uncertainty inevitably persist due to the imperfect knowledge of physics, and
assumptions and approximations made to simplify the problem setup during the modeling process.
In the case of uncertainty of physical parameters (e.g., uncertain parameters in a PDE), the cor-
responding probability distribution of solution variables can be generated with those parameters
as inputs to neural network representations of the solution field [251] or utilizing generative ad-
versarial networks [252]. However, these approaches do not consider the uncertainty induced by
the use of physics-informed ML model itself (e.g., uncertainty due to the use of a neural network).
For neural networks, the commonly used MC dropout helps increase robustness of training associ-
ated with randomization of the network architecture, while BNNs more directly seek to quantify
the parameter uncertainty of the neural network (e.g., for its weight and bias terms). Moreover,
physics-constrained BNNs [81, 253] have been developed to address the uncertainty in PINNs. We
direct interested readers to two recent review papers for a more comprehensive, in-depth discussion
on UQ for physics-informed ML [51, 254], with emphasis on PINNs [51, 254] and deep operator
networks [51].
71
tasks. To be effective, PLoM generally requires a sufficiently large quantity of data samples that can
reasonably reveal the underlying distribution geometry. This also differs from GPR and BNN that,
by design, engender a larger degree of uncertainty in the model (e.g., by falling back towards their
prior uncertainty) when less data is available. Nonetheless, PLoM has been demonstrated to work
well even in settings with relatively small datasets, especially if additional constraining from relevant
governing PDEs is available [256]. Lastly, the generative model resulting from PLoM can be highly
versatile and used for a range of applications beyond sample generation and surrogate modeling,
such as density estimates of statistics of interest [257], optimization under uncertainty [258], and
design using digital twins [259], as some examples.
72
models with good generalization to unseen data. Besides, the sparsity in the resulting function basis
offers valuable insights into the management of model selection uncertainty in the context of hybrid
dynamical systems [266]. For instance, hybrid SINDy employed the Akaike information criterion
score on out-of-sample validation data to match the SINDy model with a specific regime in a hybrid
dynamical system, from which the switching point of the hybrid system can be found [266].
The elegance and clarity inherent in the models derived through SINDy are of particular impor-
tance when considering ML model interpretability. Building upon the foundational work of Brunton
and Kutz, a multitude of SINDy variants have emerged, finding applications even in UQ contexts. A
remarkable instance worth highlighting is the approach introduced by Hirsh et al. [265], wherein the
SINDy approach is extended into a Bayesian probabilistic framework. This novel approach, termed
Uncertainty Quantification SINDy (UQ-SINDy), accounts for uncertainties in SINDy coefficients
arising from observation errors and limited data. The central innovation lies in the integration of
sparsifying priors, specifically the spike and slab prior and the regularized horseshoe prior, into
the Bayesian inference of SINDy coefficients. By unifying UQ with SINDy variants, this approach
not only heightens the interpretability of ML models but also facilitates the quantification of the
prediction’s confidence level.
7.4. PCE and its relationship with GPR and connection with ML
A key role for both GPR (see Sec. 1 and Appendix B) and polynomial chaos expansion (PCE) is
building surrogate models for solving engineering design problems. The need for surrogate modeling
stems from the multi-query nature of uncertainty propagation and design optimization, which often
require many repeated simulation runs (e.g., 103 − 106 ) to assess the behavior of output responses
under different realizations of input design variables and simulation model parameters. This process
may become prohibitively expensive for high-fidelity models where each simulation may require
hours to days. One strategy to accelerate these computations, as explained in Appendix B.1, is
to build a cheap-to-evaluate surrogate of the computationally expensive simulation model—i.e. to
trade model fidelity for speed. The surrogate model, sometimes called metamodel or response
surface, is often an explicit mathematical function (e.g., as in GPR and PCE), allowing for rapid
predictions at different input realizations.
Having presented GPR in detail in Sec. 1 and Appendix B, we briefly introduce PCE here. PCE
was originally proposed in the 1930s to model stochastic processes using a spectral expansion of
multivariate Hermite polynomials of Gaussian random variables [267]. These Hermite polynomial
basis functions are orthogonal with respect to the joint probability distribution of the respective
Gaussian variables. PCE was later applied to solve physics and engineering problems [268] and
extended to non-Gaussian probability distributions, giving rise to the generalized PCE [269]. Since
the input variables of a PCE are naturally formulated to follow certain probability distributions,
PCE has been a convenient and popular tool for conducting UQ. However, PCE has not been
employed much for UQ of ML models, since most ML models are already relatively inexpensive
to evaluate; rather, PCE brings more value for enabling UQ of expensive computer simulation
models. In that case, a PCE surrogate model is built to approximate the original simulation model,
73
where the PCE’s expansion coefficients can be computed, for example, non-intrusively by projection
(numerical integration via quadrature or simulation) [270] or regression (least squares minimization
of the fitting error) [271].
One major challenge faced by PCE is the curse of dimensionality, where the number of model
parameters (and in turn training samples of the simulation model) increases exponentially with the
input dimension (i.e., the number of input random variables). Several algorithmic techniques have
been developed to alleviate this issue through truncation schemes that can identify a sparse set
of important polynomials to be included. Two notable methods for introducing sparsity are the
Smolyak sparse constructions (and their adaptive versions) [272–274], and variants of compressive
sensing (such as least angle regression and LASSO) [275–277]. Such effort has been made in the
context of surrogate modeling [275–278] and reliability analysis [279–282]. A comprehensive review
of sparse PCE is provided in Ref. [283].
Historically, PCE and GPR (or kriging) have been studied separately and mostly in isolation,
although both methods have produced many success stories in surrogate modeling. Recently, at-
tempts have been made to combine PCE and kriging, resulting in PCE-kriging hybrids [284]. The
basic idea is to use PCE to represent the mean function m(x) of the Gaussian process prior (see
Eq. (8)) that captures the global trend of the computer simulation model (i.e., f (x)). The GPR
formulation with a non-zero, non-constant mean function is called universal kriging, which differs
from ordinary kriging where the mean function is set as a constant (e.g., zero). When combined
with kriging in this manner, PCE serves the purpose of a deterministic (non-probabilistic) mean
(trend) function. Such PCE-kriging hybrids have found applications to uncertainty propagation
in computational dosimetry [285] and damage quantification in structural health monitoring [286].
More broadly, while PCE is typically not used for UQ of ML models, it may be combined with other
ML techniques (e.g., kriging [284] and radial basis functions [287]) to produce hybrid PCE-ML mod-
els with improved prediction accuracy over standalone PCE surrogates. On a final note, although
PCE is typically not categorized as an ML technique, it was reported to offer surrogate modeling
accuracy on par with state-of-the-art ML techniques such as regression tree, neural network, and
support vector machine [288].
This tutorial aims to cover the fundamental role of UQ in ML, particularly focusing on a detailed
introduction of state-of-the-art UQ methods for neural networks and a brief review of applications in
engineering design and PHM. It possesses four salient characteristics: (1) classification of uncertainty
types (aleatory vs. epistemic), sources, and causes pertaining to ML models; (2) tutorial-style
descriptions of emerging UQ techniques; (3) quantitative metrics for evaluation and calibration of
predictive uncertainty; and (4) easily accessible source codes for implementing and comparing several
state-of-the-art UQ techniques in engineering design and PHM applications. Two case studies are
developed to demonstrate the implementation of UQ methods and benchmark their performance in
predicting battery life using early-life data (case study 1) and turbofan engine RUL using online-
accessible measurements (case study 2). Our rigorous examination of the state-of-the-art techniques
74
for UQ, calibration, and evaluation and the two case studies offers a holistic lens on pressing issues
that need to be tackled along the future development of UQ techniques in terms of scalability,
principleness, and decomposition given the increasing importance of UQ in safeguarding the usage
of ML models in high-stakes applications.
It is important to note that the case studies presented in this paper are not optimized in terms of
their hyperparameters, and it is reasonable to expect that optimizing them would yield even better
performance results. The primary objective of this paper is to offer a user-friendly platform for
individuals seeking to comprehend the analyzed methods and to encourage them to enhance and
suggest new ones.
Essentially, UQ acts as a layer of safety assurance on top of ML models, enabling rigorous
and quantitative risk assessment and management of ML solutions in high-stakes applications. As
UQ methods for ML models continue to mature, they are anticipated to play a crucial role in
creating safe, reliable, and trustworthy ML solutions by safeguarding against various risks such as
OOD, adversarial attacks, and spurious correlations. From this perspective, the development of UQ
methods is of paramount significance in expanding the adoption of ML models in breadth and depth.
The accurate, sound, and principled quantification of uncertainty in ML model prediction has great
potential to fundamentally tackle the safety assurance problem that haunts ML’s development.
Towards this end, several long-standing challenges encompassing the UQ development need to be
addressed by the research community:
1. The need for a unified and well-acknowledged testbed to comprehensively examine the per-
formance of the diverse and expanding set of UQ methods in uncertainty quantification, cal-
ibration (and recalibration), decomposition, attribution, and interpretation. Although some
recent efforts were devoted to developing standardized benchmarks for UQ [289], most of these
efforts primarily emphasized conventional performance metrics, such as prediction accuracy
metrics and UQ calibration errors. However, other key performance aspects (e.g., uncertainty
decomposition and uncertainty attribution) essential to ensuring high quality UQ have rarely
been investigated. The lack of these key elements emerges as a significant challenge to the
sound development of the UQ ecosystem. Hence, there is an imperative demand calling for
establishing UQ testbeds with community-acknowledged standards to facilitate comprehen-
sive testing and verification of the behavior of uncertainty generated by different UQ methods,
especially on edge cases. Establishing such testbeds with the support of synthetic data gener-
ation is expected to tremendously benefit the long-term and sustainable development of UQ
methods for ML models.
2. The need for principled, scalable, and computationally efficient UQ methods to enable high
quality and large-scale UQ. As summarized in Table 2, each method covered in this tutorial
has its own strengths and shortcomings. Although numerous efforts have been made to elevate
the soundness and principleness of UQ methods of ML models, the existing methods still suffer
from a common but critical deficiency: a lack of (limited) theoretical guarantee in detecting
OOD instances. It is thus imperative to investigate further along this direction to fill the
75
loophole. Emerging deterministic methods such as SNGP exhibit a strong OOD detection
capability due to distance awareness. In addition, the computational efficiency of UQ methods
needs to be further improved to satisfy the need for real-time or near real-time decision making
in a broad range of safety-critical applications (e.g., autonomous driving and aviation). Thus,
more research efforts need to be invested in enabling three key essential features of high quality
UQ: principleness, scalability, and efficiency.
4. The PHM community has long recognized the importance of estimating the predictive uncer-
tainty of prognostic models. These prognostic models can be built based on supervised ML
or more traditional state-space models (see, for example, the Bayes filter in one of the earliest
studies on battery prognostics [38]). As discussed in Sec. 5.3, in the PHM field, UQ of ML
models has been predominantly applied to the task of predicting the RUL of a system or com-
ponent. The focus of UQ in this context is to provide a probability distribution of the RUL
rather than a single point estimate. While UQ in the PHM field has primarily been focused
on RUL prediction, there is a growing interest in applying UQ to other tasks, such as anomaly
detection, fault detection and classification, and health estimation. Many of the UQ methods
discussed in detail in Sec. 3 can also be readily applied to these classification and regression
tasks in the PHM field. Looking ahead, we identify three research directions along which pos-
itive and significant impacts could be made on the PHM field surrounding UQ of ML models.
First, decomposing the total predictive uncertainty into its aleatoric and epistemic components
is highly desirable and sometimes essential, as noted in Sec. 5.3. Such a decomposition has
76
several benefits, for example, highlighting the need for improved sensing solutions with lower
measurement noise to reduce aleatory uncertainty and identifying areas where further data
collection or model refinement efforts may be necessary to reduce epistemic uncertainty. More
work is needed to develop UQ methods with built-in uncertainty decomposition capability and
create procedures to assess the accuracy of uncertainty decomposition. Second, prognostic
studies involving UQ mostly evaluate UQ quality subjectively and qualitatively by looking at
whether a two-sided 95% confidence interval of the RUL estimate gets narrows with time and
contains the true RUL, especially toward the end of life. As discussed in a general context in
Sec. 4.4, we call for consistent effort among PHM researchers and practitioners to quantita-
tively evaluate their ML models’ UQ quality using some of the metrics introduced in Sec. 4,
such as calibration metrics (Sec. 4.1), sparsification metrics (Sec. 4.2), and NLL (Sec. 4.3).
Ideally, UQ quality assessment should also become standard practice when building and de-
ploying ML models in PHM applications, just as prediction accuracy assessment is currently
standard practice. Third, both UQ and interpretation serve the purpose of improving model
transparency and trustworthiness, as noted in Sec. 1. An under-explored question is whether
UQ capability can help improve interpretability and vice versa. For example, interpretability
can provide insights into the most important input features for making predictions. Such an
understanding could allow distance-aware UQ models to define their distance measures based
only on highly important features, potentially improving the UQ quality.
5. Model uncertainty quantification for label-free learning is another future research direction.
Obtaining labels by solving implicit engineering physics models is usually costly. Label-free
machine learning embeds physics models in a cost function or as constraints in the model
training process without solving them. As a result, labels are not required. Physics-informed
neural network (PINN) is one such label-free method [76, 230]. This method has gained much
attention because it makes the regression task feasible without solving the true label. In ad-
dition, the physical constraints prevent the regression from severe overfitting in conventional
neural networks, especially when data are limited. Since labels are not available, the quantifi-
cation of prediction uncertainty of the machine learning model is extremely difficult. Even the
prediction errors at the training points are unknown. Due to this reason, the GPR method
has not been used for label-free learning since the prediction of a GPR model requires labels
at the training points. A proof-of-concept study has been conducted for quantifying epistemic
uncertainty for physics-based label-free regression [290]. This method integrates neural net-
works and GPR models and can produce both systematic error (represented by a mean) and
random error (represented by a standard deviation) for a model prediction. The method, how-
ever, has not been extended to time- and space-dependent problems where partially different
equations are involved. There is a need to develop generic uncertainty quantification methods
for label-free learning.
77
Authors’ contributions
All the authors read and approved the final manuscript. Hu, C. and Zhang, X. devised the origi-
nal concept of the tutorial paper. Hu, Z., Hu, C., Du, X., Wang, Y., and Huan, X. were responsible
for the classification of types and sources of uncertainty pertaining to ML models. Hu, C. and Tran,
A. were responsible for GPR. Huan, X. was responsible for implementing BNN via the means of
MCMC and variational inference. Zhang, X. and X. Huan were responsible for MC dropout. Zhang,
X. and Hu, C. were responsible for neural network ensemble. Hu, C. was responsible for determin-
istic methods for UQ of neural networks. Zhang, X. and Nemani, V., were responsible for the toy
example to compare the predictive uncertainty produced by different UQ methods. Zhang, X. and
Hu, C. were responsible for the summary of the qualitative comparison of different UQ methods.
Hu, C. and Nemani, V. were responsible for the evaluation of predictive uncertainty. Hu, Z., Zhang,
X., Hu., C., and Tran, A. were responsible for the review on UQ of ML models in engineering
design. Biggio, L., and Fink, O. were responsible for the review of UQ of ML models in prognostics.
Nemani, V. and Hu, C. were responsible for case study 1 – battery early life prediction. Biggio L.
and Fink O. were responsible for case study 2 – turbofan engine prognostics. Zhang, X., Hu, C.,
and Hu, Z. were responsible for the conclusion and outlook. All authors participated in manuscript
writing, review, and editing. All correspondence should be addressed to Xiaoge Zhang (e-mail:
[email protected]) and Chao Hu (e-mails: [email protected]; [email protected]).
Acknowledgements
78
ECCS-2015710. The opinions, findings, and conclusions presented in this article are solely those of
the authors and do not necessarily reflect the views of the sponsors that provided funding support
for this research.
Sandia National Laboratories is a multimission laboratory managed and operated by National
Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell
International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration
under contract DE-NA-0003525.
References
[2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: A large-scale hierarchical
image database, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition,
IEEE, 248–255, doi: http://dx.doi.org/10.1109/CVPR.2009.5206848, 2009.
[3] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, A. Oliva, Learning deep features for scene
recognition using places database, Advances in Neural Information Processing Systems 27.
[4] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick,
Microsoft COCO: Common objects in context, in: European Conference on Computer Vision,
Springer, 740–755, 2014.
[5] J. Blitzer, M. Dredze, F. Pereira, Biographies, bollywood, boom-boxes and blenders: Domain
adaptation for sentiment classification, in: Proceedings of the 45th Annual Meeting of the
Association of Computational Linguistics, 440–447, 2007.
[6] X. Glorot, A. Bordes, Y. Bengio, Domain adaptation for large-scale sentiment classification:
A deep learning approach, in: Proceedings of the 28th International Conference on Machine
Learning (ICML-11), 513–520, 2011.
[7] Q. Li, C. Shen, L. Chen, Z. Zhu, Knowledge mapping-based adversarial domain adaptation:
A novel fault diagnosis method with high generalizability under variable working conditions,
Mechanical Systems and Signal Processing 147 (2021) 107095, doi: https://doi.org/10.
1016/j.ymssp.2020.107095.
[8] S. M. Lundberg, S.-I. Lee, A unified approach to interpreting model predictions, Advances in
Neural Information Processing Systems 30.
79
[10] C. Molnar, Interpretable machine learning, Lulu. com, 2020.
[11] J. Jiménez-Luna, F. Grisoni, G. Schneider, Drug discovery with explainable artificial intelli-
gence, Nature Machine Intelligence 2 (10) (2020) 573–584, doi: https://doi.org/10.1038/
s42256-020-00236-4.
[12] S. Guo, H. Ding, Y. Li, H. Feng, X. Xiong, Z. Su, W. Feng, A hierarchical deep convolu-
tional regression framework with sensor network fail-safe adaptation for acoustic-emission-
based structural health monitoring, Mechanical Systems and Signal Processing 181 (2022)
109508, doi: https://doi.org/10.1016/j.ymssp.2022.109508.
[13] S. Khan, T. Yairi, A review on the application of deep learning in system health management,
Mechanical Systems and Signal Processing 107 (2018) 241–265, doi: https://doi.org/10.
1016/j.ymssp.2017.11.024.
[15] E. Begoli, T. Bhattacharya, D. Kusnezov, The need for uncertainty quantification in machine-
assisted medical decision making, Nature Machine Intelligence 1 (1) (2019) 20–23, doi: https:
//doi.org/10.1038/s42256-018-0004-1.
[16] C. Rudin, Stop explaining black box machine learning models for high stakes decisions and use
interpretable models instead, Nature Machine Intelligence 1 (5) (2019) 206–215, doi: https:
//doi.org/10.1038/s42256-019-0048-x.
[17] M. Sensoy, L. Kaplan, M. Kandemir, Evidential deep learning to quantify classification uncer-
tainty, Advances in Neural Information Processing Systems 31, doi: https://doi.org/10.
48550/arXiv.1806.01768.
[18] X. Zhang, S. Zhong, S. Mahadevan, Airport surface movement prediction and safety as-
sessment with spatial–temporal graph convolutional neural network, Transportation Re-
search Part C: Emerging Technologies 144 (2022) 103873, doi: https://doi.org/10.1038/
s42256-019-0048-x.
80
[22] B. Lakshminarayanan, A. Pritzel, C. Blundell, Simple and scalable predictive uncertainty
estimation using deep ensembles, Advances in Neural Information Processing Systems 30.
[23] A. Kendall, Y. Gal, What uncertainties do we need in Bayesian deep learning for computer
vision?, Advances in Neural Information Processing Systems 30, doi: https://doi.org/10.
48550/arXiv.1703.04977.
[27] R. Jin, W. Chen, A. Sudjianto, On sequential sampling for global metamodeling in engineering
design, in: International Design Engineering Technical Conferences and Computers and In-
formation in Engineering Conference, vol. 36223, 539–548, doi: https://doi.org/10.1115/
DETC2002/DAC-34092, 2002.
[29] B. Echard, N. Gayton, M. Lemaire, AK-MCS: an active learning reliability method combining
Kriging and Monte Carlo simulation, Structural Safety 33 (2) (2011) 145–154, doi: https:
//doi.org/10.1016/j.strusafe.2011.01.002.
[31] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, N. de Freitas, Taking the human out of the
loop: A review of Bayesian optimization, Proceedings of the IEEE 104 (1) (2016) 148–175,
doi: https://doi.org/10.1109/JPROC.2015.2494218.
[32] S. H. Lee, W. Chen, A comparative study of uncertainty propagation methods for black-box-
type problems, Structural and Multidisciplinary Optimization 37 (3) (2009) 239–253, doi:
https://doi.org/10.1007/s00158-008-0234-7.
81
[33] S. Chakraborty, Simulation free reliability analysis: A physics-informed deep learning based
approach doi: https://doi.org/10.48550/arXiv.2005.01302.
[34] M. Li, Z. Wang, Deep learning for high-dimensional reliability analysis, Mechanical Systems
and Signal Processing 139 (2020) 106399, doi: https://doi.org/10.1016/j.ymssp.2019.
106399.
[35] C. Zhang, A. Shafieezadeh, Simulation-free reliability analysis with active learning and
Physics-Informed Neural Network, Reliability Engineering & System Safety 226 (2022) 108716,
doi: https://doi.org/10.1016/j.ress.2022.108716.
[36] J. B. Coble, J. W. Hines, Prognostic algorithm categorization with PHM challenge application,
in: 2008 International Conference on Prognostics and Health Management, IEEE, 1–11, doi:
https://doi.org/10.1109/PHM.2008.4711456, 2008.
[37] M. E. Tipping, Sparse Bayesian learning and the relevance vector machine, Journal of
Machine Learning Research 1 (Jun) (2001) 211–244, doi: https://doi.org/10.1162/
15324430152748236.
[38] B. Saha, K. Goebel, S. Poll, J. Christophersen, Prognostics methods for battery health moni-
toring using a Bayesian framework, IEEE Transactions on Instrumentation and Measurement
58 (2) (2008) 291–296, doi: https://doi.org/10.1109/TIM.2008.2005965.
[39] D. Wang, Q. Miao, M. Pecht, Prognostics of lithium-ion batteries based on relevance vectors
and a conditional three-parameter capacity degradation model, Journal of Power Sources 239
(2013) 253–264, doi: https://doi.org/10.1016/j.jpowsour.2013.03.129.
[40] Y. Chang, J. Zou, S. Fan, C. Peng, H. Fang, Remaining useful life prediction of degraded
system with the capability of uncertainty management, Mechanical Systems and Signal Pro-
cessing 177 (2022) 109166, doi: https://doi.org/10.1016/j.ymssp.2022.109166.
[41] P. Wang, B. D. Youn, C. Hu, A generic probabilistic framework for structural health prog-
nostics and uncertainty management, Mechanical Systems and Signal Processing 28 (2012)
622–637, doi: https://doi.org/10.1016/j.ymssp.2011.10.019.
[42] C. Hu, B. D. Youn, P. Wang, J. T. Yoon, Ensemble of data-driven prognostic algorithms for
robust prediction of remaining useful life, Reliability Engineering & System Safety 103 (2012)
120–135, doi: https://doi.org/10.1016/j.ress.2012.03.008.
[43] D. Liu, J. Pang, J. Zhou, Y. Peng, M. Pecht, Prognostics for state of health estimation of
lithium-ion batteries based on combination Gaussian process functional regression, Micro-
electronics Reliability 53 (6) (2013) 832–839, doi: https://doi.org/10.1016/j.microrel.
2013.03.010.
82
[45] A. Thelen, M. Li, C. Hu, E. Bekyarova, S. Kalinin, M. Sanghadasa, Augmented model-based
framework for battery remaining useful life prediction, Applied Energy 324 (2022) 119624,
doi: https://doi.org/10.1016/j.apenergy.2022.119624.
[53] W. Chen, M. Fuge, BézierGAN: Automatic Generation of Smooth Curves from Interpretable
Low-Dimensional Parameters doi: https://doi.org/10.48550/arXiv.1808.08871.
[54] W. Chen, M. Fuge, Synthesizing designs with interpart dependencies using hierarchical gen-
erative adversarial networks, Journal of Mechanical Design 141 (11) (2019) 111403, doi:
https://doi.org/10.1115/1.4044076.
83
[55] M. He, D. He, Deep learning based approach for bearing fault diagnosis, IEEE Transactions on
Industry Applications 53 (3) (2017) 3057–3065, doi: https://doi.org/10.1109/TIA.2017.
2661250.
[56] D.-T. Hoang, H.-J. Kang, A survey on deep learning based bearing fault diagnosis, Neuro-
computing 335 (2019) 327–335, doi: https://doi.org/10.1016/j.neucom.2018.06.078.
[58] B. Hou, D. Wang, Y. Chen, H. Wang, Z. Peng, K.-L. Tsui, Interpretable online updated
weights: Optimized square envelope spectrum for machine condition monitoring and fault
diagnosis, Mechanical Systems and Signal Processing 169 (2022) 108779, doi: https://doi.
org/10.1016/j.ymssp.2021.108779.
[60] J. Deutsch, D. He, Using deep learning-based approach to predict remaining useful life of
rotating components, IEEE Transactions on Systems, Man, and Cybernetics: Systems 48 (1)
(2017) 11–20, doi: https://doi.org/10.1109/TSMC.2017.2697842.
[61] W. Yu, I. Y. Kim, C. Mechefske, Remaining useful life estimation using a bidirectional recur-
rent neural network based autoencoder scheme, Mechanical Systems and Signal Processing
129 (2019) 764–780, doi: https://doi.org/10.1016/j.ymssp.2019.05.005.
[62] X. Li, W. Zhang, Q. Ding, Deep learning-based remaining useful life estimation of bearings
using multi-scale feature extraction, Reliability Engineering & System Safety 182 (2019) 208–
218, doi: https://doi.org/10.1016/j.ress.2018.11.011.
[63] A. Der Kiureghian, O. Ditlevsen, Aleatory or epistemic? Does it matter?, Structural Safety
31 (2) (2009) 105–112, doi: https://doi.org/10.1016/j.strusafe.2008.06.020.
[64] Y. Gal, J. Hron, A. Kendall, Concrete dropout, Advances in Neural Information Processing
Systems 30.
[65] R. Sanjay, R. Sriram, Data Fidelity and Latency: All things Clin-
ical, Innovaccer 1 (2022) https://innovaccer.com/resources/blogs/
data--fidelity--and--latency--all--things--clinical.
84
[67] I. M. Sobol’, On sensitivity estimation for nonlinear mathematical models, Matematicheskoe
Modelirovanie 2 (1) (1990) 112–118.
[68] I. M. Sobol, Global sensitivity indices for nonlinear mathematical models and their Monte
Carlo estimates, Mathematics and Computers in Simulation 55 (1-3) (2001) 271–280, doi:
https://doi.org/10.1016/S0378-4754(00)00270-6.
[69] Y. Gal, Uncertainty in deep learning, Ph.D. thesis, PhD thesis, University of Cambridge,
2016.
[71] L. Smith, Y. Gal, Understanding measures of uncertainty for adversarial example detection
doi: https://doi.org/10.48550/arXiv.1803.08533.
[72] A. Malinin, M. Gales, Predictive uncertainty estimation via prior networks, Advances in
Neural Information Processing Systems 31, doi: https://doi.org/10.48550/arXiv.1802.
10501.
[75] C. Shorten, T. M. Khoshgoftaar, A survey on image data augmentation for deep learning,
Journal of Big Data 6 (1) (2019) 1–48, doi: https://doi.org/10.1186/s40537-019-0197-0.
[77] A. Thelen, X. Zhang, O. Fink, Y. Lu, S. Ghosh, B. D. Youn, M. D. Todd, S. Mahadevan, C. Hu,
Z. Hu, A Comprehensive Review of Digital Twin–Part 2: Roles of Uncertainty Quantification
and Optimization, a Battery Digital Twin, and Perspectives, Structural and Multidisciplinary
Optimization 66 (1) (2023) 1–43, doi: https://doi.org/10.1007/s00158-022-03410-x.
[78] Y. Xu, S. Kohtz, J. Boakye, P. Gardoni, P. Wang, Physics-informed machine learning for relia-
bility and systems safety applications: State of the art and challenges, Reliability Engineering
& System Safety 230 (2022) 108900, doi: https://doi.org/10.1016/j.ress.2022.108900.
[79] C. Hu, K. Goebel, D. Howey, Z. Peng, D. Wang, P. Wang, B. D. Youn, Special issue on Physics-
informed machine learning enabling fault feature extraction and robust failure prognosis,
Mechanical Systems and Signal Processing 192 (2023) 110219, doi: https://doi.org/10.
1016/j.ymssp.2023.110219.
85
[80] P. Wang, D. Coit, Physics-Informed Machine Learning for Reliability and Safety, URL https:
//www.sciencedirect.com/journal/reliability-engineering-and-system-safety/
special-issue/1084PD0CV5B, 2023 (Accessed on 2023-04-18).
[81] L. Malashkhia, D. Liu, Y. Lu, Y. Wang, Physics-Constrained Bayesian Neural Network for
Bias and Variance Reduction, Journal of Computing and Information Science in Engineering
23 (1) (2023) 011012, doi: https://doi.org/10.1115/1.4055924.
[82] Y. Deng, Multifidelity Data Fusion via Gradient-Enhanced Gaussian Process Regression,
Communications in Computational Physics 28 (5) (2020) 1812–1837, doi: https://doi.org/
10.4208/cicp.OA-2020-0151.
[83] M. Plumlee, V. R. Joseph, Orthogonal Gaussian process models, Statistica Sinica (2018)
601–619Doi: https://doi.org/10.5705/ss.202015.0404.
[84] A. Tran, K. Maupin, T. Rodgers, Monotonic Gaussian process for physics-constrained machine
learning with materials science applications, Journal of Computing and Information Science
in Engineering 23 (1) (2023) 011011, doi: https://doi.org/10.1115/1.4055852.
[86] L. Bottou, Stochastic gradient descent tricks, in: Neural networks: Tricks of the trade,
Springer, 421–436, doi: https://doi.org/10.1007/978-3-642-35289-8, 2012.
[87] D. Liu, Y. Wang, A Dual-Dimer method for training physics-constrained neural networks
with minimax architecture, Neural Networks 136 (2021) 112–125, doi: https://doi.org/10.
1016/j.neunet.2020.12.028.
[88] J. Cai, J. Luo, S. Wang, S. Yang, Feature selection in machine learning: A new perspec-
tive, Neurocomputing 300 (2018) 70–79, doi: https://doi.org/10.1016/j.neucom.2017.
11.077.
[89] G. Chandrashekar, F. Sahin, A survey on feature selection methods, Computers & Electrical
Engineering 40 (1) (2014) 16–28, doi: https://doi.org/10.1016/j.compeleceng.2013.11.
024.
[90] G. Box, All models are wrong, but some are useful, Robustness in Statistics 202 (1979) (1979)
549, doi: https://doi.org/10.1007/s10815-020-01895-3.
86
[92] G. Pilania, J. E. Gubernatis, T. Lookman, Multi-fidelity machine learning models for accu-
rate bandgap predictions of solids, Computational Materials Science 129 (2017) 156–163, doi:
https://doi.org/10.1016/j.commatsci.2016.12.004.
[93] D. Liu, Y. Wang, Multi-fidelity physics-constrained neural network and its application in
materials modeling, Journal of Mechanical Design 141 (12) (2019) 121403, doi: https://
doi.org/10.1115/1.4044400.
[94] D. Liu, P. Pusarla, Y. Wang, Multi-Fidelity Physics-Constrained Neural Networks with Mini-
max Architecture, Journal of Computing and Information Science in Engineering 23 (3) (2023)
031008, doi: https://doi.org/10.1115/1.4055316.
[95] X. Huang, T. Xie, Z. Wang, L. Chen, Q. Zhou, Z. Hu, A transfer learning-based multi-
fidelity point-cloud neural network approach for melt pool modeling in additive manufacturing,
ASCE-ASME Journal of Risk and Uncertainty in Engineering Systems, Part B: Mechanical
Engineering 8 (1), doi: https://doi.org/10.1115/1.4051749.
[96] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436–444, doi:
https://doi.org/10.1038/nature14539.
[97] X. Zhang, F. T. Chan, C. Yan, I. Bose, Towards risk-aware artificial intelligence and machine
learning systems: An overview, Decision Support Systems 159 (2022) 113800, doi: https:
//doi.org/10.1016/j.dss.2022.113800.
[98] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778, doi: https:
//doi.org/10.1109/CVPR.2016.90, 2016.
[99] X. Zhang, S. Mahadevan, Bayesian neural networks for flight trajectory prediction and safety
assessment, Decision Support Systems 131 (2020) 113246, doi: https://doi.org/10.1016/
j.dss.2020.113246.
87
[103] N. Tagasovska, D. Lopez-Paz, Single-model uncertainties for deep learning, Advances in Neural
Information Processing Systems 32.
[105] C. E. Rasmussen, Gaussian processes in machine learning, MIT Press, doi: https://doi.
org/10.1007/978-3-540-28650-9_4, 2006.
[107] R. M. Neal, Bayesian learning for neural networks, vol. 118, Springer Science & Business
Media, 2012.
[108] R. Furrer, M. G. Genton, D. Nychka, Covariance tapering for interpolation of large spa-
tial datasets, Journal of Computational and Graphical Statistics 15 (3) (2006) 502–523, doi:
https://doi.org/10.1198/106186006X132178.
[110] N. Cressie, G. Johannesson, Fixed rank kriging for very large spatial data sets, Journal of
the Royal Statistical Society: Series B (Statistical Methodology) 70 (1) (2008) 209–226, doi:
https://doi.org/10.1111/j.1467-9868.2007.00633.x.
[111] S. Banerjee, A. E. Gelfand, A. O. Finley, H. Sang, Gaussian predictive process models for large
spatial data sets, Journal of the Royal Statistical Society: Series B (Statistical Methodology)
70 (4) (2008) 825–848, doi: https://doi.org/10.1111/j.1467-9868.2008.00663.x.
[112] R. M. Neal, Monte Carlo implementation of Gaussian process models for Bayesian regression
and classification, arXiv preprint physics/9701026 .
[113] I. Andrianakis, P. G. Challenor, The effect of the nugget on Gaussian process emulators of
computer models, Computational Statistics & Data Analysis 56 (12) (2012) 4215–4228, doi:
https://doi.org/10.1016/j.csda.2012.04.020.
[114] L. Le Gratiet, C. Cannamela, B. Iooss, A Bayesian approach for global sensitivity analysis
of (multifidelity) computer codes, SIAM/ASA Journal on Uncertainty Quantification 2 (1)
(2014) 336–363, doi: https://doi.org/10.1137/130926869.
[115] M. Menz, S. Dubreuil, J. Morio, C. Gogu, N. Bartoli, M. Chiron, Variance based sensitiv-
ity analysis for Monte Carlo and importance sampling reliability assessment with Gaussian
88
processes, Structural Safety 93 (2021) 102116, doi: https://doi.org/10.1016/j.strusafe.
2021.102116.
[116] P. Wei, Y. Zheng, J. Fu, Y. Xu, W. Gao, An expected integrated error reduction function for
accelerating Bayesian active learning of failure probability, Reliability Engineering & System
Safety 231 (2023) 108971, doi: https://doi.org/10.1016/j.ress.2022.108971.
[117] Q. V. Le, A. J. Smola, S. Canu, Heteroscedastic Gaussian process regression, in: Proceedings
of the 22nd International Conference on Machine learning, 489–496, 2005.
[118] M. L. Stein, Interpolation of spatial data: some theory for kriging, Springer Science & Business
Media, 1999.
[119] H. Liu, Y.-S. Ong, X. Shen, J. Cai, When Gaussian process meets big data: A review of
scalable GPs, IEEE Transactions on Neural Networks and Learning Systems 31 (11) (2020)
4405–4423, doi: https://doi.org/10.1109/TNNLS.2019.2957109.
[121] R. Tripathy, I. Bilionis, M. Gonzalez, Gaussian processes with built-in dimensionality reduc-
tion: Applications to high-dimensional uncertainty propagation, Journal of Computational
Physics 321 (2016) 191–223, doi: https://doi.org/10.1016/j.jcp.2016.05.039.
[124] M. Binois, N. Wycoff, A survey on high-dimensional Gaussian process modeling with applica-
tion to Bayesian optimization, ACM Transactions on Evolutionary Learning and Optimization
2 (2) (2022) 1–26, doi: https://doi.org/10.1145/3545611.
[127] Y. A. LeCun, L. Bottou, G. B. Orr, K.-R. Müller, Efficient BackProp, in: G. Montavon,
G. B. Orr, K.-R. Müller (Eds.), Neural Networks: Tricks of the Trade, Springer-Verlag Berlin
Heidelberg, 9–48, doi:https://doi.org/10.1007/978-3-642-35289-8_3, 2012.
89
[128] J. O. Berger, Statistical Decision Theory and Bayesian Analysis, Springer Series in Statistics,
Springer New York, New York, NY, ISBN 978-1-4419-3074-3, doi: https://doi.org/10.
1007/978-1-4757-4286-2, 1985.
[129] J. M. Bernardo, A. F. M. Smith, Bayesian Theory, John Wiley & Sons, New York, NY, 2000.
[130] D. S. Sivia, J. Skilling, Data Analysis: A Bayesian Tutorial, Oxford University Press, New
York, NY, 2nd edn., 2006.
[132] A. Graves, Practical Variational Inference for Neural Networks, in: Advances in Neural Infor-
mation Processing Systems 24 (NIPS 2011), Granada, Spain, 2348–2356, 2011.
[135] H. Jeffreys, An invariant form for the prior probability in estimation problems, Proceedings of
the Royal Society of London. Series A. Mathematical and Physical Sciences 186 (1007) (1946)
453–461, doi: https://doi.org/10.1098/rspa.1946.0056.
[136] E. T. Jaynes, Prior Probabilities, IEEE Transactions on Systems Science and Cybernetics
4 (3) (1968) 227–241, doi: https://doi.org/10.1109/TSSC.1968.300117.
[137] V. Fortuin, Priors in Bayesian Deep Learning: A Review, International Statistical Review
90 (3) (2022) 563–591, doi: https://doi.org/10.1111/insr.12502.
[139] S. Brooks, A. Gelman, G. Jones, X.-L. Meng (Eds.), Handbook of Markov Chain Monte Carlo,
Chapman & Hall/CRC, doi: https://doi.org/10.1201/b10905, 2011.
[141] W. K. Hastings, Monte Carlo sampling methods using Markov chains and their applications,
Biometrika 57 (1) (1970) 97–109, doi: https://doi.org/10.1093/biomet/57.1.97.
90
[142] R. M. Neal, MCMC Using Hamiltonian Dynamics, in: Handbook of Markov Chain Monte
Carlo, 113–162, doi: https://doi.org/10.1201/b10905-6, 2011.
[144] T. Chen, E. B. Fox, C. Guestrin, Stochastic Gradient Hamiltonian Monte Carlo, in: Proceed-
ings of the 31st International Conference on Machine Learning, vol. 32, Beijing, 1683–1691,
2014.
[145] C. Zhang, B. Shahbaba, H. Zhao, Variational Hamiltonian Monte Carlo via Score Matching,
Bayesian Analysis 13 (2) (2018) 485–506, doi: https://doi.org/10.1214/17-BA1060.
[148] D. J. Rezende, S. Mohamed, Variational inference with normalizing flows, in: 32nd Interna-
tional Conference on Machine Learning, ICML 2015, vol. 2, 1530–1538, 2015.
[149] Y. Marzouk, T. Moselhy, M. Parno, A. Spantini, Sampling via Measure Transport: An In-
troduction, in: Handbook of Uncertainty Quantification, Springer International Publishing,
Cham, 1–41, doi: https://doi.org/10.1007/978-3-319-11259-6_23-1, 2016.
[150] Q. Liu, D. Wang, Stein Variational Gradient Descent: A General Purpose Bayesian Infer-
ence Algorithm, in: Advances in Neural Information Processing Systems 29 (NIPS 2016),
Barcelona, Spain, 2378–2386, 2016.
[153] P. Chen, O. Ghattas, Projected stein variational gradient descent, in: Advances in Neural In-
formation Processing Systems, doi: https://doi.org/10.48550/arXiv.2002.03469, 2020.
[154] T. P. Minka, Expectation propagation for approximate Bayesian inference, in: Proceedings
of the Seventeenth Conference on Uncertainty in Artificial Intelligence, UAI’01, AUAI Press,
Seattle, Washington, USA, 362–369, doi: https://doi.org/10.48550/arXiv.1301.2294,
2001.
91
[155] S. L. Lauritzen, Propagation of probabilities, means, and variances in mixed graphical asso-
ciation models, Journal of the American Statistical Association 87 (420) (1992) 1098–1108,
doi: https://doi.org/10.2307/2290647.
[157] G. Shen, X. Chen, Z. Deng, Variational learning of Bayesian neural networks via Bayesian dark
knowledge, in: Proceedings of the Twenty-Ninth International Conference on International
Joint Conferences on Artificial Intelligence, 2037–2043, doi: https://doi.org/10.24963/
ijcai.2020/282, 2021.
[160] Y. Gal, Z. Ghahramani, Bayesian convolutional neural networks with Bernoulli approximate
variational inference, arXiv preprint arXiv:1506.02158 Doi: https://doi.org/10.48550/
arXiv.1506.02158.
[161] I. Osband, Risk versus uncertainty in deep learning: Bayes, bootstrap and the dangers of
dropout, in: NIPS Workshop on Bayesian Deep Learning, vol. 192, 2016.
[166] D. Opitz, R. Maclin, Popular ensemble methods: An empirical study, Journal of Artificial
Intelligence Research 11 (1999) 169–198, doi: https://doi.org/10.1613/jair.614.
92
[167] T. G. Dietterich, Ensemble methods in machine learning, in: International Workshop on Mul-
tiple Classifier Systems, Springer, 1–15, doi: https://doi.org/10.1007/3-540-45014-9_1,
2000.
[168] L. Breiman, Bagging predictors, Machine Learning 24 (2) (1996) 123–140, doi: https://doi.
org/10.1007/BF00058655.
[169] R. E. Schapire, Y. Freund, Boosting: Foundations and algorithms, Kybernetes doi: https:
//doi.org/10.7551/mitpress/8291.001.0001.
[170] X. Zhang, S. Mahadevan, Ensemble machine learning models for aviation incident risk predic-
tion, Decision Support Systems 116 (2019) 48–63, doi: https://doi.org/10.1016/j.dss.
2018.10.009.
[172] D. A. Nix, A. S. Weigend, Estimating the mean and variance of the target probability distribu-
tion, in: Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN’94),
vol. 1, IEEE, 55–60, doi: https://doi.org/10.1109/ICNN.1994.374138, 1994.
[173] S. Fort, H. Hu, B. Lakshminarayanan, Deep ensembles: A loss landscape perspective doi:
https://doi.org/10.48550/arXiv.1912.02757.
[176] J. Van Amersfoort, L. Smith, Y. W. Teh, Y. Gal, Uncertainty estimation using a single deep
deterministic neural network, in: International Conference on Machine Learning, PMLR,
9690–9700, doi: https://doi.org/10.48550/arXiv.2003.02037, 2020.
[177] J. Mukhoti, A. Kirsch, J. van Amersfoort, P. H. Torr, Y. Gal, Deterministic neural networks
with appropriate inductive biases capture epistemic and aleatoric uncertainty, arXiv preprint
arXiv:2102.11582 .
[178] J. van Amersfoort, L. Smith, A. Jesson, O. Key, Y. Gal, On feature collapse and deep kernel
learning for single forward pass uncertainty doi: https://doi.org/10.48550/arXiv.2102.
11409.
93
[179] J. Liu, Z. Lin, S. Padhy, D. Tran, T. Bedrax Weiss, B. Lakshminarayanan, Simple and
principled uncertainty estimation with deterministic deep learning via distance awareness,
Advances in Neural Information Processing Systems 33 (2020) 7498–7512, doi: https:
//doi.org/10.48550/arXiv.2006.10108.
[182] T. Miyato, T. Kataoka, M. Koyama, Y. Yoshida, Spectral normalization for generative adver-
sarial networks doi: https://doi.org/10.48550/arXiv.1802.05957.
[183] J. Postels, M. Segu, T. Sun, L. Van Gool, F. Yu, F. Tombari, On the practicality of deter-
ministic epistemic uncertainty doi: https://doi.org/10.48550/arXiv.2107.00649.
[184] J. Van Landeghem, M. Blaschko, B. Anckaert, M.-F. Moens, Benchmarking scalable predictive
uncertainty in text classification, IEEE Access 10 (2022) 43703–43737, doi: https://doi.
org/10.1109/ACCESS.2022.3168734.
[186] B. Zadrozny, C. Elkan, Transforming classifier scores into accurate multiclass probability esti-
mates, in: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge
discovery and Data Mining, 694–699, doi: https://doi.org/10.1145/775047.775151, 2002.
[187] A. Niculescu-Mizil, R. Caruana, Predicting good probabilities with supervised learning, in:
Proceedings of the 22nd International Conference on Machine Learning, 625–632, doi: https:
//doi.org/10.1145/1102351.1102430, 2005.
[188] Y. Liu, W. Chen, P. Arendt, H.-Z. Huang, Toward a better understanding of model validation
metrics, Journal of Mechanical Design 133 (7), doi: https://doi.org/10.1115/1.4004223.
[190] V. Kuleshov, N. Fenner, S. Ermon, Accurate uncertainties for deep learning using calibrated
regression, in: International Conference on Machine Learning, PMLR, 2796–2804, doi: https:
//doi.org/10.48550/arXiv.1807.00263, 2018.
94
[191] J. Platt, et al., Probabilistic outputs for support vector machines and comparisons to regu-
larized likelihood methods, Advances in Large Margin Classifiers 10 (3) (1999) 61–74.
[192] C. Guo, G. Pleiss, Y. Sun, K. Q. Weinberger, On calibration of modern neural networks, in:
International Conference on Machine Learning, PMLR, 1321–1330, doi: https://doi.org/
10.48550/arXiv.1706.04599, 2017.
[193] D. Roman, S. Saxena, V. Robu, M. Pecht, D. Flynn, Machine learning pipeline for battery
state-of-health estimation, Nature Machine Intelligence 3 (5) (2021) 447–456, doi: https:
//doi.org/10.1038/s42256-021-00312-3.
[194] S. Ferson, W. L. Oberkampf, L. Ginzburg, Model validation and predictive capability for
the thermal challenge problem, Computer Methods in Applied Mechanics and Engineering
197 (29-32) (2008) 2408–2430, doi: https://doi.org/10.1016/j.cma.2007.07.030.
[195] C. Kondermann, R. Mester, C. Garbe, A statistical confidence measure for optical flows, in:
European Conference on Computer Vision, Springer, 290–301, doi: https://doi.org/10.
1007/978-3-540-88690-7_22, 2008.
[197] E. Ilg, O. Cicek, S. Galesso, A. Klein, O. Makansi, F. Hutter, T. Brox, Uncertainty estimates
and multi-hypotheses networks for optical flow, in: Proceedings of the European Conference on
Computer Vision (ECCV), 652–667, doi: https://doi.org/10.1007/978-3-030-01234-2_
40, 2018.
[199] F. D’Angelo, V. Fortuin, Repulsive deep ensembles are Bayesian, Advances in Neural Infor-
mation Processing Systems 34 (2021) 3451–3465, doi: https://doi.org/10.48550/arXiv.
2106.11642.
[200] C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Understanding deep learning (still)
requires rethinking generalization, Communications of the ACM 64 (3) (2021) 107–115, doi:
https://doi.org/10.1145/3446776.
[201] O. Fink, Q. Wang, M. Svensén, P. Dersin, W.-J. Lee, M. Ducoffe, Potential, Challenges and
Future Directions for Deep Learning in Prognostics and Health Management Applications,
doi: https://doi.org/10.1016/j.engappai.2020.103678, 2020.
[202] L. Biggio, I. Kastanis, Prognostics and health management of industrial assets: Current
progress and road ahead, Frontiers in Artificial Intelligence 3 (2020) 578613, doi: https:
//doi.org/10.3389/frai.2020.578613.
95
[203] B. Wang, Y. Lei, N. Li, T. Yan, Deep separable convolutional network for remaining useful
life prediction of machinery, Mechanical systems and signal processing 134 (2019) 106330, doi:
https://doi.org/10.1016/j.ymssp.2019.106330.
[204] J. Lee, E. Lapira, B. Bagheri, H.-a. Kao, Recent advances and trends in predictive manu-
facturing systems in big data environment, Manufacturing letters 1 (1) (2013) 38–41, doi:
https://doi.org/10.1016/j.mfglet.2013.09.005.
[206] A. Saxena, J. Celaya, E. Balaban, K. Goebel, B. Saha, S. Saha, M. Schwabacher, Metrics for
evaluating performance of prognostic techniques, in: 2008 International Conference on Prog-
nostics and Health Management, IEEE, 1–17, doi: https://doi.org/10.1109/PHM.2008.
4711436, 2008.
[207] L. Biggio, T. Bendinelli, C. Kulkarni, O. Fink, Dynaformer: A Deep Learning Model for
Ageing-aware Battery Discharge Prediction doi: https://doi.org/10.48550/arXiv.2206.
02555.
[209] A. G. Wilson, The Case for Bayesian Deep Learning, doi: https://doi.org/10.48550/
arXiv.2001.10995, 2020.
[211] M. Teye, H. Azizpour, K. Smith, Bayesian Uncertainty Estimation for Batch Normalized Deep
Networks, doi: https://doi.org/10.48550/arXiv.1802.06455, 2018.
[212] H. Ritter, A. Botev, D. Barber, A Scalable Laplace Approximation for Neural Networks, in:
International Conference on Learning Representations, 2018.
[213] Y. Wang, Y. Zhao, S. Addepalli, Remaining useful life prediction using deep learning ap-
proaches: A review, Procedia Manufacturing 49 (2020) 81–88, doi: https://doi.org/10.
1016/j.promfg.2020.06.015.
96
[215] P. Rokhforoz, M. Montazeri, O. Fink, Safe multi-agent deep reinforcement learning for joint
bidding and maintenance scheduling of generation units, Reliability Engineering & System
Safety 232 (2023) 109081, doi: https://doi.org/10.48550/arXiv.2112.10459.
[216] E. Zio, Prognostics and Health Management (PHM): Where are we and where do we (need
to) go in theory and practice, Reliability Engineering & System Safety 218 (2022) 108119,
doi: https://doi.org/10.1016/j.ress.2021.108119.
[217] A. Saxena, J. Celaya, B. Saha, S. Saha, K. Goebel, Metrics for Offline Evaluation of Prognostic
Performance, International Journal of Prognostics and Health Management 1 (1), doi: https:
//doi.org/10.36001/ijphm.2010.v1i1.1336.
[218] C. Louizos, M. Welling, Multiplicative Normalizing Flows for Variational Bayesian Neural
Networks, URL https://arxiv.org/abs/1703.01961, 2017.
[222] M. Arias Chao, C. Kulkarni, K. Goebel, O. Fink, Aircraft engine run-to-failure dataset under
real flight conditions for prognostics and diagnostics, Data 6 (1) (2021) 5, doi: https://doi.
org/10.3390/data6010005.
[223] M. A. Chao, C. Kulkarni, K. Goebel, O. Fink, Fusing physics-based and deep learning models
for prognostics, Reliability Engineering & System Safety 217 (2022) 107961, doi: https:
//doi.org/10.1016/j.ress.2021.107961.
[224] Y. Tian, M. A. Chao, C. Kulkarni, K. Goebel, O. Fink, Real-time model calibration with
deep reinforcement learning, Mechanical Systems and Signal Processing 165 (2022) 108284,
doi: https://doi.org/10.1016/j.ymssp.2021.108284.
[225] T. Song, C. Liu, R. Wu, Y. Jin, D. Jiang, A hierarchical scheme for remaining useful life
prediction with long short-term memory networks, Neurocomputing 487 (2022) 22–33, doi:
https://doi.org/10.1016/j.neucom.2022.02.032.
97
[226] H. Mo, G. Iacca, Multi-Objective Optimization of Extreme Learning Machine for Remain-
ing Useful Life Prediction, in: International Conference on the Applications of Evolution-
ary Computation (Part of EvoStar), Springer, 191–206, doi: https://doi.org/10.1007/
978-3-031-02462-7_13, 2022.
[227] M. A. Chao, C. Kulkarni, K. Goebel, O. Fink, Fusing physics-based and deep learning models
for prognostics, Reliability Engineering & System Safety 217 (2022) 107961, doi: https:
//doi.org/10.1016/j.ress.2021.107961.
[228] I. E. Lagaris, A. Likas, D. I. Fotiadis, Artificial neural networks for solving ordinary and
partial differential equations, IEEE transactions on neural networks 9 (5) (1998) 987–1000.
[229] J. Cursi, A. Koscianski, Physically constrained neural network models for simulation, in: Ad-
vances and Innovations in Systems, Computing Sciences and Software Engineering, Springer,
567–572, 2007.
[231] T. Ritto, F. Rochinha, Digital twin, physics-based model, and machine learning applied to
damage detection in structures, Mechanical Systems and Signal Processing 155 (2021) 107614,
doi: https://doi.org/10.1016/j.ymssp.2021.107614.
[234] Y. A. Yucesan, F. A. Viana, A physics-informed neural network for wind turbine main bearing
fatigue, International Journal of Prognostics and Health Management 11 (1), doi: https:
//doi.org/10.36001/ijphm.2020.v11i1.2594.
[235] C. Jiang, M. A. Vega, M. D. Todd, Z. Hu, Model correction and updating of a stochastic
degradation model for failure prognostics of miter gates, Reliability Engineering & System
Safety 218 (2022) 108203, doi: https://doi.org/10.1016/j.ress.2021.108203.
[236] M. L. Thompson, M. A. Kramer, Modeling chemical processes using prior knowledge and
neural networks, AIChE Journal 40 (8) (1994) 1328–1340, doi: https://doi.org/10.1002/
aic.690400806.
98
[237] J.-X. Wang, J.-L. Wu, H. Xiao, Physics-informed machine learning approach for reconstructing
Reynolds stress modeling discrepancies based on DNS data, Physical Review Fluids 2 (3)
(2017) 034603, doi: https://doi.org/10.1103/PhysRevFluids.2.034603.
[238] A. Thelen, Y. H. Lui, S. Shen, S. Laflamme, S. Hu, H. Ye, C. Hu, Integrating physics-based
modeling and machine learning for degradation diagnostics of lithium-ion batteries, Energy
Storage Materials 50 (2022) 668–695, doi: https://doi.org/10.1016/j.ensm.2022.05.047.
[239] M.-J. Azzi, C. Ghnatios, P. Avery, C. Farhat, Acceleration of a Physics-Based Machine Learn-
ing Approach for Modeling and Quantifying Model-Form Uncertainties and Performing Model
Updating, Journal of Computing and Information Science in Engineering 23 (1) (2023) 011009,
doi: https://doi.org/10.1115/1.4055546.
[240] W. Chen, Q. Wang, J. S. Hesthaven, C. Zhang, Physics-informed machine learning for reduced-
order modeling of nonlinear problems, Journal of Computational Physics 446 (2021) 110666,
doi: https://doi.org/10.1016/j.jcp.2021.110666.
[241] H. Gong, S. Cheng, Z. Chen, Q. Li, Data-enabled physics-informed machine learning for
reduced-order modeling digital twin: application to nuclear reactor physics, Nuclear Science
and Engineering 196 (6) (2022) 668–693, doi: https://doi.org/10.1080/00295639.2021.
2014752.
[242] Y. A. Yucesan, F. A. Viana, A hybrid physics-informed neural network for main bearing
fatigue prognosis under grease quality variation, Mechanical Systems and Signal Processing
171 (2022) 108875, doi: https://doi.org/10.1016/j.ymssp.2022.108875.
[244] A. Downey, Y.-H. Lui, C. Hu, S. Laflamme, S. Hu, Physics-based prognostics of lithium-ion
battery using non-linear least squares with dynamic bounds, Reliability Engineering & System
Safety 182 (2019) 1–12, doi: https://doi.org/10.1016/j.ress.2018.09.018.
[245] Y. H. Lui, M. Li, A. Downey, S. Shen, V. P. Nemani, H. Ye, C. VanElzen, G. Jain, S. Hu,
S. Laflamme, et al., Physics-based prognostics of implantable-grade lithium-ion battery for
remaining useful life prediction, Journal of Power Sources 485 (2021) 229327, doi: https:
//doi.org/10.1016/j.jpowsour.2020.229327.
[246] P. Ramuhalli, L. Udpa, S. S. Udpa, Finite-element neural networks for solving differential
equations, IEEE Transactions on Neural Networks 16 (6) (2005) 1381–1392, doi: https:
//doi.org/10.1109/TNN.2005.857945.
[247] J. Darbon, T. Meng, On some neural network architectures that can represent viscosity so-
lutions of certain high dimensional Hamilton–Jacobi partial differential equations, Journal
99
of Computational Physics 425 (2021) 109907, doi: https://doi.org/10.1016/j.jcp.2020.
109907.
[248] L. Lu, P. Jin, G. Pang, Z. Zhang, G. E. Karniadakis, Learning nonlinear operators via Deep-
ONet based on the universal approximation theorem of operators, Nature Machine Intelligence
3 (3) (2021) 218–229, doi: https://doi.org/10.1038/s42256-021-00302-5.
[253] L. Sun, J.-X. Wang, Physics-constrained bayesian neural network for fluid flow reconstruction
with sparse and noisy data, Theoretical and Applied Mechanics Letters 10 (3) (2020) 161–169,
doi: https://doi.org/10.1016/j.taml.2020.01.031.
100
[258] R. G. Ghanem, C. Soize, C. Safta, X. Huan, G. Lacaze, J. C. Oefelein, H. N. Najm, Design
optimization of a scramjet under uncertainty using probabilistic learning on manifolds, Journal
of Computational Physics 399 (2019) 108930, doi: https://doi.org/10.1016/j.jcp.2019.
108930.
[260] A. Thelen, X. Zhang, O. Fink, Y. Lu, S. Ghosh, B. D. Youn, M. D. Todd, S. Mahadevan, C. Hu,
Z. Hu, A comprehensive review of digital twin—part 2: roles of uncertainty quantification
and optimization, a battery digital twin, and perspectives, Structural and Multidisciplinary
Optimization 66 (1) (2023) 1, doi: https://doi.org/10.1007/s00158-022-03476-7.
[264] K. Kaheman, J. N. Kutz, S. L. Brunton, SINDy-PI: a robust algorithm for parallel implicit
sparse identification of nonlinear dynamics, Proceedings of the Royal Society A 476 (2242)
(2020) 20200279, doi: https://doi.org/10.1098/rspa.2020.0279.
[266] N. M. Mangan, T. Askham, S. L. Brunton, J. N. Kutz, J. L. Proctor, Model selection for hybrid
dynamical systems via sparse regression, Proceedings of the Royal Society A 475 (2223) (2019)
20180534, doi: https://doi.org/10.1098/rspa.2018.0534.
[267] N. Wiener, The homogeneous chaos, American Journal of Mathematics 60 (4) (1938) 897–936,
doi: https://doi.org/10.2307/2371268.
[268] R. G. Ghanem, P. D. Spanos, Stochastic finite elements: a spectral approach, Courier Corpo-
ration, 2003.
[269] D. Xiu, G. E. Karniadakis, The Wiener–Askey polynomial chaos for stochastic differential
equations, SIAM Journal on Scientific Computing 24 (2) (2002) 619–644, doi: https://doi.
org/10.1137/S1064827501387826.
101
[270] O. P. Le Maıtre, M. T. Reagan, H. N. Najm, R. G. Ghanem, O. M. Knio, A stochastic
projection method for fluid flow: II. Random process, Journal of Computational Physics
181 (1) (2002) 9–44, doi: https://doi.org/10.1006/jcph.2002.7104.
[271] M. Berveiller, B. Sudret, M. Lemaire, Stochastic finite element: a non intrusive approach by
regression, European Journal of Computational Mechanics/Revue Européenne de Mécanique
Numérique 15 (1-3) (2006) 81–92, doi: https://doi.org/10.3166/remn.15.81-92.
[272] S. Smolyak, Quadrature and interpolation formulas for tensor products of certain classes of
functions, Dokl. Akad. Nauk SSSR 148 (5) (1963) 1042–1045.
[275] G. Blatman, B. Sudret, Adaptive sparse polynomial chaos expansion based on least angle
regression, Journal of Computational Physics 230 (6) (2011) 2345–2367, doi: https://doi.
org/10.1016/j.jcp.2010.12.021.
[278] G. Blatman, B. Sudret, An adaptive algorithm to build up sparse polynomial chaos expansions
for stochastic finite element analysis, Probabilistic Engineering Mechanics 25 (2) (2010) 183–
197, doi: https://doi.org/10.1016/j.probengmech.2009.10.003.
[279] C. Hu, B. D. Youn, Adaptive-sparse polynomial chaos expansion for reliability analysis and de-
sign of complex engineering systems, Structural and Multidisciplinary Optimization 43 (2011)
419–442, doi: https://doi.org/10.1007/s00158-010-0568-9.
[280] Q. Pan, D. Dias, Sliced inverse regression-based sparse polynomial chaos expansions for re-
liability analysis in high dimensions, Reliability Engineering & System Safety 167 (2017)
484–493, doi: https://doi.org/10.1016/j.ress.2017.06.026.
[281] J. Xu, F. Kong, A cubature collocation based sparse polynomial chaos expansion for efficient
structural reliability analysis, Structural Safety 74 (2018) 24–31, doi: https://doi.org/10.
1016/j.strusafe.2018.04.001.
102
[282] B. Bhattacharyya, Structural reliability analysis by a Bayesian sparse polynomial chaos ex-
pansion, Structural Safety 90 (2021) 102074, doi: https://doi.org/10.1016/j.strusafe.
2020.102074.
[283] N. Lüthen, S. Marelli, B. Sudret, Sparse polynomial chaos expansions: Literature survey and
benchmark, SIAM/ASA Journal on Uncertainty Quantification 9 (2) (2021) 593–649, doi:
https://doi.org/10.1137/20M1315774.
[285] P. Kersaudy, B. Sudret, N. Varsier, O. Picon, J. Wiart, A new surrogate modeling technique
combining Kriging and polynomial chaos expansions–Application to uncertainty analysis in
computational dosimetry, Journal of Computational Physics 286 (2015) 103–117, doi: https:
//doi.org/10.1016/j.jcp.2015.01.034.
[286] B. Pavlack, J. Paixão, S. Da Silva, A. Cunha Jr, D. Garcia Cava, Polynomial Chaos-
Kriging metamodel for quantification of the debonding area in large wind turbine blades,
Structural Health Monitoring 21 (2) (2022) 666–682, doi: https://doi.org/10.1177/
14759217211007956.
[287] X. Shang, P. Ma, M. Yang, T. Chao, An efficient polynomial chaos-enhanced radial basis
function approach for reliability-based design optimization, Structural and Multidisciplinary
Optimization 63 (2021) 789–805, doi: https://doi.org/10.1007/s00158-020-02730-0.
[288] E. Torre, S. Marelli, P. Embrechts, B. Sudret, Data-driven polynomial chaos expansion for
machine learning regression, Journal of Computational Physics 388 (2019) 601–623, doi:
https://doi.org/10.1016/j.jcp.2019.03.039.
[290] H. Li, J. Yin, X. Du, Uncertainty Quantification of Physics-Based Label-Free Deep Learning
and Probabilistic Prediction of Extreme Events, in: International Design Engineering Tech-
nical Conferences and Computers and Information in Engineering Conference, vol. 86236,
American Society of Mechanical Engineers, V03BT03A001, doi: https://doi.org/10.1115/
DETC2022-88277, 2022.
[291] R. M. Neal, Bayesian Learning for Neural Networks, Springer-Verlag New York, New York,
NY, doi: https://doi.org/10.1007/978-1-4612-0745-0, 1996.
[292] C. Williams, Computing with infinite networks, Advances in Neural Information Processing
Systems 9.
103
[293] J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, J. Sohl-Dickstein, Deep Neural
Networks as Gaussian Processes, in: ICLR, 2018.
[294] R. Novak, L. Xiao, J. Lee, Y. Bahri, G. Yang, J. Hron, D. A. Abolafia, J. Pennington, J. Sohl-
Dickstein, Bayesian deep convolutional networks with many channels are Gaussian processes,
in: NIPS Workshop on Bayesian Deep Learning, 2018.
[296] Y. Cho, L. Saul, Kernel methods for deep learning, Advances in Neural Information Processing
Systems 22.
[297] A. G. Wilson, Z. Hu, R. Salakhutdinov, E. P. Xing, Deep kernel learning, in: Artificial
intelligence and statistics, PMLR, 370–378, 2016.
[298] A. Damianou, N. D. Lawrence, Deep Gaussian processes, in: Artificial intelligence and statis-
tics, PMLR, 207–215, doi: https://doi.org/10.48550/arXiv.1211.0358, 2013.
[300] H. Salimbeni, M. Deisenroth, Doubly stochastic variational inference for deep Gaussian pro-
cesses, Advances in Neural Information Processing Systems 30.
[302] M. Fuge, B. Peters, A. Agogino, Machine learning algorithms for recommending design meth-
ods, Journal of Mechanical Design 136 (10) (2014) 101103, doi: https://doi.org/10.1115/
1.4028102.
[303] J. H. Panchal, M. Fuge, Y. Liu, S. Missoum, C. Tucker, Machine learning for engineering
design, Journal of Mechanical Design 141 (11), doi: https://doi.org/10.1115/1.4044690.
[305] C. Fan, L. Zeng, Y. Sun, Y.-Y. Liu, Finding key players in complex networks through deep
reinforcement learning, Nature Machine Intelligence 2 (6) (2020) 317–324, doi: https://doi.
org/10.1038/s42256-020-0177-2.
[306] J. Jiang, Y. Xiong, Z. Zhang, D. W. Rosen, Machine learning integrated design for additive
manufacturing, Journal of Intelligent Manufacturing (2020) 1–14Doi: https://doi.org/10.
1007/s10845-020-01715-6.
104
[307] S. M. Moosavi, K. M. Jablonka, B. Smit, The role of machine learning in the understanding
and design of materials, Journal of the American Chemical Society 142 (48) (2020) 20273–
20287, doi: https://doi.org/10.1021/jacs.0c09105.
[308] Q. Tao, P. Xu, M. Li, W. Lu, Machine learning for perovskite materials design and dis-
covery, NPJ Computational Materials 7 (1) (2021) 1–18, doi: https://doi.org/10.1038/
s41524-021-00495-8.
[311] X. Lei, C. Liu, Z. Du, W. Zhang, X. Guo, Machine learning-driven real-time topology opti-
mization under moving morphable component-based framework, Journal of Applied Mechanics
86 (1) (2019) 011004, doi: https://doi.org/10.1115/1.4041319.
[312] G. E. Hinton, R. R. Salakhutdinov, Reducing the dimensionality of data with neural networks,
Science 313 (5786) (2006) 504–507, doi: https://doi.org/10.1126/science.1127647.
[313] C. Qian, R. K. Tan, W. Ye, An adaptive artificial neural network-based generative design
method for layout designs, International Journal of Heat and Mass Transfer 184 (2022) 122313,
doi: https://doi.org/10.1016/j.ijheatmasstransfer.2021.122313.
[318] M. C. Kennedy, A. O’Hagan, Bayesian calibration of computer models, Journal of the Royal
Statistical Society: Series B (Statistical Methodology) 63 (3) (2001) 425–464, doi: https:
//doi.org/10.1111/1467-9868.00294.
105
[319] K. Cheng, Z. Lu, C. Ling, S. Zhou, Surrogate-assisted global sensitivity analysis: an overview,
Structural and Multidisciplinary Optimization 61 (3) (2020) 1187–1213, doi: https://doi.
org/10.1007/s00158-019-02413-5.
[321] F. A. Viana, R. T. Haftka, V. Steffen, Multiple surrogates: how cross-validation errors can
help us to obtain the best predictor, Structural and Multidisciplinary Optimization 39 (4)
(2009) 439–457, doi: https://doi.org/10.1007/s00158-008-0338-0.
[322] R. Jin, X. Du, W. Chen, The use of metamodeling techniques for optimization under un-
certainty, Structural and Multidisciplinary Optimization 25 (2) (2003) 99–116, doi: https:
//doi.org/10.1007/s00158-002-0277-0.
[323] Z. Hu, S. Mahadevan, A single-loop kriging surrogate modeling for time-dependent reliability
analysis, Journal of Mechanical Design 138 (6), doi: https://doi.org/10.1115/1.4033428.
[325] X. Zhang, L. Wang, J. D. Sørensen, REIF: a novel active-learning function toward adaptive
Kriging surrogate models for structural reliability analysis, Reliability Engineering & System
Safety 185 (2019) 440–454, doi: https://doi.org/10.1016/j.ress.2019.01.014.
[326] L. Yan, T. Zhou, Adaptive multi-fidelity polynomial chaos approach to Bayesian inference
in inverse problems, Journal of Computational Physics 381 (2019) 110–128, doi: https:
//doi.org/10.1016/j.jcp.2018.12.025.
[327] Y. Zhang, D. W. Apley, W. Chen, Bayesian optimization for materials design with mixed
quantitative and qualitative variables, Scientific Reports 10 (1) (2020) 1–13, doi: https:
//doi.org/10.1038/s41598-020-60652-9.
[328] US NSTC, Materials Genome Initiative for global competitiveness, Executive Office of the
President, National Science and Technology Council, 2011.
[330] D. McDowell, J. Scott, et al., Creating the Next-Generation Materials Genome Initiative
Workforce, Tech. Rep., The Minerals Metals and Materials Society, 2019.
106
initiative, NPJ Computational Materials 5 (1) (2019) 1–23, doi: https://doi.org/10.1038/
s41524-019-0173-4.
[333] H. Sasaki, H. Igarashi, Topology optimization accelerated by deep learning, IEEE Transactions
on Magnetics 55 (6) (2019) 1–5, doi: https://doi.org/10.1109/TMAG.2019.2901906.
[336] Z. Hu, S. Mahadevan, Global sensitivity analysis-enhanced surrogate (GSAS) modeling for
reliability analysis, Structural and Multidisciplinary Optimization 53 (3) (2016) 501–521, doi:
https://doi.org/10.1007/s00158-015-1347-4.
[337] J. Li, B. Wang, Z. Li, Y. Wang, An improved active learning method combing with the weight
information entropy and Monte Carlo simulation of efficient structural reliability analysis, Pro-
ceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering
Science 235 (19) (2021) 4296–4313, doi: https://doi.org/10.1177/0954406220973233.
[341] P. I. Frazier, Bayesian optimization, in: Recent advances in optimization and modeling of
contemporary problems, INFORMS, 255–278, doi: https://doi.org/10.1287/educ.2018.
0188, 2018.
[342] W. Shen, X. Huan, Bayesian sequential optimal experimental design for nonlinear models
using policy gradient reinforcement learning, arXiv preprint arXiv:2110.15335 Doi: https:
//doi.org/10.48550/arXiv.2110.15335.
107
[343] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,
Y. Bengio, Generative adversarial networks, Communications of the ACM 63 (11) (2020)
139–144, doi: https://doi.org/10.1145/3422622.
[345] J. Chen, C. Chen, Z. Xing, X. Xia, L. Zhu, J. Grundy, J. Wang, Wireframe-based UI design
search through image autoencoder, ACM Transactions on Software Engineering and Method-
ology (TOSEM) 29 (3) (2020) 1–31, doi: https://doi.org/10.1145/3391613.
[346] X. Li, C. Xie, Z. Sha, A Predictive and Generative Design Approach for Three-Dimensional
Mesh Shapes Using Target-Embedding Variational Autoencoder, Journal of Mechanical De-
sign 144 (11) (2022) 114501, doi: https://doi.org/10.1115/1.4054906.
[347] S. Oh, Y. Jung, S. Kim, I. Lee, N. Kang, Deep generative design: Integration of topology
optimization and generative models, Journal of Mechanical Design 141 (11), doi: https:
//doi.org/10.1115/1.4044229.
[348] L. Regenwetter, F. Ahmed, Towards Goal, Feasibility, and Diversity-Oriented Deep Genera-
tive Models in Design, arXiv preprint arXiv:2206.07170, Doi: https://doi.org/10.48550/
arXiv.2206.07170.
[349] H. Song, K. K. Choi, I. Lee, L. Zhao, D. Lamb, Adaptive virtual support vector machine for re-
liability analysis of high-dimensional problems, Structural and Multidisciplinary Optimization
47 (4) (2013) 479–491, doi: https://doi.org/10.1007/s00158-012-0857-6.
[350] A. Basudhar, S. Missoum, Adaptive explicit decision functions for probabilistic design and
optimization using support vector machines, Computers & Structures 86 (19-20) (2008) 1904–
1917, doi: https://doi.org/10.1016/j.compstruc.2008.02.008.
[351] O. Sener, S. Savarese, Active learning for convolutional neural networks: A core-set approach,
arXiv preprint arXiv:1708.00489 Doi: https://doi.org/10.48550/arXiv.1708.00489.
[352] J. M. Haut, M. E. Paoletti, J. Plaza, J. Li, A. Plaza, Active learning with convolutional
neural networks for hyperspectral image classification using a new Bayesian approach, IEEE
Transactions on Geoscience and Remote Sensing 56 (11) (2018) 6440–6461, doi: https://
doi.org/10.1109/TGRS.2018.2838665.
[353] Z. Xiang, J. Chen, Y. Bao, H. Li, An active learning method combining deep neural net-
work and weighted sampling for structural reliability analysis, Mechanical Systems and Signal
Processing 140 (2020) 106684, doi: https://doi.org/10.1016/j.ymssp.2020.106684.
108
[354] Y. Bao, Z. Xiang, H. Li, Adaptive subset searching-based deep neural network method for
structural reliability analysis, Reliability Engineering & System Safety 213 (2021) 107778, doi:
https://doi.org/10.1016/j.ress.2021.107778.
[355] L. C. Nguyen, H. Nguyen-Xuan, Deep learning for computational structural optimization, ISA
Transactions 103 (2020) 177–191, doi: https://doi.org/10.1016/j.isatra.2020.03.033.
[356] T. Asano, S. Noda, Optimization of photonic crystal nanocavities based on deep learning,
Optics Express 26 (25) (2018) 32704–32717, doi: https://doi.org/10.1364/OE.26.032704.
[357] J. J. Beland, P. B. Nair, Bayesian optimization under uncertainty, in: NIPS BayesOpt 2017
workshop, 2017.
[359] P. I. Frazier, J. Wang, Bayesian optimization for materials design, in: Information Sci-
ence for Materials Discovery and Design, Springer, 45–75, doi: https://doi.org/10.1007/
978-3-319-23871-5_3, 2016.
[362] D. Liu, Y. Wang, Metal Additive Manufacturing Process Design based on Physics Con-
strained Neural Networks and Multi-Objective Bayesian Optimization, Manufacturing Letters
33 (2022) 817–827, doi: https://doi.org/10.1016/j.mfglet.2022.07.101.
[363] L. Le Gratiet, J. Garnier, Recursive co-kriging model for design of computer experiments
with multiple levels of fidelity, International Journal for Uncertainty Quantification 4 (5), doi:
https://doi.org/10.1615/Int.J.UncertaintyQuantification.2014006914.
[365] R. P. Dwight, Z.-H. Han, Efficient uncertainty quantification using gradient-enhanced kriging,
AIAA paper 2276 (2009) 2009, doi: https://doi.org/10.2514/6.2009-2276.
109
[366] A. Tran, M. Tran, Y. Wang, Constrained mixed-integer Gaussian mixture Bayesian op-
timization and its applications in designing fractal and auxetic metamaterials, Structural
and Multidisciplinary Optimization 59 (2019) 2131–2154, doi: https://doi.org/10.1007/
s00158-018-2182-1.
[367] C. Paciorek, M. Schervish, Nonstationary covariance functions for Gaussian process regression,
Advances in Neural Information Processing Systems 16.
[369] S. Remes, M. Heinonen, S. Kaski, Non-stationary spectral kernels, Advances in Neural Infor-
mation Processing Systems 30.
[370] M. Schwabacher, K. Goebel, A Survey of Artificial Intelligence for Prognostics., in: AAAI fall
symposium: artificial intelligence for prognostics, Arlington, VA, 108–115, 2007.
[371] M. Kefalas, B. van Stein, M. Baratchi, A. Apostolidis, T. Bäck, An End-to-End Pipeline for
Uncertainty Quantification and Remaining Useful Life Estimation: An Application on Aircraft
Engines 7 (2022) 245–260, doi: https://doi.org/10.36001/phme.2022.v7i1.3317.
[372] J. Lee, M. Mitici, Deep reinforcement learning for predictive aircraft maintenance using Prob-
abilistic Remaining-Useful-Life prognostics, Reliability Engineering & System Safety (2022)
108908Doi: https://doi.org/10.1016/j.ress.2022.108908.
[373] G. Mazaev, G. Crevecoeur, S. Van Hoecke, Bayesian convolutional neural networks for remain-
ing useful life prognostics of solenoid valves with uncertainty estimations, IEEE Transactions
on Industrial Informatics 17 (12) (2021) 8418–8428, doi: https://doi.org/10.1109/TII.
2021.3078193.
[374] R. Zhu, Y. Chen, W. Peng, Z.-S. Ye, Bayesian deep-learning for RUL prediction: An active
learning perspective, Reliability Engineering & System Safety 228 (2022) 108758, doi: https:
//doi.org/10.1016/j.ress.2022.108758.
[375] J. Yang, Y. Peng, J. Xie, P. Wang, Remaining Useful Life Prediction Method for Bearings
Based on LSTM with Uncertainty Quantification, Sensors 22 (12) (2022) 4549, doi: https:
//doi.org/10.3390/s22124549.
[376] G. Li, L. Yang, C.-G. Lee, X. Wang, M. Rong, A Bayesian deep learning RUL framework
integrating epistemic and aleatoric uncertainties, IEEE Transactions on Industrial Electronics
68 (9) (2020) 8829–8841, doi: https://doi.org/10.1109/TIE.2020.3009593.
[377] Y.-H. Lin, G.-H. Li, A Bayesian Deep Learning Framework for RUL Prediction Incorporating
Uncertainty Quantification and Calibration, IEEE Transactions on Industrial Informatics Doi:
https://doi.org/10.1109/TII.2022.3156965.
110
[378] M. Wei, H. Gu, M. Ye, Q. Wang, X. Xu, C. Wu, Remaining useful life prediction of lithium-ion
batteries based on Monte Carlo Dropout and gated recurrent unit, Energy Reports 7 (2021)
2862–2871, doi: https://doi.org/10.1016/j.egyr.2021.05.019.
[379] Y. Kong, X. Zhang, S. Mahadevan, Bayesian Deep Learning for Aircraft Hard Landing Safety
Assessment, IEEE Transactions on Intelligent Transportation Systems 23 (10) (2022) 17062–
17076, doi: https://doi.org/10.1109/TITS.2022.3162566.
[380] W. Peng, Z.-S. Ye, N. Chen, Bayesian deep-learning-based health prognostics toward prognos-
tics uncertainty, IEEE Transactions on Industrial Electronics 67 (3) (2019) 2283–2293, doi:
https://doi.org/10.1109/TIE.2019.2907440.
[381] S. Xiang, Y. Qin, J. Luo, F. Wu, K. Gryllias, A concise self-adapting deep learning network
for machine remaining useful life prediction, Mechanical Systems and Signal Processing 191
(2023) 110187, ISSN 0888-3270, doi: https://doi.org/10.1016/j.ymssp.2023.110187.
[382] M. Xu, P. Baraldi, S. Al-Dahidi, E. Zio, Fault Prognostics by an Ensemble of Echo State
Networks in Presence of Event Based Measurements, Engineering Applications of Artificial
Intelligence 87 (2019) 103346, doi: https://doi.org/10.1016/j.engappai.2019.103346.
[383] J. Zgraggen, G. Pizza, L. G. Huber, Uncertainty Informed Anomaly Scores with Deep Learn-
ing: Robust Fault Detection with Limited Data, in: PHM Society European Conference,
vol. 7, 530–540, doi: https://doi.org/10.36001/phme.2022.v7i1.3342, 2022.
[384] Y. Liao, L. Zhang, C. Liu, Uncertainty prediction of remaining useful life using long short-
term memory network based on bootstrap method, in: 2018 IEEE International Conference
on Prognostics and Health Management (ICPHM), IEEE, 1–8, doi: https://doi.org/10.
1109/ICPHM.2018.8448804, 2018.
[387] B. Ellis, P. S. Heyns, S. Schmidt, A hybrid framework for remaining useful life estimation
of turbomachine rotor blades, Mechanical Systems and Signal Processing 170 (2022) 108805,
doi: https://doi.org/10.1016/j.ymssp.2022.108805.
[388] M. Jankowiak, G. Pleiss, J. R. Gardner, Deep Sigma Point Processes URL https://arxiv.
org/abs/2002.09112.
111
Appendix A. Some further discussions on Gaussian process regression
where Γ(·) is the Gamma function,qP dist(x, x′ ) is the Euclidean distance between points x and x′ ,
D
i.e., dist(x, x′ ) = |x − x′ | = ′ 2
d=1 (xd − xd ) , and Kν is the modified Bessel function of the second
kind and order ν. A larger value of ν results in a smoother appropriated function. When ν → ∞,
the Matérn kernel becomes the squared exponential kernel. Another special case worth mentioning
is when ν = 1/2, the Matérn kernel is equivalent to the absolute exponential kernel (sometimes also
called the Ornstein-Uhlenbeck process kernel), which can be expressed as
dist (x, x′ )
′
k(x, x ) = σf2 exp − . (A.2)
l
GPR using this Matérn 1/2 kernel yields rather unsmooth (rough) functions sampled from the
Gaussian process prior and posterior. Additionally, observations do not inform predictions on input
points far away from the points of observations, leading to poor generalization performance of the
resulting GPR model. Two other special cases of the Matérn kernels are ν = 3/2 and ν = 5/2.
The resulting Matérn 3/2 kernel and Matérn 5/2 kernel are not infinitely differentiable, unlike the
squared exponential kernel, but at least once (Matérn 3/2) or twice differentiable (ν = 5/2). These
two kernels may be useful in cases where intermediate solutions between the unsmooth Matérn 1/2
kernel and the perfectly smooth squared exponential kernel are needed to approximate functions
that are expected to be somewhat smooth yet not perfectly smooth.
The Matérn kernel in Eq. (A.1) has a single length scale l and is of an isotropic form. Like the
ARD squared exponential kernel shown in Eq. (11), an anisotropic variant of the Matérn kernel
can be defined by introducing D length scales, each depicting the rrelevance of an input dimension.
PD (xd −x′d )2
The resulting ARD Matérn kernel has a slightly modified term, d=1 l2
, in place of the
d
qP
D ′ 2
d=1 (xd −xd ) dist(x,x′ )
original term, l (i.e., l in Eq. (A.1)). For D-dimensional input x ∈ X ⊆ Rd ,
an anisotropic kernel is composed of (D + 1) hyperparameters, σf , l1 , . . . , lD .
To illustrate the concept of kernels, Fig. A.24 compares GPR models built using multiple com-
monly used kernels in a 1D example. As demonstrated in this figure, the squared-exponential
kernel produces the smoothest GPR, whereas Matérn1/2 produces the roughest GPR (where the
samples drawn from the posterior are equivalent to a Brownian motion). The intuition is that the
112
Figure A.24: Comparison of GPR models built using multiple kernels: squared-exponential (ν → ∞), Matérn1/2
ν = 12 , Matérn3/2 ν = 32 ) , Matérn5/2 ν = 52 , with the same eight training data points, along with five samples
randomly drawn from the posterior.
larger the ν value, the smoother the underlying function. Specifically, when ν = 1/2, the Gaussian
process sampled from posterior with this kernel (Matérn1/2) corresponds to a Brownian motion
(or equivalently, a Wiener process), whereas ν → ∞ smoothens the sampled Gaussian process be-
cause the posterior mean is infinitely differentiable (i.e., C ∞ ) [105]. The noiseless ground truth,
f (x) = sin(0.9x), is plotted as dot-dashed magenta lines. Each noisy observation used for training
is obtained based on the following observation model: y = f (x) + ε, where the Gaussian noise
ε ∼ N (0, 0.12 ). Eight training observations are plotted as black dots, and five samples randomly
drawn from the GPR posterior are plotted as dotted purple lines.
113
𝑙𝑙 𝑙𝑙
3 3
2 2
1 1
Yy
Yy
0 0
𝜎𝜎f 𝜎𝜎f
-1 -1
-2 -2
-3 -3
𝜎𝜎𝜀𝜀 𝜎𝜎𝜀𝜀
-5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5
Xx Xx
𝛉𝛉 = [𝑙𝑙, 𝜎𝜎f , 𝜎𝜎𝜀𝜀 ]T = [1, 1, 0.1]T , log 𝑝𝑝(𝒚𝒚𝑡𝑡 |𝐗𝐗t , 𝛉𝛉) = −1.6 𝛉𝛉 = [𝑙𝑙, 𝜎𝜎f , 𝜎𝜎𝜀𝜀 ]T = [0.1, 1, 0.1]T , log 𝑝𝑝(𝒚𝒚𝑡𝑡 |𝐗𝐗t , 𝛉𝛉) = −21.4
𝑙𝑙 𝑙𝑙
3 3
2 2
1 1
Yy
Yy
0 0
𝜎𝜎f 𝜎𝜎f
-1 -1
-2 -2
-3 -3
𝜎𝜎𝜀𝜀 𝜎𝜎𝜀𝜀
-5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5
Xx Xx
𝛉𝛉 = [𝑙𝑙, 𝜎𝜎f , 𝜎𝜎𝜀𝜀 ]T = [1, 3, 0.1]T , log 𝑝𝑝(𝒚𝒚𝑡𝑡 |𝐗𝐗t , 𝛉𝛉) = −9.7 𝛉𝛉 = [𝑙𝑙, 𝜎𝜎f , 𝜎𝜎𝜀𝜀 ]T = [1, 1, 0.05]T , log 𝑝𝑝(𝒚𝒚𝑡𝑡 |𝐗𝐗t , 𝛉𝛉) = −23.0
Figure A.25: Effect of hyperparameters on the Gaussian process posterior for the 1D toy example used in Fig. 5.
Note that the confidence intervals shown collectively as light blue shade are derived from the posterior of (noisy)
observations (function output plus noise); they are slightly wider than the confidence intervals for the underlying
function shown in Fig. 5 due to the added Gaussian noise (see the discussion below Eqs. (18) and (19) in Sec.
3.1.1.d).
connected neural network with a single, infinite-width hidden layer and an i.i.d. prior over the
network parameters (weights and biases) [291]. This equivalence is significant because using a
Gaussian process prior over functions allows one to perform Bayesian inference in its exact form
on neural networks using simple matrix operations (see the familiar formulae for Gaussian process
114
posterior in Eqs. (16) and (17)) [292]. One obvious benefit is that one does not need to resort to iter-
ative, more computationally expensive training algorithms, such as gradient descent and stochastic
gradient descent, or approximate Bayesian inference methods for Bayesian neural networks (see Sec.
3.2). As deep learning has been gaining popularity in recent years, significant extensions were made
to draw such connections for standard DNNs [293] and DNNs with convolutional filters, or so-called
deep convolutional neural networks [294, 295].
Hidden
layer
Input
layer
h1
x1 Output
h2 layer
x2 y
…
…
xD
ℎ𝑁𝑁H
Figure A.26: A single-hidden-layer neural network where the number of hidden units NH could approach infinity,
i.e., NH → ∞. W0 and b0 conveniently denote the NH × D matrix of input-to-hidden weights and the vector of
figures/single_hidden_layerNN.pdf
NH input-to-hidden biases. Similarly, w1 denote the vector of NH hidden-to-output weights, again, for notational
convenience purposes.
Let us now briefly review the early work in [291]. We consider a fully-connected neural network
with one hidden layer, illustrated in Fig. A.26. To get to each hidden node hj , 1 ≤ j ≤ NH , where
NH is the number of hidden units, we first apply a linear transformation of input point x and then
a nonlinear operation using an activation function ψ(·) : RD 7→ R. The resulting j-th hidden unit
takes the following form: In process
D
!
X
0 0
hj (x) = ψ bj + wdj xd , (A.3)
d=1
where 0
wdj denotes the input-to-hidden weight from xd to hj and b0j is the input-to-hidden bias for
hj . To get to the output node y (assuming zero observation noise for simplicity, i.e., y(x) = f (x)),
we apply another linear transformation of the hidden units with hidden-to-output weights and a
bias
NH
X
1
y(x) = b + wj1 hj (x), (A.4)
j=1
where wj1 denotes the hidden-to-output weight from hj to y, and b1 is the hidden-to-output bias.
115
We assume (1) the prior of the hidden-to-output weights wj1 and bias b follows independent
zero-mean (often Gaussian) distributions with variances being σw 2 and σ 2 , respectively, and (2)
1 b
0 0
the input-to-hidden weights wdj and biases bj are i.i.d. It follows that the network output y(x) in
Eq. (A.4) is a summation over (NH + 1) i.i.d. random variables [291]. Based on the Central Limit
Theorem, when NH → ∞, i.e., when the width of the hidden layer approaches infinity, yb(x) will
follow a Gaussian distribution. This Gaussian prior holds regardless of the distribution types of the
(NH + 1) random variables in the sum. Let us move on to look at any finite set of input points,
x1 , . . . , xN∗ . As NH → ∞, their network outputs, yb1 , . . . , ybN∗ , will be jointly Gaussian, according to
the multidimensional Central Limit Theorem. It means that the joint distribution of the network
outputs at any finite collection of input points is multivariate Gaussian, which exactly matches the
definition of a Gaussian process discussed in Sec. 3.1.1.a. Thus, yb(x) ∼ GP(mnn (x), knn (x, x′ )), a
Gaussian process with the mean function mnn (·) and covariance function knn (·). Since the hidden-
to-output weights wj1 and bias b1 have zero means, mnn ≡ E[b y (x)] = 0. The covariance function
can be derived based on i.i.d. conditions and takes the following form:
NH
X
′ ′ 2 2 ′ 2 2 ′
knn (x, x ) ≡ E yb(x)b
y (x ) = σb1 + σw 1 E hj (x)hj (x ) = σb1 + NH σw 1 E hj (x)hj (x ) , (A.5)
j=1
| {z } | {z }
ω2 C(x,x′ )
where the prior variance σw 2 of each hidden-to-output weight is set to scale carefully as ω 2 /N for
1 H
2 ′
some fixed “unscaled” variance ω and C(x, x ) need to be evaluated for all x in the training set
and all x′ in the training and test sets. C(x, x′ ) has an analytic form for certain types of activation
functions such as the error function (or Gaussian nonlinearities) [105, 292], one-sided polynomial
functions [296], and ReLU (rectified linear unit) [293]. As a result, infinitely wide Bayesian neural
networks give rise to a new family of GPR kernels. An interesting and attractive property of
these neural networks is that all network parameters are often initialized as independent zero-mean
Gaussians, some with properly scaled variances, and the kernel parameters (e.g., “unscaled” prior
variances of weights and prior variances of biases) may be the only parameters that need to be
optimized.
What has been discussed in this subsection represents a category of approaches for combining
the strengths of GPR (exact Bayesian inference, distance awareness, etc.) with those of neural
networks (feature extraction from high-dimensional inputs (large D), ability to model nonlinearities,
etc.). These approaches explore the direct theoretical relationship between infinitely wide neural
networks and GPR. Another category of approaches uses GPR with standard kernels (such as the
squared exponential kernel in Eq. (10)) whose inputs are feature representations in the hidden
space learned by a neural network [178–180, 297]. These approaches are often called deep kernel
learning. The network weights, biases, and GPR kernel parameters can be jointly optimized end-to-
end, which is straightforward to implement using gradient descent or stochastic gradient descent.
These approaches excel in OOD detection thanks to the distance awareness property of GPR and
offer a solution to improving the scalability of GPR to high-dimensional inputs. A drawback is
that overparameterization associated with a DNN (e.g., a deep convolutional neural network) may
116
make the network prone to overfitting. Another issue is feature collapse [176], which needs to be
carefully addressed to preserve input distances in the hidden space. This issue will be discussed along
with a representative approach in this category called spectral-normalized neural Gaussian process
(SNGP) in Sec. 3.4. A third category of approaches aims to mimic the many-layer architecture of
a DNN by stacking Gaussian processes on top of one another in a hierarchical form [298–301]. The
resulting deep Gaussian processes are probabilistic ML models with the UQ capability brought in
by GPR and the added flexibility to learn complex mappings from datasets that can be small or
large. However, the performance gains over standard GPR comes at a cost: exact Bayesian inference
by deep Gaussian processes can be prohibitively expensive due to the computationally demanding
need to compute the inverse and determinant of the covariance matrix. Therefore, almost all deep
Gaussian process approaches adopt appropriate inference techniques for efficient model training
that use only a small set of the so-called inducing points to build covariance matrixes [299–301].
i. Feature extraction: Extracting informative features from massive volumes of raw data is
a representative use case of ML in engineering design. In this regard, ML, particularly deep
learning, has become more and more prevalent in engineering design due to its salient charac-
teristic of automatically extracting feature representations from high-dimensional data in its
raw form. Specifically, in the context of engineering design, the powerful representation learn-
ing ability has been frequently utilized in two types of design activities, namely (1) dimension
reduction, which is to reduce the dimensionality of design problems, and (2) generative design,
which is to generate candidate designs subject to certain design constraints [312–314].
(a) For dimension reduction, autoencoder, as an unsupervised learning technique, has been
commonly adopted to learn efficient codings and compressed knowledge representations
from unlabeled data [312]. More specifically, an autoencoder consists of an encoder and
117
Material
design
Design for
reliability Surrogate
modeling Energy
system
design
ML in
engineering
design
Optimi- Feature
zation extraction
Topology
design Design for
AM
Design for
autonomy
118
expensive computer simulation models in engineering design [24]. With the development of
computational mechanics and advanced numerical solvers, computer simulations are getting
increasingly sophisticated. The high-fidelity computer simulations allow us to accurately pre-
dict complicated physical phenomena without performing large numbers of expensive physical
experiments, thereby accelerating the design of engineering systems to meet mission-specific re-
quirements. Although high-fidelity simulations significantly enhance our predictive capability,
they present notable challenges to engineering design due to the high computational demand
and burden often associated with them. ML models play a vital role in addressing this chal-
lenge by maintaining the same predictive capability level as high-fidelity simulations while
significantly reducing the computational effort required to make high-fidelity predictions [317].
The basic idea of ML-enabled surrogate modeling is to replace an expensive-to-evaluate high-
fidelity simulation model with a much “cheaper” mathematical surrogate, essentially an ML
model. Over the past few decades, various surrogate modeling methods have been proposed
for different purposes within engineering design, including model calibration [318], reliabil-
ity analysis [28], sensitivity analysis [319], and optimization [320]. These existing surrogate
modeling methods can be broadly classified into two groups:
(a) Global surrogate modeling for general purposes: This class of surrogate models is con-
structed for the general purpose of design optimization and tries to achieve a good pre-
diction accuracy in the whole design region of interest [27, 321, 322]. More specifically,
let us use ŷ = Ĝ(x) to represent the surrogate model of a computer simulation model
y = G(x), x ∈ Ωx , where Ωx is the prediction domain of the inputs. In global surrogate
modeling, we are concerned about the prediction accuracy of ŷ = Ĝ(x) for all x ∈ Ωx .
Because of this, the training data for ML model construction needs to spread through-
out the whole prediction domain Ωx , with those in nonlinear regions being denser and
the others in relatively smoother regions being more sparse. Various sampling techniques
have been developed to efficiently construct globally accurate surrogate models using ML.
Some examples of the techniques include MSE-based methods, the A-optimality criterion,
and maximin scaled distance approaches [24]. The goal of global surrogate modeling is
to construct a surrogate that is fully representative of the original computer simulation
model. Since the surrogate model is not constructed for any specific purposes and the
prediction accuracy has been verified for all x ∈ Ωx , it can be used for any purposes, such
as design optimization, uncertainty analysis, and sensitivity analysis, after its construc-
tion. In addition, the UQ calibration metrics presented in Sec. 4.1.3 and Sec. 4.3 can be
used to quantify the prediction accuracy of a global surrogate model, if the test data is
representative of the design domain Ωx .
(b) Local surrogate modeling for specific purposes: Instead of achieving good prediction accu-
racy in the whole design region, this group of surrogate models only focuses on prediction
in very localized design regions, such as the limit state regions in design for reliability
problems [28, 323–325] and important regions for model calibration purposes [326]. In
119
local surrogate modeling, we are concerned about the prediction accuracy of ŷ = Ĝ(x)
for x ∈ Ω̃x , where Ω̃x ⊂ Ωx is a subset of the prediction domain of the inputs. This
sub-domain Ω̃x varies with the specific purpose of the surrogate modeling. For example,
when the surrogate model is constructed for the purpose of reliability analysis, which is a
classification problem, Ω̃x will be the regions along the limit state or classification bound-
ary. When the surrogate model is constructed for optimization, Ω̃x will be the regions
where the optima locate. As a result, the training data for surrogate modeling will be
concentrated in those localized regions instead of spreading evenly throughout the whole
prediction domain of the inputs. Because we only concentrate on a sub-domain Ω̃x of the
input space, Ωx , the local surrogate model ŷ = Ĝ(x), x ∈ Ω̃x only partially represents the
original simulation model (i.e., the surrogate is an accurate representation of the simula-
tion model only in the sub-domain of the design space). Moreover, since the sub-domain
Ω̃x is usually unknown during the construction of the surrogate model, learning functions
(also called acquisition functions in some methods) are needed to identify these localized
sub-domains adaptively based on the currently available information about the underly-
ing simulation model (ground truth). Because the surrogate model is constructed for a
specific purpose (e.g., model calibration, reliability analysis, or optimization), its accuracy
also needs to be quantified using metrics tailored for that specific purpose. For example, a
metric used to check the prediction accuracy of the surrogate model for reliability analysis
may not be appropriate for constructing a surrogate model for design optimization.
iii. Optimization: Engineering design problems are essentially optimization problems. Conven-
tional gradient-based optimizers often have difficulties in finding global optima. Even though
evolutionary optimization methods can overcome some of the limitations of gradient-based op-
timizers, the former methods are likely to require much larger numbers of function evaluations,
which could become prohibitively costly for high-fidelity simulation models in many engineer-
ing design problems. ML-based or ML-assisted optimization methods have been proposed to
tackle this challenge, resulting in a new family of optimization methods collectively named
gradient-free ML-based optimization. One representative example of this family is Bayesian
optimization [30]. ML-based optimization transforms the way that engineering systems are
designed in many fields, such as new materials [327]. It is worth noting that the Materials
Genome Initiative [328, 329, 329–332], firstly debuted in 2011, was embedded in the context
of designing new materials using ML and optimization to significantly reduce the research and
development time. Moreover, the development of deep learning methods in recent years even
allows designers to bypass complicated design optimization by directly generating candidate
designs for a particular application. Some examples include the ML-based topology optimiza-
tion [333, 334] and deep learning-enabled design of large-scale complex networks [335].
120
for ML-enabled feature extraction in engineering design, quantifying the predictive uncertainty of
ML models play an important role in (1) ensuring the extracted features are representative of the
original data sources, (2) eliminating the ill-posedness of inverse problems in generative design, and
(3) accounting for variability across input features.
For surrogate modeling in engineering design, an essential step in building an accurate surrogate
model (global or local surrogate) is the collection of training data. However, an initial set of training
data is usually insufficient to build a surrogate model with satisfactory prediction accuracy. A
subsequent refinement step sometimes is needed to improve the prediction accuracy of the surrogate
model. Due to the high computational effort required to collect training data from high-fidelity
simulations in engineering design, it is desirable to reduce the number of training data points
or refinement iterations for surrogate modeling as much as possible. Over the past few decades,
numerous refinement strategies have been developed in engineering design to minimize the number
of iterations in collecting training data for the purpose of improving the performance of surrogate
models. Even though these refinement strategies may differ from each other, they share one notable
starting point: quantifying the predictive uncertainty of the surrogate model for any given input.
For instance, the most commonly used refinement method for global surrogate modeling is to
identify new training data by maximizing the variance of the prediction of the surrogate model [27].
That is a mean squared error-method as mentioned above in Appendix B.1. In a GPR model, the
variance of the prediction can be directly obtained from the surrogate model. For other types of
surrogate models, however, the predictive uncertainty needs to be quantified using a separate UQ
method. Moreover, UQ of ML models becomes particularly important, if local surrogate models
need to be constructed for engineering design. In the context of local surrogate modeling, learning
functions (also called acquisition functions), such as the expected improvement (EI) function in
GPR-based surrogate modeling, are required to identify new training data in critical local regions
(i.e., Ω̃x mentioned in Appendix B.1) of the input space. The new training data will then be
used to refine the surrogate. Many (20+) learning functions have been proposed in recent years
for local surrogate modeling of various purposes (e.g., surrogate construction, reliability analysis,
and optimization). These learning functions look into multiple quantitative metrics to examine
different aspects crucial to the iterative improvement of surrogate models, such as classification
error [336], information entropy [337, 338], and exploitation and exploration [339], among others.
A detailed review of various learning functions for local surrogate modeling for reliability analysis
is available in Ref. [340]. To the best of our knowledge, nearly all the learning functions for
local surrogate modeling heavily rely on UQ of ML models. Let us take a look at two well-known
learning functions for local surrogate modeling in reliability-based design optimization: the expected
feasibility function (EFF) [28] and the U function [29]. They are mathematically described as
121
follows:
Z e+τ
EF F (x) = [τ − |e − y|] pŷ(x) (y)dy, (B.1a)
e−τ
|µŷ (x) − e|
U (x) = , (B.1b)
σŷ (x)
where e is the failure threshold used to define the limit state, y = e, that separates the failure
region (y > e) from the safe region (y ≤ e), τ is half the width of a two-sided critical interval in the
vincinity of the limit state (y = e), often set as two times the standard deviation of the ML model
prediction, i.e., τ = 2σŷ (x), µŷ (x) and σŷ (x) are, respectively, the mean and standard deviation of
the ML prediction with respect to the input x, and pŷ(x) (y) is the probability density function of y
for given input x predicted by the ML model.
As shown in the above two equations, UQ of ML models plays an essential role in the construc-
tion of such learning functions. This observation also applies to the other learning functions in local
surrogate modeling. It is commonly referred as adaptive surrogate modeling in the literature. In
general, the identification of the sub-domain Ω̃x (see Appendix B.1) relies on the learning func-
tions in local surrogate modeling, where UQ of ML models plays a foundational role towards the
establishment of these learning functions.
Similar to local surrogate modeling, ML-enabled optimization in engineering design also depends
heavily on the ability to quantify the predictive uncertainty of ML models, which is essential for ML
models to exploit and explore the design domain to efficiently identify optimal designs. Examples
of such ML-based optimizers include Bayesian optimization [341] and deep reinforcement learning-
based optimization [342]. Specifically for Bayesian optimization, a trade-off between exploitation
and exploration is balanced through a learning/acquisition function, which is very similar to that in
local surrogate modeling discussed above. Some popular learning functions include the probability
of improvement, EI, upper confidence bound, and knowledge gradient (a generalization of EI).
Taking the EI function for a minimization problem as an example, this function is mathematically
defined as [30].
122
Moreover, a global surrogate model and a local surrogate model are interchangeable during the
process of ML model construction. For example, we usually start with a global surrogate model in
order to construct a local surrogate model because the critical local regions are unknown and need
to be identified using a learning function based on the UQ of an ML model. After constructing a
local surrogate model for a specific purpose (e.g., reliability analysis, optimization), we can always
convert this local surrogate into a global one if we want to expand the prediction domain to the
whole design domain. Regardless of whether design optimization leverages local or global surrogate
modeling, UQ of ML models is almost always the foundation of the three categories of ML-enabled
capabilities in engineering design described in Appendix B.1.
123
port vector regression, random forest, etc. For local surrogate modeling, however, most current
approaches are developed based on GPR models. This is largely attributed to the capability of
GPR to analytically quantify the predictive uncertainty in the form of a Gaussian distribution that
is convenient to use. In fact, most of the learning functions for local surrogate model-based relia-
bility analysis are derived or developed based on GPR models. For example, learning functions in
closed forms as given in Eqs. (B.1a) and (B.1b) have been derived for GPR models. Quantifying
the predictive uncertainty of GPR models in the Gaussian form facilitates an efficient evaluation
of various learning functions for the refinement of local surrogates. In addition to GPR-based local
surrogate modeling methods, a few approaches have also been proposed for local surrogate modeling
based on UQ of support vector regression models [349, 350]. In recent years, with the rapid devel-
opment of deep learning techniques and the capability of quantifying the prediction uncertainty of
deep learning models, local surrogate modeling methods have been studied for deep neural networks
to achieve “active learning” [351–353]. For instance, Xiang et al. [353] proposed an active learning
method for DNN-based structural reliability analysis by extending a weighted sampling method
from GPR models to DNNs. This extension allows for selecting new training data for refining DNN
models for reliability analysis. Similarly, Bao et al. [354] extended the subset sampling method
to DNNs, resulting in an adaptive DNN method for structural reliability analysis. Even though
active learning for local surrogate modeling has great potential in reducing the size of training data
required to build accurate surrogate models, it is still in the early development stage for other ML
models beyond GPR models. In particular, many existing UQ methods for deep learning models
are still far from GPR’s scientific rigor and theoretical soundness because few can stand strict UQ
tests pertaining to uncertainty calibration, decomposition, and attribution. Additionally, even fewer
methods offer principled ways to reduce the predictive uncertainty of deep neural networks. With
UQ methods for ML models (as reviewed in Sec. 3) getting more and more mature, we foresee that
active learning for local surrogate modeling will also become a very active research topic for ML
models other than GPR models.
Similar to local surrogate modeling, even though some deep learning-based optimization methods
have been developed recently [355, 356], ML-enabled optimization has mostly been studied using
GPR models, resulting in a group of Bayesian optimization-based engineering design methods [327,
357, 358], whose applications include material design [359, 360], design for reliability [361], and
design for additive manufacturing [362]. Because GPR is a flexible and versatile framework, which
means it can be fairly easy to extend to other problems and applications, numerous extensions have
been considered to adopt GPR models in different settings under the big umbrella of “Bayesian
optimization”. These extensions include, but are not limited to, using multi-fidelity strategy to
reduce the required number of high-fidelity samples in GPR-based Bayesian optimization [363],
Bayesian optimization for multi-output response [364], enhancing Bayesian optimization through
gradient information during the construction of a GPR model [365], Bayesian optimization for
problems with mixed-integer design variables (also known as mixed-variables) [366], and Bayesian
optimization based on heteroscedastic or non-stationary GPR models [117, 367–369].
Based on the above reviews, we can conclude that the UQ methods for ML models reviewed
124
in Sec. 3 provide valuable tools to fill the gaps in the following three major activities of ML-based
engineering design: ML-enabled feature extraction, surrogate modeling, and optimization.
125
1. Identifying a health indicator and predicting its trend until a defined threshold is reached.
2. Directly mapping the extracted features or raw measurements as in the case of DL to the RUL.
For the first approach, the focus is on identifying a specific parameter or health indicator that
is indicative of the health state of the system or component being monitored. This degradation
indicator could be a physical measurement, a derived relevant feature or a combination of several
degradation indicators that change over time as the system undergoes degradation. Once the health
indicator is identified, the next step is to predict its trend over time. This involves using various
predictive modeling techniques, such as regression or time-series analysis, to estimate how the health
indicator evolves as the system degrades over time. The goal is to predict when the health indicator
will reach a defined threshold, indicating that the system or component is reaching the end of its
useful life.
For the second approach, instead of focusing on predicting the trend of a specific health indicator,
the predictive model directly maps either the extracted features or, in the case of deep learning,
directly from the raw measurements of the system or component to the RUL.
126
data are frequently distorted by multiple sources of noise and, training data is often limited in scope
and fails to represent the full range of conditions that may arise in real-world scenarios. Conse-
quently, there is a significant risk of encountering high levels of epistemic uncertainties, which must
be quantified and communicated to the decision makers.
127
0.016 0.016
(a) Train (b)
0.014 Validation 0.014
0.012 0.012
0.010 0.010
RMSE Loss
RMSE Loss
0.008 0.008
0.006 0.006
0.004 0.004
0.002 0.002
0.000 100 101 102 103 104 0.000 100 101 102 103 104
Epoch Epoch
Figure D.28: Training and validation losses for MC dropout models with dropout rate of (a) 0.05 and (b) 0.2 respec-
tively.
In Section 3.2.3, we mention the instability of the MC dropout model arising from even slight
variations in hyperparameters, such as model size, training epochs and dropout rate. In this ap-
pendix, we first show the training and validation losses for two MC dropout models trained with the
same data of the toy example from Section 3.5 in Fig. D.28. The two MC dropout models have the
same architecture (3 ResNet blocks as shown in Fig. 15), but have different dropout rates. In this
case, the MC dropout model converged at around 500 epochs, but no over-fitting is observed until
10000 epochs. Next, we plot the uncertainty maps for various configurations of the MC dropout
model in Table D.8. The uncertainty maps are highly inconsistent, thus leading to our conclusion
about the instability of MC dropout.
128
Number of training epochs
200 500 1000
Dropout rate: 0.05
15 15 15
10 10 10
Number of ResNet blocks 5 5 5
0 0 0
x2
x2
x2
1 5 5 5
10 10 10
1515 10 5 0 5 10 15 1515 10 5 0 5 10 15 1515 10 5 0 5 10 15
x1 x1 x1
15 15 15
10 10 10
5 5 5
0 0 0
x2
x2
x2
3 5 5 5
10 10 10
1515 10 5 0 5 10 15 1515 10 5 0 5 10 15 1515 10 5 0 5 10 15
x1 x1 x1
0 0 0
x2
x2
x2
1 5 5 5
10 10 10
1515 10 5 0 5 10 15 1515 10 5 0 5 10 15 1515 10 5 0 5 10 15
x1 x1 x1
15 15 15
10 10 10
5 5 5
0 0 0
x2
x2
x2
3 5 5 5
10 10 10
1515 10 5 0 5 10 15 1515 10 5 0 5 10 15 1515 10 5 0 5 10 15
x1 x1 x1
0 0 0
x2
x2
x2
1 5 5 5
10 10 10
1515 10 5 0 5 10 15 1515 10 5 0 5 10 15 1515 10 5 0 5 10 15
x1 x1 x1
15 15 15
10 10 10
5 5 5
0 0 0
x2
x2
x2
3 5 5 5
10 10 10
1515 10 5 0 5 10 15 1515 10 5 0 5 10 15 1515 10 5 0 5 10 15
x1 x1 x1
Low High
uncertainty 129 uncertainty
Table D.8: Demonstration of the instability associated with uncertainty maps of MC dropout with respect to dropout
rate, number of training epochs, and ResNet architecture.