Green Et Al 2015 Bayesian and Markov Chain Monte Carlo Methods For Identifying Nonlinear Systems in The Presence of
Green Et Al 2015 Bayesian and Markov Chain Monte Carlo Methods For Identifying Nonlinear Systems in The Presence of
White-box models are taken here to be those whose equations of motion have been derived
completely from the underlying physics of the problem of interest and in which the
Downloaded from https://royalsocietypublishing.org/ on 20 June 2024
model parameters have direct physical meanings. Finite-element models constitute one
sub-class of such models.
Black-box models are, by contrast, usually formed by adopting a parametrized class of
models with some universal approximation property and learning the parameters from
measured data; in such a model, like a neural network, the parameters will not generally
carry any physical meaning.
Grey-box models, as the name suggests, are usually a hybrid of the first two types above.
They are commonly formed by taking a basic core motivated by known physics and then
adding a black-box component with approximation properties suited to the problem of
interest. A good example of a grey-box model is the Bouc–Wen model of hysteresis. In the
Bouc–Wen model, a mass-spring-damper core is supplemented by an extra state-space
equation which allows versatile approximation of a class of hysteresis loops [5,6].
In all of these cases, measured data from the system or structure of interest can be used in order
to determine any unknown aspects of the model, e.g. any necessary undetermined parameters
can be estimated. The use of measured data often means that uncertainty is introduced into
the problem. There are two main sources of uncertainty caused by consideration of measured
data. The first source is measurement noise; in general, other sources (noise) will contribute to
measurements of the variable of interest and the direct distinction between signal and noise will
be impossible. The second problem is encountered when a measured variable is itself a random
process. In this case, only specific finite realizations of the process of interest can be measured;
variability between realizations leads to variability between parameter estimates and thus gives
rise to uncertainty.
In the past, the SI practitioner would generally implement the classical algorithms (i.e. least-
squares minimization) as an exercise in linear algebra and would usually treat the resulting
set of crisp parameter estimates as determining ‘the model’. Even if a covariance matrix were
extracted, the user would usually use this only to provide confidence intervals or ‘error bars’
on the parameters; predictions would still be made using the crisp parameters produced by the
algorithm. Such approaches do not fully accommodate the fact that a given set of measured
data, subject to the sources of uncertainty discussed above, may be consistent with a number
of different parametric models. It is now becoming clear—largely as a result of the pioneering
work of James Beck and colleagues and more recently from guidance from the machine learning
community—that a more robust approach to parameter estimation, and also model selection,
can be formulated on the basis of Bayesian principles for probability and statistics. Among the
potential advantages offered by a Bayesian formulation are the estimation procedure will return
parameter distributions rather than parameters; predictions can be made by integrating over all
parameters consistent with the data, weighted by their probabilities; evidence for a given model
structure can be computed, leading to a principled means of model selection.
Adoption of Bayesian methods first became widespread in the context of the identification
of black-box models; the methods have recently begun to occupy a central position within the
machine learning community [7,8]. Bayesian methods for training multi-layer perceptron neural
networks are a good example of this trend [9]; the Gaussian process model is also achieving
3
wide popularity [10]. Most machine learning algorithms, like the neural networks and Gaussian
processes already mentioned, are used to learn static relationships between variables; however,
and dating from the same year is perhaps the first paper on Bayesian methods for structural
dynamic SI [16]. To date, the most systematic and extensive development of Bayesian SI is the
result of the work of James Beck and his various collaborators. Beck’s early work on statistical
system identification is summarized in [17] and his transition to a Bayesian framework is given
in [18]. This paper uses a Laplace approximation to remove the need to evaluate intractable
high-dimensional integrals. Later, Beck & Au [19] introduce a Markov chain Monte Carlo (MCMC)
method as a more general means of computing response quantities of interest represented by
high-dimensional integrals. Bayesian methods of model selection are discussed in [20], and
the paper also discusses the possibility of marginalizing over different model classes. A recent
contribution [21] discusses identification and model selection for a type of hysteretic system
model—the Masing model. Staying with hysteresis models, the paper [22] considers how MCMC
can be used for Bayesian estimation of Bouc–Wen models and discusses a simple model selection
statistic. Two recent developments which are of interest are the introduction of probability logic
for Bayesian SI [23] and a method for potentially reducing computational expense for MCMC by
selecting the most informative training data [24]. Bayesian methods for the system identification
of differential equations have also been the subject of recent interest in the context of systems
biology [25,26] and show considerable promise in the context of structural dynamics.
At this point, it is appropriate to define some notation. Here M is used to represent a model
structure. θ ∈ RNθ is then used to represent the vector of parameters within that model which
requires estimation. Finally, D is used to denote a set of observations which one has made about
the system of interest, i.e. the measured data. As an example, one may consider the case study
which is shown in §5, where one is attempting to create a white-box model of a dynamical system
whose response is thought to be greatly influenced by friction effects. In this case, M represents
the hypothesized equation of motion of the system. Here θ represents the parameters within the
equation of motion which require estimation—in the current example, this includes terms which
modulate the level of viscous damping and friction in the system. The data, D, consist of a time
history of acceleration measurements which have been taken during a dynamic test. The basic
idea of the Bayesian approach to identification is that, by repeatedly applying Bayes’ theorem,
one can assess the probability of a set of parameters θ as well as a model structure M conditional
on the data D using
p(D | θ, M)p(θ | M)
p(θ | D, M) = (1.1)
p(D | M)
and
p(D | M)P(M)
P(M | D) = , (1.2)
p(D)
respectively, where
p(D | M) = p(D | θ , M)p(θ | M) dθ (1.3)
is a normalizing constant which ensures that p(θ | D, M) integrates to unity. This is referred to
here as the ‘marginal likelihood’ but can also be described as the ‘model evidence’ (because, as
is shown in §2, it can provide evidence for candidate model structures). With equation (1.1), one
4
converts an a priori probability density for the parameters θ into a posterior density having seen
the data D. If one desires a point estimate of the parameters, the usual course of action is to choose
θ = id(D), (2.1)
where the function id represents the application of the identification algorithm to the data D.
Now, if noise (t) is present on the input or output data (or both), θ will become a random vector
conditioned on the data. In this context, one no longer wishes to find an estimate of θ, but rather
to specify one’s belief in its value. If it is assumed that the noise is Gaussian with (unknown)
standard deviation σ , then the parameter σ can be subsumed into θ, and inferred along with the
model parameters. In probabilistic terms, instead of equation (2.1) one now has
1
For a more detailed description of various MCMC algorithms, the technical report by Neal is recommended [27] while, for
the interested reader, it is worth noting that [28] is an impressive Python resource for coding MCMC schemes.
Suppose a new input sequence x∗ were applied to the system, one would wish to determine the
5
density for the predicted outputs
y∗ ∼ p(y∗ | x∗ , θ, D, M)
Furthermore, if one appeals to Bayes theorem in the form of equation (1.2) and assumes equal
priors on the models, one arrives at a comparison ratio or Bayes factor
P(Mi | D) p(D | Mi )
Bij = = , (2.6)
P(Mj | D) p(D | Mj )
which weights the evidence for two models in terms of marginal likelihoods of the data given
the models.
The Bayesian approach to model selection is particularly attractive as the marginal likelihood
rewards models for being high fidelity while also penalizing them for being overly complex. By
automatically embodying Ockham’s Razor with regard to model selection, it follows that the
adoption of a Bayesian approach can help to prevent overfitting. An intuitive explanation of this
property is provided by MacKay [7], where it is suggested that a complex model will be capable of
replicating a larger range of predictions than a simple model with relatively few parameters. As
the probability density function p(D | M) must always be normalized it follows that, in a region
where both models are able to replicate the same data, the marginal likelihood will be larger for
the simpler model (figure 1). It is in this way that p(D | M) can be used to provide evidence for
candidate model structures.
2
Without loss of generality, the reader can regard x∗ and y∗ as a set of samples, a set of variables or both.
6
simple model
complex model
data
Downloaded from https://royalsocietypublishing.org/ on 20 June 2024
Figure 1. The embodiment of Ockham’s Razor in the marginal likelihood (original explanation described in [7]).
For a more detailed explanation, an information theoretic analysis of the marginal likelihood
was originally discussed by Beck & Yuen [29] before being generalized in [21]. Noting that
p(θ | D, Mi ) dθ = 1, it follows that the logarithm of the marginal likelihood can be written as
ln[p(D | Mi )] = ln[p(D | Mi )] p(θ | D, Mi ) dθ (2.7)
= ln[p(D | Mi )]p(θ | D, Mi ) dθ (2.8)
p(D | θ , Mi )p(θ | Mi )
= ln p(θ | D, Mi ) dθ (2.9)
p(θ | D, Mi )
therefore
p(θ | D, Mi )
ln[p(D | Mi )] = ln[p(D | θ, Mi )]p(θ | D, Mi ) dθ − ln p(θ | D, Mi ) dθ. (2.10)
p(θ | Mi )
The first term in the above equation is the posterior mean of the log-likelihood which is a measure
of the average data fit of model Mi . It follows that achieving a good fit to the training data will
provide evidence for a candidate model structure. The second term in equation (2.10) represents
the relative entropy between the prior and posterior. The marginal likelihood therefore penalizes
models which are ‘complex’ where, a complex model is defined as that which is able to extract
large amounts of information about the parameters θ from the data D. It is important to note that
the marginal likelihood is a function of the prior p(θ | Mi )—it is possible to alter the evidence for
Mi by altering the prior distribution while maintaining the same model structure.
Unfortunately, the marginal likelihoods themselves (equation (1.3)) are often analytically
intractable and numerically challenging because of their high-dimensional nature [26]. While
one can resort to less informative model selection indicators which are simpler to compute
(for example, as used in [22], a Bayesian generalization of the Akaike Information Criterion
(AIC) [30] known as the Deviance Information Criterion (DIC)), it will be shown in §4 that there
now exist MCMC methods which can be used to estimate the marginal likelihoods of different
models/generate samples directly from P(M | D).
As a final comment on the issue of model selection, it should be noted that there are
already examples of the use of Bayesian strategies for model selection in the structural dynamics
literature, the ‘Ockham factor’ defined in [19] being one of these. In [31], the authors use a
Bayesian model screening approach in order to determine the appropriate nonlinear terms to
include in a system model. The book [32] discusses Bayesian model selection in some detail.
3. Markov chain Monte Carlo for the posterior parameter distribution 7
The first set of algorithms reviewed here are those which are designed to generate samples from
is used to denote the unnormalized distribution and Z is the corresponding normalizing constant.
If accepted, the new state of the Markov chain is θ (i+1) = θ otherwise θ (i+1) = θ (i) . The probability
of making the transition from some state θ to the region θ dθ can be written as
π ∗ (θ )
T(θ | θ ) dθ = q(θ | θ ) dθ min 1, . (3.2)
π ∗ (θ (i) )
The first point to note is that, by using such a transition, one satisfies the condition known as
detailed balance:
π (θ)T(θ | θ ) = π (θ )T(θ | θ ) ⇒ π (θ)T(θ | θ) dθ = π (θ ),
(3.3)
(noting that T(θ | θ ) dθ = 1) which shows that the stationary distribution of the Markov chain
is equal to the target. The second point to note is that, to evaluate the acceptance probability
(equation (3.1)), one does not need to know the normalizing constant Z. This makes the Metropolis
algorithm particularly well suited to Bayesian inference problems as it allows one to sample from
p(θ | D, M) without having to evaluate the marginal likelihood.
Figure 2a shows an example of the Metropolis algorithm generating samples from a two-
dimensional target distribution. It is clear that the Markov chain must go through a transitionary
period (known as the ‘burn in’) before it converges to its stationary distribution—the samples
generated during this time will need to be discarded. Figure 2b,c shows what can happen if one
selects proposal densities which are have too small or too large variance. With a small proposal
density, the Markov chain will take a very long time to converge to its stationary distribution
while, with a large proposal density, the majority of the proposed states are rejected and the
resulting samples from the Markov chain are highly correlated. The efficiency of the Metropolis
algorithm is therefore highly dependent on the tuning of the proposal density q. A final point
worth noting is that, when using the Metropolis algorithm to generate samples from p(θ | D, M),
the use of large data sets may cause numerical issues when one is evaluating the acceptance
probability. This can easily be overcome by simply using the logarithm of equation (3.1), such
that one then only needs to evaluate the logarithm of the posterior parameter distribution.
(a) (b) (c)
8
q1
Figure 2. (a–c) Sampling from a two-dimensional distribution using the Metropolis algorithm.
p2
K= and V = − ln(π ∗ (θ)) (3.5)
2
such that
p2 p2
H= − ln(π ∗ ) and exp(−H) = exp − π ∗. (3.6)
2 2
As a result, if one targets the distribution exp(−H) and then simply omits the samples of p, one
will be left with samples of θ from the target π .
To generate a candidate state {p , θ } from the current state θ (i) , one must first generate an
initial momenta p ∼ N (0, 1) (noting that this is actually a direct sample from exp(−H)). The
Hamiltonian of this current state is written as H(i) . The system is then allowed to evolve according
to equation (3.4) for a certain amount of ‘time’, until it reaches some state {p , θ } which has
Hamiltonian H . As with the Metropolis algorithm, this state is then accepted with probability
exp(−H )
min 1, . (3.7)
exp(−H(i) )
From the above equation, it is clear that, if the dynamics of the system are modelled perfectly,
then the new state will always be accepted (as the Hamiltonian must remain constant). In practice,
9
q1
however, the evolution of the system according to equation (3.4) must usually be conducted
numerically (usually using finite difference estimates of ∇V), and so the Hamiltonian will alter
as a result of numerical error. In [36], it is shown that one can still obey detailed balance (and
therefore generate samples from exp(−H)) so long as the dynamics of the system are reversible.
This can be guaranteed by using the ‘leapfrog’ numerical integration technique (see [27] for
more details).
The ability of Hybrid Monte Carlo to ‘generate momentum’ during the proposal process
can allow it to conduct efficient explorations of the parameter space relative to the Metropolis
algorithm (reference [27] gives a clear explanation of this physical analogy). However, its
successful implementation sometimes requires careful tuning of parameters in the leapfrog
algorithm, as well as the parameters which dictate how long the system must evolve before a
proposal is generated. Furthermore, the need to repeatedly evaluate the posterior distribution to
obtain estimates of ∇V can make the algorithm computationally expensive. It has however, been
successfully applied to various structural dynamics problems in [37–39].
The parameter β is referred to as the inverse ‘temperature’ (thus drawing an analogy between
πj and a Boltzmann distribution). One begins by targeting a distribution with a low value of
β (high temperature) before steadily ‘cooling’ the system until β = 1, simulating the process of
annealing. This essentially means that the ‘fine details’ of the target distribution are introduced
gradually and that, at high temperatures, the Markov chain is able to traverse the parameter space
relatively freely compared to when β = 1. It is this which allows the Markov chain to easily escape
10
local traps and converge to the globally optimum region of the parameter space.
When attempting to generate samples from the posterior parameter distribution specifically, a
which allows one to facilitate a smooth transition from the prior to the posterior parameter
distributions. The strictly increasing sequence of β vales (the ‘annealing schedule’) is crucial to
the success of the algorithm. Annealing too fast can result in the Markov chain becoming stuck
in local traps while annealing to slowly will incur unnecessarily large computational costs. While
there are algorithms which feature adaptive annealing schedules (e.g. [27,41–43]), they are not
Downloaded from https://royalsocietypublishing.org/ on 20 June 2024
∗ (θ )
πj+1
(i)
wj
(i)
j
= p(D | θ j , M)βj+1 −βj
(i) (i) (i)
wj = and ŵj = , (4.1)
πj∗ (θ j )
(i) (i)
i wj
(i)
respectively. With standard importance sampling, one would then ‘resample’ by assigning θj+1 =
(i) (i)
θj with probability ŵj . If left to continue in this manner, the algorithm would suffer from the
well-known degeneracy problem (a phenomenon often associated with the particle filter), and the
set of samples would become dominated by relatively few, highly weighted samples.
(i)
To overcome this issue, TMCMC considers each resampled value θj+1 as the starting point
of a Markov chain. The Markov chains evolve according to the Metropolis algorithm, each
targeting πj+1 . The probability that a Markov chain will ‘grow’ is determined by the normalized
importance weight of its initial sample. The advantage of this approach is that, by simultaneously
growing Markov chains in high probability regions of πj+1 , one is able to generate samples from
distributions with multiple modes. Once the Markov chains have generated a sufficient number of
samples from πj+1 , the process is simply repeated until one is left with samples from p(θ | D, M).
With regard to estimating the marginal likelihood, if one denotes wj as a vector of importance
11
weights, then from the property that
π∗ π∗ π∗
it follows that p(D | M) = Z0 E[w0 ]E[w1 ] · · · E[wM−1 ]. As a result, by estimating the expected
value of the importance weights at each stage of the algorithm, one can approximate the marginal
likelihood.
As well as being able to sample from distributions with multiple modes and estimate the
marginal likelihood, its reliance on the simultaneous growth of multiple Markov chains makes
TMCMC suitable for parallel processing [48]. Furthermore, it is also shown in [42] that, by
selecting values of β which ensure that the coefficient of variation of the importance weights
Downloaded from https://royalsocietypublishing.org/ on 20 June 2024
remain within predefined limits, the algorithm is also able to generate an adaptive annealing
schedule which prevents large changes in the geometry of the target distribution occurring.
As a result of these benefits, TMCMC has become a popular algorithm which has been applied
to many engineering problems (e.g. [49–53]) and has helped to inspire other algorithms such as
AIMS [43] (which is not discussed here).
One then defines X as the being the prior mass enclosed within the contour where the likelihood
is larger than λ:
X= p(θ | M) dθ. (4.8)
p(D|θ ,M)>λ
It is then assumed that there exists a function λ = L(X) which, if one is given X, will reveal the
corresponding value of λ. When X = 0 it is clear that there will be no prior mass within the contour
defined by p(D | θ, M) = λ (implying that λ must be larger than maxθ {p(D | θ, M))}. L(X) is then
a decreasing3 function of X until, when X = 1, λ must be equal to zero as the entire prior mass is
now contained in the contour defined by p(D | θ , M) = λ.
From equation (4.8), it follows that dX represents the prior probability mass p(θ | M) dθ
associated with the contour where the likelihood is equal to λ, ergo
1
P(M | D) = p(D | θ , M)p(θ | M) dθ = L(X) dX (4.9)
0
and a difficult multi-dimensional integral has been reduced to a simple one-dimensional integral.
The Nested sampling algorithm begins with N samples {θ (1) , . . . , θ (N) } from the prior. One then
(1)
locates the sample θ (k) which resulted in the lowest value of the likelihood (denoted λmin ). The
(1)
corresponding value of X (written as X ) is estimated according to
N−1
X(1) = . (4.10)
N
One then replaces θ (k) with a sample which has been generated from the prior and is subject to
(1)
the constraint that the resulting likelihood is larger than λmin (one may try to achieve this using
the Metropolis algorithm or other MCMC methods). This procedure is repeated until a series of
3
Technically, it is assumed to be strictly decreasing such that there is a 1–1 relationship between X and λ—see [60] for more
details.
(a) (b)
13
generator
z(t)
m l
x(t)
Downloaded from https://royalsocietypublishing.org/ on 20 June 2024
k k
2 2
y(t)
Figure 4. (a) Test rig and (b) schematic of rotational energy harvester at the University of Southampton Institute of Sound
and Vibration.
X values and the corresponding L(X) = λ have been obtained—standard numerical methods can
1
then be used to estimate 0 L(X) dX.
While nested sampling is an elegant algorithm, the need to generate samples from the
prior subject to constraints on the likelihood can be difficult and, as such, it has not been
widely adopted within the context of structural dynamics (although it was used in [61]). It is
included here as it provides an interesting method of estimating the marginal likelihood which is
fundamentally different from TMCMC or RJMCMC and, with further development, may become
more ubiquitous within structural dynamics.
5. Case study
The case study shown here was originally conducted as part of a collaborative project with the
University of Southampton (full findings are published in [62]); it is included here as it clearly
demonstrates how using MCMC methods within a Bayesian framework can be used to quantify
and propagate the uncertainties involved in modelling nonlinear dynamical systems.
Figure 4a shows a rotational energy harvester—a device which, via a ball–screw mechanism,
is designed to convert low-frequency translational motion into high-frequency rotational motion
(which can then be transformed into electrical energy). The device is mounted on a electro-
hydraulic shaker while accelerometers are attached to the shaker and the oscillating mass (for a
more detailed description of the experiment, see [62,63]). With the measured inputs and outputs
(x and y) being the acceleration of the base and mass, respectively, the aim was to use a set of
experimentally obtained data to infer a robust model of the device.
Referring to the schematic in figure 4b, the equation of motion of the energy harvester is
2 2
2π 2π
Mz̈ + bm ż + kz + f (ż) = −mẍ, M=m+J , bm = cm , (5.1)
l l
where z = y − x is the relative acceleration between the mass and base, l is the ball screw lead, cm
is mechanical damping, k is spring stiffness, m is the oscillating mass and J represents the system’s
moment of inertia. The function f (ż) is a friction model which is to be identified. In this case, two
(a) 14
300 250
250
frequency of samples
200
50 50
0 0
410 420 430 440 450 460 0.098 0.100 0.102 0.104 0.106 0.108 0.110
Cm (Ns m–1) se
(b)
Downloaded from https://royalsocietypublishing.org/ on 20 June 2024
250
frequency of samples
0 0 0 0
100 110 120 11.0 11.5 12.0 12.5 60 80 100 120 0.09 0.10
Cm (Ns m–1) Fc (N m–1) b (s m–1) se
Figure 5. MCMC samples generated for model 1 (a) and model 2 (b).
models, M1 and M2 , were considered. With the first, it was assumed that all of the parasitic
losses in the device could be modelled using a linear viscous damper (which is equivalent to
setting f (ż) = 0) while, with the second, a hyperbolic tangent friction model f (ż) = Fc tanh(β ż) was
hypothesized. Assuming that M, k and l were already estimated with sufficient accuracy and
employing a Gaussian likelihood with standard deviation σ , the identification of models 1 and
2 involved estimating the parameter vectors {cm , σ } and {cm , Fc , β, σ }, respectively. Aside from
obtaining probabilistic estimates for the parameters in each model, the aim here was to establish
whether it is worth including the additional complexity of the hyperbolic tangent friction model.
Using TMCMC to generate 1000 samples from the posterior parameter distribution of each
model, figure 5 shows the histograms of the resulting samples (where row 1 is model 1 and row
2 is model 2). Using these samples as part of Monte Carlo simulations, figure 6a,b shows the
ability of model 1 and model 2 to replicate the training data (the filled grey regions in figure 6
represent 3σ confidence bounds). These results seem to indicate that the additional complexity
of model 2 has allowed it to form a more accurate representation of the training data. Using
TMCMC to analyse the marginal likelihood, the finding that P(M2 | D) ≈ 1 confirms that model
2 is preferable. Finally, figure 6c shows the ability of model 2 to replicate a set of ‘unseen’
acceleration time history (data which were not used to train the model).
It is interesting to note that, although model 2 provides a better fit to the data, the confidence
bounds on the predictions made by each model are of a similar magnitude. This is essentially
due to a poor ‘initial phrasing’ of the problem. Specifically, when the likelihood was defined,
it was assumed that the probability of witnessing each data point was a Gaussian distribution,
centred on the prediction made by the model and with variance σ2 . The choice of a Gaussian
distribution is justified somewhat by the Principle of Maximum Entropy [64] from which one
finds that, having assumed the first 2 moments of the likelihood, the Gaussian distribution is that
which minimizes the amount of additional information that must be assumed. However, having
completed the analysis using such a likelihood, it can be observed that model 2 actually appears
better able to replicate the experiment when at low amplitudes. From this, one may conclude
(a) 1.0 (b)
measured data
15
–0.5
0 500 1000 1500 2000 0 500 1000 1500 2000
(c) 1.5
Downloaded from https://royalsocietypublishing.org/ on 20 June 2024
1.0
0.5
0
–0.5
–1.0
–1.5
–2.0
0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
data point (×104)
Figure 6. The ability of (a) model 1 and (b) model 2 to replicate the training data. (c) Shows predictions about previously
‘unseen’ data using model 2.
that the probability of witnessing a data point, conditional on the model, actually varies with
amplitude. As a continuation to this study one could propose a more complex likelihood before
repeating the analysis. Potentially, one could then adopt a Bayesian approach to the selection of
different likelihoods, thus preventing the selection of overly complex ‘error-prediction’ models
(e.g. [65]).
6. Future work
Ultimately, with each sample generated using MCMC requiring a model run, the applicability of
MCMC to Bayesian system identification problems is limited by computational cost. This places
several restrictions on the types of problems which can be addressed. For situations where one’s
model is expensive, a current stream of research is aimed towards the development of MCMC
algorithms which are suitable for large-scale parallelization [66], and those which are able to
reduce computational cost via the exploitation of interpolation methods (see [67] for example,
where kriging is integrated into TMCMC). Further interest has been directed towards the scenario
where one is confronted with large datasets from which to infer models. The work [24] proposes
a method which allows the selection of small, highly informative subsets of data while, in [47,68],
MCMC methods are proposed which allow the tracking of one’s parameter estimates as more
data are analysed (helping to establish when a sufficient amount of data has been used).
7. Conclusion
In this paper, the authors have presented arguments for the adoption of a Bayesian framework
for the system identification of nonlinear dynamical systems in the presence of uncertainty.
Specifically, it has been highlighted how a Bayesian approach allows one to realize probabilistic
parameter estimates in the presence of measurement noise, select high fidelity models which are
16
not overfitted and make predictions which are marginalized over one’s parameter estimates and,
in some cases, over a set of candidate model structures. It is then shown how many of the potential
Bayesian system identification and aided the article’s revision. Both authors contributed to the drafting of
the manuscript and gave final approval for publication.
Competing interests. We declare we have no competing interests.
Funding. The authors would like to acknowledge the EPSRC Programme Grant ‘Engineering Nonlinearity’
EP/K003836/1 which funded the work in this paper as well as the collaborative project described in §5.
References
1. Soderstrom T, Stoica P. 1994 System identification, New Edition. Englewood Cliffs, NJ: Prentice
Hall.
2. Ljung L. 1999 System identification: theory for the user, 2nd edn. Englewood Cliffs, NJ: Prentice
Hall.
3. Worden K, Tomlinson GR. 2001 Nonlinearity in structural dynamics: detection, modelling and
identification. Bristol, UK: Institute of Physics.
4. Kerschen G, Worden K, Golinval J-C, Vakakis AK. 2006 Past, present and future of
nonlinear system identification in structural dynamics. Mech. Syst. Signal Process. 20, 505–592.
(doi:10.1016/j.ymssp.2005.04.008)
5. Bouc R. 1967 Forced vibration of mechanical system with hysteresis. In Proc. of 4th Conf. on
Nonlinear Oscillation, Prague, Czechoslovakia.
6. Wen Y. 1976 Method for random vibration of hysteretic systems. ASCE J. Eng. Mech. Div. 102,
249–263.
7. Mackay MJC. 2003 Information theory, inference and learning algorithms. Cambridge, UK:
Cambridge University Press.
8. Bishop CM. 2007 Pattern recognition and machine learning. Berlin, Germany: Springer.
9. Bishop CM. 1998 Neural networks for pattern recognition. Oxford, UK: Oxford University Press.
10. Rasmussen CE, Williams CKI. 2006 Gaussian processes for machine learning. Cambridge, MA:
MIT Press.
11. Leontaritis IJ, Billings SA. 1985 Input–output parametric models for nonlinear systems, Part
I: deterministic nonlinear systems. Int. J. Control 41, 303–328. (doi:10.1080/0020718508961129)
12. Leontaritis IJ, Billings SA. 1985 Input-output parametric models for nonlinear systems, Part
II: stochastic nonlinear systems. Int. J. Control 41, 329–344. (doi:10.1080/0020718508961130)
13. Worden K, Manson G, Cross EJ 2012 Higher-order frequency response functions from
Gaussian process NARX models. In Proc. of 25th International Conference on Noise and Vibration
Engineering, Leuven, Belgium.
14. Baldacchino T, Anderson SR, Kadirkamanathan V. 2013 Computational system identification
for Bayesian NARMAX modelling. Automatica 49, 2641–2651. (doi:10.1016/j.automatica.2013.
05.023)
15. Bard Y. 1974 Nonlinear parameter estimation. New York, NY: Academic Press.
16. Collins JD, Hart GC, Hasselman TK, Kennedy B. 1974 Statistical identification of structures.
AIAA J. 12, 185–190. (doi:10.2514/3.49190)
17. Beck JL. 1989 Statistical system identification of structures. In Proc. of 5th Int. Conf. on Structural
Safety and Reliability, pp. 1395–1402. New York, NY: ASCE.
18. Beck JL, Katafygiotis LS. 1998 Updating models and their uncertainties. I. Bayesian statistical
framework. ASCE J. Eng. Mech. 124, 455–461. (doi:10.1061/(ASCE)0733-9399(1998)124:4(455))
19. Beck JL, Au SK. 2002 Bayesian updating of structural models and reliability using
17
Markov chain Monte Carlo simulation. J. Eng. Mech. 128, 380–391. (doi:10.1061/(ASCE)0733-
9399(2002)128:4(380))
systems using highly informative training data. Mech. Syst. Signal Process. 56–57, 109–122.
(doi:10.1016/j.ymssp.2014.10.003)
25. Girolami M. 2008 Bayesian inference for differential equations. Theoret. Comp. Sci. 408, 4–16.
(doi:10.1016/j.tcs.2008.07.005)
26. Calderhead B, Girolami M, Higham DJ. 2010 Is it safe to go out yet? Statistical inference in a
zombie outbreak model. University of Strathclyde, Department of Mathematics and Statistics.
27. Neal RM. 1993 Probabilistic inference using Markov chain Monte Carlo methods. Technical report.
Toronto, ON: Dept of Computer Science, University of Toronto.
28. Patil A, Huard D, Fonnesbeck CJ. 2010 PyMC: Bayesian stochastic modelling in Python. J. Stat.
Software 35, 1–81.
29. Beck JL, Yuen KV. 2004 Model selection using response measurements: Bayesian probabilistic
approach. J. Eng. Mech. 130, 192–203. (doi:10.1061/(ASCE)0733-9399(2004)130:2(192))
30. Gelman A, Carlin JB, Stern HS, Rubin DB. 2004 Bayesian data analysis, 2nd edn. London, UK:
Chapman and Hall.
31. Kerschen G, Golinval J-C, Hemez FM. 2003 Bayesian model screening for the identification of
nonlinear mechanical structures. ASME J. Vib. Acoust. 125, 389–397. (doi:10.1115/1.1569947)
32. Yuen K-V. 2010 Bayesian methods for structural dynamics and civil engineering. New York, NY:
John Wiley and Sons.
33. Doob JL. 1953 Stochastic processes. Wiley publications in statistics. New York, NY: Wiley.
34. Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. 1953 Equation of state
calculations by fast computing machines. J. Chem. Phys. 21, 1087–1092. (doi:10.1063/1.1699114)
35. Hastings WK. 1970 Monte Carlo sampling methods using Markov chains and their
applications. Biometrika 57, 97–109. (doi:10.1093/biomet/57.1.97)
36. Duane S, Kennedy AD, Pendleton BJ, Roweth D. 1987 Hybrid Monte Carlo. Phys. Lett. B 195,
216–222. (doi:10.1016/0370-2693(87)91197-X)
37. Cheung SH, Beck JL. 2009 Bayesian model updating using hybrid Monte Carlo simulation
with application to structural dynamic models with many uncertain parameters. J. Eng. Mech.
135, 243–255. (doi:10.1061/(ASCE)0733-9399(2009)135:4(243))
38. Marwala T. 2001 Probabilistic fault identification using vibration data and neural networks.
Mech. Syst. Signal Process. 15, 1109–1128. (doi:10.1006/mssp.2001.1386)
39. Nakada Y, Matsumoto T, Kurihara T, Yosui K. 2005 Bayesian reconstructions and predictions
of nonlinear dynamical systems via the hybrid Monte Carlo scheme. Signal Process. 85,
129–145. (doi:10.1016/j.sigpro.2004.09.007)
40. Kirkpatrick S, Gelatt CD, Vecchi MP. 1983 Optimization by simulated annealing. Science 220,
671–680. (doi:10.1126/science.220.4598.671)
41. Green PL 2014 Bayesian System Identification of Nonlinear Dynamical Systems using a Fast
MCMC Algorithm. In Proc. of ENOC 2014, European Nonlinear Dynamics Conference, Vienna,
Austria, 6–11 July.
42. Ching J, Chen YC. 2007 Transitional Markov chain Monte Carlo method for Bayesian
model updating, model class selection, and model averaging. J. Eng. Mech. 133, 816–832.
(doi:10.1061/(ASCE)0733-9399(2007)133:7(816))
43. Beck JL, Zuev KM. 2013 Asymptotically independent Markov sampling: a new Markov
chain Monte Carlo scheme for Bayesian inference. Int. J. Uncertainty Quant. 3, 445–474.
(doi:10.1615/Int.J.UncertaintyQuantification.2012004713)
44. Marinari E, Parisi G. 1992 Simulated tempering: a new Monte Carlo scheme. Europhys. Lett.
19, 451–458. (doi:10.1209/0295-5075/19/6/002)
45. Geyer CJ, Thompson EA. 1995 Annealing Markov chain Monte Carlo with applications to
18
ancestral inference. J. Amer. Stat. Assoc. 90, 909–920. (doi:10.1080/01621459.1995.10476590)
46. Hukushima K, Nemoto K. 1996 Exchange Monte Carlo method and application to spin glass
50. Goller B, Schueller GI. 2011 Investigation of model uncertainties in Bayesian structural model
updating. J. Sound Vibr. 330, 6122–6136. (doi:10.1016/j.jsv.2011.07.036)
51. Zheng W, Yu Y. 2013 Bayesian probabilistic framework for damage identification of steel truss
bridges under joint uncertainties. Adv. Civil Eng. 2013, 1–13. (doi:10.1155/2013/307171)
52. Zheng W, Chen YT. 2014 Novel probabilistic approach to assessing barge–bridge collision
damage based on vibration measurements through transitional Markov chain Monte Carlo
sampling. J. Civil Struct. Health Monit. 4, 119–131. (doi:10.1007/s13349-013-0063-2)
53. Wang J, Katafygiotis LS. 2014 Reliability-based optimal design of linear structures subjected
to stochastic excitations. Struct. Safety 47, 29–38. (doi:10.1016/j.strusafe.2013.11.002)
54. Green PJ. 1995 Reversible jump Markov chain Monte Carlo computation and Bayesian model
determination. Biometrika 82, 711–732. (doi:10.1093/biomet/82.4.711)
55. Green PJ, Hastie DI. 2009 Reversible jump MCMC. Genetics 155, 1391–1403.
56. Zio E, Zoia A. 2009 Parameter identification in degradation modeling by reversible-jump
Markov Chain Monte Carlo. Reliab. IEEE Trans. 58, 123–131. (doi:10.1109/TR.2008.2011674)
57. Guan X, Jha R, Liu Y. 2011 Model selection, updating, and averaging for probabilistic fatigue
damage prognosis. Struct. Safety 33, 242–249. (doi:10.1016/j.strusafe.2011.03.006)
58. Tiboaca D, Green PL, Barthorpe RJ, Worden K. 2014 Bayesian system identification of
dynamical systems using reversible jump Markov Chain Monte Carlo. In Topics in modal
analysis II, Orlando, FL, vol. 8, pp. 277–284. Berlin, Germany: Springer.
59. Skilling J. 2004 Nested sampling. In Bayesian inference and maximum entropy methods
in science and engineering, Garching, Germany, 25–30 July. AIP Conf. Proc. 735, 395–405.
(doi:10.1063/1.1835238)
60. Skilling J. 2006 Nested sampling for general Bayesian computation. Bayesian Anal. 1, 833–859.
(doi:10.1214/06-BA127)
61. Mthembu L, Marwala T, Friswell MI, Adhikari S. 2011 Model selection in finite element
model updating using the Bayesian evidence statistic. Mech. Syst. Signal Process. 25, 2399–2412.
(doi:10.1016/j.ymssp.2011.04.001)
62. Green PL, Hendijanizadeh M, Simeone L, Elliott SJ. In press. Probabilistic modelling of a
rotational energy harvester. J. Intelligent Mater. Syst. Struct. (doi:10.1177/1045389X15573343)
63. Hendijanizadeh M 2014 Design and optimisation of constrained electromagnetic energy
harvesting devices. PhD thesis, University of Southampton, UK.
64. Jaynes ET 2003 Probability theory: the logic of science. Cambridge, UK: Cambridge University
Press.
65. Simoen E, Papadimitriou C, Lombaert G. 2013 On prediction error correlation in Bayesian
model updating. J. Sound Vib. 332, 4136–4152. (doi:10.1016/j.jsv.2013.03.019)
66. Hadjidoukas PE, Angelikopoulos P, Papadimitriou C, Koumoutsakos P. 2015 π 4u: A high
performance computing framework for Bayesian uncertainty quantification of complex
models. J. Comput. Phys. 284, 1–21. (doi:10.1016/j.jcp.2014.12.006)
67. Angelikopoulos P, Papadimitriou C, Koumoutsakos P. 2015 X-TMCMC: Adaptive kriging
for Bayesian inverse modeling. Comp. Methods Appl. Mech. Eng. 289, 409–428. (doi:10.1016/j.
cma.2015.01.015)
68. Green PL 2015 A MCMC method for Bayesian system identification from large data
sets. In Proc. IMAC XXXIII, Conf. and Exposition on Structural Dynamics. Model Validation
and Uncertainty Quantification, vol. 3, pp. 275–281. Berlin, Germany: Springer International
Publishing.