0% found this document useful (0 votes)

60 views

Frigola Bayesian Inference and Learning in Gaussian Process State Space Models With Particle MCMC

This document summarizes a research paper that presents a Bayesian approach to inference and learning in nonlinear nonparametric state-space models. It places a Gaussian process prior over the state transition dynamics, resulting in a flexible model that can capture complex dynamical phenomena. It uses Particle Markov Chain Monte Carlo samplers to efficiently infer the joint smoothing distribution over latent states, while marginalizing over the unknown transition function. This allows it to learn the model without reverting to a parametric representation of the dynamics.

Uploaded by

snakesmithsmith

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views

Frigola Bayesian Inference and Learning in Gaussian Process State Space Models With Particle MCMC

Uploaded by

snakesmithsmith

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Bayesian Inference and Learning in Gaussian Process

State-Space Models with Particle MCMC

Roger Frigola1 , Fredrik Lindsten2 , Thomas B. Schön2,3 and Carl E. Rasmussen1

1. Dept. of Engineering, University of Cambridge, UK, {rf342,cer54}@cam.ac.uk

2. Div. of Automatic Control, Linköping University, Sweden, [email protected]
3. Dept. of Information Technology, Uppsala University, Sweden, [email protected]

Abstract
State-space models are successfully used in many areas of science, engineering
and economics to model time series and dynamical systems. We present a fully
Bayesian approach to inference and learning (i.e. state estimation and system
identification) in nonlinear nonparametric state-space models. We place a Gaus-
sian process prior over the state transition dynamics, resulting in a flexible model
able to capture complex dynamical phenomena. To enable efficient inference, we
marginalize over the transition dynamics function and, instead, infer directly the
joint smoothing distribution using specially tailored Particle Markov Chain Monte
Carlo samplers. Once a sample from the smoothing distribution is computed,
the state transition predictive distribution can be formulated analytically. Our ap-
proach preserves the full nonparametric expressivity of the model and can make
use of sparse Gaussian processes to greatly reduce computational complexity.

1 Introduction
State-space models (SSMs) constitute a popular and general class of models in the context of time
series and dynamical systems. Their main feature is the presence of a latent variable, the state
xt ∈ X , Rnx , which condenses all aspects of the system that can have an impact on its future.
A discrete-time SSM with nonlinear dynamics can be represented as
xt+1 = f (xt , ut ) + vt , (1a)
yt = g(xt , ut ) + et , (1b)
where ut denotes a known external input, yt denotes the measurements, vt and et denote i.i.d. noises
acting on the dynamics and the measurements, respectively. The function f encodes the dynamics
and g describes the relationship between the observation and the unobserved states.
We are primarily concerned with the problem of learning general nonlinear SSMs. The aim is to
find a model that can adaptively increase its complexity when more data is available. To this effect,
we employ a Bayesian nonparametric model for the dynamics (1a). This provides a flexible model
that is not constrained by any limiting assumptions caused by postulating a particular functional
form. More specifically, we place a Gaussian process (GP) prior [1] over the unknown function f .
The resulting model is a generalization of the standard parametric SSM. The functional form of
the observation model g is assumed to be known, possibly parameterized by a finite dimensional
parameter. This is often a natural assumption, for instance in engineering applications where g
corresponds to a sensor model – we typically know what the sensors are measuring, at least up to
some unknown parameters. Furthermore, using too flexible models for both f and g can result in
problems of non-identifiability.
We adopt a fully Bayesian approach whereby we find a posterior distribution over all the latent
entities of interest, namely the state transition function f , the hidden state trajectory x0:T , {xi }Ti=0

1
and any hyper-parameter θ of the model. This is in contrast with existing approaches for using GPs
to model SSMs, which tend to model the GP using a finite set of target points, in effect making
the model parametric [2]. Inferring the distribution over the state trajectory p(x0:T | y0:T , u0:T )
is an important problem in itself known as smoothing. We use a tailored particle Markov Chain
Monte Carlo (PMCMC) algorithm [3] to efficiently sample from the smoothing distribution whilst
marginalizing over the state transition function. This contrasts with conventional approaches to
smoothing which require a fixed model of the transition dynamics. Once we have obtained an
approximation of the smoothing distribution, with the dynamics of the model marginalized out,
learning the function f is straightforward since its posterior is available in closed form given the state
trajectory. Our only approximation is that of the sampling algorithm. We report very good mixing
enabled by the use of recently developed PMCMC samplers [4] and the exact marginalization of the
transition dynamics.
There is by now a rich literature on GP-based SSMs. For instance, Deisenroth et al. [5, 6] presented
refined approximation methods for filtering and smoothing for already learned GP dynamics and
measurement functions. In fact, the method proposed in the present paper provides a vital component
needed for these inference methods, namely that of learning the GP model in the first place. Turner
et al. [2] applied the EM algorithm to obtain a maximum likelihood estimate of parametric models
which had the form of GPs where both inputs and outputs were parameters to be optimized. This
type of approach can be traced back to [7] where Ghahramani and Roweis applied EM to learn
models based on radial basis functions. Wang et al. [8] learn a SSM with GPs by finding a MAP
estimate of the latent variables and hyper-parameters. They apply the learning in cases where the
dimension of the observation vector is much higher than that of the latent state in what becomes a
form of dynamic dimensionality reduction. This procedure would have the risk of overfitting in the
common situation where the state-space is high-dimensional and there is significant uncertainty in
the smoothing distribution.

2 Gaussian Process State-Space Model

We describe the generative probabilistic model of the Gaussian process SSM (GP-SSM) represented
in Figure 1b by
f (xt ) ∼ GP mθx (xt ), kθx (xt , x0t ) ,

(2a)
xt+1 | ft ∼ N (xt+1 | ft , Q), (2b)
yt | xt ∼ p(yt | xt , θ y ), (2c)
and x0 ∼ p(x0 ), where we avoid notational clutter by omitting the conditioning on the known inputs
ut . In addition, we put a prior p(θ) over the various hyper-parameters θ = {θ x , θ y , Q}. Also, note
that the measurement model (2c) and the prior on x0 can take any form since we do not rely on their
properties for efficient inference.
The GP is fully described by its mean function and its covariance function. An interesting property
of the GP-SSM is that any a priori insight into the dynamics of the system can be readily encoded
in the mean function. This is useful, since it is often possible to capture the main properties of
the dynamics, e.g. by using a simple parametric model or a model based on first principles. Such

(a) Standard GP regression (b) GP-SSM

Figure 1: Graphical models for standard GP regression and the GP-SSM model. The thick horizontal
bars represent sets of fully connected nodes.

2
simple models may be insufficient on their own, but useful together with the GP-SSM, as the GP
is flexible enough to model complex departures from the mean function. If no specific prior model
is available, the linear mean function m(xt ) = xt is a good generic choice. Interestingly, the prior
information encoded in this model will normally be more vague than the prior information encoded
in parametric models. The measurement model (2c) implicitly contains the observation function g
and the distribution of the i.i.d. measurement noise et .

3 Inference over States and Hyper-parameters

Direct learning of the function f in (2a) from input/output data {u0:T −1 , y0:T } is challenging since
the states x0:T are not observed. Most (if not all) previous approaches attack this problem by re-
verting to a parametric representation of f which is learned alongside the states. We address this
problem in a fundamentally different way by marginalizing out f , allowing us to respect the non-
parametric nature of the model. A challenge with this approach is that marginalization of f will
introduce dependencies across time for the state variables that lead to the loss of the Markovian
structure of the state-process. However, recently developed inference methods, combining sequen-
tial Monte Carlo (SMC) and Markov chain Monte Carlo (MCMC) allow us to tackle this problem.
We discuss marginalization of f in Section 3.1 and present the inference algorithms in Sections 3.2
and 3.3.

3.1 Marginalizing out the State Transition Function

Targeting the joint posterior distribution of the hyper-parameters, the latent states and the latent func-
tion f is problematic due to the strong dependencies between x0:T and f . We therefore marginalize
the dynamical function from the model, and instead target the distribution p(θ, x0:T | y1:T ) (recall
that conditioning on u0:T −1 is implicit). In the MCMC literature, this is referred to as collapsing [9].
Hence, we first need to find an expression for the marginal prior p(θ, x0:T ) = p(x0:T | θ)p(θ). Fo-
cusing on p(x0:T | θ) we note that, although this distribution is not Gaussian, it can be represented
as a product of Gaussians. Omitting the dependence on θ in the notation, we obtain
T
Y T
Y
p(x1:T | θ, x0 ) = p(xt | θ, x0:t−1 ) = N xt | µt (x0:t−1 ), Σt (x0:t−1 ) , (3a)
t=1 t=1

with
e −1 (x1:t−1 − m0:t−2 ),
µt (x0:t−1 ) = mt−1 + Kt−1,0:t−2 K (3b)
0:t−2
−1
Σt (x0:t−1 ) = Kt−1 − Kt−1,0:t−2 K
e e K> 0:t−2 t−1,0:t−2 (3c)
for t ≥ 2 and µ1 (x0 ) = m0 , Σ1 (x0 ) = K e 0 . Equation (3) follows from the fact that, once
conditioned on x0:t−1 , a one-step prediction for the state variable is a standard GP prediction. Here,
>
we have defined the mean vector m0:t−1 , m(x0 )> . . . m(xt−1 )> and the (nx t) × (nx t)

positive definite matrix K0:t−1 with block entries [K0:t−1 ]i,j = k(xi−1 , xj−1 ). We use two sets of
indices, as in Kt−1,0:t−2 , to refer to the off-diagonal blocks of K0:t−1 . We also define K e 0:t−1 =
K0:t−1 + It ⊗ Q. We can also express (3a) more succinctly as,

p(x1:t | θ, x0 ) = |(2π)nx t K e 0:t−1 |− 12 exp(− 1 (x1:t − m0:t−1 )> K

e −1 (x1:t − m0:t−1 )). (4)
0:t−1
2
This expression looks very much like a multivariate Gaussian density function. However, we em-
phasize that this is not the case since both m0:t−1 and K e 0:t−1 depend (nonlinearly) on the argument
x1:t . In fact, (4) will typically be very far from Gaussian.

3.2 Sequential Monte Carlo

With the prior (4) in place, we now turn to posterior inference and we start by considering the joint
smoothing distribution p(x0:T | θ, y0:T ). The sequential nature of the proposed model suggests
the use of SMC. Though most well known for filtering in Markovian SSMs – see [10, 11] for an
introduction – SMC is applicable also for non-Markovian latent variable models. We seek to ap-
proximate the sequence of distributions p(x0:t | θ, y0:t ), for t = 0, . . . , T . Let {xi0:t−1 , wt−1
i
}N
i=1

3
be a collection of weighted particles approximating p(x0:t−1 | θ, y0:t−1 ) by the empirical distribu-
PN i
tion, pb(x0:t−1 | θ, y0:t−1 ) , i=1 wt−1 δxi0:t−1 (x0:t−1 ). Here, δz (x) is a point-mass located at
z. To propagate this sample to time t, we introduce the auxiliary variables {ait }N i=1 , referred to as
ancestor indices. The variable ait is the index of the ancestor particle at time t − 1, of particle xit .
j
Hence, xit is generated by first sampling ait with P(ait = j) = wt−1 . Then, xit is generated as,
ai
xit ∼ p(xt | θ, x0:t−1
t
, y0:t ), (5)
ait
for i = 1, . . . , N . The particle trajectories are then augmented according to xi0:t = {x0:t−1 , xit }.
Sampling from the one-step predictive density is a simple (and sensible) choice, but we may also
consider other proposal distributions. In the above formulation the resampling step is implicit and
corresponds to sampling the ancestor indices (cf. the auxiliary particle filter, [12]). Finally, the
particles are weighted according to the measurement model, wti ∝ p(yt | θ, xit ) for i = 1, . . . , N ,
where the weights are normalized to sum to 1.

3.3 Particle Markov Chain Monte Carlo

There are two shortcomings of SMC: (i) it does not handle inference over hyper-parameters; (ii)
despite the fact that the sampler targets the joint smoothing distribution, it does in general not pro-
vide an accurate approximation of the full joint distribution due to path degeneracy. That is, the
successive resampling steps cause the particle diversity to be very low for time points t far from the
final time instant T .
To address these issues, we propose to use a particle Markov chain Monte Carlo (PMCMC, [3, 13])
sampler. PMCMC relies on SMC to generate samples of the highly correlated state trajectory within
an MCMC sampler. We employ a specific PMCMC sampler referred to as particle Gibbs with
ancestor sampling (PGAS, [4]), given in Algorithm 1. PGAS uses Gibbs-like steps for the state
trajectory x0:T and the hyper-parameters θ, respectively. That is, we sample first x0:T given θ,
then θ given x0:T , etc. However, the full conditionals are not explicitly available. Instead, we draw
samples from specially tailored Markov kernels, leaving these conditionals invariant. We address
these steps in the subsequent sections.

Algorithm 1 Particle Gibbs with ancestor sampling (PGAS)

1. Set θ[0] and x1:T [0] arbitrarily.
2. For ` ≥ 1 do
(a) Draw θ[`] conditionally on x0:T [` − 1] and y0:T as discussed in Section 3.3.2.
(b) Run CPF-AS (see [4]) targeting p(x0:T | θ[`], y0:T ), conditionally on x0:T [` − 1].
(c) Sample k with P(k = i) = wTi and set x1:T [`] = xk1:T .
3. end

3.3.1 Sampling the State Trajectories

To sample the state trajectory, PGAS makes use of an SMC-like procedure referred to as a con-
ditional particle filter with ancestor sampling (CPF-AS). This approach is particularly suitable for
non-Markovian latent variable models, as it relies only on a forward recursion (see [4]). The differ-
ence between a standard particle filter (PF) and the CPF-AS is that, for the latter, one particle at each
time step is specified a priori. Let these particles be denoted xe0:T = {e x0 , . . . , x
eT }. We then sam-
ple according to (5) only for i = 1, . . . , N − 1. The N th particle is set deterministically: xN t =x et .
To be able to construct the N th particle trajectory, xN
t has to be associated with an ancestor particle
at time t − 1. This is done by sampling a value for the corresponding ancestor index aN t . Following
[4], the ancestor sampling probabilities are computed as

i i p({xi0:t−1 , x
et:T }, y0:T ) i p({xi0:t−1 , x
et:T }) i
w
e t−1|T ∝ wt−1 i
∝ wt−1 = wt−1 xt:T | xi0:t−1 ). (6)
p(e
p(x0:t−1 , y0:t−1 ) p(xi0:t−1 )
where the ratio is between the unnormalized target densities up to time T and up to time t − 1,
respectively. The second proportionality follows from the mutual conditional independence of the
observations, given the states. Here, {xi0:t−1 , x
et:T } refers to a path in XT +1 formed by concatenating

4
the two partial trajectories. The above expression can be computed by using the prior over state
i
trajectories given by (4). The ancestor sampling weights {w
e t−1|T }N
i=1 are then normalized to sum
j
to 1 and the ancestor index aN N
t is sampled with P(at = j) = wt−1|t .

The conditioning on a prespecified collection of particles implies an invariance property in CPF-AS,

which is key to our development. More precisely, given x e00:T be generated as follows:
e0:T let x
1. Run CPF-AS from time t = 0 to time t = T , conditionally on x e0:T .
e00:T to one of the resulting particle trajectories according to P(e
2. Set x x00:T = xi0:T ) = wTi .
For any N ≥ 2, this procedure defines an ergodic Markov kernel MθN (e x00:T | x
e0:T ) on XT +1 ,
leaving the exact smoothing distribution p(x0:T | θ, y0:T ) invariant [4]. Note that this invariance
holds for any N ≥ 2, i.e. the number of particles that are used only affect the mixing rate of the
kernel MθN . However, it has been experienced in practice that the autocorrelation drops sharply as
N increases [4, 14], and for many models a moderate N is enough to obtain a rapidly mixing kernel.

3.3.2 Sampling the Hyper-parameters

Next, we consider sampling the hyper-parameters given a state trajectory and sequence of observa-
tions, i.e. from p(θ | x0:T , y0:T ). In the following, we consider the common situation where there
are distinct hyper-parameters for the likelihood p(y0:T | x0:T , θ y ) and for the prior over trajectories
p(x0:T | θ x ). If the prior over the hyper-parameters factorizes between those two groups we obtain
p(θ | x0:T , y0:T ) ∝ p(θ y | x0:T , y0:T ) p(θ x | x0:T ). We can thus proceed to sample the two
groups of hyper-parameters independently. Sampling θ y will be straightforward in most cases, in
particular if conjugate priors for the likelihood are used. Sampling θ x will, nevertheless, be harder
since the covariance function hyper-parameters enter the expression in a non-trivial way. However,
we note that once the state trajectory is fixed, we are left with a problem analogous to Gaussian
process regression where x0:T −1 are the inputs, x1:T are the outputs and Q is the likelihood co-
variance matrix. Given that the latent dynamics can be marginalized out analytically, sampling the
hyper-parameters with slice sampling is straightforward [15].

4 A Sparse GP-SSM Construction and Implementation Details

A naive implementation of the CPF-AS algorithm will give rise to O(T 4 ) computational complexity,
since at each time step t = 1, . . . , T , a matrix of size T × T needs to be factorized. However, it is
possible to update and reuse the factors from the previous time step, bringing the total computational
complexity down to the familiar O(T 3 ). Furthermore, by introducing a sparse GP model, we can
reduce the complexity to O(M 2 T ) where M T . In Section 4.1 we introduce the sparse GP
model and in Section 4.2 we provide insight into the efficient implementation of both the vanilla GP
and the sparse GP.

4.1 FIC Prior over the State Trajectory

An important alternative to GP-SSM is given by exchanging the vanilla GP prior over f for a sparse
counterpart. We do not consider the resulting model to be an approximation to GP-SSM, it is still a
GP-SSM, but with a different prior over functions. As a result we expect it to sometimes outperform
its non-sparse version in the same way as it happens with their regression siblings [16].
Most sparse GP methods can be formulated in terms of a set of so called inducing variables [17].
These variables live in the space of the latent function and have a set I of corresponding inducing
inputs. The assumption is that, conditionally on the inducing variables, the latent function values are
mutually independent. Although the inducing variables are marginalized analytically – this is key for
the model to remain nonparametric – the inducing inputs have to be chosen in such a way that they,
informally speaking, cover the same region of the input space covered by the data. Crucially, in order
to achieve computational gains, the number M of inducing variables is selected to be smaller than
the original number of data points. In the following, we will use the fully independent conditional
(FIC) sparse GP prior as defined in [17] due to its very good empirical performance [16].
As shown in [17], the FIC prior can be obtained by replacing the covariance function k(·, ·) by,
k FIC (xi , xj ) = s(xi , xj ) + δij k(xi , xj ) − s(xi , xj ) ,

(7)

5
where s(xi , xj ) , k(xi , I)k(I, I)−1 k(I, xj ), δij is Kronecker’s delta and we use the convention
whereby when k takes a set as one of its arguments it generates a matrix of covariances. Using the
Woodbury matrix identity, we can express the one-step predictive density as in (3), with
µFIC
t (x0:t−1 ) = mt−1 + Kt−1,I PKI,0:t−2 Λ−1
0:t−2 (x1:t−1 − m0:t−2 ), (8a)
ΣFIC (x0:t−1 ) = K
t
e t−1 − St−1 + Kt−1,I PKI,t−1 , (8b)
where P , (KI,I + KI,0:t−2 Λ−1
0:t−2 K0:t−2,I )
−1
Λ0:t−2 , diag[K
, e 0:t−2 − S0:t−2 ] and SA,B ,
−1
KA,I KI,I KI,B . Despite its apparent cumbersomeness, the computational complexity involved in
computing the above mean and covariance is O(M 2 t), as opposed to O(t3 ) for (3). The same idea
can be used to express (4) in a form which allows for efficient computation. Here diag refers to a
block diagonalization if Q is not diagonal.
We do not address the problem of choosing the inducing inputs, but note that one option is to use
greedy methods (e.g. [18]). The fast forward selection algorithm is appealing due to its very low
computational complexity [18]. Moreover, its potential drawback of interference between hyper-
parameter learning and active set selection is not an issue in our case since hyper-parameters will be
fixed for a given run of the particle filter.

4.2 Implementation Details

As pointed out above, it is crucial to reuse computations across time to attain the O(T 3 ) or O(M 2 T )
computational complexity for the vanilla GP and the FIC prior, respectively. We start by discussing
the vanilla GP and then briefly comment on the implementation aspects of FIC.
There are two costly operations of the CPF-AS algorithm: (i) sampling from the prior (5), requiring
the computation of (3b) and (3c) and (ii) evaluating the ancestor sampling probabilities according
to (6). Both of these operations can be carried out efficiently by keeping track of a Cholesky fac-
i et:T −1 }) = Lit Li>
torization of the matrix K({x 0:t−1 , x t , for each particle i = 1, . . . , N . Here,
e
i
K({x0:t−1 , x
e et:T −1 }) is a matrix defined analogously to K e 0:T −1 , but where the covariance function
i et:T −1 }. From Lit , it is possible to identify
is evaluated for the concatenated state trajectory {x0:t−1 , x
sub-matrices corresponding to the Cholesky factors for the covariance matrix Σt (xi0:t−1 ) as well as
for the matrices needed to efficiently evaluate the ancestor sampling probabilities (6).
It remains to find an efficient update of the Cholesky factor to obtain Lit+1 . As we move from time
t to t + 1 in the algorithm, x et will be replaced by xit in the concatenated trajectory. Hence, the
i i
matrix K({x0:t , x
e et+1:T −1 }) can be obtained from K({x
e
0:t−1 , x
et:T −1 }) by replacing nx rows and
columns, corresponding to a rank 2nx update. It follows that we can compute Lit+1 by making nx
successive rank one updates and downdates on Lit . In summary, all the operations at a specific time
step can be done in O(T 2 ) computations, leading to a total computational complexity of O(T 3 ).
For the FIC prior, a naive implementation will give rise to O(M 2 T 2 ) computational complexity.
This can be reduced to O(M 2 T ) by keeping track of a factorization for the matrix P. However, to
reach the O(M 2 T ) cost all intermediate operations scaling with T has to be avoided, requiring us
to reuse not only the matrix factorizations, but also intermediate matrix-vector multiplications.

5 Learning the Dynamics

Algorithm 1 gives us a tool to compute p(x0:T , θ | y1:T ). We now discuss how this can be used to
find an explicit model for f . The goal of learning the state transition dynamics is equivalent to that
of obtaining a predictive distribution over f ∗ = f (x∗ ), evaluated at an arbitrary test point x∗ ,
Z
p(f ∗ | x∗ , y1:T ) = p(f ∗ | x∗ , x0:T , θ) p(x0:T , θ | y1:T ) dx0:T dθ. (9)

Using a sample-based approximation of p(x0:T , θ | y1:T ), this integral can be approximated by

L L
1X 1X
p(f ∗ | x∗ , y1:T ) ≈ p(f ∗ | x∗ , x0:T [`], θ[`]) = N (f ∗ | µ` (x∗ ), Σ` (x∗ )), (10)
L L
`=1 `=1

6
where L is the number of samples and µ` (x∗ ) and Σ` (x∗ ) follow the expressions for the predictive
distribution in standard GP regression if x0:T −1 [`] are treated as inputs, x1:T [`] are treated as outputs
and Q is the likelihood covariance matrix. This mixture of Gaussians is an expressive representation
of the predictive density which can, for instance, correctly take into account multimodality arising
from ambiguity in the measurements. Although factorized covariance matrices can be pre-computed,
the overall computational cost will increase linearly with L.The computational cost can be reduced
by thinning the Markov chain using e.g. random sub-sampling or kernel herding [19].
In some situations it could be useful to obtain an approximation from the mixture of Gaussians
consisting in a single GP representation. This is the case in applications such as control or real time
filtering where the cost of evaluating the mixture of Gaussians can be prohibitive. In those cases
one could opt for a pragmatic approach and learn the mapping x∗ 7→ f ∗ from a cloud of points
{x0:T [`], f0:T [`]}L
`=1 using sparse GP regression. The latent function values f0:T [`] can be easily
sampled from the normally distributed p(f0:T [`] | x0:T [`], θ[`]).

6 Experiments

6.1 Learning a Nonlinear System Benchmark

Consider a system with dynamics given by xt+1 = axt + bxt /(1 + x2t ) + cut + vt , vt ∼ N (0, q)
and observations given by yt = dx2t + et , et ∼ N (0, r), with parameters (a, b, c, d, q, r) =
(0.5, 25, 8, 0.05, 10, 1) and a known input ut = cos(1.2(t + 1)). One of the difficulties of this
system is that the smoothing density p(x0:T | y0:T ) is multimodal since no information about the
sign of xt is available in the observations. The system is simulated for T = 200 time steps, using
log-normal priors for the hyper-parameters, and the PGAS sampler is then run for 50 iterations using
N = 20 particles. To illustrate the capability of the GP-SSM to make use of a parametric model as
baseline, we use a mean function with the same parametric form as the true system, but parameters
(a, b, c) = (0.3, 7.5, 0). This function, denoted model B, is manifestly different to the actual state
transition (green vs. black surfaces in Figure 2), also demonstrating the flexibility of the GP-SSM.
Figure 2 (left) shows the samples of x0:T (red). It is apparent that the distribution covers two
alternative state trajectories at particular times (e.g. t = 10). In fact, it is always the case that
this bi-modal distribution covers the two states of opposite signs that could have led to the same
observation (cyan). In Figure 2 (right) we plot samples from the smoothing distribution, where
each circle corresponds to (xt , ut , E[ft ]). Although the parametric model used in the mean function
of the GP (green) is clearly not representative of the true dynamics (black), the samples from the
smoothing distribution accurately portray the underlying system. The smoothness prior embodied
by the GP allows for accurate sampling from the smoothing distribution even when the parametric
model of the dynamics fails to capture important features.
To measure the predictive capability of the learned transition dynamics, we generate a new dataset
consisting of 10 000 time steps and present the RMSE between the predicted value of f (xt , ut )
and the actual one. We compare the results from GP-SSM with the predictions obtained from two
parametric models (one with the true model structure and one linear model) and two known models
(the ground truth model and model B). We also report results for the sparse GP-SSM using an
FIC prior with 40 inducing points. Table 1 summarizes the results, averaged over 10 independent
training and test datasets. We also report the RMSE from the joint smoothing sample to the ground
truth trajectory.

Table 1: RMSE to ground truth values over 10 independent runs.

RMSE prediction of smoothing

f ∗ |x∗t , u∗t , data x0:T |data
Ground truth model (known parameters) – 2.7 ± 0.5
GP-SSM (proposed, model B mean function) 1.7 ± 0.2 3.2 ± 0.5
Sparse GP-SSM (proposed, model B mean function) 1.8 ± 0.2 2.7 ± 0.4
Model B (fixed parameters) 7.1 ± 0.0 13.6 ± 1.1
Ground truth model, learned parameters 0.5 ± 0.2 3.0 ± 0.4
Linear model, learned parameters 5.5 ± 0.1 6.0 ± 0.5

7
20

15
20
Samples
10
Ground truth
15 ±(max(yt,0)/d)
1/2
5
10

f(t)
0

5 −5
State

−10
0
−15
−5
−20

−10 1

0.5 20
−15 15
0 10
5
−20 0
−0.5 −5
−10
0 10 20 30 40 50 60 −15
−1 −20
Time u(t) x(t)

Figure 2: Left: Smoothing distribution. Right: State transition function (black: actual transition
function, green: mean function (model B) and red: smoothing samples).

16 10 2
2
14 5 1

x 12 ẋ 0 θ̇ 0 θ 0
−5 −1
10
−2
−10 −2
8
300 350 300 350 300 350 300 350

Figure 3: One step ahead predictive distribution for each of the states of the cart and pole system.
Black: ground truth. Colored band: one standard deviation from the mixture of Gaussians predictive.

6.2 Learning a Cart and Pole System

We apply our approach to learn a model of a cart and pole system used in reinforcement learning.
The system consists of a cart, with a free-spinning pendulum, rolling on a horizontal track. An
external force is applied to the cart. The system’s dynamics can be described by four states and a
set of nonlinear ordinary differential equations [20]. We learn a GP-SSM based on 100 observations
of the state corrupted with Gaussian noise. Although the training set only explores a small region
of the 4-dimensional state space, we can learn a model of the dynamics which can produce one step
ahead predictions such the ones in Figure 3. We obtain a predictive distribution in the form of a
mixture of Gaussians from which we display the first and second moments. Crucially, the learned
model reports different amounts of uncertainty in different regions of the state-space. For instance,
note the narrower error-bars on some states between t = 320 and t = 350. This is due to the model
being more confident in its predictions in areas that are closer to the training data.

7 Conclusions
We have shown an efficient way to perform fully Bayesian inference and learning in the GP-SSM.
A key contribution is that our approach retains the full nonparametric expressivity of the model.
This is made possible by marginalizing out the state transition function, which results in a non-
trivial inference problem that we solve using a tailored PGAS sampler.
A particular characteristic of our approach is that the latent states can be sampled from the smoothing
distribution even when the state transition function is unknown. Assumptions about smoothness
and parsimony of this function embodied by the GP prior suffice to obtain high-quality smoothing
distributions. Once samples from the smoothing distribution are available, they can be used to
describe a posterior over the state transition function. This contrasts with the conventional approach
to inference in dynamical systems where smoothing is performed conditioned on a model of the
state transition dynamics.

8
References
[1] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. MIT Press, 2006.
[2] R. Turner, M. P. Deisenroth, and C. E. Rasmussen, “State-space inference and learning with Gaussian
processes,” in 13th International Conference on Artificial Intelligence and Statistics, ser. W&CP, Y. W.
Teh and M. Titterington, Eds., vol. 9, Chia Laguna, Sardinia, Italy, May 13–15 2010, pp. 868–875.
[3] C. Andrieu, A. Doucet, and R. Holenstein, “Particle Markov chain Monte Carlo methods,” Journal of the
Royal Statistical Society: Series B (Statistical Methodology), vol. 72, no. 3, pp. 269–342, 2010.
[4] F. Lindsten, M. Jordan, and T. B. Schön, “Ancestor sampling for particle Gibbs,” in Advances in Neural
Information Processing Systems 25, P. Bartlett, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, Eds.,
2012, pp. 2600–2608.
[5] M. Deisenroth, R. Turner, M. Huber, U. Hanebeck, and C. Rasmussen, “Robust filtering and smoothing
with Gaussian processes,” IEEE Transactions on Automatic Control, vol. 57, no. 7, pp. 1865 –1871, july
2012.
[6] M. Deisenroth and S. Mohamed, “Expectation Propagation in Gaussian process dynamical systems,” in
Advances in Neural Information Processing Systems 25, P. Bartlett, F. Pereira, C. Burges, L. Bottou, and
K. Weinberger, Eds., 2012, pp. 2618–2626.
[7] Z. Ghahramani and S. Roweis, “Learning nonlinear dynamical systems using an EM algorithm,” in Ad-
vances in Neural Information Processing Systems 11, M. J. Kearns, S. A. Solla, and D. A. Cohn, Eds.
MIT Press, 1999.
[8] J. Wang, D. Fleet, and A. Hertzmann, “Gaussian process dynamical models,” in Advances in Neural
Information Processing Systems 18, Y. Weiss, B. Schölkopf, and J. Platt, Eds. Cambridge, MA: MIT
Press, 2006, pp. 1441–1448.
[9] J. S. Liu, Monte Carlo Strategies in Scientific Computing. Springer, 2001.
[10] A. Doucet and A. Johansen, “A tutorial on particle filtering and smoothing: Fifteen years later,” in The
Oxford Handbook of Nonlinear Filtering, D. Crisan and B. Rozovsky, Eds. Oxford University Press,
2011.
[11] F. Gustafsson, “Particle filter theory and practice with positioning applications,” IEEE Aerospace and
Electronic Systems Magazine, vol. 25, no. 7, pp. 53–82, 2010.
[12] M. K. Pitt and N. Shephard, “Filtering via simulation: Auxiliary particle filters,” Journal of the American
Statistical Association, vol. 94, no. 446, pp. 590–599, 1999.
[13] F. Lindsten and T. B. Schön, “Backward simulation methods for Monte Carlo statistical inference,” Foun-
dations and Trends in Machine Learning, vol. 6, no. 1, pp. 1–143, 2013.
[14] F. Lindsten and T. B. Schön, “On the use of backward simulation in the particle Gibbs sampler,” in
Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), Kyoto, Japan, Mar. 2012.
[15] D. K. Agarwal and A. E. Gelfand, “Slice sampling for simulation based fitting of spatial data models,”
Statistics and Computing, vol. 15, no. 1, pp. 61–69, 2005.
[16] E. Snelson and Z. Ghahramani, “Sparse Gaussian processes using pseudo-inputs,” in Advances in Neural
Information Processing Systems (NIPS), Y. Weiss, B. Schölkopf, and J. Platt, Eds., Cambridge, MA, 2006,
pp. 1257–1264.
[17] J. Quiñonero-Candela and C. E. Rasmussen, “A unifying view of sparse approximate Gaussian process
regression,” Journal of Machine Learning Research, vol. 6, pp. 1939–1959, 2005.
[18] M. Seeger, C. Williams, and N. Lawrence, “Fast Forward Selection to Speed Up Sparse Gaussian Process
Regression,” in Artificial Intelligence and Statistics 9, 2003.
[19] Y. Chen, M. Welling, and A. Smola, “Super-samples from kernel herding,” in Proceedings of the 26th
Conference on Uncertainty in Artificial Intelligence (UAI 2010), P. Grünwald and P. Spirtes, Eds. AUAI
Press, 2010.
[20] M. Deisenroth, “Efficient reinforcement learning using Gaussian processes,” Ph.D. dissertation, Karl-
sruher Institut für Technologie, 2010.

On Particle Methods For Parameter Estimation in State-Space Models
No ratings yet
On Particle Methods For Parameter Estimation in State-Space Models
25 pages
Particle Filter Tutorial
No ratings yet
Particle Filter Tutorial
39 pages
A Tutorial On Particle Filtering and Smoothing: Fifteen Years Later
No ratings yet
A Tutorial On Particle Filtering and Smoothing: Fifteen Years Later
41 pages
Inference in Mixed Linear/Nonlinear State-Space Models Using Sequential Monte Carlo
No ratings yet
Inference in Mixed Linear/Nonlinear State-Space Models Using Sequential Monte Carlo
31 pages
Bellman Filtering For State Space Models
No ratings yet
Bellman Filtering For State Space Models
26 pages
Bellman Filtering and Smoothing For State-Space Models
No ratings yet
Bellman Filtering and Smoothing For State-Space Models
60 pages
Supp2 2
No ratings yet
Supp2 2
307 pages
Discrete and Continuous Dynamical Systems Series S: Doi:10.3934/dcdss.2022054
No ratings yet
Discrete and Continuous Dynamical Systems Series S: Doi:10.3934/dcdss.2022054
25 pages
Time Grad
No ratings yet
Time Grad
11 pages
07 - Bayesian Learning
No ratings yet
07 - Bayesian Learning
55 pages
10 1 1 314 2260 PDF
No ratings yet
10 1 1 314 2260 PDF
41 pages
Introduction To State Space Models and Sequential Bayesian Inference
No ratings yet
Introduction To State Space Models and Sequential Bayesian Inference
58 pages
Bishop2008 Chapter ANewFrameworkForMachineLearnin
No ratings yet
Bishop2008 Chapter ANewFrameworkForMachineLearnin
24 pages
Green Et Al 2015 Bayesian and Markov Chain Monte Carlo Methods For Identifying Nonlinear Systems in The Presence of
No ratings yet
Green Et Al 2015 Bayesian and Markov Chain Monte Carlo Methods For Identifying Nonlinear Systems in The Presence of
18 pages
Bayesian filtering using Gaussian process prediction and observation models
No ratings yet
Bayesian filtering using Gaussian process prediction and observation models
17 pages
Lopes 2010
No ratings yet
Lopes 2010
42 pages
v88c02
No ratings yet
v88c02
41 pages
MCMC With Temporary Mapping and Caching With Application On Gaussian Process Regression
No ratings yet
MCMC With Temporary Mapping and Caching With Application On Gaussian Process Regression
16 pages
Unit 5 - Machine Learning
No ratings yet
Unit 5 - Machine Learning
16 pages
Biometrika Trust
No ratings yet
Biometrika Trust
14 pages
5772 Learning Stationary Time Series Using Gaussian Processes With Nonparametric Kernels
No ratings yet
5772 Learning Stationary Time Series Using Gaussian Processes With Nonparametric Kernels
9 pages
Factor Models
No ratings yet
Factor Models
59 pages
Applying Kalman Filtering in Solving SSM Estimation Problem by The Means of EM Algorithm With Considering A Practical Example
No ratings yet
Applying Kalman Filtering in Solving SSM Estimation Problem by The Means of EM Algorithm With Considering A Practical Example
8 pages
SI Nonlin
No ratings yet
SI Nonlin
14 pages
Gaussian Process Latent Force Models For Learning and Stochastic Control of Physical Systems
No ratings yet
Gaussian Process Latent Force Models For Learning and Stochastic Control of Physical Systems
7 pages
Bayesian Machine Learning
No ratings yet
Bayesian Machine Learning
127 pages
On Input Selection With Reversible Jump Markov Chain Monte Carlo Sampling
No ratings yet
On Input Selection With Reversible Jump Markov Chain Monte Carlo Sampling
10 pages
Ml2 Script v2
No ratings yet
Ml2 Script v2
123 pages
Cheatsheet Supervised Learning
No ratings yet
Cheatsheet Supervised Learning
4 pages
Bayesian Dynamic Modelling: Bayesian Theory and Applications
No ratings yet
Bayesian Dynamic Modelling: Bayesian Theory and Applications
27 pages
Bayesian Dynamic Modelling: Bayesian Theory and Applications
No ratings yet
Bayesian Dynamic Modelling: Bayesian Theory and Applications
29 pages
Talk On Regression Based Method For Bayesian Nonparanormal Graphical Models
No ratings yet
Talk On Regression Based Method For Bayesian Nonparanormal Graphical Models
40 pages
Culbertson and Sturtz - 2013 - Bayesian machine learning via category theory
No ratings yet
Culbertson and Sturtz - 2013 - Bayesian machine learning via category theory
74 pages
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
17 pages
Bayesian Estimation of State Space Models
No ratings yet
Bayesian Estimation of State Space Models
49 pages
Lian Duke 0066D 13204
No ratings yet
Lian Duke 0066D 13204
117 pages
SSM Book (Durbin Koopman)
No ratings yet
SSM Book (Durbin Koopman)
41 pages
Efficient Simulation and Integrated Likelihood TVP Var
No ratings yet
Efficient Simulation and Integrated Likelihood TVP Var
20 pages
Gaussian Processes in Machine Learning
No ratings yet
Gaussian Processes in Machine Learning
9 pages
Lecture 19
No ratings yet
Lecture 19
12 pages
Posterior Mean and Variance Approximation For Regression and Time Series Problems
No ratings yet
Posterior Mean and Variance Approximation For Regression and Time Series Problems
25 pages
BSSM: Bayesian Inference of Non-Linear and Non-Gaussian State Space Models in R
No ratings yet
BSSM: Bayesian Inference of Non-Linear and Non-Gaussian State Space Models in R
14 pages
Calibrating The Gaussian Multi-Target Tracking Model: Lan Jiang Sumeetpal S. Singh
No ratings yet
Calibrating The Gaussian Multi-Target Tracking Model: Lan Jiang Sumeetpal S. Singh
14 pages
Statistical Solutions For and From Signal Processing
No ratings yet
Statistical Solutions For and From Signal Processing
60 pages
The Use of Gaussian Processes in System Identification
No ratings yet
The Use of Gaussian Processes in System Identification
13 pages
GMM Methodandapplication
No ratings yet
GMM Methodandapplication
28 pages
Lecture 8: State-Space Models Based On Slides By: Probabilis C Graphical Models
No ratings yet
Lecture 8: State-Space Models Based On Slides By: Probabilis C Graphical Models
29 pages
Deep Boltzmann Machines
No ratings yet
Deep Boltzmann Machines
8 pages
Journal of Statistical Software: Pyssm: A Python Module For Bayesian Inference of Linear Gaussian State Space Models
No ratings yet
Journal of Statistical Software: Pyssm: A Python Module For Bayesian Inference of Linear Gaussian State Space Models
37 pages
650
No ratings yet
650
20 pages
Unit 2
No ratings yet
Unit 2
7 pages
partialnongaussian
No ratings yet
partialnongaussian
18 pages
system_id
No ratings yet
system_id
3 pages
Cheatsheet Supervised Learning
100% (1)
Cheatsheet Supervised Learning
4 pages
Math Psych 03
No ratings yet
Math Psych 03
48 pages
L6 - Kalman Filter
No ratings yet
L6 - Kalman Filter
15 pages
Dynamical Gaussian Mixture Model For Tracking Elliptical Living Objects
No ratings yet
Dynamical Gaussian Mixture Model For Tracking Elliptical Living Objects
5 pages
Waldmann
No ratings yet
Waldmann
6 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Analytical Methods of Optimization
From Everand
Analytical Methods of Optimization
D. F. Lawden
No ratings yet
Kotak Assured Savings Plan Brochure1
No ratings yet
Kotak Assured Savings Plan Brochure1
12 pages
Nikravesh (1988) - Book ERRATA
No ratings yet
Nikravesh (1988) - Book ERRATA
5 pages
Course Description
No ratings yet
Course Description
2 pages
Tabelle Complete PDF
No ratings yet
Tabelle Complete PDF
4 pages
Get (Ebook) Bayesian Approaches in Oncology Using R and OpenBUGS by Atanu Bhattacharjee ISBN 9780367350505, 9780429329449, 0367350505, 042932944X free all chapters
100% (7)
Get (Ebook) Bayesian Approaches in Oncology Using R and OpenBUGS by Atanu Bhattacharjee ISBN 9780367350505, 9780429329449, 0367350505, 042932944X free all chapters
71 pages
Pro Deep Learning With Tensorflow 2.0: A Mathematical Approach To Advanced Artificial Intelligence in Python 2Nd Edition Santanu Pattanayak
100% (1)
Pro Deep Learning With Tensorflow 2.0: A Mathematical Approach To Advanced Artificial Intelligence in Python 2Nd Edition Santanu Pattanayak
49 pages
Bridge
No ratings yet
Bridge
21 pages
A Comparative Study On Reliability Analysis of CohesiveSoil Slope Using Subset Simulation and Other Methods
No ratings yet
A Comparative Study On Reliability Analysis of CohesiveSoil Slope Using Subset Simulation and Other Methods
21 pages
Full download The Handbook of Data Mining 1st Edition Nong Ye pdf docx
100% (4)
Full download The Handbook of Data Mining 1st Edition Nong Ye pdf docx
71 pages
(Ebook) Bayesian Statistics for Beginners: A Step-By-Step Approach by Therese M Donovan; Ruth M Mickey ISBN 9780198841302, 0198841302 2024 scribd download
100% (10)
(Ebook) Bayesian Statistics for Beginners: A Step-By-Step Approach by Therese M Donovan; Ruth M Mickey ISBN 9780198841302, 0198841302 2024 scribd download
55 pages
Chapter 10 Limiting Distribution of Markov Chain (Lecture On 02-04-2021) - STAT 243 - Stochastic Process
No ratings yet
Chapter 10 Limiting Distribution of Markov Chain (Lecture On 02-04-2021) - STAT 243 - Stochastic Process
6 pages
Bayesian Bivariate Meta-Analysis of Diagnostic Test Studies With Interpretable Priors
No ratings yet
Bayesian Bivariate Meta-Analysis of Diagnostic Test Studies With Interpretable Priors
20 pages
Markov Chain Monte Carlo Sampling Using A Reservoir Method
No ratings yet
Markov Chain Monte Carlo Sampling Using A Reservoir Method
11 pages
(Basher Et Al 2017) Bank Capital and Portfolio Risk Among Islamic Banks
No ratings yet
(Basher Et Al 2017) Bank Capital and Portfolio Risk Among Islamic Banks
9 pages
Bayesian Model Averaging For Linear Regression Models
No ratings yet
Bayesian Model Averaging For Linear Regression Models
14 pages
Monte Carlo Simulation: Presente R
No ratings yet
Monte Carlo Simulation: Presente R
21 pages
Bayesian Ibrahim
No ratings yet
Bayesian Ibrahim
370 pages
Modern Bayesian Econometrics
No ratings yet
Modern Bayesian Econometrics
100 pages
Probabilistic Machine Learning Advanced Topics Draft 1st Edition Kevin P. Murphy - Own the complete ebook set now in PDF and DOCX formats
100% (1)
Probabilistic Machine Learning Advanced Topics Draft 1st Edition Kevin P. Murphy - Own the complete ebook set now in PDF and DOCX formats
67 pages
Forecasting Hospital Demand in Metropolitan Areas During The Current COVID-19 Pandemic and Estimates of Lockdown-Induced 2nd Waves
No ratings yet
Forecasting Hospital Demand in Metropolitan Areas During The Current COVID-19 Pandemic and Estimates of Lockdown-Induced 2nd Waves
16 pages
13 Bayes Nets
No ratings yet
13 Bayes Nets
38 pages
124 Stochastic Processes From Applications To Theory Pierre Del Moral Spiridon Penev Edisi 1 2016
100% (1)
124 Stochastic Processes From Applications To Theory Pierre Del Moral Spiridon Penev Edisi 1 2016
916 pages
Discovery: Science Engineering Technology at AWE
No ratings yet
Discovery: Science Engineering Technology at AWE
23 pages
Long Term Memory Representations For Audio Visual Scenes
No ratings yet
Long Term Memory Representations For Audio Visual Scenes
22 pages
Geographic variation in the probability of being born with and retaining contrasting tail tip colour (tail luring) in the Common Lancehead Bothrops jararaca
No ratings yet
Geographic variation in the probability of being born with and retaining contrasting tail tip colour (tail luring) in the Common Lancehead Bothrops jararaca
7 pages
Computer Predictions With Quantified Uncertainty, Part II
No ratings yet
Computer Predictions With Quantified Uncertainty, Part II
4 pages
Time Series: Modeling, Computation, and Inference, Second Edition Raquel Prado all chapter instant download
100% (1)
Time Series: Modeling, Computation, and Inference, Second Edition Raquel Prado all chapter instant download
62 pages
Hima Lakkaraju XAI ShortCourse
No ratings yet
Hima Lakkaraju XAI ShortCourse
271 pages
Unit V Graphical Models
No ratings yet
Unit V Graphical Models
23 pages
How To Make A Milk Market: A Case Study From The Ethiopian Highlands
No ratings yet
How To Make A Milk Market: A Case Study From The Ethiopian Highlands
25 pages
Rsta 2012 0222
No ratings yet
Rsta 2012 0222
17 pages
Treatment of Inguinal Hernia Systematic Review.13
No ratings yet
Treatment of Inguinal Hernia Systematic Review.13
8 pages
Big Data JPM
No ratings yet
Big Data JPM
31 pages
Probabilistic Programming in Python Using PyMC
No ratings yet
Probabilistic Programming in Python Using PyMC
19 pages

Frigola Bayesian Inference and Learning in Gaussian Process State Space Models With Particle MCMC

Uploaded by

Frigola Bayesian Inference and Learning in Gaussian Process State Space Models With Particle MCMC

Uploaded by

Bayesian Inference and Learning in Gaussian Process

State-Space Models with Particle MCMC

Roger Frigola1 , Fredrik Lindsten2 , Thomas B. Schön2,3 and Carl E. Rasmussen1

1. Dept. of Engineering, University of Cambridge, UK, {rf342,cer54}@cam.ac.uk

2 Gaussian Process State-Space Model

(a) Standard GP regression (b) GP-SSM

3 Inference over States and Hyper-parameters

3.1 Marginalizing out the State Transition Function

p(x1:t | θ, x0 ) = |(2π)nx t K e 0:t−1 |− 12 exp(− 1 (x1:t − m0:t−1 )> K

3.2 Sequential Monte Carlo

3.3 Particle Markov Chain Monte Carlo

Algorithm 1 Particle Gibbs with ancestor sampling (PGAS)

3.3.1 Sampling the State Trajectories

The conditioning on a prespecified collection of particles implies an invariance property in CPF-AS,

3.3.2 Sampling the Hyper-parameters

4 A Sparse GP-SSM Construction and Implementation Details

4.1 FIC Prior over the State Trajectory

4.2 Implementation Details

5 Learning the Dynamics

Using a sample-based approximation of p(x0:T , θ | y1:T ), this integral can be approximated by

6.1 Learning a Nonlinear System Benchmark

Table 1: RMSE to ground truth values over 10 independent runs.

RMSE prediction of smoothing

6.2 Learning a Cart and Pole System

You might also like