Bayesian Tensor Factorisations For Time Series of Counts 3pk66m9hsc
Bayesian Tensor Factorisations For Time Series of Counts 3pk66m9hsc
Counts
Zhongzhen Wang1 , Petros Dellaportas∗1,2, and Ioannis Kosmidis3
1 University College London, London, UK
2 Athens University of Economics and Business, Athens, Greece
arXiv:2311.06076v1 [stat.ME] 10 Nov 2023
Abstract
We propose a flexible nonparametric Bayesian modelling framework for multi-
variate time series of count data based on tensor factorisations. Our models can be
viewed as infinite state space Markov chains of known maximal order with non-linear
serial dependence through the introduction of appropriate latent variables. Alterna-
tively, our models can be viewed as Bayesian hierarchical models with conditionally
independent Poisson distributed observations. Inference about the important lags
and their complex interactions is achieved via MCMC. When the observed counts
are large, we deal with the resulting computational complexity of Bayesian infer-
ence via a two-step inferential strategy based on an initial analysis of a training
set of the data. Our methodology is illustrated using simulation experiments and
analysis of real-world data.
Keywords: Dirichlet process, MCMC, Poisson distribution, Tensor factorisation
1 Introduction
We consider a time-index sequence of multivariate random variables of size T , {yt }Tt=1 ,
taking values in {0, 1, . . .}. We build a non-parametric model by (i) assuming that the
transition probability law of the sequence {yt } conditional on the filtration up to time t−1,
Ft−1 , is that of a Markov chain of maximal order q, (ii) allowing non-linear dependence
of the values at the previous q time points and (iii) incorporating complex interactions
between lags.
We propose a Bayesian model for multivariate time series of counts based on tensor factori-
sations. Our development is inspired by Yang and Dunson (2016) and Sarkar and Dunson
(2016). Yang and Dunson (2016) introduced conditional tensor factorisation models that
lead to parsimonious representations of transition probability vectors together with a sim-
ple, powerful Bayesian hierarchical formulation based on latent allocation variables. This
framework has been exploited in Sarkar and Dunson (2016) to build a nonparametric
∗
Corresponding author. Email: [email protected]
1
Bayesian model for categorical data together with an efficient MCMC inferential frame-
work. We adopt the ideas and methods of these papers to build flexible models for time
series of counts. The major difference that distinguishes our work to Sarkar and Dunson
(2016) is that, unlike categorical data, we deal with time series that are infinite, rather
than finite, state space Markov chains. The resulting computational complexity of our
proposed model is grown as the observed counts become larger, so we propose a two-step
inferential strategy in which an initial, training part of the time series data, is utilized to
facilitate the inference and prediction of the rest of the data.
A common way to analyse univariate time series of counts is by assuming that the condi-
tional probability distribution of yt | yt−1 , . . . , yt−q can be expressed as a Poisson density
with rate λt that depends either on previous counts yt−1 , . . . , yt−q or previous intensities
λt−1 , . . . , λt−q . For example, one such popular model is the Poisson autoregressive model
(without covariates) of order q, PAR(q):
yt ∼ Poisson(λt ),
q
X (1)
log(λt ) = β0 + βi log(yt−i + 1)
i=1
where β0 , β1 , . . . , βq are unknown parameters; see Cameron and Trivedi (2001). Grunwald et al.
(1995), Grunwald et al. (1997) and Fokianos (2011) discuss the modelling and properties
of a PAR(1) process. Brandt and Williams (2001) generalise PAR(1) to a PAR(q) pro-
cess and apply it to the modelling of presidential vetoes in the United States. Kuhn et al.
(1994) adopt such processes to model the counts of child injury in Washington Heights.
When we deal with M distinct time series of counts, the PAR(q) model is written, for
m = 1, . . . , M, as
ym,t ∼ Poisson(λm,t ),
q
M X
X (2)
log(λm,t ) = β0 + βi,m log(ym,t−i + 1);
m=1 i=1
see, for example, Liboschik et al. (2015). In the above equation, q is fixed for each
m = 1, . . . , M. We will use this model formulation as a benchmark for comparison against
our proposed methodology. Other approaches to modelling time series of counts include
the integer-valued generalised autoregression conditional heteroscedastic models Heinen
(2003); Weiß (2014) and the integer-valued autoregression processes Al-Osh and Alzaid
(1987). We have not dealt with these models here because a proper Bayesian evaluation
of their predictive performance requires a challenging Bayesian inference task which is
beyond the scope of our work.
The rest of the paper is organised as follows. We specify our model in Section 2, followed
by estimation and inference details in Section 3. Simulation experiments and applications
are provided in Section 4 and 5, respectively.
2
2 Model Specification
2.1 The Bayesian tensor factorisation model
2.1.1 Univariate time series
We build a probabilistic model by assuming that the transition probability law of yt
conditional on Ft−1 is that of a Markov chain of maximal order q:
for t ∈ [q + 1, T ] where the set containing all integers from i to j is denoted as [i, j]. This
formulation includes the possibility that only a subset of the previous q values affects
yt . We follow Sarkar and Dunson (2016) and introduce a series of latent variables as
follows. First, let kj denote the maximal number of clusters that the values of yt−j can
be separated into for predicting yt . To demonstrate the use of kj we present a simple
example. Assume that yt depends only on yt−1 and the relationship in which the observed
values of yt−1 affect the density of yt is based on the following stochastic rule: if yt−1 > 1
then yt ∼ Poisson(1) and if yt−1 ≤ 1 then yt ∼ Poisson(2). Then k1 = 2 since the values
of yt−1 are separated into two clusters that determine the distribution of yt . Note that
if kj = 1 the value of yt−j does not affect the density of yt . The collection of all these
latent variables K := {kj }j∈[1,q] determines how past values of the time series affect the
distribution of yt .
We also define a collection of time-dependent latent allocation random variables Zt :=
{zj,t }j∈[1,q] where zj,t specifies which of the kj clusters of yt−j affects yt . We will write
Zt = H meaning that all latent variables in Zt equal to another collection of latent
variables H := {hj }j∈[1,q] that do not depend on t. Finally, denote the collection H :=
{hj ∈ [1, kj ], j ∈ [1, q]} that depends on K. The connection among Zt , H and H is that
for any t ∈ [q + 1, T ], Zt is sampled with value H ∈ H.
We are now in a position to define our model. Let λZt be the Poisson rate for generating
yt given the random variable Zt . The conditional transition probability law (3) can be
written as a Bayesian hierarchical model, for j ∈ [1, q], H ∈ H and t ∈ [q + 1, T ], as
yt | Zt = H ∼ Poisson(λH ), (4)
(j) (j)
zj,t | yt−j ∼ Multinomial [1, kj ], π1 (yt−j ), . . . , πkj (yt−j ) . (5)
Pk (j)
with constraints λH ≥ 0 for any H ∈ H and hjj =1 πhj (yt−j ) = 1 for each combination of
(j, yt−j ). Multinomial([1, k], π) is a multinomial distribution selecting a value from [1, k]
with a probability vector π. The formulation (6) is referred to as a conditional tensor
factorisation with the Poisson density PD(yt ; λH ) being the core tensor; see Harshman
(1970); Harshman and Lundy (1994); Tucker (1966); De Lathauwer et al. (2000) for a
description ofQ tensor(j)factorisations. It can also be interpreted as a Poisson mixture
model with j∈[1,q] πhj (yt−j ) being the mixture weights that depend on previous values
of yt .
3
A more parsimonious representation for our tensor factorisation model is obtained by
adopting a Dirichlet process for Poisson rates λH . Independently, for each H ∈ H, we
use the stick-breaking construction introduced by Sethuraman (1994) in which
∞
X
λH ∼ πl∗ δ(λ∗l ), (7)
l=1
where δ(.) is a Dirac delta function and independently, for l ∈ [1, ∞),
l−1
Y
πl∗ = Vl (1 − Vs ), Vl ∼ Beta(1, α0 ), λ∗l ∼ Gamma(a, b)
s=1
where λ∗l represents a label-clustered Poisson rate. By letting ZZ∗ t denote the label of the
cluster that Zt belongs to at time t ∈ [q + 1, T ], we complete the model formulation as
∗
p(ZH = l) = πl∗ , independently for each H ∈ H,
∗
(λH | ZH = l) = λ∗l ,
(8)
(ZZ∗ t | Zt = H) = ZH ∗
,
(yt | ZZt = l) ∼ Poisson(λ∗l ).
∗
The idea is that each univariate time series may depend on all or some of the q previous
values of all, or some, univariate time series. Model (9) assumes that conditional on past
q values of all time series before time t, the M univariate random variables at time t are
independent. The formulation requires M different latent variables for each dimension
but, other than that, its specific details have no essential difference from those in the
univariate case.
4
We first define a collection of latent variables {w1:c−1 , µ1:c , c} that models the pre-training
data {yt }t∈[1,T1 ] as
c
X
p(yt | w1:c−1 , µ1:c , c) = wi PD(yt ; µi ) (10)
i=1
Pc
for any t ∈ [1, T1 ], 0 < wi < 1, i=1 wi = 1, µi ≥ 0. Thus, (10) assumes that any yt in
the pre-training dataset is distributed as a finite mixture of Poisson distributions with c
components, weights wi and intensities µi . The usual latent structure for such mixture
models assumes indicator variables dt representing the estimated label of the mixture
component that yt belongs to, so p(dt = i) = wi for all i ∈ [1, c].
We exploit this finite mixture clustering of the pre-training dataset to build our model for
the training dataset. We define another collection of latent variables as Dt = {dj,t}j∈[1,q]
and by setting dj,t = dt−j for all j ∈ [1, q] and t ∈ [T1 + 1 + q, T1 + T2 ]. We then build a
probabilistic model for the training dataset by assuming that the transition probability
law of the sequence {yt }t∈[T1 +1+q,T1 +T2 ] conditional on Ft−1 is that of a probabilistic model
of this target sequence conditional on Dt . That is, we have
The conditional transition probability law (11) can then be written as a Bayesian hierar-
chical model, for j ∈ [1, q] and t ∈ [T1 + 1 + q, T1 + T2 ], as
yt | Zt = H ∼ Poisson(λH ), (12)
(j)
(j)
zj,t | dj,t ∼ Multinomial [1, kj ], π1 (dj,t), . . . , πkj (dj,t ) . (13)
Pk (j)
with constraints λH ≥ 0 for any H ∈ H and hjj =1 πhj (dj,t ) = 1 for each combination of
(j, dj,t). It is clear that (14) is equivalent to (12) and (13). From (14) the expectation of
yt conditional on Dt is
X Y (j)
E(yt | Dt ) = λH πhj (dj,t). (15)
H∈H j∈[1,q]
The rest of the model which utilises the stick-breaking process for λH is similar to the
one used in Section 2.1.1.
2.1.4 Priors
We assign independent priors on π (j) (dj,t) as
(j) (j)
π (j) (dj,t) = {π1 (dj,t), . . . , πkj (dj,t)} ∼ Dirichlet(γj , . . . , γj ),
with γj = 0.1. Also, we follow Sarkar and Dunson (2016) and set priors
p(kj = κ) ∝ exp(−ϕjκ),
5
where j ∈ [1, q], κ ∈ [1, c]. Notice that ϕ controls p(kj = κ) and the number of im-
portant lags for the proposed conditional tensor factorisation; for all our experiments
throughout this paper, we set ϕ = 0.5. Following Viallefont et al. (2002), we place for
the Gamma density of λ∗l parameters a as the mid-range of yt in the training dataset
a = 12 [max({yt}t∈[T1 +1+q,T1 +T2 ] ) − min({yt }t∈[T1 +1+q,T1 +T2 ] )] and b = 1. We set α0 = 1 for
the Beta prior to Vl . Finally, we truncate the series (7), by assuming
L
X
λH ∼ πl∗ δ(λ∗l ),
l=1
6
Cj,r assumed to correspond to its own latent class hj = r. With independent Dirichlet
priors on the mixture kernels λH ∼ Gamma(a, b) marginalised out, the likelihood of
our targeted response {yt }t∈T2∗ conditional on the cluster configuration C = {Cj,r : j ∈
Z[1,q] , r ∈ Z[1,kj ] } is given by
YZ ∞
p({yt }t∈Z[T1 +1+q,T1 +T2 ] | C) = f ({yt }t∈Z[T1 +1+q,T1 +T2 ] | λH )p(λH | C)dλH
H∈H 0
−1
YZ ∞ Y X
= (yt ξ)! exp −( ξ)λH ·
H∈H 0 t∈Z[T1 +1+q,T1 +T2 ] t∈Z[T1 +1+q,T1 +T2 ]
P
t∈Z[T +1+q,T +T ] yt ξ 1
λH 1 1 2
λa−1 exp(−λH b)dλH
(1/b) Γ(a) H
a
−1
Y 1 Y X
= (yt ξ)! Γ a + yt ξ ·
H∈H
(1/b)a Γ(a) t∈Z[T1 +1+q,T1 +T2 ] t∈Z[T1 +1+q,T1 +T2 ]
−(a+Pt∈Z yt ξ)
[T1 +1+q,T1 +T2 ]
X
ξ + b ,
t∈Z[T1 +1+q,T1 +T2 ]
where ξ = 1{d1,t ∈ C1,h1 , . . . , dq,t ∈ Cq,hq }. Then the MCMC steps for j ∈ Z[1,q] are: (i) If
1 ≤ kj ≤ c, we propose to either increase kj to (kj + 1) or decrease kj to (kj − 1). (ii) If
an increasing move is proposed, we randomly split a cluster of dj,t into two clusters. We
accept this move with an acceptance rate based on the approximated marginal likelihood.
(iii) If a decrease move is proposed, we randomly merge two clusters of dj,t into a single
cluster. We accept this move with an acceptance rate based on the approximated marginal
likelihood. If K ∗ and C ∗ are the updated model index and cluster, α(·; ·) is the Metropolis-
Hastings acceptance rate, L(·) is the likelihood function and q(· → ·) is the proposal
function, we obtain
∗ ∗
L({yt }t∈Z[T1 +1+q,T1 +T2 ] , K ∗ , C ∗ )q(K ∗ , C ∗ → K, C)
α(K, C; K , C ) = .
L({yt }t∈Z[T1 +1+q,T1 +T2 ] , K, C)q(K, C → K ∗ , C ∗ )
∗ ∗
Gamma (a + NH∗ (l), b + NH (l)) , where
P λl with∗ l ∈ L ∗from λl | ζ ∼ P
• Sample each
∗ ∗
NH (l) = H∈H 1{ZH = l}nH and NH (l) = H∈H 1{ZH = l}nH .
• For j ∈ Z[1,q] and ω ∈ Z[1,c] , sample
n o
(j) (j)
π1 (ω), . . . , πkj (ω) |ζ ∼ Dirichlet{γj + nj,ω (1), . . . , γj + nj,ω (kj )}
7
P
where nj,ω (hj ) = t∈Z[T1 +1+q,T1 +T2 ] 1{zj,t = hj , dj,t = ω}.
where H.../j=h is equal to H at all position except the j-th position taking the value
h.
4 Simulation Experiments
We tested our methodology with simulated data from designed experiments against the
Poisson autoregressive model (1) through the log predictive score calculated in an out-of-
sample (test) dataset T of size T̃ . For each model the log predictive score is estimated
by
− t∈T N (i)
P P
i=1 log p̂ (yt )
T̃ N
(i)
where p̂ (yt ) denotes the one-step ahead estimated transition probability of observing
yt∈T calculated using the parameter values at the i-th iteration of MCMC with total N
iterations. It measures the predictive accuracy of the model by assessing the quality of
the uncertainty quantification. A model predicts better when the log predictive score is
smaller; see, for example, Czado et al. (2009). For each designed scenario, we generated 10
datasets with 5, 000 data points and out-of-sample predictive performance for all models
was tested by using either the first 4, 000 or 4, 500 data points as training datasets and
calculating the log predictive scores approximated via the MCMC output at the rest
1, 000 or 500 test data points respectively. The resulting mean log predictive score that is
reported in Tables 1-3 is the average log predictive score across the 10 generated datasets.
The pre-training dataset for the BTF model has been chosen to be the first 3, 000 points.
All MCMC runs were based on the following burn-in and posterior samples respectively:
2, 000 and 5, 000 for fitting the Poisson mixtures on the pre-training dataset; 1, 000 and
2, 000 for selecting the important lags and their corresponding number of inclusions; and
2, 000 and 5, 000 for sampling the rest of the parameters. Bayesian inference for Poisson
autoregressive model was obtained by ’rjags’ Plummer et al. (2016) package based on
5,000 burn-in and 10,000 MCMC samples respectively. We first chose the order q of the
model by choosing among all models with maximum order up to q + 2 using the AIC and
BIC criteria. We set the priors for parameters as β0 ∼ N(0, 10−6 ) and βi ∼ N(0, 10−4 )
for any i ∈ [1, q].
Table 1 presents the results of out-of-sample comparative predictive ability based on six
generated Poisson autoregressive models based on (1). Notice that when the order q is
high and there are only a few true coefficients, as in cases C, E and F , the maximal
order Markov structure of the BTF model achieves a comparative, satisfactory predictive
performance. Given that the data generating process is based on Poisson autoregressive
models these results are very promising.
Next, we generated data in which past values affect current random variables in a non-
PK as follows. There are K important lag(s) {yt−i1 , . . . , yt−iK } and, for given
linear fashion
ν+ , ν− , if j=1 yt−ij ≥ Kν+ , then yt ∼ Poisson(ν+ ); else yt ∼ Poisson(ν− ). We designed
8
Bayesian Poisson autoregression BTF
Scenarios Data Sizes AIC BIC
(A) : β0 = 1, β1 = 0.5 4000 : 1000 2.436(0.024) 2.436(0.024) 2.443(0.022)
4500 : 500 2.441(0.031) 2.441(0.031) 2.450(0.030)
(B) : β0 = 1, β7 = 0.5 4000 : 1000 2.450(0.019) 2.449(0.019) 2.458(0.022)
4500 : 500 2.454(0.028) 2.452(0.031) 2.463(0.030)
(C) : β0 = 1, β29 = 0.7 4000 : 1000 3.126(0.018) 3.126(0.018) 3.108(0.014)
4500 : 500 3.123(0.024) 3.123(0.024) 3.106(0.021)
(D) : β0 = 1, β1 = −0.5, β7 = 0.5 4000 : 1000 1.870(0.016) 1.870(0.016) 1.882(0.024)
4500 : 500 1.876(0.020) 1.876(0.020) 1.885(0.017)
(E) : β0 = 1, β19 = −0.5, β29 = 0.5 4000 : 1000 1.873(0.015) 1.873(0.015) 1.857(0.017)
4500 : 500 1.869(0.018) 1.869(0.018) 1.852(0.020)
(F ) : β0 = 1, β1 = −0.5, β7 = −0.5, β19 = 0.5 4000 : 1000 1.683(0.013) 1.683(0.013) 1.631(0.009)
4500 : 500 1.689(0.017) 1.689(0.017) 1.635(0.012)
Table 1: Mean log predictive scores (with standard deviations in brackets) for Bayesian Poisson autoregressive models and
our Bayesian tensor factorisations model (BTF) based on 10 Poisson autoregression generated data sets for each one of 6
Scenarios. AIC and BIC columns indicate that the best model has been chosen with the corresponding criterion. Models
with the best performance are highlighted in bold.
6 scenarios and the results are shown in Table 2. Our proposed modelling formulation
outperforms the Bayesian Poisson autoregressive model in all but one scenario.
Finally, we replicated the last exercise by testing the models in a more challenging data
generation mechanism in which the response is multivariate. We designed 6 different
scenarios by generating an M-dimensional time series {ym,t }m∈[1,M ] and assuming that
we are interested in predicting y1,t . For t ≤ 10, we generated ym,t from Pois(ν− ) for each
m; for t > 10, if K
P
i=1 ymi ,t−ji ≥ ν− we generate y1,t ∼ Poisson(ν+ ), else y1,t ∼ Poisson(ν− ).
We fitted an M-dimensional multivariate Poisson autoregressive model of order q that
predicts yℓ,t with covariates {ym,t−1 }m∈M,m6=ℓ as
yℓ,t ∼ Poisson(λℓ,t ),
q
X X (16)
log(λℓ,t ) = βℓ,0 + βℓ,i log(yℓ,t−i + 1) + ζℓ,m ym,t−1
i=1 m6=ℓ
9
where βℓ,0 , βℓ,i and ζℓ,m are unknown parameters. Table 3 shows that for all 6 Scenarios,
the Bayesian tensor factorisation model achieves impressively better predictive perfor-
mance than the Bayesian Poisson autoregressive model.
Bayesian Poisson autoregression BTF
Scenarios and non-zero coefficients Data Sizes AIC BIC
(A) : M = 2; ν− = 20, ν+ = 10; 4000 : 1000 3.225(0.029) 3.225(0.029) 3.013(0.037)
Non-zero coefficients for y1,t : y1,t−1 , y2,t−1 ;
No non-zero coefficient for y2,t 4500 : 500 3.243(0.057) 3.243(0.057) 3.110(0.033)
Table 3: Mean log predictive scores (with standard deviations in brackets) for Bayesian Poisson autoregressive models and
our Bayesian tensor factorisations model (BTF) based on 10 nonlinear generated data sets for each one of 6 Scenarios. AIC
and BIC columns indicate that the best model has been chosen with the corresponding criterion. Models with the best
performance are highlighted in bold.
The times needed to run the MCMC algorithms for Bayesian Poisson autoregressive and
BTF models are comparable. For 1000 iterations we needed, on average, 20 seconds
for the BTF model implemented with our matlab code and 25 seconds for the Bayesian
Poisson autoregressive models implemented with rjags.
5 Applications
5.1 Univariate flu data
We compared our Bayesian tensor factorisation model to Bayesian Poisson autoregressive
model with two datasets from Google Flu Trends that refer to 514 Norway, Switzerland
and Castilla–La Mancha weekly flu counts in Spain, see Figure 1. We chose the maximum
lag q to be 10 for all models we applied to the data. We examined the sensitivity to the
size of the pre-training data by considering three scenarios. We used 103(20%), 154(30%)
and 206(40%) pre-training sizes and compared their predictive ability against the best
models for Bayesian Poisson autoregression formulations based on AIC and BIC criteria.
The last 103 and 52 data points were chosen for out-of-sample test comparison for each
dataset. To demonstrate how our methodology works, we will present MCMC results for
the Norway dataset based on 154 training points; results for both datasets and for all
training sizes are given at the end of the Section.
10
Norway
1000
500
0
11-2006 04-2008 08-2009 12-2010 05-2012 09-2013 02-2015
Aargau
1000
500
0
11-2006 04-2008 08-2009 12-2010 05-2012 09-2013 02-2015
Bern
1000
500
0
11-2006 04-2008 08-2009 12-2010 05-2012 09-2013 02-2015
Andalusia
1000
500
0
11-2006 04-2008 08-2009 12-2010 05-2012 09-2013 02-2015
Castilla-La Mancha
1000
500
0
11-2006 04-2008 08-2009 12-2010 05-2012 09-2013 02-2015
Illes Balears
1000
500
0
11-2006 04-2008 08-2009 12-2010 05-2012 09-2013 02-2015
Region de Murcia
1000
500
0
11-2006 04-2008 08-2009 12-2010 05-2012 09-2013 02-2015
Comunitat Valenciana
1000
500
0
11-2006 04-2008 08-2009 12-2010 05-2012 09-2013 02-2015
Figure 1: Trace plot of 514 time-series data points counting flu cases in Norway, Aargau as well as Bern in Switzerland
and five regions in eastern Spain including Andalusia, Castilla-La Mancha, Illes Balears, Region de Murcia and Valencian
Community counted by each week from 09-Oct-2005 to 09-Aug-2015.
The pre-training results are illustrated in Figure 2. There are barely significant differences
among 6 of the 10 clusters in the left panel so we fix the number of clusters to be 5,
see Figure 2. Figure 3 shows some MCMC results for the rest of our Bayesian tensor
factorisation model. With 411 training data points, Panels (a),(b) and (c) provide strong
evidence that there are two important predictors, 6 possible combinations of (h1 , ..., hq )
and 6 unique λh1 ,...,hq . Similarly, when the length of the training dataset is 462 panels
(d),(e) and (f) indicate that there is evidence for only one important predictor, the total
number of possible combinations of (h1 , ..., hq ) is either 3 or 4, and that there are 3 unique
λh1 ,...,hq .
Model selection results for the Poisson autoregression models are illustrated in Figure 4.
MCMC was based on 5,000 burn-in and 10,000 runs by using ’rjags’, see Plummer et al.
(2016). For the resulting parameter estimates see Table 4.
11
Poisson Rate
Poisson Rate
log(Weight)
log(Weight)
Figure 2: Fitting of a mixture of Poisson distributions. The dataset used is the pre-training data from flu cases in Norway
counted by each week from 09-Oct-2005 to 09-Aug-2015. Panels (a) and (b) indicate that the outcome for a total number
of clusters c are 10 and 5 respectively. The top panels illustrate the Poisson rates of their corresponding label of clusters,
whilst the bottom panels show their corresponding log weights.
Table 4: Means of coefficients (with standard deviations in brackets) based on 5,000 burn-in and 10,000 MCMC runs. Two
scenarios with different sizes of training against testing data are shown in each columns with their corresponding selected
models indicated.
12
(a) (b) (c)
1 1 1
0 0 0
1 2 3 4 5 6 7 8 9 10 1 4 6 1 2 3 4 5 6
0 0 0
1 2 3 4 5 6 7 8 9 10 3 4 1 2 3
Figure 3: MCMC frequency results. In all panels, the x-axis represents the number and the y-axis does the relative
frequency.Top three panels: 411 training data points; bottom three panels: 462 training data points. (a,d): TheQrelative
frequency distributions for the number of important predictor(s). (b,e): The relative frequency distributions of qj=1 kj ,
or the total number of possible combinations of (h1 , . . . , hq ). (c,f): The relative frequency distributions of the number of
unique λh1 ,...,hq .
13
(a) (b)
value value
5400 5400
5300 5300
5200 5200
5100
5100
5000
5000
4900
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
(c) (d)
value value
5900 5900
5800 5800
5700 5700
5600
5600
5500
5500
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Figure 4: AIC and BIC scores given by PAR(q) models with q labelled in the x-axis for flu cases in Norway counted by
each week from 09-Oct-2005 to 09-Aug-2015. (a): The AIC scores for the scenario with 411 training data points and 103
testing data points; (b): The BIC scores for the scenario with 411 training data points and 103 testing data points; (c):
The AIC scores for the scenario with 462 training data points and 52 testing data points; (d): The BIC scores for the
scenario with 462 training data points and 52 testing data points.
14
Table 5 indicates that in all pre-training size scenarios BTF outperform, in terms of
predictive ability expressed with log predictive scores, Bayesian Poisson autoregression
models. There is clearly a trade-off between good mixture estimation and adequate
training size that is expressed in small and large pre-training sizes respectively. In our
small empirical study it seems that there is evidence for some robustness in the inference
procedure when the pre-training size is small, since 103 points outperform 206 points
with the 154 points being the best performing pre-training size. The predictive means
and 95% credible intervals of BTF and of the PAR(5) model that had one of the best
predictive performances based on 103 test data are depicted in Figure 5.
Table 5: Log predictive scores for Bayesian Poisson autoregression models and our Bayesian tensor factorisations model
(BTF) for flu counts datasets in Norway and Castilla-La Mancha, Spain. The BTF model has performed with 103, 154 and
206 pre-training data points (PTDPs). AIC and BIC columns indicate that the best model has been chosen (in brackets)
with the corresponding criterion. Training and testing data sizes appear in the second column. Models with the best
performance are highlighted in bold.
The average run times for the MCMC algorithms for BTF and the Bayesian Poisson
autoregression models are comparable. For the former, 1000 iterations take approximately
20 seconds with our code written in matlab, whereas the latter takes approximately 25
seconds for 1000 iterations in the R package ‘rjags’.
15
Figure 5: Out of sample predictive means and 95% highest credible regions (HCRs) of Bayesian tensor factorisations (BTF)
and Poisson autoregressive models (PAR) compared against the Castilla–La Mancha data. The sizes of training and testing
data are 411 and 103 respectively.
Table 7 presents the five-dimensional example of Spanish regions in which the counts
of each region are predicted from past counts of all five regions. Here, in eight out
of ten cases BTF outperforms the Poisson autoregression model and in particular the
log-predictive scores are dramatically lower in all cases with smaller training (462) and
higher test (103) sizes. This is not surprising since our model is capable of capturing the
complicated five-dimensional dependencies created in these Spanish regions.
1 1 1 1
0 0 0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Figure 6: Lag selection for the Norway (left pair) and the Castilla-La Mancha (right pair) flu datasets. Each pair of figures
represents (i) the inclusion proportions (y-axis) of different lags (x-axis) for the scenario with 411 training and 103 testing
data points and (ii) the inclusion proportions (y-axis) of different lags (x-axis) for the scenario with 462 training data points
and 52 testing data points.
16
Aargau Bern Aargau Bern
1 1 1 1
Aargau
0.6 0.6 0.6 0.6
0 0 0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
1 1 1 1
Bern
0.4 0.4 0.4 0.4
0 0 0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
(a) 411 training and 103 testing points (b) 462 training and 52 testing points
Figure 7: Important lag selection for the Swiss flu dataset. Y-axis represents the inclusion proportions of different lags in
x-axis.
6 Discussion
We have introduced a new flexible modelling framework for that extends Bayesian tensor
factorisations to multivariate time series of count data. Extensive simulation studies
and analysis of real data provide evidence that the flexibility of these models offers an
important alternative to other multivariate time series models for counts.
An important aspect of our proposed models is that direct MCMC inference cannot avoid
an increased computational complexity as observed counts grow. We have dealt with this
issue with a two-stage inferential procedure that successfully deals with large observed
counts.
Acknowledgements
We would like to thank Abhar Sarkar for kindly providing us the code for Sarkar and Dunson
(2016).
17
A CLM IB RM VC
1 1 1 1 1
A
0 0 0 0 0
1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910
1 1 1 1 1
CLM
0 0 0 0 0
1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910
1 1 1 1 1
IB
0 0 0 0 0
1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910
1 1 1 1 1
RM
0 0 0 0 0
1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910
1 1 1 1 1
VC
0 0 0 0 0
1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910
Figure 8: Important lag selection for the south-eastern Spain flu dataset. Y-axis represents the inclusion proportions of
different lags in x-axis for the scenario with 411 training data points and 103 testing data points. A: Andalusia; CLM:
Castilla-La Mancha; IB: Illes Balears; RM: Region de Murcia; VC: Valencian Community.
7 Declarations
• Funding: Not applicable
• Availability of data and material: The google flu data are publicly available.
18
A CLM IB RM VC
1 1 1 1 1
A
0 0 0 0 0
1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910
1 1 1 1 1
CLM
0 0 0 0 0
1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910
1 1 1 1 1
IB
0 0 0 0 0
1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910
1 1 1 1 1
RM
0 0 0 0 0
1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910
1 1 1 1 1
VC
0 0 0 0 0
1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910
Figure 9: Important lag selection for the south-eastern Spain flu dataset. Y-axis represents the inclusion proportions of
different lags in x-axis for the scenario with 462 training data points and 52 testing data points. A: Andalusia; CLM:
Castilla-La Mancha; IB: Illes Balears; RM: Region de Murcia; VC: Valencian Community.
• Code availability: The code will be free and available from Petros Dellaportas’ web
site.
• Author contributions: The code has been written by Zhongzhen Wang. Zhongzhen
Wang, Petros Dellaportas and Ioannis Kosmidis had equal contributions at the
development of the theory.
• Licence: For the purpose of open access, the authors have applied a Creative Com-
mons Attribution (CC BY) licence to any Author Accepted Manuscript version
arising from this submission.
19
References
Al-Osh, M. and A. A. Alzaid (1987). First-order integer-valued autoregressive (inar (1))
process. Journal of Time Series Analysis 8 (3), 261–275.
Czado, C., T. Gneiting, and L. Held (2009). Predictive model assessment for count data.
Biometrics 65 (4), 1254–1261.
Fokianos, K. (2011). Some recent progress in count time series. Statistics 45 (1), 49–58.
Grunwald, G., R. Hyndman, and L. Tedesco (1995). A unified view of linear ar (1)
models.
Grunwald, G. K., K. Hamza, and R. J. Hyndman (1997). Some properties and general-
izations of non-negative bayesian time series models. Journal of the Royal Statistical
Society: Series B (Statistical Methodology) 59 (3), 615–626.
Harshman, R. (1970). Foundations of the parafac procedure: Model and conditions for
an explanatory factor analysis. Technical Report UCLA Working Papers in Phonetics
16, University of California, Los Angeles, Los Angeles, CA.
Heinen, A. (2003). Modelling time series count data: an autoregressive conditional poisson
model. Available at SSRN 1117187 .
Kuhn, L., L. L. Davidson, and M. S. Durkin (1994). Use of poisson regression and
time series analysis for detecting changes over time in rates of child injury following a
prevention program. American Journal of Epidemiology 140 (10), 943–955.
Liboschik, T., K. Fokianos, and R. Fried (2015). tscount: An R package for analysis of
count time series following generalized linear models. Universitätsbibliothek Dortmund
Dortmund, Germany.
Marin, J.-M., K. Mengersen, and C. P. Robert (2005). Bayesian modelling and inference
on mixtures of distributions. Handbook of statistics 25, 459–507.
Plummer, M. et al. (2016). rjags: Bayesian graphical models using mcmc. R package
version 4 (6).
20
Sarkar, A. and D. B. Dunson (2016). Bayesian nonparametric modeling of higher order
markov chains. Journal of the American Statistical Association 111 (516), 1791–1803.
Weiß, C. H. (2014). Ingarch and regression models for count time series. Wiley StatsRef:
Statistics Reference Online, 1–6.
Yang, Y. and D. B. Dunson (2016). Bayesian conditional tensor factorizations for high-
dimensional classification. Journal of the American Statistical Association 111 (514),
656–669.
21