0% found this document useful (0 votes)
47 views

Time Grad

Research Paper

Uploaded by

chiguoxuan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Time Grad

Research Paper

Uploaded by

chiguoxuan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time

Series Forecasting

Kashif Rasul 1 Calvin Seward 1 Ingmar Schuster 1 Roland Vollgraf 1

Abstract To model the full predictive distribution, methods typically


resort to tractable distribution classes or some type of low-
arXiv:2101.12072v2 [cs.LG] 2 Feb 2021

In this work, we propose TimeGrad, an autore- rank approximations, regardless of the true data distribution.
gressive model for multivariate probabilistic time To model the distribution in a general fashion, one needs
series forecasting which samples from the data probabilistic methods with tractable likelihoods. Till now
distribution at each time step by estimating its gra- several deep learning methods have been proposed for this
dient. To this end, we use diffusion probabilistic purpose such as autoregressive (van den Oord et al., 2016c)
models, a class of latent variable models closely or generative ones based on normalizing flows (Papamakar-
connected to score matching and energy-based ios et al., 2019) which can learn flexible models of high
methods. Our model learns gradients by optimiz- dimensional multivariate time series. Even if the full likeli-
ing a variational bound on the data likelihood and hood is not be tractable, one can often optimize a tractable
at inference time converts white noise into a sam- lower bound to the likelihood. But still, these methods re-
ple of the distribution of interest through a Markov quire a certain structure in the functional approximators, for
chain using Langevin sampling. We demonstrate example on the determinant of the Jacobian (Dinh et al.,
experimentally that the proposed autoregressive 2017) for normalizing flows. Energy-based models (EBM)
denoising diffusion model is the new state-of-the- (Hinton, 2002; LeCun et al., 2006) on the other hand have
art multivariate probabilistic forecasting method a much less restrictive functional form. They approximate
on real-world data sets with thousands of corre- the unnormalized log-probability so that density estimation
lated dimensions. We hope that this method is a reduces to a non-linear regression problem. EBMs have
useful tool for practitioners and lays the founda- been shown to perform well in learning high dimensional
tion for future research in this area. data distributions at the cost of being difficult to train (Song
& Kingma, 2021).
In this work, we propose autoregressive EBMs to solve the
1. Introduction multivariate probabilistic time series forecasting problem
Classical time series forecasting methods such as those via a model we call TimeGrad and show that not only are
in (Hyndman & Athanasopoulos, 2018) typically provide we able to train such a model with all the inductive biases
univariate point forecasts, require hand-tuned features to of probabilistic time series forecasting, but this model per-
model seasonality, and are trained individually on each time forms exceptionally well when compared to other modern
series. Deep learning based time series models (Benidis methods. This autoregressive-EBM combination retains the
et al., 2020) are popular alternatives due to their end-to-end power of autoregressive models, such as good performance
training of a global model, ease of incorporating exogenous in extrapolation into the future, with the flexibility of EBMs
covariates, and automatic feature extraction abilities. The as a general purpose high-dimensional distribution model,
task of modeling uncertainties is of vital importance for while remaining computationally tractable.
downstream problems that use these forecasts for (business) The paper is organized as follows. In Section 2 we first set
decision making. More often the individual time series up the notation and detail the EBM of (Ho et al., 2020)
for a problem data set are statistically dependent on each which forms the basis of our per time-step distribution
other. Ideally, deep learning models need to incorporate model. Section 3 introduces the multivariate probabilis-
this inductive bias in the form of multivariate (Tsay, 2014) tic time series problem and we detail the TimeGrad model.
probabilistic methods to provide accurate forecasts. The experiments with extensive results are detailed in Sec-
1
Zalando Research, Mühlenstraße 25, 10243 Berlin, Germany. tion 4. We cover related work in Section 5 and conclude
Correspondence to: Kashif Rasul <[email protected]>. with some discussion in Section 6.
Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting

2. Diffusion Probabilistic Model By using the fact that these processes are Markov chains,
the objective in (2) can be written as the KL-divergence
Let x0 ∼ qX (x0 ) denote the multivariate training vector between Gaussian distributions:
from some input space X = RD and let pθ (x0 ) denote the
probability density function (PDF) which aims to approxi- − log pθ (x0 |x1 ) + DKL (q(xN |x0 )||p(xN ))
mate qX (x0 ) and allows for easy sampling. Diffusion mod- N
els (Sohl-Dickstein etRal., 2015) are latent variable models of
X
+ DKL (q(xn−1 |xn , x0 )||pθ (xn−1 |xn )), (4)
the form pθ (x0 ) := pθ (x0:N ) dx1:N , where x1 , . . . , xN n=2
are latents of dimension RD . Unlike in variational autoen-
and (Ho et al., 2020) shows that by the property (3) the
coders (Kingma & Welling, 2019) the approximate posterior
forward process posterior in these KL divergences when
q(x1:N |x0 ),
conditioned on x0 , i.e. q(xn−1 |xn , x0 ) are tractable given
q(x1:N |x0 ) = ΠN n n−1 by
n=1 q(x |x )
q(xn−1 |xn , x0 ) = N (xn−1 ; µ̃n (xn , x0 ), β̃n I),
is not trainable but fixed to a Markov chain (called the
forward process) that gradually adds Gaussian noise to the where
√ √
signal: ᾱn−1 βn 0 αn (1 − ᾱn−1 ) n
µ̃n (xn , x0 ) := x + x
p 1 − ᾱn 1 − ᾱn
q(xn |xn−1 ) := N (xn ; 1 − βn xn−1 , βn I). and
1 − ᾱn−1
The forward process uses an increasing variance sched- β̃n := βn . (5)
1 − ᾱn
ule β1 , . . . , βN with βn ∈ (0, 1). The joint distribution Further, (Ho et al., 2020) shows that the KL-divergence
pθ (x0:N ) is called the reverse process, and is defined as between Gaussians can be written as:
a Markov chain with learned Gaussian transitions starting
with p(xN ) = N (xN ; 0, I), where each subsequent transi- DKL (q(xn−1 |xn , x0 )||pθ (xn−1 |xn )) =
tion of 
1

n 0 n 2
Eq kµ̃n (x , x ) − µθ (x , n)k + C, (6)
pθ (x0:N ) := p(xN )Π1n=N pθ (xn−1 |xn ) 2Σθ
where C is a constant which does not depend on θ. So
is given by a parametrization of our choosing denoted by instead of a parametrization (1) of pθ that predicts µ̃,
n 0
√ one can
instead
√ use the property (3) to write x (x , ) = ᾱn x0 +
pθ (xn−1 |xn ) := N (xn−1 ; µθ (xn , n), Σθ (xn , n)I), (1) 1 − ᾱn  for  ∼ N (0, I) and the formula for µ̃ to obtain
√ √
that µθ must predict (xn − βn / 1 − ᾱn )/ αn , but since
with shared parameters θ. Both µθ : RD × N → RD and
xn is available to the network, we can choose:
Σθ : RD × N → R+ take two inputs, namely the variable  
xn ∈ RD as well as the noise index n ∈ N. The goal of 1 βn
µθ (xn , n) = √ xn − √ θ (xn , n) ,
pθ (xn−1 |xn ) is to eliminate the Gaussian noise added in αn 1 − ᾱn
the diffusion process. The parameters θ are learned to fit
where θ is a network which predicts  ∼ N (0, I) from xn ,
the data distribution qX (x0 ) by minimizing the negative log-
so that the objective simplifies to:
likelihood via a variational bound using Jensen’s inequality:
βn2 √ √
 
Ex0 , k − θ ( ᾱn x0 + 1 − ᾱn , n)k2
min Eq(x0 ) [− log pθ (x0 )] ≤ 2Σθ αn (1 − ᾱn )
θ
(7)
min Eq(x0:N ) [− log pθ (x0:N ) + log q(x1:N |x0 )]. resembling the loss in Noise Conditional Score Networks
θ
(Song & Ermon, 2019; 2020) using score matching. Once
This upper bound can be shown to be equal to trained, to sample from the reverse process xn−1 ∼
" N
# pθ (xn−1 |xn ) (1) we can compute
n−1 n
X p θ (x |x )
min Eq(x0:N ) − log p(xN ) −
  p
log . 1 βn
θ q(xn |xn−1 ) xn−1 = √ xn − √ θ (xn , n) + Σθ z
n=1 αn 1 − ᾱn
(2)
As shown by (Ho et al., 2020), a property of the forward where z ∼ N (0, I) for n = N, . . . , 2 and z = 0 when
process is that it admits sampling xn at any arbitrary noise n = 1. The full sampling procedure for x0 , starting
level n in closed form, since if αn := 1 − βn and ᾱn := from white noise sample xN , resembles Langevin dynamics
Πni=1 αi its cumulative product, we have: where we sample from the most noise-perturbed distribution
√ and reduce the magnitude of the noise scale until we reach
q(xn |x0 ) = N (xn ; ᾱn x0 , (1 − ᾱn )I). (3) the smallest one.
Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting

3. TimeGrad Method Algorithm 1 Training for each time series step t ∈ [t0 , T ]
We denote the entities of a multivariate time series by x0i,t ∈ Input: data x0t ∼ qX (x0t ) and state ht−1
R for i ∈ {1, . . . , D} where t is the time index. Thus the repeat
multivariate vector at time t is given by x0t ∈ RD . We are Initialize n ∼ Uniform(1, . . . , N ) and  ∼ N (0, I)
tasked with predicting the multivariate distribution some Take gradient step on
given prediction time steps into the future and so in what √ √
∇θ k − θ ( ᾱn x0t + 1 − ᾱn , ht−1 , n)k2
follows consider time series with t ∈ [1, T ], sampled from
the complete time series history of the training data, where until converged
we will split this contiguous sequence into a context window
of size [1, t0 ) and prediction interval [t0 , T ], reminiscent
of seq-to-seq models (Sutskever et al., 2014) in language
neural network (RNN) architecture from (Graves, 2013;
modeling.
Sutskever et al., 2014) which utilizes the LSTM (Hochreiter
In the univariate probabilistic DeepAR model (Salinas et al., & Schmidhuber, 1997) or GRU (Chung et al., 2014) to
2019b), the log-likelihood of each entity x0i,t at a time step encode the time series sequence up to time point t, given
t ∈ [t0 , T ] is maximized over an individual time series’ the covariates ct , via the updated hidden state ht :
prediction window. This is done with respect to the param-
eters of some chosen distributional model via the state of ht = RNNθ (concat(x0t , ct ), ht−1 ), (9)
an RNN derived from its previous time step x0i,t−1 and its
corresponding covariates ci,t−1 . The emission distribution where RNNθ is a multi-layer LSTM or GRU parameterized
model, which is typically Gaussian for real-valued data or by shared weights θ and h0 = 0. Thus we can approximate
negative binomial for count data, is selected to best match (8) by the model
the statistics of the time series and the network incorporates
activation functions that satisfy the constraints of the dis- ΠTt=t0 pθ (x0t |ht−1 ), (10)
tribution’s parameters, e.g. a softplus() for the scale
parameter of the Gaussian. where now θ comprises the weights of the RNN as well as
denoising diffusion model. This model is autoregressive as
A straightforward time series model for multivariate real- it consumes the observations at the time step t − 1 as input
valued data could use a factorizing output distribution in- to learn the distribution of, or sample, the next time step as
stead. Shared parameters can then learn patterns across the shown in Figure 1.
individual time series entities through the temporal com-
ponent — but the model falls short of capturing dependen-
3.1. Training
cies in the emissions of the model. For this, a full joint
distribution at each time step has to be modeled, for exam- Training is performed by randomly sampling context and
ple by using a multivariate Gaussian. However, modeling adjoining prediction sized windows from the training time
the full covariance matrix not only increases the number series data and optimizing the parameters θ that minimize
of parameters of the neural network by O(D2 ), making the negative log-likelihood of the model (10):
learning difficult but computing the loss is O(D3 ) making
it impractical. Furthermore, statistical dependencies for T
X
such distributions would be limited to second-order effects. − log pθ (x0t |ht−1 ),
Approximating Gaussians with low-rank covariance matri- t=t0

ces do work however and these models are referred to as


Vec-LSTM in (Salinas et al., 2019a). starting with the hidden state ht0 −1 obtained by running the
RNN on the chosen context window. Via a similar derivation
Instead, in this work we propose TimeGrad which aims to as in the previous section, we have that the conditional
learn a model of the conditional distribution of the future variant of the objective (4) for time step t and noise index
time steps of a multivariate time series given its past and n is given by the following simplification of (7) (Ho et al.,
covariates as: 2020):
√ √
Ex0t ,,n k − θ ( ᾱn x0t + 1 − ᾱn , ht−1 , n)k2 ,
 
qX (x0t0 :T |x01:t0 −1 , c1:T ) = ΠTt=t0 qX (x0t |x01:t−1 , c1:T ),
(8)
were we assume that the covariates are known for all the when we choose the variance in (1) to be Σθ = β̃n (5),
time points and each factor is learned via a conditional where now the θ network is also conditioned on the hidden
denoising diffusion model introduced above. To model the state. Algorithm 1 is the training procedure for each time
temporal dynamics we employ the autoregressive recurrent step in the prediction window using this objective.
Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting

the problem for the model, which is reflected in significantly


t )
q(xnt | xn−1 improved empirical performance as shown in (Salinas et al.,
2019b). The other method of a short-cut connection from
x0t … xn−1 xnt … xNt
the input to the output of the function approximator, as done
pθ(xn−1 | xnt, ht−1)
t
t
in the multivariate point forecasting method LSTNet (Lai
et al., 2018), is not applicable here.

ht−2 ht−1 3.4. Covariates


RNN
We employ embeddings for categorical features (Charring-
ton, 2018), that allows for relationships within a category,
… x0t−1 ct−1 … or its context, to be captured when training time series mod-
els. Combining these embeddings as features for forecasting
yields powerful models like the first place winner of the Kag-
Figure 1. TimeGrad schematic: an RNN conditioned diffusion gle Taxi Trajectory Prediction1 challenge (De Brébisson
probabilistic model at some time t − 1 depicting the fixed forward et al., 2015). The covariates ct we use are composed of
process that adds Gaussian noise and the learned reverse processes.
time-dependent (e.g. day of week, hour of day) and time-
independent embeddings, if applicable, as well as lag fea-
Algorithm 2 Sampling x0t via annealed Langevin dynamics tures depending on the time frequency of the data set we are
Input: noise xNt ∼ N (0, I) and state ht−1
training on. All covariates are thus known for the periods
for n = N to 1 do we wish to forecast.
if n > 1 then
z ∼ N (0, I) 4. Experiments
else
z=0 We benchmark TimeGrad on six real-world data sets and
end if evaluate against several competitive baselines. The source
βn √
xn−1
t = √1αn (xnt − √1−  (xnt , ht−1 , n)) + Σθ z
ᾱn θ
code of the model will be made available after the review
end for process.
Return: x0t
4.1. Evaluation Metric and Data Set
For evaluation, we compute the Continuous Ranked Proba-
3.2. Inference bility Score (CRPS) (Matheson & Winkler, 1976) on each
After training, we wish to predict for each time series in our time series dimension, as well as on the sum of all time
data set some prediction steps into the future and compare series dimensions (the latter denoted by CRPSsum ). CRPS
with the corresponding test set time series. As in training, measures the compatibility of a cumulative distribution func-
we run the RNN over the last context sized window of the tion F with an observation x as
training set to obtain the hidden state hT via (9). Then Z
we follow the sampling procedure in Algorithm 2 to obtain CRPS(F, x) = (F (z) − I{x ≤ z})2 dz,
R
a sample x0T +1 of the next time step, which we can pass
autoregressively to the RNN together with the covariates where I{x ≤ z} is the indicator function which is one if
cT +1 to obtain the next hidden state hT +1 and repeat until x ≤ z and zero otherwise. CRPS is a proper scoring func-
the desired forecast horizon has been reached. This process tion, hence CRPS attains its minimum when the predictive
of sampling trajectories from the “warm-up” state hT can distribution F and the data distribution are equal.
be repeated many times (e.g. S = 100) to obtain empirical PS Employ-
ing the empirical CDF of F , i.e. F̂ (z) = S1 s=1 I{Xs ≤
quantiles of the uncertainty of our predictions. z} with S samples Xs ∼ F as a natural approxima-
tion of the predictive CDF, CRPS can be directly com-
3.3. Scaling puted from simulated samples of the conditional distribu-
tion (8) at each time point (Jordan et al., 2019). Finally,
In real-world data, the magnitudes of different time series
CRPSsum is obtained by first summing across the D time-
entities can vary drastically. To normalize scales, we divide
series — both for the ground-truth data, and sampled data
each time series entity by their context window mean (or 1
(yielding F̂sum (t) for each time point). The results are
if it’s zero) before feeding it into the model. At inference,
the samples are then multiplied by the same mean values to 1
https://www.kaggle.com/c/
match the original scale. This rescaling technique simplifies pkdd-15-predict-taxi-service-trajectory-i
Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting

Fourier positional embeddings, with Nmax = 500, into R32


Table 1. Dimension, domain, frequency, total training time steps
vectors. The network θ consists of conditional 1-dim di-
and prediction length properties of the training data sets used in
the experiments. lated ConvNets with residual connections adapted from the
WaveNet (van den Oord et al., 2016a) and DiffWave
(Kong et al., 2021) models. Figure 2 shows the schematics
D IM . T IME P RED . of a single residual block i = {0, . . . , 7} together with the
DATA SET D OM . F REQ . final output from the sum of all the 8 skip-connections. All,
D STEPS STEPS

EX C H A N G E 8 R +
DAY 6, 071 30 but the last, convolutional network layers have an output
SO L A R 137 R+ HOUR 7, 009 24 channel size of 8 and we use a bidirectional dilated convo-
EL E C. 370 R+ HOUR 5, 833 24 lution in each block i by setting its dilation to 2i%2 . We use
TR A F F I C 963 (0, 1) HOUR 4, 001 24 a validation set from the training data of the same size as
TA X I 1, 214 N 30- MIN 1, 488 24 the test set to tune the number of epochs for early stopping.
WI K I. 2, 000 N DAY 792 30
All experiments run on a single Nvidia V100 GPU with
16GB of memory.
then averaged overh the prediction
 horizon,ii.e. formally
CRPSsum = Et CRPS F̂sum (t), i x0i,t . As proved
P
Residual block i
in (de Bézenac et al., 2020) CRPSsum is also a proper residual_layers

Conv1x1 Conv1x1
scoring function and we use it, instead of likelihood based
ReLU ReLU
metrics, since not all methods we compare against yield out=residual_channels
Gated
out=residual_channels

act.
analytical forecast distributions or likelihoods are not mean- unit

ingfully defined.
For our experiments we use Exchange (Lai et al., 2018), + Conv1x1 +
out=residual_channels

Solar (Lai et al., 2018), Electricity2 , Traffic3 ,


Taxi4 and Wikipedia5 open data sets, preprocessed ex- Dilated Conv1d
actly as in (Salinas et al., 2019a), with their properties listed Conv1d ReLU
to out=residual_channels
dilation=2**(i%2)
out=residual_channels

in Table 1. As can be noted in the table, we do not need to block


i+1
normalize scales for Traffic. + + FC Conv1d
out=residual_channels out=1

4.2. Model Architecture


We train TimeGrad via SGD using Adam (Kingma & Ba,
2015) with learning rate of 1 × 10−3 on the training split of Conv1x1 Noise FC
each data set with N = 100 diffusion steps using a linear ReLU
out=residual_channels
Emb. Upsampler out=D

variance schedule starting from β1 = 1 × 10−4 till βN =


xnt n ht−1 ϵθ
0.1. We construct batches of size 64 by taking random
windows (with possible overlaps), with the context size set
to the number of prediction steps, from the total time steps Figure 2. The network architecture of θ consisting of
residual layers = 8 conditional residual blocks with
of each data set (see Table 1). For testing we use a rolling
the Gated Activation Unit σ(·) tanh(·) from (van den Oord
windows prediction starting from the last context window et al., 2016b); whose skip-connection outputs are summed up to
history before the start of the prediction and compare it compute the final output. Conv1x1 and Conv1d are 1D convo-
to the ground-truth in the test set by sampling S = 100 lutional layers with filter size of 1 and 3, respectively, circular
trajectories. padding so that the spatial size remains D, and all but the last con-
volutional layer has output channels residual channels = 8.
The RNN consists of 2 layers of an LSTM with the hid-
FC are linear layers used to up/down-sample the input to the
den state ht ∈ R40 and we encode the noise index n ∈ appropriate size for broadcasting.
{1, . . . , N } using the Transformer’s (Vaswani et al., 2017)
2
https://archive.ics.uci.edu/ml/datasets/
4.3. Results
ElectricityLoadDiagrams20112014
3
https://archive.ics.uci.edu/ml/datasets/ Using the CRPSsum as an evaluation metric, we compare
PEMS-SF
4 test time predictions of TimeGrad to a wide range of ex-
https://www1.nyc.gov/site/tlc/about/
tlc-trip-record-data.page isting methods including classical multivariate methods:
5
https://github.com/mbohlkeschneider/
gluon-ts/tree/mv_release/datasets • VAR (Lütkepohl, 2007) a mutlivariate linear vector
Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting

Table 2. Test set CRPSsum comparison (lower is better) of models on six real world data sets. Mean and standard error metrics for
TimeGrad obtained by re-training and evaluating 10 times.

Method Exchange Solar Electricity Traffic Taxi Wikipedia


VES 0.005±0.000 0.9±0.003 0.88±0.0035 0.35±0.0023 - -
VAR 0.005±0.000 0.83±0.006 0.039±0.0005 0.29±0.005 - -
VAR-Lasso 0.012±0.0002 0.51±0.006 0.025±0.0002 0.15±0.002 - 3.1±0.004
GARCH 0.023±0.000 0.88±0.002 0.19±0.001 0.37±0.0016 - -
KVAE 0.014±0.002 0.34±0.025 0.051±0.019 0.1±0.005 - 0.095±0.012
Vec-LSTM
0.008±0.001 0.391±0.017 0.025±0.001 0.087±0.041 0.506±0.005 0.133±0.002
ind-scaling
Vec-LSTM
0.007±0.000 0.319±0.011 0.064±0.008 0.103±0.006 0.326±0.007 0.241±0.033
lowrank-Copula
GP
0.009±0.000 0.368±0.012 0.022±0.000 0.079±0.000 0.183±0.395 1.483±1.034
scaling
GP
0.007±0.000 0.337±0.024 0.0245±0.002 0.078±0.002 0.208±0.183 0.086±0.004
Copula
Transformer
0.005±0.003 0.301±0.014 0.0207±0.000 0.056±0.001 0.179±0.002 0.063±0.003
MAF
TimeGrad 0.006±0.001 0.287±0.02 0.0206±0.001 0.044±0.006 0.114±0.02 0.0485±0.002

auto-regressive model with lags corresponding to the joint emission distribution is given by a low-rank plus
periodicity of the data, diagonal covariance Gaussian copula and

• VAR-Lasso a Lasso regularized VAR, • Transformer-MAF (Rasul et al., 2021) which uses
Transformer (Vaswani et al., 2017) to model the tem-
• GARCH (van der Weide, 2002) a multivariate condi- poral conditioning and Masked Autoregressive Flow
tional heteroskedastic model and (Papamakarios et al., 2017) for the distribution emis-
sion model.
• VES a innovation state space model (Hyndman et al.,
2008);
Table 2 lists the corresponding CRPSsum values averaged
over 10 independent runs together with their empirical stan-
as well as deep learning based methods namely: dard deviations and shows that the TimeGrad model sets
the new state-of-the-art on all but the smallest of the bench-
• KVAE (Fraccaro et al., 2017) a variational autoencoder mark data sets. Note that flow based models must apply con-
to represent the data on top of a linear state space model tinuous transformations onto a continuously connected dis-
which describes the dynamics, tribution, making it difficult to model disconnected modes.
Flow models assign spurious density to connections between
• Vec-LSTM-ind-scaling (Salinas et al., 2019a)
these modes leading to potential inaccuracies. Similarly the
which models the dynamics via an RNN and outputs
generator network in variational autoencoders must learn to
the parameters of an independent Gaussian distribution
map from some continuous space to a possibly disconnected
with mean-scaling,
space which might not be possible to learn. In contrast
• Vec-LSTM-lowrank-Copula (Salinas et al., EMBs do not suffer from these issues (Du & Mordatch,
2019a) which instead parametrizes a low-rank plus 2019).
diagonal covariance via Copula process,
4.4. Ablation
• GP-scaling (Salinas et al., 2019a) which unrolls
an LSTM with scaling on each individual time series The length N of the forward process is a crucial hyper-
before reconstructing the joint distribution via a low- parameter, as a bigger N allows the reverse process to
rank Gaussian, be approximately Gaussian (Sohl-Dickstein et al., 2015)
which assists the Gaussian parametrization (1) to approx-
• GP-Copula (Salinas et al., 2019a) which unrolls an imate it better. We evaluate to which extent, if any at
LSTM on each individual time series and then the all, larger N affects prediction performance, with an ab-
Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting

lation study where we record the test set CRPSsum of the


0.25
Electricity data set for different total diffusion pro- 0.035 observations
median prediction
90.0% prediction interval
cess lengths N = 2, 4, 8, . . . , 256 while keeping all other 0.030 50.0% prediction interval
hyperparemeters unchanged. The results are then plotted in 0.20
0.025
Figure 3 where we note that N can be reduced down to ≈ 10
without significant performance loss. An optimal value is 0.020
0.15
achieved at N ≈ 100 and larger levels are not beneficial if 0.015
all else is kept fixed.
0.010 0.10

0.005
Electricity 0.05
0.000

18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00
15-Jun 16-Jun 15-Jun 16-Jun
2008 2008

0.08
0.08
CRPS-SUM

0.07
0.07
10 1
0.06
0.06

0.05 0.05

0.04 0.04

0.03 0.03

0.02 0.02
101 102
N 0.01 0.01

0.00 0.00
18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00
15-Jun 16-Jun 15-Jun 16-Jun
Figure 3. TimeGrad test set CRPSsum for Electricity data 2008 2008

by varying total diffusion length N . Good performance is estab-


0.30
lished already at N ≈ 10 with optimal value at N ≈ 100. The 0.16
mean and standard errors obtained over 5 independent runs. We
0.14 0.25
see similar behaviour with other data sets.
0.12
0.20
To highlight the predictions of TimeGrad we show in Fig- 0.10

ure 4 the predicted median, 50% and 90% distribution in- 0.08
0.15

tervals of the first 6 dimensions of the full 963 dimensional


0.06 0.10
multivariate forecast of the Traffic benchmark.
0.04
0.05
0.02
5. Related Work
0.00
18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00
15-Jun 16-Jun 15-Jun 16-Jun
5.1. Energy-Based Methods 2008 2008

The EBM of (Ho et al., 2020) that we adapt is based on Figure 4. TimeGrad prediction intervals and test set ground-truth
methods that learn the gradient of the log-density with re- for Traffic data of the first 6 of 963 dimensions from first
spect to the inputs, called Stein Score function (Hyvärinen, rolling-window. Note that neighboring entities have an order of
2005; Vincent, 2011), and at inference time use this gradient magnitude difference in scales.
estimate via Langevin dynamics to sample from the model
of this complicated data distribution (Song & Ermon, 2019).
These models achieve impressive results for image genera-
Although these methods learn the distribution of vector val-
tion (Ho et al., 2020; Song & Ermon, 2020) when trained
ued data via denoising diffusion methods, as done here, they
in an unsupervised fashion without requiring adversarial
do not consider its temporal development. Also neighbor-
optimization. By perturbing the data using multiple noise
ing dimensions of waveform data are highly correlated and
scales, the learnt Score network captures both coarse and
have a uniform scale, which is not necessarily true for mul-
fine-grained data features.
tivariate time series problems where neighboring entities
The closest related work to TimeGrad is in the recent occur arbitrarily (but in a fixed order) and can have different
non-autoregressive conditional methods for high fidelity scales. (Du & Mordatch, 2019) also use EBMs to model
waveform generation (Chen et al., 2021; Kong et al., 2021). one and multiple steps for a trajectory modeling task in an
Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting

non-autoregressive fashion. since they penalize high probability under the model but
low probability under the data distribution explicitly. Future
5.2. Time Series Forecasting work could evaluate the usage of TimeGrad for anomaly
detection tasks.
Neural time series methods have recently become popular
ways of solving the prediction problem via univariate point For long time sequences, one could replace the RNN with
forecasting methods (Oreshkin et al., 2020; Smyl, 2020) or a Transformer architecture (Rasul et al., 2021) to provide
univariate probabilistic methods (Salinas et al., 2019b). In better conditioning for the EBM emission head. Concur-
the multivariate setting we also have point forecasting meth- rently, since EBMs are not constrained by the form of their
ods (Lai et al., 2018; Li et al., 2019) as well as probabilistic functional approximators, one natural way to improve the
methods, like this method, which explicitly model the data model would be to incorporate architectural choices that
distribution using Gaussian copulas (Salinas et al., 2019a), best encode the inductive bias of the problem being tackled,
GANs (Yoon et al., 2019), or normalizing flows (de Bézenac for example with graph neural networks (Niu et al., 2020)
et al., 2020; Rasul et al., 2021). Bayesian neural networks when the relationships between entities are known.
can also be used to provide epistemic uncertainty in fore-
casts as well as detect distributional shifts (Zhu & Laptev, References
2018), although these methods often do not perform as well
empirically (Wenzel et al., 2020). Benidis, K., Rangapuram, S. S., Flunkert, V., Wang,
B., Maddix, D., Turkmen, C., Gasthaus, J., Bohlke-
Schneider, M., Salinas, D., Stella, L., Callot, L., and
6. Conclusion and Future Work Januschowski, T. Neural forecasting: Introduction and
We have presented TimeGrad, a versatile multivariate literature overview, 2020.
probabilistic time series forecasting method that leverages Charrington, S. TWiML & AI Podcast: Systems and Soft-
the exceptional performance of EBMs to learn and sample ware for Machine Learning at Scale with Jeff Dean, 2018.
from the distribution of the next time step, autoregressivly. URL https://bit.ly/2G0LmGg.
Analysis of TimeGrad on six commonly used time se-
ries benchmarks establishes the new state-of-the-art against Chen, N., Zhang, Y., Zen, H., Weiss, R. J., Norouzi,
competitive methods. M., and Chan, W. WaveGrad: Estimating gradients
for waveform generation. In International Conference
We note that while training TimeGrad we do not need to
on Learning Representations 2021 (Conference Track),
loop over the EBM function approximator θ , unlike in the
2021. URL https://openreview.net/forum?
normalizing flow setting where we have multiple stacks of
id=NsMLjcFaO8O.
bijections. However while sampling we do loop N times
over θ . A possible strategy to improve sampling times Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Empirical
introduced in (Chen et al., 2021) uses a combination of evaluation of gated recurrent neural networks on sequence
improved variance schedule and an L1 loss to allow sam- modeling. In NIPS 2014 Workshop on Deep Learning,
pling with fewer steps at the cost of a small reduction in December 2014, 2014.
quality if such a trade-off is required. A recent paper (Song
et al., 2021) generalize the diffusion processes via a class de Bézenac, E., Rangapuram, S. S., Benidis, K., Bohlke-
of non-Markovian processes which also allows for faster Schneider, M., Kurle, R., Stella, L., Hasson, H., Gallinari,
sampling. P., and Januschowski, T. Normalizing Kalman Filters for
Multivariate Time series Analysis. In Advances in Neu-
The use of normalizing flows for discrete valued data dic- ral Information Processing Systems, volume 33. Curran
tates that one dequantizes it (Theis et al., 2016), by adding Associates, Inc., 2020.
uniform noise to the data, before using the flows to learn.
Dequantization is not needed in the EBM setting and future De Brébisson, A., Simon, E., Auvolat, A., Vincent, P.,
work could explore methods of explicitly modeling discrete and Bengio, Y. Artificial Neural Networks Applied
distributions. to Taxi Destination Prediction. In Proceedings of the
2015th International Conference on ECML PKDD Dis-
As noted in (Du & Mordatch, 2019) EBMs exhibit better
covery Challenge - Volume 1526, ECMLPKDDDC’15,
out-of-distribution (OOD) detection than other likelihood
pp. 40–51, Aachen, Germany, Germany, 2015. CEUR-
models. Such a task requires models to have a high like-
WS.org. URL http://dl.acm.org/citation.
lihood on the data manifold and low at all other locations.
cfm?id=3056172.3056178.
Surprisingly (Nalisnick et al., 2019) showed that likelihood
models, including flows, were assigning higher likelihoods Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density es-
to OOD data whereas EBMs do not suffer from this issue timation using Real NVP. In 5th International Confer-
Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting

ence on Learning Representations, ICLR 2017, Toulon, Hyvärinen, A. Estimation of Non-Normalized Statistical
France, April 24-26, 2017, Conference Track Proceedings. Models by Score Matching. Journal of Machine Learning
OpenReview.net, 2017. URL https://openreview. Research, 6(24):695–709, 2005. URL http://jmlr.
net/forum?id=HkpbnH9lx. org/papers/v6/hyvarinen05a.html.

Du, Y. and Mordatch, I. Implicit Generation and Modeling Jordan, A., Krüger, F., and Lerch, S. Evaluating Prob-
with Energy Based Models. In Wallach, H., Larochelle, abilistic Forecasts with scoringRules. Journal of Sta-
H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and tistical Software, Articles, 90(12):1–37, 2019. ISSN
Garnett, R. (eds.), Advances in Neural Information Pro- 1548-7660. doi: 10.18637/jss.v090.i12. URL https:
cessing Systems, volume 32, pp. 3608–3618. Curran As- //www.jstatsoft.org/v090/i12.
sociates, Inc., 2019. URL https://proceedings.
neurips.cc/paper/2019/file/ Kingma, D. P. and Ba, J. Adam: A method for stochastic
378a063b8fdb1db941e34f4bde584c7d-Paper. optimization. In International Conference on Learning
pdf. Representations (ICLR), 2015.

Fraccaro, M., Kamronn, S., Paquet, U., and Winther, Kingma, D. P. and Welling, M. An Introduction to Vari-
O. A Disentangled Recognition and Nonlinear ational Autoencoders. Foundations and Trends in Ma-
Dynamics Model for Unsupervised Learning. In chine Learning, 12(4):307–392, 2019. doi: 10.1561/
Guyon, I., Luxburg, U. V., Bengio, S., Wallach, 2200000056. URL https://doi.org/10.1561/
H., Fergus, R., Vishwanathan, S., and Garnett, R. 2200000056.
(eds.), Advances in Neural Information Processing
Kong, Z., Ping, W., Huang, J., Zhao, K., and Catanzaro, B.
Systems, volume 30, pp. 3601–3610. Curran Asso-
DiffWave: A Versatile Diffusion Model for Audio Syn-
ciates, Inc., 2017. URL https://proceedings.
thesis. In International Conference on Learning Repre-
neurips.cc/paper/2017/file/
sentations 2021 (Conference Track), 2021. URL https:
7b7a53e239400a13bd6be6c91c4f6c4e-Paper.
//openreview.net/forum?id=a-xFK8Ymz5J.
pdf.
Lai, G., Chang, W.-C., Yang, Y., and Liu, H. Model-
Graves, A. Generating Sequences With Recurrent Neural
ing Long- and Short-Term Temporal Patterns with Deep
Networks. arXiv preprint arXiv:1308.0850, 2013.
Neural Networks. In The 41st International ACM SI-
Hinton, G. E. Training Products of Experts by Minimiz- GIR Conference on Research & Development in Infor-
ing Contrastive Divergence. Neural Computation, 14 mation Retrieval, SIGIR ’18, pp. 95–104, New York,
(8):1771––1800, August 2002. ISSN 0899-7667. doi: NY, USA, 2018. ACM. ISBN 978-1-4503-5657-2. doi:
10.1162/089976602760128018. URL https://doi. 10.1145/3209978.3210006. URL http://doi.acm.
org/10.1162/089976602760128018. org/10.1145/3209978.3210006.

Ho, J., Jain, A., and Abbeel, P. Denoising Diffusion Proba- LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., and Huang,
bilistic Models. In Wallach, H., Larochelle, H., Beygelz- F. A Tutorial on Energy-Based Learning. In Bakir, G.,
imer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Hofman, T., Schölkopf, B., Smola, A., and Taskar, B.
Advances in Neural Information Processing Systems, (eds.), Predicting Structured Data. MIT Press, 2006.
volume 33. Curran Associates, Inc., 2020. URL https:
//papers.nips.cc/paper/2020/file/ Li, S., Jin, X., Xuan, Y., Zhou, X., Chen, W., Wang, Y.-X.,
4c5bcfec8584af0d967f1ab10179ca4b-Paper. and Yan, X. Enhancing the locality and breaking the
pdf. memory bottleneck of transformer on time series fore-
casting. In Wallach, H., Larochelle, H., Beygelzimer,
Hochreiter, S. and Schmidhuber, J. Long Short-Term Mem- A., d’Alché Buc, F., Fox, E., and Garnett, R. (eds.), Ad-
ory. Neural Computation, 9(8):1735–1780, November vances in Neural Information Processing Systems 32, pp.
1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8. 5244–5254. Curran Associates, Inc., 2019.
1735.
Lütkepohl, H. New Introduction to Multiple Time Series
Hyndman, R. and Athanasopoulos, G. Forecasting: Princi- Analysis. Springer Berlin Heidelberg, 2007. ISBN
ples and practice. OTexts, 2018. ISBN 9780987507112. 9783540262398. URL https://books.google.
de/books?id=muorJ6FHIiEC.
Hyndman, R., Koehler, A., Ord, K., and Snyder, R. Fore-
casting with exponential smoothing. The state space ap- Matheson, J. E. and Winkler, R. L. Scoring Rules for Con-
proach, chapter 17, pp. 287–300. Springer-Verlag, 2008. tinuous Probability Distributions. Management Science,
doi: 10.1007/978-3-540-71918-2. 22(10):1087–1096, 1976.
Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting

Nalisnick, E., Matsukawa, A., Teh, Y. W., Gorur, D., and article/pii/S0169207019301153. M4
Lakshminarayanan, B. Do Deep Generative Models Competition.
Know What They Don’t Know? In International Confer-
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and
ence on Learning Representations, 2019. URL https:
Ganguli, S. Deep Unsupervised Learning using Nonequi-
//openreview.net/forum?id=H1xwNhCcYm.
librium Thermodynamics. In Bach, F. and Blei, D. (eds.),
Niu, C., Song, Y., Song, J., Zhao, S., Grover, A., and Ermon, Proceedings of the 32nd International Conference on
S. Permutation Invariant Graph Generation via Score- Machine Learning, volume 37 of Proceedings of Ma-
Based Generative Modeling. In Chiappa, S. and Calandra, chine Learning Research, pp. 2256–2265, Lille, France,
R. (eds.), The 23rd International Conference on Artificial 2015. PMLR. URL http://proceedings.mlr.
Intelligence and Statistics, AISTATS 2020, 26-28 August press/v37/sohl-dickstein15.html.
2020, Online [Palermo, Sicily, Italy], volume 108 of
Song, J., Meng, C., and Ermon, S. Denoising Diffu-
Proceedings of Machine Learning Research, pp. 4474–
sion Implicit Models. In International Conference
4484. PMLR, 2020.
on Learning Representations 2021 (Conference Track),
Oreshkin, B. N., Carpov, D., Chapados, N., and Bengio, 2021. URL https://openreview.net/pdf?
Y. N-BEATS: Neural basis expansion analysis for inter- id=St1giarCHLP.
pretable time series forecasting. In International Confer- Song, Y. and Ermon, S. Generative Modeling by
ence on Learning Representations, 2020. URL https: Estimating Gradients of the Data Distribution. In
//openreview.net/forum?id=r1ecqn4YwB. Wallach, H., Larochelle, H., Beygelzimer, A.,
Papamakarios, G., Pavlakou, T., and Murray, I. Masked d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.),
Autoregressive Flow for Density Estimation. Advances Advances in Neural Information Processing Systems,
in Neural Information Processing Systems 30, 2017. volume 32, pp. 11918–11930. Curran Associates,
Inc., 2019. URL https://proceedings.
Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, neurips.cc/paper/2019/file/
S., and Lakshminarayanan, B. Normalizing Flows for 3001ef257407d5a371a96dcd947c7d93-Paper.
Probabilistic Modeling and Inference, 2019. pdf.

Rasul, K., Sheikh, A.-S., Schuster, I., Bergmann, U., Song, Y. and Ermon, S. Improved Techniques for Training
and Vollgraf, R. Multivariate Probabilistic Time Se- Score-Based Generative Models. In Wallach, H.,
ries Forecasting via Conditioned Normalizing Flows. Larochelle, H., Beygelzimer, A., d'Alché-Buc, F.,
In International Conference on Learning Representa- Fox, E., and Garnett, R. (eds.), Advances in Neural
tions 2021 (Conference Track), 2021. URL https: Information Processing Systems, volume 33. Curran As-
//openreview.net/forum?id=WiGQBFuVRv. sociates, Inc., 2020. URL https://proceedings.
neurips.cc/paper/2020/file/
Salinas, D., Bohlke-Schneider, M., Callot, L., Medico, R., 92c3b916311a5517d9290576e3ea37ad-Paper.
and Gasthaus, J. High-dimensional multivariate fore- pdf.
casting with low-rank Gaussian Copula Processes. In
Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché Song, Y. and Kingma, D. P. How to Train Your Energy-
Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neu- Based Models. 2021. URL https://arxiv.org/
ral Information Processing Systems 32, pp. 6824–6834. abs/2101.03288.
Curran Associates, Inc., 2019a. Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to Se-
quence Learning with Neural Networks. In Ghahramani,
Salinas, D., Flunkert, V., Gasthaus, J., and Januschowski,
Z., Welling, M., Cortes, C., Lawrence, N., and Wein-
T. DeepAR: Probabilistic forecasting with autore-
berger, K. (eds.), Advances in Neural Information Pro-
gressive recurrent networks. International Journal
cessing Systems 27, pp. 3104–3112. Curran Associates,
of Forecasting, 2019b. ISSN 0169-2070. URL
Inc., 2014.
http://www.sciencedirect.com/science/
article/pii/S0169207019301888. Theis, L., van den Oord, A., and Bethge, M. A note on the
evaluation of generative models. In International Confer-
Smyl, S. A hybrid method of exponential smooth-
ence on Learning Representations, 2016. URL http://
ing and recurrent neural networks for time series
arxiv.org/abs/1511.01844. arXiv:1511.01844.
forecasting. International Journal of Forecast-
ing, 36(1):75–85, 2020. ISSN 0169-2070. doi: Tsay, R. S. Multivariate Time Series Analysis: With R and
https://doi.org/10.1016/j.ijforecast.2019.03.017. URL Financial Applications. Wiley Series in Probability and
http://www.sciencedirect.com/science/ Statistics. Wiley, 2014. ISBN 9781118617908.
Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting

van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Yoon, J., Jarrett, D., and van der Schaar, M. Time-
Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., series Generative Adversarial Networks. In
and Kavukcuoglu, K. WaveNet: A Generative Model for Wallach, H., Larochelle, H., Beygelzimer, A.,
Raw Audio. In The 9th ISCA Speech Synthesis Workshop, d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.),
Sunnyvale, CA, USA, 13-15 September 2016, pp. 125. Advances in Neural Information Processing Systems,
ISCA, 2016a. URL http://www.isca-speech. volume 32, pp. 5508–5518. Curran Associates,
org/archive/SSW_2016/abstracts/ssw9_ Inc., 2019. URL https://proceedings.
DS-4_van_den_Oord.html. neurips.cc/paper/2019/file/
c9efe5f26cd17ba6216bbe2a7d26d490-Paper.
van den Oord, A., Kalchbrenner, N., Espeholt, L., pdf.
kavukcuoglu, k., Vinyals, O., and Graves, A. Conditional
Image Generation with PixelCNN Decoders. In Lee, Zhu, L. and Laptev, N. Deep and Confident Prediction for
D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, Time Series at Uber. In 2017 IEEE International Confer-
R. (eds.), Advances in Neural Information Processing ence on Data Mining Workshops (ICDMW), volume 00,
Systems, volume 29, pp. 4790–4798. Curran Associates, pp. 103–110, November 2018. doi: 10.1109/ICDMW.
Inc., 2016b. URL https://proceedings. 2017.19. URL doi.ieeecomputersociety.
neurips.cc/paper/2016/file/ org/10.1109/ICDMW.2017.19.
b1301141feffabac455e1f90a7de2054-Paper.
pdf.

van den Oord, A., Kalchbrenner, N., and Kavukcuoglu,


K. Pixel Recurrent Neural Networks. In Balcan, M. F.
and Weinberger, K. Q. (eds.), Proceedings of The 33rd
International Conference on Machine Learning, vol-
ume 48 of Proceedings of Machine Learning Research,
pp. 1747–1756, New York, New York, USA, 20–22 Jun
2016c. PMLR. URL http://proceedings.mlr.
press/v48/oord16.html.

van der Weide, R. GO-GARCH: a multivariate generalized


orthogonal GARCH model. Journal of Applied Econo-
metrics, 17(5):549–564, 2002. doi: 10.1002/jae.688.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,


L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Atten-
tion is All you Need. In Guyon, I., Luxburg, U., Bengio,
S., Wallach, H., Fergus, R., Vishwanathan, S., and Gar-
nett, R. (eds.), Advances in Neural Information Process-
ing Systems 30, pp. 5998–6008. Curran Associates, Inc.,
2017. URL http://papers.nips.cc/paper/
7181-attention-is-all-you-need.pdf.

Vincent, P. A Connection Between Score Matching and De-


noising Autoencoders. Neural Computation, 23(7):1661–
1674, 2011. URL https://doi.org/10.1162/
NECO_a_00142.

Wenzel, F., Roth, K., Veeling, B., Swiatkowski, J., Tran,


L., Mandt, S., Snoek, J., Salimans, T., Jenatton, R., and
Nowozin, S. How good is the Bayes posterior in deep
neural networks really? In III, H. D. and Singh, A.
(eds.), Proceedings of the 37th International Conference
on Machine Learning, volume 119 of Proceedings of
Machine Learning Research, pp. 10248–10259. PMLR,
13–18 Jul 2020. URL http://proceedings.mlr.
press/v119/wenzel20a.html.

You might also like