Multivariate Probabilistic Time Series Forecasting via Conditioned Normalizing Flows
Multivariate Probabilistic Time Series Forecasting via Conditioned Normalizing Flows
Kashif Rasul, Abdul-Saboor Sheikh, Ingmar Schuster, Urs Bergmann & Roland Vollgraf
Zalando Research
Mühlenstraße 25
10243 Berlin
Germany
[email protected]
arXiv:2002.06103v3 [cs.LG] 14 Jan 2021
A BSTRACT
Time series forecasting is often fundamental to scientific and engineering prob-
lems and enables decision making. With ever increasing data set sizes, a trivial
solution to scale up predictions is to assume independence between interacting
time series. However, modeling statistical dependencies can improve accuracy and
enable analysis of interaction effects. Deep learning methods are well suited for
this problem, but multivariate models often assume a simple parametric distribution
and do not scale to high dimensions. In this work we model the multivariate
temporal dynamics of time series via an autoregressive deep learning model, where
the data distribution is represented by a conditioned normalizing flow. This com-
bination retains the power of autoregressive models, such as good performance
in extrapolation into the future, with the flexibility of flows as a general purpose
high-dimensional distribution model, while remaining computationally tractable.
We show that it improves over the state-of-the-art for standard metrics on many
real-world data sets with several thousand interacting time-series.
1 I NTRODUCTION
Classical time series forecasting methods such as those in Hyndman & Athanasopoulos (2018)
typically provide univariate forecasts and require hand-tuned features to model seasonality and other
parameters. Time series models based on recurrent neural networks (RNN), like LSTM (Hochreiter
& Schmidhuber, 1997), have become popular methods due to their end-to-end training, the ease of
incorporating exogenous covariates, and their automatic feature extraction abilities, which are the
hallmarks of deep learning. Forecasting outputs can either be points or probability distributions, in
which case the forecasts typically come with uncertainty bounds.
The problem of modeling uncertainties in time series forecasting is of vital importance for assessing
how much to trust the predictions for downstream tasks, such as anomaly detection or (business)
decision making. Without probabilistic modeling, the importance of the forecast in regions of low
noise (small variance around a mean value) versus a scenario with high noise cannot be distinguished.
Hence, point estimation models ignore risk stemming from this noise, which would be of particular
importance in some contexts such as making (business) decisions.
Finally, individual time series, in many cases, are statistically dependent on each other, and models
need the capacity to adapt to this in order to improve forecast accuracy (Tsay, 2014). For example,
to model the demand for a retail article, it is important to not only model its sales dependent on
its own past sales, but also to take into account the effect of interacting articles, which can lead to
cannibalization effects in the case of article competition. As another example, consider traffic flow in
a network of streets as measured by occupancy sensors. A disruption on one particular street will
also ripple to occupancy sensors of nearby streets — a univariate model would arguably not be able
to account for these effects.
In this work, we propose end-to-end trainable autoregressive deep learning architectures for proba-
bilistic forecasting that explicitly models multivariate time series and their temporal dynamics by
employing a normalizing flow, like the Masked Autoregressive Flow (Papamakarios et al., 2017) or
1
Published as a conference paper at ICLR 2021
Real NVP (Dinh et al., 2017). These models are able to scale to thousands of interacting time series,
we show that they are able to learn ground-truth dependency structure on toy data and we establish
new state-of-the-art results on diverse real world data sets by comparing to competitive baselines.
Additionally, these methods adapt to a broad class of underlying data distribution on account of using
a normalizing flow and our Transformer based model is highly efficient due to the parallel nature of
attention layers while training.
The paper first provides some background context in Section 2. We cover related work in Section 3.
Section 4 introduces our model and the experiments are detailed in Section 5. We conclude with
some discussion in Section 6. The Appendix contains details of the datasets, additional metrics and
exploratory plots of forecast intervals as well as details of our model.
2 BACKGROUND
2.1 D ENSITY E STIMATION VIA N ORMALIZING F LOWS
Normalizing flows (Tabak & Turner, 2013; Papamakarios et al., 2019) are mappings from RD to RD
such that densities pX on the input space X = RD are transformed into some simple distribution pZ
(e.g. an isotropic Gaussian) on the space Z = RD . These mappings, f : X 7→ Z, are composed of a
sequence of bijections or invertible functions. Due to the change of variables formula we can express
pX (x) by
∂f (x)
pX (x) = pZ (z) det ,
∂x
where ∂f (x)/∂x is the Jacobian of f at x. Normalizing flows have the property that the inverse
x = f −1 (z) is easy to evaluate and computing the Jacobian determinant takes O(D) time.
The bijection introduced by Real NVP (Dinh et al., 2017) called the coupling layer satisfies the above
two properties. It leaves part of its inputs unchanged and transforms the other part via functions of
the un-transformed variables (with superscript denoting the coordinate indices)
1:d
y = x1:d
yd+1:D = xd+1:D exp(s(x1:d )) + t(x1:d ),
where is an element wise product, s() is a scaling and t() a translation function from Rd 7→ RD−d ,
given by neural networks. To model a nonlinear density map f (x), a number of coupling layers which
map X 7→ Y1 7→ · · · 7→ YK−1 7→ Z are composed together all the while alternating the dimensions
which are unchanged and transformed. Via the change of variables formula the probability density
function (PDF) of the flow given a data point can be written as
K
X
log pX (x) = log pZ (z) + log | det(∂z/∂x)| = log pZ (z) + log | det(∂yi /∂yi−1 )|. (1)
i=1
Note that the Jacobian for the Real NVP is a block-triangular matrix and thus the log-determinant of
each map simply becomes
1:d
log | det(∂yi /∂yi−1 )| = log | exp(sum(si (yi−1 ))|, (2)
where sum() is the sum over all the vector elements. This model, parameterized by the weights of the
scaling and translation neural networks θ, is then trained via stochastic gradient descent (SGD) on
training data points where for each batch D we maximize the average log likelihood (1) given by
1 X
L= log pX (x; θ).
|D|
x∈D
In practice, Batch Normalization (Ioffe & Szegedy, 2015) is applied as a bijection to outputs of
successive coupling layers to stabilize the training of normalizing flows. This bijection implements
the normalization procedure using a weighted moving average of the layer’s mean and standard
deviation values, which has to be adapted to either training or inference regimes.
2
Published as a conference paper at ICLR 2021
The Real NVP approach can be generalized, resulting in Masked Autoregressive Flows (Papamakarios
et al., 2017) (MAF) where the transformation layer is built as an autoregressive neural network in the
sense that it takes in some input x ∈ RD and outputs y = (y 1 , . . . , y D ) with the requirement that
this transformation is invertible and any output y i cannot depend on input with dimension indices
≥ i, i.e. x≥i . The Jacobian of this transformation is triangular and thus the Jacobian determinant is
tractable. Instead of using a RNN to share parameters across the D dimensions of x one avoids this
sequential computation by using masking, giving the method its name. The inverse however, needed
for generating samples, is sequential.
By realizing that the scaling and translation function approximators don’t need to be invertible, it
is straight-forward to implement conditioning of the PDF pX (x|h) on some additional information
h ∈ RH : we concatenate h to the inputs of the scaling and translation function approximators of
the coupling layers, i.e. s(concat(x1:d , h)) and t(concat(x1:d , h)) which are modified to map
Rd+H 7→ RD−d . Another approach is to add a bias computed from h to every layer inside the s and
t networks as proposed by Korshunova et al. (2018). This does not change the log-determinant of the
coupling layers given by (2). More importantly for us, for sequential data, indexed by t, we can share
parameters across the different conditioners ht by using RNNs or Attention in an autoregressive
fashion.
For discrete data the distribution has differential entropy of negative infinity, which leads to arbitrary
high likelihood when training normalizing flow models, even on test data. To avoid this one can
dequantize the data, often by adding Uniform[0, 1) noise to integer-valued data. The log-likelihood
of the resulting continuous model is then lower-bounded by the log-likelihood of the discrete one as
shown in Theis et al. (2016).
The self-attention based Transformer (Vaswani et al., 2017) model has been used for sequence
modeling with great success. The multi-head self-attention mechanism enables it to capture both long-
and short-term dependencies in time series data. Essentially, the Transformer takes in a sequence
X = [x1 , . . . , xT ]| ∈ RT ×D , and the multi-head self-attention transforms this into H distinct query
Qh = XWhQ , key Kh = XWhK and value Vh = XWhV matrices, where the WhQ , WhK , and WhV
are learnable parameters. After these linear projections the scaled dot-product attention computes a
sequence of vector outputs via:
Qh K|
Oh = Attention(Qh , Kh , Vh ) = softmax √ h · M Vh ,
dK
where a mask M can be applied to filter out right-ward attention (or future information leakage)
by setting its upper-triangular elements to −∞ and we normalize by dK the dimension of the WhK
matrices. Afterwards, all H outputs Oh are concatenated and linearly projected again.
One typically uses the Transformer in an encoder-decoder setup, where some warm-up time series
is passed through the encoder and the decoder can be used to learn and autoregressively generate
outputs.
3 R ELATED W ORK
Related to this work are models that combine normalizing flows with sequential modeling in some
way. Transformation Autoregressive Networks (Oliva et al., 2018) which model the density of a
multi-variate variable x ∈ RD as D conditional distributions ΠD i i−1
i=1 pX (x |x , . . . , x1 ), where the
conditioning is given by a mixture model coming from the state of a RNN, and is then transformed
via a bijection. The PixelSNAIL (Chen et al., 2018) method also models the joint as a product of
conditional distributions, optionally with some global conditioning, via causal convolutions and
self-attention (Vaswani et al., 2017) to capture long-term temporal dependencies. These methods
are well suited to modeling high dimensional data like images, however their use in modeling the
temporal development of data has only recently been explored for example in VideoFlow (Kumar
et al., 2019) in which they model the distribution of the next video frame via a flow where the model
outputs the parameters of the flow’s base distribution via a ConvNet, whereas our approach will be
based on conditioning of the PDF as described above.
3
Published as a conference paper at ICLR 2021
Using RNNs for modeling either multivariate or temporal dynamics introduces sequential com-
putational dependencies that are not amenable to parallelization. Despite this, RNNs have been
shown to be very effective in modeling sequential dynamics. A recent work in this direction (Hwang
et al., 2019) employs bipartite flows with RNNs for temporal conditioning to develop a conditional
generative model of multivariate sequential data. The authors use a bidirectional training procedure
to learn a generative model of observations that together with the temporal conditioning through a
RNN, can also be conditioned on (observed) covariates that are modeled as additional conditioning
variables in the latent space, which adds extra padding dimensions to the normalizing flow.
The other aspect of related works deals with multivariate probabilistic time series methods which
are able to model high dimensional data. The Gaussian Copula Process method (Salinas et al.,
2019a) is a RNN-based time series method with a Gaussian copula process output modeled using
a low-rank covariance structure to reduce computational complexity and handle non-Gaussian
marginal distributions. By using a low-rank approximation of the covariance matrix they obtain a
computationally tractable method and are able to scale to multivariate dimensions in the thousands
with state-of-the-art results. We will compare our model to this method in what follows.
4
Published as a conference paper at ICLR 2021
For modeling the time evolution, we also investi- Batch Coupling Batch Coupling
gate an encoder-decoder Transformer (Vaswani zt Norm Layer Norm Layer xt
K K 1 1
et al., 2017) architecture where the encoder em-
beds x1:t0 −1 and the decoder outputs the con-
ditioning for the flow over xt0 :T via a masked ht−1 ht
attention module. See Figure 2 for a schematic RNN
4.1 T RAINING
…
…
via SGD using Adam (Kingma & Ba, 2015) with x1:t −1 c1:t −10
xt :T−1 ct :T−1
0 0 0
4.2 C OVARIATES
We employ embeddings for categorical features (Charrington, 2018), which allows for relationships
within a category, or its context, to be captured while training models. Combining these embeddings
5
Published as a conference paper at ICLR 2021
as features for time series forecasting yields powerful models like the first place winner of the Kaggle
Taxi Trajectory Prediction1 challenge (De Brébisson et al., 2015). The covariates ct we use are
composed of time-dependent (e.g. day of week, hour of day) and time-independent embeddings, if
applicable, as well as lag features depending on the time frequency of the data set we are training on.
All covariates are thus known for the time periods we wish to forecast.
4.3 I NFERENCE
For inference we either obtain the hidden state ĥt1 by passing a “warm up” time series x1:t1 −1
through the RNN or use the cold-start hidden state, i.e. we set ĥt1 = h1 = ~0, and then by sampling a
noise vector zt1 ∈ RD from an isotropic Gaussian, go backward through the flow to obtain a sample
of our time series for the next time step, x̂t1 = f −1 (zt1 |ĥt1 ), conditioned on this starting state. We
then use this sample and its covariates to obtain the next conditioning state ĥt1 +1 via the RNN and
repeat till our inference horizon. This process of sampling trajectories from some initial state can be
repeated many times to obtain empirical quantiles of the uncertainty of our prediction for arbitrary
long forecast horizons.
The attention model similarly uses a warm-up time series x1:t1 −1 and covariates and passes them
through the encoder and then uses the decoder to output the conditioning for sampling from the flow.
This sample is then used again in the decoder to iteratively sample the next conditioning state, similar
to the inference procedure in seq-to-seq models.
Note that we do not sample from a reduced-temperature model, e.g. by scaling the variance of the
isotropic Gaussian, unlike what is done in likelihood-based generative models (Parmar et al., 2018)
to obtain higher quality samples.
5 E XPERIMENTS
Here we discuss a toy experiment for sanity-checking our model and evaluate probabilistic forecasting
results on six real-world data sets with competitive baselines. The source code of the model, as
well as other time series models, is available at https://github.com/zalandoresearch/
pytorch-ts.
In this toy experiment, we check if the inductive bias of incorporating relations between time series is
learnt in our model by simulating flow of a liquid in a system of pipes with valves. See Figure 3 for a
depiction of the system.
Liquid flows from left to right, where pressure V1 S1
at the first sensor in the system is given by S0 S3
S0 = X + 3, X ∼ Gamma(1, 0.2) in the
shape/scale parameterization of the Gamma dis-
tribution. The valves are given by V1 , V2 ∼iid V2 S2
Beta(0.5, 0.5), and we have
Figure 3: System of pipes with liquid flowing from
left to right with sensors (Si ) and valves (Vi ).
Vi
Si = S0 + i
V1 + V2
for i ∈ {1, 2} and finally S3 = S1 + S2 + 3 with ∗ ∼ N (0, 0.1). With this simulation we check
whether our model captures correlations in space and time. The correlation between S1 and S2 results
from both having the same source, measured by S0 . This is reflected by Cov(S1 , S2) > 0, which is
captured by our model as shown in Figure 4 left.
The cross-covariance structure between consecutive time points in the ground truth and as captured
by our trained model is depicted in Figure 4 right. It reflects the true flow of liquid in the system from
S0 at time t to S1 and S2 at time t + 1, on to S3 at time t + 2.
6
Published as a conference paper at ICLR 2021
measured at t
measured at t
s0 s0 s0 s0
s1 s1 s1 s1
s2 s2
s2 s2
s3 s3
s3 s3
v1 v2 s0 s1 s2 s3 v1 v2 s0 s1 s2 s3
v1 v2 s0 s1 s2 s3 v1 v2 s0 s1 s2 s3 measured at t + 1 measured at t + 1
Figure 4: Estimated (cross-)covariance matrices. Darker means higher positive values. left: Co-
variance matrix for a fixed time point capturing the correlation between S1 and S2 . right: Cross-
covariance matrix between consecutive time points capturing true flow of liquid in the pipe system.
Table 1: Test set CRPSsum comparison (lower is better) of models from Salinas et al. (2019a) and
our models LSTM-Real-NVP, LSTM-MAF and Transformer-MAF. The two best methods are
in bold and the mean and standard errors of our methods are obtained by rerunning them 20 times.
For evaluation we compute the Continuous Ranked Probability Score (CRPS) (Matheson & Winkler,
1976) on each individual time series, as well as on the sum of all time series (the latter denoted
by CRPSsum ). CRPS measures the compatibility of a cumulative distribution function F with an
observation x as Z
CRPS(F, x) = (F (z) − I{x ≤ z})2 dz (5)
R
where I{x ≤ z} is the indicator function which is one if x ≤ z and zero otherwise. CRPS is a proper
scoring function, hence CRPS attains its minimum when the predictive distribution F and the data
Pn
distribution are equal. Employing the empirical CDF of F , i.e. F̂ (z) = n1 i=1 I{Xi ≤ z} with n
samples Xi ∼ F as a natural approximation of the predictive CDF, CRPS can be directly computed
from simulated samples of the conditional distribution (4) at each time point (Jordan et al., 2019).
We take 100 samples to estimate the empirical CDF in practice. Finally, CRPSsum is obtained by
first summing across the D time-series — both for the ground-truth data, and sampled data (yielding
F̂sum (t) for eachhtime point).
The results are then averaged over the prediction horizon, i.e. formally
P i i
CRPSsum = Et CRPS F̂sum (t), i xt .
Our model is trained on the training split of each data set, and for testing we use a rolling windows
prediction starting from the last point seen in the training data set and compare it to the test set.
We train on Exchange (Lai et al., 2018), Solar (Lai et al., 2018), Electricity2 , Traffic3 ,
Taxi4 and Wikipedia5 open data sets, preprocessed exactly as in Salinas et al. (2019a), with their
properties listed in Table 2 of the appendix. Both Taxi and Wikipedia consist of count data and
are thus dequantized before being fed to the flow (and mean-scaled).
1
https://www.kaggle.com/c/pkdd-15-predict-taxi-service-trajectory-i
2
https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014
3
https://archive.ics.uci.edu/ml/datasets/PEMS-SF
4
https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
5
https://github.com/mbohlkeschneider/gluon-ts/tree/mv_release/datasets
7
Published as a conference paper at ICLR 2021
We compare our method using LSTM and two different normalizing flows (LSTM-Real-NVP
and LSTM-MAF based on Real NVP and MAF, respectively) as well as a Transformer
model with MAF (Transformer-MAF), with the most competitive baseline probabilistic
models from Salinas et al. (2019a) on the six data sets and report the results in Table 1.
Vec-LSTM-ind-scaling outputs the parameters of an independent Gaussian distribution with
mean-scaling, Vec-LSTM-lowrank-Copula parametrizes a low-rank plus diagonal covariance
via Copula process. GP-scaling unrolls a LSTM with scaling on each individual time series
before reconstructing the joint distribution via a low-rank Gaussian. Similarly, GP-Copula unrolls
a LSTM on each individual time series and then the joint emission distribution is given by a low-rank
plus diagonal covariance Gaussian copula.
0 200 400 600 800 0 200 400 600 800 0 200 400 600 800
Figure 5: Visual analysis of the dependency structure extrapolation of the model. Left: Cross-
covariance matrix computed from the test split of Traffic benchmark. Middle: Cross-covariance
matrix computed from the mean of 100 sample trajectories drawn from the Transformer-MAF
model’s extrapolation into the future (test split). Right: The absolute difference of the two matrices
mostly shows small deviations between ground-truth and extrapolation.
In Table 1 we observe that MAF with either RNN or self-attention mechanism for temporal condi-
tioning achieves the state-of-the-art (to the best of our knowledge) CRPSsum on all benchmarks.
Moreover, bipartite flows with RNN either also outperform or are found to be competitive w.r.t. the
previous state-of-the-art results as listed in the first four columns of Table 1. Further analyses with
other metrics (e.g. CRPS and MSE) are reported in Section B of the appendix.
To showcase how well our model captures dependencies in extrapolating the time series into the
future versus real data, we plot in Figure 5 the cross-covariance matrix of observations (plotted left) as
well as the mean of 100 sample trajectories (middle plot) drawn from Transformer-MAF model
for the test split of Traffic data set. As can be seen, most of the covariance structure especially in
the top-left region of highly correlated sensors is very well reflected in the samples drawn from the
model.
6 C ONCLUSION
We have presented a general method to model high-dimensional probabilistic multivariate time series
by combining conditional normalizing flows with an autoregressive model, such as a recurrent neural
network or an attention module. Autoregressive models have a long-standing reputation for working
very well for time series forecasting, as they show good performance in extrapolation into the future.
The flow model, on the other hand, does not assume any simple fixed distribution class, but instead
can adapt to a broad range of high-dimensional data distributions. The combination hence combines
the extrapolation power of the autoregressive model class with the density estimation flexibility of
flows. Furthermore, it is computationally efficient, without the need of resorting to approximations
(e.g. low-rank approximations of a covariance structure as in Gaussian copula methods) and is robust
compared to Deep Kernel learning methods especially for large D. Analysis on six commonly used
time series benchmarks establishes the new state-of-the-art performance against competitive methods.
A natural way to improve our method is to incorporate a better underlying flow model. For example,
Table 1 shows that swapping the Real NVP flow with a MAF improved the performance, which is a
consequence of Real NVP lacking in density modeling performance compared to MAF. Likewise,
we would expect other design choices of the flow model to improve performance, e.g. changes to
8
Published as a conference paper at ICLR 2021
the dequantization method, the specific affine coupling layer or more expressive conditioning, say
via another Transformer. Recent improvements to flows, e.g. as proposed in the Flow++ (Ho et al.,
2019), to obtain expressive bipartite flow models, or models to handle discrete categorical data (Tran
et al., 2019), are left as future work to assess their usefulness. To our knowledge, it is however still
an open problem how to model discrete ordinal data via flows — which would best capture the nature
of some data sets (e.g. sales data).
ACKNOWLEDGMENTS
K.R. would like to thank Rob Hyndman for the helpful discussions and suggestions.
We wish to acknowledge and thank the authors and contributors of the following open source libraries
that were used in this work: GluonTS (Alexandrov et al., 2020), NumPy (Harris et al., 2020),
Pandas (Pandas development team, 2020), Matplotlib (Hunter, 2007) and PyTorch (Paszke et al.,
2019). We would also like to thank and acknowledge the hard work of the reviewers whose comments
and suggestions have without a doubt help improve this paper.
R EFERENCES
Alexander Alexandrov, Konstantinos Benidis, Michael Bohlke-Schneider, Valentin Flunkert, Jan
Gasthaus, Tim Januschowski, Danielle C. Maddix, Syama Rangapuram, David Salinas, Jasper
Schulz, Lorenzo Stella, Ali Caner Türkmen, and Yuyang Wang. GluonTS: Probabilistic and Neural
Time Series Modeling in Python. Journal of Machine Learning Research, 21(116):1–6, 2020. URL
http://jmlr.org/papers/v21/19-820.html.
Sam Charrington. TWiML & AI Podcast: Systems and software for machine learning at scale with
Jeff Dean, 2018. URL https://bit.ly/2G0LmGg.
XI Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. PixelSNAIL: An improved
autoregressive generative model. In Jennifer Dy and Andreas Krause (eds.), Proceedings of
the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine
Learning Research, pp. 864–872, Stockholmsmässan, Stockholm Sweden, 2018. PMLR. URL
http://proceedings.mlr.press/v80/chen18h.html.
Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of
gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning,
December 2014, 2014.
Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network
learning by exponential linear units (elus). In Yoshua Bengio and Yann LeCun (eds.), 4th Inter-
national Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4,
2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1511.07289.
Emmanuel de Bézenac, Syama Sundar Rangapuram, Konstantinos Benidis, Michael Bohlke-
Schneider, Richard Kurle, Lorenzo Stella, Hilaf Hasson, Patrick Gallinari, and Tim Januschowski.
Normalizing Kalman Filters for Multivariate Time series Analysis. In Advances in Neural Infor-
mation Processing Systems, volume 33. Curran Associates, Inc., 2020.
Alexandre De Brébisson, Étienne Simon, Alex Auvolat, Pascal Vincent, and Yoshua Bengio. Artificial
neural networks applied to taxi destination prediction. In Proceedings of the 2015th International
Conference on ECML PKDD Discovery Challenge - Volume 1526, ECMLPKDDDC’15, pp. 40–51,
Aachen, Germany, Germany, 2015. CEUR-WS.org. URL http://dl.acm.org/citation.
cfm?id=3056172.3056178.
Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using Real NVP. In
International Conference on Learning Representations 2017 (Conference Track), 2017. URL
https://openreview.net/forum?id=HkpbnH9lx.
Charles R. Harris, K. Jarrod Millman, St’efan J. van der Walt, Ralf Gommers, Pauli Virtanen, David
Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti
Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fern’andez
9
Published as a conference paper at ICLR 2021
del R’ıo, Mark Wiebe, Pearu Peterson, Pierre G’erard-Marchant, Kevin Sheppard, Tyler Reddy,
Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array programming
with NumPy. Nature, 585(7825):357–362, September 2020. doi: 10.1038/s41586-020-2649-2.
URL https://doi.org/10.1038/s41586-020-2649-2.
Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. Flow++: Improving flow-
based generative models with variational dequantization and architecture design. In Kamalika
Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference
on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 2722–2730,
Long Beach, California, USA, 2019. PMLR. URL http://proceedings.mlr.press/
v97/ho19a.html.
S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735–1780,
November 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735.
J. D. Hunter. Matplotlib: A 2d graphics environment. Computing in Science & Engineering, 9(3):
90–95, 2007. doi: 10.1109/MCSE.2007.55.
Seong Jae Hwang, Zirui Tao, Won Hwa Kim, and Vikas Singh. Conditional recurrent flow: Con-
ditional generation of longitudinal samples with applications to neuroimaging. In The IEEE
International Conference on Computer Vision (ICCV), October 2019.
R.J. Hyndman and G. Athanasopoulos. Forecasting: Principles and practice. OTexts, 2018. ISBN
9780987507112.
Rob Hyndman, Anne Koehler, Keith Ord, and Ralph Snyder. Forecasting with exponential smoothing.
The state space approach, chapter 17, pp. 287–300. Springer-Verlag, 2008. doi: 10.1007/
978-3-540-71918-2.
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training
by reducing internal covariate shift. In Proceedings of the 32Nd International Conference on
International Conference on Machine Learning - Volume 37, ICML’15, pp. 448–456. JMLR.org,
2015. URL http://dl.acm.org/citation.cfm?id=3045118.3045167.
Alexander Jordan, Fabian Krüger, and Sebastian Lerch. Evaluating probabilistic forecasts with
scoringRules. Journal of Statistical Software, Articles, 90(12):1–37, 2019. ISSN 1548-7660. doi:
10.18637/jss.v090.i12. URL https://www.jstatsoft.org/v090/i12.
Diederick P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International
Conference on Learning Representations (ICLR), 2015.
Iryna Korshunova, Yarin Gal, Arthur Gretton, and Joni Dambre. Conditional BRUNO: A Deep
Recurrent Process for Exchangeable Labelled Data. In Bayesian Deep Learning workshop, NIPS,
2018.
Rahul G Krishnan, Uri Shalit, and David Sontag. Structured inference networks for nonlinear state
space models. In AAAI, 2017.
Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn, Sergey Levine, Laurent
Dinh, and Durk Kingma. VideoFlow: A Flow-Based Generative Model for Video. In Workshop on
Invertible Neural Nets and Normalizing Flows , ICML, 2019.
Guokun Lai, Wei-Cheng Chang, Yiming Yang, and Hanxiao Liu. Modeling long- and short-term
temporal patterns with deep neural networks. In The 41st International ACM SIGIR Conference
on Research & Development in Information Retrieval, SIGIR ’18, pp. 95–104, New York, NY,
USA, 2018. ACM. ISBN 978-1-4503-5657-2. doi: 10.1145/3209978.3210006. URL http:
//doi.acm.org/10.1145/3209978.3210006.
Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, and Xifeng
Yan. Enhancing the locality and breaking the memory bottleneck of transformer on time series
forecasting. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett
(eds.), Advances in Neural Information Processing Systems 32, pp. 5244–5254. Curran Associates,
Inc., 2019.
10
Published as a conference paper at ICLR 2021
H. Lütkepohl. New Introduction to Multiple Time Series Analysis. Springer Berlin Heidelberg, 2007.
ISBN 9783540262398. URL https://books.google.de/books?id=muorJ6FHIiEC.
James E. Matheson and Robert L. Winkler. Scoring rules for continuous probability distributions.
Management Science, 22(10):1087–1096, 1976.
Junier Oliva, Avinava Dubey, Manzil Zaheer, Barnabas Poczos, Ruslan Salakhutdinov, Eric Xing, and
Jeff Schneider. Transformation autoregressive networks. In Jennifer Dy and Andreas Krause (eds.),
Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings
of Machine Learning Research, pp. 3898–3907, Stockholmsmässan, Stockholm Sweden, 2018.
PMLR. URL http://proceedings.mlr.press/v80/oliva18a.html.
Boris N. Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. N-BEATS: Neural basis ex-
pansion analysis for interpretable time series forecasting. In International Conference on Learning
Representations, 2020. URL https://openreview.net/forum?id=r1ecqn4YwB.
The Pandas development team. pandas-dev/pandas: Pandas, February 2020. URL https://doi.
org/10.5281/zenodo.3509134.
George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density
estimation. Advances in Neural Information Processing Systems 30, 2017.
George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji
Lakshminarayanan. Normalizing flows for probabilistic modeling and inference, 2019.
Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and
Dustin Tran. Image transformer. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th
International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning
Research, pp. 4055–4064, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL
http://proceedings.mlr.press/v80/parmar18a.html.
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor
Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward
Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner,
Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high-performance
deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and
R. Garnett (eds.), Advances in Neural Information Processing Systems 32, pp. 8026–8037. Curran
Associates, Inc., 2019.
David Salinas, Michael Bohlke-Schneider, Laurent Callot, Roberto Medico, and Jan Gasthaus. High-
dimensional multivariate forecasting with low-rank Gaussian copula processes. In H. Wallach,
H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett (eds.), Advances in Neural
Information Processing Systems 32, pp. 6824–6834. Curran Associates, Inc., 2019a.
David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. DeepAR: Probabilistic
forecasting with autoregressive recurrent networks. International Journal of Forecasting, 2019b.
ISSN 0169-2070. URL http://www.sciencedirect.com/science/article/pii/
S0169207019301888.
E.G. Tabak and C.V. Turner. A family of nonparametric density estimation algorithms. Communica-
tions on Pure and Applied Mathematics, 66(2):145–164, 2013.
L. Theis, A. van den Oord, and M. Bethge. A note on the evaluation of generative models. In
International Conference on Learning Representations, 2016. URL http://arxiv.org/
abs/1511.01844. arXiv:1511.01844.
Dustin Tran, Keyon Vafa, Kumar Agrawal, Laurent Dinh, and Ben Poole. Discrete flows: Invertible
generative models of discrete data. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché
Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems 32, pp.
14692–14701. Curran Associates, Inc., 2019.
Ruey S. Tsay. Multivariate Time Series Analysis: With R and Financial Applications. Wiley Series in
Probability and Statistics. Wiley, 2014. ISBN 9781118617908.
11
Published as a conference paper at ICLR 2021
Roy van der Weide. GO-GARCH: a multivariate generalized orthogonal GARCH model. Journal of
Applied Econometrics, 17(5):549–564, 2002. doi: 10.1002/jae.688.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U.V. Luxburg,
S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural
Information Processing Systems 30, pp. 5998–6008. Curran Associates, Inc., 2017. URL http:
//papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.
A DATA S ET D ETAILS
B A DDITIONAL M ETRICS
We used exactly the same open source code to evaluate our metrics as provided by the authors
of Salinas et al. (2019a).
We report test set CRPSsum results on VAR (Lütkepohl, 2007) a mutlivariate linear vector auto-
regressive model with lags corresponding to the periodicity of the data, VAR-Lasso a Lasso
regularized VAR, GARCH (van der Weide, 2002) a multivariate conditional heteroskedastic model, GP
Gaussian process model, KVAE (Krishnan et al., 2017) a variational autoencoder on top of a linear
state space model and VES a innovation state space model (Hyndman et al., 2008) in Table 3. Note
that VAR-Lasso, KVAE and VES metrics are from (de Bézenac et al., 2020).
Table 3: Test set CRPSsum (lower is better) of classical methods and our Transformer-MAF
model, where the mean and standard errors of our model are obtained over a mean of 20 runs.
Transformer
Data set VAR VAR-Lasso GP GARCH VES KVAE
MAF
Exchange 0.010±0.00 0.012±0.000 0.011±0.001 0.020±0.000 0.005±0.00 0.014±0.002 0.005±0.003
Solar 0.524±0.001 0.51±0.006 0.828±0.01 0.869±0.00 0.9±0.003 0.34±0.025 0.301±0.014
Electricity 0.031±0.00 0.025±0.00 0.947±0.016 0.278±0.00 0.88±0.003 0.051±0.019 0.0207±0.000
Traffic 0.144±0.00 0.15±0.002 2.198±0.774 0.368±0.00 0.35±0.002 0.1±0.005 0.056±0.001
Taxi 0.292±0.00 - 0.425±0.199 - - - 0.179±0.002
Wikipedia 3.4±0.003 3.1±0.004 0.933±0.003 - - 0.095±0.012 0.063±0.003
The average marginal CRPS over dimensions D and over the predicted time steps compared to the
test interval is given in Table 4.
12
Published as a conference paper at ICLR 2021
Table 4: Test set CRPS comparison (lower is better) of models from Salinas et al. (2019a) and our
models LSTM-Real-NVP, LSTM-MAF and Transformer-MAF. The mean and standard errors
are obtained by re-running each method three times.
The MSE is defined as the mean squared error over all the time series dimensions D and over the
whole prediction range with respect to the test data. Table 5 shows the MSE results for the the
marginal MSE.
Table 5: Test set MSE comparison (lower is better) of models from Salinas et al. (2019a) and our
models LSTM-Real-NVP, LSTM-MAF and Transformer-MAF.
Univariate methods typically give better forecasts than multivariate ones, which is counter-intuitive,
the reason being the difficulty in estimating the cross-series correlations. The additional variance that
multivariate methods add often ends up harming the forecast, even when one knows that individual
time series are related. Thus as an additional sanity check, that this method is good enough to improve
the forecast and not make it worse, we report the metrics with respect to a modern univariate point
forecasting method as well as a multivariate point forecasting method for the Traffic data set.
Figure 6 reports the metrics from LSTNet (Lai et al., 2018) a multivariate point forecasting method
and Figure 7 reports the metrics from N-BEATS (Oreshkin et al., 2020) a univariate model. As
can be seen, our methods improve on the metrics for the Traffic data set and this pattern holds
for other data sets in our experiments. As a visual comparison, we have also plotted the prediction
intervals using our models in Figures 8, 9, 10 and 11.
D E XPERIMENT D ETAILS
D.1 F EATURES
For hourly data sets we used hour of day, day of week, day of month features which are normalized.
For daily data sets we use the day of week features. For data sets with minute granularity we use
minute of hour, hour of day and day of week features. The normalized features are concatenated to
the RNN or Transformer input at each time step. We also concatenate lag values as inputs according
to the data set’s time frequency: [1, 24, 168] for hourly data, [1, 7, 14] for daily and [1, 2, 4, 12, 24, 48]
for the half-hourly data.
13
Published as a conference paper at ICLR 2021
observations
prediction 0.12 0.08 0.07
0.030
0.07
0.06
0.025 0.10 0.06
0.05
0.020 0.05
0.08 0.04
0.015 0.04
0.03
0.06 0.03
0.010
0.02
0.02
0.005 0.04 0.01 0.01
0.000
18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00
15-Jun 16-Jun 15-Jun 16-Jun 15-Jun 16-Jun 15-Jun 16-Jun
2008 2008 2008 2008
0.050
0.05 0.02
0.02
0.025
0.00
18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00
15-Jun 16-Jun 15-Jun 16-Jun 15-Jun 16-Jun 15-Jun 16-Jun
2008 2008 2008 2008
0.14
0.25
0.20 0.20
0.12
0.20
0.10
0.15 0.15
0.08 0.15
0.06 0.10 0.10
0.10
0.04
0.05 0.05
0.02 0.05
Figure 6: Point forecast and test set ground-truth from LSTNet multivariate model for Traffic
data of the first 16 of 963 time series. CRPSsum : 0.125, CRPS: 0.202 and MSE: 7.4 × 10−4 .
D.2 H YPERPARAMETERS
We use batch sizes of 64, with 100 batches per epoch and train for a maximum of 40 epochs with a
learning rate of 1e−3. The LSTM hyperparameters were the ones from Salinas et al. (2019a) and we
used K = 5 stacks of normalizing flow bijections layers. The components of the normalizing flows
(f and g) are linear feed forward layers (with fixed input and final output sizes because we model
bijections) with hidden dimensions of 100 and ELU (Clevert et al., 2016) activation functions. We
sample 100 times to report the metrics on the test set. The Transformer uses H = 8 heads and n = 3
encoding and m = 3 decoding layers and a dropout rate of 0.1. All experiments run on a single
Nvidia V-100 GPU and the code to reproduce the results will be made available after the review
process.
14
Published as a conference paper at ICLR 2021
observations
prediction 0.12 0.08 0.07
0.030
0.07
0.06
0.025 0.10 0.06
0.05
0.020 0.05
0.08
0.04
0.015 0.04
0.06 0.03 0.03
0.010
0.02 0.02
0.005 0.04
0.01 0.01
0.000 0.02 0.00
18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00
15-Jun 16-Jun 15-Jun 16-Jun 15-Jun 16-Jun 15-Jun 16-Jun
2008 2008 2008 2008
0.14
0.25
0.20 0.20
0.12
0.20
0.10
0.15 0.15
0.08 0.15
0.06 0.10
0.10
0.10
0.04
0.05
0.02 0.05 0.05
0.10
0.06 0.02
0.02
0.05
0.04
0.00
0.00 0.00
18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00
15-Jun 16-Jun 15-Jun 16-Jun 15-Jun 16-Jun 15-Jun 16-Jun
2008 2008 2008 2008
Figure 7: Point forecast and test set ground-truth from N-BEATS univariate model for Traffic
data of the first 16 of 963 time series. CRPSsum : 0.174, CRPS: 0.228 and MSE: 8.4 × 10−4 .
15
Published as a conference paper at ICLR 2021
observations 0.175
0.035 median prediction 0.08 0.07
90.0% prediction interval 0.150
0.030 50.0% prediction interval 0.07
0.06
0.125
0.06
0.025 0.05
0.100 0.05
0.020 0.04
0.075 0.04
0.015 0.03
0.050 0.03
0.010 0.02
0.025 0.02
0.005 0.01
0.000 0.01
0.000
18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00
15-Jun 16-Jun 15-Jun 16-Jun 15-Jun 16-Jun 15-Jun 16-Jun
2008 2008 2008 2008
0.30 0.20
0.10
0.10
0.25
0.08 0.15
0.08 0.20
0.06
0.06 0.15 0.10
0.04
0.04 0.10
0.05
0.02
0.02 0.05
0.00
0.00 0.00
18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00
15-Jun 16-Jun 15-Jun 16-Jun 15-Jun 16-Jun 15-Jun 16-Jun
2008 2008 2008 2008
0.14
0.25
0.20 0.20
0.12
0.20
0.10 0.15 0.15
0.08 0.15
0.08 0.06
0.15
0.04
0.06
0.10 0.04
0.04 0.02
0.05 0.02
0.02
0.00 0.00 0.00 0.00
18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00
15-Jun 16-Jun 15-Jun 16-Jun 15-Jun 16-Jun 15-Jun 16-Jun
2008 2008 2008 2008
Figure 8: Prediction intervals and test set ground-truth from LSTM-REAL-NVP model for Traffic
data of the first 16 of 963 time series.
16
Published as a conference paper at ICLR 2021
0.040
observations 0.18 0.10
median prediction 0.08
0.035
90.0% prediction interval 0.16
50.0% prediction interval 0.08
0.030
0.14 0.06
0.025 0.06
0.12
0.020 0.04
0.10 0.04
0.015
0.02
0.08
0.010
0.02 0.00
0.005 0.06
0.30 0.10
0.12 0.20
0.25 0.08
0.10
0.15
0.20
0.08 0.06
0.15 0.10
0.06
0.04
0.10
0.04 0.05
0.02
0.05
0.02
0.00
0.00 0.00
18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00
15-Jun 16-Jun 15-Jun 16-Jun 15-Jun 16-Jun 15-Jun 16-Jun
2008 2008 2008 2008
0.40
0.14 0.35
0.20 0.20
0.12 0.30
0.10 0.15 0.15 0.25
0.08 0.20
0.06 0.10 0.10 0.15
0.04 0.10
0.05 0.05
0.02 0.05
0.00 0.00
0.02 0.00
18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00
15-Jun 16-Jun 15-Jun 16-Jun 15-Jun 16-Jun 15-Jun 16-Jun
2008 2008 2008 2008
Figure 9: Prediction intervals and test set ground-truth from Transformer-MAF model for
Traffic data of the first 16 of 963 time series.
17
Published as a conference paper at ICLR 2021
120 44
900
700
100 42 800
600
40 700
80
500 600
38
60
500
36 400
40 400
34 300
20 300
32
06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00
01-Sep 02-Sep 01-Sep 02-Sep 01-Sep 02-Sep 01-Sep 02-Sep
2014 2014 2014 2014
6000
400 3000
20
5000 350
2500
18
4000 300
2000
16 3000 250
1500
200
14 2000
150 1000
12 1000
100 500
06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00
01-Sep 02-Sep 01-Sep 02-Sep 01-Sep 02-Sep 01-Sep 02-Sep
2014 2014 2014 2014
4500
150 800 1000
4000
140 900
700
3500 130
800
120 600
3000 700
110 500 600
2500 100
400 500
2000 90
400
80 300
1500 300
06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00
01-Sep 02-Sep 01-Sep 02-Sep 01-Sep 02-Sep 01-Sep 02-Sep
2014 2014 2014 2014
Figure 10: Prediction intervals and test set ground-truth from LSTM-REAL-NVP model for
Electricity data of the first 16 of 370 time series.
18
Published as a conference paper at ICLR 2021
300
observations 200
median prediction 100
90.0% prediction interval
200 50.0% prediction interval 180 250
90
160 80 200
150
70
140
150
100 60
120
50 100
100
50 40
50
80 30
06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00
01-Sep 02-Sep 01-Sep 02-Sep 01-Sep 02-Sep 01-Sep 02-Sep
2014 2014 2014 2014
225
42 800 900
200
40 700 800
175
150 700
38 600
125 600
500
100 36 500
75 400
400
34
50 300 300
25 32
200 200
06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00
01-Sep 02-Sep 01-Sep 02-Sep 01-Sep 02-Sep 01-Sep 02-Sep
2014 2014 2014 2014
400
6000
3000
22
350
5000
20 2500
300
4000
18 2000
250
3000
16
200 1500
14 2000
150 1000
12 1000
100
06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00
01-Sep 02-Sep 01-Sep 02-Sep 01-Sep 02-Sep 01-Sep 02-Sep
2014 2014 2014 2014
4000 160
1000
800
3500 140 900
700
800
3000 120 600 700
500 600
2500 100
500
400
2000 80 400
300
300
06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00
01-Sep 02-Sep 01-Sep 02-Sep 01-Sep 02-Sep 01-Sep 02-Sep
2014 2014 2014 2014
Figure 11: Prediction intervals and test set ground-truth from Transformer-MAF model for
Electricity data of the first 16 of 370 time series.
19