Temporal Graph Neural Networks For Irregular Data
Temporal Graph Neural Networks For Irregular Data
generation (Zang and Wang, 2020) and video classification the Evidence Lower Bound (ELBO). The encoder builds
(Kosman and Di Castro, 2022). Temporal GNNs model a spatio-temporal graph of observations and aggregates in-
also time-varying signals in the graph. This extension formation using an attention mechanism (Vaswani et al.,
to graph-structured time series is achieved by combining 2017). The decoder then extrapolates to future times by
GNN layers with recurrent (Li et al., 2018), convolutional solving a neural ODE. The encoder-decoder setup differs
(Wu et al., 2019; Yu et al., 2018) or attention (Guo et al., from our TGNN4I model that sequentially incorporates ob-
2019) architectures. While the graph is commonly as- servations and auto-regressively makes predictions at every
sumed to be known a priori, some approaches also explore time point. Because of the multi-agent motivation Huang
learning the graph structure jointly with the temporal GNN et al. (2020) are also more focused on smaller graphs with
model (Zhang et al., 2022; Wu et al., 2020b). few interacting entities, but longer forecasting horizons.
An extended version of LG-ODE, called CG-ODE (Huang
Time series forecasting is a well-studied problem and a
et al., 2021), also aims to learn the graph structure in the
vast amount of methods exist in the literature. Tradi-
form of a weighted adjacency matrix.
tional methods in the area include ARIMA models, vector
auto-regression and Gaussian Processes (Box et al., 2015;
Roberts et al., 2013). Many deep learning approaches have 3 A TEMPORAL GNN FOR IRREGULAR
also been applied to time series forecasting. This includes OBSERVATIONS
Recurrent Neural Networks (RNNs) (Lazzeri, 2020), tem-
poral convolutional neural networks (Chen et al., 2020) and 3.1 Setting
Transformers (Giuliari et al., 2021).
The latent state of an RNN can be extended to continu- Consider a directed or undirected graph G = (V, E) with
Nt
ous time by letting the state decay exponentially in be- node set V and edge set E. Let {ti }i=1 be a set of (pos-
tween observations (Che et al., 2018). Such decay mecha- sibly irregular) time points s.t. 0 < t1 < · · · < tNt . We
nisms have been used for modeling data with missing ob- will here present our model for a single time series, but
servations (Che et al., 2018), doing imputation (Cao et al., in general we have a dataset containing multiple time se-
2018) and parametrizing point processes (Mei and Eisner, ries. Let Oi ⊆ V be the set of nodes observed at time ti .
2017). Another way to define time-continuous states is If n ∈ Oi , we denote the observed value as yin ∈ Rdy
by learning a more general ODE. In neural ODE models and any accompanying input features as xni ∈ Rdx . We
(Chen et al., 2018; Kidger, 2021) the latent state is the let yin = xni = 0 if n ∈ / Oi . Note that this general set-
solution to an ODE defined by a neural network. Neu- ting encompasses a spectrum of irregularity, from single
ral ODEs have been successfully applied to irregular time node observations (|Oi | = 1 ∀i) to fully observed graphs
series (Rubanova et al., 2019) and can be used to define (Oi = V ∀i). A table of notation is given in appendix A.
the dynamics of temporal GNNs (Fang et al., 2021; Poli The problem we consider is that of forecasting. At future
et al., 2021). Poli et al. (2021) use such a model for graph- time points we want to predict the value at each node, given
structured time series with irregular time steps, but consider all earlier observations. Since observations are irregular
the full graph to be observed at each observation time. Also and we want to make predictions at arbitrary times, we need
the GraphCON framework of Rusch et al. (2022) combines to consider models that can make predictions for any time
GNNs with a second order system of ODEs. In GraphCON in the future.
the time-axis of the ODE is however aligned with the lay-
ers of the GNN. This makes the framework suitable for We consider a model where at each node n a latent state
node- and graph-level predictive tasks, rather than time- hn (t) ∈ Rdh evolves over continuous time. We define the
series modeling. dynamics of hn (t) by: (1) how hn (t) evolves in between
observations, and (2) how hn (t) is updated when node n is
Another related body of work is concerned with using observed. If node n is observed at time ti we incorporate
GNNs for data imputation in time series (Cini et al., 2022; this observation into the latent state using a GRU cell (Cho
Gordon et al., 2021; Omidshafiei et al., 2022). These meth- et al., 2014). This information can then be used for making
ods generally do not assume that the time series come with predictions at future time points. An overview of our model
some known graph structure. Instead, the GNN is defined is given in Figure 1.
on some graph specifically constructed for the purpose of
performing imputation.
3.2 Time-continuous Latent Dynamics
The closest work to ours found in the literature is the LG-
ODE model of Huang et al. (2020). They consider the Consider a time interval ]ti , tj ] where node n is not ob-
same types of irregularities, but are motivated more by a served. During this interval we define the latent state of
multi-agent systems perspective. LG-ODE is based on an node n by the sum hn (t) = h̄ni + h̃n (t). The first part h̄ni
encoder-decoder architecture and trained by maximizing is constant over the time interval, constituting a base level
around which the state evolves. The dynamics of h̃n (t) are
Joel Oskarsson, Per Sidén, Fredrik Lindsten
Figure 1: Left: Example graph with four nodes and their latent states, here one-dimensional for illustration purposes. Node
observations are indicated with ⃝. Right: Schematic diagram of the GRU update and predictive model g. Everything
inside the shaded area happens instantaneously at time ti . The GRU cell outputs the new initial state h̄ni + ĥni , the static
component h̄ni and the ODE parameters ωin . The parameters ωin define the dynamics until the next observation. The
prediction ŷin here is based on information from all earlier time points.
dictated by a linear ODE of the form of the dynamical system in the limit, which for the com-
plete state means that hn (tj ) → h̄ni as tj → ∞.
dh̃n (t) = Ah̃n (t) dt (1)
On the other hand, if we instead allow A to have com-
with A ∈ R dh ×dh
and initial condition h̃ (ti ) =
n
ĥni .
Over plex eigenvalues we can decompose A using the real Jor-
this interval the ODE has a closed form solution (Arrow- dan form (Horn and Johnson, 2012). This makes Λ block-
smith and Place, 1992) given by diagonal with 2 × 2 blocks
h̃n (t) = exp(δt A)ĥni (2) aj −bj
Cj = (5)
bj aj
where exp is the matrix exponential function and we define
δt = t − ti . Assuming that all eigenvalues of A are unique, corresponding to complex conjugate eigenvalues aj ± bj i.
we can use its eigen-decomposition1 A = QΛQ−1 to write If we compute the matrix exponential for this Λ we end up
with a combination of exponential decay and periodic dy-
h̃n (t) = Q exp(δt Λ)Q−1 ĥni . (3) namics (Arrowsmith and Place, 1992). The solution gives
dynamics that couple each pair of dimensions as
Since Q contains an eigen-basis of Rdh it can be viewed as h i
a transition matrix, changing the basis of the latent space. h̃n (t) = exp(δt aj )×
j:j+1
While we could in principle learn Q, we note that the basis h i (6)
of the latent space has no physical interpretation and we cos(bj δt ) − sin(bj δt )
ĥni .
can without loss of generality choose it such that Q = I. sin(bj δt ) cos(bj δt ) j:j+1
Next, if we make the assumption that A has real eigenval- We parametrize also these dynamics with ωin =
ues the resulting dynamics are given by [−a1 , −a3 , . . . , −adh −1 , b1 , b3 , . . . , bdh −1 ]⊺ > 0. Note
that when bj → 0 these periodic dynamics reduce to the
h̃n (t) = exp(−δt diag(ωin ))ĥni (4) exponential decay in Eq. 4, but with the parameter for −aj
shared across pairs of dimensions. We consider both the
where ωin > 0 is a parameter vector representing the nega- exponential decay and the more general periodic dynamics
tion of the eigenvalues. We arrive at an exponential de- as options for our model. The periodic dynamics can nat-
cay in-between observations, a type of dynamics used with urally be seen as advantageous for modeling periodic data,
GRU-updates in existing works (Che et al., 2018; Cao et al., as we will explore empirically in section 4.4.
2018). The positive restriction on ωin ensures the stability
1
Recall that the eigen-decomposition of a diagonalizable ma- 3.3 Incorporating Observations from the Graph
trix A is given by A = QΛQ−1 , where Λ is a diagonal matrix
containing the eigenvalues of A and the columns of Q are the When node n is observed at time ti the observation is in-
corresponding eigenvectors (Searle and Khuri, 1982). corporated into the latent state by a GRU cell, incurring an
Temporal Graph Neural Networks for Irregular Data
3.5 observed that the periodic models output only low frequen-
cies, resulting in dynamics and results similar to the model
with exponential decay. While the periodic dynamics in
3.0
Eq. 6 have fewer degrees of freedom when reduced to pure
exponential decay, this does not seem to hurt the perfor-
25% 50% 75% 100% mance in this example.
Observed fraction
We found that the training time of LG-ODE scales poorly
Figure 2: Test LMSE on METR-LA traffic data for the best to large graphs, limiting us to only training a single model
performing models from Table 1. Shaded areas correspond for each dataset. The predictions are however quite poor,
to 95% confidence intervals based on re-training models especially on the PEMS-BAY data. While the model seems
with 5 random seeds. to learn something more than just predicting the mean, it is
not competitive with our TGNN4I model. We believe that
the poor performance of LG-ODE can be explained by a
casting setting. We were however unable to get useful pre- combination of multiple things: (1) The LG-ODE model
dictions without making major changes to the model and it is primarily designed for data with clear continuous under-
is therefore not included here. lying dynamics, which might not match this type of traffic
data. (2) When the model is trained as proposed by Huang
4.2 Traffic Data et al. (2020), it can require large amounts of data. For some
of the experiments in the original paper 20 000 sequences
We experiment on the PEMS-BAY and METR-LA are used for training, while we use less than 150. (3) The
datasets, containing traffic speed sensor data from the Cal- slow training has limited possibilities for exhaustive hyper-
ifornia highway system (Li et al., 2018; Chen et al., 2001). parameter tuning on our datasets. Training one LG-ODE
To be able to control the degree of irregularity, we start model on the PEMS-BAY data takes us over 50 hours.
from regularly sampled data and choose subsets of obser-
vations. We use the versions of the datasets pre-processed
by Li et al. (2018). Each dataset is split up into time series 4.3 USHCN Climate Data
of 288 observations (1 day). PEMS-BAY contains 180 such
time series with 325 nodes and METR-LA 119 time series Irregular and missing observations are common problems
with 207 nodes. We include the time of day and the time in climate data (Schneider, 2001). The United State His-
since the node was last observed as input features xni . In torical Climatology Network (USHCN) daily dataset con-
order to introduce irregularity in the time steps we next sub- tains over 50 years of measurements of multiple climate
sample each time series by keeping only 72 of the 288 ob- variables from sensor stations in the United States (Menne
servations. These Nt = 72 observation times are the same et al., 2015). We use the pre-processing of De Brouwer
for all nodes. However, from these subsampled time se- et al. (2019) to clean and subsample the data. The target
ries we furthermore sample subsets containing 25%–100% variables chosen are minimum and maximum daily tem-
of all Nt × |V | individual node observations. This results perature (Tmin and Tmax ), which we model as separate
in irregular observation time points and a fraction of nodes datasets. While existing works (De Brouwer et al., 2019;
observed at each time. Our additional pre-processing pre- Schirmer et al., 2022) have treated time series from differ-
vents us from a direct comparison with Li et al. (2018), as ent sensor stations as independent, we model also the spa-
their method does not handle irregular observations. tial correlation by constructing a 10-nearest-neighbor graph
using the sensor positions. Each full dataset contains 186
We report results for both datasets in Table 1 and high-
time series of length Nt = 100 on a graph of 1123 nodes.
light the best performing models on METR-LA in Figure 2.
The pre-processed USHCN data is sparsely observed with
GRU-D (joint) has a hard time modeling all nodes jointly,
only around 5% of potential node observations present.
often not performing better than the simple Predict Previ-
ous baseline. The Transformer models achieve somewhat We report results on both datasets in Table 2. Due to the
better results, but still not competitive with TGNN4I. We large size of the graph it was not feasible to apply the
additionally note that the Transformers can be highly sen- LG-ODE model here. We note that for these datasets the
sitive to the random seed used for initialization, something (joint) baselines clearly outperform the (node) versions.
that we have not observed for other models. Comparing For GRU-D this is the opposite of what we saw in the traf-
TGNN4I and GRU-D (node) in Figure 2 it can be noted fic data. This can be explained by the fact that climate data
Joel Oskarsson, Per Sidén, Fredrik Lindsten
Table 1: Test LMSE (multiplied by 102 ) for the traffic datasets with different fractions of node observations. Where appli-
cable we report mean ± one standard deviation across 5 runs with different random seeds. The lowest mean LMSE for each
dataset and observation percentage is marked in bold.
PEMS-BAY
Model 25% 50% 75% 100%
Predict Previous 26.32 18.60 15.25 13.50
GRU-D (joint) 18.79±0.07 18.27±0.10 17.93±0.08 17.75±0.12
GRU-D (node) 8.79±0.06 6.62±0.02 5.82±0.06 5.49±0.06
Transformer (joint) 12.05±1.19 13.13±2.59 12.21±1.95 11.09±1.38
Transformer (node) 16.49±0.17 14.44±0.48 13.20±0.56 13.16±1.23
LG-ODE 27.00 24.93 24.71 23.52
TGNN4I (static) 7.41±0.09 5.98±0.07 5.29±0.08 4.89±0.05
TGNN4I (exponential) 7.10±0.07 5.78±0.05 5.23±0.03 4.87±0.09
TGNN4I (periodic) 7.10±0.09 5.80±0.08 5.22±0.09 4.87±0.02
METR-LA
Model 25% 50% 75% 100%
Predict Previous 9.86 7.54 6.52 6.04
GRU-D (joint) 8.38±0.05 8.03±0.04 7.89±0.03 7.80±0.02
GRU-D (node) 4.36±0.08 3.62±0.07 3.28±0.08 3.16±0.04
Transformer (joint) 5.70±1.41 7.17±1.66 5.95±1.90 6.11±1.80
Transformer (node) 7.01±0.31 6.34±0.24 5.84±0.23 5.96±0.50
LG-ODE 8.51 7.35 6.71 6.24
TGNN4I (static) 3.86±0.02 3.31±0.02 3.03±0.02 2.88±0.02
TGNN4I (exponential) 3.68±0.05 3.18±0.03 2.97±0.03 2.86±0.04
TGNN4I (periodic) 3.69±0.02 3.19±0.04 3.01±0.05 2.88±0.03
Table 2: Test LMSE (multiplied by 102 ) for the two USHCN decay. This should to some extent be expected, as none
climate datasets. of the previous datasets show any clear periodic patterns at
the considered time scales. To investigate the possible ben-
Tmin Tmax efits of the periodic dynamics we instead create a synthetic
dataset with periodic signals propagating over a graph.
Predict Previous 16.88 17.18
GRU-D (joint) 8.03±0.23 7.97±0.19 The synthetic dataset is based on a randomly sampled di-
GRU-D (node) 13.12±0.03 13.67±0.04 rected acyclic graph with 20 nodes. We define a periodic
Transformer (joint) 7.36±0.41 7.37±0.28 base signal
Transformer (node) 15.68±0.32 15.74±0.34 ρn (t) = sin(ϕn t + η n ) (16)
TGNN4I (static) 6.97±0.05 6.86±0.04 with random parameters ϕn and η n for each node n. The
TGNN4I (exponential) 6.72±0.04 6.60±0.04 target signal y n (t) in each node is then defined through
TGNN4I (periodic) 6.72±0.05 6.63±0.03
0.5 X
κn (t) = ρn (t) + κm (t − 0.05) (17a)
|N (n)|
m∈N (n)
has strong spatial dependencies. The (joint) models can to y n (t) = κn (t) + ϵn (t) (17b)
some extent learn to pick up on these, while for (node) no
information can flow between nodes. The best results are where ϵn (t) is Gaussian white noise with standard devia-
however achieved by TGNN4I, showing the added benefit tion 0.01. The target signal in each node depends on the
of utilizing the spatial graph. base signal in the node itself and the signals in neighboring
nodes at a time lag of 0.05. To construct one time series we
sample y n (t) at 70 irregular time points on [0, 1]. In total
4.4 Synthetic Periodic Data we sample 200 such time series and keep 50% of the node
observations in each.
In the previous experiments, using periodic dynamics with
TGNN4I has not added any value. Instead, the learned dy- We train versions of the GRU-D (node) and TGNN4I mod-
namics have been largely similar to just using exponential els with different latent dynamics on the synthetic data.
Temporal Graph Neural Networks for Irregular Data
Table 3: Test LMSE (multiplied by 102 ) for synthetic data. PEMS-BAY dataset with 25% observations. We used expo-
nential dynamics and considered the weighting functions
Static Exponential Periodic
w1 (∆t ) = 1 (18a)
GRU-D (node) 8.88±0.36 3.13±0.06 2.81±0.04
TGNN4I 15.12±0.05 2.91±0.17 ∆t
1.95±0.11 w2 (∆t ) = exp − (18b)
0.04
Predict Prev. 27.52 2 !
Transformer (joint) 23.19±0.38 ∆t − 0.1
w3 (∆t ) = exp − (18c)
Transformer (node) 15.39±0.05 0.02
LG-ODE 16.61±0.23 w4 (∆t ) = I{∆t ∈[0.18,0.22]} . (18d)
0.6 Figure 3 shows the test MSE for the trained models at
different ∆t in the future, as well as plots of the weight-
0.5 ing functions. As an example, the prediction ŷi→j m
has
∆t = t j − t i .
MSE
0.4
We note that the choice of w can have substantial impact
w1
w2
on the error of the model at different time horizons. The
0.3
w3 exponential weighting in w2 makes the model focus heav-
w4 ily on short-term predictions. This results in better predic-
0.2
tions for low ∆t . At the shortest time horizon the exponen-
0.00 0.05 0.10 0.15 0.20 0.25 0.30 tial weighting yields an 11% improvement over the model
1.0 trained with constant w1 , but this comes at the cost of far
higher errors for long-term predictions. Interestingly the
w
used. Which of these is to be preferred depends on the spar- ing the system. To tackle this problem our model could
sity of observations in the data. Our implementation fol- be combined with approaches for also learning the graph
lows the massively parallel approach, using binary masks structure (Stanković et al., 2020; Zhang et al., 2022). Ex-
for keeping track of Oi at each time point. In order to scale tending our method to dynamic graphs, that evolve over
TGNN4I to massive graphs in real world scenarios it could time, would not require any major changes and could be an
be interesting to consider a version of the model distributed interesting direction for future experiments.
over multiple machines, perhaps directly connected to sen-
Our focus has been on forecasting, but the model could also
sors producing the input data. A sparse implementation
be trained for other tasks. The time-continuous latent state
would be strongly preferred for such an extension.
in each node could be used for imputing missing obser-
vations or performing sequence segmentation. Also clas-
5.2 Linear and Neural ODEs sification tasks are possible, either classifying each node
separately or the entire graph-structured time series.
The dynamics of TGNN4I are defined by a linear ODE and
additionally restricted by the assumption of unique eigen- While our model can produce predictions at arbitrary time
values in A. This has the benefit of a closed form solu- points, an extension would be to also predict the time un-
tion that is efficient to compute, but also limits the types of til the next observation occurs. One way to achieve this
dynamics that can be learned. Our approach can be con- would be to let the latent state parametrize the intensity
trasted with Neural ODEs (Chen et al., 2018), that allow of a point process (Mei and Eisner, 2017; Jia and Benson,
for learning more expressive dynamics. Neural ODEs do 2019). Building such point process models on graphs could
however lack closed form solutions and require using nu- be an interesting future application of our model.
merical solvers (Kidger, 2021). This incurs a trade-off be-
tween speed and numerical accuracy. In experiments we Acknowledgments
have compared TGNN4I with the LG-ODE model (Huang
et al., 2020), which uses a Neural ODE decoder. The slow This research is financially supported by the Swedish Re-
training of the LG-ODE model can to a large extent be at- search Council via the project Handling Uncertainty in Ma-
tributed to the numerical ODE solver. While more complex chine Learning Systems (contract number: 2020-04122),
latent dynamics can be useful for some datasets, it can also the Excellence Center at Linköping–Lund in Informa-
be argued that simpler dynamics can be compensated with tion Technology (ELLIIT), and the Wallenberg AI, Au-
a high enough latent dimension dh and a flexible enough tonomous Systems and Software Program (WASP) funded
predictive model g (Schirmer et al., 2022). by the Knut and Alice Wallenberg Foundation. The com-
putations were enabled by the Berzelius resource at the Na-
5.3 Societal and Sustainability Impact tional Supercomputer Centre, provided by the Knut and Al-
ice Wallenberg Foundation. All datasets available online
While our contributions are purely methodological, many were accessed from the Linköping University network.
applications of graph-based and spatio-temporal data anal-
ysis have a clear societal impact. Our example applica- References
tions of traffic and climate modeling both have potential
to aid efforts of transforming society in more sustainable Arrowsmith, D. and Place, C. M. (1992). Dynamical
directions, such as those described in the United Nations systems: differential equations, maps, and chaotic be-
sustainable development goals 11 and 13 (Rolnick et al., haviour. CRC Press.
2022; United Nations, 2015). Traffic modeling allows us to Box, G. E., Jenkins, G. M., Reinsel, G. C., and Ljung,
both understand travel behavior and predict future demand. G. M. (2015). Time series analysis: forecasting and con-
This can enable optimizations of transport systems, both trol. John Wiley & Sons.
improving the experience of travelers and reducing the en-
vironmental impact. Integrating machine learning methods Cao, W., Wang, D., Li, J., Zhou, H., Li, L., and Li, Y.
with climate modeling has potential to speed up simula- (2018). BRITS: Bidirectional recurrent imputation for
tions and increase our understanding of the climate around time series. In Advances in Neural Information Process-
us. However, it is surely also possible to find applications ing Systems, volume 31.
of our method with a damaging impact on society, for ex-
Che, Z., Purushotham, S., Cho, K., Sontag, D., and Liu, Y.
ample through undesired mass-surveillance.
(2018). Recurrent neural networks for multivariate time
series with missing values. Scientific Reports, 8(1):6085.
5.4 Future Work
Chen, C., Petty, K., Skabardonis, A., Varaiya, P., and Jia,
We consider a setting where the graph structure is both Z. (2001). Freeway performance measurement system:
known and constant over time. In some practical applica- Mining loop detector data. Transportation Research
tions it is not obvious how to construct the graph describ- Record, 1748(1):96–102.
Temporal Graph Neural Networks for Irregular Data
Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, Huang, Z., Sun, Y., and Wang, W. (2021). Coupled
D. K. (2018). Neural ordinary differential equations. In graph ODE for learning interacting system dynamics. In
Advances in neural information processing systems, vol- Proceedings of the 27th ACM SIGKDD Conference on
ume 31. Knowledge Discovery & Data Mining, pages 705–715.
Chen, Y., Kang, Y., Chen, Y., and Wang, Z. (2020). Prob- Jia, J. and Benson, A. R. (2019). Neural jump stochastic
abilistic forecasting with temporal convolutional neural differential equations. In Advances in Neural Informa-
network. Neurocomputing, 399:491–501. tion Processing Systems, volume 32.
Cho, K., van Merriënboer, B., Bahdanau, D., and Bengio, Kidger, P. (2021). On neural differential equations. PhD
Y. (2014). On the properties of neural machine trans- thesis, University of Oxford.
lation: Encoder–decoder approaches. In Proceedings Kosman, E. and Di Castro, D. (2022). Graphvid: It
of SSST-8, Eighth Workshop on Syntax, Semantics and only takes a few nodes to understand a video. In Com-
Structure in Statistical Translation, pages 103–111. puter Vision – ECCV 2022, pages 195–212.
Cini, A., Marisca, I., and Alippi, C. (2022). Filling the Lam, R., Sanchez-Gonzalez, A., Willson, M., Wirnsberger,
g ap s: Multivariate time series imputation by graph P., Fortunato, M., Pritzel, A., Ravuri, S., Ewalds, T.,
neural networks. In International Conference on Learn- Alet, F., Eaton-Rosen, Z., et al. (2022). Graphcast:
ing Representations (ICLR). Learning skillful medium-range global weather forecast-
De Brouwer, E., Simm, J., Arany, A., and Moreau, Y. ing. arXiv preprint arXiv:2212.12794.
(2019). GRU-ODE-Bayes: Continuous modeling of Lazzeri, F. (2020). Machine learning for time series fore-
sporadically-observed time series. In Advances in Neu- casting with Python. John Wiley & Sons.
ral Information Processing Systems, volume 32.
Li, Y., Yu, R., Shahabi, C., and Liu, Y. (2018). Dif-
Fang, Z., Long, Q., Song, G., and Xie, K. (2021). Spatial- fusion convolutional recurrent neural network: Data-
temporal graph ODE networks for traffic flow forecast- driven traffic forecasting. In International Conference
ing. In Proceedings of the 27th ACM SIGKDD Con- on Learning Representations (ICLR).
ference on Knowledge Discovery & Data Mining, pages
364–373. Mei, H. and Eisner, J. M. (2017). The neural hawkes pro-
cess: A neurally self-modulating multivariate point pro-
Fey, M. and Lenssen, J. E. (2019). Fast graph representa- cess. In Advances in Neural Information Processing Sys-
tion learning with PyTorch Geometric. In ICLR Work- tems, volume 30.
shop on Representation Learning on Graphs and Mani-
folds. Menne, M., Williams Jr, C., and Vose, R. (2015). Long-
term daily climate records from stations across the con-
Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and tiguous united states.
Dahl, G. E. (2017). Neural message passing for quan-
tum chemistry. In Proceedings of the 34th International Omidshafiei, S., Hennes, D., Garnelo, M., Wang, Z., Re-
Conference on Machine Learning, pages 1263–1272. casens, A., Tarassov, E., Yang, Y., Elie, R., Connor,
J. T., Muller, P., Mackraz, N., Cao, K., Moreno, P.,
Giuliari, F., Hasan, I., Cristani, M., and Galasso, F. (2021). Sprechmann, P., Hassabis, D., Graham, I., Spearman,
Transformer networks for trajectory forecasting. In W., Heess, N., and Tuyls, K. (2022). Multiagent off-
25th International Conference on Pattern Recognition screen behavior prediction in football. Scientific Reports,
(ICPR). 12(1):1–13.
Gordon, D., Petousis, P., Zheng, H., Zamanzadeh, D., and Poli, M., Massaroli, S., Rabideau, C. M., Park, J., Ya-
Bui, A. A. (2021). TSI-GNN: Extending graph neural mashita, A., Asama, H., and Park, J. (2021). Continuous-
networks to handle missing data in temporal settings. depth neural models for dynamic graph prediction. arXiv
Frontiers in Big Data, 4. preprint arXiv:2106.11581.
Guo, S., Lin, Y., Feng, N., Song, C., and Wan, H. (2019). Roberts, S., Osborne, M., Ebden, M., Reece, S., Gib-
Attention based spatial-temporal graph convolutional son, N., and Aigrain, S. (2013). Gaussian processes for
networks for traffic flow forecasting. In Proceedings time-series modelling. Philosophical Transactions of the
of the AAAI Conference on Artificial Intelligence, vol- Royal Society A: Mathematical, Physical and Engineer-
ume 33, pages 922–929. ing Sciences.
Horn, R. A. and Johnson, C. R. (2012). Matrix analysis. Rolnick, D., Donti, P. L., Kaack, L. H., Kochanski, K., La-
Cambridge university press. coste, A., Sankaran, K., Ross, A. S., Milojevic-Dupont,
Huang, Z., Sun, Y., and Wang, W. (2020). Learning contin- N., Jaques, N., Waldman-Brown, A., Luccioni, A. S.,
uous system dynamics from irregularly-sampled partial Maharaj, T., Sherwin, E. D., Mukkavilli, S. K., Kording,
observations. In Advances in Neural Information Pro- K. P., Gomes, C. P., Ng, A. Y., Hassabis, D., Platt, J. C.,
cessing Systems, volume 33, pages 16177–16187. Creutzig, F., Chayes, J., and Bengio, Y. (2022). Tackling
Joel Oskarsson, Per Sidén, Fredrik Lindsten
climate change with machine learning. ACM Computing Yu, B., Yin, H., and Zhu, Z. (2018). Spatio-temporal graph
Surveys, 55(2). convolutional networks: A deep learning framework for
Rozemberczki, B., Scherer, P., Kiss, O., Sarkar, R., and traffic forecasting. In Proceedings of the Twenty-Seventh
Ferenci, T. (2021). Chickenpox cases in hungary: A International Joint Conference on Artificial Intelligence,
benchmark dataset for spatiotemporal signal processing pages 3634–3640.
with graph neural networks. In Workshop on Graph Zang, C. and Wang, F. (2020). Moflow: An invertible flow
Learning Benchmarks@ TheWebConf 2021. model for generating molecular graphs. In Proceedings
of the 26th ACM SIGKDD International Conference on
Rubanova, Y., Chen, R. T. Q., and Duvenaud, D. K. (2019).
Knowledge Discovery & Data Mining, page 617–626.
Latent ordinary differential equations for irregularly-
sampled time series. In Advances in Neural Information Zhang, X., Zeman, M., Tsiligkaridis, T., and Zitnik, M.
Processing Systems, volume 32. (2022). Graph-guided network for irregularly sampled
multivariate time series. In International Conference on
Rusch, T. K., Chamberlain, B., Rowbottom, J., Mishra, S.,
Learning Representations (ICLR).
and Bronstein, M. (2022). Graph-coupled oscillator net-
works. In Proceedings of the 39th International Confer- Zhao, X., Chen, F., and Cho, J.-H. (2018). Deep learn-
ence on Machine Learning, pages 18888–18909. ing for predicting dynamic uncertain opinions in net-
work data. In 2018 IEEE International Conference on
Schirmer, M., Eltayeb, M., Lessmann, S., and Rudolph, M. Big Data (Big Data), pages 1150–1155.
(2022). Modeling irregular time series with continuous
recurrent units. In Proceedings of the 39th International
Conference on Machine Learning, pages 19388–19405.
Schneider, T. (2001). Analysis of incomplete climate
data: Estimation of mean values and covariance matrices
and imputation of missing values. Journal of Climate,
14(5):853–871.
Searle, S. R. and Khuri, A. I. (1982). Matrix algebra useful
for statistics. John Wiley & Sons.
Stanković, L., Mandic, D., Daković, M., Brajović, M.,
Scalzo, B., Li, S., and Constantinides, A. G. (2020). Data
analytics on graphs part III: Machine learning on graphs,
from graph topology to applications. Foundations and
Trends in Machine Learning, 13(4):332–530.
United Nations (2015). Transforming our world: The 2030
agenda for sustainable development.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017).
Attention is all you need. In Advances in neural infor-
mation processing systems, volume 30.
Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., and Yu,
P. S. (2020a). A comprehensive survey on graph neural
networks. IEEE Transactions on Neural Networks and
Learning Systems, pages 1–21.
Wu, Z., Pan, S., Long, G., Jiang, J., Chang, X., and Zhang,
C. (2020b). Connecting the dots: Multivariate time se-
ries forecasting with graph neural networks. In Proceed-
ings of the 26th ACM SIGKDD International Conference
on Knowledge Discovery & Data Mining, pages 753–
763.
Wu, Z., Pan, S., Long, G., Jiang, J., and Zhang, C.
(2019). Graph wavenet for deep spatial-temporal graph
modeling. In Proceedings of the Twenty-Eight Interna-
tional Joint Conference on Artificial Intelligence, page
1907–1913.
Temporal Graph Neural Networks for Irregular Data
A TABLE OF NOTATION
The presentation of TGNN4I in Section 3 describes how the model processes a single graph-structured time-series. In
practice we use batches of multiple such time series during training and inference. When working on batches there is an
additional sum in the definition of Lℓ , computing the mean over all samples in the batch. While we assume that all time
Nt
series in a batch share the same Nt , the exact time points {ti }i=1 can differ. In all experiments we use Ninit = 5 and
minimize the loss using the Adam optimizer (Kingma and Ba, 2015).
Both GNNU and GNNW contain only GNN layers, while the predictive model g contains a sequence of GNN layers
followed by a sequence of fully connected layers. We use ReLU activation functions in between all layers.
W
The input x̃m
i to GNN is 0 for neighbors that are not observed at time ti . However, there could be an observation of
neighbor m where the observed value and input features are exactly 0. To help the model differentiate between these two
cases we additionally include an indicator variable I{m∈Oi } in x̃m
i .
C BASELINE MODELS
The Predict Previous baseline does not require any training. Predictions are computed as ŷi→j
m m
= yprev(i) ∀j, where
prev(i) = max{k : k ≤ i ∧ m ∈ Ok }. If there is no earlier observation of node m the predictions is just 0.
GRU-D (node) is essentially a version of TGNN4I without the GNN components. The GNNs GNNU and GNNW in
the GRU update are replaced by matrix multiplications and the predictive model g includes only fully connected layers.
Because of this no information can flow between nodes, and time-series in different nodes are treated as independent. The
GRU-D (node) model uses the exponential dynamics for the latent state. To stay consistent with the other models we still
process all nodes in the graph concurrently. This means that a batch size of B for GRU-D (node) means that we process
B × N independent node time-series.
The GRU-D (joint) model is defined similar to GRU-D h (node), but i⊺ modeling all nodes jointly. All node observa-
|V |
tions are concatenated into one long vector yi = yi , . . . , yi
1
and similarly the features are concatenated as
h i⊺
|V |
xi = x1i , . . . , xi . This time series is then modeled using a single latent state, also based on the exponential dy-
namics. In GRU-D (joint) there is however a GRU update at each ti , as at least one of the nodes is observed at each time
point. We can think of this setup as a graph with a single node, for which the observations are high-dimensional. The
high-dimensional predictions are split up and re-assigned to the original nodes in the graph for computing the loss. Note
also that many entries in each yi and xi will be zero, corresponding to the nodes that are not observed. We again include
indicators I{m∈Oi } as input to the GRU, but here for all nodes.
C.4 LG-ODE
For the LG-ODE model we use the code provided by the authors (Huang et al., 2020)3 , making only small modifications to
the data loading in order to correctly handle our datasets. The original code does not use a validation set, instead evaluating
the model on the test set after each training epoch. We change this step to use validation data and save the model from the
epoch with the lowest validation error. That model is then loaded and evaluated on the test data. Due to the high training
time we have not been able to perform exhaustive hyperparameter tuning for the LG-ODE model. We have mainly used
3
https://github.com/ZijieH/LG-ODE
Joel Oskarsson, Per Sidén, Fredrik Lindsten
the default parameters of Huang et al. (2020) or a slightly smaller version of the model. The smaller version has halved
dimensions for hidden layers in the GNN (from 128 to 64) and augmentation (from 64 to 32) used in the ODE decoder.
For computing LMSE using the LG-ODE model we need to compute each necessary ŷi→j m
. This is done by encoding all
observations up to time ti and then decoding Nmax time steps into the future. It should be noted that the model is still
trained as proposed by Huang et al. (2020), by encoding the first half of each time series and predicting the second half.
During training each encoded time series is Nt /2 long, but when computing LMSE on the test set the length of the sequence
being encoded varies. We have however not noticed any higher errors for time steps with a shorter or longer encoded
sequence than that used during training.
C.5 Transformers
The Transformer (Vaswani et al., 2017) baselines use an encoder-decoder approach similar to LG-ODE. During training a
prediction time ti is however randomly sampled as i ∼ U({Ninit , . . . , Nt − Nmax }). Each sequence is then encoded up to
time ti and decoded over the next Nmax time steps to produce predictions. The LMSE loss is used also for the Transformer
models, but only based on this one prediction per time series.
The irregular time steps are handled by sinusoidal encodings concatenated to the input of both the encoder and decoder.
Instead of basing these on the sequence index i, the exact timestamp ti is used in
ti
θi = (19)
0.12i/dh
⊺
to then compute the full encoding vector sin(θ1 ), . . . , sin(θ⌊dh /2⌋ ), cos(θ1 ), . . . , cos(θ⌊dh /2⌋ ) .
Unobserved nodes are in Transformer (node) handled by masking the attention mechanism. For each node n, this prevents
the model from attending to encoded time steps j s.t. n ∈ / Oj . In Transformer (joint) the same approach as in GRU-D
(joint) is instead used, where indicator variables are included as input.
The hyperparameters defining the Transformer architectures differ somewhat from the other models. We still let dh repre-
sent the dimensionality of hidden representations, but here also tune the number of transformer layers stacked together.
D DATASETS
For the PEMS-BAY and METR-LA datasets we use the versions pre-processed by Li et al. (2018), where weighted
graphs are created based on thresholded road-network distances. We additionally remove edges from any node to itself
and drop nodes not connected to the rest of the graph. The original time series is then split into sequences of length
288, corresponding to one day of observations at 5 minute intervals. In each such sequence we uniformly sample only
Nt = 72 time points to keep, introducing irregularity between the time steps. Next we create the set {(n, i)}n∈V,i=1,...,Nt
with indices of all single node observations. We uniformly sample a fraction of this set as the observations to keep,
independently for each sequence. The percentages in Table 1 refer to the percentage of observations kept from this set.
This step introduces further irregularity as all nodes will generally not be observed at each time step. The METR-LA
dataset has some observations missing initially, meaning that we can never get to exactly 100% observations for that
dataset. From this pre-processing we end up with 180 time series in the PEMS-BAY dataset and 119 time series in the
METR-LA dataset. We randomly assign 70% of these to the training set, 10% to the validation set and 20% to the test set.
The USHCN daily data is openly available4 together with the positions of all sensor stations. We use the pre-processing
of De Brouwer et al. (2019) to clean and subsample the data5 . We choose the longer version of the time series, with
observations between the years 1950 and 2000, but split this into multiple sequences of 100 days.
In order to build the spatial graph we perform an equirectangular map projection of the sensor station coordinates and
then connect each station to its 10 nearest neighbors based on euclidean distance. While there are many options for how
4
https://cdiac.ess-dive.lbl.gov/ftp/ushcn_daily/
5
A script for pre-processing the USHCN data is available together with their code at https://github.com/edebrouwer/
gru_ode_bayes.
Joel Oskarsson, Per Sidén, Fredrik Lindsten
to create this type of spatial graph, we have found the 10-nearest-neighbor approach to work sufficiently well in practice.
Further investigating different methods for building spatial graphs is outside the scope of this paper. We additionally add
edge weights to this graph following a similar method as Li et al. (2018) did for the traffic data. For sensor stations m and
n at a distance dm,n we assign the edge (m, n) the weight
2 !
dm,n
em,n = exp − (20)
4σe
where σe is the standard deviation of all distances associated with edges. The constant 4 is chosen such that the furthest
distance gets a weight close to 0 and the nearest distance a weight close to 1. For the USHCN data the only input feature
is elapsed time since the node was last observed. We use the same 70%/10%/20% training/validation/test split as for the
traffic data.
We build two datasets from the USHCN data, one with the daily minimum temperature Tmin as target variable and one
with the daily maximum temperature Tmax . The reason for separating these target variables, instead of using dy = 2, is
that the pre-processed data contains time points where only one of these is observed. Our model is not designed for such
a setting where we do not observe all dimensions of yin . Extending TGNN4I to handle this is left as future work. The
USHCN data also contains other climate variables, related to precipitation and snow coverage. These time series are less
interesting to directly evaluate our model on, as many entries are just 0. Properly modeling these would require designing
a suitable likelihood function and taking into account that some sensor stations never get any snow. As this would shift the
focus from the core problem we choose to restrict our attention to Tmin and Tmax .
To create the graph for the synthetic data we first sample 20 node positions uniformly over [0, 1]2 . We then create an
undirected graph using a Delaunay triangulation (De Loera et al., 2010) based on these positions. This undirected graph
is turned into a directed acyclic graph by choosing a random ordering of the nodes and removing edges going from nodes
later in the ordering to those earlier.
Based on this graph 200 irregular time series are sampled according to Eq. 16 and 17. An example is shown in Figure 4. The
irregular time steps are created by first discretizing [0, 1] into 1000 time steps and then sampling 70 of these independently
for each time series. Out of all node observations we keep 50%, sampled in the same way as for the traffic data. The
parameters of the periodic signals are sampled according to ϕn ∼ U([20, 100]) and η n ∼ U([0, 2π]). We resample all η n
for each sequence, but sample the node frequencies ϕn only once, treating these as underlying properties of the nodes. Out
of the 200 sampled sequences we use 100 for training, 50 for validation and 50 for testing.
We perform hyperparameter tuning by exhaustive grid search over combinations of parameter values. The configuration
that achieves the lowest validation LMSE is then used for the final experiment. An overview of hyperparameter values
considered and the final configurations for all experiments is given in Table 5. All hyperparamter tuning for GRU-D and
TGNN4I is done on the versions of the models with exponential dynamics. For these models we generally observe that
larger architectures perform better, but the memory usage limits how much we can increase the model size. While the exact
model architecture of TGNN4I does impact the results, the model does not seem particularly sensitive to learning rate or
batch size.
We evaluate LMSE on the validation set after each training epoch, stopping the training early if the validation error does not
improve for 20 epochs. Except for LG-ODE we train all models until this early stopping occurs, which typically takes less
than 150 epochs.
For the traffic data we perform hyperparameter tuning on the versions of the datasets with 25% observations. The same
hyperparameter configuration, shown in Table 5, performed the best for both PEMS-BAY and METR-LA.
We used the slightly downscaled version of the LG-ODE model for the traffic data. We tried also the default hyperparam-
eters on the METR-LA dataset, but this did not improve the results. LG-ODE was trained for 50 epochs with a batch size
of 8 and we observed no further improvements when trying to train the model for longer.
Temporal Graph Neural Networks for Irregular Data
1
κn (t)
−1
−2
Figure 4: Examples of signals in the synthetic periodic data for 5 of the 20 nodes. The lines show clean signals κn (t) and
each dot a (noisy) observation y n (ti ). Note that the time steps are irregular and that not all signals are observed at each
observation time.
For the USHCN data we perform the hyperparameter tuning on the Tmin dataset. Since the USHCN graph contains many
nodes we are somewhat more restricted in how large we can scale up the models.
On the periodic data we tried the same hyperparameters for GRU-D (joint) as for the other models, but no options gave any
useful results. Because of this we exclude the model from this experiment. The number of iterations until the validation
LMSE stops decreasing is higher for the periodic data, with models training up to 500 epochs.
We used default hyperparameters for the LG-ODE model on the periodic data, but with a batch size of 16. As this graph
only contains 20 nodes we were here able to train 5 LG-ODE models with different random seeds.
For the loss weighting experiment we do not perform any new hyperparameter tuning, but use the best configuration from
the traffic data experiment. One TGNN4I model was trained for each weighting function. We here use a batch size of 4
and Ninit = 25, which allows us to study predictions to higher ∆t in the future.
Appendix References
De Loera, J. A., Rambau, J., and Santos, F. (2010). Triangulations, volume 25 of Algorithms and Computation in Mathe-
matics. Springer.
Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. In International Conference on Learning
Representations (ICLR).
Joel Oskarsson, Per Sidén, Fredrik Lindsten
Table 5: Values considered in hyperparameter tuning for the different datasets. Bold numbers represent the best performing
configuration, that was then used in the final experiment.