0% found this document useful (0 votes)
7 views

Temporal Graph Neural Networks For Irregular Data

Uploaded by

Ikram Bouba
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Temporal Graph Neural Networks For Irregular Data

Uploaded by

Ikram Bouba
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Temporal Graph Neural Networks for Irregular Data

Joel Oskarsson Per Sidén Fredrik Lindsten


Linköping University Linköping University Linköping University
Arriver Software AB

Abstract While many works have studied the problem of modeling


temporal graph data (Wu et al., 2020a), these approaches
This paper proposes a temporal graph neural net- generally assume a constant sampling rate and no missing
work model for forecasting of graph-structured observations. In real data it is not uncommon to have irreg-
irregularly observed time series. Our TGNN4I ular or missing observations due to non-synchronous mea-
model is designed to handle both irregular time surements or errors in the data collection process. Deal-
steps and partial observations of the graph. This ing with such irregularities is especially challenging in the
is achieved by introducing a time-continuous la- graph setting, as node observations are heavily interdepen-
tent state in each node, following a linear Ordi- dent. While observations in one node could be modeled us-
nary Differential Equation (ODE) defined by the ing existing approaches for irregular time series (Rubanova
output of a Gated Recurrent Unit (GRU). The et al., 2019; Schirmer et al., 2022), the situation becomes
ODE has an explicit solution as a combination of complicated when irregular observations in different nodes
exponential decay and periodic dynamics. Ob- occur at different times.
servations in the graph neighborhood are taken In this paper we tackle two kinds of irregular observa-
into account by integrating graph neural network tions in graph-structured time series: (1) irregularly spaced
layers in both the GRU state update and predic- observation times, and (2) only a subset of nodes being
tive model. The time-continuous dynamics ad- observed at each time point. We propose the TGNN4I
ditionally enable the model to make predictions model for time series forecasting. The model uses a time-
at arbitrary time steps. We propose a loss func- continuous latent state in each node, which allows for pre-
tion that leverages this and allows for training the dictions to be made at any time point. The latent dynam-
model for forecasting over different time hori- ics are motivated by a linear Ordinary Differential Equation
zons. Experiments on simulated data and real- (ODE) formulation, which has a closed form solution. This
world data from traffic and climate modeling val- ODE solution corresponds to an exponential decay (Che
idate the usefulness of both the graph structure et al., 2018) together with an optional periodic component.
and time-continuous dynamics in settings with ir- New observations are incorporated into the state by a Gated
regular observations. Recurrent Unit (GRU) (Cho et al., 2014). Interactions be-
tween the nodes are captured by integrating Graph Neu-
ral Network (GNN) layers both in the latent state updates
1 INTRODUCTION and predictive model. To train our model we introduce a
loss function that takes into account the time-continuous
Many real-world systems can be modeled as graphs. When model formulation and irregularity in the data. We eval-
data about such systems is collected over time, the resulting uate the model on forecasting problems using traffic and
time series has additional structure induced by the graph climate data of varying degrees of irregularity and a simu-
topology. Examples of such temporal graph data is the traf- lated dataset of periodic signals.
fic speed in the road network (Li et al., 2018) and the spread
of disease in different regions (Rozemberczki et al., 2021).
Building accurate machine learning models in this setting
2 RELATED WORK
requires taking both the temporal and graph structure into
GNNs are deep learning models for graph-structured data
account.
(Gilmer et al., 2017; Wu et al., 2020a). By learning rep-
resentations of nodes, edges or entire graphs GNNs can
Proceedings of the 26th International Conference on Artificial be used for many different machine learning tasks. These
Intelligence and Statistics (AISTATS) 2023, Valencia, Spain. models have been successfully applied to diverse areas
PMLR: Volume 206. Copyright 2023 by the author(s).
such as weather forecasting (Lam et al., 2022) molecule
Temporal Graph Neural Networks for Irregular Data

generation (Zang and Wang, 2020) and video classification the Evidence Lower Bound (ELBO). The encoder builds
(Kosman and Di Castro, 2022). Temporal GNNs model a spatio-temporal graph of observations and aggregates in-
also time-varying signals in the graph. This extension formation using an attention mechanism (Vaswani et al.,
to graph-structured time series is achieved by combining 2017). The decoder then extrapolates to future times by
GNN layers with recurrent (Li et al., 2018), convolutional solving a neural ODE. The encoder-decoder setup differs
(Wu et al., 2019; Yu et al., 2018) or attention (Guo et al., from our TGNN4I model that sequentially incorporates ob-
2019) architectures. While the graph is commonly as- servations and auto-regressively makes predictions at every
sumed to be known a priori, some approaches also explore time point. Because of the multi-agent motivation Huang
learning the graph structure jointly with the temporal GNN et al. (2020) are also more focused on smaller graphs with
model (Zhang et al., 2022; Wu et al., 2020b). few interacting entities, but longer forecasting horizons.
An extended version of LG-ODE, called CG-ODE (Huang
Time series forecasting is a well-studied problem and a
et al., 2021), also aims to learn the graph structure in the
vast amount of methods exist in the literature. Tradi-
form of a weighted adjacency matrix.
tional methods in the area include ARIMA models, vector
auto-regression and Gaussian Processes (Box et al., 2015;
Roberts et al., 2013). Many deep learning approaches have 3 A TEMPORAL GNN FOR IRREGULAR
also been applied to time series forecasting. This includes OBSERVATIONS
Recurrent Neural Networks (RNNs) (Lazzeri, 2020), tem-
poral convolutional neural networks (Chen et al., 2020) and 3.1 Setting
Transformers (Giuliari et al., 2021).
The latent state of an RNN can be extended to continu- Consider a directed or undirected graph G = (V, E) with
Nt
ous time by letting the state decay exponentially in be- node set V and edge set E. Let {ti }i=1 be a set of (pos-
tween observations (Che et al., 2018). Such decay mecha- sibly irregular) time points s.t. 0 < t1 < · · · < tNt . We
nisms have been used for modeling data with missing ob- will here present our model for a single time series, but
servations (Che et al., 2018), doing imputation (Cao et al., in general we have a dataset containing multiple time se-
2018) and parametrizing point processes (Mei and Eisner, ries. Let Oi ⊆ V be the set of nodes observed at time ti .
2017). Another way to define time-continuous states is If n ∈ Oi , we denote the observed value as yin ∈ Rdy
by learning a more general ODE. In neural ODE models and any accompanying input features as xni ∈ Rdx . We
(Chen et al., 2018; Kidger, 2021) the latent state is the let yin = xni = 0 if n ∈ / Oi . Note that this general set-
solution to an ODE defined by a neural network. Neu- ting encompasses a spectrum of irregularity, from single
ral ODEs have been successfully applied to irregular time node observations (|Oi | = 1 ∀i) to fully observed graphs
series (Rubanova et al., 2019) and can be used to define (Oi = V ∀i). A table of notation is given in appendix A.
the dynamics of temporal GNNs (Fang et al., 2021; Poli The problem we consider is that of forecasting. At future
et al., 2021). Poli et al. (2021) use such a model for graph- time points we want to predict the value at each node, given
structured time series with irregular time steps, but consider all earlier observations. Since observations are irregular
the full graph to be observed at each observation time. Also and we want to make predictions at arbitrary times, we need
the GraphCON framework of Rusch et al. (2022) combines to consider models that can make predictions for any time
GNNs with a second order system of ODEs. In GraphCON in the future.
the time-axis of the ODE is however aligned with the lay-
ers of the GNN. This makes the framework suitable for We consider a model where at each node n a latent state
node- and graph-level predictive tasks, rather than time- hn (t) ∈ Rdh evolves over continuous time. We define the
series modeling. dynamics of hn (t) by: (1) how hn (t) evolves in between
observations, and (2) how hn (t) is updated when node n is
Another related body of work is concerned with using observed. If node n is observed at time ti we incorporate
GNNs for data imputation in time series (Cini et al., 2022; this observation into the latent state using a GRU cell (Cho
Gordon et al., 2021; Omidshafiei et al., 2022). These meth- et al., 2014). This information can then be used for making
ods generally do not assume that the time series come with predictions at future time points. An overview of our model
some known graph structure. Instead, the GNN is defined is given in Figure 1.
on some graph specifically constructed for the purpose of
performing imputation.
3.2 Time-continuous Latent Dynamics
The closest work to ours found in the literature is the LG-
ODE model of Huang et al. (2020). They consider the Consider a time interval ]ti , tj ] where node n is not ob-
same types of irregularities, but are motivated more by a served. During this interval we define the latent state of
multi-agent systems perspective. LG-ODE is based on an node n by the sum hn (t) = h̄ni + h̃n (t). The first part h̄ni
encoder-decoder architecture and trained by maximizing is constant over the time interval, constituting a base level
around which the state evolves. The dynamics of h̃n (t) are
Joel Oskarsson, Per Sidén, Fredrik Lindsten

Figure 1: Left: Example graph with four nodes and their latent states, here one-dimensional for illustration purposes. Node
observations are indicated with ⃝. Right: Schematic diagram of the GRU update and predictive model g. Everything
inside the shaded area happens instantaneously at time ti . The GRU cell outputs the new initial state h̄ni + ĥni , the static
component h̄ni and the ODE parameters ωin . The parameters ωin define the dynamics until the next observation. The
prediction ŷin here is based on information from all earlier time points.

dictated by a linear ODE of the form of the dynamical system in the limit, which for the com-
plete state means that hn (tj ) → h̄ni as tj → ∞.
dh̃n (t) = Ah̃n (t) dt (1)
On the other hand, if we instead allow A to have com-
with A ∈ R dh ×dh
and initial condition h̃ (ti ) =
n
ĥni .
Over plex eigenvalues we can decompose A using the real Jor-
this interval the ODE has a closed form solution (Arrow- dan form (Horn and Johnson, 2012). This makes Λ block-
smith and Place, 1992) given by diagonal with 2 × 2 blocks
 
h̃n (t) = exp(δt A)ĥni (2) aj −bj
Cj = (5)
bj aj
where exp is the matrix exponential function and we define
δt = t − ti . Assuming that all eigenvalues of A are unique, corresponding to complex conjugate eigenvalues aj ± bj i.
we can use its eigen-decomposition1 A = QΛQ−1 to write If we compute the matrix exponential for this Λ we end up
with a combination of exponential decay and periodic dy-
h̃n (t) = Q exp(δt Λ)Q−1 ĥni . (3) namics (Arrowsmith and Place, 1992). The solution gives
dynamics that couple each pair of dimensions as
Since Q contains an eigen-basis of Rdh it can be viewed as h i
a transition matrix, changing the basis of the latent space. h̃n (t) = exp(δt aj )×
j:j+1
While we could in principle learn Q, we note that the basis  h i (6)
of the latent space has no physical interpretation and we cos(bj δt ) − sin(bj δt )
ĥni .
can without loss of generality choose it such that Q = I. sin(bj δt ) cos(bj δt ) j:j+1

Next, if we make the assumption that A has real eigenval- We parametrize also these dynamics with ωin =
ues the resulting dynamics are given by [−a1 , −a3 , . . . , −adh −1 , b1 , b3 , . . . , bdh −1 ]⊺ > 0. Note
that when bj → 0 these periodic dynamics reduce to the
h̃n (t) = exp(−δt diag(ωin ))ĥni (4) exponential decay in Eq. 4, but with the parameter for −aj
shared across pairs of dimensions. We consider both the
where ωin > 0 is a parameter vector representing the nega- exponential decay and the more general periodic dynamics
tion of the eigenvalues. We arrive at an exponential de- as options for our model. The periodic dynamics can nat-
cay in-between observations, a type of dynamics used with urally be seen as advantageous for modeling periodic data,
GRU-updates in existing works (Che et al., 2018; Cao et al., as we will explore empirically in section 4.4.
2018). The positive restriction on ωin ensures the stability
1
Recall that the eigen-decomposition of a diagonalizable ma- 3.3 Incorporating Observations from the Graph
trix A is given by A = QΛQ−1 , where Λ is a diagonal matrix
containing the eigenvalues of A and the columns of Q are the When node n is observed at time ti the observation is in-
corresponding eigenvectors (Searle and Khuri, 1982). corporated into the latent state by a GRU cell, incurring an
Temporal Graph Neural Networks for Irregular Data

instantaneous jump in the latent dynamics to a new value date is computed as


h̄ni + ĥni . Inspired by the continuous-time LSTM of Mei
rin = σ(vi,1
n
+ uni,1 + b1 ) (10a)
and Eisner (2017), we extend the GRU cell to output also
the parameters ωin , which define the dynamics of h̃n (n) zin = n
σ(vi,2 + uni,2 + b2 ) (10b)
for the next time interval. qin = n
tanh(vi,3 + (rin ⊙ uni,3 ) + b3 ) (10c)
So far we have considered each node of the graph as sepa- h̄ni + ĥni = (1 − zin ) n
⊙ h (ti ) + zin ⊙ qin (10d)
rate entities, but in a graph-based system future observa-
tions of a node can depend on the history of the entire r̄in = σ(vi,4
n
+ uni,4 + b4 ) (11a)
graph. To capture this we let the state of each node de- z̄in = n
σ(vi,5 + uni,5 + b5 ) (11b)
pend on observations and states in its graph neighborhood.
This is achieved by introducing GNNs (Gilmer et al., 2017; q̄in = n
tanh(vi,6 + (r̄in ⊙ uni,6 ) + b6 ) (11c)
Wu et al., 2020a) in the GRU update (Zhao et al., 2018). h̄ni = (1 − z̄in ) ⊙ h̄ni−1 + z̄in ⊙ q̄in (11d)
We replace matrix multiplications with GNNs, taking in-
puts both from the node n itself and from its neighbor- ωin = log(1 + exp(vi,7
n
+ uni,7 + b7 )) (12)
hood N (n) = {m|(m, n) ∈ E}. The type of GNN we use where σ is the sigmoid function and b1 –b7 learnable bias
is a simple version of a message passing neural network parameters. In Eq. 11d we let h̄nk = h̄nk−1 if n ∈ / Ok .
(Gilmer et al., 2017), defined as Eq. 10a–10d correspond to one GRU update, using the de-
cayed state hn (ti ). Eq. 11a–11d define a separate GRU
update, but for the decay target h̄ni . Note that we get ĥni ,
 
GNN hn , {hm }m∈N (n)
the initial value of h̃n (t), implicitly from the difference be-
1 (7)
tween Eq. 10d and Eq. 11d. Finally Eq. 12 computes the
X
= W1 hn + em,n W2 hm
|N (n)| parameters ωin defining the dynamics of h̃n (t) up until the
m∈N (n)
next observation of node n. The parameters of the GRU
where the matrices W1 , W2 are learnable parameters cell are shared for all nodes in the graph. To capture any
shared among all nodes and em,n is an edge weight associ- node-specific properties we parametrize initial states hn (0)
ated with the edge (m, n). The use of edge weights allows separately for all nodes and learn these jointly with the rest
for incorporating prior information about the strength of of the model.
connections in the graph. We can additionally stack mul-
tiple such GNN layers, append fully-connected layers and 3.4 Predictions
include non-linear activation functions in between. The in-
The time-continuous dynamics ensure that there is a well-
clusion of multiple GNN layers makes the GRU update de-
defined latent state hn (t) in each node at each time point
pendent on a larger graph neighborhood.
t. The value of the time series can then be predicted at any
In a standard GRU cell the input and previous state are first time by applying a mapping g : Rdh → Rdy from this latent
mapped to three new representations using matrices U and state to the prediction ŷjn .
W (Cho et al., 2014). These representations are then used
The addition of GNNs into the GRU update makes the la-
to compute the state update. In our GNN-based GRU up-
tent state dynamics of each node dependent on historical
date the matrices are replaced by GNNs and we require
observations in its neighborhood. However, since the GRU
seven such intermediate representations to update ĥni , h̄ni
updates happen only when a node is observed, information
and ωin , as shown below. The node states are combined as
from observed neighbors might not be incorporated imme-
  diately in the latent state. Consider three consecutive time
[uni,1 , . . . , uni,7 ]⊺ = GNNU hn (ti ), {hm (ti )}m∈N (n) points ti < ti+1 < ti+2 s.t. n ∈ Oi , n ∈ / Oi+1 . Then any
(8) observation yi+1
m
for m ∈ Oi+1 ∩ N (n) will not be taken
where the resulting vector is split into seven equally sized into account by the model for the prediction ŷi+2 n
, as that
chunks. Another GNN is then used for the combined ob- prediction is based on a latent state with only information
servations and input features x̃ni = [yin , xni ]⊺ , from time ti . To remedy this we choose also the predictive
model g to contain one or more GNN layers,
 
ŷjn = GNNg [hn (tj ), xnj ]⊺ , [hm (tj ), xm
   ⊺
n
[vi,1 n ⊺
, . . . , vi,7 ] = GNNW x̃ni , {x̃m
i }m∈N (n) . (9) j ] m∈N (n)
.
(13)
Note that while all nodes might not be observed at time ti , This way g takes the latent states and input features of the
i = 0 for any unobserved m and thus do not contribute
x̃m whole neighborhood into account for prediction. We name
to the sum over neighbors in the GNN. With the combined the full proposed model Temporal Graph Neural Network
information from the graph neighborhood the full GRU up- for Irregular data (TGNN4I).
Joel Oskarsson, Per Sidén, Fredrik Lindsten

3.5 Loss Function 4 EXPERIMENTS


In order to make predictions for arbitrary future time points The TGNN4I model was implemented2 using PyTorch and
we introduce a suitable loss function based on the time- PyG (Fey and Lenssen, 2019). We evaluate the model on a
continuous nature of the model. Let ŷi→j m
be the predic- number of different datasets. See appendix D and E for de-
tion for node m at time tj , based on observations of all tails on the pre-processing and experimental setups used.
nodes at times t ≤ ti . Define also a time-continuous As the loss function LMSE captures errors throughout an
weighting function w : R+ → R+ and the set τm,i = entire time series we adopt this also as our evaluation met-
{j : m ∈ Oj ∧ i < j} containing the indices of all times ric. Given that the loss weighting w used for training accu-
after ti where node m is observed. We do not include pre- rately represents how we value predictions at different time
dictions from the first Ninit time steps in the loss, treating horizons, it is natural to use the same choice for evalua-
this as a short warm-up phase. The loss function for one tion. In our experiments we rescale each time
 series so that
graph-structured time series is then t ∈ [0, 1] and use w(∆t ) = exp − 0.04 ∆t
. By inspecting
Nt
w and the time steps in the data we also choose a suitable
m
, yjm w(tj − ti )

1 X X X ℓ ŷi→j Nmax = 10.
Lℓ =
Nobs j − Ninit − 1
m∈V i=Ninit +1 j∈τ
m,i We consider three versions of our TGNN4I model:
(14) (static) with a constant latent state in-between observations
where ℓ Pis any loss function for a single observation and (h̃n (t) = ĥni ∀ t∈ ]ti , tj ]), (exponential) with the expo-
Nt
Nobs = i=N init +2
|Oi | the total number of node observa- nential decay dynamics from Eq. 4 and (periodic) with the
tions. We use LMSE with Mean Squared Error (MSE) as combined decay and periodic dynamics from Eq. 6. In all
ℓ, but the framework is fully compatible with other loss our experiments the training time of a single model on an
functions as well. This includes general probabilistic pre- NVIDIA A100 GPU is less than an hour and for the small-
dictions with a negative log-likelihood loss. Dividing by est dataset (METR-LA) not more than 20 minutes.
j − Ninit − 1, the number of times observation j has been
predicted, guarantees that later observations are not given a 4.1 Baselines
higher total weight.
The weighting function w allows for specifying which We compare TGNN4I to multiple baseline models. As a
time-horizons that should be prioritized by the model. This simple starting point we consider a model that always pre-
choice is highly application-dependent and should capture dicts the last observed value in each node for all future time
which predictions that are of interest when later deploying points (Predict Previous). Che et al. (2018) propose the
the model in some real-world setting. If we care about all GRU-D model for irregular time series, which we extend
time horizons, but want to prioritize predictions close with our parametrization of the exponential decay and in-
 in clude as a baseline. GRU-D does not use the graph struc-
time, a suitable choice could be w(∆t ) = exp − ∆Ωt . It
might also be desirable to focus predictive capabilities on a ture explicitly, so there are two ways to adapt this model
specific ∆t . If we want predictions around ∆t = µ to be to our setting. We can view the entire graph-structured
prioritized we can for instance use a Gaussian kernel as time series as one series with (|V |dy )-dimensional vectors
at each time step (GRU-D (joint)). Alternatively, we can
view the time series in each node as independent (GRU-D
  2 
∆t −µ
w(∆t ) = exp − Ω . (15) (node)), which is essentially the same as TGNN4I with all
edges in the graph removed. Two Transformer baselines
A limitation of the proposed loss function is the quadratic are also included, used in the same (joint) and (node) con-
scaling in the number of time steps, as predictions are made figurations. In these models the irregular observations are
from all times to all future observations. This especially handled through attention masks and the use of timestamps
requires large amounts of memory for nodes that are ob- in the sinusoidal positional encodings. We also compare
served at many time steps. However, for many sensible against the LG-ODE model of Huang et al. (2020) us-
choices of w predictions far into the future have a close to ing the code provided by the authors. We follow their pro-
0 impact on the loss. In practice we can utilize this to ap- posed training procedure, where the model encodes the first
proximate Lℓ by only making predictions Nmax time steps half of each time series and has to predict the second half.
into the future. This approximation explicitly corresponds When computing LMSE using LG-ODE we encode all ob-
to setting τm,i = {j : m ∈ Oj ∧ i < j ≤ i + Nmax } and servations up to ti and decode from that time point in order
changing the denominator in Eq. 14 to min(Nmax , j − to get each ŷi→j
m
. More details on the baseline models are
Ninit − 1). Alternatively, we can select a weight function given in appendix C. An attempt was also made to adapt
with finite support which implies that many terms in Eq. 14 the RAINDROP model of Zhang et al. (2022) to our fore-
will be exactly zero. This does however require more book- 2
Our code and datasets are available at https://github.
keeping than the aforementioned truncation method. com/joeloskarsson/tgnn4i.
Temporal Graph Neural Networks for Irregular Data

×10−2 that the importance of using the graph structure increases


4.5
GRU-D (node) when there are fewer observations. Out of the different ver-
TGNN4I (static) sions of TGNN4I the exponential and periodic dynamics
4.0 TGNN4I (exponential)
show a clear advantage over the static one, with the largest
TGNN4I (periodic)
difference for the most sparsely observed data. We have
LMSE

3.5 observed that the periodic models output only low frequen-
cies, resulting in dynamics and results similar to the model
with exponential decay. While the periodic dynamics in
3.0
Eq. 6 have fewer degrees of freedom when reduced to pure
exponential decay, this does not seem to hurt the perfor-
25% 50% 75% 100% mance in this example.
Observed fraction
We found that the training time of LG-ODE scales poorly
Figure 2: Test LMSE on METR-LA traffic data for the best to large graphs, limiting us to only training a single model
performing models from Table 1. Shaded areas correspond for each dataset. The predictions are however quite poor,
to 95% confidence intervals based on re-training models especially on the PEMS-BAY data. While the model seems
with 5 random seeds. to learn something more than just predicting the mean, it is
not competitive with our TGNN4I model. We believe that
the poor performance of LG-ODE can be explained by a
casting setting. We were however unable to get useful pre- combination of multiple things: (1) The LG-ODE model
dictions without making major changes to the model and it is primarily designed for data with clear continuous under-
is therefore not included here. lying dynamics, which might not match this type of traffic
data. (2) When the model is trained as proposed by Huang
4.2 Traffic Data et al. (2020), it can require large amounts of data. For some
of the experiments in the original paper 20 000 sequences
We experiment on the PEMS-BAY and METR-LA are used for training, while we use less than 150. (3) The
datasets, containing traffic speed sensor data from the Cal- slow training has limited possibilities for exhaustive hyper-
ifornia highway system (Li et al., 2018; Chen et al., 2001). parameter tuning on our datasets. Training one LG-ODE
To be able to control the degree of irregularity, we start model on the PEMS-BAY data takes us over 50 hours.
from regularly sampled data and choose subsets of obser-
vations. We use the versions of the datasets pre-processed
by Li et al. (2018). Each dataset is split up into time series 4.3 USHCN Climate Data
of 288 observations (1 day). PEMS-BAY contains 180 such
time series with 325 nodes and METR-LA 119 time series Irregular and missing observations are common problems
with 207 nodes. We include the time of day and the time in climate data (Schneider, 2001). The United State His-
since the node was last observed as input features xni . In torical Climatology Network (USHCN) daily dataset con-
order to introduce irregularity in the time steps we next sub- tains over 50 years of measurements of multiple climate
sample each time series by keeping only 72 of the 288 ob- variables from sensor stations in the United States (Menne
servations. These Nt = 72 observation times are the same et al., 2015). We use the pre-processing of De Brouwer
for all nodes. However, from these subsampled time se- et al. (2019) to clean and subsample the data. The target
ries we furthermore sample subsets containing 25%–100% variables chosen are minimum and maximum daily tem-
of all Nt × |V | individual node observations. This results perature (Tmin and Tmax ), which we model as separate
in irregular observation time points and a fraction of nodes datasets. While existing works (De Brouwer et al., 2019;
observed at each time. Our additional pre-processing pre- Schirmer et al., 2022) have treated time series from differ-
vents us from a direct comparison with Li et al. (2018), as ent sensor stations as independent, we model also the spa-
their method does not handle irregular observations. tial correlation by constructing a 10-nearest-neighbor graph
using the sensor positions. Each full dataset contains 186
We report results for both datasets in Table 1 and high-
time series of length Nt = 100 on a graph of 1123 nodes.
light the best performing models on METR-LA in Figure 2.
The pre-processed USHCN data is sparsely observed with
GRU-D (joint) has a hard time modeling all nodes jointly,
only around 5% of potential node observations present.
often not performing better than the simple Predict Previ-
ous baseline. The Transformer models achieve somewhat We report results on both datasets in Table 2. Due to the
better results, but still not competitive with TGNN4I. We large size of the graph it was not feasible to apply the
additionally note that the Transformers can be highly sen- LG-ODE model here. We note that for these datasets the
sitive to the random seed used for initialization, something (joint) baselines clearly outperform the (node) versions.
that we have not observed for other models. Comparing For GRU-D this is the opposite of what we saw in the traf-
TGNN4I and GRU-D (node) in Figure 2 it can be noted fic data. This can be explained by the fact that climate data
Joel Oskarsson, Per Sidén, Fredrik Lindsten

Table 1: Test LMSE (multiplied by 102 ) for the traffic datasets with different fractions of node observations. Where appli-
cable we report mean ± one standard deviation across 5 runs with different random seeds. The lowest mean LMSE for each
dataset and observation percentage is marked in bold.

PEMS-BAY
Model 25% 50% 75% 100%
Predict Previous 26.32 18.60 15.25 13.50
GRU-D (joint) 18.79±0.07 18.27±0.10 17.93±0.08 17.75±0.12
GRU-D (node) 8.79±0.06 6.62±0.02 5.82±0.06 5.49±0.06
Transformer (joint) 12.05±1.19 13.13±2.59 12.21±1.95 11.09±1.38
Transformer (node) 16.49±0.17 14.44±0.48 13.20±0.56 13.16±1.23
LG-ODE 27.00 24.93 24.71 23.52
TGNN4I (static) 7.41±0.09 5.98±0.07 5.29±0.08 4.89±0.05
TGNN4I (exponential) 7.10±0.07 5.78±0.05 5.23±0.03 4.87±0.09
TGNN4I (periodic) 7.10±0.09 5.80±0.08 5.22±0.09 4.87±0.02
METR-LA
Model 25% 50% 75% 100%
Predict Previous 9.86 7.54 6.52 6.04
GRU-D (joint) 8.38±0.05 8.03±0.04 7.89±0.03 7.80±0.02
GRU-D (node) 4.36±0.08 3.62±0.07 3.28±0.08 3.16±0.04
Transformer (joint) 5.70±1.41 7.17±1.66 5.95±1.90 6.11±1.80
Transformer (node) 7.01±0.31 6.34±0.24 5.84±0.23 5.96±0.50
LG-ODE 8.51 7.35 6.71 6.24
TGNN4I (static) 3.86±0.02 3.31±0.02 3.03±0.02 2.88±0.02
TGNN4I (exponential) 3.68±0.05 3.18±0.03 2.97±0.03 2.86±0.04
TGNN4I (periodic) 3.69±0.02 3.19±0.04 3.01±0.05 2.88±0.03

Table 2: Test LMSE (multiplied by 102 ) for the two USHCN decay. This should to some extent be expected, as none
climate datasets. of the previous datasets show any clear periodic patterns at
the considered time scales. To investigate the possible ben-
Tmin Tmax efits of the periodic dynamics we instead create a synthetic
dataset with periodic signals propagating over a graph.
Predict Previous 16.88 17.18
GRU-D (joint) 8.03±0.23 7.97±0.19 The synthetic dataset is based on a randomly sampled di-
GRU-D (node) 13.12±0.03 13.67±0.04 rected acyclic graph with 20 nodes. We define a periodic
Transformer (joint) 7.36±0.41 7.37±0.28 base signal
Transformer (node) 15.68±0.32 15.74±0.34 ρn (t) = sin(ϕn t + η n ) (16)
TGNN4I (static) 6.97±0.05 6.86±0.04 with random parameters ϕn and η n for each node n. The
TGNN4I (exponential) 6.72±0.04 6.60±0.04 target signal y n (t) in each node is then defined through
TGNN4I (periodic) 6.72±0.05 6.63±0.03
0.5 X
κn (t) = ρn (t) + κm (t − 0.05) (17a)
|N (n)|
m∈N (n)
has strong spatial dependencies. The (joint) models can to y n (t) = κn (t) + ϵn (t) (17b)
some extent learn to pick up on these, while for (node) no
information can flow between nodes. The best results are where ϵn (t) is Gaussian white noise with standard devia-
however achieved by TGNN4I, showing the added benefit tion 0.01. The target signal in each node depends on the
of utilizing the spatial graph. base signal in the node itself and the signals in neighboring
nodes at a time lag of 0.05. To construct one time series we
sample y n (t) at 70 irregular time points on [0, 1]. In total
4.4 Synthetic Periodic Data we sample 200 such time series and keep 50% of the node
observations in each.
In the previous experiments, using periodic dynamics with
TGNN4I has not added any value. Instead, the learned dy- We train versions of the GRU-D (node) and TGNN4I mod-
namics have been largely similar to just using exponential els with different latent dynamics on the synthetic data.
Temporal Graph Neural Networks for Irregular Data

Table 3: Test LMSE (multiplied by 102 ) for synthetic data. PEMS-BAY dataset with 25% observations. We used expo-
nential dynamics and considered the weighting functions
Static Exponential Periodic
w1 (∆t ) = 1 (18a)
GRU-D (node) 8.88±0.36 3.13±0.06 2.81±0.04  
TGNN4I 15.12±0.05 2.91±0.17 ∆t
1.95±0.11 w2 (∆t ) = exp − (18b)
0.04
Predict Prev. 27.52  2 !
Transformer (joint) 23.19±0.38 ∆t − 0.1
w3 (∆t ) = exp − (18c)
Transformer (node) 15.39±0.05 0.02
LG-ODE 16.61±0.23 w4 (∆t ) = I{∆t ∈[0.18,0.22]} . (18d)

0.6 Figure 3 shows the test MSE for the trained models at
different ∆t in the future, as well as plots of the weight-
0.5 ing functions. As an example, the prediction ŷi→j m
has
∆t = t j − t i .
MSE

0.4
We note that the choice of w can have substantial impact
w1
w2
on the error of the model at different time horizons. The
0.3
w3 exponential weighting in w2 makes the model focus heav-
w4 ily on short-term predictions. This results in better predic-
0.2
tions for low ∆t . At the shortest time horizon the exponen-
0.00 0.05 0.10 0.15 0.20 0.25 0.30 tial weighting yields an 11% improvement over the model
1.0 trained with constant w1 , but this comes at the cost of far
higher errors for long-term predictions. Interestingly the
w

0.5 w1 model gives better predictions at all time horizons than


the models with w3 and w4 , which focus on predictions
0.0 at some specific time ahead. We believe that there can be
0.00 0.05 0.10 0.15 0.20 0.25 0.30
a feedback effect benefiting the constant weighting, where
∆t
learning to make good short-term predictions also aid the
Figure 3: MSE (Top) for predictions at time ∆t in the learning of long-term prediction, for example by finding
future for models trained with different loss weighting useful intermediate representations. A drawback of weight-
w (Bottom). The weighting functions are described in ing with w1 is however that since the loss never approaches
Eq. 18a–18d. To compute the MSE all predictions in the 0 there is no properly motivated choice of Nmax for our loss
test set were binned based on their ∆t , with bin width 0.02. approximation. As the model trained with w4 still gives
The bottom subplot also shows a histogram of the number good predictions in the interval [0.18, 0.22], there can still
of predictions in each bin. be practical reasons to choose such a weighting. With this
choice we could reduce the innermost sum in Eq. 14 to only
those j:s that lie in the interval of interest.
Also our other baselines are included for comparison. Re-
sults are reported in Table 3. For both GRU-D (node) and 5 DISCUSSION
TGNN4I we see a large difference between the different
types of latent dynamics. The periodic dynamics seem to We have proposed a temporal GNN model that can han-
help the model to keep track of the base signal in the node dle both irregular time steps and partially observed graphs.
and its neighborhood in order to achieve accurate future By defining latent states in continuous time our model can
predictions. While this is a synthetic example, periodic be- make predictions for arbitrary time points in the future. In
havior is prevalent in much time series data and being able this section we discuss some details and limitations of the
to explicitly model this in the latent state can be highly ad- approach, and also give some pointers to interesting direc-
vantageous. Attempts were made to also train the GRU-D tions for future work.
(joint) model on this dataset, but it failed to pick up on any
patterns and ended up only predicting a constant value for 5.1 Efficient Implementation
all nodes and times.
In order to efficiently implement the training and inference
4.5 Loss Weighting of TGNN4I there is a key design choice between (1) storing
everything in dense matrices and utilizing massively paral-
To investigate the impact of the loss weighting function lel GPU-computations, and (2) utilizing sparse representa-
w we trained four TGNN4I models on the subsampled tions in order to avoid computing values that will never be
Joel Oskarsson, Per Sidén, Fredrik Lindsten

used. Which of these is to be preferred depends on the spar- ing the system. To tackle this problem our model could
sity of observations in the data. Our implementation fol- be combined with approaches for also learning the graph
lows the massively parallel approach, using binary masks structure (Stanković et al., 2020; Zhang et al., 2022). Ex-
for keeping track of Oi at each time point. In order to scale tending our method to dynamic graphs, that evolve over
TGNN4I to massive graphs in real world scenarios it could time, would not require any major changes and could be an
be interesting to consider a version of the model distributed interesting direction for future experiments.
over multiple machines, perhaps directly connected to sen-
Our focus has been on forecasting, but the model could also
sors producing the input data. A sparse implementation
be trained for other tasks. The time-continuous latent state
would be strongly preferred for such an extension.
in each node could be used for imputing missing obser-
vations or performing sequence segmentation. Also clas-
5.2 Linear and Neural ODEs sification tasks are possible, either classifying each node
separately or the entire graph-structured time series.
The dynamics of TGNN4I are defined by a linear ODE and
additionally restricted by the assumption of unique eigen- While our model can produce predictions at arbitrary time
values in A. This has the benefit of a closed form solu- points, an extension would be to also predict the time un-
tion that is efficient to compute, but also limits the types of til the next observation occurs. One way to achieve this
dynamics that can be learned. Our approach can be con- would be to let the latent state parametrize the intensity
trasted with Neural ODEs (Chen et al., 2018), that allow of a point process (Mei and Eisner, 2017; Jia and Benson,
for learning more expressive dynamics. Neural ODEs do 2019). Building such point process models on graphs could
however lack closed form solutions and require using nu- be an interesting future application of our model.
merical solvers (Kidger, 2021). This incurs a trade-off be-
tween speed and numerical accuracy. In experiments we Acknowledgments
have compared TGNN4I with the LG-ODE model (Huang
et al., 2020), which uses a Neural ODE decoder. The slow This research is financially supported by the Swedish Re-
training of the LG-ODE model can to a large extent be at- search Council via the project Handling Uncertainty in Ma-
tributed to the numerical ODE solver. While more complex chine Learning Systems (contract number: 2020-04122),
latent dynamics can be useful for some datasets, it can also the Excellence Center at Linköping–Lund in Informa-
be argued that simpler dynamics can be compensated with tion Technology (ELLIIT), and the Wallenberg AI, Au-
a high enough latent dimension dh and a flexible enough tonomous Systems and Software Program (WASP) funded
predictive model g (Schirmer et al., 2022). by the Knut and Alice Wallenberg Foundation. The com-
putations were enabled by the Berzelius resource at the Na-
5.3 Societal and Sustainability Impact tional Supercomputer Centre, provided by the Knut and Al-
ice Wallenberg Foundation. All datasets available online
While our contributions are purely methodological, many were accessed from the Linköping University network.
applications of graph-based and spatio-temporal data anal-
ysis have a clear societal impact. Our example applica- References
tions of traffic and climate modeling both have potential
to aid efforts of transforming society in more sustainable Arrowsmith, D. and Place, C. M. (1992). Dynamical
directions, such as those described in the United Nations systems: differential equations, maps, and chaotic be-
sustainable development goals 11 and 13 (Rolnick et al., haviour. CRC Press.
2022; United Nations, 2015). Traffic modeling allows us to Box, G. E., Jenkins, G. M., Reinsel, G. C., and Ljung,
both understand travel behavior and predict future demand. G. M. (2015). Time series analysis: forecasting and con-
This can enable optimizations of transport systems, both trol. John Wiley & Sons.
improving the experience of travelers and reducing the en-
vironmental impact. Integrating machine learning methods Cao, W., Wang, D., Li, J., Zhou, H., Li, L., and Li, Y.
with climate modeling has potential to speed up simula- (2018). BRITS: Bidirectional recurrent imputation for
tions and increase our understanding of the climate around time series. In Advances in Neural Information Process-
us. However, it is surely also possible to find applications ing Systems, volume 31.
of our method with a damaging impact on society, for ex-
Che, Z., Purushotham, S., Cho, K., Sontag, D., and Liu, Y.
ample through undesired mass-surveillance.
(2018). Recurrent neural networks for multivariate time
series with missing values. Scientific Reports, 8(1):6085.
5.4 Future Work
Chen, C., Petty, K., Skabardonis, A., Varaiya, P., and Jia,
We consider a setting where the graph structure is both Z. (2001). Freeway performance measurement system:
known and constant over time. In some practical applica- Mining loop detector data. Transportation Research
tions it is not obvious how to construct the graph describ- Record, 1748(1):96–102.
Temporal Graph Neural Networks for Irregular Data

Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, Huang, Z., Sun, Y., and Wang, W. (2021). Coupled
D. K. (2018). Neural ordinary differential equations. In graph ODE for learning interacting system dynamics. In
Advances in neural information processing systems, vol- Proceedings of the 27th ACM SIGKDD Conference on
ume 31. Knowledge Discovery & Data Mining, pages 705–715.
Chen, Y., Kang, Y., Chen, Y., and Wang, Z. (2020). Prob- Jia, J. and Benson, A. R. (2019). Neural jump stochastic
abilistic forecasting with temporal convolutional neural differential equations. In Advances in Neural Informa-
network. Neurocomputing, 399:491–501. tion Processing Systems, volume 32.
Cho, K., van Merriënboer, B., Bahdanau, D., and Bengio, Kidger, P. (2021). On neural differential equations. PhD
Y. (2014). On the properties of neural machine trans- thesis, University of Oxford.
lation: Encoder–decoder approaches. In Proceedings Kosman, E. and Di Castro, D. (2022). Graphvid: It
of SSST-8, Eighth Workshop on Syntax, Semantics and only takes a few nodes to understand a video. In Com-
Structure in Statistical Translation, pages 103–111. puter Vision – ECCV 2022, pages 195–212.
Cini, A., Marisca, I., and Alippi, C. (2022). Filling the Lam, R., Sanchez-Gonzalez, A., Willson, M., Wirnsberger,
g ap s: Multivariate time series imputation by graph P., Fortunato, M., Pritzel, A., Ravuri, S., Ewalds, T.,
neural networks. In International Conference on Learn- Alet, F., Eaton-Rosen, Z., et al. (2022). Graphcast:
ing Representations (ICLR). Learning skillful medium-range global weather forecast-
De Brouwer, E., Simm, J., Arany, A., and Moreau, Y. ing. arXiv preprint arXiv:2212.12794.
(2019). GRU-ODE-Bayes: Continuous modeling of Lazzeri, F. (2020). Machine learning for time series fore-
sporadically-observed time series. In Advances in Neu- casting with Python. John Wiley & Sons.
ral Information Processing Systems, volume 32.
Li, Y., Yu, R., Shahabi, C., and Liu, Y. (2018). Dif-
Fang, Z., Long, Q., Song, G., and Xie, K. (2021). Spatial- fusion convolutional recurrent neural network: Data-
temporal graph ODE networks for traffic flow forecast- driven traffic forecasting. In International Conference
ing. In Proceedings of the 27th ACM SIGKDD Con- on Learning Representations (ICLR).
ference on Knowledge Discovery & Data Mining, pages
364–373. Mei, H. and Eisner, J. M. (2017). The neural hawkes pro-
cess: A neurally self-modulating multivariate point pro-
Fey, M. and Lenssen, J. E. (2019). Fast graph representa- cess. In Advances in Neural Information Processing Sys-
tion learning with PyTorch Geometric. In ICLR Work- tems, volume 30.
shop on Representation Learning on Graphs and Mani-
folds. Menne, M., Williams Jr, C., and Vose, R. (2015). Long-
term daily climate records from stations across the con-
Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and tiguous united states.
Dahl, G. E. (2017). Neural message passing for quan-
tum chemistry. In Proceedings of the 34th International Omidshafiei, S., Hennes, D., Garnelo, M., Wang, Z., Re-
Conference on Machine Learning, pages 1263–1272. casens, A., Tarassov, E., Yang, Y., Elie, R., Connor,
J. T., Muller, P., Mackraz, N., Cao, K., Moreno, P.,
Giuliari, F., Hasan, I., Cristani, M., and Galasso, F. (2021). Sprechmann, P., Hassabis, D., Graham, I., Spearman,
Transformer networks for trajectory forecasting. In W., Heess, N., and Tuyls, K. (2022). Multiagent off-
25th International Conference on Pattern Recognition screen behavior prediction in football. Scientific Reports,
(ICPR). 12(1):1–13.
Gordon, D., Petousis, P., Zheng, H., Zamanzadeh, D., and Poli, M., Massaroli, S., Rabideau, C. M., Park, J., Ya-
Bui, A. A. (2021). TSI-GNN: Extending graph neural mashita, A., Asama, H., and Park, J. (2021). Continuous-
networks to handle missing data in temporal settings. depth neural models for dynamic graph prediction. arXiv
Frontiers in Big Data, 4. preprint arXiv:2106.11581.
Guo, S., Lin, Y., Feng, N., Song, C., and Wan, H. (2019). Roberts, S., Osborne, M., Ebden, M., Reece, S., Gib-
Attention based spatial-temporal graph convolutional son, N., and Aigrain, S. (2013). Gaussian processes for
networks for traffic flow forecasting. In Proceedings time-series modelling. Philosophical Transactions of the
of the AAAI Conference on Artificial Intelligence, vol- Royal Society A: Mathematical, Physical and Engineer-
ume 33, pages 922–929. ing Sciences.
Horn, R. A. and Johnson, C. R. (2012). Matrix analysis. Rolnick, D., Donti, P. L., Kaack, L. H., Kochanski, K., La-
Cambridge university press. coste, A., Sankaran, K., Ross, A. S., Milojevic-Dupont,
Huang, Z., Sun, Y., and Wang, W. (2020). Learning contin- N., Jaques, N., Waldman-Brown, A., Luccioni, A. S.,
uous system dynamics from irregularly-sampled partial Maharaj, T., Sherwin, E. D., Mukkavilli, S. K., Kording,
observations. In Advances in Neural Information Pro- K. P., Gomes, C. P., Ng, A. Y., Hassabis, D., Platt, J. C.,
cessing Systems, volume 33, pages 16177–16187. Creutzig, F., Chayes, J., and Bengio, Y. (2022). Tackling
Joel Oskarsson, Per Sidén, Fredrik Lindsten

climate change with machine learning. ACM Computing Yu, B., Yin, H., and Zhu, Z. (2018). Spatio-temporal graph
Surveys, 55(2). convolutional networks: A deep learning framework for
Rozemberczki, B., Scherer, P., Kiss, O., Sarkar, R., and traffic forecasting. In Proceedings of the Twenty-Seventh
Ferenci, T. (2021). Chickenpox cases in hungary: A International Joint Conference on Artificial Intelligence,
benchmark dataset for spatiotemporal signal processing pages 3634–3640.
with graph neural networks. In Workshop on Graph Zang, C. and Wang, F. (2020). Moflow: An invertible flow
Learning Benchmarks@ TheWebConf 2021. model for generating molecular graphs. In Proceedings
of the 26th ACM SIGKDD International Conference on
Rubanova, Y., Chen, R. T. Q., and Duvenaud, D. K. (2019).
Knowledge Discovery & Data Mining, page 617–626.
Latent ordinary differential equations for irregularly-
sampled time series. In Advances in Neural Information Zhang, X., Zeman, M., Tsiligkaridis, T., and Zitnik, M.
Processing Systems, volume 32. (2022). Graph-guided network for irregularly sampled
multivariate time series. In International Conference on
Rusch, T. K., Chamberlain, B., Rowbottom, J., Mishra, S.,
Learning Representations (ICLR).
and Bronstein, M. (2022). Graph-coupled oscillator net-
works. In Proceedings of the 39th International Confer- Zhao, X., Chen, F., and Cho, J.-H. (2018). Deep learn-
ence on Machine Learning, pages 18888–18909. ing for predicting dynamic uncertain opinions in net-
work data. In 2018 IEEE International Conference on
Schirmer, M., Eltayeb, M., Lessmann, S., and Rudolph, M. Big Data (Big Data), pages 1150–1155.
(2022). Modeling irregular time series with continuous
recurrent units. In Proceedings of the 39th International
Conference on Machine Learning, pages 19388–19405.
Schneider, T. (2001). Analysis of incomplete climate
data: Estimation of mean values and covariance matrices
and imputation of missing values. Journal of Climate,
14(5):853–871.
Searle, S. R. and Khuri, A. I. (1982). Matrix algebra useful
for statistics. John Wiley & Sons.
Stanković, L., Mandic, D., Daković, M., Brajović, M.,
Scalzo, B., Li, S., and Constantinides, A. G. (2020). Data
analytics on graphs part III: Machine learning on graphs,
from graph topology to applications. Foundations and
Trends in Machine Learning, 13(4):332–530.
United Nations (2015). Transforming our world: The 2030
agenda for sustainable development.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017).
Attention is all you need. In Advances in neural infor-
mation processing systems, volume 30.
Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., and Yu,
P. S. (2020a). A comprehensive survey on graph neural
networks. IEEE Transactions on Neural Networks and
Learning Systems, pages 1–21.
Wu, Z., Pan, S., Long, G., Jiang, J., Chang, X., and Zhang,
C. (2020b). Connecting the dots: Multivariate time se-
ries forecasting with graph neural networks. In Proceed-
ings of the 26th ACM SIGKDD International Conference
on Knowledge Discovery & Data Mining, pages 753–
763.
Wu, Z., Pan, S., Long, G., Jiang, J., and Zhang, C.
(2019). Graph wavenet for deep spatial-temporal graph
modeling. In Proceedings of the Twenty-Eight Interna-
tional Joint Conference on Artificial Intelligence, page
1907–1913.
Temporal Graph Neural Networks for Irregular Data

A TABLE OF NOTATION

The notation used is listed in Table 4.

B IMPLEMENTATION AND TRAINING DETAILS FOR TGNN4I

The presentation of TGNN4I in Section 3 describes how the model processes a single graph-structured time-series. In
practice we use batches of multiple such time series during training and inference. When working on batches there is an
additional sum in the definition of Lℓ , computing the mean over all samples in the batch. While we assume that all time
Nt
series in a batch share the same Nt , the exact time points {ti }i=1 can differ. In all experiments we use Ninit = 5 and
minimize the loss using the Adam optimizer (Kingma and Ba, 2015).
Both GNNU and GNNW contain only GNN layers, while the predictive model g contains a sequence of GNN layers
followed by a sequence of fully connected layers. We use ReLU activation functions in between all layers.
W
The input x̃m
i to GNN is 0 for neighbors that are not observed at time ti . However, there could be an observation of
neighbor m where the observed value and input features are exactly 0. To help the model differentiate between these two
cases we additionally include an indicator variable I{m∈Oi } in x̃m
i .

C BASELINE MODELS

C.1 Predict Previous

The Predict Previous baseline does not require any training. Predictions are computed as ŷi→j
m m
= yprev(i) ∀j, where
prev(i) = max{k : k ≤ i ∧ m ∈ Ok }. If there is no earlier observation of node m the predictions is just 0.

C.2 GRU-D (node)

GRU-D (node) is essentially a version of TGNN4I without the GNN components. The GNNs GNNU and GNNW in
the GRU update are replaced by matrix multiplications and the predictive model g includes only fully connected layers.
Because of this no information can flow between nodes, and time-series in different nodes are treated as independent. The
GRU-D (node) model uses the exponential dynamics for the latent state. To stay consistent with the other models we still
process all nodes in the graph concurrently. This means that a batch size of B for GRU-D (node) means that we process
B × N independent node time-series.

C.3 GRU-D (joint)

The GRU-D (joint) model is defined similar to GRU-D h (node), but i⊺ modeling all nodes jointly. All node observa-
|V |
tions are concatenated into one long vector yi = yi , . . . , yi
1
and similarly the features are concatenated as
h i⊺
|V |
xi = x1i , . . . , xi . This time series is then modeled using a single latent state, also based on the exponential dy-
namics. In GRU-D (joint) there is however a GRU update at each ti , as at least one of the nodes is observed at each time
point. We can think of this setup as a graph with a single node, for which the observations are high-dimensional. The
high-dimensional predictions are split up and re-assigned to the original nodes in the graph for computing the loss. Note
also that many entries in each yi and xi will be zero, corresponding to the nodes that are not observed. We again include
indicators I{m∈Oi } as input to the GRU, but here for all nodes.

C.4 LG-ODE

For the LG-ODE model we use the code provided by the authors (Huang et al., 2020)3 , making only small modifications to
the data loading in order to correctly handle our datasets. The original code does not use a validation set, instead evaluating
the model on the test set after each training epoch. We change this step to use validation data and save the model from the
epoch with the lowest validation error. That model is then loaded and evaluated on the test data. Due to the high training
time we have not been able to perform exhaustive hyperparameter tuning for the LG-ODE model. We have mainly used
3
https://github.com/ZijieH/LG-ODE
Joel Oskarsson, Per Sidén, Fredrik Lindsten

Table 4: List of notation

Notation Description Defined in


diag(v) Diagonal matrix with entries of vector v on the diagonal
[v]j:j+1 Entries j and j + 1 of vector v
I{·} Indicator function
G The underlying graph of the time series
V, E Node and edge set of G
ti Time point where at least one node is observed
Nt Number of time points in one graph-structured time series
Oi Set of nodes observed at ti Section 3.1
yin Observed value in node n at time point ti . Equal to 0 if node n is not observed at time ti .
xni Features of node n at time point ti . Equal to 0 if node n is not observed at time ti .
dx , dy , dh Dimensionality of xni , yin and hn (t)
hn (t) Time-continuous latent state of node n
h̄ni Static part of hn (t) after time ti
h̃n (t) Dynamic part of hn (t) defined by a linear ODE
A Coefficient matrix in the ODE defining h̃n (t)
ĥni Initial value of h̃n (ti ) in the ODE from time ti onward
δt Elapsed time since ti Section 3.2
Q Matrix containing the eigenvectors of A
Λ Diagonal matrix containing the eigenvalues of A
ωin Parameters defining the dynamics of h̃n (t) after ti
Cj 2 × 2 block matrix in real Jordan form of A
aj ± bj i Complex conjugate eigenvalue pair of A
N (n) Neighbors (parents) of node n
W1 , W2 Matrices used in GNN
em,n Edge weight associated with the edge (m, n)
GNNU GNN combining latent states in neighborhood of node n
uni,1 – uni,7 Output of GNNU for node n at time ti . Intermediate representations used in GRU update.
x̃ni Concatenation of yin and xni Section 3.3
GNNW GNN combining features and observations in neighborhood of node n
n
vi,1 – vi,7n
Output of GNNW for node n at time ti . Intermediate representations used in GRU update.
ri , zi , qin
n n
Intermediate representations in GRU update for hn (t)
σ Sigmoid function
b1 –b7 Bias parameters in GRU cell
r̄in , z̄in , q̄in Intermediate representations in GRU update for h̄ni
g Predictive model
ŷjn Predicted value of node n at time tj Section 3.4
GNNg GNN used as predictive model g
m
ŷi→j Predicted value of node m at time tj based on observations before or at time ti
w Loss weighting function
τm,i Set of indices of all times after ti where node m is observed
Ninit Length of warm-up phase where predictions are not included in loss
Lℓ Loss function for one whole graph-structured time-series
Section 3.5
ℓ General loss function for a single observation
LMSE Lℓ with Mean Squared Error as loss for each observation
Nobs Number of node observations in time-series
∆t Time difference between last observation and prediction time
Nmax Maximum number of time steps to predict ahead in approximation of Lℓ
Temporal Graph Neural Networks for Irregular Data

the default parameters of Huang et al. (2020) or a slightly smaller version of the model. The smaller version has halved
dimensions for hidden layers in the GNN (from 128 to 64) and augmentation (from 64 to 32) used in the ODE decoder.
For computing LMSE using the LG-ODE model we need to compute each necessary ŷi→j m
. This is done by encoding all
observations up to time ti and then decoding Nmax time steps into the future. It should be noted that the model is still
trained as proposed by Huang et al. (2020), by encoding the first half of each time series and predicting the second half.
During training each encoded time series is Nt /2 long, but when computing LMSE on the test set the length of the sequence
being encoded varies. We have however not noticed any higher errors for time steps with a shorter or longer encoded
sequence than that used during training.

C.5 Transformers

The Transformer (Vaswani et al., 2017) baselines use an encoder-decoder approach similar to LG-ODE. During training a
prediction time ti is however randomly sampled as i ∼ U({Ninit , . . . , Nt − Nmax }). Each sequence is then encoded up to
time ti and decoded over the next Nmax time steps to produce predictions. The LMSE loss is used also for the Transformer
models, but only based on this one prediction per time series.
The irregular time steps are handled by sinusoidal encodings concatenated to the input of both the encoder and decoder.
Instead of basing these on the sequence index i, the exact timestamp ti is used in
ti
θi = (19)
0.12i/dh
⊺
to then compute the full encoding vector sin(θ1 ), . . . , sin(θ⌊dh /2⌋ ), cos(θ1 ), . . . , cos(θ⌊dh /2⌋ ) .


Unobserved nodes are in Transformer (node) handled by masking the attention mechanism. For each node n, this prevents
the model from attending to encoded time steps j s.t. n ∈ / Oj . In Transformer (joint) the same approach as in GRU-D
(joint) is instead used, where indicator variables are included as input.
The hyperparameters defining the Transformer architectures differ somewhat from the other models. We still let dh repre-
sent the dimensionality of hidden representations, but here also tune the number of transformer layers stacked together.

D DATASETS

D.1 Traffic Data

For the PEMS-BAY and METR-LA datasets we use the versions pre-processed by Li et al. (2018), where weighted
graphs are created based on thresholded road-network distances. We additionally remove edges from any node to itself
and drop nodes not connected to the rest of the graph. The original time series is then split into sequences of length
288, corresponding to one day of observations at 5 minute intervals. In each such sequence we uniformly sample only
Nt = 72 time points to keep, introducing irregularity between the time steps. Next we create the set {(n, i)}n∈V,i=1,...,Nt
with indices of all single node observations. We uniformly sample a fraction of this set as the observations to keep,
independently for each sequence. The percentages in Table 1 refer to the percentage of observations kept from this set.
This step introduces further irregularity as all nodes will generally not be observed at each time step. The METR-LA
dataset has some observations missing initially, meaning that we can never get to exactly 100% observations for that
dataset. From this pre-processing we end up with 180 time series in the PEMS-BAY dataset and 119 time series in the
METR-LA dataset. We randomly assign 70% of these to the training set, 10% to the validation set and 20% to the test set.

D.2 USHCN Climate Data

The USHCN daily data is openly available4 together with the positions of all sensor stations. We use the pre-processing
of De Brouwer et al. (2019) to clean and subsample the data5 . We choose the longer version of the time series, with
observations between the years 1950 and 2000, but split this into multiple sequences of 100 days.
In order to build the spatial graph we perform an equirectangular map projection of the sensor station coordinates and
then connect each station to its 10 nearest neighbors based on euclidean distance. While there are many options for how
4
https://cdiac.ess-dive.lbl.gov/ftp/ushcn_daily/
5
A script for pre-processing the USHCN data is available together with their code at https://github.com/edebrouwer/
gru_ode_bayes.
Joel Oskarsson, Per Sidén, Fredrik Lindsten

to create this type of spatial graph, we have found the 10-nearest-neighbor approach to work sufficiently well in practice.
Further investigating different methods for building spatial graphs is outside the scope of this paper. We additionally add
edge weights to this graph following a similar method as Li et al. (2018) did for the traffic data. For sensor stations m and
n at a distance dm,n we assign the edge (m, n) the weight
 2 !
dm,n
em,n = exp − (20)
4σe

where σe is the standard deviation of all distances associated with edges. The constant 4 is chosen such that the furthest
distance gets a weight close to 0 and the nearest distance a weight close to 1. For the USHCN data the only input feature
is elapsed time since the node was last observed. We use the same 70%/10%/20% training/validation/test split as for the
traffic data.
We build two datasets from the USHCN data, one with the daily minimum temperature Tmin as target variable and one
with the daily maximum temperature Tmax . The reason for separating these target variables, instead of using dy = 2, is
that the pre-processed data contains time points where only one of these is observed. Our model is not designed for such
a setting where we do not observe all dimensions of yin . Extending TGNN4I to handle this is left as future work. The
USHCN data also contains other climate variables, related to precipitation and snow coverage. These time series are less
interesting to directly evaluate our model on, as many entries are just 0. Properly modeling these would require designing
a suitable likelihood function and taking into account that some sensor stations never get any snow. As this would shift the
focus from the core problem we choose to restrict our attention to Tmin and Tmax .

D.3 Synthetic Periodic Data

To create the graph for the synthetic data we first sample 20 node positions uniformly over [0, 1]2 . We then create an
undirected graph using a Delaunay triangulation (De Loera et al., 2010) based on these positions. This undirected graph
is turned into a directed acyclic graph by choosing a random ordering of the nodes and removing edges going from nodes
later in the ordering to those earlier.
Based on this graph 200 irregular time series are sampled according to Eq. 16 and 17. An example is shown in Figure 4. The
irregular time steps are created by first discretizing [0, 1] into 1000 time steps and then sampling 70 of these independently
for each time series. Out of all node observations we keep 50%, sampled in the same way as for the traffic data. The
parameters of the periodic signals are sampled according to ϕn ∼ U([20, 100]) and η n ∼ U([0, 2π]). We resample all η n
for each sequence, but sample the node frequencies ϕn only once, treating these as underlying properties of the nodes. Out
of the 200 sampled sequences we use 100 for training, 50 for validation and 50 for testing.

E DETAILS ON EXPERIMENT SETUPS

We perform hyperparameter tuning by exhaustive grid search over combinations of parameter values. The configuration
that achieves the lowest validation LMSE is then used for the final experiment. An overview of hyperparameter values
considered and the final configurations for all experiments is given in Table 5. All hyperparamter tuning for GRU-D and
TGNN4I is done on the versions of the models with exponential dynamics. For these models we generally observe that
larger architectures perform better, but the memory usage limits how much we can increase the model size. While the exact
model architecture of TGNN4I does impact the results, the model does not seem particularly sensitive to learning rate or
batch size.
We evaluate LMSE on the validation set after each training epoch, stopping the training early if the validation error does not
improve for 20 epochs. Except for LG-ODE we train all models until this early stopping occurs, which typically takes less
than 150 epochs.

E.1 Traffic Data

For the traffic data we perform hyperparameter tuning on the versions of the datasets with 25% observations. The same
hyperparameter configuration, shown in Table 5, performed the best for both PEMS-BAY and METR-LA.
We used the slightly downscaled version of the LG-ODE model for the traffic data. We tried also the default hyperparam-
eters on the METR-LA dataset, but this did not improve the results. LG-ODE was trained for 50 epochs with a batch size
of 8 and we observed no further improvements when trying to train the model for longer.
Temporal Graph Neural Networks for Irregular Data

1
κn (t)

−1

−2

0.0 0.2 0.4 0.6 0.8 1.0


t

Figure 4: Examples of signals in the synthetic periodic data for 5 of the 20 nodes. The lines show clean signals κn (t) and
each dot a (noisy) observation y n (ti ). Note that the time steps are irregular and that not all signals are observed at each
observation time.

E.2 USHCN Climate Data

For the USHCN data we perform the hyperparameter tuning on the Tmin dataset. Since the USHCN graph contains many
nodes we are somewhat more restricted in how large we can scale up the models.

E.3 Synthetic Periodic Data

On the periodic data we tried the same hyperparameters for GRU-D (joint) as for the other models, but no options gave any
useful results. Because of this we exclude the model from this experiment. The number of iterations until the validation
LMSE stops decreasing is higher for the periodic data, with models training up to 500 epochs.
We used default hyperparameters for the LG-ODE model on the periodic data, but with a batch size of 16. As this graph
only contains 20 nodes we were here able to train 5 LG-ODE models with different random seeds.

E.4 Loss Weighting

For the loss weighting experiment we do not perform any new hyperparameter tuning, but use the best configuration from
the traffic data experiment. One TGNN4I model was trained for each weighting function. We here use a batch size of 4
and Ninit = 25, which allows us to study predictions to higher ∆t in the future.

Appendix References
De Loera, J. A., Rambau, J., and Santos, F. (2010). Triangulations, volume 25 of Algorithms and Computation in Mathe-
matics. Springer.
Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. In International Conference on Learning
Representations (ICLR).
Joel Oskarsson, Per Sidén, Fredrik Lindsten

Table 5: Values considered in hyperparameter tuning for the different datasets. Bold numbers represent the best performing
configuration, that was then used in the final experiment.

Fully connected GNN layers GNN layers


Learning rate dh Batch size
layers in g in g in GRU
Traffic Data
GRU-D (joint) 0.001, 0.0005 64, 128, 256, 512 1, 2 - - 16
GRU-D (node) 0.001, 0.0005 64, 128, 256 1, 2 - - 8
TGNN4I 0.001, 0.0005 64, 128, 256 1, 2 1, 2 1, 2 8
Transformer (joint) 0.001 64, 256, 512, 2048 2, 4 (Transformer layers) 8
Transformer (node) 0.001 64, 128, 256 2, 4 (Transformer layers) 8
USHCN Data
GRU-D (joint) 0.001 64, 128, 256, 512 2 - - 16
GRU-D (node) 0.001 32, 64, 128 2 - - 4
TGNN4I 0.001 32, 64, 128 2 1, 2 1, 2 4
Transformer (joint) 0.001 64, 256, 512, 2048 2, 4 (Transformer layers) 8
Transformer (node) 0.001 64, 128 2, 4 (Transformer layers) 8
Periodic Data
GRU-D (node) 0.001 64, 128, 256 2 - - 16
TGNN4I 0.001 64, 128, 256 2 2 2 16
Transformer (joint) 0.001 64, 256, 512, 2048 2, 4 (Transformer layers) 8
Transformer (node) 0.001 64, 128, 256 2, 4 (Transformer layers) 8

You might also like