Ting Ta Jiun 202111 MAS Thesis
Ting Ta Jiun 202111 MAS Thesis
by
Ta Jiun Ting
Ta Jiun Ting
Master of Applied Science
Graduate Department of Mechanical and Industrial Engineering
University of Toronto
2021
Abstract
Many methods of traffic prediction have been proposed over the years, from the
time series models over 40 years ago to the latest deep learning models today, which
prompts the need for an in-depth comparison and the critical question of whether
deep learning offers significant improvements over the traditional methods. This
thesis addresses this situation by systematically evaluating the different methods.
We first procure a diverse set of traffic data from simulation software and real-world
sensors. We then compare the different methods and perform further analysis on the
latest deep learning models. Finally, we also consider the task of predicting long-term
traffic up to a week in advance. Overall, we demonstrate that deep learning models
are effective in short-term prediction. However, the classical random forest regression
provides the best performance in both short-term and long-term prediction, which
suggests that there is still room for improvement for deep learning methods.
ii
Acknowledgments
iii
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 4
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Prediction with Dynamic Traffic Simulation Models . . . . . . . . . . 8
2.4.1 Macroscopic Approach . . . . . . . . . . . . . . . . . . . . . . 9
2.4.2 Microscopic Approach . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Time Series Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5.1 ARIMA Models . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5.2 Exponential Smoothing . . . . . . . . . . . . . . . . . . . . . . 12
2.5.3 Facebook Prophet . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 Classical Machine Learning . . . . . . . . . . . . . . . . . . . . . . . 15
2.6.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6.2 Ensemble Regression Tree . . . . . . . . . . . . . . . . . . . . 16
2.7 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.7.1 Multilayer Perceptron . . . . . . . . . . . . . . . . . . . . . . 17
2.7.2 Recurrent Neural networks . . . . . . . . . . . . . . . . . . . . 19
2.7.3 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . 21
iv
3.2.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6 Conclusion 61
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
v
List of Tables
vi
List of Figures
vii
5.8 Prediction error of short-term random forest regression compared to
long-term methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
viii
Chapter 1
Introduction
1.1 Motivation
1
CHAPTER 1. INTRODUCTION 2
time series analysis, regression analysis, artificial neural networks, and deep neural
networks. Notably, in the past 5 years, the prevailing model for traffic prediction has
shifted to graph neural networks (GNNs) [5]–[8], a type of artificial neural network
that generates predictions by exploiting the road topology to detect correlations be-
tween different road segments. These methods are discussed in further detail in
Chapter 2 of this thesis.
In our review of the recent literature, we noticed that while there is a continual
development of new graph neural network variants, there is a distinct lack of com-
prehensive evaluation against other types of traffic prediction methods. In addition,
while the prediction errors are seemingly lowered with each new improvement, there
lacks a broad comparison between graph neural networks to identify the components
and techniques that are crucial to performance. Therefore, our work begins with a
thorough evaluation of the different methods first as an attempt to validate the recent
advancements in the field of traffic prediction, also to provide insights into the success
or failure of the different models in various settings.
1.2 Contributions
The goal of this thesis is to examine the plethora of prediction models that exist in
the literature in order to identify candidate models for field deployment and poten-
tial directions for future traffic research. The main contributions of this thesis are
threefold.
1. We conduct a comprehensive survey of the different classes of short-term traffic
prediction methods that have appeared over the past 50 years. Afterwards, we
select representative models from each class and conduct a thorough evaluation
of their capabilities. In this analysis, we employ traffic simulation software to
generate data that includes various traffic scenarios potentially seen in the real
world. We demonstrate the effectiveness of regression analysis techniques and
identify useful input features to a traffic prediction model. Furthermore, we
highlight the detrimental effects of sharing model parameters in GNNs, which
motivates the further investigation that culminates in the second contribution
of this thesis. Lastly, to the best of our knowledge, this is the first comprehen-
sive study that evaluates different classical and contemporary traffic prediction
models in a variety of settings with the aid of traffic simulation software.
2. We perform an in-depth evaluation of GNNs that have appeared in the recent
traffic prediction literature by analyzing the effects of individual GNN com-
ponents on prediction performance. We demonstrate that the state-of-the-art
CHAPTER 1. INTRODUCTION 3
GNNs can potentially learn a model that indicates direct traffic influence among
faraway links, which is inconsistent with traffic behaviour. Lastly, we showcase
that the performance of GNNs may be overstated in the literature since their
prediction accuracy is largely comparable to the traditional machine learning
model of random forest that predates GNNs by over 15 years; however, it is
important to note that GNNs contain significantly fewer model parameters.
3. We create a traffic prediction toolkit that includes all methods in our evaluations,
including both short-term and long-term predictions. This toolkit is beneficial
for future research in this area as it allows us to easily compare and visualize
different methods. Initially, we integrate this toolkit with a traffic model for the
Greater Toronto and Hamilton Area created in the Aimsun [9] traffic simulation
software.
1.3 Outline
The remainder of this thesis proceeds as follows. Chapter 2 introduces the rele-
vant background material, including a more precise definition of the traffic prediction
problem, the common evaluation metrics and an overview of the evolution of traf-
fic prediction models over the past 50 years. Afterwards, Chapter 3 discusses the
methodology and findings of our evaluation of the different classes of traffic pre-
diction methods using simulation data, which corresponds to our first contribution.
Meanwhile, Chapter 4 focuses on our analysis of graph neural networks and their
components, which corresponds to our second contribution. Subsequently, Chapter
5 examines the methods concerning long-term traffic prediction with prediction hori-
zons that are multiple days in advance, where we assess several recent time series
forecasting methods in the literature by comparing them against classical methods.
Finally, Chapter 6 is the conclusion of this thesis, where we summarize our findings
and provide recommendations for future traffic prediction research for both short-term
and long-term predictions.
Chapter 2
Background
2.1 Overview
This chapter introduces the background material relevant to the contribution of this
thesis. We begin by defining the traffic prediction problem in Section 2.2 and the
evaluation metrics in Section 2.3. Afterwards, we broadly classify the different traffic
prediction methods that are present in the literature into four categories and discuss
them in the four remaining sections of this chapter. This discussion is not intended to
be an exhaustive survey; in particular, the primary focus is placed on methods that
are relevant to our contributions.
Road traffic is a highly dynamic process evolving over space and time. For a stretch
of highway, traffic conditions at a given location influence its downstream sections as
traffic moves forward. Meanwhile, traffic conditions at a given location also influence
its upstream sections as traffic congestion propagates backwards via shock waves, a
phenomenon known since the 1950s [10]. These two basic processes are constantly
present on any road network, evolving continuously in response to the changes in
traffic conditions. In the urban setting, the presence of intersections and traffic signals
further influences the dynamics of traffic patterns and complicates the prediction
problem.
Traffic prediction can be performed at different levels, from the behaviours of
individual vehicles to the traffic states of entire districts. The prediction methods
vary between the different levels. At the individual vehicle level, the predictor models
the behaviour of each vehicle to predict future trajectories. Meanwhile, at the link
level, the predictor focuses on the macroscopic properties such as traffic speeds, flows,
4
CHAPTER 2. BACKGROUND 5
and densities on each road and predicting their evolution over time. For this thesis,
we define the prediction problem as link-level prediction and the task of predicting
individual vehicle properties is considered out of scope. Furthermore, when discussing
applications of graph neural networks in traffic prediction, we may also refer to road
links as nodes, and connections between road links (such as road intersections) as
edges.
We can represent the traffic properties on each link with a variety of variables,
including:
• Flow (q): number of vehicles passing a point or the link per unit time
• Density (ρ): number of vehicles per unit distance at a given instance in time
The above quantities are related to one another in the equations below.
q =ρ·v (2.1)
headway = q −1 (2.3)
q = c · occ · v (2.4)
the diagram is specific to each road because values such as free-flow speed and max
density are highly dependent on road characteristics.
vfreeflow qmax
v q
max max
vfreeflow
qmax
q
(c) Speed and flow
Figure 2.1: The macroscopic fundamental diagram of traffic
Typically, only a small subset of these properties are available to the transportation
agency depending on the sensors installed. Consequently, traffic prediction typically
only predicts flow or speed depending on the source of data. Loop detectors installed
in the road can produce accurate vehicle count data that allows us to calculate flow
by counting the number of vehicles passing through in a unit time; however, it is
impossible to obtain information such as speed or density with a single loop detector
since they require observation over a distance rather than a single point. Meanwhile,
GPS or Bluetooth data contains locations and timestamps that allow us to calculate
average vehicle speed by dividing distance traversed over time elapsed; yet flow and
density information from this data can be inaccurate since some vehicles may lack
the technology and would be unaccounted for in the data. In practice, other factors
such as weather and road design also influence traffic patterns; however, similar to
other works on this topic, this thesis only uses the past observations and the graph
CHAPTER 2. BACKGROUND 7
• G = (V, E): The directed graph which describes the road network. V is the set
of nodes, which represents the links, while |V| = N . E is the set of edges, which
represents the intersections of the road network.
• N (i): The set of nodes in the neighbourhood of node i. This is not restricted to
the immediate neighbours of node i, and also includes node i itself.
(t)
• xi : A vector with length d that represents the observation of node i at time t.
• X(t) : A matrix with size (N × d) that represents the observation of the entire
road network at time t.
(t)
• x̂i : A vector with length d that represents the prediction of node i at time t.
• X̂(t) : A matrix with size (N × d) that represents the prediction of the entire
road network at time t.
Using the above notation, we can define the prediction problem as learning a
function f that maps the past observations to predictions using the graph G as follows:
of the model is the predicted state values of each location in the road network over
each time slice in the prediction horizon.
The goal of a predictive model is to minimize the difference between the predicted
state values and the actual state values. To quantify this numerically, researchers
typically use a combination of the following time series regression metrics to assess
the performance of predictive models: mean absolute error (MAE), mean absolute
percentage error (MAPE), mean-square error (MSE), and root-mean-square error
(RMSE). MAE is the average of the absolute error across all predictions while MAPE
is the average of the absolute relative error that emphasizes lower values. Meanwhile,
MSE is the averaged squared error, which also represents the variance of prediction
errors. Lastly, RMSE is the square root of the average squared errors, which is also the
standard deviation of prediction errors. For a prediction horizon H, given predictions
(t+1) (t+2) (t+H) (t+1) (t+2) (t+H)
x̂i , x̂i , ..., x̂i and the actual observed value xi , xi , ..., xi for N
different samples, we can calculate the four metrics using the equations below:
N |V| H
1 1 1 X X X (t+k) (t+k)
MAE = xi,j − x̂i,j (2.6)
N |V| H i=1 j=1 k=1 1
N |V| H
1 1 1 X X X (t+k) (t+k)
MSE = xi,j − x̂i,j (2.8)
N |V| H i=1 j=1 k=1 2
v
u N X |V| H
u1 1 1 X X (t+k) (t+k)
RMSE = t xi,j − x̂i,j (2.9)
N |V| H i=1 j=1 k=1 2
Traffic prediction and modelling have been studied since the 1950s. This section covers
the traditional methods which model traffic dynamics explicitly using the observed
behaviours of traffic. In contrast, the newer methods adopt data-driven techniques
such as time series analysis, regression analysis, or deep learning. This thesis focuses
on data-driven techniques and does not experiment with the methods described in this
section; however, we include their discussion in this chapter due to their significant
and historical contribution to this field. These methods can be roughly divided into
two categories: the macroscopic approach and the microscopic approach.
CHAPTER 2. BACKGROUND 9
q0 ρ1 q1 ρ2 q2 ρ3 q3 ρ4 q4
v1 v2 v3 v4
inflow/
outflow
The macroscopic approach to traffic modelling treats the stream of traffic on a road
link as a collective, homogeneous entity evolving over time due to influences from up-
stream and downstream links. In 1956, Lighthill and Whitham [4] pioneered this ap-
proach by creating the kinematic wave model. This model observes the phenomenon
of traffic waves propagating slowly along a road as similar to the kinematic waves in
fluid dynamics. Consequently, this model applies the kinematic wave equations to
describe the change in flow, density, and speed along a road. The notion of traffic
shock waves augments this model by explaining the cause of congestion accumulation
and dissipation on a road [10].
The simplicity of the kinematic wave model allows a numerical solution to be
described, known as the cell transmission model (CTM) [12]. The CTM works by
discretizing a road into cells, as shown by Figure 2.2, where the length of each cell
is equal to the distance travelled in one time step at the free-flow speed. Given the
speed v or density ρ of each cell as well as any inflows and outflows from the corridor,
we can compute the flow q between consecutive cells using the fundamental diagram
in Figure 2.1. Subsequently, based on conservation of inflows and outflows at each
cell, we can calculate the density of each cell at the next time step. Therefore, if
the initial conditions of the road are known, we can iteratively update the cells and
predict future traffic conditions along a road corridor. A more detailed description
of the cell transmission model can be found in [12]. Extensions to the CTM include
applying CTM to the analysis of complex road networks [13], and optimizing the
CTM by reducing the number of cells [14]. Moreover, the works of [15] and [16] show
that CTM theory can also be applied to traffic signal control.
One limitation of the first-order kinematic wave model is that it assumes instanta-
neous change in speed with corresponding changes in density, which implies infinite
acceleration. Therefore, in the 1970s, there were attempts to improve the first-order
CHAPTER 2. BACKGROUND 10
As data collection and processing technology improved over the years, a new class
of traffic prediction models emerged. Researchers began to view the evolving macro-
scopic traffic properties on a road as time series and applied time series analysis to
predict traffic in the 1970s. In contrast to the methods described in Section 2.4, these
methods do not model traffic dynamics explicitly. Instead, these methods use the
available data to estimate the parameters of the model.
Finally, the full ARIMA model adds a differencing component, which transforms
the initial time series into a series of differences between consecutive observations in
the time series. The differencing is performed before fitting the model in Equation
(2.10) in order to transform a non-stationary time series to become stationary, which
means that the properties of the time series are constant over time. Based on this
description, an ARIMA model can be described using a set of 3 numbers (p, d, q),
CHAPTER 2. BACKGROUND 12
where p is the order (number of time lags) of the autoregressive component, d is the
number of times first-order differencing is applied, and q is the order of the moving
average component. The procedure of determining the orders and calculating the
model parameters are specified by [28]. It should be noted that the ARIMA model
is a univariate model, thus road segments would be decoupled from one another and
analyzed individually through time in traffic prediction applications. In this thesis, we
implement the ARIMA models using the pmdarima [29] and statsmodels [30] Python
packages.
The ARIMA model can be augmented by incorporating additional terms that
describe the seasonality of a time series, known as the seasonal ARIMA model. The
seasonal part of the seasonal ARIMA model also contains autoregression, differencing,
and moving average components. However, instead of past observations, seasonal
differencing computes the differences between an observation and the observation
from previous seasons. Similarly, the observations from previous seasons are used to
compute the AR and the MA components in the seasonal part of a seasonal ARIMA
model. The full seasonal ARIMA model can be described as (p, d, q)(P, D, Q)m , where
the lowercase letters denote the non-seasonal part while the uppercase letters denote
the seasonal part of the model, and m denotes the number of observations in a season.
In traffic, there is clear daily and weekly seasonal pattern that can be exploited with
this type of model; we can cite [31], [32] as applications of seasonal ARIMA in traffic
prediction.
Other extensions to the ARIMA model also exist in the literature. The ARIMAX
model (ARIMA model with eXogenous variables) allows ARIMA to model time series
using exogenous variables, such as incorporating data from neighbouring links to
generate predictions [33]. Similarly, the ARIMA model can also be extended to
model the evolution of multiple endogenous time series simultaneously using vector
autoregression and space-time ARIMA, which can capture the correlations among
multiple roads [34], [35]. The fitting procedure of these models is very tedious and
time-consuming, which makes them unsuitable for large-scale prediction. Therefore,
in this thesis, we only experiment with the ARIMA and seasonal ARIMA models
described above.
equation and a smoothing equation that models the level of the time series. This is
shown below in Equations (2.11) and (2.12), where γ denotes the smoothing factor
while l and respectively represent the level and the error terms.
• Forecasting equation:
(t)
x̂i (t+1) = li (2.11)
• Smoothing equation:
(t) (t) (t−1)
li = γxi + (1 − γ)li
(t−1) (t) (t−1)
= li + γ(xi − li ) (2.12)
(t−1) (t)
= li + γi
We can also extend this method to capture changes in trend and seasonality in
the time series by adding additional component equations, which were first described
by Holt [36] and Winters [37]. The smoothing equation for the level also describes
the observation error (2.12); therefore, this type of model is also known as the ETS
(error, trend, seasonal) model in the literature.
The convention is to use three letters to describe the error, trend, and seasonal
components of an ETS model. The error component can be either additive (A) or
multiplicative (M). Meanwhile, the trend component can be none (N) or additive (A).
Lastly, the seasonal component can be none (N), additive, (A), or multiplicative (M).
In the simplest case of linear regression, we can compute future state values as a
linear combination of multiple explanatory features, such as past observations of
neighbouring links. Variations of linear regression have been listed in the traffic
prediction literature as baseline models to compare performance with other models
[40]. Using the notations defined in Section 2.2, a linear regression model for traffic
prediction can be defined as follows:
n np (t−k)
x= xj (2.13)
j∈N (i) k=1
(t)
x̂i = w> x + b (2.14)
where x denotes the input features to the regression model, k denotes the vector
concatenation operation, p denotes the number of past time steps to consider, while
w and b are parameters of the model. In other words, the prediction for link i at
time t is a linear combination of the features in the past p observations of the links
in the neighbourhood of i plus a bias term. The objective is the find the best set of
model parameters w and b that minimizes the sum of the squares of prediction error
over T training observations. This objective is defined as:
T
X (t)
xi − w > x + b
ŵ = arg min + λR(w) (2.15)
w 2
t=1
The final term, λR(w), is a regularizer over the model parameters that prevents
overfitting. The regularizer contains a regularization strength parameter λ > 0 and
regularization function R(w). There are two common regularization functions, L1
and L2 , which are defined as follows:
CHAPTER 2. BACKGROUND 16
X
X>5 X≤5
Y
Y>2 Y≤2
Z = 2Y + 3 Z = 3X + 1 Z=X+Y
Linear regression assumes a linear relationship between the input features and the
target variable, which may not be true for traffic prediction. Therefore, we also explore
the use of decision tree learning, a simple yet very powerful non-linear regression
method.
Decision trees with regression targets are also known as regression trees. A regres-
sion tree splits the input samples recursively into a tree-like decision diagram until it
reaches the desired number of depth or leaf nodes. Each internal node of the tree con-
tains a rule that splits the samples according to the value of some features and passes
each split to the corresponding children node. Each leaf node of the tree contains a
simple model that describes only the samples within its split. During prediction, we
traverse the tree based on the features until we reach a leaf node, then the model
within the node can generate the predicted target. With a large number of nodes,
this approach can approximate complex functions with relatively simple models; how-
ever, a large number of nodes can also lead to overfitting. Therefore, it is common to
construct multiple regression trees via ensemble learning to mitigate overfitting. An
example of a simple regression tree is shown below in Figure 2.5. Although regres-
sion trees can split the samples according to more general inequalities that contain
multiple features, the implementation in this thesis selects a single feature to perform
each split.
CHAPTER 2. BACKGROUND 17
There are two common techniques of ensemble regression trees: bagging (boot-
strap aggregating) and boosting. In bagging, each regression tree in the ensemble is
constructed only using a subsample of training data (with replacement) and predic-
tions are generated by averaging the predicted values from all trees. Random forest
[41] is a typical example of bagged regression trees. In contrast, boosting [42] builds
the regression trees sequentially and each subsequent tree emphasizes samples that
previous trees fail to predict accurately. Similar to linear regression, the input to
an ensemble regression tree model for traffic prediction is the recent observations of
nearby links shown in Equation (2.13). The training methodologies of these models
are specified in the respective papers [41], [42] and we omit the detailed mathemati-
cal formulation in this discussion due to their complexity. Overall, ensemble learning
creates a robust regression model and its prowess in traffic prediction has been cited
in [7], [40].
Other classical machine learning methods also exist in the traffic prediction litera-
ture, such as k-nearest neighbour [43]–[45] and support vector regression [46], [47].
However, their presence in the recent literature is minimal since they are largely
superseded by artificial neural networks, which we introduce in the next section.
Therefore, while it is important to mention the existence of these works, they do not
relate significantly to our contributions and are omitted in this thesis.
Since the beginning of the 21st century, the rise of artificial neural networks and deep
learning gave researchers a new tool for traffic prediction. This section discusses the
various deep learning methods and their application in traffic prediction. Similar to
the time series and classical machine learning methods discussed previously, these
methods represent traffic dynamics implicitly. Deep neural networks are universal
function approximators that can approximate any continuous function in Euclidean
space [48]. Consequently, we can use deep neural networks to capture the complex
relationship of traffic properties evolving over space and time. In this thesis, we
implement all deep neural networks using PyTorch [49] with the MSE loss function
defined in Equation (2.8).
Multilayer perceptron (MLP) is the first and simplest type of artificial neural net-
work, consisting of one input layer, a series of one or more hidden layers, and one
CHAPTER 2. BACKGROUND 18
output layer. The input layer accepts the input vector x and the information travels
through the layers sequentially, ending at the output layer. After the input layer,
each subsequent layer receives information from the previous layer and applies a
linear transformation along with a non-linear activation function. The size of the
transformed vector after each layer is also known as its dimension. Formally, in an
MLP with L − 1 hidden layers, we can define the outputs h of each layer recursively
as follows:
h(0) = x
(2.16)
h(l) = σ W(l)> h(l−1) + b(l)
1≤l≤L
where σ denotes the activation function while W(l) and b(l) respectively denote the
model weights and bias of layer l. The final vector at the output layer, h(L) , is the
model prediction. During training, we provide the model with labelled data and
iteratively optimize the model parameters via gradient descent and backpropagation
to minimize the specified loss function. Figure 2.6 is an example of a multilayer
perceptron.
In the context of traffic prediction, the input and output features can be the
same as the regression analysis methods introduced in Section 2.6, which corresponds
to a model capable of predicting for a single link. Alternatively, the model can
also be configured to predict for the entire road network by using an input feature
that consists of recent observation across the entire network. We examined both
configurations in our experiments, which are discussed in Chapter 3. In our work, we
CHAPTER 2. BACKGROUND 19
experimented with three common activation functions: rectified linear unit (ReLU),
sigmoid, and hyperbolic tangent (tanh), which are defined below.
• Sigmoid:
1
S(z) = (2.18)
1 + e−z
• Hyperbolic tangent:
ez − e−z
tanh(z) = (2.19)
ez + e−z
In the traffic prediction literature, this type of artificial neural network appeared in
the late 1990s and early adopters of this model include [51]–[54]. Over time, training
neural networks became more efficient due to advances in training techniques [55],
[56] as well as increased computation power. This allowed for neural networks with
increasingly many layers and prompted the development of more complex neural
network architectures discussed in the later sections.
Recurrent neural networks (RNNs) are the predominant deep learning model for
analyzing sequential data, consisting of repeated cells that form a temporal sequence.
In contrast to an MLP where each layer contains distinct model weights, the model
parameters in an RNN are shared across time steps. Moreover, an RNN accepts
input data sequentially at the corresponding cells of each time step. Consequently,
by varying the number of repeated cells, this architecture can process sequential data
of different lengths. There are two common RNN cell architectures: long short-term
memory (LSTM) [57] and gated recurrent unit (GRU) [58], which are defined below
in Equations (2.20) and (2.21).
CHAPTER 2. BACKGROUND 20
(t)
k2 = S W2 x(t) + U2 h(t−1) + b2
(t)
k3 = tanh W3 x(t) + U3 h(t−1) + b3
(t)
k4 = S W4 x(t) + +U4 h(t−1) + b4
(2.20)
(t) (t) (t)
c(t) = k1 ∗ c(t−1) + k2 ∗ k3
(t)
h(t) = k4 ∗ tanh c(t)
o(t) = σ Wo h(t) + bo
(t)
k2 = S W2 x(t) + U2 h(t−1) + b2
(t) (t) (t) (t−1)
k3 = tanh W3 x + U3 k1 ∗ h + b3 (2.21)
(t) (t) (t)
h(t) = 1 − k2 ∗ h(t−1) + k2 ∗ k3
o(t) = σ Wo h(t) + bo
In this formulation, W, U, and b are the model parameters k are the intermediate
outputs of the model, h(t) and o(t) are respectively the hidden state and the output
generated by the model at time t, while ∗ denotes the Hadamard product. Fur-
thermore, the LSTM model additional contains a cell state denoted by c. Similar
to MLPs, we can train RNNs using the backpropagation through time algorithm.
It should be noted that the above equations are the standard configurations of the
LSTM and the GRU presented in their respective original papers, but variations also
exist in the literature. Figure 2.7 illustrates how RNN cells can be configured in
different applications.
With a recurrent neural network, we can predict future traffic states by feeding the
past observations sequentially into the model. Similar to the MLP model, an RNN
can be configured to predict a single link or the entire road network by modifying
the input features. Furthermore, an RNN can generate a sequence of predictions by
feeding the predicted values back into the network as the input of the next time step.
Recurrent neural networks have been actively used in traffic prediction research to
capture the temporal dynamics of evolving traffic since the work of [59] in 2015. Later
improvements to this model include stacking multiple RNNs [60] and incorporating
the spatial structure of the road network in the RNN [61].
CHAPTER 2. BACKGROUND 21
Graph neural networks (GNNs) are artificial neural networks designed to process
graph-structured data, primarily through applying the graph convolution operation.
Graph convolution extends the notion of the convolution operation, which is com-
monly applied to analyzing visual imagery with a grid-like structure, to an operation
that can be applied to graphs with arbitrary structures. Therefore, a GNN is capable
of extracting information using the spatial correlations between nodes in a graph and
lends itself well to capturing the complex patterns needed for short-term traffic pre-
diction. We can cite [5] as the first application of graph neural network in short-term
traffic prediction, and there have been many subsequent works that expand upon this
idea.
The first GNN framework is founded in the convolution theorem and the field of
graph signal processing [62]. The convolution theorem states that the convolution
of two signals in the point-wise product of their Fourier transforms and graph signal
processing defines the Fourier transform of a graph based on the normalized Laplacian
matrix L. The definition of the normalized Laplacian matrix is shown in Equation
(2.22) below:
1 1
L = I − D− 2 AD− 2 (2.22)
where I is the identity matrix, A is the adjacency matrix of the graph, and D is
the diagonal degree matrix defined by Dii = Σj Aij . Subsequently, we can perform
eigendecomposition on the Laplacian matrix to obtain a matrix of eigenvectors U and
a diagonal matrix of the corresponding eigenvalues Λ. The Fourier transform and the
inverse Fourier transform of a graph signal x ∈ R|V|×1 is then respectively defined in
Equations (2.23) and (2.24).
x̃ = U> x (2.23)
CHAPTER 2. BACKGROUND 22
x = Ux̃ (2.24)
Finally, we can perform convolution on a graph signal by applying the Fourier trans-
form, multiplying by the convolutional kernel, then applying the inverse Fourier trans-
form [63]. This is shown in Equation (2.25), where y ∈ R|V|×1 is the output of the
convolution and diagonal matrix Θ is the convolutional kernel.
y = UΘU> x (2.25)
The main problem with this approach is that the operation is not localized in space
because the kernel Θ is applied in the spectral domain after the Fourier transform. In
other words, the output of a node contains information from the entire graph, which
is undesirable for applications such as traffic where a node only exerts influence on
a part of the graph. The work of [64] demonstrates that we can achieve a localized
filter by restricting Θ to be a polynomial of Λ and that the Chebyshev expansion is
a suitable polynomial kernel approximation. The Chebyshev polynomial of order k,
Tk (x), can be written as the recurrence relation in Equation (2.26).
Lastly, for the case where K = 1, [66] proposes that λmax can be approximated as
2 to achieve a linear model with respect to L. Furthermore, the two parameters, θ0
and θ1 , can be combined into a single parameter θ by setting θ = θ0 = −θ1 . This is
CHAPTER 2. BACKGROUND 23
y = θ0 x + θ1 (L − I)x
1 1
(2.29)
= θ I + D− 2 AD− 2 x
So far, this section only discussed the case of single dimensional inputs and outputs at
each node. Using the linear model, we can generalize Equation (2.29) to multidimen-
sional inputs and outputs as well as introduce a bias term and activation function.
This is shown in Equation (2.30) below, where the model parameters are defined by
a matrix W with dimensions equal to the number of input features and the number
of output features.
− 12 − 12
Y = σ I + D AD XW + b (2.30)
where hi is the output representation for node i, xj is the graph input for node j, aij
is the influence from node j to node i that is defined by the aggregation matrix, W
and b are respectively the weight and bias terms of the model that transforms the
input to hidden dimension, and σ(·) denotes the activation function. In the case of
the linear model, Equations (2.30) and (2.31) are equivalent as N (i) corresponds to
node i and its one-hop neighbours while the coefficients a correspond to entries in the
1 1
matrix I + D− 2 AD− 2 in Equation (2.30).
In the more recent graph attention networks [68], ain are instead produced by
CHAPTER 2. BACKGROUND 24
an additional module that learns the relationship between every pair of nodes to
assign weights for the aggregation. This can be achieved with a variety of attention
mechanisms that exist in the literature such as the works of [69], [70], the mechanism
cited in [68] is as follows:
where
z z>0
LeakyReLU(z) = (2.33)
0.2 z z≤0
In the above formulation, α defines the linear layer that calculates the attention
value between two nodes, and k denotes the concatenation operator. It is important
to note that there is only one set of model weight W and bias b that is applied to all
nodes in both formulations.
A short-term traffic prediction model can use GNNs to capture the spatial corre-
lations between different nodes of a road network; however, the model also needs to
account for the changing dynamic of traffic through time. This is commonly achieved
CHAPTER 2. BACKGROUND 25
in the literature through the use of RNNs with GRU cells as exemplified by [5], [7],
[8], where the matrix multiplications in the GRU is replaced with a GNN operation.
For example, (2.21) can be modified as follows:
X
(t) (t) (t−1)
k1,i = S aij W1 xj + U1 hj + b1
j∈N (i)
X
(t) (t) (t−1)
k2,i = S aij W2 xj + U2 hj + b2
j∈N (i)
(2.34)
X
(t) (t) (t) (t−1)
k3,i = tanh aij W3 xj + U3 k1,i ∗ hj + b3
j∈N (i)
(t) (t) (t−1) (t) (t)
hi = 1 − k2,i ∗ hi + k2,i ∗ k3,i
(t) (t)
oi = σ Wo hi + bo
There are also works in the literature that apply other types of deep neural networks
in traffic prediction, the most notable being convolutional neural networks [40], [71],
[72]. However, the attempts at employing convolutional neural networks require ad-
justments that dismantle the topological structure of the road network [72]. As a
result, this type of framework cannot generate predictions on individual links and we
omit them in this thesis.
Chapter 3
3.1 Introduction
In the previous chapter, we introduced the traffic prediction problem and a wide se-
lection of solutions that currently exist in the literature. We found that recent works
mainly focus on innovating new models, especially in the graph neural network area.
Although the older methods are sometimes listed as baseline models for comparison,
there lacks a comprehensive evaluation of the different methods under different set-
tings in the literature. This type of evaluation is important because it allows us to
draw insight into the constitution of an accurate predictive method and incorporate
proven techniques when developing new models.
In this chapter, we address this deficiency by assessing the performance of a large
selection of models in both urban and highway scenarios. We evaluate the models
using data from traffic simulations which avoids the problem of missing data due to
sensor issues common in real-world data sets. In addition, the simulation environ-
ment allows us to easily adjust demands and road conditions to create a variety of
benchmarks for the analysis.
3.2 Methodology
We procured two sets of data to represent the highway and urban traffic conditions;
both sets are created using the Aimsun Next [9] simulation software. For the highway
data, we simulated Queen Elizabeth Way, a highway in Ontario with 56 links on the
26
CHAPTER 3. EVALUATION OF CLASSICAL AND CURRENT MACHINE LEARNING METHODS 27
Figure 3.1: Map of the highway region chosen for this study.
eastbound direction of travel. For the urban data, simulated a 167-link region in
downtown Toronto. The maps of the highway and urban regions are respectively
shown in Figure 3.1 and Figure 3.2.
The travel demand was collected from survey data in 2016 [73], then the simulation
model was calibrated using measurements from the loop detectors installed along the
roads [74]. We built the simulation model using morning peak-hour travel demands,
and each simulation is for the four-hour period between 6:00 and 10:00 AM. The
speed (distance traveled per unit time) and flow (number of vehicles per unit time)
for every link are extracted from the simulations in 1-minute intervals. Speed and
flow are selected because they are the most common form of data in the real-world
measured using loop detectors and GPS. To augment the data, the simulation is run
50 times and each simulation uses the original travel demands multiplied by a random
scalar factor between 0.5 and 1.5.
3.2.2 Evaluation
In this study, we assessed the models based on the predicted speed values in 5 different
settings. For both the urban and highway dataset, we used a prediction horizon of 5
minutes as a base case. Missing data due to faulty sensors on the road is very common,
thus we additionally included a scenario to evaluate the model performance under the
presence of missing data in the highway setting. We chose 5% as the probability of
missing data, evaluated using the same 5-minute prediction horizon. Meanwhile, we
CHAPTER 3. EVALUATION OF CLASSICAL AND CURRENT MACHINE LEARNING METHODS 28
Figure 3.2: Map of the urban region chosen for this study.
included a scenario with a 1-minute prediction horizon for the urban dataset, which
is useful for adaptive traffic signal control.
The last scenario evaluates the generalizability of each model. We generated an
artificial dataset from the highway simulation model by blocking a lane in an arbi-
trarily chosen section while using the same travel demands, which simulates traffic
conditions during a traffic accident. We train the models on the regular highway
dataset, then apply them to the unseen artificial dataset to evaluate the performance
metrics. Across all scenarios, we report the MAE, MAPE, RMSE of the prediction
models, their definitions are outlined in Section 2.3.
For this study, we selected 8 candidate models from the methods discussed in the
previous chapter.
• ARIMA: as described in Section 2.5.1
• Local MLP: as described in Section 2.7.1 with same input and output features
as the regression models
• Global MLP: as described in Section 2.7.1 with network-wide input and pre-
dictions
CHAPTER 3. EVALUATION OF CLASSICAL AND CURRENT MACHINE LEARNING METHODS 29
• Local RNN: as described in Section 2.7.2 with same input and output features
as the regression models
• Global RNN: as described in Section 2.7.2 with network-wide input and pre-
dictions
3.3 Results
The results of the models under the five scenarios are shown in the tables below. The
mean and the confidence interval for each metric are calculated using 5-fold cross-
validation. In 5-fold cross-validation, the full dataset is split into 5 partitions with
CHAPTER 3. EVALUATION OF CLASSICAL AND CURRENT MACHINE LEARNING METHODS 30
equal sizes. In each iteration, one partition is reserved for testing while the remaining
partitions are used for training; the process is repeated until each of the 5 partitions
has been used as the test set. We did not perform 5-fold cross-validation on the
scenario with unseen artificial data since the test dataset is completely separated
from training and validation datasets.
Except for the unseen test set scenario, the baseline models of constant and pre-
vious interval performed worse than every other model. However, the simple linear
regression baseline proves itself to be comparable in performance with the deep learn-
ing architectures. The only scenario where the linear regression model had poor
performance is in the presence of missing data. In general, the ARIMA model per-
formed worse than the linear regression and random forest regression models. Overall,
the random forest regression model performed the best in all scenarios and metrics.
Across all experiments, the local versions of MLP and RNN performed better
than their global counterparts. In addition, the performance of the global RNN
model is unstable as evidenced by the large spread of 95% confidence intervals. The
local RNN performed better than its MLP counterpart for highway data, while their
performances are relatively similar on the urban data. However, the MLP models can
generalize better to unseen artificial test data. Finally, the GRNN model performed
competitively across all experiment scenarios, but it is consistently outperformed by
Table 3.2: Results for 5-minute prediction horizon on the highway dataset
Constant Previous Linear ARIMA Random Local Global Local Global GRNN
Interval Forest MLP MLP RNN RNN
Mean 33.16 4.56 3.29 4.41 2.76 3.41 3.80 2.86 7.11 3.30
MAE
32.34 4.24 3.02 4.12 2.52 3.27 3.41 2.66 4.28 2.77
(km/h) 95% CI
33.98 4.87 3.55 4.70 3.01 3.55 4.19 3.06 9.98 3.82
Mean 109.55 12.32 9.72 11.96 8.02 9.95 11.54 8.36 21.25 9.89
MAPE
88.06 10.32 8.17 10.05 6.74 8.40 9.95 7.05 12.21 7.46
(%) 95% CI
131.04 14.32 11.28 13.87 9.31 11.50 13.13 9.67 30.29 12.33
Mean 35.46 9.34 5.12 9.13 4.74 5.37 5.76 4.83 11.81 5.37
RMSE
34.72 8.72 4.62 8.53 4.15 5.01 5.05 4.33 7.56 4.44
(km/h) 95% CI
36.19 9.95 5.61 9.73 5.32 5.72 6.48 5.33 16.06 6.30
Table 3.3: Results for 5-minute prediction horizon on the highway dataset with 5% missing data
Constant Previous Linear ARIMA Random Local Global Local Global GRNN
Interval Forest MLP MLP RNN RNN
Mean 33.16 4.57 4.76 4.43 2.85 4.49 4.64 3.04 5.82 3.61
MAE
32.34 4.26 4.42 4.14 2.59 4.19 4.06 2.78 3.79 3.37
(km/h) 95% CI
33.98 4.88 5.10 4.72 3.11 4.78 5.21 3.30 7.85 3.84
Mean 109.55 12.36 12.55 12.00 8.26 12.29 13.39 8.88 16.78 10.91
MAPE
88.06 10.37 10.89 10.09 6.96 10.45 11.61 7.35 11.73 8.77
(%) 95% CI
131.04 14.35 14.21 13.92 9.56 14.13 15.16 10.41 21.82 13.06
Mean 35.46 9.37 6.92 9.17 4.84 6.52 7.04 5.08 10.08 5.83
RMSE
34.72 8.77 6.28 8.58 4.31 5.99 6.08 4.59 6.84 5.37
(km/h) 95% CI
36.19 9.97 7.56 9.77 5.38 7.05 8.01 5.58 13.32 6.29
CHAPTER 3. EVALUATION OF CLASSICAL AND CURRENT MACHINE LEARNING METHODS
31
Table 3.4: Results for 5-minute prediction horizon on the downtown dataset
Constant Previous Linear ARIMA Random Local Global Local Global GRNN
Interval Forest MLP MLP RNN RNN
Mean 6.68 6.53 4.30 4.87 3.94 4.42 4.43 4.18 4.27 4.35
MAE
6.64 6.52 4.28 4.83 3.91 4.38 4.20 4.11 4.15 4.32
(km/h) 95% CI
6.72 6.55 4.32 4.90 3.97 4.46 4.67 4.25 4.40 4.39
Mean 46.71 36.65 24.21 30.98 22.37 24.00 24.40 22.98 24.41 25.32
MAPE
44.32 36.21 23.72 30.46 21.93 23.31 23.41 22.49 22.31 24.22
(%) 95% CI
49.11 37.09 24.69 31.50 22.80 24.68 25.40 23.46 26.52 26.42
Mean 10.04 10.88 6.61 7.73 6.30 6.92 6.87 6.77 6.82 7.01
RMSE
9.91 10.85 6.56 7.69 6.26 6.81 6.47 6.56 6.63 6.94
(km/h) 95% CI
10.16 10.92 6.65 7.76 6.34 7.02 7.28 6.98 7.02 7.09
Table 3.5: Results for 1-minute prediction horizon on the downtown dataset
Constant Previous Linear ARIMA Random Local Global Local Global GRNN
Interval Forest MLP MLP RNN RNN
Mean 6.68 6.78 3.79 4.53 3.40 3.86 4.13 3.89 4.37 4.40
MAE
6.64 6.74 3.77 4.49 3.38 3.78 4.01 3.83 4.14 4.32
(km/h) 95% CI
6.72 6.81 3.80 4.57 3.42 3.93 4.24 3.94 4.59 4.47
Mean 46.71 38.05 21.48 28.19 19.14 20.10 22.38 21.18 26.60 26.58
MAPE
44.32 37.74 21.12 27.70 18.90 19.26 22.02 20.82 24.33 25.27
(%) 95% CI
49.11 38.36 21.83 28.68 19.38 20.93 22.73 21.53 28.87 27.89
Mean 10.04 11.31 5.80 7.13 5.47 6.02 6.33 6.24 6.78 6.94
RMSE
9.91 11.24 5.77 7.07 5.43 5.84 6.14 6.03 6.39 6.86
(km/h) 95% CI
10.16 11.38 5.83 7.18 5.50 6.20 6.52 6.45 7.16 7.02
CHAPTER 3. EVALUATION OF CLASSICAL AND CURRENT MACHINE LEARNING METHODS
32
CHAPTER 3. EVALUATION OF CLASSICAL AND CURRENT MACHINE LEARNING METHODS 33
Table 3.6: Results for 5-minute prediction horizon on the unseen artificial test set
MAE (km/h) MAPE (%) RMSE (km/h)
Constant 31.01 96.08 33.75
Previous Interval 6.15 16.54 11.00
Linear 5.64 16.71 8.31
ARIMA 6.10 16.92 10.87
Random Forest 5.28 15.58 8.36
Local MLP 6.46 18.78 9.51
Global MLP 12.38 39.34 15.99
Local RNN 6.69 27.89 12.10
Global RNN 22.43 72.74 32.21
GRNN 5.70 15.61 8.99
3.4 Discussion
The ARIMA model is commonly used as a baseline model for comparison in the traffic
prediction literature. However, our experiment results show that ARIMA performs
only marginally better than simply predicting the last observation. In addition, since
there exists known influences among nearby roads, the univariate nature of ARIMA
hampers its performance. This is especially evident as the prediction horizon in-
creases, where evolving traffic patterns cause nearby information to become more
important than local history. Overall, a good traffic prediction model should contain
a more regional perspective rather than decoupling the roads from one another.
In the case of feedforward and recurrent neural networks, the local models perform
better than the network-wide models. We speculate that this is because traffic dy-
namics are local, and using separate models for each location allows the parameters
to be learned more easily. The global model has access to the recent history of the
entire road network; therefore, the model can extract the relevant features to create
accurate predictions for every location. Meanwhile, each local model only has access
to the recent history of a smaller nearby region. This restriction forces the local
models to use features with known influences on the output, which guides parameter
learning and creates more accurate predictions. Nevertheless, it is uncertain whether
global models can achieve the same performance given access to more data.
Graph neural networks are compact due to the shared parameters across different
links, and our experiment results show that this creates a competitive model. How-
CHAPTER 3. EVALUATION OF CLASSICAL AND CURRENT MACHINE LEARNING METHODS 34
3.5 Conclusion
Figure 3.3: The RMSE of each link and the line of best fit for each model on the highway dataset.
CHAPTER 3. EVALUATION OF CLASSICAL AND CURRENT MACHINE LEARNING METHODS 37
Figure 3.4: The RMSE of each model categorized by the number of neighboring links on the urban
dataset.
CHAPTER 3. EVALUATION OF CLASSICAL AND CURRENT MACHINE LEARNING METHODS 38
6:00
100
7:00
80
Time
8:00 60
40
9:00
20
10:00
West East
Link location
(a) The true speed of the entire corridor
20
Over-predict
15
7:00
10
8:00
Time
9:00 10
15
Under-predict
10:00 20
West East
Link location
(b) The prediction error of the random forest model
Figure 3.5: The error of the random forest model for 1 simulation of the highway data
Chapter 4
4.1 Introduction
The previous chapter evaluated GNNs and other deep neural networks against the
classical machine learning methods. However, the literature on GNNs has exploded
over the past 4 years and the GRNN model may not be an accurate representation
of all GNNs. Therefore, we felt the need to expand the selection of GNNs to create
a more objective analysis.
In addition, each work in the literature claims to improve upon its predecessors
in the evaluation metrics; however, the improvements are very marginal. In a recent
example from 2020 [8], the authors compare their proposed model against the works
of [5], [6], which are both published in 2017. The proposed model demonstrates a
0.9% reduction in MAPE for flow prediction, which corresponds to 0.8 vehicles per
5 minutes in MAE [8]. Similar results can also be found in [75]–[77]. This serves as
additional motivation for us to investigate these innovations and determine if there
is still room for advancement.
This chapter first describes the graph convolution perspective and then develops
a taxonomy of GNN short-term traffic prediction models based on their components.
Afterwards, we explore different variations on these components and eventually arrive
at a variant that is similar to a traditional recurrent neural network. Finally, we
also revisit the regression view of short-term traffic prediction using random forest
regression, which we found to be a powerful prediction method in the previous chapter.
39
CHAPTER 4. COMPARATIVE EVALUATION OF GRAPH NEURAL NETWORK VARIANTS 40
4.2 Methodology
As mentioned in Section 2.7.3, a short-term traffic prediction model can use a GNN to
capture the spatial correlations between different nodes of a road network while using
an RNN to capture the changing dynamic of traffic through time. Typically, the two
components are combined by replacing the matrix multiplications in the GRU with
a GNN operation shown in Equation (2.34). Equation (2.34) is written again in this
section as (4.1) to improve readability.
X
(t) (t) (t−1)
k1,i = S aij W1 xj + U1 hj + b1
j∈N (i)
X
(t) (t) (t−1)
k2,i = S aij W2 xj + U2 hj + b2
j∈N (i)
(4.1)
X
(t) (t) (t) (t−1)
k3,i = tanh aij W3 xj + U3 k1,i ∗ hj + b3
j∈N (i)
(t) (t) (t−1) (t) (t)
hi = 1 − k2,i ∗ hi + k2,i ∗ k3,i
(t) (t)
oi = σ Wo hi + bo
With this formulation, we can categorize GNN short-term traffic prediction models
by their 3 components. The first 2 components are the operation concerning the input
(t) (t−1)
xi and the last hidden state hi , which can be a standard matrix multiplication
or a GNN variant. The third component is the model weights W, U, and b, which
can be either shared among nodes or independent. We explore variations on these
components in the next section. It is important to note that some works use other
mechanisms to capture the temporal dynamics of traffic; however, this investigation
is focused on RNN-based models due to their prevalence.
We begin with the GRNN model [7] introduced in the previous chapter, which uses a
standard matrix multiplication for input, convolution for the last hidden state, and
shared model weights among nodes. This formulation transforms the first equation
CHAPTER 4. COMPARATIVE EVALUATION OF GRAPH NEURAL NETWORK VARIANTS 41
As in (4.1), the remaining equations in (2.21) can be transformed likewise and are
omitted for brevity in this section.
We then experiment with changing the graph convolution operation to graph atten-
tion shown in (2.32) and examine the different combinations of applying the attention
operation to input and hidden states. For this investigation, we call this type of model
the graph attention gated recurrent unit (GA-GRU) as it combines the concepts of
graph attention networks and gated recurrent units. Equation 4.3 is one configuration
where attention is only applied to the last hidden state; i.e., GA-GRU (hidden).
In contrast with (4.3), the subscript i in all weights and biases signifies that each
node contains its own model parameters, and the a in this framework are learnable
weights.
We also highlight the input-only attention variant of the AGRNN; i.e., AGRNN
CHAPTER 4. COMPARATIVE EVALUATION OF GRAPH NEURAL NETWORK VARIANTS 42
4.2.3 Datasets
For this study, we employ the same datasets from our evaluation in the previous
chapter. In addition, to facilitate comparison with the recent literature on GNNs, we
additionally include the California datasets described below.
• Toronto datasets: the same data that was used in the previous chapter, de-
CHAPTER 4. COMPARATIVE EVALUATION OF GRAPH NEURAL NETWORK VARIANTS 43
• California datasets: The PeMS04 and PeMS08 datasets are collected from
the Caltrans Performance Measurement System (PeMS) of districts in California.
The interval is 5 minutes per time step, which corresponds to 288 data points per
day. The adjacency matrix is defined according to road distance and connectivity.
We follow the evaluation procedure established by other papers [8], [79], which
uses the last 12 observations to predict the next 12 time steps, i.e., use the past
hour of traffic data to predict that of the next hour.
We evaluated our models against a selection of different methods that are listed
below, including other graph neural networks as well as time series analysis methods.
Additionally, similar to Chapter 3, we tuned the hyperparameters summarized in
Table 4.2 using the same coordinate descent procedure outlined in Chapter 3.
• Historical Average: A time series model that predicts the average of obser-
vations from the same time of day in previous weeks. It is not applicable to
simulated datasets since the simulation model simulates only one day and has
no long-term time series.
• GCN [66]: This model is a Graph Convolutional Network with 1 hidden layer.
The output layer is connected with a fully-connected layer to predict traffic
states.
Table 4.3: Performance comparison of traffic speed prediction models for simulated highway dataset
5-minute horizon 15-minute horizon
Model MAE MAPE RMSE MAE MAPE RMSE
(km/h) (%) (km/h) (km/h) (%) (km/h)
Historical Average Not applicable Not applicable
ARIMA 4.41 11.96 9.13 7.52 20.73 15.47
GCN 3.72 9.05 5.95 5.24 13.08 8.99
GRNN 3.24 9.66 5.31 5.18 14.90 9.00
GA-GRU (input) 4.99 15.30 7.72 13.09 38.09 16.83
GA-GRU (hidden) 3.38 10.30 5.35 5.00 15.31 8.55
GA-GRU (both) 4.08 12.55 6.32 13.48 47.70 16.95
AGRNN (input) 3.28 9.71 5.46 4.90 14.85 8.63
AGRNN (hidden) 3.92 9.14 6.00 4.99 12.13 8.12
AGRNN (both) 3.53 8.69 6.17 3.84 9.52 6.86
AGCRN 3.41 7.70 7.38 4.28 10.00 8.84
Random forest 2.77 8.02 4.73 3.60 10.31 6.51
In this study, we report the MAE, MAPE, RMSE of the prediction models, their
definitions are outlined in Section 2.3. For the Toronto datasets, we used both 5th
minute and 15th minute as the prediction horizon and computed error using only the
(t+H)
prediction at the horizon x̂i . Meanwhile, on the California datasets, we followed
the convention of [8], [79] and computed error using all predictions up to the prediction
(t+1) (t+2) (t+H)
horizon H; i.e., x̂i , x̂i , ..., x̂i .
Although the above metrics conveniently produce numerical values for easy com-
parison across models, they are unable to represent all model aspects. Therefore, we
also measured the complexity of each model for a more well-rounded comparison as
shown in Table 4.6.
4.3 Results
We performed the evaluation using a data split of 60% training, 20% validation, and
20% testing for each dataset. We then recorded each metric to produce results shown
in Tables 4.3, 4.4, and 4.5 below, where the bolded number is the lowest error and the
underlined number is the second lowest error. In addition, we also report the model
complexity for the 5-minute prediction horizon on the highway dataset in Table 4.6.
In most experiments, performances improve from that of GRNN, GA-GRUs, to
AGRNNs. First, this indicates that using an attention mechanism to learn spatial
correlations is better than using a fixed adjacency matrix. Besides, the node-specific
convolutional weights can capture distinct traffic patterns in each node and improve
accuracy. Moreover, the AGRNN (input) model is competitive with other GNNs
CHAPTER 4. COMPARATIVE EVALUATION OF GRAPH NEURAL NETWORK VARIANTS 45
Table 4.4: Performance comparison of traffic speed prediction models for simulated urban dataset
5-minute horizon 15-minute horizon
Model MAE MAPE RMSE MAE MAPE RMSE
(km/h) (%) (km/h) (km/h) (%) (km/h)
Historical Average Not applicable Not applicable
ARIMA 4.87 30.98 7.73 5.43 35.12 8.58
GCN 4.30 24.33 6.53 4.62 25.21 6.94
GRNN 4.37 24.88 7.06 4.94 29.32 7.78
GA-GRU (input) 4.85 31.06 7.65 5.31 33.45 8.39
GA-GRU (hidden) 4.24 24.04 6.91 4.83 28.48 7.98
GA-GRU (both) 4.62 29.27 7.40 4.96 31.18 7.90
AGRNN (input) 4.11 22.49 6.88 4.31 24.44 7.21
AGRNN (hidden) 4.23 24.55 6.71 4.43 25.77 7.02
AGRNN (both) 4.08 22.35 6.62 4.27 23.88 7.02
AGCRN 4.02 23.59 7.02 4.49 28.40 7.78
Random forest 3.89 22.18 6.21 4.17 24.37 6.66
Table 4.5: Performance comparison of traffic flow prediction models on PeMS04 and PeMS08 dataset
PeMS04 PeMS08
Model MAE MAPE RMSE MAE MAPE RMSE
(veh) (%) (veh) (veh) (%) (veh)
Historical Average 24.99 16.07 41.84 21.21 13.72 36.73
ARIMA 27.53 20.55 42.44 22.67 14.92 35.08
GCN 23.72 17.92 37.47 21.09 14.42 31.45
GRNN 30.66 25.02 46.06 26.09 21.92 39.07
GA-GRU (input) 32.78 30.24 46.65 27.73 42.20 38.78
GA-GRU (hidden) 24.73 17.17 38.18 19.89 14.01 30.73
GA-GRU (both) 29.35 25.23 42.74 23.53 19.96 34.46
AGRNN (input) 24.01 17.37 37.82 21.04 14.57 31.19
AGRNN (hidden) 22.97 16.32 37.25 20.31 13.48 30.80
AGRNN (both) 23.77 16.54 38.53 22.04 14.33 34.16
AGCRN 19.86 13.06 32.57 16.08 10.40 25.55
Random forest 20.74 13.76 35.03 16.64 10.95 26.95
CHAPTER 4. COMPARATIVE EVALUATION OF GRAPH NEURAL NETWORK VARIANTS 46
according to all error metrics; which signifies that the propagation of hidden states
among nodes between consecutive RNN time steps is not essential in achieving ac-
curate prediction. It should be noted that although our experiments keep the other
components of the model constant among GRNN, GA-GRUs, and AGRNNs, the find-
ings of this work may not generalize to other architectures, such as multiple graph
convolutional layers or different RNN configurations.
4.4 Discussion
error on the simulated datasets while being narrowly outperformed by the AGCRN
model on the California datasets. This supports the hypothesis that short-term traffic
prediction can be formed as a regression problem and further highlights that sharing
model weights and latent states are inconsequential in attaining model accuracy. Al-
though we do not report the exact training time of each model, except the historical
average, all models examined in this chapter displayed similar training time in the
order of several hours. However, the random forest model contains by far the largest
number of parameters across all experiments, which may be impracticable for de-
ployment in real-world environments due to storage requirements. Overall, the result
suggests that while GNNs can be more compact, random forest regression remains
competitive and should not be overlooked in short-term traffic prediction.
The errors we report for the GNN models are largely in line with recent publications
and it is unlikely that further hyperparameter tuning would yield any significant
reduction in prediction error [8]. Furthermore, similar to other studies that use the
California PeMS datasets [8], [75], the MAE and RMSE in Table 4.5 are measured in
number of vehicles per time step (5 minutes). We see that the best GNNs improve
upon the baseline historical average model by around 10 vehicles per time step, which
corresponds to 120 vehicles per hour. To put this into perspective, the capacity of
a 3-lane highway is in the range of 5000 to 6000 vehicles per hour [80]. Therefore,
the best GNNs can predict better than historical average by about 2 percent of the
capacity. It is unclear whether this error reduction can result in tangible benefits for
downstream applications of traffic prediction.
Although the location-independent AGRNN and random forest models contain
more parameters compared to their graph convolutional counterparts, it also presents
several major advantages. First of all, this framework is highly flexible and modular to
the number of prediction locations and new modules can be added as new sensors are
installed. Secondly, this framework allows us to easily identify and remedy the source
of error because each location has an independent model that can be updated as new
data becomes available. Lastly, since each prediction location contains its own model
that can be trained separately, the overall framework scales linearly with respect to
the number of prediction locations, which allows this model to be deployed for large
cities with relatively low memory consumption. We generated synthetic data with a
variable number of locations and trained the models on a system with a GeForce RTX
2070 which has 8 GB of memory. Using the same default model hyperparameters, the
GRNN model runs out of memory when the number of locations exceeds 850 while
the location-independent models do not face this constraint.
CHAPTER 4. COMPARATIVE EVALUATION OF GRAPH NEURAL NETWORK VARIANTS 48
7000
6000
5000
Flow (veh/h)
4000
3000
2000
1000
0
Day 1 Day 2 Day 3 Day 4 Day 5
(a) PeMS08 dataset - one working week of data
8000
7000
6000
5000
Flow (veh/h)
4000
3000
2000
1000
0
Day 1 Day 2 Day 3 Day 4 Day 5
(b) Simulated highway dataset - first 5 simulation runs
Figure 4.1: Flow comparison between real-world and simulated data
For each dataset, we select 4 links and plot the flow values over 5 days.
In our introduction of GNNs in Section 2.7.3, we mentioned that the attention ma-
trix represents the learned influence among all nodes. During the experiment, we
extracted the trained attention matrix in the models to perform a sanity check on
their learned influence. In the GNN variations defined in Section 4.2.2, we restrict the
influence of node i to its neighbourhood N (i) based on our understanding that traffic
propagation occurs according to the network topology. However, the AGCRN and
other GNNs in the literature do not enforce this restriction and allow for network-wide
influence for every node. This difference is highlighted in Figure 4.2, where we notice
that the attention values learned by the AGCRN model are significantly different
from the road topology. This result further indicates that the superior performance
of the ACGRN may be attributed to detecting statistical correlation among faraway
CHAPTER 4. COMPARATIVE EVALUATION OF GRAPH NEURAL NETWORK VARIANTS 49
nodes as opposed to its ability to model traffic dynamics. Consequently, this suggests
that models such as the AGCRN may fail to predict accurately in the event of traffic
incidents since the local traffic patterns would deviate from normal and cannot be
adequately explained by long-range statistical correlation.
4.5 Conclusion
In this chapter, we compared several variants of GNNs against random forest regres-
sion using simulated data of two regions in Toronto as well as real-world sensor data
from selected California highways. We found that incorporating matrix factorization,
attention, and location-specific model weights either individually or collectively into
GNNs can result in a better overall performance. Moreover, although random forest
regression is a less compact model, it matches or exceeds the performance of all varia-
tions of GNNs in our experiments. Finally, we also highlighted two issues with current
state-of-the-art GNNs: Parameter sharing across different locations limits their flex-
ibility and scalability. Also, the GNNs are learning the incorrect traffic propagation
patterns. Overall, this suggests that the current graph convolutional methods may
be in incorrect approach to traffic prediction.
(a) Simulated highway dataset - node connectivity
50
Chapter 5
5.1 Introduction
The previous chapters discussed our contributions regarding short-term traffic pre-
diction, with horizons up to 1 hour. However, transportation agencies may require
prediction further into the future to enable long-term planning. Historically, research
in this sector is mostly focused on time series models such as seasonal ARIMA [31].
In this chapter, we first revisit the seasonal ARIMA model in long-term traffic
prediction by following the analysis of Williams and Hoel [31]. Afterwards, we explore
the exponential smoothing and the Facebook Prophet models described in Section
2.5.2 and 2.5.3, respectively. Subsequently, we also examine the potential of regression
methods such as random forest and XGBoost in this task. Lastly, we attempt to
define the boundary between short-term and long-term prediction by determining the
horizon where the knowledge of recent traffic history no longer influences prediction
accuracy.
5.2 Methodology
Since this study is focused on long-term prediction, the 4-hour long simulated datasets
from the previous chapters are insufficient for this analysis. Therefore, for this study,
we obtained the traffic speed data of 216 links on the Gardiner Expressway in Toronto
over the month of October 2019. The location of the Gardiner Expressway is shown
on the map in Figure 5.1 highlighted in red. The data is collected using Global
Positioning System (GPS) and aggregated over 15-minute intervals. For this analysis,
the data has been divided to form a training set which consists of data from October
51
CHAPTER 5. LONG-TERM TRAFFIC PREDICTION 52
44.4
44.2
44.0
Latitude
43.8
43.6
43.4
43.2
43.0
80.25 80.00 79.75 79.50 79.25 79.00 78.75 78.50
Longitude
1st to 23rd, and a test set which consists of the remaining data from October 24th
to 31st.
Historical Average
A historical average model is a simple model that partition the available training data
into bins based on timestamp and the observed values are averaged within each bin.
During prediction, the model identifies which bin the prediction timestamp belongs
to and returns the stored value in that bin. For example, if the model is asked to
predict for Thursday, October 31, 2019, at 9:00 AM, the model finds all instances of
training data that are associated with Thursday at 9:00 AM and returns an average of
those data points. Unless additional data are supplied, the model will always predict
the same value for a time bin, such as every Thursday at 9:00 AM.
In this study, we create two historical average models. The first model is the
CHAPTER 5. LONG-TERM TRAFFIC PREDICTION 53
weekly model that contains separate time bins for every weekday. Since the data
has a 15-minute interval, there are 96 timestamps in each day for a total of 672 time
bins. Meanwhile, the second model is the daily model that only distinguishes between
workweek and weekends for a total of 192 time bins. No model selection is required
for this method since there is only one possible model.
Seasonal ARIMA
The seasonal ARIMA model is outlined in Section 2.5.1. Due to computational con-
straints, it is not feasible to exhaustively search through all possible orders for a
seasonal ARIMA model. Therefore, we need to rely on various statistical tests in
the model selection process. This process is well established and we follow the pro-
cedure outlined in [81] in this section. In addition, previous work has identified
(1, 0, 1)(0, 1, 1)m as a set of orders that consistently performs well in the traffic pre-
diction context [31].
First, we need to select the season differencing term (D). The autocorrelation
function (ACF) is a useful test to determine whether the data is stationary. Figure
5.2 plots the the ACF for 6 selected links. As expected, the ACF plots show strong
correlations at multiples of 96 lags (1 day), and some links show the strongest corre-
lation at 672 lags (1 week). Visually, this shows that there is a clear seasonality in
our data and suggests seasonal differencing. Figure 5.3 shows the ACF for the same 6
links after applying seasonal differencing once. The seasonally differenced time series
exhibits a more stationary pattern, which is consistent with the findings of [31].
1.0
0.8
0.6
0.4
ACF
0.2
0.0
0.2
0.4
1.0
0.8
0.6
0.4
ACF
0.2
0.0
0.2
0.4
0 100 200 300 400 500 600 700 800
Lag
Figure 5.3: ACF of the first-order seasonally differenced time series (period = 1 day)
200 200
Number of Links
Number of Links
150 150
100 100
50 50
0 0
0 >0 0 >0
Figure 5.4: OCSB test result
seasons. This contradicts the findings of [31] and also the visual analysis above. The
reason for this discrepancy could be because one month of data is not enough, or
because the OCSB test was designed for financial time series and the technique is
not suitable for the traffic setting. Ultimately, due to the clear seasonality shown in
Figure 5.2, we decide to set the order of seasonal differencing (D) to 1 and the period
(m) to 96 (1 day) for this analysis.
Once the seasonal differencing term (D) is known, we can apply a unit root test
to objectively determine whether the resulting time series is stationary and decide
whether further differencing is required. For this analysis, we elected to use both the
Augmented Dickey Fuller (ADF) test [83] and the Kwiatkowski-Phillips-Schmidt-Shin
(KPSS) test [84], both commonly used in time series analysis. In the ADF test, the
null hypothesis is that the time series has a unit root. If we fail to reject the null
hypothesis, there is evidence that the series is non-stationary. On the other hand, the
null and alternate hypotheses of the KPSS test are opposite of the ADF test, where
CHAPTER 5. LONG-TERM TRAFFIC PREDICTION 55
250 250
200 200
Number of Links
Number of Links
150 150
100 100
50 50
0 0
<0.05 >= 0.05 <0.05 >= 0.05
(a) p-values of the ADF Test (b) p-values of the KPSS Test
Figure 5.5: Unit root test result
the null hypothesis is that the data are stationary. We use a p-value of 0.05 for the
two tests and display the results in Figure 5.5. The results of the tests suggest that
the seasonally differenced time series are stationary, and both tests are in agreement
with one another. Therefore, no further differencing is required and the non-seasonal
differencing order (d) is set to 0 for this analysis.
In the literature, the remaining terms of the seasonal ARIMA model are selected
by fitting the model with different sets of orders and then select the set that minimizes
an information criterion [81]. The information criterion is a model selection tool that
measures the quality of the fitted model relative to other models. The most common
information criterion in the literature is the Akaike information criterion (AIC) [85];
however, it is prone to overfitting when the sample size is too small. To address this,
a correction formula to AIC has been developed known as corrected AIC (AICc),
which we elected to use in this analysis. One important note is that model selection
through minimizing AIC can only be performed using a fixed order of differencing (d,
D) because differencing alters the underlying model fitting process. Therefore, d and
D must be selected prior to the other orders.
There are different methods to fit a seasonal ARIMA model; however, the param-
eter estimation process typically takes a long time. This is especially true when the
number of periods in a season (m) is large, as is the case with our data. There-
fore, we perform this search using 20% of the links and begin with the baseline of
(1, 0, 1)(0, 1, 1)96 suggested by [31]. Throughout this process, we found no meaningful
change to the AICc compared to the baseline. Therefore, we decide to use the model
fitted using the baseline orders as the final model for comparison.
CHAPTER 5. LONG-TERM TRAFFIC PREDICTION 56
Exponential Smoothing
Based on the description in Section 2.5.2, there are a total of 12 models to consider.
According to [81], the additive seasonal methods should be used when the seasonal
variation is roughly constant throughout the series, while the multiplicative methods
should be used when the seasonal variation is proportional to the level of the series.
In traffic data, the seasonal variation is not dependent on the level; therefore, we will
only consider none or additive seasonal components. To select the final model, we
can once again perform parameter estimation and then find the model that minimizes
the information criterion.
Similar to the previous section, we selected 20% of the links to fit the ETS mod-
els. We compared the corrected AIC of various models and found that ETS(A,N,A)
produced the best result across different links. Therefore, this is the model that is
selected to represent the class of exponential smoothing models.
Facebook Prophet
The Facebook Prophet algorithm is briefly described in Section 2.5.3. Since we only
have 3 weeks of training data and there are no obvious seasonalities other than daily
and weekly effects, we can let the Prophet algorithm automatically fit the data;
therefore, no model selection procedure is needed for this method. Additionally, the
Prophet model fitting process is significantly faster than both the ARIMA and the
ETS models, completing in less than a second per link as opposed to seasonal ARIMA
which can take over a minute.
Linear Regression
The linear regression model is outlined in Section 2.6.1. However, for long-term
prediction, recent traffic history is unlikely to influence multiple days into the future.
To account for this, we change the input features of the linear regression model to
the following: day of the week, hour of the day, and whether it is a working day
or a holiday. In addition, we find that one-hot encoding of the 3 types of features
produces more accurate results; therefore, this is the linear regression model used in
the experiments.
XGBoost
XGBoost [86] is an ensemble regression tree method that uses the boosting technique
described in Section 2.6.2. We use the same input features as the linear regression
model described above.
CHAPTER 5. LONG-TERM TRAFFIC PREDICTION 57
In this study, we fit the models using the 23-day training data then generate the pre-
dictions for the 8-day test period without supplying additional data. We then report
the MAE, MAPE, RMSE of the prediction models, their definitions are outlined in
Section 2.3.
The results of our analysis are summarized in Table 5.1. In addition, the violin plot
in Figure 5.6 show the distribution of prediction error across links. We can see that
the ETS model has the highest error, which corroborates the aggregated error metrics
reported in Table 5.1. In addition, the distribution of the remaining models suggests
that there are a few links that can be predicted with significantly higher accuracy
compared to the other links.
The two plots in Figure 5.7 show the prediction on two different links over the last
2 days of October, which is separated from the training period by 6 days. Even with
such a long prediction horizon, we can see that every model is able to capture the
overall trend of the time series. In addition, we can see that the Prophet algorithm
produces predictions that are much smoother compared to the other methods. One
interesting thing to note is that the true observations seems to vary significantly
between consecutive time steps, which may contribute to the prediction error. From
the second plot, we can see the reason why weekly average is able to predict with
almost no error on some links. For the most part, the weekly average prediction
matches exactly with the ground truth. This suggests that the data itself is gap-filled
using some weekly average method, which likely contributed significantly to its higher
accuracy in our analysis.
During the model selection process, the Prophet model fitting process is signifi-
cantly faster than both the ARIMA and the ETS models, completing in less than
a second per link as opposed to seasonal ARIMA which can take over a minute.
35
30
25
MAE on Link (km/h)
20
15
10
In addition, the Prophet algorithm is easy to use and applies well established time
series analysis techniques. Overall, our analysis demonstrates that it can supplant
traditional time series models such as seasonal ARIMA in long-term traffic prediction.
100
80
Speed (km/h)
60
40 Ground Truth
Daily Average
Weekly Average
20 Seasonal ARIMA
Facebook Prophet
Linear Regression
XGBoost
10-29 23:00 10-30 05:00 10-30 11:00 10-30 17:00 10-30 23:00 10-31 05:00 10-31 11:00 10-31 17:00 10-31 23:00
(a) Link 1
80
70
60
Speed (km/h)
50
40 Ground Truth
Daily Average
Weekly Average
Seasonal ARIMA
30 Facebook Prophet
Linear Regression
XGBoost
10-29 23:00 10-30 05:00 10-30 11:00 10-30 17:00 10-30 23:00 10-31 05:00 10-31 11:00 10-31 17:00 10-31 23:00
(b) Link 2
Figure 5.7: Line plot of predictions generated by various model and the ground truth on 2 selected
links
5.4 Conclusion
In this analysis, we compared three different time series prediction methods, seasonal
ARIMA, exponential smoothing, and Facebook’s Prophet algorithm, against histor-
ical average and regression baselines. The results show that the historical average
model has the lowest error while the exponential smoothing model has the highest
error on our dataset. However, further analysis showed that the accurate result of
the historical average model should be discounted because the data itself seems to be
gap-filled using some historical average method. Nonetheless, the model performance
is very close to one other and any of them could be suitably used for the long-term
traffic forecasting problem. Lastly, we found that long-term methods start to overtake
short-term methods for prediction horizons beyond 1 hour.
CHAPTER 5. LONG-TERM TRAFFIC PREDICTION 60
13
18
12 17
16
11
RMSE (km/h)
MAE (km/h)
15
10
14
9 13
Weekly Average Weekly Average
Daily Average Daily Average
Seasonal ARIMA 12 Seasonal ARIMA
8 Facebook Prophet Facebook Prophet
XGBoost XGBoost
11
20 30 40 50 60 70 80 90 20 30 40 50 60 70 80 90
Prediction Horizon (min) Prediction Horizon (min)
(a) Highway links
11.00
7.0
10.75
6.8
10.50
RMSE (km/h)
MAE (km/h)
6.6 10.25
10.00
6.4
9.75
Weekly Average Weekly Average
6.2 Daily Average 9.50 Daily Average
Seasonal ARIMA Seasonal ARIMA
Facebook Prophet Facebook Prophet
XGBoost 9.25 XGBoost
20 30 40 50 60 70 80 90 20 30 40 50 60 70 80 90
Prediction Horizon (min) Prediction Horizon (min)
(b) Arterial links
Figure 5.8: Prediction error of short-term random forest regression compared to long-term methods
The long-term methods predict using long-term history rather than recent observations; therefore,
the notion of prediction horizon is not applicable and the error is drawn as horizontal lines.
Chapter 6
Conclusion
6.1 Summary
Traffic prediction can provide tremendous aid to transportation agencies for managing
traffic demand and it is an essential component in building intelligent transportation
systems. Over the years, a myriad of solutions has been proposed for this challenging
task, with deep learning and graph neural networks at the forefront of the current
literature. This thesis is an extensive survey and a thorough evaluation of the different
methods in this rapidly growing field of study. Throughout our analysis, we attempt to
highlight potential issues and important considerations for deploying these solutions
in the field.
In Chapter 3, we comparatively evaluated the different classes of short-term traffic
prediction methods under a variety of traffic scenarios with the aid of traffic simula-
tion software. We examined time series analysis, classical machine learning, and deep
learning models. The experiment results demonstrated that random forest regres-
sion is potentially more powerful than deep learning models, at the expense of more
memory and storage requirements. Furthermore, we highlighted the importance of
including regional traffic history as input features to traffic prediction models. Fi-
nally, we also illustrated that individual parameters for each prediction location can
improve the predictive performance of graph neural networks.
In Chapter 4, we expanded our evaluation of graph neural networks (GNNs) due to
their prevalence in the recent literature. The results showcased the effectiveness of the
attention mechanism in GNNs and further confirmed that location-specific parame-
ters are beneficial to predictive performance. However, upon a closer inspection, we
discovered that state-of-the-art GNN methods can potentially produce models that
signifies traffic influence among faraway locations. This is inconsistent with traffic
behaviour and indicates that the GNNs are detecting statistical correlations rather
61
CHAPTER 6. CONCLUSION 62
than modelling traffic. Lastly, we established that while GNNs can be compact mod-
els, they are no more capable than the traditional machine learning model of random
forest, which predates GNNs by over 15 years.
In Chapter 5, we examine long-term prediction methods to complete the explo-
ration of traffic prediction. We showed that traditional time series analysis approaches
are cumbersome and impractical for large-scale prediction. We then demonstrated
that Facebook’s Prophet, a recent forecasting algorithm, and other machine learning
approaches can replace traditional time series analysis methods in the traffic pre-
diction context. Finally, we determined that long-term prediction methods begin to
eclipse short-term methods when the prediction horizon is longer than 1 hour.
This thesis compares different existing methods of traffic prediction using established
evaluation method and metrics. Although we procured data from a variety of sources
to create a range of traffic patterns, there are real-world problems that still need to be
resolved before a traffic prediction system can be safely deployed. We briefly discuss
four of these problems below.
• Multiple data sources: The current models rely on a single source of data
such as loop detectors or GPS. We need to explore alternative data sources that
complement existing sensors to achieve a more comprehensive traffic view and
safeguard against sensor failure. This would also raise the need for a data fusion
method within the prediction system.
illustrates that models can produce results that are inconsistent with the domain
knowledge. A potential solution is to infuse a traffic dynamics model into a
prediction model, which may require a more comprehensive observation than
what current sensors allow and a departure from the macroscopic view of traffic
prediction. Meanwhile, another direction is to incorporate sensor data into a
traffic dynamics model, such as a traffic simulation software.
Recently, increasingly sophisticated deep learning solutions are proposed for traffic
prediction. This thesis explored the major prediction methods that exist in the litera-
ture and raised some concerns regarding the state-of-the-art deep learning prediction
models. For future research, it is important to consider whether these solutions can
be safely deployed in the real world and whether they provide tangible improvement
over existing solutions.
Bibliography
[1] HDR Corporation, “Costs of road congestion in the greater toronto and hamilton area: Impact
and cost benefit analysis of the metrolinx draft regional transportation plan,” Greater Toronto
Transportation Authority, Tech. Rep., 2008.
[2] P. Mirchandani and L. Head, “A real-time traffic signal control system: Architecture, algo-
rithms, and analysis,” Transportation Research Part C: Emerging Technologies, vol. 9, no. 6,
pp. 415–432, 2001.
[3] P.-W. Lin, K.-P. Kang, and G.-L. Chang, “Exploring the effectiveness of variable speed limit
controls on highway work-zone operations,” in Intelligent transportation systems, Taylor &
Francis, vol. 8, 2004, pp. 155–168.
[4] M. J. Lighthill and G. B. Whitham, “On kinematic waves ii. a theory of traffic flow on long
crowded roads,” Proceedings of the Royal Society of London. Series A. Mathematical and
Physical Sciences, vol. 229, no. 1178, pp. 317–345, 1955.
[5] Y. Li, R. Yu, C. Shahabi, and Y. Liu, “Diffusion convolutional recurrent neural network:
Data-driven traffic forecasting,” arXiv preprint arXiv:1707.01926, 2017.
[6] B. Yu, H. Yin, and Z. Zhu, “Spatio-temporal graph convolutional networks: A deep learning
framework for traffic forecasting,” arXiv preprint arXiv:1709.04875, 2017.
[7] X. Wang, C. Chen, Y. Min, J. He, B. Yang, and Y. Zhang, “Efficient metropolitan traffic
prediction based on graph recurrent neural network,” arXiv preprint arXiv:1811.00740, 2018.
[8] L. Bai, L. Yao, C. Li, X. Wang, and C. Wang, “Adaptive graph convolutional recurrent network
for traffic forecasting,” arXiv preprint arXiv:2007.02842, 2020.
[9] Aimsun next: Your personal mobility modeling lab. [Online]. Available: https://www.aimsun.
com/aimsun-next/.
[10] P. I. Richards, “Shock waves on the highway,” Operations research, vol. 4, no. 1, pp. 42–51,
1956.
[11] F. L. Hall, “Traffic stream characteristics,” Traffic Flow Theory. US Federal Highway Admin-
istration, vol. 36, 1996.
[12] C. F. Daganzo, “The cell transmission model. part i: A simple dynamic representation of
highway traffic,” 1992.
[13] ——, “The cell transmission model, part ii: Network traffic,” Transportation Research Part B:
Methodological, vol. 29, no. 2, pp. 79–93, 1995.
64
BIBLIOGRAPHY 65
[14] I. Yperman, S. Logghe, and B. Immers, “The link transmission model: An efficient implemen-
tation of the kinematic wave theory in traffic networks,” in Proceedings of the 10th EWGT
Meeting, Poznan Poland, 2005, pp. 122–127.
[15] I. Guilliard, S. Sanner, F. W. Trevizan, and B. C. Williams, “Nonhomogeneous time mixed
integer linear programming formulation for traffic signal control,” Transportation Research
Record, vol. 2595, no. 1, pp. 128–138, 2016.
[16] I. Guilliard, F. Trevizan, and S. Sanner, “Mitigating the impact of light rail on urban traffic
networks using mixed-integer linear programming,” IET Intelligent Transport Systems, vol. 14,
no. 6, pp. 523–533, 2020.
[17] H. J. Payne, “Model of freeway traffic and control,” Mathematical Model of Public System,
pp. 51–61, 1971.
[18] G. B. Whitham, Linear and nonlinear waves. John Wiley & Sons, 2011, vol. 42.
[19] C. F. Daganzo, “Requiem for second-order fluid approximations of traffic flow,” Transportation
Research Part B: Methodological, vol. 29, no. 4, pp. 277–286, 1995.
[20] A. Aw and M. Rascle, “Resurrection of” second order” models of traffic flow,” SIAM journal
on applied mathematics, vol. 60, no. 3, pp. 916–938, 2000.
[21] E. Kometani and T. Sasaki, “On the stability of traffic flow (report-i),” J. Oper. Res. Soc.
Japan, vol. 2, no. 1, pp. 11–26, 1958.
[22] L. A. Pipes, “Car following models and the fundamental diagram of road traffic,” Transporta-
tion Research/UK/, 1966.
[23] G. F. Newell, “A simplified car-following theory: A lower order model,” Transportation Re-
search Part B: Methodological, vol. 36, no. 3, pp. 195–205, 2002.
[24] ——, “Nonlinear effects in the dynamics of car following,” Operations research, vol. 9, no. 2,
pp. 209–229, 1961.
[25] P. A. Lopez, M. Behrisch, L. Bieker-Walz, J. Erdmann, Y.-P. Flötteröd, R. Hilbrich, L. Lücken,
J. Rummel, P. Wagner, and E. Wießner, “Microscopic traffic simulation using sumo,” in The
21st IEEE International Conference on Intelligent Transportation Systems, IEEE, 2018. [On-
line]. Available: https://elib.dlr.de/124092/.
[26] M. S. Ahmed and A. R. Cook, Analysis of freeway traffic time-series data by using Box-Jenkins
techniques, 722. 1979.
[27] C. Moorthy and B. Ratcliffe, “Short term traffic forecasting using time series methods,” Trans-
portation planning and technology, vol. 12, no. 1, pp. 45–56, 1988.
[28] G. E. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung, Time series analysis: forecasting
and control. John Wiley & Sons, 2015.
[29] T. G. Smith et al., pmdarima: Arima estimators for Python, [Online]. Available from: http:
//www.alkaline-ml.com/pmdarima, Accessed 28th June 2020, 2017.
[30] S. Seabold and J. Perktold, “Statsmodels: Econometric and statistical modeling with python,”
in 9th Python in Science Conference, 2010.
BIBLIOGRAPHY 66
[31] B. M. Williams and L. A. Hoel, “Modeling and forecasting vehicular traffic flow as a seasonal
arima process: Theoretical basis and empirical results,” Journal of transportation engineering,
vol. 129, no. 6, pp. 664–672, 2003.
[32] M. Lippi, M. Bertini, and P. Frasconi, “Short-term traffic flow forecasting: An experimental
comparison of time-series analysis and supervised learning,” IEEE Transactions on Intelligent
Transportation Systems, vol. 14, no. 2, pp. 871–882, 2013.
[33] B. M. Williams, “Multivariate vehicular traffic flow prediction: Evaluation of arimax model-
ing,” Transportation Research Record, vol. 1776, no. 1, pp. 194–200, 2001.
[34] Y. Kamarianakis and P. Prastacos, “Forecasting traffic flow conditions in an urban net-
work: Comparison of multivariate and univariate approaches,” Transportation Research Record,
vol. 1857, no. 1, pp. 74–84, 2003.
[35] T. Ma, Z. Zhou, and B. Abdulhai, “Nonlinear multivariate time–space threshold vector error
correction model for short term traffic state prediction,” Transportation Research Part B:
Methodological, vol. 76, pp. 27–47, 2015.
[36] C. C. Holt, “Forecasting seasonals and trends by exponentially weighted moving averages,”
International journal of forecasting, vol. 20, no. 1, pp. 5–10, 2004.
[37] P. R. Winters, “Forecasting sales by exponentially weighted moving averages,” Management
science, vol. 6, no. 3, pp. 324–342, 1960.
[38] Forecasting at scale. [Online]. Available: https://facebook.github.io/prophet/.
[39] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P.
Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,
M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine
Learning Research, vol. 12, pp. 2825–2830, 2011.
[40] H. Yao, F. Wu, J. Ke, X. Tang, Y. Jia, S. Lu, P. Gong, J. Ye, and Z. Li, “Deep multi-view
spatial-temporal network for taxi demand prediction,” in Thirty-Second AAAI Conference on
Artificial Intelligence, 2018.
[41] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
[42] J. H. Friedman, “Greedy function approximation: A gradient boosting machine,” Annals of
statistics, pp. 1189–1232, 2001.
[43] G. A. Davis and N. L. Nihan, “Nonparametric regression and short-term freeway traffic fore-
casting,” Journal of Transportation Engineering, vol. 117, no. 2, pp. 178–188, 1991.
[44] B. L. Smith and M. J. Demetsky, “Short-term traffic flow prediction models-a comparison of
neural network and nonparametric regression approaches,” in Proceedings of IEEE Interna-
tional Conference on Systems, Man and Cybernetics, IEEE, vol. 2, 1994, pp. 1706–1709.
[45] L. Zhang, Q. Liu, W. Yang, N. Wei, and D. Dong, “An improved k-nearest neighbor model for
short-term traffic flow prediction,” Procedia-Social and Behavioral Sciences, vol. 96, pp. 653–
662, 2013.
[46] A. Ding, X. Zhao, and L. Jiao, “Traffic flow time series prediction based on statistics learning
theory,” in Proceedings. The IEEE 5th International Conference on Intelligent Transportation
Systems, IEEE, 2002, pp. 727–730.
BIBLIOGRAPHY 67
[47] C.-H. Wu, J.-M. Ho, and D.-T. Lee, “Travel-time prediction with support vector regression,”
IEEE transactions on intelligent transportation systems, vol. 5, no. 4, pp. 276–281, 2004.
[48] G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of con-
trol, signals and systems, vol. 2, no. 4, pp. 303–314, 1989.
[49] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N.
Gimelshein, L. Antiga, et al., “Pytorch: An imperative style, high-performance deep learning
library,” in Advances in neural information processing systems, 2019, pp. 8026–8037.
[50] F.-F. Li, A. Karpathy, and J. Johnson, “Cs231n: Convolutional neural networks for visual
recognition,” 2016. [Online]. Available: http://cs231n.stanford.edu/.
[51] M. S. Dougherty and M. R. Cobbett, “Short-term inter-urban traffic forecasts using neural
networks,” International journal of forecasting, vol. 13, no. 1, pp. 21–31, 1997.
[52] B. Park, C. J. Messer, and T. Urbanik, “Short-term freeway traffic volume forecasting using
radial basis function neural network,” Transportation Research Record, vol. 1651, no. 1, pp. 39–
47, 1998.
[53] H. Dia, “An object-oriented neural network approach to short-term traffic forecasting,” Euro-
pean Journal of Operational Research, vol. 131, no. 2, pp. 253–261, 2001.
[54] B. Abdulhai, H. Porwal, and W. Recker, “Short-term traffic flow prediction using neuro-genetic
algorithms,” ITS Journal-Intelligent Transportation Systems Journal, vol. 7, no. 1, pp. 3–41,
2002.
[55] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,”
Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.
[56] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-wise training of deep
networks,” in Advances in neural information processing systems, 2007, pp. 153–160.
[57] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9,
no. 8, pp. 1735–1780, 1997.
[58] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y.
Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine
translation,” arXiv preprint arXiv:1406.1078, 2014.
[59] Y. Tian and L. Pan, “Predicting short-term traffic flow by long short-term memory recurrent
neural network,” in 2015 IEEE international conference on smart city/SocialCom/SustainCom
(SmartCity), IEEE, 2015, pp. 153–158.
[60] Z. Cui, R. Ke, Z. Pu, and Y. Wang, “Deep bidirectional and unidirectional lstm recurrent
neural network for network-wide traffic speed prediction,” arXiv preprint arXiv:1801.02143,
2018.
[61] Z. Zhao, W. Chen, X. Wu, P. C. Chen, and J. Liu, “Lstm network: A deep learning approach
for short-term traffic forecast,” IET Intelligent Transport Systems, vol. 11, no. 2, pp. 68–75,
2017.
[62] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and locally connected
networks on graphs,” arXiv preprint arXiv:1312.6203, 2013.
BIBLIOGRAPHY 68
[79] M. Li and Z. Zhu, “Spatial-temporal fusion graph neural networks for traffic flow forecasting,”
arXiv preprint arXiv:2012.09641, 2020.
[80] H. C. Manual, “Highway capacity manual,” Washington, DC, vol. 2, p. 1, 2000.
[81] R. J. Hyndman and G. Athanasopoulos, Forecasting: principles and practice. OTexts, 2018.
[82] D. Osborn, A. Chui, J. Smith, and C. Birchenhall, “Seasonality and the order of integration for
consumption,” Oxford Bulletin of Economics and Statistics, vol. 50, pp. 361–377, May 2009.
doi: 10.1111/j.1468-0084.1988.mp50004002.x.
[83] D. A. Dickey and W. A. Fuller, “Distribution of the estimators for autoregressive time series
with a unit root,” Journal of the American statistical association, vol. 74, no. 366a, pp. 427–
431, 1979.
[84] D. Kwiatkowski, P. C. Phillips, P. Schmidt, and Y. Shin, “Testing the null hypothesis of
stationarity against the alternative of a unit root: How sure are we that economic time series
have a unit root?” Journal of econometrics, vol. 54, no. 1-3, pp. 159–178, 1992.
[85] H. Akaike, “A new look at the statistical model identification,” IEEE transactions on automatic
control, vol. 19, no. 6, pp. 716–723, 1974.
[86] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of
the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016,
pp. 785–794.
[87] R. Liaw, E. Liang, R. Nishihara, P. Moritz, J. E. Gonzalez, and I. Stoica, “Tune: A research
platform for distributed model selection and training,” arXiv preprint arXiv:1807.05118, 2018.