Pedestrian and Vehicle Behaviour Prediction in Autonomous Vehicle System — a Review
Pedestrian and Vehicle Behaviour Prediction in Autonomous Vehicle System — a Review
Review
Keywords: Autonomous vehicles (AV)s have become a trending topic nowadays since they have the potential to solve
Deep learning traffic problems, such as accidents and congestion. Although AV systems have greatly evolved, it still have
Autonomous vehicle their limitations. For example, Google reported that their AVs have been involved in several collisions and
Pedestrian
near misses. While most of these collisions and near misses were caused by third parties, the AVs should
Vehicles
be able to predict and avoid them. Events like this show that there is still room for improvement in the AV
Behaviour prediction
system. This paper aims to present a review of the state-of-the-art algorithms proposed to enable AV behaviour
prediction systems to predict trajectories and intentions for pedestrians and vehicles. This will be achieved by
using information from previous literature review papers, recent works, and results obtained using well-known
datasets.
1. Introduction Another major AV limitation is gaining public confidence that they are
safe to ride. Authors (Petrović et al., 2020) investigated 300 traffic
Road traffic accidents and congestion have posed significant chal- collisions in California (US) between 2015 and 2017 that involved AVs.
lenges for many countries today. Road traffic accidents claim the lives They found that most of the collisions were caused by conventional
of 1.35 million people annually and it is ranked 8th leading cause drivers, who were following the AVs too close, and violated the right-
of death worldwide (WHO, 2018). In addition, it has been reported of-way, traffic signals, and traffic signs. Google published a paper
that road traffic accidents are responsible for 20 to 50 million non- reporting the performance of their Waymo driver between 2019 and
fatal causalities, and 95% of these accidents are caused by human 2020, to show transparency and make the public more comfortable
errors and imprudence. It reported in the UK that, in 2020 and 2021 and confident with AVs. In the report, the Waymo driver drove 6.1
there were 92,055 and 119,850 road traffic causalities, respectively, million miles and was involved in 47 road traffic collisions and near-
and 1676 of these causalities led to death (GOVUK, 2020, 2021). miss events - these include both actual and counterfactual simulated
Congestion has a significant negative impact on society, affecting the events (Schwall et al., 2020). Most of the reported collisions were
economy, environment, public health and safety (Afrin & Yodo, 2020; induced by humans where one or more road traffic rules were broken,
Levy et al., 2010). Enforced legislation, advanced driving assistance such as violating the speed limit, driving on the wrong side of the road,
system (ADAS), other methods of transportation and road improve- not obeying the stop sign or the red traffic light signal, performing inap-
ments have been used to address these road traffic issues. However, it is propriate lane change or junction merging, not yielding the right of way
predicted that the number of road users will double by 2050 and these to the Waymo driver, and not yielding to the slowing down behaviour
current measures will not be sufficient (COLONNA, 2018). AVs are a of the Waymo’s driver. Although some of these cited events could not be
trending topic nowadays and companies such as Waymo and Uber have avoided by the AV drivers, for example, the conventional drivers hitting
already deployed several AVs on the roads to solve the aforementioned the rear of the AV while it was stationary or slowing down, there were
road traffic problems. Although AV systems have considerably evolved, instances where they could have been. For instance, accidents caused
they still have limitations such as efficiently and safely navigating in by changing lane manoeuvre, merging from a junction, or making a
complex scenarios. This could be achieved by avoiding congestion, turning manoeuvre, could have been avoided if the AV was able to
predicting, preventing, or mitigating any road traffic collisions. These make an accurate and longer prediction horizon of the trajectories and
are challenging tasks since the AVs have to share the roads with human intention of the conventional drivers. This shows that there is still room
road users, and as reported by World Health Organisation (WHO), for improvement in the AV system, mainly in the behaviour prediction
most road traffic collisions are linked to human error and imprudence.
∗ Corresponding author.
E-mail addresses: [email protected] (L.G. Galvão), [email protected] (M.N. Huda).
https://doi.org/10.1016/j.eswa.2023.121983
Received 24 December 2022; Received in revised form 11 September 2023; Accepted 1 October 2023
Available online 12 October 2023
0957-4174/© 2023 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
L.G. Galvão and M.N. Huda Expert Systems With Applications 238 (2024) 121983
of other road users since it enables the AVs to make a risk assessment • Presents a behaviour prediction general problem formulation.
of the situation in order to take appropriate action. The goal of this • Presents the most used terminologies in the pedestrian and vehi-
paper is to review the most relevant works that aimed to predict the cle behaviour prediction domain.
trajectories and intentions of vehicles and pedestrians. • Reviews not only pedestrian or vehicle behaviour prediction al-
There are several literature reviews covering both traditional and gorithms, but both of them;
Deep Learning (DL) techniques to predict the behaviour of vehicles, • Briefly presents the most important traditional techniques and
for example, Lefèvre et al. (2014), Leon and Gavrilescu (2019), Shirazi focuses more on the DL techniques for pedestrian, and vehicles
and Morris (2016), Sivaraman and Trivedi (2013) and Mozaffari et al. prediction algorithms;
(2020). Sivaraman and Trivedi (2013) briefly reviewed the behaviour • Summarises the key information extracted from the reviewed
prediction of vehicles but at that time this topic was fairly new and studies on predicting pedestrian and vehicle behaviour in tables.
only traditional techniques were reviewed. Lefèvre et al. (2014) pre- These tables report the methods employed, the problem that the
sented a survey and classified vehicle prediction behaviour algorithms algorithms are trying to solve, the datasets used, and the results
into physics-based, manoeuvre-based, and interaction-aware-based al- acquired.
gorithms. They concluded that a behaviour prediction algorithm needs • Reviews works that have performed prediction behaviour of het-
to consider the interaction between vehicles as well as the scene context erogeneous agent traffic.
to have a longer prediction horizon. In addition, they reviewed the
• Introduce a general framework for a behaviour prediction system
existing risk assessment methods for autonomous vehicles and con-
highlighting the system dependence on the AV’s hardware and the
cluded that a risk assessment module was highly dependent on the
perception module, and its typical outputs. In addition presents a
behaviour prediction algorithm. In this review, the authors only cov-
risk assessment for a general behaviour prediction system.
ered traditional techniques since DL techniques for vehicle behaviour
• Identifies the requirements and challenges to design a pedestrian
prediction were still emerging at the time. Shirazi and Morris (2016)
and vehicle behaviour prediction system for AV.
reviewed techniques used to analyse vehicles, drivers, and pedestri-
• Discusses whether the current techniques have met the previously
ans’ behaviour at road intersections. Only traditional techniques were
mentioned requirements, and suggests future works.
analysed, however, the focus was not on the prediction behaviour
of the vehicles. Leon and Gavrilescu (2019) reviewed methods used Some of the commonly used terminologies in the pedestrian and ve-
for vehicle tracking, behaviour prediction, and decision-making. Both hicle behaviour prediction literature are listed below (Mozaffari et al.,
traditional and DL techniques have been covered. The authors con- 2020).
cluded that DL techniques had better results since they are more
robust, flexible and have better generalisation ability. Mozaffari et al. • Object behaviour: means the object trajectories or intentions.
(2020) performed a systematic and comparative review of the different • Object trajectory: vectors with a sequence of data, typically
DL methods used to predict vehicle trajectories and its intentions. comprised of tracking information that describes the path an
They presented a more detailed taxonomy of the prediction behaviour object had followed.
algorithms compared to Lefèvre et al. (2014). They categorised the al- • Object intention: is a course of actions that an object intends to
gorithms based on the type of input, the type of output, and the method perform to achieve its goal. In the vehicle domain, these courses
of prediction. Although the review was extensive and very informative, of action are known as manoeuvres, such as, turning, changing
the authors did not cover in detail what intention behaviour the works lanes, stopping, cut-in/cut-out, etc. In the pedestrian domain,
were trying to predict, for example, lane change, overtaking, or making these intentions are crossing/non-crossing, stopping, etc.
a turn; and do not provide specific information on what datasets were • Observation time horizon (OTH): the time that an algorithm
used. observes the past behaviours of an object to predict its future
The following works have performed pedestrian behaviour predic-
behaviour.
tion reviews, Chen, Ding, et al. (2020), Kong and Fu (2018), Ridel et al.
• Prediction time horizon (PTH): Most of the reviewed works
(2018), Rudenko et al. (2020), Sharma et al. (2022) and Ahmed et al.
use prediction horizon to refer to the time that an algorithm can
(2019a). Kong and Fu (2018) presented traditional and DL techniques
predict an object’s behaviour before it happens. However, in some
that were used to recognise and predict human action. Ahmed et al.
works, the term ‘prediction’ is replaced with ‘anticipation’, and it
(2019a) presented a survey on the detection and intention prediction
is defined as the time that an algorithm can predict an object’s
of pedestrians and cyclists. A review on pedestrian behaviour was
behaviour before it begins. This paper adopts the term prediction
presented by Ridel et al. (2018), where they briefly described the
and its first meaning.
traditional and DL techniques that were used. Chen, Li, et al. (2020) dis-
• Ego Vehicle (EV): observes the others traffic agents using on-
cussed the required architecture, the traditional and DL techniques to
board sensors.
detect and predict pedestrian actions. Although, these works reviewed
DL techniques, only a limited amount of works were considered. A • Target object: the object that the EV is observing to predict its
detailed human trajectory prediction survey was done by Rudenko et al. behaviour.
(2020), where they reviewed a substantial amount of published works • Surrounding objects: the objects that may interact with and
to propose a taxonomy, identify the available datasets and evaluation affect the behaviour of the target object.
metrics, and the limitations of the current methods. However, the • Multi-modal behaviour: means that an observed history of
authors did not review methods used to predict pedestrian intentions. A behaviours could lead to multiple several potential future be-
comprehensive survey was done by Sharma et al. (2022) on pedestrian haviours.
intention prediction for AV systems. • Trajectory prediction: means to predict the future motion of
To the authors’ knowledge, the work presented by Gulzar et al. an object given a time frame of its and/or surrounding object’s
(2021) is the only one that reviewed the behaviour prediction of both trajectories, contextual information, and interactions between the
pedestrian and vehicle. The authors presented a novel taxonomy that objects in the scene.
unifies both pedestrian and vehicle behaviour prediction problems. • Intention prediction: usually uses the same history informa-
However, the authors did not explore the evaluation metrics, datasets, tion that trajectory prediction uses, however, the system aims to
features, and the results of the reviewed works. predict the future discrete action of the target object.
Unlike the previously cited review works on both pedestrian and • Interaction: influences that one or more objects have on each
vehicle behaviour prediction, this paper: other.
2
L.G. Galvão and M.N. Huda Expert Systems With Applications 238 (2024) 121983
et al. (2022), Piccoli et al. (2020), Rasouli et al. (2019, 2020), Vitas
et al. (2020), Yang, Zhang, et al. (2022), Yao et al. (2021b), Zeng
(2022), Zhang, Angeloudis, and Demiris (2022), and Xue et al. (2018),
a general intention prediction problem formulation is as follows: a
sequence of feature vector {𝐹𝑡−𝑂𝑇 𝐻 , … , 𝐹𝑡 } extracted from a given
sequence of video frames {𝑡 − 𝑂𝑇 𝐻, … , 𝑡} acquired from an image
sensor is used by a model to determine the probability of the target
Fig. 1. Object behaviour prediction full pipeline process. The detection and classifi- agent intention 𝐼𝑎𝑡+𝑛 𝜖{0, 1}, where 𝑡 is the specific time of the last
cation stage outputs the object position, size, type, bounding box, segmentation, and,
observed frame and 𝑛 is the number of frames from the last observed
global and local context information. The object tracking stage outputs the ID for each
detected object and its dynamics (e.g., speed). The output of the object behaviour frame to the final frame of the event, also known as time-to-event
prediction module can be the object’s intention and its future trajectory. (TTE). The prediction intention estimation can be described by the
equation
3
L.G. Galvão and M.N. Huda Expert Systems With Applications 238 (2024) 121983
Fig. 3. General interactions among traffic agents and their environments. Object 1 is the target object (blue circled), the blue arrow shows the direct interaction between the
target object and object 2; the orange arrows show the interaction between object 2 and objects 3, 9, 11, 12, and 13; the yellow arrow shows the interaction between object 17
and object 3.
et al., 2021; Mozaffari et al., 2020). These works will be discussed in Table 1
Motion, context and intention features that can be used to predict vehicle behaviour.
the upcoming sections.
Information Features
Before discussing the behaviour prediction of pedestrians and ve-
hicles, it is important to understand their potential interactions. As MOTION Target Vehicle (TV): Lateral/longitudinal position, velocity,
acceleration, yaw, yaw rate, and relative speed.
depicted in Fig. 3, interactions among different traffic agents can
TV-to-lane: lateral offset, and lateral speed.
cascade and get very challenging, for example, in order to predict the TV-to-Surrounding Vehicle (SV): distance from surrounding vehicles.
actions of object 1, it might be required to consider the actions of the:
CONTEXT Road: Lane marking, number of lanes, lane width, lane curvature,
type of lines, entries, exits, left/right/forward arrows, crosswalks,
• Object 2, since it can change direction and velocity.
traffic light, traffic signs, type of roads (urban, country,
• Object 3, since its action will affect the action of object 2. highway-motorway), bumps, road holes, road works, left/right-hand
• Object 17, since it will affect the action of object 3. side traffic, and junctions.
• Object 9, since it will affect the action of object 2. Vehicle: indicators, brake lights, warning lights, type of the vehicle,
and sirens’ light status.
• Object 13, since it can make a right turn, which will affect the
Other road agents: pedestrians, animals, cyclists, and trams.
action of object 2. Environment: sunny, snowing, rainy, foggy, and dark.
• Object 11 or 13, since they may break the law by not obeying the INTENTION Braking, turning left/right, lane keeping, left/right lane change,
red traffic light. speeding, normal driving, aggressive driving, abnormal driving,
merging, exiting, cutting in/out, and yielding.
In the vehicle behaviour prediction domain, the literature often prediction algorithm should be fast, cost-effective, accurate, generalise
uses the terms prediction behaviour of drivers/vehicles or prediction well in different traffic scenes, consider the interdependence between
behaviour of target/surrounding vehicles. The former usually means to agents, and have a long prediction horizon. A long prediction horizon
predict the behaviour of the ego vehicle using its internal data, such as provides the AV with more time to make decisions and take appropriate
the steering angle, brake pedal position, velocity, speed, indicators sta- actions. A typical vehicle behaviour prediction pipeline consists of
tus, etc. Berndt and Dietmayer (2009), Girma et al. (2020), Raimundo multiple steps, starting with the detection of the target vehicle and
and Favio (2021), Xing et al. (2017). This approach is suitable for AV the surrounding vehicles. This detection information is used to obtain
systems when considering vehicle-to-vehicle communication. The latter tracking information. Subsequently, this tracking information is used
approach involves the ego vehicle using on-board sensors to gather as an observation feature to predict future trajectories. In order to
information from the surrounding vehicles to predict their behaviour. enhance the quality and duration of predictions, context information of
In this review, only the latter approach is reviewed, as vehicle-to- the traffic scene and the intention manoeuvre of other vehicles can be
vehicle communication is not yet available, and AVs would still share considered. Table 1 provides the type of motion, context, and intention
roads with conventional human drivers. information that has been and could be used by the researchers to
Vehicle behaviour prediction is a crucial component of the AV predict vehicle behaviour.
behaviour prediction system as it would enable the AV to perform risk Although vehicles have some characteristics that simplify their be-
assessment, plan future movements, and make appropriate decisions haviour prediction, such as constrained movement due to their inertial
to avoid/mitigate the impact of collisions. Ideally, a vehicle behaviour property, having to obey traffic road rules, and navigating inside the
4
L.G. Galvão and M.N. Huda Expert Systems With Applications 238 (2024) 121983
road boundaries. It is still a challenging task since their behaviour is 3.1. Trajectory prediction
dependent on other vehicles’ actions, traffic regulations, road geometry,
and different driving environments (Lefèvre et al., 2014; Mozaffari As reported in Table 2, vehicle trajectory prediction has been
et al., 2020). Moreover, vehicles have multi-modal behaviour, different achieved using one or more of the following approaches: physics-based,
types of vehicles might provide different motion information, and manoeuvre-based, or interaction-aware motion models (Lefèvre et al.,
prediction can be affected if surrounding vehicles are occluded. 2014). Physics-based motion models were one of the first approaches
The main two sources of data used to predict the behaviour of to be proposed and it uses the principles of physics to predict vehicle
vehicles are top-view and on-board sensors. Top-view data are captured motions. This approach is computationally efficient, meets real-time
from static sensors usually installed on tall buildings, while on-board requirements, and does not require the dataset to be human-labelled.
sensors are captured from sensors installed on the EV. Top-view data However, they are less suitable for complex scenarios like busy urban
have the advantage of providing more precise information since the scenarios and junctions. This is because they do not take into account
acquired data have better quality, the vehicles surrounding the TV the TV intentions, the contextual information of the scene, or the
are captured, and vehicles are not easily occluded. However, it only interaction between the TV and the SVs. This lack of information
covers a specific and fixed portion of the traffic scene, limiting the limits the prediction horizon for the EV to less than 1 s (Lefèvre
algorithm to generalise to other traffic scenarios. Top-view sensors et al., 2014). In order to overcome the limitation of a short prediction
are typically used in two types of traffic environments: highways- horizon associated with the physics-based approach, manoeuvre-based
motorways and complex traffic scenes, such as busy urban areas and approaches were introduced. In the manoeuvre-based approach, the EV
junctions. Highway-Motorway datasets can suffer imbalanced samples, uses the predicted intention of the TV to predict future trajectories. This
where there are more instances of constant velocity behaviour than increases both the trajectory prediction horizon and accuracy, as the
the specific manoeuvres of interest (Altché & de La Fortelle, 2017). predicted trajectory would match the predicted intention. However, if
On-board sensor data can capture different traffic scenarios, however, the predicted manoeuvre is incorrect, the whole predicted trajectory
its data quality can be affected by noises, surrounding vehicles can be may also be inaccurate. The interaction-aware approach uses the tra-
occluded, and in order to detect all the vehicles surrounding the EV jectories and the intentions of both the TV and the SVs to predict the
and the TV, more than one sensor might be required (e.g., front, rear, TV trajectory. This approach further extends the prediction horizon
and sides cameras.) (Izquierdo et al., 2021). On-board sensor data is and improves the accuracy of the predicted trajectories. On the other
particularly advantageous for AV applications because the algorithms hand, it comes with complexities in implementation, demands greater
that use them, could be directly integrated into AVs, which are already computational power, and raises questions about how to determine
equipped with on-board sensors. Several sensors, such as cameras, which vehicles should be considered as SVs, and not all SV might be
radar, and LIDAR could be used to acquire both top-view and on-board reliably detected by the EV.
data (Izquierdo et al., 2019; SIMulation, 2007; Zhou et al., 2020; Zyner The previously cited approaches have been implemented using ei-
et al., 2019). However, this research mainly focuses on works that have ther traditional or DL techniques. Traditional techniques encompass
used camera sensors. For more information about the available datasets Linear methods like KF and Switching Linear Dynamic Models, as
for vehicle behaviour prediction, please refer to Izquierdo et al. (2021). well as Non-linear methods such as EKF, UKF, Switching Non-Linear
Table 2 summarise the most relevant vehicle trajectory and intention Dynamic Models, Particle filters, Bayesian filtering, Monte Carlo sim-
prediction works from 2009 to 2022. From the table, it is observed the ulation, Naive Bayes Classifiers, Dynamic Bayesian Networks, HMM,
following:
SVM, case-based reasoning, random decision Forest, Artificial Neural
Network (ANN), SVM, and Gaussian Process NN (Biparva et al., 2021).
• Shift to Deep Learning and NGSIM Dataset: Up to 2016, the
Traditional techniques have the advantage of being fast to infer and
majority of the works used traditional techniques and their OWN
not requiring an extensive dataset. However, they struggle to generalise
datasets, however, after 2016 most of the works adopted DL
techniques and used the NGSIM dataset. well and have limited prediction horizons. Additionally, most tradi-
tional techniques do not inherently account for vehicle interactions
• Expanding Information Sources: Vehicle behaviour prediction
and may require additional features. The DL techniques used in the
algorithms have evolved from using only motion information to
incorporating additional sources, including manoeuvre, interac- literature were based on ANNs, Convolutional Neural Networks (CNN),
tion, and driver-style information. Fully Connect Networks (FCN), Recurrent Neural Networks (RNN),
Graph Convolutional Neural Networks (GCNN), Gated Recurrent Unit
• Limited Use of Other Datasets: While the NGSIM dataset gained
popularity, other datasets such as Apollo, KITTI, LISA, INTERAC- (GRU), or Long-short Term Memory (LSTM) (Biparva et al., 2021). The
TION, HighD, and PREVENTION were rarely used. main advantage of DL techniques is their ability to implicitly extract
the required features to predict vehicle behaviour. Some DL techniques
• Trajectory Prediction Dominance: The majority of the research
efforts was to predict trajectories. It was not until 2020 that even consider the interaction between vehicles by themselves, for
more research began to address the prediction and recognition instance, RNN and GCNNs. Yet, DL techniques may not address the
of vehicle intentions. multi-modal behaviour of vehicles as they tend to average the multiple
• Focus on Lane Changing and Turning Manoeuvres: Most re- possible modalities to minimise the regression error. They also require
search works focused on predicting the trajectories and intentions an extensive dataset to generalise well, take longer to train, may suffer
related to lane changing and turning manoeuvres. Other types gradient vanishing, and may not provide accurate trajectory prediction
of manoeuvres such as reversing, braking, and U-turns were for longer time horizons.
seldomly used. The following paragraphs will discuss the most relevant DL algo-
• Evaluation Metrics: The most common evaluation metric for rithms used to predict vehicle trajectories.
trajectory prediction was the Root Mean Square Error (RMSE), Altché and de La Fortelle (2017) and Kim et al. (2017), to the
while for intention prediction, accuracy was the predominant authors’ knowledge, were one of the first ones to use LSTM-RNN to
evaluation metric. predict the future trajectories of the surrounding vehicles by using their
past trajectories as input feature. Park et al. (2018) predicted future
The following two subsections discuss the algorithms used to predict trajectories using an encoder–decoder LSTM. The encoder encodes past
vehicle behaviour. The first covers the algorithms used to predict trajectories of the surrounding vehicles, while the decoder decodes
trajectories, and the latter, the algorithms used to detect and predict future trajectories in an Occupancy Grid Map (OGM). The authors
vehicle intention. also applied a beam search algorithm, to reduce the error propagation
5
L.G. Galvão and M.N. Huda Expert Systems With Applications 238 (2024) 121983
Table 2
Relevant works for vehicle trajectory and intention prediction.
Work Methods Algorithm objectives Dataset-results
PF+RBF Hermes et al. (2009) Trajectory prototype. Particle Filter (PF) to track and Predict future trajectories of the OWN
generate motion hypothesis. RBF to classify trajectories. ego and surrounding vehicles. See Table 3.
QRLCS to measure similarity between trajectories.
Evaluation: RMSE.
Lim et al. (2010) Extended Kalman Filter Estimate Position and Velocity. OWN
Evaluation: Mean Distance Error. Graphs.
Kasper et al. (2012) Bayesian Networks. Detection of lane change OWN
Occupancy Grid Map (OGM). manoeuvre. Accuracy: 83.8%.
Evaluation: FP, FN, and Accuracy.
Kumar et al. (2013) SVM. Predict lane change manoeuvres OWN
Bayesian Filter. of the EV. Recall: 1
Evaluation: Recall, Precision, and F1-score. Precision: 0.8
F1-score: 0.9
APT: 0.97 s
Yoon and Kum (2016) Target lane model to predict in which lane the target vehicle Predict lane change of NGSIM
will go. surrounding vehicles. Absolute error: 0.7 m.
3rd Order Linear System to model trajectory.
Auto encoder to cluster the available trajectories into 3
prototype trajectories.
Multi-layer Perceptron (MLP) network to predict the target
lane and the probability for each one of the prototype
trajectories.
OTH/PTH: (1 s, 2 s, 3 s, 4 s,5 s)/5 s.
Evaluation: Prediction time and absolute error of lateral
position.
Khosroshahi et al. (2016) Features: linear changes, angular changes, and angular Classify manoeuvre intention at KITTI
changes histogram. intersections. 2 classes: 85%.
Multi-layer LSTM. 3 classes: 75%.
Evaluation: Accuracy. 8 classes: 65%.
12 classes: 40%.
Dueholm et al. (2016) Detection: DMP + Feature Pyramid + HOG Predict future trajectories of the OWN
Tracking: MDP + TLD. surrounding vehicles. Recall: 92%
Trajectory: KF.
Evaluation: Recall.
Kim et al. (2017) LSMT-RNN. Predict the future position of the OWN
OGM. surrounding vehicle using OGM. MAE:1.51 for 2 s; 0.88 for 1 s;
Data-driven approach. and 0.59 for 0.5 s.
PTH: 0.5 s, 1 s, and 2 s.
Information: Position, the velocity of surrounding vehicles,
and velocity and yaw rate of ego vehicle.
Evaluation: Mean Absolute Error(MAE).
Lee, Kwon, et al. (2017) CNN. Predict lane change manoeuvre. OWN
Evaluation: Accuracy. Accuracy: 89.87%
DESIRE Observation, sample generation, and rank refinement. Predict the future position of the SDD KITTI
Lee, Choi, et al. (2017) CVAE + RNN (GRU) to predict multi-modal trajectories surrounding vehicles considering See Table 3.
considering latent variables. static and dynamic scene context
IOC (based on Reinforcement Learning) to rank and refine and interaction between agents.
the predicted trajectories.
Spatial Grid-Based Pooling Layer to extract interaction
feature.
SCF to combine agents’ interactions and scene context.
OTH/PTH: 2 s/4 s.
Evaluation: L2 distance error and miss rate.
Altché and de La Fortelle (2017) LSTM encoder–decoder. Predict the target vehicle’s future NGSIM
Evaluation: average RMSE. position by considering See Table 3.
surrounding vehicles.
Xing et al. (2017) Two LSTM networks, one to encode past trajectories and Predict vehicle trajectory using NGSIM
predict intention manoeuvre, the other to encode past past trajectories and predicted See Table 3.
trajectories, and the predicted manoeuvre to decode future manoeuvre intention.
trajectories.
Evaluation: lateral and longitudinal RMSE.
Park et al. (2018) LSTM encoder–decoder. Predict the future position of the OWN
OGM. target and the surrounding MAE (Grid): 1.27 for 2 s; 1.14
Beam search algorithm. vehicles. for 1.6 s; 0.99 for 1.2 s; 0.84
OTH/PTH: 3 s/2 s. for 0.8 s; and 0.64 for 0.4 s.
Evaluation: MAE.
6
L.G. Galvão and M.N. Huda Expert Systems With Applications 238 (2024) 121983
Table 2 (continued).
Work Methods Algorithm objectives Dataset-results
M-LSTM Tracking history and Manoeuvres classification (Lane change, Trajectory prediction of NGSIM
Deo and Trivedi (2018b) brake, and normal driving) to allow multi-modal prediction. surrounding vehicles considering See Table 3.
LSTM encoder–decoder to encode tracked history motions the interaction between traffic
and to decode multi-modal future motions. agents.
OTH/PTH: 3 s/5 s.
Evaluation: RMSE.
C-VGMM+VIM HMM for manoeuvre recognition. Manoeuvre Intention (lane LISA-A
Deo et al. (2018) IMM + VGMM to predict trajectories. change, overtaking, cutting-in, MAE overtakes and cut-ins: 2.49
Markov Random Field for vehicle interaction. drift into ego lane) and for 5 s; 1.94 for 4 s; 1.39 for 3
PTH: 5 s Trajectory Prediction. s; 0.82 for 2 s; and 0.29 for 1 s.
Evaluation: Manoeuvre classification accuracy, mean and MAE stop-and-go: 2.17 for 5 s;
median error for the trajectory prediction. 1.65 for 4 s; 1.14 for 3 s; 0.64
for 2 s; 0.20 for 1 s.
Accuracy for overtakes and
cut-ins: 55.89%
Accuracy stop-and-go: 87.19%
Time: 6FPS.
CS-LSTM LSTM encoder–decoder to encode previous motion Predict future motions of NGSIM
Deo and Trivedi (2018a) information and to decode future motion. surrounding vehicles taking into See Table 3.
Convolutional Social Pooling to learn agent’s consideration motion, spatial Computation time: 0.29 s
interdependence motions. configuration, and (reported by Li et al. (2019b)).
Multi-modal prediction (6 classes: RLC, LLC, NLC, brake, and interdependence between agents.
normal).
OTH/PTH: 3 s / 5 s.
Evaluation: RMSE and Negative log-likelihood (NLL).
SA-LSTM Surrounding-Aware LSTM. Predict lane change manoeuvre NGSIM
Su et al. (2018) OTH: 6, 9, and 12 frames. and future trajectories. Avg. Accuracy: 86.19%.
Evaluation: Accuracy.
MATF Hybrid Model (LSTM + CNN) Trajectory prediction by NGSIM
Zhao et al. (2019) LSTM to encode past trajectories for multiple agents. considering social interaction and See Table 3.
CNN to encode context information. scene context.
MATF to fuse interaction, spatial structure, and context
information.
Conditional generative adversarial training to detect
uncertainty in predicting manoeuvres.
Environment: Highway-Motorway and pedestrian crowd
scenes.
OTH/PTH: 3 s / 5 s.
Evaluation: RMSE.
Benterki et al. (2019) Features: local position, velocity, acceleration, distance to Predict lane change manoeuvres NGSIM
lane markings, yaw angle and rate, lateral velocity, and of the surrounding vehicles. ANN Accuracy: 98.8%.
acceleration. Prediction: 2.4 s.
ANN and SVM. SVM Accuracy: 97.1%.
Evaluation: Recall, Accuracy, Precision, and F1-score. Prediction: 1.9 s.
ST-LSTM Spatio-temporal LSTM. Trajectory prediction by NGSIM I-80
Dai et al. (2019) Short-cut connections to avoid gradient vanishing. considering spatial and temporal See Table 3.
Weighted sum to integrate the outputs. information.
Consider the 6 vehicles around the target vehicle.
OTH/PTH: 3 s/6 s.
Evaluation: RMSE.
GRIP Fixed Graph Convolutional (10 blocks) Model to represent Predict surrounding vehicle NGSIM
Li et al. (2019b) interactions between agents. trajectories considering the See Table 3.
Single LSTM encoder–decoder to make trajectory predictions. interaction between them. Computation time: 0.05 s.
OTH/PTH: 3 s/5 s.
Hardware: 4.0 GHz i7, 32GB memory, and NVIDIA Titan XP.
Evaluation: RMSE.
GRIP++ Dynamic Graph Convolutional (3 blocks) Model to represent Predict surrounding vehicle ApolloScape
Li et al. (2019a) interactions between agents. trajectories considering the WSADE: 1.2588.
Three GRU-RNN encoder–decoder to make trajectory interaction between them. WSFDE: 2.3631.
predictions. NGSIM
OTH/PTH: 3 s/5 s. See Table 3.
Hardware: 4.0 GHz i7, 32GB memory, and NVIDIA Titan XP. Computation time: 0.02 s.
Evaluation: RMSE, WSADE, and WSFDE.
NLS-LSTM Local and non-local social pooling. Predict vehicle trajectory using HighD
Messaoud et al. (2019) LSTM encoder–decoder. local and non-local social pooling. See Table 3
Evaluation: RMSE. NGSIM
See Table 3
7
L.G. Galvão and M.N. Huda Expert Systems With Applications 238 (2024) 121983
Table 2 (continued).
Work Methods Algorithm objectives Dataset-results
Benterki et al. (2020) Hybrid Model Manoeuvre classification and NGSIM
ANN to classify manoeuvres. trajectory prediction. See Table 3
LSTM to predict trajectories.
OTH: 3 s, 5 s, and 6 s.
PTH: 1 s, 3 s, and 5 s.
Evaluation: RMSE and classification accuracy.
Fernández-Llorca et al. (2020) Two stream CNN (Disjoint). Recognition and prediction of PREVENTION
Spatio-temporal Multiplier Networks (ST) (cross-stream lane change/keep manoeuvre Disjoint
connections). using stacked visual cues from Classification Accuracy:89.46%.
ResNet-50 to extract both temporal and contextual videos. Prediction Accuracy:91.02%.
information. ST
OTH/PTH: 2 s/(1–2 s). Classification Accuracy: 90.30%.
4 Sizes of RoI are used x1, x2, x3 and x4. Prediction Accuracy: 91.94%.
Dense optical flow to extract movement context.
Evaluation: Classification accuracy and Prediction Accuracy.
ARIMA-Bi-LSTM Off-line Bi-LSTM. Predict trajectories and turning NGSIM-LP
Zhang and Fu (2020) Online ARIMA + Bi-LSTM. manoeuvres at intersections. GS: lateral 0.032; long. 0.1093.
PTH: 5 s. TL: lateral 0.2719; long. 0.1592.
Evaluation: RMSE and Accuracy. TR: lateral 0.1168 long. 0.3954
Accuracy: 94.2% at 1 s, 93.5%
at 2 s, and 74.5% at 3 s.
Izquierdo et al. (2021) TSM to differ between target and surrounding vehicles. Detection and prediction of lane PREVENTION
TIM to extract motion pattern. change performed by surrounding Manoeuvre Detection:
Greyscale image to extract context information. vehicles. Present a baseline to Accuracy: 82.7%.
Compared various CNN models to detect and predict compare human performance Anticipation:2.28 s.
manoeuvres. against automated systems. Briefly Manoeuvre Prediction:
OTH: 1 s. compared the available datasets. Accuracy: 83.4%.
Evaluation: Accuracy, precision, recall, anticipation (s), and Prediction: 0.72 s.
AUC.
Biparva et al. (2021) 4 action recognition models were evaluated: Two-stream Recognition and prediction of PREVENTION
CNN, Two-stream Inflated 3D CNN, STM network, and lane change/keep event using Accuracy for STM: 91.91% for 2
SlowFast Network. stacked visual cues from videos. s; 86.51% for 1 s.
4 Sizes of RoI.
Dense optical flow to extract movement context.
OTH:PTH: 2 s/(1–2 s).
Evaluation: Accuracy (%).
ST-Conv-LSTM Spatial–temporal Convolutional LSTM. Predict lateral (lane change) and BDD100K
Huang et al. (2021) OTH/PTH: 2.4 s/1 s. longitudinal (holding, sharp Accuracy: 57.9%.
Evaluation: Accuracy. acceleration, deceleration, and
stopping) intention.
IPTM-LSTM Intention encoder–decoder LSTM. Use intention to predict trajectory NGSIM-LP
Zhang, Song, et al. (2021) Trajectory encoder–decoder LSTM. of travelling straight, turning Avg. Intention Accuracy:
IPTM. left/right and braking. 90.94%
Evaluation: Accuracy and RMSE. RMSE: See Table 3
INTERACTION
Avg. Intention Accuracy:
86.92%.
LSTM-GAN LSTM + Generative Confrontation Network. Predict vehicle turning intention. OWN
He et al. (2021) Evaluation: Accuracy. Accuracy: 90.9%.
Luan et al. (2022) Game theory model to predict the intention of the driver. Predict the trajectory of lane NGSIM
Recognise the vehicle behaviour using past vehicle state. change manoeuvres using driver Graphs.
Nash-optimisation function. style (aggressive or conservative)
Evaluation: Lateral position error, yaw rate error, and behaviour recognition.
probability error.
AI-TP Approach: Data-driven. Trajectory prediction. NGSIM
Zhang, Zhao, et al. (2022) Features: Past trajectories. See Table 3
Model(s): graph attention mechanism (AI-TP), ConvGRU,
Evaluation: MSE.
caused by the greedy strategy that the decoder LSTM uses to maximise Trivedi, 2018a) combined convolutional social pooling and encoder–
the output probabilities. decoder LSTM to predict manoeuvres and future trajectories. The con-
Deo and Trivedi (2018b) presented a Manoeuvre-LSTM model that volution social pooling can learn the interaction and interdependence
encodes motion and interaction of the surrounding vehicles to assign of the surrounding vehicles. The downside of the algorithm is that
probabilities for each manoeuvre. The assigned probabilities enable the social tensor of the convolutional social network was fixed to the
multi-modal trajectory predictions. During that period, the algorithm defined spatial grid around the target vehicle, and it did not consider
achieved better RMSE results than the state-of-the-art algorithms, but visual context information. The disadvantage of the last two algorithms
the RMSE values for long PTH were still high. Although the algo- is that the predicted trajectories are dependent on the manoeuvre classi-
rithm considered the interaction between vehicles, it did not consider fication performance. For example, Deo and Trivedi (2018a) compared
their inter-dependencies. In order to overcome this limitation, (Deo & their algorithm with and without considering manoeuvre intention
8
L.G. Galvão and M.N. Huda Expert Systems With Applications 238 (2024) 121983
9
L.G. Galvão and M.N. Huda Expert Systems With Applications 238 (2024) 121983
Table 3
Results for the most relevant vehicle trajectory prediction works.
Work Dataset Metrics Axis Obs. Hor. 1 s 2 s 3 s 4 s 5 s 6 s
NGSIM RMSE Both 3 s 0.73 1.78 3.13 4.78 6.68 -
CV
Deo and Trivedi (2018a)
S-LSTM NGSIM RMSE Both 3 s 0.65 1.31 2.16 3.25 4.55 -
Alahi et al. (2016)
GAIL-GRU NGSIM RMSE Both 3 s 0.69 1.51 2.55 3.65 4.71 -
Kuefler et al. (2017)
C-VGMM+VIM NGSIM RMSE Both 3 s 0.66 1.56 2.75 4.24 5.99 -
Deo et al. (2018)
M-LSTM NGSIM RMSE Both 3 s 0.58 1.26 2.12 3.24 4.66 -
Deo and Trivedi (2018b)
CS-LSTM(M) NGSIM RMSE Both 3 s 0.62 1.29 2.13 3.20 4.52 -
Deo and Trivedi (2018a)
CS-LSTM NGSIM RMSE Both 3 s 0.61 1.27 2.09 3.10 4.37 -
Deo and Trivedi (2018a)
MATF GAN NGSIM RMSE Both 3 s 0.66 1.34 2.08 2.97 4.13 -
Zhao et al. (2019)
ST-LSTM-1350 NGSIM RMSE Both 3 s 0.56 1.19 1.93 2.78 3.76 4.84
Dai et al. (2019) avg.
GRIP NGSIM RMSE Both 3 s 0.37 0.86 1.45 2.21 3.16 -
Li et al. (2019b)
GRIP++ NGSIM RMSE Both 3 s 0.38 0.89 1.45 2.14 2.94 -
Li et al. (2019a)
AI-TP NGSIM RMSE Both 3 s 0.47 0.1.05 1.53 1.93 2.31 -
Zhang, Zhao, et al. (2022)
NLS-LSTM NGSIM HighD RMSE Both 3 s 0.560.20 1.22 0.57 2.02 1.14 3.03 1.90 4.30 2.91 - -
Messaoud et al. (2019)
OGM-LSTM NGSIM RMSE Lateral Longi. 0.56 3.05 1.24 6.70 - - - - - - - -
Kim et al. (2017)
Dual LSTM NGSIM RMSE Lateral Longi. 5 s 0.15 0.47 0.26 1.39 0.38 2.57 0.45 4.04 0.49 5.77 - -
Xing et al. (2017)
Altché and de La Fortelle (2017) NGSIM RMSE Lateral Longi. 0.11 0.71 0.25 1.98 0.33 3.75 0.40 5.96 0.47 9.00 - -
ANN-LSTM NGSIM RMSE Lateral Longi. 3 s 0.043 0.122 - - 0.125 0.235 - - 0.235 0.264 - -
Benterki et al. (2020)
IPTM-LSTM NGSIM-LP RMSE Both 3 s 0.77 1.34 2.19 – – -
Zhang, Song, et al. (2021)
MATF GAN Massachusetts RMSE Both 3 s 0.75 1.4 2.0 2.7 – -
Zhao et al. (2019)
PF+RBF OWN RMSE Both – 0.7 1.4 5.0 – – -
Hermes et al. (2009)
CS-LSTM(M) NGSIM NLL Both 3 s 0.58 2.14 3.03 3.68 4.22 -
Deo and Trivedi (2018a)
C-VGMM+VIM LISA-A MAE Both 3 s 0.24 0.69 1.18 1.66 2.18 -
Deo et al. (2018)
DESIRE KITTI SDD DE PE Both 2 s 0.281.29 0.67 2.35 1.22 3.47 2.06 5.33 – –
Lee, Choi, et al. (2017)
it requires a substantial number of sample trajectories to determine the Bayesian Networks, HMM, and SVM. DL methods commonly used are
numerous possible motion patterns. RNN, LSTM, and action recognition models.
In contrast, the manoeuvre intention estimation methods use vehicle The following paragraphs will discuss the most relevant DL algo-
motion and road context features to classify the different types of ma- rithms used to predict vehicle intention manoeuvre.
noeuvres, for instance, stopping/non-stopping, turning left/right, etc. Khosroshahi et al. (2016) implemented a multi-layer LSTM net-
Although this method is less complex than calculating the numerous work to classify manoeuvre intentions at complex intersections. They
trajectory probabilities, a large training dataset is required to make the extracted samples representing manoeuvres intentions from the KITTI
system robust to the different road scenarios. Another limitation is that dataset to train and test the algorithm. The input features included
the manoeuvre classes may not be sufficient to cover the real vehicle linear and angular changes, as well as a histogram of angular changes
intention complexity. For instance, the system may predict a braking of the vehicle trajectories. The authors performed experiments with
manoeuvre, but the braking can be normal or harsh. A proposed different numbers of manoeuvre classes: 2 (straight or turning), 3
solution is to sub-categorise the manoeuvre, for example, normal/harsh (straight, turning left/right), 8 and 12 classes. The algorithm performed
stopping, normal/sharp right/left turn, however, this adds complexity well with 2 and 3 classes, but the accuracy significantly decreases with
to the dataset labelling (Mozaffari et al., 2020). 8 and 12 classes.
Intention prediction algorithms can also use predicted trajectories Lee, Kwon, et al. (2017) transformed real-world images into a
and interaction between vehicles to achieve better accuracy. Tradi- simplified version of Bird’s Eye View (BEV) and fed them into a
tional methods used to predict the intention of vehicles are Heuristics, CNN to predict lane change behaviour. Zhang and Fu (2020) used an
10
L.G. Galvão and M.N. Huda Expert Systems With Applications 238 (2024) 121983
trajectories and predict future ones. The outputs of the offline Bi-LSTM Feature Information
and ARIMA were then fed into another Bi-LSTM to recognise turning Bbox coordinates Position, speed, height and width.
behaviour as left-turn, right-turn, or going straight. The algorithm went Bbox cropped image Pedestrian appearance, local and
through evaluation using the NGSIM Lankershim and Peachtree Street surrounding context.
dataset, and was able to meet real-time requirements while achieving Full image Global context and some interaction
good accuracy recognition for the PTH of 1 s and 2 s. However, between different traffic objects.
accuracy dropped when considering PTH of 3 s, and it only considered Body Pose Displacement, action, skeleton, and
turning left/right and going straight manoeuvres. Whereas vehicles landmarks.
at intersections can perform more complex manoeuvres as reported Ego vehicle position/speed Interaction between pedestrian and ego
by Khosroshahi et al. (2016). In addition, the dataset used was acquired vehicle. Pedestrian behaviour is affected
by ego vehicle speed.
from top-view sensors while AVs are equipped with on-board camera
sensors. Benterki et al. (2019) compared two conventional methods
to predict lane-change manoeuvre, ANN and SVM. They concluded
that ANN and SVM have almost the same performance; however, ANN techniques used over the years for addressing the pedestrian behaviour
showed the best results. prediction.
Izquierdo et al. (2021) used CNN, action recognition, and prediction Pedestrian behaviour prediction has been applied in three main
methods to recognise and predict lane-keeping/changing manoeuvres. types of datasets: datasets that are recorded using drones, for example,
Instead of using a sequence of images, they encoded context, interac- ETH and UCY; datasets recorded from static cameras; and datasets
tion, and dynamic state information in a unique enriched image. The recorded from a car dash cameras, for example, Daimler, JAAD, PIE
enriched image was created by extracting the red channel from a grey- or KITTI. Datasets from car cameras are more appropriate to train
scale version of the original image, using a target selection method models for AV because they provide more realistic representation.
(TSM), and a temporal integration method (TIM). The authors also However, when the car is in motion, it may affect the position of
investigated the human performance in recognising and predicting lane the pedestrian bounding box, and pedestrians can be easily occluded.
changes. Their findings indicated that humans can detect 83.9% of the Car cameras datasets can be categorised as either naturalistic or non-
lane change events with an average anticipation of 1.66 s before the naturalistic, as discussed by Fang and López (2018). In non-naturalistic
manoeuvre is completed. Only 3 out of 72 users were able to predict the datasets, the pedestrian behaviours and intentions are performed by
lane change events before they started, with an average prediction hori- actors, whereas, in naturalistic datasets, behaviours and intentions
zon of 1.08 s. On the other hand, their best algorithm, which considers are recorded from actual road traffic scenarios. Some of the features
the trade-off between accuracy and anticipation, achieved 86.4% accu- that have been used for predicting pedestrian behaviour are listed in
racy with an average anticipation of 2.09 s when considering TTE equal Table 4. Pedestrian behaviour prediction has been heavily investigated
to 0. When TTE was set to 1 s, their algorithm achieved an anticipation in the past years and it has many challenges. For instance, pedestrians
of 2.69 s, a prediction of 0.72 s, and an average accuracy of 83.4%. are highly dynamic, they can move in many directions and change them
Fernández-Llorca et al. (2020) and Biparva et al. (2021) recognised very quickly, and can be easily occluded by other objects. They can
and predicted lane-keeping/changing manoeuvres using video action also become distracted by their own objects or external environments,
recognition approaches. Biparva et al. (2021) used four types of video their movements may be influenced by other traffic agents, and they
action recognition approaches: Two-stream CNN, Two-stream Inflated can be difficult to detect in poor visibility conditions. As reported
3D CNN, spatio-temporal Multiplier Networks, and SlowFast Networks. in Tables 5 and 7, researchers have proposed various methods and
All of the aforementioned networks used spatial and temporal informa- features to address these challenges over the years. From these tables,
tion from a single image, a sequence of images, or a sequence of optical the following observations can be made:
flow images for recognition and prediction tasks. Moreover, four sizes
• Until 2018, most of the works used traditional methods and their
of RoI were used, denoted as x1, x2, x3 and x4, to consider the inter-
OWN dataset. Thereafter, most authors adopted DL techniques
action between agents, and to extract contextual information around
and used the ETH and UCY datasets for trajectory prediction, as
the target vehicle. The network with the best recognition performance well as JAAD and PIE datasets for intention prediction.
was the SlowFast CNN achieving an accuracy of 90.98% with an OTH
• Pedestrian behaviour prediction algorithms have evolved, from
of 2 s before the TTE. Meanwhile, the network with the best prediction
solely using motion information to using pedestrian appearance,
performance was the spatiotemporal multiplier, achieving an accuracy
body pose landmarks, local/global context, interactions between
of 91.94% with an OTH of 2 s. The limitations of the previously cited agents, and ego vehicle dynamics.
works are as follows: the distribution of the manoeuvre classes was
• Prior to 2018, the focus was predominantly on trajectory predic-
imbalanced, with more lane-keeping samples than lane-changing ones;
tion, thereafter substantial research efforts have been dedicated
the time required to recognise and predict a single instance was not to predict pedestrian intentions.
provided; and some of the algorithms, such as the SlowFast network,
• Most of the intention prediction works were to predict the cross-
was not able to complete its training due to the GPU memory limitation.
ing intention.
Furthermore, it was observed from the previous vehicle intention
• The most used evaluation metrics for intention prediction were
prediction works that the authors have selected a fixed PTH to predict
accuracy, F1-score, precision, recall, Area Under the Curve (AUC),
the vehicle’s intentions. The drawback of using a fixed PTH is that
and Receiver Operating Characteristic Curve (ROC-AUC).
manoeuvre samples may vary in length. For instance, the lane-change
• The most used evaluation metrics for trajectory prediction were
manoeuvre performed by an aggressive driver will be shorter than a
Average Displacement Error (ADE), Average Final Displacement
lane-changing manoeuvre performed by a normal driver.
Error (FDE), and MSE. Other metrics are Average Non-linear
Displacement Error (ANDE), Mean Average Displacement (MAD),
4. Pedestrian behaviour prediction
and Final Average Displacement (FAD).
At present, AV systems can effectively detect and track pedestrians, The following subsections discuss some of the algorithms reported
however, this alone is not enough to prevent potential collisions. In or- in Tables 5 and 7. The first subsection provides an in-depth exploration
der to avoid a collision, AV systems must predict pedestrian behaviours. of trajectory prediction algorithms, while the subsequent subsection
This section aims to provide a literature review of the challenges and explore intention prediction algorithms.
11
L.G. Galvão and M.N. Huda Expert Systems With Applications 238 (2024) 121983
Table 5
Relevant works for pedestrian trajectory prediction.
Work Methods Dataset/Results
Schneider and Gavrila (2013) Approach: Dynamic. Daimler
Features: constant velocity, acceleration, turn. IMM has not shown
Models: Recursive Bayesian filters – Compared EKF and IMM significant performance over
filters. simpler models.
PTH: < 2 s.
Evaluation: MLPE.
Keller and Gavrila (2013) Approach: Dynamic. OWN (on-board)
Features: optical flow. GDPM and PHTM showed
Compared the performance between GDPMs, PHTM, KF and better accuracy, however, they
IMMKF. are more computationally
Provided human performance on classifying pedestrian expensive.
behaviour prediction. 10–50 cm Time Horizon 0.77
Evaluation: Mean Combined Longitudinal and Lateral RMSE. s.
Kooij et al. (2014) Approach: Dynamic + Context. OWN (on-board)
Features: Head orientation, distance between vehicle and Outperforms state-of-art
pedestrian, distance between pedestrian and curb. algorithm PHTM. Best result
Models: Dynamic Bayesian Filters (SLDS). of −0.33 was achieved in the
Evaluation: Predictive log likelihood. critical, vehicle-seen and
stopping scenario using the
full context information.
Social-LSTM Alahi et al. (2016) Approach: Data driven. ETH and UCY
Features: Past trajectories. ADE/FDE/AND:
Models: Social pooling layer, and LSTM. 0.27/0.61/0.15.
OTH/PTH: 8 (3.2 s)/12 (4.8 s) frames.
Evaluation: ADE, FDE, and AND.
Karasev et al. (2016) Approach: Dynamic + Context. OWN (on-board) for training
Features: pedestrian state (position, orientation, and speed), and KITTI for evaluation.
predicted goals, environment context (building, sidewalk, Displayed in a graph.
crosswalk, road and grass), dynamic environments such as
traffic lights, and assumed rational behaviour for the agent.
Models: Jump-Markov Process, and Rao-Blackwellized filter.
Evaluation: L2 error, and Average prediction error.
Rehder et al. (2018) Approach: Data driven + Goal-directed. OWN (on-board)
Features: visual cues, predicted pedestrian destinations, and Outperformed IMM. Results
trajectories. were not clear, but from graph
Models: RMDN, LSTM, topology network, and Markov Prediction accuracy 10(−1) for
Decision Process. 1.5 s. Destination plays an
Evaluation: Predicted probability distribution, Average important role when trying to
accuracy of predicted destination, and prediction accuracy predict pedestrian intention.
over time.
SR-LSTM Zhang et al. (2018) Approach: Data driven and social behaviour. ETH and UCY
Features: trajectories and current state of the neighbours. MAD: 0.45; FAD: 0.94.
Model(s): SR-LSTM and attention mechanism.
Evaluation: MAD, and FAD.
Social-GAN Approach: Data driven. ETH, UCY
Gupta et al. (2018) Features: Past trajectories. ADE: 0.39/0.58.
Model(s): GAN, Pooling Module, and LSTM. FDE: 0.78/1.18.
PTH: 8 and 12 metres.
Evaluation: ADE and FDE.
Social attention Approach: Data driven. ETH and UCY
Vemula et al. (2018) Features: Past trajectories. ADE: 0.30 m.
Model(s): ST-Graph, LSTM, and Attention. FDE: 2.59 m.
OTH/PTH: 8 (3.2 s)/12 (4.8 s) time steps.
Evaluation: ADE and FDE.
SS-LSTM Approach: Data driven. ETH and UCY
Xue et al. (2018) Features: Past trajectories, neighbour feature (occupancy ADE: 0.070 pixels.
maps: grip, circle and log), and individual information. FDE: 0.133 pixels.
Model(s): CNN, and Hierarchical-LSTM.
OTH/PTH: 8/12 frames.
Evaluation: ADE and FDE.
CIDNN Approach: Data driven. GC/ETH/UCY/CUHK/Subway
Xu et al. (2018) Features: Past trajectories, and interactions. ADE:
Model(s): stacked-LSTM, and MLP. 0.012/0.09/0.12/0.008/0.016.
OTH/PTH: 5/5 frames. Inference: 0.43 ms
Hardware: Intel Xeon CPU E52643 4.40 and TITAN GPU.
Evaluation: ADE.
12
L.G. Galvão and M.N. Huda Expert Systems With Applications 238 (2024) 121983
Table 5 (continued).
Work Methods Dataset/Results
LSTM-Bayesian Approach: Data driven. CityScapes(on-board)
Bhattacharyya et al. (2018) Features: Bbox coordinates past trajectories and ego MSE/NLL: 505/3.92.
vehicle odometry.
Model(s): Two stream architecture, Bayesian RNN
(LSTM), and CNN.
OTH/PTH: 0.5/1 s.
Evaluation: MSE in pixels and NLL.
DBN-SLDS Approach: Data driven. OWN (on-board,
Flohr et al. (2018) Features: context cues (VRU actions, and its static and non-naturalistic
dynamic environment). Graphs.
Model(s): DBN and SLDS.
TTE = [−15, 0]
PTH:1 s.
Evaluation: Prediction error.
MX-LSTM Approach: Data driven. UCY
Hasan et al. (2018) Features: Past trajectories, and head pose estimation. MAD/FAD: 0.49/1.12 m.
Model(s): tracklets, vislets, VFO social pooling, and Towncentre
LSTM. MAD/FAD: 1.15/2.30 m.
OTH/PTH: 8/12 frames.
Evaluation: MAD and FAD in metres.
Scene-LSTM Approach: Data driven. UCY and ETH
Manh and Alaghband (2018) Features: Past trajectories and scene divided into grid ADE/FDE/NDE: 0.7/0.7/0.9.
cells.
Model(s): Scene Data Filter, and Coupled-LSTM.
OTH/PTH: 3.2/4.8 s.
Evaluation: ADE, FDE and NDE.
SoPhie Approach: Data driven. ETH, UCY
Sadeghian et al. (2019) Features: Past trajectories, social interactions, and images ADE: 0.54 m.
of the scene. FDE: 1.15 m.
Model(s): CNN, LSTM, GAN, Social and physical SDD
attention mechanism. ADE: 16.24 pixels.
PTH: 12 future timesteps. FDE: 29.38 pixels.
Evaluation: ADE and FDE.
StarNet-DNN Approach: Data driven. ETH and UCY
Zhu et al. (2019) Features: Past trajectories. ADE/FDE: 0.30/0.57.
Model(s): StarNet DNN (Host and hub networks), and Inference: 0.073 s.
LSTM.
PTH: 8 frames.
Hardware: Tesla V100 GPU.
Evaluation: ADE and FDE.
PECNet Approach: Data driven and goal directed. ETH and UCY
Mangalam et al. (2020) Features: Past trajectories and estimated end point ADE/FDE: 0.29/0.48 m.
destination. SDD
Model(s): CVAE, attention mechanism, and social ADE/FDE: 9.96/15.88 p.
pooling.
OTH/PTH: 3.2/4.8 s.
Evaluation: ADE and FDE.
ST-GCNN Approach: Data driven. ETH and UCY
Mohamed et al. (2020) Features: Past trajectories and sequence of images. ADE/FDE: 0.44/0.75 m.
Model(s): GCN, and TXP-CNN.
OTH/PTH: 3.2/4.8 s.
Evaluation: ADE and FDE.
RSBG Approach: Data driven. ETH and UCY
Sun et al. (2020) Features: Past trajectories and local context. ADE/FDE: 0.48/0.99 m.
Model(s): GCN, CNN, and LSTM.
OTH/PTH: 3.2/4.8 s.
Evaluation: ADE and FDE.
LVTA Approach: Data driven. ETH and UCY
Xue et al. (2020) Features: Past trajectories and velocities. ADE/FDE: 0.46/0.92 m.
Model(s): attention mechanism, and LSTM.
OTH/PTH: 3.2/4.8 s.
Evaluation: ADE and FDE.
Holistic-LSTM Approach: Data driven. JAAD
Quan et al. (2021) Features: bbox past trajectories, crossing intention, MSE: 389.
pedestrian scale, depth estimation, and global scene PIE
dynamics (depth and optical flow). MSE: 167.
Model(s): ConvLSTM, modified LSTM with more inputs, S-KITTI
and attention mechanism. MSE: 525/1.5 s.
OTH/PTH: 0.5/1 s.
Evaluation: MSE, CMSE, and CFMSE of the bbox
coordinates.
13
L.G. Galvão and M.N. Huda Expert Systems With Applications 238 (2024) 121983
Table 5 (continued).
Work Methods Dataset/Results
Bi-TraP Approach: Data driven and Multi-modal goal estimation. JAAD
Yao et al. (2021a) Features: bbox past trajectories. ADE: 1206.
Model(s): CVAE, Gaussian distribution, GMM, and PIE
Bi-directional GRU. ADE: 511.
OTH/PTH (JAAD/PIE): 0.5/1.5 s. ETH-UCY
OTH/PTH (ETH/UCY): 3.2/4.8 s. ADE/FDE: 0.18/0.35.
Evaluation: ADE and FDE.
BA-PTP Approach: Data driven. PIE
Czech et al. (2022) Features: vehicle odometry, bbox, body, head MSE/CMSE/CFMSE:
orientation, and pose. 420/383/1513.
Model(s): attention mechanism and Bi-GRU, ECP-Intention
OTH/PTH (PIE): 0.5/1.5 s. MSE/CMSE/CFMSE:
OTH/PTH (ECP): 0.6/1.6 s. 768/680/1966
Evaluation: MSE, CMSE, and CFMSE.
SGNet Approach: Data driven, and goal directed. JAAD
Wang et al. (2022) Features: Past trajectories. MSE/CMSE/CFMSE:
Model(s): Stepwise goal estimator, attention mechanism, 1049/996/4076 p (1.5 s).
GRU, and CVAE. PIE
OTH/PTH (JAAD, PIE, HEV-I): 1.6/0.5,1.0,1.5 s. MSE/CMSE/CFMSE:
OTH/PTH (ETH & UCY): 3.2/4.8 s. 442/413/1761 p (1.5 s).
OTH/PTH (NuScenes): 2/6 s. ETH and UCY
Evaluation: MSE, CMSE, CFMSE, ADE and FDE. ADE/FDE: 0.35/0.83
Euclidean space.
NuScenes
ADE/FDE: 1.32/2.50.
PTPGC Approach: Data driven. ETH and UCY
Yang, Sun, et al. (2022) Features: Past trajectories, length of attributes, and ADE/FDE: 0.67/1.29.
number of pedestrians.
Model(s): Graph attention, convLSTM, and Temporal
CNN.
OTH/PTH: 3.2/4.8 s.
Evaluation: ADE and FDE.
4.1. Trajectory prediction layers that extract relevant features from the neighbours using non-
local attention. Yao et al. (2021a) also proposed a goal-direct method,
Both traditional and DL techniques have been used in order to where they combine CVAE and bi-directional GRU to encode past
predict pedestrian trajectories. Traditional techniques relies on hand- trajectories and decode multi-modal future trajectories. Goal-directed
crafted functions, such as EKF, IMM, and social forces, to predict pedes- models have the disadvantage that only one goal is estimated over a
trians’ future trajectories. However, these functions have limitations long-term prediction. For this reason, if a pedestrian changes direction
in handling complex scenarios. To address this, several researchers the estimated goal may be incorrect, and consequently affecting the
adopted DL techniques such as: CNN, Generative Adversarial Net- estimated predicted trajectories. Wang et al. (2022) proposed a method
work (GAN), GCNN, LSTM, GRU, CVAE, attention mechanism, and/or where they model and estimate goals continuously by using RNNs.
Multi-Layer Perceptron (MLP). While many studies relied on historical trajectories for predicting
Although LSTM networks have many advantages, it struggles to future ones, they often overlooked the current state of the pedestrian.
learn dependencies between multiple correlated sequences. For this In order to overcome this issue, Zhang et al. (2019) introduced a state
reason, Alahi et al. (2016) proposed a Social LSTM network to predict refinement LSTM that considered both the current and previous state
pedestrian trajectories. Social pooling layers were introduced to enable of the target pedestrian and the surrounding pedestrians. This state
LSTM networks to share their hidden state. This enables the algorithm refinement module enables the network to incorporate interactions
to learn interactions among pedestrians. Social-LSTM only considers through a message-passing mechanism. It also uses a motion gate as
motion features to model human interactions, however, Xu et al. (2018) an attention mechanism to focus on the most relevant features of the
argues that spatial position should also be considered. For this reason, neighbours.
they presented a model where MLP layers were used to encode location, Previous research, when considering human-to-human interactions,
and LSTM was used to encode motion for each neighbour. Both sets of would often take into account only nearby neighbours, even though
encoded information were then used as input to a crowd interaction more distant neighbours might also influence the behaviour of the
module to predict pedestrian displacement. In a different approach, Xue target pedestrian. A GAN was presented by Gupta et al. (2018) that
et al. (2020) used two LSTM layers to encode the pedestrian’s location not only considers local neighbours but all neighbours in the scene. The
and velocity, along with a temporal attention mechanism to extract the GAN network comprises an LSTM generator to generate multi-potential
most relevant features from the velocity and location inputs. trajectories, a pooling module to learn human-to-human interactions,
Humans are highly dynamic, which makes the task of predict- and an LSTM discriminator to select acceptable trajectories from the
ing their trajectories more challenging. In response to this, Rehder generated ones. Similarly, Vemula et al. (2018) considered all the
et al. (2018) implemented a DNN that would first predict the future pedestrians in the scene using a spatio-temporal graph and LSTM. Ad-
destinations of the pedestrians, and then predict their future trajec- ditionally, they adopted an attention mechanism to learn the relevance
tories. They have used CNN, LSTM and Mixture Density Network to of each agent, regardless of how far they are from each other. A star-
predict potential destinations, and another CNN to plan and predict like network was introduced by Zhu et al. (2019) to account for all
future trajectories based on these potential destinations. CVAE was agents in the scene. The network has a centralised hub network, which
used by Mangalam et al. (2020) to predict future endpoints, these then gathers motion information from all pedestrians in the scene, and a
were subsequently used to predict multi-modal longer-term trajecto- host network for each pedestrian. The host networks query the hub
ries. They also presented a novel self-attention-based social pooling network for social information to predict trajectories. Graph attention
14
L.G. Galvão and M.N. Huda Expert Systems With Applications 238 (2024) 121983
and convolutional LSTM were also proposed by Yang, Sun, et al. (2022)
to consider the surrounding neighbours.
Xue et al. (2018) emphasised the importance of considering scene
layout when predicting pedestrian trajectories. As a result, they used
three different LSTMs to learn information about individuals, social
interactions, and scene layout. One LSTM used the trajectory of the
target pedestrian as its input, another used an occupancy map as
its input, and the final one used feature vectors extracted from the
original image by a CNN as its input. Likewise, Manh and Alaghband
(2018) took scene layout into account, where they used a two-level
grid structure of the original image and trajectory information as
inputs to a two-stream LSTM for predicting future trajectories. CNN,
LSTM, attention mechanism, and GAN were used by Sadeghian et al.
(2019) to predict trajectories using both past trajectories and scene
context as inputs. The CNN extracted scene-related features, the LSTM
extracted motion-related features, the attention mechanism extracted
both the physical and position relevant features, and the GAN generated
multiple trajectories and then selected the most suitable ones.
Mohamed et al. (2020) classified methods such as social pooling
or the combination of hidden state features, used to model human
Fig. 5. Pedestrian Trajectory Prediction Performance using the ETH and UCY datasets,
interactions, as ‘‘aggregation methods’’. They claimed that these types with an OTH of 3.2 s, a PTH of 4.8 s, and Average Displacement Error (ADE) in metres
of methods have limitations in accurately modelling human interac- (See Table 6).
tions because the aggregation occurs within the feature space and
does not directly model physical interactions. Furthermore, some of
these aggregation methods, such as pooling layers, may overlook to Table 6 and Fig. 5 report the results for the most relevant studies in
capture important information. Given these considerations, the authors pedestrian trajectory prediction. It is not possible to directly compare
proposed a social spatio-temporal GCN (ST-GCN) to model interactions all of them since some of them have used different datasets, metrics,
among pedestrians. The ST-GCN model’s output is subsequently used OTH, and PTH. However, when examining the results of the algorithms
as input for a time extrapolate CNN to predict future trajectories. that used the same dataset, metrics, OTH, and PTH, the Bi-Trap (Yao
The above works have not considered group-based interactions, et al., 2021a) algorithm outperformed others. Bi-Trap achieved ADE
which involve two or more individuals exhibiting similar movements, and FDE values of 0.18 m and 0.35 m, respectively.
behaviours, or goals. A recursive social behaviour graph and GCN
was implemented by Sun et al. (2020) to explore and learn group- 4.2. Intention recognition and prediction
based interactions. The authors also used CNN and LSTM to obtain an
individual representation of each pedestrian in the scene. The individ- The difference between pedestrian intention recognition and pre-
ual representations, along with the learned group-based features, were diction aligns with what was explained on Section 3. Recognition does
combined and used by a decoder LSTM to predict future trajectories. not require anticipation, while prediction does. The main methods
Bhattacharyya et al. (2018) claimed that they were the pioneers in used to predict pedestrian intentions include CNN, GCNN, GRU, LSTM,
using an on-board dataset to predict pedestrian behaviour. The authors attention mechanism, multi-tasking, and transformer networks.
used a two-stream LSTM architecture to encode bounding box coordi- CNN: Fang et al. (2017) and Fang and López (2018) used CNNs to
nates, ego-vehicle odometry information, and feature vectors extracted extract human skeleton features and used SVM/RF classifier to predict
from the original image by a CNN. Another work that used an on-board if the pedestrian is crossing the road. Abdulrahim and Salam (2016)
dataset is (Czech et al., 2022), in which the authors used a multi- also used CNNs, along with depth information to learn 3D human body
stream RNN to individually encode bounding box coordinates, head landmarks, including additional information such as the pedestrian
orientation, body orientation, pose skeleton, and past trajectories. The shoulders, neck, and face. While CNNs can extract spatial features, their
encoded information from each stream is fused through an attention capability to capture temporal dependencies is limited. To overcome
mechanism and subsequently input to an RNN decoder to predict future this limitation, Yang et al. (2021) implemented a 3D-CNN to extract
bounding boxes. The drawback of the latter two algorithms is that they spatio-temporal information. Additionally, Piccoli et al. (2020) pro-
did not consider social interaction among the agents. posed an alternative model called FuSSI-Net, designed to extract both
Hasan et al. (2018) argues that head orientation and movement are spatio-temporal information. FuSSI-Net is a spatio-temporal Dense-net
correlated. Consequently, they proposed a two-stream LSTM to encode that takes a sequence of bounding boxes and skeleton features as
both trajectory and head orientation information. The two encoded inputs to predict crossing intention. Although these last two models can
information, were then merged using a View Frustum social pooling extract spatial and temporal information, they are limited to short-time
layer. The disadvantage of this method is that it is only suitable for horizon prediction and become computationally expensive as the input
top-view and BEV datasets. sequence length increases.
Usually, when a system adopts LSTM networks and requires the LSTM: Rasouli et al. (2019) used LSTM to encode local context,
use of multiple types of inputs, these inputs are first combined be- trajectories, and ego vehicle information. Subsequently, the encoded
fore being fed to LSTM cells. This practice is required because LSTM information was decoded to estimate the probability of a pedestrian
cells are designed to accept only a single input sequence, which can crossing the road. Bouhsain et al. (2020) used bounding box coor-
constrain their ability to capture relevant information from various dinates and velocities features as inputs for a sequence-to-sequence
input sources. Quan et al. (2021) adapted the conventional LSTM cell LSTM, which was used to predict both the pedestrian intentions and
to accept four additional input sequences: vehicle speed, pedestrian the future position of the pedestrians’ bounding boxes. In a different
intention, correlation among frames, and bounding box location. The approach, Lian et al. (2022), introduced a stacked-LSTM model, where
vehicle speed was estimated by using optical flow and depth informa- appearance, context, and dynamic features of the pedestrian were
tion; the pedestrian intention was estimated using convLSTM; and the used to predict crossing intentions. LSTM networks have the ability
correlation among frames was derived from optical flow images. to learn and memorise features over the long term, as they capture
15
L.G. Galvão and M.N. Huda Expert Systems With Applications 238 (2024) 121983
Table 6
Results for the most relevant pedestrian trajectory prediction works.
Work Dataset OTH PTH ADE FDE AND MAD FAD MSE
Social-LSTM ETH & UCY 3.2 s 4.8 s. 0.27 m 0.61 m 0.15 m – – –
Alahi et al. (2016)
Scene-LSTM ETH & UCY 3.2 s 4.8 s 0.7 m 0.7 m 0.9 m – – –
Manh and Alaghband (2018)
Social-GAN ETH & UCY 3.2 s 4.8 s 0.48 m 0.98 m – – – -
Gupta et al. (2018)
Social-attention ETH & UCY 3.2 s 4.8 s 0.30 m 2.59 m – – – -
Vemula et al. (2018)
Sophie ETH & UCY 3.2 s 4.8 s 0.54 m 1.15 m – – – –
Sadeghian et al. (2019) SDD 16.24 pi 29.38 pi – – –
StarNet-DNN ETH & UCY 3.2 s 4.8 s 0.30 m 0.57 m – – – -
Zhu et al. (2019)
PECNet ETH & UCY 3.2 s 4.8 s 0.29 m 0.48 m – – – –
Mangalam et al. (2020) SDD 9.96 pi 15.88 pi – – –
ST-GCNN ETH & UCY 3.2 s 4.8 s 0.44 m 0.75 m – – – –
Mohamed et al. (2020)
RSBG ETH & UCY 3.2 s 4.8 s 0.48 m 0.99 m – – – –
Sun et al. (2020)
LVTA ETH & UCY 3.2 s 4.8 s 0.46 m 0.92 m – – – –
Xue et al. (2020)
Bi-TraP ETH & UCY 3.2 s 4.8 s 0.18 m 0.35 m – – – –
Yao et al. (2021a) JAAD 0.5 s 1.5 s 1206 – – – –
PIE 0.5 s 1.5 s 511 – –
SGNet ETH & UCY 3.2 s 4.8 s 0.35 m 0.83 – – – -
Wang et al. (2022) JAAD 1.6 s 1.5 s – — – – 1049
PIE 1.6 s 1.5 s – — 442
NuScenes 2 s 6 s 1.32 2.5 -
SGNet ETH & UCY 3.2 s 4.8 s 0.35 m 0.83 m – – – –
Wang et al. (2022)
PTPGC ETH & UCY 3.2 s 4.8 s 0.67 m 1.29 m – – – –
Yang, Sun, et al. (2022)
SS-LSTM ETH & UCY 3.2 s 4.8 s 0.070 npu 0.133 npu – – – -
Xue et al. (2018)
SR-LSTM Zhang et al. (2018) ETH & UCY 3.2 s 4.8 s – – – 0.45 0.94 –
CIDNN ETH & UCY 4 s 4 s 0.11 – – – – –
Xu et al. (2018)
MX-LSTM UCY 3.2 s 4.8 s – – – 0.49 m 1.12 m –
Hasan et al. (2018) Towncentre – – 1.15 m 2.30 m –
Holistic-LSTM JAAD 0.5 s 1 s – – – 389
Quan et al. (2021) PIE 1 s – – 167
S-KITTI 1.5 s 525
BA-PTP PIE 0.5 s 1.5 s – – – 420
Czech et al. (2022) ECP 0.6 s 1.6 s – – 768
long-distance dependencies (Chung et al., 2014). Nevertheless, they GCN: A spatio-temporal GCN was presented by Zhang, Angeloudis,
have limitations in extracting spatial features, managing dependencies and Demiris (2022), where they used a sequence of skeleton features
among the extracted features, exhibiting longer training times, and to predict crossing intentions. The skeleton joints were connected by
assigning uniform attention to all inputs, even though some inputs can nodes and edges to learn both spatial and temporal features. Cadena
be more relevant than others (Sharma et al., 2022). Ahmed et al. (2023) et al. (2022) used two GCNs, which took human body key points,
used a 2D pose estimator in conjunction with LSTM to predict crossing local context, and ego speed information as inputs to predict crossing
behaviour of the pedestrian. intentions. GCNs has the advantage of extracting interactions among
GRU: GRUs serve as an alternative to LSTMs, as they also learns the target pedestrian and its neighbours, considering both spatial and
temporal dependencies (Sharma et al., 2022). In addition, GCNs can
temporal information. Kotseruba et al. (2020) used pedestrian appear-
handle non-Euclidean data formats, such as scenarios where pedestrians
ance features, which were extracted using a VGG network, and ego
are dispersed across a scene, which cannot be represented using a grid-
vehicle velocity information as inputs for a GRU network to predict
like structure. However, they can only handle short-term sequences and
pedestrian intentions. Rasouli et al. (2020) used pedestrian appearance,
do not perform well when applied to regression tasks.
global context, body pose, bounding boxes, and ego-vehicle speed Attention Mechanism: Lian et al. (2022) also used a self-attention
features as inputs to a stacked GRU network to predict pedestrian mechanism to extract the most relevant information from the pedes-
crossing behaviour. These features were gradually integrated into the trian appearance, the pedestrian’s surroundings, and dynamic fea-
GRU network, starting with pedestrian appearance, followed by global tures. Rasouli et al. (2019) combined different attention mechanism
context, body pose, bounding boxes, and concluding with the ego layers at different locations of the network to investigate their impact
vehicle speed. GRUs offer the advantage of requiring less memory and on the model performance. Attention mechanism approaches enable
being faster than LSTMs. However, they tend to be less accurate when networks like LSTM to focus more on the most relevant features, and
handling with long input sequences (Chung et al., 2014). less on redundant ones.
16
L.G. Galvão and M.N. Huda Expert Systems With Applications 238 (2024) 121983
Table 7
Relevant works for pedestrian intention prediction.
Work Methods Problem Dataset/Results
Schneider and Gavrila (2013) Approach: Dynamic. Trajectory and intention Daimler
Features: prediction. IMM has not shown
Recursive Bayesian filters – Compared EKF and IMM filters significance performance over
(constant velocity/acceleration/turn). simpler models.
PTH: < 2 s.
Evaluation: MLPE.
Keller and Gavrila (2013) Approach: Dynamic. Trajectory and intention OWN (on-board)
Features: prediction. GDPM and PHTM showed
Compared the performance between GDPMs using optical better accuracy, however, they
flow information, PHTM, KF and IMMKF. are more computationally
Provided human performance on classifying pedestrian expensive.
behaviour prediction. 10–50 cm Time Horizon 0.77
Evaluation: Mean Combined Longitudinal and Lateral RMSE. s.
Bonnin et al. (2014) Approach: Dynamic + Context. Intention prediction (crossing). OWN (on-board) Inner-city
Features: distance and time to curb, distance and time to dataset, zebra dataset and
ego lane, distance and time to zebra crossing, distance and combination of both ICZ.
time to collision point, difference of time to collision point, Inner-city model: 31% TPR,
face, global and relative orientation. 0.0 FPR, PTH 0.72 s for the
Single Neural Network as classifier to learn the different zebra dataset. TPR 29%, PTH
features. 0.67 s for the inner-city
Inner-city and zebra model. dataset. TPR 31%, PTH 0.72 s
PTH: 1 s. for the ICZ dataset.
Evaluation: TPR and FPR. Zebra crossing model: 100%
TPR, 3.23 s PTH for the zebra
dataset. 86% TPR, 28% FPR
and 1.73 s PTH for the
inner-city dataset.
CMT model: 62% TPR, 2.59 s
PTH for the ICZ dataset.
Neogi et al. (2017) Approach: Dynamic + Context. Intention prediction. NTUC (OWN, on-board and
FLDCRF. actors)
Features: pedestrian position (distance to curb, and left or Average probability > 0.7
right side of the road), pedestrian–vehicle interaction, optical predicting 1.2 s before the
flow. action.
Evaluation: average probability, time to stop and time to
cross.
Minguez et al. (2018) Approach: Dynamic. Predict pedestrian actions. CMU-UAH
Balanced-GDPMs to reduce 3-D time relevant information Achieved MED of 41.24 mm
into low dimensional information and to assume future for TTE of 1 s, for starting
latent positions. activity; and MED of
Features: Skeleton motion analysis. 238.01 mm for TTE of 1 s for
Four models to predict start, stop, walk and stand actions. stopping activity.
HMM is used to select which model to use to predict future
pedestrian path and poses.
Evaluation: MED against TTE.
Fang et al. (2017) Approach: Data Driven. Intention prediction Daimler
Features: Skeleton. (crossing/not crossing). 0.8 predictability with
CNN for pose estimation. TTE=12 (750 ms).
Deep association for tracking.
Evaluation: Intention probability vs TTE.
CV Approach: Data Driven. Intention prediction See Table 8
Fang and López (2018) Features: Skeleton. (crossing/not crossing).
CNN for pose estimation.
Deep association for tracking.
Evaluation: Accuracy.
PIE (int) Approach: Data driven. Intention prediction (crossing). See Table 8
Rasouli et al. (2019) Features: bbox coord, image context, and image bbox.
RNN (LSTM).
Evaluation: Accuracy, and F1-score.
Bouhsain et al. (2020) Approach: Data Driven. Pedestrian intention and See Table 8.
Features: bboxes coordinates and velocities. pedestrian bbox predictions
PV-LSTM (crossing).
Multi-task sequence to sequence learning
Evaluation: ADE, FDE, Accuracy.
Liu et al. (2020) Approach: Context, Temporal, and Data driven. Intention prediction (crossing). Stanford-TIR
Features: A: 79.10%.
Graph Convolution and GRU to learn spatio-temporal JAAD
relationship. A: 79.28%.
Evaluation: Accuracy.
17
L.G. Galvão and M.N. Huda Expert Systems With Applications 238 (2024) 121983
Table 7 (continued).
Work Methods Problem Dataset/Results
Abughalieh and Alawneh (2020) Approach: Data driven. Intention prediction (walking OWN (on-board)
Features: pedestrian body landmarks considering depth and crossing). A: 89%.
information.
CNN.
Evaluation: Accuracy.
FUSSI-net Approach: Data driven, target-agent context. Intention prediction (crossing). See Table 8
Piccoli et al. (2020) Features: Skeleton and bbox.
DenseNet.
Evaluation: Accuracy.
SFR-GRU Approach: Data driven. Intention prediction (crossing). See Table 8
Rasouli et al. (2020) Features: pose, 2D bbox, appearance, global context, and
ego speed.
Stacked-RNN (GRU).
Evaluation: Accuracy, Precision, recall, F1-score, and AUC.
C+B+S+Int Approach: Data driven. Intention prediction (crossing). See Table 8
Kotseruba et al. (2020) Features: surrounding, appearance, context, bbox, and ego Studied human performance.
vehicle speed.
single GRU.
PTH: 2 s.
Evaluation: Accuracy, AUC, F1, Precision, and recall.
Razali et al. (2021) Approach: Data driven and key body landmarks. Recognition and Intention JAAD
Features: PAF and PIF. prediction (crossing) in Recognition: −0 s: 81.7%; −1
Uses only one RGB image. real-time. s: 83.6%; −2 s: 83.5%; −3 s:
Multitask learning. 83%; −4 s: 82.7%.
CNN (ResNet). Prediction: −1 s: 42.6%; −2 s:
Evaluation: Precision for different prediction horizon. 46.1%; −3 s: 46.3%; −4 s:
46.0%.
FPS: 5.
Zhang, Abdel-Aty, et al. (2021) Approach: Data Driven. Intention prediction (crossing CCTV
Features:: pose-key-points. at red light). A: 92%: 1 s; 92%: 2 s; 88.9%:
Compared SVM, RF, GBM, and XGBoost models. 3 s; 92.5%: 4 s.
Evaluation: Accuracy.
PCIR Approach: Data driven, context, and behavioural. Intention detection (crossing). See Table 8
Yang et al. (2021) Features: pedestrians, ego vehicle, and environment.
3D-CNN.
Evaluation: AP.
Chen et al. (2021) Approach: Data driven. Intention prediction (crossing). See Table 8
Features: bbox, body pose, road objects.
Graph encoder, CNN, and LSTM.
PTH: 1.5 s.
Evaluation: Balanced Accuracy and F1 score.
I+A+F+R Yao et al. (2021b) Approach: Data driven, and multi-task. Intention and action See Table 8
ARN Attentive Relation Network. prediction (crossing). Inference: < 6 ms.
CNN, MLP, and GRU.
PTH: 1–2 s.
Features: bbox context and coordinates, relation, and visual.
Evaluation: Accuracy, F1-score, ROC-AUC, precision.
PCPA Approach: Data driven. Intention prediction (crossing). See Table 8
Kotseruba et al. (2021) Features: bbox, pose, local context, and ego vehicle speed.
3D CNN + single-RNN (GRU) + attention mechanism.
Evaluation: Accuracy, AUC, and F1.
Yang, Zhang, et al. (2022) Approach: Data driven. Intention prediction (crossing). See Table 8
Features: local and global context, bbox, pose-key-points.
Attention mechanism, 2D CNN, and RNN.
Evaluation: Accuracy, F1, and recall.
Graph+ Approach: Data driven. Intention Prediction (crossing). See Table 8
Cadena et al. (2022) Features: context, ego vehicle velocity, and key body Inference: 6 ms.
landmarks.
Graph Convolutional Network.
Evaluation: Accuracy.
ST-CrossingPose Approach: Data driven. Intention prediction (crossing). JAAD
Zhang, Angeloudis, and Demiris Features: skeleton-based. Recognition: 63%.
(2022) Spatio-Temporal GCN. See Table 8
Evaluation: Accuracy, AUC, F1-score, Precision, and Recall.
Achaji et al. (2022) Approach: Data Driven. Intention recognition and PIE A:91%.
Features: bbox. prediction (crossing). F1:0.83.
Transformer Networks. CP2A A:91%.
PTH: 1 s and 2 s. F1:0.91.
Test human ability for pedestrian action prediction.
Evaluation: Accuracy and F1-Score.
18
L.G. Galvão and M.N. Huda Expert Systems With Applications 238 (2024) 121983
Table 7 (continued).
Work Methods Problem Dataset/Results
Scene-STGCN Approach: Data Driven. Intention recognition See Table 8
Naik et al. (2022) Features: (crossing).
Scene Spatio-Temporal GCN.
Evaluation: Accuracy, F1-score, AP, and ROC-AUC.
Zeng (2022) Approach: Data driven. Intention prediction (crossing). See Table 8
Features: body land-marks. Light-weight and inference
SqueezeNet and GRU. speed.
Hardware: AMD Ryzen 5 3600, G Force RTX 3070.
Evaluation: Accuracy and ROC-AUC.
CA-LSTM Approach: Data driven. context and dynamic. Intention Prediction (crossing). See Table 8
Lian et al. (2022) Features: appearance, velocity, and walking angle.
Attention LSTM.
Evaluation: Accuracy, F1-score, recall metrics.
Gazzeh and Douik (2022) Approach: Data driven. Intention recognition in See Table 8
Features: pedestrian localisation and environment contest real-time.
(lane lines).
ML and DL.
Evaluation: Accuracy.
Ma and Rong (2022) Approach: Data driven. Intention prediction (crossing). See Table 8
Features: pedestrian pose (skeleton), pedestrian to vehicle
distance, and ego vehicle information.
Multi-feature fusion.
Random forest classifier.
PTH: 0.6 s.
Evaluation: Accuracy and AUC.
Ahmed et al. (2023) Approach: Data driven. Intention prediction (crossing). JAAD and PIE
Features: Past trajectories, velocity, and 3D joint estimation. Accuracy: 89%/91%.
Model(s): Position and Velocity LSTM.
PTH: 0.4 s.
Evaluation: Accuracy.
Transformers: Even though attention mechanism have the ability combine the most relevant features. Yang, Zhang, et al. (2022) used
to focus on the most relevant features, it was reported by Achaji et al. 2D-CNN, stacked-RNN, and attention mechanism. Spatio-temporal GCN
(2022) that its effectiveness might be reduced when coupled with LSTM was used by Naik et al. (2022) to encode the input image, image
networks. For this reason, Achaji et al. (2022) proposed a framework class and location information tensors. Then the output of the spatio-
based on three types of transformer networks: encoder-only, encoder- temporal GCN was fed into an LSTM network to generate long-term
pooling, and encoder–decoder architectures. The proposed framework predictions. Zeng (2022) used SqueezeNet to extract visual features and
used only the pedestrian bounding box information as its input. The used GRU to extract temporal dependencies. They also used a multi-
authors argued that their model outperformed other methods that used tasking approach to predict both pedestrians’ intentions and poses. One
multiple input features. Transformer networks offer the advantage of primary advantage of using multiple models is that each model can
parallel input processing, which accelerates training stage. On the other compensate for the limitations of others. For example, CNN, GCN, and
hand, the ability to process the input data in parallel restricts the model attention mechanism can aid the limitations of an LSTM network to
to take advantage of the sequential nature of the input. extract spatial information, handle non-Euclidean data, and prioritise
Multiple Methods: many studies have used more than one method relevant features, respectively.
to predict pedestrian intention. Liu et al. (2020) used GCN to gen- Full-Pipeline: Gazzeh and Douik (2022) presented a full pipeline
erate a pedestrian-centring graph for each observation frame. These model which includes detection, tracking, and crossing intention pre-
graphs connect the target pedestrian to its surrounding, allowing the diction. They used YOLOv4 for object detection, DeepSort for tracking,
algorithm to learn relation between the pedestrian and the scene. In Canny Edge for lane line detection, and linear SVM for intention pre-
addition, edges were introduced between the pedestrian nodes in each diction. Another full pipeline system was implemented by Piccoli et al.
pedestrian-centring graph to allow the algorithm to learn temporal (2020), where they used YOLOv3 for detection, DeepSort for track-
information. The resulting interconnected graphs were then fed into ing, and spatio-temporal Densenet for intention prediction. YOLOv5,
a GRU network to predict crossing intention. Chen et al. (2021) used DeepSort, and an LSTM network with an attention mechanism were
a combination of methods, including a CNN to extract features from used by Lian et al. (2022) to detect, track, and predict pedestrian in-
traffic objects and pedestrian appearance, a GCN to auto encode the tention, respectively. A multi-task network was implemented by Razali
extracted features, another framework to extract human skeleton, and et al. (2021) to recognise pose state and predict pedestrian intentions.
an LSTM network to predict crossing intentions. CNN, ARN, MLP and ResNet was used to extract features, Part-Intensity-Fields (PIF), and
GRU were used by Yao et al. (2021b) to predict crossing intentions. Part-Association-Fields (PAF) to produce channels and pose joints, and
The CNN was used to extract global features, ARN was used to extract a head network to predict pedestrian intentions.
relational features from detected traffic objects, MLP was used for Table 8 presents the results achieved by the most relevant pedes-
intention classification, and the LSTM was used for intention prediction. trian intention prediction works in the literature. Unfortunately, direct
One major difference of this work is that the network also takes the comparisons between these studies are not possible due to variations
predicted intention output as input. Kotseruba et al. (2021) used 3D- in different problem formulations, OTH, TTE, datasets, and metrics.
CNN, RNN, and attention mechanism. The 3D-CNN was used to encode For example, the work that achieved the best accuracy was Zhang,
local features from a sequence of cropped bounding boxes, the RNN Angeloudis, and Demiris (2022), however, the authors used their own
was used to encode the bounding-box coordinates, pose landmarks and dataset. The second best was Bouhsain et al. (2020) but they used an
the ego-vehicle speed. Finally, an attention mechanism was used to observation horizon and TTE of 0.6 s.
19
L.G. Galvão and M.N. Huda Expert Systems With Applications 238 (2024) 121983
5. Heterogeneous road agents grid map limits the velocity resolution and might not give realistic
measurements. Also, relying solely on the distance between the ego
All the previously mentioned works primarily focused on predicting and the target vehicle is not enough. For example, an ego vehicle
the behaviour of either pedestrians or vehicles. However, in a real- might maintain a safe distance from the target vehicle, but the target
world traffic scenario, complex interactions occur among various types vehicle can suddenly brake and change its velocity. Therefore, it would
of agents, each with different dimensions and dynamics. Consequently, be beneficial for the ego vehicle to predict and recognise instances
it is crucial to consider the interaction between heterogeneous agents. when the target vehicle is braking or experiencing a sudden change
Several works have addressed the detection and behaviour prediction in velocity.
of heterogeneous agents. Authors (Li, Wang, et al., 2020) considered themselves pioneers in
For example, authors (Ma et al., 2019) introduced the TrafficPre- combining object detection and intention recognition to assess the risks
dict algorithm, which was developed to learn motion patterns and in a complex traffic scenarios. Their objective was to detect both non-
predict the trajectories of different types of traffic agents, including static objects such as vehicles and pedestrians, and static objects such as
pedestrians, bicycles and cars. They adopted the 4D Graph network traffic lights, and then use the gained information to evaluate potential
in conjunction with an RCNN LSTM to learn the movements and hazards ahead. In order to detect the objects, they used the YOLOv4 and
interactions of traffic agents. The authors used an OTH of 2 s to predict the BDD100K dataset and achieved an mAP of 52.7%. For recognising
a horizon of 3 s. They achieved a state-of-the-art average displacement the pedestrian intention (crossing or not-crossing), they used VGG-19
error of 0.085 and a final displacement error of 0.141. DeepTAgent CNN and Part Affinity fields, achieving an accuracy of 97.5%. To pre-
is another heterogeneous system presented by Chandra, Randhavane, dict vehicle intentions, including braking and turning, they employed
et al. (2019) in which they used Mask R-CNN to detect objects, a the EfficientNet CNN, achieving a recognition accuracy of 94%. Lastly,
CNN to extract tracking features, and a Heterogeneous Interaction for recognising traffic light state (red, green, or amber), they used
Model (HTMI) that considered collision avoidance behaviour to predict the MobileNet CNN, achieving an accuracy of 97.75%. Nevertheless,
the agents’ position, velocity and subsequently their trajectory and using only the brake and the turn signal lights information to predict
interactions. The authors (Chandra, Bhattacharya, et al., 2019) pre- vehicle behaviour and assess danger is not sufficient since braking
sented a hybrid network for predicting the trajectory of road agents behaviour can exhibit varying intensities. For example, normal braking,
and modelling their interactions. They used a CNN to capture local characterised by a gradual decrease in the vehicle’s velocity, is typically
information, such as the agent’s shape and position, and an LSTM regarded as a potential hazard. In contrast, harsh braking, involving
network for trajectory prediction. In dense, diverse traffic situations, a sudden and significant change in the vehicle’s velocity, is seen as a
the algorithm demonstrated a notable performance of 30% over state- developing hazard. Furthermore, there are situations where the target
of-the-art methods. However, it did not outperform the state-of-the-art vehicles abruptly change their direction without using their turn signal,
algorithms in sparse and homogeneous traffic scenes. Li, Yang, et al. which also poses a developing hazard. Therefore, the ego vehicle must
(2020) presented a framework called EvolveGraph. In this framework, be capable of detecting sudden changes in the vehicle’s direction and
they encoded an observation graph to infer an interaction graph, and velocity. Similarly, depending only on pedestrian crossing/not crossing
subsequently, decoded both the observation and interaction graphs to intentions limits the system to make a long prediction horizon, as
predict future trajectories. Zhang, Zhao, et al. (2022) implemented pedestrians can cross at different velocities, and may suddenly change
the Attention-based Interaction-aware Trajectory Prediction (AI-TP) their goal destination.
model. This model used Graph Attention Network (GAT) to represent
interaction among heterogeneous traffic agents and used a Convolu-
6. Discussion
tional GRU (ConvGRU) to make predictions. A multi-agent trajectory
prediction system was performed by Mo et al. (2022) where a three-
This paper has surveyed several works that investigate the be-
channel framework was used to account for dynamics, interactions
haviour prediction of pedestrians and vehicles. Based on the findings,
and road structure. Moreover, a novel Heterogeneous Edge-enhanced
this section presents a general framework diagram, outlines risk as-
graph ATtention network (HEAT) was proposed to extract interaction
sessment, discusses challenges, examines techniques, outlines require-
features. Dynamic features were extracted from the agents’ previous
ments, and suggests potential future directions for pedestrian and ve-
trajectories, interaction patterns were represented through a directed
hicle behaviour prediction systems.
edge-feature heterogeneous graph and extracted with the HEAT net-
work. The road structure information was shared among all agents
using a gate mechanism. Finally, all the information acquired from the 6.1. General framework for a behaviour prediction system
previous process was combined to predict trajectories.
All the previously cited works have predicted the trajectories and A proposed general framework for a behaviour prediction system is
interactions among the agents. However, they have not taken into con- depicted in Fig. 6. The camera sensor outputs RGB images which are
sideration their intentions, such as crossing/not-crossing, braking/non- used by the detection and image processing algorithms.
braking. Also, they have not incorporated the information provided The detection algorithm is responsible for detecting both static
by road static objects like traffic lights and road signs. Static road and non-static road objects, including road lanes, vehicles, vulnerable
traffic objects play a crucial role in directing, informing, and controlling road users, traffic lights, and road signs. The position information
road users’ behaviour. Furthermore, there is limited research on how of the detected objects, represented by bounding boxes, is then used
to use detection and prediction information to identify potential and by a tracking algorithm to assign a unique ID to each object. This
developing hazards. ID assignment enables the system to track past trajectories of each
The authors (Chen et al., 2018) proposed a multi-task learning detected object, which serves as input for subsequent processing.
model that combines both object detection and distance prediction to The image processing algorithm uses the RGB images from the
identify dangerous traffic road objects. They used SSD CNN to detect camera sensor as well as the past trajectories of the detected objects
cars, vans, and pedestrians. The input image was divided into a grid to generate optical flow, depth, appearance, global and local context
map with four vertical and three horizontal distances. Depending on images. An example of how image processing uses past trajectories is
the category of the target vehicle and its location, the network assigned the use of the bounding box information to crop the RGB image at the
a danger level using blue, green, yellow, and red bounding boxes, specific location of the detected object. This cropping operation pro-
where blue and red represented the least and the most dangerous levels, vides local context information for further analysis and decision-making
respectively. However, predicting the target vehicle’s velocity using a within the system.
20
L.G. Galvão and M.N. Huda Expert Systems With Applications 238 (2024) 121983
Fig. 6. General behaviour prediction framework. The behaviour prediction module consists of an automated feature extractor (CNN, 3D-CNN, GCN, FCN, CVAE, GAN, etc.), an
embedding layer (FCN and ANN), and a time series algorithm (RNN, GRU, and LSTM). It is dependent on the perception module (Detection, tracking, image processing, interaction
representation, and feature engineering) which is dependent on the ego vehicle sensors (camera, GPS, and wheel encoder). Additionally, the outputs of the behaviour prediction
modules are sent to the planning module.
The interaction representation algorithm uses the past trajectories of • Computing components failure: computer or GPU failure.
the objects to calculate distances between the traffic agents, construct • Sensor Failure: Failure in the steering wheel, wheel encoder, GPS,
graph networks with vertices and edges, and generate grid maps that and IMU sensors.
account for interactions between traffic agents. • Detection algorithm failure: missed detections, poor intersection
The feature engineering algorithm uses the past trajectories of ob- over union, false-positive and false-negative classification.
jects and internal sensors data from the AV (e.g., steering wheel angle, • Tracking algorithm failure: missed tracking and incorrect associa-
yaw rate, wheel encoder, etc.) to derive additional features. For ex- tion of objects between frames. For instance, an object might not
ample, to use the differences between the objects’ positions between be tracked in the next frame or objects might swap their IDs due
consecutive frames to calculate their velocities. to overlap.
The outputs of the perception module are then fed into the au- • Image processing failure: incorrect optical flow and depth estima-
tomated feature extractor and the embedding algorithms within the tion.
behaviour prediction module. Automated feature extractors are deep • Interaction representation failure: noisy and incorrect distance
learning algorithms designed to generate feature vectors representing calculation, as well as incorrect graph or grid representation of
spatial properties of the inputs. Embedding uses a linear transformation the object interactions.
to transform the inputs into a desired output feature size. The time • Feature Engineering failure: redundant features, noisy estimates
series algorithm uses the combined feature vectors generated by the speed and acceleration due to poor detection and tracking per-
automated feature extractor and the embedding layer to learn temporal formance.
information, enabling it to predict various aspects of object behaviour, • Cybersecurity failure: remote hacking, vehicle spoofing, insider
including future trajectories, future intentions, goals, and current in- threat, and tampering with sensor data.
tentions. Note that the predicted goals and recognised intentions can
be used by the embedding layer and the time series algorithm as extra 6.2.2. Risk analysis
information for predicting future trajectories. The authors (Bhavsar et al., 2017) discussed several methods for
Finally, the outputs of the behaviour prediction module are then analysing risks in automotive contexts, including situation-based analy-
used by the AV’s Planning module, which in turn uses this information sis, ontology-based analysis, failure modes and effects analysis (FMEA),
to plan the actions of the AV to achieve its final goal. and fault tree analysis (FTA). From their investigation, they concluded
that FTA is the most suitable method for conducting a risk assessment
6.2. Risk assessment for behaviour prediction system
on AV features. For this reason, this paper also adopts FTA to per-
form a risk analysis on the behaviour prediction system. FTA methods
Authors (Bhavsar et al., 2017) proposed a risk assessment for a
have the following advantages, being event-orientated, enabling the
AV. They mentioned that AV failures can arise from various aspects,
diagnosis of the root cause of failures, facilitating an understanding
including vehicular components such as hardware, software, mechan-
of how subsystems can impact each other, having a straightforward
ical systems, communication infrastructure, and interactions between
and graphical nature for ease of comprehension, and aiding in decision-
the passenger and the AV Human Machine Interface system. Based on
making regarding the control of identified risks. The proposed FTA is
their finding, this paper presents a risk assessment specifically for an
depicted in Fig. 7. A qualitative analysis of the proposed FTA reveals
AV behaviour prediction system. This assessment identifies, analyses,
that the system is highly vulnerable because any failure occurrence
and provides recommendations for mitigating and controlling these
of the basic events (EVX) can lead to the failure of the behaviour
identified risks.
prediction system. For instance, if the detection algorithm fails, it can
cascade failures throughout the tracking algorithm, image processing,
6.2.1. Risk identification
interaction representation, and feature engineering, ultimately in the
Based on the general framework for a behaviour prediction system
failure of the behaviour predictions system.
depicted in Fig. 6, the following risks have been identified:
In order to quantitatively analyse the behaviour prediction system,
• Camera sensor failure: this includes hardware malfunctions, it is required to know the probability of failure for each event (EVX),
blocked field of view, and noise (electricity, heat, and illumina- which depends on the hardware, software, and cybersecurity in use.
tion). However, a general mathematical model to calculate the overall system
21
L.G. Galvão and M.N. Huda Expert Systems With Applications 238 (2024) 121983
Fig. 7. Fault tree analysis for a Behaviour Prediction System. The circle shapes with the square shapes are the basic events that may lead to failures on the top events. The square
shape after the TOP GATE is the top event which means the failure of the behaviour prediction system. The ‘‘OR’’ gates mean that if one of its input events occurs it will output
an event as true.
failure from an FTA diagram depicted in Fig. 7 is given by the following sensor. The disadvantage of this approach is that it is expensive
equation (Ruijters & Stoelinga, 2015; Xing & Amari, 2008), and requires more space in the vehicle.
• For the general prediction behaviour system in question, it is
𝑄0 (𝑡) ≤ (1 − 𝛱𝑗=1𝑘 [1−𝑄̌ ) (3) observed that it relies on three types of information (RGB image,
𝑗 (𝑡)]
where 𝑄0 (𝑡) is the top event (failure of the behaviour prediction sys- engineering feature, and interaction) for predictions. Therefore it
tem), 𝑄𝑗̌(𝑡) is the failure probability of a minimal cut-set. For instance, is recommended to enable the system to function in a degraded
the probability that the TOP GATE in the proposed FTA diagram mode by using one or two pieces of information if one of them
fails.
happens is given by,
• The detection and tracking algorithms are important for the sys-
𝑄0 (𝑡) ≤ (1 − [1 − 𝑃 (𝐺𝑇 1)] ∗ [1 − 𝑃 (𝐺𝑇 2)] ∗ [1 − 𝑃 (𝐺𝑇 3)]) (4) tem, as their outputs are used by the other algorithms. Thus, it
is recommended to make use of sensor fusion, since if one of the
where hardware or the algorithms responsible for detecting and tracking
𝑃 (𝐺𝑇 1) = (1 − [1 − 𝑃 (𝐸𝑉 1)] ∗ [1 − 𝑃 (𝐸𝑉 2)] ∗ [1 − 𝑃 (𝐸𝑉 3)]) (5) the object fails the system can work in a degraded mode.
22
L.G. Galvão and M.N. Huda Expert Systems With Applications 238 (2024) 121983
Table 8
Results for the most relevant pedestrian intention prediction works.
Work Dataset Obs. Hor. TTE Acc(%) AUC(%) F1(%) Rec.(%) Prec(%) ROC-AUC(%)
JAAD – Recog. 92.88 – – – – –
Gazzeh and Douik (2022)
Fang and López (2018) JAAD 0.5 s Next-Frame 88 – – – – –
STRR-Graph JAAD 0.5 s Next-Frame 76.98 – – – – –
Liu et al. (2020)
FUSSI-net JAAD 0.5 s Next-Frame 76.6 – – – – –
Piccoli et al. (2020)
PIEint PIE 0.5 s Next-Frame 79 – 87 – 90 73
Rasouli et al. (2019)
CA-LSTM JAAD 0.5 s Next-Frame 89.68 – 75.38 85.96 – –
Lian et al. (2022)
PV-LSTM JAAD 0.6 s 0.6 s 91.48 – – – – –
Bouhsain et al. (2020)
Ma and Rong (2022) BPI – 0.6 s 89.5 99.2 – – – –
SFR-GRU PIE 0.5 s 2 s 84.4 82.9 72.1 80 65.7 –
Rasouli et al. (2020)
C+B+S+Int PIE 0.5 s 2 s 83 85 81 85 79 –
Kotseruba et al. (2020)
PCIR JAAD – – 89.6 – – – – –
Yang et al. (2021)
Chen et al. (2021) PIE 0.5 s 1.5 s 79 – 78 – – –
I+A+F+R JAAD 0.5 s 1-2 s 87 92 70 – 66 –
Yao et al. (2021b) PIE – 84 88 90 – 96 –
Yang, Zhang, et al. (2022) JAAD 0.5 s 1-2 s 83 82 63 81 51 –
PIE 89 86 80 81 79 –
GRAPH+ JAAD 0.5 s 1–2 s 86 88 65 75 58 –
Cadena et al. (2022) PIE 89 90 81 79 83 –
Achaji et al. (2022) PIE 0.5 s 1–2 s 91 91 83 – – –
Scene-STGCN PIE 0.5 s 1–2 s 83 – 89 – 96 85
Naik et al. (2022)
PCPA JAAD 0.5 s 0.5-1 s 85 86 68 – – –
Kotseruba et al. (2021) PIE – 87 86 77 – – –
ST-CrossingPose OWN 0.5 s 1 s 92 84.9 83.7 81.8 85.9 –
Zhang, Angeloudis, and Demiris 2 s 92 84.1 79.7 79.7 81.3 –
(2022)
Zeng (2022) JAAD -s -s 84 – – – – 85
• Good Evaluation Metric Performance: AV behaviour prediction Evaluation metrics, long prediction horizons, and robustness are
system is a safety-critical system, therefore it must perform well interrelated. For instance, as the prediction horizon increases the eval-
in terms of evaluation metric performance to prevent traffic colli- uation metric performance tends to decrease. In addition, as a system
sions. For example, if the system fails to predict that a pedestrian becomes more robust, its evaluation metric performance is expected
will cross the road, it could lead to a serious collision. to increase. The major challenges that limit behaviour prediction algo-
• Long Prediction Horizon (PTH): A system with a long PTH can rithms from meeting the previously mentioned requirements stem from
plan and react well in advance, reducing the chances of collisions the fact that an agent’s behaviour depends on other agents in the scene,
and improving overall safety. the local and global context, and their final goal. Various approaches
• Fast Inference Time: Given that an AV behaviour prediction have been proposed to address these challenges:
system must operate in a real-time, it must have a low inference • Social pooling layers (Alahi et al., 2016; Deo & Trivedi, 2018a),
time and require a low hardware resource. Graph representation, GCN, self-attention based social pooling
• Low Cost: To make AVs accessible to a wide range of people, (Mangalam et al., 2020), message passing mechanism (Zhang
the behaviour prediction system should be cost-effective, ensuring et al., 2019), occupancy maps (Kasper et al., 2012; Park et al.,
that AVs are affordable for all social classes 2018; Xue et al., 2018), view frustum social pooling (Hasan
• Low Hardware Resource Requirement: Efficient utilisation of et al., 2018), and star-like networks to model interactions be-
hardware resources is important, as it allows the system to run tween agents (Zhu et al., 2019).
on hardware with limited capacity. • CNNs to extract agents’ appearance, body pose, local context,
• Robustness: The system should be robust and able to handle global context, and to classify intentions (Biparva et al., 2021;
various scenarios and conditions on the road, ensuring reliable Chen et al., 2021; Fang et al., 2017; Fernández-Llorca et al., 2020;
performance in different situations. Izquierdo et al., 2021; Yang, Zhang, et al., 2022; Yao et al., 2021b;
• Prediction of Various Non-Static Objects: The system should be Zhao et al., 2019).
capable of predicting the behaviour of different types of non-static • Attention mechanisms and transformer networks to focus on the
objects on the road, including pedestrians, vehicles, animals, and most relevant information (Achaji et al., 2022; Lian et al., 2022;
cyclists, to ensure comprehensive safety. Rasouli et al., 2019).
23
L.G. Galvão and M.N. Huda Expert Systems With Applications 238 (2024) 121983
Table 9
Behaviour prediction research challenges.
Type of challenge Class Challenges
Pedestrian Highly dynamic, can move in many directions and change them
Target Agents
very quickly, be easily occluded, be distracted by their own objects
or external environments, their motion can be affected by other
traffic agents, might be under the influence of drugs or alcoholic
drinks, and they are hard to see in poor visibility condition.
Vehicle Dependent on other vehicles’ actions, traffic rules, road geometry,
different driving environments, vehicles have multi-modal
behaviour, different types of vehicles have different motion
properties, drivers might be under the influence of drugs or
alcoholic drinks, and target vehicles might be occluded.
System Design To achieve a good evaluation metric performance, long PTH,
*
real-time inference, low hardware resources, and robustness.
Evaluation Works have used different types of datasets, evaluation metrics,
observation and prediction horizon, and hardware setup. Therefore,
works cannot be directly compared and the actual progress of
pedestrians and vehicle behaviour prediction research cannot be
measured.
Resources Hardware Smaller size GPUs that can process deep learning algorithms in
*
real-time, sensors that enable the AV to perceive 360-degree road
view, and affordable hardware to enable all social classes to afford
AVs.
Data Several existing datasets are not publicly available and they are not
standardised to enable cross-dataset evaluation and progressive
training pipeline techniques.
Uncertainties Hardware Failure Camera, GPS, IMU, steering wheel, and wheel encoder sensor
* failure.
Cyber Attack Remote hacking, vehicle spoofing, insider threat, and tampering
with sensor data.
Software Failure Perception module (detection, tracking, image processing,
interaction representation, and feature engineering) failure.
• 3D-CNNs and temporal-Densenet to learn short-term temporal system’s overall inference time is expected to be shorter. However,
information (Biparva et al., 2021; Kotseruba et al., 2021; Piccoli there may be a trade-off between accuracy and inference time. For
et al., 2020; Yang et al., 2021). example, using multiple-feature information can increase the system’s
• LSTMs and GRUs to learn long-term temporal information (Bouh- accuracy but may lead to longer inference times compared to a sys-
sain et al., 2020; Chung et al., 2014; Kotseruba et al., 2020; tem using a single type of feature. The following methods have been
Rasouli et al., 2019, 2020). proposed in order to achieve low inference time, low cost, and low
• A modified version of the LSTM cell that accepts more than one hardware resource requirements:
input sequence set (Quan et al., 2021).
• CVAE was used to estimate the final goals of the agents to extend • GCN, which represents interactions between agents effectively
the prediction time horizon (Lee, Choi, et al., 2017; Mangalam without relying on additional information like original images,
et al., 2020; Wang et al., 2022; Yao et al., 2021a). cropped images, or contextual information (Li et al., 2019a,
• Heterogeneous agent behaviour prediction works have been pre- 2019b).
sented to enable the system to predict the behaviour of different • Dual-LSTM, which allows the system to learn more information
non-static object behaviour (Chandra, Bhattacharya, et al., 2019; from past trajectories without requiring extra input features (Xin
Chandra, Randhavane, et al., 2019; Chen et al., 2018; Li, Wang, et al., 2018).
et al., 2020; Li, Yang, et al., 2020; Ma et al., 2019; Mo et al., • Fusion of multiple input features (context, interaction, trajec-
2022). However, these works have primarily focused on pedestri- tories, and appearance) into an enriched image representation,
ans, cyclists, and vehicles, while there are other objects such as rather than processing a sequence of images (Izquierdo et al.,
animals, disabled individuals, scooters, toys (balls), skate riders, 2021).
etc.
• Combination of two or more methods to compensate their limita- 6.5. Behaviour prediction system further work
tions (Chen et al., 2021; Kotseruba et al., 2021; Liu et al., 2020;
Naik et al., 2022; Yang, Zhang, et al., 2022; Yao et al., 2021b; Despite the techniques presented to meet the specified require-
Zeng, 2022). ments, there is still work to be done from the authors’ perspective. For
• Systems that can predict the behaviour of heterogeneous agents example:
(Chandra, Bhattacharya, et al., 2019; Chandra, Randhavane, et al.,
• Most of the works, both for pedestrians and vehicles, were im-
2019; Li, Wang, et al., 2020; Li, Yang, et al., 2020; Ma et al.,
plemented using either a top-view or BEV dataset, which may
2019).
not be ideal for an AV system. Only in the past five years
Inference time, low cost, and low hardware resource requirements have researchers started implementing algorithms using on-board
are also interrelated. For example, if a system consumes less memory datasets such as PREVENTION, Appolo, JAAD, and PIE. Moreover,
and computational power, it results in cheaper hardware requirements, most of the works that used on-board datasets focused on imple-
making the overall system more cost-effective. Typically, when a sys- menting intention prediction algorithms, and most of proposed
tem requires less memory, such as for processing image inputs, the algorithms cannot be directly compared.
24
L.G. Galvão and M.N. Huda Expert Systems With Applications 238 (2024) 121983
• While some works have used the same datasets, evaluation met- 7. Conclusion
rics, observation time horizon, and prediction time horizon, these
works were implemented on top-view and BEV datasets. For ex- AV systems must not only detect pedestrians and vehicles but also
ample, many vehicle trajectory predictions have used the NGSIM predict their behaviour to avoid or mitigate collisions. Therefore, the
dataset with an OTH of 3 s, a PTH of 5 s, and the MSE evalu- purpose of this literature review, was to survey the most relevant
ation metric. Several pedestrian prediction trajectory algorithms pedestrian and vehicle behaviour prediction algorithms to identify
adopted the ETH and UCY dataset, with an OTH of 3.2 s, a PTH of the requirements for a behaviour prediction algorithm, the challenges
4.8 s, and the ADE and FDE evaluation metric. If these datasets associated with predicting pedestrian and vehicle behaviour, whether
were ideal for AV systems, then the best vehicle trajectory pre- current techniques have met these requirements, and what steps are
diction algorithms would be GRIP (Li et al., 2019b), GRIP++ (Li needed to enable AVs to predict pedestrian and vehicle behaviours. In
et al., 2019a), and AI-TP (Zhang, Zhao, et al., 2022), and the best conclusion, the review shows that:
pedestrian trajectory prediction algorithm would be the Bi-Trap
• An AV behaviour prediction system must have a good evalua-
algorithm (Yao et al., 2021a).
tion metric performance, long prediction horizon, fast inference
• There is a lack of research on unusual behaviour exhibited by
time, must be cost-effective, robust, require minimal hardware
pedestrians and vehicles. For example, pedestrians might exhibit
resources, and predict various types of non-static objects on the
unusual behaviour when under the influence of toxic substances,
road.
involved in fights, or disoriented. Similarly, vehicles may display
• The main challenges in predicting the behaviour of traffic agents
unusual behaviour when the driver is under the influence of
involve modelling their interactions, establishing relationship be-
toxic substances, and is distracted with their personal belongings,
tween the agents and the scene, and achieving a balance between
or if the vehicle is an emergency vehicle, garbage truck, road
good evaluation metric performance and low inference times.
sweeper, carrying an abnormal load, or experiencing mechanical
• Current techniques do not fully meet these requirements for
malfunctioning. several reasons:
• There is a limited research on decreasing inference time, and
more emphasis should be placed on addressing this demand. – when predicting for long-term horizons, evaluation metric
• Standardising datasets would enable cross-dataset evaluation and performance significantly decreases;
the development of progressive training pipeline techniques. – while top-view and BEV datasets are commonly used in the
• Introducing universal metrics would allow for direct comparisons literature, there are limited works that adopted on-board
of algorithm performance. datasets, which are more suitable for AVs;
• When considering a full pipeline system (detection, tracking and – on-board datasets usually only use a single forward-facing
behaviour prediction), it is necessary to account for perception camera, limiting the behaviour prediction system to con-
uncertainties due to sensor noise, fuzzy features, or unknown in- sider only agents ahead, whereas considering agents around
puts (Liu et al., 2022). Since there are a limited number of works the ego vehicles using multiple cameras is essentials (Zhang,
that have implemented a full pipeline system, more works consid- 2021);
ering the entire pipeline process are recommended to investigate – more investigation is required to develop models that can
the effect of possible noise. predict intention and trajectory simultaneously; although
some authors (Li et al., 2019a, 2019b) claimed that their
Based on the literature review the following suggestions are given system has achieved real-time inference times, they have
to further improve and accelerate the development of the Autonomous used top-view cameras, whereas systems that use on-board
Vehicle Behaviour Prediction System: sensors may require more processing time;
• Encourage more research works to adopt on-board view datasets – there are no works that consider abnormal behaviour exhib-
for predicting both pedestrian and vehicle behaviour, including ited by traffic agents.
intention and trajectories.
• Most of the reviewed works have not considered the full pipeline
• Standardise existing dataset to enable cross-dataset evaluation
behaviour prediction process, which consists of detection, clas-
and progressive training pipeline techniques.
sification, and tracking. More research should focus on the full
• Choose or create a standard evaluation metric to enable direct
pipeline process to assess the performance of each stage and its
comparison among algorithms. impact on the final prediction results.
• Develop datasets that have instances of abnormal pedestrian and
vehicle behaviours to enable research on the recognition and Abbreviations
prediction of abnormal pedestrian and vehicle behaviour. AV Autonomous Vehicle.
• Implement behaviour prediction algorithms on resource-constra- ADAS Advanced Driver Assistance System.
ined hardware, such as Jetson Orin, and Jetson Xavier GPUs, WHO World Health Organisation.
which are low-cost, small in size, lightweight, and consume low DL Deep Learning.
power. OTH Observation Time Horizon.
• Investigate more methods to select the target object and the PTH Prediction Time Horizon.
objects that directly interact with the target object. EV Ego Vehicle.
TTE Time-To-Event.
The general object detection problem serves as an example of
KF Kalman Filter.
the importance of having a large dataset and standard evaluation EKF Extended Kalman Filter.
metrics. The field has achieved an acceptable level of maturity be- HMM Hidden Markov Model.
cause researchers have access to publicly available large image bench- SVM Support Vector Machine.
mark datasets, such as the ImageNet (Russakovsky et al., 2015) and ANN Artificial Neural Network.
COCO (Lin et al., 2014). These datasets enabled the authors to directly OGM Occupancy Grid Map.
compare their detection algorithm performance and to measure the CNN Convolutional Neural Network.
advancement of object detection research.
25
L.G. Galvão and M.N. Huda Expert Systems With Applications 238 (2024) 121983
26
L.G. Galvão and M.N. Huda Expert Systems With Applications 238 (2024) 121983
Chen, L., Ma, N., Wang, P., Li, J., Wang, P., Pang, G., & Shi, X. (2020). Survey of Huang, H., Zeng, Z., Yao, D., Pei, X., & Zhang, Y. (2021). Spatial–temporal ConvLSTM
pedestrian action recognition techniques for autonomous driving. Tsinghua Science for vehicle driving intention prediction. Tsinghua Science and Technology, 27,
and Technology, 25(4), 458–470, Publisher: TUP. 599–609.
Chen, T., Tian, R., & Ding, Z. (2021). Visual reasoning using graph convolutional Izquierdo, R., Quintanar, A., Lorenzo, J., García-Daza, I., Parra, I., Fernández-Llorca, D.,
networks for predicting pedestrian crossing intention. In Proceedings of the IEEE/CVF & Sotelo, M. A. (2021). Vehicle lane change prediction on highways using efficient
international conference on computer vision (pp. 3103–3109). environment representation and deep learning. IEEE Access, 9, 119454–119465,
Chen, Y., Zhao, D., Lv, L., & Zhang, Q. (2018). Multi-task learning for dangerous object Publisher: IEEE.
detection in autonomous driving. Information Sciences, 432, 559–571, Publisher: Izquierdo, R., Quintanar, A., Parra, I., Fernández-Llorca, D., & Sotelo, M. A. (2019).
Elsevier. The prevention dataset: A novel benchmark for prediction of vehicles intentions.
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated In 2019 IEEE intelligent transportation systems conference (pp. 3114–3121). IEEE.
recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Karasev, V., Ayvaci, A., Heisele, B., & Soatto, S. (2016). Intent-aware long-term
COLONNA, M. (2018). Urbanisation worldwide. Knowledge for policy - European Com- prediction of pedestrian motion. In 2016 IEEE international conference on robotics
mission, URL: https://ec.europa.eu/knowledge4policy/foresight/topic/continuing- and automation (pp. 2543–2549). IEEE.
urbanisation/urbanisation-worldwide_en. Kasper, D., Weidl, G., Dang, T., Breuel, G., Tamke, A., Wedel, A., & Rosenstiel, W.
Czech, P., Braun, M., Kreßel, U., & Yang, B. (2022). On-board pedestrian trajectory (2012). Object-oriented Bayesian networks for detection of lane change maneuvers.
prediction using behavioral features. arXiv preprint arXiv:2210.11999. IEEE Intelligent Transportation Systems Magazine, 4(3), 19–31, Publisher: IEEE.
Dai, S., Li, L., & Li, Z. (2019). Modeling vehicle interactions via modified LSTM models Keller, C. G., & Gavrila, D. M. (2013). Will the pedestrian cross? a study on
for trajectory prediction. IEEE Access, 7, 38287–38296, Publisher: IEEE. pedestrian path prediction. IEEE Transactions on Intelligent Transportation Systems,
Dendorfer, P., Osep, A., Milan, A., Schindler, K., Cremers, D., Reid, I., Roth, S., & 15(2), 494–506, Publisher: IEEE.
Leal-Taixé, L. (2021). Motchallenge: A benchmark for single-camera multiple target Khosroshahi, A., Ohn-Bar, E., & Trivedi, M. M. (2016). Surround vehicles trajectory
tracking. International Journal of Computer Vision, 129, 845–881, Publisher: Springer. analysis with recurrent neural networks. In 2016 IEEE 19th international conference
Deo, N., Rangesh, A., & Trivedi, M. M. (2018). How would surround vehicles move? on intelligent transportation systems (pp. 2267–2272). IEEE.
A unified framework for maneuver classification and motion prediction. IEEE Kim, B., Kang, C. M., Kim, J., Lee, S. H., Chung, C. C., & Choi, J. W. (2017).
Transactions on Intelligent Vehicles, 3(2), 129–140, Publisher: IEEE. Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent
Deo, N., & Trivedi, M. M. (2018a). Convolutional social pooling for vehicle trajectory neural network. In 2017 IEEE 20th international conference on intelligent transportation
prediction. In Proceedings of the IEEE conference on computer vision and pattern systems (pp. 399–404). IEEE.
recognition workshops (pp. 1468–1476). Kong, Y., & Fu, Y. (2018). Human action recognition and prediction: A survey. arXiv
Deo, N., & Trivedi, M. M. (2018b). Multi-modal trajectory prediction of surrounding preprint arXiv:1806.11230.
vehicles with maneuver based LSTMS. In 2018 IEEE intelligent vehicles symposium
Kooij, J. F. P., Schneider, N., Flohr, F., & Gavrila, D. M. (2014). Context-based
(pp. 1179–1184). IEEE.
pedestrian path prediction. In European conference on computer vision (pp. 618–633).
Dueholm, J. V., Kristoffersen, M. S., Satzoda, R. K., Moeslund, T. B., & Trivedi, M. M.
Springer.
(2016). Trajectories and maneuvers of surrounding vehicles with panoramic camera
Kotseruba, I., Rasouli, A., & Tsotsos, J. K. (2020). Do they want to cross? understanding
arrays. IEEE Transactions on Intelligent Vehicles, 1(2), 203–214, Publisher: IEEE.
pedestrian intention for behavior prediction. In 2020 IEEE intelligent vehicles
Durrant-Whyte, H. (2001). A critical review of the state-of-the-art in autonomous
symposium (pp. 1688–1693). IEEE.
land vehicle systems and technology. Albuquerque (NM) andLivermore (CA), USA:
Kotseruba, I., Rasouli, A., & Tsotsos, J. K. (2021). Benchmark for evaluating pedestrian
SandiaNationalLaboratories, 41, 242.
action prediction. In Proceedings of the IEEE/CVF winter conference on applications
Fang, Z., & López, A. M. (2018). Is the pedestrian going to cross? answering by 2D
of computer vision (pp. 1258–1268).
pose estimation. In 2018 IEEE intelligent vehicles symposium (pp. 1271–1276). IEEE.
Kuefler, A., Morton, J., Wheeler, T., & Kochenderfer, M. (2017). Imitating driver
Fang, Z., Vázquez, D., & López, A. M. (2017). On-board detection of pedestrian
behavior with generative adversarial networks. In 2017 IEEE intelligent vehicles
intentions. Sensors, 17(10), 2193, Publisher: MDPI.
symposium (pp. 204–211). IEEE.
Fernández-Llorca, D., Biparva, M., Izquierdo-Gonzalo, R., & Tsotsos, J. K. (2020). Two-
Kumar, P., Perrollaz, M., Lefevre, S., & Laugier, C. (2013). Learning-based approach for
stream networks for lane-change prediction of surrounding vehicles. In 2020 IEEE
online lane change intention prediction. In 2013 IEEE intelligent vehicles symposium
23rd international conference on intelligent transportation systems (pp. 1–6). IEEE.
(pp. 797–802). IEEE.
Flohr, F. F., Kooij, J. F. K., Pool, E. A. P., & Gavrila, D. M. G. (2018). Context-based
Lee, N., Choi, W., Vernaza, P., Choy, C. B., Torr, P. H., & Chandraker, M. (2017). Desire:
path prediction for targets with switching dynamics.
Distant future prediction in dynamic scenes with interacting agents. In Proceedings
Galvao, L. G., Abbod, M., Kalganova, T., Palade, V., & Huda, M. N. (2021). Pedestrian
of the IEEE conference on computer vision and pattern recognition (pp. 336–345).
and vehicle detection in autonomous vehicle perception systems—A review. Sensors,
Lee, D., Kwon, Y. P., McMains, S., & Hedrick, J. K. (2017). Convolution neural network-
21(21), 7267, Publisher: MDPI.
Gazzeh, S., & Douik, A. (2022). Deep learning for pedestrian behavior understanding. based lane change intention prediction of surrounding vehicles for ACC. In 2017
In 2022 6th international conference on advanced technologies for signal and image IEEE 20th international conference on intelligent transportation systems (pp. 1–6). IEEE.
processing (pp. 1–5). IEEE. Lefèvre, S., Vasquez, D., & Laugier, C. (2014). A survey on motion prediction and
Girma, A., Amsalu, S., Workineh, A., Khan, M., & Homaifar, A. (2020). Deep learning risk assessment for intelligent vehicles. ROBOMECH Journal, 1(1), 1–14, Publisher:
with attention mechanism for predicting driver intention at intersection. In 2020 SpringerOpen.
IEEE intelligent vehicles symposium (pp. 1183–1188). IEEE. Leon, F., & Gavrilescu, M. (2019). A review of tracking, prediction and decision making
GOVUK, G. (2020). Reported road casualties Great Britain, annual report: 2020. methods for autonomous driving. arXiv preprint arXiv:1909.07707.
GOV.UK, URL: https://www.gov.uk/government/statistics/reported-road- Levy, J. I., Buonocore, J. J., & Von Stackelberg, K. (2010). Evaluation of the public
casualties-great-britain-annual-report-2020/reported-road-casualties-great-britain- health impacts of traffic congestion: A health risk assessment. Environmental Health,
annual-report-2020. 9(1), 1–12, Publisher: Springer.
GOVUK, G. (2021). Reported road casualties in Great Britain, provisional estimates: Li, Y., Wang, H., Dang, L. M., Nguyen, T. N., Han, D., Lee, A., Jang, I., &
year ending June 2021. GOV.UK, URL: https://www.gov.uk/government/statistics/ Moon, H. (2020). A deep learning-based hybrid framework for object detection
reported-road-casualties-in-great-britain-provisional-estimates-year-ending-june- and recognition in autonomous driving. IEEE Access, 8, 194228–194239, Publisher:
2021/reported-road-casualties-in-great-britain-provisional-estimates-year-ending- IEEE.
june-2021. Li, J., Yang, F., Tomizuka, M., & Choi, C. (2020). Evolvegraph: Multi-agent trajectory
Gulzar, M., Muhammad, Y., & Muhammad, N. (2021). A survey on motion prediction of prediction with dynamic relational reasoning. In Proceedings of the neural information
pedestrians and vehicles for autonomous driving. IEEE Access, 9, 137957–137969, processing systems.
Publisher: IEEE. Li, X., Ying, X., & Chuah, M. C. (2019a). Grip++: Enhanced graph-based interaction-
Gupta, A., Johnson, J., Fei-Fei, L., Savarese, S., & Alahi, A. (2018). Social GAN: Socially aware trajectory prediction for autonomous driving. arXiv preprint arXiv:1907.
acceptable trajectories with generative adversarial networks. In Proceedings of the 07792.
IEEE conference on computer vision and pattern recognition (pp. 2255–2264). Li, X., Ying, X., & Chuah, M. C. (2019b). Grip: Graph-based interaction-aware trajectory
Hasan, I., Setti, F., Tsesmelis, T., Del Bue, A., Galasso, F., & Cristani, M. (2018). Mx- prediction. In 2019 IEEE intelligent transportation systems conference (pp. 3960–3966).
lstm: Mixing tracklets and vislets to jointly forecast trajectories and head poses. IEEE.
In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. Lian, J., Yu, F., Li, L., & Zhou, Y. (2022). Early intention prediction of pedestrians
6067–6076). using contextual attention-based LSTM. Multimedia Tools and Applications, 1–17,
He, J.-H., Chen, Y.-L., Chen, X.-Z., & Chiang, H.-H. (2021). Vehicle turning intention Publisher: Springer.
prediction based on data-driven method with roadside radar and vision sensor. In Lim, Y.-C., Lee, M., Lee, C.-H., Kwon, S., & Lee, J.-h. (2010). Improvement of stereo
2021 IEEE international conference on consumer electronics-Taiwan (pp. 1–2). IEEE. vision-based position and velocity estimation and tracking using a stripe-based
Hermes, C., Wohler, C., Schenk, K., & Kummert, F. (2009). Long-term vehicle motion disparity estimation and inverse perspective map-based extended Kalman filter.
prediction. In 2009 IEEE intelligent vehicles symposium (pp. 652–657). IEEE. Optics and Lasers in Engineering, 48(9), 859–868, Publisher: Elsevier.
27
L.G. Galvão and M.N. Huda Expert Systems With Applications 238 (2024) 121983
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Rehder, E., Wirth, F., Lauer, M., & Stiller, C. (2018). Pedestrian prediction by planning
Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Computer using deep neural networks. In 2018 IEEE international conference on robotics and
vision–ECCV 2014: 13th European conference, Zurich, Switzerland, September 6-12, automation (pp. 1–5). IEEE.
2014, proceedings, Part V 13 (pp. 740–755). Springer. Ridel, D., Rehder, E., Lauer, M., Stiller, C., & Wolf, D. (2018). A literature review on
Liu, B., Adeli, E., Cao, Z., Lee, K.-H., Shenoi, A., Gaidon, A., & Niebles, J. C. (2020). the prediction of pedestrian behavior in urban scenarios. In 2018 21st international
Spatiotemporal relationship reasoning for pedestrian intent prediction. IEEE Robotics conference on intelligent transportation systems (pp. 3105–3112). IEEE.
and Automation Letters, 5(2), 3485–3492, Publisher: IEEE. Rudenko, A., Palmieri, L., Herman, M., Kitani, K. M., Gavrila, D. M., & Arras, K.
Liu, J., Wang, H., Peng, L., Cao, Z., Yang, D., & Li, J. (2022). PNNUAD: Perception O. (2020). Human motion trajectory prediction: A survey. International Journal of
neural networks uncertainty aware decision-making for autonomous vehicle. IEEE Robotics Research, 39(8), 895–935, Publisher: Sage Publications Sage UK: London,
Transactions on Intelligent Transportation Systems, 23(12), 24355–24368, Publisher: England.
IEEE. Ruijters, E., & Stoelinga, M. (2015). Fault tree analysis: A survey of the state-of-the-art
Luan, Z., Huang, Y., Zhao, W., Zou, S., & Xu, C. (2022). A comprehensive lateral in modeling, analysis and tools. Computer Science Review, 15, 29–62, Publisher:
motion prediction method of surrounding vehicles integrating driver intention pre- Elsevier.
diction and vehicle behavior recognition. Proceedings of the Institution of Mechanical Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
Engineers, Part D (Journal of Automobile Engineering), Article 09544070221078636, Karpathy, A., Khosla, A., & Bernstein, M. (2015). Imagenet large scale visual
Publisher: SAGE Publications Sage UK: London, England. recognition challenge. International Journal of Computer Vision, 115, 211–252,
Ma, J., & Rong, W. (2022). Pedestrian crossing intention prediction method based on Publisher: Springer.
multi-feature fusion. World Electric Vehicle Journal, 13(8), 158, Publisher: MDPI. Sadeghian, A., Kosaraju, V., Sadeghian, A., Hirose, N., Rezatofighi, H., & Savarese, S.
Ma, Y., Zhu, X., Zhang, S., Yang, R., Wang, W., & Manocha, D. (2019). Trafficpredict: (2019). Sophie: An attentive gan for predicting paths compliant to social and
Trajectory prediction for heterogeneous traffic-agents. In Proceedings of the AAAI physical constraints. In Proceedings of the IEEE/CVF conference on computer vision
conference on artificial intelligence, vol. 33 (pp. 6120–6127). Issue: 01. and pattern recognition (pp. 1349–1358).
Mangalam, K., Girase, H., Agarwal, S., Lee, K.-H., Adeli, E., Malik, J., & Gaidon, A. Schneider, N., & Gavrila, D. M. (2013). Pedestrian path prediction with recursive
(2020). It is not the journey but the destination: Endpoint conditioned trajectory Bayesian filters: A comparative study. In German conference on pattern recognition
prediction. In European conference on computer vision (pp. 759–776). Springer. (pp. 174–183). Springer.
Manh, H., & Alaghband, G. (2018). Scene-LSTM: A model for human trajectory Schwall, M., Daniel, T., Victor, T., Favaro, F., & Hohnhold, H. (2020). Waymo public
prediction. arXiv preprint arXiv:1808.04018. road safety performance data. arXiv preprint arXiv:2011.00038.
Messaoud, K., Yahiaoui, I., Verroust-Blondet, A., & Nashashibi, F. (2019). Non-local Sharma, N., Dhiman, C., & Indu, S. (2022). Pedestrian intention prediction for
social pooling for vehicle trajectory prediction. In 2019 IEEE intelligent vehicles autonomous vehicles: A comprehensive survey. Neurocomputing, Publisher: Elsevier.
symposium (pp. 975–980). IEEE. Shirazi, M. S., & Morris, B. T. (2016). Looking at intersections: A survey of intersection
Minguez, R. Q., Alonso, I. P., Fernandez-Llorca, D., & Sotelo, M. A. (2018). Pedestrian monitoring, behavior and safety analysis of recent studies. IEEE Transactions on
path, pose, and intention prediction through gaussian process dynamical models Intelligent Transportation Systems, 18(1), 4–24, Publisher: IEEE.
Shobha, B. S., & Deepu, R. (2018). A review on video based vehicle detection,
and pedestrian activity recognition. IEEE Transactions on Intelligent Transportation
recognition and tracking. In 2018 3rd international conference on computational
Systems, 20(5), 1803–1814, Publisher: IEEE.
systems and information technology for sustainable solutions (pp. 183–186). IEEE.
Mo, X., Huang, Z., Xing, Y., & Lv, C. (2022). Multi-agent trajectory prediction
Siegwart, R., Nourbakhsh, I. R., & Scaramuzza, D. (2011). Introduction to autonomous
with heterogeneous edge-enhanced graph attention network. IEEE Transactions on
mobile robots. MIT Press.
Intelligent Transportation Systems, Publisher: IEEE.
SIMulation, G. (2007). US highway 101 dataset.
Mohamed, A., Qian, K., Elhoseiny, M., & Claudel, C. (2020). Social-STGCNN: A
Sivaraman, S., & Trivedi, M. M. (2013). Looking at vehicles on the road: A survey of
social spatio-temporal graph convolutional neural network for human trajectory
vision-based vehicle detection, tracking, and behavior analysis. IEEE Transactions
prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern
on Intelligent Transportation Systems, 14(4), 1773–1795, Publisher: IEEE.
recognition (pp. 14424–14432).
Su, S., Muelling, K., Dolan, J., Palanisamy, P., & Mudalige, P. (2018). Learning vehicle
Mozaffari, S., Al-Jarrah, O. Y., Dianati, M., Jennings, P., & Mouzakitis, A. (2020). Deep
surrounding-aware lane-changing behavior from observed trajectories. In 2018 IEEE
learning-based vehicle behavior prediction for autonomous driving applications: A
intelligent vehicles symposium (pp. 1412–1417). IEEE.
review. IEEE Transactions on Intelligent Transportation Systems, Publisher: IEEE.
Sun, J., Jiang, Q., & Lu, C. (2020). Recursive social behavior graph for trajectory
Naik, A. Y., Bighashdel, A., Jancura, P., & Dubbelman, G. (2022). Scene spatio-temporal
prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern
graph convolutional network for pedestrian intention estimation. In 2022 IEEE
recognition (pp. 660–669).
intelligent vehicles symposium (pp. 874–881). IEEE.
Vemula, A., Muelling, K., & Oh, J. (2018). Social attention: Modeling attention in
Neogi, S., Hoy, M., Chaoqun, W., & Dauwels, J. (2017). Context based pedestrian
human crowds. In 2018 IEEE international conference on robotics and automation
intention prediction using factored latent dynamic conditional random fields. In
(pp. 4601–4607). IEEE.
2017 IEEE symposium series on computational intelligence (pp. 1–8). IEEE.
Vitas, D., Tomic, M., & Burul, M. (2020). Traffic light detection in autonomous driving
Park, S. H., Kim, B., Kang, C. M., Chung, C. C., & Choi, J. W. (2018). Sequence-to-
systems. IEEE Consumer Electronics Magazine, 9(4), 90–96, Publisher: IEEE.
sequence prediction of vehicle trajectory via LSTM encoder-decoder architecture. Wang, C., Wang, Y., Xu, M., & Crandall, D. J. (2022). Stepwise goal-driven networks
In 2018 IEEE intelligent vehicles symposium (pp. 1672–1678). IEEE. for trajectory prediction. IEEE Robotics and Automation Letters, 7(2), 2716–2723,
Pendleton, S. D., Andersen, H., Du, X., Shen, X., Meghjani, M., Eng, Y. H., Rus, D., & Publisher: IEEE.
Ang, M. H. (2017). Perception, planning, control, and coordination for autonomous Waymo, W. (2020). Waymo safety report. Waymo, URL: https://waymo.com/safety/.
vehicles. Machines, 5(1), 6, Publisher: Multidisciplinary Digital Publishing Institute. WHO, W. H. O. (2018). Global status report on road safety 2018: Summary: Technical
Petrović, D., Mijailović, R., & Pešić, D. (2020). Traffic accidents with autonomous report, World Health Organization.
vehicles: Type of collisions, manoeuvres and errors of conventional vehicles’ drivers. Xin, L., Wang, P., Chan, C.-Y., Chen, J., Li, S. E., & Cheng, B. (2018). Intention-
Transportation Research Procedia, 45, 161–168, Publisher: Elsevier. aware long horizon trajectory prediction of surrounding vehicles using dual LSTM
Piccoli, F., Balakrishnan, R., Perez, M. J., Sachdeo, M., Nunez, C., Tang, M., An- networks. In 2018 21st international conference on intelligent transportation systems
dreasson, K., Bjurek, K., Raj, R. D., & Davidsson, E. (2020). Fussi-net: Fusion of (pp. 1441–1446). IEEE.
spatio-temporal skeletons for intention prediction network. In 2020 54th asilomar Xing, L., & Amari, S. V. (2008). Fault tree analysis. In Handbook of performability
conference on signals, systems, and computers (pp. 68–72). IEEE. engineering (pp. 595–620). Publisher: Springer.
Quan, R., Zhu, L., Wu, Y., & Yang, Y. (2021). Holistic LSTM for pedestrian trajectory Xing, Y., Lv, C., Huaji, W., Wang, H., & Cao, D. (2017). Recognizing driver braking
prediction. IEEE Transactions on Image Processing, 30, 3229–3239, Publisher: IEEE. intention with vehicle data using unsupervised learning methods: Technical report, SAE
Ragesh, N. K., & Rajesh, R. (2019). Pedestrian detection in automotive safety: Technical Paper.
understanding state-of-the-art. IEEE Access, 7, 47864–47890, Publisher: IEEE. Xu, Y., Piao, Z., & Gao, S. (2018). Encoding crowd interaction with deep neural network
Raimundo, V., & Favio, M. (2021). Driver intention prediction at roundabouts. In 2021 for pedestrian trajectory prediction. In Proceedings of the IEEE conference on computer
XIX workshop on information processing and control (pp. 1–5). IEEE. vision and pattern recognition (pp. 5275–5284).
Rasouli, A., Kotseruba, I., Kunic, T., & Tsotsos, J. K. (2019). Pie: A large-scale Xue, H., Huynh, D. Q., & Reynolds, M. (2018). SS-LSTM: A hierarchical LSTM model
dataset and models for pedestrian intention estimation and trajectory prediction. for pedestrian trajectory prediction. In 2018 IEEE winter conference on applications
In Proceedings of the IEEE/CVF international conference on computer vision (pp. of computer vision (pp. 1186–1194). IEEE.
6262–6271). Xue, H., Huynh, D. Q., & Reynolds, M. (2020). A location-velocity-temporal attention
Rasouli, A., Kotseruba, I., & Tsotsos, J. K. (2020). Pedestrian action anticipation using LSTM model for pedestrian trajectory prediction. IEEE Access, 8, 44576–44589,
contextual feature fusion in stacked rnns. arXiv preprint arXiv:2005.06582. Publisher: IEEE.
Razali, H., Mordan, T., & Alahi, A. (2021). Pedestrian intention prediction: A convo- Yang, J., Sun, X., Wang, R. G., & Xue, L. X. (2022). PTPGC: Pedestrian trajectory
lutional bottom-up multi-task approach. Transportation Research Part C: Emerging prediction by graph attention network with ConvLSTM. Robotics and Autonomous
Technologies, 130, Article 103259, Publisher: Elsevier. Systems, 148, Article 103931, Publisher: Elsevier.
28
L.G. Galvão and M.N. Huda Expert Systems With Applications 238 (2024) 121983
Yang, B., Zhan, W., Wang, P., Chan, C., Cai, Y., & Wang, N. (2021). Crossing or Zhang, X., Cheng, L., Li, B., & Hu, H.-M. (2018). Too far to see? Not really!—Pedestrian
not? Context-based recognition of pedestrian crossing intention in the urban envi- detection with scale-aware localization policy. IEEE Transactions on Image Processing,
ronment. IEEE Transactions on Intelligent Transportation Systems, 23(6), 5338–5349, 27(8), 3703–3715, Publisher: IEEE.
Publisher: IEEE. Zhang, H., & Fu, R. (2020). A hybrid approach for turning intention prediction based
Yang, D., Zhang, H., Yurtsever, E., Redmill, K. A., & Ozguner, U. (2022). Predicting on time series forecasting and deep learning. Sensors, 20(17), 4887, Publisher:
pedestrian crossing intention with feature fusion and spatio-temporal attention. Multidisciplinary Digital Publishing Institute.
IEEE Transactions on Intelligent Vehicles, 7(2), 221–230, Publisher: IEEE. Zhang, P., Ouyang, W., Zhang, P., Xue, J., & Zheng, N. (2019). Sr-LSTM: State
Yao, Y., Atkins, E., Johnson-Roberson, M., Vasudevan, R., & Du, X. (2021a). Bitrap: refinement for lstm towards pedestrian trajectory prediction. In Proceedings of the
Bi-directional pedestrian trajectory prediction with multi-modal goal estimation. IEEE/CVF conference on computer vision and pattern recognition (pp. 12085–12094).
IEEE Robotics and Automation Letters, 6(2), 1463–1470, Publisher: IEEE. Zhang, T., Song, W., Fu, M., Yang, Y., & Wang, M. (2021). Vehicle motion prediction at
Yao, Y., Atkins, E., Roberson, M. J., Vasudevan, R., & Du, X. (2021b). Coupling intersections based on the turning intention and prior trajectories model. IEEE/CAA
intent and action for pedestrian crossing behavior prediction. arXiv preprint arXiv: Journal of Automatica Sinica, 8(10), 1657–1666, Publisher: IEEE.
2105.04133. Zhang, K., Zhao, L., Dong, C., Wu, L., & Zheng, L. (2022). AI-TP: Attention-based
Yoon, S., & Kum, D. (2016). The multilayer perceptron approach to lateral motion interaction-aware trajectory prediction for autonomous driving. IEEE Transactions
prediction of surrounding vehicles for autonomous vehicles. In 2016 IEEE intelligent on Intelligent Vehicles, Publisher: IEEE.
vehicles symposium (pp. 1307–1312). IEEE. Zhao, T., Xu, Y., Monfort, M., Choi, W., Baker, C., Zhao, Y., Wang, Y., & Wu, Y.
Zeng, Z. (2022). High efficiency pedestrian crossing prediction. arXiv preprint arXiv: N. (2019). Multi-agent tensor fusion for contextual trajectory prediction. In Pro-
2204.01862. ceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp.
Zhang, J. (2021). Deep understanding Tesla FSD Part 1: HydraNet. Medium, 12126–12134).
URL: https://saneryee-studio.medium.com/deep-understanding-tesla-fsd-part-1- Zhou, W., Berrio, J. S., De Alvis, C., Shan, M., Worrall, S., Ward, J., & Nebot, E. (2020).
hydranet-1b46106d57. Developing and testing robust autonomy: The university of sydney campus data set.
Zhang, S., Abdel-Aty, M., Wu, Y., & Zheng, O. (2021). Pedestrian crossing intention IEEE Intelligent Transportation Systems Magazine, 12(4), 23–40, Publisher: IEEE.
prediction at red-light using pose estimation. IEEE Transactions on Intelligent Zhu, Y., Qian, D., Ren, D., & Xia, H. (2019). Starnet: Pedestrian trajectory prediction
Transportation Systems, 23(3), 2331–2339, Publisher: IEEE. using deep neural network in star topology. In 2019 IEEE/RSJ international
Zhang, X., Angeloudis, P., & Demiris, Y. (2022). ST CrossingPose: A spatial-temporal conference on intelligent robots and systems (pp. 8075–8080). IEEE.
graph convolutional network for skeleton-based pedestrian crossing intention Zyner, A., Worrall, S., & Nebot, E. M. (2019). ACFR five roundabouts dataset:
prediction. IEEE Transactions on Intelligent Transportation Systems, Publisher: IEEE. Naturalistic driving at unsignalized intersections. IEEE Intelligent Transportation
Systems Magazine, 11(4), 8–18, Publisher: IEEE.
29