0% found this document useful (0 votes)
124 views

Multi-Modal Hybrid Architecture For Pedestrian Action Prediction

The document proposes a novel multi-modal architecture that uses feedforward and recurrent networks to predict pedestrian crossing actions. It incorporates semantic maps, local scenes, and pedestrian/ego-vehicle dynamics. The model encodes visual and dynamics features separately then combines them using attention. Evaluation on 2D benchmarks and a new 3D dataset shows state-of-the-art performance in pedestrian crossing prediction.

Uploaded by

Soulayma Gazzeh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
124 views

Multi-Modal Hybrid Architecture For Pedestrian Action Prediction

The document proposes a novel multi-modal architecture that uses feedforward and recurrent networks to predict pedestrian crossing actions. It incorporates semantic maps, local scenes, and pedestrian/ego-vehicle dynamics. The model encodes visual and dynamics features separately then combines them using attention. Evaluation on 2D benchmarks and a new 3D dataset shows state-of-the-art performance in pedestrian crossing prediction.

Uploaded by

Soulayma Gazzeh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Multi-Modal Hybrid Architecture for Pedestrian Action Prediction

Amir Rasouli*, Tiffany Yau, Mohsen Rohani and Jun Luo

Abstract— Pedestrian behavior prediction is one of the major


Visual Features
challenges for intelligent driving systems in urban environ-
ments. Pedestrians often exhibit a wide range of behaviors and or
adequate interpretations of those depend on various sources of
information such as pedestrian appearance, states of other road
users, the environment layout, etc. To address this problem, Pedestrian Motion
arXiv:2012.00514v1 [cs.CV] 16 Nov 2020

Prediction
we propose a novel multi-modal prediction algorithm that
incorporates different sources of information captured from
Will the pedestrian Will cross
the environment to predict future crossing actions of pedes- cross?
Ego-Motion
trians. The proposed model benefits from a hybrid learning
architecture consisting of feedforward and recurrent networks
for analyzing visual features of the environment and dynamics Fig. 1: Predicting pedestrian crossing in front of the ego-
of the scene. Using the existing 2D pedestrian behavior bench- vehicle based on various contextual information: local scene
marks and a newly annotated 3D driving dataset, we show
that our proposed model achieves state-of-the-art performance
around the pedestrian, semantic map of the environment and
in pedestrian crossing prediction. pedestrian and the ego-vehicle dynamics.

I. I NTRODUCTION We evaluate the performance of the proposed algorithm


Road user behavior prediction is one of the fundamental on common 2D pedestrian behavior benchmark datasets as
challenges for autonomous driving systems in urban envi- well as a new 3D dataset created by adding 3D bounding
ronments. Predicting pedestrian behavior, in particular, is of boxes and behavioral annotations to the existing autonomous
great importance as they are among the most vulnerable road driving dataset, nuScenes [10]. We show that the proposed
users and often exhibit a wide range of behaviors [1] that are method achieves state-of-the-art performance on multiple
impacted by numerous environmental and social factors [2]. prediction metrics on these datasets.
Pedestrian prediction in the context of driving can be
done implicitly via forecasting future trajectories or explicitly II. R ELATED W ORKS
through predicting high-level actions of pedestrians, e.g. A. Behavior prediction
crossing the road (see Figure 1). In either case, behavior
prediction requires a deep understanding of various contex- The topic of human behavior prediction is of great interest
tual information in order to achieve a high precision [3], to robotics and computer vision communities. Behavior pre-
[4], [5], [6], [7], [8]. In recent years, many deep learning diction has been used in many applications such as human-
architectures have been proposed that exploit various data object [11], [12] and human-human interaction [13], [14],
modalities, such as visual features, pedestrian dynamics, risk assessment [15], [16], anomaly detection [17], surveil-
pose, ego-motion, etc. in order to predict future actions. lance [18], [7], sports forecasting [19], [20] and intelligent
These algorithms rely on either feedforward methods for driving systems [21], [5].
spatiotemporal reasoning over scene images or recurrent-
based architectures to combine dynamics and visual features B. Behavior Prediction in Driving
in a single framework (see [9] for an extensive review).
In the context of intelligent driving, particularly in urban
In this work, we propose a novel hybrid architecture that
environments, predicting the behavior of road users is vital
benefits from both feedforward and recurrent architectures
for safe motion planning. In this context, the dominant
to generate a joint representation of both visual features
approach is to forecast future trajectories of other road users
and scene dynamics. More specifically, our model uses four
[21], [22], [23], [24], [25]. Recent developments in the field,
modalities of data: semantic maps of the environment, local
however, suggest that predicting higher-level actions of road
scenes of pedestrians and their surroundings, and pedestrian
users can be beneficial by providing an early risk assessment
and the ego-vehicle dynamics. The visual and dynamics
[3], [5], [26] or by improving the accuracy of trajectory
features are encoded using convolutional layers and recurrent
forecasting [4], [6], [7], [27]. These actions, for example,
networks respectively. Their outputs are weighted using two
can refer to various maneuvers performed by other vehicles,
attention modules and combined to create a joint represen-
e.g. changing lane, making a turn [27], or in the case of
tation that is used for action prediction.
pedestrians, interaction with other objects, e.g. opening a
Authors are with Noah’s Ark Laboratory, Huawei, Markham, Canada. car door [4], or, more importantly, crossing the road [6],
* Corresponding author [email protected] [5], [26].
Pedestrian Crossing Prediction: In recent years, the topic 3D bounding boxes, etc. all of which are important for real-
of pedestrian crossing prediction has gained momentum world autonomous driving applications. In addition to 2D
and many state-of-the-art algorithms have been proposed benchmark datasets, we use a new dataset of 3D bounding
that take advantage of various data modalities and learning boxes and behavioral annotations that extends the nuScenes
architectures [9]. dataset.
Crossing prediction algorithms use different architectures.
Those based on feedforward architectures [3], [28], [26], III. M ETHOD
[29] often rely on unimodal data. For instance, the method A. Problem statement
in [28] uses a series of 3D DenseNet blocks to reason
In this work, pedestrian action prediction is formulated as
over image sequences of pedestrians. The authors of [26],
an optimization problem p(At+m i |LS, M, P M, V M ) where
[3] propose generative encoder-decoder architectures using
the goal is to estimate the probability of crossing action
3D convolutional layers. Here, the algorithms first generate
At+m
i ∈ {0, 1} for some pedestrian i, 1 < i < n in
future scenes and then based on those images predict crossing
some time t + m in the future. Here, the prediction is
actions. The model in [29] relies on intermediate features,
based on the observed local scene around the pedestrian
such as pedestrian head orientation, motion state, and the
LS = {ls1 , ls2 , ..., lst }, the semantic map of the envi-
state of other traffic elements. Each of these components
ronment M = {m1 , m2 , ..., mt }, the pedestrian’s motion
is detected using separate CNN-based object classifiers the
P M = {pm1 , pm2 , ..., pmt } and the ego-vehicle motion
outputs of which are combined for predicting crossing action.
V M = {vm1 , vm2 , ..., vmt }.
Alternatively, some algorithms rely on recurrent archi-
tectures and take advantage of a combination of differ- B. Architecture
ent modalities for prediction [30], [5], [8]. For instance,
As highlighted in a survey of the past studies [2], there is
the authors of [8] use a multi-stream LSTM architecture.
a diverse set of social and environmental factors that impact
They first encode visual features, optical flow images, and
the way pedestrians make crossing decision. To capture such
vehicle dynamics using individual LSTMs the outputs of
contextual complexity, we propose a multi-modal method
which are concatenated to generate a shared representation.
that relies on a hybrid architecture for encoding visual
This representation is then fed into an embedding layer
and dynamics information. Figure 2 illustrates the proposed
followed by another LSTM. The output of the second
method which can be divided into three main parts:
LSTM is combined with the shared representation for the
• Visual Encoding. This part of the model relies on 2D
final inference. The model proposed in [5] uses a multi-
layer recurrent architecture with five spatially stacked GRUs convolutional layers to generate visual representations
that encode pedestrian appearance, pedestrian surrounding for semantic map of the environment and local scene
context, pedestrians’ poses and trajectories, and the ego- context of pedestrians.
• Dynamics Encoding. This recurrent part of the model
vehicle speed. The inputs are fed into the network at different
levels according to their complexity, e.g. visual features at generates a joint representation of scene dynamics,
the bottom and speed at the top layers. namely pedestrian and ego-vehicle motion.
• Prediction. Using a joint representation of the visual
In this work, we use a hybrid approach that takes advan-
tage of both feedforward networks and recurrent architec- and dynamics features, the model estimates the proba-
tures for a joint visuospatial and dynamics reasoning over bility of the crossing action in the future.
traffic scenes. We discuss each of these components in more details in the
Pedestrian Behavior Datasets: Although there are many subsections below.
publicly available datasets for pedestrian trajectory predic-
tion [31], [32], [33], [34], [35], the choices for pedestrian C. Visual Encoding
action prediction, particularly in the context of driving, are Visual information is crucial for robust prediction of
more limited. There are a number of datasets that provide pedestrian behavior as it reflects the state of the pedestrians,
rich behavioral tags along with temporally coherent spatial e.g. pose, head orientation, motion, gestures as well as their
annotations that can be used for pedestrian action prediction, surrounding environment, e.g. the state of other road users,
including Joint Attention in Autonomous Driving (JAAD) road structure, signals, etc. For this purpose, we rely on two
[29], VIrtual ENvironment for Action Analysis (VIENA2 ) different sources of visuospatial inputs: semantic maps and
[8], Pedestrian Intention Estimation (PIE) [6], Trajectory local scene images around pedestrians.
Inference using Targeted Action priors Network (TITAN) Semantic map. Depending on the type of data, we either
[4] and Stanford-TRI Intent Prediction (STIP) [30]. These use semantic maps generated from scene images or bird’s
datasets provide image sequences of traffic scenes recorded eye view maps of the environment and traffic elements. The
using on-board cameras along with 2D bounding boxes for maps for each time-step are concatenated channel-wise.
pedestrians and other traffic objects as well as annotations When processing the maps, it is important to capture the
for pedestrian behavior and ego-motion information. The interdependency between different traffic components such
drawback of these datasets is that they do not contain 3D as pedestrians, vehicles, road structure, etc. To address this
information such as bird’s eye view map of the environment, issue, we consider three strategies as shown in Figure 3.
Map or Map Encoding

Late Fusion

Scene Encoding Conv


512x3x1

Visual Representation
Local Scene
Dense

Dense

Dense
Conv1 Conv2 Conv3 Conv4
64x3x1 64x3x4 128x3x2 256x3x2
VAM

Joint
Pedestrian ℎ1:𝑡 Representation
Motion
or LSTM + Pedestrian Action
Dynamics
+ Representation DAM

Ego-vehicle ℎ1:𝑡
LSTM
Motion

Fig. 2: The architecture of the proposed model. The model processes four different input modalities: semantic maps of the
environment (2D or bird’s eye view), local scene images of pedestrians and their motion (2D spatial or 3D global coordinates
+ velocity), and ego-vehicle motion. Semantic maps and images are encoded with two sets of Conv2D layers. The outputs
of these layers are combined and fed into another Conv2D layer followed by a dense embedding layer to form visual
representations which are weighted using Visual Attention Module (VAM). The motion of pedestrians and the ego-vehicle
is encoded using two LSTMs, the outputs of which are concatenated and fed into Dynamic Attention Module (DAM). The
final joint representation is formed by concatenating the outputs of the attention modules. The specifications of conv layers
for scene encoding are given as [number of filters, kernel size, stride].
processing. The specifications are shown in Figure 3.
Local scene. We use scene images to capture the changes
in the appearance of pedestrians and their local surroundings.
At a given time step, we extract a region around each pedes-
Conv1 Conv2 Conv3
Conv1 Conv2 Conv3
32x3x4 64x3x2 128x3x2
Pool
16x16
trian by scaling up the corresponding 2D spatial bounding
Conv1 Conv2 Conv3 Deconv Conv4
32x3x4 64x3x2 128x3x2 Rate=2 Rate=4 Rate=8 32x3x4 64x3x2 128x3x2 128x3x4 256x3x4

(a) Sequential (b) Atrous (c) Multi-scale


box and then adjust the dimensions of the scaled box so its
width matches its height. Similar to maps, we concatenate
Fig. 3: Different strategies for processing map images. The the images channel-wise for a given observation sequence
specifications of 2D conv layers are given as [number of prior to processing by multiple conv layers.
filters, kernel size, stride]. In b), rate refers to the rate of In order to combine the visual features with different
kernel dilation. modalities, we employ a late fusion technique in which we
spatially concatenate the outputs of map and local scene
The first approach is a common sequential processing with encodings and jointly process them with an additional convo-
successive convolutional layers. This method captures local lutional layer. The output of the fusion conv layer is flattened
information in the image but fails to represent broader con- and fed into a dense layer to reduce the dimension of the
text at higher resolutions. Furthermore, consecutive down- feature vector.
sampling can result in the loss of some of the finer visual Visual Attention Module (VAM). Although visual fea-
details. To address these drawbacks, we consider two alter- tures are important for reasoning about future pedestrian ac-
natives: Atrous (or dilated) convolution and multi-scale skip tion, it is important to emphasize those features that are more
connection. relevant to the task. To achieve this goal, we weight visual
Atrous convolution is a special form of convolution in representation features by using an attention mechanism. We
which the effective field of view of a filter is increased by calculate feature weight αi for a given feature point zi as
inserting gaps, parameterized as the rate of dilation, between follows:
the elements of a given kernel as proposed in [36]. This
efi X
allows the model to capture a broader spatial context without αi = P fj , fi = ωi,k zk + bi .
the need for downsampling. je k
Multi-scale skip connections are often used to combine where fi is a single neuron in a fully-connected layer.
fine and coarse features generated at different convolutional
layers. Following this approach, the output of the first conv D. Dynamics Encoding
layer and the upsampled (via deconvolution operation) output When anticipating pedestrian action, dynamics features,
of the last layer are fed into another conv layer for joint such as pedestrian and ego-motion, are of vital importance.
For example, when a pedestrian is moving towards the road, which are publicly available. The alternatives, TITAN [4] and
that can be an indicator of crossing intention. Likewise, the STIP [30] datasets are only available under restrictive terms
ego-vehicle motion can directly impact whether a pedestrian of use, and VIENA2 [8] only contains simulated samples.
makes a crossing decision. If, for instance, the ego-vehicle is Pedestrian Intention Estimation (PIE). PIE contains 6
driving at a high speed, the pedestrian will likely cross after hours of driving footage recorded with an on-board camera.
the vehicle has passed. Given the importance of dynamics There are 1842 pedestrian samples with dense 2D bounding
information, we use pedestrian and the ego-vehicle motion box annotations at 30Hz with behavioral tags along with
as described below. motion sensor recordings both from the vehicle and the
Pedestrian motion. We use both pedestrian location and camera. For the ego-vehicle velocity, we use gyroscope
velocity for encoding pedestrian motion. In the absence of readings from the camera. The data is split into training and
3D data, 2D bounding boxes, i.e. the coordinates of the testing sets as recommended in [6].
boxes [x1 , y1 , x2 , y2 ] corresponding to top-left and bottom- Joint Attention in Autonomous Driving (JAAD). Sim-
right corners of the boxes around pedestrians in image frames ilar to PIE, JAAD is a dataset consisting of on-board video
are used. For the 3D dataset, we use global coordinates of recordings with 346 clips and 2580 pedestrians annotated
pedestrians in 3D space with respect to the map coordinates. with 2D bounding boxes at 30Hz. Compared to PIE, JAAD
Velocities are obtained by calculating the displacement samples are less diversified, i.e. the majority of samples
of the pedestrian coordinates (either in 2D or 3D) with are pedestrians walking on sidewalks and sequences are
respect to the first coordinate at the beginning of observation generally shorter. JAAD also lacks ego-motion information,
sequences. The combination of pedestrian locations and however, it provides driver’s actions, such as speeding up,
velocities forms pedestrian motion information. stopping, etc. which are used in place for ego-motion infor-
Ego-vehicle motion. We follow the same routine as mation. The data is split into train/test sets following [38].
for pedestrians and use the ego-vehicle global coordinates Pedestrian Prediction on nuScenes (PePScenes). Due
and velocities. For 2D datasets, since ego-location is not to the lack of 3D driving data with pedestrian behavior
available, we only use the vehicle’s velocity. annotations, we created our own dataset by adding anno-
Pedestrian and ego-motion features are encoded using tations to the existing dataset nuScenes [10]. For behavioral
two recurrent networks. The hidden states h1:t of these annotations, we added sample-wise and per-frame crossing
networks are concatenated temporally to form dynamics labels to 719 pedestrian samples out of which 570 are
representations. not crossing. For behavioral annotations, we selected the
Dynamics Attention Module (DAM). Similar to the pedestrians that 1) appeared long enough in front of the
visual representations, we employ an attention module to vehicle and 2) appeared to have the intention of crossing,
weight dynamics features. DAM is a 3D attention mechanism e.g. moving close to the curbside, looking at the traffic, etc.
inspired by [37] which receives as input a temporal data In addition, we extended 2D/3D bounding box annotations
sequence and generates a unified weighted representation on nuScenes from 2Hz to 10Hz to make samples more
computed as follows: suitable for pedestrian prediction. This is done by interpolat-
ing boxes between two consecutive frames using pedestrian
DAM t = tanh(Wc [ct ⊕ ht ]) global coordinates. The ratio of train/test data is 70/30.
X 0 Data preparation. We follow the same procedure as in
ct = σ(ht Wa hi )hi [5] and clip sequences up to the time of the crossing events,
i∈[1:t] i.e. the moment the pedestrians start crossing the road. In
σ is a softmax function and hi is a hidden state of dynamics cases where no crossing occurs, the last frame is selected
representation for i = 1, 2, ..., t. Here, ⊕ denotes the con- instead. All models use 0.5s observation length (or 5 frames
catenation operation. at 10Hz for all datasets) and sample sequences from each
track between 1 to 2s to the event of crossing with 50%
E. Prediction overlap between each sample. Overall, we get the following
The joint representation used for final prediction is formed number of train/test samples: 3980/3185 for PIE, 3955/3110
by concatenating the outputs of VAM and DAM. This for JAAD and 2544/1146 for PePScenes.
representation is then fed into two consecutive dense layers
B. Implementation
to make crossing predictions. For learning, we use binary
cross-entropy loss given by For 2D datasets, we generate semantic maps using the
algorithm in [36] pretrained on Cityscapes [39] and use
X
L=− yn log ŷn + (1 − yn ) log (1 − ŷn ). 14 main object classes appearing in traffic scenes. For the
n PePScenes dataset, rasterized maps encoded as 3-channel
images similar to [40] are used. The map is of size 30 × 30
IV. E MPIRICAL E VALUATION
meters centered around the ego-vehicle.
A. Datasets To extract local scene images, we use 1.5x scaled versions
For evaluations on 2D data, we choose two naturalistic of pedestrian 2D bounding boxes in image frames. Unlike the
pedestrian behavior benchmarks, PIE [6], and JAAD [29] 2D datasets, PePScenes contains recordings from three front
looking cameras that cover a wide-angle view in front of TABLE I: Performance of the proposed algorithm on PIE
the vehicle. To select a pedestrian of interest, we choose and JAAD datasets.
a camera frame in which the pedestrian appears. If the PIE JAAD
Models Acc AUC F1 Prec Acc AUC F1 Prec
pedestrian appears in two adjacent cameras, we select the TF 0.75 0.73 0.61 0.55 0.76 0.72 0.54 0.40
camera frame in which a larger portion of the pedestrian is I3D [42] 0.63 0.58 0.42 0.37 0.79 0.71 0.49 0.42
MS-GRU [8] 0.86 0.85 0.77 0.71 0.81 0.79 0.59 0.48
visible. S-GRU [43] 0.82 0.78 0.68 0.68 0.83 0.76 0.58 0.53
SF-GRU [5] 0.87 0.85 0.78 0.74 0.83 0.79 0.59 0.50
The details of convolutional layers for each module can Sequential 0.87 0.86 0.78 0.73 0.79 0.79 0.57 0.45
be seen in Figures 2,3. For all recurrent networks, we use Ours Atrous 0.89 0.88 0.81 0.77 0.84 0.80 0.62 0.54
Multi-scale 0.88 0.86 0.79 0.75 0.77 0.79 0.56 0.42
LSTM cells with 256 hidden units. We set the dimension of
the embedding dense layer for visual features to 512 and the TABLE II: Performance of the proposed algorithm on PeP-
second to last layer to 256. Scenes.
Models Acc AUC F1 Prec
C. Training TF 0.80 0.52 0.14 0.26
I3D [42] 0.84 0.60 0.33 0.53
We trained the proposed model on all datasets end-to- MS-GRU [8] 0.85 0.66 0.46 0.59
end using RMSProp [41] optimizer with batch size of 16, S-GRU [43] 0.86 0.69 0.51 0.64
learning rate of 0.00005 for 50 epochs on the 2D datasets and SF-GRU [5] 0.87 0.68 0.51 0.67
Sequential 0.85 0.63 0.39 0.65
40 on PePScenes. We applied L2 regularization of 0.0001 Ours Atrous 0.88 0.71 0.56 0.77
to LSTMs and the last dense layer. To compensate for data Multi-scale 0.86 0.68 0.50 0.64
imbalance, we set class weights based on the ratio of positive
and negative samples. An alternative approach to stacked processing is to encode
different modalities of data individually and combine them
D. Metrics prior to classification. We follow the approach in [8] and
To evaluate the proposed model, common binary classifi- process each input using a separate GRU without the second
cation metrics as in [5] are used including accuracy, Area GRU for joint processing. We refer to this model as Multi-
Under the Curve (AU C), F 1, and precision. Using all these Stream GRU (MS-GRU).
metrics, as opposed to only reporting on accuracy [30], is All the poses are generated using the model in [44]
important given the fact that the number of negative and pretrained on the MSCOCO dataset [45].
positive samples are imbalanced in all the datasets mentioned
F. Prediction on 2D Datasets
above. This means that the models can favor one class over
the others and still achieve very high accuracy. AUC, F1, and We begin by evaluating the performance of the proposed
precision show the balanced performance of the methods by models and state-of-the-art algorithms discussed earlier on
highlighting how on average each method identifies different 2D datasets. As shown in Table I, the proposed algorithm
classes and distinguishes between them. using atrous map processing achieves the best performance
on both datasets. On PIE, using multi-scale map encoding
E. Models technique achieves only small improvements while the atrous
We compare the performance of the proposed model to method improves more on all metrics, especially on precision
a series of baseline and state-of-the-art crossing prediction by 4%. On JAAD, only the atrous method offers improve-
models: ments on all metrics and the improvement gap is not as high
Trajectory-Based Forecaster (TF). As a baseline, we make as reported on PIE. This difference is to some extent expected
crossing prediction using only pedestrian trajectory informa- as JAAD is not as diverse compared to PIE. For example,
tion, i.e. pedestrian coordinates in 2D or 3D (as discussed in the majority of the samples in JAAD are pedestrians that
Sec. III-D), using an LSTM followed by a dense layer. are walking alongside the vehicle on sidewalks. This is also
Inception 3D (I3D) [42]. Given the similarities between reflected in the changes in the performance of I3D. This
action prediction and recognition tasks, we use a state-of- method, despite using only visual features achieves much
the-art activity recognition model, I3D. This is a feedforward better results on JAAD compared to PIE. This means there
model based on a series of 3D convolutional layers. As input is a high degree of visual similarity between samples.
to this model, we use local scene images as described earlier. Another possible explanation for smaller improvement on
Stacked with Fusion GRU (SF-GRU) [5]. This is a state- JAAD is the lack of proper ego-motion information. JAAD
of-the-art pedestrian crossing prediction algorithm based only offers high-level actions of the driver, which are not
on a multi-level recurrent architecture in which different as informative as the actual dynamics of the vehicle. The
modalities of data are infused and encoded gradually. This importance of the ego-vehicle dynamics is highlighted in
model relies on five modalities of data, namely pedestrians’ Section IV-I.
appearances and surrounding context, pedestrians’ poses
and coordinates, and the ego-vehicle speed. We also report G. Prediction on 3D Dataset
on a variation of SF-GRU in which we combine all data We repeat the same experiments as above using our
modalities and feed them to the bottom layer of a spatially newly annotated 3D dataset. The results of this experiment
Stacked GRUs (S-GRU) similar to [43]. are summarized in Table II. As in the 2D experiment,
Correct Incorrect

PIE

JAAD

PePScenes

Fig. 4: Performance of the proposed model on 2D and 3D datasets. Red and green boxes indicate future crossing or non-
crossing actions respectively. The results are divided into correct predictions on the left and incorrect ones on the right.

the proposed model achieves state-of-the-art performance TABLE III: Ablation study on different feature modalities.
using atrous encoding. The improvements, however, are more PIE PePScenes
Input Modality Acc AUC F1 Prec Acc AUC F1 Prec
noticeable, in particular, F1 and precision are increased by Local Scene 0.67 0.54 0.29 0.37 0.83 0.57 0.27 0.46
5% and 10% respectively. This is because bird’s eye view Map + Local Scene 0.75 0.65 0.49 0.59 0.84 0.73 0.55 0.53
Ped. Motion 0.84 0.85 0.75 0.66 0.75 0.54 0.22 0.23
maps, first, contain finer details compared to 2D semantic Ped. + Veh. Motion 0.85 0.86 0.77 0.69 0.82 0.72 0.51 0.47
maps, e.g. location of objects, road boundaries, etc., and Visual + Dynamics 0.89 0.88 0.81 0.77 0.88 0.71 0.56 0.77

second, encode a larger area around the vehicle. As a result, The results on the 3D dataset show that similar to the
avoiding downsampling and using successively larger kernel 2D data, using only images of pedestrians and their sur-
sizes can capture information more effectively. roundings, the performance is relatively poor. However, by
adding the map information a significant improvement can
H. Qualitative Analysis be achieved, especially on AUC, F1, and precision. This
Figure 4 shows the performance of the proposed model on is primarily due to the fact that the bird’s eye view map
2D and 3D datasets. In this figure, we can see that two cases captures the changes in the dynamics of the ego-vehicle
are challenging to predict: people standing on the curbside or and its surroundings much more effectively compared to 2D
road and have no crossing intention, e.g. the person engaged semantic maps. In fact, when all metrics are considered, the
in conversation or a construction worker operating next to performance of the model is similar when using visual or
the road (top row). The pedestrian’s direction of motion can dynamics features. Despite such similarity of performance,
also be distracting. For example, pedestrian might be moving there are still cases that are not learned properly using only
towards the road or step on the road but would not cross or one type of feature. This is highlighted in the performance of
might be standing at the intersection prior to crossing. the model combining all features. Besides the AUC metric,
the combined model performs better on F1 and significantly
I. Ablation better on accuracy by 4% and precision by 24%. Considering
Earlier we discussed the importance of multi-modal pro- all metrics, the combined model is clearly superior to others.
cessing and how different sources of information help us
V. C ONCLUSION
capture different contextual elements in traffic scenes. Here,
we conduct an ablation study on the final proposed model In this work, we proposed a hybrid model for predicting
(with atrous encoding) using different input modalities. For pedestrian road-crossing action. Our model benefits from
this experiment, we use both a 2D dataset (PIE) and a 3D both feedforward and recurrent architectures for encoding
dataset (PePScenes). We selected PIE over JAAD because different input modalities that capture both the changes in
it has more diverse samples and it also contains actual ego- the visual appearance and dynamics of the traffic scenes.
vehicle dynamics information. Using common 2D pedestrian behavior benchmark datasets
Table III shows the results of this experiment. On the and our newly annotated 3D dataset, we showed that our
2D dataset, the method achieves relatively high performance proposed model achieves state-of-the-art performance across
using only visual features. However, on all metrics, using all metrics.
dynamics information is superior. This is due to the fact Furthermore, by conducting an ablation study on the
that visual features in 2D space do not properly capture proposed model, we showed how different sources of infor-
the dynamics of the scene which are fundamental for action mation can impact prediction accuracy. Our findings sug-
prediction. It should be noted that despite such a discrepancy, gest that, even though dynamics information is dominant
visual features in conjunction with dynamics information can in predicting pedestrian behavior, visual features play a
help improve the results on all metrics, particularly on F1 and complementary role for prediction and can result in improved
precision by 4% and 8% respectively. performance when combined with dynamics information.
R EFERENCES [29] A. Rasouli, I. Kotseruba, and J. K. Tsotsos, “Are they going to cross?
a benchmark dataset and baseline for pedestrian crosswalk behavior,”
[1] A. Rasouli, I. Kotseruba, and J. K. Tsotsos, “Agreeing to cross: in ICCVW, 2017.
How drivers and pedestrians communicate,” in Intelligent Vehicles [30] B. Liu, E. Adeli, Z. Cao, K.-H. Lee, A. Shenoi, A. Gaidon, and J. C.
Symposium (IV), 2017. Niebles, “Spatiotemporal relationship reasoning for pedestrian intent
[2] A. Rasouli and J. K. Tsotsos, “Autonomous vehicles that interact with prediction,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp.
pedestrians: A survey of theory and practice,” IEEE Transactions on 3485–3492, 2020.
Intelligent Transportation Systems, vol. 21, no. 3, pp. 900–918, 2019. [31] J. Liang, L. Jiang, K. Murphy, T. Yu, and A. Hauptmann, “The garden
[3] M. Chaabane, A. Trabelsi, N. Blanchard, and R. Beveridge, “Looking of forking paths: Towards multi-future trajectory prediction,” in CVPR,
ahead: Anticipating pedestrians crossing with future frames predic- 2020.
tion,” in WACV, 2020. [32] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui,
[4] S. Malla, B. Dariush, and C. Choi, “Titan: Future forecast using action J. Guo, Y. Zhou, Y. Chai, B. Caine, V. Vasudevan, W. Han, J. Ngiam,
priors,” in CVPR, 2020. H. Zhao, A. Timofeev, S. Ettinger, M. Krivokon, A. Gao, A. Joshi,
[5] A. Rasouli, I. Kotseruba, and J. K. Tsotsos, “Pedestrian action an- Y. Zhang, J. Shlens, Z. Chen, and D. Anguelov, “Scalability in
ticipation using contextual feature fusion in stacked rnns,” in BMVC, perception for autonomous driving: Waymo open dataset,” in CVPR,
2019. 2020.
[6] A. Rasouli, I. Kotseruba, T. Kunic, and J. K. Tsotsos, “Pie: A [33] W. Liu, W. Luo, D. Lian, and S. Gao, “Future frame prediction for
large-scale dataset and models for pedestrian intention estimation and anomaly detection – a new baseline,” in CVPR, 2018.
trajectory prediction,” in ICCV, 2019. [34] A. Robicquet, A. Sadeghian, A. Alahi, and S. Savarese, “Learning
[7] J. Liang, L. Jiang, J. C. Niebles, A. G. Hauptmann, and L. Fei- social etiquette: Human trajectory understanding in crowded scenes,”
Fei, “Peeking into the future: Predicting future person activities and in ECCV, 2016.
locations in videos,” in CVPR, 2019. [35] A. Lerner, Y. Chrysanthou, and D. Lischinski, “Crowds by example,”
[8] M. S. Aliakbarian, F. S. Saleh, M. Salzmann, B. Fernando, L. Peters- Computer graphics forum, vol. 26, no. 3, pp. 655–664, 2007.
son, and L. Andersson, “VIENA: A driving anticipation dataset,” in [36] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Re-
ACCV, C. V. Jawahar, H. Li, G. Mori, and K. Schindler, Eds., 2019. thinking atrous convolution for semantic image segmentation,”
[9] A. Rasouli, “Deep learning for vision-based prediction: A survey,” arXiv:1706.05587, 2017.
arXiv:2007.00095, 2020. [37] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to
[10] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, attention-based neural machine translation,” arXiv:1508.04025, 2015.
A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A [38] A. Rasouli, I. Kotseruba, and J. K. Tsotsos, “It’s not all about size:
multimodal dataset for autonomous driving,” in CVPR, 2020. On the role of data properties in pedestrian detection,” in ECCVW,
[11] M. Liu, S. Tang, Y. Li, and J. Rehg, “Forecasting human object 2018.
interaction: Joint prediction of motor attention and actions in first [39] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be-
person video,” in ECCV, 2020. nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset
[12] A. Piergiovanni, A. Angelova, A. Toshev, and M. S. Ryoo, “Adver- for semantic urban scene understanding,” in CVPR, 2016.
sarial generative grammars for human activity prediction,” in ECCV, [40] H. Cui, V. Radosavljevic, F.-C. Chou, T.-H. Lin, T. Nguyen, T.-K.
2020. Huang, J. Schneider, and N. Djuric, “Multimodal trajectory predictions
[13] H. Joo, T. Simon, M. Cikara, and Y. Sheikh, “Towards social artificial for autonomous driving using deep convolutional networks,” in ICRA,
intelligence: Nonverbal social signal prediction in a triadic interaction,” 2019.
in CVPR, 2019. [41] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop, coursera: Neural
[14] T. Yao, M. Wang, B. Ni, H. Wei, and X. Yang, “Multiple granularity networks for machine learning,” Tech. Rep., 2012.
group interaction prediction,” in CVPR, 2018. [42] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new
[15] M. Strickland, G. Fainekos, and H. B. Amor, “Deep predictive models model and the kinetics dataset,” in CVPR, 2017.
for collision risk assessment in autonomous driving,” in ICRA, 2018. [43] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals,
[16] K.-H. Zeng, S.-H. Chou, F.-H. Chan, J. Carlos Niebles, and M. Sun, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks
“Agent-centric risk assessment: Accident anticipation and risky region for video classification,” in CVPR, 2015.
localization,” in CVPR, 2017. [44] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person
[17] D. Epstein, B. Chen, and C. Vondrick, “Oops! predicting unintentional 2d pose estimation using part affinity fields,” in CVPR, 2017.
action in video,” in CVPR, 2020. [45] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
[18] Y. Ma, X. Zhu, X. Cheng, R. Yang, J. Liu, and D. Manocha, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in
“Autotrajectory: Label-free trajectory extraction and prediction from context,” in ECCV, 2014.
videos using dynamic points,” in ECCV, 2020.
[19] M. Qi, J. Qin, Y. Wu, and Y. Yang, “Imitative non-autoregressive
modeling for trajectory forecasting and imputation,” in CVPR, 2020.
[20] P. Felsen, P. Agrawal, and J. Malik, “What will happen next? fore-
casting player moves in sports videos,” in ICCV, 2017.
[21] L. Fang, Q. Jiang, J. Shi, and B. Zhou, “Tpnet: Trajectory proposal
network for motion prediction,” in CVPR, 2020.
[22] M. Liang, B. Yang, W. Zeng, Y. Chen, R. Hu, S. Casas, and R. Urtasun,
“Pnpnet: End-to-end perception and prediction with tracking in the
loop,” in CVPR, 2020.
[23] T. Phan-Minh, E. C. Grigore, F. A. Boulton, O. Beijbom, and E. M.
Wolff, “Covernet: Multimodal behavior prediction using trajectory
sets,” in CVPR, 2020.
[24] R. Chandra, U. Bhattacharya, A. Bera, and D. Manocha, “Traphic: Tra-
jectory prediction in dense and heterogeneous traffic using weighted
interactions,” in CVPR, 2019.
[25] N. Rhinehart, K. M. Kitani, and P. Vernaza, “R2p2: A reparameterized
pushforward policy for diverse, precise generative path forecasting,”
in ECCV, 2018.
[26] P. Gujjar and R. Vaughan, “Classifying pedestrian actions in advance
using predicted video of urban driving scenes,” in ICRA, 2019.
[27] S. Casas, W. Luo, and R. Urtasun, “Intentnet: Learning to predict
intention from raw sensor data,” in CORL, 2018.
[28] K. Saleh, M. Hossny, and S. Nahavandi, “Real-time intent prediction
of pedestrians for autonomous ground vehicles via spatio-temporal
densenet,” in ICRA, 2019.

You might also like