Multi-Modal Hybrid Architecture For Pedestrian Action Prediction
Multi-Modal Hybrid Architecture For Pedestrian Action Prediction
Prediction
we propose a novel multi-modal prediction algorithm that
incorporates different sources of information captured from
Will the pedestrian Will cross
the environment to predict future crossing actions of pedes- cross?
Ego-Motion
trians. The proposed model benefits from a hybrid learning
architecture consisting of feedforward and recurrent networks
for analyzing visual features of the environment and dynamics Fig. 1: Predicting pedestrian crossing in front of the ego-
of the scene. Using the existing 2D pedestrian behavior bench- vehicle based on various contextual information: local scene
marks and a newly annotated 3D driving dataset, we show
that our proposed model achieves state-of-the-art performance
around the pedestrian, semantic map of the environment and
in pedestrian crossing prediction. pedestrian and the ego-vehicle dynamics.
Late Fusion
Visual Representation
Local Scene
Dense
Dense
Dense
Conv1 Conv2 Conv3 Conv4
64x3x1 64x3x4 128x3x2 256x3x2
VAM
Joint
Pedestrian ℎ1:𝑡 Representation
Motion
or LSTM + Pedestrian Action
Dynamics
+ Representation DAM
Ego-vehicle ℎ1:𝑡
LSTM
Motion
Fig. 2: The architecture of the proposed model. The model processes four different input modalities: semantic maps of the
environment (2D or bird’s eye view), local scene images of pedestrians and their motion (2D spatial or 3D global coordinates
+ velocity), and ego-vehicle motion. Semantic maps and images are encoded with two sets of Conv2D layers. The outputs
of these layers are combined and fed into another Conv2D layer followed by a dense embedding layer to form visual
representations which are weighted using Visual Attention Module (VAM). The motion of pedestrians and the ego-vehicle
is encoded using two LSTMs, the outputs of which are concatenated and fed into Dynamic Attention Module (DAM). The
final joint representation is formed by concatenating the outputs of the attention modules. The specifications of conv layers
for scene encoding are given as [number of filters, kernel size, stride].
processing. The specifications are shown in Figure 3.
Local scene. We use scene images to capture the changes
in the appearance of pedestrians and their local surroundings.
At a given time step, we extract a region around each pedes-
Conv1 Conv2 Conv3
Conv1 Conv2 Conv3
32x3x4 64x3x2 128x3x2
Pool
16x16
trian by scaling up the corresponding 2D spatial bounding
Conv1 Conv2 Conv3 Deconv Conv4
32x3x4 64x3x2 128x3x2 Rate=2 Rate=4 Rate=8 32x3x4 64x3x2 128x3x2 128x3x4 256x3x4
PIE
JAAD
PePScenes
Fig. 4: Performance of the proposed model on 2D and 3D datasets. Red and green boxes indicate future crossing or non-
crossing actions respectively. The results are divided into correct predictions on the left and incorrect ones on the right.
the proposed model achieves state-of-the-art performance TABLE III: Ablation study on different feature modalities.
using atrous encoding. The improvements, however, are more PIE PePScenes
Input Modality Acc AUC F1 Prec Acc AUC F1 Prec
noticeable, in particular, F1 and precision are increased by Local Scene 0.67 0.54 0.29 0.37 0.83 0.57 0.27 0.46
5% and 10% respectively. This is because bird’s eye view Map + Local Scene 0.75 0.65 0.49 0.59 0.84 0.73 0.55 0.53
Ped. Motion 0.84 0.85 0.75 0.66 0.75 0.54 0.22 0.23
maps, first, contain finer details compared to 2D semantic Ped. + Veh. Motion 0.85 0.86 0.77 0.69 0.82 0.72 0.51 0.47
maps, e.g. location of objects, road boundaries, etc., and Visual + Dynamics 0.89 0.88 0.81 0.77 0.88 0.71 0.56 0.77
second, encode a larger area around the vehicle. As a result, The results on the 3D dataset show that similar to the
avoiding downsampling and using successively larger kernel 2D data, using only images of pedestrians and their sur-
sizes can capture information more effectively. roundings, the performance is relatively poor. However, by
adding the map information a significant improvement can
H. Qualitative Analysis be achieved, especially on AUC, F1, and precision. This
Figure 4 shows the performance of the proposed model on is primarily due to the fact that the bird’s eye view map
2D and 3D datasets. In this figure, we can see that two cases captures the changes in the dynamics of the ego-vehicle
are challenging to predict: people standing on the curbside or and its surroundings much more effectively compared to 2D
road and have no crossing intention, e.g. the person engaged semantic maps. In fact, when all metrics are considered, the
in conversation or a construction worker operating next to performance of the model is similar when using visual or
the road (top row). The pedestrian’s direction of motion can dynamics features. Despite such similarity of performance,
also be distracting. For example, pedestrian might be moving there are still cases that are not learned properly using only
towards the road or step on the road but would not cross or one type of feature. This is highlighted in the performance of
might be standing at the intersection prior to crossing. the model combining all features. Besides the AUC metric,
the combined model performs better on F1 and significantly
I. Ablation better on accuracy by 4% and precision by 24%. Considering
Earlier we discussed the importance of multi-modal pro- all metrics, the combined model is clearly superior to others.
cessing and how different sources of information help us
V. C ONCLUSION
capture different contextual elements in traffic scenes. Here,
we conduct an ablation study on the final proposed model In this work, we proposed a hybrid model for predicting
(with atrous encoding) using different input modalities. For pedestrian road-crossing action. Our model benefits from
this experiment, we use both a 2D dataset (PIE) and a 3D both feedforward and recurrent architectures for encoding
dataset (PePScenes). We selected PIE over JAAD because different input modalities that capture both the changes in
it has more diverse samples and it also contains actual ego- the visual appearance and dynamics of the traffic scenes.
vehicle dynamics information. Using common 2D pedestrian behavior benchmark datasets
Table III shows the results of this experiment. On the and our newly annotated 3D dataset, we showed that our
2D dataset, the method achieves relatively high performance proposed model achieves state-of-the-art performance across
using only visual features. However, on all metrics, using all metrics.
dynamics information is superior. This is due to the fact Furthermore, by conducting an ablation study on the
that visual features in 2D space do not properly capture proposed model, we showed how different sources of infor-
the dynamics of the scene which are fundamental for action mation can impact prediction accuracy. Our findings sug-
prediction. It should be noted that despite such a discrepancy, gest that, even though dynamics information is dominant
visual features in conjunction with dynamics information can in predicting pedestrian behavior, visual features play a
help improve the results on all metrics, particularly on F1 and complementary role for prediction and can result in improved
precision by 4% and 8% respectively. performance when combined with dynamics information.
R EFERENCES [29] A. Rasouli, I. Kotseruba, and J. K. Tsotsos, “Are they going to cross?
a benchmark dataset and baseline for pedestrian crosswalk behavior,”
[1] A. Rasouli, I. Kotseruba, and J. K. Tsotsos, “Agreeing to cross: in ICCVW, 2017.
How drivers and pedestrians communicate,” in Intelligent Vehicles [30] B. Liu, E. Adeli, Z. Cao, K.-H. Lee, A. Shenoi, A. Gaidon, and J. C.
Symposium (IV), 2017. Niebles, “Spatiotemporal relationship reasoning for pedestrian intent
[2] A. Rasouli and J. K. Tsotsos, “Autonomous vehicles that interact with prediction,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp.
pedestrians: A survey of theory and practice,” IEEE Transactions on 3485–3492, 2020.
Intelligent Transportation Systems, vol. 21, no. 3, pp. 900–918, 2019. [31] J. Liang, L. Jiang, K. Murphy, T. Yu, and A. Hauptmann, “The garden
[3] M. Chaabane, A. Trabelsi, N. Blanchard, and R. Beveridge, “Looking of forking paths: Towards multi-future trajectory prediction,” in CVPR,
ahead: Anticipating pedestrians crossing with future frames predic- 2020.
tion,” in WACV, 2020. [32] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui,
[4] S. Malla, B. Dariush, and C. Choi, “Titan: Future forecast using action J. Guo, Y. Zhou, Y. Chai, B. Caine, V. Vasudevan, W. Han, J. Ngiam,
priors,” in CVPR, 2020. H. Zhao, A. Timofeev, S. Ettinger, M. Krivokon, A. Gao, A. Joshi,
[5] A. Rasouli, I. Kotseruba, and J. K. Tsotsos, “Pedestrian action an- Y. Zhang, J. Shlens, Z. Chen, and D. Anguelov, “Scalability in
ticipation using contextual feature fusion in stacked rnns,” in BMVC, perception for autonomous driving: Waymo open dataset,” in CVPR,
2019. 2020.
[6] A. Rasouli, I. Kotseruba, T. Kunic, and J. K. Tsotsos, “Pie: A [33] W. Liu, W. Luo, D. Lian, and S. Gao, “Future frame prediction for
large-scale dataset and models for pedestrian intention estimation and anomaly detection – a new baseline,” in CVPR, 2018.
trajectory prediction,” in ICCV, 2019. [34] A. Robicquet, A. Sadeghian, A. Alahi, and S. Savarese, “Learning
[7] J. Liang, L. Jiang, J. C. Niebles, A. G. Hauptmann, and L. Fei- social etiquette: Human trajectory understanding in crowded scenes,”
Fei, “Peeking into the future: Predicting future person activities and in ECCV, 2016.
locations in videos,” in CVPR, 2019. [35] A. Lerner, Y. Chrysanthou, and D. Lischinski, “Crowds by example,”
[8] M. S. Aliakbarian, F. S. Saleh, M. Salzmann, B. Fernando, L. Peters- Computer graphics forum, vol. 26, no. 3, pp. 655–664, 2007.
son, and L. Andersson, “VIENA: A driving anticipation dataset,” in [36] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Re-
ACCV, C. V. Jawahar, H. Li, G. Mori, and K. Schindler, Eds., 2019. thinking atrous convolution for semantic image segmentation,”
[9] A. Rasouli, “Deep learning for vision-based prediction: A survey,” arXiv:1706.05587, 2017.
arXiv:2007.00095, 2020. [37] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to
[10] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, attention-based neural machine translation,” arXiv:1508.04025, 2015.
A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A [38] A. Rasouli, I. Kotseruba, and J. K. Tsotsos, “It’s not all about size:
multimodal dataset for autonomous driving,” in CVPR, 2020. On the role of data properties in pedestrian detection,” in ECCVW,
[11] M. Liu, S. Tang, Y. Li, and J. Rehg, “Forecasting human object 2018.
interaction: Joint prediction of motor attention and actions in first [39] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be-
person video,” in ECCV, 2020. nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset
[12] A. Piergiovanni, A. Angelova, A. Toshev, and M. S. Ryoo, “Adver- for semantic urban scene understanding,” in CVPR, 2016.
sarial generative grammars for human activity prediction,” in ECCV, [40] H. Cui, V. Radosavljevic, F.-C. Chou, T.-H. Lin, T. Nguyen, T.-K.
2020. Huang, J. Schneider, and N. Djuric, “Multimodal trajectory predictions
[13] H. Joo, T. Simon, M. Cikara, and Y. Sheikh, “Towards social artificial for autonomous driving using deep convolutional networks,” in ICRA,
intelligence: Nonverbal social signal prediction in a triadic interaction,” 2019.
in CVPR, 2019. [41] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop, coursera: Neural
[14] T. Yao, M. Wang, B. Ni, H. Wei, and X. Yang, “Multiple granularity networks for machine learning,” Tech. Rep., 2012.
group interaction prediction,” in CVPR, 2018. [42] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new
[15] M. Strickland, G. Fainekos, and H. B. Amor, “Deep predictive models model and the kinetics dataset,” in CVPR, 2017.
for collision risk assessment in autonomous driving,” in ICRA, 2018. [43] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals,
[16] K.-H. Zeng, S.-H. Chou, F.-H. Chan, J. Carlos Niebles, and M. Sun, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks
“Agent-centric risk assessment: Accident anticipation and risky region for video classification,” in CVPR, 2015.
localization,” in CVPR, 2017. [44] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person
[17] D. Epstein, B. Chen, and C. Vondrick, “Oops! predicting unintentional 2d pose estimation using part affinity fields,” in CVPR, 2017.
action in video,” in CVPR, 2020. [45] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
[18] Y. Ma, X. Zhu, X. Cheng, R. Yang, J. Liu, and D. Manocha, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in
“Autotrajectory: Label-free trajectory extraction and prediction from context,” in ECCV, 2014.
videos using dynamic points,” in ECCV, 2020.
[19] M. Qi, J. Qin, Y. Wu, and Y. Yang, “Imitative non-autoregressive
modeling for trajectory forecasting and imputation,” in CVPR, 2020.
[20] P. Felsen, P. Agrawal, and J. Malik, “What will happen next? fore-
casting player moves in sports videos,” in ICCV, 2017.
[21] L. Fang, Q. Jiang, J. Shi, and B. Zhou, “Tpnet: Trajectory proposal
network for motion prediction,” in CVPR, 2020.
[22] M. Liang, B. Yang, W. Zeng, Y. Chen, R. Hu, S. Casas, and R. Urtasun,
“Pnpnet: End-to-end perception and prediction with tracking in the
loop,” in CVPR, 2020.
[23] T. Phan-Minh, E. C. Grigore, F. A. Boulton, O. Beijbom, and E. M.
Wolff, “Covernet: Multimodal behavior prediction using trajectory
sets,” in CVPR, 2020.
[24] R. Chandra, U. Bhattacharya, A. Bera, and D. Manocha, “Traphic: Tra-
jectory prediction in dense and heterogeneous traffic using weighted
interactions,” in CVPR, 2019.
[25] N. Rhinehart, K. M. Kitani, and P. Vernaza, “R2p2: A reparameterized
pushforward policy for diverse, precise generative path forecasting,”
in ECCV, 2018.
[26] P. Gujjar and R. Vaughan, “Classifying pedestrian actions in advance
using predicted video of urban driving scenes,” in ICRA, 2019.
[27] S. Casas, W. Luo, and R. Urtasun, “Intentnet: Learning to predict
intention from raw sensor data,” in CORL, 2018.
[28] K. Saleh, M. Hossny, and S. Nahavandi, “Real-time intent prediction
of pedestrians for autonomous ground vehicles via spatio-temporal
densenet,” in ICRA, 2019.