Gao VectorNet Encoding HD Maps and Agent Dynamics From Vectorized Representation CVPR 2020 Paper
Gao VectorNet Encoding HD Maps and Agent Dynamics From Vectorized Representation CVPR 2020 Paper
Vectorized Representation
Crosswalk
Abstract
11525
Input vectors Polyline subgraphs Global interaction graph Supervision & Prediction
Crosswalk Map
Completion
Lane Lane
Agent
Feature
Trajectory
Prediction
Agent Agent
Figure 2. An overview of our proposed VectorNet. Observed agent trajectories and map features are represented as sequence of vectors,
and passed to a local graph network to obtain polyline-level features. Such features are then passed to a fully-connected graph to model
the higher-order interactions. We compute two types of losses: predicting future trajectories from the node features corresponding to the
moving agents and predicting the node features when their features are masked out.
agent dynamics and structured scene context directly from supervised learning from sequential linguistic [11] and vi-
their vectorized form (Figure 1, right). The geographic ex- sual data [27], we propose an auxiliary graph completion
tent of the road features can be a point, a polygon, or a curve objective in addition to the behavior prediction objective.
in geographic coordinates. For example, a lane boundary More specifically, we randomly mask out the input node
contains multiple control points that build a spline; a cross- features belonging to either scene context or agent trajecto-
walk is a polygon defined by several points; a stop sign is ries, and ask the model to reconstruct the masked features.
represented by a single point. All these geographic entities The intuition is to encourage the graph networks to better
can be closely approximated as polylines defined by mul- capture the interactions between agent dynamics and scene
tiple control points, along with their attributes. Similarly, context. In summary, our contributions are:
the dynamics of moving agents can also be approximated
by polylines based on their motion trajectories. All these • We are the first to demonstrate how to directly incor-
polylines can then be represented as sets of vectors. porate vectorized scene context and agent dynamics in-
formation for behavior prediction.
We use graph neural networks (GNNs) to incorporate
these sets of vectors. We treat each vector as a node in • We propose the hierarchical graph network VectorNet
the graph, and set the node features to be the start location and the node completion auxiliary task.
and end location of each vector, along with other attributes • We evaluate the proposed method on our in-house be-
such as polyline group id and semantic labels. The context havior prediction dataset and the Argoverse dataset,
information from HD maps, along with the trajectories of and show that our method achieves on par or better per-
other moving agents are propagated to the target agent node formance over a competitive rendering baseline with
through the GNN. We can then take the output node fea- 70% model size saving and an order of magnitude re-
ture corresponding to the target agent to decode its future duction in FLOPs. Our method also achieves the state-
trajectories. of-the-art performance on Argoverse.
Specifically, to learn competitive representations with
GNNs, we observe that it is important to constrain the con-
2. Related work
nectivities of the graph based on the spatial and semantic Behavior prediction for autonomous driving. Behavior
proximity of the nodes. We therefore propose a hierarchi- prediction for moving agents has become increasingly im-
cal graph architecture, where the vectors belonging to the portant for autonomous driving applications [7, 9, 19], and
same polylines with the same semantic labels are connected high-fidelity maps have been widely used to provide context
and embedded into polyline features, and all polylines are information. For example, IntentNet [5] proposes to jointly
then fully connected with each other to exchange informa- detect vehicles and predict their trajectories from LiDAR
tion. We implement the local graphs with multi-layer per- points and rendered HD maps. Hong et al. [15] assumes
ceptrons, and the global graphs with self-attention [30]. An that vehicle detections are provided and focuses on behavior
overview of our approach is shown in Figure 2. prediction by encoding entity interactions with ConvNets.
Finally, motivated by the recent success of self- Similarly, MultiPath [6] also uses ConvNets as encoder,
11526
but adopts pre-defined trajectory anchors to regress multi- Next we present the hierarchical graph network which ag-
ple possible future trajectories. PRECOG [23] attempts to gregates local information from individual polylines and
capture the future stochasiticity by flow-based generative then globally over all trajectories and map features. This
models. Similar to [6, 15, 23], we also assume the agent de- graph can then be used for behavior prediction.
tections to be provided by an existing perception algorithm.
3.1. Representing trajectories and maps
However, unlike these methods which all use ConvNets to
encode rendered road maps, we propose to directly encode Most of the annotations from an HD map are in the form
vectorized scene context and agent dynamics. of splines (e.g. lanes), closed shape (e.g. regions of inter-
Forecasting multi-agent interactions. Beyond the au- sections) and points (e.g. traffic lights), with additional at-
tonomous driving domain, there is more general interest to tribute information such as the semantic labels of the an-
predict the intents of interacting agents, such as for pedes- notations and their current states (e.g. color of the traffic
trians [2, 13, 24], human activities [28] or for sports play- light, speed limit of the road). For agents, their trajecto-
ers [12, 26, 32, 33]. In particular, Social LSTM [2] models ries are in the form of directed splines with respect to time.
the trajectories of individual agents as separate LSTM net- All of these elements can be approximated as sequences of
works, and aggregates the LSTM hidden states based on vectors: for map features, we pick a starting point and di-
spatial proximity of the agents to model their interactions. rection, uniformly sample key points from the splines at the
Social GAN [13] simplifies the interaction module and pro- same spatial distance, and sequentially connect the neigh-
poses an adversarial discriminator to predict diverse futures. boring key points into vectors; for trajectories, we can just
Sun et al. [26] combines graph networks [4] with varia- sample key points with a fixed temporal interval (0.1 sec-
tional RNNs [8] to model diverse interactions. The social ond), starting from t = 0, and connect them into vectors.
interactions can also be inferred from data: Kipf et al. [18] Given small enough spatial or temporal intervals, the result-
treats such interactions as latent variables; and graph atten- ing polylines serve as close approximations of the original
tion networks [16, 31] apply self-attention mechanism to map and trajectories.
weight the edges in a pre-defined graph. Our method goes Our vectorization process is a one-to-one mapping be-
one step further by proposing a unified hierarchical graph tween continuous trajectories, map annotations and the vec-
network to jointly model the interactions of multiple agents, tor set, although the latter is unordered. This allows us to
and their interactions with the entities from road maps. form a graph representation on top of the vector sets, which
Representation learning for sets of entities. Traditionally can be encoded by graph neural networks. More specifi-
machine perception algorithms have been focusing on high- cally, we treat each vector vi belonging to a polyline Pj as
dimensional continuous signals, such as images, videos or a node in the graph with node features given by
audios. One exception is 3D perception, where the inputs vi = [dsi , dei , ai , j] , (1)
are usually in the form of unordered point sets, given by
depth sensors. For example, Qi et al. propose the Point- where dsi and dei are coordinates of the start and end points
Net model [20] and PointNet++ [21] to apply permutation of the vector, d itself can be represented as (x, y) for 2D
invariant operations (e.g. max pooling) on learned point em- coordinates or (x, y, z) for 3D coordinates; ai corresponds
beddings. Unlike point sets, entities on HD maps and agent to attribute features, such as object type, timestamps for tra-
trajectories form closed shapes or are directed, and they jectories, or road feature type or speed limit for lanes; j is
may also be associated with attribute information. We there- the integer id of Pj , indicating vi ∈ Pj .
fore propose to keep such information by vectorizing the in- To make the input node features invariant to the locations
puts, and encode the attributes as node features in a graph. of target agents, we normalize the coordinates of all vectors
Self-supervised context modeling. Recently, many works to be centered around the location of target agent at its last
in the NLP domain have proposed modeling language con- observed time step. A future work is to share the coordinate
text in a self-supervised fashion [11, 22]. Their learned rep- centers for all interacting agents, such that their trajectories
resentations achieve significant performance improvement can be predicted in parallel.
when transferred to downstream tasks. Inspired by these 3.2. Constructing the polyline subgraphs
methods, we propose an auxiliary loss for graph represen-
tations, which learns to predict the missing node features To exploit the spatial and semantic locality of the nodes,
from its neighbors. The goal is to incentivize the model to we take a hierarchical approach by first constructing sub-
better capture interactions among nodes. graphs at the vector level, where all vector nodes belonging
to the same polyline are connected with each other. Con-
3. VectorNet approach sidering a polyline P with its nodes {v1 , v2 , ..., vP }, we
define a single layer of subgraph propagation operation as
This section introduces our VectorNet approach. We first n o
(l+1) (l) (l)
describe how to vectorize agent trajectories and HD maps. vi = ϕrel genc (vi ), ϕagg genc (vj ) (2)
11527
(l)
Output Node where {pi } is the set of polyline node features, GNN(·)
Features
corresponds to a single layer of a graph neural network, and
A corresponds to the adjacency matrix for the set of poly-
line nodes.
Concat
The adjacency matrix A can be provided a heuristic,
such as using the spatial distances [2] between the nodes.
Permutation
Invariant For simplicity, we assume A to be a fully-connected graph.
Aggregator
Our graph network is implemented as a self-attention oper-
ation [30]:
11528
processing, which predicts missing tokens based on bidi- map information. The future trajectories of the test set are
rectional context from discrete and sequential text data. We held out. Unless otherwise mentioned, our ablation study
generalize this training objective to work with unordered reports performance on the validation set.
graphs. Unlike several recent methods (e.g. [25]) that gener- In-house dataset is a large-scale dataset collected for be-
alizes the BERT objective to unordered image patches with havior prediction. It contains HD map data, bounding box
pre-computed visual features, our node features are jointly and tracking annotations from an automatic in-house per-
optimized in an end-to-end framework. ception system, and manually labeled vehicle trajectories.
The total number of vehicle trajectories are 2.2M and 0.55M
3.4. Overall framework for train and test sets. Each trajectory has a length of 4 sec-
Once the hierarchical graph network is constructed, we onds, where the (0, 1] second is the history trajectory used
optimize for the multi-task training objective as observation, and (1, 4] seconds are the target future tra-
jectories to be evaluated. The trajectories are sampled from
L = Ltraj + αLnode (9) real world vehicles’ behaviors, including stationary, going
straight, turning, lane change and reversing, and roughly
where Ltraj is the negative Gaussian log-likelihood for preserves the natural distribution of driving scenarios. For
the groundtruth future trajectories, Lnode is the Huber loss the HD map features, we include lane boundaries, stop/yield
between predicted node features and groundtruth masked signs, crosswalks and speed bumps.
node features, and α = 1.0 is a scalar that balances the two For both datasets, the input history trajectories are de-
loss terms. To avoid trivial solutions for Lnode by lowering rived from automatic perception systems and are thus noisy.
the magnitude of node features, we L2 normalize the poly- Argoverse’s future trajectories are also machine generated,
line node features before feeding them to the global graph while In-house has manually labeled future trajectories.
network.
Our predicted trajectories are parameterized as per-step 4.1.2 Metrics
coordinate offsets, starting from the last observed location.
We rotate the coordinate system based on the heading of the For evaluation we adopt the widely used Average Displace-
target vehicle at the last observed location. ment Error (ADE) computed over the entire trajectories
and the Displacement Error at t (DE@ts) metric, where
4. Experiments t ∈ {1.0, 2.0, 3.0} seconds. The displacements are mea-
sured in meters.
In this section, we first describe the experimental set-
tings, including the datasets, metrics and rasterized + Con- 4.1.3 Baseline with rasterized images
vNets baseline. Secondly, comprehensive ablation studies
are done for both the rasterized baseline and VectorNet. We render N consecutive past frames, where N is 10 for
Thirdly, we compare and discuss the computation cost, in- the in-house dataset and 20 for the Argoverse dataset. Each
cluding FLOPs and number of parameters. Finally, we com- frame is a 400×400×3 image, which has road map infor-
pare the performance with state-of-the-art methods. mation and the detected object bounding boxes. 400 pixels
correspond to 100 meters in the in-house dataset, and 130
4.1. Experimental setup meters in the Argoverse dataset. Rendering is based on the
4.1.1 Datasets position of self-driving vehicle in the last observed frame;
the self-driving vehicle is placed at the coordinate location
We report results on two vehicle behavior prediction bench- (200, 320) in in-house dataset, and (200, 200) in Argov-
marks, the recently released Argoverse dataset [7] and our erse dataset. All N frames are stacked together to form a
in-house behavior prediction dataset. 400×400×3N image as model input.
Argoverse motion forecasting [7] is a dataset designed for Our baseline uses a ConvNet to encode the rasterized
vehicle behavior prediction with trajectory histories. There images, whose architecture is comparable to IntentNet [5]:
are 333K 5-second long sequences split into 211K training, we use a ResNet-18 [14] as the ConvNet backbone. Un-
41K validation and 80K testing sequences. The creators cu- like IntentNet, we do not use the LiDAR inputs. To obtain
rated this dataset by mining interesting and diverse scenar- vehicle-centric features, we crop the feature patch around
ios, such as yielding for a merging vehicle, crossing an in- the target vehicle from the convolutional feature map, and
tersection, etc. The trajectories are sampled at 10Hz, with average pool over all the spatial locations of the cropped
(0, 2] seconds are used as observation and (2, 5] seconds for feature map to get a single vehicle feature vector. We em-
trajectory prediction. Each sequence has one “interesting” pirically observe that using a deeper ResNet model or ro-
agent whose trajectory is the prediction target. In addition tating the cropped features based on target vehicle headings
to vehicle trajectories, each sequence is also associated with do not lead to better performance. The vehicle features are
11529
then fed into a fully connected layer (as used by IntentNet) also compare different cropping methods, by increasing the
to predict the future coordinates in parallel. The model is crop size or cropping along the vehicle trajectory at all ob-
optimized on 8 GPUs with synchronous training. We use served time steps. From the 3rd to 6th rows of Table 1 we
the Adam optimizer [17] and decay the learning rate every can see that a larger crop size (3 v.s. 1) can significantly
5 epochs by a factor of 0.3. We train the model for a total improve the performance, and cropping along observed tra-
of 25 epochs with an initial learning rate of 0.001. jectory also leads to better performance. This observation
To test how convolutional receptive fields and feature confirms the importance of receptive fields when rasterized
cropping strategies influence the performance, we conduct images are used as inputs. It also highlights its limitation,
ablation study on the network receptive field, feature crop- where a carefully designed cropping strategy is needed, of-
ping strategy and input image resolutions. ten at the cost of increased computation cost.
Impact of rendering resolution. We further vary the reso-
lutions of rasterized images to see how it affects the predic-
4.1.4 VectorNet with vectorized representations
tion quality and computation cost, as shown in the first three
To ensure a fair comparison, the vectorized representation rows of Table 1. We test three different resolutions, includ-
takes as input the same information as the rasterized repre- ing 400 × 400 (0.25 meter per pixel), 200 × 200 (0.5 meter
sentation. Specifically, we extract exactly the same set of per pixel) and 100 × 100 (1 meter per pixel). It can be seen
map features as when rendering. We also make sure that the that the performance increases generally as the resolution
visible road feature vectors for a target agent are the same goes up. However, for the Argoverse dataset we can see that
as in the rasterized representation. However, the vectorized increasing the resolution from 200×200 to 400×400 leads
representation does enjoy the benefit of incorporating more to slight drop in performance, which can be explained by
complex road features which are non-trivial to render. the decrease of effective receptive field size with the fixed
Unless otherwise mentioned, we use three graph lay- 3×3 kernel. We discuss the impact on computation cost of
ers for the polyline subgraphs, and one graph layer for the these design choices in Section 4.4.
global interaction graph. The number of hidden units in all
MLPs are fixed to 64. The MLPs are followed by layer nor- 4.3. Ablation study for VectorNet
malization and ReLU nonlinearity. We normalize the vec- Impact of input node types. We study whether it is help-
tor coordinates to be centered around the location of target ful to incorporate both map features and agent trajecto-
vehicle at the last observed time step. Similar to the raster- ries for VectorNet. The first three rows in Table 2 corre-
ized model, VectorNet is trained on 8 GPUs synchronously spond to using only the past trajectory of the target vehi-
with Adam optimizer. The learning rate is decayed every 5 cle (“none” context), adding only map polylines (“map”),
epochs by a factor of 0.3, we train the model for a total of and finally adding trajectory polylines (“map + agents”).
25 epochs with initial learning rate of 0.001. We can clearly observe that adding map information sig-
To understand the impact of the components on the per- nificantly improves the trajectory prediction performance.
formance of VectorNet, we conduct ablation studies on the Incorporating trajectory information furthers improves the
type of context information, i.e. whether to use only map performance.
or also the trajectories of other agents as well as the impact Impact of node completion loss. The last four rows of Ta-
of number of graph layers for the polyline subgraphs and ble 2 compares the impact of adding the node completion
global interaction graphs. auxiliary objective. We can see that adding this objective
consistently helps with performance, especially at longer
4.2. Ablation study for the ConvNet baseline
time horizons.
We conduct ablation studies on the impact of ConvNet Impact on the graph architectures. In Table 3 we study
receptive fields, feature cropping strategies, and the resolu- the impact of depths and widths of the graph layers on tra-
tion of the rasterized images. jectory prediction performance. We observe that for the
Impact of receptive fields. As behavior prediction often re- polyline subgraph three layers gives the best performance,
quires capturing long range road context, the convolutional and for the global graph just one layer is needed. Making
receptive field could be critical to the prediction quality. We the MLPs wider does not lead to better performance, and
evaluate different variants to see how two key factors of re- hurts for Argoverse, presumably because it has a smaller
ceptive fields, convolutional kernel sizes and feature crop- training dataset. Some example visualizations on predicted
ping strategies, affect the prediction performance. The re- trajectory and lane attention are shown in Figure 4.
sults are shown in Table 1. By comparing kernel size 3, 5 Comparison with ConvNets. Finally, we compare our
and 7 at 400×400 resolution, we can see that a larger kernel VectorNet with the best ConvNet model in Table 4. For the
size leads to slight performance improvement. However, it in-house dataset, our model achieves on par performance
also leads to quadratic increase of the computation cost. We with the best ResNet model, while being much more eco-
11530
Resolution Kernel Crop In-house dataset Argoverse dataset
DE@1s DE@2s DE@3s ADE DE@1s DE@2s DE@3s ADE
100×100 3×3 1×1 0.63 0.94 1.32 0.82 1.14 2.80 5.19 2.21
200×200 3×3 1×1 0.57 0.86 1.21 0.75 1.11 2.72 4.96 2.15
400×400 3×3 1×1 0.55 0.82 1.16 0.72 1.12 2.72 4.94 2.16
400×400 3×3 3×3 0.50 0.77 1.09 0.68 1.09 2.62 4.81 2.08
400×400 3×3 5×5 0.50 0.76 1.08 0.67 1.09 2.60 4.70 2.08
400×400 3×3 traj 0.47 0.71 1.00 0.63 1.05 2.48 4.49 1.96
400×400 5×5 1×1 0.54 0.81 1.16 0.72 1.10 2.63 4.75 2.13
400×400 7×7 1×1 0.53 0.81 1.16 0.72 1.10 2.63 4.74 2.13
Table 1. Impact of receptive field (as controlled by convolutional kernel size and crop strategy) and rendering resolution for the ConvNet
baseline. We report DE and ADE (in meters) on both the in-house dataset and the Argoverse dataset.
11531
Model DE@3s ADE
Constant Velocity [7] 7.89 3.53
Nearest Neighbor [7] 7.88 3.45
LSTM ED [7] 4.95 2.15
Challenge Winner: uulm-mrm 4.19 1.90
Challenge Winner: Jean 4.17 1.86
VectorNet 4.01 1.81
Table 5. Trajectory prediction performance on the Argoverse Fore-
casting test set when number of sampled trajectories K=1. Results
were retrieved from the Argoverse leaderboard [1] on 03/18/2020.
11532
References [15] Joey Hong, Benjamin Sapp, and James Philbin. Rules of the
road: Predicting driving behavior with a convolutional model
[1] Argoverse Motion Forecasting Competition, 2019. https: of semantic interactions. In CVPR, 2019.
//evalai.cloudcv.org/web/challenges/ [16] Yedid Hoshen. VAIN: Attentional multi-agent predictive
challenge-page/454/leaderboard/1279. modeling. arXiv preprint arXiv:1706.06122, 2017.
[2] Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, [17] Diederik P Kingma and Jimmy Ba. Adam: A method for
Alexandre Robicquet, Li Fei-Fei, and Silvio Savarese. Social stochastic optimization. arXiv preprint arXiv:1412.6980,
LSTM: Human Trajectory Prediction in Crowded Spaces. In 2014.
CVPR, 2016. [18] Thomas Kipf, Ethan Fetaya, Kuan-Chieh Wang, Max
[3] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- Welling, and Richard Zemel. Neural relational inference for
ton. Layer normalization. arXiv preprint arXiv:1607.06450, interacting systems. In ICML, 2018.
2016. [19] Robert Krajewski, Julian Bock, Laurent Kloeker, and Lutz
[4] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Al- Eckstein. The highd dataset: A drone dataset of naturalis-
varo Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Ma- tic vehicle trajectories on german highways for validation of
linowski, Andrea Tacchetti, David Raposo, Adam Santoro, highly automated driving systems. In ITSC, 2018.
Ryan Faulkner, Caglar Gulcehre, Francis Song, Andrew Bal- [20] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas.
lard, Justin Gilmer, George Dahl, Ashish Vaswani, Kelsey Pointnet: Deep learning on point sets for 3d classification
Allen, Charles Nash, Victoria Langston, Chris Dyer, Nico- and segmentation. In CVPR, 2017.
las Heess, Daan Wierstra, Pushmeet Kohli, Matt Botvinick, [21] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J
Oriol Vinyals, Yujia Li, and Razvan Pascanu. Relational in- Guibas. Pointnet++: Deep hierarchical feature learning on
ductive biases, deep learning, and graph networks. arXiv point sets in a metric space. In NIPS, 2017.
preprint arXiv:1806.01261, 2018. [22] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario
[5] Sergio Casas, Wenjie Luo, and Raquel Urtasun. Intentnet: Amodei, and Ilya Sutskever. Language models are unsuper-
Learning to predict intention from raw sensor data. In CoRL, vised multitask learners. 2019.
2018. [23] Nicholas Rhinehart, Rowan McAllister, Kris Kitani, and
[6] Yuning Chai, Benjamin Sapp, Mayank Bansal, and Dragomir Sergey Levine. PRECOG: Prediction conditioned on goals
Anguelov. Multipath: Multiple probabilistic anchor trajec- in visual multi-agent settings. In ICCV, 2019.
tory hypotheses for behavior prediction. In CoRL, 2019. [24] Alexandre Robicquet, Amir Sadeghian, Alexandre Alahi,
and Silvio Savarese. Learning social etiquette: Human tra-
[7] Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jag-
jectory understanding in crowded scenes. In ECCV, 2016.
jeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter
[25] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu
Carr, Simon Lucey, Deva Ramanan, et al. Argoverse: 3d
Wei, and Jifeng Dai. Vl-bert: Pre-training of generic visual-
tracking and forecasting with rich maps. In CVPR, 2019.
linguistic representations. arXiv preprint arXiv:1908.08530,
[8] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth 2019.
Goel, Aaron C Courville, and Yoshua Bengio. A recurrent
[26] Chen Sun, Per Karlsson, Jiajun Wu, Joshua B Tenenbaum,
latent variable model for sequential data. In NeurIPS, 2015.
and Kevin Murphy. Stochastic prediction of multi-agent in-
[9] James Colyar and Halkias John. Us highway 101 dataset. teractions from partial observations. In ICLR, 2019.
FHWA-HRT-07-030, 2007. [27] Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and
[10] Henggang Cui, Vladan Radosavljevic, Fang-Chieh Chou, Cordelia Schmid. VideoBERT: A joint model for video and
Tsung-Han Lin, Thi Nguyen, Tzu-Kuo Huang, Jeff Schnei- language representation learning. In ICCV, 2019.
der, and Nemanja Djuric. Multimodal trajectory predictions [28] Chen Sun, Abhinav Shrivastava, Carl Vondrick, Rahul Suk-
for autonomous driving using deep convolutional networks. thankar, Kevin Murphy, and Cordelia Schmid. Relational
In ICRA, 2019. action forecasting. In CVPR, 2019.
[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina [29] Charlie Tang and Russ R Salakhutdinov. Multiple futures
Toutanova. BERT: Pre-training of deep bidirectional prediction. In NeurIPS. 2019.
transformers for language understanding. arXiv preprint [30] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
arXiv:1810.04805, 2018. reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
[12] Panna Felsen, Pulkit Agrawal, and Jitendra Malik. What will Polosukhin. Attention is all you need. In NIPS, 2017.
happen next? forecasting player moves in sports videos. In [31] Petar Veličković, Guillem Cucurull, Arantxa Casanova,
ICCV, 2017. Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph at-
[13] Agrim Gupta, Justin Johnson, Li Fei-Fei, Silvio Savarese, tention networks. In ICLR, 2018.
and Alexandre Alahi. Social GAN: Socially acceptable tra- [32] Raymond A. Yeh, Alexander G. Schwing, Jonathan Huang,
jectories with generative adversarial networks. In CVPR, and Kevin Murphy. Diverse generation for multi-agent sports
2018. games. In CVPR, 2019.
[33] Eric Zhan, Stephan Zheng, Yisong Yue, Long Sha, and
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Patrick Lucey. Generative multi-agent behavioral cloning.
Deep residual learning for image recognition. In CVPR,
arXiv:1803.07612, 2018.
2016.
11533