0% found this document useful (0 votes)

19 views21 pages

2003 - Probabilistic Future Prediction For Video Scene Understanding

Uploaded by

Milton

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views21 pages

2003 - Probabilistic Future Prediction For Video Scene Understanding

Uploaded by

Milton

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Probabilistic Future Prediction for

Video Scene Understanding

Anthony Hu1,2 , Fergal Cotter1 , Nikhil Mohan1 , Corina Gurau1 , and Alex Kendall1,2
1
Wayve, London, UK. [email protected]
2
University of Cambridge.
arXiv:2003.06409v1 [cs.CV] 13 Mar 2020

Abstract. We present a novel deep learning architecture for probabilis-

tic future prediction from video. We predict the future semantics, geome-
try and motion of complex real-world urban scenes and use this represen-
tation to control an autonomous vehicle. This work is the first to jointly
predict ego-motion, static scene, and the motion of dynamic agents in a
probabilistic manner, which allows sampling consistent, highly probable
futures from a compact latent space. Our model learns a representa-
tion from RGB video with a spatio-temporal convolutional module. The
learned representation can be explicitly decoded to future semantic seg-
mentation, depth, and optical flow, in addition to being an input to a
learnt driving policy. To model the stochasticity of the future, we intro-
duce a conditional variational approach which minimises the divergence
between the present distribution (what could happen given what we have
seen) and the future distribution (what we observe actually happens).
During inference, diverse futures are generated by sampling from the
present distribution.

1 Introduction
Building predictive cognitive models of the world is often regarded as the essence
of intelligence. It is one of the first skills that we develop as infants. We use
these models to enhance our capability at learning more complex tasks, such as
navigation or manipulating objects [48].
Unlike in humans, developing prediction models for autonomous vehicles to
anticipate the future remains hugely challenging. Road agents have to make
reliable decisions based on forward simulation to understand how relevant parts
of the scene will evolve. There are various reasons why modelling the future
is incredibly difficult: natural-scene data is rich in details, most of which are
irrelevant for the driving task, dynamic agents have complex temporal dynamics,
often controlled by unobservable variables, and the future is inherently uncertain,
as multiple futures might arise from a unique and deterministic past.
Current approaches to autonomous driving individually model each dynamic
agent by producing hand-crafted behaviours, such as trajectory forecasting, to
feed into a decision making module [8]. This largely assumes independence be-
tween agents and fails to model multi-agent interaction. Most works that holis-
tically reason about the temporal scene are limited to simple, often simulated
environments or use low dimensional input images that do not have the visual
2

complexity of real world driving scenes [47]. Some approaches tackle this prob-
lem by making simplifying assumptions to the motion model or the stochasticity
of the world [41,8]. Others avoid explicitly predicting the future scene but rather
rely on an implicit representation or Q-function (in the case of model-free rein-
forcement learning) in order to choose an action [33,27,36].
Real world future scenarios are difficult to model because of the stochasticity
and the partial observability of the world. Our work addresses this by encoding
the future state into a low-dimensional future distribution. We then allow the
model to have a privileged view of the future through the future distribution at
training time. As we cannot use the future at test time, we train a present distri-
bution (using only the current state) to match the future distribution through a
Kullback-Leibler (KL) divergence loss. We can then sample from the present dis-
tribution during inference, when we do not have access to the future. We observe
that this paradigm allows the model to learn accurate and diverse probabilistic
future prediction outputs.
In order to predict the future we need to first encode video into a motion rep-
resentation. Unlike advances in 2D convolutional architectures [60,26], learning
spatio-temporal features is more challenging due to the higher dimensionality
of video data and the complexity of modelling dynamics. State-of-the-art archi-
tectures [64,61] decompose 3D filters into spatial and temporal convolutions in
order to learn more efficiently. The model we propose further breaks down con-
volutions into many space-time combinations and context aggregation modules,
stacking them together in a more complex hierarchical representation. We show
that the learnt representation is able to jointly predict ego-motion and motion
of other dynamic agents. By explicitly modelling these dynamics we can capture
the essential features for representing causal effects for driving. Ultimately we use
this motion-aware and future-aware representation to improve an autonomous
vehicle control policy.
Our main contributions are threefold. Firstly, we present a novel deep learn-
ing framework for future video prediction. Secondly, we demonstrate that our
probabilistic model is able to generate visually diverse and plausible futures.
Thirdly, we show our future prediction representation substantially improves a
learned autonomous driving policy.

2 Related Work
This work falls in the intersection of learning scene representation from video,
probabilistic modelling of the ambiguity inherent in real-world driving data, and
using the learnt representation for control.

Temporal representations. Current state-of-the-art temporal representa-

tions from video use recurrent neural networks [53,54], separable 3D convolutions
[29,59,63,25,61], or 3D Inception modules [7,64]. In particular, the separable 3D
Inception (S3D) architecture [64], which improves on the Inception 3D module
(I3D) introduced by Carreira et al . [7], shows the best trade-off between model
complexity and speed, both at training and inference time. Adding optical flow
3

as a complementary input modality has been consistently shown to improve per-

formance [18,56,55,5], in particular using flow for representation warping to align
features over time [21,66]. We propose a new spatio-temporal architecture that
can learn hierarchically more complex features with a novel 3D convolutional
structure incorporating both local and global space and time context.

Visual prediction. Most works for learning dynamics from video fall under
the framework of model-based reinforcement learning [42,20,16,32] or unsuper-
vised feature learning [57,15], both regressing directly in pixel space [45,49,31]
or in a learned feature space [30,19]. For the purpose of creating good represen-
tations for driving scenes, directly predicting in the high-dimensional space of
image pixels is unnecessary, as some details about the appearance of the world
are irrelevant for planning and control. Our approach is similar to that of Luc
et al . [44] which trains a model to predict future semantic segmentation us-
ing pseudo-ground truth labels generated from a teacher model. However, our
model predicts a more complete scene representation with segmentation, depth,
and flow and is probabilistic in order to model the uncertainty of the future.

Multi-modality of future prediction. Modelling uncertainty is important

given the stochastic nature of real-world data [34]. Lee et al . [40], Bhattacharyya
et al . [4] and Rhinehart et al . [50] forecast the behaviour of other dynamic agents
in the scene in a probabilistic multi-modal way. We distinguish ourselves from
this line of work as their approach does not consider the task of video forecast-
ing, but rather trajectory forecasting, and they do not study how useful the
representations learnt are for robot control. Kurutach et al . [38] propose gen-
erating multi-modal futures with adversarial training, however spatio-temporal
discriminator networks are known to suffer from mode collapse [22].
Our variational approach is similar to Kohl et al . [37], although their appli-
cation domain does not involve modelling dynamics. Furthermore, while Kohl et
al . [37] use multi-modal training data, i.e. multiple output labels are provided
for a given input, we learn directly from real-world driving data, where we can
only observe one future reality, and show that we generate diverse and plau-
sible futures. Most importantly, previous variational video generation methods
[39,14] were restricted to single-frame image generation, low resolution (64 × 64)
datasets that are either simulated (Moving MNIST [57]) or with static scenes
and limited dynamics (KTH actions [52], Robot Pushing dataset [17]). Our new
framework for future prediction generates entire video sequences on complex
real-world urban driving data with ego-motion and complex interactions.

Learning a control policy. The representation learned from dynamics mod-

els could be used to generate imagined experience to train a policy in a model-
based reinforcement learning setting [23,24] or to run shooting methods for plan-
ning [10]. Instead we follow the approaches of Bojarski et al . [6], Codevilla et
al . [12] and Amini et al . [1] and learn a policy which predicts longitudinal and
lateral control of an autonomous vehicle using Conditional Imitation Learning,
as this approach has been shown to be immediately transferable to the real
world.
4

Future Prediction
zt+Nf G t+Nf
ôt
Present & Future
... Distributions ... ...

zt+2 F µt,f , σt,f G ôt+2

train
zt+1 ηt G ôt+1
t
Future test
Past Perception Dynamics - Y P µt,p, σt,p Control - C
xt T T T zt ĉt

... v̂t, v̇ˆt θ̂t, θ̇ˆt

xt−1 T T

xt−2 T T

xt−3 T
...

Fig. 1: Our architecture has five modules: Perception, Dynamics, Present/Future

Distributions, Future Prediction and Control. The Perception module learns
scene representation features, xt , from input images. The Dynamics model builds
on these scene features to produce a spatio-temporal representation, zt , with our
proposed Temporal Block module, T . Together with a noise vector, ηt , sampled
from a future distribution, F, at training time, or the present distribution, P,
at inference time, this representation predicts future video scene representation
(segmentation, depth and optical flow) with a convolutional recurrent model, G,
and decoders, D. Lastly, we learn a Control policy, C, from the spatio-temporal
representation, zt .

3 Model Architecture

Our model learns a spatio-temporal feature to jointly predict future scene rep-
resentation (semantic segmentation, depth, optical flow) and train a driving
policy. The architecture contains five components: Perception, an image scene
understanding model, Dynamics, which learns a spatio-temporal representation,
Present/Future Distributions, our probabilistic framework, Future Prediction,
which predicts future video scene representation, and Control, which trains a
driving policy using expert driving demonstrations. Figure 1 gives an overview
of the model and further details are described in this section and Appendix A.

3.1 Perception

The perception component of our system contains two modules: the encoder
of a scene understanding model that was trained on single image frames to
reconstruct semantic segmentation and depth [35], and the encoder of a flow
network [58], trained to predict optical flow. The combined perception features
xt ∈ RC×H×W form the input to the dynamics model. These models can also be
used as a teacher to distill the information from the future, giving pseudo-ground
truth labels for segmentation, depth and flow {st , dt , ft }. See subsection 4.1 for
more details on the teacher model.
5

3.2 Dynamics

Learning a temporal representation from video is extremely challenging because

of the high dimensionality of the data, the stochasticity and complexity of nat-
ural scenes, and the partial observability of the environment. To train 3D con-
volutional filters from a sequence of raw RGB images, a large amount of data,
memory and compute is required. We instead learn spatio-temporal features
with a temporal model that operates on perception encodings, which constitute
a more powerful and compact representation compared to RGB images.
The dynamics model Y takes a history of perception features (xt−T +1 : xt )
with temporal context T and encodes it into a dynamics feature zt :

zt = Y(xt−T +1 : xt ) (1)

Temporal Block We propose a spatio-temporal module, named Temporal

Block , to learn hierarchically more complex temporal features as follows:
– Decomposing the filters: instead of systematically using full 3D filters
(kt , ks , ks ), with kt the time kernel dimension and ks the spatial kernel di-
mension, we apply four parallel 3D convolutions with kernel sizes: (1, ks , ks )
(spatial features), (kt , 1, ks ) (horizontal motion), (kt , ks , 1) (vertical motion),
and (kt , ks , ks ) (complete motion). All convolutions are preceded by a (1, 1, 1)
convolution to compress the channel dimension.
– Global spatio-temporal context: in order to learn contextual features,
we additionally use three spatio-temporal average pooling layers at: full spa-
tial size (kt , H, W ) (H and W are respectively the height and width of the
perception features xt ), half size (kt , H2 , W H W
2 ) and quarter size (kt , 4 , 4 ), fol-
lowed by bilinear upsampling to the original spatial dimension (H, W ) and
a (1, 1, 1) convolution.
Figure 2 illustrates the architecture of the Temporal Block. By stacking mul-
tiple temporal blocks, the network learns a representation that incorporates in-
creasingly more temporal, spatial and global context. We also increase the num-
ber of channels by a constant α after each temporal block, as after each block,
the network has to represent the content of the kt previous features.

3.3 Future Prediction

We train a future prediction model that unrolls the dynamics feature, which is
a compact scene representation of the past context, into predictions about the
state of the world in the future. The future prediction model is a convolutional
recurrent network G which creates future features gtt+i that become the inputs of
individual decoders Ds , Dd , Df to decode these features to predicted segmenta-
tion ŝt+i ˆt+i ˆt+i values in the pixel space. We have introduced
t , depth dt , and flow ft
a second time superscript notation, i.e. gtt+i , represents the prediction about the
world at time t + i given the dynamics features at time t. Also note that gtt , zt .
The structure of the convolutional recurrent network G is the following: a
convolutional GRU [2] followed by three spatial residual layers, repeated D times,
6

zout ∈ RCout×T×H×W
+

1×1×1, Cout

Concat, 3C

Global Context

Local Context Upsample Upsample Upsample

H×W H×W H×W
C C C C
kt×ks×ks, 2 kt×1×ks, 2 kt×ks×1, 2 1×ks×ks, 2
C C C
1×1×1, 3 1×1×1, 3 1×1×1, 3

Compression Compression Compression Compression Avg Pool Avg Pool Avg Pool
1×1×1, C2 1×1×1, C2 1×1×1, C2 1×1×1, C2 kt×H×W kt×H2 ×W2 kt×H4 ×W4

zin ∈ RC×T×H×W

Fig. 2: A Temporal Block, our proposed spatio-temporal module. From a four-

dimensional input zin ∈ RC×T ×H×W , our module learns both local and global
spatio-temporal features. The local head learns all possible configurations of
3D convolutions with filters: (1, ks , ks ) (spatial features), (kt , 1, ks ) (horizontal
motion), (kt , ks , 1) (vertical motion), and (kt , ks , ks ) (complete motion). The
global head learns global spatio-temporal features with a 3D average pooling at
full, half and quarter size, followed by a (1, 1, 1) convolution and upsampling to
the original spatial dimension H × W . The local and global features are then
concatenated and combined in a final (1, 1, 1) 3D convolution.

similarly to Clark et al . [11]. For deterministic inference, its input is ut+i

t = 0,
and its initial hidden state is zt , the dynamics feature. The future prediction
component of our network computes the following, for i ∈ {1, .., Nf }, with Nf
the number of predicted future frames:

gtt+i = G(ut+i t+i−1

t , gt ) (2)
ŝt+i
t = Ds (gtt+i ) (3)
dˆt+i
t = Dd (gtt+i ) (4)
ˆ
ftt+i = Df (gtt+i ) (5)

3.4 Present & Future Distributions

From a unique past in the real-world, many futures are possible, but in reality
we only observe one future. Consequently, modelling multi-modal futures from
deterministic video training data is extremely challenging. We adopt a condi-
tional variational approach and model two probability distributions: a present
distribution P , that represents what could happen given the past context, and a
future distribution F , that represents what actually happened in that particular
observation. This allows us to learn a multi-modal distribution from the input
data while conditioning the model to learn from the specific observed future
from within this distribution.
The present and the future distributions are Gaussian with a diagonal co-
variance matrix, and can therefore be fully characterised by their mean and
7

standard deviation. We parameterise both distributions with a neural network,

respectively P and F.

Present distribution The input of the network P is zt ∈ RCd ×H×W , which

represents the past context of the last T frames (T is the time receptive field
of our dynamics module). The present network contains two downsampling con-
volutional layers, an average pooling layer and a fully connected layer to map
the features to the desired latent dimension L. The output of the network is the
parametrisation of the present distribution: (µt,present , σt,present ) ∈ RL × RL .

Future distribution F is not only conditioned by the past zt , but also by

the future corresponding to the training sequence. Since we are predicting Nf
steps in the future, the input of F has to contain information about future
frames (t + 1, ..., t + Nf ). This is achieved using the learned dynamics features
{zt+j }j∈J , with J the set of indices such that {zt+j }j∈J covers all future frames
(t + 1, ..., t + Nf ), as well as zt . Formally, if we want to cover Nf frames with
features that have a receptive field of T , then:
J = {nT | 0 ≤ n ≤ bNf /T c} ∪ {Nf }. The architecture of the future net-
work is similar to the present network: for each input dynamics feature zt+j ∈
RCd ×H×W , with j ∈ F , we apply two downsampling convolutional layers and
an average pooling layer. The resulting features are concatenated, and a fully-
connected layer outputs the parametrisation of the future distribution:
(µt,future , σt,future ) ∈ RL × RL .

Probabilistic Future Prediction During training, we sample from the future

2
distribution a vector ηt ∼ N (µt,future , σt,future ) that conditions the predicted
future perception outputs (semantic segmentation, depth, optical flow) on the
observed future. As we want our prediction to be consistent in both space and
time, we broadcast spatially ηt ∈ RL to RL×H×W , and use the same sample
throughout the future generation as an input to the GRU to condition the future:
for i ∈ {1, .., Nf }, input ut+i
t = ηt .
We encourage the present distribution P to match the future distribution F
with the KL loss:
Lprobabilistic = DKL (F (·|Zt , ..., Zt+Nf ) || P (·|Zt )) (6)
As the future is multimodal, different futures might arise from a unique past
context zt . Each of these futures will be captured by the future distribution F
that will pull the present distribution P towards it. Since our training data is
extremely diverse, it naturally contains multimodalities. Even if the past con-
text (sequence of images (i1 , ..., it )) from two different training sequences will
never be the same, the dynamics network will have learned a more abstract
spatio-temporal representation that ignores irrelevant details of the scene (such
as vehicle colour, weather, road material etc.) to match similar past context to
a similar zt . In this process, the present distribution will learn to cover all the
possible modes contained in the future.
During inference, we sample a vector ηt from the present distribution ηt ∼
2
N (µt,present , σt,present ), where each sample corresponds to a different future.
8

3.5 Control

From this rich spatio-temporal representation zt explicitly trained to predict the

future, we train a control model C to output a four dimensional vector consisting
ˆ
ˆ steering angle θ̂ and angular velocity θ̇:
of estimated speed v̂, acceleration v̇,

ˆ
ĉt = {v̂t , v̇ˆt , θ̂t , θ̇t } = C(zt ) (7)

C compresses zt ∈ RCd ×H×W with strided convolutional layers, then stacks sev-
eral fully connected layers, compressing at each stage, to regress the four dimen-
sional output.

3.6 Losses

Future Prediction The future prediction loss at timestep t is the weighted

sum of future segmentation, depth and optical flow losses. Let the segmentation
loss at the future timestep t + i be Lt+i
s . We use a top-k cross-entropy loss [62]
between the network output ŝt+i
t and the pseudo-ground truth label st+i . Ls is
computed by summing these individual terms over the future horizon Nf with
a weighted discount term 0 < γf < 1:

Nf −1
X
Ls = γfi Lt+i
s (8)
i=0

For depth, Lt+i

d is the scale-invariant depth loss [43] between dˆt+i
t and dt+i , and
similarly Ld is the discounted sum. For flow, we use a Huber loss betwen fˆtt+i
and ft+i . We weight the summed losses by factors λs , λd , λf to get the future
prediction loss Lfuture-pred .

Lfuture-pred = λs Ls + λd Ld + λf Lf (9)

Control We use imitation learning, regressing to the expert’s true control

actions {v, θ} to generate a control loss Lc . For both speed and steering, we
have access to the expert actions.
We compare to the linear extrapolation of the generated policy’s speed/steering
for future time-steps up to Nc frames in the future:

c −1
NX 2
Lc = γci vt+i − v̂t + iv̇ˆt +
i=0

ˆ 2

θt+i − θ̂t + iθ̇t (10)

where 0 < γc < 1 is the control discount factor penalizing less speed and steering
errors further into the future.
9

Total Loss The final loss L can be decomposed into the future prediction loss
(Lfuture-pred ), the probabilistic loss (Lprobabilistic ), and the control loss (Lc ) .

L = λf p Lfuture-pred + λc Lc + λp Lprobabilistic (11)

In all experiments we use γf = 0.6, λs = 1.0, λd = 1.0, λf = 0.5, λf p = 1,

λp = 0.005, γc = 0.7, λc = 1.0.

4 Experiments

We have collected driving data in a densely populated, urban environment, rep-

resentative of most European cities using multiple drivers over the span of six
months. For the purpose of this work, only the front-facing camera images it
and the measurements of the speed and steering ct have been used to train our
model, all sampled at 5Hz.

4.1 Training Data

Perception We first pretrain the scene understanding encoder on a number of

heterogeneous datasets to predict semantic segmentation and depth: CityScapes
[13], Mapillary Vistas [46], ApolloScape [28] and Berkeley Deep Drive [65]. The
optical flow network is a pretrained PWC-Net from [58]. The decoders of these
networks are used for generating pseudo-ground truth segmentation and depth
labels to train our dynamics and future prediction modules.

Dynamics and Control The dynamics and control modules are trained using
30 hours of driving data from the urban driving dataset we collected and de-
scribed above. We address the inherent dataset bias by sampling data uniformly
across lateral and longitudinal dimensions. First, the data is split into a his-
togram of bins by steering, and subsequently by speed. We found that weighting
each data point proportionally to the width of the bin it belongs to avoids the
need for alternative approaches such as data augmentation.

4.2 Metrics

We report standard metrics for measuring the quality of segmentation, depth

and flow: respectively intersection-over-union, scale-invariant logarithmic error,
and average end-point error. For ease of comparison, additionally to individual
metrics, we report a unified perception metric Mperception defined as improve-
ment of segmentation, depth and flow metrics with respect to the Repeat Frame
baseline (repeats the perception outputs of the current frame):

1
Mperception = (seg% increase + depth% decrease + flow% decrease ) (12)
3
10

Inspired by the energy functions used in [3,51], we additionally report a diversity

distance metric (DDM) between the ground truth future Y and samples from
the predicted present distribution P :

DDM(Y, P ) = min d(Y, S) − E d(S, S 0 )

(13)
S

where d is an error metric and S, S 0 , are independent samples from the present
distribution P . This metric measures performance both in terms of accuracy,
by looking at the minimum error of the samples, as well as the diversity of
the predictions by taking the expectation of the distance between N samples.
The distance d is the scale-invariant logarithmic error for depth, the average
end-point error for flow, and for segmentation d(x, y) = 1 − IoU(x, y).
To measure control performance, we report mean absolute error of speed and
steering outputs, balanced by steering histogram bins.

5 Results
We first compare our proposed spatio-temporal module to previous state-of-
the-art architectures and show that our module achieves the best performance
on future prediction metrics. Then we demonstrate that modelling the future
in a probabilistic manner further improves performance. And finally, we show
that our probabilistic future prediction representation substantially improves a
learned driving policy. All the reported results are evaluated on test routes with
no overlap with the training data.

5.1 Spatio-Temporal Representation

We analyse the quality of the spatio-temporal representation our temporal model
learns by evaluating future prediction of semantic segmentation, depth, and op-
tical flow, two seconds in the future. Several architectures have been created
to learn features from video, with the most successful modules being: the Con-
volutional GRU [2], the 3D Residual Convolution [25] and the Separable 3D
Inception block [64].
We also compare our model to two baselines: Repeat frame (repeating the
perception outputs of the current frame at time t for each future frame t + i with
i = 1, ..., Nf ), and having no temporal model (None). As shown in Table 1, de-
terministic section, every temporal model architecture improves over the Repeat
frame baseline, as opposed to the model without any temporal context, (None),
that performs notably worse. This is because it is too difficult to forecast how
the future is going to evolve with a single image, and as a result, the learning
process collapses.
Further, we observe that our proposed temporal block module outperforms
all preexisting spatio-temporal architectures, on all three future perception met-
rics: semantic segmentation, depth and flow. There are two reasons for this: the
first one is that learning 3D filters is hard, and as demonstrated by the Sep-
arable 3D convolution [64] (i.e. the succession of a (1, ks , ks ) spatial filter and
11

Temporal Model Mperception (↑) Depth (↓) Flow (↓) Seg. (↑)
Repeat frame 0.0% 1.467 5.707 0.356
None -40.3% 1.980 8.573 0.229
Res. 3D Conv. [25] 6.9% 1.162 5.437 0.339
Conv. GRU [2] 7.4% 1.097 5.714 0.346
Deterministic
Sep. Inception [64] 9.6% 1.101 5.300 0.344
Ours 13.6% 1.090 5.029 0.367
Res. 3D Conv. [25] 8.1% 1.107 5.720 0.356
Conv. GRU [2] 9.0% 1.101 5.645 0.359
Probabilistic
Sep. Inception [64] 13.8% 1.040 5.242 0.371
Ours 20.0% 0.970 4.857 0.396

Table 1: Perception performance metrics for two seconds future prediction on

the collected urban driving data. We measure semantic segmentation with mean
IoU, depth with scale-invariant logarithmic error, and depth with average end-
point error. Mperception shows overall performance — we observe our model
outperforms all baselines.

a (kt , 1, 1) time filter), decomposing into two subtasks helps the network learn
more efficiently. In the same spirit, we decompose the spatio-temporal convo-
lutions into all combinations of space-time convolutions: (1, ks , ks ), (kt , 1, ks ),
(kt , ks , 1), (kt , ks , ks ), and by stacking these temporal blocks together, the net-
work can learn a hierarchically more complex representation of the scene. The
second reason is that we incorporate global context in our features. By pooling
the features spatially and temporally at different scales, each individual feature
map also has information about the global scene context, which helps in am-
biguous situations. Appendix A.2 contains an ablation study of the different
component of the Temporal Block.

5.2 Probabilistic Future

Since the future is inherently uncertain, the deterministic model is training in a
chaotic learning space because the predictions of the model are penalised with the
ground truth future, which only represents a subset of all the possible outcomes.
Therefore, if the network predicts a plausible future, but one that did not match
the given training sequence, it will be heavily penalised. On the other hand, the
probabilistic model has a very clean learning signal as the future distribution
conditions the network to generate the correct future. The present distribution
is encouraged to match the distribution of the future distribution during training,
and therefore has to capture all the modes of the future.
2
During inference, samples ηt ∼ N (µt,present , σt,present ) from the present distri-
2
bution should give a different outcome, with p(ηt |µt,present , σt,present ) indicating
the relative likelihood of a given scenario. Our probabilistic model should be
accurate, that is to say at least one of the generated future should match the
ground truth future. It should also be diverse: the generated samples should
capture the diversity of the possible futures with the correct probability. Next,
we analyse quantitatively and qualitatively that our model generates diverse and
accurate futures.
12

Temporal Model Depth (↓) Flow (↓) Seg. (↓)

Res. 3D Conv. [25] 0.823 2.695 0.474
Conv. GRU [2] 0.841 2.683 0.493
Sep. Inception [64] 0.799 2.914 0.469
Ours 0.724 2.676 0.424

Table 2: Diversity Distance Metric for various temporal models evaluated on the
urban driving data, demonstrating that our model produces the most accurate
and diverse distribution.

Table 1 shows that every temporal architecture have superior performance

when trained in a probabilistic way, with our model benefiting the most (from
13.6% to 20.0%) in future prediction metrics. Table 2 shows that our model
outperforms other temporal representations also using the diversity distance
metric (DDM) described in subsection 4.2. The DDM measures both accuracy
and diversity of the distribution.
Perhaps the most striking result of the model is observing that our model
can predict diverse and plausible futures from a single sequence of past frames
at 5Hz, corresponding to one second of past context and two seconds of future
prediction. In Figure 3 and Figure 4 we show qualitative examples of our video
scene understanding future prediction in real-world urban driving scenes. We
2
sample from the present distribution, ηt,j ∼ N (µt,present , σt,present ), to demon-
strate multi-modality.

Fig. 3: Predicted futures from our model while driving through an urban in-
tersection. From left, we show the actual past and future video sequence and
labelled semantic segmentation. Using four different noise vectors, η, we observe
the model imagining different driving manoeuvres at an intersection: being sta-
tionary, driving straight, taking a left or a right turn. We show both predicted
semantic segmentation and entropy (uncertainty) for each future. This example
demonstrates that our model is able to learn a probabilistic embedding, capable
of predicting multi-modal and plausible futures.
13

Fig. 4: Predicted futures from our model while driving through a busy urban
scene. From left, we show actual past and future video sequence and labelled
semantic segmentation, depth and optical flow. Using two different noise vectors,
η, we observe the model imagining either stopping in traffic or continuing in
motion. This illustrates our model’s efficacy at jointly predicting holistic future
behaviour of our own vehicle and other dynamic agents in the scene across all
modalities.

Further, our framework can automatically infer which scenes are unusual or
unexpected and where the model is uncertain of the future, by computing the dif-
ferential entropy of the present distribution. Simple scenes (e.g. one-way streets)
will tend to have a low entropy, corresponding to an almost deterministic future.
Any latent code sampled from the present distribution will correspond to the
same future. Conversely, complex scenes (e.g. intersections, roundabouts) will be
associated with a high-entropy. Different samples from the present distribution
will correspond to different futures, effectively modelling the stochasticity of the
future.3
Finally, to allow reproducibility, we evaluate our future prediction framework
on Cityscapes [13] and report future semantic segmentation performance in Ta-
ble 3. We compare our predictions, at resolution 256 × 512, to the ground truth
segmentation at 5 and 10 frames in the future, corresponding to respectively
0.29s and 0.59s in the future as the videos are sampled at 17Hz. Qualitative
examples on Cityscapes can be found in Appendix C.
Temporal Model IoUi=5 (↑) IoUi=10 (↑)
Repeat frame 0.393 0.331
Res. 3D Conv. [25] 0.445 0.399
Conv. GRU [2] 0.449 0.397
Probabilistic
Sep. Inception [64] 0.455 0.402
Ours 0.464 0.416

Table 3: Future semantic segmentation performance on Cityscapes at i = 5 and

i = 10 frames in the future.
3
In the accompanying blog post, we illustrate how diverse the predicted future be-
comes with varying levels of entropy in an intersection scenario and an urban traffic
scenario.
14

5.3 Driving Policy

We study the influence of the learned temporal representation on driving per-
formance. Our baseline is the control policy learned from a single frame.
First we compare to this baseline a model that was trained to directly op-
timise control, without being supervised with future scene prediction. It shows
only a slight improvement over the static baseline, hinting that it is difficult
to learn an effective temporal representation by only using control error as a
learning signal.

Temporal Model Mperception (↑) Steering (↓) Speed (↓)

None - 0.049 0.048
Ours w/o future pred. - 0.043 0.039
Res. 3D Conv. [25] 6.9% 0.039 0.031
Conv. GRU [2] 7.4% 0.041 0.032
Deterministic
Sep. Inception [64] 9.6% 0.040 0.031
Ours 13.6% 0.036 0.030
Res. 3D Conv. [25] 8.1% 0.040 0.028
Conv. GRU [2] 9.0% 0.038 0.029
Probabilistic
Sep. Inception [64] 13.8% 0.036 0.029
Ours 20.0% 0.033 0.026

Table 4: Evaluation of the driving policy. The policy is learned from temporal
features explicitly trained to predict the future. We observe a significant perfor-
mance improvement over non-temporal and non-future-aware baselines.

All deterministic models trained with the future prediction loss outperform
the baseline, and more interestingly the temporal representation’s ability to bet-
ter predict the future (shown by Mperception ) directly translate in a control
performance gain, with our best deterministic model having, respectively, a 27%
and 38% improvement over the baseline for steering and speed.
Finally, all probabilistic models perform better than their deterministic coun-
terpart, further demonstrating that modelling the uncertainty of the future pro-
duces a more effective spatio-temporal representation. Our probabilistic model
achieves the best performance with a 33% steering and 46% speed improvement
over the baseline.

6 Conclusions
This work is the first to propose a deep learning model capable of probabilistic
future prediction of ego-motion, static scene and other dynamic agents, jointly.
We observe large performance improvements due to our proposed temporal video
encoding architecture and probabilistic modelling of present and future distri-
butions. This initial work leaves a lot of future directions to explore: leveraging
known priors and structure in the latent representation, conditioning the control
policy on future prediction and applying our future prediction architecture to
model-based reinforcement learning.
15

References

1. Amini, A., Rosman, G., Karaman, S., Rus, D.: Variational end-to-end navigation
and localization. In: Proc. International Conference on Robotics and Automation
(ICRA). IEEE (2019)
2. Ballas, N., Yao, L., Pas, C., Courville, A.: Delving deeper into convolutional net-
works for learning video representations. In: Proc. International Conference on
Learning Representations (ICLR) (2016)
3. Bellemare, M.G., Danihelka, I., Dabney, W., Mohamed, S., Lakshminarayanan,
B., Hoyer, S., Munos, R.: The cramer distance as a solution to biased wasserstein
gradients. arXiv preprint (2017)
4. Bhattacharyya, A., Schiele, B., Fritz, M.: Accurate and diverse sampling of se-
quences based on a ”best of many” sample objective. Proc. Computer Vision and
Pattern Recognition (CVPR) (2018)
5. Bilen, H., Fernando, B., Gavves, E., Vedaldi, A., Gould, S.: Dynamic image net-
works for action recognition. Proc. Computer Vision and Pattern Recognition
(CVPR) (2016)
6. Bojarski, M., Testa, D.D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel,
L.D., Monfort, M., Muller, U., Zhang, J., Zhang, X., Zhao, J., Zieba, K.: End to
End Learning for Self-Driving Cars. arXiv preprint (2016)
7. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the
kinetics dataset. In: Proc. Computer Vision and Pattern Recognition (CVPR)
(2017)
8. Casas, S., Luo, W., Urtasun, R.: Intentnet: Learning to predict intention from raw
sensor data. In: Proc. Conference on Robot Learning (CoRL) (2018)
9. Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution
for semantic image segmentation. arXiv preprint (2017)
10. Chua, K., Calandra, R., McAllister, R., Levine, S.: Deep reinforcement learning
in a handful of trials using probabilistic dynamics models. Advances in Neural
Information Processing Systems (NeurIPS) (2018)
11. Clark, A., Donahue, J., Simonyan, K.: Adversarial video generation on complex
datasets. In: arXiv preprint (2019)
12. Codevilla, F., Miiller, M., López, A., Koltun, V., Dosovitskiy, A.: End-to-end
driving via conditional imitation learning. In: Proc. International Conference on
Robotics and Automation (ICRA) (2018)
13. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R.,
Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban
scene understanding. In: Proc. Computer Vision and Pattern Recognition (CVPR)
(2016)
14. Denton, E., Fergus, R.: Stochastic video generation with a learned prior. In: Proc.
International Conference on Machine Learning (ICML). Proceedings of Machine
Learning Research (2018)
15. Denton, E.L., Birodkar, v.: Unsupervised learning of disentangled representations
from video. In: Advances in Neural Information Processing Systems (NeurIPS)
(2017)
16. Ebert, F., Finn, C., Dasari, S., Xie, A., Lee, A.X., Levine, S.: Visual foresight:
Model-based deep reinforcement learning for vision-based robotic control. arXiv
preprint (2018)
17. Ebert, F., Finn, C., Lee, A.X., Levine, S.: Self-supervised visual planning with
temporal skip connections. In: Proc. Conference on Robot Learning (CoRL) (2017)
16

18. Feichtenhofer, C., Pinz, A., Wildes, R.P.: Spatiotemporal residual networks for
video action recognition. In: Advances in Neural Information Processing Systems
(NeurIPS) (2016)
19. Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction
through video prediction. In: Advances in Neural Information Processing Systems
(NeurIPS) (2016)
20. Finn, C., Levine, S.: Deep visual foresight for planning robot motion. Proc. Inter-
national Conference on Robotics and Automation (ICRA) (2017)
21. Gadde, R., Jampani, V., Gehler, P.V.: Semantic video CNNs through representa-
tion warping. In: Proc. Computer Vision and Pattern Recognition (CVPR) (2017)
22. Goodfellow, I.: Nips 2016 tutorial: Generative adversarial networks (2016)
23. Ha, D., Schmidhuber, J.: World models. In: Advances in Neural Information Pro-
cessing Systems (NeurIPS) (2018)
24. Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., Davidson, J.:
Learning latent dynamics for planning from pixels. In: Proc. International Confer-
ence on Machine Learning (ICML) (2019)
25. Hara, K., Kataoka, H., Satoh, Y.: Learning spatio-temporal features with 3d resid-
ual networks for action recognition. In: Proc. International Conference on Com-
puter Vision, workshop (ICCVw) (2017)
26. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: Proc. Computer Vision and Pattern Recognition (CVPR) (2016)
27. Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W.,
Horgan, D., Piot, B., Azar, M.G., Silver, D.: Rainbow: Combining improvements
in deep reinforcement learning. AAAI Conference on Artificial Intelligence (2018)
28. Huang, X., Cheng, X., Geng, Q., Cao, B., Zhou, D., Wang, P., Lin, Y., Yang, R.:
The apolloscape dataset for autonomous driving. In: Proc. Computer Vision and
Pattern Recognition, workshop (CVPRw) (2018)
29. Ioannou, Y., Robertson, D., Shotton, J., Cipolla, R., Criminisi, A.: Training cnns
with low-rank filters for efficient image classification. In: Proc. International Con-
ference on Learning Representations (ICLR) (2016)
30. Jaderberg, M., Mnih, V., Czarnecki, W.M., Schaul, T., Leibo, J.Z., Silver, D.,
Kavukcuoglu, K.: Reinforcement learning with unsupervised auxiliary tasks. Proc.
International Conference on Learning Representations (ICLR) (2017)
31. Jayaraman, D., Ebert, F., Efros, A., Levine, S.: Time-agnostic prediction: Predict-
ing predictable video frames (2018)
32. Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski,
K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R.,
Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. In: Proc.
International Conference on Learning Representations (ICLR) (2020)
33. Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, A., Jang, E., Quillen,
D., Holly, E., Kalakrishnan, M., Vanhoucke, V., Levine, S.: Qt-opt: Scalable deep
reinforcement learning for vision-based robotic manipulation. Proc. International
Conference on Machine Learning (ICML) (2018)
34. Kendall, A., Gal, Y.: What uncertainties do we need in bayesian deep learn-
ing for computer vision? In: Advances in Neural Information Processing Systems
(NeurIPS) (2017)
35. Kendall, A., Gal, Y., Cipolla, R.: Multi-task learning using uncertainty to weigh
losses for scene geometry and semantics. In: Proc. Computer Vision and Pattern
Recognition (CVPR) (2018)
36. Kendall, A., Hawke, J., Janz, D., Mazur, P., Reda, D., Allen, J.M., Lam, V.D.,
Bewley, A., Shah, A.: Learning to drive in a day. In: Proc. International Conference
on Robotics and Automation (ICRA) (2019)
17

37. Kohl, S., Romera-Paredes, B., Meyer, C., Fauw, J.D., Ledsam, J.R., Maier-Hein,
K.H., Eslami, S.M.A., Rezende, D.J., Ronneberger, O.: A probabilistic u-net for
segmentation of ambiguous images. In: Advances in Neural Information Processing
Systems (NeurIPS) (2018)
38. Kurutach, T., Tamar, A., Yang, G., Russell, S.J., Abbeel, P.: Learning plannable
representations with causal infogan. In: Advances in Neural Information Processing
Systems (NeurIPS) (2018)
39. Lee, A.X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., Levine, S.: Stochastic adver-
sarial video prediction. arXiv preprint (2018)
40. Lee, N., Choi, W., Vernaza, P., Choy, C.B., Torr, P.H.S., Chandraker, M.K.: DE-
SIRE: distant future prediction in dynamic scenes with interacting agents. Proc.
Computer Vision and Pattern Recognition (CVPR) (2017)
41. Levine, S., Abbeel, P.: Learning neural network policies with guided policy search
under unknown dynamics. In: Advances in Neural Information Processing Systems
27 (2014)
42. Levine, S., Finn, C., Darrell, T., Abbeel, P.: End-to-end training of deep visuomotor
policies. Journal of Machine Learning Research (2016)
43. Li, Z., Snavely, N.: MegaDepth: Learning Single-View Depth Prediction from In-
ternet Photos. Proc. Computer Vision and Pattern Recognition (CVPR) (2018)
44. Luc, P., Neverova, N., Couprie, C., Verbeek, J., LeCun, Y.: Predicting deeper
into the future of semantic segmentation. In: Proc. International Conference on
Computer Vision (ICCV) (2017)
45. Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond
mean square error. In: Proc. International Conference on Learning Representations
(ICLR) (2016)
46. Neuhold, G., Ollmann, T., Bulo, S.R., Kontschieder, P.: The Mapillary Vistas
Dataset for Semantic Understanding of Street Scenes. In: Proc. International Con-
ference on Computer Vision (ICCV)
47. Oh, J., Guo, X., Lee, H., Lewis, R., Singh, S.: Action-conditional video prediction
using deep networks in atari games. Advances in Neural Information Processing
Systems (NeurIPS) (2015)
48. Piaget, J.: Origins of intelligence in the child. London: Routledge and Kegan Paul
(1936)
49. Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., Chopra, S.: Video
(language) modeling: a baseline for generative models of natural videos. arXiv
preprint (2014)
50. Rhinehart, N., McAllister, R., Kitani, K.M., Levine, S.: PRECOG: prediction con-
ditioned on goals in visual multi-agent settings. Proc. International Conference on
Computer Vision (ICCV) (2019)
51. Salimans, T., Zhang, H., Radford, A., Metaxas, D.N.: Improving gans using optimal
transport. Proc. International Conference on Learning Representations (ICLR)
(2018)
52. Schuldt, Christian, L.I., Caputo, B.: Recognizing human actions: A local svm ap-
proach. In: Proc. International Conference on Pattern Recognition (2004)
53. Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.k., Woo, W.c.: Convolutional
lstm network: A machine learning approach for precipitation nowcasting. In: Ad-
vances in Neural Information Processing Systems (NeurIPS) (2015)
54. Siam, M., Valipour, S., Jagersand, M., Ray, N.: Convolutional gated recurrent net-
works for video segmentation. In: Proc. International Conference on Image Pro-
cessing (2017)
18

55. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. In: Proc. International Conference on Learning Representations
(ICLR) (2015)
56. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recog-
nition in videos. In: NIPS (2014)
57. Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video
representations using lstms. In: ICML (2015)
58. Sun, D., Yang, X., Liu, M.Y., Kautz, J.: PWC-Net: CNNs for Optical Flow Using
Pyramid, Warping, and Cost Volume. Proc. Computer Vision and Pattern Recog-
nition (CVPR) (2018)
59. Sun, L., Jia, K., Yeung, D., Shi, B.E.: Human action recognition using factorized
spatio-temporal convolutional networks. Proc. International Conference on Com-
puter Vision (ICCV) (2015)
60. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D.,
Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. Proc. Computer
Vision and Pattern Recognition (CVPR) (2015)
61. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look
at spatiotemporal convolutions for action recognition. In: Proc. Computer Vision
and Pattern Recognition (CVPR) (2018)
62. Wu, Z., Shen, C., van den Hengel, A.: Bridging category-level and instance-level
semantic image segmentation. arXiv preprint (2016)
63. Xie, S., Girshick, R.B., Dollár, P., Tu, Z., He, K.: Aggregated residual transforma-
tions for deep neural networks. Proc. Computer Vision and Pattern Recognition
(CVPR) (2017)
64. Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature
learning for video understanding. Proc. European Conference on Computer Vision
(ECCV) (2018)
65. Yu, F., Xian, W., Chen, Y., Liu, F., Liao, M., Madhavan, Vashisht, D.T.: Bdd100k:
A diverse driving video database with scalable annotation tooling (2018)
66. Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Deep feature flow for video recog-
nition. In: Proc. Computer Vision and Pattern Recognition (CVPR) (2017)
19

A Architecture
In total, our architecture has 30.4M parameters, comprising of modules:
– Perception encoder, Eperception , 25.3M parameters ;
– Dynamics module, Y, 0.8M parameters ;
– Future decoders, G, 3.5M parameters ;
– Control policy model, C, 0.7M parameters.

A.1 Perception
Semantics and Geometry. Our model is an encoder-decoder model with five
encoder blocks and three decoder blocks, followed by an atrous spatial pyramid pooling
(ASPP) module [9]. The encoders contain 2, 4, 8, 8, 8 layers respectively, downsampling
by a factor of two each time with a strided convolution. The decoders contain 3 layers
each, upsampling each time by a factor of two with a sub-strided convolution. All layers
have residual connections and many are low rank, with varying kernel and dilation sizes.
Furthermore, we employ skip connections from the encoder to decoder at each spatial
scale.
We pretrain the scene understanding encoder on a number of heterogeneous datasets
to predict semantic segmentation and depth: CityScapes [13], Mapillary Vistas [46],
ApolloScape [28] and Berkeley Deep Drive [65]. We collapse the classes to 14 seman-
tic segmentation classes shared across these datasets and sample each dataset equally
during training. We train for 200,000 gradient steps with a batch size of 32 using SGD
with an initial learning rate of 0.1 with momentum 0.9. We use cross entropy for seg-
mentation and the scale-invariant loss [43] to learn depth with a weight of 1.0 and 0.1,
respectively.

Motion. In addition to this semantics and geometry encoder, we also use a pre-
trained optical flow network, PWCNet [58]. We use the pretrained authors’ implemen-
tation.

Perception. To form our perception encoder we concatenate these two feature rep-
resentations (from the perception encoder and optical flow net) concatenated together.
We use the features two layers before the output optical flow regression as the feature
representation. The decoders of these networks are used for generating pseudo-ground
truth segmentation and depth labels to train our dynamics and future prediction mod-
ules.

A.2 Temporal Block

We ablate the architecture of our proposed Temporal Block module on Cityscapes
by evaluating performance of future semantic segmentation prediction, at resolution
256 × 512 and for future frames 5 and 10. Let kt denote the temporal kernel size and
ks the spatial kernel size of the 3D convolutions. We compare the following modules:
(i) (kt , ks , ks ) and (1, ks , ks ) convolutions. No global context.
(ii) (kt , ks , ks ), (kt , 1, ks ) and (1, ks , ks ) convolutions. No global context.
(iii) (kt , ks , ks ), (kt , ks , 1) and (1, ks , ks ) convolutions. No global context.
(iv) (kt , ks , ks ), (kt , 1, ks ), (kt , ks , 1) and (1, ks , ks ) convolutions. No global context.
(v) (kt , ks , ks ), (kt , 1, ks ), (kt , ks , 1) and (1, ks , ks ) convolutions. With global context
(i.e. our proposed Temporal Block).
20

Temporal Model IoUi=5 (↑) IoUi=10 (↑)

Repeat frame 0.393 0.331
(i) (kt , ks , ks ) 0.454 0.411
(ii) (kt , ks , ks ), (kt , 1, ks ) 0.461 0.411
Probabilistic (iii) (kt , ks , ks ), (kt , ks , 1) 0.449 0.413
(iv) (kt , ks , ks ), (kt , 1, ks ), (kt , ks , 1) 0.453 0.413
(v) Temporal Block (Ours) 0.464 0.416

Table 5: Ablation study of the Temporal Block on Cityscapes, evaluated on

future semantic segmentation performance at i = 5 and i = 10 frames in the
future. Our proposed Temporal Block module outperforms all the other variants.

B Nomenclature

We detail the symbols used to describe our model in this paper.

Networks
Perception encoder Eperception
Temporal Block T
Dynamics module Y
Present network P
Future network F
Future prediction module G
Future decoders Ds , Dd , Df
Control module C
Tensors
Temporal context T
Future prediction horizon Nf
Future control horizon Nc
Input image it
Perception features xt = Eperception (it )
Dynamics features zt = Y(xt−T +1 : xt )
Present distribution µt,present , σt,present = P(zt )
Future distribution µt,future , σt,future = F(zt )
2
Noise vector (train) ηt ∼ N (µt,future , σt,future )
2
Noise vector (test) ηt ∼ N (µt,present , σt,present )
Future prediction inputs ut+i
t = η t
Future prediction initial hidden state gtt = zt
Future prediction output features gtt+i = G(ut+it , gt
t+i−1
)
t+i t+i ˆt+i ˆt+i
ôt = {ŝt , dt , ft }
Future perception outputs
= {Ds (gtt+i ), Dd (gtt+i ), Df (gtt+i )}
ˆ
ĉt = {v̂t , v̇ˆt , θ̂t , θ̇t }
Control outputs
= C(zt )
21

C Cityscapes Qualitative Examples

(a) Our model can correctly predict future segmentation of small classes such as poles
or traffic lights.

(b) Dynamic agents, i.e. cars and cyclists, are also accurately predicted.

(c) In this example, the bus is correctly segmented, without any class bleeding contrary
to the pseudo-ground truth segmentation, showing that our model can reason in a
holistic way.

Fig. 5: Future prediction on the CityScapes dataset, for 10 frames in the future
at 17Hz and 256 × 512 resolution.

Self Directed Learning
No ratings yet
Self Directed Learning
10 pages
Fast and Furious
No ratings yet
Fast and Furious
9 pages
3128/submission 3128
No ratings yet
3128/submission 3128
10 pages
Lec 03 Direct Perception
No ratings yet
Lec 03 Direct Perception
69 pages
IMP-1A Review on Deep Learning Techniques for Video Prediction
No ratings yet
IMP-1A Review on Deep Learning Techniques for Video Prediction
26 pages
Savp
No ratings yet
Savp
26 pages
Generalized Predictive Model for Autonomous Driving
No ratings yet
Generalized Predictive Model for Autonomous Driving
33 pages
Deep+Learning+Approaches+to+Predict+Future+Frames+in+Videos
No ratings yet
Deep+Learning+Approaches+to+Predict+Future+Frames+in+Videos
17 pages
SONG-SENIORTHESIS-2019
No ratings yet
SONG-SENIORTHESIS-2019
50 pages
Towards Generalist Robot Learning From Internet Video- A Survey
No ratings yet
Towards Generalist Robot Learning From Internet Video- A Survey
69 pages
sensors-23-09225
No ratings yet
sensors-23-09225
27 pages
2506.09985v1
No ratings yet
2506.09985v1
48 pages
Video Flow A Conditional Flow Based On Nanoparticles
No ratings yet
Video Flow A Conditional Flow Based On Nanoparticles
18 pages
Meta Ai提出了一种名为v Jepa的自监督学习方法，利用视频特征预测开发出高效自监督视觉表示，实现多任务性能提升
No ratings yet
Meta Ai提出了一种名为v Jepa的自监督学习方法，利用视频特征预测开发出高效自监督视觉表示，实现多任务性能提升
23 pages
SAVi++ Towards End-to-End Object-CentricSAVi++ Towards End-to-End Object-Centric
No ratings yet
SAVi++ Towards End-to-End Object-CentricSAVi++ Towards End-to-End Object-Centric
21 pages
2412.18607v1
No ratings yet
2412.18607v1
15 pages
L U V U C - P N 3DM: Earning From Nlabelled Ideos Sing ON Trastive Redictive Eural Apping
No ratings yet
L U V U C - P N 3DM: Earning From Nlabelled Ideos Sing ON Trastive Redictive Eural Apping
19 pages
paper1
No ratings yet
paper1
15 pages
2407.05679v2
No ratings yet
2407.05679v2
17 pages
2305.14644v1
No ratings yet
2305.14644v1
12 pages
Learning For Autonomous Vehicles: A Focus On Expert Demonstration
No ratings yet
Learning For Autonomous Vehicles: A Focus On Expert Demonstration
26 pages
paper 4
No ratings yet
paper 4
10 pages
Deep Learning in Next-Frame Prediction A Benchmark Review
No ratings yet
Deep Learning in Next-Frame Prediction A Benchmark Review
11 pages
paper4
No ratings yet
paper4
12 pages
Mange Ment
No ratings yet
Mange Ment
478 pages
Gao SimVP Simpler Yet Better Video Prediction CVPR 2022 Paper
No ratings yet
Gao SimVP Simpler Yet Better Video Prediction CVPR 2022 Paper
11 pages
A Survey of Deep Learning Techniques For Autonomous Driving
No ratings yet
A Survey of Deep Learning Techniques For Autonomous Driving
28 pages
paper5
No ratings yet
paper5
12 pages
TNT: Target-Driven Trajectory Prediction
No ratings yet
TNT: Target-Driven Trajectory Prediction
12 pages
Video Representation Learning by Dense Predictive Coding 1909.04656
No ratings yet
Video Representation Learning by Dense Predictive Coding 1909.04656
13 pages
Convolutional Social Pooling For Vehicle Trajectory Prediction
No ratings yet
Convolutional Social Pooling For Vehicle Trajectory Prediction
9 pages
1611.09842v3
No ratings yet
1611.09842v3
11 pages
7 QCnet
No ratings yet
7 QCnet
11 pages
Advanced Topics in Autonomous Driving Using Deep Learning: Presenter: Nasim Souly
No ratings yet
Advanced Topics in Autonomous Driving Using Deep Learning: Presenter: Nasim Souly
41 pages
VectorNet
No ratings yet
VectorNet
9 pages
Cultrera 2020
No ratings yet
Cultrera 2020
10 pages
6 CoverNet
No ratings yet
6 CoverNet
10 pages
(2019) Future Event Prediction - If and When CVPR 3DCNN
No ratings yet
(2019) Future Event Prediction - If and When CVPR 3DCNN
10 pages
2D or Not 2D Adaptive 3D Convolution Selection For Efficient Video Recogition
No ratings yet
2D or Not 2D Adaptive 3D Convolution Selection For Efficient Video Recogition
10 pages
s42979-022-01498-y
No ratings yet
s42979-022-01498-y
16 pages
paper3
No ratings yet
paper3
9 pages
Agro Implicit Occupancy Flow Fields For Perception and Prediction in Self-Driving CVPR 2023 Paper
No ratings yet
Agro Implicit Occupancy Flow Fields For Perception and Prediction in Self-Driving CVPR 2023 Paper
10 pages
Unsupervised Learning of Video Representations Using Lstms
No ratings yet
Unsupervised Learning of Video Representations Using Lstms
12 pages
RiskMap a Unified Driving Context Representation for Autonomous Motion Planning in Urban Driving Environment
No ratings yet
RiskMap a Unified Driving Context Representation for Autonomous Motion Planning in Urban Driving Environment
6 pages
Paper 1
No ratings yet
Paper 1
7 pages
Wu AdaFrame Adaptive Frame Selection For Fast Video Recognition CVPR 2019 Paper
No ratings yet
Wu AdaFrame Adaptive Frame Selection For Fast Video Recognition CVPR 2019 Paper
10 pages
De - Biasing 2 MIT
No ratings yet
De - Biasing 2 MIT
9 pages
Learning For Autonomous Vehicles: A Focus On Expert Demonstration
No ratings yet
Learning For Autonomous Vehicles: A Focus On Expert Demonstration
24 pages
0532 Exploring Dynamic Context For Multi-Path Trajectory Prediction
No ratings yet
0532 Exploring Dynamic Context For Multi-Path Trajectory Prediction
7 pages
Predictive - Huawei-Noah & CUHK-SZ-1
No ratings yet
Predictive - Huawei-Noah & CUHK-SZ-1
5 pages
Paper 11
No ratings yet
Paper 11
15 pages
Video Prediction PDF
No ratings yet
Video Prediction PDF
21 pages
Gao VectorNet Encoding HD Maps and Agent Dynamics From Vectorized Representation CVPR 2020 Paper
No ratings yet
Gao VectorNet Encoding HD Maps and Agent Dynamics From Vectorized Representation CVPR 2020 Paper
9 pages
V Jepa
No ratings yet
V Jepa
23 pages
Ref 4
No ratings yet
Ref 4
9 pages
Multi-Modal Hybrid Architecture For Pedestrian Action Prediction
No ratings yet
Multi-Modal Hybrid Architecture For Pedestrian Action Prediction
7 pages
End-To-End Learning of Driving Models From Large-Scale Video Datasets
No ratings yet
End-To-End Learning of Driving Models From Large-Scale Video Datasets
9 pages
A Survey on Deep Learning Approaches for Vehicle Trajectory
No ratings yet
A Survey on Deep Learning Approaches for Vehicle Trajectory
8 pages
End-To-End Contextual Perception and Prediction With Interaction Transformer
No ratings yet
End-To-End Contextual Perception and Prediction With Interaction Transformer
8 pages
2024-05-13-Kolmogorov-Arnold Networks the latest advance in Neural Networks, simply explained by Theo Wolf May
No ratings yet
2024-05-13-Kolmogorov-Arnold Networks the latest advance in Neural Networks, simply explained by Theo Wolf May
22 pages
MHT Cet Triumph Physics Mcqs Based On STD Xi Xii Syllabus MH Board Sol
33% (3)
MHT Cet Triumph Physics Mcqs Based On STD Xi Xii Syllabus MH Board Sol
489 pages
Details Philippine Qualifications Framework
No ratings yet
Details Philippine Qualifications Framework
6 pages
Commalds
No ratings yet
Commalds
8 pages
An End-to-End Curriculum Learning Approach For Aut
No ratings yet
An End-to-End Curriculum Learning Approach For Aut
10 pages
ENG110_Unit_2.2_Assignment Kenneth McKinney
No ratings yet
ENG110_Unit_2.2_Assignment Kenneth McKinney
11 pages
2770_variational_autoencoding_of_de
No ratings yet
2770_variational_autoencoding_of_de
18 pages
SSD Research Report Education in AJK
No ratings yet
SSD Research Report Education in AJK
35 pages
Modals of Obligation Grammar Guides Reading Comprehension Exercises 19693
No ratings yet
Modals of Obligation Grammar Guides Reading Comprehension Exercises 19693
4 pages
2410_3D-Adapter
No ratings yet
2410_3D-Adapter
22 pages
Proquest Umi Dissertation Publishing
100% (2)
Proquest Umi Dissertation Publishing
9 pages
College Research Paper
100% (1)
College Research Paper
29 pages
The Development of Students Critical Thinking Abilities and
No ratings yet
The Development of Students Critical Thinking Abilities and
18 pages
44_variational_point_encoding_def
No ratings yet
44_variational_point_encoding_def
10 pages
2103 - Generative Adversarial Transformers
No ratings yet
2103 - Generative Adversarial Transformers
22 pages
Zitelli Picture Review - Genetics
No ratings yet
Zitelli Picture Review - Genetics
101 pages
2008 - Graph Signal Processing For Geometric Data and Beyond
No ratings yet
2008 - Graph Signal Processing For Geometric Data and Beyond
16 pages
Assignment On SRG Bangladesh Limited (SRGB)
No ratings yet
Assignment On SRG Bangladesh Limited (SRGB)
10 pages
20ME5602
No ratings yet
20ME5602
2 pages
Digital Light Wave ASA 312
No ratings yet
Digital Light Wave ASA 312
11 pages
K120443760224 Cor
No ratings yet
K120443760224 Cor
1 page
What Are The Expectedproducts of Hydrolysis of Lactose
No ratings yet
What Are The Expectedproducts of Hydrolysis of Lactose
1 page
Singapore School, Cilegon: Pupil'S Progress Profile
No ratings yet
Singapore School, Cilegon: Pupil'S Progress Profile
6 pages
QP_V20UDM305_BC (1)
No ratings yet
QP_V20UDM305_BC (1)
2 pages
Edtech 2 Group 1
No ratings yet
Edtech 2 Group 1
40 pages
Evaluator Choreographer Ideologue Catalyst The Dis
No ratings yet
Evaluator Choreographer Ideologue Catalyst The Dis
3 pages
Knowledge of Taxation, Taxpayer Awareness and Compliance of Individual Taxpayers
No ratings yet
Knowledge of Taxation, Taxpayer Awareness and Compliance of Individual Taxpayers
15 pages
Towards Geometric Deep Learning I - On The Shoulders of Giants
No ratings yet
Towards Geometric Deep Learning I - On The Shoulders of Giants
13 pages
Art Museums by The Numbers 2014
No ratings yet
Art Museums by The Numbers 2014
6 pages
Gcu Safety Calendar 2
No ratings yet
Gcu Safety Calendar 2
7 pages
Abdest Ve Mesh
No ratings yet
Abdest Ve Mesh
22 pages
Ravindra Bharathi Schools: Office Copy Fee Receipt
No ratings yet
Ravindra Bharathi Schools: Office Copy Fee Receipt
2 pages
1st Prelim - Teaching Profession-Zm
100% (8)
1st Prelim - Teaching Profession-Zm
3 pages
Teacher's Philosophy of Education
No ratings yet
Teacher's Philosophy of Education
9 pages
Foundation Course - Indian Literature - English
No ratings yet
Foundation Course - Indian Literature - English
5 pages
Resume of Loriwood73
No ratings yet
Resume of Loriwood73
1 page
Some Important Transaction Codes For SAP BW
No ratings yet
Some Important Transaction Codes For SAP BW
3 pages
Underwater Computer Vision: Exploring the Depths of Computer Vision Beneath the Waves
From Everand
Underwater Computer Vision: Exploring the Depths of Computer Vision Beneath the Waves
Fouad Sabry
No ratings yet

2003 - Probabilistic Future Prediction For Video Scene Understanding

Uploaded by

2003 - Probabilistic Future Prediction For Video Scene Understanding

Uploaded by

Probabilistic Future Prediction for

Video Scene Understanding

Abstract. We present a novel deep learning architecture for probabilis-

Temporal representations. Current state-of-the-art temporal representa-

as a complementary input modality has been consistently shown to improve per-

Multi-modality of future prediction. Modelling uncertainty is important

Learning a control policy. The representation learned from dynamics mod-

zt+2 F µt,f , σt,f G ôt+2

... v̂t, v̇ˆt θ̂t, θ̇ˆt

Fig. 1: Our architecture has five modules: Perception, Dynamics, Present/Future

Learning a temporal representation from video is extremely challenging because

Temporal Block We propose a spatio-temporal module, named Temporal

3.3 Future Prediction

Local Context Upsample Upsample Upsample

Fig. 2: A Temporal Block, our proposed spatio-temporal module. From a four-

similarly to Clark et al . [11]. For deterministic inference, its input is ut+i

gtt+i = G(ut+i t+i−1

3.4 Present & Future Distributions

standard deviation. We parameterise both distributions with a neural network,

Present distribution The input of the network P is zt ∈ RCd ×H×W , which

Future distribution F is not only conditioned by the past zt , but also by

Probabilistic Future Prediction During training, we sample from the future

From this rich spatio-temporal representation zt explicitly trained to predict the

Future Prediction The future prediction loss at timestep t is the weighted

For depth, Lt+i

Control We use imitation learning, regressing to the expert’s true control

L = λf p Lfuture-pred + λc Lc + λp Lprobabilistic (11)

In all experiments we use γf = 0.6, λs = 1.0, λd = 1.0, λf = 0.5, λf p = 1,

We have collected driving data in a densely populated, urban environment, rep-

4.1 Training Data

Perception We first pretrain the scene understanding encoder on a number of

We report standard metrics for measuring the quality of segmentation, depth

Inspired by the energy functions used in [3,51], we additionally report a diversity

DDM(Y, P ) = min d(Y, S) − E d(S, S 0 )

5.1 Spatio-Temporal Representation

Table 1: Perception performance metrics for two seconds future prediction on

5.2 Probabilistic Future

Temporal Model Depth (↓) Flow (↓) Seg. (↓)

Table 1 shows that every temporal architecture have superior performance

Table 3: Future semantic segmentation performance on Cityscapes at i = 5 and

5.3 Driving Policy

Temporal Model Mperception (↑) Steering (↓) Speed (↓)

A.2 Temporal Block

Temporal Model IoUi=5 (↑) IoUi=10 (↑)

Table 5: Ablation study of the Temporal Block on Cityscapes, evaluated on

We detail the symbols used to describe our model in this paper.

C Cityscapes Qualitative Examples

You might also like