End-To-End Learning of Driving Models From Large-Scale Video Datasets
End-To-End Learning of Driving Models From Large-Scale Video Datasets
Moving Path
Abstract Motion
N
Robust perception-action models should be learned from
training data with diverse visual appearances and realis- LSTM
tic behaviors, yet current approaches to deep visuomotor
policy learning have been generally limited to in-situ mod- Previous Motion Seg Loss
els learned from a single vehicle or simulation environment.
We advocate learning a generic vehicle motion model from Video Dilated FCN
large scale crowd-sourced video data, and develop an end-
to-end trainable architecture for learning to predict a dis-
tribution over future vehicle egomotion from instantaneous Figure 1: Autonomous driving is formulated as a future
monocular camera observations and previous vehicle state. egomotion prediction problem. Given a large-scale driving
Our model incorporates a novel FCN-LSTM architecture, video dataset, an end-to-end FCN-LSTM network is trained
which can be learned from large-scale crowd-sourced ve- to predict multi-modal discrete and continuous driving be-
hicle action data, and leverages available scene segmenta- haviors. Using semantic segmentation as a side task further
tion side tasks to improve performance under a privileged improves the model.
learning paradigm. We provide a novel large-scale dataset
of crowd-sourced driving behavior suitable for training our
model, and report results predicting the driver action on These prior efforts generally formulate the problem as
held out sequences across diverse conditions. learning a mapping from pixels to actuation. This end-
to-end optimization is appealing as it directly mimics the
demonstrated performance, but is limiting in that it can
1. Introduction only be performed on data collected with the specifically
calibrated actuation setup, or in corresponding simulations
Learning perception-based policies to support complex
(e.g., as was done in [17], and more recently in [23, 20, 3]).
autonomous behaviors, including driving, is an ongoing
The success of supervised robot learning-based methods
challenge for computer vision and machine learning. While
is governed by the availability of training data, and typi-
recent advances that use rule-based methods have achieved
cal publicly available datasets only contain on the order of
some success, we believe that learning-based approaches
dozens to hundreds of hours of collected experience.
will be ultimately needed to handle complex or rare sce-
narios, and scenarios that involve multi-agent interplay with We explore an alternative paradigm, which follows the
other human agents. successful practice in most computer vision settings, of ex-
The recent success of deep learning methods for vi- ploiting large scale online and/or crowdsourced datasets.
sual perception tasks has increased interest in their effi- We advocate learning a driving model or policy from large
cacy for learning action policies. Recent demonstration sys- scale uncalibrated sources, and specifically optimize mod-
tems [1, 2, 12] have shown that simple tasks, such as a ve- els based on crowdsourced dashcam video sources. We re-
hicle lane-following policy or obstacle avoidance, can be lease with our paper a curated dataset from which suitable
solved by a neural net. This echoes the seminal work by models or policies can be learned.
Dean Pomerleau with the CMU NavLab, whose ALVINN To learn a model from this data, we propose a novel
network was among the earliest successful neural network deep learning architecture for learning-to-drive from uncal-
models [17]. ibrated large-scale video data. We formulate the problem as
learning a generic driving model/policy; our learned model
1* indicates equal contribution is generic in that it learns a predictive future motion path
2174
given the present agent state. Presently we learn our model Instead of directly learning to map from pixels to ac-
from a corpus of demonstrated behavior and evaluate on tuation, [2] proposed mapping pixels to pre-defined affor-
held out data from the same corpus. Our driving model dance measures, such as the distance to surrounding cars.
is akin to a language model, which scores the likelihood This approach provides human-interpretable intermediate
of character or word sequences given certain corpora; our outputs, but a complete set of such measures may be in-
model similarly is trained and evaluated in terms of its abil- tractable to define in complex, real-world scenarios. More-
ity to score as highly likely the observed behavior of the over, the learned affordances need to be manually associ-
held out driving sequence. It is also a policy in that it de- ated with car actions, which is expensive, as was the case
fines a probability distribution over actions conditioned on with older rule-based systems. Concurrent approaches in
a state, with the limitation that the policy is never actually industry have used neural network predictions from tasks
executed in the real world or simulation. such as object detection and lane segmentation as inputs to
Our paper offers four novel contributions. First, we in- a rule-based control system [9].
troduce a generic motion approach to learning a deep vi-
Another line of work has treated autonomous navigation
suomotor action policy where actuator independent motion
as a visual prediction task in which future video frames are
plans are learned based on current visual observations and
predicted on the basis of previous frames. [21] propose to
previous vehicle state. Second, we develop a novel FCN-
learn a driving simulator with an approach that combines a
LSTM which can learn jointly from demonstration loss and
Variational Auto-encoder (VAE) [10] and a Generative Ad-
segmentation loss, and can output multimodal predictions.
versarial Network (GAN) [7]. This method is a special case
Third, we curate and make publicly available a large-scale
of the more general task of video prediction; there are ex-
dataset to learn a generic motion model from vehicles with
amples of video prediction models being applied to driving
heterogeneous actuators. Finally, we report experimental
scenarios [4, 14]. However, in many scenarios, video pre-
results confirming that “privileged” training with side task
diction is ill-constrained as preceding actions are not given
(semantic segmentation) loss learns egomotion prediction
as input the model. [16, 6] address this by conditioning the
tasks faster than from motion prediction task loss alone2 .
prediction on the model’s previous actions. In our work, we
We evaluate our model and compare to various base-
incorporate information about previous actions in the form
lines in terms of the ability of the model to predict held-out
of an accumulated hidden state.
video examples; our task can be thought of that of predict-
ing future egomotion given present observation and previ- Our model also includes a side- or privileged-
ous agent state history. information learning aspect. This occurs when a learn-
While future work includes extending our model to drive ing algorithm has additional knowledge at training time;
a real car, and addressing issues therein involving policy i.e., additional labels or meta-data. This extra information
coverage across undemonstrated regions of the policy space helps training of a better model than possible using only
(c.f. [18]), we nonetheless believe that effective driving the view available at test time. A theoretical framework for
models learned from large scale datasets using the class of learning under privileged information (LUPI) was explored
methods we propose will be a key element in learning a ro- in [24]; a max-margin framework for learning with side-
bust policy for a future driving agent. information in the form of bounding boxes, image tags, and
attributes was examined in [22] within the DPM framework.
2. Related Work Recently [8] exploited deep learning with side tasks when
mapping from depth to intensity data. Below we exploit a
ALVINN [17] was among the very first attempts to use
privileged/side-training paradigm for learning to drive, us-
a neural network for autonomous vehicle navigation. The
ing semantic segmentation side labels.
approach was simple, comprised of a shallow network that
predicted actions from pixel inputs applied to simple driv- Recent advances in recurrent neural network modeling
ing scenarios with few obstacles; nevertheless, its success for sequential image data are also related to our work. The
suggested the potential of neural networks for autonomous Long-term Recurrent Convolutional Network (LRCN) [5]
navigation. model investigates the use of deep visual features for se-
Recently, NVIDIA demonstrated a similar idea that ben- quence modeling tasks by applying a long short-term mem-
efited from the power of modern convolution networks to ory (LSTM) recurrent neural network to the output of a
extract features from the driving frames [1]. This frame- convolutional neural network. We take this approach, but
work was successful in relatively simple real-world sce- use the novel combination of a fully-convolutional network
narios, such as highway lane-following and driving in flat, (FCN) [13] and an LSTM. A different approach is taken
obstacle-free courses. by [25], as they introduce a convolutional long short-term
2 The codebase and dataset can be found at https://github.com/ memory (LSTM) network that directly incorporates convo-
gy20073/BDD_Driving_Model/ lution operations into the cell updates.
2175
3. Deep Generic Driving Networks Motion Motion
motion is plausible given the current observed world con- Video Video
figuration. Note that the world configuration incorporates Input Output
previous observation and vehicle state. Formally, a driving
model F is a function defined as: (a) FCN-LSTM (b) Conv-LSTM (c) LRCN
F (s, a) : S × A → R (1)
Figure 2: Comparison among novel architectures that can
where s denotes states, a represents a potential motion ac- fuse time-series information with visual inputs.
tion and F (s, a) measures the feasibility score of operating
motion action a under the state s.
Our approach is generic in that it predicts egomotion, and consequently we feel the language model analogy is the
rather than actuation of a specific vehicle.3 Our generic more suitable one.
models take as input raw pixels and current and prior ve-
hicle state signals, and predict the likelihood of future mo- 3.2. FCN-LSTM Architecture
tion. This can be defined over a range of action or motion Our goal is to predict the distribution over feasible future
granularity, and we consider both discrete and continuous actions, conditioned on the past and current states, includ-
settings in this paper.4 For example, the motion action set ing visual cues and egomotions. To accomplish our goal,
A could be a set of coarse actions: an image encoder is necessary to learn the relevant visual
representation in each input frame, together with a tempo-
A = {straight, stop, left-turn, right-turn} (2)
ral network to take advantage of the motion history infor-
One can also define finer actions based on the car egomo- mation. We propose a novel architecture for time-series
tion heading in the future. In that case, the possible motion prediction which fuses an LSTM temporal encoder with a
action set is: fully convolutional visual encoder. Our model is able to
A = {~v |~v ∈ R2 } (3) jointly train motion prediction and pixel-level supervised
tasks. We can use semantic segmentation as a side task fol-
where, ~v denotes the future egomotion on the ground plane. lowing “previleged” information learning paradigm. This
We refer to F (s, a) as a driving model inspired by its leads to better performance in our experiments. Figure 2
similarity to the classical N-gram language model in Nat- compares our architecture (FCN-LSTM) with two related
ural Language Processing. Both of them take in the se- architectures[5, 25].
quence of prior events, such as what the driver has seen in
the driving model, or the previously observed tokens in the
language model, and predict plausible future events, such as 3.2.1 Visual Encoder
the viable physical actions or the coherent words. Our driv- Given a video frame input, a visual encoder can encode the
ing model can equivalently be thought of as a policy from visual information in a discriminative manner while main-
a robotics perspective, but we presently only train and test taining the relevant spatial information. In our architec-
our model from fixed existing datasets, as explained below, ture, a dilated fully convolutional neural network [26, 5] is
3 Future work will comprise how to take such a prediction and cause used to extract the visual representations. We take the Ima-
the desired motion to occur on a specific actuation platform. The latter geNet [19] pre-trained AlexNet [11] model, remove POOL2
problem has been long studied in the robotics and control literature and and POOL5 layers and use dilated convolutions for conv3
both conventional and deep-learning based solutions are feasible (as is their
combination).
through fc7. To get a more discriminative encoder, we fine-
4 We leave the most general setting, of predicting directly arbitrary tune it jointly with the temporal network described below.
6DOF motion, also to future work. The dilated FCN representation has the advantage that it en-
2176
ables the network to be jointly trained with a side task in an
end-to-end manner. This approach is advantageous when Driving
Driving Loss Driving Loss
Decision Stage
the training data is scarce.
2177
Figure 4: Example density of data distribution of BDDV in
a major city. Each dot represents the starting location of a
short video clip of approximately 40 seconds.
2178
Datasets settings type Approx scale Diversity Specific Car Make
KITTI city, rural area, highway real less than 1 hour one city, one weather condition, daytime Yes
Cityscape city real less than 100 hours German cities, multiple weather conditions, daytime Yes
Comma.ai mostly highway real 7.3 hours highway, N.A. , daytime and night Yes
Oxford city real 214 hours one city (Oxford), multiple weather conditions, daytime Yes
Princeton Torcs highway synthesis 13.5 hours N.A. N.A.
GTA city, highway synthesis N.A. N.A. N.A.
BDDV(ours) city, rural area, highway real 10k hours multiple cities, multiple weather conditions,daytime and night No
2179
method perplexity accuracy
Motion Reflex Approach 0.718 71.31%
Mediated Perception Approach 0.8887 61.66
Privileged Training Approach 0.697 72.4%
2180
Perception Approach, we first compute the segmentation
output of every frame in the videos using the Multi-Scale
Context Aggregation approach described in [26]. We then
feed the segmentation results into an LSTM and train the
LSTM independently from the segmentation part, mimick-
ing stage-by-stage training. In theory, one would not need
(a) lane following left (b) lane following right side task to improve the performance of a neural network
with unlimited data. To simulate a scenario where we only
have limited amount of training data, we run experiments
on a common subset of 1000 video clips.
As shown in Table 4, the Privileged Training approach
achieves the best performance in both perplexity and accu-
racy. These observations align well with our intuition that
(c) multiple possible actions: (d) collapsed to single action af- training on side tasks in an end-to-end fashion improves per-
turn left or go straight ter the turn formance. Figure 8 shows an example in which Privileged
Training provides a benefit. In the first column, there is a
red light far ahead in the intersection. The Privileged Train-
ing approach has successfully identified that and predicted
stop in (c), while the other two methods fail. In the sec-
ond column, the car is waiting behind another car. In the
frame immediately previous to these frames, the vehicle in
(e) single sided prediction due to (f) right turn becomes available at front had an illuminated brake light. The second column of
side walk intersection images shows the prediction of the three methods when the
brake light of the car goes out but the vehicle has not yet
Figure 7: Continuous actions predicted by our model. The started to move. The Privileged Training approach in (c)
green sector with different darkness shows the probability predicts stop with high probability. The other two methods
map of going to a particular direction. The blue line shows behave more aggressively and predict going straight with
the driver’s action. high probability.
(a) 6. Conclusion
We introduce an approach to learning a generic driv-
ing model from large scale crowd-sourced video dataset
with an end-to-end trainable architecture. It can learning
(b) from monocular camera observations and previous egomo-
tion states to predict a distribution over future egomotion.
The model uses a novel FCN-LSTM architecture to learn
from driving behaviors. It can take advantage of semantic
segmentation as side tasks improve performance, following
the privileged learning paradigm. To facilitate our study, we
(c) provide a novel large-scale dataset of crowd-sourced driving
behaviors that is suitable for learning driving models. We
investigate the effectiveness of our driving model and the
“privileged” learning by evaluating future egomotion pre-
diction on held-out sequences across diverse conditions.
2181
References games. In Advances in Neural Information Processing Sys-
tems, pages 2863–2871, 2015.
[1] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner,
[17] D. Pomerleau . Alvinn: An autonomous land vehicle in a
B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller,
neural network. In D. Touretzky, editor, Advances in Neural
J. Zhang, et al. End to end learning for self-driving cars.
Information Processing Systems 1. Morgan Kaufmann, 1989.
arXiv preprint arXiv:1604.07316, 2016.
[18] S. Ross, G. J. Gordon, and D. Bagnell. A reduction of imi-
[2] C. Chen, A. Seff, A. Kornhauser, and J. Xiao. Deepdriving:
tation learning and structured prediction to no-regret online
Learning affordance for direct perception in autonomous
learning. In AISTATS, volume 1, page 6, 2011.
driving. In Proceedings of the IEEE International Confer-
[19] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
ence on Computer Vision, pages 2722–2730, 2015.
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
[3] S. Daftry, J. A. Bagnell, and M. Hebert. Learning transfer-
et al. ImageNet large scale visual recognition challenge.
able policies for monocular reactive MAV control. In Inter-
International Journal of Computer Vision, 115(3):211–252,
national Symposium on Experimental Robotics, 2016.
2015.
[4] B. De Brabandere, X. Jia, T. Tuytelaars, and L. Van Gool.
[20] A. A. Rusu, M. Vecerik, T. Rothörl, N. Heess, R. Pascanu,
Dynamic filter networks. arXiv preprint arXiv:1605.09673,
and R. Hadsell. Sim-to-real robot learning from pixels with
2016.
progressive nets. arXiv preprint arXiv:1610.04286, 2016.
[5] J. Donahue, L. Anne Hendricks, S. Guadarrama,
M. Rohrbach, S. Venugopalan, K. Saenko, and T. Dar- [21] E. Santana and G. Hotz. Learning a driving simulator. arXiv
rell. Long-term recurrent convolutional networks for visual preprint arXiv:1608.01230, 2016.
recognition and description. In Proceedings of the IEEE [22] V. Sharmanska, N. Quadrianto, and C. H. Lampert. Learning
Conference on Computer Vision and Pattern Recognition, to rank using privileged information. In International Con-
pages 2625–2634, 2015. ference on Computer Vision (ICCV), pages 825–832. IEEE,
[6] C. Finn, I. Goodfellow, and S. Levine. Unsupervised learn- 2013.
ing for physical interaction through video prediction. arXiv [23] E. Tzeng, C. Devin, J. Hoffman, C. Finn, P. Abbeel,
preprint arXiv:1605.07157, 2016. S. Levine, K. Saenko, and T. Darrell. Adapting deep vi-
[7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, suomotor representations with weak pairwise constraints. In
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen- Workshop on the Algorithmic Foundations of Robotics, 2016.
erative adversarial nets. In Advances in Neural Information [24] V. Vapnik and A. Vashist. A new learning paradigm:
Processing Systems, pages 2672–2680, 2014. Learning using privileged information. Neural Networks,
[8] J. Hoffman, S. Gupta, and T. Darrell. Learning with side in- 22(5):544–557, 2009.
formation through modality hallucination. In In Proc. Com- [25] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-k. Wong,
puter Vision and Pattern Recognition (CVPR), 2016. and W.-c. Woo. Convolutional LSTM network: A machine
[9] B. Huval, T. Wang, S. Tandon, J. Kiske, W. Song, learning approach for precipitation nowcasting. In Advances
J. Pazhayampallil, M. Andriluka, P. Rajpurkar, T. Migimatsu, in Neural Information Processing Systems, pages 802–810,
R. Cheng-Yue, et al. An empirical evaluation of deep learn- 2015.
ing on highway driving. arXiv preprint arXiv:1504.01716, [26] F. Yu and V. Koltun. Multi-scale context aggregation by di-
2015. lated convolutions. arXiv preprint arXiv:1511.07122, 2015.
[10] D. P. Kingma and M. Welling. Auto-encoding variational [27] R. Zhang, P. Isola, and A. A. Efros. Colorful image coloriza-
bayes. stat, 1050:10, 2014. tion. arXiv preprint arXiv:1603.08511, 2016.
[11] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet
classification with deep convolutional neural networks. In
Advances in neural information processing systems, pages
1097–1105, 2012.
[12] Y. LeCun, U. Muller, J. Ben, E. Cosatto, and B. Flepp. Off-
road obstacle avoidance through end-to-end learning. In
NIPS, pages 739–746, 2005.
[13] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
networks for semantic segmentation. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 3431–3440, 2015.
[14] W. Lotter, G. Kreiman, and D. Cox. Deep predictive cod-
ing networks for video prediction and unsupervised learning.
arXiv preprint arXiv:1605.08104, 2016.
[15] W. Maddern, G. Pascoe, C. Linegar, and P. Newman. 1 Year,
1000km: The Oxford RobotCar Dataset. The International
Journal of Robotics Research (IJRR), to appear.
[16] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh. Action-
conditional video prediction using deep networks in Atari
2182