0% found this document useful (0 votes)
44 views

End-To-End Learning of Driving Models From Large-Scale Video Datasets

1) The document proposes an end-to-end trainable neural network architecture called FCN-LSTM to learn a generic vehicle motion model from large-scale crowd-sourced driving video datasets. 2) The model takes in monocular camera images and previous vehicle state to predict a distribution over future vehicle motions. It incorporates semantic segmentation as an auxiliary task to improve performance. 3) The authors curate and release a large-scale crowd-sourced driving video dataset to train their model, and evaluate its ability to predict driver actions on held-out video sequences with diverse conditions.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

End-To-End Learning of Driving Models From Large-Scale Video Datasets

1) The document proposes an end-to-end trainable neural network architecture called FCN-LSTM to learn a generic vehicle motion model from large-scale crowd-sourced driving video datasets. 2) The model takes in monocular camera images and previous vehicle state to predict a distribution over future vehicle motions. It incorporates semantic segmentation as an auxiliary task to improve performance. 3) The authors curate and release a large-scale crowd-sourced driving video dataset to train their model, and evaluate its ability to predict driver actions on held-out video sequences with diverse conditions.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

End-to-end Learning of Driving Models from Large-scale Video Datasets

Huazhe Xu∗ Yang Gao∗ Fisher Yu Trevor Darrell


University of California, Berkeley

Moving Path
Abstract Motion

N
Robust perception-action models should be learned from
training data with diverse visual appearances and realis- LSTM
tic behaviors, yet current approaches to deep visuomotor
policy learning have been generally limited to in-situ mod- Previous Motion Seg Loss
els learned from a single vehicle or simulation environment.
We advocate learning a generic vehicle motion model from Video Dilated FCN
large scale crowd-sourced video data, and develop an end-
to-end trainable architecture for learning to predict a dis-
tribution over future vehicle egomotion from instantaneous Figure 1: Autonomous driving is formulated as a future
monocular camera observations and previous vehicle state. egomotion prediction problem. Given a large-scale driving
Our model incorporates a novel FCN-LSTM architecture, video dataset, an end-to-end FCN-LSTM network is trained
which can be learned from large-scale crowd-sourced ve- to predict multi-modal discrete and continuous driving be-
hicle action data, and leverages available scene segmenta- haviors. Using semantic segmentation as a side task further
tion side tasks to improve performance under a privileged improves the model.
learning paradigm. We provide a novel large-scale dataset
of crowd-sourced driving behavior suitable for training our
model, and report results predicting the driver action on These prior efforts generally formulate the problem as
held out sequences across diverse conditions. learning a mapping from pixels to actuation. This end-
to-end optimization is appealing as it directly mimics the
demonstrated performance, but is limiting in that it can
1. Introduction only be performed on data collected with the specifically
calibrated actuation setup, or in corresponding simulations
Learning perception-based policies to support complex
(e.g., as was done in [17], and more recently in [23, 20, 3]).
autonomous behaviors, including driving, is an ongoing
The success of supervised robot learning-based methods
challenge for computer vision and machine learning. While
is governed by the availability of training data, and typi-
recent advances that use rule-based methods have achieved
cal publicly available datasets only contain on the order of
some success, we believe that learning-based approaches
dozens to hundreds of hours of collected experience.
will be ultimately needed to handle complex or rare sce-
narios, and scenarios that involve multi-agent interplay with We explore an alternative paradigm, which follows the
other human agents. successful practice in most computer vision settings, of ex-
The recent success of deep learning methods for vi- ploiting large scale online and/or crowdsourced datasets.
sual perception tasks has increased interest in their effi- We advocate learning a driving model or policy from large
cacy for learning action policies. Recent demonstration sys- scale uncalibrated sources, and specifically optimize mod-
tems [1, 2, 12] have shown that simple tasks, such as a ve- els based on crowdsourced dashcam video sources. We re-
hicle lane-following policy or obstacle avoidance, can be lease with our paper a curated dataset from which suitable
solved by a neural net. This echoes the seminal work by models or policies can be learned.
Dean Pomerleau with the CMU NavLab, whose ALVINN To learn a model from this data, we propose a novel
network was among the earliest successful neural network deep learning architecture for learning-to-drive from uncal-
models [17]. ibrated large-scale video data. We formulate the problem as
learning a generic driving model/policy; our learned model
1* indicates equal contribution is generic in that it learns a predictive future motion path

2174
given the present agent state. Presently we learn our model Instead of directly learning to map from pixels to ac-
from a corpus of demonstrated behavior and evaluate on tuation, [2] proposed mapping pixels to pre-defined affor-
held out data from the same corpus. Our driving model dance measures, such as the distance to surrounding cars.
is akin to a language model, which scores the likelihood This approach provides human-interpretable intermediate
of character or word sequences given certain corpora; our outputs, but a complete set of such measures may be in-
model similarly is trained and evaluated in terms of its abil- tractable to define in complex, real-world scenarios. More-
ity to score as highly likely the observed behavior of the over, the learned affordances need to be manually associ-
held out driving sequence. It is also a policy in that it de- ated with car actions, which is expensive, as was the case
fines a probability distribution over actions conditioned on with older rule-based systems. Concurrent approaches in
a state, with the limitation that the policy is never actually industry have used neural network predictions from tasks
executed in the real world or simulation. such as object detection and lane segmentation as inputs to
Our paper offers four novel contributions. First, we in- a rule-based control system [9].
troduce a generic motion approach to learning a deep vi-
Another line of work has treated autonomous navigation
suomotor action policy where actuator independent motion
as a visual prediction task in which future video frames are
plans are learned based on current visual observations and
predicted on the basis of previous frames. [21] propose to
previous vehicle state. Second, we develop a novel FCN-
learn a driving simulator with an approach that combines a
LSTM which can learn jointly from demonstration loss and
Variational Auto-encoder (VAE) [10] and a Generative Ad-
segmentation loss, and can output multimodal predictions.
versarial Network (GAN) [7]. This method is a special case
Third, we curate and make publicly available a large-scale
of the more general task of video prediction; there are ex-
dataset to learn a generic motion model from vehicles with
amples of video prediction models being applied to driving
heterogeneous actuators. Finally, we report experimental
scenarios [4, 14]. However, in many scenarios, video pre-
results confirming that “privileged” training with side task
diction is ill-constrained as preceding actions are not given
(semantic segmentation) loss learns egomotion prediction
as input the model. [16, 6] address this by conditioning the
tasks faster than from motion prediction task loss alone2 .
prediction on the model’s previous actions. In our work, we
We evaluate our model and compare to various base-
incorporate information about previous actions in the form
lines in terms of the ability of the model to predict held-out
of an accumulated hidden state.
video examples; our task can be thought of that of predict-
ing future egomotion given present observation and previ- Our model also includes a side- or privileged-
ous agent state history. information learning aspect. This occurs when a learn-
While future work includes extending our model to drive ing algorithm has additional knowledge at training time;
a real car, and addressing issues therein involving policy i.e., additional labels or meta-data. This extra information
coverage across undemonstrated regions of the policy space helps training of a better model than possible using only
(c.f. [18]), we nonetheless believe that effective driving the view available at test time. A theoretical framework for
models learned from large scale datasets using the class of learning under privileged information (LUPI) was explored
methods we propose will be a key element in learning a ro- in [24]; a max-margin framework for learning with side-
bust policy for a future driving agent. information in the form of bounding boxes, image tags, and
attributes was examined in [22] within the DPM framework.
2. Related Work Recently [8] exploited deep learning with side tasks when
mapping from depth to intensity data. Below we exploit a
ALVINN [17] was among the very first attempts to use
privileged/side-training paradigm for learning to drive, us-
a neural network for autonomous vehicle navigation. The
ing semantic segmentation side labels.
approach was simple, comprised of a shallow network that
predicted actions from pixel inputs applied to simple driv- Recent advances in recurrent neural network modeling
ing scenarios with few obstacles; nevertheless, its success for sequential image data are also related to our work. The
suggested the potential of neural networks for autonomous Long-term Recurrent Convolutional Network (LRCN) [5]
navigation. model investigates the use of deep visual features for se-
Recently, NVIDIA demonstrated a similar idea that ben- quence modeling tasks by applying a long short-term mem-
efited from the power of modern convolution networks to ory (LSTM) recurrent neural network to the output of a
extract features from the driving frames [1]. This frame- convolutional neural network. We take this approach, but
work was successful in relatively simple real-world sce- use the novel combination of a fully-convolutional network
narios, such as highway lane-following and driving in flat, (FCN) [13] and an LSTM. A different approach is taken
obstacle-free courses. by [25], as they introduce a convolutional long short-term
2 The codebase and dataset can be found at https://github.com/ memory (LSTM) network that directly incorporates convo-
gy20073/BDD_Driving_Model/ lution operations into the cell updates.

2175
3. Deep Generic Driving Networks Motion Motion

We first describe our overall approach for learning a


generic driving model from large-scale driving behavior ... LSTM ... ... LSTM ...
datasets, and then propose a specific novel architecture for
learning a deep driving network. Previous Seg
Motion Loss
3.1. Generic Driving Models
Dilated Conv-LSTM CNN
We propose to learn a generic approach to learning a FCN Internal Architecture
driving policy from demonstrated behaviors, and formu-
late the problem as predicting future feasible actions. Our Conv- Conv-
...
driving model is defined as the admissibility of which next LSTM LSTM

motion is plausible given the current observed world con- Video Video
figuration. Note that the world configuration incorporates Input Output
previous observation and vehicle state. Formally, a driving
model F is a function defined as: (a) FCN-LSTM (b) Conv-LSTM (c) LRCN

F (s, a) : S × A → R (1)
Figure 2: Comparison among novel architectures that can
where s denotes states, a represents a potential motion ac- fuse time-series information with visual inputs.
tion and F (s, a) measures the feasibility score of operating
motion action a under the state s.
Our approach is generic in that it predicts egomotion, and consequently we feel the language model analogy is the
rather than actuation of a specific vehicle.3 Our generic more suitable one.
models take as input raw pixels and current and prior ve-
hicle state signals, and predict the likelihood of future mo- 3.2. FCN-LSTM Architecture
tion. This can be defined over a range of action or motion Our goal is to predict the distribution over feasible future
granularity, and we consider both discrete and continuous actions, conditioned on the past and current states, includ-
settings in this paper.4 For example, the motion action set ing visual cues and egomotions. To accomplish our goal,
A could be a set of coarse actions: an image encoder is necessary to learn the relevant visual
representation in each input frame, together with a tempo-
A = {straight, stop, left-turn, right-turn} (2)
ral network to take advantage of the motion history infor-
One can also define finer actions based on the car egomo- mation. We propose a novel architecture for time-series
tion heading in the future. In that case, the possible motion prediction which fuses an LSTM temporal encoder with a
action set is: fully convolutional visual encoder. Our model is able to
A = {~v |~v ∈ R2 } (3) jointly train motion prediction and pixel-level supervised
tasks. We can use semantic segmentation as a side task fol-
where, ~v denotes the future egomotion on the ground plane. lowing “previleged” information learning paradigm. This
We refer to F (s, a) as a driving model inspired by its leads to better performance in our experiments. Figure 2
similarity to the classical N-gram language model in Nat- compares our architecture (FCN-LSTM) with two related
ural Language Processing. Both of them take in the se- architectures[5, 25].
quence of prior events, such as what the driver has seen in
the driving model, or the previously observed tokens in the
language model, and predict plausible future events, such as 3.2.1 Visual Encoder
the viable physical actions or the coherent words. Our driv- Given a video frame input, a visual encoder can encode the
ing model can equivalently be thought of as a policy from visual information in a discriminative manner while main-
a robotics perspective, but we presently only train and test taining the relevant spatial information. In our architec-
our model from fixed existing datasets, as explained below, ture, a dilated fully convolutional neural network [26, 5] is
3 Future work will comprise how to take such a prediction and cause used to extract the visual representations. We take the Ima-
the desired motion to occur on a specific actuation platform. The latter geNet [19] pre-trained AlexNet [11] model, remove POOL2
problem has been long studied in the robotics and control literature and and POOL5 layers and use dilated convolutions for conv3
both conventional and deep-learning based solutions are feasible (as is their
combination).
through fc7. To get a more discriminative encoder, we fine-
4 We leave the most general setting, of predicting directly arbitrary tune it jointly with the temporal network described below.
6DOF motion, also to future work. The dilated FCN representation has the advantage that it en-

2176
ables the network to be jointly trained with a side task in an
end-to-end manner. This approach is advantageous when Driving
Driving Loss Driving Loss
Decision Stage
the training data is scarce.

3.2.2 Temporal Fusion


Seg Loss
We optionally concatenate the past ground truth sensor in-
formation, such as speed and angular velocity, with the ex-
tracted visual representation. With the visual and sensor Scene Visual Visual
states at each time step, we use an LSTM to fuse all past Parsing Stage Encoder Encoder
and current states into a single state, corresponding to the
state s in our driving model F (s, a). This state is complete,
(a) Mediated (b) Privileged (c) Motion
in the sense that it contains all historical information about Perception Training
all sensors. We could predict the physical viability from the
state s using a fully connected layer.
We also investigate below another temporal fusion ap- Figure 3: Comparison of learning approaches. Medi-
proach, temporal convolution, instead of LSTM to fuse the ated Perception relies on semantic-class labels at the pixel
temporal information. A temporal convolution layer takes level alone to drive motion prediction. The Motion Re-
in multiple visual representations and convolves on the time flex method learns a representation based on raw pixels.
dimension with an n × 1 kernel where n is the number of Privileged Training learns from raw pixels but allows side-
input representations. training on semantic segmentation tasks.

3.3. Driving Perplexity 3.4. Discrete and Continuous Action Prediction


Our goal is to learn a future motion action feasibility dis- The output of our driving model is a probability distri-
tribution, also known as the driving model. However, in bution over all possible actions. A driving model should
past work [17, 2, 1], there are few explicit quantitative eval- have correct motion action predictions despite encounter-
uation metrics. In this section, we define an evaluation met- ing complicated scenes such as an intersection, traffic light,
rics suitable for large-scale uncalibrated training, based on and/or pedestrians. We first consider the case of discrete
sequence perplexity. motion actions, and then investigate continuous prediction
Inspired by language modeling metrics, we propose to tasks, in both cases taking into account the prediction of
use perplexity as evaluation metric to drive training. For multiple modes in a distribution when there are multiple
example, a bigram model assigns a probability of: possible actions.
Discrete Actions. In the discrete case, we train our net-
p(w1 , · · · , wm ) = p(w1 )p(w2 |w1 ) · · · p(wm |wm−1 ) work by minimizing perplexity on the training set. In prac-
tice, this effectively becomes minimizing the cross entropy
to a held out document. Our model assign:
loss between our prediction and the action that is carried
p(a1 |s1 ) · · · p(at |st ) = F (s1 , a1 ) · · · F (st , at ) (4) out. In real world of driving, it’s more prevalent to go
straight, compared to turn left or right. Thus the samples
probability to the held out driving sequence with actions in the training set are highly biased toward going straight.
a1 · · · at , conditioned on world states s1 · · · st . We define Inspired by [27], we investigated the weighted loss of dif-
the action predictive perplexity of our model on one held ferent actions according to the inverse of their prevalence.
out sample as: Continuous Actions. To output a distribution in the con-
tinuous domain, one could either use a parametric approach,
t
n 1X o by defining a family of parametric distribution and regress-
perplexity = exp − log F (si , ai ) (5) ing to the parameters of the distribution, or one can em-
t i=1
ploy a non-parametric approach, e.g. discretizing the action
To evaluate a model, one can take the most probable ac- spaces into many small bins. Here we employ the second
tion predicted apred = argmaxa F (s, a) and compare it with approach, since it can be difficult to find a parametric distri-
the action areal that is carried out by the driver. This is the bution family that could fit all scenarios.
accuracy of the predictions from a model. Note that mod-
3.5. Driving with Privileged Information
els generally do not achieve 100% accuracy, since a driving
model does not know the intention of the driver ahead of Despite the large-scale nature of our training set, small
time. phenomena and objects may be hard to learn in a purely

2177
Figure 4: Example density of data distribution of BDDV in
a major city. Each dot represents the starting location of a
short video clip of approximately 40 seconds.

Figure 5: Sample frames from the BDDV dataset.


end-to-end fashion. We propose to exploit privileged learn-
ing [24, 22, 8] to learn a driving policy that exploits both
4.1. Scale
task loss and available side losses. In our model, we use
semantic segmentation as the extra supervision. Figure BDDV provides a collection of sufficiently large and
3 summarizes our approach and the alternatives: motion diverse driving data, from which it is possible to learn
prediction could be learned fully end to end (Motion Re- generic driving models. The BDDV contains over 10,000
flex Approach), or could rely fully on predicted interme- hours of driving dash-cam video streams from different lo-
diate semantic segmentation labels (Mediated Perception cations in the world. The largest prior dataset is Robotcar
Approach), in contrast, our proposed approach (Privileged dataset [15] which corresponds to 214 hours of driving ex-
Training Approach) adopts the best of both worlds, having perience. KITTI, which has diverse calibrated data, pro-
the semantic segmentation as a side task to improve the rep- vides 22 sequences (less than an hour) for SLAM purposes.
resentation, which ultimately performs motion prediction. In Cityscapes, there are no more than 100 hours driving
Specifically, we add a segmentation loss after fc7, which en- video data provided upon request. To the best of knowledge,
forces fc7 to learn a meaningful feature representation. Our BDDV is at least in two orders larger than any benchmark
results below confirm that even when semantic segmenta- public datasets for vision-based autonomous driving.
tion is not the ultimate goal, learning with semantic segmen-
tation side tasks can improve performance, especially when 4.2. Modalities
coercing a model to attend to small relevant scene phenom-
Besides the images, our BDDV dataset also comes with
ena.
sensor readings of a smart phone. The sensors are GPS,
IMU, gyroscope and magnetometer. The data also comes
with sensor-fused measurements, such as course and speed.
4. Dataset Those modalities could be used to recover the trajectory and
dynamics of the vehicle.
The Berkeley DeepDrive Video dataset (BDDV) is a
dataset comprised of real driving videos and GPS/IMU data. 4.3. Diversity
The BDDV dataset contains diverse driving scenarios in-
The BDDV dataset is collected to learn a driving model
cluding cities, highways, towns, and rural areas in several
that is generic in terms of driving scenes, car makes and
major cities in US. We analyze different properties of this
models, and driving behaviors. The coverage of BDDV in-
dataset in the following sections and show its suitability
cludes various driving, scene, and lighting conditions. In
for learning a generic driving model in comparison with
Figure 5 we show some samples of our dataset in nighttime,
sets of benchmark datasets including KITTI, Cityscapes,
daytime, city areas, highway and rural areas. As shown in
Comma.ai dataset, Oxford Dataset, Princeton Torcs, GTA,
Table 1, existing benchmark datasets are limited in the va-
each of which varies in size, target, and types of data. A
riety of scene types they comprise. In Figure 4 we illustrate
comparison of datasets is provided in Table 1.
the spatial distribution of our data across a major city.

2178
Datasets settings type Approx scale Diversity Specific Car Make
KITTI city, rural area, highway real less than 1 hour one city, one weather condition, daytime Yes
Cityscape city real less than 100 hours German cities, multiple weather conditions, daytime Yes
Comma.ai mostly highway real 7.3 hours highway, N.A. , daytime and night Yes
Oxford city real 214 hours one city (Oxford), multiple weather conditions, daytime Yes
Princeton Torcs highway synthesis 13.5 hours N.A. N.A.
GTA city, highway synthesis N.A. N.A. N.A.
BDDV(ours) city, rural area, highway real 10k hours multiple cities, multiple weather conditions,daytime and night No

Table 1: Comparison of our dataset with other driving datasets.

Configuration Image Temporal Speed Perplexity Accuracy


5. Experiments Random-Guess N.A. N.A. No 0.989 42.1%
Speed-Only N.A. LSTM Yes 0.555 80.1%
For our initial experiments, we used a subset of the CNN-1-Frame CNN N.A. No 0.491 82.0%
BDDV comprising 21,808 dashboard camera videos as TCNN3 CNN CNN No 0.445 83.2%
TCNN9 CNN CNN No 0.411 84.6%
training data, 1,470 as validation data and 3,561 as test data. CNN-LSTM CNN LSTM No 0.419 84.5%
Each video is approximately 40 seconds in length. Since a CNN-LSTM+Speed CNN LSTM Yes 0.449 84.2%
small portion of the videos has duration just under 40 sec- FCN-LSTM FCN LSTM No 0.430 84.1%

onds, we truncate all videos to 36 seconds. We downsam-


ple frames to 640 × 360 and temporally downsample the Table 2: Results on the discrete feasible action prediction
video to 3Hz to avoid feeding near-duplicate frames into task. We investigated the influence of various image en-
our model. After all such preprocessing, we have a total coders, temporal networks and the effect of speed. Log per-
of 2.9 million frames, which is approximately 2.5 times the plexity (lower is better) and accuracy (higher is better) of
size of the ILSVRC2012 dataset. To train our model, we our prediction are reported. See Section 5.1 for details.
used stochastic gradient descent (SGD) with an initial learn-
ing rate of 10−4 , momentum of 0.99 and a batch size of 2.
The learning rate was decayed by 0.5 whenever the train-
formance than the two baseline models (random and speed-
ing loss plateaus. Gradient clipping of 10 was applied to
only). This is intuitive, since human drivers can get a good,
avoid gradient explosion in the LSTM. The LSTM is run
but not perfect, sense of feasible motions from a single
sequentially on the video with the previous visual observa-
frame. In the TCNN configuration we study using temporal
tions. Specifically, the number of hidden units in LSTM is
convolution as the temporal fusion mechanism. We used a
64. Models are evaluated using predictive perplexity and
fixed length window of 3 (TCNN3) and 9 (TCNN9), which
accuracy, where the maximum likelihood action is taken as
is 1 and 3 seconds in time respectively. TCNN models
the prediction.
further improves the performance and the longer the time
5.1. Discrete Action Driving Model horizon, the better the performance. However, it needs a
fixed size of history window and is more memory demand-
We first consider the discrete action case, in which we ing than the LSTM based approach. We also explore the
define four actions: straight, stop, left turn, right turn. CNN-LSTM approach, and it achieves comparable perfor-
The task is defined as predicting the feasible actions in the mance as TCNN9. When changing the visual encoder from
next 1/3rd of a second. CNN to FCN, the performance is comparable. However, as
Following Section 3.2, we minimize perplexity on the we will show later 3.5, a FCN-based visual encoder is vital
training set and evaluate perplexity and accuracy of the for learning from privileged segmentation information. We
maximum likelihood prediction on a set of held out videos. also found that the inverse frequency weighting of the loss
In Table 2, we do an ablation study to investigate the impor- function [27] encourages the prediction of rare actions, but
tance of different components of our model. it does not improve the prediction perplexity. Thus we do
Table 2 shows the comparison among a few variants of not use this in our methods above.
our method. The Random Guess baseline predicts randomly In Figure. 6, we show some predictions made by our
based on the input distribution. In the speed-only condition, model. In the first pair of images (subfig. a&b), the car is
we only use the speed of the previous frame as input, ig- going through an intersection, when the traffic light starts to
noring the image input completely. It achieves decent per- change from yellow to red. Our model has predicted to go
formance, since the driving behavior is largely predictable straight when the light is yellow, and the prediction changes
from the speed in previous moment. In the “1-Frame” con- to stop when the traffic light is red. This indicates that our
figuration, we only feed in a single image at each timestep model has learned how human drivers often react to traffic
and use a CNN as the visual encoder. It achieves better per- light colors. In the second pair (c& d), the car is approach-

2179
method perplexity accuracy
Motion Reflex Approach 0.718 71.31%
Mediated Perception Approach 0.8887 61.66
Privileged Training Approach 0.697 72.4%

Table 4: Comparison of the privileged training with other


(a) go at yellow light (b) stop at red light
methods.

responds to the log bins method. We use a total of 180


bins that is evenly distributed in logspace(−90◦ , −1◦ ) and
logspace(1◦ , 90◦ ). We also tried a data-driven approach.
We first compute the distribution of the drivers’ behavior
(c) stop & go equal weight at (d) stop when too close to vehicle (the vehicle’s angular velocity) in the continuous space.
medium distance ahead Then we discretize the distribution to 180 bins, by requir-
ing each bin having the same probability density. Such
Figure 6: Discrete actions predicted by our FCN-LSTM
data-driven binning method will adaptively capture the de-
model. Each row of 2 images show how the prediction
tails of the driver’s action. During training we use a Gaus-
changes by time. The green bars shows the probability of
sian smoothing with standard deviation of 0.5 to smooth the
doing that action at that time. The red bars are the driver’s
training labels in nearby bins. Results are shown in Table 3;
action. The four actions from top to bottom are going
The data-driven binning method performed the best among
straight, slow or stop, turn left and turn right.
all of them, while the linear binning performed worst.
ing a stopped car in the front. In (c), there is still empty Figure 7 shows examples of our prediction on video
space ahead, and our model predicts to go or stop roughly frames. Sub-figure (a) & (b) shows that our models could
equally. However, when the driver moves closer to the front follow the curving lane accurately. The prediction has a
car, our model predicts stop instead. This shows that our longer tail towards the direction of turning, which is ex-
model has learned the concept of distance and automatically pected since it’s fine to have different degrees of turns. Sub-
map it to the feasible driving action. figure (c) shows the prediction when a car is starting to turn
left at an intersection. It assigns a higher probability to con-
Table 3: Continuous lane following experiment. See Sec- tinue turning left, while still assigning a small probability to
tion 5.2 for details. go straight. The probability in the middle is close to zero,
since the car should not hit the wall. Close to the completion
Configuration Angle Perplexity of the turn (sub-figure (d)), the car could only finish the turn
Random Guess 1.86 and thus the other direction disappears. This shows that we
Linear Bins -2.82 could predict a variable number of modalities appropriately.
Log Bins -3.66 In sub-figure (e), when the car is going close to the sidewalk
Data-Driven Bins -4.83 on its right, our model assigns zero probability to turn right.
When going to the intersection, the model has correctly as-
signed non-zero probability to turning right, since it’s clear
5.2. Continuous Action Driving Model by that time.
In this section, we investigate the continuous action pre-
5.3. Learning with Privileged Information (LUPI)
diction problem, in particular, lane following. We define
the lane following problem as predicting the angular speed In this section, we demonstrate our LUPI approach on
of the vehicle in the future 1/3 second. As proposed above, the discrete action prediction task. Following Section 3.5,
we discretize the prediction domain into bins and turn the we designed three approaches: The Motion Reflex Ap-
problem into a multi-nomial prediction task. proach refers to the FCN-LSTM approach above. The Priv-
We evaluated three different kinds of binning methods ileged Training approach takes the FCN-LSTM architecture
(Table 3). First we tried a linear binning method, where and adds an extra segmentation loss after the fc7 layer. We
we discretize [−90◦ , 90◦ ] into 180 bins of width 1◦ . The used BDD Segmentation masks as the extra supervision.
linear binning method is reasonable under the assumption Since the BDDV dataset only contains the car egomotion
that constant controlling accuracy is needed to drive well. and the BDD Segmentation dataset only contains the seg-
Another reasonable assumption might be that constant rel- mentation of individual images, we pair each video clip
ative accuracy is required to control the turns. This cor- with 10 BDD Segmentation images during training. The

2180
Perception Approach, we first compute the segmentation
output of every frame in the videos using the Multi-Scale
Context Aggregation approach described in [26]. We then
feed the segmentation results into an LSTM and train the
LSTM independently from the segmentation part, mimick-
ing stage-by-stage training. In theory, one would not need
(a) lane following left (b) lane following right side task to improve the performance of a neural network
with unlimited data. To simulate a scenario where we only
have limited amount of training data, we run experiments
on a common subset of 1000 video clips.
As shown in Table 4, the Privileged Training approach
achieves the best performance in both perplexity and accu-
racy. These observations align well with our intuition that
(c) multiple possible actions: (d) collapsed to single action af- training on side tasks in an end-to-end fashion improves per-
turn left or go straight ter the turn formance. Figure 8 shows an example in which Privileged
Training provides a benefit. In the first column, there is a
red light far ahead in the intersection. The Privileged Train-
ing approach has successfully identified that and predicted
stop in (c), while the other two methods fail. In the sec-
ond column, the car is waiting behind another car. In the
frame immediately previous to these frames, the vehicle in
(e) single sided prediction due to (f) right turn becomes available at front had an illuminated brake light. The second column of
side walk intersection images shows the prediction of the three methods when the
brake light of the car goes out but the vehicle has not yet
Figure 7: Continuous actions predicted by our model. The started to move. The Privileged Training approach in (c)
green sector with different darkness shows the probability predicts stop with high probability. The other two methods
map of going to a particular direction. The blue line shows behave more aggressively and predict going straight with
the driver’s action. high probability.

(a) 6. Conclusion
We introduce an approach to learning a generic driv-
ing model from large scale crowd-sourced video dataset
with an end-to-end trainable architecture. It can learning
(b) from monocular camera observations and previous egomo-
tion states to predict a distribution over future egomotion.
The model uses a novel FCN-LSTM architecture to learn
from driving behaviors. It can take advantage of semantic
segmentation as side tasks improve performance, following
the privileged learning paradigm. To facilitate our study, we
(c) provide a novel large-scale dataset of crowd-sourced driving
behaviors that is suitable for learning driving models. We
investigate the effectiveness of our driving model and the
“privileged” learning by evaluating future egomotion pre-
diction on held-out sequences across diverse conditions.

Figure 8: We show one example result in each column from


each of the three models. (a) is the Behavior Reflex Ap- Acknowledgement
proach. (b) is the Mediated Perception Approach and (c)
Prof. Darrell was supported in part by DARPA; NSF
the Privileged Training Approach. awards IIS - 1212798, IIS - 1427425, and IIS - 1536003,
Berkeley DeepDrive, and the Berkeley Artificial Intelli-
motion prediction loss (or driving loss) and the semantic gence Research Center. We appreciate BDD sponsor Nexar
segmentation loss are weighted equally. For the Mediated for providing the Berkeley DeepDrive Video Dataset.

2181
References games. In Advances in Neural Information Processing Sys-
tems, pages 2863–2871, 2015.
[1] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner,
[17] D. Pomerleau . Alvinn: An autonomous land vehicle in a
B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller,
neural network. In D. Touretzky, editor, Advances in Neural
J. Zhang, et al. End to end learning for self-driving cars.
Information Processing Systems 1. Morgan Kaufmann, 1989.
arXiv preprint arXiv:1604.07316, 2016.
[18] S. Ross, G. J. Gordon, and D. Bagnell. A reduction of imi-
[2] C. Chen, A. Seff, A. Kornhauser, and J. Xiao. Deepdriving:
tation learning and structured prediction to no-regret online
Learning affordance for direct perception in autonomous
learning. In AISTATS, volume 1, page 6, 2011.
driving. In Proceedings of the IEEE International Confer-
[19] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
ence on Computer Vision, pages 2722–2730, 2015.
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
[3] S. Daftry, J. A. Bagnell, and M. Hebert. Learning transfer-
et al. ImageNet large scale visual recognition challenge.
able policies for monocular reactive MAV control. In Inter-
International Journal of Computer Vision, 115(3):211–252,
national Symposium on Experimental Robotics, 2016.
2015.
[4] B. De Brabandere, X. Jia, T. Tuytelaars, and L. Van Gool.
[20] A. A. Rusu, M. Vecerik, T. Rothörl, N. Heess, R. Pascanu,
Dynamic filter networks. arXiv preprint arXiv:1605.09673,
and R. Hadsell. Sim-to-real robot learning from pixels with
2016.
progressive nets. arXiv preprint arXiv:1610.04286, 2016.
[5] J. Donahue, L. Anne Hendricks, S. Guadarrama,
M. Rohrbach, S. Venugopalan, K. Saenko, and T. Dar- [21] E. Santana and G. Hotz. Learning a driving simulator. arXiv
rell. Long-term recurrent convolutional networks for visual preprint arXiv:1608.01230, 2016.
recognition and description. In Proceedings of the IEEE [22] V. Sharmanska, N. Quadrianto, and C. H. Lampert. Learning
Conference on Computer Vision and Pattern Recognition, to rank using privileged information. In International Con-
pages 2625–2634, 2015. ference on Computer Vision (ICCV), pages 825–832. IEEE,
[6] C. Finn, I. Goodfellow, and S. Levine. Unsupervised learn- 2013.
ing for physical interaction through video prediction. arXiv [23] E. Tzeng, C. Devin, J. Hoffman, C. Finn, P. Abbeel,
preprint arXiv:1605.07157, 2016. S. Levine, K. Saenko, and T. Darrell. Adapting deep vi-
[7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, suomotor representations with weak pairwise constraints. In
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen- Workshop on the Algorithmic Foundations of Robotics, 2016.
erative adversarial nets. In Advances in Neural Information [24] V. Vapnik and A. Vashist. A new learning paradigm:
Processing Systems, pages 2672–2680, 2014. Learning using privileged information. Neural Networks,
[8] J. Hoffman, S. Gupta, and T. Darrell. Learning with side in- 22(5):544–557, 2009.
formation through modality hallucination. In In Proc. Com- [25] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-k. Wong,
puter Vision and Pattern Recognition (CVPR), 2016. and W.-c. Woo. Convolutional LSTM network: A machine
[9] B. Huval, T. Wang, S. Tandon, J. Kiske, W. Song, learning approach for precipitation nowcasting. In Advances
J. Pazhayampallil, M. Andriluka, P. Rajpurkar, T. Migimatsu, in Neural Information Processing Systems, pages 802–810,
R. Cheng-Yue, et al. An empirical evaluation of deep learn- 2015.
ing on highway driving. arXiv preprint arXiv:1504.01716, [26] F. Yu and V. Koltun. Multi-scale context aggregation by di-
2015. lated convolutions. arXiv preprint arXiv:1511.07122, 2015.
[10] D. P. Kingma and M. Welling. Auto-encoding variational [27] R. Zhang, P. Isola, and A. A. Efros. Colorful image coloriza-
bayes. stat, 1050:10, 2014. tion. arXiv preprint arXiv:1603.08511, 2016.
[11] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet
classification with deep convolutional neural networks. In
Advances in neural information processing systems, pages
1097–1105, 2012.
[12] Y. LeCun, U. Muller, J. Ben, E. Cosatto, and B. Flepp. Off-
road obstacle avoidance through end-to-end learning. In
NIPS, pages 739–746, 2005.
[13] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
networks for semantic segmentation. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 3431–3440, 2015.
[14] W. Lotter, G. Kreiman, and D. Cox. Deep predictive cod-
ing networks for video prediction and unsupervised learning.
arXiv preprint arXiv:1605.08104, 2016.
[15] W. Maddern, G. Pascoe, C. Linegar, and P. Newman. 1 Year,
1000km: The Oxford RobotCar Dataset. The International
Journal of Robotics Research (IJRR), to appear.
[16] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh. Action-
conditional video prediction using deep networks in Atari

2182

You might also like