0% found this document useful (0 votes)
18 views

0504 Learning Robust Driving Policies Without Online Exploration

This document summarizes a research paper presented at the 2021 IEEE International Conference on Robotics and Automation. The paper proposes a multi-time-scale predictive representation learning method to efficiently learn robust driving policies in an offline manner without online exploration. This representation learning utilizes training data collected offline to make counterfactual predictions about future lane position and road angle. These predictions are then used as the state representation in offline reinforcement learning to learn a policy that can generalize to new road conditions not seen during training, such as damaged lane markings. The method is evaluated in both simulation and on a real-world robot for lane keeping tasks.

Uploaded by

1795546027
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

0504 Learning Robust Driving Policies Without Online Exploration

This document summarizes a research paper presented at the 2021 IEEE International Conference on Robotics and Automation. The paper proposes a multi-time-scale predictive representation learning method to efficiently learn robust driving policies in an offline manner without online exploration. This representation learning utilizes training data collected offline to make counterfactual predictions about future lane position and road angle. These predictions are then used as the state representation in offline reinforcement learning to learn a policy that can generalize to new road conditions not seen during training, such as damaged lane markings. The method is evaluated in both simulation and on a real-world robot for lane keeping tasks.

Uploaded by

1795546027
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

2021 IEEE International Conference on Robotics and Automation (ICRA 2021)

May 31 - June 4, 2021, Xi'an, China

Learning robust driving policies without online exploration


Daniel Graves1 , Nhat M. Nguyen1 , Kimia Hassanzadeh1 , Jun Jin1 , Jun Luo1

Abstract— We propose a multi-time-scale predictive repre-


sentation learning method to efficiently learn robust driving
policies in an offline manner that generalize well to novel road
geometries, and damaged and distracting lane conditions which
are not covered in the offline training data. We show that
our proposed representation learning method can be applied
easily in an offline (batch) reinforcement learning setting demon-
strating the ability to generalize well and efficiently under
novel conditions compared to standard batch RL methods.
Our proposed method utilizes training data collected entirely
offline in the real-world which removes the need of intensive
online explorations that impede applying deep reinforcement
learning on real-world robot training. Various experiments were
conducted in both simulator and real-world scenarios for the Fig. 1: Lane keeping of a Jackal Robot using vision-based
purpose of evaluation and analysis of our proposed claims. counterfactual predictions of the future lane centeredness over
multiple time scales to represent the state of the agent in RL
I. I NTRODUCTION
Learning to drive is a challenging problem that is a long-
standing goal in robotics and autonomous driving. In the poorly on novel situations not seen during training [19], [20],
early days of autonomous driving, a popular approach to [21], [22]. We aim to address the issues of learning both
staying within a lane was based on lane marking detection practical and general driving policies in the real-world by
[1]. However, a significant challenge with this approach is combining a novel representation learning approach with
the lack of robustness to missing, occluded or damaged lane offline RL without any online exploration.
markings [2] where most roads in the US are not marked with Offline RL, or learning RL policies from data without
reliable lane markings on either side of the road [3]. Modern exploration [23], could potentially address many of the
approaches mitigate some of these issues by constructing practicality issues with applying standard online RL in
high definition maps and developing accurate localization the real-world. Unfortunately, deep offline RL struggles to
techniques [4], [5], [6], [7]. However, scaling both the map generalize to data not in the training set [24], [25]. Our
and localization approaches globally in a constantly changing approach applies novel representation learning based on
world is still very challenging for autonomous driving and counterfactual predictions [26], [27], [25], shown in Figure
robotic navigation [4]. 1, to address the generalization issue. We learn predictions
Recently, there have been a growing number of successes of future lane centeredness and road angle from offline data
in AI applied to robotics and autonomous driving [8], [9], safely collected in the environment using noisy localization
[10], [11], [12], [3]. These data-driven approaches can be sources during training, eliminating the need for expensive, on-
divided into two categories: (1) behavior cloning, and (2) vehicle, high accuracy localization sensors during deployment.
reinforcement learning (RL). Behavior cloning suffers from State-of-the-art offline RL [23] is then applied using these
generalization challenges since valuable negative experiences counterfactual predictions as a low dimensional representation
are rarely collected; in addition they cannot offer performance of the state to learn a policy to drive the vehicle. These
better than the behavior being cloned [8], [13], [12]. RL on the counterfactual predictions are motivated in psychology [28],
other-hand is a promising direction for vision-based control [29] where we find predictions aid in an agent’s understanding
[14]; however, RL is usually not practical because it requires of the world, particularly in driving [30]. Similar works in
extensive online exploration in the environment to find the classical control have shown how anticipation of the future
best policy that maximizes the cumulative reward [11], [15], is important for driving at the limits of stability through feed-
[16]. Moreover, the success in game environments like Go forward [31] and model predictive control [32]. Our work is
[17] doesn’t always transfer well to success in the real-world motivated by the predictive state hypothesis [33], [34] that
where an agent is expected to learn policies that generalize claims counterfactual predictions help an agent generalize
well [18], [16]. A key challenge is that RL overfits to the and adapt quickly to new problems [35].
training environment where learned policies tend to perform The significance of our approach is that it demonstrates
1 Noah’s Ark Lab, Huawei Technologies Canada {daniel.graves,
practical value in autonomous driving and real-world RL with-
minh.nhat.nguyen, kima.hassanzadeh, jun.jin1, out requiring extensive maps, robust localization techniques or
jun.luo1}@huawei.com robust lane marking and curb detection. We demonstrate that

978-1-7281-9077-8/21/$31.00 ©2021 IEEE 13186


our approach generalizes to never-before seen roads including
those with damaged and distracting lane markings. The novel
contributions of this work are summarized as follows: (1) an
algorithm for learning counterfactual predictions from real-
world driving data with behavior distribution estimation, (2) an
investigation into the importance of predictive representations
for learning good driving policies that generalize well to Fig. 2: Overall architecture of the RL system involved learns
new roads and damaged lane markings, and (3) the first a predictive representation ψ to represent the state of the
demonstration of deep RL applied to autonomous driving agent. Camera is the only environment sensor at test time.
with real-world data without any online exploration.

II. R ELATED W ORKS as the policy being learned and thus are not counterfactual
predictions [43], [44], [29], [45].
Deep learning approaches to driving: There have been
many attempts to apply deep learning to driving including III. P REDICTIVE C ONTROL FOR AUTONOMOUS D RIVING
deep RL and imitation learning [11]; however generalization Let us consider the usual setting of an MDP described by a
is a key challenge. ChaufferNet [36] used a combination of set of states S, a set of actions A, and transition dynamics with
imitation learning and predictive models to synthesize the probability P (s0 |s, a) of transitioning to next state s0 after
worst case scenarios but more work is needed to improve taking action a from state s, and a reward r. The objective
the policy to achieve performance competitive with modern of an MDP is to learn a policy π that maximizes the future
motion planners. Another approach trained the agent entirely discounted sum of rewards in a given state. Obtaining the state
in the simulator where transfer to the real-world could be chal- of the agent in an MDP environment is not trivial especially
lenging to achieve [11]. DeepDriving [37] learned affordance with deep RL where the policy is changing because the target
predictions of road angle from an image for multi-lane driving is moving [46]. Our approach is to learn an intermediate
in simulation using offline data collected by human drivers. representation mapping sensor readings s to a limited number
However, in contrast with our proposed method, DeepDriving of counterfactual predictions φ as a representation of the state
used heuristics and rules to control the vehicle instead of for deep RL. This has the advantage of pushing the heavy
learning a policy with RL. Moreover, DeepDriving learned burden of deep feature representation learning in RL to the
predictions of the current lane centeredness and current road easier problem of prediction learning [47], [48], [49], [43].
angle rather than long-term counterfactual predictions of the The overall architecture of the system is depicted in Figure
future. 2. The proposal is to represent the state of the agent as a
Offline RL in real-world robot training: There are many vector ψ which is the concatenation of a limited number
prior arts in offline (batch) RL [38], [24]. However, most of the predictions φ, the current speed of the vehicle vt
prior arts in offline RL have challenges learning good policies and the previous action taken at−1 . The predictions φ are
in the deep setting [24]. The current state of the art in offline counterfactual predictions, also called general value functions
RL is batch constrained Q-learning (BCQ) [23], [24] where [26]. The previous action taken is needed due to the nature
success is demonstrated in simulation environments such as of the predictions which are relative to the last action.
Atari but the results still perform badly in comparison to Learning a policy π(ψ) could provide substantial benefits
online learning. The greatest challenge with offline RL is the over learning π from image observations: (1) improving
difficulty in covering the state-action space of the environment learning performance and speed, (2) enabling batch RL from
resulting in holes in the training data where extrapolation offline data, and (3) improving generalization of the driving
is necessary. [39] applied a novel offline RL approach to policy. Our approach is to learn a value function Q(s, a) and a
playing soccer with a real-world robot by exploiting the deterministic policy π(ψ) that maximizes that value function
episodic nature of the problem. Our work overcomes these using batch constrained Q-learning (BCQ) [23]. While the
challenges and is, to the best of our knowledge, the first networks can be modelled as one computational graph, the
successful real-world robotic application of batch RL with gradients from the policy and value function network are not
deep learning. back-propagated through the prediction network to decouple
Counterfactual prediction learning: Learning counterfac- the representation learning when learning from the offline
tual predictions as representation of the state of the agent data. Thus, training happens in two phases: (1) learning the
has been proposed before in the real-world [40], [41]. Other prediction network, (2) learning the policy and value function.
approaches demonstrate counterfactual predictions but don’t During the first phase of training, a low-accuracy localiza-
provide a way to use them [26], [42], [27]. While experiments tion algorithm, based on 2D lidar scan matching, produces
with counterfactual predictions show a lot of promise for the lane centeredness α and relative road angle β of the
improving learning and generalization, most experiments are vehicle, depicted in Figure 3, that are used to train the
in simple tabular domains [33], [34], [35]. Auxiliary tasks prediction network. The prediction network is a single network
and similar prediction problems have been applied to deep that predicts the lane centeredness and relative road angle
RL task in simulation but assume the policy is the same over multiple temporal horizons depicted in Figure 4: these

13187
predict the value and produce the next action at as depicted
in Figure 2. In our offline learning approach, we used the
state of the art batch RL BCQ [23][24] to train the policy.
(a) Lane centeredness α (b) Road angle β
Note that the same architecture can also be applied online
Fig. 3: An illustration of (a) lane centeredness position α, where the counterfactual prediction, policy and value networks
and (b) the road angle β which is the angle between the are all learned online simultaneously with deep deterministic
direction of the vehicle and the direction of the road. policy gradient (DDPG) [50] but the details are left in the
appendix.

IV. P REDICTIVE L EARNING


The counterfactual predictions given in Equation (1) are
general value functions (GVFs) [26] that are learned with a
novel combination of different approaches including (1) off-
Fig. 4: An illustration of the multiple temporal horizons of policy, or counterfactual, prediction learning with importance
the predictions φ. resampling [47], and (2) behavior estimation with the density
ratio trick [51].

are predictions of the future lane centeredness and relative A. Counterfactual Predictions
road angle rather than the current estimates returned by the
To ask a counterfactual predictive question, we use the
localization algorithm. They are chosen because they represent
GVF framework, where one must define a cumulant ct =
both present and future lane centeredness information needed
c(st , at , st+1 ), a.k.a. pseudo-reward, a policy distribution
to steer [30]. These predictions are discounted sums of future
τ (a|s) and continuation function γt = γ(st , at , st+1 ). The
lane centeredness and relative road angle respectively that
answer to the predictive question is the expectation of the
are learned with GVFs [26]:
return φt , when following policy τ , defined by
X∞
φ(s) = Eτ [ γ i ct+i+1 |st = s] (1)
∞ k−1
i=0 X Y
φτ (s) = Eτ [ ( γt+j+1 )ct+k+1 |st = s, at = a] (2)
where ct+i+1 is the cumulant vector consisting of the current k=0 j=0
lane centeredness α and current relative road angle β. It
is important to understand that φ(s) predicts the sum of where the cumulant is ct and 0 ≤ γt ≤ 1 [26]. This is
all future lane centeredness and road angle values collected the more general form for learning a prediction than the
under some policy τ . The policy τ is counterfactual in the one given in Equation (1) where the only difference is
sense that it is different from the behavior policy µ used that γ is replaced by a continuation function which allows
to collect the data and the learned policy π. Formally, the for predictions that predict the sum of cumulants until an
policy τ (at |st , at1 ) = N (at−1 , Σ) where Σ = 0.0025I is episodic event occurs such as going out of lane. The agent
a diagonal covariance matrix. The meaning of this policy usually collects experience under a different behavior policy
is to “keep doing what you are doing”, similar to the one µ(a|s). When τ is different from both the behavior policy
used in [32] for making counterfactual predictions. Therefore, µ and the policy being learned π, the predictive question
φ(s) predicts the discounted sum of future lane centeredness is a counterfactual prediction1 . Cumulants are often scaled
and road angle if the vehicle takes similar actions to its last by a factor of 1 − γ when γ is a constant in non-episodic
action. Moreover, φ(s) can be interpreted as predictions of the predictions. The counterfactual prediction φτ (s) is a general
deviation from the desired lane centeredness and road angle. value function (GVF) is approximated by a deep neural
Counterfactual predictions can be thought of as anticipated network parameterized by θ to learn (2). The parameters θ
future "errors" that allow controllers to take corrective actions are optimized with gradient descent minimizing the following
before the errors occur. The discount factor γ controls the loss function
temporal horizon of the prediction. It is critical to learn φ(s)
for different values of γ in order to control both steering and L(θ) = Eµ [ρδ 2 ] (3)
speed. The details for learning φ(s) are provided in the next
τ (a|s)
section. where δ = φτ (s; θ) − y is the TD error and ρ = µ(a|s) is
During the second stage of training, the localization the importance sampling ratio to correct for the difference
algorithm is no longer needed; it was used to provide between the policy distribution τ and behavior distribution µ.
the labels for training the predictive representation in the Note that only the behavior policy distribution is corrected;
first stage. Instead, the counterfactual predictions φ are but the expectation is still over the state visitation distribution
concatenated with the vehicle speed vt and last action at−1 to under the policy µ. In practice, this is usually not an issue
form a predictive representation ψ. The RL agent receives ψ
as the state of the agent in the environment which is used to 1 Some literature call this an off-policy prediction

13188
[47]. The target y is produced by bootstrapping a prediction Here we assume that p(y = +1) = p(y = −1). From this
of the value of the next state [52] under policy τ result, we can estimate µ(a|s) with µ̂(a|s) as follows

g(a, s)
µ̂(a|s) = η(a|s) (8)
y = Est+1 ∼P [ct+1 + γφτ (st+1 ; θ̂)|st = s, at = a] (4) 1 − g(a, s)
where y is a bootstrapped prediction using the most recent where η(a|s) is a known distribution over action conditioned
parameters θ̂ that are assumed constant in the gradient on state. Choosing η(a|s) to be the uniform distribution
computation. Learning a counterfactual prediction with a fixed ensures that the discriminator is well trained against all
policy τ tends to be very stable when minimizing L(θ) using possible actions in a given state; thus good performance
gradient descent approaches and therefore doesn’t require is achieved with sufficient coverage of the state space
target networks originally used in [46] to stabilize DQN. rather than the state-action space. Alternatively, one can
The gradient of the loss function (3) is given by estimate the importance sampling ratio without defining an
additional distribution η by replacing the distribution η with
∇θ L(θ) = Eµ [ρδ∇θ φτ (s; θ)] (5) τ ; however, defining η to be a uniform distribution ensures
the discriminator is learned effectively across the entire action
However, updates with importance sampling ratios are known
space. The combined algorithms for training counterfactual
to have high variance which may negatively impact learning;
predictions with an unknown behavior distribution are given
instead we use the importance resampling technique to reduce
in the Appendix for both the online and offline RL settings.
the variance of the updates [47]. With importance resampling,
a replay buffer D of size N is required and the gradient is V. E XPERIMENTS
estimated from a mini-batch and multiplied with the average
Our approach to learning counterfactual predictions for
importance sampling ratio of the samples in the buffer ρ̄ =
PN representing the state used in RL to learn a driving policy is
i=1 ρi
N . applied to two different domains. The first set of experiments
The gradient with importance resampling is given by is conducted on a Jackal robot in the real-world where we
demonstrate the practicality of our approach and its robustness
∇θ L(θ) = Es,a∼Dρ [ρ̄δ∇θ v̂ τ (s; θ)] (6) to damaged and distracting lane markings. The second set of
experiments is conducted in the TORCS simulator where we
where Dρ is a distribution of the transitions in the replay
conduct an ablation study to understand the effect different
buffer proportional to the importance sampling ratio. The
counterfactual predictive representations have on performance
probability for transition i = 1...N is given by Di = PNρi ρ
j=1 j and comfort. Refer to the Appendix2 for more details in the
τ (ai |si )
where the importance sampling ratio is ρi = µ(a i |si )
. An experimental setup and training.
efficient data structure for the replay buffer is the SumTree
used in prioritized experience replay [53]. A. Jackal Robot
The proposed solution for learning to drive the Jackal
B. Behavior Estimation robot in the real-world is called GVF-BCQ since it combines
When learning predictions from real-world driving data, our novel method of learning GVF predictions with BCQ
one needs to know the behavior policy distribution µ(a|s); [23]. Two baselines are compared with our method: (1) a
however, in practice this is rarely known. Instead we estimate classical controller using model predictive control (MPC),
it using the density ratio trick [51] where the ratio of and (2) batch-constrained Q-learning that trains end-to-end
two probability densities can be expressed as a ratio of (E2E-BCQ). The MPC uses a map and 2D laser scanner
discriminator class probabilities that distinguish samples from for localization from pre-existing ROS packages. The E2E-
the two distributions. Let us define an intermediate probability BCQ is the current state-of-the-art in offline deep RL [24].
density function η(a|s) such as the uniform distribution; Comparing to online RL was impractical for safety concerns
this will be compared to the behavior distribution µ(a|s) and the need to recharge the robot’s battery every 4 hours.
which we desire to estimate. The class labels y = +1 and The training data consisted of 6 training roads in both
y = −1 are labels given to samples from µ(a|s) and η(a|s). A counter clock-wise (CCW) and clock-wise (CW) directions
discriminator g(a, s) is learned that distinguishes state action and 3 test roads where each of the 3 test roads had damaged
pairs from the two distributions using the cross-entropy loss. variants. All training data was flipped to simulate travelling
The ratio of the densities can be computed using only the in the reverse direction and balance the data set in terms of
discriminator g(a, s). direction. The training data was collected using a diverse set
of drivers including human drivers by remote control and a
pure pursuit controller with safe exploration; thus, the training
µ(a|s) p(a|s, y = +1) p(y = +1|a, s)/p(y = +1)
= = data was not suitable for imitation learning. The test roads
η(a|s) p(a|s, y = −1) p(y = −1|a, s)/p(y = −1) were different from the training data: (1) a rectangle-shaped
p(y = +1|a, s) g(a, s) road with rounded outer corners, (2) an oval-shaped road,
= =
p(y = −1|a, s) 1 − g(a, s)
2 Appendix
(7) is at https://arxiv.org/abs/2103.08070

13189
TABLE I: Comparison of GVF-BCQ (our method) and E2E- TABLE III: Comparison of GVF-BCQ (our method) and
BCQ (baseline) on Rectangle test road with 0.4 m/s target MPC (baseline) in CCW direction with 0.4 m/s target speed
speed in both the CW and CCW directions. GVF-BCQ where R, O, and C are the Rectangle, Oval and Complex
exceeds performance of E2E-BCQ in all respects with higher road shapes respectively.
overall speed, and far fewer out of lane events. E2E-BCQ Off- Off- Speed Steer
was deemed unsafe for further experiments. r/s Out of
Method center angle Jerk Jerk
↑ Lane ↓
↓ ↓ ↓ ↓
Off- Off- GVF-BCQ 2.68 0.13 0.13 0.0% 0.036 0.23
r/s Speed Out of R
Method Dir. center angle MPC 0.97 0.53 0.19 20.4% 0.083 1.25
↑ ↑ Lane ↓
↓ ↓ GVF-BCQ 2.40 0.28 0.21 1.45% 0.035 0.22
GVF-BCQ CCW 2.68 0.32 0.14 0.13 0.0% O
MPC 0.89/s 0.53 0.20 22.7% 0.103 1.41
E2E-BCQ CCW 1.26 0.18 0.26 0.24 3.8% GVF-BCQ 2.35 0.22 0.18 0.0% 0.034 0.23
GVF-BCQ CW 2.29 0.31 0.22 0.16 0.0% C
MPC 0.72 0.64 0.21 38.9% -0.063 -1.21
E2E-BCQ3 CW -0.13 0.17 0.99 0.30 54.2%

TABLE II: Effect of damaged lanes on GVF-BCQ perfor-


nearly all metrics at a high target speed of 0.4 m/s. The
mance in CCW direction with 0.4 m/s target speed where R,
MPC performed poorly since it was difficult to tune for 0.4
O, and C are the Rectangle, Oval and Complex road shapes
m/s; performance was more similar at 0.25 m/s speeds where
respectively. GVF-BCQ demonstrates robustness to damaged
results are in the Appendix. A clear advantage of GVF-BCQ
and distracting lanes.
is the stability and smoothness of control achieved at the
Off- Off- Speed Steer higher speeds.
Out of
Damage r/s ↑ center angle Jerk Jerk
Lane ↓
↓ ↓ ↓ ↓
R
No 2.68 0.13 0.13 0.0% 0.036 0.23 B. Ablation Study in TORCS
Yes 2.74 0.14 0.14 0.0% -0.038 0.23
O
No 2.40 0.28 0.21 1.5% 0.035 0.22 In order to understand the role of counterfactual predictions
Yes 2.07 0.33 0.21 7.19% 0.033 0.21 in representing the state of the agent, we conduct an ablation
No 2.35 0.22 0.18 0.0% 0.034 0.23
C
Yes 2.11 0.31 0.24 9.42% 0.044 0.29 study in the TORCS simulator. We compare representations
consisting of future predictions at multiple time scale, future
predictions at a single time scale and predictions with super-
and (3) a complex road loop with many turns significantly vised regression of the current (non-future) lane centeredness
different from anything observed by the agent during training. and relative road angle. These experiments were conducted
In addition, the test roads included variants with damaged with online RL using deep deterministic policy gradient
lane markings. The reward is given by rt = vt (cos βt + |αt |) (DDPG) [50] in order to more easily understand the impact
where vt is the speed of the vehicle in km/h, βt is the angle of the different state representations on the learning process.
between the road direction and the vehicle direction, and αt Our method is called GVF-DDPG and uses multiple time
is the lane centeredness. scales specified by the values γ = [0.0, 0.5, 0.9, 0.95, 0.97].
A comparison of the learned approaches is given in Table Two variants of our method called GVF-0.95-DDPG and
I where GVF-BCQ approach exceeds the performance of GVF-0.0-DDPG were defined to investigate the impact of
E2E-BCQ in all respects demonstrating better performance at different temporal horizons on performance, where γ = 0.95
nearly double the speed. Both GVF-BCQ and E2E-BCQ were and γ = 0.0 respectively. It is worth pointing out that when
trained with the same data sets and given 10M updates each γ = 0, the prediction is myopic meaning that it reduces to
for a fair comparison. For GVF-BCQ, the first 5M updates a standard supervised regression problem equivalent to the
were used for learning the counterfactual predictions and the predictions learned in [37]. These methods receive a history
second 5M updates were used for learning the policy from the of two images, velocity and last action and produce desired
predictive representation with BCQ. They both received the steering angle and vehicle speed action commands.
same observations consisting of two stacked images, current Some additional baselines include a kinematic-based steer-
vehicle speed, and last action and produced desired steering ing approach based on [54] and two variants of DDPG with
angle and speed. slightly different state representations. The kinematic-based
GVF-BCQ was tested on roads with damaged and dis- steering approach is treated as a "ground truth" controller
tracting lane markings as shown in Table III. The damaged since it has access to perfect localization information to steer
and distracting lane markings for the complex test road loop the vehicle; unlike our approach, the speed is controlled
are shown in Figure 1. These results demonstrate robustness independently. The variants of DDPG are called (1) DDPG-
because the training data did not include roads with damaged Image and (2) DDPG-LowDim. DDPG-Image is given a
or distracting lane markings. history of two images, velocity and last action while DDPG-
GVF-BCQ was also compared to MPC in Table III where LowDim is given a history of two images, velocity, last
GVF-BCQ was found to produce superior performance in action, current lane centeredness α and relative road angle β
in the observation. Both DDPG-Image and DDPG-LowDim
3 E2E-BCQ failed to recover after undershooting the first turn in the output steering angle and vehicle speed action commands. The
clock-wise (CW) direction; it was not safe for testing on the other roads. performance of DDPG-LowDim serves as an ideal learned

13190
Fig. 6: Ablation study of GVF-DDPG (our method) of
jerkiness (lower is better) over different time scale selections.
Fig. 5: Ablation study of GVF-DDPG (our method) of We use angular and longitudinal jerkiness to evaluate the
test scores (accumulated reward) over different time scale smoothness of the learned policy. The jerkiness is evaluated
selections (left) and raw image-based state representations every 1000 steps during training for dirt-dirt-4, evo-evo-2 and
(right). Test scores were evaluated every 1000 steps during road-spring which were not part of the training set. Results
training for dirt-dirt-4, evo-evo-2 and road-spring which were show our proposed multi-time-scale predictions achieves the
not part of the training set. Results show our proposed best performance.
predictive representation with multiple time scales achieves
the best performance.

VI. C ONCLUSIONS
controller since it learns from both images and the perfect
localization information. We present a new approach to learning to drive through a
The learned agents were trained on 85% of 40 tracks two step process: (1) learn a limited number of counterfactual
available in TORCS. The rest of the tracks were used for predictions about future lane centeredness and road angle
testing (6 in total) to measure the generalization performance under a known policy, and (2) learn an RL policy using the
of the policies. Results are repeated over 5 runs for each counterfactual predictions as a representation of state. Our
method. Only three of the tracks were successfully completed novel approach is safe and practical because it learns from
by at least one learned agent and those are reported here. real-world driving data without online exploration where
The reward in the TORCS environment is given by rt = the behavior distribution of the driving data is unknown.
0.0002vt (cos βt + |αt |) where vt is the speed of the vehicle An experimental investigation into the impact of predictive
in km/h, βt is the angle between the road direction and the representations on learning good driving policies shows that
vehicle direction, and αt is the current lane centeredness. they generalize well to new roads, damaged lane markings
The policies were evaluated on test roads at regular intervals and even distracting lane markings. We find that our approach
during training as shown in Figures 5 and 6. improves the performance, smoothness and robustness of the
The GVF-0.0-DDPG and GVF-0.95-DDPG variations ini- driving decisions from images. We conclude that counterfac-
tially learned very good solutions but then diverged indicating tual predictions at different time scales is crucial to achieve
that one prediction may not be enough to control both steering a good driving policy. To the best of our knowledge, this
angle and vehicle speed. Despite an unfair advantage provided is the first practical demonstration of deep RL applied to
by DDPG-LowDim with the inclusion of lane centeredness autonomous driving on a real vehicle using only real-world
and road angle in the observation vector, GVF-DDPG still data without any online exploration.
outperforms both variants of DDPG on many of the test Our approach has the potential to be scaled with large
roads. DDPG-Image was challenging to tune and train due to volumes of data captured by human drivers of all skill levels;
instability in learning; however, the counterfactual predictions however, more work is needed to understand how well this
in GVF-DDPG stabilized training for more consistent learning approach will scale. In addition, a general framework of
even though they were being learned simultaneously. Only learning the right counterfactual predictions for real-world
GVF-DDPG with multiple time scale predictions is able to problems is needed where online interaction is prohibitively
achieve extraordinarily smooth control. expensive.

13191
R EFERENCES Symposium on Adaptive Dynamic Programming and Reinforcement
Learning (ADPRL), pp. 120–127, 2011.
[1] N. Möhler, D. John, and M. Voigtländer, “Lane detection for a [20] C. Zhao, O. Sigaud, F. Stulp, and T. M. Hospedales, “Investigating
situation adaptive lane keeping support system, the safelane system,” in generalisation in continuous deep reinforcement learning,” CoRR, vol.
Advanced Microsystems for Automotive Applications 2006, J. Valldorf abs/1902.07015, 2019. [Online]. Available: https://arxiv.org/abs/1902.
and W. Gessner, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 07015
2006, pp. 485–500. [21] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger,
[2] Q. Zou, H. Jiang, Q. Dai, Y. Yue, L. Chen, and Q. Wang, “Robust lane “Deep reinforcement learning that matters,” CoRR, vol. abs/1709.06560,
detection from continuous driving scenes using deep neural networks,” 2017. [Online]. Available: http://arxiv.org/abs/1709.06560
IEEE Transactions on Vehicular Technology, vol. 69, no. 1, pp. 41–54, [22] J. Farebrother, M. C. Machado, and M. Bowling, “Generalization and
2020. regularization in DQN,” CoRR, vol. abs/1810.00123, 2018. [Online].
[3] T. Ort, L. Paull, and D. Rus, “Autonomous vehicle navigation in Available: http://arxiv.org/abs/1810.00123
rural environments without detailed prior maps,” in IEEE International [23] S. Fujimoto, D. Meger, and D. Precup, “Off-policy deep reinforcement
Conference on Robotics and Automation (ICRA), 2018, pp. 2040–2047. learning without exploration,” CoRR, vol. abs/1812.02900, 2018.
[4] Bing-Fei Wu, Tsu-Tian Lee, Hsin-Han Chang, Jhong-Jie Jiang, Cheng- [Online]. Available: http://arxiv.org/abs/1812.02900
Nan Lien, Tien-Yu Liao, and Jau-Woei Perng, “Gps navigation based [24] S. Fujimoto, E. Conti, M. Ghavamzadeh, and J. Pineau,
autonomous driving system design for intelligent vehicles,” in IEEE “Benchmarking batch deep reinforcement learning algorithms,”
International Conference on Systems, Man and Cybernetics, 2007, pp. CoRR, vol. abs/1910.01708, 2019. [Online]. Available: http:
3294–3299. //arxiv.org/abs/1910.01708
[5] G. Garimella, J. Funke, C. Wang, and M. Kobilarov, “Neural network [25] S. Levine, A. Kumar, G. Tucker, and J. Fu, “Offline reinforcement
modeling for steering control of an autonomous vehicle,” in IEEE/RSJ learning: tutorial, review and perspectives on open problems,”
International Conference on Intelligent Robots and Systems (IROS), CoRR, vol. abs/2005.01643, 2020. [Online]. Available: http:
2017, pp. 2609–2615. //arxiv.org/abs/2005.01643
[6] R. Liu, J. Wang, and B. Zhang, “High definition map for automated [26] R. Sutton, J. Modayil, M. Delp, T. Degris, P. Pilarski, A. White,
driving: Overview and analysis,” Journal of Navigation, vol. 73, no. 2, and D. Precup, “Horde: A scalable real-time architecture for learning
p. 324–341, 2020. knowledge from unsupervised sensorimotor interaction,” in Interna-
[7] L. Wang, Y. Zhang, and J. Wang, “Map-based localization method for tional Conference on Autonomous Agents and Multiagent Systems, ser.
autonomous vehicles using 3d-lidar,” IFAC, vol. 50, no. 1, pp. 276 – AAMAS ’11, vol. 2, 2011, pp. 761–768.
281, 2017. [27] J. Modayil, A. White, and R. S. Sutton, “Multi-timescale nexting
[8] J. Chen, B. Yuan, and M. Tomizuka, “Deep imitation learning for in a reinforcement learning robot,” in From Animals to Animats 12,
autonomous driving in generic urban scenarios with enhanced safety,” T. Ziemke, C. Balkenius, and J. Hallam, Eds. Berlin, Heidelberg:
IEEE/RSJ International Conference on Intelligent Robots and Systems Springer Berlin Heidelberg, 2012, pp. 299–309.
(IROS), pp. 2884–2890, 2019.
[28] A. Clark, “Whatever next? predictive brains, situated agents, and the
[9] M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, future of cognitive science,” Behavioral and Brain Science, vol. 36,
P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, no. 3, pp. 181–204, 2013.
J. Zhao, and K. Zieba, “End to end learning for self-driving
[29] E. M. Russek, I. Momennejad, M. M. Botvinick, S. J. Gershman,
cars,” CoRR, vol. abs/1604.07316, 2016. [Online]. Available:
and N. D. Daw, “Predictive representations can link model-based rein-
http://arxiv.org/abs/1604.07316
forcement learning to model-free mechanisms,” PLOS Computational
[10] Z. Chen and X. Huang, “End-to-end learning for lane keeping of self-
Biology, vol. 13, no. 9, pp. 1–35, 2017.
driving cars,” in IEEE Intelligent Vehicles Symposium (IV), 2017, pp.
1856–1860. [30] D. D. Salvucci and R. Gray, “A two-point visual control model of
steering,” Perception, vol. 33, no. 10, pp. 1233–1248, 2004.
[11] A. Sallab, M. Abdou, E. Perot, and S. Yogamani, “Deep reinforcement
learning framework for autonomous driving,” Electronic Imaging, vol. [31] N. Kapania and J. Gerdes, “Design of a feedback-feedforward steering
2017, pp. 70–76, 2017. controller for accurate path tracking and stability at the limits of
[12] L. Chi and Y. Mu, “Deep steering: Learning end-to-end driving model handling,” Vehicle System Dynamics, vol. 53, pp. 1–18, 2015.
from spatial and temporal visual cues,” CoRR, vol. abs/1708.03798, [32] C. Beal and J. Gerdes, “Model predictive control for vehicle stabi-
2018. [Online]. Available: http://arxiv.org/abs/1810.00123 lization at the limits of handling,” Control Systems Technology, IEEE
[13] Y. Pan, C.-A. Cheng, K. Saigol, K. Lee, X. Yan, E. A. Theodorou, Transactions on, vol. 21, pp. 1258–1269, 2013.
and B. Boots, “Imitation learning for agile autonomous driving,” The [33] M. L. Littman and R. S. Sutton, “Predictive representations of state,” in
International Journal of Robotics Research, vol. 39, no. 2-3, pp. 286– Advances in Neural Information Processing Systems 14, T. G. Dietterich,
302, 2020. S. Becker, and Z. Ghahramani, Eds., 2002, pp. 1555–1561.
[14] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement [34] E. J. Rafols, M. B. Ring, R. S. Sutton, and B. Tanner, “Using predictive
learning for robotic manipulation with asynchronous off-policy updates,” representations to improve generalization in reinforcement learning,” in
in IEEE International Conference on Robotics and Automation (ICRA), International Joint Conference on Artificial Intelligence, ser. IJCAI’05,
2017, pp. 3389–3396. 2005, pp. 835–840.
[15] B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. A. Sallab, [35] T. Schaul and M. Ring, “Better generalization with forecasts,” in
S. Yogamani, and P. Pérez, “Deep reinforcement learning for International Joint Conference on Artificial Intelligence, ser. IJCAI
autonomous driving: A survey,” CoRR, vol. abs/2002.00444, 2020. ’13, 2013, pp. 1656–1662.
[Online]. Available: http://arxiv.org/abs/2002.00444 [36] M. Bansal, A. Krizhevsky, and A. S. Ogale, “Chauffeurnet:
[16] G. Dulac-Arnold, D. J. Mankowitz, and T. Hester, “Challenges Learning to drive by imitating the best and synthesizing the
of real-world reinforcement learning,” International Conference on worst,” in Robotics: Science and Systems XV, University of Freiburg,
International Conference on Machine Learning, vol. abs/1904.12901, Freiburg im Breisgau, Germany, June 22-26, 2019, A. Bicchi,
2019. [Online]. Available: http://arxiv.org/abs/1904.12901 H. Kress-Gazit, and S. Hutchinson, Eds., 2019. [Online]. Available:
[17] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den https://doi.org/10.15607/RSS.2019.XV.031
Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanc- [37] C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “Deepdriving: Learning
tot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, affordance for direct perception in autonomous driving,” in IEEE
T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, International Conference on Computer Vision (ICCV), 2015, pp. 2722–
“Mastering the game of go with deep neural networks and tree search,” 2730.
Nature, vol. 529, pp. 484–503, 2016. [38] P. S. Thomas and E. Brunskill, “Data-efficient off-policy policy
[18] I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, evaluation for reinforcement learning,” in International Conference on
A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas et al., “Solving International Conference on Machine Learning, ser. ICML’16, vol. 48,
rubik’s cube with a robot hands,” CoRR, vol. abs/1910.07113, 2019. 2016, p. 2139–2148.
[Online]. Available: http://arxiv.org/abs/1910.07113 [39] J. Cunha, R. Serra, N. Lau, L. Lopes, and A. Neves, “Batch
[19] S. Whiteson, B. Tanner, M. E. Taylor, and P. Stone, “Protecting reinforcement learning for robotic soccer using the q-batch update-rule,”
against evaluation overfitting in empirical reinforcement learning,” IEEE Journal of Intelligent & Robotic Systems, vol. 80, pp. 385–399, 2015.

13192
[40] J. Günther, P. M. Pilarski, G. Helfrich, H. Shen, and K. Diepold,
“Intelligent laser welding through representation, prediction, and control
learning: An architecture with deep neural networks and reinforcement
learning,” Mechatronics, vol. 34, pp. 1 – 11, 2016.
[41] A. L. Edwards, M. R. Dawson, J. S. Hebert, C. Sherstan, R. S. Sutton,
K. M. Chan, and P. M. Pilarski, “Application of real-time machine
learning to myoelectric prosthesis control: A case series in adaptive
switching,” Prosthetics and orthotics international, vol. 40, no. 5, pp.
573–581, 2016.
[42] A. White, “Developing a predictive approach to knowledge,” Ph.D.
dissertation, University of Alberta, 2015.
[43] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Sil-
ver, and K. Kavukcuoglu, “Reinforcement learning with unsupervised
auxiliary tasks.” International Conference on Learning Representations,
2017.
[44] A. Barreto, W. Dabney, R. Munos, J. J. Hunt, T. Schaul, H. P. van
Hasselt, and D. Silver, “Successor features for transfer in reinforcement
learning,” in Advances in Neural Information Processing Systems
30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
S. Vishwanathan, and R. Garnett, Eds., 2017, pp. 4055–4065.
[45] H. Van Seijen, M. Fatemi, J. Romoff, R. Laroche, T. Barnes, and
J. Tsang, “Hybrid reward architecture for reinforcement learning,” in
Advances in Neural Information Processing Systems 30, I. Guyon,
U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,
and R. Garnett, Eds., 2017, pp. 5392–5402.
[46] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,
D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement
learning,” CoRR, vol. abs/1312.5602, 2013. [Online]. Available:
http://arxiv.org/abs/1312.5602
[47] M. Schlegel, W. Chung, D. Graves, J. Qian, and M. White, “Importance
resampling off-policy prediction,” in Neural Information Processing
Systems, ser. NeurIPS’19, 2019.
[48] S. Ghiassian, A. Patterson, M. White, R. S. Sutton, and A. White,
“Online off-policy prediction,” CoRR, vol. abs/1811.02597, 2018.
[Online]. Available: http://arxiv.org/abs/1811.02597
[49] D. Graves, K. Rezaee, and S. Scheideman, “Perception as prediction
using general value functions in autonomous driving applications,” in
IEEE/RSJ International Conference on Intelligent Robots and Systems,
ser. IROS 2019, 2019.
[50] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller,
“Deterministic policy gradient algorithms,” in International Conference
on International Conference on Machine Learning, ser. ICML’14,
vol. 32, 2014, pp. I–387–I–395.
[51] M. Sugiyama, T. Suzuki, and T. Kanamori, “Density ratio estimation:
A comprehensive review,” RIMS Kokyuroku, pp. 10–31, 2010.
[52] R. S. Sutton, “Learning to predict by the methods of temporal
differences,” Machine Learning, vol. 3, no. 1, pp. 9–44, 1988.
[53] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience
replay,” in International Conference on Learning Representations,
Puerto Rico, 2016.
[54] B. Paden, M. Cáp, S. Z. Yong, D. S. Yershov, and E. Frazzoli, “A
survey of motion planning and control techniques for self-driving
urban vehicles,” CoRR, vol. abs/1604.07446, 2016. [Online]. Available:
http://arxiv.org/abs/1604.07446

13193

You might also like