0504 Learning Robust Driving Policies Without Online Exploration
0504 Learning Robust Driving Policies Without Online Exploration
II. R ELATED W ORKS as the policy being learned and thus are not counterfactual
predictions [43], [44], [29], [45].
Deep learning approaches to driving: There have been
many attempts to apply deep learning to driving including III. P REDICTIVE C ONTROL FOR AUTONOMOUS D RIVING
deep RL and imitation learning [11]; however generalization Let us consider the usual setting of an MDP described by a
is a key challenge. ChaufferNet [36] used a combination of set of states S, a set of actions A, and transition dynamics with
imitation learning and predictive models to synthesize the probability P (s0 |s, a) of transitioning to next state s0 after
worst case scenarios but more work is needed to improve taking action a from state s, and a reward r. The objective
the policy to achieve performance competitive with modern of an MDP is to learn a policy π that maximizes the future
motion planners. Another approach trained the agent entirely discounted sum of rewards in a given state. Obtaining the state
in the simulator where transfer to the real-world could be chal- of the agent in an MDP environment is not trivial especially
lenging to achieve [11]. DeepDriving [37] learned affordance with deep RL where the policy is changing because the target
predictions of road angle from an image for multi-lane driving is moving [46]. Our approach is to learn an intermediate
in simulation using offline data collected by human drivers. representation mapping sensor readings s to a limited number
However, in contrast with our proposed method, DeepDriving of counterfactual predictions φ as a representation of the state
used heuristics and rules to control the vehicle instead of for deep RL. This has the advantage of pushing the heavy
learning a policy with RL. Moreover, DeepDriving learned burden of deep feature representation learning in RL to the
predictions of the current lane centeredness and current road easier problem of prediction learning [47], [48], [49], [43].
angle rather than long-term counterfactual predictions of the The overall architecture of the system is depicted in Figure
future. 2. The proposal is to represent the state of the agent as a
Offline RL in real-world robot training: There are many vector ψ which is the concatenation of a limited number
prior arts in offline (batch) RL [38], [24]. However, most of the predictions φ, the current speed of the vehicle vt
prior arts in offline RL have challenges learning good policies and the previous action taken at−1 . The predictions φ are
in the deep setting [24]. The current state of the art in offline counterfactual predictions, also called general value functions
RL is batch constrained Q-learning (BCQ) [23], [24] where [26]. The previous action taken is needed due to the nature
success is demonstrated in simulation environments such as of the predictions which are relative to the last action.
Atari but the results still perform badly in comparison to Learning a policy π(ψ) could provide substantial benefits
online learning. The greatest challenge with offline RL is the over learning π from image observations: (1) improving
difficulty in covering the state-action space of the environment learning performance and speed, (2) enabling batch RL from
resulting in holes in the training data where extrapolation offline data, and (3) improving generalization of the driving
is necessary. [39] applied a novel offline RL approach to policy. Our approach is to learn a value function Q(s, a) and a
playing soccer with a real-world robot by exploiting the deterministic policy π(ψ) that maximizes that value function
episodic nature of the problem. Our work overcomes these using batch constrained Q-learning (BCQ) [23]. While the
challenges and is, to the best of our knowledge, the first networks can be modelled as one computational graph, the
successful real-world robotic application of batch RL with gradients from the policy and value function network are not
deep learning. back-propagated through the prediction network to decouple
Counterfactual prediction learning: Learning counterfac- the representation learning when learning from the offline
tual predictions as representation of the state of the agent data. Thus, training happens in two phases: (1) learning the
has been proposed before in the real-world [40], [41]. Other prediction network, (2) learning the policy and value function.
approaches demonstrate counterfactual predictions but don’t During the first phase of training, a low-accuracy localiza-
provide a way to use them [26], [42], [27]. While experiments tion algorithm, based on 2D lidar scan matching, produces
with counterfactual predictions show a lot of promise for the lane centeredness α and relative road angle β of the
improving learning and generalization, most experiments are vehicle, depicted in Figure 3, that are used to train the
in simple tabular domains [33], [34], [35]. Auxiliary tasks prediction network. The prediction network is a single network
and similar prediction problems have been applied to deep that predicts the lane centeredness and relative road angle
RL task in simulation but assume the policy is the same over multiple temporal horizons depicted in Figure 4: these
13187
predict the value and produce the next action at as depicted
in Figure 2. In our offline learning approach, we used the
state of the art batch RL BCQ [23][24] to train the policy.
(a) Lane centeredness α (b) Road angle β
Note that the same architecture can also be applied online
Fig. 3: An illustration of (a) lane centeredness position α, where the counterfactual prediction, policy and value networks
and (b) the road angle β which is the angle between the are all learned online simultaneously with deep deterministic
direction of the vehicle and the direction of the road. policy gradient (DDPG) [50] but the details are left in the
appendix.
are predictions of the future lane centeredness and relative A. Counterfactual Predictions
road angle rather than the current estimates returned by the
To ask a counterfactual predictive question, we use the
localization algorithm. They are chosen because they represent
GVF framework, where one must define a cumulant ct =
both present and future lane centeredness information needed
c(st , at , st+1 ), a.k.a. pseudo-reward, a policy distribution
to steer [30]. These predictions are discounted sums of future
τ (a|s) and continuation function γt = γ(st , at , st+1 ). The
lane centeredness and relative road angle respectively that
answer to the predictive question is the expectation of the
are learned with GVFs [26]:
return φt , when following policy τ , defined by
X∞
φ(s) = Eτ [ γ i ct+i+1 |st = s] (1)
∞ k−1
i=0 X Y
φτ (s) = Eτ [ ( γt+j+1 )ct+k+1 |st = s, at = a] (2)
where ct+i+1 is the cumulant vector consisting of the current k=0 j=0
lane centeredness α and current relative road angle β. It
is important to understand that φ(s) predicts the sum of where the cumulant is ct and 0 ≤ γt ≤ 1 [26]. This is
all future lane centeredness and road angle values collected the more general form for learning a prediction than the
under some policy τ . The policy τ is counterfactual in the one given in Equation (1) where the only difference is
sense that it is different from the behavior policy µ used that γ is replaced by a continuation function which allows
to collect the data and the learned policy π. Formally, the for predictions that predict the sum of cumulants until an
policy τ (at |st , at1 ) = N (at−1 , Σ) where Σ = 0.0025I is episodic event occurs such as going out of lane. The agent
a diagonal covariance matrix. The meaning of this policy usually collects experience under a different behavior policy
is to “keep doing what you are doing”, similar to the one µ(a|s). When τ is different from both the behavior policy
used in [32] for making counterfactual predictions. Therefore, µ and the policy being learned π, the predictive question
φ(s) predicts the discounted sum of future lane centeredness is a counterfactual prediction1 . Cumulants are often scaled
and road angle if the vehicle takes similar actions to its last by a factor of 1 − γ when γ is a constant in non-episodic
action. Moreover, φ(s) can be interpreted as predictions of the predictions. The counterfactual prediction φτ (s) is a general
deviation from the desired lane centeredness and road angle. value function (GVF) is approximated by a deep neural
Counterfactual predictions can be thought of as anticipated network parameterized by θ to learn (2). The parameters θ
future "errors" that allow controllers to take corrective actions are optimized with gradient descent minimizing the following
before the errors occur. The discount factor γ controls the loss function
temporal horizon of the prediction. It is critical to learn φ(s)
for different values of γ in order to control both steering and L(θ) = Eµ [ρδ 2 ] (3)
speed. The details for learning φ(s) are provided in the next
τ (a|s)
section. where δ = φτ (s; θ) − y is the TD error and ρ = µ(a|s) is
During the second stage of training, the localization the importance sampling ratio to correct for the difference
algorithm is no longer needed; it was used to provide between the policy distribution τ and behavior distribution µ.
the labels for training the predictive representation in the Note that only the behavior policy distribution is corrected;
first stage. Instead, the counterfactual predictions φ are but the expectation is still over the state visitation distribution
concatenated with the vehicle speed vt and last action at−1 to under the policy µ. In practice, this is usually not an issue
form a predictive representation ψ. The RL agent receives ψ
as the state of the agent in the environment which is used to 1 Some literature call this an off-policy prediction
13188
[47]. The target y is produced by bootstrapping a prediction Here we assume that p(y = +1) = p(y = −1). From this
of the value of the next state [52] under policy τ result, we can estimate µ(a|s) with µ̂(a|s) as follows
g(a, s)
µ̂(a|s) = η(a|s) (8)
y = Est+1 ∼P [ct+1 + γφτ (st+1 ; θ̂)|st = s, at = a] (4) 1 − g(a, s)
where y is a bootstrapped prediction using the most recent where η(a|s) is a known distribution over action conditioned
parameters θ̂ that are assumed constant in the gradient on state. Choosing η(a|s) to be the uniform distribution
computation. Learning a counterfactual prediction with a fixed ensures that the discriminator is well trained against all
policy τ tends to be very stable when minimizing L(θ) using possible actions in a given state; thus good performance
gradient descent approaches and therefore doesn’t require is achieved with sufficient coverage of the state space
target networks originally used in [46] to stabilize DQN. rather than the state-action space. Alternatively, one can
The gradient of the loss function (3) is given by estimate the importance sampling ratio without defining an
additional distribution η by replacing the distribution η with
∇θ L(θ) = Eµ [ρδ∇θ φτ (s; θ)] (5) τ ; however, defining η to be a uniform distribution ensures
the discriminator is learned effectively across the entire action
However, updates with importance sampling ratios are known
space. The combined algorithms for training counterfactual
to have high variance which may negatively impact learning;
predictions with an unknown behavior distribution are given
instead we use the importance resampling technique to reduce
in the Appendix for both the online and offline RL settings.
the variance of the updates [47]. With importance resampling,
a replay buffer D of size N is required and the gradient is V. E XPERIMENTS
estimated from a mini-batch and multiplied with the average
Our approach to learning counterfactual predictions for
importance sampling ratio of the samples in the buffer ρ̄ =
PN representing the state used in RL to learn a driving policy is
i=1 ρi
N . applied to two different domains. The first set of experiments
The gradient with importance resampling is given by is conducted on a Jackal robot in the real-world where we
demonstrate the practicality of our approach and its robustness
∇θ L(θ) = Es,a∼Dρ [ρ̄δ∇θ v̂ τ (s; θ)] (6) to damaged and distracting lane markings. The second set of
experiments is conducted in the TORCS simulator where we
where Dρ is a distribution of the transitions in the replay
conduct an ablation study to understand the effect different
buffer proportional to the importance sampling ratio. The
counterfactual predictive representations have on performance
probability for transition i = 1...N is given by Di = PNρi ρ
j=1 j and comfort. Refer to the Appendix2 for more details in the
τ (ai |si )
where the importance sampling ratio is ρi = µ(a i |si )
. An experimental setup and training.
efficient data structure for the replay buffer is the SumTree
used in prioritized experience replay [53]. A. Jackal Robot
The proposed solution for learning to drive the Jackal
B. Behavior Estimation robot in the real-world is called GVF-BCQ since it combines
When learning predictions from real-world driving data, our novel method of learning GVF predictions with BCQ
one needs to know the behavior policy distribution µ(a|s); [23]. Two baselines are compared with our method: (1) a
however, in practice this is rarely known. Instead we estimate classical controller using model predictive control (MPC),
it using the density ratio trick [51] where the ratio of and (2) batch-constrained Q-learning that trains end-to-end
two probability densities can be expressed as a ratio of (E2E-BCQ). The MPC uses a map and 2D laser scanner
discriminator class probabilities that distinguish samples from for localization from pre-existing ROS packages. The E2E-
the two distributions. Let us define an intermediate probability BCQ is the current state-of-the-art in offline deep RL [24].
density function η(a|s) such as the uniform distribution; Comparing to online RL was impractical for safety concerns
this will be compared to the behavior distribution µ(a|s) and the need to recharge the robot’s battery every 4 hours.
which we desire to estimate. The class labels y = +1 and The training data consisted of 6 training roads in both
y = −1 are labels given to samples from µ(a|s) and η(a|s). A counter clock-wise (CCW) and clock-wise (CW) directions
discriminator g(a, s) is learned that distinguishes state action and 3 test roads where each of the 3 test roads had damaged
pairs from the two distributions using the cross-entropy loss. variants. All training data was flipped to simulate travelling
The ratio of the densities can be computed using only the in the reverse direction and balance the data set in terms of
discriminator g(a, s). direction. The training data was collected using a diverse set
of drivers including human drivers by remote control and a
pure pursuit controller with safe exploration; thus, the training
µ(a|s) p(a|s, y = +1) p(y = +1|a, s)/p(y = +1)
= = data was not suitable for imitation learning. The test roads
η(a|s) p(a|s, y = −1) p(y = −1|a, s)/p(y = −1) were different from the training data: (1) a rectangle-shaped
p(y = +1|a, s) g(a, s) road with rounded outer corners, (2) an oval-shaped road,
= =
p(y = −1|a, s) 1 − g(a, s)
2 Appendix
(7) is at https://arxiv.org/abs/2103.08070
13189
TABLE I: Comparison of GVF-BCQ (our method) and E2E- TABLE III: Comparison of GVF-BCQ (our method) and
BCQ (baseline) on Rectangle test road with 0.4 m/s target MPC (baseline) in CCW direction with 0.4 m/s target speed
speed in both the CW and CCW directions. GVF-BCQ where R, O, and C are the Rectangle, Oval and Complex
exceeds performance of E2E-BCQ in all respects with higher road shapes respectively.
overall speed, and far fewer out of lane events. E2E-BCQ Off- Off- Speed Steer
was deemed unsafe for further experiments. r/s Out of
Method center angle Jerk Jerk
↑ Lane ↓
↓ ↓ ↓ ↓
Off- Off- GVF-BCQ 2.68 0.13 0.13 0.0% 0.036 0.23
r/s Speed Out of R
Method Dir. center angle MPC 0.97 0.53 0.19 20.4% 0.083 1.25
↑ ↑ Lane ↓
↓ ↓ GVF-BCQ 2.40 0.28 0.21 1.45% 0.035 0.22
GVF-BCQ CCW 2.68 0.32 0.14 0.13 0.0% O
MPC 0.89/s 0.53 0.20 22.7% 0.103 1.41
E2E-BCQ CCW 1.26 0.18 0.26 0.24 3.8% GVF-BCQ 2.35 0.22 0.18 0.0% 0.034 0.23
GVF-BCQ CW 2.29 0.31 0.22 0.16 0.0% C
MPC 0.72 0.64 0.21 38.9% -0.063 -1.21
E2E-BCQ3 CW -0.13 0.17 0.99 0.30 54.2%
13190
Fig. 6: Ablation study of GVF-DDPG (our method) of
jerkiness (lower is better) over different time scale selections.
Fig. 5: Ablation study of GVF-DDPG (our method) of We use angular and longitudinal jerkiness to evaluate the
test scores (accumulated reward) over different time scale smoothness of the learned policy. The jerkiness is evaluated
selections (left) and raw image-based state representations every 1000 steps during training for dirt-dirt-4, evo-evo-2 and
(right). Test scores were evaluated every 1000 steps during road-spring which were not part of the training set. Results
training for dirt-dirt-4, evo-evo-2 and road-spring which were show our proposed multi-time-scale predictions achieves the
not part of the training set. Results show our proposed best performance.
predictive representation with multiple time scales achieves
the best performance.
VI. C ONCLUSIONS
controller since it learns from both images and the perfect
localization information. We present a new approach to learning to drive through a
The learned agents were trained on 85% of 40 tracks two step process: (1) learn a limited number of counterfactual
available in TORCS. The rest of the tracks were used for predictions about future lane centeredness and road angle
testing (6 in total) to measure the generalization performance under a known policy, and (2) learn an RL policy using the
of the policies. Results are repeated over 5 runs for each counterfactual predictions as a representation of state. Our
method. Only three of the tracks were successfully completed novel approach is safe and practical because it learns from
by at least one learned agent and those are reported here. real-world driving data without online exploration where
The reward in the TORCS environment is given by rt = the behavior distribution of the driving data is unknown.
0.0002vt (cos βt + |αt |) where vt is the speed of the vehicle An experimental investigation into the impact of predictive
in km/h, βt is the angle between the road direction and the representations on learning good driving policies shows that
vehicle direction, and αt is the current lane centeredness. they generalize well to new roads, damaged lane markings
The policies were evaluated on test roads at regular intervals and even distracting lane markings. We find that our approach
during training as shown in Figures 5 and 6. improves the performance, smoothness and robustness of the
The GVF-0.0-DDPG and GVF-0.95-DDPG variations ini- driving decisions from images. We conclude that counterfac-
tially learned very good solutions but then diverged indicating tual predictions at different time scales is crucial to achieve
that one prediction may not be enough to control both steering a good driving policy. To the best of our knowledge, this
angle and vehicle speed. Despite an unfair advantage provided is the first practical demonstration of deep RL applied to
by DDPG-LowDim with the inclusion of lane centeredness autonomous driving on a real vehicle using only real-world
and road angle in the observation vector, GVF-DDPG still data without any online exploration.
outperforms both variants of DDPG on many of the test Our approach has the potential to be scaled with large
roads. DDPG-Image was challenging to tune and train due to volumes of data captured by human drivers of all skill levels;
instability in learning; however, the counterfactual predictions however, more work is needed to understand how well this
in GVF-DDPG stabilized training for more consistent learning approach will scale. In addition, a general framework of
even though they were being learned simultaneously. Only learning the right counterfactual predictions for real-world
GVF-DDPG with multiple time scale predictions is able to problems is needed where online interaction is prohibitively
achieve extraordinarily smooth control. expensive.
13191
R EFERENCES Symposium on Adaptive Dynamic Programming and Reinforcement
Learning (ADPRL), pp. 120–127, 2011.
[1] N. Möhler, D. John, and M. Voigtländer, “Lane detection for a [20] C. Zhao, O. Sigaud, F. Stulp, and T. M. Hospedales, “Investigating
situation adaptive lane keeping support system, the safelane system,” in generalisation in continuous deep reinforcement learning,” CoRR, vol.
Advanced Microsystems for Automotive Applications 2006, J. Valldorf abs/1902.07015, 2019. [Online]. Available: https://arxiv.org/abs/1902.
and W. Gessner, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 07015
2006, pp. 485–500. [21] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger,
[2] Q. Zou, H. Jiang, Q. Dai, Y. Yue, L. Chen, and Q. Wang, “Robust lane “Deep reinforcement learning that matters,” CoRR, vol. abs/1709.06560,
detection from continuous driving scenes using deep neural networks,” 2017. [Online]. Available: http://arxiv.org/abs/1709.06560
IEEE Transactions on Vehicular Technology, vol. 69, no. 1, pp. 41–54, [22] J. Farebrother, M. C. Machado, and M. Bowling, “Generalization and
2020. regularization in DQN,” CoRR, vol. abs/1810.00123, 2018. [Online].
[3] T. Ort, L. Paull, and D. Rus, “Autonomous vehicle navigation in Available: http://arxiv.org/abs/1810.00123
rural environments without detailed prior maps,” in IEEE International [23] S. Fujimoto, D. Meger, and D. Precup, “Off-policy deep reinforcement
Conference on Robotics and Automation (ICRA), 2018, pp. 2040–2047. learning without exploration,” CoRR, vol. abs/1812.02900, 2018.
[4] Bing-Fei Wu, Tsu-Tian Lee, Hsin-Han Chang, Jhong-Jie Jiang, Cheng- [Online]. Available: http://arxiv.org/abs/1812.02900
Nan Lien, Tien-Yu Liao, and Jau-Woei Perng, “Gps navigation based [24] S. Fujimoto, E. Conti, M. Ghavamzadeh, and J. Pineau,
autonomous driving system design for intelligent vehicles,” in IEEE “Benchmarking batch deep reinforcement learning algorithms,”
International Conference on Systems, Man and Cybernetics, 2007, pp. CoRR, vol. abs/1910.01708, 2019. [Online]. Available: http:
3294–3299. //arxiv.org/abs/1910.01708
[5] G. Garimella, J. Funke, C. Wang, and M. Kobilarov, “Neural network [25] S. Levine, A. Kumar, G. Tucker, and J. Fu, “Offline reinforcement
modeling for steering control of an autonomous vehicle,” in IEEE/RSJ learning: tutorial, review and perspectives on open problems,”
International Conference on Intelligent Robots and Systems (IROS), CoRR, vol. abs/2005.01643, 2020. [Online]. Available: http:
2017, pp. 2609–2615. //arxiv.org/abs/2005.01643
[6] R. Liu, J. Wang, and B. Zhang, “High definition map for automated [26] R. Sutton, J. Modayil, M. Delp, T. Degris, P. Pilarski, A. White,
driving: Overview and analysis,” Journal of Navigation, vol. 73, no. 2, and D. Precup, “Horde: A scalable real-time architecture for learning
p. 324–341, 2020. knowledge from unsupervised sensorimotor interaction,” in Interna-
[7] L. Wang, Y. Zhang, and J. Wang, “Map-based localization method for tional Conference on Autonomous Agents and Multiagent Systems, ser.
autonomous vehicles using 3d-lidar,” IFAC, vol. 50, no. 1, pp. 276 – AAMAS ’11, vol. 2, 2011, pp. 761–768.
281, 2017. [27] J. Modayil, A. White, and R. S. Sutton, “Multi-timescale nexting
[8] J. Chen, B. Yuan, and M. Tomizuka, “Deep imitation learning for in a reinforcement learning robot,” in From Animals to Animats 12,
autonomous driving in generic urban scenarios with enhanced safety,” T. Ziemke, C. Balkenius, and J. Hallam, Eds. Berlin, Heidelberg:
IEEE/RSJ International Conference on Intelligent Robots and Systems Springer Berlin Heidelberg, 2012, pp. 299–309.
(IROS), pp. 2884–2890, 2019.
[28] A. Clark, “Whatever next? predictive brains, situated agents, and the
[9] M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, future of cognitive science,” Behavioral and Brain Science, vol. 36,
P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, no. 3, pp. 181–204, 2013.
J. Zhao, and K. Zieba, “End to end learning for self-driving
[29] E. M. Russek, I. Momennejad, M. M. Botvinick, S. J. Gershman,
cars,” CoRR, vol. abs/1604.07316, 2016. [Online]. Available:
and N. D. Daw, “Predictive representations can link model-based rein-
http://arxiv.org/abs/1604.07316
forcement learning to model-free mechanisms,” PLOS Computational
[10] Z. Chen and X. Huang, “End-to-end learning for lane keeping of self-
Biology, vol. 13, no. 9, pp. 1–35, 2017.
driving cars,” in IEEE Intelligent Vehicles Symposium (IV), 2017, pp.
1856–1860. [30] D. D. Salvucci and R. Gray, “A two-point visual control model of
steering,” Perception, vol. 33, no. 10, pp. 1233–1248, 2004.
[11] A. Sallab, M. Abdou, E. Perot, and S. Yogamani, “Deep reinforcement
learning framework for autonomous driving,” Electronic Imaging, vol. [31] N. Kapania and J. Gerdes, “Design of a feedback-feedforward steering
2017, pp. 70–76, 2017. controller for accurate path tracking and stability at the limits of
[12] L. Chi and Y. Mu, “Deep steering: Learning end-to-end driving model handling,” Vehicle System Dynamics, vol. 53, pp. 1–18, 2015.
from spatial and temporal visual cues,” CoRR, vol. abs/1708.03798, [32] C. Beal and J. Gerdes, “Model predictive control for vehicle stabi-
2018. [Online]. Available: http://arxiv.org/abs/1810.00123 lization at the limits of handling,” Control Systems Technology, IEEE
[13] Y. Pan, C.-A. Cheng, K. Saigol, K. Lee, X. Yan, E. A. Theodorou, Transactions on, vol. 21, pp. 1258–1269, 2013.
and B. Boots, “Imitation learning for agile autonomous driving,” The [33] M. L. Littman and R. S. Sutton, “Predictive representations of state,” in
International Journal of Robotics Research, vol. 39, no. 2-3, pp. 286– Advances in Neural Information Processing Systems 14, T. G. Dietterich,
302, 2020. S. Becker, and Z. Ghahramani, Eds., 2002, pp. 1555–1561.
[14] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement [34] E. J. Rafols, M. B. Ring, R. S. Sutton, and B. Tanner, “Using predictive
learning for robotic manipulation with asynchronous off-policy updates,” representations to improve generalization in reinforcement learning,” in
in IEEE International Conference on Robotics and Automation (ICRA), International Joint Conference on Artificial Intelligence, ser. IJCAI’05,
2017, pp. 3389–3396. 2005, pp. 835–840.
[15] B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. A. Sallab, [35] T. Schaul and M. Ring, “Better generalization with forecasts,” in
S. Yogamani, and P. Pérez, “Deep reinforcement learning for International Joint Conference on Artificial Intelligence, ser. IJCAI
autonomous driving: A survey,” CoRR, vol. abs/2002.00444, 2020. ’13, 2013, pp. 1656–1662.
[Online]. Available: http://arxiv.org/abs/2002.00444 [36] M. Bansal, A. Krizhevsky, and A. S. Ogale, “Chauffeurnet:
[16] G. Dulac-Arnold, D. J. Mankowitz, and T. Hester, “Challenges Learning to drive by imitating the best and synthesizing the
of real-world reinforcement learning,” International Conference on worst,” in Robotics: Science and Systems XV, University of Freiburg,
International Conference on Machine Learning, vol. abs/1904.12901, Freiburg im Breisgau, Germany, June 22-26, 2019, A. Bicchi,
2019. [Online]. Available: http://arxiv.org/abs/1904.12901 H. Kress-Gazit, and S. Hutchinson, Eds., 2019. [Online]. Available:
[17] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den https://doi.org/10.15607/RSS.2019.XV.031
Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanc- [37] C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “Deepdriving: Learning
tot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, affordance for direct perception in autonomous driving,” in IEEE
T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, International Conference on Computer Vision (ICCV), 2015, pp. 2722–
“Mastering the game of go with deep neural networks and tree search,” 2730.
Nature, vol. 529, pp. 484–503, 2016. [38] P. S. Thomas and E. Brunskill, “Data-efficient off-policy policy
[18] I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, evaluation for reinforcement learning,” in International Conference on
A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas et al., “Solving International Conference on Machine Learning, ser. ICML’16, vol. 48,
rubik’s cube with a robot hands,” CoRR, vol. abs/1910.07113, 2019. 2016, p. 2139–2148.
[Online]. Available: http://arxiv.org/abs/1910.07113 [39] J. Cunha, R. Serra, N. Lau, L. Lopes, and A. Neves, “Batch
[19] S. Whiteson, B. Tanner, M. E. Taylor, and P. Stone, “Protecting reinforcement learning for robotic soccer using the q-batch update-rule,”
against evaluation overfitting in empirical reinforcement learning,” IEEE Journal of Intelligent & Robotic Systems, vol. 80, pp. 385–399, 2015.
13192
[40] J. Günther, P. M. Pilarski, G. Helfrich, H. Shen, and K. Diepold,
“Intelligent laser welding through representation, prediction, and control
learning: An architecture with deep neural networks and reinforcement
learning,” Mechatronics, vol. 34, pp. 1 – 11, 2016.
[41] A. L. Edwards, M. R. Dawson, J. S. Hebert, C. Sherstan, R. S. Sutton,
K. M. Chan, and P. M. Pilarski, “Application of real-time machine
learning to myoelectric prosthesis control: A case series in adaptive
switching,” Prosthetics and orthotics international, vol. 40, no. 5, pp.
573–581, 2016.
[42] A. White, “Developing a predictive approach to knowledge,” Ph.D.
dissertation, University of Alberta, 2015.
[43] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Sil-
ver, and K. Kavukcuoglu, “Reinforcement learning with unsupervised
auxiliary tasks.” International Conference on Learning Representations,
2017.
[44] A. Barreto, W. Dabney, R. Munos, J. J. Hunt, T. Schaul, H. P. van
Hasselt, and D. Silver, “Successor features for transfer in reinforcement
learning,” in Advances in Neural Information Processing Systems
30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
S. Vishwanathan, and R. Garnett, Eds., 2017, pp. 4055–4065.
[45] H. Van Seijen, M. Fatemi, J. Romoff, R. Laroche, T. Barnes, and
J. Tsang, “Hybrid reward architecture for reinforcement learning,” in
Advances in Neural Information Processing Systems 30, I. Guyon,
U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,
and R. Garnett, Eds., 2017, pp. 5392–5402.
[46] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,
D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement
learning,” CoRR, vol. abs/1312.5602, 2013. [Online]. Available:
http://arxiv.org/abs/1312.5602
[47] M. Schlegel, W. Chung, D. Graves, J. Qian, and M. White, “Importance
resampling off-policy prediction,” in Neural Information Processing
Systems, ser. NeurIPS’19, 2019.
[48] S. Ghiassian, A. Patterson, M. White, R. S. Sutton, and A. White,
“Online off-policy prediction,” CoRR, vol. abs/1811.02597, 2018.
[Online]. Available: http://arxiv.org/abs/1811.02597
[49] D. Graves, K. Rezaee, and S. Scheideman, “Perception as prediction
using general value functions in autonomous driving applications,” in
IEEE/RSJ International Conference on Intelligent Robots and Systems,
ser. IROS 2019, 2019.
[50] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller,
“Deterministic policy gradient algorithms,” in International Conference
on International Conference on Machine Learning, ser. ICML’14,
vol. 32, 2014, pp. I–387–I–395.
[51] M. Sugiyama, T. Suzuki, and T. Kanamori, “Density ratio estimation:
A comprehensive review,” RIMS Kokyuroku, pp. 10–31, 2010.
[52] R. S. Sutton, “Learning to predict by the methods of temporal
differences,” Machine Learning, vol. 3, no. 1, pp. 9–44, 1988.
[53] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience
replay,” in International Conference on Learning Representations,
Puerto Rico, 2016.
[54] B. Paden, M. Cáp, S. Z. Yong, D. S. Yershov, and E. Frazzoli, “A
survey of motion planning and control techniques for self-driving
urban vehicles,” CoRR, vol. abs/1604.07446, 2016. [Online]. Available:
http://arxiv.org/abs/1604.07446
13193