Autonomous Car Racing in Simulation Environment Using Deep Reinforcement Learning
Autonomous Car Racing in Simulation Environment Using Deep Reinforcement Learning
I. I NTRODUCTION
Self-driving cars are one of the most important and promis- Figure 1: Example race
ing assets of our time. It has been a challenge and inspiration
for researchers and engineers throughout the world for decades.
With the introduction of Deep Learning practices and computer
vision techniques, autonomous vehicles in the near future is became accessible. In this study, we will try to find a near-
not a dream [1]. The development cycle of the self-driving optimal driver for TORCS environment using Deep Reinforce-
cars must start from simulations since creating driving data ment Learning techniques. The rest of the paper is organized
for training would be inefficient, risky and time-consuming. as follows. In Section II, information regarding our approaches
TORCS, The Open Racing Car Simulator, is open, flexible is given. Our methodology, algorithms and strategies we have
and has a portable interface for AI development [2]. Example used are explained in Section III. In Section IV, results of
race is shown in Figure 1. our approaches are given and the last section has concluding
remarks.
Achieving autonomous agents is a very complicated task.
There are multiple practices in machine learning to train
II. P RELIMINARIES
agents to learn and act. First one is supervised learning, in
which the dataset contains the data and the ground truth Reinforcement Learning (RL) is a goal-oriented machine
labels, agents will try to predict true labels. Second one learning practice which tries to maximize the agent’s cumu-
is unsupervised learning, instead of having labels, this time lative rewards. The reward is given when an agent behaves
only data is provided and the agents need to correlate and towards the goal or achieves the goal. Negative reward as
group the data together. The last one, which is our method, punishment is also plausible in some cases.
reinforcement learning, creates its own data by interacting with
the environment [3]. This is the best-suited approach since RL agents also receive observations/states from environ-
most of the time there will not be any ground truth actions for ment and acts upon them. This cycle goes on until an optimal
agents to learn. agent is found. Figure 2 depicts this behaviour. This behavior is
formulated in Markov Decision Process and MDPs are defined
Latest improvements in Deep Learning affected every area with five components, which are:
of computation, and one of the areas affected is Reinforcement
Learning. With the GPUs’ huge performance increase, Rein- • S: State space
forcement Learning practices with Deep Learning techniques
• A: Action space
• R: Reward function
978-1-7281-2868-9/19/$31.00 c 2019 IEEE
that value functions become unstable. To prevent this, SAC
utilizes a target value network and updates it via Polyak
averaging [7].
Q values are the policy’s target density function. To achieve
differentiation for the policy updates, the re-parameterization
trick is used. Overall, Soft Actor-Critic with Maximum En-
tropy Learning is a data efficient and stable algorithm.
Figure 2: Reinforcement Learning Setting [3]
B. Rainbow DQN
We have used auto entropy tuning using log probabilities. Reward functions that we have tried are formulated below.
Before LSTM, we have also deployed NSTACK mechanism.
We serialized 4 past states and used as input. When we • No Trackpos: Track position is ignored.
noticed that LSTM outperforms NSTACK approach, we have Vx cos θ − |Vx sin θ|
abandoned it and continued with LSTM.
• Trackpos: Track position is taken into consideration.
2) Rainbow DQN: DQN network consists of three fully
conected layers of 128 weights each as seen in Figure 4. Vx cos θ − |Vx sin θ| − |Vx trackpos|
The activation functions of hidden layers are ReLU, like SAC
• EndToEnd [14]: Car’s angle as penalty is not used.
architecture, but outputs are 51 atom distribution over Q values,
namely C51. As stated before, we are using Noisy-Net for Vx (cos θ − |trackpos|)
exploration as opposed to the epsilon-greedy mechanism. Hy-
perparameters for Rainbow DQN are below. These parameters • DeepRLTorcs [15]: Track position penalty is dis-
are selected via trial and error using a simple grid search. counted with car’s angle. Car’s angle as penalty is
used. Additionally, lateral velocity as penalty is used
• N-Step: 3 with discounting towards car’s angle.
• Gamma: 0.99 Vx cos θ − |Vx sin θ| − |2Vx sin θtrackpos| − Vy cos θ
State and
State 512 256 128 LSTM Value 512 256 128 LSTM Value State 512 256 128 LSTM Action
Action
(a) SAC V Network (b) SAC Q Network (c) SAC Policy Network
0.10
0.08
Brake Percentage
0.06
C51 Value
State 128 128 128
Distributions
0.04
Figure 4: DQN Architecture
0.02
0.00
• Sigmoid: Same as above reward function, only dif- 0 50000 100000 150000 200000
ference is rewards are flatlined at center to overcome Timestep
slaloming. Figure 5: Try-Brake Distribution
Vx sigmoid(3 cos θ) − Vx sin θ − Vy sigmoid(3 cos θ)
600 V. C ONCLUSION
We have implemented two different algorithms for TORCS
Score
400 with great success. Agents complete tracks most of the time
around 140 km/h average speed and around 190 km/h max-
imum speed. It is observed that agents generalize well on
200 unseen tracks. Before Try-Brake implementation, the agent
did not use the brake action at all to prevent obtaining low
rewards. This seems to be fixed with the implementation of
0 the Try-Brake mechanism.
0 500 1000 1500 2000 2500 3000
Episode Another problem we have faced was fast left-right ma-
neuvers, also mentioned as slaloming, partly solved via re-
Figure 6: Episode vs Score. Maximum and Standard Deviation ward shaping and LSTM. We noticed that this problem is
can be seen in Table II. an issue of the reward function, thus better reward shaping
might overcome this in the near future. Since these maneuvers
happen frequently at high speeds, cases other than racing might
Algorithm Standard Deviation Maximum not face this problem. Race with SAC-LSTM agent against
DQN 306.25 1255.94 scripted bots and races between Rainbow DQN and SAC-
SAC-LSTM 433.22 1395.0
LSTM agents from our trained agent’s perspective can be
Table II: Standard Deviation and Maximum values of Scores viewed from https://youtu.be/f82EBvPKyDI.
We argue that the reasons behind SAC and SAC-LSTM
Further analysis will be done on the SAC-LSTM agent. performed better from Rainbow DQN algorithm are because of
Since this study is designed for a car race, speed is the crucial the exploration methods and the continuous action space. SAC
factor. Performance of the agent in terms of speeds versus tries to maximize entropy and this allows the agent to explore
episode can be seen in Figure 7. Values are aggregated over uncertain regions of the action space. Additionally, since
250 episodes. SAC’s policy network contains continuous actions, braking,
accelerating and steering can be controlled with continuous
As discussed before, there are multiple tracks in the envi- actions, unlike Rainbow DQN’s discretized 27 actions. This
ronment with different difficulties. Table IV shows results for difference might have been helpful to obtain stability on the
each track for SAC-LSTM algorithm. road.
Track Max Score Avg Score Max Speed Avg Speed
forza 1395.0 742.95 216.0 143.27 [9] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
g-track-1 954.0 611.36 200.36 131.56 Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski
g-track-2 1182.0 912.06 205.65 149.32 et al., “Human-level control through deep reinforcement learning,”
g-track-3 1154.0 574.16 182.0 109.11 Nature, vol. 518, no. 7540, p. 529, 2015.
ole-road-1 1103.0 209.70 216.42 121.65 [10] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience
ruudskogen 1294.0 575.58 199.78 117.08
street-1 1208.0 597.32 198.0 117.44
replay,” arXiv preprint arXiv:1511.05952, 2015.
wheel-1 1248.0 758.45 206.05 139.16 [11] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and
wheel-2 1152.0 636.49 216.19 128.87 N. De Freitas, “Dueling network architectures for deep reinforcement
aalborg 922.0 183.84 189.16 80.54 learning,” arXiv preprint arXiv:1511.06581, 2015.
alpine-1 1264.0 822.09 203.18 118.94
alpine-2 1119.0 677.13 194.43 103.02 [12] M. Fortunato, M. G. Azar, B. Piot, J. Menick, I. Osband, A. Graves,
e-track-1 1080.0 325.40 207.54 119.02 V. Mnih, R. Munos, D. Hassabis, O. Pietquin et al., “Noisy networks
e-track-2 1212.0 907.00 179.65 104.55 for exploration,” arXiv preprint arXiv:1706.10295, 2017.
e-track-4 1293.0 881.47 214.73 152.13 [13] M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspec-
e-track-6 1201.0 583.34 213.59 130.12
tive on reinforcement learning,” in Proceedings of the 34th International
eroad 1264.0 883.87 200.59 131.87
e-track-3 1383.0 895.41 211.87 137.35 Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp.
449–458.
Table IV: Scores on different tracks. Max value achieved for [14] M. Jaritz, R. De Charette, M. Toromanoff, E. Perot, and F. Nashashibi,
each column represented as bold. “End-to-end race driving with deep reinforcement learning,” in 2018
IEEE International Conference on Robotics and Automation (ICRA).
IEEE, 2018, pp. 2070–2075.
[15] B. Renukuntla, S. Sharma, S. Gadiyaram, V. Elango, and V. Sakaray,
“The road to be taken, a deep reinforcement learning approach towards
This study is a step towards using Deep Reinforcement autonomous navigation,” https://github.com/charlespwd/project-title,
Learning practices for self-driving cars. It can be seen that 2017.
these methods are capable of learning to drive without super-
vision. Furthermore, these agents are easily transferable to the
real world robotics platforms. Together with other machine
learning practices, Deep Reinforcement Learning methods are
expected to be used actively in the autonomous car industry.
ACKNOWLEDGEMENTS
This work was part of the term project for BLG604E
Deep Reinforcement Learning course from ITU. The project
consisted of an autonomous car race with other participants
from the class. Our agent became the fastest and won first
place.
We thank Can Erhan, for his contributions to the code base
and implementations. We also want to thank Onur Karadeli,
for his points regarding reward functions.
R EFERENCES
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521,
no. 7553, p. 436, 2015.
[2] B. Wymann, E. Espié, C. Guionneau, C. Dimitrakakis, R. Coulom, and
A. Sumner, “Torcs, the open racing car simulator,” Software available
at http://torcs. sourceforge. net, vol. 4, no. 6, 2000.
[3] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
MIT press, 2018.
[4] R. Bellman, “Dynamic programming,” Science, vol. 153, no. 3731, pp.
34–37, 1966.
[5] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan,
V. Kumar, H. Zhu, A. Gupta, P. Abbeel et al., “Soft actor-critic
algorithms and applications,” arXiv preprint arXiv:1812.05905, 2018.
[6] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
with double q-learning,” in Thirtieth AAAI Conference on Artificial
Intelligence, 2016.
[7] B. T. Polyak and A. B. Juditsky, “Acceleration of stochastic approxima-
tion by averaging,” SIAM Journal on Control and Optimization, vol. 30,
no. 4, pp. 838–855, 1992.
[8] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski,
W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver, “Rainbow:
Combining improvements in deep reinforcement learning,” in Thirty-
Second AAAI Conference on Artificial Intelligence, 2018.