0% found this document useful (0 votes)
10 views

Autonomous Car Racing in Simulation Environment Using Deep Reinforcement Learning

This document discusses the use of Deep Reinforcement Learning techniques to develop near-optimal autonomous agents for the Open Racing Car Simulator (TORCS). It details the methodologies employed, including Soft Actor-Critic and Rainbow DQN algorithms, and explores various reward shaping, exploration, and generalization strategies. The study emphasizes the importance of simulation in training self-driving cars before real-world implementation.

Uploaded by

mxrv
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Autonomous Car Racing in Simulation Environment Using Deep Reinforcement Learning

This document discusses the use of Deep Reinforcement Learning techniques to develop near-optimal autonomous agents for the Open Racing Car Simulator (TORCS). It details the methodologies employed, including Soft Actor-Critic and Rainbow DQN algorithms, and explores various reward shaping, exploration, and generalization strategies. The study emphasizes the importance of simulation in training self-driving cars before real-world implementation.

Uploaded by

mxrv
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Autonomous Car Racing in Simulation Environment

Using Deep Reinforcement Learning

Kıvanç Güçkıran1 , Bülent Bolat2


1
Electronic and Communication Engineering Department, Yildiz Technical University, Istanbul, Turkey
[email protected]
2
Electronic and Communication Engineering Department, Yildiz Technical University, Istanbul, Turkey
[email protected]

Abstract—Self-Driving Cars are, currently a hot topic


throughout the globe thanks to the advancements in Deep
Learning techniques on computer vision problems. Since driving
simulations are fairly important before real life autonomous
implementations, there are multiple driving-racing simulations
for testing purposes. The Open Racing Car Simulation (TORCS)
is a highly portable open source car racing -self-driving- simu-
lation. While it can be used as a game in which human players
compete with scripted agents, TORCS provides observation and
action API to develop an artificial intelligence agent. This study
explores near-optimal Deep Reinforcement Learning agents for
TORCS environment using Soft Actor-Critic and Rainbow DQN
algorithms, exploration and generalization techniques.
Keywords—Deep Reinforcement Learning, TORCS, Self-
Driving Car

I. I NTRODUCTION
Self-driving cars are one of the most important and promis- Figure 1: Example race
ing assets of our time. It has been a challenge and inspiration
for researchers and engineers throughout the world for decades.
With the introduction of Deep Learning practices and computer
vision techniques, autonomous vehicles in the near future is became accessible. In this study, we will try to find a near-
not a dream [1]. The development cycle of the self-driving optimal driver for TORCS environment using Deep Reinforce-
cars must start from simulations since creating driving data ment Learning techniques. The rest of the paper is organized
for training would be inefficient, risky and time-consuming. as follows. In Section II, information regarding our approaches
TORCS, The Open Racing Car Simulator, is open, flexible is given. Our methodology, algorithms and strategies we have
and has a portable interface for AI development [2]. Example used are explained in Section III. In Section IV, results of
race is shown in Figure 1. our approaches are given and the last section has concluding
remarks.
Achieving autonomous agents is a very complicated task.
There are multiple practices in machine learning to train
II. P RELIMINARIES
agents to learn and act. First one is supervised learning, in
which the dataset contains the data and the ground truth Reinforcement Learning (RL) is a goal-oriented machine
labels, agents will try to predict true labels. Second one learning practice which tries to maximize the agent’s cumu-
is unsupervised learning, instead of having labels, this time lative rewards. The reward is given when an agent behaves
only data is provided and the agents need to correlate and towards the goal or achieves the goal. Negative reward as
group the data together. The last one, which is our method, punishment is also plausible in some cases.
reinforcement learning, creates its own data by interacting with
the environment [3]. This is the best-suited approach since RL agents also receive observations/states from environ-
most of the time there will not be any ground truth actions for ment and acts upon them. This cycle goes on until an optimal
agents to learn. agent is found. Figure 2 depicts this behaviour. This behavior is
formulated in Markov Decision Process and MDPs are defined
Latest improvements in Deep Learning affected every area with five components, which are:
of computation, and one of the areas affected is Reinforcement
Learning. With the GPUs’ huge performance increase, Rein- • S: State space
forcement Learning practices with Deep Learning techniques
• A: Action space
• R: Reward function
978-1-7281-2868-9/19/$31.00 c 2019 IEEE
that value functions become unstable. To prevent this, SAC
utilizes a target value network and updates it via Polyak
averaging [7].
Q values are the policy’s target density function. To achieve
differentiation for the policy updates, the re-parameterization
trick is used. Overall, Soft Actor-Critic with Maximum En-
tropy Learning is a data efficient and stable algorithm.
Figure 2: Reinforcement Learning Setting [3]
B. Rainbow DQN

• P : Transition function Rainbow DQN [8] utilizes multiple improvements on DQN


[9] together. These improvements are;
• γ: Discount factor
• Double Q Learning [6] - This method is used to
State space depends on the environment. All possible
overcome overestimation problem on Q networks.
agent’s perception of the environment forms the state space.
Similar to state space, action space defined as all possible ac- • Priority Experience Replay [10] - Experiences for up-
tions that agents can use and utilize. Reward function depends dates are picked with a priority. Mostly used parameter
on state and action and defines when and how much reward for priority is TD error.
is received with a state-action couple. Transition function
addresses which state is transitioned to, after an action is taken • Dueling Networks [11] - Sometimes, choosing the
in a state. The last one is the discount factor, which calculates exact action does not matter too much, but the value
how much an agent takes future rewards into consideration. function estimation is still important. This method
guarantees value calculation in all cases.
When MDP is known, there are common practices as
Dynamic Programming to solve MDP via visiting all state • Noisy Networks [12] - Exploration in environments
action pairs recursively [4]. On the other hand, when MDP with Q learning mostly depends on the action-
is not known to the agent, Reinforcement Learning practices exploration with epsilon-greedy methods. Noisy Net-
are used. In RL practices trajectories are sampled from envi- works introduces the capability of parameter space
ronment and agents learn from them. There are mainly two exploration. Parameter change drives state and action
approaches to Reinforcement Learning problems, value-based exploration indirectly.
methods, and policy-based methods. There are also, hybrid
methods like Actor-Critic, which in addition to policies, value These algorithms together form the Rainbow DQN ap-
networks are also present within the algorithms. proach. We have also used C51 output to obtain further
Value-based methods define their policy via acting greedily improvements on Q value distribution [13].
to the value function and this way, it uses its current knowledge
on the environment. This leads to sub-optimal policies since
agents need to explore new trajectories to obtain optimality. C. TORCS
There are multiple approaches to this dilemma like epsilon-
greedy strategy, where sometimes agent acts randomly using TORCS provides an API for AI agents to act and learn
epsilon value. from. This API has several observations like angle, speed,
damage, gear, etc. We are using 6 observations with a total
Policy-based methods try to maximize their cumulative of 29 dimensions from API, which are:
reward, mapping states to actions. This time, exploration is
generally done via adding noise to actions. There are also • Angle - 1: Angle between the tangent of the track and
hybrid methods which utilize a value function within policy- the car
based methods. These methods are called Actor-Critic meth-
ods. • Track - 19: Lidar sensor on the front of the car
scanning 180 degrees
A. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Re-
inforcement Learning with a Stochastic Actor (SAC) • TrackPos - 1: Distance from the middle of the track,
greater than 0.5 if off the track
SAC uses a modified RL objective function using maxi-
mum entropy formulation [5]. The algorithm tries to maximize • Speed - 3: Cartesian speeds where the x-axis is always
entropy, in addition to policy updates. This way, the agent is pointing the front of the car
encouraged to explore unseen and unknown states.
• Wheel speeds - 4: Angular speeds(rad/s) for each
There are two Q networks to estimate expected rewards for wheel
policy updates. These two networks are used to minimize Q
value overestimating like Double Q learning [6]. Q function is • RPM - 1: Engine speed
trained with another V function and since these two networks
are co-dependent on each other, there is a significant chance For actions, we are using acceleration, brake and steer.
III. M ETHODOLOGY • Tau: 10−3
This section emphasizes on our implementation of the stud- • N-Step Weight Parameter: 1
ies explained beforehand. First, architectures for the algorithms
• N-Step Q Regularization Parameter: 10−7
used in this study will be explained, then reward shaping and
termination topics will be detailed, and lastly, exploration, • Buffer Size: 105
generalization and environmental changes will be explained
thoroughly. Our codebase and implementation can be found at • Batch Size: 32
https://github.com/kivancguckiran/torcs-rl-agent. • Learning Rate: 10−4
• Adam Epsilon: 10−8
A. Architecture
• Adam Weight Decay: 10−7
1) SAC: Each neural network architecture for SAC consists
of fully connected layers with 512, 256, and 128 weights • PER Alpha: 0.6
respectively as seen in Figure 3. We have used ReLU acti-
• PER Beta: 0.4
vations on hidden layers and Gaussian distribution on actions
with TanH activation on the output layer. We have observed • PER Epsilon: 10−6
improvements on SAC when we added a single LSTM layer
before the output layer. Hyperparameters for SAC-LSTM are • Gradient Clip: 10
below. These parameters are selected via trial and error using • Prefill Buffer Size: 104
a simple grid search.
• C51 - V Minimum: −300
• Gamma: 0.99
• C51 - V Maximum: 300
• Tau: 10−3 • C51 - Atom Size: 1530
• Batch Size: 32 • NoisyNet Initial Variance: 0.5
• Step Size: 16
B. Reward Shaping
• Episode Buffer: 103
Like many TORCS AI developers, we also have noticed
• Actor Learning Rate: 3.10−4 the fast left-right maneuvers (slaloming) on a straight track.
• Value Learning Rate: 3.10−4 We have tried multiple reward functions to stabilize the car.
In addition to this, when the agent steers off track and turns
• Q Learning Rate: 3.10−4 backward, environment resets. In this situation, agent does not
• Entropy Learning Rate: 3.10−4 try to recover from the state. We have also tried to recover
from this.
• Policy Update Interval: 2
1) Reward Functions: The parameters used in reward func-
• Initial Random Action: 104 tions are defined as,
We have implemented a custom buffer for the LSTM. Since • Vx : Longitudinal velocity
LSTM needs sequential samples, standard experience replay is
inappropriate. Thus, we buffer a whole episode sequentially • Vy : Lateral velocity
and select episodes randomly according to batch size on • θ: Angle between car and track axis
training time. Then, we randomly pick an index from each
episode and train with the following 16 samples. • trackpos: Distance between center of the road and car

We have used auto entropy tuning using log probabilities. Reward functions that we have tried are formulated below.
Before LSTM, we have also deployed NSTACK mechanism.
We serialized 4 past states and used as input. When we • No Trackpos: Track position is ignored.
noticed that LSTM outperforms NSTACK approach, we have Vx cos θ − |Vx sin θ|
abandoned it and continued with LSTM.
• Trackpos: Track position is taken into consideration.
2) Rainbow DQN: DQN network consists of three fully
conected layers of 128 weights each as seen in Figure 4. Vx cos θ − |Vx sin θ| − |Vx trackpos|
The activation functions of hidden layers are ReLU, like SAC
• EndToEnd [14]: Car’s angle as penalty is not used.
architecture, but outputs are 51 atom distribution over Q values,
namely C51. As stated before, we are using Noisy-Net for Vx (cos θ − |trackpos|)
exploration as opposed to the epsilon-greedy mechanism. Hy-
perparameters for Rainbow DQN are below. These parameters • DeepRLTorcs [15]: Track position penalty is dis-
are selected via trial and error using a simple grid search. counted with car’s angle. Car’s angle as penalty is
used. Additionally, lateral velocity as penalty is used
• N-Step: 3 with discounting towards car’s angle.
• Gamma: 0.99 Vx cos θ − |Vx sin θ| − |2Vx sin θtrackpos| − Vy cos θ
State and
State 512 256 128 LSTM Value 512 256 128 LSTM Value State 512 256 128 LSTM Action
Action

(a) SAC V Network (b) SAC Q Network (c) SAC Policy Network

Figure 3: SAC Architecture

1) Try-Brake: Try Brake mechanism is like Stochastic


Braking [15]. After a certain amount of timesteps, the agent
is forced to use brake 10% of the time, again for a certain
amount of time. This way we hope that agent will learn to
speed up in a straight track and brake before and while turns.
These trials are done according to a Gaussian distribution as
seen in Figure 5.

0.10

0.08
Brake Percentage

0.06
C51 Value
State 128 128 128
Distributions
0.04
Figure 4: DQN Architecture
0.02

0.00
• Sigmoid: Same as above reward function, only dif- 0 50000 100000 150000 200000
ference is rewards are flatlined at center to overcome Timestep
slaloming. Figure 5: Try-Brake Distribution
Vx sigmoid(3 cos θ) − Vx sin θ − Vy sigmoid(3 cos θ)

Our agent uses “DeepRLTorcs” reward function formulated D. Generalization


above. We did not see any improvements with “Sigmoid”
We have added nearly every road track to train and test on
function but, it looks promising to overcome slaloming on a
to generalize agent’s behavior on unseen tracks. This way, we
straight track since it soft clips the cosine of the longitudinal
try to prevent agent to overfit and memorize the tracks. We
velocity. Reward shaping with plateaus in the center of the
have avoided using Spring track since it is very long.
track might overcome slaloming in the future.
The list of tracks used for training and test are listed in the
2) Termination: The active episode is terminated if no first column of Table IV. Agents are trained in these tracks in
progress is achieved within 100 timesteps. Progress is defined a circular fashion. Each track is trained for 5 episodes then
as achieving 5km/h. Similar to this, the agent is provided with skips to next track.
an additional 100 timesteps to recover from turning backward.
This way we want to see agent try to get on track after spins.
E. Action Spaces
We have prepared optimal action spaces to make learning
C. Exploration easier and faster for the agent. Below are two of the imple-
mentations and explanations of the environments we have tried.
Exploration in this environment is done by maximizing
State and action values are normalized between −1 and +1.
entropy in SAC and NoisyNets in DQN algorithm. But learning
to utilize brake is a challenge since using the brake action 1) Continuous Action Space: In this environment, we have
decreases reward. This way agent avoids using the brake action reduced to action size to 2. First action value is used for
altogether. We have employed the Try-Brake mechanism to both accelerating and braking. Since the agent should not use
overcome this problem. them together, defining two actions for them is unnecessary.
Smaller values than zero are used for brake values and greater
values are used for accelerating. Second action value is used 180
for steering. We use this environment for SAC algorithm.
2) Discretized Action Space: Since our DQN algorithm is 160
suitable for discrete actions, we have discretized action space
into 21 actions. There are 7 steering points on 3 intervals. 140
The first interval is for accelerating and steering, the second
interval is only steering and the last interval is for braking and 120
steering. Actions are depicted in Table I.
100
Acceleration Brake Steer
+1 -1 {-1, -0.66, -0.33, 0, 0.33, 0.66, 1}
-1 +1 {-1, -0.66, -0.33, 0, 0.33, 0.66, 1} 80
-1 -1 {-1, -0.66, -0.33, 0, 0.33, 0.66, 1} Max Speed
Avg Speed
Table I: Discretized Actions 0 500 1000 1500 2000 2500 3000 3500 4000

Figure 7: Episode vs Speeds. Maximum and Standard Devia-


IV. R ESULTS tion can be seen in Table III.
We have achieved significant success with both of these
approaches. SAC-LSTM agent was more data-efficient and Type Standard Deviation Maximum
Max Speed 24.94 216.42
performant in terms of track consistency and speed. Episode Avg Speed 28.52 164.59
versus Score can be seen in Figure 6. Scores, which is the
accumulated rewards, are aggregated over 250 episodes for Table III: Standard Deviation and Min, Max values of Speeds
brevity.

Final agents were trained around 6.106 timesteps, 5.103


800 DQN episodes on hardware with Intel i9-9900k CPU and GeForce
SAC-LSTM RTX 2060 GPU.

600 V. C ONCLUSION
We have implemented two different algorithms for TORCS
Score

400 with great success. Agents complete tracks most of the time
around 140 km/h average speed and around 190 km/h max-
imum speed. It is observed that agents generalize well on
200 unseen tracks. Before Try-Brake implementation, the agent
did not use the brake action at all to prevent obtaining low
rewards. This seems to be fixed with the implementation of
0 the Try-Brake mechanism.
0 500 1000 1500 2000 2500 3000
Episode Another problem we have faced was fast left-right ma-
neuvers, also mentioned as slaloming, partly solved via re-
Figure 6: Episode vs Score. Maximum and Standard Deviation ward shaping and LSTM. We noticed that this problem is
can be seen in Table II. an issue of the reward function, thus better reward shaping
might overcome this in the near future. Since these maneuvers
happen frequently at high speeds, cases other than racing might
Algorithm Standard Deviation Maximum not face this problem. Race with SAC-LSTM agent against
DQN 306.25 1255.94 scripted bots and races between Rainbow DQN and SAC-
SAC-LSTM 433.22 1395.0
LSTM agents from our trained agent’s perspective can be
Table II: Standard Deviation and Maximum values of Scores viewed from https://youtu.be/f82EBvPKyDI.
We argue that the reasons behind SAC and SAC-LSTM
Further analysis will be done on the SAC-LSTM agent. performed better from Rainbow DQN algorithm are because of
Since this study is designed for a car race, speed is the crucial the exploration methods and the continuous action space. SAC
factor. Performance of the agent in terms of speeds versus tries to maximize entropy and this allows the agent to explore
episode can be seen in Figure 7. Values are aggregated over uncertain regions of the action space. Additionally, since
250 episodes. SAC’s policy network contains continuous actions, braking,
accelerating and steering can be controlled with continuous
As discussed before, there are multiple tracks in the envi- actions, unlike Rainbow DQN’s discretized 27 actions. This
ronment with different difficulties. Table IV shows results for difference might have been helpful to obtain stability on the
each track for SAC-LSTM algorithm. road.
Track Max Score Avg Score Max Speed Avg Speed
forza 1395.0 742.95 216.0 143.27 [9] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
g-track-1 954.0 611.36 200.36 131.56 Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski
g-track-2 1182.0 912.06 205.65 149.32 et al., “Human-level control through deep reinforcement learning,”
g-track-3 1154.0 574.16 182.0 109.11 Nature, vol. 518, no. 7540, p. 529, 2015.
ole-road-1 1103.0 209.70 216.42 121.65 [10] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience
ruudskogen 1294.0 575.58 199.78 117.08
street-1 1208.0 597.32 198.0 117.44
replay,” arXiv preprint arXiv:1511.05952, 2015.
wheel-1 1248.0 758.45 206.05 139.16 [11] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and
wheel-2 1152.0 636.49 216.19 128.87 N. De Freitas, “Dueling network architectures for deep reinforcement
aalborg 922.0 183.84 189.16 80.54 learning,” arXiv preprint arXiv:1511.06581, 2015.
alpine-1 1264.0 822.09 203.18 118.94
alpine-2 1119.0 677.13 194.43 103.02 [12] M. Fortunato, M. G. Azar, B. Piot, J. Menick, I. Osband, A. Graves,
e-track-1 1080.0 325.40 207.54 119.02 V. Mnih, R. Munos, D. Hassabis, O. Pietquin et al., “Noisy networks
e-track-2 1212.0 907.00 179.65 104.55 for exploration,” arXiv preprint arXiv:1706.10295, 2017.
e-track-4 1293.0 881.47 214.73 152.13 [13] M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspec-
e-track-6 1201.0 583.34 213.59 130.12
tive on reinforcement learning,” in Proceedings of the 34th International
eroad 1264.0 883.87 200.59 131.87
e-track-3 1383.0 895.41 211.87 137.35 Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp.
449–458.
Table IV: Scores on different tracks. Max value achieved for [14] M. Jaritz, R. De Charette, M. Toromanoff, E. Perot, and F. Nashashibi,
each column represented as bold. “End-to-end race driving with deep reinforcement learning,” in 2018
IEEE International Conference on Robotics and Automation (ICRA).
IEEE, 2018, pp. 2070–2075.
[15] B. Renukuntla, S. Sharma, S. Gadiyaram, V. Elango, and V. Sakaray,
“The road to be taken, a deep reinforcement learning approach towards
This study is a step towards using Deep Reinforcement autonomous navigation,” https://github.com/charlespwd/project-title,
Learning practices for self-driving cars. It can be seen that 2017.
these methods are capable of learning to drive without super-
vision. Furthermore, these agents are easily transferable to the
real world robotics platforms. Together with other machine
learning practices, Deep Reinforcement Learning methods are
expected to be used actively in the autonomous car industry.

ACKNOWLEDGEMENTS
This work was part of the term project for BLG604E
Deep Reinforcement Learning course from ITU. The project
consisted of an autonomous car race with other participants
from the class. Our agent became the fastest and won first
place.
We thank Can Erhan, for his contributions to the code base
and implementations. We also want to thank Onur Karadeli,
for his points regarding reward functions.

R EFERENCES
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521,
no. 7553, p. 436, 2015.
[2] B. Wymann, E. Espié, C. Guionneau, C. Dimitrakakis, R. Coulom, and
A. Sumner, “Torcs, the open racing car simulator,” Software available
at http://torcs. sourceforge. net, vol. 4, no. 6, 2000.
[3] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
MIT press, 2018.
[4] R. Bellman, “Dynamic programming,” Science, vol. 153, no. 3731, pp.
34–37, 1966.
[5] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan,
V. Kumar, H. Zhu, A. Gupta, P. Abbeel et al., “Soft actor-critic
algorithms and applications,” arXiv preprint arXiv:1812.05905, 2018.
[6] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
with double q-learning,” in Thirtieth AAAI Conference on Artificial
Intelligence, 2016.
[7] B. T. Polyak and A. B. Juditsky, “Acceleration of stochastic approxima-
tion by averaging,” SIAM Journal on Control and Optimization, vol. 30,
no. 4, pp. 838–855, 1992.
[8] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski,
W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver, “Rainbow:
Combining improvements in deep reinforcement learning,” in Thirty-
Second AAAI Conference on Artificial Intelligence, 2018.

You might also like