Comparing DRL Architectures
Comparing DRL Architectures
Keywords: In classical autonomous racing, a perception, planning, and control pipeline is employed to navigate vehicles
Deep reinforcement learning around a track as quickly as possible. In contrast, neural network controllers have been used to replace either
End-to-end driving part of or the entire pipeline. This paper compares three deep learning architectures for F1Tenth autonomous
Autonomous racing
racing: full planning, which replaces the global and local planner, trajectory tracking, which replaces the
Trajectory planning
local planner and end-to-end, which replaces the entire pipeline. The evaluation contrasts two reward signals,
compares the DDPG, TD3 and SAC algorithms and investigates the generality of the learned policies to different
test maps. Training the agents in simulation shows that the full planning agent has the most robust training
and testing performance. The trajectory tracking agents achieve fast lap times on the training map but low
completion rates on different test maps. Transferring the trained agents to a physical F1Tenth car reveals
that the trajectory tracking and full planning agents transfer poorly, displaying rapid side-to-side swerving
(slaloming). In contrast, the end-to-end agent, the worst performer in simulation, transfers the best to the
physical vehicle and can complete the test track with a maximum speed of 5 m/s. These results show that
planning methods outperform end-to-end approaches in simulation performance, but end-to-end approaches
transfer better to physical robots.
1. Introduction
∗ Corresponding author.
E-mail addresses: [email protected] (B.D. Evans), [email protected] (H.W. Jordaan), [email protected] (H.A. Engelbrecht).
https://doi.org/10.1016/j.mlwa.2023.100496
Received 13 June 2023; Received in revised form 4 August 2023; Accepted 30 August 2023
Available online 4 September 2023
2666-8270/© 2023 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
B.D. Evans et al. Machine Learning with Applications 14 (2023) 100496
enable classical planning since the vehicle can determine its location
on a map but are limited by requiring a map of the race track and,
thus, are inflexible to unmapped tracks.
Model-predictive controllers (MPC), that calculate receding horizon
optimal control commands (Tătulea-Codrean, Mariani, & Engell, 2020;
Wang et al., 2021), and pure pursuit path-followers, that geometrically
track a path (Becker et al., 2023; O’Kelly, Zheng, Jain, et al., 2020),
have been used for trajectory tracking. These methods transfer well to
Fig. 2. The classical racing pipeline using a trajectory generator offline and localisation
and trajectory following module online.
physical vehicles, with a novel pure pursuit algorithm controlling a ve-
hicle at over 8 m/s at the limits of the non-linear tyre dynamics. These
classical methods produce high-performance results, accurately track-
ing the optimal trajectory, but are limited by requiring the vehicle’s
Liu, 2020). While planning techniques have demonstrated good per-
formance (Wurman et al., 2022), they are still limited by requiring location on the map.
localisation and thus a mapped environment.
End-to-end methods map raw sensor data (LiDAR scans or camera
2.2. Planning architectures
images) directly to control actions (Cai, Wang, Huang, Liu, & Liu,
2021; Hamilton, Musau, Lopez, & Johnson, 2022). Unlike classical
approaches, end-to-end methods can plan control actions directly from Planning architectures can be split into full planning approaches that
raw sensor data. They have the advantage of not requiring an explicit replace the global (offline) and local (online) planner with a neural
vehicle model or real-time processing, and the solutions are flexible to network, and trajectory tracking methods that replace only the local
unmapped tracks not seen during training (Bosello, Tse, & Pau, 2022). planner.
The limitations of end-to-end approaches are performance, with many
solutions limited to low, constant speeds (Evans et al., 2022; Hamilton
et al., 2022) and safety with agents achieving low lap completion 2.2.1. Full planning
rates (Brunnbauer et al., 2022). Full planning methods that use sensor data and track information
This paper contributes an extensive examination of full planning,
have demonstrated high performance while maintaining flexibility to
trajectory tracking and end-to-end DRL architectures for autonomous
different tracks (Fuchs et al., 2021). In Grand Turismo Sport, a hybrid
racing through:
state vector with the vehicle’s current speed and acceleration, the
1. Comparing the reward, lap progress, crash rate and lap time curvature of upcoming waypoints and 66 range finder measurements
during training DRL agents for F1Tenth racing. were used (Fuchs et al., 2021). Similarly, Wurman et al. (2022) showed
2. Analysing the trained agent’s lap times, success rates, speed that using a complex state vector leads to outracing world champions.
profiles, and racing line deviations. Comparable approaches combining state variables and LiDAR scans or
3. Evaluating the simulation-to-reality transfer by contrasting be- other upcoming track information for autonomous racing have been
haviour in simulation and a physical vehicle. evaluated in simulated F1Tenth racing (Tătulea-Codrean et al., 2020)
and full-size vehicle simulators (Remonda, Krebs, Veas, Luzhnica, &
2. Literature study Kern, 2021). While these approaches have demonstrated exceptional
performance in racing games and simulators, their performance for
We provide an overview of classical approaches to autonomous
physical robots is unknown.
racing that use vehicle models and optimisation to plan and follow a
trajectory. Learning techniques for autonomous racing are studied in
the categories of planning methods that require the vehicle’s location 2.2.2. Trajectory tracking
and end-to-end methods that replace the entire pipeline with a neural
Deep reinforcement learning agents have been combined with an of-
network. Table 1 provides a summary of the classical, learned planning
fline trajectory optimiser for high-performance control in autonomous
and end-to-end learning methods referenced.
racing (Dwivedi, Betz, Sauerbeck, Manivannan, & Lienkamp, 2022;
2.1. Classical racing Ghignone, Baumann, & Magno, 2023). Ghignone et al. (2023) uses a
state vector with the vehicle’s velocity vector, heading angle a list of
The classical racing approach, shown in Fig. 2, calculates an optimal upcoming trajectory points. Their results in simulated F1Tenth racing
trajectory and then uses a trajectory following algorithm to track demonstrated improved computational time, better track generalisation
it (Betz et al., 2022). During the race, the localisation module uses the and more robustness to modelling mismatch.
LiDAR scan to calculate the vehicle’s pose. The path-follower uses the DRL agents have demonstrated exceptional performance in au-
vehicle’s pose and optimal trajectory waypoints to calculate the speed tonomous drifting, where agents can track a precomputed trajectory
and steering angles that control the vehicle. at the non-linear tyre limits (Cai et al., 2020; Orgován, Bécsi, & Aradi,
Trajectory optimisation techniques calculate a set of waypoints 2021). Cai et al. (2020) used a DRL agent with a state consisting of the
(positions with a speed reference) on a track that, when followed, lead
upcoming waypoints (including the planned slip angle) and the current
the vehicle to complete a lap in the shortest time possible (Christ,
vehicle state to control a vehicle in the Speed Dreams racing simulator.
Wischnewski, Heilmeier, & Lohmann, 2021). A common approach gen-
Their results showed that the DRL agent could control the vehicle while
erates a minimum curvature path and then a minimum time speed
drifting through corners in many complex environments and generalise
profile (Heilmeier et al., 2020).
Localisation approaches for autonomous racing depend on the sen- to different vehicles.
sors and computation available. Full-sized racing cars often fuse GNSS While full planning and trajectory tracking approaches have demon-
and other sensors (Wischnewski et al., 2022), and scaled cars use a strated promising results in simulation, their performance on physical
LiDAR as input to a particle filter (O’Kelly, et al., 2020; Walsh & vehicles is unknown due to a lack of investigation into the simulation-
Karaman, 2018; Wang, Han, & Vaidya, 2021). Localisation methods to-reality transfer (see Table 1).
2
B.D. Evans et al. Machine Learning with Applications 14 (2023) 100496
Table 1
Classical, planning and end-to-end approaches to autonomous racing with results in simulation and physical vehicles.
Category Method Platform Simulation result Physical result
Learning MPC (Wang et al., 2021) F1Tenth Optimal racing up to 7 m/s in simulation and reality
Classical
Model and acceleration pure pursuit (Becker et al., 2023) F1Tenth Racing at the non-linear tyre limits with speeds over
8 m/s in simulation and reality
Full planning for racing games (Fuchs et al., 2021; Grand Turismo Outperforms world champion No physical tests
Wurman et al., 2022) Sport gamers
Planning
Planning for racing (Tătulea-Codrean et al., 2020; F1Tenth 5 m/s (Tătulea-Codrean et al., No physical tests
Vianna, Goubault, & Putot, 2021) 2020) 3 m/s (Vianna et al., 2021)
Trajectory tracking for racing (Dwivedi et al., 2022; F1Tenth Outperforms path tracking No physical tests
Ghignone et al., 2023)
Trajectory tracking for drifting (Cai et al., 2020; Speed Dreams & Able to control the vehicle in No physical tests
Orgován et al., 2021) Carla non-linear environments
Driving with LiDAR (Evans et al., 2022; Hamilton et al., F1Tenth Constant speeds of 1 m/s (Hamilton et al., 2022), 2 m/s (Evans
End-to-end
2022; Ivanov et al., 2020) et al., 2022) and 2.4 m/s (Ivanov et al., 2020) in simulation
and reality
Racing with LiDAR (Bosello et al., 2022; Brunnbauer F1Tenth Speeds ranging up to 5 m/s Completes test lap
et al., 2022)
Fig. 4. The deep learning formulation uses a neural network to select actions.
3. Racing architectures
3
B.D. Evans et al. Machine Learning with Applications 14 (2023) 100496
Fig. 7. The end-to-end architecture uses a state of the previous and current LiDAR
Fig. 5. The full planning architecture uses a state vector consisting of beams from the
scan and the vehicle’s speed.
LiDAR scan, a set of upcoming centreline points, and the linear speed and steering
angle.
4
B.D. Evans et al. Machine Learning with Applications 14 (2023) 100496
5
B.D. Evans et al. Machine Learning with Applications 14 (2023) 100496
Fig. 10. The cross-track and heading error reward uses the vehicle speed 𝑣𝑡 , heading
error 𝜓 and cross-track distance 𝑑c .
Fig. 11. The trajectory-aided learning reward penalises the difference between the
agent’s actions and those selected by a classical planner following the optimal
trajectory.
4.2.2. Reward signals Fig. 12. The training and testing losses from tuning the number of neurons used per
After each timestep, a reward is calculated and given to the agent. layer in the neural network.
6
B.D. Evans et al. Machine Learning with Applications 14 (2023) 100496
Fig. 13. The losses from tuning the number of beams and the number of waypoints
used in the state vectors.
average track progress made by the agent. The average track progress is
a moving average of the progress made by the vehicle for each episode
before the car crashes or completes a lap. In the reward signal study, Fig. 15 shows the average progress, completion rate and lap times
the episode reward is also used to explain the learned behaviour. achieved by agents trained with the cross-track and heading (CTH) and
Fig. 14 shows the average progress and episode reward during trajectory-aided learning (TAL) reward signals. The agents were trained
training from each of the reward signals. The cross-track and head- and tested on the GBR map. The average progress values correspond
lining (CTH) average progress demonstrates a significant difference well to the training graph, with the CTH agents achieving around
between the architectures, with the full planning architecture achieving 85%, 60% and 50%, respectively. The trajectory-tracking and end-to-
around 90%, the trajectory tracking agent 60% and the end-to-end end agents trained with the CTH reward have a low completion rate
agent 50%. The TAL reward trains all the agents to achieve higher of around 8% and 25%, indicating that very few laps are completed.
average progress, with the full planning and trajectory tracking agents The CTH full planning agent achieves a significantly higher completion
achieving near 100% and the end-to-end agent around 90%. rate indicating that the full planning agent is more robust to different
The rewards earned by the CTH agents follow a similar pattern to rewards than the other architectures. While the CTH agents have criti-
the average progress of the full planning earning the most reward, cally low lap completion rates, they have fast lap times, outperforming
followed by the trajectory tracking and finally, the end-to-end agent. the TAL agents. This is further investigated by plotting the distribution
The TAL agents show an interesting pattern of the trajectory tracking of slip angles for the agents trained with the two rewards. The slip angle
agent achieving a higher reward than the full planning agent from is the difference between the vehicle’s velocity (direction of motion)
around 15k steps. This suggests that the trajectory tracking agent and orientation (which way the front of the vehicle faces).
selects actions that are more similar to the classical planner (following Fig. 16 shows the slip angles from test laps of agents trained with
the optimal trajectory) than the other agents. the cross-track and heading (CTH) and trajectory-aided learning (TAL)
7
B.D. Evans et al. Machine Learning with Applications 14 (2023) 100496
Fig. 18 shows the average progress during training from using each
algorithm on the GBR map. All the agents trained with the DDPG
algorithm perform poorly with averages remaining around 50%. In
contrast, the SAC and TD3 algorithm both train the agents well, with
the average progresses reaching over 90%. The TD3 algorithm appears
to train the agents slightly faster and to higher average progress. With
all three algorithms, the full planning agent’s progress rises the fastest,
quickly reaching high values. The end-to-end agent trained with the
SAC algorithm takes around 50k steps to reach over 90% average
progress. The trajectory tracking agent generally remains between the
full planning and end-to-end agents. This indicates that the full plan-
ning agent is the most robust to different training algorithms, always
training quickly, and the end-to-end agent requires the most training
data to converge, training more slowly. Since only the SAC and TD3
algorithms can reliably train the agents to complete laps, we study their
Fig. 17. The lateral and speed deviation for agents trained with the cross-track and training behaviour in more detail.
heading (CTH) and trajectory-aided learning (TAL) reward signals. Fig. 19 shows the lap times during training of the agents trained
with the SAC and TD3 algorithms. When training with the SAC, all
the architectures show a pattern of starting with slower lap times and
reward signals on the GBR map. The agents trained with the cross-track as the training progresses, achieving faster lap times. In contrast, the
and heading error reward have a higher number of large slip angles TD3 algorithm shows little change in lap times throughout training. It
than the TAL agents. The CTH agents have many slip angles above 0.2 is proposed that this is due to the maximum entropy formulation of
radians, while most of the TAL agents do not have any slip angles above the SAC algorithm in contrast to the deterministic nature of the TD3
0.2. The high-slip angles resulting from the CTH reward signal indicate algorithm. The trajectory-tracking architecture has the least variance
that the agent learns to exploit the simulator by causing the vehicle to between lap times, and the end-to-end architecture has the most. This
drift around parts of the track. This drifting behaviour is unstable and is possibly due to large differences between LiDAR scans compared to
not repeatable. upcoming trajectory points.
We consider the lateral and speed deviation for the agents trained Fig. 20 shows the crash frequency per algorithm for each archi-
with each reward signal to the optimal trajectory. The lateral deviation tecture. The agents trained with the SAC algorithm crash for a longer
is the perpendicular distance between the optimal trajectory and the period of time during training, indicating that the algorithm is trying all
vehicle, and the speed deviation is the absolute value of the difference possible actions before converging. The TD3 agents crash less than the
between the vehicle’s speed and the speed at the optimal trajectory SAC agents. This pattern is similar to the lap times presented in Fig. 19,
point. Fig. 17 shows the lateral and speed deviation for agents trained indicating that the SAC algorithm fails more before learning, while
with the cross-track and heading (CTH) and trajectory-aided learning the TD3 algorithm learns faster by exploiting its current knowledge.
(TAL) reward signals tested on the GBR map. The general pattern is In comparing the architectures, the full planning architecture crashes
that the TAL agents have lower lateral and speed deviations than the the least during training, completing almost all laps after 40k steps.
CTH agents. Across both reward signals, the trajectory-tracking agent In contrast, the end-to-end agent has a high number of crashes during
has the lowest lateral deviation of the three DRL agents. This shows that training, still occasionally crashing at the end of training.
giving trajectory waypoints to the planner leads to the agent selecting a Fig. 21 shows the average progress and lap times from training
more similar path to the optimal trajectory. In contrast, the TAL, end-to- agents with the DDPG, SAC and TD3 algorithms. The DDPG agents
end agent, has the largest deviations, indicating that giving only LiDAR perform poorly, with all the agents having some repetitions below 70%
scans leads to behaviour that is dissimilar to the optimal trajectory. completion. Of the agents trained with the DDPG algorithm, the full
While the TAL reward improves on the cross-track and heading reward, planning architecture performs the best, with two of the repetitions
none of the agents can outperform the classical trajectory following reaching near 100% completion rate. The trajectory tracking average
method. is similar, with a large spread between the three repetitions. The end-
The study on reward signals shows that the trajectory-aided learning to-end architecture performs the worst, with all three repetitions below
reward can train DRL agents to autonomous racing to achieve higher 75%. The inconsistency in the results highlights the shortcoming of the
average progresses during training and higher lap completion rates DDPG algorithm of being brittle to random seeding.
after training. While the full planning architecture achieved around a The poor results achieved by the DDPG algorithm show that while
75% completion rate with the cross-track and heading reward, the other it has the potential to train agents, the performance is often poor
two architectures achieved below 50% lap completion. This indicates and lacks reliability. In contrast, the TD3 algorithm achieves near
that the full planning architecture is the most robust to different 100% completion rate for all of the architectures. The SAC algorithm
rewards, followed by the end-to-end architecture, and the trajectory is slightly behind with around 90%, and the results are more spread
tracking architecture is very sensitive to reward. Studying the lateral out. The agents trained with the TD3 algorithm all have near 100%
and speed deviations showed that the trajectory tracking architecture, completion rates and similar lap times of around 40 s. The agents
trained with the TAL reward, has the most similar behaviour to the trained with the SAC algorithm vary more between repetitions, with
classical planner following the optimal trajectory. This indicates that the trajectory tracking agent’s completion rate ranging from 80%–
while sensitive, the trajectory tracking architecture shows the best 100% and the end-to-end agent’s lap times ranging from 40–60 s. This
ability to replicate a specific behaviour. indicates that the TD3 algorithm produces more repeatable behaviour
with less dependence on the random seeds.
5.2. Algorithm comparison The differences in lap times, specifically between the end-to-end
planner, are investigated by plotting the frequency of the vehicle speeds
We use the DDPG, SAC and TD3 algorithms to train three agents for agents trained with each planner. Fig. 22 shows the frequency of the
of each architecture for autonomous racing. These tests use the TAL vehicle speeds. While the full planning and trajectory tracking agents
reward and the GBR map. We consider the differences during training have similar performance, there is a notable difference in the end-to-
and the differences in the behaviour of the trained agents. end agent. The end-to-end agent trained with the SAC algorithm never
8
B.D. Evans et al. Machine Learning with Applications 14 (2023) 100496
Fig. 18. Average progress during training agents on the GBR map with the DDPG, SAC and TD3 algorithms.
Table 2
Lap times from agents trained on MCO and tested on four test maps compared to the classical planner.
TestMap AUT ESP GBR MCO
Classic planner 22.11 ± 0.00 47.85 ± 0.00 38.95 ± 0.00 35.51 ± 0.00
Full planning 21.52 ± 0.15 48.19 ± 2.38 42.46 ± 1.18 37.85 ± 0.12
Trajectory tracking 22.28 ± 0.81 49.81 ± 0.42 39.64 ± 1.58 36.08 ± 1.24
End-to-end 21.90 ± 1.42 49.25 ± 2.90 43.34 ± 3.21 38.14 ± 1.83
selects high speeds, while the one trained with the TD3 algorithm does.
The result of this is slower speed selection. While it is not clear why the
end-to-end agent trained with the SAC algorithm selects lower speeds,
it could be related to the difficulty in racing at high speeds using only
LiDAR scans. This result indicates that the end-to-end architecture is
less robust to changes in algorithm compared to the full planning and
trajectory tracking architectures.
The investigation into algorithms showed that while the DDPG
algorithm produces inconsistent, poor completion rates, the TD3 and
SAC algorithms can train agents to achieve good racing performance.
The TD3 algorithm learns a racing policy without first learning slow
laps, while the SAC algorithm starts out learning slowly and steadily
Fig. 19. Lap times during training agents with the SAC and TD3 algorithms on the improves. The TD3 algorithm produces repeatable results, with the
GBR map. three repetitions being closely grouped. The SAC algorithm shows a
larger variance between results, indicating a greater dependence on
the seed used. The end-to-end architecture is the most sensitive to
the training algorithm, completing very few laps with the DDPG and
achieving slow lap times with the SAC algorithm.
The agents are investigated for their ability to generalise and trans-
fer to maps other than the training maps. Agents of each architecture
are trained using the TAL reward and the TD3 algorithm on the MCO
and GBR maps before being tested on four maps (see Fig. 9). A quanti-
tative comparison of the completion rate and lap time performance is
presented, followed by a qualitative assessment of the performance.
Fig. 20. Crash frequency during training agents with the SAC and TD3 algorithms on
5.3.1. Quantitative comparison
the GBR map. For each agent, 20 test laps are run on the AUT, ESP, GRB and MCO
maps. Fig. 23 records the average success rates from each train-test
combination.
The results in Fig. 23 show that the full planning agents learn the
most general behaviour, achieving near 100% success on all the maps
considered. The end-to-end agent achieves good performance on the
AUT map and over 80% on the ESP map. The trajectory-tracking agent
trained on the GBR map achieves below 60% and 40% success on
the AUT and ESP maps, respectively. This behaviour indicates that
the trajectory tracking architecture generalises less well to other maps
and is more dependent on the training map used. The graph also
indicates that the agents trained on the MCO map generalise to other
tracks better than agents trained on the GBR track. We further study
Fig. 21. The average progress and lap times from training agents with the DDPG, SAC the performance by considering the lap times achieved by the agents
and TD3 algorithms. trained on the MCO map on the different tracks.
Table 2 shows the lap times for the agents trained on the MCO
map and tested on each of the four test maps. The classic planner
9
B.D. Evans et al. Machine Learning with Applications 14 (2023) 100496
Fig. 22. The frequency of vehicle speeds experienced by agents trained with the SAC and TD3 algorithms.
Fig. 23. Completion rates from testing vehicles trained on GBR and MCO maps on
four tracks.
times are shown for reference and have a standard deviation of 0 since
they do not depend on a random seed. On the training map, MCO, the
trajectory tracking agent, achieves the fastest average lap time of 36.08
s, followed by the full planning agent with 37.85 s, and the end-to-end
agent with 38.14 s. This indicates that the trajectory tracking agent Fig. 24. Trajectories from full planning, trajectory tracking, end-to-end and classical
achieves the best racing performance on the training map, probably planners. The colours represent the speeds in m/s.
due to tracking the optimal trajectory well (see Section 5.1). On the
AUT and ESP maps, the full planning agent achieves the fastest times
of 21.52 s and 48.19 s, respectively. Despite being the slowest on MCO
(the training map), the end-to-end agent outperforms the trajectory
tracking agent on both the AUT by 0.38 s and the ESP map by 0.56 s.
This indicates that the full planning and end-to-end agents learn more
general, transferable behaviour than the trajectory tracking agent. This
suggests that having LiDAR beams in the state vector leads to policies
that are robust to different maps.
Fig. 25. Comparison of the speed profile on a section of the MCO track.
10
B.D. Evans et al. Machine Learning with Applications 14 (2023) 100496
Fig. 27. The paths taken by the different on the physical vehicle with a speed cap of
2 m/s.
11
B.D. Evans et al. Machine Learning with Applications 14 (2023) 100496
Fig. 28. Distances and mean curvatures from simulated and real experiments.
Fig. 31. The effect of localisation delay on the total distance and mean curvature of
the vehicles.
Fig. 29. Steering angles selected by agents in simulation (top) compared to the real
vehicle (bottom).
Fig. 32. Lap times using increasing speed caps. Missing bars mean that the agent was
unable to complete the lap.
12
B.D. Evans et al. Machine Learning with Applications 14 (2023) 100496
Fig. 33. Speed profiles for the full planning, end-to-end and classic planners in simulation and reality.
7. Conclusion
This paper compared the full planning, trajectory tracking and end-
to-end DRL architectures for autonomous racing. End-to-end agents use
the raw LiDAR scan and vehicle speed. Full planning agents use the
LiDAR scan, upcoming centre line points and the vehicle’s state vari-
ables. Trajectory tracking agents use upcoming trajectory points and
the vehicle’s state variables. The simulation results showed that the full
planning agent is the most robust to reward signal, training algorithm
and test maps. The end-to-end agent generally achieves slower lap
times, sometimes caused by not selecting high speeds. The trajectory
tracking agent can achieve the fastest lap times on the training map,
and tracks the optimal trajectory closely, but is brittle to changes in
Fig. 34. Fastest trajectories from the trajectory tracking, full planning and end-to-end reward signal and test map.
agents on the physical vehicle.
The policies were tested on a physical vehicle to evaluate the sim-
to-real transfer. The study on the steering actions and resulting paths
showed that the full planning and trajectory tracking agents swerved
any laps due to crashing into the walls. The full planning agent can excessively. The end-to-end agent had the most similar performance
complete laps up to 4 m/s. The end-to-end and classic agents can between the simulation and physical vehicle, swerving less, travelling
complete laps using a speed cap of 5 m/s, indicating that they perform a shorter distance and having a lower curvature. Further investigation
the best at high-speed racing. showed that the end-to-end agent selected less extreme steering actions
Fig. 33 shows the speeds selected by the agents using a speed cap of compared to the full planning and trajectory tracking agents. The study
4 m/s. On the physical vehicle, the full planning agent selects slower on speed showed that the trajectory tracking agent could complete laps
speeds more regularly, possibly due to the increase in extreme steering with a maximum speed of 2 m/s, the full planning agent 4 m/s and the
actions. The end-to-end agent displays a similar speed profile selection end-to-end agent 5 m/s. The good performance of end-to-end agents
in both simulation and reality with occasional spikes of low speeds. demonstrates their advantage of not relying on other systems, limiting
Fig. 34 shows the fastest trajectories by the trajectory tracking, full the difference between simulation and reality. However, even with end-
planning and end-to-end agents. The trajectory tracking agent’s extreme to-end architectures, simulation-to-reality transfer is still difficult and
swerving is a key problem limiting the agent’s performance. The full requires further study.
planning agent swerves less, enabling success at 4 m/s. The end-to-end Future work should further explore sim-to-real transfer for DRL
agent can smoothly control the vehicle at high speeds, resulting in a agents to physical vehicles and how all three architectures can be
transferred to physical vehicles. Domain randomisation and adapta-
fast trajectory with a speed cap of 5 m/s. The end-to-end trajectory
tion (Carr, Chli, & Vogiatzis, 2018) could help train more robust
demonstrates that the agent can select a feasible speed profile that
policies or a more complex model could be used in the simulator
transfers to high-speed racing on a physical vehicle.
to minimise the differences. The key limitation to be addressed is
The sim-to-real investigation studied how the DRL agents trained in
the extreme swerving displayed by the trajectory tracking agent and
simulation transferred to a physical vehicle. In the study on steering,
seen in other studies (Brunnbauer et al., 2022). Investigating the sen-
the end-to-end agent had the smallest difference between simulation sor noise, latency, and compute requirements will provide explana-
and reality in terms of the distance travelled and the mean curvature. tions, and domain randomisation and adaptation can possibly improve
In contrast, the full planning and, specifically, the trajectory tracking performance.
architectures transferred poorly, having high curvature and selecting
large steering angles. The study on delay indicated that a key reason for Acknowledgment
this was that the end-to-end agent did not require localisation while the
other architectures did. The study on speed showed the impact of the The vehicle that was used within the practical experiments was
curvature trajectories was that trajectory tracking architecture could constructed in conjunction between Stellenbosch University and FH
only complete the track with a maximum speed of 2 m/s and the full Aachen.
13
B.D. Evans et al. Machine Learning with Applications 14 (2023) 100496
CRediT authorship contribution statement Heilmeier, A., Wischnewski, A., Hermansdorfer, L., Betz, J., Lienkamp, M., &
Lohmann, B. (2020). Minimum curvature trajectory planning and control for an
autonomous race car. Vehicle System Dynamics, 58(10), 1497–1527.
Benjamin David Evans: Conceptualization, Methodology, Soft- Ivanov, R., Carpenter, T. J., Weimer, J., Alur, R., Pappas, G. J., & Lee, I. (2020). Case
ware, Formal analysis, Investigation, Data curation, Writing – origi- study: verifying the safety of an autonomous racing car with a neural network
nal draft, Visualization. Hendrik Willem Jordaan: Supervision, Re- controller. In Proceedings of the 23rd international conference on hybrid systems:
sources, Writing – review & editing, Project administration. Herman computation and control (pp. 1–7).
Jaritz, M., De Charette, R., Toromanoff, M., Perot, E., & Nashashibi, F. (2018). End-
Arnold Engelbrecht: Supervision, Writing – review & editing, Funding
to-end race driving with deep reinforcement learning. In 2018 IEEE international
acquisition. conference on robotics and automation ICRA, (pp. 2070–2075). IEEE.
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., et al. (2015).
Declaration of competing interest Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.
02971.
O’Kelly, M., Zheng, H., Jain, A., Auckley, J., Luong, K., & Mangharam, R. (2020).
The authors declare that they have no known competing finan- Tunercar: A superoptimization toolchain for autonomous racing. In 2020 IEEE
cial interests or personal relationships that could have appeared to international conference on robotics and automation (pp. 5356–5362). IEEE.
influence the work reported in this paper. O’Kelly, M., Zheng, H., Karthik, D., & Mangharam, R. (2020). F1tenth: An open-
source evaluation environment for continuous control and reinforcement learning.
Proceedings of Machine Learning Research, 123.
Data availability Orgován, L., Bécsi, T., & Aradi, S. (2021). Autonomous drifting using reinforcement
learning. Periodica Polytechnica Transportation Engineering, 49(3), 292–300.
Remonda, A., Krebs, S., Veas, E., Luzhnica, G., & Kern, R. (2021). Formula RL: Deep
Data will be made available on request.
reinforcement learning for autonomous racing using telemetry data. arXiv preprint
arXiv:2104.11106.
References Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT Press.
Tătulea-Codrean, A., Mariani, T., & Engell, S. (2020). Design and simulation of a
machine-learning and model predictive control approach to autonomous race
Becker, J., Imholz, N., Schwarzenbach, L., Ghignone, E., Baumann, N., & Magno, M.
driving for the F1/10 platform. IFAC-PapersOnLine, 53(2), 6031–6036.
(2023). Model-and acceleration-based pursuit controller for high-performance au-
Vianna, L. C. M., Goubault, E., & Putot, S. (2021). Neural network based model
tonomous racing. In 2023 IEEE international conference on robotics and automation
predictive control for an autonomous vehicle. arXiv e-prints, arXiv:2107.14573.
(pp. 5276–5283). IEEE.
Walsh, C. H., & Karaman, S. (2018). Cddt: Fast approximate 2d ray casting for
Betz, J., Zheng, H., Liniger, A., Rosolia, U., Karle, P., Behl, M., et al. (2022).
accelerated localization. In 2018 IEEE international conference on robotics and
Autonomous vehicles on the edge: A survey on autonomous vehicle racing. IEEE
automation ICRA, (pp. 3677–3684). IEEE.
Open Journal of Intelligent Transportation Systems.
Wang, R., Han, Y., & Vaidya, U. (2021). Deep koopman data-driven control framework
Bosello, M., Tse, R., & Pau, G. (2022). Train in Austria, race in montecarlo: Generalized
for autonomous racing. In Proc. int. conf. robot. autom.(icra) workshop opportunities
RL for cross-track F1 tenth lidar-based races. In 2022 IEEE 19th annual consumer
challenges auton. racing (pp. 1–6).
communications & networking conference (pp. 290–298). IEEE.
Wischnewski, A., Geisslinger, M., Betz, J., Betz, T., Fent, F., Heilmeier, A., et al. (2022).
Brunnbauer, A., Berducci, L., Brandstatter, A., Lechner, M., Hasani, R., Rus, D., et al.
Indy autonomous challenge-autonomous race cars at the handling limits. In 12th
(2022). Latent imagination facilitates zero-shot transfer in autonomous racing. In
International Munich Chassis symposium 2021 (pp. 163–182). Springer.
2022 International conference on robotics and automation (pp. 7513–7520).
Wurman, P. R., Barrett, S., Kawamoto, K., MacGlashan, J., Subramanian, K., Walsh, T.
Cai, P., Mei, X., Tai, L., Sun, Y., & Liu, M. (2020). High-Speed Autonomous Drifting
J., et al. (2022). Outracing champion gran turismo drivers with deep reinforcement
with Deep Reinforcement Learning. IEEE Robotics and Automation Letters, 5(2),
learning. Nature, 602(7896), 223–228.
1247–1254.
Cai, P., Wang, H., Huang, H., Liu, Y., & Liu, M. (2021). Vision-based autonomous car
racing using deep imitative reinforcement learning. IEEE Robotics and Automation
Letters, 6(4), 7262–7269. Benjamin David Evans studied for a bachelor’s in Mecha-
Carr, T., Chli, M., & Vogiatzis, G. (2018). Domain adaptation for reinforcement learning tronic Engineering at the University of Stellenbosch. He
on the atari. arXiv preprint arXiv:1812.07452. went on to complete his Ph.D. in the field of deep reinforce-
Chisari, E., Liniger, A., Rupenyan, A., Van Gool, L., & Lygeros, J. (2021). Learning ment learning for autonomous racing. His interests include
from simulation, racing in reality. In 2021 IEEE international conference on robotics the intersection of classical control and machine learning
and automation ICRA, (pp. 8046–8052). IEEE. for autonomous systems. He is a postdoctoral researcher
Christ, F., Wischnewski, A., Heilmeier, A., & Lohmann, B. (2021). Time-optimal at Stellenbosch University, focusing on machine learning
trajectory planning for a race car considering variable tyre-road friction coefficients. applications in high-performance control.
Vehicle System Dynamics, 59(4), 588–612.
Coulter, R. C. (1992). Implementation of the pure pursuit path tracking algorithm: Technical
report, Carnegie-Mellon UNIV Pittsburgh PA Robotics INST.
Dwivedi, T., Betz, T., Sauerbeck, F., Manivannan, P., & Lienkamp, M. (2022). Con-
tinuous control of autonomous vehicles using plan-assisted deep reinforcement Hendrik Willem Jordaan received his bachelor’s in Electri-
learning. In 2022 22nd international conference on control, automation and systems cal and Electronic Engineering with Computer Science and
(pp. 244–250). IEEE. continued to receive his Ph.D degree in satellite control
Evans, B., Betz, J., Zheng, H., Engelbrecht, H. A., Mangharam, R., & Jordaan, H. W. at Stellenbosch University. He currently acts as a senior
(2022). Accelerating online reinforcement learning via supervisory safety systems. lecturer at Stellenbosch University and is involved in several
arXiv preprint arXiv:2209.11082. research projects regarding advanced control systems as ap-
Evans, B. D., Engelbrecht, H. A., & Jordaan, H. W. (2023). High-speed autonomous plied to different autonomous vehicles. His interests include
racing using trajectory-aided deep reinforcement learning. IEEE Robotics and robust and adaptive control systems applied to practical
Automation Letters, 8(9), 5353–5359. vehicles.
Fuchs, F., Song, Y., Kaufmann, E., Scaramuzza, D., & Durr, P. (2021). Super-human
performance in gran turismo sport using deep reinforcement learning. IEEE Robotics
and Automation Letters, 6(3), 4257–4264.
Fujimoto, S., Hoof, H., & Meger, D. (2018). Addressing function approximation Prof. Herman Engelbrecht received his Ph.D. degree in
error in actor-critic methods. In International conference on machine learning (pp. Electronic Engineering from Stellenbosch University (South
1587–1596). PMLR. Africa) in 2007. He is currently the Chair of the Department
Ghignone, E., Baumann, N., & Magno, M. (2023). Tc-driver: A trajectory conditioned of Electrical and Electronic Engineering. His research inter-
reinforcement learning approach to zero-shot autonomous racing. Field Robotics, ests include distributed systems (specifically infrastructure
3(1), 637–651. to support massive multi-user virtual environments) and
Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., et al. (2018). Soft machine learning (specifically deep reinforcement learning).
actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905. Prof Engelbrecht is a Senior Member of the IEEE and a
Member of the ACM.
Hamilton, N., Musau, P., Lopez, D. M., & Johnson, T. T. (2022). Zero-shot policy
transfer in autonomous racing: Reinforcement learning vs imitation learning. In
2022 IEEE international conference on assured autonomy ICAA, (pp. 11–20). IEEE.
14