Multi-Agent Reinforcement Learning for Highway Platooning
Multi-Agent Reinforcement Learning for Highway Platooning
Article
Multi-Agent Reinforcement Learning for Highway Platooning
Máté Kolat and Tamás Bécsi *
Department of Control for Transportation and Vehicle Systems, Budapest University of Technology and
Economics, H-1111 Budapest, Hungary; [email protected]
* Correspondence: [email protected]
Abstract: The advent of autonomous vehicles has opened new horizons for transportation efficiency
and safety. Platooning, a strategy where vehicles travel closely together in a synchronized manner,
holds promise for reducing traffic congestion, lowering fuel consumption, and enhancing overall
road safety. This article explores the application of Multi-Agent Reinforcement Learning (MARL)
combined with Proximal Policy Optimization (PPO) to optimize autonomous vehicle platooning. We
delve into the world of MARL, which empowers vehicles to communicate and collaborate, enabling
real-time decision making in complex traffic scenarios. PPO, a cutting-edge reinforcement learning
algorithm, ensures stable and efficient training for platooning agents. The synergy between MARL
and PPO enables the development of intelligent platooning strategies that adapt dynamically to
changing traffic conditions, minimize inter-vehicle gaps, and maximize road capacity. In addition to
these insights, this article introduces a cooperative approach to Multi-Agent Reinforcement Learning
(MARL), leveraging Proximal Policy Optimization (PPO) to further optimize autonomous vehicle
platooning. This cooperative framework enhances the adaptability and efficiency of platooning
strategies, marking a significant advancement in the pursuit of intelligent and responsive autonomous
vehicle systems.
Keywords: deep learning; reinforcement learning; platooning; road traffic control; multi-agent systems
1. Introduction
Citation: Kolat, M.; Bécsi, T. The platooning of vehicles represents a significant advancement within the automo-
Multi-Agent Reinforcement Learning tive sector, with the primary objective of enhancing safety, fuel efficiency, travel time,
for Highway Platooning. Electronics and overall performance. The utilization of autonomous vehicles operating closely in
2023, 12, 4963. https://doi.org/ computer-guided platoons promises advantages such as fuel savings, expanded highway
10.3390/electronics12244963 capacity, and enhanced passenger comfort. The integration of automation into road traffic
holds the potential to address crucial challenges, including accidents, traffic congestion,
Academic Editor: Jose Eugenio
Naranjo
environmental pollution, and energy consumption, offering valuable solutions to these
pressing issues [1]. The concept of automated highways has a longstanding history, dating
Received: 3 November 2023 back to the 1939 World’s Fair when General Motors showcased a functional model. Sig-
Revised: 30 November 2023 nificant research efforts have been dedicated to the development of automated highway
Accepted: 8 December 2023 systems (AHS), culminating in the National Automated Highway System Consortium
Published: 11 December 2023
(NAHSC) successfully demonstrating the technical feasibility of AHS in San Diego in
August 1997. Early steps in AHS research date back to the 1960s [2], with operational
testing of prototype equipment as early as 1970. Dr. Shladover has extensively reviewed
Copyright: © 2023 by the authors.
the progress made in AHS research [3]. Since those initial efforts, the field has witnessed
Licensee MDPI, Basel, Switzerland. considerable advancement. In 1995, a research report provided a comprehensive account
This article is an open access article of the aerodynamic performance of platoons, highlighting a notable reduction of approxi-
distributed under the terms and mately 55% in the drag coefficient, which accounts for both vehicle size and velocity, in
conditions of the Creative Commons two-, three-, and four-vehicle platoons. This reduction in drag coefficient translates to
Attribution (CC BY) license (https:// lower fuel consumption, as reported to the California Partners for Advanced Transit and
creativecommons.org/licenses/by/ Highways (PATH) [4]. Platooning involves the creation of groups or “linked” vehicles
4.0/). that operate collectively on the Automated Highway System, effectively acting as a single
unit [5–8]. These vehicles maintain an extremely close headway or spacing between them,
often just a few meters. They are interconnected through headway control mechanisms
like radar-based systems. Within the platoon, the lead vehicle, also known as the leader,
continually shares information about the AHS conditions and any planned maneuvers
with the other vehicles, referred to as followers. In vehicle platooning, a group or platoon
of vehicles operates with precise automated control over both longitudinal and lateral
movement. These vehicles maintain a consistent and fixed gap between them, even at
high highway speeds. This reduced spacing contributes to an increased capacity on the
highway. Safety is substantially improved through automation and the coordinated actions
of the vehicles. The close proximity of vehicles within the platoon ensures that even rapid
accelerations and decelerations have minimal impact on passenger comfort [4], which
was conclusively demonstrated during the platooning scenario presented by the PATH
program [9].
In summary, one of the primary advantages of platooning is improved fuel efficiency.
Vehicles in a platoon experience reduced air resistance, resulting in fuel savings for each
vehicle, particularly in long-haul transportation, where fuel costs constitute a substantial
portion of operational expenses. This helps optimize traffic flow by reducing the space
between vehicles, leading to smoother traffic patterns, less congestion, and improved
overall traffic management, particularly on highways and major roadways. Platooning can
increase the capacity of highways by allowing vehicles to travel more closely together. This
space optimization can lead to better infrastructure utilization, potentially reducing the
need for costly expansions. Lastly, platooning’s automated systems and communication
technologies contribute to increased road safety. Vehicles within a platoon can respond
quickly to the lead vehicle’s speed or direction changes, reducing the likelihood of accidents.
in the case of research on platooning. In [26], the authors utilize a particular version of
DDPG called the platoon-sharing deep deterministic policy gradient algorithm (PSDDPG),
while [27] uses DDPG for mixed mixed-autonomy vehicle platoons. Furthermore, from
the examined topic’s perspective, many studies in the field of platooning can be distin-
guished. Aki et al. discuss the need for improved braking systems in platooning based
on the results of a driving simulator study [28]; Jones et al. analyze the human factor
challenges associated with CACC and suggest research questions that need to be further
investigated [29]; Ogitsu et al. analyze equipment failures during platooning and describe
strategies to deal with failures based on their severity [30]; and in [31], the authors discuss
the value of various information for safely controlling platooning.
1.2. Contribution
Few research papers address the platooning problem within the reinforcement learn-
ing (RL) framework using Multi-Agent Reinforcement Learning, and most existing studies
involve only a limited number of participants. This paper presents an innovative approach
to highway platooning, focusing on enabling adaptive responses to varying velocity condi-
tions and encompassing different platooning speeds. This adaptation is achieved through
the introduction of a novel reward mechanism. The research employs a reward strategy
in which the following distance is dynamically adjusted according to the platooning’s
velocity. This adjustment is determined by using the following time as a metric rather than
relying solely on a fixed distance. The research paper demonstrates the suitability of this
innovative reward method within a multi-agent environment through the utilization of
Proximal Policy Optimization. It effectively organizes a platooning scenario characterized
by minimal velocity variance among participating vehicles, showcasing its ability to man-
age fluctuations in velocity conditions adequately. It is important to highlight that we have
focused on building a strong foundation rather than directly comparing our method with
other solutions. The goal has been to set the stage for future work, providing a starting
point that others can use as a reference for more detailed studies. The paper is structured
as follows: Section 2 presents the literature background of the utilized methods. Section 3
introduces the implemented traffic environment and the used RL properties. In Section 4,
the results are shown and compared to other methods. Ultimately, Section 5 summarizes
the research and makes further development suggestions.
2. Methodology
This study presents a highway platooning method, applying Multi-Agent Reinforce-
ment Learning with Proximal Policy Optimization algorithm. Section 2.1 introduces the
literature background of reinforcement learning. Section 2.2 presents the basis of MARL,
while Section 2.4 shows the motivation of PPO.
Figure 1 illustrates the core concept of RL, where an agent, acting as a decision-maker,
observes its environment and takes actions based on the current state. Each action is
assessed for its quality, and the agent is subject to penalties or rewards based on the
desirability of its actions. RL is formally modeled as a Markov decision process (MDP),
which can be defined in terms of the tuple <S, A, R, P>:
- S represents the set of observable states.
- A corresponds to the set of possible actions available to the agent.
- R signifies the set of rewards obtained based on the quality of the chosen actions.
- P denotes the policy responsible for determining which action to take in a given state.
An effectively designed reward strategy in RL plays a pivotal role in influencing the
agent’s neural network (NN) performance as it strives to achieve the desired behavior.
This form of learning aims to fine-tune a network’s responses, enabling it to generalize its
actions effectively in unfamiliar environments that were not part of its training dataset.
2.4. PPO
Reinforcement learning has made significant progress in recent years, and one notable
contribution to this progress is the Proximal Policy Optimization algorithm. PPO is widely
popular due to its good empirical performance and ease of implementation. PPO is in the
field of model-free RL, particularly in the field of policy gradient methods. Policy gradient
methods operate by calculating an estimate of the policy gradient and incorporating it into
a stochastic gradient ascent algorithm. The commonly employed gradient estimator takes
the following form: h i
ĝ = Êt ∇θ log πθ ( at |st ) Âit (1)
where πθ represents a stochastic policy and Ât is an estimator for the advantage function at
timestep t. The expectation Êt denotes the empirical average over a finite batch of samples
in an algorithm that alternates between sampling and optimization. Implementations
utilizing automatic differentiation software construct an objective function whose gradient
corresponds to the policy gradient estimator. The estimator ĝ is derived by differentiating
the objective: h i
L PG (θ ) = Êt θ log πθ ( at |st ) Âit (2)
While the idea of performing multiple optimization steps on this loss L PG using the same
trajectory is appealing, it lacks a solid justification. Empirically, such an approach often
results in excessively large policy updates. PPO is characterized by its focus on optimizing
policies while ensuring stability and efficient learning. This is achieved by enforcing “prox-
imal” constraints on policy updates and preventing drastic changes that might destabilize
learning. The core idea of PPO is to iteratively update the policy by optimizing a surrogate
objective function that includes a clipping mechanism.
πθ ( at |st ) πθ ( at |st )
CLIP
LPPO (θ ) = Êt min Ât , clip 1 − e, 1 + e, Ât (3)
πθold ( at |st ) πθold ( at |st )
where:
• πθ ( at |st ) is the probability of taking action at given state st under policy πθ ;
• πθold ( at |st ) is the probability under the old policy;
• Ât is an estimator of the advantage function at timestep t;
• clip( a, b, x ) clips x to the interval [ a, b];
• e is a hyperparameter controlling the size of policy updates.
Electronics 2023, 12, 4963 6 of 13
This mechanism constrains policy updates to a range that does not differ significantly
from the previous policy. This way, PPO prevents policy updates that could lead to catas-
trophic performance drops and ensures smoother convergence. PPO has several advantages
and is recommended for various RL applications. It has impressive sampling efficiency
and can learn proficient policies with fewer interactions with the environment. Addition-
ally, PPO provides discrete and continuous action spaces, making it versatile for various
problems. Furthermore, PPO includes regularizing entropy, encouraging exploration, and
preventing premature convergence to suboptimal policies. This feature is especially use-
ful in scenarios where exploration is essential to discover optimal strategies. In recent
years, PPO has been applied to many real-world problems, including robotics, autonomous
agents, and gaming agents, proving its adaptability and robustness in various fields.
3. Environment
Creating the right setting is crucial in reinforcement learning because it shapes the
training data based on the agent’s choices. In this study, we have set up a customized
environment. Nine vehicles are driving one after the other on a highway lane. The first
vehicle periodically changes its speed randomly or maintains its current speed every 5 s.
In response to these speed changes, the other vehicles adjust their acceleration. This
adjustment, determined by the training agent, is carefully tuned to maintain a safe and
steady following distance. The custom environment is managed through PettingZoo, an
extension of Gymnasium specifically designed for multi-agent scenarios.
To make the training process more varied and help the model adapt effectively, we
begin each training episode with the vehicles randomly positioned. This randomness intro-
duces a range of traffic situations during training, leading to a robust and flexible outcome.
This comprehensive state representation grants the RL agent a holistic view of the
platooning environment, allowing it to discern not only the behavior of its immediate
followers but also anticipate the actions of the vehicles ahead. Considering these neighbor-
Electronics 2023, 12, 4963 7 of 13
ing vehicles’ relative dynamics and positions, the agent can make informed decisions to
optimize platoon stability and ensure safe and efficient driving.
Furthermore, within this state representation framework, including ego data, which
comprises the ego vehicle’s velocity and acceleration, provides the agent with a self-
awareness crucial for cooperative platooning. This self-awareness enables the agent to
factor its actions into the decision-making process, ensuring that it harmoniously integrates
with the platoon while adhering to its operational constraints. The above-mentioned char-
acteristics align with the principles of Cooperative Multi-Agent Reinforcement Learning
(Cooperative MARL). In Cooperative MARL, agents collaborate to achieve shared objec-
tives, and state representation plays a central role in facilitating effective cooperation. The
state representation supports a shared goal of optimizing platoon stability and ensuring
safe and efficient driving. Cooperative MARL is designed for scenarios where agents work
collectively towards a common objective. The state representation captures the interdepen-
dence among vehicles within the platoon. Actions of one vehicle impact others, reinforcing
the need for cooperation—a core principle of Cooperative MARL. The inclusion of ego
data, representing velocity and acceleration, fosters self-awareness. In Cooperative MARL,
self-aware agents contribute meaningfully to the collective decision-making process.
3.3. Reward
Rewards in reinforcement learning constitute pivotal feedback signals, serving as
evaluative metrics for agent performance. These numerical indicators encapsulate the
desirability of actions taken within an environment, thereby facilitating the optimization of
decision-making processes. The acquisition of rewards engenders adaptive learning, where
agents strive to obtain maximize cumulative rewards, emblematic of the quintessential RL
paradigm. The rewarding strategy takes center stage in our specific context of optimizing
the following time within a dynamic highway platooning system. Our primary objective is
to maintain a target following time of 3 s between vehicles. The essence of our rewarding
strategy lies in a well-defined threshold of 0.5 s. This threshold serves as a critical boundary,
indicating the permissible deviation from the target following time of 3 s. When the actual
following time falls within this 0.5 s threshold, signifying that the vehicles are maintaining
close proximity to the desired 3 s gap, a maximum reward of 1 is granted. This reward
serves as positive reinforcement, encouraging the platoon to maintain an optimal following
distance, which is vital for minimizing collision risks and promoting efficient traffic flow.
However, if the deviation from the target following time exceeds the 0.5 s threshold,
indicating a substantial departure from the desired spacing, a penalty is applied to the
reward. The following equation determines the penalty:
Electronics 2023, 12, 4963 8 of 13
3.4. Training
As previously discussed, the carefully selected diverse dataset is crucial for enhancing
the network’s ability to generalize and yield superior performance. In this study, we ensure
the diversity of our training data through the random generation of traffic in each episode.
We employ a competitive Multi-Agent Reinforcement Learning technique characterized
by immediate rewards. Under this approach, each agent prioritizes its own acceleration
decisions and receives rewards based solely on its actions following each step. A training
episode concludes either when it reaches the predefined time step limit or when collisions
occur. During the training process, the neural network receives the state representation of
each vehicle and, in response, provides action recommendations for every agent.
4. Results
Considering road transportation, safety cannot be emphasized enough. This study
introduced a highway-based platooning method, where vehicles travel at high speed, high-
lighting the importance of the adequately selected following distance and time. However,
the following distance highly depends on the speed chosen by vehicles participating in
road traffic. Evaluating the results with the utilization of the following time metric is
necessary. Therefore, the results are evaluated based on the deviation of the vehicle’s
velocity from the target following time. The proposed approach underwent testing through
1000 individual testing episodes, each featuring unique randomly generated traffic flows
to simulate real-world traffic scenarios. This testing protocol was specifically designed
to assess the performance and effectiveness of the control strategy developed during the
training phase.
Parameter Value
Learning rate (α) 0.00005
Discount factor (γ) 0.97
Num. of ep. after params. are upd. (ξ ) 20
Num. of hidden layers 4
Num. of neurons 128,256,256,128
Hidden layer activation function RELU
Layers Dense
Optimizer Adam
Kernel initializer Xavier normal
Electronics 2023, 12, 4963 9 of 13
Figure 4. The average deviation from the target following time of the participants
In Figure 4, the absolute deviation of the following distance is depicted in seconds (s)
for each episode. The figure illustrates that the participants’ average absolute deviation
from the target following time per episode consistently falls well below the specified
threshold of 0.5 s and ranges between 0.15 and 0.26. Therefore, the figure validates the
performance of the agents of the utilized system since the deviation from the standard and
suggested 3 s following distance is relatively small. As the leading vehicle executes various
dynamic maneuvers, such as acceleration, speed maintenance, and deceleration, there will
always be some inherent reaction time, as the state space does not include the leading
vehicle’s exact next-time step maneuver. Consequently, in this scenario, the objective was
not to reach a deviation time of zero but rather to minimize it.
Figure 5 shows the the test result for one specific episode.
Here, the focus is on a specific test episode having 1600 time steps and plotting the
velocity of the leading vehicle and the average velocity of the following vehicles during
the whole episode. It can be observed that the leading vehicle (yellow line) performs
various acceleration and deceleration maneuvers during the episode. The green dotted line
denotes the average velocity of the following vehicles. The data show that the following
vehicle responds effectively to the acceleration and deceleration actions of the vehicle in
front, as shown in the figure, maintaining an appropriate speed range for the lead vehicle
throughout the episode. These variations in speed reflect the dynamics of real-world traffic
situations, where the lead vehicle often responds to external factors such as road conditions
or the actions of other vehicles. Finally, once the speed of the vehicle in front stabilizes after
some fluctuations, the vehicle behind will follow suit.
The trend in Figure 6 indicates that the learning algorithm effectively optimizes the
policy to make decisions that, on average, result in positive rewards. In RL, achieving
convergence to a desired reward level is a key objective, and in this case, a reward above
0.9 suggests that the agent is successfully navigating its environment and making advanta-
geous decisions.
Electronics 2023, 12, 4963 10 of 13
5. Conclusions
The advent of self-driving cars offers an opportunity to improve the efficiency and
safety of transportation significantly. Platooning, characterized by the synchronized move-
ment of vehicles over short distances, offers an opportunity to alleviate traffic congestion,
reduce fuel consumption, and improve overall road safety. The paper presents a promising
potential Multi-Agent Reinforcement Learning-based platooning solution utilizing Proxi-
mal Policy Optimization. The research results show that the proposed method can establish
strong stability within the platooning system. Furthermore, using the introduced reward
mechanism, this approach can be effectively adapted to different speed conditions in the
system. This study uses a reward strategy in which the following distance is dynamically
adjusted based on speed, as the distance is flexibly determined based on following time
metrics rather than relying on a fixed distance. This approach results in slight variations in
follow-up time within a platoon. However, it is essential to note that various events can
occur in road traffic, such as merging into a platoon, which were not investigated in this
Electronics 2023, 12, 4963 11 of 13
study. Therefore, the future focus of this research will be on addressing merging scenarios.
Additionally, improved system performance can be achieved by including the leading
vehicle decision in the state space. In presenting the outcomes of the study, it is essential to
highlight that we have focused on building a solid foundation rather than directly compar-
ing it with other solutions. The goal has been to set the stage for future work. In our next
project, we aim to tackle the complexities of merging dynamics in the context of platooning.
While our current paper keeps things simple and focuses on the basics of platooning, our
upcoming work will dive into the more complex challenges of merging.
Author Contributions: Conceptualization, T.B.; Methodology, M.K. and T.B.; Software, M.K.;
Writing—original draft, M.K.; Writing—review & editing, T.B.; Visualization, M.K.; Supervision, T.B.;
Funding acquisition, T.B. All authors have read and agreed to the published version of the manuscript.
Funding: This research was supported by the European Union within the framework of the National
Laboratory for Autonomous Systems (RRF-2.3.1-21-2022-00002). Project no. TKP2021-NVA-02 has
been implemented with the support provided by the Ministry of Culture and Innovation of Hungary
from the National Research, Development and Innovation Fund, financed under the TKP2021-NVA
funding scheme. T.B. was supported by BO/00233/21/6, János Bolyai Research Scholarship of the
Hungarian Academy of Sciences.
Data Availability Statement: Data are contained within the article.
Conflicts of Interest: The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
References
1. Krizsik, N.; Sipos, T. Social Perception of Autonomous Vehicles. Period. Polytech. Transp. Eng. 2023, 51, 133–139. [CrossRef]
2. Hanson, M.E. Project METRAN: An Integrated, Evolutionary Transportation System for Urban Areas; Number 8; MIT Press: Cambridge,
MA, USA, 1966.
3. Shladover, S.E. Review of the state of development of advanced vehicle control systems (AVCS). Vehicle Syst. Dyn. 1995,
24, 551–595. [CrossRef]
4. Levedahl, A.; Morales, F.; Mouzakitis, G. Platooning Dynamics and Control on an Intelligent Vehicular Transport System; CSOIS, Utah
State University: Logan, UT, USA, 2010; pp. 1–7.
5. Bergenhem, C.; Huang, Q.; Benmimoun, A.; Robinson, T. Challenges of platooning on public motorways. In Proceedings of the
17th World Congress on Intelligent Transport Systems, Busan, Republic of Korea, 25–29 October 2010; pp. 1–12.
Electronics 2023, 12, 4963 12 of 13
6. Dávila, A.; Nombela, M. Sartre: Safe road trains for the environment. In Proceedings of the Conference on Personal Rapid Transit
PRT@ LHR, London, UK, 21–23 September 2010; Volume 3, pp. 2–3.
7. Robinson, T.; Chan, E.; Coelingh, E. Operating platoons on public motorways: An introduction to the sartre platooning programme.
In Proceedings of the 17th World Congress on Intelligent Transport Systems, Busan, Republic of Korea, 25–29 October 2010;
Volume 1, p. 12.
8. Little, J.D.; Kelson, M.D.; Gartner, N.H. MAXBAND: A versatile program for setting signals on arteries and triangular networks.
In Proceedings of the 60th Annual Meeting of the Transportation Research Board, Washington, DC, USA, 12–16 January 1981.
9. Shladover, S.E.; Desoer, C.A.; Hedrick, J.K.; Tomizuka, M.; Walrand, J.; Zhang, W.B.; McMahon, D.H.; Peng, H.; Sheikholeslam,
S.; McKeown, N. Automated vehicle control developments in the PATH program. IEEE Trans. Veh. Technol. 1991, 40, 114–130.
[CrossRef]
10. Wu, T.; Zhou, P.; Liu, K.; Yuan, Y.; Wang, X.; Huang, H.; Wu, D.O. Multi-Agent Deep Reinforcement Learning for Urban Traffic
Light Control in Vehicular Networks. IEEE Trans. Veh. Technol. 2020, 69, 8243–8256. [CrossRef]
11. Zhang, H.; Peng, J.; Dong, H.; Ding, F.; Tan, H. Integrated velocity optimization and energy management strategy for hybrid
electric vehicle platoon: A multi-agent reinforcement learning approach. IEEE Trans. Transp. Electrif. 2023, 1. [CrossRef]
12. Zhang, X.; Chen, P.; Yu, G.; Wang, S. Deep Reinforcement Learning Heterogeneous Channels for Poisson Multiple Access.
Mathematics 2023, 11, 992. [CrossRef]
13. Chen, P.; Shi, L.; Fang, Y.; Lau, F.C.M.; Cheng, J. Rate-Diverse Multiple Access Over Gaussian Channels. IEEE Trans. Wirel.
Commun. 2023, 22, 5399–5413. [CrossRef]
14. Egea, A.C.; Howell, S.; Knutins, M.; Connaughton, C. Assessment of reward functions for reinforcement learning traffic signal
control under real-world limitations. In Proceedings of the 2020 IEEE International Conference on Systems, Man, and Cybernetics
(SMC), Toronto, ON, Canada, 11–14 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 965–972.
15. Cools, S.B.; Gershenson, C.; D’Hooghe, B. Self-organizing traffic lights: A realistic simulation. In Advances in Applied Self-
Organizing Systems; Springer: Berlin/Heidelberg, Germany, 2013; pp. 45–55.
16. Gershenson, C. Self-organizing traffic lights. arXiv 2004, arXiv:nlin/0411066.
17. Shabestary, S.M.A.; Abdulhai, B. Deep learning vs. discrete reinforcement learning for adaptive traffic signal control. In Proceed-
ings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018;
IEEE: Piscataway, NJ, USA, 2018; pp. 286–293.
18. Feng, Y.; Yu, C.; Liu, H.X. Spatiotemporal intersection control in a connected and automated vehicle environment. Transp. Res.
Part Emerg. Technol. 2018, 89, 364–383. [CrossRef]
19. Wang, S.; de Almeida Correia, G.H.; Lin, H.X. Effects of coordinated formation of vehicle platooning in a fleet of shared automated
vehicles: An agent-based model. Transp. Res. Procedia 2020, 47, 377–384. [CrossRef]
20. Maiti, S.; Winter, S.; Kulik, L.; Sarkar, S. The Impact of Flexible Platoon Formation Operations. IEEE Trans. Intell. Veh. 2020,
5, 229–239. [CrossRef]
21. He, D.; Qiu, T.; Luo, R. Fuel efficiency-oriented platooning control of connected nonlinear vehicles: A distributed economic MPC
approach. Asian J. Control 2020, 22, 1628–1638. [CrossRef]
22. Lopes, D.R.; Evangelou, S.A. Energy savings from an Eco-Cooperative Adaptive Cruise Control: A BEV platoon investigation. In
Proceedings of the 2019 18th European Control Conference (ECC), Naples, Italy, 25–28 June 2019; IEEE: Piscataway, NJ, USA,
2019; pp. 4160–4167.
23. Li, Z.; Hu, B.; Li, M.; Luo, G. String Stability Analysis for Vehicle Platooning Under Unreliable Communication Links With
Event-Triggered Strategy. IEEE Trans. Veh. Technol. 2019, 68, 2152–2164. [CrossRef]
24. Peng, B.; Yu, D.; Zhou, H.; Xiao, X.; Fang, Y. A Platoon Control Strategy for Autonomous Vehicles Based on Sliding-Mode Control
Theory. IEEE Access 2020, 8, 81776–81788. [CrossRef]
25. Basiri, M.H.; Pirani, M.; Azad, N.L.; Fischmeister, S. Security of Vehicle Platooning: A Game-Theoretic Approach. IEEE Access
2019, 7, 185565–185579. [CrossRef]
26. Lu, S.; Cai, Y.; Chen, L.; Wang, H.; Sun, X.; Jia, Y. A sharing deep reinforcement learning method for efficient vehicle platooning
control. IET Intell. Transp. Syst. 2022, 16, 1697–1709. [CrossRef]
27. Chu, T.; Kalabić, U. Model-based deep reinforcement learning for CACC in mixed-autonomy vehicle platoon. In Proceedings of
the 2019 IEEE 58th Conference on Decision and Control (CDC), Nice, France, 11–13 December 2019; IEEE: Piscataway, NJ, USA,
2019; pp. 4079–4084.
28. Aki, M.; Zheng, R.; Nakano, K.; Yamabe, S.; Lee, S.Y.; Suda, Y.; Suzuki, Y.; Ishizaka, H. Evaluation of safety of automatic
platoon-driving with improved brake system. In Proceedings of the 19th Intelligent Transport Systems World Congress, ITS 2012,
Vienna, Austria, 22–26 October 2012.
29. Jones, S. Cooperative Adaptive Cruise Control: Human Factors Analysis; Technical Report; Office of Safety Research and Development,
Federal Highway Administration: Washington, DC, USA, 2013.
30. Ogitsu, T.; Fukuda, R.; Chiang, W.P.; Omae, M.; Kato, S.; Hashimoto, N.; Aoki, K.; Tsugawa, S. Decision process for handling
operations against device failures in heavy duty trucks in a platoon. In Proceedings of the 9th FORMS/FORMAT 2012
Symposium on Formal Methods for Automation and Safety in Railway and Automotive Systems, Braunschweig, Germany,
12–13 December 2012.
Electronics 2023, 12, 4963 13 of 13
31. Xu, L.; Wang, L.Y.; Yin, G.; Zhang, H. Communication information structures and contents for enhanced safety of highway
vehicle platoons. IEEE Trans. Veh. Technol. 2014, 63, 4206–4220. [CrossRef]
32. Kohl, N.; Stone, P. Policy gradient reinforcement learning for fast quadrupedal locomotion. In Proceedings of the IEEE
International Conference on Robotics and Automation, 2004, Proceedings, ICRA’04, New Orleans, LA, USA, 26 April–1 May 2004;
IEEE: Piscataway, NJ, USA, 2004; Volume 3, pp. 2619–2624.
33. Ng, A.Y.; Coates, A.; Diel, M.; Ganapathi, V.; Schulte, J.; Tse, B.; Berger, E.; Liang, E. Autonomous inverted helicopter flight via
reinforcement learning. In Experimental Robotics IX; Springer: Berlin/Heidelberg, Germany, 2006; pp. 363–372.
34. Singh, S.; Litman, D.; Kearns, M.; Walker, M. Optimizing dialogue management with reinforcement learning: Experiments with
the NJFun system. J. Artif. Intell. Res. 2002, 16, 105–133. [CrossRef]
35. Tesauro, G. Temporal difference learning and TD-Gammon. Commun. ACM 1995, 38, 58–68. [CrossRef]
36. Strehl, A.L.; Li, L.; Wiewiora, E.; Langford, J.; Littman, M.L. PAC model-free reinforcement learning. In Proceedings of the 23rd
International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 881–888.
37. Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam,
V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [CrossRef]
38. Levine, S.; Finn, C.; Darrell, T.; Abbeel, P. End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 2016, 17, 1334–1373.
39. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.;
Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [CrossRef] [PubMed]
40. Zhu, Y.; Mottaghi, R.; Kolve, E.; Lim, J.J.; Gupta, A.; Fei-Fei, L.; Farhadi, A. Target-driven visual navigation in indoor scenes using
deep reinforcement learning. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA),
Singapore, 29 May–3 June 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 3357–3364.
41. Claus, C.; Boutilier, C. The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of the AAAI
’98/IAAI ’98: Proceedings of the Fifteenth National/Tenth Conference on Artificial Intelligence/Innovative Applications of
Artificial Intelligence, Madison, WI, USA, 26–30 July 1998; pp. 746–752.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.