Graph-Based Cooperation Multi-Agent Reinforcement Learning for Intelligent Traffic Signal Control
Graph-Based Cooperation Multi-Agent Reinforcement Learning for Intelligent Traffic Signal Control
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2025.3525640
Abstract—In the trend of continuously advancing urban in- Control (TSC) strategy, which is a key component in smart
telligent transport construction, traditional traffic signal control city and intelligent transportation systems (ITS), is urgently
(TSC) struggles to make effective decisions with complex traffic required to manage and regulate the movement of vehicles[3].
conditions. Although multi-agent deep reinforcement learning
(MARL) shows promise in optimizing traffic flow, most existing Real-world traffic conditions are influenced by a multitude of
studies ignore the complex relationships between signal lights factors, including driver preferences, interactions with various
and fail to communicate with neighbors effectively. Moreover, the road users (such as pedestrians and cyclists), as well as
deterministic strategies generated by Q-learning-based methods environmental factors like weather and road infrastructure.
struggle to be extended to large-scale urban road networks. This complexity makes it challenging to tune traffic signals
Therefore, this paper proposes a multi-agent graph-based soft
actor-critic (MAGSAC) approach for TSC, which combines graph dynamically based on real-time traffic conditions[4].
neural networks with the Soft Actor-Critic (SAC) algorithm Recently, the development of Internet of Things (IoT) tech-
and extends it to multi-agent environments to address the TSC nology has enhanced the connection between Connected Au-
problem. Specifically, we employ graph-based networks and tonomous Vehicles (CAVs) and road infrastructure, especially
attention mechanism to expand the receptive domain of agents, traffic lights, solving the problem of insufficient information
enable environmental information to be shared among agents,
and utilize the attention mechanism to filter out unimportant exchange in existing intelligent transportation systems (ITS)
information. The algorithm adheres to the Centralized Training under dynamic traffic conditions[5]. ITS relies on advanced
Decentralized Execution (CTDE) paradigm to minimize the non- Artificial Intelligence (AI) algorithms, information commu-
stationarity of MARL. Finally, a rigorous experimental eval- nication [6], Internet of Things (IoT), edge computing [7],
uation was conducted using the CityFlow simulator on both [8], and various automated control systems. It is expected
synthetic traffic grids and real-world urban road networks.
Experimental results show that MAGSAC outperforms other TSC to become a promising tool for improving efficiency and
methods in performance metrics, including average queue length sustainability in the field of intelligent transportation [9]. As
and waiting time, and achieves excellent performance under shown in Figure 1, cooperative driving between CAVs and
complex urban traffic conditions. AIM can improve road efficiency and traffic safety. In the
Index Terms—Adaptive traffic signal control, Multi-agent deep future, multiple CAVs will negotiate their movements in the
reinforcement learning, Internet of Things, Intelligent transporta- conflict region of the intersection without any signal head [10].
tion systems. Especially in the key areas where traffic flows from different
directions or types meet on the crossing area and may lead
I. I NTRODUCTION to conflicts due to factors like visibility, speed, and trajectory.
In the driving environment, more real-time traffic information
Considering the rapid progress of modern economic soci- will be obtained, achieving more efficient and safe driving
ety [1], excessive population agglomeration may bring neg- than traffic lights and manually driven vehicles[11]. During
ative externalities such as traffic congestion, environmental this period, traditional TSC systems need to address more
problems, and efficiency decline [2]. Effective Traffic Signal challenges to facilitate a smooth transition and effectively
Jing Shang, Shunmei Meng and Qianmu Li are with the School of adapt to technological advancements.
Computer Science and Engineering, Nanjing University of Science and Many researchers have illustrated the potential of reinforce-
Technology, Nanjing 210049, China(e-mail: [email protected]; meng- ment learning (RL) as a data-driven approach for learning
[email protected]; [email protected]).
Jun Hou is with the School of Social Science, Nanjing Vocational optimal policies by leveraging real-time traffic conditions [12].
University of Industry Technology,Nanjing 210046, China(e-mail: houjun- Within the domain of TSC using RL (RL-TSC), a variety of
[email protected]). MARL-based methods have been suggested, including fully
Xiaoran Zhao is with the School of Computer Science, Qufu Normal
University, Qufu 273165, China(e-mail: [email protected]). centralized, fully decentralized, and centralized training with
Xiaokang Zhou is with the Faculty of Business and Data Science, Kansai decentralized execution (CTDE) paradigms. As shown in Fig-
University, Osaka Prefecture 564-8680, Japan(e-mail: [email protected]). ure 2, centralized methods send all collected data to a central
Rong Jiang is with the Yunnan Key Laboratory of Service Computing,
Yunnan University of Finance and Economics, Kunming 650399, China(e- agent, which uses the obtained global state to optimize the
mail: [email protected]). global policy. This leads to a dimensionality disaster, imposing
Lianyong Qi is with the College of Computer Science and Technology, significant computational and communication burdens on the
China University of Petroleum (East China), Qingdao 266580, China(e-mail:
[email protected]). system. In contrast, in decentralized methods, each agent
Qianmu Li and Xiaokang Zhou are the corresponding authors. makes decisions independently, offering better scalability [13].
Authorized licensed use limited to: ANURAG GROUP OF INSTITUTIONS. Downloaded on January 10,2025 at 08:26:03 UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2025.3525640
Current state
Fixed/Actuated/Adapt Collect data from
ive Signal Control detector,cameras
However, this approach struggles to capture the complex each agent based on MARL, and employ the CTDE
interactions and collaborations between agents. Collaborative paradigm to enhance the efficiency of the TSC system.
methods can leverage mutual communication and cooperation This eliminates the need for deterministic traffic model
among agents to optimize global network traffic objectives, assumptions and allows for the learning of TSC strategies
learning strategies directly through trial and error. For exam- within dynamic traffic environments.
ple, [14] enhances coordination between agents by reducing • Technically, the proposed MAGSAC approach utilizes
pressure at intersections, and obtains more comprehensive graph-based state encoding to capture the mutual influ-
information by expanding the observation range of target ences within the graph structure of traffic signals, en-
agents. Previous work [15] has also used transfer learning abling collaboration among MARL agents while avoiding
to guide collaboration between agents, but the well-trained the curse of dimensionality. We innovatively employ a
model must undergo additional training for different scenarios multi-head attention mechanism to identify cooperative
to obtain better performance. CTDE has recently become a interactions between traffic signals and effectively dis-
popular framework for cooperative multi-agent reinforcement cern the dependencies among agents from the constantly
learning (MARL), where agents can use additional global changing traffic states, thereby filtering out irrelevant
state information to guide training in a centralized manner information.
and make their own decisions based solely on decentralized • We conducted extensive experiments on the Cityflow sim-
local policies. Additionally, most existing research uses simple ulator using both synthetic and real-world traffic datasets.
Q-learning algorithms to train MARL models, such as [16], The experimental results demonstrate that our proposal
[14], [17], [18]. However, the optimal solutions obtained from outperforms the baseline algorithm in terms of average
these studies are deterministic strategies that cannot adapt to travel time and converges more quickly. Moreover, in
dynamic changes in traffic flow. complex large-scale road scenarios, MAGSAC demon-
To tackle the previous constraints, we introduce a novel strates notably superior performance compared to other
approach known as Multi-Agent Graph-based Soft Actor- methods, highlighting its robust generalization capabili-
Critic Traffic Signal Control (MAGSAC). Operating within ties for complex urban TSC problems.
the CTDE paradigm, MAGSAC employs centralized critics The structure of the remainder of this paper unfolds as
to guide policy development. To extend the communication follows. Section II delves into a comprehensive survey of
domain of individual traffic lights, we have introduced a existing traffic signal control research. The setup for our
graph-based learning communication mechanism to capture approach is given in Section III and the system overview is
the information interaction between signal lights. Based on the elaborated. The implementation of the proposed method is
local observations of all agents, we use a multi-head attention detailed in Section IV. Section V showcases a comparative
mechanism to filter out unimportant data, thereby facilitating analysis of diverse RL algorithms, alongside a discussion of
the local integration of global information. Moreover, we experimental findings. Finally, Section VI encapsulates the
utilize the Soft Actor-Critic (SAC) algorithm to explore a findings and concludes the paper.
broader solution space. The principal contributions of our work
are outlined as follows: II. R ELATED WORKS
• We propose a Multi-Agent Graph-based Soft Actor-Critic A. Traffic signal control
(MAGSAC) approach for urban traffic scenarios. The Optimized traffic signals can improve intersection safety,
TSC problem is formulated as a partially observable and using edge computing can more efficiently control traffic
Markov decision process (POMDP). We describe the conflicts between vehicle and pedestrian movements within an
observation space, action space, and reward function for area [9]. Intelligent TSC algorithms are the key to improving
Authorized licensed use limited to: ANURAG GROUP OF INSTITUTIONS. Downloaded on January 10,2025 at 08:26:03 UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2025.3525640
Authorized licensed use limited to: ANURAG GROUP OF INSTITUTIONS. Downloaded on January 10,2025 at 08:26:03 UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2025.3525640
agents through broadcast communications. However, there are TABLE I: POMDP element.
limitations to this approach. First, communication between N Number of traffic lights controllers
agents can exacerbate dimensionality challenges in large-scale S State space
urban TSC. Second, the limited communication bandwidth O Observation set
prevents the agent from obtaining sufficient information from Ω Local observation set
other agents, resulting in low communication efficiency. Third, A Action set
R Reward function
redundant information exchange can lead to compromised P State Transition Probability
learning. Previous work on the cooperative for TSC such as γ Discount Factor
MADDPG [37] and MAPPO[38] have successfully addressed
the challenges in TSC based on MARL. [34] studied the A2C-
based MARL algorithm under limited neighborhood commu- tuple ⟨N, S, O, Ω , A, R, P, γ⟩. The components of POMDP
nication and proposed a spatial discount factor to attenuate are presented in Table I.
the impact of state and reward signals originating from other During the time step t, every agent i ∈ N acquires
agents. Although this work enables agents to communicate and a local observation oti ∈ Oi and then executes an action
exchange messages during execution, it overlooks the intricate ati ∈ Ai taking into account both the local observation and
relationships between agents, leading to subpar performance the information passed by other agents. Then the environment
and ineffective communication. state is transferred from s to a new state s′ and a shared reward
Graph Neural Networks (GNNs) is a deep embedding rt is received. After trial and error, the agent can perform the
framework for MARL to process data with a graph structure best action to maximize long-term rewards in dynamic traffic
and generalize relations among nodes. Obviously, urban roads states. The collective goal of the N agents is to iteratively
can be regarded as a graph, and the traffic lights on the roads refine their individual policies πi to maximize
PT the expected
t
can be regarded as nodes in the graph. Multi-agent systems cumulative discounted rewards Gt = t=0 γ rt , where T
can use graph neural networks to process input data with graph represents the time horizon and γ ∈ [0, 1]. The agents are
structures to describe information interactions between agents. tasked with learning the optimal policy π ∗ : S → A that
Studies like Colight [29] and GPlight [39] have demonstrated effectively maps states to actions.
the high effectiveness of GNN iin facilitating collaboration in
TSC. GraphLight [40] employs Graph Convolutional Network B. Graph neural networks
(GCN) to capture the features of dynamic traffic networks In the standard formulation of TSC, as shown in Figure 3, a
and employs a spatial discount factor to adjust the states of graph G = (V, E) is utilized. Here, vi ∈ V represents a traffic
neighboring agents, but this practice is not good enough. [41] signal controller and e (i, j) ∈ E denotes the connection lane
proposed to leverage GCN to capture interactions between between vi and vj . Each node vi is associated with its unique
nodes and simplify the complex node update process by feature vector hi , and when combined, these vectors constitute
identifying key targets, thereby improving the quality and the feature matrix H ∈ RN ×P , where P is the total number
performance of decision-making in reinforcement learning. of features for each node. The adjacency matrix E ∈ RN ×N
Graph Attention Network (GAT) is an effective variant of captures the interconnectivity among nodes.
GCN, which introduces an attention mechanism in the neigh-
v1
bor aggregation operation between nodes [42]. Recently, GAT
v2
has found extensive applications in computer vision [43], v5
v4
3D point cloud [44], natural language processing [45] and v6 v3
so on. GAT learns node embedding representations in the v7
road network and weights and aggregates neighbor nodes
through multiple attention heads to reduce the neural network’s
attention to non-critical node information. Therefore, in the
process of processing multi-agent interactions, GAT can assist Fig. 3: A real urban road network exemplar in New York (each node
agents in comprehending the system’s status and mitigating the corresponds to an agent, and the connections between nodes represent
influence of internal interference information on the agents’ agent-agent interactions).
decision-making processes.
Graph attention networks represent an advanced class
of GNN that are particularly adept at handling data with
III. P ROBLEM F ORMULATION graph-like structures. The graph attention network is de-
In this section, we start by presenting the standard formula- signed to learn and produce an embedded representation of
′ P
tion of TSC to better understand our approach. Then we give a node hi = σ j∈Ni αij W hj . In this expression, αij =
a brief introduction to the basic building for TSC. sof tmaxj (eij ) is the attention coefficient, calculated using
a softmax function, which normalizes the unscaled attention
scores eij across all neighboring nodes Ni . The attention score
A. Partially Observable Markov Processes eij = a (W hi , W hj ) is computed based on a function a that
We model the TSC problem as a partially observable takes the embeddings of nodes i and j, weighted by a matrix
Markov Processes (POMDP), which can be defined as a W , to determine the relevance of node j to node i.
Authorized licensed use limited to: ANURAG GROUP OF INSTITUTIONS. Downloaded on January 10,2025 at 08:26:03 UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2025.3525640
Authorized licensed use limited to: ANURAG GROUP OF INSTITUTIONS. Downloaded on January 10,2025 at 08:26:03 UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2025.3525640
A S,R
Actor Critic Target Critic
Agent 1 Agent 2 … Agent N Global
J State
Qi s, a Qi s ', a '
Replay buffer
Execution i
s ,o ,a ,r , s
t
t
i t t t 1 , oi
t 1
Tranining
Fig. 5: Architecture of MAGSAC. The global critic of MAGSAC consists of two critic networks, where the target critic network
used to stabilize the training process.
B. Graph-based Learning (
1 if M LP (f (hi , hj )) ≥ ζ
As depicted in Figure 3, the road network is interlaced with xi,j = (7)
traffic lights that are not only structured but also interdepen- 0 otherwise
dent, forming a graph-based agent network. In a multi-agent Here, ζ is the threshold used to determine whether the
cooperative framework, receiving information from all other output of the MLP should be considered as 1 or 0. In our
agents can be bandwidth-intensive. An effective method is experimental setup, ζ is set to 0.5. If xi,j = 0, it means that
to exchange information with its closest neighbor to broaden there is no interaction between traffic lights i and j; if xi,j = 1,
its observation domain. To further refine communication and it indicates that there is an interaction between them. For each
enhance cooperation among agents, as shown in Figure 6, traffic light in the network, we can construct an adjacency
we leverage a graph learning module to integrate the status matrix Aij = xi,j .
of neighboring traffic signals and extract high-order hidden Building upon the currently obtained node features that
features. This module employs attention mechanisms and incorporate historical information and adjacency matrices, we
LSTM to learn cooperation between intersections and filter out employ MHA to capture the temporal connections between
unimportant node information. To extract advanced interaction nodes from the dynamically changing road environment. This
representations, we leverage multi-head attention to capture involves projecting the input features of each node into inde-
information from different subspaces, which aids in learning pendent attention heads,
the global optimal strategy at each intersection. m
Considering the encoded signal light observations as se- qi = Wqm hi
k m = Wkm hi (8)
quential data, we input the features into Long Short-Term im
vi = Wvm hi
Memory (LSTM) to capture the temporal dependencies among
traffic lights. where Wqm , Wkm , and Wvm convert hi into the correspond-
ing query, key, and value vectors, respectively. Within the
hi , ci = LSTM (ei , h′i , c′i ) (6) attention head m, the attention weights between agent i and
agent j can be calculated as:
Here, hi and ci represent the hidden state and cell state T
τ · Wqm × hi · (Wkm × hj ) , Aij = 1
m
of the current time step. And h′i and c′i represent the hidden eij = (9)
0, Aij = 0
state and the cell state of the previous time step, respectively.
Within a dynamic environment, the interaction relationships Here, τ signifies the scaling factor utilized in the process.
between agents are constantly changing. Consequently, the Aij is the adjacency matrix that denotes the connection
historical traffic conditions encapsulated by LSTM are crucial relationship between node i and node j in the road network.
for refining each agent’s policy. For distinct traffic lights Then the attention score between these two nodes is [46]:
controllers labeled i and j, the LSTM-generated state vectors
m m
exp em ij
hi and hj are amalgamated into a composite vector f (hi , hj ). αij = sof tmax eij = P m
(10)
This vector is then fed into an MLP to yield a binary j∈Ni exp eij
outcome. Such an outcome serves to ascertain the connectivity Furthermore, after capturing the dependencies between traf-
between various traffic lights, enabling the dynamic selection fic lights and the attention scores of each agent, a weighted
of informative neighbors for each traffic light controller, while sum of the values of other agents is computed for each
simultaneously discarding less relevant ones. attention head. The outputs from the M attention heads
Authorized licensed use limited to: ANURAG GROUP OF INSTITUTIONS. Downloaded on January 10,2025 at 08:26:03 UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2025.3525640
pertaining to agent i are combined through concatenation. where ρπ represents the state distribution under the policy
This concatenated vector then undergoes processing by an π; α is a smooth factor deployed to strike a balance between
activation function σ (·), which is realized through a fully- the rewards and the entropy within the policy. The larger the
connected layer with a ReLU as its nonlinearity. The result is parameter α, the more the strategy converges to the optimal
the final output of the convolutional layer. strategy. The entropy of policy π is denoted as H(π(· | St )),
quantifying the uncertainty of the policy given a specific state
X st .
h′i = σ concat m
αij hj , ∀m ∈ M (11) As per the principles of the Bellman equation, the soft value-
j∈Ni state function Qsof t and the soft value function Vsof t are
Here, M represents the number of independent attention respectively defined as:
mechanisms. The outputs emanating from these attention
Vsof t (st ) = Eat ∼π [Qsof t (st , at ) − α log π (at | st )] (14)
mechanisms are concatenated to culminate in the ultimate
feature vector h′i . Qsof t (st , at ) = r (st , at ) + γEst+1 ∼ρπ [Vsof t (st+1 )] (15)
a1 a2 aN
We define Qϕ (s, a) as the critic network parameterized
… by ϕ , and πθ (a | s) as the actor network parameterized
Agent 1 Agent 1 … Agent 1 Adjacency matrix by θ. To facilitate experience replay, it is typical to store
h1' h2' hN'
the agent’s experience e = (st , at , rt , st+1 ) at each time
hj
hi Wk Wv step into the replay buffer D = {(st , at , rt , st+1 )}. During
Wq
Authorized licensed use limited to: ANURAG GROUP OF INSTITUTIONS. Downloaded on January 10,2025 at 08:26:03 UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2025.3525640
collective rewards of all agents’ actions to guide the actor’s performance outcomes verify the robustness and applicability
training. In the testing process, each agent’s actor network of our model.
relies solely on local observations to execute actions. The
algorithm mainly includes three stages: experience collection,
A. Experiment Settings
graph-based state encoding, and network training. Before the
MAGSAC algorithm starts running, the network parameters The experiments were performed on Intel Core i5 2.40
θ, θ̄, ϕ are first initialized, and the replay buffer D is cleared GHz, 8GB RAM and Intel Iris Plus Graphics 1536 MB
at the same time. During each iteration, the node will aggregate graphics card. We employed CityFlow, an open-source traffic
the information of critical neighbors based on graph-based simulation platform, for our experimental framework. This
learning and generate new node features h′i . For each agent i, it provides a Python interface to design large-scale urban traffic
interacts with the traffic environment, performs the generated scenarios and manage traffic signals. Figure 8 illustrates that
action, and then stores tuples into the replay buffer D. Training the simulator offers access to real-time traffic status, such
starts when the data in the buffer reaches the threshold, a as the metrics—vehicle counts, velocities, and current signal
mini-batch experience B is sampled from the replay buffer phase.
each step, and stochastic gradient descent is used to update The experimental settings include road network settings
the parameters of neural network parameters. and traffic flow settings in JSON format. In the simulated
road network, each intersection features four three-lane roads,
Algorithm 1: MAGSAC
with each road spanning 500 meters and a speed limit of 15
1 Initialize all neural network parameters and replay meters/second assigned to all lanes. Vehicles can turn right
buffer D ; without conflict, and a combined yellow and full red time of
2 for each episode do 5 seconds is followed to clear the intersection at each signal
3 Reset the traffic environment ; light phase transition. We define that when the speed of a
4 end vehicle is lower than 0.3 meters/second, it is considered to be
5 for each agent i ∈ {1, · · · , N } do waiting in the queue. This is a typical setting that appears in
6 Observe the global state s and its individual many related studies.
observations oi ;
7 for t = 1 to max-episode-length do TABLE II: The parameters of each agent.
8 Select action ati ∼ πϕ (ati | τit ) ;
9 Encode the observation as representation ei ; Parameter Value
10 Calculate the adjacency matrix by Eq. (7) ; Smooth factor α 0.001
11 Learning spatial dependency and node features Discount rate γ 0.9
hi ′ by Eq. (11) ; Batch size B 20
Replay buffer size 10000
12 Execute actions a = (a1 , a2 , . . . , aN ) and
Learning rate for the actor, lra 0.0003
interact with the environment, obtain the Learning rate for the critic, lrc 0.001
global reward r nd transit to the next global Time steps 3600
state s′ ;
D ← D ∪ st , oti , at , rt , st+1 , ot+1
13 i ; In the MAGSAC approach proposed in this article, each
14 end agent controls an intersection, and each agent sets the same
15 for each rl training step do parameters as shown in Table II. To encourage the agent to
16 Sample a random minibatch uniformly from D ; consider long-term cumulative rewards, the discount factor γ is
17 Update critic network by minimizing the loss set to 0.9. Then, a smooth factor α is set to 0.001 for updating
function by Eq. (16) ; the network’s parameters. A smaller smooth factor makes the
18 Update actor network via gradient descent by learning process more robust. The learning rates lra and lrc
Eq.(18), θ̄ ← ρθ̄ + (1 − ρ) θ, where ρ is the for the actor and critic networks are set to 0.0003 and 0.001,
update ratio ; respectively, ensuring the stability of policy updates and value
19 Update hyper-parameter; estimation. Additionally, the replay buffer size is set to 10,000
20 Update all target network parameters for each for storing experience data, and the batch size is set to 20 to
agent by Eq. (17), ϕ ← ϕ − λ▽ ˆ J (ϕ), where
ϕ balance the use of memory and computational resources.
λ is learning rate ; 1) Datasets: This paper used two kinds of datasets: syn-
21 end thetic traffic grid dataset and real-world traffic dataset. We
22 end count their characteristics and present the details in Table
23 return Qθ , πϕ III. The source datasets can be accessed through https://
traffic-signal-control.github.io/#open-datasets.
Synthetic Dataset: The synthetic datasets are generated
V. E XPERIMENT RESULTS AND COMPARISON
artificially and include three scenarios: Grid1×3 (1 row, 3
Within this section, we detail our experimental approach and columns), Grid3×3 and Grid6×6 traffic environments. As
assess the performance of our proposed MAGSAC approach shown in Figure 7, all vehicles are directed to enter and
using both synthetic and real-world datasets. The model’s exit through the rim edges of the network. For the Grid1×3
Authorized licensed use limited to: ANURAG GROUP OF INSTITUTIONS. Downloaded on January 10,2025 at 08:26:03 UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2025.3525640
(a) Grid1×3 , 3 traffic lights (b) Grid3×3 , 9 traffic lights (c) Grid6×6 , 36 traffic lights
Fig. 7: The traffic grid for synthetic Dataset: Grid1×3 , Grid3×3 ,Grid6×6
Authorized licensed use limited to: ANURAG GROUP OF INSTITUTIONS. Downloaded on January 10,2025 at 08:26:03 UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2025.3525640
10
(a) Jinan (China), 12 traffic lights (b) Hangzhou (China), 16 traffic lights (c) New York (USA), 48 traf-
fic lights
Fig. 9: The city maps for real-world datasets: Hangzhou, Jinan, and New York
TABLE IV: Performance results for different methods on various datasets. For ATT and AQL, lower is better.
Average travel time (ATT)
Model RL-based
Grid1×3 Grid3×3 Grid6×6 Jinan3×4 Hangzhou4×4 N ewY ork16×3
FT × 73.51 105.48 210.93 814.11 718.29 1842.20
SOTL × 78.14 132.50 267.34 1459.45 1215.28 1887.85
MaxPressure × 76.35 99.69 186.56 361.33 407.17 392.31
√
PressLight 54.85 85.59 153.75 343.90 549.84 380.69
√
CoLight 46.05 89.27 128.76 318.75 336.71 229.66
√
MAGSAC 47.95 84.63 116.47 286.81 250.28 168.29
dataset. The performance gap is smaller in the simulated data 2) Training performance: To substantiate the training per-
set because the artificial data set is simpler and allows for formance of our proposed approach, Figure 10 and Figure 11
optimal solutions. illustrate a comparison of the training performance between
the MAGSAC and other RL-based methods in the Jinan3×4
dataset. It is obvious that MAGSAC obtains the best conver-
gence speed with good sampling efficiency and training effi-
ciency. In simple traffic scenarios, MAGSAC achieves similar
performance to other RL-based methods, but in complex real
traffic scenarios, MAGSAC is significantly better than other
RL-based methods. This shows that MAGSAC is more stable
and adaptable in complex urban TSC problems.
Authorized licensed use limited to: ANURAG GROUP OF INSTITUTIONS. Downloaded on January 10,2025 at 08:26:03 UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2025.3525640
11
more intricate road networks, exemplified by N ewY ork16×3 , various intersections. Then, the model incorporates a multi-
MAGSAC can still train its own traffic light strategy based on layer attention mechanism that extracts crucial information
key neighborhood information around, which shows that our from other traffic signals. This not only expands the obser-
method is scalable. vation domain of each traffic light but also greatly reduces the
communication volume. In this way, the model can adapt to
complex road networks and varying traffic conditions. During
model training, we employ a CTDE architecture, which is
conducive to more efficient and stable learning. Extensive
experiments utilizing both synthetic and real-world data within
the Cityflow simulation platform have been conducted. The
results demonstrate that our proposed method has superior
performance, with noticeable improvements in travel time and
training efficiency.
However, there are also some limitations. For instance,
Fig. 12: The training times of different models for 100 the current research is still conducted in idealized simulated
episodes. MAGSAC is efficient in all scenarios. environments and specific scenarios, which requires further
validation of the impact of more complex traffic conditions
Finally, we evaluated the impact of the number of attention and external factors on MAGSAC. With the development of
heads on MAGSAC in the MHA mechanism. We tested the emerging technologies such as the IoT and edge computing,
performance of MAGSAC and Colight models with different real-time traffic data can be easily obtained. In the future, we
numbers of attention heads on the Jinan3×4 dataset. As plan to extend our research methodology to different scenarios
shown in Figure 13, when the number of heads increased from including urban, suburban, or rural areas. Design various
1 to 4, the travel time of the model decreased significantly. signal light control strategies for the sparsity of traffic flow to
However, when the number of heads continued to increase enhance the model’s generalization capabilities. Additionally,
further, the improvement in travel time was no longer apparent, to analyze how cooperation strategies between traffic signals
and even increased. This is due to an excess of attention heads change with the increase in network scale, we will conduct
leading to overfitting on the training data, making it difficult further experiments on performance, ablation, and parametric
for the model to learn key features, while also increasing experiments on different datasets to determine the impact of
the computational burden on the model, ultimately affecting attention head number and the GAT network structure on
its decision-making ability. This shows that an appropriate model performance.
number of attention heads can achieve a better traffic light
control strategy. ACKNOWLEDGMENTS
This work is supported in part by the Frontier Technologies
R&D Program of Jiangsu (No. BF2024071), the Youth Talent
Support Program of the Jiangsu Association for Science and
Technology (No. JSTJ-2024-430), the Open Research Project
of State Key Laboratory of Novel Software Technology (No.
KFKT2022B28), the Special Funds for Technology Transfer in
Jiangsu Province (No. BA2022011) and the Special Funds for
Industrial and Information Industry Transformation and Up-
grading in Jiangsu Province (Project Name: Key Technologies
and Industrialization of Industrial Internet Terminal Threat
Detection and Response System).
Authorized licensed use limited to: ANURAG GROUP OF INSTITUTIONS. Downloaded on January 10,2025 at 08:26:03 UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2025.3525640
12
[5] S. Dhelim, T. Kechadi, L. Chen, N. Aung, H. Ning, and L. Atzori, Conference on Information and Knowledge Management, 2019, pp.
“Edge-enabled metaverse: The convergence of metaverse and mobile 1913–1922.
edge computing,” arXiv preprint arXiv:2205.02764, 2022. [30] X. Xu, S. Tang, L. Qi, X. Zhou, F. Dai, and W. Dou, “Cnn partitioning
[6] X. Xu, H. Li, Z. Li, and X. Zhou, “Safe: Synergic data filtering for and offloading for vehicular edge networks in web3,” IEEE Communi-
federated learning in cloud-edge computing,” IEEE Transactions on cations Magazine, 2023.
Industrial Informatics, vol. 19, no. 2, pp. 1655–1665, 2022. [31] P. Zhou, X. Chen, Z. Liu, T. Braud, P. Hui, and J. Kangasharju, “Drle:
[7] H. Li, M. Sun, F. Xia, X. Xu, and M. Bilal, “A survey of edge caching: Decentralized reinforcement learning at the edge for traffic light control
Key issues and challenges,” Tsinghua Science and Technology, vol. 29, in the iov,” IEEE Transactions on Intelligent Transportation Systems,
no. 3, pp. 818–842, 2023. vol. 22, no. 4, pp. 2262–2273, 2020.
[8] Y. Qu, L. Ma, W. Ye, X. Zhai, S. Yu, Y. Li, and D. Smith, “Towards [32] J. Gao, X. Xu, L. Qi, W. Dou, X. Xia, and X. Zhou, “Distributed
privacy-aware and trustworthy data sharing using blockchain for edge computation offloading and power control for uav-enabled internet of
intelligence,” Big Data Mining and Analytics, vol. 6, no. 4, pp. 443–464, medical things,” ACM Transactions on Internet Technology, 2024.
2023. [33] C. Chen, H. Wei, N. Xu, G. Zheng, M. Yang, Y. Xiong, K. Xu, and
[9] Z. Liu, X. Xu, F. Han, Q. Zhao, L. Qi, W. Dou, and X. Zhou, “Secure Z. Li, “Toward a thousand lights: Decentralized deep reinforcement
edge server placement with non-cooperative game for internet of vehicles learning for large-scale traffic signal control,” in Proceedings of the
in web 3.0,” IEEE Transactions on Network Science and Engineering, AAAI conference on artificial intelligence, vol. 34, no. 04, 2020, pp.
2023. 3414–3421.
[10] H. A. Aziz, H. Wang, S. Young, J. Sperling, and J. M. Beck, “Synthesis [34] T. Chu, J. Wang, L. Codecà, and Z. Li, “Multi-agent deep reinforcement
study on transitions in signal infrastructure and control algorithms for learning for large-scale traffic signal control,” IEEE transactions on
connected and automated transportation,” Washington, DC: DOE, 2017. intelligent transportation systems, vol. 21, no. 3, pp. 1086–1095, 2019.
[11] Y. Huang, Y. J. Li, and Z. Cai, “Security and privacy in metaverse: A [35] M. Yang, L. Huang, and C. Tang, “K-means clustering with local
comprehensive survey,” Big Data Mining and Analytics, vol. 6, no. 2, distance privacy,” Big Data Mining and Analytics, vol. 6, no. 4, pp.
pp. 234–247, 2023. 433–442, 2023.
[12] Z. Li, X. Xu, T. Hang, H. Xiang, Y. Cui, L. Qi, and X. Zhou, “A [36] W. Lin, M. Zhu, X. Zhou, R. Zhang, X. Zhao, S. Shen, and L. Sun, “A
knowledge-driven anomaly detection framework for social production deep neural collaborative filtering based service recommendation method
system,” IEEE Transactions on Computational Social Systems, 2022. with multi-source data for smart cloud-edge collaboration applications,”
[13] R. Yan, Y. Zheng, N. Yu, and C. Liang, “Multi-smart meter data Tsinghua Science and Technology, vol. 29, no. 3, pp. 897–910, 2023.
encryption scheme based on distributed differential privacy,” Big Data [37] T. Wang, J. Cao, and A. Hussain, “Adaptive traffic signal control for
Mining and Analytics, vol. 7, no. 1, pp. 131–141, 2023. large-scale scenario with cooperative group-based multi-agent reinforce-
[14] H. Wei, C. Chen, G. Zheng, K. Wu, and Z. Li, “Presslight: Learning ment learning,” Transportation research part C: emerging technologies,
max pressure control to coordinate traffic signals in arterial network,” vol. 125, p. 103046, 2021.
ACM, 2019. [38] C. Yu, A. Velu, E. Vinitsky, J. Gao, Y. Wang, A. Bayen, and Y. Wu, “The
[15] H. Ge, D. Gao, L. Sun, Y. Hou, C. Yu, Y. Wang, and G. Tan, “Multi-agent surprising effectiveness of ppo in cooperative multi-agent games,” Ad-
transfer reinforcement learning with multi-view encoder for adaptive vances in Neural Information Processing Systems, vol. 35, pp. 24 611–
traffic signal control,” IEEE Transactions on Intelligent Transportation 24 624, 2022.
Systems, vol. 23, no. 8, pp. 12 572–12 587, 2021. [39] X. Hu, C. Zhao, and G. Wang, “A traffic light dynamic control algorithm
[16] Q. Zheng, H. Xu, J. Chen, D. Zhang, K. Zhang, and G. Tang, “Double with deep reinforcement learning based on gnn prediction,” arXiv
deep q-network with dynamic bootstrapping for real-time isolated signal preprint arXiv:2009.14627, 2020.
control: A traffic engineering perspective,” Applied Sciences, vol. 12, [40] Z. Zeng, “Graphlight: graph-based reinforcement learning for traffic
no. 17, p. 8641, 2022. signal control,” in 2021 ieee 6th international conference on computer
[17] X. Zang, H. Yao, G. Zheng, N. Xu, K. Xu, and Z. Li, “Metalight: and communication systems (icccs). IEEE, 2021, pp. 645–650.
Value-based meta-reinforcement learning for traffic signal control,” in [41] Y. Liu, X. Zhou, H. Kou, Y. Zhao, X. Xu, X. Zhang, and L. Qi, “Privacy-
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, preserving point-of-interest recommendation based on simplified graph
no. 01, 2020, pp. 1153–1160. convolutional network for geological traveling,” ACM Transactions on
[18] W. Genders and S. Razavi, “Asynchronous n-step q-learning adaptive Intelligent Systems and Technology, 2023.
traffic signal control,” Journal of Intelligent Transportation Systems, [42] Y. Liu, Y. Zhang, X. Mao, X. Zhou, J. Chang, W. Wang, P. Wang,
vol. 23, no. 4, pp. 319–331, 2019. and L. Qi, “Lithological facies classification using attention-based gated
[19] P. Koonce and L. Rodegerdts, “Traffic signal timing manual.” United recurrent unit,” Tsinghua Science and Technology, vol. 29, no. 4, pp.
States. Federal Highway Administration, Tech. Rep., 2008. 1206–1218, 2024.
[20] R. E. Allsop, “Delay at a fixed time traffic signal—i: Theoretical [43] M. Munir, W. Avery, and R. Marculescu, “Mobilevig: Graph-based
analysis,” Transportation Science, vol. 6, no. 3, pp. 260–285, 1972. sparse attention for mobile vision applications,” in Proceedings of the
[21] F. V. Webster, “Traffic signal settings,” Road Research Technical Paper, IEEE/CVF Conference on Computer Vision and Pattern Recognition,
vol. 39, 1958. 2023, pp. 2210–2218.
[22] S.-B. Cools, C. Gershenson, and B. D’Hooghe, “Self-organizing traffic [44] C. Chen, L. Z. Fragonara, and A. Tsourdos, “Gapointnet: Graph attention
lights: A realistic simulation,” Advances in applied self-organizing based point neural network for exploiting local feature of point cloud,”
systems, pp. 45–55, 2013. Neurocomputing, vol. 438, pp. 122–132, 2021.
[23] S. A. Celtek, A. Durdu, and M. E. M. Alı, “Real-time traffic signal [45] M. Song, W. Zhao, and E. Haihong, “Kganet: a knowledge graph
control with swarm optimization methods,” Measurement, vol. 166, p. attention network for enhancing natural language inference,” Neural
108206, 2020. Computing and Applications, vol. 32, pp. 14 963–14 973, 2020.
[24] H. Lin, Y. Han, W. Cai, and B. Jin, “Traffic signal optimization based on [46] A. Vaswani, “Attention is all you need,” Advances in Neural Information
fuzzy control and differential evolution algorithm,” IEEE Transactions Processing Systems, 2017.
on Intelligent Transportation Systems, 2022.
[25] S. A. Fayazi and A. Vahidi, “Mixed-integer linear programming for
optimal scheduling of autonomous vehicle intersection crossing,” IEEE
Transactions on Intelligent Vehicles, vol. 3, no. 3, pp. 287–299, 2018.
[26] J. Ruan, J. Tang, G. Gao, T. Shi, and A. Khamis, “Deep reinforcement
learning-based traffic signal control,” in 2023 IEEE International Con-
ference on Smart Mobility (SM). IEEE, 2023, pp. 21–26. Jing Shang received the B.S. degree in Commu-
[27] H. Wei, G. Zheng, H. Yao, and Z. Li, “Intellilight: A reinforcement nication Engineering from Nanjing Tech Univer-
learning approach for intelligent traffic light control,” in Proceedings sity, Nanjing, China, in 2018, and the M.S. degree
of the 24th ACM SIGKDD International Conference on Knowledge in Computer Application Technology from Tian-
Discovery & Data Mining, 2018, pp. 2496–2505. jin University of Science and Technology, Tianjin,
[28] H. Mei, X. L. Lei, L. D. Da, B. Shi, and H. Wei, “Libsignal: An open China, in 2021. She is currently pursuing the Ph.D.
library for traffic signal control,” 2022. degree in Software Engineering with the Nanjing
[29] H. Wei, N. Xu, H. Zhang, G. Zheng, X. Zang, C. Chen, W. Zhang, University of Science and Technology, Nanjing,
Y. Zhu, K. Xu, and Z. Li, “Colight: Learning network-level cooperation China. Her research interests include cloud/edge
for traffic signal control,” in Proceedings of the 28th ACM International computing and the Internet of Vehicles.
Authorized licensed use limited to: ANURAG GROUP OF INSTITUTIONS. Downloaded on January 10,2025 at 08:26:03 UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2025.3525640
13
Shunmei Meng received her PhD degree in De- Lianyong Qi received his Ph.D degree from the
partment of Computer Science and Technology from Department of Computer Science and Technology,
Nanjing University, China, in 2016. Now, she is an Nanjing University, China. In 2010, he visited
assistant professor of Department of Computer Sci- the Department of Information and Communication
ence and Technology, Nanjing University of Science Technology, Swinburne University of Technology,
and Technology, Nanjing, China. She has published Australia. Now, he is a professor of the College of
papers in international journals and international Computer Science and Technology, China Univer-
conferences such as TPDS, TII, TKDD, FGCS, sity of Petroleum (East China), China. His research
AAAI, CIKM, ICDM, ICWS, and ICSOC. Her re- interests include AI and recommender systems.
search interests include recommender systems, cloud
computing, and security & privacy.