0% found this document useful (0 votes)
64 views13 pages

Graph-Based Cooperation Multi-Agent Reinforcement Learning for Intelligent Traffic Signal Control

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views13 pages

Graph-Based Cooperation Multi-Agent Reinforcement Learning for Intelligent Traffic Signal Control

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

This article has been accepted for publication in IEEE Internet of Things Journal.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2025.3525640

Graph-based Cooperation Multi-agent


Reinforcement Learning for Intelligent Traffic
Signal Control
Jing Shang, Shunmei Meng, Jun Hou, Xiaoran Zhao, Xiaokang Zhou*, Rong Jiang, Lianyong Qi, and Qianmu Li*

Abstract—In the trend of continuously advancing urban in- Control (TSC) strategy, which is a key component in smart
telligent transport construction, traditional traffic signal control city and intelligent transportation systems (ITS), is urgently
(TSC) struggles to make effective decisions with complex traffic required to manage and regulate the movement of vehicles[3].
conditions. Although multi-agent deep reinforcement learning
(MARL) shows promise in optimizing traffic flow, most existing Real-world traffic conditions are influenced by a multitude of
studies ignore the complex relationships between signal lights factors, including driver preferences, interactions with various
and fail to communicate with neighbors effectively. Moreover, the road users (such as pedestrians and cyclists), as well as
deterministic strategies generated by Q-learning-based methods environmental factors like weather and road infrastructure.
struggle to be extended to large-scale urban road networks. This complexity makes it challenging to tune traffic signals
Therefore, this paper proposes a multi-agent graph-based soft
actor-critic (MAGSAC) approach for TSC, which combines graph dynamically based on real-time traffic conditions[4].
neural networks with the Soft Actor-Critic (SAC) algorithm Recently, the development of Internet of Things (IoT) tech-
and extends it to multi-agent environments to address the TSC nology has enhanced the connection between Connected Au-
problem. Specifically, we employ graph-based networks and tonomous Vehicles (CAVs) and road infrastructure, especially
attention mechanism to expand the receptive domain of agents, traffic lights, solving the problem of insufficient information
enable environmental information to be shared among agents,
and utilize the attention mechanism to filter out unimportant exchange in existing intelligent transportation systems (ITS)
information. The algorithm adheres to the Centralized Training under dynamic traffic conditions[5]. ITS relies on advanced
Decentralized Execution (CTDE) paradigm to minimize the non- Artificial Intelligence (AI) algorithms, information commu-
stationarity of MARL. Finally, a rigorous experimental eval- nication [6], Internet of Things (IoT), edge computing [7],
uation was conducted using the CityFlow simulator on both [8], and various automated control systems. It is expected
synthetic traffic grids and real-world urban road networks.
Experimental results show that MAGSAC outperforms other TSC to become a promising tool for improving efficiency and
methods in performance metrics, including average queue length sustainability in the field of intelligent transportation [9]. As
and waiting time, and achieves excellent performance under shown in Figure 1, cooperative driving between CAVs and
complex urban traffic conditions. AIM can improve road efficiency and traffic safety. In the
Index Terms—Adaptive traffic signal control, Multi-agent deep future, multiple CAVs will negotiate their movements in the
reinforcement learning, Internet of Things, Intelligent transporta- conflict region of the intersection without any signal head [10].
tion systems. Especially in the key areas where traffic flows from different
directions or types meet on the crossing area and may lead
I. I NTRODUCTION to conflicts due to factors like visibility, speed, and trajectory.
In the driving environment, more real-time traffic information
Considering the rapid progress of modern economic soci- will be obtained, achieving more efficient and safe driving
ety [1], excessive population agglomeration may bring neg- than traffic lights and manually driven vehicles[11]. During
ative externalities such as traffic congestion, environmental this period, traditional TSC systems need to address more
problems, and efficiency decline [2]. Effective Traffic Signal challenges to facilitate a smooth transition and effectively
Jing Shang, Shunmei Meng and Qianmu Li are with the School of adapt to technological advancements.
Computer Science and Engineering, Nanjing University of Science and Many researchers have illustrated the potential of reinforce-
Technology, Nanjing 210049, China(e-mail: [email protected]; meng- ment learning (RL) as a data-driven approach for learning
[email protected]; [email protected]).
Jun Hou is with the School of Social Science, Nanjing Vocational optimal policies by leveraging real-time traffic conditions [12].
University of Industry Technology,Nanjing 210046, China(e-mail: houjun- Within the domain of TSC using RL (RL-TSC), a variety of
[email protected]). MARL-based methods have been suggested, including fully
Xiaoran Zhao is with the School of Computer Science, Qufu Normal
University, Qufu 273165, China(e-mail: [email protected]). centralized, fully decentralized, and centralized training with
Xiaokang Zhou is with the Faculty of Business and Data Science, Kansai decentralized execution (CTDE) paradigms. As shown in Fig-
University, Osaka Prefecture 564-8680, Japan(e-mail: [email protected]). ure 2, centralized methods send all collected data to a central
Rong Jiang is with the Yunnan Key Laboratory of Service Computing,
Yunnan University of Finance and Economics, Kunming 650399, China(e- agent, which uses the obtained global state to optimize the
mail: [email protected]). global policy. This leads to a dimensionality disaster, imposing
Lianyong Qi is with the College of Computer Science and Technology, significant computational and communication burdens on the
China University of Petroleum (East China), Qingdao 266580, China(e-mail:
[email protected]). system. In contrast, in decentralized methods, each agent
Qianmu Li and Xiaokang Zhou are the corresponding authors. makes decisions independently, offering better scalability [13].

Authorized licensed use limited to: ANURAG GROUP OF INSTITUTIONS. Downloaded on January 10,2025 at 08:26:03 UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2025.3525640

Automated vehicle control


Autonmous intesection Conflict free trajectory

Transitions in signal control


management V2V communication

CAV-based signal control


Temporal and spatial control
RSU Joint signal and trajectory
in CAV environment
control

CV-based signal control


Advanced control with
RSU Detector-free V2I
V2X technology communication

Current state
Fixed/Actuated/Adapt Collect data from
ive Signal Control detector,cameras

Transition in automated technology

Fig. 1: Transitions in signal infrastructure and control algorithms.

However, this approach struggles to capture the complex each agent based on MARL, and employ the CTDE
interactions and collaborations between agents. Collaborative paradigm to enhance the efficiency of the TSC system.
methods can leverage mutual communication and cooperation This eliminates the need for deterministic traffic model
among agents to optimize global network traffic objectives, assumptions and allows for the learning of TSC strategies
learning strategies directly through trial and error. For exam- within dynamic traffic environments.
ple, [14] enhances coordination between agents by reducing • Technically, the proposed MAGSAC approach utilizes
pressure at intersections, and obtains more comprehensive graph-based state encoding to capture the mutual influ-
information by expanding the observation range of target ences within the graph structure of traffic signals, en-
agents. Previous work [15] has also used transfer learning abling collaboration among MARL agents while avoiding
to guide collaboration between agents, but the well-trained the curse of dimensionality. We innovatively employ a
model must undergo additional training for different scenarios multi-head attention mechanism to identify cooperative
to obtain better performance. CTDE has recently become a interactions between traffic signals and effectively dis-
popular framework for cooperative multi-agent reinforcement cern the dependencies among agents from the constantly
learning (MARL), where agents can use additional global changing traffic states, thereby filtering out irrelevant
state information to guide training in a centralized manner information.
and make their own decisions based solely on decentralized • We conducted extensive experiments on the Cityflow sim-
local policies. Additionally, most existing research uses simple ulator using both synthetic and real-world traffic datasets.
Q-learning algorithms to train MARL models, such as [16], The experimental results demonstrate that our proposal
[14], [17], [18]. However, the optimal solutions obtained from outperforms the baseline algorithm in terms of average
these studies are deterministic strategies that cannot adapt to travel time and converges more quickly. Moreover, in
dynamic changes in traffic flow. complex large-scale road scenarios, MAGSAC demon-
To tackle the previous constraints, we introduce a novel strates notably superior performance compared to other
approach known as Multi-Agent Graph-based Soft Actor- methods, highlighting its robust generalization capabili-
Critic Traffic Signal Control (MAGSAC). Operating within ties for complex urban TSC problems.
the CTDE paradigm, MAGSAC employs centralized critics The structure of the remainder of this paper unfolds as
to guide policy development. To extend the communication follows. Section II delves into a comprehensive survey of
domain of individual traffic lights, we have introduced a existing traffic signal control research. The setup for our
graph-based learning communication mechanism to capture approach is given in Section III and the system overview is
the information interaction between signal lights. Based on the elaborated. The implementation of the proposed method is
local observations of all agents, we use a multi-head attention detailed in Section IV. Section V showcases a comparative
mechanism to filter out unimportant data, thereby facilitating analysis of diverse RL algorithms, alongside a discussion of
the local integration of global information. Moreover, we experimental findings. Finally, Section VI encapsulates the
utilize the Soft Actor-Critic (SAC) algorithm to explore a findings and concludes the paper.
broader solution space. The principal contributions of our work
are outlined as follows: II. R ELATED WORKS
• We propose a Multi-Agent Graph-based Soft Actor-Critic A. Traffic signal control
(MAGSAC) approach for urban traffic scenarios. The Optimized traffic signals can improve intersection safety,
TSC problem is formulated as a partially observable and using edge computing can more efficiently control traffic
Markov decision process (POMDP). We describe the conflicts between vehicle and pedestrian movements within an
observation space, action space, and reward function for area [9]. Intelligent TSC algorithms are the key to improving

Authorized licensed use limited to: ANURAG GROUP OF INSTITUTIONS. Downloaded on January 10,2025 at 08:26:03 UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2025.3525640

model, it is desirable to propose a MARL approach that uses


Central Controller Central Controller
Agent i … Agent N multiple agents to learn in parallel across the road network
o1 , r1 oN , r N a1 , o1 , r1 aN , oN , r N
a1 aN for TSC. The MARL model has obvious advantages in terms
Actor i … Actor N
Agent i … Agent N a1 o1 , r1 aN oN , r N of performance and robustness due to decentralized execution
a1 o1 aN oN and allowing experience sharing among agents. Specifically,
Environment
Environment Environment the focus of this paper is particularly on multi-intersection
Fully Centralized Fully Decentralized
Centralized Training and
Decentralized Execution
scenarios.
Nevertheless, we notice that there are several challenges to
Fig. 2: Three paradigms of the multi-agent reinforcement learning. designing a MARL model both adaptive and coordinated in
large-scale urban networks. The first challenge in designing
a MARL model is selecting appropriate model settings to de-
urban traffic efficiency. Conventional TSC approaches can scribe the TSC problem. Currently, time-based metrics, includ-
be broadly divided into three primary strategies: fixed-time ing vehicular travel time, delay, and waiting time, are widely
operations, actuated control, and adaptive control strategies. employed in RL-TSC applications. Additionally, congestion-
The fixed-time strategy [19], [20], [21] implements phases in related metrics like queue length or intersection through can
a cycle with predetermined durations based on traffic demand be utilized to capture the level of congestion in the network.
estimation, with the cycle sequence being green, red, and The work in [14] uses the pressure of an intersection to
yellow light. However, this approach can easily lead to delays define the reward function. Moreover, vehicular speed metrics
and congestion during periods of high traffic demand. The such as absolute speed, acceleration, or harmonic speed can
actuated strategy allocates signal timing of TSC in real-time also be considered as objective functions. Many previous
and relies on vehicle detection such as sensors or pressure works [17], [31], [27] have explored the use of compound
plates at the intersection, which requires the installation of reward definitions or multiple objective functions to maximize
external devices and can be unpredictable in high-volume multiple traffic-related metrics simultaneously.
traffic [22]. However, the preceding techniques rely heavily Furthermore, designing a suitable model structure is indeed
on human expertise, which is difficult to apply in highly the second challenge [32]. In general, existing MARL methods
dynamic environments. More advanced, adaptive strategies can for TSC operate under three main modes: independent control,
dynamically adjust traffic signal timings by precalculating the centralized learning, or coordinated control. The former like
arrival of vehicles in response to changing traffic conditions. [33] only considers optimizing the agent separately without
Previous studies have tried many methods for adaptive considering other traffic light states, making it difficult to
traffic signal control (ATSC), including heuristics [23], fuzzy obtain the optimal strategy. In addition, due to the high
logic [24], and linear programming [25]. Nonetheless, these dimensionality of the joint action space, it becomes impractical
methods are contingent upon accurate traffic flow parameters to apply centralized RL such as [34] in large-scale TSC,
and the physical relationships between road segments, neces- and this inadvertently leads to the loss of crucial topological
sitating extensive data collection and complex calculations to information about the traffic network [35]. A collaborative
develop a clear traffic flow model that accurately describes learning method is needed to collaboratively control multiple
the behavior of the traffic system. In recent years, dozens intersections so that the entire network reaches the optimal
of studies [26], [27], [28], [29] have leveraged RL-based strategy. The third crucial concern is the selection of an
methods, including Q-learning and Deep Q-network (DQN) effective DRL training algorithm to explore solution spaces for
algorithms, to optimize ATSC. Additionally, the integration TSC. Early multi-agent research focused on model-based RL
of deep neural networks (DNNs) has significantly boosted algorithms, which used transition probability distributions to
the learning capacity of RL in tackling complex tasks [30]. represent environment models. However, the method of creat-
The most common approach is to treat each traffic signal ing state representation is unrealistic in practical applications.
controller as an RL agent, enabling them to learn optimal In addition, most of the optimal strategies found by iteratively
actions through direct interaction with the environment. This updating the Q value are deterministic and struggle to adapt
avoids relying on certain assumptions in traffic modeling, such to the dynamic variations in the traffic system.
as assuming a uniform arrival rate and unlimited lane capacity,
which are often unrealistic in real-world scenarios. Further
research and empirical studies are still needed to develop C. Cooperative for Traffic signal control
effective algorithms and solutions to address complex traffic Cooperation is essential in the MARL environment for TSC,
environments and challenges in the real world. as the assessment and formulation of each agent’s policy
depends not just on local observations but also on the strategies
B. Multi-agent reinforcement learning for Traffic signal con- of fellow agents. Various strategies and mechanisms have
trol been explored to enhance the agents’ ability to collaborate
When dealing with real-world traffic situations involving and optimize the global network traffic objectives [36]. A
multiple intersections, isolated TSC methods can not manage typical approach focuses on learning communication protocols
the high-dimensional state space and lack the ability to feature between multiple agents to coordinate behavioral actions. This
coordinate learning and junction-to-junction communication approach defines a messaging network for each agent, encod-
across intersections. Given the limitations of the isolated TSC ing local observations into messages and sharing with other

Authorized licensed use limited to: ANURAG GROUP OF INSTITUTIONS. Downloaded on January 10,2025 at 08:26:03 UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2025.3525640

agents through broadcast communications. However, there are TABLE I: POMDP element.
limitations to this approach. First, communication between N Number of traffic lights controllers
agents can exacerbate dimensionality challenges in large-scale S State space
urban TSC. Second, the limited communication bandwidth O Observation set
prevents the agent from obtaining sufficient information from Ω Local observation set
other agents, resulting in low communication efficiency. Third, A Action set
R Reward function
redundant information exchange can lead to compromised P State Transition Probability
learning. Previous work on the cooperative for TSC such as γ Discount Factor
MADDPG [37] and MAPPO[38] have successfully addressed
the challenges in TSC based on MARL. [34] studied the A2C-
based MARL algorithm under limited neighborhood commu- tuple ⟨N, S, O, Ω , A, R, P, γ⟩. The components of POMDP
nication and proposed a spatial discount factor to attenuate are presented in Table I.
the impact of state and reward signals originating from other During the time step t, every agent i ∈ N acquires
agents. Although this work enables agents to communicate and a local observation oti ∈ Oi and then executes an action
exchange messages during execution, it overlooks the intricate ati ∈ Ai taking into account both the local observation and
relationships between agents, leading to subpar performance the information passed by other agents. Then the environment
and ineffective communication. state is transferred from s to a new state s′ and a shared reward
Graph Neural Networks (GNNs) is a deep embedding rt is received. After trial and error, the agent can perform the
framework for MARL to process data with a graph structure best action to maximize long-term rewards in dynamic traffic
and generalize relations among nodes. Obviously, urban roads states. The collective goal of the N agents is to iteratively
can be regarded as a graph, and the traffic lights on the roads refine their individual policies πi to maximize
PT the expected
t
can be regarded as nodes in the graph. Multi-agent systems cumulative discounted rewards Gt = t=0 γ rt , where T
can use graph neural networks to process input data with graph represents the time horizon and γ ∈ [0, 1]. The agents are
structures to describe information interactions between agents. tasked with learning the optimal policy π ∗ : S → A that
Studies like Colight [29] and GPlight [39] have demonstrated effectively maps states to actions.
the high effectiveness of GNN iin facilitating collaboration in
TSC. GraphLight [40] employs Graph Convolutional Network B. Graph neural networks
(GCN) to capture the features of dynamic traffic networks In the standard formulation of TSC, as shown in Figure 3, a
and employs a spatial discount factor to adjust the states of graph G = (V, E) is utilized. Here, vi ∈ V represents a traffic
neighboring agents, but this practice is not good enough. [41] signal controller and e (i, j) ∈ E denotes the connection lane
proposed to leverage GCN to capture interactions between between vi and vj . Each node vi is associated with its unique
nodes and simplify the complex node update process by feature vector hi , and when combined, these vectors constitute
identifying key targets, thereby improving the quality and the feature matrix H ∈ RN ×P , where P is the total number
performance of decision-making in reinforcement learning. of features for each node. The adjacency matrix E ∈ RN ×N
Graph Attention Network (GAT) is an effective variant of captures the interconnectivity among nodes.
GCN, which introduces an attention mechanism in the neigh-
v1
bor aggregation operation between nodes [42]. Recently, GAT
v2
has found extensive applications in computer vision [43], v5
v4
3D point cloud [44], natural language processing [45] and v6 v3
so on. GAT learns node embedding representations in the v7
road network and weights and aggregates neighbor nodes
through multiple attention heads to reduce the neural network’s
attention to non-critical node information. Therefore, in the
process of processing multi-agent interactions, GAT can assist Fig. 3: A real urban road network exemplar in New York (each node
agents in comprehending the system’s status and mitigating the corresponds to an agent, and the connections between nodes represent
influence of internal interference information on the agents’ agent-agent interactions).
decision-making processes.
Graph attention networks represent an advanced class
of GNN that are particularly adept at handling data with
III. P ROBLEM F ORMULATION graph-like structures. The graph attention network is de-
In this section, we start by presenting the standard formula- signed to learn and produce an embedded representation of
′ P
tion of TSC to better understand our approach. Then we give a node hi = σ j∈Ni αij W hj . In this expression, αij =
a brief introduction to the basic building for TSC. sof tmaxj (eij ) is the attention coefficient, calculated using
a softmax function, which normalizes the unscaled attention
scores eij across all neighboring nodes Ni . The attention score
A. Partially Observable Markov Processes eij = a (W hi , W hj ) is computed based on a function a that
We model the TSC problem as a partially observable takes the embeddings of nodes i and j, weighted by a matrix
Markov Processes (POMDP), which can be defined as a W , to determine the relevance of node j to node i.

Authorized licensed use limited to: ANURAG GROUP OF INSTITUTIONS. Downloaded on January 10,2025 at 08:26:03 UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2025.3525640

C. RL-based traffic signal control 1 X


rit = − dli,t+1 (3)
We focus on an urban traffic scenario with a road network Lin
i l∈Lin
comprising N multiple intersections. Each intersection within i

the road network is designated as an agent, creating a multi- 1 X


rit = − l
qi,t+1 (4)
agent environment. Assuming that each traffic light control Lin
i l∈Lin
i
unit realizes a cycle with dynamic phase durations. The l
where vi,t+1 , dli,t+1 , qi,t+1
l
correspond to the average ve-
collection of intersections within the road network can be
hicular speed, vehicle delay, and queue length on lane l at
represented as I = 1, 2, ..., N . Moreover, we define the set
intersection i during the next time step. It is worth noting that
of road segments entering the intersection can be expressed
delay measurements can only be obtained after the vehicle
as Lini , and the set of road segments driving out of the
reaches its destination, which can introduce a time lag in
intersection can be expressed as Lout i . Consider a 4-way
calculating the reward. To address this issue, some research
intersection, where there are 12 possible movements and 4
works employ a weighted linear combination of these metrics
candidate phases. Figure 4 illustrates the four phases: North-
as the reward. However, it is crucial to recognize that fine-
South Straight (NS), North-South Left-turn (NSL), East-West
tuning the weights becomes a critical aspect, as even slight
Straight (EW), and East-West Left (EWL). For any given
adjustments can yield significantly different results. For this
phase, the right turn is always permitted, which does not
study, we select queue length ql as our reward metric due to
interfere with traffic in other directions.
its simplistic nature and its ability to provide instantaneous
feedback regarding traffic conditions.

IV. S YSTEM A RCHITECTURE


In this section, we detail the architecture of the proposed
NS NSL EW EWL
MAGSAC approach. As depicted in Figure 5, we interpret a
Fig. 4: An example signal phase sequence on a four-way authentic urban road network and its traffic lights as a multi-
intersection. agent graph model. Each signal light within the road network
is modeled as a node, while edges and weights between nodes
Observation: To make informed decisions, each agent capture the dependencies and importance among agents. To
needs to observe the current traffic condition at intersection enhance the learning of localized policies, we employ a multi-
i. We denote the current traffic condition as follows, head attention mechanism (MHA) to learn the cooperative
attributes of multi-agents in dynamic traffic environments and
 l l
oi,t = qi,t , vi,t , pi,t , l ∈ Lin
i (1) selectively generate the overall environment characteristics
of neighboring agents, facilitating the integration of global
l l information at the local level. In this process, each agent
Here, qi,t , vi,t and pi,t represents the quantity of queued
vehicles, the velocity of vehicles, and the current phase of the is modeled with one actor network and two critic networks,
l l following the CTDE pattern. Specifically, decentralized local
traffic signal at intersection i. Both qi,t and vi,t are normalized
t behavioral policies are generated for each agent, allowing them
to the range [0, 1]. Additionally, pi indicates the active phase
of the traffic signal at intersection i at time t. It is encoded as to act independently. However, these policies are globally
a one-hot encoded vector Bp+1 , where the additional one-hot optimized using centralized soft Q-functions. This approach
entry represents the all-red clearance phase. ensures that the decentralized policies collectively contribute
Action: To enable decision-making, each agent performs to an overall optimal solution.
an action after observing current traffic condition oti . There
are two most commonly used action settings for traffic signal A. Observation Embedding
control: ”Switch” and ”Choose Phase”. In the ”Switch” setting,
the agent determines whether to keep with the current phase As shown in Figure 6, we preprocess the features in
or switch to the next phase in the fixed sequence of phases. the encoder module. For a partially observable environment
This action is denoted as ati = {0, 1}. In the ”Choose controlled by multiple traffic lights, agent i can observe the
Phase” setting, the agent selects the most appropriate phase local state oti at each time step t. We input the original data oti
from a predefined set of phases for the upcoming period. into the multilayer perceptron (MLP) and obtain the feature
In this paper, we choose the latter action setting, donated as vector eti , which represents the traffic status of the i-th signal
ati ∈ {N S, N SL, EW, EW L}. light.
Reward: In RL-TSC, the reward function is pivotal in
eti = MLP oti = σ oti We + be
 
steering the agent’s learning trajectory. The most frequently (5)
used metrics in literature to construct reward functions include
where σ (·) is the ReLU function. The MLP consists of two
vehicular speed, delay measurements, and congestion as fol-
hidden layers, each with 32 neurons. The parameters to be
lows, respectively.
learned include the weight matrix We and the bias vector be .
1 X
rit = − in l
vi,t+1 (2) MLP transforms low-level feature vectors into a unified feature
Li in space, encompassing the observations of each agent.
l∈Li

Authorized licensed use limited to: ANURAG GROUP OF INSTITUTIONS. Downloaded on January 10,2025 at 08:26:03 UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2025.3525640

o1t Observation e1t


Observation
Encoder &
Adjacency

oit Observation eit  ij


Encoder
Multi-head
oNt Observation eNt Graph
Attention Graph
Encoder Learning aggregation
Layers

A S,R
Actor Critic Target Critic
Agent 1 Agent 2 … Agent N Global
J    State  

Qi  s, a  Qi  s ', a '
Replay buffer
Execution  i
s ,o ,a ,r , s
t
t
i t t t 1 , oi 
t 1
Tranining

Fig. 5: Architecture of MAGSAC. The global critic of MAGSAC consists of two critic networks, where the target critic network
used to stabilize the training process.

B. Graph-based Learning (
1 if M LP (f (hi , hj )) ≥ ζ
As depicted in Figure 3, the road network is interlaced with xi,j = (7)
traffic lights that are not only structured but also interdepen- 0 otherwise
dent, forming a graph-based agent network. In a multi-agent Here, ζ is the threshold used to determine whether the
cooperative framework, receiving information from all other output of the MLP should be considered as 1 or 0. In our
agents can be bandwidth-intensive. An effective method is experimental setup, ζ is set to 0.5. If xi,j = 0, it means that
to exchange information with its closest neighbor to broaden there is no interaction between traffic lights i and j; if xi,j = 1,
its observation domain. To further refine communication and it indicates that there is an interaction between them. For each
enhance cooperation among agents, as shown in Figure 6, traffic light in the network, we can construct an adjacency
we leverage a graph learning module to integrate the status matrix Aij = xi,j .
of neighboring traffic signals and extract high-order hidden Building upon the currently obtained node features that
features. This module employs attention mechanisms and incorporate historical information and adjacency matrices, we
LSTM to learn cooperation between intersections and filter out employ MHA to capture the temporal connections between
unimportant node information. To extract advanced interaction nodes from the dynamically changing road environment. This
representations, we leverage multi-head attention to capture involves projecting the input features of each node into inde-
information from different subspaces, which aids in learning pendent attention heads,
the global optimal strategy at each intersection.  m
Considering the encoded signal light observations as se-  qi = Wqm hi
k m = Wkm hi (8)
quential data, we input the features into Long Short-Term  im
vi = Wvm hi
Memory (LSTM) to capture the temporal dependencies among
traffic lights. where Wqm , Wkm , and Wvm convert hi into the correspond-
ing query, key, and value vectors, respectively. Within the
hi , ci = LSTM (ei , h′i , c′i ) (6) attention head m, the attention weights between agent i and
agent j can be calculated as:
Here, hi and ci represent the hidden state and cell state  T
τ · Wqm × hi · (Wkm × hj ) , Aij = 1
m
of the current time step. And h′i and c′i represent the hidden eij = (9)
0, Aij = 0
state and the cell state of the previous time step, respectively.
Within a dynamic environment, the interaction relationships Here, τ signifies the scaling factor utilized in the process.
between agents are constantly changing. Consequently, the Aij is the adjacency matrix that denotes the connection
historical traffic conditions encapsulated by LSTM are crucial relationship between node i and node j in the road network.
for refining each agent’s policy. For distinct traffic lights Then the attention score between these two nodes is [46]:
controllers labeled i and j, the LSTM-generated state vectors

m m
 exp em ij
hi and hj are amalgamated into a composite vector f (hi , hj ). αij = sof tmax eij = P m
 (10)
This vector is then fed into an MLP to yield a binary j∈Ni exp eij

outcome. Such an outcome serves to ascertain the connectivity Furthermore, after capturing the dependencies between traf-
between various traffic lights, enabling the dynamic selection fic lights and the attention scores of each agent, a weighted
of informative neighbors for each traffic light controller, while sum of the values of other agents is computed for each
simultaneously discarding less relevant ones. attention head. The outputs from the M attention heads

Authorized licensed use limited to: ANURAG GROUP OF INSTITUTIONS. Downloaded on January 10,2025 at 08:26:03 UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2025.3525640

pertaining to agent i are combined through concatenation. where ρπ represents the state distribution under the policy
This concatenated vector then undergoes processing by an π; α is a smooth factor deployed to strike a balance between
activation function σ (·), which is realized through a fully- the rewards and the entropy within the policy. The larger the
connected layer with a ReLU as its nonlinearity. The result is parameter α, the more the strategy converges to the optimal
the final output of the convolutional layer. strategy. The entropy of policy π is denoted as H(π(· | St )),
   quantifying the uncertainty of the policy given a specific state
X st .
h′i = σ concat  m
αij hj , ∀m ∈ M  (11) As per the principles of the Bellman equation, the soft value-
j∈Ni state function Qsof t and the soft value function Vsof t are
Here, M represents the number of independent attention respectively defined as:
mechanisms. The outputs emanating from these attention
Vsof t (st ) = Eat ∼π [Qsof t (st , at ) − α log π (at | st )] (14)
mechanisms are concatenated to culminate in the ultimate
feature vector h′i . Qsof t (st , at ) = r (st , at ) + γEst+1 ∼ρπ [Vsof t (st+1 )] (15)

a1 a2 aN
We define Qϕ (s, a) as the critic network parameterized
… by ϕ , and πθ (a | s) as the actor network parameterized
Agent 1 Agent 1 … Agent 1 Adjacency matrix by θ. To facilitate experience replay, it is typical to store
h1' h2' hN'
the agent’s experience e = (st , at , rt , st+1 ) at each time
hj
hi Wk Wv step into the replay buffer D = {(st , at , rt , st+1 )}. During
Wq

Q K V the training process, n mini-batches of experiences from the


Multi-head attention mechanism
buffer are uniformly sampled to update the model in every
h1t h2t hNt Linear Linear Linear
T environmental step. The optimization of these parameters
LSTM LSTM … LSTM Attention Score is achieved through the minimization of a loss function,
e1t e2t et
Parameter sharing N typically employing stochastic optimization techniques. The
MLP MLP MLP loss function for the critic network is formulated as:
  2 
o1t o2t … oNt 1
L (ϕ) = E(st ,at )∼D Qϕ (st , at ) − Q̂ (st , at ) (16)
2
Fig. 6: Details and components of agent network structure.
In particular, MAGSAC employs two critic networks to
mitigate the risk of overestimating the Q value. The primary
critic network, denoted as Qϕ (s, a) calculates Q value based
C. Multi-agent Deep Reinforcement Learning on the current state and action. Meanwhile, the target critic
The Multi-Agent Soft Actor-Critic (MA-SAC) algorithm is network, represented as Q̂ (st , at ) predicates the target Q value
an extension of the SAC algorithm adapted for the multi- based on the next state and the policy π generated by the
agent paradigm, making it more suitable for urban traffic signal actor network. The difference between the current Q-value
control. We integrate the aforementioned graph-based learning and the target Q-value is harnessed to update the parameters
network with the MA-SAC algorithm to learn the optimal pol- of the critic networks, thereby enabling a more stable learning
icy for traffic light control. Following the CTDE paradigm, the process for a superior strategy. Q̂ (st , at ) is expressed as:
MA-SAC algorithm operates on the actor-critic framework and
encompasses two distinct neural network components: actor
Q̂ (st , at ) = r (st , at ) + γQϕ̄ (st , at )
and critic. The actor network serves as a policy mechanism (17)
that ingests the current state and generates the likelihood of −γα log πθ (at+1 | st+1 )
choosing various traffic signal phases. Meanwhile, the critic The parameters of the policy, denoted as θ, can be deter-
network is used to calculate the Q value of each phase. mined by seeking to minimize
SAC is a maximum entropy reinforcement learning that adds
maximum entropy RL to both maximize the expected Q value LQ (θ) = Est ∼D [Eat ∼πθ (α log πθ (at | st )
(18)
and the entropy of the actor’s policy. By increasing the entropy − Qϕ (st , at )
of the policy distribution, the SAC algorithm encourages Based on the above analysis, the complete MAGSAC ap-
greater exploration, preventing the agent from converging too proach we proposed is shown in Algorithm 1, which combines
quickly to local optima, thereby alleviating the overestimation the GAT method to encode graph-based state information and
of Q-values and enhancing the stability of the algorithm. The updates the policy and value function through the MA-SAC
formula can be represented as follows, algorithm. Each agent is modeled with a policy network as
T
X the actor network and two Q-value networks as the critic
J (π) = E(st ,at )∼ρπ [rt + αH (π (· | st ))] (12) networks. To enhance the learning rate during the model
t=0 training process and stabilize the multi-agent training regimen,
our method follows the idea of CTDE. Specifically, the local
critic assesses the rewards of its own actions by observing
X
H (π (· | st )) = − π (at | st ) log π (at | st ) (13)
at
the local environment oi , while the global critic evaluates the

Authorized licensed use limited to: ANURAG GROUP OF INSTITUTIONS. Downloaded on January 10,2025 at 08:26:03 UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2025.3525640

collective rewards of all agents’ actions to guide the actor’s performance outcomes verify the robustness and applicability
training. In the testing process, each agent’s actor network of our model.
relies solely on local observations to execute actions. The
algorithm mainly includes three stages: experience collection,
A. Experiment Settings
graph-based state encoding, and network training. Before the
MAGSAC algorithm starts running, the network parameters The experiments were performed on Intel Core i5 2.40
θ, θ̄, ϕ are first initialized, and the replay buffer D is cleared GHz, 8GB RAM and Intel Iris Plus Graphics 1536 MB
at the same time. During each iteration, the node will aggregate graphics card. We employed CityFlow, an open-source traffic
the information of critical neighbors based on graph-based simulation platform, for our experimental framework. This
learning and generate new node features h′i . For each agent i, it provides a Python interface to design large-scale urban traffic
interacts with the traffic environment, performs the generated scenarios and manage traffic signals. Figure 8 illustrates that
action, and then stores tuples into the replay buffer D. Training the simulator offers access to real-time traffic status, such
starts when the data in the buffer reaches the threshold, a as the metrics—vehicle counts, velocities, and current signal
mini-batch experience B is sampled from the replay buffer phase.
each step, and stochastic gradient descent is used to update The experimental settings include road network settings
the parameters of neural network parameters. and traffic flow settings in JSON format. In the simulated
road network, each intersection features four three-lane roads,
Algorithm 1: MAGSAC
with each road spanning 500 meters and a speed limit of 15
1 Initialize all neural network parameters and replay meters/second assigned to all lanes. Vehicles can turn right
buffer D ; without conflict, and a combined yellow and full red time of
2 for each episode do 5 seconds is followed to clear the intersection at each signal
3 Reset the traffic environment ; light phase transition. We define that when the speed of a
4 end vehicle is lower than 0.3 meters/second, it is considered to be
5 for each agent i ∈ {1, · · · , N } do waiting in the queue. This is a typical setting that appears in
6 Observe the global state s and its individual many related studies.
observations oi ;
7 for t = 1 to max-episode-length do TABLE II: The parameters of each agent.
8 Select action ati ∼ πϕ (ati | τit ) ;
9 Encode the observation as representation ei ; Parameter Value
10 Calculate the adjacency matrix by Eq. (7) ; Smooth factor α 0.001
11 Learning spatial dependency and node features Discount rate γ 0.9
hi ′ by Eq. (11) ; Batch size B 20
Replay buffer size 10000
12 Execute actions a = (a1 , a2 , . . . , aN ) and
Learning rate for the actor, lra 0.0003
interact with the environment, obtain the Learning rate for the critic, lrc 0.001
global reward r nd transit to the next global Time steps 3600
state s′ ;
D ← D ∪ st , oti , at , rt , st+1 , ot+1

13 i ; In the MAGSAC approach proposed in this article, each
14 end agent controls an intersection, and each agent sets the same
15 for each rl training step do parameters as shown in Table II. To encourage the agent to
16 Sample a random minibatch uniformly from D ; consider long-term cumulative rewards, the discount factor γ is
17 Update critic network by minimizing the loss set to 0.9. Then, a smooth factor α is set to 0.001 for updating
function by Eq. (16) ; the network’s parameters. A smaller smooth factor makes the
18 Update actor network via gradient descent by learning process more robust. The learning rates lra and lrc
Eq.(18), θ̄ ← ρθ̄ + (1 − ρ) θ, where ρ is the for the actor and critic networks are set to 0.0003 and 0.001,
update ratio ; respectively, ensuring the stability of policy updates and value
19 Update hyper-parameter; estimation. Additionally, the replay buffer size is set to 10,000
20 Update all target network parameters for each for storing experience data, and the batch size is set to 20 to
agent by Eq. (17), ϕ ← ϕ − λ▽ ˆ J (ϕ), where
ϕ balance the use of memory and computational resources.
λ is learning rate ; 1) Datasets: This paper used two kinds of datasets: syn-
21 end thetic traffic grid dataset and real-world traffic dataset. We
22 end count their characteristics and present the details in Table
23 return Qθ , πϕ III. The source datasets can be accessed through https://
traffic-signal-control.github.io/#open-datasets.
Synthetic Dataset: The synthetic datasets are generated
V. E XPERIMENT RESULTS AND COMPARISON
artificially and include three scenarios: Grid1×3 (1 row, 3
Within this section, we detail our experimental approach and columns), Grid3×3 and Grid6×6 traffic environments. As
assess the performance of our proposed MAGSAC approach shown in Figure 7, all vehicles are directed to enter and
using both synthetic and real-world datasets. The model’s exit through the rim edges of the network. For the Grid1×3

Authorized licensed use limited to: ANURAG GROUP OF INSTITUTIONS. Downloaded on January 10,2025 at 08:26:03 UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2025.3525640

(a) Grid1×3 , 3 traffic lights (b) Grid3×3 , 9 traffic lights (c) Grid6×6 , 36 traffic lights

Fig. 7: The traffic grid for synthetic Dataset: Grid1×3 , Grid3×3 ,Grid6×6

State TSC methods and three other RL control techniques. The


conventional TSC method includes Fixedtime (FT) [19], Self-
CityFlow Organizing Traffic Light Control (SOTL) [22]. RL control
Action
Roadnet.json methods include PressLight [14], and CoLight [29]. Each
Reward model is learned without any pre-trained parameters to ensure
flow.json
fairness.
Traffic signal controller
(Agent) Traffic Environment • FT: This approach employs a pre-determined fixed cycle
length and signal phase timing rules that are ideally
Fig. 8: Architecture diagram of simulator for TSC. suitable for relatively stable traffic flow.
• SOTL: This method changes the signal phase through
a manually adjusted threshold of the number of waiting
scenario, the start and end points of the traffic flow are on the cars in response to the real-time traffic status on the street.
edge of grid from X1 to X8 . For the Grid3×3 scenario, the • PressLight: A classic DRL-based TSC control method
start and end points of the traffic flow are on the edge of grid relying on Q-learning without considering neighbor in-
from X1 to X12 . For the Grid6×6 scenario, the start and end formation to improve intersection throughput.
points of the traffic flow are on the edge of grid from X1 to • CoLight: A synchronized decentralized algorithm for
X24 . The traffic flow is modeled with a Gaussian distribution, multi-agent traffic signal control that uses GAT to fa-
assuming an average of 500 vehicles per hour per lane. cilitate communication.
TABLE III: Arrival rate of real-world traffic dataset. 3) Evaluation Metrics: We assess the performance of our
proposed approach by utilizing a set of metrics that are
Arrival rate (vehicles/300 s) commonly accepted in judging the effectiveness of traffic
Dataset Intersections
Mean Std Max Min signal control systems.
Grid1×3 3 300 - - - • Average travel time (ATT): This metric represents the
Grid3×3 9 300 - - - average travel time (in seconds) it takes for all vehicles
Grid6×6 36 300 - - - traveling in this area and is related to the user driving
Jinan3×4 12 250.70 38.21 335 208
experience. We record the time when a vehicle enters
Hangzhou4×4 16 526.63 86.70 676 256
N ewY ork16×3 48 236.42 8.13 257 216 and leaves the road networkPN as endtstart
v and tend
v . ATT is
1
calculated as AT T = N v=1 tv − tstart v .
• Average queue length (AQL): Queue length refers to the
Real-world Dataset: The real data set is collected from
number of vehicles waiting at incoming lanes in a time
intersection surveillance cameras, including areas such as
step. We use the average queue length across all queue
Jinan and Hangzhou in China, as well as New York in the
lengths as the final result.
USA, as depicted in Figure 9. The road network of these data
are extracted from OpenStreetMap and converted to CityFlow
format. For each road network, corresponding traffic flow data B. Experiment Results
were collected based on taxi GPS data. Since the traffic flow 1) Execution performance comparison: We present the
of the real-world data set is more irregular and complex than ATT results on different datasets in Table IV. Overall, the
the synthetic data set, necessary simplifications have been proposed MAGSAC significantly outperforms other baseline
made. Statistical analysis has determined the following turning methods on ATT. In addition, the RL-based TSC method
ratios at intersections: 10% for left turns, 60% for proceeding significantly outperforms the traditional TSC method due to
straight, and 30% for right turns. the dynamic adjustment of traffic signals through real-time
2) Compared Methods: To assess and demonstrate the ef- traffic information, further proving the effectiveness of the RL-
fectiveness of the proposed MAGSAC approach in this paper, based TSC method. The MAGSAC approach shows the best
we conduct a comparative analysis involving conventional performance in all datasets and performs best in the New York

Authorized licensed use limited to: ANURAG GROUP OF INSTITUTIONS. Downloaded on January 10,2025 at 08:26:03 UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2025.3525640

10

(a) Jinan (China), 12 traffic lights (b) Hangzhou (China), 16 traffic lights (c) New York (USA), 48 traf-
fic lights

Fig. 9: The city maps for real-world datasets: Hangzhou, Jinan, and New York

TABLE IV: Performance results for different methods on various datasets. For ATT and AQL, lower is better.
Average travel time (ATT)
Model RL-based
Grid1×3 Grid3×3 Grid6×6 Jinan3×4 Hangzhou4×4 N ewY ork16×3
FT × 73.51 105.48 210.93 814.11 718.29 1842.20
SOTL × 78.14 132.50 267.34 1459.45 1215.28 1887.85
MaxPressure × 76.35 99.69 186.56 361.33 407.17 392.31

PressLight 54.85 85.59 153.75 343.90 549.84 380.69

CoLight 46.05 89.27 128.76 318.75 336.71 229.66

MAGSAC 47.95 84.63 116.47 286.81 250.28 168.29

dataset. The performance gap is smaller in the simulated data 2) Training performance: To substantiate the training per-
set because the artificial data set is simpler and allows for formance of our proposed approach, Figure 10 and Figure 11
optimal solutions. illustrate a comparison of the training performance between
the MAGSAC and other RL-based methods in the Jinan3×4
dataset. It is obvious that MAGSAC obtains the best conver-
gence speed with good sampling efficiency and training effi-
ciency. In simple traffic scenarios, MAGSAC achieves similar
performance to other RL-based methods, but in complex real
traffic scenarios, MAGSAC is significantly better than other
RL-based methods. This shows that MAGSAC is more stable
and adaptable in complex urban TSC problems.

Fig. 10: Convergence speed of ATT (dataset: Jinan3×4 ).

In detail, compared with Colight, MAGSAC achieved a


reduction of 5.2% and 9.5% in ATT for the synthetic datasets
Grid3×3 and Grid6×6 respectively. On the simple traf-
fic scene Grid1×3 , MAGSAC’s ATT increased by 4.12%
compared to CoLight, indicating a poorer performance for
MAGSAC in this case. This is because there is less in-
teraction and interdependence between agents and will not Fig. 11: Convergence speed of AQL (dataset: Jinan3×4 ).
benefit much from graph-based methods. When we verify
the model on the real-word dataset, MAGSAC can achieve As depicted in Figure 12, we conducted a comparison of
better results in Jinan3×4 , Hangzhou4×4 , N ewY ork16×3 the training time for our MAGSAC method against other RL-
respectively. MAGSAC’s ATT performance has decreased by based methods on different datasets. To ensure fairness, all
10.02%, 25.7%, and 26.7% respectively compared to CoL- models are evaluated individually on the server. The com-
ight. MAGSAC’s ATT performance has decreased by 16.6%, parative analysis reveals that MAGSAC exhibits a markedly
54.5%, and 55.8% respectively compared to Preslight. This shorter training time when juxtaposed with the PressLight
shows that our proposed algorithm can effectively extract and Colight approaches. Additionally, it is observed that the
road information in complex roads to optimize signal light training time for all models escalates with an increase in the
decisions, and is more suitable for complex urban networks. number of traffic signals. Notably, even in scenarios featuring

Authorized licensed use limited to: ANURAG GROUP OF INSTITUTIONS. Downloaded on January 10,2025 at 08:26:03 UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2025.3525640

11

more intricate road networks, exemplified by N ewY ork16×3 , various intersections. Then, the model incorporates a multi-
MAGSAC can still train its own traffic light strategy based on layer attention mechanism that extracts crucial information
key neighborhood information around, which shows that our from other traffic signals. This not only expands the obser-
method is scalable. vation domain of each traffic light but also greatly reduces the
communication volume. In this way, the model can adapt to
complex road networks and varying traffic conditions. During
model training, we employ a CTDE architecture, which is
conducive to more efficient and stable learning. Extensive
experiments utilizing both synthetic and real-world data within
the Cityflow simulation platform have been conducted. The
results demonstrate that our proposed method has superior
performance, with noticeable improvements in travel time and
training efficiency.
However, there are also some limitations. For instance,
Fig. 12: The training times of different models for 100 the current research is still conducted in idealized simulated
episodes. MAGSAC is efficient in all scenarios. environments and specific scenarios, which requires further
validation of the impact of more complex traffic conditions
Finally, we evaluated the impact of the number of attention and external factors on MAGSAC. With the development of
heads on MAGSAC in the MHA mechanism. We tested the emerging technologies such as the IoT and edge computing,
performance of MAGSAC and Colight models with different real-time traffic data can be easily obtained. In the future, we
numbers of attention heads on the Jinan3×4 dataset. As plan to extend our research methodology to different scenarios
shown in Figure 13, when the number of heads increased from including urban, suburban, or rural areas. Design various
1 to 4, the travel time of the model decreased significantly. signal light control strategies for the sparsity of traffic flow to
However, when the number of heads continued to increase enhance the model’s generalization capabilities. Additionally,
further, the improvement in travel time was no longer apparent, to analyze how cooperation strategies between traffic signals
and even increased. This is due to an excess of attention heads change with the increase in network scale, we will conduct
leading to overfitting on the training data, making it difficult further experiments on performance, ablation, and parametric
for the model to learn key features, while also increasing experiments on different datasets to determine the impact of
the computational burden on the model, ultimately affecting attention head number and the GAT network structure on
its decision-making ability. This shows that an appropriate model performance.
number of attention heads can achieve a better traffic light
control strategy. ACKNOWLEDGMENTS
This work is supported in part by the Frontier Technologies
R&D Program of Jiangsu (No. BF2024071), the Youth Talent
Support Program of the Jiangsu Association for Science and
Technology (No. JSTJ-2024-430), the Open Research Project
of State Key Laboratory of Novel Software Technology (No.
KFKT2022B28), the Special Funds for Technology Transfer in
Jiangsu Province (No. BA2022011) and the Special Funds for
Industrial and Information Industry Transformation and Up-
grading in Jiangsu Province (Project Name: Key Technologies
and Industrialization of Industrial Internet Terminal Threat
Detection and Response System).

Fig. 13: Performance with different numbers of attention heads R EFERENCES


(Jinan3×4 ). [1] C. Zhu, D. Ye, T. Zhu, and W. Zhou, “Location-based real-time updated
advising method for traffic signal control,” IEEE Internet of Things
Journal, 2023.
[2] A. A. Musa, S. I. Malami, F. Alanazi, W. Ounaies, M. Alshammari,
VI. C ONCLUSION and S. I. Haruna, “Sustainable traffic management for smart cities
using internet-of-things-oriented intelligent transportation systems (its):
In this paper, we have presented an innovative ap- Challenges and recommendations,” Sustainability, vol. 15, no. 13, p.
proach named Multi-Agent Graph-based Soft Actor-Critic 9859, 2023.
(MAGSAC), specifically designed for the management of [3] X. Xu, J. Gu, H. Yan, W. Liu, L. Qi, and X. Zhou, “Reputation-aware
supplier assessment for blockchain-enabled supply chain in industry
real-world traffic signals to enhance urban traffic efficiency. 4.0,” IEEE Transactions on Industrial Informatics, vol. 19, no. 4, pp.
First, our approach leverages graph neural network learn- 5485–5494, 2022.
ing to capture the spatial characteristics of the environment, [4] A. Alzu’bi, A. Alomar, S. Alkhaza’leh, A. Abuarqoub, and M. Ham-
moudeh, “A review of privacy and security of edge computing in
thereby strengthening the integration of temporal and spatial smart healthcare systems: Issues, challenges, and research directions,”
information and improving decision-making capabilities at Tsinghua Science and Technology, vol. 29, no. 4, pp. 1152–1180, 2024.

Authorized licensed use limited to: ANURAG GROUP OF INSTITUTIONS. Downloaded on January 10,2025 at 08:26:03 UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2025.3525640

12

[5] S. Dhelim, T. Kechadi, L. Chen, N. Aung, H. Ning, and L. Atzori, Conference on Information and Knowledge Management, 2019, pp.
“Edge-enabled metaverse: The convergence of metaverse and mobile 1913–1922.
edge computing,” arXiv preprint arXiv:2205.02764, 2022. [30] X. Xu, S. Tang, L. Qi, X. Zhou, F. Dai, and W. Dou, “Cnn partitioning
[6] X. Xu, H. Li, Z. Li, and X. Zhou, “Safe: Synergic data filtering for and offloading for vehicular edge networks in web3,” IEEE Communi-
federated learning in cloud-edge computing,” IEEE Transactions on cations Magazine, 2023.
Industrial Informatics, vol. 19, no. 2, pp. 1655–1665, 2022. [31] P. Zhou, X. Chen, Z. Liu, T. Braud, P. Hui, and J. Kangasharju, “Drle:
[7] H. Li, M. Sun, F. Xia, X. Xu, and M. Bilal, “A survey of edge caching: Decentralized reinforcement learning at the edge for traffic light control
Key issues and challenges,” Tsinghua Science and Technology, vol. 29, in the iov,” IEEE Transactions on Intelligent Transportation Systems,
no. 3, pp. 818–842, 2023. vol. 22, no. 4, pp. 2262–2273, 2020.
[8] Y. Qu, L. Ma, W. Ye, X. Zhai, S. Yu, Y. Li, and D. Smith, “Towards [32] J. Gao, X. Xu, L. Qi, W. Dou, X. Xia, and X. Zhou, “Distributed
privacy-aware and trustworthy data sharing using blockchain for edge computation offloading and power control for uav-enabled internet of
intelligence,” Big Data Mining and Analytics, vol. 6, no. 4, pp. 443–464, medical things,” ACM Transactions on Internet Technology, 2024.
2023. [33] C. Chen, H. Wei, N. Xu, G. Zheng, M. Yang, Y. Xiong, K. Xu, and
[9] Z. Liu, X. Xu, F. Han, Q. Zhao, L. Qi, W. Dou, and X. Zhou, “Secure Z. Li, “Toward a thousand lights: Decentralized deep reinforcement
edge server placement with non-cooperative game for internet of vehicles learning for large-scale traffic signal control,” in Proceedings of the
in web 3.0,” IEEE Transactions on Network Science and Engineering, AAAI conference on artificial intelligence, vol. 34, no. 04, 2020, pp.
2023. 3414–3421.
[10] H. A. Aziz, H. Wang, S. Young, J. Sperling, and J. M. Beck, “Synthesis [34] T. Chu, J. Wang, L. Codecà, and Z. Li, “Multi-agent deep reinforcement
study on transitions in signal infrastructure and control algorithms for learning for large-scale traffic signal control,” IEEE transactions on
connected and automated transportation,” Washington, DC: DOE, 2017. intelligent transportation systems, vol. 21, no. 3, pp. 1086–1095, 2019.
[11] Y. Huang, Y. J. Li, and Z. Cai, “Security and privacy in metaverse: A [35] M. Yang, L. Huang, and C. Tang, “K-means clustering with local
comprehensive survey,” Big Data Mining and Analytics, vol. 6, no. 2, distance privacy,” Big Data Mining and Analytics, vol. 6, no. 4, pp.
pp. 234–247, 2023. 433–442, 2023.
[12] Z. Li, X. Xu, T. Hang, H. Xiang, Y. Cui, L. Qi, and X. Zhou, “A [36] W. Lin, M. Zhu, X. Zhou, R. Zhang, X. Zhao, S. Shen, and L. Sun, “A
knowledge-driven anomaly detection framework for social production deep neural collaborative filtering based service recommendation method
system,” IEEE Transactions on Computational Social Systems, 2022. with multi-source data for smart cloud-edge collaboration applications,”
[13] R. Yan, Y. Zheng, N. Yu, and C. Liang, “Multi-smart meter data Tsinghua Science and Technology, vol. 29, no. 3, pp. 897–910, 2023.
encryption scheme based on distributed differential privacy,” Big Data [37] T. Wang, J. Cao, and A. Hussain, “Adaptive traffic signal control for
Mining and Analytics, vol. 7, no. 1, pp. 131–141, 2023. large-scale scenario with cooperative group-based multi-agent reinforce-
[14] H. Wei, C. Chen, G. Zheng, K. Wu, and Z. Li, “Presslight: Learning ment learning,” Transportation research part C: emerging technologies,
max pressure control to coordinate traffic signals in arterial network,” vol. 125, p. 103046, 2021.
ACM, 2019. [38] C. Yu, A. Velu, E. Vinitsky, J. Gao, Y. Wang, A. Bayen, and Y. Wu, “The
[15] H. Ge, D. Gao, L. Sun, Y. Hou, C. Yu, Y. Wang, and G. Tan, “Multi-agent surprising effectiveness of ppo in cooperative multi-agent games,” Ad-
transfer reinforcement learning with multi-view encoder for adaptive vances in Neural Information Processing Systems, vol. 35, pp. 24 611–
traffic signal control,” IEEE Transactions on Intelligent Transportation 24 624, 2022.
Systems, vol. 23, no. 8, pp. 12 572–12 587, 2021. [39] X. Hu, C. Zhao, and G. Wang, “A traffic light dynamic control algorithm
[16] Q. Zheng, H. Xu, J. Chen, D. Zhang, K. Zhang, and G. Tang, “Double with deep reinforcement learning based on gnn prediction,” arXiv
deep q-network with dynamic bootstrapping for real-time isolated signal preprint arXiv:2009.14627, 2020.
control: A traffic engineering perspective,” Applied Sciences, vol. 12, [40] Z. Zeng, “Graphlight: graph-based reinforcement learning for traffic
no. 17, p. 8641, 2022. signal control,” in 2021 ieee 6th international conference on computer
[17] X. Zang, H. Yao, G. Zheng, N. Xu, K. Xu, and Z. Li, “Metalight: and communication systems (icccs). IEEE, 2021, pp. 645–650.
Value-based meta-reinforcement learning for traffic signal control,” in [41] Y. Liu, X. Zhou, H. Kou, Y. Zhao, X. Xu, X. Zhang, and L. Qi, “Privacy-
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, preserving point-of-interest recommendation based on simplified graph
no. 01, 2020, pp. 1153–1160. convolutional network for geological traveling,” ACM Transactions on
[18] W. Genders and S. Razavi, “Asynchronous n-step q-learning adaptive Intelligent Systems and Technology, 2023.
traffic signal control,” Journal of Intelligent Transportation Systems, [42] Y. Liu, Y. Zhang, X. Mao, X. Zhou, J. Chang, W. Wang, P. Wang,
vol. 23, no. 4, pp. 319–331, 2019. and L. Qi, “Lithological facies classification using attention-based gated
[19] P. Koonce and L. Rodegerdts, “Traffic signal timing manual.” United recurrent unit,” Tsinghua Science and Technology, vol. 29, no. 4, pp.
States. Federal Highway Administration, Tech. Rep., 2008. 1206–1218, 2024.
[20] R. E. Allsop, “Delay at a fixed time traffic signal—i: Theoretical [43] M. Munir, W. Avery, and R. Marculescu, “Mobilevig: Graph-based
analysis,” Transportation Science, vol. 6, no. 3, pp. 260–285, 1972. sparse attention for mobile vision applications,” in Proceedings of the
[21] F. V. Webster, “Traffic signal settings,” Road Research Technical Paper, IEEE/CVF Conference on Computer Vision and Pattern Recognition,
vol. 39, 1958. 2023, pp. 2210–2218.
[22] S.-B. Cools, C. Gershenson, and B. D’Hooghe, “Self-organizing traffic [44] C. Chen, L. Z. Fragonara, and A. Tsourdos, “Gapointnet: Graph attention
lights: A realistic simulation,” Advances in applied self-organizing based point neural network for exploiting local feature of point cloud,”
systems, pp. 45–55, 2013. Neurocomputing, vol. 438, pp. 122–132, 2021.
[23] S. A. Celtek, A. Durdu, and M. E. M. Alı, “Real-time traffic signal [45] M. Song, W. Zhao, and E. Haihong, “Kganet: a knowledge graph
control with swarm optimization methods,” Measurement, vol. 166, p. attention network for enhancing natural language inference,” Neural
108206, 2020. Computing and Applications, vol. 32, pp. 14 963–14 973, 2020.
[24] H. Lin, Y. Han, W. Cai, and B. Jin, “Traffic signal optimization based on [46] A. Vaswani, “Attention is all you need,” Advances in Neural Information
fuzzy control and differential evolution algorithm,” IEEE Transactions Processing Systems, 2017.
on Intelligent Transportation Systems, 2022.
[25] S. A. Fayazi and A. Vahidi, “Mixed-integer linear programming for
optimal scheduling of autonomous vehicle intersection crossing,” IEEE
Transactions on Intelligent Vehicles, vol. 3, no. 3, pp. 287–299, 2018.
[26] J. Ruan, J. Tang, G. Gao, T. Shi, and A. Khamis, “Deep reinforcement
learning-based traffic signal control,” in 2023 IEEE International Con-
ference on Smart Mobility (SM). IEEE, 2023, pp. 21–26. Jing Shang received the B.S. degree in Commu-
[27] H. Wei, G. Zheng, H. Yao, and Z. Li, “Intellilight: A reinforcement nication Engineering from Nanjing Tech Univer-
learning approach for intelligent traffic light control,” in Proceedings sity, Nanjing, China, in 2018, and the M.S. degree
of the 24th ACM SIGKDD International Conference on Knowledge in Computer Application Technology from Tian-
Discovery & Data Mining, 2018, pp. 2496–2505. jin University of Science and Technology, Tianjin,
[28] H. Mei, X. L. Lei, L. D. Da, B. Shi, and H. Wei, “Libsignal: An open China, in 2021. She is currently pursuing the Ph.D.
library for traffic signal control,” 2022. degree in Software Engineering with the Nanjing
[29] H. Wei, N. Xu, H. Zhang, G. Zheng, X. Zang, C. Chen, W. Zhang, University of Science and Technology, Nanjing,
Y. Zhu, K. Xu, and Z. Li, “Colight: Learning network-level cooperation China. Her research interests include cloud/edge
for traffic signal control,” in Proceedings of the 28th ACM International computing and the Internet of Vehicles.

Authorized licensed use limited to: ANURAG GROUP OF INSTITUTIONS. Downloaded on January 10,2025 at 08:26:03 UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2025.3525640

13

Shunmei Meng received her PhD degree in De- Lianyong Qi received his Ph.D degree from the
partment of Computer Science and Technology from Department of Computer Science and Technology,
Nanjing University, China, in 2016. Now, she is an Nanjing University, China. In 2010, he visited
assistant professor of Department of Computer Sci- the Department of Information and Communication
ence and Technology, Nanjing University of Science Technology, Swinburne University of Technology,
and Technology, Nanjing, China. She has published Australia. Now, he is a professor of the College of
papers in international journals and international Computer Science and Technology, China Univer-
conferences such as TPDS, TII, TKDD, FGCS, sity of Petroleum (East China), China. His research
AAAI, CIKM, ICDM, ICWS, and ICSOC. Her re- interests include AI and recommender systems.
search interests include recommender systems, cloud
computing, and security & privacy.

Qianmu Li (Member, IEEE) received the B.Sc. and


Jun Hou received the Ph.D. degree from Nanjing Ph.D. degrees in computer application technology
University of Science and Technology, Nanjing, from Nanjing University of Science and Technology,
China, in 2019. She is currently an Associate Profes- Nanjing.China, in 2001 and 2005, respectively.
sor with Nanjing Institute of Industry Technology, He is currently a Full Professor with the School
Nanjing. She has authored dozens of articles in of Computer Science and Engineering, Nanjing Uni-
prestigious journals and top-tier conferences. Her versity of Science and Technology. His research
research interests include ideological education and interests include information security, data mining,
data mining. Dr. Hou serves as a PC member for and deep learning.
several international conferences. Prof.Li received the 2022 TRB AED50, the 2019
AAAI Winner, the 2022 Jiangsu Province Top 10
Scientific and Technological Advances, China Network and Information
Security Outstanding Talent Award in 2016, and the Education Ministry
Xiaoran Zhao received her Bachelor’s degree from Science and Technology Award in 2012.
the School of Computer Science, Qufu Normal
University, China, in 2022. Now, she is pursuing
her Master of Engineering degree at the School of
Computer Science, Qufu Normal University. She is
committed to research in the fields of artificial intel-
ligence and computer science. Her research interests
include recommender systems, machine learning,
and data mining.

Xiaokang Zhou (Member, IEEE) is currently an


associate professor with the Faculty of Business
Data Science, Kansai University, Japan. He received
the Ph.D. degree in human sciences from Waseda
University, Japan, in 2014. From 2012 to 2015,
he was a research associate with the Faculty of
Human Sciences, Waseda University, Japan. He was
a lecturer/associate professor with the Faculty of
Data Science, Shiga University, Japan, from 2016
to 2024. He also works as a visiting researcher
with the RIKEN Center for Advanced Intelligence
Project (AIP), RIKEN, Japan, since 2017. Dr. Zhou has been engaged
in interdisciplinary research works in the fields of computer science and
engineering, information systems, and social and human informatics. His
recent research interests include ubiquitous computing, big data, machine
learning, behavior and cognitive informatics, cyber-physical-social systems,
and cyber intelligence and security. Dr. Zhou is a member of the IEEE CS,
and ACM, USA, IPSJ, and JSAI, Japan, and CCF, China.

Rong Jiang is currently a Chief Professor, deputy


dean of Intelligent Application Research Institute
and Doctoral Supervisor at Yunnan University of
Finance and Economics, Kunming, China. He re-
ceived his Ph.D. degree in system analysis and
integration from the School of Software, Yunnan
University, China. He is the Director of Yunnan
Key Laboratory of Service Computing, the leader
of Yunnan Innovation Team of Service Computing
and Digital Economy, a High-level Talent of Yunnan
Province (First Level) ”Yunling Scholar”, a Leading
Talent of Yunnan Ten Thousand Talents Program, an Expert with Outstanding
Contributions in Yunnan Province, an Expert Enjoying Special Government
Allowances in Yunnan Province, a Young and Middle-aged Academic and
Technical Leader in Yunnan Province, an Excellent Teacher in Yunnan
Province, an Outstanding Member of China Computer Federation (CCF). His
current research interests include cloud computing, big data, block chain,
AI application and information management, digital economy and software
engineering. He has published more than 100 papers and received more than
100 prizes in recent years.
Authorized licensed use limited to: ANURAG GROUP OF INSTITUTIONS. Downloaded on January 10,2025 at 08:26:03 UTC from IEEE Xplore. Restrictions apply.
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.

You might also like