0% found this document useful (0 votes)

28 views

Traffic Signal Control System Using Deep Reinforcement Learning With Emphasis On Reinforcing Successful Experiences

This document describes a proposed traffic signal control system using deep reinforcement learning with an emphasis on reinforcing successful experiences. The system aims to reduce traffic congestion in a multi-agent environment without direct communication between traffic signals. It does this by having each signal agent observe changes in congestion within its intersection to predict neighboring signals. Experimental results using a traffic simulator show the proposed system reduces average waiting times at signals by 33% compared to conventional deep reinforcement learning methods. The system aims to apply this research to real-world traffic control environments in the future.

Uploaded by

sanjanamuthineedi

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views

Traffic Signal Control System Using Deep Reinforcement Learning With Emphasis On Reinforcing Successful Experiences

Uploaded by

sanjanamuthineedi

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Received 30 September 2022, accepted 4 November 2022, date of publication 28 November 2022,

date of current version 14 December 2022.

Digital Object Identifier 10.1109/ACCESS.2022.3225431

Traffic Signal Control System Using Deep

Reinforcement Learning With Emphasis on
Reinforcing Successful Experiences
NAOKI KODAMA 1, TAKU HARADA 2, AND KAZUTERU MIYAZAKI 3, (Member, IEEE)
1 Department of Computer Science, School of Science and Technology, Meiji University, Tama-ku, Kawasaki, Kanagawa 214-8571, Japan
2 Department of Industrial Administration, Faculty of Science and Technology, Tokyo University of Science, Noda-shi, Chiba 278-8510, Japan
3 National Institution for Academic Degrees and Quality Enhancement of Higher Education, Kodaira-shi, Tokyo 187-8587, Japan

Corresponding author: Naoki Kodama ([email protected])

This work was supported by the Japan Society for the Promotion of Science (JSPS) Grants-in-Aid for Scientific Research under
Grant JP21K12024.

ABSTRACT In recent years, several studies have been conducted on the dynamic control of traffic signal
durations using deep reinforcement learning with the aim of reducing traffic congestion. The unique
advantages of independent control of traffic signals include reduction in the cost of information transmission
and stable control without being affected by the failure of other traffic signals. However, conventional
deep reinforcement learning methods such as Deep Q-Network may degrade the learning performance in a
multi-agent environment where there are multiple traffic signals in the environment. Therefore, we propose
a traffic light control system based on the dual targeting algorithm, which incorporates reinforcement of
successful experiences in multi-agent environments, with the aim of realizing a better traffic light control
system. The experimental results in a multi-agent environment using a traffic flow simulator based on
simulation of urban mobility (SUMO) show that the proposed traffic light control system reduces the
waiting time at traffic lights by 33% compared to a conventional traffic light control system using deep
reinforcement learning. In the future works, we aim to apply this research to traffic light control systems in
real environments.

INDEX TERMS Deep reinforcement learning, deep Q-network, Q-learning, multi-agent, traffic signal
control.

I. INTRODUCTION congestion is caused by a variety of factors such as events,

In recent years, there has been a significant rise in the overall weather, and time, and uncertain factors such as sudden con-
economic losses owing to increasing traffic congestion. For gestion, designing a control law that considers all of these
instance, the total time lost owing to traffic congestion in factors is challenging. Therefore, there is a need to realize
Japan in 2012 was approximately 5 billion hours per year [1]. a traffic signal control system (TSCS) that automatically
The economic loss in terms of total lost time is estimated to obtains a better control law considering multiple factors using
be approximately 10 trillion yen per year, and it is calculated deep reinforcement learning.
by converting cash wages in 2012 to time [2]. Therefore, In recent years, considerable research has been conducted
because the economic benefits of reducing traffic congestion on the application of deep reinforcement learning to TSCSs.
are significant, it is necessary to address this issue. For instance, in an environment with a single intersection,
We aim to respond to changes in the environment dynami- a TSCS using a deep Q-network (DQN) [3] has been used
cally using reinforcement learning. However, because traffic to reduce traffic congestion [4], [5], [6]. However, in real
environments where there are multiple intersections, TSCSs
The associate editor coordinating the review of this manuscript and are expected to interact with each other. Therefore, it is nec-
approving it for publication was Wei Wei . essary to consider a multi-agent environment where multiple

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 10, 2022 128943
N. Kodama et al.: TSCS Using Deep Reinforcement

intersections and TSCSs exist. TSCSs with reinforcement

learning in a multi-agent environment can be classified into
three types [7].
The first is centralized control implementation [8]. In this
method, a single control system operates multiple traffic sig-
nals. A single control system makes it possible to formalize
the problem as a single-agent environment, and the advantage
of this system is that the reinforcement learning system for
a single agent can be used effectively. However, the trans-
fer cost of collecting information on various traffic signals
is high, and the time required for learning depends on the
number of actions that can be selected. Additionally, no traffic
signals can be controlled if the single control system fails, and
the monotonous behavior of the system, such as control by
fixed time cycles, can be expected.
The other two are implemented using a distributed control
scheme and can be defined according to their implementation FIGURE 1. Road network environment. There are nine intersections in the
method. environment, all of which contain traffic light agents.

In the first method, each intersection exchanges informa-

tion with the surrounding intersections and then controls confirmed that a method that strongly reinforces the success
the traffic signals placed at each intersection [9], [10]. This experience, such as Profit Sharing, can be effectively used
method effectively uses deep reinforcement learning to obtain to address the simultaneous learning problem [14], and this
a better control law for the entire system. However, there study aims to achieve this effect through DTA. The TSCS
are some problems in implementing this method in a real was tested using simulation of urban mobility (SUMO) [15],
environment. It is assumed that the information exchanged an open source traffic flow simulator developed by the Ger-
with each intersection varies depending on the environment, man Aerospace Center, and we compared this model with
and the transfer cost of exchanging information with each TSCSs that use the existing deep reinforcement learning
intersection is high. Additionally, for centralized control, methods.
problems such as the impact of failures of surrounding traffic The TSCS proposed in this study has the following
signals can also be anticipated. characteristics:
In contrast, in the second method, each intersection con- • No information is transmitted between traffic
trols its traffic signals completely independently, as is done signals.
previously. In this method, the TSCS is trained using deep • Predicts information on adjacent signals by
reinforcement learning based on the information within each using changes in congestion within an intersec-
intersection. However, it is difficult to obtain a better control tion
law for the entire system because there is no information
Under the TSCS, the following contributions can be obtained
transfer between intersections. However, there is no transfer
by using DTA.
cost, making this the easiest method to implement in a real
environment. • The number of trials and errors required for

In this study, we propose a TSCS that can function without learning convergence does not increase.
direct communication or transfer cost, and we aim to improve • Reduced latency by 33% in the learned model.

the learning performance of our TSCS. First, the signal con- • The least variation in learning results per

trol system in this study perceives the change in congestion experiment.

on a road within the intersection every hour. This observa-
tion helps to predict the behavior of traffic signals in the II. SYSTEM MODEL AND PROBLEM FORMULATION
vicinity without the need for information transfer between tne In this study, we assume a road network environment that
intersections. However, this problem becomes a multi-agent consists of three one-lane dual carriageways connecting the
problem, and when dealing with reinforcement learning, the east-west and north-south directions, as shown in Fig. 1.
learning performance may be adversely affected by the simul- As shown in Fig. 2, each intersection is equipped with traffic
taneous learning problem [11]. Therefore, we use the dual signals, wherein the inner lane is dedicated to left turns and
targeting algorithm (DTA) [12] to reduce the impact of the the outer lane is dedicated to straight or right turns. All vehi-
simultaneous learning problem. Compared to DQN, the DTA cles appear with a certain probability on each road leading
is a deep reinforcement learning method that places a higher from outside the road network and take the shortest path to
emphasis on the success experience, and it has shown to their destination. Vehicle destinations are randomly set from
improve learning performance when implemented in a home the roads exiting the network in the direction opposite to
energy management system [13]. It has been experimentally where they appear. For example, a vehicle that appears at the

128944 VOLUME 10, 2022

N. Kodama et al.: TSCS Using Deep Reinforcement

vehicle is in the lane. Therefore, Dlt is calculated as follows.

l (S
Nveh veh + gmin )
Dlt = (1)
Ll
where Nvehl is the number of vehicles in lane l, L l is the
length of lane l, Sveh is the size of the vehicle and g is the
minimum distance between vehicles. The congestion level
is used to determine whether each lane is congested. For
example, a lane with a high congestion level is congested.
Therefore, in such a case, it is necessary to switch to a traffic
signal phase that allows traffic to pass to alleviate congestion.
Because 1Dlt is the change in Dlt during one step, it can
be calculated by taking the difference between Dlt and Dlt−1 .
In this case, 1Dlt can take the range [−1, 1]; however, in this
study, 1Dlt is used after normalization. Therefore, 1Dlt is
FIGURE 2. Enlarged view of the intersection. The road consists of two calculated as follows.
lanes on each side.
1 + Dlt − Dlt−1
TABLE 1. Signal phases handled in this environment. 1Dlt = (2)
2
The amount of change in congestion was used to determine
the status of adjacent traffic signals. For example, if conges-
tion increases, we know that an intersection other than our
own, to which the affected lane is connected, has adopted
a signal phase that allows traffic to enter our intersection.
west end of the network sets one of the roads at the east end In such a case, the traffic signal must also allow traffic in
as its destination. Therefore, all vehicles will pass through the lane with increasing congestion to ease the congestion.
at least three intersections. When a vehicle passes through an In other words, by simply using the change in congestion
intersection, it follows four signal phases, as listed in Table 1. level as a state, each traffic signal can generate the possibil-
Here, E-W and N-S denote east-west and north-south roads, ity of obtaining cooperative control with neighboring traffic
respectively. If the word OUT appears after E-W or N-S, signals.
it means the outer lane of E-W or N-S. If the word IN appears The Vtl takes the range [0, Vmax ]. Here, Vmax represents the
after E-W or N-S, it means the inner lane of E-W or N-S. maximum speed of the road. In this study, Vtl was normalized
The green and red signals for is lane are denoted by G and and used; thus, it was calculated as follows:
R, respectively When the traffic signal phase switches, the
Vtl
yellow light is turned on for a certain period between the Vtl = (3)
green and red signal phases for safety purposes. Vmax
In our TSCS, deep reinforcement learning was used under The average speed of the cars in lane l that are heading to
discretized time steps t = 1, 2, · · · . At each step, some traffic an intersection was used to determine the flow of cars at the
information observed from the environment is obtained as a intersection. For example, even if the congestion level is high,
state, and each signal switches between signal phases based it is not a problem if the average speed is high because cars are
on the state. Our TSCS aims to reduce traffic congestion passing normally in that lane. This information, which cannot
across the entire road network and improve its strategies be determined solely by the congestion level, can be obtained
based on the rewards received from the environment. In this using the average speed.
section, we describe a deep reinforcement-learning model for In summary, the traffic signal agent in this environ-
our TSCS. ment defines the state at a discrete time step t as st =
(Pt , Dlt , 1Dlt , Vtl ).
A. STATE DEFINITION
At each intersection, the state is defined at each step t using B. ACTION DEFINITION
the phase Pt of the traffic signal, the congestion level Dlt of Each traffic-light agent selects an action at each step t and
lane l leading to the intersection, the change in congestion observes the state st+1 of the transition destination. In this
level 1Dlt in one step, and the average speed Vtl of the environment, each agent defines the switching of its signal
vehicles on lane l leading to the intersection. phase as an action. Specifically, for state st , each agent selects
The phases of a traffic signal are represented using a one- one of the four signal phases listed in Table 1 and outputs
hot representation. it as an action. If the selected signal phase is the same as
Dlt is in the range [0, 1]; it is 1 when a vehicle, up to the the current signal phase, the current signal phase will be
maximum occupancy of the road, is in the lane and 0 when no retained. If a signal phase that is different from the current

VOLUME 10, 2022 128945

N. Kodama et al.: TSCS Using Deep Reinforcement

phase is selected, the current signal phase will switch to

the chosen signal phase after the yellow signal phase of a
certain duration. However, if the same phase continues for
the Tmax steps, the phase is forcibly switched, ignoring the
agent’s actions. The phase to switch to is determined by the
order of the phase numbers shown in Table 1. For example,
if the current phase is phase 3, the phase is switched to phase
4 when the forced phase switch is performed. If the current
phase is 4, it switches to phase 1.

C. REWARD DESIGN
The objective of the TSCS in this study is to alleviate traffic
congestion in the entire road network. However, because each
traffic signal agent should obtain traffic information for the
entire road network, which leads to an increase in the transfer
cost, the objective of each agent in this study is to minimize
traffic congestion at its own intersection. Although achieving
the goal of each agent does not necessarily help achieve the
goal of the system as a whole, we anticipate achieving the
goal of the system as a whole through cooperative control.
Cooperative control can be obtained by sharing information
with neighboring traffic signals through state perception.
FIGURE 3. Conceptual diagram of proposed traffic light control system.
In this study, we used the accumulated waiting time of
vehicles as an indicator of traffic congestion at an intersec-
tion [16]. First, we define wit ,t , (1 ≤ it ≤ Nveh ) as the time due to simultaneous learning problems becomes an issue
(in seconds) until step t of the vehicle, that is, the it th vehicle when using conventional deep reinforcement learning, such
in the lane heading to its own intersection at step t. Here, as DQN, to acquire control measurements. Simultaneous
Nveh is the number of vehicles in all lanes heading to their learning can lead to a situation where a strategy change by one
own intersection. The cumulative waiting time Wt at step t is mutually influencing agent affects the environment of other
calculated as follows. agents, making learning difficult. It has been experimentally
that each agent should converge its strategy quickly in order
Nveh
X to address the aforementioned problem, for instance, by using
Wt = wit ,t (4)
exploitation-oriented learning [14].
it =1
In this study, we aim to reduce the impact of this
Wt does not decrease with the step, and the larger the increase simultaneous learning problem by using DTA, which strongly
with respect to the step, the larger the number of vehicles reinforces the success experience for learning a TSCS. Con-
waiting at the intersection. In other words, to reduce traffic ceptual diagram of traffic light control system is shown in
congestion at an intersection, the smaller the increase or Fig. 3. Furthermore, the neural network architecture handled
decrease in Wt , the larger the reward. Therefore, the reward in this study is shown in fig. 4. To verify the feasibility
at step t can be designed as follows of DTA, this study also deals with a signal control system
wherein the deep reinforcement learning used is replaced
rt = Wt−1 − Wt (5)
by DQN and multistep DQN, and this is further used as a
The reward always considers a negative value, and the maxi- comparison target in the literature [12].
mum value is zero when no vehicle is waiting at the intersec-
tion. Therefore, by using deep reinforcement learning to learn B. COLLECTION OF EXPERIENCE
a strategy that maximizes discounted revenue, congestion In the traffic signal system proposed in this research, the
relief at the intersection of each agent can be achieved. experience collection and training are performed according
to DTA.
III. SIGNAL CONTROL SYSTEM USING DUAL TARGETING The experience collection is performed according to the
ALGORITHM following steps. First, in each step, the state st is perceived
A. APPROACH OF THIS STUDY by the TSCS, the action at is performed, the reward rt+1
This study assumes an environment that can be realized is obtained from the environment, and the transition des-
with a lower transfer cost than that of a conventional traffic tination state st+1 is stored in the replay buffer. Next, the
signal system by eliminating the communication between time when the signal is switched is defined as the end of
traffic signals. However, because multiple traffic signals in the episode. At the end of the episode, the multistep returns
the environment mutually affect each other, unstable learning Rτi i −t+1 from steps i to τi , the end-of-episode step, and the

128946 VOLUME 10, 2022

N. Kodama et al.: TSCS Using Deep Reinforcement

TABLE 2. Parameter settings.

This algorithm is summarized in Algorithm 1.

FIGURE 4. A neural network architecture handled by the proposed TSCS.
The input is state st = (Pt , Dlt , 1Dlt , Vtl ), the first hidden layer IV. EVALUATION
has 256 ReLU units, the second has 128 ReLU units, and the output is
Q-values used for action selection. A. SIMULATION SETTING
In this experiment, we verified the learning performance of
state st+1 of the next step of the episode are stored in the the proposed method on a TSCS without information transfer
replay buffer. The state sτi +1 of the next step of the ter- between intersections. For the proposed TSCS, we com-
mination is added to the experience of the corresponding pared three types of systems: random behavior, a similar
step in the replay buffer. In other words, the experiences TSCS learned using DQN, and a similar traffic signal sys-
in the episode that have not reached the end of the replay tem learned using multistep DQN, which was selected as a
buffer keep (st , at , rt+1 , st+1 ), and the other experiences keep comparison target in the literature [1].
(st , at , rt+1 , st+1 , Rτi i −t+1 , sτi +1 ). Here, Rτi i −t+1 is calculated The hyperparameters for deep reinforcement learning used
as follows. in this experiment are listed in Table 2. All parameters not
τiX
−i+1 shown here follow the literature [3]. In addition, the number
(τi −t+1)
Ri = γ k ri+k+1 (6) of steps in the multistep DQN was set to three based on
k=0 preliminary experiments. In this experiment, we used the
A certain number of experiences were randomly sampled same neural network architecture for all methods to compare
from the replay memory during training, and the neural net- the learning performance of the different learning algorithms.
work was trained based on these experiences. The input layer, wherein the state is used as the input,
First, we calculate the 1-step targets yj and multistep targets consists of 28 units. The information in the input layer is
fj from the sampled step experiences, as follows: weighted by all coupling layers and fed into the 256 ReLUs
of the first hidden layer. Then, the output of the first hidden
yj = rj+1 + γ max Q(sj+1 , a0 |θ̄), (7) layer was weighted by all the coupling layers and input to
a0
(τj −j+1) the 128 ReLUs of the second hidden layer. Finally, the output
fj = Rj + γ (τj −j+1) max Q(sτj +1 , a0 |θ̄). (8) of the second hidden layer is the Q-value of the action that
a0
selects each phase number.
where j is a randomly sampled experience step, Q(s, a|θ ) is In this study, Rllib [18] was used to develop the algorithm.
the action value function (Q-value) for action a when state Additionally, the simulation of urban mobility (SUMO) [15]
s is input to a neural network with learning weight θ, and θ̄ was used to construct the simulation environment. As set-
is the target. The learning weights of the target network [1] tings for roads, the maximum speed of each road was set to
are shown. The training weights of the target network were 50[km/h], the vehicle occurrence probability of each road was
updated by copying from θ at every step. set to 0.015 per step, and the distance between intersections
In DTA, if fj is a large return, which is successful experi- was set to 100[m]. As a setting for traffic signals, the duration
ence, then the Q value is updated by the return. Therefore, of the same signal phase is set to 50[s], and if it is exceeded,
the error is calculated using the maximum of yj and fj as the the current phase number is shifted to the following phase
target value as follows. number, ignoring the agent’s action selection.
δj (θ Q ) = max(yj , fj ) − Q(sj , aj |θ Q ), (9) In this experiment, one step was defined as 5 s in the
simulation, and 360 steps (30 min) were considered one trial.
The training of the neural network is updated by the gradi- The duration of the yellow signal is set as 2 s. When switching
ent descent method using error L(θ ), which is calculated as signals, the action for this step is performed as a yellow signal
follows: (2 s) + new signal phase (3 s).
 P
j 2 δj (θ )
1 1 Q 2 if |δj (θ Q )| ≥ 1.0
M
L(θ Q ) = (10) B. EVALUATION METHOD
 1 P |δj (θ Q )| − 1 otherwise,
M j 2 At the end of the trial, the system enters the termination state,
where M denotes the mini-batch size. and the state is re-initialized. After every 20 training trials, the

VOLUME 10, 2022 128947

N. Kodama et al.: TSCS Using Deep Reinforcement

Algorithm 1 DTA-Based TSCS

Input: The phase Pt , the congestion level Dlt , the change in congestion level 1Dlt , the average speed Vtl , the step t
Output: The weights of a neural network θ
1: Initialize replay memory D of size N , budget H , and minibatch M
2: Randomly initialize the neural network Q(s, a|θ ) with weights θ
3: Initialize the weights of a target network θ by copying θ
4: Initialize the time step at the start of the episode tstart = 0
5: Initialize List Rewards for temporarily storing rewards.
6: Observe the initial state s0
7: for t = 0, 1, · · · , H do
8: Choose an action at and observe a reward rt+1 and the next state st+1
9: Store transition (st , at , rt+1 , st+1 ) in D
10: Append the rt+1 to Rewards
11: if t is episode terminal then
12: for j = tstart to t do
13: Set τj = t
(τ −j+1) Pτj −j k
14: Compute Rj j = k=0 γ rj+k+1 from Rewards
(τ −j+1)
15: Add τj , Rj j , sτj +1 ) to (sj , aj , rj+1 , sj+1 ) in D
16: end for
17: Set tstart = t + 1
18: end if
(τ −i+1)
19: Sample a random minibatch of M transitions (si , ai , ri+1 , si+1 , τi , Ri i , sτi +1 ) from D
20: Set yj = rj+1 + γ maxa0 Q(sj+1 , a |θ) 0
(τ −j+1)
21: Set fj = Rj j + γ (τj −j+1) maxa0 Q(sτj +1 , a0 |θ)
22: Update θ by minimizing the huber loss L(θ)
23: Update target networks
24: end for

system enters the evaluation mode and executes five trials.

One experiment consisted of 1,000 trials, and the learning
curve was formed using the average of the results in the
evaluation mode for one experiment. In addition, the same
experiment was performed ten times with different random
numbers, and the average of the ten obtained learning curves
was used for the evaluation of this experiment. The average
waiting time per vehicle at a single intersection is used as the
evaluation index.

C. SIMULATION RESULTS
A comparison of the learning curves is shown in Fig. 3.
Fig. 4 shows a magnified view after 500 trials. Furthermore,
FIGURE 5. Waiting time comparison.
in Fig. 7, Fig. 8, and Fig. 9, the variation of the DQN after
500 trials, the Multistep DQN, and the 10 experiments of
the proposed method are shown as error bars. The unbi- DQN.’’ A Welch’s t-test was conducted with the significance
ased sample variance at the 1,000th trial was 16.5 for the level set at 0.01. The results showed P = 0.0003 < 0.01, and
DQN, 0.06 for the multistep DQN, and 0.02 for the proposed the null hypothesis was rejected.
method, respectively.
To heck whether the difference between the mean values D. CONSIDERATION
of the waiting times of the Multistep DQN and the proposed From Fig. 3, it is evident that when deep reinforcement learn-
method at the 1,000th trial is significant, the null hypothesis ing was introduced, the performance of the proposed method
H0 is ‘‘the mean value of the waiting time of the proposed was initially poor, compared to that of the random agent.
method is equal to the mean value of the Multistep DQN’’ and However, the latency was significantly reduced. We can also
the alternative hypothesis H1 is ‘‘the mean of the waiting time confirm that the proposed method using DTA improves the
of the proposed method is less than the mean of the Multistep learning speed and converges faster than methods that use

128948 VOLUME 10, 2022

N. Kodama et al.: TSCS Using Deep Reinforcement

FIGURE 6. Waiting time comparison (enlarged figure after 500 trials). FIGURE 9. Error bars of proposed method in 10 experiments.

by changes in the strategies of other agents because of the

simultaneous learning problem. In other words, because of
the impact of the simultaneous learning problem, Q-values
may not be estimated correctly. Because the effect of the
simultaneous learning problem is reduced when all agents
solidify their measures to a certain extent, the multistep DQN
and proposed method can immediately reduce its effect due to
their high learning speed. In particular, the proposed method
strongly reinforces a set of actions when the revenue during
one cycle of signals (from one signal switch to the next
signal switch) is greater than the current Q-value, enabling
immediate learning when the Q-value is incorrectly estimated
FIGURE 7. Error bars of DQN in 10 experiments. in the early stages of learning. Therefore, the method is less
affected by the simultaneous learning problem than the other
methods, leading to stable learning.
These results indicate that the proposed method is the
most effective for traffic signal control without information
transfer between intersections.
In recent years, there has been significant research on
various aspects of the performance of deep reinforcement
learning, and combining these research studies could help
maximize the performance of the reinforcement learning.
The DTA used in the proposed method is modified such
that the targets for network update are calculated separately
from those of the base deep reinforcement learning method to
ensure it can be easily merged with other deep reinforcement
FIGURE 8. Error bars of multistep DQN in 10 experiments. learning methods. However, compared to existing methods,
the proposed method increases the volume of information
DQN and multistep DQN. In addition, because multistep stored in the replay buffer and the computation of target
DQN converges more quickly than DQN, these improve- values for neural network training. Although this increase can
ments in learning speed can be attributed to the use of multi- be ignored on a large-scale computer, the increase in com-
step returns. In DTA, the learning speed may have increased putation time and memory shortage are on a small computer
because DQN takes longer than multistep DQN, depending that is expected to be installed at each traffic light is highly
on the situation. significant. Therefore, for the use of the proposed method in
Fig.4 shows that the proposed method reduces the latency a real-world environment, the computational cost of possible
by more than 33% compared with the existing methods. It was implementations should also be considered.
also confirmed that the proposed method can stably reduce
the waiting time because the unbiased sample variance of the V. CONCLUSION
proposed method is the smallest. In this study, we propose a TSCS that can learn effectively
Because this is a multi-agent environment, it is expected without information transfer between traffic signals to reduce
that the improvement of an agent’s strategies may be affected the transfer cost. Traffic flow simulation results show that

VOLUME 10, 2022 128949

N. Kodama et al.: TSCS Using Deep Reinforcement

the proposed method reduces the waiting time by more than [15] P. A. Lopez, E. Wiessner, M. Behrisch, L. Bieker-Walz, J. Erdmann,
33% compared with similar TSCSs using DQN and multistep Y.-P. Flotterod, R. Hilbrich, L. Lucken, J. Rummel, and P. Wagner, ‘‘Micro-
scopic traffic simulation using SUMO,’’ in Proc. 21st Int. Conf. Intell.
DQN. The learning time to convergence is also shorter than Transp. Syst. (ITSC), Nov. 2018, pp. 2575–2582.
that of the existing methods, indicating that the method is [16] J. Gao, Y. Shen, J. Liu, M. Ito, and N. Shiratori, ‘‘Adaptive traffic signal
effective even when real-time learning is considered. How- control: Deep reinforcement learning algorithm with experience replay and
target network,’’ 2017, arXiv:1705.02755.
ever, on a small computer such as those expected to be [17] D. P. Kingma and J. Ba, ‘‘Adam: A method for stochastic optimization,’’
installed at each traffic signal in a real environment, we are 2014, arXiv:1412.6980.
concerned about increased computation time and insufficient [18] E. Liang, R. Liaw, P. Moritz, R. Nishihara, R. Fox, K. Goldberg,
J. E. Gonzalez, M. I. Jordan, and I. Stoica, ‘‘RLlib: Abstractions for dis-
memory compared to existing methods. tributed reinforcement learning,’’ 2017, arXiv:1712.09381.
In the future, it is necessary to verify the execution time
of the proposed method and the existing method on a small
computer. In addition, the traffic flow on a real road network
changes depending on the time of the day. In this experiment,
the simulation time was 800 s under constant traffic flow;
therefore, the learning performance under varying traffic flow NAOKI KODAMA received the B.E. degree in
has not yet been confirmed. Therefore, it is necessary to verify mechanical engineering informatics and the M.E.
degree in mechanical engineering from Meiji Uni-
the proposed system in a more realistic environment in the versity, Japan, in 2016 and 2018, respectively, and
future. the Ph.D. degree in industrial administration from
the Tokyo University of Science, Japan, in 2021.
REFERENCES He is currently an Assistant Professor at
Meiji University. He is involved with research
[1] Ministry of Land, Infrastructure, Transport and Tourism, Ministry of Land,
of machine learning, particularly reinforcement
Infrastructure, Transport and Tourism Productivity Revolution Project.
learning and deep reinforcement learning.
[Online]. Available: https://www.mlit.go.jp/common/001132350.pdf
[2] (2012). Ministry of Health, Labour and Welfare, Basic Survey
on Wage Structure. Accessed: Feb. 24, 2022. [Online]. Available:
https://www.mhlw.go.jp/toukei/itiran/roudou/chingin/kouzou/z2012/
[3] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness,
G. M. Bellemare, A. Graves, M. Riedmiller, K. A. Fidjeland, G. Ostrovski,
S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,
D. Wierstra, S. Legg, and D. Hassabis, ‘‘Human-level control through TAKU HARADA received the Ph.D. degree from
deep reinforcement learning,’’ Nature, vol. 518, pp. 529–533, Feb. 2015. the Tokyo University of Science, in 1993. He is
[4] H. Wei, G. Zheng, H. Yao, and Z. L. Intellilight, ‘‘A reinforcement learning currently pursuing the Doctor of Engineering
approach for intelligent traffic signal control,’’ in Proc. 24th ACM SIGKDD degree.
Int. Conf. Knowl. Discovery Data Mining, 2018, pp. 2496–2505. He was an Assistant Professor with the Tokyo
[5] L. A. Prashanth and S. Bhatnagar, ‘‘Reinforcement learning with function University of Science, an Associate Professor with
approximation for traffic signal control,’’ IEEE Trans. Intell. Transp. Syst., Mie University, and a Junior Associate Profes-
vol. 12, no. 2, pp. 412–421, Jun. 2011. sor with the Tokyo University of Science, where
[6] W. Genders and S. Razavi, ‘‘Using a deep reinforcement learning agent for he is currently an Associate Professor. He is
traffic signal control,’’ 2016, arXiv:1611.01142.
involved with research of machine learning and
[7] T. Wu, P. Chou, K. Liu, Y. Yuan, X. Wang, H. Huang, and D. O. Wu,
metaheuristic optimization.
‘‘Multi-agent deep reinforcement learning for urban traffic signal con-
trol in vehicular networks,’’ IEEE Trans. Veh. Technol., vol. 69, no. 8,
pp. 8243–8256, Aug. 2020.
[8] Z. Zhang, J. Yang, and H. Zha, ‘‘Integrating independent and centralized
multi-agent reinforcement learning for traffic signal network optimiza-
tion,’’ 2019, arXiv:1909.10651.
[9] M. A. Marco, ‘‘Multi-agent reinforcement learning for traffic signal con-
trol,’’ in Proc. 17th Int. Conf. Mach. Learn., 2000, pp. 1151–1158. KAZUTERU MIYAZAKI (Member, IEEE)
[10] K. J. Prabuchandran, H. K. An, and S. Bhatnagar, ‘‘Multi-agent reinforce- received the Graduate degree in precise engineer-
ment learning for traffic signal control,’’ in Proc. 17th Int. IEEE Conf. ing from the Faculty of Engineering, Meiji Univer-
Intell. Transp. Syst. (ITSC), Oct. 2014, pp. 2529–2534. sity, Japan, in 1981, and the Ph.D. degree from the
[11] S. Sen and M. Sekaran, ‘‘Multiagent coordination with learning classi- Department of Computational Intelligence, Inter-
fier systems,’’ in Adaption and Learning in Multi-Agent Systems. Berlin, disciplinary Graduate School of Science and Engi-
Germany: Springer, 1995, pp. 218–233. neering, Tokyo Institute of Technology, Japan,
[12] N. Kodama, T. Harada, and K. Miyazaki, ‘‘Deep reinforcement learning
in 1996.
with dual targeting algorithm,’’ in Proc. Int. Joint Conf. Neural Netw.
In 1996, he was worked by the Institute as
(IJCNN), Jul. 2019, pp. 1–6.
[13] N. Kodama, T. Harada, and K. Miyazaki, ‘‘Home energy manage- a Research Associate with the Interdisciplinary
ment algorithm based on deep reinforcement learning using multi- Graduate School of Science and Engineering, Japan, and in 1999, he became
step prediction,’’ IEEE Access, vol. 9, pp. 153108–153115, 2021, doi: an Associate Professor at the National Institution for Academic Degrees and
10.1109/ACCESS.2021.3126365. Quality Enhancement of Higher Education, Japan, where he is currently
[14] N. Kodama, K. Miyazaki, and H. Kobayashi, ‘‘Proposal and evalua- a Professor at the Research Department. He is involved with research of
tion of reward sharing method based on safety level,’’ SICE J. Con- machine learning, particularly, reinforcement learning, deep reinforcement
trol, Meas., Syst. Integr., vol. 11, no. 3, pp. 207–213, May 2018, doi: learning, and text mining.
10.9746/JCMSI.11.207.

128950 VOLUME 10, 2022

Vidali Tesi
No ratings yet
Vidali Tesi
73 pages
Induction Training Report
50% (4)
Induction Training Report
27 pages
Manitou Black (Manual)
No ratings yet
Manitou Black (Manual)
6 pages
1 Two-Layer Coordinated Reinforcement Learning For Traffic Signal Control in Traffic Network
No ratings yet
1 Two-Layer Coordinated Reinforcement Learning For Traffic Signal Control in Traffic Network
12 pages
2018 TS Optimization 34
No ratings yet
2018 TS Optimization 34
17 pages
Applsci 13 02750 v2
No ratings yet
Applsci 13 02750 v2
23 pages
Reinforcement Learning-Based Intelligent Traffic Signal Control Considering Sensing Information of Railway
No ratings yet
Reinforcement Learning-Based Intelligent Traffic Signal Control Considering Sensing Information of Railway
12 pages
Using A Deep Reinforcement Learning Agent For Traffic Signal Control
No ratings yet
Using A Deep Reinforcement Learning Agent For Traffic Signal Control
9 pages
Electronics 10 02363 v2
No ratings yet
Electronics 10 02363 v2
32 pages
14
No ratings yet
14
20 pages
Proximal Policy Optimization Through A Deep Reinfo
No ratings yet
Proximal Policy Optimization Through A Deep Reinfo
19 pages
Deep Reinforcement Learning Based Approach For Tra C Signal Control Deep Reinforcement Learning Based Approach For Tra C Signal Control
No ratings yet
Deep Reinforcement Learning Based Approach For Tra C Signal Control Deep Reinforcement Learning Based Approach For Tra C Signal Control
8 pages
sensors-22-08732-v3
No ratings yet
sensors-22-08732-v3
21 pages
Batch-Augmented Multi-Agent Reinforcement Learning For Efficient Traffic Signal Optimization
No ratings yet
Batch-Augmented Multi-Agent Reinforcement Learning For Efficient Traffic Signal Optimization
20 pages
Deep Reinforcement Learning Algorithm With Experience Replay and Target Network
No ratings yet
Deep Reinforcement Learning Algorithm With Experience Replay and Target Network
10 pages
Transportation Research Part C: Jiajie Yu, Pierre-Antoine Laharotte, Yu Han, Ludovic Leclercq
No ratings yet
Transportation Research Part C: Jiajie Yu, Pierre-Antoine Laharotte, Yu Han, Ludovic Leclercq
25 pages
Ntegrating Independent and Centralized Multi Agent Reinforcement Learning For Traffic Signal Network Optimization
No ratings yet
Ntegrating Independent and Centralized Multi Agent Reinforcement Learning For Traffic Signal Network Optimization
14 pages
s13177-023-00346-4
No ratings yet
s13177-023-00346-4
15 pages
Enhancing traffic flow through multi-agent reinforcement learning for adaptive traffic light duration control
No ratings yet
Enhancing traffic flow through multi-agent reinforcement learning for adaptive traffic light duration control
16 pages
Deep Reinforcement Learning For Traffic Signal Control A Review - 2020
No ratings yet
Deep Reinforcement Learning For Traffic Signal Control A Review - 2020
29 pages
Application for Reinforcement Learning
No ratings yet
Application for Reinforcement Learning
4 pages
RL Paper Latex v01d01
No ratings yet
RL Paper Latex v01d01
6 pages
Traffic_Signal_Control_a_Double_Q-learning_Approac
No ratings yet
Traffic_Signal_Control_a_Double_Q-learning_Approac
5 pages
IEEE 2023 Cooperative Control
No ratings yet
IEEE 2023 Cooperative Control
5 pages
IEEE DQL Regional Network
No ratings yet
IEEE DQL Regional Network
5 pages
Mathematics 12 02056 v2
No ratings yet
Mathematics 12 02056 v2
24 pages
sensors-24-03987-v2
No ratings yet
sensors-24-03987-v2
19 pages
10 1109@tits 2020 2984033
No ratings yet
10 1109@tits 2020 2984033
10 pages
Graph-Based Cooperation Multi-Agent Reinforcement Learning for Intelligent Traffic Signal Control
No ratings yet
Graph-Based Cooperation Multi-Agent Reinforcement Learning for Intelligent Traffic Signal Control
13 pages
11
No ratings yet
11
8 pages
Multi_Agent_Learning_Automata_for_Online_Adaptive_Control_of_Large_Scale_Traffic_Signal_Systems
No ratings yet
Multi_Agent_Learning_Automata_for_Online_Adaptive_Control_of_Large_Scale_Traffic_Signal_Systems
6 pages
Transportation Research Part C: Xiang (Ben) Song, Bin Zhou, Dongfang Ma
No ratings yet
Transportation Research Part C: Xiang (Ben) Song, Bin Zhou, Dongfang Ma
16 pages
An Information Fusion Approach to Intelligent Traffic Signal Control Using the Joint Methods of Multiagent Reinforcement Learning and Artificial Intelligence of Things
No ratings yet
An Information Fusion Approach to Intelligent Traffic Signal Control Using the Joint Methods of Multiagent Reinforcement Learning and Artificial Intelligence of Things
11 pages
Mitigating Action Hysteresis in Traffic Signal Control With Traffic Predictive Reinforcement Learning
No ratings yet
Mitigating Action Hysteresis in Traffic Signal Control With Traffic Predictive Reinforcement Learning
12 pages
Ecolight: Reward Shaping in Deep Reinforcement Learning For Ergonomic Traffic Signal Control
No ratings yet
Ecolight: Reward Shaping in Deep Reinforcement Learning For Ergonomic Traffic Signal Control
5 pages
A Traffic Light Control Method Based On Multi Agent Deep Reinforcement Learning Algorithm
No ratings yet
A Traffic Light Control Method Based On Multi Agent Deep Reinforcement Learning Algorithm
11 pages
IEEE 2024 DQL Improved DQL
No ratings yet
IEEE 2024 DQL Improved DQL
6 pages
Actuators 13 00251
No ratings yet
Actuators 13 00251
15 pages
Machine Learning Technque Research Paper
No ratings yet
Machine Learning Technque Research Paper
11 pages
A Review of Traffic Light Control System With Reinforcement Learning
No ratings yet
A Review of Traffic Light Control System With Reinforcement Learning
8 pages
Multi ARL
No ratings yet
Multi ARL
10 pages
Multi-Agent Reinforcement Learning For Traffic Signal Control Through Universal Communication Method
No ratings yet
Multi-Agent Reinforcement Learning For Traffic Signal Control Through Universal Communication Method
12 pages
5 FairLight - Fairness-Aware Autonomous Traffic Signal Control With Hierarchical Action Space
No ratings yet
5 FairLight - Fairness-Aware Autonomous Traffic Signal Control With Hierarchical Action Space
13 pages
MDP For Traffic Light Control Based On Multi-Agent Reinforcement
No ratings yet
MDP For Traffic Light Control Based On Multi-Agent Reinforcement
20 pages
Simulation of Intelligent Traffic Control For Autonomous Vehicles
No ratings yet
Simulation of Intelligent Traffic Control For Autonomous Vehicles
7 pages
2503.14250v1
No ratings yet
2503.14250v1
13 pages
Urban Traffic Signal Control Using Reinforcement Learning Agents
No ratings yet
Urban Traffic Signal Control Using Reinforcement Learning Agents
13 pages
9
No ratings yet
9
10 pages
4 Cycle-Based Signal Timing With Traffic Flow Prediction For Dynamic Environment
No ratings yet
4 Cycle-Based Signal Timing With Traffic Flow Prediction For Dynamic Environment
17 pages
Improving Traffic Light Systems Using Deep Q-networks
No ratings yet
Improving Traffic Light Systems Using Deep Q-networks
13 pages
Optimizing Traffic Signal Control With Deep Reinforcement Learning Exploring Decay Rate Tuning for Enhanced Exploration-Exploitation Trade-Off
No ratings yet
Optimizing Traffic Signal Control With Deep Reinforcement Learning Exploring Decay Rate Tuning for Enhanced Exploration-Exploitation Trade-Off
8 pages
Intelligent Emergency Traffic Signal Control System With Pedestrian Access
No ratings yet
Intelligent Emergency Traffic Signal Control System With Pedestrian Access
1 page
8349-Article Text-48881-2-10-20201129
No ratings yet
8349-Article Text-48881-2-10-20201129
16 pages
2 Intelligent Traffic Light Systems Using Edge Flow Predictions
No ratings yet
2 Intelligent Traffic Light Systems Using Edge Flow Predictions
9 pages
RL report
No ratings yet
RL report
7 pages
Multi-Agent Deep Reinforcement Learning For Large-Scale Traffic Signal Control
No ratings yet
Multi-Agent Deep Reinforcement Learning For Large-Scale Traffic Signal Control
10 pages
2020-Enhancing Transferability of Deep Reinforcement Learning-Based Variable Speed Limit Control Using Transfer Learning
No ratings yet
2020-Enhancing Transferability of Deep Reinforcement Learning-Based Variable Speed Limit Control Using Transfer Learning
12 pages
Improving Traffic Light Systems Using Deep Q-Networks
No ratings yet
Improving Traffic Light Systems Using Deep Q-Networks
2 pages
Neural Networks For Real-Time Traffic Signal Control (Srinivasan Et Al., 2006)
No ratings yet
Neural Networks For Real-Time Traffic Signal Control (Srinivasan Et Al., 2006)
12 pages
Autonomous Vehicle Platoons in Urban Road Networks A Joint Distributed Reinforcement Learning and Model Predictive Control Approach
No ratings yet
Autonomous Vehicle Platoons in Urban Road Networks A Joint Distributed Reinforcement Learning and Model Predictive Control Approach
16 pages
Optimize Traffic Signal Control
No ratings yet
Optimize Traffic Signal Control
11 pages
Cloud Brokering
From Everand
Cloud Brokering
Felipe Díaz-Sánchez
No ratings yet
Dolti Material
No ratings yet
Dolti Material
3 pages
Dry-Street Lighting Network Layout-Ci-Sl-Lp-101
No ratings yet
Dry-Street Lighting Network Layout-Ci-Sl-Lp-101
1 page
WB Road Project Details
No ratings yet
WB Road Project Details
1 page
(B) Coded Defects
No ratings yet
(B) Coded Defects
80 pages
Addendum A Local Condition Disclosure To CAR Form SBSA
100% (1)
Addendum A Local Condition Disclosure To CAR Form SBSA
4 pages
ISE II Sample Paper 6 (With Answers)
No ratings yet
ISE II Sample Paper 6 (With Answers)
13 pages
Fire Station - Sadeq Al-Salami
No ratings yet
Fire Station - Sadeq Al-Salami
27 pages
KENYA ROADS REGISTER, 2024 (NATIONAL TRUNK ROADS)
No ratings yet
KENYA ROADS REGISTER, 2024 (NATIONAL TRUNK ROADS)
64 pages
Tata Motors Pvt. LTD.: About The Company Profile
No ratings yet
Tata Motors Pvt. LTD.: About The Company Profile
5 pages
Example Bridge: Design Step 2 - Example Bridge Prestressed Concrete Bridge Design Example
No ratings yet
Example Bridge: Design Step 2 - Example Bridge Prestressed Concrete Bridge Design Example
11 pages
Astm A 10.9
100% (1)
Astm A 10.9
23 pages
Form 4
No ratings yet
Form 4
3 pages
UNIT-1 Basic Concepts: Objectives of Air Traffic Service
No ratings yet
UNIT-1 Basic Concepts: Objectives of Air Traffic Service
18 pages
Transportation Organizations
No ratings yet
Transportation Organizations
2 pages
Risk Assessment For Excavation and Trenching Works
No ratings yet
Risk Assessment For Excavation and Trenching Works
4 pages
1DC3B2212FFAEB3224315A13D3B6B63F
No ratings yet
1DC3B2212FFAEB3224315A13D3B6B63F
59 pages
Complete Cargo Loss Control
No ratings yet
Complete Cargo Loss Control
82 pages
Bài tập Unit 3
No ratings yet
Bài tập Unit 3
3 pages
Abhishek-Hiremath
No ratings yet
Abhishek-Hiremath
2 pages
MOP of OSP V1.0
No ratings yet
MOP of OSP V1.0
91 pages
Sample For Transpo Designing
No ratings yet
Sample For Transpo Designing
17 pages
Tendernotice_1 - 2025-02-13T000413.660
No ratings yet
Tendernotice_1 - 2025-02-13T000413.660
2 pages
May TNG
No ratings yet
May TNG
5 pages
Canary Wharf Arts Events Winter Lights 2019 Map
No ratings yet
Canary Wharf Arts Events Winter Lights 2019 Map
2 pages
Line Línea
No ratings yet
Line Línea
5 pages
Racv Abs Braking System Effectiveness
No ratings yet
Racv Abs Braking System Effectiveness
58 pages
IOCL Guntakal Roads & Drains
No ratings yet
IOCL Guntakal Roads & Drains
5 pages
Hyundai Motor Company
No ratings yet
Hyundai Motor Company
14 pages

Traffic Signal Control System Using Deep Reinforcement Learning With Emphasis On Reinforcing Successful Experiences

Uploaded by

Traffic Signal Control System Using Deep Reinforcement Learning With Emphasis On Reinforcing Successful Experiences

Uploaded by

Received 30 September 2022, accepted 4 November 2022, date of publication 28 November 2022,

date of current version 14 December 2022.

Traffic Signal Control System Using Deep

Corresponding author: Naoki Kodama ([email protected])

I. INTRODUCTION congestion is caused by a variety of factors such as events,

intersections and TSCSs exist. TSCSs with reinforcement

In the first method, each intersection exchanges informa-

trol system in this study perceives the change in congestion experiment.

128944 VOLUME 10, 2022

vehicle is in the lane. Therefore, Dlt is calculated as follows.

VOLUME 10, 2022 128945

phase is selected, the current signal phase will switch to

128946 VOLUME 10, 2022

TABLE 2. Parameter settings.

This algorithm is summarized in Algorithm 1.

VOLUME 10, 2022 128947

Algorithm 1 DTA-Based TSCS

system enters the evaluation mode and executes five trials.

128948 VOLUME 10, 2022

by changes in the strategies of other agents because of the

VOLUME 10, 2022 128949

128950 VOLUME 10, 2022

You might also like