Traffic Signal Control System Using Deep Reinforcement Learning With Emphasis On Reinforcing Successful Experiences
Traffic Signal Control System Using Deep Reinforcement Learning With Emphasis On Reinforcing Successful Experiences
ABSTRACT In recent years, several studies have been conducted on the dynamic control of traffic signal
durations using deep reinforcement learning with the aim of reducing traffic congestion. The unique
advantages of independent control of traffic signals include reduction in the cost of information transmission
and stable control without being affected by the failure of other traffic signals. However, conventional
deep reinforcement learning methods such as Deep Q-Network may degrade the learning performance in a
multi-agent environment where there are multiple traffic signals in the environment. Therefore, we propose
a traffic light control system based on the dual targeting algorithm, which incorporates reinforcement of
successful experiences in multi-agent environments, with the aim of realizing a better traffic light control
system. The experimental results in a multi-agent environment using a traffic flow simulator based on
simulation of urban mobility (SUMO) show that the proposed traffic light control system reduces the
waiting time at traffic lights by 33% compared to a conventional traffic light control system using deep
reinforcement learning. In the future works, we aim to apply this research to traffic light control systems in
real environments.
INDEX TERMS Deep reinforcement learning, deep Q-network, Q-learning, multi-agent, traffic signal
control.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 10, 2022 128943
N. Kodama et al.: TSCS Using Deep Reinforcement
In this study, we propose a TSCS that can function without learning convergence does not increase.
direct communication or transfer cost, and we aim to improve • Reduced latency by 33% in the learned model.
the learning performance of our TSCS. First, the signal con- • The least variation in learning results per
C. REWARD DESIGN
The objective of the TSCS in this study is to alleviate traffic
congestion in the entire road network. However, because each
traffic signal agent should obtain traffic information for the
entire road network, which leads to an increase in the transfer
cost, the objective of each agent in this study is to minimize
traffic congestion at its own intersection. Although achieving
the goal of each agent does not necessarily help achieve the
goal of the system as a whole, we anticipate achieving the
goal of the system as a whole through cooperative control.
Cooperative control can be obtained by sharing information
with neighboring traffic signals through state perception.
FIGURE 3. Conceptual diagram of proposed traffic light control system.
In this study, we used the accumulated waiting time of
vehicles as an indicator of traffic congestion at an intersec-
tion [16]. First, we define wit ,t , (1 ≤ it ≤ Nveh ) as the time due to simultaneous learning problems becomes an issue
(in seconds) until step t of the vehicle, that is, the it th vehicle when using conventional deep reinforcement learning, such
in the lane heading to its own intersection at step t. Here, as DQN, to acquire control measurements. Simultaneous
Nveh is the number of vehicles in all lanes heading to their learning can lead to a situation where a strategy change by one
own intersection. The cumulative waiting time Wt at step t is mutually influencing agent affects the environment of other
calculated as follows. agents, making learning difficult. It has been experimentally
that each agent should converge its strategy quickly in order
Nveh
X to address the aforementioned problem, for instance, by using
Wt = wit ,t (4)
exploitation-oriented learning [14].
it =1
In this study, we aim to reduce the impact of this
Wt does not decrease with the step, and the larger the increase simultaneous learning problem by using DTA, which strongly
with respect to the step, the larger the number of vehicles reinforces the success experience for learning a TSCS. Con-
waiting at the intersection. In other words, to reduce traffic ceptual diagram of traffic light control system is shown in
congestion at an intersection, the smaller the increase or Fig. 3. Furthermore, the neural network architecture handled
decrease in Wt , the larger the reward. Therefore, the reward in this study is shown in fig. 4. To verify the feasibility
at step t can be designed as follows of DTA, this study also deals with a signal control system
wherein the deep reinforcement learning used is replaced
rt = Wt−1 − Wt (5)
by DQN and multistep DQN, and this is further used as a
The reward always considers a negative value, and the maxi- comparison target in the literature [12].
mum value is zero when no vehicle is waiting at the intersec-
tion. Therefore, by using deep reinforcement learning to learn B. COLLECTION OF EXPERIENCE
a strategy that maximizes discounted revenue, congestion In the traffic signal system proposed in this research, the
relief at the intersection of each agent can be achieved. experience collection and training are performed according
to DTA.
III. SIGNAL CONTROL SYSTEM USING DUAL TARGETING The experience collection is performed according to the
ALGORITHM following steps. First, in each step, the state st is perceived
A. APPROACH OF THIS STUDY by the TSCS, the action at is performed, the reward rt+1
This study assumes an environment that can be realized is obtained from the environment, and the transition des-
with a lower transfer cost than that of a conventional traffic tination state st+1 is stored in the replay buffer. Next, the
signal system by eliminating the communication between time when the signal is switched is defined as the end of
traffic signals. However, because multiple traffic signals in the episode. At the end of the episode, the multistep returns
the environment mutually affect each other, unstable learning Rτi i −t+1 from steps i to τi , the end-of-episode step, and the
C. SIMULATION RESULTS
A comparison of the learning curves is shown in Fig. 3.
Fig. 4 shows a magnified view after 500 trials. Furthermore,
FIGURE 5. Waiting time comparison.
in Fig. 7, Fig. 8, and Fig. 9, the variation of the DQN after
500 trials, the Multistep DQN, and the 10 experiments of
the proposed method are shown as error bars. The unbi- DQN.’’ A Welch’s t-test was conducted with the significance
ased sample variance at the 1,000th trial was 16.5 for the level set at 0.01. The results showed P = 0.0003 < 0.01, and
DQN, 0.06 for the multistep DQN, and 0.02 for the proposed the null hypothesis was rejected.
method, respectively.
To heck whether the difference between the mean values D. CONSIDERATION
of the waiting times of the Multistep DQN and the proposed From Fig. 3, it is evident that when deep reinforcement learn-
method at the 1,000th trial is significant, the null hypothesis ing was introduced, the performance of the proposed method
H0 is ‘‘the mean value of the waiting time of the proposed was initially poor, compared to that of the random agent.
method is equal to the mean value of the Multistep DQN’’ and However, the latency was significantly reduced. We can also
the alternative hypothesis H1 is ‘‘the mean of the waiting time confirm that the proposed method using DTA improves the
of the proposed method is less than the mean of the Multistep learning speed and converges faster than methods that use
FIGURE 6. Waiting time comparison (enlarged figure after 500 trials). FIGURE 9. Error bars of proposed method in 10 experiments.
the proposed method reduces the waiting time by more than [15] P. A. Lopez, E. Wiessner, M. Behrisch, L. Bieker-Walz, J. Erdmann,
33% compared with similar TSCSs using DQN and multistep Y.-P. Flotterod, R. Hilbrich, L. Lucken, J. Rummel, and P. Wagner, ‘‘Micro-
scopic traffic simulation using SUMO,’’ in Proc. 21st Int. Conf. Intell.
DQN. The learning time to convergence is also shorter than Transp. Syst. (ITSC), Nov. 2018, pp. 2575–2582.
that of the existing methods, indicating that the method is [16] J. Gao, Y. Shen, J. Liu, M. Ito, and N. Shiratori, ‘‘Adaptive traffic signal
effective even when real-time learning is considered. How- control: Deep reinforcement learning algorithm with experience replay and
target network,’’ 2017, arXiv:1705.02755.
ever, on a small computer such as those expected to be [17] D. P. Kingma and J. Ba, ‘‘Adam: A method for stochastic optimization,’’
installed at each traffic signal in a real environment, we are 2014, arXiv:1412.6980.
concerned about increased computation time and insufficient [18] E. Liang, R. Liaw, P. Moritz, R. Nishihara, R. Fox, K. Goldberg,
J. E. Gonzalez, M. I. Jordan, and I. Stoica, ‘‘RLlib: Abstractions for dis-
memory compared to existing methods. tributed reinforcement learning,’’ 2017, arXiv:1712.09381.
In the future, it is necessary to verify the execution time
of the proposed method and the existing method on a small
computer. In addition, the traffic flow on a real road network
changes depending on the time of the day. In this experiment,
the simulation time was 800 s under constant traffic flow;
therefore, the learning performance under varying traffic flow NAOKI KODAMA received the B.E. degree in
has not yet been confirmed. Therefore, it is necessary to verify mechanical engineering informatics and the M.E.
degree in mechanical engineering from Meiji Uni-
the proposed system in a more realistic environment in the versity, Japan, in 2016 and 2018, respectively, and
future. the Ph.D. degree in industrial administration from
the Tokyo University of Science, Japan, in 2021.
REFERENCES He is currently an Assistant Professor at
Meiji University. He is involved with research
[1] Ministry of Land, Infrastructure, Transport and Tourism, Ministry of Land,
of machine learning, particularly reinforcement
Infrastructure, Transport and Tourism Productivity Revolution Project.
learning and deep reinforcement learning.
[Online]. Available: https://www.mlit.go.jp/common/001132350.pdf
[2] (2012). Ministry of Health, Labour and Welfare, Basic Survey
on Wage Structure. Accessed: Feb. 24, 2022. [Online]. Available:
https://www.mhlw.go.jp/toukei/itiran/roudou/chingin/kouzou/z2012/
[3] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness,
G. M. Bellemare, A. Graves, M. Riedmiller, K. A. Fidjeland, G. Ostrovski,
S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,
D. Wierstra, S. Legg, and D. Hassabis, ‘‘Human-level control through TAKU HARADA received the Ph.D. degree from
deep reinforcement learning,’’ Nature, vol. 518, pp. 529–533, Feb. 2015. the Tokyo University of Science, in 1993. He is
[4] H. Wei, G. Zheng, H. Yao, and Z. L. Intellilight, ‘‘A reinforcement learning currently pursuing the Doctor of Engineering
approach for intelligent traffic signal control,’’ in Proc. 24th ACM SIGKDD degree.
Int. Conf. Knowl. Discovery Data Mining, 2018, pp. 2496–2505. He was an Assistant Professor with the Tokyo
[5] L. A. Prashanth and S. Bhatnagar, ‘‘Reinforcement learning with function University of Science, an Associate Professor with
approximation for traffic signal control,’’ IEEE Trans. Intell. Transp. Syst., Mie University, and a Junior Associate Profes-
vol. 12, no. 2, pp. 412–421, Jun. 2011. sor with the Tokyo University of Science, where
[6] W. Genders and S. Razavi, ‘‘Using a deep reinforcement learning agent for he is currently an Associate Professor. He is
traffic signal control,’’ 2016, arXiv:1611.01142.
involved with research of machine learning and
[7] T. Wu, P. Chou, K. Liu, Y. Yuan, X. Wang, H. Huang, and D. O. Wu,
metaheuristic optimization.
‘‘Multi-agent deep reinforcement learning for urban traffic signal con-
trol in vehicular networks,’’ IEEE Trans. Veh. Technol., vol. 69, no. 8,
pp. 8243–8256, Aug. 2020.
[8] Z. Zhang, J. Yang, and H. Zha, ‘‘Integrating independent and centralized
multi-agent reinforcement learning for traffic signal network optimiza-
tion,’’ 2019, arXiv:1909.10651.
[9] M. A. Marco, ‘‘Multi-agent reinforcement learning for traffic signal con-
trol,’’ in Proc. 17th Int. Conf. Mach. Learn., 2000, pp. 1151–1158. KAZUTERU MIYAZAKI (Member, IEEE)
[10] K. J. Prabuchandran, H. K. An, and S. Bhatnagar, ‘‘Multi-agent reinforce- received the Graduate degree in precise engineer-
ment learning for traffic signal control,’’ in Proc. 17th Int. IEEE Conf. ing from the Faculty of Engineering, Meiji Univer-
Intell. Transp. Syst. (ITSC), Oct. 2014, pp. 2529–2534. sity, Japan, in 1981, and the Ph.D. degree from the
[11] S. Sen and M. Sekaran, ‘‘Multiagent coordination with learning classi- Department of Computational Intelligence, Inter-
fier systems,’’ in Adaption and Learning in Multi-Agent Systems. Berlin, disciplinary Graduate School of Science and Engi-
Germany: Springer, 1995, pp. 218–233. neering, Tokyo Institute of Technology, Japan,
[12] N. Kodama, T. Harada, and K. Miyazaki, ‘‘Deep reinforcement learning
in 1996.
with dual targeting algorithm,’’ in Proc. Int. Joint Conf. Neural Netw.
In 1996, he was worked by the Institute as
(IJCNN), Jul. 2019, pp. 1–6.
[13] N. Kodama, T. Harada, and K. Miyazaki, ‘‘Home energy manage- a Research Associate with the Interdisciplinary
ment algorithm based on deep reinforcement learning using multi- Graduate School of Science and Engineering, Japan, and in 1999, he became
step prediction,’’ IEEE Access, vol. 9, pp. 153108–153115, 2021, doi: an Associate Professor at the National Institution for Academic Degrees and
10.1109/ACCESS.2021.3126365. Quality Enhancement of Higher Education, Japan, where he is currently
[14] N. Kodama, K. Miyazaki, and H. Kobayashi, ‘‘Proposal and evalua- a Professor at the Research Department. He is involved with research of
tion of reward sharing method based on safety level,’’ SICE J. Con- machine learning, particularly, reinforcement learning, deep reinforcement
trol, Meas., Syst. Integr., vol. 11, no. 3, pp. 207–213, May 2018, doi: learning, and text mining.
10.9746/JCMSI.11.207.