0% found this document useful (0 votes)

3 views

Urban Traffic Signal Control Using Reinforcement Learning Agents

IET transactions paper

Uploaded by

balajipg

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Urban Traffic Signal Control Using Reinforcement Learning Agents

IET transactions paper

Uploaded by

balajipg

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/224170257

Urban trafﬁc signal control using reinforcement learning agents

Article in IET Intelligent Transport Systems · October 2010

DOI: 10.1049/iet-its.2009.0096 · Source: IEEE Xplore

CITATIONS READS

229 3,935

3 authors, including:

Balaji Parasumanna Gokulan D. Srinivasan

Huawei Technologies National University of Singapore
13 PUBLICATIONS 878 CITATIONS 349 PUBLICATIONS 17,206 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Balaji Parasumanna Gokulan on 24 September 2015.

The user has requested enhancement of the downloaded file.

www.ietdl.org

Published in IET Intelligent Transport Systems

Received on 28th October 2009
Revised on 26th February 2010
doi: 10.1049/iet-its.2009.0096

ISSN 1751-956X

Urban trafﬁc signal control using

reinforcement learning agents
P.G. Balaji X. German D. Srinivasan
Department of Electrical and Computer Engineering, 4 Engineering Drive 3, National University of Singapore,
Singapore 117576, Singapore
E-mail: [email protected]

Abstract: This study presents a distributed multi-agent-based traffic signal control for optimising green timing in
an urban arterial road network to reduce the total travel time and delay experienced by vehicles. The proposed
multi-agent architecture uses traffic data collected by sensors at each intersection, stored historical traffic
patterns and data communicated from agents in adjacent intersections to compute green time for a phase.
The parameters like weights, threshold values used in computing the green time is fine tuned by online
reinforcement learning with an objective to reduce overall delay. PARAMICS software was used as a platform
to simulate 29 signalised intersection at Central Business District of Singapore and test the performance of
proposed multi-agent traffic signal control for different traffic scenarios. The proposed multi-agent
reinforcement learning (RLA) signal control showed significant improvement in mean time delay and speed in
comparison to other traffic control system like hierarchical multi-agent system (HMS), cooperative ensemble
(CE) and actuated control.

1 Introduction adjusting the timings dynamically with changing trafﬁc

patterns. Actuated signal controls adaptively increments/
Traffic control in the urban areas is becoming increasing decrements the green time of a phase on detecting the
complex with the exponential growth in vehicle count. presence/absence of vehicle in a lane. Actuated controls lack
Expansion of the road network to accommodate the the ability to foresee increased traffic flow and bases its
increased vehicle count is not a socially feasible option and decision on instantaneous flow values. Further, it results in
is essential to increase the utilisation of the existing higher delay as green time is not held for upstream platoons
infrastructure through proper regulation of traffic flow. causing higher percentage of vehicles to be stopped [1].
Traffic signals were introduced to control the traffic flow,
thereby improving the safety of road users. However, traffic Various computational intelligence techniques such as
signals create bottleneck for traffic flow in lanes that do not hybrid fuzzy genetic algorithm [2], ant colony-based
have the right of way during a specific phase and optimisation [3], emotional algorithms [4] and neuro-fuzzy
optimisation of signal timings is required to reduce the networks [5] calculate the green time required by
overall delay experienced by all vehicles at the intersection. forecasting the future traffic inflow. First limitation is,
Optimisation can be performed in offline (pre-timed) or a large training data that encompass all the dynamics of
online (adaptive) manner. the traffic is required for fine tuning parameters of the
controller and is difficult to obtain. Second, most of the
In pre-timed or fixed time signal control, Webster’s formula above controllers were designed for isolated intersection,
is used to calculate the green and the cycle time offline using thereby simplifying the model and reducing its suitability to
traffic data collected from the road network. Pre-timed coordinated interconnected intersections.
signal control cannot handle any variation in traffic from the
training patterns resulting in increased travel time delay. SCOOT [6, 7], SCATS [8, 9] and Green Link Determining
Adaptive signal controls overcome this limitation by (GLIDE) [10] are examples of centralised traffic signal

IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177 – 188 177
doi: 10.1049/iet-its.2009.0096 & The Institution of Engineering and Technology 2010
www.ietdl.org

controllers that have been implemented on large-scale networks 2 Proposed agent architecture
successfully. However, centralised controllers increase the
requirement for extensive communication of information and The proposed multi-agent system has a distributed
computational requirement for efficient data mining of architecture with each agent capable of making own
information required to compute optimal green time. The decisions without any central supervising agent. Traffic
limitation can be solved by implementing a distributed multi- signal at each intersection is controlled by an agent. The
agent architecture where larger problem can be divided into agent collects local traffic data collected from induction
smaller sub-problems. Multi-agent system is a group of loop detectors placed near the stop line of incoming and
autonomous agents capable of perceiving the environment and outgoing links connecting the neighbouring intersections.
decides its own course of action for achieving a common goal. Agents’ communicate outgoing traffic information to the
The agents could achieve this either by cooperation or neighbouring agents. The structure of individual agent
competition. The communication between agents increases the architecture is shown in Fig. 1.
global view of the agent and increases the coordination. In
[11], distributed agent system utilising evolutionary game Based on locally collected inputs and communicated
theory to assign reward or penalty was proposed. The information, intersection agents determine green time
limitation is the necessity to compute pay-off matrix for each required for each phase in the next cycle period. Agent
state-action pair. In [12], the advantages and disadvantages of possesses local memory to store traffic demand and create a
multi-agent system have been highlighted and a theoretical data repository to assess future traffic demand and
model of agent based on estimated traffic state has been effectiveness of agent’s actions. Agents fine tune and learn
proposed. In [13–15], semi-distributed agent architecture the decision model of each intersection by observing the
based on distributed constraint optimisation, swarm expected utility for each state-action pair and update using
intelligence methods and hybrid intelligent techniques online Q-learning.
combining fuzzy logic, neural networks and evolutionary
computation have been attempted. The limitation is the The traffic demands in the road network are quite
amount of data to be communicated and conflict of decision uniformly spread and can be characterised by different type
among agents. Agent system with reinforcement learning of distribution based on the traffic flow information
capability has shown to improve the performance significantly collected from the network. However, vehicles have a large
[16]; however, tests were conducted on simple road network number of route choices and route selected depends on
with less number of intersections. In this paper, a driver behaviour therefore no specific green wave policy can
reinforcement learning distributed multi-agent architecture has be selected based only on historic traffic flow patterns.
been proposed and tested on a large urban arterial road network. Therefore explicit offset settings was not used in this work
as synchronisation is achieved through communication of
information between agents and learning by visiting each
The paper is organised into seven sections. Section 2
state-action pairs sufficient number of times.
details the proposed multi-agent architecture. Section 3
describes learning of parameters using reinforcement
learning. Section 4 details the performance measures used
followed by a brief note in Section 5 on the benchmarks
2.1 Traffic input
used. Section 5 discusses the simulation platform used and Traffic data like vehicle occupancy Toccupied , the amount of
the comparative analysis over the benchmark signal time the vehicle is present on the detector, Qlength , length
controls. Section 6 summarises the work done in this paper. of queue of vehicles at the end of each phase and Vcount ,

Figure 1 Proposed agent architecture

178 IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177– 188
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-its.2009.0096
www.ietdl.org

number of vehicles crossing a specific intersection during a 2.2.2 Local traffic variations: Second most important
phase are collected from the incoming and outgoing links influencing factor in the adjustment of green time of a phase
of the intersection. Since vehicles in most lanes are free is the local variation in traffic condition at consecutive time
to choose the exit lanes, data from outgoing detectors need periods. As the cycle length is dynamically adjusted by
to be transmitted to neighbouring intersection agents to agents, ratio of Toccupied to Tphase at consecutive time
enable prediction of incoming traffic. For proper estimation periods is used rather than raw vehicle occupancy data
of current traffic state, queue value has to be used along
with vehicle occupancy and count, which tends to stagnate load = (Toccupied /Tphase ) (4)
during high traffic flow.
Agents decide the required change in green time by
Traffic state estimation is performed based on occupancy, comparing the load with loadtarget or threshold value. The
queue and vehicle count of maximum congested lanes (1) threshold value is computed as average of load value
as averaging across the lane causes improper classification of experienced during previous t cycles, where t represents the
traffic number of previous cycles to be considered. The change in
green time is computed using
Toccupied = max( max(Toccupied (ai lj ))) (1)
i j
d2 = D ∗ Tphase (5)
where ailj is the jth lane for approach ai .

2.2 Rule base where

Agents compute new phase length in a cycle based on the max(loadnew − loadtarget , 0), if loadnew . loadold
D=
locally observed traffic input and information communicated min(loadnew − loadtarget , 0)/2, if loadnew , loadold
by neighbouring intersections using a set of rules. The rules
specify the required amount of change in green time di for The old value of load is updated with current load value after
specific traffic condition at the intersection and is calculated every time period. To ensure that extension of green time is
as weighted sum of rules output as shown in relatively slower than reduction of green time, a correction
new old
new
term of 1/2 is included in the computation of D. A large
Tphase = Tphase + ri di , Tmin ≤ Tphase ≤ Tmax (2) correction value can cause instability because of shorter
i
phase split.
where ri is the weight assigned to each rule, di is the output of
each rule and i is the number of rules. Upper and lower limits 2.2.3 Neighbourhood advice: Agent’s environment is
[Tmax , Tmin] are imposed to avoid indiscriminate increase of usually affected by the action of neighbouring agents. This
green time. For all signals, Tmin is fixed at 10 s, however, necessitates modifying the behaviour of agent based on
Tmax varies in accordance to the number of phases and total neighbouring agents communicated information. The
cycle length limited to 120 s. Different rules used by agent neighbouring agents communicate vehicle occupancy and
for estimation of green time are explained in the following count at outgoing link of its intersection. Data are
sections. Calculation of ri is explained in detail in Section 3.3. communicated as a simple broadcast with identification tag
to all the neighbouring intersections. Based on the
2.2.1 Occupancy ratio: Agent uses the occupancy ratio information in the directory facilitator, agents decide to
(ratio of vehicle occupancy time to green time of the phase) receive the broadcast information. The received data are
to estimate green time required by vehicles present at the stored as Advice in the data repository. The communicated
stop line of the intersection. Occupancy is directly related data permit forecasting of traffic inflow and accordingly
to vehicle density and indicates the current state of adjust the green time as in
the intersection. Based on the speed – flow– density
Advicenew
characteristics, ratio of vehicle occupancy (in seconds) to d3 = − Tphase (6)
green time of a phase gives an accurate indication of degree OccRatio
of saturation of the network. However, there is no universal Advicemem + Adviceold
best value for the occupancy ratio and varies with level of Advicenew = (7)
2
congestion. An underutilised phase (large green time for
low vehicle count) have a low occupancy ratio and increases After calculation of d3 , average value of the Advicemem
the delay experienced by vehicles. Based on the occupancy received in the current time period and Adviceold is used to
ratio, each agent computes the extension or reduction in update the repository.
green time of phase in progress
If an approach is congested to such that at least one turn
Toccupied
d1 = − Tphase (3) movement is blocked during a phase, then the vehicles
OccRatio count for that movement is set to zero. This situation arises

IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177 – 188 179
doi: 10.1049/iet-its.2009.0096 & The Institution of Engineering and Technology 2010
www.ietdl.org

when the queue from the downstream intersection reaches 3.2 Traffic state estimation
the current intersection because of queue spillback and
leads to deadlock formation. It is not possible for an agent Traffic is discretised into different states using queue and flow
to differentiate between an empty and saturated data. Average queue computed at the end of each phase by (9)
intersection. However it is possible to differentiate the two can be used for traffic classification. However for proper
scenarios by using a combination of occupancy and count classification of traffic, flow value needs to be used in
data communicated by the neighbouring agents. If vehicle conjunction with queue. The values of queue and flow at
count on the outgoing lane is null and occupancy is not time period t are taken as Qscore and flow(t). At the start of
null, agent can distinguish the lane as congested and learning process, there is no history data available.
blocked because of the queue spillback. Under such However, at time period t + 1, the data of Qscore and
circumstances, green time for the phase is kept fixed at a flow(t) get stored as history value for the previous day
minimum limit of 10 s to allow clearance of vehicles. namely Hscore and Hflow, respectively. The change in
current traffic is computed as the difference between
current traffic queue score and queue experienced in the
previous day at the next time period Qscore(t) 2 Hscore
(t + 1) and assigned membership grade that classifies
3 Reinforcement learning current traffic change into three low, medium and high
All agents in the network must be capable of learning the model traffic as shown in Fig. 2
of intersection controlled based on the assessment of present
average traffic condition and previous day traffic condition at Qlength (phasei )
Qscore = (9)
the same period. The learning period has to be long enough Nphase
to allow aggregation of sufficient traffic information to
estimate the current traffic state and capture traffic dynamics. The rate of change of traffic dx is computed as (Hflow(t + 1) 2
The learning period can be found by experimentation and flow(t))/flow(t) and (Hscore(t + 1) 2 Qscore)/Qscore and is used
was calculated as 500 s for the specific road network to determine whether the traffic is decreasing, stable or
considered for testing. Conventional learning is difficult as increasing and assigned a membership grade similar to
the exact desired value is not available and unsupervised Fig. 2. Combining the rate of change of traffic and current
learning or a combination of both (reinforcement learning) change in traffic, traffic at the intersection is classified into
needs to be employed. Reinforcement learning utilises scalar nine possible states and is shown in Table 1. The current
time delayed reward received from the environment on traffic state is determined as the output with highest firing
selecting an action in a specific state to modify the parameters level using fuzzy logic.
of intersection model. In this paper, Q-learning was used to
modify the parameters.
3.3 Parameter update
Traffic states have been completely defined in the previous
3.1 Q-learning section. However, the action space needs to be defined to

Q-learning [17] is a reinforcement learning technique that

learns the action value function which provides the
expected utility of taking an action in a given state and
then following a ﬁxed policy thereafter. The utility or
reward is received after time delay from the environment
and is a scalar quantity that do not exactly specify the
action to be taken. Each agent maintains a Q-matrix that
stores the Q-values for each state-action pair and is updated
iteratively as shown in (8). The Q-value reaches optimum Figure 2 Structure of fuzzy membership
value when all states are visited sufﬁciently larger number
of times.

Table 1 Possible trafﬁc states

Q(s, a)∗ = (1 − a)Q(s, a) + a(r + g maxi Q(s′ , ai )) (8)
current low 0 1 2
∗ trafﬁc
where Q(s, a) is the optimal value, a is the learning rate in medium 3 4 5
the range [0, 1] and g is the future discount reward. The
high 6 7 8
learning rate and future discount reward are computed
through experimentation to be 0.33 and 0.05 as trade-off decreasing stable increasing
needs to be made between the rate of convergence and
changes in trafﬁc
precision.

180 IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177– 188
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-its.2009.0096
www.ietdl.org

complete Q-matrix. The total green time extension can be iteratively in

defined by ri whose value lies in the range between [0, 1]
new old
and divided into 12 equal values. Each agent maintains a Hscore = (1 − b)Hscore + br (11)
Q-matrix that matches the nine traffic states defined in the
previous section to the 12 action values. At the end of each The coefficient b decreases from 0.5 to 0.1 for the first few
learning period, the agent computes the reward r received iterations, then stays at 0.1. The same equation as in (11) is
from environment after choosing action ai when in state si . used for calculating Hflow . The value of b needs to be
The reward value is computed using varied with large steps in the start so that higher preference
is given to time delayed rewards than historical value and
Qscore (t) − Hscore (t) Q (t) − Qscore (t − 1) reduced to lower value so that the learnt values are not
r new = + h score
Qscore (t) Qscore (t) forgotten and is determined through experimentation with
(10) different values.

where h is in the range [0, 1]. The reward value is positive if the Cooperation between the agents is achieved by averaging
queue is smaller than the historic queue as well as the queue for the Q matrix values between immediate adjacent
previous period. Since, the traffic demands vary with time, neighbouring agents each time period. This ensures the
queue value also varies. Comparison with historic value improvement in performance as agents learn from
tracks the traffic pattern over a large period and queue value experience of the neighbouring agents.
of previous time period tracks short time variations.
Therefore h needs to be kept small so that reward comes
from comparison with historic values. Once current state is
4 Performance measures
detected, the appropriate action value is chosen as one with The performance of proposed reinforcement learning (RLA)
highest Q-value. In case of multiple actions having same algorithm in a simulated road traffic environment is evaluated
Q-value, one of the action is selected randomly. Greedy based on three parameters namely vehicle count, total mean
action selection strategy was used to increase the exploration delay and current mean speed of vehicles inside the road
to increase the visited state-action pairs. network (Fig. 3).

3.4 Memory update 4.1 Vehicle count

Once the state-action pair has been found, the memory of the Vehicle count is the total number of vehicles present inside
agent is updated. The historic queue score can be updated the road network at a given time and is calculated as the

Figure 3 Simulated road network with indication of prominent hotspots caused because of pre-timed signals

IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177 – 188 181
doi: 10.1049/iet-its.2009.0096 & The Institution of Engineering and Technology 2010
www.ietdl.org

difference between number of vehicles entering and leaving individual agent and providing directives received from
the network during the estimation period. The vehicle supervising agent. Supervising agent is in the top layer of
count gives an accurate indication of the congestion level hierarchy and oversees the functioning of entire system.
inside the network at a specified period of time. The zonal agents utilise evolutionary fuzzy algorithm to
generate the rule base for control and compute cooperation
4.2 Total vehicle mean delay levels required between agents using neuro-fuzzy system.
For detailed description of the HMS, refer to [21].
Total mean delay is the average value of delay experienced by
vehicles to reach the destination from starting point of the CE [22] is a distributed multi-agent architecture, where the
network and is expressed in seconds. Mean delay is the agents self-organise and form clusters of cooperating agents.
sum of total stopping time which corresponds to time lost The clusters are formed dynamically using graph theoretical
waiting at intersections, and the travel time which depends method. The teams or clusters cooperate to reduce the
on speed of vehicles inside the network overall time delay experienced by the group rather than an
individual. Overlap in the cooperative clusters is possible;

n
however, it is limited to avoid excessive computation.
TAD = TD /TN (12)
i=1
6 Simulation results and
where TAD is the total average delay, n is the number of
intersection, TD is the delay experienced by vehicles at an
discussions
intersection and TN is the total number of vehicles released The proposed RLA signal controller was tested on a
into the network. Little [18] and Highway Capacity simulated network of 29 intersections. The simulated
Manual – HCM2000 [19] show the wide acceptance of network is the highly congested section of the busy Central
delay parameter for validating the signal controllers. Business District area in Singapore [23]. The network is
simulated using PARAMICS, a microscopic simulation
4.3 Current vehicle mean speed software capable of simulating the driver’s behaviour,
dynamic re-routing of vehicles and incidents efficiently.
For better understanding of the results, current mean speed of The network serves as an ideal test bed because of the
the vehicle inside the network is used along with the time geometry and heterogeneity in the classification of links
delay value. The importance of using current mean speed in (major and minor roads with varying speed limits).
validating the signal controller has been highlighted in [20].
Four types of simulation were used to evaluate the
5 Benchmarks performance of the proposed RLA signal control. They are
namely the typical scenario with morning peak (3 h), typical
It is difficult to find a good benchmark for large-scale traffic scenario with morning and evening peaks (24 h), extreme
signal control problem given the following reasons: scenario with dual peaks (6 h) and extreme scenario with
multiple peaks (24 h). It must be noted that extreme traffic
1. Some of the existing algorithms are developed for simplified scenarios are hypothetical traffic peaks created to test the
traffic scenarios and hence not suitable for benchmarking. reliable control of traffic by the proposed RLA signal control
in cyclic repetitive stress conditions. It also serves to
2. Commercial traffic signal control programs, which are showcase the response and settling time of the signal control.
known to work well, are not easily available because of
proprietary reasons. The origin–destination data collected from Land Transport
Authority Singapore, is used to recreate the peak traffic
Hence in all the experiments, GLIDE [10], modified conditions. Even though the peak traffic data are pre-fixed,
version of SCATS used in Singapore, hierarchical multi- the number of vehicles actually released into the network
agent system (HMS) [21] developed in and cooperative varies according to the random seeds set before the
ensemble (CE) [22] are used as benchmarks. HMS and simulation. Since PARAMICS dynamically adjusts traffic
CE have already been compared with GLIDE and hence model characteristics like gap acceptance, lane change, merge
simulation plot results are not included to avoid and so on, the traffic dynamics is different for each simulation
redundancy. HMS is a semi-distributed multi-agent traffic run with different random seeds. The PARAMICS model
signal control with hierarchical architecture. It consists of has been validated for the specific data and has been
three layers of agents with increasing hierarchy and control. previously used for simulation testing in [5, 20, 22, 23].
The agent at the intersection decides the green time
required based on local traffic information and cannot 6.1 Typical scenario with morning
communicate with agents in same hierarchy and uses
Webster’s method to compute the green time requirement.
peak (3 h)
The zonal agent oversees the functioning of five The typical scenario with morning peak is used to validate the
intersection agents by monitoring the action plan of performance improvement in traffic condition using RLA

182 IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177– 188
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-its.2009.0096
www.ietdl.org

signal control for short time traffic variations. Twenty Fig. 4b shows the comparison of RLA signal control with
simulation runs using different random seeds were carried and without communication of data between agents. In case
out for each signal control technique compared. Since of no communication, higher delay is experienced during the
variance of the outcome of simulation runs was small, peak traffic conditions and is almost equivalent to HMS and
average value was taken as the representation of the outcome. CE signal control. CE shows higher delay as it is difficult to
form clusters in a dynamic environment with continuous
Fig. 4a shows comparison of the time delay experienced by change in traffic flow input.
vehicles in road network using different types of traffic signal
control. The proposed RLA signal control shows a 15% 6.2 Typical scenario with morning and
improvement in delay in comparison to other benchmarks.
The improvement in performance can be attributed to the
afternoon peaks (24 h)
ability of RLA signal control to foresee traffic increase For the typical scenario with morning and evening peaks
based on the communicated information and adjust the (24 h), 20 different simulation runs using different random
green timing before the traffic arrives at the intersection. seeds were carried out for each signal control technique.
During low traffic period, HMS and CE experience higher Average value of simulation runs were taken into
delays as their actions are based only on locally collected consideration when evaluating the performance of each
data and thereby cause more vehicles to be retained inside control technique.
the network. Though under high traffic condition, the
decisions are more coordinated, the number of vehicles to Fig. 5a shows comparison of mean vehicle delay using
be cleared is much larger because of the vehicles retained different signal control techniques for 24-h typical two
during low traffic period, thereby increasing the delay than peak traffic scenario. Although HMS signal control shows
RLA signal control. higher mean delay during traffic peaks, RLA signal control

Figure 4 Three hour single peak trafﬁc scenario

a Comparison of time delay for 3-h single peak trafﬁc simulation scenario for different agent architectures
b Comparison of proposed RLA architecture with and with communication between agents

IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177 – 188 183
doi: 10.1049/iet-its.2009.0096 & The Institution of Engineering and Technology 2010
www.ietdl.org

has a stable lower delay throughout the simulation period. consecutive peaks causing an increased vehicle count at
Fig. 5b further indicates a smoother speed transition on the start of next peak trafﬁc regions and can be seen from
using RLA signal control than other controllers. Fig. 6b.

The other extreme trafﬁc condition scenario simulated was

6.3 Extreme traffic scenarios 6 h two peak traffic condition with higher demand values
Two hypothetical simulation scenarios were designed to test than 24-h simulation. Fig. 7 shows the mean vehicle delay
the settling and response time of the signal control for short extreme scenario. These simulation scenarios test
algorithms when subjected to repetitive high and low traffic the limits of the algorithms, as they attempt to stabilise
demand. The input demand and the number of vehicles traffic when subjected to repeated peaks. As the HMS
remaining inside the network for 24-h eight peak algorithm performs better than the CE algorithm under
simulation are shown in Fig. 6a. The stress experienced by the eight peak extreme scenario [22], results of CE are not
signal controllers can be seen from the growing values of included in Fig. 6b. RLA algorithm performs better than
vehicles retained inside the network. Main reason for the HMS signal control. HMS signal control produces higher
stress can be attributed to vehicle count at the beginning of time delay because of the delay in propagation of control
each peak. When the settling time (time required to bring signal from supervising agent and absence of local
the vehicle count to non-peak condition) is larger, there communication between intersection agents. Lower mean
is an overlap in the peak traffic build up regions of delay value of RLA algorithm clearly indicates faster

Figure 5 Twenty-four hour two peak trafﬁc simulation scenario

a Average travel time delay comparison
b Comparison of current vehicle mean speed

184 IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177– 188
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-its.2009.0096
www.ietdl.org

Figure 6 Twenty-four hour eight peak trafﬁc simulation scenario

a Trafﬁc demand and count of vehicles present inside the network
b Average travel time delay of vehicles

Figure 7 Mean travel time delay comparison for 6-h two peak trafﬁc scenario

IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177 – 188 185
doi: 10.1049/iet-its.2009.0096 & The Institution of Engineering and Technology 2010
www.ietdl.org

settling time and better adaptability to variations in the traffic is reflected in the phase timing. The cycle time is lower in
demand over other traffic signal controls. the non-peak period and dynamically varies with changing
traffic pattern. It is not possible to compare the signal
timings across the intersections because of the stochastic
6.4 Response time and cycle length nature of traffic input and random seed initialisation.
variation However, this can serve as an indication for the adaptability
of the proposed RLA signal control.
The response time of the RLA signal control can be best
illustrated with frequency of change in the phase length.
Fig. 8a displays the length of each phase of a cycle for a
four-phase intersection in middle of the network controlled
6.5 Improvement because of learning
by RLA signal control. The links having the right of way Reinforcement learning vastly improves the performance of
during the third phase have the lowest traffic demand and the RLA signal control. Fig. 8b shows the variation of the

Figure 8 Green timing and inﬂuence of reinforcement learning

a Change in signal green time settings of an intersection
b Improvement in the average delay experienced due to reinforcement learning

186 IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177– 188
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-its.2009.0096
www.ietdl.org

Table 2 Worst and best time delay comparison of signal agent scenario and Q-matrix was shared between agents to
controls improve the local observations and create global view.
Simulation tests conducted on a virtual traffic network of
Pre- HMS CE RLA Act. GLIDE Central Business District in Singapore for four different
timed traffic scenarios showed almost 15% improvement over the
one peak Nv 120 95 89 90 100 – benchmark signal controls. Further improvements because
of online reinforcement learning of the parameters have
V 37 46 44 49 45 – been demonstrated effectively.
d 297 200 191 163 196 –
typical Nv 317 286 301 266 295 –
day
8 Acknowledgment
V 35 43 38 48 44 40
This research work was supported by National University of
d 500 182 200 160 184 200 Singapore under the research grant WBS: R-263-000-425-
112.
short Nv Sat. 216 258 170 215 –
extreme
V 0 35 36 48 42 10
d Sat. 315 340 232 309 650 9 References
long Nv Sat. 250 – 206 205 – [1] KOONCE P.: ‘Traffic signal timing manual’, US Department
extreme of Transportation FHWA-HOP-08-024, Federal Highway
V 0 33 – 42 37 0
Administration, 2008
d Sat. 242 – 216 238 Sat.

Nv: number of vehicles at the end of the simulation, [2] SANCHEZ J.J., GALAN M., RUBIO E.: ‘Genetic algorithms and
V: speed, d: mean delay cellular automata: a new architecture for traffic light
cycles optimization’. Proc. Congress on Evolutionary
Computation, 19 – 23 June 2004, Piscataway, NJ, USA,
average value of mean vehicle delay experienced with each
2004, pp. 1668 – 1674
simulation run. After 90 continuous simulation runs, the
mean delay value reduced to around 50% from what was at
[3] HOAR R., PENNER J., JACOB C.: ‘Evolutionary swarm traffic: if
the start of the simulation run. HMS [21] utilises selective
ant roads had traffic lights’. Proc. 2002 World Congress on
back propagation method for learning the parameters of
Computational Intelligence – WCCI’02, 12 – 17 May 2002,
neuro-fuzzy system. The back-propagation method has a
Piscataway, NJ, USA, 2002, pp. 1910 – 1915
limitation of getting into local optimum and therefore
increases the time delay.
[4] ISHIHARA H., FUKUDA T.: ‘Traffic signal networks simulator
using emotional algorithm with individuality’. Proc. IEEE
Table 2 shows the comparison of traffic data obtained using
Intelligent Transportation Systems, 25 – 29 August, 2001,
RLA signal control with all other benchmarks including
Oakland, CA, USA, pp. 1034 – 1039
GLIDE which is currently used in Singapore. Simulation
model of GLIDE was obtained from [5, 21]. Average values
[5] SRINIVASAN D., CHOY M.C., CHEU R.L.: ‘Neural networks for
obtained from 20 simulation runs are compared in Table 2.
real-time traffic signal control’, IEEE Trans. Intell. Transp.
Standard deviation of delay was around 4 and 5 – 6%
Syst., 2006, 7, pp. 261 – 272
variation for vehicle count and speed. Proposed RLA signal
control showed 9 – 15% improvement in performance when
[6] HUNT P.B., ROBERTSON D.I. , BRETHERTON R.D., WINTON R.I. :
compared to other benchmarks in all the 20 simulation runs.
‘SCOOT – a traffic responsive method of coordinating
signals’ (United Kingdom, 1981)
7 Conclusion [7] PECK C. , GORTON P.T.W., LIREN D.: ‘Application of
The proposed RLA signal control has a fully distributed SCOOT in developing countries’. Third Int. Conf. on Road
architecture with agents capable of interacting with each Traffic Control, 1 – 3 May 1990, London, England,
other to effectively compute the optimal value of green pp. 104– 109
time that reduces the overall travel time delay and increases
vehicle mean speed. The update of traffic pattern in the [8] SIMS A.G. , DOBINSON K.W. : ‘The Sydney Coordinated
repository and shared communication between agents Adaptive Traffic (SCAT) system philosophy and benefits’,
increased the forecasting capability of each agent. This IEEE Trans. Veh. Technol., 1980, t-29, pp. 130 – 137
property of the agent effectively reduced the formation of
congestion and improved clearance of vehicles at the [9] LOWRIE P.R.: ‘The Sydney Coordinated Adaptive Traffic
intersection. Online Q-learning has been adopted to multi- System-principles, methodology, algorithms’. Int. Conf. on

IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177 – 188 187
doi: 10.1049/iet-its.2009.0096 & The Institution of Engineering and Technology 2010
www.ietdl.org

Road Traffic Signalling, 30 March– 1 April 1982, London, UK, [16] CAMPONOGARA E., KRAUS JR. W.: ‘Distributed learning agents
pp. 67– 70 in urban traffic control’, Prog. Artif. Intell., 2003, 2902,
pp. 324– 335
[10] KEONG C.K.: ‘The GLIDE system – Singapore’s urban
traffic control system’, Transp. Rev., Transnatl. Transdiscipl. [17] WATKINS C., DAYAN P.: ‘Technical note: Q-learning’, Mach.
J., 1993, 13, pp. 295 – 305 Learn., 1992, 8, pp. 279 – 292

[11] BAZZAN A.L.C.: ‘A distributed approach for coordination of [18] LITTLE J.D.C. : ‘A proof for the queuing formula:
traffic signal agents’, Auton. Agents Multi-Agent Syst., 2005, L ¼ {lambda} W’, Oper. Res., 1961, 9, pp. 383– 387
10, pp. 131 – 64
[19] ‘Highway capacity manual – HCM2000’ (Transportation
[12] ROOZEMOND D.A.: ‘Using intelligent agents for pro-active, Research Board, National Research Council, 2000)
real-time urban intersection control’, Eur. J. Oper. Res.,
2001, 131, pp. 293– 301 [20] BALAJI P.G. , SRINIVASAN D., CHEN-KHONG T. : ‘Coordination
in distributed multi-agent system using type-2 fuzzy
[13] MIZUNO K., NISHIHARA S.: ‘Distributed constraint decision systems’. IEEE 16th Int. Conf. on Fuzzy Systems
satisfaction for urban traffic signal control’. Second Int. (FUZZ-IEEE), 1 – 6 June 2008, Piscataway, NJ, USA,
Conf. on Knowledge Science, Engineering and pp. 2291 – 2298
Management. KSEM 2007, 28 – 30 November 2007, Berlin,
Germany, 2007, pp. 73 – 84 [21] CHOY M.C., SRINIVASAN D., CHEU R.L.: ‘Neural networks for
continuous online learning and control’, IEEE Trans.
[14] DE OLIVEIRA D., BAZZAN A.L.C.: ‘Traffic lights control with Neural Netw., 2006, 17, pp. 1511– 1531
adaptive group formation based on swarm intelligence’.
Ant Colony Optimization and Swarm Intelligence. Proc. [22] SRINIVASAN D., CHOY M.: ‘Distributed problem solving using
Fifth Int. Workshop, ANTS 2006, 4 – 7 September 2006, evolutionary learning in multi-agent systems’, Adv. Evol.
Berlin, Germany, pp. 520– 521 Comput. Syst. Des., 2007, 66, pp. 211– 227

[15] CHOY M.C. , CHEU R.L., SRINIVASAN D., LOGI F. : ‘Real-time [23] CHOY M.C., SRINIVASAN D., CHEU R.L.: ‘Cooperative, hybrid
coordinated signal control through use of agents with agent architecture for real-time trafﬁc signal control’, IEEE
online reinforcement learning’. Transportation Research Trans. Syst. Man Cybern. A (Syst. Hum.), 2003, 33,
Board Meeting (82nd), Washington, DC, 2003, pp. 64– 75 pp. 597– 607

188 IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177– 188
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-its.2009.0096

View publication stats

06-MBN Installation, operation and maintenance instruction（英文）
50% (2)
06-MBN Installation, operation and maintenance instruction（英文）
92 pages
Pro Soft TP SW Oper Man 1338 Rev 7 en GB
No ratings yet
Pro Soft TP SW Oper Man 1338 Rev 7 en GB
132 pages
Cloud Security
No ratings yet
Cloud Security
4 pages
Transportation Research Part C: Xiang (Ben) Song, Bin Zhou, Dongfang Ma
No ratings yet
Transportation Research Part C: Xiang (Ben) Song, Bin Zhou, Dongfang Ma
16 pages
Deep Reinforcement Learning Algorithm With Experience Replay and Target Network
No ratings yet
Deep Reinforcement Learning Algorithm With Experience Replay and Target Network
10 pages
Electronics 10 02363 v2
No ratings yet
Electronics 10 02363 v2
32 pages
14
No ratings yet
14
20 pages
Graph-Based Cooperation Multi-Agent Reinforcement Learning for Intelligent Traffic Signal Control
No ratings yet
Graph-Based Cooperation Multi-Agent Reinforcement Learning for Intelligent Traffic Signal Control
13 pages
2018 TS Optimization 34
No ratings yet
2018 TS Optimization 34
17 pages
Mitigating Action Hysteresis in Traffic Signal Control With Traffic Predictive Reinforcement Learning
No ratings yet
Mitigating Action Hysteresis in Traffic Signal Control With Traffic Predictive Reinforcement Learning
12 pages
Neural Networks For Real-Time Traffic Signal Control (Srinivasan Et Al., 2006)
No ratings yet
Neural Networks For Real-Time Traffic Signal Control (Srinivasan Et Al., 2006)
12 pages
RL Paper Latex v01d01
No ratings yet
RL Paper Latex v01d01
6 pages
Revised Project Proposal (RUNIX)
No ratings yet
Revised Project Proposal (RUNIX)
6 pages
IEEE 2023 Cooperative Control
No ratings yet
IEEE 2023 Cooperative Control
5 pages
Traffic_Signal_Control_a_Double_Q-learning_Approac
No ratings yet
Traffic_Signal_Control_a_Double_Q-learning_Approac
5 pages
Multi_Agent_Learning_Automata_for_Online_Adaptive_Control_of_Large_Scale_Traffic_Signal_Systems
No ratings yet
Multi_Agent_Learning_Automata_for_Online_Adaptive_Control_of_Large_Scale_Traffic_Signal_Systems
6 pages
IEEE DQL Regional Network
No ratings yet
IEEE DQL Regional Network
5 pages
sensors-24-03987-v2
No ratings yet
sensors-24-03987-v2
19 pages
44444444OPTIMIZINGTRAFFICFLOWATFOUR-LEGGEDINTERSECTIONS_ASTUDYONINTELLIGENTSIGNALSYSTEMS
No ratings yet
44444444OPTIMIZINGTRAFFICFLOWATFOUR-LEGGEDINTERSECTIONS_ASTUDYONINTELLIGENTSIGNALSYSTEMS
9 pages
Coordination Between Traffic Signals Based On Cooperative: S.S. Shamshirband, H. Shirgahi, M.Gholami and B. Kia
No ratings yet
Coordination Between Traffic Signals Based On Cooperative: S.S. Shamshirband, H. Shirgahi, M.Gholami and B. Kia
6 pages
8349-Article Text-48881-2-10-20201129
No ratings yet
8349-Article Text-48881-2-10-20201129
16 pages
3 Discharge Control Policy Based On Density and Speed For Deep Q-Learning Adaptive Traffic Signal
No ratings yet
3 Discharge Control Policy Based On Density and Speed For Deep Q-Learning Adaptive Traffic Signal
21 pages
A Deterministic and Stochastic Petri Net Model For Traffic-Responsive Signaling Control in Urban Areas
No ratings yet
A Deterministic and Stochastic Petri Net Model For Traffic-Responsive Signaling Control in Urban Areas
15 pages
A Review of Traffic Light Control System With Reinforcement Learning
No ratings yet
A Review of Traffic Light Control System With Reinforcement Learning
8 pages
1 Two-Layer Coordinated Reinforcement Learning For Traffic Signal Control in Traffic Network
No ratings yet
1 Two-Layer Coordinated Reinforcement Learning For Traffic Signal Control in Traffic Network
12 pages
Transportation Research Part C: Jiajie Yu, Pierre-Antoine Laharotte, Yu Han, Ludovic Leclercq
No ratings yet
Transportation Research Part C: Jiajie Yu, Pierre-Antoine Laharotte, Yu Han, Ludovic Leclercq
25 pages
Reinforcement Learning Based Multiagent System For Network Traffic Signal Control
No ratings yet
Reinforcement Learning Based Multiagent System For Network Traffic Signal Control
8 pages
Reinforcement Learning-Based Intelligent Traffic Signal Control Considering Sensing Information of Railway
No ratings yet
Reinforcement Learning-Based Intelligent Traffic Signal Control Considering Sensing Information of Railway
12 pages
Applsci 13 02750 v2
No ratings yet
Applsci 13 02750 v2
23 pages
Ijrte02020406 PDF
No ratings yet
Ijrte02020406 PDF
3 pages
9
No ratings yet
9
10 pages
Hamsa Seminar Report
No ratings yet
Hamsa Seminar Report
18 pages
sensors-22-08732-v3
No ratings yet
sensors-22-08732-v3
21 pages
4 Cycle-Based Signal Timing With Traffic Flow Prediction For Dynamic Environment
No ratings yet
4 Cycle-Based Signal Timing With Traffic Flow Prediction For Dynamic Environment
17 pages
Using A Deep Reinforcement Learning Agent For Traffic Signal Control
No ratings yet
Using A Deep Reinforcement Learning Agent For Traffic Signal Control
9 pages
Expert Systems With Applications
No ratings yet
Expert Systems With Applications
22 pages
Investigacion Algoritmos 2
No ratings yet
Investigacion Algoritmos 2
22 pages
Batch-Augmented Multi-Agent Reinforcement Learning For Efficient Traffic Signal Optimization
No ratings yet
Batch-Augmented Multi-Agent Reinforcement Learning For Efficient Traffic Signal Optimization
20 pages
An Information Fusion Approach to Intelligent Traffic Signal Control Using the Joint Methods of Multiagent Reinforcement Learning and Artificial Intelligence of Things
No ratings yet
An Information Fusion Approach to Intelligent Traffic Signal Control Using the Joint Methods of Multiagent Reinforcement Learning and Artificial Intelligence of Things
11 pages
Optimize Traffic Signal Control
No ratings yet
Optimize Traffic Signal Control
11 pages
Enhancing traffic flow through multi-agent reinforcement learning for adaptive traffic light duration control
No ratings yet
Enhancing traffic flow through multi-agent reinforcement learning for adaptive traffic light duration control
16 pages
Actuators 13 00251
No ratings yet
Actuators 13 00251
15 pages
grp_101-new
No ratings yet
grp_101-new
15 pages
osoETC08
No ratings yet
osoETC08
26 pages
A Traffic Light Control Method Based On Multi Agent Deep Reinforcement Learning Algorithm
No ratings yet
A Traffic Light Control Method Based On Multi Agent Deep Reinforcement Learning Algorithm
11 pages
Distributed Geometric Fuzzy Multiagent U
No ratings yet
Distributed Geometric Fuzzy Multiagent U
14 pages
Simulation of Intelligent Traffic Control For Autonomous Vehicles
No ratings yet
Simulation of Intelligent Traffic Control For Autonomous Vehicles
7 pages
Improving Traffic Light Systems Using Deep Q-networks
No ratings yet
Improving Traffic Light Systems Using Deep Q-networks
13 pages
Scratch
No ratings yet
Scratch
33 pages
Traffic Management System and Traffic Light Control in Smart City To Reduce Traffic Congestion
No ratings yet
Traffic Management System and Traffic Light Control in Smart City To Reduce Traffic Congestion
10 pages
Transportation Research Part C: Yiming Bie, Yuting Ji, Dongfang Ma
No ratings yet
Transportation Research Part C: Yiming Bie, Yuting Ji, Dongfang Ma
19 pages
Proximal Policy Optimization Through A Deep Reinfo
No ratings yet
Proximal Policy Optimization Through A Deep Reinfo
19 pages
11
No ratings yet
11
8 pages
s41598-022-21125-3
No ratings yet
s41598-022-21125-3
12 pages
Knowledge Based Traffic Signal Control Model For Signalized Intersection
No ratings yet
Knowledge Based Traffic Signal Control Model For Signalized Intersection
5 pages
Deep Reinforcement Learning For Traffic Signal Control A Review - 2020
No ratings yet
Deep Reinforcement Learning For Traffic Signal Control A Review - 2020
29 pages
1 s2.0 S0968090X13000375 Main
No ratings yet
1 s2.0 S0968090X13000375 Main
23 pages
Traffic Signal Control System Using Deep Reinforcement Learning With Emphasis On Reinforcing Successful Experiences
No ratings yet
Traffic Signal Control System Using Deep Reinforcement Learning With Emphasis On Reinforcing Successful Experiences
8 pages
1-s2.0-S0968090X14003325-main
No ratings yet
1-s2.0-S0968090X14003325-main
20 pages
Multi-Agent Reinforcement Learning For Traffic Signal Control Through Universal Communication Method
No ratings yet
Multi-Agent Reinforcement Learning For Traffic Signal Control Through Universal Communication Method
12 pages
Adaptive Tra C Control System Using Dynamic Slot Allocation To Achieve Minimal Congestion in A Stand Alone Road Junction
No ratings yet
Adaptive Tra C Control System Using Dynamic Slot Allocation To Achieve Minimal Congestion in A Stand Alone Road Junction
8 pages
Design-and-Analysis-of-Multi-Agent-Reinforcement-Learning (1)
No ratings yet
Design-and-Analysis-of-Multi-Agent-Reinforcement-Learning (1)
5 pages
The Role of Network Security and 5G Communication in Smart Cities and Industrial Transformation
From Everand
The Role of Network Security and 5G Communication in Smart Cities and Industrial Transformation
Devasis Pradhan
No ratings yet
3 PB
No ratings yet
3 PB
8 pages
Ger-4900 - Ur Version 6 Cpu Module & Firmware 6.04 6.05 6.06
No ratings yet
Ger-4900 - Ur Version 6 Cpu Module & Firmware 6.04 6.05 6.06
3 pages
Electrical Circuits: Topic 1.2B Webinar Presented by - Deepak Pais
No ratings yet
Electrical Circuits: Topic 1.2B Webinar Presented by - Deepak Pais
37 pages
Gmail - Booking Confirmation On IRCTC, Train - 20959, 18-Mar-2024, 2S, BRC - VNG
No ratings yet
Gmail - Booking Confirmation On IRCTC, Train - 20959, 18-Mar-2024, 2S, BRC - VNG
1 page
Read Me First
No ratings yet
Read Me First
3 pages
Pattern PDF
No ratings yet
Pattern PDF
32 pages
2014lcd Technical Guide 32a400 Dec2014 2
100% (1)
2014lcd Technical Guide 32a400 Dec2014 2
11 pages
Crystals 12 01581
No ratings yet
Crystals 12 01581
42 pages
rohini_69193420600
No ratings yet
rohini_69193420600
9 pages
3DTouch Auto Leveling Sensor - Geeetech Wiki
No ratings yet
3DTouch Auto Leveling Sensor - Geeetech Wiki
16 pages
BOM (Total List) C80 ESTIMATE
No ratings yet
BOM (Total List) C80 ESTIMATE
3 pages
PPS25 PDF
No ratings yet
PPS25 PDF
1 page
8CMH'10 PDF
No ratings yet
8CMH'10 PDF
35 pages
31-Analysis of Clocked Synchronous Sequential Circuits-07!03!2023
No ratings yet
31-Analysis of Clocked Synchronous Sequential Circuits-07!03!2023
16 pages
CAD Question - Bank-With SLD
No ratings yet
CAD Question - Bank-With SLD
6 pages
Datentransfer - GB - Datapilot 4110
No ratings yet
Datentransfer - GB - Datapilot 4110
16 pages
Fees Stucture 20222023
No ratings yet
Fees Stucture 20222023
2 pages
Inventory - House Construction
No ratings yet
Inventory - House Construction
13 pages
Fast and Secure Routing Algorithms For Quantum Key Distribution Networks
No ratings yet
Fast and Secure Routing Algorithms For Quantum Key Distribution Networks
11 pages
UNIT3 - 22MA2BSMCS - MES - QB Updated 21-07-2023
No ratings yet
UNIT3 - 22MA2BSMCS - MES - QB Updated 21-07-2023
6 pages
Brochure
No ratings yet
Brochure
20 pages
ACE Internship Form (2020)
No ratings yet
ACE Internship Form (2020)
4 pages
Torque Drag Analysis Using Finite Element Method
No ratings yet
Torque Drag Analysis Using Finite Element Method
15 pages
Passat 03
No ratings yet
Passat 03
204 pages
Cloud Computing - Session-3
No ratings yet
Cloud Computing - Session-3
14 pages
Lab 1 - Intro-Pst-Lucas Nulle Sem2 20-21
No ratings yet
Lab 1 - Intro-Pst-Lucas Nulle Sem2 20-21
10 pages
Pythonn 1 To 8 & 11
No ratings yet
Pythonn 1 To 8 & 11
19 pages