0% found this document useful (0 votes)
3 views

Urban Traffic Signal Control Using Reinforcement Learning Agents

IET transactions paper

Uploaded by

balajipg
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Urban Traffic Signal Control Using Reinforcement Learning Agents

IET transactions paper

Uploaded by

balajipg
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/224170257

Urban traffic signal control using reinforcement learning agents

Article in IET Intelligent Transport Systems · October 2010


DOI: 10.1049/iet-its.2009.0096 · Source: IEEE Xplore

CITATIONS READS

229 3,935

3 authors, including:

Balaji Parasumanna Gokulan D. Srinivasan


Huawei Technologies National University of Singapore
13 PUBLICATIONS 878 CITATIONS 349 PUBLICATIONS 17,206 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Balaji Parasumanna Gokulan on 24 September 2015.

The user has requested enhancement of the downloaded file.


www.ietdl.org

Published in IET Intelligent Transport Systems


Received on 28th October 2009
Revised on 26th February 2010
doi: 10.1049/iet-its.2009.0096

ISSN 1751-956X

Urban traffic signal control using


reinforcement learning agents
P.G. Balaji X. German D. Srinivasan
Department of Electrical and Computer Engineering, 4 Engineering Drive 3, National University of Singapore,
Singapore 117576, Singapore
E-mail: [email protected]

Abstract: This study presents a distributed multi-agent-based traffic signal control for optimising green timing in
an urban arterial road network to reduce the total travel time and delay experienced by vehicles. The proposed
multi-agent architecture uses traffic data collected by sensors at each intersection, stored historical traffic
patterns and data communicated from agents in adjacent intersections to compute green time for a phase.
The parameters like weights, threshold values used in computing the green time is fine tuned by online
reinforcement learning with an objective to reduce overall delay. PARAMICS software was used as a platform
to simulate 29 signalised intersection at Central Business District of Singapore and test the performance of
proposed multi-agent traffic signal control for different traffic scenarios. The proposed multi-agent
reinforcement learning (RLA) signal control showed significant improvement in mean time delay and speed in
comparison to other traffic control system like hierarchical multi-agent system (HMS), cooperative ensemble
(CE) and actuated control.

1 Introduction adjusting the timings dynamically with changing traffic


patterns. Actuated signal controls adaptively increments/
Traffic control in the urban areas is becoming increasing decrements the green time of a phase on detecting the
complex with the exponential growth in vehicle count. presence/absence of vehicle in a lane. Actuated controls lack
Expansion of the road network to accommodate the the ability to foresee increased traffic flow and bases its
increased vehicle count is not a socially feasible option and decision on instantaneous flow values. Further, it results in
is essential to increase the utilisation of the existing higher delay as green time is not held for upstream platoons
infrastructure through proper regulation of traffic flow. causing higher percentage of vehicles to be stopped [1].
Traffic signals were introduced to control the traffic flow,
thereby improving the safety of road users. However, traffic Various computational intelligence techniques such as
signals create bottleneck for traffic flow in lanes that do not hybrid fuzzy genetic algorithm [2], ant colony-based
have the right of way during a specific phase and optimisation [3], emotional algorithms [4] and neuro-fuzzy
optimisation of signal timings is required to reduce the networks [5] calculate the green time required by
overall delay experienced by all vehicles at the intersection. forecasting the future traffic inflow. First limitation is,
Optimisation can be performed in offline (pre-timed) or a large training data that encompass all the dynamics of
online (adaptive) manner. the traffic is required for fine tuning parameters of the
controller and is difficult to obtain. Second, most of the
In pre-timed or fixed time signal control, Webster’s formula above controllers were designed for isolated intersection,
is used to calculate the green and the cycle time offline using thereby simplifying the model and reducing its suitability to
traffic data collected from the road network. Pre-timed coordinated interconnected intersections.
signal control cannot handle any variation in traffic from the
training patterns resulting in increased travel time delay. SCOOT [6, 7], SCATS [8, 9] and Green Link Determining
Adaptive signal controls overcome this limitation by (GLIDE) [10] are examples of centralised traffic signal

IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177 – 188 177
doi: 10.1049/iet-its.2009.0096 & The Institution of Engineering and Technology 2010
www.ietdl.org

controllers that have been implemented on large-scale networks 2 Proposed agent architecture
successfully. However, centralised controllers increase the
requirement for extensive communication of information and The proposed multi-agent system has a distributed
computational requirement for efficient data mining of architecture with each agent capable of making own
information required to compute optimal green time. The decisions without any central supervising agent. Traffic
limitation can be solved by implementing a distributed multi- signal at each intersection is controlled by an agent. The
agent architecture where larger problem can be divided into agent collects local traffic data collected from induction
smaller sub-problems. Multi-agent system is a group of loop detectors placed near the stop line of incoming and
autonomous agents capable of perceiving the environment and outgoing links connecting the neighbouring intersections.
decides its own course of action for achieving a common goal. Agents’ communicate outgoing traffic information to the
The agents could achieve this either by cooperation or neighbouring agents. The structure of individual agent
competition. The communication between agents increases the architecture is shown in Fig. 1.
global view of the agent and increases the coordination. In
[11], distributed agent system utilising evolutionary game Based on locally collected inputs and communicated
theory to assign reward or penalty was proposed. The information, intersection agents determine green time
limitation is the necessity to compute pay-off matrix for each required for each phase in the next cycle period. Agent
state-action pair. In [12], the advantages and disadvantages of possesses local memory to store traffic demand and create a
multi-agent system have been highlighted and a theoretical data repository to assess future traffic demand and
model of agent based on estimated traffic state has been effectiveness of agent’s actions. Agents fine tune and learn
proposed. In [13–15], semi-distributed agent architecture the decision model of each intersection by observing the
based on distributed constraint optimisation, swarm expected utility for each state-action pair and update using
intelligence methods and hybrid intelligent techniques online Q-learning.
combining fuzzy logic, neural networks and evolutionary
computation have been attempted. The limitation is the The traffic demands in the road network are quite
amount of data to be communicated and conflict of decision uniformly spread and can be characterised by different type
among agents. Agent system with reinforcement learning of distribution based on the traffic flow information
capability has shown to improve the performance significantly collected from the network. However, vehicles have a large
[16]; however, tests were conducted on simple road network number of route choices and route selected depends on
with less number of intersections. In this paper, a driver behaviour therefore no specific green wave policy can
reinforcement learning distributed multi-agent architecture has be selected based only on historic traffic flow patterns.
been proposed and tested on a large urban arterial road network. Therefore explicit offset settings was not used in this work
as synchronisation is achieved through communication of
information between agents and learning by visiting each
The paper is organised into seven sections. Section 2
state-action pairs sufficient number of times.
details the proposed multi-agent architecture. Section 3
describes learning of parameters using reinforcement
learning. Section 4 details the performance measures used
followed by a brief note in Section 5 on the benchmarks
2.1 Traffic input
used. Section 5 discusses the simulation platform used and Traffic data like vehicle occupancy Toccupied , the amount of
the comparative analysis over the benchmark signal time the vehicle is present on the detector, Qlength , length
controls. Section 6 summarises the work done in this paper. of queue of vehicles at the end of each phase and Vcount ,

Figure 1 Proposed agent architecture

178 IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177– 188
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-its.2009.0096
www.ietdl.org

number of vehicles crossing a specific intersection during a 2.2.2 Local traffic variations: Second most important
phase are collected from the incoming and outgoing links influencing factor in the adjustment of green time of a phase
of the intersection. Since vehicles in most lanes are free is the local variation in traffic condition at consecutive time
to choose the exit lanes, data from outgoing detectors need periods. As the cycle length is dynamically adjusted by
to be transmitted to neighbouring intersection agents to agents, ratio of Toccupied to Tphase at consecutive time
enable prediction of incoming traffic. For proper estimation periods is used rather than raw vehicle occupancy data
of current traffic state, queue value has to be used along
with vehicle occupancy and count, which tends to stagnate load = (Toccupied /Tphase ) (4)
during high traffic flow.
Agents decide the required change in green time by
Traffic state estimation is performed based on occupancy, comparing the load with loadtarget or threshold value. The
queue and vehicle count of maximum congested lanes (1) threshold value is computed as average of load value
as averaging across the lane causes improper classification of experienced during previous t cycles, where t represents the
traffic number of previous cycles to be considered. The change in
green time is computed using
Toccupied = max( max(Toccupied (ai lj ))) (1)
i j
d2 = D ∗ Tphase (5)
where ailj is the jth lane for approach ai .

2.2 Rule base where



Agents compute new phase length in a cycle based on the max(loadnew − loadtarget , 0), if loadnew . loadold
D=
locally observed traffic input and information communicated min(loadnew − loadtarget , 0)/2, if loadnew , loadold
by neighbouring intersections using a set of rules. The rules
specify the required amount of change in green time di for The old value of load is updated with current load value after
specific traffic condition at the intersection and is calculated every time period. To ensure that extension of green time is
as weighted sum of rules output as shown in relatively slower than reduction of green time, a correction
new old
 new
term of 1/2 is included in the computation of D. A large
Tphase = Tphase + ri di , Tmin ≤ Tphase ≤ Tmax (2) correction value can cause instability because of shorter
i
phase split.
where ri is the weight assigned to each rule, di is the output of
each rule and i is the number of rules. Upper and lower limits 2.2.3 Neighbourhood advice: Agent’s environment is
[Tmax , Tmin] are imposed to avoid indiscriminate increase of usually affected by the action of neighbouring agents. This
green time. For all signals, Tmin is fixed at 10 s, however, necessitates modifying the behaviour of agent based on
Tmax varies in accordance to the number of phases and total neighbouring agents communicated information. The
cycle length limited to 120 s. Different rules used by agent neighbouring agents communicate vehicle occupancy and
for estimation of green time are explained in the following count at outgoing link of its intersection. Data are
sections. Calculation of ri is explained in detail in Section 3.3. communicated as a simple broadcast with identification tag
to all the neighbouring intersections. Based on the
2.2.1 Occupancy ratio: Agent uses the occupancy ratio information in the directory facilitator, agents decide to
(ratio of vehicle occupancy time to green time of the phase) receive the broadcast information. The received data are
to estimate green time required by vehicles present at the stored as Advice in the data repository. The communicated
stop line of the intersection. Occupancy is directly related data permit forecasting of traffic inflow and accordingly
to vehicle density and indicates the current state of adjust the green time as in
the intersection. Based on the speed – flow– density
Advicenew
characteristics, ratio of vehicle occupancy (in seconds) to d3 = − Tphase (6)
green time of a phase gives an accurate indication of degree OccRatio
of saturation of the network. However, there is no universal Advicemem + Adviceold
best value for the occupancy ratio and varies with level of Advicenew = (7)
2
congestion. An underutilised phase (large green time for
low vehicle count) have a low occupancy ratio and increases After calculation of d3 , average value of the Advicemem
the delay experienced by vehicles. Based on the occupancy received in the current time period and Adviceold is used to
ratio, each agent computes the extension or reduction in update the repository.
green time of phase in progress
If an approach is congested to such that at least one turn
Toccupied
d1 = − Tphase (3) movement is blocked during a phase, then the vehicles
OccRatio count for that movement is set to zero. This situation arises

IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177 – 188 179
doi: 10.1049/iet-its.2009.0096 & The Institution of Engineering and Technology 2010
www.ietdl.org

when the queue from the downstream intersection reaches 3.2 Traffic state estimation
the current intersection because of queue spillback and
leads to deadlock formation. It is not possible for an agent Traffic is discretised into different states using queue and flow
to differentiate between an empty and saturated data. Average queue computed at the end of each phase by (9)
intersection. However it is possible to differentiate the two can be used for traffic classification. However for proper
scenarios by using a combination of occupancy and count classification of traffic, flow value needs to be used in
data communicated by the neighbouring agents. If vehicle conjunction with queue. The values of queue and flow at
count on the outgoing lane is null and occupancy is not time period t are taken as Qscore and flow(t). At the start of
null, agent can distinguish the lane as congested and learning process, there is no history data available.
blocked because of the queue spillback. Under such However, at time period t + 1, the data of Qscore and
circumstances, green time for the phase is kept fixed at a flow(t) get stored as history value for the previous day
minimum limit of 10 s to allow clearance of vehicles. namely Hscore and Hflow, respectively. The change in
current traffic is computed as the difference between
current traffic queue score and queue experienced in the
previous day at the next time period Qscore(t) 2 Hscore
(t + 1) and assigned membership grade that classifies
3 Reinforcement learning current traffic change into three low, medium and high
All agents in the network must be capable of learning the model traffic as shown in Fig. 2
of intersection controlled based on the assessment of present 
average traffic condition and previous day traffic condition at Qlength (phasei )
Qscore = (9)
the same period. The learning period has to be long enough Nphase
to allow aggregation of sufficient traffic information to
estimate the current traffic state and capture traffic dynamics. The rate of change of traffic dx is computed as (Hflow(t + 1) 2
The learning period can be found by experimentation and flow(t))/flow(t) and (Hscore(t + 1) 2 Qscore)/Qscore and is used
was calculated as 500 s for the specific road network to determine whether the traffic is decreasing, stable or
considered for testing. Conventional learning is difficult as increasing and assigned a membership grade similar to
the exact desired value is not available and unsupervised Fig. 2. Combining the rate of change of traffic and current
learning or a combination of both (reinforcement learning) change in traffic, traffic at the intersection is classified into
needs to be employed. Reinforcement learning utilises scalar nine possible states and is shown in Table 1. The current
time delayed reward received from the environment on traffic state is determined as the output with highest firing
selecting an action in a specific state to modify the parameters level using fuzzy logic.
of intersection model. In this paper, Q-learning was used to
modify the parameters.
3.3 Parameter update
Traffic states have been completely defined in the previous
3.1 Q-learning section. However, the action space needs to be defined to

Q-learning [17] is a reinforcement learning technique that


learns the action value function which provides the
expected utility of taking an action in a given state and
then following a fixed policy thereafter. The utility or
reward is received after time delay from the environment
and is a scalar quantity that do not exactly specify the
action to be taken. Each agent maintains a Q-matrix that
stores the Q-values for each state-action pair and is updated
iteratively as shown in (8). The Q-value reaches optimum Figure 2 Structure of fuzzy membership
value when all states are visited sufficiently larger number
of times.

Table 1 Possible traffic states


Q(s, a)∗ = (1 − a)Q(s, a) + a(r + g maxi Q(s′ , ai )) (8)
current low 0 1 2
∗ traffic
where Q(s, a) is the optimal value, a is the learning rate in medium 3 4 5
the range [0, 1] and g is the future discount reward. The
high 6 7 8
learning rate and future discount reward are computed
through experimentation to be 0.33 and 0.05 as trade-off decreasing stable increasing
needs to be made between the rate of convergence and
changes in traffic
precision.

180 IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177– 188
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-its.2009.0096
www.ietdl.org

complete Q-matrix. The total green time extension can be iteratively in


defined by ri whose value lies in the range between [0, 1]
new old
and divided into 12 equal values. Each agent maintains a Hscore = (1 − b)Hscore + br (11)
Q-matrix that matches the nine traffic states defined in the
previous section to the 12 action values. At the end of each The coefficient b decreases from 0.5 to 0.1 for the first few
learning period, the agent computes the reward r received iterations, then stays at 0.1. The same equation as in (11) is
from environment after choosing action ai when in state si . used for calculating Hflow . The value of b needs to be
The reward value is computed using varied with large steps in the start so that higher preference
is given to time delayed rewards than historical value and
Qscore (t) − Hscore (t) Q (t) − Qscore (t − 1) reduced to lower value so that the learnt values are not
r new = + h score
Qscore (t) Qscore (t) forgotten and is determined through experimentation with
(10) different values.

where h is in the range [0, 1]. The reward value is positive if the Cooperation between the agents is achieved by averaging
queue is smaller than the historic queue as well as the queue for the Q matrix values between immediate adjacent
previous period. Since, the traffic demands vary with time, neighbouring agents each time period. This ensures the
queue value also varies. Comparison with historic value improvement in performance as agents learn from
tracks the traffic pattern over a large period and queue value experience of the neighbouring agents.
of previous time period tracks short time variations.
Therefore h needs to be kept small so that reward comes
from comparison with historic values. Once current state is
4 Performance measures
detected, the appropriate action value is chosen as one with The performance of proposed reinforcement learning (RLA)
highest Q-value. In case of multiple actions having same algorithm in a simulated road traffic environment is evaluated
Q-value, one of the action is selected randomly. Greedy based on three parameters namely vehicle count, total mean
action selection strategy was used to increase the exploration delay and current mean speed of vehicles inside the road
to increase the visited state-action pairs. network (Fig. 3).

3.4 Memory update 4.1 Vehicle count


Once the state-action pair has been found, the memory of the Vehicle count is the total number of vehicles present inside
agent is updated. The historic queue score can be updated the road network at a given time and is calculated as the

Figure 3 Simulated road network with indication of prominent hotspots caused because of pre-timed signals

IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177 – 188 181
doi: 10.1049/iet-its.2009.0096 & The Institution of Engineering and Technology 2010
www.ietdl.org

difference between number of vehicles entering and leaving individual agent and providing directives received from
the network during the estimation period. The vehicle supervising agent. Supervising agent is in the top layer of
count gives an accurate indication of the congestion level hierarchy and oversees the functioning of entire system.
inside the network at a specified period of time. The zonal agents utilise evolutionary fuzzy algorithm to
generate the rule base for control and compute cooperation
4.2 Total vehicle mean delay levels required between agents using neuro-fuzzy system.
For detailed description of the HMS, refer to [21].
Total mean delay is the average value of delay experienced by
vehicles to reach the destination from starting point of the CE [22] is a distributed multi-agent architecture, where the
network and is expressed in seconds. Mean delay is the agents self-organise and form clusters of cooperating agents.
sum of total stopping time which corresponds to time lost The clusters are formed dynamically using graph theoretical
waiting at intersections, and the travel time which depends method. The teams or clusters cooperate to reduce the
on speed of vehicles inside the network overall time delay experienced by the group rather than an
individual. Overlap in the cooperative clusters is possible;

n
however, it is limited to avoid excessive computation.
TAD = TD /TN (12)
i=1
6 Simulation results and
where TAD is the total average delay, n is the number of
intersection, TD is the delay experienced by vehicles at an
discussions
intersection and TN is the total number of vehicles released The proposed RLA signal controller was tested on a
into the network. Little [18] and Highway Capacity simulated network of 29 intersections. The simulated
Manual – HCM2000 [19] show the wide acceptance of network is the highly congested section of the busy Central
delay parameter for validating the signal controllers. Business District area in Singapore [23]. The network is
simulated using PARAMICS, a microscopic simulation
4.3 Current vehicle mean speed software capable of simulating the driver’s behaviour,
dynamic re-routing of vehicles and incidents efficiently.
For better understanding of the results, current mean speed of The network serves as an ideal test bed because of the
the vehicle inside the network is used along with the time geometry and heterogeneity in the classification of links
delay value. The importance of using current mean speed in (major and minor roads with varying speed limits).
validating the signal controller has been highlighted in [20].
Four types of simulation were used to evaluate the
5 Benchmarks performance of the proposed RLA signal control. They are
namely the typical scenario with morning peak (3 h), typical
It is difficult to find a good benchmark for large-scale traffic scenario with morning and evening peaks (24 h), extreme
signal control problem given the following reasons: scenario with dual peaks (6 h) and extreme scenario with
multiple peaks (24 h). It must be noted that extreme traffic
1. Some of the existing algorithms are developed for simplified scenarios are hypothetical traffic peaks created to test the
traffic scenarios and hence not suitable for benchmarking. reliable control of traffic by the proposed RLA signal control
in cyclic repetitive stress conditions. It also serves to
2. Commercial traffic signal control programs, which are showcase the response and settling time of the signal control.
known to work well, are not easily available because of
proprietary reasons. The origin–destination data collected from Land Transport
Authority Singapore, is used to recreate the peak traffic
Hence in all the experiments, GLIDE [10], modified conditions. Even though the peak traffic data are pre-fixed,
version of SCATS used in Singapore, hierarchical multi- the number of vehicles actually released into the network
agent system (HMS) [21] developed in and cooperative varies according to the random seeds set before the
ensemble (CE) [22] are used as benchmarks. HMS and simulation. Since PARAMICS dynamically adjusts traffic
CE have already been compared with GLIDE and hence model characteristics like gap acceptance, lane change, merge
simulation plot results are not included to avoid and so on, the traffic dynamics is different for each simulation
redundancy. HMS is a semi-distributed multi-agent traffic run with different random seeds. The PARAMICS model
signal control with hierarchical architecture. It consists of has been validated for the specific data and has been
three layers of agents with increasing hierarchy and control. previously used for simulation testing in [5, 20, 22, 23].
The agent at the intersection decides the green time
required based on local traffic information and cannot 6.1 Typical scenario with morning
communicate with agents in same hierarchy and uses
Webster’s method to compute the green time requirement.
peak (3 h)
The zonal agent oversees the functioning of five The typical scenario with morning peak is used to validate the
intersection agents by monitoring the action plan of performance improvement in traffic condition using RLA

182 IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177– 188
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-its.2009.0096
www.ietdl.org

signal control for short time traffic variations. Twenty Fig. 4b shows the comparison of RLA signal control with
simulation runs using different random seeds were carried and without communication of data between agents. In case
out for each signal control technique compared. Since of no communication, higher delay is experienced during the
variance of the outcome of simulation runs was small, peak traffic conditions and is almost equivalent to HMS and
average value was taken as the representation of the outcome. CE signal control. CE shows higher delay as it is difficult to
form clusters in a dynamic environment with continuous
Fig. 4a shows comparison of the time delay experienced by change in traffic flow input.
vehicles in road network using different types of traffic signal
control. The proposed RLA signal control shows a 15% 6.2 Typical scenario with morning and
improvement in delay in comparison to other benchmarks.
The improvement in performance can be attributed to the
afternoon peaks (24 h)
ability of RLA signal control to foresee traffic increase For the typical scenario with morning and evening peaks
based on the communicated information and adjust the (24 h), 20 different simulation runs using different random
green timing before the traffic arrives at the intersection. seeds were carried out for each signal control technique.
During low traffic period, HMS and CE experience higher Average value of simulation runs were taken into
delays as their actions are based only on locally collected consideration when evaluating the performance of each
data and thereby cause more vehicles to be retained inside control technique.
the network. Though under high traffic condition, the
decisions are more coordinated, the number of vehicles to Fig. 5a shows comparison of mean vehicle delay using
be cleared is much larger because of the vehicles retained different signal control techniques for 24-h typical two
during low traffic period, thereby increasing the delay than peak traffic scenario. Although HMS signal control shows
RLA signal control. higher mean delay during traffic peaks, RLA signal control

Figure 4 Three hour single peak traffic scenario


a Comparison of time delay for 3-h single peak traffic simulation scenario for different agent architectures
b Comparison of proposed RLA architecture with and with communication between agents

IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177 – 188 183
doi: 10.1049/iet-its.2009.0096 & The Institution of Engineering and Technology 2010
www.ietdl.org

has a stable lower delay throughout the simulation period. consecutive peaks causing an increased vehicle count at
Fig. 5b further indicates a smoother speed transition on the start of next peak traffic regions and can be seen from
using RLA signal control than other controllers. Fig. 6b.

The other extreme traffic condition scenario simulated was


6.3 Extreme traffic scenarios 6 h two peak traffic condition with higher demand values
Two hypothetical simulation scenarios were designed to test than 24-h simulation. Fig. 7 shows the mean vehicle delay
the settling and response time of the signal control for short extreme scenario. These simulation scenarios test
algorithms when subjected to repetitive high and low traffic the limits of the algorithms, as they attempt to stabilise
demand. The input demand and the number of vehicles traffic when subjected to repeated peaks. As the HMS
remaining inside the network for 24-h eight peak algorithm performs better than the CE algorithm under
simulation are shown in Fig. 6a. The stress experienced by the eight peak extreme scenario [22], results of CE are not
signal controllers can be seen from the growing values of included in Fig. 6b. RLA algorithm performs better than
vehicles retained inside the network. Main reason for the HMS signal control. HMS signal control produces higher
stress can be attributed to vehicle count at the beginning of time delay because of the delay in propagation of control
each peak. When the settling time (time required to bring signal from supervising agent and absence of local
the vehicle count to non-peak condition) is larger, there communication between intersection agents. Lower mean
is an overlap in the peak traffic build up regions of delay value of RLA algorithm clearly indicates faster

Figure 5 Twenty-four hour two peak traffic simulation scenario


a Average travel time delay comparison
b Comparison of current vehicle mean speed

184 IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177– 188
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-its.2009.0096
www.ietdl.org

Figure 6 Twenty-four hour eight peak traffic simulation scenario


a Traffic demand and count of vehicles present inside the network
b Average travel time delay of vehicles

Figure 7 Mean travel time delay comparison for 6-h two peak traffic scenario

IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177 – 188 185
doi: 10.1049/iet-its.2009.0096 & The Institution of Engineering and Technology 2010
www.ietdl.org

settling time and better adaptability to variations in the traffic is reflected in the phase timing. The cycle time is lower in
demand over other traffic signal controls. the non-peak period and dynamically varies with changing
traffic pattern. It is not possible to compare the signal
timings across the intersections because of the stochastic
6.4 Response time and cycle length nature of traffic input and random seed initialisation.
variation However, this can serve as an indication for the adaptability
of the proposed RLA signal control.
The response time of the RLA signal control can be best
illustrated with frequency of change in the phase length.
Fig. 8a displays the length of each phase of a cycle for a
four-phase intersection in middle of the network controlled
6.5 Improvement because of learning
by RLA signal control. The links having the right of way Reinforcement learning vastly improves the performance of
during the third phase have the lowest traffic demand and the RLA signal control. Fig. 8b shows the variation of the

Figure 8 Green timing and influence of reinforcement learning


a Change in signal green time settings of an intersection
b Improvement in the average delay experienced due to reinforcement learning

186 IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177– 188
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-its.2009.0096
www.ietdl.org

Table 2 Worst and best time delay comparison of signal agent scenario and Q-matrix was shared between agents to
controls improve the local observations and create global view.
Simulation tests conducted on a virtual traffic network of
Pre- HMS CE RLA Act. GLIDE Central Business District in Singapore for four different
timed traffic scenarios showed almost 15% improvement over the
one peak Nv 120 95 89 90 100 – benchmark signal controls. Further improvements because
of online reinforcement learning of the parameters have
V 37 46 44 49 45 – been demonstrated effectively.
d 297 200 191 163 196 –
typical Nv 317 286 301 266 295 –
day
8 Acknowledgment
V 35 43 38 48 44 40
This research work was supported by National University of
d 500 182 200 160 184 200 Singapore under the research grant WBS: R-263-000-425-
112.
short Nv Sat. 216 258 170 215 –
extreme
V 0 35 36 48 42 10
d Sat. 315 340 232 309 650 9 References
long Nv Sat. 250 – 206 205 – [1] KOONCE P.: ‘Traffic signal timing manual’, US Department
extreme of Transportation FHWA-HOP-08-024, Federal Highway
V 0 33 – 42 37 0
Administration, 2008
d Sat. 242 – 216 238 Sat.

Nv: number of vehicles at the end of the simulation, [2] SANCHEZ J.J., GALAN M., RUBIO E.: ‘Genetic algorithms and
V: speed, d: mean delay cellular automata: a new architecture for traffic light
cycles optimization’. Proc. Congress on Evolutionary
Computation, 19 – 23 June 2004, Piscataway, NJ, USA,
average value of mean vehicle delay experienced with each
2004, pp. 1668 – 1674
simulation run. After 90 continuous simulation runs, the
mean delay value reduced to around 50% from what was at
[3] HOAR R., PENNER J., JACOB C.: ‘Evolutionary swarm traffic: if
the start of the simulation run. HMS [21] utilises selective
ant roads had traffic lights’. Proc. 2002 World Congress on
back propagation method for learning the parameters of
Computational Intelligence – WCCI’02, 12 – 17 May 2002,
neuro-fuzzy system. The back-propagation method has a
Piscataway, NJ, USA, 2002, pp. 1910 – 1915
limitation of getting into local optimum and therefore
increases the time delay.
[4] ISHIHARA H., FUKUDA T.: ‘Traffic signal networks simulator
using emotional algorithm with individuality’. Proc. IEEE
Table 2 shows the comparison of traffic data obtained using
Intelligent Transportation Systems, 25 – 29 August, 2001,
RLA signal control with all other benchmarks including
Oakland, CA, USA, pp. 1034 – 1039
GLIDE which is currently used in Singapore. Simulation
model of GLIDE was obtained from [5, 21]. Average values
[5] SRINIVASAN D., CHOY M.C., CHEU R.L.: ‘Neural networks for
obtained from 20 simulation runs are compared in Table 2.
real-time traffic signal control’, IEEE Trans. Intell. Transp.
Standard deviation of delay was around 4 and 5 – 6%
Syst., 2006, 7, pp. 261 – 272
variation for vehicle count and speed. Proposed RLA signal
control showed 9 – 15% improvement in performance when
[6] HUNT P.B., ROBERTSON D.I. , BRETHERTON R.D., WINTON R.I. :
compared to other benchmarks in all the 20 simulation runs.
‘SCOOT – a traffic responsive method of coordinating
signals’ (United Kingdom, 1981)
7 Conclusion [7] PECK C. , GORTON P.T.W., LIREN D.: ‘Application of
The proposed RLA signal control has a fully distributed SCOOT in developing countries’. Third Int. Conf. on Road
architecture with agents capable of interacting with each Traffic Control, 1 – 3 May 1990, London, England,
other to effectively compute the optimal value of green pp. 104– 109
time that reduces the overall travel time delay and increases
vehicle mean speed. The update of traffic pattern in the [8] SIMS A.G. , DOBINSON K.W. : ‘The Sydney Coordinated
repository and shared communication between agents Adaptive Traffic (SCAT) system philosophy and benefits’,
increased the forecasting capability of each agent. This IEEE Trans. Veh. Technol., 1980, t-29, pp. 130 – 137
property of the agent effectively reduced the formation of
congestion and improved clearance of vehicles at the [9] LOWRIE P.R.: ‘The Sydney Coordinated Adaptive Traffic
intersection. Online Q-learning has been adopted to multi- System-principles, methodology, algorithms’. Int. Conf. on

IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177 – 188 187
doi: 10.1049/iet-its.2009.0096 & The Institution of Engineering and Technology 2010
www.ietdl.org

Road Traffic Signalling, 30 March– 1 April 1982, London, UK, [16] CAMPONOGARA E., KRAUS JR. W.: ‘Distributed learning agents
pp. 67– 70 in urban traffic control’, Prog. Artif. Intell., 2003, 2902,
pp. 324– 335
[10] KEONG C.K.: ‘The GLIDE system – Singapore’s urban
traffic control system’, Transp. Rev., Transnatl. Transdiscipl. [17] WATKINS C., DAYAN P.: ‘Technical note: Q-learning’, Mach.
J., 1993, 13, pp. 295 – 305 Learn., 1992, 8, pp. 279 – 292

[11] BAZZAN A.L.C.: ‘A distributed approach for coordination of [18] LITTLE J.D.C. : ‘A proof for the queuing formula:
traffic signal agents’, Auton. Agents Multi-Agent Syst., 2005, L ¼ {lambda} W’, Oper. Res., 1961, 9, pp. 383– 387
10, pp. 131 – 64
[19] ‘Highway capacity manual – HCM2000’ (Transportation
[12] ROOZEMOND D.A.: ‘Using intelligent agents for pro-active, Research Board, National Research Council, 2000)
real-time urban intersection control’, Eur. J. Oper. Res.,
2001, 131, pp. 293– 301 [20] BALAJI P.G. , SRINIVASAN D., CHEN-KHONG T. : ‘Coordination
in distributed multi-agent system using type-2 fuzzy
[13] MIZUNO K., NISHIHARA S.: ‘Distributed constraint decision systems’. IEEE 16th Int. Conf. on Fuzzy Systems
satisfaction for urban traffic signal control’. Second Int. (FUZZ-IEEE), 1 – 6 June 2008, Piscataway, NJ, USA,
Conf. on Knowledge Science, Engineering and pp. 2291 – 2298
Management. KSEM 2007, 28 – 30 November 2007, Berlin,
Germany, 2007, pp. 73 – 84 [21] CHOY M.C., SRINIVASAN D., CHEU R.L.: ‘Neural networks for
continuous online learning and control’, IEEE Trans.
[14] DE OLIVEIRA D., BAZZAN A.L.C.: ‘Traffic lights control with Neural Netw., 2006, 17, pp. 1511– 1531
adaptive group formation based on swarm intelligence’.
Ant Colony Optimization and Swarm Intelligence. Proc. [22] SRINIVASAN D., CHOY M.: ‘Distributed problem solving using
Fifth Int. Workshop, ANTS 2006, 4 – 7 September 2006, evolutionary learning in multi-agent systems’, Adv. Evol.
Berlin, Germany, pp. 520– 521 Comput. Syst. Des., 2007, 66, pp. 211– 227

[15] CHOY M.C. , CHEU R.L., SRINIVASAN D., LOGI F. : ‘Real-time [23] CHOY M.C., SRINIVASAN D., CHEU R.L.: ‘Cooperative, hybrid
coordinated signal control through use of agents with agent architecture for real-time traffic signal control’, IEEE
online reinforcement learning’. Transportation Research Trans. Syst. Man Cybern. A (Syst. Hum.), 2003, 33,
Board Meeting (82nd), Washington, DC, 2003, pp. 64– 75 pp. 597– 607

188 IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177– 188
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-its.2009.0096

View publication stats

You might also like