Urban Traffic Signal Control Using Reinforcement Learning Agents
Urban Traffic Signal Control Using Reinforcement Learning Agents
net/publication/224170257
CITATIONS READS
229 3,935
3 authors, including:
All content following this page was uploaded by Balaji Parasumanna Gokulan on 24 September 2015.
ISSN 1751-956X
Abstract: This study presents a distributed multi-agent-based traffic signal control for optimising green timing in
an urban arterial road network to reduce the total travel time and delay experienced by vehicles. The proposed
multi-agent architecture uses traffic data collected by sensors at each intersection, stored historical traffic
patterns and data communicated from agents in adjacent intersections to compute green time for a phase.
The parameters like weights, threshold values used in computing the green time is fine tuned by online
reinforcement learning with an objective to reduce overall delay. PARAMICS software was used as a platform
to simulate 29 signalised intersection at Central Business District of Singapore and test the performance of
proposed multi-agent traffic signal control for different traffic scenarios. The proposed multi-agent
reinforcement learning (RLA) signal control showed significant improvement in mean time delay and speed in
comparison to other traffic control system like hierarchical multi-agent system (HMS), cooperative ensemble
(CE) and actuated control.
IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177 – 188 177
doi: 10.1049/iet-its.2009.0096 & The Institution of Engineering and Technology 2010
www.ietdl.org
controllers that have been implemented on large-scale networks 2 Proposed agent architecture
successfully. However, centralised controllers increase the
requirement for extensive communication of information and The proposed multi-agent system has a distributed
computational requirement for efficient data mining of architecture with each agent capable of making own
information required to compute optimal green time. The decisions without any central supervising agent. Traffic
limitation can be solved by implementing a distributed multi- signal at each intersection is controlled by an agent. The
agent architecture where larger problem can be divided into agent collects local traffic data collected from induction
smaller sub-problems. Multi-agent system is a group of loop detectors placed near the stop line of incoming and
autonomous agents capable of perceiving the environment and outgoing links connecting the neighbouring intersections.
decides its own course of action for achieving a common goal. Agents’ communicate outgoing traffic information to the
The agents could achieve this either by cooperation or neighbouring agents. The structure of individual agent
competition. The communication between agents increases the architecture is shown in Fig. 1.
global view of the agent and increases the coordination. In
[11], distributed agent system utilising evolutionary game Based on locally collected inputs and communicated
theory to assign reward or penalty was proposed. The information, intersection agents determine green time
limitation is the necessity to compute pay-off matrix for each required for each phase in the next cycle period. Agent
state-action pair. In [12], the advantages and disadvantages of possesses local memory to store traffic demand and create a
multi-agent system have been highlighted and a theoretical data repository to assess future traffic demand and
model of agent based on estimated traffic state has been effectiveness of agent’s actions. Agents fine tune and learn
proposed. In [13–15], semi-distributed agent architecture the decision model of each intersection by observing the
based on distributed constraint optimisation, swarm expected utility for each state-action pair and update using
intelligence methods and hybrid intelligent techniques online Q-learning.
combining fuzzy logic, neural networks and evolutionary
computation have been attempted. The limitation is the The traffic demands in the road network are quite
amount of data to be communicated and conflict of decision uniformly spread and can be characterised by different type
among agents. Agent system with reinforcement learning of distribution based on the traffic flow information
capability has shown to improve the performance significantly collected from the network. However, vehicles have a large
[16]; however, tests were conducted on simple road network number of route choices and route selected depends on
with less number of intersections. In this paper, a driver behaviour therefore no specific green wave policy can
reinforcement learning distributed multi-agent architecture has be selected based only on historic traffic flow patterns.
been proposed and tested on a large urban arterial road network. Therefore explicit offset settings was not used in this work
as synchronisation is achieved through communication of
information between agents and learning by visiting each
The paper is organised into seven sections. Section 2
state-action pairs sufficient number of times.
details the proposed multi-agent architecture. Section 3
describes learning of parameters using reinforcement
learning. Section 4 details the performance measures used
followed by a brief note in Section 5 on the benchmarks
2.1 Traffic input
used. Section 5 discusses the simulation platform used and Traffic data like vehicle occupancy Toccupied , the amount of
the comparative analysis over the benchmark signal time the vehicle is present on the detector, Qlength , length
controls. Section 6 summarises the work done in this paper. of queue of vehicles at the end of each phase and Vcount ,
178 IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177– 188
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-its.2009.0096
www.ietdl.org
number of vehicles crossing a specific intersection during a 2.2.2 Local traffic variations: Second most important
phase are collected from the incoming and outgoing links influencing factor in the adjustment of green time of a phase
of the intersection. Since vehicles in most lanes are free is the local variation in traffic condition at consecutive time
to choose the exit lanes, data from outgoing detectors need periods. As the cycle length is dynamically adjusted by
to be transmitted to neighbouring intersection agents to agents, ratio of Toccupied to Tphase at consecutive time
enable prediction of incoming traffic. For proper estimation periods is used rather than raw vehicle occupancy data
of current traffic state, queue value has to be used along
with vehicle occupancy and count, which tends to stagnate load = (Toccupied /Tphase ) (4)
during high traffic flow.
Agents decide the required change in green time by
Traffic state estimation is performed based on occupancy, comparing the load with loadtarget or threshold value. The
queue and vehicle count of maximum congested lanes (1) threshold value is computed as average of load value
as averaging across the lane causes improper classification of experienced during previous t cycles, where t represents the
traffic number of previous cycles to be considered. The change in
green time is computed using
Toccupied = max( max(Toccupied (ai lj ))) (1)
i j
d2 = D ∗ Tphase (5)
where ailj is the jth lane for approach ai .
IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177 – 188 179
doi: 10.1049/iet-its.2009.0096 & The Institution of Engineering and Technology 2010
www.ietdl.org
when the queue from the downstream intersection reaches 3.2 Traffic state estimation
the current intersection because of queue spillback and
leads to deadlock formation. It is not possible for an agent Traffic is discretised into different states using queue and flow
to differentiate between an empty and saturated data. Average queue computed at the end of each phase by (9)
intersection. However it is possible to differentiate the two can be used for traffic classification. However for proper
scenarios by using a combination of occupancy and count classification of traffic, flow value needs to be used in
data communicated by the neighbouring agents. If vehicle conjunction with queue. The values of queue and flow at
count on the outgoing lane is null and occupancy is not time period t are taken as Qscore and flow(t). At the start of
null, agent can distinguish the lane as congested and learning process, there is no history data available.
blocked because of the queue spillback. Under such However, at time period t + 1, the data of Qscore and
circumstances, green time for the phase is kept fixed at a flow(t) get stored as history value for the previous day
minimum limit of 10 s to allow clearance of vehicles. namely Hscore and Hflow, respectively. The change in
current traffic is computed as the difference between
current traffic queue score and queue experienced in the
previous day at the next time period Qscore(t) 2 Hscore
(t + 1) and assigned membership grade that classifies
3 Reinforcement learning current traffic change into three low, medium and high
All agents in the network must be capable of learning the model traffic as shown in Fig. 2
of intersection controlled based on the assessment of present
average traffic condition and previous day traffic condition at Qlength (phasei )
Qscore = (9)
the same period. The learning period has to be long enough Nphase
to allow aggregation of sufficient traffic information to
estimate the current traffic state and capture traffic dynamics. The rate of change of traffic dx is computed as (Hflow(t + 1) 2
The learning period can be found by experimentation and flow(t))/flow(t) and (Hscore(t + 1) 2 Qscore)/Qscore and is used
was calculated as 500 s for the specific road network to determine whether the traffic is decreasing, stable or
considered for testing. Conventional learning is difficult as increasing and assigned a membership grade similar to
the exact desired value is not available and unsupervised Fig. 2. Combining the rate of change of traffic and current
learning or a combination of both (reinforcement learning) change in traffic, traffic at the intersection is classified into
needs to be employed. Reinforcement learning utilises scalar nine possible states and is shown in Table 1. The current
time delayed reward received from the environment on traffic state is determined as the output with highest firing
selecting an action in a specific state to modify the parameters level using fuzzy logic.
of intersection model. In this paper, Q-learning was used to
modify the parameters.
3.3 Parameter update
Traffic states have been completely defined in the previous
3.1 Q-learning section. However, the action space needs to be defined to
180 IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177– 188
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-its.2009.0096
www.ietdl.org
where h is in the range [0, 1]. The reward value is positive if the Cooperation between the agents is achieved by averaging
queue is smaller than the historic queue as well as the queue for the Q matrix values between immediate adjacent
previous period. Since, the traffic demands vary with time, neighbouring agents each time period. This ensures the
queue value also varies. Comparison with historic value improvement in performance as agents learn from
tracks the traffic pattern over a large period and queue value experience of the neighbouring agents.
of previous time period tracks short time variations.
Therefore h needs to be kept small so that reward comes
from comparison with historic values. Once current state is
4 Performance measures
detected, the appropriate action value is chosen as one with The performance of proposed reinforcement learning (RLA)
highest Q-value. In case of multiple actions having same algorithm in a simulated road traffic environment is evaluated
Q-value, one of the action is selected randomly. Greedy based on three parameters namely vehicle count, total mean
action selection strategy was used to increase the exploration delay and current mean speed of vehicles inside the road
to increase the visited state-action pairs. network (Fig. 3).
Figure 3 Simulated road network with indication of prominent hotspots caused because of pre-timed signals
IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177 – 188 181
doi: 10.1049/iet-its.2009.0096 & The Institution of Engineering and Technology 2010
www.ietdl.org
difference between number of vehicles entering and leaving individual agent and providing directives received from
the network during the estimation period. The vehicle supervising agent. Supervising agent is in the top layer of
count gives an accurate indication of the congestion level hierarchy and oversees the functioning of entire system.
inside the network at a specified period of time. The zonal agents utilise evolutionary fuzzy algorithm to
generate the rule base for control and compute cooperation
4.2 Total vehicle mean delay levels required between agents using neuro-fuzzy system.
For detailed description of the HMS, refer to [21].
Total mean delay is the average value of delay experienced by
vehicles to reach the destination from starting point of the CE [22] is a distributed multi-agent architecture, where the
network and is expressed in seconds. Mean delay is the agents self-organise and form clusters of cooperating agents.
sum of total stopping time which corresponds to time lost The clusters are formed dynamically using graph theoretical
waiting at intersections, and the travel time which depends method. The teams or clusters cooperate to reduce the
on speed of vehicles inside the network overall time delay experienced by the group rather than an
individual. Overlap in the cooperative clusters is possible;
n
however, it is limited to avoid excessive computation.
TAD = TD /TN (12)
i=1
6 Simulation results and
where TAD is the total average delay, n is the number of
intersection, TD is the delay experienced by vehicles at an
discussions
intersection and TN is the total number of vehicles released The proposed RLA signal controller was tested on a
into the network. Little [18] and Highway Capacity simulated network of 29 intersections. The simulated
Manual – HCM2000 [19] show the wide acceptance of network is the highly congested section of the busy Central
delay parameter for validating the signal controllers. Business District area in Singapore [23]. The network is
simulated using PARAMICS, a microscopic simulation
4.3 Current vehicle mean speed software capable of simulating the driver’s behaviour,
dynamic re-routing of vehicles and incidents efficiently.
For better understanding of the results, current mean speed of The network serves as an ideal test bed because of the
the vehicle inside the network is used along with the time geometry and heterogeneity in the classification of links
delay value. The importance of using current mean speed in (major and minor roads with varying speed limits).
validating the signal controller has been highlighted in [20].
Four types of simulation were used to evaluate the
5 Benchmarks performance of the proposed RLA signal control. They are
namely the typical scenario with morning peak (3 h), typical
It is difficult to find a good benchmark for large-scale traffic scenario with morning and evening peaks (24 h), extreme
signal control problem given the following reasons: scenario with dual peaks (6 h) and extreme scenario with
multiple peaks (24 h). It must be noted that extreme traffic
1. Some of the existing algorithms are developed for simplified scenarios are hypothetical traffic peaks created to test the
traffic scenarios and hence not suitable for benchmarking. reliable control of traffic by the proposed RLA signal control
in cyclic repetitive stress conditions. It also serves to
2. Commercial traffic signal control programs, which are showcase the response and settling time of the signal control.
known to work well, are not easily available because of
proprietary reasons. The origin–destination data collected from Land Transport
Authority Singapore, is used to recreate the peak traffic
Hence in all the experiments, GLIDE [10], modified conditions. Even though the peak traffic data are pre-fixed,
version of SCATS used in Singapore, hierarchical multi- the number of vehicles actually released into the network
agent system (HMS) [21] developed in and cooperative varies according to the random seeds set before the
ensemble (CE) [22] are used as benchmarks. HMS and simulation. Since PARAMICS dynamically adjusts traffic
CE have already been compared with GLIDE and hence model characteristics like gap acceptance, lane change, merge
simulation plot results are not included to avoid and so on, the traffic dynamics is different for each simulation
redundancy. HMS is a semi-distributed multi-agent traffic run with different random seeds. The PARAMICS model
signal control with hierarchical architecture. It consists of has been validated for the specific data and has been
three layers of agents with increasing hierarchy and control. previously used for simulation testing in [5, 20, 22, 23].
The agent at the intersection decides the green time
required based on local traffic information and cannot 6.1 Typical scenario with morning
communicate with agents in same hierarchy and uses
Webster’s method to compute the green time requirement.
peak (3 h)
The zonal agent oversees the functioning of five The typical scenario with morning peak is used to validate the
intersection agents by monitoring the action plan of performance improvement in traffic condition using RLA
182 IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177– 188
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-its.2009.0096
www.ietdl.org
signal control for short time traffic variations. Twenty Fig. 4b shows the comparison of RLA signal control with
simulation runs using different random seeds were carried and without communication of data between agents. In case
out for each signal control technique compared. Since of no communication, higher delay is experienced during the
variance of the outcome of simulation runs was small, peak traffic conditions and is almost equivalent to HMS and
average value was taken as the representation of the outcome. CE signal control. CE shows higher delay as it is difficult to
form clusters in a dynamic environment with continuous
Fig. 4a shows comparison of the time delay experienced by change in traffic flow input.
vehicles in road network using different types of traffic signal
control. The proposed RLA signal control shows a 15% 6.2 Typical scenario with morning and
improvement in delay in comparison to other benchmarks.
The improvement in performance can be attributed to the
afternoon peaks (24 h)
ability of RLA signal control to foresee traffic increase For the typical scenario with morning and evening peaks
based on the communicated information and adjust the (24 h), 20 different simulation runs using different random
green timing before the traffic arrives at the intersection. seeds were carried out for each signal control technique.
During low traffic period, HMS and CE experience higher Average value of simulation runs were taken into
delays as their actions are based only on locally collected consideration when evaluating the performance of each
data and thereby cause more vehicles to be retained inside control technique.
the network. Though under high traffic condition, the
decisions are more coordinated, the number of vehicles to Fig. 5a shows comparison of mean vehicle delay using
be cleared is much larger because of the vehicles retained different signal control techniques for 24-h typical two
during low traffic period, thereby increasing the delay than peak traffic scenario. Although HMS signal control shows
RLA signal control. higher mean delay during traffic peaks, RLA signal control
IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177 – 188 183
doi: 10.1049/iet-its.2009.0096 & The Institution of Engineering and Technology 2010
www.ietdl.org
has a stable lower delay throughout the simulation period. consecutive peaks causing an increased vehicle count at
Fig. 5b further indicates a smoother speed transition on the start of next peak traffic regions and can be seen from
using RLA signal control than other controllers. Fig. 6b.
184 IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177– 188
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-its.2009.0096
www.ietdl.org
Figure 7 Mean travel time delay comparison for 6-h two peak traffic scenario
IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177 – 188 185
doi: 10.1049/iet-its.2009.0096 & The Institution of Engineering and Technology 2010
www.ietdl.org
settling time and better adaptability to variations in the traffic is reflected in the phase timing. The cycle time is lower in
demand over other traffic signal controls. the non-peak period and dynamically varies with changing
traffic pattern. It is not possible to compare the signal
timings across the intersections because of the stochastic
6.4 Response time and cycle length nature of traffic input and random seed initialisation.
variation However, this can serve as an indication for the adaptability
of the proposed RLA signal control.
The response time of the RLA signal control can be best
illustrated with frequency of change in the phase length.
Fig. 8a displays the length of each phase of a cycle for a
four-phase intersection in middle of the network controlled
6.5 Improvement because of learning
by RLA signal control. The links having the right of way Reinforcement learning vastly improves the performance of
during the third phase have the lowest traffic demand and the RLA signal control. Fig. 8b shows the variation of the
186 IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177– 188
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-its.2009.0096
www.ietdl.org
Table 2 Worst and best time delay comparison of signal agent scenario and Q-matrix was shared between agents to
controls improve the local observations and create global view.
Simulation tests conducted on a virtual traffic network of
Pre- HMS CE RLA Act. GLIDE Central Business District in Singapore for four different
timed traffic scenarios showed almost 15% improvement over the
one peak Nv 120 95 89 90 100 – benchmark signal controls. Further improvements because
of online reinforcement learning of the parameters have
V 37 46 44 49 45 – been demonstrated effectively.
d 297 200 191 163 196 –
typical Nv 317 286 301 266 295 –
day
8 Acknowledgment
V 35 43 38 48 44 40
This research work was supported by National University of
d 500 182 200 160 184 200 Singapore under the research grant WBS: R-263-000-425-
112.
short Nv Sat. 216 258 170 215 –
extreme
V 0 35 36 48 42 10
d Sat. 315 340 232 309 650 9 References
long Nv Sat. 250 – 206 205 – [1] KOONCE P.: ‘Traffic signal timing manual’, US Department
extreme of Transportation FHWA-HOP-08-024, Federal Highway
V 0 33 – 42 37 0
Administration, 2008
d Sat. 242 – 216 238 Sat.
Nv: number of vehicles at the end of the simulation, [2] SANCHEZ J.J., GALAN M., RUBIO E.: ‘Genetic algorithms and
V: speed, d: mean delay cellular automata: a new architecture for traffic light
cycles optimization’. Proc. Congress on Evolutionary
Computation, 19 – 23 June 2004, Piscataway, NJ, USA,
average value of mean vehicle delay experienced with each
2004, pp. 1668 – 1674
simulation run. After 90 continuous simulation runs, the
mean delay value reduced to around 50% from what was at
[3] HOAR R., PENNER J., JACOB C.: ‘Evolutionary swarm traffic: if
the start of the simulation run. HMS [21] utilises selective
ant roads had traffic lights’. Proc. 2002 World Congress on
back propagation method for learning the parameters of
Computational Intelligence – WCCI’02, 12 – 17 May 2002,
neuro-fuzzy system. The back-propagation method has a
Piscataway, NJ, USA, 2002, pp. 1910 – 1915
limitation of getting into local optimum and therefore
increases the time delay.
[4] ISHIHARA H., FUKUDA T.: ‘Traffic signal networks simulator
using emotional algorithm with individuality’. Proc. IEEE
Table 2 shows the comparison of traffic data obtained using
Intelligent Transportation Systems, 25 – 29 August, 2001,
RLA signal control with all other benchmarks including
Oakland, CA, USA, pp. 1034 – 1039
GLIDE which is currently used in Singapore. Simulation
model of GLIDE was obtained from [5, 21]. Average values
[5] SRINIVASAN D., CHOY M.C., CHEU R.L.: ‘Neural networks for
obtained from 20 simulation runs are compared in Table 2.
real-time traffic signal control’, IEEE Trans. Intell. Transp.
Standard deviation of delay was around 4 and 5 – 6%
Syst., 2006, 7, pp. 261 – 272
variation for vehicle count and speed. Proposed RLA signal
control showed 9 – 15% improvement in performance when
[6] HUNT P.B., ROBERTSON D.I. , BRETHERTON R.D., WINTON R.I. :
compared to other benchmarks in all the 20 simulation runs.
‘SCOOT – a traffic responsive method of coordinating
signals’ (United Kingdom, 1981)
7 Conclusion [7] PECK C. , GORTON P.T.W., LIREN D.: ‘Application of
The proposed RLA signal control has a fully distributed SCOOT in developing countries’. Third Int. Conf. on Road
architecture with agents capable of interacting with each Traffic Control, 1 – 3 May 1990, London, England,
other to effectively compute the optimal value of green pp. 104– 109
time that reduces the overall travel time delay and increases
vehicle mean speed. The update of traffic pattern in the [8] SIMS A.G. , DOBINSON K.W. : ‘The Sydney Coordinated
repository and shared communication between agents Adaptive Traffic (SCAT) system philosophy and benefits’,
increased the forecasting capability of each agent. This IEEE Trans. Veh. Technol., 1980, t-29, pp. 130 – 137
property of the agent effectively reduced the formation of
congestion and improved clearance of vehicles at the [9] LOWRIE P.R.: ‘The Sydney Coordinated Adaptive Traffic
intersection. Online Q-learning has been adopted to multi- System-principles, methodology, algorithms’. Int. Conf. on
IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177 – 188 187
doi: 10.1049/iet-its.2009.0096 & The Institution of Engineering and Technology 2010
www.ietdl.org
Road Traffic Signalling, 30 March– 1 April 1982, London, UK, [16] CAMPONOGARA E., KRAUS JR. W.: ‘Distributed learning agents
pp. 67– 70 in urban traffic control’, Prog. Artif. Intell., 2003, 2902,
pp. 324– 335
[10] KEONG C.K.: ‘The GLIDE system – Singapore’s urban
traffic control system’, Transp. Rev., Transnatl. Transdiscipl. [17] WATKINS C., DAYAN P.: ‘Technical note: Q-learning’, Mach.
J., 1993, 13, pp. 295 – 305 Learn., 1992, 8, pp. 279 – 292
[11] BAZZAN A.L.C.: ‘A distributed approach for coordination of [18] LITTLE J.D.C. : ‘A proof for the queuing formula:
traffic signal agents’, Auton. Agents Multi-Agent Syst., 2005, L ¼ {lambda} W’, Oper. Res., 1961, 9, pp. 383– 387
10, pp. 131 – 64
[19] ‘Highway capacity manual – HCM2000’ (Transportation
[12] ROOZEMOND D.A.: ‘Using intelligent agents for pro-active, Research Board, National Research Council, 2000)
real-time urban intersection control’, Eur. J. Oper. Res.,
2001, 131, pp. 293– 301 [20] BALAJI P.G. , SRINIVASAN D., CHEN-KHONG T. : ‘Coordination
in distributed multi-agent system using type-2 fuzzy
[13] MIZUNO K., NISHIHARA S.: ‘Distributed constraint decision systems’. IEEE 16th Int. Conf. on Fuzzy Systems
satisfaction for urban traffic signal control’. Second Int. (FUZZ-IEEE), 1 – 6 June 2008, Piscataway, NJ, USA,
Conf. on Knowledge Science, Engineering and pp. 2291 – 2298
Management. KSEM 2007, 28 – 30 November 2007, Berlin,
Germany, 2007, pp. 73 – 84 [21] CHOY M.C., SRINIVASAN D., CHEU R.L.: ‘Neural networks for
continuous online learning and control’, IEEE Trans.
[14] DE OLIVEIRA D., BAZZAN A.L.C.: ‘Traffic lights control with Neural Netw., 2006, 17, pp. 1511– 1531
adaptive group formation based on swarm intelligence’.
Ant Colony Optimization and Swarm Intelligence. Proc. [22] SRINIVASAN D., CHOY M.: ‘Distributed problem solving using
Fifth Int. Workshop, ANTS 2006, 4 – 7 September 2006, evolutionary learning in multi-agent systems’, Adv. Evol.
Berlin, Germany, pp. 520– 521 Comput. Syst. Des., 2007, 66, pp. 211– 227
[15] CHOY M.C. , CHEU R.L., SRINIVASAN D., LOGI F. : ‘Real-time [23] CHOY M.C., SRINIVASAN D., CHEU R.L.: ‘Cooperative, hybrid
coordinated signal control through use of agents with agent architecture for real-time traffic signal control’, IEEE
online reinforcement learning’. Transportation Research Trans. Syst. Man Cybern. A (Syst. Hum.), 2003, 33,
Board Meeting (82nd), Washington, DC, 2003, pp. 64– 75 pp. 597– 607
188 IET Intell. Transp. Syst., 2010, Vol. 4, Iss. 3, pp. 177– 188
& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-its.2009.0096