0% found this document useful (0 votes)
55 views7 pages

MPC For MCP

This technical report from Delft University of Technology presents a learning-based approach to model predictive control (MPC) for systems modeled as Markov decision processes. The report proposes using reinforcement learning to learn value functions that can reduce the computational requirements of MPC and improve decision-making over time. Initially, decisions are made using pure MPC over a finite horizon, but as more experience is gained, the learned value functions are incorporated more to allow infinite-horizon decisions with experience from system interactions. The approach allows the system model and desired performance to vary gradually over time.

Uploaded by

martinapam
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views7 pages

MPC For MCP

This technical report from Delft University of Technology presents a learning-based approach to model predictive control (MPC) for systems modeled as Markov decision processes. The report proposes using reinforcement learning to learn value functions that can reduce the computational requirements of MPC and improve decision-making over time. Initially, decisions are made using pure MPC over a finite horizon, but as more experience is gained, the learned value functions are incorporated more to allow infinite-horizon decisions with experience from system interactions. The approach allows the system model and desired performance to vary gradually over time.

Uploaded by

martinapam
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Delft University of Technology

Delft Center for Systems and Control

Technical report 04-021

Learning-based model predictive


control for Markov decision processes∗

R.R. Negenborn, B. De Schutter, M.A. Wiering, and


H. Hellendoorn

If you want to cite this report, please use the following reference instead:
R.R. Negenborn, B. De Schutter, M.A. Wiering, and H. Hellendoorn,
“Learning-based model predictive control for Markov decision processes,” Pro-
ceedings of the 16th IFAC World Congress, Prague, Czech Republic, 6 pp.,
July 2005. Paper 2106 / We-M16-TO/2.

Delft Center for Systems and Control


Delft University of Technology
Mekelweg 2, 2628 CD Delft
The Netherlands
phone: +31-15-278.51.19 (secretary)
fax: +31-15-278.66.79
URL: http://www.dcsc.tudelft.nl

This report can also be downloaded via http://pub.deschutter.info/abs/04_021.html
LEARNING-BASED MODEL PREDICTIVE CONTROL
FOR MARKOV DECISION PROCESSES

Rudy R. Negenborn ∗,1 Bart De Schutter ∗


Marco A. Wiering ∗∗ Hans Hellendoorn ∗


Delft Center for Systems and Control
Delft University of Technology, Delft, The Netherlands
∗∗
Institute of Information and Computing Sciences
Utrecht University, Utrecht, The Netherlands

Abstract: We propose the use of Model Predictive Control (MPC) for controlling systems
described by Markov decision processes. First, we consider a straightforward MPC
algorithm for Markov decision processes. Then, we propose value functions, a means
to deal with issues arising in conventional MPC, e.g., computational requirements and
sub-optimality of actions. We use reinforcement learning to let an MPC agent learn a
value function incrementally. The agent incorporates experience from the interaction with
the system in its decision making. Our approach initially relies on pure MPC. Over time,
as experience increases, the learned value function is taken more and more into account.
This speeds up the decision making, allows decisions to be made over an infinite instead
of a finite horizon, and provides adequate control actions, even if the system and desired
performance slowly vary over time.

Keywords: Markov decision processes, predictive control, learning.

1. INTRODUCTION computations than conventional MPC, and improves


decision quality by making decisions over an infinite
Over the last decades Model Predictive Control (MPC) horizon.
has become an important technology for finding con-
trol policies for complex, dynamic systems, as found We consider an agent controlling a dynamic system
in, e.g., the process industry (Camacho and Bordons, at discrete decision steps. At each decision step, the
1995; Morari and Lee, 1999). As the name suggests, agent observes the state of the system and determines
MPC is based on models that describe the behavior the next action to take based on the observation and
of a system. Typically, these models are systems of a policy. A policy maps states to actions and it is
difference or differential equations. In this paper we the agent’s task to determine a policy that makes the
consider the application of MPC to systems that can system behave in an optimal way.
be modeled by Markov decision processes, a sub- This paper is organized as follows. We introduce con-
class of discrete-event models. Moreover, we propose ventional MPC in Section 2. Then we propose MPC
a learning-based extension for reducing the on-line for systems that can be modeled by Markov decision
computational cost of the MPC algorithm, using re- processes in Section 3. We consider the use of value
inforcement learning to learn expectations of perfor- functions in MPC in Section 4. To improve computa-
mance on-line. The approach allows for system mod- tional and decision making performance we improve
els to change gradually over time, results in fewer the method with reinforcement learning in Section 5.

1 Corresponding author, e-mail: [email protected]


2. MODEL PREDICTIVE CONTROL
past future set point y*
MPC (Camacho and Bordons, 1995; Morari and Lee,
1999; Maciejowski, 2002) is a model-based control predicted outputs y
approach that has found successful application, e.g.,
in the process industry. In MPC, a control agent uses
a system model to predict the behavior of a system computed control inputs u
under various actions. The control agent finds a se-
quence of actions that bring the system in a desired
state, while minimizing negative effects of the actions, k k+1 k + Nc k +Np
and taking constraints into account. In order to find the control horizon
prediction horizon
sequence of appropriate actions, the control agent uses
a performance function. This performance function
Fig. 1. Example of conventional MPC. The control
evaluates the preferability of being in a certain state
problem is to find actions uk to uk+Nc , such that
and performing a certain action by giving rewards.
after Np steps the system behavior y approaches
Let us denote by rk the reward given by the perfor-
the desired behavior y ∗ . In this example, y indeed
mance function at decision step k, by a0 , . . . , a∞ the
reaches the desired set point y ∗ .
actions to be determined by the agent, and by E the
expectancy operator taking the system uncertainty into
account. We may then write the task of the agent as
solving the optimization problem:

nX o Implementation details of (2) depend on the structure
max E rk , (1) of the system model and performance function. In
a0 ,...,a∞
k=0 general, MPC methods have the following scheme
subject to the system model, the performance func- (see Figure 1):
tion, and the constraints. (1) The horizon is moved to the current decision step
Basing actions on the model predictions introduces k0 by observing the state of the true system and
issues with robustness due to the fact that models are reformulating the optimization problem of (2)
inherently inaccurate and thus predictions further in using the observed state as initial state xk0 .
the future are more and more uncertain. To deal with (2) The formulated optimization problem is solved,
this, MPC uses a rolling or receding horizon, which in- often using general solution techniques (e.g.,
volves reformulating the optimization problem at each quadratic programming, sequential quadratic pro-
decision step using the latest observation of the system gramming, ...). The optimization problem is
state. However, the rolling horizon increases compu- solved taking into account constraints on actions
tational costs, since at each decision step a sequence and states.
of actions has to be determined to make sure no (3) Actions found in the optimization procedure are
constraints are violated. In practice this is intractable executed until the next decision step. Typically
for many applications. To reduce computational costs, only one action is performed.
MPC uses a control horizon, a prediction horizon, and Advantages of MPC lie in the explicit integration of
a performance-to-go. The control horizon determines input and state constraints. Due to the rolling horizon
the number of actions to find. The prediction hori- MPC adapts easily to new contexts and can be used
zon determines how far the behavior of the system without intervention for long periods. Moreover, only
is predicted. The performance-to-go gives the sum of few parameters need to be tuned, i.e., the prediction
the reward obtained from the state at the end of the and control horizon. However, the optimization prob-
prediction horizon until infinity under a certain policy. lem may still require too many computations, e.g.,
With these principles (1) can be rewritten as: when the control horizon becomes large. Resources
h n k0X
+Nc o n k0 +Np o required for computation and memory may be high,
X
max E rk + E rk + increasing more when the prediction horizon or sys-
ak0 ,...,ak0 +Nc tem complexity increases. Besides that, solutions to
k=k0 k=k0 +Nc +1
i the finite horizon problems do not guarantee solutions
V xk0 +Np +1 , (2) to the problem over the infinite horizon.
where V is the performance-to-go function, indicating Research in the past has addressed these issues for
the expected sum of future rewards when in a cer- conventional MPC, typically using models that are
tain state. In general the performance-to-go function systems of difference or differential equations. In
is not known in advance; it may be assumed zero, the following sections we propose MPC for sys-
approximated with a Lyapunov function (Jadbabaie et tems modeled by Markov decision processes and con-
al., 1999), or be learned from experience, as we shall sider improving speed and decision quality using the
discuss in Section 5. performance-to-go function and experience.
3. MPC FOR MARKOV DECISION PROCESSES 3.2 Straightforward MPC Approach

3.1 Markov Decision Processes Let us consider the straightforward application of


MPC to Markov decision processes. Similar to al-
Markov decision processes (Puterman, 1994) are ap- ternative approaches, the rolling horizon principle is
plicable in fields characterized by uncertain state tran- easily included by letting the agent synchronize at
sitions and a necessity for sequential decision making, each decision step its current estimate of the system
e.g., robot control, manufacturing, and traffic signal state with a new observation of the system state. The
control (Wiering, 2000). Markov decision processes control horizon should equal the prediction horizon,
satisfy the Markov property, stating that state transi- since the systems we consider have no autonomous
tions are conditionally independent from actions and behavior and the set of possible actions can change per
states encountered before the current decision step. An state. Therefore, as is usually assumed in conventional
agent can therefore rely on a policy that directly maps MPC, assuming constant actions between the end of
states to actions to determine the next action. After the control horizon and the prediction horizon is not
execution of an action, the system is assumed to stay reasonable in our case.
in the new state until the next action, i.e., the system
has no autonomous behavior. Figure 2 shows the graph The agent uses the Markov decision process to find
representation of some Markov decision process. a sequence of Nc actions that gives the best perfor-
mance over the control horizon. From the graphical
P (x1 |x3 , a4 ) viewpoint of Markov decision processes this comes
r(x3 , a4 , x1 ) down to finding the path of Nc steps that has the
P (x3 |x2 , a3 )
x3 highest expected accumulated reward. This yields the
P (x2 |x1 , a1 ) 2
x r(x2 , a3 , x3 ) following straightforward MPC algorithm for Markov
r(x1 , a1 , x2 ) P (x4 |x2 , a5 ) decision processes:
r(x2 , a5 , x4 )
P (x4 |x5 , a6 ) x4 (1) Roll the horizon to the current step by observing
1
x
r(x5 , a6 , x4 ) the state of the system. Define the optimization
P (x5 |x1 , a2 ) problem of finding the actions over the control
x5 P (x6 |x5 , a6 ) x6
r(x1 , a2 , x5 ) horizon that maximize the sum of the rewards
5 6 6
r(x , a , x )
starting from the observed state.
(2) Find all paths of length Nc and accumulate the
Fig. 2. Example of a Markov decision process. A node
rewards. Determine the sequence of actions that
represents a state. An arc represents a transition
leads to the path with the highest accumulated
from one state to another under a certain action.
reward.
An arc is labeled with a transition probability and
(3) Implement the first action of this sequence and
a reward obtainable under the transition.
move on to the next decision step.
We use k as counter that indicates the decision step.
The proposed MPC algorithm can suffer from the dis-
At each step the system is in one out of a finite set
advantages discussed earlier for general MPC tech-
of states X = {x1 , x2 , . . . , xN }. In each state x ∈
niques. The amount of computational resources re-
X there is a finite set of actions Ax that the agent
quired to consider all paths over a length of the control
can perform (Ax = {a1 , a2 , . . . , aMx }). The system
horizon depends on Nc and the number of actions
evolves according to system model Σ : P (x′ |x, a),
possible from each encountered state. In particular
where P (x′ |x, a) is the probability of transitioning
when there is a very large number of actions from each
from state x to state x′ after action a is performed. The
state, it may be intractable to consider all paths. Also
performance function is given by r, where r(x, a, x′ )
whether or not the system model or the performance
is the reward obtained with the transition from state x
model are deterministic or stochastic has influence on
to state x′ under action a.
the speed at which the paths can be evaluated. Fur-
Constraints can be included explicitly by restricting thermore, because of the limited horizon over which
actions and reachable states, or implicitly by imposing actions are considered, the resulting policy may be
a highly negative reward for certain transitions; as we suboptimal. This is in particular the case since we ig-
will see, the agent will try to avoid these transitions. nored the performance-to-go V , as is commonly done
in conventional MPC.
As an example, in local traffic signal control at an in-
tersection, a state can consist of the number of cars in As a solution we can take a small control horizon.
front of the traffic signals. Actions in each state consist However, this may result in increased sub-optimal
of traffic signal configurations. Transition probabili- decision making, in particular when we keep ignoring
ties may depend on the number of cars leaving the the performance-to-go. In the following we will not
crossroad during a green signal. Rewards may depend ignore this performance indicator. We will from now
on the average waiting time, with lower waiting time on refer to the performance-to-go as value function,
indicating higher reward. Constraints on actions con- and use the information from this value function to
sist of admissible, safe, traffic signal configurations. improve the computations required at each step.
4. MPC WITH VALUE FUNCTIONS where γ ∈ (0, 1) is the discount factor. The closer γ is
chosen to 1, the more long-term performance expec-
4.1 Value Functions tations are taken into account. The value function (3)
can be written as:
A value function V gives the expected accumulated X
future reward for each state x and a policy π. The V π (xk0 ) = Pπ (a|xk0 ) ×
optimal value function V ∗ gives the highest possible a∈Axk
0
expected accumulated future reward for each state. h X i
This highest possible future reward is obtained by r(xk0 , a, x′ ) + γ P (x′ |xk0 , a)V (x′ ) ,
x′
following the actions that an optimal policy π ∗ pre-
scribes 2 . Whereas in previous sections we considered where Pπ (a|x) is the probability that the policy π
a deterministic policy, from now on we consider a will select action a in state x. This kind of equation
probabilistic policy. The optimal value function V ∗ is is called a Bellman equation. Dynamic-programming
then obtained by solving for each xk0 : methods treat the values of the optimal values of the
 nX ∞ o states as unknowns. In that case a system of Bellman
V ∗ xk0 = max E r(xk , π(xk ), xk+1 ) . equations for all states forms a system of equations
π
k=k0 whose unique solution is the optimal value function
(Sutton and Barto, 1998).
Assume the optimal value function is known. From
the graphical viewpoint of Markov decision processes,
we can label each node with a value, or expected
accumulated future reward. In that case, the agent has 4.2 Value-Function MPC Approach
to consider only the actions a ∈ Ax possible in current
state x and find the action that gives the highest sum of Using the value function we can formulate a new MPC
directly obtainable reward plus expected accumulated algorithm for Markov decision processes as follows:
future reward of the resulting state after the action (1) Apply the rolling horizon principle, updating the
would have been executed. This sum, called the Q state estimate with a measurement of the state.
value for the (x, a)-pair, is used by the agent to find (2) Compute the value function given the latest sys-
the action that gives the highest Q value as follows: tem model.
ak = arg max (3) Formulate the optimization problem over a con-
a∈Axk
hX i trol horizon of Nc = 1 of finding the action that
P (x′ |xk , a) r(xk , a, x′ ) + V ∗ (x′ ) . brings the state of the system into the state with
x′ the highest value. Solve the problem.
Thus, when the optimal value function is known, (4) Implement the found action and move on to the
instead of considering Nc steps, the agent has to next decision step.
consider only a one-step optimization procedure at
The advantage of this approach is that the control
each decision step, i.e., the control horizon becomes
horizon is only of length one. Moreover, by using the
Nc = 1. Moreover, since the value function is optimal
most up-to-date system model to compute the value
over the infinite horizon, also the chosen actions are
function at each decision step, actions are adequate,
optimal over the infinite horizon.
even in the event of (slowly) changing system and
In general neither optimal policies nor optimal value performance desires.
functions are known in advance. In our case, value
However, computing the optimal value function at
functions cannot be computed easily in a straightfor-
each decision step can computationally be very expen-
ward way, since the reward over an infinite horizon
sive. Computing the optimal value function off-line
cannot be summed explicitly. Instead, the value func-
before the agent starts controlling the system (e.g., as
tion can be approximated. Dynamic-programming
done in (Bemporad et al., 2002) for linear systems)
methods (Bellman, 1957) use one way of approximat-
reduces on-line computations, but does not allow for
ing the value function. Dynamic-programming meth-
the system to vary over time. Although the rolling
ods approximate the value function by introducing a
horizon provides some robustness, structural changes
discount factor. This discount factor lets the infinite
in parameters of the system model are not anticipated.
sum of rewards converge. Using a discount factor, the
value function is approximated as: Instead of recomputing the value function at each de-
nX∞ o cision step, we could update the value function on-
V π (xk0 ) = E γ k−k0 r(xk , π(xk ), xk+1 ) , line using experience from the interaction between the
k=k0 agent and the true system. We propose to combine
(3) MPC for Markov decision processes with learning
the value function on-line using reinforcement learn-
2
ing. This way, system changes are anticipated on-line
For the sake of simplicity we assume a unique optimal policy.
Extension to the non-unique case is straightforward by choosing one
while not computing the value function at every deci-
of the optimums. sion step.
5. MPC WITH REINFORCEMENT LEARNING 5.2 TD-MPC Approach

5.1 Reinforcement Learning

In reinforcement learning (Sutton and Barto, 1998; We consider a collaborative approach in which MPC
Kaelbling et al., 1996; Wiering, 1999) both the model provides basic robustness and decision making over
of the stochastic system and the desired behavior are the relatively short term, while learning provides
unknown a priori. To determine a policy, the agent robustness, adaptation, and decision making over
incrementally computes the value function based on the long term. The agent gradually incorporates the
performance indications and interaction with the sys- learned value function in its decision making as ex-
tem, which implicitly contains the system model. At perience increases. Initially uncertainty in the value
each decision step the value function of the last deci- estimates is high, so it will just use MPC. Samples
sion step is updated with the newly gained experience generated by the MPC part are predictions about the
consisting of a state-action-state transition and reward. behavior of the system and predictions about what
By obtaining sufficiently many experiences the agent is optimal to do over the control horizon. Learning
can accurately estimate the value function. uses the samples as idealized experience, incorporat-
ing them in its value function. Over time the uncer-
In Temporal-Difference (λ) learning (TD(λ)) (Sutton, tainty in the value estimates decreases. When the un-
1988) the difference between value estimates of suc- certainty is below a threshold, the agent uses the value
cessive decision steps is minimized, explicitly using estimates, thereby decreasing the control horizon over
value estimates of successive states. The parameter which MPC computes paths. Since the agent uses a
λ ∈ [0, 1] weighs reward and value estimates further learned value only when the uncertainty in it is below a
away in the future exponentially less. With probability threshold, values can be initialized to any finite value.
1 value estimates can be guaranteed to converge to the We propose the following algorithm:
true values for all λ (Sutton, 1988).
(1) Roll the horizon to the current step k.
TD(λ) learning uses eligibility traces to incrementally (2) For each path of Nc (x, a, r, x′ ) 4-tuples starting
learn the value function, which we assume initially from the current state, consider each state. If the
contains arbitrary (finite) values. The value of a state uncertainty in the value estimate of an encoun-
depends on the values of successor states. Therefore, tered state is below a threshold, use the value plus
the value update of a state also depends on the value reward summed over earlier steps in that path as
updates of successive states. In fact, to compute the indication for the expected accumulated future
update for a state, all future updates need to be known, reward, and stop considering the path. Else, add
which is impossible for the infinite-horizon case. In- the given reward to the summed reward over ear-
stead, values can be updated incrementally as new up- lier steps in the path and move to the next state.
dates become available using eligibility traces (Barto (3) Incorporate the (x, a, r, x′ )-samples created by
et al., 1983). These traces indicate the amount a state MPC in the value function as experience using
is eligible to learn from new experience. This depends TD learning and reduce the uncertainty in the
on λ, the recency of the state appearance, and the value estimates.
frequency of the state appearance. The update ∆V l (x) (4) Implement the first action in the sequence deter-
of the learned value of a state using a reward received mined and move to the next decision step.
in the future can be shown to be:
The described algorithm has some attractive features.
∆V l (x) = α(x)ek lk (x) Once the value function is computed with high enough
where α(x) is a suitable learning rate, which can guar- accuracy, the computationally intensive MPC opti-
antee convergence; error ek = rk + γV l (xk+1 ) − mizations over the full control horizon using the sys-
V l (xk ) indicates for a state the difference between tem model and the performance function are reduced
the previously learned value V l (xk ) and the sampled to a one-step optimization using the system model and
value based on the obtained reward rk and the previ- the value function. Moreover, using the experience,
ously learned value V l (xk+1 ) for the successor state; the decisions are based on an infinite horizon, since
lk (x) represents the accumulating eligibility trace for values of states represent expected accumulated re-
x, which is initially zero and can recursively be up- ward over the full future. Constraint violations are thus
dated as: anticipated better.

lk+1 (x) ← λγlk (x) if xk =


6 x The agent will propose adequate actions, even if the
system and desired performance slowly vary over
lk+1 (x) ← λγlk (x) + 1 if xk = x.
time. In particular for systems with a long lifetime this
The uncertainty in the update can be computed using is an advantage. The system model and performance
the error ek . For the case λ = 0 the uncertainty (or function can be updated at each decision step. The
variance) in the update is σk2 = e2k . More general agent will then generate samples using these updated
results on error bounds for TD learning are reported models, and the learning part will incorporate these
in (Kearns and Singh, 2000). samples and adjust to the new situation.
6. CONCLUSIONS & FUTURE RESEARCH REFERENCES

In this paper we have considered Model Predictive Barto, A. G., R. S. Sutton and C. W. Anderson (1983).
Control (MPC) for Markov decision processes. We Neuronlike adaptive elements that can solve dif-
have first considered a straightforward algorithm for ficult learning control problems. IEEE Transac-
these kind of models. To deal high computational re- tions on Systems, Man, and Cybernetics 13, 834–
quirements and sub-optimality issues, we have pro- 846.
posed the use of the performance-to-go or value func- Bellman, R. (1957). Dynamic Programming. Prince-
tion. With optimal value functions the MPC control ton University Press. Princeton, New Jersey.
horizon becomes length one. Speed is increased, while Bemporad, A., M. Morari, V. Dua and E.N. Pis-
decisions are based on infinite-horizon predictions. tikopoulos (2002). The explicit linear quadratic
regulator for constrained systems. Automatica
In general however, optimal value functions are not 38(1), 3–20.
known a priori. In this paper we have considered Camacho, E.F. and C. Bordons (1995). Model Pre-
using experience to incrementally learn value func- dictive Control in the Process Industry. Springer-
tions over time. With reinforcement-learning methods Verlag. Berlin, Germany.
like temporal-difference learning the agent incorpo- Jadbabaie, A., J. Yu and J. Hauser (1999). Stabilizing
rates experience built up through interaction with the receding horizon control of nonlinear systems:
system. It can over time get a good estimate of the a control Lyapunov function approach. In: Pro-
value function. Once sufficient experience has been ceedings of the 1999 American Control Confer-
obtained, the agent uses this to its fullest, requiring ence. San Diego, California. pp. 1535–1539.
less computations than the non-learning approach. Kaelbling, L. P., M. L. Littman and A. W. Moore
An additional advantage of the proposed approach lies (1996). Reinforcement learning: A survey. Jour-
in that the agent adapts to changing system and per- nal of Artificial Intelligence Research 4, 237–
formance characteristics. The performance function or 285.
system under control may slowly change over time. Kearns, M. and S. Singh (2000). Bias-variance error
Since the agent incorporates newly gained experience bounds for temporal difference updates. In: Pro-
at each decision step, it will adapt to these changes and ceedings of the Thirteenth Annual Conference on
still produce adequate actions. Computational Learning Theory. Stanford, Cali-
fornia. pp. 142–147.
We note that in this paper we have considered TD(λ) Maciejowski, J. M. (2002). Predictive Control with
learning for finite Markov decision processes. To deal Constraints. Prentice Hall. Harlow, England.
with high dimensional continuous action and state Morari, M. and J. H. Lee (1999). Model predictive
spaces we can use actor-critic methods (Sutton and control: past, present and future. Computers and
Barto, 1998). Moreover, in this paper we have silently Chemical Engineering 23, 667–682.
assumed an explicit tabular value-function representa- Puterman, M. L. (1994). Markov Decision Processes:
tion. If an explicit representation is not available, we Discrete Stochastic Dynamic Programming. John
may use an implicit representation, e.g., a function ap- Wiley & Sons, Inc.. New York.
proximator (Sutton and Barto, 1998). MPC may then Sutton, R. and A. Barto (1998). An Introduction to
still be combined fruitfully with learning. Reinforcement Learning. MIT Press. Cambridge,
Future research directions consist of considering alter- Massachusetts.
native ways to include the uncertainty in the gained Sutton, R. S. (1988). Learning to predict by the meth-
experience in the decision making. Also, accuracy ods of temporal differences. Machine Learning
bounds and comparisons with alternative adaptive and 3, 9–44.
learning control approaches can be made. Further- Wiering, M. (2000). Multi-agent reinforcement learn-
more, experiments need to be implemented to further ing for traffic light control. In: Proceedings of
investigate and show the potential of the proposed the Seventeenth International Conference on Ma-
learning-based MPC for Markov decision processes. chine Learning. Stanford, California. pp. 1151–
1158.
ACKNOWLEDGMENTS Wiering, M. A. (1999). Explorations in Efficient Re-
inforcement Learning. PhD thesis. University of
This research was supported by project “Multi-agent Amsterdam. The Netherlands.
control of large-scale hybrid systems” (DWV.6188) of
the Dutch Technology Foundation STW, Applied Sci-
ence division of NWO, the Technology Programme of
the Dutch Ministry of Economic Affairs, the TU Delft
spearhead program “Transport Research Centre Delft:
Towards Reliable Mobility”, and the European 6th
Framework Network of Excellence “HYbrid CONtrol:
Taming Heterogeneity and Complexity of Networked
Embedded Systems (HYCON)”.

You might also like