MPC For MCP
MPC For MCP
If you want to cite this report, please use the following reference instead:
R.R. Negenborn, B. De Schutter, M.A. Wiering, and H. Hellendoorn,
“Learning-based model predictive control for Markov decision processes,” Pro-
ceedings of the 16th IFAC World Congress, Prague, Czech Republic, 6 pp.,
July 2005. Paper 2106 / We-M16-TO/2.
∗
Delft Center for Systems and Control
Delft University of Technology, Delft, The Netherlands
∗∗
Institute of Information and Computing Sciences
Utrecht University, Utrecht, The Netherlands
Abstract: We propose the use of Model Predictive Control (MPC) for controlling systems
described by Markov decision processes. First, we consider a straightforward MPC
algorithm for Markov decision processes. Then, we propose value functions, a means
to deal with issues arising in conventional MPC, e.g., computational requirements and
sub-optimality of actions. We use reinforcement learning to let an MPC agent learn a
value function incrementally. The agent incorporates experience from the interaction with
the system in its decision making. Our approach initially relies on pure MPC. Over time,
as experience increases, the learned value function is taken more and more into account.
This speeds up the decision making, allows decisions to be made over an infinite instead
of a finite horizon, and provides adequate control actions, even if the system and desired
performance slowly vary over time.
In reinforcement learning (Sutton and Barto, 1998; We consider a collaborative approach in which MPC
Kaelbling et al., 1996; Wiering, 1999) both the model provides basic robustness and decision making over
of the stochastic system and the desired behavior are the relatively short term, while learning provides
unknown a priori. To determine a policy, the agent robustness, adaptation, and decision making over
incrementally computes the value function based on the long term. The agent gradually incorporates the
performance indications and interaction with the sys- learned value function in its decision making as ex-
tem, which implicitly contains the system model. At perience increases. Initially uncertainty in the value
each decision step the value function of the last deci- estimates is high, so it will just use MPC. Samples
sion step is updated with the newly gained experience generated by the MPC part are predictions about the
consisting of a state-action-state transition and reward. behavior of the system and predictions about what
By obtaining sufficiently many experiences the agent is optimal to do over the control horizon. Learning
can accurately estimate the value function. uses the samples as idealized experience, incorporat-
ing them in its value function. Over time the uncer-
In Temporal-Difference (λ) learning (TD(λ)) (Sutton, tainty in the value estimates decreases. When the un-
1988) the difference between value estimates of suc- certainty is below a threshold, the agent uses the value
cessive decision steps is minimized, explicitly using estimates, thereby decreasing the control horizon over
value estimates of successive states. The parameter which MPC computes paths. Since the agent uses a
λ ∈ [0, 1] weighs reward and value estimates further learned value only when the uncertainty in it is below a
away in the future exponentially less. With probability threshold, values can be initialized to any finite value.
1 value estimates can be guaranteed to converge to the We propose the following algorithm:
true values for all λ (Sutton, 1988).
(1) Roll the horizon to the current step k.
TD(λ) learning uses eligibility traces to incrementally (2) For each path of Nc (x, a, r, x′ ) 4-tuples starting
learn the value function, which we assume initially from the current state, consider each state. If the
contains arbitrary (finite) values. The value of a state uncertainty in the value estimate of an encoun-
depends on the values of successor states. Therefore, tered state is below a threshold, use the value plus
the value update of a state also depends on the value reward summed over earlier steps in that path as
updates of successive states. In fact, to compute the indication for the expected accumulated future
update for a state, all future updates need to be known, reward, and stop considering the path. Else, add
which is impossible for the infinite-horizon case. In- the given reward to the summed reward over ear-
stead, values can be updated incrementally as new up- lier steps in the path and move to the next state.
dates become available using eligibility traces (Barto (3) Incorporate the (x, a, r, x′ )-samples created by
et al., 1983). These traces indicate the amount a state MPC in the value function as experience using
is eligible to learn from new experience. This depends TD learning and reduce the uncertainty in the
on λ, the recency of the state appearance, and the value estimates.
frequency of the state appearance. The update ∆V l (x) (4) Implement the first action in the sequence deter-
of the learned value of a state using a reward received mined and move to the next decision step.
in the future can be shown to be:
The described algorithm has some attractive features.
∆V l (x) = α(x)ek lk (x) Once the value function is computed with high enough
where α(x) is a suitable learning rate, which can guar- accuracy, the computationally intensive MPC opti-
antee convergence; error ek = rk + γV l (xk+1 ) − mizations over the full control horizon using the sys-
V l (xk ) indicates for a state the difference between tem model and the performance function are reduced
the previously learned value V l (xk ) and the sampled to a one-step optimization using the system model and
value based on the obtained reward rk and the previ- the value function. Moreover, using the experience,
ously learned value V l (xk+1 ) for the successor state; the decisions are based on an infinite horizon, since
lk (x) represents the accumulating eligibility trace for values of states represent expected accumulated re-
x, which is initially zero and can recursively be up- ward over the full future. Constraint violations are thus
dated as: anticipated better.
In this paper we have considered Model Predictive Barto, A. G., R. S. Sutton and C. W. Anderson (1983).
Control (MPC) for Markov decision processes. We Neuronlike adaptive elements that can solve dif-
have first considered a straightforward algorithm for ficult learning control problems. IEEE Transac-
these kind of models. To deal high computational re- tions on Systems, Man, and Cybernetics 13, 834–
quirements and sub-optimality issues, we have pro- 846.
posed the use of the performance-to-go or value func- Bellman, R. (1957). Dynamic Programming. Prince-
tion. With optimal value functions the MPC control ton University Press. Princeton, New Jersey.
horizon becomes length one. Speed is increased, while Bemporad, A., M. Morari, V. Dua and E.N. Pis-
decisions are based on infinite-horizon predictions. tikopoulos (2002). The explicit linear quadratic
regulator for constrained systems. Automatica
In general however, optimal value functions are not 38(1), 3–20.
known a priori. In this paper we have considered Camacho, E.F. and C. Bordons (1995). Model Pre-
using experience to incrementally learn value func- dictive Control in the Process Industry. Springer-
tions over time. With reinforcement-learning methods Verlag. Berlin, Germany.
like temporal-difference learning the agent incorpo- Jadbabaie, A., J. Yu and J. Hauser (1999). Stabilizing
rates experience built up through interaction with the receding horizon control of nonlinear systems:
system. It can over time get a good estimate of the a control Lyapunov function approach. In: Pro-
value function. Once sufficient experience has been ceedings of the 1999 American Control Confer-
obtained, the agent uses this to its fullest, requiring ence. San Diego, California. pp. 1535–1539.
less computations than the non-learning approach. Kaelbling, L. P., M. L. Littman and A. W. Moore
An additional advantage of the proposed approach lies (1996). Reinforcement learning: A survey. Jour-
in that the agent adapts to changing system and per- nal of Artificial Intelligence Research 4, 237–
formance characteristics. The performance function or 285.
system under control may slowly change over time. Kearns, M. and S. Singh (2000). Bias-variance error
Since the agent incorporates newly gained experience bounds for temporal difference updates. In: Pro-
at each decision step, it will adapt to these changes and ceedings of the Thirteenth Annual Conference on
still produce adequate actions. Computational Learning Theory. Stanford, Cali-
fornia. pp. 142–147.
We note that in this paper we have considered TD(λ) Maciejowski, J. M. (2002). Predictive Control with
learning for finite Markov decision processes. To deal Constraints. Prentice Hall. Harlow, England.
with high dimensional continuous action and state Morari, M. and J. H. Lee (1999). Model predictive
spaces we can use actor-critic methods (Sutton and control: past, present and future. Computers and
Barto, 1998). Moreover, in this paper we have silently Chemical Engineering 23, 667–682.
assumed an explicit tabular value-function representa- Puterman, M. L. (1994). Markov Decision Processes:
tion. If an explicit representation is not available, we Discrete Stochastic Dynamic Programming. John
may use an implicit representation, e.g., a function ap- Wiley & Sons, Inc.. New York.
proximator (Sutton and Barto, 1998). MPC may then Sutton, R. and A. Barto (1998). An Introduction to
still be combined fruitfully with learning. Reinforcement Learning. MIT Press. Cambridge,
Future research directions consist of considering alter- Massachusetts.
native ways to include the uncertainty in the gained Sutton, R. S. (1988). Learning to predict by the meth-
experience in the decision making. Also, accuracy ods of temporal differences. Machine Learning
bounds and comparisons with alternative adaptive and 3, 9–44.
learning control approaches can be made. Further- Wiering, M. (2000). Multi-agent reinforcement learn-
more, experiments need to be implemented to further ing for traffic light control. In: Proceedings of
investigate and show the potential of the proposed the Seventeenth International Conference on Ma-
learning-based MPC for Markov decision processes. chine Learning. Stanford, California. pp. 1151–
1158.
ACKNOWLEDGMENTS Wiering, M. A. (1999). Explorations in Efficient Re-
inforcement Learning. PhD thesis. University of
This research was supported by project “Multi-agent Amsterdam. The Netherlands.
control of large-scale hybrid systems” (DWV.6188) of
the Dutch Technology Foundation STW, Applied Sci-
ence division of NWO, the Technology Programme of
the Dutch Ministry of Economic Affairs, the TU Delft
spearhead program “Transport Research Centre Delft:
Towards Reliable Mobility”, and the European 6th
Framework Network of Excellence “HYbrid CONtrol:
Taming Heterogeneity and Complexity of Networked
Embedded Systems (HYCON)”.