OCDM2223 Tutorial7solved
OCDM2223 Tutorial7solved
Tutorial 7
Problem 1: LQ Control
P (k + 1) r + (r + 1)P (k + 1)
⇒ K(k) = − , P (k) = , P (N ) = S
r + P (k + 1) r + P (k + 1)
k 4 3 2 1 0
1+2S 3+5S 8+13S 21+34S
P (k) S 1+S 2+3S 5+8S 13+21S
S
K(k) − 1+S − 1+2S
2+3S
− 3+5S
5+8S
8+13S
− 13+21S
c) Compare and discuss the resulting trajectories for x(t) for S = 0 and S → ∞ for an initial value of
x(0) = 2.
For S = 0:
8 3
u(0) = K(0)x(0) = − x(0), u(1) = K(1)x(1) = − x(1),
13 5
1
u(2) = K(2)x(2) = − x(2), u(3) = K(3)x(3) = 0
2
5
⇒ x(0) = 2, x(1) = x(0) + u(0) = x(0) = 0.769,
13
x(2) = x(1) + u(1) = 0.308, x(3) = x(2) + u(2) = 0.154, x(4) = x(3) + u(3) = 0.154
For S → ∞:
13 5
u(0) = K(0)x(0) = − x(0), u(1) = K(1)x(1) = − x(1),
21 8
2
u(2) = K(2)x(2) = − x(2), u(3) = K(3)x(3) = −x(3)
3
8
⇒ x(0) = 2, x(1) = x(0) + u(0) = x(0) = 0.762,
21
x(2) = x(1) + u(1) = 0.286, x(3) = x(2) + u(2) = 0.095, x(4) = x(3) + u(3) = 0
1 34
S → ∞ : J ∗ = x20 P (0) = 2 × = 3.238
2 21
21
S = 0 : J∗ = 2 × = 3.231
13
r
2
P∞ 2 1 1
⇒ P∞ = P∞ + 1 − ⇒ P∞ − P∞ − r = 0 ⇒ P∞ = + +r
r + P∞ 2 4
√
P∞ 1 + 1 + 4r
K∞ =− =− √
r + P∞ 2r + 1 + 1 + 4r
2r
x(k + 1) = x(k) + u(k) = (1 + K∞ )x(k) = √ x(k)
2r + 1 + 1 + 4r
e) Discuss the closed loop behavior for r = 0 and r → ∞ and interpret the cost function.
r = 0: Dead-Beat Control: Reach Equilibrium xeq = 0 in one time step
r → ∞: K∞ = 0 closed loop behavior is equal to open loop behavior, only critically stable
[Adapted from F. L. Lewis, D. Vrabie, V. L. Syrmos, Optimal Control, 2012, ex 11.2-1 and 11.3-4]
In this exercise we consider the discrete-time linear quadratic regulator problem in light of the Markov
Decision Process (MDP) theory. Consider a deterministic MDP with infinite and continuous state space
X = Rn and action space U = Rm and state transition equation
x(k + 1) = Ax(k) + Bu(k), (1)
where k is the discrete time index. For a fixed stabilizing stationary policy π that is defined by the control
law u(i) = µ(x(i)), i = k, . . . , ∞, and for positive definite matrices Q, R > 0, the associated value
function is ∞
1X
π
x(i)> Qx(i) + u(i)> Ru(i) ,
V (x(k)) = (2)
2 i=k
only dependent on the initial state x(k). The infinite sum (2) can be written as a difference equation,
yielding
1
V π (x(k)) = x(k)> Qx(k) + u(k)> Ru(k) + V π (x(k + 1)).
(3)
2
We assume that the value function is quadratic in the state V π (x(k)) = 21 x(k)> P x(k) for some kernel
matrix P . Then, (3) boils down to:
1 1 1
x(k)> P x(k) = x(k)> Qx(k) + u(k)> Ru(k) + x(k + 1)> P x(k + 1). (4)
2 2 2
Substituting the system dynamics (1), equation (4) is further simplified:
1 > 1 >
x Qx + u> Ru + x> A> P Ax + 2x> A> P Bu + u> B > P Bu ,
x Px = (5)
2 2
where, for readability, we used the notation x = x(k) and u = u(k).
Consider the Policy Iteration algorithm. We assume an initial policy u = K (0) x and then we perform the
Policy Evaluation step, that is, we calculate the value function for the initial policy. Precisely, substituting
the initial policy in (5), the goal is to compute matrix P (0) solving the Lyapunov equation:
> >
P (0) = Q + K (0) RK (0) + A> P (0) A + 2A> P (0) BK (0) + K (0) B > P (0) BK (0) . (6)
Found the solution P (0) , we perform the policy improvement based on the value function just obtained,
that is, we determine the next policy K (1) as
1 >
K (1) x = argmin x Qx + u> Ru + x> A> P (0) Ax + 2x> A> P (0) Bu + u> B > P (0) Bu ,
u=Kx 2
which, as shown during the lectures, yields the closed-form solution for the policy improvement step:
−1
K (1) = − R + B > P (0) B B > P (0) A.
1
a) What is the advantage of assuming that the value function is in the form V π (x(k)) = x(k)> P x(k)?
2
In fixing a “structure” for the value function V π (x(k)) the problem reduces from finding infinitely many
possible policies u(x) to finding a finite number of entries in matrix P .
b) Considering the system dynamics and cost function for the previous problem, with r = 1, determine the
solution of the Lyapunov equation (6). Would this approach scale easily with a high-dimensional state?
And with nonlinear dynamics? Perform a few iterations of the Policy Iteration algorithm starting with the
stabilizing initial policy K (0) = −0.1 and with the non-stabilizing initial policy K (0) = 1. How many steps
are required for convergence?
With A = 1, B = 1, Q = 1, R = 1, we obtain P (0) ∈ R as
2
(0) 2 (0) 2 1 + K (0)
P (0) = 1 + K + P (0) + 2K (0) P (0) + K P (0) ⇒ P (0) = −
2.
2K (0) + (K (0) )
With K (0) = −0.1, at the fifth iteration the policy converges to K = −0.6180 (P = 1.618), with K (0) = 1
the algorithm converges to the wrong value 1.6818 (P = −0.6180).
Consider the Value Iteration algorithm. Assuming initial policy u = K (0) x and assigning zero to the value
function, P (0) = 0, we perform the Value Update by just evaluating the right-hand side of the Lyapunov
equation (6), obtaining
> >
P (1) = Q + K (0) RK (0) + A> P (0) A + 2A> P (0) BK (0) + K (0) B > P (0) BK (0) . (7)
Then, also in this case, the policy improvement is based on the obtained value function and eventually we
have: −1
K (1) = − R + B > P (1) B B > P (1) A.
c) Considering the system dynamics and cost function for the previous problem, with r = 1, perform a few
iterations of the Value Iteration algorithm starting with the stabilizing initial policy K (0) = −0.1 and with
the unstabilizing initial policy K (0) = 1. How many steps are required for convergence? What happens
if the initial guess of the policy is K (0) = 100? And which algorithm yields faster convergence, Policy
Iteration or Value Iteration?
With K (0) = −0.1, at the sixth iteration the policy converges to K = −0.6180 (P = 1.618), with
K (0) = 1 the algorithm converges in 9 iterations, with K (0) = 100 also in nine iterations.
6
-0.1
-0.2
5
-0.3
-0.4
-0.5
3
-0.6
2
-0.7
1
-0.8
-0.9 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Figure 1: Comparison between Policy Iteration and Value Iteration, for K (0) = −0.1. Controller gain K (j)
(left) and matrix P (j) of the value function (right) for the first 10 iterations j = 1, . . . , 10.
2 2
1.5 1.5
1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Figure 2: Comparison between Policy Iteration and Value Iteration, for K (0) = 1. Controller gain K (j)
(left) and matrix P (j) of the value function (right) for the first 10 iterations j = 1, . . . , 10.