0% found this document useful (0 votes)
11 views

OCDM2223 Tutorial7solved

Assuming the value function is quadratic in the state allows the infinite horizon LQ problem to be formulated as an algebraic Riccati equation that can be solved to obtain the optimal feedback gain. This yields a closed form solution rather than requiring iterative numerical methods.

Uploaded by

qq727783
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

OCDM2223 Tutorial7solved

Assuming the value function is quadratic in the state allows the infinite horizon LQ problem to be formulated as an algebraic Riccati equation that can be solved to obtain the optimal feedback gain. This yields a closed form solution rather than requiring iterative numerical methods.

Uploaded by

qq727783
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Lehrstuhl für Steuerungs- und Regelungstechnik Optimal Control and TUM

Univ. Prof. Dr.-Ing./Univ. Tokio habil. Martin Buss Decision Making


Dr.-Ing. Marion Leibold Tutorial 7 WS 22/23

Tutorial 7

Problem 1: LQ Control

[M. Papageorgiou, M. Leibold, M. Buss: Optimierung, Springer 2015, Examples 13.2]


For a linear discrete-time plant

x(k + 1) = x(k) + u(k), x(0) = x0

an optimal feedback control u(k) = K(k)x(k) that minimizes the cost


N −1
1 2 1X
J = Sx(N ) + x(k)2 + ru(k)2 , S ≥ 0, r ≥ 0.
2 2 k=0

has to be designed in the framework of LQ control.


a) Find the recursion for the Riccatti Matrix P (k) and the controller K(k).
With A = 1, B = 1, Q = 1, R = r, Pf = S:

K(k) = −(B(k)T P (k + 1)B(k) + R(k))−1 B(k)T P (k + 1)A(k)


P (k) = A(k)T P (k + 1)A(k) + Q(k) + K(k)T B(k)T P (k + 1)A(k), P (N ) = Pf

P (k + 1) r + (r + 1)P (k + 1)
⇒ K(k) = − , P (k) = , P (N ) = S
r + P (k + 1) r + P (k + 1)

b) Assume N = 4, r = 1. Give P (0), P (1), . . . P (4) and K(0), K(1),. . ., K(3).

k 4 3 2 1 0
1+2S 3+5S 8+13S 21+34S
P (k) S 1+S 2+3S 5+8S 13+21S
S
K(k) − 1+S − 1+2S
2+3S
− 3+5S
5+8S
8+13S
− 13+21S

c) Compare and discuss the resulting trajectories for x(t) for S = 0 and S → ∞ for an initial value of
x(0) = 2.
For S = 0:
8 3
u(0) = K(0)x(0) = − x(0), u(1) = K(1)x(1) = − x(1),
13 5
1
u(2) = K(2)x(2) = − x(2), u(3) = K(3)x(3) = 0
2

5
⇒ x(0) = 2, x(1) = x(0) + u(0) = x(0) = 0.769,
13
x(2) = x(1) + u(1) = 0.308, x(3) = x(2) + u(2) = 0.154, x(4) = x(3) + u(3) = 0.154
For S → ∞:
13 5
u(0) = K(0)x(0) = − x(0), u(1) = K(1)x(1) = − x(1),
21 8
2
u(2) = K(2)x(2) = − x(2), u(3) = K(3)x(3) = −x(3)
3

8
⇒ x(0) = 2, x(1) = x(0) + u(0) = x(0) = 0.762,
21
x(2) = x(1) + u(1) = 0.286, x(3) = x(2) + u(2) = 0.095, x(4) = x(3) + u(3) = 0

d) What is the cost J ∗ for these realizations?

1 34
S → ∞ : J ∗ = x20 P (0) = 2 × = 3.238
2 21
21
S = 0 : J∗ = 2 × = 3.231
13

Problem 2: LQ Control with Infinite Horizon

[M. Papageorgiou, M. Leibold, M. Buss: Optimierung, Springer 2015, Examples 13.3]


Consider the plant from problem 1 with cost function

1X
J= x(k)2 + ru(k)2 , r ≥ 0.
2 k=0

a) What is the stationary Riccatti Matrix P∞ ?


With A = 1, B = 1, Q = 1, R = r:
K = −(B T P∞ B + R)−1 B T P∞ A
P∞ = AT P∞ A + Q + K T B T P∞ A

r
2
P∞ 2 1 1
⇒ P∞ = P∞ + 1 − ⇒ P∞ − P∞ − r = 0 ⇒ P∞ = + +r
r + P∞ 2 4

b) Give an equation for K∞ of the optimal feedback control u(k) = K∞ x(k).


P∞ 1 + 1 + 4r
K∞ =− =− √
r + P∞ 2r + 1 + 1 + 4r

c) Give the dynamics of the closed loop system.

2r
x(k + 1) = x(k) + u(k) = (1 + K∞ )x(k) = √ x(k)
2r + 1 + 1 + 4r

d) Discuss stability of the closed loop.


2r
Since √
2r+1+ 1+4r
< 1: the equlibrium xeq = 0 is asymptotically stable.

e) Discuss the closed loop behavior for r = 0 and r → ∞ and interpret the cost function.
r = 0: Dead-Beat Control: Reach Equilibrium xeq = 0 in one time step
r → ∞: K∞ = 0 closed loop behavior is equal to open loop behavior, only critically stable

Problem 3: LQ as RL with model

[Adapted from F. L. Lewis, D. Vrabie, V. L. Syrmos, Optimal Control, 2012, ex 11.2-1 and 11.3-4]
In this exercise we consider the discrete-time linear quadratic regulator problem in light of the Markov
Decision Process (MDP) theory. Consider a deterministic MDP with infinite and continuous state space
X = Rn and action space U = Rm and state transition equation
x(k + 1) = Ax(k) + Bu(k), (1)
where k is the discrete time index. For a fixed stabilizing stationary policy π that is defined by the control
law u(i) = µ(x(i)), i = k, . . . , ∞, and for positive definite matrices Q, R > 0, the associated value
function is ∞
1X
π
x(i)> Qx(i) + u(i)> Ru(i) ,

V (x(k)) = (2)
2 i=k
only dependent on the initial state x(k). The infinite sum (2) can be written as a difference equation,
yielding
1
V π (x(k)) = x(k)> Qx(k) + u(k)> Ru(k) + V π (x(k + 1)).

(3)
2
We assume that the value function is quadratic in the state V π (x(k)) = 21 x(k)> P x(k) for some kernel
matrix P . Then, (3) boils down to:
1 1  1
x(k)> P x(k) = x(k)> Qx(k) + u(k)> Ru(k) + x(k + 1)> P x(k + 1). (4)
2 2 2
Substituting the system dynamics (1), equation (4) is further simplified:
1 > 1 >
x Qx + u> Ru + x> A> P Ax + 2x> A> P Bu + u> B > P Bu ,

x Px = (5)
2 2
where, for readability, we used the notation x = x(k) and u = u(k).
Consider the Policy Iteration algorithm. We assume an initial policy u = K (0) x and then we perform the
Policy Evaluation step, that is, we calculate the value function for the initial policy. Precisely, substituting
the initial policy in (5), the goal is to compute matrix P (0) solving the Lyapunov equation:
 >  >
P (0) = Q + K (0) RK (0) + A> P (0) A + 2A> P (0) BK (0) + K (0) B > P (0) BK (0) . (6)

Found the solution P (0) , we perform the policy improvement based on the value function just obtained,
that is, we determine the next policy K (1) as
1 > 
K (1) x = argmin x Qx + u> Ru + x> A> P (0) Ax + 2x> A> P (0) Bu + u> B > P (0) Bu ,
u=Kx 2
which, as shown during the lectures, yields the closed-form solution for the policy improvement step:
 −1
K (1) = − R + B > P (0) B B > P (0) A.

1
a) What is the advantage of assuming that the value function is in the form V π (x(k)) = x(k)> P x(k)?
2
In fixing a “structure” for the value function V π (x(k)) the problem reduces from finding infinitely many
possible policies u(x) to finding a finite number of entries in matrix P .
b) Considering the system dynamics and cost function for the previous problem, with r = 1, determine the
solution of the Lyapunov equation (6). Would this approach scale easily with a high-dimensional state?
And with nonlinear dynamics? Perform a few iterations of the Policy Iteration algorithm starting with the
stabilizing initial policy K (0) = −0.1 and with the non-stabilizing initial policy K (0) = 1. How many steps
are required for convergence?
With A = 1, B = 1, Q = 1, R = 1, we obtain P (0) ∈ R as
2
(0) 2 (0) 2 1 + K (0)
P (0) = 1 + K + P (0) + 2K (0) P (0) + K P (0) ⇒ P (0) = −
 
2.
2K (0) + (K (0) )

With K (0) = −0.1, at the fifth iteration the policy converges to K = −0.6180 (P = 1.618), with K (0) = 1
the algorithm converges to the wrong value 1.6818 (P = −0.6180).
Consider the Value Iteration algorithm. Assuming initial policy u = K (0) x and assigning zero to the value
function, P (0) = 0, we perform the Value Update by just evaluating the right-hand side of the Lyapunov
equation (6), obtaining
 >  >
P (1) = Q + K (0) RK (0) + A> P (0) A + 2A> P (0) BK (0) + K (0) B > P (0) BK (0) . (7)

Then, also in this case, the policy improvement is based on the obtained value function and eventually we
have:  −1
K (1) = − R + B > P (1) B B > P (1) A.

c) Considering the system dynamics and cost function for the previous problem, with r = 1, perform a few
iterations of the Value Iteration algorithm starting with the stabilizing initial policy K (0) = −0.1 and with
the unstabilizing initial policy K (0) = 1. How many steps are required for convergence? What happens
if the initial guess of the policy is K (0) = 100? And which algorithm yields faster convergence, Policy
Iteration or Value Iteration?
With K (0) = −0.1, at the sixth iteration the policy converges to K = −0.6180 (P = 1.618), with
K (0) = 1 the algorithm converges in 9 iterations, with K (0) = 100 also in nine iterations.
6
-0.1

-0.2
5

-0.3

-0.4

-0.5
3

-0.6
2

-0.7

1
-0.8

-0.9 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Figure 1: Comparison between Policy Iteration and Value Iteration, for K (0) = −0.1. Controller gain K (j)
(left) and matrix P (j) of the value function (right) for the first 10 iterations j = 1, . . . , 10.

2 2

1.5 1.5

1 1

0.5 0.5

0 0

-0.5 -0.5

-1 -1
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Figure 2: Comparison between Policy Iteration and Value Iteration, for K (0) = 1. Controller gain K (j)
(left) and matrix P (j) of the value function (right) for the first 10 iterations j = 1, . . . , 10.

You might also like