15 - Optimal Policies For Passive Learning Controllers
15 - Optimal Policies For Passive Learning Controllers
00
Printed in Great Britain. Pergamon Press plc
~) 1989 International Federation of Automatic Control
Brief Paper
Key Words--Dual control; dynamic programming; stochastic control; discrete time systems; adaptive
control.
757
758 Brief Paper
the conditional density of x at time k assuming model i is and the conditional covariance matrix is
given by
Coy (x(k) I It) = ~ q,(l)P~,{k)
p(x(k ) I I k, r = i) = G(x(k );.~,(k ), P~(k )) ieF
where G ( . ; - , - ) denotes a Gaussian density function and + ~ q,(l)qj(k)($~(k) - .~(k))(.~(k) - 2J(k)) T. (8')
,et(k ) and Pi(k) are the conditional mean and covariance (i,I)GSF
matrix assuming model i given by The next theorem relates the minimization of an
appropriate quadratic cost functional associated with the
.fi(k) = m~(k) + K~(k)[y(k) - C(i)m~(k)] (3)
" D U L " passive controller defined by Despande et al. (1973).
P~(k ) = (1 - K,(k ) C ( i ) ) M , ( k ) (4) It is shown that this control law is the optimal closed loop
strategy for the problem posed.
and the Kalman gain
K~(k) = M,(k)C(i)T[C(i)M,(k)C(i) T + QW]-,. (5) 4. Results for model uncertainty only in the observation
equation
Here mi(k ) and Mi(k ) are the one step ahead mean and
Theorem 1. For the system
covariance matrix assuming model i and
x(k+l)=Ax(k)+Bu(k)+v(k), k=O ..... N-1
p(x(k ) l1 k-l, r = i) = G(x(k ); mi(k ), Mi(k ))
y(k) = C(r)x(k) + w(k), k = 1. . . . . N
with mi(k) and Mi(k ) computed recursively by
where the vectors and matrices were defined before, the
m~(k) = A ( i ) £ , ( k - l ) + B(i)u(k - 1) (6) passive closed loop control
M~(k) = A(i)P~(k - 1)A(i) T + Q ". (7) u(k) = -F(k).~(k), k = 0..... N - 1 (9a)
+ ~ . q i ( N - 1)(tr Q N _ I P i ( N -- 1) + tr SNMi(N))
joe rain J fxoTS(O)xo
(u(t ) ),"VZo
~
- ~ q , ( N - 1 ) q j ( N - I ) 0 ? , ( N - 1)
+ ~'~ q,(O) tr S ( n ) Q ~ + S(O)P~(O) . (12) (i,))~sF
-- $ j ( N - 1))'rPN_t(L(N -- 1) -- i j ( N - 1))
Proof. T h e solution is obtained using dynamic programming. - ~ q,(N- 1 ) t r P N _ 1 P i ( N - - 1).
Let
V ( k + 1, t k) = ~ i n E { x ( k ) r Q ~ ( k ) + u(k)TRku(k) Now select P ~ - t = (BTSNA)TDN-~(BTSNA), then using (7)
and ( l l b )
-- ( x ( k ) - £ ( k ) )T p , ( x ( k ) - $ ( k ) ) M~(N) = A(i)P~(N - 1)A(i) T + Q o
+ V ( k + 2, I *+') Ilk) and
where t* = {y(0), y(1) . . . . . y ( k ) , u(O) . . . . . u ( k - 1)) is the A(i)'rSNA(i) = S ( N - 1) - QN-1 + PN-I
information state of the system and V ( . , . ) satisfies the
then
boundary conditions
tr SNMi(N ) = tr SN(A(i)P~(N - 1)A(i) T) + tr SNQ ~
V ( N , I ~ - ' ) ffi min E { x ( N ) T S N x ( N )
u(N--l) = trA(i)TSNA(i)P~(N - 1) + tr SNQ °
+ x(N- 1 ) T Q ~ _ ~ x ( N - 1)
t r S N M i ( N ) = t r S ( N - 1)P~(N - 1) - tr O N _ t P i ( N - 1)
+ u ( N - 1)TR~_~u(N - 1)
+ tr PN_tP/(N -- 1) + tr SNQ ~
- ( x ( N - 1) - £ ( N - 1 ) ) T P ~ _ , ( x ( N - 1)
and this implies that
-.e(N - 1)) I ,,N-'}
q i ( N - 1) tr (QN_1P~(N -- 1) + SNMI(N) - PN_tP~(N -- 1))
and iGF
Theorem 2 (verification theorem). The cost (10) is a convex next obtain a control cost for which an open loop Jeedback
functional of {u(k)} and, as such, there is a unique policy is optimal. It should also be noted that the
minimizing policy {u(k)}*. computation of the open loop feedback control (which is
needed to solve the problem when the system matrices A and
Proof. It is enough to show that S(k) is a positive B depend on the unknown parameter r) is much more
semidefinite matrix for k = 0 , . . . , N - 1 involved than the computation of the closed loop control
S(k) = ATS(k + 1)A + Qk (which can be solved in closed form when the system
matrices A and B do not depend on the unknown parameter
- ( B T S ( k + 1)A)T(BTS(k + 1)B + Rt,)-l(BXS(k + 1)A) r). This is an important difference in terms of the
computational effort and the conceptual interpretation.
with
S ( N - 1) = ATSNA + QN-I 5. Results for model uncertain~ on the plant and observation
equations
- (BTSNA)T(BTSNB + RN_,)-'(BTSNA) 5.1. Open loop feedback controller. Let q~(-) a n d f ( . , • . - )
be non-negative functions of their arguments.
with SN a positive semidefinite symmetric matrix,
Let
{Qk}k=o,...,N-i a sequence of positive semidefinite symmetric N-[
matrices and {Rk}k=o,...,N-~ a sequence of positive definite JN = cp(x(N)) + ~ f(x(k), u(k), k).
symmetric matrices. Write k =0
S(k ) = [A - BF(k )]XS(k + 1)[A - BF(k)] + F(k )TRkF(k ) + Qk The open loop feedback control policy is obtained as
follows. Let I k = {y(0) . . . . . y(k), u(0) . . . . . u(k - 1)} be
with the information state at time k. Note that
qi(O), q/(1) . . . . . qi(k), i • F, are not included because they
r(k) = (BTS(k + 1)B + Rk)-'BTS(k + 1)A. can be derived from I k and the system equations.
Given S(k + 1) positive semidefinite it follows that S(k) is Assume that only a priori information is available. The
positive semidefinite, k = N - 2 . . . . . 0. design of the open loop controller can be formulated as
But min E{JN I/~'} to obtain the sequence
(u(k)}k=o,. ,~-i
S(N - 1) = [A - BF(N - I)]TSNI A -- BF(N - 1)] {u°(O), u°(1) . . . . . u°(N - 1)}.
+ F(N - 1)TRN_IF(N - 1) + QN-I Select u(O) ° as the input to be applied at time zero and
advance one step, collecting information to obtain I ~.
->0
Assume that no new information will be collected in the
as required. future and select the control law at time equal to one such
Now, after a tedious calculation the cost-to-go at stage that
N - k - 1 can be written as min E{JN ] I I} to obtain the sequence
(u(k)}k .1,. ,jv-i
u'(1), ut(2) . . . . . u I ( N - 1).
V(k + 1, 1k) = min/~'~ q,(k) II£,(k)ll~(k)
u(k) [i~F
Select ul(1) as the input and advance one step, collecting
information to obtain I z. Repeating this procedure defines
+ u(k) + ~ q,(k)r(k)~,(k) ~ the open loop feedback policy {u(k)} = {u'(k)}. The
i~F II(Rk + B T S ( k + I ) B )
procedure can be formalized using dynamic programming as
developed in Appendix B.
+ i~t:
~'~ q,(k) tr , - 1S(n)Q ~ + S(k)P~(k) . The next result shows that the open loop feedback control
policy as previously defined is an optimal open loop feedback
Thus S(k + 1) > 0 implies Rk + BTS(k + 1)B > 0, so that at policy for an appropriately defined quadratic cost functional.
every stage the cost-to-go is a strictly convex function of the
control, and has a unique minimum. Theorem 3. For the system x(k + 1) = A(r)x(k) +
This proves the desired result. B(r)u(k) + v(k), k = 1. . . . . N - 1
Comments. y(k)=C(r)x(k)+w(k), k = 1. . . . . N
(1) The passive learning control introduced in Despande et where the vectors and matrices are defined in Section 2, the
al. (1973) is optimal with respect to an appropriate quadratic open loop feedback control
cost functional that includes a weighted version of the trace
of the conditional covariance matrix.
(2) The dual property of the control can be defined in u ( k ) = - [ ' ( k ) ( , ~ r q,(k)B(r)"Sr(k + l)A(r)Ycr(k )),
terms of the dependence of the conditional covariance on the
control (Bar-Shalom and Tse, 1974). Including the k=0 ..... N-I (13a)
conditional covariance weighted by the sequence {Pk}
balances out the active learning which is present in the F(k)=(~Fqr(k)(Rk+ B(r)TSr(k+ 1)B(r))) I
standard L Q G problem with parameter uncertainty so that
the passive learning controller is optimal in this context. k=0 ..... N-I (13b)
(3) It follows that L Q G problem with parameter
uncertainty poses an interesting dual control problem which is the optimal open loop feedback control with respect to the
inherently includes active learning without any modifications following quadratic cost functional:
of the cost functional to include an identification cost N-I
formally into the problem. J = E x(N)TSNx(N) + ~ (x(k)" fAx(k) + u(k)fRku(k))
(4) The feedback gain matrices (9) can be computed k~O
off-line. In fact, the optimal policy can be written as the N--1
weighted average of the individual optimal closed loop - ~ [(B(r)TS,( k + l)A(r)x(k))TD(k)
control policies corresponding to each parameter r • F , k=O
where the a posteriori probabilities are treated as the x (B(r)TS,(k + l)A(r)x(k))
weights. This is the D U L Despande et al. (1973) law. It
should be noted that this result has been obtained for the - E(B(r)TS,(k + 1)A(r)x(k) I lk)'rD(k)
case of model uncertainty in the observation and cannot be
straightforwardly extended to the case of general model × E(B(r)TS,(k + 1)A(r)x(k) [ P ] } (14)
uncertainty in the system equation. For this last situation we
Brief Paper 761
where for r e F, and {S,(k + 1)} and {D(k)} are sequences + ~', q,(0) tr {B(i)TsNA(i)}TD°(N - 1)
of matrices satisfying:
(i) × (B(i)TSNA(i))~(N-- 1)) (16)
St,(k) = A(r)TSt,(k + 1)A(r) + Qk
with IIxll~* = x T e x , for x e R ~.
- (B(r)TSt~(k + I)A(r))TDt(k)(B(r)TSt,(k + l)A(r)) Performing the minimization using (7') and (lib) and the
05a) results above, we obtain
with V(N,/O) = ~ q,(0)£'°(N- 1)TS,(N - 1)£°(N - 1)
S',(N)=S N, r¢F, I = 0 ..... N - 1 . (15c) and the boundary condition S°i(N)=SN, i cF. The
Remarks. minimizing control obtained is
(1) The upper index l indicates that the information
available to the controller is I t. For all times k with u°(N-1)= - f " ° ( N - 1)(~F q'(O)B(i)TSNA(i)£°(N-1))
l < k -< N - 1 the matrices Dl(k) and S~ are computed using
qi(l). Thus (15) are the equations associated with the f'°(N - 1) = O°(N - 1).
solution of the Bellman equation corresponding to the design
of u(l), equation (B6) in Appendix B. Note the cancellation of the terms involved in (16) with those
(2) This is neither the certainty equivalent controller nor in the cost-to-go.
the DUL controller, because (13) is not the average of the To achieve the desired result we apply mathematical
control policies associated with a particular system model induction in two steps.
(A(i), B(i), C(i)) weighted by the conditional probabilities Step 1. Here we assume that the information state is fixed
qi(k). This controller has been defined in Casielio and
at/O and use induction on the stages to determine u°(0). This
Loparo (1985).
is accomplished as follows. Assume that at stage N - k - 1
(3) The matrices {Si(k)} cannot be computed off-line,
since they depend on {qi(k)}k.O,....N-1. They must be V(k + 1,/O) = ~ q~(O)X°(k)TS°(k)X°(k)
recomputed as new information is obtained in Casiello and ieF
Loparo (1985). N
- ( B ( r ) T S r ( N ) A ( r ) x ( N - 1))TD°(N- 1)
+ ~ q,(O) tr S~(n)Q ~ + g ( k - 1)P~i(k - 1)
× (B(r)TS,(N)A(r)x(N- 1))
+ E(Bfr)TS, f N ) A f r ) x ( N - 1) I/O)TD°(N- 1) with u°(k) given by
+ u(k - 1)TRk_lu(k - 1)
× [i~e q,(O)(B(i)TSNA(i)£°i(N-1)] - (B(r)TS~,(k)A(r)x(k - 1))TD°(k - 1)
with .~°(N - 1) and P~/(N - 1) are given by (6') and (7'), then x (B(r)S°(k)A(r)x(k - 1))
adding these terms and using the algebraic fact given in + E(B(r)TS°,(k)A(r)x(k)[/o)TD°(k - 1)
Appendix A, they become
x EfB(r)TS~,fk)Afr)x(k) I/O) + V(k + 1,/O) I 1°} •
- ((,.~s~. q,(O)q/(O) IIB(i)TSNA(i).~°(N -- 1) Computing ~,Fq,(O)B(i)TS~(k)A(i)~O(k) using (6'), (7')
and (15a) and following the same steps of Stage 1 with k
- B(j)TSNA(j)-r°( N -- l)ll~0v-~) replaced by N , w e o b t a i n the result.
762 Brief Paper
l~k<_N-1. 2
+ u'(k) + ~ q,(l)~(k)~l(k)
Then we want to show that at stage N - k we have i~F (R k + ~ . ~ e F q i ( t ) B ( i ) T S ~1( k + 1 ) B ( i ) )
so that at every stage and for any information state 1t, the
ut(k ) = -['t(k )(t~r qi(I)B(i)T S[(k + 1 )m(i)$~(k ))
cost-to-go is a strictly convex function of the control, and has
6. Conclusions
~i~F
In this paper we have defined quadratic optimal control
and S~(k) problems for linear systems with parameter uncertainty for
Sl(k) = A(i)+Sl(k + 1)A(i) + Q, which passive control laws are optimal.
For a system with parameter uncertainty in the observation
- (B(i)TS~(k + 1)A(i))TD'(k)(B(i)TS~(k + 1)A(i)) equation only a convex cost functional was derived such that
the DUL control law is optimal.
with the boundary condition For a system with uncertainty in the plant and observation
SI(N ) = S N, i•F. equation a cost functional for which the open loop feedback
control is optimal is presented.
This follows directly by replacing /o by I t and using the The synthesis of these passive laws involves modifying the
results of the previous calculation. standard LQG cost functional by subtracting a quadratic
Now select u(l) = ut(l). term, which can be referred to as a "dual cost", since it
The next theorem shows that for the cost functional (14) directly involves the identification aspects of the problem. It
introduced in Theorem 3, the minimizing control in the class would be interesting to study the asymptotic properties of the
of open loop feedback policies is unique. dual cost when Ilu(k)ll---~, this is studied in Casiello and
Loparo (1989b).
Theorem 4 (verification theorem). There is a unique open
loop feedback policy {u(k)}* that minimizes the cost
functional (14).
References
Bar-Shalom, Y. and E. Tse (1974). Dual effect, certainty
Proof. It is enough to show that Sit(k) is positive semidefinite equivalence and separation in stochastic control. I E E E
1 = 0 . . . . . N - I , k>-I Trans. Aut. Control, AC-19, 494-500.
Casiello, F. and K. Loparo (1985). A dual controller for
S~(k) = A(i)TSI(k + 1)A(i) + Qk linear systems with random jump parameters. Proc. 24th
- (B(i)xS~(k + I)A(i))TD'(k)(B(i)TS~(k + 1)A(i)) Conf. on Decision and Control, Fort Lauderdale, pp.
911-915.
(
Dr(k) = ~'. qi(l)B(i)TSl(k + 1)B(i) + R k
"ieF
)' Casiello, F. and K. Loparo (1989a). Optimal control of
unknown parameter systems. IEEE Trans. Aut. Control,
to appear.
with Casiello, F. and K. Loparo (1989b). Optimal learning control
for a class of unknown parameter systems, submitted for
S[(N - 1) = A(i)TSNA(i) + ON-t publication.
- (B(i)TSNA(i))TDt(N -- 1)(B(i)TSNA(i)) Despande, J., T. Upadhyay and D. Lainotis. (1973).
Adaptive control of linear stochastic systems. Automatica,
9, 107-115.
Dr(N - 1) = ( ~Fq~(I)B(i)TSNB(i) + RN_, ) - ' Fragoso, M. (1988). On a partially observable LOG problem
for systems with Markovian jumping parameters. Syst.
with SN a positive semidefinite symmetric matrix, Control Leu., 10, 349-356.
Brief Paper 763
Appendix A. Derivation of an algebraic identity V(N, 1°) = min E{q~(x(N)) + f ( x ( N - 1), u(N - 1)) [/o}
Algebraic fact. Let o~,> O, i • F = {0, I..... f} satisfy u(N--I)
and
~:'.o:i=l. V(Z, P) = SNo.
ieF
Let x~ ~ R n, i e F, P an n × n matrix then Then equation (B2) is the Bellman equation corresponding
E ~/,Pxi- E
icJF
(
xicJF
~:, :(e E )
/ \i¢~F
-:,
to the design of u(0).
Select u(O)=u°(O) and advance one step collecting
1' = {y(0), y0), u°(0)}.
= E OIiO~](Xi -- x I ) T p ( x i -- Xi) (A1) Designing ut(1). After obtaining 11, let
Oj)esF
where St, is the set of all pairwise distinct combinations of the
elements of F. (u(k))
{
JN, = rain E q~(x(N)) + ~. f(x(k), u(k)) ] 1'
t=1
}
N-I
Proof The left-hand side of the above expression equals
= min Ef~(x(N)) 11'} + ~ E{/(x(k), u(k)) t1'). (n3)
Y. ~:~,Px,-E ~,:,Px,- 2 E ~,~/,Px, {u(t)) k-t
iGF iGF (l,i)mSp Let
= ~ or/(1 - ~,)xTPx, - 2 ~ Ott.rgTPx,
itF ff.i)~sp V(k + 1, I t) ffi rain
u(k)
E{f(x(k), u(k)) [ I t } + V(k + 2, I t) (154)
= ~ q~(k)~a(k)£~(k)T+ ~ q~(k)P~(k)
- .~(k)) T I/k}
{
= rain E qJ(x(N) ]I') + ~', E{f(x(k), u(k)) [I') . (B5)
{u(~)) kft
}
Let
V (k + 1, 1:) ffi rain E {f (x(k ), u(k ) ) I It} + V (k + 2, ! t) (B6)
u(k)
Then (6) follows from (At) with P=I, xi=£i(k) x and
~l ffi qi( k ) • where V(., It) satisfiesthe boundary conditions
Appendix B. Synthesis of the open loop feedback controller V(N, I t) = rain E{cp(x(N)) + f ( x ( N - 1), u ( N - 1))[/t)
Designing u°(O). At time k =0, let /o represent the u(N--I)
knowledge about the initial state distribution. and
Let V(1, 1:) JNt" =
AUTO 25:5-H