Kamala Pur Kar 2015
Kamala Pur Kar 2015
Automatica
journal homepage: www.elsevier.com/locate/automatica
Brief paper
(1996) use generalized back-propagation through time to solve a The steady-state control policy ud : Rn → Rm corresponding to
finite horizon tracking problem that involves offline training of the desired trajectory xd is
NNs. An ADP-based approach is presented in Dierks and Jagan-
nathan (2009) to solve an infinite horizon optimal tracking prob- ud (xd ) = gd+ (hd (xd ) − fd ) , (2)
lem where the desired trajectory is assumed to depend on the where gd , g (xd ) and fd , f (xd ). To transform the time-vary-
+ +
system states. Greedy heuristic dynamic programming based algo- ing optimal control problem into a time-invariant optimal control
rithms are presented in results such as (Luo & Liang, 2011; Wang, problem, a new concatenated state ζ ∈ R2n is defined as (Zhang
Liu, & Wei, 2012; Zhang, Wei, & Luo, 2008) which transform the et al., 2008)
nonautonomous system into an autonomous system, and approx- T
ζ , eT , xTd .
imate convergence of the sequence of value functions to the opti- (3)
mal value function is established. However, these results lack an
Based on (1) and Assumption 2, the time derivative of (3) can be
accompanying stability analysis.
expressed as
In this result, the tracking error and the desired trajectory both
serve as inputs to the NN. This makes the developed controller ζ̇ = F (ζ ) + G (ζ ) µ, (4)
fundamentally different from previous results, in the sense that
where the functions F : R2n → R2n , G : R2n → R2n×m , and the
a different HJB equation must be solved and its solution, i.e. the
control µ ∈ Rm are defined as
feedback component of the controller, is a time-varying function
f (e + xd ) − hd (xd ) + g (e + xd ) ud (xd )
of the tracking error. In particular, this paper addresses the techni-
F (ζ ) , ,
cal obstacles that result from the time-varying nature of the opti- hd ( x d )
mal control problem by including the partial derivative of the value
g (e + xd )
function with respect to the desired trajectory in the HJB equation, G (ζ ) , , µ , u − ud . (5)
and by using a system transformation to convert the problem into 0
a time-invariant optimal control problem in such a way that the Local Lipschitz continuity of f and g, the fact that f (0) = 0, and
resulting value function is a time-invariant function of the trans- Assumption 2 imply that F (0) = 0 and F is locally Lipschitz.
formed states, and hence, lends itself to approximation using a NN. The objective of the optimal control problem is to design a pol-
A Lyapunov-based analysis is used to prove ultimately bounded icy µ∗ : R2n → Rm ∈ Ψ such that the control law µ = µ∗ (ζ )
tracking and that the enacted controller approximates the optimal minimizes the cost functional
controller. Simulation results are presented to demonstrate the ap- ∞
plicability of the presented technique. To gauge the performance of J (ζ , µ) , r (ζ (ρ) , µ (ρ)) dρ,
the proposed method, a comparison with a numerical optimal so- 0
lution is presented. subject to the dynamic constraints in (4), where Ψ is the set of ad-
For notational brevity, unless otherwise specified, the domain missible policies (Beard et al., 1997), and r : R2n × Rm → R≥0 is
of all the functions is assumed to be R≥0 . Furthermore, time- the local cost defined as
dependence is suppressed while denoting trajectories of dynami- r (ζ , µ) , ζ T Q ζ + µT Rµ. (6)
cal systems. For example, the trajectory x : R≥0 → Rn is defined by
m×m
abuse of notation as x ∈ Rn , and referred to as x instead of x (t ), and In (6), R ∈ R is a positive definite symmetric matrix of con-
unless otherwise specified, an equation of the form f + h (y, t ) = stants, and Q ∈ R2n×2n is defined as
g (x) is interpreted as f (t ) + h (y (t ) , t ) = g (x (t )) for all t ∈ R≥0 .
Q 0n×n
Q , , (7)
0n×n 0n×n
2. Formulation of time-invariant optimal control problem
where Q ∈ Rn×n is a positive definite symmetric matrix of con-
stants with the minimum eigenvalue q ∈ R>0 , and 0n×n ∈ Rn×n is
Consider a class of nonlinear control affine systems
a matrix of zeros. For brevity of notation, let (·)′ denote ∂ (·) /∂ζ .
ẋ = f (x) + g (x) u,
3. Approximate optimal solution
where x ∈ Rn is the state, and u ∈ Rm is the control input. The func-
tions f : Rn → Rn and g : Rn → Rn×m are locally Lipschitz and
Assuming that a minimizing policy exists and that the optimal
f (0) = 0. The control objective is to track a bounded continuously
value function V ∗ : R2n → R≥0 defined as
differentiable signal xd ∈ Rn . To quantify this objective, a tracking
error is defined as e , x − xd . The open-loop tracking error dynam- ∞
ics can then be expressed as V (ζ ) ,
∗
min r (φ µ (τ ; t , ζ ) , µ (τ )) dτ (8)
µ(τ )|τ ∈R≥t
ė = f (x) + g (x) u − ẋd . (1) t
The value function V ∗ can be represented using a NN with N neu- Assumption 3. The regressor ψ : R≥0 → RN defined as ψ ,
rons as √ ω is persistently exciting (PE). Thus, there exist T , ψ > 0
1+νωT Γ ω
V ∗ (ζ ) = W T σ (ζ ) + ϵ (ζ ) , (11)
t +T
such that ψ I ≤ t
ψ (τ ) ψ (τ )T dτ .3
N
where W ∈ R is the constant ideal weight matrix bounded above Using Assumption 3 and Corollary 4.3.2 in Ioannou and Sun (1996)
by a known positive constant W̄ ∈ R in the sense that ∥W ∥ ≤ W̄ , it can be concluded that
σ : R2n → RN is a bounded continuously differentiable nonlinear
activation function, and ϵ : R2n → R is the function reconstruc- ϕ IN ×N ≤ Γ ≤ ϕ IN ×N , ∀t ∈ R≥0 (19)
tion error (Hornik, Stinchcombe, & White, 1990; Lewis, Selmic, &
where ϕ, ϕ ∈ R are constants such that 0 < ϕ < ϕ .4 Based on
Campos, 2002).
(19), the regressor vector can be bounded as
Using (10) and (11) the optimal policy can be represented as
1
1 ∥ψ∥ ≤ √ , ∀t ∈ R≥0 . (20)
µ (ζ ) = − R−1 GT (ζ ) σ ′T (ζ ) W + ϵ ′T (ζ ) .
∗ νϕ
(12)
2
For notational brevity, state-dependence of the functions hd , F , G,
Based on (11) and (12), the NN approximations to the optimal value
V ∗ , µ∗ , σ , and ϵ is suppressed hereafter.
function and the optimal policy are given by
Using (9), (15), and (16), an unmeasurable form of the BE can be
written as
V̂ ζ , Ŵc = ŴcT σ (ζ ) ,
1 1
1 δ = −W̃cT ω + W̃aT Gσ W̃a + ϵ ′ Gϵ ′T
µ ζ , Ŵa = − R−1 GT (ζ ) σ ′T (ζ ) Ŵa , (13) 4 4
2 1 T ′ ′T
+ W σ Gϵ − ϵ ′ F , (21)
where Ŵc ∈ RN and Ŵa ∈ RN are estimates of the ideal neural net- 2
work weights W . The use of two separate sets of weight estimates where G , GR−1 GT and Gσ , σ ′ GR−1 GT σ ′T . The weight estimation
Ŵa and Ŵc for W is motivated by the fact that the Bellman error errors for the value function and the policy are defined as W̃c ,
(BE) is linear with respect to the value function weight estimates W − Ŵc and W̃a , W − Ŵa , respectively.
and nonlinear with respect to the policy weight estimates. Use of
a separate set of weight estimates for the value function facilitates
4. Stability analysis
least squares-based adaptive updates.
The controller is obtained from (2), (5), and (13) as
Before stating the main result of the paper, three supplemen-
1 tary technical lemmas are stated. To facilitate the discussion, let
u = − R−1 GT (ζ ) σ ′T (ζ ) Ŵa + gd+ (hd (xd ) − fd ) . (14) Y ∈ R2n+2N be a compact set, and let Z , Y ∩ Rn+2N . Using
2
the universal approximation property of NNs, on the compact set
Using the approximations µ and V̂ for µ∗ and V ∗ in (9), respec- Y ∩ R2n , the NN approximation errors can be bounded such that
sup |ϵ (ζ )| ≤ ϵ̄ and sup ϵ ′ (ζ ) ≤ ϵ̄ ′ , where ϵ̄ ∈ R and ϵ̄ ′ ∈ R
tively, the error between the approximate and the optimal Hamil-
tonian, called the BE δ ∈ R, is given in a measurable form by are positive constants, and there exists a positive constant LF ∈ R
such that5 sup ∥F (ζ )∥ ≤ LF ∥ζ ∥. Using Assumptions 1 and 2 the
δ , V̂ ′ ζ , Ŵc ζ̇ + r ζ , µ ζ , Ŵa . (15) following bounds are developed on the compact set Y ∩ R2n to aid
the subsequent stability analysis:
t
The value function weights are updated to minimize 0 δ 2 (ρ) dρ ϵ WTσ′
′
Gϵ + ϵ̄ ′ LF ∥xd ∥ ≤ ι1 , ∥Gσ ∥ ≤ ι2 ,
′T
using a normalized least squares update law1 with an exponential 4 + 2
forgetting factor as (Ioannou & Sun, 1996)
1 T
ϵ Gϵ ≤ ι3 , W Gσ + 1 ϵ ′ Gσ ′T ≤ ι4 ,
′ ′T
˙ ω 2 2
Ŵ c = −ηc Γ δ,
(16)
1 + νωT Γ ω
1 ′ ′T 1 T ′ ′T
ϵ Gϵ + W σ Gϵ ≤ ι5 ,
ωωT (22)
,
4 2
Γ̇ = −ηc −λΓ + Γ Γ (17)
1 + νωT Γ ω
where ι1 , ι2 , ι3 , ι4 , ι5 ∈ R are positive constants.
where ν, ηc ∈ R are constant positive adaptation gains, ω : R≥0 →
RN is defined as ω , σ ′ (ζ ) ζ̇ , and λ ∈ (0, 1) is the constant for- 4.1. Supporting lemmas
getting factor for the estimation gain matrix Γ ∈ RN ×N . The policy
weights are updated to follow the critic weights2 as The contribution in the previous section was the development
of a transformation that enables the optimal policy and the optimal
·
Ŵ a = −ηa1 Ŵa − Ŵc − ηa2 Ŵa , (18)
where ηa1 , ηa2 ∈ R are constant positive adaptation gains. The fol- 3 The regressor is defined here as a trajectory indexed by time. It should be noted
lowing assumption facilitates the stability analysis using PE. that different initial conditions result in different regressor trajectories; hence, the
constants T and ψ depend on the initial values of ζ and Ŵa . Hence, the final result
is not uniform in the initial conditions.
4 Since the evolution of ψ is dependent on the initial values of ζ and Ŵ , the
a
1 The least-squares approach is motivated by faster convergence. With minor constants ϕ and ϕ depend on the initial values of ζ and Ŵa .
modifications to the stability analysis, the result can also be established for a 5 Instead of using the fact that locally Lipschitz functions on compact sets are
gradient descent update law. Lipschitz, it is possible to bound the function F as ∥F (ζ )∥ ≤ ρ (∥ζ ∥) ∥ζ ∥, where
2 The least-squares approach cannot be used to update the policy weights ρ : R≥0 → R≥0 is non-decreasing. This approach is feasible and results in
because the BE is a nonlinear function of the policy weights. additional gain conditions.
R. Kamalapurkar et al. / Automatica 51 (2015) 40–48 43
value function to be expressed as a time-invariant function of ζ . Proof. The proof is omitted due to space constraints, and is avail-
The use of this transformation presents a challenge in the sense able in Kamalapurkar, Dinh, Bhasin, and Dixon (2013).
that the optimal value function, which is used as the Lyapunov T
Lemma 3. Let Z , eT W̃cT W̃aT , and suppose that Z (τ ) ∈ Z,
function for the stability analysis, is not a positive definite function
of ζ , because the matrix Q is positive semi-definite. In this section, for all τ ∈ [t , t + T ]. Then, the critic weights satisfy
this technical obstacle is addressed by exploiting the fact that the t +T t +T
T 2
2
time-invariant optimal value function V ∗ : R2n → R can be − W̃c ψ dτ ≤ −ψϖ7 W̃c + ϖ8
∥e ∥2 d τ
interpreted as a time-varying map Vt∗ : Rn × R≥0 → R, such that t t
t +T 4
+ 3ι2
2
W̃a (σ ) dσ + ϖ9 T ,
e
Vt∗ (e, t ) = V ∗ (23)
xd (t ) t
ν2 ϕ2
for all e ∈ Rn and for all t ∈ R≥0 . Specifically, the time-invariant where ϖ7 = , ϖ8 = 3ϵ̄ ′2 L2F , and ϖ9 = 2(ι25 +
2 ν 2 ϕ 2 +ηc2 ϕ 2 T 2
form facilitates the development of the approximate optimal pol-
icy, whereas the equivalent time-varying form can be shown to be ϵ̄ ′2 L2F d2 ).
a positive definite and decrescent function of the tracking error. In
Proof. The proof is omitted due to space constraints, and is
the following, Lemma 1 is used to prove that Vt∗ : Rn × R≥0 → R is
available in Kamalapurkar et al. (2013).
positive definite and decrescent, and hence, a candidate Lyapunov
function.
4.2. Gain conditions and gain selection
Lemma 1. Let Ba denote a closed ball around the origin with the
radius a ∈ R>0 . The optimal value function Vt∗ : Rn × R≥0 → R The following section details sufficient gain conditions derived
satisfies the following properties based on a stability analysis performed using the candidate
Lyapunov function VL : Rn+2N × R≥0 → R defined as VL (Z , t ) ,
Vt∗ (e, t ) ≥ v (∥e∥) , (24a) Vt∗ (e, t ) + 12 W̃cT Γ −1 W̃c + 12 W̃aT W̃a . Using Lemma 1 and (19),
Vt (0, t ) = 0,
∗
(24b)
vl (∥Z ∥) ≤ VL (Z , t ) ≤ vl (∥Z ∥) , (25)
Vt (e, t ) ≤ v (∥e∥) ,
∗
(24c)
∀Z ∈ Bb , ∀t ∈ R≥0 , where vl : [0, b] → R≥0 and vl : [0, b] → R≥0
∀t ∈ R≥0 and ∀e ∈ Ba where v : [0, a] → R≥0 and v : [0, a] → R≥0 are class K functions, and Bb ⊂ Rn+2N denotes a ball of radius
are class K functions. b ∈ R>0 around the origin, containing Z.
Proof. See Appendix. To facilitate the discussion, define ηa12 , ηa1 +ηa2 , Z , [eT W̃cT W̃aT ]T,
ι , ( a2 4 ) + 2η (ι )2 + 1 ι , ϖ , 6 a12
2
η W +ι ϖ η +2ϖ2 q+ηc ϖ9
T + ι,
Lemma 2. Let Z , eT W̃cT W̃aT , and suppose that Z (τ ) ∈ Z, ηa12 c 1 4 3 10
8
for all τ ∈ [t , t + T ]. Then, the NN weights and the tracking errors and ϖ11 , 16 1
min(ηc ψϖ7 , 2ϖ0 qT , ϖ3 ηa12 T ). Let Z0 ∈ R≥0
satisfy denote a known constant bound on the initial condition such that
∥Z (t0 )∥ ≤ Z0 , and let
− inf ∥e (τ )∥2
τ ∈[t ,t +T ]
ϖ10 T
Z , vl vl max Z0 , + ιT .
2 −1
≤ −ϖ0 sup ∥e (τ )∥2 + ϖ1 T 2 sup W̃a (τ ) + ϖ2 (26)
ϖ11
τ ∈[t ,t +T ] τ ∈[t ,t +T ]
The sufficient gain conditions for the subsequent Theorem 1 are
2 2
− inf W̃a (τ ) ≤ −ϖ3 sup W̃a (τ )
τ ∈[t ,t +T ] τ ∈[t ,t +T ]
given by6
ηc ι2
2
+ ϖ4 Z
inf W̃c (τ ) + ϖ5 sup ∥e (τ )∥2 + ϖ6 , ηa12 > max ηa1 ξ2 + , 3η ι 2
,
τ ∈[t ,t +T ] c 2Z
τ ∈[t ,t +T ] 4 νϕ
where ηa1 2ϖ4 ηa12
ξ1 > 2ϵ̄ ′ LF , ηc > , ψ> T,
λγ ξ2 ηc ϖ7
1 − 6nT 2 L2F 3n 2
ϖ0 = , ϖ1 = sup gR−1 GT σ ′T ,
2 4 ϖ5 ηa12 1
t
2 q > max , ηc ϖ8 , ηc LF ϵ̄ ′ ξ1 ,
ϖ0
2
dLF + sup ggd+ (hd − fd ) − 12 gR−1 GT σ ′T W − hd
3n2 T 2
t
ϖ2 = , νϕ
1 1
n T < min √ ,√ , √ ,
1 − 6N (ηa1 + ηa2 ) T 2
6N ηa12 6N η ϕ 2 nLF
2
ϖ3 = , c
ηa12
2
6N ηa1
2
T2 , (27)
ϖ4 = 6N ηa12
3
+ 8qϖ1
2 ,
1 − 6N (ηc ϕ T )2 / νϕ
where ξ1 , ξ2 ∈ R are known adjustable positive constants. Fur-
2 2
thermore, the compact set Z satisfies the sufficient condition
18 ηa1 N ηc ϕ ϵ̄ ′ LF T
ϖ5 = 2 , Z ≤ r, (28)
νϕ 1 − 6N (ηc ϕ T )2 / νϕ
2
18 N ηa1 ηc ϕ ϵ̄ ′ LF d + ι5 T 2
2
ϖ6 = 2 + 3N ηa2 W T .
6 Similar conditions on ψ and T can be found in PE-based adaptive control in the
νϕ 1 − 6N (ηc ϕ T )2 / νϕ presence of bounded or Lipschitz uncertainties (cf. Misovec, 1999 and Narendra &
Annaswamy, 1986).
44 R. Kamalapurkar et al. / Automatica 51 (2015) 40–48
Fig. 3. Hamiltonian and costate of the numerical solution computed using GPOPS.
The NN weights converge to the following values solution is difficult to obtain for an infinite horizon optimal control
problem, the numerical optimal control problem is solved over a
Ŵc = Ŵa = 83.36 2.37 27.0 2.78 −2.83 0.20 14.13
finite horizon ranging over approximately 5 times the settling time
29.81 18.87 4.11 3.47 6.69 9.71 15.58 4.97 12.42 associated with the slowest state variable. Based on the solution
T obtained using the proposed technique, the slowest settling time
11.31 3.29 1.19 −1.99 4.55 −0.47 0.56 . (34) is estimated to be approximately 20 s. Thus, to approximate the
Note that the last sixteen weights that correspond to the terms infinite horizon solution, the numerical solution is computed over
containing the desired trajectories ζ5 , . . . , ζ8 are non-zero. Thus, a 100 s time horizon using 300 collocation points.
the resulting value function V and the resulting policy µ depend As seen in Fig. 3, the Hamiltonian of the numerical solution is
on the desired trajectory, and hence, are time-varying functions approximately zero. This supports the assertion that the optimal
of the tracking error. Since the true weights are unknown, a control problem is time-invariant. Furthermore, since the Hamilto-
direct comparison of the weights in (34) with the true weights is nian is close to zero, the numerical solution obtained using GPOPS
not possible. Instead, to gauge the performance of the presented is sufficiently accurate as a benchmark to compare against the ADP-
technique, the state and the control trajectories obtained using based solution obtained using the proposed technique. Note that
the estimated policy are compared with those obtained using in Fig. 3, the costate variables corresponding to the desired tra-
Radau-pseudospectral numerical optimal control computed using jectories are nonzero. Since these costate variables represent the
the GPOPS software (Rao et al., 2010). Since an accurate numerical sensitivity of the cost with respect to the desired trajectories, this
46 R. Kamalapurkar et al. / Automatica 51 (2015) 40–48
Appendix
The proofs for the technical lemmas and the gain selection al-
gorithm are detailed in this section.
Proof. Since t −→ Ξ (x, t ) is uniformly bounded, for all x ∈ D, Dierks, T., & Jagannathan, S. (2009). Optimal tracking control of affine nonlinear
supt ∈R≥0 {Ξ (x, t )} exists and is unique for all x ∈ D. Let the discrete-time systems with unknown internal dynamics. In Proc. IEEE conf. decis.
control (pp. 6750–6755).
function α : D → R≥0 be defined as
Dierks, T., & Jagannathan, S. (2010). Optimal control of affine nonlinear continuous-
time systems. In Proc. Am. control conf. (pp. 1568–1573).
α (x) , sup {Ξ (x, t )} . (35)
t ∈R≥0 Doya, K. (2000). Reinforcement learning in continuous time and space. Neural
Computation, 12(1), 219–245.
Since x → Ξ (x, t ) is continuous, uniformly in t , ∀ε > 0, ∃ς (x) > Hornik, K., Stinchcombe, M., & White, H. (1990). Universal approximation of an
0 such that ∀y ∈ D, unknown mapping and its derivatives using multilayer feedforward networks.
Neural Networks, 3(5), 551–560.
dD×R≥0 ((x, t ) , (y, t )) < ς (x) Ioannou, P., & Sun, J. (1996). Robust adaptive control. Prentice Hall.
H⇒ dR≥0 (Ξ (x, t ) , Ξ (y, t )) < ε, (36) Jiang, Y., & Jiang, Z.-P. (2012). Computational adaptive optimal control for
continuous-time linear systems with completely unknown dynamics. Automat-
where dM (·, ·) denotes the standard Euclidean metric on the met- ica, 48(10), 2699–2704.
ric space M. By the definition of dM (·, ·), dD×R≥0 ((x, t ) , (y, t )) = Kamalapurkar, R., Dinh, H., Bhasin, S., & Dixon, W. (2013). Approximately optimal
trajectory tracking for continuous time nonlinear systems. arXiv:1301.7664.
dD (x, y). Using (36),
Khalil, H. K. (2002). Nonlinear systems (3rd ed.). Prentice Hall.
dD (x, y) < ς (x) H⇒ |Ξ (x, t ) − Ξ (y, t )| < ε. (37) Kirk, D. (2004). Optimal control theory: an introduction. Dover.
Lewis, F. L., Selmic, R., & Campos, J. (2002). Neuro-fuzzy control of industrial systems
Given the fact that Ξ is positive, (37) implies Ξ (x, t ) < Ξ (y, t )+ε with actuator nonlinearities. Philadelphia, PA, USA: Society for Industrial and
and Ξ (y, t ) < Ξ (x, t )+ε which from (35) implies α (x) < α (y)+ Applied Mathematics.
ε and α (y) < α (x)+ε , and hence, from (37), dD (x, y) < ς (x) H⇒ Loría, A., & Panteley, E. (2002). Uniform exponential stability of linear time-varying
|α (x) − α (y)| < ε. Since Ξ is positive definite, (35) can be used systems: revisited. Systems & Control Letters, 47(1), 13–24.
to conclude α (0) = 0. Thus, Ξ is bounded above by a continuous Luo, Y., & Liang, M. (2011). Approximate optimal tracking control for a class
positive definite function; hence, Ξ is decrescent in D. of discrete-time non-affine systems based on GDHP algorithm. In IWACI int.
workshop adv. comput. intell. (pp. 143–149).
Based on the definitions in (8)–(7) and (23), Vt∗ (e, t )> 0, ∀t ∈ Misovec, K. M. (1999). Friction compensation using adaptive non-linear control
T with persistent excitation. International Journal of Control, 72(5), 457–479.
R≥0 and ∀e ∈ Ba \ {0}. The optimal value function V ∗ 0, xTd
is Narendra, K., & Annaswamy, A. (1986). Robust adaptive control in the presence of
the cost incurred when starting with e = 0 and following the op- bounded disturbances. IEEE Transactions on Automatic Control, 31(4), 306–315.
timal policy thereafter for an arbitrary desired trajectory xd . Sub- Panteley, E., Loria, A., & Teel, A. (2001). Relaxed persistency of excitation for uniform
stituting x (t0 ) = xd (t0 ), µ (t0 ) = 0 and (2) in (4) indicates that asymptotic stability. IEEE Transactions on Automatic Control, 46(12), 1874–1886.
ė (t0 ) = 0. Thus, when starting from e = 0, a policy that is identi- Park, Y. M., Choi, M. S., & Lee, K. Y. (1996). An optimal tracking neuro-controller
for nonlinear dynamic systems. IEEE Transactions on Neural Networks, 7(5),
cally zero satisfies the
dynamic constraints
in (4). Furthermore, the 1099–1110.
T
0, xTd (t0 ) = 0, ∀xd (t0 ) which, from (23),
optimal cost is V ∗ Rao, A. V., Benson, D. A., Darby, C. L., Patterson, M. A., Francolin, C., & Huntington,
G. T. (2010). Algorithm 902: GPOPS, a MATLAB software for solving multiple-
implies (24b). Since the optimal value function Vt∗ is strictly posi- phase optimal control problems using the Gauss pseudospectral method. ACM
tive everywhere but at e = 0 and is zero at e = 0, Vt∗ is a positive Transactions on Mathematical Software, 37(2), 1–39.
definite function. Hence, Lemma 4.3 in Khalil (2002) can be invoked Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: an introduction.
to conclude that there exists a class K function v : [0, a] → R≥0 Cambridge, MA, USA: MIT Press.
such that v (∥e∥) ≤ Vt∗ (e, t ) , ∀t ∈ R≥0 and ∀e ∈ Ba . Vamvoudakis, K., & Lewis, F. (2010). Online actor-critic algorithm to solve the
continuous-time infinite horizon optimal control problem. Automatica, 46(5),
Admissibility of the optimal policy implies that V ∗ (ζ ) is boun-
878–888.
ded over all compact subsets K ⊂ R2n . Since the desired trajectory Vrabie, D., & Lewis, F. (2009). Neural network approach to continuous-time direct
is bounded, t −→ Vt∗ (e, t ) is uniformly bounded for all e ∈ Ba . adaptive optimal control for partially unknown nonlinear systems. Neural
To establish that e −→ Vt∗ (e, t ) is continuous, uniformly in t, let Networks, 22(3), 237–246.
χeo ⊂ Rn be a compact set containing eo . Since xd is bounded, Wang, D., Liu, D., & Wei, Q. (2012). Finite-horizon neuro-optimal tracking
xd ∈ χxd , where χxd ⊂ Rn is compact. Since V ∗ : R2n → R≥0 control for a class of discrete-time nonlinear systems using adaptive dynamic
programming approach. Neurocomputing, 78(1), 14–22.
is continuous, and χeo × χxd ⊂ R2n is compact, V ∗ is uniformly
Zhang, H., Cui, L., Zhang, X., & Luo, Y. (2011). Data-driven robust approximate
continuous on χeo × χxd . Thus, ∀ε > 0, ∃ς > 0, such that ∀([eTo , optimal tracking control for unknown general nonlinear systems using
xTd ]T , [eT1 , xTd ]T ) ∈ χeo × χxd , dχeo ×χxd ([eTo , xTd ]T , [eT1 , xTd ]T ) < ς H⇒ adaptive dynamic programming method. IEEE Transactions on Neural Networks,
22(12), 2226–2236.
dR (V ∗ ([eTo , xTd ]T ), V ∗ ([eT1 , xTd ]T )) < ε . Thus, for each eo ∈ Rn , there Zhang, H., Luo, Y., & Liu, D. (2009). Neural-network-based near-optimal control for
exists a ς > 0 independent of xd , that establishes the continuity a class of discrete-time affine nonlinear systems with control constraints. IEEE
of e −→ V ∗ ([eT , xTd ]T ) at eo . Thus, e −→ V ∗ ([eT , xTd ]T ) is con- Transactions on Neural Networks, 20(9), 1490–1503.
tinuous, uniformly in xd , and hence, using (23) e −→ Vt∗ (e, t ) is Zhang, H., Wei, Q., & Luo, Y. (2008). A novel infinite-time optimal tracking control
continuous, uniformly in t. Using Lemma 4 and (24a) and (24b), scheme for a class of discrete-time nonlinear systems via the greedy hdp
iteration algorithm. IEEE Transactions on Systems, Man and Cybernetics, Part B
there exists a positive definite function α : Rn → R≥0 such that (Cybernetics), 38(4), 937–942.
Vt∗ (e, t ) < α (e) , ∀ (e, t ) ∈ Rn × R≥0 . Lemma 4.3 in Khalil (2002)
indicates that there exists a class K function v : [0, a] → R≥0 such
that α (e) ≤ v (∥e∥), which implies (24c).
Rushikesh Kamalapurkar received his Bachelor of Tech-
nology degree in Mechanical Engineering from Visves-
References varaya National Institute of Technology, Nagpur, India. He
worked for two years as a Design Engineer at Larsen and
Toubro Ltd., Mumbai, India. He received his Master of Sci-
Abu-Khalaf, M., & Lewis, F. (2002). Nearly optimal HJB solution for constrained input
ence degree and his Doctor of Philosophy degree from the
systems using a neural network least-squares approach. In Proc. IEEE conf. decis. Department of Mechanical and Aerospace Engineering at
control (pp. 943–948). Las Vegas, NV. the University of Florida under the supervision of Dr. War-
Beard, R., Saridis, G., & Wen, J. (1997). Galerkin approximations of the generalized ren E. Dixon. He is currently a postdoctoral researcher with
Hamilton–Jacobi–Bellman equation. Automatica, 33, 2159–2178. the Nonlinear Controls and Robotics lab at the Univer-
Bhasin, S., Kamalapurkar, R., Johnson, M., Vamvoudakis, K., Lewis, F. L., & Dixon, sity of Florida. His research interests include dynamic pro-
W. (2013). A novel actor-critic-identifier architecture for approximate optimal gramming, optimal control, reinforcement learning, and data-driven adaptive con-
control of uncertain nonlinear systems. Automatica, 49(1), 89–92. trol for uncertain nonlinear dynamical systems.
48 R. Kamalapurkar et al. / Automatica 51 (2015) 40–48
Huyen Dinh received a B.S. Degree in Mechatronics from Warren E. Dixon received his Ph.D. in 2000 from the
Hanoi University of Science and Technology, Hanoi, Viet- Department of Electrical and Computer Engineering from
nam in 2006, and M.Eng. and Ph.D. Degrees in Mechanical Clemson University. After completing his doctoral studies
Engineering from University of Florida in 2010 and 2012, he was selected as an Eugene P. Wigner Fellow at Oak
respectively. She currently works as an Assistant Professor Ridge National Laboratory (ORNL). In 2004, he joined the
in the Department of Mechanical Engineering at University University of Florida in the Mechanical and Aerospace
of Transport and Communications, Hanoi, Vietnam. Her Engineering Department. His main research interest has
primary research interest is the development of Lyapunov- been the development and application of Lyapunov-based
based control and applications for uncertain nonlinear sys- control techniques for uncertain nonlinear systems. He
tems. Current research interests include Learning-based has published over 300 refereed papers and several books
Control, Adaptive Control for uncertain nonlinear systems. in this area. His work has been recognized by the 2013
Fred Ellersick Award for Best Overall MILCOM Paper, 2012–2013 University
of Florida College of Engineering Doctoral Dissertation Mentoring Award, 2011
Shubhendu Bhasin received his Ph.D. in 2011 from the American Society of Mechanical Engineers (ASME) Dynamics Systems and Control
Department of Mechanical and Aerospace Engineering Division Outstanding Young Investigator Award, 2009 American Automatic Control
at the University of Florida. He is currently Assistant Council (AACC) O. Hugo Schuck (Best Paper) Award, 2006 IEEE Robotics and
Professor in the Department of Electrical Engineering at Automation Society (RAS) Early Academic Career Award, an NSF CAREER Award
the Indian Institute of Technology, Delhi. His research (2006–2011), 2004 DOE Outstanding Mentor Award, and the 2001 ORNL Early
interests include reinforcement learning-based feedback Career Award for Engineering Achievement. He is an IEEE Control Systems Society
control, approximate dynamic programming, neural (CSS) Distinguished Lecturer. He currently serves as a member of the US Air
network-based control, nonlinear system identification Force Science Advisory Board and as the Director of Operations for the Executive
and parameter estimation, and robust and adaptive con- Committee of the IEEE CSS Board of Governors. He has formerly served as an
trol of uncertain nonlinear systems. associate editor for several journals, and is currently an associate editor for
Automatica and the International Journal of Robust and Nonlinear Control.