0% found this document useful (0 votes)

3 views9 pages

Kamala Pur Kar 2015

This paper presents a method for approximate optimal trajectory tracking in continuous-time nonlinear systems using adaptive dynamic programming (ADP). It addresses the challenges of time-varying optimal control problems by transforming them into time-invariant problems, ensuring bounded tracking of desired trajectories while approximating the optimal controller. The results include a Lyapunov-based analysis and simulation comparisons to demonstrate the effectiveness of the proposed technique.

Uploaded by

Vũ Nguyễn Trọng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views9 pages

Kamala Pur Kar 2015

Uploaded by

Vũ Nguyễn Trọng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Automatica 51 (2015) 40–48

Contents lists available at ScienceDirect

Automatica
journal homepage: www.elsevier.com/locate/automatica

Brief paper

Approximate optimal trajectory tracking for continuous-time

nonlinear systems✩
Rushikesh Kamalapurkar a , Huyen Dinh b , Shubhendu Bhasin c , Warren E. Dixon a
a
Department of Mechanical and Aerospace Engineering, University of Florida, Gainesville, USA
b
Department of Mechanical Engineering, University of Transport and Communications, Hanoi, Viet Nam
c
Department of Electrical Engineering, Indian Institute of Technology, Delhi, India

article info abstract

Article history: Adaptive dynamic programming has been investigated and used as a method to approximately solve
Received 21 September 2012 optimal regulation problems. However, the extension of this technique to optimal tracking problems for
Received in revised form continuous-time nonlinear systems has remained a non-trivial open problem. The control development
23 May 2014
in this paper guarantees ultimately bounded tracking of a desired trajectory, while also ensuring that the
Accepted 29 September 2014
enacted controller approximates the optimal controller.
© 2014 Elsevier Ltd. All rights reserved.
Keywords:
Time-varying systems
Optimal control
Adaptive control
Tracking applications
Actor–critic

1. Introduction When applied to continuous-time systems the principle of

optimality leads to the Hamilton–Jacobi–Bellman (HJB) equation
Reinforcement learning (RL) is a concept that can be used to en- which is the continuous-time counterpart of the Bellman equa-
able an agent to learn optimal policies from interaction with the tion (Doya, 2000). Similar to discrete-time adaptive dynamic pro-
environment. The objective of the agent is to learn the policy that gramming (ADP), continuous-time ADP approaches aim at finding
maximizes or minimizes a cumulative long term reward. Almost approximate solutions to the HJB equation. Various methods to
all RL algorithms use some form of generalized policy iteration solve this problem are proposed in Abu-Khalaf and Lewis (2002),
(GPI). GPI is a set of two simultaneous interacting processes, pol- Beard, Saridis, and Wen (1997), Bhasin et al. (2013), Jiang and Jiang
icy evaluation and policy improvement. Starting with an estimate (2012), Vamvoudakis and Lewis (2010), Vrabie and Lewis (2009)
of the state value function and an admissible policy, policy eval- and Zhang, Luo, and Liu (2009) and the references therein. An in-
uation makes the estimate consistent with the policy and policy finite horizon regulation problem with a quadratic cost function is
improvement makes the policy greedy with respect to the value the most common problem considered in ADP literature. For these
function. These algorithms exploit the fact that the optimal value problems, function approximation techniques can be used to ap-
function satisfies Bellman’s principle of optimality (Kirk, 2004; Sut- proximate the value function because it is time-invariant.
ton & Barto, 1998). Approximation techniques like neural networks (NNs) are com-
monly used in ADP literature for value function approximation.
ADP-based approaches are presented in results such as (Dierks &
✩ This research is supported in part by NSF award numbers 1161260, 1217908, Jagannathan, 2010; Zhang, Cui, Zhang, & Luo, 2011) to address the
ONR grant number N00014-13-1-0151, and a contract with the AFRL Mathematical tracking problem for continuous-time systems, where the value
Modeling and Optimization Institute. Any opinions, findings and conclusions or function, and the controller presented are time-varying functions
recommendations expressed in this material are those of the authors and do not of the tracking error. However, for the infinite horizon optimal con-
necessarily reflect the views of the sponsoring agency. The material in this paper trol problem, time does not lie on a compact set, and NNs can only
was not presented at any conference. This paper was recommended for publication approximate functions on a compact domain. Thus, it is unclear
in revised form by Associate Editor Antonio Loria under the direction of Editor
Andrew R. Teel.
how a NN with the tracking error as an input can approximate the
E-mail addresses: [email protected] (R. Kamalapurkar), time-varying value function and controller.
[email protected] (H. Dinh), [email protected] (S. Bhasin), For discrete-time systems, several approaches have been de-
[email protected] (W.E. Dixon). veloped to address the tracking problem. Park, Choi, and Lee
http://dx.doi.org/10.1016/j.automatica.2014.10.103
0005-1098/© 2014 Elsevier Ltd. All rights reserved.
R. Kamalapurkar et al. / Automatica 51 (2015) 40–48 41

(1996) use generalized back-propagation through time to solve a The steady-state control policy ud : Rn → Rm corresponding to
finite horizon tracking problem that involves offline training of the desired trajectory xd is
NNs. An ADP-based approach is presented in Dierks and Jagan-
nathan (2009) to solve an infinite horizon optimal tracking prob- ud (xd ) = gd+ (hd (xd ) − fd ) , (2)
lem where the desired trajectory is assumed to depend on the where gd , g (xd ) and fd , f (xd ). To transform the time-vary-
+ +
system states. Greedy heuristic dynamic programming based algo- ing optimal control problem into a time-invariant optimal control
rithms are presented in results such as (Luo & Liang, 2011; Wang, problem, a new concatenated state ζ ∈ R2n is defined as (Zhang
Liu, & Wei, 2012; Zhang, Wei, & Luo, 2008) which transform the et al., 2008)
nonautonomous system into an autonomous system, and approx- T
ζ , eT , xTd .

imate convergence of the sequence of value functions to the opti- (3)
mal value function is established. However, these results lack an
Based on (1) and Assumption 2, the time derivative of (3) can be
accompanying stability analysis.
expressed as
In this result, the tracking error and the desired trajectory both
serve as inputs to the NN. This makes the developed controller ζ̇ = F (ζ ) + G (ζ ) µ, (4)
fundamentally different from previous results, in the sense that
where the functions F : R2n → R2n , G : R2n → R2n×m , and the
a different HJB equation must be solved and its solution, i.e. the
control µ ∈ Rm are defined as
feedback component of the controller, is a time-varying function
f (e + xd ) − hd (xd ) + g (e + xd ) ud (xd )
 
of the tracking error. In particular, this paper addresses the techni-
F (ζ ) , ,
cal obstacles that result from the time-varying nature of the opti- hd ( x d )
mal control problem by including the partial derivative of the value
g (e + xd )
 
function with respect to the desired trajectory in the HJB equation, G (ζ ) , , µ , u − ud . (5)
and by using a system transformation to convert the problem into 0
a time-invariant optimal control problem in such a way that the Local Lipschitz continuity of f and g, the fact that f (0) = 0, and
resulting value function is a time-invariant function of the trans- Assumption 2 imply that F (0) = 0 and F is locally Lipschitz.
formed states, and hence, lends itself to approximation using a NN. The objective of the optimal control problem is to design a pol-
A Lyapunov-based analysis is used to prove ultimately bounded icy µ∗ : R2n → Rm ∈ Ψ such that the control law µ = µ∗ (ζ )
tracking and that the enacted controller approximates the optimal minimizes the cost functional
controller. Simulation results are presented to demonstrate the ap-  ∞
plicability of the presented technique. To gauge the performance of J (ζ , µ) , r (ζ (ρ) , µ (ρ)) dρ,
the proposed method, a comparison with a numerical optimal so- 0

lution is presented. subject to the dynamic constraints in (4), where Ψ is the set of ad-
For notational brevity, unless otherwise specified, the domain missible policies (Beard et al., 1997), and r : R2n × Rm → R≥0 is
of all the functions is assumed to be R≥0 . Furthermore, time- the local cost defined as
dependence is suppressed while denoting trajectories of dynami- r (ζ , µ) , ζ T Q ζ + µT Rµ. (6)
cal systems. For example, the trajectory x : R≥0 → Rn is defined by
m×m
abuse of notation as x ∈ Rn , and referred to as x instead of x (t ), and In (6), R ∈ R is a positive definite symmetric matrix of con-
unless otherwise specified, an equation of the form f + h (y, t ) = stants, and Q ∈ R2n×2n is defined as
g (x) is interpreted as f (t ) + h (y (t ) , t ) = g (x (t )) for all t ∈ R≥0 .  
Q 0n×n
Q , , (7)
0n×n 0n×n
2. Formulation of time-invariant optimal control problem
where Q ∈ Rn×n is a positive definite symmetric matrix of con-
stants with the minimum eigenvalue q ∈ R>0 , and 0n×n ∈ Rn×n is
Consider a class of nonlinear control affine systems
a matrix of zeros. For brevity of notation, let (·)′ denote ∂ (·) /∂ζ .
ẋ = f (x) + g (x) u,
3. Approximate optimal solution
where x ∈ Rn is the state, and u ∈ Rm is the control input. The func-
tions f : Rn → Rn and g : Rn → Rn×m are locally Lipschitz and
Assuming that a minimizing policy exists and that the optimal
f (0) = 0. The control objective is to track a bounded continuously
value function V ∗ : R2n → R≥0 defined as
differentiable signal xd ∈ Rn . To quantify this objective, a tracking
error is defined as e , x − xd . The open-loop tracking error dynam- ∞
ics can then be expressed as V (ζ ) ,
∗
min r (φ µ (τ ; t , ζ ) , µ (τ )) dτ (8)
µ(τ )|τ ∈R≥t
ė = f (x) + g (x) u − ẋd . (1) t

is continuously differentiable, the HJB equation for the optimal

The following assumptions are made to facilitate the formulation
control problem can be written as
of an approximate optimal tracking controller.
H ∗ = V ∗′ (ζ ) F (ζ ) + G (ζ ) µ∗ (ζ ) + r ζ , µ∗ (ζ ) = 0,
   
(9)
Assumption 1. The function g is bounded, the matrix g (x) has full
for all ζ , with the boundary condition V (0) = 0, where H de- ∗ ∗
column rank for all x ∈ Rn , and the function g + : Rn → Rm×n
 −1 notes the Hamiltonian, and µ∗ : R2n → Rm denotes the optimal
+ T T
defined as g , g g g is bounded and locally Lipschitz. policy. In (8) φ µ (τ ; t , ζ ) denotes the trajectory of (4) under the
controller µ starting at initial time t and initial state ζ . For the lo-
Assumption 2. The desired trajectory is bounded such that ∥xd ∥ ≤ cal cost in (6) and the dynamics in (4), the optimal policy can be
d ∈ R, and there exists a locally Lipschitz function hd : Rn → obtained in closed-form as (Kirk, 2004)
Rn such that ẋd = hd (xd ) and g (xd ) g + (xd ) (hd (xd ) − f (xd )) =
1
hd (xd ) − f (xd ) , ∀t ∈ R≥t0 .
T
µ∗ (ζ ) = − R−1 GT (ζ ) V ∗′ (ζ ) .

(10)
2
42 R. Kamalapurkar et al. / Automatica 51 (2015) 40–48

The value function V ∗ can be represented using a NN with N neu- Assumption 3. The regressor ψ : R≥0 → RN defined as ψ ,
rons as √ ω is persistently exciting (PE). Thus, there exist T , ψ > 0
1+νωT Γ ω
V ∗ (ζ ) = W T σ (ζ ) + ϵ (ζ ) , (11)
 t +T
such that ψ I ≤ t
ψ (τ ) ψ (τ )T dτ .3
N
where W ∈ R is the constant ideal weight matrix bounded above Using Assumption 3 and Corollary 4.3.2 in Ioannou and Sun (1996)
by a known positive constant W̄ ∈ R in the sense that ∥W ∥ ≤ W̄ , it can be concluded that
σ : R2n → RN is a bounded continuously differentiable nonlinear
activation function, and ϵ : R2n → R is the function reconstruc- ϕ IN ×N ≤ Γ ≤ ϕ IN ×N , ∀t ∈ R≥0 (19)
tion error (Hornik, Stinchcombe, & White, 1990; Lewis, Selmic, &
where ϕ, ϕ ∈ R are constants such that 0 < ϕ < ϕ .4 Based on
Campos, 2002).
(19), the regressor vector can be bounded as
Using (10) and (11) the optimal policy can be represented as
1
1 ∥ψ∥ ≤ √ , ∀t ∈ R≥0 . (20)
µ (ζ ) = − R−1 GT (ζ ) σ ′T (ζ ) W + ϵ ′T (ζ ) .
∗ νϕ
 
(12)
2
For notational brevity, state-dependence of the functions hd , F , G,
Based on (11) and (12), the NN approximations to the optimal value
V ∗ , µ∗ , σ , and ϵ is suppressed hereafter.
function and the optimal policy are given by
  Using (9), (15), and (16), an unmeasurable form of the BE can be
written as
V̂ ζ , Ŵc = ŴcT σ (ζ ) ,
1 1
  1 δ = −W̃cT ω + W̃aT Gσ W̃a + ϵ ′ Gϵ ′T
µ ζ , Ŵa = − R−1 GT (ζ ) σ ′T (ζ ) Ŵa , (13) 4 4
2 1 T ′ ′T
+ W σ Gϵ − ϵ ′ F , (21)
where Ŵc ∈ RN and Ŵa ∈ RN are estimates of the ideal neural net- 2
work weights W . The use of two separate sets of weight estimates where G , GR−1 GT and Gσ , σ ′ GR−1 GT σ ′T . The weight estimation
Ŵa and Ŵc for W is motivated by the fact that the Bellman error errors for the value function and the policy are defined as W̃c ,
(BE) is linear with respect to the value function weight estimates W − Ŵc and W̃a , W − Ŵa , respectively.
and nonlinear with respect to the policy weight estimates. Use of
a separate set of weight estimates for the value function facilitates
4. Stability analysis
least squares-based adaptive updates.
The controller is obtained from (2), (5), and (13) as
Before stating the main result of the paper, three supplemen-
1 tary technical lemmas are stated. To facilitate the discussion, let
u = − R−1 GT (ζ ) σ ′T (ζ ) Ŵa + gd+ (hd (xd ) − fd ) . (14) Y ∈ R2n+2N be a compact set, and let Z , Y ∩ Rn+2N . Using
2
the universal approximation property of NNs, on the compact set
Using the approximations µ and V̂ for µ∗ and V ∗ in (9), respec- Y ∩ R2n , the NN approximation  errors can be bounded such that
sup |ϵ (ζ )| ≤ ϵ̄ and sup ϵ ′ (ζ ) ≤ ϵ̄ ′ , where ϵ̄ ∈ R and ϵ̄ ′ ∈ R

tively, the error between the approximate and the optimal Hamil-
tonian, called the BE δ ∈ R, is given in a measurable form by are positive constants, and there exists a positive constant LF ∈ R
     such that5 sup ∥F (ζ )∥ ≤ LF ∥ζ ∥. Using Assumptions 1 and 2 the
δ , V̂ ′ ζ , Ŵc ζ̇ + r ζ , µ ζ , Ŵa . (15) following bounds are developed on the compact set Y ∩ R2n to aid
the subsequent stability analysis:
t
The value function weights are updated to minimize 0 δ 2 (ρ) dρ  ϵ WTσ′
 ′  
Gϵ  + ϵ̄ ′ LF ∥xd ∥ ≤ ι1 , ∥Gσ ∥ ≤ ι2 ,

′T 
using a normalized least squares update law1 with an exponential  4 + 2

forgetting factor as (Ioannou & Sun, 1996) 
1 T

ϵ Gϵ  ≤ ι3 ,  W Gσ + 1 ϵ ′ Gσ ′T  ≤ ι4 ,
 ′ ′T  
˙ ω 2 2
Ŵ c = −ηc Γ δ,
 
(16)
1 + νωT Γ ω  
 1 ′ ′T 1 T ′ ′T 
 ϵ Gϵ + W σ Gϵ  ≤ ι5 ,
ωωT (22)
 
,
4 2
Γ̇ = −ηc −λΓ + Γ Γ (17)

1 + νωT Γ ω
where ι1 , ι2 , ι3 , ι4 , ι5 ∈ R are positive constants.
where ν, ηc ∈ R are constant positive adaptation gains, ω : R≥0 →
RN is defined as ω , σ ′ (ζ ) ζ̇ , and λ ∈ (0, 1) is the constant for- 4.1. Supporting lemmas
getting factor for the estimation gain matrix Γ ∈ RN ×N . The policy
weights are updated to follow the critic weights2 as The contribution in the previous section was the development
of a transformation that enables the optimal policy and the optimal
·  
Ŵ a = −ηa1 Ŵa − Ŵc − ηa2 Ŵa , (18)

where ηa1 , ηa2 ∈ R are constant positive adaptation gains. The fol- 3 The regressor is defined here as a trajectory indexed by time. It should be noted
lowing assumption facilitates the stability analysis using PE. that different initial conditions result in different regressor trajectories; hence, the
constants T and ψ depend on the initial values of ζ and Ŵa . Hence, the final result
is not uniform in the initial conditions.
4 Since the evolution of ψ is dependent on the initial values of ζ and Ŵ , the
a
1 The least-squares approach is motivated by faster convergence. With minor constants ϕ and ϕ depend on the initial values of ζ and Ŵa .
modifications to the stability analysis, the result can also be established for a 5 Instead of using the fact that locally Lipschitz functions on compact sets are
gradient descent update law. Lipschitz, it is possible to bound the function F as ∥F (ζ )∥ ≤ ρ (∥ζ ∥) ∥ζ ∥, where
2 The least-squares approach cannot be used to update the policy weights ρ : R≥0 → R≥0 is non-decreasing. This approach is feasible and results in
because the BE is a nonlinear function of the policy weights. additional gain conditions.
R. Kamalapurkar et al. / Automatica 51 (2015) 40–48 43

value function to be expressed as a time-invariant function of ζ . Proof. The proof is omitted due to space constraints, and is avail-
The use of this transformation presents a challenge in the sense able in Kamalapurkar, Dinh, Bhasin, and Dixon (2013).
that the optimal value function, which is used as the Lyapunov T
Lemma 3. Let Z , eT W̃cT W̃aT , and suppose that Z (τ ) ∈ Z,

function for the stability analysis, is not a positive definite function
of ζ , because the matrix Q is positive semi-definite. In this section, for all τ ∈ [t , t + T ]. Then, the critic weights satisfy
this technical obstacle is addressed by exploiting the fact that the  t +T  t +T
 T 2
   2
time-invariant optimal value function V ∗ : R2n → R can be − W̃c ψ  dτ ≤ −ψϖ7 W̃c  + ϖ8
 
∥e ∥2 d τ
interpreted as a time-varying map Vt∗ : Rn × R≥0 → R, such that t t
 t +T  4
+ 3ι2
2
W̃a (σ ) dσ + ϖ9 T ,
 
e
 
Vt∗ (e, t ) = V ∗ (23)
xd (t ) t

ν2 ϕ2
for all e ∈ Rn and for all t ∈ R≥0 . Specifically, the time-invariant where ϖ7 =  , ϖ8 = 3ϵ̄ ′2 L2F , and ϖ9 = 2(ι25 +
2 ν 2 ϕ 2 +ηc2 ϕ 2 T 2
form facilitates the development of the approximate optimal pol-
icy, whereas the equivalent time-varying form can be shown to be ϵ̄ ′2 L2F d2 ).
a positive definite and decrescent function of the tracking error. In
Proof. The proof is omitted due to space constraints, and is
the following, Lemma 1 is used to prove that Vt∗ : Rn × R≥0 → R is
available in Kamalapurkar et al. (2013).
positive definite and decrescent, and hence, a candidate Lyapunov
function.
4.2. Gain conditions and gain selection
Lemma 1. Let Ba denote a closed ball around the origin with the
radius a ∈ R>0 . The optimal value function Vt∗ : Rn × R≥0 → R The following section details sufficient gain conditions derived
satisfies the following properties based on a stability analysis performed using the candidate
Lyapunov function VL : Rn+2N × R≥0 → R defined as VL (Z , t ) ,
Vt∗ (e, t ) ≥ v (∥e∥) , (24a) Vt∗ (e, t ) + 12 W̃cT Γ −1 W̃c + 12 W̃aT W̃a . Using Lemma 1 and (19),
Vt (0, t ) = 0,
∗
(24b)
vl (∥Z ∥) ≤ VL (Z , t ) ≤ vl (∥Z ∥) , (25)
Vt (e, t ) ≤ v (∥e∥) ,
∗
(24c)
∀Z ∈ Bb , ∀t ∈ R≥0 , where vl : [0, b] → R≥0 and vl : [0, b] → R≥0
∀t ∈ R≥0 and ∀e ∈ Ba where v : [0, a] → R≥0 and v : [0, a] → R≥0 are class K functions, and Bb ⊂ Rn+2N denotes a ball of radius
are class K functions. b ∈ R>0 around the origin, containing Z.
Proof. See Appendix. To facilitate the discussion, define ηa12 , ηa1 +ηa2 , Z , [eT W̃cT W̃aT ]T,

ι , ( a2 4 ) + 2η (ι )2 + 1 ι , ϖ , 6 a12
2
η W +ι ϖ η +2ϖ2 q+ηc ϖ9
T + ι,
Lemma 2. Let Z , eT W̃cT W̃aT , and suppose that Z (τ ) ∈ Z, ηa12 c 1 4 3 10

8

for all τ ∈ [t , t + T ]. Then, the NN weights and the tracking errors and ϖ11 , 16 1
min(ηc ψϖ7 , 2ϖ0 qT , ϖ3 ηa12 T ). Let Z0 ∈ R≥0
satisfy denote a known constant bound on the initial condition such that
∥Z (t0 )∥ ≤ Z0 , and let
− inf ∥e (τ )∥2
τ ∈[t ,t +T ]      
ϖ10 T
Z , vl vl max Z0 , + ιT .
 2 −1
≤ −ϖ0 sup ∥e (τ )∥2 + ϖ1 T 2 sup W̃a (τ ) + ϖ2 (26)
ϖ11
 
τ ∈[t ,t +T ] τ ∈[t ,t +T ]
The sufficient gain conditions for the subsequent Theorem 1 are
 2  2
− inf W̃a (τ ) ≤ −ϖ3 sup W̃a (τ )
   
τ ∈[t ,t +T ] τ ∈[t ,t +T ]
given by6
  
ηc ι2
 2
+ ϖ4 Z
inf W̃c (τ ) + ϖ5 sup ∥e (τ )∥2 + ϖ6 , ηa12 > max ηa1 ξ2 + , 3η ι 2
,
 
τ ∈[t ,t +T ] c 2Z
τ ∈[t ,t +T ] 4 νϕ
where ηa1 2ϖ4 ηa12
ξ1 > 2ϵ̄ ′ LF , ηc > , ψ> T,
λγ ξ2 ηc ϖ7
 
1 − 6nT 2 L2F 3n 2
ϖ0 = , ϖ1 = sup gR−1 GT σ ′T  ,

2 4 ϖ5 ηa12 1
t
 
 2 q > max , ηc ϖ8 , ηc LF ϵ̄ ′ ξ1 ,
ϖ0
 
2
dLF + sup ggd+ (hd − fd ) − 12 gR−1 GT σ ′T W − hd 

3n2 T 2
t
ϖ2 = , νϕ

1 1
n T < min √ ,√ , √ ,
1 − 6N (ηa1 + ηa2 ) T 2
6N ηa12 6N η ϕ 2 nLF
 2

ϖ3 = ,  c
ηa12

2
6N ηa1
2
T2 , (27)
ϖ4 =  6N ηa12
3
+ 8qϖ1
 2  ,
1 − 6N (ηc ϕ T )2 / νϕ
where ξ1 , ξ2 ∈ R are known adjustable positive constants. Fur-
2 2
thermore, the compact set Z satisfies the sufficient condition
18 ηa1 N ηc ϕ ϵ̄ ′ LF T
 
ϖ5 =   2  , Z ≤ r, (28)
νϕ 1 − 6N (ηc ϕ T )2 / νϕ
2
18 N ηa1 ηc ϕ ϵ̄ ′ LF d + ι5 T 2
  
2
ϖ6 =  2  + 3N ηa2 W T .

 6 Similar conditions on ψ and T can be found in PE-based adaptive control in the

νϕ 1 − 6N (ηc ϕ T )2 / νϕ presence of bounded or Lipschitz uncertainties (cf. Misovec, 1999 and Narendra &
Annaswamy, 1986).
44 R. Kamalapurkar et al. / Automatica 51 (2015) 40–48

where r , 21 supz ,y∈Z ∥z − y∥ denotes the radius of Z. Since the 5. Simulation

Lipschitz constant and the bounds on NN approximation error de-
pend on the size of the compact set Z, the constant Z depends on Simulations are performed on a two-link manipulator to
r; hence, feasibility of the sufficient condition in (28) is not appar- demonstrate the ability of the presented technique to approxi-
ent. Algorithm 1 in the Appendix details an iterative gain selection mately optimally track a desired trajectory. The two link robot ma-
process in order to ensure satisfaction of the sufficient condition nipulator is modeled using Euler–Lagrange dynamics as
in (28).
M q̈ + Vm q̇ + Fd q̇ + Fs = u, (30)
 T  T
4.3. Main result where q = q1 q2 and q̇ = q̇1 q̇2 are the angular positions in
radians and the angular velocities in radian/s respectively. In (30),
M ∈ R2×2 denotes the inertia matrix, and Vm ∈ R2×2 denotes the
Theorem 1. Provided that the sufficient conditions in (27) and 
p1 + 2p3 c2 p2 + p3 c2
(28) are satisfied and Assumptions 1–3 hold, the controller in (14) and centripetal–Coriolis matrix given by M , p2 + p3 c2 p2
,
−p3 s2 q̇2 −p3 s2 (q̇1 + q̇2 )
 
the update laws in (16)–(18) guarantee that the tracking error is
Vm , , where c2 = cos (q2 ) , s2 = sin (q2 ) ,
ultimately bounded, and the error ∥µ (t ) − µ∗ (ζ (t ))∥ is ultimately p3 s2 q̇1 0

bounded as t → ∞. p1 = 3.473 kg m , p2 = 0.196 kg m2 , and p3 = 0.242 kg m2 ,

and Fd = diag 5.3, 1.1 N m s and Fs (q̇) = [8.45 tanh (q̇1 ) ,

Proof. The time derivative of VL is 2.35 tanh (q̇2 )]T N m are the models for the static and the dynamic
˙ ˙ friction, respectively.
V̇L = V ∗′ F + V ∗′ Gµ + W̃cT Γ −1 W̃ c − W̃aT Ŵ a The objective is to find a policy µ that ensures that the state
1 x , [q1 , q2 , q̇1 , q̇2 ]T tracks the desired trajectory xd (t ) =
− W̃cT Γ −1 Γ̇ Γ −1 W̃c .
2 [0.5 cos (2t ) , 0.33  (3t ) , − sin(2t ) , − sin (3t )] , while mini-
 ∞cos
T

mizing the cost 0 eT Qe + µT µ dt, where Q = diag[10, 10,

Provided the sufficient conditions in (27) are satisfied, (16), (21),
2, 2]. Using (2)–(5) and the definitions
the bounds in (20)–(22), and the facts that V ∗′ F = −V ∗′ Gµ∗ −
r (ζ , µ∗ ) and V ∗′ G = −2µ∗T R yield     T T
x
2 η  2 f , x3 , x4 , M −1
(−Vm − Fd ) 3 − Fs ,
q  1 a12 
x4
V̇L ≤ − ∥e∥ − ηc W̃cT ψ  −
2
W̃a  + ι. (29)
  
2 8 4
  T T
,
T T
0, , 0, , M −1
 
g , 0 0
The inequality in (29) is valid provided Z (t ) ∈ Z.  
0, 0 , M (xd ) ,
 t +T T T
∥e(τ )∥2 dτ ≤ gd+ , 0, 0 ,

Integrating (29), using the facts that − t
 t +T
−T infτ ∈[t ,t +T ] ∥e(τ )∥ and − t
2
∥W̃a (τ )∥ dτ ≤ −T infτ ∈[t ,t +T ]
2
T
hd , xd3 , xd4 , −4xd1 , −9xd2 ,

(31)
∥W̃a (τ )∥2 , Lemmas 2 and 3, and the gain conditions in (27) yields
the optimal tracking problem can be transformed into the time-
VL (Z (t + T ) , t + T ) − VL (Z (t ) , t )
invariant form in (5).
ηc ψϖ7  2 ϖ0 qT
In this effort, the basis chosen for the value function approxi-
≤− W̃c (t ) − ∥e (t )∥2
 
16 8 mation is a polynomial basis with 23 elements given by
ϖ3 ηa12 T  2
− W̃a (t ) + ϖ10 T , 1
 
16 σ (ζ ) = ζ12 ζ22 ζ1 ζ3 ζ1 ζ4 ζ2 ζ3 ζ2 ζ4 ζ12 ζ22 ζ12 ζ52
2
provided Z (τ ) ∈ Z, ∀τ ∈ [t , t + T ]. Thus,
 VL (Z (t + T ) , t + T ) − ζ12 ζ62 ζ12 ζ72 ζ12 ζ82 ζ22 ζ52 ζ22 ζ62 ζ22 ζ72 ζ22 ζ82 ζ32 ζ52
ϖ10 T
VL (Z (t ) , t ) < 0 provided ∥Z (t )∥ > ϖ11
and Z (τ ) ∈ Z, ∀τ ∈ T
ζ32 ζ62 ζ32 ζ72 ζ32 ζ82 ζ42 ζ52 ζ42 ζ62 ζ42 ζ72 ζ42 ζ82 . (32)
[t , t + T ]. The bounds on the Lyapunov function in (25) yield
(Z (t + T) , t + T ) − VL (Z (t ) , t ) < 0 provided VL (Z (t ) , t ) >
VL  The control gains are selected as ηa1 = 5, ηa2 = 0.001, ηc = 1.25,
vl ϖ10 T
and Z (τ ) ∈ Z, ∀τ ∈ [t , t + T ].
λ = 0.001, and ν = 0.005, and the initial conditions are x (0) =
ϖ11 T
1.8 1.6 0 0 , Ŵc (0) = 10 × 123×1 , Ŵa (0) = 6 × 123×1 , and

Since Z (t0 ) ∈ Z, (29) can be used to conclude that V̇L (Z (t0 ), t0 ) Γ (0) = 2000 × I23×23 , where 123×1 is vector of ones. To ensure PE,
≤ ι. The sufficient condition in (28) ensures that vl −1 (VL (Z (t0 ), t0 )+ a probing signal
ιT ) ≤ r; hence,  Z (t ) ∈ Z for all t ∈ [t0 , t0 + T ]. If VL (Z (t0 ) , t0 ) √ √
ϖ10 T
   
> vl ϖ11
, then Z (t ) ∈ Z for all t ∈ [t0 , t0 + T ] 2.55 tanh(2t ) 20 sin 232π t cos 20π t
implies VL (Z (t0 + T ) , t0 + T ) − VL (Z (t0 ) , t0 ) < 0; hence,
 
+ 6 sin 18e2 t + 20 cos (40t ) cos (21t ) 
   
vl −1 (VL (Z (t0 + T ) , t0 + T ) + ιT ) ≤ r. Thus, Z (t ) ∈ Z for all p (t ) = 

  √   √   (33)
t ∈ [t0 + T , t0 + 2T ]. Inductively, the system state is bounded such 0.01 tanh(2t ) 20 sin 132π t cos 10π t 

that supt ∈[0,∞) ∥Z (t )∥ ≤ r and ultimately bounded7 such that   
    + 6 sin (8et ) + 20 cos (10t ) cos (11t )
ϖ 10 T
lim sup ∥Z (t )∥ ≤ vl −1 vl + ιT . is added to the control signal for the first 30 s of the simulation
t →∞ ϖ11 (Vamvoudakis & Lewis, 2010).
It is clear from Fig. 1 that the system states are bounded during
the learning phase and the algorithm converges to a stabilizing
controller in the sense that the tracking errors go to zero when
7 If the regressor ψ satisfies a stronger u-PE assumption (cf. Loría & Panteley, the probing signal is eliminated. Furthermore, Fig. 2 shows that the
2002 and Panteley, Loria, & Teel, 2001), the tracking error and the weight estimation weight estimates for the value function and the policy are bounded
errors can be shown to be uniformly ultimately bounded. and they converge.
R. Kamalapurkar et al. / Automatica 51 (2015) 40–48 45

Fig. 1. State and error trajectories with probing signal.

Fig. 2. Evolution of value function and policy weights.

Fig. 3. Hamiltonian and costate of the numerical solution computed using GPOPS.

The NN weights converge to the following values solution is difficult to obtain for an infinite horizon optimal control
problem, the numerical optimal control problem is solved over a
Ŵc = Ŵa = 83.36 2.37 27.0 2.78 −2.83 0.20 14.13

finite horizon ranging over approximately 5 times the settling time
29.81 18.87 4.11 3.47 6.69 9.71 15.58 4.97 12.42 associated with the slowest state variable. Based on the solution
T obtained using the proposed technique, the slowest settling time
11.31 3.29 1.19 −1.99 4.55 −0.47 0.56 . (34) is estimated to be approximately 20 s. Thus, to approximate the
Note that the last sixteen weights that correspond to the terms infinite horizon solution, the numerical solution is computed over
containing the desired trajectories ζ5 , . . . , ζ8 are non-zero. Thus, a 100 s time horizon using 300 collocation points.
the resulting value function V and the resulting policy µ depend As seen in Fig. 3, the Hamiltonian of the numerical solution is
on the desired trajectory, and hence, are time-varying functions approximately zero. This supports the assertion that the optimal
of the tracking error. Since the true weights are unknown, a control problem is time-invariant. Furthermore, since the Hamilto-
direct comparison of the weights in (34) with the true weights is nian is close to zero, the numerical solution obtained using GPOPS
not possible. Instead, to gauge the performance of the presented is sufficiently accurate as a benchmark to compare against the ADP-
technique, the state and the control trajectories obtained using based solution obtained using the proposed technique. Note that
the estimated policy are compared with those obtained using in Fig. 3, the costate variables corresponding to the desired tra-
Radau-pseudospectral numerical optimal control computed using jectories are nonzero. Since these costate variables represent the
the GPOPS software (Rao et al., 2010). Since an accurate numerical sensitivity of the cost with respect to the desired trajectories, this
46 R. Kamalapurkar et al. / Automatica 51 (2015) 40–48

problem. The ultimately bounded tracking and estimation result

was established using Lyapunov analysis for nonautonomous sys-
tems. Simulations are performed to demonstrate the applicability
and the effectiveness of the developed method. The accuracy of the
approximation depends on the choice of basis functions and the re-
sult hinges on the system states being PE. Furthermore, computa-
tion of the desired control in (2) requires exact model knowledge.
A solution to the tracking problem without using the desired
control while employing a multi-layer neural network that can
approximate the basis functions remains a future challenge. In
adaptive control, it is generally possible to formulate the control
problem such that PE along the desired trajectory is sufficient to
achieve parameter convergence. In the ADP-based tracking prob-
lem, PE along the desired trajectory would be sufficient to achieve
Fig. 4. Control trajectories µ (t ) obtained from GPOPS and the developed parameter convergence if the BE can be formulated in terms of the
technique. desired trajectories. Achieving such a formulation is not trivial, and
is a subject for future research.

Appendix

The proofs for the technical lemmas and the gain selection al-
gorithm are detailed in this section.

Algorithm for selection of NN architecture and learning gains

Since the gains depend on the initial conditions and on the

compact sets used for function approximation and the Lipschitz
bounds, an iterative algorithm is developed to select the gains. In
Algorithm 1, the notation {ϖ }i for any parameter ϖ denotes the
value of ϖ computed in the ith iteration. Algorithm 1 ensures sat-
isfaction of the sufficient condition in (28).
Fig. 5. Tracking error trajectories e (t ) obtained from GPOPS and the developed
technique.
Algorithm 1 Gain Selection
First iteration:
further supports the assertion that the optimal value function de-
Given Z0 ∈ R≥0 such that ∥Z (t0 )∥ < Z0 , let Z1 = {ϱ ∈ Rn+2{N }1 |
pends on the desired trajectory, and hence, is a time-varying func-
∥ϱ∥ ≤ β1 vl −1 (vl (Z0 ))} for some β1 > 1. Using Z1 , compute the
tion of the tracking error.
bounds in (22) and (26), and select the gains according to (27). If
Figs. 4 and 5 show the control and the tracking error trajecto- Z 1 ≤ β1 vl −1 (vl (∥Z0 ∥)) , set Z = Z1 and terminate.
 
ries obtained from the developed technique (dashed lines) plotted Second iteration:
alongside the numerical solution obtained using GPOPS (solid
If {Z }1 > β1 vl −1 (vl (∥Z0 ∥)), let Z2 , {ϱ ∈ Rn+2{N }1 | ∥ϱ∥ ≤
lines). The trajectories obtained using the developed technique are
close to the numerical solution. The inaccuracies are a result of the β2 {Z }1 }. Using Z2 , compute the bounds
 in(22)
 and (26) and select
facts that the set of basis functions in (32) is not exact, and the pro- the gains according to (27). If Z 2 ≤ Z 1 , set Z = Z2 and
posed method attempts to find the weights that generate the least terminate.
total cost for the given set of basis functions. The accuracy of the Third
  iteration:
If Z 2 > Z 1 , increase the number of NN neurons to {N }3 to yield
 
approximation can be improved by choosing a more appropriate
a lower function approximation error ϵ̄ ′ such that {LF }2 ϵ̄ ′ 3 ≤
   
set of basis functions, or at an increased computational cost, by
3
adding more basis functions to the existing set in (32). The total {LF }1 ϵ̄ 1 . The increase in the number of NN neurons ensures that
 ′
 100 
e (t )T Qe (t ) + µ (t )T Rµ (t ) dt obtained using the nu-

cost 0 {ι}3 ≤ {ι}1 . Furthermore, the assumption that the PE interval {T }3
∞
merical solution is found to be 75.42 and the total cost 0
(e (t )T is small enough such that ≤ {T }1 {LF }1 and {N }3 {T }3 ≤
 {LF }2 {T }3 
Qe (t ) + µ (t ) Rµ (t ))dt obtained using the developed method is {T }1 {N }1 ensures that ϖ ϖ10
T 10
 
ϖ11 3
≤ ϖ11 1
, and hence, Z 3 ≤
found to be 84.31. Note that from Figs. 4 and 5, it is clear that both
β2 Z 1 . Set Z = ϱ ∈ Rn+2{N }3 | ∥ϱ∥ ≤ β2 Z 1 and terminate.
     
the tracking error and the control converge to zero after approxi-
mately 20 s, and hence, the total cost obtained from the numerical
solution is a good approximation of the infinite horizon cost.
Proof of Lemma 1
6. Conclusion
The following supporting technical lemma is used to prove
An ADP-based approach using the policy evaluation and policy Lemma 1.
improvement architecture is presented to approximately solve the
infinite horizon optimal tracking problem for control affine non- Lemma 4. Let D ⊆ Rn contain the origin and let Ξ : D × R≥0 →
linear systems with quadratic cost. The problem is solved by trans- R≥0 be positive definite. If t −→ Ξ (x, t ) is uniformly bounded for
forming the system to convert the tracking problem that has a all x ∈ D and if x −→ Ξ (x, t ) is continuous, uniformly in t, then Ξ
time-varying value function, into a time-invariant optimal control is decrescent in D.
R. Kamalapurkar et al. / Automatica 51 (2015) 40–48 47

Proof. Since t −→ Ξ (x, t ) is uniformly bounded, for all x ∈ D, Dierks, T., & Jagannathan, S. (2009). Optimal tracking control of affine nonlinear
supt ∈R≥0 {Ξ (x, t )} exists and is unique for all x ∈ D. Let the discrete-time systems with unknown internal dynamics. In Proc. IEEE conf. decis.
control (pp. 6750–6755).
function α : D → R≥0 be defined as
Dierks, T., & Jagannathan, S. (2010). Optimal control of affine nonlinear continuous-
time systems. In Proc. Am. control conf. (pp. 1568–1573).
α (x) , sup {Ξ (x, t )} . (35)
t ∈R≥0 Doya, K. (2000). Reinforcement learning in continuous time and space. Neural
Computation, 12(1), 219–245.
Since x → Ξ (x, t ) is continuous, uniformly in t , ∀ε > 0, ∃ς (x) > Hornik, K., Stinchcombe, M., & White, H. (1990). Universal approximation of an
0 such that ∀y ∈ D, unknown mapping and its derivatives using multilayer feedforward networks.
Neural Networks, 3(5), 551–560.
dD×R≥0 ((x, t ) , (y, t )) < ς (x) Ioannou, P., & Sun, J. (1996). Robust adaptive control. Prentice Hall.

H⇒ dR≥0 (Ξ (x, t ) , Ξ (y, t )) < ε, (36) Jiang, Y., & Jiang, Z.-P. (2012). Computational adaptive optimal control for
continuous-time linear systems with completely unknown dynamics. Automat-
where dM (·, ·) denotes the standard Euclidean metric on the met- ica, 48(10), 2699–2704.

ric space M. By the definition of dM (·, ·), dD×R≥0 ((x, t ) , (y, t )) = Kamalapurkar, R., Dinh, H., Bhasin, S., & Dixon, W. (2013). Approximately optimal
trajectory tracking for continuous time nonlinear systems. arXiv:1301.7664.
dD (x, y). Using (36),
Khalil, H. K. (2002). Nonlinear systems (3rd ed.). Prentice Hall.
dD (x, y) < ς (x) H⇒ |Ξ (x, t ) − Ξ (y, t )| < ε. (37) Kirk, D. (2004). Optimal control theory: an introduction. Dover.
Lewis, F. L., Selmic, R., & Campos, J. (2002). Neuro-fuzzy control of industrial systems
Given the fact that Ξ is positive, (37) implies Ξ (x, t ) < Ξ (y, t )+ε with actuator nonlinearities. Philadelphia, PA, USA: Society for Industrial and
and Ξ (y, t ) < Ξ (x, t )+ε which from (35) implies α (x) < α (y)+ Applied Mathematics.
ε and α (y) < α (x)+ε , and hence, from (37), dD (x, y) < ς (x) H⇒ Loría, A., & Panteley, E. (2002). Uniform exponential stability of linear time-varying
|α (x) − α (y)| < ε. Since Ξ is positive definite, (35) can be used systems: revisited. Systems & Control Letters, 47(1), 13–24.
to conclude α (0) = 0. Thus, Ξ is bounded above by a continuous Luo, Y., & Liang, M. (2011). Approximate optimal tracking control for a class
positive definite function; hence, Ξ is decrescent in D. of discrete-time non-affine systems based on GDHP algorithm. In IWACI int.
workshop adv. comput. intell. (pp. 143–149).
Based on the definitions in (8)–(7) and (23), Vt∗ (e, t )> 0, ∀t ∈ Misovec, K. M. (1999). Friction compensation using adaptive non-linear control
T with persistent excitation. International Journal of Control, 72(5), 457–479.
R≥0 and ∀e ∈ Ba \ {0}. The optimal value function V ∗ 0, xTd

is Narendra, K., & Annaswamy, A. (1986). Robust adaptive control in the presence of
the cost incurred when starting with e = 0 and following the op- bounded disturbances. IEEE Transactions on Automatic Control, 31(4), 306–315.
timal policy thereafter for an arbitrary desired trajectory xd . Sub- Panteley, E., Loria, A., & Teel, A. (2001). Relaxed persistency of excitation for uniform
stituting x (t0 ) = xd (t0 ), µ (t0 ) = 0 and (2) in (4) indicates that asymptotic stability. IEEE Transactions on Automatic Control, 46(12), 1874–1886.

ė (t0 ) = 0. Thus, when starting from e = 0, a policy that is identi- Park, Y. M., Choi, M. S., & Lee, K. Y. (1996). An optimal tracking neuro-controller
for nonlinear dynamic systems. IEEE Transactions on Neural Networks, 7(5),
cally zero satisfies the
 dynamic constraints
 in (4). Furthermore, the 1099–1110.
T
0, xTd (t0 ) = 0, ∀xd (t0 ) which, from (23),

optimal cost is V ∗ Rao, A. V., Benson, D. A., Darby, C. L., Patterson, M. A., Francolin, C., & Huntington,
G. T. (2010). Algorithm 902: GPOPS, a MATLAB software for solving multiple-
implies (24b). Since the optimal value function Vt∗ is strictly posi- phase optimal control problems using the Gauss pseudospectral method. ACM
tive everywhere but at e = 0 and is zero at e = 0, Vt∗ is a positive Transactions on Mathematical Software, 37(2), 1–39.
definite function. Hence, Lemma 4.3 in Khalil (2002) can be invoked Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: an introduction.
to conclude that there exists a class K function v : [0, a] → R≥0 Cambridge, MA, USA: MIT Press.
such that v (∥e∥) ≤ Vt∗ (e, t ) , ∀t ∈ R≥0 and ∀e ∈ Ba . Vamvoudakis, K., & Lewis, F. (2010). Online actor-critic algorithm to solve the
continuous-time infinite horizon optimal control problem. Automatica, 46(5),
Admissibility of the optimal policy implies that V ∗ (ζ ) is boun-
878–888.
ded over all compact subsets K ⊂ R2n . Since the desired trajectory Vrabie, D., & Lewis, F. (2009). Neural network approach to continuous-time direct
is bounded, t −→ Vt∗ (e, t ) is uniformly bounded for all e ∈ Ba . adaptive optimal control for partially unknown nonlinear systems. Neural
To establish that e −→ Vt∗ (e, t ) is continuous, uniformly in t, let Networks, 22(3), 237–246.
χeo ⊂ Rn be a compact set containing eo . Since xd is bounded, Wang, D., Liu, D., & Wei, Q. (2012). Finite-horizon neuro-optimal tracking
xd ∈ χxd , where χxd ⊂ Rn is compact. Since V ∗ : R2n → R≥0 control for a class of discrete-time nonlinear systems using adaptive dynamic
programming approach. Neurocomputing, 78(1), 14–22.
is continuous, and χeo × χxd ⊂ R2n is compact, V ∗ is uniformly
Zhang, H., Cui, L., Zhang, X., & Luo, Y. (2011). Data-driven robust approximate
continuous on χeo × χxd . Thus, ∀ε > 0, ∃ς > 0, such that ∀([eTo , optimal tracking control for unknown general nonlinear systems using
xTd ]T , [eT1 , xTd ]T ) ∈ χeo × χxd , dχeo ×χxd ([eTo , xTd ]T , [eT1 , xTd ]T ) < ς H⇒ adaptive dynamic programming method. IEEE Transactions on Neural Networks,
22(12), 2226–2236.
dR (V ∗ ([eTo , xTd ]T ), V ∗ ([eT1 , xTd ]T )) < ε . Thus, for each eo ∈ Rn , there Zhang, H., Luo, Y., & Liu, D. (2009). Neural-network-based near-optimal control for
exists a ς > 0 independent of xd , that establishes the continuity a class of discrete-time affine nonlinear systems with control constraints. IEEE
of e −→ V ∗ ([eT , xTd ]T ) at eo . Thus, e −→ V ∗ ([eT , xTd ]T ) is con- Transactions on Neural Networks, 20(9), 1490–1503.
tinuous, uniformly in xd , and hence, using (23) e −→ Vt∗ (e, t ) is Zhang, H., Wei, Q., & Luo, Y. (2008). A novel infinite-time optimal tracking control
continuous, uniformly in t. Using Lemma 4 and (24a) and (24b), scheme for a class of discrete-time nonlinear systems via the greedy hdp
iteration algorithm. IEEE Transactions on Systems, Man and Cybernetics, Part B
there exists a positive definite function α : Rn → R≥0 such that (Cybernetics), 38(4), 937–942.
Vt∗ (e, t ) < α (e) , ∀ (e, t ) ∈ Rn × R≥0 . Lemma 4.3 in Khalil (2002)
indicates that there exists a class K function v : [0, a] → R≥0 such
that α (e) ≤ v (∥e∥), which implies (24c).
Rushikesh Kamalapurkar received his Bachelor of Tech-
nology degree in Mechanical Engineering from Visves-
References varaya National Institute of Technology, Nagpur, India. He
worked for two years as a Design Engineer at Larsen and
Toubro Ltd., Mumbai, India. He received his Master of Sci-
Abu-Khalaf, M., & Lewis, F. (2002). Nearly optimal HJB solution for constrained input
ence degree and his Doctor of Philosophy degree from the
systems using a neural network least-squares approach. In Proc. IEEE conf. decis. Department of Mechanical and Aerospace Engineering at
control (pp. 943–948). Las Vegas, NV. the University of Florida under the supervision of Dr. War-
Beard, R., Saridis, G., & Wen, J. (1997). Galerkin approximations of the generalized ren E. Dixon. He is currently a postdoctoral researcher with
Hamilton–Jacobi–Bellman equation. Automatica, 33, 2159–2178. the Nonlinear Controls and Robotics lab at the Univer-
Bhasin, S., Kamalapurkar, R., Johnson, M., Vamvoudakis, K., Lewis, F. L., & Dixon, sity of Florida. His research interests include dynamic pro-
W. (2013). A novel actor-critic-identifier architecture for approximate optimal gramming, optimal control, reinforcement learning, and data-driven adaptive con-
control of uncertain nonlinear systems. Automatica, 49(1), 89–92. trol for uncertain nonlinear dynamical systems.
48 R. Kamalapurkar et al. / Automatica 51 (2015) 40–48

Huyen Dinh received a B.S. Degree in Mechatronics from Warren E. Dixon received his Ph.D. in 2000 from the
Hanoi University of Science and Technology, Hanoi, Viet- Department of Electrical and Computer Engineering from
nam in 2006, and M.Eng. and Ph.D. Degrees in Mechanical Clemson University. After completing his doctoral studies
Engineering from University of Florida in 2010 and 2012, he was selected as an Eugene P. Wigner Fellow at Oak
respectively. She currently works as an Assistant Professor Ridge National Laboratory (ORNL). In 2004, he joined the
in the Department of Mechanical Engineering at University University of Florida in the Mechanical and Aerospace
of Transport and Communications, Hanoi, Vietnam. Her Engineering Department. His main research interest has
primary research interest is the development of Lyapunov- been the development and application of Lyapunov-based
based control and applications for uncertain nonlinear sys- control techniques for uncertain nonlinear systems. He
tems. Current research interests include Learning-based has published over 300 refereed papers and several books
Control, Adaptive Control for uncertain nonlinear systems. in this area. His work has been recognized by the 2013
Fred Ellersick Award for Best Overall MILCOM Paper, 2012–2013 University
of Florida College of Engineering Doctoral Dissertation Mentoring Award, 2011
Shubhendu Bhasin received his Ph.D. in 2011 from the American Society of Mechanical Engineers (ASME) Dynamics Systems and Control
Department of Mechanical and Aerospace Engineering Division Outstanding Young Investigator Award, 2009 American Automatic Control
at the University of Florida. He is currently Assistant Council (AACC) O. Hugo Schuck (Best Paper) Award, 2006 IEEE Robotics and
Professor in the Department of Electrical Engineering at Automation Society (RAS) Early Academic Career Award, an NSF CAREER Award
the Indian Institute of Technology, Delhi. His research (2006–2011), 2004 DOE Outstanding Mentor Award, and the 2001 ORNL Early
interests include reinforcement learning-based feedback Career Award for Engineering Achievement. He is an IEEE Control Systems Society
control, approximate dynamic programming, neural (CSS) Distinguished Lecturer. He currently serves as a member of the US Air
network-based control, nonlinear system identification Force Science Advisory Board and as the Director of Operations for the Executive
and parameter estimation, and robust and adaptive con- Committee of the IEEE CSS Board of Governors. He has formerly served as an
trol of uncertain nonlinear systems. associate editor for several journals, and is currently an associate editor for
Automatica and the International Journal of Robust and Nonlinear Control.

Concept Paper in The Different Fields
No ratings yet
Concept Paper in The Different Fields
18 pages
123 Easy Essay
100% (2)
123 Easy Essay
4 pages
OA Attendance Schedule ID 83325
No ratings yet
OA Attendance Schedule ID 83325
6 pages
Chapter 17-EMERGING MANAGEMENT PRACTICES
100% (1)
Chapter 17-EMERGING MANAGEMENT PRACTICES
7 pages
Why We Get Bored and How To Overcome It - Psychology Today
No ratings yet
Why We Get Bored and How To Overcome It - Psychology Today
13 pages
Reinforcement Learning and Optimal Control - Draft Version by Dmitri Bertsekas
No ratings yet
Reinforcement Learning and Optimal Control - Draft Version by Dmitri Bertsekas
268 pages
My Brothers Famous Bottom Takes Off Strong Jeremy Instant Download
No ratings yet
My Brothers Famous Bottom Takes Off Strong Jeremy Instant Download
26 pages
Continuous-Time Stochastic Policy Iteration of Adaptive Dynamic Programming
No ratings yet
Continuous-Time Stochastic Policy Iteration of Adaptive Dynamic Programming
13 pages
Eene Sakha Ea2025
No ratings yet
Eene Sakha Ea2025
14 pages
Optimal Control Exercises
100% (2)
Optimal Control Exercises
79 pages
Adprl Chapter Icis
No ratings yet
Adprl Chapter Icis
43 pages
1 s2.0 S000510981500343X Main
No ratings yet
1 s2.0 S000510981500343X Main
8 pages
2017 - On The Sample Complexity of The Linear Quadratic Regulator
No ratings yet
2017 - On The Sample Complexity of The Linear Quadratic Regulator
43 pages
Tracking Control of Completely Unknown Continuous-Time Systems Via Off-Policy Reinforcement Learning
No ratings yet
Tracking Control of Completely Unknown Continuous-Time Systems Via Off-Policy Reinforcement Learning
13 pages
TIOXIDE TR88 Datasheet
100% (1)
TIOXIDE TR88 Datasheet
2 pages
1 Number Systems
No ratings yet
1 Number Systems
18 pages
Control and Reinforcement Learning
No ratings yet
Control and Reinforcement Learning
6 pages
Approximate Dynamic Programming and Reinforcement Learning - Algorithms, Analysis and An Application
No ratings yet
Approximate Dynamic Programming and Reinforcement Learning - Algorithms, Analysis and An Application
139 pages
2024 Ouput Feedback Linear System Based On ADP 1
No ratings yet
2024 Ouput Feedback Linear System Based On ADP 1
10 pages
Kamala Pur Kar 2016
No ratings yet
Kamala Pur Kar 2016
11 pages
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
No ratings yet
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
9 pages
2020 05 26 IDN NV UN 001 English
100% (2)
2020 05 26 IDN NV UN 001 English
2 pages
Timber PDF
No ratings yet
Timber PDF
12 pages
Kamalapurkar 2016
No ratings yet
Kamalapurkar 2016
12 pages
Vam Vou Dakis 2009
No ratings yet
Vam Vou Dakis 2009
8 pages
Adaptive Optimal Output Regulation of Unknown Linear Continuous-Time Systems by Dynamic Output Feedback and Value Iteration
No ratings yet
Adaptive Optimal Output Regulation of Unknown Linear Continuous-Time Systems by Dynamic Output Feedback and Value Iteration
11 pages
Linear Quadratic Control Using Model-Free Reinforcement Learning
No ratings yet
Linear Quadratic Control Using Model-Free Reinforcement Learning
16 pages
He, S Et Al (2019) Reinforcement Learning
No ratings yet
He, S Et Al (2019) Reinforcement Learning
10 pages
Optimal Tracking Control of Nonlinear Partially-Unknown Constrained-Input Systems Using Integral Reinforcement Learning
No ratings yet
Optimal Tracking Control of Nonlinear Partially-Unknown Constrained-Input Systems Using Integral Reinforcement Learning
13 pages
Week 1, Lec 1
No ratings yet
Week 1, Lec 1
9 pages
Derongliu 2014
No ratings yet
Derongliu 2014
14 pages
Neural Network Based Online Simultaneous Policy Update Algorithm For Solving The HJI Equation in Nonlinear H Control
No ratings yet
Neural Network Based Online Simultaneous Policy Update Algorithm For Solving The HJI Equation in Nonlinear H Control
12 pages
A Neural RDE Approach For Continuous-Time Non-Markovian Stochastic Control Problems
No ratings yet
A Neural RDE Approach For Continuous-Time Non-Markovian Stochastic Control Problems
11 pages
1 s2.0 S0925231224014486 Main 1 8
No ratings yet
1 s2.0 S0925231224014486 Main 1 8
13 pages
1 s2.0 S1474667016440140 Main
No ratings yet
1 s2.0 S1474667016440140 Main
6 pages
Monte Carlo Beam Search For Actor-Critic Reinforcement Learning in Continuous Control
No ratings yet
Monte Carlo Beam Search For Actor-Critic Reinforcement Learning in Continuous Control
8 pages
Robust Control Design For Zero-Sum Differential Games Problem Based On Off-Policy Reinforcement Learning Technique
No ratings yet
Robust Control Design For Zero-Sum Differential Games Problem Based On Off-Policy Reinforcement Learning Technique
9 pages
Adaptative For Lineare
No ratings yet
Adaptative For Lineare
23 pages
Adaptive Dynamic Programming For Stochastic Systems With State and Control Dependent Noise
No ratings yet
Adaptive Dynamic Programming For Stochastic Systems With State and Control Dependent Noise
12 pages
Writing For Gifted Students: Exercise 1 (Week 1)
No ratings yet
Writing For Gifted Students: Exercise 1 (Week 1)
20 pages
DAA - All Five Units (HandWrittern Notes)
No ratings yet
DAA - All Five Units (HandWrittern Notes)
154 pages
20-IEEE-NNLS-Event-Triggered Optimal Control With Performance Guarantees Using Adaptive Dynamic Programming
No ratings yet
20-IEEE-NNLS-Event-Triggered Optimal Control With Performance Guarantees Using Adaptive Dynamic Programming
13 pages
Curriculum Vita1
No ratings yet
Curriculum Vita1
3 pages
Near-Optimal Control of Dynamical Systems With Neural Ordinary Differential Equations
No ratings yet
Near-Optimal Control of Dynamical Systems With Neural Ordinary Differential Equations
23 pages
Robust Control Scheme For A Class of Uncertain Nonlinear Systems With Completely Unknown Dynamics Using Data-Driven Reinforcement Learning Method
No ratings yet
Robust Control Scheme For A Class of Uncertain Nonlinear Systems With Completely Unknown Dynamics Using Data-Driven Reinforcement Learning Method
34 pages
CHANG 1980 - Ballistic Trajectory Estimation With Angle-Only Measurement
No ratings yet
CHANG 1980 - Ballistic Trajectory Estimation With Angle-Only Measurement
8 pages
Tac 232
No ratings yet
Tac 232
7 pages
Dynamic Programming and Optimal Control
No ratings yet
Dynamic Programming and Optimal Control
62 pages
ACC23 Tutorial Paulson
No ratings yet
ACC23 Tutorial Paulson
12 pages
Data-Driven-Based Sliding-Mode Dynamic Event-Triggered Control
No ratings yet
Data-Driven-Based Sliding-Mode Dynamic Event-Triggered Control
11 pages
Global Adaptive Dynamic Programming For Continuous-Time Nonlinear Systems
No ratings yet
Global Adaptive Dynamic Programming For Continuous-Time Nonlinear Systems
13 pages
2020 ADP Nonlinear System Mismatched Disterbances 2
No ratings yet
2020 ADP Nonlinear System Mismatched Disterbances 2
8 pages
Subject Index - 2017 - Introduction To Optimum Design Fourth Edition
No ratings yet
Subject Index - 2017 - Introduction To Optimum Design Fourth Edition
17 pages
The Effect of Growth Mindset On Mathematical Performance in Algeb
No ratings yet
The Effect of Growth Mindset On Mathematical Performance in Algeb
31 pages
Learning-Based Control of Continuous-Time Systems Using Output Feedback
No ratings yet
Learning-Based Control of Continuous-Time Systems Using Output Feedback
8 pages
SDR 2022 Denmark
No ratings yet
SDR 2022 Denmark
2 pages
PHY103A: Lecture # 2: Semester II, 2017-18 Department of Physics, IIT Kanpur
No ratings yet
PHY103A: Lecture # 2: Semester II, 2017-18 Department of Physics, IIT Kanpur
21 pages
Amchelltdprofile V2
No ratings yet
Amchelltdprofile V2
146 pages
Naidoo Education Studies
No ratings yet
Naidoo Education Studies
9 pages
Dynamic Programming and Optimal Control
No ratings yet
Dynamic Programming and Optimal Control
199 pages
2014 22 Khamis Naidu NonlinearOptimalTracking Proceedingsofthe2014ACCPortlandOregonpp.2420 2425june2014
No ratings yet
2014 22 Khamis Naidu NonlinearOptimalTracking Proceedingsofthe2014ACCPortlandOregonpp.2420 2425june2014
7 pages
Shell Gabon Solution
No ratings yet
Shell Gabon Solution
7 pages
Nafis Sir Math Routine 21 22
No ratings yet
Nafis Sir Math Routine 21 22
4 pages
Good Luck Ebook
No ratings yet
Good Luck Ebook
32 pages
Automatica: Kyriakos G. Vamvoudakis Frank L. Lewis
No ratings yet
Automatica: Kyriakos G. Vamvoudakis Frank L. Lewis
11 pages
Automatica: D. Vrabie O. Pastravanu M. Abu-Khalaf F.L. Lewis
No ratings yet
Automatica: D. Vrabie O. Pastravanu M. Abu-Khalaf F.L. Lewis
8 pages
Using Reinforcement Learning Techniques To Solve Continuous-Time Non-Linear Optimal Tracking Problem Without System Dynamics
No ratings yet
Using Reinforcement Learning Techniques To Solve Continuous-Time Non-Linear Optimal Tracking Problem Without System Dynamics
9 pages
Adaptive Dynamic Programming Algorithm For Uncertain Nonlinear Switched Systems
No ratings yet
Adaptive Dynamic Programming Algorithm For Uncertain Nonlinear Switched Systems
7 pages
Neurocomputing: Xiaofeng Li, Lei Xue, Changyin Sun
No ratings yet
Neurocomputing: Xiaofeng Li, Lei Xue, Changyin Sun
8 pages
Shi 2021
No ratings yet
Shi 2021
11 pages
Lewis LFC
No ratings yet
Lewis LFC
15 pages
Empirical AND Molecul AR Formulas: Insert Picture From First Page of Chapter
No ratings yet
Empirical AND Molecul AR Formulas: Insert Picture From First Page of Chapter
58 pages
Present Perfect Resource Master
No ratings yet
Present Perfect Resource Master
13 pages
Al Tamimi2008
No ratings yet
Al Tamimi2008
7 pages
Question Bank Physical Science Class IX Nov 2023
No ratings yet
Question Bank Physical Science Class IX Nov 2023
7 pages
Optimal Control Theory
No ratings yet
Optimal Control Theory
28 pages
2.9 What Happens in The Sequencer
No ratings yet
2.9 What Happens in The Sequencer
2 pages
Trajectory Generation
No ratings yet
Trajectory Generation
6 pages
Control of Toys
No ratings yet
Control of Toys
6 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
ACODS 2014 GAndrade
No ratings yet
ACODS 2014 GAndrade
7 pages
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
No ratings yet
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
22 pages
Dynamic Programming and Optimal Control 3rd Edition, Volume II
No ratings yet
Dynamic Programming and Optimal Control 3rd Edition, Volume II
233 pages
Programacion Lineal 9
No ratings yet
Programacion Lineal 9
5 pages
1211 5761 PDF
No ratings yet
1211 5761 PDF
8 pages
Adaptive DP For Discrete Time LQR Optimal Tracking Control Problems With Unknown Dynamics
No ratings yet
Adaptive DP For Discrete Time LQR Optimal Tracking Control Problems With Unknown Dynamics
6 pages
RL Frontmatter
No ratings yet
RL Frontmatter
11 pages
Mrigshira Nakshatra
No ratings yet
Mrigshira Nakshatra
3 pages
Elementary Surveying Field Manual: Field Work No.1 Pacing On Level Ground
No ratings yet
Elementary Surveying Field Manual: Field Work No.1 Pacing On Level Ground
7 pages
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet

Kamala Pur Kar 2015

Uploaded by

Kamala Pur Kar 2015

Uploaded by

Automatica 51 (2015) 40–48

Contents lists available at ScienceDirect

Approximate optimal trajectory tracking for continuous-time

article info abstract

1. Introduction When applied to continuous-time systems the principle of

is continuously differentiable, the HJB equation for the optimal

where r , 21 supz ,y∈Z ∥z − y∥ denotes the radius of Z. Since the 5. Simulation

bounded as t → ∞. p1 = 3.473 kg m , p2 = 0.196 kg m2 , and p3 = 0.242 kg m2 ,

and Fd = diag 5.3, 1.1 N m s and Fs (q̇) = [8.45 tanh (q̇1 ) ,

mizing the cost 0 eT Qe + µT µ dt, where Q = diag[10, 10,

Fig. 1. State and error trajectories with probing signal.

Fig. 2. Evolution of value function and policy weights.

problem. The ultimately bounded tracking and estimation result

Algorithm for selection of NN architecture and learning gains

Since the gains depend on the initial conditions and on the

You might also like