0% found this document useful (0 votes)
44 views10 pages

Disturbance Observer

This paper proposes a disturbance observer based actor-critic learning control method for uncertain nonlinear systems with unmodeled dynamics and time-varying disturbances. The actor network is used to estimate the unknown system dynamics, the critic network evaluates control performance, and a disturbance observer estimates the compounded disturbance including time-varying disturbances and actor-critic network errors. A composite adaptation law updates the actor network weights using cost function and modeling error signals. Analysis shows the controller guarantees bounded stability, and simulations on a robot manipulator validate the approach.

Uploaded by

gustavoarins1612
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views10 pages

Disturbance Observer

This paper proposes a disturbance observer based actor-critic learning control method for uncertain nonlinear systems with unmodeled dynamics and time-varying disturbances. The actor network is used to estimate the unknown system dynamics, the critic network evaluates control performance, and a disturbance observer estimates the compounded disturbance including time-varying disturbances and actor-critic network errors. A composite adaptation law updates the actor network weights using cost function and modeling error signals. Analysis shows the controller guarantees bounded stability, and simulations on a robot manipulator validate the approach.

Uploaded by

gustavoarins1612
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Chinese Journal of Aeronautics, (2023), 36(11): 271–280

Chinese Society of Aeronautics and Astronautics


& Beihang University
Chinese Journal of Aeronautics
[email protected]
www.sciencedirect.com

Disturbance observer based actor-critic learning


control for uncertain nonlinear systems
Xianglong LIANG a, Zhikai YAO b, Yaowen GE a, Jianyong YAO a,*
a
School of Mechanical Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
b
College of Artificial Intelligence, Nanjing University of Post and Telecommunication, Nanjing 210023, China

Received 8 October 2022; revised 3 December 2022; accepted 17 January 2023


Available online 27 June 2023

KEYWORDS Abstract This paper investigates the disturbance observer based actor-critic learning control for a
Actor-critic structure; class of uncertain nonlinear systems in the presence of unmodeled dynamics and time-varying dis-
Composite adaptation; turbances. The proposed control algorithm integrates a filter-based design method with actor-critic
Disturbance observer; learning architecture and disturbance observer to circumvent the unmodeled dynamic and the time-
Robot manipulator; varying disturbance. To be specific, the actor network is employed to estimate the unknown system
Uncertain nonlinear system dynamic, the critic network is developed to evaluate the control performance, and the disturbance
observer is leveraged to provide efficient estimation of the compounded disturbance which includes
the time-varying disturbance and the actor-critic network approximation error. Consequently, high-
gain feedback is avoided and the improved tracking performance can be expected. Moreover, a
composite weight adaptation law for actor network is constructed by utilizing two types of signals,
the cost function and the modeling error. Eventually, theoretical analysis demonstrates that the
developed controller can guarantee bounded stability. Extensive simulations and experiments on
a robot manipulator are implemented to validate the performance of the resulted control strategy.
Ó 2023 Production and hosting by Elsevier Ltd. on behalf of Chinese Society of Aeronautics and
Astronautics. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/
licenses/by-nc-nd/4.0/).

1. Introduction of nonlinear systems. However, all these aforementioned meth-


ods cannot be directly applied to the nonlinear systems con-
Owing to its significance, both from a practical and a theoret- taining completely unknown dynamic structure, which
ical perspective, the control design of uncertain nonlinear sys- hinders their widespread application.
tems has been a major research topic over the past decades.1–4 In recent years, scholars have observed that Neural Net-
Some remarkable control approaches can be found in Refs. 5– work (NN) displays excellent ability in dealing with unknown
7, including backstepping control, adaptive control, and nonlinearity due to its universal function approximation prop-
observer-based nonlinear control, to name a few. Among erties, and substantial control problems have been addressed
them, adaptive control is an effective approach to address by utilizing the NN.8–10 For instance, in Ref. 11, an adaptive
unknown parameters, and hereafter, its combination with neural tracking control problem was investigated for strict
backstepping technique plays an important role in the control feedback nonlinear systems with unmodeled dynamics. By
introducing robust control and disturbance observer tech-
* Corresponding author. niques, the authors in Refs. 12–13 presented robust adaptive
E-mail address: [email protected] (J. YAO). neural control and disturbance observer based adaptive neural
https://doi.org/10.1016/j.cja.2023.06.028
1000-9361 Ó 2023 Production and hosting by Elsevier Ltd. on behalf of Chinese Society of Aeronautics and Astronautics.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
272 X. LIANG et al.

control, which can handle both unmodeled dynamics and time- of the time-varying disturbance and actor-critic network
varying disturbances. Different from the traditional NN-based approximation error on tracking performance, the actor-
control, an advanced neural learning control strategy is further critic learning algorithm is combined with the disturbance
developed according to actor-critic learning architecture.14,15 observer to circumvent the effects of the above factors. In
To be specific, the actor-critic learning architecture consists addition, a composite weight adaptation law for actor network
of two networks: an actor network and a critic network. The is constructed by utilizing two types of signals, the cost func-
actor network is leveraged to approximate an unknown func- tion and the modeling error. Consequently, high-gain feedback
tion and generate action or control signal, while the critic net- is avoided and the improved tracking performance can be
work is leveraged to evaluate the control performance in the achieved. Eventually, extensive simulations and experiments
actor-critic learning architecture. The actor-critic learning on a robot manipulator are implemented to validate the per-
architecture with a generalized learning structure can be easily formance of the resulted control strategy.
applied to control other nonlinear systems. The key contributions of this paper are listed as follows:
Enlightened by the philosophy in Ref. 15, extensive control
approaches were studied by integrating the actor-critic learn- (1) An actor-critic learning architecture is developed to esti-
ing architecture with traditional control approaches for mate the unknown system dynamic online, which
unknown nonlinear systems. In Refs. 16–18, the actor-critic requires less model information and improves the
architecture has been successfully applied to estimate the robustness to unmodeled dynamic effectively.
unknown nonlinearity online and achieved satisfactory results, (2) A disturbance observer is effectively combined to com-
yet they ignore the negative influence of time-varying unknown pensate for the time-varying disturbance and actor-
disturbances on control performance. However, for the practi- critic network approximation error, which avoids high-
cal systems (e.g., vehicular systems, robot manipulators and gain feedback and achieves improved tracking
unmanned aerial vehicles19–21), the time-varying perturbations performance.
always exist. The appearance of time-varying disturbance
could generate some unexpected results, such as degrading To the best of our knowledge, few studies have integrated
the control performance and leading to system divergence, actor-critic learning architecture with disturbance observer
and it is difficult to circumvent the influence of the time- for tracking control of uncertain nonlinear systems with
varying disturbances with actor-critic learning control alone. unmodeled dynamic and time-varying unknown disturbance.
Moreover, the approximation inaccuracy caused by actor- The remainder of this paper is organized as follows. The
critic learning also influences the tracking performance. In this problem description is provided in Section 2. Section 3 states
regard, the above-mentioned factors facilitate the combination the disturbance observer based actor-critic learning control
of the actor-critic control and robust control or disturbance scheme and system stability analysis. Simulation and experi-
observer based control. In Ref. 22, the actor-critic structure mental studies on a single-link robot are provided in Section 4,
is used to estimate the modeling uncertainties that exist in a and conclusions are drawn in Section 5.
small unmanned helicopter, and a discontinuous sliding model
based robust component is introduced to eliminate the influ- 2. Problem description
ence of the approximation error of actor network and
unknown disturbance. To overcome the discontinuities of con- Consider a class of the nth-order Multiple Input Multiple Out-
trol input, a prescribed performance fault-tolerant control put (MIMO) nonlinear systems with the following form:
approach is developed in Ref.23 by integrating the actor- 8
critic learning scheme with a Robust Integral of the Sign of < x_ i ¼ xiþ1 i ¼ 1; 2;    ; n  1
>
Error (RISE) feedback term, which requires less system infor- x_ n ¼ gðxÞu þ fðxÞ þ dðtÞ ð1Þ
mation and achieves the asymptotic stability. However, large >
:
yðtÞ ¼ x1
feedback gains are required to resist unknown disturbances
T
in robust control, which will reduce the stability margin, even where x ¼ ½xT1 ; xT2 ;    ; xTn  2 Rmn denotes the system state
stimulate high-frequency dynamics, and then result in system vector with xi 2 Rm , which is assumed to be available for mea-
instability. Inspired by the feedforward design, disturbance surement; u 2 Rm is the control input; y 2 Rm is the system
observer based control24–27 can be used to estimate the impact output; f 2 Rm is the unknown smooth nonlinear function,
of disturbance and further used for compensation. In Refs. gðxÞ 2 Rm is the known nonzero function, and dðtÞ 2 Rm is
28,29, a reinforcement learning based controller integrated the time-varying disturbance.
with disturbance observer is established to reject the real- The main control objective of this study is to propose a dis-
time external disturbance, which can guarantee the robust sta- turbance observer based actor-critic learning control strategy
bility and the nominal performance even for the uncertain to achieve high tracking accuracy with unmodeled dynamic
plant, and obtain satisfactory results in the numerical simula- and time-varying unknown disturbance. To facilitate the pre-
tion. However, both of them merely focus on the regulation sentation, some related assumptions and lemmas are
control problem. necessary.
Inspired by the aforementioned challenges, a disturbance
observer based actor-critic learning control is developed for Assumption 1. The reference trajectory yr ðtÞ 2 Rm and its n-
a class of uncertain nonlinear systems in the presence of ðnÞ
derivative yr are available, smooth, and bounded.
unknown dynamic and time-varying disturbance. To cope with
the unknown dynamic, the actor-critic learning architecture is
Assumption 2. The time-varying disturbance dðtÞ and its first
developed to provide the feedforward compensation by 
approximating unknown nonlinearity. Considering the effect _ 6 dm .
derivative are bounded, which satisfy kdk 6 dm ; kdk
Disturbance observer based actor-critic learning control 273
ðn1Þ
Lemma 1. The NN universal approximation property indi- z_ n ¼ gðxÞu þ fðxÞ þ d  yðnÞ
r þ k1 z1
cates that a continuous function U : S ! RN1 (S is a compact þk2 z2
ðn2Þ
þ    þ kn1 zn1
ð1Þ
ð4Þ
set) can be approximated as 
¼ gðxÞu þ fðxÞ þ d  yðnÞ
r þ k z
UðxÞ ¼ WT uðxÞ þ eðxÞ ð2Þ 
ðn1Þ ðn2Þ ð1Þ
where k ¼ ½k1 ; k2 ;    ; kn1 , z ¼ ½z1 ; z2 ;    ; zn1 , and the
where x 2 RN2 is the input vector, W 2 RN3 N1 is the ideal
unknown nonlinear function fðxÞ can be approximated by an
weight matrix, and N1 , N2 and N3 are numbers of neurons in
NN-based actor network
the output, input, and hidden layer, respectively. uðxÞ 2 RN3
is the nonlinear activation function. According to Ref. 22, fðxa Þ ¼ WTa uðxa Þ þ ea ð5Þ
the ideal NN weights, nonlinear activation function and T
approximation error are assumed to be bounded by where xa ¼ ½xT1 ; xT2 ;    ; xTn 
denotes the input vector, Wa
  denotes the weight vector of actor network and ea denotes
kWk 6 Wm ,kuk 6 um ; kuk_ 6 um ,kek 6 em ; k_ek 6 e m .
the actor network function reconstruction inaccuracy, which

satisfies kea k 6 eam and k_ea k 6 e am . Defining f ¼ ea þ d, the
3. Main results
expression in Eq. (4) can be further expressed as

In this scenario, the disturbance observer based actor-critic z_ n ¼ gðxÞu þ W z
^ a T ua  WT ua þ f  yðnÞ þ k ð6Þ
a r
learning control strategy for uncertain nonlinear system Eq. 
(1) will be presented. First, we provide the backstepping con- where Wa ¼ W ^ a  Wa and W ^ a is the estimate of Wa , which
troller design based on a filter-based design approach. Then, will be introduced later.
we design the actor-critic network to deal with the unknown Generally, time-varying external disturbances can be esti-
nonlinear dynamic, where the critic network is leveraged to mated by using model-based disturbance observers.30 Herein,
evaluate the control performance while the actor network is the time-varying external disturbance and the residual function
leveraged to approximate the unknown function. The architec- reconstruction inaccuracy of the actor network are integrated
ture of the developed control scheme is depicted as Fig. 1. as f. The adaptive neural disturbance observer for estimating
the lumped disturbance f is designed by using neural network
3.1. Controller design approximation
(
h_ ¼ aðh þ axn Þ  aðW^ T ua þ gðxÞuÞ
To quantify the aforementioned control objective, the tracking
a
ð7Þ
bf ¼ h þ axn
error z1 2 Rm is defined as z1 ¼ y  yr , and the following fil-
tered tracking errors are introduced to facilitate the controller where h 2 Rn is an internal state and a 2 R is a positive con-
design stant. Therefore, the control input u can be given by
8
< z2 ¼ z_ 1 þ k1 z1 u ¼ g1 ðxÞðyðnÞ  z  kn zn  W
^ T ua  bfÞ
r  k a ð8Þ
i ¼ 3; 4;    ; n ð3Þ
: where kn 2 R denotes a positive control gain. Substituting Eq.
zi ¼ z_ i1 þ ki1 zi1
(8) into Eq. (6), the dynamics of filtered error zn can be rewrit-
where k1 ; k2 ;    ; kn1 2 R denote positive control gains. By ten as
substituting Eq. (1) into Eq. (3), the dynamic of filtered track-  

ing error zn can be written as z_ n ¼ kn zn  WTa ua þ f ð9Þ

Fig. 1 Structure of disturbance observer based actor-critic learning control strategy.


274 X. LIANG et al.

with f ¼ f  bf and tem experience. The approximation of fðxÞ is designed as
follows:
 
~f_ ¼ a f þaWT u þ f_ ð10Þ
a a
^fðxa Þ ¼ W
^ T uðxa Þ ð17Þ
a

Define a prediction error ea ðÞ as


Remark 1. Apart from external disturbance, the residual
function reconstruction inaccuracy of actor network is esti- ^ T ua
ea ¼ KV ðV^  Vd Þ þ W ð18Þ
a
mated by designing adaptive neural disturbance observer,
which is different from conventional disturbance observers.30
where KV 2 Rm is the positive design parameter and Vd ¼ 0
The performance and robustness of the control plant can be
greatly improved by utilizing the adaptive neural disturbance is the ideal value of cost-to-go. The weight vector c W a is
observer for disturbance compensation. selected to minimize the objection function Ea ¼ 0:5eTa ea and
the actor network weight parameters are updated by the fol-
lowing update law:
3.2. Critic network design
^_ a ¼ ka1 ua ðKV V^ þ W
W ^ T ua Þ  ka2 c
Wa ð19Þ
a

The critic network is utilized to provide the evaluation func- with positive parameters ka1 and ka2 . To further improve the
tion for the current strategy, which can test the performance
convergence of the estimated weights c W a and the function
of the current policy and generate rewards/punishments as
approximation precision of actor network, another prediction
the feedback for adaptive learning. Therefore, we introduce  
the infinite horizon performance index function as follows: error xn , named modeling error,24 is defined as xn ¼ b
x n  xn , in
Z 1 ^
which xn can be obtained by constructing the following serial-
VðtÞ ¼ exp½cðs  tÞrðsÞds ð11Þ parallel estimation model 31:
t
where c > 0 represents a discount factor, which can guarantee (
the boundedness of the cost function even if the reference tra- ^_ i1 ¼ b
x x i i ¼ 1; 2;    ; n  1
 ð20Þ
jectory does not converge to zero, and rðtÞ represents an ^_ n ¼ gðxÞu þ W
x ^ T u þ bf  bx
a a n
instantaneous cost function

rðtÞ ¼ zT1 Qz1 þ uT Ru ð12Þ and the dynamic equation xn is written as
where Q 2 Rmm and R 2 Rmm are the weighting matrices for  

the lumped tracking error z1 and control input u, respectively. ~_ n ¼ bxn  WTa ua  f
x ð21Þ
To achieve optimal control, the cost-to-go function is sup-
posed to be minimized. Given that it is difficult to obtain the in which b 2 R is a positive constant.
cost-to-go function, an NN-based critic network is introduced Therefore, the actor network weight parameters are
Vðxc Þ ¼ WTc uðxc Þ þ ec ð13Þ adjusted by the following composite update law:

where xc ¼ z1 denotes the input vector, Wc denotes the weight ^_ a ¼ ka1 ua ðKV V^ þ W
W ^ T ua Þ  ka2 c 
W a  ka3 ua xTn ð22Þ
a
vector of critic network and ec denotes the critic network func- where ka3 is a positive parameter.
tion reconstruction inaccuracy, which satisfies kec k 6 ecm and

k_ec k 6 e cm . And the cost-to-go function can be approximated by Remark 2. Different from the traditional action network
^ T uðxc Þ
b cÞ ¼ W
Vðx weights updating,18,22,23 a composite weight adaptation law
c
for actor network is constructed by using both prediction error
where W^ c is the estimate of Wc . 
ea and modeling error xn , which ensures that the estimated
The weight vector c W c is selected to minimize the objection weights cW a converge better to unknown weights Wa and the
function Ec ¼ 0:5eTc ec , and according to Eq. (11) and Eq. (12), more precise approximation of the nonlinear function is
the prediction error ec ðÞ can be expressed as achieved.32
_^  cVðtÞ
e ¼ rðtÞ þ VðtÞ ^ ð15Þ
c

and the critic network weight parameters are updated by 3.4. Stability analysis
the following update law:
^_ c ¼ kc1 ec ðcuc þ ruc x_ c Þ  kc2 c
W Wc
Theorem 1. In consideration of the nonlinear system Eq. (1) in
W Tc KÞK  kc2 c
¼ kc1 ðr þ c Wc ð16Þ
the presence of unmodeled dynamic and time-varying distur-
where kc1 ; kc2 are positive parameters and K ¼ cuc þ ruc x_ c . bance, if the control input Eq. (8), the critic network weight
adaptive law Eq. (16), the actor network weight adaptive law
3.3. Actor network design Eq. (22) and the adaptive neural disturbance observer Eq. (7)
are designed, then all system signals are bounded. Proof details
The actor network is leveraged to estimate the unknown are given in Appendix A.
function fðxÞ that exists in Eq. (4), which can generate the
appropriate control policy by gradually accumulating the sys-
Disturbance observer based actor-critic learning control 275

4. Simulation and experiments

4.1. Simulation

To substantiate the feasibility and effectiveness of the devel-


oped control strategy, we consider a two-degree-of-freedom
robot manipulator (see Ref. 32) with the following dynamic
equation:
_ q_ þ FðqÞ
MðqÞ€q þ Cðq; qÞ _ þ sd ¼ s ð23Þ
_ €q denote the position, velocity, and acceleration,
where q; q;
respectively, MðqÞ is the inertia matrix, Cðq; qÞ _ is the
_ is the friction, sd is the exter-
centripetal-Coriolis matrix, FðqÞ
nal disturbance, and s is the control input.
The matrix MðqÞ; Cðq; qÞ; _ FðqÞ
_ and sd are given as follows: Fig. 2 Tracking performance with proposed controller Con-
  troller 1.
p1 þ 2p3 cosðq2 Þ p2 þ p3 cosðq2 Þ
MðqÞ ¼
p2 þ p3 cosðq2 Þ p2
 
p3 sinðq2 Þq_ 2 p3 sinðq2 Þðq_ 1 þ q_ 2 Þ
_ ¼
Cðq; qÞ
p3 sinðq2 Þq_ 1 0
    
fd1 0 q_ 1 sd1
_ ¼
FðqÞ ; sd ¼
0 fd2 q_ 2 sd2
where p1 ¼ 3:473 kg  m2 , p2 ¼ 0:196 kg  m2 ,
p3 ¼ 0:242 kg  m ,2
fd1 ¼ 5:3 N  m  s, fd2 ¼ 1:1 N  m  s,
sd1 ¼ 3 sinðtÞ and sd2 ¼ 0:2 sinðtÞ.
Then the dynamics in Eq. (23) can be transformed into the
state-space equation considered in this paper, i.e.,

x_ 1 ¼ x2
ð24Þ
x_ 2 ¼ gðxÞu þ fðxÞ þ dðtÞ

with x1 ¼ ½q1 ; q2 T ; x2 ¼ ½q_ 1 ; q_ 2 T ; gðxÞ ¼ M1 ðx1 Þ, u ¼ s, Fig. 3 Tracking errors for Joints 1 and 2 under Controller 1 and
fðxÞ ¼ M1 ½Cðx1 ; x2 Þx2 þ Fðx2 Þ and dðtÞ ¼ M1 sd . Controller 2.
The following two control strategies are compared to vali-
date the effectiveness of the proposed approach:
Controller 1. This is the proposed controller or more specif-
ically, actor-critic learning control integrated with disturbance
observer. The control parameters are chosen as k1 ¼ 30,
k2 ¼ 10, a ¼ 20, b ¼ 100, kc1 ¼ 2, kc2 ¼ 0:1, ka1 ¼ 20, ka2 ¼ 1
and ka3 ¼ 5. The initial weights of actor-critic networks are
chosen as c W a ¼ zerosð10; 4Þ and c W c ¼ zerosð10; 2Þ. The dis-
count factor is chosen as c ¼ 0:1, the positive matrices Q and
R in the cost function are selected as Q ¼ diagð½50 200Þ and
R ¼ diagð½0:1 0:1Þ, respectively.
Controller 2. This is the actor-critic learning control
approach without disturbance feedforward compensation. To
ensure fair comparison, the selected control parameters are
consistent with Controller 1.
The reference trajectories of two joints are chosen as
yr1 ¼ 0:6 sinð3:14tÞ  ½1  expðtÞ and yr2 ¼ 0:8 sinð3:14tÞ
½1  expðtÞ. The simulation results are depicted as Figs. 2- Fig. 4 Compound estimation ðf þ dÞ½1 for Joint 1 under
6. As depicted in Fig. 2 and Fig. 3, the integrated Controller Controller 1 and Controller 2.
1 controller can follow the reference signal well and achieve
the best tracking performance in aspects of convergence speed
The results in Figs. 4 and 5 depict the compound estimation
and steady tracking error since the disturbance observer is
of f þ d, and it can be found that the composite estimation
introduced. In comparison of the last 20 s, the maximum
architecture established by actor-critic learning and distur-
amplitude of steady tracking error Mz1 ¼ ½0:0009; 0:0058 rad
bance observer can well approximate the unmodeled dynamic
under Controller 1, while the maximum amplitude of steady
and time-varying disturbance in comparison of the estimation
tracking error Mz1 ¼ ½0:0021; 0:0103 rad under Controller 2.
architecture only with actor-critic learning. This phenomenon
276 X. LIANG et al.

Fig. 5 Compound estimation ðf þ dÞ½2 for Joint 2 under


Controller 1 and Controller 2.

explains that the accurate feedforward compensation can


result in higher tracking accuracy. Eventually, the control
inputs are shown in Fig. 6, which are regular and bounded.

4.2. Experiments

To further substantiate the priority of the developed control Fig. 7 A single-degree-of-freedom robot manipulator platform.
strategy, an experiment was conducted on a single-degree-of-
freedom robot manipulator platform shown in Fig. 7. The test
rig includes a bench case, a motor actuator (consisting of a DC
friction, GðqÞ is the unknown gravity, sd is the external distur-
motor Kollmorgen DH063A, an electrical driver Kollmorgen
bance, s is the control input, and J ¼ Jr þ Jl þ Jp is the total
ServoStar 620, a rotary encoder Heidenhain ERN180, and a
moment of inertia with the joint moment of inertia Jr , the link
revolute joint), a link, a payload and a control module. The
moment of inertia Jl ¼ ml L2 =3 and the payload moment of
control module consists of a real-time control software com-
posed of an Advantech PCI-1723 and a Heidenhain IK-220 inertia Jp ¼ mp L2 . The system parameters are provided as
counter card, and a monitoring software. The sampling time Jr ¼ 0:3 kg  m2 , L ¼ 0:5 m, ml ¼ 0:5 kg and mp ¼ 0  1 kg.
is 0.5 ms. Likewise, the aforementioned two controllers are still uti-
The dynamic of single-degree-of-freedom robot manipula- lized to test in the experiments. The reference signal is chosen
tor can also be written as the state-space form shown in Eq. as yr ¼ 10½1  cosð3:14tÞ  ½1  expðtÞ, the control parame-
(24) with x1 ¼ q; x2 ¼ q; _ x ¼ ½x1 ; x2 T , gðxÞ ¼ J1 , u ¼ s, ters are chosen as k1 ¼ 150, k2 ¼ 50, a ¼ 10, b ¼ 50, kc1 ¼ 2,
kc2 ¼ 0:5, ka1 ¼ 20, ka2 ¼ 2 and ka3 ¼ 5. The initial weights
fðxÞ ¼ J1 ½FðqÞ
_ þ GðqÞ and dðtÞ ¼ J1 sd , where q; q_ denote
of actor-critic networks are chosen as
the position and velocity, respectively, FðqÞ _ is the unknown
c
W a ¼ zerosð10; 2Þ; W ^ c ¼ zerosð10; 1Þ. The discount factor is
chosen as c ¼ 0:1, and the positive matrices Q and R in the cost
function are selected as Q ¼ ½50 and R ¼ ½1, respectively. In
this scenario, the control performance with different payloads
mp ¼ 0 kg, mp ¼ 0:5 kg and mp ¼ 1 kg is tested, respectively.
Furthermore, to quantitatively evaluate the tracking perfor-
mance of the aforementioned controllers, three performance
indices (i.e., the maximum Mz , average l, and standard devia-
tion r) in Ref. 33 are introduced.

Case 1. The compared two controllers are tested under no-


load condition with mp ¼ 0 kg. The tracking errors and the
performance indices (during the last 20 s) of the two
controllers are presented in Fig. 8. From these results, it can
be observed that the tracking performance of Controller 1 is
improved compared with that of Controller 2 since the
disturbance observer is integrated with actor-critic learning
control, and the result of compound estimation f^þ d^ acquired
Fig. 6 Control inputs of two joints under Controller 1 and by action network and disturbance observer is depicted in
Controller 2. Fig. 9.
Disturbance observer based actor-critic learning control 277

Fig. 10 System tracking errors of Controller 1 and Controller 2


under light load condition with mp ¼ 0:5 kg.
Fig. 8 System tracking errors of Controller 1 and Controller 2
under no-load condition with mp ¼ 0 kg.

Case 2. The compared two controllers are tested under light


load condition with mp ¼ 0:5 kg. The tracking errors and the
performance indices are shown in Fig. 10. It can be seen that
the performance of Controller 1 still outperforms that of Con-
troller 2. In comparison with the no-load condition, Mz of
Controller 1 changed very little, and only increased by 1:6%,
while that of Controller 2 increased by about two times, and
this phenomenon illustrates that the actor-critic learning con- Fig. 11 Compound estimation of f^þ d^ of Controller 1 under
trol alone is not enough to compensate for the impact of load light load condition with mp ¼ 0:5 kg.
change on the system. In addition, the result of compound esti-
mation is depicted in Fig. 11. Compared with Fig. 9, it can be
observed that as the payload becomes larger, the proposed
composite estimation scheme can well adapt to the change of
system uncertainty, which further illustrates that Controller 1
can effectively compensate for the influence of load change.

Case 3. The compared two controllers are tested under heavy


load condition with mp ¼ 1 kg. The tracking errors and the
performance indices are shown in Fig. 12. Likewise, the perfor-
mance of Controller 1 still outperforms that of Controller 2,
and compared with the no-load condition, Mz of Controller
1 increased by 5:5%, while that of Controller 2 increased by
about 2.3 times, which further illustrates that Controller 1
can effectively compensate for the impact of load change on
the system. The result of compound estimation is depicted in
Fig. 13. Compared with Fig. 9, it can be observed that as the Fig. 12 System tracking errors of Controller 1 and Controller 2
payload becomes larger, Controller 1 can still well adapt to under heavy load condition with mp ¼ 1 kg.
the change of system uncertainty.

Fig. 13 Compound estimation of f^þ d^ of Controller 1 under


heavy load condition with mp ¼ 1 kg.
Fig. 9 Compound estimation of f^þ d^ of Controller 1 under no-
load condition with mp ¼ 0 kg.
278 X. LIANG et al.

where
Xn
1 T 1  1  
L1 ¼ zi zi ; L2 ¼ f T f; L3 ¼ trðWTc Wc Þ;
i¼1
2 2 2
1   1 
L4 ¼ trðWTa Wa Þ; L5 ¼ xTn xn ðA2Þ
2 2

Using Eq. (3) and Eq. (9), the derivative of L1 can be


expressed as
Fig. 14 Mz of two controllers under different payloads. L_ 1 ¼ k1 zT1 z1 þ zT1 z2  k2 zT2 z2 þ zT2 z3 þ     kn1 zTn1 zn1
 
þ zTn1 zn  kn zTn zn þ zTn ðWTa ua þ f Þ
1 1
 k1 kz1 k2 þ kz1 k2 þ kz2 k2  k2 kz2 k2
In order to more intuitively observe the impact of load 2 2
change on the tracking performance of controllers Controller 1 1
þ kz2 k þ kz3 k þ     kn1 kzn1 k2
2 2
1 and Controller 2, Mz of the two controllers under different 2 2
payloads are collected together, as shown in Fig. 14. It is evi- 1 1
þ kzn1 k þ kzn k2  kn kzn k2 þ kzn k2
2

dent that Controller 1 can effectively compensate for the 2 2


1  2 1  2
impact of load change on the system and maintain its tracking þ k f k þ kWa k kua k2 ðA3Þ
performance, while the tracking performance of Controller 2 2 2
gradually deteriorates with the increase of payload. To this Using Eq. (10), the derivative of L2 can be expressed as
end, the proposed controller Controller 1 is robust to     
unknown uncertainties and the improved tracking perfor- L_ 2 ¼ af T f þaf T WTa ua þ f T f_
  
mance can be achieved. 6 ak f k2 þ 12 ak f k2 þ 12 akWa k2 kua k2
 ðA4Þ
_
þ 12 k f k2 þ 12 kfk
2
5. Conclusions
 
_ 2
6  12 ða  1Þk f k2 þ 12 akWa k2 kua k2 þ 12 kfk
In this paper, the disturbance observer based actor-critic learn-
ing control is investigated for a class of nonlinear systems in Using Eq. (13) and Eq. (16), the derivative of L3 can be
the presence of unknown dynamic and time-varying distur- expressed as.
 
bance. A composite weight adaptation law for actor network L_ 3 ¼ kc1 WTc ðrðtÞ þ cW Tc KÞK  kc2 WTc c
Wc
    
is constructed by both cost function and modeling error, and ¼ kc1 WTc ð c W Tc K þ ec þ c
W Tc KÞK  kc2 WTc Wc þ Wc
a disturbance observer component is combined to compensate     
for the residual functional reconstruction inaccuracy caused by 6  12 kc1 kmin KKT þ kc2 WTc Wc þ k2c1 2Tc 2c þ k2c2 WTc Wc
the actor network and the time-varying disturbance. Extensive (A5)
simulations and experiments on a robot manipulator present where ec ¼ cec  e_ c , which is bounded, i.e., kec k 6 ecm .
that using the developed disturbance observer based actor- Using Eq. (22), the derivative of L4 can be expressed as
critic learning control strategy can effectively circumvent the  
influence of unknown system dynamic and time-varying distur- L_ 4 ¼ ka1 WTa ua ðKV c W Ta ua Þ  ka2 WTa c
W Tc uc þ c Wa
 
bance, and higher tracking accuracy can be achieved. Consid- ka3 WTa uxTn
ering that the full state information is required in the   
implementation of the developed control strategy, we will ¼ ka1 WTa ua WTa ua  ka1 WTa ua ðWTa ua þ KV c
W Tc uc Þ
explore an output feedback control approach for uncertain    
ka2 WTa ðWa þ Wa Þ  ka3 WTa ua xTn ðA6Þ
nonlinear systems in the future.  
6  12 ðka1 kmin ðua uTa Þ þ ka2  ka3 kua k2 ÞWTa Wa
Declaration of Competing Interest  
þka1 kWa k2 kua k2 þ 2ka1 KTV KV kuc k2 WTc Wc
 2
The authors declare that they have no known competing þ2ka1 KTV KV kWc k2 kuc k2 þ k2a2 WTa Wa þ k2a3 kxn k
financial interests or personal relationships that could have
appeared to influence the work reported in this paper. Using Eq. (21), the derivative of L5 can be expressed as
     1 1  
L_ 5 ¼ xTn ðbxn  WTa ua  fÞ  ðb  1Þkxn k2 þ k f k2 þ kWa k2 kua k2 ðA7Þ
2 2
Acknowledgements
Then combining Eqs. (A3)–(A7), the first derivative of L
This work was supported by the National Key R&D Program can be written as
of China (No. 2021YFB2011300), the National Natural
L_ 6 ðk1  0:5Þkz1 k2  ðki  1Þkzi k2      ðkn  1:5Þkzn k2
Science Foundation of China (No. 52075262). 
 12 ðkc1 kmin ðKKT Þ þ kc2  4ka1 KTV KV kuc k2 ÞkWc k2
Appendix A. Proof of Theorem 1. Consider the following 
 12 ðka1 kmin ðua uTa Þ  ðka3 þ a þ 2Þkua k2 þ ka2 ÞkWa k2
Lyapunov candidate function:   2
 12 ða  3Þk f k2  12 ð2b  ka3  2Þkxn k þ q1
L ¼ L1 þ L2 þ L3 þ L4 þ L5 ðA1Þ 6 q0 L þ q1
ðA8Þ
Disturbance observer based actor-critic learning control 279

where 12. Yao ZK, Yao JY, Sun WC. Adaptive RISE control of hydraulic
systems with multilayer neural-networks. IEEE Trans Ind Electron
2019;66(11):8638–47.
q0 ¼ minf2ðk1  0:5Þ; 2ðk2  1Þ;    ; 2ðkn  1:5Þ; a  3; 2b 13. Wang XJ, Yin XH, Wu QH, et al. Disturbance observer based
ka3  2; kc1 kmin ðKKT Þ þ kc2  4ka1 KTV KV kuc k2 ; adaptive neural control of uncertain MIMO nonlinear systems
ka1 kmin ðua uTa Þ  ðka3 þ a þ 2Þkua k2 þ ka2 g with unmodeled dynamics. Neurocomputing 2018;313:247–58.
 
14. Sutton RS, Barto AGReinforcement learning: An introduction. 2nd
q1 ¼ 0:5ðkc1 2cm þ e 2am þ d2m Þ þ e2am þ d2m þ ð2ka1 KTV KV kucm k2
ed. Cambridge: MIT Press; 2018. p. 331–2.
þ0:5kc2 ÞkWcm k2 þ ð0:5ka2 þ ka1 kuam k2 ÞkWam k2 15. Widrow B, Gupta NK, Maitra S. Punish/reward: Learning with a
To ensure q0 > 0, the following conditions must be fulfilled: critic in adaptive threshold systems. IEEE Trans Syst Man Cybern
1973;3(5):455–65.
k1 > 0:5; k2 > 1;    ; kn > 1:5 16. Cui RX, Yang CG, Li Y, et al. Adaptive neural network control of
a > 3; 2b  ka3  2 > 0 AUVs with control input nonlinearities using reinforcement
ðA9Þ learning. IEEE Trans Syst Man Cybern 2017;47(6):1019–29.
kc1 kmin ðKKT Þ þ kc2  4ka1 KTV KV kuc k2 > 0 17. Guo XX, Yan WS, Cui RX. Event-triggered reinforcement
ka1 kmin ðua uTa Þ þ ka2  ðka3 þ a þ 2Þkua k2 > 0 learning-based adaptive tracking control for completely unknown
continuous-time nonlinear systems. IEEE Trans Cybern 2020;50
Solving the aforementioned differential inequality Eq. (A8), (7):3231–42.
we have 18. He W, Gao HJ, Zhou C, et al. Reinforcement learning control of a
q q q flexible two-link manipulator: an experimental investigation. IEEE
LðtÞ 6 ðLð0Þ  1 Þ expðq0 tÞ þ 1 6 Lð0Þ þ 1 ðA100Þ Trans Syst Man Cybern 2021;51(12):7326–36.
q0 q0 q0
19. Yang J, Su JY, Li SH, et al. High-order mismatched disturbance
Consequently, all system signals are bounded according to compensation for motion control systems via a continuous
the definition of L in Eq. (A1). dynamic sliding-mode approach. IEEE Trans Ind Inform 2014;10
(1):604–14.
20. Razmjooei H, Shafiei MH, Palli G, et al. Non-linear finite-time
References tracking control of uncertain robotic manipulators using time-
varying disturbance observer-based sliding mode method. J Intell
1. Han SS, Jiao ZX, Wang CW, et al. Fuzzy robust nonlinear control Rob Syst 2022;104(2):1–13.
approach for electro-hydraulic flight motion simulator. Chin J 21. Liang YQ, Dong Q, Zhao YJ. Adaptive leader-follower formation
Aeronaut 2015;28(1):294–304. control for swarms of unmanned aerial vehicles with motion
2. Yao JY, Deng WX. Active disturbance rejection adaptive control constraints and unknown disturbances. Chin J Aeronaut 2020;33
of uncertain nonlinear systems: Theory and application. Nonlinear (11):2972–88.
Dyn 2017;89(3):1611–24. 22. Xian B, Zhang X, Zhang HN, et al. Robust adaptive control for a
3. Deng WX, Yao JY, Ma DW. Time-varying input delay compen- small unmanned helicopter using reinforcement learning. IEEE
sation for nonlinear systems with additive disturbance: An output Trans Neural Netw Learn Syst 2022;33(12):7589–97.
feedback approach. Int J Robust Nonlinear Control 2018;28 23. Wang XR, Wang QL, Sun CY. Prescribed performance fault-
(1):31–52. tolerant control for uncertain nonlinear MIMO system using
4. Lu Y. Disturbance observer-based backstepping control for actor–critic learning structure. IEEE Trans Neural Netw Learn
hypersonic flight vehicles without use of measured flight path Syst 2022;33(9):4479–90.
angle. Chin J Aeronaut 2021;34(2):396–406. 24. Xu B, Sun FC, Pan YP, et al. Disturbance observer based
5. Yao JY, Jiao ZX, Ma DW. Extended-state-observer-based output composite learning fuzzy control of nonlinear systems with
feedback nonlinear robust control of hydraulic systems with unknown dead zone. IEEE Trans Syst Man Cybern 2017;47
backstepping. IEEE Trans Ind Electron 2014;61(11):6285–93. (8):1854–62.
6. Chen M, Ge SS, Ren BB. Adaptive tracking control of uncertain 25. Jing YH, Yang GH. Fuzzy adaptive quantized fault-tolerant
MIMO nonlinear systems with input constraints. Automatica control of strict-feedback nonlinear systems with mismatched
2011;47(3):452–65. external disturbances. IEEE Trans Syst Man Cybern 2020;50
7. Wu XQ, Xu KX, He XX. Disturbance-observer-based nonlinear (9):3424–34.
control for overhead cranes subject to uncertain disturbances. 26. Zhang R, Xu B, Shi P. Output feedback control of micromechan-
Mech Syst Signal Process 2020;139 106631. ical gyroscopes using neural networks and disturbance observer.
8. Bu XW, Wu XY, Ma Z, et al. Novel adaptive neural control of IEEE Trans Neural Netw Learn Syst 2022;33(3):962–72.
flexible air-breathing hypersonic vehicles based on sliding mode 27. Min HF, Xu SY, Fei SM, et al. Observer-based NN control for
differentiator. Chin J Aeronaut 2015;28(4):1209–16. nonlinear systems with full-state constraints and external distur-
9. Ouyang YC, Dong L, Xue L, et al. Adaptive control based on bances. IEEE Trans Neural Netw Learn Syst 2022;33(9):4322–31.
neural networks for an uncertain 2-DOF helicopter system with 28. Ran MP, Li JC, Xie LH. Reinforcement-learning-based distur-
input deadzone and output constraints. IEEE/CAA J Autom Sin bance rejection control for uncertain nonlinear systems. IEEE
2019;6(3):807–15. Trans Cybern 2022;52(9):9621–33.
10. Ma L, Xu N, Zhao XD, et al. Small-gain technique-based adaptive 29. Kim JW, Shim H, Yang I. On improving the robustness of
neural output-feedback fault-tolerant control of switched nonlin- reinforcement learning-based controllers using disturbance obser-
ear systems with unmodeled dynamics. IEEE Trans Syst Man ver. 2019 IEEE 58th conference on decision and control (CDC).
Cybern 2021;51(11):7051–62. Piscataway: IEEE Press; 2020. p. 847-52.
11. Zhang T, Ge SS, Hang CC. Adaptive neural network control for 30. Chen WH, Yang J, Guo L, et al. Disturbance-observer-based
strict-feedback nonlinear systems using backstepping design. control and related methods—an overview. IEEE Trans Ind
Automatica 2000;36(12):1835–46. Electron 2015;63(2):1083–95.
280 X. LIANG et al.

31. Xu B, Shi ZK, Yang CG, et al. Composite neural dynamic surface 33. Yao ZK, Liang XL, Zhao QT, et al. Adaptive disturbance
control of a class of uncertain nonlinear systems in strict-feedback observer-based control of hydraulic systems with asymptotic
form. IEEE Trans Cybern 2014;44(12):2626–34. stability. Appl Math Model 2022;105:226–42.
32. Hojati M, Gazor S. Hybrid adaptive fuzzy identification and
control of nonlinear systems. IEEE Trans Fuzzy Syst 2002;10
(2):198–210.

You might also like