A_reservoir_computing_approach_for_learning_forward_dynamics_of_industrial_manipulators
A_reservoir_computing_approach_for_learning_forward_dynamics_of_industrial_manipulators
613
Authorized licensed use limited to: University Of Minnesota Duluth. Downloaded on September 28,2024 at 21:09:31 UTC from IEEE Xplore. Restrictions apply.
with high-frequency samples of sensors’ values which are
highly correlated. Further insights on the impact of the self-
organized layer on the prediction accuracy of such a RNN,
alongside additional details on its derivation can be found in
[13].
B. Dynamic Reservoir
The values of the self-organized nodes are propagated
to an Echo State Network (ESN). This ESN includes, in
Fig. 2: Overview of the PC-ESN++ deep neural network. addition to the version of [13], a leaky-integrator which
The neurons are represented as nodes, the weights as arcs at each time-step reduces the memory of the reservoir by
and the bias of nodes as rectangles. The dashed lines signify a certain amount. This element is expected to reduce the
weights that are updated during the learning phase. accumulated errors in the case of multi-step prediction and
make the model more capable of handling long horizons.
Also, ESN has been found to perform well in non-linear
between the RNN and the output layer. The structure of the system identification applications [23].
network is illustrated in Fig. 2 ESN belongs to the class of RNNs which—unlike simple
Even though the method we are proposing here is inspired feed-forward networks—is able to approximate dynamical
by our previous work [13], several adaptions and extensions systems [24]. Another useful characteristic of RNNs is that
needed to be performed in order for it to be applicable to the their state is a unique representation of the inputs’ history—
forward dynamics learning problem. Changes include both due to the recurrent connections—and therefore have a
the learning rule and the structure of the used network. The dynamic memory. Despite those advantages, adaptation of
new learning rule avoids the assumption that the noise of the recurrent weights is a computationally expensive task.
the system is known by putting a prior distribution over the This issue is solved in ESNs by setting constant recurrent
noise of the target values. This results in different update weights following a process that ensures the Echo State
rules for the weights connecting the RNN with the output property. This property introduces a fading memory to the
layer and an additional posterior distribution over the noise. system, which should make ESN capable of coping with
Furthermore, the structure of the used network has changed, accumulated errors in long-term predictions. The derivation
omitting the feed-back connections from the output layer to of the recurrent weights is described in more detail in [13].
the RNN. This change has the advantage of reducing the The reservoir depends on four parameters: the number of
propagation of possible errors back to the network. Finally, nodes, its sparsity, its spectral radius and the leak rate. The
we introduced a leaky integrator which acts as a forgetting reservoir’s size is related to the number of training instances;
module and reduces the dynamic memory of the reservoir. small reservoirs are preferable for large numbers of training
A. Generalized Hebbian Learning samples [25]. The sparsity affects the computational time,
while the spectral radius and the leak rate of the reservoir
The values h of the self-organized layer nodes derive from
affect its memory. The state of the reservoir is updated
a linear combination between the inputs u and the weights
according to (3):
connecting the input layer and the self-organized layer. This
rt+1 = −ζrt + g Wres rt + Wself ht+1
can be written in matrix form as in (1): (3)
ht+1 = Wtin ut (1) where the values of the reservoir at time step t + 1 are
denoted as rt+1 . Wself is the weights matrix connecting the
where the nodes values of the input and self-organized layer
nodes of the self-organized layer with the reservoir. Wres
are represented by the column vectors u and h respectively.
are the connections between the nodes of the reservoir and
The inputs’ weights are represented by the triangular matrix
ζ is the leaking rate that causes a reservoir memory leak at
Win with elements wjk in
and therefore, its entry in (j, k) is
each time step. Both Wself and Wres have fixed weights.
the weight from the input node k to the self-organized layer
The state of the output layer is derived from a linear
node j. The inputs’ matrix is updated according to the GHL
combination of the nodes connected to them and is calculated
rule. GHL belongs to the class of unsupervised learning [22]
as:
and constitutes an extension of Oja’s rule to multiple outputs.
ot+1 = Wtrain ct+1 . (4)
The update rule is illustrated in (2) where LT [·] denotes
a lower-triangular matrix and the learning rate ηt decreases where all the adaptable weights and the nodes that are
as t → ∞. connected to the output layer ot+1 have been concatenated
in in the matrix Wtrain and the vector ct+1 respectively.
= ηt (ut hTt+1 − LT ht+1 hTt+1 Wtin )
∆Wt+1 (2)
Thus, it is clear that inference of Wtrain constitutes a
Thus, the values of the self-organized nodes are an ap- linear regression problem and can be updated by iteratively
proximation of the inputs’ principal components. The decor- applying Bayesian linear regression, as explained in the
relation of the inputs is important since the algorithm is fed following subsection.
614
Authorized licensed use limited to: University Of Minnesota Duluth. Downloaded on September 28,2024 at 21:09:31 UTC from IEEE Xplore. Restrictions apply.
C. Iterative Bayesian Linear Regression the most probable value of the weights and is derived from:
−1
wttrain = Vt (Vt−1 wt−1 + ct τt ) (9)
In this work we avoid making any assumptions about
the sensor noise; instead, the noise is solely inferred by where Vt is updated at each time-step according to the
−1 T −1
the algorithm. This fact reduces the number of parameters relationship Vt = Vt−1 + ct ct , the initial weights w0
requiring tuning and results in a different learning rule that is a zero vector, and V0 = I. The asymmetry and scale
makes the algorithm more adaptable to different robotic parameters of the posterior distribution are updated according
manipulators. to (10) and (11) respectively:
The goal of our learning rule is to infer the appropriate n
αt = (10)
weights such that the output o of the PC-ESN++ approxi- 2
mates the true state of the robot s. For notation simplicity we 1 train T
+ τ 2 − wttrain Vt wtT
limit the derivation of the learning rule to a single output, βt = wt−1 Vt−1 wt−1 (11)
2
but it can be easily generalized to multiple outputs. Thus,
where n is the number of samples at time step t.
the weights towards the node o are represented by the row
The posterior predictive distribution for a new input c∗ is
vector wtrain and the forward dynamics model for a single
a Student’s t-distribution as illustrated in (12):
joint at time step t can be rewritten as a regression problem
such that wttrain ct + ε = st , where ε is noise. p (o∗ |c∗ , D) = T Wttrain c∗ , αβtt I + c∗ Vt c∗ T , 2αt
615
Authorized licensed use limited to: University Of Minnesota Duluth. Downloaded on September 28,2024 at 21:09:31 UTC from IEEE Xplore. Restrictions apply.
TABLE I: Description of the evaluation datasets TABLE II: One-step prediction error (nMSE) of the eval-
uated algorithms averaged over all joints and all cross-
Dataset Trajectories Total # of Sampling Motion Type DoF validation sets for the KUKA LWR dataset.
Samples Frequency
Baxter 10 19295 100 Hz Pick & Place 7 KUKA LWR
KUKA 10 20068 120 Hz Pick & Place 7 Algorithm Position Error Velocity Error
Mean St.Dev Mean Variance
PC-ESN++ 1.67 · 10−6 1.60 · 10−6 0.04 0.01
performed in terms of prediction accuracy and convergence
GP 4.96 · 10−6 6.95 · 10−6 0.27 0.38
in two cases—a step-by-step (Sec. IV-A) and a full trajectory
prediction scenario (Sec. IV-B). Furthermore—trying to find LWBR 1.03 · 10−4 1.57 · 10−4 1.61 1.62
a middle ground between these two extreme cases—the
algorithms are compared in terms of operational space error TABLE III: One-step prediction error (nMSE) of the eval-
for different prediction horizons (Sec. IV-C). Moreover we uated algorithms averaged over all joints and all cross-
evaluate the computational time of the algorithms in Sec. IV- validation sets for the Baxter dataset.
D.The inputs of the algorithms are the position, velocity and
Rethink Robotics Baxter
applied torques at a time step t, while the prediction output
Algorithm Position Error Velocity Error
is the position and velocity at the next time step t + 1.
Mean Variance Mean Variance
−6
A. Step-by-step prediction PC-ESN++ 2.40 · 10 1.80 · 10−6 0.02 0.01
−5
In this evaluation scenario we assume that the actual GP 4.68 · 10 8.93 · 10−5 0.33 0.26
−4
position and velocity are available after each time step and LWBR 2.67 · 10 5.03 · 10−4 1.96 1.96
they are used as inputs for predicting the state of the joints at
the next time step. The step-by-step evaluation can be useful
for model-based RL algorithms that update their policy after this paper we restrict ourselves to testing all algorithms
every single step and not at the end of the trajectory. Tables II employing simply a recursive strategy; trying more sophis-
and III illustrate the nMSE of the three evaluated algorithms ticated approaches falls outside the scope of this work. In
in the predicted joints’ position and velocity averaged over the training phase the algorithms learn to make one-step
all joints and all cross-validation sets. The cross-validation predictions using the ground-truth as inputs. In the testing
sets are derived using a leave-one-out approach. phase the inputs are the predicted values of the previous
The PC-ESN++ performs better than LWBR and GPs time step. Thus, the algorithms are only based on their past
both in position and velocity prediction. Particularly, our predicted values, and an obvious problem in this case is that
proposed algorithm gives an average error of 2.04 · 10−6 errors are accumulated over the trajectory.
on predicting joint positions and 0.03 on joint velocities As shown in Tables IV and V, the full trajectory prediction
when considering both datasets. On the other hand, the state- performance of all the algorithms is significantly worse com-
of-the-art GP algorithm has an average of 2.59 · 10−5 and pared to the step-by-step scenario. This comes as no surprise
0.3 respectively. Thus, PC-ESN++ decreased the prediction since the algorithms are making future predictions based on
error on joint positions by 21.21% and on joint velocities by their own, imperfect, previous predictions. However, even in
90% compared to GP and even more compared to LWBR. this scenario the PC-ESN++ performs better than GP and
Furthermore, its performance does not vary significantly LWBR, even though its prediction over the joints fluctuates
between joints. more compared to its performance in the step-by-step sce-
Another desirable characteristic of machine learning ap- nario. In detail, PC-ESN++ had an average prediction error
proaches is fast convergence—their ability to learn the dy- of 1.25 on joints’ positions and 1.34 on joints’ velocities
namics model with as less training data as possible. Fig. 4
illustrates the convergence of PC-ESN++, LWBR and GP
on a single joint of the Baxter robot. The algorithms are 10-2
PC-ESN++
trained with an increasing number of trajectories and tested LWBR
GP
always on the same trajectory. Both GPs and PC-ESN++ 10-3
converge fast and as a result they are able to learn the forward
nMSE
10-4
dynamics from the first trajectories. On the contrary, LWBR
needs more training samples in order to achieve its optimal
10-5
performance.
10-6
B. Full trajectory prediction 1 2 3 4 5 6 7 8 9
Number of trained trajectories
In this scenario we evaluate the ability of the algorithms
to predict a full trajectory. This kind of problems are chal- Fig. 4: Step-by-step prediction convergence of the evaluated
lenging and a number of approaches have been proposed algorithms. Only the base joint of Baxter is illustrated for
for dealing with long-term time-series predictions [26]. In clarity. The convergence is similar for all other joints.
616
Authorized licensed use limited to: University Of Minnesota Duluth. Downloaded on September 28,2024 at 21:09:31 UTC from IEEE Xplore. Restrictions apply.
TABLE IV: Full trajectory prediction error (nMSE) of the 0.7
PC-ESN++
evaluated algorithms averaged overall joints and all cross- 0.6 GP
LWBR
validation sets for the KUKA dataset 0.5
Distance (m)
0.4
KUKA LWR 0.3
Algorithm Position Error Velocity Error 0.2
Mean Variance Mean Variance 0.1
PC-ESN++ 1.01 0.85 1.48 1.21 0
100 200 300 400 500 600 700 800 900 1000
GP 1.71 0.96 1.67 0.92 Prediction Horizon (time-steps)
LWBR 2.42 1.25 3.59 3.52
Fig. 6: Operational space error averaged over the trajectory
of the Baxter robot for different prediction horizons. The
TABLE V: Full trajectory prediction error (nMSE) of the error is the distance between the actual and the predicted
evaluated algorithms averaged overall joints red and all cross- position of the end-effector.
validation sets for the Baxter dataset.
10-1
102
nMSE
10-2
100 10-3
10-4
0 200 400 600 800 1000 1200 1400 1600 1800 2000
10-2 Number of trained instances
1 2 3 4 5 6 7 8 9
Number of trained trajectories
Fig. 7: Computational time of the evaluated methods for
Fig. 5: Full trajectory prediction convergence of the evaluated performing a model-update as a function of the training
algorithms. Only the base joint of Baxter is illustrated for instances. All results were measured on an Intel core i7 @
clarity. The convergence is similar for all other joints. 2.4 Ghz processor with 12 GB of RAM.
617
Authorized licensed use limited to: University Of Minnesota Duluth. Downloaded on September 28,2024 at 21:09:31 UTC from IEEE Xplore. Restrictions apply.
decorrelated outputs of the self-organized layer are propa- [7] P. Kormushev, S. Calinon, and D. G. Caldwell, “Reinforcement learn-
gated to a dynamic reservoir which projects them to high ing in robotics: Applications and real-world challenges,” Robotics,
vol. 2, no. 3, pp. 122–148, 2013.
dimensions by a non-linear combination of fixed weights. [8] M. P. Deisenroth, P. Englert, J. Peters, and D. Fox, “Multi-task policy
Leaky integrators are included in the reservoir as forgetting search for robotics,” in IEEE International Conference on Robotics
modules. The output weights are updated by applying at and Automation. IEEE, 2014, pp. 3876–3881.
[9] A. Kupcsik, M. P. Deisenroth, J. Peters, A. P. Loh, P. Vadakkepat, and
each iteration the Bayes rule without making any assumption G. Neumann, “Model-based contextual policy search for data-efficient
about the noise of the system. generalization of robot skills,” Artificial Intelligence, Dec. 2014.
The performance of the algorithm was evaluated on two [10] N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa,
“Learning continuous control policies by stochastic value gradients,”
self-recorded datasets gathered from two industrial robots in Advances in Neural Information Processing Systems 28, C. Cortes,
(KUKA LWR and Rethink Robotics Baxter), which we N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds.
make publicly available. The datasets consist of 10 different Curran Associates, Inc., 2015, pp. 2944–2952.
[11] J. Kober, J. a. Bagnell, and J. Peters, “Reinforcement learning in
trajectories recorded during the execution of pick and place robotics: A survey,” The International Journal of Robotics Research,
tasks. The proposed algorithm was evaluated and compared vol. 32, no. 11, pp. 1238–1274, Aug. 2013.
with other state-of-the-art algorithms in terms of converge [12] O. Sigaud, C. Salaün, and V. Padois, “On-line regression algorithms
for learning mechanical models of robots: a survey,” Robotics and
and prediction accuracy in both joint and operational space. Autonomous Systems, vol. 59, no. 12, pp. 1115–1129, 2011.
The evaluation considered two different scenarios: a step-by- [13] A. S. Polydoros, L. Nalpantidis, and V. Krüger, “Real-time deep
step and a multi-step trajectory prediction. learning of robotic manipulator inverse dynamics,” in IEEE/RSJ Inter-
national Conference on Intelligent Robots and Systems (IROS), Sept
The prediction performance and convergence of PC- 2015.
ESN++ is better than the state-of-the-art in both datasets and [14] M. Hermans and B. Schrauwen, “Training and analysing deep recur-
for both evaluated scenarios. Furthermore, our algorithm can rent neural networks,” in Advances in Neural Information Processing
Systems, 2013, pp. 190–198.
handle better the accumulated errors in multi-step predictions [15] A. El-Fakdi and M. Carreras, “Policy gradient based Reinforcement
due to the fading memory of the reservoir and the leaky inte- Learning for real autonomous underwater cable tracking,” in 2008
grator. Another important characteristic of PC-ESN++ is its IEEE/RSJ International Conference on Intelligent Robots and Systems.
IEEE, Sept. 2008, pp. 3635–3640.
low demand for computational resources. Its complexity only [16] S. Ross and J. A. Bagnell, “Agnostic system identification for model-
depends on the size of the reservoir, while the complexity based reinforcement learning,” Proceedings of the 29th Interational
of the other algorithms depend on the number of training Conference on Machine Learning, pp. 1703–1710, 2012.
[17] R. Koppejan and S. Whiteson, “Neuroevolutionary reinforcement
samples. This allows us to perform learning on large datasets learning for generalized helicopter control,” in Proceedings of the
without additional burden. 11th Annual conference on Genetic and evolutionary computation -
Even if the effect of long term predictions on accuracy GECCO ’09. New York, New York, USA: ACM Press, July 2009,
p. 145.
was ameliorated by the use of leaky integrators, the need for [18] G. Boone, “Efficient reinforcement learning: model-based Acrobot
a strategy that handles better the accumulated errors becomes control,” Proceedings of International Conference on Robotics and
apparent [27]. Thus, as future work, we intend to investigate Automation, vol. 1, 1997.
[19] C. G. Atkeson, “Nonparametric model-based reinforcement learning,”
the impact of different multi-step prediction approaches that in Advances in Neural Information Processing Systems, 1998, pp.
have been proposed in the field of time-series forecasting for 1008–1014.
the prediction of robots’ dynamics. [20] J. Schneider, “Exploiting model uncertainty estimates for safe dynamic
control learning,” Advances in Neural Information Processing Systems,
Finally, we intend to apply the proposed PC-ESN++ 1997.
within a model-based RL algorithm for learning typical tasks [21] J. Bagnell and J. Schneider, “Autonomous helicopter control using
reinforcement learning policy search methods,” Robotics and Automa-
performed by industrial robotic manipulators. We expect that tion, IEEE International Conference on, vol. 2, pp. 1615—-1620,
such an algorithm would be able to learn tasks in a small 2001.
number of iterations and achieve a level of accuracy that will [22] T. D. Sanger, “Optimal unsupervised learning in a single-layer linear
feedforward neural network,” Neural networks, vol. 2, no. 6, pp. 459–
allow it to handle manufacturing problems. 473, 1989.
[23] H. Jaeger, “Adaptive nonlinear system identification with echo state
R EFERENCES networks,” in Advances in neural information processing systems,
2002, pp. 593–600.
[1] R. Featherstone and D. Orin, “Robot dynamics: equations and algo- [24] K. Funahashi and Y. Nakamura, “Approximation of dynamical systems
rithms,” in Robotics and Automation, 2000. Proceedings. ICRA ’00. by continuous time recurrent neural networks,” Neural networks,
IEEE International Conference on, vol. 1, 2000, pp. 826–834 vol.1. vol. 6, no. 6, pp. 801–806, 1993.
[2] K. S. Narendra and K. Parthasarathy, “Identification and control of [25] M. Lukoševičius, “A practical guide to applying echo state networks,”
dynamical systems using neural networks,” IEEE Transactions on in Neural Networks: Tricks of the Trade. Springer, 2012, pp. 659–686.
neural networks, vol. 1, no. 1, pp. 4–27, 1990. [26] A. Sorjamaa, J. Hao, N. Reyhani, Y. Ji, and A. Lendasse, “Methodol-
[3] D. Nguyen-Tuong and J. Peters, “Model learning for robot control: a ogy for long-term prediction of time series,” Neurocomputing, vol. 70,
survey,” Cognitive processing, vol. 12, no. 4, pp. 319–40, Nov. 2011. no. 16, pp. 2861–2869, 2007.
[4] J. Nakanishi, R. Cory, M. Mistry, J. Peters, and S. Schaal, “Oper- [27] A. Venkatraman, M. Hebert, and J. A. Bagnell, “Improving multi-step
ational space control: A theoretical and empirical comparison,” The prediction of learned time series models.” in AAAI, 2015, pp. 3024–
International Journal of Robotics Research, vol. 27, no. 6, pp. 737– 3030.
757, 2008.
[5] D. Wolpert and M. Kawato, “Multiple paired forward and inverse
models for motor control,” Neural Networks, vol. 11, no. 78, pp. 1317
– 1329, 1998.
[6] Y. Wada and M. Kawato, “A neural network model for arm trajec-
tory formation using forward and inverse dynamics models,” Neural
Networks, vol. 6, no. 7, pp. 919 – 932, 1993.
618
Authorized licensed use limited to: University Of Minnesota Duluth. Downloaded on September 28,2024 at 21:09:31 UTC from IEEE Xplore. Restrictions apply.