0% found this document useful (0 votes)
3 views

A_reservoir_computing_approach_for_learning_forward_dynamics_of_industrial_manipulators

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

A_reservoir_computing_approach_for_learning_forward_dynamics_of_industrial_manipulators

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Daejeon Convention Center


October 9-14, 2016, Daejeon, Korea

A Reservoir Computing Approach for


Learning Forward Dynamics of Industrial Manipulators
Athanasios S. Polydoros and Lazaros Nalpantidis

Abstract— Many robot learning algorithms depend on a


model of the robot’s forward dynamics for simulating potential
trajectories and ultimately learning a required task. In this
paper, we present a data-driven reservoir computing approach
and apply it for learning forward dynamics models. Our
proposed machine learning algorithm exploits the concepts
of dynamic reservoir, self-organized learning and Bayesian
inference. We have evaluated our approach on datasets gathered
from two industrial robotic manipulators and compared it on
both step-by-step and multi-step trajectory prediction scenarios
with state-of-the-art algorithms. The evaluation considers the
algorithms’ convergence and prediction performance on joint
and operational space for varying prediction horizons, as well as
computational time. Results show that the proposed algorithm Fig. 1: Overview of a model-based RL algorithm. The
performs better than the state-of-the-art, converges fast and dynamics model receive a training signal from the robot’s
can achieve accurate predictions over longer horizons, which
makes it a reliable, data-efficient approach for learning forward
sensors, then the learned model is used for simulating the
models. trajectory that derives from the current policy. The reward of
the trajectory is propagated to the policy learning algorithm
I. INTRODUCTION which updates the policy and applies it on the robot.
Modeling robots’ dynamics has been an appealing topic
of scientific research over the last decades with numerous joints, masses, friction coefficients, centrifugal forces e.t.c.
proposed approaches and algorithms [1], [2]. Such models These are difficult to calculate and, even worse, they may
can represent a robot’s embodiment, its interaction with change depending on the wear and tear of the robot. An
the environment and can be used for control and simula- alternative approach, in order to overcome such problems, is
tion [3]. In particular, forward dynamics—also known as the use of data-based models. In those cases—this work is
direct dynamics—provide the state of the joints given the one of them—the models are learned by employing machine
applied forces, and can therefore be used for simulating learning approaches which use information gathered from the
the effects of actions on the robot state. Some applications sensors for inferring the mapping that best approximates the
of forward dynamics include operational space control [4], model [12].
motor control [5] and formation of smooth trajectories [6]. In this paper we focus on learning forward dynamics
Recently, trying to let robots learn tasks faster and minimize directly from sensor data. To achieve this, we draw inspira-
their interactions with the environment [7], model-based tion from our previous work [13]—where we were learning
Reinforcement Learning (RL) has started gaining popularity inverse dynamics models—but develop an enhanced method-
[8]–[10]. Forward dynamics models are an essential part ology to target this new, different problem. Our approach
of model-based RL [11], as shown in Fig. 1, because the is based on reservoir computing—which, to the best of the
performance of such algorithms depends on the prediction authors’ knowledge, has never been applied before in this
accuracy of the model. problem. The proposed algorithm, we call this version PC-
Traditionally, the derivation of forward models can be ESN++, is an enhanced version of the Principal-Components
done analytically by exploiting physics equations and prop- Echo State Network (PC-ESN) algorithm—as presented in
erties of the robot’s structure. However, this approach can [13] for learning inverse dynamics—and belongs to the class
be problematic since it requires precise knowledge of robot of deep learning algorithms [14]. The network consists of a
properties such as inertial matrices of the links, type of self-organized layer which decorrelates the inputs, a fixed
recurrent network (dynamic reservoir) and the learning rule
All the authors are with the Robotics, Vision and Machine is an iterative formulation of Bayesian regression.
Intelligence (RVMI) Lab., Department of Mechanical and In contrast to our previous work, the enhanced version
Manufacturing Engineering, Aalborg University Copenhagen, Denmark
{athapoly,lanalpa}@m-tech.aau.dk illustrated in this paper avoids any assumptions about the
This work has been supported by the European Commission through the sensor noise; instead, the noise is solely inferred by the
research project “Sustainable and Reliable Robotics for Part Handling in algorithm. This fact reduces the number of parameters re-
Manufacturing Automation (STAMINA)”, FP7-ICT-2013-10-610917.
The authors would like to thank Mikkel Rath Pedersen for his help in quiring tuning and results in a different learning rule that
gathering the dataset from the KUKA LWR iiwa. makes the algorithm more adaptable to different robotic

978-1-5090-3762-9/16/$31.00 ©2016 IEEE 612


Authorized licensed use limited to: University Of Minnesota Duluth. Downloaded on September 28,2024 at 21:09:31 UTC from IEEE Xplore. Restrictions apply.
manipulators. Furthermore, given the nature of our targeted of uncertainty for their predictions. This characteristic is
problem—learning forward dynamics models—we have been useful in model-based RL because the policy can be learned
able to simplify the structure of the network by omitting taking into account the noise of the system. An extension
the connections from the output back to the reservoir. As of LWLR is the Locally Weighted Bayesian Regression
a result of this, prediction errors are not propagated back (LWBR), which is employed in [20] for learning the forward
to the network making the algorithm more reliable in long- dynamics of a cart-pole system and in [21] for learning the
term predictions. Long horizon predictions suffer from the forward model of an helicopter. LWBR combines Bayesian
accumulation and propagation of errors, can lead the model inference and LWLR. The regression coefficients are given
to unknown states and cause instability. Furthermore, we add a Gaussian prior and the noise is assumed to be described
a forgetting factor, the leaky integrator, which reduces the as a Gamma distribution. The weighted inputs are used for
dynamic memory of the reservoir. This module makes the the derivation of the regression coefficients according to the
model more capable to handle long prediction horizons. learning rule of Bayesian regression. The model’s output is
The contribution of this paper is three-fold. We apply a a Student’s t-distribution over the future state.
deep learning algorithm for the first time on the problem of The state-of-the-art approach for learning the forward
learning forward dynamics which outperforms the current models—and in particular stochastic models—are the Gaus-
state-of-the-art algorithms in prediction performance and sian Processes (GP), applied in [8], [9] for learning the for-
matches their convergence, both on step-by-step and multi- ward dynamics of complex platforms like high-DOF robotic
step trajectory prediction for various horizons. Moreover, the manipulators and biped robots. GP is a non-parametric
proposed algorithm is much less computationally demanding method like LWBR and thus, there is not any assumption
than the state-of-the-art since it is memory-less, i.e. it does about the function M that maps current states and actions
not need to retain sensors’ data in memory. Finally, we to future states. This makes it a powerful learning method.
make publicly available the datasets we gathered from two Furthermore, it employs the kernel trick for projecting the
industrial robots, the Rethink Robotics Baxter robot and the inputs into a high-dimensional space and is defined by its
KUKA LWR iiwa arm1 . mean m and a kernel (covariance function) k. The mean
of the GP is assumed to be zero in most cases and a very
II. R ELATED WORK
common choice for kernels are those that belong to the
Physics-based models that describe forward dynamics are exponential family. Some kernels depend on parameters—
widely used as deterministic models [15], [16]. The main the hyper-parameters of a GP. The hyper-parameters can
disadvantage of such models is that they contain many factors be tuned by a variety of methods, including greedy search
that are difficult to be analytically expressed, such as friction over the hyper-parameters’ space and—more commonly—
and dynamics of elastic joints. Another disadvantage is that marginal likelihood-based methods. The prediction of the GP
these models do not take into account potential changes of is a Gaussian distribution over the future state.
the robot’s environment. On the other hand, machine learning Our proposed method, the PC-ESN++, belongs to the class
models do not suffer from those problems because they are of non-parametric generative models such as LWBR and GP.
based only on sensor data. Thus, a workaround is to create Its main advantage over those methods is that it is memory-
hybrid models by combining physics models and machine less; it does not keep the past inputs’ in memory, which
learning algorithms for inferring hard to model quantities significantly decreases its computational requirements. This
[17], [18]. Nevertheless, pure machine learning algorithms is achieved due to its incremental learning rule. Moreover,
can be used to infer a mapping M such that M (st , ut ) 7→ contrary to LWBR, it projects the inputs on high dimensional
st+1 , where st is the state of the robot at time step t, ut are space—like GP—by using a recurrent neural network with
the applied commands and st+1 is the predicted state for the fixed weights (dynamic reservoir). This makes our approach
next time step. computationally cheaper compared to the use of kernels.
A widely used deterministic algorithm is the Locally Finally, the fading memory property of the dynamic reservoir
Weighted Linear Regression (LWLR) [19], which creates a can reduce accumulated errors when predicting the trajectory
nonparametric model that fits linear regressions locally on over long horizons.
training data. The predictions of the model’s regression co-
efficients are made using the ordinary least square estimator, III. T HE PC-ESN++
weighted by a kernel that provides a measurement of the The PC-ESN++ is a deep neural network that consists of
similarity between new inputs and learned data. Thus, the two hidden layers, an input and an output layer. The inputs
data-points that are closer to the input affect the prediction are the position, velocity and applied torques at each time-
more. LWLR is memory-based; it needs to keep all the step, which are propagated through a feed-forward network
training inputs in memory, which makes it a computationally to the self-organized layer. That layer decorrelates them
expensive approach. according to the Generalized Hebbian Learning (GHL) rule.
Stochastic machine learning models are more popular The decorrelated inputs are fed to the second hidden layer,
for modeling forward dynamics since they provide a level a fixed Recurrent Neural Network (RNN), which projects
1 Our recorded datasets are made publicly available at https:// them to high-dimensions. At the final step we iteratively
bitbucket.org/athapoly/datasets/ apply Bayesian Linear Regression for inferring the weights

613
Authorized licensed use limited to: University Of Minnesota Duluth. Downloaded on September 28,2024 at 21:09:31 UTC from IEEE Xplore. Restrictions apply.
with high-frequency samples of sensors’ values which are
highly correlated. Further insights on the impact of the self-
organized layer on the prediction accuracy of such a RNN,
alongside additional details on its derivation can be found in
[13].

B. Dynamic Reservoir
The values of the self-organized nodes are propagated
to an Echo State Network (ESN). This ESN includes, in
Fig. 2: Overview of the PC-ESN++ deep neural network. addition to the version of [13], a leaky-integrator which
The neurons are represented as nodes, the weights as arcs at each time-step reduces the memory of the reservoir by
and the bias of nodes as rectangles. The dashed lines signify a certain amount. This element is expected to reduce the
weights that are updated during the learning phase. accumulated errors in the case of multi-step prediction and
make the model more capable of handling long horizons.
Also, ESN has been found to perform well in non-linear
between the RNN and the output layer. The structure of the system identification applications [23].
network is illustrated in Fig. 2 ESN belongs to the class of RNNs which—unlike simple
Even though the method we are proposing here is inspired feed-forward networks—is able to approximate dynamical
by our previous work [13], several adaptions and extensions systems [24]. Another useful characteristic of RNNs is that
needed to be performed in order for it to be applicable to the their state is a unique representation of the inputs’ history—
forward dynamics learning problem. Changes include both due to the recurrent connections—and therefore have a
the learning rule and the structure of the used network. The dynamic memory. Despite those advantages, adaptation of
new learning rule avoids the assumption that the noise of the recurrent weights is a computationally expensive task.
the system is known by putting a prior distribution over the This issue is solved in ESNs by setting constant recurrent
noise of the target values. This results in different update weights following a process that ensures the Echo State
rules for the weights connecting the RNN with the output property. This property introduces a fading memory to the
layer and an additional posterior distribution over the noise. system, which should make ESN capable of coping with
Furthermore, the structure of the used network has changed, accumulated errors in long-term predictions. The derivation
omitting the feed-back connections from the output layer to of the recurrent weights is described in more detail in [13].
the RNN. This change has the advantage of reducing the The reservoir depends on four parameters: the number of
propagation of possible errors back to the network. Finally, nodes, its sparsity, its spectral radius and the leak rate. The
we introduced a leaky integrator which acts as a forgetting reservoir’s size is related to the number of training instances;
module and reduces the dynamic memory of the reservoir. small reservoirs are preferable for large numbers of training
A. Generalized Hebbian Learning samples [25]. The sparsity affects the computational time,
while the spectral radius and the leak rate of the reservoir
The values h of the self-organized layer nodes derive from
affect its memory. The state of the reservoir is updated
a linear combination between the inputs u and the weights
according to (3):
connecting the input layer and the self-organized layer. This
rt+1 = −ζrt + g Wres rt + Wself ht+1

can be written in matrix form as in (1): (3)
ht+1 = Wtin ut (1) where the values of the reservoir at time step t + 1 are
denoted as rt+1 . Wself is the weights matrix connecting the
where the nodes values of the input and self-organized layer
nodes of the self-organized layer with the reservoir. Wres
are represented by the column vectors u and h respectively.
are the connections between the nodes of the reservoir and
The inputs’ weights are represented by the triangular matrix
ζ is the leaking rate that causes a reservoir memory leak at
Win with elements wjk in
and therefore, its entry in (j, k) is
each time step. Both Wself and Wres have fixed weights.
the weight from the input node k to the self-organized layer
The state of the output layer is derived from a linear
node j. The inputs’ matrix is updated according to the GHL
combination of the nodes connected to them and is calculated
rule. GHL belongs to the class of unsupervised learning [22]
as:
and constitutes an extension of Oja’s rule to multiple outputs.
ot+1 = Wtrain ct+1 . (4)
The update rule is illustrated in (2) where LT [·] denotes
a lower-triangular matrix and the learning rate ηt decreases where all the adaptable weights and the nodes that are
as t → ∞. connected to the output layer ot+1 have been concatenated
in in the matrix Wtrain and the vector ct+1 respectively.
= ηt (ut hTt+1 − LT ht+1 hTt+1 Wtin )
 
∆Wt+1 (2)
Thus, it is clear that inference of Wtrain constitutes a
Thus, the values of the self-organized nodes are an ap- linear regression problem and can be updated by iteratively
proximation of the inputs’ principal components. The decor- applying Bayesian linear regression, as explained in the
relation of the inputs is important since the algorithm is fed following subsection.

614
Authorized licensed use limited to: University Of Minnesota Duluth. Downloaded on September 28,2024 at 21:09:31 UTC from IEEE Xplore. Restrictions apply.
C. Iterative Bayesian Linear Regression the most probable value of the weights and is derived from:
−1
wttrain = Vt (Vt−1 wt−1 + ct τt ) (9)
In this work we avoid making any assumptions about
the sensor noise; instead, the noise is solely inferred by where Vt is updated at each time-step according to the
−1 T −1

the algorithm. This fact reduces the number of parameters relationship Vt = Vt−1 + ct ct , the initial weights w0
requiring tuning and results in a different learning rule that is a zero vector, and V0 = I. The asymmetry and scale
makes the algorithm more adaptable to different robotic parameters of the posterior distribution are updated according
manipulators. to (10) and (11) respectively:
The goal of our learning rule is to infer the appropriate n
αt = (10)
weights such that the output o of the PC-ESN++ approxi- 2
mates the true state of the robot s. For notation simplicity we 1 train T
+ τ 2 − wttrain Vt wtT

limit the derivation of the learning rule to a single output, βt = wt−1 Vt−1 wt−1 (11)
2
but it can be easily generalized to multiple outputs. Thus,
where n is the number of samples at time step t.
the weights towards the node o are represented by the row
The posterior predictive distribution for a new input c∗ is
vector wtrain and the forward dynamics model for a single
a Student’s t-distribution as illustrated in (12):
joint at time step t can be rewritten as a regression problem  
such that wttrain ct + ε = st , where ε is noise. p (o∗ |c∗ , D) = T Wttrain c∗ , αβtt I + c∗ Vt c∗ T , 2αt


Regression problems can be solved using deterministic (12)


methods such as ordinary least squares estimation and where o∗ is the predicted output. Finally, the marginal pos-
ridge regression. The Bayesian approach, which is used terior distributions over the unknown trained weights wtrain
in PC-ESN++, derives a joint probability distribution over and variance σ 2 are given by (13) and (14) respectively:
the regression coefficients and the variance of the noise βt
p wtrain |D = T (wttrain , , Vt , 2αt )
 
p wttrain , σ 2 |D . The likelihood of the regression model can (13)
αt
be written as a conditional probability distribution according
p σ 2 |D = IG(αt , βt )

to (5) and it is assumed to be a normal distribution with (14)
mean wtrain ct and variance σ 2 .
Thus, the proposed learning rule infers iteratively both the
p τt |ct , wtrain , σ 2 = N (τt |wtrain ct , σ 2 )

(5) regression coefficients and the variance of the sensors’ noise.
In (5), σ 2 is the unknown variance of the noise ε, while τ IV. E VALUATION R ESULTS
is the target value of the training sample. A conjugate prior The proposed algorithm has been evaluated on two self-
of the normally distributed likelihood is the Normal-Inverse recorded datasets gathered from two different robotic manip-
Gaussian distribution (NIG) which is used to represent the ulators, the KUKA Light Weight Robot (LWR) and Rethink
prior knowledge about the joint distribution of the weights Robotics Baxter (we used one of its arms, equipped with
and noise variance, as in (6): a parallel gripper). The two robots have very different

p wttrain , σ 2 = characteristics when it comes to accuracy, repeatability and
(6) stiffness, making them an interesting couple to examine. The
N IG(wtrain , σ 2 |wt−1
train
, Vt−1 , αt−1 , βt−1 )
datasets contain trajectories generated during the execution
where the prior of the regression coefficients’ covariance of pick and place tasks. The pick and place locations were
matrix is Vt−1 and α, β respectively are asymmetry and drawn randomly from two non-overlapping areas with size
scale parameters of the distribution that represents the belief 50 × 50 cm each, as illustrated in Fig. 3. The robots were
before receiving a training signal at t − 1. considered to have full executed a task by starting and
When a new training signal becomes available at time finishing at the same location. An overview of the datasets
step t, the posterior probability distribution is calculated by is presented in Table I.
recursively applying the Bayes rule as: The performance of PC-ESN++ is compared with two

p wttrain , σ 2 |D ∝ other state-of-the-art non-parametric generative methods, the
N IG(wtrain , σ 2 |wt−1
train
, Vt−1 , αt−1 , βt−1 )· (7) LWBR and GPs, as presented in Sec. II. The comparison is
N (τt |wt−1 , ct−1 , σ 2 ).
train

Given that the likelihood and the prior distributions are


conjugate and by applying the Bayesian rule on linear
Gaussian systems, the joint posterior probability is NIG as
illustrated in (8):

p wttrain , σ 2 |D =
(8)
N IG(wtrain , σ 2 |wttrain , Vt , αt , βt ) Fig. 3: Sketch of the set-up used for the creation of datasets.
The location of the NIG distribution wttrain corresponds to The exact pick and place locations were randomly chosen.

615
Authorized licensed use limited to: University Of Minnesota Duluth. Downloaded on September 28,2024 at 21:09:31 UTC from IEEE Xplore. Restrictions apply.
TABLE I: Description of the evaluation datasets TABLE II: One-step prediction error (nMSE) of the eval-
uated algorithms averaged over all joints and all cross-
Dataset Trajectories Total # of Sampling Motion Type DoF validation sets for the KUKA LWR dataset.
Samples Frequency
Baxter 10 19295 100 Hz Pick & Place 7 KUKA LWR
KUKA 10 20068 120 Hz Pick & Place 7 Algorithm Position Error Velocity Error
Mean St.Dev Mean Variance
PC-ESN++ 1.67 · 10−6 1.60 · 10−6 0.04 0.01
performed in terms of prediction accuracy and convergence
GP 4.96 · 10−6 6.95 · 10−6 0.27 0.38
in two cases—a step-by-step (Sec. IV-A) and a full trajectory
prediction scenario (Sec. IV-B). Furthermore—trying to find LWBR 1.03 · 10−4 1.57 · 10−4 1.61 1.62
a middle ground between these two extreme cases—the
algorithms are compared in terms of operational space error TABLE III: One-step prediction error (nMSE) of the eval-
for different prediction horizons (Sec. IV-C). Moreover we uated algorithms averaged over all joints and all cross-
evaluate the computational time of the algorithms in Sec. IV- validation sets for the Baxter dataset.
D.The inputs of the algorithms are the position, velocity and
Rethink Robotics Baxter
applied torques at a time step t, while the prediction output
Algorithm Position Error Velocity Error
is the position and velocity at the next time step t + 1.
Mean Variance Mean Variance
−6
A. Step-by-step prediction PC-ESN++ 2.40 · 10 1.80 · 10−6 0.02 0.01
−5
In this evaluation scenario we assume that the actual GP 4.68 · 10 8.93 · 10−5 0.33 0.26
−4
position and velocity are available after each time step and LWBR 2.67 · 10 5.03 · 10−4 1.96 1.96
they are used as inputs for predicting the state of the joints at
the next time step. The step-by-step evaluation can be useful
for model-based RL algorithms that update their policy after this paper we restrict ourselves to testing all algorithms
every single step and not at the end of the trajectory. Tables II employing simply a recursive strategy; trying more sophis-
and III illustrate the nMSE of the three evaluated algorithms ticated approaches falls outside the scope of this work. In
in the predicted joints’ position and velocity averaged over the training phase the algorithms learn to make one-step
all joints and all cross-validation sets. The cross-validation predictions using the ground-truth as inputs. In the testing
sets are derived using a leave-one-out approach. phase the inputs are the predicted values of the previous
The PC-ESN++ performs better than LWBR and GPs time step. Thus, the algorithms are only based on their past
both in position and velocity prediction. Particularly, our predicted values, and an obvious problem in this case is that
proposed algorithm gives an average error of 2.04 · 10−6 errors are accumulated over the trajectory.
on predicting joint positions and 0.03 on joint velocities As shown in Tables IV and V, the full trajectory prediction
when considering both datasets. On the other hand, the state- performance of all the algorithms is significantly worse com-
of-the-art GP algorithm has an average of 2.59 · 10−5 and pared to the step-by-step scenario. This comes as no surprise
0.3 respectively. Thus, PC-ESN++ decreased the prediction since the algorithms are making future predictions based on
error on joint positions by 21.21% and on joint velocities by their own, imperfect, previous predictions. However, even in
90% compared to GP and even more compared to LWBR. this scenario the PC-ESN++ performs better than GP and
Furthermore, its performance does not vary significantly LWBR, even though its prediction over the joints fluctuates
between joints. more compared to its performance in the step-by-step sce-
Another desirable characteristic of machine learning ap- nario. In detail, PC-ESN++ had an average prediction error
proaches is fast convergence—their ability to learn the dy- of 1.25 on joints’ positions and 1.34 on joints’ velocities
namics model with as less training data as possible. Fig. 4
illustrates the convergence of PC-ESN++, LWBR and GP
on a single joint of the Baxter robot. The algorithms are 10-2
PC-ESN++
trained with an increasing number of trajectories and tested LWBR
GP
always on the same trajectory. Both GPs and PC-ESN++ 10-3

converge fast and as a result they are able to learn the forward
nMSE

10-4
dynamics from the first trajectories. On the contrary, LWBR
needs more training samples in order to achieve its optimal
10-5
performance.
10-6
B. Full trajectory prediction 1 2 3 4 5 6 7 8 9
Number of trained trajectories
In this scenario we evaluate the ability of the algorithms
to predict a full trajectory. This kind of problems are chal- Fig. 4: Step-by-step prediction convergence of the evaluated
lenging and a number of approaches have been proposed algorithms. Only the base joint of Baxter is illustrated for
for dealing with long-term time-series predictions [26]. In clarity. The convergence is similar for all other joints.

616
Authorized licensed use limited to: University Of Minnesota Duluth. Downloaded on September 28,2024 at 21:09:31 UTC from IEEE Xplore. Restrictions apply.
TABLE IV: Full trajectory prediction error (nMSE) of the 0.7
PC-ESN++
evaluated algorithms averaged overall joints and all cross- 0.6 GP
LWBR
validation sets for the KUKA dataset 0.5

Distance (m)
0.4
KUKA LWR 0.3
Algorithm Position Error Velocity Error 0.2
Mean Variance Mean Variance 0.1
PC-ESN++ 1.01 0.85 1.48 1.21 0
100 200 300 400 500 600 700 800 900 1000
GP 1.71 0.96 1.67 0.92 Prediction Horizon (time-steps)
LWBR 2.42 1.25 3.59 3.52
Fig. 6: Operational space error averaged over the trajectory
of the Baxter robot for different prediction horizons. The
TABLE V: Full trajectory prediction error (nMSE) of the error is the distance between the actual and the predicted
evaluated algorithms averaged overall joints red and all cross- position of the end-effector.
validation sets for the Baxter dataset.

Rethink Robotics Baxter


Algorithm Position Error Velocity Error of PC-ESN++ in the operational space is lower than both GP
Mean Variance Mean Variance and LWBR for all horizons. Furthermore, all the algorithms
converge to low error for short horizons but longer ones are
PC-ESN++ 1.48 1.21 1.20 0.76
handled better by PC-ESN++.
GP 1.6 1.5 1.93 1.60
LWBR 3.52 3.37 1.61 1.98 D. Computational Time
Computational time is a significant characteristic of model
learning algorithms. Fast updates of the model makes the
when considering both datasets. The corresponding errors
algorithms usable in real-time, which can significantly im-
of GP are 2.51 and 2.765 respectively. Thus, the proposed
prove the prediction accuracy. In this section we evalu-
approach decreased the prediction error by 50.2% on joints’
ated the algorithms in terms of computational time for an
positions and by 51.54% on joints’ velocities compared to
increasing amount of training data. Fig.7 illustrates this
GP and even more compared to LWBR. Finally, PC-ESN++
comparison where PC-ESN++ requires a constant amount
converges faster than the other algorithms, as illustrated in
of time, independent from the amount of data. The other
Fig. 5. It needs three trajectories, while LWBR and GP need
methods depend on the number of training instances—they
more in order to achieve their optimal performance.
are memory-based and their computational time grows with
C. Varying prediction horizons the amount of considered data. Small fluctuations of the
In this part of the evaluation we explored how the al- time are caused by operating system noise, while the sharp
gorithms perform in situations falling between the afore- reduction noticeable in GP happens because the algorithm
mentioned two extreme cases. We varied the considered employs a more computationally efficient method after a
predictions horizons from 100 steps to 1000 steps, in in- certain amount of data.
crements of 100 steps. The evaluation was performed on V. C ONCLUSIONS & D ISCUSSION
the Baxter dataset, using 9 trajectories for training and 1 for
testing—all algorithms were trained and tested with the same In this paper we applied a novel memory-free and
trajectories. For each horizon we evaluated the performance reservoir-based approach for learning the forward dynamics
of all algorithms in terms of operational space error. The of robotic manipulators using only data from sensors. The
results are illustrated in Fig. 6. It can be seen that the error
101
PC-ESN++
4 LWBR
10 0 GP
PC-ESN++ 10
GP
LWBR
Time (sec)

10-1
102
nMSE

10-2

100 10-3

10-4
0 200 400 600 800 1000 1200 1400 1600 1800 2000
10-2 Number of trained instances
1 2 3 4 5 6 7 8 9
Number of trained trajectories
Fig. 7: Computational time of the evaluated methods for
Fig. 5: Full trajectory prediction convergence of the evaluated performing a model-update as a function of the training
algorithms. Only the base joint of Baxter is illustrated for instances. All results were measured on an Intel core i7 @
clarity. The convergence is similar for all other joints. 2.4 Ghz processor with 12 GB of RAM.

617
Authorized licensed use limited to: University Of Minnesota Duluth. Downloaded on September 28,2024 at 21:09:31 UTC from IEEE Xplore. Restrictions apply.
decorrelated outputs of the self-organized layer are propa- [7] P. Kormushev, S. Calinon, and D. G. Caldwell, “Reinforcement learn-
gated to a dynamic reservoir which projects them to high ing in robotics: Applications and real-world challenges,” Robotics,
vol. 2, no. 3, pp. 122–148, 2013.
dimensions by a non-linear combination of fixed weights. [8] M. P. Deisenroth, P. Englert, J. Peters, and D. Fox, “Multi-task policy
Leaky integrators are included in the reservoir as forgetting search for robotics,” in IEEE International Conference on Robotics
modules. The output weights are updated by applying at and Automation. IEEE, 2014, pp. 3876–3881.
[9] A. Kupcsik, M. P. Deisenroth, J. Peters, A. P. Loh, P. Vadakkepat, and
each iteration the Bayes rule without making any assumption G. Neumann, “Model-based contextual policy search for data-efficient
about the noise of the system. generalization of robot skills,” Artificial Intelligence, Dec. 2014.
The performance of the algorithm was evaluated on two [10] N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa,
“Learning continuous control policies by stochastic value gradients,”
self-recorded datasets gathered from two industrial robots in Advances in Neural Information Processing Systems 28, C. Cortes,
(KUKA LWR and Rethink Robotics Baxter), which we N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds.
make publicly available. The datasets consist of 10 different Curran Associates, Inc., 2015, pp. 2944–2952.
[11] J. Kober, J. a. Bagnell, and J. Peters, “Reinforcement learning in
trajectories recorded during the execution of pick and place robotics: A survey,” The International Journal of Robotics Research,
tasks. The proposed algorithm was evaluated and compared vol. 32, no. 11, pp. 1238–1274, Aug. 2013.
with other state-of-the-art algorithms in terms of converge [12] O. Sigaud, C. Salaün, and V. Padois, “On-line regression algorithms
for learning mechanical models of robots: a survey,” Robotics and
and prediction accuracy in both joint and operational space. Autonomous Systems, vol. 59, no. 12, pp. 1115–1129, 2011.
The evaluation considered two different scenarios: a step-by- [13] A. S. Polydoros, L. Nalpantidis, and V. Krüger, “Real-time deep
step and a multi-step trajectory prediction. learning of robotic manipulator inverse dynamics,” in IEEE/RSJ Inter-
national Conference on Intelligent Robots and Systems (IROS), Sept
The prediction performance and convergence of PC- 2015.
ESN++ is better than the state-of-the-art in both datasets and [14] M. Hermans and B. Schrauwen, “Training and analysing deep recur-
for both evaluated scenarios. Furthermore, our algorithm can rent neural networks,” in Advances in Neural Information Processing
Systems, 2013, pp. 190–198.
handle better the accumulated errors in multi-step predictions [15] A. El-Fakdi and M. Carreras, “Policy gradient based Reinforcement
due to the fading memory of the reservoir and the leaky inte- Learning for real autonomous underwater cable tracking,” in 2008
grator. Another important characteristic of PC-ESN++ is its IEEE/RSJ International Conference on Intelligent Robots and Systems.
IEEE, Sept. 2008, pp. 3635–3640.
low demand for computational resources. Its complexity only [16] S. Ross and J. A. Bagnell, “Agnostic system identification for model-
depends on the size of the reservoir, while the complexity based reinforcement learning,” Proceedings of the 29th Interational
of the other algorithms depend on the number of training Conference on Machine Learning, pp. 1703–1710, 2012.
[17] R. Koppejan and S. Whiteson, “Neuroevolutionary reinforcement
samples. This allows us to perform learning on large datasets learning for generalized helicopter control,” in Proceedings of the
without additional burden. 11th Annual conference on Genetic and evolutionary computation -
Even if the effect of long term predictions on accuracy GECCO ’09. New York, New York, USA: ACM Press, July 2009,
p. 145.
was ameliorated by the use of leaky integrators, the need for [18] G. Boone, “Efficient reinforcement learning: model-based Acrobot
a strategy that handles better the accumulated errors becomes control,” Proceedings of International Conference on Robotics and
apparent [27]. Thus, as future work, we intend to investigate Automation, vol. 1, 1997.
[19] C. G. Atkeson, “Nonparametric model-based reinforcement learning,”
the impact of different multi-step prediction approaches that in Advances in Neural Information Processing Systems, 1998, pp.
have been proposed in the field of time-series forecasting for 1008–1014.
the prediction of robots’ dynamics. [20] J. Schneider, “Exploiting model uncertainty estimates for safe dynamic
control learning,” Advances in Neural Information Processing Systems,
Finally, we intend to apply the proposed PC-ESN++ 1997.
within a model-based RL algorithm for learning typical tasks [21] J. Bagnell and J. Schneider, “Autonomous helicopter control using
reinforcement learning policy search methods,” Robotics and Automa-
performed by industrial robotic manipulators. We expect that tion, IEEE International Conference on, vol. 2, pp. 1615—-1620,
such an algorithm would be able to learn tasks in a small 2001.
number of iterations and achieve a level of accuracy that will [22] T. D. Sanger, “Optimal unsupervised learning in a single-layer linear
feedforward neural network,” Neural networks, vol. 2, no. 6, pp. 459–
allow it to handle manufacturing problems. 473, 1989.
[23] H. Jaeger, “Adaptive nonlinear system identification with echo state
R EFERENCES networks,” in Advances in neural information processing systems,
2002, pp. 593–600.
[1] R. Featherstone and D. Orin, “Robot dynamics: equations and algo- [24] K. Funahashi and Y. Nakamura, “Approximation of dynamical systems
rithms,” in Robotics and Automation, 2000. Proceedings. ICRA ’00. by continuous time recurrent neural networks,” Neural networks,
IEEE International Conference on, vol. 1, 2000, pp. 826–834 vol.1. vol. 6, no. 6, pp. 801–806, 1993.
[2] K. S. Narendra and K. Parthasarathy, “Identification and control of [25] M. Lukoševičius, “A practical guide to applying echo state networks,”
dynamical systems using neural networks,” IEEE Transactions on in Neural Networks: Tricks of the Trade. Springer, 2012, pp. 659–686.
neural networks, vol. 1, no. 1, pp. 4–27, 1990. [26] A. Sorjamaa, J. Hao, N. Reyhani, Y. Ji, and A. Lendasse, “Methodol-
[3] D. Nguyen-Tuong and J. Peters, “Model learning for robot control: a ogy for long-term prediction of time series,” Neurocomputing, vol. 70,
survey,” Cognitive processing, vol. 12, no. 4, pp. 319–40, Nov. 2011. no. 16, pp. 2861–2869, 2007.
[4] J. Nakanishi, R. Cory, M. Mistry, J. Peters, and S. Schaal, “Oper- [27] A. Venkatraman, M. Hebert, and J. A. Bagnell, “Improving multi-step
ational space control: A theoretical and empirical comparison,” The prediction of learned time series models.” in AAAI, 2015, pp. 3024–
International Journal of Robotics Research, vol. 27, no. 6, pp. 737– 3030.
757, 2008.
[5] D. Wolpert and M. Kawato, “Multiple paired forward and inverse
models for motor control,” Neural Networks, vol. 11, no. 78, pp. 1317
– 1329, 1998.
[6] Y. Wada and M. Kawato, “A neural network model for arm trajec-
tory formation using forward and inverse dynamics models,” Neural
Networks, vol. 6, no. 7, pp. 919 – 932, 1993.

618
Authorized licensed use limited to: University Of Minnesota Duluth. Downloaded on September 28,2024 at 21:09:31 UTC from IEEE Xplore. Restrictions apply.

You might also like