Intelligent Robotic Systems
Intelligent Robotic Systems
D q D qk Dk qk
Learning
Analysis
1. Conclusions
We’ve briefly discussed what learning or intelligence means in a
robotics context
• We’re focussing at lower level control, hence intelligence and
(supervised) learning are closely related
• A simple interpretation is that there is a change in behaviour
of the robot after an experience, “it learns”
• We’re focussing on supervised control where input and
desired output exemplars are available and the problem is to
model the underlying regression function
• We’re also focussing on on-line or instantaneous learning,
where data is iteratively presented to the learning algorithm
and it updates its parameters straight away (not off-line
design as is typically done for supervised learning)
1. Exercises
1. Describe what learning means in a robotics context
2. Give an example of how a learning algorithm can be used to
solve a robotics (control) problem, clearly describing why
learning is necessary
3. Define what a supervised learning problem is
4. (MSc students) Describe whether (least squares) system
identification and Recursive Least Squares are examples of
batch, iterative or instantaneous learning algorithms
2. Linear Regression
Initially consider the supervised learning problem when the model
is linear in the parameters.
The presumed model is of the form:
yk q 0 q1 x1,k q n xn ,k
xTk θ
where
xk [1 x1,k xn ,k ]T input vector
θ [q 0 q1 q n ]T parameter vector
f:X->Y is linear in the parameters, q, and affine (linear + constant) in
the input.
1 q0
x1
q1 y
x2 q2 S
xn qn
2. One and Two Input Linear Regression
y ~ f(x) = q0 + q2x2 + q2x2
y ~ f(x) = q0 + q1x
y x y
x
x x x
x
x x
x
x
x x2 x1
J(q) J(qk)
J(qk+1) < J(qk) J(qk+1)
dJ (θ k )
dθ dJ (θ k )
dθ
2. Instantaneous Gradient Descent
Instantaneous gradient descent (iGD) performs an update after
each datum is presented
dJ k (θ k )
θ k 1 θ k h J k (θ k ) 12 e k2 12 ( yk* xTk θ k ) 2
dθ
• k is both the iteration and the datum index
• h (>0) is the learning rate which controls the update size
• the ½ multiplier isn’t too important
q q
i 2
1 1
0.5 0.5
0 0
0 5 10 15 k 20 25 0 0.2 0.4 0.6 q 0.8 1
1
In the next section, we’ll analyze this behaviour
2. Linear Parameter Estimation: Conclusions
This section introduced both a linear regression model as well as a
simple instantaneous gradient descent method to estimate the
parameters.
• Often the first column of the linear model includes a bias (+1)
term, although the examples do not include this
• The output is linear in the parameters
• Introduced both a sum and an instantaneous squared prediction
error performance function
• Moving in the direction of the negative gradient allows the
parameters to be updated and reduce the size of the
performance function
• Squared prediction error is quadratic in the parameters, hence
the gradient is linear in the parameters & iGD is:
θ k 1 θ k he k x k
2. Linear Parameter Estimation: Exercises
1. Explain why the squared prediction error is a quadratic function of
the parameters when the model is linear.
2. Derive the (quadratic) relationship between the squared prediction
error and the parameters, clearly marking the quadratic & linear
terms as well as the quadratic weighting matrix.
3. Derive the iGD rule.
4. Re-write the iGD update algorithm as a (discrete time) state space
system where the states are the parameters.
5. Simulate the iGD example in Matlab
6. Using the simulation developed in Q5), explain what happens
when:
• The learning rate is small or large. What is a good value?
• The outputs are perturbed slightly.
• The inputs are changed so that they’re highly correlated or poorly
correlated
3. Analysis of Instantaneous Gradient Descent
In order to understand the performance of an adaptive system,
we need to analyse:
• Final parameter values (what the algorithm converges to)
• Parameter convergence (how rapid parameter convergence is)
1 yk*
q1 q0
x1,k x1,k
q0
Hence, the solution hyperplane is a (green) line
3.1 Example: Solution Hyperplanes
For the data set (note that xk,1 is not unity, not important)
1 2 3
X 2 1 , y* 3
1 0 1
The solution hyperplanes are given by
1.5 Equations for the
q2
solutions hyperplanes
are …
1 eqn 1
0.5
eqn 3 eqn 2
0
0 0.5 1 q 1.5
1
3.1 Geometry of Parameter Update
Instantaneous gradient descent θ k 1 θ k he k x k
The parameter update is in the direction
of the input vector: q1 xk
θ k x k θ k θk
Direction can be positive or negative as
the learning rate, h>0, but the prediction θ k 1
error, ek, is positive or negative.
q0
The (perpendicular) distance of a point
q1 xk
(prior parameter vector) from the θk
solution hyperplane is given by:
θTk x k yk* d
ek
d (θ k ,{x k , yk* })
( xk2,1 xk2,2 xk2,3 ) xk 2
q1
2
xk h 0 q1 xk
θk
θk
2
h q0
xk
2
q0
2 c) Stable learning (2 cases)
Prediction error decreases
q1 θk
xk
1 2 1
2
h 2
0 h 2
xk 2
xk 2
xk 2
q0
3.3 Normalized Instantaneous Gradient Descent
The convergence of the iGD rule depends on the value of the
learning rate, which is, in turn, related to the (inverse of) the
magnitude of the input vector
2
θ k 1 θ k he k x k , 0 h 2
xk 2
Let’s define a normalized iGD rule as
θ k 1 θ k hk e k x k , hk 2
xk 2
1
Then convergence occurs for 0<<2 q1 xk
and when =1, the prior error is θk
always projected onto the solution
hyperplane so that the posterior
error is zero
q0
3.4 Parameter Convergence 1
When the data set consists of multiple data points, we have
different convergence behaviours
Without loss of generality, we can assume that the normalized
instantaneous gradient descent rule, =1, so that the posterior
error is 0
a) Less data than parameters
• 2 parameters and 1 data point θk
q1 xk
• Parameter vector is projected
(perpendicularly) onto the solution
line: qk a xk θ k 1
• There are multiple solutions to the
learning problem, all of which are
equally good (with respect to the data
q0
set) & lie on the solution hyperplane
• The equations are under determined.
3.4 Parameter Convergence 2
b) Consistent data c) Inconsistent data
• All the solution hyper-planes • The solution hyper-planes do
intersect at a unique point not intersect at a unique point
• When the parameter is • Equations are inconsistent
projected onto the next hyper- • iGD converges to a limit cycle
plane, the distance to that or oscillates within a zone
point is reduced (can be defined by the hyper-plane
shown) intersection.
• iGD converges to that unique • Typical when data contains
value noise
θk
θ*
q1 θk q1
q0 q0
3.4 Rate of Parameter Convergence
The rate of parameter convergence depends on the correlation (or
lack of) between successive training data.
Without loss of generality assume the data is consistent and there
are two parameters and two data points
a) Correlated Data b) Near Orthogonal Data
q1 θk q1 θk
q0 q0
• The rate of parameter convergence strongly depends on how
orthogonal (lack of correlation) between successive data points.
• Update directions more orthogonal and better search parameter
space
3.4 Example: Normalized iGD Convergence
For the usual data set, the data is presented cyclically using a
normalized iGD rule with =1.
1.5
1 2 3 q
i
X 2 1 , y * 3
1
0.5
1 0 1
0
0 5 10 15 k 20 25
1.5 1.5
q2
q2
1 1
0.5 0.5
0 0
0 0.5 1 q 1.5 2 0 0.5 1 q1 1.5 2
1
It is usual to choose a small learning rate which implicitly performs an
averaging type behaviour, &/or to reduce the learning rate with time
3. Analysis of iGD: Conclusions
• Each data point is visualized as a solution hyperplane (because
the prediction is linear in the parameters) with normal xk
• The iGD parameter updates are perpendicular to the solution
hyperplane, parallel to xk.
• Stability of the algorithm is controlled via the learning rate h the
posterior prediction error is reduced when
2
θk 1 θk he k xk , 0 h 2
xk 2
• This influences the definition of the normalized iGD algorithm,
where the learning rate is normalized by the input magnitude
• Parameter convergence is fast/slow when successive data is near
orthogonal/highly correlated, respectively
• Noise affects the final convergence properties, because iGD is
calculated using a single data point only
3. Analysis of iGD: Exercises
• Derive and sketch the solution hyperplanes from example 3.1
• Derive the expression for the iGD posterior error and hence
derive the limits in the learning rate to ensure it decreases in
magnitude
• Assuming the same input-output pair is repeatedly presented to
the iGD rule, calculate the time constant for the error in the
parameters as a function of the learning rate and the input
vector magnitude
• Implement in Matlab the simulations in Examples 3.1, 3.4 and
3.5.
• In Example 3.4, when the 2nd data point is removed, is
parameter convergence slower or faster? Explain why.
• In Example 3.5, does the final limit cycle (=1) depend on the
initial parameter values?
4. Non-linear Models & Parameter Estimation
A large amount of research has been performed in the statistics,
artificial intelligence and cognitive robotics communities (to name
but a few) under the general heading of non-linear modelling
(supervised learning).
In this section, we’ll cover
1. Introduction
2. Model structure
3. Controlling complexity
4. Gradient descent parameter estimation
In this section, we’ll see how this work can be used to learn non-linear
functions
1. Understand the non-linear mappings f:X->Y
2. Derive gradient based learning rules
Necessary in robotics applications (learning to invert a pendulum) where
the operational space is too large to use linearization.
In the past, these non-linear functions have had fancy names like neural
networks, fuzzy approximators, kernel machines. We won’t bother too
much with this labelling.
4.1 1D & 2D Non-Linear Regression Models
Unlike linear regression models which have a specific structure,
there are a large variety of non-linear models including
polynomials, piecewise polynomials, sigmoidal and radial basis
functions, kernel …
The key thing to appreciate is that the model’s output is a typically
non-linear in terms of the input and parameters: f(x,q)
Single input Two inputs
f
f
x2
x
x1
4.1 Inspiration from the Brain
The brain is a massively parallel “computer”
which can solve many problems (pattern
recognition, planning, locomotion, …) which
digital computers find difficult
• Can we use simple models of the brain to
learn how to classify, predict, organize …
information?
• This is the field of artificial neural networks
(ANN)
• Typically, basic function of the neuron has a
“linear behaviour” + non-linear
transformation and computation is
distributed across many neurons.
• We’ll look at a basic MLP (multi-layer
perceptron)
4.2 Typical Non-linear Models: Polynomial
A polynomial model is of the form:
y q 0 q1 x1 q n xn
q n 1 x12 q n 2 x1 x2 q* xn2
q* x13 q* x1 x2 x3 q* xn3
h=3
c = {-2, 0, 2}
s2 ~ {0.6, 0.6, 0.6}
q = {0.5, 1, 1}
x2 Output
Hidden h2 layer qo
layer Qh
y tanh(u )
u 0 1x1 1x 2
θ [0,1,1]
y y ( x i , θ)
T
J θ
2
q*
*
qk qk+1
1
2 i q
i 1
Ji θ y y ( x i , θ)
* 2
1
2 i
• No direct solution
• Local minima are possible
• Learning rate is difficult to estimate because local Hessian
(second derivative matrix) varies in parameter space
4.3 Output Layer Gradient Calculation
u θ x T y f (u) J 1 ( y* y )2
2
Hidden
layer Output layer …
dJ (θ k )
Gradient descent update: θ k h
dθ
( yi* yi ) jh xi
4.3 MLP Iterative Parameter Estimation
Randomly initialise all parameters in network (to small values)
For each parameter update
present each input pattern to the network & get output
calculate the update for each parameter according to:
qijl ,k h ( yk* yk ) lj xil
where:
output layer o f ' uo
hidden layer jh oq oj f ' (u hj )
calculate average parameter updates
update weights
Stop when steps > max_steps or MSE < tolerance or test MSE
is minimum
4. 4 Example: Learning the XOR Problem
Performance history for the
XOR data and MLP with 2
hidden nodes. Note its non-
monotonic behaviour, also
large number of iterations
h = 0.05, update after each
datum
Outer loop
feedback
5. Rational for Learning CTC
• CTC is widely used in mechanical and robotic systems because
it can provide accurate tracking performance.
• Requires an accurate model of the dynamics/equations of
motion. Modelling errors can have a large effect on the
performance of the control system
• Need to “precisely” know the masses, length, inertias as well
as non-linear (not modelled) effects such as friction, stiction,
backlash, …
– Can measure masses and lengths and estimate inertias
from CAD diagrams
– Can use system identification techniques to validate (and
improve) parameters
– Missing non-linear dynamic relationships are difficult to
determine
5. Experimental Learning
In this section we’ll take a fairly simple view of learning where
plant inputs (control torques) are randomly generated for different
states and this produces accelerations
.
x = [q, q] ..
dynamic q
u system
d2q/dt2
Use parameters:
l = 0.8 m 0
a = 0.4 m q 0.862t 3.38cos(q)
m = 5 kg -10
J = 1 kg m2 10
t 0 2
-10 0 q
5. SLM Data Generation
In practice, a lot of work needs to be 10
done to generate suitable torque
5
signals which adequately excite all the
t
dynamics and enable the plant to visit 0
all regions of the state space &
stabilize the plant -5
d2q/dt2
• t lies in [-10, 10] 0
No noise has been added to either the
-10
inputs or outputs, although a more
10
realistic experiment would have to t 0 q
2
consider this. -10 0
5. MLP Details
We’ll use an MLP with 2 inputs {q,t} and 5 hidden nodes (ridges)
to approximate the single link manipulator non-linear
acceleration mapping.
• Choice of 5 hidden nodes is a bit arbitrary, there needs to be
enough ridges to approximate non-linear function sufficiently
accurately, but need to have enough data to accurately
estimate the parameters
• 21 parameters in the MLP and these are trained using 1,000
data points
• Parameters are randomly initialized and estimated using
instantaneous gradient descent iGD
5. MLP Learning Behaviour
• Parameters initially randomly 1
distributed (0.1*randn)
0.8
• h = 0.005
• Maximum of 1,000 cycles 0.6
rmsek
through the data set, where
0.4
the data set contains 1,000
data points 0.2
• Learning terminated when
rmse = 0.1 which was after 0
0 50 k 100 150
117 cycles
d2q/dt2
The trends are largely correct: 0
• linear with respect to t,
-10
• Non-linear with respect to q 10
(half a sinusoid) t 0 2
q
• RMSE is 0.1 (pre-set in the -10 0
learning algorithm)
1
• Max error is ~0.7 at the edge
• The sigmoid (5 ridges) 0.5
Error
approximation is partially 0
evident in the form of the -0.5
error plot 10
t 0 2
-10 0 q
5. Inverting the Model
In CTC, we need to invert the 10