0% found this document useful (0 votes)
18 views

Intelligent Robotic Systems

Uploaded by

mj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Intelligent Robotic Systems

Uploaded by

mj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Intelligent Systems:

with Application to Robotics


A. Supervised learning (2 days)
1. Introduction
2. Parameter estimation for linear models
3. Analysis
4. Parameter estimation for non-linear models
5. Robotics application to computed torque control &
related applications (vision, learning control, …)
B. Reinforcement learning (1 day)
0. What is Intelligence or Learning
What is intelligence in robotics?

Mechanical creature which can function autonomously

Mechanical: built, constructed, robot


Creature: think of it as an entity with its own motivation,
decision making processes (but it’s a machine?)
Function autonomously: can sense, act, maybe even reason;
doesn’t just do the same thing over and over like automation

Similar discussions about Artificial Intelligence and Cognitive


Robotics – relative to the current behaviour or performance, and
once you understand something, it loses its mystique.
0. What is Learning/Intelligence?
Rather than getting too hung up on precise definitions, let us
consider a simple example. Consider the task of learning to control
an inverted pendulum by applying a linear force to the cart
• No pre-programmed control
algorithm which specifies the
dynamic relationship
• Learn dynamics by performing
experiments and sensing the
reactions
• Need a precise meta specification of
the goal to make the learning
problem well-posed
• Need guarantees of learning
performance
http://www.youtube.com/watch?v=Lt-KLtkDlh8
A. Supervised Learning
A. Supervised learning (2 days)
1. Introduction
2. Parameter estimation for linear models
3. Analysis
4. Parameter estimation for non-linear models
5. Robotics application to computed torque control &
related applications (vision, learning control, …)
B. Reinforcement learning (1 day)
1. Supervised Learning
Many problems in robotics and control can be posed as supervised
learning problems:
Prediction Supervisor
Input: x Model output: y target: y*
-
f(x,q)
y*-y

Aim: we want both to understand and approximate the


relationship
y* = f(x,q*) supervisor providing exemplar target data
y = f(x,q) model predictions
q is a vector of adjustable parameters which are learnt.
Learning is done via a training data being available which
represents input/output samples {xk, yk}k=1T, where k represents
the training sample index
1. Example: Inverted Pendulum
For the inverted pendulum, a basic knowledge
the dynamics tells us that the linear force F
applied to the cart will result in a linear
acceleration 𝑥 and an angular acceleration 𝜃.
Hence, the equations of motion are given by:
𝑥 , 𝜃 ← {𝑥, 𝜃 , 𝑥, 𝜃, 𝐹}
Learn either :
• {force, positions, velocities} =>
accelerations: Equations of motion, used
for CTC
• {force, positions, velocities} => {next
positions, next velocities}: state space
description
Sampled data is collected from each
experiment and used to reconstruct the map
1. Supervised Learning Notation
x: the inputs or feature vector: independent variables
– Generally x will be a vector of n values
– Sample values of x: xk is kth of T sample values
y: the output or response or predicted value: dependent variable
– Typically a scalar, although it can be a vector, of real values
(shown in bold, not italic)
– Again yk is the value of the kth sample.
– For regression, this is real value (for statistical classification
it represents a probability [0,1])
y*: the desired output or target response/value
– For regression, this is real value (for classification it is
typically a binary {0,1} value)
– Again y*k is the value of the kth sample.
1. Supervised Learning Example: 1st Order
System Identification
System identification is an example of supervised learning
problem for modelling a dynamic system.
Given input uk and measured output (response) yk data taken
from a dynamic system, the aim is to identify a suitable dynamic
model which could be used for control, state estimation, fault
detection …
System identification
problem 𝑦𝑘 + 𝑎1 𝑦𝑘−1 = 𝑏1 𝑢𝑘−1
𝑦𝑘 − 0.9𝑦𝑘−1 = 0.1𝑢𝑘−1
Supervised
learning: kth 𝐱 𝑘 = [−𝑦𝑘−1 , 𝑢𝑘−1 ]𝑇
sample 𝛉 = [𝑎1 , 𝑏1 ]𝑇
𝑦𝑘∗ = 𝑦𝑘
1. Supervised Learning: Data
Training Data: D = {D1, D2, …, DT} a set of T learning examples
• Dk = {xk, y*k} a training datum at discrete time index
k = 1,2,…,T
• xk = [x1,k, x2,k, …, xn,k]T is the input vector of length n
• y*k is the desired output
On this module, we’re focussed on the case where

Objective: Learn the function f:X -> Y, such that


y ~ f() for all k=1…T
The reason for the approximation symbol is because we typically
assume there is some form of small noise or error in the training
data
Regression: the output is assumed to be a continuous function of the
inputs, so the output is smooth.
1. Learning (Parameter Estimation) Schemes
Batch Iterative Instantaneous

D q D qk Dk qk

• Experiment • Experiment • Experiment


produces data set D. produces data set produces an
D. exemplar Dk.
• A closed form
solution exists for • Iterative approach • Dk is used to
determining the (gradient descent) perform a single
optimal parameters. is used to estimate update of the
the parameters. current parameter
• Only possible with
estimate qk
linear models & • For non-linear
quadratic objective models where no • Instantaneous or
explicit solution on-line learning
• Off-line
identification exists
• Iterative “learning”

Learning
Analysis
1. Conclusions
We’ve briefly discussed what learning or intelligence means in a
robotics context
• We’re focussing at lower level control, hence intelligence and
(supervised) learning are closely related
• A simple interpretation is that there is a change in behaviour
of the robot after an experience, “it learns”
• We’re focussing on supervised control where input and
desired output exemplars are available and the problem is to
model the underlying regression function
• We’re also focussing on on-line or instantaneous learning,
where data is iteratively presented to the learning algorithm
and it updates its parameters straight away (not off-line
design as is typically done for supervised learning)
1. Exercises
1. Describe what learning means in a robotics context
2. Give an example of how a learning algorithm can be used to
solve a robotics (control) problem, clearly describing why
learning is necessary
3. Define what a supervised learning problem is
4. (MSc students) Describe whether (least squares) system
identification and Recursive Least Squares are examples of
batch, iterative or instantaneous learning algorithms
2. Linear Regression
Initially consider the supervised learning problem when the model
is linear in the parameters.
The presumed model is of the form:
yk  q 0  q1 x1,k   q n xn ,k
 xTk θ
where
xk [1 x1,k xn ,k ]T input vector
θ [q 0 q1 q n ]T parameter vector
f:X->Y is linear in the parameters, q, and affine (linear + constant) in
the input.
1 q0
x1
q1 y
x2 q2 S
xn qn
2. One and Two Input Linear Regression
y ~ f(x) = q0 + q2x2 + q2x2
y ~ f(x) = q0 + q1x

y x y
x
x x x
x
x x
x
x

x x2 x1

• For a single input x, we can interpret linear regression as


fitting a line through the data which minimizes the prediction
errors (vertical distances to the line)
• For two inputs x = [x1, x2]T, the interpretation is the same
except that the function is a plane
• For 3 or more inputs, the function is now a hyperplane
2. Instantaneous Squared Prediction Error
Data: Dk = {xk, y*k}
Function: xk -> f(xk)
We’d like to have y*k ~ f(xk) for all k = 1,…,T. In practice,
there are errors and noise which makes it an Jk(q)
approximation
The instantaneous prediction errors
ek = y*k - f(xk,q)
define how close the model output is to the regression
q
sample and the instantaneous objective: Jk = ek2
J(q)
We can define a squared error objective function:
T T
J  ED ( J k )  T1  J k  T1  ( yk  f ( x k , θ)) 2
k 1 k 1
Learning the aim is to find the parameter values which
minimize this (non-negative) function. q
In this section, we’ll approach the problem using
instantaneous gradient descent.
2. Gradient Descent
We want to iteratively estimate the parameters of the linear
model using supervised learning

J(q) J(qk)
J(qk+1) < J(qk) J(qk+1)

This will be done using gradient descent. At each iteration, the


parameters are updated in the direction of the negative gradient.
Given an initial parameter estimate q1, iteratively update the
parameters according to:
dJ (θ k )
θ k 1  θ k  h

where h>0 is the learning rate or step size.
2. Gradient Descent

dJ (θ k )
dθ dJ (θ k )

2. Instantaneous Gradient Descent
Instantaneous gradient descent (iGD) performs an update after
each datum is presented
dJ k (θ k )
θ k 1  θ k  h J k (θ k )  12 e k2  12 ( yk*  xTk θ k ) 2

• k is both the iteration and the datum index
• h (>0) is the learning rate which controls the update size
• the ½ multiplier isn’t too important

dJ k (θ k ) • Add the current input vector to


 e k x k the parameter vector,

weighted by the scaled
θ k 1  θ k  he k x k prediction error
Also known as the LMS or MIT learning rule, and is widely used
for signal processing and adaptive control
2. Example: iGD Parameter Convergence
Data set:
1 2  3 • Iteratively cycle through
X   2 1  , y*  3 data set
• First column of X is not 1
1 0  1 • q* = [1 1]T
iGD with h = 0.1 and q1 = 0.
1.5 1.5

q q
i 2

1 1

0.5 0.5

0 0
0 5 10 15 k 20 25 0 0.2 0.4 0.6 q 0.8 1
1
In the next section, we’ll analyze this behaviour
2. Linear Parameter Estimation: Conclusions
This section introduced both a linear regression model as well as a
simple instantaneous gradient descent method to estimate the
parameters.
• Often the first column of the linear model includes a bias (+1)
term, although the examples do not include this
• The output is linear in the parameters
• Introduced both a sum and an instantaneous squared prediction
error performance function
• Moving in the direction of the negative gradient allows the
parameters to be updated and reduce the size of the
performance function
• Squared prediction error is quadratic in the parameters, hence
the gradient is linear in the parameters & iGD is:
θ k 1  θ k  he k x k
2. Linear Parameter Estimation: Exercises
1. Explain why the squared prediction error is a quadratic function of
the parameters when the model is linear.
2. Derive the (quadratic) relationship between the squared prediction
error and the parameters, clearly marking the quadratic & linear
terms as well as the quadratic weighting matrix.
3. Derive the iGD rule.
4. Re-write the iGD update algorithm as a (discrete time) state space
system where the states are the parameters.
5. Simulate the iGD example in Matlab
6. Using the simulation developed in Q5), explain what happens
when:
• The learning rate is small or large. What is a good value?
• The outputs are perturbed slightly.
• The inputs are changed so that they’re highly correlated or poorly
correlated
3. Analysis of Instantaneous Gradient Descent
In order to understand the performance of an adaptive system,
we need to analyse:
• Final parameter values (what the algorithm converges to)
• Parameter convergence (how rapid parameter convergence is)

Understanding how the iGD algorithm performs will be done


using the following steps:
1. Update direction & solution hyperplane
2. Posterior error & learning rate stability
3. Normalized iGD
4. Parameter convergence
5. Instantaneous & batch updates
3.1 Solution Hyperplane
iGD has a simple form. For each datum Dk = {xk, y*k}:
θ k 1  θ k  he k x k
Each data point represents a single, linear constraint (solution
hyperplane) in parameter space:
yk*  q 0  q1 x1,k   q n xn ,k  xTk θ
This is the equation of a plane with normal xk
For two parameters
q1 xk
y  q 0  q1 x1,k
*
k

1 yk*
q1   q0 
x1,k x1,k
q0
Hence, the solution hyperplane is a (green) line
3.1 Example: Solution Hyperplanes
For the data set (note that xk,1 is not unity, not important)
1 2  3
X   2 1  , y*  3
1 0  1
The solution hyperplanes are given by
1.5 Equations for the
q2
solutions hyperplanes
are …
1 eqn 1

0.5
eqn 3 eqn 2
0
0 0.5 1 q 1.5
1
3.1 Geometry of Parameter Update
Instantaneous gradient descent θ k 1  θ k  he k x k
The parameter update is in the direction
of the input vector: q1 xk
θ k  x k θ k θk
Direction can be positive or negative as
the learning rate, h>0, but the prediction θ k 1
error, ek, is positive or negative.
q0
The (perpendicular) distance of a point
q1 xk
(prior parameter vector) from the θk
solution hyperplane is given by:
θTk x k  yk* d
ek
d (θ k ,{x k , yk* })  
( xk2,1  xk2,2  xk2,3 ) xk 2

Update must reduce this distance


q0
3.2 Posterior Error & Learning Stability
To evaluate the effect of the iGD update rule on the prediction
error, let’s evaluate it after the update so that
e k  yk*  xTk θk Usual definition
e k  yk*  xTk θk 1 Posterior prediction error
We’d like to show that the error has decreased, so
e k  yk*  xTk θk
e k  yk*  xTk θk 1  yk*  xTk (θk  he k x k )

 1 h xk
2
2  e k
The posterior prediction error will be smaller than the prior
prediction error so long as:
2 Learning rate
|1  h x k 2 | 1  0  h 
2
2 bound for stable
xk 2 learning
3.2 Geometry of Stable and Unstable Learning
a) Unstable learning (2 cases) b) Stable learning h  1
Prediction error increases 0 posterior error xk
2

q1
2
xk h 0 q1 xk
θk
θk

2
h q0
xk
2
q0
2 c) Stable learning (2 cases)
Prediction error decreases
q1 θk
xk
1 2 1
2
h  2
0 h  2
xk 2
xk 2
xk 2

q0
3.3 Normalized Instantaneous Gradient Descent
The convergence of the iGD rule depends on the value of the
learning rate, which is, in turn, related to the (inverse of) the
magnitude of the input vector
2
θ k 1  θ k  he k x k , 0 h  2
xk 2
Let’s define a normalized iGD rule as

θ k 1  θ k  hk e k x k , hk  2
xk 2
 1
Then convergence occurs for 0<<2 q1 xk
and when =1, the prior error is θk
always projected onto the solution
hyperplane so that the posterior
error is zero
q0
3.4 Parameter Convergence 1
When the data set consists of multiple data points, we have
different convergence behaviours
Without loss of generality, we can assume that the normalized
instantaneous gradient descent rule, =1, so that the posterior
error is 0
a) Less data than parameters
• 2 parameters and 1 data point θk
q1 xk
• Parameter vector is projected
(perpendicularly) onto the solution
line: qk a xk θ k 1
• There are multiple solutions to the
learning problem, all of which are
equally good (with respect to the data
q0
set) & lie on the solution hyperplane
• The equations are under determined.
3.4 Parameter Convergence 2
b) Consistent data c) Inconsistent data
• All the solution hyper-planes • The solution hyper-planes do
intersect at a unique point not intersect at a unique point
• When the parameter is • Equations are inconsistent
projected onto the next hyper- • iGD converges to a limit cycle
plane, the distance to that or oscillates within a zone
point is reduced (can be defined by the hyper-plane
shown) intersection.
• iGD converges to that unique • Typical when data contains
value noise
θk
θ*

q1 θk q1

q0 q0
3.4 Rate of Parameter Convergence
The rate of parameter convergence depends on the correlation (or
lack of) between successive training data.
Without loss of generality assume the data is consistent and there
are two parameters and two data points
a) Correlated Data b) Near Orthogonal Data

q1 θk q1 θk

q0 q0
• The rate of parameter convergence strongly depends on how
orthogonal (lack of correlation) between successive data points.
• Update directions more orthogonal and better search parameter
space
3.4 Example: Normalized iGD Convergence
For the usual data set, the data is presented cyclically using a
normalized iGD rule with =1.
1.5

1 2   3 q
i

X   2 1  , y *  3
1

0.5

1 0  1
0
0 5 10 15 k 20 25

• Faster convergence with 1.5


normalized learning rule q2

• Parameters are projected onto 1


solution hyperplanes
• Data is consistent
0.5
• Reasonably fast rate of
convergence as the data is not
too correlated 0
0 0.5 1 q1 1.5
3.5 iGD Expected Behaviour
Instantaneous gradient descent approximates true,
or batch gradient descent where all of the training
data is available to calculate a parameter update Jk(q)
dJ (θk )
θk 1  θk  h

T T
J  ED ( J k )  T1  J k  T1  ( yk  f (xk , θ)) 2
k 1 k 1
q
Without performing a full analysis of this
J(q)
relationship, it is easy to see that
dJ (θ k )  dJ i (θ k ) 
 ED  
dθ  dθ 
• the batch gradient is the average of the
instantaneous gradients. q
• the average iGD update will be zero about the
minimum of J
3.5 Example: iGD Expected Behaviour
1 2   2.5 • y now contains a noisy
X   2 1  , y *  3.5  component
1 0   0.8
=1, 8 cycles =0.1, 80 cycles
2 2

1.5 1.5
q2
q2
1 1

0.5 0.5

0 0
0 0.5 1 q 1.5 2 0 0.5 1 q1 1.5 2
1
It is usual to choose a small learning rate which implicitly performs an
averaging type behaviour, &/or to reduce the learning rate with time
3. Analysis of iGD: Conclusions
• Each data point is visualized as a solution hyperplane (because
the prediction is linear in the parameters) with normal xk
• The iGD parameter updates are perpendicular to the solution
hyperplane, parallel to xk.
• Stability of the algorithm is controlled via the learning rate h the
posterior prediction error is reduced when
2
θk 1  θk  he k xk , 0 h  2
xk 2
• This influences the definition of the normalized iGD algorithm,
where the learning rate is normalized by the input magnitude
• Parameter convergence is fast/slow when successive data is near
orthogonal/highly correlated, respectively
• Noise affects the final convergence properties, because iGD is
calculated using a single data point only
3. Analysis of iGD: Exercises
• Derive and sketch the solution hyperplanes from example 3.1
• Derive the expression for the iGD posterior error and hence
derive the limits in the learning rate to ensure it decreases in
magnitude
• Assuming the same input-output pair is repeatedly presented to
the iGD rule, calculate the time constant for the error in the
parameters as a function of the learning rate and the input
vector magnitude
• Implement in Matlab the simulations in Examples 3.1, 3.4 and
3.5.
• In Example 3.4, when the 2nd data point is removed, is
parameter convergence slower or faster? Explain why.
• In Example 3.5, does the final limit cycle (=1) depend on the
initial parameter values?
4. Non-linear Models & Parameter Estimation
A large amount of research has been performed in the statistics,
artificial intelligence and cognitive robotics communities (to name
but a few) under the general heading of non-linear modelling
(supervised learning).
In this section, we’ll cover
1. Introduction
2. Model structure
3. Controlling complexity
4. Gradient descent parameter estimation

We’re going to describe a “artificial neural network” view of


things, but try and keep things reasonably generic and
meaningful.
4.1. Non-Linear Supervised Learning
The previous section demonstrated how learning, or parameter
adaptation, can be used to estimate an unknown linear function.
• The parameters adapt to best fit an objective function (SSE)
• There are a variety of approaches to estimate the parameters including
(direct solution and batch and instantaneous gradient descent)

In this section, we’ll see how this work can be used to learn non-linear
functions
1. Understand the non-linear mappings f:X->Y
2. Derive gradient based learning rules
Necessary in robotics applications (learning to invert a pendulum) where
the operational space is too large to use linearization.

In the past, these non-linear functions have had fancy names like neural
networks, fuzzy approximators, kernel machines. We won’t bother too
much with this labelling.
4.1 1D & 2D Non-Linear Regression Models
Unlike linear regression models which have a specific structure,
there are a large variety of non-linear models including
polynomials, piecewise polynomials, sigmoidal and radial basis
functions, kernel …
The key thing to appreciate is that the model’s output is a typically
non-linear in terms of the input and parameters: f(x,q)
Single input Two inputs
f
f

x2

x
x1
4.1 Inspiration from the Brain
The brain is a massively parallel “computer”
which can solve many problems (pattern
recognition, planning, locomotion, …) which
digital computers find difficult
• Can we use simple models of the brain to
learn how to classify, predict, organize …
information?
• This is the field of artificial neural networks
(ANN)
• Typically, basic function of the neuron has a
“linear behaviour” + non-linear
transformation and computation is
distributed across many neurons.
• We’ll look at a basic MLP (multi-layer
perceptron)
4.2 Typical Non-linear Models: Polynomial
A polynomial model is of the form:
y  q 0  q1 x1   q n xn 
q n 1 x12  q n  2 x1 x2   q* xn2 
q* x13  q* x1 x2 x3   q* xn3 

Here, a * on the parameter index is simply a wildcard used to denote


the appropriate value (often a complex formula)
• Use is motivated by (truncated) Taylor series, which approximates
about a point rather than minimizing the SSE of a data set)
• Non-linear in the input, but linear in the nK parameters
• Previously derived training rules can be used
• Can approximate any continuous function to an arbitrary value.
• Often the unbounded (rapid growth) of the polynomial terms as
|x| increases makes them an unsuitable choice.
4.2 Radial Basis Function Models
Radial basis function models are defined by:
h
y   qi exp( x  ci s
2 2
2
/ i )
i 1

h=3
c = {-2, 0, 2}
s2 ~ {0.6, 0.6, 0.6}
q = {0.5, 1, 1}

• Parameters are split into a linear set (weights) and non-linear


set (centres and widths) {qi, ci, si}i
• Approximate a non-linear function with “lots of bumps”
• Linked into statistical techniques like Gaussian processes and
kernel (support vector) machines
4.2 Sigmoidal Non-linear Models
The sigmoidal model is of the form:
hn 1 1
y   q h(x θ  q ) q
i
o T h
i
h
0
o
0 h1 y
i 1 x1
where h is a sigmoidal function like:
x2
1
h(u )  h2
1  eu
• These models are often called
Artificial Neural Networks or Multi-
Layer Perceptrons as each node can
be thought of as an extremely
simple model of a neuron.
• Non-linear in the both the input &
parameters
• Can approximate any continuous
function to an arbitrary value using
ridges
4.2 Multi-Layer Perceptron Networks
• Layered perceptron (with bi-polar/binary outputs) networks
can realize any logical function, however there is no simple way
to estimate the parameters/generalise the (single layer)
Perceptron convergence procedure
• Multi-layer perceptron (MLP) networks are a class of models
that are formed from layered sigmoidal nodes, which can be
used for regression or classification purposes. Widely applied to
many prediction and classification problems
• They are commonly trained using gradient descent on a mean
squared error performance function, using a technique known
as error back propagation in order to calculate the gradients.
Often 2nd order methods are used to improve the rate of
convergence
4.2 Multi-Layer Perceptron Networks
• Use 2 or more layers of parameters where:
– Empty circles represent sigmoidal (tanh) nodes
– Solid circles represent real signals (inputs, biases & outputs)
– Arrows represent adjustable parameters
x0=1 h0=1
h1 y
x1

x2 Output
Hidden h2 layer qo
layer Qh

• Multi-Layer Perceptron networks can have:


– Any number of layers of parameters (but generally just 2)
– Any number of outputs (but generally just 1)
– Any number of nodes in the
45/16,hidden
v3.0 layers
4.2 Tanh Function with 2D Inputs

y  tanh(u )
u  0  1x1  1x 2
θ  [0,1,1]

Such functions are often known as ridge functions,


because they are constant along a line in input space!
• u = xTq = c
4.2 Exemplar Model Outputs
MLP with two hidden nodes. The
response surface resembles an
“impulse ridge” because one
sigmoid is subtracted from the
other.
This is a learnt solution to the
“classification” XOR problem.

This non-linear regression


surface is generated by an
MLP with three hidden
nodes, and a linear transfer
function in the output layer
4.3 Gradient Descent Parameter Estimation
All of the model’s parameters can be stacked up into a single
vector q, then use gradient descent learning:
dJ (θ k )
θ k 1  θ k  h J

– q1 are small, random values
• Performance function is non-quadratic in q

 y  y ( x i , θ) 
T
J θ 
2
q*
*
qk qk+1
1
2 i q
i 1

Ji θ  y  y ( x i , θ) 
* 2
1
2 i

• No direct solution
• Local minima are possible
• Learning rate is difficult to estimate because local Hessian
(second derivative matrix) varies in parameter space
4.3 Output Layer Gradient Calculation
u θ x T y  f (u) J  1 ( y*  y )2
2
Hidden
layer Output layer …
dJ (θ k )
Gradient descent update: θ k  h

For the ith training pattern: J i (θ)  1


2 ( yi
*
 y ( x i , θ )) 2

Using the chain rule: dJ i dJ i dy du



dθ dy du dθ
 ( yi*  yi ) f '(u ) xi

Giving an update rule: θ k 1  θ k  h ( yi*  yi ) f '(u ) xi


4.3 Hidden Layer Gradient Calculation
Analyze the path such that altering the jth hidden node’s
parameter vector affects the model’s output
θ h
j
u hj y hj θ oj u o y o
x S f()

y hj  f (uhj )  f (xT θhj )


By the chain rule:
h h
dJ i dJ i dy o du o dy j du j
 o
dθ j dy du o dy hj du hj dθ hj
h

Gradient expression (back error propagation):


dJ i
  ( yi
*
 yi ) f '(u o
)q o
j f '(u h
j ) xi
dθ j
h

 ( yi*  yi ) jh xi
4.3 MLP Iterative Parameter Estimation
Randomly initialise all parameters in network (to small values)
For each parameter update
present each input pattern to the network & get output
calculate the update for each parameter according to:
qijl ,k  h ( yk*  yk )  lj xil
where:
output layer  o  f ' uo 
hidden layer  jh   oq oj f ' (u hj )
calculate average parameter updates
update weights
Stop when steps > max_steps or MSE < tolerance or test MSE
is minimum
4. 4 Example: Learning the XOR Problem
Performance history for the
XOR data and MLP with 2
hidden nodes. Note its non-
monotonic behaviour, also
large number of iterations
h = 0.05, update after each
datum

Learning histories for the 9


parameters in the MLP. Note
that even when the MSE goes
up, the parameters are heading
towards “optimal” values
4.4 Example: Trained XOR Model

The trained optimal model has a ridge where the target is


1 and plateaus out in the regions where target is –1. Note
all inputs and targets are bipolar {–1,1}, rather than binary
4. Conclusions
Despite being inspired by the brain architecture (very loosely)
and function (does the brain really perform supervised
learning?), MLPs are just a non-linear regression or classification
algorithm which are trained using supervised learning.
• The hidden nodes (sigmoids) are ridge non-linearities which
are linearly combined (output layer) to produce the non-linear
regression mapping
• They have good scaling abilities when the number of inputs is
large, but …
• Gradient descent learning can be quite slow (poorly
conditioned learning problem)
• Difficult to estimate the number of nodes necessary or to stop
training, although Bayesian methods, outside the scope of this
course, can provide a decent answer.
4. Exercises
1. Explain how the parameters of
a) polynomial models
b) radial basis functions
are estimated (this includes any structural terms, like the
order of the polynomial)
2. Give a logical interpretation to solving the 2 input XOR
problem, when an MLP is used with 2 nodes in the hidden
layer. Each node should correspond to a linear logical
mapping.
3. Simulate, in Matlab, the MLP learning to solve the XOR
problem
4. Derive the error back-propagation (iGD) rule for an MLP with
a single hidden layer
5 CTC using a Learning Approach
In this section, we’ll look at a more realistic supervised learning
problem.
Consider implementing a Computer Torque Control approach for
a single link manipulator, as described in week 2.
• Replace the non-linear block with a supervised learning
algorithm
Linear
𝐂 𝐪, 𝐪 + 𝐠(𝐪)
Non-linear
𝐪∗ + t Robot 𝐪, 𝐪
M(q) +
- Dynamics
u

Outer loop
feedback
5. Rational for Learning CTC
• CTC is widely used in mechanical and robotic systems because
it can provide accurate tracking performance.
• Requires an accurate model of the dynamics/equations of
motion. Modelling errors can have a large effect on the
performance of the control system
• Need to “precisely” know the masses, length, inertias as well
as non-linear (not modelled) effects such as friction, stiction,
backlash, …
– Can measure masses and lengths and estimate inertias
from CAD diagrams
– Can use system identification techniques to validate (and
improve) parameters
– Missing non-linear dynamic relationships are difficult to
determine
5. Experimental Learning
In this section we’ll take a fairly simple view of learning where
plant inputs (control torques) are randomly generated for different
states and this produces accelerations
.
x = [q, q] ..
dynamic q
u system

Invert the measured data set


{q, q, q*}  u
This provides the training set for a supervised learning approach
In practice, the learning and data would be done on-line, we’re
taking a simpler off-line approach in this case.
5. Manipulator Details
For simplicity, we’ll consider the single link
manipulator
m
( J  ma )q  mga cos(q)  t
2
x
a
q  (t  mga cos(q)) / ( J  ma 2 ) mg
The angular acceleration is a function of two q,t
“inputs”, {q,t}
• Linearly dependent on the torque t
• Non-linearly dependent on the angle q
10

d2q/dt2
Use parameters:
l = 0.8 m 0
a = 0.4 m q  0.862t  3.38cos(q)
m = 5 kg -10
J = 1 kg m2 10
t 0 2
-10 0 q
5. SLM Data Generation
In practice, a lot of work needs to be 10
done to generate suitable torque
5
signals which adequately excite all the
t
dynamics and enable the plant to visit 0
all regions of the state space &
stabilize the plant -5

For this, we’ll use 1,000 data points -10


0 1 2 q 3
which are randomly generated in the
region
• q lies in [0, p] 10

d2q/dt2
• t lies in [-10, 10] 0
No noise has been added to either the
-10
inputs or outputs, although a more
10
realistic experiment would have to t 0 q
2
consider this. -10 0
5. MLP Details
We’ll use an MLP with 2 inputs {q,t} and 5 hidden nodes (ridges)
to approximate the single link manipulator non-linear
acceleration mapping.
• Choice of 5 hidden nodes is a bit arbitrary, there needs to be
enough ridges to approximate non-linear function sufficiently
accurately, but need to have enough data to accurately
estimate the parameters
• 21 parameters in the MLP and these are trained using 1,000
data points
• Parameters are randomly initialized and estimated using
instantaneous gradient descent iGD
5. MLP Learning Behaviour
• Parameters initially randomly 1
distributed (0.1*randn)
0.8
• h = 0.005
• Maximum of 1,000 cycles 0.6

rmsek
through the data set, where
0.4
the data set contains 1,000
data points 0.2
• Learning terminated when
rmse = 0.1 which was after 0
0 50 k 100 150
117 cycles

Deciding how to stop training and determining the number of


hidden nodes (non-linear terms) can strongly affect the
performance of the model
More recently, Bayesian methods have been widely used.
5. MLP Results
The approximation abilities of
the model is pretty good. 10

d2q/dt2
The trends are largely correct: 0
• linear with respect to t,
-10
• Non-linear with respect to q 10
(half a sinusoid) t 0 2
q
• RMSE is 0.1 (pre-set in the -10 0
learning algorithm)
1
• Max error is ~0.7 at the edge
• The sigmoid (5 ridges) 0.5
Error

approximation is partially 0
evident in the form of the -0.5
error plot 10
t 0 2
-10 0 q
5. Inverting the Model
In CTC, we need to invert the 10

model to calculate the torque 5


necessary to achieve the t
desired acceleration. 0

This can be done by inverting -5


the model (or by learning the
inverse data in the first place!) -10
0
1
10 2 q
0 -10 3
2 2
d q/dt

This strategy is conceptually simple and can be made to work well.


However, CTC can be sensitive to modelling errors and some form
of simple, robust controller is often necessary to ensure closed
loop stability
5. Conclusions
• Supervised learning approaches can be used to implement a
CTC approach:
– The basic assumption of model matching is invalid
(parameters differ, missing non-linear effects, …) and this
can cause instabilities
• Supervised learning can be used to learn the
– {input, state} -> acceleration & invert
– {state, acceleration} -> input
regression maps.
• MLP was used to approximate the SLM mapping and while the
errors are relatively small, they’re not zero
• MLP can be trained using iGD
• Robust, stabilizing outer control loop would be necessary to
implement in practice
5. Exercises
1. Explain the rational for using a learning approach to solving
the CTC approach to control design for robotic systems.
2. Explain, using equations, why the basic approach for CTC is
not (fully) valid when there exists modelling error in the CTC
calculation
3. Discuss which would be the most accurate in predicting the
torque using a supervised learning approach:
a. Learn the {control, state} -> acceleration map and invert
b. Learn the {state, acceleration} -> torque map
4. Implement the CTC mapping for a single link manipulator, as
described on the slides
1. Investigate what happens when the learning rate varies
from 0.05
2. Investigate what happens when there are 1, 5 or 20
hidden nodes

You might also like