RBF Network XOR Problem
RBF Network XOR Problem
One can have a basis function centred on each training data point as in the case of exact
interpolation, but add an extra term to the error/cost function which penalizes mappings
that are not smooth. For network outputs yk(xp) and sum squared error function, some
appropriate regularization function Ω can be introduced to give
E = Esse + λΩ = 1
2 ∑ ∑ (t k
p
− yk (x p ))2 + λΩ
p k
L15-2
Computing the Regularized Weights
Provided the regularization term is quadratic in the output weights wkj, they can still be
found by solving a set of linear equations. For example, the two popular regularizers
2
2 p
1 ∂ y (x )
Ω= 1
2 ∑ (w kj )2 and Ω = ∑∑ k 2
k,j p k ,i 2 ∂xi
both directly penalize large output curvature, and minimizing the error function E leads to
solutions for the output weights that are no harder to compute than we had before:
€ T
€ −1 T
W = M Φ T
We have the same matrices with components (W)kj = w kj, (Φ)pj = φ j(xp) and (T)pk = {tkp}
as before, but now have different regularized versions of ΦTΦ for the two regularizers:
€
T ∂ 2ΦT ∂2 Φ
M = Φ Φ + λI T
and M = Φ Φ + λ∑ 2 2
i ∂ x i ∂ x i
Clearly, for λ = 0 both reduce to the un-regularized result we derived in the last lecture.
€ L15-3
Example 1 : M = N, σ = 2dave, λ = 0
From: Neural Networks for Pattern Recognition, C. M. Bishop, Oxford University Press, 1995.
L15-4
Example 2 : M = N, σ = 2dave, λ = 40
From: Neural Networks for Pattern Recognition, C. M. Bishop, Oxford University Press, 1995.
L15-5
RBF Networks for Classification
So far, the RBF networks have been used for function approximation, but they are also
useful for classification problems. Consider a data set that falls into three classes:
An MLP would naturally separate the classes with hyper-planes in the input space (as on
the left). An alternative approach would be to model the separate class distributions by
localised radial basis functions (as on the right).
L15-6
Implementing RBF Classification Networks
In principle, it is easy to set up an RBF network to perform classification – one can
simply have an output function yk(x) for each class k with appropriate targets
p
1 if pattern p belongs to class k
t =
k
0 otherwise
and, when the network is trained, it will automatically classify new patterns.
€
The underlying justification is found in Cover’s theorem which states that “A complex
pattern classification problem cast in a high dimensional space non-linearly is more
likely to be linearly separable than in a low dimensional space”. We know that once we
have linear separable patterns, the classification problem is easy to solve.
In addition to the RBF network outputting good classifications, it can be shown that the
outputs of such a regularized RBF network classifier can also provide good estimates of
the posterior class probabilities.
L15-7
The XOR Problem Revisited
The familiar case of the non-linearly separable XOR function provides a good example:
x2
p x1 x2 t
1 0 0 0
2 0 1 1
3 1 0 1
4 1 1 0
x1
It was seen before that Single Layer Perceptrons with step or sigmoidal activation functions
cannot generate the right outputs, because they can only form a single linear decision
boundary. To deal with this problem using Perceptrons, one must either change the activation
function, or introduce a non-linear hidden layer to give an Multi Layer Perceptron (MLP).
L15-8
The XOR Problem in RBF Form
Recall that sensible RBFs are M Gaussians φ j (x) centred at random training data points:
M 2 p
φ j (x) = exp − 2 x − µ j where {µ j } ⊂ {x }
dmax
To perform the XOR classification in an RBF network, one must begin by deciding how
many basis functions are needed. Given there are four training patterns and two classes,
M = 2 seems a reasonable first guess. Then the basis function centres need to be chosen.
The two separated zero targets seem a good random choice, so µ1 = (0, 0) and µ 2 = (1,1)
and the distance between them is dmax = √2. That gives the basis functions:
(
φ1 (x) = exp − x − µ1
2
) with µ1 = (0,0)
(
φ 2 (x) = exp − x − µ 2
2
) with µ 2 = (1,1)
This is hopefully sufficient to transform the problem into a linearly separable form.
L15-9
The XOR Problem Basis Functions
Since the hidden unit activation space is only two dimensional, it is easy to plot the
activations to see how the four input patterns have been transformed:
φ2
p x1 x2 φ1 φ2
1 0 0 1.0000 0.1353 4
2 0 1 0.3678 0.3678
3 1 0 0.3678 0.3678
2
4 1 1 0.1353 1.0000 1
3 φ1
It is clear that the patterns are now linearly separable. Note that, in this case, there is no
need to increase the dimensionality from the input space to the hidden unit/basis function
space – the non-linearity of the mapping is sufficient. Exercise: check what happens if
you chose a different pair of basis function centres, or one or three centres.
L15-10
The XOR Problem Output Weights
In this case, there is just one output y(x), with one weight wj from each hidden unit j, and
one bias -θ. So, the network’s input-output relation for each input pattern x is
Thus, to make the outputs y(xp) equal the targets tp, there are four equations to satisfy:
Three are different, and there are three variables, so they are easily solved to give
w1 = w2 = −2.5018 , θ = −2.8404
This completes the “training” of the RBF network for the XOR problem.
L15-11
Interpretation of Gaussian Hidden Units
The Gaussian hidden units in an RBF Network are “activated” when the associated
regions in the input space are “activated”, so they can be interpreted as receptive fields.
For each hidden unit there will be a region of the input space that results in an activation
above a certain threshold, and that region is the receptive field for that hidden unit. This
provides a direct relation to the receptive fields in biological sensory systems.
L15-12
Comparison of RBF Networks with MLPs
When deciding whether to use an RBF network or an MLP, there are several factors to
consider. There are clearly similarities between RBF networks and MLPs:
Similarities
It is not surprising, then, to find that there always exists an RBF network capable of
accurately mimicking a specific MLP, and vice versa. However the two network types
do differ from each other in a number of important respects:
Differences
1. RBF networks are naturally fully connected with a single hidden layer, whereas
MLPs can have any number of hidden layers and any connectivity pattern.
L15-13
2. In MLPs, the nodes in different layers share a common neuronal model, though
not always the same activation function. In RBF networks, the hidden nodes (i.e.,
basis functions) have a very different purpose and operation to the output nodes.
3. In RBF networks, the argument of each hidden unit activation function is the
distance between the input and the “weights” (RBF centres), whereas in MLPs it
is the inner product of the input and the weights.
4. RBF networks are usually trained quickly one layer at a time with the first layer
unsupervised, which allows them to make good use of unlabelled training data.
5. MLPs are usually trained iteratively with a single global supervised algorithm,
which is slow compared to RBF networks and requires many learning parameters
to be set well, but allows an early stopping approach to optimizing generalization.
6. MLPs construct global approximations to non-linear input-output mappings with
distributed hidden representations, whereas RBF networks tend to use localised
non-linearities (Gaussians) at the hidden layer to construct local approximations.
L15-14
Real World Application – EEG Analysis
One successful RBF network detects epileptiform artefacts in EEG recordings:
Full details can be found in the original journal paper: A. Saastamoinen, T. Pietilä, A.
Värri, M. Lehtokangas, & J. Saarinen, (1998). Waveform Detection with RBF Network
– Application to Automated EEG Analysis. Neurocomputing, vol. 20, pp. 1-13.
L15-15
Overview and Reading
Reading
L15-16