Neural Network Lectures RBF 1
Neural Network Lectures RBF 1
Is it a Neural Network?
Unsupervised Learning
References
1. Simon Haykins Neural Networks and Learning
Machines
2. Christopher Bishop Neural Networks in Pattern
Recognition
3. Kevin Gurney Introduction to Neural Networks
4. Hertz, Krogh, Palmer Introduction to theory of
Neural computation (Santa Fe Institute Series)
5. Handbook of Brain Sciences & Neural Networks
Linear models
It is mathematically easy to fit linear models to data.
We can learn a lot about model-fitting in this relatively simple case.
There are many ways to make linear models more powerful while
retaining their nice mathematical properties:
By using non-linear, non-adaptive basis functions, we can get
generalised linear models that learn non-linear mappings from input
to output but are linear in their parameters only the linear part of
the model learns.
By using kernel methods we can handle expansions of the raw data
that use a huge number of non-linear, non-adaptive basis functions.
By using large margin kernel methods we can avoid overfitting even
when we use huge numbers of basis functions.
But linear methods will not solve most AI problems.
They have fundamental limitations.
Sigmoids
Gaussians
Polynomials
y (x, w ) = w0 + w1 x1 + w2 x2 + ... = w T x
y (x, w ) = w0 + w11 (x) + w22 (x) + ... = w (x)
T
error =
2
T
(
)
t
w
x
n
n
n
T
w = ( X X)
*
optimal
weights
inverse of the
covariance
matrix of the
input vectors
X t
vector of
target values
the transposed
design matrix has
one input vector per
column
3.1 4.2
1.5 2.7
0.6 1.8
component
vector
input vector
yn = y ( x n , w )
correct
answer
p (t n | yn ) = p ( yn + noise = t n | x n , w ) =
log p (t n | yn ) = log 2 + log +
can be ignored if
sigma is fixed
y = models
estimate of most
probable value
1
2
( t n yn ) 2
2 2
(t n yn ) 2
2 2
can be ignored if
sigma is same for
every case
Multiple outputs
If there are multiple outputs we can often treat the learning
problem as a set of independent problems, one per output.
Not true if the output noise is correlated and changes from
case to case.
Even though they are independent problems we can save
work by only multiplying the input vectors by the inverse
covariance of the input components once. For output k we
have:
w *k = ( XT X) 1 XT t k
does not depend
on a
+1
weights after
seeing training
case tau+1
= w En ( )
learning
rate
This is called online learning. It can be more efficient if the dataset is very
redundant and it is simple to implement in hardware.
It is also called stochastic gradient descent if the training cases are picked
at random.
Care must be taken with the learning rate to prevent divergent
oscillations, and the rate must decrease at the end to get a good fit.
|| w ||2
w * = ( I + XT X) 1 XT t
identity
matrix
i (xi ) = x xi ,
where is the Euclidean norm. Then i is called the RBF.
Linear:
Cubic:
r3
Multiquadrics:
r 2 n log r , n 1, in 2D,
2 n 1
in 3D.
r , n 1,
Polyharmonic Spines:
Gaussian:
2014/8/25
cr 2
29
2014/8/25
= r 2 + c2
30
2014/8/25
= (1 r ) 4+ (4r + 1)
31