0% found this document useful (0 votes)
8 views96 pages

Neural Networks

Uploaded by

lemitu1904
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views96 pages

Neural Networks

Uploaded by

lemitu1904
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 96

Artificial Neural Networks –

Basics of MLP, RBF and


Kohonen Networks

Jerzy Stefanowski
Institute of Computing Science
Lecture 13 in Data Mining
for M.Sc. Course of SE
version for 2010
Acknowledgments
• Slides are also based on ideas coming from
presentations as:
– Rosaria Silipo: Lecture on ANN. IDA Spring School 2001
– Prévotet Jean-Christophe (Paris VI): Tutorial on Neural
Networks
– Włodzisław Duch: Lectures on Computational Intelligence
– Few others
• and many of my notes for a course on Machine
Learning and Neural Networks (Polish Language
ISWD – see my personal web page for more slides)
Outline
• Introduction
– Inspirations
– The biological and artificial neurons
– Architecure of networks and basic learning rules
• Single Linear and Non-linear Perceptrons
– Delta learning rule
• MultiLayer Perceptrons
– MLPs and Back-Propagation
– Tuning parameters of BP
• Radial Basis Functions
– Architectures and learning algorithms
• Competitive Learning
– Competitive Learning, LVQ, Kohonen self-organizing maps.
• Applications and Software Tools
• Final Remarks
Introduction
• Some definitions
– “… a system composed of many simple processing
elements operating in parallel whose function is
determined by network structure, connection strengths,
and the processing performed at computing elements or
nodes.” - DARPA (1988)
– A neural network: A set of connected input/output units
where each connection has a weight associated with it
• During the learning phase, the network learns by
adjusting the weights so as to be able to predict the
correct class output of the input signals
Some properties
• Some points from definitions
– Many neuron-like threshold switching units
– Many weighted interconnections among units
– Highly parallel, distributed process
– Emphasis on tuning weights automatically
– …
When to Consider Neural Networks
• Input: High-Dimensional and Discrete or Real-Valued
– e.g., raw sensor input
– Conversion of symbolic data to quantitative (numerical) representations possible
• Output: Discrete or Real Vector-Valued
– e.g., low-level control policy for a robot actuator
– Similar qualitative/quantitative (symbolic/numerical) conversions may apply
• Data: Possibly Noisy
• Target Function: Unknown Form
• Result: Human Readability Less Important Than Performance
– Performance measured purely in terms of accuracy and efficiency
– Readability: ability to explain inferences made using model; similar criteria
• Examples
– Speech phoneme recognition
– Image classification
– Time signal prediction, Robotics, and many others
Autonomous Learning Vehicle

in
Pomerleau et al
a Neural Net (ALVINN)
– http://www.cs.cmu.edu/afs/cs/project/alv/member/www/projects/ALVINN.html
– Drives 70mph on highways

Hidden-to-Output Unit
Weight Map
(for one hidden unit)

Input-to-Hidden Unit
Weight Map
(for one hidden unit)
Image Recognition and Classifiation
of Postal Codes

Examples of handwritten postal codes


drawn from a database available from the US Postal service
Example:Neural Nets for Face Recognition
Left Straight Right Up
Output Layer Weights (including w0 = θ) after 1 Epoch

Hidden Layer Weights after 25 Epochs

30 x 32 Inputs

Hidden Layer Weights after 1 Epoch

• 90% Accurate Learning Head Pose, Recognizing 1-of-20 Faces


• http://www.cs.cmu.edu/~tom/faces.html
Example:NetTalk
• Sejnowski and Rosenberg, 1987
• Early Large-Scale Application of Backprop
– Learning to convert text to speech
• Acquired model: a mapping from letters to phonemes and stress marks
• Output passed to a speech synthesizer
– Good performance after training on a vocabulary of ~1000 words
• Very Sophisticated Input-Output Encoding
– Input: 7-letter window; determines the phoneme for the center letter and context on
each side; distributed (i.e., sparse) representation: 200 bits
– Output: units for articulatory modifiers (e.g., “voiced”), stress, closest phoneme;
distributed representation
– 40 hidden units; 10000 weights total
• Experimental Results
– Vocabulary: trained on 1024 of 1463 (informal) and 1000 of 20000 (dictionary)
– 78% on informal, ~60% on dictionary
• http://www.boltz.cs.cmu.edu/benchmarks/nettalk.html
ANN and Mining Data
• ANN originally comes from AI and ML
• Data Mining and Exploration of Data
– We can meet numerical (at least partly) data, …
– Tasks of function approximation, pattern
classification, etc are also similar
• ANN are very good approximators or classifiers
– However, remember about time cost,
parameterization, black boxes, …
Examples of Different ANN
• Perceptron
• Multi-Layer Perceptron
• Radial Basis Function (RBF)
• Kohonen Features maps
• Other architectures, e.g.
– Hopfield networks and BAM
– ART
Looking at ANN
• ANN could be defined by:
– Model of artificial network (details of its
component and processing)
– Topology / architecture of the network
– Learning
Biological Inspirations
• Humans perform complex tasks like vision,
motor control, or language understanding
very well

• One way to build intelligent machines is to


try to imitate the (organizational principles
of) human brain
Biological inspirations
• Some numbers…
– The human brain contains about (or over) 10 billion
nerve cells (neurons)
– Each neuron is connected to the others through 10000
synapses

• Properties of the brain


– It can learn, reorganize itself from experience
– It adapts to the environment
– It is robust and fault tolerant
Biological neuron

synapse
synapse axon
axon
nucleus
nucleus

cell
cellbody
body

dendrites
dendrites

• A neuron has
– A branching input (dendrites)
– A branching output (the axon)
• The information circulates from the dendrites to the axon via the
cell body
• Axon connects to dendrites via synapses
– Synapses vary in strength
– Synapses may be excitatory or inhibitory
The Action Potential
Human Brain
• The brain is a highly complex, non-linear, and parallel computer,
composed of some 1011 neurons that are densely connected (~104
connection per neuron). We have just begun to understand how
the brain works...
• A neuron is much slower (10-3sec) compared to a silicon logic
gate (10-9sec), however the massive interconnection between
neurons make up for the comparably slow rate.
– Complex perceptual decisions are arrived at quickly (within a
few hundred milliseconds)
• 100-Steps rule: Since individual neurons operate in a few
milliseconds, calculations do not involve more than about 100
serial steps and the information sent from one neuron to another is
very small (a few bits)
• Plasticity: Some of the neural structure of the brain is present at
birth, while other parts are developed through learning, especially
in early stages of life, to adapt to the environment (new inputs).
The Artificial Neuron
(Mc Culloch and Pitt, 1943)
x1

w1

Σ
y
a
wx y

u ab
wn
xn
⎛ n ⎞ ⎛ n

y (t + 1) = f ⎜ ∑ w k x k ⎝ t ⎠ − u ⎟ = f ⎜ ∑ w k x k (t )⎟
⎛⎜ ⎞⎟

⎝ k =1 ⎠ ⎝ k =0 ⎠
Activation Functions
⎧+ 1 if a≥u
• Step function f (a ) = ⎨
⎩− 1 if a<u

⎧+ 1 if a ≥ u

• Linear function f (a ) = ⎨a if − u ≤ a < u
⎪− 1 a < −u
⎩ if

1
• Logistic Sigmoid f (a ) = − ha
1+ e

a−u

• Gaussian f (a ) = e 2σ 2
Activation functions
20

18

16
Linear 1
14

y=x
12

10

6 -1
4

0
0 2 4 6 8 10 12 14 16 18 20

1,2
Sigmoidal (logistic) Step function
1
0,8
f(e) 0,6
0,4
1
0,2
0 y=
-6 -4 -2 0 2 4 6
1 + exp(− β x)
e

1.5

1
Hyperbolic tangent
0.5

0
exp( x) − exp(− x)
-0.5
y=
-1.5
-1

exp( x) + exp(− x)
-2
-10 -8 -6 -4 -2 0 2 4 6 8 10
Network topologies
Feed Forward Neural Networks
• The information is
Output layer propagated from the
inputs to the outputs
2nd hidden • Computations of No non
layer linear functions from n
input variables by
1st hidden compositions of Nc
layer algebraic functions
• Time has no role (NO
cycle between outputs and
inputs)
x1 x2 ….. xn
Network topologies
Recurrent Neural Networks
• Can have arbitrary topologies
• Can model systems with
internal states (dynamic ones)
0 1 • Delays are associated to a
0 specific weight
0 • Training is more difficult
1
• Performance may be
0 problematic
0 1 – Stable Outputs may be more
difficult to evaluate
x1 x2 – Unexpected behavior
(oscillation, chaos, …)
Learning neural networks
• The procedure that consists in estimating the parameters of
neurons (usually weights) so that the whole network can
perform a specific task

• Basic types of learning


– The supervised learning
– The unsupervised learning

• The Learning process (supervised)


– Present the network a number of inputs and their corresponding
outputs
– See how closely the actual outputs match the desired ones
– Modify the parameters to better approximate the desired outputs
The ANN Learning Process
• Neurons can learn, (Hebb, 1949):
– memory is stored in synapses and learning takes place by synaptic
modifications;
– neurons become organized into larger configurations to perform more
complex information processing
Hebbian learning:
„ When two joining cells fire
simultaneously, the connection
between them strengthens (Hebb,
1949)
„ Discovered at a biomolecular
level by Lomo (1966) (Long-term
potentiation).

US
UR

CS
Supervised Learning of Neurons
Let us suppose that a sufficiently large set of examples (training
set) is available.
Supervised learning:
– The network answer to each input pattern is directly
compared with the desired answer and a feedback is given to
the network to correct possible errors

Weights matrix
x and y

Required
Error output
y-d d
Perceptron
y1 y2 yp-1 yp

1 2 ... p-1 p
wp-1,1
w1,1 w1,n wp,1
... ...
w2,1w1,2
...
w1,n-1 wp,n

...
x1 x2 xn-1 xn

⎛ n ⎞
y i (t + 1 ) = f ⎜ ∑ w ik x k (t )⎟ i = 1, 2 , ... p
⎝ k =0 ⎠
What a Single Perceptron Does
• Classification: y=1 if
• Regression: y=wx+w0
y
(wx+w0>0)
y
s y
w0 w0
w w
x
w0
x x
x0=+1
1
y = sigmoid (o ) =
[
1 + exp − w T x ]
Perceptron
+
• Rosenblatt (1962) ++
+ + y = +1
+ +
• Linear separation + + + ++ +
+
+ + +
• Inputs :Vector of real values + + ++ +
+ + + + ++
• Outputs :1 or -1 +
+ +
+
+
y = f (o) y = −1 +
++

w0 + w1 x1 + w2 x2 = 0
∑ o = w0 + w1x1 + w2 x2
w0 w1 w2
1 x 1
x2
Error Function
• Training set: T = { (x q , d q ) q = 1, 2 , ..., m }
• Error Measure:
E (W ) = f (o iq − d iq )

E(W)

E(W*)
W* W
Gradient Descent algorithm
• Simple Gradient Descent Algorithm
– Applicable to different type of learning (with proper representation)

• Algorithm Train-Perceptron (D ≡ {<x, o(x) ≡ d(x)>})


– Initialize all weights wi to random values
– WHILE not all examples correctly predicted DO
FOR each training example x ∈ D
Compute current output o(x)
FOR i = 1 to n
wi ← wi + r(t - o)xi //delta perceptron learning rule

• Definition: Gradient

r ⎡ ∂E ∂E ∂E ⎤
∇E [w ] ≡ ⎢ , ,K , ⎥

⎣ 0w ∂w 1 ∂w n⎦
Gradient Descent algorithm
The RMS error function:

∑ ∑ (o ) = ∑ E (W )
1 m p m
E (W ) = i
q
−d i
q 2 q

2 q =1 i =1 q =1

The learning process (stepwise looking for solution):


w ik (t + 1 ) = w ik (t ) + ∆ w ik (t )

The gradient descent algorithm:

∂ E (t ) m
∂ E q (t ) m
∆ w ik (t ) = − η = −η ∑ = ∑ ∆ w ikq (t )
∂ w ik q =1 ∂ w ik q =1
Delta Learning Rule (Widrow,Hoff)
⎛ 1 ⎞
∑ (o )
p
2
∂ ⎜⎜ q
− d q
⎟⎟
∂E
i i
q
⎝ 2 i =1 ⎠ =
=
∂ w ik ∂ w ik

= Αfter some computations --

∆ w ikq = − η δ iq x kq

In literature: Error is usually calculated as (d – o),


and delta learning rule will be given in a form:

∆ q
w ik =η δ i
q q
xk
Learning Rate (η)
∆w1= η1 δ x with η1 too small
∆w2= η2 δ x with η2 right size
∆w3= η3 δ x with η3 too big

E(W) ∆W1
∆W2

E(W*) ∆W3

W(q) W* W
x
• The standard perceptron learning +
+2
+ +
++
algorithm converges if examples + +
+
- -
- - -
are linearly separable → see
+ --
+ x1
- - -
- - -
• Consider an example of a simple -
-
-

logical AND problem Linearly Separable (LS)


Data Set
Perceptron limitations [Minski,Papert]
The XOR function: the non-linear separability problem

y x1
1
1

w1 w2
w1 x1 + w2 x2 -w0 = 0

x1 x2 0
x2
0 1
Need for constructing MLP
1
oi (t ) =
1 + e −( net i ( t ) −θ ) /τ

The solution – 2 layered


network with non-linear
Functions
However → how to learn
weights in such networks

x1 XOR x2 = (x1 AND ~x2) OR (~x1 AND x2)


The Universality Property
• A two layer feed-forward neural network with step
activation functions can implement any Boolean
function, provided that the number of hidden
neurons H is sufficiently large (Mc Culloch and Pitts,
1943) .
• If the input variables are continuous in [0,1] and
the activation function is the logistic sigmoid, it
can be proven that any continuous decision
boundary can be approximated arbitrarily close by
a two-layer Perceptron with a sufficient number H
of hidden neurons (Cybenko, 1989) .
MultiLayer Perceptrons
y1 y2 yp-1 yp

1 2 ... p-1 p
... ...
1 ... H

... ...
...
x1 x2 xn-1 xn
Non-linear regression mapping
Output of a generic MLP neuron in layer l

⎛ n ( l −1 ) ⎞
y i = f i (a i ) = f i ⎜⎜ ∑ w ik o k ⎟⎟ i = 1, ..., n (l ) k = 1, ..., n (l − 1 )
⎝ k =0 ⎠

Two-layer MLP, only one output unit with linear


activation function.
n (1 ) n (1 )
⎛ n (0 ) ⎞
y 1 = b1 = ∑w 1k ok = ∑w 1k f k ⎜⎜ ∑ v kj x j ⎟⎟ =
k =0 k =0 ⎝ j=0 ⎠

(vr )
n (1 )
r
= ∑w
k =1
1k fk T
k x + vk 0 + w0
Back propagation (I)
(The Generalized Delta Rule )
Gradient Descent formula for a weight wik connecting units
from two generic layers l and l-1 (i∈layer l, k ∈layer l-1)
after presentation of training pattern q.

∂E q
∆ w ikq = −η
∂ w ik

Now calculations should take into account activation


function.

∂E q ∂ E q ∂ a iq
=
∂ w ik ∂ a iq ∂ w ik
Back Propagation (II)
∂ a iq ∂E q
= o kq δ q
=
∂ w ik ∂ a iq
i

∂E q
∆ w ikq = −η = −η δ q
o kq
∂ w ik
i

For output units (i∈layer L) – generalized delta learning rule:

∂E q
δ i
q
=
∂aiq
= f ' ( )(o
a i
q
i
q
− d iq )
Multi-Layer Perceptron
• One or more hidden
layers
Output layer
• Where can we use
generalized delta rule?
2nd hidden • Where can we
layer compute error?
1st hidden
layer We do not know the desired
answers of the hidden layer and
therefore we can not estimate the
error function.
Input data
Back Propagation (III)
We do not know the desired answers of the hidden layer and
therefore we can not estimate the error function.

For hidden units (i∈layer l < L):

n (l + 1 )
∂E q
∂E q
∂a q

δ i
q
=
∂aiq
= ∑
j =1 ∂a q
∂b q
j
=
j i

n (l + 1 )
∂a q

= ∑
j =1
δ q
j
∂a q
j
=
i
n (l + 1 )
= f ' a iq ( )∑ δ q
j w ji
j =1
Back Propagation
(forward phase)
y1 y2 yp-1 yp

1 2 ... p-1 p
... ...
1 ... H

... ...
...
x1 x2 xn-1 xn
Back Propagation
(backward phase)
δ1 δ2 δp-1 δp

1 2 ... p-1 p
... ...
1 ... H

... ...
...
x1 x2 xn-1 xn
Elements of Backpropagation
• The set of learning examples is usually showed to
the algorithms several times (iterations → epochs)
/ sometimes thousands
• The order of showing examples is randomly
shuffled
• Stopping conditions
– Threshold for RMS (should be smaller than …)
– Max no. of iterations
– Classification evaluations
Tuning learning rate
• Too small – local minimum of error
• Too large – oscillations and unable to go inside the
global minimum
• Some solutions
– Slowly decreasing the rate with epoches (time)

E Sp

Sk

wi
Learning Rate and Momentum Term
∆W1
E(W)
∆W2

∆W3
E(W*)

W(q) W* Wmin W

∂E q
∆ w ikq = −η + α ∆ w ikq − 1
∂ w ik
Different non linearly separable
problems and number of layers
Types of Exclusive-OR Classes with Most General
Structure
Decision Regions Problem Meshed regions Region Shapes
Single-Layer Half Plane A B
Bounded By B
A
Hyperplane B A

Two-Layer Convex Open A B


Or B
A
Closed Regions B A

Three-Layer Abitrary
A B
(Complexity B
Limited by No. A
B A
of Nodes)
Neural Networks – An Introduction Dr. Andrew Hunter
Over-fitting
A too large number of parameters can memorize all the examples
of the training set with the associated noise, errors and
inconsistencies

E(u)

Test set

E(u*) Validation set


Training set
u* Training step u
Overtraining in ANNs
• Recall: Definition of Overfitting
– h’ worse than h on Dtrain, better on Dtest
• Overtraining: A Type of Overfitting
– Due to excessive iterations
– Avoidance: stopping criterion
(cross-validation: holdout, k-fold)
– Avoidance: weight decay Error versus epochs (Example 1)
Choosing the number of neurons
Network size
The universality property requires
a sufficient number of hidden neurons.

• Pruning algorithms
Start with a large network and gradually remove weights or
complete units that do not seem to be necessary
– Sensitivity methods
– Penalty-term methods

• Growing algorithms
Start from a small architecture and allow new units to be added
when necessary.
Neural Network as a Classifier
• Weakness
– Long training time
– Require a number of parameters typically best determined empirically,
e.g., the network topology or ``structure."
– Poor interpretability: Difficult to interpret the symbolic meaning behind
the learned weights and of ``hidden units" in the network
• Strength
– High tolerance to noisy data
– Ability to classify untrained patterns
– Well-suited for continuous-valued inputs and outputs
– Successful on a wide array of real-world data
– Algorithms are inherently parallel
– Techniques have recently been developed for the extraction of rules from
trained neural networks
Knowledge Extraction
– Global approach
A tree of symbolic rules is built to represent the whole network. Each
rule is then tested against the network behavior until most of training
space is covered.
Disadvantage: huge trees.

– Local approach
The original MLP is decomposed into a series of smaller usually
single layered sub-networks. The incoming weights form the
antecedent of a symbolic rule for each unit. Those rules are gradually
combined together to define a more general set of rules that describes
the network as a whole.
Disadvantage: Because of the distributed knowledge in an ANN
hidden units do not typically represent clear logic entities.
RBF networks
• This is becoming an increasingly popular neural network
with diverse applications and is probably the main rival to the
multi-layered perceptron
• Much of the inspiration for RBF networks has come from
traditional statistics and pattern classification techniques
(mainly local methods for non-parametric regression)
• These include function approximation, regularization
theory, density estimation and interpolation in the presence
of noise [Bishop, 1995]
• Cover Theorem on non-linear projections into new feature
space where difficult decision boundaries maybe linear
separable
Numerical approximation of functions
• Consider N data points characterized by p features
{xi ∈ R m i = 1,K, N }
• and corresponding N outputs (real values)
{di ∈ R | i = 1,K, N }
• The aim is to find an unknown function (mapping)
f ( xi ) = di ∀i = 1,K, N
• Complicated functions construct from simple
building blocks (local approximations)
Function Approximation with
Radial Basis Functions
RBF Networks approximate functions using (radial) basis functions
as the building blocks.
On Exact Interpolation of
• RBFs have their origins in techniques for performing
exact function interpolation [Bishop, 1995]:
– Find a function h(x) such that
h(xn) = tn ∀n=1, ... N

• Radial Basis Function approach (Powel 1987):


– Use a set of N basis functions of the form φ(||x-xn||), one for
each point,where φ(.) is some non-linear function.
– Output: h(x) = Σn wn φ(||x-xn||)
Radial Basis Function Networks
Goal: each hidden unit k should represent a cluster k in the
input space, for example by containing its prototype xk.

y1 yp
r H r
yi ( x ) = ∑ wik Φk (x ) + wi 0
k =1 1 ... p
... ...
Φ k ( x ) = Φ k ( x − xk )
r r r
Φ1 ... Φ H


r r
x−µk ... ...
r 2σ k2 ...
Φ k (x) = e x1 x2 xn-1 xn
Typical radial functions
Simple radial
Examples: Inverse multiquadratic
Multiquadratic
h( r ) = r = X − X i Gauss
Thin splines (cienkiej płytki)
h ( r ) = (σ + r
2 2 −α
) , α >0

h ( r ) = (σ + r
2 2 β
) , 1> β > 0
− ( r / σ )2
h( r ) = e
h( r ) = (σ r ) 2 ln(σ r )
RBFNs and MLPs

• Locality. In RBFNs only a small fraction of Φk is active


for each input vector => more efficient training algorithms
• Separation surfaces. MLP produces open separation
surfaces vs. RBFNs closed separation surfaces
• Approximation capability. The universality property
still holds for RBFNs if a sufficient number of Φk is given.
• Interpretability. RBFNs are easier to interpret than
MLPs. Φk can be interpreted as p(cluster k| x) and wik as
p(Ci|cluster k)
MLPs versus RBFs
• Classification
– MLPs separate classes via
hyperplanes
– RBFs separate classes via
hyperspheres X2 MLP
• Learning
– MLPs use distributed learning
– RBFs use localized learning
– RBFs train faster
X1
• Structure
– MLPs have one or more hidden
layers X2
– RBFs have only one layer RBF
– RBFs require more hidden neurons
=> curse of dimensionality
X1
The hybrid learning strategy
1. Unsupervised training of the RBF parameters
– K-means clustering algorithm
– Mixtures of Gaussians
– Kohonen Competitive learning

2. Supervised training of the weights connecting


the hidden and the output layer
– Back-Propagation
– Or a special mathematical approaches to
solve matrix equations!
RBF units Unsupervised Training
• K-means algorithm.
r 1 rq 1 H
r r
µk =
Nk

q∈ S k
x σ =
H
∑ µi − µ j
i =1

H
rq r
J =∑∑
2
x − µk
k =1 q∈S k

• Mixtures of Gaussians.
rq
( )
m
l=∏p x
r H
r r
o (x ) = ∑ α j (x ) Φ j (x )
j =1 q =1
RBFNs Training Algorithms (I)
• Modified Back-Propagation.
The corresponding expressions of the partial
derivatives of the error function have to be
evaluated and included into the gradient descent
procedure.

• Orthogonal Least Square Algorithm.


RBF units are sequentially introduced. At the first step
each RBF is centered on one training pattern; the
RBF unit with smallest error is retained. The
algorithms continues on the remaining training data.
RBF analysis of sinus function
• Following lecture og prof. A.Bartkowiak Uniw.
Wrocławski
RBF analysis of sinus function (2)
Tasks for ANN
• Pattern classification
• Function approximation
• Time series and
forecasting
• Clustering
• Multidimensional
Projections
• Association memory
• Content addressed
memory
• Control strategies
• ..
Unsupervised ANN Learning
• Unsupervised Learning
If the desired answers are not available, not even for a subset
of data to use as training set, we use unsupervised learning.

• Similarity and Correlation


The network should organize the training data into clusters on
the basis of similarity and correlation criteria.

•Redundancy
This can happen only if there is redundancy in the training data

• Hebbian Learning and Competitive Learning


Standard Competitive Learning
(winner-take-all)
y1 y2 yp w ij ≥ 0
1 2
... p
n
rT r
... ai = ∑j =1
w ij x j = w i x
x1 x2 xn
⎧⎪1
yi = ⎨
rT r
if : w i x = max k = 1 ,..., p (
rT r
wk x )
⎪⎩ 0 otherwise

r r r r
r ⎧⎪1 if : w i − x = min k = 1 ,..., p wk − x
if w i = 1 yi = ⎨
⎪⎩ 0 otherwise
SCL: Training algorithm
Goal:

(rq
)
T rq rq
( T rq
)
rq
((rq
) (
w i (t ) x ≤ w i (t + 1 ) x = w i (t ) + ∆ w i (t ) x
T T rq
) )

rq ⎧ rq rq
⎪ η x − η wi (t )
∆wi (t ) = ⎨
( )
r q T rq
(( )
r q T rq
if : wi (t ) x = maxk wk (t ) x )
⎪⎩0 otherwise

η > 0 (usually 0.1 < η < 0.7)


SCL Training algorithm
y1 y2 y3

1 w2 2 p w3
w1

w3 x1 x2 x3 w1
w3

before after
training training

w1
w2
w2
Learning Vector Quantization (LVQ)

LVQ is the Supervised extension of the


winner-take-all learning algorithm.

( rq r q
)
⎧ + η (t ) x − wi (t ) if : class of unit i q is correct

rq
( rq r q
)
∆wi (t ) = ⎨− η (t ) x − wi (t ) if : class of unit i q is incorrect

⎩0 if : i q
is not a winner
Improved LVQ
The class of the input vector q is different from the class represented
by winner unit i, but it is the same as close unit j.
rq
(rq r q
∆wi (t ) = − η (t ) x − wi (t ) )
rq
(rq r q
∆w j (t ) = + η (t ) x − w j (t ) )
r
∆wk (t ) = 0 k ≠ i, j

The class of the input vector q is the same as winner unit i and
close unit j.
rq
( rq r q
∆wh (t ) = + ε η (t ) x − wh (t ) ) h = i, j
r
∆wk (t ) = 0 k ≠ i, j
Kohonen Self-Organizing Maps
• Architecture:
– Kohonen maps consist of a two-dimensional array of
neurons, fully connected, with no lateral connections,
arranged on a squared or hexagonal lattice
• Learning algorithm:
– follows the winner-take-all strategy
– forces close neurons to fire for similar inputs (Self-
Organizing Maps)
• Properties:
– The topology of the input space is preserved
Self organizing maps
• The purpose of SOM is to map a
multidimensional input space onto a
topology preserving map of neurons
– Preserve a topological so that
neighboring neurons respond to «
similar »input patterns
– The topological structure is often a 2 or
3 dimensional space
– the distance and proximity relationship
(i.e., topology) are preserved as much as
o
possible x=dane o
x o o
N-wymiarowa
przestrzeń danych
xo
• Similar to specific clustering: cluster o=pozycje wag
neuronów
o
o x
o x
o
o o
centers tend to lie in a low- o
wagi wskazują
na punkty w N-D
dimensional manifold in the feature
space siatka neuronów
w 2-D
• The activation of the
neuron is spread in its
direct neighborhood
=>neighbors become
sensitive to the same input
patterns
• Block distance
• The size of the
neighborhood is initially
large but reduce over time
=> Specialization of the
network 2nd neighborhood

Visualisation of an influence of different


patterns on neuron outputs
First neighborhood
SOM Learning Algorithm
Winner take-all learning rule
rq
(
∆wk (t ) = +η (t ) Λ k , i q , t )(
rq r q
x − wk (t ) ) for all units k

⎛ − r − rq 2

(
Neighborhood function Λ k , i q , t = exp ⎜ )k i
⎜⎜ 2σ (t )2

⎟⎟
⎝ ⎠

1 m rq r q
Q = ∑ x − wi
2
Quantization Error
m q =1

Average Distortion
1 m
D = ∑ Λ iq ,iq ,t
m q =1
( ) rq r q
x − wi
2
SOM algorithm
XT=(X1, X2 .. Xd), samples from feature space.
Create a grid with nodes i = 1 .. K in 1D, 2D or 3D,
each node with d-dimensional vector W(i)T = (W1(i) W2(i) .. Wd(i)),
W(i) = W(i)(t), changing with t – discrete time.

1. Initialize: random small W(i)(0) for all i=1...K.


Define parameters of neighborhood function h(|ri−rc|/σ(t),t)
2. Iterate: select randomly input vector X
3. Calculate distances d(X,W(i)), find the winner node W(c) most
similar (closest to) X
4. Update weights of all neurons in the neighborhood O(rc)
5. Decrease the influence h0(t) and shrink neighborhood σ(t).
6. If in the last T steps all W(i) changed less than ε stop.
Where to use SOM
• Natural language processing: linguistic analysis, parsing, learning
languages, hyphenation patterns.
• Optimization: configuration of telephone connections, VLSI
design, time series prediction, scheduling algorithms.
• Signal processing: adaptive filters, real-time signal analysis,
radar, sonar seismic, USG, EKG, EEG and other medical signals
...
• Image recognition and processing: segmentation, object
recognition, texture recognition ...
• Content-based retrieval: examples of WebSOM, Cartia,
VisierPicSom – similarity based image retrieval.

• More on SOM – see earlier lecture on clustering


Software Tools
• Commercial products, e.g.
– Matlab Toolbox
– Statistica Neural Networks
– Peltarion Synapse
– NeuroXL
– …
• Many others
– Sttugart Neural Simulator
– Limitted options WEKA, RapidMiner
– Many university projects.e.g NuClass7 Arlington US
– …
SSN (Univ. Sttugart)
Components in Process
Constructing and Learning ANN in Synapse
• German credit data (UCI repository) – prediction of paying loans by bank
customers / 700 good decisions and 300 bad ones
Hardware
• Usually more costly
• Specialized electronic devices
• Need for a real, popular application
• However, FPGA implementing ?
Applications
• Aerospace
– High performance aircraft autopilots, flight path simulations, aircraft
control systems, autopilot enhancements, aircraft component
simulations, aircraft component fault detectors
• Automotive
– Automobile automatic guidance systems, warranty activity analyzers
• Banking
– Check and other document readers, credit application evaluators
• Defense
– Weapon steering, target tracking, object discrimination, facial
recognition, new kinds of sensors, sonar, radar and image signal
processing including data compression, feature extraction and noise
suppression, signal/image identification
• Electronics
– Code sequence prediction, integrated circuit chip layout, process
control, chip failure analysis, machine vision, voice synthesis, nonlinear
modeling
Applications
• Financial
– Real estate appraisal, loan advisor, mortgage screening, corporate bond
rating, credit line use analysis, portfolio trading program, corporate
financial analysis, currency price prediction
• Manufacturing
– Manufacturing process control, product design and analysis, process
and machine diagnosis, real-time particle identification, visual quality
inspection systems, beer testing, welding quality analysis, paper quality
prediction, computer chip quality analysis, analysis of grinding
operations, chemical product design analysis, machine maintenance
analysis, project bidding, planning and management, dynamic
modeling of chemical process systems
• Medical
– Breast cancer cell analysis, EEG and ECG analysis, prosthesis design,
optimization of transplant times, hospital expense reduction, hospital
quality improvement, emergency room test advisement
Applications
• Robotics
– Trajectory control, forklift robot, manipulator controllers, vision
systems
• Speech
– Speech recognition, speech compression, vowel classification,
text to speech synthesis
• Securities
– Market analysis, automatic bond rating, stock trading advisory
systems
• Telecommunications
– Image and data compression, automated information services,
real-time translation of spoken language, customer payment
processing systems
• Transportation
– Truck brake diagnosis systems, vehicle scheduling, routing
systems
Conclusions
• ANNs are roughly based on the simulation of biological
nervous systems
• An equivalence can be established between many ANN
paradigms and statistical analysis techniques
• Perceptron as a non-linear regression function
• The auto-associator projects input data onto a PC space
• RBFNs can be interpreted as statistical classifiers
• etc …
• ANNs drawbacks:
• Lack of criteria to define the optimal network size =>
genetic algorithms?
• Many parameters to tune
• Hard interpretation of the ANN analysis process => fuzzy
models?
• Time and cost computational requirements
References
• J. Hertz, A. Krogh, R.G. Palmer, “Introduction to the
theory of Neural Computation”, Addison-Wesley, 1991.

• C.M. Bishop, “Neural Networks for pattern


recognition”, Oxford University Press, New York, 1995.

• S. Haykin, “Neural Networks, a comprehensive


foundation”, IEEE Press, 1994.

• J.M. Zurada, R.J. Marks, C.J. Robonson Eds.,


“Computational Intelligence imitating life”, IEEE Press,
New York, 1994.
• Many others
References in Polish Language
• Osowski Stanisław: Sieci Neuronowe do przetwarzania
informacji. Warszawa 2000
• J.M. Zurada, Baruch: Sztuczne sieci neuronowe, PWN.
• Several books by R.Tadeusiewicz
• Krawiec K., Stefanowski J.: Uczenie maszynowe i sieci
neuronowe, Wydawnictwo Politechniki Poznańskiej, Poznań
2004
• WWW teaching materials,np:
• prof. Włodzisław Duch, UMK Toruń
• prof. Anna Bartkowiak UWr Wrocław
• My own slides for II part of the course Machine
Learning
• Many others
Any questions, remarks?

You might also like