Neural Networks
Neural Networks
Jerzy Stefanowski
Institute of Computing Science
Lecture 13 in Data Mining
for M.Sc. Course of SE
version for 2010
Acknowledgments
• Slides are also based on ideas coming from
presentations as:
– Rosaria Silipo: Lecture on ANN. IDA Spring School 2001
– Prévotet Jean-Christophe (Paris VI): Tutorial on Neural
Networks
– Włodzisław Duch: Lectures on Computational Intelligence
– Few others
• and many of my notes for a course on Machine
Learning and Neural Networks (Polish Language
ISWD – see my personal web page for more slides)
Outline
• Introduction
– Inspirations
– The biological and artificial neurons
– Architecure of networks and basic learning rules
• Single Linear and Non-linear Perceptrons
– Delta learning rule
• MultiLayer Perceptrons
– MLPs and Back-Propagation
– Tuning parameters of BP
• Radial Basis Functions
– Architectures and learning algorithms
• Competitive Learning
– Competitive Learning, LVQ, Kohonen self-organizing maps.
• Applications and Software Tools
• Final Remarks
Introduction
• Some definitions
– “… a system composed of many simple processing
elements operating in parallel whose function is
determined by network structure, connection strengths,
and the processing performed at computing elements or
nodes.” - DARPA (1988)
– A neural network: A set of connected input/output units
where each connection has a weight associated with it
• During the learning phase, the network learns by
adjusting the weights so as to be able to predict the
correct class output of the input signals
Some properties
• Some points from definitions
– Many neuron-like threshold switching units
– Many weighted interconnections among units
– Highly parallel, distributed process
– Emphasis on tuning weights automatically
– …
When to Consider Neural Networks
• Input: High-Dimensional and Discrete or Real-Valued
– e.g., raw sensor input
– Conversion of symbolic data to quantitative (numerical) representations possible
• Output: Discrete or Real Vector-Valued
– e.g., low-level control policy for a robot actuator
– Similar qualitative/quantitative (symbolic/numerical) conversions may apply
• Data: Possibly Noisy
• Target Function: Unknown Form
• Result: Human Readability Less Important Than Performance
– Performance measured purely in terms of accuracy and efficiency
– Readability: ability to explain inferences made using model; similar criteria
• Examples
– Speech phoneme recognition
– Image classification
– Time signal prediction, Robotics, and many others
Autonomous Learning Vehicle
•
in
Pomerleau et al
a Neural Net (ALVINN)
– http://www.cs.cmu.edu/afs/cs/project/alv/member/www/projects/ALVINN.html
– Drives 70mph on highways
Hidden-to-Output Unit
Weight Map
(for one hidden unit)
Input-to-Hidden Unit
Weight Map
(for one hidden unit)
Image Recognition and Classifiation
of Postal Codes
30 x 32 Inputs
synapse
synapse axon
axon
nucleus
nucleus
cell
cellbody
body
dendrites
dendrites
• A neuron has
– A branching input (dendrites)
– A branching output (the axon)
• The information circulates from the dendrites to the axon via the
cell body
• Axon connects to dendrites via synapses
– Synapses vary in strength
– Synapses may be excitatory or inhibitory
The Action Potential
Human Brain
• The brain is a highly complex, non-linear, and parallel computer,
composed of some 1011 neurons that are densely connected (~104
connection per neuron). We have just begun to understand how
the brain works...
• A neuron is much slower (10-3sec) compared to a silicon logic
gate (10-9sec), however the massive interconnection between
neurons make up for the comparably slow rate.
– Complex perceptual decisions are arrived at quickly (within a
few hundred milliseconds)
• 100-Steps rule: Since individual neurons operate in a few
milliseconds, calculations do not involve more than about 100
serial steps and the information sent from one neuron to another is
very small (a few bits)
• Plasticity: Some of the neural structure of the brain is present at
birth, while other parts are developed through learning, especially
in early stages of life, to adapt to the environment (new inputs).
The Artificial Neuron
(Mc Culloch and Pitt, 1943)
x1
w1
Σ
y
a
wx y
u ab
wn
xn
⎛ n ⎞ ⎛ n
⎞
y (t + 1) = f ⎜ ∑ w k x k ⎝ t ⎠ − u ⎟ = f ⎜ ∑ w k x k (t )⎟
⎛⎜ ⎞⎟
⎝ k =1 ⎠ ⎝ k =0 ⎠
Activation Functions
⎧+ 1 if a≥u
• Step function f (a ) = ⎨
⎩− 1 if a<u
⎧+ 1 if a ≥ u
⎪
• Linear function f (a ) = ⎨a if − u ≤ a < u
⎪− 1 a < −u
⎩ if
1
• Logistic Sigmoid f (a ) = − ha
1+ e
a−u
−
• Gaussian f (a ) = e 2σ 2
Activation functions
20
18
16
Linear 1
14
y=x
12
10
6 -1
4
0
0 2 4 6 8 10 12 14 16 18 20
1,2
Sigmoidal (logistic) Step function
1
0,8
f(e) 0,6
0,4
1
0,2
0 y=
-6 -4 -2 0 2 4 6
1 + exp(− β x)
e
1.5
1
Hyperbolic tangent
0.5
0
exp( x) − exp(− x)
-0.5
y=
-1.5
-1
exp( x) + exp(− x)
-2
-10 -8 -6 -4 -2 0 2 4 6 8 10
Network topologies
Feed Forward Neural Networks
• The information is
Output layer propagated from the
inputs to the outputs
2nd hidden • Computations of No non
layer linear functions from n
input variables by
1st hidden compositions of Nc
layer algebraic functions
• Time has no role (NO
cycle between outputs and
inputs)
x1 x2 ….. xn
Network topologies
Recurrent Neural Networks
• Can have arbitrary topologies
• Can model systems with
internal states (dynamic ones)
0 1 • Delays are associated to a
0 specific weight
0 • Training is more difficult
1
• Performance may be
0 problematic
0 1 – Stable Outputs may be more
difficult to evaluate
x1 x2 – Unexpected behavior
(oscillation, chaos, …)
Learning neural networks
• The procedure that consists in estimating the parameters of
neurons (usually weights) so that the whole network can
perform a specific task
US
UR
CS
Supervised Learning of Neurons
Let us suppose that a sufficiently large set of examples (training
set) is available.
Supervised learning:
– The network answer to each input pattern is directly
compared with the desired answer and a feedback is given to
the network to correct possible errors
Weights matrix
x and y
Required
Error output
y-d d
Perceptron
y1 y2 yp-1 yp
1 2 ... p-1 p
wp-1,1
w1,1 w1,n wp,1
... ...
w2,1w1,2
...
w1,n-1 wp,n
...
x1 x2 xn-1 xn
⎛ n ⎞
y i (t + 1 ) = f ⎜ ∑ w ik x k (t )⎟ i = 1, 2 , ... p
⎝ k =0 ⎠
What a Single Perceptron Does
• Classification: y=1 if
• Regression: y=wx+w0
y
(wx+w0>0)
y
s y
w0 w0
w w
x
w0
x x
x0=+1
1
y = sigmoid (o ) =
[
1 + exp − w T x ]
Perceptron
+
• Rosenblatt (1962) ++
+ + y = +1
+ +
• Linear separation + + + ++ +
+
+ + +
• Inputs :Vector of real values + + ++ +
+ + + + ++
• Outputs :1 or -1 +
+ +
+
+
y = f (o) y = −1 +
++
w0 + w1 x1 + w2 x2 = 0
∑ o = w0 + w1x1 + w2 x2
w0 w1 w2
1 x 1
x2
Error Function
• Training set: T = { (x q , d q ) q = 1, 2 , ..., m }
• Error Measure:
E (W ) = f (o iq − d iq )
E(W)
E(W*)
W* W
Gradient Descent algorithm
• Simple Gradient Descent Algorithm
– Applicable to different type of learning (with proper representation)
• Definition: Gradient
r ⎡ ∂E ∂E ∂E ⎤
∇E [w ] ≡ ⎢ , ,K , ⎥
∂
⎣ 0w ∂w 1 ∂w n⎦
Gradient Descent algorithm
The RMS error function:
∑ ∑ (o ) = ∑ E (W )
1 m p m
E (W ) = i
q
−d i
q 2 q
2 q =1 i =1 q =1
∂ E (t ) m
∂ E q (t ) m
∆ w ik (t ) = − η = −η ∑ = ∑ ∆ w ikq (t )
∂ w ik q =1 ∂ w ik q =1
Delta Learning Rule (Widrow,Hoff)
⎛ 1 ⎞
∑ (o )
p
2
∂ ⎜⎜ q
− d q
⎟⎟
∂E
i i
q
⎝ 2 i =1 ⎠ =
=
∂ w ik ∂ w ik
∆ w ikq = − η δ iq x kq
∆ q
w ik =η δ i
q q
xk
Learning Rate (η)
∆w1= η1 δ x with η1 too small
∆w2= η2 δ x with η2 right size
∆w3= η3 δ x with η3 too big
E(W) ∆W1
∆W2
E(W*) ∆W3
W(q) W* W
x
• The standard perceptron learning +
+2
+ +
++
algorithm converges if examples + +
+
- -
- - -
are linearly separable → see
+ --
+ x1
- - -
- - -
• Consider an example of a simple -
-
-
y x1
1
1
w1 w2
w1 x1 + w2 x2 -w0 = 0
x1 x2 0
x2
0 1
Need for constructing MLP
1
oi (t ) =
1 + e −( net i ( t ) −θ ) /τ
1 2 ... p-1 p
... ...
1 ... H
... ...
...
x1 x2 xn-1 xn
Non-linear regression mapping
Output of a generic MLP neuron in layer l
⎛ n ( l −1 ) ⎞
y i = f i (a i ) = f i ⎜⎜ ∑ w ik o k ⎟⎟ i = 1, ..., n (l ) k = 1, ..., n (l − 1 )
⎝ k =0 ⎠
(vr )
n (1 )
r
= ∑w
k =1
1k fk T
k x + vk 0 + w0
Back propagation (I)
(The Generalized Delta Rule )
Gradient Descent formula for a weight wik connecting units
from two generic layers l and l-1 (i∈layer l, k ∈layer l-1)
after presentation of training pattern q.
∂E q
∆ w ikq = −η
∂ w ik
∂E q ∂ E q ∂ a iq
=
∂ w ik ∂ a iq ∂ w ik
Back Propagation (II)
∂ a iq ∂E q
= o kq δ q
=
∂ w ik ∂ a iq
i
∂E q
∆ w ikq = −η = −η δ q
o kq
∂ w ik
i
∂E q
δ i
q
=
∂aiq
= f ' ( )(o
a i
q
i
q
− d iq )
Multi-Layer Perceptron
• One or more hidden
layers
Output layer
• Where can we use
generalized delta rule?
2nd hidden • Where can we
layer compute error?
1st hidden
layer We do not know the desired
answers of the hidden layer and
therefore we can not estimate the
error function.
Input data
Back Propagation (III)
We do not know the desired answers of the hidden layer and
therefore we can not estimate the error function.
n (l + 1 )
∂E q
∂E q
∂a q
δ i
q
=
∂aiq
= ∑
j =1 ∂a q
∂b q
j
=
j i
n (l + 1 )
∂a q
= ∑
j =1
δ q
j
∂a q
j
=
i
n (l + 1 )
= f ' a iq ( )∑ δ q
j w ji
j =1
Back Propagation
(forward phase)
y1 y2 yp-1 yp
1 2 ... p-1 p
... ...
1 ... H
... ...
...
x1 x2 xn-1 xn
Back Propagation
(backward phase)
δ1 δ2 δp-1 δp
1 2 ... p-1 p
... ...
1 ... H
... ...
...
x1 x2 xn-1 xn
Elements of Backpropagation
• The set of learning examples is usually showed to
the algorithms several times (iterations → epochs)
/ sometimes thousands
• The order of showing examples is randomly
shuffled
• Stopping conditions
– Threshold for RMS (should be smaller than …)
– Max no. of iterations
– Classification evaluations
Tuning learning rate
• Too small – local minimum of error
• Too large – oscillations and unable to go inside the
global minimum
• Some solutions
– Slowly decreasing the rate with epoches (time)
E Sp
Sk
wi
Learning Rate and Momentum Term
∆W1
E(W)
∆W2
∆W3
E(W*)
W(q) W* Wmin W
∂E q
∆ w ikq = −η + α ∆ w ikq − 1
∂ w ik
Different non linearly separable
problems and number of layers
Types of Exclusive-OR Classes with Most General
Structure
Decision Regions Problem Meshed regions Region Shapes
Single-Layer Half Plane A B
Bounded By B
A
Hyperplane B A
Three-Layer Abitrary
A B
(Complexity B
Limited by No. A
B A
of Nodes)
Neural Networks – An Introduction Dr. Andrew Hunter
Over-fitting
A too large number of parameters can memorize all the examples
of the training set with the associated noise, errors and
inconsistencies
E(u)
Test set
• Pruning algorithms
Start with a large network and gradually remove weights or
complete units that do not seem to be necessary
– Sensitivity methods
– Penalty-term methods
• Growing algorithms
Start from a small architecture and allow new units to be added
when necessary.
Neural Network as a Classifier
• Weakness
– Long training time
– Require a number of parameters typically best determined empirically,
e.g., the network topology or ``structure."
– Poor interpretability: Difficult to interpret the symbolic meaning behind
the learned weights and of ``hidden units" in the network
• Strength
– High tolerance to noisy data
– Ability to classify untrained patterns
– Well-suited for continuous-valued inputs and outputs
– Successful on a wide array of real-world data
– Algorithms are inherently parallel
– Techniques have recently been developed for the extraction of rules from
trained neural networks
Knowledge Extraction
– Global approach
A tree of symbolic rules is built to represent the whole network. Each
rule is then tested against the network behavior until most of training
space is covered.
Disadvantage: huge trees.
– Local approach
The original MLP is decomposed into a series of smaller usually
single layered sub-networks. The incoming weights form the
antecedent of a symbolic rule for each unit. Those rules are gradually
combined together to define a more general set of rules that describes
the network as a whole.
Disadvantage: Because of the distributed knowledge in an ANN
hidden units do not typically represent clear logic entities.
RBF networks
• This is becoming an increasingly popular neural network
with diverse applications and is probably the main rival to the
multi-layered perceptron
• Much of the inspiration for RBF networks has come from
traditional statistics and pattern classification techniques
(mainly local methods for non-parametric regression)
• These include function approximation, regularization
theory, density estimation and interpolation in the presence
of noise [Bishop, 1995]
• Cover Theorem on non-linear projections into new feature
space where difficult decision boundaries maybe linear
separable
Numerical approximation of functions
• Consider N data points characterized by p features
{xi ∈ R m i = 1,K, N }
• and corresponding N outputs (real values)
{di ∈ R | i = 1,K, N }
• The aim is to find an unknown function (mapping)
f ( xi ) = di ∀i = 1,K, N
• Complicated functions construct from simple
building blocks (local approximations)
Function Approximation with
Radial Basis Functions
RBF Networks approximate functions using (radial) basis functions
as the building blocks.
On Exact Interpolation of
• RBFs have their origins in techniques for performing
exact function interpolation [Bishop, 1995]:
– Find a function h(x) such that
h(xn) = tn ∀n=1, ... N
y1 yp
r H r
yi ( x ) = ∑ wik Φk (x ) + wi 0
k =1 1 ... p
... ...
Φ k ( x ) = Φ k ( x − xk )
r r r
Φ1 ... Φ H
−
r r
x−µk ... ...
r 2σ k2 ...
Φ k (x) = e x1 x2 xn-1 xn
Typical radial functions
Simple radial
Examples: Inverse multiquadratic
Multiquadratic
h( r ) = r = X − X i Gauss
Thin splines (cienkiej płytki)
h ( r ) = (σ + r
2 2 −α
) , α >0
h ( r ) = (σ + r
2 2 β
) , 1> β > 0
− ( r / σ )2
h( r ) = e
h( r ) = (σ r ) 2 ln(σ r )
RBFNs and MLPs
H
rq r
J =∑∑
2
x − µk
k =1 q∈S k
• Mixtures of Gaussians.
rq
( )
m
l=∏p x
r H
r r
o (x ) = ∑ α j (x ) Φ j (x )
j =1 q =1
RBFNs Training Algorithms (I)
• Modified Back-Propagation.
The corresponding expressions of the partial
derivatives of the error function have to be
evaluated and included into the gradient descent
procedure.
•Redundancy
This can happen only if there is redundancy in the training data
r r r r
r ⎧⎪1 if : w i − x = min k = 1 ,..., p wk − x
if w i = 1 yi = ⎨
⎪⎩ 0 otherwise
SCL: Training algorithm
Goal:
(rq
)
T rq rq
( T rq
)
rq
((rq
) (
w i (t ) x ≤ w i (t + 1 ) x = w i (t ) + ∆ w i (t ) x
T T rq
) )
rq ⎧ rq rq
⎪ η x − η wi (t )
∆wi (t ) = ⎨
( )
r q T rq
(( )
r q T rq
if : wi (t ) x = maxk wk (t ) x )
⎪⎩0 otherwise
1 w2 2 p w3
w1
w3 x1 x2 x3 w1
w3
before after
training training
w1
w2
w2
Learning Vector Quantization (LVQ)
( rq r q
)
⎧ + η (t ) x − wi (t ) if : class of unit i q is correct
⎪
rq
( rq r q
)
∆wi (t ) = ⎨− η (t ) x − wi (t ) if : class of unit i q is incorrect
⎪
⎩0 if : i q
is not a winner
Improved LVQ
The class of the input vector q is different from the class represented
by winner unit i, but it is the same as close unit j.
rq
(rq r q
∆wi (t ) = − η (t ) x − wi (t ) )
rq
(rq r q
∆w j (t ) = + η (t ) x − w j (t ) )
r
∆wk (t ) = 0 k ≠ i, j
The class of the input vector q is the same as winner unit i and
close unit j.
rq
( rq r q
∆wh (t ) = + ε η (t ) x − wh (t ) ) h = i, j
r
∆wk (t ) = 0 k ≠ i, j
Kohonen Self-Organizing Maps
• Architecture:
– Kohonen maps consist of a two-dimensional array of
neurons, fully connected, with no lateral connections,
arranged on a squared or hexagonal lattice
• Learning algorithm:
– follows the winner-take-all strategy
– forces close neurons to fire for similar inputs (Self-
Organizing Maps)
• Properties:
– The topology of the input space is preserved
Self organizing maps
• The purpose of SOM is to map a
multidimensional input space onto a
topology preserving map of neurons
– Preserve a topological so that
neighboring neurons respond to «
similar »input patterns
– The topological structure is often a 2 or
3 dimensional space
– the distance and proximity relationship
(i.e., topology) are preserved as much as
o
possible x=dane o
x o o
N-wymiarowa
przestrzeń danych
xo
• Similar to specific clustering: cluster o=pozycje wag
neuronów
o
o x
o x
o
o o
centers tend to lie in a low- o
wagi wskazują
na punkty w N-D
dimensional manifold in the feature
space siatka neuronów
w 2-D
• The activation of the
neuron is spread in its
direct neighborhood
=>neighbors become
sensitive to the same input
patterns
• Block distance
• The size of the
neighborhood is initially
large but reduce over time
=> Specialization of the
network 2nd neighborhood
⎛ − r − rq 2
⎞
(
Neighborhood function Λ k , i q , t = exp ⎜ )k i
⎜⎜ 2σ (t )2
⎟
⎟⎟
⎝ ⎠
1 m rq r q
Q = ∑ x − wi
2
Quantization Error
m q =1
Average Distortion
1 m
D = ∑ Λ iq ,iq ,t
m q =1
( ) rq r q
x − wi
2
SOM algorithm
XT=(X1, X2 .. Xd), samples from feature space.
Create a grid with nodes i = 1 .. K in 1D, 2D or 3D,
each node with d-dimensional vector W(i)T = (W1(i) W2(i) .. Wd(i)),
W(i) = W(i)(t), changing with t – discrete time.