NN 2
NN 2
Neural Networks NN 1 1
Perceptron: architecture
• We consider the architecture: feed-forward
NN with one layer
• It is sufficient to study single layer
perceptrons with just one neuron:
Neural Networks NN 1 2
Single layer perceptrons
• Generalization to single layer perceptrons with
more neurons is easy because:
Neural Networks NN 1 5
Perceptron Training
• How can we train a perceptron for a
classification task?
• We try to find suitable values for the
weights in such a way that the training
examples are correctly classified.
• Geometrically, we try to find a hyper-
plane that separates the examples of the
two classes.
Neural Networks NN 1 6
Perceptron Geometric View
The equation below describes a (hyper-)plane in the
input space consisting of real valued 2D vectors. The
plane splits the input space into two regions, each of
them describing one class.
decision
region for C1
2 x2 w x + w x + w >= 0
w x
1 1 2 2 0
i i + w 0 = 0 decision
i =1 boundary C1
x1
C2
w1x1 + w2x2 + w0 = 0
Neural Networks NN 1 7
Example: AND
• Here is a representation of the AND
function
• White means false, black means true for
the output
• -1 means false, +1 means true for the
input
-1 AND -1 = false
-1 AND +1 = false
+1 AND -1 = false
+1 AND +1 = true
Neural Networks NN 1 8
Example: AND continued
• A linear decision surface (i.e. a plane in
3D space) intersecting the feature
space (i.e. the 2D plane where z=0)
separates false from true instances
Neural Networks NN 1 9
Example: AND continued
• Watch a perceptron learn the AND function:
Neural Networks NN 1 10
Example: XOR
• Here’s the XOR function:
-1 XOR -1 = false
-1 XOR +1 = true
+1 XOR -1 = true
+1 XOR +1 = false
Neural Networks NN 1 12
Example
-1 -1 -1 -1 -1 -1 -1 -1
-1 -1 +1 +1 +1 +1 -1 -1
-1 -1 -1 -1 -1 +1 -1 -1
-1 -1 -1 +1 +1 +1 -1 -1
-1 -1 -1 -1 -1 +1 -1 -1
-1 -1 -1 -1 -1 +1 -1 -1
-1 -1 +1 +1 +1 +1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1
Neural Networks NN 1 13
Example
• How to train a perceptron to recognize this 3?
• Assign –1 to weights of input values that are
equal to -1, +1 to weights of input values that
are equal to +1, and –63 to the bias.
• Then the output of the perceptron will be 1
when presented with a “prefect” 3, and at most
–1 for all other patterns.
Neural Networks NN 1 14
Example
-1 -1 -1 -1 -1 -1 -1 -1
-1 -1 +1 +1 +1 +1 -1 -1
-1 -1 -1 -1 -1 +1 -1 -1
-1 -1 -1 +1 +1 +1 -1 -1
-1 +1 -1 -1 -1 +1 -1 -1
-1 -1 -1 -1 -1 +1 -1 -1
-1 -1 +1 +1 +1 +1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1
Neural Networks NN 1 15
Example
• What if a slightly different 3 is to be recognized,
like the one in the previous slide?
• The original 3 with one bit corrupted would
produce a sum equal to –1.
• If the bias is set to –61 then also this corrupted
3 will be recognized, as well as all patterns with
one corrupted bit.
• The system has been able to generalize by
considering only one example of corrupted
pattern!
Neural Networks NN 1 16
Perceptron: Learning Algorithm
n=1;
initialize w(n) randomly;
while (there are misclassified training examples)
Select a misclassified augmented example (x(n),d(n))
w(n+1) = w(n) + d(n)x(n);
n = n+1;
end-while;
= learning rate parameter (real number)
Neural Networks NN 1 18
Example
Train a perceptron on C1 C2
Neural Networks NN 1 19
A possible implementation
Consider the augmented training set C’1 C’2, with first
entry fixed to 1 (to deal with the bias as extra weight):
(1, 1, 1), (1, 1, -1), (1, 0, -1) ,(1,-1, -1), (1,-1, 1), (1,0,1)
Replace x with -x for all x C2’ and use the following update
rule:
w(n ) + x (n ) if wT (n ) x(n ) 0
w(n + 1) =
w(n ) otherwise
x2 1
- - +
Decision boundary:
C2 2x1 - x2 = 0
-1 1/2 1 2 x1
w
- + -1 + C1
Neural Networks NN 1 23
Termination of the learning algorithm
Suppose the classes C1, C2 are linearly separable (that is, there
exists a hyper-plane that separates them). Then the perceptron
algorithm applied to C1 C2 terminates successfully after a
finite number of iterations.
Proof:
Consider the set C containing the inputs of C1 C2 transformed by
replacing x with -x for each x with class label -1.
For simplicity assume w(1) = 0, = 1.
Let x(1) … x(k) C be the sequence of inputs that have been used
after k iterations. Then
w(2) = w(1) + x(1)
w(3) = w(2) + x(2) w(k+1) = x(1) + … + x(k)
Neural Networks NN 1 24
Convergence theorem (proof)
||w(k+1)||2 k2 2 (A)
||w*|| 2
Neural Networks NN 1 25
Convergence theorem (proof)
• Now we consider another route:
w(k+1) = w(k) + x(k)
|| w(k+1)||2 = || w(k)||2 + ||x(k)||2 + 2 w T(k)x(k)
euclidean norm
0 because x(k) is misclassified
||w(k+1)||2 ||w(k)||2 + ||x(k)||2
=0
||w(2)||2 ||w(1)||2 + ||x(1)||2
||w(3)||2 ||w(2)||2 + ||x(2)||2
k
2
||w(k+1)||2 || x(i) ||
i =1
Neural Networks NN 1 26
convergence theorem (proof)
Neural Networks NN 1 27
Perceptron: Limitations
• The perceptron can only model linearly
separable classes, like (those described by)
the following Boolean functions:
• AND
• OR
• COMPLEMENT
• It cannot model the XOR.
+/
f (x)
(Adaline)
Neural Networks NN 1 30
Adaline: Adaptive Linear Element
• When the two classes are not linearly separable, it may be
desirable to obtain a linear separator that minimizes the mean
squared error.
• Adaline (Adaptive Linear Element):
– uses a linear neuron model and
– the Least-Mean-Square (LMS) learning algorithm
– useful for robust linear classification and regression
For an example (x,d) the error e(w) of the network is
m
e(w) = d − x jw j
j=0
Neural Networks NN 1 32
Incremental Gradient Descent
− (gradient of E ( w(n) )) = − E
w1
, ,
E
wm
• take a small step (of size ) in that direction
D={<(1,1),1>,<(-1,-1),1>,
<(1,-1),-1>,<(-1,1),-1>}
(w1,w2)
(w1+w1,w2 +w2)
Neural Networks NN 1 34
Gradient Descent
• Train the wi’s such that they minimize the
squared error
– E[w1,…,wm] = ½ dD (td-od)2
Gradient:
E[w]=[E/w0,… E/wm]
w=- E[w]
wi=- E/wi
= - /wi 1/2d(td-od)2
= - /wi 1/2d(td-i wi xi)2
= - d(td- od)(-xi)
Neural Networks NN 1 35
Gradient Descent
Gradient-Descent(training_examples, )
Each training example is a pair of the form <(x1,…xm),t> where (x1,…,xm)
is the vector of input values, and t is the target output value, is the
learning rate (e.g. 0.1)
• Initialize each wi to some small random value
• Until the termination condition is met , Do
– Initialize each wi to zero
– For each <(x1,…xm),t> in training_examples Do
• Input the instance (x1,…,xm) to the linear unit and compute the
output o
• For each linear unit weight wi Do
– wi= wi + (t-o) xi
– For each linear unit weight wi Do
• wi=wi+wi
• Termination condition – error falls under a given threshold
Neural Networks NN 1 36
Incremental Stochastic
Gradient Descent
• Batch mode : gradient descent
w=w - ED[w] over the entire data D
ED[w]=1/2d(td-od)2
• Incremental mode: gradient descent
w=w - Ed[w] over individual training
examples d
Ed[w]=1/2 (td-od)2
• Computation of Gradient(E):
E ( w ) e
=e
w w
= e[− x ]
T
n=1;
initialize w(n) randomly;
while (E_tot unsatisfactory and n<max_iterations)
Select an example (x(n),d(n))
e(n) = d (n) − w(n) x(n)
T
Perceptron
Support Vector
Neural Networks NN 1 Machine 42