Single Layer Perceptron
Single Layer Perceptron
Lecture 4:
Single Layer Perceptron (SLP)
Classifiers
Bias v = ∑w x +b
k
j =1
kj j k
bk
x1 wk1 Activation
x2 wk2
function
y = ϕ (v )
k k
vk Output
Σ ϕ(.) yk
...
...
xm wkm Summing
junction Discrete Perceptron:
Input Synaptic
weights
ϕ (⋅) = sign (⋅)
signal
Continous Perceptron:
ϕ (⋅) = S − shape
Activation Function of a perceptron
+1
+1
vi vi
-1
Signum Function
(sign) Continous Perceptron:
Discrete Perceptron: ϕ (v) = s − shape
ϕ (⋅) = sign (⋅)
SLP Architecture
Single layer perceptron
Three-Layer Arbitrary
(Complexity A B
Limited by No. B
A
of Nodes) B A
Review from last lectures:
Implementing Logic Gates with
Perceptrons http://www.cs.bham.ac.uk/~jxb/NN/l3.pdf
We can use the perceptron to implement the basic logic gates (AND, OR
and NOT).
All we need to do is find the appropriate connection weights and neuron
thresholds to produce the right outputs for each set of inputs.
We saw how we can construct simple networks that perform NOT, AND,
and OR.
It is then a well known result from logic that we can construct any logical
function from these three operations.
The resulting networks, however, will usually have a much more complex
architecture than a simple Perceptron.
We generally want to avoid decomposing complex problems into simple
logic gates, by finding the weights and thresholds that work directly in a
Perceptron architecture.
Implementation of Logical NOT, AND, and OR
In each case we have inputs ini and outputs out, and need to determine
the weights and thresholds. It is easy to find solutions by inspection:
The Need to Find Weights Analytically
Constructing simple networks by hand is one thing. But what about
harder problems? For example, what about:
We have two weights w1 and w2 and the threshold θ, and for each
training pattern we need to satisfy
Clearly the second and third inequalities are incompatible with the
fourth, so there is in fact no solution. We need more complex networks,
e.g. that combine together many simple networks, or use different
activation/thresholding/transfer functions.
It then becomes much more difficult to determine all the weights and
thresholds by hand.
These weights instead are adapted using learning rules. Hence, need to
consider learning rules (see previous lecture), and more complex
architectures.
E.g. Decision Surface of a Perceptron
x2
x2
+
+ + -
+ -
- x1
x1
+ - +
-
-
Linearly separable Non-Linearly separable
Pattern classification/recognition
- Assign the input data (a physical object, event, or phenomenon)
to one of the pre-specified classes (categories)
The block diagram of the recognition and classification system
Classification: an example
http://webcourse.technion.ac.il/236607/Winter2002-2003/en/ho.htm
Duda & Hart, Chapter 1
or
ω1 : if P ( x |ω1) P(ω1) > P(x|ω2) P(ω2)
ω2 : otherwise
ω1 ω1
> p ( x | ω 1 ) > P (ω 1 )
p ( x | ω 1 ) P (ω 1 ) p ( x | ω 2 ) P (ω 2 ) ⇔
< p ( x | ω 2 ) < P (ω 2 )
ω2 ω2
X 1 , K, X R such that
x ∈ X k ↔ x assigned to ωk
R Classify x∈ Xk if ∀ j ≠ k
P ( correct ) = ∑ P(x ∈ X
k =1
k ,ωk )
p ( x ω k ) P (ω k ) > p ( x ω j ) P (ω j )
R
= ∑ P(x ∈ X
k =1
k ω k ) P (ω k ) i.e.
maximum posterior probabilit y
Here R=2
∀j ≠ k P (ω k x ) > P (ω j x )
Discriminant functions
Bayes approach:
– Estimate class-conditioned probability density
– Combine with prior class probability
– Determine posterior class probability
– Derive decision boundaries
Alternate approach implemented by NN
– Estimate posterior probability directly
– i.e. determine decision boundaries directly
DISCRIMINANT FUNCTIONS
Discriminant Functions http://140.122.185.120
Do not mix between n = dim of each I/P vector (dim of feature space); P= # of I/P
vectors; and R= # of classes.
Discriminant Functions…
Discriminant Functions…
Discriminant Functions…
Discriminant Functions…
Discriminant Functions…
Linear Machine and Minimum Distance
Classification
• Find the linear-form discriminant function for two class
classification when the class prototypes are known
t
Linear Machine and Minimum Distance
Classification…(multiclass classification)
•The linear-form discriminant functions for multiclass
classification
– There are up to R(R-1)/2 decision hyperplanes for R
pairwise separable classes
t
Linear Machine and Minimum Distance
Classification… (multiclass classification)
Linear Machine and Minimum Distance
Classification…
P1, P2, P3 are the centres of gravity of the prototype points, we need to design a minimum distance classifier. Using
the formulas from the previous slide, we get wi
o 1 = sgn( x 1 + x 2 + 1)
o 2 = sgn( − x 1 − x 2 + 1)
x1 x2 o1 o2
These 2 inputs map -1 -1 -1 1
to the same point -1 1 1 1
(1,1) in the image 1 -1 1 1
space 1 1 1 -1
The Discrete Perceptron
Discrete Perceptron Training Algorithm
• So far, we have shown that coefficients of linear
discriminant functions called weights can be
determined based on a priori information about sets of
patterns and their class membership.
•In what follows, we will begin to examine neural
network classifiers that derive their weights during the
learning cycle.
•The sample pattern vectors x1, x2, …, xp, called the
training sequence, are presented to the machine along
with the correct response.
Discrete Perceptron Training Algorithm
- Geometrical Representations http://140.122.185.120
Zurada, Chapter 3
y1 in Class 2 w ′ = w1 − cy1
c (>0) is the correction
Weight increment (is two times the
Space learning constant ρ
introduced before)
(correction in negative gradient direction)
Discrete Perceptron Training Algorithm
- Geometrical Representations…
Discrete Perceptron Training Algorithm
- Geometrical Representations…
w1t y
cy = y =p
yt y
x1 = 1, x3 = 3, d1 = d 3 = 1 : class 1
x2 = −0.5, x4 = −2, d 2 = d 4 = −1 : class 2
•The augmented input vectors are:
2
•We obtain the following outputs and weight updates:
•Step 1: Pattern y1 is input
1
o1 = sgn [− 2.5 1.75] = −1
1
d1 − o1 = 2
− 1.5
w =w +y =
2 1 1
2.75
Discrete Perceptron Training Algorithm
- Geometrical Representations…
•Step 2: Pattern y2 is input
− 0.5
o2 = sgn [− 1.5 2.75] =1
1
d 2 − o2 = −2
−1
w = w −y =
3 2 2
1.75
•Step 3: Pattern y3 is input
3
o3 = sgn [− 1 1.75] = −1
1
d 3 − o3 = 2
2
w =w +y =
4 3 3
2.75
Discrete Perceptron Training Algorithm
- Geometrical Representations…
• Since we have no evidence of correct classification of
weight w4 the training set consisting of an ordered
sequence of patterns y1 ,y2 and y3 needs to be recycled.
We thus have y4= y1 , y5= y2, etc (the superscript is used
to denote the following training step number).
•Step 4, 5: w6 = w5 = w4 (no misclassification, thus no
weight adjustments).
•You can check that the adjustment following in steps 6
through 10 are as follows:
w 7 = [2.5 1.75]
t
w10 = w 9 = w 8 = w 7
w11 = [3 0.75]
t
(of current
error function)
Training rule of
continous perceptron
∂ (net ) (equivalent to delta
Since net = w t y, we have = yi i = 1,2,..., n + 1 training rule)
∂wi
Continuous Perceptron Training Algorithm…
Continuous Perceptron Training Algorithm…
Same as previous example (of discrete perceptron) but with a
continuous activation function and using the delta rule.
2
1 2
E1 (w ) = 1 − − 1
2 1 + exp[− λ ( w1 + w2 )]
λ = 1 and reducing the terms simplifies this expression to the following form
2
E1 (w ) =
[1 + exp(w1 + w2 )]2
similarly
2
E2 ( w ) =
[1 + exp(0.5w1 − w2 )]2
2 2
E3 (w ) = E4 ( w ) =
[1 + exp(3w1 + w2 )]2 [1 + exp(2w1 − w2 )]2
These error surfaces are as shown on the previous slide.
Continuous Perceptron Training Algorithm…
minimum
Mutlicategory SLP
Multi-category Single layer Perceptron nets
•Treat the last fixed component of input pattern vector as
the neuron activation threshold…. T=wn+1
2
sgn [1 − 2 0]− 5 = 1*
− 1
1 2 − 1
2 w13 = 2 − − 5 = 3
sgn [0 − 1 2]− 5 = 1 0 − 1 1
− 1
w 32 = w 22
2 w 33 = w 32
sgn [− 9 1 0]− 5 = −1
− 1
Multi-category Single layer Perceptron nets…
•Step 3: Pattern y3 is input 4 One can
w14 = − 2
( )
verify that
sgn w13t y 3 = 1* the only
2
sgn (w y ) = −1
adjusted
3t
weights
2 3
w 42 = w 32
sgn (w y ) = 1
from now
3t
on are those
3 3
w 34 = w 33 of TLU1
Unconstrained Optimization
Techniques
Unconstrained Optimization Techniques
http://ai.kaist.ac.kr/~jkim/cs679/
Haykin, Chapter 3
Finally,
w(n+1) = w(n) + ∆w(n)
= w(n) - H-1(n) g(n)
Newton’s method converges quickly
asymptotically and does not exhibit the
zigzagging behavior
the Hessian H(n) has to be a positive definite
matrix for all n
Gauss-Newton Method
The Gauss-Newton method is applicable to a
cost function r 1 n 2
E ( w) =
2
∑ e
i =1
(i )
Thus we get
r r −1 T r
w(n + 1) = w(n) − ( J (n) J (n)) J (n)e (n)
T
= w T ( n) x( n)
– otherwise
x ( k )∈H1
k =1
Eq. B4 states that the squared Euclidean norm of w(n+1)
grows at most linearly with the number of iterations n.
Perceptron Convergence Proof
The second result of B4 is clearly in conflict with Eq. B3.
•Indeed, we can state that n cannot be larger than some
value nmax for which Eq. B3 and B4 are both satisfied with
the equality sign. That is nmax is the solution of the eq.
2
nmaxα2
2
= nmax β
w0
•Solving for nmax given a solution w0, we find that
2
β w0
nmax =
α2
We have thus proved that for η(n)=1 for all n, and for w(0)=0,
given that a sol’ vector w0 exists, the rule for adapting the
synaptic weights of the perceptron must terminate after at most
nmax iterations.
MORE READING
Suggested Reading.