SlideShare a Scribd company logo
CS407 Neural Computation
Lecture 4:
Single Layer Perceptron (SLP)
Classifiers
Lecturer: A/Prof. M. Bennamoun
Outline
What’s a SLP and what’s classification?
Limitation of a single perceptron.
Foundations of classification and Bayes Decision making theory
Discriminant functions, linear machine and minimum distance
classification
Training and classification using the Discrete perceptron
Single-Layer Continuous perceptron Networks for linearly
separable classifications
Appendix A: Unconstrained optimization techniques
Appendix B: Perceptron Convergence proof
Suggested reading and references
What is a perceptron and what is
a Single Layer Perceptron (SLP)?
Perceptron
The simplest form of a neural network
consists of a single neuron with adjustable
synaptic weights and bias
performs pattern classification with only two
classes
perceptron convergence theorem :
– Patterns (vectors) are drawn from two
linearly separable classes
– During training, the perceptron algorithm
converges and positions the decision
surface in the form of hyperplane between
two classes by adjusting synaptic weights
What is a perceptron?
wk1
x1
wk2
x2
wkm
xm
...
...
Σ
Bias
bk
ϕ(.)
vk
Input
signal
Synaptic
weights
Summing
junction
Activation
function
bxwv kj
m
j
kjk
+= ∑=1
)(vy kk
ϕ=
)()( ⋅=⋅ signϕ
Discrete Perceptron:
Output
yk
shapeS −=⋅)(ϕ
Continous Perceptron:
Activation Function of a perceptron
vi
+1
-1
vi
+1
Signum Function
(sign)
shapesv −=)(ϕ
Continous Perceptron:
)()( ⋅=⋅ signϕ
Discrete Perceptron:
SLP Architecture
Single layer perceptron
Input layer Output layer
Where are we heading? Different
Non-Linearly Separable Problems
http://www.zsolutions.com/light.htm
Structure
Types of
Decision Regions
Exclusive-OR
Problem
Classes with
Meshed regions
Most General
Region Shapes
Single-Layer
Two-Layer
Three-Layer
Half Plane
Bounded By
Hyperplane
Convex Open
Or
Closed Regions
Arbitrary
(Complexity
Limited by No.
of Nodes)
A
AB
B
A
AB
B
A
AB
B
B
A
B
A
B
A
Review from last lectures:
Implementing Logic Gates with
Perceptrons http://www.cs.bham.ac.uk/~jxb/NN/l3.pdf
We can use the perceptron to implement the basic logic gates (AND, OR
and NOT).
All we need to do is find the appropriate connection weights and neuron
thresholds to produce the right outputs for each set of inputs.
We saw how we can construct simple networks that perform NOT, AND,
and OR.
It is then a well known result from logic that we can construct any logical
function from these three operations.
The resulting networks, however, will usually have a much more complex
architecture than a simple Perceptron.
We generally want to avoid decomposing complex problems into simple
logic gates, by finding the weights and thresholds that work directly in a
Perceptron architecture.
Implementation of Logical NOT, AND, and OR
In each case we have inputs ini and outputs out, and need to determine
the weights and thresholds. It is easy to find solutions by inspection:
The Need to Find Weights Analytically
Constructing simple networks by hand is one thing. But what about
harder problems? For example, what about:
How long do we keep looking for a solution? We need to be able to
calculate appropriate parameters rather than looking for solutions by trial
and error.
Each training pattern produces a linear inequality for the output in terms
of the inputs and the network parameters. These can be used to compute
the weights and thresholds.
Finding Weights Analytically for the AND Network
We have two weights w1 and w2 and the threshold θ, and for each
training pattern we need to satisfy
So the training data lead to four inequalities:
It is easy to see that there are an infinite number of solutions. Similarly,
there are an infinite number of solutions for the NOT and OR networks.
Limitations of Simple Perceptrons
We can follow the same procedure for the XOR network:
Clearly the second and third inequalities are incompatible with the
fourth, so there is in fact no solution. We need more complex networks,
e.g. that combine together many simple networks, or use different
activation/thresholding/transfer functions.
It then becomes much more difficult to determine all the weights and
thresholds by hand.
These weights instead are adapted using learning rules. Hence, need to
consider learning rules (see previous lecture), and more complex
architectures.
E.g. Decision Surface of a Perceptron
+
+-
-
x1
x2
Non-Linearly separable
• Perceptron is able to represent some useful functions
• But functions that are not linearly separable (e.g. XOR)
are not representable
+
+
+
+ -
-
-
-
x2
Linearly separable
x1
What is classification?
Classification ? http://140.122.185.120
Pattern classification/recognition
- Assign the input data (a physical object, event, or phenomenon)
to one of the pre-specified classes (categories)
The block diagram of the recognition and classification system
Classification: an example
• Automate the process of sorting incoming fish on
a conveyor belt according to species (Salmon or
Sea bass).
Set up a camera
Take some sample images
Note the physical differences between the two types
of fish
Length
Lightness
Width
No. & shape of fins ( “sanfirim”)
Position of the mouth
http://webcourse.technion.ac.il/236607/Winter2002-2003/en/ho.htm
Duda & Hart, Chapter 1
Classification an example…
Classification: an example…
• Cost of misclassification: depends on application
Is it better to misclassify salmon as bass or vice versa?
Put salmon in a can of bass loose profit
Put bass in a can of salmon loose customer
There is a cost associated with our decision.
Make a decision to minimize a given cost.
• Feature Extraction:
Problem & Domain dependent
Requires knowledge of the domain
A good feature extractor would make the job of the
classifier trivial.
⇒
⇒
Bayesian decision theory
Bayesian Decision Theory
http://webcourse.technion.ac.il/236607/Winter2002-2003/en/ho.html
Duda & Hart, Chapter 2
Bayesian decision theory is a fundamental statistical approach
to the problem of pattern classification.
Decision making when all the probabilistic
information is known.
For given probabilities the decision is optimal.
When new information is added, it is assimilated in
optimal fashion for improvement of decisions.
Bayesian Decision Theory …
Fish Example:
Each fish is in one of 2 states: sea bass or salmon
Let ω denote the state of nature
ω = ω1 for sea bass
ω = ω2 for salmon
Bayesian Decision Theory …
The State of nature is unpredictable ω is a
variable that must be described probabilistically.
If the catch produced as much salmon as sea bass
the next fish is equally likely to be sea bass or
salmon.
Define
P(ω1 ) : a priori probability that the next fish is sea bass
P(ω2 ): a priori probability that the next fish is salmon.
⇒
Bayesian Decision Theory …
If other types of fish are irrelevant:
P( ω1 ) + P( ω2 ) = 1.
Prior probabilities reflect our prior
knowledge (e.g. time of year, fishing area, …)
Simple decision Rule:
Make a decision without seeing the fish.
Decide w1 if P( ω1 ) > P( ω2 ); ω2 otherwise.
OK if deciding for one fish
If several fish, all assigned to same class.
Bayesian Decision Theory ...
In general, we will have some features and
more information.
Feature: lightness measurement = x
Different fish yield different lightness readings
(x is a random variable)
Bayesian Decision Theory ….
Define
p(x|ω1) = Class Conditional Probability Density
Probability density function for x given that the
state of nature is ω1
The difference between p(x|ω1 ) and p(x|ω2 ) describes the
difference in lightness between sea bass and salmon.
Class conditioned probability density: p(x|ω)
Hypothetical class-conditional probability
Density functions are normalized (area under each curve is 1.0)
Suppose that we know
The prior probabilities P(ω1 ) and P(ω2 ),
The conditional densities and
Measure lightness of a fish = x.
What is the category of the fish ?
1( | )p x ω 2( | )p x ω
( | )jp xω
Bayesian Decision Theory ...
Bayes Formula
Given
– Prior probabilities P(ωj)
– Conditional probabilities p(x| ωj)
Measurement of particular item
– Feature value x
Bayes formula:
(from )
where
so
)(
)()|(
)(
xp
Pxp
xP
jj
j
ωω
ω =
∑=
i
ii Pxpxp )()|()( ωω
∑ =
i
i xP 1)|(ω
)()|()()|(),( xpxPPxpxp jjjj ωωωω ==
Likelihood Prior
Posterior
Evidence
∗
=
Bayes' formula ...
• p(x|ωj ) is called the likelihood of ωj with
respect to x.
(the ωj category for which p(x|ωj ) is large is more
"likely" to be the true category)
•p(x) is the evidence
how frequently we will measure a pattern with
feature value x.
Scale factor that guarantees that the posterior
probabilities sum to 1.
Posterior Probability
Posterior probabilities for the particular priors P(ω1)=2/3 and P(ω2)=1/3.
At every x the posteriors sum to 1.
Error
2 1
1 2
If we decide ( | )
( | )
If we decide ( | )
P x
P error x
P x
ω ω
ω ω
⇒
= 
⇒
For a given x, we can minimize the
probability of error by deciding ω1 if P(ω1|x)
> P(ω2|x) and ω2 otherwise.
Bayes' Decision Rule
(Minimizes the probability of error)
ω1 : if P(ω1|x) > P(ω2|x) i.e.
ω2 : otherwise
or
ω1 : if P ( x |ω1) P(ω1) > P(x|ω2) P(ω2)
ω2 : otherwise
and
P(Error|x) = min [P(ω1|x) , P(ω2|x)]
)()( 21
2
1
xPxP ωω
ω
ω
<
>
Likelihood ratio
)(
)(
)|(
)|(
)()|()()|(
2
1
2
1
2211
2
1
2
1
ω
ω
ω
ω
ωωωω
ω
ω
ω
ω
P
P
xp
xp
PxpPxp
<
>
⇔
<
>
Threshold
Decision Boundaries
Classification as division of feature space into
non-overlapping regions
Boundaries between these regions are known
as decision surfaces or decision
boundaries
kk
R
toassignedxXx
thatsuchXX
ω↔∈
,,1 K
Optimum decision boundaries
Criterion:
– minimize miss-classification
– Maximize correct-classification
)()(
yprobabilitposteriormaximum
..
)()()()(
xPxPkj
ei
PxpPxp
kjifXxClassify
jk
jjkk
k
ωω
ωωωω
>≠∀
>
≠∀∈
2
)()(
),()(
1
1
=
∈=
∈=
∑
∑
=
=
RHere
PXxP
XxPcorrectP
R
k
kkk
R
k
kk
ωω
ω
Discriminant functions
Discriminant functions determine
classification by comparison of their values:
Optimum classification: based on posterior
probability
Any monotone function g may be applied
without changing the decision boundaries
)()( xgxgkj
ifXxClassify
jk
k
>≠∀
∈
))(ln()(..
))(()(
xPxgge
xPgxg
kk
kk
ω
ω
=
=
)( xP kω
The Two-Category Case
Use 2 discriminant functions g1 and g2, and assigning x to ω1 if
g1>g2.
Alternative: define a single discriminant function g(x) = g1(x) -
g2(x), decide ω1 if g(x)>0, otherwise decide ω2.
Two category case
1 2
1 1
2 2
( ) ( | ) ( | )
( | ) ( )
( ) ln ln
( | ) ( )
g P P
p P
g
p P
ω ω
ω ω
ω ω
= −
= +
x x x
x
x
x
Summary
Bayes approach:
– Estimate class-conditioned probability density
– Combine with prior class probability
– Determine posterior class probability
– Derive decision boundaries
Alternate approach implemented by NN
– Estimate posterior probability directly
– i.e. determine decision boundaries directly
DISCRIMINANT FUNCTIONS
Discriminant Functions http://140.122.185.120
Determine the membership in a category by the
classifier based on the comparison of R discriminant
functions g1(x), g2(x),…, gR(x)
– When x is within the region Xk if gk(x) has the largest
value
Do not mix between n = dim of each I/P vector (dim of feature space); P= # of I/P
vectors; and R= # of classes.
Discriminant Functions…
Discriminant Functions…
Discriminant Functions…
Discriminant Functions…
Discriminant Functions…
Linear Machine and Minimum Distance
Classification
• Find the linear-form discriminant function for two class
classification when the class prototypes are known
• Example 3.1: Select the decision hyperplane that
contains the midpoint of the line segment connecting
center point of two classes
Linear Machine and Minimum Distance
Classification… (dichotomizer)
•The dichotomizer’s discriminant function g(x):
t
Linear Machine and Minimum Distance
Classification…(multiclass classification)
•The linear-form discriminant functions for multiclass
classification
– There are up to R(R-1)/2 decision hyperplanes for R
pairwise separable classes
(i.e. next to or touching another)
Linear Machine and Minimum Distance
Classification… (multiclass classification)
•Linear machine or minimum-distance classifier
– Assume the class prototypes are known for all
classes
• Euclidean distance between input pattern x and the
center of class i, Xi:
t
Linear Machine and Minimum Distance
Classification… (multiclass classification)
Linear Machine and Minimum Distance
Classification…
Note: to find S12 we need to compute (g1-g2)
P1, P2, P3 are the centres of gravity of the prototype points, we need to design a minimum distance classifier. Using
the formulas from the previous slide, we get wi
Linear Machine and Minimum Distance
Classification…
•If R linear discriminant functions exist for a set of
patterns such that
( ) ( )
ji,..., R,j,..., R,,i
i,gg ji
≠==
∈>
,2121
Classfor xxx
•The classes are linearly separable.
Linear Machine and Minimum Distance
Classification… Example:
Linear Machine and Minimum Distance
Classification… Example…
Linear Machine and Minimum Distance
Classification…
•Examples 3.1 and 3.2 have shown that the coefficients
(weights) of the linear discriminant functions can be
determined if the a priori information about the sets of
patterns and their class membership is known
•In the next section (Discrete perceptron) we will
examine neural networks that derive their weights during
the learning cycle.
Linear Machine and Minimum Distance
Classification…
•The example of linearly non-separable patterns
Linear Machine and Minimum Distance
Classification…
Input space (x)
Image space (o)
)1sgn( 211 ++= xxo
Linear Machine and Minimum Distance
Classification…
)1sgn( 211 ++= xxo
)1sgn( 212 +−−= xxo
-1111
11-11
111-1
1-1-1-1
o2o1x2x1
These 2 inputs map
to the same point
(1,1) in the image
space
The Discrete Perceptron
Discrete Perceptron Training Algorithm
• So far, we have shown that coefficients of linear
discriminant functions called weights can be
determined based on a priori information about sets of
patterns and their class membership.
•In what follows, we will begin to examine neural
network classifiers that derive their weights during the
learning cycle.
•The sample pattern vectors x1, x2, …, xp, called the
training sequence, are presented to the machine along
with the correct response.
Discrete Perceptron Training Algorithm
- Geometrical Representations http://140.122.185.120
Zurada, Chapter 3
(Intersects the origin
point w=0)
5 prototype patterns in this case: y1, y2, …y5
If dim of augmented pattern vector is > 3, our power of visualization are no longer of assistance. In this case,
the only recourse is to use the analytical approach.
Discrete Perceptron Training Algorithm
- Geometrical Representations…
•Devise an analytic approach based on the geometrical
representations
– E.g. the decision surface for the training pattern y1
If y1 in
Class 1:
y1 in Class 2
( ) 11 yyww =∇ t
1
1
yww c+=′
y1 in Class 1
If y1 in
Class 2:
1
1
yww c−=′
c (>0) is the correction
increment (is two times the
learning constant ρ
introduced before)
Weight
Space
Weight
Space
c controls the
size of adjustment
Gradient
(the direction of
steepest increase)
(see previous slide)
(correction in negative gradient direction)
Discrete Perceptron Training Algorithm
- Geometrical Representations…
Discrete Perceptron Training Algorithm
- Geometrical Representations…
Note 2: c is not constant and depends on the current training pattern as expressed by eq. Above.
Note 1: p=distance so >0
pc t
t
== y
yy
yw
y
1
Discrete Perceptron Training Algorithm
- Geometrical Representations…
•For fixed correction rule: c=constant, the correction of
weights is always the same fixed portion of the current
training vector
– The weight can be initialised at any value
•For dynamic correction rule: c depends on the distance
from the weight (i.e. the weight vector) to the decision
surface in the weight space. Hence
– The initial weight should be different from 0.
(if w1=0, then cy =0 and w’=w1+cy=0, therefore no possible adjustments).
Current input
pattern
Current weight
Discrete Perceptron Training Algorithm
- Geometrical Representations…
•Dynamic correction rule: Using the value of c from previous slide as
a reference, we devise an adjustment technique which depends on
the length w2-w1
Νote: λ is the ratio of the distance
between the old weight vector w1
and the new w2, to the distance
from w1 to the pattern hyperplane
λ=2: Symmetrical reflection w.r.t decision plane
λ=0: No weight adjustment
Discrete Perceptron Training Algorithm
- Geometrical Representations…
•Example:
2class:1,2,5.0
1class:1,3,1
4242
3131
−==−=−=
====
ddxx
ddxx
•The augmented input vectors are:





−
=





=




−
=





=
1
2
1
3
,
1
5.0
,
1
1
4321 yyyy
•The decision lines wtyi=0, for i=1, 2, 3, 4 are sketched
on the augmented weight space as follows:
Discrete Perceptron Training Algorithm
- Geometrical Representations…
Discrete Perceptron Training Algorithm
- Geometrical Representations…
[ ]t
75.15.2and1cFor 1
−== w
•Using the weight training with each step can
be summarized as follows:
yww c±='
kk
kt
k
k
d
c
yyww )]sgn([
2
−=∆
•We obtain the following outputs and weight updates:
•Step 1: Pattern y1 is input





−
=+=
=−
−=













−=
75.2
5.1
2
1
1
1
]75.15.2[sgn
112
11
1
yww
od
o
Discrete Perceptron Training Algorithm
- Geometrical Representations…
•Step 2: Pattern y2 is input





 −
=−=
−=−
=












−
−=
75.1
1
2
1
1
5.0
]75.25.1[sgn
223
22
2
yww
od
o
•Step 3: Pattern y3 is input






=+=
=−
−=













−=
75.2
2
2
1
1
3
]75.11[sgn
334
33
3
yww
od
o
Discrete Perceptron Training Algorithm
- Geometrical Representations…
• Since we have no evidence of correct classification of
weight w4 the training set consisting of an ordered
sequence of patterns y1 ,y2 and y3 needs to be recycled.
We thus have y4= y1 , y5= y2, etc (the superscript is used
to denote the following training step number).
•Step 4, 5: w6 = w5 = w4 (no misclassification, thus no
weight adjustments).
•You can check that the adjustment following in steps 6
through 10 are as follows:
[ ]
[ ]t
t
75.03
75.15.2
11
78910
7
=
===
=
w
wwww
w
w11 is in solution area.
The Continuous Perceptron
Continuous Perceptron Training Algorithm
http://140.122.185.120
Zurada, Chapter 3
•Replace the TLU (Threshold Logic Unit) with the
sigmoid activation function for two reasons:
– Gain finer control over the training procedure
– Facilitate the differential characteristics to enable
computation of the error gradient
(of current
error function)
The factor ½ does not affect the location of
the error minimum
Continuous Perceptron Training Algorithm…
•The new weights is obtained by moving in the direction
of the negative gradient along the multidimensional error
surface
By definition of the steepest descent concept,
each elementary move should be
perpendicular to the current error contour.
Continuous Perceptron Training Algorithm…
•Define the error as the squared difference between the
desired output and the actual output
1,...,2,1
)(
havewe,Since +==
∂
∂
= niy
w
net
net i
i
t
yw
Training rule of
continous perceptron
(equivalent to delta
training rule)
Continuous Perceptron Training Algorithm…
Continuous Perceptron Training Algorithm…
Same as previous example (of discrete perceptron) but with a
continuous activation function and using the delta rule.
Same training pattern set as
discrete perceptron example
Continuous Perceptron Training Algorithm…
2
1
)exp(1
2
2
1












−
−+
−= k
k
k
net
dE
λ
[ ]
2
21
1 1
)(exp1
2
1
2
1
)(












−
+−+
−=
ww
E
λ
w
formfollowingthetoexpressionthissimplifiestermsthereducingand1=λ
[ ]2
21
1
)exp(1
2
)(
ww
E
++
=w
similarly
[ ]2
21
2
)5.0exp(1
2
)(
ww
E
−+
=w
[ ]2
21
3
)3exp(1
2
)(
ww
E
++
=w
[ ]2
21
4
)2exp(1
2
)(
ww
E
−+
=w
These error surfaces are as shown on the previous slide.
Continuous Perceptron Training Algorithm…
minimum
Mutlicategory SLP
Multi-category Single layer Perceptron nets
•Treat the last fixed component of input pattern vector as
the neuron activation threshold…. T=wn+1
yn+1= -1 (irrelevant wheter it
is equal to +1 or –1)
Multi-category Single layer Perceptron nets…
• R-category linear classifier using R discrete bipolar
perceptrons
– Goal: The i-th TLU response of +1 is indicative of
class i and all other TLU respond with -1
Multi-category Single layer Perceptron nets…
•Example 3.5
t
1)1,-(-1,
beshould
Indecision regions = regions
where no class membership of
an input pattern can be
uniquely determined based on
the response of the classifier
(patterns in shaded areas are
not assigned any reasonable
classification. E.g. point Q for
which o=[1 1 –1]t => indecisive
response). However no
patterns such as Q have been
used for training in the
example.
Multi-category Single layer Perceptron nets…
[ ] [ ]tt
131and210 1
3
1
2 −=−= ww[ ]t
021and1cFor 1
1 −== w
•Step 1: Pattern y1 is input
[ ]
[ ]
[ ] *1
1
2
10
131sgn
1
1
2
10
210sgn
1
1
2
10
021sgn
=




















−
−
−=




















−
−
=




















−
− Since the
only
incorrect
response is
provided
by TLU3,
we have









−
=










−
−










−
=
=
=
0
1
9
1
2
10
1
3
1
2
3
1
2
2
2
1
1
2
1
w
ww
ww
Multi-category Single layer Perceptron nets…
•Step 2: Pattern y2 is input
[ ]
[ ]
[ ] 1
1
5
2
019sgn
1
1
5
2
210sgn
*1
1
5
2
021sgn
−=




















−
−−
=




















−
−−
=




















−
−−
2
3
3
3
2
2
3
2
3
1
1
3
1
1
5
2
0
2
1
ww
ww
w
=
=









−
=










−
−−










=
Multi-category Single layer Perceptron nets…
•Step 3: Pattern y3 is input
3
3
4
3
3
2
4
2
4
1
2
2
4
ww
ww
w
=
=










−=
One can
verify that
the only
adjusted
weights
from now
on are those
of TLU1
( )
( )
( ) 1sgn
1sgn
*1sgn
3
3
3
3
3
2
3
3
1
=
−=
=
yw
yw
yw
t
t
t
• During the second cycle:










−=










=
=
4
2
7
3
3
2
7
1
6
1
4
1
5
1
w
w
ww










=
=
5
3
5
9
1
7
1
8
1
w
ww
Multi-category Single layer Perceptron nets…
•R-category linear classifier using R continuous bipolar
perceptrons
Comparison between Perceptron and
Bayes’ Classifier
Perceptron operates on the promise that the patterns to be
classified are linear separable (otherwise the training algorithm will
oscillate), while Bayes classifier can work on nonseparable
patterns
Bayes classifier minimizes the probability of misclassification
which is independent of the underlying distribution
Bayes classifier is a linear classifier on the assumption of
Gaussianity
The perceptron is non-parametric, while Bayes classifier is
parametric (its derivation is contingent on the assumption of the
underlying distributions)
The perceptron is adaptive and simple to implement
the Bayes’ classifier could be made adaptive but at the expense of
increased storage and more complex computations
APPENDIX A
Unconstrained Optimization
Techniques
Unconstrained Optimization Techniques
http://ai.kaist.ac.kr/~jkim/cs679/
Haykin, Chapter 3
Cost function E(ww)
– continuously differentiable
– a measure of how to choose ww of an adaptive
filtering algorithm so that it behaves in an optimum
manner
we want to find an optimal solution ww* that minimize
E(ww)
– local iterative descent :
starting with an initial guess denoted by ww(0),
generate a sequence of weight vectors ww(1), ww(2),
…, such that the cost function E(ww) is reduced at
each iteration of the algorithm, as shown by
E(ww(n+1)) < E(ww(n))
– Steepest Descent, Newton’s, Gauss-Newton’s
methods
0*)( =∇ wE
r
Method of Steepest Descent
Here the successive adjustments applied to ww
are in the direction of steepest descent, that
is, in a direction opposite to the grad(E(ww))
ww(n+1) = ww(n) - a gg(n)
a : small positive constant called step size or
learning-rate parameter.
gg(n) : grad(E(ww))
The method of steepest descent converges to
the optimal solution w*w* slowly
The learning rate parameter a has a profound
influence on its convergence behavior
– overdamped, underdamped, or even
unstable(diverges)
Newton’s Method
Using a second-order Taylor series expansion of the
cost function around the point ww(n)
∆E(ww(n)) = E(ww(n+1)) - E(ww(n))
~ ggT(n) ∆ww(n) + 1/2 ∆wwT(n) HH(n) ∆ww(n)
where ∆ww(n) = ww(n+1) - ww(n) ,
HH(n) : Hessian matrix of E(n)
We want ∆ww**(n) that minimize ∆E(ww(n)) so
differentiate ∆E(ww(n)) with respect to ∆ww(n) :
gg(n) + HH(n) ∆ww**(n) = 0
so,
∆ww**(n) = -HH--11(n) gg(n)
Newton’s Method…
Finally,
ww(n+1) = ww(n) + ∆ww(n)
= ww(n) - HH--11(n) gg(n)
Newton’s method converges quickly
asymptotically and does not exhibit the
zigzagging behavior
the Hessian HH(n) has to be a positive definite
matrix for all n
Gauss-Newton Method
The Gauss-Newton method is applicable to a
cost function
Because the error signal e(i) is a function of ww,
we linearize the dependence of e(i) on ww by
writing
Equivalently, by using matrix notation we may
write
∑=
=
n
i
iewE
1
2
)(
2
1
)(
r
))((
)(
)(),('
)(
nww
w
ie
iewie
T
nww
rr
r
r
rr
−





∂
∂
+=
=
))()(()(),(' nwwnJnewne
rrrrr
−+=
Gauss-Newton Method…
where JJ(n) is the n-by-m Jacobian matrix of
ee(n) (see bottom of this slide)
We want updated weight vector ww(n+1)
defined by
simple algebraic calculation tells…
Now differentiate this expression with respect
to ww and set the result to 0, we obtain






=+
2
),('
2
1
minarg)1( wnenw
w
rrr
r
))()(()())((
2
1
))()(()()(
2
1
),('
2
1 22
nwwnJnJnwwnwwnJnenewne TTT rrrrrrrrrr
−−+−+=






















∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
=
M
M
M
w
ne
w
ne
w
ne
w
ke
w
ke
w
ke
w
e
w
e
w
e
J
)()()(
)()()(
)1()1()1(
1
1
1
LL
MLMLM
LL
MLMLM
LL
α
α
α
Gauss-Newton Method…
0))()(()()()( =−+ nwwnJnJnenJ TT rrr
Thus we get
To guard against the possibility that the matrix
product JJT(n)JJ(n) is singular, the customary practice
is
where is a small positive constant.
This modification effect is progressively reduced as
the number of iterations, n, is increased.
δ
)()())()(()()1( 1
nenJnJnJnwnw TT rrr −
−=+
)()())()(()()1( 1
nenJInJnJnwnw TT rrr −
+−=+ δ
Linear Least-Squares Filter
The single neuron around which it is built is linear
The cost function consists of the sum of error
squares
Using and the error
vector is
Differentiating with respect to
correspondingly,
From Gauss-Newton method, (eq. 3.22)
)()()( iiiy T
wx= )()()( iyidie −=
)()()()( nnnn wXde −=
)(nw
)()( nn T
Xe −=∇
)()( nn XJ −=
)()()()())()(()1( 1
nnnnnnn TT
dXdXXXw +−
==+
LMS Algorithm
Based on the use of instantaneous values for
cost function :
Differentiating with respect to ,
The error signal in LMS algorithm :
hence,
so,
)(
2
1
)( 2
neE =w
w
ww
w
∂
∂
=
∂
∂ )(
)(
)( ne
ne
E
)()()()( nnndne T
wx−=
)(
)(
)(
n
n
ne
x
w
−=
∂
∂
)()(
)(
)(
nen
n
E
x
w
w
−=
∂
∂
LMS Algorithm …
Using as an estimate for the gradient
vector,
Using this for the gradient vector of steepest
descent method, LMS algorithm as follows :
– : learning-rate parameter
The inverse of is a measure of the
memory of the LMS algorithm
– When is small, the adaptive process
progress slowly, more of the past data are
remembered and a more accurate filtering
action’
)(
)(
n
E
w
w
∂
∂
)()()(ˆ nenn xg −=
)()()(ˆ)1(ˆ nennn xww η+=+
η
η
η
LMS Characteristics
LMS algorithm produces an estimate of the
weight vector
– Sacrifice a distinctive feature
• Steepest descent algorithm : follows a
well-defined trajectory
• LMS algorithm : follows a random
trajectory
– Number of iterations goes infinity,
performs a random walk
But importantly, LMS algorithm does not
require knowledge of the statistics of the
environment
)(nw
)(ˆ nw
)(ˆ nw
Convergence Consideration
Two distinct quantities, and determine
the convergence
– the user supplies , and the selection of
is important for the LMS algorithm to
converge
Convergence of the mean
as
– This is not a practical value
Convergence in the mean square
Convergence condition for LMS algorithm in the
mean square
η )(nx
)(nxη
[ ] 0)(ˆ ww →nE ∞→n
[ ] ∞→→ nneE asconstant)(2
inputssensortheofvaluessquare-meanofsum
2
0 <<η
APPENDIX B
Perceptron Convergence Proof
Perceptron Convergence Proof Haykin, Chapter 3
Consider the following perceptron:
)()(
)()()(
0
nn
nxnwnv
T
m
i
ii
xw=
= ∑=
1Cclasstobelongingorinput vecteveryfor0 xxw >T
2Cclasstobelongingorinput vecteveryfor0 xxw ≤T
Perceptron Convergence Proof…
The algorithm for the weight adjustment for the
perceptron
– if x(n) is correctly classified no adjustments to w
– otherwise
– learning rate parameter controls adjustment
applied to weight vector
1Cclasstobelongs)(and0)(if)()1( nnnn T
xxwww >=+
2Cclasstobelongs)(and0)(if)()1( nnnn T
xxwww ≤=+
Cclasstobelongs)(and0)(if)()()()1( nnnnnn T
xxwxww >−=+ η
Cclasstobelongs)(and0)(if)()()()1( nnnnnn T
xxwxww ≤+=+ η
)(nη
2
1
Perceptron Convergence Proof
For 0w == )0(and1)(nη
Suppose the perceptron incorrectly classifies the vectors
such that),...2(),1( xx
1Ctobelonging(n)for)()()1(
1sinceBut
)()()()1(:thatso0)(
xxww
xwwxw
nnn
nnnnnT
+=+
⇒=
+=+≤
η
η
)1B()(...)2()1()1(
)1(findy weiterativel,(0)Since
nn
n
xxxw
w0w
+++=+
+=
Since the classes C1 and C2 are assumed to be linearly
separable, there exists a solution w0 for which wTx(n)>0 for
the vectors x(1), …x(n) belonging to the subset H1(subset of
training vectors that belong to class C1).
Perceptron Convergence Proof
)2()(min 0
)( 1
BnT
Hn
xw
x ∈
=α
For a fixed solution w0, we may then define a positive number
α as
)(...)2()1()1(
impliesabove(B1)equationHence
T
0
T
0
T
0
T
0 nn xwxwxwww +++=+
Using equation B2 above, (since each term is greater or equal
than α), we have
αnn ≥+ )1(T
0 ww
Now we use the Cauchy-Schwartz inequality:
0for
).(
or).(
2
2
2
2
222
≠≥
≤
b
b
ba
a
baba
Perceptron Convergence Proof
This implies that:
)3()1( 2
0
22
2
B
n
n
w
w
α
≥+
Now let’s follow another development route (notice index k)
1H(k)and1for)()()1( ∈=+=+ xxww , ..., nkkkk
By taking the squared Euclidean norm of both sides, we get:
)()(2)()()1(
222
kkkkk T
xwxww ++=+
But under the assumption the the perceptron incorrectly
classifies an input vector x(k) belonging to the subset H1, we
have :henceand0)()( <kkT
xw
222
)()()1( kkk xww +≤+
Perceptron Convergence Proof
Or equivalently,
nkkkk ,...1;)()()1(
222
=≤−+ xww
Adding these inequalities for k=1,…n, and invoking the initial
condition w(0)=0, we get the following inequality:
)4()()1(
1
22
Bnkn
n
k
β≤≤+ ∑=
xw
Where β is a positive number defined by;
∑=
∈
=
n
k
Hk
k
1
2
)(
)(max
1
x
x
β
Eq. B4 states that the squared Euclidean norm of w(n+1)
grows at most linearly with the number of iterations n.
Perceptron Convergence Proof
The second result of B4 is clearly in conflict with Eq. B3.
•Indeed, we can state that n cannot be larger than some
value nmax for which Eq. B3 and B4 are both satisfied with
the equality sign. That is nmax is the solution of the eq.
•Solving for nmax given a solution w0, we find that
We have thus proved that for η(n)=1 for all n, and for w(0)=0,
given that a sol’ vector w0 exists, the rule for adapting the
synaptic weights of the perceptron must terminate after at most
nmax iterations.
β
α
max2
0
22
max
n
n
=
w
2
2
0
max
α
β w
=n
MORE READING
Suggested Reading.
S. Haykin, “Neural Networks”, Prentice-Hall, 1999,
chapter 3.
L. Fausett, “Fundamentals of Neural Networks”,
Prentice-Hall, 1994, Chapter 2.
R. O. Duda, P.E. Hart, and D.G. Stork, “Pattern
Classification”, 2nd edition, Wiley 2001. Appendix A4,
chapter 2, and chapter 5.
J.M. Zurada, “Introduction to Artificial Neural Systems”,
West Publishing Company, 1992, chapter 3.
References:
These lecture notes were based on the references of the
previous slide, and the following references
1. Berlin Chen Lecture notes: Normal University, Taipei,
Taiwan, ROC. http://140.122.185.120
2. Ehud Rivlin, IIT:
http://webcourse.technion.ac.il/236607/Winter2002-
2003/en/ho.html
3. Jin Hyung Kim, KAIST Computer Science Dept., CS679
Neural Network lecture notes
http://ai.kaist.ac.kr/~jkim/cs679/detail.htm
4. Dr John A. Bullinaria, Course Material, Introduction to
Neural Networks,
http://www.cs.bham.ac.uk/~jxb/inn.html
Ad

More Related Content

What's hot (20)

Artificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rulesArtificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rules
Mohammed Bennamoun
 
Multilayer perceptron
Multilayer perceptronMultilayer perceptron
Multilayer perceptron
omaraldabash
 
03 Single layer Perception Classifier
03 Single layer Perception Classifier03 Single layer Perception Classifier
03 Single layer Perception Classifier
Tamer Ahmed Farrag, PhD
 
Mc culloch pitts neuron
Mc culloch pitts neuronMc culloch pitts neuron
Mc culloch pitts neuron
Siksha 'O' Anusandhan (Deemed to be University )
 
Perceptron & Neural Networks
Perceptron & Neural NetworksPerceptron & Neural Networks
Perceptron & Neural Networks
NAGUR SHAREEF SHAIK
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
Knoldus Inc.
 
Statistical learning
Statistical learningStatistical learning
Statistical learning
Slideshare
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
Ashray Bhandare
 
Feedforward neural network
Feedforward neural networkFeedforward neural network
Feedforward neural network
Sopheaktra YONG
 
2.5 backpropagation
2.5 backpropagation2.5 backpropagation
2.5 backpropagation
Krish_ver2
 
Perceptron
PerceptronPerceptron
Perceptron
Nagarajan
 
Perceptron (neural network)
Perceptron (neural network)Perceptron (neural network)
Perceptron (neural network)
EdutechLearners
 
Associative memory network
Associative memory networkAssociative memory network
Associative memory network
Dr. C.V. Suresh Babu
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
Sanghyuk Chun
 
Artifical Neural Network and its applications
Artifical Neural Network and its applicationsArtifical Neural Network and its applications
Artifical Neural Network and its applications
Sangeeta Tiwari
 
Activation function
Activation functionActivation function
Activation function
Astha Jain
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
Ashraf Uddin
 
Back propagation
Back propagationBack propagation
Back propagation
Nagarajan
 
Cnn
CnnCnn
Cnn
Nirthika Rajendran
 
Recurrent Neural Networks (RNNs)
Recurrent Neural Networks (RNNs)Recurrent Neural Networks (RNNs)
Recurrent Neural Networks (RNNs)
Abdullah al Mamun
 

Viewers also liked (8)

Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationArtificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Mohammed Bennamoun
 
Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...
Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...
Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...
Brocade
 
Artificial Neural Networks Lect1: Introduction & neural computation
Artificial Neural Networks Lect1: Introduction & neural computationArtificial Neural Networks Lect1: Introduction & neural computation
Artificial Neural Networks Lect1: Introduction & neural computation
Mohammed Bennamoun
 
Perceptron Slides
Perceptron SlidesPerceptron Slides
Perceptron Slides
ESCOM
 
Artificial Neural Networks Lect2: Neurobiology & Architectures of ANNS
Artificial Neural Networks Lect2: Neurobiology & Architectures of ANNSArtificial Neural Networks Lect2: Neurobiology & Architectures of ANNS
Artificial Neural Networks Lect2: Neurobiology & Architectures of ANNS
Mohammed Bennamoun
 
Lecture 9 Perceptron
Lecture 9 PerceptronLecture 9 Perceptron
Lecture 9 Perceptron
Marina Santini
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
BICA Labs
 
Perception
PerceptionPerception
Perception
Preetham Preetu
 
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationArtificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Mohammed Bennamoun
 
Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...
Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...
Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...
Brocade
 
Artificial Neural Networks Lect1: Introduction & neural computation
Artificial Neural Networks Lect1: Introduction & neural computationArtificial Neural Networks Lect1: Introduction & neural computation
Artificial Neural Networks Lect1: Introduction & neural computation
Mohammed Bennamoun
 
Perceptron Slides
Perceptron SlidesPerceptron Slides
Perceptron Slides
ESCOM
 
Artificial Neural Networks Lect2: Neurobiology & Architectures of ANNS
Artificial Neural Networks Lect2: Neurobiology & Architectures of ANNSArtificial Neural Networks Lect2: Neurobiology & Architectures of ANNS
Artificial Neural Networks Lect2: Neurobiology & Architectures of ANNS
Mohammed Bennamoun
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
BICA Labs
 
Ad

Similar to Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers (20)

Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
ssuserab4f3e
 
A formal ontology of sequences
A formal ontology of sequencesA formal ontology of sequences
A formal ontology of sequences
Robert Hoehndorf
 
12 pattern recognition
12 pattern recognition12 pattern recognition
12 pattern recognition
Talal Khaliq
 
tutorial.ppt
tutorial.ppttutorial.ppt
tutorial.ppt
Vara Prasad
 
pattern recognition
pattern recognition pattern recognition
pattern recognition
MohammadMoattar2
 
Bayesian Neural Networks
Bayesian Neural NetworksBayesian Neural Networks
Bayesian Neural Networks
Natan Katz
 
Lesson 38
Lesson 38Lesson 38
Lesson 38
Avijit Kumar
 
AI Lesson 38
AI Lesson 38AI Lesson 38
AI Lesson 38
Assistant Professor
 
CSC446: Pattern Recognition (LN4)
CSC446: Pattern Recognition (LN4)CSC446: Pattern Recognition (LN4)
CSC446: Pattern Recognition (LN4)
Mostafa G. M. Mostafa
 
Machine Learning 1
Machine Learning 1Machine Learning 1
Machine Learning 1
cairo university
 
Boundness of a neural network weights using the notion of a limit of a sequence
Boundness of a neural network weights using the notion of a limit of a sequenceBoundness of a neural network weights using the notion of a limit of a sequence
Boundness of a neural network weights using the notion of a limit of a sequence
IJDKP
 
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2
arogozhnikov
 
Lecture 3
Lecture 3Lecture 3
Lecture 3
subith t
 
Useing PSO to optimize logit model with Tensorflow
Useing PSO to optimize logit model with TensorflowUseing PSO to optimize logit model with Tensorflow
Useing PSO to optimize logit model with Tensorflow
Yi-Fan Liou
 
MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1
arogozhnikov
 
Machine Learning- Perceptron_Backpropogation_Module 3.pdf
Machine Learning- Perceptron_Backpropogation_Module 3.pdfMachine Learning- Perceptron_Backpropogation_Module 3.pdf
Machine Learning- Perceptron_Backpropogation_Module 3.pdf
Dr. Shivashankar
 
Hidden Markov Models
Hidden Markov ModelsHidden Markov Models
Hidden Markov Models
Minesh A. Jethva
 
JAISTサマースクール2016「脳を知るための理論」講義02 Synaptic Learning rules
JAISTサマースクール2016「脳を知るための理論」講義02 Synaptic Learning rulesJAISTサマースクール2016「脳を知るための理論」講義02 Synaptic Learning rules
JAISTサマースクール2016「脳を知るための理論」講義02 Synaptic Learning rules
hirokazutanaka
 
MetiTarski: An Automatic Prover for Real-Valued Special Functions
MetiTarski: An Automatic Prover for Real-Valued Special FunctionsMetiTarski: An Automatic Prover for Real-Valued Special Functions
MetiTarski: An Automatic Prover for Real-Valued Special Functions
Lawrence Paulson
 
Lec2-review-III-svm-logreg_for the beginner.pptx
Lec2-review-III-svm-logreg_for the beginner.pptxLec2-review-III-svm-logreg_for the beginner.pptx
Lec2-review-III-svm-logreg_for the beginner.pptx
raheemsyedrameez12
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
ssuserab4f3e
 
A formal ontology of sequences
A formal ontology of sequencesA formal ontology of sequences
A formal ontology of sequences
Robert Hoehndorf
 
12 pattern recognition
12 pattern recognition12 pattern recognition
12 pattern recognition
Talal Khaliq
 
Bayesian Neural Networks
Bayesian Neural NetworksBayesian Neural Networks
Bayesian Neural Networks
Natan Katz
 
Boundness of a neural network weights using the notion of a limit of a sequence
Boundness of a neural network weights using the notion of a limit of a sequenceBoundness of a neural network weights using the notion of a limit of a sequence
Boundness of a neural network weights using the notion of a limit of a sequence
IJDKP
 
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2
arogozhnikov
 
Useing PSO to optimize logit model with Tensorflow
Useing PSO to optimize logit model with TensorflowUseing PSO to optimize logit model with Tensorflow
Useing PSO to optimize logit model with Tensorflow
Yi-Fan Liou
 
MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1
arogozhnikov
 
Machine Learning- Perceptron_Backpropogation_Module 3.pdf
Machine Learning- Perceptron_Backpropogation_Module 3.pdfMachine Learning- Perceptron_Backpropogation_Module 3.pdf
Machine Learning- Perceptron_Backpropogation_Module 3.pdf
Dr. Shivashankar
 
JAISTサマースクール2016「脳を知るための理論」講義02 Synaptic Learning rules
JAISTサマースクール2016「脳を知るための理論」講義02 Synaptic Learning rulesJAISTサマースクール2016「脳を知るための理論」講義02 Synaptic Learning rules
JAISTサマースクール2016「脳を知るための理論」講義02 Synaptic Learning rules
hirokazutanaka
 
MetiTarski: An Automatic Prover for Real-Valued Special Functions
MetiTarski: An Automatic Prover for Real-Valued Special FunctionsMetiTarski: An Automatic Prover for Real-Valued Special Functions
MetiTarski: An Automatic Prover for Real-Valued Special Functions
Lawrence Paulson
 
Lec2-review-III-svm-logreg_for the beginner.pptx
Lec2-review-III-svm-logreg_for the beginner.pptxLec2-review-III-svm-logreg_for the beginner.pptx
Lec2-review-III-svm-logreg_for the beginner.pptx
raheemsyedrameez12
 
Ad

Recently uploaded (20)

IntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdfIntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdf
Luiz Carneiro
 
new ppt artificial intelligence historyyy
new ppt artificial intelligence historyyynew ppt artificial intelligence historyyy
new ppt artificial intelligence historyyy
PianoPianist
 
Smart Storage Solutions.pptx for production engineering
Smart Storage Solutions.pptx for production engineeringSmart Storage Solutions.pptx for production engineering
Smart Storage Solutions.pptx for production engineering
rushikeshnavghare94
 
Compiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptxCompiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptx
RushaliDeshmukh2
 
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Journal of Soft Computing in Civil Engineering
 
Artificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptxArtificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptx
aditichinar
 
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ijscai
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 
introduction to machine learining for beginers
introduction to machine learining for beginersintroduction to machine learining for beginers
introduction to machine learining for beginers
JoydebSheet
 
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdffive-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
AdityaSharma944496
 
Main cotrol jdbjbdcnxbjbjzjjjcjicbjxbcjcxbjcxb
Main cotrol jdbjbdcnxbjbjzjjjcjicbjxbcjcxbjcxbMain cotrol jdbjbdcnxbjbjzjjjcjicbjxbcjcxbjcxb
Main cotrol jdbjbdcnxbjbjzjjjcjicbjxbcjcxbjcxb
SunilSingh610661
 
AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)
Vəhid Gəruslu
 
LECTURE-16 EARTHEN DAM - II.pptx it's uses
LECTURE-16 EARTHEN DAM - II.pptx it's usesLECTURE-16 EARTHEN DAM - II.pptx it's uses
LECTURE-16 EARTHEN DAM - II.pptx it's uses
CLokeshBehera123
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
some basics electrical and electronics knowledge
some basics electrical and electronics knowledgesome basics electrical and electronics knowledge
some basics electrical and electronics knowledge
nguyentrungdo88
 
theory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptxtheory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptx
sanchezvanessa7896
 
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G..."Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
Infopitaara
 
Introduction to FLUID MECHANICS & KINEMATICS
Introduction to FLUID MECHANICS &  KINEMATICSIntroduction to FLUID MECHANICS &  KINEMATICS
Introduction to FLUID MECHANICS & KINEMATICS
narayanaswamygdas
 
Raish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdfRaish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdf
RaishKhanji
 
IntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdfIntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdf
Luiz Carneiro
 
new ppt artificial intelligence historyyy
new ppt artificial intelligence historyyynew ppt artificial intelligence historyyy
new ppt artificial intelligence historyyy
PianoPianist
 
Smart Storage Solutions.pptx for production engineering
Smart Storage Solutions.pptx for production engineeringSmart Storage Solutions.pptx for production engineering
Smart Storage Solutions.pptx for production engineering
rushikeshnavghare94
 
Compiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptxCompiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptx
RushaliDeshmukh2
 
Artificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptxArtificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptx
aditichinar
 
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ijscai
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 
introduction to machine learining for beginers
introduction to machine learining for beginersintroduction to machine learining for beginers
introduction to machine learining for beginers
JoydebSheet
 
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdffive-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
AdityaSharma944496
 
Main cotrol jdbjbdcnxbjbjzjjjcjicbjxbcjcxbjcxb
Main cotrol jdbjbdcnxbjbjzjjjcjicbjxbcjcxbjcxbMain cotrol jdbjbdcnxbjbjzjjjcjicbjxbcjcxbjcxb
Main cotrol jdbjbdcnxbjbjzjjjcjicbjxbcjcxbjcxb
SunilSingh610661
 
AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)
Vəhid Gəruslu
 
LECTURE-16 EARTHEN DAM - II.pptx it's uses
LECTURE-16 EARTHEN DAM - II.pptx it's usesLECTURE-16 EARTHEN DAM - II.pptx it's uses
LECTURE-16 EARTHEN DAM - II.pptx it's uses
CLokeshBehera123
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
some basics electrical and electronics knowledge
some basics electrical and electronics knowledgesome basics electrical and electronics knowledge
some basics electrical and electronics knowledge
nguyentrungdo88
 
theory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptxtheory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptx
sanchezvanessa7896
 
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G..."Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
Infopitaara
 
Introduction to FLUID MECHANICS & KINEMATICS
Introduction to FLUID MECHANICS &  KINEMATICSIntroduction to FLUID MECHANICS &  KINEMATICS
Introduction to FLUID MECHANICS & KINEMATICS
narayanaswamygdas
 
Raish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdfRaish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdf
RaishKhanji
 

Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers

  • 1. CS407 Neural Computation Lecture 4: Single Layer Perceptron (SLP) Classifiers Lecturer: A/Prof. M. Bennamoun
  • 2. Outline What’s a SLP and what’s classification? Limitation of a single perceptron. Foundations of classification and Bayes Decision making theory Discriminant functions, linear machine and minimum distance classification Training and classification using the Discrete perceptron Single-Layer Continuous perceptron Networks for linearly separable classifications Appendix A: Unconstrained optimization techniques Appendix B: Perceptron Convergence proof Suggested reading and references
  • 3. What is a perceptron and what is a Single Layer Perceptron (SLP)?
  • 4. Perceptron The simplest form of a neural network consists of a single neuron with adjustable synaptic weights and bias performs pattern classification with only two classes perceptron convergence theorem : – Patterns (vectors) are drawn from two linearly separable classes – During training, the perceptron algorithm converges and positions the decision surface in the form of hyperplane between two classes by adjusting synaptic weights
  • 5. What is a perceptron? wk1 x1 wk2 x2 wkm xm ... ... Σ Bias bk ϕ(.) vk Input signal Synaptic weights Summing junction Activation function bxwv kj m j kjk += ∑=1 )(vy kk ϕ= )()( ⋅=⋅ signϕ Discrete Perceptron: Output yk shapeS −=⋅)(ϕ Continous Perceptron:
  • 6. Activation Function of a perceptron vi +1 -1 vi +1 Signum Function (sign) shapesv −=)(ϕ Continous Perceptron: )()( ⋅=⋅ signϕ Discrete Perceptron:
  • 7. SLP Architecture Single layer perceptron Input layer Output layer
  • 8. Where are we heading? Different Non-Linearly Separable Problems http://www.zsolutions.com/light.htm Structure Types of Decision Regions Exclusive-OR Problem Classes with Meshed regions Most General Region Shapes Single-Layer Two-Layer Three-Layer Half Plane Bounded By Hyperplane Convex Open Or Closed Regions Arbitrary (Complexity Limited by No. of Nodes) A AB B A AB B A AB B B A B A B A
  • 9. Review from last lectures:
  • 10. Implementing Logic Gates with Perceptrons http://www.cs.bham.ac.uk/~jxb/NN/l3.pdf We can use the perceptron to implement the basic logic gates (AND, OR and NOT). All we need to do is find the appropriate connection weights and neuron thresholds to produce the right outputs for each set of inputs. We saw how we can construct simple networks that perform NOT, AND, and OR. It is then a well known result from logic that we can construct any logical function from these three operations. The resulting networks, however, will usually have a much more complex architecture than a simple Perceptron. We generally want to avoid decomposing complex problems into simple logic gates, by finding the weights and thresholds that work directly in a Perceptron architecture.
  • 11. Implementation of Logical NOT, AND, and OR In each case we have inputs ini and outputs out, and need to determine the weights and thresholds. It is easy to find solutions by inspection:
  • 12. The Need to Find Weights Analytically Constructing simple networks by hand is one thing. But what about harder problems? For example, what about: How long do we keep looking for a solution? We need to be able to calculate appropriate parameters rather than looking for solutions by trial and error. Each training pattern produces a linear inequality for the output in terms of the inputs and the network parameters. These can be used to compute the weights and thresholds.
  • 13. Finding Weights Analytically for the AND Network We have two weights w1 and w2 and the threshold θ, and for each training pattern we need to satisfy So the training data lead to four inequalities: It is easy to see that there are an infinite number of solutions. Similarly, there are an infinite number of solutions for the NOT and OR networks.
  • 14. Limitations of Simple Perceptrons We can follow the same procedure for the XOR network: Clearly the second and third inequalities are incompatible with the fourth, so there is in fact no solution. We need more complex networks, e.g. that combine together many simple networks, or use different activation/thresholding/transfer functions. It then becomes much more difficult to determine all the weights and thresholds by hand. These weights instead are adapted using learning rules. Hence, need to consider learning rules (see previous lecture), and more complex architectures.
  • 15. E.g. Decision Surface of a Perceptron + +- - x1 x2 Non-Linearly separable • Perceptron is able to represent some useful functions • But functions that are not linearly separable (e.g. XOR) are not representable + + + + - - - - x2 Linearly separable x1
  • 17. Classification ? http://140.122.185.120 Pattern classification/recognition - Assign the input data (a physical object, event, or phenomenon) to one of the pre-specified classes (categories) The block diagram of the recognition and classification system
  • 18. Classification: an example • Automate the process of sorting incoming fish on a conveyor belt according to species (Salmon or Sea bass). Set up a camera Take some sample images Note the physical differences between the two types of fish Length Lightness Width No. & shape of fins ( “sanfirim”) Position of the mouth http://webcourse.technion.ac.il/236607/Winter2002-2003/en/ho.htm Duda & Hart, Chapter 1
  • 20. Classification: an example… • Cost of misclassification: depends on application Is it better to misclassify salmon as bass or vice versa? Put salmon in a can of bass loose profit Put bass in a can of salmon loose customer There is a cost associated with our decision. Make a decision to minimize a given cost. • Feature Extraction: Problem & Domain dependent Requires knowledge of the domain A good feature extractor would make the job of the classifier trivial. ⇒ ⇒
  • 22. Bayesian Decision Theory http://webcourse.technion.ac.il/236607/Winter2002-2003/en/ho.html Duda & Hart, Chapter 2 Bayesian decision theory is a fundamental statistical approach to the problem of pattern classification. Decision making when all the probabilistic information is known. For given probabilities the decision is optimal. When new information is added, it is assimilated in optimal fashion for improvement of decisions.
  • 23. Bayesian Decision Theory … Fish Example: Each fish is in one of 2 states: sea bass or salmon Let ω denote the state of nature ω = ω1 for sea bass ω = ω2 for salmon
  • 24. Bayesian Decision Theory … The State of nature is unpredictable ω is a variable that must be described probabilistically. If the catch produced as much salmon as sea bass the next fish is equally likely to be sea bass or salmon. Define P(ω1 ) : a priori probability that the next fish is sea bass P(ω2 ): a priori probability that the next fish is salmon. ⇒
  • 25. Bayesian Decision Theory … If other types of fish are irrelevant: P( ω1 ) + P( ω2 ) = 1. Prior probabilities reflect our prior knowledge (e.g. time of year, fishing area, …) Simple decision Rule: Make a decision without seeing the fish. Decide w1 if P( ω1 ) > P( ω2 ); ω2 otherwise. OK if deciding for one fish If several fish, all assigned to same class.
  • 26. Bayesian Decision Theory ... In general, we will have some features and more information. Feature: lightness measurement = x Different fish yield different lightness readings (x is a random variable)
  • 27. Bayesian Decision Theory …. Define p(x|ω1) = Class Conditional Probability Density Probability density function for x given that the state of nature is ω1 The difference between p(x|ω1 ) and p(x|ω2 ) describes the difference in lightness between sea bass and salmon.
  • 28. Class conditioned probability density: p(x|ω) Hypothetical class-conditional probability Density functions are normalized (area under each curve is 1.0)
  • 29. Suppose that we know The prior probabilities P(ω1 ) and P(ω2 ), The conditional densities and Measure lightness of a fish = x. What is the category of the fish ? 1( | )p x ω 2( | )p x ω ( | )jp xω Bayesian Decision Theory ...
  • 30. Bayes Formula Given – Prior probabilities P(ωj) – Conditional probabilities p(x| ωj) Measurement of particular item – Feature value x Bayes formula: (from ) where so )( )()|( )( xp Pxp xP jj j ωω ω = ∑= i ii Pxpxp )()|()( ωω ∑ = i i xP 1)|(ω )()|()()|(),( xpxPPxpxp jjjj ωωωω == Likelihood Prior Posterior Evidence ∗ =
  • 31. Bayes' formula ... • p(x|ωj ) is called the likelihood of ωj with respect to x. (the ωj category for which p(x|ωj ) is large is more "likely" to be the true category) •p(x) is the evidence how frequently we will measure a pattern with feature value x. Scale factor that guarantees that the posterior probabilities sum to 1.
  • 32. Posterior Probability Posterior probabilities for the particular priors P(ω1)=2/3 and P(ω2)=1/3. At every x the posteriors sum to 1.
  • 33. Error 2 1 1 2 If we decide ( | ) ( | ) If we decide ( | ) P x P error x P x ω ω ω ω ⇒ =  ⇒ For a given x, we can minimize the probability of error by deciding ω1 if P(ω1|x) > P(ω2|x) and ω2 otherwise.
  • 34. Bayes' Decision Rule (Minimizes the probability of error) ω1 : if P(ω1|x) > P(ω2|x) i.e. ω2 : otherwise or ω1 : if P ( x |ω1) P(ω1) > P(x|ω2) P(ω2) ω2 : otherwise and P(Error|x) = min [P(ω1|x) , P(ω2|x)] )()( 21 2 1 xPxP ωω ω ω < > Likelihood ratio )( )( )|( )|( )()|()()|( 2 1 2 1 2211 2 1 2 1 ω ω ω ω ωωωω ω ω ω ω P P xp xp PxpPxp < > ⇔ < > Threshold
  • 35. Decision Boundaries Classification as division of feature space into non-overlapping regions Boundaries between these regions are known as decision surfaces or decision boundaries kk R toassignedxXx thatsuchXX ω↔∈ ,,1 K
  • 36. Optimum decision boundaries Criterion: – minimize miss-classification – Maximize correct-classification )()( yprobabilitposteriormaximum .. )()()()( xPxPkj ei PxpPxp kjifXxClassify jk jjkk k ωω ωωωω >≠∀ > ≠∀∈ 2 )()( ),()( 1 1 = ∈= ∈= ∑ ∑ = = RHere PXxP XxPcorrectP R k kkk R k kk ωω ω
  • 37. Discriminant functions Discriminant functions determine classification by comparison of their values: Optimum classification: based on posterior probability Any monotone function g may be applied without changing the decision boundaries )()( xgxgkj ifXxClassify jk k >≠∀ ∈ ))(ln()(.. ))(()( xPxgge xPgxg kk kk ω ω = = )( xP kω
  • 38. The Two-Category Case Use 2 discriminant functions g1 and g2, and assigning x to ω1 if g1>g2. Alternative: define a single discriminant function g(x) = g1(x) - g2(x), decide ω1 if g(x)>0, otherwise decide ω2. Two category case 1 2 1 1 2 2 ( ) ( | ) ( | ) ( | ) ( ) ( ) ln ln ( | ) ( ) g P P p P g p P ω ω ω ω ω ω = − = + x x x x x x
  • 39. Summary Bayes approach: – Estimate class-conditioned probability density – Combine with prior class probability – Determine posterior class probability – Derive decision boundaries Alternate approach implemented by NN – Estimate posterior probability directly – i.e. determine decision boundaries directly
  • 41. Discriminant Functions http://140.122.185.120 Determine the membership in a category by the classifier based on the comparison of R discriminant functions g1(x), g2(x),…, gR(x) – When x is within the region Xk if gk(x) has the largest value Do not mix between n = dim of each I/P vector (dim of feature space); P= # of I/P vectors; and R= # of classes.
  • 47. Linear Machine and Minimum Distance Classification • Find the linear-form discriminant function for two class classification when the class prototypes are known • Example 3.1: Select the decision hyperplane that contains the midpoint of the line segment connecting center point of two classes
  • 48. Linear Machine and Minimum Distance Classification… (dichotomizer) •The dichotomizer’s discriminant function g(x): t
  • 49. Linear Machine and Minimum Distance Classification…(multiclass classification) •The linear-form discriminant functions for multiclass classification – There are up to R(R-1)/2 decision hyperplanes for R pairwise separable classes (i.e. next to or touching another)
  • 50. Linear Machine and Minimum Distance Classification… (multiclass classification) •Linear machine or minimum-distance classifier – Assume the class prototypes are known for all classes • Euclidean distance between input pattern x and the center of class i, Xi: t
  • 51. Linear Machine and Minimum Distance Classification… (multiclass classification)
  • 52. Linear Machine and Minimum Distance Classification… Note: to find S12 we need to compute (g1-g2) P1, P2, P3 are the centres of gravity of the prototype points, we need to design a minimum distance classifier. Using the formulas from the previous slide, we get wi
  • 53. Linear Machine and Minimum Distance Classification… •If R linear discriminant functions exist for a set of patterns such that ( ) ( ) ji,..., R,j,..., R,,i i,gg ji ≠== ∈> ,2121 Classfor xxx •The classes are linearly separable.
  • 54. Linear Machine and Minimum Distance Classification… Example:
  • 55. Linear Machine and Minimum Distance Classification… Example…
  • 56. Linear Machine and Minimum Distance Classification… •Examples 3.1 and 3.2 have shown that the coefficients (weights) of the linear discriminant functions can be determined if the a priori information about the sets of patterns and their class membership is known •In the next section (Discrete perceptron) we will examine neural networks that derive their weights during the learning cycle.
  • 57. Linear Machine and Minimum Distance Classification… •The example of linearly non-separable patterns
  • 58. Linear Machine and Minimum Distance Classification… Input space (x) Image space (o) )1sgn( 211 ++= xxo
  • 59. Linear Machine and Minimum Distance Classification… )1sgn( 211 ++= xxo )1sgn( 212 +−−= xxo -1111 11-11 111-1 1-1-1-1 o2o1x2x1 These 2 inputs map to the same point (1,1) in the image space
  • 61. Discrete Perceptron Training Algorithm • So far, we have shown that coefficients of linear discriminant functions called weights can be determined based on a priori information about sets of patterns and their class membership. •In what follows, we will begin to examine neural network classifiers that derive their weights during the learning cycle. •The sample pattern vectors x1, x2, …, xp, called the training sequence, are presented to the machine along with the correct response.
  • 62. Discrete Perceptron Training Algorithm - Geometrical Representations http://140.122.185.120 Zurada, Chapter 3 (Intersects the origin point w=0) 5 prototype patterns in this case: y1, y2, …y5 If dim of augmented pattern vector is > 3, our power of visualization are no longer of assistance. In this case, the only recourse is to use the analytical approach.
  • 63. Discrete Perceptron Training Algorithm - Geometrical Representations… •Devise an analytic approach based on the geometrical representations – E.g. the decision surface for the training pattern y1 If y1 in Class 1: y1 in Class 2 ( ) 11 yyww =∇ t 1 1 yww c+=′ y1 in Class 1 If y1 in Class 2: 1 1 yww c−=′ c (>0) is the correction increment (is two times the learning constant ρ introduced before) Weight Space Weight Space c controls the size of adjustment Gradient (the direction of steepest increase) (see previous slide) (correction in negative gradient direction)
  • 64. Discrete Perceptron Training Algorithm - Geometrical Representations…
  • 65. Discrete Perceptron Training Algorithm - Geometrical Representations… Note 2: c is not constant and depends on the current training pattern as expressed by eq. Above. Note 1: p=distance so >0 pc t t == y yy yw y 1
  • 66. Discrete Perceptron Training Algorithm - Geometrical Representations… •For fixed correction rule: c=constant, the correction of weights is always the same fixed portion of the current training vector – The weight can be initialised at any value •For dynamic correction rule: c depends on the distance from the weight (i.e. the weight vector) to the decision surface in the weight space. Hence – The initial weight should be different from 0. (if w1=0, then cy =0 and w’=w1+cy=0, therefore no possible adjustments). Current input pattern Current weight
  • 67. Discrete Perceptron Training Algorithm - Geometrical Representations… •Dynamic correction rule: Using the value of c from previous slide as a reference, we devise an adjustment technique which depends on the length w2-w1 Νote: λ is the ratio of the distance between the old weight vector w1 and the new w2, to the distance from w1 to the pattern hyperplane λ=2: Symmetrical reflection w.r.t decision plane λ=0: No weight adjustment
  • 68. Discrete Perceptron Training Algorithm - Geometrical Representations… •Example: 2class:1,2,5.0 1class:1,3,1 4242 3131 −==−=−= ==== ddxx ddxx •The augmented input vectors are:      − =      =     − =      = 1 2 1 3 , 1 5.0 , 1 1 4321 yyyy •The decision lines wtyi=0, for i=1, 2, 3, 4 are sketched on the augmented weight space as follows:
  • 69. Discrete Perceptron Training Algorithm - Geometrical Representations…
  • 70. Discrete Perceptron Training Algorithm - Geometrical Representations… [ ]t 75.15.2and1cFor 1 −== w •Using the weight training with each step can be summarized as follows: yww c±=' kk kt k k d c yyww )]sgn([ 2 −=∆ •We obtain the following outputs and weight updates: •Step 1: Pattern y1 is input      − =+= =− −=              −= 75.2 5.1 2 1 1 1 ]75.15.2[sgn 112 11 1 yww od o
  • 71. Discrete Perceptron Training Algorithm - Geometrical Representations… •Step 2: Pattern y2 is input       − =−= −=− =             − −= 75.1 1 2 1 1 5.0 ]75.25.1[sgn 223 22 2 yww od o •Step 3: Pattern y3 is input       =+= =− −=              −= 75.2 2 2 1 1 3 ]75.11[sgn 334 33 3 yww od o
  • 72. Discrete Perceptron Training Algorithm - Geometrical Representations… • Since we have no evidence of correct classification of weight w4 the training set consisting of an ordered sequence of patterns y1 ,y2 and y3 needs to be recycled. We thus have y4= y1 , y5= y2, etc (the superscript is used to denote the following training step number). •Step 4, 5: w6 = w5 = w4 (no misclassification, thus no weight adjustments). •You can check that the adjustment following in steps 6 through 10 are as follows: [ ] [ ]t t 75.03 75.15.2 11 78910 7 = === = w wwww w w11 is in solution area.
  • 74. Continuous Perceptron Training Algorithm http://140.122.185.120 Zurada, Chapter 3 •Replace the TLU (Threshold Logic Unit) with the sigmoid activation function for two reasons: – Gain finer control over the training procedure – Facilitate the differential characteristics to enable computation of the error gradient (of current error function) The factor ½ does not affect the location of the error minimum
  • 75. Continuous Perceptron Training Algorithm… •The new weights is obtained by moving in the direction of the negative gradient along the multidimensional error surface By definition of the steepest descent concept, each elementary move should be perpendicular to the current error contour.
  • 76. Continuous Perceptron Training Algorithm… •Define the error as the squared difference between the desired output and the actual output 1,...,2,1 )( havewe,Since +== ∂ ∂ = niy w net net i i t yw Training rule of continous perceptron (equivalent to delta training rule)
  • 78. Continuous Perceptron Training Algorithm… Same as previous example (of discrete perceptron) but with a continuous activation function and using the delta rule. Same training pattern set as discrete perceptron example
  • 79. Continuous Perceptron Training Algorithm… 2 1 )exp(1 2 2 1             − −+ −= k k k net dE λ [ ] 2 21 1 1 )(exp1 2 1 2 1 )(             − +−+ −= ww E λ w formfollowingthetoexpressionthissimplifiestermsthereducingand1=λ [ ]2 21 1 )exp(1 2 )( ww E ++ =w similarly [ ]2 21 2 )5.0exp(1 2 )( ww E −+ =w [ ]2 21 3 )3exp(1 2 )( ww E ++ =w [ ]2 21 4 )2exp(1 2 )( ww E −+ =w These error surfaces are as shown on the previous slide.
  • 80. Continuous Perceptron Training Algorithm… minimum
  • 82. Multi-category Single layer Perceptron nets •Treat the last fixed component of input pattern vector as the neuron activation threshold…. T=wn+1 yn+1= -1 (irrelevant wheter it is equal to +1 or –1)
  • 83. Multi-category Single layer Perceptron nets… • R-category linear classifier using R discrete bipolar perceptrons – Goal: The i-th TLU response of +1 is indicative of class i and all other TLU respond with -1
  • 84. Multi-category Single layer Perceptron nets… •Example 3.5 t 1)1,-(-1, beshould Indecision regions = regions where no class membership of an input pattern can be uniquely determined based on the response of the classifier (patterns in shaded areas are not assigned any reasonable classification. E.g. point Q for which o=[1 1 –1]t => indecisive response). However no patterns such as Q have been used for training in the example.
  • 85. Multi-category Single layer Perceptron nets… [ ] [ ]tt 131and210 1 3 1 2 −=−= ww[ ]t 021and1cFor 1 1 −== w •Step 1: Pattern y1 is input [ ] [ ] [ ] *1 1 2 10 131sgn 1 1 2 10 210sgn 1 1 2 10 021sgn =                     − − −=                     − − =                     − − Since the only incorrect response is provided by TLU3, we have          − =           − −           − = = = 0 1 9 1 2 10 1 3 1 2 3 1 2 2 2 1 1 2 1 w ww ww
  • 86. Multi-category Single layer Perceptron nets… •Step 2: Pattern y2 is input [ ] [ ] [ ] 1 1 5 2 019sgn 1 1 5 2 210sgn *1 1 5 2 021sgn −=                     − −− =                     − −− =                     − −− 2 3 3 3 2 2 3 2 3 1 1 3 1 1 5 2 0 2 1 ww ww w = =          − =           − −−           =
  • 87. Multi-category Single layer Perceptron nets… •Step 3: Pattern y3 is input 3 3 4 3 3 2 4 2 4 1 2 2 4 ww ww w = =           −= One can verify that the only adjusted weights from now on are those of TLU1 ( ) ( ) ( ) 1sgn 1sgn *1sgn 3 3 3 3 3 2 3 3 1 = −= = yw yw yw t t t • During the second cycle:           −=           = = 4 2 7 3 3 2 7 1 6 1 4 1 5 1 w w ww           = = 5 3 5 9 1 7 1 8 1 w ww
  • 88. Multi-category Single layer Perceptron nets… •R-category linear classifier using R continuous bipolar perceptrons
  • 89. Comparison between Perceptron and Bayes’ Classifier Perceptron operates on the promise that the patterns to be classified are linear separable (otherwise the training algorithm will oscillate), while Bayes classifier can work on nonseparable patterns Bayes classifier minimizes the probability of misclassification which is independent of the underlying distribution Bayes classifier is a linear classifier on the assumption of Gaussianity The perceptron is non-parametric, while Bayes classifier is parametric (its derivation is contingent on the assumption of the underlying distributions) The perceptron is adaptive and simple to implement the Bayes’ classifier could be made adaptive but at the expense of increased storage and more complex computations
  • 91. Unconstrained Optimization Techniques http://ai.kaist.ac.kr/~jkim/cs679/ Haykin, Chapter 3 Cost function E(ww) – continuously differentiable – a measure of how to choose ww of an adaptive filtering algorithm so that it behaves in an optimum manner we want to find an optimal solution ww* that minimize E(ww) – local iterative descent : starting with an initial guess denoted by ww(0), generate a sequence of weight vectors ww(1), ww(2), …, such that the cost function E(ww) is reduced at each iteration of the algorithm, as shown by E(ww(n+1)) < E(ww(n)) – Steepest Descent, Newton’s, Gauss-Newton’s methods 0*)( =∇ wE r
  • 92. Method of Steepest Descent Here the successive adjustments applied to ww are in the direction of steepest descent, that is, in a direction opposite to the grad(E(ww)) ww(n+1) = ww(n) - a gg(n) a : small positive constant called step size or learning-rate parameter. gg(n) : grad(E(ww)) The method of steepest descent converges to the optimal solution w*w* slowly The learning rate parameter a has a profound influence on its convergence behavior – overdamped, underdamped, or even unstable(diverges)
  • 93. Newton’s Method Using a second-order Taylor series expansion of the cost function around the point ww(n) ∆E(ww(n)) = E(ww(n+1)) - E(ww(n)) ~ ggT(n) ∆ww(n) + 1/2 ∆wwT(n) HH(n) ∆ww(n) where ∆ww(n) = ww(n+1) - ww(n) , HH(n) : Hessian matrix of E(n) We want ∆ww**(n) that minimize ∆E(ww(n)) so differentiate ∆E(ww(n)) with respect to ∆ww(n) : gg(n) + HH(n) ∆ww**(n) = 0 so, ∆ww**(n) = -HH--11(n) gg(n)
  • 94. Newton’s Method… Finally, ww(n+1) = ww(n) + ∆ww(n) = ww(n) - HH--11(n) gg(n) Newton’s method converges quickly asymptotically and does not exhibit the zigzagging behavior the Hessian HH(n) has to be a positive definite matrix for all n
  • 95. Gauss-Newton Method The Gauss-Newton method is applicable to a cost function Because the error signal e(i) is a function of ww, we linearize the dependence of e(i) on ww by writing Equivalently, by using matrix notation we may write ∑= = n i iewE 1 2 )( 2 1 )( r ))(( )( )(),(' )( nww w ie iewie T nww rr r r rr −      ∂ ∂ += = ))()(()(),(' nwwnJnewne rrrrr −+=
  • 96. Gauss-Newton Method… where JJ(n) is the n-by-m Jacobian matrix of ee(n) (see bottom of this slide) We want updated weight vector ww(n+1) defined by simple algebraic calculation tells… Now differentiate this expression with respect to ww and set the result to 0, we obtain       =+ 2 ),(' 2 1 minarg)1( wnenw w rrr r ))()(()())(( 2 1 ))()(()()( 2 1 ),(' 2 1 22 nwwnJnJnwwnwwnJnenewne TTT rrrrrrrrrr −−+−+=                       ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ = M M M w ne w ne w ne w ke w ke w ke w e w e w e J )()()( )()()( )1()1()1( 1 1 1 LL MLMLM LL MLMLM LL α α α
  • 97. Gauss-Newton Method… 0))()(()()()( =−+ nwwnJnJnenJ TT rrr Thus we get To guard against the possibility that the matrix product JJT(n)JJ(n) is singular, the customary practice is where is a small positive constant. This modification effect is progressively reduced as the number of iterations, n, is increased. δ )()())()(()()1( 1 nenJnJnJnwnw TT rrr − −=+ )()())()(()()1( 1 nenJInJnJnwnw TT rrr − +−=+ δ
  • 98. Linear Least-Squares Filter The single neuron around which it is built is linear The cost function consists of the sum of error squares Using and the error vector is Differentiating with respect to correspondingly, From Gauss-Newton method, (eq. 3.22) )()()( iiiy T wx= )()()( iyidie −= )()()()( nnnn wXde −= )(nw )()( nn T Xe −=∇ )()( nn XJ −= )()()()())()(()1( 1 nnnnnnn TT dXdXXXw +− ==+
  • 99. LMS Algorithm Based on the use of instantaneous values for cost function : Differentiating with respect to , The error signal in LMS algorithm : hence, so, )( 2 1 )( 2 neE =w w ww w ∂ ∂ = ∂ ∂ )( )( )( ne ne E )()()()( nnndne T wx−= )( )( )( n n ne x w −= ∂ ∂ )()( )( )( nen n E x w w −= ∂ ∂
  • 100. LMS Algorithm … Using as an estimate for the gradient vector, Using this for the gradient vector of steepest descent method, LMS algorithm as follows : – : learning-rate parameter The inverse of is a measure of the memory of the LMS algorithm – When is small, the adaptive process progress slowly, more of the past data are remembered and a more accurate filtering action’ )( )( n E w w ∂ ∂ )()()(ˆ nenn xg −= )()()(ˆ)1(ˆ nennn xww η+=+ η η η
  • 101. LMS Characteristics LMS algorithm produces an estimate of the weight vector – Sacrifice a distinctive feature • Steepest descent algorithm : follows a well-defined trajectory • LMS algorithm : follows a random trajectory – Number of iterations goes infinity, performs a random walk But importantly, LMS algorithm does not require knowledge of the statistics of the environment )(nw )(ˆ nw )(ˆ nw
  • 102. Convergence Consideration Two distinct quantities, and determine the convergence – the user supplies , and the selection of is important for the LMS algorithm to converge Convergence of the mean as – This is not a practical value Convergence in the mean square Convergence condition for LMS algorithm in the mean square η )(nx )(nxη [ ] 0)(ˆ ww →nE ∞→n [ ] ∞→→ nneE asconstant)(2 inputssensortheofvaluessquare-meanofsum 2 0 <<η
  • 104. Perceptron Convergence Proof Haykin, Chapter 3 Consider the following perceptron: )()( )()()( 0 nn nxnwnv T m i ii xw= = ∑= 1Cclasstobelongingorinput vecteveryfor0 xxw >T 2Cclasstobelongingorinput vecteveryfor0 xxw ≤T
  • 105. Perceptron Convergence Proof… The algorithm for the weight adjustment for the perceptron – if x(n) is correctly classified no adjustments to w – otherwise – learning rate parameter controls adjustment applied to weight vector 1Cclasstobelongs)(and0)(if)()1( nnnn T xxwww >=+ 2Cclasstobelongs)(and0)(if)()1( nnnn T xxwww ≤=+ Cclasstobelongs)(and0)(if)()()()1( nnnnnn T xxwxww >−=+ η Cclasstobelongs)(and0)(if)()()()1( nnnnnn T xxwxww ≤+=+ η )(nη 2 1
  • 106. Perceptron Convergence Proof For 0w == )0(and1)(nη Suppose the perceptron incorrectly classifies the vectors such that),...2(),1( xx 1Ctobelonging(n)for)()()1( 1sinceBut )()()()1(:thatso0)( xxww xwwxw nnn nnnnnT +=+ ⇒= +=+≤ η η )1B()(...)2()1()1( )1(findy weiterativel,(0)Since nn n xxxw w0w +++=+ += Since the classes C1 and C2 are assumed to be linearly separable, there exists a solution w0 for which wTx(n)>0 for the vectors x(1), …x(n) belonging to the subset H1(subset of training vectors that belong to class C1).
  • 107. Perceptron Convergence Proof )2()(min 0 )( 1 BnT Hn xw x ∈ =α For a fixed solution w0, we may then define a positive number α as )(...)2()1()1( impliesabove(B1)equationHence T 0 T 0 T 0 T 0 nn xwxwxwww +++=+ Using equation B2 above, (since each term is greater or equal than α), we have αnn ≥+ )1(T 0 ww Now we use the Cauchy-Schwartz inequality: 0for ).( or).( 2 2 2 2 222 ≠≥ ≤ b b ba a baba
  • 108. Perceptron Convergence Proof This implies that: )3()1( 2 0 22 2 B n n w w α ≥+ Now let’s follow another development route (notice index k) 1H(k)and1for)()()1( ∈=+=+ xxww , ..., nkkkk By taking the squared Euclidean norm of both sides, we get: )()(2)()()1( 222 kkkkk T xwxww ++=+ But under the assumption the the perceptron incorrectly classifies an input vector x(k) belonging to the subset H1, we have :henceand0)()( <kkT xw 222 )()()1( kkk xww +≤+
  • 109. Perceptron Convergence Proof Or equivalently, nkkkk ,...1;)()()1( 222 =≤−+ xww Adding these inequalities for k=1,…n, and invoking the initial condition w(0)=0, we get the following inequality: )4()()1( 1 22 Bnkn n k β≤≤+ ∑= xw Where β is a positive number defined by; ∑= ∈ = n k Hk k 1 2 )( )(max 1 x x β Eq. B4 states that the squared Euclidean norm of w(n+1) grows at most linearly with the number of iterations n.
  • 110. Perceptron Convergence Proof The second result of B4 is clearly in conflict with Eq. B3. •Indeed, we can state that n cannot be larger than some value nmax for which Eq. B3 and B4 are both satisfied with the equality sign. That is nmax is the solution of the eq. •Solving for nmax given a solution w0, we find that We have thus proved that for η(n)=1 for all n, and for w(0)=0, given that a sol’ vector w0 exists, the rule for adapting the synaptic weights of the perceptron must terminate after at most nmax iterations. β α max2 0 22 max n n = w 2 2 0 max α β w =n
  • 112. Suggested Reading. S. Haykin, “Neural Networks”, Prentice-Hall, 1999, chapter 3. L. Fausett, “Fundamentals of Neural Networks”, Prentice-Hall, 1994, Chapter 2. R. O. Duda, P.E. Hart, and D.G. Stork, “Pattern Classification”, 2nd edition, Wiley 2001. Appendix A4, chapter 2, and chapter 5. J.M. Zurada, “Introduction to Artificial Neural Systems”, West Publishing Company, 1992, chapter 3.
  • 113. References: These lecture notes were based on the references of the previous slide, and the following references 1. Berlin Chen Lecture notes: Normal University, Taipei, Taiwan, ROC. http://140.122.185.120 2. Ehud Rivlin, IIT: http://webcourse.technion.ac.il/236607/Winter2002- 2003/en/ho.html 3. Jin Hyung Kim, KAIST Computer Science Dept., CS679 Neural Network lecture notes http://ai.kaist.ac.kr/~jkim/cs679/detail.htm 4. Dr John A. Bullinaria, Course Material, Introduction to Neural Networks, http://www.cs.bham.ac.uk/~jxb/inn.html