0% found this document useful (0 votes)

115 views113 pages

Single Layer Perceptron

This document provides an outline and overview of a lecture on single layer perceptron (SLP) classifiers. The key points covered include: - What a perceptron and SLP are, including their architecture with weights, bias, and activation functions. - The limitations of a single perceptron for non-linearly separable problems. - An introduction to Bayesian decision theory and classification, including examples of classifying fish by species. - Training and classification using discrete and continuous perceptrons for linearly separable problems.

Uploaded by

Vinod kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

115 views113 pages

Single Layer Perceptron

Uploaded by

Vinod kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 113

CS407 Neural Computation

Lecture 4:
Single Layer Perceptron (SLP)
Classifiers

Lecturer: A/Prof. M. Bennamoun

Outline
What’s a SLP and what’s classification?
Limitation of a single perceptron.
Foundations of classification and Bayes Decision making theory
Discriminant functions, linear machine and minimum distance
classification
Training and classification using the Discrete perceptron
Single-Layer Continuous perceptron Networks for linearly
separable classifications
Appendix A: Unconstrained optimization techniques
Appendix B: Perceptron Convergence proof
Suggested reading and references
What is a perceptron and what is
a Single Layer Perceptron (SLP)?
Perceptron
The simplest form of a neural network
consists of a single neuron with adjustable
synaptic weights and bias
performs pattern classification with only two
classes
perceptron convergence theorem :
– Patterns (vectors) are drawn from two
linearly separable classes
– During training, the perceptron algorithm
converges and positions the decision
surface in the form of hyperplane between
two classes by adjusting synaptic weights
What is a perceptron?
m

Bias v = ∑w x +b
k
j =1
kj j k
bk
x1 wk1 Activation

x2 wk2
function
y = ϕ (v )
k k

vk Output
Σ ϕ(.) yk
...

...

xm wkm Summing
junction Discrete Perceptron:
Input Synaptic
weights
ϕ (⋅) = sign (⋅)
signal

Continous Perceptron:
ϕ (⋅) = S − shape
Activation Function of a perceptron

+1
+1

vi vi

-1

Signum Function
(sign) Continous Perceptron:
Discrete Perceptron: ϕ (v) = s − shape
ϕ (⋅) = sign (⋅)
SLP Architecture
Single layer perceptron

Input layer Output layer

Where are we heading? Different
Non-Linearly Separable Problems
http://www.zsolutions.com/light.htm

Types of Exclusive-OR Classes with Most General

Structure Decision Regions Problem Meshed regionsRegion Shapes

Single-Layer Half Plane A B

Bounded By B
Hyperplane A
B A

Two-Layer Convex Open A B

Or B
Closed Regions A
B A

Three-Layer Arbitrary
(Complexity A B
Limited by No. B
A
of Nodes) B A
Review from last lectures:
Implementing Logic Gates with
Perceptrons http://www.cs.bham.ac.uk/~jxb/NN/l3.pdf

We can use the perceptron to implement the basic logic gates (AND, OR
and NOT).
All we need to do is find the appropriate connection weights and neuron
thresholds to produce the right outputs for each set of inputs.
We saw how we can construct simple networks that perform NOT, AND,
and OR.
It is then a well known result from logic that we can construct any logical
function from these three operations.
The resulting networks, however, will usually have a much more complex
architecture than a simple Perceptron.
We generally want to avoid decomposing complex problems into simple
logic gates, by finding the weights and thresholds that work directly in a
Perceptron architecture.
Implementation of Logical NOT, AND, and OR
In each case we have inputs ini and outputs out, and need to determine
the weights and thresholds. It is easy to find solutions by inspection:
The Need to Find Weights Analytically
Constructing simple networks by hand is one thing. But what about
harder problems? For example, what about:

How long do we keep looking for a solution? We need to be able to

calculate appropriate parameters rather than looking for solutions by trial
and error.
Each training pattern produces a linear inequality for the output in terms
of the inputs and the network parameters. These can be used to compute
the weights and thresholds.
Finding Weights Analytically for the AND Network

We have two weights w1 and w2 and the threshold θ, and for each
training pattern we need to satisfy

So the training data lead to four inequalities:

It is easy to see that there are an infinite number of solutions. Similarly,

there are an infinite number of solutions for the NOT and OR networks.
Limitations of Simple Perceptrons

We can follow the same procedure for the XOR network:

Clearly the second and third inequalities are incompatible with the
fourth, so there is in fact no solution. We need more complex networks,
e.g. that combine together many simple networks, or use different
activation/thresholding/transfer functions.
It then becomes much more difficult to determine all the weights and
thresholds by hand.
These weights instead are adapted using learning rules. Hence, need to
consider learning rules (see previous lecture), and more complex
architectures.
E.g. Decision Surface of a Perceptron
x2
x2
+
+ + -
+ -
- x1
x1
+ - +
-
-
Linearly separable Non-Linearly separable

• Perceptron is able to represent some useful functions

• But functions that are not linearly separable (e.g. XOR)
are not representable
What is classification?
Classification ? http://140.122.185.120

Pattern classification/recognition
- Assign the input data (a physical object, event, or phenomenon)
to one of the pre-specified classes (categories)
The block diagram of the recognition and classification system
Classification: an example
http://webcourse.technion.ac.il/236607/Winter2002-2003/en/ho.htm
Duda & Hart, Chapter 1

• Automate the process of sorting incoming fish on

a conveyor belt according to species (Salmon or
Sea bass).
¾ Set up a camera
¾ Take some sample images
¾ Note the physical differences between the two types
of fish
Length
Lightness
Width
No. & shape of fins ( “sanfirim”)
Position of the mouth
Classification an example…
Classification: an example…

• Cost of misclassification: depends on application

Is it better to misclassify salmon as bass or vice versa?
¾ Put salmon in a can of bass ⇒ loose profit
¾ Put bass in a can of salmon ⇒ loose customer
¾ There is a cost associated with our decision.
¾ Make a decision to minimize a given cost.
• Feature Extraction:
¾ Problem & Domain dependent
¾ Requires knowledge of the domain
¾ A good feature extractor would make the job of the
classifier trivial.
Bayesian decision theory
Bayesian Decision Theory
http://webcourse.technion.ac.il/236607/Winter2002-2003/en/ho.html
Duda & Hart, Chapter 2

Bayesian decision theory is a fundamental statistical approach

to the problem of pattern classification.
¾ Decision making when all the probabilistic
information is known.
¾ For given probabilities the decision is optimal.
¾ When new information is added, it is assimilated in
optimal fashion for improvement of decisions.
Bayesian Decision Theory …
Fish Example:
Each fish is in one of 2 states: sea bass or salmon
Let ω denote the state of nature
¾ ω = ω1 for sea bass
¾ ω = ω2 for salmon
Bayesian Decision Theory …
The State of nature is unpredictable ⇒ ω is a
variable that must be described probabilistically.
If the catch produced as much salmon as sea bass
the next fish is equally likely to be sea bass or
salmon.
Define
¾ P(ω1 ) : a priori probability that the next fish is sea bass
¾ P(ω2 ): a priori probability that the next fish is salmon.
Bayesian Decision Theory …
If other types of fish are irrelevant:
P( ω1 ) + P( ω2 ) = 1.
Prior probabilities reflect our prior
knowledge (e.g. time of year, fishing area, …)
Simple decision Rule:
¾Make a decision without seeing the fish.
¾Decide w1 if P( ω1 ) > P( ω2 ); ω2 otherwise.
¾OK if deciding for one fish
¾If several fish, all assigned to same class.
Bayesian Decision Theory ...

In general, we will have some features and

more information.
Feature: lightness measurement = x
¾Different fish yield different lightness readings
(x is a random variable)
Bayesian Decision Theory ….
Define
p(x|ω1) = Class Conditional Probability Density
Probability density function for x given that the
state of nature is ω1

The difference between p(x|ω1 ) and p(x|ω2 ) describes the

difference in lightness between sea bass and salmon.
Class conditioned probability density: p(x|ω)

Hypothetical class-conditional probability

Density functions are normalized (area under each curve is 1.0)
Bayesian Decision Theory ...
Suppose that we know
The prior probabilities P(ω1 ) and P(ω2 ),
The conditional densities p ( x | ω1 ) and p( x | ω 2 )
Measure lightness of a fish = x.
ω
What is the category of the fish p ( j | x ) ?
Bayes Formula
Given
– Prior probabilities P(ωj)
– Conditional probabilities p(x| ωj)
Measurement of particular item
– Feature value x p ( x | ω j ) P(ω j )
Bayes formula: P(ω j x) =
p( x)
Likelihood ∗ Prior
Posterior =
Evidence
(from p(ω j , x) = p( x | ω j ) P(ω j ) = P(ω j | x) p( x))
∑ P(ω i | x) = 1
where
i
so p ( x) = ∑ p ( x | ω i ) P(ω i )
i
Bayes' formula ...

• p(x|ωj ) is called the likelihood of ωj with

respect to x.
(the ωj category for which p(x|ωj ) is large is more
"likely" to be the true category)
•p(x) is the evidence
how frequently we will measure a pattern with
feature value x.
Scale factor that guarantees that the posterior
probabilities sum to 1.
Posterior Probability

Posterior probabilities for the particular priors P(ω1)=2/3 and P(ω2)=1/3.

At every x the posteriors sum to 1.
Error

If we decide ω 2 ⇒ P (ω1 | x)

P (error | x) = 
If we decide ω1 ⇒ P (ω 2 | x)
For a given x, we can minimize the
probability of error by deciding ω1 if P(ω1|x)
> P(ω2|x) and ω2 otherwise.
Bayes' Decision Rule
(Minimizes the probability of error) ω1

ω1 : if P(ω1|x) > P(ω2|x) i.e. >

P(ω1 x) P(ω2 x)
ω2 : otherwise <
ω2

Likelihood ratio Threshold

and
P(Error|x) = min [P(ω1|x) , P(ω2|x)]
Decision Boundaries

Classification as division of feature space into

non-overlapping regions

X 1 , K, X R such that
x ∈ X k ↔ x assigned to ωk

Boundaries between these regions are known

as decision surfaces or decision
boundaries
Optimum decision boundaries
Criterion:
– minimize miss-classification
– Maximize correct-classification

R Classify x∈ Xk if ∀ j ≠ k
P ( correct ) = ∑ P(x ∈ X
k =1
k ,ωk )
p ( x ω k ) P (ω k ) > p ( x ω j ) P (ω j )
R
= ∑ P(x ∈ X
k =1
k ω k ) P (ω k ) i.e.
maximum posterior probabilit y
Here R=2
∀j ≠ k P (ω k x ) > P (ω j x )
Discriminant functions

Discriminant functions determine

classification by comparison of their values:
Classify x∈ Xk if
∀j ≠ k g k ( x) > g j ( x)
Optimum classification: based on posterior
probability P(ω k x)
Any monotone function g may be applied
without changing the decision boundaries
g k ( x) = g ( P(ωk x))
e.g. g k ( x) = ln( P(ωk x))
The Two-Category Case
Use 2 discriminant functions g1 and g2, and assigning x to ω1 if
g1>g2.
Alternative: define a single discriminant function g(x) = g1(x) -
g2(x), decide ω1 if g(x)>0, otherwise decide ω2.
Two category case

g (x) = P(ω1 | x) − P(ω2 | x)

p(x | ω1 ) P(ω1 )
g (x) = ln + ln
p(x | ω 2 ) P(ω2 )
Summary

Bayes approach:
– Estimate class-conditioned probability density
– Combine with prior class probability
– Determine posterior class probability
– Derive decision boundaries
Alternate approach implemented by NN
– Estimate posterior probability directly
– i.e. determine decision boundaries directly
DISCRIMINANT FUNCTIONS
Discriminant Functions http://140.122.185.120

Determine the membership in a category by the

classifier based on the comparison of R discriminant
functions g1(x), g2(x),…, gR(x)
– When x is within the region Xk if gk(x) has the largest
value

Do not mix between n = dim of each I/P vector (dim of feature space); P= # of I/P
vectors; and R= # of classes.
Discriminant Functions…
Discriminant Functions…
Discriminant Functions…
Discriminant Functions…
Discriminant Functions…
Linear Machine and Minimum Distance
Classification
• Find the linear-form discriminant function for two class
classification when the class prototypes are known

• Example 3.1: Select the decision hyperplane that

contains the midpoint of the line segment connecting
center point of two classes
Linear Machine and Minimum Distance
Classification… (dichotomizer)
•The dichotomizer’s discriminant function g(x):

t
Linear Machine and Minimum Distance
Classification…(multiclass classification)
•The linear-form discriminant functions for multiclass
classification
– There are up to R(R-1)/2 decision hyperplanes for R
pairwise separable classes

(i.e. next to or touching another)

Linear Machine and Minimum Distance
Classification… (multiclass classification)
•Linear machine or minimum-distance classifier
– Assume the class prototypes are known for all
classes
• Euclidean distance between input pattern x and the
center of class i, Xi:

t
Linear Machine and Minimum Distance
Classification… (multiclass classification)
Linear Machine and Minimum Distance
Classification…
P1, P2, P3 are the centres of gravity of the prototype points, we need to design a minimum distance classifier. Using
the formulas from the previous slide, we get wi

Note: to find S12 we need to compute (g1-g2)

Linear Machine and Minimum Distance
Classification…
•If R linear discriminant functions exist for a set of
patterns such that

g i (x ) > g j (x ) for x ∈ Class i,

i = 1 , 2 ,..., R, j = 1 , 2 ,..., R , i≠ j

•The classes are linearly separable.

Linear Machine and Minimum Distance
Classification… Example:
Linear Machine and Minimum Distance
Classification… Example…
Linear Machine and Minimum Distance
Classification…

•Examples 3.1 and 3.2 have shown that the coefficients

(weights) of the linear discriminant functions can be
determined if the a priori information about the sets of
patterns and their class membership is known
•In the next section (Discrete perceptron) we will
examine neural networks that derive their weights during
the learning cycle.
Linear Machine and Minimum Distance
Classification…
•The example of linearly non-separable patterns
Linear Machine and Minimum Distance
Classification…
o 1 = sgn( x 1 + x 2 + 1)

Input space (x)

Image space (o)

Linear Machine and Minimum Distance
Classification…

o 1 = sgn( x 1 + x 2 + 1)

o 2 = sgn( − x 1 − x 2 + 1)

x1 x2 o1 o2
These 2 inputs map -1 -1 -1 1
to the same point -1 1 1 1
(1,1) in the image 1 -1 1 1
space 1 1 1 -1
The Discrete Perceptron
Discrete Perceptron Training Algorithm
• So far, we have shown that coefficients of linear
discriminant functions called weights can be
determined based on a priori information about sets of
patterns and their class membership.
•In what follows, we will begin to examine neural
network classifiers that derive their weights during the
learning cycle.
•The sample pattern vectors x1, x2, …, xp, called the
training sequence, are presented to the machine along
with the correct response.
Discrete Perceptron Training Algorithm
- Geometrical Representations http://140.122.185.120
Zurada, Chapter 3

(Intersects the origin

point w=0)
5 prototype patterns in this case: y1, y2, …y5
If dim of augmented pattern vector is > 3, our power of visualization are no longer of assistance. In this case,
the only recourse is to use the analytical approach.
Discrete Perceptron Training Algorithm
- Geometrical Representations…
•Devise an analytic approach based on the geometrical
representations
– E.g. the decision surface for the training pattern y1
( )
∇ w w t y1 = y1 Gradient
(the direction of
y1 in Class 1 If y1 in steepest increase)
(see previous slide) Class 1:
Weight w ′ = w1 + cy1
Space
If y1 in c controls the
Class 2: size of adjustment

y1 in Class 2 w ′ = w1 − cy1
c (>0) is the correction
Weight increment (is two times the
Space learning constant ρ
introduced before)
(correction in negative gradient direction)
Discrete Perceptron Training Algorithm
- Geometrical Representations…
Discrete Perceptron Training Algorithm
- Geometrical Representations…

w1t y
cy = y =p
yt y

Note 1: p=distance so >0

Note 2: c is not constant and depends on the current training pattern as expressed by eq. Above.
Discrete Perceptron Training Algorithm
- Geometrical Representations…
•For fixed correction rule: c=constant, the correction of
weights is always the same fixed portion of the current
training vector
– The weight can be initialised at any value

•For dynamic correction rule: c depends on the distance

from the weight (i.e. the weight vector) to the decision
surface in the weight space. Hence
Current weight Current input
pattern

– The initial weight should be different from 0.

(if w1=0, then cy =0 and w’=w1+cy=0, therefore no possible adjustments).
Discrete Perceptron Training Algorithm
- Geometrical Representations…
•Dynamic correction rule: Using the value of c from previous slide as
a reference, we devise an adjustment technique which depends on
the length w2-w1 λ=2: Symmetrical reflection w.r.t decision plane

λ=0: No weight adjustment

Νote: λ is the ratio of the distance

between the old weight vector w1
and the new w2, to the distance
from w1 to the pattern hyperplane
Discrete Perceptron Training Algorithm
- Geometrical Representations…
•Example:

x1 = 1, x3 = 3, d1 = d 3 = 1 : class 1
x2 = −0.5, x4 = −2, d 2 = d 4 = −1 : class 2
•The augmented input vectors are:

1 − 0.5 3  − 2

y1 =  , y 2 =   , y3 =   y4 =  
1  1  1 1
•The decision lines wtyi=0, for i=1, 2, 3, 4 are sketched
on the augmented weight space as follows:
Discrete Perceptron Training Algorithm
- Geometrical Representations…
Discrete Perceptron Training Algorithm
- Geometrical Representations…
For c = 1 and w1 = [− 2.5 1.75]
t

•Using w ' = w ± cy the weight training with each step can

be summarized as follows:
c
∆w = [d k − sgn(w kt y k )]y k
k

2
•We obtain the following outputs and weight updates:
•Step 1: Pattern y1 is input
 1 
o1 = sgn [− 2.5 1.75]    = −1
 1 
d1 − o1 = 2
− 1.5
w =w +y = 
2 1 1

 2.75 
Discrete Perceptron Training Algorithm
- Geometrical Representations…
•Step 2: Pattern y2 is input
 − 0.5 

o2 = sgn  [− 1.5 2.75]    =1

  1 
d 2 − o2 = −2
 −1 
w = w −y = 
3 2 2

1.75
•Step 3: Pattern y3 is input
 3 
o3 = sgn [− 1 1.75]    = −1
 1 
d 3 − o3 = 2
 2 
w =w +y = 
4 3 3

 2.75
Discrete Perceptron Training Algorithm
- Geometrical Representations…
• Since we have no evidence of correct classification of
weight w4 the training set consisting of an ordered
sequence of patterns y1 ,y2 and y3 needs to be recycled.
We thus have y4= y1 , y5= y2, etc (the superscript is used
to denote the following training step number).
•Step 4, 5: w6 = w5 = w4 (no misclassification, thus no
weight adjustments).
•You can check that the adjustment following in steps 6
through 10 are as follows:
w 7 = [2.5 1.75]
t

w10 = w 9 = w 8 = w 7
w11 = [3 0.75]
t

w11 is in solution area.

The Continuous Perceptron
Continuous Perceptron Training Algorithm
http://140.122.185.120
Zurada, Chapter 3

•Replace the TLU (Threshold Logic Unit) with the

sigmoid activation function for two reasons:
– Gain finer control over the training procedure
– Facilitate the differential characteristics to enable
computation of the error gradient

(of current
error function)

The factor ½ does not affect the location of

the error minimum
Continuous Perceptron Training Algorithm…

•The new weights is obtained by moving in the direction

of the negative gradient along the multidimensional error
surface

By definition of the steepest descent concept,

each elementary move should be
perpendicular to the current error contour.
Continuous Perceptron Training Algorithm…
•Define the error as the squared difference between the
desired output and the actual output

Training rule of
continous perceptron
∂ (net ) (equivalent to delta
Since net = w t y, we have = yi i = 1,2,..., n + 1 training rule)
∂wi
Continuous Perceptron Training Algorithm…
Continuous Perceptron Training Algorithm…
Same as previous example (of discrete perceptron) but with a
continuous activation function and using the delta rule.

Same training pattern set as

discrete perceptron example
Continuous Perceptron Training Algorithm…
2
1  2 
E k = d k −  − 1
1 + exp(−λ net )  
k
2

2
1  2 
E1 (w ) = 1 −  − 1 
2  1 + exp[− λ ( w1 + w2 )]  

λ = 1 and reducing the terms simplifies this expression to the following form
2
E1 (w ) =
[1 + exp(w1 + w2 )]2
similarly
2
E2 ( w ) =
[1 + exp(0.5w1 − w2 )]2
2 2
E3 (w ) = E4 ( w ) =
[1 + exp(3w1 + w2 )]2 [1 + exp(2w1 − w2 )]2
These error surfaces are as shown on the previous slide.
Continuous Perceptron Training Algorithm…

minimum
Mutlicategory SLP
Multi-category Single layer Perceptron nets
•Treat the last fixed component of input pattern vector as
the neuron activation threshold…. T=wn+1

yn+1= -1 (irrelevant wheter it

is equal to +1 or –1)
Multi-category Single layer Perceptron nets…
• R-category linear classifier using R discrete bipolar
perceptrons
– Goal: The i-th TLU response of +1 is indicative of
class i and all other TLU respond with -1
Multi-category Single layer Perceptron nets…
•Example 3.5

Indecision regions = regions

should be where no class membership of
(-1, - 1, 1) t an input pattern can be
uniquely determined based on
the response of the classifier
(patterns in shaded areas are
not assigned any reasonable
classification. E.g. point Q for
which o=[1 1 –1]t => indecisive
response). However no
patterns such as Q have been
used for training in the
example.
Multi-category Single layer Perceptron nets…
For c = 1 and w11 = [1 − 2 0] w12 = [0 − 1 2] and w13 = [1 3 − 1]
t t t

•Step 1: Pattern y1 is input

 10  
   
sgn [1 − 2 0] 2   = 1 Since the
  only w12 = w11
  −
 1 incorrect
response is w 22 = w12
 10   provided
    by TLU3,  1  10  − 9
sgn [0 − 1 2] 2   = −1
w 32 =  3  −  2  =  1 
we have
  − 1 
   − 1 − 1  0 
 10  
   
sgn [1 3 − 1] 2   = 1*
  − 1 
  
Multi-category Single layer Perceptron nets…
•Step 2: Pattern y2 is input

  2 
   
sgn [1 − 2 0]− 5  = 1*
  − 1  
   1  2  − 1
  2  w13 = 2 − − 5 =  3 
   
sgn [0 − 1 2]− 5  = 1 0  − 1  1 
  − 1  
   w 32 = w 22
  2  w 33 = w 32
   
sgn [− 9 1 0]− 5  = −1
  − 1  
  
Multi-category Single layer Perceptron nets…
•Step 3: Pattern y3 is input 4 One can
w14 = − 2
( )
verify that
sgn w13t y 3 = 1* the only
 2 
sgn (w y ) = −1
adjusted
3t
weights
2 3
w 42 = w 32
sgn (w y ) = 1
from now
3t
on are those
3 3
w 34 = w 33 of TLU1

• During the second cycle:

w15 = w14
 2 w18 = w17
w16 = 3 5
3 w19 = 3
7 5
w17 = − 2
 4 
Multi-category Single layer Perceptron nets…
•R-category linear classifier using R continuous bipolar
perceptrons
Comparison between Perceptron and
Bayes’ Classifier
Perceptron operates on the promise that the patterns to be
classified are linear separable (otherwise the training algorithm will
oscillate), while Bayes classifier can work on nonseparable
patterns
Bayes classifier minimizes the probability of misclassification
which is independent of the underlying distribution
Bayes classifier is a linear classifier on the assumption of
Gaussianity
The perceptron is non-parametric, while Bayes classifier is
parametric (its derivation is contingent on the assumption of the
underlying distributions)
The perceptron is adaptive and simple to implement
the Bayes’ classifier could be made adaptive but at the expense of
increased storage and more complex computations
APPENDIX A

Unconstrained Optimization
Techniques
Unconstrained Optimization Techniques
http://ai.kaist.ac.kr/~jkim/cs679/
Haykin, Chapter 3

Cost function E(w)

– continuously differentiable
– a measure of how to choose w of an adaptive
filtering algorithm so that it behaves in an optimum
manner
we want to find an optimal solution w* that minimize
E(w) ∇E ( wr *) = 0
– local iterative descent :
starting with an initial guess denoted by w(0),
generate a sequence of weight vectors w(1), w(2),
…, such that the cost function E(w) is reduced at
each iteration of the algorithm, as shown by
E(w(n+1)) < E(w(n))
– Steepest Descent, Newton’s, Gauss-Newton’s
methods
Method of Steepest Descent
Here the successive adjustments applied to w
are in the direction of steepest descent, that
is, in a direction opposite to the grad(E(w))
w(n+1) = w(n) - a g(n)
a : small positive constant called step size or
learning-rate parameter.
g(n) : grad(E(w))
The method of steepest descent converges to
the optimal solution w* slowly
The learning rate parameter a has a profound
influence on its convergence behavior
– overdamped, underdamped, or even
unstable(diverges)
Newton’s Method
Using a second-order Taylor series expansion of the
cost function around the point w(n)
∆E(w(n)) = E(w(n+1)) - E(w(n))
~ gT(n) ∆w(n) + 1/2 ∆wT(n) H(n) ∆w(n)
where ∆w(n) = w(n+1) - w(n) ,
H(n) : Hessian matrix of E(n)
We want ∆w*(n) that minimize ∆E(w(n)) so
differentiate ∆E(w(n)) with respect to ∆w(n) :

g(n) + H(n) ∆w*(n) = 0

so,
∆w*(n) = -H-1(n) g(n)
Newton’s Method…

Finally,
w(n+1) = w(n) + ∆w(n)
= w(n) - H-1(n) g(n)
Newton’s method converges quickly
asymptotically and does not exhibit the
zigzagging behavior
the Hessian H(n) has to be a positive definite
matrix for all n
Gauss-Newton Method
The Gauss-Newton method is applicable to a
cost function r 1 n 2
E ( w) =
2
∑ e
i =1
(i )

Because the error signal e(i) is a function of w,

we linearize the dependence of e(i) on w by
writing T
r  ∂e(i )  r r
e' (i, w) = e(i ) +  r  ( w − w(n))
 ∂w  wr = wr ( n )

Equivalently, by using matrix notation we may

write r r r r r
e ' (n, w) = e (n) + J (n)( w − w(n))
Gauss-Newton Method…
where J(n) is the n-by-m Jacobian matrix of
e(n) (see bottom of this slide)
We want updated weight vector w(n+1)
defined by
r 1 r r 2
w(n + 1) = arg min
r  e ' (n, w) 
w
2 
simple algebraic calculation tells…
1 r r 2 1 r 2 rT r r 1 r r r r
e' (n, w) = e(n) +e (n)J(n)(w− w(n))+ (w− w(n))T JT (n)J(n)(w− w(n))
2 2 2
Now differentiate this expression with respect
to w and set the result to 0, we obtain
 ∂e(1) ∂e(1) ∂e(1) 
 ∂w L L
∂wα ∂wM 
 1

 M L M L M 
∂e( k ) ∂e(k ) ∂e(k ) 
J = L L
 ∂w1 ∂wα ∂wM 
 M L M L M 
 ∂e(n) ∂e( n) ∂e(n) 
 L L 
 ∂w1 ∂wα ∂wM 
Gauss-Newton Method…
r r r
J (n)e (n) + J (n) J (n)( w − w(n)) = 0
T T

Thus we get
r r −1 T r
w(n + 1) = w(n) − ( J (n) J (n)) J (n)e (n)
T

To guard against the possibility that the matrix

product JT(n)J(n) is singular, the customary practice
is
r r r
w(n + 1) = w(n) − ( J (n) J (n) + δI ) J (n)e (n)
T −1 T

where δ is a small positive constant.

This modification effect is progressively reduced as
the number of iterations, n, is increased.
Linear Least-Squares Filter
The single neuron around which it is built is linear
The cost function consists of the sum of error
squares
Using y (i ) = x (i )w (i ) and e(i ) = d (i ) − y (i ) the error
T

vector is
e( n ) = d ( n ) − X ( n ) w ( n )
Differentiating with respect to w (n)
∇e ( n ) = − X T ( n )
correspondingly,
J ( n) = − X( n)
From Gauss-Newton method, (eq. 3.22)
w (n + 1) = ( XT (n) X(n)) −1 XT (n)d(n) = X + (n)d(n)
LMS Algorithm
Based on the use of instantaneous values for
cost function :
1 2
E (w ) = e ( n)
2

Differentiating with respect to w ,

∂E (w ) ∂e(n)
= e( n )
∂w ∂w

The error signal in LMS algorithm :

e( n ) = d ( n ) − x T ( n ) w ( n )
hence,
∂e(n) ∂E (w )
= −x(n) so, = − x ( n ) e( n )
∂w (n) ∂w (n)
LMS Algorithm …
∂E (w )
Using ∂w (n)
as an estimate for the gradient
vector,
gˆ (n) = − x(n)e(n)
Using this for the gradient vector of steepest
descent method, LMS algorithm as follows :
ˆ ( n ) + ηx ( n )e ( n )
ˆ ( n + 1) = w
w
– η : learning-rate parameter
The inverse of η is a measure of the
memory of the LMS algorithm
– When η is small, the adaptive process
progress slowly, more of the past data are
remembered and a more accurate filtering
action’
LMS Characteristics
LMS algorithm produces an estimate of the
weight vector
– Sacrifice a distinctive feature
• Steepest descent algorithm : w (n) follows a
well-defined trajectory
• LMS algorithm : wˆ (n) follows a random
trajectory
– Number of iterations goes infinity, wˆ (n)
performs a random walk
But importantly, LMS algorithm does not
require knowledge of the statistics of the
environment
Convergence Consideration
Two distinct quantities, η and x(n) determine
the convergence
– the user supplies η , and the selection of x(n)
is important for the LMS algorithm to
converge
Convergence of the mean
E [w
ˆ (n)] → w 0 as n → ∞
– This is not a practical value
Convergence in the mean square
[ ]
E e 2 (n) → constant as n → ∞
Convergence condition for LMS algorithm in the
mean square 2
0 <η <
sum of mean - square values of the sensor inputs
APPENDIX B

Perceptron Convergence Proof

Perceptron Convergence Proof Haykin, Chapter 3

Consider the following perceptron:

m
v(n) = ∑ wi (n) xi (n)
i =0

= w T ( n) x( n)

w T x > 0 for every input vector x belonging to class C1

w T x ≤ 0 for every input vector x belonging to class C 2
Perceptron Convergence Proof…
The algorithm for the weight adjustment for the
perceptron
– if x(n) is correctly classified no adjustments to w
w (n + 1) = w (n) if w T x(n) ≤ 0 and x(n) belongs to class C 2

w (n + 1) = w (n) if w T x(n) > 0 and x(n) belongs to class C1

– otherwise

w(n + 1) = w(n) − η (n)x(n) if wT x(n) > 0 and x(n) belongs to class C2

w(n + 1) = w(n) + η (n)x(n) if wT x(n) ≤ 0 and x(n) belongs to class C1

– learning rate parameter η (n) controls adjustment

applied to weight vector
Perceptron Convergence Proof
For η (n) = 1 and w (0) = 0
Suppose the perceptron incorrectly classifies the vectors
x(1), x(2),... such that

wT x(n) ≤ 0 so that : w(n + 1) = w(n) + η (n)x(n)

But sinceη = 1 ⇒
w(n + 1) = w(n) + x(n) for x(n) belonging to C1
Since w(0) = 0, iteratively we find w(n + 1)
w(n + 1) = x(1) + x(2) + ... + x(n) (B1)
Since the classes C1 and C2 are assumed to be linearly
separable, there exists a solution w0 for which wTx(n)>0 for
the vectors x(1), …x(n) belonging to the subset H1(subset of
training vectors that belong to class C1).
Perceptron Convergence Proof
For a fixed solution w0, we may then define a positive number
α as
α = min w x(n) T
0 ( B 2)
x ( n )∈H1

Hence equation (B1) above implies

w T0 w(n + 1) = w T0 x(1) + w T0 x(2) + ... + w T0 x(n)
Using equation B2 above, (since each term is greater or equal
than α), we have T
w 0 w(n + 1) ≥ nα
Now we use the Cauchy-Schwartz inequality:
2 2
(a.b) ≤ a b
2
or
2 (a.b) 2 2
a ≥ 2
for b ≠ 0
b
Perceptron Convergence Proof
This implies that:
2 n 2α 2
w(n + 1) ≥ 2
( B3)
w0
Now let’s follow another development route (notice index k)
w(k + 1) = w(k ) + x(k ) for k = 1, ..., n and x(k) ∈ H1
By taking the squared Euclidean norm of both sides, we get:
2 2 2
w(k + 1) = w(k ) + x(k ) + 2wT (k )x(k )
But under the assumption the the perceptron incorrectly
classifies an input vector x(k) belonging to the subset H1, we
have wT (k )x(k ) < 0 and hence :
2 2 2
w(k + 1) ≤ w(k ) + x(k )
Perceptron Convergence Proof
Or equivalently,
2 2 2
w(k + 1) − w(k ) ≤ x(k ) ; k = 1,...n

Adding these inequalities for k=1,…n, and invoking the initial

condition w(0)=0, we get the following inequality:
n
w(n + 1) ≤ ∑ x(k ) ≤ nβ
2 2
( B4)
k =1

Where β is a positive number defined by;

n
β = max ∑ x(k )
2

x ( k )∈H1
k =1
Eq. B4 states that the squared Euclidean norm of w(n+1)
grows at most linearly with the number of iterations n.
Perceptron Convergence Proof
The second result of B4 is clearly in conflict with Eq. B3.
•Indeed, we can state that n cannot be larger than some
value nmax for which Eq. B3 and B4 are both satisfied with
the equality sign. That is nmax is the solution of the eq.
2
nmaxα2
2
= nmax β
w0
•Solving for nmax given a solution w0, we find that
2
β w0
nmax =
α2
We have thus proved that for η(n)=1 for all n, and for w(0)=0,
given that a sol’ vector w0 exists, the rule for adapting the
synaptic weights of the perceptron must terminate after at most
nmax iterations.
MORE READING
Suggested Reading.

S. Haykin, “Neural Networks”, Prentice-Hall, 1999,

chapter 3.
L. Fausett, “Fundamentals of Neural Networks”,
Prentice-Hall, 1994, Chapter 2.
R. O. Duda, P.E. Hart, and D.G. Stork, “Pattern
Classification”, 2nd edition, Wiley 2001. Appendix A4,
chapter 2, and chapter 5.
J.M. Zurada, “Introduction to Artificial Neural Systems”,
West Publishing Company, 1992, chapter 3.
References:
These lecture notes were based on the references of the
previous slide, and the following references

1. Berlin Chen Lecture notes: Normal University, Taipei,

Taiwan, ROC. http://140.122.185.120
2. Ehud Rivlin, IIT:
http://webcourse.technion.ac.il/236607/Winter2002-
2003/en/ho.html
3. Jin Hyung Kim, KAIST Computer Science Dept., CS679
Neural Network lecture notes
http://ai.kaist.ac.kr/~jkim/cs679/detail.htm
4. Dr John A. Bullinaria, Course Material, Introduction to
Neural Networks,
http://www.cs.bham.ac.uk/~jxb/inn.html

Solutions Manual Microelectronic Circuits Analysis and Design 2nd Edition Rashid PDF
0% (1)
Solutions Manual Microelectronic Circuits Analysis and Design 2nd Edition Rashid PDF
10 pages
Data Driven Programming Made Easy
No ratings yet
Data Driven Programming Made Easy
62 pages
Sampling Distribution
No ratings yet
Sampling Distribution
19 pages
Big Data Analytics
No ratings yet
Big Data Analytics
3 pages
07.0 PP 43 76 General Relativity Theory
No ratings yet
07.0 PP 43 76 General Relativity Theory
34 pages
Backpropagation
No ratings yet
Backpropagation
12 pages
TempestMOREHelp PDF
No ratings yet
TempestMOREHelp PDF
1,807 pages
Sampling Distributions: The Basic Practice of Statistics
No ratings yet
Sampling Distributions: The Basic Practice of Statistics
14 pages
Neural Networks - A Worked Example PDF
No ratings yet
Neural Networks - A Worked Example PDF
13 pages
Coal Bed Methane (CBM) : A Presentation By
No ratings yet
Coal Bed Methane (CBM) : A Presentation By
22 pages
Formation Pressure Data 1662035611
No ratings yet
Formation Pressure Data 1662035611
12 pages
Data-Driven Robust Optimization
No ratings yet
Data-Driven Robust Optimization
43 pages
Decision Tree Slides
No ratings yet
Decision Tree Slides
94 pages
Gas Lift Optimization: Solution To Non-Linear Field Network Problem Using Sequential Quadratic Programming Technique
No ratings yet
Gas Lift Optimization: Solution To Non-Linear Field Network Problem Using Sequential Quadratic Programming Technique
23 pages
Full Download Hands On Machine Learning with Scikit Learn and TensorFlow Concepts Tools and Techniques to Build Intelligent Systems 1st Edition by Aurelien Geron ISBN 1491962291 9781491962299 PDF DOCX
100% (28)
Full Download Hands On Machine Learning with Scikit Learn and TensorFlow Concepts Tools and Techniques to Build Intelligent Systems 1st Edition by Aurelien Geron ISBN 1491962291 9781491962299 PDF DOCX
83 pages
Project Management: Topic: Network Diagram (Activity On Arrow)
No ratings yet
Project Management: Topic: Network Diagram (Activity On Arrow)
22 pages
Kinetix Ps PDF
No ratings yet
Kinetix Ps PDF
1 page
Phuong Nguyen: The Complete Guide To Cluster Analysis Using Python
No ratings yet
Phuong Nguyen: The Complete Guide To Cluster Analysis Using Python
68 pages
A New Pressure/Rate-Deconvolution Algorithm To Analyze Wireline-Formation-Tester and Well-Test Data
No ratings yet
A New Pressure/Rate-Deconvolution Algorithm To Analyze Wireline-Formation-Tester and Well-Test Data
11 pages
POL BigDataStatisticsJune2014
No ratings yet
POL BigDataStatisticsJune2014
27 pages
Application of Geostatistics For Reservoir Characterization - Accomplishments and Challenges
No ratings yet
Application of Geostatistics For Reservoir Characterization - Accomplishments and Challenges
5 pages
8 Dynamic Uncertainty Analysis
100% (1)
8 Dynamic Uncertainty Analysis
9 pages
Dual Porosity
No ratings yet
Dual Porosity
12 pages
An Introduction To Feature Selection
No ratings yet
An Introduction To Feature Selection
45 pages
Python Notes 1
No ratings yet
Python Notes 1
4 pages
IPTC-20254-MS Numerical Simulation of Gas Lift Optimization Using Genetic Algorithm For A Middle East Oil Field: Feasibility Study
No ratings yet
IPTC-20254-MS Numerical Simulation of Gas Lift Optimization Using Genetic Algorithm For A Middle East Oil Field: Feasibility Study
25 pages
An Evolutionary Algorithm To Solve Crypt Arithmetic Problem
No ratings yet
An Evolutionary Algorithm To Solve Crypt Arithmetic Problem
3 pages
Gas Lift Optimization To Improve Well Performance
No ratings yet
Gas Lift Optimization To Improve Well Performance
11 pages
Part 4 Diffusivity Equation Line Source Radial Flow Radius Invest
100% (1)
Part 4 Diffusivity Equation Line Source Radial Flow Radius Invest
34 pages
What Is Reservoir Simulation
No ratings yet
What Is Reservoir Simulation
12 pages
Python Scientific Slides (Boston University)
100% (1)
Python Scientific Slides (Boston University)
94 pages
Gatech OMSCS Courses
0% (1)
Gatech OMSCS Courses
9 pages
Termodinamica de Hidrocarburos: Generalized Phase Equilibria Models
No ratings yet
Termodinamica de Hidrocarburos: Generalized Phase Equilibria Models
76 pages
Biorthogonal Wavelets
No ratings yet
Biorthogonal Wavelets
23 pages
Diesel Cycle
100% (1)
Diesel Cycle
7 pages
4.4.2.2 - Derivation of The Diffusivity Equation in Radial-Cylindrical Coordinates
No ratings yet
4.4.2.2 - Derivation of The Diffusivity Equation in Radial-Cylindrical Coordinates
4 pages
Core Libraries For Machine Learning
No ratings yet
Core Libraries For Machine Learning
5 pages
How Tnavigator PDF
No ratings yet
How Tnavigator PDF
11 pages
Simulation Study of Technical and Feasible Gas Lift Performance
No ratings yet
Simulation Study of Technical and Feasible Gas Lift Performance
24 pages
Analysis of Gas Power Cycles
No ratings yet
Analysis of Gas Power Cycles
8 pages
Harmony Enterprise: Installation Guide (Local DB)
No ratings yet
Harmony Enterprise: Installation Guide (Local DB)
30 pages
Machine Learning, Modeling, & Simulation:: Engineering Problem-Solving in The Age of Ai
No ratings yet
Machine Learning, Modeling, & Simulation:: Engineering Problem-Solving in The Age of Ai
10 pages
Feature Engineering / Feature Selection
No ratings yet
Feature Engineering / Feature Selection
33 pages
Full Download (eBook PDF) Optimization in Operations Research 2nd Edition PDF DOCX
100% (7)
Full Download (eBook PDF) Optimization in Operations Research 2nd Edition PDF DOCX
56 pages
01 Modelling and Simulation
No ratings yet
01 Modelling and Simulation
18 pages
Data Model: Database Systems: Design, Implementation, and Management, Sixth Edition, Rob and Coronel
100% (1)
Data Model: Database Systems: Design, Implementation, and Management, Sixth Edition, Rob and Coronel
71 pages
Data Scientist - KD PDF
No ratings yet
Data Scientist - KD PDF
1 page
101 IntroPVTsim
No ratings yet
101 IntroPVTsim
9 pages
Optimization of Wells Using Perform Software - Final
No ratings yet
Optimization of Wells Using Perform Software - Final
80 pages
Machine Learning For Fluid Mechanics
No ratings yet
Machine Learning For Fluid Mechanics
32 pages
Lecture 01 (Introduction To Pattern Recognition)
No ratings yet
Lecture 01 (Introduction To Pattern Recognition)
26 pages
Application of Machine Learning To Predict Transient Sand Production in The Karazhanbas Oil Field, Ustyurt-Buzachi Basin (West Kazakhstan)
No ratings yet
Application of Machine Learning To Predict Transient Sand Production in The Karazhanbas Oil Field, Ustyurt-Buzachi Basin (West Kazakhstan)
12 pages
An Introduction of Ensemble Learning
100% (1)
An Introduction of Ensemble Learning
40 pages
Class 7
No ratings yet
Class 7
42 pages
UNIT 02 Video Worksheets
No ratings yet
UNIT 02 Video Worksheets
159 pages
Depression Detection System
No ratings yet
Depression Detection System
6 pages
Importance of Porosity - Permeability Relationship in Sandstone Petrophysical Properties
No ratings yet
Importance of Porosity - Permeability Relationship in Sandstone Petrophysical Properties
61 pages
Geostatistics Project 2 (PETE 630)
100% (1)
Geostatistics Project 2 (PETE 630)
28 pages
Bayesian Belief Network
No ratings yet
Bayesian Belief Network
30 pages
Advanced Dynamic-System Simulation: Model Replication and Monte Carlo Studies
From Everand
Advanced Dynamic-System Simulation: Model Replication and Monte Carlo Studies
Granino A. Korn
No ratings yet
Bayesian Decision Theory: Intro To
No ratings yet
Bayesian Decision Theory: Intro To
56 pages
Believe Yourself To Success PDF
No ratings yet
Believe Yourself To Success PDF
1 page
Comparison of I C Logic Families
No ratings yet
Comparison of I C Logic Families
2 pages
Intro To Embedded Systems by Shibu KV
70% (162)
Intro To Embedded Systems by Shibu KV
80 pages
Processor Architecture
No ratings yet
Processor Architecture
13 pages
CS407 Neural Computation: Associative Memories and Discrete Hopfield Network. Lecturer: A/Prof. M. Bennamoun
No ratings yet
CS407 Neural Computation: Associative Memories and Discrete Hopfield Network. Lecturer: A/Prof. M. Bennamoun
62 pages
CS407 Neural Computation: Associative Memories and Discrete Hopfield Network. Lecturer: A/Prof. M. Bennamoun
No ratings yet
CS407 Neural Computation: Associative Memories and Discrete Hopfield Network. Lecturer: A/Prof. M. Bennamoun
62 pages
P and NP
No ratings yet
P and NP
22 pages
Matrix DPP-01
No ratings yet
Matrix DPP-01
1 page
Second Quarter: S.Y. 2020 - 2021 Learner's Module in Mathematics 8 Rectangular Coordinate System
100% (1)
Second Quarter: S.Y. 2020 - 2021 Learner's Module in Mathematics 8 Rectangular Coordinate System
3 pages
Inductie Matematica
100% (1)
Inductie Matematica
3 pages
Wavelet Methods For Time Series Analysis 1m2hg2nm53
No ratings yet
Wavelet Methods For Time Series Analysis 1m2hg2nm53
6 pages
Maths Class 7 Part Test-2
No ratings yet
Maths Class 7 Part Test-2
2 pages
Simplex Method Assignment
100% (1)
Simplex Method Assignment
11 pages
CLASS 12 MATHS NOTES (Inverse Trigonometric Functions) Class-XII
No ratings yet
CLASS 12 MATHS NOTES (Inverse Trigonometric Functions) Class-XII
8 pages
Bab 4.3
No ratings yet
Bab 4.3
40 pages
MTH603 Final Term Papers in One File PDF
No ratings yet
MTH603 Final Term Papers in One File PDF
11 pages
Engneering Drawing
No ratings yet
Engneering Drawing
201 pages
Nonhomogeneous Linear Equations (Section 17.2)
No ratings yet
Nonhomogeneous Linear Equations (Section 17.2)
46 pages
Tensor Analysis-Chapter 1
No ratings yet
Tensor Analysis-Chapter 1
81 pages
OCR C1 Revision Notes PDF
No ratings yet
OCR C1 Revision Notes PDF
12 pages
Mathematical Tools S1 Worksheet April 14 PDF
No ratings yet
Mathematical Tools S1 Worksheet April 14 PDF
8 pages
ICSE Class 10 Maths Question Paper Solution 2019
No ratings yet
ICSE Class 10 Maths Question Paper Solution 2019
30 pages
Grade 11 Functions Unit 2 - Equivalent Algebraic Expressions Student Notes
No ratings yet
Grade 11 Functions Unit 2 - Equivalent Algebraic Expressions Student Notes
15 pages
Chapter 4 Power Series
No ratings yet
Chapter 4 Power Series
55 pages
Exercise Acc2o1s
No ratings yet
Exercise Acc2o1s
30 pages
Nazrul Islam-Tensors and Their Applications-To New Age International PVT LTD Publishers (2006) PDF
100% (1)
Nazrul Islam-Tensors and Their Applications-To New Age International PVT LTD Publishers (2006) PDF
262 pages
Cams and Followers
No ratings yet
Cams and Followers
29 pages
Chapter 4 Part 2 - Truss Method of Sections
No ratings yet
Chapter 4 Part 2 - Truss Method of Sections
17 pages
Curriculum Guide Mathematics (Grade 7-12)
No ratings yet
Curriculum Guide Mathematics (Grade 7-12)
51 pages
A Minimization of The Cost of Transportation: M. L. Aliyu, U. Usman, Z. Babayaro, M. K. Aminu
No ratings yet
A Minimization of The Cost of Transportation: M. L. Aliyu, U. Usman, Z. Babayaro, M. K. Aminu
7 pages
Assessment
No ratings yet
Assessment
3 pages
PMMT100 FT 7 2020 1
No ratings yet
PMMT100 FT 7 2020 1
9 pages
Linear Algebra Harvard Notes Lecture 1
No ratings yet
Linear Algebra Harvard Notes Lecture 1
3 pages
Study Material STUDENT COPY
No ratings yet
Study Material STUDENT COPY
50 pages