Single Layer Perceptron Classifier
Single Layer Perceptron Classifier
By
Varun Deshmukh
Assistant Professor,
Department of Computer Engineering,
MPSTME, Shirpur Campus
08/13/2021 1
CLASSIFICATION MODEL, FEATURES, AND
DECISION REGIONS
• One of the most useful tasks that can be performed by networks of interconnected
nonlinear elements introduced in the previous chapter is pattern classification.
• A pattern is the quantitative description of an object, event, or phenomenon.
• The classification may involve spatial and temporal patterns. Examples of spatial
patterns are pictures, video images of ships, weather maps, fingerprints and
characters.
• Examples of temporal patterns include speech signals, signal vs time produced by
sensors, electrocardiograms, and seismograms.
• Temporal patterns usually involve ordered sequences of data appearing in time.
• The goal of pattern classification is to assign a physical object, event, or
phenomenon to one of the prespecified classes (also called categories.).
• Despite the lack of any formal theory of pattern perception and classification,
human beings and animals have performed these tasks since the beginning of their
existence.
2
CLASSIFICATION MODEL, FEATURES, AND
DECISION REGIONS
• The oldest classification tasks required from a human being have been
classification of the human environment into such groups of objects as living
species, plants, weather conditions, minerals, tools, human faces, voices etc.
• The interpretation of data has been learned gradually as a result of repetitive
inspecting and classifying of examples.
• When a person perceives a pattern, an inductive inference is made and the
perception is associated with some general concepts or clues derived from the
person's past experience.
• The problem of pattern classification may be regarded as one of discriminating
the input data within object population via the search for invariant attributes
among members of the population.
3
CLASSIFICATION MODEL, FEATURES, AND
DECISION REGIONS
• While some of the tasks mentioned above can be learned easily, the growing
complexity of the human environment and technological progress has created
classification problems that are diversified and also difficult.
• As a result, the use of various classifying aids became helpful and in some
applications, even indispensable.
• Reading and processing bank checks exemplifies a classification problem that can
be automated.
• It obviously can be performed by a human worker, however, machine
classification can achieve much greater efficiency.
• Extensive study of the classification process has led to the development of an
abstract mathematical model that provides the theoretical basis for classifier
design.
• Eventually, machine classification came to maturity to help people in their
classification tasks.
4
CLASSIFICATION MODEL, FEATURES, AND
DECISION REGIONS
• The electrocardiogram waveform, biomedical photograph, or disease diagnosis
problem can nowadays be handled by machine classifiers.
• Other applications include fingerprint identification, patent searches, radar and
signal detection, printed and written character classification, and speech
recognition.
• Figure shows the block diagram of the recognition and classification system.
5
CLASSIFICATION MODEL, FEATURES, AND
DECISION REGIONS
• Recognition is understood here as a class assignment for input patterns that are
not identical to the patterns used for training of the classifier.
• Since the training concept has not been fully explained yet, we will focus first on
techniques for classifying patterns.
• The classifying system consists of an input transducer providing the input pattern
data to the feature extractor.
• Typically, inputs to the feature extractor are sets of data vectors that belong to a
certain category.
• Assume that each such set member consists of real numbers corresponding to
measurement results for a given physical situation.
• Usually, the converted data at the output of the transducer can be compressed
while still maintaining the same level of machine performance.
• The compressed data are called features. The feature extractor at the input of the
classifier in Figure performs the reduction of dimensionality.
6
CLASSIFICATION MODEL, FEATURES, AND
DECISION REGIONS
• The feature space dimensionality is postulated to be much smaller than the
dimensionality of the pattern space.
• The feature vectors retain the minimum number of data dimensions while
maintaining the probability of correct classification, thus making handling data
easier.
• An example of possible feature extraction is available in the analysis of speech
vowel sounds. A 16-channel filter-bank can provide a set of 16-component
spectral vectors. The vowel spectral content can be transformed into perceptual
quality space consisting of two dimensions only. They are related to tongue height
and retraction.
• Another example of dimensionality reduction is the projection of planar data on a
single line, reducing the feature vector size to a single dimension. Although the
projection of data will often produce a useless mixture, by moving and/or rotating
the line it might be possible to find its orientation for which the projected data are
7
well separated.
CLASSIFICATION MODEL, FEATURES, AND
DECISION REGIONS
• In such a case, two-dimensional data are represented by single-dimensional
features denoting the position of the projected points on the line.
• It is beyond the scope of this chapter to discuss the selection of measurements or
data feature extraction from the input pattern vector.
• We shall henceforth assume that the sets of extracted feature vectors yield the sets
of pattern vectors to be classified that the extraction, or selection, of input
components to the classifier has been done as wisely as possible.
• Thus, the pattern vector x shown in figure consists of components that may be
features.
• However the n-tuple vectors at the input to the classifier on Figure (b) may also
be input pattern data when separate feature extraction does not take place.
• In such a case the classifier's function is to perform not only the classification
itself but also to internally extract input pattern features.
8
CLASSIFICATION MODEL, FEATURES, AND
DECISION REGIONS
• The rationale for this approach in our study is that neural networks can be
successfully used for joint classification/recognition tasks and for feature
extraction.
• For such networks, the feature extractor and classifier from Figure (a) can be
considered merged to the single classifier network of Figure (b).
• Classification can often be conveniently described in geometric terms.
• Any pattern can be represented by a point in n-dimensional Euclidean space
called the pattern space.
• Points in that space corresponding to members of the pattern set are n-tuple
vectors x.
• A pattern classifier maps sets of points in space into one of the numbers .
9
CLASSIFICATION MODEL, FEATURES, AND
DECISION REGIONS
• The sets containing patterns of classes 1,
2, . . . , R are denoted here by
respectively.
• An example case for n = 2 and R = 4 is
illustrated in Figure showing disjoint
regions .
• Let us postulate for simplicity that the
classifier’s response should be the class
number.
• We now have the decision function for a
pattern of class j yielding the following
result:
• The regions denoted are called decision regions. Regions are separated from
each other by so called decision surface. 10
DISCRIMINANT FUNCTIONS
• In this chapter, the assumption is made that both a set of n-dimensional patterns
and the desired classification for each pattern are known.
• The size P of the pattern set is finite, and it is usually much larger than the
dimensionality n of the pattern space.
• In many practical cases we will also assume that P is much larger than the number
of categories R.
• Although the assumptions regarding n, P, and R are often valid for practical
classification cases, they do not necessarily hold for our study of classification
principles, nor do they limit the validity of our final conclusions.
• We will first discuss classifiers that use the discriminant functions concept.
• This discussion will lead to interesting conclusions as to how neural network
classifiers should be trained
11
DISCRIMINANT FUNCTIONS
• Let us assume momentarily, and for the purpose of this presentation, that the
classifier has already been designed so that it can correctly perform the
classification tasks.
• During the classification step, the membership in a category needs to be
determined by the classifier based on the comparison of R discriminant functions
computed for the input pattern under consideration.
• It is convenient to assume that the discriminant functions are scalar values and
that the pattern x belongs to the category if and only if
• Thus, within the region the discriminant function will have the largest value.
• This maximum property of the discriminant function for the pattern of class is
fundamental, and it will be subsequently used to choose, or assume, specific
forms of the functions.
12
DISCRIMINANT FUNCTIONS
• The discriminant functions and for contiguous decision regions and define the
decision surface between patterns of classes and in space.
• Since the decision surface itself obviously contains patterns x without
membership in any category, it is characterized by equal to .
• Thus, the decision surface equation is
• Assuming that the discriminant functions are known, the block diagram of a basic
pattern classifier can now be adopted.
• For a given pattern, the discriminator computes the value of the function called
briefly the discriminant.
• The maximum selector selects the 1argest of all inputs, thus yielding the response
equal to the category number .
13
DISCRIMINANT FUNCTIONS
14
DISCRIMINANT FUNCTIONS
• The discussion above and the associated example of classification has highlighted
a special case of the classifier into R classes for R = 2.
• Such a classifier is called the dichotomizer. Although the ancient Greek
civilization is rather famous for other interests than decision-making machines,
the word dichotomizer is of Greek origin.
• The two separate Greek language roots are dicha and tomia and they mean in two
and cut, respectively.
• It has been noted that the general classification condition for the case of a
dichotomizer can now be reduced to the inspection of the sign of the following
discriminant function
• Thus, the general classification rule can be rewritten for a dichotomizer as follows
15
DISCRIMINANT FUNCTIONS
16
DISCRIMINANT FUNCTIONS
• Once a general functional form of the discriminant functions has been suitably
chosen, discriminants can be computed using a priori information about the
classification of patterns, provided that such information is available.
• In such an approach, the design of a classifier can be based entirely on the
computation of decision boundaries as derived from patterns and their
membership in classes.
• Throughout this chapter and most portions of this book, however, we will focus
mainly on classifiers whose decision capabilities are generated from training
patterns by means of an iterative learning, or training, algorithm.
• Once a type of discriminant function has been assumed, the algorithm of learning
functions, provided the training pattern sets are separable by the assumed type of
decision function.
17
DISCRIMINANT FUNCTIONS
• For study of such adaptive, or trainable, classifiers the following assumptions are
made:
• The training pattern set and classification of all its members are known, thus
the training is supervised.
• The discriminant functions have a linear form and only their coefficients are
adjusted in the training procedure.
• Under these assumptions, a trainable classifier can be implemented that learns by
examples. In this context, we will be interested in input data vectors for which we
have a priori knowledge of their correct classification.
• These vectors will be referred to as class prototypes or exemplars. The
classification problem will then be one c finding decision surfaces, in n-
dimensional space, that will enable correct classification of the prototypes and
will afford some degree of confidence in correctly recognizing and classifying
unknown patterns that have not been used for training. 18
DISCRIMINANT FUNCTIONS
19
Linear Machine & Minimum Distance Classification
• The efficient classifier having the block diagram as shown in Figure must be
described, in general, by discriminant functions that are not linear functions of the
inputs .
• An example of such classification is provided in Figure c. As will be shown later,
the use of nonlinear discriminant functions can be avoided by changing the
classifier's feedforward architecture to the multilayer form.
• Such an architecture is comprised of more layers of elementary classifiers such as
the discussed dichotomizer or the “dichotomizer” providing continuous response
between - 1 and + 1.
• The elementary decision making discrete, or continuous dichotomizers, will then
again be described with the argument being the basic linear discriminant function.
20
Linear Machine & Minimum Distance Classification
21
Linear Machine & Minimum Distance Classification
• Figure depicts two clusters of patterns, each cluster belonging to one known
category.
• The center points of the clusters shown of classes 1 and 2 are vectors and ,
respectively.
• The center, or prototype, points can be interpreted here as centers of gravity for
each cluster.
• We prefer that the decision hyperplane contain the midpoint of the line segment
connecting prototype points and , and it should be normal to the vector , which is
directed toward .
• The decision hyperplane equation can thus be written in the following form:
22
Linear Machine & Minimum Distance Classification
• The left side of Equation is obviously the dichotomizer's discriminant function g(x). It can also
be seen that g(x) implied here constitutes a hyperplane described by the equation
• The weighting coefficients , , . . . , of the dichotomizer can now be obtained easily from
comparing above equations as follows:
23
Perceptron Networks
• Perceptron networks come under single layer feed-forward networks and are also called simple
perceptron’s.
• Various types of perceptron's were designed by Rosenblatt (1962) and Minsky-Papert (1969,
1988). However, a simple perceptron network was discovered by Block in 1962.
• The Key points to be noted in a perceptron network are:
• The perceptron network consist of three units, namely, sensory unit (input unit), associator
unit (hidden unit), response unit (output unit).
• The sensory units are connected to associator units with fixed weights having values 1, 0, or
-1, which are assigned at random.
• The binary activation function is used in sensory unit and associator unit.
• The response unit has an activation of 1, 0 or -1. the binary step with fixed threshold is
used as activation for associator. The output signals that are sent from the associator unit to
the response unit are only binary.
• The output of the perceptron network is given by
• The perceptron learning rule is used in the weight Updation between the associator unit and
the response unit. For each training input, the net will calculate the response and it will
determine whether or not an error has occurred.
• The error calculation is based on the comparison of the values of targets with those of the
calculated outputs.
• The weight on the connections from the units that send the nonzero signal will get adjusted
suitably.
• The weights will be adjusted on the basis of the learning rule if an error has occurred for
particular training pattern, i.e.,
25
Perceptron Networks
• If no error occurs, there is no weight Updation and hence the training process may be stopped.
• In the above equations, the target value "t" is +1 or –l and α is the learning rate. In general, these
learning rules begin with an initial guess at the weight values and then successive adjustments are
made on the basis of the evaluation of an objective function.
• Eventually, the learning rules reach a near optimal or optimal solution in a finite number of steps.
• A perceptron network with its three units is shown in Figure. As shown in Figure a sensory unit
can be a two-dimensional matrix of 400 photodetectors upon which a lighted picture with
geometric black and white pattern impinges.
• These detectors provide a binary (0) electrical signal if the input signal is found to exceed a
certain value of threshold.
• Also these detectors are connected randomly with the associator unit. The associator unit is found
to consist of a set of subcircuits called feature predicates.
• The feature predicates are hard-wired to detect the specific feature of a pattern and are equivalent
to the feature detectors. For a particular feature, each predicate is examined with a few or all of the
responses of the sensory unit.
• It can be found that the results from the predicate units are also binary (0 or 1). The last unit, i.e.
response unit, contains the pattern recognizers or perceptron’s. The weights present in the input
26
layers are all fixed, while the weights on the response unit are trainable.
Perceptron Learning Rule
• In case of the perceptron learning rule, the learning signal is difference between the desired and
actual response of neuron. The perceptron learning rule is explained as follows:
• Consider a finite number of input training vectors, with their associated target values and where
ranges from 1 to N.
• The target is either +1 or -1. the output “y” is obtained on the basis of the net input calculated and
activation function being applied over the net input.
• The weights can be initialized at any values in this method. The perceptron rule convergence
theorem states that “if there is a weight vector W, such that , for all n, then for any starting vector
w1, the perceptron learning rule will converge to a weight vector that gives the correct response.
28
Perceptron Training Algorithm for Single Output class
• Step 0: Initialize the weights and the bias. Also initialize the learning rate. For simplicity is set to 1.
• Step 1: Perform step 2-6 until the final stopping condition is false.
• Step 2: Perform step 3-5 for each training pair indicated by s:t
• Step 3: The input layer containing input units is applied with identity activation functions:
• Step 4: Calculate the output of the network. To do so, first obtain the net input:
where “n” is number of neuron in the input layer. Then apply activations over the net input calculated to obtain the
output:
• Step 5: Weight and Bias adjustment: compare the value of the actual output and desired output.
If , then
( – learning rate) ;
Else, we have
• Step 6: Train the network until there is no weight change. This is the stopping condition for the network.
29
Perceptron Training Algorithm for Multiple Output
classes
• Step 0: Initialize the weights and the bias. Also initialize the learning rate. For simplicity is set to 1.
• Step 1: Perform step 2-6 until the final stopping condition is false.
• Step 2: Perform step 3-5 for each training pair indicated by s:t
• Step 3: The input layer containing input units is applied with identity activation functions:
• Step 4: Calculate output response of each output unit first obtain the net input as:
Then apply activations over the net input calculated to obtain the output:
30
Perceptron network Testing Algorithm
• Step 0: The initial weights to be used here are taken from the training algorithms (the final weights obtained
during training).
• Step 1: Foreach input vector X to be classified, perform step 2-3.
• Step 2: Set activations of the input unit.
• Step 3: Obtain the response of Output unit
Then apply activations over the net input calculated to obtain the output:
31
Examples
• Implement AND function using perceptron network for bipolar inputs and targets.
• Solution: • The perceptron network, which uses perceptron learning rule, is used
x1 x2 t to train the AND function. The network architecture is as shown in
1 1 1 fig. The input patterns are presented to the network one by one. When
1 -1 -1 all the four input patterns are presented, then one epoch is said to be
-1 1 -1 completed. The initial weights and threshold are set to zero, i.e., . the
learning rate is set to 1
-1 -1 -1
• For the first input pattern , and with weights and bias
• Calculate the net input
32
Examples
• The weights , , b = 1 are the final weights after first input pattern is presented.
• The same process is repeated for all the input patterns.
• The process can be stopped when all the targets become equal to the calculated output or when a
separating line is obtained using the final weights for separating the positive responses from
negative responses.
• Table shows the training of perceptron network until its target and calculated output converge for
all the patterns.
33
Examples
• it can be easily found that the above straight line separates the
positive response and negative response region.
34
Examples
• Implement OR function with binary inputs and bipolar targets using perceptron training
algorithm up to 3 epochs.
• x1 x2 t • The truth table for OR function with binary inputs and bipolar
Solution:
targets is shown in table.
1 1 1
• The perceptron network, which uses perceptron learning rule, is
1 0 1 used to train the OR function.
0 1 1 • The network architecture is shown in figure.
0 0 -1 • The initial values of the weights and bias are taken as zero i.e.
• Also the learning rate is 1 and threshold is 0.2. so, the activation
function becomes
35
Input target Net input Calculated Weight Changes Weights
output
X1 X2 1 (t) (yin) (y) ∆w1 ∆w2 ∆b w1 w2 b
(0) (0) (0)
1 1 1 1 0 0 1 1 1 1 1 1
1 0 1 1 2 1 0 0 0 1 1 1
0 1 1 1 2 1 0 0 0 1 1 1
0 0 1 -1 1 1 0 0 -1 1 1 0
1 1 1 1 2 1 0 0 0 1 1 0
1 0 1 1 1 1 0 0 0 1 1 0
0 1 1 1 1 1 0 0 0 1 1 0
0 0 1 -1 0 0 0 0 -1 1 1 -1
1 1 1 1 1 1 0 0 0 1 1 -1
1 0 1 1 0 0 1 0 1 2 1 0
0 1 1 1 1 1 0 0 0 2 1 0
0 0 1 -1 0 0 0 0 -1 2 1 -1
36
Examples
• Find the weights using perceptron network for ANDNOT function when all the inputs are
presented only one time. Use bipolar inputs and targets.
• Solution: • The truth table for ANDNOT function is shown in table.
• The network architecture of ANDNOT is shown in figure.
x1 x2 t
• The initial values of the weights and bias are taken as zero i.e.
1 1 -1 and .
1 -1 1 • For first input sample, we compute the net input as
-1 1 -1
-1 -1 -1
37
Examples
38
Examples
• For the third input sample, we calculate the net • The output ) is obtained by applying activation
input as function, hence . since , the new weights are
computed as
39
Training & Classification using Discrete Perceptron
• It can be seen from previous Equation that the discriminant function becomes explicitly known if
prototype points P1 and P2 are known.
• We can also note that unless the cluster center coordinates xl, x2 are known, g(x) cannot be
determined a priori using the method just presented.
• The linear form of discriminant functions can also be used for classifications between more than
two categories.
• In case of R pairwise separable classes, there will be up to R(R-1)/2 decision hyperplanes like the
one computed for R=2.
• For R=3 there are up to three decision hyperplanes.
• For larger number of classes, some decision regions may not be contiguous, thus eliminating
some decision hyperplanes.
• In such cases, the equation and has no solution.
• Still, the dichotomizer example just discussed can be considered as a simple case of a multiclass
minimum distance classifier.
40
Training & Classification using Discrete Perceptron:
Algorithm and Example
• Let us look now in more detail at the weight adjustment aspects.
• Again using the geometrical relationship, we choose the correction increment c so that the weight
adjustment step size is meaningfully controlled.
• The distance p of a point from the plane in (n+1)-dimensional Euclidean space is computed
according to the formula
• Where the sign in front of the fraction is chosen to be the opposite of the sign of the value of . A
simpler rule is that the sign must be chosen to be identical to the sign of . Since p is always a non
negative scalar by definition of the distance, expression above can be rewritten using the absolute
value notation as follows:
• Let us now require that the correction increment constant c be selected such that the corrected
weight vector based on dislocates on the decision hyperplane , which is the decision hyperplane
used for this particular correction step.
41
Training & Classification using Discrete Perceptron:
Algorithm and Example
• This implies that
• And the required correction increment results for this training step as
• Since the correction increment c is positive, equation above can be briefly rewritten as
• The basic correction rule for c=1 leads to very simple adjustment of the weight vector. Such
adjustment alters the weight vector exactly by the pattern vector y.
• Using the value of the correction increment calculated above, several different adjustment
techniques can be devised depending on length of the weight correction vector .
42
Training & Classification using Discrete Perceptron:
Algorithm and Example
• Given are P training pairs { }, where is , is ,
• Note that augmented input vectors are used:
• In the following, k denotes the training step and p denotes the step counter within the training
cycle.
1. is chosen
2. Weights are initialized at w at small random values, w is (n+1)X1. counters and error are initialized:
3. The training cycle begins here. Input is presented and output computed:
43
Single Layer continuous perceptron network for
linearly separable classification
• In this section we introduce the concept of an error function in multidimensional weight space.
• Also, the TLU element with weights will be replaced by the continuous perceptron.
• This replacement has two main direct objectives.
• The first one is to gain finer control over the training procedure. The other is to facilitate
working with differentiable characteristics of the threshold element, thus enabling
computation of the error gradient. Such a continuous characteristic, or sigmoidal activation
function, describes the neuron in lieu of the sign function.
• According to the training theorem discussed in the last section, the TLU weights are converging to
some solution for any positive correction increment constant.
• The weight modification problem could be better solved, however, by minimizing the scalar
criterion function. Such a task can possibly be attempted by again using the gradient, or steepest
descent, procedure.
• The basic procedure of descent is quite simple. Starting from an arbitrary chosen weight vector w,
the gradient of the current error function is computed.
• The next value of w is obtained by moving in the direction of the negative gradient along the
multidimensional error surface. The direction of negative gradient is the one of steepest descent.
44
Single Layer continuous perceptron network for
linearly separable classification
• The algorithm can be summarized as below:
• Where is the positive constant called learning constant and superscript denotes the step number.
• Let us define the error in the k’th training step as the squared difference between the desired value
at the output of the continuous perceptron and its actual output value computed.
• As shown in Figure , the desired value is provided by the teacher.
45
Single Layer continuous perceptron network for
linearly separable classification
• The expression for classification error to be minimized is
• where the coefficient 1 /2 in front of the error expression is intended for convenience in simplifying the
expression of the gradient value, and it does not affect the location of the error minimum or the error
minimization itself.
• Our intention is to achieve minimization of the error function E(w) in (n + 1)-dimensional weight space.
• An example of a well-behaving error function is shown in Figure. The error function has a single
minimum at , which can be achieved using the negative-gradient descent starting at the initial weight
vector 0 shown in the figure.
• The vicinity of the point E() = 0 is shown to be reached within a finite number of steps. The movement of
the weight vector can be observed both on the error surface in Figure (a), or across the error contour lines
shown in Figure (b). In the depicted case, the weights displace from contour 1 through 4 ideally toward
the point E = 0. By definition of the steepest descent concept, each elementary move should be
perpendicular to the current error contour.
46
Single Layer continuous perceptron network for
linearly separable classification
47
Single Layer continuous perceptron network for
linearly separable classification
• Unfortunately, neither error functions for a TLU-based dichotomizer nor those for more complex
multiclass classifiers using TLU units result in a form suitable for the gradient descent-based training.
• The reason for this is that the error function has zero slope in the entire weight space where the error
function exists, and it is nondifferentiable on the decision surface itself.
• This property directly results from calculating any gradient vector component with replacing .
• Indeed, the derivative of the internal function is nonexistent in zero, and of zero value elsewhere.
• The error minimization algorithm requires computation of the gradient of the error as follows:
• The training step superscript k has been temporarily skipped for simplicity, but it should be understood
that the error gradient computation refers strictly to the k'th training step.
• Since we have
48
Single Layer continuous perceptron network for
linearly separable classification
• which is the training rule of the continuous perceptron. It can be seen that the rule is equivalent to the
delta training rule.
• The computation of adjusted weights requires the assumption of and the specification for the activation
function used.
• Note that in further considerations we assume trainable weights. Therefore, we will no longer need to use
the steepness coefficient of the activation function as a variable.
• The assumption that = 1 is thus as valid as the assumption of any other constant value used to scale all
the weights in the same proportion.
• The gradient becomes
• And the complete delta training rule for the bipolar continuous activation function results as
49
Single Layer continuous perceptron network for
linearly separable classification
• It may be noted that the weight adjustment rule corrects the weights in the same direction as the discrete
perceptron learning rule.
• The size, or simply length, of the weight correction vector is the main difference between the rules of
discrete perceptron and continuous perceptron.
• Both these rules involve adding or subtracting a fraction of the pattern vector y. The essential difference is
the presence of the moderating factor . This scaling factor is obviously always positive and smaller than 1.
• For erroneous responses and close to 0, or a weakly committed perceptron, the correction scaling factor
will be larger than for those responses generated by a of large magnitude.
• Another significant difference between the discrete and continuous perceptron training is that the discrete
perceptron training algorithm always leads to a solution for linearly separable problems. In contrast to
this property, the negative gradient-based training does not guarantee solutions for linearly separable
patterns.
• At this point the TLU-based classifier has been modified and consists of a continuous perceptron element.
In the past, the term perceptron has been used to describe the TLU-based discrete decision element with
synaptic weights and a summing node. Here, we have extended the perceptron concept by replacing the
TLU decision element with a neuron characterized by a continuous activation function. As in the case of a
discrete perceptron, the pattern components arrive through synaptic weight connections yielding the net
signal. The net signal excites the neuron, which responds according to its continuous activation function.
50
Single continuous perceptron Training Algorithm
• In the following, k denotes the training step and p denotes the step counter within the training cycle.
1. chosen.
2. Weights are initialized at w at small random values, w is (n+1)X1. counters and error are initialized:
3. The training cycle begins here. Input is presented and output computed:
with
4. Weights are updated:
5. Cycle error is computed:
6. If then and , and go to Step 3; otherwise, go to step 7.
7. The training cycle is completed. For , terminate the training session. Output weights and k.
51
Multi category single layer perceptron networks
• Our approach so far has been limited to training of dichotomizers using both discrete and continuous
perceptron elements.
• In this section we will attempt to apply the error-correcting algorithm to the task of multicategory
classification.
• The assumption needed is that classes are linearly pairwise separable, or that each class is linearly
separable from each other class.
• This assumption is equivalent to the fact that there exist R linear discriminant functions such that
• Let us devise a suitable training procedure for such an R-category classifier. To begin, we need to define
the augmented weight vector as
• Assume that an augmented pattern of class is presented to the maximum selector-based classifier.
• The R decision functions , , ….. are evaluated. If is larger than any of the remaining R - 1 discriminant
functions, no adjustment of weight vectors is needed, since the classification is correct.
• This indicates that
52
Multi category single layer perceptron networks
• The prime is again used to denote the weights after correction. If, however, for some value we have , then
the updated weight vectors become
• The matrix formula can be rewritten using the double-subscript notation for the weights as follows:
• It can be seen from Eq. that the weight value of the connection between the output and the component of
the input is supplemented with if the output is too small.
• The weight value toward the output from the component of the input is reduced by if the output is
excessive.
• So far in this chapter we have been dealing with the augmented pattern vector y with the component
always of fixed value + 1.
• It is somewhat instructive to take a different look at this fixed component of the pattern vector to gain
better insight into the perceptron's thresholding operation.
53
Multi category single layer perceptron networks
• Figure shows the perceptron-based dichotomizer in which the augmented pattern component is .
• Let us assume now that the neuron is excited as in Figure (a). We can now write that
54
Multi category single layer perceptron networks
• The activation function expressed now in the form of f() with as an argument is shown in Figure (b).
• The figure shows an activation function of a neuron with the positive threshold value, T > 0, and sketched
versus nonaugmented activation .
• It is instructive to notice that now the weight value is equal to the threshold T, and the neuron is excited if
the weighted sum of the original unaugmented pattern exceeds the threshold value T.
• Otherwise, the neuron remains inhibited. The nonzero threshold value causes the neuron to behave as a
biased device with T being its bias level.
• It is important to stress again that from the training viewpoint, any value is an appropriate choice.
• When , however, the value becomes equal to the actual firing threshold of the neuron with input being
the original pattern x.
• In further considerations we will use and = T unless otherwise stated.
55
Multi category single layer perceptron networks
• Let us now try to eliminate the maximum
selector in an R-class linear classifier
and replace it first with R discrete
perceptron's.
• The network generated this way is
comprised of R discrete perceptron's as
shown in Figure.
• The properly trained classifier from
Figure should respond with when is
larger than any of the remaining
discriminant functions , for , for all
patterns belonging to .
• Instead of signaling the event by single
output maximum selector responding
with a TLU #1 as I figure may be used.
Then outputs and should indicate
category 1 input.
56
Multi category single layer perceptron networks
• The adjustment of threshold can be suitably done by altering the value of through adding additional
threshold value T1.
• Such an adjustment is clearly feasible since exceeds all remaining discriminant values for all x in .
• The adjustment is done by changing only the single weight value at the input of the first TLU by
T1.
• None of the remaining weights is affected during T1 setting step with the input of class 1.
• Applying a similar procedure to the remaining inputs of the maximum selector, the classifier using
R individual TLU elements as shown in Figure can be obtained.
• For this classifier, the k’th TLU response of +1 is indicative of class k and all other TLUs respond
with -1.
57
Multi category single layer perceptron networks
• Where and are the desired and actual responses, respectively, of the i'th discrete perceptron. The formula
expresses the R-category discrete perceptron classifier training.
• For R-category classifiers with so-called local representation, the desired response for the training pattern
of the i'th category is
• For R-category classifiers with so-called distributed representation, condition is not required, because as
more than a single neuron is allowed to respond + 1 in this mode.
58
R-category discrete perceptron Training Algorithm
• In the following, k denotes the training step and p denotes the step counter within the training cycle.
1. is chosen.
2. Weights are initialized at W at small random values, is . counters and error are initialized:
3. The training cycle begins here. Input is presented and output computed:
w is the row of W.
4. Weights are updated:
5. Cycle error is computed:
6. If then and , and go to Step 3; otherwise, go to step 7.
7. The training cycle is completed. For E=0, terminate the training session. Output weights and k.
59
Examples
Example: Using Madaline network, implement XOR function with bipolar inputs and targets. Assume
the required parameters for training of the network.
• Solution: • The training pattern for XOR function with bipolar inputs and
x1 x2 t targets is shown in table.
• The Madaline algorithm in which the weights between the hidden
1 1 -1
layer and output layer remain fixed is used for training the network.
1 -1 1 • Initializing the weights to small random values, the network
-1 1 1 architecture is shown in figure with initial weights.
-1 -1 -1
• The initial weights and bias are
60
Examples
• For first input sample,
• Calculate net input to the hidden units:
• Calculate the output by applying the activation over the net input computed. The
activation function is given by
• Hence,
• After computing the output of the hidden units, then find the net input entering into
the output unit:
61
Examples
• Apply the activation function over the net input to calculate the output y:
• Since , weight Updation has to be performed. Also since , the weights are updated
on and that have positive net input. Since here both net inputs and are positive,
updating the weights and bias on both hidden units, we obtain
62