0% found this document useful (0 votes)
12 views

Linear Classifier-Perceptron

The document discusses the perceptron algorithm, which is a linear classifier that learns the weight vector and bias term that separate two classes. It provides an example of applying the perceptron learning algorithm to a dataset. The algorithm is presented as updating the weight vector for any misclassified samples in each iteration until all samples are correctly classified.

Uploaded by

Ankur Saroj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Linear Classifier-Perceptron

The document discusses the perceptron algorithm, which is a linear classifier that learns the weight vector and bias term that separate two classes. It provides an example of applying the perceptron learning algorithm to a dataset. The algorithm is presented as updating the weight vector for any misclassified samples in each iteration until all samples are correctly classified.

Uploaded by

Ankur Saroj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Linear Classifier: Perceptron

Compiled by Karthikeyan S CED16I015


Guided by
Dr Umarani Jayaraman

Department of Computer Science and Engineering


Indian Institute of Information Technology Design and Manufacturing
Kancheepuram

April 20, 2022

1 / 42
Introduction

If the probability density function is not known then we can not


estimate or have any parametric form of the probability density
function.
In such cases, we try to estimate the weight vector W and w0 which
separate the two classes if it linearly separable.
Here W gives the orientation of the line while wo gives position of the
line which separate the two classes.
With this assumption we try to design the linear classifiers.
One of the linear classifier that we discuss in this is the perceptron
and its convergence proof.

2 / 42
Introduction

X i = (x 1 , x 2 , ..., x d )

y = d+1 components ≈ dˆ

If at y i > 0 ⇒ yi ∈ ω 1

at y i < 0 ⇒ yi ∈ ω 2

3 / 42
Uniform criterion function

For all the samples at yi > 0 the weight vector ’a’ is correctly classified.
Otherwise, it is mis-classified and then we should update the weight
vector from a(k) to a(k + 1).
We take some criterion function J(a).
J(a) is minimised if ’a’ is a solution vector/solution region.
One such criterion function is perceptron criterion function

J p (a) = Σ(−at y ) ∀ y mis-classified

4 / 42
Perceptron algorithm

The perceptron algorithm is :

a(0) = Initial weight vector; arbitrary


a(k+1) = a(k) + η(k) Σy ∀ y mis-classified

J p (a) can have minimum value which is zero.


It has a global minimum and that can be obtained using iterative
procedure, whenever ’a’ is in solution region/solution vector.

5 / 42
Issues in Perceptron algorithm

We can find that there is a problem in this procedure.


The problem is in terms of memory requirement for execution
of this algorithm.
In real situation, we may have 1000s of such samples which will be
mis-classified initially.
And the algorithm takes summation of all samples which are
mis-classified; so we need to have large amount of memory.
The solution is instead of considering all the samples together,
we can consider sample by sample.
As a result, we can have a sequential version of perceprtion algorithm.

6 / 42
Sequential Version of Perceptron algorithm

In y1 , y2 ,..., yk ,....,yn ⇒ If yk is mis-classified, then:

a(0) = arbitrary
a(k+1) = a(k) + η(k)y k

Memory requirement is much less as compared to previous algorithm.

7 / 42
Sequential Version of Perceptron algorithm

One of the variant of perceptron that is easier to analyse:


We shall consider the samples in a sequence and shall modify the
weight vector whenever it is mis-classified a single sample.
η(k) - constant ⇒ Fixed increment case.
η(k) = 1 with no loss in generality.
Accordingly, the modified perceptron algorithm is as follows

a(0) = arbitrary
a(k+1) = a(k) + 1.yk

8 / 42
Perceptron algorithm: Sequential Version

ALGORITHM - Fixed-Increment Single-Sample Perceptron


Initialize a, k ← 0
do k ← (k+1) mod n
If yk is misclassified by ’a’ then a ← a + yk
Until all samples are correctly classified
return a
End ALGORITHM

9 / 42
Two category case

10 / 42
Example: Perceptron learning algorithm

11 / 42
Example: Perceptron learning algorithm

12 / 42
Example: Perceptron learning algorithm

   
0 −0.5
w1 = 0 and x1 = −3.0
0 −1
here w1 t x1 = 0 so w2 = w1 + x1 represented by
a(k+1) = a(k) + η(k)y k Σy for all y misclassified

w2 = w1 + x1
 
−0.5
= −3.0
−1

13 / 42
Example: Perceptron learning algorithm

next we consider thepatten t


 x2 : w 2 x 2
 −1
−0.5 −3.0 −1 −3 = 10.5 > 0
−1
x3 , x4 and x5 are also
properly
 classified
 −0.5
−0.5 −3.0 −1 −2.5 = 8.75 > 0
−1
 
 −1
−0.5 −3.0 −1 −2.5 = 9 > 0
−1
 
 −1.5
−0.5 −3.0 −1 −2.5 = 9.25 > 0
−1

14 / 42
Example: Perceptron learning algorithm

 
 4.5
−0.5 −3.0 −1  1  = -6.25 < 0
1
so update weight vector
w3= w= 2 + x 6   
−0.5 4.5 4
= −3  +  1  = −2
−1 1 0
note that w3 classifies patterns x7 , x8 , x9 and in the next iteration x1 ,x2 ,x3
and x4 correctly.

15 / 42
Example: Perceptron learning algorithm
 
 5
w3 t x7 = 4 −2 0 1 = 18
1
 
 4.5
w3 t x8 = 4 −2 0 0.5 = 17
1
 
 5.5
w3 t x9 = 4 −2 0 0.5 = 21
1
 
 −0.5
w3 t x1 = 4 −2 0 −3.0 = 4
−1
 
 −1
w3 t x2 = 4 −2 0 −3 = 2
−1

16 / 42
Example: Perceptron learning algorithm
 
−0.5
w3 t x3 = 4 −2 0 −2.5 = 3


−1
 
−1
w3 t x4 = 4 −2 0 −2.5 = 1


−1
However x5 is misclassified t
 by w3 , note that w3 x5 is -1
 −1.5
w3 t x2 = 4 −2 0 −2.5 = -1 < 0
−1
So, update
  weight
vectorw4 =
 w3 +x5
4 −1.5 2.5
w4 = −2 + −2.5 = −4.5
    
0 −1 −1

17 / 42
Example: Perceptron learning algorithm

w4 classifies patterns x6 , 
x7 , x8
, x9 , x1 , x2 , x3 , x4 and x5 correctly
4.5
w4 t x6 = 2.5 −4.5 −1  1  = 5.75


1
 
 5
w4 t x7 = 2.5 −4.5 −1 1 = 7
1
 
4.5
w4 t x8 = 2.5 −4.5 −1 0.5 = 8


1
 
 5.5
w4 t x9 = 2.5 −4.5 −1 0.5 = 10.5
1

18 / 42
Example: Perceptron learning algorithm
 
 −0.5
w4 t x1 = 2.5 −4.5 −1 −3.0 = 13.25
−1
 
−1
w4 t x2 = 2.5 −4.5

−1 −3 = 11.5
−1
 
 −0.5
w4 t x3 = 2.5 −4.5 −1 −2.5 = 11
−1
 
 −1
w4 t x4 = 2.5 −4.5 −1 −2.5 = 9.75
−1
 
−1.5
w4 t x5 = 2.5 −4.5

−1 −2.5 = 8.5
−1

19 / 42
Example: Perceptron learning algorithm

So w4 (or) a4 is the desired vector ’a’


In other words 2.5x1 -4.5x2 -1 = 0 is the equation of the decision
boundary.
Equivalently, the line separating the two classes is 5x1 -9x2 -2 = 0
w1 =5,w2 =-9,w0 =-2

20 / 42
Recap: Convergence of Perceptron Algorithm

Perceptron Criterion:
{X} -> {y}
 
  x1
x1 x 2 
x 2   
  .
.  
  -> . 
.  
  .
.  
x d 
xd
1
If at y > 0 then y ∈ ω 1
If at y < 0 then y ∈ ω 2

21 / 42
Recap:Uniform Criterion Function

For all the samples at y > 0 the weight vector a is correctly classified.
otherwise it is misclassified.
Then we should update the weight vector a(k) to a(k+1) we are
interested to find the weight vector ’a’.
J(a) has to be minimum.
a(0) - arbitrary
a(k+1) = a(k) - η(k)OJ(a(k))

Criterion:
Jp (a) = Σ(−at y ) ∀y − misclassified
a(0) − arbitrary
a(k + 1) = a(k) + η(k)Σy ∀ y − misclassified

22 / 42
Recap: Sequential Version of Perceptron Algorithm

y1 , y2 , y3 ,...,yk ,...,yn -> kth sample misclassified

a(0) - arbitrary
a(k+1) = a(k) + ηy k

23 / 42
Perceptron Algorithm: Convergence Proof
To demonstrate that the above sequential algorithm converge lets
consider the two dimensional case:

24 / 42
Perceptron Algorithm: Convergence Proof

Weight vector ’a’ is orthogonal to the decision surface.


In 2-D it is nothing but a line.
What are straight lines which actually separates these two classes?
We could have some limiting cases, two lines l1 and l2 .
Any line that lies in between these two limiting lines l1 and l2 which
properly separates these two classes without error.

25 / 42
Perceptron Algorithm: Convergence Proof

Now the weight vectors are orthogonal to the decision boundary.


Any weight vector ’a’ lies within the conical region is solving our
purpose.
The conical region is the solution region.
Our weight vectors should lie within this solution region.
When the algorithm converges the weight vectors should lie within
our solution region.

26 / 42
Perceptron Convergence Proof: Algorithm Illustration

27 / 42
Perceptron Convergence Proof: Algorithm Illustration

The initial weight vector a(0) misclassifies the 3 samples in ω 1 .


The decision surface corresponding to the weight vectors a(0) which
is drawn in blue line.
According to the algorithm:

a(k) = a(k − 1) + ηΣy ∀y − misclassified


a(k) = a(k − 1) + ηy

This vector 0 y 0 is scaled by a factor η in the direction of ’y’ and added


with the previous weight vector a(k − 1)

28 / 42
Perceptron Convergence Proof: Algorithm Illustration

The weight vector a(0) will be moved in the direction of misclassified


vector 0 y 0 by η times.
And finally when the algorithm converges the weight vectors lie within
the solution region.
This is ensured by the perceptron criterion.
But there is a problem of generalization.
This leads to risk in classification.
To minimise this risk, we should restrict the solution region some
where as the safe region (sub space of solution region).
That means we should ensure the weight vector ’a’ should lie in safe
region (refer Fig. 3).

29 / 42
Perceptron Convergence Proof: Algorithm Illustration

30 / 42
Perceptron Convergence Proof: Algorithm Illustration

In order to ensure the weight vector ’a’ should lie in safe region, It
should be > some margin b
This can be ensured by the rule at y > b, for some positive constant b.
We would say now, any y which satisfies at y > b then it is safely
classified.
If it is > 0 then it is properly classified but it is not in the safe region.
With this, we can ensure that the weight vector should lie on the safe
region.
The perceptron criterion is not only the criteria function to design a
linear classifier.
One of the criteria function can be defined based on the margin (b);
It is called as relaxation criterion.

31 / 42
Relaxation Criterion

It is based on margin b

1 (at y − b)2
J r (a) = Σ ∀y − misclassified
2 ||y ||2
For minimization of this criteria function Jr (a) we use the same
gradient descent procedure to obtain the weight vector ’a’.

(at y − b)2
OJ r (a) = Σ
||y ||2
(at y − b)
=Σ .y ∀y − misclassified
||y ||2
t
a(0) = arbitrary a(k+1) = a(k) + ηΣ b−a
||y ||2
y
.y ∀y − misclassified

32 / 42
Sequential version of Relaxation Criterion

a(0) = arbitrary
b − at (k)y k k
a(k + 1) = a(k) + η .y
||y k ||2
Here, the samples are considered one after another.
The moment, when we find the vector ’y’ is misclassified, we should
update the weight vector.
It can be noted that whether we use perceptron criteria or relaxation
criteria, in both cases, the convergence is guaranteed if the classes are
linearly separable.
Otherwise, the algorithm can never converge.
We can make use of these algorithms only if we know for sure the
classes are linearly separable.
However, if we are not sure (or) do not know if the classes are linearly
separable or not, still we can design linear classifier with minimum
error.
33 / 42
II. Minimum Squared Error - For Non Separable Case

The criterion function thus so far, have focused their attention on the
mis-classified samples.
Now, we shall consider a criterion function that involves all of the
samples.
Previously, the decision rule was at y > 0.
Now, we shall try to make at y > b.
The decision surface is at y = b, where b is some positive constant.
We should get a solution to this equation at y = b.
The solution of this equation can be obtained by this minimum
squared error procedure to be more generalization:

at y i = b i : for every sample yi


We can have different margins for generalization.

34 / 42
Minimum Squared Error - For Non Separable Case

For every i th sample, we have such an equation.


So for ’n’ number of samples, ’n’ number of equations.
So, we have ’n’ number of simultaneous equations and solve this
number of simultaneous equations.
This can be simplified by introducing matrix.

35 / 42
Minimum Squared Error - For Non Separable Case

In matrix form:
     
Y 10 Y 11 Y 12 ... Y 1d a0 b1
Y 20 Y 21 Y 22 ... Y 2d  a 1  b 2 
     
 .  a 2   . 
   = 
 .  . .
     
 .  . .
Y n0 Y n1 Y n2 ... Y nd ad bn

In compact form:
Ya = b
Find the weight vector ’a’ satisfying the above matrix
a = Y -1 b

36 / 42
Minimum Squared Error - For Non Separable Case
But, the problem is this ’y’ is not a square matrix; it is a rectangular
matrix.
No. of rows = no. of samples
No. of columns = d+1 (or) d̂; usually with more rows that columns.
In this case, the vector ’a’ is over determined.
So; we cant get an exact solution for this vector ’a’.
To get the solution for this vector ’a’ we can define an error vector:

e = Ya − b

Our aim is to get a solution for ’a’ that minimises this error:
Y is training sample and ’b’ is margin; so both ’Y’ and ’b’ is known
’a’ is unknown: try to get solution for ’a’ which will minimize this
error.
37 / 42
Sum of Squared Error Criterion

Let’s define a criterion function (i.e) Sum of Squared Error criterion


J s (a) = ||Ya − b||2
which is nothing but
J s (a) = Σ(at y i − b i )2
This can be solved by gradient descent approach; we can start initial
weight vector ’a’ and go on updating it.
OJ s (a) = Σ2(at y i − b i ).y i
OJ s (a) = 2Y t (Ya − b) = 0

38 / 42
Closed form solution

OJ s (a) = 2Y t (Ya − b) = 0
2Y t (Ya − b) = 0
2Y t Ya − 2Y t b = 0
Y t Ya = Y t b
a = (Y t Y )-1 Y t b
where Y is a rectangular matrix of dimension nXd, but Y t Y will be a
square matrix of dXd and quite often this matrix is non singular.
a = Y + b where Y + is the (Yt Y)-1 Yt pseudo inverse of Y .

39 / 42
Closed form solution

Note:
If Y is square and non singular, the pseudo inverse coincides with the
regular inverse.
Y +Y = I
But, YY + 6= I
However, MSE solution always exists and that a = Y + b is an MSE
solution to Ya = b.
The MSE solution depends on the margin vector ’b’
Different choices for ’b’ give the solution different properties.

40 / 42
Problem of Generalization

Generalization is a term used to describe a models ability to react to


new data.
That is, after being trained on a training set, a model can digest new
data and make accurate predictions.

41 / 42
THANK YOU

42 / 42

You might also like