Linear Classifier-Perceptron
Linear Classifier-Perceptron
1 / 42
Introduction
2 / 42
Introduction
X i = (x 1 , x 2 , ..., x d )
y = d+1 components ≈ dˆ
If at y i > 0 ⇒ yi ∈ ω 1
at y i < 0 ⇒ yi ∈ ω 2
3 / 42
Uniform criterion function
For all the samples at yi > 0 the weight vector ’a’ is correctly classified.
Otherwise, it is mis-classified and then we should update the weight
vector from a(k) to a(k + 1).
We take some criterion function J(a).
J(a) is minimised if ’a’ is a solution vector/solution region.
One such criterion function is perceptron criterion function
4 / 42
Perceptron algorithm
5 / 42
Issues in Perceptron algorithm
6 / 42
Sequential Version of Perceptron algorithm
a(0) = arbitrary
a(k+1) = a(k) + η(k)y k
7 / 42
Sequential Version of Perceptron algorithm
a(0) = arbitrary
a(k+1) = a(k) + 1.yk
8 / 42
Perceptron algorithm: Sequential Version
9 / 42
Two category case
10 / 42
Example: Perceptron learning algorithm
11 / 42
Example: Perceptron learning algorithm
12 / 42
Example: Perceptron learning algorithm
0 −0.5
w1 = 0 and x1 = −3.0
0 −1
here w1 t x1 = 0 so w2 = w1 + x1 represented by
a(k+1) = a(k) + η(k)y k Σy for all y misclassified
w2 = w1 + x1
−0.5
= −3.0
−1
13 / 42
Example: Perceptron learning algorithm
14 / 42
Example: Perceptron learning algorithm
4.5
−0.5 −3.0 −1 1 = -6.25 < 0
1
so update weight vector
w3= w= 2 + x 6
−0.5 4.5 4
= −3 + 1 = −2
−1 1 0
note that w3 classifies patterns x7 , x8 , x9 and in the next iteration x1 ,x2 ,x3
and x4 correctly.
15 / 42
Example: Perceptron learning algorithm
5
w3 t x7 = 4 −2 0 1 = 18
1
4.5
w3 t x8 = 4 −2 0 0.5 = 17
1
5.5
w3 t x9 = 4 −2 0 0.5 = 21
1
−0.5
w3 t x1 = 4 −2 0 −3.0 = 4
−1
−1
w3 t x2 = 4 −2 0 −3 = 2
−1
16 / 42
Example: Perceptron learning algorithm
−0.5
w3 t x3 = 4 −2 0 −2.5 = 3
−1
−1
w3 t x4 = 4 −2 0 −2.5 = 1
−1
However x5 is misclassified t
by w3 , note that w3 x5 is -1
−1.5
w3 t x2 = 4 −2 0 −2.5 = -1 < 0
−1
So, update
weight
vectorw4 =
w3 +x5
4 −1.5 2.5
w4 = −2 + −2.5 = −4.5
0 −1 −1
17 / 42
Example: Perceptron learning algorithm
w4 classifies patterns x6 ,
x7 , x8
, x9 , x1 , x2 , x3 , x4 and x5 correctly
4.5
w4 t x6 = 2.5 −4.5 −1 1 = 5.75
1
5
w4 t x7 = 2.5 −4.5 −1 1 = 7
1
4.5
w4 t x8 = 2.5 −4.5 −1 0.5 = 8
1
5.5
w4 t x9 = 2.5 −4.5 −1 0.5 = 10.5
1
18 / 42
Example: Perceptron learning algorithm
−0.5
w4 t x1 = 2.5 −4.5 −1 −3.0 = 13.25
−1
−1
w4 t x2 = 2.5 −4.5
−1 −3 = 11.5
−1
−0.5
w4 t x3 = 2.5 −4.5 −1 −2.5 = 11
−1
−1
w4 t x4 = 2.5 −4.5 −1 −2.5 = 9.75
−1
−1.5
w4 t x5 = 2.5 −4.5
−1 −2.5 = 8.5
−1
19 / 42
Example: Perceptron learning algorithm
20 / 42
Recap: Convergence of Perceptron Algorithm
Perceptron Criterion:
{X} -> {y}
x1
x1 x 2
x 2
.
.
-> .
.
.
.
x d
xd
1
If at y > 0 then y ∈ ω 1
If at y < 0 then y ∈ ω 2
21 / 42
Recap:Uniform Criterion Function
For all the samples at y > 0 the weight vector a is correctly classified.
otherwise it is misclassified.
Then we should update the weight vector a(k) to a(k+1) we are
interested to find the weight vector ’a’.
J(a) has to be minimum.
a(0) - arbitrary
a(k+1) = a(k) - η(k)OJ(a(k))
Criterion:
Jp (a) = Σ(−at y ) ∀y − misclassified
a(0) − arbitrary
a(k + 1) = a(k) + η(k)Σy ∀ y − misclassified
22 / 42
Recap: Sequential Version of Perceptron Algorithm
a(0) - arbitrary
a(k+1) = a(k) + ηy k
23 / 42
Perceptron Algorithm: Convergence Proof
To demonstrate that the above sequential algorithm converge lets
consider the two dimensional case:
24 / 42
Perceptron Algorithm: Convergence Proof
25 / 42
Perceptron Algorithm: Convergence Proof
26 / 42
Perceptron Convergence Proof: Algorithm Illustration
27 / 42
Perceptron Convergence Proof: Algorithm Illustration
28 / 42
Perceptron Convergence Proof: Algorithm Illustration
29 / 42
Perceptron Convergence Proof: Algorithm Illustration
30 / 42
Perceptron Convergence Proof: Algorithm Illustration
In order to ensure the weight vector ’a’ should lie in safe region, It
should be > some margin b
This can be ensured by the rule at y > b, for some positive constant b.
We would say now, any y which satisfies at y > b then it is safely
classified.
If it is > 0 then it is properly classified but it is not in the safe region.
With this, we can ensure that the weight vector should lie on the safe
region.
The perceptron criterion is not only the criteria function to design a
linear classifier.
One of the criteria function can be defined based on the margin (b);
It is called as relaxation criterion.
31 / 42
Relaxation Criterion
It is based on margin b
1 (at y − b)2
J r (a) = Σ ∀y − misclassified
2 ||y ||2
For minimization of this criteria function Jr (a) we use the same
gradient descent procedure to obtain the weight vector ’a’.
(at y − b)2
OJ r (a) = Σ
||y ||2
(at y − b)
=Σ .y ∀y − misclassified
||y ||2
t
a(0) = arbitrary a(k+1) = a(k) + ηΣ b−a
||y ||2
y
.y ∀y − misclassified
32 / 42
Sequential version of Relaxation Criterion
a(0) = arbitrary
b − at (k)y k k
a(k + 1) = a(k) + η .y
||y k ||2
Here, the samples are considered one after another.
The moment, when we find the vector ’y’ is misclassified, we should
update the weight vector.
It can be noted that whether we use perceptron criteria or relaxation
criteria, in both cases, the convergence is guaranteed if the classes are
linearly separable.
Otherwise, the algorithm can never converge.
We can make use of these algorithms only if we know for sure the
classes are linearly separable.
However, if we are not sure (or) do not know if the classes are linearly
separable or not, still we can design linear classifier with minimum
error.
33 / 42
II. Minimum Squared Error - For Non Separable Case
The criterion function thus so far, have focused their attention on the
mis-classified samples.
Now, we shall consider a criterion function that involves all of the
samples.
Previously, the decision rule was at y > 0.
Now, we shall try to make at y > b.
The decision surface is at y = b, where b is some positive constant.
We should get a solution to this equation at y = b.
The solution of this equation can be obtained by this minimum
squared error procedure to be more generalization:
34 / 42
Minimum Squared Error - For Non Separable Case
35 / 42
Minimum Squared Error - For Non Separable Case
In matrix form:
Y 10 Y 11 Y 12 ... Y 1d a0 b1
Y 20 Y 21 Y 22 ... Y 2d a 1 b 2
. a 2 .
=
. . .
. . .
Y n0 Y n1 Y n2 ... Y nd ad bn
In compact form:
Ya = b
Find the weight vector ’a’ satisfying the above matrix
a = Y -1 b
36 / 42
Minimum Squared Error - For Non Separable Case
But, the problem is this ’y’ is not a square matrix; it is a rectangular
matrix.
No. of rows = no. of samples
No. of columns = d+1 (or) d̂; usually with more rows that columns.
In this case, the vector ’a’ is over determined.
So; we cant get an exact solution for this vector ’a’.
To get the solution for this vector ’a’ we can define an error vector:
e = Ya − b
Our aim is to get a solution for ’a’ that minimises this error:
Y is training sample and ’b’ is margin; so both ’Y’ and ’b’ is known
’a’ is unknown: try to get solution for ’a’ which will minimize this
error.
37 / 42
Sum of Squared Error Criterion
38 / 42
Closed form solution
OJ s (a) = 2Y t (Ya − b) = 0
2Y t (Ya − b) = 0
2Y t Ya − 2Y t b = 0
Y t Ya = Y t b
a = (Y t Y )-1 Y t b
where Y is a rectangular matrix of dimension nXd, but Y t Y will be a
square matrix of dXd and quite often this matrix is non singular.
a = Y + b where Y + is the (Yt Y)-1 Yt pseudo inverse of Y .
39 / 42
Closed form solution
Note:
If Y is square and non singular, the pseudo inverse coincides with the
regular inverse.
Y +Y = I
But, YY + 6= I
However, MSE solution always exists and that a = Y + b is an MSE
solution to Ya = b.
The MSE solution depends on the margin vector ’b’
Different choices for ’b’ give the solution different properties.
40 / 42
Problem of Generalization
41 / 42
THANK YOU
42 / 42