0% found this document useful (0 votes)
44 views

Basic Concept of SVM

This document summarizes the basic concept of support vector machines (SVM). It explains that SVMs find the optimal separating hyperplane that maximizes the margin between positive and negative examples. This hyperplane is determined by support vectors, which are the closest data points to the hyperplane. The document formulates the SVM optimization problem and shows that its dual formulation leads to efficiently solving non-linearly separable problems by mapping data to a higher-dimensional feature space.

Uploaded by

Shawez Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Basic Concept of SVM

This document summarizes the basic concept of support vector machines (SVM). It explains that SVMs find the optimal separating hyperplane that maximizes the margin between positive and negative examples. This hyperplane is determined by support vectors, which are the closest data points to the hyperplane. The document formulates the SVM optimization problem and shows that its dual formulation leads to efficiently solving non-linearly separable problems by mapping data to a higher-dimensional feature space.

Uploaded by

Shawez Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Basic Concept of SVM:

o Which line
will classify
the unseen
data well?
o The dotted
line! Its line
with
Maximum
(C) CDAC Mumbai Workshop on Machine Learning
Margin!
Cont…

Support Vectors Support Vectors

− 1
W T X + b = 0 
+ 1

(C) CDAC Mumbai Workshop on Machine Learning


Some definitions:
o Functional Margin:
w.r.t.
(i)
1) individual examples : γˆ = y ( i ) (W T x ( i ) + b )
2)example set S = {( x ( i ) , y ( i ) ); i = 1 ,....., m }
( i )
γˆ = min γˆ
i = 1 ,..., m

o Geometric Margin:
w.r.t   W T (i)
 b 
1)Individual examples: γ (i)
= y (i)
 x +
  || W ||  || W || 

2) example set S,
( i )
γ = min γ
i = 1 ,..., m

(C) CDAC Mumbai Workshop on Machine Learning


Problem Formulation:

 − 1
W T
X + b =  0 
 + 1 
(C) CDAC Mumbai Workshop on Machine Learning
Cont..
o Distance of a point (u, v) from Ax+By+C=0, is given by
|Ax+By+C|/||n||
Where ||n|| is norm of vector n(A,B)
b
o Distance of hyperpalne from origin =
|| W ||

b + 1
o Distance of point A from origin = || W ||

b −1
o Distance of point B from Origin = || W ||
2
o Distance between points A and B (Margin) =
|| W ||

(C) CDAC Mumbai Workshop on Machine Learning


Cont…
We have data set (i) (i)
{X ,Y },i =1,....,m
d 1
X ∈R and Y ∈R
separating hyperplane
T
W X + b = 0
s .t .
T (i) (i)
W X + b > 0 if Y = +1
T (i) (i)
W X + b < 0 if Y = −1

(C) CDAC Mumbai Workshop on Machine Learning


Cont…
o Suppose training data satisfy following constrains also,
T (i) (i)
W X + b ≥ +1 for Y = +1
W T X (i) + b ≤ − 1 for Y (i)
= −1
Combining these to the one,
Y ( i ) (W T X ( i ) + b ) ≥ 1 for ∀i

o Our objective is to find Hyperplane(W,b) with maximal


separation between it and closest data points while satisfying
the above constrains

(C) CDAC Mumbai Workshop on Machine Learning


THE PROBLEM:
2
max
W,b ||W ||
such that
(i ) T (i )
Y (W X + b) ≥ 1 for ∀i
Also we know
T
|| W || = W W

(C) CDAC Mumbai Workshop on Machine Learning


Cont..
So the Problem can be written as:
1 T
min
W ,b 2
W W

Such that

Y(i) (WT X(i) +b) ≥1 for ∀i

T 2
Notice:W W =|| W ||

It is just a convex quadratic optimization problem !


(C) CDAC Mumbai Workshop on Machine Learning
DUAL
o Solving dual for our problem will lead us to apply SVM for
nonlinearly separable data, efficiently
o It can be shown that
min primal = max(min L(W , b, α ))
α ≥0 W ,b
o Primal problem:
1 T
min
W ,b 2
W W

Such that

Y (i) (W T X (i) + b) ≥ 1 for ∀i


(C) CDAC Mumbai Workshop on Machine Learning
Constructing Lagrangian
o Lagrangian for our problem:

m
1
[ ]
L (W , b , α ) = || W || 2 − ∑ α i Y ( i ) (W T X ( i ) + b ) − 1
2 i =1

Where α a Lagrange multiplier and α i ≥ 0

o Now minimizing it w.r.t. W and b:


We set derivatives of Lagrangian w.r.t. W and b to zero
(C) CDAC Mumbai Workshop on Machine Learning
Cont…
o Setting derivative w.r.t. W to zero, it gives:
m
W − ∑ α iY ( i ) X ( i ) = 0
i =1

i.e.
m
W = ∑ α iY ( i ) X ( i )
i =1

o Setting derivative w.r.t. b to zero, it gives:


m


i =1
α iY (i)
= 0

(C) CDAC Mumbai Workshop on Machine Learning


Cont…
o Plugging these results into Lagrangian gives
m
1 m (i ) ( j )
L(W , b, α ) = ∑α i − ∑Y Y α iα j ( X (i ) )T ( X ( j ) )
i =1 2 i , j =1
o Say it
m
1 m (i ) ( j )
D(α ) = ∑α i − ∑Y Y α iα j ( X (i ) )T ( X ( j ) )
i =1 2 i , j =1
o This is result of our minimization w.r.t W and b,

(C) CDAC Mumbai Workshop on Machine Learning


So The DUAL:
o Now Dual becomes::
m m
1
max
α
D (α ) = ∑
i=1
α i −
2

i, j =1
Y (i)
Y ( j)
α iα j X (i)
, X ( j)

s .t .
α i ≥ 0, i = 1 ,..., m
m

∑i =1
α iY (i)
= 0

o Solving this optimization problem gives us α i


o Also Karush-Kuhn-Tucker (KKT) condition is
satisfied at this solution i.e.

αi [Y (W X + b) −1] = 0, for i =1,...,m


(i) T (i)

(C) CDAC Mumbai Workshop on Machine Learning


Values of W and b:
o W can be found using
m
W = ∑
i =1
α iY (i)
X (i)

o b can be found using:


max i:Y ( i ) = −1 W *T X ( i ) + min i:Y ( i ) =1 W *T X ( i )
b* = −
2
(C) CDAC Mumbai Workshop on Machine Learning
What if data is nonlinearly separable?
o The maximal margin
hyperplane can classify
only linearly separable
data
o What if the data is linearly
non-separable?
o Take your data to linearly
separable ( higher
dimensional space) and
use maximal margin
hyperplane there!

(C) CDAC Mumbai Workshop on Machine Learning


Taking it to higher dimension works!
Ex. XOR

(C) CDAC Mumbai Workshop on Machine Learning


Doing it in higher dimensional space
o Let Φ: X →F be non linear mapping from input
space X (original space) to feature space (higher
dimensional) F
o Then our inner (dot) product X (i ) , X ( j ) in higher
(i ) ( j)
dimensional space is φ ( X ), φ ( X )

o Now, the problem becomes:


m m
1
max D (α ) = ∑ α i − ∑Y (i)
Y ( j)
α iα j φ(X (i )
), φ ( X ( j)
)
α
i =1 2 i , j =1

s .t .
α i ≥ 0, i = 1,..., m
m

∑α Y
i =1
i
(C) CDAC Mumbai
(i )
= 0 Workshop on Machine Learning
Kernel function:
o There exist a way to compute inner product in feature
space as function of original input points – Its kernel
function!
o Kernel function:

K(x, z) = φ(x),φ(z)
o We need not know φ to compute K ( x , z )

(C) CDAC Mumbai Workshop on Machine Learning


An example:
let x, z ∈ R n For n=3, feature mapping φ
K ( x, z) = ( xT z) 2 is given as : x x  1 1
x x 
n n  1 2
i.e. K ( x, z) = (∑ xi zi ) (∑ x j z j )
 x1 x 3 
 
x x
 2 1
i =1 j =1
φ ( x) =  x2 x2 
n n  
= ∑∑ xi x j zi z j  2 3
x x
x x 
i =1 j =1  3 1
 x3 x2 
n  
= ∑ ( xi x j )(zi z j )
x x
 3 3

i , j =1

K ( x, z ) = φ ( x),φ ( z )
(C) CDAC Mumbai Workshop on Machine Learning
example cont…
o Here,
for  x1 x1  1 
 x 1 x 2  2
φ (x) =  =  
K ( x, z) = ( xT z)2  x 2 x1  2
   
 x2x2  4
1  3 
x =   z =   9 
 12 
2 4 φ (z) =  
 12 
3   
xT z = [1 2 ]    16 
4 9 
 12 
= 11 φ ( x ) T φ ( z ) = [1 2 2 4 ] 
 12 
 
K ( x , z ) = ( x T z ) 2 = 121  16 
= 121

(C) CDAC Mumbai Workshop on Machine Learning


So our SVM for the non-linearly
separable data:
o Optimization problem:
m m
1
max D (α ) = ∑ α i − ∑Y (i )
Y ( j)
α iα j K X (i)
,X ( j)
α
i =1 2 i , j =1

s .t .
α i ≥ 0, i = 1,..., m
m

∑α Y
i =1
i
(i)
=0

o Decision function
m
F ( X ) = Sign(∑ α iY (i ) K ( X (i ) , X ) + b)
i =1

(C) CDAC Mumbai Workshop on Machine Learning


Some commonly used Kernel functions:

o Linear: K ( X ,Y ) = X TY
o Polynomial of degree d: K ( X , Y ) = ( X T Y + 1) d
|| X −Y ||2

Gaussian Radial Basis Function (RBF): 2σ 2
o K ( X ,Y ) = e

o Tanh kernel: K ( X , Y ) = tanh( ρ ( X T Y ) − δ )

(C) CDAC Mumbai Workshop on Machine Learning


Implementations:
Some Ready to use available SVM implementations:
1)LIBSVM:A library for SVM by Chih-Chung Chang and
chih-Jen Lin
(at: http://www.csie.ntu.edu.tw/~cjlin/libsvm/)
2)SVM light : An implementation in C by Thorsten
Joachims
(at: http://svmlight.joachims.org/ )
3)Weka: A Data Mining Software in Java by University
of Waikato
(at: http://www.cs.waikato.ac.nz/ml/weka/ )
(C) CDAC Mumbai Workshop on Machine Learning
Issues:
o Selecting suitable kernel: Its most of the time trial
and error
o Multiclass classification: One decision function for
each class( l1 vs l-1 ) and then finding one with max
value i.e. if X belongs to class 1, then for this and
other (l-1) classes vales of decision functions:
F1( X ) ≥ + 1
F 2 ( X ) ≤ − 1
.
.
Fl ( X ) ≤ − 1
(C) CDAC Mumbai Workshop on Machine Learning
Cont….
o Sensitive to noise: Mislabeled data can badly affect
the performance
o Good performance for the applications like-
1)computational biology and medical applications
(protein, cancer classification problems)
2)Image classification
3)hand-written character recognition
And many others…..
o Use SVM :High dimensional, linearly separable
data (strength), for nonlinearly depends on choice of
kernel
(C) CDAC Mumbai Workshop on Machine Learning
Conclusion:
Support Vector Machines provides very
simple method for linear classification. But
performance, in case of nonlinearly separable
data, largely depends on the choice of kernel!

(C) CDAC Mumbai Workshop on Machine Learning


References:
o Nello Cristianini and John Shawe-Taylor (2000)??
An Introduction to Support Vector Machines and Other Kernel-based Learning Methods
Cambridge University Press
o Christopher J.C. Burges (1998)??
A tutorial on Support Vector Machines for pattern recognition
Usama Fayyad, editor, Data Mining and Knowledge Discovery, 2, 121-167.
Kluwer Academic Publishers, Boston.
o Andrew Ng (2007)
CSS229 Lecture Notes
Stanford Engineering Everywhere, Stanford University .
o Support Vector Machines <http://www.svms.org > (Accessed 10.11.2008)
o Wikipedia
o Kernel-Machines.org<http://www.kernel-machines.org >(Accessed 10.11.2008)

(C) CDAC Mumbai Workshop on Machine Learning


Thank You!

[email protected] ;
[email protected]

(C) CDAC Mumbai Workshop on Machine Learning

You might also like