SVM
SVM
Linear Separators
• Binary classification can be viewed as the task of
separating classes in feature space:
w Tx + b = 0
w Tx + b > 0
w Tx + b < 0
f(x) = sign(wTx + b)
Linear Separators
• Which of the linear separators is optimal?
What is a good Decision Boundary?
• Many decision
boundaries! Class 2
– The Perceptron algorithm
can be used to find such a
boundary
• Are all decision
boundaries equally Class 1
good?
4
Examples of Bad Decision Boundaries
Class 2 Class 2
Class 1 Class 1
5
Finding the Decision Boundary
• Let {x1, ..., xn} be our data set and let yi Î {1,-1} be the class
label of xi
T
For yi=1 w xi b 1
For yi=-1 wT xi b 1
y=1
y=1
So:
yi w xi b 1, xi , yi
y=1 T
y=-1 y=1
y=-1 y=1
Class 2
y=-1
y=-1
y=-1
Class 1 y=-1
m
6
Large-margin Decision Boundary
• The decision boundary should be as far away
from the data of both classes as possible
– We should maximize the margin, m
Class 2
Class 1
m
7
Finding the Decision Boundary
• The decision boundary should classify all points correctly Þ
• The Lagrangian is
– ai≥0
– Note that ||w||2 = wTw
9
Gradient with respect to w and b
• Setting the gradient of w.r.t. w and b to
zero, we have
n
1 T
L w w i 1 yi wT xi b
2 i 1
1 m k k n m k k
w w i 1 yi w xi b
2 k 1 i 1 k 1
n: no of examples, m: dimension of the space
L
wk 0, k
L
b 0
10
The Dual Problem
• If we substitute to , we have
Since
• w can be recovered by
13
Characteristics of the Solution
• Many of the ai are zero
– w is a linear combination of a small number of data
points
– This “sparse” representation can be viewed as data
compression as in the construction of knn classifier
a8=0. a10=0
6
a7=0
a2=0
a5=0
a1=0.8
a4=0
a6=1.4
a9=0
a3=0
Class 1
15
Characteristics of the Solution
• For testing with a new data z
– Compute
and classify z as class 1 if the sum is positive, and
class 2 otherwise
Class 2
Class 1
18
Soft Margin Hyperplane
• The new conditions become
L
C j j 0
j
L n
yi i 0
b i 1
20
The Dual Problem
1 n n T n
L i j yi y j xi x j C i
2 i 1 j 1 i 1
n n n
i 1 i yi j y j x j xi b
T
i i
i 1 j 1 i 1
n
With
y 0 C j j
i i
i 1
1 T
n n n
L i j yi y j xi x j i
2 i 1 j 1 i 1
The Optimization Problem
• The dual of this new constrained optimization problem is
C j j
• New constrainsderive from since μ and α are
positive.
• w is recovered as
22
n
1 2
w C i
2 i 1
24
Extension to Non-linear Decision Boundary
• So far, we have only considered large-margin classifier with a linear
decision boundary
• How to generalize it to become nonlinear?
• Key idea: transform xi to a higher dimensional space to “make life
easier”
– Input space: the space the point xi are located
– Feature space: the space of f(xi) after transformation
• Why transform?
– Linear operation in the feature space is equivalent to non-linear operation
in input space
– Classification can become easier with a proper transformation. In the XOR
problem, for example, adding a new feature of x1x2 make the problem
linearly separable
25
X Y
XORIs not linearly separable
0 0 0
0 1 1
1 0 1
1 1 0
Is linearly separable
X Y XY
0 0 0 0
0 1 0 1
1 0 0 1
1 1 1 0
26
Find a feature space
27
Transforming the Data
f( )
f( ) f( )
f( ) f( ) f( )
f(.) f( )
f( ) f( )
f( ) f( )
f( ) f( )
f( ) f( ) f( )
f( )
f( )
32
Modification Due to Kernel Function
• Change all inner products to kernel functions
• For training,
Original
With
kernel
function
33
Modification Due to Kernel Function
• For testing, the new data z is classified as class
1 if f ³0, and as class 2 if f <0
Original
With
kernel
function
34
More on Kernel Functions
• Since the training of SVM only requires the value of
K(xi, xj), there is no restriction of the form of xi and xj
– xi can be a sequence or a tree, instead of a feature vector
35
Example
• Suppose we have 5 1D data points
– x1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class 1
and 4, 5 as class 2 y1=1, y2=1, y3=-1, y4=-1, y5=1
36
Example
1 2 4 5 6
37
Example
• We use the polynomial kernel of degree 2
– K(x,y) = (xy+1)2
– C is set to 100
38
Example
• By using a QP solver, we get
– a1=0, a2=2.5, a3=0, a4=7.333, a5=4.833
– Note that the constraints are indeed satisfied
– The support vectors are {x2=2, x4=5, x5=6}
• The discriminant function is
39
Example
Value of discriminant function
1 2 4 5 6
40
Kernel Functions
• In practical use of SVM, the user specifies the kernel
function; the transformation f(.) is not explicitly stated
• Given a kernel function K(xi, xj), the transformation
f(.) is given by its eigenfunctions (a concept in
functional analysis)
– Eigenfunctions can be difficult to construct explicitly
– This is why people only specify the kernel function without
worrying about the exact transformation
• Another view: kernel function, being an inner product,
is really a similarity measure between the objects
41
A kernel is associated to a transformation
– Given a kernel, in principle it should be recovered the
transformation in the feature space that originates it.
x2
It corresponds the transformation x 2x
1
12/10/2024 42
Examples of Kernel Functions
• Polynomial kernel up to degree d
44
Building new kernels
• If k1(x,y) and k2(x,y) are two valid kernels then the
following kernels are valid
– Linear Combination
k ( x, y ) c1k1 ( x, y ) c2 k 2 ( x, y )
k ( x, y ) expk1 ( x, y )
– Exponential
– Product
k ( x, y ) k1 ( x, y ) k 2 ( x, y )
– Polymomial tranfsormation (Q: polymonial with non negative
coeffients) 1
k ( x, y ) Q k ( x, y )
– Function product (f: any function)
k ( x, y ) f ( x ) k ( x, y ) f ( y )
1
45
Ploynomial kernel
56
Resources
• http://www.kernel-machines.org/
• http://www.support-vector.net/
• http://www.support-vector.net/icml-tutorial.p
df
• http://www.kernel-machines.org/papers/tuto
rial-nips.ps.gz
• http://www.clopinet.com/isabelle/Projects/SV
M/applist.html
57