0% found this document useful (0 votes)
18 views

(Optimization) SVMs

The document provides an introduction to support vector machines including their history, how they work as large-margin linear classifiers, finding the optimal decision boundary, and the dual formulation. Key aspects covered include maximizing the margin between classes, support vectors, solving the constrained optimization problem using Lagrange multipliers, and deriving the dual problem.

Uploaded by

Jeong Phill Kim
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

(Optimization) SVMs

The document provides an introduction to support vector machines including their history, how they work as large-margin linear classifiers, finding the optimal decision boundary, and the dual formulation. Key aspects covered include maximizing the margin between classes, support vectors, solving the constrained optimization problem using Lagrange multipliers, and deriving the dual problem.

Uploaded by

Jeong Phill Kim
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Outline

 Large-margin linear classifier


A Simple Introduction to Support  Linear separable
Vector Machines  Nonlinear separable
 Creating nonlinear classifiers: kernel trick
 Transduction
 Discussion on SVM
Adapted from various authors
 Conclusion
by Mario Martin

2 Simple introduction to SVMs May 13, 2012

History of SVM
 SVM is related to statistical learning theory [3]
 Introduced by Vapnik
 SVM was first introduced in 1992
SVM: Large-margin linear classifier
 SVM becomes popular because of its success a lot of
classification problems

3 Simple introduction to SVMs May 13, 2012 4 Simple introduction to SVMs May 13, 2012
Perceptron Revisited: Linear
Separators Linear Separators
 Binary classification can be viewed as the task of separating  Which of the linear separators is optimal?
classes in feature space:

wTx + b = 0
wTx + b > 0
wTx + b < 0

f(x) = sign(wTx + b)

5 Simple introduction to SVMs May 13, 2012 6 Simple introduction to SVMs May 13, 2012

What is a good Decision Boundary? Examples of Bad Decision Boundaries

 Consider a two-class, linearly


separable classification problem
Class 2
 Many decision boundaries! Class 2 Class 2
 The Perceptron algorithm can be
used to find such a boundary
 Other different algorithms have
been proposed
 Are all decision boundaries equally Class 1
Class 1 Class 1
good?

7 Simple introduction to SVMs May 13, 2012 8 Simple introduction to SVMs May 13, 2012
Maximum Margin Classification Classification Margin
 Maximizing the distance to examples is good according to wT xi  b
 Distance from example xi to the separator is r
w
intuition and PAC theory.
 Examples closest to the hyperplane are support vectors.
 Implies that only few vectors matter; other training examples are
 Margin ρ of the separator is the distance between support vectors.
ignorable. ρ

9 Simple introduction to SVMs May 13, 2012 10 Simple introduction to SVMs May 13, 2012

Large-margin Decision Boundary Finding the Decision Boundary


 The decision boundary should be as far away from the data of both  Let {x1, ..., xn} be our data set and let yi  {1,-1} be the
classes as possible: We should maximize the margin, m class label of xi
We normalize equations so
function in supports is 1/-1.
 The decision boundary should classify all points correctly 
wT xi  b
r
w
Class 2
 Maximizing margin classifying all points correctly constraints
is defined as follows:
Class 1
m

11 Simple introduction to SVMs May 13, 2012 12 Simple introduction to SVMs May 13, 2012
Finding the Decision Boundary [Recap of Constrained Optimization]
 Suppose we want to: minimize f(x) subject to g(x) = 0
 Primal formulation  A necessary condition for x0 to be a solution:

 : the Lagrange multiplier


 For multiple constraints gi(x) = 0, i=1, …, m, we need a
 We can solve this problem using this formulation, or using Lagrange multiplier i for each of the constraints
the dual formulation…

13 Simple introduction to SVMs May 13, 2012 14 Simple introduction to SVMs May 13, 2012

[Recap of Constrained Optimization] Back to the Original Problem


 The case for inequality constraint gi(x)0 is similar, except that
the Lagrange multiplier i should be positive
 If x0 is a solution to the constrained optimization problem
 The Lagrangian is

 There must exist i0 for i=1, …, m such that x0 satisfy

 Note that ||w||2 = wTw


 Setting the gradient of w.r.t. w and b to zero, we have
 The function is also known as the Lagrangrian.
We want to set its gradient to 0

15 Simple introduction to SVMs May 13, 2012 16 Simple introduction to SVMs May 13, 2012
The Dual Formulation The Dual formulation
 If we substitute to , we have  It is known as the dual problem (the original problem is
known as the primal problem): if we know w, we know all
i; if we know all i, we know w
 The objective function of the dual problem needs to be
maximized!
 The dual problem is therefore:

 Remember that

 This is a function of i only Properties of i when we introduce the


The result when we differentiate the
Lagrange multipliers original Lagrangian w.r.t. b

17 Simple introduction to SVMs May 13, 2012 18 Simple introduction to SVMs May 13, 2012

The Dual Problem A Geometrical Interpretation


Class 2

10=0
8=0.6

7=0
2=0
5=0
 This is a quadratic programming (QP) problem
 A global maximum of i can always be found 1=0.8
4=0
6=1.4
 w can be recovered by 9=0
3=0
Class 1

19 Simple introduction to SVMs May 13, 2012 20 Simple introduction to SVMs May 13, 2012
Characteristics of the Solution Characteristics of the Solution
 For testing with a new data z
 Many of the i are zero
 w is a linear combination of a small number of data points  Compute
 This “sparse” representation can be viewed as data compression  classify z as class 1 if the sum is positive, and class 2 otherwise
 xi with non-zero i are called support vectors (SV)  Note: w need not be formed explicitly
 The decision boundary is determined only by the SV
 Let tj (j=1, ..., s) be the indices of the s support vectors. We can
write

21 Simple introduction to SVMs May 13, 2012 22 Simple introduction to SVMs May 13, 2012

SVM
Non-Separable Sets
The Quadratic Programming Problem
 Many approaches have been proposed
 Loqo, cplex, etc. (see http://www.numerical.rl.ac.uk/qp/qp.html) • Sometimes, we do not want to separate perfectly.
 Most are “interior-point” methods
 Start with an initial solution that can violate the constraints
 Improve this solution by optimizing the objective function and/or
This is too
reducing the amount of constraint violation
close!
 For SVM, sequential minimal optimization (SMO) seems to be the
most popular
 A QP with two variables is trivial to solve Maybe this
point is not
 Each iteration of SMO picks a pair of (i,j) and solve the QP with
these two variables; repeat until convergence so important.
 In practice, we can just regard the QP solver as a “black-box”
without bothering how it works

23 Simple introduction to SVMs May 13, 2012 24 Simple introduction to SVMs May 13, 2012
SVM SVM
Non-Separable Sets Non-Separable Sets

• Sometimes, we do not want to separate perfectly. .

If we ignore
this point
The hyperplane
is nicer!

25 Simple introduction to SVMs May 13, 2012 26 Simple introduction to SVMs May 13, 2012

Soft Margin Classification Non-linearly Separable Problems


 What if the training set is not linearly separable?  We allow “error” i in classification; it is based on the output
 Slack variables ξi can be added to allow misclassification of of the discriminant function wTx+b
difficult or noisy examples, resulting margin called soft.  i approximates the number of misclassified samples

Class 2

ξi
ξi

Class 1
27 Simple introduction to SVMs May 13, 2012 28 Simple introduction to SVMs May 13, 2012
Soft Margin Hyperplane Soft Margin Hyperplane
 We want to minimize
 If we minimize ii, i can be computed by

 C : tradeoff parameter between error and margin

 The optimization problem becomes


 i are “slack variables” in optimization
 Note that i=0 if there is no error for xi
 Number of slacks + supports is an upper bound of the number
of errors (Leave one out error)

29 Simple introduction to SVMs May 13, 2012 30 Simple introduction to SVMs May 13, 2012

The Optimization Problem Non-linearly Separable Problems


 The dual of this new constrained optimization problem is  We allow “error” i in classification; it is based on the output
of the discriminant function wTx+b
 i approximates the number of misclassified samples
1=0
Class 2
3=C
 w is recovered as:

 This is very similar to the optimization problem in the linear


2<=C
separable case, except that there is an upper bound C on i now
 Once again, a QP solver can be used to find i

Class 1
31 Simple introduction to SVMs May 13, 2012 32 Simple introduction to SVMs May 13, 2012
Extension to Non-linear Decision
Boundary
 So far, we have only considered large-margin classifier with a
linear decision boundary
 How to generalize it to become nonlinear?

SVM with KERNELS: Large-margin  Key idea: transform xi to a higher dimensional space to “make life
easier”
NON-linear classifiers  Input space: the space the point xi are located
 Feature space: the space of (xi) after transformation
 Why transform?
 Linear operation in the feature space is equivalent to non-linear
operation in input space
 Classification can become easier with a proper transformation. In the
XOR problem, for example, adding a new feature of x1x2 make the
problem linearly separable

33 Simple introduction to SVMs May 13, 2012 34 Simple introduction to SVMs May 13, 2012

Transforming the Data Non-linear SVMs: Feature spaces


( )
( ) ( )
( ) ( ) ( )  General idea: the original feature space can always be
(.) ( )
( ) ( )
( ) ( ) mapped to some higher-dimensional feature space where the
( ) ( ) training set is separable:
( ) ( ) ( )
( )
( ) Φ: x →
Input space Feature space φ(x)
Note: feature space is of higher dimension than
the input space in practice

 Computation in the feature space can be costly because it is high


dimensional
 The feature space is typically infinite-dimensional!
 The kernel trick comes to rescue
35 Simple introduction to SVMs May 13, 2012 36 Simple introduction to SVMs May 13, 2012
SVMs with kernels
The Kernel Trick
 Recall the SVM optimization problem
• Training
l
1 l l
maximize   i 
i 1
  i  j  yi  yj  K xi  xj 
2 i 1 j 1
l
 The data points only appear as inner product
 As long as we can calculate the inner product in the feature space,
subject to   y
i 1
i i  0 and i C  i  0
we do not need the mapping explicitly
 Many common geometric operations (angles, distances) can be • Classification of x:
expressed by inner products
 Define the kernel function K by  l 
h ( x )  sign    i  y i  K ( xi , x )  b 
 i 1 
37 Simple introduction to SVMs May 13, 2012 38 Simple introduction to SVMs May 13, 2012

An Example for (.) and K(.,.) Kernel Functions


 Suppose (.) is given as follows • Kernel (Gram) matrix:
 K (x1 , x1 ) K (x1 , x 2 ) K (x1 , x 3 )  K (x1 , x l ) 
 
 An inner product in the feature space is  K (x 2 , x1 ) K (x 2 , x 2 ) K (x 2 , x 3 ) K (x 2 , x l ) 
   
 
   
 So, if we define the kernel function as follows, there is no need to  K (x , x ) K (x , x ) K (x , x )  K (x l , x l ) 
carry out (.) explicitly  l 1 l 2 l 3

Matrix obtained from product:


 This use of kernel function to avoid carrying out (.) explicitly is ’
known as the kernel trick

39 Simple introduction to SVMs May 13, 2012 40 Simple introduction to SVMs May 13, 2012
Kernel Functions
Kernel Functions
 Any function K(x,z) that creates a symmetric, positive
definite matrix Kij = K(xi,xj) is a valid kernel (an inner
 Another view: kernel function, being an inner
product in some space)
product, is really a similarity measure between the
 Why? Because any sdp matrix M can be decomposed as
objects
N’N = M  Not all similarity measures are allowed – they must
so N can be seen as the projection to the feature space Mercer conditions
 Any distance measure can be translated to a kernel

41 Simple introduction to SVMs May 13, 2012 42 Simple introduction to SVMs May 13, 2012

Examples of Kernel Functions Modification Due to Kernel Function


 Polynomial kernel with degree d  Change all inner products to kernel functions
 For training,

 Radial basis function kernel with width  Original

 Closely related to radial basis function neural networks


 The feature space is infinite-dimensional
 Sigmoid with parameter  and  With kernel
function
 It does not satisfy the Mercer condition on all  and 

43 Simple introduction to SVMs May 13, 2012 44 Simple introduction to SVMs May 13, 2012
Modification Due to Kernel Function More on Kernel Functions
 For testing, the new data z is classified as class 1 if f 0, and  Since the training of SVM only requires the value of K(xi, xj),
as class 2 if f <0 there is no restriction of the form of xi and xj
 xi can be a sequence or a tree, instead of a feature vector
Original
 K(xi, xj) is just a similarity measure comparing xi and xj
 For a test object z, the discriminant function essentially is a
weighted sum of the similarity between z and a pre-selected
set of objects (the support vectors)

With kernel
function

45 Simple introduction to SVMs May 13, 2012 46 Simple introduction to SVMs May 13, 2012

More on Kernel Functions Choosing the Kernel Function


 Probably the most tricky part of using SVM.
 The kernel function is important because it creates the kernel
 Not all similarity measure can be used as kernel function, matrix, which summarizes all the data
however  Many principles have been proposed (diffusion kernel, Fisher
 The kernel function needs to satisfy the Mercer function, kernel, string kernel, …)
i.e., the function is “positive-definite”  There is even research to estimate the kernel matrix from available
information
 This implies that the n by n kernel matrix, in which the
(i,j)-th entry is the K(xi, xj), is always positive definite  In practice, a low degree polynomial kernel or RBF kernel with a
 This also means that the QP is convex and can be solved in reasonable width is a good initial try
polynomial time  Note that SVM with RBF kernel is closely related to RBF neural
networks, with the centers of the radial basis functions
automatically chosen for SVM

47 Simple introduction to SVMs May 13, 2012 48 Simple introduction to SVMs May 13, 2012
Other Aspects of SVM Software
 How to use SVM for multi-class classification?  A list of SVM implementation can be found at
 One can change the QP formulation to become multi-class http://www.kernel-machines.org/software.html
 More often, multiple binary classifiers are combined  Some implementation (such as LIBSVM) can handle multi-
 One can train multiple one-versus-all classifiers, or combine class classification
multiple pairwise classifiers “intelligently”  SVMLight is among one of the earliest implementation of
 How to interpret the SVM discriminant function value as SVM
probability?  Several Matlab toolboxes for SVM are also available
 By performing logistic regression on the SVM output of a set of
data (validation set) that is not used for training
 Some SVM software (like libsvm) have these features built-in

49 Simple introduction to SVMs May 13, 2012 50 Simple introduction to SVMs May 13, 2012

Summary: Steps for Classification Strengths and Weaknesses of SVM


 Prepare the pattern matrix  Strengths
 Select the kernel function to use  Training is relatively easy
 No local optimal, unlike in neural networks
 Select the parameter of the kernel function and the value of C
 It scales relatively well to high dimensional data
 You can use the values suggested by the SVM software, or you
can set apart a validation set to determine the values of the  Tradeoff between classifier complexity and error can be
parameter controlled explicitly
 Non-traditional data like strings and trees can be used as input
 Execute the training algorithm and obtain the i
to SVM, instead of feature vectors
 Unseen data can be classified using the i and the support
 Weaknesses
vectors
 Need to choose a “good” kernel function.

51 Simple introduction to SVMs May 13, 2012 52 Simple introduction to SVMs May 13, 2012
Other Types of Kernel Methods Conclusion
 A lesson learnt in SVM: a linear algorithm in the feature  SVM is a useful alternative to neural networks
space is equivalent to a non-linear algorithm in the input  Two key concepts of SVM: maximize the margin and the
space kernel trick
 Standard linear algorithms can be generalized to its non-  Many SVM implementations are available on the web for you
linear version by going to the feature space to try on your data set!
 Kernel principal component analysis, kernel independent
component analysis, kernel canonical correlation analysis,
kernel k-means, 1-class SVM are some examples

53 Simple introduction to SVMs May 13, 2012 54 Simple introduction to SVMs May 13, 2012

Examples Examples
Toy Examples Toy Examples (I)
• All examples have been run with the 2D graphic interface of
SVMLIB (Chang and Lin, National University of Taiwan)
“LIBSVM is an integrated software for support vector classification, Linearly separable data set
(C-SVC, nu-SVC), regression (epsilon-SVR, un-SVR) and distribution
estimation (one-class SVM). It supports multi-class classification. The
Linear SVM
basic algorithm is a simplification of both SMO by Platt and SVMLight Maximal margin Hyperplane
by Joachims. It is also a simplification of the modification 2 of SMO by
Keerthy et al. Our goal is to help users from other fields to easily use
SVM as a tool. LIBSVM provides a simple interface where users can
easily link it with their own programs…”

• Available from: www.csie.ntu.edu.tw/~cjlin/libsvm (it icludes a . What happens if we add


Web integrated demo tool)
a blue training example
here?

55 Simple introduction to SVMs May 13, 2012 56 Simple introduction to SVMs May 13, 2012
Examples Examples
Toy Examples (I) Toy Examples (I)

(still) Linearly separable (still) Linearly separable


data set data set
Linear SVM Linear SVM
High value of C parameter Low value of C parameter
Maximal margin Hyperplane Trade-off between: margin
and training error

The example is The example is


correctly classified now a bounded SV

57 Simple introduction to SVMs May 13, 2012 58 Simple introduction to SVMs May 13, 2012

Examples Examples
Toy Examples (I) Toy Examples (I)

59 Simple introduction to SVMs May 13, 2012 60 Simple introduction to SVMs May 13, 2012
Examples Examples
Toy Examples (I) Toy Examples (I)

61 Simple introduction to SVMs May 13, 2012 62 Simple introduction to SVMs May 13, 2012

Resources
 http://www.kernel-machines.org/
 http://www.support-vector.net/
 http://www.support-vector.net/icml-tutorial.pdf
 http://www.kernel-machines.org/papers/tutorial- Transduction with SVMs
nips.ps.gz
 http://www.clopinet.com/isabelle/Projects/SVM/applist.
html

63 Simple introduction to SVMs May 13, 2012 64 Simple introduction to SVMs May 13, 2012
Transduction based on margin size
The learning problem  Binary classification, linear parameterization, joint set of
 Transduction: (training + working) samples
We consider a phenomenon f that maps inputs (instances) x to
outputs (labels) y = f(x) (y {−1, 1})
 Two objectives of transductive learning:
 Given a set of labeled examples {(xi, yi) : i = 1, …, n},
 and a set of unlabeled examples x’1, …, x’m
(TL1) separate labeled training data using a large-margin
hyperplane (as in standard inductive SVM)
 the goal is to find the labels y’1 , …, y’m (TL2) separating (explain) working data set using a large-margin
hyperplane.
 No need to construct a function f, the output of the
transduction algorithm is a vector of labels.

65 Simple introduction to SVMs May 13, 2012 66

TextCat
Transductive SVMs Transductive SVMs
• Transductive instead of inductive (Vapnik 98)
• TSVMs take into account a particular test set and try
to minimize misclassifications of just those particular
examples
• Formal setting:
Strain  {( x1 , y1 ), ( x 2 , y2 ),  , ( x n , yn )}
Stest  {x1* , x*2 ,  , x*k } (normally k  n )
Goal of the transductive learner L:
find a function hL  L( Strain , Stest ) so that the expected number
of erroneous predictions on the test examples is minimized

67 Simple introduction to SVMs May 13, 2012 68 Simple introduction to SVMs May 13, 2012
TextCat
Transductive SVMs
Induction vs Transduction

69 Simple introduction to SVMs May 13, 2012 70

Optimization formulation (cont’d)


Optimization formulation for SVM transduction
 Given: joint set of (training + working) samples  Hyperparameters C and C * control the trade-off between
 Denote slack variables for training,
i for working
 *j explanation and margin size
 Soft-margin inductive SVM is a special case of soft-margin
n m
 Minimize R (w , b) 
1
( w  w )  C   i  C *   *j

subject to 
2 i 1
y i [( w  x i )  b ]  1   i
j 1
transduction with zero slacks  *j  0

 y *j [( w  x i )  b ]  1   *j  Dual + kernel version of SVM transduction
 ,  *  0, i  1,..., n , j  1,..., m
 i j  Transductive SVM optimization is not convex
where y *j  sign ( w  x j  b ), j  1,..., m
(~ non-convexity of the loss for unlabeled data) –
 Solution (~ decision boundary) D ( x )  ( w *  x )  b *  different opt. heuristics ~ different solutions
 Unbalanced situation (small training/ large test)  Exact solution (via exhaustive search) possible for small
 all unlabeled samples assigned to one class number of test samples (m)
 Additional constraint: 1 n 1 m
 y i   [( w  x i )  b ]
n i 1 m j 1
71 72
Many applications for transduction
Example application
 Text categorization: classify word documents into a number
 Prediction of molecular bioactivity for drug discovery
of predetermined categories
 Training data~1,909; test~634 samples
 Email classification: Spam vs non-spam
 Input space ~ 139,351-dimensional
 Web page classification
 Prediction accuracy:
 Image database classification
SVM induction ~74.5%; transduction ~ 82.3%
 All these applications:
Ref: J. Weston et al, KDD cup 2001 data analysis: prediction of molecular
- high-dimensional data bioactivity for drug design – binding to thrombin, Bioinformatics 2003
- small labeled training set (human-labeled)
- large unlabeled test set

73 74

You might also like