(Optimization) SVMs
(Optimization) SVMs
History of SVM
SVM is related to statistical learning theory [3]
Introduced by Vapnik
SVM was first introduced in 1992
SVM: Large-margin linear classifier
SVM becomes popular because of its success a lot of
classification problems
3 Simple introduction to SVMs May 13, 2012 4 Simple introduction to SVMs May 13, 2012
Perceptron Revisited: Linear
Separators Linear Separators
Binary classification can be viewed as the task of separating Which of the linear separators is optimal?
classes in feature space:
wTx + b = 0
wTx + b > 0
wTx + b < 0
f(x) = sign(wTx + b)
5 Simple introduction to SVMs May 13, 2012 6 Simple introduction to SVMs May 13, 2012
7 Simple introduction to SVMs May 13, 2012 8 Simple introduction to SVMs May 13, 2012
Maximum Margin Classification Classification Margin
Maximizing the distance to examples is good according to wT xi b
Distance from example xi to the separator is r
w
intuition and PAC theory.
Examples closest to the hyperplane are support vectors.
Implies that only few vectors matter; other training examples are
Margin ρ of the separator is the distance between support vectors.
ignorable. ρ
9 Simple introduction to SVMs May 13, 2012 10 Simple introduction to SVMs May 13, 2012
11 Simple introduction to SVMs May 13, 2012 12 Simple introduction to SVMs May 13, 2012
Finding the Decision Boundary [Recap of Constrained Optimization]
Suppose we want to: minimize f(x) subject to g(x) = 0
Primal formulation A necessary condition for x0 to be a solution:
13 Simple introduction to SVMs May 13, 2012 14 Simple introduction to SVMs May 13, 2012
15 Simple introduction to SVMs May 13, 2012 16 Simple introduction to SVMs May 13, 2012
The Dual Formulation The Dual formulation
If we substitute to , we have It is known as the dual problem (the original problem is
known as the primal problem): if we know w, we know all
i; if we know all i, we know w
The objective function of the dual problem needs to be
maximized!
The dual problem is therefore:
Remember that
17 Simple introduction to SVMs May 13, 2012 18 Simple introduction to SVMs May 13, 2012
10=0
8=0.6
7=0
2=0
5=0
This is a quadratic programming (QP) problem
A global maximum of i can always be found 1=0.8
4=0
6=1.4
w can be recovered by 9=0
3=0
Class 1
19 Simple introduction to SVMs May 13, 2012 20 Simple introduction to SVMs May 13, 2012
Characteristics of the Solution Characteristics of the Solution
For testing with a new data z
Many of the i are zero
w is a linear combination of a small number of data points Compute
This “sparse” representation can be viewed as data compression classify z as class 1 if the sum is positive, and class 2 otherwise
xi with non-zero i are called support vectors (SV) Note: w need not be formed explicitly
The decision boundary is determined only by the SV
Let tj (j=1, ..., s) be the indices of the s support vectors. We can
write
21 Simple introduction to SVMs May 13, 2012 22 Simple introduction to SVMs May 13, 2012
SVM
Non-Separable Sets
The Quadratic Programming Problem
Many approaches have been proposed
Loqo, cplex, etc. (see http://www.numerical.rl.ac.uk/qp/qp.html) • Sometimes, we do not want to separate perfectly.
Most are “interior-point” methods
Start with an initial solution that can violate the constraints
Improve this solution by optimizing the objective function and/or
This is too
reducing the amount of constraint violation
close!
For SVM, sequential minimal optimization (SMO) seems to be the
most popular
A QP with two variables is trivial to solve Maybe this
point is not
Each iteration of SMO picks a pair of (i,j) and solve the QP with
these two variables; repeat until convergence so important.
In practice, we can just regard the QP solver as a “black-box”
without bothering how it works
23 Simple introduction to SVMs May 13, 2012 24 Simple introduction to SVMs May 13, 2012
SVM SVM
Non-Separable Sets Non-Separable Sets
If we ignore
this point
The hyperplane
is nicer!
25 Simple introduction to SVMs May 13, 2012 26 Simple introduction to SVMs May 13, 2012
Class 2
ξi
ξi
Class 1
27 Simple introduction to SVMs May 13, 2012 28 Simple introduction to SVMs May 13, 2012
Soft Margin Hyperplane Soft Margin Hyperplane
We want to minimize
If we minimize ii, i can be computed by
29 Simple introduction to SVMs May 13, 2012 30 Simple introduction to SVMs May 13, 2012
Class 1
31 Simple introduction to SVMs May 13, 2012 32 Simple introduction to SVMs May 13, 2012
Extension to Non-linear Decision
Boundary
So far, we have only considered large-margin classifier with a
linear decision boundary
How to generalize it to become nonlinear?
SVM with KERNELS: Large-margin Key idea: transform xi to a higher dimensional space to “make life
easier”
NON-linear classifiers Input space: the space the point xi are located
Feature space: the space of (xi) after transformation
Why transform?
Linear operation in the feature space is equivalent to non-linear
operation in input space
Classification can become easier with a proper transformation. In the
XOR problem, for example, adding a new feature of x1x2 make the
problem linearly separable
33 Simple introduction to SVMs May 13, 2012 34 Simple introduction to SVMs May 13, 2012
39 Simple introduction to SVMs May 13, 2012 40 Simple introduction to SVMs May 13, 2012
Kernel Functions
Kernel Functions
Any function K(x,z) that creates a symmetric, positive
definite matrix Kij = K(xi,xj) is a valid kernel (an inner
Another view: kernel function, being an inner
product in some space)
product, is really a similarity measure between the
Why? Because any sdp matrix M can be decomposed as
objects
N’N = M Not all similarity measures are allowed – they must
so N can be seen as the projection to the feature space Mercer conditions
Any distance measure can be translated to a kernel
41 Simple introduction to SVMs May 13, 2012 42 Simple introduction to SVMs May 13, 2012
43 Simple introduction to SVMs May 13, 2012 44 Simple introduction to SVMs May 13, 2012
Modification Due to Kernel Function More on Kernel Functions
For testing, the new data z is classified as class 1 if f 0, and Since the training of SVM only requires the value of K(xi, xj),
as class 2 if f <0 there is no restriction of the form of xi and xj
xi can be a sequence or a tree, instead of a feature vector
Original
K(xi, xj) is just a similarity measure comparing xi and xj
For a test object z, the discriminant function essentially is a
weighted sum of the similarity between z and a pre-selected
set of objects (the support vectors)
With kernel
function
45 Simple introduction to SVMs May 13, 2012 46 Simple introduction to SVMs May 13, 2012
47 Simple introduction to SVMs May 13, 2012 48 Simple introduction to SVMs May 13, 2012
Other Aspects of SVM Software
How to use SVM for multi-class classification? A list of SVM implementation can be found at
One can change the QP formulation to become multi-class http://www.kernel-machines.org/software.html
More often, multiple binary classifiers are combined Some implementation (such as LIBSVM) can handle multi-
One can train multiple one-versus-all classifiers, or combine class classification
multiple pairwise classifiers “intelligently” SVMLight is among one of the earliest implementation of
How to interpret the SVM discriminant function value as SVM
probability? Several Matlab toolboxes for SVM are also available
By performing logistic regression on the SVM output of a set of
data (validation set) that is not used for training
Some SVM software (like libsvm) have these features built-in
49 Simple introduction to SVMs May 13, 2012 50 Simple introduction to SVMs May 13, 2012
51 Simple introduction to SVMs May 13, 2012 52 Simple introduction to SVMs May 13, 2012
Other Types of Kernel Methods Conclusion
A lesson learnt in SVM: a linear algorithm in the feature SVM is a useful alternative to neural networks
space is equivalent to a non-linear algorithm in the input Two key concepts of SVM: maximize the margin and the
space kernel trick
Standard linear algorithms can be generalized to its non- Many SVM implementations are available on the web for you
linear version by going to the feature space to try on your data set!
Kernel principal component analysis, kernel independent
component analysis, kernel canonical correlation analysis,
kernel k-means, 1-class SVM are some examples
53 Simple introduction to SVMs May 13, 2012 54 Simple introduction to SVMs May 13, 2012
Examples Examples
Toy Examples Toy Examples (I)
• All examples have been run with the 2D graphic interface of
SVMLIB (Chang and Lin, National University of Taiwan)
“LIBSVM is an integrated software for support vector classification, Linearly separable data set
(C-SVC, nu-SVC), regression (epsilon-SVR, un-SVR) and distribution
estimation (one-class SVM). It supports multi-class classification. The
Linear SVM
basic algorithm is a simplification of both SMO by Platt and SVMLight Maximal margin Hyperplane
by Joachims. It is also a simplification of the modification 2 of SMO by
Keerthy et al. Our goal is to help users from other fields to easily use
SVM as a tool. LIBSVM provides a simple interface where users can
easily link it with their own programs…”
55 Simple introduction to SVMs May 13, 2012 56 Simple introduction to SVMs May 13, 2012
Examples Examples
Toy Examples (I) Toy Examples (I)
57 Simple introduction to SVMs May 13, 2012 58 Simple introduction to SVMs May 13, 2012
Examples Examples
Toy Examples (I) Toy Examples (I)
59 Simple introduction to SVMs May 13, 2012 60 Simple introduction to SVMs May 13, 2012
Examples Examples
Toy Examples (I) Toy Examples (I)
61 Simple introduction to SVMs May 13, 2012 62 Simple introduction to SVMs May 13, 2012
Resources
http://www.kernel-machines.org/
http://www.support-vector.net/
http://www.support-vector.net/icml-tutorial.pdf
http://www.kernel-machines.org/papers/tutorial- Transduction with SVMs
nips.ps.gz
http://www.clopinet.com/isabelle/Projects/SVM/applist.
html
63 Simple introduction to SVMs May 13, 2012 64 Simple introduction to SVMs May 13, 2012
Transduction based on margin size
The learning problem Binary classification, linear parameterization, joint set of
Transduction: (training + working) samples
We consider a phenomenon f that maps inputs (instances) x to
outputs (labels) y = f(x) (y {−1, 1})
Two objectives of transductive learning:
Given a set of labeled examples {(xi, yi) : i = 1, …, n},
and a set of unlabeled examples x’1, …, x’m
(TL1) separate labeled training data using a large-margin
hyperplane (as in standard inductive SVM)
the goal is to find the labels y’1 , …, y’m (TL2) separating (explain) working data set using a large-margin
hyperplane.
No need to construct a function f, the output of the
transduction algorithm is a vector of labels.
TextCat
Transductive SVMs Transductive SVMs
• Transductive instead of inductive (Vapnik 98)
• TSVMs take into account a particular test set and try
to minimize misclassifications of just those particular
examples
• Formal setting:
Strain {( x1 , y1 ), ( x 2 , y2 ), , ( x n , yn )}
Stest {x1* , x*2 , , x*k } (normally k n )
Goal of the transductive learner L:
find a function hL L( Strain , Stest ) so that the expected number
of erroneous predictions on the test examples is minimized
67 Simple introduction to SVMs May 13, 2012 68 Simple introduction to SVMs May 13, 2012
TextCat
Transductive SVMs
Induction vs Transduction
subject to
2 i 1
y i [( w x i ) b ] 1 i
j 1
transduction with zero slacks *j 0
y *j [( w x i ) b ] 1 *j Dual + kernel version of SVM transduction
, * 0, i 1,..., n , j 1,..., m
i j Transductive SVM optimization is not convex
where y *j sign ( w x j b ), j 1,..., m
(~ non-convexity of the loss for unlabeled data) –
Solution (~ decision boundary) D ( x ) ( w * x ) b * different opt. heuristics ~ different solutions
Unbalanced situation (small training/ large test) Exact solution (via exhaustive search) possible for small
all unlabeled samples assigned to one class number of test samples (m)
Additional constraint: 1 n 1 m
y i [( w x i ) b ]
n i 1 m j 1
71 72
Many applications for transduction
Example application
Text categorization: classify word documents into a number
Prediction of molecular bioactivity for drug discovery
of predetermined categories
Training data~1,909; test~634 samples
Email classification: Spam vs non-spam
Input space ~ 139,351-dimensional
Web page classification
Prediction accuracy:
Image database classification
SVM induction ~74.5%; transduction ~ 82.3%
All these applications:
Ref: J. Weston et al, KDD cup 2001 data analysis: prediction of molecular
- high-dimensional data bioactivity for drug design – binding to thrombin, Bioinformatics 2003
- small labeled training set (human-labeled)
- large unlabeled test set
73 74