SVM Tutorial
SVM Tutorial
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480
Abstract
Support vector machines (SVMs) have been extensively researched in the data mining and
machine learning communities for the last decade, and applied in various domains. They
represent a set of supervised learning techniques that create a function from training data,
which usually consists of pairs of an input object, typically vectors, and a desired output.
SVMs learn a function that generates the desired output given the input, and the learned
function can be used to predict the output of a new object. They belong to a family of
generalized linear classifier where the classification (or boundary) function is a hyperplane
in the feature space. This chapter introduces the basic concepts and techniques of SVMs for
learning classification, regression, and ranking functions.
1 Introduction
Support vector machines are typically used for learning classification, regression, or ranking
functions, for which they are called classifying SVM, support vector regression (SVR), or
ranking SVM (RankSVM), respectively. Two special properties of SVMs are that they achieve
high generalization by maximizing the margin, and they support an efficient learning of
nonlinear functions using the kernel trick. This chapter introduces these general concepts
and techniques of SVMs for learning classification, regression, and ranking functions.
In particular, the SVMs for binary classification are first presented in > Sect. 2, SVR in
> Sect. 3, ranking SVM in > Sect. 4, and another recently developed method for learning
2 SVM Classification
SVMs were initially developed for classification (Burges 1998) and have been extended for
regression (Smola and Schölkopf 1998) and preference (or rank) learning (Herbrich et al.
2000; Yu 2005). The initial form of SVMs is a binary classifier where the output of the learned
function is either positive or negative. A multiclass classification can be implemented by
combining multiple binary classifiers using the pairwise coupling method (Hastie and Tibshirani
1998; Friedman 1998). This section explains the motivation and formalization of SVM as a
binary classifier, and the two key properties – margin maximization and the kernel trick.
Binary SVMs are classifiers that discriminate data points of two categories. Each data
object (or data point) is represented by an n-dimensional vector. Each of these data points
belongs to only one of two classes. A linear classifier separates them with a hyperplane. For
example, > Fig. 1 shows two groups of data and separating hyperplanes that are lines in a two-
dimensional space. There are many linear classifiers that correctly classify (or divide) the two
groups of data such as L1, L2, and L3 in > Fig. 1. In order to achieve maximum separation
between the two classes, the SVM picks the hyperplane that has the largest margin. The margin
is the summation of the shortest distance from the separating hyperplane to the nearest data
point of both categories. Such a hyperplane is likely to generalize better, meaning that the
hyperplane can correctly classify ‘‘unseen’’ or testing data points.
SVMs do the mapping from the input space to the feature space to support nonlinear
classification problems. The kernel trick is helpful for doing this by allowing the absence of the
SVM Tutorial — Classification, Regression and Ranking 15 481
. Fig. 1
Linear classifiers (hyperplane) in two-dimensional spaces.
exact formulation of the mapping function which could introduce a case of the curse of
dimensionality problem. This makes a linear classification in the new space (or the feature
space) equivalent to nonlinear classification in the original space (or the input space). SVMs
do this by mapping input vectors to a higher dimensional space (or feature space) where a
maximal separating hyperplane is constructed.
To understand how SVMs compute the hyperplane of maximal margin and support nonlinear
classification, we first explain the hard-margin SVM where the training data is free of noise
and can be correctly classified by a linear function.
The data points D in > Fig. 1 (or training set) can be expressed mathematically as follows:
D ¼ fðx1 ; y 1 Þ; ðx2 ; y 2 Þ; . . . ; ðxm ; y m Þg ð1Þ
where xi is an n-dimensional real vector, yi is either 1 or 1 denoting the class to which the
point xi belongs. The SVM classification function FðxÞ takes the form
FðxÞ ¼ w x b ð2Þ
w is the weight vector and b is the bias, which will be computed by the SVM in the training
process.
First, to correctly classify the training set, F(·) (or w and b) must return positive numbers
for positive data points and negative numbers otherwise, that is, for every point xi in D,
w xi b > 0 if y i ¼ 1; and
w xi b < 0 if y i ¼ 1
482 15 SVM Tutorial — Classification, Regression and Ranking
. Fig. 2
SVM classification function: the hyperplane maximizing the margin in a two-dimensional space.
The constrained optimization problem > Eqs. 6 and >7 is called a primal problem. It is
characterized as follows:
where the auxiliary nonnegative variables a are called Lagrange multipliers. The solution to the
constrained optimization problem is determined by the saddle point of the Lagrange function
J ðw; b; aÞ, which has to be minimized with respect to w and b; it also has to be maximized with
respect to a. Thus, differentiating J ðw; b; aÞ with respect to w and b and setting the results
equal to zero, we get the following two conditions of optimality:
@J ðw; b; aÞ
Condition 1: ¼0 ð9Þ
@w
@J ðw; b; aÞ
Condition 2: ¼0 ð10Þ
@b
After rearrangement of terms, Condition 1 yields
Xm
w¼ ai y i ; xi ð11Þ
i¼1
The solution vector w is defined in terms of an expansion that involves the m training
examples.
As noted earlier, the primal problem deals with a convex cost function and linear
constraints. Given such a constrained optimization problem, it is possible to construct another
problem called the dual problem. The dual problem has the same optimal value as the primal
problem, but with the Lagrange multipliers providing the optimal solution.
To postulate the dual problem for the primal problem, > Eq. 8 is first expanded term by
term, as follows:
1 Xm Xm Xm
J ðw; b; aÞ ¼ w w ai y i w x i b ai y i þ ai ð13Þ
2 i¼1 i¼1 i¼1
The third term on the right-hand side of > Eq. 13 is zero by virtue of the optimality
condition of > Eq. 12. Furthermore, from > Eq. 11, we have
X
m m X
X m
ww ¼ ai y i w x ¼ ai aj y i y j xi xj ð14Þ
i¼1 i¼1 j¼1
484 15 SVM Tutorial — Classification, Regression and Ranking
Accordingly, setting the objective function J ðw; b; aÞ ¼ QðaÞ, > Eq. 13 can be formu-
lated as
Xm
1X m X m
QðaÞ ¼ ai ai aj y i y j xi xj ð15Þ
i¼1
2 i¼1 j¼1
a0 ð18Þ
Note that the dual problem is cast entirely in terms of the training data. Moreover, the
function Q(a) to be maximized depends only on the input patterns in the form of a set of dot
products fxi xj gm
ði;jÞ¼1 .
Having determined the optimum Lagrange multipliers, denoted by ai , the optimum
weight vector w may be computed using > Eq. 11 and so can be written as
X
w ¼ ai y i xi ð19Þ
i
The discussion so far has focused on linearly separable cases. However, the optimization
problem > Eqs. 6 and > 7 will not have a solution if D is not linearly separable. To deal with
such cases, a soft margin SVM allows mislabeled data points while still maximizing the margin.
The method introduces slack variables, xi, which measure the degree of misclassification. The
following is the optimization problem for a soft margin SVM.
SVM Tutorial — Classification, Regression and Ranking 15 485
1 X
minimize: Q 1 ðw; b; xi Þ ¼ jjwjj2 þ C xi ð23Þ
2 i
xi 0 ð25Þ
Due to the xi in > Eq. 24, data points are allowed to be misclassified, and the amount of
misclassification will be minimized while maximizing the margin according to the objective
function (> Eq. 23). C is a parameter that determines the trade-off between the margin size
and the amount of error in training.
Similarly to the case of hard-margin SVM, this primal form can be transformed to the
following dual form using the Lagrange multipliers:
X XX
maximize: Q2 ðaÞ ¼ ai ai aj y i y j xi xj ð26Þ
i i j
X
subject to: ai y i ¼ 0 ð27Þ
i
Ca0 ð28Þ
Note that neither the slack variables xi nor their Lagrange multipliers appear in the dual
problem. The dual problem for the case of nonseparable patterns is thus similar to that for the
simple case of linearly separable patterns except for a minor but important difference. The
objective function Q(a) to be maximized is the same in both cases. The nonseparable case
differs from the separable case in that the constraint ai 0 is replaced with the more stringent
constraint C ai 0. Except for this modification, the constrained optimization for the
nonseparable case and computations of the optimum values of the weight vector w and bias b
proceed in the same way as in the linearly separable case.
Just as for the hard-margin SVM, a constitutes a dual representation for the weight vector
such that
X
ms
w ¼ ai y i xi ð29Þ
i¼1
where ms is the number of support vectors whose corresponding coefficient ai > 0. The
determination of the optimum values of the bias also follows a procedure similar to
that described before. Once a and b are computed, the function > Eq. 22 is used to classify
new objects.
Relationships among a, x, and C can be further disclosed using the Kuhn–Tucker condi-
tions that are defined by
ai fy i ðw xi bÞ 1 þ xi g ¼ 0; i ¼ 1; 2; . . . ; m ð30Þ
and
mi xi ¼ 0; i ¼ 1; 2; . . . ; m ð31Þ
> Eq. 30 is a rewrite of > Eq. 20 except for the replacement of the unity term (1 xi).
As for > Eq. 31, the mi are Lagrange multipliers that have been introduced to enforce
the nonnegativity of the slack variables xi for all i. At the saddle point, the derivative of the
486 15 SVM Tutorial — Classification, Regression and Ranking
. Fig. 3
Graphical relationships among ai, j i, and C.
Lagrange function for the primal problem with respect to the slack variable xi is zero,
the evaluation of which yields
ai þ mi ¼ C ð32Þ
By combining > Eqs. 31 and > 32, we see that
xi ¼ 0 if ai < C; and ð33Þ
xi 0 if ai ¼ C ð34Þ
We can graphically display the relationships among ai, xi, and C in > Fig. 3.
Data points outside the margin will have a = 0 and x = 0 and those on the margin line will
have C > a > 0 and still x = 0. Data points within the margin will have a = C. Among them,
those correctly classified will have 1 > x > 0 and misclassified points will have x > 1.
If the training data is not linearly separable, there is no straight hyperplane that can separate
the classes. In order to learn a nonlinear function in that case, linear SVMs must be extended
to nonlinear SVMs for the classification of nonlinearly separable data. The process of finding
classification functions using nonlinear SVMs consists of two steps. First, the input vectors are
transformed into high-dimensional feature vectors where the training data can be linearly
separated. Then, SVMs are used to find the hyperplane of maximal margin in the new feature
space. The separating hyperplane becomes a linear function in the transformed feature space
but a nonlinear function in the original input space.
Let x be a vector in the n-dimensional input space and ’ðÞ be a nonlinear mapping
function from the input space to the high-dimensional feature space. The hyperplane repre-
senting the decision boundary in the feature space is defined as follows:
w ’ðxÞ b ¼ 0 ð35Þ
SVM Tutorial — Classification, Regression and Ranking 15 487
where w denotes a weight vector that can map the training data in the high-dimensional
feature space to the output space, and b is the bias. Using the ’(·) function, the weight
becomes
X
w¼ ai y i ’ðxi Þ ð36Þ
Furthermore, the dual problem of the soft-margin SVM (> Eq. 26) can be rewritten using
the mapping function on the data vectors as follows:
X 1XX
QðaÞ ¼ ai ai aj y i y j ’ðxi Þ ’ðxj Þ ð38Þ
i
2 i j
Mercer’s theorem ensures that the kernel function can be always expressed as the inner
product between pairs of input vectors in some high-dimensional space, thus the inner product
can be calculated using the kernel function only with input vectors in the original space without
transforming the input vectors into the high-dimensional feature vectors.
The dual problem is now defined using the kernel function as follows:
X XX
maximize: Q 2 ðaÞ ¼ ai ai aj y i y j K ðxi ; xj Þ ð41Þ
i i j
X
subject to: ai y i ¼ 0 ð42Þ
i
Ca0 ð43Þ
488 15 SVM Tutorial — Classification, Regression and Ranking
Since K(·) is computed in the input space, no feature transformation will actually be done
P
or no ’(·) will be computed, and thus the weight vector w ¼ ai y i ’ðxÞ will not be
computed either in nonlinear SVMs.
The following are popularly used kernel functions:
Polynomial: K ða; bÞ ¼ ða b þ 1Þd
Radial Basis Function (RBF): K ða; bÞ ¼ expðgjja bjj2 Þ
Sigmoid: K ða; bÞ ¼ tanhðka b þ cÞ
Note that the kernel function is a kind of similarity function between two vectors
where the function output is maximized when the two vectors become equivalent. Because
of this, SVM can learn a function from any shapes of data beyond vectors (such as trees or
graphs) as long as we can compute a similarity function between any pair of data objects.
Further discussions on the properties of these kernel functions are out of the scope of this
chapter. Instead, we will give an example of using the polynomial kernel for learning an XOR
function in the following section.
To illustrate the procedure of training a nonlinear SVM function, assume that the training set
of > Table 1 is given.
> Figure 4 plots the training points in the 2D input space. There is no linear function that
. Table 1
XOR problem
. Fig. 4
XOR problem.
Based on this mapping function, the objective function for the dual form can be derived
from > Eq. 41 as follows.
QðaÞ ¼ a1 þ a2 þ a3 þ a4
1
ð9a21 2a1 a2 2a1 a3 þ 2a1 a4 ð48Þ
2
þ 9a22 þ 2a2 a3 2a2 a4 þ 9a3 2a3 a4 þ a24 Þ
Optimizing Q(a) with respect to the Lagrange multipliers yields the following set of
simultaneous equations:
9a1 a2 a3 þ a4 ¼ 1
a1 þ 9a2 þ a3 a4 ¼ 1
a1 þ a2 þ 9a3 a4 ¼ 1
a1 a2 a3 þ 9a4 ¼ 1
From > Eq. 36, we find that the optimum weight vector is
1
w ¼ ½’ðx1 Þ þ ’ðx2 Þ þ ’ðx3 Þ ’ðx4 Þ
8
2 2 3 2 3 2 3 2 33 2 3
1 1 1 1 0
6 6 1 7 6 1 7 6 1 7 6 1 77 6 0 7
6 6 pffiffiffi 7 6 pffiffiffi 7 6 pffiffiffi 7 6 pffiffiffi 77 6 7
6 6 7 6 7 6 7 6 77 6 7 ð49Þ
16 6 2 7 6 2 7 6 2 7 6 2 77 6 p1ffiffi 7
¼ 6 6 7þ6 7þ6 76 77 ¼ 6 27
86 6 1 7 6 1 7 6 1 7 6 1 77 6 0 7
6 6 p
6 6 ffiffiffi 7 6 7 6 7 6
7 6 pffiffiffi 7 6 pffiffiffi 7 6 pffiffiffi 77 6
77 6 7
7
4 4 2 5 4 2 5 4 2 5 4 2 55 4 0 5
pffiffiffi pffiffiffi pffiffiffi pffiffiffi
2 2 2 2 0
The bias b is 0 because the first element of w is 0. The optimal hyperplane becomes
2 3
1
6 x2 7
6 pffiffiffi 1 7
6 7
1 6 2x 1 x 2 7
w ’ðxÞ ¼ ½0 0 pffiffiffi 0 0 06 6 x2 7 ¼ 0
7 ð50Þ
2 6 pffiffiffi2 7
6 7
4 2x 1 5
pffiffiffi
22
which reduces to
x 1 x 2 ¼ 0 ð51Þ
this is the optimal hyperplane, the solution of the XOR problem. It makes the output y ¼ 1 for
both input points x 1 ¼ x 2 ¼ 1 and x 1 ¼ x 2 ¼ 1, and y ¼ 1 for both input points
x 1 ¼ 1; x 2 ¼ 1 or x 1 ¼ 1; x 2 ¼ 1. > Figure 5 represents the four points in the transformed
feature space.
. Fig. 5
The four data points of the XOR problem in the transformed feature space.
SVM Tutorial — Classification, Regression and Ranking 15 491
3 SVM Regression
SVM regression (SVR) is a method to estimate a function that maps from an input object to a
real number based on training data. Similarly to the classifying SVM, SVR has the same
properties of the margin maximization and kernel trick for nonlinear mapping.
A training set for regression is represented as follows.
D ¼ fðx1 ; y 1 Þ; ðx2 ; y 2 Þ; . . . ; ðxm ; y m Þg ð52Þ
where xi is a n-dimensional vector, y is the real number for each xi . The SVR function Fðxi Þ
makes a mapping from an input vector xi to the target yi and takes the form:
FðxÞ ¼) w x b ð53Þ
where w is the weight vector and b is the bias. The goal is to estimate the parameters (w and b)
of the function that give the best fit of the data. An SVR function FðxÞ approximates all pairs
(xi , yi) while maintaining the differences between estimated values and real values under e
precision. That is, for every input vector x in D,
y i w xi b e ð54Þ
w xi þ b y i e ð55Þ
The margin is
1
margin ¼ ð56Þ
jjwjj
By minimizing jjwjj2 to maximize the margin, the training in SVR becomes a constrained
optimization problem, as follows:
1
minimize: LðwÞ ¼ jjwjj2 ð57Þ
2
w xi þ b y i e ð59Þ
The solution of this problem does not allow any errors. To allow some errors to deal with
noise in the training data, the soft margin SVR uses slack variables x and ^ x. Then the
optimization problem can be revised as follows:
1 X
minimize: Lðw; xÞ ¼ jjwjj2 þ C ðx2i ; ^
x2i Þ; C > 0 ð60Þ
2 i
w xi þ b y i e þ ^
xi ; 8ðxi ; y i Þ 2 D ð62Þ
x; ^
xi 0 ð63Þ
The constant C > 0 is the trade-off parameter between the margin size and the number of
errors. The slack variables x and ^
x deal with infeasible constraints of the optimization problem
by imposing a penalty on excess deviations that are larger than e.
492 15 SVM Tutorial — Classification, Regression and Ranking
To solve the optimization problem > Eq. 60, we can construct a Lagrange function from
the objective function with Lagrange multipliers as follows:
1 X X
minimize: L ¼ jjwjj2 þ C ðxi þ ^xi Þ ði xi þ ^i ^
xi Þ
2 i i
X
ai ðe þ i y i þ w xi þ bÞ ð64Þ
i
X
^
ai ðe þ ^
i þ y i w xi bÞ
i
a; ^
ai 0 ð66Þ
where i ; ^i ; a; and a^i are the Lagrange multipliers that satisfy positive constraints. The
following is the process to find the saddle point by using the partial derivatives of L with
respect to each Lagrangian multiplier for minimizing the function L:
@L X
¼ ðai ^
ai Þ ¼ 0 ð67Þ
@b i
@L X
¼ w Sðai ^
ai Þxi ¼ 0; w ¼ ðai ^
ai Þxi ð68Þ
@w i
@L
¼C ^
ai ^i ¼ 0; ^i ¼ C ^
ai ð69Þ
@^
xi
The optimization problem with inequality constraints can be changed to the following
dual optimization problem by substituting > Eqs. 67, > 68, and > 69 into > 64.
X X
maximize: LðaÞ ¼ y i ðai a^i Þ e ðai þ a^i Þ ð70Þ
i i
1XX
ðai a^i Þðai a^i Þxi xj ð71Þ
2 i j
X
subject to: ðai a^i Þ ¼ 0 ð72Þ
i
0 a; ^
aC ð73Þ
The dual variables ; ^i are eliminated in revising > Eq. 64 into > Eq. 70. > Eqs. 68 and
> 69can be rewritten as follows:
X
w¼ ðai ^
ai Þxi ð74Þ
i
i ¼ C ai ð75Þ
i ¼ C ^
^ ai ð76Þ
SVM Tutorial — Classification, Regression and Ranking 15 493
where w is represented by a linear combination of the training vectors xi . Accordingly, the SVR
function FðxÞ becomes the following function:
X
FðxÞ ¼) ðai ^
ai Þxi x þ b ð77Þ
i
> Eq. 77 can map the training vectors to target real values which allow some errors but it
cannot handle the nonlinear SVR case. The same kernel trick can be applied by replacing
the inner product of two vectors xi ; xj with a kernel function K ðxi ; xj Þ. The transformed
feature space is usually high dimensional, and the SVR function in this space becomes
nonlinear in the original input space. Using the kernel function K, the inner product in the
transformed feature space can be computed as fast as the inner product xi xj in the original
input space. The same kernel functions introduced in > Sect. 2.3 can be applied here.
Once we replace the original inner product with a kernel function K, the remaining process
for solving the optimization problem is very similar to that for the linear SVR. The linear
optimization function can be changed by using the kernel function as follows:
X X
maximize: LðaÞ ¼ y i ðai a^i Þ e ðai þ a^i Þ
i i
1XX ð78Þ
ðai a^i Þðai a^i ÞK ðxi ; xj Þ
2 i j
X
subject to: ðai a^i Þ ¼ 0 ð79Þ
i
^
ai 0; ai 0 ð80Þ
0 a; ^
aC ð81Þ
Finally, the SVR function FðxÞ becomes the following using the kernel function:
X
FðxÞ ¼) ð^
ai ai ÞK ðxi ; xÞ þ b ð82Þ
i
4 SVM Ranking
Ranking SVM, learning a ranking (or preference) function, has resulted in various applications
in information retrieval (Herbrich et al. 2000; Joachims 2002; Yu et al. 2007). The task of
learning ranking functions is distinguished from that of learning classification functions as
follows:
1. While a training set in a classification is a set of data objects and their class labels, in
ranking a training set is an ordering of data. Let ‘‘A is preferred to B’’ be specified as ‘‘A
B.’’ A training set for ranking SVM is denoted as R ¼ fðx1 ; y i Þ; . . . ; ðxm ; y m Þg where yi is
the ranking of xi , that is, yi < yj if xi xj .
2. Unlike a classification function, which outputs a distinct class for a data object, a ranking
function outputs a score for each data object, from which a global ordering of data is
constructed. That is, the target function Fðxi Þ outputs a score such that Fðxi Þ > Fðxj Þ for
any xi xj .
494 15 SVM Tutorial — Classification, Regression and Ranking
If not stated, R is assumed to be strict ordering, which means that for all pairs xi and xj in a
set D, either xi R xj or xi R xj . However, it can be straightforwardly generalized to weak
orderings. Let R be the optimal ranking of data in which the data is ordered perfectly
according to the user’s preference. A ranking function F is typically evaluated by how closely
its ordering RF approximates R.
Using the techniques of SVM, a global ranking function F can be learned from an order-
ing R. For now, assume F is a linear ranking function such that
8fðxi ; xj Þ : y i < y j 2 Rg : Fðxi Þ > Fðxj Þ() w xi > w xj ð83Þ
By the constraint (> Eq. 85) and by minimizing the upper bound ∑xij in (> Eq. 84), the
above optimization problem satisfies orderings on the training set R with minimal error. By
minimizing w w or by maximizing the margin (= jjwjj1
), it tries to maximize the generaliza-
tion of the ranking function. We will explain how maximizing the margin corresponds to
increasing the generalization of ranking in > Sect. 4.1. C is the soft margin parameter that
controls the trade-off between the margin size and the training error.
By rearranging the constraint (> Eq. 85) we get
wðxi xj Þ 1 xij ð87Þ
The optimization problem becomes equivalent to that of classifying SVM on pairwise differ-
ence vectors (xi xj ). Thus, we can extend an existing SVM implementation to solve the
problem.
Note that the support vectors are the data pairs ðxsi ; xsj Þ such that constraint (> Eq. 87) is
satisfied with the equality sign, that is, wðxsi xsj Þ ¼ 1 xij . Unbounded support vectors are
the ones on the margin (i.e., their slack variables xij = 0), and bounded support vectors are the
ones within the margin (i.e., 1 > xij > 0) or misranked (i.e., xij > 1). As done in the classifying
SVM, a function F in ranking SVM is also expressed only by the support vectors.
Similarly to the classifying SVM, the primal problem of ranking SVM can be transformed
to the following dual problem using the Lagrange multipliers:
X XX
maximize: L2 ðaÞ ¼ aij aij auv K ðxi xj ; xu xv Þ ð88Þ
ij ij uv
SVM Tutorial — Classification, Regression and Ranking 15 495
Once transformed to the dual, the kernel trick can be applied to support the nonlinear
ranking function. K(·) is a kernel function. aij is a coefficient for pairwise difference vectors
ðxi xj Þ. Note that the kernel function is computed for P2( m4) times where P is the number
of data pairs and m is the number of data points in the training set, thus solving the ranking
SVM takes O(m4) at least. Fast training algorithms for ranking SVM have been proposed
(Joachims 2006) but they are limited to linear kernels.
Once a is computed, w can be written in terms of the pairwise difference vectors and their
coefficients such that
X
w¼ aij ðxi xj Þ ð90Þ
ij
The ranking function F on a new vector z can be computed using the kernel function
replacing the dot product as follows:
X X
FðzÞ ¼ w z ¼ aij ðxi xj Þ z ¼ aij K ðxi xj ; zÞ: ð91Þ
ij ij
We now explain the margin-maximization of the ranking SVM, to reason about how the
ranking SVM generates a ranking function of high generalization. Some essential properties of
ranking SVM are first established. For convenience of explanation, it is assumed that a training
set R is linearly rankable and thus we use hard-margin SVM, that is, xij = 0 for all (i, j) in the
objective (> Eq. 84) and the constraints (> Eq. 85).
In the ranking formulation, from > Eq. 83, the linear ranking function F w projects data
vectors onto a weight vector w. For instance, > Fig. 6 illustrates linear projections of four
vectors fx1 ; x2 ; x3 ; x4 g onto two different weight vectors w1 and w2 , respectively, in a two-
dimensional space. Both F x1 and F x2 make the same ordering R for the four vectors, that is,
x1 >R x2 >R x3 >R x4 . The ranking difference of two vectors (xi ; xj ) according to a ranking
. Fig. 6
Linear projection of four data points.
496 15 SVM Tutorial — Classification, Regression and Ranking
function F w is denoted by the geometric distance of the two vectors projected onto w, that is,
wðxi xj Þ
formulated as jjwjj .
Corollary 2 The ranking function F, generated by the hard-margin ranking SVM, maximizes
the minimal difference of any data pairs in ranking.
wðxs xs Þ
Proof By minimizing w w, the ranking SVM maximizes the margin d ¼ jjwjj 1
¼ jjwjj
i j
where ðxi ; xj Þ are the support vectors that denote, from the proof of Corollary 1, the minimal
s s
The soft margin SVM allows bounded support vectors whose xij > 0 as well as unbounded
support vectors whose xij = 0, in order to deal with noise and allow small errors for the R
which is not completely linearly rankable. However, the objective function in (> Eq. 84) also
minimizes the number of the slacks and thus the amount of error, and the support vectors are
the close data pairs in the ranking. Thus, maximizing the margin generates the effect of
maximizing the differences of close data pairs in the ranking.
From Corollaries 1 and 2, we observe that ranking SVM improves the generalization
performance by maximizing the minimal ranking difference. For example, consider the two
linear ranking functions F w1 and F w2 in > Fig. 6. Although the two weight vectors w1 and w2
make the same ordering, intuitively w 1 generalizes better than w2 because the distance
between the closest vectors on w 1 (i.e., d1) is larger than that on w2 (i.e., d2). The SVM
computes the weight vector w that maximizes the differences of close data pairs in the ranking.
Ranking SVMs find a ranking function of high generalization in this way.
This section presents another rank learning method, RVM, a revised 1-norm ranking
SVM that is better for feature selection and more scalable to large datasets than the standard
ranking SVM.
We first develop a 1-norm ranking SVM, a ranking SVM that is based on the 1-norm
objective function. (The standard ranking SVM is based on the 2-norm objective function.)
The 1-norm ranking SVM learns a function with much fewer support vectors than the
standard SVM. Thereby, its testing time is much faster than 2-norm SVMs and provides
better feature selection properties. (The function of the 1-norm SVM is likely to utilize fewer
features by using fewer support vectors (Fung and Mangasarian 2004).) Feature selection is
also important in ranking. Ranking functions are relevance or preference functions in
SVM Tutorial — Classification, Regression and Ranking 15 497
document or data retrieval. Identifying key features increases the interpretability of the
function. Feature selection for nonlinear kernels is especially challenging, and the fewer the
number of support vectors, the more efficiently feature selection can be done (Guyon and
Elisseeff 2003; Mangasarian and Wild 1998; Cao et al. 2007; Yu et al. 2003; Cho et al. 2008).
We next present an RVM that revises the 1-norm ranking SVM for fast training. The RVM
trains much faster than standard SVMs while not compromising the accuracy when the
training set is relatively large. The key idea of a RVM is to express the ranking function with
‘‘ranking vectors’’ instead of support vectors. Support vectors in ranking SVMs are pairwise
difference vectors of the closest pairs as discussed in > Sect. 4. Thus, the training requires
investigating every data pair as potential candidates of support vectors, and the number of
data pairs are quadratic to the size of training set. On the other hand, the ranking function of
the RVM utilizes each training data object instead of data pairs. Thus, the number of variables
for optimization is substantially reduced in the RVM.
The goal of 1-norm ranking SVM is the same as that of the standard ranking SVM, that is, to
learn F that satisfies > Eq. 83 for most fðxi ; xj Þ : y i < y j 2 Rg and generalize well beyond the
training set. In the 1-norm ranking SVM, > Eq. 83 can be expressed using the F of > Eq. 91
as follows:
XP XP
Fðxu Þ > Fðxv Þ ¼) aij ðxi xj Þ xu > aij ðxi xj Þ xv ð92Þ
ij ij
X
P
¼) aij ðxi xj Þ ðxu xv Þ > 0 ð93Þ
ij
Then, replacing the inner product with a kernel function, the 1-norm ranking SVM is
formulated as
XP X
P
minimize: Lða; xÞ ¼ aij þ C xij ð94Þ
ij ij
X
P
s:t:: aij K ðxi xj ; xu xv Þ 1 xuv ; 8fðu; vÞ : y u < y v 2 Rg ð95Þ
ij
a 0; x 0 ð96Þ
While the standard ranking SVM suppresses the weight w to improve the generalization
performance, the 1-norm ranking suppresses a in the objective function. Since the weight is
expressed by the sum of the coefficient times pairwise ranking difference vectors, suppressing
the coefficient a corresponds to suppressing the weight w in the standard SVM. (Mangasarian
proves it in Mangasarian (2000).) C is a user parameter controlling the trade-off between the
margin size and the amount of the error, x, and K is the kernel function. P is the number of
pairwise difference vectors ( m2).
The training of the 1-norm ranking SVM becomes a linear programming (LP) problem,
thus solvable by LP algorithms such as the Simplex and Interior Point method (Mangasarian
498 15 SVM Tutorial — Classification, Regression and Ranking
2000, 2006; Fung and Mangasarian 2004). Just as for the standard ranking SVM, K needs to be
computed P2 ( m4) times, and there are P number of constraints (> Eq. 95) and a to
compute. Once a is computed, F is computed using the same ranking function as the standard
ranking SVM, that is, > Eq. 91.
The accuracies of the 1-norm ranking SVM and standard ranking SVM are comparable,
and both methods need to compute the kernel function O(m4) times. In practice, the training
of the standard SVM is more efficient because fast decomposition algorithms have been
developed such as sequential minimal optimization (SMO) (Platt 1998), while the 1-norm
ranking SVM uses common LP solvers.
It is shown that 1-norm SVMs use much fewer support vectors than standard 2-norm SVMs,
that is, the number of positive coefficients (i.e., a > 0) after training is much fewer in the
1-norm SVMs than in the standard 2-norm SVMs (Mangasarian 2006; Fung and Mangasarian
2004). This is because, unlike the standard 2-norm SVM, the support vectors in the 1-norm
SVM are not bounded to those close to the boundary in classification or the minimal ranking
difference vectors in ranking. Thus, the testing involves much fewer kernel evaluations, and it
is more robust when the training set contains noisy features (Zhu et al. 2003).
Although the 1-norm ranking SVM has merits over the standard ranking SVM in terms of
the testing efficiency and feature selection, its training complexity is very high with respect to
the number of data points. In this section, we present an RVM that revises the 1-norm ranking
SVM to reduce the training time substantially. The RVM significantly reduces the number of
variables in the optimization problem while not compromising the accuracy. The key idea of
RVM is to express the ranking function with ‘‘ranking vectors’’ instead of support vectors. The
support vectors in ranking SVMs are chosen from pairwise difference vectors, and the number
of pairwise difference vectors are quadratic to the size of the training set. On the other hand,
the ranking vectors are chosen from the training vectors, thus the number of variables to
optimize is substantially reduced.
To theoretically justify this approach, we first present the representer theorem.
X
m X
m
Fðxu Þ > Fðxv Þ ¼) ai K ðxi ; xu Þ > ai K ðxi ; xv Þ ð99Þ
i i
X
m
¼) ai ðK ðxi ; xu Þ K ðxi ; xv ÞÞ > 0: ð100Þ
i
The loss function utilizes couples of data points penalizing misranked pairs, that is, it
returns higher values as the number of misranked pairs increases. Thus, the loss function is
order sensitive, and it is an instance of the function class c in > Eq. 97.
Pm
We set the regularizer Oðjjf jjh Þ ¼ ai (ai 0), which is strictly monotonically increas-
i
ing. Let P be the number of pairs (u, v) 2 R such that yu < yv, and let xuv ¼
Pm
1 ai ðK ðxi ; xu Þ K ðxi ; xv ÞÞ. Then, the RVM is formulated as follows.
i
X
m X
P
minimize: Lða; xÞ ¼ ai þ C xij ð102Þ
i ij
X
m
s:t:: ai ðK ðxi ; xu Þ K ðxi ; xv ÞÞ 1 xuv ; 8fðu; vÞ : y u < y v 2 Rg ð103Þ
i
a; x 0 ð104Þ
The solution of the optimization problem lies in the span of kernels centered on the
training points (i.e., > Eq. 98) as suggested in the representer theorem. Just as the 1-norm
ranking SVM, the RVM suppresses a to improve the generalization, and forces > Eq. 100 by
constraint (> Eq. 103). Note that there are only m number of ai in the RVM. Thus, the kernel
function is evaluated O(m3) times while the standard ranking SVM computes it O(m4) times.
Another rationale of RVM or a rationale for using training vectors instead of pairwise
difference vectors in the ranking function is that the support vectors in the 1-norm ranking
SVM are not the closest pairwise difference vectors, thus expressing the ranking function with
pairwise difference vectors becomes not as beneficial in the 1-norm ranking SVM. To explain
this further, consider classifying SVMs. Unlike the 2-norm (classifying) SVM, the support
500 15 SVM Tutorial — Classification, Regression and Ranking
vectors in the 1-norm (classifying) SVM are not limited to those close to the decision
boundary. This makes it possible that the 1-norm (classifying) SVM can express the similar
boundary function with fewer support vectors. Directly extended from the 2-norm (classify-
ing) SVM, the 2-norm ranking SVM improves the generalization by maximizing the closest
pairwise ranking difference that corresponds to the margin in the 2-norm (classifying) SVM as
discussed in > Sect. 4. Thus, the 2-norm ranking SVM expresses the function with the closest
pairwise difference vectors (i.e., the support vectors). However, the 1-norm ranking SVM
improves the generalization by suppressing the coefficients a just as the 1-norm (classifying)
SVM. Thus, the support vectors in the 1-norm ranking SVM are not the closest pairwise
difference vectors any more, and thus expressing the ranking function with pairwise difference
vectors becomes not as beneficial in the 1-norm ranking SVM.
5.3 Experiment
This section evaluates the RVM on synthetic datasets (> Sect. 5.3.1) and a real-world dataset
(> Sect. 5.3.2). The RVM is compared with the state-of-the-art ranking SVM provided in
SVM-light. Experiment results show that the RVM trains substantially faster than the SVM-
light for nonlinear kernels while their accuracies are comparable. More importantly, the number
of ranking vectors in the RVM is multiple orders of magnitude smaller than the number of
support vectors in the SVM-light. Experiments are performed on a Windows XP Professional
machine with a Pentium IV 2.8 GHz and 1 GB of RAM. We implemented the RVM using C and
used CPLEX (http://www.ilog.com/products/cplex/) for the LP solver. The source codes are
freely available at http://iis.postech.ac.kr/rvm (Yu et al. 2008).
Evaluation Metric: MAP (mean average precision) is used to measure the ranking quality
when there are only two classes of ranking (Yan et al. 2003), and NDCG is used to evaluate
ranking performance for IR applications when there are multiple levels of ranking
(Baeza-Yates and Ribeiro-Neto 1999; Burges et al. 2004; Cao et al. 2006; Xu and Li 2007).
Kendall’s t is used when there is a global ordering of data and the training data is a subset of it.
Ranking SVMs as well as the RVM minimize the amount of error or mis-ranking, which
corresponds to optimizing the Kendall’s t (Joachims 2002; Yu 2005). Thus, we use the
Kendall’s t to compare their accuracy.
Kendall’s t computes the overall accuracy by comparing the similarity of two orderings R
and RF. (RF is the ordering of D according to the learned function F.) The Kendall’s t is defined
based on the number of concordant pairs and discordant pairs. If R and RF agree on how they
order a pair, xi and xj , the pair is concordant, otherwise it is discordant. The accuracy of
function F is defined as the number of concordant pairs between R and RF per the total
number of pairs in D as follows.
# of concordant pairs
FðR ; RF Þ ¼
jRj
2
For example, suppose R and RF order five points x1 ; . . . ; x5 as follow:
ðx1 ; x2 ; x3 ; x4 ; x5 Þ 2 R
ðx3 ; x2 ; x1 ; x4 ; x5 Þ 2 R F
Then, the accuracy of F is 0.7, as the number of discordant pairs is 3, i.e., fx1 ; x2 g; fx1 ; x3 g;
fx2 ; x3 g while all remaining seven pairs are concordant.
SVM Tutorial — Classification, Regression and Ranking 15 501
. Fig. 7
Accuracy: (a) Linear (b) RBF.
502 15 SVM Tutorial — Classification, Regression and Ranking
3. Dtrain and Dtest are ranked according to F , which forms the global ordering R train and R test
on the training and testing data.
4. We train a function F from R train , and test the accuracy of F on R test .
We tuned the soft margin parameter C by trying C ¼ 105 ; 105, . . . , 105, and used the
highest accuracy for comparison. For the linear and RBF functions, we used linear and RBF
kernels accordingly. This entire process is repeated 30 times to get the mean accuracy.
Accuracy: > Figure 7 compares the accuracies of the RVM and the ranking SVM from
the SVM-light. The ranking SVM outperforms RVM when the size of the dataset is small,
but their difference becomes trivial as the size of the dataset increases. This phenomenon can
be explained by the fact that when the training size is too small, the number of potential
ranking vectors becomes too small to draw an accurate ranking function whereas the number
of potential support vectors is still large. However, as the size of the training set increases, RVM
. Fig. 8
Training time: (a) Linear Kernel (b) RBF Kernel.
SVM Tutorial — Classification, Regression and Ranking 15 503
becomes as accurate as the ranking SVM because the number of potential ranking vectors
becomes large as well.
Training Time: > Figure 8 compares the training time of the RVM and the SVM-light.
While the SVM-light trains much faster than RVM for linear kernel (SVM-light is specially
optimized for linear kernels), the RVM trains significantly faster than the SVM-light for
RBF kernels.
Number of Support (or Ranking) Vectors: > Figure 9 compares the number of support (or
ranking) vectors used in the function of the RVM and the SVM-light. The RVM’s model uses a
significantly smaller number of support vectors than the SVM-light.
Sensitivity to Noise: In this experiment, the sensitivity of each method is compared to noise.
Noise is inserted by switching the orders of some data pairs in R train . We set the size of the
training set mtrain = 100 and the dimension n = 5. After R train is made from a random function
F , k vectors are randomly picked from the R train and switched with their adjacent vectors in
. Fig. 9
Number of support (or ranking) vectors: (a) Linear Kernel (b) RBF Kernel.
504 15 SVM Tutorial — Classification, Regression and Ranking
. Fig. 10
Sensitivity to noise (mtrain = 100): (a) Linear (b) RBF.
the ordering to implant noise in the training set. > Figure 10 shows the decrements of the
accuracies as the number of misorderings increases in the training set. Their accuracies are
moderately decreasing as the noise increases in the training set, and their sensitivities to noise
are comparable.
In this section, we experiment using the OHSUMED dataset obtained from LETOR, the site
containing benchmark datasets for ranking (LETOR). OHSUMED is a collection of docu-
ments and queries on medicine, consisting of 348,566 references and 106 queries. There are in
total 16,140 query-document pairs upon which relevance judgments are made. In this dataset,
the relevance judgments have three levels: ‘‘definitely relevant,’’ ‘‘partially relevant,’’ and
SVM Tutorial — Classification, Regression and Ranking 15 505
. Table 2
Experiment results: accuracy (Acc), training time (Time), and number of support or ranking
vectors (#SV or #RV)
‘‘irrelevant.’’ The OHSUMED dataset in LETOR extracts 25 features. We report our experi-
ments on the first three queries and their documents. We compare the performance of RVM
and SVM-light on them. We tuned the parameters threefold using cross-validation by trying C
and g ¼ 106 ; 105 ; . . . ; 106 for the linear and RBF kernels and compared the highest perfor-
mances. The training time is measured for training the model with the tuned parameters. The
whole process was repeated three times and the mean values reported.
> Table 2 shows the results. The accuracies of the SVM and RVM are comparable overall;
SVM shows a little higher accuracy than RVM for query 1, but for the other queries their
accuracy differences are not statistically significant. More importantly, the number of ranking
vectors in RVM is significantly smaller than that of support vectors in SVM. For example, for
query 3, an RVM having just one ranking vector outperformed an SVM with over 150 support
vectors. The training time of the RVM is significantly shorter than that of SVM-light.
References
Baeza-Yates R, Ribeiro-Neto B (eds) (1999) Modern in- Cao Y, Xu J, Liu TY, Li H, Huang Y, Hon HW (2006)
formation retrieval. ACM Press, New York Adapting ranking SVM to document retrieval.
Bertsekas DP (1995) Nonlinear programming. Athena In: Proceedings of the ACM SIGIR international
Scientific, Belmont, MA conference on information retrieval (SIGIR’06),
Burges C, Shaked T, Renshaw E, Lazier A, Deeds M, New York
Hamilton N, Hullender G (2004) Learning to rank Cho B, Yu H, Lee J, Chee Y, Kim I (2008) Nonlinear
using gradient descent. In: Proceedings of the inter- support vector machine visualization for risk factor
national conference on machine learning analysis using nomograms and localized radial basis
(ICML’04), Oregon State University, Corvallis, OR, function kernels. IEEE Trans Inf Technol Biomed
USA 12(2)
Burges CJC (1998) A tutorial on support vector Christianini N, Shawe-Taylor J (2000) An introduction
machines for pattern recognition. Data Mining to support vector machines and other kernel-based
Knowl Discov 2:121–167 learning methods. Cambridge University Press,
Cao B, Shen D, Sun JT, Yang Q, Chen Z (2007) Feature Cambridge, UK
selection in a kernel space. In: Proceedings of the Cohen WW, Schapire RE, Singer Y (1998) Learning to
international conference on machine learning order things. In: Proceedings of the advances in
(ICML’07), Oregon State University, Corvallis, OR, neural information processing systems (NIPS’98),
USA Cambridge, MA
506 15 SVM Tutorial — Classification, Regression and Ranking
Friedman H (1998) Another approach to polychoto- Schölkopf B, Herbrich R, Smola AJ, Williamson RC
mous classification. Tech. rep., Stanford Univer- (2001) A generalized representer theorem. In: Pro-
sity, Department of Statistics, Stanford, CA ceedings of COLT, Amsterdam, The Netherlands
10:1895–1924 Smola AJ, Schölkopf B (1998) A tutorial on support
Fung G, Mangasarian OL (2004) A feature selection vector regression. Tech. rep., NeuroCOLT2 Techni-
Newton method for support vector machine classi- cal Report NC2-TR-1998-030
fication. Comput Optim Appl 28:185–202 Vapnik V (1998) Statistical learning theory. John Wiley
Guyon I, Elisseeff A (2003) An introduction to vari- and Sons, New York
able and feature selection. J Mach Learn Res Xu J, Li H (2007) Ada Rank: a boosting algorithm for
3:1157–1182 information retrieval. In: Proceedings of the ACM
Hastie T, Tibshirani R (1998) Classification by pairwise SIGIR international conference on information re-
coupling. In: Advances in neural information pro- trieval (SIGIR’07), New York
cessing systems. MIT Press, Cambridge, MA Yan L, Dodier R, Mozer MC, Wolniewicz R (2003) Opti-
Herbrich R, Graepel T, Obermayer K (eds) (2000) Large mizing classifier performance via the Wilcoxon-
margin rank boundaries for ordinal regression. Mann-Whitney statistics. In: Proceedings of the
MIT Press, Cambridge, MA international conference on machine learning
Joachims T (2002) Optimizing search engines using (ICML’03), Washington, DC
clickthrough data. In: Proceedings of the ACM Yu H (2005) SVM selective sampling for ranking with
SIGKDD international conference on knowledge application to data retrieval. In: Proceedings of the
discovery and data mining (KDD’02), Paris, France international conference on knowledge discovery
Joachims T (2006) Training linear SVMs in linear time. and data mining (KDD’05), Chicago, IL
In: Proceedings of the ACM SIGKDD international Yu H, Hwang SW, Chang KCC (2007) Enabling soft
conference on knowledge discovery and data mining queries for data retrieval. Inf Syst 32:560–574
(KDD’06) Philadelphia, PA, USA Yu H, Kim Y, Hwang SW (2008) RVM: An efficient
Liu T-Y (2009) Learning to rank for information retriev- method for learning ranking SVM. Tech. rep., De-
al. Found Trends Inf Retr 3(3):225–331 partment of Computer Science and Engineering,
Mangasarian OL (2000) Generalized support vector Pohang University of Science and Technology
machines. MIT Press, Cambridge, MA (POSTECH), Pohang, Korea, http://iis.hwanjoyu.
Mangasarian OL (2006) Exact 1-norm support vector org/rvm
machines via unconstrained convex differentiable Yu H, Yang J, Wang W, Han J (2003) Discovering
minimization. J Mach Learn Res 7:1517–1530 compact and highly discriminative features or
Mangasarian OL, Wild EW (1998) Feature selection for feature combinations of drug activities using sup-
nonlinear kernel support vector machines. Tech. port vector machines. In: IEEE computer society
rep., University of Wisconsin, Madison bioinformatics conference (CSB’03), Stanford, CA,
Platt J (1998) Fast training of support vector machi- pp 220–228
nes using sequential minimal optimization. Zhu J, Rosset S, Hastie T, Tibshriani R (2003) 1-norm
In: Schölkopf B, Burges CJC (eds) Advances in ker- support vector machines. In: Proceedings of the
nel methods: support vector machines, MIT Press, advances in neural information processing systems
Cambridge, MA (NIPS’00) Berlin, Germany