0% found this document useful (0 votes)

17 views28 pages

SVM Tutorial

Classification,Regression and Ranking

Uploaded by

im.lifengfan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views28 pages

SVM Tutorial

Classification,Regression and Ranking

Uploaded by

im.lifengfan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

15 SVM Tutorial — Classification,

Regression and Ranking

Hwanjo Yu1 . Sungchul Kim2
1
Data Mining Lab, Department of Computer Science and Engineering,
Pohang University of Science and Technology, Pohang, South Korea
[email protected]
2
Data Mining Lab, Department of Computer Science and Engineering,
Pohang University of Science and Technology, Pohang, South Korea
[email protected]

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480

2 SVM Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480

3 SVM Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491

4 SVM Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493

5 Ranking Vector Machine: An Efficient Method for Learning the

1-Norm Ranking SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496

G. Rozenberg et al. (eds.), Handbook of Natural Computing, DOI 10.1007/978-3-540-92910-9_15,

# Springer-Verlag Berlin Heidelberg 2012
480 15 SVM Tutorial — Classification, Regression and Ranking

Abstract

Support vector machines (SVMs) have been extensively researched in the data mining and
machine learning communities for the last decade, and applied in various domains. They
represent a set of supervised learning techniques that create a function from training data,
which usually consists of pairs of an input object, typically vectors, and a desired output.
SVMs learn a function that generates the desired output given the input, and the learned
function can be used to predict the output of a new object. They belong to a family of
generalized linear classifier where the classification (or boundary) function is a hyperplane
in the feature space. This chapter introduces the basic concepts and techniques of SVMs for
learning classification, regression, and ranking functions.

1 Introduction
Support vector machines are typically used for learning classification, regression, or ranking
functions, for which they are called classifying SVM, support vector regression (SVR), or
ranking SVM (RankSVM), respectively. Two special properties of SVMs are that they achieve
high generalization by maximizing the margin, and they support an efficient learning of
nonlinear functions using the kernel trick. This chapter introduces these general concepts
and techniques of SVMs for learning classification, regression, and ranking functions.
In particular, the SVMs for binary classification are first presented in > Sect. 2, SVR in
> Sect. 3, ranking SVM in > Sect. 4, and another recently developed method for learning

ranking SVM called Ranking Vector Machine (RVM) in > Sect. 5.

2 SVM Classification

SVMs were initially developed for classification (Burges 1998) and have been extended for
regression (Smola and Schölkopf 1998) and preference (or rank) learning (Herbrich et al.
2000; Yu 2005). The initial form of SVMs is a binary classifier where the output of the learned
function is either positive or negative. A multiclass classification can be implemented by
combining multiple binary classifiers using the pairwise coupling method (Hastie and Tibshirani
1998; Friedman 1998). This section explains the motivation and formalization of SVM as a
binary classifier, and the two key properties – margin maximization and the kernel trick.
Binary SVMs are classifiers that discriminate data points of two categories. Each data
object (or data point) is represented by an n-dimensional vector. Each of these data points
belongs to only one of two classes. A linear classifier separates them with a hyperplane. For
example, > Fig. 1 shows two groups of data and separating hyperplanes that are lines in a two-
dimensional space. There are many linear classifiers that correctly classify (or divide) the two
groups of data such as L1, L2, and L3 in > Fig. 1. In order to achieve maximum separation
between the two classes, the SVM picks the hyperplane that has the largest margin. The margin
is the summation of the shortest distance from the separating hyperplane to the nearest data
point of both categories. Such a hyperplane is likely to generalize better, meaning that the
hyperplane can correctly classify ‘‘unseen’’ or testing data points.
SVMs do the mapping from the input space to the feature space to support nonlinear
classification problems. The kernel trick is helpful for doing this by allowing the absence of the
SVM Tutorial — Classification, Regression and Ranking 15 481

. Fig. 1
Linear classifiers (hyperplane) in two-dimensional spaces.

exact formulation of the mapping function which could introduce a case of the curse of
dimensionality problem. This makes a linear classification in the new space (or the feature
space) equivalent to nonlinear classification in the original space (or the input space). SVMs
do this by mapping input vectors to a higher dimensional space (or feature space) where a
maximal separating hyperplane is constructed.

2.1 Hard-Margin SVM Classification

To understand how SVMs compute the hyperplane of maximal margin and support nonlinear
classification, we first explain the hard-margin SVM where the training data is free of noise
and can be correctly classified by a linear function.
The data points D in > Fig. 1 (or training set) can be expressed mathematically as follows:
D ¼ fðx1 ; y 1 Þ; ðx2 ; y 2 Þ; . . . ; ðxm ; y m Þg ð1Þ
where xi is an n-dimensional real vector, yi is either 1 or 1 denoting the class to which the
point xi belongs. The SVM classification function FðxÞ takes the form
FðxÞ ¼ w x b ð2Þ
w is the weight vector and b is the bias, which will be computed by the SVM in the training
process.
First, to correctly classify the training set, F(·) (or w and b) must return positive numbers
for positive data points and negative numbers otherwise, that is, for every point xi in D,
w xi b > 0 if y i ¼ 1; and
w xi b < 0 if y i ¼ 1
482 15 SVM Tutorial — Classification, Regression and Ranking

. Fig. 2
SVM classification function: the hyperplane maximizing the margin in a two-dimensional space.

These conditions can be revised into

y i ðw xi bÞ > 0; 8ðxi ; y i Þ 2 D ð3Þ
If there exists such a linear function F that correctly classifies every point in D or satisfies
> Eq. 3, D is called linearly separable.
Second, F (or the hyperplane) needs to maximize the margin. The margin is the distance
from the hyperplane to the closest data points. An example of such a hyperplane is illustrated
in > Fig. 2. To achieve this, > Eq. 3 is revised into the following:
yi ðw xi bÞ 1; 8ðxi ; y i Þ 2 D ð4Þ
Note that > Eq. 4 includes the equality sign, and the right side becomes 1 instead of 0.
If D is linearly separable, or every point in D satisfies > Eq. 3, then there exists such an F that
satisfies > Eq. 4. This is because, if there exist such w and b that satisfy > Eq. 3, they can always
be rescaled to satisfy > Eq. 4.
The distance from the hyperplane to a vector xi is formulated as jFðx i Þj
jjwjj . Thus, the margin
becomes
1
margin ¼ ð5Þ
jjwjj
because when xi are the closest vectors, FðxÞ will return to 1 according to > Eq. 4. The closest
vectors that satisfy > Eq. 4 with the equality sign are called support vectors.
Maximizing the margin becomes minimizing jjwjj. Thus, the training problem in an SVM
becomes a constrained optimization problem as follows:
1
minimize: QðwÞ ¼ jjwjj2 ð6Þ
2

subject to: y i ðw xi bÞ 1; 8ðxi ; y i Þ 2 D ð7Þ

1
The factor of 2 is used for mathematical convenience.
SVM Tutorial — Classification, Regression and Ranking 15 483

2.1.1 Solving the Constrained Optimization Problem

The constrained optimization problem > Eqs. 6 and >7 is called a primal problem. It is
characterized as follows:

The objective function (> Eq. 6) is a convex function of w.

The constraints are linear in w.
Accordingly, we may solve the constrained optimization problem using the method of
Lagrange multipliers (Bertsekas 1995). First, we construct the Lagrange function
1 Xm
J ðw; b; aÞ ¼ w w ai fy i ðw xi bÞ 1g ð8Þ
2 i¼1

where the auxiliary nonnegative variables a are called Lagrange multipliers. The solution to the
constrained optimization problem is determined by the saddle point of the Lagrange function
J ðw; b; aÞ, which has to be minimized with respect to w and b; it also has to be maximized with
respect to a. Thus, differentiating J ðw; b; aÞ with respect to w and b and setting the results
equal to zero, we get the following two conditions of optimality:
@J ðw; b; aÞ
Condition 1: ¼0 ð9Þ
@w

@J ðw; b; aÞ
Condition 2: ¼0 ð10Þ
@b
After rearrangement of terms, Condition 1 yields
Xm
w¼ ai y i ; xi ð11Þ
i¼1

and Condition 2 yields

X
m
ai y i ¼ 0 ð12Þ
i¼1

The solution vector w is defined in terms of an expansion that involves the m training
examples.
As noted earlier, the primal problem deals with a convex cost function and linear
constraints. Given such a constrained optimization problem, it is possible to construct another
problem called the dual problem. The dual problem has the same optimal value as the primal
problem, but with the Lagrange multipliers providing the optimal solution.
To postulate the dual problem for the primal problem, > Eq. 8 is first expanded term by
term, as follows:
1 Xm Xm Xm
J ðw; b; aÞ ¼ w w ai y i w x i b ai y i þ ai ð13Þ
2 i¼1 i¼1 i¼1
The third term on the right-hand side of > Eq. 13 is zero by virtue of the optimality
condition of > Eq. 12. Furthermore, from > Eq. 11, we have
X
m m X
X m
ww ¼ ai y i w x ¼ ai aj y i y j xi xj ð14Þ
i¼1 i¼1 j¼1
484 15 SVM Tutorial — Classification, Regression and Ranking

Accordingly, setting the objective function J ðw; b; aÞ ¼ QðaÞ, > Eq. 13 can be formu-
lated as
Xm
1X m X m
QðaÞ ¼ ai ai aj y i y j xi xj ð15Þ
i¼1
2 i¼1 j¼1

where the ai are nonnegative.

The dual problem can be now stated as follows:
X 1XX
maximize: QðaÞ ¼ ai ai aj y i y j xi xj ð16Þ
i
2 i j
X
subject to: ai y i ¼ 0 ð17Þ
i

a0 ð18Þ
Note that the dual problem is cast entirely in terms of the training data. Moreover, the
function Q(a) to be maximized depends only on the input patterns in the form of a set of dot
products fxi xj gm
ði;jÞ¼1 .
Having determined the optimum Lagrange multipliers, denoted by ai , the optimum
weight vector w may be computed using > Eq. 11 and so can be written as
X
w ¼ ai y i xi ð19Þ
i

Note that according to the property of Kuhn–Tucker conditions of optimization theory,

the solution of the dual problem ai must satisfy the following condition:
ai fy i ðw xi bÞ 1g ¼ 0 for i ¼ 1; 2; . . . ; m ð20Þ
and either ai
or its corresponding constraint fy i ðw xi bÞ 1g must be nonzero. This
condition implies that only when xi is a support vector or y i ðw xi bÞ ¼ 1, its
corresponding coefficient ai will be nonzero (or nonnegative from > Eq. 18). In other
words, the xi whose corresponding coefficients ai are zero will not affect the optimum weight
vector w due to > Eq. 19. Thus, the optimum weight vector w will only depend on the
support vectors whose coefficients are nonnegative.
Once the nonnegative ai and their corresponding support vectors are computed, we can
compute the bias b using a positive support vector xi from the following equation:
b ¼ 1 w xi ð21Þ
The classification of > Eq. 2 now becomes
X
FðxÞ ¼ ai y i x i x b ð22Þ
i

2.2 Soft-Margin SVM Classification

The discussion so far has focused on linearly separable cases. However, the optimization
problem > Eqs. 6 and > 7 will not have a solution if D is not linearly separable. To deal with
such cases, a soft margin SVM allows mislabeled data points while still maximizing the margin.
The method introduces slack variables, xi, which measure the degree of misclassification. The
following is the optimization problem for a soft margin SVM.
SVM Tutorial — Classification, Regression and Ranking 15 485

1 X
minimize: Q 1 ðw; b; xi Þ ¼ jjwjj2 þ C xi ð23Þ
2 i

subject to: y i ðw xi bÞ 1 xi ; 8ðxi ; y i Þ 2 D ð24Þ

xi 0 ð25Þ

Due to the xi in > Eq. 24, data points are allowed to be misclassified, and the amount of
misclassification will be minimized while maximizing the margin according to the objective
function (> Eq. 23). C is a parameter that determines the trade-off between the margin size
and the amount of error in training.
Similarly to the case of hard-margin SVM, this primal form can be transformed to the
following dual form using the Lagrange multipliers:
X XX
maximize: Q2 ðaÞ ¼ ai ai aj y i y j xi xj ð26Þ
i i j

X
subject to: ai y i ¼ 0 ð27Þ
i

Ca0 ð28Þ
Note that neither the slack variables xi nor their Lagrange multipliers appear in the dual
problem. The dual problem for the case of nonseparable patterns is thus similar to that for the
simple case of linearly separable patterns except for a minor but important difference. The
objective function Q(a) to be maximized is the same in both cases. The nonseparable case
differs from the separable case in that the constraint ai 0 is replaced with the more stringent
constraint C ai 0. Except for this modification, the constrained optimization for the
nonseparable case and computations of the optimum values of the weight vector w and bias b
proceed in the same way as in the linearly separable case.
Just as for the hard-margin SVM, a constitutes a dual representation for the weight vector
such that
X
ms
w ¼ ai y i xi ð29Þ
i¼1

where ms is the number of support vectors whose corresponding coefficient ai > 0. The
determination of the optimum values of the bias also follows a procedure similar to
that described before. Once a and b are computed, the function > Eq. 22 is used to classify
new objects.
Relationships among a, x, and C can be further disclosed using the Kuhn–Tucker condi-
tions that are defined by
ai fy i ðw xi bÞ 1 þ xi g ¼ 0; i ¼ 1; 2; . . . ; m ð30Þ
and
mi xi ¼ 0; i ¼ 1; 2; . . . ; m ð31Þ
> Eq. 30 is a rewrite of > Eq. 20 except for the replacement of the unity term (1 xi).
As for > Eq. 31, the mi are Lagrange multipliers that have been introduced to enforce
the nonnegativity of the slack variables xi for all i. At the saddle point, the derivative of the
486 15 SVM Tutorial — Classification, Regression and Ranking

. Fig. 3
Graphical relationships among ai, j i, and C.

Lagrange function for the primal problem with respect to the slack variable xi is zero,
the evaluation of which yields
ai þ mi ¼ C ð32Þ
By combining > Eqs. 31 and > 32, we see that
xi ¼ 0 if ai < C; and ð33Þ

xi 0 if ai ¼ C ð34Þ

We can graphically display the relationships among ai, xi, and C in > Fig. 3.
Data points outside the margin will have a = 0 and x = 0 and those on the margin line will
have C > a > 0 and still x = 0. Data points within the margin will have a = C. Among them,
those correctly classified will have 1 > x > 0 and misclassified points will have x > 1.

2.3 Kernel Trick for Nonlinear Classification

If the training data is not linearly separable, there is no straight hyperplane that can separate
the classes. In order to learn a nonlinear function in that case, linear SVMs must be extended
to nonlinear SVMs for the classification of nonlinearly separable data. The process of finding
classification functions using nonlinear SVMs consists of two steps. First, the input vectors are
transformed into high-dimensional feature vectors where the training data can be linearly
separated. Then, SVMs are used to find the hyperplane of maximal margin in the new feature
space. The separating hyperplane becomes a linear function in the transformed feature space
but a nonlinear function in the original input space.
Let x be a vector in the n-dimensional input space and ’ðÞ be a nonlinear mapping
function from the input space to the high-dimensional feature space. The hyperplane repre-
senting the decision boundary in the feature space is defined as follows:
w ’ðxÞ b ¼ 0 ð35Þ
SVM Tutorial — Classification, Regression and Ranking 15 487

where w denotes a weight vector that can map the training data in the high-dimensional
feature space to the output space, and b is the bias. Using the ’(·) function, the weight
becomes
X
w¼ ai y i ’ðxi Þ ð36Þ

The decision function of > Eq. 22 becomes

X
m
FðxÞ ¼ ai y i ’ðxi Þ ’ðxÞ b ð37Þ
i

Furthermore, the dual problem of the soft-margin SVM (> Eq. 26) can be rewritten using
the mapping function on the data vectors as follows:
X 1XX
QðaÞ ¼ ai ai aj y i y j ’ðxi Þ ’ðxj Þ ð38Þ
i
2 i j

holding the same constraints.

Note that the feature mapping functions in the optimization problem and also in the
classifying function always appear as dot products, for example, ’ðxi Þ ’ðxj Þ. ’ðxi Þ ’ðxj Þ is
the inner product between pairs of vectors in the transformed feature space. Computing the
inner product in the transformed feature space seems to be quite complex and suffers from the
curse of dimensionality problem. To avoid this problem, the kernel trick is used. The kernel
trick replaces the inner product in the feature space with a kernel function K in the original
input space as follows.
K ðu; vÞ ¼ ’ðuÞ ’ðvÞ ð39Þ
Mercer’s theorem proves that a kernel function K is valid if and only if the following
conditions are satisfied, for any function cðxÞ (refer to Christianini and Shawe-Taylor (2000)
for the proof in detail):
Z
K ðu; vÞcðuÞcðvÞdxdy 0
Z ð40Þ
where cðxÞ2 dx 0

Mercer’s theorem ensures that the kernel function can be always expressed as the inner
product between pairs of input vectors in some high-dimensional space, thus the inner product
can be calculated using the kernel function only with input vectors in the original space without
transforming the input vectors into the high-dimensional feature vectors.
The dual problem is now defined using the kernel function as follows:
X XX
maximize: Q 2 ðaÞ ¼ ai ai aj y i y j K ðxi ; xj Þ ð41Þ
i i j

X
subject to: ai y i ¼ 0 ð42Þ
i

Ca0 ð43Þ
488 15 SVM Tutorial — Classification, Regression and Ranking

The classification function becomes

X
FðxÞ ¼ ai y i K ðxi ; xÞ b ð44Þ
i

Since K(·) is computed in the input space, no feature transformation will actually be done
P
or no ’(·) will be computed, and thus the weight vector w ¼ ai y i ’ðxÞ will not be
computed either in nonlinear SVMs.
The following are popularly used kernel functions:
Polynomial: K ða; bÞ ¼ ða b þ 1Þd
Radial Basis Function (RBF): K ða; bÞ ¼ expðgjja bjj2 Þ
Sigmoid: K ða; bÞ ¼ tanhðka b þ cÞ
Note that the kernel function is a kind of similarity function between two vectors
where the function output is maximized when the two vectors become equivalent. Because
of this, SVM can learn a function from any shapes of data beyond vectors (such as trees or
graphs) as long as we can compute a similarity function between any pair of data objects.
Further discussions on the properties of these kernel functions are out of the scope of this
chapter. Instead, we will give an example of using the polynomial kernel for learning an XOR
function in the following section.

2.3.1 Example: XOR Problem

To illustrate the procedure of training a nonlinear SVM function, assume that the training set
of > Table 1 is given.
> Figure 4 plots the training points in the 2D input space. There is no linear function that

can separate the training points.

To proceed let
K ðx; xi Þ ¼ ð1 þ x xi Þ2 ð45Þ
If we denote x ¼ ðx 1 ; x 2 Þ and xi ¼ ðx i1 ; x i2 Þ, the kernel function is expressed in terms of
monomials of various orders as follows.
K ðx; xi Þ ¼ 1 þ x 21 x 2i1 þ 2x 1 x 2 x i1 x i2 þ x 22 x 2i2 þ 2x 1 x i1 þ 2x 2 x i2 ð46Þ
The image of the input vector x induced in the feature space is therefore deduced to be
pffiffiffi pffiffiffi pffiffiffi
’ðxÞ ¼ ð1; x 21 ; 2x 1 x 2 ; x 22 ; 2x 1 ; 2x 2 Þ ð47Þ

. Table 1
XOR problem

Input vector x Desired output y

(1, 1) 1
(1, þ1) þ1
(þ1, 1) þ1
(þ1, þ1) 1
SVM Tutorial — Classification, Regression and Ranking 15 489

. Fig. 4
XOR problem.

Based on this mapping function, the objective function for the dual form can be derived
from > Eq. 41 as follows.
QðaÞ ¼ a1 þ a2 þ a3 þ a4
1
ð9a21 2a1 a2 2a1 a3 þ 2a1 a4 ð48Þ
2
þ 9a22 þ 2a2 a3 2a2 a4 þ 9a3 2a3 a4 þ a24 Þ

Optimizing Q(a) with respect to the Lagrange multipliers yields the following set of
simultaneous equations:
9a1 a2 a3 þ a4 ¼ 1
a1 þ 9a2 þ a3 a4 ¼ 1
a1 þ a2 þ 9a3 a4 ¼ 1
a1 a2 a3 þ 9a4 ¼ 1

Hence, the optimal values of the Lagrange multipliers are

1
a1 ¼ a2 ¼ a3 ¼ a4 ¼
8
This result denotes that all four input vectors are support vectors and the optimum value
of Q(a) is
1
QðaÞ ¼
4
and
1 1 1
jjwjj2 ¼ ; or jjwjj ¼ pffiffiffi
2 4 2
490 15 SVM Tutorial — Classification, Regression and Ranking

From > Eq. 36, we find that the optimum weight vector is
1
w ¼ ½’ðx1 Þ þ ’ðx2 Þ þ ’ðx3 Þ ’ðx4 Þ
8
2 2 3 2 3 2 3 2 33 2 3
1 1 1 1 0
6 6 1 7 6 1 7 6 1 7 6 1 77 6 0 7
6 6 pffiffiffi 7 6 pffiffiffi 7 6 pffiffiffi 7 6 pffiffiffi 77 6 7
6 6 7 6 7 6 7 6 77 6 7 ð49Þ
16 6 2 7 6 2 7 6 2 7 6 2 77 6 p1ffiffi 7
¼ 6 6 7þ6 7þ6 76 77 ¼ 6 27
86 6 1 7 6 1 7 6 1 7 6 1 77 6 0 7
6 6 p
6 6 ffiffiffi 7 6 7 6 7 6
7 6 pffiffiffi 7 6 pffiffiffi 7 6 pffiffiffi 77 6
77 6 7
7
4 4 2 5 4 2 5 4 2 5 4 2 55 4 0 5
pffiffiffi pffiffiffi pffiffiffi pffiffiffi
2 2 2 2 0

The bias b is 0 because the first element of w is 0. The optimal hyperplane becomes
2 3
1
6 x2 7
6 pffiffiffi 1 7
6 7
1 6 2x 1 x 2 7
w ’ðxÞ ¼ ½0 0 pffiffiffi 0 0 06 6 x2 7 ¼ 0
7 ð50Þ
2 6 pffiffiffi2 7
6 7
4 2x 1 5
pffiffiffi
22
which reduces to
x 1 x 2 ¼ 0 ð51Þ
this is the optimal hyperplane, the solution of the XOR problem. It makes the output y ¼ 1 for
both input points x 1 ¼ x 2 ¼ 1 and x 1 ¼ x 2 ¼ 1, and y ¼ 1 for both input points
x 1 ¼ 1; x 2 ¼ 1 or x 1 ¼ 1; x 2 ¼ 1. > Figure 5 represents the four points in the transformed
feature space.

. Fig. 5
The four data points of the XOR problem in the transformed feature space.
SVM Tutorial — Classification, Regression and Ranking 15 491

3 SVM Regression

SVM regression (SVR) is a method to estimate a function that maps from an input object to a
real number based on training data. Similarly to the classifying SVM, SVR has the same
properties of the margin maximization and kernel trick for nonlinear mapping.
A training set for regression is represented as follows.
D ¼ fðx1 ; y 1 Þ; ðx2 ; y 2 Þ; . . . ; ðxm ; y m Þg ð52Þ
where xi is a n-dimensional vector, y is the real number for each xi . The SVR function Fðxi Þ
makes a mapping from an input vector xi to the target yi and takes the form:
FðxÞ ¼) w x b ð53Þ
where w is the weight vector and b is the bias. The goal is to estimate the parameters (w and b)
of the function that give the best fit of the data. An SVR function FðxÞ approximates all pairs
(xi , yi) while maintaining the differences between estimated values and real values under e
precision. That is, for every input vector x in D,
y i w xi b e ð54Þ

w xi þ b y i e ð55Þ
The margin is
1
margin ¼ ð56Þ
jjwjj
By minimizing jjwjj2 to maximize the margin, the training in SVR becomes a constrained
optimization problem, as follows:
1
minimize: LðwÞ ¼ jjwjj2 ð57Þ
2

subject to: y i w xi b e ð58Þ

w xi þ b y i e ð59Þ
The solution of this problem does not allow any errors. To allow some errors to deal with
noise in the training data, the soft margin SVR uses slack variables x and ^ x. Then the
optimization problem can be revised as follows:
1 X
minimize: Lðw; xÞ ¼ jjwjj2 þ C ðx2i ; ^
x2i Þ; C > 0 ð60Þ
2 i

subject to: y i w xi b e þ xi ; 8ðxi ; y i Þ 2 D ð61Þ

w xi þ b y i e þ ^
xi ; 8ðxi ; y i Þ 2 D ð62Þ

x; ^
xi 0 ð63Þ
The constant C > 0 is the trade-off parameter between the margin size and the number of
errors. The slack variables x and ^
x deal with infeasible constraints of the optimization problem
by imposing a penalty on excess deviations that are larger than e.
492 15 SVM Tutorial — Classification, Regression and Ranking

To solve the optimization problem > Eq. 60, we can construct a Lagrange function from
the objective function with Lagrange multipliers as follows:
1 X X
minimize: L ¼ jjwjj2 þ C ðxi þ ^xi Þ ði xi þ ^i ^
xi Þ
2 i i
X
ai ðe þ i y i þ w xi þ bÞ ð64Þ
i
X
^
ai ðe þ ^
i þ y i w xi bÞ
i

subject to: ; ^i 0 ð65Þ

a; ^
ai 0 ð66Þ
where i ; ^i ; a; and a^i are the Lagrange multipliers that satisfy positive constraints. The
following is the process to find the saddle point by using the partial derivatives of L with
respect to each Lagrangian multiplier for minimizing the function L:
@L X
¼ ðai ^
ai Þ ¼ 0 ð67Þ
@b i

@L X
¼ w Sðai ^
ai Þxi ¼ 0; w ¼ ðai ^
ai Þxi ð68Þ
@w i

@L
¼C ^
ai î ¼ 0; î ¼ C ^
ai ð69Þ
@^
xi
The optimization problem with inequality constraints can be changed to the following
dual optimization problem by substituting > Eqs. 67, > 68, and > 69 into > 64.
X X
maximize: LðaÞ ¼ y i ðai aî Þ e ðai þ aî Þ ð70Þ
i i

1XX
ðai a^i Þðai a^i Þxi xj ð71Þ
2 i j

X
subject to: ðai a^i Þ ¼ 0 ð72Þ
i

0 a; ^
aC ð73Þ
The dual variables ; ^i are eliminated in revising > Eq. 64 into > Eq. 70. > Eqs. 68 and
> 69can be rewritten as follows:
X
w¼ ðai ^
ai Þxi ð74Þ
i

i ¼ C ai ð75Þ

i ¼ C ^
^ ai ð76Þ
SVM Tutorial — Classification, Regression and Ranking 15 493

where w is represented by a linear combination of the training vectors xi . Accordingly, the SVR
function FðxÞ becomes the following function:
X
FðxÞ ¼) ðai ^
ai Þxi x þ b ð77Þ
i

> Eq. 77 can map the training vectors to target real values which allow some errors but it

cannot handle the nonlinear SVR case. The same kernel trick can be applied by replacing
the inner product of two vectors xi ; xj with a kernel function K ðxi ; xj Þ. The transformed
feature space is usually high dimensional, and the SVR function in this space becomes
nonlinear in the original input space. Using the kernel function K, the inner product in the
transformed feature space can be computed as fast as the inner product xi xj in the original
input space. The same kernel functions introduced in > Sect. 2.3 can be applied here.
Once we replace the original inner product with a kernel function K, the remaining process
for solving the optimization problem is very similar to that for the linear SVR. The linear
optimization function can be changed by using the kernel function as follows:
X X
maximize: LðaÞ ¼ y i ðai aî Þ e ðai þ aî Þ
i i
1XX ð78Þ
ðai aî Þðai aî ÞK ðxi ; xj Þ
2 i j
X
subject to: ðai aî Þ ¼ 0 ð79Þ
i

^
ai 0; ai 0 ð80Þ

0 a; ^
aC ð81Þ
Finally, the SVR function FðxÞ becomes the following using the kernel function:
X
FðxÞ ¼) ð^
ai ai ÞK ðxi ; xÞ þ b ð82Þ
i

4 SVM Ranking

Ranking SVM, learning a ranking (or preference) function, has resulted in various applications
in information retrieval (Herbrich et al. 2000; Joachims 2002; Yu et al. 2007). The task of
learning ranking functions is distinguished from that of learning classification functions as
follows:
1. While a training set in a classification is a set of data objects and their class labels, in
ranking a training set is an ordering of data. Let ‘‘A is preferred to B’’ be specified as ‘‘A
B.’’ A training set for ranking SVM is denoted as R ¼ fðx1 ; y i Þ; . . . ; ðxm ; y m Þg where yi is
the ranking of xi , that is, yi < yj if xi xj .
2. Unlike a classification function, which outputs a distinct class for a data object, a ranking
function outputs a score for each data object, from which a global ordering of data is
constructed. That is, the target function Fðxi Þ outputs a score such that Fðxi Þ > Fðxj Þ for
any xi xj .
494 15 SVM Tutorial — Classification, Regression and Ranking

If not stated, R is assumed to be strict ordering, which means that for all pairs xi and xj in a
set D, either xi R xj or xi R xj . However, it can be straightforwardly generalized to weak
orderings. Let R be the optimal ranking of data in which the data is ordered perfectly
according to the user’s preference. A ranking function F is typically evaluated by how closely
its ordering RF approximates R.
Using the techniques of SVM, a global ranking function F can be learned from an order-
ing R. For now, assume F is a linear ranking function such that
8fðxi ; xj Þ : y i < y j 2 Rg : Fðxi Þ > Fðxj Þ() w xi > w xj ð83Þ

A weight vector w is adjusted by a learning algorithm. We say that an ordering R is linearly

rankable if there exists a function F (represented by a weight vector w) that satisfies > Eq. 83
for all fðxi ; xj Þ : y i < y j 2 Rg.
The goal is to learn F that is concordant with the ordering R and also generalize well
beyond R. That is to find the weight vector w such that w xi > w xj for most data pairs
fðxi ; xj Þ : y i < y j 2 Rg.
Though this problem is known to be NP-hard (Cohen et al. 1998), the solution can be
approximated using SVM techniques by introducing (nonnegative) slack variables xij and
minimizing the upper bound ∑xij as follows (Herbrich et al. 2000):
1 X
minimize: L1 ðw; xij Þ ¼ w w þ C xij ð84Þ
2

subject to: 8fðxi ; xj Þ : y i < y j 2 Rg : w xi w xj þ 1 xij ð85Þ

8ði; jÞ : xij 0 ð86Þ

By the constraint (> Eq. 85) and by minimizing the upper bound ∑xij in (> Eq. 84), the
above optimization problem satisfies orderings on the training set R with minimal error. By
minimizing w w or by maximizing the margin (= jjwjj1
), it tries to maximize the generaliza-
tion of the ranking function. We will explain how maximizing the margin corresponds to
increasing the generalization of ranking in > Sect. 4.1. C is the soft margin parameter that
controls the trade-off between the margin size and the training error.
By rearranging the constraint (> Eq. 85) we get
wðxi xj Þ 1 xij ð87Þ

The optimization problem becomes equivalent to that of classifying SVM on pairwise differ-
ence vectors (xi xj ). Thus, we can extend an existing SVM implementation to solve the
problem.
Note that the support vectors are the data pairs ðxsi ; xsj Þ such that constraint (> Eq. 87) is
satisfied with the equality sign, that is, wðxsi xsj Þ ¼ 1 xij . Unbounded support vectors are
the ones on the margin (i.e., their slack variables xij = 0), and bounded support vectors are the
ones within the margin (i.e., 1 > xij > 0) or misranked (i.e., xij > 1). As done in the classifying
SVM, a function F in ranking SVM is also expressed only by the support vectors.
Similarly to the classifying SVM, the primal problem of ranking SVM can be transformed
to the following dual problem using the Lagrange multipliers:
X XX
maximize: L2 ðaÞ ¼ aij aij auv K ðxi xj ; xu xv Þ ð88Þ
ij ij uv
SVM Tutorial — Classification, Regression and Ranking 15 495

subject to: C a 0 ð89Þ

Once transformed to the dual, the kernel trick can be applied to support the nonlinear
ranking function. K(·) is a kernel function. aij is a coefficient for pairwise difference vectors
ðxi xj Þ. Note that the kernel function is computed for P2( m4) times where P is the number
of data pairs and m is the number of data points in the training set, thus solving the ranking
SVM takes O(m4) at least. Fast training algorithms for ranking SVM have been proposed
(Joachims 2006) but they are limited to linear kernels.
Once a is computed, w can be written in terms of the pairwise difference vectors and their
coefficients such that
X
w¼ aij ðxi xj Þ ð90Þ
ij

The ranking function F on a new vector z can be computed using the kernel function
replacing the dot product as follows:
X X
FðzÞ ¼ w z ¼ aij ðxi xj Þ z ¼ aij K ðxi xj ; zÞ: ð91Þ
ij ij

4.1 Margin-Maximization in Ranking SVM

We now explain the margin-maximization of the ranking SVM, to reason about how the
ranking SVM generates a ranking function of high generalization. Some essential properties of
ranking SVM are first established. For convenience of explanation, it is assumed that a training
set R is linearly rankable and thus we use hard-margin SVM, that is, xij = 0 for all (i, j) in the
objective (> Eq. 84) and the constraints (> Eq. 85).
In the ranking formulation, from > Eq. 83, the linear ranking function F w projects data
vectors onto a weight vector w. For instance, > Fig. 6 illustrates linear projections of four
vectors fx1 ; x2 ; x3 ; x4 g onto two different weight vectors w1 and w2 , respectively, in a two-
dimensional space. Both F x1 and F x2 make the same ordering R for the four vectors, that is,
x1 >R x2 >R x3 >R x4 . The ranking difference of two vectors (xi ; xj ) according to a ranking

. Fig. 6
Linear projection of four data points.
496 15 SVM Tutorial — Classification, Regression and Ranking

function F w is denoted by the geometric distance of the two vectors projected onto w, that is,
wðxi xj Þ
formulated as jjwjj .

Corollary 1 Suppose F w is a ranking function computed by the hard-margin ranking SVM on

an ordering R. Then, the support vectors of F w represent the data pairs that are closest to each
other when projected to w, thus closest in ranking.
Proof The support vectors are the data pairs (xsi ; xsj ) such that wðxsi xsj Þ ¼ 1 in constraint
(> Eq. 87), which is the smallest possible value for all data pairs 8ðxi ; xj Þ 2 R. Thus, its
wðxs xs Þ
ranking difference according to F w ¼ jjwjj
i j
is also the smallest among them (Vapnik
1998).

Corollary 2 The ranking function F, generated by the hard-margin ranking SVM, maximizes
the minimal difference of any data pairs in ranking.
wðxs xs Þ
Proof By minimizing w w, the ranking SVM maximizes the margin d ¼ jjwjj 1
¼ jjwjj
i j

where ðxi ; xj Þ are the support vectors that denote, from the proof of Corollary 1, the minimal
s s

difference of any data pairs in the ranking.

The soft margin SVM allows bounded support vectors whose xij > 0 as well as unbounded
support vectors whose xij = 0, in order to deal with noise and allow small errors for the R
which is not completely linearly rankable. However, the objective function in (> Eq. 84) also
minimizes the number of the slacks and thus the amount of error, and the support vectors are
the close data pairs in the ranking. Thus, maximizing the margin generates the effect of
maximizing the differences of close data pairs in the ranking.
From Corollaries 1 and 2, we observe that ranking SVM improves the generalization
performance by maximizing the minimal ranking difference. For example, consider the two
linear ranking functions F w1 and F w2 in > Fig. 6. Although the two weight vectors w1 and w2
make the same ordering, intuitively w 1 generalizes better than w2 because the distance
between the closest vectors on w 1 (i.e., d1) is larger than that on w2 (i.e., d2). The SVM
computes the weight vector w that maximizes the differences of close data pairs in the ranking.
Ranking SVMs find a ranking function of high generalization in this way.

5 Ranking Vector Machine: An Efficient Method for Learning

the 1-Norm Ranking SVM

This section presents another rank learning method, RVM, a revised 1-norm ranking
SVM that is better for feature selection and more scalable to large datasets than the standard
ranking SVM.
We first develop a 1-norm ranking SVM, a ranking SVM that is based on the 1-norm
objective function. (The standard ranking SVM is based on the 2-norm objective function.)
The 1-norm ranking SVM learns a function with much fewer support vectors than the
standard SVM. Thereby, its testing time is much faster than 2-norm SVMs and provides
better feature selection properties. (The function of the 1-norm SVM is likely to utilize fewer
features by using fewer support vectors (Fung and Mangasarian 2004).) Feature selection is
also important in ranking. Ranking functions are relevance or preference functions in
SVM Tutorial — Classification, Regression and Ranking 15 497

document or data retrieval. Identifying key features increases the interpretability of the
function. Feature selection for nonlinear kernels is especially challenging, and the fewer the
number of support vectors, the more efficiently feature selection can be done (Guyon and
Elisseeff 2003; Mangasarian and Wild 1998; Cao et al. 2007; Yu et al. 2003; Cho et al. 2008).
We next present an RVM that revises the 1-norm ranking SVM for fast training. The RVM
trains much faster than standard SVMs while not compromising the accuracy when the
training set is relatively large. The key idea of a RVM is to express the ranking function with
‘‘ranking vectors’’ instead of support vectors. Support vectors in ranking SVMs are pairwise
difference vectors of the closest pairs as discussed in > Sect. 4. Thus, the training requires
investigating every data pair as potential candidates of support vectors, and the number of
data pairs are quadratic to the size of training set. On the other hand, the ranking function of
the RVM utilizes each training data object instead of data pairs. Thus, the number of variables
for optimization is substantially reduced in the RVM.

5.1 1-Norm Ranking SVM

The goal of 1-norm ranking SVM is the same as that of the standard ranking SVM, that is, to
learn F that satisfies > Eq. 83 for most fðxi ; xj Þ : y i < y j 2 Rg and generalize well beyond the
training set. In the 1-norm ranking SVM, > Eq. 83 can be expressed using the F of > Eq. 91
as follows:
XP XP
Fðxu Þ > Fðxv Þ ¼) aij ðxi xj Þ xu > aij ðxi xj Þ xv ð92Þ
ij ij

X
P
¼) aij ðxi xj Þ ðxu xv Þ > 0 ð93Þ
ij

Then, replacing the inner product with a kernel function, the 1-norm ranking SVM is
formulated as
XP X
P
minimize: Lða; xÞ ¼ aij þ C xij ð94Þ
ij ij

X
P
s:t:: aij K ðxi xj ; xu xv Þ 1 xuv ; 8fðu; vÞ : y u < y v 2 Rg ð95Þ
ij

a 0; x 0 ð96Þ
While the standard ranking SVM suppresses the weight w to improve the generalization
performance, the 1-norm ranking suppresses a in the objective function. Since the weight is
expressed by the sum of the coefficient times pairwise ranking difference vectors, suppressing
the coefficient a corresponds to suppressing the weight w in the standard SVM. (Mangasarian
proves it in Mangasarian (2000).) C is a user parameter controlling the trade-off between the
margin size and the amount of the error, x, and K is the kernel function. P is the number of
pairwise difference vectors ( m2).
The training of the 1-norm ranking SVM becomes a linear programming (LP) problem,
thus solvable by LP algorithms such as the Simplex and Interior Point method (Mangasarian
498 15 SVM Tutorial — Classification, Regression and Ranking

2000, 2006; Fung and Mangasarian 2004). Just as for the standard ranking SVM, K needs to be
computed P2 ( m4) times, and there are P number of constraints (> Eq. 95) and a to
compute. Once a is computed, F is computed using the same ranking function as the standard
ranking SVM, that is, > Eq. 91.
The accuracies of the 1-norm ranking SVM and standard ranking SVM are comparable,
and both methods need to compute the kernel function O(m4) times. In practice, the training
of the standard SVM is more efficient because fast decomposition algorithms have been
developed such as sequential minimal optimization (SMO) (Platt 1998), while the 1-norm
ranking SVM uses common LP solvers.
It is shown that 1-norm SVMs use much fewer support vectors than standard 2-norm SVMs,
that is, the number of positive coefficients (i.e., a > 0) after training is much fewer in the
1-norm SVMs than in the standard 2-norm SVMs (Mangasarian 2006; Fung and Mangasarian
2004). This is because, unlike the standard 2-norm SVM, the support vectors in the 1-norm
SVM are not bounded to those close to the boundary in classification or the minimal ranking
difference vectors in ranking. Thus, the testing involves much fewer kernel evaluations, and it
is more robust when the training set contains noisy features (Zhu et al. 2003).

5.2 Ranking Vector Machine

Although the 1-norm ranking SVM has merits over the standard ranking SVM in terms of
the testing efficiency and feature selection, its training complexity is very high with respect to
the number of data points. In this section, we present an RVM that revises the 1-norm ranking
SVM to reduce the training time substantially. The RVM significantly reduces the number of
variables in the optimization problem while not compromising the accuracy. The key idea of
RVM is to express the ranking function with ‘‘ranking vectors’’ instead of support vectors. The
support vectors in ranking SVMs are chosen from pairwise difference vectors, and the number
of pairwise difference vectors are quadratic to the size of the training set. On the other hand,
the ranking vectors are chosen from the training vectors, thus the number of variables to
optimize is substantially reduced.
To theoretically justify this approach, we first present the representer theorem.

Theorem 1 (Representer Theorem (Schölkopf et al. 2001)).

Denote by O: ½0; 1Þ ! r a strictly monotonic increasing function, by x a set, and by
c : ðx r2 Þm ! r [ f1g an arbitrary loss function. Then each minimizer F 2 h of the
regularized risk
cððx 1 ; y 1 ; Fðx 1 ÞÞ; . . . ; ðx m ; y m ; Fðx m ÞÞÞ þ OðjjFjjh Þ ð97Þ
admits a representation of the form
X
m
FðxÞ ¼ ai K ðx i ; xÞ ð98Þ
i¼1

The proof of the theorem is presented in Schölkopf et al. (2001).

Note that, in the theorem, the loss function c is arbitrary allowing coupling between data
points ðxi ; y i Þ, and the regularizer O has to be monotonic.
Given such a loss function and regularizer, the representer theorem states that although we
might be trying to solve the optimization problem in an infinite-dimensional space h,
SVM Tutorial — Classification, Regression and Ranking 15 499

containing linear combinations of kernels centered on arbitrary points of x, the solution

lies in the span of m particular kernels – those centered on the training points (Schölkopf
et al. 2001).
Based on the theorem, our ranking function F can be defined as > Eq. 98, which is based
on the training points rather than arbitrary points (or pairwise difference vectors). Function
(> Eq. 98) is similar to function (> Eq. 91) except that, unlike the latter using pairwise
difference vectors (xi xj ) and their coefficients (aij), the former utilizes the training vectors
(xi ) and their coefficients (ai). With this function, > Eq. 92 becomes the following.

X
m X
m
Fðxu Þ > Fðxv Þ ¼) ai K ðxi ; xu Þ > ai K ðxi ; xv Þ ð99Þ
i i

X
m
¼) ai ðK ðxi ; xu Þ K ðxi ; xv ÞÞ > 0: ð100Þ
i

Thus, we set the loss function c as follows.

!
X X
m
c¼ 1 ai ðK ðxi ; xu Þ K ðxi ; xv ÞÞ ð101Þ
8fðu;vÞ:y u <y v 2Rg i

The loss function utilizes couples of data points penalizing misranked pairs, that is, it
returns higher values as the number of misranked pairs increases. Thus, the loss function is
order sensitive, and it is an instance of the function class c in > Eq. 97.
Pm
We set the regularizer Oðjjf jjh Þ ¼ ai (ai 0), which is strictly monotonically increas-
i
ing. Let P be the number of pairs (u, v) 2 R such that yu < yv, and let xuv ¼
Pm
1 ai ðK ðxi ; xu Þ K ðxi ; xv ÞÞ. Then, the RVM is formulated as follows.
i

X
m X
P
minimize: Lða; xÞ ¼ ai þ C xij ð102Þ
i ij

X
m
s:t:: ai ðK ðxi ; xu Þ K ðxi ; xv ÞÞ 1 xuv ; 8fðu; vÞ : y u < y v 2 Rg ð103Þ
i

a; x 0 ð104Þ
The solution of the optimization problem lies in the span of kernels centered on the
training points (i.e., > Eq. 98) as suggested in the representer theorem. Just as the 1-norm
ranking SVM, the RVM suppresses a to improve the generalization, and forces > Eq. 100 by
constraint (> Eq. 103). Note that there are only m number of ai in the RVM. Thus, the kernel
function is evaluated O(m3) times while the standard ranking SVM computes it O(m4) times.
Another rationale of RVM or a rationale for using training vectors instead of pairwise
difference vectors in the ranking function is that the support vectors in the 1-norm ranking
SVM are not the closest pairwise difference vectors, thus expressing the ranking function with
pairwise difference vectors becomes not as beneficial in the 1-norm ranking SVM. To explain
this further, consider classifying SVMs. Unlike the 2-norm (classifying) SVM, the support
500 15 SVM Tutorial — Classification, Regression and Ranking

vectors in the 1-norm (classifying) SVM are not limited to those close to the decision
boundary. This makes it possible that the 1-norm (classifying) SVM can express the similar
boundary function with fewer support vectors. Directly extended from the 2-norm (classify-
ing) SVM, the 2-norm ranking SVM improves the generalization by maximizing the closest
pairwise ranking difference that corresponds to the margin in the 2-norm (classifying) SVM as
discussed in > Sect. 4. Thus, the 2-norm ranking SVM expresses the function with the closest
pairwise difference vectors (i.e., the support vectors). However, the 1-norm ranking SVM
improves the generalization by suppressing the coefficients a just as the 1-norm (classifying)
SVM. Thus, the support vectors in the 1-norm ranking SVM are not the closest pairwise
difference vectors any more, and thus expressing the ranking function with pairwise difference
vectors becomes not as beneficial in the 1-norm ranking SVM.

5.3 Experiment

This section evaluates the RVM on synthetic datasets (> Sect. 5.3.1) and a real-world dataset
(> Sect. 5.3.2). The RVM is compared with the state-of-the-art ranking SVM provided in
SVM-light. Experiment results show that the RVM trains substantially faster than the SVM-
light for nonlinear kernels while their accuracies are comparable. More importantly, the number
of ranking vectors in the RVM is multiple orders of magnitude smaller than the number of
support vectors in the SVM-light. Experiments are performed on a Windows XP Professional
machine with a Pentium IV 2.8 GHz and 1 GB of RAM. We implemented the RVM using C and
used CPLEX (http://www.ilog.com/products/cplex/) for the LP solver. The source codes are
freely available at http://iis.postech.ac.kr/rvm (Yu et al. 2008).
Evaluation Metric: MAP (mean average precision) is used to measure the ranking quality
when there are only two classes of ranking (Yan et al. 2003), and NDCG is used to evaluate
ranking performance for IR applications when there are multiple levels of ranking
(Baeza-Yates and Ribeiro-Neto 1999; Burges et al. 2004; Cao et al. 2006; Xu and Li 2007).
Kendall’s t is used when there is a global ordering of data and the training data is a subset of it.
Ranking SVMs as well as the RVM minimize the amount of error or mis-ranking, which
corresponds to optimizing the Kendall’s t (Joachims 2002; Yu 2005). Thus, we use the
Kendall’s t to compare their accuracy.
Kendall’s t computes the overall accuracy by comparing the similarity of two orderings R
and RF. (RF is the ordering of D according to the learned function F.) The Kendall’s t is defined
based on the number of concordant pairs and discordant pairs. If R and RF agree on how they
order a pair, xi and xj , the pair is concordant, otherwise it is discordant. The accuracy of
function F is defined as the number of concordant pairs between R and RF per the total
number of pairs in D as follows.
# of concordant pairs
FðR ; RF Þ ¼
jRj
2
For example, suppose R and RF order five points x1 ; . . . ; x5 as follow:
ðx1 ; x2 ; x3 ; x4 ; x5 Þ 2 R
ðx3 ; x2 ; x1 ; x4 ; x5 Þ 2 R F
Then, the accuracy of F is 0.7, as the number of discordant pairs is 3, i.e., fx1 ; x2 g; fx1 ; x3 g;
fx2 ; x3 g while all remaining seven pairs are concordant.
SVM Tutorial — Classification, Regression and Ranking 15 501

5.3.1 Experiments on Synthetic Datasets

Below is the description of the experiments on synthetic datasets.

1. We randomly generated a training and a testing dataset Dtrain and Dtest, respectively, where
Dtrain contains mtrain (= 40, 80, 120, 160, 200) data points of n (e.g., 5) dimensions
(i.e., mtrain-by-n matrix), and Dtest contains mtest (= 50) data points of n dimensions
(i.e., mtest-by-n matrix). Each element in the matrices is a random number between zero
and one. (We only did experiments on the dataset of up to 200 objects for performance
reasons. Ranking SVMs run intolerably slow on datasets larger than 200.)
2. A global ranking function F is randomly generated, by randomly generating the weight
vector w in F ðxÞ ¼ w x for linear, and in F ðxÞ ¼ exp ðjjw xjjÞ2 for RBF function.

. Fig. 7
Accuracy: (a) Linear (b) RBF.
502 15 SVM Tutorial — Classification, Regression and Ranking

3. Dtrain and Dtest are ranked according to F , which forms the global ordering R train and R test
on the training and testing data.
4. We train a function F from R train , and test the accuracy of F on R test .
We tuned the soft margin parameter C by trying C ¼ 105 ; 105, . . . , 105, and used the
highest accuracy for comparison. For the linear and RBF functions, we used linear and RBF
kernels accordingly. This entire process is repeated 30 times to get the mean accuracy.
Accuracy: > Figure 7 compares the accuracies of the RVM and the ranking SVM from
the SVM-light. The ranking SVM outperforms RVM when the size of the dataset is small,
but their difference becomes trivial as the size of the dataset increases. This phenomenon can
be explained by the fact that when the training size is too small, the number of potential
ranking vectors becomes too small to draw an accurate ranking function whereas the number
of potential support vectors is still large. However, as the size of the training set increases, RVM

. Fig. 8
Training time: (a) Linear Kernel (b) RBF Kernel.
SVM Tutorial — Classification, Regression and Ranking 15 503

becomes as accurate as the ranking SVM because the number of potential ranking vectors
becomes large as well.
Training Time: > Figure 8 compares the training time of the RVM and the SVM-light.
While the SVM-light trains much faster than RVM for linear kernel (SVM-light is specially
optimized for linear kernels), the RVM trains significantly faster than the SVM-light for
RBF kernels.
Number of Support (or Ranking) Vectors: > Figure 9 compares the number of support (or
ranking) vectors used in the function of the RVM and the SVM-light. The RVM’s model uses a
significantly smaller number of support vectors than the SVM-light.
Sensitivity to Noise: In this experiment, the sensitivity of each method is compared to noise.
Noise is inserted by switching the orders of some data pairs in R train . We set the size of the
training set mtrain = 100 and the dimension n = 5. After R train is made from a random function
F , k vectors are randomly picked from the R train and switched with their adjacent vectors in

. Fig. 9
Number of support (or ranking) vectors: (a) Linear Kernel (b) RBF Kernel.
504 15 SVM Tutorial — Classification, Regression and Ranking

. Fig. 10
Sensitivity to noise (mtrain = 100): (a) Linear (b) RBF.

the ordering to implant noise in the training set. > Figure 10 shows the decrements of the
accuracies as the number of misorderings increases in the training set. Their accuracies are
moderately decreasing as the noise increases in the training set, and their sensitivities to noise
are comparable.

5.3.2 Experiment on Real Dataset

In this section, we experiment using the OHSUMED dataset obtained from LETOR, the site
containing benchmark datasets for ranking (LETOR). OHSUMED is a collection of docu-
ments and queries on medicine, consisting of 348,566 references and 106 queries. There are in
total 16,140 query-document pairs upon which relevance judgments are made. In this dataset,
the relevance judgments have three levels: ‘‘definitely relevant,’’ ‘‘partially relevant,’’ and
SVM Tutorial — Classification, Regression and Ranking 15 505

. Table 2
Experiment results: accuracy (Acc), training time (Time), and number of support or ranking
vectors (#SV or #RV)

Query 1 Query 2 Query 3

jDj = 134 jDj = 128 jDj = 182
#SV or #SV or #SV or
Acc Time #RV Acc Time #RV Acc Time #RV
Linear 0.5484 0.23 1.4 0.6730 0.41 3.83 0.6611 1.94 1.99
RVM
RBF 0.5055 0.85 4.3 0.6637 0.41 2.83 0.6723 4.71 1
Linear 0.5634 1.83 92 0.6723 1.03 101.66 0.6588 4.24 156.55
SVM
RBF 0.5490 3.05 92 0.6762 3.50 102 0.6710 55.08 156.66

‘‘irrelevant.’’ The OHSUMED dataset in LETOR extracts 25 features. We report our experi-
ments on the first three queries and their documents. We compare the performance of RVM
and SVM-light on them. We tuned the parameters threefold using cross-validation by trying C
and g ¼ 106 ; 105 ; . . . ; 106 for the linear and RBF kernels and compared the highest perfor-
mances. The training time is measured for training the model with the tuned parameters. The
whole process was repeated three times and the mean values reported.
> Table 2 shows the results. The accuracies of the SVM and RVM are comparable overall;

SVM shows a little higher accuracy than RVM for query 1, but for the other queries their
accuracy differences are not statistically significant. More importantly, the number of ranking
vectors in RVM is significantly smaller than that of support vectors in SVM. For example, for
query 3, an RVM having just one ranking vector outperformed an SVM with over 150 support
vectors. The training time of the RVM is significantly shorter than that of SVM-light.

References
Baeza-Yates R, Ribeiro-Neto B (eds) (1999) Modern in- Cao Y, Xu J, Liu TY, Li H, Huang Y, Hon HW (2006)
formation retrieval. ACM Press, New York Adapting ranking SVM to document retrieval.
Bertsekas DP (1995) Nonlinear programming. Athena In: Proceedings of the ACM SIGIR international
Scientific, Belmont, MA conference on information retrieval (SIGIR’06),
Burges C, Shaked T, Renshaw E, Lazier A, Deeds M, New York
Hamilton N, Hullender G (2004) Learning to rank Cho B, Yu H, Lee J, Chee Y, Kim I (2008) Nonlinear
using gradient descent. In: Proceedings of the inter- support vector machine visualization for risk factor
national conference on machine learning analysis using nomograms and localized radial basis
(ICML’04), Oregon State University, Corvallis, OR, function kernels. IEEE Trans Inf Technol Biomed
USA 12(2)
Burges CJC (1998) A tutorial on support vector Christianini N, Shawe-Taylor J (2000) An introduction
machines for pattern recognition. Data Mining to support vector machines and other kernel-based
Knowl Discov 2:121–167 learning methods. Cambridge University Press,
Cao B, Shen D, Sun JT, Yang Q, Chen Z (2007) Feature Cambridge, UK
selection in a kernel space. In: Proceedings of the Cohen WW, Schapire RE, Singer Y (1998) Learning to
international conference on machine learning order things. In: Proceedings of the advances in
(ICML’07), Oregon State University, Corvallis, OR, neural information processing systems (NIPS’98),
USA Cambridge, MA
506 15 SVM Tutorial — Classification, Regression and Ranking

Friedman H (1998) Another approach to polychoto- Schölkopf B, Herbrich R, Smola AJ, Williamson RC
mous classification. Tech. rep., Stanford Univer- (2001) A generalized representer theorem. In: Pro-
sity, Department of Statistics, Stanford, CA ceedings of COLT, Amsterdam, The Netherlands
10:1895–1924 Smola AJ, Schölkopf B (1998) A tutorial on support
Fung G, Mangasarian OL (2004) A feature selection vector regression. Tech. rep., NeuroCOLT2 Techni-
Newton method for support vector machine classi- cal Report NC2-TR-1998-030
fication. Comput Optim Appl 28:185–202 Vapnik V (1998) Statistical learning theory. John Wiley
Guyon I, Elisseeff A (2003) An introduction to vari- and Sons, New York
able and feature selection. J Mach Learn Res Xu J, Li H (2007) Ada Rank: a boosting algorithm for
3:1157–1182 information retrieval. In: Proceedings of the ACM
Hastie T, Tibshirani R (1998) Classification by pairwise SIGIR international conference on information re-
coupling. In: Advances in neural information pro- trieval (SIGIR’07), New York
cessing systems. MIT Press, Cambridge, MA Yan L, Dodier R, Mozer MC, Wolniewicz R (2003) Opti-
Herbrich R, Graepel T, Obermayer K (eds) (2000) Large mizing classifier performance via the Wilcoxon-
margin rank boundaries for ordinal regression. Mann-Whitney statistics. In: Proceedings of the
MIT Press, Cambridge, MA international conference on machine learning
Joachims T (2002) Optimizing search engines using (ICML’03), Washington, DC
clickthrough data. In: Proceedings of the ACM Yu H (2005) SVM selective sampling for ranking with
SIGKDD international conference on knowledge application to data retrieval. In: Proceedings of the
discovery and data mining (KDD’02), Paris, France international conference on knowledge discovery
Joachims T (2006) Training linear SVMs in linear time. and data mining (KDD’05), Chicago, IL
In: Proceedings of the ACM SIGKDD international Yu H, Hwang SW, Chang KCC (2007) Enabling soft
conference on knowledge discovery and data mining queries for data retrieval. Inf Syst 32:560–574
(KDD’06) Philadelphia, PA, USA Yu H, Kim Y, Hwang SW (2008) RVM: An efficient
Liu T-Y (2009) Learning to rank for information retriev- method for learning ranking SVM. Tech. rep., De-
al. Found Trends Inf Retr 3(3):225–331 partment of Computer Science and Engineering,
Mangasarian OL (2000) Generalized support vector Pohang University of Science and Technology
machines. MIT Press, Cambridge, MA (POSTECH), Pohang, Korea, http://iis.hwanjoyu.
Mangasarian OL (2006) Exact 1-norm support vector org/rvm
machines via unconstrained convex differentiable Yu H, Yang J, Wang W, Han J (2003) Discovering
minimization. J Mach Learn Res 7:1517–1530 compact and highly discriminative features or
Mangasarian OL, Wild EW (1998) Feature selection for feature combinations of drug activities using sup-
nonlinear kernel support vector machines. Tech. port vector machines. In: IEEE computer society
rep., University of Wisconsin, Madison bioinformatics conference (CSB’03), Stanford, CA,
Platt J (1998) Fast training of support vector machi- pp 220–228
nes using sequential minimal optimization. Zhu J, Rosset S, Hastie T, Tibshriani R (2003) 1-norm
In: Schölkopf B, Burges CJC (eds) Advances in ker- support vector machines. In: Proceedings of the
nel methods: support vector machines, MIT Press, advances in neural information processing systems
Cambridge, MA (NIPS’00) Berlin, Germany

mth202 Lecture03
No ratings yet
mth202 Lecture03
15 pages
SVM notes unit 4.docx
No ratings yet
SVM notes unit 4.docx
8 pages
Support Vector Machine
No ratings yet
Support Vector Machine
45 pages
SVM
No ratings yet
SVM
11 pages
UNIT-III Support Vector Machines
No ratings yet
UNIT-III Support Vector Machines
43 pages
Support Vector Machine (SVM)
No ratings yet
Support Vector Machine (SVM)
4 pages
Support Vector Machines (SVMs) - Introduction and Key Concepts
No ratings yet
Support Vector Machines (SVMs) - Introduction and Key Concepts
52 pages
UNIT - 2
No ratings yet
UNIT - 2
15 pages
Unit2 notes What is a Support Vector Machine
No ratings yet
Unit2 notes What is a Support Vector Machine
11 pages
Unit 2 PPT - Part 2
100% (1)
Unit 2 PPT - Part 2
81 pages
Unit-4 AI - SVM
No ratings yet
Unit-4 AI - SVM
21 pages
UNIT - 2-1
No ratings yet
UNIT - 2-1
7 pages
IVPML Unit III
No ratings yet
IVPML Unit III
139 pages
Support Vector Machine
No ratings yet
Support Vector Machine
19 pages
SVM Tutorial
100% (1)
SVM Tutorial
34 pages
Hands On Machine Learning 3 Edition
No ratings yet
Hands On Machine Learning 3 Edition
43 pages
Unit 2
No ratings yet
Unit 2
47 pages
Chapter 07 SVM
No ratings yet
Chapter 07 SVM
20 pages
Unit 2 - SVM - 241016 - 104220
No ratings yet
Unit 2 - SVM - 241016 - 104220
47 pages
Tutorial On Support Vector Machine (SVM) : Abstract
No ratings yet
Tutorial On Support Vector Machine (SVM) : Abstract
13 pages
Support Vector Machine
No ratings yet
Support Vector Machine
17 pages
Support Vector Machines: (Vapnik, 1979)
No ratings yet
Support Vector Machines: (Vapnik, 1979)
34 pages
CS-13410 Introduction To Machine Learning
No ratings yet
CS-13410 Introduction To Machine Learning
33 pages
Ankita
No ratings yet
Ankita
10 pages
SVM
No ratings yet
SVM
6 pages
Support Vector Machine - Explanation
No ratings yet
Support Vector Machine - Explanation
12 pages
Lecture 18 - SVM
No ratings yet
Lecture 18 - SVM
54 pages
SVM Tutorial
No ratings yet
SVM Tutorial
31 pages
SVM Explained PDF
No ratings yet
SVM Explained PDF
19 pages
ML Unit 3
No ratings yet
ML Unit 3
14 pages
ML-Lec9-SVM
No ratings yet
ML-Lec9-SVM
32 pages
Svm
No ratings yet
Svm
52 pages
Machine Learning (CSO851) - Lecture 05
No ratings yet
Machine Learning (CSO851) - Lecture 05
27 pages
Machine Learning Unit-3.3
No ratings yet
Machine Learning Unit-3.3
38 pages
Support Vector Machine (SVM) : Basic Terminologies
100% (1)
Support Vector Machine (SVM) : Basic Terminologies
2 pages
SVM - Feb 15
No ratings yet
SVM - Feb 15
34 pages
Another Introduction SVM
No ratings yet
Another Introduction SVM
4 pages
L5_SVMs
No ratings yet
L5_SVMs
37 pages
SVM (Repaired)
No ratings yet
SVM (Repaired)
39 pages
Svm
No ratings yet
Svm
52 pages
Ann Unit III
No ratings yet
Ann Unit III
20 pages
6 Lec SVM Kernel
No ratings yet
6 Lec SVM Kernel
36 pages
27-Module 4 - Support Vector Machine and Naïve Bayes-20-09-2024
No ratings yet
27-Module 4 - Support Vector Machine and Naïve Bayes-20-09-2024
31 pages
Support Vector Machine (SVM) PDF
No ratings yet
Support Vector Machine (SVM) PDF
15 pages
SVM notes
No ratings yet
SVM notes
4 pages
Support Vector Machine
No ratings yet
Support Vector Machine
31 pages
Support Vector Machine (SVM)
No ratings yet
Support Vector Machine (SVM)
28 pages
Support Vector Machines: Jeff Wu
No ratings yet
Support Vector Machines: Jeff Wu
35 pages
Support Vector Machines
No ratings yet
Support Vector Machines
19 pages
Support Vector Machine Algorithm
No ratings yet
Support Vector Machine Algorithm
8 pages
support_vector_machines
No ratings yet
support_vector_machines
12 pages
SVM.pptx
No ratings yet
SVM.pptx
67 pages
Support Vector Machine
No ratings yet
Support Vector Machine
40 pages
Support vector Machine.pptx
No ratings yet
Support vector Machine.pptx
18 pages
Lecture 5. Support Vector Machines SVM
No ratings yet
Lecture 5. Support Vector Machines SVM
47 pages
SVM
No ratings yet
SVM
9 pages
SVMs
No ratings yet
SVMs
30 pages
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
From Everand
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
Fouad Sabry
No ratings yet
Support Vector Machine: Fundamentals and Applications
From Everand
Support Vector Machine: Fundamentals and Applications
Fouad Sabry
No ratings yet
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
From Everand
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
Fouad Sabry
No ratings yet
Bilinear Interpolation: Enhancing Image Resolution and Clarity through Bilinear Interpolation
From Everand
Bilinear Interpolation: Enhancing Image Resolution and Clarity through Bilinear Interpolation
Fouad Sabry
No ratings yet
Chapter 5
No ratings yet
Chapter 5
50 pages
Straight Lines Class 11 One Shot
No ratings yet
Straight Lines Class 11 One Shot
280 pages
MA8251 - ENGINEERING MATHEMATICS - II - by WWW - LearnEngineering.in PDF
No ratings yet
MA8251 - ENGINEERING MATHEMATICS - II - by WWW - LearnEngineering.in PDF
60 pages
Jacaranda Maths Quest 10 10A Victorian Curriculum Revised Edition Boucher Download PDF
100% (1)
Jacaranda Maths Quest 10 10A Victorian Curriculum Revised Edition Boucher Download PDF
54 pages
NumMeths Outline
No ratings yet
NumMeths Outline
3 pages
Digital Electronics Exam
No ratings yet
Digital Electronics Exam
4 pages
Lab Sem in Exam-2 Answers - NANI
No ratings yet
Lab Sem in Exam-2 Answers - NANI
16 pages
VL2020210502889 Da02
No ratings yet
VL2020210502889 Da02
2 pages
Unit III Surface Modelling
No ratings yet
Unit III Surface Modelling
24 pages
Chapter 2 Resource Book: Dditional Esources
No ratings yet
Chapter 2 Resource Book: Dditional Esources
5 pages
Grade 10 Term 4 2024
No ratings yet
Grade 10 Term 4 2024
5 pages
Harvard Algorithm Course Notes
No ratings yet
Harvard Algorithm Course Notes
6 pages
pc11 Sol c03 CP
No ratings yet
pc11 Sol c03 CP
3 pages
Mean Deviation
No ratings yet
Mean Deviation
23 pages
2-2 Sem
No ratings yet
2-2 Sem
10 pages
Krishnas Differential Equations 53th Edition Dr J N Sharma pdf download
No ratings yet
Krishnas Differential Equations 53th Edition Dr J N Sharma pdf download
80 pages
17 FY13CE Maths Detail Solutions
100% (1)
17 FY13CE Maths Detail Solutions
26 pages
Resolving Paradoxes - Issue 106 - Philosophy Now
No ratings yet
Resolving Paradoxes - Issue 106 - Philosophy Now
5 pages
Newton's Difference Interpolation Formula
No ratings yet
Newton's Difference Interpolation Formula
23 pages
P.6 Mathematics Questions Term 1 Kampala Junior Academy Schools
50% (2)
P.6 Mathematics Questions Term 1 Kampala Junior Academy Schools
12 pages
Lecture Week 4: Algorithm Complexity
No ratings yet
Lecture Week 4: Algorithm Complexity
34 pages
Ramsey Theory
100% (1)
Ramsey Theory
43 pages
Review 6
No ratings yet
Review 6
6 pages
WWCGGDF
No ratings yet
WWCGGDF
3 pages
9.02 Circumference of A Circle - Worksheet
No ratings yet
9.02 Circumference of A Circle - Worksheet
10 pages
Mathematics Unit 39: Student Heights
No ratings yet
Mathematics Unit 39: Student Heights
5 pages
0580_s24_qp_22
No ratings yet
0580_s24_qp_22
12 pages
Vogels Approximation Method
No ratings yet
Vogels Approximation Method
41 pages
BF 00375684
No ratings yet
BF 00375684
9 pages

SVM Tutorial

Uploaded by

SVM Tutorial

Uploaded by

15 SVM Tutorial — Classification,

Regression and Ranking

2 SVM Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480

3 SVM Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491

4 SVM Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493

5 Ranking Vector Machine: An Efficient Method for Learning the

G. Rozenberg et al. (eds.), Handbook of Natural Computing, DOI 10.1007/978-3-540-92910-9_15,

ranking SVM called Ranking Vector Machine (RVM) in > Sect. 5.

2.1 Hard-Margin SVM Classification

These conditions can be revised into

subject to: y i ðw xi bÞ 1; 8ðxi ; y i Þ 2 D ð7Þ

2.1.1 Solving the Constrained Optimization Problem

 The objective function (> Eq. 6) is a convex function of w.

and Condition 2 yields

where the ai are nonnegative.

Note that according to the property of Kuhn–Tucker conditions of optimization theory,

2.2 Soft-Margin SVM Classification

subject to: y i ðw xi bÞ 1 xi ; 8ðxi ; y i Þ 2 D ð24Þ

2.3 Kernel Trick for Nonlinear Classification

The decision function of > Eq. 22 becomes

holding the same constraints.

The classification function becomes

2.3.1 Example: XOR Problem

can separate the training points.

Input vector x Desired output y

Hence, the optimal values of the Lagrange multipliers are

subject to: y i w xi b  e ð58Þ

subject to: y i w xi b  e þ xi ; 8ðxi ; y i Þ 2 D ð61Þ

subject to: ; ^i 0 ð65Þ

A weight vector w is adjusted by a learning algorithm. We say that an ordering R is linearly

subject to: 8fðxi ; xj Þ : y i < y j 2 Rg : w xi w xj þ 1 xij ð85Þ

8ði; jÞ : xij 0 ð86Þ

subject to: C a 0 ð89Þ

4.1 Margin-Maximization in Ranking SVM

Corollary 1 Suppose F w is a ranking function computed by the hard-margin ranking SVM on

difference of any data pairs in the ranking.

5 Ranking Vector Machine: An Efficient Method for Learning

5.1 1-Norm Ranking SVM

5.2 Ranking Vector Machine

Theorem 1 (Representer Theorem (Schölkopf et al. 2001)).

The proof of the theorem is presented in Schölkopf et al. (2001).

containing linear combinations of kernels centered on arbitrary points of x, the solution

Thus, we set the loss function c as follows.

5.3.1 Experiments on Synthetic Datasets

Below is the description of the experiments on synthetic datasets.

5.3.2 Experiment on Real Dataset

Query 1 Query 2 Query 3

You might also like

The objective function (> Eq. 6) is a convex function of w.

subject to: y i w xi b e ð58Þ

subject to: y i w xi b e þ xi ; 8ðxi ; y i Þ 2 D ð61Þ