Twin Support Vector Machines Models Extensions and Applications
Twin Support Vector Machines Models Extensions and Applications
Jayadeva
Reshma Khemchandani
Suresh Chandra
Twin Support
Vector
Machines
Models, Extensions and Applications
Studies in Computational Intelligence
Volume 659
Series editor
Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
e-mail: [email protected]
About this Series
Suresh Chandra
123
Jayadeva Suresh Chandra
Department of Electrical Engineering Department of Mathematics
Indian Institute of Technology (IIT) Indian Institute of Technology (IIT)
New Delhi New Delhi
India India
Reshma Khemchandani
Department of Computer Science
Faculty of Mathematics and Computer
Science
South Asian University (SAU)
New Delhi
India
—Jayadeva
To my loving mother
Mrs. Shilpa Khemchandani, whose beautiful
memories will always stay with me,
my family and my teachers.
—Reshma Khemchandani
—Suresh Chandra
Preface
Support vector machines (SVMs), introduced by Vapnik in 1998, have proven very
effective computational tool in Machine Learning. SVMs have already outper-
formed most other computational intelligence methodologies mainly because they
are based on sound mathematical principles of statistical learning theory and
optimization theory. SVMs have been applied successfully to a wide spectrum of
areas, ranging from pattern recognition, text categorization, biomedicine, bioin-
formatics and brain–computer interface to financial time series forecasting.
In 2007, Jayadeva, Khemchandani and Chandra have proposed a novel classifier
called twin support vector machine (TWSVM) for binary data classification.
TWSVM generates two non-parallel hyperplanes by solving a pair of smaller-sized
quadratic programming problems (QPPs) such that each hyperplane is closer to one
class and as far as possible from the other class. The strategy of solving a pair of
smaller-sized QPPs, rather than a single large one, makes the learning speed of
TWSVM approximately four times faster than the standard SVM.
Over the years, TWSVM has become a popular machine learning tool because of
its low computational complexity. Not only TWSVM has been applied to a wide
spectrum of areas, many researchers have proposed new variants of TWSVM, for
classification, clustering (TWSVC) and regression (TWSVR) scenarios.
This monograph presents a systematic and focused study of the various aspects
of TWSVM and related developments for classification, clustering and regression.
Apart from presenting most of the basic models of TWSVM, TWSVC and TWSVR
available in the literature, a special effort has been made to include important and
challenging applications of the tool. A chapter on “Some Additional Topics” has
been included to discuss topics of kernel optimization and support tensor machines
which are comparatively new, but have great potential in applications.
After presenting an overview of support vector machines in Chap. 1 and
devoting an introductory Chapter (Chap. 2) on generalized eigenvalue proximal
support vector machines (GEPSVM), the main contents related with TWSVMs are
presented in Chaps. 3–8. Here Chap. 8 is fully devoted to “applications”.
This monograph is primarily addressed to graduate students and researchers in
the area of machine learning and related topics in computer science, mathematics
vii
viii Preface
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Support Vector Machines: An Overview . . . . . . . . . . . . . . . . . . . 1
1.2 The Classical L1 -Norm SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Linear SVM: Hard Margin Classifier . . . . . . . . . . . . . . . . 2
1.2.2 Linear SVM: Soft Margin Classifier . . . . . . . . . . . . . . . . . 4
1.2.3 Nonlinear/Kernel SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Least Squares SVM and Proximal SVM . . . . . . . . . . . . . . . . . . . . 9
1.4 Support Vector Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Efficient Algorithms for SVM and SVR . . . . . . . . . . . . . . . . . . . . 15
1.6 Other Approaches to Solving the SVM QPP . . . . . . . . . . . . . . . . 18
1.6.1 The Relaxed SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.6.2 The Relaxed LSSVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.6.3 Solving the Relaxed SVM and the Relaxed LSSVM . . . . 21
1.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2 Generalized Eigenvalue Proximal Support Vector Machines . . . . . . . 25
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 GEPSVM for Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.1 Linear GEPSVM Classifier . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.2 Nonlinear GEPSVM Classifier . . . . . . . . . . . . . . . . . . . . . 31
2.3 Some Variants of GEPSVM for Classification . . . . . . . . . . . . . . . 32
2.3.1 ReGEPSVM Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.2 Improved GEPSVM Formulation . . . . . . . . . . . . . . . . . . . 34
2.4 GEPSVR: Generalized Eigenvalue Proximal Support
Vector Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4.1 GEPSVR Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.2 Regularized GEPSVR Formulation . . . . . . . . . . . . . . . . . . 39
2.4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
xi
xii Contents
Support Vector Machines (SVMs) [1] have emerged to become the paradigm of
choice for classification tasks over the last two decades. SVMs emerged from some
celebrated work on statistical learning theory. Prior to SVMs, multilayer neural net-
work architectures were applied widely for a variety of learning tasks. Most learning
algorithms for multilayer neural networks focussed on minimizing the classification
error on training samples. A few efforts focussed on generalization and reducing the
error on test samples. Some of the recent works, such as Optimal Brain Surgeon and
Optimal Brain Damage, were related to pruning trained networks in order to improve
generalization. However, the advent of SVMs radically changed the machine learning
landscape. SVMs addressed generalization using a sound theoretical framework and
showed that the generalization error was related to the margin of a hyperplane clas-
sifier. In order to better appreciate the motivation for Twin Support Vector Machines
(TWSVM), we briefly discuss the ideas behind Support Vector Machines. The clas-
sical L 1 -norm SVM, which minimizes the L 1 -norm of the error vector, was followed
by formulations that minimize the L 2 -norm of the error vector. These includes the
Least Squares SVM (LSSVM) proposed by Suykens and Vandewalle [2], and the
Proximal SVM (PSVM) proposed by Fung and Mangasarian [3]. The two formula-
tions are similar in spirit, but have minor differences. The classical L 1 -norm SVM
and its least squares counterparts suffer from one drawback. The formulations work
well for binary classification when the two classes are balanced, but for imbalanced
classes, nonlinear separable datasets, test set accuracies tend to be poor. One can
construct synthetic datasets to reveal that in this scenario, the classifying hyperplane
tends to increasingly ignore the smaller class, and that test set accuracies drop for
the smaller class with increasing imbalance in the class sizes.
The Generalized Eigenvalue Proximal SVM (GEPSVM) is an attempt to include
information within samples of the same class and between samples of different
classes. It represents a radically different approach to the classical SVM, and is less
sensitive to class size imbalance. The GEPSVM leads to solving eigenvalue problem,
and no longer requires solving a Quadratic Programming Problem as in the case of
the L 1 -norm and L 2 -norm SVMs. The GEPSVM is an important milestone in the
journey from SVMs to the TWSVM, and is discussed in more detail in Chap. 2.
This chapter consists of the following main sections: The Classical L 1 -norm SVM,
Least Squares SVM and Proximal SVM, Support Vector Regression (SVR), Efficient
Algorithms for SVM and SVR, and other approaches for solving the quadratic pro-
gramming problems arising in the SVM type formulations. The presentation of this
chapter is based on the work of Bi and Bennett [4], Suykens and Vandewalle [2],
Fung and Mangasarian [3], Burges [5], Sastry [6], Deng et al. [7].
Consider a binary classification dataset in which the training samples are denoted
by
w T x + b = 0, (1.2)
with maximum margin that separates samples of the two classes. Let us initially
consider the case when the samples of A and B are linearly separable, i.e., samples
of A lie on one side of the separating hyperplane, and samples of B lie on the other
side. To determine the classifier w T x + b = 0 for the linearly separable dataset Tc ,
Vapnik [1] proposed the principle of maximum margin. According to this principle,
we should determine w ∈ Rn , b ∈ R so that the margin, i.e. the distance between the
supporting hyperplanes w T x + b = 1 and w T x + b = −1 is maximum. A directed
2
calculation yields that the margin equals , where ||w|| denotes the L 2 -norm of
||w||
w ∈ Rn and is given by ||w||2 = w T w.
1.2 The Classical L 1 -Norm SVM 3
2
Max
(w,b) w
subject to
Aw + eb ≥ 1,
Bw + eb ≤ −1, (1.3)
Aw + eb ≥ 1,
Bw + eb ≤ −1. (1.4)
Since w, the norm of the weight vector, is non-negative, we can also replace
the objective function by its square, yielding the problem
1
Min w2
(w,b) 2
subject to
Aw + eb ≥ 1,
Bw + eb ≤ −1. (1.5)
1 T
Min w w
(w,b) 2
subject to
Here it may be noted that problem (1.6) has as many constraints as the number of
training samples, i.e., m. Since in practice m could be very large, it makes sense to
write the Wolfe dual [8, 9] of problem (1.6), which comes out to be
1
m m m
T
Max − yi y j (x (i) ) x ( j) αi α j + αj
α 2 i=1 j=1 j=1
subject to
m
yi αi = 0,
i=1
αi ≥ 0, (i = 1, 2, . . . , m). (1.7)
Let α = (α1 , α2 , . . . , αm ) be an optimal solution of (1.7). Then using Karush–
Kuhn–Tucker (K.K.T) [9] conditions it can be shown that the Hard Margin classifier
is x T w + b = 0 where
m
w = αi yi x (i) , (1.8)
i=1
m
T
b = y j − αi yi (x (i) ) x ( j) . (1.9)
i=1
Here, x ( j) is that pattern for which α j > 0. All patterns x ( j) for which α j > 0
are called as support vectors. In view of K.K.T conditions for problem (1.6), we get
y j ((w )T x ( j) + b) = 1, i.e., x ( j) lies on one of the bounding hyperplanes, w T x + b =
1 and w T x + b = −1. Since there can be many support vectors, in practice b is taken
as the average of all such b .
Consider a case where a few samples are not linearly separable. In this case, one may
choose to allow some training samples to be mis-classified, and to seek a classify-
ing hyperplane that achieves a tradeoff between having a large margin and a small
classification error. The optimization problem that allows us to find such a tradeoff
is given by
1 T m
Min w w+C ξi
(w,b,ξ ) 2 i=1
subject to
yi (w T x (i) + b) + ξi ≥ 1, (i = 1, 2, . . . , m),
ξi ≥ 0, (i = 1, 2, . . . , m),
(1.10)
1.2 The Classical L 1 -Norm SVM 5
and yields the soft-margin SVM. The above formulation is the primal formulation
and is usually not the one that is solved.
The margin is termed as soft because the hyperplane w T x + b = 0 does not lin-
early separate samples of the two classes, and some samples are incorrectly classi-
m
fied. The term ξi measures the mis-classification error, and the hyper-parameter
i=1
m
C defines the importance of the mis-classification term ξi relative to the margin
i=1
1
term w T w. This hyper-parameter is chosen by the user and is often determined
2
by using a tuning set, that is a small sample of the training set, on which a search
is conducted to find a good choice of C. Once again, we can determine the Wolfe’s
dual of the primal soft margin. A little algebra yields the dual as
1
m m m
T
Max − yi y j (x (i) ) x ( j) αi α j + αj
α 2 i=1 j=1 j=1
subject to
m
yi αi = 0,
i=1
0 ≤ αi ≤ C, (i = 1, 2, . . . , m). (1.11)
Let α = (α1 , α2 , . . . , αm ) be an optimal solution of (1.11). Then using K.K.T
conditions, it can be shown that the soft margin classifier is x T w + b = 0 where
m
w = αi yi x (i) . (1.12)
i=1
In this case, the support vectors are those patterns x ( j) for which the corresponding
multiplier α j satisfies 0 < α j < C; using the K.K.T conditions, one can also show
that
y j (w T x ( j) + b) = 1, (1.13)
i.e., the support vectors are samples that lie on the hyperplanes
w T x ( j) + b = ±1.
In the case of the soft margin SVM, the offset b is obtained by taking any support
vector, i.e. any sample x ( j) for which 0 < α j < C; and computing,
m
T
b = y j − αi yi (x (i) ) x ( j) . (1.14)
i=1
6 1 Introduction
Since there can be many support vectors, in practice b is taken as the average of
all such b obtained from all the support vectors.
One may also note that samples x ( j) for which α j = 0 are not support vectors and
do not effect the classifying hyperplane. Finally, samples x ( j) for which α j = C are
samples that are classified with a margin less than 1, i.e.
y j (w T x ( j) + b ) < 1,
are also support vectors, since their corresponding multipliers are non-zero. These
also include samples that are mis-classified.
When the data sets are too complex to be separated adequately well by a linear
hyperplane, the solution is to employ the so-called “kernel trick”. This involves
projecting the input samples to a higher dimensional “image” space or “feature”
space, and constructing a linear hyperplane in that space. This leads to the nonlinear
or kernel SVM, which we discuss in the next section.
Consider the exclusive-OR (XOR) problem in two dimensions. Here, the samples for
the binary classification problem are (x1 , x2 ) = (0, 0), (1, 1), (0, 1) and (1,0). The first
two samples are in class (-1), while the latter two are in class 1, i.e., the class labels
are {−1, −1, 1, 1}. The samples are not linearly separable and a linear classifier of
the form w1 x1 + w2 x2 + b = 0 will be unable to separate the patterns. Consider a
map φ(x) that maps the input patterns to a higher dimension z. As an example, we
choose a map that takes the patterns (x1 , x2 ) and maps them to three dimensions, i.e.
z = φ(x), where z = (z 1 , z 2 , z 3 ), with
z 1 = x1 ,
z 2 = x2 ,
z 3 = −(2x1 − 1) × (2x2 − 1). (1.15)
The map φ thus produces the following patterns in the image space:
φ
(0, 0) −
→ (0, 0, −1), class label = − 1,
φ
(1, 1) −
→ (0, 0, −1), class label = − 1,
φ
(0, 1) −
→ (0, 1, +1), class label = + 1,
φ
(1, 0) −
→ (1, 0, +1), class label = + 1.
1.2 The Classical L 1 -Norm SVM 7
Note that the plane z 3 = 0 separates the patterns. The map φ from two dimensions
to three has made the patterns separable in the higher (three) dimensional feature
space.
In order to construct the SVM classifier, we consider the patterns φ(x (i) , (i =
1, 2, . . . , m), with corresponding class labels yi , (i = 1, 2, . . . , m), and rewrite the
soft margin SVM classifier given by (1.11) for the given patterns. The soft margin
kernel SVM is given by
1 T m
Min w w+C ξi
(w,b,ξ ) 2 i=1
subject to
The corresponding dual is similarly obtained by substituting φ(x (i) ) for x (i) in
(1.11). This yields the Wolfe dual of the soft margin kernel SVM, as
1
m m m
Max − yi y j φ(x (i) )T φ(x ( j) )αi α j + αj
α 2 i=1 j=1 j=1
subject to
m
yi αi = 0,
i=1
0 ≤ αi ≤ C, (i = 1, 2, . . . , m). (1.17)
Note that the image vectors φ(x (i) ) are only present as part of inner product terms
of the form φ(x (i) )T φ(x ( j) ). Let us define a Kernel function K as follows,
Observe that the value of a kernel function is a scalar obtained by computing the
inner product of the image vectors φ( p) and φ(q). It is possible to obtain that value
by implicitly using the values of φ( p) and φ(q), and not explicitly. The implication is
that one can take two vectors p and q that lie in the input space, and obtain the values of
φ( p)T φ(q) without explicitly computing φ( p)T or φ(q). Since the map φ takes input
vectors from a lower dimensional input space and produces high dimensional image
vectors, this means that we do not need to use the high dimensional image vectors at
all. This requires the existence of a Kernel function K so that K ( p, q) = φ( p)T φ(q)
8 1 Introduction
for any choice of vectors p and q. A theorem by Mercer [6] provides the necessary
conditions for such a function to exist.
Mercer’s Theorem
Consider a kernel function K : Rn × Rn → R. Then, for such a kernel function,
there exists a map φ such that K ( p, q) ≡ φ( p)T φ(q) for any vectors p and q if and
only if the following condition holds.
For every square integrable function h : Rn → R, i.e. h 2 (x)d x < ∞,
we have
K ( p, q)h( p)h(q)dpdq ≥ 0.
K ( p, q) = p T q,
Note that the expansion contains only non-negative terms. A very popular kernel is
the Radial Basis Function (RBF) or Gaussian kernel, i.e.
1
m m m
Max − yi y j K i j αi α j + αj
α 2 i=1 j=1 j=1
subject to
m
yi αi = 0,
i=1
0 ≤ αi ≤ C, (i = 1, 2, . . . , m). (1.20)
1.3 Least Squares SVM and Proximal SVM 9
Many of the early SVM formulations largely differ with regard to the way they mea-
sure the empirical error. When the L 2 -norm, i.e. the squared or Euclidean norm is
used to measure the classification error, we obtain the Least Squares SVM (LSSVM)
proposed by Suykens and Vandewalle [2], and the Proximal SVM (PSVM) proposed
by Fung and Mangasarian [3]. The two differ in minor ways in terms of their formu-
lations but employ similar methodology for their solutions. We shall write LSSVM
(respectively PSVM) for least square SVM (respectively Proximal SVM).
The LSSVM solves the following optimization problem.
1 m
Min w2 + C (ξi )2
(w,b,ξ ) 2 i=1
subject to
yi (w T x (i) + b) = 1 − ξi , (i = 1, 2, . . . , m),
(1.21)
1 T m
Min (w w + b2 ) + C (ξi )2
(w,b,ξ ) 2 i=1
subject to
The addition of the term b2 to the objective function makes it strictly convex
(Fung and Mangasarian [3]), and facilitates algebraic simplification of the solution.
The evolution of the GEPSVM and the TWSVM are based on a similar premise,
and in the sequel we focus on the PSVM and its solution. Note that in the PSVM
formulation the error variables ξi are unrestricted in sign.
Similar to LSSVM, the solution of PSVM aims at finding w T x + b = ±1 which
also represents a pair of parallel hyperplanes. Note that w T x + b = 1 is the repre-
sentation of all the samples of class A, while w T x + b = −1 is the representation
of all the samples of class B respectively. The hyperplane w T x + b = 1 can thus be
thought of as a hyperplane that passes through the cluster of samples of A. It is there-
fore termed as a hyperplane that is “proximal” to the set of samples of A. In a similar
manner, the hyperplane w T x + b = −1 can thus be thought of as a hyperplane that
passes through the cluster of samples of B. The first term of the objective function
of (1.22) is the reciprocal of the distance between the two proximal hyperplanes,
and minimizing it attempts to find two planes that pass through samples of the two
classes, while being as far apart as possible. The TWSVM is motivated by a similar
10 1 Introduction
notion, as we shall elaborate in the sequel, but it offers a more generalized solution
by allowing the two hyperplanes to be non-parallel.
The formulation (1.22) is also a regularized least squares solution to the system
of linear equations
Ai w + b = 1, (i = 1, 2, . . . , m 1 ), (1.23)
B w + b = −1, (i = 1, 2, . . . , m 2 ).
i
(1.24)
where ξ A and ξ B are error variables; e A and e B are vectors of ones of appropriate
dimension. Here, λ A and λ B are vectors of Lagrange multipliers corresponding to the
equality constraints associated with samples of classes A and B, respectively. Setting
the gradients of L to zero yields the following Karush–Kuhn–Tucker optimality
conditions
w − A T λ A − B T λ B = 0 =⇒ w = A T λ A + B T λ B , (1.26)
b − (e ) λ − (e ) λ = 0 =⇒ b = (e ) λ + (e ) λ ,
A T A B T B A T A B T B
(1.27)
λA
Cξ A − λ A = 0 =⇒ ξ A = , (1.28)
C
λB
Cξ B − λ B = 0 =⇒ ξ B = , (1.29)
C
Aw + e A b + ξ A = e A , (1.30)
Bw + e b + ξ = −e .
B B B
(1.31)
can take into account the scatter within each class, and the between class scatter. This
idea may be thought of as a motivation for the Generalized Eigen Value Proximal
Support Vector Machine (GEPSVM). The basic GEPSVM model of Mangasarian
and Wild [10] is a major motivation for ideas behind Twin Support Vector Machines.
12 1 Introduction
f (x) = w T x + b.
We examine the case for -regression, where the task is to find a regressor that
approximately matches the samples yi within a tolerance .
In analogy with the hard margin SVM classifier, the hard -tube regression prob-
lem is usually formulated as
1
Min w2
(w,b) 2
subject to
w T x (i) + b ≥ yi − , (i = 1, 2, . . . , m),
w T x (i) + b ≤ yi + , (i = 1, 2, . . . , m). (1.32)
1.4 Support Vector Regression 13
But this formulation does not have a correspondence with the SVM clas-
sification problem, and hence, the notion of margin cannot be adapted to the
regression setting. This requires us to use a clever and significant result due to
Bi and Bennett [4], that shows a link between the classification and regression
tasks. Consider a regression problem with data points x (i) , (i = 1, 2, . . . , m), and
where the value of an unknown function at the point x (i) is denoted by yi ∈ R.
As before, we assume that the dimension of the input samples is n, i.e. x (i) =
(x1(i) , x2(i) , . . . , xn(i) )T . Bi and Bennett [4] showed that the task of building a
regressor on this data has a one-to-one correspondence with a binary classifica-
tion task in which class (−1) points lie at the (n + 1)-dimensional co-ordinates
(x (1) ; y1 − ), (x (2) ; y2 − ), . . . , (x (m) ; ym − ), and class (+1) points lie at the
co-ordinates (x (1) ; y1 + ), (x (2) ; y2 + ), . . . , (x (m) ; ym + ). Let us first con-
sider the case when these two subsets are linearly separable. This implies that the
convex hulls of these two subsets do not intersect. Let the closest points in the two
convex hulls be denoted by p+ and p−, respectively. The maximum margin separat-
ing hyperplane must pass through the midpoint of the line joining p+ and p−. Let
the separating hyperplane be given by w T x + ηy + b = 0. We observe that η = 0.
Then, the regressor is given by
1
y = − (w T x + b). (1.33)
η
Thus, as shown in Bi and Bennett [4] the SVR regression problem is written as
1
Min (w2 + η2 ) + (β − α)
(w,η,α,β) 2
subject to
1
Min (w2 + η2 )
(w,b,η) 2
subject to
Here it may be noted that problem (1.35) is same as the problem obtained in Deng
et al. [7], which has been reduced to the usual -SVM formulation (1.32). Some of
these details have been summarized in Chap. 4.
14 1 Introduction
We can write the Wolfe dual of (1.34) and get the following QPP
2
1 XT XT
Min u − v
(u,v) 2 (y + e) T
(y − e) T
subject to
e T u = 1, e T v = 1,
u ≥ 0, v ≥ 0, (1.36)
1 m
Min
+ −
(w2 + η2 ) + (β − α) + C (ξi+ + ξi− )
(w,ξ ,ξ ,η,α,β) 2 i=1
subject to
e T u = 1, e T v = 1,
0 ≤ u ≤ Ce, 0 ≤ v ≤ Ce. (1.38)
The kernel version of the -SVR is obtained on substituting φ(X ) for X and
following the standard methodology of SVM.
Remark 1.4.2 Bi and Bennett [4] have termed the hard -tube formulation as H-
SVR (convex hull SVR) because it is constructed by separating the convex hulls
of the training data with the response variable shifted up and down by . The soft
-tube formulation, termed as RH-SVR, is constructed by separating the reduced
convex hulls of the training data with the response variable shifted up and down
by . Reduced convex hulls limit the influence of any given point by reducing the
upper bound on the multiplier for each point to a constant D < 1. There is a close
connection between the constant D (appearing in RH-SVR) and C (appearing in soft
-tube SVR). For these and other details we shall refer to Bi and Bennett [4].
Remark 1.4.3 An alternate but somewhat simpler version of the derivation of -SVR
is presented in Deng et al. [7] which is also summarized in Chap. 4. The presentation
in Deng et al. [7] is again based on Bi and Bennett [4].
So far, we have seen formulations for Support Vector Machines for classification and
for regression. All the formulations involve solving a quadratic programming prob-
lem (QPP), i.e., minimizing a quadratic objective function with linear constraints.
While there are several algorithms for solving QPPs, the structure of the QPP in
SVMs allows for the construction of particularly efficient algorithms. One of the
early algorithms for doing so was given by Platt [11], and is referred to as Platt’s
Sequential Minimal Optimization (SMO) technique.
We first begin by re-iterating the dual formulation of the SVM, for the sake of
convenience. The dual QPP formulation for the soft margin SVM classifier is given
by
1
m m m
Max − yi y j K i j αi α j + αj
α 2 i=1 j=1 j=1
subject to
m
yi αi = 0,
i=1
0 ≤ αi ≤ C, (i = 1, 2, . . . , m). (1.39)
Platt’s SMO starts with all multipliers αi , (i = 1, 2, . . . , m) set to zero. This triv-
ially satisfies the constraints of (1.39). The algorithm updates two multipliers at a
time. It first selects one multipliers to update, and updates it in order to achieve the
maximum reduction in the value of the objective function. Note that changing the
value of any multiplier would now lead to constraints of (1.39) getting violated. Platt’s
16 1 Introduction
SMO algorithm now updates the second multiplier to ensure that the constraints of
(1.39) remain satisfied. Thus, each iteration ensures a reduction in the value of the
objective function without violating any of the constraints. We now elaborate on the
individual steps of the algorithm.
Step 1
The first step is to find a pair of violating multipliers. This involves checking each
multiplier αi , (i = 1, 2, . . . , m) to see if it violates the K.K.T conditions. These con-
ditions may be summarized as:
(i) αi = 0 =⇒ yi f (x (i) ) ≥ 1,
where
m
f ( p) = λ j y j K ( p, x ( j) ) + b. (1.40)
j=1
If there are no violating multipliers, then the existing solution is optimal and the
algorithm terminates. Otherwise, it proceeds to Step 2. For the sake of convenience,
we assume that the violators chosen in Step 1 are denoted by α1 and α2 . Although
one could pick any pair of violating multipliers in Step 1, picking the most violating
pairs leads to faster convergence and fewer iterations. More sophisticated implemen-
tations of SMO indeed do that in order to reduce the number of updates needed for
convergence.
Step 2
m
Since only α1 and α2 are changed, and since i=1 yi αi = 0, this implies that the
updated values of α1 and α2 , denoted by α1next and α2next must satisfy
since y1 , y2 ∈ {−1, 1}. Note that the updated values α1next and α2next need to satisfy
0 ≤ α1next , α2next ≤ C. The constraints on α1next and α2next may be summarized as
αmin ≤ α2next ≤ αmax , together with (1.42), where αmin and αmax are indicated in the
1.5 Efficient Algorithms for SVM and SVR 17
following table. The Table 1.1 also indicates the value of β in terms of the current
values of α1 and α2 .
Since only α1 and α2 are being changed, the objective function of the SVM
formulation, as given in (1.39) may be written as
M
z1 = y j α j K1 j , (1.44)
j=3
M
z2 = y j α j K2 j . (1.45)
j=3
1
Q(α2 ) = − (K 11 (β − α2 )2 + K 22 α22 + 2K 12 y1 y2 α2 (β − α2 ) + (β − y1 y2 α2next α2
2
+α2 y2 z 2 + y1 z 1 (β − y1 y2 α2next ) + α2next ) + R(α \ {α1 , α2 }). (1.46)
d Q(α2 )
= 0, (1.47)
dα2
which gives
α2 (K 11 + K 22 − 2K 12 ) − y1 y2 β(K 11 − K 22 ) − y2 (z 1 − z 2 ) = 1 − y1 y2 . (1.48)
In order for Q(α2 ) to have a maximum at α2next , we require the second derivative
of Q(α2 ) to be negative at α2next , i.e.,
d 2 Q(α2 )
< 0, (1.49)
dα22
18 1 Introduction
which implies
K 11 + K 22 − 2K 12 < 0. (1.50)
The update proceeds by assuming that α2next is a maximum, i.e. (1.50) holds. Some
algebraic simplification leads us to the relation between α 2 and α2next
1
α2next = α2 + [y2 ( f (x 2 ) − f (x 1 )) + (y1 y2 − 1)], (1.51)
(K 11 + K 22 − 2K 12 )
where
M
f (x) = α j y j K (x, x j ) + b. (1.52)
j=1
Once the target value for α2next is computed using (1.51), we proceed to Step 3.
Step 3
The target value of α2next computed using (1.51) is now checked for the bound con-
straints, i.e. we clip α2next to αmin or αmax if these bounds are crossed. Finally, we
obtain the updated value for α1next as
Step 4
Go to Step 1 to check if there are any more violators.
The SMO algorithm is widely used in many SVM solvers, including LIBSVM
(Chang and Lin [12]). LIBSVM in fact uses several additional heuristics to speed
up convergence. However, more efficient approaches are possible, as we show in the
sequel.
Small changes to the SVM formulation can in fact make it easier to solve. In Joshi et
al. [13], the authors suggest the relaxed SVM and the relaxed LSSVM formulations.
Min w w
1 T
2
+ h2 b2 + Ce T ξ
(ξ,w,b)
subject to
yk [w T φ(x (k) ) + b] ≥ 1 − ξk ,
ξk ≥ 0, (k = 1, 2, . . . , m). (1.54)
h
The addition of b2 , where h is positive constant, in the objective function dis-
2
tinguishes it from the classical SVM formulation. Note that the h = 1 case is similar
to PSVM formulation where b2 is also added in the objective function.
The Lagrangian for the problem (1.54) is given by
m
1
L = Ce T ξ + αk [1 − ξk − yk (w T φ(x (k) ) + b)] + (w T w + hb2 )
k=1
2
m
− βk ξk . (1.55)
k=1
m
∇w L = 0 ⇒ w = αk yk φ(x (k) ), (1.56)
k=1
1
m
∂L
=0⇒b= αk yk , (1.57)
∂b h k=1
∂L
= 0 ⇒ C − αk − βk = 0 ⇒ αk + βk = C. (1.58)
∂qk
m
1
w T φ(x) + b = αk yk K (x (k) , x) + , (1.59)
k=1
h
0 ≤ αi ≤ C, (i = 1, 2, . . . , m). (1.60)
1 T h
Min w w + b2 + Cξ T ξ
(ξ,w) 2 2
subject to
1 T m
L= (w w + hb2 ) + Cξ T ξ + αk [1 − ξk − yk [w T φ(x (k) ) + b]]. (1.62)
2 k=1
m
∇w L = 0 ⇒ w = αk yk φ(x (k) ), (1.63)
k=1
1
m
∂L
=0⇒b= αk yk , (1.64)
∂b h k=1
∂L αk
= 0 ⇒ ξk = ( ) . (1.65)
∂ξk C
m
1 1
w T φ(x) + b = αk yk K (x (k) , x) + + , (1.66)
k=1
h C
where Pi j is given by
Ki j , i = j,
Pi j = 1 (1.68)
K ii + , i = j.
C
The duals of the RSVM and RLSSVM may be written in a common manner as
1
m m m
Min yi y j λi λ j Q i j − λi . (1.69)
λ 2 i=1 j=1 i=1
1
m m m
Min yi y j αi α j Q i j − αi . (1.70)
α 2 i=1 j=1 i=1
0 ≤ αi ≤ C, (i = 1, 2, . . . , m), (1.71)
may be kept in mind when updating the multipliers. Without loss of generality, let
α1 be the multiplier under consideration. The objective function in (1.70) may be
written as
m
m
1
Q(α1 ) = −α1 − α j + α1 y1 α j y j (Q 1 j + α12 (Q 11 )
j=2 j=2
2
1
m
m
+ yi y j αi α j (Q i j ). (1.72)
2 i=2 j=2
We assume Q to be symmetric. Note that y12 = 1 for (1.72). For the new value of
∂Q
α1 to be a minimum of Q(α1 ), we have = 0, which gives
∂α1
22 1 Introduction
m
1 − α1new (Q 11 ) − y1 α j y j (K 1 j ) = 0. (1.73)
j=2
We also require
∂2 Q
> 0, i.e., Q 11 > 0. (1.74)
∂α12
m
α1new (Q 11 ) = 1 − y1 α j y j (Q 1 j ). (1.75)
j=2
which simplifies to
1 − yk f old (x (k) )
αknew = αkold + , (1.77)
(Q kk )
where the update rule has been changed to the case for any generic multiplier αk . The
boundary constraints (1.71) are enforced when updating the multipliers. The update
rule (1.77) may be written as
1 − yk f old (x (k) )
αknew = αkold + , (1.78)
(K kk + α p )
1 − yk f old (x (k) )
αknew = αkold + , (1.79)
(Pkk + α p )
Ki j , i = j,
Pi j = 1 (1.80)
K ii + , i = j.
C
Note that the 1SMO update rules do not necessitate updating pairs of multi-
pliers as in the case of the SMO [11]. This leads to faster updates, as shown in
Joshi et al. [13].
Remark 1.6.1 Keerthi and Shevade [14], and Shevade et al. [15] discussed certain
improvements in Platt’s SMO algorithm for SVM classifier design and regression.
1.6 Other Approaches to Solving the SVM QPP 23
Keerthi and Shevade [14] also developed SMO algorithm for least squares SVM
formulations. Some other efficient algorithms for solving SVM type formulations
are Successive Over relaxation (SOR) due to Mangasarian and Musicant [16] and
Iterative Single Data Algorithm due to Vogt et al. [17]. The SOR algorithm of Man-
gasarian and Musicant [16] has also been used in the study of non-parallel type
classification, e.g., Tian et al. [18] and Tian et al. [19].
1.7 Conclusions
The main focus of this chapter has been to present an overview of Support Vector
Machines and some of its variants, e.g., Least Squares Support Vector Machine
and Proximal Support Vector Machines. Support Vector Regression (SVR) has been
discussed in a totally different perspective. This development of SVR is based on a
very important result of Bi and Bennett [4] which shows that the regression problem
is equivalent to an appropriately constructed classification problem in R n+1 . Some
popular and efficient algorithms for solving SVM problems have also been included
in this chapter.
References
13. Joshi, S., Jayadeva, Ramakrishanan, G., & Chandra, S. (2012). Using sequential unconstrained
minimization techniques to simplify SVM solvers. Neurocomputing, 77, 253–260.
14. Keerthi, S. S., & Shevade, S. K. (2003). SMO algorithm for least squares SVM formulations.
Neural Computations, 15(2), 487–507.
15. Keerthi, S. S., Shevade, S. K., Bhattacharya, C., & Murthy, K. R. K. (2001). Improvements to
Platt’s SMO algorithm for SVM classifier design. Neural Computations, 13(3), 637–649.
16. Mangasarian, O. L., & Musicant, D. R. (1988). Successive overrelaxation for support vector
machines. IEEE Transactions on Neural Networks, 10(5), 1032–1037.
17. Vogt, M., Kecman, V., & Huang, T. M. (2005). Iterative single data algorithm for training kernel
machines from huge data sets: Theory and performance. Support vector machines: Theory and
Applications, 177, 255–274.
18. Tian, Y. J., Ju, X. C., Qi, Z. Q., & Shi, Y. (2013). Improved twin support vector machine.
Science China Mathematics, 57(2), 417–432.
19. Tian, Y. J., Qi, Z. Q., Ju, X. C., Shi, Y., & Liu, X. H. (2013). Nonparallel support vector
machines for pattern classification. IEEE Transactions on Cybernertics, 44(7), 1067–1079.
Chapter 2
Generalized Eigenvalue Proximal Support
Vector Machines
2.1 Introduction
Let the training set for the binary data classification be given by
where x (i) ∈ Rn and yi ∈ {−1, 1}. Let there be m1 patterns having class label +1 and
m2 patterns having class label −1. We construct matrix A (respectively B) of order
(m1 × n) (respectively (m2 × n)) by taking the i th row of A (respectively B) as the
i th pattern of class label +1 (respectively class label −1). Thus, m1 + m2 = m.
The linear GEPSVM classifier aims to determine two non parallel planes
x T w1 + b1 = 0 and x T w2 + b2 = 0, (2.2)
such that the first plane is closest to the points of class +1 and farthest from the
points of class −1, while the second plane is closest to the points in class −1 and
farthest for the points in class +1. Here w1 , w2 ∈ Rn , and b1 , b2 ∈ R.
In order to determine the first plane x T w1 + b1 = 0, we need to solve the following
optimization problem
where e is a vector of ‘ones’ of appropriate dimension and . denotes the L2 -norm.
In (2.3), the numerator of the objective function is the sum of the squares of
two-norm distances between each of the points of class +1 to the plane, and the
denominator is the similar quantity between each of the points of class −1 to the
plane. The way this objective function is constructed, it meets the stated goal of
determining a plane which is closest to the points of class +1 and farthest to the
points of class −1. Here, it may be remarked that ideally we would like to minimize
the numerator and maximize the denominator, but as the same (w, b) may not do the
simultaneous optimization of both the numerator and denominator, we take the ratio
and minimize the same. This seems to be a very natural motivation for introducing
the optimization problem (2.3).
The problem (2.3) can be re-written as
Aw + eb2
Min , (2.4)
(w,b) =0 Bw + eb2
where it is assumed that (w, b) = 0 implies that (Bw + eb) = 0. This makes the
problem (2.4) well defined. Now, let
G = [A e]T [A e],
H = [B e]T [B e],
2.2 GEPSVM for Classification 27
zT Gz
Min . (2.5)
z =0 zT Hz
The objective function in (2.5) is the famous Rayleigh quotient of the generalized
eigenvalue problem Gz = λHz, z = 0. When H is positive definite, the Rayleigh
quotient is bounded and its range is [λmin , λmax ] where λmin and λmax respectively
denote the smallest and the largest eigenvalues. Here, G and H are symmetric matrices
of order (n + 1) × (n + 1), and H is positive definite under the assumption that
columns of [B e] are linearly independent.
Now following similar arguments, we need to solve the following optimization
problem to determine the second plane x T w2 + b2 = 0
Bw + eb2
Min , (2.6)
(w,b) =0 Aw + eb2
where we need to assume that (w, b) = 0 implies that (Aw + eb) = 0. We can again
write (2.6) as
zT Hz
Min , (2.7)
z =0 zT Gz
and
respectively. Here δ > 0. Further a solution of (2.8) gives the first plane x T w1 + b1 =
0 while that of (2.9) gives the second plane x T w2 + b2 = 0.
28 2 Generalized Eigenvalue Proximal Support …
P = [A e]T [A e] + δI,
Q = [B e]T [B e],
R = [B e]T [B e] + δI,
S = [A e]T [A e],
zT = (w, b), (2.10)
zT Pz
Min , (2.11)
z =0 zT Qz
and
zT Rz
Min , (2.12)
z =0 zT Sz
respectively.
We now have the following theorem of Parlett [5].
In view of Theorem 2.2.1, solving the optimization problem (2.11) amounts to finding
the eigenvector corresponding to the smallest eigenvalue of the generalized eigen-
value problem
Pz = λQz, z = 0. (2.13)
In a similar manner, to solve the optimization problem (2.12) we need to get the
eigenvector corresponding to the smallest eigenvalue of the generalized eigenvalue
problem
Rz = μSz, z = 0. (2.14)
We can summarize the above discussion in the form of the following theorem.
2.2 GEPSVM for Classification 29
Remark 2.2.3 The requirement that columns of (A e) and (B e) are linearly inde-
pendent, is only a sufficient condition for the determination of z1 and z2 . It is not
a necessary condition as may be verified for the XOR example. Also this linear
independence condition may not be too restrictive if m1 and m2 are much larger
than n.
TC = {((0, 0), +1), ((1, 1), +1), ((1, 0), −1), ((0, 1), −1).}
Therefore
0 0 1 0
A= and B = .
1 1 0 1
Hence
⎡ ⎤
1 1 1
P = S = ⎣1 1 1⎦ ,
1 1 2
and
⎡
⎤
1 0 1
Q = R = ⎣0 1 1⎦ .
1 1 2
30 2 Generalized Eigenvalue Proximal Support …
Then the generalized eigenvalue problem (2.13) has the solution λmin = 0 and z1 =
(−1 1 0). Therefore, the first plane is given by −x1 + x2 + 0 = 0, i.e., x1 − x2 = 0.
In the same way, the generalized eigenvalue problem (2.14) has the solution μmin =
0 and z2 = (−1 − 1 1), which gives the second plane −x1 − x2 + 1 = 0, i.e.,
x1 + x2 = 1.
Here we observe that neither the columns of (A e) nor that of (B e) are linearly
independent but the problems (2.13) and (2.14) have solutions z1 and z2 respectively.
The XOR example also illustrates that proximal separability does not imply linear
separability. In fact these are two different concepts and therefore the converse is also
not true. It is also possible that two sets be both proximally and linearly separable,
e.g. the AND problem.
Geometrically, the Cross Planes data set is a perturbed generalization of the XOR
example. Similar to XOR, it serves as a test example for the efficacy of typical linear
classifiers, be it linear SVM, linear PSVM, linear GEPSVM or others to be studied
in the sequel. An obvious reason for the poor performance of linear PSVM on Cross
Planes data set is that in PSVM we have insisted on the requirement that the proximal
planes should be parallel, which is not the case with linear GEPSVM.
We next discuss the kernel version of GEPSVM to get the nonlinear GEPSVM
classifier.
(a) (b)
Here we note that (2.15) are nonlinear surfaces rather than planes, but serve the
same purpose as (2.2). Thus, the first (respectively second) surface is closest to data
points in class +1 (respectively class −1) and farthest from the data points in class
−1 (respectively class +1).
If we take the linear kernel K(x T , C T ) = x T C and define w1 = C T u1 , w2 = C T u2 ,
then (2.15) reduces to (2.2). Therefore to generate surfaces (2.15) we can generalize
our earlier arguments and get the following two optimization problems
and
If we now define
then the optimization problem (2.16) and (2.17) reduces to generalized eigenvalue
problems
zT P1 z
Min , (2.19)
z =0 zT Q1 z
and
zT R1 z
Min , (2.20)
z =0 zT S1 z
respectively.
32 2 Generalized Eigenvalue Proximal Support …
Now we can state a theorem similar to Theorem (2.2.2) and make use of the same
to generate the required surfaces (2.15). Specifically, let z1 (respectively z2 ) be the
eigenvector corresponding to the smallest eigenvalue of the generalized eigenvalue
problem (2.19) and (respectively 2.20), the z1 = (u1 , b1 ) (respectively z2 = (u2 , b2 ))
gives the desired surface K(x T , C T )u1 + b1 = 0 (respectively K(x T , C T )u2 + b2 =
0.). For a new point x ∈ Rn , let
|K(x T , C T )ui + bi |
distance(x, Si ) = ,
(ui )T K(C, C T )ui
where Si is the non linear surface K(x T , C T )ui + bi = 0 (i = 1, 2). Then the class i
(i = 1, 2) for this new point x is assigned as per the following rule
|K(x T , C T )ui + bi |
class = arg Min . (2.21)
i=1,2 (ui )T K(C, C T )ui
Mangasarian and Wild [1] implemented their linear and nonlinear GEPSVM clas-
sifiers extensively on artificial as well as real world data sets. On Cross Planes data
set in (300 × 7) dimension (i.e. m = 300, n = 7) the linear classifier gives 10-fold
testing correctness of 98 % where as linear PSVM and linear SVM give the cor-
rectness of 55.3 % and 45.7 % respectively. Also on the Galaxy Bright data set, the
GEPSVM linear classifier does significantly better than PSVM. Thus, Cross Planes
as well as Galaxy Bright data sets indicate that allowing the proximal planes to be
non-parallel allows the classifier to better represent the data set when needed. A
similar experience is also reported with the nonlinear GEPSVM classifier on various
other data sets.
There are two main variants of original GEPSVM. These are due to Guarracino
et al. [2] and Shao et al. [3]. Guarracino et al. [2] proposed a new regularization
technique which results in solving a single generalized eigenvalue problem. This
formulation is termed as Regularized GEPSVM and is denoted as ReGEPSVM.
The formulation of Shao et al. [3] is based on the difference measure, rather than the
ratio measure of GEPSVM, and therefore results in solving two eigenvalue problems,
unlike the two generalized eigenvalue problems in GEPSVM. Shao et al. [3] model is
termed as Improved GEPSVM and is denoted as IGEPSVM. Both of these variants
are attractive as they seem to be superior to the classical GEPSVM in terms of
classification accuracy as well as in computation time. This has been reported in [2]
and [3] by implementing and experimenting these algorithms on several artificial and
benchmark data sets.
2.3 Some Variants of GEPSVM for Classification 33
Let us refer to the problems (2.8) and (2.9) where Tikhonov type regularization
term is introduced in the basic formulation of (2.4) and (2.6). Let us also recall the
following theorem from Saad [7] in this regard.
τ2 λ + δ1
μ= ,
τ1 + δ 2 λ
which is similar to (2.8) (and also to (2.9)) but with a different regularization term.
Now if we take τ1 = τ2 = 1, δˆ1 = −δ1 > 0, δˆ2 = −δ2 > 0 with the condition that
τ1 τ2 − δ1 δ2 = 0, then Theorem (2.3.1) becomes applicable for the generalized eigen-
value problem corresponding to the optimization problem (2.22). Let
Uz = λV z, z = 0. (2.23)
Further, the smallest eigenvalue of the original problem (2.5) becomes the largest
eigenvalue of (2.23), and the largest eigenvalue of (2.5) becomes the smallest eigen-
value of (2.23). This is because of Theorem (2.3.1) which asserts that the spectrum
of the transformed eigenvalue problem G∗ x = λH ∗ x gets shifted and inverted. Also,
the eigenvalues of (2.5) and (2.7) are reciprocal to each other with the same eigen-
vectors. Therefore, to determine the smallest eigenvalues of (2.5) and (2.7) respec-
tively, we need to determine the largest and the smallest eigenvalues of (2.23). Let
z1 = col(w1 , b1 ) and z2 = col(w2 , b2 ) be the corresponding eigenvectors. Then, the
respective planes are x T w1 + b1 = 0 and x T w2 + b2 = 0.
For the nonlinear case, we need to solve the analogous optimization problem
34 2 Generalized Eigenvalue Proximal Support …
via the related generalized eigenvalue problem G∗ x = λH ∗ x. Its solution will give
two proximal surfaces
and the classification for a new point x ∈ Rn is done similar to nonlinear GEPSVM
classifier.
It has been remarked in Sect. 2.1 that the determination of the first plane x T w1 + b1 =
0 (respectively the second plane x T w2 + b2 = 0) requires the minimization (respec-
tively maximization) of (Aw + eb2 /(w, b)T 2 ) and the maximization (respec-
tively minimization) of (Bw + eb2 /(w, b)T 2 ). Since the same (w, b) may not
perform this simultaneous optimization, Mangasarian and Wild [1] proposed a ratio
measure to construct the optimization problem (2.4) (respectively (2.6)) and thereby
obtained the generalized eigenvalue problem (2.13) (respectively (2.14)) for deter-
mining the first (respectively second) plane.
The main difference between the works of Mangasarian and Wild [1], and that of
Shao et al. [3] is that the former considers a ratio measure whereas the latter considers
a difference measure to formulate their respective optimization problems. Conceptu-
ally there is a bi-objective optimization problem which requires the minimization of
(Aw + eb2 /(w, b)T 2 ) and maximization of (Bw + eb2 /(w, b)T 2 ), i.e. min-
imization of -(Bw + eb2 /(w, b)T 2 ). Thus the relevant optimization problem is
the following bi-objective optimization problem
Let γ > 0 be the weighting factor which determines the trade-off between the two
objectives in (2.25). Then, the bi-objective optimization problem (2.25) is equivalent
to the following scalar optimization problem
Let z = (w, b)T , G = [A e]T [A e] and H = [B e]T [B e]. Then, the optimiza-
tion problem (2.26) becomes
2.3 Some Variants of GEPSVM for Classification 35
zT (G − γH)z
Min . (2.27)
z =0 zT Iz
Now similar to GEPSVM, we can also introduce a Tikhonov type regularization term
(2.27) to get
zT (G − γH)z + δz2
Min ,
z =0 zT Iz
i.e.
zT (G + δI − γH)z
Min . (2.28)
z =0 zT Iz
The above problem (2.28) is exactly the minimization of the Rayleigh quotient
whose global optimum solution can be obtained by solving the following eigenvalue
problem
x T wi + bi
class(x) = arg Min .
i=1,2 wi
The above results can easily be extended to the nonlinear case by considering the
eigenvalue problems
and
where
|K(x T , C T )ui + bi |
class(x) = arg Min . (2.33)
i=1,2 (ui )T K(C, C T )ui
Shao et al. [3] termed their model as Improved Generalized Eigenvalue Proximal
Support Vector Machine (IGEPSVM). It is known that the linear IGEPSVM needs
only to solve two eigenvalue problems with computational time complexity of O(n2 ),
where n is the dimensionality of data points. In contrast, the linear GEPSVM requires
two generalized eigenvalue problems whose complexity is O(n3 ). For the nonlinear
case, the computational complexity is O(m2 ) for Shao et al. model [3] and is O(m3 )
for Mangasarian and Wild model. Here, m is the number of training data points.
This probably explains that in numerical implementations, IGEPSVM takes much
less time than GEPSVM. For details of these numerical experiments we shall refer
to Shao et al. [3]. Recently, Saigal and Khemchandani [8] presented comparison of
various nonparallel algorithms including variations of GEPSVM along with TWSVM
for multicategory classification.
The aim of this section is to present the regression problem in the setting of the gener-
alized eigenvalue problem. Here, we present two models for the regression problem.
The first is termed as the GEPSVR which is in the spirit of GEPSVM and requires
the solution of two generalized eigenvalue problems. The second formulation is in
the spirit of ReGEPSVM and is termed as the Regularized Generalized Eigenvalue
Support Vector Regressor (ReGEPSVR). The formulation of ReGEPSVR requires
the solution of a single regularized eigenvalue problem and it reduces the execution
time to half as compared to GEPSVR.
Earlier Bi and Bennett [9] made a very significant theoretical contribution to
the theory of support vector regression. They (Bi and Bennett [9]) showed that the
problem of support vector regression can be regarded as a classification problem in
2.4 GEPSVR: Generalized Eigenvalue Proximal Support Vector Regression 37
the dual space and maximizing the margin corresponds to shrinking of effective -
tube. From application point of view, this result is of utmost importance because this
allows to look for other classification algorithms and study their regression analogues.
We have already demonstrated this duality aspect in the context of SVR in Chap. 1,
where it was derived via SVM approach. We shall continue to take this approach
in this section as well as in some of the later chapters where regression problem is
further studied.
For GEPSVR, our objective is to find the two non-parallel -insensitive bounding
regressors. The two non-parallel regressors around the data points are derived by
solving two generalized eigenvalue problems. We first discuss the case of linear
GEPSVR, the case of nonlinear GEPSVR may be developed analogously.
Let the training set for the regression problem be given by
TC = {((x (i) , yi + ), +1), ((x (i) , yi − ), −1), i = 1, 2, ..., m}, where > 0.
Aw + eb − (Y − e)2
Min . (2.35)
(w,b) =0 Aw + eb − (Y + e)2
R = [A e (Y − e)]T [A e (Y − e)] + δI
S = [A e (Y + e)]T [A e (Y + e)].
Ru = ηSu, u = 0. (2.37)
Let u1 denote the eigenvector corresponding to the smallest eigenvalue ηmin of (2.37).
To obtain w1 and b1 from u1 , we normalize u1 by the negative of the (n + 2)th
element of u1 so as to force a (−1) at the (n + 2)th position of u1 . Let this
normalized representation of u1 be u1 with (−1) at the end, such that u1new =
[w1 b1 − 1]T . Then, w1 and b1 determine an -insensitive bounding regressor
as f1 (x) = x T w1 + b1 .
Similarly for determining the second bounding regressor f2 (x) we consider the
regularized optimization problem
Pu = νQu, u = 0. (2.39)
2.4 GEPSVR: Generalized Eigenvalue Proximal Support Vector Regression 39
Now as before, finding minimum eigenvalue νmin of (2.37) and having determined
the corresponding eigenvector u2 , we obtain w2 and b2 by the normalizing procedure
explained earlier. This gives the other -insensitive regressor f2 (x) = x T w2 + b2 .
Having determined f1 (x) and f2 (x) from u1 and u2 , the final regressor f (x) is
constructed by taking the average, i.e.,
1 1 1
f (x) = (f1 (x) + f2 (x)) = x T (w1 + w2 ) + (b1 + b2 ).
2 2 2
Now using the earlier discussed properties of Rayleigh quotient, the optimization
problem (2.40) is equivalent to the following generalized eigenvalue problem
Ut = νV t, t = 0. (2.41)
This yields the eigenvector t1 corresponding to largest eigenvalue νmax of (2.41), and
eigenvector t2 corresponding to the smallest eigenvalue νmin of (2.41). To obtain w1
and b1 from t1 and w2 and b2 from t2 , we follow the usual normalization procedure of
Sect. 2.4.1 and get t1new = (w1 b1 − 1)T and t2new = (w2 b2 − 1)T . This yield
the -insensitive bounding regressors f1 (x) = x T w1 + b1 and f2 (x) = x T w2 + b2
from t1new and t2new . For a new point x ∈ Rn , the regressed value f (x) is given by
1 1 1
f (x) = (f1 (x) + f2 (x)) = x T (w1 + w2 ) + (b1 + b2 ).
2 2 2
40 2 Generalized Eigenvalue Proximal Support …
For extending our results to the nonlinear case, we consider the following kernel
generated functions instead of linear functions
φ φ φ φ
F1 (x) = K(x T , AT )w1 + b1 and F2 (x) = K(x T , AT )w2 + b2 , (2.42)
φ φ φ φ
where K is the chosen Kernel function and w1 , w2 , b1 and b2 are defined in the
kernel spaces.
Let t φ = [w φ bφ − 1]T and
Et φ = βFt φ , t φ = 0. (2.43)
φ
This yields the eigenvector t1 corresponding to the largest eigenvalue βmax of (2.43),
φ
and t2 corresponding to the smallest eigenvalue βmin of (2.43). We do the usual
φ φ φ φ φ φ φ φ
normalization of t1 and t2 to get tnew = [w1 b1 − 1]T , and tnew = [w2 b2 −
φ φ
1]T . This gives the -insensitive bounding regressor F1 (x) = K(x T , AT )w1 + b1 and
φ φ
F2 (x) = K(x T , AT )w2 + b2 . For a new input pattern x ∈ Rn , the regressed value is
given by
1 φ φ 1 φ φ
F(x) = (w + w2 )K(x T , AT ) + (b1 + b2 ).
2 1 2
To test the performance of ReGEPSVR, Khemchandani et al. [4] implemented
their model on several data sets including UCI, financial time series data and four
two dimensional functions considered by Lázaro et al. [10]. Here, we summarize
some of the conclusions reported in Khemchandani et al. [4].
To test the performance of the proposed ReGEPSVR, Khemchandani et al. [4] com-
pared it with SVR on several datasets. The performance of these regression algorithms
on the above-mentioned datasets largely depends on the choice of initial parameters.
Hence, optimal parameters for these algorithms for UCI datasets [11] are selected
by using a cross-validation set [12] comprising of 10 percent of the dataset, picked
up randomly. Further, RBF kernel is taken as the choice for kernel function in all
2.4 GEPSVR: Generalized Eigenvalue Proximal Support Vector Regression 41
A small NMSE value means good agreement between estimations and real-values.
We consider four benchmark datasets, including the Boston Housing, Machine
CPU, Servo, and Auto-price datasets, obtained from the UCI repository.
The Boston Housing dataset consists of 506 samples. Each sample has thirteen
features which designate the quantities that influence the price of a house in Boston
suburb and an output feature which is the house-price in thousands of dollars. The
Machine CPU dataset concerns Relative CPU performance data. It consists of 209
cases, with seven continuous features, which are MYCT, MMIN, MMAX, CACH,
CHMIN, CHMAX, PRP (output). The Servo dataset consists of 167 samples and
covers an extremely nonlinear phenomenon-predicting the rise time of a servo mech-
anism in terms of two continuous gain settings and two discrete choices of mechanical
linkages. The Auto price dataset consists of 159 samples with fifteen features.
Results for comparison between ReGEPSVR and SVR for the four UCI datasets
are given in Table 2.1.
2.5 Conclusions
This chapter presents the GEPSVM formulation of Mangasarian and Wild [1] for
binary data classification and discusses its advantages over the traditional SVM
formulations. We also discuss two variants of the basic GEPSVM formulation
which seem to reduce overall computational effort over that of GEPSVM. These are
ReGEPSVM formulation of Guarracino et al. [2], and Improved GEPSVM formu-
lation of Shao et al. [3]. In ReGEPSVM we get only a single generalized eigenvalue
problem, while Improved GEPSVM deals with two simple eigenvalue problems.
Taking motivation from Bi and Bennett [9], a regression analogue of GEPSVM is
also discussed. This regression formulation is due to Khemchandani et al. [4] and is
termed as Generalized Eigenvalue Proximal Support Vector Regression (GEPSVR).
A natural variant of GEPSVR, namely ReGEPSVR is also presented here.
References
1. Mangasarian, O. L., & Wild, E. W. (2006). Multisurface proximal support vector machine
classification via generalized eigenvalues. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 28(1), 69–74.
2. Guarracino, M. R., Cifarelli, C., Seref, O., & Pardalos, P. M. (2007). A classification method
based on generalized eigenvalue problems. Optimization Methods and Software, 22(1), 73–81.
3. Shao, Y.-H., Deng, N.-Y., Chen, W.-J., & Wang, Z. (2013). Improved generalized eigenvalue
proximal support vector machine. IEEE Signal Processing Letters, 20(3), 213–216.
4. Khemchandani, R., Karpatne, A., & Chandra, S. (2011). Generalized eigenvalue proximal
support vector regressor. Expert Systems with Applications, 38, 13136–13142.
5. Parlett, B. N. (1998). The symmetric eigenvalue problem: Classics in applied mathematics
(Vol. 20). Philadelphia: SIAM.
6. Tikhonov, A. N., & Arsenin, V. Y. (1977). Solutions of Ill-posed problems. New York: Wiley.
7. Saad, Y. (1992). Numerical methods for large eigenvalue problems. New York: Halsted Press.
8. Saigal, P., & Khemchandani, R. (2015) Nonparallel hyperplane classifiers for multi-category
classification. In IEEE Workshop on Computational Intelligence: Theories, Applications and
Future Directions (WCI). Indian Institude of Technology, Kanpur.
9. Bi, J., & Bennett, K. P. (2003). A geometric approach to support vector regression. Neurocom-
puting, 55, 79–108.
10. Lázarao, M., Santamaŕia, I., Péreze-Cruz, F., & Artés-Rodŕiguez, A. (2005). Support vector
regression for the simultaneous learning of a multivariate function and its derivative. Neuro-
computing, 69, 42–61.
11. Alpaydin, E., & Kaynak, C. (1998). UCI Machine Learning Repository, Irvine, CA: University
of California, Department of Information and Computer Sciences. http://archive.ics.uci.edu/
ml.
12. Duda, R., Hart, P., & Stork, D. (2001). Pattern classification. New York: Wiley.
Chapter 3
Twin Support Vector Machines (TWSVM)
for Classification
3.1 Introduction
[4]. Some other variants of TWSVM are later studied in Chap. 5. Section 3.7 contains
certain concluding remarks on TWSVM.
Let the training set TC for the given binary data classification problem be
1
(T W SVM1) Min (Aw1 + e1 b1 )2 + C1 eT2 q1
(w1 , b1 , q1 ) 2
subject to
− (Bw1 + e2 b1 ) + q1 ≥ e2 ,
q1 ≥ 0, (3.2)
and
1
(T W SVM2) Min (Bw2 + e2 b2 )2 + C2 eT1 q2
(w2 , b2 , q2 ) 2
subject to
(Aw2 + e1 b2 ) + q2 ≥ e1 ,
q2 ≥ 0. (3.3)
The parameters C1 > 0 does the trade-off between the minimization of the two terms
in the objective function of (TWSVM1). A similar interpretation may also be given
to the formulation (TWSVM2) given at (3.3).
Thus, TWSVM formulation consists of a pair of quadratic programming prob-
lems (3.2) and (3.3) such that, in each QPP, the objective function corresponds to a
particular class and the constraints are determined by patterns of the other class. As
a consequence of this strategy, TWSVM formulation gives rise to two smaller sized
QPPs, unlike the standard SVM formulation where a single large QPP is obtained. In
(TWSVM1), patterns of class +1 are clustered around the plane x T w1 + b1 = 0. Sim-
ilarly, in (TWSVM2), patterns of class −1 cluster around the plane x T w2 + b2 = 0.
We observe that TWSVM is approximately four times faster then the usual SVM.
This is because the complexity of the usual SVM is no more than m3 , and TWSVM
solves two problems, namely (3.2) and (3.3), each of size roughly (m/2). Thus the
3 m 3
ratio of run times is approximately m / 2 × 2 = 4.
At this stage, we give two simple examples to visually illustrate TWSVM and
GEPSVM. Figures 3.1 and 3.2 illustrates the classifier obtained for the two examples
by using GEPSVM and TWSVM, respectively. The data consists of points in R2 .
Points of class 1 and −1 are denoted by two different shapes. The training set accuracy
for TWSVM is 100 percent in both the examples, whereas, for GEPSVM, it is 70
percent and 61.53 percent, respectively.
Fig. 3.1 a GEPSVM Classifier. b TWSVM Classifier. Points of class 1 and −1 are denoted by
different shapes
46 3 Twin Support Vector Machines …
Fig. 3.2 a GEPSVM Classifier. b TWSVM Classifier. Points of class 1 and −1 are denoted by
different shapes
Taking motivation from the standard SVM methodology, we derive the dual formula-
tion of TWSVM. Obviously this requires the duals of (TWSVM1) and (TWSVM2).
We first consider (TWSVM1) and write its Wolfe dual. For this, we use the Lagrangian
corresponding to the problem (TWSVM1) which is given by
1
L(w1 , b1 , q1 , α, β) = (Aw1 + e1 b1 )T (Aw1 + e1 b1 ) + C1 eT2 q1
2
− α T (−(Bw1 + e2 b1 ) + q1 − e2 ) − β T q1 , (3.4)
where α = (α1 , α2 . . . αm2 )T , and β = (β1 , β2 . . . βm2 )T are the vectors of Lagrange
multipliers. As (TWSVM1) is a convex optimization problem, the Karush–
Kuhn–Tucker (K. K. T) optimality conditions are both necessary and sufficient
(Mangasarian [5], Chandra et al. [6]). Therefore, we write K.K.T conditions for
(TWSVM1) and get the following
AT (Aw1 + e1 b1 ) + BT α = 0, (3.5)
eT1 (Aw1 + e1 b1 ) + eT2 α = 0, (3.6)
C1 e2 − α − β = 0, (3.7)
−(Bw1 + e2 b1 ) + q1 ≥ e2 , (3.8)
α T (−(Bw1 + e2 b1 ) + q1 − e2 ) = 0, (3.9)
3.2 Linear TWSVM for Binary Data Classification 47
β T q1 = 0, (3.10)
α ≥ 0, β ≥ 0, q1 ≥ 0. (3.11)
We now define
(H T H)u + GT α = 0. (3.14)
Here, we note that H T H is always positive semidefinite but it does not guarantee that
H T H is invertible. Therefore, on the lines of the regularization term introduced in
such as Saunders et al. [7], we introduce a regularization term I, > 0, I being the
identity matrix of appropriate dimension, to take care of problems due to possible ill
conditioning of H T H. This makes the matrix (H T H + I) invertible and therefore
(3.15) gets modified to
In the following we shall continue to use (3.15) instead of (3.16) with the under-
standing that, if the need be, (3.16) is to be used for the determination of u.
The Wolfe dual (Mangasarian [5], Chandra et al. [6]) of (TWSVM1) is given by
Max L(w1 , b1 , q1 , α, β)
subject to
∇w1 L(w1 , b1 , q1 , α, β) = 0,
∂L
= 0,
∂b1
∂L
= 0,
∂q1
α ≥ 0, β ≥ 0.
Now using the K.K.T conditions (3.5)–(3.11), and making use of (3.15), we obtain
the Wolfe dual of (TWSVM1) as follows
48 3 Twin Support Vector Machines …
1
(DT W SVM1) Max eT2 α − α T G(H T H)−1 GT α
α 2
subject to
0 ≤ α ≤ C1 .
1
(DT W SVM2) Max eT1 ν − ν T J(QT Q)−1 J T ν
ν 2
subject to
0 ≤ ν ≤ C2 . (3.17)
x T w1 + b1 = 0 and x T w2 + b2 = 0. (3.19)
where
|x T wr + br |
dr (x) = . (3.20)
w(r)
In this section, we extend the results of linear TWSVM to kernel TWSVM. For this,
we consider the following kernel generated surfaces instead of planes.
1
(KT W SVM1) Min (K(A, C T )u1 + e1 b1 )2 + C1 eT2 q1
(u1 , b1 , q1 ) 2
subject to
− (K(B, C T )u1 + e2 b1 ) + q1 ≥ e2 ,
q1 ≥ 0, (3.22)
and
1
(KT W SVM2) Min (K(B, C T )u2 + e2 b2 )2 + C2 eT1 q2
(u2 , b2 , q2 ) 2
subject to
(K(A, C T )u2 + e1 b2 ) + q2 ≥ e1 ,
q2 ≥ 0. (3.23)
1
L(u1 , b1 , q1 , α, β) = (K(A, C T )u1 + e1 b1 )2 − α T (−(K(B, C T )u1 + e2 b1
2
+ q1 − e2 ) + C1 eT2 q1 − β T q1 , (3.24)
where α = (α1 , α2 . . . αm2 )T , and β = (β1 , β2 . . . βm2 )T are the vectors of Lagrange
multipliers.
We now write K.K.T optimality conditions for (KTWSVM1) and obtain
C1 e2 − α − β = 0, (3.27)
−(K(B, C )u1 + e2 b1 ) + q1 ≥ e2 ,
T
(3.28)
α (−(K(B, C T )u1 + e2 b1 ) + q1 − e2 ) = 0,
T
(3.29)
β T q1 = 0, (3.30)
α ≥ 0, β ≥ 0, q1 ≥ 0. (3.31)
Let
Now using the K.K.T conditions (3.25)–(3.31), and (3.34), we obtain the Wolfe dual
(KDTWSVM1) of (KTWSVM1) as follows
1
(KDT W SVM1) Max eT2 α − α T R(S T S)−1 RT α
α 2
subject to
0 ≤ α ≤ C1 . (3.35)
1
(KDT W SVM2) Max eT1 ν − ν T L(J T J)−1 L T ν
ν 2
subject to
0 ≤ ν ≤ C2 . (3.36)
The Twin Support Vector Machine (TWSVM), GEPSVM and SVM data classifica-
tion methods were implemented by using MATLAB 7 running on a PC with an Intel
P4 processor (3 GHz) with 1 GB RAM. The methods were evaluated on datasets
from the UCI Machine Learning Repository. Generalization error was determined
by following the standard ten fold cross-validation methodology.
Table 3.1 summarizes TWSVM performance on some benchmark datasets avail-
able at the UCI machine learning repository. The table compares the performance
of the TWSVM classifier with that of SVM and GEPSVM [1]. Optimal values of
c1 and c2 were obtained by using a tuning set comprising of 10% of the dataset.
Table 3.2 compares the performance of the TWSVM classifier with that of SVM and
GEPSVM [1] using a RBF kernel. In case of the RBF kernel, we have employed a
rectangular kernel using 80% of the data. Table 3.3 compares the training time for
ten folds, of SVM with that of TWSVM. The TWSVM training time has been deter-
mined for two cases: the first when an executable file is used, and secondly, when a
dynamic linked library (DLL) file is used. The table indicates that TWSVM is not
just effective, but also almost four times faster than a conventional SVM, because
it solves two quadratic programming problems of a smaller size instead of a single
QPP of a very large size.
Table 3.1 Test set accuracy (as percentages) with a linear kernel
Dataset TWSVM GEPSVM SVM
Heart-statlog (270×14) 84.44±4.32 84.81±3.87 84.07±4.40
Heart-c (303×14) 83.80±5.53 84.44±5.27 82.82±5.15
Hepatitis (155×19) 80.79±12.24 58.29±19.07 80.00±8.30
Ionosphere (351×34) 88.03±2.81 75.19±5.50 86.04 ±2.37
Sonar (208×60) 77.26±10.10 66.76±10.75 79.79±5.31
Votes (435×16) 96.08±3.29 91.93±3.18 94.50±2.71
Pima-Indian (768×8) 73.70±3.97 74.60±5.07 76.68±2.90
Australian (690×14) 85.80±5.05 85.65±4.60 85.51±4.58
CMC (1473×9) 67.28±2.21 65.99±2.30 67.82±2.63
Table 3.2 Percentage test set accuracy with a RBF kernel. (∗ marked) Testing accuracy figures
have been obtained from [1]
Dataset TWSVM SVM GEPSVM
Hepatitis 82.67±10.04 83.13±11.25 78.25±11.79
WPBC 81.92±8.98 79.92±9.18 62.7∗
BUPA liver 67.83±6.49 58.32±8.20 63.8∗
Votes 94.72±4.72 94.94±4.33 94.2∗
52 3 Twin Support Vector Machines …
The TWSVM formulation has several advantages over the standard SVM type for-
mulation. Some of these are
(i) The dual problems (DTWSVM1) and (DTWSVM2) have m1 and m2 variables
respectively, as opposed to m = m1 + m2 variables in the standard SVM. This
strategy of solving a pair of smaller sized QPP’s instead of a large one, makes
the learning speed of TWSVM approximately four times faster than that of
SVM.
(ii) TWSVM uses the quadratic loss function and therefore fully considers the prior
information within classes in the data. This makes TWSVM less sensitive to
the noise.
(iii) TWSVM is useful for automatically discovering two dimensional projection of
the data.
But the TWSVM formulation still has some drawbacks which we list below
(i) In the primal problems (TWSVM1) and (TWSVM2), only the empirical risk is
minimized, whereas the structural SVM formulation is based on the structural
risk minimization principle.
(ii) Though TWSVM solves two smaller sized QPP’s, it needs the explicit knowl-
edge of the inverse of matrices H T H and QT Q. The requirement of evaluating
(H T H)−1 and (QT Q)−1 explicitly puts severe limitation on the application of
TWSVM for very large data sets. Further while evaluating the computational
complexity of TWSVM, the cost of computing (H T H)−1 and (QT Q)−1 should
also be included.
(iii) As quadratic loss function is involved in the TWSVM formulation, almost all
datapoints are involved in the determination of the final decision function. As a
3.5 Certain Advantages and Possible Drawbacks of TWSVM Formulation 53
It is well known that one significant advantage of SVM is the fact that it is based on
the structural risk minimization principle, but only the empirical risk is considered
in the primal problems (TWSVM1) and (TWSVM2) of TWSVM. Also in the dual
formulation of TWSVM, the inverse matrices (H T H)−1 and (QT Q)−1 appear explic-
itly. This implies that the TWSVM formulation implicitly assumes that the matrices
(H T H) and (QT Q) are non singular. However, this additional requirement can not
always be satisfied. This point was certainly noted in Jayadeva et al. [2] where an
appropriate regularization technique was suggested to handle this scenario. But then
this amounts to solving an approximate problem and not the real TWSVM problem.
Shao et al. [3] in (2011) suggested a variant of TWSVM which improves the
original TWSVM formulation. The stated improvements include adherence to the
structural risk minimization principle, automatically getting matrices for the dual
formulation whose inverses are guaranteed and can be used for successive overre-
laxation methodology (SOR) methodology. Shao et al. [3] termed their model as
Twin Bounded Support Vector Machine (TBSVM) which we discuss in the next
sub-section.
1
(TBSVM1) Min ((Aw1 + e1 b1 ))T ((Aw1 + e1 b1 ))
(w1 , b1 , q1 ) 2
1
+ C3 (w1 2 + (b1 )2 ) + C1 eT2 q1
2
subject to
− (Bw1 + e2 b1 ) + q1 ≥ e2 ,
q1 ≥ 0, (3.37)
54 3 Twin Support Vector Machines …
and
1
(TBSVM2) Min ((Bw2 + e2 b2 ))T ((Bw2 + e2 b2 ))
(w2 , b2 , q2 ) 2
1
+ C4 (w2 2 + (b2 )2 ) + C2 eT1 q2
2
subject to
(Aw2 + e1 b2 ) + q2 ≥ e1 ,
q2 ≥ 0. (3.38)
ments give justification for the inclusion of the term w2 2 + (b2 )2 in the objective
function of problem (3.38).
Though Shao et al. [3] claimed that the inclusion of additional terms C3 /(w1 2 +
b1 ) in (TBSVM1), and C4 /(w2 2 + b22 ) in (TBSVM2) does the structural risk mini-
2
mization, their arguments are only motivational which attempt to justify their claims.
Unlike SVM, a proper mathematical theory of TBSVM based on statistical learning
theory (SLT) is still not available and this aspect needs to be further explored.
In order to get the solutions of problems (TBSVM1) and (TBSVM2) we need
to derive their dual problems. To get the dual (DTBSVM1) of (TBSVM1), we con-
struct the Lagrangian and then follow the standard methodology. For (TBSVM1),
the Lagrangian is
1 1
L(w1 , b1 , q1 , α, β) = C3 (w1 2 + (b1 )2 ) + Aw1 + e1 b1 2 + C1 eT2 q1
2 2
− α T (−(Bw1 + e2 b1 ) + q1 − e2 ) − β T q1 , (3.39)
3.6 Twin Bounded Support Vector Machine 55
where α = (α1 , α2 . . . αm2 )T , and β = (β1 , β2 . . . βm2 )T are the vectors of Lagrange
multipliers.
Now writing the Wolfe dual of (TBSVM1), we get (DTBSVM1) as follows
1
(DTBSVM1) Max eT2 α − α T G(H T H + C3 I)−1 GT α
α 2
subject to
0 ≤ α ≤ C1 . (3.40)
1
(DTBSVM2) Max eT1 ν − ν T J(QT Q + C4 I)−1 J T ν
ν 2
subject to
0 ≤ ν ≤ C2 . (3.41)
and (QT Q + C4 I)−1 are guaranteed to exist without any extra assumption or modi-
fication. The rest of details are similar to that of TWSVM.
1
(KTBSVM1) Min (K(A, C T )u1 + e1 b1 )2
(u1 , b1 , q1 ) 2
1
+ C3 (u1 2 + b12 ) + C1 eT2 q1
2
subject to
− (K(B, C T )u1 + e2 b1 ) + q1 ≥ e2 ,
q1 ≥ 0, (3.43)
56 3 Twin Support Vector Machines …
1
(DKTBSVM1) Max eT2 α − α T R(S T S + C3 I)−1 RT α
α 2
subject to
0 ≤ α ≤ C1 . (3.44)
In a similar manner, the problem (KTBSVM2) is constructed and its dual is obtained
as
1
(DKTBSVM2) Max eT1 ν − ν T L(J T J + C4 I)−1 L T ν
ν 2
subject to
0 ≤ ν ≤ C2 . (3.45)
Here the matrices R, S, L and J are same as in problems (3.35) and (3.36). The rest
of details are analogous to that of nonlinear TWSVM.
Remark 3.6.1 Shao et al. [3] implemented their linear TBSVM and nonlinear
TBSVM models on various artificial and real life datasets. The experimental results
show the effectiveness of these models in both computation time and classification
accuracy. In fact Shao et al. [3] used successive overrelaxation technique (SOR) to
speed up the training procedure, but explicit inverse of relevant matrices appearing
in the duals is still required.
In a recent work, Tian et al. [4] presented an improved model of twin methodol-
ogy, termed as Improved Twin Support Vector Machine (ITSVM), for binary data
classification. Surprisingly ITSVM is exactly same as TBSVM (Shao et al. [3]) but
represented differently. This leads to different Lagrangian function for primal prob-
lems in TWSVM and TBSVM, and therefore different dual formulations. It has
been shown in Tian et al. [4] that ITSVM does not need to compute the large inverse
matrices before training and the kernel trick can be applied directly to ITSVM for the
nonlinear case. Further, ITSVM can be solved efficiently by the successive overre-
laxation (SOR) and sequential minimization optimization (SMO) techniques, which
makes it more suitable for large datasets.
3.7 Improved Twin Support Vector Machine 57
Aw1 + e1 b1 = p,
−(Bw1 + e2 b1 ) + q1 ≥ e2 ,
q1 ≥ 0, (3.46)
and
1 T 1
(ITSVM2) Min q q + C4 (w2 2 + (b2 )2 ) + C2 eT1 q2
(w2 , b2 , q, q2 ) 2 2
subject to
Bw2 + e2 b2 = q,
(Aw2 + e1 b2 ) + q2 ≥ e1 ,
q2 ≥ 0. (3.47)
1 1
L(w1 , b1 , p, q1 , α, β, λ) =
C3 (w1 2 + (b1 )2 ) + pT p + C1 eT2 q1
2 2
+ λT (Aw1 + e1 b1 − p) − α T (−(Bw1 + e2 b1 ) + q1 − e2 ) − β T q1 , (3.48)
where α = (α1 , α2 . . . αm2 )T , β = (β1 , β2 . . . βm2 )T and λ = (λ1 , λ2 . . . λm1 )T are the
vectors of Lagrange multipliers. The K.K.T necessary and sufficient optimality con-
ditions for (ITSVM1) are given by
C3 w1 + AT λ + BT α = 0, (3.49)
C3 b1 + eT1 λ + eT2 α = 0, (3.50)
λ − p = 0, (3.51)
C1 e2 − α − β = 0, (3.52)
Aw1 + e1 b1 = p, (3.53)
−(Bw1 + e2 b1 ) + q1 ≥ e2 , (3.54)
α (Bw1 + e2 b1 − q1 + e2 ) = 0,
T
(3.55)
β T q1 = 0, (3.56)
58 3 Twin Support Vector Machines …
α ≥ 0, β ≥ 0, λ ≥ 0, q1 ≥ 0. (3.57)
1 T
w1 = − (A λ + BT α), (3.58)
C3
1
b1 = − (eT1 λ + eT2 α). (3.59)
C3
Now using (3.58), (3.59) and (3.51), we obtain the Wolfe dual of (ITSVM1) as
1
(DITSVM1) Max − (λT , α T )Q1 (λT , α T )T + C3 eT2 α
λ,α 2
subject to
0 ≤ α ≤ C1 e2 . (3.60)
Here
AAT + C3 I ABT
Q1 = + E, (3.61)
BAT BBT
and I is the (m1 × m1 ) identity matrix. Also E is the (m × m) matrix having all entries
as ‘one’.
Similarly the dual of (ITSVM2) is obtained as
1
(DITSVM2) Max − (θ T , ν T )Q2 (θ T , ν T )T + C4 eT1 ν
θ,ν 2
subject to
0 ≤ ν ≤ C2 e1 , (3.62)
where
BBT + C4 I BAT
Q2 = + E. (3.63)
ABT AAT
1 T ∗
b1 = − (e λ + eT2 α ∗ ),
C3 1
and
1 T ∗
b2 = − (e θ + eT1 ν ∗ ).
C4 2
The linear ITSVM is equivalent to linear TBSVM because the primal problems
are same; only they have different representation.
The important point to note here is that problems (DITSVM1) and (DITSVM2)
are quadratic programming problems which do not require computation of matrices
inverse. In this respect, the ITSVM formulation is certainly attractive in comparison
to TWSVM and TBSVM formulations. However major disadvantage of ITSVM is
that the matrices Q1 and Q2 involve all the patterns of class A and B. In the recent
work, Peng et al. [9] presented a L1 norm version of ITSVM which again needs to
optimize a pair of larger sized dual QPPs than TWSVM.
We now present the nonlinear ITSVM formulation. Here unlike the nonlinear
TBSVM or TWSVM formulations, we do not need to consider the kernel generated
surfaces and construct two new primal problems corresponding to these surfaces.
But rather we can introduce the kernel function directly into the problems (3.60)
and (3.62) in the same manner as in the case of standard SVM formulation. As a
consequence of this construction, it follows that similar to SVM, linear ITSVM is a
special case of nonlinear ITSVM with its specific choice of the kernel as the linear
kernel. Unfortunately this natural property is not shared by TWSVM or TBSVM
formulations.
Let us now introduce the kernel function K(x, x ) =< φ(x), φ(x ) > and the cor-
responding transformation z = φ(x) where z ∈ H, H being an appropriate Hilbert
space, termed as feature space. Therefore, the corresponding primal ITSVM in the
feature space are
1 T 1
Min p p + C3 (w1 2 + (b1 )2 ) + C1 eT2 q1
(w1 , b1 , p, q1 ) 2 2
subject to
φ(A)w1 + e1 b1 = p,
−(φ(B)w1 + e2 b1 ) + q1 ≥ e2 ,
q1 ≥ 0, (3.64)
60 3 Twin Support Vector Machines …
and
1 T 1
Min q q + C4 (w2 2 + (b2 )2 ) + C2 eT1 q2
(w2 , b2 , q, q2 ) 2 2
subject to
φ(B)w2 + e2 b2 = q,
(φ(A)w2 + e1 b2 ) + q2 ≥ e1 ,
q2 ≥ 0. (3.65)
We next consider problem (3.64) and write its Wolfe dual to get
0 ≤ α ≤ C1 e2 . (3.66)
Here
K(AT , AT ) + C3 I K(AT , BT )
Q3 = + E. (3.67)
K(AT , BT ) K(BT , BT )
1
(DKITSVM2) Max − (θ T , ν T )Q4 (θ T , ν T )T + C4 eT1 ν
θ,ν 2
subject to
≤ ν ≤ C2 e1 , (3.68)
where
K(BT , BT ) + C3 I K(BT , BT )
Q4 = + E, (3.69)
K(BT , BT ) K(AT , AT )
where
Here again we note that in problems (3.66) and (3.68) we do not require the com-
putation of inverse matrices any more. Also these problems degenerate to problems
(3.60) and (3.62) corresponding to linear ITSVM when the chosen kernel K is the
linear kernel.
Remark 3.7.1 Tian et al. [4] presented two fast solvers to solve various optimiza-
tion problems involved in ITSVM, namely SOR and SMO type algorithms. This
makes ITSVM more suitable to large scale problems. Tian et al. [4] also performed
a very detailed numerical experimentation with ITSVM, both on small datasets and
very large datasets. The small datasets were the usual publicly available bench-
mark datasets, while the large datasets were NDC-10k, NDC-50k and NDC-1m
using Musicant’s NDC data generator [10]. ITSVM performed universally better
than TWSVM and TBSVM on most of datasets. On large datasets NDC-10k, NDC-
50k and NDC-1m, TWSVM and TBSVM failed because experiments ran out of
memory whereas ITSVM produced the classifier.
3.8 Conclusions
This chapter presents TWSVM formulation for binary data classification and dis-
cusses its advantages and possible drawbacks over the standard SVM formulation.
Though there are several variants of the basic TWSVM formulation available in the
literature, we discuss two of these in this chapter. These are TBSVM formulation
due to Shao et al. [3] and ITSVM formulation due to Tian et al. [4]. The construc-
tion of TBSVM attempts to include structural risk minimization in its formulation
and therefore claims its theoretical superiority over TWSVM. The construction of
ITSVM is very similar to TBSVM but has different mathematical representation.
One added advantage of ITSVM is that it does not require computation of inverses
and therefore is more suitable to large datasets. We shall discuss some other relevant
variants of TWSVM later in Chap. 5.
References
1. Mangasarian, O. L., & Wild, E. W. (2006). Multisurface proximal support vector machine
classification via generalized eigenvalues. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 28(1), 69–74.
2. Jayadeva, Khemchandani, R., & Chandra, S. (2007). Twin support vector machines for pattern
classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(5), 905–
910.
3. Shao, Y.-H., Zhang, C.-H., Wang, X.-B., & Deng, N.-Y. (2011). Improvements on twin support
vector machines. IEEE Transactions on Neural Networks, 22(6), 962–968.
4. Tian, Y. J., Ju, X. C., Qi, Z. Q., & Shi, Y. (2013). Improved twin support vector machine.
Science China Mathematics, 57(2), 417–432.
5. Mangasarian, O. L. (1994). Nonlinear programming. Philadelphia: SIAM.
62 3 Twin Support Vector Machines …
6. Chandra, S., Jayadeva, & Mehra, A. (2009). Numerical optimization with applications. New
Delhi: Narosa Publishing House.
7. Saunders, C., Gammerman, A., & Vovk, V. (1998). Ridge regression learning algorithm in dual
variables. In Proceedings of the Fifteenth International Conference on Machine Learning (pp.
515–521).
8. Fung, G., & Mangasarian, O. L. (2001). Proximal support vector machine classifiers. In F.
Provost & R. Srikant (Eds.), Proceedings of Seventh International Conference on Knowledge
Discovery and Data Mining (pp. 77–86).
9. Peng, X. J., Xu, D., Kong, L., & Chen, D. (2016). L1-norm loss based twin support vector
machine for data recognition. Information Sciences, 340–341, 86–103.
10. Musicant, D. R. (1998). NDC: Normally Distributed Clustered Datasets, Computer Sciences
Department, University of Wisconsin, Madison.
Chapter 4
TWSVR: Twin Support Vector Machine
Based Regression
4.1 Introduction
SVR (Support Vector Regression) is a SVM based approach to study the regression
problem. The standard SVR model sets an epsilon tube around data points within
which errors are discarded using an epsilon insensitive loss function. We have already
presented the standard SVR formulation in Chap. 1.
One of the major theoretical developments in the context of SVR is due to Bi and
Bennett [1]. They (Bi and Bennett [1]) presented an intuitive geometric framework
for SVR which shows that SVR can be related to SVM for an appropriately con-
structed classification problem. This result of Bi and Bennett [1] is conceptually very
significant, because it suggests that any variant of SVM has possibility of having an
analogous SVR formulation. This has been the motivation of introducing GEPSVR
in Chap. 2.
In Chap. 3, we have presented the development of TWSVM (Jayadeva et al. [2])
for the binary data classification problem. Since TWSVM has proven its advantage
over the standard SVM, it makes sense to look into the possibility of obtaining a
regression analogue of TWSVM. In the literature, Peng [3] is credited to initiate the
study of regression problem in the twin framework. The work of Peng [3] motivated
many researchers to further study regression problem in twin setting, e.g. Xu and
Wang [4], Shao et al. [5], Chen et al. [6, 7], Zhao et al. [8], Zhang et al. [9], Peng
[10, 11], Balasundaram and Tanveer [12] and Singh et al. [13].
Recently, Khemchandani et al. [14, 15] presented a new framework of Twin
Support Vector model to regression problem, termed as TWSVR. Unlike Peng [3],
TWSVR is truely inspired by TWSVM, where the upper bound regressor (respec-
tively lower bound regressor) problem deals with the proximity of points in upper
tube (respectively lower tube) and at the same time, at least distance from the points
of lower tube (respectively upper tube).
In our presentation here, we shall differentiate between two notations TSVR
and TWSVR. TSVR refers to Peng’s formulation [3] where as TWSVR refers to
TC = {((x (i) , yi + ), +1), ((x (i) , yi − ), −1), (i = 1, 2 . . . , l)}. (4.2)
Let A be an (l × n) matrix whose ith row is the vector (x (i) )T . Let Y = (y1 , y2 . . . , yl ),
(Y + e) = (y1 + , y2 + . . . , yl + ) and (Y − e) = (y1 − , y2 − . . . ,
yl − ). Then f (x) = x T w + b may be identified as a hyperplane in Rn+1 .
The TSVR formulation as given by Peng [3] consists of following two QPP’s
1
(TSVR1) Min (Y − e1 − (Aw1 + eb1 ))2 + C1 eT ξ1
(w1 , b1 , ξ1 ) 2
subject to
and
1
(TSVR2) Min (Y + e2 − (Aw2 + eb2 ))2 + C2 eT ξ2
(w2 , b2 , ξ2 ) 2
subject to
where C1 , C2 > 0, 1 , 2 > 0 are parameters, ξ1 , ξ2 are slack vectors, e denotes vector
of ones of appropriate dimension and .2 denotes the L2 norm.
Each of the above two QPP is smaller than the one obtained in the classical SVR
formulation. Also (TSVR1) finds f1 (x) = x T w1 + b1 the down bound regressor and
(TSVR2) finds the up bound regressor f2 (x) = x T w2 + b2 . The final regressor is
taken as the mean of up and down bound regressor.
We would now like to make certain remarks on Peng’s formulation of TSVR.
These remarks not only convince that Peng’s formulation is not in the true spirit of
TWSVM but also motivate the proposed formulation of Khemchandani et al. [14,
15]. In this context, we have the following lemma.
Lemma 4.2.1 For the given dataset, let f (x) be the final regressor obtained from
(TSVR1) and (TSVR2) when 1 = 2 = 0, and g(x) be the final regressor obtained
for any constant value of 1 and 2 . Then
Proof Let (w1 , b1 ) and (w2 , b2 ) be the solutions to (TSVR1) and (TSVR2) respec-
tively for constant 1 and 2 so that g(x) = (x T w1 + x T w2 )/2 + (b1 + b2 )/2. Now
applying the transformation (b1 )new = b1 + 1 to (TSVR1) and (b2 )new = b2 − 2
to (TSVR2), we note that the resulting formulations have no term in it. The final
regressor obtained from these transformed formulations will be f (x). It follows from
the transformation that (w1 , b1 + 1 ) and (w2 , b2 − 2 ) will be the solutions to trans-
formed QPPs. Hence f (x) = (x T w1 + x T w2 )/2 + (b1 + b2 )/2 + (1 − 2 )/2, thus
proving the result.
The understanding to Lemma 4.2.1 comes from analyzing the physical interpreta-
tion of the TSVR model. For both the formulations of TSVR, the objective function
and the constraints are shifted by the epsilon values in the same direction. This
shifting by epsilon in the same direction makes the role of epsilon limited to only
shifting the final regressor linearly and not playing any role in the orientation of the
final regressor. Lemma 4.2.1 essentially tells that the general regressor g(x) can be
obtained by first solving two QPP’s (TSVR1) and (TSVR2) by neglecting 1 and 2 ,
and then shifting the same by (1 − 2 )/2.
The following remarks are evident from Lemma 4.2.1.
(i) The values of 1 and 2 only contribute to linear shift of the final regressor. The
orientation of the regressor is independent of values of 1 and 2 . Therefore in
the final hyperplane y = w T x + b, only b is a function of 1 and 2 whereas w
is independent of 1 and 2 . This suggest that TSVR formulation is not even in
line with the classical Support Vector Regression.
(ii) The regressor is dependent only on one value 1 − 2 . This unnecessarily
increases the burden of parameter selection of two parameters (1 , 2 ) when
in reality, the final regressor depends on only one parameter (1 − 2 ).
(iii) The regressor is independent of values of 1 and 2 for 1 = 2 . We get the same
final regressor, say if 1 = 2 = 0.1 or when 1 = 2 = 100. The Experiment
66 4 TWSVR: Twin Support Vector Machine Based Regression
Section of Peng [3] only shows results for 1 = 2 case, which we have shown
is same as not considering s at all.
In order to support the above facts empirically, we would provide certain plots in
Sect. 4.5
Apart from above points, there is another problem with formulation of Peng [3].
Let us consider (TSVR1) first. For any sample point (x (i) , yi ) and the fitted function
f1 (x), if (yi − f1 (x (i) ) − ) ≥ 0, then the penalty term is 21 (yi − f1 (x (i) ) − )2 , and
if (yi − f1 (x (i) ) − ) ≤ 0, then the penalty term is 21 (yi − f1 (x (i) ) − )2 + C1 | yi −
f1 (x (i) ) − |. Thus the penalty term is asymmetric as it gives more penalty on negative
deviation of yi − f1 (x (i) ) − . This is happening because the points of the same class
(i)
(x , yi − ), i = 1, . . . , l are appearing both in the objective function and the
constraints, where (x (i) , yi − ), i = 1, . . . , l and (x (i) , yi + ), i = 1, . . . , l are
the two classes of points. Similar arguments hold for (TSVR2) as well. This logic is
not consistent with the basic principle of twin methodology.
Fig. 4.1 Band Regression; a original data (Ai ); b shifted data (A+ −
i and Ai ) and separating plane;
c regression plane
4.3 SVR via SVM 67
1 1
Min ||w||2 + η2
(w, b, η) 2 2
subject to
Aw + η(Y + e) + be ≥ 1,
Aw + η(Y − e) + be ≤ −1. (4.6)
solution of the SVR problem is related to the SVM solution of resulting classification
problem.
We now proceed to discuss the formulations of Goyal [16] and Khemchandani et al.
[14, 15] twin support vector regression. Here the intuition is to derive TWSVR
formulation by relating it to a suitable TWSVM formulation exactly in the same way
as SVR formulation is related to SVM formulation. As a first step we construct D+
and D− as in Sect. 4.3 and get two hyperplanes for the resulting TWSVM problem.
If we show that the mean of these hyperplanes linearly separates the two sets D+ and
D− , then the mean becomes -insensitive hyperplane to the regression problem. This
is true as TWSVM when applied on two sets A and B gives two hyperplanes x T w1 +
b1 = 0 and x T w2 + b2 = 0 such that the shifted hyperplane x T w1 + b1 + 1 = 0 lies
above the points of set B and similarly the shifted hyperplane x T w2 + b2 − 1 = 0
lies below the points of set A. Clearly then the mean of these shifted hyperplanes
(which is same as mean of the original hyperplanes) will separate the two sets A
and B.
Now the TWSVM methodology applied to two datasets D+ and D− produces
following two QPPs yielding hyperplanes x T w1 + η1 y + b1 = 0 and x T w2 + η2 y +
b2 = 0
1
Min (Aw1 + η1 (Y + e) + eb1 )2 + C1 eT ξ1
(w1 , b1 , η1 , ξ1 ) 2
subject to
1
Min (Aw2 + η2 (Y − e) + eb2 )2 + C2 eT ξ2
(w2 , b2 , η2 , ξ2 ) 2
subject to
Let us consider the first problem (4.8). Here we note that η1 = 0 and therefore without
any loss of generality, we can assume that η1 < 0. We next consider the constraints
of (4.8) and rewrite the same as
w1 b1
− A − (Y − e) + e (−η1 ) + ξ1 ≥ e, ξ1 ≥ 0. (4.10)
−η1 −η1
4.4 TWSVR via TWSVM 69
On replacing w1 = −w1 /η1 and b1 = −b1 /η1 and noting that −η1 ≥ 0, we get
1 ξ1
− (Aw1 + eb1 ) + Y − e − ≥− , ξ1 ≥ 0. (4.11)
η1 −η1
1
Let 1 = − . As η1 < 0 and 1 > 0, therefore from above
η1
ξ1
Again by replacing w1 = −w1 /η1 , b1 = −b1 /η1 , ξ1 := ≥ 0, and C1 =
−η1
C1
> 0 we get the above objective function as
−η1
1
η12 (Aw1 + eb1 − (Y + e))2 + C1 eT ξ1 . (4.14)
2
1
Min (Y + e) − (Aw1 + eb1 )2 + C1 eT ξ1
(w1 , b1 , ξ1 ) 2
subject to
1 1
where 2 = ( − ) > 0 and new b1 = e(b1 − ) > 0 is still denoted by b1 . This
η2 η2
change requires a similar adjustment in the constraints. Thus
is equivalent to
1 e
Aw1 + e b1 − − (γ − e1 ) ≤ ξ1 − .
η2 η2
1
but as the new b1 = e(b1 − ) is still denoted by b1 and ξ1 = ξ1 − ≥ 0, (ξ1 ≥
η2 η2
0, η2 < 0, > 0) is still denoted by ξ1 , problem (4.15) is equivalent to (TWSVR1)
stated below.
1
(TWSVR1) Min (Y + e2 − (Aw1 + eb1 ))2 + C1 eT ξ1
(w1 , b1 , ξ1 ) 2
subject to
1
(TWSVR2) Min (Y − e1 − (Aw2 + eb2 ))2 + C2 eT ξ2
(w2 , b2 , ξ2 ) 2
subject to
0.8
0.6
0.4
0.2
0
1 2 3 4 5 6 7 8 9 10
Solving two formulations instead of one as in classical SVR provides the advan-
tage that the bounding regressors which are now obtained by solving two different
formulations and not one are no longer required to be parallel as in case of SVR.
Since these bounding regressors which forms the epsilon tube are no longer required
to be parallel can become zero at some places at non zero at other. Figure 4.2 again
shows this interpretation of epsilon tube. This makes the final regressor to generalize
well as compared to SVR. Similar advantage is observed in TWSVM as compared to
SVM. Also the two formulations of TWSVR are each smaller than the formulation
of SVR because the constraints of the SVR formulation splits to two formulations
of TWSVR making the number of inequalities in the constraints of TWSVR smaller
and hence making the TWSVR model faster. Similar to TWSVM formulation, we
obtain that TWSVR is approximately four times faster than standard SVR. These
advantages of TWSVR over SVR in terms of performance as well as computational
time makes them better suited for a regression problem.
Next on the lines of Shao et al. [18], we introduce a regularization term in each
of the two TWSVR formulation to control for the Structural Risk and from now on
we call the following two formulations as TWSVR formulations
C3 1
(TWSVR1) Min C1 eT ξ1 + (w1 2 + (b1 )2 ) + (Y + e2 − (Aw1 + eb1 ))2
(w1 , b1 , ξ1 ) 2 2
subject to
C4 1
(TWSVR2) Min C2 eT ξ2 + (w2 2 + (b2 )2 ) + (Y − e1 − (Aw2 + eb2 ))2
(w2 , b2 , ξ2 ) 2 2
subject to
As with (TWSVM), we work with the dual formulation of TWSVR. For this we con-
sider the Lagrangian corresponding to the problem(TWSVR1) and write its Wolfe’s
dual. The Lagrangian function is
0 ≤ α ≤ C1 . (4.27)
We define
H = [A e], (4.29)
w1
and the augmented vector u = . With these notations, (4.28) may be rewritten
b1
as
giving
1
(DTWSVR1) Min − α T H(H T H + C3 I)−1 H T α +f T H(H T H + C3 I)−1 H T α
α 2
−f T α + (1 + 2 )eT α
subject to
0 ≤ α ≤ C1 e. (4.33)
1
(DTWSVR2) Min gT γ + (1 + 2 )eT γ − γ T H(H T H + C4 I)−1 H T γ
γ 2
−gT H(H T H + C4 I)−1 H T γ
subject to
0 ≤ γ ≤ C2 e, (4.34)
74 4 TWSVR: Twin Support Vector Machine Based Regression
4.9
TWSVR
TSVR
4.8
4.75
4.7
4.65
0.05 0.1 0.15 0.2 0.25 0.3
Value of epsilons
Fig. 4.3 TWSVR and TSVR plot of norm of w hyperplane parameter versus different values of
s for Servo dataset and 1 = 2
19.15
TWSVR
19.1 TSVR
19.05
19
Sum Squared Error
18.95
18.9
18.85
18.8
18.75
18.7
18.65
0.05 0.1 0.15 0.2 0.25 0.3
Value of epsilons
Fig. 4.5 TWSVR and TSVR plot of sum of square error versus different values of s for Sinc
function
76 4 TWSVR: Twin Support Vector Machine Based Regression
We now state the primal and dual versions of Kernel TWSVR which can be derived
on the lines of Kernel TWSVM discussed in Chap. 3. We consider the following
kernel generated functions instead of linear functions.
1
(KTWSVR1) Min C1 eT ξ1 + (Y + e2 − (K(A, AT )w1 + eb1 ))2
(w1 , b1 , ξ1 ) 2
C3
(w1 2 + (b1 )2 )
2
subject to
1
(KTWSVR2) Min C2 eT ξ2 + (Y − e1 − (K(A, AT )w2 + eb2 ))2
(w2 , b2 , ξ2 ) 2
C4
+ (w2 2 + (b2 )2 )
2
subject to
(4.40)
C1 e − α − β = 0,
(4.41)
(K(A, AT )w1 + eb1 − (Y − e1 )) ≤ ξ1 , ξ1 ≥ 0,
(4.42)
α T ((K(A, AT )w1 + eb1 ) − (Y − e1 ) − ξ1 )) = 0, β T ξ1 = 0,
(4.43)
α ≥ 0, β ≥ 0.
(4.44)
0 ≤ α ≤ C1 . (4.45)
We define
H = [K(A, AT ) e], (4.47)
w1
and the augmented vector u = . With these notations, (4.46) may be rewritten
b1
as
giving
Using (4.38) and the above K.K.T conditions, we obtain the Wolfe dual
(Mangasarian [19] and Chandra et al. [20]) (DKTWSVR1) of (DKTWSVR1) as
follows
1
(DKTWSVR1) Min − f T α + (1 + 2 )eT α − α T H(H T H + C3 I)−1 H T α
α 2
+f T H(H T H + C3 I)−1 H T α
subject to
0 ≤ α ≤ C1 e. (4.51)
1
(DKTWSVR2) Min gT γ + (1 + 2 )eT γ − γ T H(H T H + C4 I)−1 H T γ
γ 2
−gT H(H T H + C4 I)−1 H T γ
subject to
0 ≤ γ ≤ C2 e, (4.52)
4.4.3 Experiments
The SVR, TWSVR and TSVR models were implemented by using MATLAB 7.8
running on a PC with Intel Core 2 Duo processor (2.00 GHz) with 3 GB of RAM. The
methods were evaluated on two artificial dataset and on some standard regression
datasets from UCI Machine Learning Laboratory (Blake and Merz [22]).
For our simulations, we have considered RBF Kernel and the values of parameters
like
i C(regularization term), epsilon and sigma are selected from the set of values
2 |i = −9, −8, . . . , 10 by tuning a set comprising of random 10 percent of sample
set. Similar to Peng [3], we have taken C1 = C2 and = 1 = 2 in our experiments
to degrade the computational complexity of parameter selection.
The datasets used for comparison and the evaluation criteria is similar to Peng [3],
further for the sake of completeness we would specify the evaluation criteria before
presenting the experimental results. The total number of testing samples is denoted
by l, yi denotes the real value of a sample xi , yi denotes the predicted value of sample
4.4 TWSVR via TWSVM 79
xi and y = i=1
l
yi is the mean of y1 , y2 , . . . , yl . We use the following criterion for
algorithm evaluation.
l
SSE: Sum squared error of testing, defines as SSE = (yi − yi )2 .
i=1
l
SST: Sum squared deviation of testing, which is defined as SST = (yi − y)2
i=1
SSR: Sum squared deviation that can be explained by the estimator, which is
l
defines as SSR = (yi − y)2 . It reflects the explanation ability of the regressor.
i=1
SSE/SST: Ratio between sum squared error and sum squared deviation of testing
l l
samples, which is defines as SSE/SST = (yi − yi )2 / (yi − y)2 .
i=1 i=1
SSR/SST: Ratio between interpretable sum squared deviation and real sum
l
squared deviation of testing samples, which is defined as SSR/SST = (yi −
i=1
l
y) / 2
(yi − y)2
i=1
In most cases, small SSE/SST means good agreement between estimations and
real values, and to obtain smaller SSE/SST usually accompanies an increase of
SSR/SST. However, extremely small value of SSE/SST indicate overfitting of the
regressor. Therefore a good estimator should strike a balance between SSE/SST
and SSR/SST. The other two criterion’s are SSE/SSTLOO and SSR/SSTLOO , which
l l
are defined as SSE/SSTL00 = (yi − ŷi LOO )2 / (yi − ytr )2 and SSR/SSTLOO =
i=1 i=1
l l
(ŷi LOO − ytr )2 / (yi − ytr )2 , where ŷi LOO is the prediction of yi when a sample
i=1 i=1
xi is left out from training set during leave one out procedure and ytr is the mean of
y values of m training data points.
We first compare the performance of our proposed TWSVR on synthetic dataset
Sinc function, which is defined as
sin(x)
y = Sinc(x) = , where x ∈ [−12, 12]. (4.53)
x
To effectively reflect the performance of our method, training data points are per-
turbed by different Gaussian noises with zero means. Specifically, we have the fol-
lowing training samples (xi , yi ), i = 1, . . . , l.
80 4 TWSVR: Twin Support Vector Machine Based Regression
(Type A)
sin(xi )
yi = + ξi , (4.54)
xi
where xi ∼ U[−12, 12] and ξi ∼ N(0, 0.22 ). Here U[a, b] and N(c, d) represent the
Uniform random variable in [a, b] and the Gaussian random variable with mean c
and variance d 2 , respectively.
Next, we compare TWSVR on the following synthetic dataset
|x − 1|
g(x) = + |sin(π(1 + (x − 1)/4))| + 1, (4.56)
4
where x ∈ [−10, 10]. Again the training samples are polluted with Gaussian noises
with zero means and different variances as follows
(Type C)
yi = g(xi ) + ξi , (4.57)
yi = g(xi ) + ξi , (4.58)
where xi ∼ U[−10, 10] and ξi ∼ N(0, 0.42 ). To avoid biased comparisons, we gen-
erate 10 groups of noisy samples, which respectively consists of 256(400) training
samples for Type A and B (Type C and D) and 500(800) test samples for Type A and
B (Type C and D). Besides, testing data points are uniformly sampled without con-
sidering any noise. Tables 4.1 and 4.2 shows the average results of TWSVR, TSVR
and SVR with 10 independent runs on aforementioned synthetic datasets.
Table 4.1 Result comparison of TWSVR, Peng’s TSVR and classical SVR on Sinc dataset with
different noises
Noise Regressor SSE SSE/SST SSR/SST
Type A TWSVR 0.1926 ± 0.0796 0.0034 ± 0.0014 0.9959 ± 0.0281
TSVR 0.2511 ± 0.0851 0.0045 ± 0.0015 1.0020 ± 0.0328
SVR 0.2896 ± 0.1355 0.0052 ± 0.0024 0.9865 ± 0.0444
Type B TWSVR 0.7671 ± 0.3196 0.0137 ± 0.0057 0.9940 ± 0.0607
TSVR 0.9527 ± 0.3762 0.0170 ± 0.0067 1.0094 ± 0.0579
SVR 1.0978 ± 0.3570 0.0196 ± 0.0064 1.0187 ± 0.0818
4.4 TWSVR via TWSVM 81
Table 4.2 Result comparison of TWSVR, Peng’s TSVR and classical SVR on g(x) function with
different noises
Noise Regressor SSE SSE/SST SSR/SST
Type C TWSVR 2.5632 ± 0.4382 0.0048 ± 0.0008 0.9967 ± 0.0185
TSVR 2.8390 ± 0.7234 0.0053 ± 0.0014 0.9992 ± 0.0175
SVR 3.6372 ± 0.7828 0.0068 ± 0.0015 0.9832 ± 0.0293
Type D TWSVR 8.5473 ± 1.9399 0.0160 ± 0.0036 1.0295 ± 0.0560
TSVR 10.0011 ± 1.8801 0.0187 ± 0.0035 1.0230 ± 0.0572
SVR 9.4700 ± 2.1139 0.0177 ± 0.0040 1.0195 ± 0.0521
Table 4.3 Result comparison of our model of TWSVR, Peng’s model of TSVR and classical SVR
on Motorcycle and Diabetes datasets
Dataset Regression type SSE/SST100 SSR/SST100
Motorcycle TWSVR 0.2280 0.7942
TSVR 0.2364 0.8048
SVR 0.2229 0.8950
Diabetes TWSVR 0.6561 0.5376
TSVR 0.6561 0.5376
SVR 0.6803 0.3231
Next, we have shown the comparison on UCI benchmark dataset, i.e. we have
tested the efficacy of the algorithms on several benchmark datasets which includes
Motorcycle, Diabetes, Boston Housing, Servo, Machine CPU, Auto Price and
Wisconsin B.C. For these datasets we have reported ten fold cross validation result
with RBF kernel, except Motorcycle and Diabetes where we have used leave one out
cross validation approach and Wisconsin B.C where we have used the linear kernel.
Table 4.3 shows the comparison results for Motorcycle and Diabetes datasets
while Table 4.4 shows the comparison results for other five UCI datasets. We make
the following observations from the results of comparison tables
(i) The TWSVR model derives the smallest SSE and SSE/SST for almost all the
datasets when compared to SVR. This clearly shows the superiority of TWSVR
over SVR in terms of testing accuracy.
(ii) The TWSVR performs better than TSVR when compared over synthetic datasets
while they perform similar in case of UCI datasets. We note here that although
both the models TWSVR and TSVR differ significantly conceptually, there dual
optimization problems differ only in terms of epsilon values. In practice, is set
to small values and hence both the models achieves similar results.
Shao et al. [5] presented another formulation termed as − TSVR. The formulation
of − TSVR is different from that of Peng [3] and is motivated by the classical SVR
formulation. Specifically, Shao et al. [5] constructed the following pair of quadratic
82 4 TWSVR: Twin Support Vector Machine Based Regression
Table 4.4 Result comparison of our model of TWSVR, Peng’s model of TSVR and classical SVR
on following UCI datasets
Dataset Regression type SSE/SST SSR/SST
Machine CPU TWSVR 0.0048 ± 0.0040 0.9805 ± 0.0984
TSVR 0.0050 ± 0.0044 0.9809 ± 0.1004
SVR 0.0325 ± 0.0275 0.7973 ± 0.1804
Servo TWSVR 0.1667 ± 0.0749 0.9639 ± 0.2780
TSVR 0.1968 ± 0.1896 1.0593 ± 0.4946
SVR 0.1867 ± 0.1252 0.7828 ± 0.2196
Boston Housing TWSVR 0.1464 ± 0.079 0.9416 ± 0.1294
TSVR 0.1469 ± 0.079 0.9427 ± 0.13
SVR 0.1546 ± 0.053 0.8130 ± 0.1229
Auto Price TWSVR 0.1663 ± 0.0806 0.9120 ± 0.372
TSVR 0.1663 ± 0.0806 0.9120 ± 0.372
SVR 0.5134 ± 0.1427 0.3384 ± 0.1171
Wisconsin B.C TWSVR 0.9995 ± 0.2718 0.4338 ± 0.2248
TSVR 0.9989 ± 0.2708 0.4330 ± 0.2242
SVR 0.9113 ± 0.1801 0.2398 ± 0.1294
1 C3
( − TSVR1) Min C1 eT ξ1 + (Y − (Aw1 + eb1 )2 (w1 2 + (b1 )2 )
(w1 , b1 , ξ1 ) 2 2
subject to
and
1 C4
( − TSVR2) Min C2 eT ξ2 + (Y − (Aw2 + eb2 )2 + (w2 2 + (b2 )2 )
(w2 , b2 , ξ2 ) 2 2
subject to
Shao et al. [5] also presented the kernel version of − TSVR (Kernel − TSVR)
and did extensive numerical experimentation to test the efficiency of their formula-
tion.
One natural question here is to check if the − TSVR formulation of Shao et al.
[5] can also be derived by Bi and Bennett [1] result. This is in fact true as has been
demonstrated in Goyal [16].
1 T
Min w w + c1 (eT p1 + eT p2 ) + c2 (eT q1 + eT q2 )
(p1 ,p2 ,q1 ,q2 ,w,b) 2
subject to
and
k ∂φ1 (x (j) ) ∂φ2 (x (j) ) ∂φd (x (j) )
φ (x (j) ) ≡ [ , ,..., ]
∂xk ∂xk ∂xk
k = 1, 2, . . . , n; j = 1, 2, . . . , l2 .
The RLSVRD is then obtained by solving the following problem
n
1 T c1 c2 k k T k
Min (w w + b2 ) + pT p + q q
(p, q1 , q2 ,...,qn ,w,b) 2 2 2
k=1
subject to
wT φ(x (i) ) + b − yi + pi = 0, (i = 1, 2, . . . , l1 ),
k k
w T φ (x (i) ) − yj + qj k = 0, (j = 1, 2, . . . , l2 , k = 1, 2, . . . , n), (4.66)
T T k ∂f
where, Z = [x (1) , x (2) , . . . , x (l1 ) ] , Z1 = [x (1) , x (2) , . . . , x (l2 ) ] , Y = [ ∂x k
]|x=x(j) ,
k
H = [φ(Z)]l1 ×d and Gk = [φ (Z1 )]l2 ×d , (k = 1, 2, . . . , n, j = 1, 2, . . . , l2 ). Once α
and β k , (k = 1, 2, . . . , n) are obtained, the estimated values of the function and its
derivatives at any x ∈ Rn are then given by
l1 n l2
k T
f (x) = αi φ T (x (i) )φ(x) + βj k [φ (x (j) )] φ(x) + eT α,
i=1 k=1 j=1
and
m1 m2
k k k T k
f (x) = αi φ T (x (i) )[φ (x)] + βj k [φ (x (j) )] [φ (x)], (k = 1, 2, . . . , n).
i=1 j=1
We note that although RLSVRD model improves over Lazaro et al. model in terms
of estimation accuracy and computational complexity but its least square approach
lacks sparseness making it unfit for large datasets.
86 4 TWSVR: Twin Support Vector Machine Based Regression
Here we describe the TSVRD model of Khemchandani et al. [31] for simultaneously
learning a function and its derivatives. For a real valued function f : Rn − → R, let A
of dimension (l1 × n) be the set of points {x (i) ∈ Rn , i = 1, 2, . . . , l1 } over which
functional values y are provided and Bj of dimension (l2j × n) be the set of points
∂f
over which partial derivatives ∂x j
are provided, j = 1, 2, . . . , n. Here xj denote the
jth component
⎡ ⎤of the vector x.
B1
⎢ B2 ⎥
⎢ ⎥ A
Let B = ⎢ . ⎥ (size: l2 × n where l2 = (l21 + l22 + · · · + l2n )) and let C =
⎣ .. ⎦ B
Bn
(size: (l1 + l2 ) × n). Let Y be the set containing the values of the function at A, and
⎡ ⎤
Y1
⎢ Y
⎥
⎢ 2 ⎥
Yj be the set of j derivative values of the function prescribed at Bj . Let Y = ⎢ . ⎥,
th
⎣ .. ⎦
Yn
Y
Z= .
Y
On the similar lines to TSVR (Peng [3]), 1 and 2 perturbations in Z are introduced
in order to create the upper-bound and lower-bound regression tubes
for the simulta-
11
neous learning of a function and its derivatives, where 1 = and 2 = 21 .
12 22
Y and Yj , j = 1, 2, . . . , n, are then regressed according to the following functions
and
where (w1 , b1 ) and (w2 , b2 ) define the upper-tube and lower-tube regressions of the
function and its derivatives, K(x, y) is the kernel used with K xj (x, y) as the derivative
of K(x, y) with respect to xj , j = 1, 2, . . . , n and e is a vector of ones of appropriate
dimension.
4.5 Simultaneous Learning of Function and Its Derivative 87
e1
Let e = be a column vector of length (l1 + l2 ), e1 be a column vector of
e2
∗ e
ones of length l1 and o1 be a column vector of zeros of length l2 . Let e = 1 ,
o1
⎡ ⎤
Kσ1 (A, C T )
⎢ Kσ x (B1 , C T ) ⎥
∗ ⎢ 11 1 ⎥ wi
G =⎢ .. ⎥, u i = , (i = 1, 2) and G = [G∗ e∗ ].
⎣ . ⎦ bi
Kσ1n xn (Bn , C T )
Then the two quadratic programming optimization problems(QPPs) for the upper-
bound and lower-bound regression functions respectively are
1 1
(TSVRD1) Min (Z − e1 − Gu1 )2 + C1 (u1 T u1 ) + C2 eT ξ1
(u1 , ξ1 ) 2 2
subject to
and
1 1
(TSVRD2) Min (Z + e2 − Gu2 )2 + C3 (u2 T u2 ) + C4 eT ξ2
(u2 , ξ2 ) 2 2
subject to
1 1
f (x) = (Kσ1 (x, C T )w1 + Kσ2 (x, C T )w2 ) + (b1 + b2 ) (4.73)
2 2
∂f 1
|x=xj = (K σ1j xj (x, C )w1 + K σ2j xj (x, C )w2 ), j = 1, 2 . . . n.
T T
(4.74)
∂x 2
88 4 TWSVR: Twin Support Vector Machine Based Regression
and
where δ1 > 0 is the regularization coefficient. Let ν1 = [u1 − 1]T and hence the
above problem converts to Rayleigh Quotient of the form
ν1T Rν1
Min , (4.79)
ν1 =0 ν1T Sν1
where
Using Rayleigh Quotient properties (Mangasarian and Wild [33]; Parlett [35]), the
solution of (4.79) is obtained by solving the following generalized eigenvalue prob-
lem
Let μ1 denote the eigenvector corresponding to the smallest eigenvalue ηmin of (4.81).
To obtain u1 from μ1 , we normalize μ1 by the negative of the (l1 + l2 + 2)th element
of μ1 so as to force a (−1) at the (l1 + l2 + 2)th position of μ1 . Let this normalized
representation of μ1 be μ1 with (−1) at the end, such that μ1 = [u1 − 1]T . We get
w1 and b1 from u1 which determines an -insensitive bounding regressors f1 (x) =
Kσ1 (x, C T )w1 + b1 and f1 (x) = Kσ (x, C T )w1 . We note that the solution is easily
1
obtained using a single MATLAB command that solves the classical generalized
eigenvalue problem.
90 4 TWSVR: Twin Support Vector Machine Based Regression
In a similar manner, f2 (x) and f2 (x) is determined by considering the optimization
problem (4.76) along with the Tikhonov regularization as follows
where δ2 > 0 is is the regularization constant. Let ν2 = [u2 − 1]T and hence the
above problem converts to Rayleigh Quotient of the form
ν2T Pν2
Min , (4.83)
ν2 =0 ν2T Qν2
where
The solution of (4.83) is thus obtained by solving the following generalized eigen-
value problem
Finding the minimum eigenvalue of (4.85) and then normalizing the corresponding
eigenvector in the similar manner mentioned for problem (4.81), we get u2 from
which w2 and b2 is obtained determining an -insensitive bounding regressors f2 (x) =
Kσ2 (x, C T )w2 + b2 and f2 (x) = Kσ (x, C T )w2 .
2
We next extend the formulations to consider the real valued functions of n vari-
ables f : Rn − → R. Let A of dimension (l1 × n) be the set of points {x (i) ∈ Rn , i =
1, 2, . . . , l1 } over which functional values y are provided and Bj of dimension (l2j × n)
∂f
be the set of points over which partial derivatives ∂x are provided, j = 1, 2, . . . , n.
j
⎡ ⎤
B1
⎢ B2 ⎥
⎢ ⎥
Here xj denote the jth component of the vector x. Let B = ⎢ . ⎥ (size: l2 × n where
⎣ .. ⎦
Bn
A
l2 = (l21 + l22 + · · · + l2n )) and let C = (size: (l1 + l2 ) × n). Let Y be the set
B
containing the values of the function at A, and Yj be the set of jth derivative values
⎡ ⎤
Y1
⎢ Y2 ⎥
⎢ ⎥ Y
of the function prescribed at Bj . Let Y = ⎢ . ⎥, Z = .
⎣ .. ⎦ Y
Yn
4.5 Simultaneous Learning of Function and Its Derivative 91
The algorithm finds regressors of the function, f1 (x) = Kσ1 (x, C T )w1 + b1 and
f2 (x) = Kσ2 (x, C T )w2 + b2 and regressors of the derivative function, f1j (x) =
Kσ xj (x, C T )w1 and f2j (x) = Kσ xj (x, C T )w2 corresponding to the -insensitive bound-
1 2
ing regressor of the function and its partial derivatives, where Kx j (x, y) is the derivative
of K(x, y) with respect
jth dimension of input variable xj , j = 1, 2, . . . , n.
e1
Let e = be a column vector of length (l1 + l2 ), e1 be a column vector of
e2
∗ e
ones of length l1 and o1 be a column vector of zeros of length l2 . Let e = 1 ,
o1
⎡ ⎤
Kσ1 (A, C T )
⎢ Kσ x (B1 , C T ) ⎥
⎢ 11 1 ⎥ wi
G∗ = ⎢ .. ⎥, u i = , i = 1, 2 and G = [G∗ e∗ ].
⎣ . ⎦ bi
Kσ1n xn (Bn , C T )
Similar to the case for single variable, the two optimization problems are
and
1 1
f (x) = (Kσ1 (x, C T )w1 + Kσ2 (x, C T )w2 ) + (b1 + b2 ) (4.88)
2 2
∂f 1
|x=xj = (K σ1j xj (x, C )w1 + K σ2j xj (x, C )w2 ), j = 1, 2 . . . n.
T T
(4.89)
∂x 2
where δˆ1 , δˆ2 are non-negative regularization constant and Ĝ = [Ĝ∗ e∗ ] where Ĝ∗ is
the diagonal matrix of diagonal values of G∗ . From the work of Guarracino et al.
[36], we get that the minimum eigenvalue of the original problem (4.77) becomes
the maximum eigenvalue of the above problem and the maximum eigenvalue of the
original problem (4.77) becomes the minimum eigenvalue of the above problem. Let
t = [u − 1]T and U and V be defined as
U = [G (Z − e2 )]T [G (Z − e2 )] + δˆ1 [Ĝ (Z + e1 )]T [Ĝ(Z + e1 )],
V = [G (Z + e1 )]T [G (Z + e1 )] + δˆ2 [Ĝ (Z − e2 )]T [Ĝ (Z − e2 )].
Now again using the properties of Rayleigh quotient, the optimization problem (4.90)
is equivalent to following generalized eigenvalue problem
Ut = νVt , t = 0. (4.91)
To prove the efficacy of the proposed GEPSVRD approach over above mentioned
approaches, comparison of estimation accuracy and run time complexity have been
performed on sinc(x), xcos(x), xsin(y), and seven functions of two real variables
which were introduced in Lázarao et al. [27]. All the methods have been implemented
in MATLAB 7.8 running on a PC with Intel Core 2 Duo processor (2.00 GHz) with
3 GB of RAM. Here, we note that the time complexity of Lazaro et al. approach
is shown to be higher than the RLSVRD and TSVRD approaches (Jayadeva et al.
[29, 30]; Khemchandani et al. [31]) and therefore comparison of only the run time
of discussed model GEPSVRD (Goyal [16]) with RLSVRD and TSVRD approach.
4.5 Simultaneous Learning of Function and Its Derivative 93
Table 4.5 Estimation accuracy comparisons for sinc(x), xcos(x) and xsin(y)
RAE MAE RMSE SSE/SST SSR/SST
sinc(x)
GEPSVRD f (x) 0.0192 0.00002 0.00004 3×10−8 1.00002
f (x) 0.0457 0.0001 0.0002 2×10−7 1.00003
TSVRD f (x) 0.6423 0.0009 0.0013 0.00002 0.9968
f (x) 0.5985 0.0019 0.0027 0.00003 0.9943
RLSVRD f (x) 2.0192 0.0028 0.0040 0.0002 0.9708
f (x) 2.2825 0.0073 0.0091 0.0004 0.9648
SVRD (Lazaro et al.) f (x) 1.7577 0.0024 0.0043 0.00028 0.9685
f (x) 1.7984 0.0057 0.0077 0.00026 0.9683
xcos(x)
GEPSVRD f (x) 0.0888 0.0022 0.0039 0.000001 0.9999
f (x) 0.3905 0.0087 0.0129 0.000024 0.9998
TSVRD f (x) 0.1142 0.0029 0.0049 0.000002 1.0004
f (x) 0.4900 0.0109 0.0172 0.00004 1.0005
RLSVRD f (x) 0.3540 0.0088 0.0134 0.0000 0.9925
f (x) 0.5510 0.0123 0.0183 0.0000 0.9940
SVRD (Lazaro et al.) f (x) 2.8121 0.0702 0.0941 0.0009 0.9430
f (x) 2.3863 0.0531 0.0655 0.0006 0.9559
xsin(y)
GEPSVRD f (x) 0.5723 0.0056 0.0085 0.00004 1.0028
∂f
∂x |x=x1 0.6789 0.0041 0.0056 0.00006 0.9932
∂f
∂x |x=x2 0.4103 0.0043 0.0071 0.00003 0.9990
TSVRD f (x) 0.8242 0.0081 0.0122 0.0001 1.0053
∂f
∂x |x=x1 1.1843 0.0072 0.0101 0.0002 0.9986
∂f
∂x |x=x2 0.6730 0.0070 0.0118 0.0001 1.0018
RLSVRD f (x) 0.8544 0.0084 0.0130 0.0001 0.9873
∂f
∂x |x=x1 2.9427 0.0180 0.0269 0.0015 0.9640
∂f
∂x |x=x2 1.2276 0.0128 0.0211 0.0002 0.9795
SVRD (Lazaro et al.) f (x) 1.0094 0.0100 0.0140 0.0001 0.9865
∂f
∂x |x=x1 1.6307 0.0100 0.0131 0.0004 0.9765
∂f
∂x |x=x2 1.5325 0.0160 0.0236 0.0003 0.9760
94 4 TWSVR: Twin Support Vector Machine Based Regression
tion/derivatives respectively
at a training point xi . Let m be the size of the testing
dataset and let ȳ = li=1 yi . The five evaluation measures are then defined as:
1. Relative Absolute Error (RAE):
l
|yi − ŷi |
RAE = i=1
l × 100.
i=1 |yi |
l
1
MAE = |yi − ŷi |.
l i=1
SSE/SST, which is defined as the ratio between the Sum Squared Error (SSE
4. l =
l
i=1 (yi − ŷi ) 2
) and Sum Squared Deviation of testing samples (SST = i=1
(yi − ȳ)2 ).
5. SSR/SST, which is defined as the ratio between the Interpretable Sum Squared
Deviation (SSR = li=1 (ŷi − ȳ)2 ) and Sum Squared Deviation of testing samples
(SST).
In most cases, a lower value of RAE, MAE, RMSE and SSE/SST reflects precision
in agreement between the estimated and original values, while a higher value of
SSR/SST shows higher statistical information being accounted by the regressor.
4.6 Conclusions
In this chapter, we review the formulation of Twin Support Vector Regression (TSVR)
proposed by Peng [3] and conclude that this formulation is not in the true spirit of
TWSVM. This is because Peng’s formulation deals either with up or down bound
regressor in each of the twin optimization problems. But our TWSVR model is in the
true spirit of TWSVM where in each QPP, the objective function deals with up(down)
bound regressor and constraints deal with down(up)-bound regressor. As a result of
this strategy we solve two QPPs, each one being smaller than the QPP of classical
SVR. The efficacy of the proposed algorithm has been established when compared
with TSVR and SVR.
The proposed model of TWSVR is based on intuition, is shown to be mathemati-
cally derivable and follows the spirit of TWSVM making it not only better, but also
correct choice for establishing future work on twin domain. Here it may be remarked
that various existing extensions of Peng’s model [3], e.g. Xu and Wang [4], Shao et al.
[5], Chen et al. [6, 7], Zhao et al. [8], Zhang et al. [9], Peng [10, 11], Balasundaram
and Tanveer [12] and Singh et al. [13], may be studied for our present model as well.
Another problem discussed in this chapter is the problem of simultaneous learning
of function and its derivatives. Here after reviewing the work of Lázarao et al. [27, 28],
formulation of the Twin Support Vector Regression of a Function and its Derivatives
(TSVRD) due to Khemchandani et al. is discussed in detail. Further the formulation
of Goyal [16] for a Generalized Eigenvalue Proximal Support Vector Regression of
a Function and its Derivatives (GEPSVRD) is presented. The efficiency of TSVRD
and GEPSVRD is demonstrated on various benchmark functions available in the
literature.
References
1. Bi, J., & Bennett, K. P. (2003). A geometric approach to support vector regression. Neurocom-
puting, 55, 79–108.
2. Jayadeva, Khemchandani, R., & Chandra, S. (2007). Twin support vector machines for pattern
classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(5), 905–
910.
3. Peng, X. (2010). Tsvr: an efficient twin support vector machine for regression. Neural Networks,
23, 365–372.
4. Xu, Y. T., & Wang, L. S. (2012). A weighted twin support vector regression. Knowledge Based
Systems, 33, 92–101.
5. Shao, Y. H., Zhang, C. H., Yang, Z. M., Zing, L., & Deng, N. Y. (2013). - Twin support vector
machine for regression. Neural Computing and Applications, 23(1), 175–185.
6. Chen, X., Yang, J., & Chen, L. (2014). An improved robust and sparse twin support vector
regression via linear programming. Soft Computing, 18, 2335–2348.
7. Chen, X. B., Yang, J., Liang, J., & Ye, Q. L. (2012). Smooth twin support vector regression.
Neural Computing and Applications, 21(3), 505–513.
8. Zhao, Y. P., Zhao, J., & Zhao, M. (2013). Twin least squares support vector regression. Neuro-
computing, 118, 225–236.
100 4 TWSVR: Twin Support Vector Machine Based Regression
9. Zhang, P., Xu, Y. T., & Zhao, Y. H. (2012). Training twin support vector regression via linear
programming. Neural Computing and Applications, 21(2), 399–407.
10. Peng, X. (2012). Efficient twin parametric insensitive support vector regression. Neurocom-
puting, 79, 26–38.
11. Peng, X. (2010). Primal twin support vector regression and its sparse approximation. Neuro-
computing, 73(16–18), 2846–2858.
12. Balasundaram, S., & Tanveer, M. (2013). On Lagrangian twin support vector regression. Neural
Computing and Applications, 22(1), 257–267.
13. Singh, M., Chadha, J., Ahuja, P., Jayadeva, & Chandra, S. (2011). Reduced twin support vector
regression. Neurocomputing, 74(9), 1471–1477.
14. Khemchandani, R., Goyal, K., & Chandra, S. (2015). Twin support vector machine based
regression. International Conference on Advances in Pattern Recognition, 1–6.
15. Khemchandani, R., Goyal, K., & Chandra, S. (2015). TWSVR: regression via twin support
vector machine. Neural Networks, 74, 14–21.
16. Goyal, K. (2015) Twin Support Vector Machines Based Regression and its Extensions. M.Tech
Thesis Report, Mathematics Department, Indian Institute of Technology, Delhi.
17. Deng, N., Tian, Y., & Zhang, C. (2012). Support vector machines: Optimization based theory,
algorithms and extensions. New York: Chapman & Hall, CRC Press.
18. Shao, Y.-H., Zhang, C.-H., Wang, X.-B., & Deng, N.-Y. (2011). Improvements on twin support
vector machines. IEEE Transactions on Neural Networks, 22(6), 962–968.
19. Mangasarian, O. L. (1994). Nonlinear programming. Philadelphia: SIAM.
20. Chandra, S., Jayadeva, & Mehra, A. (2009). Numerical optimization with applications. New
Delhi: Narosa Publishing House.
21. Saunders, C., Gammerman, A., & Vovk, V. (1998). Ridge regression learning algorithm in dual
variables. Proceedings of the Fifteenth International Conference on Machine Learning (pp.
515–521).
22. Blake, C. L., & Merz, C. J. UCI Repository for Machine Learning Databases, Irvine, CA:
University of California, Department of Information and Computer Sciences, http://www.ics.
uci.edu/~mlearn/MLRepository.html.
23. Zheng, S. (2011). Gradient descent algorithms for quantile regression with smooth approxi-
mation. International Journal of Machine Learning and Cybernetics, 2(3), 191–207.
24. Lagaris, I. E., Likas, A., & Fotiadis, D. (1998). Artificial neural networks for solving ordinary
and partial differential equations. IEEE Transactions on Neural Networks, 9, 987–1000.
25. Antonio, J., Martin, H., Santos, M., & Lope, J. (2010). Orthogonal variant moments features
in image analysis. Information Sciences, 180, 846–860.
26. Julier, S. J., & Uhlmann, J. K. (2004). Unscented filtering and nonlinear estimation. Proceedings
of the IEEE, 92, 401–422.
27. Lázarao, M., Santamaŕia, I., Péreze-Cruz, F., & Artés-Rodŕiguez, A. (2005). Support vector
regression for the simultaneous learning of a multivariate function and its derivative. Neuro-
computing, 69, 42–61.
28. Lázarao, M., Santamaŕia, I., Péreze-Cruz, F., & Artés-Rodŕiguez, A. (2003). SVM for the
simultaneous approximation of a function and its derivative. Proceedings of the IEEE inter-
national workshop on neural networks for signal processing (NNSP), Toulouse, France (pp.
189–198).
29. Jayadeva, Khemchandani, R., & Chandra, S. (2006). Regularized least squares twin SVR for
the simultaneous learning of a function and its derivative, IJCNN, 1192–1197.
30. Jayadeva, Khemchandani, R., & Chandra, S. (2008). Regularized least squares support vector
regression for the simultaneous learning of a function and its derivatives. Information Sciences,
178, 3402–3414.
31. Khemchandani, R., Karpatne, A., & Chandra, S. (2013). Twin support vector regression for
the simultaneous learning of a function and its derivatives. International Journal of Machine
Learning and Cybernetics, 4, 51–63.
32. Khemchandani, R., Karpatne, A., & Chandra, S. (2011). Generalized eigenvalue proximal
support vector regressor. Expert Systems with Applications, 38, 13136–13142.
References 101
33. Mangasarian, O. L., & Wild, E. W. (2006). Multisurface proximal support vector machine
classification via generalized eigenvalues. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 28(1), 69–74.
34. Tikhonov, A. N., & Arsenin, V. Y. (1977). Solutions of Ill-posed problems. New York: Wiley.
35. Parlett, B. N. (1998). The symmetric eigenvalue problem: Classics in applied mathematics
(Vol. 20). Philadelphia: SIAM.
36. Guarracino, M. R., Cifarelli, C., Seref, O., & Pardalos, P. M. (2007). A classification method
based on generalized eigenvalue problems. Optimization Methods and Software, 22(1), 73–81.
Chapter 5
Variants of Twin Support Vector Machines:
Some More Formulations
5.1 Introduction
On the lines of Least Squares SVM proposed by Suykens and Vandewalle [1], Kumar
and Gopal [2] presented Least Squares TWSVM (LS-TWSVM) where they modi-
fied the primal and dual QPPs of TWSVM in least squares sense and solved them
with equality constraints instead of inequalities of TWSVM. These modifications
simplified the solution methodology for LS-TWSVM as one could get its solution
directly from solving two systems of linear equations instead of solving two QPPs as
in TWSVM. Here we would restrict ourself to the linear version only, as the kernel
version being analogous to kernel version of TWSVM.
Let the training set TC for the given binary data classification problem be
1
(LS − T W SVM1) Min (Aw1 + e1 b1 )2 + C1 q1 T q1
(w1 , b1 , q1 ) 2
subject to
− (Bw1 + e2 b1 ) + q1 = e2 , (5.2)
and
1
(LS − T W SVM2) Min (Bw2 + e2 b2 )2 + C2 q2 T q2
(w2 , b2 , q2 ) 2
subject to
(Aw2 + e1 b2 ) + q2 = e1 . (5.3)
Thus, LS-TWSVM solves the classification problem using two matrix inverses,
one for each hyperplane where [w1 , b1 ]T and [w2 , b2 ]T are determined as per Eqs.
(5.4) and (5.5) respectively. Since the computational complexity of solving a system
of m linear equations in n unknowns is O(r 3 ) where r is the rank of m × n matrix
and r ≤ Min(m, n), the LS-TWSVM is computationally faster than TWSVM. Gopal
5.2 Least Squares-TWSVM 105
and Kumar [2] implemented LS-TWSVM on various real and synthetic datasets and
compared its efficiency to that of TWSVM, GEPSVM and proximal SVM. For these
details and other related issues we may refer to the their paper [2].
Recently, Nasiri et al. [5] proposed Energy-based model of least squares twin sup-
port vector machine (ELS-TWSVM) in activity recognition in which they changed
the minimum unit distance constraint used in LS-TWSVM by some fixed energy
parameter in order to reduce the effect of intrinsic noise of spatio-temporal fea-
tures [6]. However, this energy parameter was explicitly chosen based on external
observation of the user or could be part of the optimization problem.
The primal form of ELS-TWSVM formulation is given below
1 C1 T
(ELS − T W SVM1) Min ||Aw1 + e1 b1 ||2 + y y2
(w1 ,b1 ,y2 ) 2 2 2
subject to
− (Bw1 + e2 b1 ) + y2 = E1 , (5.6)
and
1 C2 T
(ELS − T W SVM2) Min ||Bw2 + e2 b2 ||2 + y y1
(w2 ,b2 ,y1 ) 2 2 1
subject to
(Aw2 + e1 b2 ) + y1 = E2 , (5.7)
and
[w2 b2 ]T = [c2 H T H + G T G]−1 [c2 H T E2 ]. (5.9)
drawback is that energy parameter has to be externally fixed which sometimes leads
to instability in problem formulations which effects the overall prediction accuracy
of the system. We would next discuss the formulation where the energy parameter
discussed above becomes part of the optimization problem.
1 1 T
(ν − T W SVM1) Min (Aw1 + e1 b1 )2 − ν1 ρ+ + e q1
(w1 , b1 , q1 ,ρ+ ) 2 m1 2
subject to
− (Bw1 + e2 b1 ) + q1 ≥ ρ+ ,
ρ+ , q1 ≥ 0, (5.10)
and
1 1 T
(ν − T W SVM2) Min (Bw2 + e2 b2 )2 − ν2 ρ− + e q2
(w2 , b2 , q2 ,ρ− ) 2 m2 1
subject to
(Aw2 + e2 b2 ) + q2 ≥ ρ− ,
ρ− , q2 ≥ 0. (5.11)
5.3 Linear ν-TWSVM 107
1 T
(ν − DT W SVM1) Max α G(H T H)−1 G T α
α 2
subject to
eT2 α ≥ ν1 ,
1
0≤α≤ . (5.12)
m2
1 T
(ν − DT W SVM2) Max β H(G T G)−1 H T β
β 2
subject to
eT2 β ≥ ν2 ,
1
0≤β ≤ . (5.14)
m1
v = (G T G)−1 H T β. (5.15)
108 5 Variants of Twin Support Vector …
x T w1 + b1 = 0 and x T w2 + b2 = 0. (5.16)
where,
|x T wr + br |
dr (x) = . (5.17)
wr
Remark 5.3.1 Similar to TWSVM, Peng in [8] observed that patterns of class −1 for
which 0 < αi < C1 , (i = 1, 2 . . . , m2 ), lie on the hyperplane given by x T w1 + b1 =
ρ+ . We would further define such patterns of class −1 as support vectors of class +1
with respect to class −1, as they play a important role in determining the required
plane. A similar definition of support vectors of class −1 with respect to class +1
follows analogously.
Generally, the classical SVM and its extensions assume that the noise level on train-
ing data is uniform throughout the domain, or at least, its functional dependency
is known beforehand. The assumption of a uniform noise model, however, is not
always satisfied. For instance, for the heteroscedastic noise structure, the amount of
noise depends heteroscedastic noise on location. Recently, Hao [10] aimed at this
shortcoming appearing in the classical SVM, and proposed a novel SVM model,
called the Parametric-Margin ν-Support Vector Machine (par-ν-SVM), based on the
ν-Support Vector Machine (ν-SVM). This par-ν-SVM finds a parametric-margin
model of arbitrary shape. The parametric insensitive model is characterized by a
learnable function g(x), which is estimated by a new constrained optimization prob-
lem. This can be useful in many cases, especially when the data has heteroscedastic
error structure, i.e., the noise strongly depends on the input value.
In this section, we present a twin parametric margin SVM (TPMSVM) proposed
by Peng [11]. The proposed TPMSVM aims at generating two nonparallel hyper-
planes such that each one determines the positive or negative parametric-margin
hyperplane of the separating hyperplane. For this aim, similar to the TWSVM, the
TPMSVM also solves two smaller sized QPPs instead of solving large one as in the
classical SVM or par-ν-SVM. The formulation of TPMSVM is totally different from
5.4 Linear Parametric-TWSVM 109
that of par-ν-SVM in some respects. First, the TPMSVM solves a pair of smaller
sized QPPs, whereas, the par-ν-SVM only solves single large QPP, which makes
the learning speed of TPMSVM much faster than the par-ν-SVM. Second, the par-
ν-SVM directly finds the separating hyperplane and parametric-margin hyperplane,
while the TPMSVM indirectly determines the separating hyperplane through the pos-
itive and negative parametric-margin hyperplanes. In short, TPMSVM successfully
combines the merits of TWSVM, i.e., the fast learning speed, and par-ν-SVM, i.e.,
the flexible parametric-margin. In their paper, Peng [11] has done the computational
comparisons of TPMSVM, par-ν-SVM, TWSVM and SVM in terms of generaliza-
tion performance, number of support vectors (SVs) and training time are made on
several artificial and benchmark datasets, indicating the TPMSVM is not only fast,
but also shows comparable generalization.
To understand the approach of TPMSVM, we would first present par-ν-SVM
model in order to explain the mathematics behind the parametric behaviour of SVM
model. The par-ν-SVM considers a parametric-margin mode g(x) = zT x + d instead
of the functional margin in the ν-SVM. Specifically, the hyperplane f (x) = w T x + b
in the par-ν-SVM separates the data if and only if
x T w + b ≥ x T z + d, ∀ x ∈ A, (5.18)
x T w + b ≤ −x T z − d, ∀ x ∈ B. (5.19)
1 1 1
(par − SVM) Min w22 + c(ν.( z2 + d) + eT q
(w, b, q, z, d) 2 2 m
subject to
1 1 (i) (j)
m m m m
(par − DSVM) Max − yi yj x (i) x (j) αi αj + x x αi αj
αi 2 i=1 j=1 2cν i=1 j=1
subject to
eT2 α ≥ cν,
m
yi αi = 0,
i=1
ce
0 ≤ αi ≤ . (5.21)
m
Solving the above dual QPP, we obtain the vector of Lagrange multipliers α,
which gives the weight vectors w and z as the linear combinations of data samples
m
1 (i)
m
w= yi x (i) αi , z= x αi .
i=1
cv i=1
Further, by exploiting the K.K.T. conditions [3, 4] the bias terms b and d are deter-
mined as
b = −0.5 ∗ [w T x (i) + w T x (j) − zT x (i) + w T x (j) ],
and
d = 0.5 ∗ [wT x (i) − w T x (j) − zT x (i) − w T x (j) ].
m
f (x) = αi yi (x (i) )T x + b,
i=1
5.4 Linear Parametric-TWSVM 111
and
1
m
g(x) = αi (x (i) )T x + d,
cv i=1
respectively.
In the following, we call the two nonparallel hyperplanes f (x) ± g(x) = 0 as the
parametric-margin hyperplanes.
Similar to the TWSVM, TPMSVM also derives a pair of nonparallel planes around
the datapoints through two QPPs in order to find f1 (x) = x T w1 + b1 = 0 and f2 (x) =
x T w2 + b2 = 0 each one determines one of parametric-margin hyperplanes.
Peng [11] in their paper addressed, f1 (x) and f2 (x) as the positive and negative
parametric-margin hyperplanes, respectively. Specifically, given the training dat-
apoints {(x (i) , yi ), x (i) ∈ Rn , yi ∈ {−1, +1}, (i = 1, 2, ..., m)}, f1 (x) determines the
positive parametric-margin hyperplane, and f2 (x) determines the negative parametric-
margin hyperplane. By incorporating the positive and negative parametric margin
hyperplanes, this TPMSVM separates the data if and only if
x T w1 + b1 ≥ 0 ∀x ∈ A,
x T w2 + b2 ≤ 0 ∀x ∈ B.
It is further, discussed in the paper that the parametric margin hyperplanes f1 (x) =
0 and f2 (x) = 0 in TPMSVM are equivalent to f (x) ± g(x) = 0 in the par-ν-SVM.
Thus, Peng [11] considered the following pair of constrained optimization prob-
lems:
1 ν1 T c1 T
(par − T W SVM1) Min w1 22 + e2 (Bw1 + e2 b1 ) + e q1
(w1 , b1 , q1 ) 2 m2 m1 2
subject to,
Aw1 + e1 b1 ≥ 0 − q1 ,
q1 ≥ 0, (5.22)
and
1 ν2 T c2 T
(par − T W SVM2) Min w2 22 − e (Aw2 + e1 b2 ) + e q2
(w2 , b2 , q2 )
2 m1 1 m2 1
subject to,
Bw2 + e2 b2 ≥ 0 − q2 ,
q2 ≥ 0, (5.23)
112 5 Variants of Twin Support Vector …
1 ν1 T T
(par − DT W SVM1) Max − α T AAT α + e BA α
α 2 m1 2
subject to,
eT2 α = ν1 ,
c1
0≤α≤ e2 . (5.24)
m2
eT2 β = ν2 ,
c2
0≤β ≤ e2 . (5.25)
m2
and
1
b2 = − Bj w2 ,
|N− | j∈N
−
c2
where N− is the index set of positive samples satisfying βj ∈ (0, ).
m2
On the similar lines, using the concept of structural granularity, Peng et al. [12]
introduced improved version of TPMSVM, termed as Structural TPMSVM. Struc-
tural TPMSVM incorporates data structural information within the corresponding
class by adding regularization term derived from cluster granularity.
5.4 Linear Parametric-TWSVM 113
On the similar lines of Hao et al. [10], Khemchandani and Sharma [13] pro-
posed Robust parametric TWSVM and have shown its application in human activity
recognition framework.
Remark 5.4.1 The TWSVM classifier finds, for each class, a hyperplane that passes
through the points of that class and is at a distance of at least unity from the other
class. In contrast to the TWSVM classifier, par-TWSVM finds a plane that touches
the points of one class so that points of that class lie on the one side, and is as far
away as possible from points of the other one. Thus here the role is reversed and
therefore par-TWSVM is more in the spirit of reverse twin rather than twin spirit.
In Chap. 3, we have noticed that for the nonlinear case, TWSVM considers the
kernel generated surfaces instead of hyperplanes and construct two different primal
problems, which means that TWSVM has to solve two problems for linear case
and two other problems for the nonlinear case separately. However, in the standard
SVMs, only one dual problem is solved for both the cases with different kernels.
In order to address this issue, Tian et al. [14] proposed a novel nonparallel SVM,
termed as NPSVM for binary classification where the dual problems of these two
primal problems have the same advantages as that of the standard SVMs, i.e., only
the inner products appear so that the kernel trick can be applied directly. Further,
the dual problems have the same formulation with that of standard SVMs and can
certainly be solved efficiently by SMO, i.e. no need to compute the inverses of the
large matrices as we do in TWSVMs so as to solve the dual problem.
On the similar lines of TWSVM, NPSVM also seek two nonparallel hyperplanes
by solving two convex QPPs.
1
(NPSVM1) Min w1 2 + C1 (eT1 η1 + eT1 η1∗ ) + C2 (eT2 q1 )
(w1 , b1 , η1 , η1∗ , q1 ) 2
subject to
Aw1 + e1 b1 ≤ + η1 ,
− (Aw1 + e1 b1 ) ≤ + η1∗ ,
Bw1 + e2 b1 ≤ −e2 + q1 ,
η1 , η1∗ , q1 ≥ 0, (5.26)
and
1
(NPSVM2) Min w2 2 + C3 (eT2 η2 + eT2 η2∗ ) + C4 (eT1 q2 )
(w2 , b2 , η2 , η2∗ , q2 ) 2
subject to
114 5 Variants of Twin Support Vector …
Bw2 + e2 b2 ≤ + η2 ,
− (Bw2 + e2 b2 ) ≤ + η2∗ ,
Aw2 + e1 b2 ≥ −e1 + q2 ,
η2 , η2∗ , q2 ≥ 0, (5.27)
1
(DNPSVM1) Max
∗
− (α ∗ − α)T AAT (α ∗ − α) + (α ∗ − α)βBAT
(α,α ,β) 2
1
− β T BBT β + eT1 (α ∗ − α) − eT2 β
2
subject to
eT1 (α − α ∗ ) + eT2 β = 0,
0 ≤ α, α ∗ ≤ c1 ,
0 ≤ β ≤ c2 , (5.28)
and
1
(DNPSVM2) Max
∗
− (α ∗ − α)T BBT (α ∗ − α) + (α ∗ − α)βABT
(α,α ,β) 2
1
− β T AAT β + eT2 (α ∗ − α) − eT1 β
2
subject to
eT2 (α − α ∗ ) − eT1 β = 0,
0 ≤ α, α ∗ ≤ c3 ,
0 ≤ β ≤ c4 , (5.29)
Once the solution in terms of w and b is obtained the decision rule is similar
to that of TWSVM. Further, NPSVM degenerates to TBSVM and TWSVM when
parameters are chosen appropriately. Interested readers could refer to Tian et al. [14]
for more details.
SVM has been widely studied as binary classifiers and researcher have been trying
to extend the same to multi-category classification problems. The two most popular
approaches for multi-class SVM are One-Against-All (OAA) and One-Against-One
(OAO) SVM (Hsu and Lin [15]). OAA-SVM implements a series of binary classifiers
where each classifier separates one class from rest of the classes. But this approach
leads to biased classification due to huge difference in the number of samples. For
a K-class classification problem, OAA-SVM requires K binary SVM comparisons
for each test data. In case of OAO-SVM, the binary SVM classifiers are determined
using a pair of classes at a time. So, it formulates upto (K ∗ (K − 1))/2 binary
SVM classifiers, thus leading to increase in computational complexity. Also, directed
acyclic graph SVM (DAGSVM) is proposed in Platt et al. [16], in which the training
phase is the same as OAO-SVM i.e. solving (K ∗ (K − 1))/2 binary SVM, however
its testing phase is different. During testing phase, it uses a rooted binary directed
acyclic graph which has (K ∗ (K − 1))/2 internal nodes and K leaves. OAA-SVM
classification using decision tree was proposed by Kumar and Gopal in ([17]). Chen
et al. proposed multiclass support vector classification via coding and regression
(Chen et al. [18]). Jayadeva et al. proposed fuzzy linear proximal SVM for multi-
category data classification (Jayadeva et al. [19]). Lei et al. propose Half-Against-Half
(HAH) multiclass-SVM (Lie and Govindarajen [20]). HAH is built via recursively
dividing the training dataset of K classes into two subsets of classes. Shao et al.
propose a decision tree twin support vector machine (DTTSVM) for multi-class
classification (Shao et al. [21]), by constructing a binary based on the best separating
principle, which maximizes the distance between the classes. Xie et al. have extended
TWSVM for multi-class classification (Xie et al. [22]) using OAA approach. Xu et
al. proposed Twin K-class support vector classifier (TwinKSVC) (Xu et al. [23]),
which uses TSVM with support vector classification-regression machine for K-class
classification (K-SVCR) and evaluates all the training points into a 1-versus-1-versus
structure thereby generating ternary outputs (+1, 0, −1).
The speed while learning a model is a major challenge for multi-class classifi-
cation problems in SVM. Also, TWSVM classifier is four times faster than that of
SVM, while learning a model, as it solves two smaller QPPs. Further, TWSVM over-
comes the unbalance problem in two classes sized by choosing two different penalty
variables for different classes. Because of the strength of TWSVM, Khemchandani
and Saigal [24] have recently extended TWSVMs to multi-category scenario and
termed it as Ternary Decision Structure based Multi-category Twin Support Vector
116 5 Variants of Twin Support Vector …
a balanced ternary structure, a K-class problem would require only log3 K tests.
Also, at each level, the number of samples used by TDS-TWSVM diminishes with
the expansion of decision structure. Hence, the order of QPP reduces as the height
of the structure increases. The proposed TDS-TWSVM algorithm determines the
classifier model, which is efficient in terms of accuracy and requires fewer tests for
a K-class classification problem. The process of finding TDS-TWSVM classifier is
explained below in the algorithm.
Algorithm: TDS-TWSVM
(This structure can be applied in general to any type of dataset; however experi-
ments are performed in context of image classification). Given an image dataset with
N images from K different classes. Pre-compute the Complete Binary Local Binary
Pattern with Co-nonoccurrence matrix (CR-LBP-Co) [26] and Angular Radial Trans-
form (ART) [27] features for all images in the dataset as discussed in Sects. 3.1 and
3.2 respectively. Create a descriptor F by concatenating both the features. F is a
matrix of size N × n, where n is the length of feature vector. Here, n = 172 and the
feature vector for an image is given as
fv = [ft1 , ft2 , ..., ft136 , fs1 , fs2 , ..., fs36 ],
where fti (i = 1, 2, ..., 136) is texture feature and fsk (k = 1, 2, ..., 36) is shape fea-
ture.
1. Select the parameters such as- penalty parameter Ci , kernel type and kernel para-
meter.
Repeat Steps 2-5, 5-times (for 5-fold crossvalidation)
2. Use k-means clustering to partition the training data into two sets. Identify two
focused groups of classes with labels ‘+1’ and ‘−1’ respectively, and one ambigu-
ous group of classes represented with label ‘0’. Here, k=2 and we get at most three
groups.
118 5 Variants of Twin Support Vector …
3. Take training samples of ‘+1’, ‘−1’ and ‘0’ groups as class representatives and
find three hyperplanes (w (1) , b(1) ), (w (2) , b(2) ) and (w(3) , b(3) ), by applying one-
against-all approach and solving for TWSVM classifier.
4. Recursively partition the data-sets and obtain TWSVM classifiers until further
partitioning is not possible.
5. Evaluate the test samples with the decision structure based classifier model and
assign the label of the non-divisible node.
The strength of the proposed algorithm lies in the fact that it requires fewer num-
ber of TWSVM comparisons for evaluations than other state-of-the-art multi-class
approaches like OAA-SVM and OAO-SVM. In order to compare the accuracy of the
proposed system, Khemchandani and Saigal [24] have implemented OAA-TWSVM
and TB-TWSVM. OAA-TWSVM consists of solving K QPPs, one for each class,
so that we obtain 2 ∗ K nonparallel hyperplanes for K classes. Here, we construct a
TWSVM classifier, where in ith TWSVM classifier, we solve one QPP taking ith class
samples as one class and remaining samples as other class. By using the TWSVM
methodology, we determine the hyperplane for the ith class. The unbalance problem
of exemplars existing in ith TWSVM is tackled by choosing the proper penalty vari-
able Ci for the ith class. In case of TB-TWSVM, we recursively divide the data into
two halves and create a binary tree of TWSVM classifiers. TB-TWSVM determines
2 ∗ (K − 1) TWSVM classifiers for a K-class problem. For testing, TB-TWSVM
requires at most log2 K binary TWSVM evaluations. They also implemented a
variation of TB-TWSVM as ternary tree-based TWSVM (TT-TWSVM) where each
node of the tree is recursively divided into three nodes. The partitioning is done by
k-means clustering, with k=3. The experimental results show that TDS-TWSVM
outperforms TT-TWSVM.
TDS-TWSVM is more efficient than OAA-TWSVM considering the time required
to build the multi-category classifier. Also new test sample can be tested by log3 K
comparisons, which is more efficient than OAA-TWSVM and TB-TWSVM testing
time. For a balanced decision structure, the order of QPP reduces to one-third of
parent QPP with each level, as the classes of parent node are divided into three
groups. Experimental results show that TDS-TWSVM has advantages over OAA-
TWSVM, TT-TWSVM and TB-TWSVM in terms of testing complexity. At the
same time, TDS-TWSVM outperforms other approaches in multi-category image
classification and retrieval.
5.6 Multi-category Extensions of TWSVM 119
CBIR makes use of image features to determine the similarity or distance between
two images [28]. For retrieval, CBIR fetches most similar images to the given query
image. Khemchandani and Saigal [24] proposed the use of TDS-TWSVM for image
retrieval. To find the class label of query image, the algorithm is explained in Sect. 5.6
and then to find the similar images from the classified training set, chi-square distance
measure is used. A highlighting feature of TDS-TWSVM is that it is evaluated using
out-of-sample data. Most CBIR approaches take query image from the dataset which
is used to determine the model. But unlike other CBIR based approaches, TDS-
TWSVM reserves a separate part of dataset for evaluation. Thus, it provides a way
to test the model on data that has not been a component in the optimization model.
Therefore, the classifier model will not be influenced in any way by the out-of-sample
data. Figure 5.4 shows the retrieval result for a query image taken from Wang’s Color
dataset.
120 5 Variants of Twin Support Vector …
Fig. 5.4 Retrieval result for a query image taken from Wang’s Color dataset
5.7 Conclusions
In this chapter, we review the variants of TWSVM which are developed in the
recent past. The chapter discusses least squares version of TWSVM which is faster
than TWSVM as it solves system of linear equations for obtaining the solution of
TWSVM.
5.7 Conclusions 121
Further, ν-TWSVM was discussed where the unit distance separation constraint in
TWSVM is relaxed to the parameter ρ which becomes part of optimization problem.
To capture heteroscedastic noise and outlier, par-TWSVM was also reviewed
in this chapter. In this approach learnable function automatically adjusts a flexible
parametric-insensitive zone of arbitrary shape and minimal radius to include the
given data.
Towards the end, nonparallel classifier termed as NPSVM was introduced where
-insensitive loss function instead of quadratic loss function is used. NPSVM imple-
ments the SRM principle, further the dual problem have same advantage as that of
standard SVMs and can be solved efficiently by the SMO algorithm in order to deal
with large scale problems.
The other variants of TWSVM which we have not discussed in this chapter are
Fuzzy Twin Support Vector Machines (Khemchandani et al. [29]), Incremental Twin
Support Vector Machines (Khemchandani et al. [30]), Probabilistic TWSVM(Shao
et al. [31]), ν-Nonparallel Support Vector Machines (Tian et al. [32]), efficient sparse
nonparallel support vector machines (Tian et al. [33]) and Recursive Projection
TWSVM ([34]) etc. Interested readers could refer to some more related papers to
aforementioned papers. In this regard certain survey articles on TWSVM for example
Tian et al. [35], Ding et al. [36, 37] are good source of information.
References
1. Suykens, J. A., & Vandewalle, J. (1999). Least squares support vector machine classifiers.
Neural processing letters, 9(3), 293–300.
2. Kumar, M. A., & Gopal, M. (2009). Least squares twin support vector machines for pattern
classification. Expert Systems and its Applications, 36(4), 7535–7543.
3. Chandra, S., Jayadeva, & Mehra, A. (2009). Numerical Optimization with Applications. New
Delhi: Narosa Publishing House.
4. Mangasarian, O. L. (1994). Nonlinear Programming. Philadelphia: SIAM.
5. Nasiri, J. A., Charkari, N. M., & Mozafari, K. (2014). Energy-based model of least squares
twin Support Vector Machines for human action recognition. Signal Processing, 104, 248–257.
6. Laptev I., Marszalek M., Schmid C., & Rozenfeld B. (2008). Learning realistic human actions
from movies. IEEE Conference on Computer Vision and Pattern Recognition, CVPR, (p. 18).
IEEE.
7. Jayadeva, Khemchandani. R., & Chandra, S. (2007). Twin support vector machines for pattern
classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(5), 905–
910.
8. Peng, X. J. (2010). A ν-twin support vector machine (ν-TWSVM) classifier and its geometric
algorithms. Information Science, 180(20), 3863–3875.
9. Schoolkopf, B., Smola, A., Williamson, R., & Bartlett, P. L. (2000). New support vector algo-
rithms. Neural Computation, 12(5), 1207–1245.
10. Hao, Y. P. (2010). New support vector algorithms with parametric insensitive margin model.
Neural Networks, 23(1), 60–73.
11. Peng, X. (2011). TPSVM: A novel twin parametric-margin support vector for pattern recogni-
tion. Pattern Recognition, 44(10–11), 2678–2692.
12. Peng, X. J., Wang, Y. F., & Xu, D. (2013). Structural twin parametric margin support vector
machine for binary classification. Knowledge-Based Systems, 49, 63–72.
122 5 Variants of Twin Support Vector …
13. Khemchandani, R., & Sharma,S. (2016). Robust parametric twin support vector machines and
its applications to human activity recognition. In Proceedings of International Conference on
Image Processing, IIT Roorkee.
14. Tian, Y. J., Qi, Z. Q., Ju, X. C., Shi, Y., & Liu, X. H. (2013). Nonparallel support vector
machines for pattern classification. IEEE Transactions on cybernertics, 44(7), 1067–1079.
15. Hsu, C.-W., & Lin, C.-J. (2002). A comparison of methods for multiclass support vector
machines. IEEE Transactions on Neural Networks, 13(2), 415–425.
16. Platt, J. C., Cristianini, N., & Shawe-Taylor, J. (2010). Large margin DAGs for multiclass
classification. Advances in Neural Information Processing Systems, 12, 547–553.
17. Kumar, M. A., & Gopal, M. (2010). Fast multiclass SVM classification using decision tree
based one-against-all method. Neural Processing Letters, 32, 311–323.
18. Chen, P.-C., Lee, K.-Y., Lee, T.-J., Lee, Y.-J., & Huang, S.-Y. (2010). Multiclass support vector
classification via coding and regression. Neurocomputing, 73, 1501–1512.
19. Jayadeva, Khemchandani. R., & Chandra, S. (2005). Fuzzy linear proximal support vector
machines for multi-category data classification. Neurocomputing, 67, 426–435.
20. Lei, H., & Govindaraju, V. (2005). Half-against-half multi-class support vector machines. MCS,
LNCS, 3541, 156–164.
21. Shao, Y.-H., Chen, W.-J., Huang, W.-B., Yang, Z.-M., & Deng, N.-Y. (2013). The best separating
decision tree twin support vector machine for multi-class classification. Procedia Computer
Science, 17, 1032–1038.
22. Xie, J., Hone, K., Xie, W., Gao, X., Shi, Y., & Liu, X. (2013). Extending twin support vector
machine classifier for multi-category classification problems. Intelligent Data Analysis, 17,
649–664.
23. Xu, Y., Guo, R., & Wang, L. (2013). A Twin multi-class classification support vector machine.
Cognate Computer, 5, 580–588.
24. Khemchandani, R., & Saigal, P. (2015). Color image classification and retrieval through ternary
decision structure based multi-category TWSVM. Neurocomputing, 165, 444–455.
25. Queen, M. J. (1967). Some methods for classification and analysis of multivariate observations.
In, Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability (pp.
281–297). University of California.
26. Zhao, Y., Jia, W., Hu, R. X., & Min, H. (2013). Completed robust local binary pattern for
texture classification. Neurocomputing, 106, 68–76.
27. Ricard, Julien, Coeurjolly, David, & Baskurt, Atilla. (2005). Generalizations of angular radial
transform for 2D and 3D shape retrieval. Pattern Recognition Letters, 26(14), 2174–2186.
28. Liu, G. H., Zhang, L., Hou, Y. K., Li, Z. Y., & Yang, J. Y. (2010). Image retrieval based on
multi-texton histogram. Pattern Recognition, 43(7), 2380–2389.
29. Khemchandani, R., Jayadeva, & Chandra, S. (2007). Fuzzy twin support vector machines for
pattern classification. In ISPDM’ 07 International Symposium on Mathematical Programming
for Decision Making: Theory and Applications. Singapore: World Scientific (Published in
Mathematical Programming and Game Theory for Decision Making)
30. Khemchandani, R., Jayadeva, & Chandra, S. (2008). Incremental twin support vector machines.
In S.K. Neogy, A.K.das and R. B. Bapat (Eds.), ICMCO-08, International Conference on Mod-
eling, Computation and Optimization. Published in Modeling, Computation and Optimization.
Singapore:World Scientific.
31. Shao, Y. H., Deng, N. Y., Yang, Z. M., Chen, W. J., & Wang, Z. (2012). Probabilistic outputs
for twin support vector machines. Knowledge-Based Systems, 33, 145–151.
32. Tian, Y. J., Zhang, Q., & Liu, D. L. (2014). ν-Nonparallel support vector machine for pattern
classification. Neural Computing and Applications,. doi:10.1007/s00521-014-1575-3.
33. Tian, Y. J., Ju, X. C., & Qi, Z. Q. (2013). Efficient sparse nonparallel support vector machines
for classification. Neural Computing and Applications, 24(5), 1089–1099.
34. Chen, X., Yang, J., Ye, Q., & Liang, J. (2011). Recursive projection twin support vector machine
via within-class variance Minimization. Pattern Recognition, 44(10), 2643–2655.
35. Tian, Y., & Qi, Z. (2014). Review on twin support vector machines Arin: Data. Sci,. doi:10.
1007/S40745-014-0018-4.
References 123
36. Ding, S., Yu, J., Qi, B., & Huang, H. (2014). An Overview of twin support vector machines.
Artificial Intelligence Review., 42(2), 245–252.
37. Ding, S., Zhang, N., Zhang, X., & Wu. F. (2016). Twin support vector machine: theory,
algorithm and applications. Neural Computing and Applications,. doi:10.1007/s00521-016-
2245-4.
Chapter 6
TWSVM for Unsupervised
and Semi-supervised Learning
6.1 Introduction
Recently, Jayadeva et al. [6] have proposed Twin Support Vector Machine
(TWSVM) classifier for binary data classification where the two hyperplanes are
obtained by solving two related smaller sized Quadratic Programming Problems
(QPPs) as compare to single large sized QPP in conventional Support Vector Machine
(SVM). Taking motivation from Suykens and Vanderwalla [7], Kumar and Gopal [8]
have proposed Least Squares version of TWSVM (LS-TWSVM) and shown that
the formulation is extremely fast since it solves two modified primal problem of
TWSVM which is further equivalent to solving system of linear equations. Most
researchers minimize the loss function subject to the L1 -norm and L2 -norm penalty.
In [9] the authors proposed Lp -norm Least Square Twin Support Vector Machine
(Lp -LSTWSVM) which automatically select the value of p from the data.
Extension of TWSVM in semi-supervised framework has been proposed by Qi
et al. [10] which they termed as Laplacian-TWSVM. In Laplacian-TWSVM, the
authors have used graph based method for utilizing labeled information along with
large number of unlabeled information to build a better classifier. To reduce the com-
putation cost of Laplacian-TWSVM, Chen et al. [11] proposed a least squares version
of Laplacian-TWSVM, termed as Lap-LSTWSVM. Lap-LSTWSVM replaces the
QPPs in Lap-TWSVM with a linear system of equations by using a squared loss func-
tion instead of the hinge loss function. Similar to LS-TWSVM, Lap-LSTWSVM is
extremely fast as their solution is determined by solving system of linear equations.
In semi-supervised binary classification problem, we consider set S = {(x1 , y1 ),
(x2 , y2 ), . . . , (xl , yl ), xl+1 , . . . , xm }, where Xl = {xi : i = 1, 2, . . . , l} are the l labeled
data points in n dimension with corresponding class labels Yl = {yi ∈ [1, −1] : i =
1, 2, . . . , l} and Xu = {xi : i = l + 1, l + 2, . . . , m} are unlabeled data points. Thus,
X = Xl ∪ Xu . Data points belonging to class 1 and −1 are represented by matrices A
and B each with number of patterns m1 and m2 , respectively. Therefore, the size of
matrices A and B are (m1 × d) and (m2 × d), respectively. Here, n is the dimension
of the feature space. Let Ai (i = 1, 2, . . . , m1 ) is a row vector in n-dimensional real
space Rn that represents feature vector of data sample.
For better generalization performance, we would like to construct a classifier
which utilizes both labeled and unlabeled data information. Recently, Manifold Reg-
ularization learning technique have been proposed by Belkin et al. [1] which utilizes
both labeled and unlabeled data information as well as preserve some geometric
information of data. In [1], the author introduced the following regularization term
1
l+u
1
||f ||2M = wi,j (f (xi ) − f (xj ))2 = f (X)T La f (X), (6.1)
2 i,j=1 2
where f (X) = [f (x1 ), f (x2 ), . . . , f (xl+u )] represents the decision function values
over all the training data X, La = D − W is the graph Laplacian matrix, W = (wi,j )
is the adjacency matrix of dimension {(l + u) × (l + u)} and its wi,j entry corresponds
to edge-weight defined for a pair of points (xi , xj ), D is the diagonal matrix given by
l+u
Di,i = j=1 wi,j .
6.1 Introduction 127
With a selection of kernel function K(·, ·) with norm || · ||H in the Reproducing
Kernel Hilbert Space (RKHS), the semi-supervised manifold regularization frame-
work is established by Minimizing
where Remp (f ) denotes the empirical risk on the labeled data Xl , γH is the parameter
corresponding to ||f ||2H which penalizes the complexity of f in the RKHS, and γM
is the parameter associated with ||f ||2M which enforces smoothness of function ‘f ’
along the intrinsic manifold M. For more details refer Belkin et al. [1].
6.2 Laplacian-SVM
l
Min Max(1 − yi f (xi ), 0) + γH ||f ||2H + γM ||f ||2M . (6.3)
(f ∈Hk )
i=1
By introducing the slack variables ξi , the above unconstrained primal problem can
be written as a constrained optimization problem, with decision variable (α, b, ξ ) as
l
Min ξi + γH α T Kα + γM α T KLKα
(α,b,ξ )
i=1
subject to
yi ( nj=1 αi K(xi , xj ) + b) ≥ 1 − ξi , (i = 1, 2, . . . , l),
ξi ≥ 0, (i = 1, 2, . . . , l).
l
1
Max βi − β T Qβ
β 2
i=1
subject to
l
βi yi = 0,
i=1
0 ≤ βi ≤ 1, (i = 1, 2, . . . , l), (6.4)
128 6 TWSVM for Unsupervised and Semi-supervised Learning
where,
Q = YJL K(2γH I + 2γM KL)−1 JLT Y , Y ∈ Rl,l , JL ∈ R1,l and I ∈ Rl,l is the identity
matrix where Rl,l is a matrix of order l × l.
Once the Lagrange multiplier β is obtained, the optimal value of decision variable
α ∗ is determined with
l
The target function f ∗ is defined as f ∗ (x) = αi∗ K(xi , x).
i=1
The decision function that discriminates between class +1 and −1 is given by
y(x) = sign(f ∗ (x)).
6.3 Laplacian-TWSVM
and
where Ai and Bi are ith row of matrices A (matrix corresponding to class label +1) and
B (matrix corresponding to class label −1) and (w1 , b1 ) and (w2 , b2 ) are augmented
matrix or vector corresponding to classes A and B respectively.
On the lines of Lap-SVM, the regularization terms of Lap-TWSVM i.e. ||f1 ||2H
and ||f2 ||2H can be expressed as
1
||f1 ||2H = (||w1 ||22 + b12 ), (6.9)
2
6.3 Laplacian-TWSVM 129
and
1
||f2 ||2H = (||w2 ||22 + b22 ), (6.10)
2
respectively.
For manifold regularization, a data adjacency graph W(l+u)x(l+u) is defined by
nodes Wi,j , which represents the similarity of every pair of input samples. The weight
matrix W may be defined by k nearest neighbor as follows
exp(−||xi − xj ||22 /2σ 2 ), if xi , xj are neighbor,
Wij = (6.11)
0, Otherwise,
1 l+u
||f1 ||2M = Wi,j (f1 (xi ) − f1 (xj ))2 = f1T Lf1 , (6.12)
(l + u)2 i,j=1
1 l+u
||f2 ||2M = Wi,j (f2 (xi ) − f2 (xj ))2 = f2T Lf2 , (6.13)
(l + u)2 i,j=1
1 c2
Min ||Aw1 + e1 b1 ||2 + c1 eT2 ξ2 + (w1T w1 + b1 )
(w1 ,b1 ,ξ2 ) 2 2
c3 T
+ (w1 X + e b1 )L(Xw1 + eb1 )
T
2
subject to
−(Bw1 + e2 b1 ) + ξ2 ≥ e2 ,
ξ2 ≥ 0,
and
c2
Min 1
2
||Bw2 + e2 b2 ||2 + c1 eT1 ξ1 + (w2T w2 + b2 )
(w2 ,b2 ,ξ1 ) 2
c3 T
+ (w2 X + e b2 )L(Xw2 + eb2 )
T
2
subject to
(Aw2 + e1 b2 ) + ξ1 ≥ e1 ,
ξ1 ≥ 0,
130 6 TWSVM for Unsupervised and Semi-supervised Learning
where e1 and e2 are the vector of ones of dimensions equal to number of known label
pattern in the respective classes, c1 & c2 > 0 are trade-off parameter.
Similarly, for nonlinear kernel the corresponding decision function with parameter
(λ1 , b1 ) and (λ2 , b2 ) can be expressed as the kernel-generated hyperplanes given by
where K is chosen kernel function: K(xi , xj ) = (φ(xi ) · φ(xj )), X is the data matrix
comprising of both labelled as well as unlabelled patterns. With the help of above
notations the regularizer term can be expressed as
1 T
||f1 ||2H = (λ Kλ1 + b12 ), (6.16)
2 1
1
||f2 ||2H = (λT2 Kλ2 + b22 ). (6.17)
2
Similar to linear case, for manifold regularization, ||f1 ||2M and ||f2 ||2M can be written
as
−(K(B, X T )λ1 + e2 b1 ) + ξ2 ≥ e2 ,
ξ2 ≥ 0,
and
1 c2
Min ||K(B, X T )λ2 + e2 b2 ||2 + c1 eT1 ξ1 + (λT2 Kλ2 + b22 )
(λ2 ,b2 ,ξ1 ) 2 2
c3
+ (λT2 K + eT b2 )L(Kλ2 + eb2 )
2
subject to
(K(A, X T )λ2 + e1 b2 ) + ξ1 ≥ e1 ,
ξ1 ≥ 0, (6.20)
6.3 Laplacian-TWSVM 131
where e1 and e2 are the vector of ones of dimensions equal to number of known label
pattern in respective classes. Similar to c1 and c2 , c3 > 0 is trade-off parameter.
From Qi et al. [10], the Wolfe dual problem corresponding to hyperplane class
+1 can be expressed as
1
Max eT2 α − (α T Gφ )(HφT Hφ + c2 Oφ + c3 JφT LJφ )−1 (GTφ α)
α 2
subject to
0 ≤ α ≤ c1 e2 , (6.21)
where,
K 0
Hφ = K(A, X T ) e1 , Oφ = , Jφ = K e , Gφ = K(B, X T ) e2 ,
0 0
0 ≤ β ≤ c2 e1 , (6.23)
where,
K 0
Qφ = K(B, X T ) e2 , Uφ = Fφ = K e , Pφ = K(A, X T ) e1
0 0
Once the vectors u1 and u2 are obtained from above equations, a new data point x ∈ Rn
is then assigned to the +1 and −1 class, based on which of the two hyperplanes it
lies closest to, i.e.
where,
λTi K(x T , X T ) + bi
di = , (6.26)
||λi ||
Taking motivation from Least Squares SVM (LSSVM) [7] and proximal SVM
(PSVM) [12], authors in [11] elaborated the formulation of Laplacian Least Squares
Twin SVM for binary data classification. Later on, Khemchandani et al. [13] extended
Laplacian Least Squares TWSVM for classifying multicategory datasets with One
versus One versus Rest strategy. In [11], authors modify the loss function as
emp
m1
m2
R1 = f1 (Ai )2 + c1 (f1 (Bi ) + 1)2 , (6.27)
i=1 i=1
and
emp
m2
m1
R2 = f2 (Bi )2 + c2 (f2 (Ai ) − 1)2 , (6.28)
i=1 i=1
where f1 (Ai ) and f2 (Bi ) represent decision function values over the training data
belonging to class A and B respectively and c1 > 0 & c2 >0, are the risk penalty
parameter which determine the trade-off between the loss terms in Eqs. (6.27) and
(6.28) respectively.
By introducing the slack variables (error variables) y1 , y2 , z1 and z2 of appropriate
dimensions, in the corresponding primal problem of Lap-LSTWSVM, we obtain the
following optimization problems,
1 T 1 − λ1
Min (y y1 + c1 z1T z1 ) + (Xw1 + eb1 )T La (Xw1 + eb1 )
(w1 ,b1 ,y1 ,z1 ) 2 1 2
λ1
+ (||w1 ||2 + b12 )
2
subject to
Aw1 + e1 b1 = y1 ,
Bw1 + e2 b1 + e2 = z1 , (6.29)
6.4 Laplacian Least Squares TWSVM 133
and
1 T 1 − λ2
Min (z2 z2 + c2 y2T y2 ) + (Xw2 + eb2 )T La (Xw2 + eb2 )
(w2 ,b2 ,y2 ,z2 ) 2 2
λ2
+ (||w2 ||2 + b22 )
2
subject to
Bw2 + e2 b2 = z2 ,
Aw2 + e1 b2 − e1 = y2 , (6.30)
1 λ1
Min L = (||Aw1 + e1 b1 ||2 + c1 ||Bw1 + e2 b1 + e2 ||2 ) + (||w1 ||2 + b12 )
(w1 ,b1 ) 2 2
1 − λ1
+ (Xw1 + eb1 )T La (Xw1 + eb1 )
2
(6.31)
Equating the partial derivative of Eq. (6.31) with respect to w1 and b1 equal to zero
leads to following set of equations,
where P = H T H + c1 GT G + λ1 I + (1 − λ1 )J T La J.
134 6 TWSVM for Unsupervised and Semi-supervised Learning
In a similar way, the solution of Eq. (6.30) could be obtained by solving following
system of linear equations,
Qv2 = c2 H T e1 , (6.36)
|wkT x + bk |
Class i = arg Min , (6.37)
(k=1,2) |wk |
Clustering is a powerful tool which aims at grouping similar objects into the same
cluster and dissimilar objects into different clusters by identifying dominant struc-
tures in the data. It has remained a widely studied research area in machine learning
(Anderberg [14]; Jain et al. [15]; Aldenderfer and Blashfield [16]) and has applica-
tions in diverse domains such as computer vision, text mining, bioinformatics and
signal processing (QiMin et al. [17]; Zhan et al. [18], Tu et al. [19]; Liu and Li [20]).
Traditional point based clustering methods such as k-means (Anderberg [14]),
k-median (Bradley and Mangasarian [21]) etc. work by partitioning the data into
clusters based on the cluster prototype points. These methods perform poorly in case
when data is not distributed around several cluster points. In contrast to these, plane
based clustering methods such as k-plane clustering (Bradley and Mangasarian [22]),
proximal plane clustering ([23]), local k-proximal plane clustering (Yang et. al. [24])
etc. have been proposed in literature. These methods calculate k cluster center planes
and partition the data into k clusters according to the proximity of the datapoints with
these k planes.
Recently, Wang et al. [25] proposed a novel plane based clustering method namely
twin support vector clustering (TWSVC). The method is based on twin support vector
machine (TWSVM) (Jayadeva et al. [6]) and exploits information from both within
and between clusters. Different from the TWSVM, the formulation of TWSVC is
modified to get one cluster plane close to the points of its own cluster and at the same
time far away from the points of different clusters from both sides of cluster planes.
Experiment results (Wang et al. [25]) show the superiority of the method against
existing plane based methods.
The samples are denoted by a set of m row vectors X = {x1 ; x2 ; . . . ; xm } in the n-
dimensional real space Rn , where the jth sample xj = (xj1 , xj2 , . . . , xjn ). We assume
that these samples belong to k clusters with their corresponding cluster labels in
6.5 Unsupervised Learning 135
{1, 2, . . . , k}. Let Xi denotes the set of samples belonging to cluster label i and Xi ,
where i = 1, 2, . . . , k denotes the set of samples belonging to other than i cluster
label.
6.5.1 k-Means
k
mi
Min Xi (j) − μi 2 , (6.38)
(μ1 ,μ2 ,...,μk ,X1 ,X2 ,...,Xk )
i=1 j=1
where Xi (j) represents the jth sample in Xi , mi is the number of samples in Xi so that
m1 + m2 + · · · + mk = m, and .2 denotes L2 norm.
In practice, an iterative relocation algorithm is followed which minimizes (6.38)
locally. Given an initial set of k cluster center points, each sample x is labelled to its
nearest cluster center by
Then the k cluster center points are updated as the mean of the corresponding cluster
samples since for a given assignment Xi , the mean of the cluster samples represents
the solution to (6.38). At each iteration, the cluster centers and sample labels are
updated until some convergence criteria is satisfied.
6.5.2 TWSVC
Considers the following problem by Wang et al. [25] to get k cluster center planes
wiT x + bi = 0, i = 1, 2, . . . , k, one for each class
1
Min (Xi wi + bi e)2 + CeT qi
(wi , bi , qi ,Xi ) 2
subject to
|(Xi wi + bi e)| + qi ≥ e,
qi ≥ 0, (6.40)
136 6 TWSVM for Unsupervised and Semi-supervised Learning
where the index of the subproblem j = 0, 1, 2, . . . , and T (.) denotes the first order
Taylor expansion.
Wang et al. [25] shows that the above problem (6.41) becomes equivalent to the
following:
j+1 j+1 j+1
Min 1
2
(Xi wi + bi e)2 + CeT qi
j+1 j+1 j+1
(wi , bi , qi )
subject to
j j j+1 j+1 j+1
diag(sign(Xi wi + bi e))(Xi wi + bi e) + qi ≥ e,
j+1
qi ≥ 0, (6.42)
j+1 j+1
which is solved for [wi ; bi ] by solving its dual problem
1 T
Min α G(H T H)−1 GT α − eT α
α 2
subject to
0 ≤ α ≤ Ce, (6.43)
j j
where G = diag(sign(Xi wi + bi e))[Xi e], H = [Xi e] and α ∈ Rm−mi is the
Lagrangian multiplier vector.
The solution to (6.42) is obtained from the solution to (6.43) by
[wi ; bi ]T = (H T H)−1 GT α,
j+1 j+1
(6.44)
In short, for each i = 1, 2, . . . , k, we select an initial wi0 and bi0 and solve for
j+1 j+1 j+1 j+1 j j
[wi ; bi ] by (6.44) for j = 0, 1, 2 . . . , and stop when [wi ; bi ] − [wi ; bi ] is
j+1 j+1
small enough. We then set wi = wi and bi = bi .
6.5 Unsupervised Learning 137
1
Min (K(Xi , X)ui + γi e)2 + CeT ηi
(ui , γi , ηi ,Xi ) 2
subject to
where ηi is a error vector. The above problem is solved in a similar manner to linear
case by CCCP. However, it is worth mentioning that for each i = 1, 2, . . . , k, the
solution of nonlinear TWSVC is decomposed into solving a series of subproblems
which requires inversion of matrix of size (m + 1) × (m + 1) along with a QPP to
be solved, where m is the total number of patterns.
Taking motivation from Kumar and Gopal [8], Khemchandani et al. in [27] proposed
Least Squares version of TWSVC and then extend it to fuzzy LS-TWSVC. Here, we
modify the primal problem of linear TWSVC (6.40) in least squares sense (Suykens
and Vandewalle [7]), with inequality constraints replaced with equality constraints
along with adding a regularization term in the objective function to incorporate
Structural Risk Minimization (SRM) principle. Thus, for class i (i = 1, 2, . . . , k)
the optimization problem is given as:
1 ν C
Min (Xi wi + bi e)2 + (wi 2 + bi2 ) + qi 2
(wi , bi , qi ,Xi ) 2 2 2
subject to
138 6 TWSVM for Unsupervised and Semi-supervised Learning
where C > 0, ν > 0 are trade-off parameter. Note that QPP (6.47) uses the square
of L2-norm of slack variable qi instead of 1-norm of qi in (6.40), which makes the
constraint qi ≥ 0 redundant (Fung and Mangasarian [12]). Further solving (6.47) is
equivalent to solving system of linear equations.
Further, we introduce the fuzzy matrices Si and Si in (6.47) which indicates the
fuzzy membership value of each data point to different available clusters as follows
1 ν C
Min (Si Xi wi + bi e)2 + (wi 2 + bi2 ) + qi 2
(wi , bi , qi ,Xi ) 2 2 2
subject to
Similar to the solution of TWSVC formulation (Wang et al. [25]), the above
optimization problem can be solved by using the concave-convex procedure (CCCP)
(Yuille and Rangarajan [26]), which decomposes it into a series of j (j = 0, 1, 2, . . .)
quadratic subproblems with initial wi0 and bi0 as follows
j+1
Substituting the error variable qi into the objective function of (6.50) leads to the
following optimization problem.
j+1 j+1
Further, considering the gradient of (6.51) with respect to wi and bi and equate
it to zero gives:
j+1 j+1 j+1
(Si Xi )T [H1 zi ] + νwi + C(Si Xi )T GT [G(H2 zi ) − e] = 0, (6.52)
j+1 j+1 j+1
eT
[H1 zi ] + νbi + Ce G T T
[G(H2 zi ) − e] = 0, (6.53)
j+1
which gives the solution for zi :
It can be finally observed that our algorithm requires the solution of (6.55) which
involves inversion of smaller dimensional matrix of size (n+1)×(n+1) as compared
to an additional QPP solution required in case of TWSVC.
Working on the lines of Jayadeva et al. [6], we extend the nonlinear formulation
of LS-TWSVC and F-LS-TWSVC by considering k cluster center kernel generated
surfaces for i = 1, 2, . . . , k:
1 ν
Min ((K(Xi , X)ui + γi e)2 ) + (ui2 + γi2 ) + CηiT ηi
(ui , γi , ηi ) 2 2
subject to
Similar to the linear case, for each i = 1, 2, . . . , k the above problem is also
decomposed into series of quadratic subproblems where the index of subproblems
is j = 0, 1, 2 . . . , and solution of which can be derived to be:
1 ν
Min ((Si K(Xi , X)ui + γi e)2 ) + (ui2 + γi2 ) + CηiT ηi
(ui , γi , ηi )
2 2
subject to
where E1 = [Si (K(Xi , X)) e], E2 = [Si (K(Xi , X)) e] and F = diag(sign
j j
(E2 [ui ; bi ])).
The overall algorithm remains the same as of linear case except that we solve for
k kernel generated surfaces parameters ui , γi , i = 1, 2, . . . , k.
It can be noted that the nonlinear algorithm requires the solution of (6.60) which
involves calculating the inverse of matrix of order (m + 1) × (m + 1). However,
inversion in (6.60) can be solved by calculating inverses of two smaller dimension
matrices as compare to (m + 1) × (m + 1) by using Sherman-Morrison-Woodbury
(SMW) (Golub and Van Loan [28]) formula. Therefore, inversion of matrices in
(6.60) can be further solved by
1 : if yi = yj
M(i, j) = (6.62)
0 : otherwise.
Let, Mt is the similarity matrix computed by the true cluster label of the data set,
and Mp corresponds to the label computed from the prediction of clustering method.
Then, the metric accuracy of the clustering method is defined as the
n00 + n11 − m
Metric Accuracy = × 100 %, (6.63)
m2 − m
142 6 TWSVM for Unsupervised and Semi-supervised Learning
where n00 is the number of zeros in Mp and Mt , and n11 is the number of ones in Mp
and Mt respectively.
TP
Precision = ,
TP + FP
and
TP
Recall = .
TP + FN
• ER can be calculated as
FP + FN
ER = , (6.65)
TT
where TP is number of true-detection object pixels, FP is the number of false-
detection object pixels, FN is the number of false-detection not object pixels and
TT is the total number of pixels present in the image.
For our simulations, we have considered RBF Kernel and the values of parame-
ters like C, ν and sigma (kernel parameter) are optimized from the set of values
2i |i = −9, −8, . . . , 0 using cross validation methodology [33]. The initial clus-
ter labels and fuzzy membership values are optimized from FNNG initialization as
discussed in Sect. 6.8.3.
input. In [25], authors have shown via experiments that the results of plane based
clustering methods strongly depends on the initial input of class labels. Hence taking
motivation from initialization algorithm based on NNG [25, 34], we implement fuzzy
NNG(FNNG) and provide output in the form of fuzzy membership matrix from
FNNG method as the initial input to our algorithm. The main process of calculating
FNNG is as follows:
1. For the given data set and a parameter p, construct p nearest neighbor undirected
graph whose edges represents the distance between xi (i=1, … , m) and its p nearest
neighbor.
2. From the graph, t clusters are obtained by associating the nearest samples. Further,
construct a fuzzy membership matrix Sij where i = 1, . . . m and j = 1, . . . t whose
(i, j) entry can be calculated as follows,
1
Sij = , (6.66)
dij
where dij is the euclidean distance of the sample i with the jth cluster. If the current
number of cluster t is equal to k, then stop. Else, go to step 3 or 4 accordingly.
3. If t < k, disconnect the two connected samples with the largest distance and go
to step 2.
4. If t > k, compute the Hausdorff distance [35] between every two clusters among
the t clusters and sort all pairs in ascending order. Merge the nearest pair of clusters
into one, until k clusters are formulated, where the Hausdorff distance between
two sets S1 and S2 of sample is defined as
For the initialization of the CCCP in F-LS-TWSVC, i.e. for the value of an initial
decision variable [w10 b10 ] we have implemented F-LS-TWSVM [36] classifier and
obtained the solution for the aforementioned variables.
In [6], the authors have shown that TWSVM is approximately 4 times faster than
SVM. The computational complexity of TWSVM is (m3 /4), where m is the total
size of training samples. In [12], the authors have shown that the solution of
LS-TWSVM requires system of linear equations to be solved as opposed to the
solution in TWSVM which requires system of linear equations along with two QPPs
to be solved.
144 6 TWSVM for Unsupervised and Semi-supervised Learning
On the similar lines our algorithm F-LS-TWSVC essentially differs from TWSVC
from the optimization problem involved i.e. in order to obtain k cluster plane para-
meters, we solve only two matrix inverse of (n + 1) × (n + 1) in linear case whereas
TWSVC seeks to solve system of linear equations along with two QPPs. Table 6.1
shows the training time comparison among different algorithms with linear kernel
on UCI dataset.
For nonlinear F-LS-TWSVC, solution requires inverse of the matrices with order
(m + 1) × (m + 1) which can further be solved by (6.61) using SMW formula where
we tend to solve inverse of two smaller dimension (mi ×mi ) and ((m−mi )×(m−mi ))
matrices. Table 6.2 shows the training time comparison among different techniques
with nonlinear kernel on UCI dataset.
Tables 6.4 and 6.5 summarizes the clustering accuracy results of our proposed F-LS-
TWSVC and LS-TWSVC with TWSVC on several UCI benchmark datasets using
linear and nonlinear kernel respectively. These tables show that metric accuracy
of LS-TWSVC and TWSVC are comparable to each other, which further increases
approximately 2−5 % on each datasets after incorporating fuzzy membership matrix.
In Tables 6.4 and 6.5 we have taken results of kPC [22], PPC [23] and FCM [37]
from [25].
In this part, clustering accuracy was determined by following the standard 5-fold
cross validation methodology [33]. Tables 6.6 and 6.7 summarizes testing clustering
accuracy results of our proposed algorithms F-LS-TWSVC and LS-TWSVC with
TWSVC on several UCI benchmark datasets.
6.8 Experimental Results 147
Table 6.6 Testing clustering accuracy with linear kernel on UCI datasets
Data TWSVC LS-TWSVC F-LS-TWSVC
Zoo 92.21 ± 3.23 93.56 ± 2.88 96.10 ± 2.18
Wine 85.88 ± 4.16 84.94 ± 4.89 90.92 ± 2.78
Iris 86.01 ± 8.15 86.57 ± 8.05 96.55 ± 1.23
Glass 65.27 ± 4.12 61.20 ± 5.26 65.41 ± 3.80
Dermatology 87.80 ± 2.39 88.08 ± 1.17 92.68 ± 2.42
Ecoli 80.96 ± 5.16 82.45 ± 4.96 86.23 ± 4.56
Compound 89.34 ± 3.53 90.70 ± 3.20 90.22 ± 3.29
Haberman 62.57 ± 4.06 60.63 ± 3.94 64.63 ± 3.94
Libas 87.31 ± 1.53 87.34 ± 0.64 88.52 ± 0.49
Page blocks 74.98 ± 4.07 74.63 ± 3.89 76.32 ± 3.12
Optical recognition 74.01 ± 4.78 73.33 ± 5.04 77.40 ± 4.32
Table 6.7 Testing clustering accuracy comparison with nonlinear kernel on UCI datasets
Data TWSVC LS-TWSVC F-LS-TWSVC
Zoo 93.47 ± 3.96 94.76 ± 3.04 97.26 ± 2.68
Wine 87.66 ± 4.46 88.04 ± 4.98 92.56 ± 3.48
Iris 88.08 ± 7.45 89.77 ± 7.88 97.25 ± 2.23
Glass 67.27 ± 4.62 64.64 ± 5.66 68.04 ± 4.14
Dermatology 88.26 ± 3.49 88.77 ± 1.74 94.78 ± 2.90
Ecoli 83.28 ± 5.46 84.74 ± 5.07 88.96 ± 5.24
Compound 90.14 ± 3.68 90.98 ± 3.44 91.88 ± 3.55
Haberman 62.16 ± 4.26 60.03 ± 3.14 63.36 ± 3.44
Libas 88.16 ± 1.98 88.46 ± 1.06 90.05 ± 0.84
Page blocks 76.68 ± 5.22 75.99 ± 6.07 79.88 ± 5.51
Optical recognition 75.82 ± 5.78 75.32 ± 6.03 78.44 ± 4.11
(0o , 45o , 90o , 135o ). As a result, we have 12(3 × 4) coefficients for each pixel of
image. Finally, we use maximum (in absolute value) of the 12 coefficients for each
pixels which represents the pixel-level wise Gabor features of an image. Further, this
feature used as an input to FNNG which give us initial membership matrix for every
pixels in different clusters. We have also use this Gabor filter to identify number of
clusters present in the image.
Table 6.8 compare the performance of implemented F-LS-TWSVC with TWSVC
methods on Barkeley Segmentation Dataset. It is noticeable that for better segmen-
tation, the value of F-measure should be high and the value of ER should be less.
Table 6.8 shows that the value of F-measure is high and the value of ER is less with
F-LS-TWSVC than TWSVC (Table 6.9).
Figures 6.1, 6.2, 6.3 and 6.4 shows the segmentation results with F-LS-TWSVC
and TWSVC respectively.
6.8 Experimental Results 149
(a)
(b) (c)
Fig. 6.1 Segmentation results a original image (ImageID-296059), b segmented image with F-LS-
TWSVC and c segmented image with TWSVC
(a)
(b) (c)
Fig. 6.2 Segmentation results a original image (ImageID-86016), b segmented image with F-LS-
TWSVC and c segmented image with TWSVC
150 6 TWSVM for Unsupervised and Semi-supervised Learning
(b) (c)
(b) (c)
6.9 Conclusions
In this chapter, we review the variants of TWSVM for semi-supervised and unsuper-
vised framework which are developed in the recent past. To begin with, this chapter
discusses Laplacian SVM, Laplacian TWSVM, Laplacian LSTWSVM all variants
of SVMs in semi-supervised settings. For unsupervised classification we have dis-
cussed K-Means, followed by plane based clustering algorithm which are on the lines
of twin support vector machines and are termed as twin support vector clustering. We
have also discussed our proposed work on fuzzy least squares twin support vector
clustering and show its results on UCI as well as Image datasets.
References 151
References
1. Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Manifold regularization: a geometric frame-
work for learning from labeled and unlabeled examples. The Journal of Machine Learning
Research, 7, 2399–2434.
2. Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training.
In Proceedings of the Eleventh Annual Conference on Computational Learning Theory (pp.
92–100).
3. Nigam, K., McCallum, A. K., Thrun, S., & Mitchell, T. (2000). Text classification from labeled
and unlabeled documents using em. Machine learning, 39(2–3), 103–134.
4. Melacci, S., & Belkin, M. (2011). Laplacian support vector machines trained in the primal.
The Journal of Machine Learning Research, 12, 1149–1184.
5. Zhu, X. (2008). Semi-supervised learning literature survey, Computer Science TR (150). Madi-
son: University of wisconsin.
6. Jayadeva, Khemchandani. R., & Chandra, S. (2007). Twin support vector machines for pattern
classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(5), 905–
910.
7. Suykens, J. A., & Vandewalle, J. (1999). Least squares support vector machine classifiers.
Neural processing letters, 9(3), 293–300.
8. Kumar, M. A., & Gopal, M. (2009). Least squares twin support vector machines for pattern
classification. Expert Systems and its Applications, 36(4), 7535–7543.
9. Zhang, Z., Zhen, L., Deng, N., & Tan, J. (2014). Sparse least square twin support vector machine
with adaptive norm. Applied Intelligence, 41, 1097–1107.
10. Qi, Z., Tian, Y., & Shi, Y. (2012). Laplacian twin support vector machine for semi-supervised
classification. Neural Networks, 35, 46–53.
11. Chen, W. J., Shao, Y. H., Deng, N. Y., & Feng, Z. L. (2014). Laplacian least squares twin
support vector machine for semi-supervised classification. Neurocomputing, 145, 465–476.
12. Fung, G., & Mangasarian, O. L. (2001). Proximal support vector machine classifiers. In, F.
Provost and R. Srikant (Eds.) Proceedings of Seventh International Conference on Knowledge
Discovery and Data Mining, 77–86
13. Khemchandani, R., & Pal, A. (2016). Multicategory laplacian twin support vector machines.
Applied Intelligence (To appear)
14. Anderberg, M. (1973). Cluster Analysis for Applications. New York: Academic Press.
15. Jain, A., Murty, M., & Flynn, P. (1999). Data clustering: a review. ACM Computing Surveys
(CSUR), 31(3), 264–323.
16. Aldenderfer, M., & Blashfield, R. (1985). Cluster Analysis. Los Angeles: Sage.
17. QiMin, C., Qiao, G., Yongliang, W., & Xianghua, W. (2015). Text clustering using VSM with
feature clusters. Neural Computing and Applications, 26(4), 995–1003.
18. Zhan, Y., Yin, J., & Liu, X. (2013). Nonlinear discriminant clustering based on spectral regu-
larization. Neural Computing and Applications, 22(7–8), 1599–1608.
19. Tu, E., Cao, L., Yang, J., & Kasabov, N. (2014). A novel graph-based k-means for nonlinear
manifold clustering and representative selection. Neurocomputing, 143, 109–122.
20. Liu, X., & Li, M. (2014). Integrated constraint based clustering algorithm for high dimensional
data. Neurocomputing, 142, 478–485.
21. Bradley, P., & Mangasarian, O. (1997). Clustering via concave minimization. Advances in
Neural Information Processing Systems, 9, 368–374.
22. Bradley, P., & Mangasarian, O. (2000). K-plane clustering. Journal of Global Optimization,
16(1), 23–32.
23. Shao, Y., Bai, L., Wang, Z., Hua, X., & Deng, N. (2013). Proximal plane clustering via eigen-
values. Procedia Computer Science, 17, 41–47.
24. Yang, Z., Guo, Y., Li, C., & Shao, Y. (2014). Local k-proximal plane clustering. Neural Com-
puting and Applications, 26(1), 199–211.
152 6 TWSVM for Unsupervised and Semi-supervised Learning
25. Wang, Z., Shao, Y., Bai, L., & Deng, N. (2014). Twin support vector machine for clustering.
IEEE Transactions on Neural Networks and Learning Systems,. doi:10.1109/TNNLS.2014.
2379930.
26. Yuille, A. L., & Rangarajan, A. (2002). The concave-convex procedure (CCCP) (Vol. 2).,
Advances in Neural Information Processing Systems Cambridge: MIT Press.
27. Khemchandani, R., & Pal, A. Fuzzy least squares twin support vector clustering. Neural Com-
puting and its Applications (To appear)
28. Golub, G. H., & Van Loan, C. F. (1996). Matrix computations (3rd ed.). Baltimore: John
Hopkins University Press.
29. Blake, C. L., & Merz, C. J. UCI Repository for Machine Learning Databases, Irvine, CA:
University of California, Department of Information and Computer Sciences. http://www.ics.
uci.edu/~mlearn/MLRepository.html.
30. Arbelaez, P., Fowlkes, C., & Martin, D. (2007). The Berkeley Segmentation Dataset and Bench-
mark. http://www.eecs.berkeley.edu/Research/Projects/CS/vision/bsds.
31. Mehrkanoon, S., Alzate, C., Mall, R., Langone, R., & Suykens, J. (2015). Multiclass semisuper-
vised learning based upon kernel spectral clustering. IEEE Transactions on Neural Networks
and Learning Systems, 26(4), 720–733.
32. Wang, X. Y., Wang, T., & Bu, J. (2011). Color image segmentation using pixel wise support
vector machine classification. Pattern Recognition, 44(4), 777–787.
33. Duda, R., Hart, P., & Stork, D. (2001). Pattern classification. New York: Wiley.
34. Keller, J. M., Gray, M. R., & Givens, J. A. (1985). A fuzzy k-nearest neighbor algorithm. IEEE
Transactions on Systems, Man and Cybernetics, 4, 580–585.
35. Hausdorff, F. (1927). Mengenlehre. Berlin: Walter de Gruyter.
36. Sartakhti, J. S., Ghadiri, N., & Afrabandpey, H. (2015). Fuzzy least squares twin support vector
machines. arXiv preprint arXiv:1505.05451.
37. Wang, X., Wang, Y., & Wang, L. (2004). Improving fuzzy c-means clustering based on feature-
weight learning. Pattern Recognition Letters, 25, 1123–1132.
38. Manjunath, B. S., & Ma, W. Y. (1996). Texture features for browsing and retrieval of image
data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8), 837–842.
Chapter 7
Some Additional Topics
7.1 Introduction
This chapter is devoted to the study of certain additional topics on twin support
vector machines for classification. Specifically, these topics are kernel optimization in
TWSVM, knowledge based TWSVM and a recently introduced formulation of Twin
Support Tensor Machine (TWSTM) for matrix data classification. We are including
these topics for our discussion here because of their novelty and potential for real
life applications.
This chapter consists of five main sections namely, Optimal Kernel Selection in
Twin Support Vector Machines, Knowledge Based Twin Support Vector Machines
and Variants, Support Tensor Machines: A Brief Review, and Twin Support Tensor
Machines.
Our presentation here is based on Khemchandani et al. [1, 2], Kumar et al. [3],
Cai et al. [4], Zhao et al. [5] and Gao et al. [6].
It is well known that kernel based methods have proven to be a powerful tool for
solving classification and regression problems. Generally, kernels are chosen by pre-
defining a kernel function (Gaussian, polynomial etc.) and then adjusting the kernel
parameters by means of a tuning procedure. The classifier’s/regressor’s performance
on a subset of the training data, commonly referred to as the validation set, is usually
the main criterion for choosing a kernel. But this kernel selection procedure is mostly
adhoc and can be computationally expensive.
In recent years, several authors have proposed the use of a kernel that is obtained
as an ‘optimal’ non-negative linear combination of finitely many ‘basic’ kernels.
The kernel so chosen is termed as an ‘optimal kernel’. This optimal kernel is
Let us recall our discussion on kernel TWSVM formulation from Chap. 3. Let the
patterns to be classified be denoted by a set of m row vectors Xi , (i = 1, 2, . . . , m)
in the n-dimensional real space Rn , and let yi ∈ {1, −1} denote the class to which the
ith pattern belongs. Matrices A and B represent data points belonging to classes +1
and −1, respectively. Let the number of patterns in classes +1 and −1 be given by
m1 and m2 , respectively. Therefore, the matrices A and B are of sizes (m1 × n) and
(m2 × n), respectively.
In order to obtain the nonlinear classifiers, we consider the following kernel gen-
erated surfaces
− (K(B, C T )u1 + e2 b1 ) + q1 ≥ e2 ,
q1 ≥ 0, (7.2)
and
(KDT W SV M1) Max eT2 α − 21 α T R(S T S)−1 RT α
α
subject to
0 ≤ α ≤ c1 , (7.3)
(K(A, C T )u2 + e1 b2 ) + q2 ≥ e1 ,
q2 ≥ 0, (7.6)
0 ≤ γ ≤ c2 . (7.7)
[13] can be applied to reduce the dimensionality of the kernel matrix, i.e. we can
replace K(A, C T ) by a rectangular kernel of the type K(A, QT )m1 ×t , where Q is a
t × n random sub matrix of C, with t << m. In fact, t may be as small as 0.01 m.
Thus, the size of kernel matrix becomes m1 × t, thereby requiring the inversion of a
matrix of order (t + 1) × (t + 1) only.
The optimization problems (KDTWSVM1) and (KDTWSVM2) can be rewritten
as
(KDT W SV M1) Max eT2 α − 21 α T T (1) α
α
subject to
0 ≤ α ≤ c1 ,
and
(KDT W SV M2) Max eT1 γ − 21 γ T T (2) γ
γ
subject to
0 ≤ γ ≤ c2 ,
where the matrices T (1) and T (2) are defined by T (1) = R(S T S + I)−1 RT and
T (2) = L(N T N + I)−1 L T , respectively. Here the matrices R and S; and, L and N are
as defined in (7.4), and (7.8) respectively. Given a kernel such as a Gaussian, or a poly-
nomial, we compute T (1) and T (2) and solve (KDTWSVM1) and (KDTWSVM2),
respectively, to obtain the kernel generated surfaces (7.1).
Let us now suppose that instead of the pair of kernels T (1) and T (2) being defined by
a single kernel function such as a Gaussian or a polynomial, an optimal pair of kernels
J(1) and J(2) corresponding to (KDTWSVM1) and (KDTWSVM2), respectively, are
chosen as follows
p
p
J (1)
= μ(1) (1)
j Tj ; J (2)
= μ(2) (2)
j Tj ,
j=1 j=1
where μ1j , μ2j ≥ 0. Here, similar to T (1) and T (2) , Tj(1) (and Tj(2) ) are constructed from
the basic ‘p’ kernel matrices Kj , (j = 1, 2, . . . , p).
As it is pointed out in Lanckriet et al. [7], the set of basic kernels can be seen
as a predefined set of initial guess of the kernel matrices. We also note that the
set of basic kernels could contain very different kernel matrix models, e.g., linear,
Gaussian, polynomial, or kernels with different hyper parameters.
Taking motivation from Fung et al. [10], we present an iterative alternating algo-
rithm, termed as (A-TWSVM), for the determination of the optimal pair of ker-
nel matrices J(1) and J(2) . This method alternates between optimizing the decision
158 7 Some Additional Topics
0 ≤ α ≤ c1 , (7.12)
and
⎛ ⎞
1 p
(AT W SV M2) Max eT1 γ − γ T ⎝ μ(2)
j Tj
(2) ⎠
γ
γ 2 j=1
subject to
0 ≤ γ ≤ c2 . (7.13)
The initial values of μ(i)j , i = 1, 2, j = 1, 2, . . . , p are set to one. Once the values of α
and γ are known, z(1) and z(2) are obtained from (7.5) and (7.9), and the classifier (7.1)
is determined. With the values of α and γ as obtained from (7.12) and (7.13) respec-
tively, the second part of the algorithm then finds the optimal weights μ(1) (2)
j , μj ≥ 0,
j = 1, 2, . . . , p by solving the pair of optimization problems (7.14)–(7.15)
1 (1) T (1)
p
1 T
(ATWSVM3) Min μ1 μ1 + μ α Tj α, (7.14)
eT μ1 =p , 2 2 j=1 j
μ1 ≥0
and
7.2 Optimal Kernel Selection in Twin Support Vector Machines 159
1 (2) T (2)
p
1 T
(ATWSVM4) Min μ2 μ2 + μ γ Tj γ , (7.15)
eT μ2 =p, 2 2 i=j j
μ2 ≥0
and Hathaway [12]. This theorem states that if the objective function is a strictly
convex function on its domain that has a minimizer at which Hessian of the function
is continuous and positive definite, then AO will converge q-linearly to the minimizer
using any initialization.
The performance of A-TWSVM was assessed by using the pair of kernel matrices
Tj = (Tj(1) , Tj(2) ) (j = 1, 2, 3), and with the optimal pair of kernel matrices (J(1) , J(2) ).
The kernel matrices T1 , T2 and T3 are respectively defined by the linear kernel K1 ,
the Gaussian kernel K2 and the polynomial kernel of degree two, namely, K3 . In the
experiments, we have considered two sets of combinations of basic kernels. In the
first case the pair of kernels is obtained as a non-negative linear combination of kernel
matrices T1 , T2 and T3 , whereas in the second case the pair of kernels is obtained
as a non-negative linear combination of Gaussian kernels only, but with different
(1) (2)
variances (σ ). Thus, for the first case the pair is (J123 , J123 ) whereas for the second
(1) (2)
case the pair is (J2 , J2 ). Here the suffices indicate that in the first case all the three
kernels are used whereas for the second case only Gaussian kernels are used.
While implementing an individual kernel the optimal value of the kernel parame-
ter is obtained by following the standard tuning procedure i.e. we have taken 10% of
the dataset for finding the optimal value of the parameter. Further, while implement-
ing A-TWSVM for finding the optimal pair of Gaussian kernels i.e. (J2(1) , J2(2) ) we
have chosen variances of the two Gaussian kernels in the neighborhood of σ = σ1 .
Here, σ1 has been obtained by tuning while implementing TWSVM with the indi-
vidual kernel K2 . Thus, variance of one Gaussian is σ1 and variances of other two
Gaussian’s are chosen randomly from an interval around σ1 . Similarly, while imple-
(1) (2)
menting A-TWSVM for (J123 , J123 ) variance for K2 is σ1 . For all the examples, the
interval [0.1, 2] suffices. Further, while implementing (ATWSVM3)-(ATWSVM4),
the kernel matrices are normalized using
T (i) (xs , xt )
Tst(i) = , (i = 1, 2)(s, t = 1, 2, . . . , m). (7.16)
T (i) (xs , xt )T (i) (xs , xt )
to ensure that the kernels Tj(1) and Tj(2) (j = 1, 2, 3) have trace equal to m. This inturn
ensures the validity of the constraints eT μ1 = 3 and eT μ2 = 3.
The optimal values of μ(1) (2)
j , μj ≥ 0, j = 1, 2, . . . , p in A-TWSVM, are obtained
by solving the problem iteratively. Using these optimal values of μ(1) (2)
j , μj ≥ 0,
j = 1, 2, . . . , p, a pair of hyperplanes are built and then the performance on the
testing set is evaluated. In our case p = 3.
The proposed method was evaluated on five datasets namely Ionosphere, Heart
statlog, German Credit, Hepatitis and Sonar chosen from the UCI Machine Learning
7.2 Optimal Kernel Selection in Twin Support Vector Machines 161
Repository (Blake and Merz [14]), and the test set accuracy was determined by
following the standard ten-fold cross-validation methodology (Duda et al. [15]). The
hardware consisted of a PC with an Intel P4 processor (3 GHz) and 1 GB RAM.
First dataset which we have considered is Ionosphere dataset. This data classifies
the radar returns as “good” or “bad” signifying if there is or isn’t any evidence of
structure in the ionosphere is observed. There are a total of 351 patterns, 224 are
“good”, and 127 are “bad” patterns and each having 34 continuous attributes. The
Heart-statlog dataset has 270 instances, 150 patterns don’t have heart disease whereas
in 120 presence of heart disease is predicted. This dataset has 13 continuous attributes.
The German credit data set is taken from Institute für Statistik und Okonometrie
Hamburg University. It has 13 attributes selected from the original 20 attributes.
There are a total of 1000 instances of which 700 are “good”, and 300 are “bad”. The
Hepatitis data set is a binary data set from the hepatitis domain with each instance
having 19 attributes like sex, age, steroid, antiviral, fatigue, bilirubin, albumin etc.
that determine the disease. There are 155 instance of which 32 are “die” and 123 are
“alive”. Fifth dataset is Sonar dataset. This is a two class identification problem of
underground targets (“rock”, or “mines”) on the basis of 60 attributes. There are a
total of 208 patterns (111 “mines”, and 97 “rocks” patterns).
For the implementation of A-TWSVM we have fixed the maximum number of
iterations to be 10. Thus, the algorithm terminates after 10 iterations, or if there is
a small change (as per our tolerance = 10−2 ) in the test set accuracy. The tables
indicate (mean ± standard deviation) performance for each classifier on each dataset.
Tables 7.1, 7.2, 7.3, 7.4 and 7.5 summarize the performance of the optimal kernel
selection approach for twin support vector machines on some benchmark datasets
available at the UCI machine learning repository. In practice we found that the our
algorithm converges in 3–4 iterations on all the datasets. The average run time (in
seconds) is also reported in the tables along with the average number of iterations. In
each table μ1 and μ2 are the weights in the optimal non-negative linear combination
(1) (2)
of kernels in J123 and J123 ; and J2(1) and J2(2) .
From the tables, we observe that sum of μ(i) j , (i = 1, 2); (j = 1, 2, 3) is three and
(i)
the kernels corresponding to the large μj , (i = 1, 2); (j = 1, 2, 3) are important than
the one with smaller values. This set of information may play an important role while
learning the structure of the data. Further, we conclude that for a given classification
problem the optimally combined kernel generally yields better test set accuracy than
individual kernels.
From Table 7.1, we conclude that combination of three Gaussian’s improves the
test set accuracy for Ionosphere dataset as compare to linear, Gaussian and polyno-
mial kernel. The same conclusion can be drawn for Heart-statlog, German Credit,
Hepatitis and Sonar dataset. Further, from the tables we observe that when the two
classes have different structures then the two different kernels are used to predict
the structure of the data. This set of information can be obtained from the values
of μ(i)
j , (i = 1, 2); (j = 1, 2, 3). For example, in Table 7.4, when the three different
kernels are combined for Hepatitis dataset, for class 1 polynomial kernel of degree
2 dominates the linear and the Gaussian, while for class –1, the Gaussian dominates
162 7 Some Additional Topics
TSA 89.17 ± 4.58 93.14 ± 4.28 89.17 ± 5.25 93.15 ± 4.82 94.29 ± 5.11
μ1 – – – (0, 3, 0) (0.57, 1.00, 1.43)
μ2 – – – (0, 3, 0) (0.41, 0.85, 1.74)
Iterations – – – 2±0 2.5 ± 1.75
Time (s) 1.16 1.20 1.51 3.22 3.85
∗ (J(1) , J(2) ) are obtained as non-negative linear combination of kernel matrices T , T and T .
123 123 1 2 3
∗∗ (J(1) , J(2) ) are obtained as non-negative linear combination of Gaussian with different hyper
2 2
parameters
TSA 82.96 ± 4.44 81.85 ± 5.35 81.11 ± 6.07 83.47 ± 4.61 83.70 ± 6.46
μ1 – – – (0, 2.58, 0.42) (1.03, 1.01, 0.96)
μ2 – – – (0, 3, 0) (1.04, 1.02, 0.94)
Iterations – – – 2.4 ± 0.4 2.1 ± 0.54
Time (s) 0.71 0.97 0.80 2.41 2.31
∗ (J(1) , J(2) ) are obtained as non-negative linear combination of kernel matrices T , T and T .
123 123 1 2 3
∗∗ (J(1) , J(2) ) are obtained as non-negative linear combination of Gaussian with different hyper
2 2
parameters
Table 7.3 Percentage test set accuracy on the German credit dataset
Kernel (T1(1) , T1(2) ) (T2(1) , T2(2) ) (T3(1) , T3(2) ) (1)
(J123 (2) ∗
, J123 ) (J2(1) , J2(2) )∗∗
TSA 70.40 ± 3.01 70.90 ± 4.95 70.20 ± 3.43 70.90 ± 4.95 71.80 ± 2.93
μ1 – – – (0, 3, 0) (0, 3, 0)
μ2 – – – (0, 3, 0) (0, 3, 0)
Iterations – – – 1.9 ± 0.3 2.00 ± 0.0
Time (s) 9.85 14.04 12.88 51.63 51.20
∗ (J(1) , J(2) ) are obtained as non-negative linear combination of kernel matrices T , T and T .
123 123 1 2 3
∗∗ (J(1) , J(2) ) are obtained as non-negative linear combination of Gaussian with different hyper
2 2
parameters
the other two kernels. Similar behavior is also observed when Gaussian with different
hyper-parameters are used for training the Hepatitis and Sonar datasets.
7.3 Knowledge Based Twin Support Vector Machines and Variants 163
TSA 79.83 ± 8.27 79.17 ± 9.00 79.83 ± 7.31 79.83 ± 8.85 82.46 ± 8.89
μ1 – – – (0, 0.33, 2.67) (0.32, 0.17, 2.51)
μ2 – – – (0.67, 1.46, 0.87) (1, 1.01, 0.99)
Iterations – – – 2 ± 1.1 3.7 ± 0.9
Time (s) 0.30 0.28 0.35 0.95 1.89
∗ (J(1) , J(2) ) are obtained as non-negative linear combination of kernel matrices T , T and T .
123 123 1 2 3
∗∗ (J(1) , J(2) ) are obtained as non-negative linear combination of Gaussian with different hyper
2 2
parameters
TSA 65.49 ± 5.80 80.34 ± 7.31 72.23 ± 4.35 85.17 ± 6.61 85.88 ± 3.34
μ1 – – – (0, 2.67, 0.33) (0, 3, 0)
μ2 – – – (0.7, 2.89, 0.4) (0, 0, 3)
Iterations – – – 2±0 2 ± 0.5
Time (s) 1.17 2.47 2.32 5.60 6.24
∗ (J(1) , J(2) ) are obtained as non-negative linear combination of kernel matrices T , T and T .
123 123 1 2 3
∗∗ (J(1) , J(2) ) are obtained as non-negative linear combination of Gaussian with different hyper
2 2
parameters
Classification with prior knowledge has now become an important field of research,
e.g. classification of medical datasets where only expert’s knowledge is prescribed
and no training data is available. In an earlier work of Fung et al. [16], prior knowledge
in the form of multiple polyhedral sets, each belonging to one of the two classes has
been introduced into reformulation of linear support vector machine classifier. Fung
et al. [16] have discussed 1-norm formulation of knowledge based classifier which
they termed as Knowledge Based SVM (KBSVM) classifier. Here, with the use of
theorems of alternative (Mangasarian [17]), polyhedral knowledge set is reformulated
into a set of inequalities, with which prior data can be embedded into the linear
programming formulation. Working on the lines of Fung et al. [16], Khemchandani
et al. [18] have proposed an extremely simple and fast (KBPSVM) classifier in the
light of PSVM formulation. They have used 2-norm approximation of knowledge
sets and slack variables in their formulation instead of 1-norm used in KBSVM.
Experimental results presented in [18] show that KBPSVM performs better than
KBSVM in terms of both generalization and training speed.
164 7 Some Additional Topics
Khemchandani et al. [18] proposed KBPSVM, where they incorporated prior knowl-
edge, represented by polyhedral sets into linear PSVM. Consider the problem of
binary classification wherein a linearly inseparable data set of m points in real n-
dimensional space of features is represented by the matrix X ∈ Rm×n . The corre-
sponding target or class of each data point Xi ; i = 1, 2, . . . , m, is represented by a
diagonal matrix D ∈ Rm×n with entries Dii as +1 or −1. Let us assume that prior
information in the form of the following knowledge sets is given for the two classes.
where H i and Gi are (1 × n) row vectors; and hi and gi ∈ Rn . Given the above
problem, the aim of KBPSVM is to determine the hyperplane described by wT x +
b = 0, which lies midway between the two parallel proximal hyperplanes given by
w T x + b = +1 and w T x + b = −1. KPSVM obtains this hyperplane by solving the
following optimization problem.
1 T c1 c2 c3 c4
Min (w w + b2 ) + q 2 + ( r1 2 +r22 ) + ( s1 2 +s22 ) + ( u 2
(w, b)2 2 2 2 2
+ v 2 )
subject to
D(Xw + eb) + q = e,
H t u + w = r1 ,
ht u + b + 1 = r2 ,
7.3 Knowledge Based Twin Support Vector Machines and Variants 165
Gt v − w = s1 ,
gt v − b + 1 = s2 , (7.17)
where
⎡ ⎤
I
⎢DPP D + c1 −DP
T
DP ⎥
⎢ ⎥
=⎢ ⎢ I I ⎥
U T
P D MM + (1 + )I
T
−I ⎥,
⎢ c4 c2 ⎥
⎣ I I ⎦
−P D T
−I NN + (1 + )I
T
c4 c3
w T T
and z = = PT Dα + β − σ, P = X e , M = H h , N = G g ,
b
0e
e1 = , α ∈ Rm , β ∈ Rn+1 , σ ∈ Rn+1 .
1 (n+1)×1
Thus, the solution of KBPSVM can be obtained by solving single system of lin-
ear equations of order (m + 2(n + 1)). However using matrix partition method and
Sherman-Morrison-Woodbury (SMW) formula (Golub and Van Loan [19]) an effi-
cient way of computing the desired solution has also been proposed in Khemchandani
[18]. This solution requires three small inverses of order (n + 1) each, instead of a
single large (m + 2(n + 1)) matrix inverse. This solution is also advantageous in
the sense that the order of the matrix to be inverted is independent of the number
of data points m and depends only on the dimension n. This improves the training
speed of KBPSVM significantly as usually n << m. Khemchandani et al. [18] have
also empirically compared the proposed KBPSVM approach against KBSVM with
1-norm (KBSVM 1), KBSVM with 2-norm (KBSVM 2), conventional SVM and
PSVM on two selected data sets WPBC and Cleveland Heart respectively.
Before proceeding to KBTSVM/KBLSTSVM, we will show an alternate way of
solving QPP (7.17), which may turn-out to be advantageous in many cases. The
166 7 Some Additional Topics
solution of the system (7.18) was obtained in Khemchandani et al. [18] by construct-
ing the Lagrangian function and K.K.T conditions; here we will derive an alternate
solution by direct substitution. To obtain the alternate solution we will substitute all
the constraints of QPP (7.17) into the objective function. This will lead us to an
unconstrained QPP as shown below
1 T c1 c2
Min (w w + b2 ) + D(Xw + eb) − e 2 + ( H T u + w 2 + (hT u + b + 1)2 )
2 2 2
c3 c4
+ ( GT v − w 2 + (gT v − b + 1)2 ) + ( u 2 + v 2 ). (7.19)
2 2
Setting the gradient of (7.19) with respect to w, b, u and v to zero gives the
following equations
Let us rearrange them in matrix format to yield the alternate solution. For this let
columns Col1, Col2, Col3 and Col4 be denoted as follows
⎛ ⎞ ⎛ ⎞
(1 + c2 + c3 )I + c1 X T X −c1 X T e
⎜ −c1 eT X ⎟ ⎜ ⎟
⎟ , Col2 = ⎜ (1 + c2 + c3 + c1 e e) ⎟ ,
T
Col1 = ⎜ ⎝ ⎠ ⎝ ⎠
c2 H c2 h
−c3 G −c3 g
⎛ ⎞ ⎛ ⎞
c2 H T −c3 GT
⎜ c2 hT ⎟ ⎜ −c3 gT ⎟
Col3 = ⎜ ⎟
⎝ c4 + c2 hhT + c2 HH T ⎠ , Col4 = ⎜
⎝ T
⎟.
⎠
0ee
0eeT c4 + c3 gg + c3 GG
T T
Then
⎡ ⎤ ⎡ ⎤
w c1 X T De
⎢ b ⎥ ⎢−(c1 eT De + c2 − c3 )⎥
Col1, Col2, Col3, Col4 ⎢ ⎥ ⎢ ⎥.
⎣u⎦ = ⎣ −c2 h ⎦ (7.20)
v −c3 g
order (n + 1), the advantage in using (7.20) is that often in practical cases l, k << n.
The advantage is apparent particularly under the scenario of increasing number of
knowledge sets. For example, consider including a new class +1 knowledge set
{x| H1 x ≤ h1 }, H1 ∈ Rl1 ×n into KBPSVM formulation (7.17). Then, the new solution
of type (7.18) will require and additional inverse of order (n + 1) together with the
original three inverses of order (n + 1). Solution of type (7.20) will need an additional
inverse of order ł1 together with the original three inverses. In other words, for every
additional knowledge set included into the KBPSVM formulation, solution (7.18)
requires same number of additional inverses of order (n + 1), however solution (7.20)
will need additional inverses whose order will depend upon the number of constraints
in the additional knowledge set (which is usually lesser than n). We will describe our
KBLSTSVM algorithm in the light of this alternate solution.
Now, let us consider the problem of incorporating two knowledge sets: {x | Hx ≤
h} belonging to class +1 and {x | Gx ≤ g} belonging to class −1 (H ∈ Rl×n , h ∈
Rl , G ∈ Rk×n and g ∈ Rk ) into TWSVM formulation; generalization to multiple
knowledge sets is straight forward. First, we will discuss incorporation of these
knowledge sets into (T W SV M1); incorporation into (T W SV M2) will be on similar
lines. The idea here is to use the fact that in each QPP of TWSVM, the objective
function corresponds to a particular class and the constraints are determined by pat-
terns of the other class. Following Fung et al. [16] and Khemchandani et al. [18], the
knowledge set {x | Gx ≤ g} can be incorporated into QPP for (T W SV M1) as
1
Min Aw1 + eb1 2 + c1 eT q1 + d1 (eT r1 + p1 )
(w1 , b1 ) 2
subject to
− (Bw1 + eb1 ) + q1 ≥ e, q1 ≥ 0e,
−r1 ≤ GT u1 − w1 ≤ r1 , u1 ≥ 0e, r1 ≥ 0e,
gT u1 − b1 + 1 ≤ p1 , p1 ≥ 0. (7.21)
1
Min Aw1 + eb1 2 + c1 eT q1 + d1 (eT r1 + p1 ) + f1 (z1 + eT s1 + eT y1 )
(w1 ,b1 )
2
subject to
In an exactly similar way knowledge sets can be incorporated into QPP for (TWSVM2)
and the following modified QPP can be obtained
1
Min Bw2 + eb2 2 + c2 eT q2 + d2 (eT r2 + p2 ) + f2 (z2 + eT s2 + eT y2 )
(w2 , b2 )
2
subject to
hT u2 + b2 + 1 ≤ p2 , p2 ≥ 0,
−s2 ≤ GT v2 + w2 ≤ s2 , v2 ≥ 0e, s2 ≥ 0e,
gT v2 + b2 ≤ z2 ,
−y2 ≤ GT t2 − w2 ≤ y2 , t2 ≥ 0e, y2 ≥ 0e,
gT t2 − b2 ≤ z2 , z2 ≥ 0. (7.23)
Together, QPPs (7.22) and (7.23) define our final KBTSVM formulation, which
is capable of producing linear non-parallel hyperplanes w1T x + b1 = 0 and w2T x +
b2 = 0 from real data (A, B) and prior knowledge {x | Hx ≤ h}, {x | Gx ≤ g}.
Here c1 , c2 , d1 , d2 , f1 , f2 > 0 are trade-off parameters and q1 , q2 , r1 , r2 , p1 , p2 , z1 , z2 ,
s1 , s2 , y1 , y2 are slack variables and u1 , u2 , v1 , v2 , t1 , t2 are variables introduced in
the process of incorporating polyhedral knowledge sets using Farkas Theorem (Man-
gasarian [17]). It is easy to see that our KBTWSVM formulation simplifies to
TWSVM under the condition d1 , d2 , f1 , f2 = 0. Also, if we would like to give equal
significance to data and prior knowledge we can set d1 = c1 , d2 = c2 , and f1 , f2 = 1.
7.3 Knowledge Based Twin Support Vector Machines and Variants 169
Following Jayadeva et al. [20], dual QPPs of (7.22) and (7.23) can be derived to obtain
the solution of KBTWSVM. However, considering the fact that LSTWSVM has been
shown to be significantly faster in training than TWSVM without any compromise
in accuracy (Kumar et al. [3]), we will use the above KBTWSVM formulation to
obtain KBLSTSVM in the next section and proceed to computational experiments
with KBLSTSVM in subsequent sections.
In this section, following Fung et al. [16] and Kumar et al. [3] we will take equalities
in primal QPPs (7.22) and (7.23) instead of inequalities to get our KBLSTSVM
formulation. The new QPPs of KBLSTSVM formulation is shown in (7.24) and
(7.25).
1 c1 d1
Min Aw1 + eb1 2 + q1 2 + ( r1 2 + p21 + u1 2 )+
(w1 ,b1 ) 2 2 2
f1 2
(z + s1 + y1 + v1 2 + t1 2 )
2 2
2 1
subject to
− (Bw1 + eb1 ) + q1 = e,
GT u1 − w1 = r1 ,
gT u1 − b1 + 1 = p1 ,
H T v1 + w1 = s1 ,
hT v1 + b1 = z1 ,
H T t1 − w1 = y1 ,
hT t1 − b1 = z1 . (7.24)
and
1 c2 d2
Min Bw2 + eb2 2 + q2 2 + ( r2 2 +p22 + u2 2 )+
(w2 , b2 )
2 2 2
f2 2
(z + s2 + y2 2 + v2 2 + t2 2 )
2
2 2
subject to
170 7 Some Additional Topics
Aw2 + eb2 + q2 = e,
H T u2 + w2 = r2 ,
hT u2 + b2 + 1 = p2 ,
GT v2 + w2 = s2 ,
gT v2 + b2 = z2 ,
GT t2 − w2 = y2 ,
gT t2 − b2 = z2 . (7.25)
By substituting the constraints into the objective function of QPP (7.24) we get
f1 1 T T 2 T 2 T 2 2 2
Min (h t1 + h v1 ) + H v1 + w1 + H t1 − w1 + v1 + t1
(w1 , b1 ) 2 4
d1 1
+ ( GT u1 − w1 2 +(gT u1 − b1 + 1)2 + u1 2 ) + Aw1 + eb1 2
2 2
c1
+ −(Bw1 + eb1 ) + e 2 . (7.26)
2
where
⎡ ⎤ ⎡ ⎤
AT A + c1 BT B + I(d1 + 2f1 ) [AT e + c1 BT e]
⎢ −eT A − c1 eT B ⎥ ⎢m1 − c1 m2 + d1 ⎥
⎢ ⎥ ⎢ ⎥
⎢
E1 = ⎢ −d1 G ⎥ E2 = ⎢ −d1 g ⎥
⎥ ⎢ ⎥
⎣ f1 H ⎦ ⎣ 0e ⎦
−f1 H 0e
7.3 Knowledge Based Twin Support Vector Machines and Variants 171
⎡ ⎤
⎡ ⎤ f1 H T
−d1 G T
⎢ 0eT ⎥
⎢ ⎥ ⎢ ⎥
⎢ −d1 gT ⎥ ⎢ 0eeT ⎥
⎢ ⎥
E3 = ⎢d
⎢ 1 [gg T
+ GG T
+ I]⎥
⎥ E4 = ⎢ hhT ⎥,
⎣ ⎦ ⎢f1 [ + HH T
+ I]⎥
0eeT ⎢ 4 ⎥
0eeT ⎣ f1 hh T ⎦
4
⎡ ⎤
−f1 H T
⎢ 0eT ⎥
⎢ ⎥
⎢ 0eeT ⎥
⎢ ⎥
E5 = ⎢ f1 hhT ⎥
⎢ ⎥
⎢ 4 ⎥
⎣ hhT ⎦
f1 [ + HH T + I] .
4
Then the solution of QPP (7.24) is
⎡ ⎤ ⎡ ⎤
w1 −c1 BT e
⎢ b1 ⎥ ⎢d1 + c1 m2 ⎥
⎢ ⎥ ⎢ ⎥
E⎢ ⎥ ⎢ ⎥
⎢ u1 ⎥ = ⎢ −d1 g ⎥ . (7.27)
⎣ v1 ⎦ ⎣ 0e ⎦
t1 0e
where
⎡ ⎤ ⎡ ⎤
BT B + c2 AT A + I(d2 + 2f2 ) −[BT e + c2 AT e]
⎢ −[eT B + c2 eT A] ⎥ ⎢ m2 + c2 m1 + d2 ⎥
⎢ ⎥ ⎢ ⎥
⎢
F1 = ⎢ −d2 H ⎥, F2 = ⎢ −d2 h ⎥,
⎥ ⎢ ⎥
⎣ f1 G ⎦ ⎣ 0e ⎦
−f1 G 0e
⎡ ⎤
⎡ ⎤ f2 GT
d2 H T ⎢ oeT ⎥
⎢ ⎥ ⎢ ⎥
⎢ −d2 hT ⎥ ⎢ oeeT ⎥
⎢ ⎥
F3 = ⎢d2 [hh + HH + I]⎥
⎢ T T
⎥, F4 = ⎢ ggT ⎥,
⎣ ⎦ ⎢f2 [ + GG + I]⎥
T
0eeT ⎢ 4 ⎥
0eeT ⎣ f2 ggT ⎦
4
172 7 Some Additional Topics
⎡ ⎤
−f2 gT
⎢ oeT ⎥
⎢ ⎥
⎢ oeeT ⎥
⎢ ⎥
F5 = ⎢ f2 ggT⎥.
⎢ ⎥
⎢ 4 ⎥
⎣ gg T ⎦
f2 [ + GG + I]
T
4
In an exactly similar way the solution of QPP (7.25) can be derived to be
⎡ ⎤ ⎡ ⎤
w2 −c2 AT e
⎢ b2 ⎥ ⎢−d2 − c2 m1 ⎥
⎢ ⎥ ⎢ ⎥
F⎢ ⎥ ⎢ ⎥
⎢ u2 ⎥ = ⎢ −d2 h ⎥ . (7.28)
⎣ v2 ⎦ ⎣ 0e ⎦
t2 0e
namely tumor size (T) (feature 31) and lymph node status (L) (feature 32). Tumor
size is the diameter of the excised tumor in centimeters, and lymph node status refers
to the number of metastasized auxiliary lymph nodes. The rules are given by:
The second dataset used was Cleveland heart dataset (Blake and Merz [14]) using
a less than 50 % cutoff diameter narrowing for predicting presence or absence of heart
disease. The knowledge sets consists of, rules constructed (for the sake of illustration)
and used in Fung et al. [16] and Khemchandani et al. [18] with two features oldpeak
(feature 10) and thal (T) (feature 13). The rules are given by:
In the machine learning community, high dimensional image data with many
attributes are often encountered in the real applications. The representation and selec-
174 7 Some Additional Topics
tion of the features will have a strong effect on the classification performance. Thus,
how to efficiently represent the image data is one of the fundamental problems in
classifier model design. It is worth noting that most of existing classification algo-
rithms are oriented to vector space model (VSM), e.g. support vector machine (SVM)
(Cristianini and Shawe-Taylor [22]; Vapnik [23]), proximal support vector machine
classifier (PSVM) (Fung and Mangasarian [13]), twin support vector machine for
pattern classification (TWSVM) (Jayadeva et al. [20]).
SVM relies on a vector dataset, takes vector data x in space Rn as inputs, and
aims at finding a single linear (or nonlinear) function. Thus, if SVM are applied for
image classification, images are commonly represented as long vectors in the high-
dimensional vector space, in which each pixel of the images corresponds to a feature
or dimension. For example, when the VSM focused methods are applied and one is
often confronted with an image with 64 × 64 pixels, then image x will be represented
as a long vector x̂ with dimension n = 4,096. In such cases, learning such a linear
SVM function w T x̂ + b in vector space is time-consuming, where w ∈ Rn and b are
parameters to be estimated. And most importantly, such a vector representation fails
to take into account the spatial locality of pixels in the image (Wang et al. [24]; Zhang
and Ye [25]).
Images are intrinsically matrices. To represent the images appropriately, it is
important to consider transforming the vector patterns to the corresponding matrix
patterns or second order tensors before classification. In recent years, some interests
about tensor representation have been investigated. Specifically, some tensor repre-
sentation based approaches (Fu and Huang [26]; He et al. [27]; Yan et al. [28]) are
proposed for high-dimensional data analysis. A tensor-based learning framework for
SVMs, termed as support tensor machines (STMs), was recently proposed by Cai et
al. [4], and Tao et al. [29, 30], which directly accepts tensors as inputs in the learning
model, without vectorization. Tao et al. [29, 30] have discussed tensor version of
various classification algorithms available in the literature. The use of tensor repre-
sentation in this way helps overcome the overfitting problem encountered mostly in
vector-based learning.
The set of all k-tensors on {Rni ; i = 1, . . . , k}, denoted by T k , is a vector space under
the usual operations of pointwise addition and scalar multiplication:
by
S ⊗ T (a1 , . . . , ak+l ) = S(a1 , . . . , ak )T (ak+1 , . . . , ak+l ), (7.33)
This shows that every order-2 tensor in Rn1 ⊗ Rn2 uniquely corresponds to an n1 × n2
matrix. Given two vectors a = nk=1 1
ak ek ∈ Rn1 and b = nl=1
2
bl eˆl ∈ Rn2 , we have
n
1
n2
T (a, b) = Tij i ⊗ ˆj ak ek , bl eˆl
i, j k=1 l=1
n n
1 2
Different from Vector Space Model which considers input in the form of a vector
in Rn , Tensor Space Model considers the input variable as a second order tensor in
Rn1 ⊗ Rn2 , where n1 × n2 ≈ n. A linear classifier in Rn is represented as aT x + b,
in which there are n + 1 (≈ n1 × n2 + 1) parameters (b, ai , i = 1 . . . n). Similarly, a
linear classifier in the tensor space Rn1 ⊗ Rn2 can be represented as uT Xv + b where
u ∈ Rn1 and v ∈ Rn2 . Thus, there are only n1 + n2 + 1 parameters to be estimated.
176 7 Some Additional Topics
This property makes tensor-based learning algorithms especially suitable for small
sample-size (S3) cases.
Problem Statement
Given a set of training samples {(Xi , yi ); i = 1, . . . , m}, where Xi is the data point
in order-2 tensor space, Xi ∈ Rn1 ⊗ Rn2 and yi ∈ {−1, 1} is the class associated with
Xi , STM finds a tensor classifier f (X) = uT Xv + b such that the two classes can be
separated with maximum margin.
Algorithm
STM is a tensor generalization of SVM. The algorithmic procedure is formally stated
below
1
Min β1 v T v + CeT ξ
(v,b,ξ ) 2
subject to
yi (v T xi + b) ≥ 1 − ξi ,
ξi ≥ 0, (i = 1, . . . , m).
Min 1
β uT u
2 2
+ CeT ξ
(u,b,ξ )
subject to
yi (uT x̃i + b) ≥ 1 − ξi ,
ξi ≥ 0, (i = 1, . . . , m).
The optimization problem in the tensor space is then reduced to the following
1
(TSO) Min uv T 2 + CeT ξ
(u,v,b,ξ ) 2
subject to
yi (uT Xi v + b) ≥ 1 − ξi ,
ξi ≥ 0, (i = 1 . . . m). (7.39)
1
L= uv T 2 + CeT ξ − αi yi (uT Xi v + b) + eT α − α T ξ − μT ξ.
2 i
Since
1 1
uv T 2 = trace(uv T vuT )
2 2
1
= (v T u)trace(uuT )
2
1
= (v T v)(uT u),
2
we have
1 T
L= (v v)(uT u) + CeT ξ − αi yi (uT Xi v + b) + eT α − α T ξ − μT ξ.
2 i
Applying the K.K.T sufficient optimality conditions (Mangasarian [17] and Chandra
[31]), we obtain
αi yi Xi v
u= i
, (7.40)
vT v
i αi yi u Xi
T
v= , (7.41)
uT u
α T y = 0, (7.42)
C − αi − μi = 0, (i = 1, . . . , m). (7.43)
From Equations (7.40) and (7.41), we see that u and v are dependent on each other,
and cannot be solved independently. In the sequel, we describe a simple yet effective
computational method to solve this optimization problem.
178 7 Some Additional Topics
We first fix u. Let β1 = u2 and xi = XiT u. The optimization problem (TSO) can
be rewritten as follows
1
Min β1 v2 + CeT ξ
(v, b, ξ ) 2
subject to
yi (v T xi + b) ≥ 1 − ξi ,
ξi ≥ 0, (i = 1 . . . m). (7.44)
1
Min β2 u2 + CeT ξ
(u, b, ξ ) 2
subject to
yi (uT x̃i + b) ≥ 1 − ξi ,
ξi ≥ 0, (i = 1 . . . m). (7.45)
Proof of Convergence
Define
1
f (u, v) = uv T 2 + CeT ξ, (7.46)
2
where ξ is the slack vector and e is a vector of ones of appropriate dimension. Let u0
be the initial value. Fixing u0 , we get v0 by solving the optimization problem (7.44).
Likewise, fixing v0 , we get u1 by solving the optimization problem (7.45).
Since the optimization problem of SVM is convex, so the solution of SVM is glob-
ally optimum. Specifically, the solutions of problems (7.44) and (7.45) are globally
optimum. Thus, we have
f (u0 , v0 ) ≥ f (u1 , v0 ).
1 n
Min W 2F +C ξk
(W,b,ξ ) 2
k=1
subject to
where U ∈ Rn1 ×n1 and V ∈ Rn2 ×n2 are orthogonal matrices, = diag(σ12 , σ22 , . . . ,
σr ), σ1 ≥ σ2 , ≥, . . . , σr > 0 and r = Rank(W ). Denoting the matrices U and V by
2
r
W = σi2 ui v Tj .
i=1
r
W = ui viT ,
i=1
1 T
r n
Min (ui ui )(viT vi ) + C ξK
(ui , vi , b, ξ ) 2 i=1
k=1
subject to
180 7 Some Additional Topics
r
yk (uiT Xk vi + b) ≥ 1 − ξk , (k = 1, . . . , n),
i=1
ξk ≥ 0, (k = 1, . . . , n). (7.48)
yi (uT Xi v + b) ≥ 1 − ξi , (i = 1, . . . , n),
ξi ≥ 0, (i = 1, . . . , n). (7.49)
Most of the development in this area have concentrated only on Rank one SVM’s.
But recently some researchers have initiated work on Multiple Rank Support Tensor
Machines, which have also been termed as Multiple Rank Support Matrix Machines
for Matrix Data Classification. In this context the work of Gao et al. [32] looks
promising.
The above discussion suggests that the decision function of STM is f (x) =
uT Xv + b, where X ∈ Rn1 × Rn2 , b ∈ R. If we now denote the Kronecker product
operation (denoted by ⊗), then we can verify that
Here V ec(X) denotes an operator that vectorizes a matrix X ∈ Rn1 × Rn2 into a vector
having (n1 × n2 ) components (x11 , . . . , x1n2 ; . . . , ; xn1 1 , . . . , xn1 n2 ). Now comparing
with SVM classifier f (x) = w T x + b, x ∈ Rn and STM classifier f (x) = uT xv + b
as obtained at (7.48) it follows that the decision functions in both SVM and STM
have the same form and v ⊗ u in STM plays the same role as the vector w in SVM.
In a spirit similar to PSVM (Fung and Mangasarian [13]), Khemchandani et al. [2]
developed a least squares variant of STM in the tensor space model, which classifies
a new point to either of the two classes depending on its closeness to one of the
two parallel hyperplanes in the tensor space. By using a least squares error term
in the objective function, we reduce the inequality constraints of STM to equality
constraints, and proceeding along the lines of STM, we develop an iterative algorithm
which requires the solution of a system of linear equations at every iteration, rather
than solving a dual QPP at every step, as in the case of STM. This leads to the
formation of a fast tensor-based iterative algorithm, which requires the determination
of a significantly smaller number of decision variables as compared to vector-based
7.4 Support Tensor Machines: A Brief Review 181
approaches. The above properties make (PSTMs) specially suited for overcoming
overfitting in Small Sample Size (S3) problems.
Given a set of training samples {(Xi , yi ); i = 1, . . . , m}, where Xi is the data point
in order-2 tensor space, Xi ∈ Rn1 ⊗ Rn2 and yi ∈ {−1, 1} is the class associated with
Xi , find a tensor classifier f (X) = uT Xv + b in the linear tensor space, such that the
two classes are proximal to either of the two parallel hyperplanes in the tensor space,
separated maximally using margin maximization. Here, u ∈ Rn1 , v ∈ Rn2 , and b ∈ R
are the decision variables to be determined by PSTM.
A linear classifier in the tensor space can be naturally represented as follows
which can be rewritten through matrix inner product as (Golub and Loan [19])
subject to
where C is the trade-off parameter between margin maximization and empirical error
minimization. Applying the K.K.T necessary and sufficient optimality conditions
(Mangasarian [17] and Chandra [31]) to the problem (7.50), we obtain
m
αi yi Xi v
u= i=1
, (7.51)
vT v
m
i=1 αi yi u Xi
T
v= , (7.52)
uT u
m
b= αi yi , (7.53)
i=1
qi = αi /C, (i = 1, . . . , m). (7.54)
We observe from (7.51) and (7.52) that u and v are dependent on each other, and
cannot be solved independently. Hence, we resort to the alternating projection method
for solving this optimization problem, which has earlier been used by Cai et al. [4]
and Tao et al. [29] in developing STM. The method can be described as follows.
We first fix u. Let β1 = ||u||2 and xi = Xi T u. Let D be a diagonal matrix where
Dii = yi . The optimization problem (7.50) can then be reduced to the following QPP
182 7 Some Additional Topics
1 1 1
(v − PSTM) Min β1 v T v + b2 + CqT q
(v,b,q) 2 2 2
subject to
D(Xv + eb) + q = e,
β1 v = X T Dα, (7.55)
b = e Dα, T
(7.56)
Cq = α. (7.57)
Substituting the values of v, b and q from (7.55), (7.56) and (7.57) into the equality
constraints of (v − PSV TM) yields the following system of linear equations for
obtaining α
DXX T D I
+ Dee D +
T
α = e,
β1 C
√
where I isan identity matrix
of appropriate dimensions. Let Hv = D[X/ β1 e],
and Gv = Hv Hv T + I/C . Then,
α = Gv −1 e.
Once α is obtained, v and b can be computed using (7.55) and (7.56) respectively.
We observe that solving (v − PSTM) requires the inversion of an m × m matrix
Gv , which is non-singular due to the diagonal perturbation introduced by the I/C
term. Optionally, when m n2 , we can instead use the (Golub and Van Loan [19])
to invert a smaller (n2 + 1) × (n2 + 1) matrix for obtaining the value of α, saving
some computational time. The following reduction is obtained on applying the SMW
identity (Golub and Loan [19]).
Once v is obtained, let x̃i = Xi v and β2 = ||v||2 . u can then be computed by solving
the following optimization problem:
1 1 1
(u − PSTM) Min β2 uT u + b2 + CpT p
(u,b,p) 2 2 2
subject to
D(X̃u + eb) + p = e,
7.4 Support Tensor Machines: A Brief Review 183
β = Gu −1 e,
√
where, Gu = Hu Hu T + I/C and Hu = D[X̃/ β2 e]. Once β is obtained, u and b
can be computed using the following equations:
β2 u = X̃ T Dβ,
b = eT Dβ.
f (u0 , v0 ) ≥ f (u1 , v0 ).
1 1 1
(v-PSTM) Min β1 v T v + b2 + CqT q
(v,b,q) 2 2 2
subject to
D(Xv + eb) + q = e.
Step 3 (Computing u)
Once v is obtained, let x̃i = Xi v, X̃ = (x˜1 T , x˜2 T , . . . , x˜m T )T and β2 = ||v||2 .
u can then be computed by solving the following optimization problem:
1 1 1
(u-PSTM) Min β2 uT u + b2 + CpT p
(u,b,p) 2 2 2
subject to
D(X̃u + eb) + p = e.
we stop our iterative algorithm, and ui , vi and bi are the required optimal
decision variables of PSTM. Iterate over steps 2 and 3 otherwise.
Khemchandani et al. [2] compared the results of PSTM with another tensor-based
classification method - STM on face detection and handwriting recognition datasets.
In our presentation here, we restrict ourselves to hand writing data set only. For the
face detection dataset and other details we refer to Khemchandani et al. [2]. Since
this dataset is mostly balanced in both the cases, we consider percentage accuracy
at varying sizes of training set as a performance metric for comparison (Duda et al.
[15]; Vapnik [23]). For each size of the training set, we randomly select our training
images from the entire set and repeat the experiment 10 times, in a spirit similar to
(Cai et al. [4]; Hu et al. [33]). The mean and the standard deviation of the evaluation
metrics are then used for comparison. In all our simulations, the initial parameters
chosen for both PSTM and STM are: C = 0.5 and tolerance for convergence = 10−3 .
All the algorithms have been implemented in MATLAB7 (R2008a) on Ubuntu9.04
running on a PC with system configuration Intel Core2 Duo (1.8 GHz) with 1 GB of
RAM.
7.4 Support Tensor Machines: A Brief Review 185
To provide a uniform testing environment to both PSTM and STM, the number
of iterations performed by both the methods are taken to be the same for every
simulation. Hence, for each train and test run, we first train the STM algorithm and
obtain the number of iterations needed by STM to converge. This value is then used
as the maximum number of iterations allowed for PSTM to converge on the same
train and test run. After constraining the number of training iterations of PSTM in
this fashion, computational time comparisons between PSTM and STM have also
been reported.
There are two distinct handwriting recognition domains- online and offline, which
are differentiated by the nature of their input signals (Plamondon and Srihari [34]).
Online handwriting recognition systems depend on the information acquired during
the production of the handwriting using specific equipments that capture the trajec-
tory of the writing tool, such as in electronic pads and smart-phones. In this case,
the input signal can be realized as a single-dimensional vector of trajectory points of
the writing tool arranged linearly in time. On the other hand, in offline handwriting
recognition systems, static representation of a digitized document such as cheque,
form, mail or document processing is used. In this case, any information in the tem-
poral domain regarding the production of the handwriting is absent, constraining the
input signal to be handled as a binary matrix representation of images, where the
ones correspond to the region in the image over which the character has been drawn
by hand. Since, we are interested in working with tensor data, where the size of the
input feature is large, we consider the offline recognition scenario, often called as
Optical Character Recognition (OCR).
The Optdigits Dataset (Alpaydin and Kaynak [35]), obtained from the UCI repos-
itory, contains 32 × 32 normalized binary images of handwritten numerals 0–9,
extracted from a preprinted form by using preprocessing programs made available
by NIST. From a total of 43 people, 30 contributed to the training set and a different
13 to the test set, offering a wide variety of handwriting styles among the writers.
The Optdigits dataset consists of 3 subsets of binary images - training data obtained
from the first 30 subjects, writer-dependent testing data obtained from the same
30 subjects, and writer-independent testing data, which has been generated by the
remaining 13 subjects and which is vital for comparing testing performances. The
number of training and testing examples for each numeral is provided in Table 7.8.
Similar to the approach used in comparing performance over binary classifier, we
consider a multi-class classification routine with varying ratio of the training set to be
used. Also, the performance comparisons of PSTM and STM are evaluated on both
the writer-dependent and writer-independent testing datasets, to study the effects of
changing the writer on testing accuracy.
A Micro-Averaged Percentage Accuracy
measure for different ratios of the train-
ing data is generated by solving 10 2
binary classification problems. Result compar-
isons between PSTM and STM of this evaluation metric is provided in Table 7.9.
We further employ a one-against-one approach for extending the binary classifiers
for multi-class classification of Opt Digits dataset. The class label for a new input
test image is decided by a voting system across the 10 2
binary classifiers. In this
fashion, we obtain a macro-level comparison of percentage accuracy for multi-class
186 7 Some Additional Topics
classification, the results of which are summarized in Table 7.10. The average training
time comparisons between PSTM and STM, for learning a single binary classifier,
are provided in Table 7.11.
7.5 Twin and Least Squares Twin Tensor Machines for Classification 187
Table 7.11 Training time comparisons and number of iterations on optdigits dataset
Training ratio
0.01 0.3 0.7
PSTM 0.005 ± 0.007 0.01 ± 0.060 0.73 ± 0.485
STM 0.02 ± 0.009 0.19 ± 0.067 1.38 ± 0.852
Number of iterations 10 ± 7 14 ± 10 26 ± 24
From the results we can conclude that the efficacy of proximal STM is in line
with STM and in terms of time it is much faster. Further, as we increase the size of
training dataset we get significant improvement in the testing accuracy.
Remark 7.4.1 Some authors, e.g. Meng et al. [36] have also studied Least Squares
Support Tensor Machines (LSSTM). Here it may be be noted that the formulation
of LSSTM is very similar to PSTM presented here. The only difference being that
in LSSTM, the extra term of 21 b2 present in the objective function of PSTM, is
not included. Since we have a detailed discussion of PSTM we omit the details of
LSSTM.
where Xi ∈ Rn1 ×n2 represents the second order tensor and yi ∈ {−1, 1} are class
labels. Thus TC is essentially a matrix pattern dataset and yi is the associated class
label for the matrix input Xi for i = 1, 2, . . . , p. Let p1 and p2 be the number of
samples of positive and negative classes respectively, and p = p1 + p2 . Let I1 and I2
respectively be the index sets of positive and negative classes, and |I1 | = p1 , |I2 | = p2 .
Linear TWSTM finds a pair of nonparallel hyperplanes f1 (x) = u1T Xv1 + b1 = 0
and f2 (x) = u2T Xv2 + b2 = 0 by constructing two appropriate quadratic program-
ming problems (QPP’s). These two QPP’s are motivated by (TWSVM1) and
(TWSVM2) introduced in Chap. 3 for determining the nonparallel hyperplanes
x T w1 + b1 = 0 and x T w2 + b2 = 0. Thus the two QPP’s are
188 7 Some Additional Topics
1
(T W STM1) Min (u1 Xi v1 + b1 )2 + C1 ξ2j
(u1 ,v1 ,b1 ,ξ2 ) 2 i∈I
1 j∈I 2
subject to
−(u1T Xj v1 + b1 ) + ξ2j ≥ 1, j ∈ I2 ,
ξ2j ≥ 0, j ∈ I2 ,
and
1
(T W STM2) Min (u2 Xj v2 + b2 )2 + C2 ξ1i
(u2 ,v2 ,b2 ,ξ1 ) 2 j∈I
2 i∈I 1
subject to
(u2T Xi v2 + b2 ) + ξ1i ≥ 1, i ∈ I1 ,
ξ1i ≥ 0, i ∈ I1 .
where k = 1 denotes the positive class and k = 2 denotes the negative class.
To solve the two quadratic programming problems (TWSTM1) and (TWSTM2)
we proceed in the same manner as in the alternating algorithm for (STM1) and
(STM2) discussed in Sect. 7.2. We present here the details for (TWSTM1) and the
details for (TWSTM2) are analogous.
Step 1 (Initialization)
Choose > 0 and u1(0) = (1, 1, . . . , 1)T ∈ Rn1 . Set s = 0.
Step 2 (Determination of v1 and b1 )
Given u1(s) and hence XiT u1(s) , i ∈ I1 , solve (TWSTM1) and obtain its optimal
solution (v1(s) , b1(s) ).
Step 3 (Updation of u)
Given v1(s) and hence Xi v1(s) , i ∈ I1 , solve (TWSTM1) and obtain its optimal
solution (u1(s+1) , b1(s+1) ).
7.5 Twin and Least Squares Twin Tensor Machines for Classification 189
Step 4 (Updation of v)
Given u1(s+1) and hence XiT u1(s+1) , i ∈ I1 , solve (TWSTM1) and obtain its
optimal solution (v1(s+1) , b̂1(s+1) ).
Step 5 (Stopping rule)
If u1(s+1) − u1(s) and v1(s+1) − v1(s) are all less than or the maximum
number of iterations is achieved, stop and take u1 = u1(s+1) , v1 = v1(s+1) , b1 =
b1(s+1) + b̂1(s+1)
, f1 (x) = u1 T xv1 + b1 . Otherwise put s ← s + 1 and return to
2
Step 2.
The solution of (TWSTM2) gives the second plane f2 (x) = u2 T xv2 + b2 , and then
for a new input Xnew , its class label is assigned as per rule (7.59).
Taking motivation from LS-TWSVM, we can formally write the formulation of
LS-TWSTM as follows
1
(LS − T W STM1) Min (u1 Xi v1 + b1 )2 + C1 ξ2j2
(u1 ,v1 ,b1 ,ξ2 ) 2 i∈I
1 j∈I2
subject to
−(u1T Xj v1 + b1 ) + ξ2j = 1, j ∈ I2 ,
and
1
(LS − T W STM2) Min (u2 Xj v2 + b2 )2 + C2 ξ1i2
(u2 ,v2 ,b2 ,ξ1 ) 2 j∈I
2 i∈I1
subject to
(u2T Xi v2 + b2 ) + ξ1i ≥ 1, i ∈ I1 .
Zhao et al. [5] and Gao et al. [6] derived alternating algorithm for (LS-TWSTM)
which is similar to (TWSTM) except that QPP (LS-TWSTM1) and (LS-TWSTM2)
are solved by solving a system of linear equations. This is the common approach
for solving all least squares versions, be it LS-SVM, LS-TWSVM, LS-STM or LS-
TWSTM. In fact, one can also consider proximal version of TSTM on the lines of
PSTM and can achieve all associated simplifications in the computations.
Zhao et al. [5] and Gao et al. [6] did extensive implementations on Eyes and ORL
database and showed that LS-TWSTM performs uniformly better than linear STM
and linear TWSTM. Gao et al. [6] also proposed kernel version of LS-TWSTM and
termed it NLS-TWSTM. On ORL database, NLS-TWSTM is reported to have better
performance than LS-TWSTM. Similar experience has also been reported on Yale
database.
Remark 7.5.1 Taking motivation from (GEPSVM), Zhang and Chow [39] devel-
oped Maximum Margin Multisurface Proximal Support Tensor Machines (M3 PSTM)
190 7 Some Additional Topics
model and discussed its application to image classification and segmentation. Fur-
ther, Shi et al. [40] proposed a Tensor Distance Based Least Square Twin Support
Tensor Machine (TDLS-TSTM) model which can take full advantage of the struc-
tural information of data. Various other extensions of (TWSTM), which are similar
to the extensions of (TWSVM), have been reported in the literature.
Remark 7.5.2 In support tensor and twin support tensor type machines, mostly sec-
ond order tensors are used as the input. Therefore, some authors have also called
such learning machines as Support Matrix Machines, e.g. model of Xu et al. [41] for
2-dimensional image data classification. Some other references with this terminol-
ogy are Matrix Pattern Based Projection Twin Support Vector Machin es (Hua and
Ding [42]), New Least Squares Support Vector Machines Based on Matrix Patterns
(Wang and Chen [43]) and Multiple Rank kernel Support Matrix Machine for Matrix
Data Classification (Gao et al. [32]).
7.6 Conclusions
This chapter studies certain additional topics related with TWSVM. Just as in SVM,
the problem of optimal kernel selection in TWSVM is also very important. Here tak-
ing motivation from Fung et al. [13] and utilizing the iterative alternating optimization
algorithm of Bezdek and Hathaway [12], a methodology is developed for selecting
optimal kernel in TWSVM. Classification with prior knowledge in the context of
TWSVM is also been discussed in this chapter. We have presented its least squares
version, i.e. KBTWSVM and KBLTWSVM, in detail and shown its superiority with
other knowledge based formulations.
The last topic discussed in this chapter is Twin Support Tensor Machines. Support
tensor machines are conceptually different from SVMs as these are tensor space
model and not vector space models. They have found much favour in the areas of
text characterization and image processing. This topic seems to be very promising
for future studies.
References
1. Khemchandani, R., & Jayadeva Chandra, S. (2009). Optimal kernel selection in twin support
vector machines. Optimization Letters, 3, 77–88.
2. Khemchandani, R., Karpatne, A., & Chandra, S. (2013). Proximal support tensor machines.
International Journal of Machine Learning and Cybernetics, 4, 703–712.
3. Kumar, M. A., Khemchandani, R., Gopal, M., & Chandra, S. (2010). Knowledge based least
squares twin support vector machines. Information Sciences, 180, 4606–4618.
4. Cai, D., He, X., Wen, J.-R., Han, J. & Ma, W.-Y. (2006). Support tensor machines for text
categorization. Technical Report, Department of Computer Science, UIUC. UIUCDCS-R-
2006-2714.
References 191
5. Zhao, X., Shi, H., Meng, L., & Jing, L. (2014). Least squares twin support tensor machines
for classification. Journal of Information and Computational Science, 11–12, 4175–4189.
6. Gao, X., Fan, L., & Xu, H. (2014). NLS-TWSTM: A novel and fast nonlinear image classifi-
cation method. WSEAS Transactions on Mathematics, 13, 626–635.
7. Lanckriet, G. R. G., Cristianini, N., Bartlett, P., Ghaoui, L. El, & Jordan, M. I. (2004). Learning
the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5,
27–72.
8. Xiong, H., Swamy, M. N. S., & Ahmad, M. O. (2005). Optimizing the kernel in the empirical
feature space. IEEE Transactions on Neural Networks, 16(2), 460–474.
9. Jayadeva Shah, S., & Chandra, S. (2009). Kernel optimization using a generalized eigenvalue
approach. PReMI, 32–37.
10. Fung, G., Dundar, M., Bi, J., & Rao, B. (2004). A fast iterative algorithm for Fisher discrim-
inant using heterogeneous kernels. In: Proceedings of the 21st International conference on
Machine Learning (ICML) (pp. 264–272). Canada: ACM.
11. Mika, S., Rätsch, G., & Müller, K. (2001). A mathematical programming approach to the
kernel fisher algorithm. In. Advances in Neural Information Processing Systems, 13, 591–
597.
12. Bezdek, J. C., & Hathaway, R. J. (2003). Convergence of alternating optimization. Neural,
Parallel and Scientific Computations, 11(4), 351–368.
13. Fung, G., & Mangasarian, O. L. (2001). Proximal support vector machine classifiers. In: F.
Provost, & R. Srikant (Eds.), Proceedings of Seventh International Conference on Knowledge
Discovery and Data Mining (pp. 77–86).
14. Blake, C.L., & Merz, C.J., UCI Repository for Machine Learning Databases, Irvine, CA:
University of California, Department of Information and Computer Sciences. http://www.ics.
uci.edu/~mlearn/MLRepository.html.
15. Duda, R., Hart, P., & Stork, D. (2001). Pattern classification. New York: Wiley.
16. Fung, G., Mangasarian, O. L., & Shavlik, J. (2002). Knowledge- based support vector machine
classifiers. In: Advances in Neural Information Processing Systems (vol. 14, pp. 01–09).
Massachusetts: MIT Press.
17. Mangasarian, O. L. (1994). Nonlinear programming. SIAM.
18. Khemchandani, R., & Jayadeva, Chandra, S. (2009). Knowledge based proximal support
vector machines. European Journal of Operational Research, 195(3), 914–923.
19. Golub, G. H., & Van Loan, C. F. (1996). Matrix computations (3rd ed.). Baltimore, Maryland:
The John Hopkins University Press.
20. Jayadeva, Khemchandani., R., & Chandra, S. (2007). Twin support vector machines for pattern
classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(5), 905–
910.
21. Lee, Y. j., Mangasarian, O. L., & Wolberg, W. H. (2003). Survival-time classification of breast
cancer patients. Computational optimization and Applications, 25, 151–166.
22. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines and
other Kernel-based learning methods. New York: Cambridge University Press.
23. Vapnik, V. (1998). Statistical learning theory. New York: Wiley.
24. Wang, Z., Chen, S. C., liu, J., Zhang, D. Q. (2008). Pattern representation in feature extraction
and classifier design: Matrix versus vector. IEEE Transactions on Neural Networks, 19, 758–
769.
25. Zhang, Z., & Ye, N. (2011). Learning a tensor subspace for semi-supervised dimensionality
reduction. Soft Computing, 15(2), 383–395.
26. Fu, Y., & Huang, T. S. (2008). Image classification using correlation tensor analysis. IEEE
Transactions on Image Processing, 17(2), 226–234.
27. He, X., Cai, D., & Niyogi, P. (2005). Tensor subspace analysis, advances in neural information
processing systems. Canada: Vancouver.
28. Yan, S., Xu, D., Yang, Q., Zhang, L., Tang, X., & Zhang, H. (2007). Multilinear discriminant
analysis for face recognition. IEEE Transactions on Image Processing, 16(1), 212–220.
192 7 Some Additional Topics
29. Tao, D., Li, X., Hu, W., Maybank, S. J., & Wu, X. (2007). General tensor discriminant analysis
and Gabor features for gait recognition. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 29(10), 1700–1715.
30. Tao, D., Li, X., Hu, W., Maybank, S. J., & Wu, X. (2008). Tensor rank one discriminant analy-
sis: a convergent method for discriminative multilinear subspace selection. Neurocomputing,
71(10–12), 1866–1882.
31. Chandra, S., Jayadeva, Mehra, A. (2009). Numerical optimization with applications. New
Delhi: Narosa Publishing House.
32. Gao, X., Fan, L., & Xu, H. (2013). Multiple rank kernel support matrix machine for matrix
data classification. International Journal of Machine Learning and Cybernetics, 4, 703–712.
33. Hu, R. X., Jia, W., Huang, D. S., & Lei, Y. K. (2010). Maximum margin criterion with tensor
representation. Neurocomputing, 73(10–12), 1541–1549.
34. Plamondon, R., & Srihari, S. N. (2000). On-line and off-line handwriting recognition: A
comprehensive survey. IEEE Transactions on Pattern Analysis and Machine Intelligence,
22(1), 63–84.
35. Alpaydin, E., & Kaynak, C. (1998). UCI Machine Learning Repository, Irvine, CA: University
of California, Department of Information and Computer Sciences. http://archive.ics.uci.edu/
ml.
36. Meng, L.V., Zhao, X., Song, L., Shi, H., & Jing, L. (2013). Least squares support tensor
machine. ISORA 978-1-84919-713-7, IET.
37. Zhang, X., Gao, X., & Wang, Y. (2009). Twin support tensor machines for MCS detection.
Journal of Electronics, 26, 318–324.
38. Zhang, X., Gao, X., & Wang, Y. (2014). Least squares twin support tensor machine for
classification. Journal of Information and Computational Science, 11–12, 4175–4189.
39. Zhang, Z., & Chow, T. W. S. (2012). Maximum margin multisurface support tensor machines
with application to image classification and segmentation. Expert Systems with Applications,
39, 849–860.
40. Shi, H., Zhao, X., & Jing, L. (2014). Tensor distance based least square twin support tensor
machine. Applied Mechanics and Materials, 668–669, 1170–1173.
41. Xu, H., Fan, L., & Gao, X. (2015). Projection twin SMM’s for 2d image data classification.
Neural Computing and Applications, 26, 91–100.
42. Hua, X., & Ding, S. (2012). Matrix pattern based projection twin support vector machines.
International Journal of Digital Content Technology and its Applications, 6(20), 172–181.
43. Wang, Z., & Chen, S. (2007). New least squares support vector machines based on matrix
patterns. Neural Processing Letters, 26, 41–56.
Chapter 8
Applications Based on TWSVM
8.1 Introduction
The Twin Support Vector Machine (TWSVM) has been widely used in a variety
of applications based on classification (binary, multi-class, multi-label), regression
(function estimation) and clustering tasks. The key benefit of using the TWSVM
as an alternative to the conventional SVM arises from the fact that it is inherently
designed to handle the class imbalance problem and is four times faster than SVM,
which makes it usable for real-time applications. Such a situation is common in vari-
ous practical applications for classification and regression, particularly in the case of
multi-class classification. This makes the TWSVM a natural choice for such appli-
cations, resulting in better generalization (higher accuracy) as well as in obtaining
a faster solution (lower training time as a result of solving smaller sized Quadratic
Programming Problems - QPPs).
In order to deal with unbalanced dataset some of the recent techniques includes
randomly oversampling the smaller class, or undersampling the larger class; one sided
selection - which involves removing examples of the larger class that are “noisy”
[1], and cluster-based oversampling - which eliminates small subsets of isolated
examples [2]. Wilson’s editing [3] employs a 3-nearest neighbour classifier to remove
majority class examples SS that are misclassified. SMOTE [4] adds minority class
exemplars by extrapolating between existing ones. A summary of many techniques
and comparisons may be found in [5, 6]. All these techniques modify the training
dataset in some manner. In contrast, the TWSVM does not change the dataset at all.
This chapter aims to review a few significant applications which have benefited
from using the TWSVM. The objective is to illustrate the widespread applicability
of the TWSVM across a multitude of application domains, while enabling the reader
to gain insight to identify potential application domains. Research papers cited in
this chapter are indexed in online search forums like Google Scholar and Scopus.
The TWSVM has been successfully used in gesture classification using Surface Elec-
tromyogram (sEMG) signals. The problem at hand in this context is to train classifiers
using sEMG data corresponding to limb movements such as wrist and finger flexions,
such as those shown in Fig. 8.1. The data is collected by attaching electrodes to the
limb that pick up nerve impulses when the subject performs limb movements. These
electrodes, usually non-invasive, are called channels in biomedical signal parlance.
sEMG datasets are high-dimensional and unbalanced. The high dimensionality
comes from the fact that these electrodes typically acquire data at a sampling rate, of
say S samples per second, and when this data is collected from k electrodes (channels)
for a time period of t seconds, the size of the data matrix is of the order of S × k × t.
8.2 Applications to Bio-Signal Processing 195
Fig. 8.1 Various wrist and finger flexions that are classified from sEMG data in [7, 8]
For example, a recording of a finger flexion for 5 s from 4 channels sampled at 250
samples per second would have 5000 samples.
The second challenge in building efficient classifiers is the class imbalance prob-
lem inherent to such datasets. This is essentially due to two restrictions, first that
collecting a large number of samples is tedious for the user, and secondly, being a
multi-class problem, the use of one-v/s-rest classification scheme results in a large
number of samples for the “other” class samples. To give an example, suppose we
are trying to classify I1 , I2 , . . . , I p intents (wrist and finger flexions), and we have k
samples of each of the p intents, then each classifier in the one-v/s-rest scheme would
have k samples of an intent Ii , i ∈ {1, p} and (k − 1) × p samples corresponding to
“other” intents I j , j ∈ {1, p}, j = i. This creates class imbalance, which amplifies
as the number of intents to be detected increases. Thus, a classifier that is insensitive
to class imbalance finds suitable applications in this domain.
Naik et al. [7] show that the use of the TWSVM results in a sensitivity and speci-
ficity of 84.8 and 88.1 %, respectively, and opposed to neural network sensitivity and
specificity being 58.3 and 42.5 %, respectively for the same set of gestures using Root
Mean Square (RMS) values as features. Datasets corresponding to seven gestures
(various wrist and finger flexions) were investigated for seven subjects in this study.
Similarly, Arjunan et al. [9] used fractal features on similar data and obtained recog-
nition accuracies ranging from 82 to 95 % for the Radial Basis Function (RBF) ker-
nel, and 78 to 91 % for linear kernel TWSVM. Further, promising results on amputee
subjects have been reported by Kumar et al. [8], where they obtain accuracy and sensi-
tivity of identification of finger actions from single channel sEMG signal to be 93 and
94 % for able-bodied, and 81 and 84 % for trans-radial amputated respectively. More
recently, Arjunan et al. [10] have shown significant improvement in grip recognition
accuracy, sensitivity, and specificity obtaining kappa coefficient κ = 0.91, which is
indicative of the TWSVM’s robustness. These results are significant in the develop-
196 8 Applications Based on TWSVM
The TWSVM has also been shown to be beneficial for classification of Electroen-
cephalogram (EEG) data by Soman and Jayadeva [11] Their work investigates the use
of EEG data corresponding to imagined motor movements that are useful for devel-
oping Brain Computer Interface (BCI) systems. These use voluntary variations in
brain activity to control external devices, which are useful for patients suffering from
locomotor disabilities. In practice, these systems (the specific class of motor-imagery
BCI systems) allow disabled people to control external devices by imagining move-
ments of their limbs, which are detected from the EEG data. Applications include
movement of a motorized wheelchair, movement of the mouse cursor on a computer
screen, synthesized speech generation etc.
This problem faces challenges similar to the challenges with sEMG data, and this
has been addressed in [11] by developing a processing pipeline using the TWSVM
and a feature selection scheme that reduces computation time and leads to higher
accuracies on publicly available benchmark datasets. Conventionally, the EEG sig-
nals are high dimensional, as is the case with sEMG. In order to work in a space
of reasonable dimensionality, the EEG signals are projected to a lower dimensional
space [12], and classification is done in this feature space. This projection presents an
inherent challenge in terms of loss of information, specifically as a trade-off between
being able to retain either time or frequency information. Later methods improved
upon this by considering the dimensionality reduction after filtering the EEG data in
multiple frequency bands [13, 14]. However, this led to an increase in the number
of features and an efficient selection scheme was needed. In [11], a computationally
Fig. 8.2 The EEG processing pipeline using TWSVM for classification
8.2 Applications to Bio-Signal Processing 197
simple measure called the classifiability [15] has been used to accomplish this. The
summary of the processing pipeline employed is shown in Fig. 8.2.
Their results indicate a relative improvement on classification accuracy ranging
from 6.60 % to upto 43.22 % against existing methods. On multi-class datasets, the
use of the TWSVM over the conventional SVM clearly results in an improvement in
the classification accuracy (in the range of 73–79% as opposed to SVM accuracies
of only upto 49 %).
Other than the applications pertaining to biomedical signal processing, the TWSVM
has also been successfully used to build systems to supplement medical disease diag-
nosis and detection. A few such examples are discussed in this section. In the context
of Micro-Calcification Cluster (MCC) detection, TWSVM has been shown to be use-
ful for detecting clustered micro-calcifications in mammogram images. This is useful
for early detection of breast cancer, and hence finds importance in from the clinical
perspective. The datasets are from the Digital Database for Screening Mammogra-
phy (DDSM), which has been annotated for ground truth by radiologists. Various
combinations of features are used to build classifiers, which include intensity his-
tograms, statistical moments, texture-based features and transform domain features
such as wavelet transform coefficients. Zhang et al. [16, 17] use a boosted variant of
the TWSVM resulting in a sensitivity of 92.35 % with an error rate of 8.3 %. Further,
the tensor variant of the TWSVM has been explored in [18], and results indicate it
addresses the problem of over-fitting faced previously. The tensor variant results in
a sensitivity and specificity of 0.9385 and 0.9269 respectively, and an average error
rate of 0.0693, outperforming the SVM.
Si and Jing [19] have developed Computer-Aided Diagnostic (CAD) system for
mass detection in digital mammograms. Their results are based on the Digital Data-
base for Screening Mammography (DDSM) dataset, and reveal that 94 % sensitivity
was achieved for malignant masses and 78 % detection for benign masses.
Disease detection systems have been developed using the TWSVM such as for
Parkinsons by Tomar et al. [20], which uses the LS-TWSVM in conjunction with
Particle Swarm Optimization (PSO) to obtain an accuracy of 97.95 %, that is supe-
rior to 11 other methods reported in literature [20, Table 1]. These authors have
also developed a system for detection of heart disease using the TWSVM [21] and
obtain an accuracy of 85.59 % on the heart-statlog dataset. These results indicate the
feasibility of using the TWSVM for problems pertaining to disease detection and
diagnosis systems.
198 8 Applications Based on TWSVM
Problems such as intrusion and fault detection are fall into the category of applica-
tions that benefit by use of classifiers such as the TWSVM. Intrusion detection is
of significance in wireless networks as it is aids development of efficient network
security protocols. Examples of intrusions to be detected include Denial of Service
attack, user to root attack, remote to local user attack and probing. The challenge
here is to be able to build a classifier that detects intrusions effectively, and this is
non-trivial as the training data available would have very few number of labeled
intrusions as these are comparatively infrequent in terms of occurrence over time.
This results in the class imbalance problem. The case is similar for detection of faults
in mechanical systems. Machine learning techniques have been applied in literature
to automate the detection of intrusions or faults, and using the Twin SVM has shown
a significant improvement in the detection rate in comparison.
The TWSVM has been used for intrusion detection by Ding et al. [22] on the
KDD 1999 dataset and obtain higher detection rates and lower training time. The
objective was to build a predictive model capable of distinguishing between “bad”
connections, called intrusions or attacks, and “good” normal connections. They have
used basic features (derived from packet headers), content features (based on domain
knowledge of network packets) time-based and host-based features. The results show
improvement in detection rates across various categories of network intrusions con-
sidered, for both the original set of 41 features as well as on a reduced feature set.
The results have been compared with the Wenke-Lee method and the conventional
SVM. For the case of using all features, there is an improvement of 0.75 % over the
traditional SVM model, and 19.06 % than the Wenke-Lee model. For the reduced
feature set, there is an improvement of 0.59 % than the SVM model and 19.48 %
over the Wenke-Lee model. There is also significant reduction in training time as
opposed to conventional SVMs, of upto 75 % for the full feature set and 69 % on the
reduced feature set. Further, a probability based approach by Nei et al. [23] using
a network flow feature selection procedure obtains upto 97.94 % detection rate on
intrusion detection on the same dataset.
Another application of detecting defects in computer software programs using
the TWSVM has been investigated by Agarwal et al. [24]. They have used the CM1
PROMISE software engineering repository dataset which contains the information
such as Lines of Code (LOC), Design Complexity, Cyclomatic Complexity, Essential
Complexity, Effort Measures, comments and various other attributes that are useful
for predicting whether a software has defects or not. They obtain an accuracy of
99.10 % using the TWSVM while 12 other state-of-the-art methods obtain accuracies
in the range of 76.83–93.17 %.
Ding. et al. [25] have devised a diagnosis scheme for fault detection in wireless
sensors based on using the Twin SVM, using Particle Swarm Optimization (PSO) for
its parameter tuning. Four types of faults, shock, biasing, short circuit, and shifting
were investigated and the performance of TWSVM was compared with other diag-
nostic methods, which used neural networks and conventional SVMs. These methods
8.4 Applications to Intrusion and Fault Detection Problems 199
suffered from the drawbacks such as getting stuck in local minima and overfitting for
neural networks. Their results indicate that the diagnosis results for wireless sensor
using TWSVMs are better than those of SVM and Artificial Neural Network (ANN)
- 96.7 % accuracy as opposed to 91.1 % (SVM) and 83.3 %(ANN).
Chu et al. [26] have investigated the use of a variant of the TWSVM, called the
multi-density TWSVM (MDTWSVM) for strip steel surface defect classification.
This problem aims to identify defects such as scarring, dents, scratches, dirt, holes,
damage and edge-cracks on steel surface, and is essentially a multi-class classifi-
cation problem. They use geometry, gray level, projection, texture and frequency-
domain features for classification. The authors have proposed the MDTWSVM for
large-scale, sparse, unbalanced and corrupted training datasets. Their formulation
incorporates density information using Kernel Density Estimation (KDE) to develop
the MDTWSVM for this specific application context. The formulation is a deriv-
ative of the Twin SVM formulation in principle. The results report an accuracy of
92.31 % for the SVM, 93.85 % for the TWSVM and 96.92 % for the MDTWSVM.
The training times are 91.1512, 11.3707 and 1.2623 s for the SVM, TWSVM and
MDTWSVM respectively.
Similarly, Shen et al. [27] show application of the TWSVM for fault diagnosis in
rolling bearings. Here too, the TWSVM is useful as the number of faults constitute
very few of the total number of samples. They obtain higher accuracies and lower
training time using TWSVM as over conventional SVM. These results show signif-
icant improvement by the use of TWSVM and similarly motivated formulations.
than SVM because it solves system of linear equations instead of solving two QPPs to
obtain the non-parallel hyperplanes, it has gained popularity in action classification
and labelling task [28–30]. Mozafari et al. [28] introduce the application of LST-
WSVM in activity recognition system with the local space-time features. In order to
handle the intrinsic noise and outliers present in the related activity classes, Nasiri
et al. [30] proposed Energy-based least squares twin support vector machine (ELS-
TWSVM) on the lines of LSTWSVM for activity recognition in which they changed
the unit distance constraints used in LSTWSVM by some pre-defined energy para-
meter. Similar to LSTWSVM, the ELS-TWSVM [30] seeks a pair of hyperplane
given as follows:
w1T x + b1 = 0 and w2T x + b2 = 0 (8.1)
1 c1
(ELS-TWSVM 1) Min ||Aw1 + e1 b1 )||2 + y2T y2
w1 ,b1 ,y2 2 2 (8.2)
subject to −(Bw1 + e2 b1 ) + y2 = E 1 ,
and
1 c2
(ELS-TWSVM 2) Min ||Bw2 + e2 b2 ||2 + y1T y1
w2 ,b2 ,y1 2 2 (8.3)
subject to (Aw2 + e1 b2 ) + y1 = E 2 ,
and
[w2 b2 ]T = [c2 H T H + G T G]−1 [c2 H T E 2 ] (8.5)
where H = [ A e1 ] and G = [B e2 ].
A new data point x ∈ R nis assigned to class i (i = +1 or −1) using the following
|x T w1 +e1 b1 |
+1, i f |x T w +e b | ≤ 1,
decision function: f (x) = 2 2 2
|x T w1 +e1 b1 | ELS-TWSVM takes advan-
−1, i f |x T w2 +e2 b2 | ≥ 1.
tage of prior knowledge available in human activity recognition problem about the
uncertainty and intra-class variations and thus improves the performance of the activ-
ity recognition to some degree [30].
Various experiments were carried out in [30] on general activity recognition
datasets like Weizmann dataset and KTH dataset, which showed the efficacy of the
ELS-TWSVM. They have extracted features extraction using Harris detector algo-
rithm and Histogram of Optical Flow descriptors on the KTH database containing
six human actions (walking, jogging, running, boxing, hand waving, hand clapping)
8.5 Application to Activity Recognition 201
where c1 , c2 > 0, e1 and e2 are vector of ones of appropriate dimensions; and ξ is the
error variable. The parameter (1 − ρ1 ) replaced E 1 of ELS-TWSVM, further ν > 0
control the bound on the fractions of support vectors and hence its optimal value
ensure the effect of outliers has been taken care of in activity recognition framework.
Here, ELS-TWSVM can be thought of as a special case of RLS-TWSVM where the
energy parameter is fixed to the optimal value of (1 − ρ1 ).
Khemchandani et al. [31] further introduced the hierarchical framework to deal
with multi-category classification problem of activity recognition system in which
202 8 Applications Based on TWSVM
the hierarchy of classes is determined using the using different multi-category clas-
sification approaches discussed in Chap. 5. Experimental results carried out in [31]
using global space-time features demonstrated that ELS-TWSVM mostly outper-
forms other state of art methods. RLS-TWSVM obtains 100 % prediction accuracy
on Weizmann dataset.
There have also been applications of the TWSVM in financial predictions and mod-
elling. For instance, Ye et al. [37, 38] have used the TWSVM to identify the deter-
minants in inflation for the Chinese economy by considering financial data from
two distinct economic periods. Fang et al. [39] have used the economic development
data of Anhui province from 1992 to 2009 to study economic development prediction
using wavelet kernel-based primal TWSVM, and have obtained improvement over
the conventional SVM.
8.8 Applications for Online Learning 203
The TWSVM has also been used to devise online learning schemes, which are ben-
eficial for large datasets and streaming data that occurs in many practical appli-
cations. These models also adapt the classifier model based on predictions deter-
mined on the incoming samples, in cases where feedback on correctness can be pro-
vided. Such models find widespread applications in real-time scenarios. For instance,
Khemchandani et al. [40] have devised the incremental TWSVM, Hao et al. [41]
devised the fast incremental TWSVM and Mu et al. [42] have developed the online
TWSVM.
The TWSVM has been used for achievement analysis of students by Yang and Liu
[43], where the use of the twin parametric-margin SVM has been used to determine
student’s achievement levels “Advanced language program design” course. The per-
formance has been compared against conventional SVM and neural networks. The
evaluation results for student achievement based on weighted TWSVM network are
consistent with the actual results, and also performs better than the comparative
methods investigated. Similar findings have also been reported by Lui et al. [44].
Manifold fuzzy TWSVMs have been investigated by Liu et al. [45] for classifica-
tion of star spectra. Experiments with alternate classification methods, such as SVM
and k-nearest neighbors on the datasets illustrate the superior results obtained using
TWSVM. Aldhaifallah et al. [46] have identified Auto-Regressive Exogenous Mod-
els (ARX) based on Twin Support Vector Machine Regression (TWSVR). Their
results indicate that the use of the TWSVR outperforms SVR and Least Squares
SVR (LSSVR) in terms of accuracy. Moreover, the CPU time spent by the TWSVR
algorithm is much less than the time spent by SVR algorithm. Other recognition
applications include license plate recognition by Gao et al. [47].
8.10 Conclusions
such as deep learning and allied hybrid models, which may possibly benefit from the
advantages of the TWSVM. Minimal complexity implementations of the TWSVM
are of interest as well, as these are relevant for realization on hardware where design
area and power consumption are of significance. The galore of practical applications
continues to grow by the day with the untamed growth of available data. Hence, the
relevance of such algorithms, and machine learning in general, continues to increase.
References
1. Kubat, M., & Matwin, S. (1997). Addressing the curse of imbalanced training sets: One-sided
selection. (ICML), 97, 179–186 (1997).
2. Jo, T., & Japkowicz, N. (2004). Class imbalances versus small disjuncts. ACM SIGKDD Explo-
rations Newsletter, 6(1), 40–49.
3. Barandela, R., Valdovinos, R. M., Sánchez, J. S., & Ferri, F. J. (2004). The imbalanced train-
ing sample problem: Under or over sampling. Structural, Syntactic, and Statistical Pattern
Recognition, 806–814.
4. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic
minority over-sampling technique. Journal of Artificial Intelligence Research, 321–357.
5. Van H. J., Khoshgoftaar, T. M., & Napolitano, A. (2007). Experimental perspectives on learn-
ing from imbalanced data. In Proceedings of the 24th International Conference on Machine
learning (pp. 935–942).
6. Estabrooks, A., Jo, T., & Japkowicz, N. (2004). A multiple resampling method for learning
from imbalanced data sets. Computational Intelligence, 20(1), 18–36.
7. Naik, R., & Kumar, D., & Jayadeva., (2010). Twin SVM for gesture classification using the
surface electromyogram. IEEE Transactions on Information Technology in Biomedicine, 14(2),
301–308.
8. Kumar, D. K., Arjunan, S. P., & Singh, V. P. (2013). Towards identification of finger flexions
using single channel surface electromyography-able bodied and amputee subjects. Journal of
NeuroEngineering and Rehabilitation, 10(50), 1–7.
9. Arjunan, S. P., Kumar, D. K., & Naik, G. R. (2010). A machine learning based method for
classification of fractal features of forearm sEMG using twin support vector machines, engi-
neering in medicine and biology society (EMBC). In: 2010 Annual international conference
of the IEEE (pp. 4821–4824).
10. Arjunan, S. P., Kumar, D. K., & Jayadeva. (2015). Fractal and twin SVM-based handgrip
recognition for healthy subjects and trans-radial amputees using myoelectric signal, Biomedical
Engineering/Biomedizinische Technik. doi:10.1515/bmt-2014-0134.
11. Soman, S., & Jayadeva., (2015). High performance EEG signal classification using classifia-
bility and the Twin SVM. Applied Soft Computing, 30, 305–318.
12. Ramoser, H., Muller-Gerking, J., & Pfurtscheller, G. (2000). Optimal spatial filtering of single
trial EEG during imagined hand movement. IEEE Transactions on Rehabilitation Engineering,
8(4), 441–446.
13. Ang, K. K., Chin, Z. Y., Zhang, H., & Guan, C. (2008). Filter bank common spatial pat-
tern (FBCSP) in brain-computer interface. In IEEE International Joint Conference on Neural
Networks, 2008. IJCNN, 2008 (pp. 2390–2397). IEEE World Congress on Computational
Intelligence,
14. Novi, Q., Guan, C., Dat, T. H., & Xue, P. (2007). Sub-band common spatial pattern (SBCSP) for
brain-computer interface. In 3rd International IEEE/EMBS Conference on Neural Engineering,
2007. CNE’07 (pp. 204–207).
15. Li, Y., Dong, M., & Kothari, R. (2005). Classifiability-based omnivariate decision trees. IEEE
Transactions on Neural Networks, 16(6), 1547–1560.
References 205
16. Zhang, X. (2009). Boosting twin support vector machine approach for MCs detection. In Asia-
Pacific Conference on Information Processing, 2009. APCIP 2009 (vol. 1, pp. 149–152).
17. Zhang, X., Gao, X., & Wang, Y. (2009). MCs detection with combined image features and twin
support vector machines. Journal of Computers, 4(3), 215–221.
18. Zhang, X., Gao, X., & Wang, Y. (2009). Twin support tensor machines for MCs detection.
Journal of Electronics (China), 26(3), 318–325.
19. Si, X., & Jing, L. (2009). Mass detection in digital mammograms using twin support vector
machine-based CAD system. In WASE International Conference on Information Engineering,
2009. ICIE’09 (vol. 1, pp. 240–243). New York: IEEE.
20. Tomar, D., Prasad, B. R., Agarwal, S. (2004). An efficient parkinson disease diagnosis system
based on least squares twin support vector machine and particle swarm optimization. In 2014
9th International IEEE Conference on Industrial and Information Systems (ICIIS) (pp. 1–6).
21. Tomar, D., & Agarwal, S. (2014). Feature selection based least square twin support vec-
tor machine for diagnosis of heart disease. International Journal of Bio-Science and Bio-
Technology, 6(2), 69–82.
22. Ding, X. Zhang, G., Ke, Y., Ma, B., & Li, Z. (2008). High efficient intrusion detection method-
ology with twin support vector machines. In International Symposium on Information Science
and Engineering, 2008. ISISE’08 (vol. 1, pp. 560–564).
23. Nie, W., & He, D. (2010). A probability approach to anomaly detection with twin support
vector machines. Journal of Shanghai Jiaotong University (Science), 15, 385–391.
24. Aggarwal, S., Tomar, D., & Verma, S. (2014). Prediction of software defects using twin support
vector machine. In IEEE International Conference on Information Systems and Computer
Networks (ISCON) (pp. 128–132).
25. Ding, M., Yang, D., & Li, X. (2013). Fault diagnosis for wireless sensor by twin support vector
machine. In Mathematical Problems in Engineering Cairo: Hindawi Publishing Corporation.
26. Chu, M., Gong, R., & Wang, A. (2014). Strip stee surface defect classification method based
on enhanced twin support vector machine. ISIJ International (the Iron and Steel Institute of
Japan), 54(1), 119–124.
27. Shen, Z., Yao, N., Dong, H., & Yao, Y. (2014). Application of twin support vector machine
for fault diagnosis of rolling bearing. In Mechatronics and Automatic Control Systems (pp.
161–167). Heidelberg: Springer.
28. Mozafari, K., Nasir, J., Charkar, N. M., & Jalili, S. (2011). Action recognition by local space-
time features and least square twin SVM (LS-TSVM). First International IEEE Conference
Informatics and Computational Intelligence (ICI) (pp. 287–292).
29. Mozafari, K., Nasiri, J. A., Charkari, N. M., & Jalili, S. (2011). Hierarchical least square twin
support vector machines based framework for human action recognition. In 2011 9th Iranian
Conference on Machine Vision and Image Processing (MVIP) (pp. 1-5). New York: IEEE.
30. Nasiri, J. A., Charkari, N. M., & Mozafari, K. (2014). Energy-based model of least squares
twin support vector machines for human action recognition. Signal Processing, 104, 248–257.
31. Khemchandani, R., & Sharma, S. (2016). Robust least squares twin support vector machine
for human activity recognition. Applied Soft Computing.
32. Shao, Y.-H., Zhang, C.-H., Wang, X.-B., & Deng, N.-Y. (2011). Improvements on twin support
vector machines. IEEE Transactions on Neural Networks, 22(6), 962–968.
33. Golub, G. H., & Van Loan, C. F. (1996). Matrix computations (3rd ed.). Baltimore, Maryland:
The John Hopkins University Press.
34. Khemchandani, R., & Saigal, P. (2015). Color image classification and retrieval through ternary
decision structure based multi-category TWSVM. Neurocomputing, 165, 444–455.
35. Pickard, R., Graszyk, C., Mann, S., Wachman, J., Pickard, L., & Campbell, L. (1995). Vistex
database. In Media Lab Massachusetts: MIT Cambridge.
36. Yang, H. Y., Wang, X. Y., Niu, P. P., & Liu, Y. C. (2014). Image denoising using nonsubsampled
shearlet transform and twin support vector machines. Neural Networks, 57, 152–165.
37. Ye, Y. F., Liu, X. J., Shao, Y. H., & Li, J. Y. (2013). L1-ε-Twin support vector regression for the
determinants of inflation: A comparative study of two periods in China. Procedia Computer
Science, 17, 514–522.
206 8 Applications Based on TWSVM
38. Ye, Y., Shao, Y., & Chen, W. (2013). Comparing inflation forecasts using an ε-wavelet twin
support vector regression. Journal of Information and Computational Science, 10, 2041–2049.
39. Fang, S., & Hai Yang, S. (2013). A wavelet kernel-based primal twin support vector machine
for economic development prediction. In Mathematical Problems in Engineering Hindwani
Publishing Corporation.
40. Khemchandani, R., Jayadeva, Chandra, S. (2008). Incremental twin support vector machines. In
S. K. Neogy, A. K. das, & R. B. Bapat, (Eds.), International Conference on Modeling, Compu-
tation and Optimization. Published in Modeling, Computation and Optimization, ICMCO-08
Singapore: World Scientific.
41. Hao, Y., Zhang, H., (2014). A fast incremental learning algorithm based on twin support vector
machine. In Seventh International IEEE Symposium on Computational Intelligence and Design
(ISCID) (vol. 2, pp. 92–95).
42. Mu, X., Chen, L., & Li, J. (2012). Online learning algorithm for least squares twin support
vector machines. Computer Simulation, 3, 1–7.
43. Yang, J., & Liu, W. (2014). A structural twin parametric-margin support vector model and its
application in students’ achievements analysis. Journal of Computational Information Systems,
10(6), 2233–2240.
44. Liu, W. (2014). Student achievement analysis based on weighted TSVM network. Journal of
Computational Information Systems, 10(5), 1877–1883.
45. Liu, Z.-B., Gao, Y.-Y., & Wang, J.-Z. (2015). Automatic classification method of star spectra
data based on manifold fuzzy twin support vector machine. Guang Pu Xue Yu Guang Pu Fen
Xi/Spectroscopy and Spectral Analysis, 35(1), 263–266.
46. Aldhaifallah, M., & Nisar, K. S. (2013). Identification of auto-regressive exogenous models
based on twin support vector machine regression. Life Science Journal, 10(4).
47. Gao, S. B., Ding, J., & Zhang, Y. J. (2011). License plate location algorithm based on twin SVM.
Advanced Materials Research (vol. 271, pp. 118-124). Switzerland: Trans Tech Publication.
Bibliography
1. Fu, Y., & Huang, T. S. (2008). Image classification using correlation tensor analysis. IEEE
Transactions on Image Processing, 17(2), 226–234.
2. Liu, X., & Li, M. (2014). Integrated constraint based clustering algorithm for high dimensional
data. Neurocomputing, 142, 478–485.
3. QiMin, C., Qiao, G., Yongliang, W., & Xianghua, W. (2015). Text clustering using VSM with
feature clusters. Neural Computing and Applications, 26(4), 995–1003.
4. Tu, E., Cao, L., Yang, J., & Kasabov, N. (2014). A novel graph-based k-means for nonlinear
manifold clustering and representative selection. Neurocomputing, 143, 109–122.
5. Zhang, P., Xu, Y. T., & Zhao, Y. H. (2012). Training twin support vector regression via linear
programming. Neural Computing and Applications, 21(2), 399–407.
C
G
CBIR, 119
Galaxy Bright data, 32
Chi-square distance measure, 119
Gaussian kernel, 8
Clustering, 134
Generalized eigenvalue problem, 27
Clustering algorithms, 125
Generalized Eigenvalue Proximal Support
Computer vision, 134
Vector Machine, 25
Concave-convex procedure, 136
Generalized Eigenvalue Proximal Support
Cross Planes data set, 30
Vector Regression, 25
Cross Planes example, 30
Generalized Eigenvalue Proximal Support
Curse of dimensionality, 125
Vector Regressor, 36
Graph Laplacian, 125
D
DAGSVM, 115
Down bound regressor, 70 H
DTTSVM, 115 Half-Against-Half (HAH), 115
Dual formulation of TWSVM, 46 Hard-Margin Linear SVM, 3
Dual formulation of TWSVR, 72 Heteroscedastic noise, 108
E I
Efficient sparse nonparallel support vector Image classification, 119
machines , 121 Image processing, 202
Electroencephalogram (EEG) data, 196 Image retrieval, 119
Empirical risk, 127 Improved GEPSVM, 32
© Springer International Publishing Switzerland 2017 209
Jayadeva et al., Twin Support Vector Machines, Studies in Computational
Intelligence 659, DOI 10.1007/978-3-319-46186-1
210 Index
L P
Lagrange multipliers, 10 Parametric-margin, 109
Laplacian Least Squares Twin SVM (Lap- Parametric-Margin ν-Support Vector Ma-
LSTWSVM), 126, 132 chine (Par-ν-SVM), 108–110
Laplacian Support Vector Machine Parametric-margin hyperplane, 109
(Laplacian-SVM), 125, 127 Platt’s Sequential Minimal Optimization, 15
Laplacian-TWSVM, 126, 128 Positive and negative parametric-margin hy-
Least Squares Twin Support Tensor Ma- perplanes, 111
chine, 187 Probabilistic TWSVM (Par-TWSVM), 121
Least Squares TWSVM (LS-TWSVM) ν- Projection Twin Support Matrix Machine,
TWSVM, Parametric- TWSVM (par- 190
TWSVM), Non-Parallel Plane classi- Proximal plane clustering, 134
fier (NPPC), 103 Proximal Support Tensor Machines, 181
Least Squares version of TWSVM, 126
LIBSVM, 18
Linear GEPSVM classifier, 26
Linear Twin Support Vector Machine (Lin- R
ear TWSVM), 43, 44 Radial Basis Function, 8
L 2 -norm empirical risk, 106 Rayleigh quotient, 27
Local k-proximal plane clustering, 134 Regularized GEPSVM, 32
Relaxed LSSVM, 18
Relaxed SVM, 18
M Reproducing Kernel Hilbert Space, 127
Manifold, 125 Ridge regression, 47
Index 211
T V
TB-TWSVM, 118 Vapnik–Chervonenkis (VC) dimension, 3
Tensor classifier, 176
Tensor Distance Based Least Square Twin
Support Tensor Machine, 190
Tensor product, 175 W
Tensor Space Model, 175 Wolfe’s dual, 127
Tensor with order k, 174
Ternary Decision Structure based Multi-
category Twin Support Vector Ma- X
chine (TDS-TWSVM), 116 XOR example, 30