10.1.1.92.623
10.1.1.92.623
Abstract
Support Vector Machines (SVMs) have become a popular learning algorithm,
in particular for large, high-dimensional classification problems. SVMs have been
shown to give most accurate classification results in a variety of applications. Sev-
eral methods have been proposed to obtain not only a classification, but also an
estimate of the SVMs confidence in the correctness of the predicted label. In this
paper, several algorithms are compared which scale the SVM decision function
to obtain an estimate of the conditional class probability. A new simple and fast
method is derived from theoretical arguments and empirically compared to the ex-
isting approaches.
1 Introduction
Support Vector Machines (SVMs) have become a popular learning algorithm, in par-
ticular for large, high-dimensional classification problems. SVMs have been shown to
give most accurate classification results in a variety of applications. Several methods
have been proposed to obtain not only a classification, but also an estimate of the SVMs
confidence in the correctness of the predicted label.
Usually, the performance of a classifier is measured in terms of accuracy or some
other performance measure based on the comparison of the classifiers prediction y^ of
the true class y . But in some cases, this does not give sufficient information. For ex-
ample in credit card fraud detection, one has usually much more negative than positive
examples, such that the optimal classifier may be to the default negative classifier. But
then, still one would like to find out which transactions are most probably fraudulent,
even if this probability is small. In other situations e. g. information retrieval, one could
be more interested in a ranking of the examples with respect to their interestingness in-
stead of a simple yes/no-decision. Third, one may be interested to integrate a classifier
into a bigger system, for example a multi-classifier learner. To combine and compare
the SVM prognosis with that of other learners, one would like a comparable, well-
defined confidence estimate. The best method to achieve a confidence estimate that
allows to rank the examples and gives well-defined, interpretable values, is to estimate
1
the conditional class probability P (y jx). Obviously, this is a more complex problem
than finding a classification l(x) 2 f 1; 1g, as it is possible to get a classification
function by comparing P^ (y jx) to the threshold 0:5, but not vice versa.
For numerical classifiers, i. e. classifiers of the type l(x) = sign(f (x)) with a
numerical decision function f , one usually tries to estimation the conditional class
probability from the decision function P^ (y jx) = P^ (y jf (x)). This reduces the prob-
ability estimation from a multi-variate to a one-dimensional problem, where one has
to find a scaling function such that P^ (Y = 1jx) = (f (x)). The idea behind this
approach is that the classification l(x) of examples that lie close to the decision bound-
ary fxjf (x) = 0g can easily change when the examples are randomly perturbed by a
small amount. This is very hard for examples with very high or very low f (x) (this
argument requires some sort of continuity or differentiability constraints on the func-
tion f ). Hence, the probability that the classifier is correct should be higher for larger
absolute values of f . As was noted by Platt [10], this also means there is a strong prior
for selecting a monotonic scaling function .
The rest of the paper is organized as follows: In the next section, we will shortly
present the Support Vector Machine and Kernel Logistic Regression algorithm, as far
as it is necessary for this paper. In Section 3, existing methods for probabilistic scaling
of SVM outputs will be discussed and a new, simple scaling method will be presented.
The effectiveness of this method will be empirically evaluated in Section 4.
2 Algorithms
2.1 Support Vector Machines
Support Vector Machines are a classification method based on Statistical Learning The-
ory [12]. The goal is to find a function f (x) = w x + b that minimizes the expected
Risk
ZZ
R[f ℄ = L(y; f (x))dP (yjx)dP (x)
of the learner by minimizing the regularized risk Rreg [f ℄, which is the weighted sum
of the empirical risk with respect to the data (xi ; yi )i=1:::n and a complexity term jjwjj2
Rreg [f ℄ =
1 jjwjj2 + C X j1 y f (x )j
i i +
2 i
(1)
where j j+ = max(; 0): This optimization problem can be efficiently solved in its
dual formulation
1 n X n X
2 i;j=1 i j i j i j i=1 i ! min
y y x x + (2)
X
n
w:r:t: i yi =0
i=1
8i : 0 i C
2
2.2 The Kernel Trick
The inner product xi xj in Equation 2 can be replaced by a kernel function K (xi ; xj )
which corresponds to an inner product in some space, called feature space. That is,
there exists a mapping : X ! X such that K (x; x0 ) = (x) (x0 ). This allows
the construction of non-linear classifiers by an essentially linear algorithm.
The resulting decision function is given by
f (x) = w (x) + b
X
n
= yi i K (xi ; x) + b:
i=1
The actual SVM classification is given by sign(f (x)). It can be shown that the SVM
solution depends only on its support vectors SV = fxi j i 6= 0g. See [12, 2] for a more
detailed introduction on SVMs.
P (yjx) =
1
1+e y(wx b) :
The drawback of KLR is that typically all i are nonzero, as all examples play a
role in estimating the conditional class probability, whereas in the SVM only a small
number of support vectors are needed to classify the examples. Hence, KLR is compu-
tationally much more expensive than the SVM.
3
(w; b). Assuming that P (Y = 1jx) is continuous in x, it seems reasonable that ex-
amples lying closer to the hyperplane have a larger probability of being misclassified
than examples lying far away (the closer the example is to the hyperplane, the smaller
changes have to be to produce a different classification). Hence, it seem suitable to
model the conditional class probability P (y jx) as a function of the value of the SVM
decision function, i. e. P^ (Y = 1jx) = (f (x)) with an appropriate scaling function .
There are several ad-hoc scaling functions, e. g. the softmax scaler
softmax (z ) =
1
1+e 2z;
which monotonously maps the decision functions value z = f (x) to the interval [0; 1℄.
The scaler assumes that for the decision function is of the type sign(z ) and hence for
z = 0 the classifiers class decision is smallest such that z is mapped to the conditional
class probability 0:5. This allows to view softmax (z ) as a probability. However, this
mapping is not very well founded, as the scaled values are not justified from the data.
To justify the interpretation P^ (Y = 1jx) = (f (x)), it is better to use data to
calibrate the scaling. One can use a subset of the data which has not been used for
training (or use a cross-validation-like approach) and optimize the scaling function to
minimize the error between the predicted class probability (f (x)) and the empirical
class probability defined by the class values y in the new data. There are two error
measures which are usually used, cross-entropy and mean squared error. Cross-entropy
is defined by X
CRE = yi log(zi) + (1 yi )log(1 zi )
i
(where zi = (f (xi ))), which is the Kullback-Leibler distance between the predicted
and the empirical class probability. For comparison of different data sets it is better
to divide the cross-entropy by the number of examples and work with the mean cross-
entropy mCRE. The mean squared error is defined by
MSE =
1 X(y pi )2 :
i
n i
It is an appropriate error measure because for a binary random variable Y 2 f0; 1g,
the expected value of (Y p)2 is minimized by p = P (Y = 1). Hence, the task
of estimating the conditional class probability becomes a regression task. The open
question is, what types of scaling functions should be fitted to the data.
Motivated by an empirical analysis, Platt [10] uses scaling functions of the form
a;b (z ) =
1
1 + e az+b
with a 0 to obtain a monotonically increasing function. The parameters a and
b are found by minimization of the cross-entropy error over a test set (xi ; yi ) with
zi = f (xi ). For an efficient implementation, see [8].
Garczarek [4] proposes a method which scales classification values by
(z ) = B 11; 1 B ;
1 1 (z )
4
where B ; is the Beta distribution function with parameters and . The parameters
1; 1; 2 and 2 are selected such that over a test set (xi ; yi )
1. the average value of (f (x)) for each class is identical to the classification per-
formance of the classifier f in this class and
2. the mean square error (y (f (x)))2 is minimized.
Originally, the algorithm is designed for multiclass problems and computes an indi-
vidual scaler for each predicted class. For binary problems, it is better to modify
this approach such that only one scaler is generated. This avoids discontinuities in
P^ (Y = 1jx) when the prediction changes from one class to the other.
Binning has also been applied to this problem [3]. The decision values are dis-
cretized into several bins and one can estimate the the conditional class probability by
counting the class distribution in the single bins. Other, more complicated approaches
also exists, see e. g. [7] or [12], Ch. 11.11.
5
3
(x,y)
SVM
2.5 KLR
1.5
0.5
-0.5
-1
-1.5
-2
-2.5
-3 -2 -1 0 1 2 3 4 5
4 Experiments
The experiments were conducted on 11 data sets, including 7 data sets from the UCI
Repository [9] (covtype, diabetes, digits, digits, ionosphere, liver, mushroom, promot-
ers) and 4 other real-world data sets: a business cycle analysis problem (business), an
analysis of a direct mailing application (directmailing), a data set from a life insurance
6
company (insurance) and intensive care patient monitoring data (medicine). Prior to
learning, nominal attributes were binarised and the attributes were scaled to expectancy
0 and variance 1. Multi-class-problems were converted to two-class problems by ar-
bitrarily selecting two of the classes (covtype and digits) or combining smaller classes
into a single class (business, medicine). For the covtype data set, a 1% sample was
drawn. The following table sums up the description of the data sets:
Experiments were made with Support Vector Machines and Kernel Logistic Re-
gression with both linear and radial basis kernel. The parameters of the algorithms
were selected in a prior step to optimize accuracy. The following algorithms were
compared in the experiments:
KLR: Kernel Logistic Regression, used as the baseline.
7
Method MSE mCRE
KLR 0.1000 0.0332
SVM-Platt 0.0912 0.0291
SVM-Beta 0.5966 1
SVM-Beta-2 0.0915 0.0301
SVM-Bin (10 bins) 0.1201 0.0384
SVM-Bin (50 bins) 0.1301 0.0415
SVM-Softmax 0.0975 0.0343
SVM-01 0.0970 0.0317
SVM-PP 0.0933 0.0296
With respect to the mean squared error, we get the following ranking: SVM-Platt
< SVM-Beta-2 < SVM-PP < SVM-01 < SVM-Softmax < KLR < SVM-Bin-10 <
SVM-Bin-50 << SVM-Beta. Sorting by mean cross-entropy, SVM-Beta-2 and SVM-
PP change places, as well as SVM-Softmax and Bin-10.
The RBF kernel gave the following results:
This gives the following ranking for MSE: KLR < SVM-Platt < SVM-Beta-2 <
SVM-PP < SVM-01 < SVM-Bin-10 < SVM-Softmax < SVM-Bin-50 << SVM-
Beta.
A close inspection reveals that these results do not give the full picture, as the error
measures reach very different values for the individual data sets. E. g. , the MSE for
Kernel Logistic Regression with radial basis kernel runs from 10 7 (mushroom) to
0:191 (liver). To allow for a better comparison, the methods were ranked according to
their performance for each data set. The following table gives the average rank of each
of the methods for the linear kernel:
8
avg. rank from
Method MSE mCRE
KLR 3.18 3.09
SVM-Platt 3.18 3.45
SVM-Beta 9.00 9.00
SVM-Beta-2 3.27 3.45
SVM-Bin (10 bins) 5.18 5.55
SVM-Bin (50 bins) 6.55 6.45
SVM-Softmax 5.18 5.36
SVM-01 4.91 5.09
SVM-PP 3.45 3.55
To validate the significance of the results, a paired t-test ( = 0:05) was run over the
cross-validation runs. The following table shows the comparison of the cross-entropy
for the linear kernel of the best five of the scaling algorithms. Each row of the table
shows how often the hypothesis that the estimation in that row is better than the esti-
mation in the corresponding column was rejected. E. g. , the 6 in the last row and first
column shows that the hypothesis that softmax scaling is better than KLR was rejected
for 6 of the data sets. The contrary hypothesis was rejected on 2 data sets (first row,
last column).
These are the results for cross-entropy and the radial basis kernel:
9
KLR Platt Beta2 PP Bin10 Soft
KLR 0 0 0 0 0 0
Platt 6 0 0 1 0 0
Beta2 7 6 0 4 1 0
PP 7 5 4 0 2 0
Bin10 8 3 3 3 0 2
Soft 9 9 7 9 6 0
5 Summary
The experiments in this paper showed that a trivial method of estimating the conditional
class probability P (y jx) from the output of a SVM classifier performs comparably to
much more complicated estimation techniques.
Acknowledgments
The financial support of the Deutsche Forschungsgemeinschaft (SFB 475, ”Reduction
of Complexity for Multivariate Data Structures”) is gratefully acknowledged.
References
[1] Peter L. Bartlett and Ambuj Tewari. Sparseness vs estimating conditional proba-
bilities: Some asymptotic results. submitted, 2004.
[2] C. Burges. A tutorial on support vector machines for pattern recognition. Data
Mining and Knowledge Discovery, 2(2):121–167, 1998.
[3] Joseph Drish. Obtaining calibrated probability estimates from support vector ma-
chines. Technical report, University of California, San Diego, June 2001.
[4] Ursula Garczarek. Classification Rules in Standardized Partition Spaces. PhD
thesis, Universität Dortmund, 2002.
10
[5] T. S. Jaakkola and D. Haussler. Probabilistic kernel regression models. In Pro-
ceedings of the 1999 Conference on AI and Statistics, 1999.
[6] S. S. Keerthi, K. Duan, S. K. Shevade, and A.N. Poo. A fast dual algorithm for
kernel logistic regression. Submitted for publication in Machine Learning.
[7] James Tin-Yau Kwok. Moderating the outputs of support vector machine clas-
sifiers. IEEE Transactions on Neural Networks, 10(5):1018–1031, September
1999.
[8] H.-T. Lin, C.-J. Lin, and R. C. Weng. A note on platt’s probabilistic outputs for
support vector machines, May 2003.
[9] P. M. Murphy and D. W. Aha. UCI repository of machine learning databases,
1994.
[10] John Platt. Advances in Large Margin Classifiers, chapter Probabilistic Outputs
for Support Vector Machines and Comparisons to Regularized Likelihood Meth-
ods. MIT Press, 1999.
[11] Volker Roth. Probabilistic discriminative kernel classifiers for multi-class prob-
lems. In B. Radig and S. Florczyk, editors, Pattern Recognition–DAGM’01, num-
ber 2191 in LNCS, pages 246–253. Springer, 2001.
[12] V. Vapnik. Statistical Learning Theory. Wiley, Chichester, GB, 1998.
[13] Grace Wahba. Advances in Kernel Methods - Support Vector Learning, chapter
Support Vector Machines, Reproducing Kernel Hilbert Spaces and the Random-
ized GACV, pages 69–88. MIT Press, 1999.
[14] Ji Zhu and Trevor Hastie. Kernel logistic regression and the import vector ma-
chine. In Neural Information Processing Systems, volume 14, 2001.
11