Support Vector Machines as Probabilistic Models
Support Vector Machines as Probabilistic Models
p(y|x). In contrast, we argue that the SVM implies ω + = ω and ω − = 1−ω for positive and negative class,
a (non-uniform) marginal p(x), giving it a generative respectively. For any fixed λ > 0, ω ≥ 0, the func-
touch. Second, it is alluring to expect a model whose tion R(w; ω) is convex and the function F (w; λ, ω) is
ML/MAP solution exactly agrees with the standard strictly convex.
SVM. While we do not know whether this is possi-
For given λ and ω, the SVM learning algorithm returns
ble at all, we did succeed in recovering an alternative
the parameter vector wSVM (λ, ω) which is a unique
parameterization of the SVM as an ML estimate of
minimum of F (w; λ, ω), i.e.,
a suitable model. This is analoguous to the ν-SVM
reformulation of the standard SVM (Schölkopf et al.,
wSVM (λ, ω) = argmin F (w; λ, ω) . (2)
2000), in which the rather unintuitive regularization w∈Rn
parameter is replaced by a parameter controlling the
number of SVs. In our model, the hyperparameter will The problem (2) is well understood and there exists
be the length ||w|| of the hyperplane normal. a plethora of efficient optimization algorithms for its
solution.
After a brief review of the SVM (Section 2), we present
our model (Section 3.1), which is generative and semi- The SVM algorithm specifies how to learn the param-
parametric. For technical reasons we restrict our anal- eter vector w while the hyper-parameters λ and ω
ysis to the SVM classifier without the bias term.2 The must be determined differently. The standard SVMs
core result, the equivalence of ML in our model with sets ω = 21 . However, tuning of ω is a routinely used
the SVM, is presented and proved in Section 4. In heuristic in the case of unbalanced class distribution.
Section 5 we demonstrate how max-margin clustering A common practice is to selected the best combination
drops out of our model; after this, we conclude. of λ and ω based on solving
Theorem 1 Let h : Rn → R be a piece-wise continu- uses h(x) = h(kxk) ensured by the assumption 3, the
ous function which satisfy the following assumptions: fact that |Dϕ (v)| = 1, and the equality hτ u, ϕ(v)i =
hτ Ru, vi = hτ u0 , vi. The equalities 1 and 4 are due to
1. 0 ≤ h(x), ∀x ∈ Rn (positive) definition (5) which completes the proof.
R
2. 0 < x∈Rn h(x) < ∞ (integrable)
Two examples of functions satisfying assumptions of
3. h(x) = h(x0 ), ∀x, x0 ∈ Rn such that kxk = kx0 k Theorem 1 are h1 (x) = exp(−hx, c1 Exi) and h2 (x) =
(radial basis function) c2 [[kxk ≤ c3 ]] where c1 , c2 and c3 are arbitrary strictly
positive scalars and E is the identity matrix.
Then for any τ ∈ R++ , ω ∈ (0, 1) and u ∈ U the
integrals I + (τ, ω, u) and I − (τ, ω, u) defined by Corrolary 1 Let h be a function satisfying the as-
Z sumptions of Theorem 1 and let us define
I y (τ, ω, u) = exp(−ω y `(yhτ u, xi)) · h(x) · dx 1
x∈Rn Z(τ, ω) = . (7)
(5) I + (τ, ω) + I − (τ, ω)
satisfy the following properties
Then for any fixed τ ∈ R++ , ω ∈ (0, 1) and u ∈ U, the
1. 0 < I y (τ, ω, u) (strictly positive) function p(x, y; τ, ω, u) given by (4) is a proper p.d.f.
defined over Rn × {+1, −1}, that is,
2. I y (τ, ω, u) < ∞ (finite)
y y
p(x, y; τ, ω, u) ≥ 0 , ∀x ∈ Rn , y ∈ {+1, −1} ,
3. I (τ, ω, u) = I (τ, ω) (invariant to u) Z X
p(x, y; τ, ω, u) = 1 .
proof: For fixed τ , ω, u, we introduce a short- x∈Rn y∈{+1,−1}
y
hand g(x) = exp(−ω R `(yhτ u, xi)) which simplifies (5)
to I y (τ, ω, u) = x∈Rn h(x)g(x)dx. It is seen that 3.2. Prior probability
g : Rn → (0, 1].
It follows from (7) that the priory probability of the
Because h is a piece-wise continuous and its integral is class label y under the model (4) is given by
strictly positive then there must exist µ ∈ Rn , r > 0,
and ε1 > 0 such that for all x within the ball B(µ, r) = I y (τ, ω)
{x0 ∈ Rn | kx0 − µk ≤ r} the value of h(x) is not less p(y; τ, ω) = . (8)
I + (τ, ω) + I − (τ, ω)
then ε1 . The volume V of B(µ, r) is greater than 0. As
g is strictly positive everywhere there must exist ε2 > 0 The prior probability does not depend on the param-
such that g(x) ≥ ε2 , ∀x ∈ B(µ, r). This implies that eter u. Moreover, we have the following theorem.
I y (τ, ω, u) ≥ ε1 ε2 V > 0 which proves the property 1.
Theorem 2 For any τ ∈ R++ it holds that
The property 2 follows from integrability of h and
boundedness of g. p(y = +1; τ, ω) = 0.5 for ω = 0.5 ,
Finally, we prove the property 3. Let u and u be 0 p(y = +1; τ, ω) < 0.5 for ω > 0.5 ,
arbitrary unit vectors. Then, there exists a orthogonal p(y = +1; τ, ω) > 0.5 for ω < 0.5 .
matrix R ∈ Rn×n with determinant +1 (i.e, rotation
matrix) such that u0 = Ru. Let ϕ : Rn → Rn be a proof: It follows from the equality (6) that
vector-valued function defined by ϕ(v) = Rv. It is Z
seen that the determinant of the Jacobian matrix of ϕ I y (τ, ω) = f y (x) · h(x) · dx , (9)
is Dϕ (v) = +1. We can write x∈Rn
The first equality follows from the fact that Wτ is a ing plug-in Bayes classifier is obtained by the proce-
subset of Rn containing the minimizer w(λ). Since all dure (15) which selects the best parameter vector from
vectors in Wτ have the same norm the second equality uML (τ, 0.5), τ ∈ T , based on a validation criterion.
holds true. The third equality results from the variable Theorem (4) guarantees that both procedures will re-
substitution w = τ u. This proves that for any λ ∈ turn exactly the same linear classifier provided we use
R++ and τ = θ(λ) = kw(λ)k the equality (18) holds. T = {kwSVM (λ1 , 0.5)k, . . . , kwSVM (λp , 0.5)k} and the
same validation criterion in (3) and (15). Hence an
It remains to prove monotonicity of θ, i.e., λ1 > λ2
empirical comparison is unnecessary.
implies θ(λ1 ) ≤ θ(λ2 ). However, this is a direct conse-
quence of Lemma 1 and the fact that θ(λ) = kw(λ)k. Note that the tuning sets are heuristic in both cases.
While this is outside the scope of this paper, we imag-
ine that the direct geometric interpretation of τ as the
reciprocal of the margin width will facilitate heuristics
4.1. Standard SVM classifier for finding reasonable settings, which could provide a
practical advantage.
The formulation of the standard SVMs assumes that
the cost-factors for both classes are equal, i.e., ω = 0.5.
Recall that by Theorem 2 the value ω = 0.5 implies 4.2. SVM classifier with different cost factors
that the posterior probability of our model is uniform, It is common knowledge that SVMs often do not work
i.e. p(y; τ, ω, u) = 0.5. well if the class distribution of the training examples
is highly unbalanced. In view of the previous section
Theorem 4 Let ω = 0.5. Then, for any linear SVM this is not surprising, as the standard SVM classifier is
classifier there exists an equivalent plug-in Bayes clas- equivalent to a Bayes classifier which assumes uniform
sifier derived from the model (4) whose parameters are prior probabilities. To cope with unbalanced classes,
estimated by the ML principle, i.e., the equality SVM practitioneers routinely use two heuristics:
qSVM (x; wSVM (λ, ω)) = qBayes (x; τ, ω, uML (τ, ω))
1. Set a higher cost-factor for the class which is less
represented in the training data. For example, if
the first class is the smaller class, then set ω > 0.5,
holds for any x ∈ Rn , λ ∈ R++ and τ = θ(λ) =
i.e., ω + > ω − . This option is supported by all
kwSVM (λ, ω)k. Morover, the mapping θ : R++ → R+
major SVM solvers like the SVMlight (Joachims,
is monotonically decreasing.
1999). A proper setting of the cost-factor ω is
proof: The proof follows trivially from Theorem 3 then tuned as an additional hyper-parameter.
and the formulas for the linear SVM classifier (1) and 2. After the linear SVM classifier is trained, tune
the Bayes classifier (12). only the bias of the classifier to achieve desired
error rate. Recall that the standard SVMs classi-
Note that the standard SVM is theoretically linked to fier (1) is unbiased.
our model in much the same way the ν-SVM is linked
to the standard SVM. In our case, training an SVM Let us confront these heuristics with the plug-in Bayes
using λ would tell us which τ to use to get the same classifier (12) whose parameters are obtained by the
result; in the other case, training a ν-SVM using an ML estimator (14). Theorem 2 shows that the hyper-
a priori chosen ν would tell us which λ to use in the parameter ω is proportional to the prior probability
standard SVM to get the same result (see Proposition p(y; τ, ω). This is perfectly consistent with the first
6 in (Schölkopf et al., 2000)). heuristic. For example, if the first class is less frequent
in the training data then, according to Theorem 2, we
Now let us compare the SVM and the ML learning
should set ω > 0.5 to guarantee that p(y = +1; τ, ω) <
from a more practical point of view. We still assume
p(y = −1; τ, ω) holds and vice-versa.
the standard setting ω = 0.5. The SVM learning re-
quires a user to supply a tuning set Λ = {λ1 , . . . , λp } Regarding the second heuristic, we showed that plug-in
for the hyper-parameter λ. The resulting SVM clas- Bayes classifier under the model (4) is a linear classi-
sifier is obtained by the procedure (3) which selects fication rule (12) with a bias term b = 2ω − 1. That
the best parameter vector from wSVM (λ, 0.5), λ ∈ is, for uniform priors the bias is set to 0, while for un-
Λ, based on a validation criterion. The ML learn- balanced classes the bias is negative if the first class
ing requires the user to supply a tuning set T = is more probable (and positive otherwise). Note that
{τ1 , . . . , τp } for the hyper-parameter τ . The result- the probabilistic model exactly specifies the value of
SVM by ML
m
the bias term given the hyper-parameter ω. By con- NLL of labels y ∈ {+1, −1} and a parameter θ is
trast, in classical SVM training the bias must be tuned m
as an additional hyper-parameter. L(y, θ) = −
X
log p(xi |yi ; θ) .
i=1
5. Equivalence between Maximum The CML approach finds labels by solving
Margin Clustering and Classification
Maximum Likelihood approach y CML ∈ argmin min L(y, θ) .
y∈{+1,−1}m θ∈Θ
The Maximum Margin Clustering (MMC) (Xu et al.,
2004) is a popular heuristic that transfers the max- The CML assumes that both the vector θ and the
imum margin principle from supervised SVM learn- labels y are the parameters to be estimated.
ing to the unsupervised setting. In this section we Let us instantiate the CML approach for our model.
show how our model (4) theoretically justifies MMC. We assume that the hyper-parameter ω = 0.5 and τ is
In particular, we demonstrate how MMC emerges from fixed otherwise (e.g. tuned on validation set). Recall
our model by applying the Classification Maximum that ω = 0.5 implies the uniform prior p(y; τ, 0.5) =
Likelihood (CML) approach (Scott & Symons, 1971), 0.5. Then, the NLL under the model (4) reads
which is a statistically motivated and theoretically
m
well-understood clustering method. X
L(y, u; τ ) = log h(xi ) − ω yi hτ u, xi i − I yi (τ, ω) .
We consider the clustering problem defined as follows. i=1
Let {(x1 , y1 ), . . . , (xm , ym )} ∈ (Rn × {+1, −1})m be
and the labels are found by solving
i.i.d. from an underlying distribution p∗ (x, y). Let
us assume that we are given only the observations y CM L (τ ) = argmin min L(y, u; τ )
{x1 , . . . , xm } and our goal is to estimate the corre- y∈{+1,−1}m u∈U
(24)
sponding hidden labels {y1 , . . . , ym }. = argmin min RM M C (y, τ u) .
y∈{+1,−1} m u∈U
As before, we assume an unbiased linear classifier (i.e. proof: Because y ∗ ∈ y MMC (λ) is a minimizer of the
the hyperplane passes through the origin). With this, problem (23) the inequality
(23) is a well-posed problem—unlike the variant with a hλ i
biased classifier, which has a trivial solution assigning minn kwk2 + RMMC (y ∗ , w)
all observations to just one class. This complication is w∈R 2 hλ i (25)
can also be solved by introducing an additional balance ≤ minn kwk2 + RMMC (y, w) ,
constraint enforcing solutions with prescribed number w∈R 2
of labels in each class. Note that all derivations below m
holds ∀y ∈ {+1, −1} . Let us denote the min-
can easily be repeated with the balance constraint to imizer of the ∗
h left hand side of i(25) as w =
recover the biased variant of the MMC.
argminw∈Rn λ2 kwk2 + RMMC (y ∗ , w) which is unique
as the objective is strictly convex for fixed y ∗ . Let us
5.1. Classification Maximum Likelihood
denote τ = kw∗ k and Wτ = {w ∈ Rn | kwk = τ }.
approach to clustering
Then, we can derive from (25) that the inequality
Assume a conditional density p(x | y; θ) parametrized
by θ ∈ Θ. Given the observations {x1 , . . . , xm }, the min RMMC (y ∗ , w) ≤ min RMMC (y, w) (26)
w∈Wτ w∈Wτ
SVM by ML
m
holds ∀y ∈ {+1, −1} . To get from (25) to (26), we Acknowledgments
used the fact that all vectors in Wτ have the same
norm and one of them is the minimizer w∗ . The in- VF was supported by the Czech Ministry of Education
equality (26) implies that y ∗ is a minimizer of project 1M0567 and by EC projects FP7-ICT-247525
HUMAVIPS, PERG04-GA-2008-239455 SEMISOL.
y∗ ∈ argmin min RMMC (y, w)
y∈{+1,−1}m w∈Wτ
= argmin min RMMC (y, τ u) References
y∈{+1,−1}m u∈U
Bartlett, P. L. and Tewari, A. Sparseness vs estimating
where the letter equality, obtained by the variable sub- conditional probabilities: Some asymptotic results.
stitution w = τ u, is just the definition of the CML Journal of Machine Learning Research, 8:775–790,
problem (24) which was to be proved. 2007.
Celeux, G. and Covaert, G. A classification EM al-
The established correspondence between the MMC gorithm for clustering and two stochastic versions.
and the CML not only provide a theoretical justifica- Computational Statistics & Data Analysis, 14(3),
tion of the MMC but it also opens ways for its exten- 1992.
sion. First, to cope with the unbalanced data one can
tune the hyper-parameter ω which would corresponds Grandvalet, Y., Mariéthoz, J., and Bengio, S. Interpre-
to changing the prior probability p(y; τ, ω). Second, tation of SVMs with an application to unbalanced
the hard problem (23) required by the MMC can be at- classification. Advances in Neural Information Pro-
tacked by algorithms routinely used for minimization cessing Systems, NIPS 18, 2005.
of the CML criterion. Namely, the Classification Ex-
Joachims, T. Making large-scale SVM learning prac-
pectation Algorithm (CME) (Celeux & Covaert, 1992)
tical. In Schölkopf, B., Burges, C. J. C., and Smola,
is a simple iterative procedure which transforms the
A. J. (eds.), Advances in Kernel Methods—Support
hard unsupervised problem to a series of much sim-
Vector Learning, pp. 169–184. MIT Press, Cam-
pler supervised problems. In turn, the existing SVMs
bridge, MA, 1999.
solvers can be readily recycled for solving the MMC.
Platt, J. C. Probabilistic outputs for support vec-
6. Conclusion tor machines and comparisons to regularized like-
lihood methods. In A. Smola et al. (ed.), Advances
The received wisdom in machine learning has so far in Large Margin Classifiers. MIT Press, Cambridge,
been that maximum margin SVM learning and prob- MA, 2000.
abilistic models constitute two separate sub-domains
of machine learning. Our work has been motivated Schölkopf, B. and Smola, A. J. Learning with Kernels.
by the unsettling fact that SVM-like methods, albeit MIT Press, Cambridge, MA, 2002.
being rooting in learning theory and being powerful Schölkopf, B, Smola, AJ, Williamson, RC, and
and efficient in practice, do not enjoy the principled Bartlett, PL. New support vector algorithms. Neu-
view on modeling offered by probabilistic methods. In ral Computation, 12(5):1207–1245, 2000.
this contribution, we heal this rupture by setting up
a probabilistic model that is equivalent to the SVM. Scott, A.J. and Symons, M.J. Clustering methods
So far, however, this work is limited to linear SVMs based on likelihood ratio criteria. Biometrics, 27:
without bias; whether and how kernelization can be 387–389, 1971.
incorporated remains to be investigated.
Sollich, P. Bayesian methods for support vector ma-
Apart from the theoretical satisfaction of unification, chines: Evidence and predictive class probabilities.
the probabilistic understanding of the SVM can also Machine learning, 46(1):21–52, 2002.
lead to further insight. As an example, we demonstrate
Steinwart, I. and Christmann, A. Support Vector Ma-
how a common and empirically successful heuristic for
chines. Springer, New York, 2008.
dealing with unbalanced class sizes can be understood
in terms of biased priors, and how maximum mar- Vapnik, V. Statistical Learning Theory. John Wiley
gin clustering is naturally linked to the generic CML and Sons, New York, 1998.
(Classification Maximum Likelihood) principle. Fur-
ther work on semi-supervised SVMs is in progress, and Xu, L., Neufeld, J., Larson, B., and Schuurmans, D.
we anticipate that many more such relationships will Maximum margin clustering. In Proc. of Neural In-
be discovered. formation Processing Systems (NIPS), 2004.