0% found this document useful (0 votes)
3 views

Support Vector Machines as Probabilistic Models

The document presents a new perspective on Support Vector Machines (SVMs) by framing them as maximum likelihood estimates of a class of probabilistic models, which allows for the interpretation of SVM outputs as probability intervals. It highlights the limitations of traditional SVM approaches in estimating posterior class probabilities and proposes a semi-parametric probabilistic model that retains the generative aspect of SVMs. The authors demonstrate the equivalence of their model with the SVM and discuss the implications for classification and error estimation.

Uploaded by

davidshuai49
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Support Vector Machines as Probabilistic Models

The document presents a new perspective on Support Vector Machines (SVMs) by framing them as maximum likelihood estimates of a class of probabilistic models, which allows for the interpretation of SVM outputs as probability intervals. It highlights the limitations of traditional SVM approaches in estimating posterior class probabilities and proposes a semi-parametric probabilistic model that retains the generative aspect of SVMs. The authors demonstrate the equivalence of their model with the SVM and discuss the implications for classification and error estimation.

Uploaded by

davidshuai49
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Support Vector Machines as Probabilistic Models

Vojtěch Franc [email protected]


Czech Technical University in Prague, Technická 2, 166 27 Praha 6, Czech Republic
Alex Zien [email protected]
LIFE Biosystems GmbH, Belfortstr. 2, 69115 Heidelberg, Germany
Bernhard Schölkopf [email protected]
Max Planck Institute for Intelligent Systems, Spemannstr. 38, 72076 Tübingen, Germany

Abstract In (Grandvalet et al., 2005) the SVM objective is taken


We show how the SVM can be viewed as a as an approximation to the negative log-likelihood such
maximum likelihood estimate of a class of that the SVM outputs are translated into probability
probabilistic models. This model class can be intervals. In a practical but also heuristic approach,
viewed as a reparametrization of the SVM in (Platt, 2000) suggested to retrospectively fit a logit
a similar vein to the ν-SVM reparametrizing function to map (non-probabilistic) SVM outputs to
the classical (C-)SVM. It is not discrimina- probabilities. This works well and has become the
tive, but has a non-uniform marginal. We standard, but fails to provide insight.
illustrate the benefits of this new view by re- In fact, there are theoretical arguments indicating that
deriving and re-investigating two established the hinge loss used by the SVM does not lend itself
SVM-related algorithms. well to the estimation of posterior class probabilities:
as the number of datapoints goes to infinity, under
certain conditions on the kernel and on the rate at
1. Introduction which the regularization strength tends to zero, the
real-valued discriminant function f returned by the
The SVM is one of the most used and best studied
machine learning models. As such, many aspects of
SVM essentially converges to the  optimal ±1-valued
classifier, sign Pr(y = 1|x) − 21 (Steinwart & Christ-
it have been assayed, including its links to learning
mann, 2008). This indicates that the SVM tries to ap-
theory and regularization, its geometry, the influence
proximate a function that does not retain information
of kernels on its regularization, its consistency, and
beyond the (optimal) class membership.1 However,
its efficient optimization (Vapnik, 1998; Schölkopf &
this does not mean that the estimation of conditional
Smola, 2002; Steinwart & Christmann, 2008). This
probabilities is necessarily impossible in the finite sam-
understanding has laid the ground for significant de-
ple setting. In that case, f will typically have a much
velopments including recent methods for efficient and
larger range (e.g., a non-empty interval of R). More-
effective structured output learning.
over, for the widely used linear kernel, the RKHS is
However, one major gap in our understanding of SVMs not rich enough to approximate the Bayes classifier,
still persists: the attempts to place it in a probabilistic so the above result does not apply in the first place.
framework have remained unsatisfactory. For instance,
Why has the SVM evaded being cast as an ML or
(Sollich, 2002) interprets the hinge loss as − log p(y|x),
MAP estimate of a probabilistic model—especially
which necessitates to artificially introduce a “don’t-
given that the rather similar (penalized) kernel logis-
know” class. Counter-intuitively its probability, a no-
tic regression (LR) is clear and simple? Two men-
tion of predictive uncertainty, is minimal at the bor-
tal barriers had to be overcome. First, typically the
der of the margin (i.e., for f (x) = ±1) and increases
SVM (analoguous to LR) is taken to be a discrimina-
when moving further away from the decision surface.
tive model, i.e., one that only specifies the conditional
Appearing in Proceedings of the 28 th International Con- 1
In a sense, this is the flipside of the SVM’s sparsity, as
ference on Machine Learning, Bellevue, WA, USA, 2011.
argued by (Bartlett & Tewari, 2007).
Copyright 2011 by the author(s)/owner(s).
SVM by ML

p(y|x). In contrast, we argue that the SVM implies ω + = ω and ω − = 1−ω for positive and negative class,
a (non-uniform) marginal p(x), giving it a generative respectively. For any fixed λ > 0, ω ≥ 0, the func-
touch. Second, it is alluring to expect a model whose tion R(w; ω) is convex and the function F (w; λ, ω) is
ML/MAP solution exactly agrees with the standard strictly convex.
SVM. While we do not know whether this is possi-
For given λ and ω, the SVM learning algorithm returns
ble at all, we did succeed in recovering an alternative
the parameter vector wSVM (λ, ω) which is a unique
parameterization of the SVM as an ML estimate of
minimum of F (w; λ, ω), i.e.,
a suitable model. This is analoguous to the ν-SVM
reformulation of the standard SVM (Schölkopf et al.,
wSVM (λ, ω) = argmin F (w; λ, ω) . (2)
2000), in which the rather unintuitive regularization w∈Rn
parameter is replaced by a parameter controlling the
number of SVs. In our model, the hyperparameter will The problem (2) is well understood and there exists
be the length ||w|| of the hyperplane normal. a plethora of efficient optimization algorithms for its
solution.
After a brief review of the SVM (Section 2), we present
our model (Section 3.1), which is generative and semi- The SVM algorithm specifies how to learn the param-
parametric. For technical reasons we restrict our anal- eter vector w while the hyper-parameters λ and ω
ysis to the SVM classifier without the bias term.2 The must be determined differently. The standard SVMs
core result, the equivalence of ML in our model with sets ω = 21 . However, tuning of ω is a routinely used
the SVM, is presented and proved in Section 4. In heuristic in the case of unbalanced class distribution.
Section 5 we demonstrate how max-margin clustering A common practice is to selected the best combination
drops out of our model; after this, we conclude. of λ and ω based on solving

(λbest , ωbest ) = argmin G[qSVM (·; wSVM (λ, ω))] (3)


2. Support Vector Machine λ∈Λ,ω∈Ω
classification
where the sets Λ = {λ1 , . . . , λp } and Ω = {ω1 , . . . , ωp }
We are given a set of training examples are prescribed manually based on user’s experience.
{(x1 , y1 ), . . . , (xm , ym )} ∈ (Rn × {+1, −1})m as- The functional G[qSVM (·; w)] is an estimator of the ex-
sumed to be i.i.d. from an unknown probability pected classification error of the rule qSVM (·; w). The
distribution function (p.d.f.) p∗ (x, y). The goal classification error computed on an independent set of
is to learn a Bayes classifier q : Rn → {+1, −1} examples, the cross-validation or the leave-one-out are
which R minimizes the expected classification er- among the most typically used error estimators. The
ror x∈Rn y∈{+1,−1} [[y 6= q(x)]]p∗ (x, y)dx where
P
resulting classifier is then qSVM (x; wSVM (λbest , ωbest )).
[[X]] = 1 if X is satisfied and 0 otherwise.
The SVM model without bias assumes that the Bayes 3. Semi-parametric probabilistic model
classifier can be well approximated by a linear classifier
qSVM : Rn → {+1, −1} parametrized by a vector w ∈ 3.1. The model
Rn such that We consider the following semi-parametric p.d.f.

+1 if hx, wi ≥ 0 ,
qSVM (x; w) = (1) p(x, y; τ, ω, u) = Z(τ, ω) · exp(−ω y `(yhτ u, xi)) · h(x)
−1 if hx, wi < 0 .
(4)
The parameter vector w is evaluated by a cost function defined over Rn × {+1, −1}. The distribution (4) is
parametrized by a unit vector u ∈ U = {u0 ∈ Rn |
λ ku0 k = 1}, a strictly positive scalar τ ∈ R++ and a
F (w; λ, ω) = kwk2 + R(w; ω)
2 scalar ω ∈ (0, 1) defining ω + = ω and ω − = 1 − ω.
Pm yi The distribution (4) is composed of three terms. The
where R(w; ω) = i=1 ω `(yi hw, xi i) is a convex first term, Z(τ, ω), is a normalization constant invari-
approximation of the training (empirical) classifica- ant to u. The second term, exp(−ω y `(yhτ u, xi)), is
tion error, `(t) = max{0, 1 − t} is the hinge-loss, a function of all three parameters (τ, ω, u) and the
λ ∈ (0, ∞) =: R++ is a strictly positive regularization input variables (x, y). Finally, the third term, h(x),
constant, and ω ∈ (0, 1) is a scalar defining cost-factors is a function which ensures that p(x, y; τ, ω, u) is in-
2
Which can, by augmenting the feature space, be used tegrable and that the normalization constant Z(τ, ω)
to arbitrarily well approximate the solution of an SVM does not depend on u. The properties of h are defined
with bias. in Theorem 1.
SVM by ML

Theorem 1 Let h : Rn → R be a piece-wise continu- uses h(x) = h(kxk) ensured by the assumption 3, the
ous function which satisfy the following assumptions: fact that |Dϕ (v)| = 1, and the equality hτ u, ϕ(v)i =
hτ Ru, vi = hτ u0 , vi. The equalities 1 and 4 are due to
1. 0 ≤ h(x), ∀x ∈ Rn (positive) definition (5) which completes the proof.
R
2. 0 < x∈Rn h(x) < ∞ (integrable)
Two examples of functions satisfying assumptions of
3. h(x) = h(x0 ), ∀x, x0 ∈ Rn such that kxk = kx0 k Theorem 1 are h1 (x) = exp(−hx, c1 Exi) and h2 (x) =
(radial basis function) c2 [[kxk ≤ c3 ]] where c1 , c2 and c3 are arbitrary strictly
positive scalars and E is the identity matrix.
Then for any τ ∈ R++ , ω ∈ (0, 1) and u ∈ U the
integrals I + (τ, ω, u) and I − (τ, ω, u) defined by Corrolary 1 Let h be a function satisfying the as-
Z sumptions of Theorem 1 and let us define
I y (τ, ω, u) = exp(−ω y `(yhτ u, xi)) · h(x) · dx 1
x∈Rn Z(τ, ω) = . (7)
(5) I + (τ, ω) + I − (τ, ω)
satisfy the following properties
Then for any fixed τ ∈ R++ , ω ∈ (0, 1) and u ∈ U, the
1. 0 < I y (τ, ω, u) (strictly positive) function p(x, y; τ, ω, u) given by (4) is a proper p.d.f.
defined over Rn × {+1, −1}, that is,
2. I y (τ, ω, u) < ∞ (finite)
y y
p(x, y; τ, ω, u) ≥ 0 , ∀x ∈ Rn , y ∈ {+1, −1} ,
3. I (τ, ω, u) = I (τ, ω) (invariant to u) Z X
p(x, y; τ, ω, u) = 1 .
proof: For fixed τ , ω, u, we introduce a short- x∈Rn y∈{+1,−1}
y
hand g(x) = exp(−ω R `(yhτ u, xi)) which simplifies (5)
to I y (τ, ω, u) = x∈Rn h(x)g(x)dx. It is seen that 3.2. Prior probability
g : Rn → (0, 1].
It follows from (7) that the priory probability of the
Because h is a piece-wise continuous and its integral is class label y under the model (4) is given by
strictly positive then there must exist µ ∈ Rn , r > 0,
and ε1 > 0 such that for all x within the ball B(µ, r) = I y (τ, ω)
{x0 ∈ Rn | kx0 − µk ≤ r} the value of h(x) is not less p(y; τ, ω) = . (8)
I + (τ, ω) + I − (τ, ω)
then ε1 . The volume V of B(µ, r) is greater than 0. As
g is strictly positive everywhere there must exist ε2 > 0 The prior probability does not depend on the param-
such that g(x) ≥ ε2 , ∀x ∈ B(µ, r). This implies that eter u. Moreover, we have the following theorem.
I y (τ, ω, u) ≥ ε1 ε2 V > 0 which proves the property 1.
Theorem 2 For any τ ∈ R++ it holds that
The property 2 follows from integrability of h and
boundedness of g. p(y = +1; τ, ω) = 0.5 for ω = 0.5 ,
Finally, we prove the property 3. Let u and u be 0 p(y = +1; τ, ω) < 0.5 for ω > 0.5 ,
arbitrary unit vectors. Then, there exists a orthogonal p(y = +1; τ, ω) > 0.5 for ω < 0.5 .
matrix R ∈ Rn×n with determinant +1 (i.e, rotation
matrix) such that u0 = Ru. Let ϕ : Rn → Rn be a proof: It follows from the equality (6) that
vector-valued function defined by ϕ(v) = Rv. It is Z
seen that the determinant of the Jacobian matrix of ϕ I y (τ, ω) = f y (x) · h(x) · dx , (9)
is Dϕ (v) = +1. We can write x∈Rn

where f y (x) = exp(−ω y `(hτ u, xi)) is a function which


Z
(1)
I y (τ, ω, u) = h(x) exp(−ω y `(yhτ u, xi)dx is for ω = 0.5 (i.e., ω + = ω − = 0.5) invariant to y ∈
x∈R n
{+1, −1}. In turn, ω = 0.5 implies that I + (τ, 0.5) =
Z
(2)
= h(ϕ(v)) exp(−ω y `(yhτ u, ϕ(v)i)|Dϕ (v)|dv
Z v∈Rn I − (τ, 0.5) which after substituting to (8) yields p(y =
(3) (4) 1; τ, 0.5) = 0.5. For ω > 0.5 we have that f + (x) <
= h(v) exp(−ω y `(yhτ u0 , vi)dv = I y (τ, ω, u0 ) .
v∈Rn f − (x) on the whole subspace {x ∈ Rn | hτ u, xi < 1}
(6) which implies that I + (τ, ω) < I − (τ, ω) and thus also
The second equality follows from the substitution the- p(y = +1; τ, ω) < 0.5. The same reasoning can be used
orem for multivariate integrals. The third equality to prove that ω < 0.5 implies p(y = +1; τ, ω) > 0.5.
SVM by ML

Theorem 2 establishes correspondence between the 1.5


hyper-parameter ω and the prior probability p(y; τ, ω). τ=0.5
Provided the prior is uniform we know the value of
hyper-parameter ω exactly, namely, ω = 0.5. In the 1
case of an unbalanced priors, we only know whether
ω is greater or less than 0.5. Note that to make the 0.5
correspondence exact we would need to compute the
integral (9), which involves the function h.
0
−6 −4 −2 0 2 4 6
3.3. Posterior and marginal probabilities and
1.5
relation to LR
τ=1
The class posterior probability derived from the
1
model (4) reads

exp(−ω y `(yhτ u, xi))


p(y | x; τ, ω, u) = P y
. 0.5
y∈{+1,−1} exp(−ω `(yhτ u, xi))
(10)
It is seen that the posterior probability does not de- 0
−6 −4 −2 0 2 4 6
pend on the function h. The marginal p.d.f. reads
1.5
p(x; τ, ω, u) = Z(τ, ω) · h(x) · f (x; τ, ω, u) τ=2

where f (x; τ, ω, u) = y∈{1,−1} exp(−ω y `(yhτ u, xi))


P 1
denotes its parametric part.
Figure 1 shows the posterior and the parametric part 0.5
of the marginal p.d.f. in 1-d (i.e. x ∈ R1 ) for different
values of the hyper-parameter τ and the other param- 0
eters set to ω = 0.5, u = 1. For comparison, we also −6 −4 −2 0 2 4 6
plot the posterior probability of the LR model
Figure 1. The figures show the posterior probability p(y |
1 x; τ, ω, u) (blue) and the parametric part f (x; τ, ω, u) of
pLR (y | x; w) = with w = τu . the marginal p.d.f. (red) for three different values of the
1 + exp(−yhw, xi)
hyper-parameter τ which is reciprocal to the margin width.
The input variable corresponding to the x-axis is univari-
It is seen that the posterior probability of our model ate, x ∈ R1 , and the other parameters are set to ω = 0.5
is very close to that of LR model. In fact, both are and u = 1. For comparison, the figures also show the
exactly the same in the margin band. posterior probability pLR (y | x; w) (dashed green) of the
Logistic Regression model with w = τ u.
The crucial difference between the LR model and our
SVM model is the marginal p.d.f. The LR imposes no
assumption about the shape of the marginal at all. By log-likelihood ratio is a piece-wise linear function
contrast, our model defines its shape up to the non- p(y = +1 | x; τ, ω, u)
parametric part h. It is obvious that the marginal of q(x; τ, ω, u) = log
p(y = −1 | x; τ, ω, u)
our model is non-uniform. Its shape is consistent with
its well-known property to be margin maximizing. The 
 (hτ u, xi − 1)ω + hτ u, xi ∈ (−∞, −1] ,
if
width of the margin band is inversely proportional to = hτ u, xi + 1 − 2ω hτ u, xi ∈ [−1, 1] ,
if
the hyper-parameter τ . 
(hτ u, xi + 1)ω − hτ u, xi ∈ [1, ∞) .
if
(11)
3.4. Bayes classifier Using the log-likelihood ratio (11), we can derive that
The (optimal) Bayes classifier minimizing the expected the Bayes classifier is the linear classification rule
classification error is based on the log-likelihood ratio

+1 if hτ u, xi ≥ b ,
p(y=+1|x)
q(x) = p(y=−1|x) ; the input x is assigned the label qBayes (x; τ, ω, u) = (12)
−1 if hτ u, xi < b ,
y = +1 if q(x) ≥ 0 and the label y = −1 otherwise.
where b = 2ω−1. Note that the classifier (12) becomes
Using (10) we can show after a little algebra that the for ω = 0.5 unbiased just like the SVM classifier (1).
SVM by ML

3.5. The Maximum-Likelihood estimator a monotonically decreasing mapping θ : R++ → R+


which for any λ ∈ R++ returns τ = θ(λ) = kw(λ)k
Given training examples {(x1 , y1 ), . . . , (xm , ym )} ∈
such that the following equality holds:
(Rn × {+1, −1})m assumed to be i.i.d. from the dis-
tribution (4), the negative log-likelihood (NLL) of the τ u(τ ) = w(λ) . (18)
parameters (τ, ω, u) reads
m
At first sight it may be puzzling that an ML estimate
(hence lacking an explicit prior) can be equivalent to
X
L(τ, ω, u) = − log p(xi , yi ; τ, ω, u)
i=1
the SVM objective, for its regularizer is commonly in-
m
X m
X terpreted as a log prior (in analogy to penalized LR).
= ω yi `(yi hτ u, xi i) − m log Z(τ, ω) − log h(xi ) . The resolving insight is simple: the regularizer (prior)
i=1 i=1 acts only on ||w||, which is kept fixed in our model,
(13)
serving as a hyperparameter substitute for SVM-λ.
The key observation is as follows. To compute the
ML estimate of all the three parameters (τ, ω, u) we Before proving Theorem 3 we introduce the following
need to know the function h. However, under the as- auxiliary lemma.
sumption that τ and ω are given, the ML estimator
of the parameter vector u does not depend on h. Let Lemma 1 For any λ1 ∈ R++ and λ2 ∈ R++ such
uML (τ, ω) denote the ML estimator of u provided the that λ1 > λ2 the inequality kw1 k ≤ kw2 k holds where
hyper-parameters τ and ω are known, then we have w1 ∈ w(λ1 ) and w2 ∈ w(λ2 ) are solutions of the prob-
lem (16) for λ1 and λ2 , respectively.
uML (τ, ω) ∈ argmin L(τ, ω, u) = argmin R(τ u; ω) ,
u∈U u∈U proof: Since w1 and w2 are minimizers of (16), we
(14) have that the following inequalities
where R(τ u; ω) is the SVM risk term (see Section 2).
λ1 λ1
Having the ML estimator, one can implement the ML kw1 k2 + R(w1 ) ≤ kwk2 + R(w) , (19)
2 2
learning of the plug-in Bayes classifier (12), i.e., we λ2 λ2
plug-in the ML estimate of the parameters for the kw2 k2 + R(w2 ) ≤ kwk2 + R(w) . (20)
2 2
real ones. Following a common practice, the hyper-
hold ∀w ∈ Rn . Substituting w = w2 to (19) and
parameters (τ, ω) can be found by solving
w = w1 to (20) yields
(τbest , ωbest ) = argmin G[qBayes (·; τ, ω, uML (τ, ω))] λ1 λ1
τ ∈T ,ω∈Ω kw1 k2 + R(w1 ) ≤ kw2 k2 + R(w2 ) , (21)
(15) 2 2
where the sets T = {τ1 , . . . , τp } and Ω = {ω1 , . . . , ωp } λ2 λ2
kw2 k2 + R(w2 ) ≤ kw1 k2 + R(w1 ) . (22)
are prescribed manually based on user’s experience. 2 2
The functional G[qBayes (·; τ, ω, uML (τ, ω))] is an es- By summing up the inequalities (21) and (22) we get
timator of the expected classification error of the λ1 λ2
rule qBayes (·; τ, ω, u). The resulting classifier is then kw1 k2 + kw2 k2 + R(w1 ) + R(w2 ) ≤
2 2
qBayes (x; τbest , ωbest , uML (τbest , ωbest )).
λ1 λ2
kw2 k + kw1 k2 + R(w1 ) + R(w2 )
2
2 2
4. Equivalence between SVM and ML which after a little algebra yields
learning
(λ1 − λ2 )(kw2 k2 − kw1 k2 ) ≥ 0 .
We first prove a theorem which establishes equivalence
and so λ1 > λ2 implies kw1 k ≤ kw2 k.
between the optimization problems appearing in the
SVM and ML learning.
Now, we prove Theorem 3.
Theorem 3 Let us consider the following optimiza-
proof: Let w(λ) be the minimizer of (16) for some
tion problems
  λ ∈ R++ . Note that w(λ) is unique as the objective
λ 2 of (16) is strictly convex. Let us denote τ = kw(λ)k
w(λ) = argmin kwk + R(w) (16)
w∈Rn 2 and Wτ = {w ∈ Rn | kwk = τ }. Then, we can write
u(τ ) = argmin R(τ u) (17)
 
(1) λ (2)
u∈U w(λ) = argmin kwk2 + R(w) = argmin R(w)
w∈Wτ 2 w∈Wτ
n
where R : R → R is a convex function and assume (3)
that the minimum of (16) exists. Then, there exists = τ u(τ ) where u(τ ) = argmin R(τ u) .
u∈U
SVM by ML

The first equality follows from the fact that Wτ is a ing plug-in Bayes classifier is obtained by the proce-
subset of Rn containing the minimizer w(λ). Since all dure (15) which selects the best parameter vector from
vectors in Wτ have the same norm the second equality uML (τ, 0.5), τ ∈ T , based on a validation criterion.
holds true. The third equality results from the variable Theorem (4) guarantees that both procedures will re-
substitution w = τ u. This proves that for any λ ∈ turn exactly the same linear classifier provided we use
R++ and τ = θ(λ) = kw(λ)k the equality (18) holds. T = {kwSVM (λ1 , 0.5)k, . . . , kwSVM (λp , 0.5)k} and the
same validation criterion in (3) and (15). Hence an
It remains to prove monotonicity of θ, i.e., λ1 > λ2
empirical comparison is unnecessary.
implies θ(λ1 ) ≤ θ(λ2 ). However, this is a direct conse-
quence of Lemma 1 and the fact that θ(λ) = kw(λ)k. Note that the tuning sets are heuristic in both cases.
While this is outside the scope of this paper, we imag-
ine that the direct geometric interpretation of τ as the
reciprocal of the margin width will facilitate heuristics
4.1. Standard SVM classifier for finding reasonable settings, which could provide a
practical advantage.
The formulation of the standard SVMs assumes that
the cost-factors for both classes are equal, i.e., ω = 0.5.
Recall that by Theorem 2 the value ω = 0.5 implies 4.2. SVM classifier with different cost factors
that the posterior probability of our model is uniform, It is common knowledge that SVMs often do not work
i.e. p(y; τ, ω, u) = 0.5. well if the class distribution of the training examples
is highly unbalanced. In view of the previous section
Theorem 4 Let ω = 0.5. Then, for any linear SVM this is not surprising, as the standard SVM classifier is
classifier there exists an equivalent plug-in Bayes clas- equivalent to a Bayes classifier which assumes uniform
sifier derived from the model (4) whose parameters are prior probabilities. To cope with unbalanced classes,
estimated by the ML principle, i.e., the equality SVM practitioneers routinely use two heuristics:
qSVM (x; wSVM (λ, ω)) = qBayes (x; τ, ω, uML (τ, ω))
1. Set a higher cost-factor for the class which is less
represented in the training data. For example, if
the first class is the smaller class, then set ω > 0.5,
holds for any x ∈ Rn , λ ∈ R++ and τ = θ(λ) =
i.e., ω + > ω − . This option is supported by all
kwSVM (λ, ω)k. Morover, the mapping θ : R++ → R+
major SVM solvers like the SVMlight (Joachims,
is monotonically decreasing.
1999). A proper setting of the cost-factor ω is
proof: The proof follows trivially from Theorem 3 then tuned as an additional hyper-parameter.
and the formulas for the linear SVM classifier (1) and 2. After the linear SVM classifier is trained, tune
the Bayes classifier (12). only the bias of the classifier to achieve desired
error rate. Recall that the standard SVMs classi-
Note that the standard SVM is theoretically linked to fier (1) is unbiased.
our model in much the same way the ν-SVM is linked
to the standard SVM. In our case, training an SVM Let us confront these heuristics with the plug-in Bayes
using λ would tell us which τ to use to get the same classifier (12) whose parameters are obtained by the
result; in the other case, training a ν-SVM using an ML estimator (14). Theorem 2 shows that the hyper-
a priori chosen ν would tell us which λ to use in the parameter ω is proportional to the prior probability
standard SVM to get the same result (see Proposition p(y; τ, ω). This is perfectly consistent with the first
6 in (Schölkopf et al., 2000)). heuristic. For example, if the first class is less frequent
in the training data then, according to Theorem 2, we
Now let us compare the SVM and the ML learning
should set ω > 0.5 to guarantee that p(y = +1; τ, ω) <
from a more practical point of view. We still assume
p(y = −1; τ, ω) holds and vice-versa.
the standard setting ω = 0.5. The SVM learning re-
quires a user to supply a tuning set Λ = {λ1 , . . . , λp } Regarding the second heuristic, we showed that plug-in
for the hyper-parameter λ. The resulting SVM clas- Bayes classifier under the model (4) is a linear classi-
sifier is obtained by the procedure (3) which selects fication rule (12) with a bias term b = 2ω − 1. That
the best parameter vector from wSVM (λ, 0.5), λ ∈ is, for uniform priors the bias is set to 0, while for un-
Λ, based on a validation criterion. The ML learn- balanced classes the bias is negative if the first class
ing requires the user to supply a tuning set T = is more probable (and positive otherwise). Note that
{τ1 , . . . , τp } for the hyper-parameter τ . The result- the probabilistic model exactly specifies the value of
SVM by ML
m
the bias term given the hyper-parameter ω. By con- NLL of labels y ∈ {+1, −1} and a parameter θ is
trast, in classical SVM training the bias must be tuned m
as an additional hyper-parameter. L(y, θ) = −
X
log p(xi |yi ; θ) .
i=1
5. Equivalence between Maximum The CML approach finds labels by solving
Margin Clustering and Classification
Maximum Likelihood approach y CML ∈ argmin min L(y, θ) .
y∈{+1,−1}m θ∈Θ
The Maximum Margin Clustering (MMC) (Xu et al.,
2004) is a popular heuristic that transfers the max- The CML assumes that both the vector θ and the
imum margin principle from supervised SVM learn- labels y are the parameters to be estimated.
ing to the unsupervised setting. In this section we Let us instantiate the CML approach for our model.
show how our model (4) theoretically justifies MMC. We assume that the hyper-parameter ω = 0.5 and τ is
In particular, we demonstrate how MMC emerges from fixed otherwise (e.g. tuned on validation set). Recall
our model by applying the Classification Maximum that ω = 0.5 implies the uniform prior p(y; τ, 0.5) =
Likelihood (CML) approach (Scott & Symons, 1971), 0.5. Then, the NLL under the model (4) reads
which is a statistically motivated and theoretically
m  
well-understood clustering method. X
L(y, u; τ ) = log h(xi ) − ω yi hτ u, xi i − I yi (τ, ω) .
We consider the clustering problem defined as follows. i=1
Let {(x1 , y1 ), . . . , (xm , ym )} ∈ (Rn × {+1, −1})m be
and the labels are found by solving
i.i.d. from an underlying distribution p∗ (x, y). Let
us assume that we are given only the observations y CM L (τ ) = argmin min L(y, u; τ )
{x1 , . . . , xm } and our goal is to estimate the corre- y∈{+1,−1}m u∈U
(24)
sponding hidden labels {y1 , . . . , ym }. = argmin min RM M C (y, τ u) .
y∈{+1,−1} m u∈U

The MMC finds labels by solving


  5.2. Equivalence between MMC and CML
λ
y MMC (λ) = argmin minn kwk2 + RMMC (y, w) approach
y∈{+1,−1}m w∈R 2
(23) Now, we show that any clustering returned by the
Pm
where RMMC (y, w) = i=1 `(y i hw, x i i). Thus the MMC can be found by the CML approach.
MMC searches for the labels which allow best sepa-
ration of the data by the SVM classifier. The prob- Theorem 5 Let y MMC (λ) be a set of minimizers of
lem (23) may not have a unique minimizer and hence the Maximum Margin Clustering problem (23) for
y MMC (λ) denotes a set. The problem (23) is dif- some λ ∈ R++ . Then, for any labeling y ∗ ∈ y MMC (λ)
ficult for optimization due to the integer variables there exists τ ∈ R+ such that y ∗ is a minimizer of the
y = (y1 , . . . , ym ). A plethora of algorithms have been Classification Maximum Likelihood problem (24), i.e.,
proposed to solve (23) approximately. y ∗ ∈ y CML (τ ) holds.

As before, we assume an unbiased linear classifier (i.e. proof: Because y ∗ ∈ y MMC (λ) is a minimizer of the
the hyperplane passes through the origin). With this, problem (23) the inequality
(23) is a well-posed problem—unlike the variant with a hλ i
biased classifier, which has a trivial solution assigning minn kwk2 + RMMC (y ∗ , w)
all observations to just one class. This complication is w∈R 2 hλ i (25)
can also be solved by introducing an additional balance ≤ minn kwk2 + RMMC (y, w) ,
constraint enforcing solutions with prescribed number w∈R 2
of labels in each class. Note that all derivations below m
holds ∀y ∈ {+1, −1} . Let us denote the min-
can easily be repeated with the balance constraint to imizer of the ∗
h left hand side of i(25) as w =
recover the biased variant of the MMC.
argminw∈Rn λ2 kwk2 + RMMC (y ∗ , w) which is unique
as the objective is strictly convex for fixed y ∗ . Let us
5.1. Classification Maximum Likelihood
denote τ = kw∗ k and Wτ = {w ∈ Rn | kwk = τ }.
approach to clustering
Then, we can derive from (25) that the inequality
Assume a conditional density p(x | y; θ) parametrized
by θ ∈ Θ. Given the observations {x1 , . . . , xm }, the min RMMC (y ∗ , w) ≤ min RMMC (y, w) (26)
w∈Wτ w∈Wτ
SVM by ML
m
holds ∀y ∈ {+1, −1} . To get from (25) to (26), we Acknowledgments
used the fact that all vectors in Wτ have the same
norm and one of them is the minimizer w∗ . The in- VF was supported by the Czech Ministry of Education
equality (26) implies that y ∗ is a minimizer of project 1M0567 and by EC projects FP7-ICT-247525
HUMAVIPS, PERG04-GA-2008-239455 SEMISOL.
y∗ ∈ argmin min RMMC (y, w)
y∈{+1,−1}m w∈Wτ
= argmin min RMMC (y, τ u) References
y∈{+1,−1}m u∈U
Bartlett, P. L. and Tewari, A. Sparseness vs estimating
where the letter equality, obtained by the variable sub- conditional probabilities: Some asymptotic results.
stitution w = τ u, is just the definition of the CML Journal of Machine Learning Research, 8:775–790,
problem (24) which was to be proved. 2007.
Celeux, G. and Covaert, G. A classification EM al-
The established correspondence between the MMC gorithm for clustering and two stochastic versions.
and the CML not only provide a theoretical justifica- Computational Statistics & Data Analysis, 14(3),
tion of the MMC but it also opens ways for its exten- 1992.
sion. First, to cope with the unbalanced data one can
tune the hyper-parameter ω which would corresponds Grandvalet, Y., Mariéthoz, J., and Bengio, S. Interpre-
to changing the prior probability p(y; τ, ω). Second, tation of SVMs with an application to unbalanced
the hard problem (23) required by the MMC can be at- classification. Advances in Neural Information Pro-
tacked by algorithms routinely used for minimization cessing Systems, NIPS 18, 2005.
of the CML criterion. Namely, the Classification Ex-
Joachims, T. Making large-scale SVM learning prac-
pectation Algorithm (CME) (Celeux & Covaert, 1992)
tical. In Schölkopf, B., Burges, C. J. C., and Smola,
is a simple iterative procedure which transforms the
A. J. (eds.), Advances in Kernel Methods—Support
hard unsupervised problem to a series of much sim-
Vector Learning, pp. 169–184. MIT Press, Cam-
pler supervised problems. In turn, the existing SVMs
bridge, MA, 1999.
solvers can be readily recycled for solving the MMC.
Platt, J. C. Probabilistic outputs for support vec-
6. Conclusion tor machines and comparisons to regularized like-
lihood methods. In A. Smola et al. (ed.), Advances
The received wisdom in machine learning has so far in Large Margin Classifiers. MIT Press, Cambridge,
been that maximum margin SVM learning and prob- MA, 2000.
abilistic models constitute two separate sub-domains
of machine learning. Our work has been motivated Schölkopf, B. and Smola, A. J. Learning with Kernels.
by the unsettling fact that SVM-like methods, albeit MIT Press, Cambridge, MA, 2002.
being rooting in learning theory and being powerful Schölkopf, B, Smola, AJ, Williamson, RC, and
and efficient in practice, do not enjoy the principled Bartlett, PL. New support vector algorithms. Neu-
view on modeling offered by probabilistic methods. In ral Computation, 12(5):1207–1245, 2000.
this contribution, we heal this rupture by setting up
a probabilistic model that is equivalent to the SVM. Scott, A.J. and Symons, M.J. Clustering methods
So far, however, this work is limited to linear SVMs based on likelihood ratio criteria. Biometrics, 27:
without bias; whether and how kernelization can be 387–389, 1971.
incorporated remains to be investigated.
Sollich, P. Bayesian methods for support vector ma-
Apart from the theoretical satisfaction of unification, chines: Evidence and predictive class probabilities.
the probabilistic understanding of the SVM can also Machine learning, 46(1):21–52, 2002.
lead to further insight. As an example, we demonstrate
Steinwart, I. and Christmann, A. Support Vector Ma-
how a common and empirically successful heuristic for
chines. Springer, New York, 2008.
dealing with unbalanced class sizes can be understood
in terms of biased priors, and how maximum mar- Vapnik, V. Statistical Learning Theory. John Wiley
gin clustering is naturally linked to the generic CML and Sons, New York, 1998.
(Classification Maximum Likelihood) principle. Fur-
ther work on semi-supervised SVMs is in progress, and Xu, L., Neufeld, J., Larson, B., and Schuurmans, D.
we anticipate that many more such relationships will Maximum margin clustering. In Proc. of Neural In-
be discovered. formation Processing Systems (NIPS), 2004.

You might also like