0% found this document useful (0 votes)
65 views

STAT 538 Maximum Entropy Models C Marina Meil A Mmp@stat - Washington.edu

This document discusses maximum entropy models and the maximum entropy principle. It defines entropy and KL divergence, then describes how the maximum entropy principle finds the distribution with the highest entropy that matches constraints defined by data. The principle results in an exponential family distribution. Minimum relative entropy generalizes maximum entropy by including a prior distribution.

Uploaded by

Matthew Hagen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views

STAT 538 Maximum Entropy Models C Marina Meil A Mmp@stat - Washington.edu

This document discusses maximum entropy models and the maximum entropy principle. It defines entropy and KL divergence, then describes how the maximum entropy principle finds the distribution with the highest entropy that matches constraints defined by data. The principle results in an exponential family distribution. Minimum relative entropy generalizes maximum entropy by including a prior distribution.

Uploaded by

Matthew Hagen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

STAT 538

Lecture 8
Maximum Entropy Models
c
Marina
Meila
[email protected]

Entropy and KL divergence

Assume that the sample space is , a (typically large) finite space. For any
distribution p : [0, 1] the entropy is defined as
H(p) =

p(x) log p(x)

(1)

In the above, and throughout this chapter, we adopt the convention 0 log = 0.
The function H(p) is non-negative and concave on the space of all distributions over . The minimum H = 0 is attained for deterministic distributions
and the maximum Hmax = log || is attained for the uniform distribution
over .
The entropy measures the uncertainty in a given distribution. Closely related
to the entropy is the Kullbach-Leibler divergence between two distributions:
X
p(x)
D(p||q) =
(2)
p(x) log
q(x)
x
The KL-divergence is asymmetric in p, q. It is non-negative, convex in (p, q)
and attains the minimum D(p||q) = 0 iff p q (proofs and details below).
The entropy and KL divergence can also be defined for continuous sample
spaces, by replacing the summation with an integral. The only property
that changes is that the entropy of a continuous distribution is not always
non-negative. All the other properties mentioned here remain the same.

Properties of H and D
1. H 0 (obvious)
2. H is a concave function of p (we know x ln x is convex)
3. H(p) = 0 iff p deterministic
4. H(p) = max for p = u=the uniform distribution; in this case H(u) =
ln ||
Proof Let = {0, 1, . . . m} w.l.o.g. Then, H is a function of the m
P
Pm
variables p1:m , with m
x=0 px = 1 and p0 = 1
x=1 px .
H(p) =

m
X

px ln px (1

m
X

px ) ln(1

px )

(3)

x=1

x=1
m
X

x=1

m
X

H
px ) + 1 = 0
= ln px 1 + ln(1
px
x=1

(4)

m
X

(5)

px = (1

px ) = p0

x =1

In other words, all px must be equal.


5. D(p||q) 0
(D is a Bregman divergence, it corresponds to the convex function
x ln x; see Lecture 6)
6. D(p||q) convex in (p, q)
Proof We use the perspective of the convex function ln x. We have
(BV 3.2.6) that t( ln xt ) is also convex jointly in (x, t), and we set
t = pi , x = qi . It follows that pi ln pqii = pi ln pqii is convex, and therefore
D is convex.
7. We define the conditional entropy of two random variables X, Y with
joint distribution pXY by
H(X|Y ) =

pY (y)

H(pX|Y =y )

(6)

Thus, the conditional entropy is the average of the entropies H(pX|Y =y ).


We have that
H(X|Y ) =

XX
x

py px|y ln px|y

{z

concave


}

(7)

py px|y

ln

py px|y

px ln px = H(px ) H(X)

(8)
(9)

In other words, H(X|Y ) H(X) , or conditioning (observing) another


variable Y decreases the uncertainty in X. The amount of the decrease
is called the mutual information between X and Y .
I(X||Y ) = H(X) H(X|Y ) = H(Y ) H(Y |X)

(10)

(proving this last equality is left as an exercise)


8. The mutual information is non-negative, and is 0 iff the two variables
are independent, as a consequence of the following identity
I(X||Y ) = D(pXY ||pX pY )

(11)

(proof by direct calculation).


9. For two variables X, Y with joint distribution PXY = [pxy ]x,y , the joint
entropy is defined as
H(X, Y ) H(PXY ) =

pxy ln pxy

(12)

xy

It is easy to verify that


H(X, Y ) = H(X) + H(Y |X) H(X) + H(Y ).

(13)

The above inequality is satisfied with equality only if X Y , i.e. only


if X and Y are independent.
10. Aggregation decreases the entropy. Let the random variable Y be a
deterministic function of X. When X is discrete, that means that
several values x may be mapped to the same value y. Intuitively, the
P
values x are aggregated into sets labeled by y. Hence, py = xy px .
H(X) =

px ( ln px ) =

X
y

XX

px ( ln px )

(14)

y xy

py

"

1
px
ln
px
xy py
X

(15)

py

"

py

"

px 1
ln
xy py px
X

ny
ln
py

= H(Y ) +

(because ln z is convex) (16)

where ny = #{x y}

py ln ny H(Y )

(17)
(18)

11. Mixing increases entropy. For the mixture distribution tp + (1 p)q,


by the concavity of the entropy, it follows that
H(tp + (1 t)q) tH(p) + (1 t)H(q)

(19)

The Maximum Entropy Principle

Assume that we have a set of N observations D = {x(1) , . . . x(N ) } from an


unknown distribution p. The observation define the empirical distribution
p
N
1 X
(20)
(j) (x)
p (x) =
N i=1 x
where
x(j) (x) =

1 x = x(j)
0 otherwise

(21)

From now on we assume that the data is represented by the empirical distribution p .
We also have a set of features fi (x) of the data; they are functions fi :
(, ), i = 1, . . . K.
The Maximum Entropy Principle states that the best model of the
data is the distribution q representing the solution to the following problem
max H(q)
q

s.t. Eq [fi ] = Ep [fi ] for all i = 1, . . . K.

(22)

This optimization problem has a concanve objective to be maximized (which


is equivalent to minimizing the convex objective H(q)), linear constraints
4

and an overall convex space for the variable q. Thus, it is a convex optimization problem and we know that if a solution exists, then it is unique.
Let us now apply the usual convex optimization machinery to find a general
form for the solution. The Lagrangean is
L(q, ) = H(q)

i (Eq [fi ] Ep [fi ]) 0 (

q 1)

(23)

In the above, i are the Lagrange multipliers associated to the each of the
constraints i = 1, . . . K, and 0 represents the normalization constraint on
q. Note that if is finite, q is a vector indexed by x. We take the partial
derivative of L w.r.t each vector element q(x). For this, a very useful identity
will be the following:
d
y log y = log y + 1
(24)
dy
Therefore,

X
L
i fi (x) 0
= log q(x) + 1
q(x)
i

(25)

By equating the above with 0 we obtain


log q(x) =

i fi (x) + 0 1

(26)

or

(27)
f (x)
i i i

q(x) e
or
1 T f
q =
e
Z

(28)
(29)
(30)

The normalization constant Z , also known as the partition function


is defined as
X P
Z =
e i i fi (x)
(31)
x

This is the general form of the solution of the Maximum Entropy problem.
Note that although the formulation was non-parametric, i.e we were optimizing over all possible distributions over , the resulting solution depends on
K parameters, one for each constraint.

2.1

Examples

1. Without any constraints, the solution reduces to q


uniform distribution.

1, i.e. the

2. Let x = (x1 , . . . xd ) a d-dimensional vector of binary variables xj


{0, 1}. Let fj = xj for j = 1, . . . d. Then,
P

q(x) e

j xj

(ej )xj =

j j

(32)

In other words, if the features depend only on one variable, in the maximum entropy distribution the variables are independent. In particular,
in our case we recover the (multivariate) binomial distribution.
3. Let us solve the above case again, explicitly, for x = (y, z) {0, 1}2 .
Define q by its four parameters q11 , q10 , q01 , q00 . We have three constraints q11 + q10 = y, q11 + q01 = z, q11 + q10 + q01 + q00 = 1, where
y, z are respectively the sample means of y, z. Hence, the solution q
depends on one free parameter only; let that be q11 . We can write q in
the following way:
"

q11
y q11
z q11 1 y z + q11

and the objective is


H(q11 ) = q11 log q11 (
y q11 ) log(
y q11 ) (
z q11 ) log(
z q11 )
(1 y z + q11 ) log(1 y z + q11 )
(33)
Taking the derivative we get
d
(
y q11 )(
z q11 )
H = log
= 0
dq11
q11 (1 y z + q11 )

(34)

The above is solved for q11 = yz which amounts to independence between y and z.
4. Assume the same 2-dimensional binary variable sample space, but now
with the feature f (y, z) = 1 iff y = z. Let the expectation of f from
6

the data be f =

Ny=z
.
N

L(q, , 0 ) =

Then
qyz log qyz (q00 + q11 f) 0 (

y,z

qyz 1)(35)

yz

L
= log q00 + 1 0 = 0
q00
L
= log q11 + 1 0 = 0
q11

(36)
(37)

From the last two equations, we find that q11 = q00 and therefore = f/2.
Thus, the solution q is
#
"
f
2
1f
2

1f
2
f
2

Again, the solution makes q as uniform as possible under the given


constraints. Note that because the feature depends on both y, z the
two variables are dependent in the resulting distribution.
5. Exercise Take the same sample space as above, with the unique feature f (y, z) = 1 iff y = z = 1. What is the corresponding MaxEnt
distribution?
6. Assume now that = (, ) and thus we have a continuous Maximum Entropy problem. Let the features be f1 (x) = x, f2 (x) = x2 .
Then the maximum entropy distribution is
q(x) e1 x+2 x

(38)

which represents a Gaussian that fits the first two moments of the
observed data.
7. Exercise Markov random fields and decomposable graphical models
(aka junction trees) over discrete domains are maximum entropy distributions. Can you identify the features that they are matching?

Minimum Relative Entropy

The Minimum Relative Entropy (MRE) problem generalizes the Maximum


Entropy (ME) problem to the case when we also have a prior distribution
7

q0 . We want to find the model that matches the sufficient statistics Ep [fi ]
of the data and is close to the prior.
min D(q||q0)
q

s.t. Eq [fi ] = Ep [fi ] for i = 1, . . . K

(39)

The Lagrangean of this problem is


L(q, ) =

q(x) log

X
q(x) X X
fi (x)q(x)Ep [fi ])+0 (
i (
q(x)1)
q0 (x)
i
x
x
(40)

And the solution is


P

q (x) q0 (x)e i i fi (x)


or
1
T
q =
q0 e f
Z
with
P
X
Z =
q0 (x)e i i fi (x)

(41)
(42)
(43)

Note that the ME problem is a special case of the MRE problem with q0 1
the uniform distribution. Indeed, it is easy to see that
D(q||uniform) =

3.1

q log q

q log

1
= H(q) log ||
||

(44)

The relationship with exponential family models

From (42) we have that


log q(x) = T f (x) log Z + log q0 (x)

(45)

A family of probability distributions that can be put in this form is called


and exponential family model. Exponential family models comprise (multivariate) normal distributions, Markov random fields (with positive distributions), binomial and multinomial models, etc. They have many convenient
properties, some of which are evident from the definition above. For example, exponential family models are essentially the only parametric models
8

that have a finite number of sufficient statistics1 ; they have conjugate priors; from the differential geometry p.o.v, exponential families repreent flat
manifolds, i.e affine function spaces spanned by the vectors fi .
We also know that Maximum Likelihood (ML) estimation for exponential
models consists in fitting the sufficient statistics of the data. Therefore,
finding the optimal parameters for MRE/ME models can be seen as a ML
estmation. The next section discusses this in detail.

The ME-ML duality

The ME/MRE problem is a convex optimization problem. Here we examine


its dual and show that it represents a Maximum Likelihood problem.
The dual objective of problem (39) is given by L() = supq h(q, ). We
can obtain L explicitly by substituting the solution (42) into h.
L() = h(q , )
=

(46)
P

q0 e
Z q0

q log

f
i i i

i fi

i (

fi q Ep [fi ])

q log Z

(47)

i Ep [fi ]) log Z

fi q +

i Ep [f(48)
i]

(49)

Hence, the dual of the MRE problem is


max T Ep [f ]) log Z

(50)

Since all the constraints in the primal problem were equality constraints, the
dual has no constraints on .
We now show that the dual problem represent maximixing a likelihood. The
log-likelihood of the parameter set given a dataset D is
l() =

log q(xi )

(51)

xi D
1

Distributions that are piecewise uniform may also have finite sufficient statistics. In
their case, the sufficient statistics are intervals in which the data lie.

= N

p (x) log q (x)

(52)

= N

p [log q0 +

i fi log Z ]

(53)

= N(

i Ep [fi ] log Z ) + constant

(54)

= NL() + constant

(55)

Therefore, solving the MRE problem is equivalent to maximizing the likelihood of the exponential model given by (42).
It is also easy to show that maximizing the likelihood for any model family
{q} is equivalent to miniminzing the KL divergence from p to {q}.
D(
p ||q) =

p log

p
q

(56)

p)
p log q H(

(57)

{z

l(q)/N

We can summarize the previous sections in the following way: for a fixed set
of features f and a fixed data set we define
T

Q = {Q | q q0 e f }
P = {p | Ep [f ] = Ep [f ]}

(58)
(59)

i.e the family of exponential models with sufficient statistics f and the family
of distributions that fit the sufficient statistics of the data. Let Q be closure
of Q under the Euclidean norm. Then (see Della Pietra & al) the MRE
distribution q is the unique distribution in P Q and is also the unique
distribution satisfying
p ||q)
q = argminP D(q||q0) = argminQ D(

(60)

The first optimization in the above represents the original MRE problem,
the second is the dual ML problem. Minimizing a KL divergence to a set
can be thought of as projecting a distribution on the respective set. Note
that in our case the two projections differ since they are reversed forms of
10

P
Q

q*
n

tio

jec

o
pr

oje
pr
on
cti

q0

Figure 1: The duality between MRE and ML: the m-projection corresponds
to ML estimation, the e-projection corresponds to MRE estimation.
the KL divergence. Thus, one obtains the same solution either by the eprojection of the prior q0 on the data manifold P or by the m-projection
of the empirical distribution p on the model manifold Q.
The similarity to the EM algorithm is not incidental - see for example Neal
and Hinton A new view of the EM algorithm for a view of the Expectation
Maximization algorithm that emphasizes the alternating minimization of KL
divergences.
A Pytagorean theorem is yet another equivalent way of characterizing q .
For any q Q and any p P,
D(p||q) = D(p||q ) + D(q ||q)

(61)

Proof Let p1 , p2 P, q1 , q2 Q, q2 = q1 e f . Then


D(p1 ||q1 ) D(p1 ||q2 ) D(p2 ||q1 ) + D(p2 ||q2 ) =

p1 log

q2
q2 X
+
(62)
p2 log
q1 x
q1

(p1 p2 )T f

x
T

= (Ep1 [f ] Ep2 [f ])

(63)
(64)

If now p1 = q1 = q , p2 = p, q2 = q we obtain
0 D(q ||q) D(p||q ) + D(p||q) = ( )T (Eq [f ] Ep [f ]) = 0 (65)
| {z }
Ep [f ]

11

| {z }
Ep [f ]

Estimating the parameters

For some ME models, there is a direct relationship between the parameters


and the sufficient statistics; For e.g normal distributions, multinomial distributions, decomposable models the parameter fitting is done analytically
with a simple formula. For other exponential models, we have to resort to
iterative methods.
Gradient ascent. One method to find the parameters of the MRE distribution is to find the unique maximum of h() using equation ??. Here we
have the problem of computing the normalization constant Z which involves
a summation over the whole sample space. Sometimes Z and its partial
derivatitives can be computed analytically; we shall see some examples in
section ??. In the other cases, one has to approximate these derivatives.

The most direct method to do it is to recall that Z


= Eq [fi ] and apply a
i
Markov-chain Monte Carlo method to compute the expectations.
Both Gibbs and Metropolis sampling from a ME distribution are straightforward since
q0 (x) T (f (x)f (x ))
q(x)
=
e
(66)

q(x )
q0 (x )
For q0 uniform, binary features, and x, x differring in only one feature fj , the
above ratio becomes equal to ej . Note also that one needs only one sample
from q to estimate all the derivatives.
Improved Iterative Scaling is a method introduced by Della Pietra and
al. which is generally faster than the gradient method.
Improved Iterative Scaling (IIS) Algorithm
Given p , q0 and the features f = (f1 , . . . fK ). Assume q0 is absolutely
continuous w.r.t p (i.e the prior gives non-zero probability to the data).
Denote
X
fi (x)
(67)
f# (x) =
i

1. Initialize q q0 , 0

12

2. For each i, let i be the unique solution of


Eq [fi ei f# ] = Ep [fi ]
P

3. Set q qe

i fi

(68)

or, equivalently, set +

4. If q has not converged yet, go to step 2


The estimation in step 2 of the algorithm can be done by bisection noting
that the function is monotonic in .

Learning the features

Della Pietra et al. present a method for learning features from data. The
method is a greedy algorithm, that consists of two steps: first, from a set
of candidate features C one feature is selected to be added to the model;
second, the optimal parameters of the augmented model are fit to the data.
Field Induction Algorithm
Input: p , q0 , a rule for constructing candidate features or a set of
possible features
1. Set q q0
2. Feature selection
(a) Construct the current set of candidate features C
(b) For g C compute the gain G(g) = max D(
p ||q)D(
p ||qeg /Z )
and
that maximizes the gain
(c) Add g = argmaxC G(g) to the current set of features
3. Parameter fitting Refit the model parameters by IIS, starting from
the previous parameter set and .

4. Return to step 2
13

The algorithm is guaranteed to improve the likelihood of the data at each


iteration, and to give the best model for the given features, but it incorporates
no model selection method. So the stopping criterion and model validation
are left to the standard methods.
Below we discuss the feature selection in more detail.

6.1

Greedy feature selection

In the following we assume that all features take values in {0, 1}. Assume
that q is the current ME distribution and g C is a new feature. Let
P
p ||q) D(
p ||
q) the gain
Z = x qeg , q = qeg /Z and G(, g) = D(
of adding g with parameter to the model. Then, it is easy to show the
following three equalities:
G(, g) = Ep [g] log Eq [eg ]

G(, g) = Ep [g] Eq[g]

2
G(, g) = V arq[g] 0
2

(69)
(70)
(71)

This shows that G(, g) has a unique maximum w.r.t .


Lemma We now show that the maximum is at

= log

p (g = 1)(1 q(g = 1))


(1 p (g = 1))q(g = 1)

We have that
q =

qe
Z
q
Z

if g = 1
if g = 0

(72)

(73)

Therefore, Eq[g] = e q(g = 1)/Z and 1 Eq[g] = q(g = 0)/Z = (1 q(g =


1))/Z. Remembering from (70) that Eq[g] = Ep [g] at the optimal , we
obtain
Ep [g] = e q(g = 1)/Z
1 Ep [g] = (1 q(g = 1))/Z
By taking the ratio of the two, we obtain 72.
14

(74)
(75)

Lemma Denote by p, q the probabilities p (g = 1), q(g = 1) respectively.


Then, the gain G(g) is equal to D(Bp ||Bq ) where Bp , Bq are respectively the
Bernoulli distributions with p, q as probabilities of success.
Proof
p
1p
+ (1 p) log
q
1q
p(1 q)
1q
= p log
log
q(1 p)
1p

D(Bp ||Bq ) = p log

(76)
(77)

By 69,
p(1 q)
q(1 p)
p(1 q)
= p log
q(1 p)
p(1 q)
= p log
q(1 p)
p(1 q)
= p log
q(1 p)

G(,
g) = p log

log Eq [eg ]

(78)

log[e q(g = 1) + 1.q(g = 0)]

(79)

"

p(1 q)
log
q+1q
q(1 p)
1q
log
1p

(80)
(81)

Thus, one adds the feature g along which the discrepancy of p and q is
maximized.

7
7.1

Maximum Entropy Discrimination


The ideea

Here we apply the ME framework to classification problems. We have a


dataset D = {(x1 , y1), (x2 , y2 ), . . . (xN , yN )} of N labeled examples and a
family of classifiers F = {f (.; )} parametrized by . The values of the labels
are in {1} and the label of x given by f (.; ) is sign f (x, ; ).
Standard classification chooses an f F that has low classification error and
perhaps obeys other regularization conditions. In the Maximum Entropy
15

approach to classification, we find a distribution q over F , such that the


expected value f of all classifiers f F under q classifies the training set
well. Naturally, more than one such q exists and we choose the one which
has maximum entropy.
min H(q)
q

s.t. yi Eq [f (xi )] 1 for i = 1, . . . N

(82)

A direct generalization of ME classification is to introduce a prior q0 over


F and then to choose the q that is nearest to the prior. This is the MRE
discrimination problem.
min D(q||q0)
q

7.2

s.t. yi Eq [f (xi )] 1 for i = 1, . . . N

(83)

The MRE solution

The problems 82, 83 are convex problems with linear constraints; by applying
the usual transformations we obtain the Lagrangean, the general exponential
form of the solution and the dual. Note that unlike the unsupervised ME
problem, the current problem has inequality constraints, so that the Lagrange
multipliers will have to be 0. Another difference is that the sums are now
replaced with integrals, as the function families we typically consider are
continuous.
h(q, ) = D(q||q0 )

i (yi Eq [f (xi )] 1)

X
h
i yi f (xi )
= log q(f ) log q0 (f )
q(f )
i
P
1
q (f ) =
q0 ()e i i yi f (xi )
Z
Z
P

Z =

q0 ()e

i yi f (xi ;)

L() = h(q (), )


Z
P
X  Z 1
X
1
i yi f (xi ;)
i
i y i
[ i yif (xi ; ) log q0 () log Z ]d
q0 ()e
q0 ()e
=
Z
Z
i
i
=

i log Z

16

Hence, the solution has again an exponential form, with one factor for each
training example. The dual problem is
max L() s.t. i 0 for i = 1, . . . N

7.3

(90)

Computing the solution

The optimal s can be found by gradient ascent on the dual objective L. As


previously seen, these are equal to the primal constraints.
Z

P
log Z
1
=
q0 ()e i i yi f (xi ;) yi f (xi ; )d = yiEq [f (xi ; .)] (91)
i
Z
L
= 1 yi Eq [f (xi ; .)]
(92)
i

In practice, we start with i = 0 for all i and update the Lagrange multipliers
by
L
i max[ 0, i +
]
(93)
i
where is a step size. This iteration will converge to the unique solution
of 83 if one exists. Note that during the iteration, L increases, while the
primal objective increases too. This is because our method is eseentially an
exterior point method: we start from the unconstrained optimum and adjust
the parameters until all the constraints are satisfied.
As it is usual with constrained optimization, only the i parameters that
correspond to tight constraints are non-zero. The corresponding xi examples
represent support vectors for this problem. Thus the solution is often
sparse.
The ME classifier is given by
f (x) =

q ()f (x; )d

Note that f may not belong to the original family F .

17

(94)

8
8.1

ME discrimination extensions
Using generative models

One important use of the ME classification framework is to combine discriminative classification, i.e classification via optimization of the decision
boundaries, with generative models i.e probabilistic models of how the
data were generated. The former have the advantage that they optimize the
right criterion, the latter are much better at incorporating domain features.
Assume that we have a family of probabilisitic models describing the data
in each class; let them be P (x|+ ), P (x| ) respectively. The models may
belong to different families (i.e gaussian for the + class and uniform on a
rectangle for the - class), and the parameters may belong to different
spaces. A Likelihood ratio classifier in with these model families is
f (x; + , , b) = log

P (x|+ )
+b
P (x| )

(95)

We construct a MRE distribution over this classifier family that has the form
P

q(+ , , b) q0 (+ , , b)e

i [yi (log

P (xi |+ )
+b)1]
P (xi | )

(96)

If the prior q0 factors into q0 (+ )q0 ( )q0 (b) then the MRE distribution also
factors into independent distributions for the 3 parameters.
P

q(+ , , b) q0 (+)e

i yi log P (xi |+ )

q0 ( )e

i yi log P (xi | )

q0 (b)eb

i yi

i i
e
(97)

There are several model classes for which computing the partition function
and the required expectations can be done in closed form. For example, if
P (.| ) is an exponential family model and q0 is the corresponding conjugate
prior. In particular, graphical models with fixed structure and exponential
family distributions are also tractable; in the special case of tree graphical
models, one can construct more interesting graphical models that can be
integrated over structures and parameters (see Jaakkola et al.).

18

8.2

Inseparable data (soft margins)

If the data are not separable by any model in F , then the constraints of ??
cannot be satisfied for any q and the parameters tend to infinity. Now we
extend the MRE framework to deal with this case. We will fix a variable
margin i for each example (xi , yi ) and we will estimate a joint MRE distribution over the classfier parameters and the margins . Note once again
that, if the prior q0 is factored w.r.t , then the and will be independent
under the final solution.
The soft margin MRE problem is
min D(q||q0 )

s.t. Eq [yi f (xi ) i ] 0 for i = 1, . . . N

where q, q0 are now distributions over , 1 , . . . N . If q0 (, ) = q0 ()


then the MRE solution is
q

q0 ()e

i yi f (xi ;)

N 
Y

q0 (i )ei i

i=1

(98)
Q

i q0 (i )

(99)

If we denote the corresponding normalization constants by Z , Zi (keeping


in mind that they are functions of ), then the dual objective can be written
as
X
log Zi
(100)
L() = log Z
i

The form of the MRE distribution for suggests an exponential prior


q0 () = cec(1)

(101)

For this q0 the MRE distribution is


q() = (c )e(c)(1)

(102)

In the above needs to be c for a proper q to exist. Hence, introducing a


soft margin in this way is equivalent to puttig and upper bound on , very
much like in the case of SVMs.
Note also that Eq0 [i ] = 1 1/c. Therefore, if Eq [f (xi )] 1 1/c then
i = 0. Since Zi = 1/(c i ) each term log Zi in the dual objective
introduces a penalty log(c i ). If i approaches c from below, this penalty
tends to , effectively stopping i from growing too much.
19

8.3

Relationship to SVMs

The following theorem, from Jaakkola et al, shows that there is a strong
relationship between SVMs and MRE.
Q

Theorem Assume f (x; , b) = T x b and q0 (, b, ) = q0 ()q0 (b) i q0 (i )


where q0 () = N(0, I), q0 (b) approaches a non-informative prior, and q0 (i )
is given by equation 101. Then, the Lagrange multipliers are obtained by
P
maximizing L() subject to 0 c and N
i=1 i yi = 0 where
L() =

(i + log(c i )]

N
1 X
i j yi yj xTi xj
2 i,j=1

(103)

Note the similarity between this objective and the SVM with slack variables.
The only difference is the addional term log(c i ). Moreover, for the case
of separable data and c the same classifiers are obtained.

20

You might also like