0% found this document useful (0 votes)

65 views

STAT 538 Maximum Entropy Models C Marina Meil A Mmp@stat - Washington.edu

This document discusses maximum entropy models and the maximum entropy principle. It defines entropy and KL divergence, then describes how the maximum entropy principle finds the distribution with the highest entropy that matches constraints defined by data. The principle results in an exponential family distribution. Minimum relative entropy generalizes maximum entropy by including a prior distribution.

Uploaded by

Matthew Hagen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views

STAT 538 Maximum Entropy Models C Marina Meil A Mmp@stat - Washington.edu

Uploaded by

Matthew Hagen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

STAT 538

Lecture 8
Maximum Entropy Models
c
Marina
Meila
[email protected]

Entropy and KL divergence

Assume that the sample space is , a (typically large) finite space. For any
distribution p : [0, 1] the entropy is defined as
H(p) =

p(x) log p(x)

(1)

In the above, and throughout this chapter, we adopt the convention 0 log = 0.
The function H(p) is non-negative and concave on the space of all distributions over . The minimum H = 0 is attained for deterministic distributions
and the maximum Hmax = log || is attained for the uniform distribution
over .
The entropy measures the uncertainty in a given distribution. Closely related
to the entropy is the Kullbach-Leibler divergence between two distributions:
X
p(x)
D(p||q) =
(2)
p(x) log
q(x)
x
The KL-divergence is asymmetric in p, q. It is non-negative, convex in (p, q)
and attains the minimum D(p||q) = 0 iff p q (proofs and details below).
The entropy and KL divergence can also be defined for continuous sample
spaces, by replacing the summation with an integral. The only property
that changes is that the entropy of a continuous distribution is not always
non-negative. All the other properties mentioned here remain the same.

Properties of H and D
1. H 0 (obvious)
2. H is a concave function of p (we know x ln x is convex)
3. H(p) = 0 iff p deterministic
4. H(p) = max for p = u=the uniform distribution; in this case H(u) =
ln ||
Proof Let = {0, 1, . . . m} w.l.o.g. Then, H is a function of the m
P
Pm
variables p1:m , with m
x=0 px = 1 and p0 = 1
x=1 px .
H(p) =

m
X

px ln px (1

m
X

px ) ln(1

px )

(3)

x=1

x=1
m
X

x=1

m
X

H
px ) + 1 = 0
= ln px 1 + ln(1
px
x=1

(4)

m
X

(5)

px = (1

px ) = p0

x =1

In other words, all px must be equal.

5. D(p||q) 0
(D is a Bregman divergence, it corresponds to the convex function
x ln x; see Lecture 6)
6. D(p||q) convex in (p, q)
Proof We use the perspective of the convex function ln x. We have
(BV 3.2.6) that t( ln xt ) is also convex jointly in (x, t), and we set
t = pi , x = qi . It follows that pi ln pqii = pi ln pqii is convex, and therefore
D is convex.
7. We define the conditional entropy of two random variables X, Y with
joint distribution pXY by
H(X|Y ) =

pY (y)

H(pX|Y =y )

(6)

Thus, the conditional entropy is the average of the entropies H(pX|Y =y ).

We have that
H(X|Y ) =

XX
x

py px|y ln px|y

concave

}

(7)

py px|y

px ln px = H(px ) H(X)

(8)
(9)

In other words, H(X|Y ) H(X) , or conditioning (observing) another

variable Y decreases the uncertainty in X. The amount of the decrease
is called the mutual information between X and Y .
I(X||Y ) = H(X) H(X|Y ) = H(Y ) H(Y |X)

(10)

(proving this last equality is left as an exercise)

8. The mutual information is non-negative, and is 0 iff the two variables
are independent, as a consequence of the following identity
I(X||Y ) = D(pXY ||pX pY )

(11)

(proof by direct calculation).

9. For two variables X, Y with joint distribution PXY = [pxy ]x,y , the joint
entropy is defined as
H(X, Y ) H(PXY ) =

pxy ln pxy

(12)

It is easy to verify that

H(X, Y ) = H(X) + H(Y |X) H(X) + H(Y ).

(13)

The above inequality is satisfied with equality only if X Y , i.e. only

if X and Y are independent.
10. Aggregation decreases the entropy. Let the random variable Y be a
deterministic function of X. When X is discrete, that means that
several values x may be mapped to the same value y. Intuitively, the
P
values x are aggregated into sets labeled by y. Hence, py = xy px .
H(X) =

px ( ln px ) =

X
y

px ( ln px )

(14)

y xy

1
px
ln
px
xy py
X

(15)

px 1
ln
xy py px
X

ny
ln
py

= H(Y ) +

(because ln z is convex) (16)

where ny = #{x y}

py ln ny H(Y )

(17)
(18)

11. Mixing increases entropy. For the mixture distribution tp + (1 p)q,

by the concavity of the entropy, it follows that
H(tp + (1 t)q) tH(p) + (1 t)H(q)

(19)

The Maximum Entropy Principle

Assume that we have a set of N observations D = {x(1) , . . . x(N ) } from an

unknown distribution p. The observation define the empirical distribution
p
N
1 X
(20)
(j) (x)
p (x) =
N i=1 x
where
x(j) (x) =

1 x = x(j)
0 otherwise

(21)

From now on we assume that the data is represented by the empirical distribution p .
We also have a set of features fi (x) of the data; they are functions fi :
(, ), i = 1, . . . K.
The Maximum Entropy Principle states that the best model of the
data is the distribution q representing the solution to the following problem
max H(q)
q

s.t. Eq [fi ] = Ep [fi ] for all i = 1, . . . K.

(22)

This optimization problem has a concanve objective to be maximized (which

is equivalent to minimizing the convex objective H(q)), linear constraints
4

and an overall convex space for the variable q. Thus, it is a convex optimization problem and we know that if a solution exists, then it is unique.
Let us now apply the usual convex optimization machinery to find a general
form for the solution. The Lagrangean is
L(q, ) = H(q)

i (Eq [fi ] Ep [fi ]) 0 (

q 1)

(23)

In the above, i are the Lagrange multipliers associated to the each of the
constraints i = 1, . . . K, and 0 represents the normalization constraint on
q. Note that if is finite, q is a vector indexed by x. We take the partial
derivative of L w.r.t each vector element q(x). For this, a very useful identity
will be the following:
d
y log y = log y + 1
(24)
dy
Therefore,

X
L
i fi (x) 0
= log q(x) + 1
q(x)
i

(25)

By equating the above with 0 we obtain

log q(x) =

i fi (x) + 0 1

(26)

(27)
f (x)
i i i

q(x) e
or
1 T f
q =
e
Z

(28)
(29)
(30)

The normalization constant Z , also known as the partition function

is defined as
X P
Z =
e i i fi (x)
(31)
x

This is the general form of the solution of the Maximum Entropy problem.
Note that although the formulation was non-parametric, i.e we were optimizing over all possible distributions over , the resulting solution depends on
K parameters, one for each constraint.

2.1

Examples

1. Without any constraints, the solution reduces to q

uniform distribution.

1, i.e. the

2. Let x = (x1 , . . . xd ) a d-dimensional vector of binary variables xj

{0, 1}. Let fj = xj for j = 1, . . . d. Then,
P

q(x) e

j xj

(ej )xj =

j j

(32)

In other words, if the features depend only on one variable, in the maximum entropy distribution the variables are independent. In particular,
in our case we recover the (multivariate) binomial distribution.
3. Let us solve the above case again, explicitly, for x = (y, z) {0, 1}2 .
Define q by its four parameters q11 , q10 , q01 , q00 . We have three constraints q11 + q10 = y, q11 + q01 = z, q11 + q10 + q01 + q00 = 1, where
y, z are respectively the sample means of y, z. Hence, the solution q
depends on one free parameter only; let that be q11 . We can write q in
the following way:
"

q11
y q11
z q11 1 y z + q11

and the objective is

H(q11 ) = q11 log q11 (
y q11 ) log(
y q11 ) (
z q11 ) log(
z q11 )
(1 y z + q11 ) log(1 y z + q11 )
(33)
Taking the derivative we get
d
(
y q11 )(
z q11 )
H = log
= 0
dq11
q11 (1 y z + q11 )

(34)

The above is solved for q11 = yz which amounts to independence between y and z.
4. Assume the same 2-dimensional binary variable sample space, but now
with the feature f (y, z) = 1 iff y = z. Let the expectation of f from
6

the data be f =

Ny=z
.
N

L(q, , 0 ) =

Then
qyz log qyz (q00 + q11 f) 0 (

y,z

qyz 1)(35)

L
= log q00 + 1 0 = 0
q00
L
= log q11 + 1 0 = 0
q11

(36)
(37)

From the last two equations, we find that q11 = q00 and therefore = f/2.
Thus, the solution q is
#
"
f
2
1f
2

1f
2
f
2

Again, the solution makes q as uniform as possible under the given

constraints. Note that because the feature depends on both y, z the
two variables are dependent in the resulting distribution.
5. Exercise Take the same sample space as above, with the unique feature f (y, z) = 1 iff y = z = 1. What is the corresponding MaxEnt
distribution?
6. Assume now that = (, ) and thus we have a continuous Maximum Entropy problem. Let the features be f1 (x) = x, f2 (x) = x2 .
Then the maximum entropy distribution is
q(x) e1 x+2 x

(38)

which represents a Gaussian that fits the first two moments of the
observed data.
7. Exercise Markov random fields and decomposable graphical models
(aka junction trees) over discrete domains are maximum entropy distributions. Can you identify the features that they are matching?

Minimum Relative Entropy

The Minimum Relative Entropy (MRE) problem generalizes the Maximum

Entropy (ME) problem to the case when we also have a prior distribution
7

q0 . We want to find the model that matches the sufficient statistics Ep [fi ]
of the data and is close to the prior.
min D(q||q0)
q

s.t. Eq [fi ] = Ep [fi ] for i = 1, . . . K

(39)

The Lagrangean of this problem is

L(q, ) =

q(x) log

X
q(x) X X
fi (x)q(x)Ep [fi ])+0 (
i (
q(x)1)
q0 (x)
i
x
x
(40)

And the solution is

q (x) q0 (x)e i i fi (x)

or
1
T
q =
q0 e f
Z
with
P
X
Z =
q0 (x)e i i fi (x)

(41)
(42)
(43)

Note that the ME problem is a special case of the MRE problem with q0 1
the uniform distribution. Indeed, it is easy to see that
D(q||uniform) =

3.1

q log q

q log

1
= H(q) log ||
||

(44)

The relationship with exponential family models

From (42) we have that

log q(x) = T f (x) log Z + log q0 (x)

(45)

A family of probability distributions that can be put in this form is called

and exponential family model. Exponential family models comprise (multivariate) normal distributions, Markov random fields (with positive distributions), binomial and multinomial models, etc. They have many convenient
properties, some of which are evident from the definition above. For example, exponential family models are essentially the only parametric models
8

that have a finite number of sufficient statistics1 ; they have conjugate priors; from the differential geometry p.o.v, exponential families repreent flat
manifolds, i.e affine function spaces spanned by the vectors fi .
We also know that Maximum Likelihood (ML) estimation for exponential
models consists in fitting the sufficient statistics of the data. Therefore,
finding the optimal parameters for MRE/ME models can be seen as a ML
estmation. The next section discusses this in detail.

The ME-ML duality

The ME/MRE problem is a convex optimization problem. Here we examine

its dual and show that it represents a Maximum Likelihood problem.
The dual objective of problem (39) is given by L() = supq h(q, ). We
can obtain L explicitly by substituting the solution (42) into h.
L() = h(q , )
=

(46)
P

q0 e
Z q0

q log

f
i i i

i fi

i (

fi q Ep [fi ])

q log Z

(47)

i Ep [fi ]) log Z

fi q +

i Ep [f(48)
i]

(49)

Hence, the dual of the MRE problem is

max T Ep [f ]) log Z

(50)

Since all the constraints in the primal problem were equality constraints, the
dual has no constraints on .
We now show that the dual problem represent maximixing a likelihood. The
log-likelihood of the parameter set given a dataset D is
l() =

log q(xi )

(51)

xi D
1

Distributions that are piecewise uniform may also have finite sufficient statistics. In
their case, the sufficient statistics are intervals in which the data lie.

= N

p (x) log q (x)

(52)

= N

p [log q0 +

i fi log Z ]

(53)

= N(

i Ep [fi ] log Z ) + constant

(54)

= NL() + constant

(55)

Therefore, solving the MRE problem is equivalent to maximizing the likelihood of the exponential model given by (42).
It is also easy to show that maximizing the likelihood for any model family
{q} is equivalent to miniminzing the KL divergence from p to {q}.
D(
p ||q) =

p log

p
q

(56)

p)
p log q H(

(57)

l(q)/N

We can summarize the previous sections in the following way: for a fixed set
of features f and a fixed data set we define
T

Q = {Q | q q0 e f }
P = {p | Ep [f ] = Ep [f ]}

(58)
(59)

i.e the family of exponential models with sufficient statistics f and the family
of distributions that fit the sufficient statistics of the data. Let Q be closure
of Q under the Euclidean norm. Then (see Della Pietra & al) the MRE
distribution q is the unique distribution in P Q and is also the unique
distribution satisfying
p ||q)
q = argminP D(q||q0) = argminQ D(

(60)

The first optimization in the above represents the original MRE problem,
the second is the dual ML problem. Minimizing a KL divergence to a set
can be thought of as projecting a distribution on the respective set. Note
that in our case the two projections differ since they are reversed forms of
10

P
Q

q*
n

tio

jec

o
pr

oje
pr
on
cti

Figure 1: The duality between MRE and ML: the m-projection corresponds
to ML estimation, the e-projection corresponds to MRE estimation.
the KL divergence. Thus, one obtains the same solution either by the eprojection of the prior q0 on the data manifold P or by the m-projection
of the empirical distribution p on the model manifold Q.
The similarity to the EM algorithm is not incidental - see for example Neal
and Hinton A new view of the EM algorithm for a view of the Expectation
Maximization algorithm that emphasizes the alternating minimization of KL
divergences.
A Pytagorean theorem is yet another equivalent way of characterizing q .
For any q Q and any p P,
D(p||q) = D(p||q ) + D(q ||q)

(61)

Proof Let p1 , p2 P, q1 , q2 Q, q2 = q1 e f . Then

D(p1 ||q1 ) D(p1 ||q2 ) D(p2 ||q1 ) + D(p2 ||q2 ) =

p1 log

q2
q2 X
+
(62)
p2 log
q1 x
q1

(p1 p2 )T f

x
T

= (Ep1 [f ] Ep2 [f ])

(63)
(64)

If now p1 = q1 = q , p2 = p, q2 = q we obtain
0 D(q ||q) D(p||q ) + D(p||q) = ( )T (Eq [f ] Ep [f ]) = 0 (65)
| {z }
Ep [f ]

| {z }
Ep [f ]

Estimating the parameters

For some ME models, there is a direct relationship between the parameters

and the sufficient statistics; For e.g normal distributions, multinomial distributions, decomposable models the parameter fitting is done analytically
with a simple formula. For other exponential models, we have to resort to
iterative methods.
Gradient ascent. One method to find the parameters of the MRE distribution is to find the unique maximum of h() using equation ??. Here we
have the problem of computing the normalization constant Z which involves
a summation over the whole sample space. Sometimes Z and its partial
derivatitives can be computed analytically; we shall see some examples in
section ??. In the other cases, one has to approximate these derivatives.

The most direct method to do it is to recall that Z

= Eq [fi ] and apply a
i
Markov-chain Monte Carlo method to compute the expectations.
Both Gibbs and Metropolis sampling from a ME distribution are straightforward since
q0 (x) T (f (x)f (x ))
q(x)
=
e
(66)

q(x )
q0 (x )
For q0 uniform, binary features, and x, x differring in only one feature fj , the
above ratio becomes equal to ej . Note also that one needs only one sample
from q to estimate all the derivatives.
Improved Iterative Scaling is a method introduced by Della Pietra and
al. which is generally faster than the gradient method.
Improved Iterative Scaling (IIS) Algorithm
Given p , q0 and the features f = (f1 , . . . fK ). Assume q0 is absolutely
continuous w.r.t p (i.e the prior gives non-zero probability to the data).
Denote
X
fi (x)
(67)
f# (x) =
i

1. Initialize q q0 , 0

2. For each i, let i be the unique solution of

Eq [fi ei f# ] = Ep [fi ]
P

3. Set q qe

i fi

(68)

or, equivalently, set +

4. If q has not converged yet, go to step 2

The estimation in step 2 of the algorithm can be done by bisection noting
that the function is monotonic in .

Learning the features

Della Pietra et al. present a method for learning features from data. The
method is a greedy algorithm, that consists of two steps: first, from a set
of candidate features C one feature is selected to be added to the model;
second, the optimal parameters of the augmented model are fit to the data.
Field Induction Algorithm
Input: p , q0 , a rule for constructing candidate features or a set of
possible features
1. Set q q0
2. Feature selection
(a) Construct the current set of candidate features C
(b) For g C compute the gain G(g) = max D(
p ||q)D(
p ||qeg /Z )
and
that maximizes the gain
(c) Add g = argmaxC G(g) to the current set of features
3. Parameter fitting Refit the model parameters by IIS, starting from
the previous parameter set and .

4. Return to step 2
13

The algorithm is guaranteed to improve the likelihood of the data at each

iteration, and to give the best model for the given features, but it incorporates
no model selection method. So the stopping criterion and model validation
are left to the standard methods.
Below we discuss the feature selection in more detail.

6.1

Greedy feature selection

In the following we assume that all features take values in {0, 1}. Assume
that q is the current ME distribution and g C is a new feature. Let
P
p ||q) D(
p ||
q) the gain
Z = x qeg , q = qeg /Z and G(, g) = D(
of adding g with parameter to the model. Then, it is easy to show the
following three equalities:
G(, g) = Ep [g] log Eq [eg ]

G(, g) = Ep [g] Eq[g]

2
G(, g) = V arq[g] 0
2

(69)
(70)
(71)

This shows that G(, g) has a unique maximum w.r.t .

Lemma We now show that the maximum is at

= log

p (g = 1)(1 q(g = 1))

(1 p (g = 1))q(g = 1)

We have that
q =

qe
Z
q
Z

if g = 1
if g = 0

(72)

(73)

Therefore, Eq[g] = e q(g = 1)/Z and 1 Eq[g] = q(g = 0)/Z = (1 q(g =

1))/Z. Remembering from (70) that Eq[g] = Ep [g] at the optimal , we
obtain
Ep [g] = e q(g = 1)/Z
1 Ep [g] = (1 q(g = 1))/Z
By taking the ratio of the two, we obtain 72.
14

(74)
(75)

Lemma Denote by p, q the probabilities p (g = 1), q(g = 1) respectively.

Then, the gain G(g) is equal to D(Bp ||Bq ) where Bp , Bq are respectively the
Bernoulli distributions with p, q as probabilities of success.
Proof
p
1p
+ (1 p) log
q
1q
p(1 q)
1q
= p log
log
q(1 p)
1p

D(Bp ||Bq ) = p log

(76)
(77)

By 69,
p(1 q)
q(1 p)
p(1 q)
= p log
q(1 p)
p(1 q)
= p log
q(1 p)
p(1 q)
= p log
q(1 p)

G(,
g) = p log

log Eq [eg ]

(78)

log[e q(g = 1) + 1.q(g = 0)]

(79)

p(1 q)
log
q+1q
q(1 p)
1q
log
1p

(80)
(81)

Thus, one adds the feature g along which the discrepancy of p and q is
maximized.

7
7.1

Maximum Entropy Discrimination

The ideea

Here we apply the ME framework to classification problems. We have a

dataset D = {(x1 , y1), (x2 , y2 ), . . . (xN , yN )} of N labeled examples and a
family of classifiers F = {f (.; )} parametrized by . The values of the labels
are in {1} and the label of x given by f (.; ) is sign f (x, ; ).
Standard classification chooses an f F that has low classification error and
perhaps obeys other regularization conditions. In the Maximum Entropy
15

approach to classification, we find a distribution q over F , such that the

expected value f of all classifiers f F under q classifies the training set
well. Naturally, more than one such q exists and we choose the one which
has maximum entropy.
min H(q)
q

s.t. yi Eq [f (xi )] 1 for i = 1, . . . N

(82)

A direct generalization of ME classification is to introduce a prior q0 over

F and then to choose the q that is nearest to the prior. This is the MRE
discrimination problem.
min D(q||q0)
q

7.2

s.t. yi Eq [f (xi )] 1 for i = 1, . . . N

(83)

The MRE solution

The problems 82, 83 are convex problems with linear constraints; by applying
the usual transformations we obtain the Lagrangean, the general exponential
form of the solution and the dual. Note that unlike the unsupervised ME
problem, the current problem has inequality constraints, so that the Lagrange
multipliers will have to be 0. Another difference is that the sums are now
replaced with integrals, as the function families we typically consider are
continuous.
h(q, ) = D(q||q0 )

i (yi Eq [f (xi )] 1)

X
h
i yi f (xi )
= log q(f ) log q0 (f )
q(f )
i
P
1
q (f ) =
q0 ()e i i yi f (xi )
Z
Z
P

Z =

q0 ()e

i yi f (xi ;)

L() = h(q (), )

Z
P
X Z 1
X
1
i yi f (xi ;)
i
i y i
[ i yif (xi ; ) log q0 () log Z ]d
q0 ()e
q0 ()e
=
Z
Z
i
i
=

i log Z

Hence, the solution has again an exponential form, with one factor for each
training example. The dual problem is
max L() s.t. i 0 for i = 1, . . . N

7.3

(90)

Computing the solution

The optimal s can be found by gradient ascent on the dual objective L. As

previously seen, these are equal to the primal constraints.
Z

P
log Z
1
=
q0 ()e i i yi f (xi ;) yi f (xi ; )d = yiEq [f (xi ; .)] (91)
i
Z
L
= 1 yi Eq [f (xi ; .)]
(92)
i

In practice, we start with i = 0 for all i and update the Lagrange multipliers
by
L
i max[ 0, i +
]
(93)
i
where is a step size. This iteration will converge to the unique solution
of 83 if one exists. Note that during the iteration, L increases, while the
primal objective increases too. This is because our method is eseentially an
exterior point method: we start from the unconstrained optimum and adjust
the parameters until all the constraints are satisfied.
As it is usual with constrained optimization, only the i parameters that
correspond to tight constraints are non-zero. The corresponding xi examples
represent support vectors for this problem. Thus the solution is often
sparse.
The ME classifier is given by
f (x) =

q ()f (x; )d

Note that f may not belong to the original family F .

(94)

8
8.1

ME discrimination extensions
Using generative models

One important use of the ME classification framework is to combine discriminative classification, i.e classification via optimization of the decision
boundaries, with generative models i.e probabilistic models of how the
data were generated. The former have the advantage that they optimize the
right criterion, the latter are much better at incorporating domain features.
Assume that we have a family of probabilisitic models describing the data
in each class; let them be P (x|+ ), P (x| ) respectively. The models may
belong to different families (i.e gaussian for the + class and uniform on a
rectangle for the - class), and the parameters may belong to different
spaces. A Likelihood ratio classifier in with these model families is
f (x; + , , b) = log

P (x|+ )
+b
P (x| )

(95)

We construct a MRE distribution over this classifier family that has the form
P

q(+ , , b) q0 (+ , , b)e

i [yi (log

P (xi |+ )
+b)1]
P (xi | )

(96)

If the prior q0 factors into q0 (+ )q0 ( )q0 (b) then the MRE distribution also
factors into independent distributions for the 3 parameters.
P

q(+ , , b) q0 (+)e

i yi log P (xi |+ )

q0 ( )e

i yi log P (xi | )

q0 (b)eb

i yi

i i
e
(97)

There are several model classes for which computing the partition function
and the required expectations can be done in closed form. For example, if
P (.| ) is an exponential family model and q0 is the corresponding conjugate
prior. In particular, graphical models with fixed structure and exponential
family distributions are also tractable; in the special case of tree graphical
models, one can construct more interesting graphical models that can be
integrated over structures and parameters (see Jaakkola et al.).

8.2

Inseparable data (soft margins)

If the data are not separable by any model in F , then the constraints of ??
cannot be satisfied for any q and the parameters tend to infinity. Now we
extend the MRE framework to deal with this case. We will fix a variable
margin i for each example (xi , yi ) and we will estimate a joint MRE distribution over the classfier parameters and the margins . Note once again
that, if the prior q0 is factored w.r.t , then the and will be independent
under the final solution.
The soft margin MRE problem is
min D(q||q0 )

s.t. Eq [yi f (xi ) i ] 0 for i = 1, . . . N

where q, q0 are now distributions over , 1 , . . . N . If q0 (, ) = q0 ()

then the MRE solution is
q

q0 ()e

i yi f (xi ;)

N
Y

q0 (i )ei i

i=1

(98)
Q

i q0 (i )

(99)

If we denote the corresponding normalization constants by Z , Zi (keeping

in mind that they are functions of ), then the dual objective can be written
as
X
log Zi
(100)
L() = log Z
i

The form of the MRE distribution for suggests an exponential prior

q0 () = cec(1)

(101)

For this q0 the MRE distribution is

q() = (c )e(c)(1)

(102)

In the above needs to be c for a proper q to exist. Hence, introducing a

soft margin in this way is equivalent to puttig and upper bound on , very
much like in the case of SVMs.
Note also that Eq0 [i ] = 1 1/c. Therefore, if Eq [f (xi )] 1 1/c then
i = 0. Since Zi = 1/(c i ) each term log Zi in the dual objective
introduces a penalty log(c i ). If i approaches c from below, this penalty
tends to , effectively stopping i from growing too much.
19

8.3

Relationship to SVMs

The following theorem, from Jaakkola et al, shows that there is a strong
relationship between SVMs and MRE.
Q

Theorem Assume f (x; , b) = T x b and q0 (, b, ) = q0 ()q0 (b) i q0 (i )

where q0 () = N(0, I), q0 (b) approaches a non-informative prior, and q0 (i )
is given by equation 101. Then, the Lagrange multipliers are obtained by
P
maximizing L() subject to 0 c and N
i=1 i yi = 0 where
L() =

(i + log(c i )]

N
1 X
i j yi yj xTi xj
2 i,j=1

(103)

Note the similarity between this objective and the SVM with slack variables.
The only difference is the addional term log(c i ). Moreover, for the case
of separable data and c the same classifiers are obtained.

(397 P. COMPLETE SOLUTIONS) Elements of Information Theory 2nd Edition - COMPLETE Solutions Manual (Chapters 1-17)
85% (54)
(397 P. COMPLETE SOLUTIONS) Elements of Information Theory 2nd Edition - COMPLETE Solutions Manual (Chapters 1-17)
397 pages
GWU MSBA Analyzing Data With SAS Visual Statistics Student Homework Exercises We
No ratings yet
GWU MSBA Analyzing Data With SAS Visual Statistics Student Homework Exercises We
9 pages
Time Series Forecasting Business Report-1
No ratings yet
Time Series Forecasting Business Report-1
65 pages
Applied Stochastic Differential Equations MATH10053: Thursday 16 December 2021 1300-1500
No ratings yet
Applied Stochastic Differential Equations MATH10053: Thursday 16 December 2021 1300-1500
4 pages
Lect 6 Quantinfo 1112
No ratings yet
Lect 6 Quantinfo 1112
13 pages
J. K. Sharma - Business Statistics-Pearson Education (2007)
92% (37)
J. K. Sharma - Business Statistics-Pearson Education (2007)
752 pages
Wittenberg 2010 An Introduction To Maximum Entropy and Minimum Cross Entropy Estimation Using Stata
No ratings yet
Wittenberg 2010 An Introduction To Maximum Entropy and Minimum Cross Entropy Estimation Using Stata
16 pages
Discussion Notes 2-6
No ratings yet
Discussion Notes 2-6
3 pages
Notes08 Infotheory
No ratings yet
Notes08 Infotheory
7 pages
Maximum Entropy Handbook Formulations, Theorems, M-Files: 1 Most Uncertain Finite Distributions
No ratings yet
Maximum Entropy Handbook Formulations, Theorems, M-Files: 1 Most Uncertain Finite Distributions
9 pages
Lecture 3: Entropy, Relative Entropy, and Mutual Information
No ratings yet
Lecture 3: Entropy, Relative Entropy, and Mutual Information
5 pages
E2 201: Information Theory (2019) Solutions To Homework 3
No ratings yet
E2 201: Information Theory (2019) Solutions To Homework 3
11 pages
Class3 ML MaxEnt
No ratings yet
Class3 ML MaxEnt
6 pages
Computers and Mathematics With Applications: Maxallent: Maximizers of All Entropies and Uncertainty
No ratings yet
Computers and Mathematics With Applications: Maxallent: Maximizers of All Entropies and Uncertainty
19 pages
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
No ratings yet
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
16 pages
IT_w1
No ratings yet
IT_w1
20 pages
Hw1sol PDF
No ratings yet
Hw1sol PDF
9 pages
The Q-Exponentials Do Not Maximize The Rényi Entropy
No ratings yet
The Q-Exponentials Do Not Maximize The Rényi Entropy
12 pages
Ech 4
No ratings yet
Ech 4
39 pages
1507.04783v1
No ratings yet
1507.04783v1
10 pages
MaxEnt
No ratings yet
MaxEnt
6 pages
Information Theory Differential Entropy
No ratings yet
Information Theory Differential Entropy
29 pages
Maximum Entropy: Density Estimation
No ratings yet
Maximum Entropy: Density Estimation
18 pages
Possible Generalization of Boltzmann-Gibbs Statist
No ratings yet
Possible Generalization of Boltzmann-Gibbs Statist
10 pages
Entropy and Uncertainty
No ratings yet
Entropy and Uncertainty
15 pages
Lecture 1: Introduction, Entropy and ML Estimation
No ratings yet
Lecture 1: Introduction, Entropy and ML Estimation
5 pages
MIT6 441S16 Midterm
No ratings yet
MIT6 441S16 Midterm
5 pages
Software
No ratings yet
Software
6 pages
revised (1)
No ratings yet
revised (1)
24 pages
eeport (3)
No ratings yet
eeport (3)
13 pages
Lecture Notes 1-12
No ratings yet
Lecture Notes 1-12
5 pages
Mathematical Problems and Solutions On Information Theory
No ratings yet
Mathematical Problems and Solutions On Information Theory
28 pages
Notes It
No ratings yet
Notes It
46 pages
Lec7 InformationTheory
No ratings yet
Lec7 InformationTheory
41 pages
lời giải
No ratings yet
lời giải
52 pages
Lecture_15
No ratings yet
Lecture_15
7 pages
ent-var-two-rmks
No ratings yet
ent-var-two-rmks
13 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
CS340 Machine Learning Information Theory
No ratings yet
CS340 Machine Learning Information Theory
22 pages
The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
No ratings yet
The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
8 pages
09 Aos687
No ratings yet
09 Aos687
31 pages
Ee5143 Pset1 PDF
No ratings yet
Ee5143 Pset1 PDF
4 pages
Information Theoretic Inequalities
No ratings yet
Information Theoretic Inequalities
18 pages
Statistics Part3 2013
No ratings yet
Statistics Part3 2013
25 pages
Stat520 Ch.5
No ratings yet
Stat520 Ch.5
5 pages
Principle of Maximum Entropy
No ratings yet
Principle of Maximum Entropy
10 pages
Bayes' Estimators of Generalized Entropies
No ratings yet
Bayes' Estimators of Generalized Entropies
16 pages
CSC 5132: Information Theory and Coding Techniques: Lecturer: Dr. P. Sibanda
No ratings yet
CSC 5132: Information Theory and Coding Techniques: Lecturer: Dr. P. Sibanda
2 pages
Maximum Entropy Distribution
No ratings yet
Maximum Entropy Distribution
11 pages
Maximum Entropy Probability Distribution
No ratings yet
Maximum Entropy Probability Distribution
9 pages
Entropy Optimization Principles and Their Applications
No ratings yet
Entropy Optimization Principles and Their Applications
18 pages
Lecture 3: Entropy, Relative Entropy, and Mutual Information
No ratings yet
Lecture 3: Entropy, Relative Entropy, and Mutual Information
5 pages
Tsallis1988 Article PossibleGeneralizationOfBoltzm PDF
No ratings yet
Tsallis1988 Article PossibleGeneralizationOfBoltzm PDF
9 pages
Tsallis
No ratings yet
Tsallis
9 pages
Math525 2
No ratings yet
Math525 2
8 pages
נוסחאות ואי שיוויונים
No ratings yet
נוסחאות ואי שיוויונים
12 pages
Practice 2
No ratings yet
Practice 2
7 pages
2009 Paninsky Nonparametric estimation of entropy and distributions
No ratings yet
2009 Paninsky Nonparametric estimation of entropy and distributions
34 pages
Lecture 2: Entropy and Mutual Information: 2.1 Example
No ratings yet
Lecture 2: Entropy and Mutual Information: 2.1 Example
8 pages
Maximum Likelihood Estimation (MLE)
No ratings yet
Maximum Likelihood Estimation (MLE)
4 pages
"Entropy Inequalities For Discrete Channels
No ratings yet
"Entropy Inequalities For Discrete Channels
7 pages
Lecture Notes 2 1 Probability Inequalities
No ratings yet
Lecture Notes 2 1 Probability Inequalities
9 pages
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Lectures on Integral Equations
From Everand
Lectures on Integral Equations
Harold Widom
3.5/5 (1)
Dissertation Using Logistic Regression
100% (2)
Dissertation Using Logistic Regression
6 pages
Practical Possible Questions Revised
No ratings yet
Practical Possible Questions Revised
12 pages
Recitation 10
No ratings yet
Recitation 10
14 pages
Second Quarter School Year 2020 - 2021 Statistics and Probability Answer Sheet
No ratings yet
Second Quarter School Year 2020 - 2021 Statistics and Probability Answer Sheet
10 pages
Module 1 - 2 - EDA
No ratings yet
Module 1 - 2 - EDA
12 pages
Mid Term Paper Statistics and Probability (STT-500) BSSE 4A
No ratings yet
Mid Term Paper Statistics and Probability (STT-500) BSSE 4A
6 pages
Data Analysis Using SPSS - Evsu PDF
No ratings yet
Data Analysis Using SPSS - Evsu PDF
201 pages
Download ebooks file Linear Mixed Models A Practical Guide Using Statistical Software Second Edition Brady T. West all chapters
100% (1)
Download ebooks file Linear Mixed Models A Practical Guide Using Statistical Software Second Edition Brady T. West all chapters
77 pages
Buy Ebook Statistics Cliffs AP 1st Edition David A. Kay Cheap Price
100% (15)
Buy Ebook Statistics Cliffs AP 1st Edition David A. Kay Cheap Price
84 pages
Information Theory and Coding: 4 Electrical Engineering by DR Wassan Saad
No ratings yet
Information Theory and Coding: 4 Electrical Engineering by DR Wassan Saad
26 pages
WEEK 5 Modular
No ratings yet
WEEK 5 Modular
10 pages
Econometrics Project
No ratings yet
Econometrics Project
17 pages
Syllabus
No ratings yet
Syllabus
11 pages
Probability and Stochastic Processes: A Friendly Introduction
No ratings yet
Probability and Stochastic Processes: A Friendly Introduction
5 pages
2022 Linear Regression
No ratings yet
2022 Linear Regression
34 pages
Practice Problems FOR Biostatistics
No ratings yet
Practice Problems FOR Biostatistics
37 pages
Unit 2: Word Associations and Relation Discovery
No ratings yet
Unit 2: Word Associations and Relation Discovery
32 pages
Merged OCR Stats
No ratings yet
Merged OCR Stats
57 pages
F ANCOVA
No ratings yet
F ANCOVA
18 pages
Conditional Density Estimation With Neural Network
No ratings yet
Conditional Density Estimation With Neural Network
41 pages
Homework 3
No ratings yet
Homework 3
3 pages
Hypergeometric Distribution
No ratings yet
Hypergeometric Distribution
9 pages
Discrimination and Calibration by Terry Therneau
No ratings yet
Discrimination and Calibration by Terry Therneau
6 pages
Statistics Traning Exam
No ratings yet
Statistics Traning Exam
10 pages
Pyspark - Mllib Package
No ratings yet
Pyspark - Mllib Package
87 pages
10_Comparing_two_population_means_example--ANSWER_KEY
No ratings yet
10_Comparing_two_population_means_example--ANSWER_KEY
7 pages

STAT 538 Maximum Entropy Models C Marina Meil A Mmp@stat - Washington.edu

Uploaded by

STAT 538 Maximum Entropy Models C Marina Meil A Mmp@stat - Washington.edu

Uploaded by

STAT 538

Entropy and KL divergence

p(x) log p(x)

In other words, all px must be equal.

Thus, the conditional entropy is the average of the entropies H(pX|Y =y ).

In other words, H(X|Y ) H(X) , or conditioning (observing) another

(proving this last equality is left as an exercise)

(proof by direct calculation).

It is easy to verify that

The above inequality is satisfied with equality only if X Y , i.e. only

(because ln z is convex) (16)

11. Mixing increases entropy. For the mixture distribution tp + (1 p)q,

The Maximum Entropy Principle

Assume that we have a set of N observations D = {x(1) , . . . x(N ) } from an

s.t. Eq [fi ] = Ep [fi ] for all i = 1, . . . K.

This optimization problem has a concanve objective to be maximized (which

i (Eq [fi ] Ep [fi ]) 0 (

By equating the above with 0 we obtain

The normalization constant Z , also known as the partition function

1. Without any constraints, the solution reduces to q

2. Let x = (x1 , . . . xd ) a d-dimensional vector of binary variables xj

and the objective is

Again, the solution makes q as uniform as possible under the given

Minimum Relative Entropy

The Minimum Relative Entropy (MRE) problem generalizes the Maximum

s.t. Eq [fi ] = Ep [fi ] for i = 1, . . . K

The Lagrangean of this problem is

And the solution is

q (x) q0 (x)e i i fi (x)

The relationship with exponential family models

From (42) we have that

A family of probability distributions that can be put in this form is called

The ME-ML duality

The ME/MRE problem is a convex optimization problem. Here we examine

Hence, the dual of the MRE problem is

p (x) log q (x)

i Ep [fi ] log Z ) + constant

Proof Let p1 , p2 P, q1 , q2 Q, q2 = q1 e f . Then

Estimating the parameters

For some ME models, there is a direct relationship between the parameters

The most direct method to do it is to recall that Z

2. For each i, let i be the unique solution of

or, equivalently, set +

4. If q has not converged yet, go to step 2

Learning the features

The algorithm is guaranteed to improve the likelihood of the data at each

Greedy feature selection

G(, g) = Ep [g] Eq[g]

This shows that G(, g) has a unique maximum w.r.t .

p (g = 1)(1 q(g = 1))

Therefore, Eq[g] = e q(g = 1)/Z and 1 Eq[g] = q(g = 0)/Z = (1 q(g =

Lemma Denote by p, q the probabilities p (g = 1), q(g = 1) respectively.

D(Bp ||Bq ) = p log

log[e q(g = 1) + 1.q(g = 0)]

Maximum Entropy Discrimination

Here we apply the ME framework to classification problems. We have a

approach to classification, we find a distribution q over F , such that the

s.t. yi Eq [f (xi )] 1 for i = 1, . . . N

A direct generalization of ME classification is to introduce a prior q0 over

s.t. yi Eq [f (xi )] 1 for i = 1, . . . N

The MRE solution

L() = h(q (), )

Computing the solution

The optimal s can be found by gradient ascent on the dual objective L. As

Note that f may not belong to the original family F .

Inseparable data (soft margins)

s.t. Eq [yi f (xi ) i ] 0 for i = 1, . . . N

where q, q0 are now distributions over , 1 , . . . N . If q0 (, ) = q0 ()

If we denote the corresponding normalization constants by Z , Zi (keeping

The form of the MRE distribution for suggests an exponential prior

For this q0 the MRE distribution is

In the above needs to be c for a proper q to exist. Hence, introducing a

Theorem Assume f (x; , b) = T x b and q0 (, b, ) = q0 ()q0 (b) i q0 (i )

You might also like