Markov Random Fields and Maximum Entropy Modeling For Music Information Retrieval
Markov Random Fields and Maximum Entropy Modeling For Music Information Retrieval
P(n|H) =
|H, n|
H
i
|H
i
, n|
(1)
Basic Dirichlet smoothing is used to overcome the
zero frequency problem (Zhai and Lafferty). Figure 3
shows an (unsmoothed) example for a sample 3-state
model trained with the data from Figure 2.
Counts
0 1 2
0 0 1 1
1 1 0 1
2 0 2 0
P(n|H)
0 1 2
0 0 0.5 0.5
1 0.5 0 0.5
2 0 1.0 0
Figure 3: Example 1
st
-order (m = 2) state transition
counts (left) and state transition distribution (right)
Prior to retrieval, at indexing time, we estimate
n
j,s
S
n
j,s
(3)
Dened in this manner, our feature functions are al-
ways boolean, and equal to 1 if all the notes dened by S
were played before the target note n
i,t
. A feature function
always includes the target note n
i,t
. This is not a fallacy
in the model, since n
i,t
will never actually be considered
a part of its own history. Presence of n
i,t
in the feature
serves only to tie the occurrences of notes in S to the oc-
currence of n
i,t
. If the feature is considered likely, that
is evidence in favor of predicting n
i,t
= 1. If the feature
does not occur, it suggests that n
i,t
is likely to be zero.
One nal comment: we choose to make features time-
invariant, but not index invariant. This means that a fea-
ture is expected to characterize the same kind of depend-
ency, regardless of the time index t of the target n
i,t
, but
that the feature is not (pitch) transposition invariant. Con-
sequently, we will index the time component of the notes
in S not in absolute values but relative to the time t. We
do not do the same for the note index i, so these indices
will remain absolute. As an illustration, Figure 5 contains
some examples of features that could have an impact on
note 2 at time t.
4.3 Exponential Form
At this point we are ready to select the parametric form
that we will be using for computing the probabilities
P(n
i,t
|H
i,t
). There are a number of different forms we
could choose, but it turns out that for random elds there
is a natural formulation of the distribution that is given by
the maximum-entropy framework. Suppose we are given
a set F of feature functions that dene the structure of
the eld. The maximum-entropy principle states that we
should select the parametric form that is: (i) consistent
with the structure imposed by F and (ii) makes the least
209
amount of unwarranted assumptions that is the most
uniform of all distributions consistent with F. The family
of functions that satises these two criteria is the exponen-
tial (or log-linear) family, expressed as:
P(n
i,t
|H
i,t
) =
1
Z
i,t
exp
_
_
_
fF
f
f(n
i,t
, H
i,t
)
_
_
_
(4)
In equation (4), the set of scalars = {
f
: f F}
is the set of Lagrange multipliers for the set of structural
constraints F. Intuitively, the parameter
f
ensures that
our model predicts feature f as often as it should occur
in reality. Z
i,t
is the normalization constant that ensures
that our distribution sums to unity over all possible values
of n
i,t
. In statistical physics, it is known as a partition
function and is dened as follows:
Z
i,t
=
n
exp
_
_
_
fF
f
f(n, H
i,t
)
_
_
_
(5)
For a general random eld, the partition function Z
i,t
is exceptionally hard to compute, since it involves sum-
mation over all possible states of the system. In a typical
system the number of states is exponential in the number
of eld variables, and direct computation of equation (5)
is not feasible.
In our case, our assumption of no dependencies
between the current state notes n
i,t
makes computation
of the partition function extremely simple. Recall that
all underlying eld variables are binary, so equation (5)
only needs to be computed for two cases: n
i,t
= 0 and
n
i,t
= 1. We can further simplify the problem if we re-
call that every feature function f is a binary conjunction,
and every f includes n
i,t
in its support. As a direct con-
sequence, f(n
i,t
, H
i,t
) is non-zero only if n
i,t
= 1. The
assertion holds for all feature functions fF, which im-
plies that the summation inside the exponent in equations
(4) and (5) is zero whenever n
i,t
= 0. These observa-
tions allow us to re-write equation (4) in a form that allows
very rapid calculations:
P(n
i,t
= 1|H
i,t
) =
_
_
_
fF
f
f(1, H
i,t
)
_
_
_
P(n
i,t
= 0|H
i,t
) = 1 P(n
i,t
= 1|H
i,t
) (6)
Here is the sigmoid function, dened as:
(x) =
e
x
1+e
x
. We have stated equation (6) as a special
case applicable to our particular setting. In the remaining
arguments we will use the form given by equations (4) and
(5) to ensure generality.
4.4 Objective Function
The ultimate goal of this project is to develop a probab-
ility distribution
P(n
i,t
|H
i,t
) that will accurately predict
the notes n
i,t
in polyphonic music. There exist a number
of different measures that could indicate the quality of pre-
diction. We choose one of the simplest log-likelihood
of the training data. Given a training set T consisting of
T simultaneities with 12 notes each, the log-likelihood is
simply the average logarithm of the probability of produ-
cing every note in T :
L
P
=
1
12T
log
T
t=1
11
i=0
P(n
i,t
|H
i,t
) (7)
=
P(H)
P(n|H) log
P(n|H)
In the second step in equation (7) we re-expressed
the log-likelihood in terms of the expected cross-entropy
between the target distribution
P(n|H) and the estimate
P(n|H) =
1
12T
T
t=1
11
i=0
(n, n
i,t
)(H, H
i,t
) (8)
Here refers to the Kronecker delta function, dened
as:
(x, y) =
_
1 if x = y
0 otherwise
(9)
Returning our attention to equation (7), we stress that
the expectation
H
[. . .] is performed over all possible
values that a history H of a note might take. This set
is exponentially large, and a direct computation would be
infeasible. However, for computation we always use the
rst part (top) of equation (7), whereas the second part
(bottom) comes very handy in the algebraic derivations of
the eld induction algorithm.
4.5 Maximum Entropy
To summarize, in the previous two subsections we re-
stricted ourselves to the exponential (Gibbs) form of the
probability distribution
P(n|H), and declared that our
objective is to maximize the likelihood of the training
data within that parametric form. It is important to
note that there is a different statement of objectives that
provably leads to exactly the same exponential solution
E[f] =
P(H)
P(n|H)f(n, H) (10)
=
1
12T
T
t=1
11
i=0
f(n
i,t
, H
i,t
)
Similarly, our estimate
P(n|H) gives rise to the pre-
dicted expectation
E[f] for the function f. Predicted ex-
pected value is simply how often our model thinks that
f should occur in the training set:
210
E[f] =
P(H)
P(n|H)f(n, H) (11)
=
1
12T
T
t=1
11
i=0
P(n|H
i,t
)f(n, H
i,t
)
The key difference between
E[f] and
E[f] is that we
do not look at the actual value n
i,t
when we compute
P
=
P(H)
P(n|H) log
P(n|H) (12)
Curiously, maximizing the entropy subject to the con-
straint that
E[f] =
E[f] for every feature f turns out
to be equivalent to assuming an exponential form for our
probability distribution
P(n|H) and maximizing the like-
lihood given by equation (7).
4.6 Feature Induction
In the previous sections we outlined the general structure
of a random eld over polyphonic music and stated our
objective: to learn the probability distribution
P(n|H)
that maximizes the likelihood of the training data (equa-
tion (7)). Recall that we selected the exponential form for
,s
S such that |s s
| 2
_
(13)
In other words, we form new candidate features g tak-
ing an existing feature f and attaching a single note n
j,s
that is not too far from f in time (in our case, not more
than by two simultaneities). Naturally, we do not include
as candidates any features that are already members of F.
Now, following the reasoning of Della Pietra, we would
like to pick a candidate gG that will result in the max-
imum improvement in the objective function. Suppose
that previous log-likelihood based only on F
k
was L
P
.
Now, if we add a feature g weighted by the multiplier ,
the new likelihood of the training data would be:
L
P+{g}
= L
P
+
E[g] log
E[e
g
] (14)
When we add a new feature g to the eld, we would
like to add it with a reasonable weight , preferably
the weight that maximizes the contribution of . We
can achieve that by differentiating the new log-likelihood
L
P+{g}
with respect to and nding the root of the de-
rivative:
0 =
L
P+{g}
= log
_
E[g](1
E[g])
E[g](1
E[g])
_
(15)
An important observation to make is that we arrived
at a closed-form solution for the optimal weight to be
assigned to the new feature g. The closed-form solution is
a special property of binary feature functions, and greatly
simplies the process of inducing eld structure. Know-
ing the optimal value of in closed formallows us to com-
pute the resulting improvement, or gain, in log-likelihood,
also in closed form:
Gain =
E[g] log
E[g]
E[g]
+ (1
E[g]) log
1
E[g]
1
E[g]
(16)
The nal form is particularly interesting, since it
represents the Kullback-Leibler divergence between two
Bernoulli distributions with expected values
E[g] and
E[g] respectively.
4.7 Parameter Estimation
In the previous section we described how we can auto-
matically induce the structure of a random eld by in-
crementally adding the most promising candidate feature
g G. We also presented the closed form equations that
allow us to determine the improvement in log-likelihood
that would result from adding g to the eld, and the
optimal weight that would lead to that improvement.
What we did not discuss is the effect of adding g on the
211
weights of other features already in the eld. Since the
features f F are not independent of each other, adding
a new feature will affect the balance of existing features.
From equation (16) we know that the new log-likelihood
L
P+{g}
is always going to be better than the old one L
P
(unless the eld is saturated and cannot be improved any-
more). However, this does not guarantee that the current
set of weights is optimal for the new structure. We may
be able to further improve the objective by re-optimizing
the weights for all functions that are now in the eld.
Assume now that the structure F contains all the de-
sired features. We would like to adjust the set of weights
, so that the objective function L
P
is maximized. This
is accomplished by computing the partial derivatives of
L
P
with respect to each weight
f
, with the intention of
driving these derivatives to zero:
L
=
E[f
]
E[f
] (17)
Unfortunately, there is no closed-form solution that
would allow us to set the weights to their optimal values.
Instead, we utilize an iterative procedure that will drive
the weights towards the optimum. There are a number
of algorithms for adjusting the weights in an exponential
model, the most widely known being the Generalized It-
erative Scaling (GIS) algorithm proposed by Darroch and
Ratcliff (1972). However, iterative scaling is extremely
slow; much faster convergence can be achieved by using
variations of gradient descent. Given the current value of
the weight vector , we will update it by a small step in the
direction of the steepest increase of the likelihood, given
by the vector of partial derivatives:
k+1
f
k
f
+
L
f
=
k
f
+
_
E[f
]
E[f
]
_
(18)
Equation (18) will be applied iteratively, for all f F,
until the change in likelihood is smaller than some pre-
selected threshold. Note that while
E[f] is computed only
once for each feature f, we will have to re-compute the
value
E[f] after every update. This makes the learning
procedure quite expensive. However, the learning proced-
ure is guaranteed to converge to the global optimum. Con-
vergence is ensured by the fact that the objective function
L
P
is -convex with respect to the weights
f
. One may
verify this by computing the second-order derivative of
L
P
and observing that it is everywhere negative.
4.8 Field Induction Algorithm
We are nally ready to bring together the results of the
previous subsections into one algorithm for automatic in-
duction of random elds models for polyphonic music:
1. Initialization
(a) Let the feature set F
0
be the set of single-note
features: F
0
= {n
i,t
: i = 0 . . . 11}
(b) Set the initial features weights
f
= 1 for all
f F
0
2. Weight Update
(a) Set
k+1
f
k
f
+
_
E[f]
E[f]
_
for each fea-
ture f F
(b) If there is noticeable change in likelihood, repeat
step (2a)
3. Feature Induction
(a) Enumerate the set of candidate features
(b) For every candidate feature g G compute the
optimal weight
g
= log
_
E[g](1
E[g])
E[g](1
E[g])
_
(c) For every g G compute expected improvement
(gain) from adding g to the structure F
(d) Pick the candidate g that promises the highest im-
provement, add it to the structure F, and set
g
=
g
(e) If there is noticeable change in likelihood, go to
step (2), otherwise return F and as the induced
eld
4.9 Discussion on Markov Chains
At this point we should note that our random eld ap-
proach in some sense encompasses the Markov chain ap-
proach. Instead of inducing the features of the eld,
one could easily preselect as features all possible one-
dimensional chains under a certain xed length and then
learn the weights of those features directly. One advant-
age to our current approach, however, is that by selectively
adding only the best features to the model, not only is the
nal number of parameters much smaller, but the features
themselves grow in the direction of the data. In our ex-
periments, for some songs we learned features covering
6 simultaneities (onsets) within the rst thousand features
induced. A Markov chain would require 12
6
2.99 mil-
lion parameters to cover this same onset range.
4.10 Music Retrieval using Random Fields
Now that we have a framework for creating a Markov
Random Field model of a piece of music we wish to
use this model for retrieval. We do this by estimating
a model from a query and then observing how well that
query model predicts the notes in each document in the
collection. In other words, our measure of similarity is
the expected cross-entropy between the empirical distribu-
tion
P
D
(n|H) of the document and the estimate
P
Q
(n|H)
produced by the query model. This measure is essentially
equation (7) from above, with the document rather than
the query as the target distribution.
If the model estimated fromthe query does well at pre-
dicting the notes in a piece of music, then the query and
that piece could have been drawn from the same underly-
ing distribution. Therefore we regard them as similar.
This process is repeated for all pieces in the collection,
and pieces are then ranked in order of increasing cross-
entropy (dissimilarity).
212
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
P
r
e
c
i
s
i
o
n
Recall
11-pt Recall/Precision graph
MRF
MC=2
MC=3
MC=4
MRF
MC=2
MC=3
MC=4
%Change %Change %Change
Retrieved: 151000 151000 151000 151000
Relevant: 8801 8801 8801 8801
Rel|ret: 7291 4508 -38.17* 6805 -6.67* 6037 -17.20*
Interpolated Recall - Precision
at 0.00 1.0000 1.0000 0.0 1.0000 0.0 1.0000 0.0
at 0.10 0.5316 0.3171 -40.3* 0.2176 -59.1* 0.1319 -75.2*
at 0.20 0.3802 0.2388 -37.2* 0.1522 -60.0* 0.0970 -74.5*
at 0.30 0.2787 0.1404 -49.6* 0.1316 -52.8* 0.0931 -66.6*
at 0.40 0.2063 0.0848 -58.9* 0.0873 -57.7* 0.0691 -66.5*
at 0.50 0.1598 0.0492 -69.2* 0.0754 -52.8* 0.0550 -65.6*
at 0.60 0.1209 0.0307 -74.6* 0.0656 -45.7* 0.0498 -58.8*
at 0.70 0.0880 0.0134 -84.8* 0.0530 -39.8* 0.0406 -53.9*
at 0.80 0.0571 0.0021 -96.2* 0.0407 -28.7* 0.0404 -29.2*
at 0.90 0.0218 0.0000 -100.0* 0.0116 -46.8* 0.0372 +70.5*
at 1.00 0.0028 0.0000 -100.0* 0.0002 -91.2* 0.0197 +603.4*
Average precision (non-interpolated) over all rel docs
0.2175 0.1245 -42.76* 0.1021 -53.05* 0.0790 -63.68*
Figure 6: Recall-Precision results for Markov Random Field versus Markov Chain models
5 EXPERIMENTS AND ANALYSIS
For evaluation, we have assembled four collections. The
rst is a set of approximately 3000 polyphonic music
pieces from the CCARH at Stanford. These are mostly
baroque and classical pieces from Bach, Beethoven, Co-
relli, Handel, Haydn, Mozart and Vivaldi. Longer scores
have sometimes been broken up into their various move-
ments, but otherwise each piece is unique. Our remaining
three sets of music are pieces which were intentionally
composed as variations on some theme:
Twinkle 26 individual variations on the tune known to
English speakers as Twinkle, twinkle, little star (in
fact a mixture of mostly polyphonic and a few mono-
phonic versions);
Lachrimae 75 versions of John Dowlands Lachrimae
Pavan from different 16th and 17th-century sources,
sometimes varying in quality (numbers of wrong
notes, omissions and other inaccuracies), and in scor-
ing (for solo lute, keyboard or ve-part instrumental
ensemble);
Folia 50 variations by four different composers on the
well-known baroque tune Les Folies dEspagne.
For retrieval, we select a piece from the three sets of
variations and use that as the query. All other pieces from
that same variation set are judged relevant to the query,
and the rest of the collection is judged non-relevant. This
process is repeated for all pieces in all three sets of vari-
ations, for a total of 151 queries.
Let us dene
MC
as the retrieval system based on
Markov Chain models, and
MRF
as the retrieval system
based on Markov Random Field models. Figure 6 shows
the recall-precision results for the ranked lists produced by
each of the various modeling approaches,
MRF
as well
as
MC
with the length of the chain set to 2, 3, and 4 (1
st
-
, 2
nd
-, and 3
rd
-order models, respectively). In the table,
MRF
is shown rst, and each of the
MC
systems are
shown in comparison, with percentage change (whether
positive or negative) and an asterisk to indicate statistical
signicance (t-test at a 0.05 level).
No matter the chain length, the random eld approach
outperforms the Markov chain approach on just about
every level of precision-recall. On average, the
MC
sys-
tems are from 42% to 63% worse. These results show the
value of the random eld approach.
We believe what is happening is that the random elds
are less sensitive to the noise that appears with vari-
213
ations. For example, suppose there is an insertion of a
single note in one of the variations. The Markov chain
approach counts all possible paths through a polyphonic
sequence. The number of these paths is exponential in the
length of the chain. A single note insertion therefore dis-
proportionately increases the number (and character) of
paths being counted. This analysis is borne out by the
fact higher-order Markov chains actually do progressively
worse (see Figure 6). The longer the chain, the more a
single insertion affects the model. Two note insertions
create even more paths.
On the other hand, random elds take into account the
dependencies between the features. They work by calcu-
lating feature expectations over the data (see section 4.5),
and adjusting the weights so that the contribution of
each feature toward the prediction of a note label (on
or off) is balanced. A single note insertion may activ-
ate a few additional feature functions, but the contribution
of these additional features do not throw off the overall
correctness of the note probability estimate because their
weights are learned initially with non-independence in
mind. Random elds are more robust when it comes to
detecting variations.
6 CONCLUSION
In this paper we developed a retrieval system based on
automatically-induced random elds and show the superi-
ority of these models over Markov chains. Central to our
approach is the notion of a binary feature function, a con-
junction of notes positioned at xed pitches and locations.
Yet these features are not the only ones possible. Re-
call from section 4.2 that features are just functions that
return a boolean value. What happens inside the function
is as limitless as one need imagine. For example, one can
create models of rhythmic patterns by choosing onset con-
junction features of the sort: f
1
= was there an onset 100
ticks prior to the current onset, and another onset 300 ticks
prior to the current onset? (We are currently developing
such models.)
One is not limited to rhythm alone, any more than
one is limited to pitch alone. Alongside pitch-only and
rhythm-only features, our set of active feature functions F
can contain features that are a mixture of both pitch and
duration. E.g.: Let f
3
= the previous note was an E and it
lasted for 200 ms. Not only does this feature contain both
pitch and duration information, but if the model already
contains the pitch-only feature f
2
= the previous note was
an E, we may add f
3
without worry. The training that oc-
curs as part of weight updating (section 4.7) insures that
the values given to all feature are balanced, to automat-
ically take into account statistical dependencies between
features. (It should be clear that f
2
and f
3
are not in-
dependent.) Early work by Doraisamy and R uger (2001)
and Lemstr om et al. (1998) experimented with features of
this nature, but were forced by lack of framework to make
the independence assumption. With random elds, this
assumption is no longer necessary.
Finally, it goes without saying that in addition to pitch
and rhythm, as long as one can craft a binary wrapper (fea-
ture function) one can use any other type of musical in-
formation one has available, including metadata and data
obtained from semantic analysis of audio. Features could
include functions such as: f
4
= my timbre classier gives
me condence > 0.9 that there is a trombone playing at
this point in this audio recording, and my beat tracker es-
timates the current tempo for this song at between 80 and
100 bpm, and there was an onset around 300 ticks ago, and
the metadata tells me this song was recorded in the 1970s
This feature function may (when properly weighted and
combined with the rest of f F) be a strong positive
indication that the note C# is on.
Random elds are a framework for sequential mu-
sic modeling in which combination of multiple, non-
independent sources and types of data may be explored.
Markov random elds are more robust than Markov
chains. They are accurate, without overtting, as we can
see from the retrieval results above. They also offer a
method for attaching relative importance to various fea-
tures without having to make independence assumptions
between the features used. In short, they are an important
framework for developing the kinds of models needed for
music information retrieval applications.
7 ACKNOWLEDGEMENTS
Many thanks to Victor Lavrenko, whose early collabora-
tion made retrieval systems of this sort possible. We also
wish to thank the conference reviewers for their invaluable
comments.
REFERENCES
A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra.
A maximum entropy approach to natural language
processing. Computational Linguistics, 22(1):3971,
1996.
J. Darroch and D. Ratcliff. Generalized iterative scal-
ing for log-linear models. In Ann. Math. Statistics, 43,
pages 14701480, 1972.
S. Della Pietra, V. Della Pietra, and J. Lafferty. Inducing
features of random elds. In IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, 19, pages 380
393, 1997.
S. Doraisamy and S. M. R uger. An approach toward a
polyphonic music retrieval system. In J. S. Downie and
D. Bainbridge, editors, Proceedings of the 2nd Annual
ISMIR, pages 187193, Indiana University, Blooming-
ton, Indiana, October 2001.
J. Lafferty, A. McCallum, and F. Pereira. Conditional ran-
dom elds: Probabilistic models for segmenting and
labeling sequence data. In Proc. 18th International
Conf. on Machine Learning, pages 282289. Morgan
Kaufmann, San Francisco, CA, 2001.
K. Lemstr om, A. Haapaniemi, and E. Ukkonen. Retriev-
ing music - to index or not to index. In The Sixth ACM
International Multimedia Conference (MM 98), pages
6465, Bristol, UK, September 13-16 1998.
C. Zhai and J. Lafferty. Dual role of smoothing
in the language modeling approach. URL cite-
seer.ist.psu.edu/zhai01dual.html.
214